Maximum likelihood

From Wikipedia, the free encyclopedia

Maximum likelihood estimation (MLE) is a popular statistical method used to make inferences about parameters of the underlying probability distribution from a given data set. That is to say, you have a sample of data

X_{1}, \dots, X_{n} \!

and you want to infer the distribution.

Commonly, one assumes the data is independent, identically distributed (iid) draws from a particular distribution with unknown parameters and uses the MLE technique to create estimators for the unknown parameters.

For example, you may be interested in the height of Americans. To study this you sample some number of Americans, but not the entire population, and record their heights. Further, you are willing to assume that heights are normally distributed with some unknown mean and variance. MLE is a technique for attempting to decide which mean and variance. Roughly speaking, it fixes a set of data then picks the parameters to the distribution that are "most likely" given the data. Rigorously, start with a statistical model which is a family of distributions. In our height example,

\mathcal{ P } =  \left\{ \mathcal{N}(\mu, \sigma^{2}) \mid \mu \in \mathbb{R}, \sigma^{2} \in \mathbb{R}_{>0} \right\}. \!

Then the MLE technique picks a particular distribution

\mathcal{N}( \mu_{0}, \sigma^{2}_{0} ) = \mathrm{P}_{0} \in \mathcal{P} \!

such that for

\mathrm{P} \in \mathcal{P}, \mathrm{Prob} \left[ \mathrm{P}_{0} \mid X_{1}, \dots, X_{n} \right] \geq \mathrm{Prob} \left[ \mathrm{P} \mid X_{1}, \dots, X_{n} \right].

In the case of the normal distribution the maximum is unique.

It has widespread applications in various fields, including:

The method was pioneered by geneticist and statistician Sir R. A. Fisher between 1912 and 1922 (see external resources below for more information on the history of MLE).

Contents

[edit] Prerequisites

The following discussion assumes that readers are familiar with basic notions in probability theory such as probability distributions, probability density functions, random variables and expectation. It also assumes they are familiar with standard basic techniques of maximizing continuous real-valued functions, such as using differentiation to find a function's maxima.

[edit] Principles

Given a family Dθ of probability distributions parameterized by θ (which could be vector-valued), associated with either a known probability density function (continuous distribution) or a known probability mass function (discrete distribution), denoted as fθ, we may draw a sample x1, x2, ..., xn of n values from this distribution and then using fθ we may compute the probability density associated with our observed data:

f_\theta(x_1,\dots,x_n \mid \theta).\,\!

As a function of θ with x1, ..., xn fixed, this is the likelihood function

\mathcal{L}(\theta) = f_{\theta}(x_1,\dots,x_n \mid \theta).\,\!

The method of maximum likelihood estimates θ by finding the value of θ that maximizes L(θ). This is the maximum likelihood estimator (MLE) of θ.

This contrasts with seeking an unbiased estimator of θ, which may not necessarily yield the MLE but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of θ.

The maximum likelihood estimator may not be unique, or indeed may not even exist.

[edit] Examples

[edit] Discrete distribution, finite parameter space

Consider tossing an unfair coin 80 times (i.e., we sample something like x1=H, x2=T, ..., x80=T, and count the number of HEADS "H" observed). Call the probability of tossing a HEAD p, and the probability of tossing TAILS 1-p (so here p is θ above). Suppose we toss 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p=1/3, one which gives HEADS with probability p=1/2 and another which gives HEADS with probability p=2/3. The coins have lost their labels, so we don't know which one it was. Using maximum likelihood estimation we can calculate which coin has the largest likelihood, given the data that we observed. The likelihood function (defined below) takes one of three values:

\begin{matrix} \Pr(\mathrm{H} = 49 \mid p=1/3) & = & \binom{80}{49}(1/3)^{49}(1-1/3)^{31} \approx 0.000 \\ &&\\ \Pr(\mathrm{H} = 49 \mid p=1/2) & = & \binom{80}{49}(1/2)^{49}(1-1/2)^{31} \approx 0.012 \\ &&\\ \Pr(\mathrm{H} = 49 \mid p=2/3) & = & \binom{80}{49}(2/3)^{49}(1-2/3)^{31} \approx 0.054 \\ \end{matrix}

We see that the likelihood is maximized when p=2/3, and so this is our maximum likelihood estimate for p.

[edit] Discrete distribution, continuous parameter space

Now suppose we had only one coin but its p could have been any value 0 ≤ p ≤ 1. We must maximize the likelihood function:

L(\theta) = f_D(\mathrm{H} = 49 \mid p) = \binom{80}{49} p^{49}(1-p)^{31}

over all possible values 0 ≤ p ≤ 1.

One way to maximize this function is by differentiating with respect to p and setting to zero:

\begin{align} {0}&{} = \frac{\partial}{\partial p} \left( \binom{80}{49} p^{49}(1-p)^{31} \right) \\   & {}\propto 49p^{48}(1-p)^{31} - 31p^{49}(1-p)^{30} \\   & {}= p^{48}(1-p)^{30}\left[ 49(1-p) - 31p \right]  \\   & {}= p^{48}(1-p)^{30}\left[ 49 - 80p \right] \end{align}
Likelihood of different proportion parameter values for a binomial process with t = 3 and n = 10; the ML estimator occurs at the mode with the peak (maximum) of the curve.
Likelihood of different proportion parameter values for a binomial process with t = 3 and n = 10; the ML estimator occurs at the mode with the peak (maximum) of the curve.

which has solutions p=0, p=1, and p=49/80. The solution which maximizes the likelihood is clearly p=49/80 (since p=0 and p=1 result in a likelihood of zero). Thus we say the maximum likelihood estimator for p is 49/80.

This result is easily generalized by substituting a letter such as t in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields the maximum likelihood estimator t / n for any sequence of n Bernoulli trials resulting in t 'successes'.

[edit] Continuous distribution, continuous parameter space

For the normal distribution \mathcal{N}(\mu, \sigma^2) which has probability density function

f(x\mid \mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\                                 \exp{\left(-\frac {(x-\mu)^2}{2\sigma^2} \right)},

the corresponding probability density function for a sample of n independent identically distributed normal random variables (the likelihood) is

f(x_1,\ldots,x_n \mid \mu,\sigma^2) = \prod_{i=1}^{n} f( x_{i}\mid  \mu, \sigma^2) = \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left( -\frac{ \sum_{i=1}^{n}(x_i-\mu)^2}{2\sigma^2}\right),

or more conveniently:

f(x_1,\ldots,x_n \mid \mu,\sigma^2) = \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right),

where \bar{x} is the sample mean.

This family of distributions has two parameters: θ=(μ,σ), so we maximize the likelihood \mathcal{L} (\mu,\sigma) = f(x_1,\ldots,x_n \mid \mu, \sigma) over both parameters simultaneously, or if possible, individually.

Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often requires simpler algebra, it is the logarithm which is maximized below. [Note: the log-likelihood is closely related to information entropy and Fisher information.]

0 = \frac{\partial}{\partial \mu} \log \left( \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right) \right)


= \frac{\partial}{\partial \mu} \left( \log\left( \frac{1}{2\pi\sigma^2} \right)^{n/2} - \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2} \right)


= 0 - \frac{-2n(\bar{x}-\mu)}{2\sigma^2}

which is solved by

\hat\mu = \bar{x} = \sum^{n}_{i=1}x_i/n.

This is indeed the maximum of the function since it is the only turning point in μ and the second derivative is strictly less than zero. Its expectation value is equal to the parameter μ of the given distribution,

E \left[ \widehat\mu \right] = \mu,

which means that the maximum-likelihood estimator \widehat\mu is unbiased.

Similarly we differentiate the log likelihood with respect to σ and equate to zero:

0 = \frac{\partial}{\partial \sigma} \log \left( \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right) \right)


= \frac{\partial}{\partial \sigma} \left( \frac{n}{2}\log\left( \frac{1}{2\pi\sigma^2} \right) - \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2} \right)


= -\frac{n}{\sigma} + \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{\sigma^3}

which is solved by

\widehat\sigma^2 = \sum_{i=1}^n(x_i-\widehat{\mu})^2/n.

Inserting \widehat\mu we obtain

\widehat\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})^2 = \frac{1}{n}\sum_{i=1}^n x_i^2                           -\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n x_i x_j.

When we calculate the expectation value, the double sum gives a nonzero contribution only if i=j. We obtain

E \left[ \widehat{\sigma^2}  \right]= \frac{n-1}{n}\sigma^2.

This means that the estimator \widehat\sigma is biased (However, \widehat\sigma is consistent).

Formally we say that the maximum likelihood estimator for θ = (μ,σ2) is:

\widehat{\theta} = \left(\widehat{\mu},\widehat{\sigma}^2\right).

In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have to be obtained simultaneously.

[edit] Properties

[edit] Functional invariance

The maximum likelihood estimator (MLE) of a parameter θ can be used to calculate the MLE of a function of the parameter. Specifically, if \widehat{\theta} is the MLE for θ, and if g(θ) is a one-to-one function, then the MLE for α = g(θ) is

\widehat{\alpha} = g(\widehat{\theta}).\,\!

If g(θ) is not one-to-one, then g(\widehat{\theta}) is the MLE of α = g(θ) only if the likelihood function is modified to be

\bar{L}(\alpha) = \sup_{\theta: \alpha = g(\theta)} L(\theta).

[edit] Bias

The bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution). If n is unknown, then the maximum-likelihood estimator of n is the value on the drawn ticket, even though the expectation is only (n+1)/2. In estimating the highest number n, we can only be certain that it is greater than or equal to the drawn ticket number.

[edit] Asymptotics

In many cases, estimation is performed using a set of independent identically distributed measurements. These may correspond to distinct elements from a random sample, repeated observations, etc. In such cases, it is of interest to determine the behavior of a given estimator as the number of measurements increases to infinity, referred to as asymptotic behaviour.

Under certain (fairly weak) regularity conditions, which are listed below, the MLE exhibits several characteristics which can be interpreted to mean that it is "asymptotically optimal". These characteristics include:

  • The MLE is asymptotically unbiased, i.e., its bias tends to zero as the number of samples increases to infinity.
  • The MLE is asymptotically efficient, i.e., it achieves the Cramer-Rao lower bound when the number of samples tends to infinity. This means that, asymptotically, no unbiased estimator has lower mean squared error than the MLE.
  • The MLE is asymptotically normal. As the number of samples increases, the distribution of the MLE tends to the Gaussian distribution with mean θ and covariance matrix equal to the inverse of the Fisher information matrix.

It is straightforward to show that the asymptotic bias and efficiency are a result of the Gaussian distribution.

The regularity conditions required to ensure this behavior are:

  1. The first and second derivatives of the log-likelihood function must be defined.
  2. The Fisher information matrix must not be zero.

While these asymptotic properties only become strictly true in the limit of infinite sample size, in practice they are often assumed to be approximately true, especially when the sample size is not that small. In particular, inference about the estimated parameters is often based on the asymptotic Gaussian distribution of the MLE.

[edit] See also

  • mean squared error, a measure of how 'good' an estimator of a distributional parameter is (be it the maximum likelihood estimator or some other estimator).
  • The Rao–Blackwell theorem, a result which yields a process for finding the best possible unbiased estimator (in the sense of having minimal mean squared error). The MLE is often a good starting place for the process.
  • sufficient statistic, a function of the data through which the MLE (if it exists and is unique) will depend on the data.
  • MAP estimator, for a contrast in the way to calculate estimators when prior knowledge is postulated.

[edit] References