M-estimator

From Wikipedia, the free encyclopedia

In statistics, M-estimators are a broad class of statistics which are obtained as the solution to the problem of minimizing certain functions of the data. The process of obtaining an M-estimator is called M-estimation.

Some authors define M-estimators to be the root or roots of a system of equations consisting of certain functions of the data. This class is a subset of the class of minimization solutions. Typically these functions are the derivatives of the functions to be minimized in the broader definition.

Many classical statistics can be shown to be M-estimators. Their main utility, however, is as robust alternatives to classical statistical estimators.

Contents

[edit] Historical motivation

For a family of probability density functions f parameterized by θ, the maximum likelihood estimate of θ (which could be vector valued) are computed by maximizing the likelihood function over θ. The estimate is

\widehat{\theta} = \operatorname{argmax}_{\theta}{ \left( \prod_{i=1}^n f(x_i, \theta) \right) }\,\!

or, equivalently,

\widehat{\theta} = \operatorname{argmin}_{\theta}{ \left( -\sum_{i=1}^n \log{( f(x_i, \theta) ) }\right) }.\,\!

The performance of maximum likelihood estimators depends heavily on the assumed distribution family of the data being at least approximately true. In particular, maximum likelihood estimators can be inefficient and biased when the data are not from the assumed distribution. Of particular concern is the presence of outliers.

[edit] Definition

In 1964, Peter Huber proposed generalizing maximum likelihood estimation to the minimization of

\sum_{i=1}^n\rho(x_i, \theta),\,\!

where ρ is a function with certain properties (see below). The solutions

\hat{\theta} = \operatorname{argmin}_{\theta}\left(\sum_{i=1}^n\rho(x_i, \theta)\right) \,\!

are called M-estimators ("M" for "maximum likelihood-type" (Huber, 1981, page 43); other types of robust estimator include L-estimators, R-estimators and S-estimators). Maximum likelihood estimators are thus a special case of M-estimators.

The function ρ, or its derivative, ψ, can be chosen in such a way to provide the estimator desirable properties (in terms of bias and efficiency) when the data are truly from the assumed distribution, and 'not bad' behaviour when the data are generated from a model that is, in some sense, close to the assumed distribution.

[edit] Types of M-estimators

M-estimators are solutions θ which minimize

\sum_{i=1}^n\rho(x_i,\theta).\,\!

This minimization can always be done directly. Often it is simpler to differentiate with respect to θ and solve for the root of the derivative. When this differentiation is possible, the M-estimator is said to be of ψ-type. Otherwise, the M-estimator is said to be of ρ-type.

In most practical cases, the M-estimators are of ψ-type.

[edit] ρ-type

For positive integer r, let (\mathcal{X},\Sigma) and (\Theta\subset\mathbb{R}^r,S) be measure spaces. \theta\in\Theta is a vector of parameters. An M-estimator of ρ-type T is defined through a measurable function \rho:\mathcal{X}\times\Theta\rightarrow\mathbb{R}. It maps a probability distribution F on \mathcal{X} to the value T(F)\in\Theta (if it exists) that minimizes \int_{\mathcal{X}}\rho(x,\theta)dF(x):

T(F):=\arg\min_{\theta\in\Theta}\int_{\mathcal{X}}\rho(x,\theta)dF(x)

For example, for the maximum likelihood estimator, ρ(x,θ) = − log(f(x,θ)), where f(x,\theta)=\frac{\partial F(x,\theta)}{\partial x}.

[edit] ψ-type

If ρ is differentiable, the computation of \widehat{\theta} is usually much easier. An M-estimator of ψ-type T is defined through a measurable function \psi:\mathcal{X}\times\Theta\rightarrow\mathbb{R}^r. It maps a probability distribution F on \mathcal{X} to the value T(F)\in\Theta (if it exists) that solves the vector equation: \int_{\mathcal{X}}\psi(x,\theta)dF(x)=0

\int_{\mathcal{X}}\psi(x,T(F))dF(x)=0

For example, for the maximum likelihood estimator, \psi(x,\theta)=\left(\frac{\partial\log(f(x,\theta))}{\partial \theta^1},\cdots,\frac{\partial\log(f(x,\theta))}{\partial \theta^p}\right)^t, where ut denotes the transpose of vector u and f(x,\theta)=\frac{\partial F(x,\theta)}{\partial x}.

Such an estimator is not necessarily an M-estimator of ρ-type, but if ρ has a continuous first derivative with respect to θ, then a necessary corresponding M-estimator of ψ-type to be an M-estimator of ρ-type is \psi(x,\theta)=\nabla_\theta\rho(x,\theta). The previous definitions can easily be extended to finite samples.

If the function ψ decreases to zero as x \rightarrow \pm \infty, the estimator is called redescending. Such estimators have some additional desirable properties, such as complete rejection of gross outliers.

[edit] Computation

For many choices of ρ or ψ, no closed form solution exists and an iterative approach to computation is required. It is possible to use standard function optimization algorithms, such as Newton-Raphson. However, in most cases an iteratively re-weighted least squares fitting algorithm can be performed; this is typically the preferred method.

For some choices of ψ, specifically, redescending functions, the solution may not be unique. The issue is particularly relevant in multivariate and regression problems. Thus, some care is needed to ensure that good starting points are chosen. Robust starting points, such as the median as an estimate of location and the median absolute deviation as a univariate estimate of scale, are common.

[edit] Properties

[edit] Distribution

It can be shown that M-estimators are asymptotically normally distributed. As such, Wald-type approaches to constructing confidence intervals and hypothesis tests can be used. However, since the theory is asymptotic, it will frequently be sensible to check the distribution, perhaps by examining the permutation or bootstrap distribution.

[edit] Influence function

The influence function of an M-estimator of ψ-type is proportional to its defining ψ function.

Let T be an M-estimator of ψ-type, and G be a probability distribution for which T(G) is defined. Its influence function IF is

\operatorname{IF}(x;T,G) = -\frac{\psi(x,T(G))}
                                       {\int\left[\frac{\partial\psi(y,\theta)}
                                                       {\partial\theta}
                                            \right] \mathrm{d}y
                                       }

A proof of this property of M-estimators can be found in Huber (1981, Section 3.2).

[edit] Applications

M-estimators can be constructed for location parameters and scale parameters in univariate and multivariate settings, as well as being used in robust regression.

[edit] Examples

[edit] Mean

Let (X1, ... , Xn) be a set of independent, identically distributed random variables, with distribution F.

If we define

\rho(x, \theta)=\frac{(x - \theta)^2}{2},\,\!

we note that this is minimized when θ is the mean of the Xs. Thus the mean is an M-estimator of ρ-type, with this ρ function.

As this ρ function is continuously differentiable in θ, the mean is thus also an M-estimator of ψ-type for ψ(x, θ) = θ - x.

[edit] See also

[edit] References

  • Andersen, R. (2008). Modern Methods for Robust Regression. Sage University Paper Series on Quantitative Applications in the Social Sciences, 07-152. 
  • Huber, Peter. J. (1981, 2004). Robust Statistics. Wiley. 
  • Hoaglin, David C.; Frederick Mosteller and John W. Tukey (1983). Understanding Robust and Exploratory Data Analysis. Wiley. ISBN 0-471-09777-2. 
  • Wilcox, R. R. (2003). Applying contemporary statistical techniques. San Diego, CA: Academic Press, 55-79. 
Languages