Leverage (statistics)

In statistics, leverage is a term used in connection with regression analysis and, in particular, in analyses aimed at identifying those observations that are far away from corresponding average predictor values. Leverage points do not necessarily have a large effect on the outcome of fitting regression models.

Leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation.[1]

Modern computer packages for statistical analysis include, as part of their facilities for regression analysis, various quantitative measures for identifying influential observations: among these measures is partial leverage, a measure of how a variable contributes to the leverage of a datum.

Linear regression model

Definition

In linear regression model, the leverage score for the  i^{th} data unit is defined as:

the  i^{th} diagonal of the hat matrix  H=X(X'X)^{-1}X', where the apostrophe denotes the matrix transpose.

The leverage score is also known as the observation self-sensitivity or self-influence,[2] as:

h_{ii} = \frac{\partial\hat{y}_i}{\partial y_i},

where \hat{y}_i and {y}_i are the fitted and measured observation, respectively.

Property 1

 0 \leq h_{ii} \leq 1

Proof

First, note that  H^2=X(X'X)^{-1}X'X(X'X)^{-1}X'=XI(X'X)^{-1}X'=H . Also, observe that  H is symmetric. So we have,

and

Property 2

If we are in an ordinary least squares setting with fixed X and:

then  var(e_i)=(1-h_{ii})\sigma^2 where  e_i=Y_i-\hat{Y}_i .

In other words, if the  \epsilon are homoscedastic, leverage scores determine the noise level in the model.

Proof

First, note that  I-H is idempotent and symmetric. This gives,  var(e)=var((I-H)Y)=(I-H)var(Y)(I-H)'=\sigma^2(I-H)^2=\sigma^2(I-H) .

So that,  var(e_i)=(1-h_{ii})\sigma^2 .

See also

References

  1. Everitt, B.S. (2002) Cambridge Dictionary of Statistics. CUP. ISBN 0-521-81099-X