Studentized residual

For a broader coverage related to this topic, see Studentization.

Regression analysis
Part of a series on Statistics

Models
Linear regression Simple regression Ordinary least squares Polynomial regression General linear model
Generalized linear model Discrete choice Logistic regression Multinomial logit Mixed logit Probit Multinomial probit Ordered logit Ordered probit Poisson
Multilevel model Fixed effects Random effects Mixed model
Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic Principal components Least angle Local Segmented
Errors-in-variables
Estimation
Least squares Ordinary least squares Linear (math) Partial Total Generalized Weighted Non-linear Non-negative Iteratively reweighted Ridge regression
Least absolute deviations Bayesian Bayesian multivariate
Background
Regression model validation Mean and predicted response Errors and residuals Goodness of fit Studentized residual Gauss–Markov theorem
Statistics portal

In statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. Typically the standard deviations of residuals in a sample vary greatly from one data point to another even when the errors all have the same standard deviation, particularly in regression analysis; thus it does not make sense to compare residuals at different data points without first studentizing. It is a form of a Student's t-statistic, with the estimate of error varying between points.

This is an important technique in the detection of outliers. It is among several named in honor of William Sealey Gosset, who wrote under the pseudonym Student, and dividing by an estimate of scale is called studentizing, in analogy with standardizing and normalizing

Motivation

The key reason for studentizing is that, in regression analysis of a multivariate distribution, the variances of the residuals at different input variable values may differ, even if the variances of the errors at these different input variable values are equal. The issue is the difference between errors and residuals in statistics, particularly the behavior of residuals in regressions.

Consider the simple linear regression model

Y = \alpha_0 + \alpha_1 X + \varepsilon. \,

Given a random sample (X_i, Y_i), i = 1, ..., n, each pair (X_i, Y_i) satisfies

Y_i = \alpha_0 + \alpha_1 X_i + \varepsilon_i,\,

where the errors ε_i, are independent and all have the same variance σ². The residuals are not the true, and unobservable, errors, but rather are estimates, based on the observable data, of the errors. When the method of least squares is used to estimate α₀ and α₁, then the residuals $\scriptstyle\widehat\varepsilon$ , unlike the errors $\scriptstyle\varepsilon$ , cannot be independent since they satisfy the two constraints

\sum_{i=1}^n \widehat{\varepsilon}_i=0

and

\sum_{i=1}^n \widehat{\varepsilon}_i x_i=0.

(Here ε_i is the ith error, and $\scriptstyle\widehat{\varepsilon}_i$ is the ith residual.)

Moreover, and most importantly, the residuals, unlike the errors, do not all have the same variance: the variance decreases as the corresponding x-value gets farther from the average x-value. This is a feature of the regression better fitting values at the ends of the domain, not the data itself, and is also reflected in the influence functions of various data points on the regression coefficients: endpoints have more influence. This can also be seen because the residuals at endpoints depend greatly on the slope of a fitted line, while the residuals at the middle are relatively insensitive to the slope. The fact that the variances of the residuals differ, even though the variances of the true errors are all equal to each other, is the principal reason for the need for studentization.

It is not simply a matter of the population parameters (mean and standard deviation) being unknown – it is that regressions yield different residual distributions at different data points, unlike point estimators of univariate distributions, which share a common distribution for residuals.

How to studentize

For this simple model, the design matrix is

X=\left[\begin{matrix}1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{matrix}\right]

and the hat matrix H is the matrix of the orthogonal projection onto the column space of the design matrix:

H=X(X^T X)^{-1}X^T.\,

The leverage h_ii is the ith diagonal entry in the hat matrix. The variance of the ith residual is

\operatorname{var}(\widehat{\varepsilon}_i)=\sigma^2(1-h_{ii}).

In case the design matrix X has only two columns (as in the example above), this is equal to

\operatorname{var}(\widehat{\varepsilon}_i)=\sigma^2\left( 1 - \frac1n -\frac{(x_i-\bar x)^2}{\sum_{j=1}^n (x_j - \bar x)^2 } \right).

The corresponding studentized residual is then

t_i = {\widehat{\varepsilon}_i\over \widehat{\sigma} \sqrt{1-h_{ii}\ }}

where $\widehat{\sigma}$ is an appropriate estimate of σ (see below).

Internal and external studentization

The usual estimate of σ² is

\widehat{\sigma}^2={1 \over n-m}\sum_{j=1}^n \widehat{\varepsilon}_j^{\,2}.

where m is the number of parameters in the model (2 in our example). But it is desirable to exclude the ith observation from the process of estimating the variance when one is considering whether the ith case may be an outlier. Consequently, one may use the estimate

\widehat{\sigma}_{(i)}^2={1 \over n-m-1}\sum_{\begin{smallmatrix}j = 1\\j \ne i\end{smallmatrix}}^n \widehat{\varepsilon}_j^{\,2},

based on all but the ith residual. If the former estimate σ² is used, including the ith case, then the residual is said to be internally studentized, $t_i$ . If the latter estimate $\widehat{\sigma}_{(i)}^2$ is used instead, excluding the ith case, then it is said to be externally studentized, $t_{i(i)}$ .

Distribution

"Tau distribution" redirects here. It is not to be confused with Tau coefficient.

If the errors are independent and normally distributed with expected value 0 and variance σ², then the probability distribution of the ith externally studentized residual $t_{i(i)}$ is a Student's t-distribution with n − m − 1 degrees of freedom, and can range from $\scriptstyle-\infty$ to $\scriptstyle+\infty$ .

On the other hand, the internally studentized residuals are in the range $\scriptstyle 0 \,\pm\, \sqrt{\nu}$ , where ν = n − m is the number of residual degrees of freedom. If t_i represents the internally studentized residual, and again assuming that the errors are independent identically distributed Gaussian variables, then:^[1]

t_i \sim \sqrt{\nu} {t \over \sqrt{t^2+\nu-1}}

where t is a random variable distributed as Student's t-distribution with ν − 1 degrees of freedom. In fact, this implies that t_i /ν follows the beta distribution B(1/2,(ν − 1)/2). The distribution above is sometimes referred to as the tau distribution;^[1] it was first derived by Thompson in 1935.^[2]

When ν = 3, the internally studentized residuals are uniformly distributed between $\scriptstyle-\sqrt{3}$ and $\scriptstyle+\sqrt{3}$ . If there is only one residual degree of freedom, the above formula for the distribution of internally studentized residuals doesn't apply. In this case, the t_i are all either +1 or −1, with 50% chance for each.

The standard deviation of the distribution of internally studentized residuals is always 1, but this does not imply that the standard deviation of all the t_i of a particular experiment is 1. For instance, the internally studentized residuals when fitting a straight line going through (0, 0) to the points (1, 4), (2, −1), (2, −1) are $\sqrt{2},\ -\sqrt{5}/5,\ -\sqrt{5}/5$ , and the standard deviation of these is not 1.

References

1 2 Allen J. Pope (1976), "The statistics of residuals and the detection of outliers", U.S. Dept. of Commerce, National Oceanic and Atmospheric Administration, National Ocean Survey, Geodetic Research and Development Laboratory, 136 pages, , eq.(6)
↑ Thompson, William R. On a Criterion for the Rejection of Observations and the Distribution of the Ratio of Deviation to Sample Standard Deviation. Ann. Math. Statist. 6 (1935), no. 4, 214--219. doi:10.1214/aoms/1177732567. http://projecteuclid.org/euclid.aoms/1177732567.

Cook, R. Dennis; Weisberg, Sanford (1982). Residuals and Influence in Regression. (Repr. ed.). New York: Chapman and Hall. ISBN 041224280X. Retrieved 23 February 2013.

This article is issued from Wikipedia - version of the Monday, November 16, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.