Multicollinearity

From Wikipedia, the free encyclopedia


Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated. In this situation the coefficient estimates may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with others.

Contents

[edit] Definition

Collinearity is a linear relationship between two explanatory variables. Two variables are collinear if there is an exact linear relationship between the two. For example, X1 and X2 are collinear if

X1 = λX2

Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly correlated. We have perfect multicollinearity if the correlation between two independent variables is equal to 1 or -1. In practice, we rarely face perfect multicollinearity in a data set. More commonly, the issue of multicollinearity arises when there is a high degree of correlation (either positive or negative) between two or more independent variables.

Mathematically, a set of variables is collinear if there exists one or more linear relationships among the variables. For example, we may have:


\lambda_1 X_{1i} + \lambda_2 X_{2i} + \cdots + \lambda_k X_{ki} = 0

where λi are constants and Xi are explanatory variables. We can explore the issue caused by multicollinearity by examining the parameter estimates for the parameters of the multiple regression equation:

 Y_{i} = \beta _0 + \beta _1 X_{1i} + \cdots + \beta _k X_{ki} + \varepsilon _{i}

The ordinary least squares estimates involve inverting the matrix

XTX

where

 X = \begin{bmatrix}

      1 & X_{11} & \cdots & X_{k1}  \\

      \vdots & \vdots & & \vdots \\

      1 & X_{N1} & \cdots & X_{kN}

\end{bmatrix}

If there is a linear relationship among the independent variables, the rank of X is less than k, and the matrix XTX will not be invertible.

In most applications, perfect multicollinearity is unlikely. A analyst is more likely to face near multicollinearity. For example, suppose you add a stochastic error term vi to the equation above such that


\lambda_1 X_{1i} + \lambda_2 X_{2i} + \cdots + \lambda_k X_{ki} + v_i = 0

In this case, there is no exact linear relationship among the variables, but the Xi variables are nearly perfectly correlated. In this case, the matrix XT X is invertible, but is ill-conditioned.

[edit] Detection of multicollinearity

Indicators that multicollinearity may be present in a model:

1) Large changes in the estimated regression coefficients when a predictor variable is added or deleted

2) Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the hypothesis that those coefficients are insignificant as a group (using a F-test)

3) Large changes in the estimated regression coefficients when an observation is added or deleted

Some authors have suggested a formal detection-tolerance or the variance inflation factor (VIF) for multicollinearity:

\mathrm{tolerance} = 1-R^2,\quad \mathrm{VIF} = \frac{1}{\mathrm{tolerance}}.

A tolerance of less than 0.20 and/or a VIF of 5 and above indicates a multicollinearity problem (but see O'Brien 2007).[1]

[edit] Consequences of multicollinearity

In the presence of multicollinearity, the estimate of one variable's impact on y while controlling for the others tends to be less precise than if predictors were uncorrelated with one another. The usual interpretation of a regression coefficient is that it provides an estimate of the effect of a one unit change in an independent variable, X1, holding the other variables constant. If X1 is highly correlated with another independent variable, X2, in the given data set, then we only have observations for which X1 and X2 have a particular relationship (either positive or negative). We don't have observations for which X1 changes independently of X2, so we have an imprecise estimate of the effect of independent changes in X1.

In some sense, the collinear variables contain the same information about the dependent variable. If nominally "different" measures actually quantify the same phenomenon then they are redundant. Alternatively, if the variables are accorded different names and perhaps employ different numeric measurement scales but are highly correlated with each other, then they suffer from redundancy.

One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large. In that case, the test of the hypothesis that the coefficient is equal to zero against the alternative that it is not equal to zero leads to a failure to reject the null hypothesis. However, if a simple linear regression of the dependent variable on this explanatory variable is estimated, the coefficient will be found to be significant; specifically, the analyst will reject the hypothesis that the coefficient is insignificant. In the presence of multicollinearity, an analyst might falsely conclude that there is no linear relationship between an independent and a dependent variable.

A principal danger of such data redundancy is that of overfitting in regression analysis models. The best regression models are those in which the predictor variables each correlate highly with the dependent (outcome) variable but correlate at most only minimally with each other. Such a model is often called "low noise" and will be statistically robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical population).

See Multi-collinearity Variance Inflation and Orthogonalization in Regression by Dr. Alex Yu.

[edit] Remedy to multicollinearity

Multicollinearity has also been described as micronumerosity (or "too little data"). Multicollinearity does not actually bias results, it just produces large standard errors in the related independent variables. With enough data, these errors will be reduced.[1]

In addition, you may:

1) Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn't affect the fitted model provided that the predictor variables follow the same pattern of multicollinearity as the data on which the regression model is based[unreliable source?].

2) Drop one of the variables. An explanatory variable may be dropped to produce a model with significant coefficients. However, you lose information (because you've dropped a variable). Omission of a relevant variable results in biased coefficient estimates for the remaining explanatory variables.

3) Obtain more data. This is the preferred solution. More data can produce more precise parameter estimates (with lower standard errors)[unreliable source?].

Note: Multicollinearity does not impact the reliability of the forecast, but rather impacts the interpretation of the explanatory variables. As long as the collinear relationships in your independent variables remain stable over time, multicollinearity will not affect your forecast. If there is reason to believe that the collinear relationships do NOT remain stable over time, it is better to consider a technique like Ridge regression.

[edit] Multicollinearity in survival analysis

Multicollinearity may also represent a serious issue in survival analysis. The problem is that time-varying covariates may change their value over the time line of the study. A special procedure is recommended to assess the impact of multicollinearity on the results. See Van den Poel & Larivière (2004) for a detailed discussion.

[edit] References

  • Van den Poel Dirk, Larivière Bart (2004), Attrition Analysis for Financial Services Using Proportional Hazard Models, European Journal of Operational Research, 157 (1), 196-217

[edit] See also