Multicollinearity

From Wikipedia, the free encyclopedia

Multicollinearity refers to any linear relationship amongst explanatory variables in a regression model. It can affect two or more of them. The original definition referred to an exact linear relationship, but later it was extended to mean a nearly perfect relationship. The correlation can be negative or positive.

In the presence of multicollinearity, the estimate of one variable's impact on y while controlling for the others tends to be less precise than if predictors were uncorrelated with one another.

Simply put, if nominally "different" measures actually quantify the same phenomenon (doesn't need to be same phenomenon - just presence of correlation is enough - positive or negative. Think about positive or negative interaction between variables.) to a significant degree -- i.e., wherein the variables are accorded different names and perhaps employ different numeric measurement scales but correlate highly with each other -- they are redundant.

A principal danger of such data redundancy is that of overfitting in regression analysis models. The best regression models are those in which the predictor variables each correlate highly with the dependent (outcome) variable but correlate at most only minimally with each other. Such a model is often called "low noise" and will be statistically robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical population).

See Multi-collinearity Variance Inflation and Orthogonalization in Regression by Dr. Alex Yu.

How to tell if you have multicollinearity:

1) Large changes in the estimated regression coefficients when a predictor variable is added or deleted

2) Non significant results of simple linear regressions

3) Estimated regression coefficients have an opposite sign from predicted

4) formal detection-tolerance or the variation inflation factor (VIF)

\mathrm{tolerance} = 1-R^2,\quad \mathrm{VIF} = \frac{1}{\mathrm{tolerance}}.

A tolerance of less than 0.1 indicates a multicollinearity problem.

What to do...

1) The presence of multicollinearity doesn't affect the fitted model provided that the predictor variables follow the same multicolinearity pattern as the data on which the regression model is based.

2) A predictor variable may be dropped to lessen multicolinearity. (But then you don't get any extra information from the dropped variable)

3) You may be able to add a case to break multicollinearity

4) Estimate the regression coefficients from different data sets

Note: Multicollinearity does not impact the reliability of the forecast, but rather impacts the interpertation of the explanatory variables. As long as the collinear relationships in your independent variables remain stable over time, multicollinearity will not affect your forecast. If there is reason to believe that the collinear relationships do NOT remain stable over time, it is better to consider a technique like Ridge regression.

[edit] Multicollinearity in Survival Analysis

Multicollinearity may also represent a serious issue in survival analysis. The problem is that time-varying covariates may change their value over the time line of the study. A special procedure is recommended to assess the impact of multicollinearity on the results. See Van den Poel & Larivière (2004) for a detailed discussion.

[edit] References

  • Van den Poel Dirk, Larivière Bart (2004), Attrition Analysis for Financial Services Using Proportional Hazard Models, European Journal of Operational Research, 157 (1), 196-217

[edit] See also

In other languages