Multicollinearity

From Wikipedia, the free encyclopedia

Multicollinearity refers to linear inter-correlation among variables. Simply put, if nominally "different" measures actually quantify the same phenomenon to a significant degree -- i.e., wherein the variables are accorded different names and perhaps employ different numeric measurement scales but correlate highly with each other -- they are redundant.

A principal danger of such data redundancy is that of overfitting in regression analysis models. The best regression models are those in which the predictor variables each correlate highly with the dependent (outcome) variable but correlate at most only minimally with each other. Such a model is often called "low noise" and will be statistically robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical population).

See Multi-collinearity Variance Inflation and Orthogonalization in Regression by Dr. Alex Yu.

How to tell if you have multicollinearity:

1) Large changes in the estimated regression coefficients when a predictor variable is added or deleted

2) Non significant results of simple linear regressions

3) Estimated regression coefficients have an opposite sign from predicted


4) formal detection-Tolerance or the variation inflation factor (VIF)

    Tolerance=1-R^2    VIF=1/Tolerance

A tolerance of less than 0.1 means you have a multicollinearity problem.

What to do...

1) The presence of multicollinearity doesn't affect the fitted model provided that the predictor variables follow the same multicolinearity pattern as the data on which the regression model is based.

2) A predictor variable may be dropped to lessen multicolinearity. (But then you don't get any info from the dropped variable)

3) You may be able to add a case to break multicollinearity

4) Estimate the regression coefficients from different data sets

Note: multicollinearity=bad for forecasts

[edit] See Also

In other languages