Stepwise regression
From Wikipedia, the free encyclopedia
In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure.[1][2][3] Usually, this takes the form of a sequence of F-tests, but other techniques are possible, such as t-tests, adjusted R-square, Akaike information criterion, Bayesian information criterion, Mallows' Cp, or false discovery rate.
The main approaches are:
a) Forward selection, which involves starting with no variables in the model, trying out the variables one by one and including them if they are 'statistically significant'.
b) Backward elimination, which involves starting with all candidate variables and testing them one by one for statistical significance, deleting any that are not significant.
c) Methods that are a combination of the above, testing at each stage for variables to be included or excluded.
A widely used algorithm was proposed by Efroymson (1960).[5] This is an automatic procedure for statistical model selection in cases where there are a large number of potential explanatory variables, and no underlying theory on which to base the model selection. The procedure is used primarily in regression analysis, though the basic approach is applicable in many forms of model selection. This is a variation on forward selection. At each stage in the process, after a new variable is added, a test is made to check if some variables can be deleted without appreciably increasing the residual sum of squares (RSS). The procedure terminates when the measure is (locally) maximized, or when the available improvement falls below some critical value.
Stepwise regression procedures are used in data mining, but are controversial. Several points of criticism have been made.
1. A sequence of F-tests is often used to control the inclusion or exclusion of variables, but these are carried out on the same data and so there will be problems of multiple comparisons for which many correction criteria have been developed.
2. It is difficult to interpret the p-values associated with these tests, since each is conditional on the previous tests of inclusion and exclusion (see "dependent tests" in false discovery rate).
3. The tests themselves are biased, since they are based on the same data. (Rencher and Pun, 1980, Copas, 1983).[6][7] Wilkinson and Dalall (1981)[8] computed percentage points of the multiple correlation coefficient by simulation and showed that a final regression obtained by forward selection, said by the F-procedure to be significant at 0.1% was in fact only significant at 5%.
Critics regard the procedure as a paradigmatic example of data dredging, intense computation often being inadequate substitute for subject area expertise.
[edit] See also
- Backward regression
- Forward regression
- Logistic regression
- Occam's Razor
[edit] References
- ^ Hocking, R. R. (1976) "The Analysis and Selection of Variables in Linear Regression," Biometrics, 32.
- ^ Draper, N. and Smith, H. (1981) Applied Regression Analysis, 2d Edition, New York: John Wiley & Sons, Inc.
- ^ SAS Institute Inc. (1989) SAS/STAT User's Guide, Version 6, Fourth Edition, Volume 2, Cary, NC: SAS Institute Inc.
- ^ Box-Behnken designs from a handbook on engineering statistics at NIST
- ^ Efroymson, MA (1960) "Multiple regression analysis." In Ralston, A. and Wilf, HS, editors, Mathematical Methods for Digital Computers. Wiley.
- ^ Rencher, A.C. and Pun, F.C. (1980) "Inflation of R² in Best Subset Regression." Technometrics. 22.49-54.
- ^ Copas, J.B. (1983) "Regression, prediction and shrinkage." J. Roy. Statist. Soc. Series B. 45. 311-354.
- ^ Wilkinson, L. and Dallal, G.E. (1981) "Tests of significance in forward selection regression with an F-to enter stopping rule." Technometrics. 23. 377-380.