Coefficient of determination

From Wikipedia, the free encyclopedia

In statistics, the coefficient of determination R2 is the proportion of variability in a data set that is accounted for by a statistical model. In this definition, the term "variability" stands for variance or, equivalently, sum of squares. There are several common and equivalent expressions for R2. The version most common in statistics texts is based on an analysis of variance decomposition as follows:

{R^{2} = {SS_R \over SS_T} = {1-{SS_E \over SS_T}}}.

In the above definition,

SS_T=\sum_i (y_i-\bar{y})^2, SS_R=\sum_i (\hat{y_i}-\bar{y})^2, SS_E=\sum_i (y_i - \hat{y_i})^2.

That is, SST is the total sum of squares, SSR is the regression sum of squares, and SSE is the sum of squared errors. In some texts, the abbreviations SSE and SSR have the opposite meaning: SSE stands for the explained sum of squares (which is another name for the regression sum of squares) and SSR stands for the residual sum of squares (another name for the sum of squared errors).

R-square is the statistic that will give information about the goodness of fit of the model. It has a drawback: R-square increases as we increase the number of variables in the mode (R-square will not decrease), so the alternative technique is to look for adjusted R-square. The explanation of this statistic is also same as R-square but it penalizes R-square by the number of variables used in the model.

Contents

[edit] Explanation and interpretation of R2

For expository purposes, consider a linear model of the form

{Y_i = \beta_0 + \sum_j^p {\beta_j X_{i,j}} + \varepsilon_i},

where Yi is the response variable, \beta_0,\dots,\beta_p are unknown coefficients; X_1,\dots,X_p are p regressors, and \varepsilon_i is a mean zero error term. The coefficient of determination R2 is a measure of the global fit of the model. Specifically, R2 is an element of [0,1] and represents the proportion of variability in Yi that may be attributed to some linear combination of the regressors (explanatory variables) in X.

More simply, R2 is often interpreted as the proportion of response variation "explained" by the regressors in the model. Thus, R2 = 1 indicates that the fitted model explains all variability in y, while R2 = 0 indicates no 'linear' relationship between the response variable and regressors. An interior value such as R2 = 0.7 may be interpreted as follows: "Approximately seventy percent of the variation in the response variable can be explained by the explanatory variable. The remaining thirty percent can be explained by unknown, lurking variables or inherent variability."

A caution that applies to R2, as to other statistical descriptions of correlation and association is that "correlation does not imply causation." In other words, while correlations may provide valuable clues regarding causal relationships among variables, a high correlation between two variables does not represent adequate evidence that changing one variable has resulted, or may result, from changes of other variables.

In case of a single regressor R2 is the square of the Pearson product-moment correlation coefficient relating the regressor and the response variable.

[edit] Inflation of R2

In least squares regression, R2 is weakly increasing in the number of regressors in the model. As such, R2 cannot be used as a meaningful comparison of models with different numbers of covariants. As a reminder of this, some authors denote R2 by R2p, where p is the number of columns in X

Demonstration of this property is trivial. To begin, recall that the objective of least squares regression is (in matrix notation)

\min_b SS_E(b) \Rightarrow \min_b \sum_i (y_i - X_ib)^2\,

The optimal value of the objective is weakly smaller as additional columns of X are added, by the fact that relatively unconstrained minimization leads to a solution which is weakly smaller than relatively constrained minimization. Given the previous conclusion and noting that SST depends only on y, the non-decreasing property of R2 follows directly from the definition above.

[edit] Adjusted R2

Adjusted R2 is a modification of R2 that adjusts for the number of explanatory terms in a model. Unlike R2, the adjusted R2 increases only if the new term improves the model more than would be expected by chance. The adjusted R2 can be negative, and will always be less than or equal to R2. The adjusted R2 is defined as

{1-(1-R^{2}){n-1 \over n-p-1}}

where p is the total number of regressors in the linear model, and n is sample size.

Adjusted R2 does not have the same interpretation as R2. As such, care must be taken in interpreting and reporting this statistic. Adjusted R2 is particularly useful in the Feature selection stage of model building.

Adjusted R2 is not always better then R2: adjusted R2 will be more useful only if the R2 is calculated based on a sample, not the entire population. I.e. if our unit of analysis is a state, and we have data for all countries, then adjusted R2 will not yeld any more useful data then R2.

[edit] Notes on interpreting R2

R2 does NOT tell whether:

  • the independent variables are a true cause of the changes in the dependent variable
  • omitted-variable bias exists
  • the correct regression was used; or
  • the most appropriate set of independent variables has been chosen
  • Co-linearity is present in the data

[edit] External links

[edit] See also

In other languages