Talk:Coefficient of determination
From Wikipedia, the free encyclopedia
[edit] link with correlation coefficient
It says that only in the case of a linear regression, the coefficient of determination is equal to the square of the correlation coefficient. Should this not be: only in the case of a linear regression with a linear model? (a linear regression can also be performed with e.g. a quadratic model, in which case the the coefficient of determination is not equal to the square of the correlation coefficient). Job. 193.191.138.240 (talk) 09:53, 9 April 2008 (UTC)
- "With a linear model" would be ambiguous at best considering that in a regression context a "linear model" refers to how the parameters relate to the predicted values, not the structure of the model with respect to changes in the explanatory variables. Anyway, the present version is almost clear and nearly correct ..."correlation coefficient between the original and modelled data values" is meant to mean the correlation between the observed and predicted values, not the correlation between the observed and individual explanatory variables. Unfortunatey previous changes have left some terminology a liitle vague. I will try to improve things. Melcombe (talk) 11:24, 9 April 2008 (UTC)
[edit] Definitions/Variables are not consistent here
This page is confusing because the variables are not consistent with the pages for Residual sum of squares and Explained sum of squares. On those pages, SSE is defined as ESS and the same for SST,SSR. Also, isn't ? Don't want to fix this myself because I'm just learning stats, maybe someone more experienced can? —Preceding unsigned comment added by 128.6.30.208 (talk) 03:22, 23 October 2007 (UTC)
But these pages may have problems of their own. For example Residual sum of squares defines RSS for general values of the regression coefficents, not necessarily for the fitted coefficients, whereas Explained sum of squares assumes that fitted coefficients are used, as would the usage here. There is also the question of whetherthe pages are general enough to be interpretable for more than just a single explanatory variable. Melcombe 14:41, 23 October 2007 (UTC)
The basic problem is that there isn't such thing as a consistent definition. It would simply not make sense to "fix" something, since both definitions (Residual/Explained vs. Regression/Error SS) are highly used. Actually, this is the first time that I learned about a problem like this in math/stat, since usually defintions are precise and unique. But that's life. --Scherben 01:13, 26 October 2007 (UTC)
Yes, these notations are confusing! In my opinion, there are two kinds of notations we can use: Notations with or without subscript. That is: , where SST and TSS is "Total Sum of Square". , where SSR and SSreg is "Sum of Square for Regression" or "Explained sum of square". , where SSE and RSS is "Error Sum of Square", "Residual Sum of Square" or "Unexplained Sum of Square". Then we can define R Square as: . I think the main problem in the original page is SSR. Maybe some text books name it "Residual sum of square" while others use "Sum of square for regression". Actually, they are different. So it would be better to use RSS and SSreg to distinguish these two concepts. Otherwise, the R in SSR may means a lot. Lanyijie 12/28/07
[edit] Adjusted R Square
There is a bit better explanation of this at: http://www.csus.edu/indiv/j/jensena/mgmt105/adjustr2.htm I think we can add to the definition: 1) the motivation for "Adjusted R Square". And 2) to note that it can be viewed as an Index when comparing regression models (like the standard Error).
Tal.
—The preceding unsigned comment was added by Talgalili (talk • contribs) 16:04, 21 February 2007 (UTC).
[edit] Causality
I thought that since R is based on the general linear model you could infer causality from the model?? You are really just doing an ANOVA with a continuous factor (X) as opposed to a categorical one
>> No. R^2 has nothing at all to do with causality. Causality can only be implied by imposition of specific assumptions on the process being modeled. -- Guest.
>> Causality is a design issue not a statistical one. You need to measure the exposure before you see the outcome. If youre doing a cross sectional regression causality can never be infered. Only if it is a regression with the exposure measured at one time point and outcome at a time after measuring the exposure can you suggest*** (suggest is the operative work) that there is a causal relationship- SM
Need some help,... did anyone know why R2 in excel program are different from this meaning ?
[edit] Range of R-squared
Who says R-squared should be greater than zero? For example if measured y-values are between 9 and 10, and model prediction is always zero, then R-squared is heavily negative.Kokkokanta 07:50, 28 January 2007 (UTC)
>> Go back and look at the definition. For one thing, all the sums are of squared differences. Moreover, SSE<=SST by construction. So R^2 is certainly non-negative. Adjusted R^2 *can* be negative, however. -- Guest.
>> No R-squared can be negative. This page does not necessarily relate to linear regression, or if it is meant to do so, it does not say this. You only have the conclusion SSE<=SST if "prediction=mean" is a special case of the model being fitted and only for certain ways of fitting the model ... for example you could choose to always fit the model by setting all the parameters to 99. You can still evaluate a value of R-squared in such cases. Less outlandish cases arise where the model fitted doesn't include an intercept term in the usual terminology. The "no intercept" case might warrant a specific mention on the page. Melcombe 14:41, 20 June 2007 (UTC)
[edit] Possible expansions
- Consider mention of the Nagelkerke criterion, an analogue
that you can use with generalized linear models, which are not fitted by ordinary least squares.
- We can't assume that R^2 is applicable with every kind of
least-squares regression. For example, it doesn't make sense with regression through the origen. There has been a discussion of limitiations, in American Statistician.
- Adjusted R^2 can be negative.
Dfarrar 14:04, 8 March 2007 (UTC)
Nagelkerke's pseudo-R^2 really doesn't belong in this article IMHO. It deserves a separate page, perhaps along with other pseudo-R^2 measures. The point is well-made about regression through the origin, but redefinition of R^2 is trivial in this context. Perhaps that should be mentioned.
---Guest
R squared will be negative if you remove the intercept from the equation.
[edit] Causality
R^2 is only one measure of association. The causality issue applies to all of them. The issue has been addressed generically. See links inserted.
Dfarrar 14:29, 8 March 2007 (UTC)
[edit] Inflation of R-square
This has been a good day for additions to my watched pages. Regarding this new material, I think some terms could be explained to make the article more widely accessible, without doing much harm, e.g. "weakly smaller." Repeating a previoius point, I suggest inclussion of material on analogous statistics applicable with models other than Gaussian, e.g., with generalized linear models. Dfarrar 22:25, 20 March 2007 (UTC)
[edit] R squared formula
I changed the formula to what I believe to be the correct one, but it has been reverted. My source is Essentials of Econometrics by Damodar Gujarati. Can whoever changed it please cite their source for this? Cheers.
- I can see why you're confused. In your book though, "E" likely stands for "Explained" and "R" likely stands for "Residuals." In the equation on this page, "R" stands for "Regression" (or "Explained") and "E" stands for "Error" (or "Residuals"). Gujarati's Basic Econometrics uses "Explained" and "Residuals" as well, so the lettering is exactly the opposite. VivekVish 03:58, 18 April 2007 (UTC)
- Ah I see, thanks for clearing that up.
- An edit by someone yesterday (with a history of bad edits on other pages) screwed up this section again. It is now fixed again. The text on alternative meanings of E and R is very helpful, and hopefully will prevent these problems in the future.152.3.58.200 16:44, 7 June 2007 (UTC)
Is not there mistake in definition of ? I think should be there.
Chnaged to this form, but there is equivalence since under the conditions this form of R2 is use, the means are the same. Melcombe (talk) 16:53, 11 February 2008 (UTC)
[edit] Adj R2
This bit seems wrong to me: "adjusted R2 will be more useful only if the R2 is calculated based on a sample, not the entire population. For example, if our unit of analysis is a state, and we have data for all counties, then adjusted R2 will not yield any more useful information than R2."
It is not clear why this would be. Even if you had the population, you would still be concerned about exhausting degrees of freedom. You would thus want to penalize any calculation of the R2 for the number of regressors. If you have the population of U.S. states (N=50) and you have a model with k=50, you will perfectly predict and get an R2 of one. But this is misleading. The adjustment is meant to account for degrees of freedom, not estimation error.
Still A Student 03:06, 9 September 2007 (UTC)
I agree with this comment. The para should be removed. It might be relevant to add something along the lines of "If there is an unlimited number of linearly independent candidate regressors, both R2 and adjusted R2 become unreliable as the number of regressors increases: R2 tends towards one, while adjusted R2 becomes more variable". Also, perhaps there needs to be some pointers to related statistics such as Mallows Cp.Melcombe 09:02, 18 September 2007 (UTC)
[edit] R-squared bigger than 1?
A simple question but I cant figure it out:
Why is r square for y=(1 3 5) and y_est=(2 7 3) is bigger than 1? It must be between 0 and 1. SSR=17 SST=8 —Preceding unsigned comment added by 85.107.12.120 (talk) 13:21, 20 September 2007 (UTC)
This happened because the expression used assumed that the fitted values would be obtained by regession on the observed values and your values don't have the features that would occur if the y_est had been obtained by regression. I have revised the main text. Melcombe 14:23, 10 October 2007 (UTC)
[edit] What is ?
What is ? is it the mean? Can someone put that in the text? --Play
have included a first version Melcombe (talk) 12:12, 6 December 2007 (UTC)
added it again, as removed by someone.Melcombe (talk) 16:36, 11 February 2008 (UTC)
[edit] Causality Again (was at top)
I believe that R-squared is a measure of variability aligned rather than variability accounted for. With respect to correlation is not causation, consider R-squared as variabilty "aligned" rather than "accounted for." For example, if the number of churches in cities is correlated with the number of bars in cites, say .9 , then R-squared is .81. Rather than number of bars accounting for number of churches, consider that variability (81%) of their related numbers is aligned. (Their variability alignment is most likely "accounted for" by population.) Respectfully submitted, Gary Greer greerg@uhd.edu January 26, 2008. —Preceding unsigned comment added by 75.16.159.122 (talk) 01:34, 27 January 2008 (UTC)
"Accounted for" is standard terminology. R-squared is used in connection with a model of the user's choice, where the user chooses which variables to use in constructing the model's predicted values. There is no implication of causality ... the idea is to find the best predictor of the dependent variable that can be constructed from the chosen predictors. From one point of view the idea is to explain as much of the variation in the dependent variable (variation of the value from case to case) as possible using the selected variables, and hence the task can be phrased as attempting to account for as much variation as poosible. Similarly, adding an additional independent variable can be thought of as seeking to account for more variation.Melcombe (talk) 16:53, 11 February 2008 (UTC)
[edit] Undid change from R2 to r2
I undid a set of changes that tried to change the notation from R2 to r2 ... because -
- I think R2 is the most commonly used notation
- The notation was not changed everywhere, specifically in displyed maths and section titles and possibly elsewhere, so that the result as left was very poor.