Talk:Regression toward the mean

From Wikipedia, the free encyclopedia

This article is within the scope of WikiProject Statistics, which collaborates to improve Wikipedia's coverage of statistics. If you would like to participate, please visit the project page.

WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, which collaborates on articles related to mathematics.
Mathematics rating: Start Class Mid Priority  Field: Probability and statistics
Please update this rating as the article progresses, or if the rating is inaccurate. Please also add comments to suggest improvements to the article.


This topic (the article + this discussion) reads like a mad grad school breakdown. If you please, would someone who has a good grasp of Regression Toward the Mean write an explanation based on ONE COGENT EXAMPLE that reveals unambiguous data, processing steps, results. The audience is dying to know what regression means to them. What is needed is an actual dataset and walkthrough to illustrate the concept. You know, narrate Galton's height experiment, that would be wildly appropriate. Think of your readers as high schoolers stuck with a snotty textbook who want some mentoring on this subject AT THEIR LEVEL. They'll get a kick out of it if you can make it mean something to them, otherwise they'll drop out and live in shipping containers with Teener-Kibble for sustenance. This is, after all, a topic that only first year stats student should still be grappling with, yes? And of course it is Wikipedia.--24.113.89.98 05:24, 23 January 2007 (UTC)qwiki@edwordsmith.com



I'm not sure this page explains "regression to the mean" very well.

I agree; it's lousy. Michael Hardy 23:26, 2 Feb 2004 (UTC)
The first time I read it, I thought it was lousy. The second time I read it, it was closer to mediocre.

F. Galton's use of the terms "reversion" and "regression" described a certain, specific biological phenomenon, and it is connected with the stability of an autoregressive process: if there is not regression to the mean, the variance of the process increases over time. There is no reason to think that the same or a similar phenomenon occurs in, say, scores of students, and appealing to a general "principle of regression to the mean" is unwarranted.

I completely disagree with this one; there is indeed such a general principle. Michael Hardy 23:26, 2 Feb 2004 (UTC)

I guess I could be convinced of the existence of such a principle, but something more than anecdotes is needed to establish that.

Absolutely. A rationale needs to be given. Michael Hardy 23:26, 2 Feb 2004 (UTC)

Regression to the mean is just like normality of natural populations: maybe it's there, maybe it isn't; the only way to tell is to study a lot of examples.

No; it's not just empirical; there is a perfectly good rationale.

I'll revise this page in a week or two if I don't hear otherwise; the page should summarize Galton's findings,

I don't think regression toward the mean should be taken to mean only what Galton wrote about; it's far more general. I'm really surprised that someone who's edited a lot of statistics articles here does not know that there is a reason why regression toward the mean in widespread, and what the reason is. I'll return to this article within a few days. Michael Hardy 23:26, 2 Feb 2004 (UTC)

connect the biological phenomenon with autoregressive stability, and mention other (substantiated) examples. Wile E. Heresiarch 15:00, 2 Feb 2004 (UTC)


In response to Michael Hardy's comments above --

  1. Perhaps I overstated the case. Yes, there is a class of distributions which show regression to the mean. (I'm not sure how big it is, but it includes the normal distribution, which counts for a lot!) However, if I'm not mistaken there are examples that don't, and these are by no means exotic.
  2. There is a terminology problem here -- it's not right to speak of a "principle of r.t.t.m." as the article does, since r.t.t.m. is a demonstrated property (i.e., a theorem) of certain distributions. "Principle" suggests that it is extra-mathematical, as in "likelihood principle". Maybe we can just drop "principle".
  3. I had just come over from the Galton page, & so that's why I had Galton impressed on my mind; this article should mention him but need not focus on his concept of regression, as pointed out above.

regards & happy editing, Wile E. Heresiarch 22:57, 3 Feb 2004 (UTC)

It's nothing to do with Normality - it applies to all distributions.

Johnbibby 22:11, 12 December 2006 (UTC)

--

The opening sentence "of related measurements, the second is expected to be closer to the mean than the first" is obviously wrong.Jdannan 08:17, 15 December 2005 (UTC)


Small change to the historical background note.

Contents

[edit] Principle of Regression

I agree that the "principle" cannot hold for all distributions, but only a certain class of them, which includes the normal distributions. I think R. A. Fisher found an extension to the case where the conditional distribution is Gaussian but the joint distribution need not be. In any case, in the section on "Mathematical Derivation", it should be made clear that the specific *linear* regression form E[Y|X]=rX is valid only when Y and X are jointly Gaussian. Of course there are some other examples such as when Y and X are jointly stable but that is another can of worms. The overall question might be rephrased: given two random variables X and Y of 0 mean and the same variance, for what distributions is |E[Y|X]| < |X| almost surely?

I will make some small edits to the "mathematical derivation" section.

[edit] Intelligence

Linda Gottfredson points out that 40% of mothers having IQ of 75 or less also have children whose IQ is under 75 - as opposed to 7% of normal or bright mothers.

Fortunately, because of regression to the mean, their children will tend to be brighter than they are, but 4 in 10 still have IQs below 75. (Why g matters, page 40)

What do we know about IQ or g and regression toward the mean? Elabro 18:55, 5 December 2005 (UTC)

Your question seems to contain its own answer. Taking everything at face value, and brushing aside all the arguments (whether g exists, whether it means anything, whether Spearman's methodology was sound, whether imprecise measurements of g should be used to make decisions about people's lives, etc.) what the numbers you cite mean is simply that IQ measurements are mixtures of something that is inherited and something that is not inherited.
Intelligence, as measured by IQ score, is just about 50% heritable.
Regression doesn't have to do with the child, in this case, it has to do with the mother. The lower the mother's IQ measurement, the further away from the mean it is. The further away from the mean it is, the more likely that this was not the result of something inherited but of some other factor, one which won't be passed on to the child, who will therefore be expected to have higher intelligence than the mother.
This isn't obvious at first glance but it is just plain statistics. Our article on regression doesn't have any diagrams, and one is needed here. Dpbsmith (talk) 20:26, 5 December 2005 (UTC)
Thanks for explaining that. It's clear to me now, and I hope we can also make it clear to the reader.
By the way, I'm studying "inheritance" and "heritage" and looking for factors (such as genes) that one cannot control, as well as factors (such as parenting techniques, choice of neighborhood and school) that one can control - and how these factors affect the academic achievement of children. This is because I'm interested in Educational reform, a topic that Wikipedia has long neglected. Elabro 22:10, 5 December 2005 (UTC)

[edit] Massachusetts test scores

HenryGB has twice removed a reference supporting the paragraph that gives MCAS "improvement" scores as a good example of the regression fallacy. He cites http://groups.google.com/group/sci.stat.edu/tree/browse_frm/thread/c1086922ef405246/60bb528144835a38?rnum=21&hl=en&_done=%2Fgroup%2Fsci.sta which I haven't had a chance to review. At the very least, it is extremely inappropriate to remove the reference supporting a statement without also removing the statement.

We need to decide whether this is a clear case of something that is not regression, in which case it doesn't belong in the article; or whether it's the usual case of a somewhat murky situation involving real-world data that isn't statistically pure, in a politically charged area, where different factions put a different spin on the data. If it's the latter, then it should go back with qualifying statements showing that not everyone agrees this is an actual example of regression. As I say, I haven't read his reference yet, so I don't know yet which I think. I gotta say that when I saw the headlines in the Globe about how shocked parents in wealthy towns were that their schools had scored much lower than some troubled urban schools on these "improvement" scores, the first thing that went through my mind was "regression." Dpbsmith (talk) 12:04, 31 March 2006 (UTC)

[edit] Poorly written

The introduction is poorly written and fairly confusing.


[edit] "SAT"

Would be better with an example that means something to those of us reading outside the USA. --Newshound 16:08, 5 March 2007 (UTC)

[edit] Sports info out of date

The trick for sports executives, then, is to determine whether or not a player's play in the previous season was indeed an outlier, or if the player has established a new level of play. However, this is not easy. Melvin Mora of the Baltimore Orioles put up a season in 2003, at age 31, that was so far away from his performance in prior seasons that analysts assumed it had to be an outlier... but in 2004, Mora was even better. Mora, then, had truly established a new level of production, though he will likely regress to his more reasonable 2003 numbers in 2005.

It's now 2007, but I don't know enough about baseball to comment on Mora's performance in 2005 or afterward. I also don't know how to tag this statement as out of date without using an "as of 2004" or "as of 2005" tag (I'm not sure how one could be worked in). Can anybody help? - furrykef (Talk at me) 08:42, 4 April 2007 (UTC)

I have great difficulty understanding this article. Everything including the math is just a mess. It is quite remarkable that I have never heard of the phenomenon "regression to the mean", and it seems that its usage is restricted to certain group such medical and socio.

My guess is that there are two phenomena a) the biological property related to growth first observed in the 19th century, and b) an obvious matter. Let me explain b) the obvious matter. I have a die with possible outcomes {1, ..., 6}. Assume I threw a 6. Then the next time I throw that die, it is very likely that the outcome will be les than 6 (since there is no 7!) If one calls that 'regression to the mean', the expression is more complicated than the fact itself. Can anybody comment.Sabbah67 13:54, 13 August 2007 (UTC)


[edit] "History"

I think the history section is quite good except I think the history of the regression line is a bit off topic. Only if more detail were included (such as a discussion of the implications of the fact that the regression line had a slope <1) would the typical reader see the relevance. My opinion is that the regression line discussion be deleted but I don't feel strongly enough about it to do so myself. —Preceding unsigned comment added by 128.42.98.167 (talk) 19:09, 25 September 2007 (UTC)

[edit] Defeating regression by establishing variance

Right, so i'm measuring quantity X over a population, and looking for an effect of applying treatment A.

If i measure X for all individuals, apply A to the lowest-scoring half, and measure again, i'll see an apparent increase because of RTM, right?

If i apply A to half the population at random, or to a stratified sample, can i expect to not see RTM?

Now, my real question, i guess, if i measure X ten times over the course of a year, then apply A to the lowest-scoring half, then measure X ten more times over the next year, then calculate the mean and variance / standard deviation / standard error of the mean for each individual, and look for improvements by t-testing, would i see an effect of RTM?

If i understand it right, RTM works because the value of X is some kind of underlying true value, plus an error term. If i pick the lowest values of X, i get not only individuals who genuinely have a low true X, but also those with a middling X who happened to have a negative error term when i measured X. Assuming the error term is random, doesn't that mean that taking multiple measurements and working out the envelope of variance allows me to defeat RTM?

-- Tom Anderson 2008-02-18 1207 +0000 —Preceding unsigned comment added by 62.56.86.107 (talk) 12:07, 18 February 2008 (UTC)