Data-snooping bias
From Wikipedia, the free encyclopedia
In statistics, data-snooping bias is a form of statistical bias generated by the misuse of data mining techniques which can lead to bogus results in scientific research. Although data-snooping biases can occur in any field that uses data mining, data snooping biases are a particular concern in finance and medical research, both of which make heavy use of data mining techniques.
In the process of data mining, huge numbers of hypotheses about a single data set can be tested in a very short time, by exhaustively searching for combinations of variables that might show a correlation.
Because conventional tests of statistical significance are based on the probability that an observation arose by chance, it is reasonable to expect that 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 0.1% will turn out to be significant at the 0.1% signficance level, and so on, simply by chance.
Thus, given enough hypotheses tested, it is virtually certain that some of them will appear to be highly statistically significant, even on a data set with no real correlations at all. Researchers who are using data mining techniques can be easily misled by these apparently significant results, even though they are merely chance artifacts.
One way to think about Data-snooping is as the psychological approach to data-analysis of "I don't care what my hypothesis turns out to be." Thus, examining the data is reduced to a problem of formulating a class of hypotheses such that one is bound to be true for that data. In cases where the data-set cannot be replaced with a separate collection, this dishonesty makes it difficult to realize that the hypthesis so produced is spurious. For example, in a list of 366 people, at least two are guaranteed to share a birthday, let's say on a particular Mary Jane and John Smith. A data-snooping hypothesis would seek to find something special about the two (for example, perhaps they are the youngest and the oldest; perhaps they are the only two who have met exactly once before; exactly twice before; exactly three times before; perhaps they are the only two with a father who has the same first name; a mother who has the same first name; etc, etc, etc.) By mentally going through hundreds, or perhaps thousands, of potential, very interesting hypotheses that each have a low-probability of being true, we can find one that is. Let's say that for this data-group it turns out that John and Mary are the only two who switched minors three times in college, a fact we found out by exhaustively comparing their life's histories. Our hypothesis can then become "Being born on August 7th results in a much higher than average chance of switching minors more than twice in college"! Indeed, turning to the data, we are helpless to see that it very strongly supports that correlation, since not one of the other people (with a different birthday) had switched minors three times in college, whereas BOTH of the people with an August 7th birthday had! Turning to the general population, we attempt to reproduce the results, by selecting for August 7th birthdates, and find that no such correlation can be extrapolated. Why? Because in this example we have become victims of the data-snooper, who only chose whatever obscure fact happened to be true for that particular data-set.