Data dredging
From Wikipedia, the free encyclopedia
Data dredging or data fishing is the inappropriate (sometimes deliberately so) search for 'statistically significant' relationships in large quantities of data. This activity was formerly known in the statistical community as data mining, but that term is now in widespread use with an essentially positive meaning, so the pejorative term data dredging is now used instead.
Conventional statistical procedure is to formulate a research hypothesis, (such as 'people in higher social classes live longer') then collect relevant data, then carry out a statistical significance test to see whether the results could be due to the effects of chance.
A key point is that every hypothesis must be tested with evidence that was not used in constructing the hypothesis. This is because every data set must contain some chance patterns which are not be present in the population under study, or simply disappear with a sufficiently large sample size. If the hypothesis is not tested on a different data set from the same population, it is likely that the patterns found are chance patterns.
As a simplistic example, first throwing five coins, with a result of 2 heads and 3 tails, might lead one to ask why the coin favors tails by fifty percent, whereas first forming the hypothesis might lead one to conclude that only a 5-0 or 0-5 result would be very surprising, since the odds are 93.75% against this happening by chance. In the latter case, it becomes obvious that the data is not anomalous.
In order to construct hypotheses while avoiding the problems of data dredging, you need to collect a data set, then randomly partition it into two subsets, A and B. Only one subset - say, subset A - is examined for creating hypotheses. Once a hypothesis has been formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where such a hypothesis is also supported by A is it reasonable to believe that the hypothesis might be valid.
In meteorogy, A is often weather data up to the present, which ensures that, even subconsciously, subset B of the data could not influence the formulation of the hypthosis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This ensures no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.
[edit] Example
Consider an analysis of sales in the period following an advertising campaign. Suppose that aggregate sales were unchanged, but that analysis of a sample of households found that sales did go up more for Spanish-speaking households, or for households with incomes between $35,000 and $50,000, or for households that had refinanced in the past two years, or whatever, comparing the treatment and control groups, and that such increase(s) was/were 'statistically significant'. There would certainly be a temptation to report such findings as 'proof' that the campaign was successful, or would be successful if targeted to such group(s) in other markets.
It is important to realise that the alleged statistical significance here is completely spurious - significance tests do not protect against data dredging. When testing a data set on which the hypothesis is known to be true, the data set is by definition not a representative data set, and any resulting significance levels are meaningless.
[edit] See also
- Predictive analytics
- Testing hypotheses suggested by the data
- Multiple comparisons
- Bonferroni inequalities
[edit] External links
- John P. A. Ioannidis, Why Most Published Research Findings Are False, PLoS Medicine, August 2005.