Data dredging

From Wikipedia, the free encyclopedia

Data dredging (data fishing, data snooping) is the inappropriate (sometimes deliberately so) search for 'statistically significant' relationships in large quantities of data. This activity was formerly known in the statistical community as data mining, but that term is now in widespread use with an essentially positive meaning, so the pejorative term data dredging is now used instead.

Conventional statistical procedure is to formulate a research hypothesis, (such as 'people in higher social classes live longer') then collect relevant data, then carry out a statistical significance test to see whether the results could be due to the effects of chance.

A key point is that every hypothesis must be tested with evidence that was not used in constructing the hypothesis. This is because every data set is liable to contain some chance patterns which are not necessarily present in the population under study, whose presence simply disappear in a sufficiently large sample size. If the hypothesis is not tested on a different data set from the same population, it is likely that the patterns found are chance patterns.

As a simplistic example, first throwing five coins, with a result of 2 heads and 3 tails, might lead one to ask why the coin favors tails by fifty percent, whereas first forming the hypothesis might lead one to conclude that only a 5-0 or 0-5 result would be very surprising, since the odds are 93.75% against this happening by chance. In the latter case, it becomes obvious that the data is not anomalous.

As a more lyrical example, on a cloudy day, try the experiment of looking for figures in the clouds; if one looks long enough one may see castles, cattle, and all sort of fanciful images; but the images are not really in the clouds, as can be easily confirmed by looking at other clouds.

It is important to realize that the alleged statistical significance here is completely spurious - significance tests do not protect against data dredging. When testing a data set on which the hypothesis is known to be true, the data set is by definition not a representative data set, and any resulting significance levels are meaningless.

Contents

[edit] Examples

In meteorology, dataset A is often weather data up to the present, which ensures that, even subconsciously, subset B of the data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This ensures no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.

Consider an analysis of sales in the period following an advertising campaign. Suppose that aggregate sales were unchanged, but that analysis of a sample of households found that sales did go up more for Spanish-speaking households, or for households with incomes between $35,000 and $50,000, or for households that had refinanced in the past two years, or whatever, comparing the treatment and control groups, and that such increase(s) was/were 'statistically significant'. There would certainly be a temptation to report such findings as 'proof' that the campaign was successful, or would be successful if targeted to such group(s) in other markets.

[edit] Remedies

The practice of looking for patterns in data is legitimate; the vice of applying statistical test of significance (hypothesis testing) to the same data from which the pattern was learned is wrong. One way to construct hypotheses while avoiding the problems of data dredging is randomization. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset - say, subset A - is examined for creating hypotheses. Once a hypothesis has been formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where such a hypothesis is also supported by B is it reasonable to believe that the hypothesis might be valid.

Another remedy for data dredging is to record the number of all significance tests conducted during the experiment and simply multiply the final significance level by this number (the Bonferroni correction); however, this is a very conservative metric. The use of a false discovery rate is a more sophisticated approach that has become a popular method for control of multiple hypothesis tests.

Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method to look at data (thus if someone says that a certain event has probability of 20% +/- 2% 19 times out of 20, this means that is the probability of the event is estimated by the same method used to obtain the 20% estimate, the result will be between 18% and 22% with probability .95), and no claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.

[edit] See also

[edit] References