"Correlation does not imply causation" (related to "ignoring a common cause" and questionable cause) is a phrase used in science and statistics to emphasize that correlation between two variables does not automatically imply that one causes the other (though correlation is necessary for linear causation in the absence of any third and countervailing causative variable, and can indicate possible causes or areas for further investigation; in other words, correlation can be a hint).[1][2]
The opposite belief, correlation proves causation, is a logical fallacy by which two events that occur together are claimed to have a cause-and-effect relationship. The fallacy is also known as cum hoc ergo propter hoc (Latin for "with this, therefore because of this") and false cause. By contrast, the fallacy post hoc ergo propter hoc requires that one event occur before the other and so may be considered a type of cum hoc fallacy.
In a widely-studied example, numerous epidemiological studies showed that women who were taking combined hormone replacement therapy (HRT) also had a lower-than-average incidence of coronary heart disease (CHD), leading doctors to propose that HRT was protective against CHD. But randomized controlled trials showed that HRT caused a small but statistically significant increase in risk of CHD. Re-analysis of the data from the epidemiological studies showed that women undertaking HRT were more likely to be from higher socio-economic groups (ABC1), with better than average diet and exercise regimens. The use of HRT and decreased incidence of coronary heart disease were coincident effects of a common cause (i.e. the benefits associated with a higher socioeconomic status), rather than cause and effect as had been supposed.[3]
Contents |
In logic, the technical use of the word "implies" means "to be a sufficient circumstance". This is the meaning intended by statisticians when they say causation is not certain. Indeed, p implies q has the technical meaning of logical implication: if p then q symbolized as p → q. That is "if circumstance p is true, then q necessarily follows." In this sense, it is always correct to say "Correlation does not imply causation".
However, in casual use, the word "imply" loosely means suggests rather than requires. The idea that correlation and causation are connected is certainly true; where there is causation, there is likely to be correlation. Indeed, correlation is used when inferring causation; the important point is that such inferences are not always correct because there are other possibilities, as explained later in this article.
Edward Tufte, in a criticism of the brevity of Microsoft PowerPoint presentations, deprecates the use of "is" to relate correlation and causation (as in "Correlation is not causation"), citing its inaccuracy as incomplete.[1] While it is not the case that correlation is causation, simply stating their nonequivalence omits information about their relationship. Tufte suggests that the shortest true statement that can be made about causality and correlation is one of the following:[4]
The cum hoc ergo propter hoc logical fallacy can be expressed as follows:
In this type of logical fallacy, one makes a premature conclusion about causality after observing only a correlation between two or more factors. Generally, if one factor (A) is observed to only be correlated with another factor (B), it is sometimes taken for granted that A is causing B even when no evidence supports it. This is a logical fallacy because there are at least five possibilities:
In other words, there can be no conclusion made regarding the existence or the direction of a cause and effect relationship only from the fact that A and B are correlated. Determining whether there is an actual cause and effect relationship requires further investigation, even when the relationship between A and B is statistically significant, a large effect size is observed, or a large part of the variance is explained.
In this example, the correlation between the number of firemen at a scene and the size of the fire does not imply that the firemen cause the fire. Firemen are sent according to the severity of the fire and if there is a large fire, a greater number of firemen are sent; therefore it is rather that fire causes firemen to arrive at the scene. So the above conclusion is false.
The ideal gas law, , describes the direct relationship between pressure and temperature (along with other factors) to show that there is a direct correlation between the two properties. For a fixed volume and mass of gas, an increase in temperature will cause an increase in pressure; likewise, increased pressure will cause an increase in temperature. This demonstrates bidirectional causation. The conclusion that pressure causes temperature is true but is not logically guaranteed by the premise.
All these examples deal with a lurking variable, which is simply a hidden third variable that affects both causes of the correlation; for example, the fact that it is summer in Example 3. A difficulty often also arises where the third factor, though fundamentally different from A and B, is so closely related to A and/or B as to be confused with them or very difficult to scientifically disentangle from them (see Example 4).
The above example commits the correlation-implies-causation fallacy, as it prematurely concludes that sleeping with one's shoes on causes headache. A more plausible explanation is that both are caused by a third factor, in this case going to bed drunk, which thereby gives rise to a correlation. So the conclusion is false.
This is a recent scientific example that resulted from a study at the University of Pennsylvania Medical Center. Published in the May 13, 1999 issue of Nature,[5] the study received much coverage at the time in the popular press.[6] However, a later study at Ohio State University did not find that infants sleeping with the light on caused the development of myopia. It did find a strong link between parental myopia and the development of child myopia, also noting that myopic parents were more likely to leave a light on in their children's bedroom.[7][8][9][10] In this case, the cause of both conditions is parental myopia, and the above-stated conclusion is false.
The aforementioned example fails to recognize the importance of time and temperature in relationship to ice cream sales. Ice cream is sold during the hot summer months at a much greater rate than during colder times, and it is during these hot summer months that people are more likely to engage in activities involving water, such as swimming. The increased drowning deaths are simply caused by more exposure to water-based activities, not ice cream. The stated conclusion is false.
However, as encountered in many psychological studies, another variable, a "self-consciousness score," is discovered which has a sharper correlation (+.73) with shyness. This suggests a possible "third variable" problem, however, when three such closely related measures are found, it further suggests that each may have bidirectional tendencies (see "bidirectional variable," above), being a cluster of correlated values each influencing one another to some extent. Therefore, the simple conclusion above may be false.
As car sales increase, carbon dioxide levels increase as well as obesity as people do less walking and biking.
Recent research[12] calls this conclusion into question. Instead, it may be that other underlying factors, like genes, diet and exercise, affect both HDL levels and the likelihood of having a heart attack; it is possible that medicines may affect the directly measurable factor, HDL levels, without affecting the chance of heart attack.
This example is used satirically by the parody religion Pastafarianism to illustrate the logical fallacy of assuming that correlation equals causation.
David Hume argued that causality is based on experience, and experience similarly based on the assumption that the future models the past, which in turn can only be based on experience – leading to circular logic. In conclusion he asserted that causality is not based on actual reasoning: only correlation can actually be perceived.[13]
Intuitively, causation seems to require not just a correlation, but a counterfactual dependence. Suppose that a student performed poorly on a test and guesses that the cause was his not studying. To prove this, one thinks of the counterfactual – the same student writing the same test under the same circumstances but having studied the night before. If one could rewind history, and change only one small thing (making the student study for the exam), then causation could be observed (by comparing version 1 to version 2). Because one cannot rewind history and replay events after making small controlled changes, causation can only be inferred, never exactly known. This is referred to as the Fundamental Problem of Causal Inference – it is impossible to directly observe causal effects.[14]
A major goal of scientific experiments and statistical methods is to approximate as best as possible the counterfactual state of the world.[15] For example, one could run an experiment on identical twins who were known to consistently get the same grades on their tests. One twin is sent to study for six hours while the other is sent to the amusement park. If their test scores suddenly diverged by a large degree, this would be strong evidence that studying (or going to the amusement park) had a causal effect on test scores. In this case, correlation between studying and test scores would almost certainly imply causation.
Well-designed experimental studies replace equality of individuals as in the previous example by equality of groups. This is achieved by randomization of the subjects to two or more groups. Although not a perfect system, the likeliness of being equal in all aspects rises with the number of subjects placed randomly in the treatment/placebo groups. From the significance of the difference of the effect of the treatment vs. the placebo, one can conclude the likeliness of the treatment having a causal effect on the disease. This likeliness can be quantified in statistical terms by the P-value.
When experimental studies are impossible and only pre-existing data are available, as is usually the case for example in economics, regression analysis can be used. Factors other than the potential causative variable of interest are controlled for by including them as regressors in addition to the regressor representing the variable of interest. False inferences of causation due to reverse causation (or wrong estimates of the magnitude of causation due the presence of bidirectional causation) can be avoided by using explanators (regressors) that are necessarily exogenous, such as physical explanators like rainfall amount (as a determinant of, say, futures prices), lagged variables whose values were determined before the dependent variable's value was determined, instrumental variables for the explanators (chosen based on their known exogeneity), etc. See Causality#Economics. Spurious correlation due to mutual influence from a third, common causative, variable, is harder to avoid: the model must be specified such that there is a theoretical reason to believe that no such underlying causative variable has been omitted from the model; in particular, underlying time trends of both the dependent variable and the independent (potentially causative) variable must be controlled for by including time as another independent variable.
|