Lindley's paradox

From Wikipedia, the free encyclopedia

Lindley's paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution. The problem of the disagreement between the two approaches was discussed in Harold Jeffreys' 1939 textbook;[1] it became known as Lindley's paradox after Dennis Lindley called the disagreement a paradox in a 1957 paper.[2]

Although referred to as a paradox, the differing results from the Bayesian and Frequentist approaches can be explained as using them to answer fundamentally different questions, rather than actual disagreement between the two methods.

Description of the paradox

Consider the result \textstyle x of some experiment, with two possible explanations, hypotheses \textstyle H_{0} and \textstyle H_{1}, and some prior distribution \textstyle \pi representing uncertainty as to which hypothesis is more accurate before taking into account \textstyle x.

Lindley's paradox occurs when

  1. The result \textstyle x is "significant" by a frequentist test of \textstyle H_{0}, indicating sufficient evidence to reject \textstyle H_{0}, say, at the 5% level, and
  2. The posterior probability of \textstyle H_{0} given \textstyle x is high, indicating strong evidence that \textstyle H_{0} is in better agreement with \textstyle x than \textstyle H_{1}.

These results can occur at the same time when \textstyle H_{0} is very specific, \textstyle H_{1} more diffuse, and the prior distribution does not strongly favor one or the other, as seen below.

Numerical example

We can illustrate Lindley's paradox with a numerical example. Imagine a certain city where 49,581 boys and 48,870 girls have been born over a certain time period. The observed proportion \textstyle x of male births is thus 49,581/98,451 ≈ 0.5036. We assume the number of male births is a binomial variable with parameter \textstyle \theta . We are interested in testing whether \textstyle \theta is 0.5 or some other value. That is, our null hypothesis is \textstyle H_{0}:\theta =0.5 and the alternative is \textstyle H_{1}:\theta \neq 0.5.

Frequentist approach

The frequentist approach to testing \textstyle H_{0} is to compute a p-value, the probability of observing a fraction of boys at least as large as \textstyle x assuming \textstyle H_{0} is true. Because the number of births is very large, we can use a normal approximation for the fraction of male births \textstyle X\sim N(\mu ,\sigma ^{2}), with \textstyle \mu =np=n\theta =98,451\times 0.5=49,225.5 and \textstyle \sigma ^{2}=n\theta (1-\theta )=98,451\times 0.5\times 0.5=24,612.75, to compute

{\begin{aligned}P(X\geq x\mid \mu =49222.5)=\int _{{x=49581}}^{{98451}}{\frac  {1}{{\sqrt  {2\pi \sigma ^{2}}}}}e^{{-({\frac  {u-\mu }{\sigma }})^{2}/2}}du\\=\int _{{x=49581}}^{{98451}}{\frac  {1}{{\sqrt  {2\pi (24,612.75)}}}}e^{{-{\frac  {(u-49225.5)^{2}}{24612.75}}/2}}du\approx 0.0117.\end{aligned}}

We would have been equally surprised if we had seen 49,581 female births, i.e. \textstyle x\approx 0.4964, so a frequentist would usually perform a two-sided test, for which the p-value would be \textstyle p\approx 2\times 0.0117=0.0235. In both cases, the p-value is lower than the significance level of 5%, so the frequentist approach rejects \textstyle H_{0} as disagreeing with the observed data.

Bayesian approach

Assuming no reason to favor one hypothesis over the other, the Bayesian approach would be to assign prior probabilities \textstyle \pi (H_{0})=\pi (H_{1})=0.5, and then to compute the posterior probability of \textstyle H_{0} using Bayes' theorem,

P(H_{0}\mid k)={\frac  {P(k\mid H_{0})\pi (H_{0})}{P(k\mid H_{0})\pi (H_{0})+P(k\mid H_{1})\pi (H_{1})}}.

After observing \textstyle k=49,581 boys out of \textstyle n=98,451 births, we can compute the posterior probability of each hypothesis using the probability mass function for a binomial variable,

{\begin{aligned}P(k\mid H_{0})&={n \choose k}(0.5)^{k}(1-0.5)^{{n-k}}\approx 1.95\times 10^{{-4}}\\P(k\mid H_{1})&=\int _{0}^{1}{n \choose k}u^{k}(1-u)^{{n-k}}du={n \choose k}{\mathrm  {\mathrm{B} }}(k+1,n-k+1)\approx 1.02\times 10^{{-5}}\end{aligned}}

where \textstyle {\mathrm  {\mathrm{B} }}(a,b) is the Beta function.

From these values, we find the posterior probability of P(\textstyle H_{0}\mid k)\approx 0.95, which strongly favors \textstyle H_{0} over \textstyle H_{1}.

The two approaches—the Bayesian and the frequentist—appear to be in conflict, and this is the "paradox".

The lack of an actual paradox

The apparent disagreement between the two approaches is caused by a combination of factors. First, the frequentist approach above tests \textstyle H_{0} without reference to \textstyle H_{1}. The Bayesian approach evaluates \textstyle H_{0} as an alternative to \textstyle H_{1}, and finds the first to be in better agreement with the observations. This is because the latter hypothesis is much more diffuse, as \textstyle \theta can be anywhere in \textstyle [0,1], which results in it having a very low posterior probability. To understand why, it is helpful to consider the two hypotheses as generators of the observations:

  • Under \textstyle H_{0}, we choose \textstyle \theta =0.5, and ask how likely it is to see 49,581 boys in 98,451 births.
  • Under \textstyle H_{1}, we choose \textstyle \theta randomly from anywhere within 0 to 1, and ask the same question.

Most of the possible values for \textstyle \theta under \textstyle H_{1} are very poorly supported by the observations. In essence, the apparent disagreement between the methods is not a disagreement at all, but rather two different statements about how the hypotheses relate to the data:

  • The Frequentist finds that \textstyle H_{0} is a poor explanation for the observation.
  • The Bayesian finds that \textstyle H_{0} is a far better explanation for the observation than \textstyle H_{1}.

For practical purposes (and particularly in the numerical example above), it could also be said that disagreement is rooted in the poor choice of prior probabilities in the Bayesian approach. This becomes clear if the region of \textstyle \theta \approx 0.5 is examined.

For example, this choice of hypotheses and prior probabilities implies the statement: "if \textstyle \theta > 0.49 and \textstyle \theta < 0.51, then the prior probability of \theta being exactly 0.5 is 0.50/0.51 \approx 98%." Given such a strong preference for \theta =0.5, it is easy to see why the Bayesian approach favors H_{0} in the face of x\approx 0.5036, even though the observed value of x lies 2.28\sigma away from 0.5. The deviation of over 2 sigma from H_{0} is considered significant in the frequentist approach, but its significance is overruled by the prior in the Bayesian approach.

Looking at it another way, we can see that the prior distribution is essentially flat with a delta function at \textstyle \theta =0.5. Clearly this is dubious. In fact if you were to picture real numbers as being continuous, then it would be more logical to assume that it would impossible for any given number to be exactly the parameter value, i.e., we should assume P(theta = 0.5) = 0.

A more realistic distribution for \textstyle \theta in the alternative hypothesis produces a less surprising result for the posterior of \textstyle H_{0}. For example, if we replace \textstyle H_{1} with \textstyle H_{2}:\theta =x, i.e., the maximum likelihood estimate for \textstyle \theta , the posterior probability of \textstyle H_{0} would be only 0.07 compared to 0.93 for \textstyle H_{2} (Of course, one cannot actually use the MLE as part of a prior distribution).

Reconciling the Bayesian and Frequentist approaches

If one uses an uninformative prior and tests a hypothesis more similar to that in the Frequentist approach, the paradox disappears.

For example, if we calculate the posterior distribution \textstyle P(\theta \mid x,n), using a uniform prior distribution on \textstyle \theta (i.e., \textstyle \pi (\theta \in [0,1])=1), we find

\textstyle P(\theta \mid k,n)={\mathrm  {\mathrm{B} }}(k+1,n-k+1).

If we use this to check the probability that a newborn is more likely to be a boy than a girl, i.e., \textstyle P(\theta >0.5\mid k,n), we find

\textstyle \int _{{0.5}}^{1}{\mathrm  {\mathrm{B} }}(49582,48871)\approx .983.

In other words, it is very likely that the proportion of male births is above 0.5.

Neither analysis gives an estimate of the effect size, directly, but both could be used to determine, for instance, if the fraction of boy births is likely to be above some particular threshold.

Notes

  1. Jeffreys, Harold (1939). Theory of Probability. Oxford University Press. MR 924. 
  2. Lindley, D.V. (1957). "A Statistical Paradox". Biometrika 44 (1–2): 187–192. doi:10.1093/biomet/44.1-2.187. JSTOR 2333251. 

References

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.