Bayes' rule

This article is about the form of Bayes' theorem. For the decision rule, see Bayes estimator. For the use of Bayes factor in model selection, see Bayes factor.

Bayesian statistics

Theory
Admissible decision rule Bayesian efficiency Bayesian probability Probability interpretations Bayes' theorem Bayes' rule Bayes factor Bayesian inference Bayesian network Prior Posterior Likelihood Conjugate prior Posterior predictive Hyperparameter Hyperprior Principle of indifference Principle of maximum entropy Empirical Bayes method Cromwell's rule Bernstein–von Mises theorem Bayesian information criterion Credible interval Maximum a posteriori estimation
Techniques
Bayesian linear regression Bayesian estimator Approximate Bayesian computation
Statistics portal

In probability theory and applications, Bayes' rule relates the odds of event $A_1$ to the odds of event $A_2$ , before (prior to) and after (posterior to) conditioning on another event $B$ . The odds on $A_1$ to event $A_2$ is simply the ratio of the probabilities of the two events. The prior odds is the ratio of the unconditional or prior probabilities, the posterior odds is the ratio of conditional or posterior probabilities given the event $B$ . The relationship is expressed in terms of the likelihood ratio or Bayes factor, $\Lambda$ . By definition, this is the ratio of the conditional probabilities of the event $B$ given that $A_1$ is the case or that $A_2$ is the case, respectively. The rule simply states: posterior odds equals prior odds times Bayes factor (Gelman et al., 2005, Chapter 1).

When arbitrarily many events $A$ are of interest, not just two, the rule can be rephrased as posterior is proportional to prior times likelihood, $P(A|B)\propto P(A) P(B|A)$ where the proportionality symbol means that the left hand side is proportional to (i.e., equals a constant times) the right hand side as $A$ varies, for fixed or given $B$ (Lee, 2012; Bertsch McGrayne, 2012). In this form it goes back to Laplace (1774) and to Cournot (1843); see Fienberg (2005).

Bayes' rule is an equivalent way to formulate Bayes' theorem. If we know the odds for and against $A$ we also know the probabilities of $A$ . It may be preferred to Bayes' theorem in practice for a number of reasons.

Bayes' rule is widely used in statistics, science and engineering, for instance in model selection, probabilistic expert systems based on Bayes networks, statistical proof in legal proceedings, email spam filters, and so on (Rosenthal, 2005; Bertsch McGrayne, 2012). As an elementary fact from the calculus of probability, Bayes' rule tells us how unconditional and conditional probabilities are related whether we work with a frequentist interpretation of probability or a Bayesian interpretation of probability. Under the Bayesian interpretation it is frequently applied in the situation where $A_1$ and $A_2$ are competing hypotheses, and $B$ is some observed evidence. The rule shows how one's judgement on whether $A_1$ or $A_2$ is true should be updated on observing the evidence $B$ (Gelman et al., 2003).

The rule

Single event

Given events $A_1$ , $A_2$ and $B$ , Bayes' rule states that the conditional odds of $A_1:A_2$ given $B$ are equal to the marginal odds of $A_1:A_2$ multiplied by the Bayes factor or likelihood ratio $\Lambda$ :

O(A_1:A_2|B) = \Lambda(A_1:A_2|B) \cdot O(A_1:A_2) ,

where

\Lambda(A_1:A_2|B) = \frac{P(B|A_1)}{P(B|A_2)}.

Here, the odds and conditional odds, also known as prior odds and posterior odds, are defined by

O(A_1:A_2) = \frac{P(A_1)}{P(A_2)},

O(A_1:A_2|B) = \frac{P(A_1|B)}{P(A_2|B)}.

In the special case that $A_1 = A$ and $A_2 = \neg A$ , one writes $O(A)=O(A:\neg A)$ , and uses a similar abbreviation for the Bayes factor and for the conditional odds. The odds on $A$ is by definition the odds for and against $A$ . Bayes' rule can then be written in the abbreviated form

O(A|B) = O(A) \cdot \Lambda(A|B) ,

or in words: the posterior odds on $A$ equals the prior odds on $A$ times the likelihood ratio for $A$ given information $B$ . In short, posterior odds equals prior odds times likelihood ratio.

The rule is frequently applied when $A_1 = A$ and $A_2 = \neg A$ are two competing hypotheses concerning the cause of some event $B$ . The prior odds on $A$ , in other words, the odds between $A$ and $\neg A$ , expresses our initial beliefs concerning whether or not $A$ is true. The event $B$ represents some evidence, information, data, or observations. The likelihood ratio is the ratio of the chances of observing $B$ under the two hypotheses $A$ and $\neg A$ . The rule tells us how our prior beliefs concerning whether or not $A$ is true needs to be updated on receiving the information $B$ .

Many events

If we think of $A$ as arbitrary and $B$ as fixed then we can rewrite Bayes' theorem $P(A|B)=P(A)P(B|A)/P(B)$ in the form $P(A|B) \propto P(A)P(B|A)$ where the proportionality symbol means that, as $A$ varies but keeping $B$ fixed, the left hand side is equal to a constant times the right hand side.

In words posterior is proportional to prior times likelihood. This version of Bayes' theorem was first called "Bayes' rule" by Cournot (1843). Cournot popularized the earlier work of Laplace (1774) who had independently discovered Bayes' rule. The work of Bayes was published posthumously (1763) but remained more or less unknown till Cournot drew attention to it; see Fienberg (2006).

Bayes' rule may be preferred to the usual statement of Bayes' theorem for a number of reasons. One is that it is intuitively simpler to understand. Another reason is that normalizing probabilities is sometimes unnecessary: one sometimes only needs to know ratios of probabilities. Finally, doing the normalization is often easier to do after simplifying the product of prior and likelihood by deleting any factors which do not depend on $A$ , so we do not need to actually compute the denominator $P(B)$ in the usual statement of Bayes' theorem $P(A|B) = \frac{P(B | A)\, P(A)}{P(B)}\cdot \,$ .

In Bayesian statistics, Bayes' rule is often applied with a so-called improper prior, for instance, a uniform probability distribution over all real numbers. In that case, the prior distribution does not exist as a probability measure within conventional probability theory, and Bayes' theorem itself is not available.

Series of events

Bayes' rule may be applied a number of times. Each time we observe a new event, we update the odds between the events of interest, say $A_1$ and $A_2$ by taking account of the new information. For two events (information, evidence) $B$ and $C$ ,

O(A_1:A_2|B \cap C) = \Lambda(A_1:A_2|B \cap C) \cdot \Lambda(A_1:A_2|B) \cdot O(A_1:A_2) ,

where

\Lambda(A_1:A_2|B) = \frac{P(B|A_1)}{P(B|A_2)} ,

\Lambda(A_1:A_2|B \cap C) = \frac{P(C|A_1 \cap B)}{P(C|A_2 \cap B)} .

In the special case of two complementary events $A$ and $\neg A$ , the equivalent notation is

O(A|B,C) = \Lambda(A|B \cap C) \cdot \Lambda(A|B) \cdot O(A).

Derivation

Consider two instances of Bayes' theorem:

P(A_1|B) = \frac{1}{P(B)} \cdot P(B|A_1) \cdot P(A_1),

P(A_2|B) = \frac{1}{P(B)} \cdot P(B|A_2) \cdot P(A_2).

Combining these gives

\frac{P(A_1|B)}{P(A_2|B)} = \frac{P(B|A_1)}{P(B|A_2)} \cdot \frac{P(A_1)}{P(A_2)}.

Now defining

O(A_1:A_2|B) \triangleq \frac{P(A_1|B)}{P(A_2|B)}

O(A_1:A_2) \triangleq \frac{P(A_1)}{P(A_2)}

\Lambda(A_1:A_2|B) \triangleq \frac{P(B|A_1)}{P(B|A_2)},

this implies

O(A_1:A_2|B) = \Lambda(A_1:A_2|B) \cdot O(A_1:A_2).

A similar derivation applies for conditioning on multiple events, using the appropriate extension of Bayes' theorem

Examples

Frequentist example

Consider the drug testing example in the article on Bayes' theorem.

The same results may be obtained using Bayes' rule. The prior odds on an individual being a drug-user are 199 to 1 against, as $\textstyle 0.5\%=\frac{1}{200}$ and $\textstyle 99.5\%=\frac{199}{200}$ . The Bayes factor when an individual tests positive is $\textstyle \frac{0.99}{0.01} = 99:1$ in favour of being a drug-user: this is the ratio of the probability of a drug-user testing positive, to the probability of a non-drug user testing positive. The posterior odds on being a drug user are therefore $\textstyle 1 \times 99 : 199 \times 1 = 99:199$ , which is very close to $\textstyle 100:200 = 1:2$ . In round numbers, only one in three of those testing positive are actually drug-users.

External links

Bessière, P, Mazer, E, Ahuactzin, JM and Mekhnacha, K (2013), "Bayesian Programming", CRC Press.

Fienberg, SE (2006), "When did Bayesian inference become "Bayesian"?"", Bayesian analysis vol. 1, nr. 1, pp. 1-40.

Gelman, A, Carlin, JB, Stern, HS and Rubin, DB (2003), "Bayesian Data Analysis", Second Edition, CRC Press.

Lee, PM (2012), "Bayesian Statistics: An Introduction", Wiley.

McGrayne, SB (2012), "The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy", Yale University Press.

The on-line textbook: Information Theory, Inference, and Learning Algorithms, by MacKay, DJC, discusses Bayesian model comparison in Chapters 3 and 28.

Rosenthal, JS (2005): Struck by Lightning: the Curious World of Probabilities. Harper Collings 2005, ISBN 978-0-00-200791-7.

Stone, JV (2013), "Bayes’ Rule: A Tutorial Introduction to Bayesian Analysis", Download chapter 1, Sebtel Press, England.