Beta-binomial distribution

From Wikipedia, the free encyclopedia

Probability mass function
Cumulative distribution function
Parameters	n ∈ N₀ — number of trials $\alpha >0$ (real) $\beta >0$ (real)
Support	k ∈ { 0, …, n }
pmf	${n \choose k}{\frac {{\mathrm {B}}(k+\alpha ,n-k+\beta )}{{\mathrm {B}}(\alpha ,\beta )}}\!$
CDF	$1-{\tfrac {{\mathrm {B}}(\beta +n-k-1,\alpha +k+1)_{3}F_{2}({\boldsymbol {a}},{\boldsymbol {b}};k)}{{\mathrm {B}}(\alpha ,\beta ){\mathrm {B}}(n-k,k+2)(n+1)}}$ where ₃F₂(a,b,k) is the generalized hypergeometric function =₃F₂(1, α + k + 1, −n + k + 1; k + 2, −β − n + k + 2; 1)
Mean	${\frac {n\alpha }{\alpha +\beta }}\!$
Variance	${\frac {n\alpha \beta (\alpha +\beta +n)}{(\alpha +\beta )^{2}(\alpha +\beta +1)}}\!$
Skewness	${\tfrac {(\alpha +\beta +2n)(\beta -\alpha )}{(\alpha +\beta +2)}}{\sqrt {{\tfrac {1+\alpha +\beta }{n\alpha \beta (n+\alpha +\beta )}}}}\!$
Ex. kurtosis	See text
MGF	$_{{2}}F_{{1}}(-n,\alpha ;\alpha +\beta ;1-e^{{t}})\!$ ${\text{for }}t<\log _{e}(2)$
CF	$_{{2}}F_{{1}}(-n,\alpha ;\alpha +\beta ;1-e^{{it}})\!$ ${\text{for }}\|t\|<\log _{e}(2)$

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random. The beta-binomial distribution is the binomial distribution in which the probability of success at each trial is not fixed but random and follows the beta distribution. It is frequently used in Bayesian statistics, empirical Bayes methods and classical statistics as an overdispersed binomial distribution.

It reduces to the Bernoulli distribution as a special case when n = 1. For α = β = 1, it is the discrete uniform distribution from 0 to n. It also approximates the binomial distribution arbitrarily well for large α and β. The beta-binomial is a one-dimensional version of the Dirichlet-multinomial distribution, as the binomial and beta distributions are special cases of the multinomial and Dirichlet distributions, respectively.

Motivation and derivation

Beta-binomial distribution as a compound distribution

The Beta distribution is a conjugate distribution of the binomial distribution. This fact leads to an analytically tractable compound distribution where one can think of the $p$ parameter in the binomial distribution as being randomly drawn from a beta distribution. Namely, if

${\begin{aligned}X&\sim \operatorname {Bin}(n,p)\\{\text{then }}P(X=k|p,n)&=L(k|p)={n \choose k}p^{k}(1-p)^{{n-k}}\end{aligned}}$

where Bin(n,p) stands for the binomial distribution, and where p is a random variable with a beta distribution.

${\begin{aligned}\pi (p|\alpha ,\beta )&={\mathrm {Beta}}(\alpha ,\beta )\\&={\frac {p^{{\alpha -1}}(1-p)^{{\beta -1}}}{{\mathrm {B}}(\alpha ,\beta )}}\end{aligned}}$

then the compound distribution is given by

${\begin{aligned}f(k|n,\alpha ,\beta )&=\int _{0}^{1}L(k|p)\pi (p|\alpha ,\beta )\,dp\\&={n \choose k}{\frac {1}{{\mathrm {B}}(\alpha ,\beta )}}\int _{0}^{1}p^{{k+\alpha -1}}(1-p)^{{n-k+\beta -1}}\,dp\\&={n \choose k}{\frac {{\mathrm {B}}(k+\alpha ,n-k+\beta )}{{\mathrm {B}}(\alpha ,\beta )}}.\end{aligned}}$

Using the properties of the beta function, this can alternatively be written

$f(k|n,\alpha ,\beta )={\frac {\Gamma (n+1)}{\Gamma (k+1)\Gamma (n-k+1)}}{\frac {\Gamma (k+\alpha )\Gamma (n-k+\beta )}{\Gamma (n+\alpha +\beta )}}{\frac {\Gamma (\alpha +\beta )}{\Gamma (\alpha )\Gamma (\beta )}}.$

It is within this context that the beta-binomial distribution appears often in Bayesian statistics: the beta-binomial is the predictive distribution of a binomial random variable with a beta distribution prior on the success probability.

Beta-binomial as an urn model

The beta-binomial distribution can also be motivated via an urn model for positive integer values of α and β, known as the Polya urn model. Specifically, imagine an urn containing α red balls and β black balls, where random draws are made. If a red ball is observed, then two red balls are returned to the urn. Likewise, if a black ball is drawn, it is replaced and another black ball is added to the urn. If this is repeated n times, then the probability of observing k red balls follows a beta-binomial distribution with parameters n,α and β.

Note that if the random draws are with simple replacement (no balls over and above the observed ball are added to the urn), then the distribution follows a binomial distribution and if the random draws are made without replacement, the distribution follows a hypergeometric distribution.

Moments and properties

The first three raw moments are

${\begin{aligned}\mu _{1}&={\frac {n\alpha }{\alpha +\beta }}\\[8pt]\mu _{2}&={\frac {n\alpha [n(1+\alpha )+\beta ]}{(\alpha +\beta )(1+\alpha +\beta )}}\\[8pt]\mu _{3}&={\frac {n\alpha [n^{{2}}(1+\alpha )(2+\alpha )+3n(1+\alpha )\beta +\beta (\beta -\alpha )]}{(\alpha +\beta )(1+\alpha +\beta )(2+\alpha +\beta )}}\end{aligned}}$

and the kurtosis is

$\gamma _{2}={\frac {(\alpha +\beta )^{2}(1+\alpha +\beta )}{n\alpha \beta (\alpha +\beta +2)(\alpha +\beta +3)(\alpha +\beta +n)}}\left[(\alpha +\beta )(\alpha +\beta -1+6n)+3\alpha \beta (n-2)+6n^{2}-{\frac {3\alpha \beta n(6-n)}{\alpha +\beta }}-{\frac {18\alpha \beta n^{{2}}}{(\alpha +\beta )^{2}}}\right].$

Letting $\pi ={\frac {\alpha }{\alpha +\beta }}\!$ we note, suggestively, that the mean can be written as

$\mu ={\frac {n\alpha }{\alpha +\beta }}=n\pi \!$

and the variance as

$\sigma ^{2}={\frac {n\alpha \beta (\alpha +\beta +n)}{(\alpha +\beta )^{2}(\alpha +\beta +1)}}=n\pi (1-\pi ){\frac {\alpha +\beta +n}{\alpha +\beta +1}}=n\pi (1-\pi )[1+(n-1)\rho ]\!$

where $\rho ={\tfrac {1}{\alpha +\beta +1}}\!$ is the pairwise correlation between the n Bernoulli draws and is called the over-dispersion parameter.

Point estimates

Method of moments

The method of moments estimates can be gained by noting the first and second moments of the beta-binomial namely

${\begin{aligned}\mu _{1}&={\frac {n\alpha }{\alpha +\beta }}\\\mu _{2}&={\frac {n\alpha [n(1+\alpha )+\beta ]}{(\alpha +\beta )(1+\alpha +\beta )}}\end{aligned}}$

and setting these raw moments equal to the sample moments

${\begin{aligned}{\hat {\mu }}_{1}&=m_{1}\\{\hat {\mu }}_{2}&=m_{2}\end{aligned}}$

and solving for α and β we get

${\begin{aligned}{\hat {\alpha }}&={\frac {nm_{1}-m_{2}}{n({\frac {m_{2}}{m_{1}}}-m_{1}-1)+m_{1}}}\\{\hat {\beta }}&={\frac {(n-m_{1})(n-{\frac {m_{2}}{m_{1}}})}{n({\frac {m_{2}}{m_{1}}}-m_{1}-1)+m_{1}}}.\end{aligned}}$

Note that these estimates can be non-sensically negative which is evidence that the data is either undispersed or underdispersed relative to the binomial distribution. In this case, the binomial distribution and the hypergeometric distribution are alternative candidates respectively.

Maximum likelihood estimation

While closed-form maximum likelihood estimates are impractical, given that the pdf consists of common functions (gamma function and/or Beta functions), they can be easily found via direct numerical optimization. Maximum likelihood estimates from empirical data can be computed using general methods for fitting multinomial Pólya distributions, methods for which are described in (Minka 2003). The R package VGAM through the function vglm, via maximum likelihood, facilitates the fitting of glm type models with responses distributed according to the beta-binomial distribution. Note also that there is no requirement that n is fixed throughout the observations.

Example

The following data gives the number of male children among the first 12 children of family size 13 in 6115 families taken from hospital records in 19th century Saxony (Sokal and Rohlf, p. 59 from Lindsey). The 13th child is ignored to assuage the effect of families non-randomly stopping when a desired gender is reached.

Males	0	1	2	3	4	5	6	7	8	9	10	11	12
Families	3	24	104	286	670	1033	1343	1112	829	478	181	45	7

We note the first two sample moments are

${\begin{aligned}m_{1}&=6.23\\m_{2}&=42.31\\n&=12\end{aligned}}$

and therefore the method of moments estimates are

${\begin{aligned}{\hat {\alpha }}&=34.1350\\{\hat {\beta }}&=31.6085.\end{aligned}}$

The maximum likelihood estimates can be found numerically

${\begin{aligned}{\hat \alpha }_{{\mathrm {mle}}}&=34.09558\\{\hat \beta }_{{\mathrm {mle}}}&=31.5715\end{aligned}}$

and the maximized log-likelihood is

$\log {\mathcal {L}}=-12492.9$

from which we find the AIC

${\mathit {AIC}}=24989.74.$

The AIC for the competing binomial model is AIC = 25070.34 and thus we see that the beta-binomial model provides a superior fit to the data i.e. there is evidence for overdispersion. Trivers and Willard posit a theoretical justification for heterogeneity in gender-proneness among families (i.e. overdispersion).

The superior fit is evident especially among the tails

Males	0	1	2	3	4	5	6	7	8	9	10	11	12
Observed Families	3	24	104	286	670	1033	1343	1112	829	478	181	45	7
Predicted (Beta-Binomial)	2.3	22.6	104.8	310.9	655.7	1036.2	1257.9	1182.1	853.6	461.9	177.9	43.8	5.2
Predicted (Binomial p = 0.519215)	0.9	12.1	71.8	258.5	628.1	1085.2	1367.3	1265.6	854.2	410.0	132.8	26.1	2.3

Further Bayesian considerations

It is convenient to reparameterize the distributions so that the expected mean of the prior is a single parameter: Let

${\begin{aligned}\pi (\theta |\mu ,M)&=\operatorname {Beta}(M\mu ,M(1-\mu ))\\&={\frac {\Gamma (M)}{\Gamma (M\mu )\Gamma (M(1-\mu ))}}\theta ^{{M\mu -1}}(1-\theta )^{{M(1-\mu )-1}}\end{aligned}}$

where

${\begin{aligned}\mu &={\frac {\alpha }{\alpha +\beta }}\\M&=\alpha +\beta \end{aligned}}$

so that

${\begin{aligned}\operatorname {E}(\theta |\mu ,M)&=\mu \\\operatorname {Var}(\theta |\mu ,M)&={\frac {\mu (1-\mu )}{M+1}}.\end{aligned}}$

The posterior distribution ρ(θ|k) is also a beta distribution:

${\begin{aligned}\rho (\theta |k)&\propto \ell (k|\theta )\pi (\theta |\mu ,M)\\&=\operatorname {Beta}(k+M\mu ,n-k+M(1-\mu ))\\&={\frac {\Gamma (M)}{\Gamma (M\mu )\Gamma (M(1-\mu ))}}{n \choose k}\theta ^{{k+M\mu -1}}(1-\theta )^{{n-k+M(1-\mu )-1}}\end{aligned}}$

And

$\operatorname {E}(\theta |k)={\frac {k+M\mu }{n+M}}.$

while the marginal distribution m(k|μ, M) is given by

${\begin{aligned}m(k|\mu ,M)&=\int _{0}^{1}l(k|\theta )\pi (\theta |\mu ,M)\,d\theta \\&={\frac {\Gamma (M)}{\Gamma (M\mu )\Gamma (M(1-\mu ))}}{n \choose k}\int _{{0}}^{{1}}\theta ^{{k+M\mu -1}}(1-\theta )^{{n-k+M(1-\mu )-1}}d\theta \\&={\frac {\Gamma (M)}{\Gamma (M\mu )\Gamma (M(1-\mu ))}}{n \choose k}{\frac {\Gamma (k+M\mu )\Gamma (n-k+M(1-\mu ))}{\Gamma (n+M)}}.\end{aligned}}$

Because the marginal is a complex, non-linear function of Gamma and Digamma functions, it is quite difficult to obtain a marginal maximum likelihood estimate (MMLE) for the mean and variance. Instead, we use the method of iterated expectations to find the expected value of the marginal moments.

Let us write our model as a two-stage compound sampling model. Let k_i be the number of success out of n_i trials for event i:

${\begin{aligned}k_{i}&\sim \operatorname {Bin}(n_{i},\theta _{i})\\\theta _{i}&\sim \operatorname {Beta}(\mu ,M),\ {\mathrm {i.i.d.}}\end{aligned}}$

We can find iterated moment estimates for the mean and variance using the moments for the distributions in the two-stage model:

$\operatorname {E}\left({\frac {k}{n}}\right)=\operatorname {E}\left[\operatorname {E}\left(\left.{\frac {k}{n}}\right|\theta \right)\right]=\operatorname {E}(\theta )=\mu$

${\begin{aligned}\operatorname {var}\left({\frac {k}{n}}\right)&=\operatorname {E}\left[\operatorname {var}\left(\left.{\frac {k}{n}}\right|\theta \right)\right]+\operatorname {var}\left[\operatorname {E}\left(\left.{\frac {k}{n}}\right|\theta \right)\right]\\&=\operatorname {E}\left[\left(\left.{\frac {1}{n}}\right)\theta (1-\theta )\right|\mu ,M\right]+\operatorname {var}\left(\theta |\mu ,M\right)\\&={\frac {1}{n}}\left(\mu (1-\mu )\right)+{\frac {n-1}{n}}{\frac {(\mu (1-\mu ))}{M+1}}\\&={\frac {\mu (1-\mu )}{n}}\left(1+{\frac {n-1}{M+1}}\right).\end{aligned}}$

(Here we have used the law of total expectation and the law of total variance.)

We want point estimates for $\mu$ and $M$ . The estimated mean ${\hat {\mu }}$ is calculated from the sample

${\hat {\mu }}={\frac {\sum _{{i=1}}^{N}k_{i}}{\sum _{{i=1}}^{N}n_{i}}}.$

The estimate of the hyperparameter M is obtained using the moment estimates for the variance of the two-stage model:

$s^{2}={\frac {1}{N}}\sum _{{i=1}}^{N}\operatorname {var}\left({\frac {k_{{i}}}{n_{{i}}}}\right)={\frac {1}{N}}\sum _{{i=1}}^{N}{\frac {{\hat {\mu }}(1-{\hat {\mu }})}{n_{i}}}\left[1+{\frac {n_{i}-1}{\widehat {M}+1}}\right]$

Solving:

$\widehat {M}={\frac {{\hat {\mu }}(1-{\hat {\mu }})-s^{2}}{s^{2}-{\frac {{\hat {\mu }}(1-{\hat {\mu }})}{N}}\sum _{{i=1}}^{N}1/n_{i}}},$

where

$s^{2}={\frac {N\sum _{{i=1}}^{N}n_{i}({\hat {\theta _{i}}}-{\hat {\mu }})^{2}}{(N-1)\sum _{{i=1}}^{N}n_{i}}}.$

Since we now have parameter point estimates, ${\hat {\mu }}$ and $\widehat {M}$ , for the underlying distribution, we would like to find a point estimate ${\tilde {\theta }}_{i}=k_{i}/n_{i}$ for the probability of success for event i. This is the weighted average of the event estimate ${\hat {\theta _{i}}}$ and ${\hat {\mu }}$ . Given our point estimates for the prior, we may now plug in these values to find a point estimate for the posterior

${\tilde {\theta _{i}}}=E(\theta |k_{i})={\frac {k_{i}+\widehat {M}{\hat {\mu }}}{n_{i}+\widehat {M}}}={\frac {\widehat {M}}{n_{i}+\widehat {M}}}{\hat {\mu }}+{\frac {n_{i}}{n_{i}+\widehat {M}}}{\frac {k_{i}}{n_{i}}}.$

Shrinkage factors

We may write the posterior estimate as a weighted average:

${\tilde {\theta }}_{i}={\hat {B}}_{i}\,{\hat {\mu }}+(1-{\hat {B}}_{i}){\hat {\theta }}_{i}$

where ${\hat {B}}_{i}$ is called the shrinkage factor.

${\hat {B_{i}}}={\frac {{\hat {M}}}{{\hat {M}}+n_{i}}}$

Related distributions

$BB(1,1,n)\sim U(0,n)\,$ where $U(a,b)\,$ is the discrete uniform distribution.

References

Minka, Thomas P. (2003). Estimating a Dirichlet distribution. Microsoft Technical Report.

External links

Using the Beta-binomial distribution to assess performance of a biometric identification device
Fastfit contains Matlab code for fitting Beta-Binomial distributions (in the form of two-dimensional Pólya distributions) to data.
Interactive graphic: Univariate Distribution Relationships
Beta-Binomial distribution package for R

Probability distributions

Discrete univariate with finite support

Benford Bernoulli Beta-binomial binomial categorical hypergeometric Poisson binomial Rademacher discrete uniform Zipf Zipf–Mandelbrot

Discrete univariate with infinite support

beta negative binomial Borel Conway–Maxwell–Poisson discrete phase-type Delaporte extended negative binomial Gauss–Kuzmin geometric logarithmic negative binomial parabolic fractal Poisson Skellam Yule–Simon zeta

Continuous univariate supported on a bounded interval, e.g. [0,1]

Arcsine ARGUS Balding–Nichols Bates Beta Beta rectangular Irwin–Hall Kumaraswamy logit-normal Noncentral beta raised cosine Triangular U-quadratic uniform Wigner semicircle Xenakis

[[List of probability distributions#Supported_on_semi-infinite_intervals.2C_usually_.5B0.2C.E2.88.9E.29|Continuous univariate supported on a semi-infinite interval, usually [0,∞)]]

Benini
Benktander 1st kind
Benktander 2nd kind
Beta prime
Burr
chi-squared
chi
Coxian
Dagum
Davis
EL
Erlang
exponential
F
folded normal
Flory-Schulz
Fréchet
Gamma
Gamma/Gompertz
generalized inverse Gaussian
Gompertz
half-logistic
half-normal
Hotelling's T-squared
hyper-Erlang
hyperexponential
hypoexponential
inverse chi-squared (scaled inverse chi-squared)
inverse Gaussian
inverse gamma
Kolmogorov
Lévy
log-Cauchy
log-Laplace
log-logistic
log-normal
Maxwell–Boltzmann
Maxwell–Jüttner
Mittag–Leffler
Nakagami
noncentral chi-squared
Pareto
phase-type
Poly-Weibull
Rayleigh
relativistic Breit–Wigner
Rice
Rosin–Rammler
shifted Gompertz
truncated normal
type-2 Gumbel
Weibull
Wilks' lambda

Continuous univariate supported on the whole real line (−∞, ∞)

Cauchy exponential power Fisher's z generalized normal generalized hyperbolic geometric stable Gumbel Holtsmark hyperbolic secant Johnson SU Landau Laplace Linnik logistic noncentral t normal (Gaussian) normal-inverse Gaussian skew normal slash stable Student's t type-1 Gumbel variance-gamma Voigt

Continuous univariate with support whose type varies

generalized extreme value generalized Pareto Tukey lambda q-Gaussian q-exponential shifted log-logistic

Mixed continuous-discrete univariate distributions

rectified Gaussian

Multivariate (joint)

Discrete Ewens multinomial Dirichlet-multinomial negative multinomial Continuous Dirichlet Generalized Dirichlet multivariate normal Multivariate stable multivariate Student normal-scaled inverse gamma normal-gamma Matrix-valued inverse matrix gamma inverse-Wishart matrix normal matrix t matrix gamma normal-inverse-Wishart normal-Wishart Wishart

Directional

Univariate (circular) directional Circular uniform univariate von Mises wrapped normal wrapped Cauchy wrapped exponential wrapped Lévy Bivariate (spherical) Kent Bivariate (toroidal) bivariate von Mises Multivariate von Mises–Fisher Bingham

Degenerate and singular

Degenerate discrete degenerate Dirac delta function Singular Cantor

Families

Circular compound Poisson elliptical exponential natural exponential location-scale maximum entropy mixture Pearson Tweedie wrapped

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.