Exponential family

From Wikipedia, the free encyclopedia

In probability and statistics, an exponential family is any class of probability distributions having a certain form. This is for mathematical convenience, on account of their nice algebraic properties; as well as for generality, as they are in a sense very natural distributions to consider. The exponential family first appeared in independent work by E. J. G. Pitman, G. Darmois and B. O. Koopman in 1935-6

There are both discrete and continuous exponential families that are useful and important in theoretical or practical work. We use cumulative distribution functions (cdf) in order to encompass both discrete and continuous distributions.

Suppose H is a non-decreasing function of a real variable and H(x) approaches 0 as x approaches −∞. Then Lebesgue-Stieltjes integrals with respect to dH(x) are integrals with respect to the "reference measure" of the exponential family generated by H.

Any member of that exponential family has cumulative distribution function

$dF(x|\eta) = e^{-\eta^{\top} T(x) - A(\eta)}\, dH(x).$

If F is a continuous distribution with a density, one can write dF(x) = f(x) dx. The meanings of the different symbols in the right-hand side are as follows:

H(x) is a Lebesgue-Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is continuous with a density, then so is H, which can then be written dH(x) = h(x) dx. If F is discrete, then so is H (with the same support).

η is the natural parameter, a column vector, so that η^T = (η₁, ..., η_n), its transpose, is a row vector. The parameter space—i.e., the set of values of η for which this function is integrable—is necessarily convex.

T(x) is the sufficient statistic of the distribution, and it is a column vector whose number of scalar components is the same as that of η so that η^TT(x) is a scalar. (Note that the concept of sufficient statistic applies more broadly than just to members of the exponential family.)

and A(η) is a normalization factor without which F would not be a probability distribution. The function A is important in its own right, because in cases in which the reference measure dH(x) is a probability measure, then A is the cumulant-generating function of the probability distribution of the sufficient statistic T(X) when the distribution of X is dH(x).

1 Examples
2 Maximum entropy derivation
3 Role in statistics
- 3.1 Classical estimation: sufficiency
- 3.2 Bayesian estimation: conjugate distributions
4 Statistical inference
5 External links

[edit] Examples

The normal, gamma, chi-square, beta, Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative binomial, and geometric distributions are all exponential families. The Weibull distributions do not comprise an exponential family, nor do the Cauchy distributions or uniform distributions.

The binomial distribution. Suppose H is the function that steps upward by the binomial coefficient ${n \choose x}$ at each x ∈ {0, 1, 2, ..., n}. The probability mass function is

$f(x)={n \choose x}p^x (1-p)^{n-x}$

for x ∈ {0, 1, 2, ..., n}. Let F be the cumulative distribution function. Then

$dF(x) = p^x (1-p)^{n-x}\,dH(x)=\exp\left(x \log\left({p \over 1-p}\right) + n \log\left(1-p\right)\right)\,dH(x),$

so the "natural parameter" η (the same as a Lagrange multiplier in the maximum entropy formulation); for this family of distributions is

$\eta = \log{p \over 1-p}.$

[more to be added here....]

[edit] Maximum entropy derivation

The exponential family arises naturally as the answer to the following question: what is the maximum entropy distribution consistent with given constraints on expected values?

The information entropy of a probability distribution dF(x) can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both measures must be mutually absolutely continuous. Accordingly, we need to pick a reference measure dH(x) with the same support as dF(x). As an aside, frequentists need to realize that this is a largely arbitrary choice, while Bayesians can just make this choice part of their prior probability distribution.

The entropy of dF(x) relative to dH(x) is

$S[dF|dH]=-\int {dF\over dH}\ln{dF\over dH}\,dH$

$S[dF|dH]=\int\ln{dH\over dF}\,dF$

where dF/dH and dH/dF are Radon-Nikodym derivatives. Note that the ordinary definition of entropy for a discrete distribution supported on a set I, namely

$S=-\sum_{i\in I} p_i\ln p_i$

assumes (though this is seldom pointed out) that dH is chosen to be counting measure on I.

Consider now a collection of observable quantities (random variables) T_i. The probability distribution dF whose entropy with respect to dH is greatest, subject to the conditions that the expected value of T_i be equal to t_i, is a member of the exponential family with dH as reference measure and (T₁, ..., T_n) as sufficient statistic.

The derivation is a simple variational calculation using Lagrange multipliers. Normalization is imposed by letting T₀ = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to T₀.

[edit] Role in statistics

[edit] Classical estimation: sufficiency

According to the Pitman-Koopman-Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases. More long-windedly, suppose X_n, n = 1, 2, 3, ... are independent identically distributed random variables whose distribution is known to be in some family of probability distributions. Only if that family is an exponential family is there a (possibly vector-valued) sufficient statistic T(X₁, ..., X_n) whose number of scalar components does not increase as the sample size n increases.

[edit] Bayesian estimation: conjugate distributions

Exponential families are also important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to the exponential family there exists a conjugate prior, which is often also in the exponential family. A conjugate prior π for the parameter η of an exponential family is given by

$\pi(\eta) \propto \exp(-\eta^{\top} \alpha - \beta\, A(\eta)),$

where $\alpha \in \mathbb{R}^n$ and $β > 0$ are hyperparameters.

A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a Poisson distribution the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution.

An arbitrary likelihood will not belong to the exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.

[edit] Statistical inference

[edit] Sampling distributions

As discussed above, the sufficient statistic (T₁, ..., T_n) plays a pivotal role in statistical inference, whether classical or Bayesian. Accordingly, it is interesting to study its sampling distribution. That is, if X₁, ..., X_m is a random sample—that is, a collection of independent, identically-distributed random variables—drawn from a distribution in the exponential family, we want to know the probability distribution of the statistic

$\widehat t_i={1\over m}\sum_{j=1}^m T_i(X_j).$

Letting T₀=1, we can write

$dF(\eta)=e^{-\eta^\alpha T_\alpha}dH$

using Einstein's summation convention, namely

$\eta^\alpha T_\alpha=\eta^0 T_0+\eta^i T_i=\eta^0T_0+\eta^1T_1+\cdots+\eta^nT_n$

Then,

$Z[\eta]=\int dF=e^{-\eta^0+A(\eta)}$

is what physicists call the partition function in statistical mechanics. The condition that dF be normalized implies that η⁰ = A(η), as anticipated in the above section on information entropy.

Next, it is straightforward to check that

${\partial\over\partial\eta^i}\ln Z(\eta)={\partial\over\partial\eta^i}A(\eta)=E[T_i\mid\eta],$

denoted t_i, and

${\partial^2\over\partial\eta^i\,\partial\eta^j}\ln Z(\eta)={\partial^2\over\partial\eta^i\,\partial\eta^j}A(\eta)={\rm Cov}[T_i,T_j\mid\eta]$

denoted t_ij. As the same information can be obtained from either Z or A, it is not necessary to normalize the probability distribution dF by setting η₀ = A before taking the derivatives. Also, the function A(η) is the cumulant-generating function of the distribution of T not just for dF or dH but for the entire exponential subfamily with the given dH and T.

The equations

$E[T_i(X)\mid\eta]=t_i$

can usually be solved to find η as a function of t_i, which means that either set of parameters can be used to completely specify a member of the specific subfamily under consideration. In that case, the covariances t_ij can also be expressed in terms of the t_i, which is useful for estimation purposes as we shall see below.

We are now ready to consider the random samples mentioned earlier. It follows that

$E[\widehat t_i]=t_i,$

that is, the statistic $\widehat{t_i}$ is an unbiased estimator of t_i. Moreover, since the elements of a random sample are assumed to be mutually independent,

${\rm Cov}[\widehat{t_i},\widehat{t_j}]={1\over m^2}\sum_{k,l=1}^m{\rm Cov}[T_i(X_k),T_j(X_l)]={1\over m}t_{ij}$

Because the covariance vanishes in the limit of large samples, the estimators $\widehat{t_i}$ are said to be consistent.

More generally, the kth cumulant of the distribution of $\widehat{t_i}$ can be seen to decay with the (k − 1)th power of sample size, so the distribution of these statistics is asymptotically a multivariate normal distribution. To use asymptotic normality (as one would in the construction of confidence intervals) one needs an estimate of the covariances. Therefore we also need to look at the sampling distribution of

$\widehat t_{ij}={1\over m-1}\sum_{k=1}^m (T_i(X_k)-\widehat{t_i})(T_j(X_k)-\widehat{t_j}).$

This is easily seen to be an unbiased estimator of t_ij, but consistency and asymptotic chi-squared behaviour are rather more involved, and depend on the third and fourth cumulants of dF.

[edit] Hypothesis testing

[edit] Confidence intervals

[edit] External links

	Probability distributions [ view • talk • edit ]
	Univariate	Multivariate
Discrete:	Bernoulli • binomial • Boltzmann • compound Poisson • degenerate • Gauss-Kuzmin • geometric • hypergeometric • logarithmic • negative binomial • parabolic fractal • Poisson • Rademacher • Skellam • uniform • Yule-Simon • zeta • Zipf • Zipf-Mandelbrot	Ewens • multinomial
Continuous:	Beta • Beta prime • Cauchy • chi-square • Dirac delta function • Erlang • exponential • exponential power • F • fading • Fisher's z • Fisher-Tippett • Gamma • generalized extreme value • generalized hyperbolic • generalized inverse Gaussian • Half-Logistic • Hotelling's T-square • hyperbolic secant • hyper-exponential • hypoexponential • inverse chi-square • inverse Gaussian • inverse gamma • Kumaraswamy • Landau • Laplace • Lévy • Lévy skew alpha-stable • logistic • log-normal • Maxwell-Boltzmann • Maxwell speed • normal (Gaussian) • Pareto • Pearson • polar • raised cosine • Rayleigh • relativistic Breit-Wigner • Rice • Student's t • triangular • type-1 Gumbel • type-2 Gumbel • uniform • Voigt • von Mises • Weibull • Wigner semicircle • Wilks' lambda	Dirichlet • Kent • matrix normal • multivariate normal • von Mises-Fisher • Wigner quasi • Wishart
Miscellaneous:	Cantor • conditional • exponential family • infinitely divisible • location-scale family • marginal • maximum entropy • phase-type • posterior • prior • quasi • sampling • singular