Exponential family
From Wikipedia, the free encyclopedia
In probability and statistics, an exponential family is any class of probability distributions having a certain form. This is for mathematical convenience, on account of their nice algebraic properties; as well as for generality, as they are in a sense very natural distributions to consider. The exponential family first appeared in independent work by E. J. G. Pitman, G. Darmois and B. O. Koopman in 1935-6
There are both discrete and continuous exponential families that are useful and important in theoretical or practical work. We use cumulative distribution functions (cdf) in order to encompass both discrete and continuous distributions.
Suppose H is a non-decreasing function of a real variable and H(x) approaches 0 as x approaches −∞. Then Lebesgue-Stieltjes integrals with respect to dH(x) are integrals with respect to the "reference measure" of the exponential family generated by H.
Any member of that exponential family has cumulative distribution function
If F is a continuous distribution with a density, one can write dF(x) = f(x) dx. The meanings of the different symbols in the right-hand side are as follows:
- H(x) is a Lebesgue-Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is continuous with a density, then so is H, which can then be written dH(x) = h(x) dx. If F is discrete, then so is H (with the same support).
- η is the natural parameter, a column vector, so that ηT = (η1, ..., ηn), its transpose, is a row vector. The parameter space—i.e., the set of values of η for which this function is integrable—is necessarily convex.
- T(x) is the sufficient statistic of the distribution, and it is a column vector whose number of scalar components is the same as that of η so that ηTT(x) is a scalar. (Note that the concept of sufficient statistic applies more broadly than just to members of the exponential family.)
- and A(η) is a normalization factor without which F would not be a probability distribution. The function A is important in its own right, because in cases in which the reference measure dH(x) is a probability measure, then A is the cumulant-generating function of the probability distribution of the sufficient statistic T(X) when the distribution of X is dH(x).
Contents |
[edit] Examples
The normal, gamma, chi-square, beta, Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative binomial, and geometric distributions are all exponential families. The Weibull distributions do not comprise an exponential family, nor do the Cauchy distributions or uniform distributions.
- The binomial distribution. Suppose H is the function that steps upward by the binomial coefficient at each x ∈ {0, 1, 2, ..., n}. The probability mass function is
- for x ∈ {0, 1, 2, ..., n}. Let F be the cumulative distribution function. Then
- so the "natural parameter" η (the same as a Lagrange multiplier in the maximum entropy formulation); for this family of distributions is
- [more to be added here....]
[edit] Maximum entropy derivation
The exponential family arises naturally as the answer to the following question: what is the maximum entropy distribution consistent with given constraints on expected values?
The information entropy of a probability distribution dF(x) can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both measures must be mutually absolutely continuous. Accordingly, we need to pick a reference measure dH(x) with the same support as dF(x). As an aside, frequentists need to realize that this is a largely arbitrary choice, while Bayesians can just make this choice part of their prior probability distribution.
The entropy of dF(x) relative to dH(x) is
or
where dF/dH and dH/dF are Radon-Nikodym derivatives. Note that the ordinary definition of entropy for a discrete distribution supported on a set I, namely
assumes (though this is seldom pointed out) that dH is chosen to be counting measure on I.
Consider now a collection of observable quantities (random variables) Ti. The probability distribution dF whose entropy with respect to dH is greatest, subject to the conditions that the expected value of Ti be equal to ti, is a member of the exponential family with dH as reference measure and (T1, ..., Tn) as sufficient statistic.
The derivation is a simple variational calculation using Lagrange multipliers. Normalization is imposed by letting T0 = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to T0.
[edit] Role in statistics
[edit] Classical estimation: sufficiency
According to the Pitman-Koopman-Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases. More long-windedly, suppose Xn, n = 1, 2, 3, ... are independent identically distributed random variables whose distribution is known to be in some family of probability distributions. Only if that family is an exponential family is there a (possibly vector-valued) sufficient statistic T(X1, ..., Xn) whose number of scalar components does not increase as the sample size n increases.
[edit] Bayesian estimation: conjugate distributions
Exponential families are also important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to the exponential family there exists a conjugate prior, which is often also in the exponential family. A conjugate prior π for the parameter η of an exponential family is given by
where and β > 0 are hyperparameters.
A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a Poisson distribution the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution.
An arbitrary likelihood will not belong to the exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.
[edit] Statistical inference
[edit] Sampling distributions
As discussed above, the sufficient statistic (T1, ..., Tn) plays a pivotal role in statistical inference, whether classical or Bayesian. Accordingly, it is interesting to study its sampling distribution. That is, if X1, ..., Xm is a random sample—that is, a collection of independent, identically-distributed random variables—drawn from a distribution in the exponential family, we want to know the probability distribution of the statistic
Letting T0=1, we can write
using Einstein's summation convention, namely
Then,
is what physicists call the partition function in statistical mechanics. The condition that dF be normalized implies that η0 = A(η), as anticipated in the above section on information entropy.
Next, it is straightforward to check that
denoted ti, and
denoted tij. As the same information can be obtained from either Z or A, it is not necessary to normalize the probability distribution dF by setting η0 = A before taking the derivatives. Also, the function A(η) is the cumulant-generating function of the distribution of T not just for dF or dH but for the entire exponential subfamily with the given dH and T.
The equations
can usually be solved to find η as a function of ti, which means that either set of parameters can be used to completely specify a member of the specific subfamily under consideration. In that case, the covariances tij can also be expressed in terms of the ti, which is useful for estimation purposes as we shall see below.
We are now ready to consider the random samples mentioned earlier. It follows that
that is, the statistic is an unbiased estimator of ti. Moreover, since the elements of a random sample are assumed to be mutually independent,
Because the covariance vanishes in the limit of large samples, the estimators are said to be consistent.
More generally, the kth cumulant of the distribution of can be seen to decay with the (k − 1)th power of sample size, so the distribution of these statistics is asymptotically a multivariate normal distribution. To use asymptotic normality (as one would in the construction of confidence intervals) one needs an estimate of the covariances. Therefore we also need to look at the sampling distribution of
This is easily seen to be an unbiased estimator of tij, but consistency and asymptotic chi-squared behaviour are rather more involved, and depend on the third and fourth cumulants of dF.
[edit] Hypothesis testing
[edit] Confidence intervals
[edit] External links
- A primer on the exponential family of distributions
- exponential family of distributions on the Earliest known uses of some of the words of mathematics