In probability and statistics, an exponential family is an important class of probability distributions sharing a certain form, specified below. This special form is chosen for mathematical convenience, on account of some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural distributions to consider. The concept of exponential families is credited to[1] E. J. G. Pitman,[2] G. Darmois,[3] and B. O. Koopman[4] in 1935–6. The term exponential class is sometimes used in place of "exponential family".[5]
The exponential families include many of the most common distributions, including the normal, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli, binomial, multinomial, Poisson, Wishart, Inverse Wishart and many others. Consideration of these, and other distributions that are with an exponential family of distributions, provides a framework for selecting a possible alternative parameterisation of the distribution, in terms of natural parameters, and for defining useful sample statistics, called the natural statistics of the family. See below for more information.
Contents |
The following is a sequence of increasingly general definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of discrete or continuous probability distributions.
A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function, for the case of a discrete distribution) can be expressed in the form
where , , , and are known functions.
An alternative, equivalent form often given is
or equivalently
The value is called the parameter of the family.
Note that is often a vector of measurements, in which case is a function from the space of possible values of to the real numbers.
If , then the exponential family is said to be in canonical form. By defining a transformed parameter , it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since can be multiplied by any nonzero constant, provided that is multiplied by that constant's reciprocal.
Even when x is a scalar, and there is only a single parameter, the functions and can still be vectors, as described below.
Note also that the function or equivalently is automatically determined once the other functions have been chosen, and assumes a form that causes the distribution to be normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of , even when is not a one-to-one function, i.e. two or more different values of map to the same value of , and hence cannot be inverted. In such a case, all values of mapping to the same will also have the same value for and .
Further down the page is the example of a normal distribution with unknown mean and known variance.
What is important to note, and what characterizes all exponential family variants, is that the parameter(s) and the observation variable(s) must factorize (can be separated into products each of which involves only one type of variable), either directly or within either part (the base or exponent) of an exponentiation operation. Generally, this means that all of the factors constituting the density or mass function must be of one of the following forms: , , , , , , , , , or , where and are arbitrary functions of ; and are arbitrary functions of ; and is an arbitrary "constant" expression (i.e. an expression not involving or ).
There are further restrictions on how many such factors can occur. For example, an expression of the sort is the same as , i.e. a product of two "allowed" factors. However, when rewritten into the factorized form,
it can be seen that it cannot be expressed in the required form. (However, a form of this sort is a member of a curved exponential family, which allows multiple factorized terms in the exponent.)
To see why an expression of the form qualifies, note that
and hence factorizes inside of the exponent. Similarly,
and again factorizes inside of the exponent.
Note also that a factor consisting of a sum where both types of variables are involved (e.g. a factor of the form ) cannot be factorized in this fashion (except in some cases where occurring directly in an exponent); this is why, for example, the Cauchy distribution and Student's t distribution are not exponential families.
The definition in terms of one real-number parameter can be extended to one real-vector parameter . A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as
Or in a more compact form,
This form writes the sum as a dot product of vector-valued functions and .
An alternative, equivalent form often seen is
As in the scalar valued case, the exponential family is said to be in canonical form if , for all .
A vector exponential family is said to be curved if the dimension of is less than the dimension of the vector . That is, if the dimension of the parameter vector is less than the number of functions of the parameter vector in the above representation of the probability density function. Note that most common distributions in the exponential family are not curved, and many algorithms designed to work with any member of the exponential family implicitly or explicitly assume that the distribution is not curved.
Note that, as in the above case of a scalar-valued parameter, the function or equivalently is automatically determined once the other functions have been chosen, so that the entire distribution is normalized. In addition, as above, both of these functions can always be written as functions of , regardless of the form of the transformation that generates from . Hence an exponential family in its "natural form" (parametrized by its natural parameter) looks like
or equivalently
Note that the above forms may sometimes be seen with in place of . These are exactly equivalent formulations, merely using different notation for the dot product.
Further down the page is the example of a normal distribution with unknown mean and variance.
The vector-parameter form over a single scalar-valued random variable can be trivially expanded to cover a joint distribution over a vector of random variables. The resulting distribution is simply the same as the above distribution for a scalar-valued random variable with each occurrence of the scalar replaced by the vector . Note that the dimension of the random variable need not match the dimension of the parameter vector, nor (in the case of a curved exponential function) the dimension of the natural parameter and sufficient statistic .
The distribution in this case is written as
Or more compactly as
Or alternatively as
We use cumulative distribution functions (cdf) in order to encompass both discrete and continuous distributions.
Suppose H is a non-decreasing function of a real variable. Then Lebesgue–Stieltjes integrals with respect to dH(x) are integrals with respect to the "reference measure" of the exponential family generated by H.
Any member of that exponential family has cumulative distribution function
If F is a continuous distribution with a density, one can write dF(x) = f(x) dx.
H(x) is a Lebesgue–Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is absolutely continuous with a density, then so is H, which can then be written dH(x) = h(x) dx. If F is discrete, then H is a step function (with steps on the support of F).
In the definitions above, the functions and were apparently arbitrarily defined. However, these functions play a significant role in the resulting probability distribution.
The normal, exponential, gamma, chi-squared, beta, Weibull (with known shape parameter k), Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative binomial (with known stopping-time parameter r), and geometric distributions are all exponential families. The family of Pareto distributions with a fixed minimum bound form an exponential family.
The Cauchy and uniform families of distributions are not exponential families. The Laplace family is not an exponential family unless the mean is zero.
Following are some detailed examples of the representation of some useful distribution as exponential families.
As a first example, consider a random variable distributed normally with unknown mean and known variance . The probability density function is then
This is a single-parameter exponential family, as can be seen by setting
If σ = 1 this is in canonical form, as then η(μ) = μ.
Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then
This is an exponential family which can be written in canonical form by defining
As an example of a discrete exponential family, consider the binomial distribution with known number of trials n. The probability mass function for this distribution is
This can equivalently be written as
which shows that the binomial distribution is an exponential family, whose natural parameter is
This function of p is known as logit.
We start with the normalization of the probability distribution. Since
it follows that
This justifies calling A the log-partition function.
Now, the moment generating function of T(x) is
proving the earlier statement that is the cumulant generating function for T.
An important subclass of the exponential family the natural exponential family has a similar form for the moment generating function for the distribution of x.
In particular,
and
The first two raw moments and all mixed second moments can be recovered from these two identities. Higher order moments and cumulants are obtained by higher derivatives. This technique is often useful when T is a complicated function of the data, whose moments are difficult to calculate by integration.
As an example consider a real valued random variable with density
indexed by shape parameter (this is called the skew-logistic distribution). The density can be rewritten as
Notice this is an exponential family with natural parameter
sufficient statistic
and normalizing factor
So using the first identity,
and using the second identity
This example illustrates a case where using this method is very simple, but the direct calculation would be nearly impossible.
The exponential family arises naturally as the answer to the following question: what is the maximum-entropy distribution consistent with given constraints on expected values?
The information entropy of a probability distribution dF(x) can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both measures must be mutually absolutely continuous. Accordingly, we need to pick a reference measure dH(x) with the same support as dF(x). As an aside, frequentists need to realize that this is a largely arbitrary choice, while Bayesians can just make this choice part of their prior probability distribution.
The entropy of dF(x) relative to dH(x) is
or
where dF/dH and dH/dF are Radon–Nikodym derivatives. Note that the ordinary definition of entropy for a discrete distribution supported on a set I, namely
assumes, though this is seldom pointed out, that dH is chosen to be the counting measure on I.
Consider now a collection of observable quantities (random variables) Ti. The probability distribution dF whose entropy with respect to dH is greatest, subject to the conditions that the expected value of Ti be equal to ti, is a member of the exponential family with dH as reference measure and (T1, ..., Tn) as sufficient statistic.
The derivation is a simple variational calculation using Lagrange multipliers. Normalization is imposed by letting T0 = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to T0.
For examples of such derivations, see Maximum entropy probability distribution.
According to the Pitman–Koopman–Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases. Less tersely, suppose Xn, n = 1, 2, 3, ... are independent identically distributed random variables whose distribution is known to be in some family of probability distributions. Only if that family is an exponential family is there a (possibly vector-valued) sufficient statistic T(X1, ..., Xn) whose number of scalar components does not increase as the sample size n increases.
Exponential families are also important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to the exponential family there exists a conjugate prior, which is often also in the exponential family. A conjugate prior for the parameter of an exponential family is given by
or equivalently
where (where is the dimension of ) and are hyperparameters (parameters controlling parameters). corresponds to the effective number of observations that the prior distribution contributes, and corresponds to the total amount that these pseudo-observations contribute to the sufficient statistic over all observations and pseudo-observations. is a normalization constant that is automatically determined by the remaining functions and serves to ensure that the given function is a probability density function (i.e. it is normalized). and equivalently are the same functions as in the definition of the distribution over which is the conjugate prior.
A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a Poisson distribution the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution. It can however be represented by using a mixture density as the prior, here a combination of two beta distributions; this is a form of hyperprior.
An arbitrary likelihood will not belong to the exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.
The one-parameter exponential family has a monotone non-decreasing likelihood ratio in the sufficient statistic T(x), provided that η(θ) is non-decreasing. As a consequence, there exists a uniformly most powerful test for testing the hypothesis H0: θ ≥ θ0 vs. H1: θ < θ0.
The exponential family forms the basis for the distribution function used in generalized linear models, a class of model that encompass many of the commonly used regression models in statistics.