Exponential family

"Natural parameter" links here. For the usage of this term in differential geometry, see differential geometry of curves.

In probability and statistics, an exponential family is an important class of probability distributions sharing a certain form, specified below. This special form is chosen for mathematical convenience, on account of some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural distributions to consider. The concept of exponential families is credited to[1] E. J. G. Pitman,[2] G. Darmois,[3] and B. O. Koopman[4] in 1935–6. The term exponential class is sometimes used in place of "exponential family".[5]

The exponential families include many of the most common distributions, including the normal, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli, binomial, multinomial, Poisson, Wishart, Inverse Wishart and many others. Consideration of these, and other distributions that are with an exponential family of distributions, provides a framework for selecting a possible alternative parameterisation of the distribution, in terms of natural parameters, and for defining useful sample statistics, called the natural statistics of the family. See below for more information.

Contents

Definition

The following is a sequence of increasingly general definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of discrete or continuous probability distributions.

Scalar parameter

A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function, for the case of a discrete distribution) can be expressed in the form

 f_X(x|\theta) = h(x)\ \exp[\ \eta(\theta) \cdot T(x)\ -\ A(\theta)\ ]

where T(x), h(x), \eta(\theta), and A(\theta) are known functions.

An alternative, equivalent form often given is

 f_X(x|\theta) = h(x)\ g(\theta) \exp[\ \eta(\theta) \cdot T(x)\ ]\,

or equivalently

 f_X(x|\theta) = \exp[\ \eta(\theta) \cdot T(x)\ -\ A(\theta) %2B B(x)\ ]

The value \theta is called the parameter of the family.

Note that x is often a vector of measurements, in which case T(x) is a function from the space of possible values of x to the real numbers.

If \eta(\theta) = \theta, then the exponential family is said to be in canonical form. By defining a transformed parameter \eta = \eta(\theta), it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since \eta(\theta) can be multiplied by any nonzero constant, provided that T(x) is multiplied by that constant's reciprocal.

Even when x is a scalar, and there is only a single parameter, the functions \eta(\theta) and T(x) can still be vectors, as described below.

Note also that the function A(\theta) or equivalently g(\theta) is automatically determined once the other functions have been chosen, and assumes a form that causes the distribution to be normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of \eta, even when \eta(\theta) is not a one-to-one function, i.e. two or more different values of \theta map to the same value of \eta(\theta), and hence \eta(\theta) cannot be inverted. In such a case, all values of \theta mapping to the same \eta(\theta) will also have the same value for A(\theta) and g(\theta).

Further down the page is the example of a normal distribution with unknown mean and known variance.

Factorization of the variables involved

What is important to note, and what characterizes all exponential family variants, is that the parameter(s) and the observation variable(s) must factorize (can be separated into products each of which involves only one type of variable), either directly or within either part (the base or exponent) of an exponentiation operation. Generally, this means that all of the factors constituting the density or mass function must be of one of the following forms: f(x), g(\theta), c^{f(x)}, c^{g(\theta)}, {[f(x)]}^c, {[g(\theta)]}^c, {[f(x)]}^{g(\theta)}, {[g(\theta)]}^{f(x)}, {[f(x)]}^{h(x)g(\theta)}, or {[g(\theta)]}^{h(x)j(\theta)}, where f(x) and h(x) are arbitrary functions of x; g(\theta) and j(\theta) are arbitrary functions of \theta; and c is an arbitrary "constant" expression (i.e. an expression not involving x or \theta).

There are further restrictions on how many such factors can occur. For example, an expression of the sort {[f(x) g(\theta)]}^{h(x)j(\theta)} is the same as {[f(x)]}^{h(x)j(\theta)} [g(\theta)]^{h(x)j(\theta)}, i.e. a product of two "allowed" factors. However, when rewritten into the factorized form,

{[f(x) g(\theta)]}^{h(x)j(\theta)} = {[f(x)]}^{h(x)j(\theta)} [g(\theta)]^{h(x)j(\theta)} = e^{[h(x) \ln f(x)] j(\theta) %2B h(x) [j(\theta) \ln g(\theta)]}\, ,

it can be seen that it cannot be expressed in the required form. (However, a form of this sort is a member of a curved exponential family, which allows multiple factorized terms in the exponent.)

To see why an expression of the form {[f(x)]}^{g(\theta)} qualifies, note that

{[f(x)]}^{g(\theta)} = e^{g(\theta) \ln f(x)}\,

and hence factorizes inside of the exponent. Similarly,

{[f(x)]}^{h(x)g(\theta)} = e^{h(x)g(\theta)\ln f(x)} =  e^{[h(x) \ln f(x)] g(\theta)}\,

and again factorizes inside of the exponent.

Note also that a factor consisting of a sum where both types of variables are involved (e.g. a factor of the form 1%2Bf(x)g(\theta)) cannot be factorized in this fashion (except in some cases where occurring directly in an exponent); this is why, for example, the Cauchy distribution and Student's t distribution are not exponential families.

Vector parameter

The definition in terms of one real-number parameter can be extended to one real-vector parameter {\boldsymbol \theta} = (\theta_1, \theta_2, \ldots, \theta_d)^T. A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as

 f_X(x|\boldsymbol \theta) = h(x) \exp\left(\sum_{i=1}^s \eta_i({\boldsymbol \theta}) T_i(x) - A({\boldsymbol \theta}) \right) \,\!

Or in a more compact form,

 f_X(x|\boldsymbol \theta) = h(x) \exp\Big(\ \boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x) - A({\boldsymbol \theta})\ \Big) \,\!

This form writes the sum as a dot product of vector-valued functions \boldsymbol\eta({\boldsymbol \theta}) and \mathbf{T}(x).

An alternative, equivalent form often seen is

 f_X(x|\boldsymbol \theta) = h(x) g(\boldsymbol \theta) \exp\Big(\ \boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x)\ \Big) \,\!

As in the scalar valued case, the exponential family is said to be in canonical form if \eta_i({\boldsymbol \theta}) = \theta_i, for all i.

A vector exponential family is said to be curved if the dimension of {\boldsymbol \theta} = (\theta_1, \theta_2, \ldots, \theta_d)^T is less than the dimension of the vector {\boldsymbol \eta}(\boldsymbol \theta) = (\eta_1(\boldsymbol \theta), \eta_2(\boldsymbol \theta), \ldots, \eta_s(\boldsymbol \theta))^T. That is, if the dimension of the parameter vector is less than the number of functions of the parameter vector in the above representation of the probability density function. Note that most common distributions in the exponential family are not curved, and many algorithms designed to work with any member of the exponential family implicitly or explicitly assume that the distribution is not curved.

Note that, as in the above case of a scalar-valued parameter, the function A(\boldsymbol \theta) or equivalently g(\boldsymbol \theta) is automatically determined once the other functions have been chosen, so that the entire distribution is normalized. In addition, as above, both of these functions can always be written as functions of \boldsymbol\eta, regardless of the form of the transformation that generates \boldsymbol\eta from \boldsymbol\theta. Hence an exponential family in its "natural form" (parametrized by its natural parameter) looks like

 f_X(x|\boldsymbol \eta) = h(x) \exp\Big(\ \boldsymbol\eta \cdot \mathbf{T}(x) - A({\boldsymbol \eta})\ \Big) \,\!

or equivalently

 f_X(x|\boldsymbol \eta) = h(x) g(\boldsymbol \eta) \exp\Big(\ \boldsymbol\eta \cdot \mathbf{T}(x)\ \Big) \,\!

Note that the above forms may sometimes be seen with \boldsymbol\eta^T \mathbf{T}(x)\, in place of \boldsymbol\eta \cdot \mathbf{T}(x)\,. These are exactly equivalent formulations, merely using different notation for the dot product.

Further down the page is the example of a normal distribution with unknown mean and variance.

Vector parameter, vector variable

The vector-parameter form over a single scalar-valued random variable can be trivially expanded to cover a joint distribution over a vector of random variables. The resulting distribution is simply the same as the above distribution for a scalar-valued random variable with each occurrence of the scalar x replaced by the vector \mathbf{x} = (x_1, x_2, \ldots, x_k). Note that the dimension k of the random variable need not match the dimension d of the parameter vector, nor (in the case of a curved exponential function) the dimension s of the natural parameter \boldsymbol\eta and sufficient statistic T(\mathbf{x}).

The distribution in this case is written as

 f_X(\mathbf{x}|\boldsymbol \theta) = h(\mathbf{x})\ \exp\left(\sum_{i=1}^s \eta_i({\boldsymbol \theta}) T_i(\mathbf{x}) - A({\boldsymbol \theta}) \right) \,\!

Or more compactly as

 f_X(\mathbf{x}|\boldsymbol \theta) = h(\mathbf{x})\ \exp\Big(\ \boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(\mathbf{x}) - A({\boldsymbol \theta})\ \Big) \,\!

Or alternatively as

 f_X(\mathbf{x}|\boldsymbol \theta) = h(\mathbf{x})\ g(\boldsymbol \theta)\ \exp\Big(\ \boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(\mathbf{x})\ \Big) \,\!

Measure-theoretic formulation

We use cumulative distribution functions (cdf) in order to encompass both discrete and continuous distributions.

Suppose H is a non-decreasing function of a real variable. Then Lebesgue–Stieltjes integrals with respect to dH(x) are integrals with respect to the "reference measure" of the exponential family generated by H.

Any member of that exponential family has cumulative distribution function

dF(\mathbf{x}|\boldsymbol\eta) = e^{\boldsymbol\eta^{\top} \mathbf{T}(\mathbf{x}) - A(\boldsymbol\eta)}\, dH(\mathbf{x}).

If F is a continuous distribution with a density, one can write dF(x) = f(xdx.

H(x) is a Lebesgue–Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is absolutely continuous with a density, then so is H, which can then be written dH(x) = h(xdx. If F is discrete, then H is a step function (with steps on the support of F).

Interpretation

In the definitions above, the functions T(x), \eta(\theta), and A(\eta) were apparently arbitrarily defined. However, these functions play a significant role in the resulting probability distribution.

Examples

The normal, exponential, gamma, chi-squared, beta, Weibull (with known shape parameter k), Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative binomial (with known stopping-time parameter r), and geometric distributions are all exponential families. The family of Pareto distributions with a fixed minimum bound form an exponential family.

The Cauchy and uniform families of distributions are not exponential families. The Laplace family is not an exponential family unless the mean is zero.

Following are some detailed examples of the representation of some useful distribution as exponential families.

Normal distribution: Unknown mean, known variance

As a first example, consider a random variable distributed normally with unknown mean \mu and known variance \sigma^2. The probability density function is then

f_\sigma(x;\mu) = \frac{1}{\sqrt{2 \pi}|\sigma|} e^{-(x-\mu)^2/2\sigma^2}.

This is a single-parameter exponential family, as can be seen by setting

h_\sigma(x) = e^{-x^2/2\sigma^2}/\sqrt{2\pi}|\sigma|
T_\sigma(x) = x/\sigma\!\,
A_\sigma(\mu) = \mu^2/2\sigma^2\!\,
\eta_\sigma(\mu) = \mu/\sigma.\!\,

If σ = 1 this is in canonical form, as then η(μ) = μ.

Normal distribution: Unknown mean and unknown variance

Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then

f(x;\mu,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-(x-\mu)^2/(2 \sigma^2)}.

This is an exponential family which can be written in canonical form by defining

 \boldsymbol {\eta} = \left({\mu \over \sigma^2},{-1 \over 2\sigma^2} \right)^\top
 h(x) = {1 \over \sqrt{2 \pi}}
 T(x) = \left( x, x^2 \right)^\top
 A({\boldsymbol \eta})  = { \mu^2 \over 2 \sigma^2} %2B \ln |\sigma| = -\eta_1^2/4\eta_2 %2B 1/2\ln|1/2\eta_2|

Binomial distribution

As an example of a discrete exponential family, consider the binomial distribution with known number of trials n. The probability mass function for this distribution is

f(x)={n \choose x}p^x (1-p)^{n-x}, \quad x \in \{0, 1, 2, \ldots, n\}.

This can equivalently be written as

f(x)={n \choose x}\exp\left(x \log\left({p \over 1-p}\right) %2B n \log\left(1-p\right)\right),

which shows that the binomial distribution is an exponential family, whose natural parameter is

\eta = \log{p \over 1-p}.

This function of p is known as logit.

Moments and cumulants of the sufficient statistic

Normalization of the distribution

We start with the normalization of the probability distribution. Since

1 = \int dF(x|\eta) = \int e^{\eta^\top T(x)-A(\eta)}dH(x|\eta)

it follows that

e^{A(\eta)} = \int e^{\eta^\top T(x)}dH(x|\eta)

This justifies calling A the log-partition function.

Moment generating function of the sufficient statistic

Now, the moment generating function of T(x) is

M_T(u) \equiv E[e^{u^\top T(x)}|\eta] = \int  e^{(\eta%2Bu)^\top T(x)-A(\eta)}dH(x|\eta) = e^{A(\eta %2B u)-A(\eta)}

proving the earlier statement that K(u|\eta) = A(\eta%2Bu) - A(\eta) is the cumulant generating function for T.

An important subclass of the exponential family the natural exponential family has a similar form for the moment generating function for the distribution of x.

Differential identities for cumulants

In particular,

 E(T_{j}) = \frac{ \partial A(\eta) }{ \partial \eta_{j} }

and

 \mathrm{cov}(T_{i},T_{j}) = \frac{ \partial^{2} A(\eta) }{ \partial \eta_{i} \, \partial \eta_{j} }.

The first two raw moments and all mixed second moments can be recovered from these two identities. Higher order moments and cumulants are obtained by higher derivatives. This technique is often useful when T is a complicated function of the data, whose moments are difficult to calculate by integration.

Example

As an example consider a real valued random variable \scriptstyle X with density

 p_\theta (x) = \frac{ \theta e^{-x} }{(1 %2B e^{-x})^{\theta %2B 1} }

indexed by shape parameter  \theta \in (0,\infty) (this is called the skew-logistic distribution). The density can be rewritten as

 \frac{ e^{-x} } { 1 %2B e^{-x} } \exp( -\theta \log(1 %2B e^{-x}) %2B \log(\theta))

Notice this is an exponential family with natural parameter

 \eta = -\theta, \,

sufficient statistic

 T = \log(1 %2B e^{-x}), \,

and normalizing factor

 A(\eta) = -\log(\theta) = -\log(-\eta) \,

So using the first identity,

 E(\log(1 %2B e^{-X})) = E(T) = \frac{ \partial A(\eta) }{ \partial \eta } = \frac{ \partial }{ \partial \eta } [-\log(-\eta)] = \frac{1}{-\eta} = \frac{1}{\theta},

and using the second identity

 \mathrm{var}(\log(1 %2B e^{-X})) = \frac{ \partial^2 A(\eta) }{ \partial \eta^2 } = \frac{ \partial }{ \partial \eta } \left[\frac{1}{-\eta}\right] = \frac{1}{(-\eta)^2} = \frac{1}{\theta^2}.

This example illustrates a case where using this method is very simple, but the direct calculation would be nearly impossible.

Maximum entropy derivation

The exponential family arises naturally as the answer to the following question: what is the maximum-entropy distribution consistent with given constraints on expected values?

The information entropy of a probability distribution dF(x) can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both measures must be mutually absolutely continuous. Accordingly, we need to pick a reference measure dH(x) with the same support as dF(x). As an aside, frequentists need to realize that this is a largely arbitrary choice, while Bayesians can just make this choice part of their prior probability distribution.

The entropy of dF(x) relative to dH(x) is

S[dF|dH]=-\int {dF\over dH}\ln{dF\over dH}\,dH

or

S[dF|dH]=\int\ln{dH\over dF}\,dF

where dF/dH and dH/dF are Radon–Nikodym derivatives. Note that the ordinary definition of entropy for a discrete distribution supported on a set I, namely

S=-\sum_{i\in I} p_i\ln p_i

assumes, though this is seldom pointed out, that dH is chosen to be the counting measure on I.

Consider now a collection of observable quantities (random variables) Ti. The probability distribution dF whose entropy with respect to dH is greatest, subject to the conditions that the expected value of Ti be equal to ti, is a member of the exponential family with dH as reference measure and (T1, ..., Tn) as sufficient statistic.

The derivation is a simple variational calculation using Lagrange multipliers. Normalization is imposed by letting T0 = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to T0.

For examples of such derivations, see Maximum entropy probability distribution.

Role in statistics

Classical estimation: sufficiency

According to the PitmanKoopmanDarmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases. Less tersely, suppose Xn, n = 1, 2, 3, ... are independent identically distributed random variables whose distribution is known to be in some family of probability distributions. Only if that family is an exponential family is there a (possibly vector-valued) sufficient statistic T(X1, ..., Xn) whose number of scalar components does not increase as the sample size n increases.

Bayesian estimation: conjugate distributions

Exponential families are also important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to the exponential family there exists a conjugate prior, which is often also in the exponential family. A conjugate prior \pi for the parameter \boldsymbol\eta of an exponential family is given by

p_\pi(\boldsymbol\eta|\boldsymbol\chi,\nu) = f(\boldsymbol\chi,\nu) \exp(\boldsymbol\eta^{\top} \boldsymbol\chi - \nu\, A(\boldsymbol\eta)),

or equivalently

p_\pi(\boldsymbol\eta|\boldsymbol\chi,\nu) = f(\boldsymbol\chi,\nu) g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^{\top} \boldsymbol\chi),

where \boldsymbol\chi \in \mathbb{R}^s (where s is the dimension of \boldsymbol\eta) and \nu>0 are hyperparameters (parameters controlling parameters). \nu corresponds to the effective number of observations that the prior distribution contributes, and \boldsymbol\chi corresponds to the total amount that these pseudo-observations contribute to the sufficient statistic over all observations and pseudo-observations. f(\boldsymbol\chi,\nu) is a normalization constant that is automatically determined by the remaining functions and serves to ensure that the given function is a probability density function (i.e. it is normalized). A(\boldsymbol\eta) and equivalently g(\boldsymbol\eta) are the same functions as in the definition of the distribution over which \pi is the conjugate prior.

A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a Poisson distribution the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution. It can however be represented by using a mixture density as the prior, here a combination of two beta distributions; this is a form of hyperprior.

An arbitrary likelihood will not belong to the exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.

Hypothesis testing: Uniformly most powerful tests

The one-parameter exponential family has a monotone non-decreasing likelihood ratio in the sufficient statistic T(x), provided that η(θ) is non-decreasing. As a consequence, there exists a uniformly most powerful test for testing the hypothesis H0: θ ≥ θ0 vs. H1: θ < θ0.

Generalized linear models

The exponential family forms the basis for the distribution function used in generalized linear models, a class of model that encompass many of the commonly used regression models in statistics.

See also

References

  1. ^ Andersen, Erling (September 1970). "Sufficiency and Exponential Families for Discrete Sample Spaces". Journal of the American Statistical Association (Journal of the American Statistical Association, Vol. 65, No. 331) 65 (331): 1248–1255. doi:10.2307/2284291. JSTOR 2284291. MR268992. 
  2. ^ Pitman, E.; Wishart, J. (1936). "Sufficient statistics and intrinsic accuracy". Mathematical Proceedings of the Cambridge Philosophical Society 32 (4): 567–579. doi:10.1017/S0305004100019307. 
  3. ^ Darmois, G. (1935). "Sur les lois de probabilites a estimation exhaustive" (in French). C.R. Acad. Sci. Paris 200: 1265–1266. 
  4. ^ Koopman, B (1936). "On distribution admitting a sufficient statistic". Transactions of the American Mathematical Society (Transactions of the American Mathematical Society, Vol. 39, No. 3) 39 (3): 399–409. doi:10.2307/1989758. JSTOR 1989758. MR1501854. 
  5. ^ Kupperman, M. (1958) "Probabilities of Hypotheses and Information-Statistics in Sampling from Exponential-Class Populations", Annals of Mathematical Statistics, 9 (2), 571–575 JSTOR 2237349

Further reading

External links