Differential entropy

From Wikipedia, the free encyclopedia

Differential entropy (also referred to as continuous entropy) is a concept in information theory which tries to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuous probability distributions.

Contents

[edit] Definition

Let X be a random variable with a probability density function f whose support is a set \mathbb X. The differential entropy h(X) or h(f) is defined as

h(X) = -\int_\mathbb{X} f(x)\log f(x)\,dx.

As with its discrete analog, the units of differential entropy depend on the base of the logarithm, which is usually 2 (i.e., the units are bits). See logarithmic units for logarithms taken in different bases. Related concepts such as joint, conditional differential entropy, and relative entropy are defined in a similar fashion. One must take care in trying to apply properties of discrete entropy to differential entropy, since probability density functions can be greater than 1. For example, Uniform(0,1/2) has differential entropy \int_0^\frac{1}{2} -2\log2\,dx=-\log2\,.

The definition of differential entropy above can be obtained by partitioning the range of X into bins of length Δ with associated sample points iΔ within the bins, for X Riemann integrable. This gives a quantized version of X, defined by XΔ = iΔ if  i\Delta \leq X \leq (i+1)\Delta. Then the entropy of XΔ is

-\sum_i f(i\Delta)\Delta\log f(i\Delta) - \sum f(i\Delta)\Delta\log \Delta.

The first term approximates the differential entropy, while the second term is approximately − log(Δ). Note that this procedure suggests that the differential entropy of a discrete random variable should be -\infty.

Note that the continuous mutual information I(X;Y) has the distinction of retaining its fundamental significance as a measure of discrete information since it is actually the limit of the discrete mutual information of partitions of X and Y as these partitions become finer and finer. Thus it is invariant under linear transformations of X and Y, and still represents the amount of discrete information that can be transmitted over a channel that admits a continuous space of values.[1]

[edit] Properties of differential entropy

  • For two densities f and g, D(f||g) \geq 0 with equality if f = g almost everywhere. Similarly, for two random variables X and Y, I(X;Y) \geq 0 and h(X|Y) \leq h(X) with equality if and only if X and Y are independent.
  • The chain rule for differential entropy holds as in the discrete case
h(X_1, \ldots, X_n) = \sum_{i=1}^{n} h(X_i|X_1, \ldots, X_{i-1}) \leq \sum h(X_i).
  • Differential entropy is translation invariant, ie, h(X + c) = h(X) for a constant c.
  • Differential entropy is in general not invariant under arbitrary invertible maps. In particular, for a constant a, h(aX) = h(X) + \log \left| a \right|. For a vector valued random variable X and a matrix A, h(A\mathbf{X}) = h(\mathbf{X}) + \log(\det A).
  • In general, for a transformation from a random vector X to a random vector with same dimension Y \mathbf{Y} = m(\mathbf{X}), the corresponding entropies are related via h(\mathbf{Y}) = h(\mathbf{X}) + \int f(x) \log \left\vert \frac{\partial h}{\partial x} \right\vert dx where \left\vert \frac{\partial h}{\partial x} \right\vert is the Jacobian of the transformation m.
  • If a random vector \mathbf{X} \in \mathbb{R}^{n} has mean zero and covariance matrix K, h(\mathbf{X}) \leq \frac{1}{2} \log[(2\pi e)^n \det{K}] with equality if and only if X is jointly gaussian.

[edit] Example: Exponential distribution

Let X be an exponentially distributed random variable with parameter λ, that is, with probability density function

f(x) = \lambda e^{-\lambda x} \mbox{ for } x \geq 0.

Its differential entropy is then

h_e(X)\, =-\int_0^\infty \lambda e^{-\lambda x} \log (\lambda e^{-\lambda x})\,dx
=  -\left(\int_0^\infty (\log \lambda)\lambda e^{-\lambda x}\,dx + \int_0^\infty (-\lambda x) \lambda e^{-\lambda x}\,dx\right)
= -\log \lambda \int_0^\infty f(x)\,dx + \lambda E[X]
= -\log\lambda + 1\,.

Here, he(X) was used rather than h(X) to make it explicit that the logarithm was taken to base e, to simplify the calculation.

[edit] Differential entropies for various distributions

In the table below, \Gamma(x) = \int_0^{\infty} e^{-t} t^{x-1} dt (the gamma function), \psi(x) = \frac{d}{dx} \Gamma(x), B(p,q) = \frac{\Gamma(p)\Gamma(q)}{\Gamma(p+q)}, and γ is Euler's constant.

Table of differential entropies.
Distribution Name Probability density function (pdf) Entropy in nats
Uniform f(x) = \frac{1}{b-a} for a \leq x \leq b \ln(b - a) \,
Normal f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \ln\left(\sigma\sqrt{2\,\pi\,e}\right)
Exponential f(x) = \lambda \exp\left(-\lambda x\right) 1 - \ln \lambda \,
Rayleigh f(x) = \frac{x}{b^2} \exp\left(-\frac{x^2}{2b^2}\right) 1 + \ln \frac{\beta}{\sqrt{2}} + \frac{\gamma}{2}
Beta f(x) = \frac{x^{p-1}(1-x)^{q-1}}{B(p,q)} for 0 \leq x \leq 1  \ln B(p,q) - (p-1)[\psi(p) - \psi(p + q)] - (q-1)[\psi(q) - \psi(p + q)] \,
Cauchy f(x) = \frac{\lambda}{\pi} \frac{1}{\lambda^2 + x^2} \ln(4\pi\lambda) \,
Chi f(x) = \frac{2}{2^{n/2} \sigma^n \Gamma(n/2)} x^{n-1} \exp\left(-\frac{x^2}{2\sigma^2}\right) \ln{\frac{\sigma\Gamma(n/2)}{\sqrt{2}}} - \frac{n-1}{2} \psi\left(\frac{n}{2}\right) + \frac{n}{2}
Chi-squared f(x) = \frac{1}{2^{n/2} \sigma^n \Gamma(n/2)} x^{\frac{n}{2} - 1} \exp\left(-\frac{x}{2\sigma^2}\right)

\ln 2\sigma^{2}\Gamma\left(\frac{n}{2}\right) - \left(1 - \frac{n}{2}\right)\psi\left(\frac{n}{2}\right) + \frac{n}{2}

Erlang f(x) = \frac{\beta^n}{(n-1)!} x^{n-1} \exp(-\beta x) (1-n)\psi(n) + \ln \frac{\Gamma(n)}{\beta} + n
F f(x) = \frac{n_1^{\frac{n_1}{2}} n_2^{\frac{n_2}{2}}}{B(\frac{n_1}{2},\frac{n_2}{2})} \frac{x^{\frac{n_1}{2} - 1}}{(n_2 + n_1 x)^{\frac{n_1 + n2}{2}}} \ln \frac{n_1}{n_2} B\left(\frac{n_1}{2},\frac{n_2}{2}\right) + \left(1 - \frac{n_1}{2}\right) \psi\left(\frac{n_1}{2}\right) -

\left(1 + \frac{n_2}{2}\right)\psi\left(\frac{n_2}{2}\right) + \frac{n_1 + n_2}{2} \psi\left(\frac{n_1 + n_2}{2}\right)

Gamma f(x) = \frac{x^{\alpha - 1} \exp(-\frac{x}{\beta})}{\beta^\alpha \Gamma(\alpha)} \ln(\beta \Gamma(a)) + (1 - \alpha)\psi(\alpha) + \alpha \,
Laplace f(x) = \frac{1}{2\lambda} \exp(-\frac{|x - \theta|}{\lambda}) 1 + \ln(2\lambda) \,
Logistic f(x) = \frac{e^{-x}}{(1 + e^{-x})^2} 2 \,
Lognormal f(x) = \frac{1}{\sigma x \sqrt{2\pi}} \exp\left(-\frac{(\ln x - m)^2}{2\sigma^2}\right) m + \frac{1}{2} \ln(2\pi e \sigma^2)
Maxwell-Boltzmann f(x) = 4 \pi^{-\frac{1}{2}} \beta^{\frac{3}{2}} x^{2} \exp(-\beta x^2) \frac{1}{2} \ln \frac{\pi}{\beta} + \gamma - 1/2
Generalized normal f(x) = \frac{2 \beta^{\frac{\alpha}{2}}}{\Gamma(\frac{\alpha}{2})} x^{\alpha - 1} \exp(-\beta x^2) \ln{\frac{\Gamma(\alpha/2)}{2\beta^{\frac{1}{2}}}} - \frac{\alpha - 1}{2} \psi\left(\frac{\alpha}{2}\right) + \frac{\alpha}{2}
Pareto f(x) = \frac{a k^a}{x^{a+1}} \ln \frac{k}{a} + 1 + \frac{1}{a}
Student's t f(x) = \frac{(1 + x^2/n)^{-\frac{n+1}{2}}}{\sqrt{n}B(\frac{1}{2},\frac{n}{2})} \frac{n+1}{2}\psi\left(\frac{n+1}{2}\right) - \psi\left(\frac{n}{2}\right) + \ln \sqrt{n} B\left(\frac{1}{2},\frac{n}{2}\right)
Triangular  f(x) = \begin{cases} \frac{2x}{a} & 0 \leq x \leq a\\ \frac{2(1-x)}{1-a} & a \leq x \leq 1 \end{cases} \frac{1}{2} - \ln 2
Weibull f(x) = \frac{c}{\alpha} x^{c-1} \exp\left(-\frac{x^c}{\alpha}\right) \frac{(c-1)\gamma}{c} + \ln \frac{\alpha^{1/c}}{c} + 1
Multivariate normal 
f_X(x_1, \dots, x_N) =  \frac{1} {(2\pi)^{N/2} \left|\Sigma\right|^{1/2}}
\exp \left( -\frac{1}{2} ( x - \mu)^\top \Sigma^{-1} (x - \mu) \right) \frac{1}{2}\ln\{(2\pi e)^{N} \det(\Sigma)\}

[edit] See also

[edit] References

  1. ^ Fazlollah M. Reza (1961, 1994). An Introduction to Information Theory. Dover Publications, Inc., New York. ISBN 0-486-68210-2. 

[edit] External links

Languages