Kullback's inequality

In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function.[1] If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P<<Q, and whose first moments exist, then

D_{KL}(P\|Q) \ge \Psi_Q^*(\mu'_1(P)),

where \Psi_Q^* is the rate function, i.e. the convex conjugate of the cumulant-generating function, of Q, and \mu'_1(P) is the first moment of P.

The Cramér–Rao bound is a corollary of this result.

Proof

Let P and Q be probability distributions (measures) on the real line, whose first moments exist, and such that P<<Q. Consider the natural exponential family of Q given by

Q_\theta(A) = \frac{\int_A e^{\theta x}Q(dx)}{\int_{-\infty}^\infty e^{\theta x}Q(dx)}
   = \frac{1}{M_Q(\theta)} \int_A e^{\theta x}Q(dx)

for every measurable set A, where M_Q is the moment-generating function of Q. (Note that Q0=Q.) Then

D_{KL}(P\|Q) = D_{KL}(P\|Q_\theta)
   + \int_{\mathrm{supp}P}\left(\log\frac{\mathrm dQ_\theta}{\mathrm dQ}\right)\mathrm dP.

By Gibbs' inequality we have D_{KL}(P\|Q_\theta) \ge 0 so that

D_{KL}(P\|Q) \ge
   \int_{\mathrm{supp}P}\left(\log\frac{\mathrm dQ_\theta}{\mathrm dQ}\right)\mathrm dP
 = \int_{\mathrm{supp}P}\left(\log\frac{e^{\theta x}}{M_Q(\theta)}\right) P(dx)

Simplifying the right side, we have, for every real θ where M_Q(\theta) < \infty:

D_{KL}(P\|Q) \ge \mu'_1(P) \theta - \Psi_Q(\theta),

where \mu'_1(P) is the first moment, or mean, of P, and \Psi_Q = \log M_Q is called the cumulant-generating function. Taking the supremum completes the process of convex conjugation and yields the rate function:

D_{KL}(P\|Q) \ge \sup_\theta \left\{ \mu'_1(P) \theta - \Psi_Q(\theta) \right\}
   = \Psi_Q^*(\mu'_1(P)).

Corollary: the Cramér–Rao bound

Main article: Cramér–Rao bound

Start with Kullback's inequality

Let Xθ be a family of probability distributions on the real line indexed by the real parameter θ, and satisfying certain regularity conditions. Then

 \lim_{h\rightarrow 0} \frac {D_{KL}(X_{\theta+h}\|X_\theta)} {h^2}
    \ge \lim_{h\rightarrow 0} \frac {\Psi^*_\theta (\mu_{\theta+h})}{h^2},

where \Psi^*_\theta is the convex conjugate of the cumulant-generating function of X_\theta and \mu_{\theta+h} is the first moment of X_{\theta+h}.

Left side

The left side of this inequality can be simplified as follows:

\lim_{h\rightarrow 0}
       \frac {D_{KL}(X_{\theta+h}\|X_\theta)} {h^2}
      =\lim_{h\rightarrow 0}
       \frac 1 {h^2}
       \int_{-\infty}^\infty \left( \log\frac{\mathrm dX_{\theta+h}}{\mathrm dX_\theta} \right)
       \mathrm dX_{\theta+h}
  = \lim_{h\rightarrow 0} \frac 1 {h^2} \int_{-\infty}^\infty \left[
            \left( 1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right)
 +\frac 1 2 \left( 1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) ^ 2
 + o \left( \left( 1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) ^ 2 \right)
          \right]\mathrm dX_{\theta+h},
where we have expanded the logarithm \log x in a Taylor series in 1-1/x,
  = \lim_{h\rightarrow 0} \frac 1 {h^2} \int_{-\infty}^\infty \left[
  \frac 1 2 \left( 1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) ^ 2
          \right]\mathrm dX_{\theta+h}

         = \lim_{h\rightarrow 0} \frac 1 {h^2} \int_{-\infty}^\infty \left[
  \frac 1 2 \left( \frac{\mathrm dX_{\theta+h} - \mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) ^ 2
          \right]\mathrm dX_{\theta+h}
 = \frac 1 2 \mathcal I_X(\theta),

which is half the Fisher information of the parameter θ.

Right side

The right side of the inequality can be developed as follows:


  \lim_{h\rightarrow 0} \frac {\Psi^*_\theta (\mu_{\theta+h})}{h^2}
= \lim_{h\rightarrow 0} \frac 1 {h^2} {\sup_t \{\mu_{\theta+h}t - \Psi_\theta(t)\} }.

This supremum is attained at a value of t=τ where the first derivative of the cumulant-generating function is \Psi'_\theta(\tau) = \mu_{\theta+h}, but we have \Psi'_\theta(0) = \mu_\theta, so that

\Psi''_\theta(0) = \frac{d\mu_\theta}{d\theta} \lim_{h \rightarrow 0} \frac h \tau.

Moreover,

\lim_{h\rightarrow 0} \frac {\Psi^*_\theta (\mu_{\theta+h})}{h^2}
   = \frac 1 {2\Psi''_\theta(0)}\left(\frac {d\mu_\theta}{d\theta}\right)^2
   = \frac 1 {2\mathrm{Var}(X_\theta)}\left(\frac {d\mu_\theta}{d\theta}\right)^2.

Putting both sides back together

We have:

\frac 1 2 \mathcal I_X(\theta)
   \ge \frac 1 {2\mathrm{Var}(X_\theta)}\left(\frac {d\mu_\theta}{d\theta}\right)^2,

which can be rearranged as:

\mathrm{Var}(X_\theta) \ge \frac{(d\mu_\theta / d\theta)^2} {\mathcal I_X(\theta)}.

See also

Notes and references

  1. Fuchs, Aimé; Letta, Giorgio (1970). L'inégalité de Kullback. Application à la théorie de l'estimation. Séminaire de probabilités 4. Strasbourg. pp. 108–131.