Talk:Information entropy

From Wikipedia, the free encyclopedia

This article incorporates material from PlanetMath, which is licensed under the GFDL.

Mathematics Portal

This article is within the scope of WikiProject Mathematics, which collaborates on articles related to mathematics.

Mathematics rating:

B Class

High Priority

Field: Applied mathematics

One of the 500 most frequently viewed mathematics articles.

Please update this rating as the article progresses, or if the rating is inaccurate. Click to show/hide comments.
Please add to or update the comments to suggest improvements to the article.
--Cronholm¹⁴⁴ 03:59, 16 June 2007 (UTC)

	Portal This article is within the scope of WikiProject Physics, which collaborates on articles related to physics.
???	This article has not yet received a rating on the assessment scale. [FAQ]
???	This article has not yet received an importance rating within physics.
Help with this template Comments: edit – history – watch – refresh --Cronholm¹⁴⁴ 03:59, 16 June 2007 (UTC)

This article is also assessed within the mathematics field Probability and statistics.

1 Archives
2 minimum
3 log probability
4 Units and the Continuous Case
5 Extending discrete entropy to the continuous case: differential entropy
6 Roulette Example
7 moved to talk page because wikipedia is not a textbook
- 7.1 Derivation of Shannon's entropy
8 H(X), H(Ω), and the word 'outcome'
9 Sorry, I don't get it
10 Compression of English Text
11 Entropy of English text
12 Boltzmann's lectures on entropy
13 log basis
14 Mistake inside an external reference
15 Looking for reference
16 Units in the continuous case
17 Entropy vs Entropy Rate
18 Uncertainty

[edit] Archives

Archive: Talk:Information entropy/Archive1 start - Dec 2005

[edit] minimum

The statement at the end of the second paragraph is simply not true: "the shortest number of bits necessary to transmit the message is the Shannon entropy in bits/symbol multiplied by the number of symbols in the original message." -- the formula of (bit/symbol * number of symbols) does not give the entropy when multiplied by the number of symbols in the original message! The original should be replaced with something like the "shortest possible representation".—Preceding unsigned comment added by 139.149.31.232 (talk • contribs)

[edit] log probability

The article Perplexity says that information entropy is "also called" log probability. Is it true that they're the same thing? If so, a mention or brief discussion of this in the article might be appropriate. dbtfz ^talk 01:34, 20 April 2006 (UTC)

[edit] Units and the Continuous Case

The extension to the continuous case has a subtle problem: the distribution f(x) has units of inverse length and the integral contains "log f(x)" in it. Logarithms should be taken on dimensionless quantities (quantities without units). Thus, the logarithm should be of the ratio of f(x) to some characteristic length L. Something like log [ f(x) / L ] would be more proper.

The problem with taking a transcendental function of a quantity with units arises from the way we define arithmetic operations for quantities with units. 5 m + 2 m is defined (5 m + 2 m = 7 m) but 5 m + 2 kg is not defined because the units are different among the quantities to be added. Transcendental functions (such as logarithms), of a variable x with units, present problems for determining the resulting units of the results of the functions of x. This is why scientists and engineers try to form ratios of quantities in which all the units cancel, and then apply transcendental functions to these ratios rather than the original quantities. As an example, in exp[-E/(kT)] the constant k has the proper units for canceling the units of energy E and temperature T so units cancel in the quantity E/(kT). Then the result of the operation, of a typical transcendental function on its dimensionless argument, is also dimensionless.

My suggested solution to the problem with the units raises another question: what choice of length L should be used in the expression log [ f(x) / L ]? I think any choice can work. —The preceding unsigned comment was added by 75.85.88.234 (talk) 18:06, 17 December 2006 (UTC).

For canceling the inverse unit of length (actually the inverse unit of x), there should appear a product of f(x) and a length L under the logarithm, i.e. log [ f(x) L ]. This would be, indeed, bizare, as any length L would work - unless we are in the frame of quantum mechanics. In that case, we would simply use the smallest quantumly distinguishable value for L. If x is truly a length, then L could be Planck's length. But this is already too obfuscating for me. I would rather recommend on concentrating on the discrete formula of entropy: S = Sum [ p(i) log p(i) ]. Now, in the continuous case, the probability is infinitesimal an it is dP = f(x) dx. Thus, the exact transcription of the above formula with this probability would give S = Sum [ f(x) dx log ( f(x) dx ) ]. Now Sum would become Integral and log ( f(x) dx ) is a functional which must take a form of L(x) dx. The worst problem now is that there are two dx under one integral. This problem appears in the above modified formula for S. This problem must be worked out somehow. Its source is in the product in the initial Shannon entropy.

If you want to work with continuous variables, you're on much stronger ground if you work with the relative entropy, ie the Kullback-Leibler distance from some prior distribution, rather than the Shannon entropy. This avoids all the problems of the infinities and the physical dimensionality; and often, when you think it through, you'll find that it may make a lot more sense philosophically in the context of your application, too. Jheald 19:32, 7 February 2007 (UTC)

Of course, the relative entropy is very good for the continuous case, but, unlike Shannon entropy, it is relative, as it needs a second distribution from which to depart. I was thinking of a formula that would give a good absolute entropy, similar to the Shannon entropy, for the continuous case. This is purely speculative, though. —The preceding unsigned comment was added by 193.254.231.71 (talk) 13:52, 8 February 2007 (UTC).

[edit] Extending discrete entropy to the continuous case: differential entropy

Q —The preceding unsigned comment was added by 193.254.231.71 (talk) 10:18, 12 February 2007 (UTC).

The last definition of the differential entropy (second last formula) seems to malfunction. Actually, it should read

h[f] = lim (Delta -> 0) [ H^Delta + log Delta * Sum [ f(xi) Delta ] ]

This would ensure the complete canceling of the second sum in H^Delta. With the current formula, there would remain a non-canceling term:

h[f] = lim (Delta -> 0) [ H^Delta + log Delta ] = Integral[ f(x) log f(x) dx ] - -lim (Delta -> 0) [ log Delta * ( Sum [ f(xi) Delta ] -1 ) ] .

The last limit does not go to zero. Actually, through a l'Hopital applied to (1-Sum) / (1/log Delta) , it would go to

- lim (Delta -> 0) [ Delta (log Delta)^2 Sum[f(xi)] ],

and, as Delta -> 0, Sum[f(xi)] -> infinity as 1/Delta (since Sum[f(xi) Delta] -> 1), so it would cancel the first Delta in the limit above, and there would be only

- lim (Delta -> 0) [ (log Delta)^2 ] -> - infinity

Thus, the last definition of h[f] could not even be used. I recommend checking with a reliable source on this, then, maybe, if that formula is wrong, its erasure. Misfortunately, I have no knowledge of the way formulas are written in wikipedia (yet).

[edit] Roulette Example

In the roulette example, the entropy of a combination of numbers hit over P spins is defined as Omega/T, but the entropy is given as lg(Omega), which then calculates to the Shannon definition. Why is lg(Omega) used? (Note: I'm using the notation "lg" to denote "log base 2") 66.151.13.191 20:41, 31 March 2006 (UTC)

[edit] moved to talk page because wikipedia is not a textbook

[edit] Derivation of Shannon's entropy

Since the entropy was given as a definition, it does not need to be derived. On the other hand, a "derivation" can be given which gives a sense of the motivation for the definition as well as the link to thermodynamic entropy.

Q. Given a roulette with n pockets which are all equally likely to be landed on by the ball, what is the probability of obtaining a distribution (A₁, A₂, …, A_n) where A_i is the number of times pocket i was landed on and

$P = \sum_{i=1}^n A_i \,\!$

is the total number of ball-landing events?

A. The probability is a multinomial distribution, viz.

$p = {\Omega \over \Tau} = {P! \over A_1! \ A_2! \ A_3! \ \cdots \ A_n!} \left(\frac1n\right)^P \,\!$

where

$\Omega = {P! \over A_1! \ A_2! \ A_3! \ \cdots \ A_n!} \,\!$

is the number of possible combinations of outcomes (for the events) which fit the given distribution, and

$\Tau = n^P \$

is the number of all possible combinations of outcomes for the set of P events.

Q. And what is the entropy?

A. The entropy of the distribution is obtained from the logarithm of Ω:

$H = \log \Omega = \log \frac{P!}{A_1! \ A_2! \ A_3! \cdots \ A_n!} \,\!$

$= \log P! - \log A_1! - \log A_2! - \log A_3! - \cdots - \log A_n! \$

$= \sum_i^P \log i - \sum_i^{A_1} \log i - \sum_i^{A_2} \log i - \cdots - \sum_i^{A_n} \log i \,\!$

The summations can be approximated closely by being replaced with integrals:

$H = \int_1^P \log x \, dx - \int_1^{A_1} \log x \, dx - \int_1^{A_2} \log x \, dx - \cdots - \int_1^{A_n} \log x \, dx. \,\!$

The integral of the logarithm is

$\int \log x \, dx = x \log x - \int x \, {dx \over x} = x \log x - x. \,\!$

So the entropy is

$H = (P \log P - P + 1) - (A_1 \log A_1 - A_1 + 1) - (A_2 \log A_2 - A_2 + 1) - \cdots - (A_n \log A_n - A_n + 1)$

$= (P \log P + 1) - (A_1 \log A_1 + 1) - (A_2 \log A_2 + 1) - \cdots - (A_n \log A_n + 1)$

$= P \log P - \sum_{x=1}^n A_x \log A_x + (1 - n) \,\!$

By letting p_x = A_x/P and doing some simple algebra we obtain:

$H = (1 - n) - \sum_{x=1}^n p_x \log p_x \,\!$

and the term (1 − n) can be dropped since it is a constant, independent of the p_x distribution. The result is

$H = - \sum_{x=1}^n p_x \log p_x \,\!$ .

Thus, the Shannon entropy is a consequence of the equation

$H = \log \Omega \$

which relates to Boltzmann's definition,

$\mathcal{S} = k \ln \Omega$ ,

of thermodynamic entropy, where k is the Boltzmann constant.

—The preceding unsigned comment was added by MisterSheik (talk • contribs) 17:34, 1 March 2007.

[edit] H(X), H(Ω), and the word 'outcome'

Recent edits to this page now stress the word "outcome" in the opening sentence:

information entropy is a measure of the average information content associated with the outcome [emphasised] of a random variable.

and have changed formulas like

$H(X)=-\sum_{i=1}^np(x_i)\log_2 p(x_i),\,\!$

$H(X) = -\sum_{\omega \in \Omega}p(\omega)\log_2 p(\omega)$

There appears to have been a confusion between two meanings of the word "outcome". Previously, the word was being used on these pages in a loose, informal, everyday sense to mean "the range of the random variable X" -- ie the set of values {x₁, x₂, x₃ ...) that might be revealed for X.

But "outcome" also has a technical meaning in probability, meaning the possible states of the universe {ω₁, ω₂, ω₃ ...), which are then mapped down onto the states {x₁, x₂, x₃ ...) by the random variable X (considered to be a function mapping Ω -> R).

It is important the mapping X may in general be many-to-one: so H(X) and H(Ω) are not in general the same. In fact we can say definitely that H(X) <= H(Ω), with equality holding only if the mapping is one-to-one over all subsets of Ω with non-zero measure. (the "data processing theorem").

The correct equations are therefore

$H(X)=-\sum_{i=1}^np(x_i)\log_2 p(x_i),\,\!$

$H(\Omega) = -\sum_{\omega \in \Omega}p(\omega)\log_2 p(\omega)$

But in general the two are not the same. -- Jheald 11:37, 4 March 2007 (UTC).

[edit] Sorry, I don't get it

Self-information of an event is a number, right? Not a random variable. Yes?

So how can entropy be the expectation of self-information? I sort-of understand what the formula is coming from, but it doesn't look theoretically sound... Thanks. 83.67.217.254 13:19, 4 March 2007 (UTC)

Ok, maybe I understand. I(omega) is a number, but I(X) is itself a random variable. I have fixed the formula. 83.67.217.254 13:27, 4 March 2007 (UTC)

Uh-oh, what have I done? "Failed to parse (Missing texvc executable; please see math/README to configure.)" Could you please fix? Thank you. 83.67.217.254 13:30, 4 March 2007 (UTC)

[edit] Compression of English Text

If I take the text of the book "Uncle Tom's Cabin", http://etext.lib.virginia.edu/etcbin/toccer-new2?id=StoCabi.sgm&images=images/modeng&data=/texts/english/modeng/parsed&tag=public&part=all , its about a megabyte of text. If I compress it using winzip I get 395K bytes. bzip2: 295KB. paq8l 235KB. This isn't normal English text, but I think you get the idea. Daniel.Cardenas 19:06, 13 May 2007 (UTC)

Compression software does give a nice rule-of-thumb entropy estimate, but in this case the actual entropy is a lot lower because compression software designed for general-purpose use doesn't have the extensive knowledge of the language that allows humans to see more redundancy in the text. More rigorous experiments usually show lower entropy rates for English, typically between 1.0 and 1.5 bits per character, as described in the reference I've added. 129.97.79.144 19:23, 21 May 2007 (UTC)

Thanks, that was a good one. :-) Daniel.Cardenas 19:35, 21 May 2007 (UTC)

[edit] Entropy of English text

The article currently says "The entropy of English text is between 1.0 and 1.5 bits per letter.". Shouldn't the entropy in question decrease as one discovers more and more patterns in the language, making a text more predictable? If so, I think it would be a good idea to be a little less precise, saying "The entropy of English text can be regarded as being between 1.0 and 1.5 bits per letter." or similar instead. —Bromskloss 11:43, 7 June 2007 (UTC)

No, that's like saying "The sum of 2 plus 2 can be regarded as 4." Entropy has a precise mathematical definition. It isn't just possible to "regard" it as having an exact value, it actually does have an exact value. At most it can be said that entropy is hard to measure, which (along with differences between receivers and in what's called "English") is the reason a range instead of a single value is given. It's true that knowing more about the language (i.e. having more ability to predict the text) decreases the entropy; the studies on which the referenced statement is based are generally assuming something like the average user of English. Anyway, the statement in the article is what's in the reference and it's not appropriate for us to second-guess it. 216.75.189.154 13:18, 26 June 2007 (UTC)

[edit] Boltzmann's lectures on entropy

Since entropy was formally introduced by Ludwig Boltzmann the article should refer to his work:

Boltzmann, Ludwig (1896, 1898). Vorlesungen über Gastheorie : 2 Volumes - Leipzig 1895/98 UB: O 5262-6. English version: Lectures on gas theory. Translated by Stephen G. Brush (1964) Berkeley: University of California Press; (1995) New York: Dover ISBN 0-486-68455-5

—The preceding unsigned comment was added by Algorithms (talk • contribs) 19:35, 7 June 2007.

[edit] log basis

Hmmm, this article seems to assume that logs must always be taken to base 2 - which is not the case. We can define entropy to whatever base we like (in coding it often makes things easier to define it to a base equal to the number of code symbols, which in computer science is typically 2). This leads to different units of measurements: bits vs. nats vs. hartleys.

The article should probably be modified to reflect this HyDeckar 01:16, 13 June 2007 (UTC)

[edit] Mistake inside an external reference

Regrading the reference: Information is not entropy, information is not uncertainty ! - a discussion of the use of the terms "information" and "entropy".

They referenced article is mistaken. It refutes the claim that "information is proportional to physical randomness". However, the more random a system is the more information we need in order to describe it. I suggest we remove this reference.

—The preceding unsigned comment was added by 89.139.67.125 (talk) 07:32, 13 June 2007

I agree. That reference reads more like a rant than a discussion. Its author appears to lack some basic understanding of thermodynamic vs. information-theoretic entropy. The above comment is absolutely correct in that "the more random a system is the more information we need in order to describe it." 198.145.196.71 16:36, 25 September 2007 (UTC)

[edit] Looking for reference

Im looking for realiable, hard references for the following phrase in the article:

"Shannon's entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc. See Markov chain."

Im sorry if the above concept is a bit basic and present in basic textbooks. I have not studied the subject formally, but i may have to apply the entropy concenpt in a small analysis for my master's dissertation.

[edit] Units in the continuous case

I think there need to be some explanition on the matter of units for the continuous case.

$H[f] = -\int_{-\infty}^{\infty} f(x) \log_2 f(x)\, dx,\quad$

f(x) will have the unit 1/x. Unless x is dimmensionless the unit of entropy will inclue the log of a unit which is weird. This is a strong reason why it is more useful for the continuous case to use the relative entropy of a distribution, where the general form is the Kullback-Leibler divergence from the distribution to a reference measure m(x). It could be pointed out that a useful special case of the relative entropy is:

$H_{relative}[f] = -\int_{x_{min}}^{x_{max}} f(x) \log_2 (f(x)(x_{max}-x_{min}))\, dx,\quad$

which should corresponds to a rectangular distribution of m(x) between xmin and xmax. It is the entropy of a general bounded signal, and it gives the entropy in bits.

Petkr 13:38, 6 October 2007 (UTC)

[edit] Entropy vs Entropy Rate

not sure about the section `Limitations of entropy as information content'.

quote Consider a source that produces the string ABABABABAB... in which A is always followed by B and vice versa. If the probabilistic model considers individual letters as independent, the entropy rate of the sequence is 1 bit per character. But if the sequence is considered as "AB AB AB AB AB..." with symbols as two-character blocks, then the entropy rate is 0 bits per character. endquote

the average number of bits needed to encode this string is zero (asymptotically)

also, treating this as a markov chain (order 1), we can see from the formula in http://en.wikipedia.org/wiki/Entropy_rate and also in this article that the entropy rate is 0

also in the next paragraph quote However, if we use very large blocks, then the estimate of per-character entropy rate may become artificially low. endquote

isn't the `per-character entropy rate' redundant? should be either the `per-character entropy' or the `entropy rate' —Preceding unsigned comment added by 71.137.215.129 (talk) 07:23, 16 January 2008 (UTC)

[edit] Uncertainty

Since "uncertainty" (whatever that may mean) is used as a motivating factor in this article, it might be good to have a brief discussion about what is meant by "uncertainty." Should the reader simply assume the common definition of uncertainty? Or is there a specific technical meaning to this word that should be introduced? —Preceding unsigned comment added by 131.215.7.196 (talk) 19:41, 27 January 2008 (UTC)

The article states:“Equivalently, the Shannon entropy is a measure of the average information content the recipient is missing when he does not know the value of the random variable.” This has also been interpreted as an uncertainty in a system, not a measure of the information.

This interpretation is valid if we are sending a message from a sender to a receiver along a noisy channel, which may make the message uncertain. But there is an alternative interpretation where information entropy is hardly a measure of uncertainty.

For instance if we replace a generation with Gaussian distributed quantitative characters of one billion individuals in a large population with a new generation, the situation is quite different. This is like sending one billion different Gaussian distributed messages in parallel from parents to offspring. Every new message is a random – noisy - recombination of messages from two randomly chosen parents, for instance.

As I see it, there is per definition no uncertainty with respect to the survival of the parents, and a moment matrix of their characters may as well exist. Thus a Gaussian distribution may serve as a good approximation of the region of acceptability, A, determining the possible spread of parents along A. See also the article about "Entropy in thermodynamics ... [[1]]--Kjells (talk) 13:30, 8 June 2008 (UTC)