Talk:Huffman coding
From Wikipedia, the free encyclopedia
Contents |
[edit] basic example image is wrong?
hi! i think the example in the basic example, the one in the captation of the image is wrong. it shoud be
a3 | 110 |
a4 | 111 |
and not:
a3 | 111 |
a4 | 110 |
both the image is wrong and the table. i think one can't mix the way the 0/1 is set. PedroCarvalho (talk) 06:23, 28 November 2007 (UTC)
- well, you can mix the way 0/1 is set, the output code will be good anyway, you just have to remember it when you decode it! anyway, you'd better not to change the way 0/1 is set, so I have fixed both image and caption. Alessio Damato 13:59, 4 December 2007 (UTC)
-
- it still mixes the way 0/1 is set. In the first split it gives 0 to the least common (a1: 0.4), and 1 to the most common ({a2,a3,a4}: 0.6), but in the two other splits is gives 0 to the most common and 1 to the least common. -Thomas Nygreen (talk) 09:59, 9 May 2008 (UTC)
-
-
- Huffman code generators can select 0/1 in any manner they choose (including randomly) and still generate a valid Huffman code. Nevertheless the diagram ought to be consistent with the pseudocode. Dcoetzee 16:30, 9 May 2008 (UTC)
-
[edit] older
The algorithm in "Basic technique" is ambiguous. It's not clear how the two queues are used. It seems that the second queue is never read from, only added to.
[BREAK INTO DISCUSSION: 30 July 2007 - Simply does not compute - I have been trying to make sense of this 'pseudo code' for a couple of hours now. I am no slouch, but I just can't get the algorithm from this description. Somebody - please post something more intelligible, then remove this comment. - TropicalCoder (Google me) EXIT DISCUSSION]
~ I believe I may have better solved the example "Sample-1". I may be wrong, but I found the algorithm ill explained in my notes so I visited wikipedia to better understand it... I tried to solve it in the most logical way to me... building a complete binary tree with apropriate depth and using the percentage costs as guides to choose which boundaries to choose.
What I got was this: There is 1 character with weight 30; 1 with weight 20; 3 * 15; 1 * 10 and 3 * 5 for a total sum of 120. Dividing all by 120 and using the percentage to define how much of the tree to take away you get that 30/120 = 25%... and you go on like that... using this technic I got:
Character | Weight | Coding |
---|---|---|
h | 30 | 00 |
e | 20 | 010 |
b | 15 | 011 |
d | 15 | 100 |
g | 15 | 101 |
a | 10 | 1100 |
c | 5 | 1101 |
f | 5 | 1110 |
i | 5 | 1111 |
Total cost is exactly the same, but the coding varies in size between 2 and 4 bits, and in #Basic_technique it says that "It is generally beneficial to minimize the variance of codeword length." I believe that makes my solution better, so if anyone with more than a mere day's knowledge on the matter agrees, then by all means make the correction :) --nunocordeiro 02:02, 11 July 2006 (UTC)
- You got the right answer for the wrong reasons. You used Shannon-Fano coding, which in this case happens yield a result identical (in terms of each symbol's weight) to the bottom-merge Huffman code, the Huffman code with the lowest variance and lowest maximum length (among optimal codes). In general, these are not the same, and Shannon-Fano is suboptimal. The example should be replaced by one that either yields only one Huffman code (again, in terms of symbol weights, so {0,1} is the same code as {1,0}) or explain bottom-merge Huffman coding. I'll put this on my to do list, but if someone wants to fix this, go ahead. Calbaer 02:10, 27 July 2006 (UTC)
Removed "*Animation of the Huffman Algorithm: Algorithm Simulation by Simon Mescal" since the link is dead. If it comes back up move back in.
What is meant by "cost" in this article?
- Yeah, I was wondering too look, what I found [1], now I see. The cost is the cost to send each letter in the target alphabet. For example if we send letters as in Morse with "." & "_" we will spend more on "_". Gnomz007 15:07, 16 Nov 2004 (UTC)
Or to qualify a little more, the letter S in morse is three dots in rapid sucession. Three time periods. The letter 0 is three dashes. Each dash takes up at least two time periods. So, at least six time periods. Therefore, sending O is at least twice as expensive (in time!) as sending an S. You would think that a byte is always 8 characters long and so this has no meaning. However, most US english only uses about 7 characters. If you find that no 8-bit characters are being used in the file -- or if you find a way to mark them as such -- you can compress the file down to 7/8th its original size right there. Consider some of the unicode encodings, where most characters take 8 bits, but if you want an extended character, you can use extra bytes is you flag them. --67.177.235.1 18:55, 4 October 2005 (UTC)
- Oh no! This is wrong, wrong, wrong (in the context of the article). For most applications, costs are probabilities, pure and simple. But they need not be, because they are just multiplicative factors (or, as the Basic technique section has it, weights) for the expression to be minimized. Adding to the confusion, the Huffman coding with unequal letter costs subsection uses cost in the way you two and the linked paper are discussing, as output costs! It seems as though an information theorist wrote most the article but a computer scientist wrote the Problem definition section. Anyway, I think most people who read the entire article would come away very confused. Looks like it's time for a rewrite (or at least a find/replace of the word 'cost').... Calbaer 02:22, 27 July 2006 (UTC)
[edit] Back-story on the Discovery of Huffman Coding
Dr. David A. Huffman was the Chairperson of the Information and Computer Science Department at University of California at Santa Cruz (UCSC) in the early 1970's. He taught an undergraduate course, "The Algebraic Foundations of Cybernetics" in 1971-73. During one lecture he taught "Minimum Redundancy Coding". A student noticed, in some textbooks, it was called "Huffman Coding" -- and asked him if that was the same principle.
Dr. Huffman replied by relating an interesting story.
As a student at MIT, David Huffman had been working on an unsolved, challenging problem in Information Theory. Instead of studying for the final, he had decided to devote all his free time to work on this problem. The night before the final, with no solution yet, in a moment of despair, he threw his large stack of notes into the trash. He described the next moment as a flash of insight in which the solution occurred to him.
He went on to explain that he was still working on many difficult, interesting problems that, when solved, could be represented in textbooks for centuries. He did not want such a "simple idea" as minimum reduncancy coding to bear his name. Students wondered and discused among themselves: was this humility, or hubris?
[edit] not one-to-one on finite messages
The Huffman coding is not one-to-one for finite messages. Suppose the message consists completely of the same character (e.g. a file full of 'aaaaaaaaaaaaaaaaaaaa...'); then that character would be assigned the empty bitstring; and the message would encode to the empty string. But such messages of different lengths would all encode to the empty string, hence the information about the length is lost. If you think about decoding such a message, you would also see the problem: you have an empty encoded message, but you also have a symbol whose code is the empty bitstring, so you don't know how many characters to write out.
This problem does not occur for infinite messages, of course; since there is only one infinite message consisting solely of the same character, for any given character.
Perhaps this should be mentioned in the article? --Spoon! (talk) 04:18, 12 January 2008 (UTC)
- This is not entirely correct. The weights in the Huffman algorithm represent probabilities, not observed frequencies: so you can only assign the empty bitstring to a character if you know that such character will occur with certainty. Hence this is a degenerate case, where the only information being transmitted is the length of the string. In practice, if all you have is finite sample data, then you need to use something alike to Laplace's rule of succession to assign probabilities. —Preceding unsigned comment added by Classicalecon (talk • contribs) 11:50, 11 February 2008 (UTC)
- Another thing to keep in mind is that the empty string is not part of a valid Huffman code. A valid Huffman code has no codeword of length less than one. If the most likely outcome is more than 40% likely, its codeword will only have one bit, but it cannot have less, even if it is 99.9999% likely. If it is 100% likely — or even just much more likely than 50% — then a practical compressor wouldn't use a Huffman code per symbol, but perhaps instead a run-length encoding on the entire sequence. If there are relatively few possible sequences, that run-length code could be a Huffman code. Calbaer (talk) 22:27, 11 February 2008 (UTC)
[edit] Length-limited Huffman coding
The section on length-limited Huffman coding needs to be clarified, since finding the optimal length-limited code given the initial weight distribution is trivial. In what sense is this an unsolved problem? —Preceding unsigned comment added by Classicalecon (talk • contribs) 10:56, 11 February 2008 (UTC)
- How is it trivial? If the Huffman code has a length longer than the length limit, we have to throw away this code and find the optimal length-limited code using a different method, e.g., the Package-Merge algorithm. It might be trivial to find a length-limited code in general — a fixed-length code certainly limits length — but to find an optimal length-limited code is another matter. No method is known to solve this problem in linear or logarithmic time. Calbaer (talk) 22:27, 11 February 2008 (UTC)
[edit] Basic technique?
In the article it describe the basic technique like so:
1. Start with as many leaves as there are symbols. 2. Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue). 3. While there is more than one node in the queues: 1. Dequeue the two nodes with the lowest weight. 2. Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight. 3. Enqueue the new node into the rear of the second queue. 4. The remaining node is the root node; the tree has now been generated.
However, it never says what the second queue is used for! From what I understand shouldn't this node be inserted back into the first queue and the queue sorted? I'm having real problems working out how to generate the Huffman tree (since it differs from a normal binary tree as the highest weight should only be a single bit). —Preceding unsigned comment added by 88.107.145.7 (talk) 19:30, 6 May 2008 (UTC)
- This description is correct but confusing. The step "Dequeue the two nodes with the lowest weight" is implicitly examining the fronts of both queues. A conceptually simpler implementation uses priority queues, but is not linear time. I will attempt to clarify the exposition. Dcoetzee 19:35, 6 May 2008 (UTC)