Talk:Arithmetic coding

From Wikipedia, the free encyclopedia

This article is within the scope of Computing WikiProject, an attempt to build a comprehensive and detailed guide to computers and computing. If you would like to participate, you can edit the article attached to this page, or visit the project page, where you can join the project and/or contribute to the discussion.
Start This article has been rated as Start-Class on the quality scale
Mid This article has been rated as Mid-importance on the importance scale

The sentence that reads...
Instead, we represent the sequence as a rational number between 0 and 1 in base 3
Should read...
Instead, we represent the sequence as a rational number between 0 and 2 in base 3
I didn't want to edit the main page, maybe one of you who is more involved can
—Preceding unsigned comment added by 66.244.163.200 (talk) 18:17, 10 April 2008 (UTC)


Oy, oy, oy, just think what if Newton and Briggs had decided to patent logarithms!

Converted the sums and products to TeX, with the exception of one which had the unexpected expression "log_2". I think it is missing its argument. --Ejrh

JPEG2000 has an arithmetic coder... there must be an associated patent license. -- taral@taral.net


If these are covered by patent, we should have the list of patent numbers in the article so that we know when they expire. --ssd 18:26, 11 Jul 2004 (UTC)

Look here [1]

Contents

[edit] Entropy number wrong

I think the entropy estimate in the example is wrong. Each character seems to contain (-log2(.6) * .6) + (-log2(.2) * .2) + (-log2(.1) * .1) + (-log2(.1) * .1) = 1.57 bits of entropy. Ignoring the fact that an end of message only occurs at the end (which would reduce entropy of the entire message), a three character message like the example would have an entropy of 3*1.57=4.71. The article says 7.381.

As a check; you could actually encode any three character message of a four symbol alphabet in only 6 bits by simply assigning a 2 bit code to each of the inputs. Our example must have less entropy because the probabilities are skewed away from .25/.25/.25/.25. Mfinn68916 15:53, 27 March 2007 (UTC)

1.57b is the mean entropy of a symbol from the example entropy source. You can't just multiply that by the length of a particular message to get its entropy, as the symbols have different probabilities. The fact that a 2b fixed length code does 'better than entropy' in the "NEUTRAL NEGATIVE ENDOFDATA" case is irrelevant, what matters is the mean case, where indeed 2b/symbol does not beat the entropy of ~ 1.57b/sym. The self-information of a particular message should be calculated as the sum of -log2(Pj) over the probabilities, Pj, of the symbols in the message, for an order-0 Markov source (is there a better term?). So 7.381 is correct for the example message, but a new and clearer example would probably be a good idea... --Yjo (talk) 17:09, 26 November 2007 (UTC)

[edit] Improving the article

(note: these comments on improving the article were originally left on my talk page; I've taken the liberty of copying them here to start discussion of where the article can be improved.)

I still think my version is better for a few reasons. The main one is that the article jumps right into an example, which it then interrupts with a discussion. I think it is better to start with an analogy describing the method. Follow this with a complete example of the simple static case. Then discuss the adaptive case. And those diagrams in the history looked interesting, too. It was a shame to discard them.

And in general, the prose is too wordy, with too many asides within sentences. Go for simpler sentence structure, with fewer words. I don't have the page in front of me, but I remember the first sentence had these flaws. There was no need to define a message within the definition of arithmetic coding. A separate sentence would flow better.

And finally, I feel it would be better to do the examples in binary rather than decimal. It is just as easy, and that's really how the algorithm works in any implementation I've seen. To do it in decimal is analogous, but needlessly confusing. And speaking of analogies, your algorithm is an analogy as well. The real algorithm doesn't calculate all the interval endpoints; it just calculates the two it needs at each stage. And the decoding works more like how I described, by scaling the intervals to [0, 1) at each step.

I know my version was not finished, but it had some of the elements I describe here. Could you look at it again and see if it may be possible to work some of it into yours? Scottcraig 08:08, 13 Nov 2004 (UTC)

I think it would be better to begin discussion of these one at a time. I'll start with the issue of simplifications; there are two simplifications that you're complaining about here. The first is doing the example in decimal, when the actual computer will almost certainly be doing it in binary. You say "it is just as easy" for someone trying to grasp the basic concept of arithmetic coding to follow an example in binary as in decimal. Can you actually tell us why you think that? Among all the computer scientists and programmers I know, I don't know anyone who thinks in binary; I don't know anyone who finds it "just as easy" as decimal. We're trying to get across the basic concept of arithmetic coding; even if (for the sake of argument) 90% of our audience found binary just as easy as decimal, and only 10% found it an obstacle to following the obstacle and understand the principle illustrated, there's no reason to throw any unneeded obstacle in the path of the readers.
The second simplification is showing the algorithm as if the calculation was performed with infinite precision at each step and then converted to digits only at the end of the calculation. I think this is the proper way to introduce the concept -- which is a hard concept for those who don't already understand it -- that this mathematical interval actually contains the whole message we encoded into it. We would only obscure understanding of the basic concept by introducing the complication of re-normalization too fast. It would be a good idea to discuss the difference between the simplified version that makes the concept clear and the way it's actually done in production code; I think it would be wiser to discuss those differences after the basic concept has been explained. -- Antaeus Feldspar 17:42, 13 Nov 2004 (UTC)

[edit] Renormalisation

The description isn't quite accurate. You've left out an important case. Namely when the range is of the form [0111111111..., 1000000000...). Digits are also shifted, just not sent to the output. Scottcraig 20:16, 3 Dec 2004 (UTC)

Can you go into more detail about that? I'm not following the example you just gave. -- Antaeus Feldspar 20:34, 3 Dec 2004 (UTC)

The idea is that there is no limit to how much buffering you might have to do before you can know the correct value for a certain bit. Whenever the start and end of the range match in some number of bits, those bits are known for sure, and can be sent to the output. However, there's no limit to how small the range can get (i.e. how many bits are required to describe it) without any bits matching at all. This is problematic because the encoder and decoder need unlimited buffers.

To solve this problem, encoders typically use a technique called "bit-stuffing" that wastes a bit, but puts a finite limit on the amount of buffering required. Conceptually, bit stuffing puts a zero bit in the middle of a long string of 1 bits to limit the number of bits that can be flipped by a carry. So, for instance, you might find that you have reduced the range to [.01111111, .10000000). If your encoder/decoder has only 8 bits of buffer, you're toast because you still don't know what that first bit will be. However, in this situation, you can stuff a zero bit after the 7th bit, thereby representing this range as [011111101, 011111110). The extra digit in the 8th position has "absorbed" the carry from subsequent bits. Now you have 7 known bits, and you can send them to the output. Strictly speaking, this representation is no longer truly binary: the bit-stuffed sequence "011111110" is equivalent to the binary fraction ".10000000". --P3d0 20:37, Dec 21, 2004 (UTC)

I believe I understand now. What confused me was describing a "range" but marking it in interval notation instead. The important part is that the range can get smaller but will not always naturally converge to matching initial digits; however, when such convergence fails to occur naturally, nothing stops the encoder from voluntarily limiting its range to create such matching initial digits, which it does and then proceeds to renormalize. Is that correct? -- Antaeus Feldspar 23:31, 21 Dec 2004 (UTC)
I never looked at it that way. Since the range [.0111111, .1000000) is too hard to deal with for practical reasons, we take all the messages that would have been mapped to that interval and re-map them to the new interval [.01111110, 0.1111111), which is half of the original interval. From that point of view, the encodings are still perfectly valid binary fractions; it's just that the encoding interval [0,1) has some holes in it due to the practical limitations of the encoder/decoder. --P3d0 01:47, Dec 22, 2004 (UTC)

I will explain more thoroughly. I had assumed you had known about this already and only had to remind you, which is why I kept it short. I didn't see your question until today.

Suppose the current interval is [0111111111..., 1000000000...) to a large number of digits, with more symbols to encode. Now the left endpoint may move above 1/2, in which case both ends will start with 10000... . But the right endpoint could move below 1/2, in which case both ends start with 0111111... . But if all those digits are kept until either case happens, then all the precision is lost. So instead of this, we shift both endpoints past the 0111... and 1000... parts, keeping a count of the number of shifts. When case 1 or 2 eventually happens, we can output the correct digits. If neither has occurred and there are no more symbols, just output 10000... to the saved number of 0s. Scottcraig 07:53, 22 Dec 2004 (UTC)

I see. That's different from bit stuffing. I wonder why people do bit stuffing when this seems so much simpler and better? --P3d0 15:36, Dec 22, 2004 (UTC)
There are pros and cons to both. The original way achieves better compression. But it does not work so well with streaming data. It pauses, processing some symbols, then outputs a burst. I suppose this is why bit stuffing was developed. (I hadn't heard of it before. Thanks.) Bit stuffing ensures that every byte is sent as processed, at a slight cost. It's not that these situations occur that often anyway, so there wouldn't be much penalty. But it may be necessary to ensure that the pausing never occurs, I suppose. There is also a slight possibility that the stream may pause for so long that the number of shifts wraps around. This seems extremely unlikely, but bit-stuffing prevents it.
Actually, the whole discussion of renormalization concerns the specific implementation of the algorithm, not the algorithm itself. If we can't talk about the binary implementation, then the topic of renormalization is pretty well disconnected from anything else.
No one ever said we couldn't talk about the binary implementation. What was objected to was the idea that we should do all the examples in binary, on the assumption that someone who does not yet understand the basic idea of arithmetic coding and is coming to this article to learn will find it no obstacle at all to have the only demonstrations of the central concept in binary rather than decimal arithmetic. -- Antaeus Feldspar 07:04, 23 Dec 2004 (UTC)
The topic arises by answering the question, "Which bits can we store?"
Consider the current interval endpoints as a binary fraction, say xx...xx0abcdefgh... and xx...xx1ijklmnop... . They agree on the first part, and any subsequent narrowing will as well, so these bits can be sent to output. At the first difference, the left always has a 0, and the right has a 1. These can't be sent out yet, but they don't need to be stored either. So start storing from there, 16 bits each or what have you.
But if we stop here, then the problem under discussion arises. If the left starts with 01111, and the right starts with 10000, then storing the 1111 and 0000 robs some precision. (Remember the initial 0 and 1 are not stored anyway.) Because we know that eventually the 01111 will flip to 10000, or vice versa. If the string length goes past 16 (or what have you) bits, then the algorithm hits a wall with no way to recover. So instead, keep a counter of the length that will be output when we know if it's 0111... or 1000..., and store the 16 bits after that.
So suppose at the current step, we are storing three variables as unsigned ints: n, left, and right. Then not counting the bits already output, the endpoints are 216*(2n-1) + left, and 2n+16 + right. i.e., 01111leftbits and 10000rightbits. Note that left can be greater than right.
Next suppose that the next encoded symbol reduces the interval to a/c to b/c of the current one. (The data model must use fractions.)
The current interval has a width of 216 + right - left, or w = right + not(left) + 1, which is always positive. Then the new interval is left := left + (w div c) * a, to right := left + (w div c) * b, with adjustments still to be made depending on overflow in the calculations. Step 1 Case a) The calculation of left overflows, then the left end has moved above the half mark. Output 10...0 and set n to 0. Case b) The calculation of right does not overflow, then the right point has moved below the half mark. Output 01...1 and set n to 0. Case c) left does not overflow and right does. Do nothing. Step 2) In cases a or b, output the common starting digits and shift left. Step 3) If (0)111/(1)000 occurs, shift left while incrementing n.
I'm not really sure if all the above details are correct; there may be some off by one errors or the like. I'm sure the correct code can be found. I just don't have it in front of me. Scottcraig 05:28, 23 Dec 2004 (UTC)
  • In the first table under the heading: Interval reduced to eight-bit precision (as fractions), the denominators in the fractions are 256. Should they not be 255? 60.234.137.171 19:43, 18 March 2007 (UTC)

[edit] Encoding shorter than entropy is misleading

The article used to contain this text:

...we could have encoded the same message in the binary fraction .1000101 (equivalent to .5390625 decimal) at a cost of only 7 bits. This is actually shorter than the information content, or entropy of our message, which with a probability of 0.6% has an entropy of approximately 7.381 bits.

I have changed this because it is misleading. The fact is that giving the binary fraction as ".1000101" is ambiguous; these same bits can start any binary fraction in the range [.1000101, .1000110). For instance, both sequences "10001010" and "10001011" would start with the same bits, but one would decode to "NEUTRAL, NEGATIVE, END" while the other decodes to "NEUTRAL, END".

This is a subtle problem. We, as humans looking at the fraction ".1000101", know that all the subsequent bits must be zeoro. How do we know that? Because the symbol following the final "1" is a quote mark, which indicates that the fraction ends at that point. We have cheated, and implicitly added an extra symbol to the encoding. In a computer, there are two ways to deal with this problem:

  1. Add enough bits to make the fraction unambiguous. In this case, adding another zero bit makes the encoding unambiguous, in that subsequent bits don't affect the message. Now our encoding is 8 bits, which is more than the message's entropy.
  2. Encode the length of the stream in out-of-band data. This is an acceptable real-world solution, since filesystems do store file sizes in their metadata, but now the metadata must be included in theoretical discussions of message length.

In either case, Claude Shannon rests happily in his grave. --P3d0 19:43, Dec 21, 2004 (UTC)

Of course the encoding of a particular message can be shorter than its entropy. The title of this section is just wrong.
Shannon's result is that no encoding scheme can encode every message shorter than its entropy. This does not preclude the possibility that some are shorter, as long as some are not.
Fair enough. I have renamed this section, and removed the contentious statements from my explanation above. But what you describe is not what happened here. I think my explanation given above is correct, and it has nothing to do with decimal representation. --P3d0 20:41, Dec 21, 2004 (UTC)
On the other hand, you have uncovered a problem with the article. The ambiguity you describe comes from performing the encoding in decimal before converting to binary. I would prefer to do everything in binary. I think this would be much clearer, and provide the motivation for arithmetic coding in the first place. But others feel that requiring familiarity with binary would be too burdensome for our audience. As a result, the description lacks motivation, and has problems like the one you found. Scottcraig 20:15, 21 Dec 2004 (UTC)
Ooops. Thanks for catching that; I hadn't realized that "10001011" would decode to a different symbol stream than "10001010". Perhaps you could add something to the article about how a decoder knows it's added enough digits to make the end result truly un-ambiguous?
At the risk of making myself look even more like a dunce than I do, I also have to question your statement that a message's encoding can't be shorter than its entropy. That is certainly true over the set of all messages, that the total of their encodings will always be more than the total of their entropies. But consider a set of symbols such that one symbol, let's call it A, has a probability of just less than .5, and the least probable symbol still has a probability greater than (.5 - p(A)). (Hope I got that notation right...)
Anyways, under those conditions, those symbol probabilities would produce a Huffman tree that assigns A a one-bit code -- and yet the entropy of A must be greater than 1 bit, since its probability is less than .5. Every time an A appears, then, the cost of the encoding goes up by 1, but the total entropy of the message goes up by more than 1. With that being the case, doesn't it follow that to create a message whose encoding was shorter than its entropy, all we would need to do is take any message which did not meet that criteria and add A's until it did? -- Antaeus Feldspar 20:59, 21 Dec 2004 (UTC)
Agreed. I was wrong to say that no message can ever be encoded in fewer bits than its entropy. --P3d0 21:12, Dec 21, 2004 (UTC)

[edit] The relationship between arithmetic coding and Huffman

Could Antaeus Feldspar give a reference to e.g. a scientific paper that proves "it has been shown that Huffman is just a specialized case of arithmetic coding"? It is clear to me that in the case of the probabilities of the symbols being 2^(-n) n=(1,...) , then Huffman and the arithmetic coder get the same results. But this is not showing that Huffman is a special case of the arithmetic coding?! I don't believe that "Huffman is a special case of arithmetic coding" but that there are cases were both algorithms work identically?? -- stef@nstrahl.de 18:10, 16 Feb 2005 (CET)

No, I do not have a reference in the literature for this. My understanding that Huffman had been at some point formally proven to be a specialized case of arithmetic coding was based on years of subscribing to the Usenet newsgroup comp.compression. I would be unable at this point to find the exact post(s) where that claim was made. If you feel you have a better phrasing for the relationship between arithmetic coding and Huffman coding, by all means present it for consideration. -- Antaeus Feldspar 00:54, 17 Feb 2005 (UTC)
Thanks for the reference to comp.compression. If found [2] where Charles Bloom gives a reference to his paper Binary Range Encoding - A New Entropy Encoder in which he joins Shannon-Fano, Huffman, and arithmetic Coders into a theoretical framework. I haven't read it yet but I think this will give the prove. I'll report more details if I have done so. -- stef@nstrahl.de 22:01, 16 Feb 2005 (CET)
That probably is it. Bloom was/is one of the most prolific and most knowledgeable c.c posters and I always tried to keep up with his posts. -- Antaeus Feldspar 00:35, 19 Feb 2005 (UTC)

There is a basic flaw to this article. Arithmetic Coding is not a data compression scheme, just a method of encapsulating the result of applying a data prediction mechanism to a data stream. The beauty of it is that it separates out the encoding problem entirely, leaving the focus on the prediction mechanism. For a given model (as it is called in the text) arithmetic encoding will achieve (aside from small implementation costs) 100% of the entropy of any data stream **with respect to the model**.

You can encode "War and Peace" down to a single bit plus a stop code if you use arithmetic coding, with "War and Peace" as the prediction model. There is no global lowest degree of compression achievable ... only a lower bound for a given data stream wrt a given model.

My honours thesis in 1980 was on data compression and I showed that Huffman coding and LZW could both be viewed as a prediction model plus arithmetic coding. The original algorithms result in a discretized version of the output of an arithmetic coding approach. {Paul McGee]Pmcgee2 (talk) 05:11, 22 December 2007 (UTC)

[edit] Range coding and arithmetic coding

The article currently mentions range coding (which is a good thing for it to do), but it does seem to support what I understand to be some common misconceptions about range coding in relation to arithmetic coding.

I've already expounded on this in Talk:Range encoding#Range encoding and arithmetic encoding, but the key points are:-

  1. Range coding seems to be a different interpretation of the same thing, rather than being a different thing itself.
    1. As I understand it, all range coders (decoders/encoders) are the corresponding arithmetic coders, and vice-versa. (It's just a matter of what you interpret the actual range/arithmetic code as being. You could interpret it as being an arithmetic code in the usual way, or you could interpret it as being the common prefix of a range of natural numbers (non-negative integers), in the usual range coding way.)
    2. Differences in renormalisation are implementational, and there's no reason why byte-based renormalisation can't be applied in implementations of arithmetic coding.
  2. Although I am not a lawyer, it does seem that the 'myth' that arithmetic-coding-applicable patents don't apply to range coding relies on range coding being different to arithmetic coding. I strongly suspect it's a myth, a rumour that keeps getting (unwisely) propagated. But, as I said, I am not a lawyer. More on this below.

--Simon G Best 17:24, 20 July 2006 (UTC)

[edit] Range coding and arithmetic coding patents

The article says:-

Range encoding, unlike arithmetic coding, is generally believed not to be covered by any company's patents.

Source? It's a rumour I've come across quite a number of times, now, but I strongly suspect it's a myth (as range coding and arithmetic coding do seem to be the same thing (just interpreted in two, different ways)). I'm certainly doubtful that it's "generally believed", as there seems to be no shortage of people who look at range coding, and conclude that it's really just arithmetic coding interpreted a different way. (In my experience, this kind of observation seems to be made, by someone or other, whenever range coding comes up on the net. (And no, not just by me :o) ).) "Rumoured" might be better than "generally believed".

I should emphasise, of course, that I am not a lawyer.

--Simon G Best 17:24, 20 July 2006 (UTC)

About patent validity. For me it looks like that some patents mentioned in the article are not valid any more (17 years from grated date or 20 years from filind date) + some patens are expiring very soon. During years 2008. —Preceding unsigned comment added by 88.114.252.64 (talk) 20:34, 11 May 2008 (UTC)

[edit] Changes to this subsection of the article

(I've added a subsection heading for this, so that it doesn't seem to be part of the preceding subsection on patents. --Simon G Best 21:38, 20 July 2006 (UTC))

I've now edited the subsection of the article on range encoding to clarify and expand upon the relationship between range coding and arithmetic coding. It was a bit of a rewrite. The article on range encoding needs to be brought into line with these changes, which I may well do. It's just too hot and humid here, right now. Especially humid.

--Simon G Best 21:34, 20 July 2006 (UTC)

[edit] Applications

What common programs/file formats/standards use Arithmetic coding? --Apoc2400 08:18, 10 October 2006 (UTC)

bzip (ok, not exactly common, sadly), PAQ, most PPM implementations, DjVu, JBIG/JBIG2, JPEG 2000, H.264/MPEG-4 AVC... --Piet Delport 01:21, 11 October 2006 (UTC)

[edit] History

The history of AC would be quite dandy. When it started development, first implementations, how it has developed over the years to improve, who were the major contributors (such as Jorma Rissanen and Glen G. Langdon Jr.), etc. Spodi (talk) 11:41, 30 December 2007 (UTC)