Truncated binary encoding
From Wikipedia, the free encyclopedia
Truncated binary encoding is an entropy encoding typically used for uniform probability distributions with a finite alphabet. It is parameterized by an alphabet with total size of number n. It is a slightly more general form of binary encoding when n is not a power of two.
When n is exactly 2k, truncated binary encoding simply assigns each symbol one of the n codewords of length k. When n is equal instead to 2k+b, truncated binary encoding assigns the first 2k−b symbols the first 2k−b codewords of length k and then assigns the remaining 2b symbols the last 2b codewords of length k+1. Because all the codewords of length k+1 consist of an unassigned codeword of length k with a "0" or "1" appended, the resulting code is a prefix code. For example, for the alphabet {0, 1, 2, 3, 4}, n = 5 = 2²+1. Truncated binary encoding assigns the first three symbols (2²−1) the codewords 00, 01, and 10, all of length 2, then assigns the last two symbols the codewords 110 and 111, the last two codewords of length 3, each of which is the unused two-bit codeword 11 with an extra digit appended.
For example, if n is 5, plain binary encoding and truncated binary encoding allocates these codewords: (RED digits/bits are not transmitted in truncated binary.)
Truncated binary |
Encoding | Standard binary |
---|---|---|
0 | 000 | 0 |
1 | 001 | 1 |
2 | 010 | 2 |
UNUSED | 011 | 3 |
UNUSED | 100 | 4 |
UNUSED | 101 | 5/UNUSED |
3 | 110 | 6/UNUSED |
4 | 111 | 7/UNUSED |
The nearest power of 2 after n = 5 is 2³ = 8, so there are u = 8−5 = 3 unused codes in the straightforward 3-bit binary encoding.
In numerical terms, to send a value 0 ≤ x < n, which is one of 2k ≤ n ≤ 2k+1 symbols, there are u = 2k+1−n unused entries when the alphabet size is rounded up to the nearest power of two. The process to encode the number x in truncated binary is: If x is less than u, encode it in k binary bits. If x is greater than or equal to u, encode the value x+u in k+1 binary bits.
Another example, encoding an alphabet of size 10 (between 0 and 9), there are 6 unused codes, so input values greater than or equal to 6 are offseteby 6 to the end of the binary space, and the first bit is discarded from the truncated binary encoding output (here, the unused patterns are not shown in this table):
Input value |
Offset value |
Standard Binary |
Truncated Binary |
---|---|---|---|
0 | 0 | 0000 | 000 |
1 | 1 | 0001 | 001 |
2 | 2 | 0010 | 010 |
3 | 3 | 0011 | 011 |
4 | 4 | 0100 | 100 |
5 | 5 | 0101 | 101 |
6 | 12 | 1100 | 1100 |
7 | 13 | 1101 | 1101 |
8 | 14 | 1110 | 1110 |
9 | 15 | 1111 | 1111 |
Here is a more extreme case: with n = 7 the next power of 2 is 8 (we will then use 3 bits or 2 bits if the high bit is discarded) and u = 1:
Input value |
Offset value |
Standard Binary |
Truncated Binary |
---|---|---|---|
0 | 0 | 000 | 00 |
1 | 2 | 010 | 010 |
2 | 3 | 011 | 011 |
3 | 4 | 100 | 100 |
4 | 5 | 101 | 101 |
5 | 6 | 110 | 110 |
6 | 7 | 111 | 111 |
This last example demonstrates that a leading zero bit does not always indicate a short code; if u < 2k−1, some long codes will begin with a zero bit.
If n is a power of two, then the encoding may be done with either of two different values of k. Both produce equivalent outputs; one just has u = 2k and encodes all values as "short" k-bit codes, while the other has u = 0 and encodes everything with "long" k+1-bit codes.