Byte pair encoding

From Wikipedia, the free encyclopedia

Byte pair encoding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the original data.

[edit] Byte pair encoding example

Suppose we wanted to encode the data

aaabaaabac

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, "Z". Now we have a replacement table of

Z <- aa

and the data is

ZabZabac

In this data the byte pair "Za" occurs most often, so it will be replaced with a byte that is not used in the data, "Y". (In this case "Za" could be replaced by "Z", since every occurrence of "Z" will be replaced.) The replacement table and data are now

Z <- aa
Y <- Za
YbYbac

Once again we replace the byte pair that occurs most often.

Z <- aa
Y <- Za
X <- Yb
XXac

This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

To decompress the data, simply perform the replacements in the reverse order.

Languages