LZ77 and LZ78

From Wikipedia, the free encyclopedia

LZ77 and LZ78 are the names for the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively [1]. These two algorithms form the basis for most of the LZ variations including LZW, LZSS, LZMA and others.

They are both dictionary coders, unlike minimum redundancy coders. LZ77 is the "sliding window" compression algorithm, which was later shown to be equivalent to the explicit dictionary technique first given in LZ78.

[edit] LZ77

LZ77 algorithms achieve compression by replacing portions of the data with references to matching data that has already passed through both encoder and decoder. A match is encoded by a pair of numbers called a length-distance pair, which is equivalent to the statement "each of the next length characters is equal to the character exactly distance characters behind it in the uncompressed stream." (The "distance" is sometimes called the "offset" instead.)

The encoder and decoder must both keep track of some amount of the most recent data, such as the last 2 kB, 4 kB, or 32 kB. The structure in which these data are held is called a sliding window, which is why LZ77 is sometimes called sliding window compression. The encoder needs to keep these data to look for matches, and the decoder needs to keep these data to interpret the matches the encoder refers to. This is why the encoder can use a smaller size sliding window than the decoder, but not vice-versa.

Many documents which talk about LZ77 algorithms describe a length-distance pair as a command to "copy" data from the sliding window: "Go back distance characters in the buffer and copy length characters, starting from that point." While those used to imperative programming may find this model intuitive, it may also make it hard to understand a feature of LZ77 encoding: namely, that it is not only acceptable but frequently useful to have a length-distance pair where the length actually exceeds the distance. As a copy command, this is puzzling: "Go back one character in the buffer and copy seven characters, starting from that point." How can seven characters be copied from the buffer when only one of the specified characters is actually in the buffer? Looking at a length-distance pair as a statement of identity, however, clarifies the confusion: each of the next seven characters is identical to the character that comes one before it. This means that each character can be determined by looking back in the buffer – even if the character looked back to was not in the buffer when the decoding of the current pair began. Since by definition a pair like this will be repeating a sequence of distance characters multiple times, it means that LZ77 incorporates a flexible and easy form of run-length encoding.

Even though all LZ77 algorithms work by definition on the same basic principle, they can vary widely in how they output their encoded data – what values are possible for lengths and distances, for example, and how length-distance pairs are distinguished from literals (single characters encoded as themselves, rather than as part of a length-distance pair.) A few examples:

  • The algorithm illustrated in Lempel and Ziv's original 1977 paper outputs all its data three values at a time: the length and distance of the longest match found in the buffer, and the literal which followed that match. If two successive characters in the input stream could only be encoded as literals, the length would be 0.
  • In the PalmDoc format, a length-distance pair is always encoded by a two-byte sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding the distance, 3 go to encoding the length, and the remaining two are used to make sure the decoder can identify the first byte as the beginning of such a two-byte sequence.
  • As of 2004, the most popular LZ77 based compression method is called DEFLATE; it combines LZ77 with Huffman coding. Literals, lengths, and a symbol to indicate the end of the current block of data are all placed together into one alphabet. Distances can be safely placed into a separate alphabet; since a distance only occurs just after a length, it cannot be mistaken for another kind of symbol or vice-versa.

[edit] LZ78

While the LZ77 algorithm works on past data, the LZ78 algorithm attempts to work on future data. It does this by forward scanning the input buffer and matching it against a dictionary it maintains. It will scan into the buffer until it cannot find a match in the dictionary. At this point it will output the location of the word in the dictionary, if one is available, the match length and the character that caused a match failure. The resulting word is then added to the dictionary.

Though initially popular, the popularity of LZ78 later dampened, possibly because for the first few decades after it was introduced, parts of LZ78 were patent protected in the United States. The most popular form of LZ78 compression was the LZW algorithm, a modification of the LZ78 algorithm made by Terry Welch.

[edit] References