LZ77 and LZ78

LZ77 and LZ78 are the names for the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively.^[1] These two algorithms form the basis for most of the LZ variations including LZW, LZSS, LZMA and others.

They are both dictionary coders. LZ77 is the sliding window compression algorithm, which was later shown to be equivalent to the explicit dictionary compression technique of LZ78—however, they are only equivalent when the entire data is intended to be decompressed. LZ78 decompression allows random access to the input as long as the entire dictionary is available, while LZ77 decompression must always start at the beginning of the input.

The algorithms were named an IEEE Milestone in 2004.^[2]

1 LZ77
- 1.1 Implementations
2 LZ78
3 References

LZ77

LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the input (uncompressed) data stream. A match is encoded by a pair of numbers called a length-distance pair, which is equivalent to the statement "each of the next length characters is equal to the characters exactly distance characters behind it in the uncompressed stream". (The "distance" is sometimes called the "offset" instead.)

To spot matches, the encoder must keep track of some amount of the most recent data, such as the last 2 kB, 4 kB, or 32 kB. The structure in which this data is held is called a sliding window, which is why LZ77 is sometimes called sliding window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. The larger the sliding window is, the longer back the encoder may search for creating references.

It is not only acceptable but frequently useful to allow length-distance pairs to specify a length that actually exceeds the distance. As a copy command, this is puzzling: "Go back four characters and copy ten characters from that position into the current position". How can ten characters be copied over when only four of them are actually in the buffer? Tackling one byte at a time, there is no problem serving this request, because as a byte is copied over, it may be fed again as input to the copy command. When the copy-from position makes it to the initial destination position, it is consequently fed data that was pasted from the beginning of the copy-from position. The operation is thus equivalent to the statement "copy the data you were given and repetitively paste it until it fits". As this type of pair repeats a single copy of data multiple times, it can be used to incorporate a flexible and easy form of run-length encoding.

Implementations

Even though all LZ77 algorithms work by definition on the same basic principle, they can vary widely in how they encode their compressed data to vary the numerical ranges of a length-distance pair, alter the number of bits consumed for a length-distance pair, and distinguish their length-distance pairs from literals (raw data encoded as itself, rather than as part of a length-distance pair). A few examples:

The algorithm illustrated in Lempel and Ziv's original 1977 paper outputs all its data three values at a time: the length and distance of the longest match found in the buffer, and the literal which followed that match. If two successive characters in the input stream could only be encoded as literals, the length of the length-distance pair would be 0.
In the PalmDoc format, a length-distance pair is always encoded by a two-byte sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding the distance, 3 go to encoding the length, and the remaining two are used to make sure the decoder can identify the first byte as the beginning of such a two-byte sequence.
In the implementation used for many games by Electronic Arts,^[3] the size in bytes of a length-distance pair can be specified inside the first byte of the length-distance pair itself; depending on if the first byte begins with a 0, 10, 110, or 111 (when read in big-endian bit orientation), the length of the entire length-distance pair can be 1 to 4 bytes large.
As of 2008^[update], the most popular LZ77 based compression method is DEFLATE; it combines LZ77 with Huffman coding.^[4] Literals, lengths, and a symbol to indicate the end of the current block of data are all placed together into one alphabet. Distances can be safely placed into a separate alphabet; since a distance only occurs just after a length, it cannot be mistaken for another kind of symbol or vice-versa.

LZ78

LZ78 algorithms achieve compression by replacing repeated occurrences of data with references to a dictionary that is built based on the input data stream. Each dictionary entry is of the form dictionary[...] = {index, character}, where index is the index to a previous dictionary entry, and character is appended to the string represented by dictionary[index]. For example, "abc" would be stored (in reverse order) as follows: dictionary[k] = {j, 'c'}, dictionary[j] = {i, 'b'}, dictionary[i] = {0, 'a'}, where an index of 0 implies the end of a string. The algorithm initializes last matching index = 0 and next available index = 1. For each character of the input stream, the dictionary is searched for a match: {last matching index, character}. If a match is found, then last matching index is set to the index of the matching entry, and nothing is output. If a match is not found, then a new dictionary entry is created: dictionary[next available index] = {last matching index, character}, and the algorithm outputs last matching index, followed by character, then resets last matching index = 0 and increments next available index. Once the dictionary is full, no more entries are added. When the end of the input stream is reached, the algorithm outputs last matching index. Note that strings are stored in the dictionary in reverse order, which a LZ78 based decoder will have to deal with.

LZW is an LZ78 based algorithm that uses a dictionary pre-initialized with all possible characters (symbols), (or emulation of a pre-initialized dictionary). The main improvement of LZW is that when a match is not found, the current input stream character is assumed that it will be the first character of an existing string in the dictionary (since the dictionary is initialized with all possible characters), so only the last matching index is output (which may be the pre-initialized dictionary index corresponding to the previous (or the initial) input character). Refer to the LZW article for implementation details.

References

Jacob Ziv and Abraham Lempel; A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, 23(3), pp. 337–343, May 1977.
Jacob Ziv and Abraham Lempel; Compression of Individual Sequences via Variable-Rate Coding, IEEE Transactions on Information Theory, 24(5), pp. 530–536, September 1978.

^ "US Patent 5532693 - Adaptive data compression system with systolic string matching logic". http://www.patentstorm.us/patents/5532693-description.html. Retrieved 2008-11-27.
^ "Milestones:Lempel-Ziv Data Compression Algorithm, 1977". IEEE Global History Network. IEEE. http://www.ieeeghn.org/wiki/index.php/Milestones:Lempel-Ziv_Data_Compression_Algorithm,_1977. Retrieved 4 August 2011.
^ "QFS Compression". Niotso Wiki. http://wiki.niotso.org/QFS_compression. Retrieved 2011-11-06.
^ Feldspar, Antaeus (August 23, 1997). "An Explanation of the Deflate Algorithm". comp.compression. zlib.net. http://www.zlib.net/feldspar.html. Retrieved 2011-06-30.

Data compression methods

Information theory

Entropy · Complexity · Redundancy · Lossy · Timeline of information theory · Rate distortion theory

Lossless

Entropy encoding	Shannon–Fano · Shannon–Fano–Elias · Huffman · Adaptive Huffman · Arithmetic · Range · Golomb · Universal (Gamma · Exp-Golomb · Fibonacci · Levenstein)

Dictionary	RLE · Byte pair encoding · DEFLATE · Lempel–Ziv (LZ77/78 · LZSS · LZW · LZWL · LZO · LZMA · LZX · LZRW · LZJB · LZS · LZT · ROLZ) · Statistical Lempel Ziv

Others	CTW · BWT · PPM · DMC · Delta

Audio

Theory	Companding · Convolution · Dynamic range · Latency · Sampling · Nyquist–Shannon theorem · Sound quality

Audio codec parts	LPC (LAR · LSP) · WLPC · CELP · ACELP · A-law · μ-law · ADPCM · DPCM · MDCT · Fourier transform · Psychoacoustic model

Others	Bit rate (CBR · ABR · VBR) · Speech compression · Sub-band coding

Image

Terms	Color space · Pixel · Chroma subsampling · Compression artifact · Image resolution

Methods	RLE · Fractal · Wavelet · EZW · SPIHT · LP · DCT · Chain code · KLT

Others	Test images · PSNR quality measure · Quantization

Video

Terms	Video characteristics · Frame · Frame rate · Interlace · Frame types · Video quality · Video resolution

Video codec parts	Motion compensation · DCT · Quantization

Others	Video codecs · Bit rate (CBR · ABR · VBR)

See Compression formats for formats and Compression software implementations for codecs

LZ77 and LZ78

Contents

LZ77

Implementations

LZ78

References