DEFLATE

From Wikipedia, the free encyclopedia

DEFLATE is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool, and was later specified in RFC 1951.

DEFLATE is widely thought to be free of any subsisting patents, and at a time before the patent on LZW (which is used in the GIF file format) expired, this has led to its use in gzip compressed files and PNG image files, in addition to the ZIP file format for which Katz originally designed it.

Contents

[edit] Using DEFLATE in new software

At least two freely available libraries with source code for DEFLATE compression and decompression exist, making it possible to include compatible functionality in new programs. A C-language version exists inside the zlib library, with a C++ version included as part of 7-Zip/AdvanceCOMP.

[edit] Stream format

A DEFLATE stream consists of a series of blocks. Each block is preceded by a 3-bit header:

  • 1-bit: Last block in stream marker:
* 1: if this is the last-block in the stream
* 0: if there are more blocks to process after this one.
  • 2-bits: Encoding method used for this block type:
* 00: a stored/raw/literal section follows, between 0 and 65535 bytes in length.
* 01: a static Huffman compressed block, using a pre-agreed Huffman tree.
* 10: a compressed block complete with the Huffman table supplied.

Most blocks will end up being encoding using method 10, the dynamic Huffman encoding, which produces an optimised Huffman tree customised for each block of data individually. Instructions to generate the necessary Huffman tree immediate follow the block header.

Compression is achieved through two steps

  • The matching and replacement of duplicate strings with pointers.
  • Replacing symbols with new, weighted symbols based on frequency of use.

[edit] Duplicate string elimination

Main article: LZ77

Within compressed blocks, if a duplicate series of bytes is spotted (a repeated string), then a back-reference is inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of a length (3-258 bytes) and a distance (1-32768 bytes). Relative back-references can be made across any number of blocks, as long as the distance appears within the last 32kB of uncompressed data decoded (termed the sliding window).

[edit] Bit reduction

Main article: Huffman coding

The second compression stage consists of replacing commonly-used symbols with shorter representations and less commonly used symbols with longer representations. The method used is Huffman coding which creates a unprefixed tree of non-overlapping bit-sequences, where the length of each sequence is inversely proportional to the likelyhood of that symbol needing to be encoded.

A tree is created which contains space for 288 symbols:

* 0-255: represent the literal bytes/symbols 0-255.
* 256: end of stream, stop processing.
* 257-285: combined with extra-bits, a match length of 3-258 bytes.
* 286,287: not used, reserved and illegal but still part of the tree.

A match length code will always be followed by a distance code. Based on the distance code read, further "extra" bits may be read in order to produce the final distance. The distance tree contain space for 32 symbols:

* 0-3: distances 1-4
* 4-5: distances 5-8, 1 extra bit
* 6-7: distances 9-12, 2 extra bits
* 8-9: distances 17-32, 3 extra bits
* ...
* 26-27: distances 8193-12288, 12 extra bits
* 28-29: distances 16385-32768, 13 extra bits
* 30-31: not used, reserved and illegal but still part of the tree.

Note that for the match distance symbols 2-29, the number of extra bits can be calculated as n / 2 − 1.

[edit] Encoder / Compressor

During the compression stage, it is the encoder that chooses the amount of time spent looking for matching strings. The zlib/gzip reference implementation allows the user to select from a sliding scale of likely resulting compression-level vs. speed of encoding. Options range from -0 (do not attempt compression, just store uncompressed) to -9 representing the maximum capability of the reference implementation in zlib/gzip.

Other DEFLATE encoders have been produced, all of which will also produce a compatible bit-stream capable of being decompressed by any existing DEFLATE decoder. Differing implementations will likely produce variations on the final encoded bit-stream produced. The focus with non-zlib versions of an encoder has normally been to produce a more efficiently compressed and small encoded stream.

[edit] Encoder Implementations

  • PKZIP: the first implementation, originally done by Phil Katz as part of PKZip.
  • zlib/gzip: standard reference implementation used in a huge amount of software, owing to public availability of the source code and a license allowing inclusion into other software.
    • jzlib: Rewrite/re-implementation/port of the zlib encoder into pure Java and distributed under a BSD license. (Fully-featured replacement for java.util.zip).
    • PasZLIB: Translation/port of the zlib code into Pascal source code by Jacques Nomssi-Nzali.
  • KZIP/PNGOUT: an encoder by the game-programmer Ken Silverman using "an exhaustive search of all patterns" and "[an] advanced block splitter".
  • PuZip: designed for Commodore 64/C128 computers. PuZip is limited to an 8kB LZ77 window size, with only the store (type 00) and fixed Huffman (type 01) methods.
  • BigSpeed Deflate: "Tiny in-memory compression library" available as a MS Windows DLL limited to 32kB blocks at a time and three compression settings.
  • BJWFlate/DeflOpt: Ben Jos Walbeehm's utilities "designed to attempt to squeeze every possible byte out of the files it compresses".
  • Crypto++: contains a public domain implementation in C++ aimed mainly at reducing potential security vulnerabilities. The author, Wei Dai states "This code is less clever, but hopefully more understandable and maintainable [than zlib]".
  • 7-Zip/AdvanceCOMP: written by Igor Pavlov in C++, this version is freely licensed and tends to achieve higher compression than zlib at the expense of CPU usage.

AdvanceCOMP uses the higher compression ratio version of DEFLATE as implemented by 7-Zip to enable recompression of gzip, PNG, MNG and ZIP files with the possibility of achieving smaller file sizes than zlib is able to at maximum settings. An even more effective (but also more user-input-demanding and CPU intensive) DEFLATE encoder is employed inside Ken Silverman's KZIP and PNGOUT utilities.

Other possible focuses for a DEFLATE implementation could be super-fast compression speed, or being designed to have a very small executable code such for use in embedded systems. Further requirements could be to have a separate license on the software, or for the code to be written in a language such Java or Python. In theory it might be possible for an implementation to produce an encoding that attempted to avoid certain symbols appearing in the output stream, avoiding blacklisted sequences or NUL bytes in a particular context.

[edit] Hardware Encoders

  • AHA361-PCIX/AHA362-PCIX from Comtech AHA. Comtech produce a PCI-X only card (PCI-ID: 193f:0001) capable of compressing streams using DEFLATE at a claimed rate of up to 3.0 Gbit/s (375 MB/s) for uncompressed incoming data. Accompanying the Linux kernel driver for the AHA362-PCIX are an 'ahagzip' utility and customised 'mod_deflate_aha' capable of using the hardware compression from Apache.

The company provides benchmarks against the software zlib library running with (the lowest) level 1 compression setting, on a 3.0Ghz Pentium 4. The AHA362 hardware solution appears to process data streams at approximately four times the speed of zlib but with only half of the compression effectiveness (2.59:1 vs. 4.68:1) however it is likely that these benchmarks were specifically optimised for throughput. The second revision of the card, the AHA362 boosts speed from 2.4Gbit/s to 3.0Gbit/s and adds limited decompression support, with the provisio that:

the AHA362-PCIX can only decompress files compressed using an AHA36x-PCIX board or the GZIP software provided by AHA. It will not decompress files from a standard version of gzip.

Based on the decompression restriction of only being able to decompress streams produced by a matching AHA card, it is likely that the hardware setup is not able to use the full DEFLATE feature-set. The hardware manual states that "The history buffer is reset at the beginning of each block" and that blocks (which are the same as DEFLATE blocks) have a limit of 4GB.

No information is available regarding how, or when the Huffman tree is created, if at all. All data for compression is processed through the card in 4kB chunks, so producing an optimised Huffman tree would not be able to use more than 4kB of data for statistics/probability capture. Restrictions on the card DEFLATE implementation could be; only using LZ with only a static Huffman tree, or having a hardware-limited LZ window/history size.

The hardware interface manual states that each card contains four compression channels and two decompression channels. Decompression is likely implemented in the FPGA, whilst each of the compression channels will be handled with the assistance of an individual ASIC. High-resolution photographs of the AHA36x peripheral card show four AHA3601 custom ASICs surrounding a Xilinx Virtex chip.. Supporting documentation cites that the FPGA-based solution enables easy firmware updates, with each ASIC being capable of up to 800Mbit/s throughput. Unit pricing from a 2004-04-20 press-release is given at $750 USD for the original AHA361 card..

In appears that for the statistics provided, a desire to parallelise across all four hardware units—by breaking the input into small 4kB chunks—has caused frequent flushing of the input stream and history, resulting in reduced compression.

[edit] Decoder / Decompressor

INFLATE is the decoding process that takes a DEFLATE bit stream for decompression and correctly produces the original full-size data or file.

[edit] INFLATE-only implementations

The normal intent with an alternative INFLATE implementation is highly optimised decoding speed, or extremely predictable RAM usage for micro-controller embedded systems.

  • inflate.cl by John Foderaro. Self-standing Common Lisp decoder distributed with a GNU LGPL license.
  • kunzip (unrelated to KZIP) by Michael Kohn. Comes with C source-code under the GNU LGPL license. Used in the GIMP installer.
  • lodepng by Lode Vandevenne. A BSD-licensed single file PNG file reader with built-in C++ inflate implementation and no external dependencies.

[edit] See also

[edit] External links