ZPAQ

ZPAQ is a proposed open format for highly compressed data, mainly created by Matt Mahoney. It is designed so that new compression algorithms may be developed without breaking compatibility with older versions of ZPAQ-compliant decompression programs. ZPAQ uses a configurable context mixing algorithm (based on PAQ) and sandboxed post-processing code. ZPAQ uses a streaming data format that supports archives, single file compression, and memory to memory compression. Several ZPAQ compatible compression programs and configurations are available under the GPL license. There is also a public domain API in C++ providing compression and decompression services. The format is believed to be unencumbered by patents.

Overview

Initial work on ZPAQ was begun by Matt Mahoney on Feb. 15, 2009. After a series of incompatible experimental versions, the level 1 standard was finalized on Mar. 12, 2009. The standard consisted of a document describing the format and a reference decoder (unzpaq) written in C++ and released under GPL. The standard does not define a compression algorithm. The standard specifies that any ZPAQ level 1 compliant decoder be able to read the output of any other ZPAQ level 1 compliant compressor. The standard has a provision for higher levels which must be able to read the output of lower level compliant compressors but not vice versa. However, no higher levels have yet been defined.

The following software was released under GPL:

unzpaq - the reference decoder.
zpaq - an archiver and environment for developing new compression algorithms, which are described in configuration files and external preprocessors.
zpipe - a simple streaming compressor that reads from standard input and compresses or decompresses to standard output.
zp - an archiver with three built in configurations (fast, mid, max) to trade off speed vs. size.
zpaqsfx - a stub for creating self extracting archives for 32-bit Windows.

In addition, several configurations and preprocessors were published including specialized models for text, BMP images, JPEG images, x86 executable data, and pre/post processors using LZP and BWT compression. Many of these were written by Jan Ondrus.

Data format

ZPAQ uses a streaming data format. A ZPAQ stream consists of a sequence of blocks which can be decompressed independently. Blocks can be reordered, which has the effect of reordering the data it represents. Each block consists of a sequence of one or more segments, which must be decompressed in order from the beginning of the block.

Each block header describes the algorithm needed to decompress its contents and restore the original data. Each segment describes a file, part of a file, or an array of bytes. A segment header contains an optional file name and optional comment to support archives, the compressed data, and end of data marker (4 zero bytes) and finally an optional SHA-1 checksum of the original data for integrity checking. The data is compressed using an arithmetic coder that never outputs 4 zero bytes in a row, which could be confused with an end of segment marker.

Compression algorithm

ZPAQ uses a context mixing algorithm and an optional pre/post processor. A sandboxed assembly-like language called ZPAQL is used to compute the contexts for the context mixing algorithm and to perform any required post-processing. If post-processing is used, then the ZPAQL post-processing code is compressed prior to compressing the first segment of the block. In this case, the remainder of the block after decompression is considered input to the post-processing program.

A context mixing model compresses data one bit at a time. During compression, the model estimates the probability that the next bit will be a 0 or 1. The prediction is arithmetic coded. During decompression, the model makes an identical sequence of predictions based on previously decoded data. This prediction and the compressed data are used to decode the next bit. The decoded bits are then packed 8 at a time into bytes. An extra bit after each byte is used to mark the end of the data segment. It is predicted with a very low probability.

The context mixing model is described as an array of 1 to 255 components. Each component inputs a context (a function of past data computed by the supplied ZPAQL code) and possibly the predictions of earlier components, and outputs a new prediction. The output of the model is the output of the last component. ZPAQ describes 9 component types.

CONST - The prediction is a fixed value.
CM (context model) - The context is mapped to a prediction using a table. Then the table entry is updated by adjusting the prediction to reduce the error. The context is also mapped to a count which is used to control the adjustment. The adjustment is inversely proportional to the count, and then the count is incremented up to a user specified limit.
ICM (indirect context model) - The context is mapped to a bit history (an 8 bit state), and then the bit history is mapped to a prediction and updated as with a CM. The bit history represents a count of recent zeros and ones and the value of the most recent bit.
MIX - The predictions of a slice of the component array are combined by weighted averaging in the logistic domain, i.e. p = squash(∑_i w_i stretch(p_i)), where w_i are the mixing weights, p_i are the input predictions, stretch(p) = ln(p/(1-p)), squash(x) = 1/(1+e^-x)) is the inverse of stretch, and p is the output prediction that the next bit will be a 1. The array of weights are selected by a context. On update, the weights are adjusted in proportion to the output prediction error times stretch(p_i).
MIX2 - A 2 input MIX from any other 2 components where the weights are constrained to add to 1.
AVG - A MIX2 with fixed weights.
SSE (secondary symbol estimation) - A prediction and a context are used to look up a new prediction from a table. The input prediction is first stretched and quantized to 32 levels. The output prediction is interpolated from the nearest two entries. On update, the prediction error is reduced as with a CM.
ISSE (indirect secondary symbol estimation) - The input is a prediction from another component and a context. The context is used to look up a bit history as with an ICM. The bit history is used to select a pair of weights for a 2 input MIX. The prediction is then combined with a CONST and updated as with a MIX.
MATCH - The context is used to look up the previous occurrence of the same context in a rotating history buffer, and the next bit in the buffer is predicted with a strength depending on the length of the match.

Configurations

The zpaq compressor reads a configuration file which describes the compression format. The contents are stored in the block header and may be read and executed by any compliant decompresser. Other compressors such as zpipe and zp use a set of built in compression configurations instead.

Fast configuration

The following is the "fast" configuration, which is also built into the zp archiver (compression level 1). It is designed for higher speed at the expense of compression. Generally, speed depends on the number of components, in this case 2.

(ZPAQ fast configuration)
comp 1 2 0 0 2 (hh hm ph pm n)
  0 icm 16    (order 2)
  1 isse 19 0 (order 4)
hcomp
  *b=a a=0 (save in rotating buffer M)
  d=0 hash b-- hash *d=a
  d++ b-- hash b-- hash *d=a
  halt
post
  0
end

Configuration files are not case sensitive. Comments are enclosed in parenthesis.

The COMP section describes the components that make up the model. The HCOMP section contains ZPAQL code which computes the contexts for the components. POST 0 indicates that there is no post-processing. The decompressed data is output directly.

The COMP section describes two components, an ICM and an ISSE. The ICM (component number 0) takes as input an order 2 context hash (depending on the last 2 bytes of context plus any previously decoded bits from the current byte) to look up a bit history. The bit history indexes a table to look up the prediction. That prediction is fed to the ISSE (component 1), which also takes an order 4 context as input. Its output prediction goes to the arithmetic decoder. The arguments to the two components (16 and 19) give the size of the tables as 2¹⁶⁺⁶ and 2¹⁹⁺⁶ bit histories, respectively. Each bit history uses one byte. Thus, the model uses about 36 MiB of memory.

The HCOMP section describes the ZPAQL code which computes the context hashes for the two components. The ZPAQL machine model consist of:

32-bit registers A, B, C, D, and R0 through R255.
One bit flag register F for comparisons and conditional jumps.
Array H of 32-bit values, used to store contexts for the component array.
Array M of 8-bit values.

The 5 arguments to COMP give the sizes of H and M for the HCOMP and POST code, respectively, and the number of components. In this case, H has 2¹ elements and M has 2² elements.

The ZPAQL instructions are usually 1 or 2 bytes. Most arithmetic and logic instructions except assignment operate on A, B, C, D, *B, *C, *D or an 8 bit constant and put the result in A. *B and *C refer to elements of M. *D refers to an element of H. The HASH instruction computes A ← (A + *B + 512) * 773. The HCOMP code is called once per byte with the input byte in A and all other registers preserved between calls.

Thus, the code first saves the input byte A in M using B as a pointer, then computes an order 2 context hash, stores it in H[0] (pointed to by *D), then hashes the next two bytes and stores it in H[1]. These hashes are updated only once per byte, so they are combined by adding the bitwise contexts from the current byte for each bit prediction.

Mid configuration

The mid level configuration is shown below. It is the default model for zpaq, zp, and zpipe. It provides a fairly good level of compression for most data at some cost in speed.

comp 3 3 0 0 8 (hh hm ph pm n)
  0 icm 5        (order 0...5 chain)
  1 isse 13 0
  2 isse $1+17 1
  3 isse $1+18 2
  4 isse $1+18 3
  5 isse $1+19 4
  6 match $1+22 $1+24  (order 7)
  7 mix 16 0 7 24 255  (order 1)
hcomp
  c++ *c=a b=c a=0 (save in rotating buffer M)
  d= 1 hash *d=a   (orders 1...5 for isse)
  b-- d++ hash *d=a
  b-- d++ hash *d=a
  b-- d++ hash *d=a
  b-- d++ hash *d=a
  b-- d++ hash b-- hash *d=a (order 7 for match)
  d++ a=*c a<<= 8 *d=a       (order 1 for mix)
  halt
post
  0
end

The compression model is similar to the one used in PAQ9A. It uses an order 0 ICM, a chain of order 1 through 5 ISSE, and an order 7 MATCH. All of these predictions are combined using a MIX with an order 1 context. The notation $1 means that an argument may be passed to the configuration file and this argument is added to the values shown. This has the effect of doubling memory usage for each increment because it is used to select the ISSE table sizes and the MATCH index table and buffer sizes respectively. The default uses 111 MB of memory. The arguments to MIX are the table size (2¹⁶), range of inputs (7 components starting at 0), learning rate (24) and a bit mask (255) which tells it to include the bits of the current byte to select the weight array. The second argument to each ISSE selects the previous component prediction as input.

Max configuration

The following configuration is the maximum compression level for zp.

comp 5 9 0 0 22 (hh hm ph pm n)
  0 const 160
  1 icm 5  (orders 0-6)
  2 isse 13 1 (sizebits j)
  3 isse $1+16 2
  4 isse $1+18 3
  5 isse $1+19 4
  6 isse $1+19 5
  7 isse $1+20 6
  8 match $1+22 $1+24
  9 icm $1+17 (order 0 word)
  10 isse $1+19 9 (order 1 word)
  11 icm 13 (sparse with gaps 1-3)
  12 icm 13
  13 icm 13
  14 icm 14 (pic)
  15 mix 16 0 15 24 255 (mix orders 1 and 0)
  16 mix 8 0 16 10 255 (including last mixer)
  17 mix2 0 15 16 24 0
  18 sse 8 17 32 255 (order 0)
  19 mix2 8 17 18 16 255
  20 sse 16 19 32 255 (order 1)
  21 mix2 0 19 20 16 0
hcomp
  c++ *c=a b=c a=0 (save in rotating buffer)
  d= 2 hash *d=a b-- (orders 1,2,3,4,5,7)
  d++ hash *d=a b--
  d++ hash *d=a b--
  d++ hash *d=a b--
  d++ hash *d=a b--
  d++ hash b-- hash *d=a b--
  d++ hash *d=a b-- (match, order 8)
  d++ a=*c a&~ 32 (lowercase words)
  a> 64 if
    a< 91 if (if a-z)
      d++ hashd d-- (update order 1 word hash)
      *d<>a a+=*d a*= 20 *d=a (order 0 word hash)
      jmp 9
    endif
  endif
  (else not a letter)
    a=*d a== 0 ifnot (move word order 0 to 1)
      d++ *d=a d--
    endif
    *d=0  (clear order 0 word hash)
  (end else)
  d++
  d++ b=c b-- a=0 hash *d=a (sparse 2)
  d++ b-- a=0 hash *d=a (sparse 3)
  d++ b-- a=0 hash *d=a (sparse 4)
  d++ a=b a-= 212 b=a a=0 hash
    *d=a b<>a a-= 216 b<>a a=*b a&= 60 hashd (pic)
  d++ a=*c a<<= 9 *d=a (mix)
  d++
  d++
  d++ d++
  d++ *d=a (sse)
  halt
post
  0
end

Components 1 through 8 are a ICM-ISSE chain similar to the mid configuration. Components 9 and 10 are order 0 and 1 word level contexts which improve text compression. The corresponding ZPAQL code converts lower case to upper case, then computes a context hash of bytes that fall in the range 65 to 90 (uppercase ASCII). When a non-letter is found, the order 1 hash is updated and the order 0 hash is cleared. These two components form their own ICM-ISSE chain with contexts in increasing order as before.

Components 11 through 13 are sparse order 1 contexts formed from the current byte plus one preceding byte with a gap of 1 to 3 bytes. These models are not chained. Sparse models can improve compression of some binary files.

Component 14 is specialized for CCITT fax images such as the file PIC in the Calgary Corpus. These files encode a bitmapped image 1728 pixels or 216 bytes per line. It obtains a context hash of the corresponding bytes in the previous two scan lines.

Components 15 and 16 mix all previous predictions using an order 1 and order 0 context respectively. Component 17 mixes their outputs. This prediction is then passed through 2 SSE stages (components 18 and 20) with order 0 and 1 contexts respectively. For each stage, another mixer combines the input and output, which can sometimes improve compression further.

LZP (min) configuration

The following configuration (min.cfg) is part of the zpaq distribution intended for minimal but fast compression. It uses LZP preprocessing followed by a simple order 2 context model. Both the compressor and decompresser maintain a hash table that maps high order context hashes to the last occurrence in a rotating history buffer. If the text that follows the current location matches the text that follows the previous occurrence of the same context hash, then the current text is removed and replaced by a two byte code consisting of an escape character and the length of the match. The effect is to remove duplicate instances of long matches in the input. For example, the string "compression and decompression" might be encoded as "compression and decom,ESC,8" to tell the decompresser to copy the 8 bytes that followed the last instance of the order 3 context "com".

If the ESC byte occurs in the input, then it is encoded as ESC,0. Normally the ESC byte is chosen so that the need for such encodings are rare. Matches are encoded as ESC,1 through ESC,255. Additional parameters to LZP are the sizes of the context hash table and history buffer, the order (length) of the context, and the minimum match length. By default the minimum match length is 3 so that ESC,1 tells the decompresser to copy 3 bytes, and ESC,255 tells the decompresser to copy 257 bytes.

ZPAQ only defines a decompression format. Thus, a compressor would be responsible for pre-processing the input using the LZP algorithm, compressing it, and appending the post-processing code in ZPAQL to the archive header. When used with the zpaq compressor, the pre-processor is an external program written in C++ which is called by the compressor to create a temporary file. The post-processing code is shown below.

(zpaq 1.07 minimum (fast) compression.
Uses 4 x 2^$1 MB memory. $2 increases minimum match length.
Requires lzppre as an external preprocessor.
 
(C) 2009, Ocarina Networks Inc. Written by Matt Mahoney.
This software is free under GPL v3.
http://www.gnu.org/copyleft/gpl.html )
 
comp 3 3 $1+18 $1+20 1 (hh hm PH PM n)
  0 cm $1+19 5 (context model size=2^19, limit=5*4)
hcomp
  *d<>a a^=*d a<<= 8 *d=a (order 2 context)
  halt
pcomp lzppre $1+18 $1+20 127 $2+2 96 ;
  (lzppre PH PM ESC MINLEN HMUL)
  (If you change these values, then change them in the code too)
 
  (The sequence ESC 0 codes for ESC. The sequence ESC LEN
   codes for a match of length LEN+MINLEN at the last place
   in the output buffer M (size 2^PM) that had the same context
   hash in the low PH bits of D. D indexes hash table H
   which points into buffer M, which contains B bytes.
   When called, A contains the byte to be decoded and F=true
   if the last byte was ESC. The rolling context hash D is
   updated by D=D*HMUL+M[B])
 
  if (last byte was ESC then copy from match)
    a> 0 jf 50 (goto output esc)
    a+= $2+2 (MINLEN)
    r=a 0 (save length in R0)
    c=*d (c points to match)
    do (find match and output it)
      *d=b (update index with last output byte)
      a=*c *b=a b++ c++ out (copy and output matching byte)
      d<>a a*= 96 (HMUL)
      a+=d d<>a (update context hash)
      a=r 0 a-- r=a 0 (decrement length)
    a> 0 while (repeat until length is 0)
    halt
  endif
 
  (otherwise, set F for ESC)
  a== 127 (ESC) if
    halt
  endif
 
  (reset state at EOF)
  a> 255 if
    b=0 c=0 a= 1 a<<= $1+18 d=a
    do
      d-- *d=0 a=d (clear index)
    a> 0 while
    halt (F=0)
 
    (goto here: output esc)
    a= 127 (ESC)
  endif
  *d=b (update index)
  *b=a b++ out (update buffer and output)
  d<>a a*= 96 (HMUL)
  a+=d d<>a (update context hash)
  halt
end

The COMP and HCOMP section describe a simple context model with one component. Other low-order models could be substituted here for better compression at the cost of speed. The purpose of the LZP preprocessor is to compress and remove high order statistics so that a fast, low memory, low order model can be used.

The PCOMP section describes the post-processing code. The external program "lzppre" takes the 5 arguments shown in addition to the names of the input and output (temporary) files. These arguments are the sizes of the context hash table and history buffer (as powers of 2, respectively 2¹⁸ and 2²⁰), the value of the ESC byte (127), the match length offset (2), and the hash multiplier (HMUL = 96). These values must match the corresponding values in the ZPAQL post-processor for correct decoding. The match length offset is added to the encoded length (1 to 255) to represent matches of length 3 to 257. The compressor arguments $1 and $2 can be passed by the user to control memory allocation and the minimum match length respectively. These values are passed to both the preprocessor and the embedded post-processing code.

The hash function is HASH ← (HASH + A) * HMUL (mod 2¹⁸) (depending on the context hash table size), where A is the next input byte. If HMUL = 96 = 3 * 2⁵, then the high 5 bits of the hash are shifted out for each byte. Thus HASH depends on only the last 4 bytes of input, effectively computing an order 4 context hash.

In the ZPAQL code, M is used as a history buffer and H is used as the context hash table. The flag bit F is used to encode whether the last input byte was an ESC. The C register is used as a pointer into M to where the next byte will be stored. D is the current context hash, which points into H. The OUT instruction outputs the contents of A to the decompressed output file. On input to the PCOMP program, A is the decoded byte from the decompressor (prior to LZP decoding), or 2³² - 1 at the end of a segment. The program is called once for each input byte.

Burrows-Wheeler transform configuration

The configuration bwt_j3 (code not shown) by Jan Ondrus pre-processes the input with a Burrows-Wheeler transform (BWT). The input is sorted by context, which removes any high order statistics. The post-processor performs an inverse transform. The transformed data is compressed using a fast adapting order 0-1-2-4 ICM-ISSE chain followed by an order 0 MIX and an order 0 SSE adaptively bypassed by a MIX2. The model is based on the BBB [1] compressor.

BWT normally divides the input into blocks and uses 5 times the block size in memory to compress and decompress. A second configuration, bwt_slowmodel, uses a slower but more memory efficient algorithm modeled on a similar algorithm used in BBB. Blocks are compressed using 1.25 times the block size by sorting in smaller blocks to temporary files and then merging. The inverse transform, instead of building and traversing a linked list, uses a smaller index to find the approximate location of the next decoded byte, followed by a final linear search. By conserving memory, larger blocks can be used, which improves compression.

JPEG configuration

The configuration jpg_test2 by Jan Ondrus compresses JPEG files. JPEG is already compressed, but it is possible to compress it further by encoding the Huffman coded coefficients using a better model. jpg_test2 uses a preprocessor to expand the packed Huffman codes to whole bytes or byte pairs. Then it uses a 32 component model to re-compress the codes. The model consists of a CONST and 22 ICM and slow adapting CM components, followed by a hierarchy of 5 mixers and a 2 stage SSE chain with partial adaptive (MIX2) bypass.

BMP configuration

The configuration bmp_j4 by Jan Ondrus compresses .BMP bitmapped image files. The model is based on a similar model in paq8px_v64, which is slow but has a good compression ratio. It first uses a preprocessor to implement a color transform: (R, G, B) -> (G, R-G, B-G) on the image data following the 54 byte header. The context model reads the header to get the image width and then computes a slow adapting context mixing model with 33 components using various combinations of neighboring pixels as context. Except for the contexts, the model is similar to jpg_test2 with one CONST, 17 slow adapting CM, 6 ICM, and a hierarchy of 5 mixers and a 2 SSE chain with partial adaptive bypass.

x86 configuration

The configuration exe_j1 by Jan Ondrus is optimized to compress 32 bit x86 executable code (.exe and .dll files). It uses a preprocessor to implement an E8E9 transform. The transform converts relative CALL and JUMP (opcode 0xE8 and 0xE9) addresses to absolute addresses. Because programs often make several calls or jumps to the same target, converting to absolute addresses can increase redundancy. The transformed data is then compressed using the MAX configuration described above.

Modeling

Prediction representation

All component predictions are represented in the logistic domain as 12 bit fixed point numbers with 6 bits after the decimal point, i.e. in the range -32 to +32 with 1/64 resolution. When a prediction is calculated outside of this range, it is clamped to the nearest representable value.

When a prediction is converted to the linear domain for error estimation or arithmetic coding, it is represented as a 15 bit unsigned integer as an odd multiple of 2⁻¹⁶ in the range of 2⁻¹⁶ to 1 - 2⁻¹⁶.

Context calculation

A ZPAQ model predicts one bit at a time. Each of the components therefore requires an updated context for each bit prediction and update. As a speed optimization, contexts are only updated once per byte by executing the ZPAQL code in the HCOMP section of the block header. This requires that the contexts be further updated after each bit. That is done using a fixed algorithm that depends on the type of component.

Direct context mapping

For a MIX, MIX2, or SSE, the bitwise context, which depends on the last 0 to 7 bits since the last byte boundary, is added to the bytewise context computed by the HCOMP program. The addition is modulo the size of the component. Since the component size is always a power of 2, this is equivalent to dropping the high order bits of the sum. The bitwise context is a 1 bit followed by the 0 to 7 bits since the last byte boundary. Thus, it ranges from 1 to 255.

A low order context is usually computed such that the low 8 bits are 0 to leave room for adding the bitwise context. A higher order context is usually a hash using all available bits.

Cache aligned mapping

For a CM, the current byte is divided into 2 4-bit nibbles with a leading 1 bit added to each one, resulting in a 9 bit value as shown in the table below. Then the bitwise context is exclusive-ored with the bytewise context (usually a hash). The purpose is to preserve cache locality for better speed. Normally a CM is a table of 4 byte entries (22 bit prediction and 10 bit count packed into a 32 bit integer) aligned on a 64 byte boundary. On many processors, a cache line is 64 bytes. By combining the bitwise contexts this way, it guarantees that 4 consecutive table lookups address the same cache line, resulting in at most 2 cache misses per byte. This method does not help with a MIX, MIX2, or SSE because the combined context normally indexes array elements that are too large to pack into cache lines.

Bits	Direct	Cache aligned
0 (empty)	00000001	000000001
1 x	0000001x	00000001x
2 xx	000001xx	0000001xx
3 xxx	00001xxx	000001xxx
4 xxxx	0001xxxx	1xxxx0001
5 xxxxx	001xxxxx	1xxxx001x
6 xxxxxx	01xxxxxx	1xxxx01xx
7 xxxxxxx	1xxxxxxx	1xxxx1xxx

A low order context is usually set directly such that the low 9 bits are 0. A high order context is usually a hash.

Bit history hash tables

For an ICM or ISSE, a context addresses a bit history which occupies one byte in a hash table. The context is normally a hash which is updated on nibble boundaries. This hash indexes a 16 byte array containing 15 bit histories and a 1 byte checksum used to detect hash collisions. The hash table is looked up every 4 bits. Then subsequent bitwise contexts are looked up within the 15 byte array of bit histories. This secondary index is calculated by appending a 1 to the 0 to 3 bits following the nibble boundary, resulting in an index in the range 1 to 15. The primary hash index, computed every 4 bits, is equal to (bytewise context) + 16*(direct bitwise context).

To preserve cache locality, the hash table can be aligned on a 64 byte boundary. A 32-bit hash index is computed by the HCOMP code. The low bits (depending on size) are used to index the hash table, and the next higher 8 bits are used as a checksum confirmation. A table element is found by comparing the checksum at 3 adjacent locations and picking the first match. The 3 indexes are computed by exclusive-or'ing the hash with 0, 1, and 2. This ensures that the three lookups stay within a cache line.

If no checksum match is found, then the element with the lowest bit history count (n₀ + n₁) in the first (nibble aligned) context is replaced. This implements a least frequently used cache replacement policy.

Direct context model (CM)

A context model is a table that maps a context to a 22 bit linear prediction p (range 0 to 1 with precision 2^-22) and a 10 bit count n (range 0 to 1023). Initially, p = 1/2, n = 0. On update with bit y (0 or 1), the prediction error y - p is calculated, and the state is updated:

p ← p + error/(n + 1.5)
n ← min(n + 1, limit)

where the limit is specified by the user. Usually a higher limit compresses better for stationary sources and a lower limit is better for fast adaption to changing statistics.

Bit history (ICM, ISSE)

An ICM or ISSE maps a context to a bit history through a hash table as described above. In an ICM, the bit history is mapped to a prediction using a CM with a maximum limit of 1023. In an ISSE, the bit history is used to select a weight table for a 2 input MIX. The MIX takes a prediction from another component and mixes it with a CONST prediction of 1 in the logistic domain. The effect in either case is to model both stationary and nonstationary sources effectively.

For example, suppose we observe a certain context 8 times and record the sequence 00000001. Now we observe the context again and wish to predict the next bit. If the source were stationary, we would assume that P(1) = 1/8. If it were nonstationary then we would give greater weight to recent events and estimate p(1) > 1/8. If the source were random data, then P(1) = 1/2. An indirect model answers the question adaptively by observing what happened when the same bit history was observed in other contexts, by counting how many times a 0 or 1 was observed in these cases.

A bit history is a tuple (n₀, n₁, LB) representing a count of 0 and 1 bits and the value of the last bit, respectively. Thus, the sequence 00000001 would be represented as (7, 1, 1). To fit a bit history into a single byte (256 possible values), the values are restricted as indicated in the table below. The restrictions are symmetric with respect to n₀ and n₁. Furthermore, LB is represented only when 1 ≤ n₀ + n₁ ≤ 17.

n₀	n₁
0	0..20
1	0..48
2	0..15
3	0..8
4	0..6
5	0..5
6	0..4
7..8	0..3
9..15	0..2
16..20	0..1
21..48	1

The update rule is to set LB to the most recent bit, increment the appropriate count and if the counts exceeds the bounds above to go to the nearest state that approximately preserves the same ratio of n₀ to n₁. For example, a 1 bit in state (5, 5, 0) would go to (5, 6, 1), but since this state would exceed the bounds, the next state is rounded to (4, 5, 1) instead.

Furthermore, when one of the counts is large and the opposite bit is observed, a portion of the larger count is first discarded. The rule is that 6 or 7 is decremented and larger counts are reduced to 7. For example if state (20, 0, 0) is updated by a sequence of 1 bits, then the following states would be (7, 1, 1), (6, 2, 1), (5, 3, 1), (5, 4, 1), (5, 5, 1), (4, 5, 1). The effect is to better model sources that are likely to be nonstationary.

An ICM or ISSE effectively models a wide range of sources, but for stationary sources a CM with a high limit still gives the best results. An ISSE is usually used in a chain of contexts of increasing order, starting with a low order CM or ICM. An ISSE learns to adjust its input prediction, i.e. it learns the weights w₁ and w₂ such that for input prediction p (in the logistic domain), w₁ + w₂p gives the best possible prediction.

Mixing (AVG, MIX2, MIX)

An AVG averages two predictions in the logistic domain: p = w₁p₁ + w₂p₂, where the weights are selected by the user and constrained to add to 1.

A MIX2 allows the weights to adjust, but they are still constrained to add to 1. A weight is represented as an unsigned 16 bit fixed point integer in the range 0 to 1. The initial weights are 1/2.

A MIX allows any number of inputs and does not constrain the weights to add to 1. A weight is represented as a signed 20 bit fixed point number with 16 bits after the decimal, i.e. a range of -8 to +8 with precision 2⁻¹⁶. Initially the weights are equal and add to 1. The update for each weight is:

w_i ← min(8, max(-8, w_i + ηp_i(y - squash(p))))

where η is the user specified learning rate, p is the output prediction, y is the actual value of the predicted bit, and p_i is the input prediction in the logistic domain. Effectively, a MIX is a simple 2-layer neural network such that the weights converge by gradient descent to minimize coding cost.

SSE

An SSE inputs a prediction and a context (like an ISSE) and outputs a new prediction. The stretched (logistic domain) prediction is quantized to one of 32 levels in the range -16 to 16 in steps of 1/2 to select two adjacent entries from a CM-like table (a 22 bit prediction and 10 bit count). The prediction is interpolated and only the closer of the two table entries is updated as with a CM. The limit is selectable as with a CM.

MATCH

A MATCH model predicts a bit by looking up the last point in a history buffer with the same context hash and predicting the next bit that follows. Once a match is found, the length of the match (in bytes) is maintained until a bit is predicted incorrectly. Once a prediction fails, remaining predictions in the current byte are 1/2 (0 in the logistic domain). At the next byte boundary, a new match location is found by computing and looking up the context hash, and the match length is determined by comparing backward from the current and looked-up positions in the history buffer. The context hash table and history buffer are updated after every byte whether or not a lookup occurs.

If the match length is 0 then the prediction is 1/2. Otherwise the next bit is predicted with probability 1 - 1/16L where L is the match length in bytes and L cannot exceed 255.

Arithmetic coding

ZPAQ specifies only the arithmetic decoder, although the encoder has a similar design. The decoder state is a pair of 32 bit unsigned integers representing a range (LOW, HIGH), initially (1, 2³² - 1), and a current value CURR, initialized to the first 4 bytes of compressed data. The state is constrained at all times such that LOW ≤ CURR ≤ HIGH, 0 < LOW < HIGH, except at the end of input when CURR = 0 < LOW. The decoding (and encoding) algorithm is designed so that a sequence of 4 zero bytes never appears in the compressed data. This allows a sequence of 4 zeros to mark the end of the compressed data so that streams of compressed and uncompressed data can be freely mixed and the end of the compressed data can be found quickly without decompressing.

The decoding algorithm is as follows. The input is a prediction P, 0 ≤ P < 1 as a 16 bit fixed point integer. The return value is the decoded bit Y, either 0 or 1. All arithmetic is modulo 2³². input() reads one byte.

decode(P)
  MID ← LOW + floor((HIGH - LOW) * P)
  If CURR ≤ MID then Y ← 1, HIGH ← MID
  else Y ← 0, LOW ← MID + 1.
  While (LOW >> 24) = (HIGH >> 24) do
    LOW ← LOW * 256
    If LOW = 0 then LOW ← 1
    HIGH ← HIGH * 256 + 255
    CURR ← CURR * 256 + input()
  Return Y.

The corresponding encoding algorithm is as follows. It takes P as before and a bit Y to be encoded. LOW and HIGH are initialized as in decode(). output() writes one byte.

encode(P, Y)
  MID ← LOW + floor((HIGH - LOW) * P)
  If Y = 1 then HIGH ← MID
  else LOW ← MID + 1.
  While (LOW >> 24) = (HIGH >> 24) do
    output(LOW >> 24)
    LOW ← LOW * 256
    If LOW = 0 then LOW ← 1
    HIGH ← HIGH * 256 + 255

Predictions are always encoded with P > 0. After 8 bit predictions, an EOS (end of segment) bit is coded with P = 0. The EOS bit is 1 after the last byte is encoded. In the encoder, this has the effect of setting HIGH ← LOW and flushing 4 bytes to output. After this, 4 zero bytes would be written. Decoding the EOS bit causes those 4 zero bytes to be read into CURR, which would be an invalid state if decoding were to continue.

ZPAQL

A ZPAQ decoder has up to two virtual machines called HCOMP and PCOMP. HCOMP is called once for each byte of decoded data. It is used to compute bytewise contexts for the decompression model. PCOMP, which is optional, is called once for each byte of decoded data and once at end of segment (EOS). If present, then it outputs the decompressed program. Otherwise, the output of the decoder is output directly.

A ZPAQL virtual machine has the following state:

32 bit registers A (accumulator), B, C, D (index registers), and R0 through R255,
1 bit register F (comparison result),
Array H of 32 bit unsigned values,
Array M of 8 bit unsigned values,
16 bit program counter PC.

In HCOMP, the first n elements of H are the contexts for the first n components described in the COMP section. In PCOMP, H has no special meaning. The sizes of H and M are given in the block header. The sizes are powers of 2.

The A register is used as an accumulator. All binary arithmetic and logic operations except assignment take A and one other register or 8 bit immediate value and put the result in A. The other operand may be A, B, C, D, *B, *C, *D, or a constant in the range 0 to 255. Assignment and unary operators may update any of these registers. *B and *C mean the element of M indexed by the low bits of B or C (depending on the size of M). *D means the element of H indexed by the low bits of D.

The binary operators are = += -= *= /= %= &= &~ ^= |= <<= >>=. These have mostly the same meanings as in C. For example, "A+=B" means add B to A. The operator &~ means &=~ as in "A&~B" means "A &= ~B". Division (/=) and mod (%=) by 0 are defined to give 0. Shifts (>>=, <<=) are explicitly defined to use only the low 5 bits of the right operand so that "A<<= 33" is the same as "A<<= 1".

Unary operators and assignment may put the result in A, B, C, D, *B, *C, or *D. The operators are:

B++ means add 1 to B.
B-- means subtract 1 from B.
B! means B = ~B (complement the bits of B).
B=0 is a 1 byte code that clears B.
B<>A means swap B with A. The right operand must be A and the left operand cannot be A. If the operands are different sizes then only the low bits are updated. Thus, *B<>A swaps the low 8 bits of A.

Comparison operators are ==, <, >. Comparison is unsigned. The left operand must be A as with other binary operators. The result is put in F as 1 if true or 0 if false. Other instructions are as follows. n is a one byte unsigned operand (0..255), or a signed operand (-128..127) in the case of jump instructions.

JT n (jump if F true: PC ← PC + n)
JF n (jump if F false: PC ← PC + n)
JMP n (jump: PC ← PC + n)
LJ n m (long jump: PC ← n + 256*m)
A=R n (assign register Rn (R0..R255) to A, B, C, or D)
R=A n (assign A to register R0..R255).
HASH (A ← (A + *B + 512) * 773)
HASHD (*D ← (*D + A + 512) * 773)
OUT (output A)
HALT (return)
ERROR (invalid opcode 0)

Most opcodes are one byte. Opcodes that take a numeric argument are 2 bytes except for the LJ instruction, which is 3 bytes.

The JT, JF, and JMP instruction operands are relative to the next instruction. Thus, JMP 0 simply advances to the next instruction and JMP -2 is an infinite loop. Offsets are in bytes, not instructions.

The OUT instruction outputs the low 8 bits of A to the decompressed data stream in PCOMP. In HCOMP it has no effect.

HCOMP and PCOMP are each called once per encoded or decoded byte with that input byte in the A register and PC = 0 (first instruction). PCOMP is called once more at end of segment with A = 2³² - 1. All other machine state is preserved between calls. The initial state before the first call is for all values set to 0. It is an error to execute an undefined opcode or ERROR, or for PC to go outside the range of the program.

In a ZPAQ block, HCOMP is stored uncompressed. PCOMP, if present, is compressed concatenated to the beginning of the first segment. If not present, a 0 byte is compressed to indicate it is absent. Otherwise a 1, a 2 byte length, and the PCOMP code is compressed. The sizes of H and M in both models are stored uncompressed in the block header so that archivers can determine memory requirements without decompressing.

Implementations

The following programs are ZPAQ level 1 compliant.

unzpaq is the reference decoder. It will decompress or list the contents of an archive but not compress. It will check the SHA-1 checksum if present, and print a warning if it computes a different value. If filenames are stored and not overridden by the user, then unzpaq will create the named files. Otherwise the user must supply the output filename. Unnamed segments, if not overridden, are concatenated to the last named file. The ZPAQL code is interpreted.

zpaq has all the functionality of unzpaq. In addition, it will compress files according to a supplied configuration file. If the configuration uses post-processing, then the user must supply an external preprocessor. It includes a debugging environment which allows a user to run the HCOMP or PCOMP section of a configuration file separately, single step a program (showing register contents), or bypass the post-processor during decompression. ZPAQL code can either be interpreted or translated to C++, and if a C++ compiler is installed, compile and run the translated code. This is about twice as fast as interpreting the code for either compression or decompression. zpaq allows the user to selectively include filenames (with or without paths), comments (file size) and checksums in the archive. The compressor output is a single block. To make archives with multiple blocks it must be run multiple times. The program will also create self extracting archives.

zpipe compresses or decompresses from standard input to standard output. It compresses using the mid model, although it can read any model like any compliant decompresser. If file names are present in the archive then they are ignored and all of the output is concatenated.

zp compresses in 3 different configurations (fast, mid, max) using compiled ZPAQL models. Thus, compression runs as fast as with zpaq but no C++ compiler is required. If an archive is compressed in any of these configurations, then zp will recognize it and use the compiled models to decompress them. Otherwise it decompresses using a slower interpreted model.

zpaqsfx is a stub for creating self extracting archives. If any compressor appends its output to a copy of the stub, then executing the stub will list the contents of the compressed data and instruct the user how to run the program again to decompress them. The stub interprets the ZPAQL code. Any compliant decompresser can also decompress the data without running the stub. The stub uses a 13 byte string (a locator tag) to identify the end of its own code and the start of the compressed data. Compliant decompressers will recognize locator tags, which allow compressed data to be mixed with other types of data without causing read errors by skipping any data not recognized.

Only zpaq reads configuration files. Newer versions support if-else-endif and do-while statements, which the program translates into jump instructions. It supports up to 9 numeric arguments which can be added to values in the code. Code may contain comments (in parentheses), is free form, and is not case sensitive. To make opcode lengths explicit, it requires that each byte be coded without spaces and be separated. Thus, "A=B" is a 1 byte instruction and "A= 123" is a 2 byte instruction. The spaces must be written exactly as shown to emphasize that 123 is the second byte.

zpaq, zp, and zpipe are implemented using libzpaq, a public domain library API written in C++. The library supplies functions to read and write ZPAQ byte streams, which can be files or data structures in memory.

External links

Official website