ZPAQ is a proposed open format for highly compressed data, mainly created by Matt Mahoney. It is designed so that new compression algorithms may be developed without breaking compatibility with older versions of ZPAQ-compliant decompression programs. ZPAQ uses a configurable context mixing algorithm (based on PAQ) and sandboxed post-processing code. ZPAQ uses a streaming data format that supports archives, single file compression, and memory to memory compression. Several ZPAQ compatible compression programs and configurations are available under the GPL license. There is also a public domain API in C++ providing compression and decompression services. The format is believed to be unencumbered by patents.
Contents |
Initial work on ZPAQ was begun by Matt Mahoney on Feb. 15, 2009. After a series of incompatible experimental versions, the level 1 standard was finalized on Mar. 12, 2009. The standard consisted of a document describing the format and a reference decoder (unzpaq) written in C++ and released under GPL. The standard does not define a compression algorithm. The standard specifies that any ZPAQ level 1 compliant decoder be able to read the output of any other ZPAQ level 1 compliant compressor. The standard has a provision for higher levels which must be able to read the output of lower level compliant compressors but not vice versa. However, no higher levels have yet been defined.
The following software was released under GPL:
In addition, several configurations and preprocessors were published including specialized models for text, BMP images, JPEG images, x86 executable data, and pre/post processors using LZP and BWT compression. Many of these were written by Jan Ondrus.
ZPAQ uses a streaming data format. A ZPAQ stream consists of a sequence of blocks which can be decompressed independently. Blocks can be reordered, which has the effect of reordering the data it represents. Each block consists of a sequence of one or more segments, which must be decompressed in order from the beginning of the block.
Each block header describes the algorithm needed to decompress its contents and restore the original data. Each segment describes a file, part of a file, or an array of bytes. A segment header contains an optional file name and optional comment to support archives, the compressed data, and end of data marker (4 zero bytes) and finally an optional SHA-1 checksum of the original data for integrity checking. The data is compressed using an arithmetic coder that never outputs 4 zero bytes in a row, which could be confused with an end of segment marker.
ZPAQ uses a context mixing algorithm and an optional pre/post processor. A sandboxed assembly-like language called ZPAQL is used to compute the contexts for the context mixing algorithm and to perform any required post-processing. If post-processing is used, then the ZPAQL post-processing code is compressed prior to compressing the first segment of the block. In this case, the remainder of the block after decompression is considered input to the post-processing program.
A context mixing model compresses data one bit at a time. During compression, the model estimates the probability that the next bit will be a 0 or 1. The prediction is arithmetic coded. During decompression, the model makes an identical sequence of predictions based on previously decoded data. This prediction and the compressed data are used to decode the next bit. The decoded bits are then packed 8 at a time into bytes. An extra bit after each byte is used to mark the end of the data segment. It is predicted with a very low probability.
The context mixing model is described as an array of 1 to 255 components. Each component inputs a context (a function of past data computed by the supplied ZPAQL code) and possibly the predictions of earlier components, and outputs a new prediction. The output of the model is the output of the last component. ZPAQ describes 9 component types.
The zpaq compressor reads a configuration file which describes the compression format. The contents are stored in the block header and may be read and executed by any compliant decompresser. Other compressors such as zpipe and zp use a set of built in compression configurations instead.
The following is the "fast" configuration, which is also built into the zp archiver (compression level 1). It is designed for higher speed at the expense of compression. Generally, speed depends on the number of components, in this case 2.
(ZPAQ fast configuration) comp 1 2 0 0 2 (hh hm ph pm n) 0 icm 16 (order 2) 1 isse 19 0 (order 4) hcomp *b=a a=0 (save in rotating buffer M) d=0 hash b-- hash *d=a d++ b-- hash b-- hash *d=a halt post 0 end
Configuration files are not case sensitive. Comments are enclosed in parenthesis.
The COMP section describes the components that make up the model. The HCOMP section contains ZPAQL code which computes the contexts for the components. POST 0 indicates that there is no post-processing. The decompressed data is output directly.
The COMP section describes two components, an ICM and an ISSE. The ICM (component number 0) takes as input an order 2 context hash (depending on the last 2 bytes of context plus any previously decoded bits from the current byte) to look up a bit history. The bit history indexes a table to look up the prediction. That prediction is fed to the ISSE (component 1), which also takes an order 4 context as input. Its output prediction goes to the arithmetic decoder. The arguments to the two components (16 and 19) give the size of the tables as 216+6 and 219+6 bit histories, respectively. Each bit history uses one byte. Thus, the model uses about 36 MiB of memory.
The HCOMP section describes the ZPAQL code which computes the context hashes for the two components. The ZPAQL machine model consist of:
The 5 arguments to COMP give the sizes of H and M for the HCOMP and POST code, respectively, and the number of components. In this case, H has 21 elements and M has 22 elements.
The ZPAQL instructions are usually 1 or 2 bytes. Most arithmetic and logic instructions except assignment operate on A, B, C, D, *B, *C, *D or an 8 bit constant and put the result in A. *B and *C refer to elements of M. *D refers to an element of H. The HASH instruction computes A ← (A + *B + 512) * 773. The HCOMP code is called once per byte with the input byte in A and all other registers preserved between calls.
Thus, the code first saves the input byte A in M using B as a pointer, then computes an order 2 context hash, stores it in H[0] (pointed to by *D), then hashes the next two bytes and stores it in H[1]. These hashes are updated only once per byte, so they are combined by adding the bitwise contexts from the current byte for each bit prediction.
The mid level configuration is shown below. It is the default model for zpaq, zp, and zpipe. It provides a fairly good level of compression for most data at some cost in speed.
comp 3 3 0 0 8 (hh hm ph pm n) 0 icm 5 (order 0...5 chain) 1 isse 13 0 2 isse $1+17 1 3 isse $1+18 2 4 isse $1+18 3 5 isse $1+19 4 6 match $1+22 $1+24 (order 7) 7 mix 16 0 7 24 255 (order 1) hcomp c++ *c=a b=c a=0 (save in rotating buffer M) d= 1 hash *d=a (orders 1...5 for isse) b-- d++ hash *d=a b-- d++ hash *d=a b-- d++ hash *d=a b-- d++ hash *d=a b-- d++ hash b-- hash *d=a (order 7 for match) d++ a=*c a<<= 8 *d=a (order 1 for mix) halt post 0 end
The compression model is similar to the one used in PAQ9A. It uses an order 0 ICM, a chain of order 1 through 5 ISSE, and an order 7 MATCH. All of these predictions are combined using a MIX with an order 1 context. The notation $1 means that an argument may be passed to the configuration file and this argument is added to the values shown. This has the effect of doubling memory usage for each increment because it is used to select the ISSE table sizes and the MATCH index table and buffer sizes respectively. The default uses 111 MB of memory. The arguments to MIX are the table size (216), range of inputs (7 components starting at 0), learning rate (24) and a bit mask (255) which tells it to include the bits of the current byte to select the weight array. The second argument to each ISSE selects the previous component prediction as input.
The following configuration is the maximum compression level for zp.
comp 5 9 0 0 22 (hh hm ph pm n) 0 const 160 1 icm 5 (orders 0-6) 2 isse 13 1 (sizebits j) 3 isse $1+16 2 4 isse $1+18 3 5 isse $1+19 4 6 isse $1+19 5 7 isse $1+20 6 8 match $1+22 $1+24 9 icm $1+17 (order 0 word) 10 isse $1+19 9 (order 1 word) 11 icm 13 (sparse with gaps 1-3) 12 icm 13 13 icm 13 14 icm 14 (pic) 15 mix 16 0 15 24 255 (mix orders 1 and 0) 16 mix 8 0 16 10 255 (including last mixer) 17 mix2 0 15 16 24 0 18 sse 8 17 32 255 (order 0) 19 mix2 8 17 18 16 255 20 sse 16 19 32 255 (order 1) 21 mix2 0 19 20 16 0 hcomp c++ *c=a b=c a=0 (save in rotating buffer) d= 2 hash *d=a b-- (orders 1,2,3,4,5,7) d++ hash *d=a b-- d++ hash *d=a b-- d++ hash *d=a b-- d++ hash *d=a b-- d++ hash b-- hash *d=a b-- d++ hash *d=a b-- (match, order 8) d++ a=*c a&~ 32 (lowercase words) a> 64 if a< 91 if (if a-z) d++ hashd d-- (update order 1 word hash) *d<>a a+=*d a*= 20 *d=a (order 0 word hash) jmp 9 endif endif (else not a letter) a=*d a== 0 ifnot (move word order 0 to 1) d++ *d=a d-- endif *d=0 (clear order 0 word hash) (end else) d++ d++ b=c b-- a=0 hash *d=a (sparse 2) d++ b-- a=0 hash *d=a (sparse 3) d++ b-- a=0 hash *d=a (sparse 4) d++ a=b a-= 212 b=a a=0 hash *d=a b<>a a-= 216 b<>a a=*b a&= 60 hashd (pic) d++ a=*c a<<= 9 *d=a (mix) d++ d++ d++ d++ d++ *d=a (sse) halt post 0 end
Components 1 through 8 are a ICM-ISSE chain similar to the mid configuration. Components 9 and 10 are order 0 and 1 word level contexts which improve text compression. The corresponding ZPAQL code converts lower case to upper case, then computes a context hash of bytes that fall in the range 65 to 90 (uppercase ASCII). When a non-letter is found, the order 1 hash is updated and the order 0 hash is cleared. These two components form their own ICM-ISSE chain with contexts in increasing order as before.
Components 11 through 13 are sparse order 1 contexts formed from the current byte plus one preceding byte with a gap of 1 to 3 bytes. These models are not chained. Sparse models can improve compression of some binary files.
Component 14 is specialized for CCITT fax images such as the file PIC in the Calgary Corpus. These files encode a bitmapped image 1728 pixels or 216 bytes per line. It obtains a context hash of the corresponding bytes in the previous two scan lines.
Components 15 and 16 mix all previous predictions using an order 1 and order 0 context respectively. Component 17 mixes their outputs. This prediction is then passed through 2 SSE stages (components 18 and 20) with order 0 and 1 contexts respectively. For each stage, another mixer combines the input and output, which can sometimes improve compression further.
The following configuration (min.cfg) is part of the zpaq distribution intended for minimal but fast compression. It uses LZP preprocessing followed by a simple order 2 context model. Both the compressor and decompresser maintain a hash table that maps high order context hashes to the last occurrence in a rotating history buffer. If the text that follows the current location matches the text that follows the previous occurrence of the same context hash, then the current text is removed and replaced by a two byte code consisting of an escape character and the length of the match. The effect is to remove duplicate instances of long matches in the input. For example, the string "compression and decompression" might be encoded as "compression and decom,ESC,8" to tell the decompresser to copy the 8 bytes that followed the last instance of the order 3 context "com".
If the ESC byte occurs in the input, then it is encoded as ESC,0. Normally the ESC byte is chosen so that the need for such encodings are rare. Matches are encoded as ESC,1 through ESC,255. Additional parameters to LZP are the sizes of the context hash table and history buffer, the order (length) of the context, and the minimum match length. By default the minimum match length is 3 so that ESC,1 tells the decompresser to copy 3 bytes, and ESC,255 tells the decompresser to copy 257 bytes.
ZPAQ only defines a decompression format. Thus, a compressor would be responsible for pre-processing the input using the LZP algorithm, compressing it, and appending the post-processing code in ZPAQL to the archive header. When used with the zpaq compressor, the pre-processor is an external program written in C++ which is called by the compressor to create a temporary file. The post-processing code is shown below.
(zpaq 1.07 minimum (fast) compression. Uses 4 x 2^$1 MB memory. $2 increases minimum match length. Requires lzppre as an external preprocessor. (C) 2009, Ocarina Networks Inc. Written by Matt Mahoney. This software is free under GPL v3. http://www.gnu.org/copyleft/gpl.html ) comp 3 3 $1+18 $1+20 1 (hh hm PH PM n) 0 cm $1+19 5 (context model size=2^19, limit=5*4) hcomp *d<>a a^=*d a<<= 8 *d=a (order 2 context) halt pcomp lzppre $1+18 $1+20 127 $2+2 96 ; (lzppre PH PM ESC MINLEN HMUL) (If you change these values, then change them in the code too) (The sequence ESC 0 codes for ESC. The sequence ESC LEN codes for a match of length LEN+MINLEN at the last place in the output buffer M (size 2^PM) that had the same context hash in the low PH bits of D. D indexes hash table H which points into buffer M, which contains B bytes. When called, A contains the byte to be decoded and F=true if the last byte was ESC. The rolling context hash D is updated by D=D*HMUL+M[B]) if (last byte was ESC then copy from match) a> 0 jf 50 (goto output esc) a+= $2+2 (MINLEN) r=a 0 (save length in R0) c=*d (c points to match) do (find match and output it) *d=b (update index with last output byte) a=*c *b=a b++ c++ out (copy and output matching byte) d<>a a*= 96 (HMUL) a+=d d<>a (update context hash) a=r 0 a-- r=a 0 (decrement length) a> 0 while (repeat until length is 0) halt endif (otherwise, set F for ESC) a== 127 (ESC) if halt endif (reset state at EOF) a> 255 if b=0 c=0 a= 1 a<<= $1+18 d=a do d-- *d=0 a=d (clear index) a> 0 while halt (F=0) (goto here: output esc) a= 127 (ESC) endif *d=b (update index) *b=a b++ out (update buffer and output) d<>a a*= 96 (HMUL) a+=d d<>a (update context hash) halt end
The COMP and HCOMP section describe a simple context model with one component. Other low-order models could be substituted here for better compression at the cost of speed. The purpose of the LZP preprocessor is to compress and remove high order statistics so that a fast, low memory, low order model can be used.
The PCOMP section describes the post-processing code. The external program "lzppre" takes the 5 arguments shown in addition to the names of the input and output (temporary) files. These arguments are the sizes of the context hash table and history buffer (as powers of 2, respectively 218 and 220), the value of the ESC byte (127), the match length offset (2), and the hash multiplier (HMUL = 96). These values must match the corresponding values in the ZPAQL post-processor for correct decoding. The match length offset is added to the encoded length (1 to 255) to represent matches of length 3 to 257. The compressor arguments $1 and $2 can be passed by the user to control memory allocation and the minimum match length respectively. These values are passed to both the preprocessor and the embedded post-processing code.
The hash function is HASH ← (HASH + A) * HMUL (mod 218) (depending on the context hash table size), where A is the next input byte. If HMUL = 96 = 3 * 25, then the high 5 bits of the hash are shifted out for each byte. Thus HASH depends on only the last 4 bytes of input, effectively computing an order 4 context hash.
In the ZPAQL code, M is used as a history buffer and H is used as the context hash table. The flag bit F is used to encode whether the last input byte was an ESC. The C register is used as a pointer into M to where the next byte will be stored. D is the current context hash, which points into H. The OUT instruction outputs the contents of A to the decompressed output file. On input to the PCOMP program, A is the decoded byte from the decompressor (prior to LZP decoding), or 232 - 1 at the end of a segment. The program is called once for each input byte.
The configuration bwt_j3 (code not shown) by Jan Ondrus pre-processes the input with a Burrows-Wheeler transform (BWT). The input is sorted by context, which removes any high order statistics. The post-processor performs an inverse transform. The transformed data is compressed using a fast adapting order 0-1-2-4 ICM-ISSE chain followed by an order 0 MIX and an order 0 SSE adaptively bypassed by a MIX2. The model is based on the BBB [1] compressor.
BWT normally divides the input into blocks and uses 5 times the block size in memory to compress and decompress. A second configuration, bwt_slowmodel, uses a slower but more memory efficient algorithm modeled on a similar algorithm used in BBB. Blocks are compressed using 1.25 times the block size by sorting in smaller blocks to temporary files and then merging. The inverse transform, instead of building and traversing a linked list, uses a smaller index to find the approximate location of the next decoded byte, followed by a final linear search. By conserving memory, larger blocks can be used, which improves compression.
The configuration jpg_test2 by Jan Ondrus compresses JPEG files. JPEG is already compressed, but it is possible to compress it further by encoding the Huffman coded coefficients using a better model. jpg_test2 uses a preprocessor to expand the packed Huffman codes to whole bytes or byte pairs. Then it uses a 32 component model to re-compress the codes. The model consists of a CONST and 22 ICM and slow adapting CM components, followed by a hierarchy of 5 mixers and a 2 stage SSE chain with partial adaptive (MIX2) bypass.
The configuration bmp_j4 by Jan Ondrus compresses .BMP bitmapped image files. The model is based on a similar model in paq8px_v64, which is slow but has a good compression ratio. It first uses a preprocessor to implement a color transform: (R, G, B) -> (G, R-G, B-G) on the image data following the 54 byte header. The context model reads the header to get the image width and then computes a slow adapting context mixing model with 33 components using various combinations of neighboring pixels as context. Except for the contexts, the model is similar to jpg_test2 with one CONST, 17 slow adapting CM, 6 ICM, and a hierarchy of 5 mixers and a 2 SSE chain with partial adaptive bypass.
The configuration exe_j1 by Jan Ondrus is optimized to compress 32 bit x86 executable code (.exe and .dll files). It uses a preprocessor to implement an E8E9 transform. The transform converts relative CALL and JUMP (opcode 0xE8 and 0xE9) addresses to absolute addresses. Because programs often make several calls or jumps to the same target, converting to absolute addresses can increase redundancy. The transformed data is then compressed using the MAX configuration described above.
All component predictions are represented in the logistic domain as 12 bit fixed point numbers with 6 bits after the decimal point, i.e. in the range -32 to +32 with 1/64 resolution. When a prediction is calculated outside of this range, it is clamped to the nearest representable value.
When a prediction is converted to the linear domain for error estimation or arithmetic coding, it is represented as a 15 bit unsigned integer as an odd multiple of 2−16 in the range of 2−16 to 1 - 2−16.
A ZPAQ model predicts one bit at a time. Each of the components therefore requires an updated context for each bit prediction and update. As a speed optimization, contexts are only updated once per byte by executing the ZPAQL code in the HCOMP section of the block header. This requires that the contexts be further updated after each bit. That is done using a fixed algorithm that depends on the type of component.
For a MIX, MIX2, or SSE, the bitwise context, which depends on the last 0 to 7 bits since the last byte boundary, is added to the bytewise context computed by the HCOMP program. The addition is modulo the size of the component. Since the component size is always a power of 2, this is equivalent to dropping the high order bits of the sum. The bitwise context is a 1 bit followed by the 0 to 7 bits since the last byte boundary. Thus, it ranges from 1 to 255.
A low order context is usually computed such that the low 8 bits are 0 to leave room for adding the bitwise context. A higher order context is usually a hash using all available bits.
For a CM, the current byte is divided into 2 4-bit nibbles with a leading 1 bit added to each one, resulting in a 9 bit value as shown in the table below. Then the bitwise context is exclusive-ored with the bytewise context (usually a hash). The purpose is to preserve cache locality for better speed. Normally a CM is a table of 4 byte entries (22 bit prediction and 10 bit count packed into a 32 bit integer) aligned on a 64 byte boundary. On many processors, a cache line is 64 bytes. By combining the bitwise contexts this way, it guarantees that 4 consecutive table lookups address the same cache line, resulting in at most 2 cache misses per byte. This method does not help with a MIX, MIX2, or SSE because the combined context normally indexes array elements that are too large to pack into cache lines.
Bits | Direct | Cache aligned |
---|---|---|
0 (empty) | 00000001 | 000000001 |
1 x | 0000001x | 00000001x |
2 xx | 000001xx | 0000001xx |
3 xxx | 00001xxx | 000001xxx |
4 xxxx | 0001xxxx | 1xxxx0001 |
5 xxxxx | 001xxxxx | 1xxxx001x |
6 xxxxxx | 01xxxxxx | 1xxxx01xx |
7 xxxxxxx | 1xxxxxxx | 1xxxx1xxx |
A low order context is usually set directly such that the low 9 bits are 0. A high order context is usually a hash.
For an ICM or ISSE, a context addresses a bit history which occupies one byte in a hash table. The context is normally a hash which is updated on nibble boundaries. This hash indexes a 16 byte array containing 15 bit histories and a 1 byte checksum used to detect hash collisions. The hash table is looked up every 4 bits. Then subsequent bitwise contexts are looked up within the 15 byte array of bit histories. This secondary index is calculated by appending a 1 to the 0 to 3 bits following the nibble boundary, resulting in an index in the range 1 to 15. The primary hash index, computed every 4 bits, is equal to (bytewise context) + 16*(direct bitwise context).
To preserve cache locality, the hash table can be aligned on a 64 byte boundary. A 32-bit hash index is computed by the HCOMP code. The low bits (depending on size) are used to index the hash table, and the next higher 8 bits are used as a checksum confirmation. A table element is found by comparing the checksum at 3 adjacent locations and picking the first match. The 3 indexes are computed by exclusive-or'ing the hash with 0, 1, and 2. This ensures that the three lookups stay within a cache line.
If no checksum match is found, then the element with the lowest bit history count (n0 + n1) in the first (nibble aligned) context is replaced. This implements a least frequently used cache replacement policy.
A context model is a table that maps a context to a 22 bit linear prediction p (range 0 to 1 with precision 2-22) and a 10 bit count n (range 0 to 1023). Initially, p = 1/2, n = 0. On update with bit y (0 or 1), the prediction error y - p is calculated, and the state is updated:
where the limit is specified by the user. Usually a higher limit compresses better for stationary sources and a lower limit is better for fast adaption to changing statistics.
An ICM or ISSE maps a context to a bit history through a hash table as described above. In an ICM, the bit history is mapped to a prediction using a CM with a maximum limit of 1023. In an ISSE, the bit history is used to select a weight table for a 2 input MIX. The MIX takes a prediction from another component and mixes it with a CONST prediction of 1 in the logistic domain. The effect in either case is to model both stationary and nonstationary sources effectively.
For example, suppose we observe a certain context 8 times and record the sequence 00000001. Now we observe the context again and wish to predict the next bit. If the source were stationary, we would assume that P(1) = 1/8. If it were nonstationary then we would give greater weight to recent events and estimate p(1) > 1/8. If the source were random data, then P(1) = 1/2. An indirect model answers the question adaptively by observing what happened when the same bit history was observed in other contexts, by counting how many times a 0 or 1 was observed in these cases.
A bit history is a tuple (n0, n1, LB) representing a count of 0 and 1 bits and the value of the last bit, respectively. Thus, the sequence 00000001 would be represented as (7, 1, 1). To fit a bit history into a single byte (256 possible values), the values are restricted as indicated in the table below. The restrictions are symmetric with respect to n0 and n1. Furthermore, LB is represented only when 1 ≤ n0 + n1 ≤ 17.
n0 | n1 |
---|---|
0 | 0..20 |
1 | 0..48 |
2 | 0..15 |
3 | 0..8 |
4 | 0..6 |
5 | 0..5 |
6 | 0..4 |
7..8 | 0..3 |
9..15 | 0..2 |
16..20 | 0..1 |
21..48 | 1 |
The update rule is to set LB to the most recent bit, increment the appropriate count and if the counts exceeds the bounds above to go to the nearest state that approximately preserves the same ratio of n0 to n1. For example, a 1 bit in state (5, 5, 0) would go to (5, 6, 1), but since this state would exceed the bounds, the next state is rounded to (4, 5, 1) instead.
Furthermore, when one of the counts is large and the opposite bit is observed, a portion of the larger count is first discarded. The rule is that 6 or 7 is decremented and larger counts are reduced to 7. For example if state (20, 0, 0) is updated by a sequence of 1 bits, then the following states would be (7, 1, 1), (6, 2, 1), (5, 3, 1), (5, 4, 1), (5, 5, 1), (4, 5, 1). The effect is to better model sources that are likely to be nonstationary.
An ICM or ISSE effectively models a wide range of sources, but for stationary sources a CM with a high limit still gives the best results. An ISSE is usually used in a chain of contexts of increasing order, starting with a low order CM or ICM. An ISSE learns to adjust its input prediction, i.e. it learns the weights w1 and w2 such that for input prediction p (in the logistic domain), w1 + w2p gives the best possible prediction.
An AVG averages two predictions in the logistic domain: p = w1p1 + w2p2, where the weights are selected by the user and constrained to add to 1.
A MIX2 allows the weights to adjust, but they are still constrained to add to 1. A weight is represented as an unsigned 16 bit fixed point integer in the range 0 to 1. The initial weights are 1/2.
A MIX allows any number of inputs and does not constrain the weights to add to 1. A weight is represented as a signed 20 bit fixed point number with 16 bits after the decimal, i.e. a range of -8 to +8 with precision 2−16. Initially the weights are equal and add to 1. The update for each weight is:
where η is the user specified learning rate, p is the output prediction, y is the actual value of the predicted bit, and pi is the input prediction in the logistic domain. Effectively, a MIX is a simple 2-layer neural network such that the weights converge by gradient descent to minimize coding cost.
An SSE inputs a prediction and a context (like an ISSE) and outputs a new prediction. The stretched (logistic domain) prediction is quantized to one of 32 levels in the range -16 to 16 in steps of 1/2 to select two adjacent entries from a CM-like table (a 22 bit prediction and 10 bit count). The prediction is interpolated and only the closer of the two table entries is updated as with a CM. The limit is selectable as with a CM.
A MATCH model predicts a bit by looking up the last point in a history buffer with the same context hash and predicting the next bit that follows. Once a match is found, the length of the match (in bytes) is maintained until a bit is predicted incorrectly. Once a prediction fails, remaining predictions in the current byte are 1/2 (0 in the logistic domain). At the next byte boundary, a new match location is found by computing and looking up the context hash, and the match length is determined by comparing backward from the current and looked-up positions in the history buffer. The context hash table and history buffer are updated after every byte whether or not a lookup occurs.
If the match length is 0 then the prediction is 1/2. Otherwise the next bit is predicted with probability 1 - 1/16L where L is the match length in bytes and L cannot exceed 255.
ZPAQ specifies only the arithmetic decoder, although the encoder has a similar design. The decoder state is a pair of 32 bit unsigned integers representing a range (LOW, HIGH), initially (1, 232 - 1), and a current value CURR, initialized to the first 4 bytes of compressed data. The state is constrained at all times such that LOW ≤ CURR ≤ HIGH, 0 < LOW < HIGH, except at the end of input when CURR = 0 < LOW. The decoding (and encoding) algorithm is designed so that a sequence of 4 zero bytes never appears in the compressed data. This allows a sequence of 4 zeros to mark the end of the compressed data so that streams of compressed and uncompressed data can be freely mixed and the end of the compressed data can be found quickly without decompressing.
The decoding algorithm is as follows. The input is a prediction P, 0 ≤ P < 1 as a 16 bit fixed point integer. The return value is the decoded bit Y, either 0 or 1. All arithmetic is modulo 232. input() reads one byte.
decode(P) MID ← LOW + floor((HIGH - LOW) * P) If CURR ≤ MID then Y ← 1, HIGH ← MID else Y ← 0, LOW ← MID + 1. While (LOW >> 24) = (HIGH >> 24) do LOW ← LOW * 256 If LOW = 0 then LOW ← 1 HIGH ← HIGH * 256 + 255 CURR ← CURR * 256 + input() Return Y.
The corresponding encoding algorithm is as follows. It takes P as before and a bit Y to be encoded. LOW and HIGH are initialized as in decode(). output() writes one byte.
encode(P, Y) MID ← LOW + floor((HIGH - LOW) * P) If Y = 1 then HIGH ← MID else LOW ← MID + 1. While (LOW >> 24) = (HIGH >> 24) do output(LOW >> 24) LOW ← LOW * 256 If LOW = 0 then LOW ← 1 HIGH ← HIGH * 256 + 255
Predictions are always encoded with P > 0. After 8 bit predictions, an EOS (end of segment) bit is coded with P = 0. The EOS bit is 1 after the last byte is encoded. In the encoder, this has the effect of setting HIGH ← LOW and flushing 4 bytes to output. After this, 4 zero bytes would be written. Decoding the EOS bit causes those 4 zero bytes to be read into CURR, which would be an invalid state if decoding were to continue.
A ZPAQ decoder has up to two virtual machines called HCOMP and PCOMP. HCOMP is called once for each byte of decoded data. It is used to compute bytewise contexts for the decompression model. PCOMP, which is optional, is called once for each byte of decoded data and once at end of segment (EOS). If present, then it outputs the decompressed program. Otherwise, the output of the decoder is output directly.
A ZPAQL virtual machine has the following state:
In HCOMP, the first n elements of H are the contexts for the first n components described in the COMP section. In PCOMP, H has no special meaning. The sizes of H and M are given in the block header. The sizes are powers of 2.
The A register is used as an accumulator. All binary arithmetic and logic operations except assignment take A and one other register or 8 bit immediate value and put the result in A. The other operand may be A, B, C, D, *B, *C, *D, or a constant in the range 0 to 255. Assignment and unary operators may update any of these registers. *B and *C mean the element of M indexed by the low bits of B or C (depending on the size of M). *D means the element of H indexed by the low bits of D.
The binary operators are = += -= *= /= %= &= &~ ^= |= <<= >>=. These have mostly the same meanings as in C. For example, "A+=B" means add B to A. The operator &~ means &=~ as in "A&~B" means "A &= ~B". Division (/=) and mod (%=) by 0 are defined to give 0. Shifts (>>=, <<=) are explicitly defined to use only the low 5 bits of the right operand so that "A<<= 33" is the same as "A<<= 1".
Unary operators and assignment may put the result in A, B, C, D, *B, *C, or *D. The operators are:
Comparison operators are ==, <, >. Comparison is unsigned. The left operand must be A as with other binary operators. The result is put in F as 1 if true or 0 if false. Other instructions are as follows. n is a one byte unsigned operand (0..255), or a signed operand (-128..127) in the case of jump instructions.
Most opcodes are one byte. Opcodes that take a numeric argument are 2 bytes except for the LJ instruction, which is 3 bytes.
The JT, JF, and JMP instruction operands are relative to the next instruction. Thus, JMP 0 simply advances to the next instruction and JMP -2 is an infinite loop. Offsets are in bytes, not instructions.
The OUT instruction outputs the low 8 bits of A to the decompressed data stream in PCOMP. In HCOMP it has no effect.
HCOMP and PCOMP are each called once per encoded or decoded byte with that input byte in the A register and PC = 0 (first instruction). PCOMP is called once more at end of segment with A = 232 - 1. All other machine state is preserved between calls. The initial state before the first call is for all values set to 0. It is an error to execute an undefined opcode or ERROR, or for PC to go outside the range of the program.
In a ZPAQ block, HCOMP is stored uncompressed. PCOMP, if present, is compressed concatenated to the beginning of the first segment. If not present, a 0 byte is compressed to indicate it is absent. Otherwise a 1, a 2 byte length, and the PCOMP code is compressed. The sizes of H and M in both models are stored uncompressed in the block header so that archivers can determine memory requirements without decompressing.
The following programs are ZPAQ level 1 compliant.
Only zpaq reads configuration files. Newer versions support if-else-endif and do-while statements, which the program translates into jump instructions. It supports up to 9 numeric arguments which can be added to values in the code. Code may contain comments (in parentheses), is free form, and is not case sensitive. To make opcode lengths explicit, it requires that each byte be coded without spaces and be separated. Thus, "A=B" is a 1 byte instruction and "A= 123" is a 2 byte instruction. The spaces must be written exactly as shown to emphasize that 123 is the second byte.
zpaq, zp, and zpipe are implemented using libzpaq, a public domain library API written in C++. The library supplies functions to read and write ZPAQ byte streams, which can be files or data structures in memory.