Incremental encoding

From Wikipedia, the free encyclopedia

Incremental encoding, also known as front compression or back compression, is a type of delta encoding compression algorithm whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated. This algorithm is particularly well-suited for compressing sorted data, e.g., a list of words from a dictionary.

For example:

Input	Common prefix	Compressed output
myxa myxophyta myxopod nab nabbed nabbing nabit nabk nabob nacarat nacelle	no preceding word 'myx' 'myxop' no common prefix 'nab' 'nabb' 'nab' 'nab' 'nab' 'na' 'nac'	0 myxa 3 ophyta 5 od 0 nab 3 bed 4 ing 3 it 3 k 3 ob 2 carat 3 elle
64 bytes		46 bytes

Input

Common prefix

Compressed output

myxa
myxophyta
myxopod
nab
nabbed
nabbing
nabit
nabk
nabob
nacarat
nacelle

no preceding word
'myx'
'myxop'
no common prefix
'nab'
'nabb'
'nab'
'nab'
'nab'
'na'
'nac'

0 myxa
3 ophyta
5 od
0 nab
3 bed
4 ing
3 it
3 k
3 ob
2 carat
3 elle

64 bytes

46 bytes

This is used as a starting point by the GNU locate utility, in an index of filenames & directories. There, delta encoding is used on the common prefix length. This means, in an additional step, the 'change' in the common prefix length is used, rather than the common prefix length.

The GNU locate utility further uses bigram encoding to further shorten popular filepath prefixes.

Despite being very simple, incremental encoding can save much space, especially when used before other compressors, such as gzip or bzip2.