Rzip

From Wikipedia, the free encyclopedia

The correct title of this article is rzip. The initial letter is shown capitalized due to technical restrictions.

rzip is a data compression program based on bzip2.

Contents

[edit] Compression Algorithm

rzip operates in two stages. The first stage finds and encodes large chunks of duplicated data over potentially very long distances (up to nearly a gigabyte) in the input file. The second stage is to use a standard compression algorithm (bzip2) to compress the output of the first stage.

It is quite common these days to need to compress files that contain long distance redundancies. For example, when compressing a set of home directories several users might have copies of the same file, or of quite similar files. It is also common to have a single file that contains large duplicated chunks over long distances, such as pdf files containing repeated copies of the same image. Most compression programs won't be able to take advantage of this redundancy, and thus might achieve a much lower compression ratio than rzip can achieve.

The same algorithm is also used in rsync.

[edit] Advantages

The key difference between rzip and other well known compression algorithms is its ability to take advantage of very long distance redundancy. The well known deflate algorithm used in gzip uses a maximum history buffer of 32k. The BWT block sorting algorithm used in bzip2 is limited to 900k of history. The history buffer in rzip can be up to 900MB long, several orders of magnitude larger than gzip or bzip2. Interestingly, rzip is often much faster than bzip2, despite using the bzip2 library as a backend. This is because rzip feeds bzip2 with shrinked data, so that bzip2 has to do less work. A simple comparison (although too small for an authoritative benchmark) can be found in [1]. Another one is found in rzip's webpage.

[edit] Disadvantages

rzip is not suited for every purpose. The two biggest disadvantages of rzip are that it cannot be pipelined (so it cannot read from standard input or write to standard output), and that it uses a high amount of memory: a typical compression run on a large file might use hundreds of megabytes of RAM. If there is a lot of RAM to spare and a very high compression rate is required, rzip should be used, but if these conditions are not satisfied, alternate compression methods such as gzip and bzip2, which are less memory-intensive, should be used instead of rzip.

[edit] History

rzip was originally written by Andrew Tridgell as part of his PhD research.

[edit] See also

[edit] External links

In other languages