Data compression

From Wikipedia, the free encyclopedia

In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. For example, this article could be encoded with fewer bits if one were to accept the convention that the word "compression" be encoded as "comp". One popular instance of compression with which many computer users are familiar is the ZIP file format, which, as well as providing compression, acts as an archiver, storing many files in a single output file.

As is the case with any form of communication, compressed data communication only works when both the sender and receiver of the information understand the encoding scheme. For example, this text makes sense only if the receiver understands that it is intended to be interpreted as characters representing the English language. Similarly, compressed data can only be understood if the decoding method is known by the receiver.

Compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. On the downside, compressed data must be decompressed to be viewed (or heard), and this extra processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it's being decompressed (you always have the option of decompressing the video in full before you watch it, but this is inconvenient and requires storage space to put the uncompressed video). The design of data compression schemes therefore involve trade-offs between various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and uncompress the data.

Contents

[edit] Lossless vs. lossy compression

Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely, but nevertheless perfectly. Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small.

Another kind of compression, called lossy data compression, is possible if some loss of fidelity is acceptable. For example, a person viewing a picture or television video scene might not notice if some of its finest details are removed or not represented perfectly (i.e. may not even notice compression artifacts). Similarly, two clips of audio may be perceived as the same to a listener even though one is missing details found in the other. Lossy data compression algorithms introduce relatively minor differences and represent the picture, video, or audio using fewer bits.

Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.

However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually result in an expansion, as will attempts to compress encrypted data.

In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, which for example always removes the last byte of a file, will always compress a file up to the point where it is empty.

A good example of lossless vs. lossy compression is the following string -- 888883333333. What you just saw was the string written in an uncompressed form. However, you could save space by writing it 8[5]3[7]. By saying "5 eights, 7 threes", you still have the original string, just written in a smaller form. In a lossy system, using 83 instead, you cannot get the original data back (at the benefit of a smaller filesize).

[edit] Applications

One very simple means of compression is run-length encoding, wherein large runs of consecutive identical data values are replaced by a simple code with the data value and length of the run. This is an example of lossless data compression. It is often used to optimize disk space on office computers, or better use the connection bandwidth in a computer network. For symbolic data such as spreadsheets, text, executable programs, etc., losslessness is essential because changing even a single bit cannot be tolerated (except in some limited cases).

For visual and audio data, some loss of quality can be tolerated without losing the essential nature of the data. By taking advantage of the limitations of the human sensory system, a great deal of space can be saved while producing an output which is nearly indistinguishable from the original. These lossy data compression methods typically offer a three-way tradeoff between compression speed, compressed data size and quality loss.

Lossy image compression is used in digital cameras, greatly increasing their storage capacities while hardly degrading picture quality at all. Similarly, DVDs use the lossy MPEG-2 codec for video compression.

In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the signal. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as a separate discipline than "audio compression". Different audio and speech compression standards are listed under audio codecs. Voice compression is used in Internet telephony for example, while audio compression is used for CD ripping and is decoded by MP3 players.

[edit] Theory

The theoretical background of compression is provided by information theory (which is closely related to algorithmic information theory) and by rate-distortion theory. These fields of study were essentially created by Claude Shannon, who published fundamental papers on the topic in the late 1940s and early 1950s. Doyle and Carlson (2000) wrote that data compression "has one of the simplest and most elegant design theories in all of engineering". Cryptography and coding theory are also closely related. The idea of data compression is deeply connected with statistical inference.

Many lossless data compression systems can be viewed in terms of a four-stage model. Lossy data compression systems typically include even more stages, including, for example, prediction, frequency transformation, and quantization.

The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, although compression can be slow. DEFLATE is used in PKZIP, gzip and PNG. LZW (Lempel-Ziv-Welch) is used in GIF images. Also noteworthy are the LZR (LZ-Renau) methods, which serve as the basis of the Zip method. LZ methods utilize a table based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX). A current LZ based coding scheme that performs well is LZX, used in Microsoft's CAB format.

The very best compressors use probabilistic models whose predictions are coupled to an algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the better-known Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent. Arithmetic coding is used in the bilevel image-compression standard JBIG, and the document-compression standard DjVu. The text entry system, Dasher, is an inverse-arithmetic-coder.

Matt Mahoney, one of the 3 founders of the Hutter Prize, claims that "Compression is Equivalent to General Intelligence"[1].

[edit] Comparative

Independent comparison of different methods of data compression (Airelle, 2007).

  • Text files, such as htm or txt, can be hard compressed.
  • Some files are already compressed (e.g. mp3 or zip), so the compression rate of such files is poor.
  • To be more representative of the performance, the global score (/20) is calculated with a non-parametric formula after the sum of the ranks (1 to 20) for each of the 20 tested methods.


Comparison of different methods of data compression
Files *.avi *.dll *.doc *.exe *.gif *.htm *.jpg *.mp3 *.mpg *.pdf *.txt *.wav *.zip Notation TOTAL
Number
of
files
16 26 138 24 246 79 44 29 8 36 8 1 19   674
Initial
size
5 261 152 5 254 220 5 254 656 5 254 056 5 246 209 5 261 187 5 246 116 5 250 432 5 257 720 5 257 876 5 253 436 5 256 024 5 262 680   68 315 764
7z 4 524 067 (2) 1 543 179 (3) 147 690 (3) 3 910 541 (3) 4 620 354 (1) 341 996 (4) 4 770 061 (4) 5 053 813 (2) 4 879 067 (5) 4 258 863 (3) 1 270 884 (3) 3 670 225 (5) 5 226 742 (14) 16/20 44 217 482
arj 4 696 659 (9) 2 160 530 (15) 1 018 050 (17) 4 130 505 (11) 4 702 449 (12) 898 370 (17) 4 803 740 (11) 5 108 093 (17) 4 910 699 (16) 4 606 736 (15) 1 875 329 (16) 4 450 535 (12) 5 223 905 (13) 6,1/20 48 585 600
bh 4 703 291 (12) 2 156 986 (12) 1 010 284 (15) 4 128 594 (9) 4 693 021 (9) 889 650 (15) 4 806 914 (13) 5 105 811 (13) 4 904 209 (11) 4 601 545 (13) 1 848 972 (13) 4 451 648 (15) 5 201 639 (4) 7,5/20 48 502 564
bz2 4 720 926 (18) 2 095 832 (7) 573 721 (5) 4 273 885 (18) 4 896 084 (18) 645 243 (5) 4 743 918 (2) 5 069 593 (4) 4 888 293 (7) 4 444 829 (5) 1 531 448 (6) 3 771 508 (7) 5 238 677 (16) 11,7/20 46 893 957
bza 4 639 340 (6) 2 166 940 (17) 987 806 (11) 4 231 254 (17) 4 878 327 (17) 783 188 (8) 4 787 973 (7) 5 076 189 (5) 4 873 810 (2) 4 618 970 (17) 1 516 326 (5) 3 770 938 (6) 5 227 572 (15) 9,8/20 47 558 633
cab 4 701 113 (11) 2 148 386 (10) 893 796 (7) 4 127 044 (8) 4 678 810 (5) 842 129 (10) 4 798 500 (8) 5 099 787 (8) 4 900 314 (10) 4 584 969 (8) 1 846 233 (12) 4 451 857 (18) 5 201 717 (5) 10,8/20 48 274 655
gza 4 703 371 (13) 2 157 116 (13) 1 001 990 (13) 4 126 436 (7) 4 693 136 (10) 874 444 (12) 4 803 739 (10) 5 105 765 (12) 4 904 249 (12) 4 597 720 (11) 1 840 188 (11) 4 451 638 (14) 5 201 436 (3) 9,2/20 48 461 228
j 4 678 506 (8) 1 914 777 (5) 703 722 (6) 4 057 445 (5) 4 681 437 (6) 691 916 (6) 4 805 059 (12) 5 092 070 (7) 4 898 847 (8) 4 326 394 (4) 1 629 228 (8) 3 594 954 (4) 5 215 150 (12) 13/20 46 289 505
jar 4 704 088 (14) 2 158 273 (14) 1 017 205 (16) 4 129 816 (10) 4 705 456 (13) 893 622 (16) 4 809 136 (16) 5 107 254 (15) 4 904 615 (13) 4 603 367 (14) 1 849 394 (14) 4 451 718 (16) 5 202 611 (8) 6,2/20 48 536 555
lha 4 711 090 (16) 2 215 476 (18) 1 020 194 (18) 4 204 071 (15) 4 830 501 (15) 913 845 (18) 4 918 792 (19) 5 206 933 (19) 5 066 716 (19) 4 802 049 (19) 1 895 771 (17) 4 447 253 (10) 5 263 136 (18) 6,7/20 49 495 827
lzh 4 711 090 (16) 2 215 476 (18) 1 066 340 (19) 4 143 461 (14) 4 819 157 (14) 971 166 (19) 4 816 349 (18) 5 107 584 (16) 4 924 974 (18) 4 635 416 (18) 1 945 961 (19) 4 449 756 (11) 5 212 837 (11) 5,3/20 49 019 567
pkz 4 899 083 (20) 2 354 373 (20) 1 173 097 (20) 4 401 289 (20) 5 120 590 (19) 1 018 250 (20) 5 162 114 (20) 5 253 006 (20) 5 203 747 (20) 5 076 577 (20) 2 084 290 (20) 5 027 854 (20) 5 264 213 (19) 0,2/20 52 038 483
rar 4 634 009 (5) 1 693 150 (4) 173 313 (4) 3 948 241 (4) 4 639 881 (4) 318 269 (3) 4 780 095 (6) 5 081 085 (6) 4 887 973 (6) 4 258 775 (2) 1 318 381 (4) 2 657 731 (3) 5 202 579 (7) 15,5/20 43 593 482
rk 4 589 894 (3) 1 474 339 (2) 132 629 (1) 3 866 814 (1) 4 628 017 (3) 257 588 (1) 4 434 701 (1) 5 017 545 (1) 4 787 286 (1) 4 498 992 (6) 1 168 720 (1) 1 659 771 (1) 5 183 337 (1) 18,2/20 41 699 633
rs 4 625 725 (4) 2 137 145 (9) 937 954 (10) 4 221 864 (16) 4 850 493 (16) 768 711 (7) 4 776 635 (5) 5 066 886 (3) 4 878 852 (3) 4 612 537 (16) 1 560 879 (7) 3 804 335 (8) 5 240 116 (17) 10,7/20 47 482 132
sqx 4 662 560 (7) 2 078 866 (6) 991 992 (12) 4 105 933 (6) 4 699 518 (11) 878 469 (14) 4 808 697 (15) 5 102 452 (10) 4 908 341 (14) 4 590 245 (10) 1 836 245 (9) 4 415 575 (9) 5 208 275 (10) 9,8/20 48 287 168
tgz 4 707 481 (15) 2 165 409 (16) 907 006 (8) 4 133 949 (12) 4 684 949 (7) 861 638 (11) 4 807 701 (14) 5 105 913 (14) 4 909 789 (15) 4 588 822 (9) 1 853 650 (15) 4 451 792 (17) 5 202 392 (6) 7,8/20 48 380 491
uha 4 498 275 (1) 1 474 005 (1) 136 880 (2) 3 879 360 (2) 4 625 014 (2) 284 363 (2) 4 760 572 (3) 5 104 837 (11) 4 879 047 (4) 4 237 400 (1) 1 233 812 (2) 2 435 124 (2) 5 187 408 (2) 17,3/20 44 736 097
yz1 4 814 935 (19) 2 128 899 (8) 924 706 (9) 4 279 162 (19) 4 686 669 (8) 804 198 (9) 4 810 966 (17) 5 124 596 (18) 4 922 886 (17) 4 568 274 (7) 1 901 300 (18) 4 561 179 (19) 5 207 874 (9) 6,4/20 48 735 644
zip 4 701 064 (10) 2 155 923 (11) 1 009 814 (14) 4 135 619 (13) 5 270 565 (20) 877 679 (13) 4 799 508 (9) 5 101 205 (9) 4 898 961 (9) 4 599 883 (12) 1 839 080 (10) 4 450 719 (13) 5 264 564 (20) 7,5/20 49 104 584
Intermediate
compressed
size
4 701 089 2 152 155 962 880 4 130 160 4 696 327 851 884 4 803 740 5 103 645 4 902 262 4 593 983 1 839 634 4 448 505 5 210 556   48 519 559
Intermediate
compression
rate
10,6 % 59,0 % 81,7 % 21,4 % 10,5 % 83,8 % 8,4 % 2,8 % 6,8 % 12,6 % 65,0 % 15,4 % 1,0 %   29,0 %

Globally, the three best methods tested are rk, uha and 7z.

[edit] See also

[edit] Data compression topics

[edit] Compression algorithms

[edit] Lossless data compression

[edit] Lossy data compression

[edit] Example implementations

  • DEFLATE (a combination of LZ77 and Huffman coding) – used by ZIP, gzip and PNG files
  • LZMA used by 7-Zip and StuffitX[dubious ]
  • LZO (very fast LZ variation, speed oriented)
  • LZX (an LZ77 family compression algorithm)
  • Unix compress utility (the .Z file format), and GIF use LZW
  • Unix pack utility (the .z file format) used Huffman coding
  • bzip2 (a combination of the Burrows-Wheeler transform and Huffman coding)
  • PAQ (very high compression based on context mixing, but extremely slow; competing in the top of the highest compression competitions)
  • JPEG (image compression using a discrete cosine transform, then quantization, then Huffman coding)
  • MPEG (audio and video compression standards family in wide use, using DCT and motion-compensated prediction for video)
    • MP3 (a part of the MPEG-1 standard for sound and music compression, using subbanding and MDCT, perceptual modeling, quantization, and Huffman coding)
    • AAC (part of the MPEG-2 and MPEG-4 audio coding specifications, using MDCT, perceptual modeling, quantization, and Huffman coding)
  • Vorbis (DCT based AAC-alike audio codec, designed with a focus on avoiding patent encumbrance)
  • JPEG 2000 (image compression using wavelets, then quantization, then entropy coding)
  • TTA (uses linear predictive coding for lossless audio compression)
  • FLAC (linear predictive coding for lossless audio compression)

[edit] Corpora

Data collections, commonly used for comparing compression algorithms.

[edit] External links