Talk:Data compression

From Wikipedia, the free encyclopedia

1 Cleanup
2 Lossy compression on CD-ROM?
3 circular links
4 failure to compress
5 Source vs Entropy Coding
6 Predictive coding
7 compression and encryption
8 Source coding merged here
9 LZMA in sitx?
10 Citations
11 Comparative
12 Comparative: copyright?
13 Comparative (again)

[edit] Cleanup

This page is getting a bit long, and I think that some of the lists could used cleaned up Zack3rdbb 04:50, 22 December 2006 (UTC)

I've again broken the algorithms into a diffrent list from the implementations. Did it a few years back when I wrote part of the LZ and Huffman pages, but it was reverted back then by someone who seemed to be unable to determine the diffrence between an algorithm and an implementation. I also agree about moving the algorithm's into the lossy/lossless pages. If my current change sticks I will do that too.. Jrlinton

Where should a page referencing lossless compression link? Lossless data compression is more on-topic, but has less information (e.g. about practical algorithms) than data compression.

IMHO data compression should convey the basic concept, discuss the difference between lossy & lossless, and leave it at that. Especially the algorighm list and discussion should move to the apropriate subtopic. --Robbe

How are we defining "obsolete" here? --Alan D

There's a really detailed public domain article on data compression at http://www.vectorsite.net/ttdcmp1.html#m6

It would be great if someone could expand our articles using this material (unfortunately, I don't really have a clue about this subject!) Enchanter --- I agree that this page should be about data compression not about 100's of diffrent implementations of the half dozen basic alogrithms. The discussion about MPEG, DEFLATE, PKZIP, etc should not be on this page since they are implementations of one or more of the basic algorithms (LZ77, DCT, etc)

>> I find it disturbing that a contributor to this page claims to have no knowledge of the subject!

[edit] Lossy compression on CD-ROM?

In particular, compression of images and sounds can take advantage of limitations of the human sensory system to compress data in ways that are lossy, but nearly indistinguishable from the original by the human eye. This is used on CD-ROM and DVD, as well as in digital cameras.

This is highly suspect. DVD, yes: the video is encoded in a lossy format. Digital cameras, yes: pictures are recorded into lossy formats. CD-ROMs, however, merely store the information on them, and in fact, not only is the information not reduced by lossy means, but redundancy in the form of error-correcting codes is added to safeguard the information that's there. Even if "audio CDs" are meant rather than CD-ROMs, it's still a stretch; sure, the audio is quantized before it's recorded onto the master, but no subsequent compression, lossless or lossy, takes place after quantization. -- Antaeus Feldspar 17:32, 23 Sep 2004 (UTC)

I agree. It would be better to say that lossy compression is used mainly for images, video and audio. --Matt Mahoney 01:17, 25 Jan 2005 (UTC)

[edit] circular links

Self-extraction points to its disambiguation page, which points to uncompression, which returns to this article, which has no information about uncompression.

Perhaps someone "in the know" could do a bit on uncompression (decompression?).--suburbancow 02:14, 26 January 2006 (UTC)

[edit] failure to compress

I find this text from the article problematic:

"indeed, any compression algorithm will necessarily fail to compress at least 99.609 375 % (255/256) of all files of a given length."

Clearly it cannot be true that the algs fail to compress 99.6% of the time. I think the intent was to say that any lossless alg will fail to compress on some files of any given length, but the numbers don't make sense. Also, the percentage of files that would fail would definitely be a function of the length of the files. Kryzx 15:41, 29 March 2006 (UTC)

I agree, I can't really guess at what that sentence is trying to say. Only lossless compression has the possibility of failing to reduce file size, and that doesn't happen that often. Algr 17:44, 30 March 2006 (UTC)

Actually it's true that the vast majority of possible files will get bigger during lossless compression. It's just that the vast majority of possible files are useless random byte sequences, and we seldom try to compress those -- so it seems like compression algorithms almost always make things smaller. You can trivially prove this using the pigeonhole theorem:

There are exactly 256^100 possible 100-byte files.

A reduction in size will yield a file between 1 and 99 bytes long.

There are roughly 256^99 possible files between 1 and 99 bytes long.

Each compressed file uniquely maps to a decompressed file -- that is, you can't (correctly) compress two different files and end up with the same compressed file.

256^100 / roughly 256^99 = roughly 256.

That means that, in order to compress the file by at least one byte, we are reducing the number of possible byte sequences down to roughly 1 in 256 of the original possibilities, or about 0.4%. Only about 0.4% of the input sequences will therefore be able to be compressed.

The other 99.6% of the byte sequences will not map to a shorter sequence. Ergo, they must map to an equal or greater length sequence.

So, it is absolutely true that compression algorithms only work 0.4% of the time. BUT, it only applies when you are talking about compressing "all possible byte sequences of a given length". Nothing we ever produce is actually that random, which is why in practice compression algorithms do indeed work. Additionally, by adding even a single byte to the size you multiply your possible sequences by a factor of 256 -- meaning that while only 0.4% of the sequences can be compressed, the remaining 99.4% need not (in theory) grow by more than one byte. Egomaniac 13:54, 20 June 2007 (UTC)

[edit] Source vs Entropy Coding

I have it on reliable authority that Source Coding and Entropy Coding are different sides of the topic of data compression.

Source coding is lossy, Entropy coding is lossless - at least if I answered that way in the exam I'm about to sit I would get the marks allocated to the question.

My lecturers describe a two-phase compression scheme such as JPEG to involve a source coding phase first - effectively reducing the number of bits required to store the data by finding another representation (such as a DCT transform, which effectively represents the majority of the info in a few terms, leaving the rest nearly zero). The second phase is to take this alternative representation and encode it using an entropy coding scheme (ie huffman and/or RLE). The redundancy introduced by the first phase renders the data more susceptible to efficient entropy coding.

They're fairly specific about the difference (to the point of broad hints about the exam paper) - why is this not made clearer in the text?

Another point regarding source coding is that the DCT is COMPLETELY reversible without loss of data. The loss only occurs when the DCT transformed image cell is quantised - this is normally done by point-wise division of the transformed cell with a quantisation matrix, followed by a rounding of the values to integers. This quantisation and subsequent rounding is where the loss occurs. Because the majority of the information is contained in a few terms, the reverse process can "fairly faithfully" reconstruct the original "shape" of the intensity information represented within the cell. The amount of quantisation applied dictates the quality of the decompressed image data.

Entropy coding tries to reduce the overall "energy" represented by a data stream by finding lower energy representations. i.e. Huffman codes require the most common symbols to be given the shortest codes.

Ian (3rd Yr Undergrad - Computer Science)

[edit] Predictive coding

I see that predictive coding is presented under lossy encoding, but surely it's possible to use prediction for lossless encoding too.

It seems to me that there are two possibilities:

a. It is possible to do perfect predictive coding, and this will be lossless, or b. Do a fairly accurate predictive coding, then subtract from the source, and encode the residual. If the residual is encoded without loss then the overall encoding is also lossless.

I am fairly sure that for some applications predictive coding schemes which are completely lossless can be constructed. The article could deal with this. David Martland 09:27, 25 August 2006 (UTC)

WHAT NO IMAGES?FAIL! Trigger hurt 15:33, 6 September 2006 (UTC)

[edit] compression and encryption

The article currently claims Similarly, compressed data can only be understood if the decoding method is known by the receiver. Some compression algorithms exploit this property in order to encrypt data during the compression process.

I'm deleting the second of those sentences, on the basis of:

It is true that some data compression programs also allow people to produce "password protected archives" (with the notable exception of gzip, as mentioned in the gzip FAQ ). But my understanding is that they do not "exploit this property" -- instead, those programs first compress the data, then use an unrelated encryption algorithm that would encrypt plain text just as well as compressed text. If there actually exists some compression+encryption algorithm that "exploits this property", please tell me about it or link to it. --68.0.120.35 01:01, 13 December 2006 (UTC)

[edit] Source coding merged here

I've redirected "source coding" to here. The term was already bolded as something the article laid claim to. Data compression generally is most of source coding, but perhaps a paragraph might now be worthwhile on other aspects of source coding - eg perhaps the way source coding/data compression may make surprising symbols/runs of data more apparent, which may be interesting from a detection point of view.

Jheald 21:45, 5 March 2007 (UTC)

[edit] LZMA in sitx?

I find the assertion that sitx uses LZMA to be extremely dubious. I can't find any other source on the web that supports this claim, and the Stuffit page seems to indicate that Stuffit X uses a variety of other methods. Anybody know where this idea came from?

-Gsnixon 11:29, 6 March 2007 (UTC)

[edit] Citations

I was looking through this article for some basic beginner compression information, and although all of it matches up to what I already knew or at least makes sense...it appears to be drastically in need of some citations. I hesitate to tag it as one of my first wikipedia actions so I figured I would post first to see what other thought. Jaidan 16:48, 16 March 2007 (UTC)

[edit] Comparative

Copied from the french personal discussion page of the author of the comparative.

Interesting compression format/software comparison. Thanks for sharing your results. However, it is missing something important that I didn't find anywhere... What version of each program did you use for the tests? Specifically, what version of 7z did you test? Thanks again.

Thanks. I used the latest versions at the date of this comparative (january 2007), as specified on the second page. For 7-zip, I used the 4.43 version. Sorry for my english: I love so french that I have not time enough to learn others languages. ;-) Rlwpx 11:04, 21 April 2007 (UTC)

Oops. The article did mention v.4.43... I thought I looked everywhere.. Anyways, thanks for the info. One more question though: Did you save the test data and do you plan to keep updating the results as newer versions come out? For example, I know the new 7z 4.45 runs much faster on my dual core machine (but I have not tested if it compresses any better). I would be very interested in checking back here to see the progress of the different archivers. Well thanks again for a well done comparison :)

Don't worry for the mistake. Of course, I do appreciate 7-zip and I always use the latest version for my own. Because I made a first comparative in 2006 and I hope to make a new one next year, you should be happy to know that I have a copy of all the test data: I shall give you here the results for 7-zip 4.45 in a few days. Sincerely yours. Rlwpx

Results:

Comparison of compression efficiency between 7 zip 4.43 (january 2007) & 7 zip 4.45 (Rlwpx april 2007)
	*.avi	*.dll	*.doc	*.exe	*.gif	*.htm	*.jpg	*.mp3	*.mpg	*.pdf	*.txt	*.wav	*.zip
7z,4.43	4,524,067	1,543,179	147,690	3,910,541	4,620,354	341,996	4,770,061	5,053,813	4,879,067	4,258,863	1,270,884	3,670,225	5,226,742
7z,4.45	4,523,803	1,508,356	147,655	3,900,446	4,620,335	333,932	4,769,937	5,053,700	4,878,534	4,258,895	1,377,502	3,668,346	5,240,542

This graph is incredibly difficult to interpret. The meaning of the numbers is ambiguous. Do the results contain file size as a number split up by spaces? Like 5,245,423 = 5 245 423 (split up on different lines?). And what do the numbers in parenthesis give? Units, descriptions? Could we add some information to aid in interpreting what we see?

Sorry for the spaces: you must read commas (in french, spaces are used like commas in english). I'll correct this mistake. For example, 5 245 423 is 5,245,423 b, i.e. about 5 MB. Numbers in parenthesis indicate the rank of the method for the category: for example, 7z is the second best method for the compression of avi files, the best is uha (1) and the worth one, 20th, is pkz (20). Rlwpx 18:54, 7 June 2007 (UTC)

I have removed this section in its entirety as it fails to differentiate between compression algorithms and compression software (some of which use a variety of different algorithms), and because most of the data sets contain files which are already compressed. A better comparison would plot individual compression algorithms against uncompressed datasets of particular types (e.g., text, executables, raw bitmaps, raw audio). —Psychonaut 22:07, 12 July 2007 (UTC)

I don't agree with you: first, the table does not present softwares (for example, zip does not mean Winzip) and, second, one can very agree to compress compressed files: I doubt that there are many people who compress only txt or wav files! Rlwpx 18:23, 14 July 2007 (UTC)

So I can't agree with "Globally, the three best methods tested are rk, uha and 7z." because the table says that the strongest compression is rk (41,699,633) followed by rar (43,593,482) and 7z (44,217,482). Can someone change the text? —Preceding unsigned comment added by 87.160.124.182 (talk) 13:31, August 30, 2007 (UTC)

I think the AVI column should be removed - AVI is a container file for any number of different compression systems. Contents could have been anything from uncompressed video to some form of MPEG or DV. —Preceding unsigned comment added by 82.5.204.14 (talk) 12:14, 9 September 2007 (UTC)

[edit] Comparative: copyright?

P.Table(P.T.): PAQ8 (kgb archiver is Windows GUI of old PAQ7) is much better than this all, but for the copyright of the table it can't be copied.

If this is true, then the table should be removed. Text in Wikipedia is released under the GFDL. It's also not the best table in the world, as it includes formats like "AVI" which are highly codec-dependent. ⇌Elektron 17:10, 27 August 2007 (UTC)

I've removed the infringing section. The section without the table is reproduced below. If the original contributor is the copyright holder (the domain name of the site seems to suggest it), then it's original research, and also shouldn't be included. The section isn't written very well, and the last paragraph seems biased, and there are a bunch of other technical problems with the data (".avi" doesn't really mean anything). Let's hope bots don't think I'm a vandal. ⇌Elektron 16:00, 4 September 2007 (UTC)

Independent comparison of different methods of data compression (Results & Softwares, in French. Airelle, 2007). Numbers in parenthesis are the rank of the method of compression for the category of file specified above.

Text files, such as .htm or .txt, can be hard compressed.
Some files are already compressed (e.g. .mp3 or .zip), so the compression rate of such files is poor. Due to the addition of header data, often there are diminishing returns in such compression, causing the file to actually be slightly larger upon storage.
To be more representative of the performance, the global score (/20) is calculated with a non-parametric formula after the sum of the ranks (1 to 20) for each of the 20 tested methods.

(Missing table here)

P.Table(P.T.): PAQ8 (kgb archiver is Windows GUI of old PAQ7) is much better than this all, but for the copyright of the table it can't be copied.

Globally, the three best methods tested are rk, rar and 7z. WinRK and WinRar are commercial software, 7-zip is free, open source (LGPL licence) and can be used with Linux.

—Preceding unsigned comment added by Elektron (talk • contribs) 16:00, 4 September 2007 (UTC)

I, Airelle, the author of this comparative, is the original user who added the table in this article. I agree with the GFDL licence. As you wrote, my user name is part of the URL of the external website. Note that I had initially added a simple link towards my Web site but a user advised me to include the results in the article. So did I.

As I told, globally, the three best methods tested are rk, uha and 7z, because I used a non parametric method: so, these three formats of compression obtained the best notes (18.2, 17.3 and 16/20). If you consider the global size of the compressed files, the three best formats of compression are rk (41,699,633) followed by rar (43,593,482) and 7z (44,217,482): there is no contradiction: that depends on the criterion used.

Lastly, forgive my bad English, but that should not be a problem: no matter who can correct the text to improve it. That should not be a criterion of elimination.Rlwpx 10:41, 8 September 2007 (UTC)

I mainly removed it because of the line but for the copyright of the table it can't be copied (I'm not sure who added that), suggesting that the table is not GFDL. It's also only one person's results (possibly violating NPOV), and since you added it it counts as original research. It also can't really be a fair comparison without saying what was in the files you compressed. ⇌Elektron 11:50, 8 September 2007 (UTC)

I don't know who added the line about the copyright. But I do know that I am the author! I can say what is the list of the files, but it would be file.ext of x bits, file2.ext of x2 bits aso: hmmm... Is it really needed? I think the global size is enough. Rlwpx 12:09, 8 September 2007 (UTC)

File extensions are at best indicative, but for the most part meaningless (see the AVI comment in the section above). The entropy of a source depends on the source — text compression algorithms are typically tested on something like the Calgary Corpus. For pictures, there are standard test images, and presumably freely available videos in various formats (but if you, say, turn a TIFF into a JFIF, the particular algorithm and settings used matter too). For video/audio formats, you need to specify what codec/version/settings is used, but how much these compress largely don't matter — your average MP3 gets transmitted as an MP3 (this is also largely true for images), and your average WAV either stays like that or turns into an MP3/Vorbis/WMA/AAC/FLAC/etc. PDFs can use zlib compression internally, PNGs do, GIFs use LZW. How much ZIPs compress are almost completely irrelevant (what was in the ZIP?).

In fact, I mostly see compression used either in torrents (for no good reason, since compressing MP3s/movies doesn't help much and wastes everyone's CPU time), for precompiled programs (no-install-needed ZIPs, compressed disk images, RPM/DEB packages) or source tarballs, and I only bother if it compresses/uncompresses within reasonable time. Your choice of "file formats" introduces bias into the results, and thus any overall result is meaningless. ⇌Elektron 13:45, 9 September 2007 (UTC)

I understand well what you say, but I seldom seen a comparative with these precise details. I can add, for example, that the .avi are x mpeg3 and y mpeg2 or that the .zip are x' exe and y' txt, but the table would become incomprehensible after this additional information of a limited interest. I think that the important thing is to use the same data file for each compression so that one can compare them. Rlwpx 09:54, 15 September 2007 (UTC)

The results are meaningless unless you state what data you use. Most respectable comparisons of image/text/whatever compression use standard data. You give a file extension which has absolutely nothing to do with the data inside the file. They're also huge and don't really belong in the article - I suggest that it is at least moved to a different article. ⇌Elektron 17:01, 15 September 2007 (UTC)

[edit] Comparative (again)

This has too many issues. One of them is a private test case (mentioned before). Another is that it doesn't list archivers and archive format, it justs displays the archive format. Also, the parameters for compression are also unknown. These are quite important issues, because different archivers can archive to the same format, but do so differently, and different parameters tend to lead to different ratios also. This difference, while possibly small, is actually important. Additionally, as mentioned before, what's up with the copyright thing? And while I agree paq8o* probably compresses far better than everything else, unless we have solid evidence we should not publish these things. The copyright issue probably arose because someone wanted to include GFDL-incompatible info in the table, which, unless counting as fair-use, should not even be in Wikipedia. Someone really should fix this.

And, in case you didn't know, rar, a command-line version of WinRAR, is also available as shareware for Linux, and WinRAR works under Wine. 7-Zip is not a Linux native, p7zip is. I'll fix this. --Yugsdrawkcabeht (talk) 15:41, 29 December 2007 (UTC)