Talk:Burrows-Wheeler transform
From Wikipedia, the free encyclopedia
Contents |
[edit] EOF
The text says that EOF is ignored when sorting, but the example seems to suggest that EOF is considered to come after all normal letters. AxelBoldt 14:14 Aug 26, 2002 (PDT)
The dot wasn't an EOF. It was just a dot that was there to make the rotations stand out visually. Hopefully it's clearer now. --LC 08:06 Aug 27, 2002 (PDT)
Obviously sufficiently many repeats of the transform must eventually restore the original text, and obviously that provides another (usually inefficient?) way of inverting a transform (just store the inputs temporarily, and give back the last input before the text to be inverted turns up again). But one interesting thing is this: since the original is itself the result of a transform on some text, clearly the transform does not necessarily always produce something better suited for compression. It must be something that is statistically likely, based on properties that are common in many inputs. What are these, and what are the odds that a transform won't actually help and might even hurt? If we do get one of these cases, how can we tell (cheaply)? And so on. PML.
How can repeating the transform restore the original text? The transform is a not a bijection between strings of length n. It is a one-to-one function from strings of length n into the set of ordered pairs of a string of length n and an integer less than n. Thus, different inputs can produce the same output string with a different index. For example .BANANA. and ANA..BAN produce the same output string.
[edit] Bad example
This article gives a really bad example for this algorithm. Take the text:
SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
and replace all repeated substrings (as deflate would detect them) with a special control character which I'll represent as _
:
SIX.M_ED.P_IES._FT_XTY_.DUS_BOX_
Now take the transformed text:
TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT
and do the same:
TEXYDST.E.XII_XXSMPPSS.B.__EEUSFXDIOI_T
Clearly, the original string compresses better. And indeed gzip and bzip2 confirm this: gzip compresses the above to 67 bytes, bzip2 to 69 bytes. (Both can hardly be called compression: the original was only 44 bytes ...) — Timwi 14:14, 9 Nov 2004 (UTC)
That is unfortunately the curse of writing about data compression: in a substantial number of cases, it can be almost impossible to find an example small enough to be practical, and clear enough to show the principle involved, but large enough so that the savings achieved with the method can be immediately seen to be significant. Add in a requirement about 'no other method can do this example better' and I am afraid you're asking for the impossible. IMHO, it's much better to use an example that shows the principle clearly and note in the text if needed that 'in this example, the saving from the run-length encoding doesn't overcome the overhead of the encoding; in a larger example which produced longer runs, it would.' After all, generations of textbooks have used recursive Fibonacci algorithms to demonstrate recursion when that's not the most efficient way to calculate Fibonacci numbers.
By the way, have you noticed that your deflate example is incorrect? Deflate operates on strings of minimum length 3. Even with your simplification of all replaced substrings taking as much space as one literal, here's how it would actually look:
SIX.MIXED.PIXIES.SIFT._TY_.DUST.BOXES
vs.
TEXYDST.E.XII_XXSMPPSS.B.__EEUSFXDIOI_T
Not a stellar improvement. -- Antaeus Feldspar 15:11, 9 Nov 2004 (UTC)
- It doesn't improve deflate, but I think it does improve RLE. My impression was that the point is to create runs. Deco 20:42, 9 Nov 2004 (UTC)
-
- Timwi's point was that in the example given, deflate outperforms BWT. However, his comparison was flawed because he showed deflate getting LZ77 compression from repetition of two-byte sequences. It doesn't; BWT gets compression from repetition of two-byte sequences, but deflate's implementation of LZ77 compression doesn't work on sequences less than three bytes. -- Antaeus Feldspar 02:17, 10 Nov 2004 (UTC)
-
-
- If not deflate, then what does bzip2 apply after the BWT? — Timwi 11:31, 10 Nov 2004 (UTC)
-
-
-
-
- To the best of my knowledge, it used to apply RLE and then arithmetic coding, but because of patent problems, it switched to RLE and then Huffman coding. It wouldn't make sense to apply deflate after the BWT, because while deflate can get compression from repeated byte sequences, long runs of the same character, and relative frequency of some characters over others, the BWT is going to remove almost every instance of repeated byte sequences by re-ordering the characters so that they lengthen runs, instead. Deflate would still be able to work on the runs, but it wouldn't do so as effectively as an encoding step that works only on runs.
-
-
-
-
-
- It may help to remember that lossless compression can't create compression from nothing; it can only remove redundancy that already exists. The following table shows which compression methods exploit which kind of redundancy. (Remember that Deflate is LZ77 coding followed by Huffman coding.)
-
-
LZ77 | BWT | RLE | Huffman/Arithmetic | |
---|---|---|---|---|
repeated multi-byte sequences | yes, at cost of two parameters (offset, length) | converts into single-character runs | no | no |
single-character runs | yes, at cost of two parameters (offset, length) | no | yes, at cost of single parameter (length of run) | no |
non-flat frequency distribution of characters | no | no | no | yes, exploits optimally |
-
-
-
- As you see, BWT destroys one of the two kinds of redundancy LZ77 can work on, and converts it to the other kind; LZ77 can still work on it, but at twice the cost of RLE. So it wouldn't make sense to use deflate after BWT instead of RLE and Huffman. -- Antaeus Feldspar 16:55, 10 Nov 2004 (UTC)
-
-
-
-
-
- In the original paper that Burrows and Wheeler wrote, they suggested that their transform be used in conjuction with a Move to Front encoder and an entropy encoder, such as Huffman; they made no reference to RLE. According to the bzip2 website, RLE is used, but the author indicates that it isn't necessary and is only retained because that's how he originally wrote it and he didn't want to create yet another compression format. -- Zawersh 15:05, 14 May 2006 (UTC)
-
-
This whole discussing about a bad example is rather pointless and can be easily solved by quoting burrows and wheeler form there original report (to read it yourself => you can find it at the sources-list in the article) "The size of the input block must be large (a few kilobytes) to achieve good compression." ~ Band B 22:40, 14 February 2007 (UTC)
[edit] C implementation
Is there any reason that the C implementation given in the Polish Wikipedia shouldn't be included here? --Shutranm 21:47, 20 May 2005 (UTC)
[edit] Collation issues
Nobody reading this page cares about Posix collation. Why don't we pick a sample string with no such issues? --P3d0 17:45, 16 January 2006 (UTC)
- How do you get from "I don't care" to "nobody cares"? People reading this page quite clearly do care about POSIX collation, or there wouldn't be a mention of it in the first place. I certainly care, and I'm equally certainly reading this page. The point that BWT is sensitive to collation issues is worth making, IMO. — Haeleth Talk 20:19, 22 March 2006 (UTC)
-
- BWT isn't sensitive to Posix; the sort will work in whatever way you implement the sort. Practically speaking, it's very unlikely anyone will be doing real BWT with Posix C string sorting functions; I found it to be quite a non-sequitur with little or no apparent relation to the subject at hand. The removed section follows: --Brion 03:31, 19 May 2006 (UTC)
==== Note on sorting convention ==== If you sort with [[Posix]] [[collating]], you get the slightly different string TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT instead of TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT [[ISO 8859]] has complex collating rules, but in this case, periods are ignored. Posix collating treats periods as characters.
[edit] Mistake in example?
Could someone check the example, please? I was trying to verify something and my simple quick implementation of BWT outputs
TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT
not
TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT
as the article writes.
(And note that I intend to at least replace the black dot with another character, or to remove it altogether. It is not a good style to unnecessarily convey meaning only with a color variation, mind e.g. blind people.)
--Mormegil 18:12, 18 June 2006 (UTC)
- My code gives the same output as Mormegil's, not as the article writes.--Thorwald 06:05, 3 August 2006 (UTC)
- Just tried this out, and I get exactly the same as Mormegil and Thorwald. If you try doing it, and use something like Microsoft Excel (yes, you can stop spitting now), you get the following table:
.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST .DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE .MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX .PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY .PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED .SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES .SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST. D.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXE DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE. E.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXI ED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIX ES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXI ESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOX FT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SI IE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIX IES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIX IFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.S IX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESS IXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.M IXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.P IXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.P IXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.S MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX. OXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.B PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY. PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED. S.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIE SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES. SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT. SSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXE ST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DU T.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUS T.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIF TY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIX UST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.D X.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSI XED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MI XESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BO XIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PI XIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PI XTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SI Y.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXT
Which gives "TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT"
Which is understandable, as a space is a smaller ASCII value than [A..Z|a..z]
--Ambulnick 10:17, 27 Sept 2006 (GMT)
If you take the C code and run it on ".BANANA" instead of "Polska Wikipedia", you don't get the same answer as the article gives: BNN.AA@A, you get ANNB.AA. That's because "." sorts before capital letters in ascii, not after, as the example shows. Changing "." to "x" makes the example work. If someone else will verify that they see the same problem, I'll fix the example. But it's late at night -- I may just be confusing myself :-)
Cbogart2 06:04, 18 December 2006 (UTC)
[edit] compression?
Article says:
The Burrows-Wheeler transform (BWT, also called block-sorting compression)
Technically, BWT is not a compression. Should it be noted? It leads to wrong understanding of the subject —lim 16:54, 16 May 2007 (UTC)
[edit] Bad inefficiency in the example implementation
Putting aside the issue that this is a really horrible example of program code, there's a major inefficiency in it: In the function rotlexcmp() if 'li' is equal to 'ri', it will needlessly traverse the entire data and then just return 0. According to my own experiments this makes it almost 100 times slower than it could be (well, depending on the exact implementation of the sorting function, of course). Simply adding a "if(li==ri) return 0;" at the beginning solves this issue and makes the program a lot faster with large datasets.
[edit] Implementation
We don’t need two “sample implementations” and should not have them here both. We are not a code repository. Maybe we should clean one of them to focus on the algorithm (e.g. remove those #includes and main() in the C version)? --Mormegil 17:54, 9 November 2007 (UTC)