Talk:Bzip2

From Wikipedia, the free encyclopedia

There was also briefly some content at bunzip2:

bzip2 and bunzip2 are free open-source compression utilities.

They're a single utility with different behavior depending on the name used to call it, actually. :)

Many consider them "third-generation" compression utilities, surpassing both first-generation tools (like arc and LHA) and second-generation tools (such as the popular PKZIP and gzip formats) in compression ability; it "pays" for this extra compression with an increased computational cost. Nonetheless, with the constant effect of Moore's Law making computer time less and less important, compression methods like bzip2 have become more popular.

Of particular note is the fact that, unlike PKZIP, bzip2 is released under a very permissive license, which encourages its use in both open- and closed-source software.

Note that in addition to bzip2, gzip and zlib, there's the PKZIP-compatible Info-ZIP, also under a permissive license. --Brion 04:41 23 Jun 2003 (UTC)
Indeed. I was referring specifically to the PKZIP product, which doesn't have a permissive license. With the move to bzip2, however, and the new article, I don't see much need for my third-generation rambles and jabber about PKZIP's license. The addition of the Moore's Law comment is good enough for me; I was just trying to destubify an article I knew a bit about (bunzip2), but there's no need for that here. Phil Bordelon 04:47 23 Jun 2003 (UTC)
Do feel free to expand on this one, it's still a bit short. :) --Brion 04:49 23 Jun 2003 (UTC)

Contents

[edit] π?

bit-sequences derived from the decimal representation of pi.

Eh? Why? -- Anon.
Indeed, this caught my eye too and is nonsense. -S
/*--
  A 6-byte block header, the value chosen arbitrarily
  as 0x314159265359 :-).  A 32 bit value does not really
  give a strong enough guarantee that the value will not
  appear by chance in the compressed datastream.  Worst-case
  probability of this event, for a 900k block, is about
  2.0e-3 for 32 bits, 1.0e-5 for 40 bits and 4.0e-8 for 48 bits.
  For a compressed file of size 100Gb -- about 100000 blocks --
  only a 48-bit marker will do.  NB: normal compression/
  decompression do *not* rely on these statistical properties.
  They are only important when trying to recover blocks from
  damaged files.
--*/
...from [1]. (And look at the fifth thru tenth bytes of a .bz2 file with a hex editor.) Frencheigh 17:51, 24 August 2005 (UTC)
I've also add that the other magic number used (to mark end of stream) is sqrt(π) and that both are in binary-coded_decimal format. Sladen 20:21, 10 January 2007 (UTC)

[edit] GN00

"In GNU, bzip2 can be used combined or independently of tar"

In GNU, what the heck? No-one says "in GNU" and it should be "In Unix", anyway. Opinions? Jsalomaa 21:55, 29 August 2005 (UTC)

Very well, I fixed it - you really can use it under practically any Unix and not just under GNU. Jsalomaa 19:56, 29 September 2005 (UTC)
Also older versions of GNU tar used the -I switch instead of the -j used currently. This used to be very confusing to me when the older version was still in use. – b_jonas 15:03, 31 January 2006 (UTC)
Except under some older UNIXes where bzip2'ing options (-I or -j) have not made it into tar! Jen 23:12, 15 May 2006 (UTC)

[edit] Origin of the 'b'?

Just curious -- does anyone know why bzip is called bzip? --babbage 20:40, 5 March 2006 (UTC)

-- the man page says it: bzip2, bunzip2 - a block-sorting file compressor; so the 'b' comes from 'block'
-- Or, just think of the "better ZIP" :-) -- since actually most compressors split their data into blocks before crushing each. Jen 23:10, 15 May 2006 (UTC)

[edit] tar?

Why is there information about tar that doesn't relate to bzip? Goffrie 20:37, 3 June 2006 (UTC)

[edit] Redundant "Run-length encoding" sections

There's two "Run-length encoding" sections in the article. They need to be merged.--Father Goose 03:03, 23 April 2007 (UTC)

It appears to be clear from the textual-description of the two "RLE" stages that they both function in non-obvious-ways. The first stage encodes the length only after four consecutive symbols. The later stage encodes the length as binary power of two using the additional symbols RUNA and RUNB. Based on the differing ways in which the two operate (even if the heading is the same), I believe it would be inappropriate to merge them. Note also that they two uses of the this technique are several stages apart; to merge them would produce an inaccurate reflection of the compression stack. Sladen 09:56, 23 April 2007 (UTC)
Ah, okay, I failed to notice that section was describing the algorithm used, and that RLE is performed twice in the sequence, in two different forms. My error.--Father Goose 10:44, 23 April 2007 (UTC)

[edit] Technical limitations of bzip2

What are the technical limitations of bzip2? what is the maximum file size, the longest possible contained filename, the maximum amount of contained files etc.? --Loh 12:54, 24 April 2007 (UTC)

As far as I know there is no maximum file size (it is only limited by the containing filesystem, or perhaps [num of 900k blocks]*[maximum size of architecture integer]). Filenames and 'contained files' are not relevant because bzip2 is only a data compressor, not an archiver. It has no concept of combining together multiple files, typically tar is used for that. --Bk0 (Talk) 00:32, 25 April 2007 (UTC)
The Bzip2 format itself does not have support for storing a filename, or even a file timestamp. As noted by Bk0, data is seen just as one long stream, with no knowledge of the contents. The tar (file format) does have limitations, but those are not inherited from Bzip and Bzip2 can be used with an alternative archiving format such as an incompressed ZIP file. The Bzip2 file format does not contain any limitations (IIRC), though the supplied bzip2recover utility does. Sladen 10:33, 25 April 2007 (UTC)
Well, bzip2recover is designed to search for the two markers (pi and sqrt(pi), 48-bit binary-coded decimal) and to extract usable blocks from that. Incidentally, it should be noted that these markers do *not* have to be byte-aligned, and as such may not be easy to recognize in a file (manually using a hex editor or by something that searches for bytes). The first block marker is clearly recognizable, as it's right after BZh9 and looks like "1AY&SY" in ASCII, but the end-of-stream marker, which looks like ".re8P." - dots are non-printable ASCII - may not be recognizable in a text or hex editor to if it isn't on a byte boundary, but bzip2recover will still see it). Also, subsequent block markers may also be unrecognizable. nneonneo 04:10, 15 July 2007 (UTC)