-
bzip, bunzip - a block-sorting file compressor
-
Bzip compresses files using the Burrows-Wheeler-Fenwick
block-sorting text compression algorithm. Compression is generally
considerably better than that achieved by more conventional
LZ77/LZ78-based compressors, and competitive with all but the best
of the PPM family of statistical compressors.
-
The command-line options are deliberately very similar to those
of GNU Gzip, but they are not identical.
-
Bzip expects a list of file names to follow the command-line
flags. Each file is replaced by a compressed version of itself, with
the name "original_name.bz". Each compressed file has the same
modification date and permissions as the corresponding original, so
that these properties can be correctly restored at decompression
time. File name handling is naive in the sense that there is no
mechanism for preserving original file names, permissions and dates
in filesystems which lack these concepts, or have serious file name
length restrictions, such as MS-DOS.
-
Bzip and bunzip will not overwrite existing files; if you want
this to happen, you should delete them first.
-
If no file names are specified, bzip compresses from standard
input to standard output. In this case, bzip will decline to write
compressed output to a terminal, as this would be entirely
incomprehensible and therefore pointless.
-
Bunzip (or bzip -d) decompresses and restores all
specified files whose names end in ".bz". Files without
this suffix are ignored. Again, supplying no filenames causes
decompression from standard input to standard output.
-
You can also compress or decompress exactly one named file to
the standard output by giving the -c flag.
-
Compression is always performed, even if the compressed file is
slightly larger than the original. The worst case expansion is for
files of zero length, which expand to seventeen bytes. Random data
(including the output of most file compressors) is coded at about
8.1 bits per byte, giving an expansion of around 1%.
-
As a self-check for your protection, bzip uses 32-bit CRCs to
make sure that the decompressed version of a file is identical to
the original. This guards against corruption of the compressed data,
and against undetected bugs in bzip (hopefully very unlikely). The
chances of data corruption going undetected is microscopic, about
one chance in four billion for each file processed. Be aware,
though, that the check occurs upon decompression, so it can only
tell you that that something is wrong. It can't help you recover the
original uncompressed data.
-
Return values: 1 for an abnormal exit, otherwise 0.
- MEMORY MANAGEMENT
-
Bzip compresses large files in blocks. The block size affects
both the compression ratio achieved, and the amount of memory needed
both for compression and decompression. The flags -1
through -9 specify the block size to be 100,000 bytes
through 900,000 bytes (the default) respectively. At
decompression-time, the block size used for compression is read from
the header of the compressed file, and bunzip then allocates itself
just enough memory to decompress the file. Since block sizes are
stored in compressed files, it follows that the flags -1
to -9 are irrelevant to and so ignored during
decompression. Compression and decompression requirements, in bytes,
can be estimated as:
Compression: 300k + ( 8 x block size )
Decompression: 6 x block size
The 300k constant is for a frequency-count table, used in the
sorting phase of compression.
- Larger block sizes give rapidly diminishing marginal returns;
most of the compression comes from the first two or three hundred k
of block size, a fact worth bearing in mind when using bzip on small
machines. It is also important to appreciate that the decompression
memory requirement is set at compression-time by the choice of block
size. So, for example, if you are compressing files which you think
might possibly be decompressed on a 4-megabyte machine, you might
want to select a block size of 200k or 300k, so the decompressor
will draw 1200 kbytes or 1800 kbytes respectively, which is probably
the limit of what's comfortable on a 4-meg machine. In general,
though, you should try and use the largest block size memory
constraints allow. Compression and decompression speed is virtually
unaffected by block size.
-
Another significant point applies to files which fit in a single
block — that means most files you'd encounter using a large
block size. The amount of real memory touched is proportional to the
size of the file, since the file is smaller than a block. For
example, compressing a file 20,000 bytes long with the flag
-9 will cause the compressor to allocate [by the formula,
in practice a little more] 7500k of memory, but only touch
300k + 20000 * 8 = 460 kbytes of it. Similarly, the
decompressor will allocate 5400k but only touch
20000 * 6 = 120 kbytes.
-
Here is a table which summarises the maximum memory usage for
different block sizes. Also recorded is the total compressed size
for 14 files of the Calgary Text Compression Corpus totalling
3,141,622 bytes. This column gives some feel for how compression
varies with block size. These figures tend to understate the
advantage of larger block sizes for larger files, since the Corpus
is dominated by smaller files.
-
Compress Decompress Corpus
Flag usage usage Size
-1 1100k 500k 905958
-2 1900k 1000k 870646
-3 2700k 1500k 853650
-4 3500k 2000k 840140
-5 4300k 2500k 838355
-6 5100k 3000k 831695
-7 5900k 3500k 827104
-8 6700k 4000k 821652
-9 7500k 4500k 821652
- OPTIONS
-
-c Compress or decompress to standard output. -c
requires you to supply exactly one file name, and this file
is compressed or decompressed to standard out.
-d Force decompression. Bzip and bunzip are really the same
program, and the decision about whether to compress or
decompress is done on the basis of which name is used. This
flag overrides that mechanism, and forces bzip to decompress.
-f The complement to -d: forces compression,
regardless of the invokation name.
-k Keep (don't delete) input files during compression or
decompression.
-v Verbose mode — show the compression ratio for each file
processed.
-V Be very verbose. This spews out lots of information during
compression which is primarily of interest for debugging
purposes.
-L Display the software license terms and conditions.
-1 to -9
Set the block size to 100 k, 200 k .. 900 k when compressing.
Has no effect when decompressing.
See MEMORY MANAGEMENT above.
- PERFORMANCE NOTES
-
The sorting phase of compression gathers together similar strings
in the file. Because of this, files containing very long runs of
repeated symbols, like "aabaabaabaab..." (repeated several
hundred times) may compress extraordinarily slowly. You can use
the -V option to monitor progress in great detail, if
you want. Decompression speed is unaffected. Such pathological
cases seem rare in practice.
-
Incompressible or virtually-incompressible data may decompress
rather more slowly than one would hope. This is due to naive
implementation of the move-to-front coder, and of the frequency
tables for the arithmetic coder.
-
Decompression on Sun Sparc 1's (and other low-range
Sparcs) can be slow, because of the lack of hardware
implementations of integer multiply and divide in the SPARC v7
instruction set. The situation is much exacerbated if bzip is
compiled for a full SPARC v8 instruction set, since this causes
the machine to trap on each multiply and divide instruction.
These traps take control to the relevant software emulation of
the offending instruction, but it is much quicker for the
compiler simply to plant a call to the emulation routine.
Moral: be careful how you compile bzip for a Sparc. If you use
GNU C, investigate the effects of the -msupersparc and
-mcypress flags.
-
Wildcard expansion for Windows 95 and NT loses leading directory
information. For example, the pathspec "sources*.c" is
searched correctly for matching files, but the "sources"
bit is ignored when the files come to be processed, which means bzip
won't be able to find any of them. This is easy to fix; perhaps some
enterprising soul will send me a patch?
- CAVEATS
-
I/O error messages are not as helpful as they could be. Bzip
tries hard to detect I/O errors and exit cleanly, but the details of
what the problem is sometimes seem rather misleading.
-
There is no -t option to test the integrity of a
compressed file. However, UNIX folks can do the following:
bzip -dcV file.bz > /dev/null
which causes bzip to do a trial decompression of file.bz, throwing
away the result. You'll be shown the computed and stored CRCs. If
these are identical, the file is almost certainly OK — see the
discussion above on CRCs for a definition of "almost certainly". If
they're not, bzip will complain loudly. Note that file.bz is left
unchanged regardless of the outcome. Win95/NT folks can do the same,
but /dev/null will have to be replaced with something
suitable, perhaps NUL.
- This manual page pertains to version 0.21 of bzip. It may well
happen that some future version will use a different compressed file
format. If you try to decompress, using 0.21, a .bz file
created with some future version which uses a different compressed
file format, 0.21 will complain that your file
"is not a BZIP file". If that happens, you should obtain a
more recent version of bzip and use that to decompress the
file.
Installation
- Using the FreeBSD
ports system:
cd /usr/ports/archivers/bzip
make install clean
Using the FreeBSD
pkg system:
pkg install archivers/bzip
-