Many compression formats use Lempel Ziv backreferences (a length/distance pair to copy previous output from). There's some more detail in my Zstandard Worked Example. There's much more detail in Matt Mahoney's Data Compression Explained.
Bzip2, based on the BWT (Burrows Wheeler Transform) and other techniques, is interestingly different (but still ballpark-competitive for compression ratio). Wuffs gained a bzip2 decoder earlier this year, and I wrote up a bzip2 worked example as part of that. Joe Tsai has also written a comprehensive bzip2 deconstruction. The original "A Block-sorting Lossless Data Compression Algorithm" Technical Report is also quite readable.
Unlike Wuffs' PNG decoder, I don't
have any special tricks to share about optimizing its performance. Nonetheless,
Wuffs' decoder turned out to be faster than Debian's /usr/bin/bzcat
, which is
based on libbzip2. Both /usr/bin/bzcat
and
Wuffs' equivalent produce the same output for the linux-5.0.1.tar.bz2
input
(120 MiB compressed, 823 MiB uncompressed) but Wuffs' implementation was 1.3x
faster (as well as being written in a memory-safe language plus self-imposing
a SECCOMP_MODE_STRICT
sandbox).
$ git clone --quiet --depth=1 https://github.com/google/wuffs.git
$ gcc -O3 wuffs/example/bzcat/bzcat.c -o my-bzcat
$ /usr/bin/bzcat < linux-5.0.1.tar.bz2 | sha256sum
85435294910b8cdfbb798e8f05f042eadcb938b20ced9f2f65a9b76fafd52792 -
$ ./my-bzcat < linux-5.0.1.tar.bz2 | sha256sum
85435294910b8cdfbb798e8f05f042eadcb938b20ced9f2f65a9b76fafd52792 -
$ time /usr/bin/bzcat < linux-5.0.1.tar.bz2 > /dev/null
real 0m16.310s
user 0m16.281s
sys 0m0.028s
$ time ./my-bzcat < linux-5.0.1.tar.bz2 > /dev/null
real 0m12.665s
user 0m12.644s
sys 0m0.020s
Those "bzip2 decoder" programs above are all single-threaded. There's also the
multi-threaded /usr/bin/lbzcat
program, which is impressively faster (in
terms of real time; slower in terms of user time). That's quite a feat, given
that the bzip2 file format wasn't actually designed for multi-threaded
decoding, but discussing how that works is another story, for another time.
$ /usr/bin/lbzcat < linux-5.0.1.tar.bz2 | sha256sum
85435294910b8cdfbb798e8f05f042eadcb938b20ced9f2f65a9b76fafd52792 -
$ time /usr/bin/lbzcat < linux-5.0.1.tar.bz2 > /dev/null
real 0m5.136s
user 0m40.678s
sys 0m0.184s
Published: 2022-09-04