Data Compression



There are many algorithms—and even more implementations of those algorithms—that accept unencoded (plain) information and encode it to use fewer bits. Perhaps even more important is the paired algorithm that accepts the encoded bits and extracts the information.

Each pair of algorithms—one that creates the encoded form, and the other that accepts the encoded form and extracts the information—is called a data compression algorithm. These type of algorithms are increasing abundant, as are their variations, most utilize dictionary based schemes and statistical methods. They are also becoming increasingly specialized in compressing specific kinds of data—text, speech, music, photos, or video...

Data compression is useful in some situations because "compressed data" will save time (in reading and on transmission) and space if compared to the unencoded information it represent. Many programmers attempt to develop new algorithms to tightly compress the data into as few bits as possible (while still being able to recover the relevant information). However, other programmers—after testing several algorithms—deliberately pick some algorithm *other* than the one that would give them the fewest compressed bits. They deliberately sacrifice a few bits in order to improve latency, reduce compression time, or reduce decompression time.

Some communication systems carefully and deliberately add small amounts of "redundancy"—the Data Coding Theory Wikibook has more details.

There are also several data compression benchmarks available for comparing data compression algorithms—even one 50,000 euro cash prize for compressing one particular benchmark file as small as possible (and, of course, uncompressing it afterwards).

In this book, we describe the decompressor first, for several reasons:
 * Decompression routines are usually much easier to understand than the paired compression routine.
 * Too many people write the compressor first, then later discover that it doesn't save enough information to reconstruct the original file.
 * Once we have a decompressor, we can often improve the compressor alone without changing the file format or the decompressor.
 * Many compression systems only document and standardize the decompressor and the file format. (Probably because of the above reasons).

Contents

 * Coding
 * /Evaluating Compression Effectiveness/
 * Models
 * /Markov models/
 * /The zero-frequency problem/
 * /data differencing/
 * /Dictionary compression/
 * /grammar-based compression/
 * Order/Entropy
 * /Lossy vs. Non-Lossy Compression/
 * Inference Compression
 * /Streaming Compression/
 * /compressed file systems/
 * Asymmetric Compression
 * /Multiple transformations/
 * /Executable Compression/
 * /References/