Data Compression/Dictionary compression

Dictionary Compression
Remember the wealth of information a soldier gets from a mike click? With certain types of data, it makes sense that if you could encode the data using information that is not stored in the actual datafile you are compressing, you could significantly reduce known forms of storage and the requirements for communication. Some algorithms replicate this process by including an internal data table, often referred to as dictionary, hence the term dictionary compression.

Due to its intrinsic characteristic of permitting gains when used in communications, as the example of the mike click demonstrates, there is a pre-understood code system agreed and shared between the emitter and the receptor. It is not without a cause that most communication network oriented compression algorithms rely in some form of dictionary compression.

An example is a simple dictionary built into the compression algorithm scheme. If you already know the most compact storage scheme for a specific type of data, then you can simply substitute the more compact state for the actual data. Consider the word "supercalifragiliciousexpialadocious" Now suppose that in your dictionary there are only 11 words starting with super, and this is the eleventh. Well it's obvious that Super11 would be unique enough that given the same dictionary at the time of decompression, you could recover the full word with no data loss. With A and An, however it is not quite so obvious that you could store information about the word in less space than everything but the one letter word.

Now consider that you have the Internet, and you want to store an article by LaVogue. You could store the file on your own computer, or, alternately you could store a link to the file similar to the hypertext links you have on your homepage. Is it compression? Maybe not, but you can recover the original text with a slight delay while it downloads.

The mind boggles at the efficiency of a compression system that contains all knowledge in the world on a single server-farm, and uses it to reduce the storage of information on the client computer. Even more fantastic would be the legal snarl such a system would create and the privacy concerns and financial costs associated with recovery of compressed information that was proprietary in the first place. But a net-aware compression server, might be able to achieve truly radical compression rates. Essentially all that the home file would have to contain was the key sequence to trigger the decompression of a specific data cloud. The home user could even compress his compression files because multiple files each containing a link to a specific compression file structure, could be themselves turned into a single file, that contained the compression file structure of all the composite files. Infinite compression at the user level, but truly massive storage at the server level.

Text compression using 4 bit coding
Pike's 1981 compression algorithm, like several early text compression algorithms, used a more or less literal dictionary of English words.

Pike's algorithm used a fixed dictionary composed of 205 of the most popular English words, and the ASCII characters. By using a variable-length coding of 4, 8, or 12 bits to represent these words and the remaining ASCII characters, and some clever handling of uppercase letters and the space between words, Pike's algorithm achieved a compression ratio of 3.87 bits per character for typical English text.

Adding more English words to the dictionary helps compression, but with diminishing returns. Adding certain non-words to the dictionary -- common letter pairs and triplets, common prefixes, common suffixes, etc. -- helps compression, since they can be used to at least partially compress words that aren't in the dictionary, but adding such non-words also gives diminishing returns.

Later compression algorithms kept the idea of a short codeword representing something longer, but generalized it: a dictionary coder expands codewords to a sequence of bytes, a sequence that may or may not line up on English word boundaries.

Smaz
Smaz is a simple compression library suitable for compressing very short strings, such as URLs and single English sentences and short HTML files.

Much like Pike's algorithm, Smaz has a hard-wired constant built-in codebook of 254 common English words, word fragments, bigrams, and the lowercase letters (except j, k, q). The inner loop of the Smaz decoder is very simple:
 * Fetch the next byte X from the compressed file.
 * Is X == 254? Single byte literal: fetch the next byte L, and pass it straight through to the decoded text.
 * Is X == 255? Literal string: fetch the next byte L, then pass the following L+1 bytes straight through to the decoded text.
 * Any other value of X: lookup the X'th "word" in the codebook (that "word" can be from 1 to 5 letters), and copy that word to the decoded text.
 * Repeat until there are no more compressed bytes left in the compressed file.

Because the codebook is constant, the Smaz decoder is unable to "learn" new words and compress them, no matter how often they appear in the original text.

Adaptive dictionary algorithms
Most dictionary algorithms are adaptive: rather than a fixed, hard-wired dictionary like Pike's algorithm, as the decoder compresses the text, when it encounters letter triplets or entire words that are not yet in the dictionary, the decoder adds them to the dictionary.

If there are too many rare triplets, syllables, or words in the dictionary, it increases the codeword size of the more common ones, hurting compression ratio. If some triplet, syllable, or word is left out of the dictionary, then when it does occur, it cannot be represented by a simple dictionary reference and must be spelled out using more bits, also hurting compression ratio. This balancing act makes it difficult to build an optimal dictionary.

The hard-wired dictionary used to initialize the data compression algorithm has a big effect on the compression ratio in the early phase of the compression. An adaptive dictionary algorithm eventually adds to the dictionary all the syllables and common words used in that specific text, and so the initial hard-wired dictionary has little effect on the compression ratio in later phases of the compression of large files.

LZ77 algorithms
LZ77 type decoders expand the compressed file by expanding "copy items" in that compressed file into larger strings. This enables "duplicate string elimination": only the first copy of a string is stored in the compressed file. All other copies refer to a previous copy.

These decoders have the general algorithm:
 * Is the next item a literal? Then copy it directly from the compressed file to the end of the decoded text.
 * Is the next item a copy item? Then break it into a "length" and a "distance". Find the text that starts that "distance" back from the current end of decoded text, and copy "length" characters from that previously-decoded text to end of the decoded text.
 * Repeat from the beginning until there is no more items in the compressed file.

While all LZ77 style algorithms have this same decoder structure, there is a huge variety of ways that decoders decide whether an item is a literal or a copy items, and also a variety of ways that they extract the "length" and the "distance" from the copy item.

Many decoder algorithms combine LZ77 with other compression ideas. As of 2008, the most popular LZ77 based compression method is called DEFLATE; it combines LZ77 with Huffman. We discuss DEFLATE in a later chapter of this book, Data Compression/Multiple transformations.

LZSS and LZRW1-A
Lempel-Ziv-Storer-Szymanski (LZSS) was created in 1982 by James Storer and Thomas Szymanski. LZRW1-A was published in 1991 by Ross Williams.

The LZSS and LZRW1-A decompressors have a very similar form:

For each copy item, fetch a "literal/copy" bit from the compressed file.
 * 0: literal: the decoder grabs the next byte from the compressed file and passes it straight through to the decompressed text.
 * 1: copy item: the decoder grabs the next 2 bytes from the compressed file, breaks it into a 4 bit "length" and a 12 bit "distance". The 4 "length" bits are decoded into a length from 3 to 18 characters. Then find the text that starts that "distance" back from the current end of decoded text, and copy "length" characters from that previously-decoded text to end of the decoded text.
 * Repeat from the beginning until there is no more items in the compressed file.

LZJB
The LZJB decoder fetches a "literal/copy" bit from the compressed file.
 * 0: literal: the decoder grabs the next byte from the compressed file and passes it straight through to the decompressed text.
 * 1: copy item: the decoder grabs the next 2 bytes from the compressed file, breaks it into a 6 bit "length" and a 10 bit "distance". The 6 "length" bits are decoded into a length from 3 to 66 characters. Then find the text that starts that "distance" back from the current end of decoded text, and copy "length" characters from that previously-decoded text to end of the decoded text.
 * Repeat from the beginning until there is no more items in the compressed file.

After pulling a 1 bit from the control word, a copy item is in only one format:
 * LLLLLLdd dddddddd : copy L+3 bytes from d bytes before the most-recently-decoded byte.

LZ77
LZ77 The original LZ77 decoder fetches a "distance/length/literal" block from the compressed file.

The LZ77 decoder algorithm is:
 * Find the text that starts that "distance" back from the current end of decoded text, and copy "length" characters from that previously-decoded text to end of the decoded text. (Sometimes the "length" is 0, so this "copy" does nothing).
 * Then copy the literal character from the compressed file to the end of the decoded text.
 * Repeat from the beginning until there is no more items in the compressed file.

BARF
The "BARF" archiver (when it's not cheating) uses a simple byte-oriented LZ77-like code. The inner loop is:


 * Fetch a byte X.
 * X in the range 0...31? Literal: copy the next X literal bytes from the compressed stream to the end of the decoded text.
 * X in the range 32...255? Copy item: copy the two bytes from X-32 places back in the decoded text.
 * Repeat from the beginning until there is no more items in the compressed file.

LZF
LZF is a fast compression algorithm that takes very little code space and working memory (assuming we have access to the most recent 8 kByte of decoded text).

LibLZF by Marc Lehmann is designed to be a very small, very fast, very portable data compression library for the LZF compression algorithm.

 LibLZF source code         

LZF is used in TuxOnIce and many other applications.

The LZF decoder inner loop is:    For each copy item, the LZF decoder first fetches a byte X from the compressed file, and breaks the byte X into a 3 bit length and a 5 bit high_distance. The decoder has 3 cases: length==0, length==7, and any other length.
 * length == 0? (i.e., X in the range 0...31?) Literal: copy the next X+1 literal bytes from the compressed stream and pass them straight through to the current location in the decompressed stream.
 * length in 1...6 ? short copy: Use that value as length (later interpreted as a length of 3 to 8 bytes).
 * length == 7? long copy: fetch a new length byte and add 7. The new byte could have any value 0 to 255, so the sum is 7 to 262 (which is later interpreted as 9 to 264 bytes).
 * Either kind of copy: Fetch a low_distance byte, and combine it with the 5 high_distance bits to get a 13-bit distance. (This implies a 2^13 = 8KByte window).
 * Either kind of copy: Find the text that starts that "distance" back from the current end of decoded text, and copy "length+2" characters from that previously-decoded text to end of the decoded text.
 * Repeat from the beginning until there is no more items in the compressed file.

Each LZF "item" is one of these 3 formats:
 * 000LLLLL               ; literal reference
 * LLLddddd dddddddd           ; copy L+2 from d bytes before most recently decoded byte
 * 111ddddd LLLLLLLL dddddddd  ; copy L+2+7 from d bytes before most recently decoded byte

FastLZ
FastLZ by Ariya Hidayat is a free, open-source, portable real-time compression library for the FastLZ algorithm. (Ariya Hidayat apparently developed the FastLZ algorithm based on the LZF algorithm?) FastLZ is a drop-in replacement for Marc Lehmann's LibLZF. It has pretty much the same compressed size and compression time as LibLZF, and significantly faster decompression -- about 2/3 the time of LibLZF.

The FastLZ decoder inner loop is very similar to the LZF inner loop:


 * Fetch a byte X. Divide the byte X into the 3 bit length and a 5 bit high_distance.
 * length == 0? (i.e., X in the range 0...31?) Literal: copy the next X+1 literal bytes from the compressed stream to the end of the decoded text.
 * length in 1...6 ? Copy: Use that value as length (later interpreted as a length of 3 to 8 bytes).
 * length == 7? Copy: fetch a new length byte and add 7. The new byte could have any value 0 to 254, so the sum is 7 to 261 (which is later interpreted as 9 to 263 bytes).
 * (When the new byte is all-ones (255), it serves as an escape to even longer lengths ...)
 * Fetch a low_distance byte, and combine it with the 5 high_distance bits to get a 13-bit distance. (This implies a 2^13 = 8KByte window).
 * If the distance is all-ones (0x1FFF): fetch 2 new distance bytes to give a full 16 bit distance, and add it to the 0x1FFF all-ones distance. (This implies a 2^16 + 2^13 = 64 KByte + 8 KByte window)
 * find the text that starts that "distance" back from the current end of decoded text, and copy "length+2" characters from that previously-decoded text to end of the decoded text.
 * Repeat from the beginning until there is no more items in the compressed file.

If a compressed file only uses distances within the 2^13-1 window, and lengths less than 2^8+7, both the FastLZ and the LZF decoders will produce the same decompressed plaintext.

Each FastLZ "item" is one of these 5 formats (the LZF formats, plus a few extensions):
 * 000LLLLL               ; literal reference
 * LLLddddd dddddddd           ; copy L+2 bytes from d bytes before most recently decoded byte
 * 111ddddd LLLLLLLL dddddddd  ; copy L+2+7 bytes from d bytes before most recently decoded byte
 * LLL11111 11111111 dddddddd dddddddd ; copy L+2 bytes from d+0x1FFF bytes before most recently decoded byte
 * 11111111 LLLLLLLL 11111111 dddddddd dddddddd ; copy L+2+7 bytes from d+0x1FFF bytes before most recently decoded byte

''It's not clear why FastLZ is so much faster than LZF. Its "extensions" are apparently not being used, since the compressed file sizes on the benchmark files are not significantly different.'' If the same implementation techniques -- 16-bit copies whenever possible, special-case for "run" (pseudo-RLE), etc. -- were applied to LZF, would it be just as fast? Or is it the case that these "extensions" actually are being used a lot, and somehow they are faster (but roughly the same size) as the LZF approach?

miniLZO
LZO is a compressed data format and portable lossless data compression library written in ANSI C developed and maintained by Markus F.X.J. Oberhumer. LZO is slightly slower and produces somewhat better compressed file sizes than LZF. LZO is much faster than DEFLATE/gzip -- it runs in about half the time -- and produces somewhat worst file sizes than DEFLATE/gzip. 

miniLZO is a lightweight subset of the LZO library.

LZ4
LZ4 is a compressed data format and portable lossless data compression library developed and maintained by Yann Collet.

LZ4 is used in many ../compressed file systems/ including OpenZFS, GNU Grub, HAMMER, and Squashfs. Open source implementations are available for a variety of languages and (cross-platform) operating systems, and there is an active discussion group.

The LZ4 decoder inner loop is:

For each copy item, the LZF decoder first fetches a byte X from the compressed file, and breaks the byte X into a 4 bit literal_length and a 4 bit copy_length.

The literal decoder has 2 cases: literal_length==15, and any other length. (... say something about the special case of literal_length == 15 followed by any number of additional 255 bytes ...) Then copy the sum literal bytes from the compressed stream and pass them straight through to the decompressed stream.
 * literal_length == 0...14 ? copy the next literal_length literal bytes from the compressed stream and pass them straight through to the current location in the decompressed stream. (If the literal_length field is 0, then there is no literal).
 * literal_length == 15? long copy: fetch a new length byte and add 15. The new byte could have any value 0 to 254, so the sum is 15 to 269 bytes of literal.

After the literal bytes (if any) is the low byte of the distance, followed by the high byte of the distance. (This implies a 2^16 = 64KByte window). A distance of 1 indicates "current position - 1 byte"; a distance of 0xffff indicates "current position - 65535 bytes"; a distance of 0 is invalid and should never happen. The actual number of bytes to copy is decoded similar to the actual number of literals to copy, except the minimum length of a match is 4 bytes. (... say something about the special case of copy_length == 7 followed by any number of additional 255 bytes ...)
 * copy_length == 0...14 ? length_sum = copy_length + 4 bytes, which is 4 to 18 bytes.
 * copy_length == 15? long copy: fetch a new length byte. length_sum = that new length byte + copy_length + 4 bytes. The new byte could have any value 0 to 254, so the length_sum is 19 to 273 bytes of literal.


 * Either kind of copy: Find the text that starts that "distance" back from the current end of decoded text, and copy "length" characters from that previously-decoded text to end of the decoded text.
 * Repeat from the beginning until there is no more items in the compressed file.

Like most LZ77-style algorithms, LZ4 supports pseudo-RLE -- the length_sum can be longer than the offset distance; matches can overlap forward.

A common case of pseudo-RLE is a distance of "1", where the decompressor ends up repeating the last byte "length_sum" more times.

In order to ensure that the decoder will never read beyond the input buffer, nor write beyond the output buffer, each block starts with a "block size" field that indicates exactly how many bytes of *compressed* data are in that block; and the last 5 bytes of the block are always literals. The last copy item is incomplete and stops right after its literals.

The "LZ4-HC" ("high compression") and "LZ4" use exactly the same format and the same decoding function; the difference is in the encoder.

QuickLZ
QuickLZ

LZS
LZS: Lempel–Ziv–Stac compression. LZS compression is specified for various Internet protocols, including RFC 1967, RFC 1974, RFC 2395, and RFC 3943. Stac Electronics sued Microsoft for infringement of LZS patents. Since then, several of those patents have expired.

The LZS decoder fetches a "literal/copy" bit from the compressed file.
 * 0: literal: the decoder grabs the next byte from the compressed file and passes it straight through to the decompressed text.
 * 1: copy item: the decoder grabs the next bit from the compressed file.
 * That bit selects how many bits of "distance" to pull from the file:
 * 1: the decoder grabs a 7 bit "distance".
 * If this distance is the nonsensical value "0", it really represents the end-of-message. Exit the loop.
 * 0: the decoder grabs a 11 bit "distance" (this implies a 2 kibiByte sliding window).
 * Then the decoder grabs a "length" value from the compressed file.
 * However, the length is not encoded in a fixed-length field, but in a variable-length pattern:
 * 00 represents length 2
 * 01 represents length 3
 * 10 represents length 4
 * 1100 represents length 5
 * 1101 represents length 6
 * 1110 represents length 7
 * 1111 0000 represents length 8
 * 1111 0001 represents length 9
 * 1111 1111 0010 represents length 25
 * In general, except for the first 6 special cases listed above, "1111" repeated N times followed by 4 more bits that represent a value Y in the range 0...14, are decoded into a length value (N*15 + Y - 7).
 * Then find the text that starts that "distance" back from the current end of decoded text, and copy "length" characters from that previously-decoded text to end of the decoded text.
 * Repeat from the beginning, until the special "zero distance" value was decoded.

The complex variable-length encoding technique used in LZS allows "recent" byte pairs to be compressed into 11 bits, and even "less recent" byte pairs, as long as they are in the 2 kibiByte sliding window, to be compressed in 15 bits. This variable-length encoding technique is one of several kinds of "universal code", one of many kinds of "fixed Huffman encoding" tables, which we discuss in the Entropy chapter of this Wikibook.

In text files, repeated pairs are far more common than longer repeated strings. LZS gives at least 1 bit of compression on this common form of redundancy. This gives LZS an advantage over compression algorithms which are unable to compress repeated strings shorter than 3 characters, and so must store such short repeats as-is, or even expand them by a bit.

Snappy
The "Snappy" algorithm is tuned for very high speeds (even at the cost of mediocre compression). On benchmark files, Snappy is an order of magnitude faster than zlib but compressed files 20% to 100% bigger than zlib. Snappy is faster than LZF.

The inner loop of the "Snappy" decompressor decodes "elements" that always begin with a tag byte in one of these formats:


 * xxxxxx00 ; literal follows
 * xxxxxx01 ; copy with 1-byte offset
 * xxxxxx10 ; copy with 2-byte offset
 * xxxxxx11 ; copy with 4-byte offset

Details:


 * LLLLLL00  ; where L in 0..59: literal reference for 1...60 (inclusive) literal bytes.
 * LLLLLL00  ; where L in 60..63: these special "lengths" indicate 1, 2, 3, or 4 bytes, respectively, of actual length bytes follow the tag byte. Those length bytes are followed by the literal itself.


 * dddLLL01 dddddddd ; copy with 1-byte offset: copy L+4 bytes from d bytes before the most recently decoded byte. (Why L+4, giving lengths 4..11 ? Would L+3, giving lengths 3..10, give better compression? Perhaps even L+2 allowing us to copy the occasional byte pair?)
 * LLLLLL10 dddddddd dddddddd ; copy with 2-byte offset: copy L+1 bytes from d bytes before the most recently decoded byte.
 * LLLLLL11 dddddddd dddddddd dddddddd dddddddd ; copy with 4-byte offset: copy L+1 bytes from d bytes before the most recently decoded byte.

PalmDoc
PalmDoc combines LZ77 with a simple kind of byte pair compression. PalmDoc Algorithm:

PalmDoc files are decoded as follows:


 * Read a byte from the compressed stream. If the byte is
 * 0x00: "1 literal" copy that byte unmodified to the decompressed stream.
 * 0x09 to 0x7f: "1 literal" copy that byte unmodified to the decompressed stream.
 * 0x01 to 0x08: "literals": the byte is interpreted as a count from 1 to 8, and that many literals are copied unmodified from the compressed stream to the decompressed stream.
 * 0x80 to 0xbf: "length, distance" pair: the 2 leftmost bits of this byte ('10') are discarded, and the following 6 bits are combined with the 8 bits of the next byte to make a 14 bit "distance, length" item. Those 14 bits are broken into 11 bits of distance backwards from the current location in the uncompressed text, and 3 bits of length to copy from that point (copying n+3 bytes, 3 to 10 bytes).
 * 0xc0 to 0xff: "byte pair": this byte is decoded into 2 characters: a space character, and a letter formed from this byte XORed with 0x80.
 * Repeat from the beginning until there is no more bytes in the compressed file.

LZSA
LZSA is a collection of byte-aligned compression formats that are specifically engineered for very fast decompression on 8-bit systems.

LZSA1
LZSA1

... [FIXME:] ...

LZSA2
The inner loop of the LZSA2 decompressor decodes compressed "commands" that are always byte-aligned, beginning with a "token" byte:


 * token: XYZ LL MMM

LL is a 2-bit literals length with values of 0 to 3. If LL is 00, then there are no literals and this command handles only a match. If LL is 1 or 2, then 1 or 2 literal values follow this token. If LL is 3, then "extra literal length" nybbles or byte(s) follow this token and is decoded into a number of literals from 3 to 0xFFFF, and then those 3 to 0xFFFF literal bytes immediately follow the "extra literal length" byte.

After handling literals (if any), the decoder handles an LZ77-like match. The match offset is decoded according to the XYZ bits in the token:


 * XYZ
 * 00Z 5-bit offset: read a nybble + Z for offset
 * 01Z 9-bit offset: read a byte + Z for offset
 * 10Z 13-bit offset: read a nybble + a byte + Z for offset
 * 110 16-bit offset: read 2 bytes for the offset
 * 111 repeat offset: reuse offset value of previous command.

MMM is a 3-bit match length with values 0 to 7. The minimum-match length is 2 bytes (encoded as "000"). If M is 0 to 6, the actual match length is 2 to 8. If M is 7, then "extra match length" nybbles or byte(s) (immediately following the literal bytes, if any) are decoded into a match length from 2 to 0xFFFF.

For size-optimized decompressors, the same code can be used to decode both types of lengths -- match-offset nybbles or byte(s), and literal-length nybbles or bytes.

fixed byte pair encoding
The simplest dictionary algorithm is byte pair encoding with a fixed dictionary of byte pairs (a bigram dictionary).

Simple byte-pair encoding, also called dual-tile encoding or DTE, is often used to compress text in video game ROMs.

A byte pair decoder assumes that English text uses only a relatively small "alphabet" of letters and symbols. A byte pair decoder assumes that certain bytes never occur in English language text (the "high" bytes 0x80 to 0xFF and a few other bytes). When the "impossible" bytes occur in the compressed text, the decoder looks that byte up in a fixed dictionary, and emits the corresponding pair of bytes. Otherwise, the decoder copies the compressed byte to the plaintext more or less as-is. Often the 0x00 byte marks the end of the compressed string, so standard C string-handling routines can be re-used to handle compressed text. The dictionary should include all the most-frequent letter pairs used in the plaintext ("th", "he", " t", etc.). There are several different implementations of this algorithm, each one using a different (incompatible) dictionary.

byte pair encoding
DTE can be decoded in a single pass -- each byte of compressed text is either a "normal" printable character that stands for itself, or else it is a "unprintable" byte that stands for exactly two normal printable characters. Recursive byte pair encoding loosens that restriction: With recursive byte pair encoding, a "unprintable" byte stands for two bytes. Sometimes both of those two bytes are normal printable characters, just like DTE. Sometimes one or both bytes may be (some other) "unprintable" byte, requiring repeated decoding passes until only printable characters remain.

SCZ byte pair encoding
SCZ is a dynamic byte pair algorithm and compressed format developed by Carl Kindman, who released one implementation and its test suite under a LGPL license.

The SCZ compressed format is a series of segments. Each segment has a few bytes of header and footer, including a iteration count and a checksum, surrounding a repeatedly-compressed block.

The inner loop of the SCZ decoder is given a dictionary of 256 entries, and a special escape symbol. The inner loop scans through a block of "compressed segment data" and generates a new uncompressed block something like this:
 * Read the next byte from the compressed block.
 * If the byte read is the special escape byte, discard the escape byte and read the following compressed byte as a literal byte, and write that literal byte to the decompressed block.
 * Otherwise look up that byte in the dictionary.
 * If the byte is a "marker character", the dictionary has a byte pair to replace it with -- write those 2 bytes to the decompressed block.
 * Otherwise the dictionary indicates the byte is a literal byte -- write the byte read to the decompressed block.
 * Repeat until no more data in the compressed block.

The "escape byte" makes it possible to compress a file that has a byte pair repeated many times, even when the file already uses all 256 possible byte values -- the compressor deliberately expands some rare byte into two bytes (the escape byte and the rare byte), in order to re-use that rare byte as a "marker byte" to represent that byte pair (compressing each instance of those two bytes into one byte).

The middle loop of the SCZ decoder is given an iteration count and a block of data. The middle loop and generates the corresponding plaintext something like this:
 * If the iteration count is zero, the rest of the block is fully decompressed data -- done!
 * Read the header from the compressed block
 * read a byte from the compressed block; this byte becomes the new escape symbol.
 * Read the symbol count N, and read N dictionary entries into a dictionary. (Each entry is 3 bytes: the "marker" byte as stored in the compressed text, and the pair of bytes it represents in the uncompressed text.
 * Pass the escape symbol, the dictionary, and the rest of the compressed block (the compressed segment data) to the inner loop, decompressing it to an uncompressed block.
 * swap the (now-empty) compressed block with the uncompressed block.
 * Decrement the iteration count and repeat from the beginning.

The outer loop of the SCZ decoder breaks the file up into a series of segments, breaks each segment up into a header (which contains the iteration count, among other things), a block of compressed data (which the outer loop passes to the inner loop for decoding), and a footer (which contains a checksum used to detect problems).

LZ78 algorithms
LZ78

LZW
A well-known example of the dictionary based technique is the LZW data compression, since it operates by replacing strings of essentially unlimited length with codes that usually range in size from 9 to 16 bits.

The LZW algorithm, as used in the GIF file format, is perhaps the most famous and controversial compression algorithm.

While decoding a codeword to a "word" of 1 or more letters and emitting those letters to the plaintext stream, a LZW decompressor adds a new "word" to the dictionary. The new word is the concatenation of the previously-decoded word and the first letter of the word represented by the current codeword.

Larger codewords allow larger dictionaries, which typically improves compression.

GIF
The GIF file format typically starts with 9 bit codewords and gradually increases the width of the codeword, as necessary, up to a maximum codeword width of 12 bits. Many GIF implementations statically allocate a dictionary with 2^12 = 4096 dictionary entries, one for each possible codeword.

LZMW
LZMW (1985) by V. Miller, M. Wegman is a modification of LZW.

While decoding a codeword to a "word" of 1 or more letters and emitting those letters to the plaintext stream, a LZMW decompressor adds a new "word" to the dictionary. The new word is the concatenation of the previously-decoded word and the entire word (possibly the same word) represented by the current codeword. This allows dictionary entries to grow in length much more rapidly than LZW.

LZAP
LZAP (1988) by James Storer) is a modification of LZMW.

While decoding a codeword to a "word" of 1 or more letters and emitting those letters to the plaintext stream, a LZAP decompressor adds at least 1 and often more new "words" to the dictionary. The new words are the concatenation of the previous match with every initial substring of the current match. ("AP" refers to "all prefixes"). For example, if the previous match is "com" and the current match is "press", then the LZAP decompressor adds 5 new "words" to the dictionary: "comp", "compr", "compre", "compres", and "compress", where LZW would only add "comp", and LZMW would only add "compress".

This allows the number of words in the dictionary to grow much more rapidly than LZW or LZMW.

LZWL and word-based LZW
In 2005, Jan Lánský introduced syllable-based compression. .

The LZWL compressor uses a syllable detector to divide the plain text into syllables. When the LZWL decompressor pulls a "phrase index" from the compressed text, it adds a new phrase to the dictionary -- formed from the entire previous phrase concatenated with the first syllable of the current phrase. When the LZWL decompressor pulls the special "empty syllable" phrase index from the compressed text, it adds a new syllable to the dictionary -- encoded character-by-character in the immediately following compressed text.

A similar "word-based LZW" divides the plain text into English words (in general, alternate maximal strings of alphanumeric characters and non-alphanumeric characters).

Character-based compression seems to give better compression with smaller files; word-based compression seems to give better compression with larger files. Some researchers suggest that syllable-based compression methods might give better compression with intermediate-sized files -- such as HTML pages.

LZW implementation tips
The dictionary used by the decompressor doesn't really need to store a literal string for each dictionary entry. Since each dictionary entry is composed of some previous dictionary entry concatenated with a single character, all the decompressor needs to store for each entry is an integer and a character: the index of some previous dictionary entry, and a literal suffix character.

This same compact dictionary also works with LZAP. With LZMW, rather than each entry storing a single suffix character, it will instead store a second integer: the index to some, possibly the same, previous dictionary entry.

It's usually simpler and faster for the decoder to special-case decode any codeword with a value less than 256, and immediately output the literal byte it represents, rather than go through the motions of looking up such a codeword in the dictionary.

Some LZW compressors use a hash table to rapidly map the next few plain text characters to a dictionary entry. This works about the same as the hash table used to speed up LZ77-style compressors.

Some implementations of LZW and other LZ78 variants use a special search tree that takes advantage of the dictionary structure.

LZX
(We discuss LZX in a later chapter, Data Compression/Order/Entropy)

Reduced offset Lempel-Ziv (ROLZ)
The above LZ77 variations of LZ77 encode a linear, uncoded, fixed-length offset. The possible values of that offset at any particular instant of decoding often include many values that will *never* be used. For example, once the decompressor pulls a copy length of 4, it may see dozens of copies of "the " and "and " in the text, but it knows that the compressor will always pick the most recent one -- all the values of the offset that select some other copy of "the " or "and " will never occur.

If we could somehow "skip over" those impossible offsets, we could encode the offset in fewer bits, and still be able to use the same (offset,length) phrase that the above LZ77 encoders would use.

Also, many possible values of the offset are highly unlikely to occur. For example, once the decoder finishes emitting a phrase that ends in "ch", in normal English text the next character is almost never another consonant. So all the offsets that point to a phrase starting with some consonant will almost never be used. By skipping over those highly unlikely offsets, we could encode the offset of the next phrase in fewer bits, and almost always be able to copy the same (offset, length) phrase that the above LZ77 encoders would use. (In some rare cases, the above LZ77 encoder picks some phrase that a ROLZ algorithm deems "unlikely" and skips over, forcing us to copy some other, shorter phrase; but hopefully sacrificing a few bits in those situations is paid back in saving a few bits in almost every other situation).

By "reduced offset", we mean that we have reduced the number of bits in the compressed file used to select a particular offset, when compared to using a linear, uncoded, fixed-length offset. We do *not* mean that the maximum possible offset is reduced -- in fact, most ROLZ compressors can copy a phrase from a much more distant offset than the typical 2^10 or 2^12 byte limit of linear, uncoded offsets.

Using something other than a linear, uncoded, fixed-length offset necessarily requires more computation at decode-time to translate that compressed selection value to the actual position in the decompressed-so-far text. This is yet another tradeoff between speed and compression ratio.

Ross Williams points out that all LZ77 and LZ78 algorithms can be cast into the form of Markov algorithms.

Malcolm Taylor wrote one (the first?) implementation of ROLZ as part of the WinRK compression suite. Ilia Muraviev wrote the first open-source implementation of ROLZ compression.

LZRW3-a Of all the compressors in the LZRW series of "fast" data compressors, the LZRW3-a(8) gives the best compression on the large text benchmark. Unlike most other ROLZ compressors, it completely ignores the most-recently-decoded byte of plaintext context. The inner loop of the LZRW3-a decoder first fetches a "literal/copy" bit from the compressed file. LZRW3-a(8) tries to (approximately) preserve the 8 most-recent instances of every 3-byte phrase that occurs in the plaintext. (The specific details of which partition the decompressor adds the pointer to, and the pseudo-random replacement used to discard old pointers when a partition is full, are beyond the scope of this document). (In effect, the decompressor expands a 12 bit code to a 20 bit uncoded offset into the 1 MB plaintext buffer). (too much detail)
 * 0: literal: the decoder grabs the next byte from the compressed file, which represents itself. Pass it straight through to the decompressed text. Every literal byte is "interesting", so a pointer to this byte is (eventually) added to the "table of 2^12 interesting locations in the previously-decoded text". Hopefully next time this byte can be copied as part of a long phrase via that pointer, rather than laboriously spelled out with literals again.
 * 1: copy item: the decoder grabs the next 2 bytes from the compressed file, which represents a phrase. The decoder breaks it into a 4 bit "length" and a 12 bit selector.
 * The 4 "length" bits are decoded into a length from 3 to 18 characters.
 * It uses that 12 bit selector to pick a location from the "table of 2^12 interesting locations in the previously-decoded text", and copies length characters from that location to the end of the decoded text. Also: add to the table a pointer to the beginning of this phrase.
 * Repeat from the beginning until there is no more items in the compressed file.

LZRW4 (Ross Williams, 1991) can be seen as a kind of ROLZ algorithm. Rather than a linear, uncoded offset for each phrase, the offset is represented by a 5 bit value. At the start of each phrase, the LZRW4 decoder hashes the two previous characters to obtain a partition number. The 5 bit value selects one of the 32 pointers in the context-selected partition. That points to some previously-decoded location in the 1 MB plaintext buffer. (In effect, the decompressor has expanded the 5 coded bits to a 20 bit uncoded offset). Then the decompressor pulls 3 bits from the compressed text and decodes that into 8 possible length values (2,3,4,5,6,7,8,16), then copies the selected (offset,length) phrase to the plaintext. Then the context partitions are updated. Because a "copy item" is only 9 bits (flag bit + offset selector + length), it does a good job compressing common byte pairs. Literal items are also 9 bits (flag bit + literal octet). (To increase speed, Williams doesn't store a pointer to *every* time a 2-byte context occurs in the partition table, but only the 2 byte contexts that occur at the *end* of some phrase). (To reduce memory use, Williams doesn't keep track of the *exact* 2-byte context using 2^16 partitions, but instead hashes the 2 previous bytes of context into a 12 bite context with 2^12 partitions). (too much detail ...)

Perhaps the most extreme form of "reduced offset" is to reduce the offset to zero bits. This special case of ROLZ is called LZP.

LZP
LZP compression was invented by Charles Bloom. A LZP decompressor examines the current context (perhaps previous 3 bytes), finds the most recent occurrence of that context in the plain text (or some approximation), and uses only that one fixed offset. That gives us an extremely short compressed phrase (a flag bit + a match length). In some cases, that one fixed offset is the same one that LZ77 would have used -- in these cases, we've saved a lot of bits because we didn't store the offset. In other cases, that one fixed offset is not exactly the same one that LZ77 would have used, but a larger offset that gives us exactly the same copied phrase -- likewise, we've saved lots of bits. Inevitably, sometimes that one fixed offset gives us fewer useful bytes than the offset LZ77 would have chosen -- in the worst case, zero useful bytes. LZP requires more bits to encode those bytes than LZ77 would have done in that situation. But overall, LZP typically compresses better than LZ77.

A few implementations make sure the context table always points exactly to the most recent occurrence of every 3 byte context -- this requires updating the context table on every byte the decompressor emits. Many LZP implementations approximate this by only updating the table at the start of a phrase or literal, skipping the updates for all the bytes copied during that phrase. That speeds up decompression, and Arturo San Emeterio Campos claims it actually improves the compression ratio on longer files, although it hurts the compression ratio on shorter files.

A LZP decoder works something like this:
 * Look up the current context in the context table to get the Position. If this context has *never* occurred before, then we have a literal. Otherwise, fetch a "literal/copy" bit from the compressed file.
 * (In either case, remember that Position, but update the context table to point to the current location in the decompressed text. The next time we have the same context, we get this location, rather than that less recent Position)
 * 0: literal: the decoder grabs the next byte from the compressed file and passes it straight through to the decompressed text.
 * 1: copy item: the decoder grabs a few bits from the compressed file and decodes them into a length. Starting at the Position where this context occurred before, copy "length" characters from that previously-decoded text to end of the decoded text. Then grabs the next literal byte from the compressed file and passes it straight through to the decompressed text.
 * Repeat from the beginning until there is no more items in the compressed file.

Some documents describe a byte-oriented version of LZP:
 * Look up the current context in the context table to get the Position. (Remember that Position, but update the context table to point to the current location in the decompressed text. The next time we have the same context, we get this location, rather than that less recent Position).
 * Fetch the next byte from the compressed text.
 * If this context has *never* occurred before, *or* the fetched byte is not the escape byte, then we have a literal; otherwise, we have a copy item. (The compressor should have picked a value for the escape byte that occurs rarely in the plain text).
 * literal: pass it straight through to the decompressed text.
 * copy item: Discard the escape byte, and grab the length byte. Starting at the Position where this context occurred before, copy "length" characters from that previously-decoded text to end of the decoded text. Then grab the next literal byte from the compressed file and pass it straight through to the decompressed text, even if it happens to be the escape byte. (Occasionally the length will be 0, such as when the decompressed text *needs* to contain the special byte we use as the escape byte).
 * Repeat from the beginning until there are no more items in the compressed file.

Statistical Lempel Ziv
Statistical Lempel-Ziv was first proposed by Yu Fan Ho as the research topic of his Master's degree in Computer Science, supervised by Dr. Sam Kwong. They first published it as an application to personal digital assistants. Later Yu Fan Ho applied to it ring tones for mobile phones, where it outperformed other available lossless compressors on compression ratio and decompression speed. The application of statistical Lempel Ziv to melody data is patented.

Traditional LZ-based technologies make use of the repetitive characteristic of the data. The decompression process can be done simply by copying the repeated data from the search window according to an index in the compressed data. The data not found in the window is left uncompressed in the compressed data. The uncompressed data is then shifted into the search window for the next repetition and so on. The data is shifted into the window unconditionally without considering the statistical information. Because of limited size of the search window, the first-in data is shifted out unconditionally when the window is full. There are high possibilities that the window is occupied by the useless (non-repetitive) data while the useful (to be repeated) data is banished. To improve the compression ratio, larger search window should be used and hence more memory required in the decompressor.

Statistical Lempel Ziv is a LZ-like lossless compression algorithm but statistical information is also taken into consideration to identify the useful data that should be put into the dictionary (search window). It improves the compression ratio compared with LZ77 because more useful data can be kept in the dictionary. The dictionary can be smaller in size for keeping the useful data and hence less memory required in the decompressor. Since not all the data has to be shifted into the window, less processing power is required on the decompressor.

ACB
Associative Coding by Buyanovsky (ACB) starts with an LZ77-like compressor, and does many things to squeeze out a lot of the redundancy in the (offset, length) items.

Tunstall code
The Huffman algorithm gives a fixed-to-variable-length code that (under certain constraints) is optimal.

The Tunstall algorithm gives a variable-to-fixed-length code that (under certain constraints) is optimal.

compressed format tips and tricks
Most data compression decompression algorithms are most naturally described in terms of reading the compressed file as an undifferentiated stream of bits (a bitstream). Most programming languages do file I/O a byte at a time.

So decompression implementations typically keep a separate "bit buffer" (initially empty). When you want to read a bit, if the bit-buffer is empty, read a (byte-aligned) byte from the compressed bytestream, store the last not-yet-needed 7 bits in the bit-buffer, and use the first bit. Otherwise read the next bit from the bit-buffer.

Many file formats are designed to maintain "8 bit alignment" (some even try to maintain "16 bit alignment") wherever possible, in order to speed things up by reducing the amount of bit-shifting.

Often when decompressing, you've read some odd number of bits, and the abstract decompression algorithm mentions something like "read the next 8 bits and pass that literal byte directly to the output". There are three possible ways to design a file format to handle "read the next 8 bits":
 * "undifferentiated bitstream": get the remaining bits in the bit buffer, reload the bit buffer, and get a few more bits, then shift-and-align to re-assemble the byte.
 * Throw away the remaining "padding bits" in the bit buffer, and simply read the next (byte-aligned) byte from the compressed bytestream. Faster than the bitstream approach, but the unused bits hurt the compression ratio, and so we will speak no more of this approach.
 * "byte aligned": Temporarily ignore the bit-buffer; read the (byte-aligned) byte directly from the compressed bytestream. (Only use the bit-buffer when the decompression algorithm reads a bit or two or seven).

Compared to an undifferentiated bitstream format, a byte-aligned format for compressed text is precisely the same number of bytes long and has exactly the same bits; the bits are merely arranged in a slightly different order. So picking a byte-aligned format gives precisely the same compression ratio, and it often makes things faster.

slow naive method
The obvious way to implement a LZ77-style (offset, length) compressor involves scanning the entire window for a match to the following plaintext, for each and every phrase we eventually write into the compressed file. Then picking the longest (and nearest) match.

The obvious way to implement a LZ78-style (dictionary entry) compressor involves scanning the entire dictionary for a match to the following plaintext, for each and every dictionary index we eventually write into the compressed file.

For a file of length N this is O(N) for each compressed phrase. This requires O(N^2) to compress a file of length N. (Schlemiel the Painter's algorithm). (I'm assuming the LZ77 window covers at least half of the file, and assuming a roughly constant compression ratio). This is excruciatingly slow.

speedup
Most LZ77-style and LZ78-style compressor implementations use a speedup trick: Whenever they output a compressed phrase, they do a little O(1) "extra" work updating some data structure -- typically a trie or hash table. Then, instead of sequentially scanning the entire window for a match, they look up the following plaintext (typically the next 3 bytes) in that data structure.

trie
Some LZ77-style and LZ78-style compressor implementations use a Trie data structure.

hash tables
Most LZ77-style and LZ78-style compressor implementations use a hash table. Then, instead of sequentially scanning the entire window for a match, they look up the following plaintext (typically the next 3 bytes) in the hash table. In a hash table, searches immediately succeed or fail in O(1), and therefore it only takes O(N) to compress the entire file. This is far faster.

Many LZ77-style compressors deliberately sacrifice the maximum possible compression in two ways: (a) Some compressors only update the hash table once per phrase, rather than once per every byte. (b) Some compressors use a direct-indexed cache -- a very simple hash table without collision detection -- the table remembers only the single most-recent item with that hash index -- rather than a standard hash table that remembers *every* item (so far) with that hash index. Typically each of (a) or (b) hurts compression by a few percent, but can roughly double or triple the speed of the compressor. Many LZ77 style compressors do both (a) and (b), hurting compression by a few percent, but getting roughly four or ten times the speed.

pseudo-RLE
RLE can be emulated with overlapping copies in most LZ77-like decompressors ("pseudo-RLE mode"). The decompressor must use a method that handles such overlapping copies properly -- in particular, the decompressor must not use memcpy or memmove on such overlapping copies. On some machines, it's faster to simply use the same "copy one byte at a time" copy_bytes routine for all copies. On other machines, programmers find that memcpy is so much faster than that it reduces the total decompression time if the decompressor includes extra code that, for each copy item, decides whether it overlaps or not, and uses the slower copy_bytes only when memcpy won't work.