Data Compression/Executable Compression

executable software compression
In this chapter, we are careful to distinguish between 3 kinds of software:
 * "data compression software", the implementation of algorithms designed to compress and decompress various kinds of files -- this is what we referred to as simply "the software" in previous chapters.
 * "executable files" -- files on the disk that are not merely some kind of data to be viewed or edited by some other program, but are the programs themselves -- text editors, word processors, image viewers, music players, web browsers, compilers, interpreters, etc.
 * "source code" -- human-readable files on the disk that are intended to be fed into a compiler (to produce executable files) or fed into an interpreter.

Executable files are, in some ways, similar to English text -- and source code is even more similar -- and so data compression software that was designed and intended to be used on English text files often works fairly well with executable files and source code.

However, some techniques and concepts that apply to executable software compression that don't really make sense with any other kind of file, such as:
 * self-decompressing executables, including boot-time kernel decompression
 * dead-code elimination and unreachable code elimination (has some analogy to lossy compression)
 * refactoring redundant code, and the opposite process: inlining (has some analogy to dictionary compression)
 * some kinds of code size reduction may locally make a subroutine, when measured in isolation, appear run slower, but improve the net performance of a system. In particular, loop unrolling, inlining, "complete jump tables" vs "sparse tests", and static linking, all may make a subroutine appear -- when measured in isolation -- to run faster, but may increase instruction cache misses, TLB cache misses, and virtual memory paging activity enough to reduce the net performance of the whole system.
 * procedural abstraction
 * cross-jumping, also called tail merging
 * pcode and threaded code
 * compressed abstract syntax trees are, as of 1998, the most dense known executable format, and yet execute faster than bytecode.
 * JavaScript minifiers (sometimes called "JavaScript compression tools") convert JavaScript source into "minified" JavaScript source that runs the same, but is smaller. CSS minifiers and HTML minifiers work similarly.
 * Simple minifiers strip out comments and unnecessary whitespace.
 * Closure Compiler does more aggressive dead-code elimination and single-use-function inlining.
 * code compression
 * run-time decompression
 * demand paging, also called lazy loading, a way to reduce RAM usage
 * shared library
 * PIC shared library and other techniques for squeezing a Linux distribution onto a single floppy or a single floppy X Window System thin client.


 * copy-on-write and other technologies for reducing RAM usage
 * various technologies for reducing disk storage, such as storing only the source code (possibly compressed) and a just-in-time in-memory compiler like Tiny C Compiler (or an interpreter), rather than storing only the native executable or both the source code and the executable.
 * Selecting a machine language or higher-level language with high code density
 * various ways to "optimize for space" (including the "-Os" compiler option)
 * using newlib, uClibc, or sglibc instead of glibc
 * code compression for reducing power


 * Multiple "levels" or "waves" of unpacking: a file starts with a very small (machine-language) decompressor -- but instead of decompressing the rest of the file directly into machine language, the decompressor decompresses the rest of the file into a large, sophisticated decompressor (or interpreter or JIT compiler) (in machine language) and further data; then the large decompressor (or interpreter or JIT compiler) converts the rest of the file into machine language.

compress then compile vs compile then compress
Which stage of compilation gives the best compression:
 * compress the final machine-specific binary executable code?
 * compress the original machine-independent text source code? For example, JavaScript minifiers.
 * compress some partially compiled machine-independent intermediate code? For example, "Slim Binaries" or the "JAR format"

Some very preliminary early experiments give the surprising result that compressed high-level source code is about the same size as compressed executable machine code, but compressing a partially-compiled intermediate representation gives a larger file than either one.

filtering
Many data compression algorithms use "filter" or "preprocess" or "decorrelate" raw data before feeding it into an entropy coder. Filters for images and video typically have a geometric interpretation. Filters specialized for executable software include:
 * "detect "CALL" instructions, converting their operands from relative addressing to absolute addressing, thus calls to the same location resulted in repeated strings that the compressor could match, improving compression of 80x86 binary code." ( LZX (algorithm))
 * recoding branches into a PC-relative form
 * Instead of decompressing the latest version of an application in isolation, starting from nothing, start from the old version of an application, and patch it up until it is identical to the new latest version. That enables much smaller update files that contain only the patches -- the differences between the old version of an application and the latest version. This can be seen as a very specific kind of data differencing.
 * The algorithm used by BSDiff 4 uses using suffix sorting to build relatively short patch files
 * Colin Percival, for his doctoral thesis, has developed an even more sophisticated algorithm for building short patch files for executable files.
 * "disassemble" the code, converting all absolute addresses and offsets into symbols; then patch the disassembled code; then "reassemble" the patched code. This makes the compressed update files for converting the old version of an application to the new version of an application much smaller.

Several programmers believe that a hastily written program will be at least 10 times as large as it "needs" to be.

A few programmers believe that 64 lines of source code is more than adequate for many useful tools.

The aim of the STEPS project is "to reduce the amount of code needed to make systems by a factor of 100, 1000, 10,000, or more."

Most other applications of compression -- and even most of these executable compression techniques -- are intended to give results that appear the same to human users, while improving things in the "back end" that most users don't notice. However, some of these program compression ideas (refactoring, shared libraries, using higher-level languages, using domain-specific languages, etc.) reduce the amount of source that a human must read to understand a program, resulting in a significantly different experience for some people (programmers). That time savings can lead to significant cost reduction. Such "compressed" source code is arguably better than the original; in contrast to image compression and other fields where compression gives, at best, something identical to the original, and often much worse.