Reverse Engineering/File Formats

This section will talk about reverse-engineering proprietary file formats. Many software developers need to reverse engineer a proprietary file format, especially for the purposes of interoperability. For example, every year the Open Office project needs to reverse engineer the Microsoft Office file formats. Furthermore, reverse engineering is required for forensics purposes. The chapters in this section will talk about how to understand a proprietary file format.

File Header
Most file formats begin with a "header," a few bytes that describe the file type and version. Because there are several incompatible file formats with the same extension (for example, ".doc" and ".cod"), the header gives a program enough additional information to see if this file is one of the formats that program can handle.

Many programmers package their data in some sort of "container format" before writing it out to disk. If they use the standard zlib to hold their data in compressed form, the file will begin with the 2 bytes 0x1f 0x8b (in decimal, 31 139 ).

Blank Space
Some files are made up largely of blank space, for example, .ds_store files generated by OS X. Blank space will appear as a series of 0's in a hex editor. The creators of a file format may add blank space for a variety of reasons, for example, the author of this study on .ds_store files speculated that they exist to speed up writing data, as other data would not need to be pushed around to make room. They could also serve to prevent fragmentation.

For most purposes, blank space can be ignored.

Tools
File format reverse engineering is the domain of hex editors. Typically they are used more often to display file contents as opposed to editing them. Hex editors allow you to superimpose a data structure on top of the data (sometimes called custom views or similar), which are very helpful. Once a particular structure has been discovered in a file, these mechanisms can be used to document the structure, as well as to provide a more meaningful display of the information than just hex code.

Also useful are Unix/Linux tools like strings(1) and file(1).


 * strings:Finds and prints sequences of printable characters in a file. This can give hints of what data is embedded in the file.
 * file:Attempts to determine a file type. Sometimes file format designers re-use already well known file formats or file compression algorithms. There is a small but notable chance that file(1) can reveal this.

Windows ports of these tools are also available. E.g. as part of the Cygwin environment (strings is part of the binutils, don't ask ...).

Equally important as a hex editor is a brain. File format reverse engineering means to reason about what the hex editor and other tools displays. To guess structures, relations and the meaning of the data, to develop theories and then verify them. Very few tools can help here.

In a few limited cases additional tools are helpful. E.g. for checking brute-force if a particular part of the file consists of some embedded, compressed data. Typically such tools are written or scripted on the fly as custom tools. Another typical set of custom tools are the ones which are used to break up a file into separate components - once it has been discovered that a particular file indeed consists of separate parts, and how they are separated in the file. C/C++, Java, but also scripting languages like Perl are often used here (Perl because it can handle binary data, while classic Unix scripting tools are often limited to text data only).

In some cases a proprietary file format might contain executable code. For example, a firmware update file for some embedded device very likely contains executable code. Typically that code is wrapped into some structure, e.g. a file system, compressed, garnished with boot/flash code etc. In such cases a disassembler/decompiler for the particular executable format might be helpful.

Further, documentation of and familiarity with checksum algorithms, compression algorithms, encoding techniques, and also programming languages is very helpful.

Also very helpful is the availability of the application that produces and reads the proprietary file format. That application can be used to create test files, but also to verify if an own generated file is correct.

Strategies
Look for the obvious first. E.g. magic numbers, a block structure, ASCII text in the file. Anything that can be more or less identified clearly can be the entry ticket to more. Once a particular structure has been identified, look for in-file pointers to that data. E.g. if the data is referenced from some other part of the file with an absolute or relative address. It is also very important to find out the byte order (little endian or big endian).

Choosing the target
If you have access to the software that created the file, you can always create files with the contents of your choosing. This makes reverse engineering substantially easier. In cryptography terms, you are engaging in a chosen-plaintext attack.

Probing
Once you formulate theory as to what some data in the file might mean, you can verify that theory by creating a manipulated file. Replace it with some other data using a hex editor or a custom tool. Then load the manipulated file into the original application. If the application loads the file and displays the intended change, the theory is probably correct. Sometimes it is not trivial to change the application and reload it because of the defense mechanism that may be present. Some application check the hash and signature of the code before running it.

Introduction
File formats which are either in part or completely compressed, encrypted or scrambled are among the toughest nuts to crack. Of course, compression is different from encryption, and typically done for a different purpose. However, the resulting file formats often look similar: A bunch of gibberish. This is the intended result when file format designers go for encryption, but it is also often a desired side effect when compression is applied.

If checking a file with a hex editor or similar reveals that it just contains gibberish and e.g. not any easy to identify text strings, patterns or similar, it might indicate that the particular file is compressed, encrypted or scrambled. The methods for reverse engineering these files are similar. There might, however, be a big difference from a legal point of view. Many countries have laws against circumventing copy protection, and encryption can be seen as some kind of copy protection. See Reverse Engineering/Legal Aspects for some more hints regarding this, and seek qualified legal advice before attempting to reverse engineer an encrypted or otherwise protected file format. Similar issues might arise when a file format just uses scrambling. The format "owner" might argue that the scrambling is used as some kind of copy protection, encryption or whatever, and circumventing it might break some law. Again, seek qualified legal advice.

The remainder of this section only deals with reverse-engineering the compression of a file. This is typically just an initial step in the complete reverse-engineering process. Once it has been successfully decompressed, other reverse engineering methods need to be applied to identify the file contents and structure.

Well-Known Compression Algorithms
Often file format designers apply well known compression algorithms. Either in the form of even using a particular, well known implementation of a certain algorithm (a well-known tool), or by re-implementing a well known algorithm unchanged. In the easiest case this has been documented. For example, it is well documented that the OpenOffice file format uses ZIP archives, and therefore there is no point in reverse engineering that format.

Unfortunately, for many formats we don't have this documentation. In case a well-known implementation of a particular algorithm has been used it is often relatively easy to reverse engineer. Such compressed file formats tend to start with a format identifier (magic number), clearly identifying the particular compressed format. The compression tool has left its "fingerprint" in place.

{{Example|The following is a hex dump of the first few bytes of a fictitious firmware update file for a particular SOHO router 00000000 60 ea 27 00 1e 06 01 00  10 00 02 84 84 86 dc 34  | `.'............4 | 00000010 84 86 dc 34 00 00 00 00  00 00 00 00 00 00 00 00  | ...4............ | 00000020  00 00 44 54 41 2e 41 52  4a 00 00 b1 18 78 a6 00  | ..DTA.ARJ....x.. | 00000030  00 60 ea 27 00 1e 06 01  00 10 01 00 84 4c 86 dc  | .`.'.........L.. | 00000040  34 9b 17 0c 00 e8 a4 25  00 25 10 0d 10 00 00 20  | 4......%.%..... | 00000050  00 00 00 44 54 41 2e 4d  45 4d 00 00 50 98 0b 8f  | ...DTA.MEM..P... | 00000060  00 00 1f 30 84 dd 7b db  48 da 6f fd ee fd bb da  | ...0..{.H.o..... | }}

The file is compressed with ARJ. Not only does the string DTA.ARJ give it away for the human eye, but also the first two bytes 60 ea, which are known to identify ARJ-compressed files.

The Unix/linux tool file(1) is quite aware of many standard compressed file formats.

The next steps after the compression format has been discovered is obvious: To obtain a version of the used compression tool and to use it to decompress the data. The result, however, often needs more reverse engineering. For example, the above mentioned router firmware might contain separate sections for separate areas of the router's flash memory, each guarded with an own checksum.

A variant of using a well known compression algorithm and tool can also sometimes be found, which is more difficult to reverse engineer. In such a case the file is prefixed with some additional data, and the actual compression format can't be identified by just checking the file format. Lets assume, for example, another fictitious SOHO router's firmware update file, which is build as it follows:

Of course, the format can only be known once the file format has been reverse engineered. So how is that done? Well, in the fictitious case we assume that an inspection with the Unix/Linux tool strings(1) reveals the following interesting strings in the file:

The strings are very revealing, and those knowledgeable will recognize the name Mark Adler as one of the authors of zlib zlib, which is the base for info-zip as well as GNU's gzip. Those not so knowledgeable might at least have the idea to search for the name and the keyword compression.

It is a good bet to assume that at least parts of the file are ZIP compressed. Further probing might reveal that the file does not contain a complete ZIP archive, but just a section which is compressed with the ZIP deflate algorithm, and supposed to be decompressed with the ZIP inflate algorithm (likely version 1.1.3, as the output of strings revealed). Therefore, the fictitious file might be further separated into its components by using a custom tool which iteratively applies the <tt>inflate</tt> algorithm to the file, until the generated result makes some sense (e.g. until the result contains some recognizable clear text strings).

Unknown or homemade Compression Algorithms
If the software that either creates or reads the file is available then it is very possible to reverse the file format. You can use live analysis of the running application when reading/writing the file. Doing this is likely the easiest way to determine the data structure of the file.

If the software is not available, all bets are off if there is an unknown or homemade/ad-hoc compression algorithm, or a non-standard implementation of a known algorithm. One has to be exceptionally lucky to figure out the details of the applied algorithm, so the accompanying decompression algorithm can be constructed, although cryptologists strongly discourage the use of ad-hoc encryption schemes, as they typically do not stand up to serious cryptanalysis.

Sometimes additional information can be found. E.g. if a vendor has filed a patent application for a particular algorithm, or is known to have fallen in love with a particular compression technology in other products, e.g. communication protocols. Sometimes it might turn out that the file format actually belongs to some OEM or 3rd party product, and that information about that product is available.

Otherwise, there is a small chance that trial-and-error might reveal something about the file. e.g. run-length encodings are a popular, simple and easy to implement compression algorithm, so they can sometimes be found in homemade implementations. It might be worth a try to investigate if a file might be compressed that way. An investigation of a few other well known compression techniques might also be worth a try.

Last but not least, crypto-analysis techniques might reveal something interesting about the compression. E.g. reoccurring blocks of information might point to a particular compression algorithm. However, this requires a lot of effort, time and skill.