Choosing The Right File Format/Is there a problem

Is there a problem?
If you are one of the many people who used to use WordPerfect or WordStar and have since switched to a different editor, you may already be familiar with the problem of retrieving your own information from certain types of files. Or perhaps you switched from one operating system to another, from Amiga to Windows, or Windows to Macintosh. Stated simply, file formats for different software far too often leave your information scrambled in a way you cannot decipher again years later.

If this seems a bit theoretical to you then here are some stories to illustrate the issue of choosing the right format for your information.

"The English Tourist"

"A tourist walks into a very nice restaurant in a lovely village in the French countryside and mutters in English 'Are you still serving lunch?' No one reacts, so he says louder, 'Do you have a TABLE where I might DINE?' Recognizing a few words and realizing that the tourist must only speak English or isn't interested in trying his French, one of the employees goes off to find someone who might be able to help this ignorant tourist."

"After a long delay, someone comes, interprets his request and finds him a seat in the restaurant. The tourist is handed a menu. 'I can't read this! It is in French! What are Cervelles anyways?' The helpful interpreter is called back and the tourist has the whole menu explained to him and is finally ready to order a meal. By now our hapless tourist is getting hungry and frustrated and, in just the way everyone gets when they are frustrated and hungry forgets their manners and blurts, 'By the way, I am going to order in English so I can be sure of what I am getting - and for the privilege of taking my order, I demand that you pay the Queen of England a small sum for the use of this language which you should really just learn to use like everyone else!'"

"After this last sentence is finally translated back to the previously friendly proprietors, the kitchen is closed and the tourist is sent packing."

In terms of file formats where this tourist has gone wrong is that although he is happy with the format he is using (unlike the Roman official in the next story) he has forgotten that different people do things differently. When in a different context his preferred format (English) is not supported. This is the situation if your favourite software company goes bust or stops supporting the software you bought. The files which once were so convenient can become useless with time.

"The Roman Official"

"An official in ancient Rome by the name of Gallus hires a scribe called Taruna who understands Latin but can only write in a rare (and unrecorded) dialect of Sanskrit. After Taruna has been in the job for some years Gallus finds he is actually too slow and keeps losing important documents. Taruna is turned out into the street and goes back to his family in disgrace."

"The following day the official employed a highly regarded new assistant and sent him into the archive. A few minutes later the assistant came out in tears explaining that he only knows a few words of Sanskrit, can't find any references to the dialect used and could never hope to make sense of these documents."

Frantically they search for Taruna. When they find him they ask him to come back to work, but he sees their problem. So he says with a smile "I will happily come back to work, you just need to double my pay and holidays!"

In modern terms where the Roman official went wrong is to use an unpublished format (an unrecorded dialect of Sanskrit) to store his information. He was then trapped by this format and forced to keep buying the software (the scribes services) at ever increasing cost. He had lost control of his own information!

In a report written for The National Archives (UK) in 2003, Adrian Brown summarises how to proceed.

"The selection of file formats for creating electronic records should ... be determined not only by the immediate and obvious requirements of the situation, but also by longer-term considerations. An electronic record is not fully fit-for-purpose unless it is sustainable throughout its required life cycle. ... It is therefore highly desirable to identify the minimum set of formats which meet both the active business needs and the sustainability criteria below, and restrict data creation to these formats. (PDF)"

The approach of Project Gutenberg to this challenge has been a strict criteria that all the 15,000+ books stored in their digital repository are stored in plain ASCII text.

"Whenever possible, Project Gutenberg distributes a plain text version of an eBook. Other formats, such as HTML, XML, RTF, and others are also welcome, but plain text is the 'lowest common denominator.' We stress the inclusion of plain text because of its longevity: Project Gutenberg includes numerous text files that are 20-30 years old. In that time, dozens of widely used file formats have come and gone. Text is accessible on all computers, and is also insurance against future obsolescence."

Does that mean we cannot use word processors, if we want long term access to the information in our documents? Well, yes and no. If you want long term file readability (of Latin script languages) as Project Gutenberg does, then ASCII text is the way to go. This might be something to consider for financial records and other valuable information. If, as many people do, you have non-text information, like images and sounds, then this is the article to read. Either way there are a lot of common errors you can avoid which will at the very least make future migrations to the next generation of file formats much easier.

Let's now take a real world scenario. Many people use the Microsoft Windows operating system and the Microsoft Office package which includes the document application Microsoft Word (or just MSWord). The default file format of MSWord is DOC. So what's DOC like for long term storage?

MS Word is a proprietary program and the .doc file extension is a proprietary format. That means that how the software works and stores your information is secret - only Microsoft knows exactly how it all works.