XQuery/DocBook to Microsoft Word

Motivation
You want to create a Microsoft Word document from a DocBook file.

Method
There are two steps to building a high quality DocBook to MS-Word .docx transform. In this article we will create a zip file using the Microsoft Open_Packaging_Conventions (OPC) format. OPC files can be opened with any desktop unzip program but must be created problematically to ensure that specific files are places in the correct order. We will create several small XQuery functions that will extract key elements from the input DocBook 5 file and generate the XML files used in the Open Office XML specifications. Then we will assemble all the components into a zip file using a single generate-docx function.
 * 1) Create a docx generator that assembles a zip file with all the correct components
 * 2) Create a typeswitch transform that converts each DocBook element into the appropriate Open Office XML format

The other transformation will follow a very similar pattern.

Zip File Generation
This section shows you the process of generating a zip file using XQuery. This depends on having a zip function that allows you to specify each of the components of the output file and specifically requires the output to be in document-order.

The Output File Configuration
The output is a zip file with the following contents:
 * [Content_Types].xml - a single XML file in the root directory. This file MUST be placed first in the zip file collection.
 * _rels - a folder that has a single .rels file that is an XML file with releationships between files
 * docProps - a folder that has the document properies files in it. These are usually the app.xml and core.xml files
 * word - a folder with all the word content and two subfolders. Typical content includes:
 * _rels folder with a single file such as document.xml.rels in it
 * theme folder with a single file such as theme1.xml in it
 * document.xml
 * fontTable.xml
 * settings.xml
 * styles.xml
 * webSettings.xml

Sample Use of Zip Function
The compression:zip( $entries, true ) function takes two parameters. The first is a series of elements, one for each file or collection we are going to create.

Here is the entry for building the main [Content_Types].xml file.

The following is an example of how we put a file in the _rels folder with the .rels file name:

So to build a docx file we "assemble" each of the elements using the compression:zip function and then return the binary stream to the web browser with the correct mime type and file name. This file is downloaded and you can then open it with MS word.

Mapping DocBook 5 Elements to Open Office XML
DocBook files are very easy to work with since the entire document can be stored in a single file. DocX has many small files and these files are stored in many different locations in the zip archive.

Here are some examples:

Core Properties
The Core Properties element are the standard Dublin Core metadata elements that you might see in a bibliographic entry for a book or article.

Application Properties
Here is an example of an XQuery function that will fill in the number of sections in the application properties XML file.

Document Body Element Transforms
Mapping your DocBook elements into Open Office XML format will vary depending on what DocBook elements you use and what your Word template structure is. This tutorial example will demonstrate mapping for the following elements:
 * 1) article
 * 2) article title
 * 3) sect1
 * 4) sect 1 title
 * 5) para
 * 6) figure

Sample DocBook 5 Input File
We will begin with a DocBook 5 chatper with a two level 1 sections that each have two level two subsections each with two paragraphs each.

Document Body
Open Office XML uses a complex XML structure for storing the body of text. Paragraphs are broken up into "runs" and then have text within those run elements. The following structure is an example of this:

Creating Your Typeswitch Transform
We are now ready to dive into the element by element transform.

The structure looks like this:

Mapping your DocBook elements into Open Office XML format will vary depending on what DocBook elements you use and what your Word template structure is. This tutorial example will demonstrate mapping for the following elements:
 * 1) article
 * 2) article title
 * 3) sect1
 * 4) sect 1 title
 * 5) para
 * 6) figure
 * 7) etc

Sample Recursive Function
The main "dispatch" function will arrive at the node of every high-level element. It will then just to the function specifically associated with that element. Usually the function has the same name as the element.

At each level of the transform you put in the data elements you need and then call the main function for each sub-element. This allows you to specifically put in structure that you know exists and avoids having to look up the context of an element depending on where you are in the tree. For example the title element is used consistently in the chapter, sect1 and sect2 sections. You can lookup the parent element name when you get to the title element but it is often easier just to put in the elements within the section you have just arrived at.

DocBook Figures
DocBook figures have the following sample structure: In the sample above we store all images for an article in an images collection directly in the collection that stores the main article XML file. We also scale the image to 50% of its original size or set the content width to be a fixed number of pixels.

Sample Open Office Image
Below is the equivalant structure of an image in docx format:

The binary images must be placed in the word/media collection.

Revision Identifiers (RSIDS)
Microsoft documents also have a large number of revision attributes or "RSIDS" for each paragraph, run and text. These are used when there are multiple authors making changes and the changes must be tracked using a revision reviewing system. By assigning random ID numbers to each component of text it is easier for a person to view the tracked changes.

You can disable the generation of the RSIDs by going to the Microsoft Word Options and then to the Trust Center and then Select "Privacy Settings" (although this has nothing to do with privacy) and the UNcheck the "Store random number to improve Combine accuracy"