ETD Guide/Technical Issues/Conversions from Word, Word Perfect or other RTF-compatible tools to SGML\XML

Performing a conversion from MS Word documents into instances of a specified SGML or XML DTD is a very complex task. What you will need for that is: Dr. Peter  Fox
 * A SGML or XML document type definition (DTD) that serves as structure model for the output. One says that the output SGML document is valid to the specifies DTD, or it is an instance of this DTD:
 * A Word stylesheet that holds paragraph and character styles according to the structures in the DTD. So if in a DTD you have defined a structure for Author: e.g. expressed in the output file as:

You have to find expression in Word: paragraph styles: author character styles (just to be used within an author-paragraph): first name, surname, title
 * You will need some kind of a configuration file that allows the mapping of the DTD elements into Word elements and vice versa.
 * You will need an SGML or XML parser to check the output SGML/ XML document against the DTD.

Conversion Methods

Often a conversion is done by using a plug into MS Word directly, but other options use the Microsoft internal exchange format RTF (Rich Text Format) for conversion. Those tools can interpret the RTF file with the MS Word style that is still coded in this RTF file and export it into an SGML document. This process mostly happens within batch mode without using many graphical user interfaces.

Within the following paragraphs we describe several approaches:
 * 1) Approach of the Université de Montréal, Université de Lyon 2, Universidad de Chile
 * 2) Humboldt-University Berlin and Germanwide Dissertation Online project

There are other approaches in development as well, especially within Scandinavia and the University of Oslo/ Norway. We don't refer to their solution yet.

Conversion method of the Cybertheses project

Proposition of section (Vivi) The process line for converting Word files into SGML documents developed within the CyberThèses project uses scripts written with the Omnimark language.

Coversion Process The input of the process line is an RTF file with a "structuring style sheet" and the output is an SGML document encoded according to the TEI Lite DTD (see the TEI web site at http://etext.virginia.edu/TEI.html).

The conversion process is made up of three main steps :
 * first, one converts the RTF file into a flat XML file encoded according to DTD of RTF. The produced file is a linear sequence of paragraph elements each one having an explicit "stylename" attribute corresponding to the RTF style names.
 * the second step consists of the re-generation of the hierarchical and logical structure of the document based on the analysis of stylename attribute.
 * last, a SGML parser allows one to validate the conformity of the produced SGML document with the TEI Lite DTD. Some supplementary scripts then allow the export of the SGML document towards other formats (HTML, XML).

Most of the scripts will soon be available from the CyberTheses web site : http://www.cybertheses.org This system is devoted to a particular DTD, but its generalization to other document models shall not pose any difficulty.

Using SGML Author for Word (Humboldt-University Berlin)

Why did we use the SGML Author for Word?

The "Dissertation Online" project implemented and refined a conversion strategy that allows writers to convert documents written in MS word with a special stylesheet (dissertation.dot) into an SGML instance of the DiM.dtd.

We used this product from Microsoft, the SGML Author for Word, for several reasons:
 * 1) SGML Author is quite easy to configure
 * 2) It is easy to use.
 * 3) It is less expensive than other software producing SGML files with the same quality.
 * 4) It support an international standard for tables: CALS.
 * 5) As it is a Word-Add-On it handles documents in MS-Word doc- format better than other tools.
 * 6) As we started using this technology in 1997, it supported from the very beginning Word97, the version of word which was the actual one that time.

Unfortunately, Microsoft didn't continue the development of this tool. So there are no new versions available for Office 2000 or Office XP. However, the internal document format from MS Word 97, MS Word 2000 and Office XP are the same in the sense of the conversion into SGML. This means documents written in Word 2000 or Office XP can be imported into Word97 and therefore a conversion can be done.

Preconditions

For a successful conversion from a word document into a DiML document you will need:
 * The DiML-document type definition (diml20.dtd, calstb.dtd)
 * the SGML-Author for Word97 (may not available at Microsoft Shops any more, but NDLTD esp. Prof. Dr. Edward Fox may provide English versions of it that work with English Word)
 * The Association file for the Microsoft SGML-Author for Word (diml20.dta)
 * The converter stylesheet, which consists of several macros programmed to make the preconversion process easier.
 * The perl programming language (free Software)
 * The nsgmls-Parser (free Software)
 * Several perl scripts to correct the transformation of tables.

Software

You must have the following software installed at you computer:
 * SP (NSGMLS) (Parser for SGML-Files by James Clark). (new versions are availabe at http://openjade.sourceforge.net/doc-1.4/index.htm, but we haven’t tested that)
 * Run SP (A WYSIWYG tool for SP by Richard Light). http://www.light.demon.co.uk/runsp/index.htm
 * Perl (a scripting language for using the perl scripts).

The converter stylesheet and the authors stylesheet can be obtained from the following website: http://dochost.rz.hu-berlin.de/epdiss/vorlage.html

Converter scripts and perlscripts can be optained from http://www.educat.huberlin.de/diss_online/software/tools.exe (Perl scriptc, DTD and converter file for MS SGML-Author for Word - KonverterDiML2_0.dta)

Conversions

The conversion from a Microsoft Word document into a SGML document, which is an instance of the DiML.dtd that is used at Humboldt-University, takes several steps:

Step 1

Preparing the conversion without using the converter Microsoft SGML Author for Word directly
 * Check the correct usage
 * Load the stylesheet for conversion (NOT the one for the authors), see figure below.
 * There is a special feature to get the page numbers out of the Word document by using certain word specific text anchors. Those have to be converted into hard coded information using a page numer stylesheet.
 * Formattings that have been applied by the author without using style sheets have to be replaced by the correct style sheets.
 * In order to get a correct display of tables later on by using CSS stylesheets within common browsers, empty table cell have to be filled up with a single space (letter).
 * Soft coded line breaks have to be preserved for the conversion. This is done by inserting special characters #BR# to that. This will be used to insert later a special SGML tag for soft line breaks.

Step 2

Converting with Microsoft SGML Author for Word
 * Press the button "Save as SGML" within the FILE menu.
 * Load the converter file KonverterDiML2_0.DTA
 * Check the XML/SGML output using the feedback file (fbk) see figure below.

Step 3

Work through the output file (output according to the DiML.dtd) automatically.
 * Load the perlskripts using the batch file preprocessor.bat
 * Parse the DiML file

Step 4

Transforming the DiML file into a HTML file
 * Load the perl scripts by using the batch file did2html.bat
 * Check the HTML Output.
 * Correct possible errors manually within the SGML file and repeat the transformation.

A demonstration quicktime video may be found at the ETD-Guide server as well. (see http://www.educat.hu-berlin.de/diss_online/software/didi.mov)

Using FrameMaker+SGML6.0 for a conversion of MS Word documents into SGML instances.

Editing or converting using FrameMaker is much more complex than the previously described methods. FrameMaker is able to import formatted Word documents keeping the stylesheet information and exporting the document via an internal FrameMaker format as SGML or XML documents. In order to proceed with a conversion using FrameMaker you will need the following configuration files.
 * 1) A conversion table. This contains the list of the Word styles and the corresponding elements within the FrameMaker internal format. This table is saved within the FrameMaker internal format (*.frm).
 * 2) A document type definition will be saved within FrameMaker internally as EDD (Element Definition) It is saved within the FrameMaker internal document format (*.edd)
 * 3) FrameMaker uses layout rules for the internal layout of documents. Within this layout definition the layout of documents is described just like it is within MS Word documents: single formats and their appearances like text height, etc. are defined. This file is also stored as (*.frm file).
 * 4) The Read-Write Rules contain rules that define which FrameMaker format will be exported in which SGML / XML element.
 * 5) The SGML- or XML DTD has to be used as well, including Catalog- or Entity files, as well as Sub DTDs, like CALS for tables.
 * 6) To process a conversion a new SGML application has to be defined within FrameMaker+SGML. This application links all files that are needed for a conversion as described above. It enables FrameMaker to parse the output file when exporting a document to SGML or XML:

A workflow and a technology for conversion for ETD using FrameMaker+SGML6.0 was first developed at the Technical University Helsinki, within the HUTPubl project (1997–2000), see http://www.hut.fi/Yksikot/Kirjasto/HUTpubl

Other Tools

Text editors, Desktop Publishing Systems that can export SGML/XML documents

Tools that export using a user specified DTD:
 * WordPerfect since Version 7.0 (Corel) (http://www.corel.com )
 * FrameMaker+SGML6.0 (Adobe) (http://www.adobe.com )

Tools that export using their own native DTD:
 * Openoffice (SUN/open source ) (http://www.openoffice.org )
 * AbiWord (AbiWord/ open source) (http://www.abisource.com )
 * Kword (KOffice, KDE Project/ open source) (http://www.kde.org )

Converter Tools:
 * Omnimark (Omnimark) (http://www.omnimark.com )
 * MarkupKit (Schema) (http://www.schema.de )
 * Majix (Tetrasix) (http://www.tetrasix.com )
 * TuSTEP (RZ Uni Tübingen) (http://www.uni-tuebingen.de/zdv/tustep/index.html)

References


 * 1) Bollenbach, Markus; Rüppel, Thomas, Rocker, Andreas: FrameMaker+SGML5.5. Bonn; Reading, Mass., Addison-Wesley Longman, 1999, ISBN 3 8273 1508 5
 * 2) St. Laurent; Biggar, Robert: Inside XML DTDs. New York, Mc Graw Hill, 1999, ISBN 0 07 134621 X
 * 3) Ducharme, Bob: SGML CD. New Jersey, Prentice Hall, 1997, ISBN 0 13 475740 8
 * 4) Smith, Norman E.: Practical Guide to SGML/XML Filters. Plano. Texas, Wordware Publishing Inc., 1998, ISBN 1 55622 587 3
 * 5) Goldfarb, Charles; Prescod, Paul: XML Handbuch. München, Prentice Hall, 1999, ISBN 3 8277 9575 0

Next Section: Conversions from LaTeX to SGML/XML