XQuery/Parsing CSV

Motivation
You want to parse a file of comma-separated values (CSV) text into an xml structure or you have a flat file with very similar line/field structure that you want to convert into XML format.

Method
We will use the tokenize($input, '\n') function to break the input file into separate lines. We will then use the tokenize($input, '\s*,\s*') function to parse each line into separate fields. The regexp term \s* will remove whitespace.

Basic Example
Execute

Example With Data Dictionary in Row 1
This second example will use the first row of the CSV file as a data dictionary of the element names for each column of each row.

Execute

Adding Configuration File Options
Many times you have a family of CSV files that all may have very similar options for import. In this case it is useful to be able to pass a series of configuration parameters to a single XQuery function. These configuration parameters include:


 * 1) What the field delimiter is (comma is the default)
 * 2) The element name of the root node
 * 3) The element name of each line or row

You can then use this configuration file in the CSV parser:

Execute

CSV complications
The code above assumes a simple form of CSV. In practice CSV must handle more complex cases for which as simple use of tokeniser is insufficient to parse a line. Strings containing one or more separators will be double-quoted. Quotes within quoted strings also need to be handled.

The following implementations handle some or all of these complications, although it is often not clear which complications are considered.
 * XSLT 2.0 Stylesheet for transforming CSV files to XML
 * Matthew Royal's parser Sadly this Marklogic only since it uses mutable maps.
 * Zorba CSV importer
 * BaseX CSV module
 * David Cassel another Marklogic specific implementation

Related work

 * Adam Retter's work on CSV schemas and Validation