XQuery/Get zipped XML file

= Motivation =

You want to process XML documents from the web which contained in a zip file.

= Implementation =

This script uses the unzip function in the eXist compression module. This function uses higher order functions to filter the required components of the zipped file and to process each component.

The Unzip Function
The unzip function has five input parameters, two of which are XQuery functions that are passed to the unzip function. Each of these functions in turn have parameters.

Here is the general layout of the compression function:

compression:unzip(  $zip-data as xs:base64Binary,    $entry-filter as function,    $entry-filter-param as xs:anyType*,    $entry-data as function,    $entry-data-param as xs:anyType*) item*

UnZip all the resources/folders from the provided data by calling user defined functions to determine what and how to store the resources/folders


 * $zip-data	The zip file data
 * $entry-filter	A user defined function for filtering resources from the zip file. The function takes 3 parameters e.g. user:unzip-entry-filter($path as xs:string, $data-type as xs:string, $param as item*) as xs:boolean. $type may be 'resource' or 'folder'. $param is a sequence with any additional parameters, for example a list of extracted files.If the return type is true it indicates the entry should be processed and passed to the entry-data function, else the resource is skipped.
 * $entry-filter-param	A sequence with an additional parameters for filtering function.
 * $entry-data	A user defined function for storing an extracted resource from the zip file. The function takes 4 parameters e.g. user:unzip-entry-data($path as xs:string, $data-type as xs:string, $data as item?, $param as item*). $type may be 'resource' or 'folder'. $param is a sequence with any additional parameters
 * $entry-data-param	A sequence with an additional parameters for storing function.

In the first example, we know that there is only one XML file and we intend to process the XML in the script. Later examples store the file or files for later processing.

= Extracting a single zipped file = declare namespace fw = "http://www.cems.uwe.ac.uk/xmlwiki/fw";

declare function fw:filter($path as xs:string, $type as xs:string, $param as item*) as xs:boolean { (: pass all :) true };

declare function fw:process($path as xs:string,$type as xs:string, $data as item?, $param as item*) { (: return the XML :) $data };

let $uri := request:get-parameter("uri","http://www.iso.org/iso/iso_3166-1_list_en.zip") let $zip := httpclient:get(xs:anyURI($uri), true, )/httpclient:body/text let $filter := util:function(QName("http://www.cems.uwe.ac.uk/xmlwiki/fw","fw:filter"),3) let $process := util:function(QName("http://www.cems.uwe.ac.uk/xmlwiki/fw","fw:process"),4) let $xml := compression:unzip($zip,$filter,,$process,) return $xml

Execute

How the Process Function Works
The compression:unzip function calls the process function for each component in the zip archive it finds. This is known as a callback function. You can place any valid XQuery code in the process function to do what you would like with each input file such as list or store it.

For example the following process function will list all the items in a zip file, their path, their type and the root node if the item is an XML file.

Running this on a Office Open XML file returns the following:

= Storing the unzipped File =

You probably want to store the unzipped documents in the database. We can modify the process function to do this. We can use the third parameter to pass in the directory in which to store each file. In addition we need to create a collection to hold the unzipped files.

= Unzipping a zip archive =

Zip files commonly contain multiple files. In particular Microsoft Word .docX and Excel .xslx files are zipped collections of xmlfiles which together define the document or spreadsheet.

When documents are stored in the eXist database, the mime type (media type) is inferred from the file suffix using the mime-types.xml file. Alternatively the mime type can be set explicitly when the document is stored.

We assume here that filenames in the zip file are simple. If there is a directory structure, this needs additional coding.

= Zips with a directory structure =

Most zip files contain a directory tree of files. This directory structure needs to be recreated in the database as the files are unzipped. We can modify the process function to create database collections as necessary, assuming that higher directories are referenced before sub directories.

= Processing stored zip files =

It may be desirable to store the zip files in the database as binary resources before they are unzipped. By default files with a .zip suffix are stored as binary data. To store .docx and .xslx files in eXist, you will need to add these suffices to the entry in the $EXIST_HOME/mime-type.xml configuration file.

Change

to

You will need to reboot the server for this change to take effect.

The basic script remains the same with minor modifications