XQuery/Lorum Ipsum text

Motivation
You want to create realistically-sized example XML for testing or demonstration. Lorum impsum text is often used to fill out the contents and it would be useful to add this text wherever needed in an XML file.

We explore two approaches, one based on modifying the text and the other modifying the XML.

Approach 1 : string replacement
The places in the incomplete XML file where lorum ipsum text is to be placed is marked with ellipsis "...". The XML file is read, serialised to a string, split into parts, and the parts re-assembled adding a randomly chosen section of the lorum ipsum text in place of the ellipsis. The string is then turned back into XML for output. The base lorum ipsum text is stored as an XML file:

http://www.cems.uwe.ac.uk/xmlwiki/apps/lorumipsum/words.xml

Concepts used

 * XML <> string conversion : The script uses a pair of functions from the exist util module (util:serialize and util:parse) to convert back and forth between XML and a string. This allows the XML text to be operated on as a simple string before being converted back to XML
 * recursion : interpolating the random text into the original string requires a recursive function
 * regular expressions: reg exps are used to tokenise the lorum ipsum text and the incomplete XML file containing ellipsis

Example

 * incomplete XML
 * XML with ellipsis replaced with ipsum lorum text

Explanation

 * the lorum ipsum text is split into words by tokenising on whitespace
 * the incomplete XML is fetched and the root element accessed.
 * this element is converted to a string using the util:serialize function, then tokenized with the pattern "\.\.\.\" (not "..." since . means any single character in regular expressions)
 * the recursive function join-random joins the first of a sequence of strings with a random stretch of the lorum ipsum text with the remainder of the strings similarly joined
 * the expanded text is converted back to an XML element using util:parse

Improvements

 * the lorum ipsum text itself could be generated rather than stored.
 * the script could be parameterized for the lorum impsum file, allowing different, perhaps more realistic text to be used.
 * the lorum ipsum words are passed as a parameter to the recursive function. This could be defined in a global variable instead.
 * It would be better to use the httpclient module to fetch the files and control the caching via headers - here the file is being cached

Approach 2 - XML replacement
The choice of ellipsis as marker is problematic if this is to appear in the text. The conversion into text and back to XML is an overhead.

An alternative approach would be to use an XML element, for example to mark the places where ipsum lorum text is to appear and replace every occurrence with a random word. The replacement of a specific element anywhere in the XML tree can be accomplished by modifying the identify transformation discussed in XQuery/Filtering_Nodes.

Concepts

 * recursion - to copy an arbitrary XML tree, replacing a given element with random text.

Explanation

 * the sequence of ipsum lorum words are held in a global variable to avoid passing it as a parameter to the recursive function.
 * The copy-with-random function recursively copies the elements and items in a tree to a new tree
 * When the element with the name "ipsum" is encountered, a selection of ipsum lorem text is returned instead of the original element.

Example

 * incomplete XML
 * XML with ellipsis replaced with ipsum lorum text

Discussion
The second approach is simpler. Performance is about the same.

Acknowledgements

 * the sample XML is an extract from "Search: The Graphics Web Guide", Ken Coupland,Laurence King Publishing (2002)