XQuery/XML Differences

Motivation
You want to find the differences between two XML files and output a "colored diff" file of the differences.

Background on XML Differences
Unlike plain text files, XML structural differences must be considered when comparing two XML files.

For example when comparing two attributes for an element the order that the attributes appear in a file is not significant. The following two lines are technically the same even though the order of the attributes is different:

XML differences also tend to ignore the spaces and tabs used when indenting and XML file to make it more readable.

So the traditional Longest Common Subsequence (LCS) algorithms used by tools such as UNIX diff, GNU diff, or the Subversion diff will not usually give us the results that we desire.

XML Differencing Algorithms
There are many different algorithms for doing comparisons between tree structured data. Because hierarchical data can be so complex each algorithm will have different precision and performance considerations. There are also many options to consider. For example:


 * Do you want to ignore XML comments?
 * Do you want to ignore Processor Instructions (PIs)?
 * Do you want to ignore case (uppercase/lowercase) differences?
 * Do you want to ignore whitespace between elements?
 * Can you assume that the structure of the XML documents being compared is identical and only the text is different?
 * Are you interested if the order of attributes change?
 * Do you want your differences algorithm to output a list of changes to be made on the first or second file?

For our first version we will just do a simple scan of the elements and text within the elements.

Method
We will create a recursive XQuery function that compares all the nodes of an XML file.

XML Difference Output Format
We want to create an XML output format that allows the user to easily display the output using a side-by-side file comparison method.

For example the output might look like:

Formatting the output for HTML and CSS
The above output could be considered a raw semantic markup without concern as to how the web site wants to display the output using standard HTML div blocks and CSS. As a second step we can place the output in two HTML blocks, one for the initial file usually on the left and one for the second file, usually on the right with the changes marked using  tags for the changes. Each div will have a class property that allows the CSS to file to place the output anywhere on an HTML page. For example the  may be placed on the left and the   may be styled with green.

Algorithm
O(ND) Difference Algorithm was originally designed to compare text files using linebreaks as a fundamental unit of comparison. We will need to modify it to recursively compare XML elements and attributes. XML comparison also should not report differences in the order of attributes.

To be continued...