XQuery/Finding Duplicate Documents

Motivation
You have a set of documents that may contain identical copies of some documents. You want to detect and remove all duplicate documents.

Method
There are two approaches. One is to write a "compare" function that will compare each element and node in the list to each other item. The problem with this approach is that it will run in roughly n-squared time which can be very slow for larger data sets. This "pairwise compare" is best when you have a very small set of data.

In this example will create a unique "hash" value for each document. You can then remove all documents that have the same hash value. Hashing is critical because once a hash is created it can be stored and used for future comparisons. You thus have a very consistent way of finding out if a new document is unique. Comparing hashes of documents give the users a very quick answer to the question - have we seen this document before?

About Hash Functions
A hash function takes an input string and performs a mathematical function on it which results in a fairly short string which is - in practical terms - unique, making the probability that two different documents should have the same hash value very, very low. For example, if you calculate a million hashes per second, a duplicate would only occur once every hundred billion years. This is good enough for most business applications. The key aspect is that if a single character is different in a document, the hash will be totally different.

The way we use hash functions is to use an XQuery function such as

util:hash($input, $algorithm) (see eXist documentation)

where $input is your input document and $algorithm is a string that identifies which hash function you will use. Common algorithms include 'md5', 'sha1' and 'sha256'. Hash functions always return a single string called a hash value, hash code, hash sum, checksum or simply a hash.

Here is a sample of how to use the md5 hash function.

returns:

Timing Your Hash Function
MD5 is a favorite version of a hash algorithm because it is very fast and always returns consistent length strings that are easy to store, use as REST parameters and comparing values. Here is a sample XQuery that calculates a hash for the entire XML file of the play "Hamlet" which is around 7,842 lines of XML.

This program uses a "system timer" and returns the following result:

The example above, when run multiple times either returns 0 or 15 milliseconds on my local system. This shows that the time is reflecting a disk I/O, so the hash runs very quickly, usually under 10 milliseconds even for large files on a slow computer.

Note that a hash function is very sensitive to the ways that documents are "serialized" and the default behavior may note be what you expect. The default hash only work on the string value or the element "content" of a document. Note that the string value of an XML node does not include the attributes or element names of an document. This is not an error, since hash functions are designed to work on strings, not XML documents. The following has two files that have identical content but different element names:

returns the following:

Carefully Define Document Equality
When we ask the question: "Do I already have a copy of this document?" we first need to define what a duplicate document means. In this case we will define it as an XML document that has exactly the same attribute and element in exactly the same order. Technically XML may have attributes in different order and they might be considered the same or they might be considered duplicate documents. The number of spaces before and after element may or may not be significant. Whatever your method you should define the concept of "sameness" carefully for you situation.

Create a Consistent Serialization for Each Document
We would like to have a function that turns each XML document into a single string that removes all indentation. This is sometimes called the "canonical" version on an XML document. The following function creates a single string for each XML document:

If you are running eXist 1.5 the util module has a "serialize" function that will convert an entire XML document to a single string.

let $string-version-of-xml-file := util:serialize($input-xml-file, )

In the following example the files are identical except for the name of a single attribute.

If your system does not have a serialize function you can create your own using a simple recursive function.

Converting an XML Document to a String
The following function can be used to copy an XML file in a consistent format. The advantage of this is that all attributes will come out in a consistent order.

This function can also be modified to add and remove element or attributes that are not relevant for your comparison. See: Identity Transform with XQuery

Comparing All Files In A Collection
The following program will calculate the hash for all files in a collection. It will then report which hash values are the same.