XQuery/Lucene Search

Motivation
You want to perform a full text keyword search on one or more XML documents. This is done using the Lucene index extensions to eXist.

Background
The Apache Lucene full text search framework was added to eXist 1.4 as a full text index, replacing the previous native full text index. The new Lucene full text search framework is faster, more configurable, and more feature-rich than eXist's legacy full text index. It will also be the basis for an implementation of the W3C's full text extensions for XQuery.

eXist associates a distinct node-id with each node in an XML document. This node-id is used as the Lucene document ID in the Lucene index files, that is, each XML node becomes a Lucene document. This means that you can customize to a very high degree the search weight of keyword matches to every node in your document. So, for example, a match of a keyword within a title can have a higher score than a match in the body of a document. This means that a search hit retrieving a document title in a large number of documents will have a higher probability of being ranked first in your search results. This means your searches will have higher Precision and Recall than search systems that do not retain document structure.

eXist and Lucene Documentation
The following is the eXist documentation on how to use Lucene:

eXist-db Lucene Documentation

eXist supports the full Lucene Query Parser Syntax (with the exception of "fielded search"):

Lucene Query Parser Syntax

Setting up a Lucene Index
In order to perform Lucene-indexed, full text searching of this document, we need to create an index configuration file, collection.xconf, describing which elements and attributes should be indexed, and the various details of that indexing:

Notes:
 * If your test data are saved in db/test, you should save collection.xconf in db/system/config/db/test. Index configuration files are always saved in a directory structure inside system/config/db which is isomorphic to the directory structure of db.
 * After you create or update this index configuration file, you will need to reindex the data. You can do this either by using the eXist Java-based admin client, selecting the test collection and choosing "Reindex collection", or by using the xmldb:reindex function, supplying xmldb:reindex('/db/test') in eXide or in the XQuery Sandbox.
 * Although the legacy full text index is not needed for Lucene-based search, we have explicitly enabled it here for this example configuration in order to point out the expressive similarities between the Lucene and legacy search functions/operators (i.e. Lucene's ft:query vs. the legacy full text index's &=, |=, near, text:match-all, text:match-any).

Indexing Strategies
You can either define a Lucene index on a single element or attribute name (qname="...") or on a node path (match="...").

If you define an index on a qname, such as, an index is created on alone. What is passed to Lucene is the string value of, which includes the text of all its descendant text nodes. With such an index, one cannot search for the nodes below, e.g. for &lt;p> or , since such nodes have all been collapsed. If you want to be able to query descendant nodes, you should set up additional indexes on these, such as or.

If you define an index on a node path, as above with , the node structure below is maintained in the index and you can still query descendant nodes, such as &lt;p> or. This can be seen as a shortcut to establishing an index on all elements below. Be aware that, according to the documentation, this feature is "subject to change".

When deciding which approach to use, you should consider which parts of your document will be of interest as context for full text query. How narrow or broad to make it is best decided when considering concrete search scenarios.

Standard Lucene query syntax
eXist can process Lucene searches expressed in two kinds of query syntax, Lucene's standard query syntax and an XML syntax specific to eXist. In this section the standard query syntax is presented. This is the syntax one can expect a user to input in a search field.

A search for "Ron" in the current context will be expressed as [ft:query(., 'ron')]. The first argument holds the nodes to be searched, here ".", the current context node. The second argument supplies the query string, here simply the word "ron".

The ft:query function allows the use of Lucene wildcards.

"?" can be used for a single character and "*" for zero, one or more characters: "edward" is found with "ed?ard" and "e*d". Lucene standard query syntax does not allow "*" and "?" to occur in the beginning of a word. In eXist, however, it is possible to add an option to the query to allow leading wildcards in searches; see eXist Lucene Documentation.

Fuzzy searches, with "~" at the end of a word, make it possible to retrieve "ron" through "don~". One can quantify the fuzziness, by appending a number between 0.0f and 1.0f, making it possible to retrieve "ron" by [ft:query(., 'don~0.6')], but not by [ft:query(., 'don~0.7')]. The amount of fuzziness is based on the Levenshtein Distance, or Edit Distance algorithm.. The default is 0.5.

The boolean operators "AND" and "OR" can be used, with the expected semantics. There is a variant notation for this: [ft:query(., 'edward AND ron')] can also be written [ft:query(., '+edward +ron')]. [ft:query(., '+edward ron')] would require "edward", but not "ron", to be present. "NOT" can also be used: [ft:query(., 'edward NOT ron')] finds "edward" without "ron". "NOT" can also be represented with "-": [ft:query(., '+edward -ron')]. Operators can be grouped with parentheses, as in [ft:query(., '(edward OR ron) NOT things')].

Phrases can be searched for by putting them in quotation marks: [ft:query(., '"other issues"')].

Fields, proximity searches, range searches, boosting, and escaped reserved characters are not supported in eXist with queries using Lucene's standard query syntax. Boosting can be effected during indexing: eXist Lucene Documentation.

See Lucene Query Parser Syntax

Indexing
Since we have indexed the element as a path, the index includes descendant nodes, and queries for nested elements therefore also return hits:

collection('/db/test')/test/p/name[ft:query(., 'edward')] collection('/db/test')/test/p[ft:query(name, 'edward')]

If we had indexed the qname test with, we would not be able to do so.

Stopwords
The standard Lucene analyser, activated in the above collection.conf file with, applies the Lucene default list of English stop words and removes the following words from the index: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with.

If you wish to customize the list of stopwords, specify an analyzer with the absolute location on your file system of a text file in which the stopwords you wish to apply are listed separated by newlines, and reindex the collection.

If you wish to make all words searchable, you can leave the stop.txt empty or omit the reference to stop.txt:

After making these changes, restart eXist and reindex.

Ranking
Lucene assigns a relevance score or rank to each match. The more frequently a word occurs in a document, the higher the score. This score is preserved by eXist and can be accessed through the score function, which returns a decimal value.

The higher the score, the more relevant is the hit.

Boosting Values
The configuration file can be set up to apply higher search weights to specific elements within your document. So for example a match of a keyword in the title of a book will rank that search higher than matches in the body of the book.

Legacy Full Text Search Vs. Lucene XML Search
The following queries are equivalent (apart from the index used):

Matching any terms
To express the "match any" (|=) legacy style full text query using the new Lucene query function: you would use the following:

Matching all terms
To express the "match all" (&=) legacy full text query using the new Lucene query function:

you would use the following:

Matching no terms
To express the "match none" (not + |=) legacy full text query using the new Lucene query function: you would use the following: Note that the last one could  not  be expressed as: because Lucene's NOT operator can't be used on its own, without the presence of a 'positive' search term.

XML Query Syntax vs. Default Lucene Syntax
Following queries are equivalent, and can be tested against the Shakespeare examples shipped with eXist, by supplying them as value for $query in this XQuery snippet:

Mind the gaps in the table above! In standard Lucene syntax you can't express:

 regular expressions: this is a unique feature of eXist's XML query syntax, by means of the element ordering of proximity search terms: this is a unique feature of eXist's XML query syntax, by means of the @ordered attribute on  

Finally, a more complex case, in which boolean operator are grouped to override default priority rules:

Note how:  grouping in standard Lucene syntax can be expressed with nesting in XML syntax for nested operators, the @occur attribute can be specified as well 

Notes on Using Wildcards
Note that if you include a wildcard in your string the element must be used to enclose the string:

The following:

//SPEECH[ft:query(., 'fennny sna*')]

is equivlant to: