XQuery/Keyword Search

Motivation
You want to create a Google-style keyword search interface to an XML database with relevance-ranked, full-text search of selected nodes and search results in which the keyword in context is highlighted, as shown below.



Method
Our search engine will receive keywords from a simple HTML form, assigning them to the variable $q. Then it (1) parses the keywords, (2) constructs the scope of the query, (3) executes the query, (4) scores and sorts the hits according to the score, (5) shows the linked results with a summary containing the keyword highlighted in context, and (6) paginates the results.

Note: This tutorial was written against eXist 1.3, which was a development version of eXist; since then eXist 1.4 has been released, which altered several aspects of eXist slightly. This article has not yet been fully updated to account for the changes. The most notable changes are that (1) the kwic.xql file referenced here is now a built-in module and (2) the previous default fulltext search index (whose search operator is below as &=) is disabled by default in favor of the new, Lucene-based fulltext index, which speeds both search and scoring considerably. The changes required to make the code work with 1.4 will be extensive, but nonetheless the article is instructive in its current form. Lastly, this example will not run under versions prior to 1.3.

Example Collections and Data
Let's assume that you have three collections:

The articles and people collections contain XML files with different schemas: "articles" contains structured content, and "people" contains biographical information about people mentioned in the articles. We want to search both collections using a full-text keyword search, and we want to search specific nodes of each collection: the body of the articles and the names of the people. Fundamentally, our search string is:

for $hit in (collection('/db/test/articles')/article/body,            collection('/db/test/people')/person/biography)[. &= $q]

Note: "&=" is an eXist fulltext search operator, and it will return nodes that match the tokenized contents of $q. See for more information.

Assume you have two collections:

Collection A
File='/db/test/articles/1.xml'

Collection B
File='/db/test/people/2.xml'

Search Form
File='/db/test/search.xq'

Note that the form element can also contain an action attribute such as action="search.xq" to specify the XQuery function to use.

Receive Search Submission
It's nice to show the received results in the search field, so we can capture the search submission in variable $q using the request:get-parameter function. We change the input element so it contains the value of $q as soon as there is a value.

Filter Search Parameters
In order to prevent XQuery injection attacks, it is good practice to force the $q variable into a type of xs:string and to filter out unwanted characters from the search parameters.

An alternative method of filtering is to only allow characters that are in a whitelist:

Construct Search Scope
In the context of a native XML database, the scope of a search can be very fine-grained, using the full expressive power of XPath. We can choose to target specific collections, documents, and nodes within documents. We can also target specific element namespaces, and we can use predicates to limit results to elements with a specific attribute. In the case of our example, we will target two collections and a specific XPath for each case. We create this search scope as using a sequence of XPath expressions:

Construct Search String and Execute Search
Although we could execute our search directly using the example above (under "Example Collections and Data"), we'll have much more flexibility if we first construct our search as a string and then execute it using the util:eval function.

Score and Sort Search Results
Without sorting our results, the results would come back in "document order" -- the order in which the database executed the search. Results can be sorted according to any criteria: alphabetical order, date order, the number of keyword matches, etc. We will use a simple relevance algorithm to score our results: the number of keyword matches divided by the string length of the matching node. Using this algorithm, a hit with 1 match that is 10 characters long will score higher than a hit with 2 matches and that is 100 characters in length.

Show Results with Highlighted Keyword in Context
We want to show each result as an HTML div element containing 3 components: The title of the hit, a summary with an excerpt of the hit showing the keywords highlighted in context, and a link to display the full hit. Depending on the collection, these components will be constructed differently; we use the collection as the 'hook' to drive the display of each type of result. (Note: Other 'hooks' could be used, including namespace, node name, etc.)

We will create our highlighted keyword search summary by importing a module called kwic.xql and using a function inside called kwic:summarize. The kwic:summarize function highlights the first matching keyword term in a hit, and returns the surrounding text. kwic.xql was written by Wolfgang Meier and is distributed in eXist version 1.3b. We will place kwic.xql in the eXist database inside the /db/test/ collection.

Paginate and Summarize Results
In order to reduce the result list to a manageable number, we can use URL parameters and XPath predicates to return only 10 results at a time. To do so, we need to define two new variables: $perpage and $start. As the user retrieves each page of results, the $start value will be passed to the server as a URL parameter, driving a new set of results using the XPath predicate.

We also need to provide links to each page of results. To do so, we will mimic Google's pagination links, which start by displaying 10 results per page, grow up to 20 results per page, and show previous and next results. Our pagination links will only show if there are more than 10 results, and will be a simple HTML list that can be styled with CSS.

We should also provide a plain English summary of the search results, in the form "Showing all 5 of 5 results", or "Showing 10 of 1200 results."

Putting it All Together
Here is the complete search.xq, with some CSS to make the results look nice. This search XQuery is quite long, and lends itself well to refactoring by moving sections of code into separate functions.

File='/db/test/search.xq'