XQuery/Filtering Words

Motivation
Sometimes you have a text body and you want to filter out words that are on a given list, often called a stoplist.

Sample Program
Sample Input Text Input Text: {$input-text}

Discussion
The input string is split into words using the tokenize function which accepts two parameters, the string to be parsed and a separator expressed as a regular expression. Here words are separated by one or more spaces. The result is a sequence of words.

This program uses XPath generalized equality to compare the sequence $stopwords/word with the sequence (of one item) $word. This is true if the two sequences have items in common, that is if the stoplist contains the word.

Alternative coding
You can also use a quantified expression to perform a stopword lookup using the some...satisfies – see XQuery/Quantified Expressions expression such as:

There are other alternatives; the stop words as a sequence of strings, or a long string and use contains or a element in the database.

There are however significant differences in performance. There is a set of tests which show the differences in a number of alternatives. Unit Tests

What these tests reveal is that, on the eXist db platform, both the suggested implementations are far from optimal. Testing against a sequence of strings takes about a fifth of the time to compare with elements. Generalised equality is equally superior to the use of a qualified expression.

Recommended Practice
It would appear that the preferable approach is:

If the stop words are held as an element, it is better to convert to a sequence of atoms first:

Note that referencing the stop list in the database slightly improved performance.