XQuery/Latent Semantic Indexing

Motivation
You have a collection of documents and for any document you want to find out what documents are the most similar to any given document.

Method
We will use a text-mining technique called "Latent Semantic Indexing". We will first create a matrix of all concept words (terms) by all the documents. Each cell will have the frequency count of terms in each document. We then send this term-document matrix to a service that performs a standard Singular Value Decomposition or SVD. SVD is a very compute-intensive algorithm that can take many hours or days of calculation if you have a large number of words and documents. The SVD service then return a set of "Concept Vectors" that can be used to group related documents.

Sample Data
To keep the example simple, we will just use the document titles, not the full documents.

Here are some document titles:

XQuery Tutorial and Cookbook XForms Tutorial and Cookbook Auto-generation of XForms with XQuery Building RESTful Web Applications with XRX XRX Tutorial and Cookbook XRX Architectural Overview The Return on Investment of XRX

Our first step will be to build a Word-Document Matrix. This matrix has all the words in the document in a column and one column for each document.

We will do this in several steps.


 * 1) Get all the words from all the documents an put them into a single sequence
 * 2) Create a list of the distinct words that are not "stop words"
 * 3) For each word:
 * 4) For each document count the frequency that this word appears in the document

Creating Sigma Values
The Sigma matrix is a matrix that is multiplied by both the word vectors and the documents vectors:

[Word Document Matrix] = [Word Vectors] X [Sigma Values] X [Document Vectors]