XQuery/Overview of Page Scraping Techniques

Motivation
You want a toolkit for pulling information out of web pages, even if those pages are not well formed XML files.

Method
XQuery is an ideal toolkit for manipulating well-formed HTML; you need only use the doc function, e.g. doc('http://www.example.org/index.html') or doc('/db/path/to/index.html'). But, if a webpage is not well-formed XML, you will get errors about the source not being well-formed.

Luckily, there are programs that transform HTML files into well-formed XML files.

eXist provides several such tools. One is the httpclient module's get function, httpclient:get. To use this function you need to enable the httpclient module, by modifying the conf.xml file so that the module is loaded the next time you start eXist. Uncomment the following line:

For example the following example performs an HTTP GET on the list of all the feeds from the IBM web site:

Sometimes the HTML is so malformed that even httpclient:get will not be able to salvage the HTML. For example, if an element has two @id elements, you will get the error, "Error XQDY0025: element has more than one attribute 'id'". In this case, you may need to download the HTML source and clean up the HTML just enough so that eXist can parse the rest. Then, store the file in your database, and use the util:parse-html function (which passes the text through the Neko HTML parser to make it well-formed).

The following XQuery will clean up HTML (saved as text file, because it is still malformed):

Testing your HTTP Client with an Simple Echo Script
Once you have the have the results in

Source code for echo.xq