XQuery/Link gathering

Motivation
You want to gather the links on a blog page.

Method
We use the doc function to perform an HTTP GET on a remote web page. If the page is a well formed XML file you can then extract all the unorder list items by adding a ul predicate to the doc function.

This script fetches the blog page and selects the urls in the link section, which reference other blog articles. Each referenced article is fetched and the urls marked as external are selected. The result is returned as XML.

Execute

Version 2
Dropping the intermediate variables allows the structure to be seen more clearly:

Execute

Repository Schemas
Daniel is proposing a standard for supporting the extraction of data such as this from a site. Such a schema would define a view of a set of documents sufficient to allow the extraction above to be based on the schema.

We can go some way towards this with a view schema represented as an ER model, with added implementation-dependent paths.

This schema can then be used by a generic script link gathering script:

This script now performs the task of link gathering on any site whose page structure can be defined in terms of the schema with appropriate paths.

Execute

Relative and absolute URIs
The previous version works only if the URIs are absolute. A little more work is needed if not:

So with a different schema - same model, different paths:

which is a view schema of this  test site

Execute

Virtual Paths
The navigation path is still hard-coded in the script. We would like to write path expressions where the steps are defined in the schema. This path would then be interpreted in the context of the schema.

View Schema
In this example, the test site has been expanded to include a separate index page and some additional components in the view:

Index

Path language
This prototype uses a simple path language.The step -> dereferences a relative or absolute URL. Where a step is recognised as an attribute of the current entity, the associated path expression is used, otherwise the step is executed as XPath. The first step identifies the (entity) type of the initial document.

For example:

index/link/->/title
List the titles of the pages in the index.

Run

index/link/->/author/string(.)
List the authors of the pages referenced in the index.

Run

page/inner/->/external
List the url of all distinct external links of all pages referenced by the index page.

[http://www.cems.uwe.ac.uk/xmlwiki/Gov/test3.xq? Run]

page/inner/->/inner/->/title
List the titles of pages linked to the initial page.

Run

Script
The core function processes a virtual path in the context of a schema.

Acknowledgments
This example is based on an article by Daniel Bennett.