XQuery/Page scraping and Yahoo Weather

Background
Yahoo provide a world weather forecast service via a REST API, delivering RSS. It is described in the API documentation.

However the key to each feed for UK towns is a Yahoo Location ID such as UKXX0953 and there is no service available to convert from location names to Yahoo codes. Yahoo does provide alphabetical index pages of locations which contain links to the feeds themselves.

Yahoo Pipe
This task can be accomplished by the Yahoo Pipe written by Paul Daniel. (up to the extraction of the location ID) However the inherent instability of HTML markup leads to the current failure of this pipeline.

XQuery
This script takes a location parameter, extracts the first letter of the location, constructs the URL of the yahoo weather index page for that letter, the index page for the letter B and fetches the page via the httpclient module in eXist. The page is not valid XHTML but the httpclient:get function cleans up the XML so it is well-formed.

HTML page

The page structure can be seen in the tree view.

Next this XML is navigated to locate the li element containing the location and strips out the code for that location. Finally this code is appended to the stem of the URL of the RSS page for this location, created a URL for the RSS feed at that location. RSS feed and the script then redirects to that URL.

This process can be visualized using a data flow diagram Diagram


 * Bristol RSS feed
 * Cardiff RSS feed

XSLT
For comparison, here is the equivalent XSLT script, using analyse-string.

Bristol Weather - but currently broken

XPL
Another approach is to use XPL developed by Erik Bruchez and Alessandro Vernet at Orbeon to describe the sequence of transformations as a pipeline. Here the pipeline is extended to create a custom HTML page from the RSS feed.

    construct the index page url from the parameter      http://weather.yahoo.com/regional/UKXX_  .html </xsl:template> </p:input> <p:output name="result" id="indexUrl"/> </p:processor> <p:processor name="tidy"> <p:annotation>tidy the index page</p:annotation> <p:input name="url" id="indexUrl"/> <p:output name="xhtml" id="indexXhtml"/> </p:processor> <p:processor name="xslt"> <p:annotation>parse the index page and construct the URL for the RSS feed</p:annotation> <p:input name="xml" id="indexXhtml"/> <p:input name="parameter" id="location"/> <p:input name="xslt"> <xsl:template match="/"> <xsl:variable name="href" select="//div[@id='yw-regionalloc']//li/a[.= $location]/@href"/> <xsl:text>http://weather.yahooapis.com/forecastrss?u=c%26p=</xsl:text> <xsl:value-of select="substring-before(substring-after($href,'forecast/'),'.html')" />           </xsl:template> </p:input> <p:output name="result" id="rssUrl"/> </p:processor> <p:processor name="fetch"> <p:annotation>fetch the RSS feed</p:annotation> <p:input name="url" id="rssUrl"/> <p:output name="result" id="RSSFeed"/> </p:processor> <p:processor name="xslt"> <p:annotation>Convert RSS to an HTML page</p:annotation> <p:input name="xml" id="RSSFeed"/> <p:input name="xslt" href="http://www.cems.uwe.ac.uk/xmlwiki/weather/yahooRSS2HTML.xsl"/> <p:output name="result" id="weatherPage"/> </p:processor> </p:pipeline>

Given implementations for each of the named processor types, this can be executed (albeit rather slowly in this prototype XQuery processor )


 * This is a work in progress - at present this XPL engine is only a very simple, partial prototype, and even this simple sequential example is not conformant with the XPL schema (hence the local namespace).

The pipeline can be visualized using GraphViz.


 * The intention is to generate an additional image map to support linking to the underlying processes as well as support the full XPL language