XQuery/Uptime monitor

Motivation
You would like to monitor the service availability of several web sites or web services. You would like to do this all with XQuery and store the results in XML files. You would also like to see "dashboard" graphical displays of uptime.

There are several commercial services (Pingdom, Host-tracker )which will monitor the performance of your web sites in terms of uptime and response time.

Although the production of a reliable service requires a network of servers, the basic functionality can be performed using XQuery in a few scripts.

Method
This approach focuses on the uptime and response time of web pages. The core approach is to use the eXist job scheduler to execute an XQuery script at regular time intervals. This script performs a HTTP GET on a URI and records the statusCode of the site in an XML data file.

The operation is timed to gather response times from elapsed time (valid on a lightly used server) and the test results stored. Reports can then be run from the test results and alerts send when a site is observed to be down.

Even though a prototype, the access to fine-grained data has already revealed some response time issues on one of the sites at the University.

Watch list

Conceptual Model
This ER model was created in QSEE, which can also generate SQL or XSD.



In this notation the bar indicates that Test is a weak entity with existence dependence on Watch.

Watch-Test relationship
Since Test is dependent on Watch, the Watch-Test relationship can be implemented as composition, with the multiple Test elements contained in a Log element which itself is a child of the Watch element. Tests are stored in chronological order.

Watch Composition
Two possible approaches: Watch uri name Log Test Watch WatchSpec (the Watch entity ) uri name Log
 * add the Log as a element amongst the base data for the Watch
 * construct a Watch element which contains the Watch base data as WatchSpec and the Log

The second approach preserves the original Watch entity as a node, and also fits with the use of XForms, allowing the whole WatchSpec node to be included in a form. However it introduces a difficult-to-name intermediate, and results in paths like $watch/WatchSpec/uri when $watch/uri would be more natural.

Here we choose the first approach on the grounds that it is not desirable to introduce intermediate elements in anticipation of simpler implementation of a particular interface.

Watch entity
A Watch entity may be implemented as a file or as an element in a collection. Here we choose to implement Watch as a element in a Monitor container in a document. However this is a difficult decision and the XQuery code should hide this decision as much as possible.

Attribute implementation
Watch attributes are mapped to elements. Test attributes are mapped to attributes.

Model Generated
QSEE will generate an XML Schema. In this mapping, all relationships are implemented with foreign keys, with key and keyref used to describe the relationship. In this case, the schema would need to be edited to implement the Watch-Test relationship by composition.

By Inference
This schema has been generated by Trang (in Oxygen ) from an example document, created as the system runs.

element Monitor { element Watch { element uri { xsd:anyURI }, element name { text }, element Log { element Test { attribute at { xsd:dateTime }, attribute responseTime { xsd:integer }, attribute statusCode { xsd:integer } }+           }        }+    }
 * Compact Relax NG

XML Schema
 * XML Schema

Designed Schema
Editing the QSEE generated schema results in a schema which includes the restriction on statusCodes.

XML Schema

Test Data
An XQuery script transforms an XML Schema (or a subset thereof) to a random instance of a conforming document.

Random Document

The constraint that Tests are in ascending order of the attribute at is not defined in this schema. The generator needs to be helped to generate useful test data by additional information about the length of strings and the probability distribution of enumerated values, iterations and optional elements

Equivalent SQL implementation
In the Relational implementation the primary key uri of Watch is the foreign key of Test. There would be an advantage to adding a system-generated id to use in place of this meaningful URI, both to remove the redundancy created and to reduce the size of the foreign key. However a mechanism is then need to allocate unique ids.

eXistdb modules

 * xmldb for database update and login
 * datetime for date formating
 * util - for system-time function
 * httpclient - for HTTP GET
 * scheduler - to schedule the monitoring task
 * validation - for database validation

other

 * Google Charts

Functions
Functions in a single XQuery module.

Database Access
Access to the Monitor database which may be a local database document, or a remote document.

A specific Watch entity is identified by its URI:

Further references to a Watch are by reference. e.g.

Executing Tests
The test does an HTTP GET on the uri. The GET is bracketed by calls to util:system-time to compute the elapsed wall-clock time in milliseconds. The test report includes the statusCode.

The generated test is appended to the end of the log:

To execute the test, a script logs in, iterates through the Watch entities and for each, executes the test and stores the result:

Job scheduling
A job is scheduled to run this script every 5 minutes.

Index page
The index page is based on a supplied Monitor document, by default the production database.

In this implementation, the URI of the monitor document is passed to dependent scripts in the URI. An alternative would to pass this data via a session variable.

View

Reporting
Reporting draws on the log of Tests for a Watch

Overview Report
The basic report shows summary data about the watched URI and an embedded chart of response time over time. Up-time is the ratio of tests with a status code of 200 to the total number of tests.

Last 24 hours {monitor:responseTime-chart($last24hrs)} 1 hour averages {monitor:responseTime-chart(monitor:average($tests,12))}

}

View

Response time graph
The graph is generated using the Google Chart API. The default vertical scale from 0 to 100 fits the typical response time. In this simple example, the graph is unadorned or explained.

Response Time Frequency Distribution
The frequency distribution of response times summarised the response times. First the distribution itself is computed as a sequence of groups. The interval calculation is crude and uses 11 groups to fit with Google Chart.

This grouped distribution can then be Charted as a bar chart. Scaling is needed in this case.

Finally a Script to create a page:

Validation
The eXist module provides functions for validating a document against a schema. The Monitor document links to a schema:

Execute

Alternatively, a document can be validated against any schema:

Execute

This is used to check that the randomly generated instance is valid:

Execute

Downtime alerts
The purpose of a monitor is to alert those responsible for a site to its failure. Such an alert might be by SMS, email or some other channel. The Watch entity will need to be augmented with configuration parameters.

Check if failed
First it is necessary to calculate whether the site is down. monitor:failing returns true if all tests in the past $watch/fail-minutes have not returned a statusCode of 200.

Check if alert already sent
If this test is executed repetitively by a scheduled job, an Alert message on the appropriate channel can be generated. However, the Alert message would be sent every time the condition is true. It would be better to send an Alert less frequently. One approach would add Alert elements to the log, interspersed with the Tests. This does not affect the code which accesses Tests, but allows us to inhibit Alerts when one has been recently. alert-sent will be true if an alert has been sent in the last $watch/alert-minutes.

Alter notification task
The task to check the monitor log iterates through the Watches and for each checks if it is failing but no Alert has been sent in the period. If so, a message is constructed and an Alert element is added to the Log. The use of the Log to record Alert events means that no other state need to be held, and the period with which this task is executes is unrelated to the Alert period.

Discussion
Alert events could be added to a separate AlertLog but it is arguably easier to add a new class of Events than create a separate sequence for each. There may also be cases where the sequential relationship between Tests and Events is useful.

[ Re-designed Schema]

To do

 * add create/edit Watch
 * detect missing tests
 * Support analysis for date ranges by filtering tests by date prior to analysis
 * improve the appearance of the charts