Introduction to newLISP/Working with XML

= Working with XML =

Converting XML into lists
XML files are widely used these days, and you might have noticed that the highly organized tree structure of an XML file is similar to the nested list structures that we've met in newLISP. So wouldn't it be good if you could work with XML files as easily as you can work with lists?

You've met the two main XML-handling functions already. (See ref and ref-all.) The xml-parse and xml-type-tags functions are all you need to convert XML into a newLISP list. (xml-error is used for diagnosing errors.) xml-type-tags determines how the XML tags are processed by xml-parse, which does the actual processing of the XML file, turning it into a list.

To illustrate the use of these functions, we'll use the RSS newsfeed from the newLISP forum:

and store the retrieved XML in a file, to save hitting the server repeatedly:

The XML starts like this:

If you use xml-parse to parse the XML, but don't use xml-type-tags first, the output looks like this:

Although it looks a bit LISP-y already, you can see that the elements have been labelled as "ELEMENT", "TEXT". For now, we could do without all these labels, and that's basically the job that xml-type-tags does. It lets you determine the labels for four types of XML tag: TEXT, CDATA, COMMENTS, and ELEMENTS. We'll hide them altogether with four nils. We'll also use some options for xml-parse to tidy up the output even more.

This is now a useful newLISP list, albeit quite a complicated one, stored in a symbol called sxml. (This way of representing XML is called S-XML.)

If you're wondering what that 15 was doing in the xml-parse expression, it's just a way of controlling how much of the auxiliary XML information is translated: the options are as follows:


 * 1 - suppress whitespace text tags


 * 2 - suppress empty attribute lists


 * 4 - suppress comment tags


 * 8 - translate string tags into symbols


 * 16 - add SXML (S-expression XML) attribute tags

You add them up to get the options code number - so 15 (+ 1 2 4 8) uses the first four of these options: suppress unwanted stuff, and translate strings tags to symbols. As a result of this, new symbols have been added to newLISP's symbol table:

These correspond to the string tags in the XML file, and they'll be useful almost immediately.

Now what?
The story so far is basically this:

which has given us a list version of the news feed stored in the sxml symbol.

Because this list has a complicated nested structure, it's best to use ref and ref-all rather than find to look for things. ref finds the first occurrence of an expression in a list and returns the address:

These numbers are the address in the list of the first occurrence of the symbol item: (0 9 0) means start at item 0 of the whole list, then go to item 9 of that item, then go to item 0 of that one. (0-based indexing, of course!)

To find the higher-level or enclosing item, use chop to remove the last level of the address:

This now points to the level that contains the first item. It's like chopping the house number off an address, leaving the street name.

Now you can use this address with other expressions that accept a list of indices. The most convenient and compact form is probably the implicit address, which is just the name of the source list followed by the set of indices in a list:

That found the first occurrence of entry, and returned the enclosing portion of the SXML.

Another technique available to you is to treat sections of the list as association lists:

Here we've found the first item, as before, and then looked up the first occurrence of title using lookup.

Use ref-all to find all occurrences of a symbol in a list. It returns a list of addresses:

With a simple list traversal, you can quickly show all the titles in the file, at whatever level they may be:

Without the two rests in there, you would have seen this:

As you can see, there are many different ways to access the information in the SXML data. To produce a concise summary of the news in the XML file, one approach is to go through all the items, and extract the title and description entries. Because the description elements are a mass of escaped entities, we'll write a quick and dirty tidying routine as well:

Author: kosh Post: hello kukma. I tried to make the newLISP.chm, and this is it... Author: Lutz Post: ... also, there was a sign-extension error in the newLISP co... Author: kukma Post: Thank you Lutz and welcome home again, the principle has bec... Author: Kazimir Majorinc Post: Apparently, Aparecido Valdemir de Freitas  completed his Dr... Author: cormullion Post: Upgrade seemed to go well - I think I found most of the file... Author: Kazimir Majorinc Post:  http://github.com/mtakuya/gauche-nl-lib   Statistics: Post... Author: itistoday Post: As part of my work on Dragonfly, I've updated newLISP's SMTP... Author: Tim Johnson Post:  itistoday wrote:     Tim Johnson wrote:  Have you done any...

Changing SXML
You can use similar techniques to modify data in XML format. For example, suppose you keep the periodic table of elements in an XML file, and you wanted to change the data about elements' melting points, currently stored in degrees Kelvin, to degrees Celsius. The XML data looks like this:

When the table has been loaded into the symbol sxml, using (set 'sxml (xml-parse xml 15)) (where xml contains the XML source), we want to change each sublist that has the following form:

You can use the set-ref-all function to find and replace elements in one expression. First, here's a function to convert a temperature from Kelvin to Celsius:

Now the set-ref-all function can be called just once to find all references and modify them in place, so that every melting-point is converted to Celsius. The form is:

where the function is the method used to find list elements given the key.

Here the match function searches the SXML list using a wild-card construction (MELTING_POINT ( (UNITS "Kelvin") ) *) to find every occurrence. The replacement expression builds a replacement sublist from the matched expression stored in $0. After this has been evaluated, the SXML changes from this:

to this:

XML isn't always as easy to manipulate as this - there are attributes, CDATA sections, and so on.

Outputting SXML to XML
If you want to go the other way and convert a newLISP list to XML, the following function suggests one possible approach. It recursively works through the list:

Which is - almost - where we started from!

A simple practical example
The following example was originally set in the shipping department of a small business. I've changed the items to be pieces of fruit. The XML data file contains entries for all items sold, and the charge for each. We want to produce a report that lists how many at each price were sold, and the total value.

Here's an extract from the XML data:

This is the main function that defines and organizes the tasks:

Two functions are called: scan-file, which scans an XML file and stores the required information in a table, which is going to be some sort of newLISP list, and write-report, which scans this table and outputs a report.

The scan-file function receives a pathname, converts the file to SXML, finds all the charge items (using ref-all), and keeps a count of each value. We allow for the fact that some of the free items are marked variously as No Charge or no charge or nocharge:

The write-report function sorts and analyses the table, keeping running totals as it goes:

The report requires a bit more fiddling about than the scan-file function, particularly as the user wanted - for some reason - the 0 and no charge items to be kept separate.

Charge          Quantity     Subtotal No charge          138 0.00            145    0.11               1          0.11    0.29               1          0.29    1.89              72        136.08    1.99              17         33.83    2.99              18         53.82   12.99              55        714.45   17.99               1         17.99 Total charged        165        956.57 Grand Total         448        956.57