Introduction to newLISP/Strings

= Strings =

String-handling tools are an important part of a programming language. newLISP has many easy to use and powerful string handling tools, and you can easily add more tools to your toolbox if your particular needs aren't met.

Here's a guided tour of newLISP's string orchestra.

Strings in newLISP code
You can write strings in three ways:


 * enclosed in double quotes


 * embraced by curly braces


 * marked-up by markup codes

like this:

All three methods can handle strings of up to 2048 characters. For strings longer than 2048 characters, always use the [text] and [/text] tags to enclose the string.

Always use the first method, quotation marks, if you want escaped characters such as \n and \t, or code numbers (\046), to be processed.

this is a string with two lines

newLISP

newLISP

The double quotation character must be escaped with backslashes, as must a backslash, if you want them to appear inside a string.

Use the second method, braces (or 'curly brackets'), for strings shorter than 2048 characters when you don't want any escaped characters to be processed:

strings can be enclosed in \n"quotation marks" \n

This is a really useful way of writing strings, because you don't have to worry about putting backslashes before every quotation character, or backslashes before other backslashes. You can nest pairs of braces inside a braced string, but you can't have an unmatched brace. I like to use braces for strings, because they face the correct way (which plain dumb quotation marks don't) and because your text editor might be able to balance and match them.

The third method, using [text] and [/text] markup tags, is intended for longer text strings running over many lines, and is used automatically by newLISP when it outputs large amounts of text. Again, you don't have to worry about which characters you can and can't include - you can put anything you like in, with the obvious exception of [/text]. Escape characters such as \n or \046 aren't processed either.

If you want to know the length of a string, use length:

Strings of millions of characters can be handled easily by newLISP.

Rather than length, use utf8len to get the length of a Unicode string:

Making strings
Many functions, such as the file-reading ones, return strings or lists of strings for you. But if you want to build a string from scratch, one way is to start with the char function. This converts the supplied number to the equivalent character string with that code number. It can also reverse the operation, converting the supplied character string to its equivalent code number.)

These last two examples are available when you're running the Unicode-capable version of newLISP. Since Unicode is hexadecimally inclined, you can give a hex number, starting with 0x, to char. To see the actual characters, use a printing command:

&lambda;

&#x2643;

&#x2643;

The backslashed numbers are the result of the println function, presumably the multi-byte values of the Unicode glyph.

You can use char to build strings in other ways:

This uses char to find out the ASCII code numbers for a and z, and then uses sequence to generate a list of code numbers between the two. Then the char function is mapped onto every element of the list, so producing a list of strings. Finally, this list is converted to a single string by join.

join can also take a separator when building strings:

Similar to join is append, which works directly on strings:

but even more useful is string, which turns any collection of numbers, lists, and strings into a single string.

Notice that the first list wasn't evaluated (because it was quoted) but that the second list was evaluated to produce a list of numbers, and the resulting list - including the parentheses - was converted to a string.

The string function, combined with the various string markers such as braces and markup tags, is one way to include the values of variables inside strings:

You can also use format to combine strings and symbol values. See Formatting strings.

dup makes copies:

And date makes a date string:

or you can give it a number of seconds since 1970 to convert:

See Working with dates and times.

String surgery
Now you've got your string, there are plenty of functions for operating on them. Some of these are destructive functions - they change the string permanently, possibly losing information for ever. Others are constructive, producing a new string and leaving the old one unharmed. See Destructive functions.

reverse is destructive:

Now t has changed for ever. However, the case-changing functions aren't destructive, producing new strings without harming the old ones:

Substrings
If you know which part of a string you want to extract, use one of the following constructive functions:

You can also use this technique with lists. See Selecting items from lists.

String slices
slice gives you a new slice of an existing string, counting either forwards from the cut (positive integers) or backwards from the end (negative integers), for the given number of characters or to the specified position:

There's a shortcut to do this, too. Put the required start and length before the string in a list:

If you don't want a continuous run of characters, but want to cherry-pick some of them for a new string, use select followed by a sequence of character index numbers:

which is good for finding secret coded messages buried in text.

Changing the ends of strings
trim and chop are both constructive string-editing functions that work from the ends of the original strings inwards.

chop works from the end:

trim can remove characters from both ends:

push and pop work on strings too
You've seen push and pop adding and removing items from lists. They work on strings too. Use push to add characters to a string, and pop to remove one character from a string. Strings are added to or removed from the start of the string, unless you specify an index.

pop always returns what was popped, but push returns the modified target of the action. It's useful when you want to break up a string and process the pieces as you go. For example, to print the newLISP version number, which is stored as a 4 or 5 digit integer, use something like this:

It's easier to work from the right-hand side of the string and use pop to extract the information and remove it in one operation.

Modifying strings
There are two approaches to changing characters inside a string. Either use the index numbers of the characters, or specify the substring you want to find or change.

Using index numbers in strings
To change characters by their index numbers, use setf, the general purpose function for changing strings, lists, and arrays:

You could also use nth with setf to specify the location:

Here's how to 'increment' the first (zeroth) letter of a string:

$it contains the value found by the first part of the setf expression, and its numeric value is incremented to form the second part.

Changing substrings
If you don't want to - or can't - use index numbers or character positions, use replace, a powerful destructive function that does all kinds of useful operations on strings. Use it in the form:

So:

replace is destructive, but if you want to use replace or another destructive function constructively for its side effects, without modifying the original string, use the copy function:

The copy is modified by replace. The original string t is unaffected.

Regular expressions
replace is one of a group of newLISP functions that accept regular expressions for defining patterns in text. For most of them, you add an extra number at the end of the expression which specifies options for the regular expression operation: 0 means basic matching, 1 means case-insensitive matching, and so on.

Sometimes I put comments inside regular expressions, so that I know what I was trying to do when I read the code some days later. Text between (?# and the following closing parenthesis is ignored.

If you're happy working with Perl-compatible Regular Expressions (PCRE), you'll be happy with replace and its regex-using cousins (find, regex, find-all, parse, starts-with, ends-with, directory, and search ). Full details are in the newLISP reference manual.

You have to steer your pattern through both the newLISP reader and the regular expression processor. Remember the difference between strings enclosed in quotes and strings enclosed in braces? Quotes allow the processing of escaped characters, whereas braces don't. Braces have some advantages: they face each other visually, they don't have smart and dumb versions to confuse you, your text editor might balance them for you, and they let you use the more commonly occurring quotation characters in strings without having to escape them all the time. But if you use quotes, you must double the backslashes, so that a single backslash survives intact as far as the regular expression processor:

System variables: $0, $1 ...
replace updates a set of system variables $0, $1, $2, up to $15, with the matches. These refer to the parenthesized expressions in the pattern, and are the equivalent of the \1, \2 that you might be familiar with if you've used grep. For example:

$1 "I cannot explain." She spoke in a low $2 lisp $3 in her utterance. "But for God's sake do what I ask you. Go back and never set foot upon the $4 moor $5 again."

Here we've looked for five patterns, separated by any string starting with a comma and ending with the word curious. $0 stores the matched expression, $1 stores the first parenthesized sub-expression, and so on.

If you prefer to use quotation marks rather than the braces I used here, remember that certain characters have to be escaped with a backslash.

The replacement expression
The previous example demonstrates that an important feature of replace is that the replacement doesn't have to be just a simple string or list, it can be any newLISP expression. Each time the pattern is found, the replacement expression is evaluated. You can use this to provide a replacement value that's calculated dynamically, or you could do anything else you wanted to with the found text. It's even possible to evaluate an expression that's got nothing to do with found text at all.

Here's another example: search for the letter t followed either by the letter h or by any vowel, and print out the combinations that replace found:

th ti to ti

For every matching piece of text found, the third expression

was evaluated. This is a good way of seeing what the regular expression engine is up to while the function is running. In this example, the original string appears to be unchanged, but in fact it did change, because (println $0) did two things: it printed the string, and it returned the value to replace, thus replacing the found text with itself. Invisible mending! If the replacement expression doesn't return a string, no replacement occurs.

You could do other useful things too, such as build a list of matches for later processing, and you can use the newLISP system variables and any other function to use any of the text that was found.

In the next example, we look for the letters a, e, or c, and force each occurrence to upper-case:

As another example, here's a simple search and replace operation that keeps count of how many times the letter 'o' has been found in a string, and replaces each occurrence in the original string with the count so far. The replacement is a block of expressions grouped into a single begin expression. This block is evaluated every time a match is found:

replacing "o" number 1 replacing "o" number 2 replacing "o" number 3 replacing "o" number 4 "a hyp1thetical 2ne-dimensi3nal subat4mic particle"

The output from println doesn't appear in the string; the final value of the entire begin expression is a string version of the counter, so that gets inserted into the string.

Here's yet another example of replace in action. Suppose I have a text file, consisting of the following:

1 a = 15 2 another_variable = "strings" 4 x2 = "another string" 5 c = 25 3x=9

I want to write a newLISP script that re-numbers the lines in multiples of 10, starting at 10, and aligns the text so that the equals signs line up, like this:

10 a                  = 15 20 another_variable   = "strings" 30 x2                 = "another string" 40 c                  = 25 50 x                  = 9

(I don't know what language this is!)

The following script will do this:

I've used two replace operations inside the while loop, to keep things clearer. The first one sets a temporary variable to the result of a replace operation. The search string ({^(\d*)(\s*)(.*)}) is a regular expression that's looking for any number at the start of a line, followed by some space, followed by anything. The replacement string ((string (inc counter 10) " " $3) 0)) consists of a incremented counter value, followed by the third match (ie the anything I just looked for).

The result of the second replace operation is printed. I'm searching the temporary variable temp for more strings and spaces with an equals sign in the middle:

The replacement expression is built up from the important found elements ($1, $3, $5) but it also includes a quick calculation of the amount of space required to bring the equals sign across to character 20, which should be the difference between the first item's width and position 20 (which I've chosen arbitrarily as the location for the equals sign).

Regular expressions aren't very easy for the newcomer, but they're very powerful, particularly with newLISP's replace function, so they're worth learning.

Testing and comparing strings
There are various tests that you can run on strings. newLISP's comparison operators work by finding and comparing the code numbers of the characters until a decision can be made:

and of course newLISP's flexible argument handling lets you test loads of strings at the same time:

These comparison functions also let you use them with a single argument. If you supply only one argument, newLISP helpfully assumes that you mean 0 or "", depending on the type of the first argument:

To check whether two strings share common features, you can either use starts-with and ends-with, or the more general pattern matching commands member, regex, find, and find-all. starts-with and ends-with are simple enough:

They can also accept regular expressions, using one of the regex options (0 being the most commonly used):

find, find-all, member, and regex look everywhere in a string. find returns the index of the matching substring:

member looks to see if one string is in another. It returns the rest of the string, including the search string, rather than the index of the first occurrence.

Both find and member let you use regular expressions:

find-all works like find, but returns a list of all matching strings, rather than the index of just the first match. It always takes regular expressions, so - for once - you don't have to put regex option numbers at the end.

Or you could use regex. This returns nil if the string doesn't contain the pattern, but, if it does contain the pattern, it returns a list with the matched strings and substrings, and the start and length of each string. The results can be quite complicated:

("She spoke in a low, eager voice, with a curious lisp in her utterance." 0 70 "She spoke in a " 0 15 "low, eager voice, with a curious " 15 33 "lisp" 48 4 " in her utterance." 52 18)

This results list can be interpreted as 'the first match was from character 0 continuing for 70 characters, the second from character 0 continuing for 15 characters, another from character 15 for 33 characters', and so on.

The matches are also stored in the system variables ($0, $1, ...) which you can inspect easily with a simple loop:

$1: She spoke in a $2: low, eager voice, with a curious $3: lisp $4: in her utterance.

Strings to lists
Two functions let you convert strings to lists, ready for manipulation with newLISP's extensive list-processing powers. The well-named explode function cracks open a string and returns a list of single characters:

The explosion is easily reversed with join. explode can also take an integer. This defines the size of the fragments. For example, to divide up a string into cryptographer-style 5 letter groups, remove the spaces and use explode like this:

You can do similar tricks with find-all. Watch the end, though:

Parsing strings
parse is a powerful way of breaking strings up and returning the pieces. Used on its own, it breaks strings apart, usually at word boundaries, eats the boundaries, and returns a list of the remaining pieces:

Or you can supply a delimiting character, and parse breaks the string whenever it meets that character:

By the way, I could eliminate that first empty string from the list by filtering it out:

You can also specify a delimiter string rather than a delimiter character:

Best of all, though, you can specify a regular expression delimiter. Make sure you supply the options flag (0 or whatever), as with most of the regex functions in newLISP:

Here's that well-known quick and not very reliable HTML-tag stripper:

For parsing XML strings, newLISP provides the function xml-parse. See Working with XML.

Take care when using parse on text. Unless you specify exactly what you want, it thinks you're passing it newLISP source code. This can produce surprising results:

The semicolon is considered a comment character in newLISP, so parse has ignored it and everything that followed on that line. Tell it what you really want, using delimiters or regular expressions:

or

If you want to chop strings up in other ways, consider using find-all, which returns a list of strings that match a pattern. If you can specify the chopping operation as a regular expression, you're in luck. For example, if you want to split a number into groups of three digits, use this technique:

(set 'a "1212374192387562311") (println (find-all {\d{3}|\d{2}$|\d$} a))
 * -> ("121" "237" "419" "238" "756" "231" "1")

(explode a 3)
 * alternatively
 * -> ("121" "237" "419" "238" "756" "231" "1")

The pattern has to consider cases where there are 2 or 1 digits left over at the end.

parse eats the delimiters once they've done their work - find-all finds things and returns what it finds.

Other string functions
There are other functions that work with strings. search looks for a string inside a file on disk:

This example looks in system.log for the string kernel. If it's found, newLISP rewinds the file pointer by 64 characters, then prints out three lines, showing the line in context.

There are also functions for working with base64-encoded files, and for encrypting strings.

Formatting strings
It's worth mentioning the format function, which lets you insert the values of newLISP expressions into a pre-defined template string. Use %s to represent the location of a string expression inside the template, and other % codes to include numbers. For example, suppose you want to display a list of files like this:

A suitable template for folders (directories) looks like this:

Give the format function a template string, followed by the expression (f) that produces a file or folder name:

When this is evaluated, the contents of f is inserted into the string where the %s is. The code to generate a directory listing in this format, using the directory function, looks like this:

I'm using the directory? function to choose the right template string. A typical listing looks like this:

folder:. folder: .. file: .DS_Store file: .hotfiles.btree folder: .Spotlight-V100 folder: .Trashes folder: .vol file: .VolumeIcon.icns folder: Applications folder: Applications (Mac OS 9) folder: automount folder: bin folder: Cleanup At Startup folder: cores ...

There are lots of formatting codes that you use to produce the output you want. You use numbers to control the alignment and precision of the strings and numbers. Just make sure that the % constructions in the format string match the expressions or symbols that appear after it, and that there are the same number of each.

Here's another example. We'll display the first 400 or so Unicode characters in decimal, hexadecimal, and binary. We'll use the bits function to generate a binary string. We feed a list of three values to format after the format string, which has three entries:

32      20     100000 !  33       21     100001 "  34       22     100010 $  36       24     100100 %  37       25     100101 &  38       26     100110 '  39       27     100111 (  40       28     101000 )  41       29     101001 ...
 * 1)  35       23     100011

Strings that make newLISP think
Lastly, I must mention eval and eval-string. Both of these let you give newLISP code to newLISP for evaluation. If it's valid newLISP, you'll see the result of the evaluation. eval wants an expression:

eval-string wants a string:

This means that you can build newLISP code, using any of the functions we've met, and then have it evaluated by newLISP. eval is particularly useful when you're defining macros - functions that delay evaluation until you choose to do it. See Macros.

You could use eval and eval-string to write programs that write programs.

The following curious piece of newLISP continually and mindlessly rearranges a few strings and tries to evaluate the result. Unsuccessful attempts are safely caught. When it finally becomes valid newLISP, it will be evaluated successfully and the result will satisfy the finishing condition and finish the loop.

true 'valid set ) ( ) ( set true 'valid 'valid ( set true ) set 'valid true 'valid ) ( true set set true ) ( 'valid true ) ( set 'valid 'valid ( true ) set true 'valid set 'valid ) ( true set true ( 'valid ) set set ( 'valid ) true set true 'valid ( set 'valid true )

I've used programs that were obviously written using this programming technique...