Visual Basic/Regular Expressions

Sometimes, the built in string functions are not the most convenient or elegant solution to the problem at hand. If the task involves manipulating complicated patterns of characters, regular expressions can be a more effective tool than sequences of simple string functions.

Visual Basic has no built-in support for regular expressions. It can use regular expressions via VBScript Regular Expression Library, though. If you have Internet Explorer installed, you almost certainly have the library. To use it, you must add a reference to the project; on the Project menu choose References and scroll down to Microsoft VBScript Regular Expressions. There might be more than one version; if so, choose the one with the highest version number, unless you have some particular reason to choose an old version, such as compatibility with that version on another machine.

Class outline
Class outline of VBScript.RegExp class:
 * Attributes
 * RegExp.Pattern
 * RegExp.Global
 * RegExp.IgnoreCase
 * RegExp.MultiLine
 * Methods
 * RegExp.Test
 * RegExp.Replace
 * RegExp.Execute

Constructing a regexp
A method of constructing a regular expression object:

A method of constructing a regular expression object that requires that, in Excel, you set a reference to Microsoft VBScript Regular Expressions:

Testing for match
An example of testing for match of a regular expression

An example of testing for match in which the whole string has to match:

Finding matches
An example of iterating through the collection of all the matches of a regular expression in a string:

Finding groups
An example of accessing matched groups:

Replacing
An example of replacing all sequences of dashes with a single dash:

An example of replacing doubled strings with their single version with the use of two sorts of backreference:

Splitting
There is no direct support for splitting by a regular expression, but there is a workaround. If you can assume that the split string does not contain Chr(1), you can first replace the separator regular expression with Chr(1), and then use the non-regexp split function on Chr(1).

An example of splitting by a non-zero number of spaces:

Example application
For many beginning programmers, the ideas behind regular expressions are so foreign that it might be worth presenting a simple example before discussing the theory. The example given is in fact the beginning of an application for scraping web pages to retrieve source code so it is relevant too.

Imagine that you need to parse a web page to pick up the major headings and the content to which the headings refer. Such a web page might look like this:



RegEx Example RegEx Example aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Level Two in RegEx Example bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb Level One cccccccccccccccccccccccccccccccccccccc Level Two in Level One dddddddddddddddddddddddddddddddddddd

What we want to do is extract the text in the two h1 elements and all the text between the first h1 and the second h1 as well as all the text between the second h1 element and the end of body tag.

We could store the results in an array that looks like this:

The \n character sequences represent end of line marks. These could be any of carriage return, line feed or carriage return followed by line feed.

A regular expression specifies patterns of characters to be matched and the result of the matching process is a list of sub-strings that match either the whole expression or some parts of the expression. An expression that does what we want might look like this: " \s*([\s\S]*?)\s* " Actually it doesn't quite do it but it is close. The result is a collection of matches in an object of type MatchCollection: Item 0 .FirstIndex:89 .Length:24 .Value:" RegEx Example " .SubMatches: .Count:1 Item 0 "RegEx Example" Item 1 .FirstIndex:265 .Length:20 .Value:" Level One " .SubMatches: .Count:1 Item 0 "Level One" The name of the item is in the SubMatches of each item but where is the text? To get that we can simply use Mid$ together with the FirstIndex and Length properties of each match to find the start and finish of the text between the end of one h1 and the start of the next. However, as usual there is a problem. The last match is not terminated by another h1 element but by the end of body tag. So our last match will include that tag and all the stuff that can follow the body. The solution is to use another expression to get just the body first:

" ([\s\S]*) "

This returns just one match with on sub-match and the sub match is everything between the body and end body tags. Now we can use our original expression on this new string and it should work.

Now that you have seen an example here is a detailed description of the expressions used and the property settings of the Regular Expression object used.

A regular expression is simply a string of characters but some characters have special meanings. In this expression:

" ([\s\S]*) "

there are three principal parts:

" " "([\s\S]*)" " "

Each of these parts is also a regular expression. The first and last are simple strings with no meaning beyond the identity of the characters, they will match any string that includes them as a substring.

The middle expression is rather more obscure. It matches absolutely any sequence of characters and also captures what it matches. Capturing is indicated by surrounding the expression with round brackets. The text that is captured is returned as one of the SubMatches of a match.

In the case studies section of this book there is a simple application that you can use to test regular expressions: Regular Expression Tester.