F Sharp Programming/Lexing and Parsing

Lexing and parsing is a very handy way to convert source-code (or other human-readable input which has a well-defined syntax) into an abstract syntax tree (AST) which represents the source-code. F# comes with two tools, FsLex and FsYacc, which are used to convert input into an AST.

FsLex and FsYacc have more or less the same specification as OCamlLex and OCamlYacc, which in turn are based on the Lex and Yacc family of lexer/parser generators. Virtually all material concerned with OCamlLex/OCamlYacc can transfer seamlessly over to FsLex/FsYacc. With that in mind, SooHyoung Oh's OCamlYacc tutorial and companion OCamlLex Tutorial are the single best online resources to learn how to use the lexing and parsing tools which come with F# (and OCaml for that matter!).

Lexing and Parsing from a High-Level View
Transforming input into a well-defined abstract syntax tree requires (at minimum) two transformations:
 * 1) A lexer uses regular expressions to convert each syntactical element from the input into a token, essentially mapping the input to a stream of tokens.
 * 2) A parser reads in a stream of tokens and attempts to match tokens to a set of rules, where the end result maps the token stream to an abstract syntax tree.

It is certainly possible to write a lexer which generates the abstract syntax tree directly, but this only works for the most simplistic grammars. If a grammar contains balanced parentheses or other recursive constructs, optional tokens, repeating groups of tokens, operator precedence, or anything which can't be captured by regular expressions, then it is easiest to write a parser in addition to a lexer.

With F#, it is possible to write custom file formats, domain specific languages, and even full-blown compilers for your new language.

Extended Example: Parsing SQL
The following code will demonstrate step-by-step how to define a simple lexer/parser for a subset of SQL. If you're using Visual Studio, you should add a reference to  to your project. If you're compiling on the commandline, use the  flag to reference the aforemented F# powerpack assembly.

Step 1: Define the Abstract Syntax Tree
We want to parse the following SQL statement:

This is a pretty simple query, and while it doesn't demonstrate everything you can do with the language, it's a good start. We can model everything in this query using the following definitions in F#:

A record type neatly groups all of our related values into a single object. When we finish our parser, it should be able to convert a string in an object of type.

Step 2: Define the parser tokens
A token is any single identifiable element in a grammar. Let's look at the string we're trying to parse:

So far, we have several keywords (by convention, all keywords are uppercase):,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , and.

There are also a few comparison operators:,  ,  ,  ,.

We also have non-keyword identifiers composed of strings and numeric literals. which we’ll represent using the keyword,  ,.

Finally, there is one more token,, which indicates the end of our input stream.

Now we can create a basic parser file for FsYacc, name the file SqlParser.fsp:

This is boilerplate code with the section for tokens filled in.

Compile the parser using the following command line:


 * Tips:
 * If you haven't done so already, you should add the F# bin directory to your PATH environment variable.
 * If you're using Visual Studio, you can automatically generate your parser code on each compile. Right-click on your project file and choose "Properties". Navigate to the Build Events tab, and add the following to the 'Pre-build event command line' and use the following: . Also remember to exclude this file from the build process:  right-click the file, choose "Properties" and select "None" against "Build Action".

If everything works, FsYacc will generate two files, SqlParser.fsi and SqlParser.fs. You'll need to add these files to your project if they don't already exist. If you open the SqlParser.fsi file, you'll notice the tokens you defined in your .fsl file have been converted into a union type.

Step 3: Defining the lexer rules
Lexers convert text inputs into a stream of tokens. We can start with the following boiler plate code:

This is not "real" F# code, but rather a special language used by FsLex.

The  bindings at the top of the file are used to define regular expression macros. is a special marker used to identify the end of a string buffer input.

defines our lexing function, called  above. Our lexing function consists of a series of rules, which has two pieces: 1) a regular expression, 2) an expression to evaluate if the regex matches, such as returning a token. Text is read from the token stream one character at a time until it matches a regular expression and returns a token.

We can fill in the remainder of our lexer by adding more matching expressions:

Notice the code between the 's and  's consists of plain old F# code. Also notice we are returning the same tokens (, ,   and  ) that we defined in SqlParser.fsp. As you can probably infer, the code  returns the string our parser matched. The  function will be converted into function which has a return type of.

We can fill in the rest of the lexer rules fairly easily:

Notice we've created a few maps, one for keywords and one for operators. While we certainly can define these as rules in our lexer, its generally recommended to have a very small number of rules to avoid a "state explosion".

To compile this lexer, execute the following code on the commandline:. (Try adding this file to your project's Build Events as well.) Then, add the file SqlLexer.fs to the project. We can experiment with the lexer now with some sample input:

This program will print out a list of tokens matched by the string above.

Step 4: Define the parser rules
A parser converts a stream of tokens into an abstract syntax tree. We can modify our boilerplate parser as follows (will not compile):

Let's examine the  function. You can immediately see that we have a list of tokens which gives a rough outline of a select statement. In addition to that, you can see the F# code contained between 's and  's which will be executed when the code successfully matches—in this case, its returning an instance of the   record.

The F# code contains "$1", "$2", :$3", etc. which vaguely resembles regex replace syntax. Each "$#" corresponds to the index (starting at 1) of the token in our matching rule. The indexes become obvious when they’re annotated as follows:

So, the  rule breaks our tokens into a basic shape, which we then use to map to our   record. You're probably wondering where the,  ,  , and   come from—these are not tokens, but are rather additional parse rules which we'll have to define. Let’s start with the first rule:

matches text in the style of " " and returns the results as a list. Notice this rule is defined recursively (also notice the order of rules is not significant). FsYacc's match algorithm is "greedy", meaning it will try to match as many tokens as possible. When FsYacc receives an  token, it will match the first rule, but it also matches part of the second rule as well. FsYacc then performs a one-token lookahead: it the next token is a, then it will attempt to match additional tokens until the full rule can be satisfied.


 * Note: The definition of columnList above is not tail recursive, so it may throw a stack overflow exception for exceptionally large inputs. A tail recursive version of this rule can be defined as follows:


 * The tail-recursive version creates the list backwards, so we have to reverse when we return our final output from the parser.

We can treat the JOIN clause in the same way, however its a little more complicated:

is defined in terms of several functions. This results because there are repeating groups of tokens (such as multiple tables being joined) and optional tokens (the optional "ON" clause). You've already seen that we handle repeating groups of tokens using recursive rules. To handle optional tokens, we simply break the optional syntax into a separate function, and create an empty rule to represent 0 tokens.

With this strategy in mind, we can write the remaining parser rules:

Step 5: Piecing Everything Together
Here is the complete code for our lexer/parser:

SqlParser.fsp

SqlLexer.fsl

Program.fs

Altogether, our minimal SQL lexer/parser is about 150 lines of code (including non-trivial lines of code and whitespace). I'll leave it as an exercise for the reader to implement the remainder of the SQL language spec.

2011-03-06: I tried the above instructions with VS2010 and F# 2.0 and PowerPack 2.0. I had to make a few changes:
 * Add "module SqlLexer" on the 2nd line of SqlLexer.fsl
 * Change Map.of_list to Map.ofList
 * Add " --module SqlParser" to the command line of fsyacc
 * Add FSharp.PowerPack to get Lexing module

2011-07-06: (Sjuul Janssen) These where the steps I had to take in order to make this work.

If you get the message "Expecting a LexBuffer but given a LexBuffer The type 'char' does not match the type 'byte'" If you get the message that some module doesn't exist or that some module is declared multiple times. Make sure that in the solution explorer the files come in this order:
 * Add "fslex.exe "$(ProjectDir)SqlLexer.fsl" --unicode" to the pre-build
 * in program.fs change "let lexbuf = Lexing.from_string x" to "let lexbuf = Lexing.LexBuffer<_>.FromString x"
 * in SqlLexer.fsi change "lexeme lexbuf" to "LexBuffer<_>.LexemeString lexbuf"
 * Sql.fs
 * SqlParser.fsp
 * SqlLexer.fsl
 * SqlParser.fsi
 * SqlParser.fs
 * SqlLexer.fs
 * Program.fs

If you get the message "Method not found: 'System.Object Microsoft.FSharp.Text.Parsing.Tables`1.Interpret(Microsoft.FSharp.Core.FSharpFunc`2,!0>, ..." Go to http://www.microsoft.com/download/en/details.aspx?id=15834 and reinstall Visual Studio 2010 F# 2.0 Runtime SP1 (choose for repair)

2011-07-06: (mkduffi) Could someone please provide a sample project. I have followed all of your changes but still can not build. Thanks.

Sample
https://github.com/obeleh/FsYacc-Example

2011-07-07 (mkduffi) Thanks for posting the sample. Here is what I did to the Program.fs file:

I added a C# console project for testing and this is what is in the Program.cs file:

I had to add the YaccSample project reference as well as a reference to the FSharp.Core assembly to get this to work.

If anyone could help me figure out how to support table aliases that would be awesome.

Thanks

2011-07-08 (Sjuul Janssen) Contact me through my github account. I'm working on this and some other stuff.