Parrot Virtual Machine/Squaak Tutorial/Poking in Compiler Guts

In the first episode, we introduced the Parrot Compiler Tools, and generated a very simple language using a shell script provided with the Parrot distribution. We also announced Squaak, a simple programming language developed specially for this tutorial. Squaak will be the case study to show how PCT can be used as a very effective set of tools to implement a language for Parrot. A list of features of Squaak was specified. If you felt lucky, you might even have tried to do the exercise at the end of the previous episode.

In this episode, we will take a closer look at the generated compiler. We shall check out the different stages of the compilation process, and show what's going on in PCT-based compilers.

Under The Hood
Remember how we invoked our compiler in the previous episode? We can pass a file, or invoke the compiler without a command line argument, in which case our compiler enters the interactive mode. Consider the first case, passing the file test.sq, just as we did before:

$ parrot squaak.pbc test.sq

When invoking our compiler like this, the file test.sq is compiled and the generated code (bytecode) is executed immediately by Parrot. How does this work, you might wonder. The interpretation of a script is done through a series of transformations, starting at the script source and ending in a format that can be executed by Parrot. Compilers built with the PCT (based on the HLL::Compiler class) can take a target option, to show one of the intermediate representations. This option can have the following values, corresponding to the four default compilation phases of an HLLCompiler object:


 * --target=parse
 * --target=past
 * --target=post
 * --target=pir

This is an example of using the target option set to "parse", which will print the parse tree of the input to stdout:

$ parrot squaak.pbc --target=parse test.sq

In interactive mode, giving this input:

say 42;

will print this parse tree (without the line numbers):

1 "parse" => PMC 'Regex;Match' => "say 42;\n" @ 0 { 2    => PMC 'Regex;Match' => "say 42;\n" @ 0 { 3        => ResizablePMCArray (size:1) [ 4            PMC 'Regex;Match' => "say 42" @ 0 { 5                 => PMC 'Regex;Match' => "say 42" @ 0 { 6                     => ResizablePMCArray (size:1) [ 7                        PMC 'Regex;Match' => "42" @ 4 { 8                            => PMC 'Regex;Match' => "42" @ 4 { 9                                => PMC 'Regex;Match' => "42" @ 4 10                                 => \parse[0][0] 11                            }    12                         }    13                     ]    14                     => PMC 'Regex;Match' => "say" @ 0 15                }    16             }    17         ]    18     }    19 }

When changing the value of the target option, the output changes into a different representation of the input. Why don't you try that right now?

So, a HLL::Compiler object has four compilation phases: parsing, construction of a Parrot Abstract Syntax Tree (PAST), construction of a Parrot Opcode Syntax Tree (POST), generation of Parrot Intermediate Representation (PIR). After compilation, the generated PIR is executed immediately.

If your compiler needs additional stages, you can add them to your HLL::Compiler object. For Squaak, we will not need this, but for details, check out compilers/pct/src/PCT/HLLCompiler.pir.

We shall now discuss each compilation phase in more detail. The first two phases, parsing the input and construction of the PAST are executed simultaneously. Therefore, these are discussed together.

Parse phase: match objects and PAST construction
During the parsing phase, the input is analyzed using Perl 6's extended regular expressions, known as Rules (see Synopsis 5 for details). When a rule matches some input string, a so-called Match object is created. A Match object is a combined array and hashtable, implying it can be indexed by integers as well as strings. As rules typically consist of other (sub) rules, it is easy to retrieve a certain part of the match. For instance, this rule:

rule if_statement { 'if' 'then' 'end' {*}   }

has two other subrules: expression and statement. The match object for the rule if_statement represents the whole string from if to end. When you're interested only in the expression or statement part, you can retrieve that by indexing the match object by the name of the subrule (in this case, expression and statement, respectively).

During the parse phase, the PAST is constructed. There is a small set of PAST node types, for instance, PAST::Var to represent variables (identifiers, such as print), PAST::Val to represent literal values (for instance, "hello" and 42), and so on. Later we shall discuss the various PAST nodes in more detail.

Now, you might wonder, at which point exactly is this PAST construction happening? This is where the special {*} symbol comes in, just below the string 'if' in the if_statement rule shown above. These special markers indicate that a parse action should be invoked. Such a parse action is just a method that has the same name as the rule in which it is written (in this case: if_statement). So, during the parsing phase, several parse actions are executed, each of which builds a piece of the total PAST representing the input string. More on this will be explained later.

The Parrot Abstract Syntax Tree is just a different representation of the same input string (your program being compiled). It is a convenient data structure to transform into something different (such as executable Parrot code) but also to do all sorts of analysis, such as compile-time type checking.

PAST to POST
After the parse phase during which the PAST is constructed, the  transforms this PAST into something called a Parrot Opcode Syntax Tree (POST). The POST representation is also a tree structure, but these nodes are on a lower abstraction level. For instance, on the PAST level there is a node type to represent a  statement (constructed as   ). The template for a while statement typically consists of a number of labels and jump instructions. On the POST level, the same while statement is represented by a set of nodes, each representing a one instruction or a label. Therefore, it is much easier to transform a POST into something executable than when this is done from the PAST level.

Usually, as a user of the PCT, you don't need to know details of POST nodes, which is why this will not be discussed in further detail. Use the target option to see what a POST looks like.

POST to PIR
In the fourth (and final) stage, the POST is transformed into Parrot Intermediate Representation (PIR). As mentioned, transforming a POST into something executable is rather straightforward, as POST nodes already represent individual instructions and labels. Again, normal usage of the PCT does not require you to know any details about this transformation.

And now for the good news...
We established the general data flow of PCT-based compilers, which consists of four stages:


 * 1) source to parse tree
 * 2) parse tree to PAST
 * 3) PAST to POST
 * 4) POST to PIR

where we noted that the first two are done during the parse stage. Now, as you're reading this tutorial, you're probably interested in using the PCT for implementing Your Favorite Language for Parrot. We already saw that a language grammar is expressed in Perl 6 Rules. What about the other transformations?

Well, earlier in this episode we mentioned the term parse actions, and that these actions create PAST nodes. After you have written a parse action for each grammar rule, you're done!

Say what?
That's right. Once you have correctly constructed a PAST, your compiler can generate executable PIR, which means you just implemented your first language for Parrot. Of course, you still need to implement any language specific libraries, but that's besides the point.

PCT-based compilers already know how to transform a PAST into a POST, and how to transform a POST into PIR. These transformation stages are already provided by the PCT.

What's next?
In this episode we took a closer look at the internals of a PCT-based compiler. We discussed the four compilation stages, that transform an input string (a program, or script, depending on your definition) into a PAST, a POST and finally executable PIR. The next episodes is where the Fun Stuff is: we will be implementing Squaak for Parrot. Piece by piece, we will implement the parser and the parse actions. Finally, we shall demonstrate John Conway's "Game of Life" running on Parrot, implemented in Squaak.

Exercises
Starting in the next episode, the exercises will be more interesting. For now, it would be useful to browse around through the source files of the compiler, and see if you understand the relation between the grammar rules in Grammar.pg and the methods in Actions.pm.

It's also useful to experiment with the --target option described in this episode. If you don't know PIR, now is the time to do some preparation for that. There's sufficient information to be found on PIR, see the References section for details. In the mean time, if you have any suggestions, questions and whatnot, don't hesitate to leave a comment.