Compiler Construction/Case Study 1B

Case Study 1B - C front-end (Lex and Yacc)
The purpose of this case study is to give an example of a compiler/interpreter front-end written in C using Lex and Yacc. An interpreter is used since it allows a working program to be created with minimal extra effort (after the construction of the front-end). This code could be developed into a compiler by replacing the last phase with a compiler back-end.

The code is shown in an order which underlines the processes of creating a compiler, so the same file will be shown multiple times as it is developed.

The case study develops an interpreter for Very Tiny Basic, which is specified in the Case Study 1 section of the book.

The following steps shall be taken to complete the interpreter:
 * Lexical Analysis
 * Syntax Analysis
 * Lexical Analysis with Semantics
 * Abstract Syntax Trees
 * Generating Abstract Syntax Trees
 * Interpreting

Many important features of a useful compiler/interpreter have been left out, for brevity and simplicity, including:
 * Dealing with Errors
 * Optimization

Also some processes do not apply to such a simple language as Very Tiny Basic, for example Name Tables do not apply since only single characters are used to name variables.

Requirements
 * A C compiler
 * Lex or equivalent (flex).
 * Yacc or equivalent (bison).
 * (preferred) The Make utility.

Lexical Analysis
The first draft of the lex file identifies different tokens by returning a certain value in the associated C action. For keywords and operators this is simple, however identifiers, values and comments are trickier.

VTB.l - Version 1 %{

%}

DIGIT [0-9] LETTER [A-Z]

%% "LET"          {return LET;} "IF"           {return IF;} "THEN"         {return THEN;} "PRINT"        {return PRINT;} "REM"[^\n]*    {return REM;} "GOTO"         {return GOTO;} "GOSUB"        {return GOSUB;} "RETURN"       {return RETURN;} "STOP"         {return STOP;} "FOR"          {return FOR;} "TO"           {return TO;} "STEP"         {return STEP;} "NEXT"         {return NEXT;} \"[^\"]\"      {return STRING;} "="             {return EQUAL;} "<>"            {return NEQUAL;} %%

This file can be processed using one of the following lex VTB.l flex VTB.l

You may wonder where all those values we returned are coming from. They will be created by Yacc grammar file when it is processed.

There are some differences from the Very Tiny Basic - Specification in Case Study 1, for instance MULT and DIV in place of MULDIV. This is because we need to know the difference between the two. LineNumber and WholeNumber are lexically identical, and so cannot be separated at this time. Defining the use and category of tokens is left until the next stage.

Syntax Analysis
VTB.y - Version 1 %{ int yylex(void);

void yyerror(char* str) { /*This should handle errors!*/ }

%}

%union { int number; char *string; char var; }

%token NUMBER %token STRING %token VAR_NUM %token VAR_STR %token LET IF THEN PRINT REM GOTO GOSUB RETURN STOP FOR EQUAL TO STEP NEXT SEPARATE NEQUAL LT GT LTEQ GTEQ ADD SUB MUL DIV LPARAN RPARAN

%start lines

%%

lines: line lines | /*Nothing*/ ;

line: NUMBER statement ;

statement: LET variable EQUAL exp | IF test THEN NUMBER | PRINT printItem printList | REM | GOTO NUMBER | GOSUB NUMBER | RETURN | STOP | FOR variable EQUAL exp TO exp STEP exp | FOR variable EQUAL exp TO exp | NEXT variable ;

variable: VAR_NUM | VAR_STR ;

printList: SEPARATE printItem | /* Nothing */ ;

printItem: exp /*       | STRING */ ;

test: exp comparison exp ;

comparison: EQUAL | NEQUAL | LT       | GT        | LTEQ | GTEQ ;

exp:     addsub secondary secList | secondary secList ;

secList: addsub secondary | /* Nothing */ ;

addsub : ADD | SUB ;

secondary : primary priList ;

priList : muldiv primary | /* Nothing */ ;

muldiv : MUL | DIV ;

primary : LPARAN exp RPARAN | variable | NUMBER | STRING ;

Lexical Analysis with Semantics
In this version of the lexer, the header file generated by yacc/bison is included. The header file defines the return values and the union that is used to store the semantic values of tokens. This was created according to the %token declarations and the %union part of the grammar file. The lexer now extracts values from some types of tokens, and store the values in the yylval union.

VTB.l - Version 2

%{


 * 1) include "y.tab.h"
 * 2) include 
 * 3) include 

%}

DIGIT [0-9] LETTER [A-Z]

%% "LET"          {return LET;} "IF"           {return IF;} "THEN"         {return THEN;} "PRINT"        {return PRINT;} "REM"[^\n]*    {return REM;} "GOTO"         {return GOTO;} "GOSUB"        {return GOSUB;} "RETURN"       {return RETURN;} "STOP"         {return STOP;} "FOR"          {return FOR;} "TO"           {return TO;} "STEP"         {return STEP;} "NEXT"         {return NEXT;} \"[^\"]\"      {yylval.string=malloc(strlen(yytext)+1);                 strcpy(yylval.string,yytext); return STRING;} "="             {return EQUAL;} "<>"            {return NEQUAL;} "<"             {return LT;} ">"             {return GT;} "<="            {return LTEQ;} ">="            {return GTEQ;} {LETTER}        {yylval.var=yytext[0]; return VAR_NUM;} {LETTER}$       {yylval.var=yytext[0]; return VAR_STR;} {DIGIT}+        {yylval.number=atoi(yytext); return NUMBER;} ","             | ";"             {return SEPARATE;} "+"             {return ADD;} "-"             {return SUB;} "*"             {return MUL;} "/"             {return DIV;} "("             {return LPARAN;} ")"             {return RPARAN;}      .               {/*This is an error!*/} %%

Abstract Syntax Trees
Abstract syntax trees are an Intermediate Representation of the code that are created in memory using data structures. The grammatical structure of the language, which has already been defined and has been written down as a YACC grammar file, is translated into a tree structure. Using YACC & C, this means that most grammar rules and tokens become nodes. Comments are an example of what is not put in the tree.

The grouping of operands is clear within the structure, so tokens such as parentheses do not have to be present in the tree. The same applies to tokens which end blocks of code. This is possible because the rules in the grammar file can use these tokens create the tree in the correct shape.

Illustration of grouping in abstract syntax trees (1+3)*4    1+3*4   *            +  / \          / \ 4   +        1   *    / \          / \   1   3        3   4

In this interpreter the Primary/Secondary expression structures could be discarded by collapsing them. This would add complexity to the code, so it is currently not implemented. Rem statements are also kept (without the comment text) since the definition of VTB implies that they are a valid target for Goto. In fact, a goto to a non-existent line is undefined, so this interpreter will issue an error.

absyn.h
Using pointers and dynamic allocation is a must for the tree structure, or else it would be infinite in size and thus use too many resources (think about it). In order to keep the code nice and neat,

/*
 * Abstract Syntax Trees for Very Tiny Basic

typedef struct statement_ *statement; typedef struct exp_ *exp; typedef struct variable_ *variable; typedef exp printItem; /*Not required since it is always an expression.*/ typedef struct printList_ *printList; typedef struct test_ *test; typedef enum comparison_ comparison; typedef struct secondary_ *secondary; typedef struct secList_ *secList; typedef struct primary_ *primary; typedef struct priList_ *priList; typedef enum muldiv_ muldiv; typedef enum addsub_ addsub;

enum comparison_ {equal, nequal, lt, gt, lteq, gteq}; enum muldiv_ {mul,divd}; enum addsub_ {add, sub};

struct statement_ { enum{ let, ifThen, print, rem, goto_, gosub, return_, stop, for_step, for_, next} type; union { struct {variable var; exp e;} let_val; struct {test t; int target;} ifThen_val; struct {printItem item; printList lst;} print_val; /* REM has no value */ int goto_val; int gosub_val; /* RETURN has no value */ /* STOP has no value */ struct {variable var; exp assign; exp to; exp step;} for_val; /* Use for both.*/ variable next_val; } u; };

struct exp_ { int fHasAddSub; /* Flag */ addsub as; secondary sec; secList list; };

struct variable_ { enum { stringV, integerV} type; char v; };

struct printList_ { printItem item; printList next; };

struct test_ { exp a, b;   comparison c; };

struct secondary_ { primary p;   priList list; };

struct secList_ { addsub as; secondary s;   secList next; };

struct primary_ { enum {bracket, var, integerP, stringP} type; union { exp bracket_val; variable var_val; int integer_val; char *string_val; } u; };

struct priList_ { muldiv md; primary p;   priList next; };

statement LetStatement(variable var_, exp e_); statement IfThenStatement(test t_, int target_); statement PrintStatement(printItem item_, printList lst_); statement RemStatement; statement GotoStatement(int line); statement GosubStatement(int line); statement ReturnStatement; statement StopStatement; statement ForStatement(variable var_, exp assign_, exp to_, exp step_); /* Step can be NULL */ statement NextStatement(variable var); exp AddSubExp(addsub as_, secondary sec_, secList list_); exp PlainExp(secondary sec_, secList list_); variable StringVariable(char v_); variable IntegerVariable(char v_); printItem NewPrintItem(exp e_); printList NewPrintList(printItem item_, printList next_); test NewTest(exp a_, exp b_, comparison c_); secondary NewSecondary(primary p_, priList list_); secList NewSecList(addsub as_, secondary s_, secList next_); primary BracketPrimary(exp e_); primary VarPrimary (variable v_); primary IntegerPrimary(int n_); primary StringPrimary(char *str_); priList NewPriList(muldiv md_, primary p_, priList next_);

absyn.c
The purpose of absyn.c is to provide initializing routines for all of the structures that are used in the abstract syntax trees. This code is not very interesting, and it is very repetitive.

/*
 * Abstract Syntax Trees for Very Tiny Basic
 * 1) include 
 * 2) include "absyn.h"
 * 1) include "absyn.h"


 * 1) define safe_malloc malloc

statement LetStatement(variable var_, exp e_) { statement ret = safe_malloc(sizeof(*ret)); ret->type = let; ret->u.let_val.var = var_; ret->u.let_val.e = e_; return ret; }

statement IfThenStatement(test t_, int target_) { statement ret = safe_malloc(sizeof(*ret)); ret->type = ifThen; ret->u.ifThen_val.t = t_; ret->u.ifThen_val.target = target_; return ret; }

statement PrintStatement(printItem item_, printList lst_) { statement ret = safe_malloc(sizeof(*ret)); ret->type = print; ret->u.print_val.item = item_; ret->u.print_val.lst = lst_; return ret; }

statement RemStatement { statement ret = safe_malloc(sizeof(*ret)); ret->type = rem; return ret; }

statement GotoStatement(int line) { statement ret = safe_malloc(sizeof(*ret)); ret->type = goto_; ret->u.goto_val = line; return ret; }

statement GosubStatement(int line) { statement ret = safe_malloc(sizeof(*ret)); ret->type = gosub; ret->u.gosub_val = line; return ret; }

statement ReturnStatement { statement ret = safe_malloc(sizeof(*ret)); ret->type = return_; return ret; }

statement StopStatement{ statement ret = safe_malloc(sizeof(*ret)); ret->type = stop; return ret; }

/* Step can be NULL */ statement ForStatement(variable var_, exp assign_, exp to_, exp step_) { statement ret = safe_malloc(sizeof(*ret)); if(step_ == NULL) { ret->type = for_; } else { ret->type = for_step; }   ret->u.for_val.var = var_; ret->u.for_val.assign = assign_; ret->u.for_val.to = to_; ret->u.for_val.step = step_; return ret; }

statement NextStatement(variable var) { statement ret = safe_malloc(sizeof(*ret)); ret->type = next; ret->u.next_val = var; return ret; }

exp AddSubExp(addsub as_, secondary sec_, secList list_) { exp ret = safe_malloc(sizeof(*ret)); ret->fHasAddSub = 1; ret->as = as_; ret->sec = sec_; ret->list = list_; return ret; } exp PlainExp(secondary sec_, secList list_) { exp ret = safe_malloc(sizeof(*ret)); ret->fHasAddSub = 0; ret->sec = sec_; ret->list = list_; return ret; }

variable StringVariable(char v_) { variable ret = safe_malloc(sizeof(*ret)); ret->type = stringV; ret->v = v_; return ret; }

variable IntegerVariable(char v_) { variable ret = safe_malloc(sizeof(*ret)); ret->type = integerV; ret->v = v_; return ret; }

printItem NewPrintItem(exp e_) { return e_;}

printList NewPrintList(printItem item_, printList next_) { printList ret = safe_malloc(sizeof(*ret)); ret->item = item_; ret->next = next_; return ret; }

test NewTest(exp a_, exp b_, comparison c_) { test ret = safe_malloc(sizeof(*ret)); ret->a = a_; ret->b = b_; ret->c = c_; return ret; }

secondary NewSecondary(primary p_, priList list_) { secondary ret = safe_malloc(sizeof(*ret)); ret->p = p_; ret->list = list_; return ret; }

secList NewSecList(addsub as_, secondary s_, secList next_) { secList ret = safe_malloc(sizeof(*ret)); ret->as = as_; ret->s = s_; ret->next = next_; return ret; }

primary BracketPrimary(exp e_) { primary ret = safe_malloc(sizeof(*ret)); ret->type = bracket; ret->u.bracket_val = e_; return ret; }

primary VarPrimary (variable v_) { primary ret = safe_malloc(sizeof(*ret)); ret->type = var; ret->u.var_val = v_; return ret; }

primary IntegerPrimary(int n_) { primary ret = safe_malloc(sizeof(*ret)); ret->type = integerP; ret->u.integer_val = n_; return ret; }

primary StringPrimary(char *str_) { primary ret = safe_malloc(sizeof(*ret)); ret->type = stringP; ret->u.string_val = str_; return ret; }

priList NewPriList(muldiv md_, primary p_, priList next_) { priList ret = safe_malloc(sizeof(*ret)); return ret; }

Makefile
Since we are working with standard *nix tools and the normal build system used on *nix is make, it is useful to write a Makefile for the interpreter. Keep in mind that most compilers/interpreters are very large and need a more advanced build system than this example. They may require CVS, autoconf and many makefiles distributed across different directories. Since this one only uses five files, it is quite trivial.

DEBUG_CFLAGS     := -Wall -Wno-unknown-pragmas -Wno-format -g -DDEBUG -O0 DEBUG_LDFLAGS   := -g RELEASE_CFLAGS  := -O3 RELEASE_LDFLAGS :=

DEBUG := YES
 * 1) Comment out following line if you need optimization ;-)

POSTFIX := .exe
 * 1) Comment out following line if you're not using Windows/Dos

ifeq (YES, ${DEBUG}) CFLAGS      := ${DEBUG_CFLAGS} LDFLAGS     := ${DEBUG_LDFLAGS} else CFLAGS      := ${RELEASE_CFLAGS} LDFLAGS     := ${RELEASE_LDFLAGS} endif

.PHONY : all clean

all : vtbi${POSTFIX}

vtbi${POSTFIX} : vtbi.o y.tab.o lex.yy.o absyn.o	gcc ${LDFLAGS} ${CFLAGS} -o vtbi vtbi.o y.tab.o lex.yy.o absyn.o

clean : -rm *.o y.tab.h y.tab.c lex.yy.c *.exe

y.tab.c : VTB.y	bison -yd VTB.y -r state

y.tab.h : y.tab.c	echo "y.tab.h created by Bison as well as y.tab.c"

lex.yy.c : VTB.l	flex VTB.l %.o : %.c	gcc -c ${CFLAGS} $< -o $@

Generating Abstract Syntax Trees
VTB.y - Version 2 The lexer is now giving values for tokens and the abstract syntax tree structure has been written. Next the grammar file is updated to construct the trees from what the rules and semantic values. All the tree node types are added to the union declaration. Rules must be given types and return the correct type.

%{ int yylex(void);
 * 1) include 
 * 2) include "absyn.h"

void yyerror(char* str) { /*This should handle errors!*/ }

void addLine(int line, statement stm);

%}

%union { /* Token Types */ int number; char *string; char var; /* Abstract Syntax Tree Types */ statement stm; exp exp; variable variable; printList printList; test test; comparison comp; secondary sec; secList secList; primary pri; priList priList; muldiv muldiv; addsub addsub; }

%token NUMBER %token STRING %token VAR_NUM %token VAR_STR %token LET IF THEN PRINT REM GOTO GOSUB RETURN STOP FOR EQUAL TO STEP NEXT SEPARATE NEQUAL LT GT LTEQ GTEQ ADD SUB MUL DIV LPARAN RPARAN

%start lines

%type statement %type exp %type printItem %type variable %type  printList %type test %type comparison %type secondary %type  secList %type primary %type  priList %type muldiv %type addsub

%%

lines: line lines | /*Nothing*/ ;

line: NUMBER statement		{addLine($1, $2);} ;

statement: LET variable EQUAL exp		{$$=LetStatement($2,$4);} | IF test THEN NUMBER			{$$=IfThenStatement($2, $4);} | PRINT printItem printList		{$$=PrintStatement($2, $3);} | REM					{$$=RemStatement;} | GOTO NUMBER				{$$=GotoStatement($2);} | GOSUB NUMBER				{$$=GosubStatement($2);} | RETURN				{$$=ReturnStatement;} | STOP					{$$=StopStatement;} | FOR variable EQUAL exp TO exp STEP exp {$$=ForStatement($2, $4, $6, $8);} | FOR variable EQUAL exp TO exp {$$=ForStatement($2, $4, $6, NULL);} | NEXT variable				{$$=NextStatement($2);} ;

variable: VAR_NUM				{$$=IntegerVariable($1);} | VAR_STR				{$$=StringVariable($1);} ;

printList: SEPARATE printItem printList		{$$=NewPrintList($2,$3);} | /* Nothing */				{$$=NULL;} ;

printItem: exp					{$$=NewPrintItem($1);} /*       | STRING */ ;

test: exp comparison exp			{$$=NewTest($1, $3, $2);} ;

comparison: EQUAL	{$$=equal;} | NEQUAL	{$$=nequal;} | LT		{$$=lt;} | GT		{$$=gt;} | LTEQ		{$$=lteq;} | GTEQ		{$$=gteq;} ;

exp:     addsub secondary secList	{$$=AddSubExp($1, $2, $3);} | secondary secList		{$$=PlainExp($1, $2);} ;

secList: addsub secondary secList	{$$=NewSecList($1, $2, $3);} | /* Nothing */			{$$=NULL;} ;

addsub : ADD		{$$=add;} | SUB		{$$=sub;} ;

secondary : primary priList	{$$=NewSecondary($1, $2);} ;

priList : muldiv primary priList	{$$=NewPriList($1, $2, $3);} | /* Nothing */			{$$=NULL;} ;

muldiv : MUL		{$$=mul;} | DIV		{$$=divd;} ;

primary : LPARAN exp RPARAN	{$$=BracketPrimary($2);} | variable		{$$=VarPrimary($1);} | NUMBER		{$$=IntegerPrimary($1);} | STRING		{$$=StringPrimary($1);} ;