Compiler Construction/Describing a Programming Language

Background Information
Over the years it has been found that trying to describe a programming language using some natural language (e.g. English) is unsatisfactory due to possible omissions, contradictions, ambiguities, and vagueness. A significant milestone was the publication of the Algol 60 Report which used a formal grammar (BNF) to describe the syntax of the language. The semantics (meaning) was carefully described using English. A variation of Extended BNF (EBNF) will be used in this book as described later in this section.

There have been several attempts to formalise the semantics, including the grammar generator used for defining Algol 68, and the Vienna Definition Language which was used at one stage to define the meaning of PL/1. While theoretically adequate, these attempts are so complicated that they have proved to be of little practical use.

In 1956, Chomsky (a linguist) developed a hierarchy of formal grammars (types 0 to 3) for use in the study of English and other natural languages. Type 0 is the most powerful and type 3 the least powerful. It turns out that type-3 regular grammars are just what is needed to describe the requirements of lexical analysis, and that type-2 context-free grammars are just what is needed to describe the syntax of many programming languages. But even type 0 grammars are inadequate for describing the English language.

Analysis Tools
Programming tools have been developed which can accept the description of a grammar as input, and which produce as output the code for using that grammar for analysis. Note that there may be several different grammars which can be used to describe a particular programming language, and that not all of these grammars are usable by these tools.

Lexical analysis code using regular grammars can be produced by tools such as lex, flex, and JavaCC.

Syntax analysis code using context-free grammars can be produced by tools such as yacc, bison, and JavaCC.

Even if you have a suitable grammar, such tools only automate a relatively small part of the job of writing a compiler or interpreter. Simple lexical and syntax analysis code can in fact be written by hand without undue effort; we will see some examples in later chapters. Good hand-written code may be faster than the code produced by the tools.

Extended BNF
Words in italics (e.g. WholeNumber) are used to name grammatical concepts. It is a good idea to select names for these concepts which are semantically meaningful to a human reader, to make things easier to read and understand. The Algol 60 report used diamond brackets '<' and '>' rather than italics to flag grammatical concepts.

A vertical bar | is used to separate alternatives.

Single quotation marks are used to indicate characters to be used 'as is'.

The compound symbol ::= is used to mean 'is defined as' and a semicolon ; is used to end a definition.

Square brackets [ ] are used to indicate options and/or repetition. [ xxx ]    xxx is optional. [ yyy ]*   yyy may be repeated zero or more times. [ zzz ]+   zzz may be repeated one or more times.

Examples for Lexical Analysis
We will start with some simple examples which satisfy the rules for regular grammars.

The following defines a single decimal Digit as being any one of the characters from '0' to '9'.

Digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;

Understood as

Digit is defined as: '0' OR '1' OR '2' OR '3' OR '4' OR '5' OR '6' OR '7' OR '8' OR '9'.

A similar (somewhat tedious) definition can be given for Letter to cover 'a' to 'z' and 'A' to 'Z'.

A WholeNumber is a sequence of one or more decimal digits, which can be expressed as one of two ways

First way

WholeNumber ::= [ Digit ]+ ;

Understood as

WholeNumber is defined as: Digit may be repeated one or more times

Second way

WholeNumber ::= Digit [ Digit ]* ;

Understood as

WholeNumber is defined as: Digit AND Digit repeated zero or more times.

These definitions allow whole numbers such as 007. Some people dislike the idea of allowing whole numbers to have leading zeroes. We can handle this way of looking at things by using the following definitions, but we have to include the number zero (just 0) as a special case.

NonZeroDigit ::= '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ; Digit ::= '0' | NonZeroDigit ; WholeNumber ::= '0' | NonZeroDigit [ Digit ]* ;

Understood as

NonZeroDigit is defined as: '1' OR '2' OR '3' OR '4' OR '5' OR '6' OR '7' OR '8' OR '9'.

Digit is defined as: 0 OR NonZeroDigit.

WholeNumber is defined as: 0 OR as NonZeroDigit AND Digit repeated zero or more times.

This last definition is a formal way of saying that a whole number is either 0  or starts with a non-zero digit, and that this non-zero digit may be followed by any number of digits (including none).

In most programming languages an Identifier is any sequence of letters and digits which starts with a letter. Some programming languages may allow a few special characters as well (e.g. underline _ or hash #). There may also be some restrictions on the length of an Identifier, but such restrictions are normally given as semantic constraints expressed in a natural language such as English rather than as part of the formal grammar (e.g. up to Fortran 77, Identifiers in Fortran were limited to a maximum of 6 characters).

Example for Syntax Analysis
We now look at some examples which, in combination, satisfy the rules for context-free grammars.

Consider an arithmetic expression such as  3 + 4 * 5. The normal convention is that multiplication and division are done before addition and subtraction, so the value of  3 + 4 * 5   is   23. If we wanted the addition to be done first in this example, we would use brackets, so that  (3 + 4) * 5    has value   35.

We can express these conventions regarding brackets and operator priority as follows. These definitions also specify that operators with the same priority are applied from left to right.

AddSub ::= '+' | '-' ; MulDiv ::= '*' | '/' ; Expression ::= Secondary [ AddSub Secondary ]* ; Secondary ::= Primary   [ MulDiv Primary ]* ; Primary   ::= '(' Expression ')' | Secondary | WholeNumber | Identifier ;

There are some extremely simple expressions which are not covered by the above grammar. Can you work out what they might be before reading the third paragraph below?

Note that this is a relatively simple example, with only two priority levels for operators. Languages such as C/C++/Java have more than a dozen priority levels.

The particular aspect that makes this context-free rather than regular is the way in which an Expression can be a Secondary, which can be a Primary, which can be a bracketed Expression, i.e. Expression can be partially defined in terms of itself; this is what is known as a recursive definition.

The expressions which are not covered include such things as -5  or  +4. To handle these we need to change one definition: Expression ::= [AddSub] Secondary [ AddSub Secondary ]* ;

This grammatical rule allows -(+(-4)), but it does not allow   -+-4.