Compiler Construction/Dealing with errors

Dealing with errors
Even experienced programmers make mistakes, so they appreciate any help a compiler can provide in identifying the mistakes. Novice programmers may make lots of mistakes, and may not understand the programming language very well, so they need clear, precise, and jargon-free error reports. Especially in a learning environment, the main function of a compiler is to report errors in source programs; as an occasional side-effect you might actually get a program translated and run.

As a general rule, compiler writers should attempt to express error messages in moderately plain English, rather than with reference to the official programming language definition (some language definitions use somewhat obscure or specialized terminology).

For example, a message "can't convert string to integer" is probably clearer than "no coercion found".

Historical Notes
In the 1960's and much of the 1970's, batch processing was the normal way of using a (large) mainframe computer (personal computers only started to become household items in the early 1980's). It could well be several hours, or even a day, from when you handed your deck of punched cards to a receptionist until you could collect the card deck along with a printed listing of your program, accompanied either by error messages or by some useful results.

Under such circumstances, it was important that compilers report as many errors as possible, so part of the job of writing a compiler was to 'recover' from an error and continue checking (but not translating) in the hope of finding more errors. Unfortunately, once an error has occurred (especially if the error affects a declaration), it is quite possible for the compiler to get confused and produce a host of spurious error reports.

Programmers then had the task of deciding which errors to try and fix, and which ones to ignore in the hope that they would vanish once earlier errors were fixed. Some compilers were particularly prone to producing spurious error reports. The only useful advice that helpdesk staff could provide was: fix the first error, since the compiler hasn't had a chance to confuse itself at that point.

A significant amount of compiler development effort was often devoted to attempts at error recovery. You could try and guess what the programmer might have intended, or insert some token to at least allow parsing to continue, or just give up on that statement and skip to the next semicolon. The latter action could skip an end or other significant program structure token and so get the compiler even more confused.

Integrated Development Environment (IDE)
Fast personal computers are now available, so IDEs are becoming more popular, with an editor and compiler tightly coupled and usable from a single graphical interface. Many IDEs also include a debugger as well. In some cases the editor is language-sensitive, so it can supply matching brackets and/or statement schemas to help reduce the number of trivial errors. An IDE may also use different colours for different concepts within a source language, e.g. reserved words in bold, comments in green, constants in blue, or whatever.

This speed and tight coupling allows the compiler writer to adopt a much simpler approach to errors: the compiler just stops as soon as it finds an error, and the editor then places the cursor at the point in the source text where the error was detected and displays some specific error message. Note that the point where an error was detected could well be some distance after the point where the error actually occurred.

There were line-mode IDEs back in 1964, many BASIC systems were examples of such systems; we are going to implement something like this in the book section Case study - a simple interpreter.

Compile-time Errors
During compilation it is always possible to give the precise position at which the error was detected. This position could be shown by placing the editor cursor at the precise point, or (batch mode) by listing the offending line followed by a line containing some sort of flag (e.g.'|') positioned under the point of error, or (less conveniently) by providing the line number and column number of that point.

Remember that the actual position of the error (as distinct from where it was detected) may well be at some earlier point in the program; in some cases (e.g. bracket mismatch) the compiler may be able to indicate the nature of the earlier error.

It is important that error messages be clear, correct, and relevant.
 * The worst counter-example that Murray Langton has encountered was a compiler which reported "Missing semicolon" when the actual error was an extra space in the wrong place. To further confuse matters, no indication was given as to where in the program the error was.  Just to add insult to injury, the source language didn't even use semicolons!

Errors during Lexical Analysis
There are relatively few errors which can be detected during lexical analysis.

Strange characters
 * Some programming languages do not use all possible characters, so any strange ones which appear can be reported. Note however that almost any character is allowed within a quoted string.

Long quoted strings (1)
 * Many programming languages do not allow quoted strings to extend over more than one line; in such cases a missing quote can be detected. Languages of this type often have some way of automatically joining consecutive quoted strings together to allow for really long strings.

Long quoted strings (2)
 * If quoted strings can extend over multiple lines then a missing quote can cause quite a lot of text to be 'swallowed up' before an error is detected. The error will probably then be reported as somewhere in the text of the next quoted string, which is unlikely to make sense as part of a program.

Invalid numbers
 * A number such as 123.45.67 could be detected as invalid during lexical analysis (provided the language does not allow a full stop to appear immediately after a number). Some compiler writers prefer to treat this as two consecutive numbers 123.45 and .67 as far as lexical analysis is concerned and leave it to the syntax analyser to report an error.  Some languages do not allow a number to start with a full stop/decimal point, in which case the lexical analyser can easily detect this situation.

Errors during Syntax Analysis
During syntax analysis, the compiler is usually trying to decide what to do next on the basis of expecting one of a small number of tokens. Hence in most cases it is possible to automatically generate a useful error message just by listing the tokens which would be acceptable at that point.

Source: A + * B   Error:       | Found '*', expect one of: Identifier, Constant, '('

More specific hand-tailored error messages may be needed in cases of bracket mismatch.

Source: C := ( A + B * 3 ;   Error:                    | Missing ')' or earlier surplus '('

Errors during Semantic Analysis
One of the most common errors reported during semantic analysis is "identifier not declared"; either you have omitted a declaration or you have mispelt an identifier.

Other errors commonly detected during semantic analysis relate to incompatible use of types, e.g. attempt to assign a logical value such as true to a string of characters. Some of these errors can be quite subtle, but again it is easy to automatically generate fairly precise error messages.

Source: SomeString := true; Error: Can't assign logical value to character string

The extent to which such type checking is possible depends very much on the source language.
 * PL/1 allows an amazingly wide variety of automatic type conversions, so relatively little checking is possible.
 * Pascal is much more fussy; you can't even assign a real value to an integer variable without explicitly specifying whether you want the value to be rounded or truncated.
 * Some writers have argued that type checking should be extended to cover the appropriate units as well for even more checking, e.g. it doesn't make sense to multiply a distance by a temperature.

Other possible sources of semantic errors are parameter miscount and subscript miscount. It is generally an error to declare a subroutine as having 4 parameters and then call that routine with 5 parameters (but some languages do allow routines to have a variable number of parameters). It is also generally an error to declare an array as having 2 subscripts and then try and access an array element using 3 subscripts (but some languages may allow the use of fewer subscripts than declared in order to select a 'slice' of the array).

Reporting the Position of Run-Time Errors
There is general agreement that run-time errors such as division by 0 should be detected and reported. However, there is considerable variation as to how the location of the error is reported.


 * Some systems merely provide the hexadecimal address of the offending instruction. If your compiler/linker produced a load map you might then be able to do some hexadecimal arithmetic to identify which routine it is in.


 * Some systems do tell you the name of the routine the error was in, and possibly the names of all the routines which were active at the time.


 * A few kind systems give you the source line number, which is very helpful. Note however that extensive program optimization can move code around and intermingle statements, in which case line numbers may only be approximate.  From the implementor's viewpoint there are several ways in which line number details or equivalent can be provided.
 * The compiled program can contain instructions which place the current line number in some fixed place; this makes the program longer and slower. Of course the compiler need only add these instructions for statements which can actually cause an error.
 * The compiled program can contain a table indicating the position at which each source line starts in the compiled code. In the event of an error, special code can then consult this table and determine the source line involved.  This makes the compiled code longer but doesn't slow it down.
 * In some unoptimized systems, it may be possible to deduce some source information from the compiled code, e.g. the Elliott 503 Algol 60 compiler could report: "divide by 0 at second division after third begin of routine 'xyz'". This doesn't affect code size or speed, but may not always be feasible to implement.

Run-Time Speed versus Safety
Some of the material in this section may be controversial.

There are some potential run-time errors which many systems do not even try to detect. The language definition may merely say that the result of breaking a certain language rule is undefined, i.e. you might get an error message, or you might get the wrong answer without any warning, or you might on some occasions get the right answer, or you might get a different answer every time you run the program, or you might trigger off World War III ('undefined' does mean that anything could happen).

In the past there have been some computers (Burroughs 5000+, Elliott 4130) which had hardware support for fast detection of some of these errors. Many current IDE's do have a debugging option which may help detect some of these run-time errors:


 * Attempt to divide by 0.
 * Overflow (and possibly underflow) during arithmetic operations.
 * Attempt to use a variable before it has been set to some sensible value (undefined variable).
 * Attempt to refer to a non-existent array element (invalid subscript).
 * Attempt to set a variable (defined as having a limited range) to some value outside this range.
 * Various errors related to pointers:
 * Attempt to use a pointer before it has been set to point to somewhere useful.
 * Attempt to use a nil pointer, which explicitly doesn't point anywhere useful.
 * Attempt to use a pointer which points outside the array it should point to.
 * Attempt to use a pointer after the memory it points to has been released.

Historically, the main reason for not doing these checks is the effect on performance. When FORTRAN was first developed (circa 1957), it had to compete with code written in assembler; indeed many of the optimizing techniques used in modern compilers were first developed and used at that time. C was developed (circa 1971) initially as a replacement for assembler for use by experienced system programmers when writing operating systems.

In both the above cases there was a justifiable reason for not doing these checks. Nowadays, computer hardware is very much faster than it was in 1957 or 1971, and there are many more less-experienced programmers writing code, so the arguments for avoiding checks are much weaker. Actually adding the checks on a supposedly working program can be enlightening/surprising/embarrassing; even programs which have been 'working' for years may turn out to have a surprising number of bugs.

Hoare (inventor of quicksort) was responsible for an Algol 60 compiler in the early 1960's; subscript checking was always done. Indeed Hoare has said in "Hints on Programming Language Design" that: "Carrying out checks during testing and then suppressing then in production is like a sailor who wears a lifejacket when training on dry land and then removes the lifejacket when going to sea."

In his book "The Psychology of Computer Programming", Wienberg recounts the following anecdote:
 * After months of effort, a particular application was still not working, so a consultant was called in from another part of the company. He concluded that the existing approach could never be made to work reliably.  While on his way home he realized how it could be done.  After a few days work he had a demonstration program working and presented it to the original programming team.
 * Team leader: How long does your program take when processing?
 * Consultant: About 10 seconds per case.
 * Team leader: But our program only takes 1 second. {Team look smug at this point}
 * Consultant: But your program doesn't work. If the program doesn't have to work then I can make it as fast as you like.

Wirth designed Pascal as a teaching language (circa 1972); for many Pascal compilers the default was to perform all safety checks. Some Pascal systems had an option to suppress the checks for some limited part of the program.

When a programming language allows the use of pointers and pointer arithmetic for accessing array elements, the cost of doing checks for access to non-existent array elements might be significant. Note that it can indeed be done: each pointer is large enough to contain three addresses, the first being the one which is directly manipulated and used by the programmer, and the other two addresses being the lower and upper limits on the first. This approach may have problems when the language allows interconversion between integers and pointers.

In the case of 'undefined variables', note that setting all variables initially to 0 is a really bad idea (unless the language mandates this of course). Such an initial setting reduces program portability and may also disguise serious logic errors.

Cheap detection of 'undefined'
Murray Langton (main author of this wikibook) has had some success in checking for 'undefined' in a 140,000 line safety-critical legacy Fortran program. The fundamental idea is to set all global variables to recognizably strange values which are highly likely to produce visibly strange results if used.

For an IBM mainframe, the strange values were:
 * REAL set to -9.87654E70
 * INTEGER set to -123456789
 * CHAR set to '?'

Note that the particular values used depend on your system, in particular the large number used for REAL is definitely hardware-dependent. For a machine with IEEE floating point arithmetic (most PC's) the best choice for REAL is NaN (not a number), with a possible alternative being -9.87654E37

The reason for choosing large negative numerical values is that they tend to be very obvious when printed or displayed as output, and they tend to cause numerical errors (overflow) if used for arithmetic. Also, in Fortran, all output is in fixed-width fields, and any output which won't fit in the field is displayed as a field full of asterisks instead, which is very easy to spot.

In the safety-critical example quoted above, a program was written which identified all global variables (by analyzing COMMON blocks), excluded those (in BLOCK DATA) which were explicitly initialized, and then wrote a Fortran routine which set all these silly values. If any changes were made to a COMMON block, it was a simple matter to rerun this analysis program.

During execution, the routine which sets silly values uses less than 0.1% of the total CPU time. When these silly values were first used, it took several months to track down and eliminate the resulting flood of asterisks and question marks which appeared in the output, despite the fact that the program had been 'working' for over 20 years.

How to check for 'undefined'
The basic idea is to ensure that all variables are flagged as 'undefined' when declared. Some languages allow simultaneous declaration and initialization, in which case a variable is flagged as 'defined'. Whenever a value is assigned to a variable the flag is changed to 'defined'. Whenever a variable is used the flag is checked and an error is reported if it is 'undefined'.

In the past a few lucky implementors have had hardware assistance in the form of an extra bit attached to each word in memory (Burroughs 5000+). On modern byte-addressable machines you could attach an extra byte to each variable to hold the flag. Unfortunately, due to alignment requirements, this would tend to double the amount of memory needed for data (many systems require 4-byte items such as numbers to have an address which is a multiple of 4; even if misalignment is allowed its use may slow the program down significantly).

The simplest way of providing a flag is to use some specific value which is (hopefully) unlikely to appear in practice. Particular values depend on the type of the variable involved.

boolean
 * Such variables are most likely to be allocated one byte of storage with 0 for false and 1 for true. A value such as 255 or 128 is a suitable flag.

character
 * When used for binary input/output, any value could appear, so no checking is possible. Hence it must be possible to switch off checking in such cases.
 * When used as a character there are many possible non-printing characters. 127 or 128 or 255 may be suitable choices.

integer
 * Most computer systems use two's complement representation for negative numbers which gives an asymmetric range (for 16-bits, range is -32768 to +32767). We can restore symmetry by using the largest negative number as the 'undefined' flag.

real
 * If you hardware conforms to the IEEE standard (most PC's do) you can use NaN (not a number).

How to check at compile-time
You may well be thinking that all this checking (for undefined, bad subscript, out of range, etc.) is going to slow a program down quite a lot. Things are not as bad as you think, since a lot of the checking can actually be done at compile-time, as detailed below.

First, some statistics to show you what can be done:
 * Just adding checking to an existing compiler resulted in 1800 checks being generated for a 6000-line program.
 * Adding a few hundred lines to the compiler allowed it do many checks at compile-time, and reduced the number of run-time checks to just 70. The program then ran more than 20% faster than the version with all checks included.

We have already mentioned that variables which are given an initial value when declared need never be checked for undefined.

The next few tests require some simple flow-control analysis e.g. variables which are only set in one branch of an if statement become undefined again after the if statement, unless you can determine that a variable is defined on all possible branches.


 * Once a variable has been set (by assignment or by reading it from a file) it is then known to be defined and need not be tested thereafter.
 * Once a variable has been tested for 'undefined', it can be assumed to be defined thereafter.

If your programming language allows you to distinguish between input and output parameters for a routine, you can check as necessary before a call that all input parameters are defined. Within a routine you can then assume that all input parameters are defined.

For discrete variables such as integers and enumerations, you can often keep track at compile time of the maximum and minimum values which that variable can have at any point in the program. This is particularly easy if your source language allows variables to be declared as having some limited range (e.g. Pascal). Of course any assignment to such a bounded variable must be checked to ensure that the value is within the specified range.

For many uses of a bounded variable as a subscript, it often turns out that the known limits on the variable are within the subscript range and hence need not be checked.

In a count-controlled loop you can often check the range of the control variable by checking the loop bounds before entering the loop which may well reduce the subscript checking needed within the loop.

Glossary
This glossary is intended to provide definitions of words or phrases which relate particularly to compiling. It is not intended to provide definitions of general computing jargon, for which a reference to Wikipedia may be more appropriate.