Compiler Construction/Run-time Considerations

Storage Management - Garbage Collector
In computing there are new tools that are waiting to emerge when developers can find the technology capable of supporting them. One good example is that 50 years ago John L. McCarthy came up with an idea to automatically reclaim the memory of objects that are no longer needed during the execution of Lisp programs. It was the origin of the garbage collection concept in computer science.

One of the tasks performed at runtime--as the compiled program is run by the user--is the management of allocated memory blocks, so as to minimize memory usage by deallocating blocks that will no longer be used. This is referred to as "garbage collection." Garbage collection is not available in all languages and compilers, but it is implemented in many of the most widely used. It takes much of the burden of memory management off of the programmer--when implemented correctly--and improves performance.

Ideally, garbage collectors (hereinafter GC) would remove every allocation that will never be used again. In practice, we can assume that if there is any way to reference to a block, it can be used again; if not, it would not be used (except accidentally). So GC works by retaining memory blocks that can be reachable by tracing every reference path at a moment and freeing the rest. For example,

function concat (string1, string2) {  new_string = alloc (string1.length + string2.length); copy (string1, new_string); copy (string2, new_string + string1.length); return new_string; }

This function returns a newly allocated memory block that is the concatenation of given two strings. Because the function only returns a new string and does not know how it is going to be used, it is the caller of this function that is responsible for freeing it, like:

var old = null var string = 'Writing compilers is ' if you have not finished this book { string := concat (string, 'anything but ') old := string } string := concat (string, 'fun!') if old != null { free (old) }

By letting GC free memory blocks when needed, the above can be simplified to:

string = 'Writing compilers is ' if you have not finished this book { string := concat (string, 'anything but ') } string := concat (string, 'fun!')

In practice, an execution path and reference dependencies can be far more complicated than the above; thus, it should be easy to imagine how GC would be a great help.

Brief History of Garbage Collection
In the Beginning, there was Static Allocation, and for a Time, it was Good. FORTRAN, circa the mid-1950's, had no garbage collection at all. Once a block of memory was allocated, it stayed allocated. The programmer could not deallocate it, even if he tried.

Circa 1958, Algol implemented a form of memory management called stack allocation. Following that, languages like C implemented heap allocation, which allows the programmer to arbitrarily allocate and de-allocate memory from the heap of available memory. In C, this is done with the call. While this allows very flexible memory allocation and de-allocation by the programmer, careless (mis)use often results in memory leaks and dangling pointers; programmer errors are not handled by the compiler or runtime environment.

In order to overcome the barriers of programmer error, we must either better instruct our programmers, or provide better systems for memory management. The string of non-accredited technical institutes that seem to have sprung up in the United States obviously do nothing to solve the former problem, so it is up to us to approach the latter.

Implementing Garbage Collection
There are two basic approaches to garbage collection: reference counting and batching (or tracing).

Reference Counting
In reference counting, a reference counter is associated with every allocated object. Whenever a reference is made to that object, the counter is incremented. When one dereferences that object, the counter is decremented. When the counter reaches 0, there are no existing references to this allocated object, so it can be safely removed (because there is no way for it to be accessed in the future).

When an object O is created, we also create a reference counter for O, calling it. At creation,. When we create a reference to O, we perform. When we destroy a reference to O,. When we decrement, we check to see if the counter is now 0. If it is, we free O.

To allow allocation of memory, we maintain a list of free memory blocks. When we allocate blocks, we remove them from the free list. When we deallocate, we add them to the list. Obviously, a major drawback of this method of allocation is fragmentation; having to defragment memory to allow allocation in the requested size can create inexplicable slowdowns for memory allocation. To prevent this, one can defragment the memory at some set interval, or perform more complex deallocation to keep memory contiguous, but the solution to this problem is likely to be non-trivial.

Reference counting has the primary advantages of being simple to implement, requiring no work at an interval, and thus avoiding the necessity to pause the current operating to sort out our garbage collection (which can lead to highly inexplicable pauses during seemingly ordinary operation).

However, reference counting has a number of severe limitations, namely it's inability to detect cyclical pointers (e.g. A references B and B references A, in which case neither A nor B will ever be collected), it's cost to pointer operations and use of space, and it's computational cost to allocation as a result of fragmentation. Some of these problems have solutions in the form of augmented reference counting, but some are best solved by using a different form of garbage collection.

Batching/Tracing
In batching, the heap is viewed as a directed graph, with pointers as edges between allocated space (which we view as nodes in the graph). To do garbage collection, we traverse the graph and mark each node we reach. If a node is not marked, it cannot be reached and can be safely deallocated.

Unlike reference counting, this method must be run at a specific time (reference counting, of course, is performed as a feature of ordinary allocation and deallocation). Garbage collection--when it must be performed at a specific time rather than simply as the program runs, as with reference counting--can be performed when storage is exhausted, before a segment of code that needs to run quickly is executed, or simply during idle time when there's nothing for the computer to do. The best way depends somewhat on the type of program running; for example, in Microsoft Word, there is enough idle time (while the user sits thinking up funny parenthetical statements) that the third option is best. On a real-time system, however, it might be better to do garbage collection always before certain code runs.

Batching relies on accurate identification of memory pointers. There are a number of issues here; something that appears to be an integer may be a pointer, while something that appears to be a pointer may not be. A language could be designed specifically to avoid this confusion, but many popular languages, such as C and C++, are not designed in such a way. The compiler could mark at compile-time as pointers anything used as a pointer, but again this requires extra overhead. Some of the same methodology could be applied at runtime, but with additional runtime costs. Regardless of how this is done, however, it is important to be conservative with pointer identification. When it doubt, it is best to assume something is a pointer than is not. After all, it is far better to have extra un-collected garbage than to eliminate blocks that will later be used.