GNU C Compiler Internals/GNU C Compiler Architecture

An Overview of GCC Architecture
This section is based on a Red Hat magazine article.

GNU Compiler Collection (GCC) comprises a number of compilers for different programming languages. The main GCC executable gcc processes source files written in C, C++, Objective-C, Objective-C++, Java, Fortran, or Ada and produces an assembly file for each source file. It is a driver program that invokes the appropriate compilation programs depending on the language of the source file. For a C source file they are the preprocessor and compiler cc1, the assembler as, and the linker collect2. The first and the third programs come with a GCC distribution, the assembler is a part of the GNU binutils package. This book describes the internals of the preprocessor and compiler cc1.

Each compiler includes the following three components: a front end, a middle end, and a back end. GCC compiles one file at a time. A source file goes through all three components one after another. Figure 1 illustrates the components and the source file representations associated with each component.



The purpose of the front end is to read the source file, parse it, and convert it into the standard abstract syntax tree (AST) representation. There is one front end for each programming language. Because of the differences in languages, the format of the generated ASTs is slightly different for each language. The next step after AST generation is the unification step in which the AST tree is converted into a unified form called generic. After this, the middle end part of the compiler takes control. First, the tree is converted into another representation called GIMPLE. In this form, each expression contains no more than three operands, all control flow constructs are represented as combinations of conditional statements and goto operators, arguments of a function call can only be variables, etc. Figure 2 illustrates the differences between a tree in generic form and a tree in GIMPLE form. GIMPLE is a convenient representation for optimizing the source code.

After GIMPLE, the source code is converted into the static single assignment (SSA) representation. The central idea of this form is the fact that each variable is assigned to only once, but can be used at the right hand side of an expression many times. Every time the same variable of a tree in the GIMPLE form is reassigned, the compiler creates a new version of that variable and stores the new value into it. When the same variable is assigned to in both branches of a conditional expression, one needs to merge the two possible values of the variable into a single variable. This operation is denoted as PHI function in the SSA form.

The SSA form is also used for optimizations. GCC performs more than 20 different optimizations on SSA trees. After the SSA optimization pass, the tree is converted back to the GIMPLE form which is then used to generate a register-transfer language (RTL) form of a tree. RTL is a hardware-based representation that corresponds to an abstract target architecture with an infinite number of registers. An RTL optimization pass optimizes the tree in the RTL form. Finally, a GCC back-end generates the assembly code for the target architecture using the RTL representation. Examples of back-ends are x86 back end, mips back end, etc.

In the next sections we describe the internals of the C front end and the x86 back end. The compiler starts with its initialization and command line options processing. After that the C front end preprocesses the source file, parses it and performs a number of optimizations. The back end then generates the assembly code for the target platform and saves it to a file.

GCC Initialization
The C front end includes the C/C++ preprocessor and the C compiler. Program cc1 includes both the preprocessor and C compiler. It compiles a C source file and generates an assembly (.S) file.

The compiler frontend and backend interact with each other using callback functions called language hooks. All hooks are included into a global variable struct lang_hooks lang_hooks that is defined in file langhooks.h. There are the following types of hooks: hooks for tree inlining, hooks for call graph, hooks for functions, hooks for tree dump, hooks for types, hooks for declarations, and language-specific hooks. The default values for the hooks are defined in file langhooks-def.h.

GCC initialization consists of command line option parsing, initializing the back end, creating the global scope, and initializing the built-in data types and functions.

Each declaration is associated with a scope. For example, a local variable is associated with its function's scope. A global declaration is associated with the global scope.

Function is the function that processes the command-line options, initializes the compiler, compiles a file, and frees up the allocated resources. Function processes the command-line options and sets the corresponding variables within the compiler.

Following the command line option parsing function is called. It performs the back-end initialization by calling function. After that, function performs language-dependent initialization which includes the initialization of a front-end and a back-end. The C initialization function creates built-in data types, initializes the global scope and performs other initialization tasks. Function creates pre-defined types described in file builtin-types.def.

The standard C types are created at the initialization time. The following table presents several types:

GCC built-in functions are the functions that are evaluated at compile time. For example, if the size argument of a strcpy function is a constant then GCC replaces the function call with the required number of assignments. The compiler replaces standard library calls with built-in functions and then evaluates them once the function's AST is constructed. In case of strcpy, the compiler checks the size argument and uses the optimized built-in version of strcpy if the argument is constant. Builtin allows to find out if the value of its argument is known at compile time. GCC builtins are used beyond GCC. For example, the string processing library of Linux kernel uses builtin_constant_p to invoke the optimized version of a string processing function if the string size is known at compile time.

GCC evaluates each builtin function using a corresponding expand_builtin function. For example, builtin_strcmp is evaluated using. The following table gives examples of GCC builtins:

C Preprocessor
Following the initialization, function calls function. This function invokes parse_file front-end language hook which is set to function for C language. The latter function invokes function which initializes the preprocessor and handles -D, -U, and -A command line options (which are equivalent to #define, #undef, and #assert respectively). The C preprocessor handles the preprocessor directives such as #define, #include in the source code.

Following the preprocessor initialization, function is invoked. This function uses the standard lex/bison tools to parse the file. The preprocessor is implemented as a part of the lexer. The C language lexer function calls the libcpp function  which handles the preprocessor keywords. The state of the preprocessor is defined by variable. Type contains most importantly the list of text buffers being processed. Each buffer corresponds to a source file (.c or .h). Function calls appropriate functions for legitimate preprocessor keywords. For example, when #include is encountered, function is invoked. It allocates a new buffer and places it on top of the buffer stack making it the current buffer. When a new file is compiled the buffer is popped off the stack and the compilation of the old file continues.

Function is called whenever a new macro is defined using #define keyword.

Source Code Parsing
After running the preprocessor GCC constructs an abstract syntax tree (AST) for each function of the source file. An AST is a number of connected nodes of type. Each node has a tree code that defines the type of the tree. Macro is used to refer to the code. Tree codes are defined in files and. Trees with different tree codes are grouped into tree classes. The following tree classes are defined among others in GCC:

There are two types of trees: statements and expressions. Statements correspond to the C constructs, for example an expression followed by a ';', a conditional statement, a loop statement, etc. Expressions are what the statements are build up from. Examples of expressions are an assignment expression, an arithmetic expression, a function call, etc. Examples of tree codes are given in the following table:

In addition to the tree code that defines the type of the tree, a number of operands different for each tree type are available. For example, an assignment expression has two operands which correspond to the left and right sides of the expression. Macro is used to refer to the operands. Macro is used to find the name of an identifier that an  tree represents. The following table presents several tree nodes, their purpose, and their operands.

Each tree has a type that corresponds to the type of C expression that it represents For example, the type of the node is the type of the left-hand side operand. and trees are used for type casting.

Tree NULL_TREE is equivalent to NULL. Function prints out the tree to stderr.

The parser is implemented as a bison grammar. The grammar invokes the GCC functions that construct the AST. When a new identifier is lexed, it is added to the pool of strings that GCC maintains. The tree code of an identifier is. When the same identifier is lexed again, the same tree node is returned. Function returns the tree node for an identifier.

A new variable declaration is processed in a number of function calls. First, function is called with the name of the declaration, its type as returned by the lexer, and its attributes. The function calls that checks the type and argument nodes and returns a tree with the code appropriate for the declaration:  for variables,  for types, etc. The declarartion is then added to the scope. A scope contains all declarations created within a function, but does not contain global declarations. There is also the global scope that contains global declarations. When a function is parsed, its declarations are attached to its body as a node. When a declaration is created, the identifier node is associated with the declaration node using. Function returns the declaration for a given identifier. When a declaration leaves the scope the tree attributes is asserted.

A symbol table is not maintained in GCC. Instead, the compiler uses the pool of identifiers and attribute. Language hook lang_hooks.decls.getdecls returns the variables in the scope chained together.

Functions and  are called for an initialized declaration. Function finalizes the declaration. For an array declaration it computes the size of an initialized array. Then function is called. It computes the size and alignment of the declaration.

Parsing a function depends on the presence of its body. A function declaration is parsed using the same functions as those used for variable declarations. For a function definition, function is called. Then the compiler parses the body of the function. When the function ends function is called.

Functions and  take declaration's attributes parameter as one of their arguments. The attributes are described in the GCC Manual. Attributes are extensions of GNU implementation of C. The following table presents a few of them and explains their purpose:

For each type of C statements there is a function that constructs a tree node of the corresponding type. For example, function builds a CALL_EXPR node for a function call. It takes the identifier of the function name and the arguments as its parameters. The function finds the function declaration using and type casts the arguments using.

The ASTs are then converted into the SSA and eventually to the RTL representations. Whether the conversion occurs after parsing each function or when the whole file is parsed is controlled by the compiler option -funit-at-a-time. It is false by default.

Assembly Code Generation
The backend generates the assembly code for the specified target platform. Function is called for each instruction written to the assembly file. Function generates the prolog of the function before it is saved to the assembly file.


 * 1) https://web.archive.org/web/20160410185222/https://www.redhat.com/magazine/002dec04/features/gcc/