Polymorphic Data Structures in C/Introduction to C Constructs

This book is built on a fundamental understanding of the different constructs that C can utilize to create data structures. In this chapter, we examine those constructs, their uses, and the pitfalls a new C programmer can encounter.

Structures
A structure is a simple C construct that can store multiple types of data simultaneously. In other programming languages (such as Pascal), these are called records or clusters. Any type can be stored in a structure, including another structure. The syntax for defining a structure is simple: Where struct is the keyword defining the structure, tag is the name of the structure (this is optional, but very helpful), and the member declarations are normal variable declarations, defining the names of the "fields" inside the struct. Note, however, that defining a structure does not declare any variables. Remember that defining a variable tells the compiler to record the information about the new type for use when the variables are actually declared. In order to declare a new structure, the standard C syntax is used: variablename is, of course, the name of the new structure.

Structures are useful when a developer needs to keep track of multiple (but related) data types. For example, say a developer needs to keep track of how many times a particular word occurs in a given string. They would define a structure to hold the search string, as well as the count of the word. To declare the new structure, they would use: It is possible, as with all types in C, to declare and define the structure variable in the same statement. The example given above could be written as: If the structure is only going to exist once as referenced, then the tag (word_count) may be omitted in this case. This usage is most helpful when defining nested structures. For example:

Operations on Structures
The primary operation performed on structures is member reference. This is achieved using the member reference operator, "." (the period). This is similar to the method of reference used in object-oriented languages, such as C++. In order to get the frequency of the search term declared above, one would use the code search.frequency. The first character of the word would be search.word[0]. Similarly, with the declaration The developer would write d.address.zipcode to access the ZIP code of the employee.

Reading data into a structure must be performed at the member level, meaning that in order to input data to a structure a developer must fully reference the member. For example, to read in the name, address zipcode and salary of the structure d above on a single line, one would use the following code: However, it is possible to copy an entire structure without copying its members individually.

Unions
Unions are very similar to structures, because they too can store multiple types of data. However, unions can only hold data for a single member at any given time. For example: Defines a type that can hold either an array of four characters or an integer, but not both simultaneously. Unions allocate an amount of memory equal to the largest member, then overlay the reference point for all of the members. The statement above (assuming sizeof ( char ) is 1 and sizeof ( int ) is 2) can be shown graphically like so: Unions are most practical when members use roughly the same amount of space, or when larger members are the ones used most often. Otherwise, unions are wasteful in terms of memory management and should be avoided in favor of more sophisticated methods.

Unions are not type-aware; that is, unions cannot tell exactly which member is in use. Therefore, it is always helpful to keep track of what type is in use through an external variable. A simple way of doing this would be to define a structure containing two fields: the union, and a type field to specify the type in use in the union.

Enumerated Types
Enumerated types are types that have multiple values (also called tokens) that are listed according to a specified order. Enumerated types are defined using the syntax: Where enum is the keyword defining the enumerated type, tag is the name of the type (unlike structures, this is not optional), and tokens are the possible values separated by commas. For example, to make a field for the union used above, one could write Then, a structure could be built around the union, as so: Using the declaration struct symbol sym ; the programmer could then make assignments using the code: Remember that the type and symbol fields are syntactically independent (that is, the names of the fields are not dependent on each other), so it is up to the programmer to ensure their semantic relationship.

Enumerated types can be used for any purpose that needs a fixed set of values. The main problem with enumerated types is that the program may not be able to properly handle formatted input and output in regards to the enumerated type (usually in string format). Therefore, enumerated types are defined in C to hold integer values. The first token in the list is 0, and each token following it are numbered one higher from the last one. It is also possible to give explicit values to a particular token by listing them, followed by an assignment operator (=) and the value. Assigning an explicit value to a token causes the numbering to be "re-set" at that number; three and eleven above hold the expected values.

Enumerated types can also be handled using preprocessor directives (discussed next in this chapter), but this method is far less flexible and the error messages caused by faulty directives tend to be more complex to debug.

Preprocessor Directives
Preprocessor directives are shortcuts in C. They allow the developer to perform a "find-and-replace" on the code at the time of compilation. They are usually placed above the function main or in a global header file (in a multi-file project). They are defined using the syntax: The parameters are optional, if parameters do not need to be passed to the directive. The actual_expression only needs to be included for certain directives. One of the most commonly-used preprocessor directives is the #include statement. It takes the header file (ether stored in the default libraries directory or the local directory) and inserts it into the program at that location. The code: Causes the file <tt>stdio.h</tt> to be fetched from the C library directory (<tt>/usr/include/</tt> on Unix-based systems) and inserted at the top of the program. Note the lack of a semicolon at the end of the definition. Here is an example of a user-created directive that will be used later on in the book: In this instance, we are using the statement <tt>DATA( L )</tt> to replace the normal statement <tt>( ( L ) -> datapointer )</tt>. This system can make many complex statements much easier to create, since they only need to be created once. The preprocessor (of the C compiler) performs a "blind search" of all of the programmer's code and replaces code according to the <tt>#define</tt> rules. Because of this, error messages generated by the compiler due to preprocessor directives may be somewhat cryptic.

Type Definition
C provides a tool to define new types using the <tt>typedef</tt> statement. Unlike preprocessor directives, a proper C compiler will recognize the names used in the type definition and can perform more complex substitutions. Type definition is generally used to create type aliases for complex constructs such as structures and unions, as well as for arrays of predetermined size. A less common use is to redefine the name of a particular type (though this is generally considered bad form).

For example, consider this structure: Normally, defining a variable to use the structure declared above would use the following statement: Instead, using type definition, we can create a type called student to make variable definitions easier. Now, instead of typing <tt>struct student student_data</tt>, the programmer would only need to type <tt>student student_data</tt>. Type definition can help developers overcome some of C's inherent weaknesses, such as the lack of a <tt>string</tt> type. The above statement defines the <tt>string</tt> type to be an array of 127 characters (plus one null byte). Since strings in C are handled as arrays of characters terminated with a null byte, this implementation of strings is acceptable.