C Programming/String manipulation

A string in C is merely an array of characters. The length of a string is determined by a terminating null character:. So, a string with the contents, say,  has four characters: ,  ,  , and the terminating null  character.

The terminating null character has the value zero.

Syntax
In C, string constants (literals) are surrounded by double quotes ("), e.g. "Hello world!" and are compiled to an array of the specified char values with an additional null terminating character (0-valued) code to mark the end of the string. The type of a string constant is char [].

backslash escapes
String literals may not directly in the source code contain embedded newlines or other control characters, or some other characters of special meaning in string.

To include such characters in a string, the backslash escapes may be used, like this:

Wide character strings
C supports wide character strings, defined as arrays of the type wchar_t, 16-bit (at least) values. They are written with an L before the string like this
 * wchar_t *p = L"Hello world!";

This feature allows strings where more than 256 different possible characters are needed (although also variable length char strings can be used). They end with a zero-valued <tt>wchar_t</tt>. These strings are not supported by the <tt>&lt;string.h&gt;</tt> functions. Instead they have their own functions, declared in <tt>&lt;wchar.h&gt;</tt>.

Character encodings
What character encoding the <tt>char</tt> and <tt>wchar_t</tt> represent is not specified by the C standard, except that the value 0x00 and 0x0000 specify the end of the string and not a character. It is the input and output code which are directly affected by the character encoding. Other code should not be too affected. The editor should also be able to handle the encoding if strings shall be able to written in the source code.

There are three major types of encodings:
 * One byte per character. Normally based on ASCII. There is a limit of 255 different characters plus the zero termination character.
 * Variable length <tt>char</tt> strings, which allows many more than 255 different characters. Such strings are written as normal <tt>char</tt>-based arrays. These encodings are normally ASCII-based and examples are UTF-8 or Shift JIS.
 * Wide character strings. They are arrays of <tt>wchar_t</tt> values. UTF-16 is the most common such encoding, and it is also variable-length, meaning that a character can be two <tt>wchar_t</tt>.

The Standard Header
Because programmers find raw strings cumbersome to deal with, they wrote the code in the  library. It represents not a concerted design effort but rather the accretion of contributions made by various authors over a span of years.

First, three types of functions exist in the string library:


 * the  functions manipulate sequences of arbitrary characters without regard to the null character;
 * the  functions manipulate null-terminated sequences of characters;
 * the  functions manipulate sequences of non-null characters.

The more commonly-used string functions
The nine most commonly used functions in the string library are:


 * - concatenate two strings
 * - string scanning operation
 * - compare two strings
 * - copy a string
 * - get string length
 * - concatenate one string with part of another
 * - compare parts of two strings
 * - copy part of a string
 * - string scanning operation

Other functions, such as  (convert to lower case),   (return the string reversed), and   (convert to upper case) may be popular; however, they are neither specified by the C Standard nor the Single Unix Standard. It is also unspecified whether these functions return copies of the original strings or convert the strings in place.

The function
Some people recommend using  or   instead of strcat, in order to avoid buffer overflow.

The  function shall append a copy of the string pointed to by   (including the terminating null byte) to the end of the string pointed to by. The initial byte of  overwrites the null byte at the end of. If copying takes place between objects that overlap, the behavior is undefined. The function returns.

This function is used to attach one string to the end of another string. It is imperative that the first string have the space needed to store both strings.

Example:

Before calling, the destination must currently contain a null terminated string or the first character must have been initialized with the null character (e.g.  ).

The following is a public-domain implementation of :

The function
The  function shall locate the first occurrence of   (converted to a  ) in the string pointed to by. The terminating null byte is considered to be part of the string. The function returns the location of the found character, or a null pointer if the character was not found.

This function is used to find certain characters in strings.

At one point in history, this function was named. The  name, however cryptic, fits the general pattern for naming.

The following is a public-domain implementation of :

The function
A rudimentary form of string comparison is done with the strcmp function. It takes two strings as arguments and returns a value less than zero if the first is lexographically less than the second, a value greater than zero if the first is lexographically greater than the second, or zero if the two strings are equal. The comparison is done by comparing the coded (ascii) value of the characters, character by character.

This simple type of string comparison is nowadays generally considered unacceptable when sorting lists of strings. More advanced algorithms exist that are capable of producing lists in dictionary sorted order. They can also fix problems such as strcmp considering the string "Alpha2" greater than "Alpha12". (In the previous example, "Alpha2" compares greater than "Alpha12" because '2' comes after '1' in the character set.) What we're saying is, don't use this  alone for general string sorting in any commercial or professional code.

The  function shall compare the string pointed to by   to the string pointed to by. The sign of a non-zero return value shall be determined by the sign of the difference between the values of the first pair of bytes (both interpreted as type ) that differ in the strings being compared. Upon completion,  shall return an integer greater than, equal to, or less than 0, if the string pointed to by   is greater than, equal to, or less than the string pointed to by , respectively.

Since comparing pointers by themselves is not practically useful unless one is comparing pointers within the same array, this function lexically compares the strings that two pointers point to.

This function is useful in comparisons, e.g.

if (strcmp(s, "whatever") == 0) /* do something */ ;

The collating sequence used by  is equivalent to the machine's native character set. The only guarantee about the order is that the digits from <tt>'0'</tt> to <tt>'9'</tt> are in consecutive order.

The following is a public-domain implementation of :

The function
Some people recommend always using  instead of strcpy, to avoid buffer overflow.

The  function shall copy the C string pointed to by   (including the terminating null byte) into the array pointed to by. If copying takes place between objects that overlap, the behavior is undefined. The function returns. There is no value used to indicate an error: if the arguments to  are correct, and the destination buffer is large enough, the function will never fail.

Example:

Important: You must ensure that the destination buffer is able to contain all the characters in the source array, including the terminating null byte. Otherwise,  will overwrite memory past the end of the buffer, causing a buffer overflow, which can cause the program to crash, or can be exploited by hackers to compromise the security of the computer.

The following is a public-domain implementation of :

The function
The  function shall compute the number of bytes in the string to which   points, not including the terminating null byte. It returns the number of bytes in the string. No value is used to indicate an error.

The following is a public-domain implementation of :

Note how the line declares and initializes a pointer  to an integer constant, i.e.   cannot change the value it points to.

The function
The  function shall append not more than   bytes (a null byte and bytes that follow it are not appended) from the array pointed to by   to the end of the string pointed to by. The initial byte of  overwrites the null byte at the end of. A terminating null byte is always appended to the result. If copying takes place between objects that overlap, the behavior is undefined. The function returns.

The following is a public-domain implementation of :

The function
The  function shall compare not more than   bytes (bytes that follow a null byte are not compared) from the array pointed to by   to the array pointed to by. The sign of a non-zero return value is determined by the sign of the difference between the values of the first pair of bytes (both interpreted as type ) that differ in the strings being compared. See  for an explanation of the return value.

This function is useful in comparisons, as the  function is.

The following is a public-domain implementation of :

The function
The  function shall copy not more than   bytes (bytes that follow a null byte are not copied) from the array pointed to by   to the array pointed to by. If copying takes place between objects that overlap, the behavior is undefined. If the array pointed to by  is a string that is shorter than   bytes, null bytes shall be appended to the copy in the array pointed to by , until   bytes in all are written. The function shall return s1; no return value is reserved to indicate an error.

It is possible that the function will not return a null-terminated string, which happens if the  string is longer than   bytes.

The following is a public-domain version of :

The function
The  function is similar to the   function, except that   returns a pointer to the last occurrence of   within   instead of the first.

The  function shall locate the last occurrence of   (converted to a  ) in the string pointed to by. The terminating null byte is considered to be part of the string. Its return value is similar to 's return value.

At one point in history, this function was named. The  name, however cryptic, fits the general pattern for naming.

The following is a public-domain implementation of :

The less commonly-used string functions
The less-used functions are:


 * - Find a byte in memory
 * - Compare bytes in memory
 * - Copy bytes in memory
 * - Copy bytes in memory with overlapping areas
 * - Set bytes in memory
 * - Compare bytes according to a locale-specific collating sequence
 * - Get the length of a complementary substring
 * - Get error message
 * - Scan a string for a byte
 * - Get the length of a substring
 * - Find a substring
 * - Split a string into tokens
 * - Transform string

The function
The  function shall copy   bytes from the object pointed to by   into the object pointed to by. If copying takes place between objects that overlap, the behavior is undefined. The function returns.

Because the function does not have to worry about overlap, it can do the simplest copy it can.

The following is a public-domain implementation of :

The function
The  function shall copy   bytes from the object pointed to by   into the object pointed to by. Copying takes place as if the  bytes from the object pointed to by   are first copied into a temporary array of   bytes that does not overlap the objects pointed to by   and , and then the   bytes from the temporary array are copied into the object pointed to by. The function returns the value of.

The easy way to implement this without using a temporary array is to check for a condition that would prevent an ascending copy, and if found, do a descending copy.

The following is a public-domain, though not completely portable, implementation of :

The function
The  function shall compare the first   bytes (each interpreted as  ) of the object pointed to by   to the first   bytes of the object pointed to by. The sign of a non-zero return value shall be determined by the sign of the difference between the values of the first pair of bytes (both interpreted as type ) that differ in the objects being compared.

The following is a public-domain implementation of :

The and   functions
The ANSI C Standard specifies two locale-specific comparison functions.

The  function compares the string pointed to by   to the string pointed to by , both interpreted as appropriate to the   category of the current locale. The return value is similar to.

The  function transforms the string pointed to by   and places the resulting string into the array pointed to by. The transformation is such that if the  function is applied to the two transformed strings, it returns a value greater than, equal to, or less than zero, corresponding to the result of the   function applied to the same two original strings. No more than  characters are placed into the resulting array pointed to by , including the terminating null character. If  is zero,   is permitted to be a null pointer. If copying takes place between objects that overlap, the behavior is undefined. The function returns the length of the transformed string.

These functions are rarely used and nontrivial to code, so there is no code for this section.

The function
The  function shall locate the first occurrence of   (converted to an  ) in the initial   bytes (each interpreted as  ) of the object pointed to by. If  is not found,   returns a null pointer.

The following is a public-domain implementation of :

The,, and   functions
The  function computes the length of the maximum initial segment of the string pointed to by   which consists entirely of characters not from the string pointed to by.

The  function locates the first occurrence in the string pointed to by   of any character from the string pointed to by , returning a pointer to that character or a null pointer if not found.

The  function computes the length of the maximum initial segment of the string pointed to by   which consists entirely of characters from the string pointed to by.

All of these functions are similar except in the test and the return value.

The following are public-domain implementations of,  , and  :

The function
The  function shall locate the first occurrence in the string pointed to by   of the sequence of bytes (excluding the terminating null byte) in the string pointed to by. The function returns the pointer to the matching string in  or a null pointer if a match is not found. If  is an empty string, the function returns.

The following is a public-domain implementation of :

The function
A sequence of calls to  breaks the string pointed to by   into a sequence of tokens, each of which is delimited by a byte from the string pointed to by. The first call in the sequence has  as its first argument, and is followed by calls with a null pointer as their first argument. The separator string pointed to by  may be different from call to call.

The first call in the sequence searches the string pointed to by  for the first byte that is not contained in the current separator string pointed to by. If no such byte is found, then there are no tokens in the string pointed to by  and   shall return a null pointer. If such a byte is found, it is the start of the first token.

The  function then searches from there for a byte (or multiple, consecutive bytes) that is contained in the current separator string. If no such byte is found, the current token extends to the end of the string pointed to by, and subsequent searches for a token shall return a null pointer. If such a byte is found, it is overwritten by a null byte, which terminates the current token. The  function saves a pointer to the following byte, from which the next search for a token shall start.

Each subsequent call, with a null pointer as the value of the first argument, starts searching from the saved pointer and behaves as described above.

The  function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe.

Because the  function must save state between calls, and you could not have two tokenizers going at the same time, the Single Unix Standard defined a similar function, , that does not need to save state. Its prototype is this:

The  function considers the null-terminated string   as a sequence of zero or more text tokens separated by spans of one or more characters from the separator string. The argument lasts points to a user-provided pointer which points to stored information necessary for  to continue scanning the same string.

In the first call to,   points to a null-terminated string,   to a null-terminated string of separator characters, and the value pointed to by   is ignored. The  function shall return a pointer to the first character of the first token, write a null character into   immediately following the returned token, and update the pointer to which   points.

In subsequent calls,  is a null pointer and   shall be unchanged from the previous call so that subsequent calls shall move through the string , returning successive tokens until no tokens remain. The separator string  may be different from call to call. When no token remains in, a NULL pointer shall be returned.

The following public-domain code for  and   codes the former as a special case of the latter:

Miscellaneous functions
These functions do not fit into one of the above categories.

The function
The  function converts   into , then stores the character into the first   bytes of memory pointed to by.

The following is a public-domain implementation of :

The function
This function returns a locale-specific error message corresponding to the parameter. Depending on the circumstances, this function could be trivial to implement, but this author will not do that as it varies.

The Single Unix System Version 3 has a variant,, with this prototype:

This function stores the message in, which has a length of size.

Examples
To determine the number of characters in a string, the  function is used:

Note that the amount of memory allocated for 'turkey' is one plus the sum of the lengths of the strings to be concatenated. This is for the terminating null character, which is not counted in the lengths of the strings.

Exercises

 * 1) The string functions use a lot of looping constructs. Is there some way to portably unravel the loops?
 * 2) What functions are possibly missing from the library as it stands now?