Criticism of the C programming language

From Wikipedia, the free encyclopedia

The programming language C is widely used, minimalistic and low-level by design. C first appeared on larger computers but, like Pascal, soon found its way to CP/M and MS-DOS based microcomputers, where it achieved a rapid acceptance in the industry. Despite (or due to) its popularity, C's characteristics have led to much criticism of the language. This article documents many of these criticisms.

Many beginning programmers have difficulty learning C's syntax and peculiarities (although this is not unique to C), and even many expert programmers find C programs difficult to maintain and debug, unless exceptional care was taken in their production. A popular saying, repeated by such notable language designers as Bjarne Stroustrup, is that "C makes it easy to shoot yourself in the foot." [1] In other words, C permits many operations that are sometimes not desirable, and thus many simple programming errors are not detected by the compiler and may not even be readily apparent at runtime. This potentially leads to programs with unpredictable behavior and security holes, if sufficient care and discipline are not used in programming and maintenance. (Although again this is not unique to C, C provides less protection than do many other programming languages).

The designers wanted to avoid compile- and run-time checks that were too expensive when C was first implemented. With time, external tools were developed to perform some of these checks. Nothing prevents an implementation from providing such checks, but nothing requires it to, either.

Kernighan and Ritchie made reference to the basic design philosophy of C in their response to criticism of C not being a strongly-typed language: "Nevertheless, C retains the basic philosophy that programmers know what they are doing; it only requires that they state their intentions explicitly."[1][2]

Contents

[edit] Absent features

C has been criticized for lacking features found in some other programming languages:

Although the list of built-in features C lacks is long, this has contributed significantly to its acceptance, as new C compilers can be developed quickly for new platforms. The relatively low-level nature of the language affords the programmer close control over what the program is doing, while allowing solutions that can be specially tailored and aggressively optimized for a particular platform. This allows the code to run efficiently on very limited hardware, such as embedded systems, many of which today are as capable as the general-purpose machines originally used to implement C.

A number of these features are available as extensions in some compilers, or can be supplied by third-party libraries, or can be simulated by adopting certain coding disciplines. For example, in most object-oriented languages, method functions include a special "this" pointer which refers to the current object. By passing this pointer as an explicit function argument, similar functionality can be achieved in C. Whereas in C++ one might write:

stack->push(val);

one could write in C:

push(stack,val);

[edit] Undefined behaviour

Many "illegal" operations in C are not required to be diagnosed at compile time, and can produce undefined behavior at run time. This means that the exact behavior which arises is not specifed by the standard, and exactly what will happen does not have to be documented by the C implementor. A famous, although misleading, expression in comp.std.c is that the compiler could cause "demons to fly out of your nose". Sometimes in practice what happens for an instance of undefined behavior is a bug which is hard to track down and which may corrupt the contents of memory. Sometimes a particular compiler generates well-behaved actions that are not the same as would be obtained using a different C compiler. The main purpose of leaving some behavior undefined is to allow the compiler to generate more efficient executable code for well-defined behavior, which is important for C's primary role as a systems implementation language; it is the programmer's responsibility to avoid undefined behavior. Examples of undefined behavior are:

  • Accessing outside the bounds of an array.
  • Overflowing a signed integer.
  • Reaching the end of a function without finding a return statement, when the return value is used.
  • Reading the value of a variable before writing to it.

These operations are all programming errors that could occur using many programming languages; C draws criticism because its standard explicitly identifies numerous cases of undefined behavior, including some where the behavior could have been made well defined, and does not specify any run-time error handling mechanism.

[edit] Memory allocation

Automatically and dynamically allocated objects are not necessarily initialized; they initially have "indeterminate" values (typically, whatever bit pattern happens to be present in the storage, which might not even represent a legal value for that type). If the program attempts to use such an uninitialized value, the results are undefined. Many modern compilers try to detect and warn about this problem, but both false positives and false negatives occur.

Another common problem is that heap memory has to be manually synchronized with its actual usage in any program for it to be reused as much as possible. For example, if an automatic pointer variable goes out of scope or has its value overwritten while still referencing a particular allocation that is not freed via a call to free(), then that memory cannot be recovered for later reuse and is essentially lost to the program, a phenomenon known as a memory leak. Conversely, it is possible to release memory too soon and continue to access it, but since the allocation system can modify or re-allocate the memory at any time, this results in unpredictable behavior, typically manifested in a portion of the program far removed from the actual error. Such issues are ameliorated in languages with automatic garbage collection or RAII.

[edit] Pointers

Pointers are a primary source of potential danger. Because they are typically unchecked, a pointer can be made to point to any arbitrary location, causing undesirable effects. Although properly-used pointers point to safe places, they can be moved to unsafe places using invalid pointer arithmetic operations; the memory they point to may be deallocated and reused (dangling pointers); they may be uninitialized (wild pointers); or they may be directly assigned a value using a cast, union, or through another corrupt pointer. In general, C is permissive in allowing manipulation of and conversion between pointer types, although compilers typically provide options for various levels of checking. Other languages attempt to address these problems by using more restrictive reference types.

[edit] Arrays

Although C supports static arrays, it is not required that array indices be validated (bounds checking). For example, one can try to write to the sixth element of an array with five elements, yielding generally undesirable results. This type of bug, called a buffer overflow, has been notorious as the source of a number of security problems. On the other hand, since bounds checking elimination technology was largely nonexistent when C was defined, bounds checking came with a severe performance penalty, particularly in numerical computation. A few years earlier, some Fortran compilers had a switch to toggle bounds checking on or off; however, this would have been much less useful for C, where array arguments are passed as simple pointers.

Multidimensional arrays are commonly used in numerical algorithms (mainly from applied linear algebra) to store matrices. The structure of the C array is particularly well suited to this particular task, provided one remembers to count indices starting from 0 instead of 1. However, since arrays are passed merely as pointers, the bounds of the array must be known fixed values or else explicitly passed to any subroutine that requires them, and dynamically sized arrays of arrays cannot be accessed using double indexing. (A workaround for this is to allocate the array with an additional "row vector" of pointers to the columns.) These issues are discussed in the book Numerical Recipes in C, chapter 1.2, page 20ff (read online). That book also suggests a way to use 1-based indexing by offseting the base pointers, but that technique violates the C standard.

C99 introduced "variable-length arrays" which address some, but not all, of the issues with ordinary C arrays.

[edit] Variadic functions

Another potential source of bugs is variadic functions, which take a variable number of arguments. Unlike other prototyped C functions, checking the types of arguments to variadic functions at compile-time is, in general, impossible without additional information. If the wrong type of data is passed, the effect is unpredictable, and often fatal. Variadic functions also handle null pointer constants in a way which is often surprising to those unfamiliar with the language semantics. For example, NULL must be cast to the desired pointer type when passed to a variadic function. The printf family of functions supplied by the standard library, used to generate formatted text output, has been noted for its error-prone variadic interface, which relies on a format string to specify the number and types of trailing arguments.

However, type-checking of variadic functions from the standard library is a quality-of-implementation issue; many modern compilers do type-check printf calls, producing warnings if the argument list is inconsistent with the format string. Even so, not all printf calls can be checked statically since the format string can be built at runtime, and other variadic functions typically remain unchecked.

[edit] Syntax

Although mimicked by many languages because of its widespread familiarity, C's syntax has been often targeted as one of its weakest points. For example, Kernighan and Ritchie say in the second edition of The C Programming Language, "C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better."

Some specific problems worth noting are:

  • Not checking number and types of arguments when the function declaration has an empty parameter list. (This provides backward compatibility with K&R C, which lacked prototypes.)
  • Some questionable choices of operator precedence, as mentioned by Kernighan and Ritchie above, such as == binding more tightly than & and | in expressions like x & 1 == 0.
  • The use of the = operator, used in mathematics for equality, to indicate assignment, following the precedent of Fortran, PL/I, and BASIC, but unlike ALGOL and its derivatives. Ritchie made this syntax design decision consciously, based primarily on the argument that assignment occurs more often than comparison.
  • Similarity of the assignment and equality operators (= and ==), making it easy to substitute one for the other. C's weak type system permits each to be used in the context of the other without a compilation error (although some compilers produce warnings).[3] [2]
  • A lack of infix operators for complex objects, particularly for string operations, making programs which rely heavily on these operations difficult to read. (The Lisp language, with no infix operators whatsoever, exhibits this problem to an even greater extent.)
  • Heavy reliance on punctuation-based symbols even where this is arguably less clear, such as "&&" and "||" instead of "and" and "or," respectively. Some are also confused about the difference between bit-wise operators ("&" and "|") and logical operators ("&&" and "||"), especially since the former can frequently, but not always, be inserted in place of the latter with no change in behavior.
  • A declaration syntax that some find unintuitive, particularly for function pointers. (Ritchie's idea was to declare identifiers in contexts resembling their use.)

[edit] Economy of expression

One occasional criticism of C is that it can be concise to the point of being cryptic. A classic example that appears in K&R[4] is the following function to copy the contents of string t to string s:

void strcpy(char *s, char *t)
{
    while (*s++ = *t++);
}

In this example, s and t are each a pointer to the first element of a null-terminated array of characters. Every loop of the single while statement does the following: In this example, t points to the first of a null-terminated array of characters, and s points to the first of an array of characters to be written. Each iteration of the while statement does the following:

  • Copies the character pointed to by t (initially set to point to the first character of the string to be copied) to the corresponding character position pointed to by s (initially set to point to the first character of the character array to be copied to)
  • Advances the pointers s and t to point to the next character. Note that the values of s and t can safely be changed, because they are local copies of the pointers to the corresponding arrays
  • Tests whether the character copied (the result of the assignment statement) is a null character signifying the end of the string. Note that the test could have been written "((*s++ = *t++) != '\0')" (where '\0' is the null character); however, in C, a Boolean test is actually a test for any non-zero value; consequently the test is true as long as the character is any character other than a string-terminating null
  • As long as the character is not a null, the condition is true, causing the while loop to repeat. (In particular, because the character copy occurs before the condition is evaluated, the final terminating null is guaranteed to be copied as well)
  • The repeatedly executed body of the while loop is an empty statement, signified by the semicolon (which despite appearances is not part of the while syntax). (It is not uncommon for the body of while or for loops to be empty.)

The above code is functionally equivalent to:

void strcpy(char *s, char *t)
{
    char aux;
    do {
        *s = *t;
        aux = *s;
        s++;
        t++;
    } while (aux != '\0');
}

In a modern optimising compiler, these two pieces of source code produce identical machine instruction sequences, so the smaller code does not produce smaller output. In more verbose languages such as Pascal, a similar iteration would require several statements. For C programmers, the economy of style is idiomatic and leads to shorter expressions; for critics, being able to do too much with a single line of C code can lead to problems in comprehension.

[edit] Internal consistency

Some features of C, its preprocessor, and/or implementation are inconsistent. One of C's features is three distinct classes of non-wide string literals. One is for run-time data, another is for #include files with quotation marks around the filename, and the third is for #include filenames in angle brackets. The allowed symbol set, and its interpretation, is not consistent among the three. To some extent this arose from the need to accommodate a wide variety of file naming conventions, such as MS-DOS's use of backslash as a path separator.

Another consistency problem stems from shortcomings in C's preprocessor, which was originally implemented as a separate, relatively simple process only loosely connected with the semantics of the rest of the language. The following code is not legal Standard C:

int sixteen = 0x3e-0x2e;

The reason is that 0x3e-0x2e matches the form of a "preprocessing number", and, since token-matching is greedy, is converted to a single preprocessing token. The subsequent conversion of that to a token in a later phase of translation is ill-defined, so the compiler will not obtain the intended tokenization of

int sixteen = 0x3e - 0x2e ;

even though spaces around the minus sign would not otherwise be required.

[edit] Standardization

The C programming language was standardized by ANSI in 1989 and adopted as an ISO standard in 1990; the standard has subsequently been extended twice. Some features of the C standard, such as trigraphs and complex arithmetic, have been challenged on the ground of questionable user demand. Some major C compilers have not yet become fully conformant to later versions of the C standard.

The C standards have been accompanied by Rationale documents which describe the considerations behind many of the choices made by the standards committee. Frequently there were trade-offs among competing requirements, and not everybody weighs the factors the same as did the C standards committee.

As well, more than most other language standards, the C standard leaves some behavior unspecified, such as the order of evaluation of arguments to a function, to allow compilers to have them evaluated in whatever way they believe will be optimal for their target platforms. This can result in code fragments which behave differently when compiled by different compilers, by different versions of the same compiler, or on different architectures; these can be avoided by careful programming.

[edit] Maintenance

There are other problems in C that don't directly result in bugs or errors, but make it harder for programmers to build a robust, maintainable, large-scale system. Examples of these include:

  • A fragile system for importing definitions (#include) that relies on literal text inclusion and redundantly keeping prototypes and function definitions in sync.
  • A cumbersome compilation model that complicates dependency tracking and compiler optimizations between modules.
  • A weak type system that lets many erroneous programs compile without diagnostic messages.

[edit] Tools for mitigating issues with C

There are many C programmers who have learned to cope with C's quirks. However, some programmers may wish to use tools that have been created to help them overcome such problems.

Automated source code checking and auditing are beneficial in any language, and for C many such tools exist, such as Lint. A common practice is to use Lint to detect questionable code when a program is first written. Once a program passes Lint, it is then compiled using the C compiler.

There are also compilers, libraries and operating system level mechanisms for performing array bounds checking, buffer overflow detection, and automatic garbage collection, that are not a standard part of C.

There are dialects of C, such as Objective-C and Cyclone, that address some of these concerns.

Many compilers, notably Visual C++ and Xcode, deal with the long compilation times inflicted by header file inclusion by using precompiled headers, a system where declarations are stored in an intermediate format that is quick to parse. Building the precompiled header files in the first place is expensive, but this is generally done only for system header files, which are larger and more numerous than most application header files and also change much less often.

Cproto is a program that will read a C source file and output prototypes of all the functions within the source file. This program can be used in conjunction with the "make" command to create new files containing prototypes each time the source file has been changed. These prototype files can be included by the original source file (e.g., as "filename.p"), which reduces the problems of keeping function definitions and source files in agreement.

It should be recognized that these tools are not a panacea. Because of C's flexibility, some types of errors involving misuse of variadic functions, out-of-bounds array indexing, and incorrect memory management cannot be detected on some architectures without incurring a significant performance penalty. However, some common cases can be recognized and accounted for.

[edit] Notes

  1. ^ Dennis Ritchie. The Development of the C Language. Retrieved on July 26, 2006.
  2. ^ Brian W. Kernighan and Dennis M. Ritchie: The C Programming Language, 2nd ed., Prentice Hall, 1988, p. 3.
  3. ^ For example, the conditional expression if (a=b) is only true if b is not zero.
  4. ^ Brian W. Kernighan and Dennis M. Ritchie: The C Programming Language, 2nd ed., p. 106. Note that this example fails if the array t be larger than s, a complication that is handled by the safer library function strncpy.

[edit] See also

C programming language
Libraries: C standard library | glibc | Dietlibc | uClibc | Newlib
History: Criticism of the C programming language
Language Features: String | Syntax | Preprocessor | Variable types and declarations | Functions
Dialects: C++ | Objective-C
C and Other Languages: Compatibility of C and C++ | Operators in C and C++ | Comparison of Pascal and C | C to Java byte-code compiler