Criticism of the C programming language

From Wikipedia, the free encyclopedia

Criticism of the C programming language refers to critical commentary directed at the C programming language. This widely used language first appeared on larger computers but, like Pascal, soon found its way to CP/M and MS-DOS based microcomputers, where it achieved a rapid acceptance in the industry. Despite (or due to) its popularity, C's characteristics have led to much criticism.

Contents

[edit] Minimalist design

A popular saying, repeated by such notable language designers as Bjarne Stroustrup, is that "C makes it easy to shoot yourself in the foot."[1] In other words, C permits some operations that are sometimes not desirable, and thus many simple programming errors are not detected by the compiler and may not be readily apparent at runtime. If sufficient care and discipline are not used in programming and maintenance, this may lead to programs with unpredictable behavior and security holes. (Although this is not unique to C, C provides less protection than do many other programming languages.)

The designers wanted to avoid compile- and run-time checks that were too expensive when C was first implemented. With time, external tools were developed to perform some of these checks. Nothing prevents an implementation from providing such checks, but nothing requires it, either.

In their response to criticism of C not being a strongly-typed language, Kernighan and Ritchie made reference to the basic design philosophy of C: "Nevertheless, C retains the basic philosophy that programmers know what they are doing; it only requires that they state their intentions explicitly."[2][3]

[edit] Absent features

C was designed to be a small, simple language, which has contributed significantly to its acceptance, as new C compilers can be developed quickly for new platforms. The relatively low-level nature of the language affords the programmer close control over what the computer does, while allowing specially tailoring and aggressive optimization for a particular platform. This allows the code to run efficiently on very limited hardware, such as embedded systems.

C does not have some features that are available in some other programming languages:

A number of these features are available as extensions in some compilers, or can be supplied by third-party libraries, or can be simulated by adopting certain coding disciplines. For example, in most object-oriented languages, method functions include a special "this" pointer which refers to the current object. By passing this pointer as an explicit function argument, similar functionality can be achieved in C. Whereas in C++ one might write:

stack.push(val);

one would write in C:

push(&stack,val);

[edit] Undefined behaviour

Many operations in C that have undefined behavior are not required to be diagnosed at compile time. In the case of C, "undefined behavior" means that the exact behavior which arises is not specified by the standard, and exactly what will happen does not have to be documented by the C implementation. A famous, although misleading, expression in the newsgroups comp.std.c and comp.lang.c is that the program could cause "demons to fly out of your nose".[5] Sometimes in practice what happens for an instance of undefined behavior is a bug that is hard to track down and which may corrupt the contents of memory. Sometimes a particular compiler generates well-behaved actions that are not the same as would be obtained using a different C compiler. The reason some behavior has been left undefined is to allow the compiler to generate more efficient executable code for well-defined behavior, which was deemed important for C's primary role as a systems implementation language; it is the programmer's responsibility to avoid undefined behavior. Examples of undefined behavior are:

  • accessing outside the bounds of an array
  • overflowing a signed integer
  • reaching the end of a function without finding a return statement, when the return value is used
  • reading the value of a variable before initializing it

These operations are all programming errors that could occur using many programming languages; C draws criticism because its standard explicitly identifies numerous cases of undefined behavior, including some where the behavior could have been made well defined, and does not specify any run-time error handling mechanism.

Invoking fflush() on a stream opened for input is an example of a different kind of undefined behavior, not necessarily a programming error but a case for which some conforming implementations may provide well-defined, useful semantics (in this example, presumably discarding input through the next new-line) as an allowed extension.

[edit] Memory allocation

Automatically and dynamically allocated objects are not necessarily initialized; they initially have indeterminate values (typically, whatever bit pattern happens to be present in the storage, which might not even represent a valid value for that type). If the program attempts to use such an uninitialized value, the results are undefined. Many modern compilers try to detect and warn about this problem, but both false positives and false negatives occur.

Another common problem is that heap memory has to be manually synchronized with its actual usage in any program for it to be reused as much as possible. For example, if the only pointer to a memory allocation goes out of scope or has its value overwritten before free() has been called, then that memory cannot be recovered for later reuse and is essentially lost to the program, a phenomenon known as a memory leak. Conversely, it is possible to release memory too soon and continue to access it; however, since the allocation system can re-allocate or itself use the freed memory, unpredictable behavior is likely to occur when the multiple users corrupt each other's data. Typically, the symptoms will appear in a portion of the program far removed from the actual error. Such issues are ameliorated in languages with automatic garbage collection or RAII.

[edit] Pointers

Pointers are a primary source of potential danger. Because they are typically unchecked, a pointer can be made to point to any arbitrary location, causing undesirable effects. Although properly-used pointers point to safe places, they can be moved to unsafe places using invalid pointer arithmetic; the memory they point to may be deallocated and reused (dangling pointers); they may be uninitialized (wild pointers); or they may be directly assigned a value using a cast, union, or through another corrupt pointer. In general, C is permissive in allowing manipulation of and conversion between pointer types, although compilers typically provide options for various levels of checking. Other languages address these problems by using more restrictive reference types.

[edit] Arrays

Although C supports static arrays, it is not required that array indices be validated (bounds checking). For example, one can try to write to the sixth element of an array with five elements, yielding generally undesirable results. This type of bug, called a buffer overflow, has been notorious as the source of a number of security problems. On the other hand, since bounds checking elimination technology was largely nonexistent when C was defined, bounds checking came with a severe performance penalty, particularly in numerical computation. A few years earlier, some Fortran compilers had a switch to toggle bounds checking on or off; however, this would have been much less useful for C, where array arguments are passed as simple pointers.

Multidimensional arrays are commonly used in numerical algorithms (mainly from applied linear algebra) to store matrices. The structure of the C array is particularly well suited to this particular task. However, since arrays are passed merely as pointers, the bounds of the array must be known fixed values or else explicitly passed to any subroutine that requires them, and dynamically sized arrays of arrays cannot be accessed using double indexing. (A workaround for this is to allocate the array with an additional "row vector" of pointers to the columns.) These issues are discussed in the book Numerical Recipes in C, chapter 1.2, page 20ff.[6]

C99 introduced "variable-length arrays" which address some, but not all, of the issues with ordinary C arrays.

[edit] Variadic functions

Another potential source of bugs is variadic functions, which take a variable number of arguments. Unlike other prototyped C functions, checking the types of arguments to variadic functions at compile-time is, in general, impossible without additional information. If the wrong type of data is passed, the effect is unpredictable, and often fatal. Variadic functions also handle null pointer constants in a way which is often surprising to those unfamiliar with the language semantics. For example, NULL must be cast to the desired pointer type when passed to a variadic function. The printf family of functions supplied by the standard library, used to generate formatted text output, has been noted for its error-prone variadic interface, which relies on a format string to specify the number and types of trailing arguments.

However, type-checking of variadic functions from the standard library is a quality-of-implementation issue; many modern compilers do type-check calls to functions in the printf family, producing warnings if the argument list is inconsistent with the format string. Even so, not all printf calls can be checked statically since the format string can be built at runtime, and other variadic functions typically remain unchecked.

[edit] Syntax

Although mimicked by many languages because of its widespread familiarity, C's syntax has often been criticized. For example, Kernighan and Ritchie say in the second edition of The C Programming Language, "C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better."

Some specific problems worth noting are:

  • Not checking number and types of arguments when the function declaration has an empty parameter list. (This provides backward compatibility with K&R C, which lacked prototypes.)
  • Some questionable choices of operator precedence, as mentioned by Kernighan and Ritchie above, such as == binding more tightly than & and | in expressions like x & 1 == 0.
  • The use of the = operator, used in mathematics for equality, to indicate assignment, following the precedent of Fortran, PL/I, and BASIC, but unlike ALGOL and its derivatives. Ritchie made this syntax design decision consciously, based primarily on the argument that assignment occurs more often than comparison.
  • Similarity of the assignment and equality operators (= and ==), making it easy to substitute one for the other. C's weak type system permits each to be used in the context of the other without a compilation error (although some compilers produce warnings). For example, the conditional expression in if (a=b) is only true if a is not zero after the assignment.[7]
  • A lack of infix operators for complex objects, particularly for string operations, making programs which rely heavily on these operations difficult to read.
  • A declaration syntax that some find unintuitive, particularly for function pointers. (Ritchie's idea was to declare identifiers in contexts resembling their use: "declaration reflects use".)

[edit] Economy of expression

One occasional criticism of C is that it can be concise to the point of being cryptic. A classic example that appears in K&R[8] is the following function to copy the contents of string t to string s:

void strcpy(char *s, char *t)
{
    while (*s++ = *t++);
}

In this example, t points to the first of a null-terminated array of characters, and s points to the first of an array of characters to be written. Each iteration of the while statement does the following:

  • Copies the character pointed to by t (initially set to point to the first character of the string to be copied) to the corresponding character position pointed to by s (initially set to point to the first character of the character array to be copied to)
  • Advances the pointers s and t to point to the next character. Note that the values of s and t can safely be changed, because they are local copies of the pointers to the corresponding arrays
  • Tests whether the character copied (the result of the assignment statement) is a null character signifying the end of the string. Note that the test could have been written "((*s++ = *t++) != '\0')" (where '\0' is the null character); however, in C, a Boolean test is actually a test for any non-zero value; consequently the test is true as long as the character is any character other than a string-terminating null
  • As long as the character is not a null, the condition is true, causing the while loop to repeat. (In particular, because the character copy occurs before the condition is evaluated, the final terminating null is guaranteed to be copied as well)
  • The repeatedly executed body of the while loop is an empty statement, signified by the semicolon (which despite appearances is not part of the while syntax). (It is not uncommon for the body of while or for loops to be empty.)

In more verbose languages such as Pascal, a similar iteration would require several statements. The above code is functionally equivalent to:

void strcpy(char *s, char *t)
{
    char aux;
    do {
        *s = *t;
        aux = *s;
        s++;
        t++;
    } while (aux != '\0');
}

In a modern optimising compiler, these two pieces of source code produce identical machine instruction sequences, so the smaller code does not produce smaller output. For C programmers, the economy of style is idiomatic and leads to shorter expressions; for critics, being able to do too much with a single line of C code can lead to problems in comprehension.

[edit] Internal consistency

Some features of C, its preprocessor, and/or implementation are inconsistent. One of C's features is three distinct classes of non-wide string literals. One is for run-time data, another is for #include files with quotation marks around the filename, and the third is for #include filenames in angle brackets. The allowed symbol set, and its interpretation, is not consistent among the three. To some extent this arose from the need to accommodate a wide variety of file naming conventions, such as MS-DOS's use of backslash as a path separator.

Another consistency problem stems from shortcomings in C's preprocessor, which was originally implemented as a separate, relatively simple process only loosely connected with the semantics of the rest of the language. The following code is not legal Standard C:

int sixteen = 0x3e-0x2e;

The reason is that 0x3e-0x2e matches the form of a "preprocessing number" ("e-" could be part of a number in scientific notation), and, since token-matching is greedy, is converted to a single preprocessing token. The subsequent conversion of that to a token in a later phase of translation is ill-defined, so the compiler will not obtain the intended tokenization of

int sixteen = 0x3e - 0x2e ;

even though spaces around the minus sign would not otherwise be required.

[edit] Standardization

The C programming language was standardized by ANSI in 1989 and adopted as an ISO standard in 1990; the standard has subsequently been extended twice. Some features of the C standard, such as trigraphs and complex arithmetic, have been challenged on the ground of questionable user demand. Some major C compilers have not yet become fully conformant to later versions of the C standard.

The C standards have been accompanied by Rationale documents which describe the considerations behind many of the choices made by the standards committee. Frequently there were trade-offs among competing requirements, and not everybody weighs the factors the same as did the C standards committee.

As well, more than most other language standards, the C standard leaves some behavior unspecified, such as the order of evaluation of arguments to a function, to allow compilers to have them evaluated in whatever way they believe will be optimal for their target platforms. This can result in code fragments which behave differently when compiled by different compilers, by different versions of the same compiler, or on different architectures; these can be avoided by careful programming.

[edit] Maintenance

There are other problems in C that don't directly result in bugs or errors, but make it harder for programmers to build a robust, maintainable, large-scale system. Examples of these include:

  • A fragile system for importing definitions (#include) that relies on literal text inclusion and redundantly keeping prototypes and function definitions in sync.
  • A cumbersome compilation model that complicates dependency tracking and compiler optimizations between modules.
  • A weak type system that lets many erroneous programs compile without diagnostic messages.

[edit] Tools for mitigating issues with C

There are many C programmers who have learned to cope with C's quirks. However, some programmers may wish to use tools that have been created to help them overcome such problems.

Automated source code checking and auditing are beneficial in any language, and for C many such tools exist, such as Lint. A common practice is to use Lint to detect questionable code when a program is first written. Once a program passes Lint, it is then compiled using the C compiler. Also, many compilers can optionally warn about syntactically valid constructs that are likely to actually be errors.

There are also compilers, libraries and operating system level mechanisms for performing array bounds checking, buffer overflow detection, and automatic garbage collection, that are not a standard part of C.

There are dialects of C, such as Objective-C and Cyclone, that address some of these concerns.

Many compilers, notably Visual C++ and Xcode, reduce the long compilation times caused by very large header files by using precompiled headers, a system where the contents of a header are stored in an form designed to be much quicker to process than source text. The one-time cost of building a precompiled header file is offset by the savings from multiple uses of the faster version.

It should be recognized that these tools are not a panacea. Because of C's flexibility, some types of errors involving misuse of variadic functions, out-of-bounds array indexing, and incorrect memory management cannot be detected on some architectures without incurring a significant performance penalty. However, some common cases can be recognized and accounted for.

[edit] See also

[edit] References

  1. ^ http://www.research.att.com/~bs/bs_faq.html#really-say-that Stroustrup: FAQ
  2. ^ Dennis Ritchie. The Development of the C Language. Retrieved on 2006-07-26.
  3. ^ Brian W. Kernighan and Dennis M. Ritchie: The C Programming Language, 2nd ed., Prentice Hall, 1988, p. 3.
  4. ^ The 1999 revision of the C standard added a type _Bool, but it was not retrofit into the language's existing Boolean contexts.
  5. ^ Jargon File entry for nasal demons.
  6. ^ http://www.nrbook.com/a/bookcpdf/c1-2.pdf
  7. ^ http://www.cs.ucr.edu/~nxiao/cs10/errors.htm 10 Common Programming Mistakes in C
  8. ^ Brian W. Kernighan and Dennis M. Ritchie: The C Programming Language, 2nd ed., p. 106.