Criticism of the C programming language

From Wikipedia, the free encyclopedia

The programming language C is widely used, minimalistic and low-level by design. Despite its popularity, C's characteristics have led to much criticism of the language. Some criticisms have arisen from misconceptions or misinterpretations of the C standard, while others have some degree of validity. This article is concerned with the latter.

Many beginning programmers have difficulty learning C's syntax and peculiarities, and even many expert programmers find C programs difficult to maintain and debug. A popular saying, repeated by such notable language designers as Bjarne Stroustrup, is that "C makes it easy to shoot yourself in the foot." [1] In other words, C permits many operations that are generally not desirable, and thus many simple programming errors are not detected by the compiler and may not even be readily apparent at runtime. This potentially leads to programs with unpredictable behavior and security holes, if sufficient care and discipline are not used in programming and maintenance.

The designers wanted to avoid compile- and run-time checks that were too expensive when C was first implemented. With time, external tools were developed to perform some of these checks. Nothing prevents an implementation from providing such checks, but nothing requires it to, either. The safe C dialect Cyclone addresses some of these concerns.

Kernighan and Ritchie made reference to the basic design philosophy of C in their response to criticism of C not being a strongly-typed language[1]: "Nevertheless, C retains the basic philosophy that programmers know what they are doing; it only requires that they state their intentions explicitly."[2]

Contents

[edit] Undefined behaviour

Many "illegal" operations in C are not required to be diagnosed at compile time, and can produce undefined behavior at run time. This means that the exact behavior which arises is not specifed by the standard, and exactly what will happen does not have to be documented by the C implementor. A famous, although misleading, expression in comp.std.c is that the compiler could cause "demons to fly out of your nose". Sometimes in practice what happens for an instance of undefined behavior is a bug which is hard to track down and which may corrupt the contents of memory. Sometimes a particular compiler generates well-behaved actions that are not the same as would be obtained using a different C compiler. The main purpose of leaving some behavior undefined is to allow the compiler to generate more efficient executable code for well-defined behavior, which is important for C's primary role as a systems implementation language; it is the programmer's responsibility to avoid undefined behavior. Examples of undefined behavior are:

  • Accessing outside the bounds of an array.
  • Overflowing a signed integer.
  • Reaching the end of a function without finding a return statement, when the return value is used.
  • Reading the value of a variable before writing to it.

These operations are all programming errors that could occur using many programming languages; C draws criticism because its standard explicitly identifies numerous cases of undefined behavior, including some where the behavior could have been made well defined, and does not specify any run-time error handling mechanism.

[edit] Memory allocation

One issue to be aware of when using C is that automatically and dynamically allocated objects are not necessarily initialized (depending on what facility is used to allocate memory); they initially have an indeterminate value (typically whatever values are present in the memory space they occupy, which might not even be a legal bit pattern for that type). This value is highly unpredictable and can vary between two machines, two program runs, or even two calls to the same function. If the program attempts to use such an uninitialized value, the results are undefined. Many modern compilers try to detect and warn about this problem, but both false positives and false negatives occur.

Another common problem is that heap memory has to be manually synchronized with its actual usage in any program for it to be correctly reused as much as possible. For example, if an automatic pointer variable goes out of scope or has its value overwritten while still referencing a particular allocation that is not freed via a call to free(), then that memory cannot be recovered for later reuse and is essentially lost to the program, a phenomenon known as memory leak. Conversely, it is possible to release memory too soon, and in some cases continue to be able to use it, but since the allocation system can re-allocate the memory at any time for unrelated reasons, this results in unpredictable behavior, typically manifested in portions of the program far removed from the erroneously written segment. Such issues are ameliorated in languages with automatic garbage collection or RAII.

[edit] Pointers

Pointers are a primary source of potential danger. Because they are typically unchecked, a pointer can be made to point to any arbitrary location (even within code), causing unpredictable effects. Although properly-used pointers point to safe places, they can be moved to unsafe places using pointer arithmetic; the memory they point to may be deallocated and reused (dangling pointers); they may be uninitialized (wild pointers); or they may be directly assigned a value using a cast, union, or through another corrupt pointer. In general, C is permissive in allowing manipulation of and conversion between pointer types, although compilers typically provide options for various levels of checking. Other languages attempt to address these problems by using more restrictive reference types.

[edit] Arrays

Although C supports static arrays, it is not required that array indices be validated (bounds checking). For example, one can write to the sixth element of an array with five elements, yielding generally undesirable results. This type of bug, called a buffer overflow, has been notorious as the source of a number of security problems. On the other hand, since bounds checking elimination technology was largely nonexistent when C was defined, bounds checking came with a severe performance penalty, particularly in numerical computation. By comparison, a few years earlier some Fortran compilers had a switch to toggle bounds checking on or off; however, this would have been much less useful for C, where array arguments are passed as simple pointers.

Multidimensional arrays are commonly used in numerical algorithms (mainly from applied linear algebra) to store matrices. The structure of the C array is particularly well suited to this particular task, provided one remembers to count indices starting from 0 instead of 1. However, since arrays are passed merely as pointers, the bounds of the array must be known fixed values or else explicitly passed to any subroutine that requires them, and dynamically sized arrays of arrays cannot be accessed using standard matrix notation. (A workaround for this is to allocate the array with an additional "row vector" of pointers to the columns.) These issues are discussed in the book Numerical Recipes in C, chapter 1.2, page 20ff (read online). That book also suggests a way to use 1-based indexing by offseting the base pointers, but that technique violates the C standard.

C99 introduced "variable-length arrays" which address some, but not all, of the issues with ordinary C arrays.

[edit] Variadic functions

Another source of bugs is variadic functions, which take a variable number of arguments. Unlike other prototyped C functions, checking the types of arguments to variadic functions at compile-time is, in general, impossible without additional information. If the wrong type of data is passed, the effect is unpredictable, and often fatal. Variadic functions also handle null pointer constants in a way which is often surprising to those unfamiliar with the language semantics. For example, NULL must be cast to the desired pointer type when passed to a variadic function. The printf family of functions supplied by the standard library, used to generate formatted text output, has been noted for its error-prone variadic interface, which relies on a format string to specify the number and types of trailing arguments.

However, type-checking of variadic functions from the standard library is a quality-of-implementation issue; many modern compilers do type-check printf calls, producing warnings if the argument list is inconsistent with the format string. Even so, not all printf calls can be checked statically since the format string can be built at runtime, and other variadic functions typically remain unchecked.

[edit] Syntax

Although mimicked by many languages because of its widespread familiarity, C's syntax has been often targeted as one of its weakest points. For example, Kernighan and Ritchie say in the second edition of The C Programming Language, "C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better." Bjarne Stroustrup said of C++ (which is superficially similar to C): "Within C++, there is a much smaller and cleaner language struggling to get out. […] the C++ semantics is much cleaner than its syntax." [2] Some specific problems worth noting are:

  • A function prototype with an empty parameter list allows any set of parameters, a syntax problem introduced for backward compatibility with K&R C, which lacked prototypes.
  • Some questionable choices of operator precedence, as mentioned by Kernighan and Ritchie above, such as == binding more tightly than & and | in expressions like x & 1 == 0.
  • The use of the = operator, used in mathematics for equality, to indicate assignment. Ritchie made this syntax design decision consciously, based primarily on the argument that assignment occurs more often than comparison. However, as explained by computer scientist Damian Conway in his "Seven Deadly Sins of Introductory Programming Language Design": "Many students, when confronted with this operator, become confused as to the nature of assignment and its relationship to equality. […] [A different syntax] seems to evoke less confusion, [because it] reinforces the notion of procedural transfer of value, rather than transitive equality of value.".[3]
  • Similarly, the similarity of the assignment and equality operators (= and ==) makes it easy to substitute one for the other, and C's weak type system permits each to be used in the context of the other without a compilation error (although some compilers produce warnings).[3] [4]
  • A lack of infix operators for complex objects, particularly for string operations, making programs which rely heavily on these operations difficult to read. The Lisp language, with no infix operators whatsoever, exhibits this problem to an even greater extent.
  • Heavy reliance on punctuation-based symbols even where this is arguably less clear, such as "&&" and "||" instead of "and" and "or," respectively. Some are also confused about the difference between bit-wise operators ("&" and "|") and logical operators ("&&" and "||"), especially since the former can frequently, but not always, be inserted in place of the latter with no change in behavior.
  • Unintuitive declaration syntax, particularly for function pointers. In the words of Conway, speaking about the very similar C++ declaration syntax:

Specifying a type in C++ is made difficult by the fact that some of the components of a declaration (such as the pointer specifier) are prefix operators while others (such as the array specifier) are postfix. These declaration operators are also of varying precedence, necessitating careful bracketing to achieve the desired declaration. Furthermore, if the type ID is to apply to an identifier, this identifier ends up at somewhere between these operators, and is therefore obscured in even moderately complicated examples (see Appendix A for instance). The result is that the clarity of such declarations is greatly diminished. Ben Werther & Damian Conway. A Modest Proposal: C++ Resyntaxed. Section 3.1.1. 1996.

[edit] Economy of expression

One occasional criticism of C is that it can be concise to the point of being cryptic. A classic example that appears in K&R[4] is the following function to copy the contents of string t to string s:

void strcpy(char *s, char *t)
{
    while (*s++ = *t++);
}

In this example, t is a pointer to a null-terminated array of characters, s is a pointer to an array of characters. Every loop of the single while statement does the following:

  • Copies the character pointed to by t (initially set to point to the first character of the string to be copied) to the corresponding character position pointed to by s (initially set to point to the first character of the character array to be copied to)
  • Advances the pointers s and t to point to the next character. Note that the values of s and t can safely be changed, because they are local copies of the pointers to the corresponding arrays
  • Tests whether the character copied (the result of the assignment statement) is a null character signifying the end of the string. Note that the test could have been written "((*s++ = *t++) != '\0')" (where '\0' is the null character); however, in C, a Boolean test is actually a test for any non-zero value; consequently the test is true as long as the character is any character other than a string-terminating null
  • As long as the character is not a null, the condition is true, causing the while loop to repeat. (In particular, because the character copy occurs before the condition is evaluated, the final terminating null is guaranteed to be copied as well)
  • The repeatedly executed body of the while loop is an empty statement, signified by the semicolon (which despite appearances is not part of the while syntax). (It is not uncommon for the body of while or for loops to be empty.)

The above code is functionally equivalent to:

void strcpy(char *s, char *t)
{
    char aux;
    do {
        *s = *t;
        aux = *s;
        s++;
        t++;
    } while (aux != '\0');
}

In a modern optimising compiler, these two pieces of code produce identical assembly code, so the smaller code does not produce smaller output. In more verbose languages such as Pascal, a similar iteration would require several statements. For C programmers, the economy of style is idiomatic and leads to shorter expressions; for critics, being able to do too much with a single line of C code can lead to problems in comprehension.

[edit] Internal consistency

Some features of C, its preprocessor, and/or implementation are inconsistent. One of C's features is three distinct classes of non-wide string literals. One is for run-time data, another is for include files with quotation marks around the filename, and the third is for include filenames in angle brackets. The allowed symbol set, and the interpretation of them, is not consistent between the three. To some extent this arose from the need to accommodate a wide variety of file naming conventions, such as MS-DOS's use of backslash as a path separator.

Another consistency problem stems from shortcomings in C's preprocessor, which was originally implemented as a separate, relatively simple, process only loosely connected with the semantics of the rest of the language. The following code is legal Standard C:

int sixteen = 0x3e-0x2e;

Unfortunately, a Standard C preprocessor will convert "0x3e-0x2e" to a single preprocessing token, and the subsequent conversion of that to a token in a later phase of translation is ill-defined, so the compiler might not obtain the intended tokenization of

int sixteen = 0x3e - 0x2e ;

even though spaces around the minus sign would not otherwise be required.

[edit] Standardization

The C programming language was standardized by ANSI in 1989 and adopted as an ISO standard in 1990; the standard has subsequently been extended twice. Some features of the C standard, such as trigraphs and complex arithmetic, have been challenged on the ground of questionable user demand. Some major C compilers have not yet become fully conformant to later versions of the C standard.

The C standards have been accompanied by Rationale documents which describe the considerations behind many of the choices made by the standards committee. Frequently there were trade-offs among competing requirements, and not everybody weights the factors the same as did the C standards committee.

As well, the C standard leaves some code fragments undefined, such as the order of evaluation of arguments to a function, to allow compilers to compile them in whatever way they believe will be optimal. However, this can result in code fragments which behave differently when compiled by different compilers, by different versions of the same compiler, or on different architectures.

[edit] Maintenance

There are other problems in C that don't directly result in bugs or errors, but make it harder for programmers to build a robust, maintainable, large-scale system. Examples of these include:

  • A fragile system for importing definitions (#include) that relies on literal text inclusion and redundantly keeping prototypes and function definitions in sync, and increases build times.
  • A cumbersome compilation model that complicates dependency tracking and compiler optimizations between modules.
  • A weak type system that lets many clearly erroneous programs compile without errors.

[edit] Tools for mitigating issues with C

There are many C programmers who have learned to cope with C's quirks. However, some programmers may wish to use tools that have been created to help them overcome such problems.

Automated source code checking and auditing are beneficial in any language, and for C many such tools exist, such as Lint. A common practice is to use Lint to detect questionable code when a program is first written. Once a program passes Lint, it is then compiled using the C compiler.

There are also compilers, libraries and operating system level mechanisms for performing array bounds checking, buffer overflow detection, and automatic garbage collection, that are not a standard part of C.

Many compilers, notably Visual C++ and Xcode, deal with the long compilation times inflicted by header file inclusion using precompiled headers, a system where declarations are stored in an intermediate format that is quick to parse. Building the precompiled header files in the first place is expensive, but this is generally done only for system header files, which are larger and more numerous than most application header files and also change much less often.

Cproto is a program that will read a C source file and output prototypes of all the functions within the source file. This program can be used in conjunction with the "make" command to create new files containing prototypes each time the source file has been changed. These prototype files can be included by the original source file (e.g., as "filename.p"), which reduces the problems of keeping function definitions and source files in agreement.

It should be recognized that these tools are not a panacea. Because of C's flexibility, some types of errors involving misuse of variadic functions, out-of-bounds array indexing, and incorrect memory management cannot be detected on some architectures without incurring a significant performance penalty. However, some common cases can be recognized and accounted for.

[edit] Notes

  1. ^ Brian W. Kernighan and Dennis M. Ritchie: The C Programming Language, 2nd ed., Prentice Hall, 1988, p. 3.
  2. ^ Dennis Ritchie. The Development of the C Language. Retrieved on 2006-07-26.
  3. ^ For example, the conditional expression if (a=b) is only true if b is not zero.
  4. ^ Brian W. Kernighan and Dennis M. Ritchie: The C Programming Language, 2nd ed., p. 106. Note that this example fails if the array t be larger than s, a complication that is handled by the safer library function strncpy.

[edit] See also