C trigraph

From Wikipedia, the free encyclopedia

In the C family of programming languages, a trigraph is a sequence of three characters that represents a single character, the first two of which are both question marks.

The reason for their existence is that the basic character set of C is a subset of the ASCII character set, but nine of its characters lie outside the smaller ISO 646 character set. The ISO 646 character set is largely equivalent to ASCII, except that certain punctuation characters present in ASCII were allowed to be removed, to leave room for "national characters". In other words, users of non-English languages were free to reassign those characters to other symbols needed in their language. However, this posed a problem for C programming, since those removed punctuation characters are used in C. The ANSI C committee invented trigraphs to permit programs to be written using just the ISO 646 character set. The non-ASCII ISO 646 character sets are not used today, but trigraphs remain in the C standard.

Trigraphs are rarely used outside compiler test suites. Many compilers either have an option to turn recognition of trigraphs off, or disable trigraphs by default and have an option to turn them on. Some can issue warnings when they encounter trigraphs in source files. It is unfortunate that compilers are required to support them to claim conformance -- their processing is a heavy burden as every character in every source file has to be checked to see if it introduces a trigraph. Borland supplied a separate program, the trigraph preprocessor, to deal with the problem.

[edit] Trigraph sequences

The C preprocessor replaces all occurrences of the following nine trigraph sequences by their single-character equivalents before any other processing.

    Trigraph     Equivalent
    ========     ==========
      ??=            #
      ??/            \
      ??'            ^
      ??(            [
      ??)            ]
      ??!            |
      ??<            {
      ??>            }
      ??-            ~

Note that ??? is not a trigraph sequence.

The ??/ trigraph can be used to introduce an escaped newline for line splicing; this makes correct and efficient handling of trigraphs within the preprocessor particularly problematic.

[edit] Example

An example of a C program that uses all the defined trigraphs:

??=include <stdio.h>                 /* #           */

int main(void)
??<                                  /* {           */
        char n??(5??);               /* [ and ]     */

        n??(4??) = 'a';
        printf("%c??/n", n??(4??));  /* ??/ = \     */
        
        return ??- 0 ??' 1 ??! 2;    /* ~, ^ and |  */
??>                                  /* }           */

[edit] Disambiguation

You may want to place two question marks together yet not have the compiler treat them as introducing a trigraph. The C grammar does not permit two subsequent ? tokens, so the only places in a C file where you might want to use two question marks in a row would be in character and string literals, and comments.

To safely place two consecutive question marks within a literal you should use the escape sequence ?\?.

The ??/ trigraph forms an escaped newline when followed by a new line. This can cause surprises, particularly within comments. For example:

 // Will the next line be executed????????????????/
 a++;

which is a single logical comment line, and

 /??/
 * A comment *??/
 /

which is a correctly formed block comment.