Digraphs and trigraphs
In computer programming, digraphs and trigraphs are sequences of two and three characters respectively, appearing in source code, which a programming language specification requires an implementation of that language to treat as if they were one other character.
Various reasons exist for using digraphs and trigraphs: keyboards may not have keys to cover the entire character set of the language, input of special characters may be difficult, text editors may reserve some characters for special use and so on. Trigraphs might also be used for some EBCDIC code pages that lack characters such as {
and }
.
History
The basic character set of the C programming language is a subset of the ASCII character set that includes nine characters which lie outside the ISO 646 invariant character set. This can pose a problem for writing source code when the encoding (and possibly keyboard) being used does not support any of these nine characters. The ANSI C committee invented trigraphs as a way of entering source code using keyboards that support any version of the ISO 646 character set.
Implementations
Trigraphs are not commonly encountered outside compiler test suites.[1] Some compilers support an option to turn recognition of trigraphs off, or disable trigraphs by default and require an option to turn them on. Some can issue warnings when they encounter trigraphs in source files. Borland supplied a separate program, the trigraph preprocessor, to be used only when trigraph processing is desired (the rationale was to maximise speed of compilation).
Language support
Different systems have different sets of defined trigraphs:
Pascal
Pascal programming language supports digraphs (.
, .)
, (*
and *)
for [
, ]
, {
and }
respectively. Unlike all other cases mentioned here, (*
and *)
were in wide use.
Vim
Vim text editor supports digraphs for actual entry of text characters, following RFC 1345.
GNU Screen
GNU Screen has a digraph command, bound to ^A ^V by default.
J
The J programming language is a descendant of APL using the ASCII character set rather than traditional APL symbols. To handle the fact that the printable range of ASCII is smaller than APL's specialised set of symbols, .
(dot) and :
(colon) characters are used to inflect ASCII symbols, in effect interpreting unigraphs, digraphs or rarely trigraphs as standalone "symbols".[2]
Unlike the use of di- and trigraphs in C and C++, there is no single-character equivalent to these.
C
Trigraph | Equivalent |
---|---|
??= | # |
??/ | \ |
??' | ^ |
??( | [ |
??) | ] |
??! | | |
??< | { |
??> | } |
??- | ~ |
The C preprocessor replaces all occurrences of the following nine trigraph sequences by their single-character equivalents before any other processing.
A programmer may want to place two question marks together yet not have the compiler treat them as introducing a trigraph. The C grammar does not permit two consecutive ?
tokens, so the only places in a C file where two question marks in a row may be used are in multi-character constants, string literals, and comments. To safely place two consecutive question marks within a string literal, the programmer can use string concatenation "...?""?..."
or an escape sequence "...?\?..."
.
???
is not itself a trigraph sequence, but when followed by a character such as -
it will be interpreted as ?
+ ??-
, as in the example below which has 16 (5x3+1) ?
s before the /
.
The ??/
trigraph can be used to introduce an escaped newline for line splicing; this must be taken into account for correct and efficient handling of trigraphs within the preprocessor. It can also cause surprises, particularly within comments. For example:
// Will the next line be executed????????????????/
a++;
which is a single logical comment line (used in C++ and C99), and
/??/
* A comment *??/
/
which is a correctly formed block comment.
Digraph | Equivalent |
---|---|
<: | [ |
:> | ] |
<% | { |
%> | } |
%: | # |
In 1994 a normative amendment to the C standard , included in C99, supplied digraphs as more readable alternatives to five of the trigraphs. They are listed in the table on the right.
Unlike trigraphs, digraphs are handled during tokenization, and any digraph must always represent a full token by itself, or compose the token %:%:
replacing the preprocessor concatenation token ##
. If a digraph sequence occurs inside another token, for example a quoted string, or a character constant, it will not be replaced.
C++
Token | Equivalent |
---|---|
%:%: | ## |
compl | ~ |
not | ! |
bitand | & |
bitor | | |
and | && |
or | || |
xor | ^ |
and_eq | &= |
or_eq | |= |
xor_eq | ^= |
not_eq | != |
C++ (through C++14, see below) behaves like C, including the C99 additions, but with additional tokens listed in the table on the right.
As a note, %:%:
is treated as a single token, rather than two occurrences of %:
.
The C++ Standard makes this comment with regards to the term "digraph":[3]
The term "digraph" (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%:
and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren’t lexical keywords are colloquially known as "digraphs".
Removal of trigraphs
Trigraphs were proposed for deprecation in C++0x, which was released as C++11.[4] This was opposed by IBM, speaking on behalf of itself and other users of C++,[5] and as a result trigraphs were retained in C++0x. Trigraphs were then proposed again for removal (not only deprecation) in C++17.[6] This passed a committee vote, and trigraphs are expected to be removed from C++17 despite the opposition from IBM and others.[7] Existing code that uses trigraphs can be supported by translating from the physical source files (parsing trigraphs) to the basic source character set that does not include trigraphs.[6]
RPL
Hewlett-Packard calculators supporting the RPL language and input method provide support for a large number of trigraphs (also called TIO codes) to reliably transcribe non-seven-bit ASCII characters of the calculators' extended character set on foreign platforms, and to ease keyboard input without using the CHARS application.[8]
See also
References
- ↑ Jones, Derek M. "sentence 117". The New C Standard: An Economic and Cultural Commentary.
- ↑ Hui, Roger. "Vocabulary". jsoftware.com. Retrieved 16 April 2015.
- ↑ http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf
- ↑ C++0X, CD 1, National Body Comments, SC22/WG21 N2837, 2009-01-30 comment UK 11
- ↑ Comment on Proposed Trigraph Deprecation, by Michael Wong, Hubert Tong, Robert Klarer, Ian McIntosh, Raymond Mak, Christopher Cambly, Alain LaBonté, N2910, 2009-06-19
- ↑ 6.0 6.1 "Removing trigraphs??!", N3981, Richard Smith, 2014-05-06
- ↑ IBM comment on preparing for a Trigraph-adverse future in C++17, IBM paper N4210, 2014-10-10. Authors: Michael Wong, Hubert Tong, Rajan Bhakta, Derek Inglis
- ↑ "HP RPL TIO Table". holyjoe.org. Retrieved 23 January 2015.