Terminal and nonterminal symbols

In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules that constitute a formal grammar. The terminals and nonterminals of a particular grammar are two disjoint sets.

Contents

Terminal symbols

Terminal symbols are literal characters that can appear in the inputs to or outputs from the production rules of a formal grammar and that cannot be broken down into "smaller" units. To be precise, terminal symbols cannot be changed using the rules of the grammar. For example, a grammar that is defined by two rules:

  1. x can become xa
  2. x can become ax

has a as a terminal symbol because no rule exists that would change it to something else. (On the other hand, x has two rules that can change it, so it is nonterminal.) A formal language defined (or generated) by a particular grammar is the set of strings that can be produced by the grammar and that consist only of terminal symbols; nonterminals that do not consist entirely of terminals may not appear in the lexemes that are said to belong to the language.

In the context of syntax analysis, as opposed to the theory of programming languages and compilers, the terms "terminal symbol" and "token" are often treated as synonymous. Quoting the so-called Dragon Book (a standard reference on the latter subject):

In a compiler, the lexical analyzer reads the characters of the source program, groups them into lexically meaningful units called lexemes, and produces as output tokens representing these lexemes. A token consists of two components, a token name and an attribute value. The token names are abstract symbols that are used by the parser for syntax analysis. Often, we shall call these token names terminals, since they appear as terminal symbols in the grammar for a programming language. The attribute value, if present, is a pointer to the symbol table that contains additional information about the token. This additional information is not part of the grammar, so in our discussion of syntax analysis, often we refer to tokens and terminals synonymously. [1]

Terminal symbols, or just terminals, are the elementary symbols of the language defined by a formal grammar.

Nonterminal symbols

Nonterminal symbols, or just nonterminals, are the symbols which can be replaced; thus there are strings composed of some combination of terminal and nonterminal symbols. They may also be called simply syntactic variables. A formal grammar includes a start symbol, a designated member of the set of nonterminals from which all the strings in the language may be derived by successive applications of the production rules. In fact, the language defined by a grammar is precisely the set of terminal strings that can be so derived.

Context-free grammars are those grammars in which the left-hand side of each production rule consists of only a single nonterminal symbol. This restriction is non-trivial; not all languages can be generated by context-free grammars. Those that can are called context-free languages. These are exactly the languages that can be recognized by a non-deterministic pushdown automaton. Context-free languages are the theoretical basis for the syntax of most programming languages.

Production rules

A grammar is defined by production rules that specify which lexemes may replace which other lexemes; these rules may be used to generate strings, or to parse them. Each such rule has a head, or left-hand side, which consists of the string that may be replaced, and a body, or right-hand side, which consists of a string that may replace it. Rules are often written in the form <head \rarr body>; e.g., the rule z0 → z1 specifies that z0 can be replaced by z1.

In the classic formalization of generative grammars first proposed by Noam Chomsky in the 1950s,[2][3] a grammar G consists of the following components:

(\Sigma \cup N)^{*} N (\Sigma \cup N)^{*} \rightarrow (\Sigma \cup N)^{*}
where {}^{*} is the Kleene star operator and \cup denotes set union, so (\Sigma \cup N)^{*} represents zero or more symbols, and N means one nonterminal symbol. That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol. In the case that the body consists solely of the empty string—i.e., that it contains no symbols at all—it may be denoted with a special notation (often \Lambda, e or \epsilon) in order to avoid confusion.

A grammar is formally defined as the ordered quadruple <N, \Sigma, P, S>. Such a formal grammar is often called a rewriting system or a phrase structure grammar in the literature.[4][5]

Example

For instance, the following represents an integer (which may be signed) expressed in a variant of Backus–Naur form:

<integer> ::= ['-'] <digit> {<digit>}
<digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

In this example, the symbols (-,0,1,2,3,4,5,6,7,8,9) are terminal symbols and <digit> and <integer> are nonterminal symbols.

Notes

  1. ^ Aho, Lam, Sethi, & Ullman, Compilers: Principles, Techniques, and Tools, second edition; Pearson/Addison-Wesley, 2006. Box, p. 43.
  2. ^ Chomsky, Noam (1956). "Three Models for the Description of Language". IRE Transactions on Information Theory 2 (2): 113–123. doi:10.1109/TIT.1956.1056813. 
  3. ^ Chomsky, Noam (1957). Syntactic Structures. The Hague: Mouton. 
  4. ^ Ginsburg, Seymour (1975). Algebraic and automata theoretic properties of formal languages. North-Holland. pp. 8–9. ISBN 0720425069. 
  5. ^ Harrison, Michael A. (1978). Introduction to Formal Language Theory. Reading, Mass.: Addison-Wesley Publishing Company. pp. 13. ISBN 0201029553. 

References