Chomsky normal form

From Wikipedia, the free encyclopedia

In computer science, a formal grammar is in Chomsky normal form if and only if all production rules are of the form:

ABC or
A → α or
S → ε

where A, B and C are nonterminal symbols, α is a terminal symbol (a symbol that represents a constant value), S is the start symbol, and ε is the empty string. Also, neither B nor C may be the start symbol.

Every grammar in Chomsky normal form is context-free, and conversely, every context-free grammar can be efficiently transformed into an equivalent one which is in Chomsky normal form.

With the exception of the optional rule S → ε (included when the grammar may generate the empty string), all rules of a grammar in Chomsky normal form are expansive; thus, throughout the derivation of a string, each string of terminals and nonterminals is always either the same length or one element longer than the previous such string. The derivation of a string of length n is always exactly 2n-1 steps long. Furthermore, since all rules deriving nonterminals transform one nonterminal to exactly two nonterminals, a parse tree based on a grammar in Chomsky normal form is a binary tree, and the height of this tree is limited to at most the length of the string.

Because of these properties, many proofs in the field of languages and computability make use of the Chomsky normal form. These properties also yield various efficient algorithms based on grammars in Chomsky normal form; for example, the CYK algorithm that decides whether a given string can be generated by a given grammar uses the Chomsky normal form.

The Chomsky normal form is named after Noam Chomsky, the US linguist who invented the Chomsky hierarchy.

[edit] Alternative Definition

Some sources define Chomsky normal form in the following slightly different way:

A formal grammar is in Chomsky normal form if and only if all production rules are of the form:

ABC or
A → α

where A, B and C are nonterminal symbols, and α is a terminal symbol. When using this definition, B or C may be the start symbol.

This definition differs from the previous one in that it precludes the possibility that the grammar will generate the empty string, ε. It remains true that any context-free accepting a language L can be efficiently transformed into a grammar in Chomsky normal form that accepts L - {ε}. The principle advantage of this later definition is that proofs are generally marginally simpler, due to the fact that each step in a derivation never decreases the length of the resulting string. Its disadvantage, of course, is that special consideration is needed if the original grammar generated ε.

[edit] See also

[edit] References

  • Michael Sipser (1997). Introduction to the Theory of Computation. PWS Publishing. ISBN 0-534-94728-X. Pages 98–101 of section 2.1: context-free grammars. Page 156.
  • John Martin (2003). Introduction to Languages and the Theory of Computation. McGraw Hill. ISBN 0-07-232200-4. Pages 237–240 of section 6.6: simplified forms and normal forms.
In other languages