Chomsky normal form

In formal language theory, a context-free grammar is said to be in Chomsky normal form (discovered by Noam Chomsky^[1]) if all of its production rules are of the form:^[2]

A \rightarrow BC

A \rightarrow a

S \rightarrow \varepsilon

where $A$ , $B$ and $C$ are nonterminal symbols, $a$ is a terminal symbol (a symbol that represents a constant value), $S$ is the start symbol, and $\varepsilon$ is the empty string. Also, neither $B$ nor $C$ may be the start symbol, and the third production rule can only appear if $\varepsilon$ is in $L(G)$ , namely, the language produced by the context-free grammar $G$ .

Every grammar in Chomsky normal form is context-free, and conversely, every context-free grammar can be transformed into an equivalent one which is in Chomsky normal form. Several algorithms for performing such a transformation are known. Transformations are described in most textbooks on automata theory, such as Hopcroft et al., 2006.^[3] As pointed out by Lange and Leiß,^[4] the drawback of these transformations is that they can lead to an undesirable bloat in grammar size. The size of a grammar is the sum of the sizes of its production rules, where the size of a rule is one plus the length of its right-hand side. Using $|G|$ to denote the size of the original grammar $G$ , the size blow-up in the worst case may range from $|G|^2$ to $2^{2 |G|}$ , depending on the transformation algorithm used.^[5]

Alternative definition

Chomsky reduced form

Another way^[6] to define the Chomsky normal form is:

A formal grammar is in Chomsky reduced form if all of its production rules are of the form:

A \rightarrow\, BC

A \rightarrow\, a

where $A$ , $B$ and $C$ are nonterminal symbols, and $a$ is a terminal symbol. When using this definition, $B$ or $C$ may be the start symbol. Only those context-free grammars which do not generate the empty string can be transformed into Chomsky reduced form.

Floyd normal form

In a paper where he proposed a term Backus-Naur Form (BNF), Donald E. Knuth implied a BNF "syntax in which all definitions have such a form may be said to be in "Floyd Normal Form","

\langle A \rangle ::= \, \langle B \rangle \mid \langle C \rangle

\langle A \rangle ::= \, \langle B \rangle \langle C \rangle

\langle A \rangle ::=\, a

where $\langle A \rangle$ , $\langle B \rangle$ and $\langle C \rangle$ are nonterminal symbols, and $a$ is a terminal symbol, because Robert W. Floyd found any BNF syntax can be converted to the above one in 1961.^[7] But he withdrew this term, "since doubtless many people have independently used this simple fact in their own work, and the point is only incidental to the main considerations of Floyd's note."^[7]

Converting a grammar to Chomsky Normal Form

Introduce $S_0$

Introduce a new start variable, $S_0$ and a new rule $S_0 \rightarrow S$ where $S$ is the previous start variable.

Eliminate all $\varepsilon$ rules

$\varepsilon$ rules are rules of the form $A \rightarrow \varepsilon$ , where $A \not= S_0$ and $A \in V$ , where $V$ is the CFG's variable alphabet.

Remove every rule with $\varepsilon$ on its right hand side (RHS). For each rule with $A$ in its RHS, add a set of new rules consisting of the different possible combinations of $A$ replaced or not replaced with $\varepsilon$ . If a rule has $A$ as a singleton on its RHS, add a new rule $R = A \rightarrow \varepsilon$ unless $R$ has already been removed through this process. For example, examine the following grammar $G$ :
$S \rightarrow AbA \mid B$

$B \rightarrow b \mid c$

$A \rightarrow \varepsilon$

$G$ has one $\varepsilon$ rule. When the $A \rightarrow \varepsilon$ is removed, we get the following:
$S \rightarrow AbA \mid Ab \mid bA \mid b \mid B$

$B \rightarrow b \mid c$

Notice that we have to account for all possibilities of $A \rightarrow \varepsilon$ and so we actually end up adding 3 rules.
Eliminate all unit rules
$A \rightarrow B; A,B \in V$

After all the $\varepsilon$ rules have been removed, you can begin removing unit rules, or rules whose RHS contains one variable and no terminals (which is inconsistent with CNF).
To remove $A \rightarrow B$

$\forall B \rightarrow U$ , where $U$ is a string of variables and terminals, add rule $A \rightarrow U$ unless this is a unit rule which has already been removed.

Clean up the remaining rules that are not in Chomsky normal form.
Replace $A \rightarrow u_1 u_2 \dotso u_k, k \ge 3, u_1 \in V \cup \Sigma$ with $A \rightarrow u_1 A_1 , A_1 \rightarrow u_2 A_2 , \dotsc , A_{k-2} \rightarrow u_{k-1} u_k$ , where $A_i$ are new variables.

If $u_i \in \Sigma$ , replace $u_i$ in above rules with some new variable $V_i$ and add rule $V_i \rightarrow u_i$ .

Example

Abstract syntax tree of the arithmetic expression "a^2+4*b" wrt. the example grammar (top) and its Chomsky normal form (bottom)

The following grammar, with start symbol Expr, describes a simplified version of the set of all syntactical valid arithmetic expressions in imperative programming languages like C or Algol60. Both Number and Variable are considered terminal symbols here for simplicity, since in a compiler front-end their internal structure is usually not considered by the parser. The terminal symbol "^" denoted exponentiation in Algol60.

Expr	→ Term	\| Expr AddOp Term	\| AddOp Term
Term	→ Factor	\| Term MulOp Factor
Factor	→ Primary	\| Factor ^ Primary
Primary	→ Number	\| Variable	\| ( Expr )
AddOp	→ +	\| −
MulOp	→ *	\| /

In step 1 of the above conversion algorithm, just a rule S₀→Expr is added to the grammar. Since there are no ε rules, step 2 can be skipped. After step 3, the following grammar is obtained:

S₀	→ Number	\| Variable	\| ( Expr )	\| Factor ^ Primary	\| Term MulOp Factor	\| Expr AddOp Term	\| AddOp Term
Expr	→ Number	\| Variable	\| ( Expr )	\| Factor ^ Primary	\| Term MulOp Factor	\| Expr AddOp Term	\| AddOp Term
Term	→ Number	\| Variable	\| ( Expr )	\| Factor ^ Primary	\| Term MulOp Factor
Factor	→ Number	\| Variable	\| ( Expr )	\| Factor ^ Primary
Primary	→ Number	\| Variable	\| ( Expr )
AddOp	→ +	\| −
MulOp	→ *	\| /

After step 4, the following grammar is obtained, which is in Chomsky normal form:

S₀	→ Number	\| Variable	\| Open Expr_Close	\| Factor PowOp_Primary	\| Term MulOp_Factor	\| Expr AddOp_Term	\| AddOp Term
Expr	→ Number	\| Variable	\| Open Expr_Close	\| Factor PowOp_Primary	\| Term MulOp_Factor	\| Expr AddOp_Term	\| AddOp Term
Term	→ Number	\| Variable	\| Open Expr_Close	\| Factor PowOp_Primary	\| Term MulOp_Factor
Factor	→ Number	\| Variable	\| Open Expr_Close	\| Factor PowOp_Primary
Primary	→ Number	\| Variable	\| Open Expr_Close
AddOp	→ +	\| −
MulOp	→ *	\| /
Expr_Close	→ Expr Close
PowOp_Primary	→ PowOp Primary
MulOp_Factor	→ MulOp Factor
AddOp_Term	→ AddOp Term
Open	→ (
Close	→ )
PowOp	→ ^

The A_i introduced in step 4 are Expr_Close, PowOp_Primary, MulOp_Factor, and AddOp_Term. The V_i are Open, Close, and PowOp. The new rules have been appended at the end of the grammar.

Application

Besides its theoretical significance, CNF conversion is used in some algorithms as a preprocessing step, e.g., the CYK algorithm, a bottom-up parsing for context-free grammars, and its variant probabilistic CKY.^[8]

References

↑ Chomsky, Noam (1959). "On Certain Formal Properties of Grammars" (PDF). Information and Control 2: 137–167. doi:10.1016/s0019-9958(59)90362-6.
↑ Hopcroft, John E.; Ullman, Jeffrey D. (1979). Introduction to Automata Theory, Languages and Computation. Reading, Massachusetts: Addison-Wesley Publishing. ISBN 0-201-02988-X. Section 4.5, Theorem 4.5, p.92; see also the bibliographic notes on p.106
↑ Hopcroft, John E.; Motwani, Rajeev; Ullman, Jeffrey D. (2006). Introduction to Automata Theory, Languages, and Computation (3rd ed.). Addison-Wesley. ISBN 0-321-45536-3. Section 7.1.5, p.272
↑ Lange, Martin; Leiß, Hans (2009). "To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm" (PDF). Informatica Didactica 8.
↑ Lange, Leiß (2009), Table 3, p.7
↑ Hopcroft, Ullman (1979); Hopcroft et al. (2006)
↑ 7.0 7.1 Knuth, Donald E. 1964. Backus Normal Form vs. Backus Naur Form. Communications of the ACM, 7(12): 735–736, December.
↑ Jurafsky, Daniel; Martin, James H. (2008). Speech and Language Processing (2nd ed.). Pearson Prentice Hall. p. 465. ISBN 978-0-13-187321-6.