Chomsky–Schützenberger enumeration theorem

In formal language theory, the ChomskySchützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

Statement

In order to state the theorem, a few notions from algebra and formal language theory are needed.

A power series over \mathbb{N} is an infinite series of the form

f = f(x) = \sum_{k=0}^\infty a_k x^k = a_0 + a_1 x^1 + a_2 x^2 + a_3 x^3 + \cdots

with coefficients a_k in \mathbb{N}. The multiplication of two formal power series f and g is defined in the expected way as the convolution of the sequences a_n and b_n:

f(x)\cdot g(x) = \sum_{k=0}^\infty \left(\sum_{i=0}^k a_i b_{k-i}\right) x^k.

In particular, we write f^2 = f(x)\cdot f(x), f^3 = f(x)\cdot f(x)\cdot f(x), and so on. In analogy to algebraic numbers, a power series f(x) is called algebraic over \mathbb{Q}(x), if there exists a finite set of polynomials p_0(x), p_1(x), p_2(x), \ldots , p_n(x) each with rational coefficients such that

p_0(x) + p_1(x) \cdot f + p_2(x)\cdot f^2 + \cdots + p_n(x)\cdot f^n = 0.

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree or, equivalently, only one leftmost derivation. Having established the necessary notions, the theorem is stated as follows.

ChomskySchützenberger theorem. If L is a context-free language admitting an unambiguous context-free grammar, and a_k := | L \ \cap \Sigma^k | is the number of words of length k in L, then G(x)=\sum_{k = 0}^\infty a_k x^k is a power series over \mathbb{N} that is algebraic over \mathbb{Q}(x).

Proofs of this theorem are given by Kuich & Salomaa (1985), and by Panholzer (2005).

Usage

Asymptotic estimates

The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by Gruber, Lee & Shallit (2012): The unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules

S → M | U
M → 0M1M | ε
U → S | 0M1U.


To obtain an algebraic representation of the power series G(x) associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by 'x', each occurrence of 'ε' by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:

S = M + U
M = M²x² + 1
U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write S(x), M(x), and U(x). The equation system can be resolved after S, resulting in a single algebraic equation:

x(2x-1)S^2 + (2x-1)S +1 = 0.

This quadratic equation has two solutions for S, one of which is the algebraic power series G(x). By applying methods from complex analysis to this equation, the number a_n of words of length n generated by G can be estimated, as n grows large. In this case, one obtains a_n \in O(2+\epsilon)^n but a_n \notin O(2-\epsilon)^n for each \epsilon>0. See (Gruber, Lee & Shallit 2012) for a detailed exposition.

Inherent ambiguity

In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the Goldstine language L_G over the alphabet \{a,b\} consists of the words a^{n_1}ba^{n_2}b\cdots a^{n_p}b with p\ge 1, n_i>0 for i \in \{1,2,\ldots,p\}, and n_j \neq j for some j \in \{1,2,\ldots,p\}.

It is comparably easy to show that the language L_G is context-free (Berstel & Boasson 1990). The harder part is to show that there does not exist an unambiguous grammar that generates L_G. This can be proved as follows: If g_k denotes the number of words of length k in L_G, then for the associated power series holds G(x) = \sum_{k=0}^\infty g_k x^k = \frac{1-x}{1-2x}- \frac1x \sum_{k \ge 1} x^{k(k+1)/2-1} . Using methods from complex analysis, one can prove that this function is not algebraic over \mathbb{Q}(x). By the Chomsky-Schützenberger theorem, one can conclude that L_G does not admit an unambiguous context-free grammar. See (Berstel & Boasson 1990) for detailed account.

References

Berstel, Jean; Boasson, Luc (1990). "Context-free languages" (PDF). In van Leeuwen, Jan. Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics. Elsevier and MIT press. pp. 59–102. ISBN 0-444-88074-7. 
Chomsky, Noam; Schützenberger, Marcel-Paul (1963). "The Algebraic Theory of Context-Free Languages" (PDF). In P. Braffort and D. Hirschberg, eds., Computer Programming and Formal Systems (pp. 118161). Amsterdam: North-Holland. 
Flajolet, Philippe; Sedgewick, Robert (2009). Analytic Combinatorics (PDF). Cambridge: Cambridge University Press. ISBN 978-0-521-89806-5. 
Gruber, Hermann; Lee, Jonathan; Shallit, Jeffrey (2012). "Enumerating regular expressions and their languages". arXiv:1204.4982 [cs.FL]. 
Kuich, Werner; Salomaa, Arto (1985). Semirings, Automata, Languages. Berlin: Springer-Verlag. ISBN 978-3-642-69961-0. 
Panholzer, Alois (2005). "Gröbner Bases and the Defining Polynomial of a Context-free Grammar Generating Function". Journal of Automata, Languages and Combinatorics 10: 7997. 
This article is issued from Wikipedia - version of the Saturday, February 21, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.