Chomsky Normal Form
Chomsky Normal Form
Chomsky Normal Form
There are plenty of proofs around context-free grammars, including reducability and
equivalence to automata. Those are the simpler the more restricted the set of
grammars you have to deal with is. Therefore, normal forms can be helpful there.
As a concretae example, Greibach normal form is used to show (constructively) that
there is an εε-transition-free PDA for every CFL (that does not contain εε).
2. Enables parsing
While PDAs can be used to parse words with any grammar, this is often
inconvenient. Normal forms can give us more structure to work with, resulting in
easier parsing algorithms.
For real-world programming languages, the reference CFG is often ambiguous, due to
issues such as the dangling else problem. If present, these ambiguities are generally
resolved by adding precedence rules or other context-sensitive parsing rules, so the
overall phrase grammar is unambiguous.
Example 2 says that the grammar in example 1 is ambiguous grammar. There are two
causes of ambiguity in grammar in example 1.
1. The precedence of operators is not respected. While Figure 1(a) properly groups
* before + operator, Figure 1(b) is also a valid perse tree and groups the + ahead of the
*. We need to force only the structure of Figure 1(a) to be legal
in an unambiguous grammar.
2. A sequence of identical operators can group either from the left or from the right.
For example, if *’s in Figure 1 were replaced by +’s, we would see two different parse
trees for the string E + E + E. Since addition and multiplication are associative, it does
not matter whether we group from the left or the right, but to eliminate ambiguity, we
must pick one. The conventional approach is to insist on grouping from the left, so the
structure of Figure 1(b) is the only correct grouping of two + signs.
Example 3: The grammar in Table 1 allows only one parse tree for the string
a + a ∗ a; it is shown in the Figure 2.
The fact that the grammar in Table 1 is unambiguous may be far from obvious.
Here are the key observations that explain why no string in the language can have two
different parse trees.
• Any string derived from T, a term, must be a sequence of one or more factors,
connected by *’s. A factor is either a single identifier or any parenthesized expression.
• Because of the form of the two rules for T, the only parse tree for a sequence of
factors is the one that breaks f1 ∗ f2 ∗ . . . ∗ fn, for n > 1 into a term f1 ∗ f2 ∗ . . . ∗ fn−1
and a factor fn. The reason is that F cannot derive expression like fn−1 ∗ fn without
introducing parentheses around them. Thus, it is not possible that when using the
production T → T ∗ F, the F derives anything but the last of the factor. That is, the parse
tree for a term can only look like Figure 3.
Resources:
https://en.wikipedia.org/wiki/Ambiguous_grammar#Recognizing_ambiguous_grammars
http://cs.stackexchange.com/questions/10468/the-importance-of-normal-forms-like-
chomsky-normal-form-for-cfgs
https://courses.cs.washington.edu/courses/cse322/08au/lec14.pdf
http://www.iitg.ernet.in/gkd/ma513/sep/sep27/note-0.pdf