Chomsky Normal Form

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

1.

Chomsky Normal form


- A CFG is in Chomsky normal form when every rule is of the form A → BC and
A → a, where a is a terminal, and A, B, and C are variables. Further B and C
are not the start variable. Additionally we permit the rule S → ε where S is the
start variable, for technical reasons.
Why:
1. Simplicity of proofs

There are plenty of proofs around context-free grammars, including reducability and
equivalence to automata. Those are the simpler the more restricted the set of
grammars you have to deal with is. Therefore, normal forms can be helpful there.
As a concretae example, Greibach normal form is used to show (constructively) that
there is an εε-transition-free PDA for every CFL (that does not contain εε).
2. Enables parsing

While PDAs can be used to parse words with any grammar, this is often
inconvenient. Normal forms can give us more structure to work with, resulting in
easier parsing algorithms.

EXAMPLE CONVERSION TO CHOMSKY NORMAL FORM


Let’s work out an example. Consider the grammar
S → ASB
A → aAS|a|ε
B → SbS|A|bb
First we add a new start state:
S0 → S
S → ASB
A → aAS|a|ε
B → SbS|A|bb
Next we need to eliminate the ε rules. Eliminating A → ε yields
S0 → S
S → ASB|SB
A → aAS|a|aS
B → SbS|A|bb|ε
Now we have a new ε rule., B → ε. Lets remove it
S0 → S
S → ASB|SB|S|AS
A → aAS|a|aS
B → SbS|A|bb
Next we need to remove all unit rules. Lets begin by removing B → A:
S0 → S
S → ASB|SB|S|AS
A → aAS|a|aS
B → SbS|bb|aAS|a|aS
Next lets remove S → S:
S0 → S
S → ASB|SB|AS
A → aAS|a|aS
B → SbS|bb|aAs|a|aS
Further we can eliminate S0 → S:
S0 → ASB|SB|AS
S → ASB|SB|AS
A → aAS|a|aS
B → SbS|bb|aAs|a|aS
3
Now we need to take care of the rules with more than three symbols. First replace S0 →
ASB by S0 → AU1 and
U1 → SB:
S0 → AU1|SB|AS
S → ASB|SB|AS
A → aAS|a|aS
B → SbS|bb|aAs|a|aS
U1 → SB
Next eliminate S → ASB in a similar form (technically we could reuse U1, but lets not):
S0 → AU1|SB|AS
S → AU2|SB|AS
A → aAS|a|aS
B → SbS|bb|aAs|a|aS
U1 → SB
U2 → SB
Onward and upward, now fix A → aAS by introducing A → aU3 and U3 → AS.
S0 → AU1|SB|AS
S → AU2|SB|AS
A → aU3|a|aS
B → SbS|bb|aAs|a|aS
U1 → SB
U2 → SB
U3 → AS
Finally, fix the two B → rules:
S0 → AU1|SB|AS
S → AU2|SB|AS
A → aU3|a|aS
B → SU4|bb|aU5|a|aS
U1 → SB
U2 → SB
U3 → AS
U4 → bS
U5 → AS
Finally we need to work with the rules which have terminals and variables or two
terminals. We need to introduce
new variables for these. Let these be V1 → a and V2 → b:
S0 → AU1|SB|AS
S → AU2|SB|AS
A → V1U3|a|V1S
B → SU4|V2V2|V1U5|a|V1S
U1 → SB
U2 → SB
U3 → AS
U4 → V2S
U5 → AS
V1 → a
V2 → b

2. AMBIGUITY in Context Free Language


In computer science, an ambiguous grammar is a context-free grammar for which there
exists a string that can have more than one leftmost derivation, while an unambiguous
grammar is a context-free grammar for which every valid string has a unique leftmost
derivation. Many languages admit both ambiguous and unambiguous grammars, while
some languages admit only ambiguous grammars. Any non-empty language admits an
ambiguous grammar by taking an unambiguous grammar and introducing a duplicate
rule or synonym (the only language without ambiguous grammars is the empty
language). A language that only admits ambiguous grammars is called an inherently
ambiguous language, and there are inherently ambiguous context-free languages.
Deterministic context-free grammars are always unambiguous, and are an important
subclass of unambiguous CFGs; there are non-deterministic unambiguous CFGs,
however.

For real-world programming languages, the reference CFG is often ambiguous, due to
issues such as the dangling else problem. If present, these ambiguities are generally
resolved by adding precedence rules or other context-sensitive parsing rules, so the
overall phrase grammar is unambiguous.

Recognizing ambiguous grammars


The general decision problem of whether a grammar is ambiguous is undecidable
because it can be shown that it is equivalent to the Post correspondence problem. At
least, there are tools implementing some semi-decision procedure for detecting
ambiguity of context-free grammars.

The efficiency of context-free grammar parsing is determined by the automaton that


accepts it. Deterministic context-free grammars are accepted by deterministic
pushdown automata and can be parsed in linear time, for example by the LR parser.
This is a subset of the context-free grammars which are accepted by the pushdown
automaton and can be parsed in polynomial time, for example by the CYK algorithm.
Unambiguous context-free grammars can be nondeterministic. For example, the
language of even-length palindromes on the alphabet of 0 and 1 has the unambiguous
context-free grammar S → 0S0 | 1S1 | ε. An arbitrary string of this language cannot be
parsed without reading all its letters first which means that a pushdown automaton has
to try alternative state transitions to accommodate for the different possible lengths of a
semi-parsed string.[4] Nevertheless, removing grammar ambiguity may produce a
deterministic context-free grammar and thus allow for more efficient parsing. Compiler
generators such as YACC include features for resolving some kinds of ambiguity, such
as by using the precedence and associativity constraints.

Removing Ambiguity From Grammars


Example 1: Consider the following grammar for expressions:
E → I/E + E/E ∗ E/(E)
I → a/b/Ia/Ib/I0/I1
Example 2: Consider the sentential form E + E ∗ E. It has two derivations from
E (see grammar in Example 1):
1. E ⇒ E + E ⇒ E + E ∗ E.
2. E ⇒ E ∗ E ⇒ E + E ∗ E.

Figure 1: Two parse trees with the same yield E + E ∗ E

Example 2 says that the grammar in example 1 is ambiguous grammar. There are two
causes of ambiguity in grammar in example 1.

1. The precedence of operators is not respected. While Figure 1(a) properly groups
* before + operator, Figure 1(b) is also a valid perse tree and groups the + ahead of the
*. We need to force only the structure of Figure 1(a) to be legal
in an unambiguous grammar.
2. A sequence of identical operators can group either from the left or from the right.
For example, if *’s in Figure 1 were replaced by +’s, we would see two different parse
trees for the string E + E + E. Since addition and multiplication are associative, it does
not matter whether we group from the left or the right, but to eliminate ambiguity, we
must pick one. The conventional approach is to insist on grouping from the left, so the
structure of Figure 1(b) is the only correct grouping of two + signs.

The solution to the problem of enforcing precedence is to introduce several different


variables, each of which represents those expressions that share a level of binding
strength. Specially:
1. A factor is an expression that cannot be broken apart by any adjacent operator, either
a * or a +. The only factors in our expression language are:
(a) Identifiers. It is not possible to separate the letters of an identifier by attaching an
operator.
(b) Any parenthesized expression, no matter what appears inside the parentheses.
It is the purpose of parentheses to prevent what is inside from becoming the operand of
any operator outside the parentheses.
2. A term is an expression that cannot be broken by the + operator. In our example,
where + and * are the only operators, a term is a product of one or more factors. For
instance, the term a ∗ b can be broken if we use left associativity and place a1∗ to its
left. That is, a1 ∗ a ∗ b is grouped (a1 ∗ a) ∗ b, which breaks apart the a ∗ b. However,
placing additive term, such as a1+ to its left or +a1 to its right cannot break a ∗ b. The
proper grouping of a1 +a ∗ b is a1 + (a ∗ b), and the proper grouping of a ∗ b + a1 is
(a*b)+a1.
3. An expression will henceforth refer to any possible expression, including those that
can be broken by either an adjacent * or an adjacent +. Thus, an expression for our
example is a sum of one or more terms.
From the above discussion, we can write an unambiguous expression grammar as
follows:
Table 1: An unambiguous expression grammar
E → T/E + T
T → F/T ∗ F
F → I/(E)
I → a/b/Ia/Ib/I0/I1

Example 3: The grammar in Table 1 allows only one parse tree for the string
a + a ∗ a; it is shown in the Figure 2.
The fact that the grammar in Table 1 is unambiguous may be far from obvious.
Here are the key observations that explain why no string in the language can have two
different parse trees.
• Any string derived from T, a term, must be a sequence of one or more factors,
connected by *’s. A factor is either a single identifier or any parenthesized expression.

Figure 2: The sole parse tree for a + a ∗ a

• Because of the form of the two rules for T, the only parse tree for a sequence of
factors is the one that breaks f1 ∗ f2 ∗ . . . ∗ fn, for n > 1 into a term f1 ∗ f2 ∗ . . . ∗ fn−1
and a factor fn. The reason is that F cannot derive expression like fn−1 ∗ fn without
introducing parentheses around them. Thus, it is not possible that when using the
production T → T ∗ F, the F derives anything but the last of the factor. That is, the parse
tree for a term can only look like Figure 3.

Figure 3: The form of all parse trees for a term


• Likewise an expression is a sequence of terms connected by +. When we use the
production E → E + T to derive t1 + t2 + . . ., tn, then T must derive only tn, and the E in
the body derives t1 + t2 + . . .tn−1. The reason, again, is that T cannot derive the sum of
two or more terms without putting parentheses around them.

Resources:
https://en.wikipedia.org/wiki/Ambiguous_grammar#Recognizing_ambiguous_grammars
http://cs.stackexchange.com/questions/10468/the-importance-of-normal-forms-like-
chomsky-normal-form-for-cfgs
https://courses.cs.washington.edu/courses/cse322/08au/lec14.pdf
http://www.iitg.ernet.in/gkd/ma513/sep/sep27/note-0.pdf

You might also like