CFG:

•Generative aspect of CFG: By now it should be clear how, from a CFG G, you can derive strings wÎL(G). CFG: Parsing •Analytical aspect: Given a CFG G and strings w, how do you decide if wÎL(G) and –if so– how do you determine the derivation tree or the sequence of production rules Recognition of strings in a that produce w? This is called the problem of parsing. language

1 2

CFG: Parsing CFG: Parsing • Parser A program that determines if a string w Î L(G) Parse trees (=Derivation Tree) by constructing a derivation. Equivalently, A parse tree is a graphical representation it searches the graph of G. of a derivation sequence of a sentential form. – Top-down parsers • Constructs the derivation tree from root to leaves. • Leftmost derivation. Tree nodes represent symbols of the – Bottom-up parsers (nonterminals or terminals) and • Constructs the derivation tree from leaves to tree edges represent derivation steps. root. • Rightmost derivation in reverse. 3 4

CFG: Parsing CFG: Parsing

Parse Tree: Example Parse Tree: Example 1 E ® E + E | E * E | ( E ) | - E | id Given the following grammar: Lets examine this derivation: E Þ -E Þ -(E) Þ -(E + E) Þ -(id + id) E ® E + E | E * E | ( E ) | - E | id

E E E E E Is the string -(id + id) a in this grammar? - E - E - E - E

Yes because there is the following derivation: ( E ) ( E ) ( E )

E Þ -E Þ -(E) Þ -(E + E) Þ -(id + id) E + E E + E This is a top-down derivation because we start building the id id parse tree at the top parse tree 5 6

1 CFG: Parsing CFG: Parsing

S ® SS | a | b Parse Tree: Example 2 Rightmost Parse Tree: Example 2 abÎ L( S) derivation S Þ SS Þ Sb Þ ab

S S S S S S S S

Derivation S S S S Derivation S S S S Trees S S Trees a a b S S b a b

S S S Leftmost S derivation S Þ SS Þ aS Þ ab Rightmost Derivation a b S S in Reverse

7 8

CFG: Parsing CFG: Parsing Practical Parsers Example 3 Consider the CFG grammar G • Language/Grammar designed to enable deterministic (directed S ® A and backtrack-free) searches. A ® T | A + T T ® b | ( A ) Show that (b)+b Î L(G)? – Top-down parsers : LL(k) languages S S • E.g., Pascal, Ada, etc. S S S S S S • Better error diagnosis and recovery. – Bottom-up parsers : LALR(1), LR(k) languages A A A A A A A • E.g., C/C++, Java, etc. A T A T A T A T A T A T • Handles in the grammar. T T T T T – parsers A A • E.g., Prolog interpreter. A A T T T ( b )+ ( b )+ b 9 10 + + ( )+ ( )+

CFG: Parsing CFG: Parsing

Top-down Exhaustive Parsing Flaws of Top-down Exhaustive Parsing

nExhaustive parsing is a form of top-down parsing where nObvious flaw: it will take a long time and a lot of memory you start with S and systematically go through all possible (say for moderately long strings w: It is inefficient. leftmost) derivations until you produce the string w. n(You can remove sentential forms that will not work.) nFor cases wÏL(G) exhaustive parsing may never end. This will especially happen if we have rules like A? ? that make nExample: Can the CFG S ? SS | aSb | bSa | ? produce the the sentential forms ‘shrink’ so that we will never know if we string w = aabb, and how? went ‘too far’ with our parsing attempts. nAfter one step: S Þ SS or aSb or bSa or ?. nSimilar problems occur if the parsing can get in a loop according nAfter two steps: S Þ SSS or aSbS or bSaS or S, to A Þ B Þ A Þ B… or S Þ aSSb or aaSbb or abSab or ab. nFortunately, it is always possible to remove problematic rules ? ? nAfter three steps we see that: S Þ aSb Þ aaSbb Þ aabb. like A ? and A B from a CFG G.

11 12

2 Grammar Ambiguity Grammar Ambiguity

Definition A string wÎL(G) is derived ambiguously if it has more than one derivation tree (or equivalently: if it has more than one leftmost derivation (or rightmost)).

Definition: a string is derived ambiguously A grammar is ambiguous if some strings are derived in a context-free grammar if it has two or ambiguously. more different parse trees Typical example: rule S ® 0 | 1 | S+S | S´S Definition: a grammar is ambiguous if it S Þ S+S Þ S´S+S Þ 0´S+S Þ 0´1+S Þ 0´1+1 generates some string ambiguously versus S Þ S´S Þ 0´S Þ 0´S+S Þ 0´1+S Þ 0´1+1

13 14

Grammar Ambiguity Grammar Ambiguity

The ambiguity of 0´1+1 is shown by the two Note that the two different derivations: different parse trees: S S Þ S+S Þ 0+S Þ 0+1 S S and S Þ S+S Þ S+1 Þ 0+1 do not constitute an ambiguous string 0 S ´ S 0+1 as have the same parse tree: + 1 S + S 0 Ambiguity causes troubles when trying to interpret strings 1 S + like: “She likes men who love women who don't smoke.” S ´ S S Solutions: Use parentheses, or use precedence rules such as a+(b´c) = a+b´c ? (a+b)´c. 0 1 1 1 15 16

Grammar Ambiguity Grammar Ambiguity Example ? + Inherently Ambiguous ? * ? ( ) nLanguages that can only be generated by ? a ambiguous are inherently ambiguous. Build a parse tree for a + a * a nExample 5.13: L = {anbncm} È {anbmcm}. L ={ aib jck | i = j Ú j = k}

nThe way to make a CFG for this L somehow has to involve the step S ? S1|S2 where S1 produces the n n m n m m strings a b c and S2 the strings a b c . nThis will be ambiguous on strings anbncn.

17 18 a + a * a a + a * a

3 Grammar Ambiguity Grammar Ambiguity

Example Example E ® E + E | E * E | ( E ) | - E | id E ® E + E | E * E | ( E ) | - E | id Find a derivation for the expression: id + id * id Find a derivation for the expression: id + id * id E E E E E According to the grammar, both are correct. E + E E + E E + E E + E

E * E id E * E id E * E

id id A grammar that produces more than one id id E E E E parse tree for any input sentence is said E to be an . E * E E * E E * E E + E

E + E E + E id E * E id

Which derivation tree is correct? id id 19 id id 20

Grammar Ambiguity Grammar Ambiguity

Example One way to resolve ambiguity is to associate precedence to the operators. stm ® if expr then stm Example Grammar: | if expr then stm • * has precedence over + else stm

1 + 2 * 3 = 1 + (2 * 3) 1 + 2 * 3 ? (1 + 2)*3 if B1 then if B2 then S1 else S2 Ambiguity: • Associativity and precedence information is typically vs used to disambiguate non-fully parenthesized if B1 then if B2 then S1 else S2 expressions containing unary prefix/postfix operators or binary infix operators.

21 22

Grammar Ambiguity Grammar Ambiguity

Quiz 1 Quiz 2

Is the following grammar ambiguous? S ® PC | AQ Is the following grammar ambiguous? S ® aS | Sb | ab P ® aPb | l Yes: consider ab Yes: consider the string abc C ® cC | l Q ® bQc | l

A ® aA | l 23 24

4 Grammar Ambiguity Simple Grammar Definition Quiz A CFG (V,T,S,P) is a simple grammar (s-grammar) if and only if all its productions are of the form Is the following grammar ambiguous? S ® SS | l A ? ax with AÎV, aÎT, xÎV* and any pair (A,a) occurs at most once.

•Note, for simple grammars a left most derivation of a S string wÎL(G) is straightforward and requires time |w|. Yes •Example: Take the s-grammar S ? aS|bSS|c with aabcc: SS S Þ aS Þ aaS Þ aabSS Þ aabcS Þ aabcc. l Cyclic structure Quiz: is the grammar S ? aS|bSS|aSS|c s-grammar ? SSS

(Illustrates ambiguous grammar with cycles.) 25 NO Why? The pair (S,a) occurs twice 26

Chomsky Normal Form CNF

Even though we can’t get every grammar into right-linear form, or in general even Normal Forms get rid of ambiguity, there is an especially simple form that general CFG’s can be converted into: Chomsky Normal Form Griebach Normal Form

27 28

Chomsky Normal Form CNF Chomsky Normal Form

Noam Chomsky came up with an especially simple Definition 6.4: A CFG is in Chomsky normal form type of context free grammars which is able to if and only if all production rules are of the form A ® BC capture all context free languages. or A ® x Chomsky's grammatical form is particularly useful with variables A,B,CÎV and xÎT. when one wants to prove certain facts about (Sometimes rule S®? is also allowed.) context free languages. This is because CFGs in CNF can be parsed in time O(|w|3). assuming a much more restrictive kind of grammar can often make it easier to prove that Named after who in the generated language has whatever property the 60s made seminal contributions you are interested in. to the field of theoretical linguistics. (cf. Chomsky hierarchy of languages). 29 30

5 Chomsky Normal Form CNF Chomsky Normal Form CNF Significance of CNF

A CFG is said to be in Chomsky Normal Form if every rule in the • Length of derivation of a string of length grammar has one of the following forms: n in CNF = (2n-1) (dyadic variable productions) (Cf. Number of nodes of a strictly binary tree with n-leaves) A®BC • Maximum depth of a parse tree = n (unit terminal productions) élog 2 nù+1 A®a • Minimum depth of a parse tree = S ®l (? for empty string sake only) where B,CÎV -{S}

Where S is the start variable, A,B,C are variables and a is a terminal. Thus epsilons may only appear on the right hand side of the start symbol and other RHS are either 2 variables or a single terminal.

31 32

Chomsky Normal Form CNF Chomsky Normal Form CNF

CFGè CNF CFGè CNF: Construction • Obtain an equivalent grammar that does not contain l-rules, chain rules, and useless variables. • Theorem: There is an algorithm to • Apply following conversion on rules of construct a grammar G’ in CNF that is the form: A ® bBcC equivalent to a CFG G. A ® PQ P ® b Q ® BR R ® WC 33 W ® c 34

Chomsky Normal Form CNF Chomsky Normal Form CNF

CFGè CNF: Construction CFGè CNF: Example 1

Converting a general grammar into Chomsky Normal Form works in four steps: Let’s see how this works on the following 1. Ensure that the start variable doesn't example grammar: appear on the right hand side of any rule. 2. Remove all ?-rules productions, except from start variable. 3. Remove unit variable productions of the Sà? | a | b | aSa | bSb form A à B where A and B are variables. 4. Add variables and dyadic variable rules to replace any longer non-dyadic or non-

variable productions 35 36

6 Chomsky Normal Form CNF Chomsky Normal Form CNF

CFGè CNF: Example 1 CFGè CNF: Example 1

1. Start Variable 2. Remove ?-rules Ensure that start variable doesn't appear Remove all epsilon productions, except on the right hand side of any rule. from start variable.

S’àS S’àS | ? Sà? | a | b | aSa | bSb Sà? | a | b | aSa | bSb | aa | bb

37 38

Chomsky Normal Form CNF Chomsky Normal Form CNF

CFGè CNF: Example 1 CFGè CNF: Example 1

4. Longer production rules 3. Remove variable units Add variables and dyadic variable rules to Remove unit variable productions of the replace any longer productions. form A à B. S’à ? | a | b | aSa | bSb | aa | bb AB|CD|AA|CC Sàa | b | aSa | bSb | aa | bb AB|CD|AA|CC S’àS | ? | a | b | aSa | bSb | aa | bb Aà a Sà? | a | b | aSa | bSb | aa | bb Bà SA Cà b

39 DàSC 40

CFGè CNF: Example 2 S ? S Chomsky Normal Form CNF 1.2. AddRemove a new all startA ? variable? rules S0 0 and(where add A the is notrule S S0)0 ? S S ? 0S1 CFGè CNF: Example 1 For each occurrence of A on right S ? T#T hand side of a rule, add a new rule S ? T 5. Result CNF with the occurrence deleted T ? ? S’à ? | a | b | AB | CD | AA | CC If we have the rule B ? A, add S ? T# S ? #T CFG Sà a | b | AB | CD | AA | CC B ? ?, unless we have previously removed B ? ? S ? # Sà? | a | b | aSa | bSb Aà a S ? ? Bà SA 3. Remove unit rules A ? B ? S0 ? 010S1 Cà b ? Whenever B w appears, add S0 ? ? DàSC the rule A ? w unless this was a unit rule previously removed 41 42

7 CFGè CNF: Example 2 CFGè CNF: Example 2 4. Convert all remaining rules into the S0 ? ? Convert the following into Chomsky normal form: proper form S0 ? 0S1 A ? BAB | B | ? S0 ? T#T S0 ? 0S1 B ? 00 | ? S0 ? T# S0 ? A1A2 S0 ? #T S0 ? A S0 ? A | ? A1 ? 0 S0 ? # A ? BAB | B | ? A ? BAB | B | BB | AB | BA A ? SA 2 3 S0 ? 01 B ? 00 | ? B ? 00 A3 ? 1 S ? 0S1 S ? T#T ? S ? T# S0 T# S0 ? BAB | 00 | BB | AB | BA | ? S ? #T ? S0 ? TA4 A BAB | 00 | BB | AB | BA S ? # ? A4 ? # B 00 S ? 0143 44

Chomsky Normal Form CNF Chomsky Normal Form CNF

Exercise Answer •S ? aA|aBB •Write into Chomsky Normal Form the CFG: A ? aaA|? B ? bC|bbC S ? aA|aBB C ? B A ? aaA|? •Answer (1): First you remove the ?-productions B ? bC|bbC (AÞ?): C ? B •S ? aA|aBB|a A ? aaA|aa B ? bC|bbC C ? B

45 46

Chomsky Normal Form CNF Chomsky Normal Form CNF

Answer Answer •Answer (2): Next you remove the unit-productions from: Answer(3): Next, we determine the useless S ? aA|aBB|a variables in A ? aaA|aa S ? aA|aBB|a B ? bC|bbC C ? B A ? aaA|aa •Removing C? B, we have to include the CÞ*B B ? bC|bbC possibility, which can be done by substitution and gives: C ? bC|bbC S ? aA|aBB|a A ? aaA|aa The variables B and C can not terminate and are B ? bC|bbC therefore useless. So, removing B and C gives: C ? bC|bbC S ? aA|a

47 A ? aaA|aa 48

8 Chomsky Normal Form CNF Chomsky Normal Form CNF

Answer Answer Answer(4): To make the CFG in Chomsky Answer(5): Finally, we have to ‘chain’ the normal form, we have to introduce terminal variables in

producing variables for S ? XaA|a S ? aA|a A ? XaXaA|XaXa A ? aaA|aa, Xa ? a, •which gives •which gives S ? XaA|a S ? XaA|a A ? XaA2 |XaXa A ? XaXaA|XaXa X ? a. A2 ? XaA a 49 50 Xa ? a.

Griebach Normal Form GNF Griebach Normal Form GNF

• The size of the equivalent GNF can be • A CFG is in Griebach Normal Form large compared to the original grammar. • Next Example CFG has 5 rules, but the if each rule is of the form corresponding GNF has 24 rules!!

A ® aA1 A2...An • Length of the derivation in GNF A ® a = Length of the string. S ® l • GNF is useful in relating CFGs (“generators”) to pushdown automata where Ai ÎV -{S} (“recognizers”/”acceptors”).

51 52

Griebach Normal Form GNF Griebach Normal Form GNF

CFGè GNF CFGè GNF: Example C ® (bCB | a)R | bCB | a A ® BC • Theorem: There is an algorithm to B ® CA |b R ® ACBR | ACB construct a grammar G’ in GNF that is C ® AB | a equivalent to a CFG G. C ® bCBR | aR | bCB | a B ® bcBRA | aRA A ® BC | bCBA | aA | b B ® CA |b A ® bcBRAC | aRAC C ® BCB |a | bCBAC | aAC | bC C ® CACB | bCB | a R ® (bCBRAC | ... | bC)(CBR | CB )

53 54

9 Context Sensitive Grammar Context Sensitive Grammar (CSG)

Example An even more general form of grammars exists. Find the language generated by the CSG: In general, a non-context free grammar is one in which whole mixed variable/terminal S à ? | ASBC substrings are replaced at a time. For A à a example with S = {a,b,c} consider: CB à BC S à ? | ASBC aB à ab aB à ab A à a bB à bb bB à bb CB à BC bC à bc bC à bc cC à cc cC à cc For technical reasons, when length of LHS always £ length of RHS, these general grammars are called context sensitive. 55 56

Context Sensitive Grammar (CSG) Relations between Grammars

Example So far we studied 3 grammars:

Answer is {anbncn}. 1. Regular Grammars (RG) 2. Context Free Grammars (CFG) 2. Context Sensitive Grammars (CSG) In a future class we’ll see that this The relation between these 3 grammars is as follow: language is not context free. Thus perturbing context free-ness by allowing CSG context sensitive productions expands CFG the class. RG

57 58

Grammar Applications Grammar Applications Programming Languages Programming languages are often defined as Context Analysis Free Grammars in Backus-Naur Form (BNF). This part of Source Program Example: Compiler: the compiler ::= IF use the ::= | + Grammar ::= |* Scanner Parser The variables as indicated by The arrow ? is replaces by ::= Semantic Analy. Here, IF, + and * are terminals. Inter. Code Gen. “Syntax Checking” is checking if a program is an Optimizer element of the CFG of the . Code Generation 59 60 Target Program

10 Applications of CFG

Parsing is where we use the theory of CFGs.

The theory is especially relevant when dealing with Extensible Markup Language (XML) files and their corresponding Document Type Definitions (DTDs).

Document Type Definitions define the grammar that the XML files have to adhere to. Validating XLM files equals parsing it against the grammar of the DTD.

The nondeterminism of NPDAs can make parsing slow. What about deterministic PDAs ?

61

11