Acknowledgement

CS3300 - Design These slides borrow liberal portions of text verbatim from Antony L. Hosking @ Purdue, Jens Palsberg @ UCLA and the Dragon book. Parsing

V. Krishna Nandivada

IIT Madras

Copyright 2019 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected].

*

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 2 / 98

The role of the parser by using a CFG

Context-free syntax is specified with a context-free . Formally, a CFG G is a 4-tuple (V ,V ,S,P), where: tokens t n source V is the set of terminal symbols in the grammar. code scanner parser IR t For our purposes, Vt is the set of tokens returned by the scanner. errors Vn, the nonterminals, is a set of syntactic variables that denote sets of (sub)strings occurring in the . A parser These are used to impose a structure on the grammar. S is a distinguished nonterminal (S ∈ Vn) denoting the entire performs context-free syntax analysis set of strings in L(G). guides context-sensitive analysis This is sometimes called a goal symbol. constructs an intermediate representation P is a finite set of productions specifying how terminals and produces meaningful error messages non-terminals can be combined to form strings in the language. attempts error correction Each production must have a single non-terminal on its For the next several classes, we will look at parser construction left hand side. * * The set V = Vt ∪ Vn is called the vocabulary of G. V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 3 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 4 / 98 Notation and terminology Syntax analysis

Grammars are often written in Backus-Naur form (BNF). a,b,c,... ∈ Vt Example: h i ::= h i A,B,C,... ∈ Vn 1 goal expr 2 hexpri ::= hexprihopihexpri U,V,W,... ∈ V 3 | num α,β,γ,... ∈ V∗ 4 | id u,v,w,... ∈ Vt∗ 5 hopi ::= + If A → γ then αAβ ⇒ αγβ is a single-step derivation using A → γ 6 | − 7 | ∗ Similarly, →∗ and ⇒+ denote derivations of ≥ 0 and ≥ 1 steps 8 | /

If S →∗ β then β is said to be a sentential form of G This describes simple expressions over numbers and identifiers. In a BNF for a grammar, we represent L(G) = {w ∈ V ∗ | S ⇒+ w} w ∈ L(G) G t , is called a of 1 non-terminals with angle brackets or capital letters

∗ 2 Note, L(G) = {β ∈ V∗ | S → β} ∩ Vt∗ terminals with typewriter font or underline 3 productions as in the example * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 5 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 6 / 98

Derivations Derivations

We can view the productions of a CFG as rewriting rules. Using our example CFG (for x + 2 ∗ y): At each step, we chose a non-terminal to replace. hgoali ⇒ hexpri This choice can lead to different derivations. ⇒ hexprihopihexpri Two are of particular interest: ⇒ hid,xihopihexpri ⇒ hid,xi + hexpri leftmost derivation ⇒ hid,xi + hexprihopihexpri the leftmost non-terminal is replaced at each step ⇒ hid,xi + hnum,2ihopihexpri rightmost derivation ⇒ hid,xi + hnum,2i ∗ hexpri the rightmost non-terminal is replaced at each step ⇒ hid,xi + hnum,2i ∗ hid,yi

We have derived the sentence x + 2 ∗ y. The previous example was a leftmost derivation. We denote this hgoali→∗ id + num ∗ id. Such a sequence of rewrites is a derivation or a parse.

The process of discovering a derivation is called parsing. * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 7 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 8 / 98 Rightmost derivation Precedence

For the string x + 2 ∗ y: goal

hgoali ⇒ hexpri ⇒ hexprihopihexpri expr ⇒ hexprihopihid,yi ⇒ hexpri ∗ hid,yi expr op expr ⇒ hexprihopihexpri ∗ hid,yi ⇒ hexprihopihnum,2i ∗ hid,yi

⇒ hexpri + hnum,2i ∗ hid,yi expr op expr * ⇒ hid,xi + hnum,2i ∗ hid,yi

+ Again, hgoali⇒∗ id + num ∗ id. Treewalk evaluation computes (x + 2) ∗ y — the “wrong” answer!

* Should be x + (2 ∗ y) *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 9 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 10 / 98

Precedence Precedence

These two derivations point out a problem with the grammar. It has no notion of precedence, or implied order of evaluation. Now, for the string x + 2 ∗ y: To add precedence takes additional machinery: hgoali ⇒ hexpri 1 hgoali ::= hexpri ⇒ hexpri + htermi 2 hexpri ::= hexpri + htermi ⇒ hexpri + htermi ∗ hfactori 3 | hexpri − htermi ⇒ hexpri + htermi ∗ hid,yi 4 | htermi ⇒ hexpri + hfactori ∗ hid,yi 5 htermi ::= htermi ∗ hfactori ⇒ hexpri + hnum,2i ∗ hid,yi 6 | htermi/hfactori ⇒ htermi + hnum,2i ∗ hid,yi 7 | hfactori ⇒ hfactori + hnum,2i ∗ hid,yi 8 hfactori ::= num ⇒ hid,xi + hnum,2i ∗ hid,yi 9 | id

∗ This grammar enforces a precedence on the derivation: Again, hgoali⇒ id + num ∗ id, but this time, we build the desired tree. terms must be derived from expressions forces the “correct” tree * * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 11 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 12 / 98 Precedence

goal If a grammar has more than one derivation for a single sentential form, then it is ambiguous expr Example: hstmti ::= if hexprithen hstmti | if hexprithen hstmtielse hstmti + expr term | other stmts Consider deriving the sentential form:

* term term factor if E1 then if E2 then S1 else S2 It has two derivations. factor factor This ambiguity is purely grammatical. It is a context-free ambiguity.

x + 2 ∗ y Treewalk evaluation computes ( ) * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 13 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 14 / 98

Ambiguity Ambiguity

May be able to eliminate by rearranging the grammar: hstmti ::= hmatchedi Ambiguity is often due to confusion in the context-free specification. | hunmatchedi Context-sensitive confusions can arise from overloading. hmatchedi ::= if hexpri then hmatchedi else hmatchedi Example: | other stmts a = f(17) if then hunmatchedi ::= hexpri hstmti In many Algol/Scala-like , f could be a function or if then else | hexpri hmatchedi hunmatchedi subscripted variable. Disambiguating this statement requires context: need values of declarations This generates the same language as the , but not context-free applies the common sense rule: really an issue of type match each else with the closest unmatched then

Rather than complicate parsing, we will handle this separately. This is most likely the language designer’s intent.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 15 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 16 / 98 Scanning vs. parsing Parsing: the big picture

Where do we draw the line? term ::= [a − zA − z]([a − zA − z] | [0 − 9])∗ tokens | 0 | [1 − 9][0 − 9]∗ op ::= + | − | ∗ | / expr ::= (term op)∗term

Regular expressions are used to classify: parser identifiers, numbers, keywords grammar parser REs are more concise and simpler for tokens than a grammar generator more efficient scanners can be built from REs (DFAs) than Context-free grammars are used to count: brackets: (), begin... end, if... then... else imparting structure: expressions code IR Syntactic analysis is complicated enough: grammar for C has around 200 productions. Factoring out as a separate phase makes Our goal is a flexible parser generator system compiler more manageable. * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 17 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 18 / 98

Different ways of parsing: Top-down Vs Bottom-up Top-down parsing

A top-down parser starts with the root of the , labelled with Top-down parsers the start or goal symbol of the grammar. start at the root of derivation tree and fill in To build a parse, it repeats the following steps until the fringe of the picks a production and to match the input parse tree matches the input string may require 1 At a node labelled A, select a production A → α and construct the appropriate child for each symbol of α some grammars are backtrack-free (predictive) 2 When a terminal is added to the fringe that doesn’t match the Bottom-up parsers input string, backtrack start at the leaves and fill in 3 Find next node to be expanded (must have a label in Vn) start in a state valid for legal first tokens as input is consumed, change state to encode possibilities The key is selecting the right production in step 1. (recognize valid prefixes) If the parser makes a wrong step, the “derivation” process does not use a stack to store both state and sentential forms terminate. Why is it bad?

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 19 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 20 / 98 Left- Eliminating left-recursion

To remove left-recursion, we can transform the grammar Consider the grammar fragment:

Top-down parsers cannot handle left-recursion in a grammar hfooi ::= hfooiα Formally, a grammar is left-recursive if | β + ∃A ∈ Vn such that A ⇒ Aα for some string α where α and β do not start with hfooi We can rewrite this as:

hfooi ::= βhbari hbari ::= αhbari Our simple expression grammar is left-recursive | ε

where hbari is a new non-terminal

This fragment contains no left-recursion

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 21 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 22 / 98

How much lookahead is needed? Predictive parsing

We saw that top-down parsers may need to backtrack when they Basic idea: select the wrong production Do we need arbitrary lookahead to parse CFGs? For any two productions A → α | β, we would like a distinct way of in general, yes choosing the correct production to expand. use the Earley or Cocke-Younger, Kasami For some RHS α ∈ G, define FIRST(α) as the set of tokens that Fortunately appear first in some string derived from α. ∗ ∗ large subclasses of CFGs can be parsed with limited lookahead That is, for some w ∈ Vt , w ∈ FIRST(α) iff. α ⇒ wγ. most constructs can be expressed in a Key property: grammar that falls in these subclasses Whenever two productions A → α and A → β both appear in the grammar, we would like Among the interesting subclasses are: FIRST(α) ∩ FIRST(β) = φ LL(1): left to right scan, left-most derivation, 1-token lookahead; and This would allow the parser to make a correct choice with a LR(1): left to right scan, reversed right-most derivation, 1-token lookahead of only one symbol! lookahead

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 23 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 24 / 98 Left factoring Example

What if a grammar does not have this property? Sometimes, we can transform a grammar to have this property. There are two non-terminals Applying the transformation: to left factor: For each non-terminal A find the longest prefix hexpri ::= htermi + hexpri hexpri ::= htermihexpr0i α common to two or more of its alternatives. | htermi − hexpri hexpr0i ::= +hexpri | htermi | −hexpri if α 6= ε then replace all of the A productions | ε A → αβ1 | αβ2 | ··· | αβn htermi ::= hfactori ∗ htermi with | hfactori/htermi htermi ::= hfactorihterm0i A → αA0 | hfactori hterm0i ::= ∗htermi 0 A → β1 | β2 | ··· | βn | /htermi where A0 is a new non-terminal. | ε Repeat until no two alternatives for a single non-terminal have a common prefix.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 25 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 26 / 98

Indirect Left-recursion elimination Generality

Question: Given a left-factored CFG, to eliminate left-recursion: By left factoring and eliminating left-recursion, can we 1 Input: Grammar G with no cycles and no ε productions. transform an arbitrary context-free grammar to a form where it 2 Output: Equivalent grammat with no left-recursion. begin can be predictively parsed with a single token lookahead? 3 Arrange the non terminals in some order A1,A2,···An; Answer: 4 foreach i = 1···n do Given a context-free grammar that doesn’t meet our 5 foreach j = 1···i − 1 do conditions, it is undecidable whether an equivalent grammar th 6 Say the i production is: Ai → Ajγ ; exists that does meet our conditions. 7 and Aj → δ1|δ2|···|δk; Many context-free languages do not have such a grammar: th 8 Replace, the i production by: n n n 2n 9 Ai → δ1γ|δ2γ|··· δnγ; {a 0b | n ≥ 1} ∪ {a 1b | n ≥ 1} 10 Eliminate immediate in Ai; Must look past an arbitrary number of a’s to discover the 0 or the 1 and so determine the derivation.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 27 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 28 / 98 Recursive descent parsing Recursive descent parsing

1 int A() 2 begin 3 foreach production of the form A → X1X2X3 ···Xk do 4 for i = 1 to k do 5 if Xi is a non-terminal then 6 if (X () 6= 0) then i Backtracks in general – in practise may not do much. 7 backtrack; break; // Try the next production How to backtrack? 8 else if X matches the current input symbol a then i A left recursive grammar will lead to infinite loop. 9 advance the input to the next symbol;

10 else 11 backtrack; break; // Try the next production

12 if i EQ k + 1 then 13 return 0; // Success

14 return 1; // Failure

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 29 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 30 / 98

Non-recursive predictive parsing Table-driven parsers

Now, a predictive parser looks like: A parser generator system often looks like:

stack stack

source tokens table-driven source tokens table-driven scanner IR scanner IR code parser code parser

parser parsing grammar parsing generator tables tables This is true for both top-down (LL) and bottom-up (LR) parsers Rather than writing recursive code, we build tables. This also uses a stack – but mainly to remember part of the input Why?Building tables can be automated, easily. * string; no recursion. *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 31 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 32 / 98 FIRST FOLLOW

For a string of grammar symbols α, define FIRST(α) as: For a non-terminal A, define FOLLOW(A) as the set of terminals that begin strings derived from α: the set of terminals that can appear immediately to the right ∗ {a ∈ Vt | α ⇒ aβ} of A in some sentential form If α ⇒∗ ε then ε ∈ FIRST(α) FIRST(α) contains the tokens valid in the initial position in α Thus, a non-terminal’s FOLLOW set specifies the tokens that can To build FIRST(X): legally appear after it. A terminal symbol has no FOLLOW set. 1 If X ∈ Vt then FIRST(X) is {X} To build FOLLOW(A): 2 If X → ε then add ε to FIRST(X) 1 3 Put $ in FOLLOW(hgoali) If X → Y1Y2 ···Yk: 2 If A → B : 1 Put FIRST(Y1) − {ε} in FIRST(X) α β 2 ∀i : 1 < i ≤ k, if ε ∈ FIRST(Y1) ∩ ··· ∩ FIRST(Yi−1) 1 Put FIRST(β) − {ε} in FOLLOW(B) ∗ ∗ (i.e., Y1 ···Yi−1 ⇒ ε) 2 If β = ε (i.e., A → αB) or ε ∈ FIRST(β) (i.e., β ⇒ ε) then put then put FIRST(Yi) − {ε} in FIRST(X) FOLLOW(A) in FOLLOW(B) 3 If ε ∈ FIRST(Y1) ∩ ··· ∩ FIRST(Yk) then put ε in FIRST(X) Repeat until no more additions can be made Repeat until no more additions can be made.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 33 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 34 / 98

LL(1) grammars LL(1) grammars

Previous definition A grammar G is LL(1) iff. for all non-terminals A, each distinct Provable facts about LL(1) grammars: pair of productions A → β and A → γ satisfy the condition 1 No left-recursive grammar is LL(1) FIRST(β)T FIRST(γ) = φ. 2 No ambiguous grammar is LL(1) 3 Some languages have no LL(1) grammar ∗ What if A ⇒ ε? 4 A ε–free grammar where each alternative expansion for A begins Revised definition with a distinct terminal is a simple LL(1) grammar. A grammar G is LL(1) iff. for each set of productions Example A → α | α | ··· | α : 1 2 n S → aS | a is not LL(1) because FIRST(aS) = FIRST(a) = {a} 1 FIRST(α1), FIRST(α2),..., FIRST(αn) are all pairwise S → aS0 disjoint 0 0 ∗ S → aS | ε 2 If αi ⇒ ε then T accepts the same language and is LL(1) FIRST(αj) FOLLOW(A) = φ,∀1 ≤ j ≤ n,i 6= j.

If G is ε-free, condition 1 is sufficient.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 35 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 36 / 98 LL(1) parse table construction Example

Our long-suffering expression grammar: 1. S → E 6. T → FT0 Input: Grammar G 2. E → TE0 7. T0 → ∗T Output: Parsing table M 3. E0 → +E 8. | /T Method: 4. | −E 9. | ε 5. | ε 10. F → num 1 ∀ productions A → : α 11. | id 1 ∀a ∈ FIRST(α), add A → α to M[A,a] 2 If ε ∈ FIRST(α): FIRST FOLLOW id num + − ∗ / $ 1 ∀b ∈ FOLLOW(A), add A → α to M[A,b] S num,id $ 1 1 − − − − − 2 If $ ∈ FOLLOW(A) then add A → α to M[A,$] E num,id $ 2 2 − − − − − E0 ε,+,− $ − − 3 4 − − 5 2 M error Set each undefined entry of to T num,id +,−,$ 6 6 − − − − − If ∃M[A,a] with multiple entries then grammar is not LL(1). T0 ε,∗,/ +,−,$ − − 9 9 7 8 9 F num,id +,−,∗,/,$ 11 10 − − − − − id id − num num − Note: recall a,b ∈ Vt, so a,b 6= ε ∗ ∗ − / / − + + − * − − − *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 37 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 38 / 98

Table driven Predictive parsing A grammar that is not LL(1) Input: A string w and a parsing table M for a grammar G Output: If w is in L(G), a leftmost derivation of w; otherwise, indicate an error hstmti ::= if hexpri then hstmti 1 push $ onto the stack; push S onto the stack; | if hexpri then hstmti else hstmti 2 a points to the input tape; | ... 3 X = stack.top(); Left-factored: hstmti ::= if hexpri then hstmti hstmt0i | ... Now, 4 while X 6= $ do hstmt0i ::= else hstmti | 5 if X is a then ε 0 6 stack.pop(); inp++; FIRST(hstmt i) = {ε,else} Also, FOLLOW(hstmt0i) = {else,$} 7 else if X is a terminal then But, FIRST(hstmt0i)T FOLLOW(hstmt0i) = {else} 6= φ 8 error(); On seeing else, there is a conflict between choosing 9 else if M[X,a] is an error entry then hstmt0i ::= else hstmti and hstmt0i ::= ε 10 error();

11 else if M[X,a] = X → Y Y ···Y then ⇒ grammar is not LL(1)! 1 2 k The fix: 12 output the production X → Y1Y2 ···Yk; 0 13 stack.pop(); Put priority on hstmt i ::= else hstmti to associate else with 14 push Yk,Yk−1,···Y1 in that order; closest previous then.

15 X = stack.top(); * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 39 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 40 / 98 Another example of painful left-factoring Error recovery in Predictive Parsing

Here is a typical example where a programming language fails to An error is detected when the terminal on top of the stack does be LL(1): not match the next input symbol or M[A,a] = error. stmt → asginment | call | other Panic mode error recovery assignment → id := exp Skip input symbols till a “synchronizing” token appears. call → id (exp-list) Q: How to identify a synchronizing token? Some heuristics: This grammar is not in a form that can be left factored. We must first replace assignment and call by the right-hand sides of their All symbols in FOLLOW(A) in the synchronizing set for the defining productions: non-terminal A. statement → id := exp | id( exp-list ) | other Semicolon after a Stmt production: assgignmentStmt; assignmentStmt; We left factor: If a terminal on top of the stack cannot be matched? – statement → id stmt’ | other pop the terminal. stmt’ → := exp | (exp-list) issue a message that the terminal was inserted. Q: How about error messages? See how the grammar obscures the language . * * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 41 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 42 / 98

Some definitions Bottom-up parsing

Recall For a grammar G, with start symbol S, any string α such that Goal: S ⇒∗ α is called a sentential form Given an input string w and a grammar G, construct a parse ∗ tree by starting at the leaves and working to the root. If α ∈ Vt , then α is called a sentence in L(G) Otherwise it is just a sentential form (not a sentence in L(G)) A left-sentential form is a sentential form that occurs in the leftmost derivation of some sentence. A right-sentential form is a sentential form that occurs in the rightmost derivation of some sentence.

An unambiguous grammar will have a unique leftmost/rightmost derivation.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 43 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 44 / 98 Reductions Vs Derivations Example

Reduction: Consider the grammar At each reduction step, a specific substring matching the body of a production is replaced by the non-terminal at the head of the 1 S → aABe production. 2 A → Abc 3 | b Key decisions 4 B → d When to reduce? and the input string abbcde What production rule to apply? Prod’n. Sentential Form Reduction Vs Derivations 3 a b bcde Recall: In derivation: a non-terminal in a sentential form is 2 a Abc de replaced by the body of one of its productions. 4 aA d e A reduction is reverse of a step in derivation. 1 aABe – S Bottom-up parsing is the process of “reducing” a string w to the The trick appears to be scanning the input and finding valid sentential start symbol. forms.

Goal of bottum-up parsing: build derivation tree in reverse. * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 45 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 46 / 98

Handles Handles

S Theorem: If G is unambiguous then every right-sentential form has a unique handle. Informally, a “handle” is Proof: (by definition) a substring that matches the α 1 G is unambiguous ⇒ rightmost derivation is unique body of a production (not necessarily the first one), 2 ⇒ a unique production A → β applied to take γi−1 to γi 3 and reducing this handle, ⇒ a unique position k at which A → β is applied A represents one step of reduction 4 ⇒ a unique handle A → β (or reverse rightmost derivation).

β w The handle A → β in the parse tree for αβw * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 47 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 48 / 98 Example Handle-pruning

The left-recursive expression grammar (original form) 1 hgoali ::= hexpri Prod’n. Sentential Form The process to construct a bottom-up parse is called handle-pruning. 2 hexpri ::= hexpri + htermi – hgoali To construct a rightmost derivation 3 | hexpri − htermi 1 hexpri S = γ ⇒ γ ⇒ γ ⇒ ··· ⇒ γ ⇒ γ = w 4 | htermi 3 hexpri − htermi 0 1 2 n−1 n 5 htermi ::= htermi ∗ hfactori 5 hexpri − htermi ∗ hfactori we set i to n and apply the following simple 6 | htermi/hfactori 9 hexpri − htermi ∗ id for i = n downto 1 7 | hfactori id 7 hexpri − hfactori ∗ 1 find the handle A → β in γ 8 hfactori ::= num i i i 8 hexpri − num ∗ id 2 replace βi with Ai to generate γi−1 9 | id 4 htermi − num ∗ id This takes 2n steps, where n is the length of the derivation 7 hfactori − num ∗ id 9 id − num ∗ id

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 49 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 50 / 98

Stack implementation Example: back to x − 2 ∗ y

1 S → E Stack Input Action One scheme to implement a handle-pruning, bottom-up parser is 2 E → E + T $ id − num ∗ id S 3 | E − T $id − num ∗ id R9 called a shift-reduce parser. $hfactori − num ∗ id R7 Shift-reduce parsers use a stack and an input buffer 4 | T $htermi − num ∗ id R4 5 T → T ∗ F 1 initialize stack with $ $hexpri − num ∗ id S | T/F $hexpri − num ∗ id S 2 Repeat until the top of the stack is the goal symbol and the input 6 7 | F $hexpri − num ∗ id R8 token is $ $hexpri − hfactori ∗ id R7 num a) find the handle 8 F → $hexpri − htermi ∗ id S if we don’t have a handle on top of the stack, shift an input symbol 9 | id $hexpri − htermi ∗ id S onto the stack id b) prune the handle $hexpri − htermi ∗ R9 $hexpri − htermi ∗ hfactori R5 if we have a handle A → β on the stack, reduce $hexpri − htermi R3 i) pop | β | symbols off the stack ii) push A onto the stack $hexpri R1 $hgoali A

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 51 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 52 / 98 Shift-reduce parsing LR parsing

The skeleton parser:

push s0 Shift-reduce parsers are simple to understand token ← next token() A shift-reduce parser has just four canonical actions: repeat forever s ← top of stack 1 shift — next input symbol is shifted onto the top of the stack if action[s,token] = "shift si" then 2 reduce — right end of handle is on top of stack; push si token next token() locate left end of handle within the stack; ← else if action[s,token] = "reduce A → β" pop handle off stack and push appropriate non-terminal LHS then 3 accept — terminate parsing and signal success pop | β | states 4 error — call an error recovery routine s0 ← top of stack push goto[s0,A] Key insight: recognize handles with a DFA: else if action[s, token] = "accept" then DFA transitions shift states instead of symbols return accepting states trigger reductions else error() May have Shift-Reduce Conflicts. “How many ops?”:k shifts, l reduces, and 1 accept, where k is length of input string and l is length of reverse rightmost derivation

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 53 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 54 / 98

Example tables Example using the tables

The Grammar Stack Input Action state ACTION GOTO 1 S → E $ 0 id∗ id+ id$ s4 id + ∗ $ ETF 2 E → T + E $ 0 4 ∗ id+ id$ r6 0 s4 – – – 1 2 3 3 | T $ 0 3 ∗ id+ id$ s6 1 – – – acc ––– 4 T → F ∗ T $ 0 3 6 id+ id$ s4 $ 0 3 6 4 + id$ r6 2 – s5 – r3 ––– 5 | F 3 – r5 s6 r5 ––– $ 0 3 6 3 + id$ r5 6 F → id id 4 – r6 r6 r6 ––– $ 0 3 6 8 + $ r4 $ 0 2 + id$ s5 5 s4 – – – 7 2 3 $ 0 2 5 id$ s4 6 s4 – – – – 8 3 $ 0 2 5 4 $ r6 7 – – – r2 ––– $ 0 2 5 3 $ r5 8 – r4 – r4 ––– $ 0 2 5 2 $ r3 $ 0 2 5 7 $ r2 Note: This is a simple little right-recursive grammar. It is not the same grammar as in $ 0 1 $ acc previous lectures.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 55 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 56 / 98 LR(k) grammars LR(k) grammars

Formally, a grammar G is LR(k) iff.: Informally, we say that a grammar G is LR(k) if, given a rightmost 1 ∗ S ⇒rm αAw ⇒rm αβw, and derivation 2 ∗ S ⇒rm γBx ⇒rm αβy, and S = γ0 ⇒ γ1 ⇒ γ2 ⇒ ··· ⇒ γn = w, 3 FIRSTk(w) = FIRSTk(y) we can, for each right-sentential form in the derivation: ⇒ αAy = γBx 1 isolate the handle of each right-sentential form, and i.e., Assume sentential forms αβw and αβy, with common prefix αβ 2 determine the production by which to reduce and common k-symbol lookahead FIRSTk(y) = FIRSTk(w), such that by scanning γi from left to right, going at most k symbols beyond the αβw reduces to αAw and αβy reduces to γBx. right end of the handle of γi. But, the common prefix means αβy also reduces to αAy, for the same result. Thus αAy = γBx.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 57 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 58 / 98

Why study LR grammars? LR parsing

LR(1) grammars are often used to construct parsers. Three common algorithms to build tables for an “LR” parser: We call these parsers LR(1) parsers. 1 SLR(1) virtually all context-free programming language constructs can be smallest class of grammars expressed in an LR(1) form smallest tables (number of states) LR grammars are the most general grammars parsable by a simple, fast construction deterministic, bottom-up parser 2 LR(1) efficient parsers can be implemented for LR(1) grammars full set of LR(1) grammars LR parsers detect an error as soon as possible in a left-to-right largest tables (number of states) scan of the input slow, large construction LR grammars describe a proper superset of the languages recognized by predictive (i.e., LL) parsers 3 LALR(1) intermediate sized set of grammars LL(k): recognize use of a production A → β seeing first k same number of states as SLR(1) symbols derived from β canonical construction is slow and large LR(k): recognize the handle β after seeing everything better construction techniques exist derived from β plus k lookahead symbols

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 59 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 60 / 98 SLR vs. LR/LALR LR(k) items

The table construction algorithms use sets of LR(k) items or configurations to represent the possible states in a parse. An LR(k) item is a pair [ , ], where An LR(1) parser for either Algol or Pascal has several thousand states, α β G • while an SLR(1) or LALR(1) parser for the same language may have α is a production from with a at some position in the RHS, marking how much of the RHS of a production has already been several hundred states. seen β is a lookahead string containing k symbols (terminals or $) Two cases of interest are k = 0 and k = 1: LR(0) items play a key role in the SLR(1) table construction algorithm. LR(1) items play a key role in the LR(1) and LALR(1) table construction algorithms.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 61 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 62 / 98

Example The characteristic finite state machine (CFSM)

The • indicates how much of an item we have seen at a given state in The CFSM for a grammar is a DFA which recognizes viable prefixes of the parse: right-sentential forms: [A → •XYZ] indicates that the parser is looking for a string that can be derived from XYZ A viable prefix is any prefix that does not extend beyond the [A → XY • Z] indicates that the parser has seen a string derived from handle. XY and is looking for one derivable from Z It accepts when a handle has been discovered and needs to be LR(0) items: (no lookahead) reduced. A → XYZ generates 4 LR(0) items: To construct the CFSM we need two functions: 1 [A → •XYZ] CLOSURE(I) to build its states 2 [A → X • YZ] 3 [A → XY • Z] GOTO(I,X) to determine its transitions 4 [A → XYZ•]

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 63 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 64 / 98 CLOSURE GOTO

Given an item [A → α • Bβ], its closure contains the item and any other Let I be a set of LR(0) items and X be a grammar symbol. items that can generate legal substrings to follow α. Then, GOTO(I,X) is the closure of the set of all items Thus, if the parser has viable prefix α on its stack, the input should [A → αX • β] such that [A → α • Xβ] ∈ I reduce to Bβ (or γ for some other item [B → •γ] in the closure). If I is the set of valid items for some viable prefix γ, then GOTO(I,X) is function CLOSURE(I) the set of valid items for the viable prefix γX. repeat GOTO(I,X) represents state after recognizing X in state I. if [A → α • Bβ] ∈ I add [B → •γ] to I function GOTO(I,X) until no more items can be added to I let J be the set of items [A → αX • β] return I such that [A → α • Xβ] ∈ I return CLOSURE(J)

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 65 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 66 / 98

Building the LR(0) item sets LR(0) example

We start the construction with the item [S0 → •S$], where 1 S → E$ I0 : S → •E$ I4 : E → E + T• E → E + T E → •E + T I : T → id• S0 is the start symbol of the augmented grammar G0 2 5 S is the start symbol of G 3 | T E → •T I6 : T → (•E) $ represents EOF 4 T → id T → •id E → •E + T To compute the collection of sets of LR(0) items 5 | (E) T → •(E) E → •T I : S → E • $ T → •id function items(G0) The corresponding CFSM: 1 0 s0 ← CLOSURE({[S → •S$]}) 9 E → E • +T T → •(E) T C ← {s0} I2 : S → E$• I7 : T → (E•) repeat ( T id id I3 : E → E + •T E → E • +T for each set of items s ∈ C 0 5 6 ( for each grammar symbol X T → •id I8 : T → (E)• if GOTO(s,X) 6= and GOTO(s,X) 6∈ C E id ( E φ T → •(E) I9 : E → T• add GOTO(s,X) to C 1 +3 + 7 until no more item sets can be added to C return C $ T ) 2 4 8

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 67 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 68 / 98 Constructing the LR(0) parsing table LR(0) example

state ACTION GOTO 1 construct the collection of sets of LR(0) items for G0 9 T id () + $ SET 2 state i of the CFSM is constructed from Ii ( 1 [A → α • aβ] ∈ Ii and GOTO(Ii,a) = Ij T 0 s5 s6 – – – – 1 9 ⇒ ACTION[i,a] ← “shift j” id id 1 – – – s3 s2 ––– 0 0 5 6 ( 2 [A → α•] ∈ Ii,A 6= S 2 acc acc acc acc acc ––– ⇒ ACTION[i,a] ← “reduce A → α”, ∀a 0 E id E 3 s5 s6 – – – – – 4 3 [S → S$•] ∈ I ( i 4 r2 r2 r2 r2 r2 ––– ⇒ ACTION[i,a] ← “accept”, ∀a 1 +3 + 7 5 r4 r4 r4 r4 r4 ––– 3 GOTO(I ,A) = I i j 6 s5 s6 – – – – 7 9 ⇒ GOTO[i,A] ← j T ) $ 7 – – s8 s3 – ––– 4 set undefined entries in ACTION and GOTO to “error” 0 5 initial state of parser s0 is CLOSURE([S → •S$]) 2 4 8 8 r5 r5 r5 r5 r5 ––– 9 r3 r3 r3 r3 r3 –––

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 69 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 70 / 98

Conflicts in the ACTION table SLR(1): simple lookahead LR

If the LR(0) parsing table contains any multiply-defined ACTION Add lookaheads after building LR(0) item sets entries then G is not LR(0) Constructing the SLR(1) parsing table: Two conflicts arise: 1 construct the collection of sets of LR(0) items for G0 shift-reduce: both shift and reduce possible in same item 2 state i of the CFSM is constructed from Ii set 1 [A → α • aβ] ∈ I and GOTO(I ,a) = I reduce-reduce: more than one distinct reduce action i i j possible in same item set ⇒ ACTION[i,a] ← “shift j”, ∀a 6= $ 0 2 [A → α•] ∈ Ii,A 6= S Conflicts can be resolved through lookahead in ACTION. Consider: ⇒ ACTION[i,a] ← “reduce A → α”, ∀a ∈ FOLLOW(A) 0 A → ε | aα 3 [S → S • $] ∈ Ii ⇒ shift-reduce conflict ⇒ ACTION[i,$] ← “accept” a:=b+c*d requires lookahead to avoid shift-reduce conflict after shifting c 3 GOTO(Ii,A) = Ij (need to see * to give precedence over +) ⇒ GOTO[i,A] ← j 4 set undefined entries in ACTION and GOTO to “error” 0 5 initial state of parser s0 is CLOSURE([S → •S$])

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 71 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 72 / 98 From previous example Example: A grammar that is not LR(0)

1 S → E$ FOLLOW(E) = FOLLOW(T) = {$,+,)} 1 S → E$ I0 : S → •E$ I6 : F → (•E) 2 E → E + T state ACTION GOTO E → •E + T E → •E + T 3 | T id ( ) + $ SET 2 E → E + T E → •T E → •T 3 | T T → •T ∗ F T → •T ∗ F 4 T → id 0 s5 s6 – – – – 1 9 T → •F T → •F 5 | (E) 4 T → T ∗ F F → •id F → •id 1 – – – s3 acc ––– 5 | F F → •(E) F → •(E) 2 ––––– ––– I1 : S → E • $ I7 : E → T• 9 6 F → id E → E • +T T → T • ∗F T 3 s5 s6 – – – – – 4 I : T → T ∗ •F 7 | (E) I2 : S → E$• 8 ( T I3 : E → E + •T F → •id 4 – – r2 r2 r2 ––– T → •T ∗ F F → •(E) id id 0 5 6 ( 5 – – r4 r4 r4 ––– FOLLOW T → •F I9 : T → T ∗ F• F → •id I10 : F → (E)• E E 6 s5 s6 – – – – 7 9 E {+,),$} id ( F → •(E) I11 : E → E + T• 7 – – s8 s3 – ––– I4 : T → F• T → T • ∗F 1 +3 + 7 T {+,∗,),$} I5 : F → id• I12 : F → (E•) 8 – – r5 r5 r5 ––– F {+,∗,),$} E → E • +T $ T ) 9 – – r3 r3 r3 ––– 2 4 8

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 73 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 74 / 98

Example: But it is SLR(1) Example: A grammar that is not SLR(1)

state ACTION GOTO + ∗ id ( ) $ SETF Consider: Its LR(0) item sets: 0 0 – – s5 s6 – – – 1 7 4 S → L = R I0 : S → •S$ I5 : L → ∗ • R 1 s3 – – – – acc –––– | R S → •L = R R → •L 2 –––––– –––– L → ∗R S → •R L → • ∗ R 3 – – s5 s6 – – – – 11 4 | id L → • ∗ R L → •id 4 r5 r5 – – r5 r5 –––– R → L L → •id I6 : S → L = •R 5 r6 r6 – – r6 r6 –––– R → •L R → •L 6 – – s5 s6 – – – 12 7 4 0 7 r3 s8 – – r3 r3 –––– I1 : S → S • $ L → • ∗ R 8 – – s5 s6 – – – – – 9 I2 : S → L• = R L → •id 9 r4 r4 – – r4 r4 –––– R → L• I7 : L → ∗R• 10 r7 r7 – – r7 r7 –––– I3 : S → R• I8 : R → L• 11 r2 s8 – – r2 r2 –––– I4 : L → id• I9 : S → L = R• 12 s3 – – – s10 – –––– Now consider I2: = ∈ FOLLOW(R) (S ⇒ L = R ⇒ ∗R = R)

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 75 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 76 / 98 LR(1) items LR(1) items

What’s the point of the lookahead symbols? Recall: An LR(k) item is a pair [α,β], where carry along to choose correct reduction when there is a choice α is a production from G with a • at some position in the RHS, lookaheads are bookkeeping, unless item has • at right end: marking how much of the RHS of a production has been seen β is a lookahead string containing k symbols (terminals or $) in [A → X • YZ,a], a has no direct use in [A → XYZ•,a], a is useful What about LR(1) items? allows use of grammars that are not uniquely invertible† All the lookahead strings are constrained to have length 1 Look something like [A → X • YZ,a] The point: For [A → α•,a] and [B → α•,b], we can decide between reducing to A or B by looking at limited right context

†No two productions have the same RHS

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 77 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 78 / 98 closure1(I) goto1(I)

Given an item [A → α • Bβ,a], its closure contains the item and any Let I be a set of LR(1) items and X be a grammar symbol. other items that can generate legal substrings to follow α. Then, GOTO(I,X) is the closure of the set of all items Thus, if the parser has viable prefix α on its stack, the input should [A → αX • β,a] such that [A → α • Xβ,a] ∈ I reduce to Bβ (or γ for some other item [B → •γ,b] in the closure). If I is the set of valid items for some viable prefix γ, then GOTO(I,X) is function closure1(I) the set of valid items for the viable prefix γX. repeat goto(I,X) represents state after recognizing X in state I. if [A → α • Bβ,a] ∈ I function goto1( , ) add [B → •γ,b] to I, where b ∈ FIRST(βa) I X until no more items can be added to I let J be the set of items [A → αX • β,a] return I such that [A → α • Xβ,a] ∈ I return closure1(J)

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 79 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 80 / 98 Building the LR(1) item sets for grammar G Constructing the LR(1) parsing table

We start the construction with the item [S0 → •S,$], where Build lookahead into the DFA to begin with S0 is the start symbol of the augmented grammar G0 1 0 S is the start symbol of G construct the collection of sets of LR(1) items for G $ represents EOF 2 state i of the LR(1) machine is constructed from Ii 1 [A → α • aβ,b] ∈ Ii and goto1(Ii,a) = Ij To compute the collection of sets of LR(1) items ⇒ ACTION[i,a] ← “shift j” 0 2 [A → α•,a] ∈ I ,A 6= S function items(G0) i 0 ⇒ ACTION[i,a] ← “reduce A → α” s0 ← closure1({[S → •S,$]}) 0 3 [S → S•,$] ∈ Ii C ← {s0} repeat ⇒ ACTION[i,$] ← “accept” for each set of items s ∈ C 3 goto1(I ,A) = I for each grammar symbol X i j ⇒ GOTO[i,A] ← j if goto1(s,X) 6= φ and goto1(s,X) 6∈ C 4 set undefined entries in ACTION and GOTO to “error” add goto1(s,X) to C 0 until no more item sets can be added to C 5 initial state of parser s0 is closure1([S → •S,$]) return C

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 81 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 82 / 98

Back to previous example (6∈ SLR(1)) Example: back to SLR(1) expression grammar

0 S → L = R I0 : S → •S, $ I5 : L → id•, = $ In general, LR(1) has many more states than LR(0)/SLR(1): | R S → •L = R, $ I6 : S → L = •R, $ L → ∗R S → •R, $ R → •L, $ 1 S → E 4 T → T ∗ F | id L → • ∗ R, = L → • ∗ R, $ 2 E → E + T 5 | F L → •id, = L → •id, $ R → L 3 | T 6 F → id R → •L, $ I : L → ∗R•, = $ 7 7 | (E) L → • ∗ R, $ I8 : R → L•, = $ L → •id, $ I9 : S → L = R•, $ 0 LR(1) item sets: I1 : S → S•, $ I10 : R → L•, $ 0 00 I0 : I :shifting ( I :shifting ( I2 : S → L• = R, $ I11 : L → ∗ • R, $ 0 0 R → L•, $ R → •L, $ S → •E, $ F → (•E), ∗ + $ F → (•E), ∗+) E → •E + T,+$ E → •E + T,+) E → •E + T,+) I3 : S → R•, $ L → • ∗ R, $ I : L → ∗ • R, = $ L → •id, $ E → •T, +$ E → •T, +) E → •T, +) 4 T → •T ∗ F, ∗ + $ T → •T ∗ F, ∗+) T → •T ∗ F, ∗+) R → •L, = $ I12 : L → id•, $ L → • ∗ R, = $ I : L → ∗R•, $ T → •F, ∗ + $ T → •F, ∗+) T → •F, ∗+) 13 id id id L → •id, = $ F → • , ∗ + $ F → • , ∗+) F → • , ∗+) F → •(E), ∗ + $ F → •(E), ∗+) F → •(E), ∗+) I2 no longer has shift-reduce conflict: reduce on $, shift on = * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 83 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 84 / 98 Another example LALR(1) parsing

Consider: LR(1) item sets: S0 → S 0 0 I0 : S → •S, $ I4 : C → d•, cd Define the core of a set of LR(1) items to be the set of LR(0) items 1 S → CC S → •CC, $ I5 : S → CC•, $ derived by ignoring the lookahead symbols. 2 C → cC C → •cC, cd I : C → c • C, $ 3 | d 6 Thus, the two sets C → •d, cd C → •cC, $ state ACTION GOTO {[A → α • β,a],[A → α • β,b]}, and I : S0 → S•, $ C → •d, $ c d $ SC 1 {[A → α • β,c],[A → α • β,d]} I : S → C • C, $ I : C → d•, $ 0 s3 s4 – 1 2 2 7 have the same core. C → •cC, $ I : C → cC•, cd 1 – – acc –– 8 Key idea: 2 s6 s7 – – 5 C → •d, $ I9 : C → cC•, $ If two sets of LR(1) items, Ii and Ij, have the same core, we 3 s3 s4 – – 8 I3 : C → c • C, cd can merge the states that represent them in the ACTION and 4 r3 r3 – –– C → •cC, cd 5 – – r1 –– C → •d, cd GOTO tables. 6 s6 s7 – – 9 7 – – r3 –– 8 r2 r2 – –– 9 – – r2 –– * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 85 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 86 / 98

LALR(1) table construction LALR(1) table construction

The revised (and renumbered) algorithm To construct LALR(1) parsing tables, we can insert a single step into 1 construct the collection of sets of LR(1) items for G0 the LR(1) algorithm 2 for each core present among the set of LR(1) items, find all sets (1.5) For each core present among the set of LR(1) items, find having that core and replace these sets by their union (update the goto1 function incrementally) all sets having that core and replace these sets by their 3 state i of the LALR(1) machine is constructed from Ii. union. 1 [A → α • aβ,b] ∈ Ii and goto1(Ii,a) = Ij The goto function must be updated to reflect the ⇒ ACTION[i,a] ← “shift j” 0 replacement sets. 2 [A → α•,a] ∈ Ii,A 6= S ⇒ ACTION[i,a] ← “reduce A → α” 0 3 [S → S•,$] ∈ Ii ⇒ ACTION[i,$] ← “accept” 4 goto1(I ,A) = I ⇒ GOTO[i,A] ← j The resulting algorithm has large space requirements, as we still are i j 5 set undefined entries in ACTION and GOTO to “error” required to build the full set of LR(1) items. 0 6 initial state of parser s0 is closure1([S → •S,$])

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 87 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 88 / 98 Example More efficient LALR(1) construction

0 Reconsider: I0 : S → •S, $ I3 : C → c • C, cd I6 : C → c • C, $ S → •CC, $ C → •cC, cd C → •cC, $ 0 S0 → S Observe that we can: C → •cC, cd C → •d, cd C → •d, $ 1 S → CC represent Ii by its basis or kernel: C → •d, cd I4 : C → d•, cd I7 : C → d•, $ 0 2 C → cC 0 items that are either [S → •S,$] I1 : S → S•, $ I5 : S → CC•, $ I8 : C → cC•, cd 3 | d or do not have • at the left of the RHS I2 : S → C • C, $ I9 : C → cC•, $ C → •cC, $ compute shift, reduce and goto actions for state derived from I state ACTION GOTO i C → •d, $ c d $ SC directly from its kernel Merged states: I36 : C → c • C, cd$ 0 s36 s47 – 1 2 1 – – acc –– C → •cC, cd$ This leads to a method that avoids building the complete canonical C → •d, cd$ 2 s36 s47 – – 5 collection of sets of LR(1) items I47 : C → d•, cd$ 36 s36 s47 – – 8 I89 : C → cC•, cd$ 47 r3 r3 r3 –– 5 – – r1 –– 89 r2 r2 r2 –– Self reading: Section 4.7.5 Dragon book

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 89 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 90 / 98

The role of precedence The role of precedence

With precedence and associativity, we can use: Precedence and associativity can be used to resolve shift/reduce conflicts in ambiguous grammars. E → E ∗ E lookahead with higher precedence ⇒ shift | E/E same precedence, left associative ⇒ reduce | E + E Advantages: | E − E more concise, albeit ambiguous, grammars | (E) shallower parse trees ⇒ fewer reductions | -E Classic application: expression grammars | id | num

This eliminates useless reductions (single productions)

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 91 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 92 / 98 Error recovery in shift-reduce parsers Error recovery in /bison/Java CUP

The problem The error mechanism encounter an invalid token designated token error bad pieces of tree hanging from stack valid in any production incorrect entries in symbol table error shows synchronization points We want to parse the rest of the file When an error is discovered Restarting the parser pops the stack until error is legal skips input tokens until it successfully shifts 3 (some default value) find a restartable state on the stack error productions can have actions move to a consistent place in the input print an informative message to stderr (line number) This mechanism is fairly general

Read the section on Error Recovery of the on-line CUP manual

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 93 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 94 / 98

Example Left versus right recursion

Using error Right Recursion: Left recursive grammar: stmt list : stmt needed for termination in | stmt list ; stmt predictive parsers E → E + T|E can be augmented with error requires more stack space T → T ∗ F|F stmt list : stmt right associative operators | error F → (E) + Int | stmt list ; stmt Left Recursion: This should works fine in bottom-up After left recursion removal throw out the erroneous statement parsers 0 synchronize at “;” or “end” limits required stack space E → TE invoke yyerror("syntax error") left associative operators E0 → +TE0|ε 0 Other “natural” places for errors Rule of thumb: T → FT 0 0 all the “lists”: FieldList, CaseList right recursion for top-down T → ∗FT |ε missing parentheses or brackets (yychar) parsers F → (E) + Int extra operator or missing operator left recursion for bottom-up parsers Parse the string 3 + 4 + 5

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 95 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 96 / 98 Parsing review Grammar hierarchy

Recursive descent A hand coded directly encodes a grammar (typically an LL(1) grammar) into a series of mutually recursive procedures. It has most of the linguistic limitations of LR(k) > LR(1) > LALR(1) > SLR(1) > LR(0) LL(1). LL(k) > LL(1) > LL(0) LL(k) LR(0) > LL(0) An LL(k) parser must be able to recognize the use of a production LR(1) > LL(1) after seeing only the first k symbols of its right hand side. LR(k) > LL(k) LR(k) An LR(k) parser must be able to recognize the occurrence of the right hand side of a production after having seen all that is derived from that right hand side with k symbols of lookahead.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 97 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 98 / 98