Outline

• Review of bottom-up

LR Parsing, LALR Parser Generators • Computing the parsing DFA

• Using parser generators Lecture 10

Prof. Fateman CS 164 Lecture 10 1 Prof. Fateman CS 164 Lecture 10 2

Bottom-up Parsing (Review) The Shift and Reduce Actions (Review)

• A bottom-up parser rewrites the input string • Recall the CFG: E → int | E + (E) to the start symbol • A bottom-up parser uses two kinds of actions: • The state of the parser is described as •Shiftpushes a terminal from input on the stack α I γ – α is a stack of terminals and non-terminals E + (I int ) ⇒ E + (int I ) – γ is the string of terminals not yet examined •Reducepops 0 or more symbols off the stack (the rule’s rhs) and pushes a non-terminal on • Initially: I x1x2 . . . xn the stack (the rule’s lhs)

E + (E + ( E ) I ) ⇒ E +(E I ) Prof. Fateman CS 164 Lecture 10 3 Prof. Fateman CS 164 Lecture 10 4 LR(1) Parsing. An Example Key Issue: When to Shift or Reduce? int int + (int) + (int)$ shift 0 1 I int + (int) + (int)$ E → int • Idea: use a finite automaton (DFA) to decide E E → int I ( on $, + E I + (int) + (int)$ shift(x3) when to shift or reduce 2 + 3 4 E + (int I ) + (int)$ E → int – The input is the stack accept E int E + (E ) + (int)$ shift – The language consists of terminals and non-terminals on $ I ) E → int E + (E) I + (int)$ E → E+(E) 7 6 5 on ), + E I + (int)$ shift (x3) • We run the DFA on the stack and we examine E → E + (E) + E + (int )$ E → int the resulting state X and the token tok after I on $, + int I E + (E )$ shift –If X has a transition labeled tok then shift ( I 8 9 β E + (E) I $ E → E+(E) –If X is labeled with “A → on tok”then reduce E + E I $ accept 10 11 E → E + (E) Prof. Fateman CS 164 Lecture 10 5 ) on ), +

Representing the DFA Representing the DFA. Example

• Parsers represent the DFA as a 2D table • The table for a fragment of our DFA: – Recall table-driven int + ( ) $ E ( • Lines correspond to DFA states 3 4 … int 3s4 • Columns correspond to terminals and non- E 4s5 g6 terminals 5 6 5 rE→ int rE→ int • Typically columns are split into: E → int 6s8 s7 – Those for terminals: the action table ) on ), + 7 rE→ E+(E) rE→ E+(E) – Those for non-terminals: the goto table … sk is shift and goto state k

7 rX → α is reduce E → E + (E) gk is goto state k, Prof. Fateman CS 164 Lecture 10 7 on $, + Prof. Fateman CS 164 Lecture 10 8 The LR Parsing Algorithm The LR Parsing Algorithm

• After a shift or reduce action we rerun the Let I = w$ be initial input DFA on the entire stack Let j = 0 Let DFA state 0 be the start state – This is wasteful, since most of the work is repeated Let stack = 〈 dummy, 0 〉 repeat • Remember for each stack element on which case action[top_state(stack), I[j]] of state it brings the DFA; use extra memory. shift k: push 〈 I[j++], k 〉 reduce X → A: pop |A| pairs, • LR parser maintains a stack push 〈X, Goto[top_state(stack), X]〉 〈 sym1, state1 〉 . . . 〈 symn, staten 〉 accept: halt normally

statek is the final state of the DFA on sym1 …symk error: halt and report error

Prof. Fateman CS 164 Lecture 10 9 Prof. Fateman CS 164 Lecture 10 10

Key Issue: How is the DFA Constructed? LR(1) Items

• The stack describes the context of the parse • An LR(1) item is a pair: – What non-terminal we are looking for X →α•β, a – What production rhs we are looking for –X → αβ is a production – What we have seen so far from the rhs –ais a terminal (the lookahead terminal) – LR(1) means 1 lookahead terminal • Each DFA state describes several such contexts •[X →α•β, a] describes a context of the parser – E.g., when we are looking for non-terminal E, we – We are trying to find an X followed by an a, and might be looking either for an int or a E + (E) rhs – We have (at least) α already on top of the stack – Thus we need to see next a prefix derived from βa Prof. Fateman CS 164 Lecture 10 11 Prof. Fateman CS 164 Lecture 10 12 Note Convention

•The symbol I was used before to separate the • We add to our grammar a fresh new start stack from the rest of input symbol S and a production S → E – α I γ, where α is the stack and γ is the remaining –Where E is the old start symbol string of terminals •In items • is used to mark a prefix of a • The initial parsing context contains: production rhs: S → •E, $ →α β X • , a – Trying to find an S as a string derived from E$ β –Here might contain terminals as well –The stack is empty • In both case the stack is on the left

Prof. Fateman CS 164 Lecture 10 13 Prof. Fateman CS 164 Lecture 10 14

LR(1) Items (Cont.) LR(1) Items (Cont.)

• In context containing •Consider the item E → E + • ( E ), + E → E + (• E ) , + –If ( follows then we can perform a shift to context • We expect a string derived from E ) + containing • There are two productions for E → • E E + ( E ), + E → int and E → E + ( E) • In context containing • We describe this by extending the context E → E + ( E ) •, + with two more items: – We can perform a reduction with E → E + ( E ) E → • int, ) –But only if a + follows E → • E + ( E ) , )

Prof. Fateman CS 164 Lecture 10 15 Prof. Fateman CS 164 Lecture 10 16 The Closure Operation Constructing the Parsing DFA (1)

• Extending the context with items is called the • Construct the start context: Closure({S → •E, $}) closure operation. S → •E, $ E → •E+(E), $ Closure(Items) = E → •int, $ repeat E → •E+(E), + for each [X → α•Yβ, a] in Items E → •int, + for each production Y → γ • We abbreviate as: for each b ∈ First(βa) S → •E, $ γ add [Y → • , b] to Items E → •E+(E), $/+ until Items is unchanged E → •int, $/+

Prof. Fateman CS 164 Lecture 10 17 Prof. Fateman CS 164 Lecture 10 18

Constructing the Parsing DFA (2) The DFA Transitions

• A DFA state is a closed set of LR(1) items • A state “State” that contains [X → α•yβ, b] has a transition labeled y to a state that the • The start state contains [S → •E, $] items “Transition(State, y)” – y can be a terminal or a non-terminal

• A state that contains [X → α•, b] is labeled Transition(State, y) α with “reduce with X → on b” Items ← ∅ for each [X → α•yβ, b] ∈ State • And now the transitions … add [X → αy•β, b] to Items return Closure(Items)

Prof. Fateman CS 164 Lecture 10 19 Prof. Fateman CS 164 Lecture 10 20 Constructing the Parsing DFA. Example. LR Parsing Tables. Notes

→ • 0 1 S E, $ E → int•, $/+ E → int • Parsing tables (i.e. the DFA) can be E → •E+(E), $/+ int on $, + constructed automatically for a CFG E → •int, $/+ E → E+• (E), $/+ 3 E + 2 • Why study this at all in CS164? We still need S → E•, $ ( to understand the construction to work with E → E•+(E), $/+ E → E+(•E), $/+ 4 parser generators accept E E → •E+(E), )/+ on $ E → •int, )/+ – E.g., they report errors in terms of sets of items

E → E+(E•), $/+ int 5 6 • What kind of errors can we expect? E → E•+(E), )/+ E → int•, )/+ E → int on ), + and so on… Prof. Fateman CS 164 Lecture 10 21 Prof. Fateman CS 164 Lecture 10 22

Shift/Reduce Conflicts Shift/Reduce Conflicts

• If a DFA state contains both • They are a typical symptom if there is an [X → α•aβ, b] and [Y → γ•, a] ambiguity in the grammar • Classic example: the dangling else S → if E then S | if E then S else S | OTHER •Then on input “a” we could either • Will have DFA state containing –Shift into state [X → αa•β, b], or [S → if E then S•, else] – Reduce with Y → γ [S → if E then S• else S, x] •Ifelse follows then we can shift or reduce • This is called a shift-reduce conflict • Default (bison, CUP, JLALR, etc.) is to shift – Default behavior is right in this case

Prof. Fateman CS 164 Lecture 10 23 Prof. Fateman CS 164 Lecture 10 24 More Shift/Reduce Conflicts More Shift/Reduce Conflicts

• Consider the Some parser generators (, BISON) → E E + E | E * E | int provide precedence declarations. • We will have the states containing – Precedence left PLUS, → → [E E * • E, +] [E E * E•, +] – Precedence left TIMES [E → • E + E, +] ⇒E [E → E • + E, +] – Precedence right EXP …… • Again we have a shift/reduce on input + •Bison, YACC – We need to reduce (* binds more tightly than +) – Declare precedence and associativity: – Solution: somehow impose the precedence of * and %left + + %left *

Prof. Fateman CS 164 Lecture 10 25 Prof. Fateman CS 164 Lecture 10 26

More Shift/Reduce Conflicts Reduce/Reduce Conflicts

Our LALR generator doesn’t do this. Instead, we • If a DFA state contains both “Stratify” the grammar. (Less explanation!) [X → α•, a] and [Y → β•, a] E → E + E | E * E | int ;; original New –Then on input “a”we don’t know which E → E + E1 | E1 ;; E1+E would be right associative production to reduce E1 → E1 * int | int (Many “layers” may be necessary for elaborate languages. (13 in C++, and some operators • This is called a reduce/reduce conflict appear at several levels, e.g. “(“. Some operators are right-associative like =, +=; most are left associative.)

Prof. Fateman CS 164 Lecture 10 27 Prof. Fateman CS 164 Lecture 10 28 Reduce/Reduce Conflicts More on Reduce/Reduce Conflicts

• Usually due to gross ambiguity in the grammar • Consider the states [S → id •, $] • Example: a sequence of identifiers [S’ → • S, $] [S → id • S, $] → id → S →ε | id | id S [S •, $] ⇒ [S •, $] [S → • id, $] [S → • id, $] [S → • id S, $] [S → • id S, $] • There are two parse trees for the string id • Reduce/reduce conflict on input $ S → id S’ → S → id S → id S → id S’ → S → id S → id • How does this confuse the parser? • Better rewrite the grammar: S →ε | id S

Prof. Fateman CS 164 Lecture 10 29 Prof. Fateman CS 164 Lecture 10 30

Using Parser Generators LR(1) Parsing Tables are Big

• Parser generators construct the parsing DFA • But many states are similar, e.g. given a CFG 1 5 – Use precedence declarations and default E → int E → int•, $/+ and E → int•, )/+ E → int conventions to resolve conflicts on $, + on ), + – The parser algorithm is the same for all grammars (and is provided as a library function) • Idea: merge the DFA states whose items differ only in the lookahead tokens • But most parser generators do not construct – We say that such states have the same core the DFA as described before – Because the LR(1) parsing DFA has 1000s of states • We obtain 1’ even for a simple language E → int•, $/+/) E → int on $, +, )

Prof. Fateman CS 164 Lecture 10 31 Prof. Fateman CS 164 Lecture 10 32 The Core of a Set of LR Items LALR States

• Definition: The core of a set of LR items is • Consider for example the LR(1) states the set of first components {[X →α•, a], [Y →β•, c]} – Without the lookahead terminals {[X →α•, b], [Y →β•, d]} • They have the same core and can be merged • Example: the core of and the merged state contains: →αβ →γδ { [X • , b], [Y • , d]} {[X →α•, a/b], [Y →β•, c/d]} is • These are called LALR(1) states →α β →γδ {X • , Y • } – Stands for LookAhead LR – Typically 10 times fewer LALR(1) states than LR(1)

Prof. Fateman CS 164 Lecture 10 33 Prof. Fateman CS 164 Lecture 10 34

A LALR(1) DFA Conversion LR(1) to LALR(1). Example. int int 0 1 0 1,5 E E → int • Repeat until all states have distinct core E → int E ( on $, + int on $, +, ) – Choose two distinct states with same core 2 + 3 4 – Merge the states by creating a new one with the accept E int + ( union of all the items on $ 2 3,8 4,9 ) E → int accept – Point edges from predecessors to new state 7 6 5 on ), + on $ + E – New state points to all the previous successors E → E + (E) on $, + + int A ) A CB C ( 7,11 6,10 BE 8 9 E → E + (E) D E F D F + E on $, +, ) 10 11 E → E + (E) Prof. Fateman CS 164 Lecture 10 35 ) on ), + The LALR Parser Can Have Conflicts LALR vs. LR Parsing

• Consider for example the LR(1) states • LALR languages are not “natural” – They are an efficiency hack on LR languages {[X →α•, a], [Y →β•, b]} →α• →β• {[X , b], [Y , a]} • You may see claims that any reasonable programming • And the merged LALR(1) state language has a LALR(1) grammar, {Arguably this is {[X →α•, a/b], [Y →β•, a/b]} done by defining languages without an LALR(1) grammar as unreasonable ☺ }. • Has a new reduce-reduce conflict • In any case, LALR(1) has become a standard for • In practice such cases are rare programming languages and for parser generators, in spite of its apparent complexity.

Prof. Fateman CS 164 Lecture 10 37 Prof. Fateman CS 164 Lecture 10 38

A Hierarchy of Grammar Classes Notes on Parsing

•Parsing – A solid foundation: context-free grammars – A simple parser: LL(1)

From Andrew Appel, – A more powerful parser: LR(1) “Modern Compiler – An efficiency hack: LALR(1) Implementation in Java” – LALR(1) parser generators

• Next time we move on to semantic analysis

Prof. Fateman CS 164 Lecture 10 39 Prof. Fateman CS 164 Lecture 10 40 Notes on Lisp LALR generator Sample input for lalr generator

(defparameter G2 '( • lalr.cl is source code; lalr.doc additional (exp --> exp + term #'(lambda(exp n term)(list '+ exp term))) (exp --> exp - term #'(lambda(exp n term)(list '- exp term))) documentation. (exp --> term #'(lambda(term) term)) (term --> term * factor #'(lambda(term n fac)(list '* term fac))) • A complete parse table can be viewed by (term --> factor #'(lambda(factor) factor)) (factor --> id #'(lambda(id) (const-value id))) • (setf p (makeparser G lexforms nil)) (factor --> |(| exp |)| #'(lambda(p1 exp p2) exp)) (factor --> iconst #'(lambda(iconst) (const-value iconst))) • (Print-Table stateList) (factor --> bconst #'(lambda(bconst) (const-value bconst))) (factor --> fconst #'(lambda(fconst) (const-value fconst))))) • (eval p) ;; create the parser named LALR- parser (defparameter lexforms ‘( + - * |(| |)| id iconst bconst fconst)) (make-parser G2 lexforms nil)

Prof. Fateman CS 164 Lecture 10 41 Prof. Fateman CS 164 Lecture 10 42

Sample table-output for lisp lalr generator Each state is embodied in a subroutine

Defined locally in one main program via “labels” STATE-0: $Start --> . exp, nil Using local subroutines that shift, reduce, peek at next input On fconst shift STATE-14 Main parser is called by (lalr-parser #’next-input #’error) On bconst shift STATE-13 On iconst shift STATE-12 Any number of parsers can be set up in the same environment, though On ( shift STATE-7 On id shift STATE-6 usually only one is tested… I just try out some input On factor shift STATE-11 On term shift STATE-16 (parse-fl ‘( (id a) + (id b))) On exp shift STATE-1

STATE-1: ;; if there is a problem, edit the grammar, say G2, then $Start --> exp ., nil exp --> exp . + term, + - nil (remake G2) exp --> exp . - term, + - nil On + shift STATE-9 On - shift STATE-2 (remakec G2) ;; COMPILES lalr-parser. Parser runs 20X faster or so. On nil reduce exp --> $Start … up to state 16 Prof. Fateman CS 164 Lecture 10 43 Prof. Fateman CS 164 Lecture 10 44 Sample output program for lalr generator

(defun lalr-parser (next-input parse-error) (let ((cat-la 'nil) (val-la 'nil) (val-stack 'nil) (state-stack 'nil)) (labels ((input-peek nil …;;these 3 subprograms are standard (shift-from (name) … (reduce-cat (name cat ndaughters action)… (STATE-0 nil ;; generated specifically from grammar Supplement to LR Parsing (case (input-peek) (fconst (shift-from #'STATE-0) (STATE-14)) … (exp (shift-from #'STATE-0) (STATE-1)) (otherwise (funcall parse-error)))) (STATE-1 nil Strange Reduce/Reduce Conflicts (case (input-peek) (+ (shift-from #'STATE-1) (STATE-9)) Due to LALR Conversion (- (shift-from #'STATE-1) (STATE-2)) ((nil) (reduce-cat #'STATE-1 '$Start 1 nil)) (otherwise (funcall parse-error)))) …;etc etc

Prof. Fateman CS 164 Lecture 10 45 Prof. Fateman CS 164 Lecture 10 46

Strange Reduce/Reduce Conflicts Strange Reduce/Reduce Conflicts

• Consider the grammar •In P an id is a S → P R , NL → N | N , NL –Nwhen followed by , or : P → T | NL : T R → T | N : T –Twhen followed by id N → id T → id •In R an id is a • P - parameters specification –Nwhen followed by : • R - result specification –Twhen followed by , • N - a parameter or result name • This is an LR(1) grammar. • T -a type name • But it is not LALR(1). Why? • NL - a list of names – For obscure reasons

Prof. Fateman CS 164 Lecture 10 47 Prof. Fateman CS 164 Lecture 10 48 A Few LR(1) States What Happened?

P → • T id 1 → P → • NL : T id T id • id 3 LALR reduce/reduce • Two distinct states were confused because → conflict on “,” NL → • N : id N id • : they have the same core → NL → • N , NL : N id • , • Fix: add dummy productions to distinguish the N → • id : two confused states → N • id , T → id • id/, • E.g., add → LALR merge T • id id N → id • :/, R → id bogus –bogus is a terminal not used by the lexer R → • T , 2 T → id • , 4 id – This production will never be used during parsing → → R • N : T , N id • : – But it distinguishes R from P T → • id ,

N → • id : Prof. Fateman CS 164 Lecture 10 49 Prof. Fateman CS 164 Lecture 10 50

A Few LR(1) States After Fix

P → • T id 1 P → • NL : T id T → id • id 3 NL → • N : id N → id • : NL → • N , NL : N → id • , N → • id : N → • id , Different cores ⇒ no LALR merging T → • id id → R → . T , T id • , 4 → R → . N : T , 2 N id • : → R →.id bogus , id R id • bogus , T → . id , N → . id : Prof. Fateman CS 164 Lecture 10 51