Outline

LR • Review of bottom-up parsing LALR Parser Generators • Computing the parsing DFA

• Using parser generators

2

Bottom-up Parsing (Review) The Shift and Reduce Actions (Review)

• A bottom-up parser rewrites the input string • Recall the CFG: E → int | E + (E) to the start symbol • A bottom-up parser uses two kinds of actions: • The state of the parser is described as α I γ • Shift pushes a terminal from input on the – α is a stack of termin als and non-terminals stack – γ is the string of terminals not yet examined E + ( I int ) ⇒ E + ( int I )

• Initially: I x x . . . x • Reduce pops 0 or more symbols off of the 1 2 n stack (production RHS) and pushes a non- terminal on the stack (production LHS) E + (E + ( E ) I ) ⇒ E + ( E I )

3 4 Key Issue: When to Shift or Reduce? LR(1) Parsing: An Example

int I int + (int) + (int)$ shift • Idea: use a deterministic finite automaton 0 1 E I → (DFA) to decide when to shift or reduce E → int int + (int) + (int)$ E int ( on $, + E I + (int ) + (int)$ shift (x3) – The input is the stack 2 + 3 4 E + (int I ) + (int)$ E → int – The language consists of terminals and non-terminals accept E int on $ E + (E I ) + (int)$ shift ) E → int I → • We run the DFA on the stack and we examine 7 6 5 E + (E) + (int)$ E E+(E) on ), + the resulting state X and the token tok after I E → E + (E) E I + (int)$ shift (x3) on $, + + int E + (int I )$ E → int – If X has a transition labeled tok then shift ( E + (E I )$ shift – If X is labeled with “A →βon tok”then reduce 8 9 E + (E) I $ E → E+(E) E + E I $ accept 10 11 E → E + (E) ) on ), + 5

Representing the DFA Representing the DFA: Example

• Parsers represent the DFA as a 2D table The table for a fragment of our DFA: – Recall table-driven int + ( ) $ E ( • Lines correspond to DFA states 3 4 … 3 s4 • Columns correspond to terminals and non- E int terminals 4 s5 g6 5 • Typically columns are split into: 6 5 rE → int rE → int E → int 6 s8 s7 – Those for terminals: the action table ) on ), + 7 r → r → – Those for non-terminals: the goto table E E+(E) E E+(E)

… 7 sk is shift and goto state k

E → E + (E) rX → α is reduce

on $, + gk is goto state k

7 8 The LR Parsing Algorithm The LR Parsing Algorithm

• After a shift or reduce action we rerun the let I = w$ be initial input DFA on the entire stack let j = 0 – This is wasteful, since most of the work is repeated let DFA state 0 be the start state let stack = 〈 dummy, 0 〉 • Remember for each stack element on which repeat state it brings the DFA case action[top_state(stack), I[j]] of shift k: push 〈 I[j++], k 〉

• LR parser maintains a stack reduce X → A: pop |A| pairs, 〈 sym , state 〉 . . . 〈 sym , state 〉 1 1 n n push 〈 X, goto[top_state(stack), X] 〉 state is the final state of the DFA on sym …sy m k 1 k accept: halt normally error: halt and report error

9 10

Key Issue: How is the DFA Constructed? LR(0) Items

• The stack describes the context of the parse • An LR(0) item is a production with a “I” – What non-terminal we are looking for somewhere on the RHS – What production RHS we are looking for – What we have seen so far from the RHS • The items for T → (E) are T → I (E) • Each DFA state describes several such T → ( I E) contexts T → (E I )

– E.g., when we are looking for non-terminal E, we T → (E) I might be looking either for an int or an E + (E) RHS • The only item for X →εis X → I

11 12 LR(0) Items: Intuition LR(1) Items

• An item [X →α I β] says that • An LR(1) item is a pair:

– the parser is looking for an X X →α I β, a

– it has an α on top of the stack – X →αβis a production

– Expects to find a string derived from β next in the – a is a terminal (the lookahead terminal) input – LR(1) means 1 lookahead terminal • [X →α I β, a] describes a context of the parser • Notes: – We are trying to find an X followed by an a, and →αI β – [X a ] means that a should follow. Then we – We have (at least) α already on top of the stack can shift it and still have a viable prefix – Thus we need to see next a prefix derived from βa – [X →α I] means that we could reduce X • But this is not always a good idea !

13 14

Note Convention

• The symbol I was used before to separate the • We add to our grammar a fresh new start stack from the rest of input symbol S and a production S → E – α I γ, where α is the stack and γ is the remaining – Where E is the old start symbol string of terminals • In items I is used to mark a prefix of a • The initial parsing context contains: production RHS: S → I E , $ X →α I β, a – Trying to find an S as a string derived from E$ – Here β might contain terminals as well – The stack is empty • In both case the stack is on the left of I

15 16 LR(1) Items (Cont.) LR(1) Items (Cont.)

• In context containing • Consider the item

E → E + I ( E ) , + E → E + ( I E ) , + – If ( follows then we can perform a shift to context • We expect a string derived from E ) + containing • There are two productions for E → I E E + ( E ) , + E → int and E → E + ( E)

• In context containing • We describe this by extending the context

E → E + ( E ) I , + with two more items:

– We can perform a reduction with E → E + ( E ) E → I int , )

– But only if a + follows E → I E + ( E ) , )

17 18

The Closure Operation Constructing the Parsing DFA (1)

• The operation of extending the context with • Construct the start context: E → E + ( E ) | int

items is called the closure operation Closure({S → I E, $})

Closure(Items) = S → I E , $ → I repeat E E+(E), $ E → I int , $ → α I β for each [X Y , a] in Items E → I E+(E), +

for each production Y →γ E → I int , + for each b in First(βa) add [Y → I γ, b] to Items • We abbreviate as:

until Items is unchanged S → I E , $ E → I E+(E) , $/+ E → I int , $/+

19 20 Constructing the Parsing DFA (2) The DFA Transitions

• A DFA state is a closed set of LR(1) items • A state “State” that contains [X →α I yβ, b] has a transition labeled y to a state that • The start state contains [S → I E , $] contains the items “Transition (State, y)” – y can be a terminal or a non-terminal

• A state that contains [X →α I, b] is labelled Transition(State, y) with “reduce with X →αon b” Items = ∅ for each [X → α I yβ, b] in State • And now the transitions … add [X → αy I β, b] to Items return Closure(Items)

21 22

Constructing the Parsing DFA: Example LR Parsing Tables: Notes

0 1 • Parsing tables (i.e., the DFA) can be S → I E , $ → E → int I, $/+ E int → I on $, + constructed automatically for a CFG E E+(E), $/+ int E → I int , $/+ E → E+ I (E), $/+ 3 E 2 + • But we still need to understand the S → E I , $ ( construction to work with parser generators E → E I +(E), $/+ E → E+(I E) , $/+ 4 – E.g., they report errors in terms of sets of items accept E E → I E+(E) , )/+ on $ E → I int , )/+ • What kind of errors can we expect? → I int 6 E E+(E ) , $/+ 5 E → E I +(E) , )/+ E → int I , )/+ E → int + ) on ), +

and so on… 23 24 Shift/Reduce Conflicts Shift/Reduce Conflicts

• If a DFA state contains both • Typically due to ambiguities in the grammar

[X → α I aβ, b] and [Y → γ I, a] • Classic example: the dangling else S → if E then S | if E then S else S | OTHER • Then on input “a” we could either • Will have DFA state containing

– Shift into state [X → α a I β, b], or [S → if E then S I, else]

– Reduce with Y → γ [S → if E then S I else S, x] • If else follows then we can shift or reduce

• This is called a shift-reduce conflict • Default (, ML-yacc, etc.) is to shift – Default behavior is as needed in this case

25 26

More Shift/Reduce Conflicts More Shift/Reduce Conflicts

• Consider the • In yacc declare precedence and associativity: E → E + E | E * E | int %left + %left * • We will have the states containing [E → E * I E, +] [E → E * E I, +] • Precedence of a rule = that of its last terminal

[E → I E + E, +] ⇒E [E → E I + E, +] See yacc manual for ways to override this default

…… • Resolve shift/reduce conflict with a shift if: • Again we have a shift/reduce on input + – no precedence declared for either rule or terminal – We need to reduce (* binds more tightly than +) – input terminal has higher precedence than the rule – Recall solution: declare the precedence of * and + – the precedences are the same and right associative

27 28 Using Precedence to Solve S/R Conflicts Using Precedence to Solve S/R Conflicts

• Back to our example: • Same grammar as before [E → E * I E, +] [E →E * E I, +] E → E + E | E * E | int [E → I E + E, +] ⇒E [E →E I + E, +] • We will also have the states

…… [E → E + I E, +] [E → E + E I, +] [E → I E + E, +] ⇒E [E → E I + E, +]

• Will choose reduce because precedence of …… rule E → E * E is higher than of terminal + • Now we also have a shift/reduce on input + – We choose reduce because E → E + E and + have the same precedence and + is left-associative

29 30

Using Precedence to Solve S/R Conflicts Precedence Declarations Revisited

• Back to our dangling else example The term “precedence declaration” is misleading! [S → if E then S I, else]

[S → if E then S I else S, x] These declarations do not define precedence:

• Can eliminate conflict by declaring else having they define conflict resolutions higher precedence than then I.e., they instruct shift-reduce parsers to resolve • But this starts to look like “hacking the tables” conflicts in certain ways • Best to avoid overuse of precedence The two are not quite the same thing! declarations or we will end with unexpected parse trees

31 32 Reduce/Reduce Conflicts Reduce/Reduce Conflicts

• If a DFA state contains both • Usually due to gross ambiguity in the grammar

[X → α I, a] and [Y → β I, a] • Example: a sequence of identifiers – Then on input “a”we don’t know which S →ε | id | id S production to reduce • There are two parse trees for the string id • This is called a reduce/reduce conflict S → id

S → id S → id • How does this confuse the parser?

33 34

More on Reduce/Reduce Conflicts Using Parser Generators

• Consider the states [S → id I, $] • Parser generators automatically construct the [S’ → I S, $] [S → id I S, $] parsing DFA given a CFG [S → I, $] ⇒id [S → I, $] – Use precedence declarations and default

[S → I id, $] [S → I id, $] conventions to resolve conflicts [S → I id S, $] [S → I id S, $] – The parser algorithm is the same for all grammars (and is provided as a library function) • Reduce/reduce conflict on input $ • But most parser generators do not construct S’ → S → id the DFA as described before S’ → S → id S → id – Because the LR(1) parsing DFA has 1000s of states →ε • Better rewrite the grammar as: S | id S even for a simple language

35 36 LR(1) Parsing Tables are Big The Core of a Set of LR Items

• But many states are similar, e.g. Definition: The core of a set of LR items is the

1 5 set of first components E → int → – Without the lookahead terminals E → int I, $/+ and E → int I, )/+ E int on $, + on ), +

• Idea: merge the DFA states whose items • Example: the core of differ only in the lookahead tokens {[X →α I β, b], [Y →γ I δ, d]} – We say that such states have the same core is

{X →α I β, Y →γ I δ} • We obtain 1’ → E → int I, $/+/) E int on $, +, )

37 38

LALR States A LALR(1) DFA

• Consider for example the LR(1) states • Repeat until all states have distinct core {[X →α I, a], [Y →β I, c]} – Choose two distinct states with same core

{[X →α I, b], [Y →β I, d]} – Merge the states by creating a new one with the union of all the items • They have the same core and can be merged – Point edges from predecessors to new state

• And the merged state contains: – New state points to all the previous successors {[X →α I, a/b], [Y →β I, c/d]} • These are called LALR(1) states A B C A C – Stands for LookAhead LR BE – Typically 10 times fewer LALR(1) states than LR(1) D E F D F

39 40 Conversion LR(1) to LALR(1): Example. The LALR Parser Can Have Conflicts

int int 0 1 0 1,5 • Consider for example the LR(1) states E E → int E → int E on $, +, ) {[X →α I, a], [Y →β I, b]} + ( on $, + int 2 3 4 {[X →α I, b], [Y →β I, a]} accept E int ( on $ 2 + 3,8 4,9 • And the merged LALR(1) state

) E → int accept {[X →α I, a/b], [Y →β I, a/b]} 7 6 5 on $ on ), + + E E → E + (E) • Has a new reduce/reduce conflict on $, + + int ) ( 7,11 6,10 8 9 E → E + (E) • In practice such cases are rare

+ E on $, +, ) 10 11 E → E + (E) ) on ), + 42

LALR vs. LR Parsing: Things to keep in mind A Hierarchy of Grammar Classes

• LALR languages are not natural – They are an efficiency hack on LR languages

• Any reasonable programming language has a LALR(1) grammar From Andrew Appel, “Modern Compiler Implementation in ML” • LALR(1) parsing has become a standard for programming languages and for parser generators

43 44 Semantic Actions in LR Parsing Performing Semantic Actions: Example

• We can now illustrate how semantic actions Recall the example are implemented for LR parsing

• Keep attributes on the stack E → T + E1 { E.val = T.val + E1.val } | T { E.val = T.val }

• On shifting a, push attribute for a on stack T → int * T1 { T.val = int.val * T1.val }

• On reduce X →α | int { T.val = int.val } – pop attributes for α – compute attribute for X Consider the parsing of the string: 4 * 9 + 6 – and push it on the stack

45 46

Performing Semantic Actions: Example Notes

4 * 9 + 6 • The previous example shows how synthesized | int * int + int shift attributes are computed by LR parsers int4 | * int + int shift int4 * | int + int shift • It is also possible to compute inherited int4 * int9 | + int reduce T → int attributes in an LR parser int4 * T9 | + int reduce T → int * T T36 | + int shift T36 + | int shift T36 + int6 | reduce T → int T36 + T6 | reduce E → T T36 + E6 | reduce E → T + E E42 | accept

47 48 Notes on Parsing

• Parsing Supplement to LR Parsing – A solid foundation: context-free grammars – A simple parser: LL(1) – A more powerful parser: LR(1) – An efficiency hack: LALR(1) – LALR(1) parser generators Strange Reduce/Reduce Conflicts • Next time we move on to semantic analysis due to LALR Conversion (and how to handle them)

49

Strange Reduce/Reduce Conflicts Strange Reduce/Reduce Conflicts

• Consider the grammar • In P an id is a S → P R , NL → N | N , NL – N when followed by , or :

P → T | NL : T R → T | N : T – T when followed by id

N → id T → id • In R an id is a – N when followed by : • P - parameters specification – T when followed by , • R - result specification • This is an LR(1) grammar • N - a parameter or result name • But it is not LALR(1). Why? • T -a type name – For obscure reasons • NL - a list of names

51 52 A Few LR(1) States What Happened?

P → I T id 1 • Two distinct states were confused because → I P → I NL : T id T id id 3 LALR reduce/reduce they have the same core → I conflict on “,” T → I id id id N id :

→ I • Fix: add dummy productions to distinguish the NL → I N : N id , two confused states NL → I N , NL : • E.g., add N → I id : T → id I id/, LALR merge N → I id , N → id I :/, R → id bogus – bogus is a terminal not used by the lexer

R → I T , 2 T → id I , 4 – This production will never be used during parsing id R → I N : T , N → id I : – But it distinguishes R from P

T → I id , N → I id :

53 54

A Few LR(1) States After Fix

P → I T id 1 P → I NL : T id T → id I id 3 NL → I N : id N → id I : NL → I N , NL : N → id I , N → I id : N → I id , Different cores ⇒ no LALR merging T → I id id

R → . T , T → id I , 4

R → . N : T , 2 N → id I :

R →.id bogus , id R → id I bogus , T → . id , N → . id :

55