Outline

• Review of SLR LR Parsing • Limits of SLR parsing • LR parsing Lecture Notes by • LALR parsing Profs Aiken and Necula (UCB) • Implementation of semantic actions

• Using parser generators

CS780(Prasad) L16LR 1 CS780(Prasad) L16LR 2

Review of SLR(1) Parsing LR Parsing Algorithm

• LR parser maintains a stack Let I = w$ be initial input Let j = 0 á sym1, state1 ñ . . . á symn, staten ñ Let DFA state 1 have item S’ ® .S staten is the final state of the DFA on sym1 … symn Let stack = á dummy, 1 ñ • Goto table: the transition function of the DFA repeat A – Goto[i,A] = j if statei ® statej case action[top_state(stack), I[j]] of • Action table: for each state and terminal: shift k: push á I[j++], k ñ reduce X ® A: Shift j pop |A| pairs, Reduce X ® a push áX, Goto[X,top_state(stack)]ñ Action[i, a] = Accept accept: halt normally Error error: halt and report error CS780(Prasad) L16LR 3 CS780(Prasad) L16LR 4

Review. Items SLR(1) Action Table

• An item [X ® a.b] says that For each state si and terminal a

– the parser is looking for an X – If si has item X ® a.ab and Goto[i,a] = j then – it has an a on top of the stack Action[i,a] = shift j – Expects to find a string derived from b next in the input – If si has item S’ ® S. then Action[i,$] = accept • Notes: – [X ® a.ab] means that a should follow. Then we can – If si has item X ® a. and a Î Follow(X) and X ¹ S’ shift it and still have a viable prefix. then Action[i,a] = reduce X ® a – [X ®a.] means that we could reduce X • But this is not always a good idea ! – Otherwise, Action[i,a] = error

CS780(Prasad) L16LR 5 CS780(Prasad) L16LR 6

1 Limits of SLR Parsing Limits of SLR Parsing (cont.)

• SLR(1) is the simplest LR parsing method • Consider two states of the DFA for • SLR(1) is almost powerful enough, but … recognizing viable prefixes • … some common programming language ® ® constructs are not SLR(1). S’ . S S L . = E L • Consider the grammar S ® . L = E Þ E ® L . ® S ® L = E | E S . E SLR(1) parser on input “=“ ® L ® * E | id L . * E • shift (item L . = E ) L ® . id • reduce by E ® L E ® L (since “=“ Î Follow(E)) E ® . L

CS780(Prasad) L16LR 7 CS780(Prasad) L16LR 8

What’s The Problem? What’s The Problem? (Cont.)

• The grammar is not SLR(1), but why? • Problem: the SLR table has too many reduce • Focus on the reduce move in the second state actions. – We are in the context of S ® E ® L – Using Follow is too coarse. – No = can follow E in this context • In any given context, only some elements of – Even though = Î Follow(E) (in S ® L = E ® *E = E) Follow can actually follow a non-terminal. – The reduce move should not happen if an = follows • For example: in this context. Follow(E) = {=, $}, but In context S ® E only $ can follow E

In context S ® L = E ® * E1 = E only = can follow E1

CS780(Prasad) L16LR 9 CS780(Prasad) L16LR 10

One Way to Fix The Problem: LR(1) Items LR(1) Items. Intuition

• Idea: • [X ® a .b, a] describes a state of the parser: – refine Follow based on context. – We are trying to find an X, and – The context is described through items. – We have a already on top of the stack, and • An LR(1) item is a pair – We expect to see a prefix derived from ba [X ® a .b, a] ® a where X ® ab is a production and a is the • Back to reduce actions: have an [X ., a] lookahead token or $ – Perform the reduce only if next token is a ! – Will have fewer reduce actions • LR(k) is similar but with k tokens of lookahead – Not for all b Î Follow(X) – In practice, k = 1

CS780(Prasad) L16LR 11 CS780(Prasad) L16LR 12

2 Constructing Sets of LR(1) Items (1) Constructing Sets of LR(1) Items (2)

• Similar to construction for LR(0). 1. For each LR(1) item [Y ® a.Xb, a] Add an X-transition [Y ® a.Xb, a] ®X [Y ® aX.b, a] • The states of the NFA are the LR(1) items of G. 2. For each LR(1) item [Y ® a.Xb, a] • The start state is [S’ ® . S, $] For each production X ® g For each terminal b Î First(ba) Add an e transition [Y ® a.Xb, a] ®e [X ® .g, b]

CS780(Prasad) L16LR 13 CS780(Prasad) L16LR 14

NFA for Viable Prefixes in Detail (1) NFA for Viable Prefixes in Detail (2)

S’ ® S . $ S’ ® S . $ L ® . id = S S e S ® . L = E $ S ® . L = E $ L e e e S’ ® . S $ S’ ® . S $ S ® L . = E $ L ® . * E = e e

S ® . E $ S ® . E $

CS780(Prasad) L16LR 15 CS780(Prasad) L16LR 16

NFA for Viable Prefixes in Detail (3) NFA for Viable Prefixes in Detail (4)

S’ ® S . $ S’ ® S . $ L ® . id = L ® . id = S e S e S ® . L = E $ L S ® . L = E $ L e e e e S’ ® . S $ S ® L . = E $ S’ ® . S $ S ® L . = E $ L ® . * E = L ® . * E = L e e E ® L . $ e E ® . L $ e E ® . L $ e S ® . E $ S ® . E $ e E E L ® . * E $ L ® . id $ CS780(Prasad) L16LR 17 CS780(Prasad) L16LR 18

3 An Example Revisited Constructing LR(1) Parsing Tables

• Consider the state from last slide 1. Add a dummy S’ ® S production

S ® L . = E $ 2. Construct the NFA of LR(1) items as before 3. Convert the NFA into a DFA E ® L . $ 4. Goto is defined exactly as before:

A Goto[i, A] = j if statei ® statej • LR(1) parser on input “ =“ (the transition function of the DFA) • only shift (item L . = E )

CS780(Prasad) L16LR 19 CS780(Prasad) L16LR 20

Constructing LR(1) Parsing Tables (Cont.) LALR Parsing

5. For each state si of the DFA and terminal a • Two bottom-up parsing methods: SLR and LR

– If si has item [X ® a.ab, c] and Goto[i, a] = j then action[i,a] = shift j • Which one we use? Neither – SLR is not powerful enough. – If si has item [X ® a., a] and X ¹ S’ then action[i,a] = reduce X ® a – LR parsing tables are too big (1000’s of states vs. 100’s of states for SLR). – If si has item [S’ ® S., $] then action[i,$] = accept • In practice, use LALR(1) – Otherwise, – Stands for Look-Ahead LR action[i,a] = error – A compromise between SLR(1) and LR(1) • LR(1) grammar Û action[i,a] uniquely defined CS780(Prasad) L16LR 21 CS780(Prasad) L16LR 22

LALR Parsing (Cont.) The Core of a Set of LR Item

• Rough intuition: A LALR(1) parser for G has • Definition: The core of a set of LR items is – The number of states of an SLR parser. the set of first components. – Some of the lookahead discrimination of LR(1). • Example: the core of • Idea: construct the DFA for the LR(1). { [X ® a.b, b], [Y ® g.d, d]} • Then merge the DFA states whose items is ® a b ® g d differ only in the lookahead tokens {X . , Y . } – We say that such states have the same core. • The core of an LR item is an LR(0) item.

CS780(Prasad) L16LR 23 CS780(Prasad) L16LR 24

4 A LALR(1) DFA The LALR Parser Can Have Conflicts

• Repeat until all states have distinct core. • Consider for example the LR(1) states – Choose two distinct states with same core. {[X ® a., a], [Y ® b., b]} – Merge the states by creating a new one with the {[X ® a., b], [Y ® b., a]} union of all the items. – Point edges from predecessors to new state. • And the merged LALR(1) state – New state points to all the previous successors. {[X ® a., a/b], [Y ® b., a/b]} • Has a new reduce-reduce conflict. A B C A C BE • In practice such cases are rare. D E F D F

CS780(Prasad) L16LR 25 CS780(Prasad) L16LR 26

LALR vs. LR Parsing A Hierarchy of Grammar Classes

• LALR languages are not natural. – They are an efficiency hack on LR languages

• Any reasonable programming language has an LALR(1) grammar.

• LALR(1) has become a standard for programming languages and for parser generators.

CS780(Prasad) L16LR 27 CS780(Prasad) L16LR 28

Semantic Actions Performing Semantic Actions. Example

• We can now illustrate how semantic actions • Recall the example from earlier lecture are implemented for LR parsing. • Keep attributes on the stack. E ® T + E1 { E.val = T.val + E1.val } | T { E.val = T.val } • On shift a, push attribute for a on stack. T ® int * T1 { T.val = int.val * T1.val } • On reduce X ® a | int { T.val = int.val } – pop attributes for a – compute attribute for X • Consider the parsing of the string 3 * 5 + 8 – and push it on the stack

CS780(Prasad) L16LR 29 CS780(Prasad) L16LR 30

5 Performing Semantic Actions. Example Notes

| int * int + int shift • The previous discussion shows how int3 | * int + int shift synthesized attributes are computed by LR int3 * | int + int shift parsers. int3 * int5 | + int reduce T ® int int3 * T5 | + int reduce T ® int * T T15 | + int shift • It is also possible to compute inherited T15 + | int shift attributes in an LR parser. T15 + int8 | reduce T ® int T15 + T8 | reduce E ® T T15 + E8 | reduce E ® T + E E23 | accept

CS780(Prasad) L16LR 31 CS780(Prasad) L16LR 32

Using Parser Generators Shift/Reduce Conflicts

• Most common parser generators are LALR(1). • Typically due to ambiguities in the grammar. • A parser generator constructs a LALR(1) table. • Classic example: the dangling else • And reports an error when a table entry is multiply S ® if E then S | if E then S else S | OTHER defined: • Will have DFA state containing – A shift and a reduce. Called shift/reduce conflict [S ® if E then S., else] – Multiple reduces. Called reduce/reduce conflict [S ® if E then S. else S, x] • An will generate conflicts. • if else follows, then we can shift or reduce • What do we do in that case? • Default (bison, CUP, etc.) is to shift – Default behavior is as needed in this case.

CS780(Prasad) L16LR 33 CS780(Prasad) L16LR 34

More Shift/Reduce Conflicts More Shift/Reduce Conflicts

• Consider the ambiguous grammar • In bison, declare precedence and associativity: E ® E + E | E * E | int %left + %left * • We will have the states containing [E ® E * . E, +] [E ® E * E., +] • Precedence of a rule = that of its last terminal [E ® . E + E, +] ÞE [E ® E . + E, +] – See bison manual for ways to override this default. … … • Resolve shift/reduce conflict with a shift if: • Again we have a shift/reduce on input + – no precedence declared for either rule or terminal – We need to reduce (* binds more tightly that =) – input terminal has higher precedence than the rule – Recall solution: declare the precedence of * and = – the precedences are the same and right associative

CS780(Prasad) L16LR 35 CS780(Prasad) L16LR 36

6 Using Precedence to Solve S/R Conflicts Using Precedence to Solve S/R Conflicts

• Back to our example: • Same grammar as before [E ® E * . E, +] [E ®E * E., +] E ® E + E | E * E | int [E ® . E + E, +] ÞE [E ®E . + E, +] • We will also have the states … … [E ® E + . E, +] [E ® E + E., +] [E ® . E + E, +] ÞE [E ® E . + E, +] • Will choose reduce because precedence of … … rule E ® E * E is higher than of terminal + • Now we also have an S/R conflict on input + – We choose reduce because E ® E + E and + have the same precedence and + is left-associative.

CS780(Prasad) L16LR 37 CS780(Prasad) L16LR 38

Using Precedence to Solve S/R Conflicts Reduce/Reduce Conflicts

• Back to our dangling else example • Usually due to gross ambiguity in the grammar [S ® if E then S., else] • Example: a sequence of identifiers [S ® if E then S. else S, x] S ® e | id | id S • Can eliminate conflict by declaring else with higher precedence than then. • There are two parse trees for the string id • But this starts to look like “hacking the tables”. S ® id • Best to avoid overuse of precedence S ® id S ® id declarations, or you’ll end with unexpected • How does this confuse the parser? parse trees.

CS780(Prasad) L16LR 39 CS780(Prasad) L16LR 40

More on Reduce/Reduce Conflicts Strange Reduce/Reduce Conflicts

• Consider the states [S ® id ., $] • Consider the grammar [S’ ® . S, $] [S ® id . S, $] S ® P R , NL ® N | N , NL [S ® ., $] Þid [S ® ., $] P ® T | NL : T R ® T | N : T [S ® . id, $] [S ® . id, $] N ® id T ® id [S ® . id S, $] [S ® . id S, $] • P - parameters specification • Reduce/reduce conflict on input $ • R - result specification S’ ® S ® id • N - a parameter or result name S’ ® S ® id S ® id • T - a type name • Better rewrite the grammar: S ® e | id S • NL - a list of names

CS780(Prasad) L16LR 41 CS780(Prasad) L16LR 42

7 Strange Reduce/Reduce Conflicts A Few LR(1) States

P ® . T id 1 ® • In P an id is a P ® . NL : T id T id . id 3 LALR reduce/reduce ® conflict on “,” – N when followed by , or : NL ® . N : id N id . : ® – T when followed by id NL ® . N , NL : N id . , • In R an id is a N ® . id : – N when followed by : N ® . id , T ® id . id/, LALR merge – T when followed by , T ® . id id N ® id . :/, • This is an LR(1) grammar. R ® . T , 2 T ® id . , 4 • But it is not LALR(1). Why? id ® ® – For obscure reasons R . N : T , N id . : T ® . id , CS780(Prasad) L16LR 43 NCS780(Prasad) ® . id : L16LR 44

What Happened? A Few LR(1) States After Fix

P ® . T id 1 • Two distinct states were confused because P ® . NL : T id T ® id . id 3 they have the same core. NL ® . N : id N ® id . : • Fix: add dummy productions to distinguish the NL ® . N , NL : N ® id . , two confused states. N ® . id : • E.g., add N ® . id , Different cores Þ no LALR merging R ® id bogus T ® . id id – bogus is a terminal not used by the lexer. R ® . T , T ® id . , 4 ® – This production will never be used during parsing. R ® . N : T , 2 N id . : id R ® id . bogus , – But it distinguishes R from P. R ® . id bogus , T ® . id , CS780(Prasad) L16LR 45 NCS780(Prasad) ® . id : L16LR 46

Notes on Parsing

• Parsing – A solid foundation: context-free grammars – A simple parser: LL(1) – A more powerful parser: LR(1) – An efficiency hack: LALR(1) – LALR(1) parser generators

• Next time we move on to semantic analysis.

CS780(Prasad) L16LR 47

8