Compilation 0368-3133

Lecture 4: Syntax Analysis: Earley Bottom-Up Parsing

Noam Rinetzky

1 The Real Anatomy of a

txt Process Lexical Syntax Sem. characters Lexical tokens Syntax AST Source text Analysis Analysis Analysis input text Annotated AST

Intermediate Intermediate Code code IR code IR generation generation optimization

Symbolic Instructions

exe Target code SI Machine code MI Write optimization generation executable Executable output code

2 Broad kinds of parsers

• Parsers for arbitrary grammars – Earley’s method, CYK method – Usually, not used in practice (though might change) • Top-Down parsers – Construct in a top-down matter – Find the leftmost derivation • Bottom-Up parsers – Construct parse tree in a bottom-up manner – Find the rightmost derivation in a reverse order

3 Intuition: Top-down parsing

• Begin with start symbol • Guess the productions • Check if parse tree yields user's program

4 Intuition: Top-down parsing

Unambiguous grammar E ® E + T E ® T T ® T * F T ® F E F ® id F ® num E

F ® ( E ) T

T T

F F F

1 * 2 + 3

5 Intuition: Top-down parsing

Unambiguous grammar Left-most E ® E + T derivation E ® T T ® T * F T ® F E F ® id F ® num F ® ( E )

1 * 2 + 3

6 Intuition: Top-down parsing

Unambiguous grammar E ® E + T E ® T T ® T * F T ® F E F ® id F ® num E F ® ( E )

T

1 * 2 + 3

7 Intuition: Top-down parsing

Unambiguous grammar E ® E + T E ® T T ® T * F T ® F E F ® id F ® num E

F ® ( E ) T

T

1 * 2 + 3

8 Intuition: Top-Down parsing

Unambiguous grammar E ® E + T E ® T T ® T * F T ® F E F ® id F ® num E

F ® ( E ) T

T T

F F

1 * 2 + 3

9 Intuition: Top-Down parsing

Unambiguous grammar E ® E + T E ® T T ® T * F T ® F E F ® id F ® num E

F ® ( E ) T

T T

F F

1 * 2 + 3

10 Intuition: Top-Down parsing

Unambiguous grammar E ® E + T E ® T T ® T * F T ® F E F ® id F ® num E

F ® ( E ) T

T T

F F

1 * 2 + 3

11 Intuition: Top-Down parsing

Unambiguous grammar E ® E + T E ® T T ® T * F T ® F E F ® id F ® num E

F ® ( E ) T

T T

F F F

1 * 2 + 3

12 Intuition: Top-Down parsing

Unambiguous grammar E ® E + T E ® T T ® T * F T ® F E F ® id F ® num E

F ® ( E ) T

T T

F F F

1 * 2 + 3

13 Intuition: Top-Down parsing

Unambiguous grammar E ® E + T E ® T T ® T * F T ® F E F ® id F ® num E

F ® ( E ) T

T T

F F F

1 * 2 + 3

14 Ambiguity

Ambiguous grammar E ® E + E

E ® E * E E E ® id E ® num E E ® ( E )

E E E

1 * 2 + 3

15 Ambiguity

Ambiguous grammar E ® E + E

E ® E * E E E ® id E ® num E E ® ( E )

E E E

1 * 2 + 3

16 Ambiguity

Ambiguous grammar E ® E + E

E ® E * E E E E ® id E E ® num E E ® ( E )

E E E E E E

1 * 2 + 3

17 Earley Parsing

Jay Earley, PhD 18 Earley Parsing

• Invented by Jay Earley [PhD. 1968]

• Handles arbitrary context free grammars – Can handle ambiguous grammars

• Complexity O(N3) when N = |input| • Uses – Compactly encodes ambiguity 19 Dynamic programming

• Break a problem P into sub-problems P1…Pk

– Solve P by combining solutions for P1…Pk – Remember solutions to sub-problems instead of re-computation • Bellman-Ford shortest path – Sol(x,y,i) = minimum of • Sol(x,y,i-1) • Sol(t,y,i-1) + weight(x,t) for edges (x,t)

20 Earley Parsing

• Dynamic programming implementation of a – S[N+1] Sequence of sets of “Earley states” • N = |INPUT| • Earley state (item) s is a sentential form + aux info – S[i] All parse tree that can be produced (by a RDP) after reading the first i tokens • S[i+1] built using S[0] … S[i]

21 Earley Parsing

• Parse arbitrary grammars in O(|input|3) – O(|input|2) for unambigous grammer – Linear for most LR(k) langaues • Dynamic programming implementation of a recursive descent parser – S[N+1] Sequence of sets of “Earley states” • N = |INPUT| • Earley states is a sentential form + aux info – S[i] All parse tree that can be produced (by an RDP) after reading the first i tokens • S[i+1] built using S[0] … S[i] 22 Earley States

• s = < constituent, back > – constituent (dotted rule) for Aàαβ Aà•αβ predicated constituents Aàα•β in-progress constituents Aàαβ• completed constituents – back previous Early state in derivation

23 Earley States

• s = < constituent, back > – constituent (dotted rule) for Aàαβ Aà•αβ predicated constituents Aàα•β in-progress constituents Aàαβ• completed constituents – back previous Early state in derivation

24 Earley Parser

Input = x[1…N] S[0] = ; S[1] = … S[N] = {} for i = 0 ... N do until S[i] does not change do foreach s ∈ S[i] if s = and a=x[i+1] then // scan S[i+1] = S[i+1] ∪ { } if s = and Xàα then // predict S[i] = S[i] ∪ { } if s = < Aà … • , b> and ∈ S[b] then // complete S[i] = S[i] ∪{ }

25 Example PRACTICAL EARLEY PARSING 621

S0 S1 S′ S S → AAAA S′ E,0 E n , 0 → • n → • A → a E E E,0 S′ E , 0 → E → •n,+ 0 E → E • E,0 A E → • → •+ E → ϵ S2 → S1 E E E,0 S0 + → +• A a , 0 E E E,2 S′ S,0 → • à → • + → • S A AAA , 0 if s = and a=x[i+1] then // scan E n,2 S AAAA , 0 → • S[i+1] = S[i+1] ∪ { } → • → • S AA AA , 0 A a,0 → • if s = and Xàα then // predict S3 → • a A a,1 A E,0 → • S[i] = S[i] ∪ { } E n , 2 → • A E,1 à à ∈ → • E , 0 → • if s = < A … • , b> and S[b] then // complete n E E E , 0 → • E , 1 S[i] = S[i] ∪{ } → + • A E , 0 → • E E E,2 → • A E , 1 → •+ S A AAA , 0 → • S E , 0 → • S AAA A,0 ′ → • → • FIGURE 1. Earley sets for the grammar E E E n and FIGURE 2. An unadulterated Earley parser, representing sets → + | the input n+n.Itemsinboldareoneswhichcorrespondtothe using lists, rejects the valid input a.MissingitemsinS0 sound input’s derivation. the death knell for this parse.

26 Earley recommended using lookahead for the COMPLETER Two methods of handling this problem have been step [2]; it was later shown that a better approach was to use proposed. Grune and Jacobs aptly summarize one approach: lookahead for the PREDICTOR step [8]; later it was shown ‘The easiest way to handle this mare’s nest is that prediction lookahead was of questionable value in an to stay calm and keep running the Predictor and Earley parser which uses finite automata [9] as ours does. Completer in turn until neither has anything more In terms of implementation, the Earley sets are built in to add.’ [10, p. 159] increasing order as the input is read. Also, each set is typically represented as a list of items, as suggested by Aho and Ullman [11] specify this method in their presen- Earley [1, 2]. This list representation of a set is particularly tation of Earley parsing and it is used by ACCENT [12], a convenient, because the list of items acts as a ‘work queue’ compiler–compiler which generates Earley parsers. when building the set: items are examined in order, applying The other approach was suggested by Earley [1, 2]. SCANNER,PREDICTOR and COMPLETER as necessary; He proposed having COMPLETER note that the dot needed items added to the set are appended onto the end of the list. to be moved over A,thenlookingforthiswheneverfuture items were added to Si .Forefficiency’ssake,thecollection 3. THE PROBLEM OF ϵ of non-terminals to watch for should be stored in a data structure which allows fast access. We used this method At any given point i in the parse, we have two partially- initially for the Earley parser in the SPARK toolkit [13]. constructed sets. SCANNER may add items to Si 1 In our opinion, neither approach is very satisfactory. + and Si may have items added to it by PREDICTOR and Repeatedly processing Si ,orpartsthereof,involvesalot COMPLETER.Itisthislatterpossibility,addingitemsto of activity for little gain; Earley’s solution requires an Si while representing sets as lists, which causes grief with extra, dynamically-updated data structure and the unnatural ϵ-rules. mating of COMPLETER with the addition of items. Ideally, When COMPLETER processes an item A ,j which we want a solution which retains the elegance of Earley’s [ → • ] corresponds to the ϵ-rule A ϵ,itmustlookthrough algorithm, only processes items in S once and has no run- → i Sj for items with the dot before an A.Unfortunately, time overhead from updating a data structure. for ϵ-rule items, j is always equal to i—COMPLETER 3 is thus looking through the partially-constructed set Si . 4. AN ‘IDEAL’ SOLUTION Since implementations process items in Si in order, if an Our solution involves a simple modification to PREDICTOR, item B ... A...,k is added to Si after COMPLETER [ → • ] has processed A ,j ,COMPLETER will never add based on the idea of nullability. A non-terminal A is said [ → • ] to be nullable if A ϵ;terminalsymbols,ofcourse, B ...A ...,k to Si .Inturn,itemsresultingdirectly ⇒∗ and[ → indirectly• from B] ...A ...,k will be omitted too. can never be nullable. The nullability of non-terminals in This effectively prunes[ → potential• derivation] paths, which can agrammarmaybeeasilyprecomputedusingwell-known cause correct input to be rejected. Figure 2 gives an example techniques [14, 15]. Using this notion, our PREDICTOR can of this happening. be stated as follows (our modification is in bold):

If A ... B...,j is in Si ,add B α,i 3j i for ϵ-rule items because they can only be added to an Earley [ → • ] [ → • ] = to Si for all rules B α. If B is nullable, set by PREDICTOR,whichalwaysbestowsaddeditemswiththeparent → pointer i. also add A ...B ...,j to Si. [ → • ] THE COMPUTER JOURNAL,Vol.45,No.6,2002 27 Broad kinds of parsers

• Parsers for arbitrary grammars – Earley’s method, CYK method – Usually, not used in practice (though might change) • Top-Down parsers – Construct parse tree in a top-down matter – Find the leftmost derivation • Bottom-Up parsers – Construct parse tree in a bottom-up manner – Find the rightmost derivation in a reverse order

28 Bottom-Up Parsing

• Goal: Build a parse tree – Report error if input is not a legal program

• How: – Read input left-to-right – Construct a subtree for the first left-most tree node whose children have been constructed

29 Bottom-up parsing

E ® E + T E ® T ® T T * F E T ® F F ® id E F ® num F ® ( E ) T

T T

F F F

1 * 2 + 3

31 Bottom-up parsing

E ® E + T E ® T ® T T * F E T ® F F ® id E F ® num F ® ( E ) T

T T

F F F

1 * 2 + 3

32 Bottom-up parsing: LR(k) Grammars • A grammar is in the class LR(K) when it can be derived via: – Bottom-up derivation – Scanning the input from left to right (L) – Producing the rightmost derivation (R) • In reverse order – With lookahead of k tokens (k) • A language is said to be LR(k) if it has an LR(k) grammar • The simplest case is LR(0), which we will discuss

33 Terminology: Reductions & Handles

• The opposite of derivation is called reduction – Let A è α be a production rule – Derivation: βAµ è βαµ – Reduction: βαµ è βAµ

• A handle is the reduced substring – α is the handles for βαµ

34 Use Shift & Reduce In each stage, we shift a symbol from the input to the stack, or reduce according to one of the rules.

35 How does the parser know what to do? ( ( 23 + 7 ) * x ) token stream LP LP Num OP Num RP OP Id RP Input

Action Table

Parser Stack

Output Op(*) Goto table

Op(+) Id(b)

Num(23) Num(7) 36 How does the parser know what to do?

• A state will keep the info gathered on handle(s) – A state in the “control” of the PDA – Also (part of) the stack alpha bet Set of LR(0) items • A table will tell it “what to do” based on current state and next token – The transition function of the PDA

• A stack will records the “nesting level” – Stack contains a sequence of prefixes of handles 37 Important Bottom-Up LR-Parsers

• LR(0) – simplest, explains basic ideas • SLR(1) – simple, explains look ahead • LR(1) – complicated, very powerful, expensive • LALR(1) – complicated, powerful enough, used by automatic tools

38 LR(0) vs SLR(1) vs LR(1) vs LALR(1) • All use shift / reduce

• Main difference: how to identify a handle – Technically: Using different sets of states • More expensive è more states è more specific choice of which reduction rule to use • But the usage of the states is the same in all parsers

• Reduction is the same in all techniques – Once the handle is determined 39 LR(0) Parsing

40 LR item

Already matched To be matched Input

N ® α•β

Hypothesis about αβ being a possible handle: so far we’ve matched α, expecting to see β 41 Example: LR(0) Items

• All items can be obtained by placing a dot at every position for every production:

® LR(0) items 1: S ® •E$ Grammar (1) S E $ 2: S ® E • $ (2) E ® T 3: S ® E $ • (3) E ® E + T 4: E ® • T (4) T ® id 5: E ® T • ® 6: E ® • E + T (5) T ( E ) 7: E ® E • + T 8: E ® E + • T 9: E ® E + T • 10: T ® • i 11: T ® i • 12: T ® • (E) 13: T ® (• E) 14: T ® (E •) 15: T ® (E) • 42 Example: LR(0) Items

• All items can be obtained by placing a dot at every position for every production:

® LR(0) items 1: S ® •E$ Grammar (1) S E $ 2: S ® E • $ (2) E ® T 3: S ® E $ • (3) E ® E + T 4: E ® • T (4) T ® id 5: E ® T • ® 6: E ® • E + T (5) T ( E ) 7: E ® E • + T 8: E ® E + • T • Before • = reduced 9: E ® E + T • 10: T ® • id – matched prefix 11: T ® id • 12: T ® • (E) • After • = may be reduced 13: T ® (• E) – May be matched by suffix 14: T ® (E •) 15: T ® (E) • 43 LR(0) items

N ® α•β Shift Item

N ® αβ• Reduce Item

States are sets of items

44 LR(0) Items

E → E * B | E + B | B B → 0 | 1

• A derivation rule with a location marker (●) is called LR(0) item

45 PDA States

E → E * B | E + B | B B → 0 | 1 • A PDA state is a set of LR(0) items. E.g.,

q13 = { E → E ● * B , E → E ● + B, B → 1 ●}

• Intuitively, if we matched 1, Then the state will remember the 3 possible alternatives rules and where we are in each of them

(1) E → E ● * B (2) E → E ● + B (3) B → 1 ●

46 LR(0) Shift/Reduce Items

N → α•β Shift Item

N → αβ• Reduce Item

47 Intuition • Read input tokens left-to-right and remember them in the stack • When a right hand side of a rule is found, remove it from the stack and replace it with the non- terminal it derives

• Remembering token is called shift – Each shift moves to a state that remembers what we’ve seen so far • Replacing RHS with LHS is called reduce – Each reduce goes to a state that determines the context of the derivation 48 Model of an LR parser Input id + id + id $

Stack state 0 LR Parser Output symbol T 2

Terminals and + Non-terminals 7 Action Goto Table Table T 5

49 LR parser stack

• Sequence made of state, symbol pairs • For instance a possible stack for the grammar S ® E $ E ® T E ® E + T T ® id T ® ( E ) could be: 0 E 2 + 7 T 5

Stack grows this way 50 Form of LR parsing table state terminals non-terminals 0 Shift/Reduce actions Goto part 1 . acc . gm . rk sn error

shift state n reduce by rule k goto state m

accept 51 LR parser table example

STATE action goto id + ( ) $ E T 0 s5 s7 g1 g6 1 s3 acc 2 3 s5 s7 g4 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 52 Shift move

Input … a … $ Stack LR q Parsing . program . . action goto

• action[q, a] = sn

53 Result of shift

Input … a … $ Stack LR n Parsing a program q

. action goto . .

• action[q, a] = sn

54 Reduce move

Input … a … $ Stack LR qn Parsing σn program 2*n …

q1 action goto

σ1 q …

• action[qn, a] = rk

• Production: (k) A à σ1… σn

• Top of stack looks like q1σ1…qnσn for some q1… qn

• goto[q, A] = qm 55 Result of reduce move

Input … a … $ Stack LR Parsing program

qm action goto A q …

• action[qn, a] = rk

• Production: (k) A → σ1… σn

• Top of stack looks like q1σ1…qnσn for some q1… qn

• goto[q, A] = qm 56 Accept move

Input … a $ Stack LR q Parsing . program . . action goto

If action[q, a] = accept parsing completed

57 Error move

Input … a … $ Stack LR q Parsing . program . . action goto

If action[q, a] = error (usually empty) parsing discovered a syntactic error

58 Example

Z à E $ E à T | E + T T à i | ( E )

59 Example: parsing with LR items

Z -> E $ E -> T | E + T i + i $ T -> i | ( E )

Z -> •E $ Why do we need these additional LR items? E -> •T Where do they come from? What do they mean? E -> •E + T T -> •i T -> •( E )

60 e-closure

• Given a set S of LR(0) items

• If P -> α•Nβ is in state S • then for each rule N -> X in the grammar state S must also contain N -> •X e-closure({Z -> •E $}) = { Z -> •E $, E -> •T, Z -> E $ E -> •E + T, E -> T | E + T T -> •i , T -> i | ( E ) T -> •( E ) }61 Example: parsing with LR items

Z à E $ i + i $ E à T | E + T T à i | ( E ) Remember position from which we’re trying to reduce Items denote possible future handles

Z à •E $ E à •T E à •E + T T à •i T à •( E )

62 Example: parsing with LR items

Z à E $ i + i $ E à T | E + T T à i | ( E )

Match items with current token

Z à •E $ T à i• E à •T Reduce item! E à •E + T T à •i T à •( E )

63 Example: parsing with LR items

Z → E $ T + i $ E → T | E + T T → i | ( E ) i

Z → •E $ E → •T E → T• Reduce item! E → •E + T T → •i T → •( E )

64 Example: parsing with LR items

Z → E $ E + i $ E → T | E + T T → i | ( E ) T i

Z → •E $ E → •T E → T• Reduce item! E → •E + T T → •i T → •( E )

65 Example: parsing with LR items

Z → E $ E + i $ E → T | E + T T → i | ( E ) T i

Z → •E $ Z → E•$ E → •T E → •E + T E → E•+ T T → •i T → •( E )

66 Example: parsing with LR items

Z → E $ E + i $ E → T | E + T T → i | ( E ) T i

Z à •E $ Z à E•$ E à E+•T E à •T E à •E + T E à E•+ T T à •i T à •i T à •( E ) T à •( E )

67 Example: parsing with LR items

Z → E $ E + T $ E → T | E + T T → i | ( E ) T i i

Z à •E $ Z à E•$ E à E+•T E à •T E à •E + T E à E•+ T T à •i T à •i T à •( E ) T à •( E )

68 Example: parsing with LR items

Z → E $ E + T $ E → T | E + T T → i | ( E ) T i i Reduce item!

Z à •E $ Z à E•$ E à E+•T E à E+T• E à •T E à •E + T E à E•+ T T à •i T à •i T à •( E ) T à •( E )

69 Example: parsing with LR items

Z → E $ E $ E → T | E + T T → i | ( E ) E + T T i i

Z à •E $ Z à E•$ E à •T E à •E + T E à E•+ T T à •i T à •( E )

70 Example: parsing with LR items

Z → E $ E $ E → T | E + T T → i | ( E ) E + T T i Reduce item! i

Z à •E $ Z à E•$ Z à E$• E à •T E à •E + T E à E•+ T T à •i T à •( E )

71 Example: parsing with LR items

Z → E $ Z E → T | E + T T → i | ( E ) E $ E + T Reduce item! T i Z à •E $ i Z à E•$ Z à E$• E à •T E à •E + T E à E•+ T T à •i T à •( E )

72 GOTO/ACTION tables empty – error move ACTION GOTO Table Table

State i + ( ) $ E T action q0 q5 q7 q1 q6 shift q1 q3 q2 shift q2 ZàE$ q3 q5 q7 q4 Shift q4 EàE+T q5 Tài q6 EàT q7 q5 q7 q8 q6 shift q8 q3 q9 shift q9 TàE

73 LR(0) parser tables

• Two types of rows: – Shift row – tells which state to GOTO for current token – Reduce row – tells which rule to reduce (independent of current token) • GOTO entries are blank

74 LR parser data structures

• Input – remainder of text to be processed • Stack – sequence of pairs N, qi – N – symbol (terminal or non-terminal) – qi – state at which decisions are made

Input suffix + i $

stack q0 i q5 Stack grows this way • Initial stack contains q0

75 LR(0) pushdown automaton • Two moves: shift and reduce • Shift move – Remove first token from input – Push it on the stack – Compute next state based on GOTO table – Push new state on the stack – If new state is error – report error

shift input i + i $ input + i $ stack q0 stack q0 i q5

Stack grows this way State i + ( ) $ E T action

q0 q5 q7 q1 q6 shift 76 LR(0) pushdown automaton • Reduce move – Using a rule N à α – Symbols in α and their following states are removed from stack – New state computed based on GOTO table (using top of stack, before pushing N) – N is pushed on the stack – New state pushed on top of N

Reduce input + i $ T t i input + i $ stack q0 i q5 stack q0 q6

Stack grows this way State i + ( ) $ E T action

q0 q5 q7 q1 q6 shift 77 GOTO/ACTION table

State i + ( ) $ E T q0 s5 s7 s1 s6 q1 s3 s2 q2 r1 r1 r1 r1 r1 r1 r1 q3 s5 s7 s4 q4 r3 r3 r3 r3 r3 r3 r3 q5 r4 r4 r4 r4 r4 r4 r4 q6 r2 r2 r2 r2 r2 r2 r2 q7 s5 s7 s8 s6 q8 s3 s9 q9 r5 r5 r5 r5 r5 r5 r5

(1) Z à E $ (2) E à T Warning: numbers mean different things! (3) E à E + T rn = reduce using rule number n (4) T à i (5) T à ( E ) sm = shift to state m 78 (1) S ® E $ (2) E ® T (3) E ® E + T Parsing id+id$ (4) T ® id Stack grows this way (5) T ® ( E )

Stack Input Action S action goto 0 id + id $ s5 id + ( ) $ E T 0 s5 s7 g1 g6 1 s3 acc 2 3 s5 s7 g4 4 r3 r3 r3 r3 r3 Initialize with state 0 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 rn = reduce using rule number n sm = shift to state m 79 (1) S ® E $ (2) E ® T (3) E ® E + T Parsing id+id$ (4) T ® id Stack grows this way (5) T ® ( E )

Stack Input Action S action goto 0 id + id $ s5 id + ( ) $ E T 0 s5 s7 g1 g6 1 s3 acc 2 3 s5 s7 g4 4 r3 r3 r3 r3 r3 Initialize with state 0 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 rn = reduce using rule number n sm = shift to state m 80 (1) S ® E $ (2) E ® T (3) E ® E + T Parsing id+id$ (4) T ® id Stack grows this way (5) T ® ( E )

Stack Input Action S action goto 0 id + id $ s5 id + ( ) $ E T 0 id 5 + id $ r4 0 s5 s7 g1 g6 1 s3 acc 2 3 s5 s7 g4 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 rn = reduce using rule number n sm = shift to state m 81 (1) S ® E $ (2) E ® T (3) E ® E + T Parsing id+id$ (4) T ® id Stack grows this way (5) T ® ( E )

Stack Input Action S action goto 0 id + id $ s5 id + ( ) $ E T 0 id 5 + id $ r4 0 s5 s7 g1 g6 1 s3 acc 2 3 s5 s7 g4 4 r3 r3 r3 r3 r3 pop id 5 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 rn = reduce using rule number n sm = shift to state m 82 (1) S ® E $ (2) E ® T (3) E ® E + T Parsing id+id$ (4) T ® id Stack grows this way (5) T ® ( E )

Stack Input Action S action goto 0 id + id $ s5 id + ( ) $ E T 0 id 5 + id $ r4 0 s5 s7 g1 g6 1 s3 acc 2 3 s5 s7 g4 4 r3 r3 r3 r3 r3 push T 6 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 rn = reduce using rule number n sm = shift to state m 83 (1) S ® E $ (2) E ® T (3) E ® E + T Parsing id+id$ (4) T ® id Stack grows this way (5) T ® ( E )

Stack Input Action S action goto 0 id + id $ s5 id + ( ) $ E T 0 id 5 + id $ r4 0 s5 s7 g1 g6 1 s3 acc 0 T 6 + id $ r2 2 3 s5 s7 g4 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 rn = reduce using rule number n sm = shift to state m 84 (1) S ® E $ (2) E ® T (3) E ® E + T Parsing id+id$ (4) T ® id Stack grows this way (5) T ® ( E )

Stack Input Action S action goto 0 id + id $ s5 id + ( ) $ E T 0 id 5 + id $ r4 0 s5 s7 g1 g6 1 s3 acc 0 T 6 + id $ r2 2 0 E 1 + id $ s3 3 s5 s7 g4 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 rn = reduce using rule number n sm = shift to state m 85 (1) S ® E $ (2) E ® T (3) E ® E + T Parsing id+id$ (4) T ® id Stack grows this way (5) T ® ( E )

Stack Input Action S action goto 0 id + id $ s5 id + ( ) $ E T 0 id 5 + id $ r4 0 s5 s7 g1 g6 1 s3 acc 0 T 6 + id $ r2 2 0 E 1 + id $ s3 3 s5 s7 g4 0 E 1 + 3 id $ s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 rn = reduce using rule number n sm = shift to state m 86 (1) S ® E $ (2) E ® T (3) E ® E + T Parsing id+id$ (4) T ® id Stack grows this way (5) T ® ( E )

Stack Input Action S action goto 0 id + id $ s5 id + ( ) $ E T 0 id 5 + id $ r4 0 s5 s7 g1 g6 1 s3 acc 0 T 6 + id $ r2 2 0 E 1 + id $ s3 3 s5 s7 g4 0 E 1 + 3 id $ s5 4 r3 r3 r3 r3 r3 0 E 1 + 3 id 5 $ r4 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 rn = reduce using rule number n sm = shift to state m 87 (1) S ® E $ (2) E ® T (3) E ® E + T Parsing id+id$ (4) T ® id Stack grows this way (5) T ® ( E )

Stack Input Action S action goto 0 id + id $ s5 id + ( ) $ E T 0 id 5 + id $ r4 0 s5 s7 g1 g6 1 s3 acc 0 T 6 + id $ r2 2 0 E 1 + id $ s3 3 s5 s7 g4 0 E 1 + 3 id $ s5 4 r3 r3 r3 r3 r3 0 E 1 + 3 id 5 $ r4 5 r4 r4 r4 r4 r4 0 E 1 + 3 T 4 $ r3 6 r2 r2 r2 r2 r2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 rn = reduce using rule number n sm = shift to state m 88 (1) S ® E $ (2) E ® T (3) E ® E + T Parsing id+id$ (4) T ® id Stack grows this way (5) T ® ( E )

Stack Input Action S action goto 0 id + id $ s5 id + ( ) $ E T 0 id 5 + id $ r4 0 s5 s7 g1 g6 1 s3 acc 0 T 6 + id $ r2 2 0 E 1 + id $ s3 3 s5 s7 g4 0 E 1 + 3 id $ s5 4 r3 r3 r3 r3 r3 0 E 1 + 3 id 5 $ r4 5 r4 r4 r4 r4 r4 0 E 1 + 3 T 4 $ r3 6 r2 r2 r2 r2 r2 0 E 1 $ s2 7 s5 s7 g8 g6 8 s3 s9 9 r5 r5 r5 r5 r5 rn = reduce using rule number n sm = shift to state m 89 LR(0) automaton example

reduce state shift state q6 T E ® T• q0 T q7

Z ® •E$ T ® (•E) read input “(“ E ® •T ( E ® •T E ® •E + T E ® •E + T

T ® •i q5 T ® •i T ® •(E) i i T ® •(E) T ® i• E Managed to reduce E ( E ( i q1 q q8 Z ® E•$ 3 T ® (E•) + E ® E+•T + E ® E•+ T E ® E•+T T ® •i $ T ® •(E) ) q2 q9 Z ® E$• T T ® (E) • q4 E ® E + T• 90 E → E * B | E + B | B States and LR(0) Items B → 0 | 1

• The state will “remember” the potential derivation rules given the part that was already identified • For example, if we have already identified E then the state will remember the two alternatives: (1) E → E * B, (2) E → E + B • Actually, we will also remember where we are in each of them: (1) E → E ● * B, (2) E → E ● + B • A derivation rule with a location marker is called LR(0) item. • The state is actually a set of LR(0) items. E.g., q13 = { E → E ● * B , E → E ● + B} 91 Constructing an LR parsing table

• Construct a (determinized) transition diagram from LR items • If there are conflicts – stop • Fill table entries from diagram

97 LR item

Already matched To be matched Input

N ® α•β

Hypothesis about αβ being a possible handle, so far we’ve matched

α, expecting to see β 98 Types of LR(0) items

N ® α•β Shift Item

N ® αβ• Reduce Item

99 LR(0) automaton example

reduce state shift state q6 T E ® T• q0 T q7

Z ® •E$ T ® (•E) E ® •T ( E ® •T E ® •E + T E ® •E + T

T ® •i q5 T ® •i T ® •(E) i i T ® •(E) T ® i• E ( E ( i q1 q q8 Z ® E•$ 3 T ® (E•) + E ® E+•T + E ® E•+ T E ® E•+T T ® •i $ T ® •(E) ) q2 q9 Z ® E$• T T ® (E) • q4 E ® E + T• 100 Computing item sets

• Initial set – Z is in the start symbol – e-closure({ Z ® •α | Z ® α is in the grammar } )

• Next set from a set S and the next symbol X – step(S,X) = { N ® αX•β | N ® α•Xβ in the item set S} – nextSet(S,X) = e-closure(step(S,X))

101 Operations for transition diagram construction • Initial = {S’ ® •S$}

• For an item set I Closure(I) = Closure(I) ∪ {X ® •µ is in grammar| N ® α•Xβ in I}

• Goto(I, X) = { N ® αX•β | N ® α•Xβ in I}

102 Initial example Grammar (1) S ® E $ • Initial = {S ® •E $} (2) E ® T (3) E ® E + T (4) T ® id (5) T ® ( E )

103 Closure example Grammar (1) S ® E $ • Initial = {S ® •E $} (2) E ® T (3) E ® E + T • Closure({S ® •E $}) = { (4) T ® id S ® •E $ (5) T ® ( E ) E ® •T E ® •E + T T ® •id T ® •( E ) }

104 Goto example Grammar (1) S ® E $ • Initial = {S ® •E $} (2) E ® T (3) E ® E + T • Closure({S ® •E $}) = { (4) T ® id S ® •E $ (5) T ® ( E ) E ® •T E ® •E + T T ® •id T ® •( E ) } • Goto({S ® •E $ , E ® •E + T, T ® •id}, E) = {S ® E• $, E ® E• + T} 105 Constructing the transition diagram

• Start with state 0 containing item Closure({S ® •E $}) • Repeat until no new states are discovered – For every state p containing item set Ip, and symbol N, compute state q containing item set Iq = Closure(goto(Ip, N))

106 LR(0) automaton example

reduce state shift state q6 T E ® T• q0 T q7

Z ® •E$ T ® (•E) E ® •T ( E ® •T E ® •E + T E ® •E + T

T ® •i q5 T ® •i T ® •(E) i i T ® •(E) T ® i• E ( E ( i q1 q q8 Z ® E•$ 3 T ® (E•) + E ® E+•T + E ® E•+ T E ® E•+T T ® •i $ T ® •(E) ) q2 q9 Z ® E$• T T ® (E) • q4 E ® E + T• 107 Automaton construction (1) S ® E $ (2) E ® T (3) E ® E + T example (4) T ® id q0 (5) T ® ( E ) S ®•E$

Initialize

108 Automaton construction example (1) S ® E $ (2) E ® T (3) E ® E + T q0 (4) T ® id (5) T ® ( E ) S ® •E$ E ® •T E ® •E + T apply T ® •i Closure T ® •(E)

109 Automaton construction example (1) S ® E $ q6 (2) E ® T (3) E ® E + T q0 T E ® T• (4) T ® id (5) T ® ( E ) S ® •E$ T ® (•E) E ® •T ( E ® •T E ® •E + T E ® •E + T T ® •i q5 T ® •i T ® •(E) i T ® •(E) T ® i• E

q1 S ® E•$ E ® E•+ T

110 Automaton construction example

(1) S ® E $ q6 (2) E ® T (3) E ® E + T ® • q q0 T E T T 7 (4) T ® id (5) T ® ( E ) S ® •E$ non-terminal transition T ® (•E) E ® •T ( corresponds to goto E ® •T E ® •E + T action in parse table E ® •E + T T ® •i q5 T ® •i T ® •(E) i i T ® •(E) T ® i• E ( E ( i q1 q8 q3 terminalZ ®transitionE•$ T ® (E•) + E ® E+•T + correspondsE ® Eto• +shift T E ® E•+T action in parse table T ® •i $ T ® •(E) ) q q2 9 ® • ® • S E$ q4 T T (E)

E ® E + T• a single reduce item corresponds to reduce action 111 Are we done?

• Can make a transition diagram for any grammar • Can make a GOTO table for every grammar

• Cannot make a deterministic ACTION table for every grammar

112 LR(0) conflicts

T … q0 Z ® •E$ E ® •T E ® •E + T ( … T ® •i T ® •(E) q5 i T ® i• T ® •i[E] T ® i•[E] E Shift/reduce conflict …

Z ® E $ E ® T E ® E + T T ® i T ® ( E ) T ® i[E]

113 LR(0) conflicts

T … q0 Z ® •E$ E ® •T E ® •E + T ( … T ® •i T ® •(E) q5 i T ® i• T ® •i[E] V ® i• E reduce/reduce conflict …

Z ® E $ E ® T E ® E + T T ® i V ® i T ® ( E )

114 LR(0) conflicts

• Any grammar with an e-rule cannot be LR(0) • Inherent shift/reduce conflict – A ® e• – reduce item – P ® α•Aβ – shift item – A ® e• can always be predicted from P ® α•Aβ

115 Conflicts

• Can construct a diagram for every grammar but some may introduce conflicts • shift-reduce conflict: an item set contains at least one shift item and one reduce item • reduce-reduce conflict: an item set contains two reduce items

116 LR variants

• LR(0) – what we’ve seen so far • SLR(0) – Removes infeasible reduce actions via FOLLOW set reasoning • LR(1) – LR(0) with one lookahead token in items • LALR(0) – LR(1) with merging of states with same LR(0)

component 117 LR (0) GOTO/ACTIONS tables

GOTO table is indexed by state and a grammar ACTION symbol from the stack GOTO Table Table

State i + ( ) $ E T action q0 q5 q7 q1 q6 shift q1 q3 q2 shift q2 Z® E$ q3 q5 q7 q4 Shift q4 E® E+T q5 T® i q6 E® T q7 q5 q7 q8 q6 shift q8 q3 q9 shift q9 T® E

ACTION table determined only by state, ignores input 118 SLR parsing

• A handle should not be reduced to a non-terminal N if the lookahead is a token that cannot follow N • A reduce item N ® α• is applicable only when the lookahead is in FOLLOW(N) – If b is not in FOLLOW(N) we proved there is no derivation S è* βNb. – Thus, it is safe to remove the reduce item from the conflicted state • Differs from LR(0) only on the ACTION table – Now a row in the parsing table may contain both shift actions and reduce actions and we need to consult the current token to decide which one to take 119 Lookahead SLR action table token from the input

State i + ( ) [ ] $ state action 0 shift shift q0 shift 1 shift accept q1 shift 2 q2 3 shift shift q3 shift 4 E® E+T E® E+T E® E+T q4 E® E+T 5 T® i T® i shift T® i vs. q5 T® i 6 E® T E® T E® T q6 E® T 7 shift shift q7 shift 8 shift shift q8 shift 9 T® (E) T® (E) T® (E) q9 T® E

SLR – use 1 token look-ahead LR(0) – no look-ahead

… as before… T ® i T ® i[E] 120 LR(1) grammars

• In SLR: a reduce item N ® α• is applicable only when the lookahead is in FOLLOW(N) • But FOLLOW(N) merges lookahead for all alternatives for N – Insensitive to the context of a given production

• LR(1) keeps lookahead with each LR item • Idea: a more refined notion of follows computed per item 121 LR(1) items • LR(1) item is a pair – LR(0) item – Lookahead token • Meaning – We matched the part left of the dot, looking to match the part on the right of the dot, followed by the lookahead token • Example – The production L ® id yields the following LR(1) items LR(1) items [L → ● id, *] (0) S’ → S$ LR(0) items [L → ● id, =] (1) S → L = R [L → ● id] [L → ● id, id] (2) S → R [L → id ●] [L → ● id, $] (3) L → * R [L → id ●, *] (4) L → id [L → id ●, =] (5) R → L [L → id ●, id] [L → id ●, $] 122 LR(1) items • LR(1) item is a pair – LR(0) item – Lookahead token • Meaning – We matched the part left of the dot, looking to match the part on the right of the dot, followed by the lookahead token • Example – The production L ® id yields the following LR(1) items

• Reduce only if the the expected lookhead matches the input – [L → id ●, =] will be used only if the next input token is =

123 LALR(1)

• LR(1) tables have huge number of entries • Often don’t need such refined observation (and cost) • Idea: find states with the same LR(0) component and merge their lookaheads component as long as there are no conflicts • LALR(1) not as powerful as LR(1) in theory but works quite well in practice – Merging may not introduce new shift-reduce conflicts, only reduce-reduce, which is unlikely in practice 124 Summary

125 LR is More Powerful than LL

• Any LL(k) language is also in LR(k), i.e., LL(k) ⊂ LR(k). – LR is more popular in automatic tools

• But less intuitive

• Also, the lookahead is counted differently in the two cases – In an LL(k) derivation the algorithm sees the left-hand side of the rule + k input tokens and then must select the derivation rule – In LR(k), the algorithm “sees” all right-hand side of the derivation rule + k input tokens and then reduces • LR(0) sees the entire right-side, but no input token

126 Using tools to parse + create AST terminal Integer NUMBER; terminal PLUS,MINUS,MULT,DIV; terminal LPAREN, RPAREN; terminal UMINUS; nonterminal Integer expr; precedence left PLUS, MINUS; precedence left DIV, MULT; Precedence left UMINUS; %% expr ::= expr:e1 PLUS expr:e2 {: RESULT = new Integer(e1.intValue() + e2.intValue()); :} | expr:e1 MINUS expr:e2 {: RESULT = new Integer(e1.intValue() - e2.intValue()); :} | expr:e1 MULT expr:e2 {: RESULT = new Integer(e1.intValue() * e2.intValue()); :} | expr:e1 DIV expr:e2 {: RESULT = new Integer(e1.intValue() / e2.intValue()); :} | MINUS expr:e1 %prec UMINUS {: RESULT = new Integer(0 - e1.intValue(); :} | LPAREN expr:e1 RPAREN {: RESULT = e1; :} | NUMBER:n {: RESULT = n; :} 127 Grammar Hierarchy

Non-ambiguous CFG LR(1) LALR(1) LL(1)

SLR(1) LR(0)

128 139