Outline
LR Parsing • Review of bottom-up parsing LALR Parser Generators • Computing the parsing DFA
• Using parser generators
2
Bottom-up Parsing (Review) The Shift and Reduce Actions (Review)
• A bottom-up parser rewrites the input string • Recall the CFG: E → int | E + (E) to the start symbol • A bottom-up parser uses two kinds of actions: • The state of the parser is described as α I γ • Shift pushes a terminal from input on the – α is a stack of termin als and non-terminals stack – γ is the string of terminals not yet examined E + ( I int ) ⇒ E + ( int I )
• Initially: I x x . . . x • Reduce pops 0 or more symbols off of the 1 2 n stack (production RHS) and pushes a non- terminal on the stack (production LHS) E + (E + ( E ) I ) ⇒ E + ( E I )
3 4 Key Issue: When to Shift or Reduce? LR(1) Parsing: An Example
int I int + (int) + (int)$ shift • Idea: use a deterministic finite automaton 0 1 E I → (DFA) to decide when to shift or reduce E → int int + (int) + (int)$ E int ( on $, + E I + (int ) + (int)$ shift (x3) – The input is the stack 2 + 3 4 E + (int I ) + (int)$ E → int – The language consists of terminals and non-terminals accept E int on $ E + (E I ) + (int)$ shift ) E → int I → • We run the DFA on the stack and we examine 7 6 5 E + (E) + (int)$ E E+(E) on ), + the resulting state X and the token tok after I E → E + (E) E I + (int)$ shift (x3) on $, + + int E + (int I )$ E → int – If X has a transition labeled tok then shift ( E + (E I )$ shift – If X is labeled with “A →βon tok”then reduce 8 9 E + (E) I $ E → E+(E) E + E I $ accept 10 11 E → E + (E) ) on ), + 5
Representing the DFA Representing the DFA: Example
• Parsers represent the DFA as a 2D table The table for a fragment of our DFA: – Recall table-driven lexical analysis int + ( ) $ E ( • Lines correspond to DFA states 3 4 … 3 s4 • Columns correspond to terminals and non- E int terminals 4 s5 g6 5 • Typically columns are split into: 6 5 rE → int rE → int E → int 6 s8 s7 – Those for terminals: the action table ) on ), + 7 r → r → – Those for non-terminals: the goto table E E+(E) E E+(E)
… 7 sk is shift and goto state k
E → E + (E) rX → α is reduce
on $, + gk is goto state k
7 8 The LR Parsing Algorithm The LR Parsing Algorithm
• After a shift or reduce action we rerun the let I = w$ be initial input DFA on the entire stack let j = 0 – This is wasteful, since most of the work is repeated let DFA state 0 be the start state let stack = 〈 dummy, 0 〉 • Remember for each stack element on which repeat state it brings the DFA case action[top_state(stack), I[j]] of shift k: push 〈 I[j++], k 〉
• LR parser maintains a stack reduce X → A: pop |A| pairs, 〈 sym , state 〉 . . . 〈 sym , state 〉 1 1 n n push 〈 X, goto[top_state(stack), X] 〉 state is the final state of the DFA on sym …sy m k 1 k accept: halt normally error: halt and report error
9 10
Key Issue: How is the DFA Constructed? LR(0) Items
• The stack describes the context of the parse • An LR(0) item is a production with a “I” – What non-terminal we are looking for somewhere on the RHS – What production RHS we are looking for – What we have seen so far from the RHS • The items for T → (E) are T → I (E) • Each DFA state describes several such T → ( I E) contexts T → (E I )
– E.g., when we are looking for non-terminal E, we T → (E) I might be looking either for an int or an E + (E) RHS • The only item for X →εis X → I
11 12 LR(0) Items: Intuition LR(1) Items
• An item [X →α I β] says that • An LR(1) item is a pair:
– the parser is looking for an X X →α I β, a
– it has an α on top of the stack – X →αβis a production
– Expects to find a string derived from β next in the – a is a terminal (the lookahead terminal) input – LR(1) means 1 lookahead terminal • [X →α I β, a] describes a context of the parser • Notes: – We are trying to find an X followed by an a, and →αI β – [X a ] means that a should follow. Then we – We have (at least) α already on top of the stack can shift it and still have a viable prefix – Thus we need to see next a prefix derived from βa – [X →α I] means that we could reduce X • But this is not always a good idea !
13 14
Note Convention
• The symbol I was used before to separate the • We add to our grammar a fresh new start stack from the rest of input symbol S and a production S → E – α I γ, where α is the stack and γ is the remaining – Where E is the old start symbol string of terminals • In items I is used to mark a prefix of a • The initial parsing context contains: production RHS: S → I E , $ X →α I β, a – Trying to find an S as a string derived from E$ – Here β might contain terminals as well – The stack is empty • In both case the stack is on the left of I
15 16 LR(1) Items (Cont.) LR(1) Items (Cont.)
• In context containing • Consider the item
E → E + I ( E ) , + E → E + ( I E ) , + – If ( follows then we can perform a shift to context • We expect a string derived from E ) + containing • There are two productions for E → I E E + ( E ) , + E → int and E → E + ( E)
• In context containing • We describe this by extending the context
E → E + ( E ) I , + with two more items:
– We can perform a reduction with E → E + ( E ) E → I int , )
– But only if a + follows E → I E + ( E ) , )
17 18
The Closure Operation Constructing the Parsing DFA (1)
• The operation of extending the context with • Construct the start context: E → E + ( E ) | int
items is called the closure operation Closure({S → I E, $})
Closure(Items) = S → I E , $ → I repeat E E+(E), $ E → I int , $ → α I β for each [X Y , a] in Items E → I E+(E), +
for each production Y →γ E → I int , + for each b in First(βa) add [Y → I γ, b] to Items • We abbreviate as:
until Items is unchanged S → I E , $ E → I E+(E) , $/+ E → I int , $/+
19 20 Constructing the Parsing DFA (2) The DFA Transitions
• A DFA state is a closed set of LR(1) items • A state “State” that contains [X →α I yβ, b] has a transition labeled y to a state that • The start state contains [S → I E , $] contains the items “Transition (State, y)” – y can be a terminal or a non-terminal
• A state that contains [X →α I, b] is labelled Transition(State, y) with “reduce with X →αon b” Items = ∅ for each [X → α I yβ, b] in State • And now the transitions … add [X → αy I β, b] to Items return Closure(Items)
21 22
Constructing the Parsing DFA: Example LR Parsing Tables: Notes
0 1 • Parsing tables (i.e., the DFA) can be S → I E , $ → E → int I, $/+ E int → I on $, + constructed automatically for a CFG E E+(E), $/+ int E → I int , $/+ E → E+ I (E), $/+ 3 E 2 + • But we still need to understand the S → E I , $ ( construction to work with parser generators E → E I +(E), $/+ E → E+(I E) , $/+ 4 – E.g., they report errors in terms of sets of items accept E E → I E+(E) , )/+ on $ E → I int , )/+ • What kind of errors can we expect? → I int 6 E E+(E ) , $/+ 5 E → E I +(E) , )/+ E → int I , )/+ E → int + ) on ), +
and so on… 23 24 Shift/Reduce Conflicts Shift/Reduce Conflicts
• If a DFA state contains both • Typically due to ambiguities in the grammar
[X → α I aβ, b] and [Y → γ I, a] • Classic example: the dangling else S → if E then S | if E then S else S | OTHER • Then on input “a” we could either • Will have DFA state containing
– Shift into state [X → α a I β, b], or [S → if E then S I, else]
– Reduce with Y → γ [S → if E then S I else S, x] • If else follows then we can shift or reduce
• This is called a shift-reduce conflict • Default (yacc, ML-yacc, etc.) is to shift – Default behavior is as needed in this case
25 26
More Shift/Reduce Conflicts More Shift/Reduce Conflicts
• Consider the ambiguous grammar • In yacc declare precedence and associativity: E → E + E | E * E | int %left + %left * • We will have the states containing [E → E * I E, +] [E → E * E I, +] • Precedence of a rule = that of its last terminal
[E → I E + E, +] ⇒E [E → E I + E, +] See yacc manual for ways to override this default
…… • Resolve shift/reduce conflict with a shift if: • Again we have a shift/reduce on input + – no precedence declared for either rule or terminal – We need to reduce (* binds more tightly than +) – input terminal has higher precedence than the rule – Recall solution: declare the precedence of * and + – the precedences are the same and right associative
27 28 Using Precedence to Solve S/R Conflicts Using Precedence to Solve S/R Conflicts
• Back to our example: • Same grammar as before [E → E * I E, +] [E →E * E I, +] E → E + E | E * E | int [E → I E + E, +] ⇒E [E →E I + E, +] • We will also have the states
…… [E → E + I E, +] [E → E + E I, +] [E → I E + E, +] ⇒E [E → E I + E, +]
• Will choose reduce because precedence of …… rule E → E * E is higher than of terminal + • Now we also have a shift/reduce on input + – We choose reduce because E → E + E and + have the same precedence and + is left-associative
29 30
Using Precedence to Solve S/R Conflicts Precedence Declarations Revisited
• Back to our dangling else example The term “precedence declaration” is misleading! [S → if E then S I, else]
[S → if E then S I else S, x] These declarations do not define precedence:
• Can eliminate conflict by declaring else having they define conflict resolutions higher precedence than then I.e., they instruct shift-reduce parsers to resolve • But this starts to look like “hacking the tables” conflicts in certain ways • Best to avoid overuse of precedence The two are not quite the same thing! declarations or we will end with unexpected parse trees
31 32 Reduce/Reduce Conflicts Reduce/Reduce Conflicts
• If a DFA state contains both • Usually due to gross ambiguity in the grammar
[X → α I, a] and [Y → β I, a] • Example: a sequence of identifiers – Then on input “a”we don’t know which S →ε | id | id S production to reduce • There are two parse trees for the string id • This is called a reduce/reduce conflict S → id
S → id S → id • How does this confuse the parser?
33 34
More on Reduce/Reduce Conflicts Using Parser Generators
• Consider the states [S → id I, $] • Parser generators automatically construct the [S’ → I S, $] [S → id I S, $] parsing DFA given a CFG [S → I, $] ⇒id [S → I, $] – Use precedence declarations and default
[S → I id, $] [S → I id, $] conventions to resolve conflicts [S → I id S, $] [S → I id S, $] – The parser algorithm is the same for all grammars (and is provided as a library function) • Reduce/reduce conflict on input $ • But most parser generators do not construct S’ → S → id the DFA as described before S’ → S → id S → id – Because the LR(1) parsing DFA has 1000s of states →ε • Better rewrite the grammar as: S | id S even for a simple language
35 36 LR(1) Parsing Tables are Big The Core of a Set of LR Items
• But many states are similar, e.g. Definition: The core of a set of LR items is the
1 5 set of first components E → int → – Without the lookahead terminals E → int I, $/+ and E → int I, )/+ E int on $, + on ), +
• Idea: merge the DFA states whose items • Example: the core of differ only in the lookahead tokens {[X →α I β, b], [Y →γ I δ, d]} – We say that such states have the same core is
{X →α I β, Y →γ I δ} • We obtain 1’ → E → int I, $/+/) E int on $, +, )
37 38
LALR States A LALR(1) DFA
• Consider for example the LR(1) states • Repeat until all states have distinct core {[X →α I, a], [Y →β I, c]} – Choose two distinct states with same core
{[X →α I, b], [Y →β I, d]} – Merge the states by creating a new one with the union of all the items • They have the same core and can be merged – Point edges from predecessors to new state
• And the merged state contains: – New state points to all the previous successors {[X →α I, a/b], [Y →β I, c/d]} • These are called LALR(1) states A B C A C – Stands for LookAhead LR BE – Typically 10 times fewer LALR(1) states than LR(1) D E F D F
39 40 Conversion LR(1) to LALR(1): Example. The LALR Parser Can Have Conflicts
int int 0 1 0 1,5 • Consider for example the LR(1) states E E → int E → int E on $, +, ) {[X →α I, a], [Y →β I, b]} + ( on $, + int 2 3 4 {[X →α I, b], [Y →β I, a]} accept E int ( on $ 2 + 3,8 4,9 • And the merged LALR(1) state
) E → int accept {[X →α I, a/b], [Y →β I, a/b]} 7 6 5 on $ on ), + + E E → E + (E) • Has a new reduce/reduce conflict on $, + + int ) ( 7,11 6,10 8 9 E → E + (E) • In practice such cases are rare
+ E on $, +, ) 10 11 E → E + (E) ) on ), + 42
LALR vs. LR Parsing: Things to keep in mind A Hierarchy of Grammar Classes
• LALR languages are not natural – They are an efficiency hack on LR languages
• Any reasonable programming language has a LALR(1) grammar From Andrew Appel, “Modern Compiler Implementation in ML” • LALR(1) parsing has become a standard for programming languages and for parser generators
43 44 Semantic Actions in LR Parsing Performing Semantic Actions: Example
• We can now illustrate how semantic actions Recall the example are implemented for LR parsing
• Keep attributes on the stack E → T + E1 { E.val = T.val + E1.val } | T { E.val = T.val }
• On shifting a, push attribute for a on stack T → int * T1 { T.val = int.val * T1.val }
• On reduce X →α | int { T.val = int.val } – pop attributes for α – compute attribute for X Consider the parsing of the string: 4 * 9 + 6 – and push it on the stack
45 46
Performing Semantic Actions: Example Notes
4 * 9 + 6 • The previous example shows how synthesized | int * int + int shift attributes are computed by LR parsers int4 | * int + int shift int4 * | int + int shift • It is also possible to compute inherited int4 * int9 | + int reduce T → int attributes in an LR parser int4 * T9 | + int reduce T → int * T T36 | + int shift T36 + | int shift T36 + int6 | reduce T → int T36 + T6 | reduce E → T T36 + E6 | reduce E → T + E E42 | accept
47 48 Notes on Parsing
• Parsing Supplement to LR Parsing – A solid foundation: context-free grammars – A simple parser: LL(1) – A more powerful parser: LR(1) – An efficiency hack: LALR(1) – LALR(1) parser generators Strange Reduce/Reduce Conflicts • Next time we move on to semantic analysis due to LALR Conversion (and how to handle them)
49
Strange Reduce/Reduce Conflicts Strange Reduce/Reduce Conflicts
• Consider the grammar • In P an id is a S → P R , NL → N | N , NL – N when followed by , or :
P → T | NL : T R → T | N : T – T when followed by id
N → id T → id • In R an id is a – N when followed by : • P - parameters specification – T when followed by , • R - result specification • This is an LR(1) grammar • N - a parameter or result name • But it is not LALR(1). Why? • T -a type name – For obscure reasons • NL - a list of names
51 52 A Few LR(1) States What Happened?
P → I T id 1 • Two distinct states were confused because → I P → I NL : T id T id id 3 LALR reduce/reduce they have the same core → I conflict on “,” T → I id id id N id :
→ I • Fix: add dummy productions to distinguish the NL → I N : N id , two confused states NL → I N , NL : • E.g., add N → I id : T → id I id/, LALR merge N → I id , N → id I :/, R → id bogus – bogus is a terminal not used by the lexer
R → I T , 2 T → id I , 4 – This production will never be used during parsing id R → I N : T , N → id I : – But it distinguishes R from P
T → I id , N → I id :
53 54
A Few LR(1) States After Fix
P → I T id 1 P → I NL : T id T → id I id 3 NL → I N : id N → id I : NL → I N , NL : N → id I , N → I id : N → I id , Different cores ⇒ no LALR merging T → I id id
R → . T , T → id I , 4
R → . N : T , 2 N → id I :
R →.id bogus , id R → id I bogus , T → . id , N → . id :
55