<<

LL and LR Lecture 6

February 5, 2018 Context-free Grammars

A context-free grammar consists of

É A set of non-terminals N É Written in uppercase throughout these notes É A set of terminals T comprised of tokens É Lowercase or punctuation throughout these notes É A start symbol S (a non-terminal) É A set of productions (rewrite rules) Assuming E N E ε ∈ or → E Y1Y2...Yn where Yi N T → ∈ ∪

Compiler Construction 2/49 Context-free ? Production rules hint at expressiveness!

Regular A aB,C ε Context-free A → α → Context-sensitive αA→β αγβ Type-0 α β→ → α,β,γ N T ∗ ∈ { ∪ }

“What just happened? We must be missing some context...” Compiler Construction 3/49 Parsing and Context-free Grammars

É É Regular Expressions specify a containing strings of characters (lexeme) that correspond to a token

É Parsing É Context-free Grammars specify a Context-free Language containing strings of tokens that correspond to a grammatical rule (production)

Compiler Construction 4/49 Generativeness

É Regular expressions and context-free grammars are generative É You can generate every string in the language using the regex or grammar!

Compiler Construction 5/49 Generating Strings

É Consider regex: ab*a É You can generate aa, aba, abba, abbba, ...

É Consider context-free grammar: E (E)E |→ ε

É You can generate ε, (), (()), (())(), ...

É Generating strings with a grammar can be thought of as creating a !

Compiler Construction 6/49 Language membership

É We care about whether an input string of tokens is syntactically correct (e.g., obeys our language’s grammar)

É So far, we have looked at theoretical implications of grammars

L(G) = a1...an S ∗ a1...an { | → } For an input string x, is x L(G)? ∈ Parsing part 1: We need a yes/no answer!

Compiler Construction 7/49 Language membership

S a B |→ b C B b b C C → c c → What strings are in this language? (Hint: there’s only two!) If my input string is “dabc”, we ask: can the grammar generate this string? (No)

É N.B. it doesn’t matter how from a theoretical perspective, that’s the job of the parsing !

Compiler Construction 8/49 Parsing

É LL (top down) É Reads input from left to right and uses left-most derivations to construct a parse tree

É LR (bottom up) É Reads input from left to right and uses right-most derivations to construct a parse tree

É Both algorithms are driven by the input grammar and the input to be parsed.

Compiler Construction 9/49 Parsing Algorithm Intuition

É You start with a sequence of tokens, t1t2t3t4t5 É and also a grammar!

É Two general approaches to constructing the parse tree

É top-down parsing is when you predict the grammatical rule used to produce the tokens seen so far

É bottom-up parsing is when you consider tokens one at a time until you match a grammatical rule

Compiler Construction 10/49 Top Down Parsing

S S a B c B → C x B B → ε C → d →| a B c Input string: “adxdxc” a d x d x c

Compiler Construction 11/49 Top Down Parsing

S S a B c B B → C x B B → ε C → d →| a B c Input string: “adxdxc” a d x d x c

Compiler Construction 11/49 Top Down Parsing

S S a B c B B → C x B → B ε B C → d →| a B c C Input string: “adxdxc” a d x d x c

Compiler Construction 11/49 Top Down Parsing

S S a B c B B → C x B → B ε B C → d →| a B c C Input string: “adxdxc” a d x d x c

Compiler Construction 11/49 Top Down Parsing

S S a B c B B → C x B → B ε B C → d →| a B c C Input string: C B “adxdxc” a d x d x c

Compiler Construction 11/49 Top Down Parsing

S S a B c B B → C x B → B ε B C → d →| a B c C Input string: C B “adxdxc” a d x d x c

Compiler Construction 11/49 Top Down Parsing

S S a B c B B → C x B → B ε B C → d →| a B c C Input string: C B “adxdxc” a d x d x ε c

Compiler Construction 11/49 Bottom-up Parsing

Tokens right now: a

S a B c B → C x B B → ε C → d →| a B c Input string: “adxdxc”

a d x d x c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: ad

S a B c B → C x B B → ε C → d →| a B c Input string: “adxdxc”

a d x d x c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: aC

S a B c B → C x B B → ε C → d →| a B c C Input string: “adxdxc”

a d x d x c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: aCx

S a B c B → C x B B → ε C → d →| a B c C Input string: “adxdxc”

a d x d x c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: aCxd

S a B c B → C x B B → ε C → d →| a B c C Input string: “adxdxc”

a d x d x c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: aCxC

S a B c B → C x B B → ε C → d →| a B c C Input string: “adxdxc” C a d x d x c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: aCxCx

S a B c B → C x B B → ε C → d →| a B c C Input string: “adxdxc” C a d x d x c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: aCxCxε

S a B c B → C x B B → ε C → d →| a B c C Input string: “adxdxc” C a d x d x ε c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: aCxCxB

S a B c B → C x B B → ε C → d →| a B c C Input string: “adxdxc” C B a d x d x ε c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: aCxB

S a B c B → C x B B → ε C → d B →| a B c C Input string: “adxdxc” C B a d x d x ε c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: aB

S a B c → B C x B B B → ε C → d B →| a B c C Input string: “adxdxc” C B a d x d x ε c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: aBc

S a B c → B C x B B B → ε C → d B →| a B c C Input string: “adxdxc” C B a d x d x ε c

Compiler Construction 12/49 Bottom-up Parsing

Tokens right now: S

S a B c S → B C x B B B → ε C → d B →| a B c C Input string: “adxdxc” C B a d x d x ε c

Compiler Construction 12/49 LL(k) parsing

A LL parser read tokens from left to right and constructs a top-down leftmost derivation. LL(k) parsing predicts which production rule to use from k tokens of lookahead. LL(1) parsing is a special

case using one token of lookahead. LL(1) parsing is fast and easy, but does not work if the grammar is ambiguous, left-recursive, or non-left-factored.

Compiler Construction 13/49 General LL(1) Algorithm

É Process 1 token at a time É Consider a ‘current’ non-terminal symbol, start with S É While input is not empty É Given next 1 token (t) and ‘current’ non-terminal N, choose a rule R s.t. (N α) → É For each element X in rule R from left to right É If X is a non-terminal, ‘expand’ X by recursing! Set ‘current’ to X and consider same token t. É If X is a terminal and if t matches. If it matches, consume t from input, loop É Note the need for particular types of grammars! What if we have a rule S Sα? → Compiler Construction 14/49 Recursive Descent Parsing

É Recursive Descent Parsing can parse LL(k) grammars with backtracing

É We can use RDP to parse LL(1) grammars by recursing through the rules of the grammar based upon the next available token

É Intuition: Construct mutually-recursive functions that consume tokens according to the grammar rules!

É TL;DR “Try all productions exhaustively, backtrack”

Compiler Construction 15/49 Recursive Descent Parsing

E T + E | T T → (E) | int | int T → ∗ Input: int * int

1. Try E0 T1 + E2 → 2. Try T1 (E3) → É Nope! token ‘int’ does not match ‘(’ in T1 (E3) → 3. Try T1 int. Match! → É But the next token ‘*’ does not match ‘+’ from E0

4. Try T1 int T2 → ∗ É Matches ‘int’, but ‘+’ from E0 remains unmatched

5. Exhausted choices for T1, so we backtrack to E0

Compiler Construction 16/49 Recursive Descent Parsing (2)

E T + E | T T → (E) | int | int T → ∗ Input: int * int

6. Try E0 T1 → 7. Exhaustively try T1 α productions → É Succeed with T1 int and T2 int → → E T int T int int → → ∗ → ∗

Compiler Construction 17/49 Recursive Descent Parsing

voidS(){ if(next_char() ==’a’) { consume(’a’);B();} else if(next_char() ==’b’) { consume(’b’);C();} else{ error();} } S a B voidB(){ |→ b C if(next_char() ==’b’) B b b C { consume(’b’); consume(’b’) → ;C();} C c c else{ error();} → voidC(){ if(next_char() ==’c’) { consume(’c’); consume(’c’) ;} else{ error();} Compiler Construction 18 49 } / Recursive Descent Parsing

T l i n e \ n umber\n B |→ ε B i f \n T → | e l s e \n T | c l a s s \n C | s t r i n g \n C C t e x t \n T → That’s right, subsequent assignments PA3 through PA6 provide inputs that can be parsed through recursive descent!

Compiler Construction 19/49 Recursive Descent Parsing

Observations

É At any given moment, the fringe of the parse tree is: t1t2...tkA... Try all productions for A: if A BC is a É → production, the new fringe is t1t2...tkBC... É Backtrack when the fringe does not match the input string

Compiler Construction 20/49 What Could Go Wrong?

Compiler Construction 21/49 Recursive Descent Failure

voidS(){ S(); S S a if(next_char() ==’a’) → { consume(’a’);} }

Compiler Construction 22/49 Eliminating

É Left-recursive grammars have some production rule

S + S α →

Recursive Descent (and LL(k)) parsers cannot parse left-recursive grammars!

Compiler Construction 23/49 Eliminating Left Recursion

Consider the left-: S S α | β → S generates all strings starting with β followed by a number of α Rewrite using right-recursion S β T T → α T | ε →

Compiler Construction 24/49 Concrete Left Recursion Elimination

S 1 | S 0 → Can be rewritten as S 1 T T → 0 T | ε →

Compiler Construction 25/49 More Left Recursion Elimination

In general

S Sα 1 | . . . | Sα n | β1 | . . . | βm → All strings dervied from S start with one of β1,...,βm and continue with several instances of α1,...,αn. Rewrite as

S β1 T | . . . | βm T → T α 1 T | . . . | α n T | ε →

Compiler Construction 26/49 Recursive Descent Summary

É Simple and general parsing strategy É Left-recursion must be eliminated first! É There’s an algorithm for that É Requires significant

É Backtracking is avoidable for some grammars!

Compiler Construction 27/49 LL(1) Predictive Parsing

É LL(1) parsing assumes that for each non-terminal and token there is only one production that could lead to success É This sounds deterministic! We can use a table-based approach like with lexing

É One dimension for current non-terminal to expand

É One dimension for next token seen on the input É Each table entry contains one production

Compiler Construction 28/49 Predictive Parsing and Left Factoring

S a B T |→ b C T →E | T vs. + B b b C T C → c c i n→ t | i n t T | ( E ) → ∗ Left grammar: Easy! One token One rule É → É Right grammar: Hard! Two T productions start with ‘int’

É We must left-factor before using LL(1) predictive parsing

Compiler Construction 29/49 Left Factoring

E T + E | T T → i n t | i n t T | ( E ) → ∗ Factor out the common prefixes of production rules E TX → X + E | ε T → ( E ) | i n t Y Y → T | ε → ∗

Compiler Construction 30/49 Parse Tables!

É Parse tables are a fast implementation of LL(1) parsers É N.B. LL(1) grammars represent a subset of context-free grammars

É Restrict ambiguities in resolving rules to make a table possible!

É Table T is 2-dimensional: T [A][t] = A Y1Y2...Ym means “when you are in production→ rule A and see token t, start considering A Y1Y2...Ym” →

Compiler Construction 31/49 Parse Tables!

E TX → X + E | ε T → ( E ) | i n t Y Y → T | ε → ∗ LL(1) Parsing Table ($ means end of input) int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 32/49 Parse Tables!

É T[E][int] = TX É Interpretation: “If I’m considering nonterminal E and I see ‘int’, follow production E TX → LL(1) Parsing Table ($ means end of input) int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 33/49 Parse Tables!

É T[Y][+] = ε É Interpretation: “If I’m considering nonterminal Y and I see a ‘+’, get rid of the Y”

LL(1) Parsing Table ($ means end of input) int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 34/49 Parse Tables!

É Blank entries indicate errors! Consider T[E][*] É Interpretation: “There is no way to derive a string starting with * from non-terminal E.”

LL(1) Parsing Table ($ means end of input) int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 35/49 Using Parse Tables

É Much like recursive descent

É For each non-terminal S É Look at next token a É Choose production shown in T[S][a] É We use a stack to track pending non-terminals É Reject when we encounter an error state (a blank) É Accept when we encounter an end-of-input

Compiler Construction 36/49 LL(1) Predictive Parsing with Table

push($);// we succeed if we get to the end push(S);// start symbol do{ X= pop(); if(X ==$){ accept();} if(is_terminal(X)){ if(X == next_token()){ consume(next_token()); } else{ error();} } else{ //X is non terminal if(T[X][next_token()] =="X Y1 Y2... Ym") { → push(Ym);... push(Y2); push(Y1); } else{ error();} } } while(X !=$);

Compiler Construction 37/49 Stack Input Action

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 38/49 Stack Input Action E $ int * int $ TX

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 38/49 Stack Input Action E $ int * int $ TX T X $ int * int $ int Y

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 38/49 Stack Input Action E $ int * int $ TX T X $ int * int $ int Y int Y X $ int * int $ consume

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 38/49 Stack Input Action E $ int * int $ TX T X $ int * int $ int Y int Y X $ int * int $ consume Y X $ * int $ *T

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 38/49 Stack Input Action E $ int * int $ TX T X $ int * int $ int Y int Y X $ int * int $ consume Y X $ * int $ *T * T X $ * int $ consume

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 38/49 Stack Input Action E $ int * int $ TX T X $ int * int $ int Y int Y X $ int * int $ consume Y X $ * int $ *T * T X $ * int $ consume T X $ int $ int Y

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 38/49 Stack Input Action E $ int * int $ TX T X $ int * int $ int Y int Y X $ int * int $ consume Y X $ * int $ *T * T X $ * int $ consume T X $ int $ int Y int Y X $ int $ consume

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 38/49 Stack Input Action E $ int * int $ TX T X $ int * int $ int Y int Y X $ int * int $ consume Y X $ * int $ *T * T X $ * int $ consume T X $ int $ int Y int Y X $ int $ consume Y X $ $ ε X $ $ ε

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 38/49 Stack Input Action E $ int * int $ TX T X $ int * int $ int Y int Y X $ int * int $ consume Y X $ * int $ *T * T X $ * int $ consume T X $ int $ int Y int Y X $ int $ consume Y X $ $ ε X $ $ ε $ $ ACCEPT

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 38/49 LL(1) Languages

É LL(1) languages can be LL(1) parsed É Formally, A language Q is LL(1) if there exists an LL(1) table such that the LL(1) parsing algorithm using that table accepts exactly the strings in Q. É No table entry can be multiply defined É This restricts the grammar! É Once we construct the table 1. The parsing algorithm is simple and fast 2. No backtracking is necessary É Wouldn’t it be nice to generate a parsing table from a CFG?

Compiler Construction 39/49 FIRST and FOLLOW sets

É FIRST(α) is the set of all terminal symbols that can begin some derivation starting with α ___α ... aβ → → FIRST(α) = a T α ∗ a β ε α ∗ ε { ∈ | → } ∪ { | → } Example: S a | b S c → FIRST(S) = {a, b}

Compiler Construction 40/49 Example FIRST sets

S a S e | S T T T → R S e | Q R → x S x | ε Q → S T | ε → FIRST(S) = ? FIRST(T) = ? FIRST(R) = ? FIRST(Q) = ?

Compiler Construction 41/49 FOLLOW sets É FOLLOW(A) is the set of terminals (including $) that follows a non-terminal A

FOLLOW(A) = a T S + ...Aa... $ S + ...A { ∈ | → } ∪ { | → }

É Compute FIRST sets for all non-terminals É Add $ to FOLLOW(S) (the start symbol always ends with end-of-input)

For all productions Y ...XA1...An É → É Add FIRST(Ai)- ε to FOLLOW(X). Stop if { } ε FIRST (Ai). 6∈ É Add FOLLOW(Y) to FOLLOW(X)

Compiler Construction 42/49 Example FOLLOW Set

E TX → X + E | ε T → ( E ) | i n t Y Y → T | ε → ∗ FOLLOW(“+”) = { int, ( } FOLLOW(“(”) = { int, ( } FOLLOW(X) = { $, ) } FOLLOW(Y) = { +, ), $ }

Compiler Construction 43/49 Back to Parsing Tables

É Recall: We want to build a LL(1) Parsing Table

For each production A α in G do: → For each terminal b FIRST(α) do É ∈ É T[A][b] = α If α ∗ ε, for each b FOLLOW(A) do É → ∈ É T[A][b] = α

Compiler Construction 44/49 Parsing Table

E TX → X + E | ε T → ( E ) | i n t Y Y → T | ε → ∗ Where do we put Y T ? → ∗ É Well, FIRST(*T) = {*}, thus column * of row Y gets *T

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 45/49 Parsing Table

E TX → X + E | ε T → ( E ) | i n t Y Y → T | ε → ∗ Where do we put Y ε? → É Well, FOLLOW(Y) = {$, +,)}, thus columns $, +, and ) in row Y get Y ε →

int * + ( ) $ T int Y (E) E TX TX X + E ε ε Y *T ε ε ε

Compiler Construction 46/49 Notes on LL(1) Parsing Tables

É If any entry is multiply defined then G is not LL(1)

É G is ambiguous É G is left-recursive É G is not left-factored

Compiler Construction 47/49 Ambiguity in parse tables

E E + TT F E → TF → i d T → T FF → (E) → ∗ → For the E productions, we need FIRST(T) = {(, id} and FIRST(E) = {(, id}

But now, which rule ( E E + T or E T ) gets put in → → T[E][(] and T[E][id]??

+ * ( ) id $ E ? ? T F

Compiler Construction 48/49 Simple Parsing Strategies

É Recursive Descent Parsing É Backtracking is annoying, BUT super useful for PA3-6

É Predictive Parsing a.k.a. LL(k) É Predict production from k tokens of lookahead É Build LL(1) table É Parsing is now fast and easy!

É Next up, LR Parsing, a more powerful strategy for parsing non-LL(1) grammars

Compiler Construction 49/49