Lecture Outline

• Implementation of parsers • Two approaches Top-Down – Top-down – Bottom-up • Top-Down Adapted from Lecture by – Easier to understand and program manually Profs. Alex Aiken & George Necula • Bottom-Up (UCB) – More powerful and used by most parser generators

CS780(Prasad) L101TDP 1 CS780(Prasad) L101TDP 2

Intro to Top-Down Parsing Recursive Descent Parsing

• Consider the grammar • The is constructed 1 E → T + E | T – From the top T → ( E ) | int | int * T

– From left to right t2 3 t9 • Token stream is: int5 * int2 • Start with top-level non-terminal E • Terminals are seen in order of 4 7 appearance in the token • Try the rules for E in order stream: t5 t6 t8

t2 t5 t6 t8 t9

CS780(Prasad) L101TDP 3 CS780(Prasad) L101TDP 4

1 Recursive Descent Parsing. Example (Cont.) Recursive Descent Parsing. Example (Cont.)

•Try E0 → T1 + E2 •Try E0 → T1

• Then try a rule for T1 → ( E3 ) • Follow same steps as before for T1

–But( does not match input token int5 – And succeed with T1 → int * T2 and T2 → int – With the following parse tree •TryT1 → int . Token matches. E0 –But + after T1 does not match input token * → •Try T1 int * T2 T1 – This will match but + after T1 will be unmatched int5 * T2 •Has exhausted the choices for T1

– Backtrack to choice for E0 int2 CS780(Prasad) L101TDP 5 CS780(Prasad) L101TDP 6

A . Preliminaries A Recursive Descent Parser (2)

• Let TOKEN be the type of tokens • Define boolean functions that check the token – Special tokens INT, OPEN, CLOSE, PLUS, TIMES string for a match of – A given token terminal • Let the global next point to the next token bool term(TOKEN tok) { return *next++ == tok; } – A given production of S (the nth)

bool Sn() { … } – Any production of S: bool S() { … }

• These functions advance next CS780(Prasad) L101TDP 7 CS780(Prasad) L101TDP 8

2 A Recursive Descent Parser (3) A Recursive Descent Parser (4)

• For production E → T + E • Functions for non-terminal T bool T () { return term(OPEN) && E() && term(CLOSE); } bool E1() { return T() && term(PLUS) && E(); } 1 • For production E → T bool T2() { return term(INT) && term(TIMES) && T(); } bool T3() { return term(INT); } bool E2() { return T(); }

For all productions of E (with ) bool T() { bool E() { TOKEN *save = next; TOKEN *save = next; return (next = save, T1()) return (next = save, E ()) 1 || (next = save, T2()) || (next = save, E ()); } 2 || (next = save, T3()); }

CS780(Prasad) L101TDP 9 CS780(Prasad) L101TDP 10

Recursive Descent Parsing. Notes. When Recursive Descent Does Not Work

• To start the parser • Consider a production S → S a

–Initialize next to point to first token bool S1() { return S() && term(a); }

–Invoke E() bool S() { return S1(); } • Notice how this simulates our previous •S()will get into an infinite loop example. • A left- has a non-terminal S • Easy to implement by hand S →+ Sα for some α • But does not always work … • Recursive descent does not work in such cases.

CS780(Prasad) L101TDP 11 CS780(Prasad) L101TDP 12

3 Elimination of More Elimination of Left-Recursion

• Consider the left-recursive grammar • In general S → S α | β S → S α1 | … | S αn | β1 | … | βm • S generates all strings starting with a β and • All strings derived from S start with one of followed by a number of α β1,…,βm and continue with several instances of α α βα∗ 1,…, n • Can rewrite using right-recursion •Rewrite as S →β S’ | … | β S’ S →βS’ 1 m S’ →α S’ | … | α S’ | ε S’ →αS’ | ε 1 n

CS780(Prasad) L101TDP 13 CS780(Prasad) L101TDP 14

A → Bb | a General Left Recursion B → Aa | b

•The grammar (Cf. Gaussian Elimination) S → A α | δ A → Bb | a A → S β is also left-recursive because B → (Bb | a)a | b S →+ S βα A → Bb | a A → Bb | a B → (aa | b)Z | (aa | b) • This left-recursion can also be eliminated. B → Bba | aa | b Z → baZ | ba • More examples on the following slides.

CS780(Prasad) L101TDP 15 CS780(Prasad) L101TDP 16

4 Example: Related to conversion to Griebach Normal Formal Summary of Recursive Descent

A → BC → (bCB | a)R • Simple and general parsing strategy B → CA | b | bCB|a – Left-recursion must be eliminated first C → AB | a R → ACBR | ACB n o i – … but that can be done automatically s r u Introducing terminals A B C c f f e r as first element on RHS • Unpopular because of backtracking t f e l C → bCBR | aR | bCB | a A → BC g – Thought to be too inefficient n i t a n B → bcBRA | aRA B → CA | b i – Cf. Prolog execution strategy m i | bCBA | aA | b C → BCB | a El A → bcBRAC | aRAC • In practice, backtracking is eliminated by C → CACB | bCB | a | bCBAC | aAC | bC restricting the grammar R → (bCBRAC |... | bC)(CBR | CB) – To enable “look-before-you-leap” strategy

CS780(Prasad) L101TDP 17 CS780(Prasad) L101TDP 18

Predictive Parsers ()

• Like recursive-descent but parser can •LL(k) grammars “predict” which production to use. •LR(k) grammars – By looking at the next few tokens. –Lmeans “left-to-right” scan of input –No backtracking. –Rmeans “rightmost derivation” • Predictive parsers accept LL(k) grammars. –kmeans “predict based on k tokens of lookahead” –Lmeans “left-to-right” scan of input. •RL(1) grammars –Lmeans “leftmost derivation”. –Rmeans “right-to-left” scan of input –kmeans “predict based on k tokens of lookahead”. •LR(0) , LR(1) grammars • In practice, LL(1) is used. •SLR(1) grammars, LALR(1) grammars

CS780(Prasad) L101TDP 19 CS780(Prasad) L101TDP 20

5 LL(1) Languages Predictive Parsing and Left Factoring

• In recursive-descent, for each non-terminal • Recall the grammar and input token there may be a choice of E → T + E | T production. T → int | int * T | ( E ) • LL(1) means that for each non-terminal and token there is only one production. •Hard to predict because • Can be specified via 2D tables. –For T, two productions start with int. –For E, it is not clear how to predict. – One dimension for current non-terminal to expand. – One dimension for next token. •A grammar must be left-factored before use – A table entry contains one production. for predictive parsing.

CS780(Prasad) L101TDP 21 CS780(Prasad) L101TDP 22

Left-Factoring Example LL(1) Parsing Table Example

• Recall the grammar • Left-factored grammar E → T + E | T E → T X X → + E | ε T → int | int * T | ( E ) T → ( E ) | int Y Y → * T | ε • Factor out common prefixes of productions, • The LL(1) parsing table: possibly introducing ε-productions int * + ( ) $ E → T X E T X T X X + E X → + E | ε ε ε T → ( E ) | int Y T int Y ( E ) Y * T ε ε ε Y → * T | ε

CS780(Prasad) L101TDP 23 CS780(Prasad) L101TDP 24

6 LL(1) Parsing Table Example (Cont.) LL(1) Parsing Tables. Errors

• Consider the [E, int] entry • Blank entries indicate error situations – “When current non-terminal is E and next input is –Consider the [E,*] entry int, use production E → T X. – “There is no way to derive a string starting with * – This production can generate an int in the first from non-terminal E” place. • Consider the [Y,+] entry – “When current non-terminal is Y and current token is +, get rid of Y”. –Ycan be followed by + only in a derivation in which Y →ε.

CS780(Prasad) L101TDP 25 CS780(Prasad) L101TDP 26

Using Parsing Tables LL(1) Parsing Algorithm

• Method similar to recursive descent, except initialize stack = and next – For each non-terminal X repeat case stack of – We look at the next token t : if T[X,*next] = Y1…Yn – And chose the production shown at [X,t] then stack ← ; • We use a stack to keep track of pending non- else error (); terminals. : if t == *next ++ then stack ← ; • We reject when we encounter an error state. else error (); • We accept when we encounter end-of-input. until stack == < >

CS780(Prasad) L101TDP 27 CS780(Prasad) L101TDP 28

7 LL(1) Parsing Example Constructing Parsing Tables

Stack Input Action • LL(1) languages are those defined by a parsing E $ int * int $ T X table for the LL(1) algorithm. T X $ int * int $ int Y • No table entry can be multiply defined. int Y X $ int * int $ terminal Y X $ * int $ * T * T X $ * int $ terminal • We want to generate parsing tables from CFG. T X $ int $ int Y int Y X $ int $ terminal Y X $ $ ε X $ $ ε $ $ ACCEPT

CS780(Prasad) L101TDP 29 CS780(Prasad) L101TDP 30

Constructing Parsing Tables (Cont.) Computing First Sets

•If A →α, Definition where in the line of A do we place α ? First(X) = { t | X →* tα} ∪ {ε | X →* ε} • In the column of t Algorithm sketch: where t can start a string derived from α. 1. First(t) = { t } – α→* t β 2. ε∈First(X) if X →εis a production –We say thatt ∈ First(α). • In the column of t 3. ε∈First(X) if X → A1 …An ε∈ ≤ ≤ if α is or derives ε and t can follow an A. –and First(Ai) for 1 i n –S →* β A t δ 4. First(α) –{ε} ⊆ First(X) if X → A1 …An α

–We sayt ∈ Follow(A). –and ε∈First(Ai) for 1 ≤ i ≤ n CS780(Prasad) L101TDP 31 CS780(Prasad) L101TDP 32

8 First Sets. Example Computing Follow Sets

• Recall the grammar • Definition: E → T X X → + E | ε Follow(X) = { t | S →* β X t δ } T → ( E ) | int Y Y → * T | ε •First sets • Intuition First( ( ) = { ( } First( T ) = {int, ( } –If X → A B then First(B) ⊆ Follow(A) and First( ) ) = { ) } First( E ) = {int, ( } Follow(X) ⊆ Follow(B) First( int) = { int } First( X ) = {+, ε } –Also if B →* ε then Follow(X) ⊆ Follow(A) First( + ) = { + } First( Y ) = {*, ε } –IfS is the start symbol then $ ∈ Follow(S) First( * ) = { * }

CS780(Prasad) L101TDP 33 CS780(Prasad) L101TDP 34

Computing Follow Sets (Cont.) Follow Sets. Example

Algorithm sketch: • Recall the grammar 1. $ ∈ Follow(S) E → T X X → + E | ε 2. First(β) - {ε} ⊆ Follow(X) T → ( E ) | int Y Y → * T | ε – For each production A →αX β • Follow sets 3. Follow(A) ⊆ Follow(X) Follow( + ) = { int, ( } Follow( * ) = { int, ( } – For each production A →αX β where ε∈First(β) Follow( ( ) = { int, ( } Follow( E ) = {), $} Follow( X ) = {$, ) } Follow( T ) = {+, ) , $} Follow( ) ) = {+, ) , $} Follow( Y ) = {+, ) , $} Follow( int) = {*, +, ) , $}

CS780(Prasad) L101TDP 35 CS780(Prasad) L101TDP 36

9 Constructing LL(1) Parsing Tables Notes on LL(1) Parsing Tables

• Construct a parsing table T for CFG G • If any entry is multiply defined then G is not LL(1). • For each production A →αin G do: – If G is ambiguous. –For each terminal t ∈ First(α) do – If G is left recursive. •T[A, t] = α – If G is not left-factored. –If ε∈First(α), for each t ∈ Follow(A) do – And in other cases as well. •T[A, t] = α • Most programming language grammars are not –If ε∈First(α) and $ ∈ Follow(A) do LL(1). (Cf. Wirth’s Pascal ) •T[A, $] = α • There are tools that build LL(1) tables.

CS780(Prasad) L101TDP 37 CS780(Prasad) L101TDP 38

10