Top Down Parsing Topdown Parsing with Backtracking

Home , Backtracking, Parse tree, Parsing, Recursive descent parser

Top down parsing Topdown parsing with backtracking

• Types of parsers: • ScAd • Top down: • Aab|a – repeatedly rewrite the start symbol; – find a left-most derivation of the input string; • w=cad – easy to implement; S S – not all context-free grammars are suitable. S • Bottom up: c A d c A d – start with tokens and combine them to form interior nodes of the c A d parse tree; a b a – find a right-most derivation of the input string; – accept when the start symbol is reached; cad – it is more prevalent. cad cad cad

1 2

Parsing trace Parsing trace

Expansion Remaining input Action Expansion Remaining input Action S cad Try ScAd S cad Try ScAd cAd cad Match c cAd cad Match c Ad ad Try Aab Ad ad Try Aab abd ad Match a abd ad Match a bd d Dead end, backtrack bd d Dead end, backtrack ad ad Try Aa ad ad Match a ad ad Try Aa d d Match d ad ad Match a Success d d Match d Success

ScAd • ScAd Aab|a • Aab|a

3 4

SAB Another example AaA|ε Top down vs. bottom up parsing BbB|b Expansion Remaining input Action • Bottom up approach • Given the rules aabb S aabb Try SAB SAB – SAB a abb aAB – AaA|ε aa bb AB aabb Try AaA aaAB aaε bb aaεB – BbB|b aaA bb aAB aabb match a aabB aabb aA bb AB abb AaA • How to parse aabb ? A bb • Topdown approach Ab b aAB abb match a SAB Abb AbB AB bb Aepsilon aAB AB aaAB S B bb BbB aaεB aabB • If read backwards, the derivation is bB bb match b aabb right most B b Bb • In both topdown and bottom up Note that it is a left most approaches, the input is scanned b b match derivation from left to right

5 6

1 Recursive descent parsing

• Each method corresponds to a non-terminal //BbB|b static boolean checkS() { int savedPointer = pointer; SAB if (checkA() && checkB()) static boolean checkB() { return true; int savedPointer = pointer; pointer = savedPointer; return false; if (nextToken().equals(‘b’) && checkB()) } return true; pointer = savedPointer; static boolean checkA() { AaA|ε int savedPointer = pointer; if(nextToken().equals(‘b’)) return true; if (nextToken().equals(‘a’) && checkA()) pointer = savedPointer; return true; return false; pointer = savedPointer; return true; } }

7 8

Recursive descent parsing (a complete Left recursion example) • What if the grammar is changed to SAB • Grammar AAa|ε program statement program | statement Bb|bB statement assignment • The corresponding methods are static boolean checkA() { assignment ID EQUAL expr int savedPointer = pointer; if (checkA() && nextToken().equals(‘a’)) ...... return true; pointer = savedPointer; • Task: return true; } – Write a java program that can judge whether a program is static boolean checkB() { syntactically correct. int savedPointer = pointer; if (nextToken().equals(‘b’)) – This time we will write the parser manually. return true; – We can use the scanner pointer = savedPointer; if(nextToken().equals(‘b’) && checkB()) return true; return false; • How to do it? pointer = savedPointer; }

9 10

RecursiveDescent.java outline One of the methods

1. static int pointer=-1; 1. /** program-->statement program

2. static ArrayList tokens=new ArrayList(); 2. program-->statement 3. static Symbol nextToken() { }

4. public static void main(String[] args) { 3. */ 5. Calc3Scanner scanner=new 4. static boolean program() throws Exception { 6. Calc3Scanner(new FileReader(”calc2.input")); 5. int savedPointer = pointer; 7. Symbol token; 6. if (statement() && program()) return true;

8. while(token=scanner.yylex().sym!=Calc2Symbol.EOF) 7. pointer = savedPointer; 9. tokens.add(token);

10. boolean legal= program() && nextToken()==null; 8. if (statement()) return true; 11. System.out.println(legal); 9. pointer = savedPointer; 12.} 10. return false; 11.} 13.static boolean program() throws Exception {…} 14.static boolean statement() throws Exception {…} 15.static boolean assignment () throws Exception {…} 16.static boolean expr() {…}

11 12

2 Recursive Descent parsing Summary of recursive descent parser

• Recursive descent parsing is an easy, natural way to code top-down • Simple enough that it can easily be constructed by hand; parsers. • Not efficient; – All non terminals become procedure calls that return true or false; • Limitations: – all terminals become matches against the input stream. /** E E+T | T **/ • Example: static boolean expr() throws Exception { /** assignment--> ID=exp **/ int savePointer = pointer; if ( expr() static boolean assignment() throws Exception{ && nextToken().sym==Calc2Symbol.PLUS int savePointer= pointer; && term()) if ( nextToken().sym==Calc2Symbol.ID return true; && nextToken().sym==Calc2Symbol.EQUAL pointer = savePointer; && expr()) if (term()) return true; return true; pointer = savePointer; pointer = savePointer; return false; return false; } } • A recursive descent parser can enter into infinite loop.

13 14

Left recursion has to be removed for recursive Left recursion descent parser • Deﬁnion Look at the previous example that works: E T+E | T – A grammar is le recursive if it has a nonterminal A such that there is a static boolean expr() throws Exception { derivaon A ⇒+ Aα int savePointer = pointer; – There are tow kinds of le recursion: if (term() && nextToken().sym == Calc2Symbol.PLUS && expr()) return true; • direct le recursive : A ⇒ Aα pointer = savePointer; • in-direct le-recursive: A ⇒+ Aα, but not A ⇒ Aα if (term()) return true; • Example: pointer = savePointer; return false; EE+T|T is direct le recursive } • Indirect le-recursive What if the grammar is left recursive? SAa|b E E+T | T AAc|Sd|ε static boolean expr() throws Exception { int savePointer = pointer; Is S le recursive? if (expr() && nextToken().sym == Calc2Symbol.PLUS && term ()) return true; S⇒ Aa pointer = savePointer; ⇒Sda if (term()) return true; S⇒+Sda pointer = savePointer; return false; }

There will be infinite loop! 15 16

Remove left recursion Remove left recursion

• Direct left recursion • In general, for a production

AAα|β AAα1 | Aα2 | ... | Aαm | β1 | β2 | ... | βn expanded form: Aβ α α ... α where no βi begins with A. Left recursion removed: Aβ Z It can be replaced by:

Zα Z|ε Aβ1A’ | β2A’|... | βnA’ A’ α1A’ |α2A’| ... |αmA’| ε • Example: EE+T |T expanded form: ET +T +T ... +T ETZ Z+TZ| ε

17 18

3 Predictive parsing Left factoring

• Predictive parser is a special case of top-down parsing when no • General method backtracking is required; For a production Aα β1 | αβ2| ... | αβn | γ • At each non-terminal node, the action to undertake is unambiguous; STATif ... where γ represents all alternatives that do not begin with α, | while ... it can be replaced by | for ... Aα B | γ • Not general enough to handle real programming languages; Bβ1 | β2 | ... | βn • Grammar must be left factored; • Example IFSTATif EXPR then STAT | if EXPR then STAT else STAT ET+E | T – A predictive parser must choose the correct version of the IFSTAT Can be transformed into: before seeing the entire input ET E’ – The solution is to factor out common terms: IFSTATif EXPR then STAT IFREST E’+E| ε IFRESTelse STAT | ε • Consider another familiar example: ET+E | T

19 20

Predictive parsing More on LL(1)

• The recursive descent parser is not efficient because of • LL(1) grammar has no left-recursive productions and has the backtracking and recursive calls. been left factored. • a predictive parser does not require backtracking. – left factored grammar with no left recursion may not be LL(1) – able to choose the production to apply solely on the basis of the • there are grammars that cannot be modified to become next input symbol and the current nonterminal being processed LL(1). • In such cases, another parsing technique must be • To enable this, the grammar must be LL(1). employed, or special rules must be embedded into the – The first “L” means we scan the input from left to right; predictive parser. – the second “L” means we create a leftmost derivation; – the 1 means one input symbol of lookahead.

21 22

First() set--motivation Fisrt(): motivating example

• Navigating through two choices seemed simple enough, however, • On many cases, rules starts with non- S ⇒Bc terminals ⇒gAc what happens where we have many alternatives on the right side? SAb|Bc – statement  assignment | returnStatement | ifStatement | ADf|CA ⇒gCAc whileStatement | blockStatement BgA|e ⇒gcAc CdC|c • When implementing the statement() method, how are we going to be ⇒gcDfc Dh|i ⇒gchfc able to determine which of the 5 options to match for any given How to parse “gchfc”? input? Dfb hfb if the next token is h, i, d, or c, • Remember, we are trying to do this without backtracking, and just ⇒ ⇒ / ⇒ifb alternative Ab should be selected. one token of lookahead, so we have to be able to make immediate ⇒Ab decision with minimal information— this can be a challenge! / \ If the next token is g or e, / CAb dCAb …. alternative Bc should be • Fortunately, many production rules starts with terminals, which can ⇒ ⇒ ⇒ selected. S ⇒ cAb ⇒ … help in deciding which rule to use. \ In this way, by looking at the next token, the parser is – For example, if the input token is ‘while’, the program should know that \ able to decide which rule the whileStatement rule will be used. ⇒ Bc ⇒ gAc ⇒ … to use without exhaustive ⇒ ec searching.

23 24

4 First(): Definition First() algorithm • First(): compute the set of terminals that can begin a rule • The First set of a sequence of symbols α, written as First 1. if a is a terminal, then first(a) is {a}. (α), is the set of terminals which start the sequences of 2. if A is a non-terminal and Aaα is a production, then add a to first(A). symbols derivable from α. if Aε is a production, add ε to first(A). 3. if Aα α ... α is a production, add Fisrt(α1)-ε to First(A). – If α =>* aβ, then a is in First(α). 1 2 m If α can derive ε, add First(α )-ε to First(A). – If α =>* ε , then ε is in First(α). 1 2 If both α1 and α2 derives ε, add First(α3)-ε to First(A). and so on.

If α1 α2 ... αm =>*ε , add ε to First(A). • Given a production with a number of alternatives: • Example S  Aa | b – A  α1 | α2 | ..., A  bdZ | eZ – we can write a predicative parser only if all the sets First(αi) are Z  cZ | adZ | ε disjoint. First(A) = {First(b), First(e)}= {b, e} (by rule 2, rule 1) First(Z) = {a, c, ε } (by rule 2, rule 1) First(S) = {First(A), First(b)} (by rule 3) = {First(A), b} (by rule 1) 25 = {b, e, b} = {b, e} (by rule 2) 26

A slightly modified example Follow()– motivation

S  Aa | b • Consider A  bdZ | eZ| ε – S*aaAb Z  cZ | adZ | ε – Where Aε|aA . • When can A ε be used? What is the next token First(S) expected? = {First(A), First(b)} (by rule 3) = {First(A), b} (by rule 1) = {b, e, b} = {b, e} (by rule 2) ? • In general, when A is nullable, what is the next token we expect to see? First(S) = {First(A), First(a), b} = { a, b, e, ε } ? – A non-terminal A is nullable if ε in First(A), or – A*ε Answer: First(S) = { a, b, e} • the next token would be the first token of the symbol following A in the sentence being parsed.

27 28

Follow() Compute First() and Follow()

• Follow(): find the set of terminals that can immediately follow a non- terminal 1. ETE’ First (E) 1. $(end of input) is in Follow(S), where S is the start symbol; 2. E’+TE’|ε = First(T) 2. For the productions of the form AαBβ then everything in First(β) but ε 3. TFT’ = First(F) is in Follow(B). 4. T’*FT’|ε ={ (, id } 3. For productions of the form AαB or AαBβ where First(β) contains ε, 5. F(E)|id First (E’) then everything in Follow(A) is in Follow(B). = {+, ε } – aAb => aαBb First (T’)= { *, ε } • Example  S Aa | b Follow (E) A bdZ |eZ = { ), $ } Z cZ | adZ | ε =Follow(E’) Follow(S) Follow(T) = {$} (by rule 1) = Follow(T’) Follow(A) = { +, ), $ } First(E’) except ε plus Follow(E) = {a} (by rule 2) Follow (F) Follow(Z) = { *, +, ), $} First(T’) except ε plus Follow(T’) = {Follow(A)} = {a} (by rule 3) 29 30

5 The use of First() and Follow() LL(1) parse table construction • Construct a parse table (PT) with one axis the set of terminals, and • If we want to expand S in this grammar: the other the set of non-terminals. S  A ... | B ... • For all productions of the form Aα A  a... – Add Aα to entry PT[A,b] for each token b in First(α); B  b ... | a... – add Aα to entry PT[A,b] for each token b in Follow(A) if First(α) • If the next input character is b, we should rewrite S with contains ε; A... or B ....? – add Aα to entry PT[A,$] if First(α) contains ε and Follow(A) contains $. S  Aa | b – since First(B) ={a, b}, and First(A)= {a}, we know to rewrite S with A  b d Z | eZ First Follow B; Z  cZ | a d Z | ε S Aa b, e $ – First and Follow gives us information about the next characters b b a b c d e $ expected in the grammar. A bdZ b a • If the next input character is a, how to rewrite S? S SAa SAa eZ e Sb – a is in both First(A) and First(B); Z cZ c a A AbdZ AeZ – The grammar is not suitable for predictive parsing. adZ a Z Zε ZcZ ε ε ZadZ 31 32

a b c d e $ Construct the parsing table S SAa SAa S  Aa | b Sb A  b d Z | eZ Z  cZ | a d Z | ε • if Aα, which column we place Aα in row A? A AbdZ AeZ – in the column of t, if t can start a string derived from α, i.e., t in Z Zε ZcZ First(α). ZadZ – what if α is empty? put Aα in the column of t if t can follow an A, i.e., t in Follow(A). Stack RemainingInput Action

S$ bda$ Predict SAa or Sb? suppose Aa is used Aa$ bda$ Predict AbdZ bdZa$ bda$ match dZa$ da$ match Za$ a$ Predict Zε a$ a$ match $ $ accept

– Note that it is not LL(1) because there are more than one rule can be selected. – The correspondent (leftmost) derivation SAabdZabdε a – Note when Zε rule is used. 33 34

LL(1) grammar A complete example for LL(1) parsing • If the table entries are unique, the grammar is said to be LL(1): SP – Scan the input from Left to right; P { D; C} – performing a Leftmost derivation. D d, D | d • LL(1) grammars can have all their parser decisions made using one C c, C| c token look ahead. • In principle, can have LL(k) parsers with k>1. • The above grammar corresponds loosely to the structure of • Properties of LL(1) programs. (program is a sequence of declarations followed – Ambiguous grammar is never LL(1); by a sequence of commands). – Grammar with left recursion is never LL(1); • Need to left factor the grammar first. • A grammar G is LL(1) iff whenever A –> α | β are two distinct productions of G, the following conditions hold: SP First Follow – For no terminal a do both α and β derive strings beginning with a (i.e., P { D; C} S { $ First sets are disjoint); D d D2 – At most one of α and β can derive the empty string P { $ D2 , D | ε – If β =>* ε then α does not derive any string beginning with a terminal in D d ; Follow(A) Cc C2 D2 , ε ; C2  , C | ε C c } C2 , ε } 35 36

6 Construct LL(1) parse table LL(1) parse program

{ } ; , c d $ input$ S SP$ P P{D;C} S D DdD2 t Program a D2 D2ε D2,D c parse C CcC2 k table C2 C2ε C2,C $

First Follow SP S { $ • Stack: contain the current rewrite of the start symbol. P { D; C} P { $ D d D2 • Input: left to right scan of input. D2 , D | ε D d ; • Parse table: contain the LL(k) parse table. Cc C2 D2 , ε ; C2  , C | ε C c } C2 , ε } 37 38

SP LL(1) parsing algorithm Running LL(1) parser P { D; C} D d D2 Stack Remaining Input Action D2 , D | ε • Use the stack, input, and parse table with the following S {d,d;c}$ predict SP$ Cc C2 rules: P$ {d,d;c}$ predict P{D;C} C2  , C | ε { D ; C } $ {d,d;c}$ match { – Accept: if the symbol on the top of the stack is $ and the input D ; C } $ d,d;c}$ predict Dd D2 symbol is $, successful parse d D2 ; C } $ d,d;c}$ match d Derivation – match: if the symbol on the top of the stack is the same as the D2 ; C } $ ,d;c}$ predict D2,D SP$ next input token, pop the stack and advance the input , D ; C } $ ,d;c}$ match , { D;C}$ D ; C } $ d;c}$ predict Dd D2 – predict: if the top of the stack is a non-terminal M and the next {d D2 ; C } $ d D2 ; C } $ d;c}$ match d {d, D ; C } $ input token is a, remove M from the stack and push entry PT[M,a] D2 ; C } $ ;c}$ predict D2ε {d,d D2; C } $ on to the stack in reverse order ε ; C } $ ;c} $ match ; {d,d; C} $ – Error: Anything else is a syntax error C } $ c}$ predict Cc C2 {d,d; c C2}$ c C2 } $ c}$ match c {d,d; c} $ C2 } $ }$ predict C2ε } $ }$ match } $ $ accept Note that it is leftmost derivation

39 40

The expression example Parsing int*int

Stack Remaining input Action 1. ETE’ 2. E’+TE’|ε E$ int*int $ predicate ETE’ 3. TFT’ TE’$ int*int $ predicate TFT’ 4. T’*FT’|ε FT’E’ $ int*int $ predicate Fint 5. F(E)|int int T’ E’ $ int*int $ match int T’ E’ $ * int $ predicate T’*FT’ * F T’ E’ $ * int $ match * F T’ E’ $ int $ predicate Fint int T’ E’ $ int $ match int + * ( ) int $ T’ E’ $ $ predicate T’ε E ETE’ ETE’ E’ $ $ predicate E’ε E’ E’+TE’ E’ε E’ε $ $ match $. success. T TFT’ TFT’ T’ T’ε T’*FT’ T’ε T’ε F F(E) Fint 41 42

7 Parsing the wrong expression int*]+int Error handling Stack Remaining input Action • There are three types of error processing: E$ int*]+int $ predicate ETE’ – report, recovery, repair TE’$ int*]+int $ predicate TFT’ FT’E’ $ int*]+int $ predicate Fint • general principles int T’ E’ $ int*]+int $ match int – try to determine that an error has occurred as soon as possible. Waiting T’ E’ $ * ]+int $ predicate T’*FT’ too long before declaring an error can cause the parser to lose the actual location of the error. * F T’ E’ $ *]+ int $ match * F T’ E’ $ ]+int $ error, skip ] – Error report: A suitable and comprehensive message should be reported. F T’ E’$ + int $ PT[F, +] is sync, pop F “Missing semicolon on line 36” is helpful, “unable to shift in state 425” is not. T’E’$ +int$ predicate T’ ε – Error recovery: After an error has occurred, the parser must pick a likely E’ $ +int$ predicate E’+TE’ place to resume the parse. Rather than giving up at the first problem, a ...... parser should always try to parse as much of the code as possible in It is easy for LL(1) parsers to skip error order to find as many real errors as possible during a single run. – A parser should avoid cascading errors, which is when one error + * ( ) int ] $ generates a lengthy sequence of spurious error messages. E ETE’ ETE’ error E’ E’+TE’ E’ε error E’ε T TFT’ TFT’ error T’ T’ε T’*FT’ T’ε error T’ε

F sync (Follow(F)) sync F(E) sync Fint error sync 43 44

Error report Error Recovery

• report an error occurred and what and where possibly the error is; • Error recovery: a single error won’t stop the whole parsing. Instead, – Report expected vs found tokens by filling holes of parse table with error the parser will be able to resume the parsing at certain place after messages) the error; – Give up on current construct and restart later: – Delimiters help parser synch back up – Skip until find matching ), end, ], whatever – Use First and Follow to choose good synchronizing tokens + * ( ) int $ – Example: E ETE’ Err, int or ETE’ • duoble d; use Follow(TYPE) D  TYPE ID SEMI ( expected • junk double d; use First(D) in line … • Error repair: Patch up simple errors and continue . E’ E’+TE’ E’ε E’ε – Insert missing token (;) T TFT’ TFT’ – Add declaration for undeclared name T’ T’ε T’*FT’ T’ε T’ε F F(E) Fint

45 46

Types of errors Summarize LL(1) parsing

• Types of errors • Massage the grammar – Lexical: @+2 • Captured by JLex – Remove ambiguity – Syntactical: x=3+*4; – Remove left recursion • Captured by javacup – Left factoring – Semantic: boolean x; x = 3+4; • Captured by type checker, not implemented in parser generators • Construct the LL(1) parse table – Logical: infinite loop – First(), Follow() • Not implemented in compilers – Fill in the table entry • Run the LL(1) parse program

47 48