Top down parsing Topdown parsing with backtracking
• Types of parsers: • ScAd • Top down: • Aab|a – repeatedly rewrite the start symbol; – find a left-most derivation of the input string; • w=cad – easy to implement; S S – not all context-free grammars are suitable. S • Bottom up: c A d c A d – start with tokens and combine them to form interior nodes of the c A d parse tree; a b a – find a right-most derivation of the input string; – accept when the start symbol is reached; cad – it is more prevalent. cad cad cad
1 2
Parsing trace Parsing trace
Expansion Remaining input Action Expansion Remaining input Action S cad Try ScAd S cad Try ScAd cAd cad Match c cAd cad Match c Ad ad Try Aab Ad ad Try Aab abd ad Match a abd ad Match a bd d Dead end, backtrack bd d Dead end, backtrack ad ad Try Aa ad ad Match a ad ad Try Aa d d Match d ad ad Match a Success d d Match d Success
ScAd • ScAd Aab|a • Aab|a
3 4
SAB Another example AaA|ε Top down vs. bottom up parsing BbB|b Expansion Remaining input Action • Bottom up approach • Given the rules aabb S aabb Try SAB SAB – SAB a abb aAB – AaA|ε aa bb AB aabb Try AaA aaAB aaε bb aaεB – BbB|b aaA bb aAB aabb match a aabB aabb aA bb AB abb AaA • How to parse aabb ? A bb • Topdown approach Ab b aAB abb match a SAB Abb AbB AB bb Aepsilon aAB AB aaAB S B bb BbB aaεB aabB • If read backwards, the derivation is bB bb match b aabb right most B b Bb • In both topdown and bottom up Note that it is a left most approaches, the input is scanned b b match derivation from left to right
5 6
1 Recursive descent parsing
• Each method corresponds to a non-terminal //BbB|b static boolean checkS() { int savedPointer = pointer; SAB if (checkA() && checkB()) static boolean checkB() { return true; int savedPointer = pointer; pointer = savedPointer; return false; if (nextToken().equals(‘b’) && checkB()) } return true; pointer = savedPointer; static boolean checkA() { AaA|ε int savedPointer = pointer; if(nextToken().equals(‘b’)) return true; if (nextToken().equals(‘a’) && checkA()) pointer = savedPointer; return true; return false; pointer = savedPointer; return true; } }
7 8
Recursive descent parsing (a complete Left recursion example) • What if the grammar is changed to SAB • Grammar AAa|ε program statement program | statement Bb|bB statement assignment • The corresponding methods are static boolean checkA() { assignment ID EQUAL expr int savedPointer = pointer; if (checkA() && nextToken().equals(‘a’)) ...... return true; pointer = savedPointer; • Task: return true; } – Write a java program that can judge whether a program is static boolean checkB() { syntactically correct. int savedPointer = pointer; if (nextToken().equals(‘b’)) – This time we will write the parser manually. return true; – We can use the scanner pointer = savedPointer; if(nextToken().equals(‘b’) && checkB()) return true; return false; • How to do it? pointer = savedPointer; }
9 10
RecursiveDescent.java outline One of the methods
1. static int pointer=-1; 1. /** program-->statement program
2. static ArrayList tokens=new ArrayList(); 2. program-->statement 3. static Symbol nextToken() { }
4. public static void main(String[] args) { 3. */ 5. Calc3Scanner scanner=new 4. static boolean program() throws Exception { 6. Calc3Scanner(new FileReader(”calc2.input")); 5. int savedPointer = pointer; 7. Symbol token; 6. if (statement() && program()) return true;
8. while(token=scanner.yylex().sym!=Calc2Symbol.EOF) 7. pointer = savedPointer; 9. tokens.add(token);
10. boolean legal= program() && nextToken()==null; 8. if (statement()) return true; 11. System.out.println(legal); 9. pointer = savedPointer; 12.} 10. return false; 11.} 13.static boolean program() throws Exception {…} 14.static boolean statement() throws Exception {…} 15.static boolean assignment () throws Exception {…} 16.static boolean expr() {…}
11 12
2 Recursive Descent parsing Summary of recursive descent parser
• Recursive descent parsing is an easy, natural way to code top-down • Simple enough that it can easily be constructed by hand; parsers. • Not efficient; – All non terminals become procedure calls that return true or false; • Limitations: – all terminals become matches against the input stream. /** E E+T | T **/ • Example: static boolean expr() throws Exception { /** assignment--> ID=exp **/ int savePointer = pointer; if ( expr() static boolean assignment() throws Exception{ && nextToken().sym==Calc2Symbol.PLUS int savePointer= pointer; && term()) if ( nextToken().sym==Calc2Symbol.ID return true; && nextToken().sym==Calc2Symbol.EQUAL pointer = savePointer; && expr()) if (term()) return true; return true; pointer = savePointer; pointer = savePointer; return false; return false; } } • A recursive descent parser can enter into infinite loop.
13 14
Left recursion has to be removed for recursive Left recursion descent parser • Defini on Look at the previous example that works: E T+E | T – A grammar is le recursive if it has a nonterminal A such that there is a static boolean expr() throws Exception { deriva on A ⇒+ Aα int savePointer = pointer; – There are tow kinds of le recursion: if (term() && nextToken().sym == Calc2Symbol.PLUS && expr()) return true; • direct le recursive : A ⇒ Aα pointer = savePointer; • in-direct le -recursive: A ⇒+ Aα, but not A ⇒ Aα if (term()) return true; • Example: pointer = savePointer; return false; EE+T|T is direct le recursive } • Indirect le -recursive What if the grammar is left recursive? SAa|b E E+T | T AAc|Sd|ε static boolean expr() throws Exception { int savePointer = pointer; Is S le recursive? if (expr() && nextToken().sym == Calc2Symbol.PLUS && term ()) return true; S⇒ Aa pointer = savePointer; ⇒Sda if (term()) return true; S⇒+Sda pointer = savePointer; return false; }
There will be infinite loop! 15 16
Remove left recursion Remove left recursion
• Direct left recursion • In general, for a production
AAα|β AAα1 | Aα2 | ... | Aαm | β1 | β2 | ... | βn expanded form: Aβ α α ... α where no βi begins with A. Left recursion removed: Aβ Z It can be replaced by:
Zα Z|ε Aβ1A’ | β2A’|... | βnA’ A’ α1A’ |α2A’| ... |αmA’| ε • Example: EE+T |T expanded form: ET +T +T ... +T ETZ Z+TZ| ε
17 18
3 Predictive parsing Left factoring
• Predictive parser is a special case of top-down parsing when no • General method backtracking is required; For a production Aα β1 | αβ2| ... | αβn | γ • At each non-terminal node, the action to undertake is unambiguous; STATif ... where γ represents all alternatives that do not begin with α, | while ... it can be replaced by | for ... Aα B | γ • Not general enough to handle real programming languages; Bβ1 | β2 | ... | βn • Grammar must be left factored; • Example IFSTATif EXPR then STAT | if EXPR then STAT else STAT ET+E | T – A predictive parser must choose the correct version of the IFSTAT Can be transformed into: before seeing the entire input ET E’ – The solution is to factor out common terms: IFSTATif EXPR then STAT IFREST E’+E| ε IFRESTelse STAT | ε • Consider another familiar example: ET+E | T
19 20
Predictive parsing More on LL(1)
• The recursive descent parser is not efficient because of • LL(1) grammar has no left-recursive productions and has the backtracking and recursive calls. been left factored. • a predictive parser does not require backtracking. – left factored grammar with no left recursion may not be LL(1) – able to choose the production to apply solely on the basis of the • there are grammars that cannot be modified to become next input symbol and the current nonterminal being processed LL(1). • In such cases, another parsing technique must be • To enable this, the grammar must be LL(1). employed, or special rules must be embedded into the – The first “L” means we scan the input from left to right; predictive parser. – the second “L” means we create a leftmost derivation; – the 1 means one input symbol of lookahead.
21 22
First() set--motivation Fisrt(): motivating example
• Navigating through two choices seemed simple enough, however, • On many cases, rules starts with non- S ⇒Bc terminals ⇒gAc what happens where we have many alternatives on the right side? SAb|Bc – statement assignment | returnStatement | ifStatement | ADf|CA ⇒gCAc whileStatement | blockStatement BgA|e ⇒gcAc CdC|c • When implementing the statement() method, how are we going to be ⇒gcDfc Dh|i ⇒gchfc able to determine which of the 5 options to match for any given How to parse “gchfc”? input? Dfb hfb if the next token is h, i, d, or c, • Remember, we are trying to do this without backtracking, and just ⇒ ⇒ / ⇒ifb alternative Ab should be selected. one token of lookahead, so we have to be able to make immediate ⇒Ab decision with minimal information— this can be a challenge! / \ If the next token is g or e, / CAb dCAb …. alternative Bc should be • Fortunately, many production rules starts with terminals, which can ⇒ ⇒ ⇒ selected. S ⇒ cAb ⇒ … help in deciding which rule to use. \ In this way, by looking at the next token, the parser is – For example, if the input token is ‘while’, the program should know that \ able to decide which rule the whileStatement rule will be used. ⇒ Bc ⇒ gAc ⇒ … to use without exhaustive ⇒ ec searching.
23 24
4 First(): Definition First() algorithm • First(): compute the set of terminals that can begin a rule • The First set of a sequence of symbols α, written as First 1. if a is a terminal, then first(a) is {a}. (α), is the set of terminals which start the sequences of 2. if A is a non-terminal and Aaα is a production, then add a to first(A). symbols derivable from α. if Aε is a production, add ε to first(A). 3. if Aα α ... α is a production, add Fisrt(α1)-ε to First(A). – If α =>* aβ, then a is in First(α). 1 2 m If α can derive ε, add First(α )-ε to First(A). – If α =>* ε , then ε is in First(α). 1 2 If both α1 and α2 derives ε, add First(α3)-ε to First(A). and so on.
If α1 α2 ... αm =>*ε , add ε to First(A). • Given a production with a number of alternatives: • Example S Aa | b – A α1 | α2 | ..., A bdZ | eZ – we can write a predicative parser only if all the sets First(αi) are Z cZ | adZ | ε disjoint. First(A) = {First(b), First(e)}= {b, e} (by rule 2, rule 1) First(Z) = {a, c, ε } (by rule 2, rule 1) First(S) = {First(A), First(b)} (by rule 3) = {First(A), b} (by rule 1) 25 = {b, e, b} = {b, e} (by rule 2) 26
A slightly modified example Follow()– motivation
S Aa | b • Consider A bdZ | eZ| ε – S*aaAb Z cZ | adZ | ε – Where Aε|aA . • When can A ε be used? What is the next token First(S) expected? = {First(A), First(b)} (by rule 3) = {First(A), b} (by rule 1) = {b, e, b} = {b, e} (by rule 2) ? • In general, when A is nullable, what is the next token we expect to see? First(S) = {First(A), First(a), b} = { a, b, e, ε } ? – A non-terminal A is nullable if ε in First(A), or – A*ε Answer: First(S) = { a, b, e} • the next token would be the first token of the symbol following A in the sentence being parsed.
27 28
Follow() Compute First() and Follow()
• Follow(): find the set of terminals that can immediately follow a non- terminal 1. ETE’ First (E) 1. $(end of input) is in Follow(S), where S is the start symbol; 2. E’+TE’|ε = First(T) 2. For the productions of the form AαBβ then everything in First(β) but ε 3. TFT’ = First(F) is in Follow(B). 4. T’*FT’|ε ={ (, id } 3. For productions of the form AαB or AαBβ where First(β) contains ε, 5. F(E)|id First (E’) then everything in Follow(A) is in Follow(B). = {+, ε } – aAb => aαBb First (T’)= { *, ε } • Example S Aa | b Follow (E) A bdZ |eZ = { ), $ } Z cZ | adZ | ε =Follow(E’) Follow(S) Follow(T) = {$} (by rule 1) = Follow(T’) Follow(A) = { +, ), $ } First(E’) except ε plus Follow(E) = {a} (by rule 2) Follow (F) Follow(Z) = { *, +, ), $} First(T’) except ε plus Follow(T’) = {Follow(A)} = {a} (by rule 3) 29 30
5 The use of First() and Follow() LL(1) parse table construction • Construct a parse table (PT) with one axis the set of terminals, and • If we want to expand S in this grammar: the other the set of non-terminals. S A ... | B ... • For all productions of the form Aα A a... – Add Aα to entry PT[A,b] for each token b in First(α); B b ... | a... – add Aα to entry PT[A,b] for each token b in Follow(A) if First(α) • If the next input character is b, we should rewrite S with contains ε; A... or B ....? – add Aα to entry PT[A,$] if First(α) contains ε and Follow(A) contains $. S Aa | b – since First(B) ={a, b}, and First(A)= {a}, we know to rewrite S with A b d Z | eZ First Follow B; Z cZ | a d Z | ε S Aa b, e $ – First and Follow gives us information about the next characters b b a b c d e $ expected in the grammar. A bdZ b a • If the next input character is a, how to rewrite S? S SAa SAa eZ e Sb – a is in both First(A) and First(B); Z cZ c a A AbdZ AeZ – The grammar is not suitable for predictive parsing. adZ a Z Zε ZcZ ε ε ZadZ 31 32
a b c d e $ Construct the parsing table S SAa SAa S Aa | b Sb A b d Z | eZ Z cZ | a d Z | ε • if Aα, which column we place Aα in row A? A AbdZ AeZ – in the column of t, if t can start a string derived from α, i.e., t in Z Zε ZcZ First(α). ZadZ – what if α is empty? put Aα in the column of t if t can follow an A, i.e., t in Follow(A). Stack RemainingInput Action
S$ bda$ Predict SAa or Sb? suppose Aa is used Aa$ bda$ Predict AbdZ bdZa$ bda$ match dZa$ da$ match Za$ a$ Predict Zε a$ a$ match $ $ accept
– Note that it is not LL(1) because there are more than one rule can be selected. – The correspondent (leftmost) derivation SAabdZabdε a – Note when Zε rule is used. 33 34
LL(1) grammar A complete example for LL(1) parsing • If the table entries are unique, the grammar is said to be LL(1): SP – Scan the input from Left to right; P { D; C} – performing a Leftmost derivation. D d, D | d • LL(1) grammars can have all their parser decisions made using one C c, C| c token look ahead. • In principle, can have LL(k) parsers with k>1. • The above grammar corresponds loosely to the structure of • Properties of LL(1) programs. (program is a sequence of declarations followed – Ambiguous grammar is never LL(1); by a sequence of commands). – Grammar with left recursion is never LL(1); • Need to left factor the grammar first. • A grammar G is LL(1) iff whenever A –> α | β are two distinct productions of G, the following conditions hold: SP First Follow – For no terminal a do both α and β derive strings beginning with a (i.e., P { D; C} S { $ First sets are disjoint); D d D2 – At most one of α and β can derive the empty string P { $ D2 , D | ε – If β =>* ε then α does not derive any string beginning with a terminal in D d ; Follow(A) Cc C2 D2 , ε ; C2 , C | ε C c } C2 , ε } 35 36
6 Construct LL(1) parse table LL(1) parse program
{ } ; , c d $ input$ S SP$ P P{D;C} S D DdD2 t Program a D2 D2ε D2,D c parse C CcC2 k table C2 C2ε C2,C $
First Follow SP S { $ • Stack: contain the current rewrite of the start symbol. P { D; C} P { $ D d D2 • Input: left to right scan of input. D2 , D | ε D d ; • Parse table: contain the LL(k) parse table. Cc C2 D2 , ε ; C2 , C | ε C c } C2 , ε } 37 38
SP LL(1) parsing algorithm Running LL(1) parser P { D; C} D d D2 Stack Remaining Input Action D2 , D | ε • Use the stack, input, and parse table with the following S {d,d;c}$ predict SP$ Cc C2 rules: P$ {d,d;c}$ predict P{D;C} C2 , C | ε { D ; C } $ {d,d;c}$ match { – Accept: if the symbol on the top of the stack is $ and the input D ; C } $ d,d;c}$ predict Dd D2 symbol is $, successful parse d D2 ; C } $ d,d;c}$ match d Derivation – match: if the symbol on the top of the stack is the same as the D2 ; C } $ ,d;c}$ predict D2,D SP$ next input token, pop the stack and advance the input , D ; C } $ ,d;c}$ match , { D;C}$ D ; C } $ d;c}$ predict Dd D2 – predict: if the top of the stack is a non-terminal M and the next {d D2 ; C } $ d D2 ; C } $ d;c}$ match d {d, D ; C } $ input token is a, remove M from the stack and push entry PT[M,a] D2 ; C } $ ;c}$ predict D2ε {d,d D2; C } $ on to the stack in reverse order ε ; C } $ ;c} $ match ; {d,d; C} $ – Error: Anything else is a syntax error C } $ c}$ predict Cc C2 {d,d; c C2}$ c C2 } $ c}$ match c {d,d; c} $ C2 } $ }$ predict C2ε } $ }$ match } $ $ accept Note that it is leftmost derivation
39 40
The expression example Parsing int*int
Stack Remaining input Action 1. ETE’ 2. E’+TE’|ε E$ int*int $ predicate ETE’ 3. TFT’ TE’$ int*int $ predicate TFT’ 4. T’*FT’|ε FT’E’ $ int*int $ predicate Fint 5. F(E)|int int T’ E’ $ int*int $ match int T’ E’ $ * int $ predicate T’*FT’ * F T’ E’ $ * int $ match * F T’ E’ $ int $ predicate Fint int T’ E’ $ int $ match int + * ( ) int $ T’ E’ $ $ predicate T’ε E ETE’ ETE’ E’ $ $ predicate E’ε E’ E’+TE’ E’ε E’ε $ $ match $. success. T TFT’ TFT’ T’ T’ε T’*FT’ T’ε T’ε F F(E) Fint 41 42
7 Parsing the wrong expression int*]+int Error handling Stack Remaining input Action • There are three types of error processing: E$ int*]+int $ predicate ETE’ – report, recovery, repair TE’$ int*]+int $ predicate TFT’ FT’E’ $ int*]+int $ predicate Fint • general principles int T’ E’ $ int*]+int $ match int – try to determine that an error has occurred as soon as possible. Waiting T’ E’ $ * ]+int $ predicate T’*FT’ too long before declaring an error can cause the parser to lose the actual location of the error. * F T’ E’ $ *]+ int $ match * F T’ E’ $ ]+int $ error, skip ] – Error report: A suitable and comprehensive message should be reported. F T’ E’$ + int $ PT[F, +] is sync, pop F “Missing semicolon on line 36” is helpful, “unable to shift in state 425” is not. T’E’$ +int$ predicate T’ ε – Error recovery: After an error has occurred, the parser must pick a likely E’ $ +int$ predicate E’+TE’ place to resume the parse. Rather than giving up at the first problem, a ...... parser should always try to parse as much of the code as possible in It is easy for LL(1) parsers to skip error order to find as many real errors as possible during a single run. – A parser should avoid cascading errors, which is when one error + * ( ) int ] $ generates a lengthy sequence of spurious error messages. E ETE’ ETE’ error E’ E’+TE’ E’ε error E’ε T TFT’ TFT’ error T’ T’ε T’*FT’ T’ε error T’ε
F sync (Follow(F)) sync F(E) sync Fint error sync 43 44
Error report Error Recovery
• report an error occurred and what and where possibly the error is; • Error recovery: a single error won’t stop the whole parsing. Instead, – Report expected vs found tokens by filling holes of parse table with error the parser will be able to resume the parsing at certain place after messages) the error; – Give up on current construct and restart later: – Delimiters help parser synch back up – Skip until find matching ), end, ], whatever – Use First and Follow to choose good synchronizing tokens + * ( ) int $ – Example: E ETE’ Err, int or ETE’ • duoble d; use Follow(TYPE) D TYPE ID SEMI ( expected • junk double d; use First(D) in line … • Error repair: Patch up simple errors and continue . E’ E’+TE’ E’ε E’ε – Insert missing token (;) T TFT’ TFT’ – Add declaration for undeclared name T’ T’ε T’*FT’ T’ε T’ε F F(E) Fint
45 46
Types of errors Summarize LL(1) parsing
• Types of errors • Massage the grammar – Lexical: @+2 • Captured by JLex – Remove ambiguity – Syntactical: x=3+*4; – Remove left recursion • Captured by javacup – Left factoring – Semantic: boolean x; x = 3+4; • Captured by type checker, not implemented in parser generators • Construct the LL(1) parse table – Logical: infinite loop – First(), Follow() • Not implemented in compilers – Fill in the table entry • Run the LL(1) parse program
47 48
8