CMSC 330: Organization of Programming Languages Recall

Recall: Steps of Compilation CMSC 330: Organization of Programming Languages Parsing (aka scanning) Converts Converts strings to tokens to tokens trees CMSC 330 1 CMSC 330 2 Implementing Parsers Top-Down Parsing E → id = n | { L } Many efficient techniques for parsing E • I.e., turning strings (token lists) into parse trees L → E ; L | ε • Examples L ! LL(k), SLR(k), LR(k), LALR(k)… (Assume: id is E L ! Take CMSC 430 for more details variable name, n is integer) E L One simple technique: recursive descent parsing ε • This is a top-down parsing algorithm L Show parse tree for • Other algorithms are bottom-up E { x = 3 ; { y = 4 ; } ; } L ε { x = 3 ; { y = 4 ; } ; } CMSC 330 3 CMSC 330 4 Bottom-up Parsing Example: Shift-Reduce Parsing E → id = n | { L } Replaces RHS of production with LHS L → E ; L | ε E (nonterminal) Show parse tree for L { x = 3 ; { y = 4 ; } ; } Example grammar E L • S → aA, A → Bc, B → b Note that final trees constructed are same E L Example parse as for top-down; only order in which nodes ε • abc aBc aA S are added to tree is L different • Derivation happens in reverse E L Something to look forward to in CMSC 430 ε { x = 3 ; { y = 4 ; } ; } CMSC 330 5 CMSC 330 6 Tradeoffs Recursive Descent Parsing Recursive descent parsers: easy to write & fast Goal • The formal definition is a little clunky • Determine if we can produce the string to be parsed ! but if you follow the code then it’s almost what you might from the grammar's start symbol have done if you weren't told about grammars formally Approach • Implemented with a simple table (and/or recursion) • Recursively replace nonterminal with RHS of production Shift-reduce parsers handle more grammars At each step, we'll keep track of two facts • Complicated to write; requires tools • What tree node are we trying to match? ! yacc, bison, etc. convert a CFG to a shift-reduce parser • What is the lookahead (next token of the input string)? • Error messages may be confusing ! Helps guide selection of production used to replace nonterminal Most languages use hacked parsers (!) • Strange combination of the two CMSC 330 7 CMSC 330 8 Recursive Descent Parsing (cont.) Parsing Example At each step, 3 possible cases E → id = n | { L } • If were trying to match a terminal L → E ; L | ε ! If the lookahead is that token, then succeed, advance the • Here n is an integer and id is an identifier lookahead, and continue • If were trying to match a nonterminal ! Pick which production to apply based on the lookahead One input might be • Otherwise fail with a parsing error • { x = 3; { y = 4; }; } • This would get turned into a list of tokens { x = 3 ; { y = 4 ; } ; } • And we want to turn it into a parse tree CMSC 330 9 CMSC 330 10 Parsing Example (cont.) Recursive Descent Parsing (cont.) E E → id = n | { L } Key step { L } L → E ; L | ε • Choosing which production should be selected E ; L Two approaches { x = 3 ; { y = 4 ; } ; } x = 3 E ; L • Backtracking ! Choose some production { L } ε ! If fails, try different production ! Parse fails if all choices fail E ; L • Predictive parsing y = 4 ε ! Analyze grammar to find FIRST sets for productions ! Compare with lookahead to decide which production to select ! Parse fails if lookahead does not match FIRST CMSC 330 11 CMSC 330 12 First Sets First Sets Motivating example Definition • The lookahead is x • First(γ), for any terminal or nonterminal γ, is the set of initial terminals of all strings that γ may expand to • Given grammar S → xyz | abc • Well use this to decide what production to apply ! Select S → xyz since 1st terminal in RHS matches x • Given grammar S → A | B A → x | y B → z Examples ! Select S → A, since A can derive string beginning with x • Given grammar S → xyz | abc ! First(xyz) = { x }, First(abc) = { a } In general ! First(S) = First(xyz) U First(abc) = { x, a } • Choose a production that can derive a sentential form • Given grammar S → A | B A → x | y B → z beginning with the lookahead ! First(x) = { x }, First(y) = { y }, First(A) = { x, y } • Need to know what terminal may be first in any ! First(z) = { z }, First(B) = { z } sentential form derived from a nonterminal / production ! First(S) = { x, y, z } CMSC 330 13 CMSC 330 14 Calculating First(γ) First( ) Examples For a terminal a E → id = n | { L } E → id = n | { L } | ε • First(a) = { a } L → E ; L | ε L → E ; L For a nonterminal N • If N → ε, then add ε to First(N) First(id) = { id } First(id) = { id } • If N → α α ... α , then (note the α are all the 1 2 n i First("=") = { "=" } First("=") = { "=" } symbols on the right side of one single production): First(n) = { n } First(n) = { n } ! Add First(α1α2 ... αn) to First(N), where First(α1 α2 ... αn) is defined as First("{")= { "{" } First("{")= { "{" } • First(α ) if ε First(α ) 1 ∉ 1 First("}")= { "}" } First("}")= { "}" } • Otherwise (First(α1) – ε) First(α2 ... αn) ! If ε ∈ First(αi) for all i, 1 ≤ i ≤ k, then add ε to First(N) First(";")= { ";" } First(";")= { ";" } First(E) = { id, "{" } First(E) = { id, "{", ε } First(L) = { id, "{", ε } First(L) = { id, "{", ";" } CMSC 330 15 CMSC 330 16 Recursive Descent Parser Implementation Parser Implementation (cont.) For terminals, create function match(a) The body of parse_N for a nonterminal N does • If lookahead is a it consumes the lookahead by the following advancing the lookahead to the next token, and returns • Let N → β1 | ... | βk be the productions of N • Otherwise fails with a parse error if lookahead is not a ! Here βi is the entire right side of a production- a sequence of • In algorithm descriptions, consider parse_a, terminals and nonterminals parse_term(a) to be aliases for match(a) • Pick the production N → βi such that the lookahead is in First(βi) For each nonterminal N, create a function parse_N ! It must be that First(βi) ∩ First(βj) = ∅ for i ≠ j • Called when were trying to parse a part of the input ! If there is no such production, but N → ε then return which corresponds to (or can be derived from) N ! Otherwise fail with a parse error • parse_S for the start symbol S begins the parse • Suppose βi = α1 α2 ... αn. Then call parse_α1(); ... ; parse_αn() to match the expected right-hand side, and return CMSC 330 17 CMSC 330 18 Parser Implementation (cont.) Recursive Descent Parser Parse is built on procedure calls Given grammar S → xyz | abc Procedures may be (mutually) recursive • First(xyz) = { x }, First(abc) = { a } Parser parse_S( ) { Note: if (lookahead == x) { • These procedures are imperative: assume there is a match(x); match(y); match(z); // S → xyz global list of tokens we are consuming } • In Ocaml, the list of tokens would be passed as an else if (lookahead == a) { argument, and returned in the result match(a); match(b); match(c); // S → abc ! E.g., match(a) becomes match_tok a l } • Argument l is list of tokens else error( ); • Return value is token list minus a, if a was at the front, } or throws an exception if not CMSC 330 19 CMSC 330 20 Recursive Descent Parser Example Given grammar S → A | B A → x | y B → z E → id = n | { L } First(E) = { id, "{" } • First(A) = { x, y }, First(B) = { z } L → E ; L | ε Parser parse_E( ) { parse_L( ) { parse_S( ) { parse_A( ) { if (lookahead == id) { if ((lookahead == id) || if (lookahead == x) if ((lookahead == x) || match(id); (lookahead == {)) { match(x); // A → x match(=); // E → id = n parse_E( ); (lookahead == y)) else if (lookahead == y) match(n); match(;); // L → E ; L parse_A( ); // S → A match(y); // A → y } parse_L( ); else if (lookahead == z) else error( ); else if (lookahead == {) { } parse_B( ); // S → B } match({); else ; // L → ε parse_B( ) { else error( ); parse_L( ); // E → { L } } if (lookahead == z) match(}); } match(z); // B → z } else error( ); else error( ); } } CMSC 330 21 CMSC 330 22 Things to Notice Things to Notice (cont.) If you draw the execution trace of the parser This is a predictive parser • You get the parse tree • Because the lookahead determines exactly which production to use Examples • Grammar • Grammar This parsing strategy may fail on some grammars S → xyz S → A | B • Production First sets overlap S → abc A → x | y • Production First sets contain ε • String xyz B → z • Possible infinite recursion parse_S( ) • String x S S Does not mean grammar is not usable match(x) parse_S( ) /|\ | • Just means this parsing method not powerful enough match(y) parse_A( ) A x y z • May be able to change grammar match(z) match(x) | x CMSC 330 23 CMSC 330 24 Conflicting FIRST Sets Left Factoring Algorithm Consider parsing the grammar E → ab | ac Given grammar • First(ab) = a Parser cannot choose between • A → xα1 | xα2 | … | xαn | β RHS based on lookahead! • First(ac) = a Rewrite grammar as Parser fails whenever A → α1 | α2 and • A → xL | β • First(α1) ∩ First(α2) != ε or • L → α1 | α2 | … | αn Solution Repeat as necessary • Rewrite grammar using left factoring Examples • S → ab | ac S → aL L → b | c • S → abcA | abB | a S → aL L → bcA | bB | ε • L → bcA | bB | ε L → bL | ε L → cA | B CMSC 330 25 CMSC 330 26 Alternative Approach Left Recursion Change structure of parser Consider grammar S → Sa | ε • First match common prefix of productions • Try writing parser parse_S( ) { • Then use lookahead to choose between productions if (lookahead == a) { parse_S( ); Example match(a); // S → Sa Consider parsing the grammar E → a+b | a*b | a } • else { } parse_E( ) { } match(a); // common prefix if (lookahead == +) { // E → a+b match(+); match(b); } • Body of parse_S( ) has an infinite loop if (lookahead == *) { // E → a*b ! if (lookahead = "a") then parse_S( ) match(*);

Load more