Recall: Steps of Compilation CMSC 330: Organization of Programming Languages
Parsing
(aka scanning)
Converts Converts strings to tokens to tokens trees
CMSC 330 1 CMSC 330 2
Implementing Parsers Top-Down Parsing
E → id = n | { L } Many efficient techniques for parsing E • I.e., turning strings (token lists) into parse trees L → E ; L | ε • Examples L ! LL(k), SLR(k), LR(k), LALR(k)… (Assume: id is E L ! Take CMSC 430 for more details variable name, n is integer) E L One simple technique: recursive descent parsing ε • This is a top-down parsing algorithm L Show parse tree for • Other algorithms are bottom-up E { x = 3 ; { y = 4 ; } ; } L ε { x = 3 ; { y = 4 ; } ; }
CMSC 330 3 CMSC 330 4 Bottom-up Parsing Example: Shift-Reduce Parsing
E → id = n | { L } Replaces RHS of production with LHS L → E ; L | ε E (nonterminal) Show parse tree for L { x = 3 ; { y = 4 ; } ; } Example grammar E L • S → aA, A → Bc, B → b Note that final trees constructed are same E L Example parse as for top-down; only order in which nodes ε • abc aBc aA S are added to tree is L different • Derivation happens in reverse E L Something to look forward to in CMSC 430 ε { x = 3 ; { y = 4 ; } ; }
CMSC 330 5 CMSC 330 6
Tradeoffs Recursive Descent Parsing
Recursive descent parsers: easy to write & fast Goal • The formal definition is a little clunky • Determine if we can produce the string to be parsed ! but if you follow the code then it’s almost what you might from the grammar's start symbol have done if you weren't told about grammars formally Approach • Implemented with a simple table (and/or recursion) • Recursively replace nonterminal with RHS of production Shift-reduce parsers handle more grammars At each step, we'll keep track of two facts • Complicated to write; requires tools • What tree node are we trying to match? ! yacc, bison, etc. convert a CFG to a shift-reduce parser • What is the lookahead (next token of the input string)? • Error messages may be confusing ! Helps guide selection of production used to replace nonterminal Most languages use hacked parsers (!) • Strange combination of the two
CMSC 330 7 CMSC 330 8 Recursive Descent Parsing (cont.) Parsing Example
At each step, 3 possible cases E → id = n | { L } • If we re trying to match a terminal L → E ; L | ε ! If the lookahead is that token, then succeed, advance the • Here n is an integer and id is an identifier lookahead, and continue • If we re trying to match a nonterminal ! Pick which production to apply based on the lookahead One input might be • Otherwise fail with a parsing error • { x = 3; { y = 4; }; } • This would get turned into a list of tokens { x = 3 ; { y = 4 ; } ; } • And we want to turn it into a parse tree
CMSC 330 9 CMSC 330 10
Parsing Example (cont.) Recursive Descent Parsing (cont.) E E → id = n | { L } Key step { L } L → E ; L | ε • Choosing which production should be selected E ; L Two approaches { x = 3 ; { y = 4 ; } ; } x = 3 E ; L • Backtracking ! Choose some production { L } ε ! If fails, try different production ! Parse fails if all choices fail E ; L • Predictive parsing y = 4 ε ! Analyze grammar to find FIRST sets for productions ! Compare with lookahead to decide which production to select ! Parse fails if lookahead does not match FIRST
CMSC 330 11 CMSC 330 12 First Sets First Sets
Motivating example Definition • The lookahead is x • First(γ), for any terminal or nonterminal γ, is the set of initial terminals of all strings that γ may expand to • Given grammar S → xyz | abc • We ll use this to decide what production to apply ! Select S → xyz since 1st terminal in RHS matches x • Given grammar S → A | B A → x | y B → z Examples ! Select S → A, since A can derive string beginning with x • Given grammar S → xyz | abc ! First(xyz) = { x }, First(abc) = { a }
In general ! First(S) = First(xyz) U First(abc) = { x, a } • Choose a production that can derive a sentential form • Given grammar S → A | B A → x | y B → z beginning with the lookahead ! First(x) = { x }, First(y) = { y }, First(A) = { x, y } • Need to know what terminal may be first in any ! First(z) = { z }, First(B) = { z } sentential form derived from a nonterminal / production ! First(S) = { x, y, z }
CMSC 330 13 CMSC 330 14
Calculating First(γ) First( ) Examples
For a terminal a E → id = n | { L } E → id = n | { L } | ε • First(a) = { a } L → E ; L | ε L → E ; L For a nonterminal N • If N → ε, then add ε to First(N) First(id) = { id } First(id) = { id } • If N → α α ... α , then (note the α are all the 1 2 n i First("=") = { "=" } First("=") = { "=" } symbols on the right side of one single production): First(n) = { n } First(n) = { n } ! Add First(α1α2 ... αn) to First(N), where First(α1 α2 ... αn) is defined as First("{")= { "{" } First("{")= { "{" } • First(α ) if ε First(α ) 1 ∉ 1 First("}")= { "}" } First("}")= { "}" } • Otherwise (First(α1) – ε) First(α2 ... αn)
! If ε ∈ First(αi) for all i, 1 ≤ i ≤ k, then add ε to First(N) First(";")= { ";" } First(";")= { ";" } First(E) = { id, "{" } First(E) = { id, "{", ε } First(L) = { id, "{", ε } First(L) = { id, "{", ";" } CMSC 330 15 CMSC 330 16 Recursive Descent Parser Implementation Parser Implementation (cont.)
For terminals, create function match(a) The body of parse_N for a nonterminal N does • If lookahead is a it consumes the lookahead by the following advancing the lookahead to the next token, and returns • Let N → β1 | ... | βk be the productions of N
• Otherwise fails with a parse error if lookahead is not a ! Here βi is the entire right side of a production- a sequence of • In algorithm descriptions, consider parse_a, terminals and nonterminals parse_term(a) to be aliases for match(a) • Pick the production N → βi such that the lookahead is in First(βi) For each nonterminal N, create a function parse_N ! It must be that First(βi) ∩ First(βj) = ∅ for i ≠ j • Called when we re trying to parse a part of the input ! If there is no such production, but N → ε then return which corresponds to (or can be derived from) N ! Otherwise fail with a parse error • parse_S for the start symbol S begins the parse • Suppose βi = α1 α2 ... αn. Then call parse_α1(); ... ; parse_αn() to match the expected right-hand side, and return CMSC 330 17 CMSC 330 18
Parser Implementation (cont.) Recursive Descent Parser
Parse is built on procedure calls Given grammar S → xyz | abc
Procedures may be (mutually) recursive • First(xyz) = { x }, First(abc) = { a } Parser parse_S( ) { Note: if (lookahead == x ) { • These procedures are imperative: assume there is a match( x ); match( y ); match( z ); // S → xyz global list of tokens we are consuming } • In Ocaml, the list of tokens would be passed as an else if (lookahead == a ) { argument, and returned in the result match( a ); match( b ); match( c