Recall: Steps of Compilation CMSC 330: Organization of Programming Languages

Parsing

(aka scanning)

Converts Converts strings to tokens to tokens trees

CMSC 330 1 CMSC 330 2

Implementing Parsers Top-Down

E → id = n | { L } Many efficient techniques for parsing E • I.e., turning strings (token lists) into parse trees L → E ; L | ε • Examples L ! LL(k), SLR(k), LR(k), LALR(k)… (Assume: id is E L ! Take CMSC 430 for more details variable name, n is integer) E L One simple technique: recursive descent parsing ε • This is a top-down parsing algorithm L Show for • Other algorithms are bottom-up E { x = 3 ; { y = 4 ; } ; } L ε { x = 3 ; { y = 4 ; } ; }

CMSC 330 3 CMSC 330 4 Bottom-up Parsing Example: Shift-Reduce Parsing

E → id = n | { L } Replaces RHS of production with LHS L → E ; L | ε E (nonterminal) Show parse tree for L { x = 3 ; { y = 4 ; } ; } Example grammar E L • S → aA, A → Bc, B → b Note that final trees constructed are same E L Example parse as for top-down; only order in which nodes ε • abc aBc aA S are added to tree is L different • Derivation happens in reverse E L Something to look forward to in CMSC 430 ε { x = 3 ; { y = 4 ; } ; }

CMSC 330 5 CMSC 330 6

Tradeoffs Recursive Descent Parsing

Recursive descent parsers: easy to write & fast Goal • The formal definition is a little clunky • Determine if we can produce the string to be parsed ! but if you follow the code then it’s almost what you might from the grammar's start symbol have done if you weren't told about grammars formally Approach • Implemented with a simple table (and/or recursion) • Recursively replace nonterminal with RHS of production Shift-reduce parsers handle more grammars At each step, we'll keep track of two facts • Complicated to write; requires tools • What tree node are we trying to match? ! yacc, bison, etc. convert a CFG to a shift-reduce parser • What is the lookahead (next token of the input string)? • Error messages may be confusing ! Helps guide selection of production used to replace nonterminal Most languages use hacked parsers (!) • Strange combination of the two

CMSC 330 7 CMSC 330 8 Recursive Descent Parsing (cont.) Parsing Example

At each step, 3 possible cases E → id = n | { L } • If were trying to match a terminal L → E ; L | ε ! If the lookahead is that token, then succeed, advance the • Here n is an integer and id is an identifier lookahead, and continue • If were trying to match a nonterminal ! Pick which production to apply based on the lookahead One input might be • Otherwise fail with a parsing error • { x = 3; { y = 4; }; } • This would get turned into a list of tokens { x = 3 ; { y = 4 ; } ; } • And we want to turn it into a parse tree

CMSC 330 9 CMSC 330 10

Parsing Example (cont.) Recursive Descent Parsing (cont.) E E → id = n | { L } Key step { L } L → E ; L | ε • Choosing which production should be selected E ; L Two approaches { x = 3 ; { y = 4 ; } ; } x = 3 E ; L • ! Choose some production { L } ε ! If fails, try different production ! Parse fails if all choices fail E ; L • Predictive parsing y = 4 ε ! Analyze grammar to find FIRST sets for productions ! Compare with lookahead to decide which production to select ! Parse fails if lookahead does not match FIRST

CMSC 330 11 CMSC 330 12 First Sets First Sets

Motivating example Definition • The lookahead is x • First(γ), for any terminal or nonterminal γ, is the set of initial terminals of all strings that γ may expand to • Given grammar S → xyz | abc • Well use this to decide what production to apply ! Select S → xyz since 1st terminal in RHS matches x • Given grammar S → A | B A → x | y B → z Examples ! Select S → A, since A can derive string beginning with x • Given grammar S → xyz | abc ! First(xyz) = { x }, First(abc) = { a }

In general ! First(S) = First(xyz) U First(abc) = { x, a } • Choose a production that can derive a sentential form • Given grammar S → A | B A → x | y B → z beginning with the lookahead ! First(x) = { x }, First(y) = { y }, First(A) = { x, y } • Need to know what terminal may be first in any ! First(z) = { z }, First(B) = { z } sentential form derived from a nonterminal / production ! First(S) = { x, y, z }

CMSC 330 13 CMSC 330 14

Calculating First(γ) First( ) Examples

For a terminal a E → id = n | { L } E → id = n | { L } | ε • First(a) = { a } L → E ; L | ε L → E ; L For a nonterminal N • If N → ε, then add ε to First(N) First(id) = { id } First(id) = { id } • If N → α α ... α , then (note the α are all the 1 2 n i First("=") = { "=" } First("=") = { "=" } symbols on the right side of one single production): First(n) = { n } First(n) = { n } ! Add First(α1α2 ... αn) to First(N), where First(α1 α2 ... αn) is defined as First("{")= { "{" } First("{")= { "{" } • First(α ) if ε First(α ) 1 ∉ 1 First("}")= { "}" } First("}")= { "}" } • Otherwise (First(α1) – ε) First(α2 ... αn)

! If ε ∈ First(αi) for all i, 1 ≤ i ≤ k, then add ε to First(N) First(";")= { ";" } First(";")= { ";" } First(E) = { id, "{" } First(E) = { id, "{", ε } First(L) = { id, "{", ε } First(L) = { id, "{", ";" } CMSC 330 15 CMSC 330 16 Recursive Descent Parser Implementation Parser Implementation (cont.)

For terminals, create function match(a) The body of parse_N for a nonterminal N does • If lookahead is a it consumes the lookahead by the following advancing the lookahead to the next token, and returns • Let N → β1 | ... | βk be the productions of N

• Otherwise fails with a parse error if lookahead is not a ! Here βi is the entire right side of a production- a sequence of • In algorithm descriptions, consider parse_a, terminals and nonterminals parse_term(a) to be aliases for match(a) • Pick the production N → βi such that the lookahead is in First(βi) For each nonterminal N, create a function parse_N ! It must be that First(βi) ∩ First(βj) = ∅ for i ≠ j • Called when were trying to parse a part of the input ! If there is no such production, but N → ε then return which corresponds to (or can be derived from) N ! Otherwise fail with a parse error • parse_S for the start symbol S begins the parse • Suppose βi = α1 α2 ... αn. Then call parse_α1(); ... ; parse_αn() to match the expected right-hand side, and return CMSC 330 17 CMSC 330 18

Parser Implementation (cont.) Recursive Descent Parser

Parse is built on procedure calls Given grammar S → xyz | abc

Procedures may be (mutually) recursive • First(xyz) = { x }, First(abc) = { a } Parser parse_S( ) { Note: if (lookahead == x) { • These procedures are imperative: assume there is a match(x); match(y); match(z); // S → xyz global list of tokens we are consuming } • In Ocaml, the list of tokens would be passed as an else if (lookahead == a) { argument, and returned in the result match(a); match(b); match(); // S → abc ! E.g., match(a) becomes match_tok a l } • Argument l is list of tokens else error( ); • Return value is token list minus a, if a was at the front, } or throws an exception if not CMSC 330 19 CMSC 330 20 Recursive Descent Parser Example

Given grammar S → A | B A → x | y B → z E → id = n | { L } First(E) = { id, "{" } • First(A) = { x, y }, First(B) = { z } L → E ; L | ε Parser parse_E( ) { parse_L( ) { parse_S( ) { parse_A( ) { if (lookahead == id) { if ((lookahead == id) || if (lookahead == x) if ((lookahead == x) || match(id); (lookahead == {)) { match(x); // A → x match(=); // E → id = n parse_E( ); (lookahead == y)) else if (lookahead == y) match(n); match(;); // L → E ; L parse_A( ); // S → A match(y); // A → y } parse_L( ); else if (lookahead == z) else error( ); else if (lookahead == {) { } parse_B( ); // S → B } match({); else ; // L → ε parse_B( ) { else error( ); parse_L( ); // E → { L } } if (lookahead == z) match(}); } match(z); // B → z } else error( ); else error( ); } } CMSC 330 21 CMSC 330 22

Things to Notice Things to Notice (cont.)

If you draw the execution trace of the parser This is a predictive parser • You get the parse tree • Because the lookahead determines exactly which production to use Examples • Grammar • Grammar This parsing strategy may fail on some grammars S → xyz S → A | B • Production First sets overlap S → abc A → x | y • Production First sets contain ε • String xyz B → z • Possible infinite recursion parse_S( ) • String x S S Does not mean grammar is not usable match(x) parse_S( ) /|\ | • Just means this parsing method not powerful enough match(y) parse_A( ) A x y z • May be able to change grammar match(z) match(x) | x CMSC 330 23 CMSC 330 24 Conflicting FIRST Sets Left Factoring Algorithm

Consider parsing the grammar E → ab | ac Given grammar

• First(ab) = a Parser cannot choose between • A → xα1 | xα2 | … | xαn | β RHS based on lookahead! • First(ac) = a Rewrite grammar as

Parser fails whenever A → α1 | α2 and • A → xL | β

• First(α1) ∩ First(α2) != ε or • L → α1 | α2 | … | αn Solution Repeat as necessary

• Rewrite grammar using left factoring Examples • S → ab | ac S → aL L → b | c • S → abcA | abB | a S → aL L → bcA | bB | ε • L → bcA | bB | ε L → bL | ε L → cA | B

CMSC 330 25 CMSC 330 26

Alternative Approach

Change structure of parser Consider grammar S → Sa | ε • First match common prefix of productions • Try writing parser parse_S( ) { • Then use lookahead to choose between productions if (lookahead == a) { parse_S( ); Example match(a); // S → Sa Consider parsing the grammar E → a+b | a*b | a } • else { } parse_E( ) { } match(a); // common prefix if (lookahead == +) { // E → a+b match(+); match(b); } • Body of parse_S( ) has an infinite loop if (lookahead == *) { // E → a*b ! if (lookahead = "a") then parse_S( ) match(*); match(b); } • Infinite loop occurs in grammar with left recursion else { } // E → a }

CMSC 330 27 CMSC 330 28 Right Recursion Algorithm To Eliminate Left Recursion

Consider grammar S → aS | ε Given grammar

• Again, First(aS) = a • A → Aα1 | Aα2 | … | Aαn | β • Try writing parser parse_S( ) { ! β must exist or derivation will not yield string if (lookahead == a ) { Rewrite grammar as (repeat as needed) match(a); parse_S( ); // S → aS • A → βL } • L → α1L | α2 L | … | αn L | ε else { } } Replaces left recursion with right recursion

Examples • Will parse_S( ) infinite loop? • S → Sa | ε S → L L → aL | ε ! Invoking match( ) will advance lookahead, eventually stop • S → Sa | Sb | c S → cL L → aL | bL | ε • Top down parsers handles grammar w/ right recursion

CMSC 330 29 CMSC 330 30

What’s Wrong With Parse Trees? Abstract Syntax Trees (ASTs)

Parse trees contain too much information An is a more compact, • Example abstract representation of a parse tree, with only ! Parentheses the essential parts ! Extra nonterminals for precedence • This extra stuff is needed for parsing

But when we want to reason about languages • Extra information gets in the way (too much detail) parse AST tree

CMSC 330 31 CMSC 330 32 Abstract Syntax Trees (cont.) Producing an AST

Intuitively, ASTs correspond to the data structure To produce an AST, we can modify the parse() you’d use to represent strings in the language functions to construct the AST along the way • Note that grammars describe trees • match(a) returns an AST node (leaf) for a • So do OCaml datatypes (which we’ll see later) • Parse_A returns an AST node for A • E → a | b | c | E+E | E-E | E*E | (E) ! AST nodes for RHS of production become children of LHS node Example Node parse_S( ) { • S → aA Node n1, n2; S if (lookahead == a) { / \ n1 = match(a); n2 = parse_A( ); a A return new Node(n1,n2); | } } CMSC 330 33 CMSC 330 34

The Compilation Process

CMSC 330 35