<<

Prelude

COMP 181 z What is the Tufts mascot? “Jumbo” the elephant

Lecture 6 z Why?

Top-down Parsing z P. T. Barnum was an original trustee of Tufts z 1884: donated $50,000 for a natural museum on campus Barnum Museum, later Barnum Hall

September 21, 2006 z “Jumbo”: famous circus elephant z 1885: Jumbo died, was stuffed, donated to Tufts z 1975: Fire destroyed Barnum Hall, Jumbo

Tufts University Computer Science 2

Last time Grammar issues z Finished scanning z Often: more than one way to derive a string z Produces a stream of tokens z Why is this a problem? z Removes things we don’t care about, like white z Parsing: is string a member of L(G)? space and comments z We want more than a yes or no answer z Context-free grammars z Key: z Formal description of language syntax z Represent the derivation as a parse tree z Deriving strings using CFG z We want the structure of the parse tree to capture the meaning of the sentence z Depicting derivation as a parse tree

Tufts University Computer Science 3 Tufts University Computer Science 4

Grammar issues Parse tree: x – 2 * y z Often: more than one way to derive a string Right-most derivation Parse tree z Why is this a problem? Rule Sentential form expr z Parsing: is string a member of L(G)? - expr z We want more than a yes or no answer 1 expr op expr # Production rule 3 expr op expr op expr 1 expr → expr op expr 6 expr * z Key: 2 | number 1 expr op expr * expr op expr * y z Represent the derivation as a parse3 tree | identifier 2 expr op * z We want the structure of the parse4 optree →to capture+ the 5 expr - * meaning of the sentence 5 | - 3 - * x - 2 6 | * 7 | /

Tufts University Computer Science 5 Tufts University Computer Science 6

1 Abstract syntax tree Left vs right derivations z Parse tree contains extra junk z Two derivations of “x – 2 * y” z Eliminate intermediate nodes z Move operators up to parent nodes Rule Sentential form Rule Sentential form z Result: abstract syntax tree - expr - expr 1 expr op expr 1 expr op expr expr * 3 op expr 3 expr op 5 - expr 6 expr * 1 - expr op expr 1 expr op expr * expr op expr - y 2 - op expr 2 expr op * 6 - * expr 5 expr - * expr op expr * y x 2 3 - * 3 - * Left-most derivation Right-most derivation x - 2

Tufts University Computer Science 7 Tufts University Computer Science 8

Derivations With precedence z One captures meaning, the other doesn’t z Last time: ways to force the right tree shape z Add productions to represent precedence

- * # Production rule # Production rule 1 expr → expr op expr 1 expr → expr + term x * - y 2 | number 2 | expr - term 3 | identifier 3 | term 4 op → + 4 term → term * factor 2 y x 2 5 | - 5 | term / factor 6 | * 6 | factor Left-most derivation Right-most derivation 7 | / 7 factor → number 8 | identifier

Tufts University Computer Science 9 Tufts University Computer Science 10

With precedence Parsing

z What is parsing? z Discovering the derivation of a string expr expr- If one exists z Harder than generating strings Not surprisingly expr op expr expr - term* z Two major approaches expr op expr * y term term * fact z Top-down parsing z Bottom-up parsing

x - 2 fact fact y z Don’t work on all context-free grammars z Properties of grammar determine parse-ability x 2 z Our goal: make parsing efficient z We may be able to transform a grammar

Tufts University Computer Science 11 Tufts University Computer Science 12

2 Two approaches Grammars and parsers z Top-down parsers LL(1), recursive descent z LL(1) parsers z Start at the root of the parse tree and grow toward leaves z Left-to-right input Grammars that this z Pick a production & try to match the input z Leftmost derivation can handle are called z Bad “pick” Æ may need to backtrack LL(1) grammars z 1 symbol of look-ahead z Bottom-up parsers LR(1), operator precedence z LR(1) parsers z Start at the leaves and grow toward root z Left-to-right input Grammars that this z As input is consumed, encode possible parse trees in an z Rightmost derivation can handle are called internal state (similar to our NFA Æ DFA conversion) LR(1) grammars z 1 symbol of look-ahead z Bottom-up parsers handle a large class of grammars z Also: LL(k), LR(k), SLR, LALR, …

Tufts University Computer Science 13 Tufts University Computer Science 14

Top-down parsing Example z Start with the root of the parse tree z Expression grammar (with precedence) z Root of the tree: node labeled with the start symbol # Production rule z : 1 expr → expr + term Repeat until the fringe of the parse tree matches input string 2 | expr - term 3 | term z At a node A, select a production for A 4 term → term * factor Add a child node for each symbol on rhs 5 | term / factor z If a terminal symbol is added that doesn’t match, backtrack 6 | factor z Find the next node to be expanded (a non-terminal) 7 factor → number 8 | identifier z Done when: z Leaves of parse tree match input string (success) z Input string x – 2 * y z All productions exhausted in backtracking (failure)

Tufts University Computer Science 15 Tufts University Computer Science 16

Current position in Example the input stream Backtracking

Rule Sentential form Input string Rule Sentential form Input string - expr ↑ x - 2 * y expr - expr ↑ x - 2 * y 2 expr + term ↑ x - 2 * y 2 expr + term ↑ x - 2 * y 3 term + term ↑ x – 2 * y 3 term + term ↑ x – 2 * y Undo all these 6 factor + term ↑ x – 2 * y expr + term 6 factor + term ↑ x – 2 * y productions 8 + term x ↑ – 2 * y 8 + term x ↑ – 2 * y - + term x ↑ – 2 * y ? + term x ↑ – 2 * y term

fact z Rollback productions z Problem: z Choose a different production for expr z x Can’t match next terminal z Continue z We guessed wrong at step 2

Tufts University Computer Science 17 Tufts University Computer Science 18

3 Retrying Successful parse

Rule Sentential form Input string expr Rule Sentential form Input string expr - expr ↑ x - 2 * y - expr ↑ x - 2 * y 2 expr - term ↑ x - 2 * y 2 expr - term ↑ x - 2 * y 3 term - term ↑ x – 2 * y expr - term 3 term - term ↑ x – 2 * y expr - term 6 factor - term ↑ x – 2 * y 6 factor - term ↑ x – 2 * y 8 - term x ↑ – 2 * y 8 - term x ↑ – 2 * y term fact term term * fact - - term x – ↑ 2 * y - - term x – ↑ 2 * y 3 - factor x – ↑ 2 * y 4 - term * fact x – ↑ 2 * y 7 - x – 2 ↑ * y fact 2 6 - fact * fact x – ↑ 2 * y fact fact y 7 - * fact x – 2 ↑ * y z - - * fact x – 2 * ↑ y Problem: x x 2 8 - * x – 2 * y ↑ z More input to read z Another cause of backtracking z All terminals match – we’re done

Tufts University Computer Science 19 Tufts University Computer Science 20

Other possible parses

Rule Sentential form Input string z Formally, - expr ↑ x - 2 * y A grammar is left recursive if ∃ a non-terminal A such that 2 expr + term ↑ x - 2 * y A →* A α (for some set of symbols α) 2 expr + term + term ↑ x – 2 * y ↑ 2 expr + term + term + term x – 2 * y What does →* mean? 2 expr + term + term + term + term ↑ x – 2 * y A → B x z Problem: termination B → A y z Bad news: z Wrong choice leads to infinite expansion Top-down parsers cannot handle left recursion (More importantly: without consuming any input!)

z May not be as obvious as this z Good news: z Our grammar is left recursive We can systematically eliminate left recursion

Tufts University Computer Science 21 Tufts University Computer Science 22

Notation Eliminating left recursion z Non-terminals z Consider this grammar: z Capital letter: A, B, C Language is β followed # Production rule by zero or more α 1 foo → foo α z Terminals 2 | β z Lowercase, underline: x, y, z z Rewrite as z Some mix of terminals and non-terminals # Production rule z Greek letters: α, β, γ This production gives 1 foo → β bar you one β z Example: # Production rule 2 bar → α bar 1 A → B + x 3 | ε These two productions 1 A → B α α = + x give you zero or more α New non-terminal

Tufts University Computer Science 23 Tufts University Computer Science 24

4 Back to expressions Eliminating left recursion z Two cases of left recursion: z Resulting grammar # Production rule

# Production rule # Production rule z All right recursive 1 expr → term expr2 2 expr2 → + term expr2 1 expr → expr + term 4 term → term * factor z Retain original language 3 | - term expr2 2 | expr - term 5 | term / factor and associativity 4 | ε 3 | term 6 | factor z Not as intuitive to read 5 term → factor term2 z Transform as follows: 6 term2 → * factor term2 z Top-down parser 7 | / factor term2 # Production rule # Production rule 8 | ε z 1 expr → term expr2 4 term → factor term2 Will always terminate 9 factor → number 2 expr2 → + term expr2 5 term2 → * factor term2 z May still backtrack 10 | identifier 3 | - term expr2 6 | / factor term2 4 | ε | ε There’s a lovely algorithm to do this automatically, which we will skip

Tufts University Computer Science 25 Tufts University Computer Science 26

Top-down parsers Right-

# Production rule z Problem: Left-recursion Two productions 1 expr → term expr2 with no choice at all z Solution: Technique to remove it 2 expr2 → + term expr2 3 | - term expr2 All other productions are 4 | ε z What about backtracking? uniquely identified by a 5 term → factor term2 terminal symbol at the Current algorithm is brute force 6 term2 → * factor term2 start of RHS 7 | / factor term2 z Problem: how to choose the right 8 | ε z We can choose the right 9 factor → number production by looking at the production? 10 | identifier z Idea: use the next input token (duh) next input symbol z How? Look at our right-recursive grammar… z This is called lookahead z BUT, this can be tricky…

Tufts University Computer Science 27 Tufts University Computer Science 28

Lookahead Top-down parsing z Goal: avoid backtracking z Goal: z Look at future input symbols Given productions A → α | β , the parser should be z Use extra context to make right choice able to choose between α and β z How much lookahead is needed? z Trying to match A z In general, an arbitrary amount is needed for the full class How can the next input token help us decide? of context-free grammars z Use fancy-dancy algorithm CYK algorithm, O(n3) z Solution: FIRST sets (almost a solution) z Informally: z Fortunately, FIRST(α) is the set of tokens that could appear as the z Many CFGs can be parsed with limited lookahead first symbol in a string derived from α z Covers most programming languages not C++ or Perl z Def: x in FIRST(α) iff α →* x γ

Tufts University Computer Science 29 Tufts University Computer Science 30

5 Top-down parsing Top-down parsing z Building FIRST sets z What about ε productions? We’ll look at this algorithm later z Complicates the definition of LL(1) z Consider A → α and A → β and α may be empty z The LL(1) property z In this case there is no symbol to identify α

z Given A → α and A → β, we would like: # Production rule FIRST(α) ∩ FIRST(β) = ∅ z Example: 1 A → x B 2 | y C z What is FIRST(3)? z Parser can make right choice by looking at one 3 | ε lookahead token z = { ε } z ..almost.. z What lookahead symbol tells us we are matching production 3?

Tufts University Computer Science 31 Tufts University Computer Science 32

Top-down parsing FOLLOW sets

z Example: z If A was empty # Production rule z FIRST(1) = { x } 1 A → x B z What will the next symbol be? z FIRST(2) = { y } 2 | y C z FIRST(3) = { ε } z Must be one of the symbols that immediately 3 | ε follow an A 4 E → A z

z What can follow A? z Solution z Look at the context of all uses of A z FOLLOW(A) = { z } z OLLOW Build a F set for each production with ε z Now we can uniquely identify each production: z Extra condition for LL: z If we are trying to match an A and the next token is z, then we matched production 3 FIRST(β) must be disjoint from FIRST(α) and FOLLOW(Α)

Tufts University Computer Science 33 Tufts University Computer Science 34

More on FIRST and FOLLOW LL(1) property z Notice: z Including ε productions z FIRST and FOLLOW may be sets z FOLLOW(A) = the set of terminal symbols that can z FIRST may contain ε in addition to other symbols immediately follow A z Example: # Production rule z Def: FIRST+(A → α) as z FIRST(1) = { x, y, } ε 1 A → B C z FIRST(α) U FOLLOW(A), if ε∈FIRST(α) z FOLLOW(A) = { z, w } 2 B → x z FIRST(α), otherwise 3 | y z Question: 4 | ε 5 E → A z z Def: a grammar is LL(1) iff When would we care 6 F → A w about FOLLOW(A)? A → α and A → β and IRST IRST Answer: if FIRST(C) contains ε F +(A → α) ∩ F +(A → β) = ∅

Tufts University Computer Science 35 Tufts University Computer Science 36

6 LL(1) property Parsing LL(1) grammar z Given an LL(1) grammar z Code: simple, fast routine to recognize each production z Question z Given A → β1 | β2 | β3, with + + Can there be two rules A →αand A →βin a LL(1) FIRST (βi) ∩ FIRST (βj) = ∅ for all i != j

grammar such that ε ∈ FIRST(α) and ε ∈ FIRST(β)? /* find rule for A */

if (current token ∈ FIRST+(β1))

select A → β1 else if (current token ∈ FIRST+(β )) z Answer 2 select A → β2 Yes, as long as they have different FOLLOW sets else if (current token ∈ FIRST+(β3))

select A → β3 else report an error and return false

Tufts University Computer Science 37 Tufts University Computer Science 38

Predictive parsing Recursive descent

# Production rule z This produces a parser with six z Predictive parsing 1 goal → expr mutually recursive routines: z The parser can “predict” the correct expansion 2 expr → term expr2 z Goal 3 expr2 → + term expr2 z Expr z Using lookahead and FIRST and FOLLOW sets 4 | - term expr2 z Expr2 5 | ε z Term 6 term → factor term2 z Two kinds of predictive parsers z Term2 7 term2 → * factor term2 z Recursive descent 8 | / factor term2 z Factor Often hand-written 9 | ε z Each recognizes one NT or T 10 factor → number z The term descent refers to the z Table-driven 11 | identifier direction in which the parse tree Generate tables from First and Follow sets 12 | ( expr ) is built.

Tufts University Computer Science 39 Tufts University Computer Science 40

Example code Example code z Goal symbol: z Match expr2 main() expr2() /* Match goal −−> expr */ /* Match expr2 −−> + term expr2 */ Check FIRST and tok = nextToken(); /* Match expr2 −−> - term expr2 */ FOLLOW sets to if ( && tok == EOF) expr() distinguish then proceed to next step; if (tok == ‘+’ or tok == ‘-’) else return false; tok = nextToken(); if (term()) z Top-level expression then return expr2(); else return false; expr() /* Match expr −−> term expr2 */ /* Match expr2 --> empty */ if (term() && expr2()); return true; return true; else return false;

Tufts University Computer Science 41 Tufts University Computer Science 42

7 Example code Top-down parsing

factor() /* Match factor --> ( expr ) */ z So far: if (tok == ‘(‘) z tok = nextToken(); Gives us a yes or no answer if (expr() && tok == ‘)’) z We want to build the parse tree return true; else z How? syntax error: expecting ) return false

/* Match factor --> num */ z Add actions to matching routines if (tok is a num) return true z Create a node for each production z How do we assemble the tree? /* Match factor --> id */ if (tok is an id) return true;

Tufts University Computer Science 43 Tufts University Computer Science 44

Building a parse tree Building a parse tree z Notice: z With stack operations z Recursive calls match the shape of the tree expr() main /* Match expr −−> term expr2 */ expr if (term() && expr2()) term expr2_node = pop(); factor term_node = pop(); expr2 expr_node = new exprNode(term_node, z Idea: use a stack term expr2_node) z Each routine: push(expr_node); return true; z Pops off the children it needs else return false; z Creates its own node z Pushes that node back on the stack

Tufts University Computer Science 45 Tufts University Computer Science 46

Next time…

z Finish top-down parsing z Table-driven parsers z Building FIRST and FOLLOW sets z Start bottom-up parsing

Tufts University Computer Science 47

8