MIT 6. 035 Top-Down Parsing
Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology Orientation
• Language specification • Lexical structure – regular expressions • Syntactic structure – grammar • This Lecture - recursive ddescentescent parsers • Code parser as set of mutually recursive procedures • Structure ooff pprogramrogram matches sstructuretructure of grammar Starting Point
• Assume lexical analysis has produced a sequence of tokens • Each token has a type and value • Types ccorrespondorrespond ttoo tterminalserminals • Values to contents of token read in • Examples • Int 549 – integer token with value 549 read in • if - if keyword, no need for a value • AddOp + - add operator, value + Basic Approach
• Start with Start symbol • Bu ild a le ftf tmost der iva tion • If leftmost symbol is nonterminal, choose a production and applyapply itit • If leftmost symbol is terminal, match against input • If all terminals match, have found a parse! •Key: find correct productions for nonterminals Graphical Illustration of Leftmost Derivation
Sentential Form
NT1 T1 T2 T3 NT2 NT3
Apply Production Not Here Here Grammar for Parsing Examp le
Start → Expr • Set ooff ttokokensens iiss Expr → Expr + Term { +, -, *, /, Int }, where Expr → Expr - Term Int = [0-9][0-9]* Expr → Term • FFoo r ccononveeniencenience , may reprepreesentsent Term → Term * Int each Int n token by n Term → Term / Int Term → Int Parsing Example
Parse Remaining Input Tree Start 2-2*2
Sentential Form
Start
Current Position in Parse Tree Parsing Example
Parse Remaining Input Tree Start 2-2*2 Exprp Sentential Form
Expr
Apppplied Production
Start → Expr Current Position in Parse Tree Parsing Example
Parse Remaining Input Tree Start 2-2*2 Exprp Sentential Form
Expr - Term Expr - Term
Expr → Expr + Term Apppplied Production Expr → Expr - Term Expr → Term Expr → Expr - Term Parsing Example
Parse Remaining Input Tree Start 2-2*2 Exppr Sentential Form
Expr - Term Term - Term Term Exprp → Exprp + Teerm Apppplied Production Expr → Expr - Term Expr → Term Expr →→Te Termrm rm o m r Int
Te → - m 2-2*2 r Int lied Production
maining Input Te e Sentential F pp R App rsing Example m a er P T
p pr -
Start Ex
m r e Int s e Expr Te r e Pa Tr rm o
m r
Te - 2-2*2 2 maining Input e Sentential F R ! n e k Match Input To rsing Example m a er P T
p pr -
Start Ex
m r e Int 2 s e Expr Te r e Pa Tr rm o
m r
Te
-2*2 - 2 maining Input e Sentential F R ! n e k Match Input To rsing Example m a er P T
p pr -
Start Ex
m r e Int 2 s e Expr Te r e Pa Tr rm o
m r
Te
2*2 - 2 maining Input e Sentential F R ! n e k Match Input To rsing Example m a er P T
p pr -
Start Ex
m r e Int 2 s e Expr Te r e Pa Tr Int
rm * o Int * m r m r Te 2*2 Te
- → lied Production maining Input
2 m e r Sentential F pp R
App Te Int rsing Example m a * er P T
m r p pr - Te Start Ex
m r e Int 2 s e Expr Te r e Pa Tr rm o Int Int
* → 2*2
Int m r - lied Production
maining Input Te 2 e Sentential F pp R App Int rsing Example m a * er P T
m r p pr -
Te Int Start Ex
m r e Int 2 s e Expr Te r e Pa Tr rm o Int
* 2 2*2 - 2 maining Input e Sentential F R ! n e k Match Input Int To rsing Example m a * er P T
m r p pr -
Te Int 2 Start Ex
m r e Int 2 s e Expr Te r e Pa Tr rm o Int
* *2 2
- 2 maining Input e Sentential F R ! n e k Match Input Int To rsing Example m a * er P T
m r p pr -
Te Int 2 Start Ex
m r e Int 2 s e Expr Te r e Pa Tr rm o Int
* 2 2
- 2 maining Input e Sentential F R ! n e k Match Input Int To rsing Example m a * er P T
m r p pr -
Te Int 2 Start Ex
m r e Int 2 s e Expr Te r e Pa Tr rm o 2 * 2 2 - 2 maining Input e Sentential F R e s r Pa Int 2 Complete! rsing Example m a * er P T
m r p pr -
Te Int 2 Start Ex
m r e Int 2 s e Expr Te r e Pa Tr Summary
• Three Actions (Mechanisms) • AlApply pro ductit ion to expand current nonterminal in parse tree • Match ccurrenturrent terminal (consuming input) • Accept the parse as correct • Parser generates ppreorderreorder ttraversalraversal of parse ttreeree • visit parents before children • visit ssiblingsiblings ffromrom llefteft ttoo rrightight Policy Problem
• Which production to use for each nonterminal? • Classical Separation of Policy and Mechanism • One Approach: Backtracking • Treat it as a search problem • At each choice point, try next alternative • If it is clear that current try fails, go back to previous choice and trytry something ddifferentifferent • General technique for searching • Used a lot in classical AI and natural langgguage processing (parsing, speech recognition) Backtracking Examp le
Parse Remaining Input Tree Start 2-2*2
Sentential Form Start Backtracking Examp le
Parse Remaining Input Tree Start 2-2*2 Exp r Sentential Form Expr
Applied Production Start → Expr Backtracking Examp le
Parse Remaining Input Tree Start 2-2*2 Expr Sentential Form Expr + Term Expr + Term
Applied Production Expr → Expr + Term Backtracking Examp le
Parse Remaining Input Tree Start 2-2*2 Expr Sentential Form Expr + Term Term + Term Term Applied Production Expr → Term Backtracking Examp le
Parse Remaining Input Tree Start Match Input 2-2*2 Expr Token! Sentential Form Expr + Term Int + Term Term Applied Production Int Term → Int Backtracking Examp le
Parse Remaining Input Tree Start Can’ t Match -2*2 Expr Input Sentential Form Token! Expr + Term 2 - Term Term Applied Production Int 2 Term → Int Backtracking Examp le
Parse Remaining Input Tree Start So Backtrack! 2-2*2 Exp r Sentential Form Expr
Applied Production Start → Expr Backtracking Examp le
Parse Remaining Input Tree Start 2-2*2 Expr Sentential Form Expr - Term Expr - Term
Applied Production Expr → Expr - Term Backtracking Examp le
Parse Remaining Input Tree Start 2-2*2 Expr Sentential Form Expr - Term Term - Term Term Applied Production Expr → Term Backtracking Examp le
Parse Remaining Input Tree Start 2-2*2 Expr Sentential Form Expr - Term Int - Term Term Applied Production Int Term → Int Backtracking Examp le
Parse Remaining Input Tree Start Match Input -2*2 Expr Token! Sentential Form Expr - Term 2 - Term Term
Int 2 Backtracking Examp le
Parse Remaining Input Tree Start Match Input 2*2 Expr Token! Sentential Form Expr - Term 2 - Term Term
Int 2 Left Recursion + Top-Down Parsing = Infinite Loop • Example Production: Term → Term*Num • PtPoten ttiial pars ing steps:
Term Term Term Num Term * Num Term *
Term * Num General Search Issues
• Three components • Search space (parse trees) • Search algorithm (parsing(parsing algorithm)algorithm) • Goal to find (parse tree for input program) • Would like to (but can’t always) ensure that • Find goal (hopefully quickly) if it exists • Search terminates if it does not • Handled inin variousvarious ways in various contextscontexts • Finite search space makes it easy • Exploration strategies for infinite search space • Sometimes oneone goal more important (model checking) • For parsing, hack grammar to remove left recursion Eliminating Left Recursion • Start with productions of form •A→ A α • A →β • α, β sequences of terminals and nonterminals that do not startstart with A • Repeated application of A →A α A builds parseparse tree like tthis:his: A α
A α β α Eliminating Left Recursion • Replacement productions –A→ A α A →βR R is a new nonterminal –A→β R →αR – R → ε New ParseParse TTrreeee Old Parse Tree A A R β R A α α R β α α ε ’ ’ agment m ’ r Fr m erm r Te T t Te t In t In
* / ε ammar In → → → → ’ ’ ’
m m m r r r
New Gr erm Te Te T Te t t In Hacked Grammar In
ammar * /
m m
agment r r t Fr
Te Te In → → →
Original Gr m m m r r r
Te Te Te Parse Tree Comparisons
Original Grammar New GGrraammarmmar
Term Term Int Term’ Term * Int Term’ Int * Int * Int * Int Term’
ε Eliminating Left Recursion
• Changes search space exploration algorithm • EElilim ina tes didirec t iifnfiiinitte recursion • But grammar less intuitive • Sets things up for predictivepredictive pparsingarsing Predictive Parsing
• Alternative to backtracking • UfUseful for programming lan guages, whihich can be designed to make parsing easier • Basic iideadea • Look ahead in input stream • Decide which pproductionroduction ttoo aapplypply basedbased on next tokens in input stream • We will use one token of lookahead ’ ’ m ’ m r r m r Te Te t Te t In t In
* / ε In → → → → ’ ’ ’
m m m m r r r r
Te Te Te Te
’ r p x ’ ’ E
m Expr r Expr ε Expr Te + - → → → → → ’ ’ ’
Predictive Parsing Example Grammar Start Expr Expr Expr Expr Choice Points
• Assume Term’ is current position in parse tree • Have three possiblepossible productionsproductions to apply Term’ → * Int Term’ Teerm’ → / Int Teerm’ Term’ →ε • Use next token to decide • If next token is *, apply Term’ → * Int Term’ • If next token is /, apply Term’ → / Int Term’ • Otherwise , aapplypply Term ’ → ε Predictive Parsing + Hand Coding = Recursive DescentDescent Parser
• One procedure per nonterminal NT
• PdProductitions NT → β1 , …, NT → βn • Procedure examines the current input symbol T to determine which production to apppply
•If T∈First(βk) • Apply production k
• Consume terminals in βk (h(check for correct terminal)
• Recursively call procedures for nonterminals in βk • Current input symbol stored in global variable token • Procedures return • true if parse succeedssucceeds • false if parse fails Boolean Term() Example if (token = Int n) token = NextToken(); return(TermPrime()) else return(false) Boolean TTermPrime()ermPrime() if (token = *) token = NextToken(); if (token = Int n) token = NexttTToken(); return((TTermPrime()) else return(false) else if (token = /) token = NextToken(); if (token = Int n) token = NextToken(); return(TermPrime()) else return(false) else return(true) Term → Int Term’ Term’ → * Int Term’ Term’ → / Int Term’ Term’ →ε Multiple Productions With Same Prefix in RHS • Example Grammar NT → if then NT → if then else • Assume NT is current position in parse treet ree, aandnd if is the next token • Unclear whichwhich production to apply
• Multiple k such that T∈First(βk) • if ∈ First(if then) •if∈ First(if then else) Solution: Left Factor the Grammar
• New Grammar Factors Common Prefix Into Single Production NT → if then NT’ NT’NT → else NT’ →ε • No choice when next token iiss iif!f! • All choices have been unified in one production. 2 ?
ε α e t and
1 α enera g can can generate 2
2
2 NT Nonterminals 1 NT α α or
1 2 1 and NT NT NT 1 f if i
→ → NT ust choose based on t t a M hat about productions with nonterminals? ust choose based on possible first terminals
NT NT • W M that Wh • • • i
n NT ≤ i ≤ ε 1 ε ε ves i er or all ves i f di d er di d NT and
n derives NT es i l li es i l li NT ... NT mp
1 i mp i ε NT ves →ε → i er
NT NT di d wo rules • • T • Fixed Point Algorithm for Derives ε for all nonterminals NT set NT de ri ves ε to be f alse for all productions of the form NT →ε set NT derives ε to be true while (some NT derives ε changed in last iteration)
for aallll productions ooff tthehe form NT → NT 1 ... NTn
if (for all 1≤i ≤n NTi derives ε) set NT derives ε to be true First(β)
•T∈ First(β ) if T can appear as the first symbol in a derivationderivation startingstarting fromfrom β 1) T∈First(T ) 2) First( S ) ⊆ First( S β) 3) NT derives ε implies First(β) ⊆ First(NT β) 4) NT → S β implies First(S β) ⊆ First(NT )
•Notation • T is a terminal, NT is a nonterminal, S is a terminal or nonterminal, and β is a sequence of terminals or nonterminals ) ) ’ ’
)? m ’ m r r )
m ) r Te Te ’
Te rm’ m r e e First(
First( T Te ⊆ First( ⊆ m ) aints
) u ints ints Num a a * Num T /N / rm’ rm’ e e Constr First( First( stem of Subset ⊆ Constr Constr ⊆ y S
* Num T / Num T /) quest: What is e First(/) First(*) te R ∈ ∈ a First( First( * / First( First(*) Inclusion ) ) β )
β NT
S NT quest Gener rm’ rm’ e e e implies T T First( s ε t t implies First( ) First( le ⊆ s In In β e )
T ⊆
* / ⊆ ammar ( Ru β
) S ) Gr
→ → →ε S β les + R ’ deriv → u m First rm’ rm’ r R e e ∈ T T Te T NT NT First( First(S ) 1 2) First( 3) 4) Until ) = {} ) = {} aints {} }
rm’ { rm’ to e e orithm )= Solution int g g ’ {*} {/} o
m = Sets r = e d P Num T Te * Num T / e Fix itializ opagate Constr n First( First( First( First(/) I Pr First(*) ation Al ) g ) ’ a
m p pg r
erm’ ) ) o ’ T Te m r rm’ erm T Te e First(
First( T aints ⊆ ⊆ ) )
aint Pr Num
Constr * / Num rm’ rm’ e e First( First( ⊆ ⊆ Constr )
* Num T / Num T * First(/) First(*) ∈ ∈ * First( First( First( First(/) / ) = {} rm’ ) = {} rm’ e e
} T T t
rm’ t { rm’ e
e In In orithm ε )= Solution * / ammar g g ’ {*} {/} Gr → → → m = r = ’
Num T m Te * Num T / rm’ rm’ r e e T T Te First( First( First( First(/) First(*) ation Al ) g ) ’ a
m p pg r
erm’ ) ) o ’ T Te m r rm’ erm T Te e First(
First( T aints ⊆ ⊆ ) )
aint Pr Num
Constr * / Num rm’ rm’ e e First( First( ⊆ ⊆ Constr )
* Num T / Num T * First(/) First(*) ∈ ∈ * First( First( First( First(/) / ) = {*} rm’ ) = {/} rm’ e e
} T T t
rm’ t { rm’ e
e In In orithm ε )= Solution * / ammar g g ’ {*} {/} Gr → → → m = r = ’
Num T m Te * Num T / rm’ rm’ r e e T T Te First( First( First( First(/) First(*) ation Al ) g ) ’ a
m p pg r
erm’ ) ) o ’ T Te m r rm’ erm T Te e First(
First( T aints ⊆ ⊆ ) )
aint Pr Num
Constr * / Num rm’ rm’ e e First( First( ⊆ ⊆ Constr )
* Num T / Num T * First(/) First(*) ∈ ∈ First( * First( First( First(/) / ) = {*} rm’ ) = {/} rm’ e e T T t
rm’ t {*,/} rm’ e
e In = In orithm ε ) Solution * / ammar g g ’ {*} {/} Gr → → → m = r = ’
Num T m Te * Num T / rm’ rm’ r e e T T Te First( First( First( First(/) First(*) ation Al ) g ) ’ a
m p pg r
erm’ ) ) o ’ T Te m r rm’ erm T Te e First(
First( T aints ⊆ ⊆ ) )
aint Pr Num
Constr * / Num rm’ rm’ e e First( First( ⊆ ⊆ Constr )
* Num T / Num T * First(/) First(*) ∈ ∈ First( * First( First( First(/) / ) = {*} rm’ ) = {/} rm’ e e T T t
rm’ t {*,/} rm’ e
e In = In orithm ε ) Solution * / ammar g g ’ {*} {/} Gr → → → m = r = ’
Num T m Te * Num T / rm’ rm’ r e e T T Te First( First( First( First(/) First(*) ation Al ) g ) ’ a
m p pg r
erm’ ) ) o ’ T Te m r rm’ erm T Te e First(
First( T aints ⊆ ⊆ ) )
aint Pr Num
Constr * / Num rm’ rm’ e e First( First( ⊆ ⊆ Constr )
* Num T / Num T * First(/) First(*) ∈ ∈ First( * First( First( First(/) / Building A Parse Tree
• Have each procedure return the section of the parse treetree fforor the ppartart ooff tthehe string it parsed • Use exceptions to make code structure clean Building Parse Tree In Example
Term() if (token = Int n) oldToken = token; token = NextToken(); node = TermPrime(); if (node == NULL) return oldToken; else return(new TermNode(oldToken, node); else throw SyntaxError TermPrime() if (token = *) || (token = /) first = token; next = NextToken(); if (next = Int n) token = NextToken(); return(new TermPrimeNode(first, next, TermPrime()) else throw SSyntaxErroryntaxError else return(NULL) Parse Tree for 2*3*4
Concrete Desired Parse Tree Abstract Term Parse Tree Int Term ’ Term 2 Int * Int Term ’ Term * 4 3 Int * Int * Int Term’ 2 3 4 ε Why Use Hand-Coded Parser?
• Why not use parser generator? •• What do you ddoo ifif youryour parserparser ddoesnoesn’ ttw work?ork? • Recursive descent parser – write more code •Pasarseer ggee neraatotor • Hack grammar • But if parser generator doesn’t work, nothing you can do • If you have complicated grammar • Increase chance of going outsideoutside comfort zzoneone of parser generator • Your parser may NEVER work Bottom Line
• Recursive descent parser properties • Probably more work • But less risk of a disaster - you can almost always make a recursive descent parser work • May have easier time dealing with resulting code • Single language system • No need to deal with potentially flaky parser generator • No integration issuesissues with automatically generated code •Ify our parser development time is small compared to rest of project, or you have a really complicated language, use hand-coded recursive descent parser Summary
• Top-Down Parsing • Use L ook ahea d to Avoidid Bac ktk trac kiking •Parser is • Hand-Coded • Set of Mutually Recursive Procedures Direct Generation of Abstract Tree
• TermPrime builds an incomplete tree • Missing leftmost child • Returns root and incomplete node • (root, incomplete) = TermPrime() • Called with token = * • Remaining tokens = 3 * 4 root Term
incomplete Term * Int 4 MissingMissingMissingleft leftLeftchild child child * Int toto be filled inin byby 3 callercaller Code for Term Input to Term() parse if (token = Int n) leftmostInt = token; token = NextToken(); 2*3*4 (root, incomplete) = TermPrime(); if (root == NULL) return leftmostInt; incomplete. leftChild = leftmostIn t; return root; else throw SyntaxError
token Int 2 Code for Term Input to Term() parse if (token = Int n) leftmostInt = token; token = NextToken(); 2*3*4 (root, incomplete) = TermPrime(); if (root == NULL) return leftmostInt; incomplete. leftChild = leftmostIn t; return root; else throw SyntaxError
token Int 2 Code for Term Input to Term() parse if (token = Int n) leftmostInt = token; token = NextToken(); 2*3*4 (root, incomplete) = TermPrime(); if (root == NULL) return leftmostInt; incomplete. leftChild = leftmostIn t; return root; else throw SyntaxError
token Int 2 Code for Term Input to Term() parse if (token = Int n) leftmostInt = token; token = NextToken(); 2*3*4 (root, incomplete) = TermPrime(); if (root == NULL) return leftmostInt; incomplete. leftChild = leftmostIn t; return root; else throw SyntaxError root Term
incomplete Term * Int 4 leftmostInt Int * Int 2 3 Code for Term Input to Term() parse if (token = Int n) leftmostInt = token; token = NextToken(); 2*3*4 (root, incomplete) = TermPrime(); if (root == NULL) return leftmostInt; incomplete. leftChild = leftmostIn t; return root; else throw SyntaxError root Term
incomplete Term * Int 4 leftmostInt Int * Int 2 3 Code for Term Input to Term() parse if (token = Int n) leftmostInt = token; token = NextToken(); 2*3*4 (root, incomplete) = TermPrime(); if (root == NULL) return leftmostInt; incomplete. leftChild = leftmostIn t; return root; else throw SyntaxError root Term
incomplete Term * Int 4 leftmostInt Int * Int 2 3 Code for TermPrime
TermPrime() if (token = *) || (token = /) Missing left child op = tktoken; next = NexttTToken(()); to be filled in by if (next = Int n) caller token = NextToken(); (root, incomplete) = TermPrime(); if (root == NULL) root = new ExprNode(NULL, op, next); return (root, root); else newChild = new ExprNode(NULL, op, next); incomplete.leftChild = newChild; return(root, newChild); else throw SSyntaxErroryntaxError else return(NULL,NULL) MIT OpenCourseWare http://ocw.mit.edu
6.035 Computer Language Engineering Spring 2010
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.