Lecture 11: Parsing • Build parse from an expression CSC 131 Interested in abstract tree Fall, 2014 • - drops irrelevant details from Kim Bruce

Arithmetic Base recognizer (ignore building tree now) on productions: ::= ! ::= ! | ! ::= ! addop (fst:rest) = if fst==’+’ or fst==’-‘ then rest! | ! else error ...! ::= ( ) ! | NUM ! exp input = let! | ID ! inputAfterExp = exp input! ::= + | -! inputAfterAddop = addOp inputAfterExp! ::= * | / rest = term inputAfterAddop! in ! rest or Look at parse tree & for 2 * 3 + 7 fun exp input = term(addOp(exp input)); Problems Rewrite Grammar ::= (1)! ::= (2)! | ε (3)! ::= (4)! ::= (5)! • How do we select which production to use | ε (6)! when alternatives? ::= ( ) (7)! | NUM (8)! Left-recursive - never terminates • | ID (9)! ::= + | - (10)! ::= * | / (11) No lef recursion How do we know which production to take?

Predictive Parsing FIRST

Goal: a1a2...an • Intuition: b ∈ First(X) iff there is a derivation X →* bω for some ω. S → α 1.First(b) = b for b a terminal or the empty string ... 2.If have X ::= ω1 | ω2 | ... | ωn then → a1a2Xβ First(X) = First(ω1) ∪ ... ∪ First(ωn) Want next terminal character derived to be a3 3.For any right hand side u1u2...un a3 in First(γ) - First(u1) ⊆ First(u1u2...un) Need to apply a production X ::= γ where if all of u , u ..., u can derive the empty string then 1) γ can eventually derive a string starting with a3 or - 1 2 i-1 2) If X can derive the empty string, and also also First(ui) ⊆ First(u1u2...un) if β can derive a string starting with a3. - empty string is in First(u1u2...un) iff all of u1, u2..., un can derive the empty string a3 in Folow(X) First for Arithmetic Follow

• Intuition: A terminal b ∈ Follow(X) iff there is a FIRST() = { +, - } derivation S →* vXbω for some v and ω. FIRST() = { *, / } 1.If S is the start symbol then put EOF ∈ Follow(S) FIRST() = { (, NUM, ID } 2.For all rules of the form A ::= wXv, FIRST() = { (, NUM, ID } a.Add all elements of First(v) to Follow(X) FIRST() = { (, NUM, ID } If v can derive the empty string then add all elts of FIRST() = { +, -, ε } b. Follow(A) to Follow(X) FIRST() = { *, /, ε } • Follow(X) only used if can derive empty string from X.

Follow for ArithmeticOnly needed to Predictive Parsing, redux calculate for , Goal: a1a2...an FOLLOW() = { EOF, ) } ! FOLLOW() = FOLLOW() = { EOF, ) } S → α FOLLOW() = FIRST() ∪ ... FOLLOW() ∪ FOLLOW() → a1a2Xβ = { +, -, EOF, ) } Want next terminal character derived to be a FOLLOW() = { +, -, EOF, ) } 3 FOLLOW() = { *, /, +, -, EOF } Need to apply a production X ::= γ where FOLLOW() = { (, NUM, ID } Not needed! 1) γ can eventually derive a string starting with a3 or FOLLOW() = { (, NUM, ID } } 2) If X can derive the empty string, then see if β can derive a string starting with a3. Building Table Need Unambiguous

• Put X ::= α in entry (X,a) if either • No table entry should have more than one production to ensure it’s unambiguous, as otherwise we - a in First(α), or don’t know which rule to apply. - e in First(α) and a in Follow(X) • Laws of predictive parsing: • Consequence: X ::= α in entry (X,a) iff there is - If A ::= α1 | ...| αn then for all i̸≠ j, a derivation s.t. applying production can First(αi) ∩ First(αj) = ∅. eventually lead to string starting with a. - If X →* ε, then First(X) ∩ Follow(X) = ∅.

See ArithParse.hs

Non- ID NUM Addop Mulop ( ) EOF • Laws of predictive parsing: terminals 1 1 1 - If A ::= α1 | ...| αn then for all i̸≠ j, First(αi) ∩ First(αj) = ∅. 2 3 3 - If X →* ε, then First(X) ∩ Follow(X) = ∅. 4 4 4 • 2nd is OK for arithmetic: 6 5 6 6 - FIRST() = { +, -, ε } } 9 8 7 - FOLLOW() = { EOF, ) } no overlap! 10 - FIRST() = { *, /, ε } 11 - FOLLOW() = { +, -, EOF, ) } } Read off fom table which production to apply! Table-Driven Stack-based Parser • http://en.wikipedia.org/wiki/LL_parser • Start with “S $” on stack and “input $” to be recognized. Alternatives to Recursive Descent • Use table to replace non-terminals on top of Parsers stack. • If terminal on top of stack matches next input then erase both and proceed. • Success if end up clearing stack and input • Show with ID * (NUM + NUM)$

Another alternative More Options

• LR(1) parsers -- bottom up, gives right-most derivation. Also stack-based. Parser Combinators • is LR(1). ANTLR is LL(1). • - Domain specific language for parsing. k in LL(k) and LR(k) indicates how many • - Even easier to tie to grammar than recursive descent letters of look ahead are necessary -- e.g. length of strings in columns of table. - Build into Haskell and Scala, definable elsewhere • Talk about when cover Scala • writers are happiest with k=1 to avoid exponential blow-up of table. May have to rewrite .