Algorithms

• Earley’s algorithm (1970) • Top-down works for all CFGs • Bottom-up – O(N3) worst case • Recursive descent performance – O(N2) for • LL Parsing: continued unambiguous grammars • LR – Based on dynamic • LALR programming, used • SLR primarily for computational • CYK David Notkin linguistics • GLR • Simple precedence parser Autumn 2008 • Different parsing algorithms generally place various • Bounded context restrictions on the grammar • … of the language to be parsed • ACM digital library returned 5600+ articles matching “parsing algorithm” • Google Scholar almost 34,000

CSE401 Au08 2

Top Down Parsing Predictive Parser

• Build from the top (start symbol) down to • Predictive parser: top-down parser that uses at most the next k leaves (terminals) tokens to select production (the lookahead) • Efficient: no needed, linear time to parse • Basic issue: when expanding a nonterminal, which right • Implementations (analogous to lexing) hand side should be selected? – recursive-descent parser • Solution: look at input tokens to decide • each nonterminal parsed by a procedure • call other procedures to parse sub-nonterminals, Stmts ::= Call | Assign | If | While recursively Call ::= Id ( Expr {,Expr} ) • typically written by hand Assign ::= Id := Expr ; – table-driven parser If ::= Test Stmts if then end • push-down automata: essentially a table-driven FSA, | if Test then Stmts else Stmts end plus stack to do recursive calls While ::= while Test do Stmts end • typically generated by a tool from a grammar specification CSE401 Au08 3 CSE401 Au08 4

1 LLL(L(k)(k)) LL(k) Grammars LL k What if there are common prefixes? kFindLtokenseft L-toeft- right lookaheadderivation scan • Can construct predictive parser automatically and easily if • Left factor common prefixes to eliminate them grammar is LL(k) – create new nonterminal for different suffixes – Left-to-right scan of input, finds leftmost derivation – k tokens of look ahead needed – delay choice until after common prefix • Some restrictions including • Before – no ambiguity If ::= if Test then Stmts end | – no common prefixes of length ≥ k: if Test then Stmts else Stmts end If ::= if Test then Stmts end | • After if Test then Stmts else Stmts end If ::= if Test then Stmts IfCont – no (e.g., E ::= E Op E | ...) IfCont ::= end | else Stmts end • Restrictions guarantee that, given k input tokens, can always select correct right hand side to expand nonterminal.

CSE401 Au08 5 CSE401 Au08 6

Left recursion? Rewrite… Table-driven predictive parser

Before After • Automatically compute PREDICT table from grammar E ::= E + T | T E ::= T ECon • PREDICT(nonterminal,input-token) => right hand side T ::= T * F | F ECon ::= + T ECon |  F ::= id | ... T ::= F TCon TCon ::= * F TCon |  F ::= id | ...

• May not be as clear; can sugar it E ::= T { + T } T ::= F { * F } F ::= id | ( E ) | … • Greater distance from concrete syntax to abstract syntax CSE401 Au08 7 CSE401 Au08 8

2 Compute PREDICT table Example for you to do: if you want

• Compute FIRST set for each right hand side – All tokens that can appear first in a derivation from that right hand side • In case right hand side can be empty – Compute FOLLOW set for each non-terminal • All tokens that can appear immediately after that non-terminal in a derivation • Compute FIRST and FOLLOW sets mutually recursively • PREDICT then depends on the FIRST set

CSE401 Au08 9 CSE401 Au08 10

PREDICT and LL(1) Top down implementation

• If PREDICT table has at most one entry per cell • For years the 401 int accept(Symbol s) { was a top- if (sym == s) { – Then the grammar is LL(1) down predictive parser, getsym(); implemented by a – There is always exactly one right choice return 1; method for each • So it’s fast to parse and easy to implement nonterminal } – We have shifted to a return 0; • If multiple entries in each cell bottom-up, automatically } generated parser – Ex: common prefixes, left recursion, ambiguity – But if you’re going to build a simple one, this is int expect(Symbol s) { – Can rewrite grammar (sometimes) usually best if (accept(s)) – Can patch table manually, if you “know” what to do • Examples from http://en.wikibooks.org/ return 1; – Or can use more powerful parsing technique wiki/Compiler_construct error("expect: unexpected ion symbol"); – Helper functions on right return 0; } CSE401 Au08 11 CSE401 Au08 12

3 Example method Example method

void statement(void) { void factor(void) { if (accept(ident)) { if (accept(ident)) { expect(becomes); ; expression(); } else if (accept(number)) { … ; } else if (accept(ifsym)) { } else if (accept(lparen)) { condition(); expression(); expect(thensym); expect(rparen); statement(); } else { } else if (accept(whilesym)) { error("factor: syntax error"); condition(); getsym(); expect(dosym); } statement(); } } } CSE401 Au08 13 CSE401 Au08 14

Bottom up parsing “Shift-reduce” strategy

• Construct parse tree for input from leaves up • read (“shift”) tokens until the right hand side of – reducing a string of tokens to single start symbol by inverting “correct” production has been seen productions • reduce handle to nonterminal, then continue • Bottom-up parsing is more general than top-down parsing and just as efficient – generally preferred in practice • done when all input read and reduced to start nonterminal T ::= int Read the int * int + int productions found int * T + int T ::= int * T by bottom-up parse bottom to top; this xyzabcdef A ::= bc.D T + int T ::= int is a rightmost T + T E ::= T derivation! ^ T + E E ::= T + E E

CSE401 Au08 15 CSE401 Au08 16

4 LR(k) LR Parsing Tables

• LR(k) parsing • Construct parsing tables implementing a FSA with a stack – Left-to-right scan of input, rightmost derivation – rows: states of parser – columns: token(s) of lookahead – k tokens of look ahead – entries: action of parser • Strictly more general than LL(k) • shift, goto state X – Gets to look at whole right hand side of production • reduce production “X ::= RHS” before deciding what to do, not just first k tokens • accept – Can handle left recursion and common prefixes • error – As efficient as any top-down parsing • Algorithm to construct FSA similar to algorithm to build DFA from NFA • Complex to implement – each state represents set of possible places in parsing – Generally need automatic tools to construct parser • LR(k) algorithm may build huge tables from grammar

CSE401 Au08 17 CSE401 A8 18

Questions? Ada language/compiler color

• US DoD wanted (roughly) a single, high-level programming language • They wrote requirements for this language and received 14 bids (1977) • Four semi-finalists (1978): green (Cii), red for (Intermetrics), blue (SofTech), yellow for (SRI) • Two finalists: green and red – requirements finalized as Steelman document

CSE401 Au08 19 CSE401 Au08 20

5 General syntax: examples from Steelman York Ada compiler (. 1986) “Facts and Figures About the York Ada Compiler” (Wand et al.)

• 2A. Character Set. The full set of • 2D. Other Syntactic Issues. Multiple • Written in C character graphics that may be occurrences of a language defined used in source problems shall be symbol appearing in the same • About 80 KLOC for compiler given in the language definition. context shall not have essentially Every source program shall also different meanings. … – Front-end about 57 KLOC, code gen about 20 have a representation that uses only • 2E. Mnemonic identifiers. KLOC, VAX-specific code gen about 3 KLOC the following 55 character subset of Mnemonically significant identifiers the ASClI graphics: … shall be allowed. There shall be a • 7 KLOC for run-time • 2B. Grammar. The language should break character for use within • “It is difficult to make an accurate estimate of the time taken to write the have a simple, uniform, and easily identifiers. The language and its parsed grammar and lexical translators shall not permit compiler because the compiler writers had other demands on their time structure. The language shall have identifiers or reserved words to be (completing PhDs, teaching, etc.) . Fourteen individuals have been free form syntax and should use abbreviated. … involved at various times during the project and have contributed familiar notations where such use • 2G. Numeric Literals. There shall be approximately 20 man years to the design and construction of the does not conflict with other goals. built-in decimal literals. There shall software . The money spent directly to support the construction of the be no implicit truncation or rounding compiler was [approximately $340k], however this included neither the of integer and fixed point literals salaries of four members of the project nor the cost of computer time (we used approximately 30% of a VAX-11/780 over a five year period).”

CSE401 Au08 21 CSE401 Au08 22

6