CS164: Introduction to Programming Languages and Compilers Fall 2010 1

Parsers, Grammars and Logic Programming Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages and Compilers Fall 2010 1 Administrativia Project regrades: submit a patch to your solution small penalty assessed -- a function of patch size PA extra credit for bugs in solutions, starter kits, handouts we keep track of bugs you find on the Errata wiki page How to obtain a solution to a programming assignment see staff for hardcopy Office hours today: “HW3 clinic” for today, room change to 511 extended hours: 4 to 5:30 2 PA2 feedback: How to desugar Program in source language (text) Program in source language (AST) Program in core language (AST) 3 PA2 feedback: How to specify desugaring Desugaring is a tree-to-tree transformation. you translate abstract syntax to abstract syntax But it’s easier to specify the transformation in text using concrete syntax, as a text-to-text transformation Example: while (E) { StmtList } --> _WhileFun_(lambda(){E}, lambda(){StmtList}) 4 Outline (hidden slide) Goal of the lecture: • Languages vs grammars (a language can be described by many grammars) • CYK and Earley parsers and their complexity (Earley is needed for HW4) • Prolog (backtracking), Datalog (forward evaluation = dynamic programming) Grammars: • generation vs. recognition • random generator and its dual, oracular recognizer (show their symmetry) • Write the backtracking recognizer nicely with amb Parse tree: • result of parsing is parse tree • the oracle reconstructs the parse tree • add tree reconstruction with amb Prolog refresher • Example 1: something cool • Example 2: something with negation Switch to prolog • rewrite amb recognizer (no parse tree reconstruction) in prolog • rewrite amb recognizer (with parse tree reconstruction) in prolog • analyze running time complexity (2^n) Change this algorithm to obtain CYK once evaluated as Datalog • facts should be parse(nonterminal, startPosition, endPosition) • now ask if the evaluation needs to be exponential. How about if we instead compute all the facts bottom up? How much work will we do? 5 Visualize the work as a graph; this is CYK. • analyze CYK running time (N^3) • analyze CYK's inefficiency, motivate Earley Earley algorithm • in-progress edges • the three actions • give pseudocode • explain why it gets you N^2 on non-ambiguous grammars • explain HW3 Recursive definition of a language Language: a set of (desired) strings Example: the language of regular expressions Let’s define this language: – base case: any character c is regular expression – inductive case: if e1, e2 are regular expressions then the following are also regular expressions: e1 | e2 e1 e2 e1 * (e1) 6 Grammars Describing a language in English is tedious. We need a handy formal notation. We have been using it already. It's called a grammar. R ::= c | R R | R'|'R | R'*' | ( R ) 7 Grammars vs. languages Distinct grammars can describe the same language L(G) = language described by grammar G Example: language of all strings of the form bai , i>0 grammar 1: S ::= S a | b grammar 2: S ::= baA A ::= A a | many more grammars exist 8 Generate a string from L(G) Is there a recipe to print all strings from L(G)? Depends if you are willing to wait. L(G)may be infinite. Let’s write function gen(G) that prints a string L(G). Can we do it so that, if L(G) is finite, rerunning gen(G) will eventually print each string in L(G)? 9 gen(G) Grammar G and its language L(G): G: E ::= a | E + E | E * E L(G) = { a, a+a, a*a, a*a+a, … } For simplicity, we hardcoded G into gen() def gen() { E(); print EOF } def E() { switch (choice()): case 1: print “a" case 2: E(); print "+"; E() case 3: E(); print "*"; E() } 10 Visualizing string generation Are we generating the string top-down or bottom-up? The tree that describe string derivation is parse tree. 11 Parsing Parsing is the inverse of string generation: given a string, we want to find the parse tree If parsing is the inverse of generation, let’s obtain the parser mechanically from the generator! def gen() { E(); print EOF } def E() { switch (choice()): case 1: print “a" case 2: E(); print "+"; E() case 3: E(); print "*"; E() } 12 Generator vs. parser def gen() { E(); print EOF } def E() { switch (choice()) { case 1: print ‚a" case 2: E(); print "+"; E() case 3: E(); print "*"; E() }} def parse() { E(); scan(EOF) } def E() { switch (oracle()) { case 1: scan("a‚) case 2: E(); scan("+‚); E() case 3: E(); scan("*"); E() }} def scan(s) { if input starts with s, consume s; else abort } 13 Parsing == reconstruction of the parse tree Why do we need the parse tree? We evaluate it to obtain the AST, or perhaps to directly compute the value of the program. 14 Example 1: evaluate an expression (calculator) E (18) Input: 2 * (4 + 5) T (18) T (2) * F (9) F (2) ( E (9) ) int (2) E (4) + T (5) Annotated Parse Tree T (4) F (5) F (4) int (5) int (4) 15 Parse tree vs. abstract syntax tree • Parse tree = concrete syntax tree – contains all syntactic symbols from the input – including those that the parser needs “only” to discover • intended nesting: parentheses, curly braces • statement termination: semicolons • Abstract syntax tree (AST) – abstracts away these artifacts of parsing, – abstraction compresses the parse tree • flattens parse tree hierarchies • drops tokens 16 Ras Bodik, CS 164, Fall 2007, Lecture 6 Add parse tree reconstruction to parser def parse() { root = E(); scan(EOF); return root } def E() { switch (oracle()) { case 1: scan("a") return (‚a‛,) case 2: left = E() scan("+") right = E() return (‚+‛, left, right) case 3: // analogous }} 17 How to implement our oracle? Recall the nondeterministic evaluator from cs61A – (amb 1 2 3 4 5) evaluates to 1 or .. or 5 Which option does amb choose? One leading to success. – in our case, success means parsing successfully How was amb implemented? – backtracking Our parser with amb: def E() { switch (amb(1,2,3)) { case 1: scan("a‚) case 2: E(); scan("+‚); E() case 3: E(); scan("*"); E() }} 18 How do we implement amb in cs164 We won’t. We could implementing with coroutines, but instead we’ll move on to logic programming. We will define a parser as a backtracking logic program with exponential time complexity and then observe that its structure actually permits polynomial time algorithm. 19 Prolog refresher 20 Example with lists 21 Backtracking parser in prolog Our grammar, again: E ::= a | E+E | E*E Backtracking parser for this grammar in Prolog e(In,Out) :- In==[a|Out]. e(In,Out) :- e(In,T1), T1==[+|T2], e(T2,Out) e(In,Out) :- e(In,T1), T1==[*|T2], e(T2,Out) Can be written more concisely as e([a|T],T). e(In,Out) :- e(In, [+|T]), e(T,Out) e(In,Out) :- e(In, [*|T]), e(T,Out) 22 How does this parser work? Let’s start with this (incomplete) program: e([a|T],T). Sample queries: e([a,+,a],Rest). --> Rest = [+,a] e([a],Rest). -->Rest = [] e([a],[]). --> true // parsed successfully 23 A full example Consider input a+a*a e(In,Out) :- In==[a|Out]. e(In,Out) :- e(In,T1), T1==[+|T2], e(T2,Out) e(In,Out) :- e(In,T1), T1==[*|T2], e(T2,Out) 24 Running time of the backtracking parser We can analyze either version, they are the same. amb: def E() { switch (amb(1,2,3)) { case 1: scan("a‚) case 2: E(); scan("+‚); E() case 3: E(); scan("*"); E() }} Prolog: e(In,Out) :- In==[a|Out]. e(In,Out) :- e(In,T1), T1==[+|T2], e(T2,Out) e(In,Out) :- e(In,T1), T1==[*|T2], e(T2,Out) 25 CYK parser Can we run our Prolog parser in polynomial time? Yes, with dynamic programming (an algo technique). Let’s refactor the program a little. Here, e(i,j) is true iff the input[i:j] can be derived (ie generated) from E. e(i,i+1) :- input[i]==‘a’. e(i,j) :- e(i,k), input[k]==‘+’, e(k+1,j) e(i,j) :- e(i,k), input[k]==‘*’, e(k+1,j) Now, instead of top-down backtracking, let’s compute all facts bottom up. 27 Bottom-up evaluation of a prolog program Input: a + a * a Let’s compute gradually which facts we know hold. Step 1: base case (process input segments of length 1) e(0,1) = e(2,3) = e(4,5) = true Step 2: inductive case (input segments of length 3) e(0,3) = true // using rule #2 e(2,5) = true // using rule #3 Step 2 again: inductive case (segments of length 5) e(0,5) = true // using either rule #2 or #3 This evaluation is O(n3) time For grammars with at most two symbols on right-hand side 28 A graphical way to visualize this evaluation Initial graph: the input (terminals) Then: Keep adding non-terminal edges until no more can be added. An edge is added when some adjacent edges form rhs of a production. E 9 E10 E6 E7 E8 a1 +2 a3 *4 a5 E11 Input: a + a * a 29 CYK: the algorithm (can you find the bug?) for i=0,N-1 do add (i,i+1,nonterm(input[i])) to graph -- create nonterminal edges Ad enqueue( (i,i+1,nonterm(input[i])) ) -- nonterm() maps d to A while queue not empty do (j,k,B)=dequeue() for each edge (i,j,A) do -- for each edge “left-adjacent” to (j,k,B) if rule TAB exists then if edge e=(i,k,T) does not exists then add e to graph; enqueue(e) for each edge (k,l,C) do -- for each edge “right-adjacent” to (j,k,B) ..

CS164: Introduction to Programming Languages and Compilers Fall 2010 1

Derivatives of Parsing Expression Grammars

Abstract Syntax Trees & Top-Down Parsing

AST Indexing: a Near-Constant Time Solution to the Get-Descendants-By-Type Problem Samuel Livingston Kelly Dickinson College

Syntactic Analysis, Or Parsing, Is the Second Phase of Compilation: the Token File Is Converted to an Abstract Syntax Tree

Lecture 3: Recursive Descent Limitations, Precedence Climbing

Understanding Source Code Evolution Using Abstract Syntax Tree Matching

Parsing with Earley Virtual Machines

To an Abstract Syntax Tree

CSCI 742 - Compiler Construction

Syntax and Parsing

Top-Down Parsing

Codegenassem.Java) Lexical Analysis (Scanning)