Building a Parser II CS164 3:30-5:00 TT 10 Evans Administrativia • PA2

Administrativia • PA2 assigned today –due in 12 days Building a Parser II • WA1 assigned today –due in a week – it’s a practice for the exam CS164 3:30-5:00 TT •First midterm 10 Evans –Oct 5 – will contain some project-inspired questions Prof. Bodik CS 164 Lecture 6 1 2 Prof. Bodik CS 164 Lecture 6 Overview Grammars • Grammars • Programming language constructs have recursive structure. • derivations – which is why our hand-written parser had this structure, too • Recursive descent parser • Eliminating left recursion •An expression is either: •number, or •variable, or • expression + expression, or • expression - expression, or • ( expression ), or •… 3 4 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 Context-free grammars (CFG) Example: arithmetic expressions • a natural notation for this recursive structure • Simple arithmetic expressions: E d n | id | ( E ) | E + E | E * E • grammar for our balanced parens expressions: BalancedExpression d a | ( BalancedExpression ) • Some elements of this language: • describes (generates) strings of symbols: –id – a, (a), ((a)), (((a))), … –n • like regular expressions but can refer to –( n ) – other expressions (here, BalancedExpression) –n + id – and do this recursively (giving is “non-finite state”) –id * ( id + id ) 5 6 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 1 Symbols: Terminals and Nonterminals Notational Conventions • grammars use two kinds of symbols • In these lecture notes, let’s adopt a notation: •terminals: – no rules for replacing them – Non-terminals are written upper-case – once generated, terminals are permanent – these are tokens of our language – Terminals are written lower-case •nonterminals: or as symbols, e.g., token LPAR is written as ( – to be replaced (expanded) – in regular expression lingo, these serve as names of – The start symbol is the left-hand side of the first expressions production – start non-terminal: the first symbol to be expanded 7 8 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 Derivations Example: derivation • This is how a grammar generates strings: Grammar: E d n | id | ( E ) | E + E | E * E – think of grammar rules (called productions) as rewrite rules •a derivation: • Derivation: the process of generating a string E rewrite E with ( E ) 1. begin with the start non-terminal ( E ) rewrite E with n ( n ) this is the final string of terminals 2. rewrite the non-terminal with some of its productions • another derivation (written more concisely): 3. select a non-terminal in your current string E d ( E ) d ( E * E ) d ( E + E * E ) d ( n + E * E ) d ( n + id * E ) i. if no non-terminal left, done. d ( n + id * id ) ii. otherwise go to step 2. 9 10 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 So how do derivations help us in parsing? Decaf Example • A program (a string of tokens) has no syntax A fragment of Decaf: error if it can be derived from the grammar. – but so far you only know how to derive some (any) STMT d while ( EXPR ) STMT string, not how to check if a given string is | id ( EXPR ) ; derivable EXPR d EXPR + EXPR • So how to do parsing? | EXPR – EXPR – a naïve solution: derive all possible strings and | EXPR < EXPR check if your program is among them | ( EXPR ) – not as bad as it sounds: there are parsers that do | id this, kind of. Coming soon. 11 12 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 2 Decaf Example (Cont.) CFGs (definition) Some elements of the (fragment of) language: •A CFG consists of – A set of terminal symbols T id ( id ) ; id ( ( ( ( id ) ) ) ) ; – A set of non-terminal symbols N while ( id < id ) id ( id ) ; – A start symbol S (a non-terminal) while ( while ( id ) ) id ( id ) ; – A set of productions: while ( id ) while ( id ) while ( id ) id ( id ) ; produtions are of two forms (X N) Question: One of the STMT d while ( EXPR ) STMT X d ε , or strings is not from | id ( EXPR ) ; the language. EXPR d EXPR + EXPR | EXPR – EXPR X d Y1 Y2 ... Yn where Yi N 4 T Which one? | EXPR < EXPR | ( EXPR ) | id 13 14 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 context-free grammars Now let’s parse a string • what is “context-free”? • recursive descent parser derives all strings – means the grammar is not context-sensitive – until it matches derived string with the input string • context-sensitive gramars – or until it is sure there is a syntax error – can describe more languages than CFGs – because their productions restrict when a non- terminal can be rewritten. An example production: d N d d A B c – meaning: N can be rewritten into ABc only when preceded by d – can be used to encode semantic checks, but parsing is hard 15 16 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 Recursive Descent Parsing Recursive Descent Parsing. Example (Cont.) • Consider the grammar •Try E0 → T1 + E2 E → T + E | T • Then try a rule for T1 → ( E3 ) T → int | int * T | ( E ) – But ( does not match input token int5 • Token stream is: int * int 5 2 •TryT1 → int . Token matches. • Start with top-level non-terminal E – But + after T1 does not match input token * •Try T1 → int * T2 • Try the rules for E in order – This will match but + after T1 will be unmatched • Have exhausted the choices for T1 – Backtrack to choice for E0 17 18 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 3 Recursive Descent Parsing. Example (Cont.) A Recursive Descent Parser (2) •Try E0 → T1 • Define boolean functions that check the token string for a match of • Follow same steps as before for T1 – A given token terminal – And succeed with T1 → int * T2 and T2 → int – With the following parse tree bool term(TOKEN tok) { return in[next++] == tok; } th E0 – A given production of S (the n ) bool Sn() { … } T 1 – Any production of S: bool S() { … } int5 * T2 int2 • These functions advance next 19 20 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 A Recursive Descent Parser (3) A Recursive Descent Parser (4) • For production E → T + E • Functions for non-terminal T bool T () { return term(OPEN) && E() && term(CLOSE); } bool E1() { return T() && term(PLUS) && E(); } 1 • For production E → T bool T2() { return term(INT) && term(TIMES) && T(); } bool T3() { return term(INT); } bool E2() { return T(); } • For all productions of E (with backtracking) bool T() { bool E() { int save = next; int save = next; return (next = save, T1()) return (next = save, E ()) 1 || (next = save, T2()) || (next = save, E ()); } 2 || (next = save, T3()); } 21 22 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 Recursive Descent Parsing. Notes. Recursive Descent Parsing. Notes. • To start the parser • Easy to implement by hand – Initialize next to point to first token – An example implementation is provided as a – Invoke E() supplement “Recursive Descent Parsing” • Notice how this simulates our backtracking example from lecture • Easy to implement by hand • But does not always work … • Predictive parsing is more efficient 23 24 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 4 Recursive-Descent Parsing When Recursive Descent Does Not Work • Parsing: given a string of tokens t1 t2 ... tn, find • Consider a production S → S a: its parse tree – In the process of parsing S we try the above rule • Recursive-descent parsing: Try all the – What goes wrong? productions exhaustively – At a given moment the fringe of the parse tree is: • A left-recursive grammar has a non-terminal S t t … t A … 1 2 k S →+ Sα for some α – Try all the productions for A: if A d BC is a production, the new fringe is t1 t2 … tk B C … – Backtrack when the fringe doesn’t match the string • Recursive descent does not work in such cases – Stop when there are no more non-terminals – It goes into an ∞ loop 25 26 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 Elimination of Left Recursion Elimination of Left-Recursion. Example • Consider the left-recursive grammar • Consider the grammar S → S α | β S d 1 | S 0 ( β = 1 and α = 0 ) •Sgenerates all strings starting with a β and can be rewritten as followed by a number of α S d 1 S’ S’ d 0 S’ | ε • Can rewrite using right-recursion S →βS’ S’ →αS’ | ε 27 28 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 More Elimination of Left-Recursion General Left Recursion • In general • The grammar S → A α | δ S → S α1 | … | S αn | β1 | … | βm • All strings derived from S start with one of A → S β is also left-recursive because β1,…,βm and continue with several instances of →+ βα α1,…,αn S S •Rewrite as S →β1 S’ | … | βm S’ • This left-recursion can also be eliminated S’ →α1 S’ | … | αn S’ | ε • See [ASU], Section 4.3 for general algorithm 29 30 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6 5 Summary of Recursive Descent • Simple and general parsing strategy – Left-recursion must be eliminated first – … but that can be done automatically • Unpopular because of backtracking – Thought to be too inefficient • In practice, backtracking is eliminated by restricting the grammar 31 Prof. Bodik CS 164 Lecture 6 6.

Building a Parser II CS164 3:30-5:00 TT 10 Evans Administrativia • PA2

A Grammar-Based Approach to Class Diagram Validation Faizan Javed Marjan Mernik Barrett R

Section 12.3 Context-Free Parsing We Know (Via a Theorem) That the Context-Free Languages Are Exactly Those Languages That Are Accepted by Pdas

Formal Grammar Specifications of User Interface Processes

Topics in Context-Free Grammar CFG's

15–212: Principles of Programming Some Notes on Grammars and Parsing

COMP 181 Z What Is the Tufts Mascot? “Jumbo” the Elephant

QUESTION BANK SOLUTION Unit 1 Introduction to Finite Automata

Xtext User Guide

Compiler Construction

On the Covering Problem for Left-Recursive

Generalised Recursive Descent Parsing and Follow-Determinism

The Modelcc Model-Based Parser Generator