Administrativia
• PA2 assigned today –due in 12 days Building a Parser II • WA1 assigned today –due in a week – it’s a practice for the exam CS164 3:30-5:00 TT •First midterm 10 Evans –Oct 5 – will contain some project-inspired questions
Prof. Bodik CS 164 Lecture 6 1 2 Prof. Bodik CS 164 Lecture 6
Overview Grammars
• Grammars • Programming language constructs have recursive structure. • derivations – which is why our hand-written parser had this structure, too • Recursive descent parser • Eliminating left recursion •An expression is either: •number, or •variable, or • expression + expression, or • expression - expression, or • ( expression ), or •…
3 4 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
Context-free grammars (CFG) Example: arithmetic expressions
• a natural notation for this recursive structure • Simple arithmetic expressions: E d n | id | ( E ) | E + E | E * E • grammar for our balanced parens expressions: BalancedExpression d a | ( BalancedExpression ) • Some elements of this language: • describes (generates) strings of symbols: –id – a, (a), ((a)), (((a))), … –n • like regular expressions but can refer to –( n ) – other expressions (here, BalancedExpression) –n + id – and do this recursively (giving is “non-finite state”) –id * ( id + id )
5 6 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
1 Symbols: Terminals and Nonterminals Notational Conventions
• grammars use two kinds of symbols • In these lecture notes, let’s adopt a notation: •terminals: – no rules for replacing them – Non-terminals are written upper-case – once generated, terminals are permanent – these are tokens of our language – Terminals are written lower-case •nonterminals: or as symbols, e.g., token LPAR is written as ( – to be replaced (expanded) – in regular expression lingo, these serve as names of – The start symbol is the left-hand side of the first expressions production – start non-terminal: the first symbol to be expanded
7 8 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
Derivations Example: derivation
• This is how a grammar generates strings: Grammar: E d n | id | ( E ) | E + E | E * E – think of grammar rules (called productions) as rewrite rules •a derivation: • Derivation: the process of generating a string E rewrite E with ( E ) 1. begin with the start non-terminal ( E ) rewrite E with n ( n ) this is the final string of terminals 2. rewrite the non-terminal with some of its productions • another derivation (written more concisely): 3. select a non-terminal in your current string E d ( E ) d ( E * E ) d ( E + E * E ) d ( n + E * E ) d ( n + id * E ) i. if no non-terminal left, done. d ( n + id * id ) ii. otherwise go to step 2.
9 10 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
So how do derivations help us in parsing? Decaf Example
• A program (a string of tokens) has no syntax A fragment of Decaf: error if it can be derived from the grammar. – but so far you only know how to derive some (any) STMT d while ( EXPR ) STMT string, not how to check if a given string is | id ( EXPR ) ; derivable EXPR d EXPR + EXPR • So how to do parsing? | EXPR – EXPR – a naïve solution: derive all possible strings and | EXPR < EXPR check if your program is among them | ( EXPR ) – not as bad as it sounds: there are parsers that do | id this, kind of. Coming soon.
11 12 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
2 Decaf Example (Cont.) CFGs (definition)
Some elements of the (fragment of) language: •A CFG consists of – A set of terminal symbols T id ( id ) ; id ( ( ( ( id ) ) ) ) ; – A set of non-terminal symbols N while ( id < id ) id ( id ) ; – A start symbol S (a non-terminal) while ( while ( id ) ) id ( id ) ; – A set of productions: while ( id ) while ( id ) while ( id ) id ( id ) ; produtions are of two forms (X N) Question: One of the STMT d while ( EXPR ) STMT X d ε , or strings is not from | id ( EXPR ) ; the language. EXPR d EXPR + EXPR | EXPR – EXPR X d Y1 Y2 ... Yn where Yi N 4 T Which one? | EXPR < EXPR | ( EXPR ) | id
13 14 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
context-free grammars Now let’s parse a string
• what is “context-free”? • recursive descent parser derives all strings – means the grammar is not context-sensitive – until it matches derived string with the input string • context-sensitive gramars – or until it is sure there is a syntax error – can describe more languages than CFGs – because their productions restrict when a non- terminal can be rewritten. An example production: d N d d A B c – meaning: N can be rewritten into ABc only when preceded by d – can be used to encode semantic checks, but parsing is hard
15 16 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
Recursive Descent Parsing Recursive Descent Parsing. Example (Cont.)
• Consider the grammar •Try E0 → T1 + E2 E → T + E | T • Then try a rule for T1 → ( E3 ) T → int | int * T | ( E ) – But ( does not match input token int5 • Token stream is: int * int 5 2 •TryT1 → int . Token matches.
• Start with top-level non-terminal E – But + after T1 does not match input token *
•Try T1 → int * T2 • Try the rules for E in order – This will match but + after T1 will be unmatched
• Have exhausted the choices for T1
– Backtrack to choice for E0
17 18 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
3 Recursive Descent Parsing. Example (Cont.) A Recursive Descent Parser (2)
•Try E0 → T1 • Define boolean functions that check the token string for a match of • Follow same steps as before for T1 – A given token terminal – And succeed with T1 → int * T2 and T2 → int – With the following parse tree bool term(TOKEN tok) { return in[next++] == tok; } th E0 – A given production of S (the n )
bool Sn() { … } T 1 – Any production of S: bool S() { … } int5 * T2
int2 • These functions advance next 19 20 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
A Recursive Descent Parser (3) A Recursive Descent Parser (4)
• For production E → T + E • Functions for non-terminal T bool T () { return term(OPEN) && E() && term(CLOSE); } bool E1() { return T() && term(PLUS) && E(); } 1 • For production E → T bool T2() { return term(INT) && term(TIMES) && T(); } bool T3() { return term(INT); } bool E2() { return T(); }
• For all productions of E (with backtracking) bool T() { bool E() { int save = next; int save = next; return (next = save, T1()) return (next = save, E ()) 1 || (next = save, T2()) || (next = save, E ()); } 2 || (next = save, T3()); } 21 22 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
Recursive Descent Parsing. Notes. Recursive Descent Parsing. Notes.
• To start the parser • Easy to implement by hand – Initialize next to point to first token – An example implementation is provided as a – Invoke E() supplement “Recursive Descent Parsing” • Notice how this simulates our backtracking example from lecture • Easy to implement by hand • But does not always work … • Predictive parsing is more efficient
23 24 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
4 Recursive-Descent Parsing When Recursive Descent Does Not Work
• Parsing: given a string of tokens t1 t2 ... tn, find • Consider a production S → S a: its parse tree – In the process of parsing S we try the above rule • Recursive-descent parsing: Try all the – What goes wrong? productions exhaustively – At a given moment the fringe of the parse tree is: • A left-recursive grammar has a non-terminal S t t … t A … 1 2 k S →+ Sα for some α – Try all the productions for A: if A d BC is a
production, the new fringe is t1 t2 … tk B C … – Backtrack when the fringe doesn’t match the string • Recursive descent does not work in such cases – Stop when there are no more non-terminals – It goes into an ∞ loop
25 26 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
Elimination of Left Recursion Elimination of Left-Recursion. Example
• Consider the left-recursive grammar • Consider the grammar S → S α | β S d 1 | S 0 ( β = 1 and α = 0 )
•Sgenerates all strings starting with a β and can be rewritten as followed by a number of α S d 1 S’ S’ d 0 S’ | ε • Can rewrite using right-recursion S →βS’ S’ →αS’ | ε
27 28 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
More Elimination of Left-Recursion General Left Recursion
• In general • The grammar S → A α | δ S → S α1 | … | S αn | β1 | … | βm • All strings derived from S start with one of A → S β is also left-recursive because β1,…,βm and continue with several instances of →+ βα α1,…,αn S S •Rewrite as
S →β1 S’ | … | βm S’ • This left-recursion can also be eliminated S’ →α1 S’ | … | αn S’ | ε • See [ASU], Section 4.3 for general algorithm
29 30 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6
5 Summary of Recursive Descent
• Simple and general parsing strategy – Left-recursion must be eliminated first – … but that can be done automatically • Unpopular because of backtracking – Thought to be too inefficient
• In practice, backtracking is eliminated by restricting the grammar
31 Prof. Bodik CS 164 Lecture 6
6