<<

Administrativia

• PA2 assigned today –due in 12 days Building a Parser II • WA1 assigned today –due in a week – it’s a practice for the exam CS164 3:30-5:00 TT •First midterm 10 Evans –Oct 5 – will contain some project-inspired questions

Prof. Bodik CS 164 Lecture 6 1 2 Prof. Bodik CS 164 Lecture 6

Overview Grammars

• Grammars • Programming language constructs have recursive structure. • derivations – which is why our hand-written parser had this structure, too • Recursive descent parser • Eliminating •An expression is either: •number, or •variable, or • expression + expression, or • expression - expression, or • ( expression ), or •…

3 4 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

Context-free grammars (CFG) Example: arithmetic expressions

• a natural notation for this recursive structure • Simple arithmetic expressions: E d n | id | ( E ) | E + E | E * E • grammar for our balanced parens expressions: BalancedExpression d a | ( BalancedExpression ) • Some elements of this language: • describes (generates) strings of symbols: –id – a, (a), ((a)), (((a))), … –n • like regular expressions but can refer to –( n ) – other expressions (here, BalancedExpression) –n + id – and do this recursively (giving is “non-finite state”) –id * ( id + id )

5 6 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

1 Symbols: Terminals and Nonterminals Notational Conventions

• grammars use two kinds of symbols • In these lecture notes, let’s adopt a notation: •terminals: – no rules for replacing them – Non-terminals are written upper-case – once generated, terminals are permanent – these are tokens of our language – Terminals are written lower-case •nonterminals: or as symbols, e.g., token LPAR is written as ( – to be replaced (expanded) – in lingo, these serve as names of – The start symbol is the left-hand side of the first expressions production – start non-terminal: the first symbol to be expanded

7 8 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

Derivations Example: derivation

• This is how a grammar generates strings: Grammar: E d n | id | ( E ) | E + E | E * E – think of grammar rules (called productions) as rewrite rules •a derivation: • Derivation: the process of generating a string E rewrite E with ( E ) 1. begin with the start non-terminal ( E ) rewrite E with n ( n ) this is the final string of terminals 2. rewrite the non-terminal with some of its productions • another derivation (written more concisely): 3. select a non-terminal in your current string E d ( E ) d ( E * E ) d ( E + E * E ) d ( n + E * E ) d ( n + id * E ) i. if no non-terminal left, done. d ( n + id * id ) ii. otherwise go to step 2.

9 10 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

So how do derivations help us in parsing? Decaf Example

• A program (a string of tokens) has no syntax A fragment of Decaf: error if it can be derived from the grammar. – but so far you only know how to derive some (any) STMT d while ( EXPR ) STMT string, not how to check if a given string is | id ( EXPR ) ; derivable EXPR d EXPR + EXPR • So how to do parsing? | EXPR – EXPR – a naïve solution: derive all possible strings and | EXPR < EXPR check if your program is among them | ( EXPR ) – not as bad as it sounds: there are parsers that do | id this, kind of. Coming soon.

11 12 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

2 Decaf Example (Cont.) CFGs (definition)

Some elements of the (fragment of) language: •A CFG consists of – A set of terminal symbols T id ( id ) ; id ( ( ( ( id ) ) ) ) ; – A set of non-terminal symbols N while ( id < id ) id ( id ) ; – A start symbol S (a non-terminal) while ( while ( id ) ) id ( id ) ; – A set of productions: while ( id ) while ( id ) while ( id ) id ( id ) ; produtions are of two forms (X  N) Question: One of the STMT d while ( EXPR ) STMT X d ε , or strings is not from | id ( EXPR ) ; the language. EXPR d EXPR + EXPR | EXPR – EXPR X d Y1 Y2 ... Yn where Yi  N 4 T Which one? | EXPR < EXPR | ( EXPR ) | id

13 14 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

context-free grammars Now let’s parse a string

• what is “context-free”? • recursive descent parser derives all strings – means the grammar is not context-sensitive – until it matches derived string with the input string • context-sensitive gramars – or until it is sure there is a syntax error – can describe more languages than CFGs – because their productions restrict when a non- terminal can be rewritten. An example production: d N d d A B c – meaning: N can be rewritten into ABc only when preceded by d – can be used to encode semantic checks, but parsing is hard

15 16 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

Recursive Descent Parsing Recursive Descent Parsing. Example (Cont.)

• Consider the grammar •Try E0 → T1 + E2 E → T + E | T • Then try a rule for T1 → ( E3 ) T → int | int * T | ( E ) – But ( does not match input token int5 • Token stream is: int * int 5 2 •TryT1 → int . Token matches.

• Start with top-level non-terminal E – But + after T1 does not match input token *

•Try T1 → int * T2 • Try the rules for E in order – This will match but + after T1 will be unmatched

• Have exhausted the choices for T1

– Backtrack to choice for E0

17 18 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

3 Recursive Descent Parsing. Example (Cont.) A Recursive Descent Parser (2)

•Try E0 → T1 • Define boolean functions that check the token string for a match of • Follow same steps as before for T1 – A given token terminal – And succeed with T1 → int * T2 and T2 → int – With the following parse tree bool term(TOKEN tok) { return in[next++] == tok; } th E0 – A given production of S (the n )

bool Sn() { … } T 1 – Any production of S: bool S() { … } int5 * T2

int2 • These functions advance next 19 20 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

A Recursive Descent Parser (3) A Recursive Descent Parser (4)

• For production E → T + E • Functions for non-terminal T bool T () { return term(OPEN) && E() && term(CLOSE); } bool E1() { return T() && term(PLUS) && E(); } 1 • For production E → T bool T2() { return term(INT) && term(TIMES) && T(); } bool T3() { return term(INT); } bool E2() { return T(); }

• For all productions of E (with backtracking) bool T() { bool E() { int save = next; int save = next; return (next = save, T1()) return (next = save, E ()) 1 || (next = save, T2()) || (next = save, E ()); } 2 || (next = save, T3()); } 21 22 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

Recursive Descent Parsing. Notes. Recursive Descent Parsing. Notes.

• To start the parser • Easy to implement by hand – Initialize next to point to first token – An example implementation is provided as a – Invoke E() supplement “Recursive Descent Parsing” • Notice how this simulates our backtracking example from lecture • Easy to implement by hand • But does not always work … • Predictive parsing is more efficient

23 24 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

4 Recursive-Descent Parsing When Recursive Descent Does Not Work

• Parsing: given a string of tokens t1 t2 ... tn, find • Consider a production S → S a: its parse tree – In the process of parsing S we try the above rule • Recursive-descent parsing: Try all the – What goes wrong? productions exhaustively – At a given moment the fringe of the parse tree is: • A left- has a non-terminal S t t … t A … 1 2 k S →+ Sα for some α – Try all the productions for A: if A d BC is a

production, the new fringe is t1 t2 … tk B C … – Backtrack when the fringe doesn’t match the string • Recursive descent does not work in such cases – Stop when there are no more non-terminals – It goes into an ∞ loop

25 26 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

Elimination of Left Recursion Elimination of Left-Recursion. Example

• Consider the left-recursive grammar • Consider the grammar S → S α | β S d 1 | S 0 ( β = 1 and α = 0 )

•Sgenerates all strings starting with a β and can be rewritten as followed by a number of α S d 1 S’ S’ d 0 S’ | ε • Can rewrite using right-recursion S →βS’ S’ →αS’ | ε

27 28 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

More Elimination of Left-Recursion General Left Recursion

• In general • The grammar S → A α | δ S → S α1 | … | S αn | β1 | … | βm • All strings derived from S start with one of A → S β is also left-recursive because β1,…,βm and continue with several instances of →+ βα α1,…,αn S S •Rewrite as

S →β1 S’ | … | βm S’ • This left-recursion can also be eliminated S’ →α1 S’ | … | αn S’ | ε • See [ASU], Section 4.3 for general

29 30 Prof. Bodik CS 164 Lecture 6 Prof. Bodik CS 164 Lecture 6

5 Summary of Recursive Descent

• Simple and general parsing strategy – Left-recursion must be eliminated first – … but that can be done automatically • Unpopular because of backtracking – Thought to be too inefficient

• In practice, backtracking is eliminated by restricting the grammar

31 Prof. Bodik CS 164 Lecture 6

6