CS3300 - Compiler Design These Slides Borrow Liberal Portions of Text Verbatim from Antony L

Acknowledgement CS3300 - Compiler Design These slides borrow liberal portions of text verbatim from Antony L. Hosking @ Purdue, Jens Palsberg @ UCLA and the Dragon book. Parsing V. Krishna Nandivada IIT Madras Copyright c 2019 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected]. * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 2 / 98 The role of the parser Syntax analysis by using a CFG Context-free syntax is specified with a context-free grammar. Formally, a CFG G is a 4-tuple (V ;V ;S;P), where: tokens t n source V is the set of terminal symbols in the grammar. code scanner parser IR t For our purposes, Vt is the set of tokens returned by the scanner. errors Vn, the nonterminals, is a set of syntactic variables that denote sets of (sub)strings occurring in the language. A parser These are used to impose a structure on the grammar. S is a distinguished nonterminal (S 2 Vn) denoting the entire performs context-free syntax analysis set of strings in L(G). guides context-sensitive analysis This is sometimes called a goal symbol. constructs an intermediate representation P is a finite set of productions specifying how terminals and produces meaningful error messages non-terminals can be combined to form strings in the language. attempts error correction Each production must have a single non-terminal on its For the next several classes, we will look at parser construction left hand side. * * The set V = Vt [ Vn is called the vocabulary of G. V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 3 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 4 / 98 Notation and terminology Syntax analysis Grammars are often written in Backus-Naur form (BNF). a;b;c;::: 2 Vt Example: h i ::= h i A;B;C;::: 2 Vn 1 goal expr 2 hexpri ::= hexprihopihexpri U;V;W;::: 2 V 3 j num a;b;g;::: 2 V∗ 4 j id u;v;w;::: 2 Vt∗ 5 hopi ::= + If A ! g then aAb ) agb is a single-step derivation using A ! g 6 j − 7 j ∗ Similarly, !∗ and )+ denote derivations of ≥ 0 and ≥ 1 steps 8 j = If S !∗ b then b is said to be a sentential form of G This describes simple expressions over numbers and identifiers. In a BNF for a grammar, we represent L(G) = fw 2 V ∗ j S )+ wg w 2 L(G) G t , is called a sentence of 1 non-terminals with angle brackets or capital letters ∗ 2 Note, L(G) = fb 2 V∗ j S ! bg \ Vt∗ terminals with typewriter font or underline 3 productions as in the example * * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 5 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 6 / 98 Derivations Derivations We can view the productions of a CFG as rewriting rules. Using our example CFG (for x + 2 ∗ y): At each step, we chose a non-terminal to replace. hgoali ) hexpri This choice can lead to different derivations. ) hexprihopihexpri Two are of particular interest: ) hid,xihopihexpri ) hid,xi + hexpri leftmost derivation ) hid,xi + hexprihopihexpri the leftmost non-terminal is replaced at each step ) hid,xi + hnum,2ihopihexpri rightmost derivation ) hid,xi + hnum,2i ∗ hexpri the rightmost non-terminal is replaced at each step ) hid,xi + hnum,2i ∗ hid,yi We have derived the sentence x + 2 ∗ y. The previous example was a leftmost derivation. We denote this hgoali!∗ id + num ∗ id. Such a sequence of rewrites is a derivation or a parse. The process of discovering a derivation is called parsing. * * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 7 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 8 / 98 Rightmost derivation Precedence For the string x + 2 ∗ y: goal hgoali ) hexpri ) hexprihopihexpri expr ) hexprihopihid,yi ) hexpri ∗ hid,yi expr op expr ) hexprihopihexpri ∗ hid,yi ) hexprihopihnum,2i ∗ hid,yi ) hexpri + hnum,2i ∗ hid,yi expr op expr * <id,y> ) hid,xi + hnum,2i ∗ hid,yi <id,x> + <num,2> Again, hgoali)∗ id + num ∗ id. Treewalk evaluation computes (x + 2) ∗ y — the “wrong” answer! * Should be x + (2 ∗ y) * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 9 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 10 / 98 Precedence Precedence These two derivations point out a problem with the grammar. It has no notion of precedence, or implied order of evaluation. Now, for the string x + 2 ∗ y: To add precedence takes additional machinery: hgoali ) hexpri 1 hgoali ::= hexpri ) hexpri + htermi 2 hexpri ::= hexpri + htermi ) hexpri + htermi ∗ hfactori 3 j hexpri − htermi ) hexpri + htermi ∗ hid,yi 4 j htermi ) hexpri + hfactori ∗ hid,yi 5 htermi ::= htermi ∗ hfactori ) hexpri + hnum,2i ∗ hid,yi 6 j htermi=hfactori ) htermi + hnum,2i ∗ hid,yi 7 j hfactori ) hfactori + hnum,2i ∗ hid,yi 8 hfactori ::= num ) hid,xi + hnum,2i ∗ hid,yi 9 j id ∗ This grammar enforces a precedence on the derivation: Again, hgoali) id + num ∗ id, but this time, we build the desired tree. terms must be derived from expressions forces the “correct” tree * * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 11 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 12 / 98 Precedence Ambiguity goal If a grammar has more than one derivation for a single sentential form, then it is ambiguous expr Example: hstmti ::= if hexprithen hstmti j if hexprithen hstmtielse hstmti + expr term j other stmts Consider deriving the sentential form: * term term factor if E1 then if E2 then S1 else S2 It has two derivations. factor factor <id,y> This ambiguity is purely grammatical. It is a context-free ambiguity. <id,x> <num,2> x + 2 ∗ y Treewalk evaluation computes ( ) * * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 13 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 14 / 98 Ambiguity Ambiguity May be able to eliminate ambiguities by rearranging the grammar: hstmti ::= hmatchedi Ambiguity is often due to confusion in the context-free specification. j hunmatchedi Context-sensitive confusions can arise from overloading. hmatchedi ::= if hexpri then hmatchedi else hmatchedi Example: j other stmts a = f(17) if then hunmatchedi ::= hexpri hstmti In many Algol/Scala-like languages, f could be a function or if then else j hexpri hmatchedi hunmatchedi subscripted variable. Disambiguating this statement requires context: need values of declarations This generates the same language as the ambiguous grammar, but not context-free applies the common sense rule: really an issue of type match each else with the closest unmatched then Rather than complicate parsing, we will handle this separately. This is most likely the language designer’s intent. * * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 15 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 16 / 98 Scanning vs. parsing Parsing: the big picture Where do we draw the line? term ::= [a − zA − z]([a − zA − z] j [0 − 9])∗ tokens j 0 j [1 − 9][0 − 9]∗ op ::= + j − j ∗ j = expr ::= (term op)∗term Regular expressions are used to classify: parser identifiers, numbers, keywords grammar parser REs are more concise and simpler for tokens than a grammar generator more efficient scanners can be built from REs (DFAs) than grammars Context-free grammars are used to count: brackets: (), begin... end, if... then... else imparting structure: expressions code IR Syntactic analysis is complicated enough: grammar for C has around 200 productions. Factoring out lexical analysis as a separate phase makes Our goal is a flexible parser generator system compiler more manageable. * * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 17 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 18 / 98 Different ways of parsing: Top-down Vs Bottom-up Top-down parsing A top-down parser starts with the root of the parse tree, labelled with Top-down parsers the start or goal symbol of the grammar. start at the root of derivation tree and fill in To build a parse, it repeats the following steps until the fringe of the picks a production and tries to match the input parse tree matches the input string may require backtracking 1 At a node labelled A, select a production A ! a and construct the appropriate child for each symbol of a some grammars are backtrack-free (predictive) 2 When a terminal is added to the fringe that doesn’t match the Bottom-up parsers input string, backtrack start at the leaves and fill in 3 Find next node to be expanded (must have a label in Vn) start in a state valid for legal first tokens as input is consumed, change state to encode possibilities The key is selecting the right production in step 1. (recognize valid prefixes) If the parser makes a wrong step, the “derivation” process does not use a stack to store both state and sentential forms terminate. Why is it bad? * * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 19 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 20 / 98 Left-recursion Eliminating left-recursion To remove left-recursion, we can transform the grammar Consider the grammar fragment: Top-down parsers cannot handle left-recursion in a grammar hfooi ::= hfooia Formally, a grammar is left-recursive if j b + 9A 2 Vn such that A ) Aa for some string a where a and b do not start with hfooi We can rewrite this as: hfooi ::= bhbari hbari ::= ahbari Our simple expression grammar is left-recursive j e where hbari is a new non-terminal This fragment contains no left-recursion * * V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 21 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 22 / 98 How much lookahead is needed? Predictive parsing We saw that top-down parsers may need to backtrack when they Basic idea: select the wrong production Do we need arbitrary lookahead to parse CFGs? For any two productions A ! a j b, we would like a distinct way of in general, yes choosing the correct production to expand.

Load more