Top-Down Parsing Easy for Humans String of to Write and Characters Understand

Lexical and Syntax Analysis Top-Down Parsing Easy for humans String of to write and characters understand Lexemes String of identified tokens Easy for Data programs structure to transform Syntax A syntax is a set of rules defining the valid strings of a language, often specified by a context-free grammar. For example, a grammar E for arithmetic expressions: e → x | y | e + e | e – e | e * e | ( e ) Derivations A derivation is a proof that some string conforms to a grammar. A leftmost derivation: e ⇒ e + e ⇒ x + e ⇒ x + ( e ) ⇒ x + ( e * e ) ⇒ x + ( y * e ) ⇒ x + ( y * x ) Derivations A rightmost derivation: e ⇒ e + e ⇒ e + ( e ) ⇒ e + ( e * e ) ⇒ e + ( e * x ) ⇒ e + ( y * x ) ⇒ x + ( y * x ) Many ways to derive the same string: many ways to write the same proof. Parse tree: motivation Also a proof that a given input is valid according to the grammar. But a parse tree: . is more concise: we don’t write out the sentence every time a non-terminal is expanded. abstracts over the order in which rules are applied. Parse tree: intuition If non-terminal n has a production n → X Y Z where X, Y, and Z are terminals or non-terminals, then a parse tree may have an interior node labelled n with three children labelled X, Y, and Z. n X Y Z Parse tree: definition A parse tree is a tree in which: . the root is labelled by the start symbol; . each leaf is labelled by a terminal symbol, or 휀; . each interior node is labelled by a non-terminal; . if n is a non-terminal labelling an interior node whose children are X1, X2, ⋯, Xn then there must exist a production n→ X1 X2 ⋯ Xn. Example 1 Example input string: x + y * x A resulting parse tree according to grammar E: e e + e x e * e y x Example 2 The following is not a parse tree according to grammar E. e x + e e * e y x Why? Because e → x + e is not a production in grammar E. Grammar notation Non-terminals are underlined. Rather than writing e → x e → e + e we may write: e → x | e + e (Also, symbols → and ::= will be used interchangeably.) Syntax Analysis String of Parse tree symbols A parse tree is: 1. A proof that a given input is valid according to the grammar; 2. A data structure that is convenient for compilers to process. (Syntax analysis may also report that the input string is invalid.) Ambiguity If there exists more than one parse tree for any string then the grammar is ambiguous. For example, the string x+y*x has two parse trees: e e e + e e * e x e * e e + e x y x x y Operator precedence Different parse trees often have different meanings, so we usually want unambiguous grammars. Conventionally, * has a higher precedence (binds tighter) than +, so there is only one interpretation of x+y*x, namely x+(y*x). Operator associativity Even with precedence rules, ambiguity remains, e.g. x-x-x-x. Binary operators are either: . left-associative; . right-associative; . non-associative. Conventionally, - is left-associative, so there is only one interpretation of x-x-x-x, namely ((x-x)-x)-x. Ambiguity removal Example input: e → x | y | e + e | e – e | e * e | ( e ) All operators are left associative, and * binds tighter than + and –. Ambiguity removal Example output: e → e + e1 | e – e1 | e1 e1 → e1 * e2 | e2 e2 → ( e ) | x | y Note: ignoring bracketed expressions . e1 disallows + and – . e2 disallows +, -, and * Disallowed parse trees After disambiguation, there are no parse trees corresponding to the following originals: e e e * e e + e e + e x x e - e x y y x LHS of * cannot RHS of + cannot contain a +. contain a -. Ambiguity removal: step-by-step Given a non-terminal e which involves operators at n levels of precedence: Step 1: introduce n+1 new non- terminals, e0 ⋯ en. Let op denote an operator with precedence i. Step 2a: replace each production e → e op e with ei → ei op ei+1 | ei+1 if op is left-associative, or ei → ei+1 op ei | ei+1 if op is right-associative Step 2b: replace each production e → op e with ei → op ei | ei+1 Step 2c: replace each production e → e op with ei → ei op | ei+1 Construct the precedence table: Operator Precedence +, - 0 * 1 Grammar E after step 2 becomes: e0 → e0 + e1 | e0 – e1 | e1 e1 → e1 * e2 | e2 e → ( e ) | x | y Step 3: replace each production e → ⋯ with en → ⋯ After step 3: e0 → e0 + e1 | e0 – e1 | e1 e1 → e1 * e2 | e2 e2 → ( e ) | x | y Step 4: replace all occurrences of e0 with e. After step 4: e → e + e1 | e – e1 | e1 e1 → e1 * e2 | e2 e2 → ( e ) | x | y Exercise 1 Consider the following ambiguous grammar for logical propositions. p → 0 (Zero) | 1 (One) | ~ p (Negation) | p + p (Disjunction) | p * p (Conjunction) Now let + and * be right associative and the operators in increasing order of binding strength be : +, *, ~. Give an unambiguous grammar for logical propositions. Exercise 2 Which of the following grammars are ambiguous? b → 0 b 1 | 0 1 e → + e e | – e e | x s → if b then s | if b then s else s | skip Homework exercise Consider the following ambiguous grammar G. s → if b then s | if b then s else s | skip Give a unambiguous grammar that accepts the same language as G. Summary so far . Syntax of a language is often specified by a context-free grammar . Derivations and parse trees are proofs. Parse trees lead to a concise definition of ambiguity. Construction of unambiguous grammars using rules of precedence and associativity. PART 2: TOP-DOWN PARSING • Recursive-Descent • Backtracking • Left-Factoring • Predictive Parsing • Left-Recursion Removal • First and Follow Sets • Parsing tables and LL(1) Top-down parsing Top-down: begin with the start symbol and expand non-terminals, succeeding when the input string is matched. A good strategy for writing parsers: 1. Implement a syntax checker to accept or refute input strings. 2. Modify the checker to construct a parse tree – straightforward. RECURSIVE DESCENT A popular top-down parsing technique. Recursive descent A recursive descent parser consists of a set of functions, one for each non-terminal. The function for non-terminal n returns true if some prefix of the input string can be derived from n, and false otherwise. Consuming the input We assume a global variable next points to the input string. char* next; Consume c from input if possible. int eat(char c) { if (*next == c) { next++; return 1; } return 0; } Recursive descent Let parse(X) denote . X() if X is a non-terminal . eat(X) if X is a terminal For each non-terminal N, introduce: int N() { char* save = next; for each N → X1 X2 ⋯ Xn if (parse(X1) && parse(X2) && ⋯ && parse(Xn)) return 1; else next = save; return 0; } Backtrack Exercise 4 Consider the following grammar G with start symbol e. e → ( e + e ) | ( e * e ) | v v → x | y Using recursive descent, write a syntax checker for grammar G. Answer (part 1) int e() { char* save = next; if (eat('(') && e() && eat('+') && e() && eat(')')) return 1; else next = save; if (eat('(') && e() && eat('*') && e() && eat(')')) return 1; else next = save; if (v()) return 1; else next = save; return 0; } Answer (part 2) int v() { char* save = next; if (eat('x')) return 1; else next = save; if (eat('y')) return 1; else next = save; return 0; } Exercise 5 How many function calls are made by the recursive descent parser to parse the following strings? (x*x) ((x*x)*x) (((x*x)*x)*x) (See animation of backtracking.) Answer Number of calls is quadratic in the length of the input string. Input string Length Calls (x*x) 5 21 ((x*x)*x) 9 53 (((x*x)*x)*x) 13 117 Lesson: backtracking expensive! Function Function calls String length LEFT FACTORING Reducing backtracking! Left factoring When two productions for a non-terminal share a common prefix, expensive backtracking can be avoided by left-factoring the grammar. Idea: Introduce a new non- terminal that accepts each of the different suffixes. Example 3 Left-factoring grammar G by introducing non-terminal r: e → ( e r | v Common prefix r → + e ) | * e ) v → x Different suffixes | y Effect of left-factoring Number of calls is now linear in the length of input string. Input string Length Calls (x*x) 5 13 ((x*x)*x) 9 22 (((x*x)*x)*x) 13 31 Lesson: left-factoring a grammar reduces backtracking. Function Function calls String length PREDICTIVE PARSING Eliminating backtracking! Predictive parsing Idea: know which production of a non-terminal to choose based solely on the next input symbol. Advantage: very efficient since it eliminates all backtracking. Disadvantage: not all grammars can be parsed in this way. (But many useful ones can.) Running example The following grammar H will be used as a running example to demonstrate predictive parsing. e → e + e | e * e | ( e ) | x | y Example: x+y*(y+x) Removing ambiguity Since + and * are left-associative and * binds tighter than +, we can derive an unambiguous variant of H. e → e + t | t t → t * f | f f → ( e ) | x | y Left recursion Problem: left-recursive grammars cause recursive descent parsers to loop forever. int e() { char* save = next; if (e() && eat('+') && t()) return 1; next = save; if (t()) return 1; Call to self without next = save; consuming any input return 0; } Eliminating left recursion Let 훼 denote any sequence of grammar symbols.

Load more