Introduction to Parsing Ambiguity and Syntax Errors
Total Page:16
File Type:pdf, Size:1020Kb
Outline Introduction to Parsing • Regular languages revisited Ambiguity and Syntax Errors • Parser overview • Context-free grammars (CFG’s) • Derivations • Ambiguity • Syntax errors 2 Languages and Automata Limitations of Regular Languages • Formal languages are very important in CS Intuition: A finite automaton that runs long – Especially in programming languages and compilers enough must repeat states • A finite automaton cannot remember number • Regular languages of times it has visited a particular state – The weakest formal languages widely used • because a finite automaton has finite memory – Many applications – Only enough to store in which state it is – Cannot count, except up to a finite limit • We will also study context-free languages • Many languages are not regular • E.g., the language of balanced parentheses is not regular: { (i )i | i ≥ 0} 3 4 The Functionality of the Parser Example • Input: sequence of tokens from lexer • If-then-else statement if (x == y) then z = 1; else z = 2; • Output: parse tree of the program • Parser input IF (ID == ID) THEN ID = INT; ELSE ID = INT; • Possible parser output IF-THEN-ELSE == = = ID ID ID INT ID INT 5 6 Comparison with Lexical Analysis The Role of the Parser • Not all sequences of tokens are programs ... • Parser must distinguish between valid and Phase Input Output invalid sequences of tokens Lexer Sequence of Sequence of characters tokens • We need – A language for describing valid sequences of tokens Parser Sequence of Parse tree – A method for distinguishing valid from invalid tokens sequences of tokens 7 8 Context-Free Grammars CFGs (Cont.) • Many programming language constructs have a A CFG consists of recursive structure – A set of terminals T – A set of non-terminals N • E.g. A STMT is of the form – A start symbol S (a non-terminal) if COND then STMT else STMT , or – A set of productions while COND do STMT , or … Assuming X ∈ N the productions are of the form • Context-free grammars are a natural notation X →ε , or for this recursive structure X → Y1 Y2 ... Yn where Yi ∈ N ∪T 9 10 Notational Conventions Examples of CFGs • In these lecture notes A fragment of an example language (simplified): – Non-terminals are written upper-case – Terminals are written lower-case STMT → if COND then STMT else STMT – The start symbol is the left-hand side of the first ⏐ production while COND do STMT ⏐ id = int 11 12 Examples of CFGs (cont.) The Language of a CFG Grammar for simple arithmetic expressions: Read productions as replacement rules: → E → E * E X Y1 ... Yn Means X can be replaced by Y1 ... Yn (in this order) ⏐ E + E X →ε ⏐ ( E ) Means X can be erased (replaced with empty string) ⏐ id 13 14 Key Idea The Language of a CFG (Cont.) (1) Begin with a string consisting of the start More formally, we write symbol “S” X XX→ XXYYXX (2) Replace any non-terminal X in the string by 11LLin L i−+11 L min1 L a right-hand side of some production if there is a production XYY→ 1 n L XYYim→ 1 L (3) Repeat (2) until there are no non-terminals in the string 15 16 The Language of a CFG (Cont.) The Language of a CFG Write Let G be a context-free grammar with start ∗ symbol S. Then the language of G is: XXYY11LLnm→ if ∗ aaSaa|→ and every a is a terminal { 11KKnn i } XX11LLLLnm→→→ YY in 0 or more steps 17 18 Terminals Examples • Terminals are called so because there are no L(G) is the language of the CFG G rules for replacing them ii Strings of balanced parentheses {() |i ≥ 0} • Once generated, terminals are permanent Two equivalent ways of writing the grammar G: • Terminals ought to be tokens of the language SS→ () SS→ () or S → ε | ε 19 20 Example Example (Cont.) A fragment of our example language (simplified): Some elements of the our language STMT → if COND then STMT id = int ⏐ if COND then STMT else STMT if (id == id) then id = int else id = int ⏐ while COND do STMT while (id != id) do id = int ⏐ id = int while (id == id) do while (id != id) do id = int COND → (id == id) if (id != id) then if (id == id) then id = int else id = int ⏐ (id != id) 21 22 Arithmetic Example Notes Simple arithmetic expressions: The idea of a CFG is a big step. E→∗ E+E | E E | (E) | id But: Some elements of the language: • Membership in a language is just “yes” or “no”; id id + id we also need the parse tree of the input (id) id ∗ id • Must handle errors gracefully (id) ∗∗ id id (id) • Need an implementation of CFG’s – e.g., yacc/bison/ML-yacc/... 23 24 More Notes Derivations and Parse Trees • Form of the grammar is important A derivation is a sequence of productions – Many grammars generate the same language S →→→ – Parsing tools are sensitive to the grammar LLL A derivation can be drawn as a tree – Start symbol is the tree’s root – For a production XYY → 1 L n add children YY 1 L n to node X Note: Tools for regular languages (e.g., lex/ML-Lex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice 25 26 Derivation Example Derivation Example (Cont.) • Grammar Ε → E+E | E*E | (E) | id E→∗ E+E | E E | (E) | id E E → E+E • String E + E id ∗ id + id →∗EE+E →∗id E + E E * E id →∗id id + E id id →∗id id + id 27 28 Derivation in Detail (1) Derivation in Detail (2) Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E E + E E E → E+E 29 30 Derivation in Detail (3) Derivation in Detail (4) Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E E E E + E E + E → E+E → E+E E * E →∗EE+E E * E →∗E+EE →∗id E + E id 31 32 Derivation in Detail (5) Derivation in Detail (6) Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E E E → E+E → E+E E + E E + E →∗EE+E →∗EE+E E * E →∗id E + E E * E id →∗id E + E → id∗ id + E →∗id id + E id id id id →∗id id + id 33 34 Notes on Derivations Left-most and Right-most Derivations • A parse tree has • What was shown before Ε → E+E | E*E | (E) | id – Terminals at the leaves was a left-most derivation – At each step, we replaced E – Non-terminals at the interior nodes the left-most non-terminal → E+E • An in-order traversal of the leaves is the • There is an equivalent → E+id original input notion of a right-most derivation →∗EE + id – Shown on the right • The parse tree shows the association of →∗Eid + id operations; the input string does not ! →∗id id + id 35 36 Right-most Derivation in Detail (1) Right-most Derivation in Detail (2) Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E E + E E E → E+E 37 38 Right-most Derivation in Detail (3) Right-most Derivation in Detail (4) Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E E E E + E E + E → E+E → E+E id → E+id E * E id → E+id → EE∗ + id 39 40 Right-most Derivation in Detail (5) Right-most Derivation in Detail (6) Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E E E → E+E → E+E E + E E + E → E+id → E+id E * E id →∗EE + id E * E id → EE∗ + id →∗Eid + id → E ∗id + id id id id →∗id id + id 41 42 Derivations and Parse Trees Summary of Derivations • Note that: • We are not just interested in whether – right-most and left-most derivations have the same s ∈ L(G) parse tree – We need a parse tree for s – for each parse tree, there is a right-most and a left-most derivation • A derivation defines a parse tree – But one parse tree may have many derivations • The difference is just in the order in which branches are added • Left-most and right-most derivations are important in parser implementation 43 44 Ambiguity Ambiguity (Cont.) • Grammar: • A grammar is ambiguous if it has more than E → E + E | E * E | ( E ) | int one parse tree for some string – Equivalently, if there is more than one right-most or left-most derivation for some string • The string int * int + int has two parse trees • Ambiguity is bad E E – Leaves meaning of some programs ill-defined • Ambiguity is common in programming languages E + E E * E – Arithmetic expressions E E int int E + E – IF-THEN-ELSE * int int int int 45 46 Dealing with Ambiguity Ambiguity: The Dangling Else • There are several ways to handle ambiguity • Consider the following grammar • Most direct method is to rewrite the grammar S → if C then S unambiguously | if C then S else S | OTHER E → T + E | T T → int * T | int | ( E ) • This grammar is also ambiguous • This grammar enforces precedence of * over + 47 48 The Dangling Else: Example The Dangling Else: A Fix • The expression • else should match the closest unmatched then if C1 then if C2 then S3 else S4 • We can describe this in the grammar has two parse trees S → MIF /* all then are matched */ if if | UIF /* some then are unmatched */ MIF → if C then MIF else MIF C if C1 if S4 1 | OTHER UIF → if C then S | if C then MIF else UIF C2 S3 C2 S3 S4 • Typically we want the second form • Describes the same set of strings 49 50 The Dangling Else: Example Revisited Ambiguity • The expression if C1 then if C2 then S3 else S4 • No general techniques for handling ambiguity if if • Impossible to convert automatically an ambiguous grammar to an unambiguous one C1 if C1 if S4 • Used with care, ambiguity can simplify the C2 S3 S4 C2 S3 grammar • A valid parse tree – Sometimes allows more natural definitions • Not valid because the (for a UIF) then expression is not – However, we need disambiguation mechanisms a MIF 51 52 Precedence and Associativity Declarations Associativity Declarations • Instead of rewriting the grammar • Consider the grammar E → E + E | int – Use the more natural (ambiguous) grammar • Ambiguous: two parse trees of int + int + int – Along with disambiguating declarations E E • Most tools allow precedence and associativity E + E E E declarations to disambiguate grammars + E + E int int E + E • Examples … int int int int • Left associativity declaration: %left + 53 54 Precedence Declarations Error Handling • Consider the grammar E → E + E | E * E | int • Purpose of the compiler is And the string int + int * int – To detect non-valid programs – To translate the valid ones E E • Many kinds of possible errors (e.g.