
Outline Introduction to Parsing • Regular languages revisited Ambiguity and Syntax Errors • Parser overview • Context-free grammars (CFG’s) • Derivations • Ambiguity • Syntax errors Compiler Design 1 (2011) 2 Languages and Automata Limitations of Regular Languages • Formal languages are very important in CS Intuition: A finite automaton that runs long – Especially in programming languages enough must repeat states • A finite automaton cannot remember # of • Regular languages times it has visited a particular state – The weakest formal languages widely used • because a finite automaton has finite memory – Many applications – Only enough to store in which state it is – Cannot count, except up to a finite limit • We will also study context-free languages • Many languages are not regular • E.g., language of balanced parentheses is not regular: { (i )i | i ≥ 0} Compiler Design 1 (2011) 3 Compiler Design 1 (2011) 4 The Functionality of the Parser Example • Input: sequence of tokens from lexer • If-then-else statement if (x == y) then z =1; else z = 2; • Output: parse tree of the program • Parser input IF (ID == ID) THEN ID = INT; ELSE ID = INT; • Possible parser output IF-THEN-ELSE == = = ID ID ID INT ID INT Compiler Design 1 (2011) 5 Compiler Design 1 (2011) 6 Comparison with Lexical Analysis The Role of the Parser • Not all sequences of tokens are programs . • . Parser must distinguish between valid and Phase Input Output invalid sequences of tokens Lexer Sequence of Sequence of characters tokens • We need – A language for describing valid sequences of tokens Parser Sequence of Parse tree – A method for distinguishing valid from invalid tokens sequences of tokens Compiler Design 1 (2011) 7 Compiler Design 1 (2011) 8 Context-Free Grammars CFGs (Cont.) • Many programming language constructs have a • A CFG consists of recursive structure – A set of terminals T – A set of non-terminals N • A STMT is of the form – A start symbol S (a non-terminal) if COND then STMT else STMT , or – A set of productions while COND do STMT , or … Assuming X ∈ N the productions are of the form • Context-free grammars are a natural notation X →ε , or for this recursive structure X → Y1 Y2 ... Yn where Yi ∈ N ∪T Compiler Design 1 (2011) 9 Compiler Design 1 (2011) 10 Notational Conventions Examples of CFGs • In these lecture notes A fragment of our example language (simplified): – Non-terminals are written upper-case – Terminals are written lower-case STMT → if COND then STMT else STMT – The start symbol is the left-hand side of the first ⏐ production while COND do STMT ⏐ id = int Compiler Design 1 (2011) 11 Compiler Design 1 (2011) 12 Examples of CFGs (cont.) The Language of a CFG Grammar for simple arithmetic expressions: Read productions as replacement rules: → E → E * E X Y1 ... Yn Means X can be replaced by Y1 ... Yn ⏐ E + E X →ε ⏐ ( E ) Means X can be erased (replaced with empty string) ⏐ id Compiler Design 1 (2011) 13 Compiler Design 1 (2011) 14 Key Idea The Language of a CFG (Cont.) (1) Begin with a string consisting of the start More formally, we write symbol “S” X XX→ XXYYXX (2) Replace any non-terminal X in the string by 11LLin L i−+11 L min1 L a right-hand side of some production if there is a production XYY→ 1 n L XYYim→ 1 L (3) Repeat (2) until there are no non-terminals in the string Compiler Design 1 (2011) 15 Compiler Design 1 (2011) 16 The Language of a CFG (Cont.) The Language of a CFG Write Let G be a context-free grammar with start ∗ symbol S. Then the language of G is: XXYY11LLnm→ if aaSaa|→∗ and every a is a terminal { 11KKnn i } XX11LLLLnm→→→ YY in 0 or more steps Compiler Design 1 (2011) 17 Compiler Design 1 (2011) 18 Terminals Examples • Terminals are called so because there are no L(G) is the language of the CFG G rules for replacing them ii Strings of balanced parentheses {() |i ≥ 0} • Once generated, terminals are permanent Two grammars: • Terminals ought to be tokens of the language SS→ () SS→ () OR S → ε | ε Compiler Design 1 (2011) 19 Compiler Design 1 (2011) 20 Example Example (Cont.) A fragment of our example language (simplified): Some elements of the our language STMT → if COND then STMT id = int ⏐ if COND then STMT else STMT if (id == id) then id = int else id = int ⏐ while COND do STMT while (id != id) do id = int ⏐ id = int while (id == id) do while (id != id) do id = int COND → (id == id) if (id != id) then if (id == id) then id = int else id = int ⏐ (id != id) Compiler Design 1 (2011) 21 Compiler Design 1 (2011) 22 Arithmetic Example Notes Simple arithmetic expressions: The idea of a CFG is a big step. E→∗ E+E | E E | (E) | id But: Some elements of the language: • Membership in a language is just “yes” or “no”; id id + id we also need the parse tree of the input (id) id ∗ id • Must handle errors gracefully (id) ∗∗ id id (id) • Need an implementation of CFG’s (e.g., yacc) Compiler Design 1 (2011) 23 Compiler Design 1 (2011) 24 More Notes Derivations and Parse Trees • Form of the grammar is important A derivation is a sequence of productions – Many grammars generate the same language S →→→ – Parsing tools are sensitive to the grammar LLL A derivation can be drawn as a tree Note: Tools for regular languages (e.g., lex/ML-Lex) – Start symbol is the tree’s root are also sensitive to the form of the regular expression, but this is rarely a problem in practice – For a production XYY → 1 L n add children YY 1 L n to node X Compiler Design 1 (2011) 25 Compiler Design 1 (2011) 26 Derivation Example Derivation Example (Cont.) • Grammar E→∗ E+E | E E | (E) | id E E → E+E • String E + E id ∗ id + id →∗EE+E →∗id E + E E * E id →∗id id + E id id →∗id id + id Compiler Design 1 (2011) 27 Compiler Design 1 (2011) 28 Derivation in Detail (1) Derivation in Detail (2) E E E + E E E → E+E Compiler Design 1 (2011) 29 Compiler Design 1 (2011) 30 Derivation in Detail (3) Derivation in Detail (4) E E E E E + E E + E → E+E → E+E E * E →∗EE+E E * E →∗E+EE →∗id E + E id Compiler Design 1 (2011) 31 Compiler Design 1 (2011) 32 Derivation in Detail (5) Derivation in Detail (6) E E E E → E+E → E+E E + E E + E →∗EE+E →∗EE+E E * E →∗id E + E E * E id →∗id E + E → id∗ id + E →∗id id + E id id id id →∗id id + id Compiler Design 1 (2011) 33 Compiler Design 1 (2011) 34 Notes on Derivations Left-most and Right-most Derivations • A parse tree has • The example is a left-most – Terminals at the leaves derivation – At each step, replace the E – Non-terminals at the interior nodes left-most non-terminal → E+E • An in-order traversal of the leaves is the • There is an equivalent → E+id original input notion of a right-most derivation →∗EE + id • The parse tree shows the association of →∗Eid + id operations, the input string does not →∗id id + id Compiler Design 1 (2011) 35 Compiler Design 1 (2011) 36 Right-most Derivation in Detail (1) Right-most Derivation in Detail (2) E E E + E E E → E+E Compiler Design 1 (2011) 37 Compiler Design 1 (2011) 38 Right-most Derivation in Detail (3) Right-most Derivation in Detail (4) E E E E E + E E + E → E+E → E+E id → E+id E * E id → E+id → EE∗ + id Compiler Design 1 (2011) 39 Compiler Design 1 (2011) 40 Right-most Derivation in Detail (5) Right-most Derivation in Detail (6) E E E E → E+E → E+E E + E E + E → E+id → E+id E * E id →∗EE + id E * E id → EE∗ + id →∗Eid + id → E ∗id + id id id id →∗id id + id Compiler Design 1 (2011) 41 Compiler Design 1 (2011) 42 Derivations and Parse Trees Summary of Derivations • Note that right-most and left-most • We are not just interested in whether derivations have the same parse tree s ∈ L(G) – We need a parse tree for s • The difference is just in the order in which branches are added • A derivation defines a parse tree – But one parse tree may have many derivations • Left-most and right-most derivations are important in parser implementation Compiler Design 1 (2011) 43 Compiler Design 1 (2011) 44 Ambiguity Ambiguity (Cont.) • Grammar This string has two parse trees E → E + E | E * E | ( E ) | int E E • String E + E E * E int * int + int E * E int int E + E int int int int Compiler Design 1 (2011) 45 Compiler Design 1 (2011) 46 Ambiguity (Cont.) Dealing with Ambiguity • A grammar is ambiguous if it has more than • There are several ways to handle ambiguity one parse tree for some string – Equivalently, there is more than one right-most or • Most direct method is to rewrite grammar left-most derivation for some string unambiguously • Ambiguity is bad – Leaves meaning of some programs ill-defined E → T + E | T • Ambiguity is common in programming languages → T int* T | int| ( E ) – Arithmetic expressions – IF-THEN-ELSE • Enforces precedence of * over + Compiler Design 1 (2011) 47 Compiler Design 1 (2011) 48 Ambiguity: The Dangling Else The Dangling Else: Example • Consider the following grammar • The expression if C1 then if C2 then S3 else S4 S → if C then S has two parse trees | if C then S else S if if | OTHER C if C1 if S4 1 • This grammar is also ambiguous C2 S3 C2 S3 S4 • Typically we want the second form Compiler Design 1 (2011) 49 Compiler Design 1 (2011) 50 The Dangling Else: A Fix The Dangling Else: Example Revisited • else matches the closest unmatched then • The expression if C1 then if C2 then S3 else S4 • We can describe this in the grammar if if S → MIF /* all then are matched */ | UIF /* some then are unmatched */ C1 if C1 if S MIF → if C then MIF else MIF 4 | OTHER C S UIF → if C then S C2 S3 S4 2 3 | if C then MIF else UIF • A valid parse tree • Not valid because the (for a UIF) then expression is not • Describes the same set of strings a MIF Compiler Design 1 (2011)
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages16 Page
-
File Size-