Outline Introduction to Parsing • Regular languages revisited
Ambiguity and Syntax Errors • Parser overview
• Context-free grammars (CFG’s)
• Derivations
• Ambiguity
• Syntax errors
Compiler Design 1 (2011) 2
Languages and Automata Limitations of Regular Languages
• Formal languages are very important in CS Intuition: A finite automaton that runs long – Especially in programming languages enough must repeat states • A finite automaton cannot remember # of • Regular languages times it has visited a particular state – The weakest formal languages widely used • because a finite automaton has finite memory – Many applications – Only enough to store in which state it is – Cannot count, except up to a finite limit
• We will also study context-free languages • Many languages are not regular • E.g., language of balanced parentheses is not regular: { (i )i | i ≥ 0}
Compiler Design 1 (2011) 3 Compiler Design 1 (2011) 4 The Functionality of the Parser Example
• Input: sequence of tokens from lexer • If-then-else statement if (x == y) then z =1; else z = 2; • Output: parse tree of the program • Parser input IF (ID == ID) THEN ID = INT; ELSE ID = INT; • Possible parser output
IF-THEN-ELSE
== = =
ID ID ID INT ID INT
Compiler Design 1 (2011) 5 Compiler Design 1 (2011) 6
Comparison with Lexical Analysis The Role of the Parser
• Not all sequences of tokens are programs . . . • . . . Parser must distinguish between valid and Phase Input Output invalid sequences of tokens
Lexer Sequence of Sequence of characters tokens • We need – A language for describing valid sequences of tokens Parser Sequence of Parse tree – A method for distinguishing valid from invalid tokens sequences of tokens
Compiler Design 1 (2011) 7 Compiler Design 1 (2011) 8 Context-Free Grammars CFGs (Cont.)
• Many programming language constructs have a • A CFG consists of recursive structure – A set of terminals T – A set of non-terminals N • A STMT is of the form – A start symbol S (a non-terminal) if COND then STMT else STMT , or – A set of productions while COND do STMT , or … Assuming X ∈ N the productions are of the form • Context-free grammars are a natural notation X →ε , or
for this recursive structure X → Y1 Y2 ... Yn where Yi ∈ N ∪T
Compiler Design 1 (2011) 9 Compiler Design 1 (2011) 10
Notational Conventions Examples of CFGs
• In these lecture notes A fragment of our example language (simplified): – Non-terminals are written upper-case
– Terminals are written lower-case STMT → if COND then STMT else STMT
– The start symbol is the left-hand side of the first ⏐ production while COND do STMT ⏐ id = int
Compiler Design 1 (2011) 11 Compiler Design 1 (2011) 12 Examples of CFGs (cont.) The Language of a CFG
Grammar for simple arithmetic expressions: Read productions as replacement rules:
→ E → E * E X Y1 ... Yn
Means X can be replaced by Y1 ... Yn ⏐ E + E X →ε ⏐ ( E ) Means X can be erased (replaced with empty string)
⏐ id
Compiler Design 1 (2011) 13 Compiler Design 1 (2011) 14
Key Idea The Language of a CFG (Cont.)
(1) Begin with a string consisting of the start More formally, we write symbol “S” X XX→ XXYYXX (2) Replace any non-terminal X in the string by 11LLin L i−+11 min L1 L a right-hand side of some production if there is a production
XYY→ 1 n L XYYim→ 1 L (3) Repeat (2) until there are no non-terminals in the string
Compiler Design 1 (2011) 15 Compiler Design 1 (2011) 16 The Language of a CFG (Cont.) The Language of a CFG
Write Let G be a context-free grammar with start ∗ symbol S. Then the language of G is: XXYY11LLnm→ if ∗ {aaSaa11KKnn|→ and every a i is a terminal} XX11LLLLnm→→→ YY
in 0 or more steps
Compiler Design 1 (2011) 17 Compiler Design 1 (2011) 18
Terminals Examples
• Terminals are called so because there are no L(G) is the language of the CFG G rules for replacing them ii Strings of balanced parentheses {() |i ≥ 0} • Once generated, terminals are permanent
Two grammars: • Terminals ought to be tokens of the language SS→ () SS→ () OR S → ε | ε
Compiler Design 1 (2011) 19 Compiler Design 1 (2011) 20 Example Example (Cont.)
A fragment of our example language (simplified): Some elements of the our language
STMT → if COND then STMT id = int ⏐ if COND then STMT else STMT if (id == id) then id = int else id = int ⏐ while COND do STMT while (id != id) do id = int ⏐ id = int while (id == id) do while (id != id) do id = int COND → (id == id) if (id != id) then if (id == id) then id = int else id = int ⏐ (id != id)
Compiler Design 1 (2011) 21 Compiler Design 1 (2011) 22
Arithmetic Example Notes
Simple arithmetic expressions: The idea of a CFG is a big step. E→∗ E+E | E E | (E) | id But:
Some elements of the language: • Membership in a language is just “yes” or “no”;
id id + id we also need the parse tree of the input (id) id ∗ id • Must handle errors gracefully
(id) ∗∗ id id (id) • Need an implementation of CFG’s (e.g., yacc)
Compiler Design 1 (2011) 23 Compiler Design 1 (2011) 24 More Notes Derivations and Parse Trees
• Form of the grammar is important A derivation is a sequence of productions – Many grammars generate the same language S →→→ – Parsing tools are sensitive to the grammar LLL
A derivation can be drawn as a tree Note: Tools for regular languages (e.g., lex/ML-Lex) – Start symbol is the tree’s root are also sensitive to the form of the regular expression, but this is rarely a problem in practice – For a production XYY → 1 L n add children YY 1 L n to node X
Compiler Design 1 (2011) 25 Compiler Design 1 (2011) 26
Derivation Example Derivation Example (Cont.)
• Grammar
E→∗ E+E | E E | (E) | id E E → E+E • String E + E
id ∗ id + id →∗EE+E →∗id E + E E * E id →∗id id + E id id →∗id id + id
Compiler Design 1 (2011) 27 Compiler Design 1 (2011) 28 Derivation in Detail (1) Derivation in Detail (2)
E E
E + E E E → E+E
Compiler Design 1 (2011) 29 Compiler Design 1 (2011) 30
Derivation in Detail (3) Derivation in Detail (4)
E E E E E + E E + E → E+E → E+E E * E →∗EE+E E * E →∗E+EE →∗id E + E id
Compiler Design 1 (2011) 31 Compiler Design 1 (2011) 32 Derivation in Detail (5) Derivation in Detail (6)
E E E E → E+E → E+E E + E E + E →∗EE+E →∗EE+E E * E →∗id E + E E * E id →∗id E + E → id∗ id + E →∗id id + E id id id id →∗id id + id
Compiler Design 1 (2011) 33 Compiler Design 1 (2011) 34
Notes on Derivations Left-most and Right-most Derivations
• A parse tree has • The example is a left-most – Terminals at the leaves derivation – At each step, replace the E – Non-terminals at the interior nodes left-most non-terminal → E+E • An in-order traversal of the leaves is the • There is an equivalent → E+id original input notion of a right-most derivation →∗EE + id • The parse tree shows the association of →∗Eid + id operations, the input string does not →∗id id + id
Compiler Design 1 (2011) 35 Compiler Design 1 (2011) 36 Right-most Derivation in Detail (1) Right-most Derivation in Detail (2)
E E
E + E E E → E+E
Compiler Design 1 (2011) 37 Compiler Design 1 (2011) 38
Right-most Derivation in Detail (3) Right-most Derivation in Detail (4)
E E E E E + E E + E → E+E → E+E id → E+id E * E id → E+id → EE∗ + id
Compiler Design 1 (2011) 39 Compiler Design 1 (2011) 40 Right-most Derivation in Detail (5) Right-most Derivation in Detail (6)
E E E E → E+E → E+E E + E E + E → E+id → E+id E * E id →∗EE + id E * E id → EE∗ + id →∗Eid + id → E ∗id + id id id id →∗id id + id
Compiler Design 1 (2011) 41 Compiler Design 1 (2011) 42
Derivations and Parse Trees Summary of Derivations
• Note that right-most and left-most • We are not just interested in whether derivations have the same parse tree s ∈ L(G) – We need a parse tree for s • The difference is just in the order in which branches are added • A derivation defines a parse tree – But one parse tree may have many derivations
• Left-most and right-most derivations are important in parser implementation
Compiler Design 1 (2011) 43 Compiler Design 1 (2011) 44 Ambiguity Ambiguity (Cont.)
• Grammar This string has two parse trees E → E + E | E * E | ( E ) | int E E • String E + E E * E int * int + int
E * E int int E + E int int int int
Compiler Design 1 (2011) 45 Compiler Design 1 (2011) 46
Ambiguity (Cont.) Dealing with Ambiguity
• A grammar is ambiguous if it has more than • There are several ways to handle ambiguity one parse tree for some string – Equivalently, there is more than one right-most or • Most direct method is to rewrite grammar left-most derivation for some string unambiguously • Ambiguity is bad
– Leaves meaning of some programs ill-defined → E T + E | T • Ambiguity is common in programming languages T → int* T | int| ( E ) – Arithmetic expressions – IF-THEN-ELSE • Enforces precedence of * over +
Compiler Design 1 (2011) 47 Compiler Design 1 (2011) 48 Ambiguity: The Dangling Else The Dangling Else: Example
• Consider the following grammar • The expression
if C1 then if C2 then S3 else S4 S → if C then S has two parse trees | if C then S else S if if | OTHER C if C1 if S4 1 • This grammar is also ambiguous
C2 S3 C2 S3 S4
• Typically we want the second form
Compiler Design 1 (2011) 49 Compiler Design 1 (2011) 50
The Dangling Else: A Fix The Dangling Else: Example Revisited
• else matches the closest unmatched then • The expression if C1 then if C2 then S3 else S4 • We can describe this in the grammar
if if S → MIF /* all then are matched */ | UIF /* some then are unmatched */ C1 if C1 if S MIF → if C then MIF else MIF 4
| OTHER C S UIF → if C then S C2 S3 S4 2 3
| if C then MIF else UIF • A valid parse tree • Not valid because the (for a UIF) then expression is not • Describes the same set of strings a MIF
Compiler Design 1 (2011) 51 Compiler Design 1 (2011) 52 Ambiguity Precedence and Associativity Declarations
• No general techniques for handling ambiguity • Instead of rewriting the grammar – Use the more natural (ambiguous) grammar
• Impossible to convert automatically an – Along with disambiguating declarations
ambiguous grammar to an unambiguous one • Most tools allow precedence and associativity
• Used with care, ambiguity can simplify the declarations to disambiguate grammars
grammar – Sometimes allows more natural definitions • Examples … – We need disambiguation mechanisms
Compiler Design 1 (2011) 53 Compiler Design 1 (2011) 54
Associativity Declarations Precedence Declarations
• Consider the grammar E → E + E | int • Consider the grammar E → E + E | E * E | int • Ambiguous: two parse trees of int + int + int – And the string int + int * int
E E E E E * E E E E + E E + E + E E int int E * E E + E int int E + E +
int int int int int int int int • Precedence declarations: %left + • Left associativity declaration: %left + %left *
Compiler Design 1 (2011) 55 Compiler Design 1 (2011) 56 Error Handling Syntax Error Handling
• Purpose of the compiler is • Error handler should – To detect non-valid programs – Report errors accurately and clearly – To translate the valid ones – Recover from an error quickly • Many kinds of possible errors (e.g. in C) – Not slow down compilation of valid code
Error kind Example Detected by … Lexical … $ … Lexer • Good error handling is not easy to achieve
Syntax … x *% … Parser Semantic … int x; y = x(3); … Type checker Correctness your favorite program Tester/User
Compiler Design 1 (2011) 57 Compiler Design 1 (2011) 58
Approaches to Syntax Error Recovery Error Recovery: Panic Mode
• From simple to complex • Simplest, most popular method – Panic mode
– Error productions • When an error is detected:
– Automatic local or global correction – Discard tokens until one with a clear role is found
– Continue from there
• Not all are supported by all parser generators • Such tokens are called synchronizing tokens
– Typically the statement or expression terminators
Compiler Design 1 (2011) 59 Compiler Design 1 (2011) 60 Syntax Error Recovery: Panic Mode (Cont.) Syntax Error Recovery: Error Productions
• Consider the erroneous expression • Idea: specify in the grammar known common (1 + + 2) + 3 mistakes • Panic-mode recovery: • Essentially promotes common errors to – Skip ahead to next integer and then continue alternative syntax • Example:
• (ML)-Yacc: use the special terminal error to – Write 5 x instead of 5 * x describe how much input to skip – Add the production E → …| E E
E → int | E + E | ( E ) | error int | ( error ) • Disadvantage – Complicates the grammar
Compiler Design 1 (2011) 61 Compiler Design 1 (2011) 62
Syntax Error Recovery: Past and Present
• Past – Slow recompilation cycle (even once a day) – Find as many errors in one cycle as possible – Researchers could not let go of the topic
• Present – Quick recompilation cycle – Users tend to correct one error/cycle – Complex error recovery is needed less – Panic-mode seems enough
Compiler Design 1 (2011) 63