Outline Introduction to Parsing • Regular languages revisited

Ambiguity and Syntax Errors • Parser overview

• Context-free grammars (CFG’s)

• Derivations

• Ambiguity

• Syntax errors

Compiler Design 1 (2011) 2

Languages and Automata Limitations of Regular Languages

• Formal languages are very important in CS Intuition: A finite automaton that runs long – Especially in programming languages enough must repeat states • A finite automaton cannot remember # of • Regular languages times it has visited a particular state – The weakest formal languages widely used • because a finite automaton has finite memory – Many applications – Only enough to store in which state it is – Cannot count, except up to a finite limit

• We will also study context-free languages • Many languages are not regular • E.g., language of balanced parentheses is not regular: { (i )i | i ≥ 0}

Compiler Design 1 (2011) 3 Compiler Design 1 (2011) 4 The Functionality of the Parser Example

• Input: sequence of tokens from lexer • If-then-else statement if (x == y) then z =1; else z = 2; • Output: of the program • Parser input IF (ID == ID) THEN ID = INT; ELSE ID = INT; • Possible parser output

IF-THEN-ELSE

== = =

ID ID ID INT ID INT

Compiler Design 1 (2011) 5 Compiler Design 1 (2011) 6

Comparison with Lexical Analysis The Role of the Parser

• Not all sequences of tokens are programs . . . • . . . Parser must distinguish between valid and Phase Input Output invalid sequences of tokens

Lexer Sequence of Sequence of characters tokens • We need – A language for describing valid sequences of tokens Parser Sequence of Parse tree – A method for distinguishing valid from invalid tokens sequences of tokens

Compiler Design 1 (2011) 7 Compiler Design 1 (2011) 8 Context-Free Grammars CFGs (Cont.)

• Many constructs have a • A CFG consists of recursive structure – A set of terminals T – A set of non-terminals N • A STMT is of the form – A start symbol S (a non-terminal) if COND then STMT else STMT , or – A set of productions while COND do STMT , or … Assuming X ∈ N the productions are of the form • Context-free grammars are a natural notation X →ε , or

for this recursive structure X → Y1 Y2 ... Yn where Yi ∈ N ∪T

Compiler Design 1 (2011) 9 Compiler Design 1 (2011) 10

Notational Conventions Examples of CFGs

• In these lecture notes A fragment of our example language (simplified): – Non-terminals are written upper-case

– Terminals are written lower-case STMT → if COND then STMT else STMT

– The start symbol is the left-hand side of the first ⏐ production while COND do STMT ⏐ id = int

Compiler Design 1 (2011) 11 Compiler Design 1 (2011) 12 Examples of CFGs (cont.) The Language of a CFG

Grammar for simple arithmetic expressions: Read productions as replacement rules:

→ E → E * E X Y1 ... Yn

Means X can be replaced by Y1 ... Yn ⏐ E + E X →ε ⏐ ( E ) Means X can be erased (replaced with empty string)

⏐ id

Compiler Design 1 (2011) 13 Compiler Design 1 (2011) 14

Key Idea The Language of a CFG (Cont.)

(1) Begin with a string consisting of the start More formally, we write symbol “S” X XX→ XXYYXX (2) Replace any non-terminal X in the string by 11LLin L i−+11 min L1 L a right-hand side of some production if there is a production

XYY→ 1 n L XYYim→ 1 L (3) Repeat (2) until there are no non-terminals in the string

Compiler Design 1 (2011) 15 Compiler Design 1 (2011) 16 The Language of a CFG (Cont.) The Language of a CFG

Write Let G be a context-free grammar with start ∗ symbol S. Then the language of G is: XXYY11LLnm→ if ∗ {aaSaa11KKnn|→ and every a i is a terminal} XX11LLLLnm→→→ YY

in 0 or more steps

Compiler Design 1 (2011) 17 Compiler Design 1 (2011) 18

Terminals Examples

• Terminals are called so because there are no L(G) is the language of the CFG G rules for replacing them ii Strings of balanced parentheses {() |i ≥ 0} • Once generated, terminals are permanent

Two grammars: • Terminals ought to be tokens of the language SS→ () SS→ () OR S → ε | ε

Compiler Design 1 (2011) 19 Compiler Design 1 (2011) 20 Example Example (Cont.)

A fragment of our example language (simplified): Some elements of the our language

STMT → if COND then STMT id = int ⏐ if COND then STMT else STMT if (id == id) then id = int else id = int ⏐ while COND do STMT while (id != id) do id = int ⏐ id = int while (id == id) do while (id != id) do id = int COND → (id == id) if (id != id) then if (id == id) then id = int else id = int ⏐ (id != id)

Compiler Design 1 (2011) 21 Compiler Design 1 (2011) 22

Arithmetic Example Notes

Simple arithmetic expressions: The idea of a CFG is a big step. E→∗ E+E | E E | (E) | id But:

Some elements of the language: • Membership in a language is just “yes” or “no”;

id id + id we also need the parse tree of the input (id) id ∗ id • Must handle errors gracefully

(id) ∗∗ id id (id) • Need an implementation of CFG’s (e.g., yacc)

Compiler Design 1 (2011) 23 Compiler Design 1 (2011) 24 More Notes Derivations and Parse Trees

• Form of the grammar is important A derivation is a sequence of productions – Many grammars generate the same language S →→→ – Parsing tools are sensitive to the grammar LLL

A derivation can be drawn as a tree Note: Tools for regular languages (e.g., lex/ML-Lex) – Start symbol is the tree’s root are also sensitive to the form of the regular expression, but this is rarely a problem in practice – For a production XYY → 1 L n add children YY 1 L n to node X

Compiler Design 1 (2011) 25 Compiler Design 1 (2011) 26

Derivation Example Derivation Example (Cont.)

• Grammar

E→∗ E+E | E E | (E) | id E E → E+E • String E + E

id ∗ id + id →∗EE+E →∗id E + E E * E id →∗id id + E id id →∗id id + id

Compiler Design 1 (2011) 27 Compiler Design 1 (2011) 28 Derivation in Detail (1) Derivation in Detail (2)

E E

E + E E E → E+E

Compiler Design 1 (2011) 29 Compiler Design 1 (2011) 30

Derivation in Detail (3) Derivation in Detail (4)

E E E E E + E E + E → E+E → E+E E * E →∗EE+E E * E →∗E+EE →∗id E + E id

Compiler Design 1 (2011) 31 Compiler Design 1 (2011) 32 Derivation in Detail (5) Derivation in Detail (6)

E E E E → E+E → E+E E + E E + E →∗EE+E →∗EE+E E * E →∗id E + E E * E id →∗id E + E → id∗ id + E →∗id id + E id id id id →∗id id + id

Compiler Design 1 (2011) 33 Compiler Design 1 (2011) 34

Notes on Derivations Left-most and Right-most Derivations

• A parse tree has • The example is a left-most – Terminals at the leaves derivation – At each step, replace the E – Non-terminals at the interior nodes left-most non-terminal → E+E • An in-order traversal of the leaves is the • There is an equivalent → E+id original input notion of a right-most derivation →∗EE + id • The parse tree shows the association of →∗Eid + id operations, the input string does not →∗id id + id

Compiler Design 1 (2011) 35 Compiler Design 1 (2011) 36 Right-most Derivation in Detail (1) Right-most Derivation in Detail (2)

E E

E + E E E → E+E

Compiler Design 1 (2011) 37 Compiler Design 1 (2011) 38

Right-most Derivation in Detail (3) Right-most Derivation in Detail (4)

E E E E E + E E + E → E+E → E+E id → E+id E * E id → E+id → EE∗ + id

Compiler Design 1 (2011) 39 Compiler Design 1 (2011) 40 Right-most Derivation in Detail (5) Right-most Derivation in Detail (6)

E E E E → E+E → E+E E + E E + E → E+id → E+id E * E id →∗EE + id E * E id → EE∗ + id →∗Eid + id → E ∗id + id id id id →∗id id + id

Compiler Design 1 (2011) 41 Compiler Design 1 (2011) 42

Derivations and Parse Trees Summary of Derivations

• Note that right-most and left-most • We are not just interested in whether derivations have the same parse tree s ∈ L(G) – We need a parse tree for s • The difference is just in the order in which branches are added • A derivation defines a parse tree – But one parse tree may have many derivations

• Left-most and right-most derivations are important in parser implementation

Compiler Design 1 (2011) 43 Compiler Design 1 (2011) 44 Ambiguity Ambiguity (Cont.)

• Grammar This string has two parse trees E → E + E | E * E | ( E ) | int E E • String E + E E * E int * int + int

E * E int int E + E int int int int

Compiler Design 1 (2011) 45 Compiler Design 1 (2011) 46

Ambiguity (Cont.) Dealing with Ambiguity

• A grammar is ambiguous if it has more than • There are several ways to handle ambiguity one parse tree for some string – Equivalently, there is more than one right-most or • Most direct method is to rewrite grammar left-most derivation for some string unambiguously • Ambiguity is bad

– Leaves meaning of some programs ill-defined → E T + E | T • Ambiguity is common in programming languages T → int* T | int| ( E ) – Arithmetic expressions – IF-THEN-ELSE • Enforces precedence of * over +

Compiler Design 1 (2011) 47 Compiler Design 1 (2011) 48 Ambiguity: The Dangling Else The Dangling Else: Example

• Consider the following grammar • The expression

if C1 then if C2 then S3 else S4 S → if then S has two parse trees | if C then S else S if if | OTHER C if C1 if S4 1 • This grammar is also ambiguous

C2 S3 C2 S3 S4

• Typically we want the second form

Compiler Design 1 (2011) 49 Compiler Design 1 (2011) 50

The Dangling Else: A Fix The Dangling Else: Example Revisited

• else matches the closest unmatched then • The expression if C1 then if C2 then S3 else S4 • We can describe this in the grammar

if if S → MIF /* all then are matched */ | UIF /* some then are unmatched */ C1 if C1 if S MIF → if C then MIF else MIF 4

| OTHER C S UIF → if C then S C2 S3 S4 2 3

| if C then MIF else UIF • A valid parse tree • Not valid because the (for a UIF) then expression is not • Describes the same set of strings a MIF

Compiler Design 1 (2011) 51 Compiler Design 1 (2011) 52 Ambiguity Precedence and Associativity Declarations

• No general techniques for handling ambiguity • Instead of rewriting the grammar – Use the more natural (ambiguous) grammar

• Impossible to convert automatically an – Along with disambiguating declarations

ambiguous grammar to an unambiguous one • Most tools allow precedence and associativity

• Used with care, ambiguity can simplify the declarations to disambiguate grammars

grammar – Sometimes allows more natural definitions • Examples … – We need disambiguation mechanisms

Compiler Design 1 (2011) 53 Compiler Design 1 (2011) 54

Associativity Declarations Precedence Declarations

• Consider the grammar E → E + E | int • Consider the grammar E → E + E | E * E | int • Ambiguous: two parse trees of int + int + int – And the string int + int * int

E E E E E * E E E E + E E + E + E E int int E * E E + E int int E + E +

int int int int int int int int • Precedence declarations: %left + • Left associativity declaration: %left + %left *

Compiler Design 1 (2011) 55 Compiler Design 1 (2011) 56 Error Handling Syntax Error Handling

• Purpose of the compiler is • Error handler should – To detect non-valid programs – Report errors accurately and clearly – To translate the valid ones – Recover from an error quickly • Many kinds of possible errors (e.g. in C) – Not slow down compilation of valid code

Error kind Example Detected by … Lexical … $ … Lexer • Good error handling is not easy to achieve

Syntax … x *% … Parser Semantic … int x; y = x(3); … Type checker Correctness your favorite program Tester/User

Compiler Design 1 (2011) 57 Compiler Design 1 (2011) 58

Approaches to Syntax Error Recovery Error Recovery: Panic Mode

• From simple to complex • Simplest, most popular method – Panic mode

– Error productions • When an error is detected:

– Automatic local or global correction – Discard tokens until one with a clear role is found

– Continue from there

• Not all are supported by all parser generators • Such tokens are called synchronizing tokens

– Typically the statement or expression terminators

Compiler Design 1 (2011) 59 Compiler Design 1 (2011) 60 Syntax Error Recovery: Panic Mode (Cont.) Syntax Error Recovery: Error Productions

• Consider the erroneous expression • Idea: specify in the grammar known common (1 + + 2) + 3 mistakes • Panic-mode recovery: • Essentially promotes common errors to – Skip ahead to next integer and then continue alternative syntax • Example:

• (ML)-Yacc: use the special terminal error to – Write 5 x instead of 5 * x describe how much input to skip – Add the production E → …| E E

E → int | E + E | ( E ) | error int | ( error ) • Disadvantage – Complicates the grammar

Compiler Design 1 (2011) 61 Compiler Design 1 (2011) 62

Syntax Error Recovery: Past and Present

• Past – Slow recompilation cycle (even once a day) – Find as many errors in one cycle as possible – Researchers could not let go of the topic

• Present – Quick recompilation cycle – Users tend to correct one error/cycle – Complex error recovery is needed less – Panic-mode seems enough

Compiler Design 1 (2011) 63