Outline Introduction to Parsing • Regular languages revisited

Ambiguity and Syntax Errors • Parser overview

• Context-free grammars (CFG’s)

• Derivations

• Ambiguity

• Syntax errors

2

Languages and Automata Limitations of Regular Languages

• Formal languages are very important in CS Intuition: A finite automaton that runs long – Especially in programming languages and enough must repeat states • A finite automaton cannot remember number • Regular languages of times it has visited a particular state – The weakest formal languages widely used • because a finite automaton has finite memory – Many applications – Only enough to store in which state it is – Cannot count, except up to a finite limit

• We will also study context-free languages • Many languages are not regular • E.g., the language of balanced parentheses is not regular: { (i )i | i ≥ 0}

3 4 The Functionality of the Parser Example

• Input: sequence of tokens from lexer • If-then-else statement if (x == y) then z = 1; else z = 2; • Output: of the program • Parser input IF (ID == ID) THEN ID = INT; ELSE ID = INT; • Possible parser output

IF-THEN-ELSE

== = =

ID ID ID INT ID INT

5 6

Comparison with Lexical Analysis The Role of the Parser

• Not all sequences of tokens are programs ... • Parser must distinguish between valid and Phase Input Output invalid sequences of tokens

Lexer Sequence of Sequence of characters tokens • We need – A language for describing valid sequences of tokens Parser Sequence of Parse tree – A method for distinguishing valid from invalid tokens sequences of tokens

7 8 Context-Free Grammars CFGs (Cont.)

• Many constructs have a A CFG consists of recursive structure – A set of terminals T – A set of non-terminals N • E.g. A STMT is of the form – A start symbol S (a non-terminal) if COND then STMT else STMT , or – A set of productions while COND do STMT , or … Assuming X ∈ N the productions are of the form • Context-free grammars are a natural notation X →ε , or

for this recursive structure X → Y1 Y2 ... Yn where Yi ∈ N ∪T

9 10

Notational Conventions Examples of CFGs

• In these lecture notes A fragment of an example language (simplified): – Non-terminals are written upper-case

– Terminals are written lower-case STMT → if COND then STMT else STMT

– The start symbol is the left-hand side of the first ⏐ production while COND do STMT ⏐ id = int

11 12 Examples of CFGs (cont.) The Language of a CFG

Grammar for simple arithmetic expressions: Read productions as replacement rules:

→ E → E * E X Y1 ... Yn

Means X can be replaced by Y1 ... Yn (in this order) ⏐ E + E X →ε ⏐ ( E ) Means X can be erased (replaced with empty string)

⏐ id

13 14

Key Idea The Language of a CFG (Cont.)

(1) Begin with a string consisting of the start More formally, we write symbol “S” X XX→ XXYYXX (2) Replace any non-terminal X in the string by 11LLin L i−+11 min L1 L a right-hand side of some production if there is a production

XYY→ 1 n L XYYim→ 1 L (3) Repeat (2) until there are no non-terminals in the string

15 16 The Language of a CFG (Cont.) The Language of a CFG

Write Let G be a context-free grammar with start ∗ symbol S. Then the language of G is: XXYY11LLnm→ if ∗ {aaSaa11KKnn|→ and every a i is a terminal} XX11LLLLnm→→→ YY in 0 or more steps

17 18

Terminals Examples

• Terminals are called so because there are no L(G) is the language of the CFG G rules for replacing them ii Strings of balanced parentheses {() |i ≥ 0} • Once generated, terminals are permanent

Two equivalent ways of writing the grammar G: • Terminals ought to be tokens of the language SS→ () SS→ () or S → ε | ε

19 20 Example Example (Cont.)

A fragment of our example language (simplified): Some elements of the our language

STMT → if COND then STMT id = int ⏐ if COND then STMT else STMT if (id == id) then id = int else id = int ⏐ while COND do STMT while (id != id) do id = int ⏐ id = int while (id == id) do while (id != id) do id = int COND → (id == id) if (id != id) then if (id == id) then id = int else id = int ⏐ (id != id)

21 22

Arithmetic Example Notes

Simple arithmetic expressions: The idea of a CFG is a big step. E→∗ E+E | E E | (E) | id But:

Some elements of the language: • Membership in a language is just “yes” or “no”;

id id + id we also need the parse tree of the input (id) id ∗ id • Must handle errors gracefully

(id) ∗∗ id id (id) • Need an implementation of CFG’s – e.g., yacc/bison/ML-yacc/...

23 24 More Notes Derivations and Parse Trees

• Form of the grammar is important A derivation is a sequence of productions – Many grammars generate the same language S →→→ – Parsing tools are sensitive to the grammar LLL

A derivation can be drawn as a tree – Start symbol is the tree’s root

– For a production XYY → 1 L n add children YY 1 L n to node X

Note: Tools for regular languages (e.g., lex/ML-Lex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice

25 26

Derivation Example Derivation Example (Cont.)

• Grammar Ε → E+E | E*E | (E) | id

E→∗ E+E | E E | (E) | id E E → E+E • String E + E

id ∗ id + id →∗EE+E →∗id E + E E * E id →∗id id + E id id →∗id id + id

27 28 Derivation in Detail (1) Derivation in Detail (2)

Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E

E + E E E → E+E

29 30

Derivation in Detail (3) Derivation in Detail (4)

Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E E E E + E E + E → E+E → E+E E * E →∗EE+E E * E →∗E+EE →∗id E + E id

31 32 Derivation in Detail (5) Derivation in Detail (6)

Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E E E → E+E → E+E E + E E + E →∗EE+E →∗EE+E E * E →∗id E + E E * E id →∗id E + E → id∗ id + E →∗id id + E id id id id →∗id id + id

33 34

Notes on Derivations Left-most and Right-most Derivations

• A parse tree has • What was shown before Ε → E+E | E*E | (E) | id – Terminals at the leaves was a left-most derivation – At each step, we replaced E – Non-terminals at the interior nodes the left-most non-terminal → E+E • An in-order traversal of the leaves is the • There is an equivalent → E+id original input notion of a right-most derivation →∗EE + id – Shown on the right • The parse tree shows the association of →∗Eid + id operations; the input string does not ! →∗id id + id

35 36 Right-most Derivation in Detail (1) Right-most Derivation in Detail (2)

Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E

E + E E E → E+E

37 38

Right-most Derivation in Detail (3) Right-most Derivation in Detail (4)

Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E E E E + E E + E → E+E → E+E id → E+id E * E id → E+id → EE∗ + id

39 40 Right-most Derivation in Detail (5) Right-most Derivation in Detail (6)

Ε → E+E | E*E | (E) | id Ε → E+E | E*E | (E) | id E E E E → E+E → E+E E + E E + E → E+id → E+id E * E id →∗EE + id E * E id → EE∗ + id →∗Eid + id → E ∗id + id id id id →∗id id + id

41 42

Derivations and Parse Trees Summary of Derivations

• Note that: • We are not just interested in whether – right-most and left-most derivations have the same s ∈ L(G)

parse tree – We need a parse tree for s – for each parse tree, there is a right-most and a left-most derivation • A derivation defines a parse tree – But one parse tree may have many derivations • The difference is just in the order in which branches are added • Left-most and right-most derivations are important in parser implementation

43 44 Ambiguity Ambiguity (Cont.)

• Grammar: • A grammar is ambiguous if it has more than

E → E + E | E * E | ( E ) | int one parse tree for some string – Equivalently, if there is more than one right-most or left-most derivation for some string • The string int * int + int has two parse trees • Ambiguity is bad

E E – Leaves meaning of some programs ill-defined • Ambiguity is common in programming languages E + E E * E – Arithmetic expressions

E E int int E + E – IF-THEN-ELSE * int int int int

45 46

Dealing with Ambiguity Ambiguity: The Dangling Else

• There are several ways to handle ambiguity • Consider the following grammar

• Most direct method is to rewrite the grammar S → if then S unambiguously | if C then S else S | OTHER E → T + E | T

T → int * T | int | ( E ) • This grammar is also ambiguous

• This grammar enforces precedence of * over +

47 48 The Dangling Else: Example The Dangling Else: A Fix

• The expression • else should match the closest unmatched then

if C1 then if C2 then S3 else S4 • We can describe this in the grammar

has two parse trees S → MIF /* all then are matched */

if if | UIF /* some then are unmatched */ MIF → if C then MIF else MIF C if C1 if S4 1 | OTHER UIF → if C then S | if C then MIF else UIF C2 S3 C2 S3 S4

• Typically we want the second form • Describes the same set of strings

49 50

The Dangling Else: Example Revisited Ambiguity

• The expression if C1 then if C2 then S3 else S4 • No general techniques for handling ambiguity

if if • Impossible to convert automatically an ambiguous grammar to an unambiguous one C1 if C1 if S4 • Used with care, ambiguity can simplify the

C2 S3 S4 C2 S3 grammar • A valid parse tree – Sometimes allows more natural definitions • Not valid because the (for a UIF) then expression is not – However, we need disambiguation mechanisms a MIF

51 52 Precedence and Associativity Declarations Associativity Declarations

• Instead of rewriting the grammar • Consider the grammar E → E + E | int – Use the more natural (ambiguous) grammar • Ambiguous: two parse trees of int + int + int – Along with disambiguating declarations E E • Most tools allow precedence and associativity E + E E E declarations to disambiguate grammars +

E + E int int E + E • Examples … int int int int • Left associativity declaration: %left +

53 54

Precedence Declarations Error Handling

• Consider the grammar E → E + E | E * E | int • Purpose of the is And the string int + int * int – To detect non-valid programs – To translate the valid ones E E • Many kinds of possible errors (e.g. in C)

E * E E + E Error kind Example Detected by … E + E int int E * E Lexical … $ … Lexer Syntax …x *% … Parser int int int int Semantic … int x; y = x(3); … Type checker

• Precedence declarations: %left + Correctness your favorite program Tester/User

%left * 55 56 Syntax Error Handling Approaches to Syntax Error Recovery

• Error handler should • From simple to complex – Report errors accurately and clearly – Panic mode – Recover from an error quickly – Error productions – Not slow down compilation of valid code – Automatic local or global correction

• Good error handling is not easy to achieve • Not all are supported by all parser generators

57 58

Error Recovery: Panic Mode Syntax Error Recovery: Panic Mode (Cont.)

• Simplest, most popular method • Consider the erroneous expression (1 + + 2) + 3

• When an error is detected: • Panic-mode recovery:

– Discard tokens until one with a clear role is found – Skip ahead to next integer and then continue

– Continue from there • (ML)-Yacc: use the special terminal error to • Such tokens are called synchronizing tokens describe how much input to skip – Typically the statement or expression terminators E → int | E + E | ( E ) | error int | ( error )

59 60 Syntax Error Recovery: Error Productions Syntax Error Recovery: Past and Present

• Idea: specify some recovery rules in the • (Distant) Past grammar based on known common mistakes – Slow recompilation cycle (even once a day) • Essentially promotes common errors to – Find as many errors in one cycle as possible alternative syntax – Researchers could not let go of the topic

• Example: – Write 5 x instead of 5 * x • Present – Add the production E → …| E E – Quick recompilation cycle

• Disadvantage – Users tend to correct one error/cycle

– Complicates the grammar – Complex error recovery is needed less – Panic-mode seems enough

61 62