• A grammar describes syntactically legal strings in a language 8 Parsing • A recogniser simply accepts or rejects strings • A generator produces strings • A parser constructs a parse tree for a string • Two common types of parsers: –bottom-up or data driven –top-down or hypothesis driven • A easily implements a top-down parser for simple grammars

S Top down vs. bottom up parsing Top down vs. bottom up parsing

• The parsing problem is to connect the root • Both are general techniques that can be made to work node S with the tree leaves, the input for all languages (but not all grammars!) • Top-down parsers: starts constructing • Recall that a given language can be described by the parse tree at the top (root) and move A = 1 + 3 * 4 / 5 down towards the leaves. Easy to implement by several grammars hand, but requires restricted grammars. E.g.: • Both of these grammars describe the same language - Predictive parsers (e.g., LL(k)) E -> E + Num E -> Num + E • Bottom-up parsers: build nodes on the bottom of E -> Num E -> Num the parse tree first. Suitable for automatic parser generation, handles larger class of grammars. E.g.: • The first one, with it’s , causes problems for top down parsers – shift-reduce parser (or LR(k) parsers) • For a given parsing technique, we may have to transform the grammar to work with it

Parsing complexity Top Down Parsing Methods • How hard is the parsing task? How to we measure that? • Simplest method is a full-backup, recur- • Parsing an arbitrary CFG is O(n3) -- it can take time propor- sive descent parser tional the cube of the number of input symbols • This is bad! (why?) • Often used for parsing simple languages • If we constrain the grammar somewhat, we can always parse • Write recursive recognizers (subroutines) in linear time. This is good! (why?) for each grammar rule • Linear-time parsing • LL(n) : Left to right, – LL parsers Leftmost derivation, –If rules succeeds perform some action • Recognize LL grammar look ahead at most n (i.e., build a tree node, emit code, etc.) symbols. • Use a top-down strategy • LR(n) : Left to right, –If rule fails, return failure. Caller may – LR parsers Right derivation, try another choice or fail • Recognize LR grammar look ahead at most n symbols. –On failure it “backs up” • Use a bottom-up strategy

1 Top Down Parsing Methods: Problems Recursive Decent Parsing: Example For the grammar: • When going forward, the parser consumes tokens from the input, so what happens if -> {(*|/)}* we have to back up? We could use the following recursive – suggestions? descent parsing subprogram (this one is • Algorithms that use backup tend to be, in written in C)

general, inefficient void term() { factor(); /* parse first factor*/ – There might be a large number of possibilities while (next_token == ast_code || to try before finding the right one or giving up next_token == slash_code) { lexical(); /* get next token */ • Grammar rules which are left-recursive factor(); /* parse next factor */ } lead to non-termination! }

Problems Left-recursive grammars • Some grammars cause problems for top down parsers • A grammar is left recursive if it has • Top down parsers do not work with left- rules like recursive grammars X -> X  – E.g., one with a rule like: E -> E + T • Or if it has indirect left recursion, as in – We can transform a left- into X -> A  one which is not A -> X • A top down grammar can limit backtracking if it only has one rule per non-terminal • Q: Why is this a problem? – The technique of rule factoring can be used to –A: it can lead to non-terminating eliminate multiple rules for a non-terminal recursion!

Direct Left-Recursive Grammars Elimination of Direct Left-Recursion

• Consider left-recursive • Concretely • Consider grammar T -> T + id E -> E + Num S  S  T-> id E -> Num S ->  • T generates strings • S generates strings id • We can manually or automatically  id+id rewrite a grammar removing left-   id+id+id … recursion, making it ok for a top-down    … • Rewrite using right- recursion parser. • Rewrite using right- T -> id recursion T -> id T S   S’ S’   S’| 

2 General Left Recursion Summary of Recursive Descent • The grammar S  A  |  • Simple and general parsing strategy A  S  – Left-recursion must be eliminated first – … but that can be done automatically is also left-recursive because S + S   • Unpopular because of backtracking – Thought to be too inefficient where + means “can be rewritten in one or more steps” • In practice, backtracking is eliminated by • This indirect left-recursion can also be further restricting the grammar to allow us automatically eliminated (not covered) to successfully predict which rule to use

Predictive Parsers Predictive Parsers • That there can be many rules for a non-terminal • An important class of predictive parser only makes parsing hard peek ahead one token into the stream • A predictive parser processes the input stream • An LL(k) parser, does a Left-to-right parse, a typically from left to right Leftmost-derivation, and k-symbol lookahead – Is there any other way to do it? Yes for programming languages! • Grammars where one can decide which rule • It uses information from peeking ahead at the to use by examining only the next token are upcoming terminal symbols to decide which LL(1) grammar rule to use next • LL(1) grammars are widely used in practice • And always makes the right choice of which rule – The syntax of a PL can usually be adjusted to to use enable it to be described with an LL(1) grammar • How much it can peek ahead is an issue

Predictive Parser Remember…

Example: consider the grammar • Given a grammar and a string in the language defined by the grammar … S  if E then S else S S  begin S L • There may be more than one way to derive the string S  print E leading to the same parse tree L  end – It depends on the order in which you apply the rules An S expression starts either with L  ; S L – And what parts of the string you choose to rewrite next E  num = num an IF, BEGIN, or PRINT token, and an L expression start with an • All of the derivations are valid END or a SEMICOLON token, • To simplify the problem and the algorithms, we often and an E expression has only one production. focus on one of two simple derivation strategies – A leftmost derivation – A rightmost derivation

3 LL(k) and LR(k) parsers Predictive Parsing and Left Factoring • Two important parser classes are LL(k) and LR(k) • Consider the grammar Even if left recursion is • The name LL(k) means: E  T + E E  T removed, a grammar – L: Left-to-right scanning of the input T  int may not be parsable – L: Constructing leftmost derivation T  int * T with a LL(1) parser – k: max # of input symbols needed to predict parser action T  ( E ) • The name LR(k) means: • Hard to predict because – L: Left-to-right scanning of the input – For T, two productions start with int – R: Constructing rightmost derivation in reverse – For E, it is not clear how to predict which rule to use – k: max # of input symbols needed to select parser action • Must left-factor grammar before use for predictive • A LR(1) or LL(1) parser never need to “look ahead” parsing more than one input token to know what parser • Left-factoring involves rewriting rules so that, if a non- production rule applies terminal has > 1 rule, each begins with a terminal

Left-Factoring Example Using Parsing Tables

Add new non-terminals X and Y to factor out • LL(1) means that for each non-terminal and token common prefixes of rules there is only one production • Can be represented as a simple table E  T + E E  T X – One dimension for current non-terminal to expand E  T – One dimension for next token T  int X  + E – A table entry contains one rule’s action or empty if error T  int * T X   • Method similar to recursive descent, except T  ( E ) T  ( E ) – For each non-terminal S – We look at the next token a T  int Y – And chose the production shown at table cell [S, a] For each non-terminal the • Use a stack to keep track of pending non-terminals revised grammar, there is either Y  * T only one rule or every rule Y   • Reject when we encounter an error state, accept when begins with a terminal or  we encounter end-of-input

E  T X LL(1) Parsing Table Example LL(1) Parsing Table Example X  + E |  T  ( E ) | int Y Left-factored grammar •Consider the [E, int] entry Y  * T |  E  T X – “When current non-terminal is E & next input int, use production E T X” X  + E |  – It’s the only production that can generate an int in next place T  ( E ) | int Y •Consider the [Y, +] entry Y  * T |  – “When current non-terminal is Y and current token is +, get rid of Y” – Y can be followed by + only in a derivation where Y End of input symbol •Consider the [E, *] entry – Blank entries indicate error situations The LL(1) parsing table – “There is no way to derive a string starting with * from non-terminal E” int * + ( ) $ E T X T X X + E   int * + ( ) $ T int Y ( E ) E T X T X Y * T    X + E   T int Y ( E ) Y * T   

4 LL(1) Parsing Algorithm LL(1) Parsing Example Stack Input Action E $ int * int $ pop();push(T X) initialize stack = and next T X $ int * int $ pop();push(int Y) repeat int Y X $ int * int $ pop();next++ case stack of Y X $ * int $ pop();push(* T) : if T[X,*next] = Y1…Yn * T X $ * int $ pop();next++ then stack  ; T X $ int $ pop();push(int Y) else error (); int Y X $ int $ pop();next++; : if t == *next ++ Y X $ $ pop() then stack  ; X $ $ pop() else error (); $ $ ACCEPT! until stack == < > E  TX int * + ( ) $ X  +E E T X T X (1) next points to the next input token X   where: T  (E) X + E   (2) X matches some non-terminal T  int Y T int Y ( E ) (3) t matches some terminal Y  *T Y   Y * T   

Constructing Parsing Tables Computing First Sets • No table entry can be multiply defined Definition: First(X) = {t| X*t}{|X*} • If A  , where in the line of A do we place Algorithm sketch (see book for details):  ? 1. for all terminals t do First(t)  { t } • In column t where t can start a string derived 2. for each production X   do First(X)  {  } from  3. if X  A1 … An  and   First(Ai), 1  i  n •  * t  do add First() to First(X)

• We say that t  First() 4. for each X  A1 … An s.t.   First(Ai), 1  i  • In the column t if  is  and t can follow an A n do add  to First(X) • S *  A t  5. repeat steps 4 and 5 until no First set can be grown • We say t  Follow(A)

First Sets. Example Computing Follow Sets Recall the grammar • Definition: E  T X X  + E |  Follow(X) = { t | S *  X t  } T  ( E ) | int Y Y  * T |  First sets First( ( ) = { ( } First( T ) = {int, ( } • Intuition First( ) ) = { ) } First( E ) = {int, ( } – If S is the start symbol then $  Follow(S) First( int) = { int } First( X ) = {+,  } First( + ) = { + } First( Y ) = {*,  } – If X  A B then First(B)  Follow(A) and First( * ) = { * } Follow(X)  Follow(B) – Also if B *  then Follow(X)  Follow(A)

5 Computing Follow Sets Follow Sets. Example

Algorithm sketch: • Recall the grammar E  T X X  + E |  1. Follow(S)  { $ } T  ( E ) | int Y Y  * T |  • Follow sets 2. For each production A   X  Follow( + ) = { int, ( } Follow( * ) = { int, ( } • add First() - {} to Follow(X) 3. For each A   X  where   First() Follow( ( ) = { int, ( } Follow( E ) = {), $} • add Follow(A) to Follow(X) Follow( X ) = {$, ) } Follow( T ) = {+, ) , $} • repeat step(s) ___ until no Follow set Follow( ) ) = {+, ) , $} Follow( Y ) = {+, ) , $} grows Follow( int) = {*, +, ) , $}

Constructing LL(1) Parsing Tables Notes on LL(1) Parsing Tables

• Construct a parsing table T for CFG G • If any entry is multiply defined then G is not • For each production A   in G do: LL(1) – For each terminal t  First() do • Reasons why a grammar is not LL(1) include • T[A, t] =  – G is ambiguous – If   First(), for each t  Follow(A) do – G is left recursive • T[A, t] =  – G is not left-factored – If   First() and $  Follow(A) do • T[A, $] =  • Most programming language grammars are not strictly LL(1) • There are tools that build LL(1) tables

Bottom-up Parsing

• YACC uses bottom up parsing. There are two important operations that bottom-up parsers use: shift and reduce – In abstract terms, we do a simulation of a Push Down Automata as a finite state automata • Input: given string to be parsed and the set of productions. • Goal: Trace a rightmost derivation in reverse by starting with the input string and working backwards to the start symbol

6