Parsing • A grammar describes syntactically legal strings in a language 8 Parsing • A recogniser simply accepts or rejects strings • A generator produces strings • A parser constructs a parse tree for a string • Two common types of parsers: –bottom-up or data driven –top-down or hypothesis driven • A recursive descent parser easily implements a top-down parser for simple grammars
S Top down vs. bottom up parsing Top down vs. bottom up parsing
• The parsing problem is to connect the root • Both are general techniques that can be made to work node S with the tree leaves, the input for all languages (but not all grammars!) • Top-down parsers: starts constructing • Recall that a given language can be described by the parse tree at the top (root) and move A = 1 + 3 * 4 / 5 down towards the leaves. Easy to implement by several grammars hand, but requires restricted grammars. E.g.: • Both of these grammars describe the same language - Predictive parsers (e.g., LL(k)) E -> E + Num E -> Num + E • Bottom-up parsers: build nodes on the bottom of E -> Num E -> Num the parse tree first. Suitable for automatic parser generation, handles larger class of grammars. E.g.: • The first one, with it’s left recursion, causes problems for top down parsers – shift-reduce parser (or LR(k) parsers) • For a given parsing technique, we may have to transform the grammar to work with it
Parsing complexity Top Down Parsing Methods • How hard is the parsing task? How to we measure that? • Simplest method is a full-backup, recur- • Parsing an arbitrary CFG is O(n3) -- it can take time propor- sive descent parser tional the cube of the number of input symbols • This is bad! (why?) • Often used for parsing simple languages • If we constrain the grammar somewhat, we can always parse • Write recursive recognizers (subroutines) in linear time. This is good! (why?) for each grammar rule • Linear-time parsing • LL(n) : Left to right, – LL parsers Leftmost derivation, –If rules succeeds perform some action • Recognize LL grammar look ahead at most n (i.e., build a tree node, emit code, etc.) symbols. • Use a top-down strategy • LR(n) : Left to right, –If rule fails, return failure. Caller may – LR parsers Right derivation, try another choice or fail • Recognize LR grammar look ahead at most n symbols. –On failure it “backs up” • Use a bottom-up strategy
1 Top Down Parsing Methods: Problems Recursive Decent Parsing: Example For the grammar: • When going forward, the parser consumes tokens from the input, so what happens if
general, inefficient void term() { factor(); /* parse first factor*/ – There might be a large number of possibilities while (next_token == ast_code || to try before finding the right one or giving up next_token == slash_code) { lexical(); /* get next token */ • Grammar rules which are left-recursive factor(); /* parse next factor */ } lead to non-termination! }
Problems Left-recursive grammars • Some grammars cause problems for top down parsers • A grammar is left recursive if it has • Top down parsers do not work with left- rules like recursive grammars X -> X – E.g., one with a rule like: E -> E + T • Or if it has indirect left recursion, as in – We can transform a left-recursive grammar into X -> A one which is not A -> X • A top down grammar can limit backtracking if it only has one rule per non-terminal • Q: Why is this a problem? – The technique of rule factoring can be used to –A: it can lead to non-terminating eliminate multiple rules for a non-terminal recursion!
Direct Left-Recursive Grammars Elimination of Direct Left-Recursion
• Consider left-recursive • Concretely • Consider grammar T -> T + id E -> E + Num S S T-> id E -> Num S -> • T generates strings • S generates strings id • We can manually or automatically id+id rewrite a grammar removing left- id+id+id … recursion, making it ok for a top-down … • Rewrite using right- recursion parser. • Rewrite using right- T -> id recursion T -> id T S S’ S’ S’|
2 General Left Recursion Summary of Recursive Descent • The grammar S A | • Simple and general parsing strategy A S – Left-recursion must be eliminated first – … but that can be done automatically is also left-recursive because S + S • Unpopular because of backtracking – Thought to be too inefficient where + means “can be rewritten in one or more steps” • In practice, backtracking is eliminated by • This indirect left-recursion can also be further restricting the grammar to allow us automatically eliminated (not covered) to successfully predict which rule to use
Predictive Parsers Predictive Parsers • That there can be many rules for a non-terminal • An important class of predictive parser only makes parsing hard peek ahead one token into the stream • A predictive parser processes the input stream • An LL(k) parser, does a Left-to-right parse, a typically from left to right Leftmost-derivation, and k-symbol lookahead – Is there any other way to do it? Yes for programming languages! • Grammars where one can decide which rule • It uses information from peeking ahead at the to use by examining only the next token are upcoming terminal symbols to decide which LL(1) grammar rule to use next • LL(1) grammars are widely used in practice • And always makes the right choice of which rule – The syntax of a PL can usually be adjusted to to use enable it to be described with an LL(1) grammar • How much it can peek ahead is an issue
Predictive Parser Remember…
Example: consider the grammar • Given a grammar and a string in the language defined by the grammar … S if E then S else S S begin S L • There may be more than one way to derive the string S print E leading to the same parse tree L end – It depends on the order in which you apply the rules An S expression starts either with L ; S L – And what parts of the string you choose to rewrite next E num = num an IF, BEGIN, or PRINT token, and an L expression start with an • All of the derivations are valid END or a SEMICOLON token, • To simplify the problem and the algorithms, we often and an E expression has only one production. focus on one of two simple derivation strategies – A leftmost derivation – A rightmost derivation
3 LL(k) and LR(k) parsers Predictive Parsing and Left Factoring • Two important parser classes are LL(k) and LR(k) • Consider the grammar Even if left recursion is • The name LL(k) means: E T + E E T removed, a grammar – L: Left-to-right scanning of the input T int may not be parsable – L: Constructing leftmost derivation T int * T with a LL(1) parser – k: max # of input symbols needed to predict parser action T ( E ) • The name LR(k) means: • Hard to predict because – L: Left-to-right scanning of the input – For T, two productions start with int – R: Constructing rightmost derivation in reverse – For E, it is not clear how to predict which rule to use – k: max # of input symbols needed to select parser action • Must left-factor grammar before use for predictive • A LR(1) or LL(1) parser never need to “look ahead” parsing more than one input token to know what parser • Left-factoring involves rewriting rules so that, if a non- production rule applies terminal has > 1 rule, each begins with a terminal
Left-Factoring Example Using Parsing Tables
Add new non-terminals X and Y to factor out • LL(1) means that for each non-terminal and token common prefixes of rules there is only one production • Can be represented as a simple table E T + E E T X – One dimension for current non-terminal to expand E T – One dimension for next token T int X + E – A table entry contains one rule’s action or empty if error T int * T X • Method similar to recursive descent, except T ( E ) T ( E ) – For each non-terminal S – We look at the next token a T int Y – And chose the production shown at table cell [S, a] For each non-terminal the • Use a stack to keep track of pending non-terminals revised grammar, there is either Y * T only one rule or every rule Y • Reject when we encounter an error state, accept when begins with a terminal or we encounter end-of-input
E T X LL(1) Parsing Table Example LL(1) Parsing Table Example X + E | T ( E ) | int Y Left-factored grammar •Consider the [E, int] entry Y * T | E T X – “When current non-terminal is E & next input int, use production E T X” X + E | – It’s the only production that can generate an int in next place T ( E ) | int Y •Consider the [Y, +] entry Y * T | – “When current non-terminal is Y and current token is +, get rid of Y” – Y can be followed by + only in a derivation where Y End of input symbol •Consider the [E, *] entry – Blank entries indicate error situations The LL(1) parsing table – “There is no way to derive a string starting with * from non-terminal E” int * + ( ) $ E T X T X X + E int * + ( ) $ T int Y ( E ) E T X T X Y * T X + E T int Y ( E ) Y * T
4 LL(1) Parsing Algorithm LL(1) Parsing Example Stack Input Action E $ int * int $ pop();push(T X) initialize stack = and next T X $ int * int $ pop();push(int Y) repeat int Y X $ int * int $ pop();next++ case stack of Y X $ * int $ pop();push(* T)
Constructing Parsing Tables Computing First Sets • No table entry can be multiply defined Definition: First(X) = {t| X*t}{|X*} • If A , where in the line of A do we place Algorithm sketch (see book for details): ? 1. for all terminals t do First(t) { t } • In column t where t can start a string derived 2. for each production X do First(X) { } from 3. if X A1 … An and First(Ai), 1 i n • * t do add First() to First(X)
• We say that t First() 4. for each X A1 … An s.t. First(Ai), 1 i • In the column t if is and t can follow an A n do add to First(X) • S * A t 5. repeat steps 4 and 5 until no First set can be grown • We say t Follow(A)
First Sets. Example Computing Follow Sets Recall the grammar • Definition: E T X X + E | Follow(X) = { t | S * X t } T ( E ) | int Y Y * T | First sets First( ( ) = { ( } First( T ) = {int, ( } • Intuition First( ) ) = { ) } First( E ) = {int, ( } – If S is the start symbol then $ Follow(S) First( int) = { int } First( X ) = {+, } First( + ) = { + } First( Y ) = {*, } – If X A B then First(B) Follow(A) and First( * ) = { * } Follow(X) Follow(B) – Also if B * then Follow(X) Follow(A)
5 Computing Follow Sets Follow Sets. Example
Algorithm sketch: • Recall the grammar E T X X + E | 1. Follow(S) { $ } T ( E ) | int Y Y * T | • Follow sets 2. For each production A X Follow( + ) = { int, ( } Follow( * ) = { int, ( } • add First() - {} to Follow(X) 3. For each A X where First() Follow( ( ) = { int, ( } Follow( E ) = {), $} • add Follow(A) to Follow(X) Follow( X ) = {$, ) } Follow( T ) = {+, ) , $} • repeat step(s) ___ until no Follow set Follow( ) ) = {+, ) , $} Follow( Y ) = {+, ) , $} grows Follow( int) = {*, +, ) , $}
Constructing LL(1) Parsing Tables Notes on LL(1) Parsing Tables
• Construct a parsing table T for CFG G • If any entry is multiply defined then G is not • For each production A in G do: LL(1) – For each terminal t First() do • Reasons why a grammar is not LL(1) include • T[A, t] = – G is ambiguous – If First(), for each t Follow(A) do – G is left recursive • T[A, t] = – G is not left-factored – If First() and $ Follow(A) do • T[A, $] = • Most programming language grammars are not strictly LL(1) • There are tools that build LL(1) tables
Bottom-up Parsing
• YACC uses bottom up parsing. There are two important operations that bottom-up parsers use: shift and reduce – In abstract terms, we do a simulation of a Push Down Automata as a finite state automata • Input: given string to be parsed and the set of productions. • Goal: Trace a rightmost derivation in reverse by starting with the input string and working backwards to the start symbol
6