Uninformed Search

Parsing • A grammar describes syntactically legal strings in a language 8 Parsing • A recogniser simply accepts or rejects strings • A generator produces strings • A parser constructs a parse tree for a string • Two common types of parsers: –bottom-up or data driven –top-down or hypothesis driven • A recursive descent parser easily implements a top-down parser for simple grammars S Top down vs. bottom up parsing Top down vs. bottom up parsing • The parsing problem is to connect the root • Both are general techniques that can be made to work node S with the tree leaves, the input for all languages (but not all grammars!) • Top-down parsers: starts constructing • Recall that a given language can be described by the parse tree at the top (root) and move A = 1 + 3 * 4 / 5 down towards the leaves. Easy to implement by several grammars hand, but requires restricted grammars. E.g.: • Both of these grammars describe the same language - Predictive parsers (e.g., LL(k)) E -> E + Num E -> Num + E • Bottom-up parsers: build nodes on the bottom of E -> Num E -> Num the parse tree first. Suitable for automatic parser generation, handles larger class of grammars. E.g.: • The first one, with it’s left recursion, causes problems for top down parsers – shift-reduce parser (or LR(k) parsers) • For a given parsing technique, we may have to transform the grammar to work with it Parsing complexity Top Down Parsing Methods • How hard is the parsing task? How to we measure that? • Simplest method is a full-backup, recur- • Parsing an arbitrary CFG is O(n3) -- it can take time propor- sive descent parser tional the cube of the number of input symbols • This is bad! (why?) • Often used for parsing simple languages • If we constrain the grammar somewhat, we can always parse • Write recursive recognizers (subroutines) in linear time. This is good! (why?) for each grammar rule • Linear-time parsing • LL(n) : Left to right, – LL parsers Leftmost derivation, –If rules succeeds perform some action • Recognize LL grammar look ahead at most n (i.e., build a tree node, emit code, etc.) symbols. • Use a top-down strategy • LR(n) : Left to right, –If rule fails, return failure. Caller may – LR parsers Right derivation, try another choice or fail • Recognize LR grammar look ahead at most n symbols. –On failure it “backs up” • Use a bottom-up strategy 1 Top Down Parsing Methods: Problems Recursive Decent Parsing: Example For the grammar: • When going forward, the parser consumes tokens from the input, so what happens if <term> -> <factor> {(*|/)<factor>}* we have to back up? We could use the following recursive – suggestions? descent parsing subprogram (this one is • Algorithms that use backup tend to be, in written in C) general, inefficient void term() { factor(); /* parse first factor*/ – There might be a large number of possibilities while (next_token == ast_code || to try before finding the right one or giving up next_token == slash_code) { lexical(); /* get next token */ • Grammar rules which are left-recursive factor(); /* parse next factor */ } lead to non-termination! } Problems Left-recursive grammars • Some grammars cause problems for top down parsers • A grammar is left recursive if it has • Top down parsers do not work with left- rules like recursive grammars X -> X – E.g., one with a rule like: E -> E + T • Or if it has indirect left recursion, as in – We can transform a left-recursive grammar into X -> A one which is not A -> X • A top down grammar can limit backtracking if it only has one rule per non-terminal • Q: Why is this a problem? – The technique of rule factoring can be used to –A: it can lead to non-terminating eliminate multiple rules for a non-terminal recursion! Direct Left-Recursive Grammars Elimination of Direct Left-Recursion • Consider left-recursive • Concretely • Consider grammar T -> T + id E -> E + Num S S T-> id E -> Num S -> • T generates strings • S generates strings id • We can manually or automatically id+id rewrite a grammar removing left- id+id+id … recursion, making it ok for a top-down … • Rewrite using right- recursion parser. • Rewrite using right- T -> id recursion T -> id T S S’ S’ S’| 2 General Left Recursion Summary of Recursive Descent • The grammar S A | • Simple and general parsing strategy A S – Left-recursion must be eliminated first – … but that can be done automatically is also left-recursive because S + S • Unpopular because of backtracking – Thought to be too inefficient where + means “can be rewritten in one or more steps” • In practice, backtracking is eliminated by • This indirect left-recursion can also be further restricting the grammar to allow us automatically eliminated (not covered) to successfully predict which rule to use Predictive Parsers Predictive Parsers • That there can be many rules for a non-terminal • An important class of predictive parser only makes parsing hard peek ahead one token into the stream • A predictive parser processes the input stream • An LL(k) parser, does a Left-to-right parse, a typically from left to right Leftmost-derivation, and k-symbol lookahead – Is there any other way to do it? Yes for programming languages! • Grammars where one can decide which rule • It uses information from peeking ahead at the to use by examining only the next token are upcoming terminal symbols to decide which LL(1) grammar rule to use next • LL(1) grammars are widely used in practice • And always makes the right choice of which rule – The syntax of a PL can usually be adjusted to to use enable it to be described with an LL(1) grammar • How much it can peek ahead is an issue Predictive Parser Remember… Example: consider the grammar • Given a grammar and a string in the language defined by the grammar … S if E then S else S S begin S L • There may be more than one way to derive the string S print E leading to the same parse tree L end – It depends on the order in which you apply the rules An S expression starts either with L ; S L – And what parts of the string you choose to rewrite next E num = num an IF, BEGIN, or PRINT token, and an L expression start with an • All of the derivations are valid END or a SEMICOLON token, • To simplify the problem and the algorithms, we often and an E expression has only one production. focus on one of two simple derivation strategies – A leftmost derivation – A rightmost derivation 3 LL(k) and LR(k) parsers Predictive Parsing and Left Factoring • Two important parser classes are LL(k) and LR(k) • Consider the grammar Even if left recursion is • The name LL(k) means: E T + E E T removed, a grammar – L: Left-to-right scanning of the input T int may not be parsable – L: Constructing leftmost derivation T int * T with a LL(1) parser – k: max # of input symbols needed to predict parser action T ( E ) • The name LR(k) means: • Hard to predict because – L: Left-to-right scanning of the input – For T, two productions start with int – R: Constructing rightmost derivation in reverse – For E, it is not clear how to predict which rule to use – k: max # of input symbols needed to select parser action • Must left-factor grammar before use for predictive • A LR(1) or LL(1) parser never need to “look ahead” parsing more than one input token to know what parser • Left-factoring involves rewriting rules so that, if a non- production rule applies terminal has > 1 rule, each begins with a terminal Left-Factoring Example Using Parsing Tables Add new non-terminals X and Y to factor out • LL(1) means that for each non-terminal and token common prefixes of rules there is only one production • Can be represented as a simple table E T + E E T X – One dimension for current non-terminal to expand E T – One dimension for next token T int X + E – A table entry contains one rule’s action or empty if error T int * T X • Method similar to recursive descent, except T ( E ) T ( E ) – For each non-terminal S – We look at the next token a T int Y – And chose the production shown at table cell [S, a] For each non-terminal the • Use a stack to keep track of pending non-terminals revised grammar, there is either Y * T only one rule or every rule Y • Reject when we encounter an error state, accept when begins with a terminal or we encounter end-of-input E T X LL(1) Parsing Table Example LL(1) Parsing Table Example X + E | T ( E ) | int Y Left-factored grammar •Consider the [E, int] entry Y * T | E T X – “When current non-terminal is E & next input int, use production E T X” X + E | – It’s the only production that can generate an int in next place T ( E ) | int Y •Consider the [Y, +] entry Y * T | – “When current non-terminal is Y and current token is +, get rid of Y” – Y can be followed by + only in a derivation where Y End of input symbol •Consider the [E, *] entry – Blank entries indicate error situations The LL(1) parsing table – “There is no way to derive a string starting with * from non-terminal E” int * + ( ) $ E T X T X X + E int * + ( ) $ T int Y ( E ) E T X T X Y * T X + E T int Y ( E ) Y * T 4 LL(1) Parsing Algorithm LL(1) Parsing Example Stack Input Action E $ int * int $ pop();push(T X) initialize stack = <S $> and next T X $ int * int $ pop();push(int Y) repeat int Y X $ int * int $ pop();next++ case stack of Y X $ * int $ pop();push(* T) <X, rest> : if T[X,*next] = Y1…Yn * T X $ * int $ pop();next++ then stack <Y1… Yn rest>; T X $ int $ pop();push(int Y) else error (); int Y X $ int $ pop();next++; <t, rest> : if t == *next ++ Y X $ $ pop() then stack <rest>; X $ $ pop() else error (); $ $ ACCEPT! until stack == < > E TX int * + ( ) $ X +E E T X T X (1) next points to the next input token X where: T (E) X + E (2) X matches some non-terminal T int Y T int Y ( E ) (3) t matches some terminal Y *T Y Y * T Constructing Parsing Tables Computing First Sets • No table entry can be multiply defined Definition: First(X) = {t| X*t}{|X*} • If A , where in the line of A do we place Algorithm sketch (see book for details): ? 1.

Load more