<<

Parsing Technologies Technologies Outline Outline

Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based Top Down Parsing parsing Bottom Up Parsing

Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing

The basic parsing task Top-down vs Bottom-up

◮ Given a G, a category x and an input string w1 . . . wn, the job of a parser is to discover whether G categorises w1 . . . wn ◮ There are many ways a parser might manage the search as x, process. ◮ or equivalently, whether it permits any analysis tree whose ◮ If a parser expands a tree down towards its leaves it is said to be topmost node is x and whose leaves are w1 . . . wn. working top-down. ◮ Variants of this: ◮ By contrast a bottom-up parser fuses subtrees together with the ◮ find all parse trees, if there is more than one aim of making a single encompassing tree. ◮ find also the x’s which categorise the input, rather than assuming this is given Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Top Down Parsing Top Down Parsing

maybe john walks is an s according to this As a beginning grammar with the analysis example take the ◮ A top-down parser will effectively derive a tree in a succession of following grammar s stages, starting with just a single node s-tree and ending with the complete tree ◮ At every stage of this process of tree derivation, there are s ⇒ sadv, s sadv s choices to be made s ⇒ np, vp maybe np vp np ⇒ john ◮ One choice is which node to expand vp ⇒ iv john iv ◮ the other choice is how to expand each node iv ⇒ walks sadv ⇒ maybe walks

Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Top Down Parsing Top Down Parsing Which node to work on ? To illustrate the first kind of choice, consider the following two derivations: s s

sadv s sadv s ◮ s so an algorithm to explore the space of tree derivations, can s ◮ sadv s In the first derivation, there is a system to sadv s restrict attention to the derivations which use leftmost expansion maybe np vp the way the tree is grown ◮ s s this means the ’which node’ source of choice can be eliminated: ◮ sadv s sadv s in the second derivation, the tree growth is always deterministically choose the leftmost unexpanded node. maybe np vp maybe np vp random. ◮ s there is still the other source of choise, of non-determinism: more s ◮ sadv s in the first derivation at every step sadv s than one way to expand a given node. This still has to be dealt maybe np vp maybe np vp the leftmost expandable leaf john with, but to begin we will get familiar with the deterministic case. iv s node is expanded s sadv s sadv s maybe np vp maybe np vp ◮ The key fact is this: john iv john iv if there is an analysis tree for some s s sadv s sadv s input, then it can be generated by maybe np vp maybe np vp applying leftmost expansion john iv john iv

walks walks Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Top Down Parsing Top Down Parsing The frontier The frontier as a stack

Summarising the first derivation as ◮ Because of the choice to a series of snap shots of the leaf Let use the term Frontier for the always take the leftmost nodes, you have: subset of the leaf-nodes which are Leaf nodes Frontier unexpanded node, the frontier expandable s s operates in the fashion of a Leaf nodes Frontier sadv s sadv s stack. Leaf nodes s s maybe s s with a last-in/first-out (LIFO) s sadv s sadv s maybe np vp np vp behaviour. sadv s maybe s s maybe john vp vp ◮ You can keep adding to the maybe s maybe np vp np vp maybe john iv iv top of a stack (pushing), maybe np vp maybe john vp vp maybe john walks ◮ and its the most recently maybe john vp maybe john iv iv added things that you can maybe john iv maybe john walks remove (popping) and replace maybe john walks (more pushing).

Parsing Technologies Simple top-down and bottom-up stack-based parsing Top-down parsing algorithm (without ) Top Down Parsing set F to start symbol, progress indicator i = 0

MOVES: let A = top(F) loop thru the rules { this leads to the idea that one can manage the search through the if (rule is A → w[i]){ //LEAF CANCELLATION space of possible tree derivations, by managing a search through a pop top of F set i = i+1 space of possible stack states. goto MOVES } can now give an outline of a top-down algorithm. else if (rule is A → D1 . . . Dn){ //LEFT EXPANSION pop top of F Let w be an array representing the input, push Dn ... push D1 note order let i be the index of the current word. goto MOVES use F for the frontier of nodes in the tree that are due to be expanded. } }

YES_NO: if ((F is empty) && (i == size of input)) { succeed } else { fail } Parsing Technologies Simple top-down and bottom-up stack-based parsing an example Top Down Parsing About the top-down algorithm parsing the man hit the dog (top of stack ◮ algorithm keeps looking for a move it can make to update its show at left): progress through the input and the stack of categories F. ◮ first kind of move, leaf cancellation, recognises that the top the eg. WORDS STACK stack represents a node which could have the current word s → np, vp the man hit the dog s the man hit the dog np vp attached underneath it. Doing so removes a category off the np → det, n the man hit the dog det n vp stack and moves progress through the input by 1. det → the n → man manhitthedog n vp ◮ second kind of move, left expansion, recognises that the top of hitthedog vp n → dog the stack represents a node which could have a sequence of hitthedog tv np vp → tv, np daughters corresponding to the right-hand side of rule attached thedog np tv → hit underneath it. thedog detn dog n ◮ in checking if a move is possible, the grammar rules are considered in order from top to bottom SUCCEED ◮ note in left expansion rules daughters must be pushed in a last-to-first order, to guarantee that first daughter ends up on top of the stack.

Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Top Down Parsing Top Down Parsing What about rule choice ?

◮ C++ code this is a more detailed spelling out of the top-down parser as Often more than one move will be possible C++ code ◮ So need either ◮ a mechanism for exploring all choices – backtracking ◮ or a way to guide choices correctly by referring to something other than just the top of the stack Parsing Technologies Simple top-down and bottom-up stack-based parsing Top down with backtracking Top Down Parsing set F to start symbol, progress indicator i = 0 Adding backtracking TRY AGAIN: MOVES: let A = top(F) loop thru the rules if restored from H, start after recorded rule { if (rule is A → w[i]){ // LEAF CANCELLATION add (F, i, (A → w[i])) to H The backtracking idea is very simple pop top of F; set i = i+1 goto MOVES ◮ when the parser is about to make a move1 it pushes on to a } history stack: else if (rule is A → D1 . . . Dn){ // LEFT EXPANSION the current progress, add (F, i, (A → D1 . . . Dn)) to H the current frontier, pop top of F; push Dn ... push D1 and a record of the move being made. goto MOVES } ◮ when parser runs into a dead-end } pop most recently added history item YES_BACKTRACK _NO: restores the progress and frontier from this if ((F is empty) && (i == size of input)) { succeed } consider alternative moves later than the move which was stored. else if (H is not empty) { pop top of H; restore F and i from this; goto TRY _AGAIN } else { fail } 1for which there might be alternatives Top-down backtracking example Example continued

WORDS STACK maybe john walks s suppose the grammar: maybe john walks np vp maybe john walks det n vp s --> np,vp maybe john walks is accepted, but it s --> sadv,s takes a bit of backtracking, at a dead-end, so back up to the most recent recorded choice point np --> [john] The history records the 2 choices made so far: np --> det,n parse starts with 0: (i=0, STACK=s,(s --> np vp)) np --> n WORDS STACK 1: (i=0, STACK=np vp,(np --> det n)) vp --> iv maybe john walks s iv --> [walks] maybe john walks np vp so sadv --> [maybe] maybe john walks det n vp WORDS STACK det --> [the] a dead end backtracking to use of rule: np ⇒ det ,n n --> [man] 1 maybe john walks np vp n --> men [ ] maybe john walks n vp another dead end Again at a dead end: Parsing Technologies Simple top-down and bottom-up stack-based parsing Top Down Parsing WORDS STACK : : 1 maybe john walks np vp maybe john walks n vp The (np --> n) rule was the final option for the np vp stack, so its use was not recorded as a a choice point. so at this point the backtrack history just contains the very first choice, to use (s --> np vp) so backing up to there: C++ code this is a more detailed spelling out of the top-down backtracking parser as C++ code WORDS STACK backtracking to use of rule: s ⇒ np ,vp 0 maybe john walks s maybe john walks sadv s johnwalks s johnwalks np vp walks vp walks iv

SUCCEED

Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Top Down Parsing Top Down Parsing Questions about the backtracking, top-down algorithm

If a grammar allows trees of the form: A ◮ does the algorithm always terminate ? ◮ when it terminates, what is the relationship between the time taken and the size of the input ? A

it is left recursive Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Top Down Parsing Top Down Parsing A success using a left recursive grammar

left-recursion causes the top-down parsers not to terminate the main operation of the top-down parser is left-most expansion, a rule-driven tree expansion rule: input is: x + x WORDS: x + x STACK: E WORDS: + x STACK: BACKTRACKING to use of rule: E --> x E --> x x x E --> E,+,E 0 WORDS: x + x STACK: E ... E --> E,−,E WORDS: x + x STACK: E + E y1 yn initial(E) WORDS:+ x STACK:+ E and left-recursion allows this operation to be repeated indefinitely WORDS: x STACK: E often. WORDS: STACK: SUCCESS

Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Top Down Parsing Bottom Up Parsing An endless loop using same left recursive grammar Top-down : pre-ordering

The john walks has the parse tree input is: x − x 1:s WORDS: x − x STACK: E WORDS: −x STACK: 2:np 4:vp BACKTRACKING to use of rule: E --> x 0 WORDS: x − x STACK: E s ⇒ sadv, s 3:john 5:iv WORDS: STACK: s np vp E --> x x − x E + E ⇒ , WORDS: x STACK: E E --> E, ,E − + np ⇒ john + BACKTRACKING to use of rule: E --> x 6:walks E --> E, ,E vp ⇒ iv − 1 WORDS: x − x STACK: E + E the numbering orders the nodes so that initial(E) iv ⇒ walks WORDS: x − x STACK: E + E + E sadv ⇒ maybe ◮ mother precedes all dtrs (and descendants) WORDS: −x STACK:+ E + E ◮ BACKTRACKING to use of rule: E --> x nodes in a dtr tree come before nodes in dtr 2 WORDS: x − x STACK: E + E + E to the right WORDS: x − x STACK: E + E + E + E this ordering reflects the actions of the top-down : parser Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Bottom Up Parsing Bottom Up Parsing Bottom up parsing: post-ordering Steps in a shift-reduce parse The sentence john walks has the parse tree 6:s 1. s 2. s 2:np 5:vp np vp john np vp np

1:john 4:iv john iv john iv john s ⇒ sadv, s s ⇒ np, vp walks walks np ⇒ john 3:walks the numbering this time order the nodes so that vp ⇒ iv ◮ In each picture the dotted line encloses a forest – a collection of iv walks ◮ ⇒ a mother node follows dtr nodes (and subtrees of the eventual tree. sadv ⇒ maybe descendants) ◮ The box part shows the tree-tops of the forest ◮ once again nodes in a dtr tree come before ◮ nodes in dtr to the right it starts at the bottom left and in the first step adds height in accordance with a rule of the grammar. there is a standard shift-reduce bottom-up ◮ conventionally called a ’reduction’ parser whose actions reflect this post-order traversal of the tree

Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Bottom Up Parsing Bottom Up Parsing Steps in a shift-reduce parse Steps in a shift-reduce parse

3. s 2. s 3. s 4. s np walks np vp np np np vp np np vp walks np vp iv john john iv john john walks john iv john john iv john iv

walks walks walks walks

◮ besides trying to add height to the trees in the forest, the parser ◮ another reduction also sometimes adds the next lexical item to the forest ◮ added height to a tree in the forest: note a final tree ◮ conventionally called a shift Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Bottom Up Parsing Bottom Up Parsing Steps in a shift-reduce parse Steps in a shift-reduce parse

4. 5. s s 5. s s 6. s np iv np vp np vp np vp np vp np vp np vp np vp john walks john john john john iv john iv iv john iv iv john iv iv

walks walks walks walks walks walks walks

◮ another reduction ◮ final reduction ◮ again added height to a tree in the forest: note again a final tree ◮ height added above the two trees in the forest giving one final in the forest tree

Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Bottom Up Parsing Bottom Up Parsing

3. clearly possible to have different evolution of the forest 1. s 2. s s walks 3. vp np 1. s np vp john np vp np np s 2. s john iv john iv john john walks np vp john iv john iv john np vp john np vp john iv john iv john iv walks walks walks walks walks walks walks 4. s 5. s np iv np vp 4. s 5. np vp np vp s np iv john walks john np vp np vp john iv john iv iv np vp john walks john iv john iv john iv walks walks walks walks walks s walks 6. s np 6. s np vp vp s np vp john iv john iv np vp john iv john iv walks walks walks walks In pictures 2, 4 and 5, growth is purely vertical, first above the word john, then above walks. In picture 6, the two vertical trees instead of building up at once from john, constructed so far are joined together. In all of 2,4,5,6, the step of first walks is added to the forest, adding some height to the collection of trees by using a grammar rule and then iv is added above it. is called a reduction: it’s not the height which is reduced, but then some height is added above john, with the np node, (sometimes) the number of trees in the forest. before some more height is added above walks with the vp node then you get the final step merging the np and vp trees Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Bottom Up Parsing Bottom Up Parsing Picking a particular forest-growth regime Tree tops as a stack first (red where ◮ Once you choose suffixes-only regime for evolving the forest, the height is built): box representing the tree-tops behaves like a stack. top nodes john ◮ recall for top-down parsing, a particular search regime also gave np ◮ the first always builds height on the last n a stack: the sequence of nodes available for np walks trees in the forest : a suffix of the forest left-expansion/leaf-cancellation behaved like a stack at its left np iv ◮ the second does not: builds height on ’john’ end. np vp when its not the last tree in the forest ◮ For bottom-up parsing, the sequence of nodes availabe for s ◮ The key fact is this: suffix-only reduction/ behaves like a stack at its right end second: ◮ if the forest can be evolved at all to (once again) this leads to the idea that one can manage the top nodes a successful conclusion then it can search for a parse tree by managing a search through a space of john be evolved by building on suffixes stack states john walks only ◮ can now give outline of bottom-up algorithm john iv let w be an array representing the input np iv i be the index of the current word np vp use T for the tops of the trees in the forest s

Parsing Technologies Bottom-up algorithm (without backtracking) Simple top-down and bottom-up stack-based parsing Bottom Up Parsing set T to empty, progress indicator i = 0

MOVES: loop thru the rules { //REDUCTION if (rule is A → D1 . . . Dn ◮ there may be more than one way to reduce the stack – the and T 's top-most elements are Dn . . . D ){ nb. order 1 preceding algorithm just deterministically picks first possible pop Dn . . . D1 from T , push A on T goto MOVES syntax rule } ◮ the correct parse might require a shift even though a reduce is } possible – the preceding algorithm just deterministically opts to if(could not reduce stack T and i < size of input) {//SHIFT reduce if it is at all possible push w[i] on T set i = i + 1 ◮ these are short-cuts which have to be addressed goto MOVES ◮ either add backtracking to revisit all choices } ◮ or try to control choices by looking ahead in the input YES_NO: ◮ first look at a few examples assuming this deterministic algorithm if ((T is just initial symbol) && (i == size of input)) { succeed } else { fail } Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Bottom Up Parsing Bottom Up Parsing An example Using the parser with the E-T-F grammar: ID

input is: a a b b showing top of stack at the right STACK WORDS type of move E ⇒ E + T succeeds on ID suppose grammar aabb E ⇒ T STACK WORDS a a b b shift T T * F ID s --> a,s ,b ⇒ a a b b shift T ⇒ F ID shift s --> a,b a a b b shift F reduce initial(s) F ⇒ ( E ) a s b reduce, using s ⇒ a, b F ⇒ ID T reduce a s b shift E reduce s reduce, using s ⇒ a, s, b SUCCEED For this grammar, the short-cuts work

Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Bottom Up Parsing Bottom Up Parsing Using the parser with the E-T-F grammar: ID + ID Using the parser with the E-T-F grammar: ID * ID

succeeds on ID + ID fail on ID * ID STACK WORDS STACK WORDS ID + ID ID * ID ID + ID ID * ID E ⇒ E + T F + ID E ⇒ E + T F * ID E ⇒ T T + ID E ⇒ T T * ID T ⇒ T * F E + ID T ⇒ T * F E * ID reduced but should have shifted T ⇒ F E + ID T ⇒ F E * ID F ⇒ ( E ) E + ID F ⇒ ( E ) E * ID F ⇒ ID E + F F ⇒ ID E * F E + T E * T E E * E

succeeded because tried E --> E + T before fails because turns first ID into an E, but it has to E --> T be left a T Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Bottom Up Parsing Bottom Up Parsing Adding backtracking

The backtracking idea is very simple ◮ when the parser is about to make a move2 it pushes on to a C++ code this is a more detailed spelling out of bottom-up parser as history stack: C++ code the current progress, the current frontier, and a record of the move being made. ◮ when parser runs into a dead-end pop most recently added history item restores the progress and frontier from this consider alternative moves later than the move which was stored.

2for which there might be alternatives now succeeds on ID * ID Bottom-up algorithm with backtracking STACK WORDS ID * ID set T to empty, progress indicator i = 0 ID * ID (0) TRY AGAIN: MOVES: F * ID (1) loop thru the rules if just restored from H, start after the recorded rule{ T * ID (2) if (rule is A → D1 . . . Dn E * ID should have shifted and T 's top-most elements are Dn . . . D1){ //REDUCE E * ID add (F, i, (A → D1 . . . Dn) to H E*ID pop Dn . . . D1 from T , push A on T E ⇒ E + T goto MOVES E ⇒ T E*F (3) } T ⇒ T * F E*T (4) } T ⇒ F E * E if(could not reduce stack T and i < size of input) { //SHIFT F ⇒ ( E ) BACKTRACKING to use of rule: E ⇒ T in (4) push w[i] on T 4: E*T but no other way forward set i = i + 1 F ⇒ ID goto MOVES BACKTRACKING to use of rule: T ⇒ F in (3) } 3: E*F but no other way forward BACKTRACKING to use of rule: E ⇒ T in (2) YES_BACKTRACK _NO: 2: T * ID if ((T is just initial symbol) && (i == size of input)) { T * ID this time shifts succeed } else if (H is not empty) { T*ID pop top of H; restore F and i from this; T*F goto TRY _AGAIN T } E else { fail } Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Bottom Up Parsing Bottom Up Parsing

◮ whereas the backtracking top-down algorithm was liable to not terminate with left-recursive rules, left-recursive rules are not a C++ code this is a more detailed spelling of backtracking bottom-up problem for the backtracking bottom-up algorithm parser as C++ code ◮ its does terminate, but the backtracking is too costly

Parsing Technologies Parsing Technologies Simple top-down and bottom-up stack-based parsing Simple top-down and bottom-up stack-based parsing Bottom Up Parsing Bottom Up Parsing Further Reading Further Reading

Compilers, Aho, Sethi and Ullman ’Dragon’ book Chap 2 Chap2 has a first look at and parsing p26–32: introduction to grammars, precedence. The , Aho, Sethi and Ullman ’Dragon’ book Chap 4 ’Dragon’ alternative ways to define the strings belonging to a book chap 4 goes over same ground more thoroughly category p167-170: the rewriting or ’derivation’ angle on defining ◮ p27: rewriting non-terminals using syntax rules the strings belonging to a category derivation ◮ p28: deduction, using each rule to reason that a p195-203: bottom-up parsing: introduces the reverse, string belongs to category – using the tree to right-most derivation illustrate the reasoning p198: stacks and shift-reduce parsing, arithmetic ◮ p29: allowed trees, by require mother/dtrs to match example rules, language is leaves p41-43: example of top-down parsing, selecting left-most child: no ambiguity