Dr. Ernesto Gomez : CSE 570/670 We Now Consider Parsing Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Dr. Ernesto Gomez : CSE 570/670 We now consider parsing algorithms. This material is in chapter 4 1. An LR parser Having developed algorithms for FIRST and FOLLOW sets, we have seen how construct LR0 sets with CLOSURE and GOTO functions, such that our LR0 sets are states, and the GOTO functions are transitions between states which occur when we “read” a specific symbol, as in our finite auotmata. We extend the meaning of “read” to mean “PUSH a symbol to stack top”, this happens when we move the first symbol in the (unprocessed) input text and push it onto the stack, or when we pop symbols from the stack corresponding to a handle (the right-hand side of a production) and PUSH the left-hand symbol on the production on the stack. The first case (a standard read action) gets a terminal symbol on the stack, the second case gets a non-terminal symbol on the stack. Terminal characters are handled just like we would in a finite automaton, non-terminals are pushed on the stack when we reach a state whicht has an item of the form A ! α: ; an item of this form means we have finished processing a handle that corresponds to α and it is on stack top, and that we use the production A ! α to POP α and PUSH A. We will continue to use our expression grammar for examples, S ! E E ! E + T | E ! T T ! T ∗ F j T ! F F !id | F +num | F ! (E) with N = fS; E; T; F g, T = f+; ∗; (; ); id, numg We will also make reference to figure 4.31, page 244 of the text, which gives an LR(0) automaton for this grammar, built using the algorithms in the previous set of notes and in section 4.6.2 in Aho and Ullman. 1.1. Parsing with a shift-reduce automaton. Our automaton is table driven, much like the Deterministic Finite Automata you have worked on previously in Lab 1. The difference is that our control table is divided into 2 sections, called the Action table and the Goto table (these names are somewhat confusing, for example the Goto table is not identical to the GOTO function - also note that since the rows are the same for both tables, the are usually written as two sections of the same control tables). The states of the parser are given by the LR sets (in our example, these LR(0) sets, but the same method works for LR(k), the only difference being the parse tables. Rows of of our control table are numbered with the state numbers we assigned to the LR sets when we constructed them. The numbering depends on the order in which we generate the sets, and makes no difference to the parse function, the only fixed thing is that the start .state - state 0, in row 0 - is generated from a single item S ! .α corresponding to the start production. Columns of the Action section of the control table are labelled with all the characters 2 T , and there is an added column for the symbol $ which denotes end of input text (we have seen this convention when we generated the FOLLOW sets). Columns of the Goto section are labelled with all the symbols 2 N. Notice that every combination (state, X 2 N [ T j$) is represented in the control table, so there is an entry for every possible combination of state and symbol that could occur in 1 2 a derivation or parsen for the grammar we used to generate the table (because any symbol not a terminal or non-terminal, or $ could never appear in a derivation, and if it appears in the parse would immediately lead to reject). We can see an example of what the table looks like (for an SLR parser, the next level up from LR(0)) in page 252, fig 4.37. The action table looks a lot like the table for a finite automaton: for example, the entry for (row=0,column=”id”) says “s5” - on a finite automaton, this would be - get “id” from input and point to cell at position (5,”id”), on a pushdown automaton, we also want to push (5,”id”) on stack top (it helps to create a class that contains a string and a number, then create a stack of objects of that class- you can get the same effect without objects, but the object method is cleaner). We also have entries like “r4” at position (4,”+”), this means reduce using rule 4. In building the table, when we find an item like F ! T: in the LR set for state 4, we look at the grammar rules for F ! T , which in the example grammar on page 251 happens to be numbered 4 (reduce by 4 in state 4 is a coincidence, the numbers are not related - if we write the rules in a different order, the number might be something other than 4, and this rule appears in a state labelled 4 because of the order in which states were generated, with a different order the number might be different). The reduce action looks like - find the length L of the right-hand side of rule 4 - in this case L=1. POP L items off the stack. Read the state number S from the top item on the stack, and let A be the single non-terminal on the left side of rule 4. Now look at position (S ,A ) in the Goto table, it either shows a state number or a blank. If it is blank, reject. If it contains a state number N, push (N; A) on the stack. An example to illustrate what is going on (using the control table on page 252)- suppose you are parsing the string 4*(3+5) - assume you have read and pushed 4* onto the stack. At thia point, when you read the * you identify 4 as first a factor F and then a term T, so when you push the * on the stack it looks like (from bottom to top) T„*, and the state on stack top is 7 (looking at the table, all the actions that shift on * , whoch are in states 2 and 9, send you to state 7. So, now, in state 7 the only things that can happen are a shift on “(“ or on “id” - only other possibility in state 7 is the goto function, which says you might be seeing an F instead of a terminal. Now we process the parenthesis - after multiple steps the stack will look like T,*,(,E,) and each item has a state attached to it, the * was state 7. (E) is the handle of production 5:F ! (E), so we reduce by 5, pop 3 items off the stack and we push F, so now we look at the Goto table in the F column and we see that there are multiple states we could goto, depending on what state we are in - that is, what state we were in when we started treanslating th parenthesis - the idea is, having translated (E) to F, it is like seeing F instead of ( in the previous state. That previous state is the state attached to * on stack top, which is 7. Entry (7,F) in the Goto table says we goto state 10, so we read the next input character and continue from state 10. Then read the next symbol, s and do the action in (N; s): The algorithm that runs a parser from this table is given in pages 250-251. The same algorithm is used for all LR(k) parsers for any value of k, the only difference is the tables - SLR uses the LR(0) sets, the table is different because it uses the FOLLOW sets as a tiebreaker, LR(1) and higher methods build larger tables because they use lookahead in building the LR sets. 3 1.2. Building the parse table -LR(0) and SLR. Algorithm 4.4.6, on page 253 shows how to build the parse table. It is fairly straightforward, First you build the from the GOTO functions that use terminals - for example, GOTO(0,id) is state 5, so you write s5 at position (5,id) - row 5, column id Next build the Goto table from the GOTO functions on non-terminals - for example„ GOTO(0,F) is state 3, so you write 3 at position (3,id) - you dont have to specify shift, because you get here from a reduction that pushes F on the stack Then look at states that contain items with a dot at the end. For example, assume state sincludes the item A ! α: and let N be the number of the production A ! α ; then For LR(0) parser: set every entry (s,a) - that is, all columns labelled a 2 T - to rN (reduce by production N). This will almost always produce a conflict, because some cells in the table will already have a shift action in them - the grammar is ambiguous in LR(0) if we have this (which happens for our sample grammar in states 1 and 2). For SLR parser: set every entry (s,a) , such that a 2FOLLOW(A) to rN - this excludes columns labelled with a terminbal not in FOLLOW(A); since we get to this point right after we reduced by A ! α , A is on stack top and if the next symbol we see is not in FOLLOW(A), we would reject, so we can’t have a transition to that column label right after pushing A on the stack.