Dr. Ernesto Gomez : CSE 570/670 We now consider parsing . This material is in chapter 4

1. An LR parser Having developed algorithms for FIRST and FOLLOW sets, we have seen how construct LR0 sets with CLOSURE and GOTO functions, such that our LR0 sets are states, and the GOTO functions are transitions between states which occur when we “read” a specific symbol, as in our finite auotmata. We extend the meaning of “read” to mean “PUSH a symbol to stack top”, this happens when we move the first symbol in the (unprocessed) input text and push it onto the stack, or when we pop symbols from the stack corresponding to a handle (the right-hand side of a production) and PUSH the left-hand symbol on the production on the stack. The first case (a standard read action) gets a terminal symbol on the stack, the second case gets a non-terminal symbol on the stack. Terminal characters are handled just like we would in a finite automaton, non-terminals are pushed on the stack when we reach a state whicht has an item of the form A → α. ; an item of this form means we have finished processing a handle that corresponds to α and it is on stack top, and that we use the production A → α to POP α and PUSH A. We will continue to use our expression grammar for examples, S → E E → E + T | E → T T → T ∗ F | T → F F →id | F +num | F → (E) with N = {S, E, T, F }, T = {+, ∗, (, ), id, num} We will also make reference to figure 4.31, page 244 of the text, which gives an LR(0) automaton for this grammar, built using the algorithms in the previous set of notes and in section 4.6.2 in Aho and Ullman. 1.1. Parsing with a shift-reduce automaton. Our automaton is table driven, much like the Deterministic Finite Automata you have worked on previously in Lab 1. The difference is that our control table is divided into 2 sections, called the Action table and the Goto table (these names are somewhat confusing, for example the Goto table is not identical to the GOTO function - also note that since the rows are the same for both tables, the are usually written as two sections of the same control tables). The states of the parser are given by the LR sets (in our example, these LR(0) sets, but the same method works for LR(k), the only difference being the parse tables. Rows of of our control table are numbered with the state numbers we assigned to the LR sets when we constructed them. The numbering depends on the order in which we generate the sets, and makes no difference to the parse function, the only fixed thing is that the start .state - state 0, in row 0 - is generated from a single item S → .α corresponding to the start production. Columns of the Action section of the control table are labelled with all the characters ∈ T , and there is an added column for the symbol $ which denotes end of input text (we have seen this convention when we generated the FOLLOW sets). Columns of the Goto section are labelled with all the symbols ∈ N. Notice that every combination (state, X ∈ N ∪ T |$) is represented in the control table, so there is an entry for every possible combination of state and symbol that could occur in 1 2 a derivation or parsen for the grammar we used to generate the table (because any symbol not a terminal or non-terminal, or $ could never appear in a derivation, and if it appears in the parse would immediately lead to reject). We can see an example of what the table looks like (for an SLR parser, the next level up from LR(0)) in page 252, fig 4.37. The action table looks a lot like the table for a finite automaton: for example, the entry for (row=0,column=”id”) says “s5” - on a finite automaton, this would be - get “id” from input and point to cell at position (5,”id”), on a pushdown automaton, we also want to push (5,”id”) on stack top (it helps to create a class that contains a string and a number, then create a stack of objects of that class- you can get the same effect without objects, but the object method is cleaner). We also have entries like “r4” at position (4,”+”), this means reduce using rule 4. In building the table, when we find an item like F → T. in the LR set for state 4, we look at the grammar rules for F → T , which in the example grammar on page 251 happens to be numbered 4 (reduce by 4 in state 4 is a coincidence, the numbers are not related - if we write the rules in a different order, the number might be something other than 4, and this rule appears in a state labelled 4 because of the order in which states were generated, with a different order the number might be different). The reduce action looks like - find the length L of the right-hand side of rule 4 - in this case L=1. POP L items off the stack. Read the state number S from the top item on the stack, and let A be the single non-terminal on the left side of rule 4. Now look at position (S ,A ) in the Goto table, it either shows a state number or a blank. If it is blank, reject. If it contains a state number N, push (N,A) on the stack. An example to illustrate what is going on (using the control table on page 252)- suppose you are parsing the string 4*(3+5) - assume you have read and pushed 4* onto the stack. At thia point, when you read the * you identify 4 as first a factor F and then a term T, so when you push the * on the stack it looks like (from bottom to top) T„*, and the state on stack top is 7 (looking at the table, all the actions that shift on * , whoch are in states 2 and 9, send you to state 7. So, now, in state 7 the only things that can happen are a shift on “(“ or on “id” - only other possibility in state 7 is the goto function, which says you might be seeing an F instead of a terminal. Now we process the parenthesis - after multiple steps the stack will look like T,*,(,E,) and each item has a state attached to it, the * was state 7. (E) is the handle of production 5:F → (E), so we reduce by 5, pop 3 items off the stack and we push F, so now we look at the Goto table in the F column and we see that there are multiple states we could goto, depending on what state we are in - that is, what state we were in when we started treanslating th parenthesis - the idea is, having translated (E) to F, it is like seeing F instead of ( in the previous state. That previous state is the state attached to * on stack top, which is 7. Entry (7,F) in the Goto table says we goto state 10, so we read the next input character and continue from state 10. Then read the next symbol, s and do the action in (N, s). The that runs a parser from this table is given in pages 250-251. The same algorithm is used for all LR(k) parsers for any value of k, the only difference is the tables - SLR uses the LR(0) sets, the table is different because it uses the FOLLOW sets as a tiebreaker, LR(1) and higher methods build larger tables because they use lookahead in building the LR sets. 3

1.2. Building the parse table -LR(0) and SLR. Algorithm 4.4.6, on page 253 shows how to build the parse table. It is fairly straightforward, First you build the from the GOTO functions that use terminals - for example, GOTO(0,id) is state 5, so you write s5 at position (5,id) - row 5, column id Next build the Goto table from the GOTO functions on non-terminals - for example„ GOTO(0,F) is state 3, so you write 3 at position (3,id) - you dont have to specify shift, because you get here from a reduction that pushes F on the stack Then look at states that contain items with a dot at the end. For example, assume state sincludes the item A → α. and let N be the number of the production A → α ; then For LR(0) parser: set every entry (s,a) - that is, all columns labelled a ∈ T - to rN (reduce by production N). This will almost always produce a conflict, because some cells in the table will already have a shift action in them - the grammar is ambiguous in LR(0) if we have this (which happens for our sample grammar in states 1 and 2). For SLR parser: set every entry (s,a) , such that a ∈FOLLOW(A) to rN - this excludes columns labelled with a terminbal not in FOLLOW(A); since we get to this point right after we reduced by A → α , A is on stack top and if the next symbol we see is not in FOLLOW(A), we would reject, so we can’t have a transition to that column right after pushing A on the stack. So using the FOLLOW sets in this way we restrict the possibility of havind a shift-reduce conflict giving ambiguity in our table, SLR resolves the ambiguity, our sample grammar is not ambiguous in SLR. If we have an item with the start production and a dot at the end - in our example this is S → E. and it appears in s=1; then we place “accept” in position (s,$).

2. More stuff from chapter 4 2.1. LR(1). This explicitly uses a lookahead of 1 to generate the states, unlike SLR which just uses FOLLOW sets as a tiebreaker for the LR(0) sets when building the parse control table. The algorithm to build the LR(1) table is a simple extension of the LR(0) algorithm - pseudocode is in section 4.7.2, page 261. As before, state 0 kernel is a single item made from the start production: S → .E, $ , but in LR(1) we add the lookahead $ (the special end of text character), because after we have reduced by this productin and pushed the start symbol on the stack we should be at the end of input. CLOSURE: find items that look like A → α.Bβ, a (remember that either α or β or both could be the empty string). Since the dot is in front of B, in LR(0) we would add a new item B → .γ for every production with B on the left hand side. In LR(1), for each such production, you add items B → .γ, b , where b ∈ FIRST(βa). What is happening here - In the item A → α.Bβ, a, assume we process the B,that is push it on the stack - having done this, the state would be A → αB.β, a, for this to be correct, we the next thing we see should be in FIRST(whatever follows the B). But, since β is what follows B, this is FIRST(β). We have previously only calculated FIRST sets for symbols in N ∪T , but the algorithm is defined for strings - given a string β = X1X2 ...Xn, FIRST(β) gets whatever is in FIRST(X1). If FIRST(X1) includes , then add FIRST(X2) to FIRST(β). As long as you keep seeing , go on to the next term and add FIRST of that. If every term in the string 4 has , then add  to FIRST(β). Suppose we start from an item A → α.B, a there is no β , β is . In this case FIRBST(βa) is just a. Since there can be multiple items in FIRST, the closure can generate multiple items of the form B → .γ, b with different values of b - so we get a bigger table. From any item in a state, the LR(1) algorithm can generate multiple items that look the same except for the “,a” at the end. The term after the comma is the lookahead. For brevity, B → .γ, b// . . . can be used in writing down a production with multiple possible lookaheads, but when storing this in a program or describing its use in an algorithm, each alternative is a different item with each one of the possible lookahead characters. The GOTO function is modified to take this into account - for instance, the transition GOTO(I0,E), which in our LR(0) example gave us two kernel items : S → E. and E → E. + T , resulting in a shift-reduce conflict in state I1 (see Lecture notes 7) now gives us 2 states - because we have to match both the symbol we have just seen (the one we process by moving the dot) and the lookahead. Our LR(1) in state I0 would be S → .E, $ and E → .E + T, + , these would give GOTO(I0,E$) and GOTO(I0,E+) which are different transitions to different states. A description of how to build the parser control table is in 4.7.3, you may need to remember the reference if you need to build something more powerful than SLR, but we will not go into detail on this in class. Feel free to ignore LALR(1) - this is almost as powerful as LR(1) (The Canonical LR(1) Method, to give its full name), its purpose is to save memory. To use it, you need to calculate the LR(0) sets and GOTO functions anyway, and then use them to build a smaller parser, if you are doing it by hand it is more effort on your part than LR(1), and current computers are not short of memory, so there is no real advantage to LALR. Yacc uses LALR, it was written in the days when memory was a problem.

2.2. Ambiguous Grammars. Ambiguous grammars are not LR, they can not be parsed by a determinstic push-down automaton. Nevertheless, we have methods to allow us to use them - essentially we force determinism by specifying a resolution of the conflict in the compiler (for example, always shift to resolve a shift-reduce conflict). This actually changes the grammar - if we have a definition that is ambiguous and force it to chose in a particular way to resolve the ambiguity, we are actually building a parser for a subset of the grammar. Whatever language it is translating is not what the grammar specifies. A personal note: ambiguity will be necessary and desirable when building a human (or higher) level AI, but when we want a computer to do what we mean and nothing else, it is not a good thing. My preference is to fix the grammar so it is not ambiguous. This can even be done to extend our parsers beyond the limits of pushdown automata - for example, variable and object declarations are not context free, our parser for such languages (every modern language!) can not be a pure pushdown automaton, it is extended with some of the capability of a Turing Machine, and this is done without introducing ambiguity as we will see in Chapter 5 of the text. Section 4.8 in the text deals with how to patch an ambiguous grammar so we can build a parser from it, and how to modify a grammar to make it unambiguous. There is no algorithm for any of this. Once we are working with context-free (or any 5 language beyond context-free), we are unable to prove any interesting properties of a particular language. This means - we can prove general properties of all context free languages, but not properties belonging to any particular language. We are not going to prove this here, but we can give an idea of why. The deterministic context free languages can be defined and parsed with a push- down automaton, but they are powerful enough to describe a Turing Machine. If we want to prove, for example, that grammars G and G’ define the same language, we are really asking if the Turing Machines describe by G and G’ do the same thing.If we could prove that, we could use it to solve the halting problem - and we can prove that the halting problem is uncomputable. In essense, we can prove stuff about parsers, but when we want to prove properties of languages more powerful than the regular languages, we can not do it, and we can prove that we can not do it.

2.3. An example of resolving conflict in a grammar - IF-ELSE statements. The issue with if-else is called the “dangling else” problem. It happens in the obvious, sinple grammar for IF statements: (1) S-> statment (2) statement-> IF-statement | other-statement (3) IF-statement -> IF expresion THEN statement (4) IF-statement -> IF expression THEN statement ELSE statement The problem: the definition of IF-statment is recursive, and it can be expanded in two different ways, because IF-statement is the lefthand side of productions 3 and 4. We can create the string : IF-expression THEN IF-expression THEN other- statement ELSE other-statement in two ways: • Start from (3), replace statment with (4), then convert the two statements to other-statement with (2) • Start from (4), replace the second (rightmost) statement with other-statement. Then replace the remaining statement with (3), and then use (2) again to convert the remaning statement to other-statement In the first derivation, we want the ELSE to match with the second IF, and in the second derivation we want it to match the first IF - that is the order in which we derived them. The derivations mean different things, but the final string is the same - so the grammar is ambiguous. Could we resolve this problem with lookahead? Yes, but to distinguish between the two kinds of IF statement, we need a lookahead of 4, this grammar would be unambiguous at LR(4), which has parse tables enoormously bigger than LR(1) and requires a different and much more elaborate algorithm (we can’t just extend LR(1) to get LR(2), like we extended the LR(0) algorithm to get LR(1) - (I know, I once spent a couple of months trying in my advanced compilers class!) We can resolve the conflict by forcing a shift in one of the otherwise ambiguous states - this is done in 4.8.2 in the language. What it does, it forces the ELSE to always match the most recent IF, essentially it says the meaning matches our first derivation, if we want the ELSE to match the first IF (second derivation) we need to do something like IF expression THEN { IF-statement } ELSE statement, to distingush which IF the ELSE matches. Notice that this changes the language - if 6 we just arbitrarily say a shift reduce conflict is handled by shifting, then we would never reduce in state I1 in the automaton on page 244 and we would never accept. Since we are changing the language to resolve the cocnflict, I prefer to change it explicitly rather than add ad-hoc rules. This approach for the IF-ELSE problem is shown in example 4.16, page 212. We can modify our example as follows: (1) S-> statment (2) statement-> matched-statment | other-statement (3) matched-statement->IF expression THEN matched-statement ELSE matched- statement (4) matched-statement->statement (5) other-statement -> IF expresion THEN statement (6) other-statment -> IF expression THEN matched-statment ELSE other- statement This is tricky - the distinction between statement and other-statement is not obvi- ous. There are other ways of writing this that are simpler, but get you in trouble when you run them through a parser generator. The result, however, is that you have an unique way to generate our string, which matches the first derivation. This exactly what we do when we converted the grammar S->E+E | E*E | id | (E) into our standard example grammar - introducing T and F non-terminals and productions forces us to generate strings in a specific order that matches how we do arithmetic. This could also have been resolved with priority rules, but if you change the grammar you are sure of what will happen, with priority rules it is hard to show that no combination of terms will produce an incorrect result. The rest of the chapter goes into Lex and Yacc, useful but you have more and clearer information on this in the O’Reilly book “Lex and Yacc” and in the tutorials on the class web page.