■■■"""" " ■ "■'* " ' ummmmmimmmmmmmmtmamttmmm^^^mtmm.a^-^-~-^-, mmmv Acta Informatica 2, 12—39 (1973) © by Springer-Verlag 1973 to the left canonical deri; formal notions embodie paper, Knuth defined a which are precisely the LR(\) be obtained determinisi Efficient Parsers decision already made) T. Anderson, J. Eve and J. J. Horning looking at most k svml decision. Similarly Lew Received April 7, 1972 the left canonical deri constraints. Summary. Knuth's LR(\) algorithm is sufficiently general to handle the Both classes of lang parsing of most programminglanguages,with the additionalbenefit of earlierdetection 0 ienefh nf th of syntax errors than in other formal methodsused in compilers. The major obstacle . impeding the use of this algorithm is the large space requirement for parsing tables. rP 'y contained in tr In addition, certain optimisations described here, which increase parsing speed, greater practical inti exacerbate the space problem. at°* any point in a parse 01 Two approachesfor overcoming this problem are explored. need to be matched wit 1. Using parsers related to the LR(\) parser which parse a large subset of the was used in the derivat; LR(\) grammars, but which have much smaller parsing tables. general, to determine wli 2. Devising compact encodings and performing space optimisations on these beyond the last generate tables The LR(k) property Time and space efficiency is evaluated using the Stanford University Algol W virtue of there existing compiler as a base for comparison since the parsing routine in this compileris among has given two procedures the more efficient presently used. a grammar js LR{k] j^ " In the event that the gr; 1. Introduction ning these tables, in ie ,an The widespread use of context free grammars to model the syntax of program- guage. Knuth has ming languages has resulted in the parsing problem for such languages receiving I sentences of the lain a great deal of attention both theoretically and practically. appear elsewhere in .1 s or It is known that, for parsing an arbitrary context free language, an amount . una*ely, exploitation of time proportional to the cube and an amountof space proportional to the square involves generating an /. of of the length of the input string is sufficient (see, for example, Earley, 1970). the orig«nal grammar i For various subclasses of context free grammars these bounds can be considerably *n the case of prograi improved. Programming languages are generally amenable to analysis by parsing loo'< ahead arises prirr methods with time and space requirements which are linearly related to the available on many curre length of the input string. Several such linear parsing algorithms are described Algol 60 is a well-known in the survey article of Feldman and Gries (1968). is separa Historically, these parsers were classified into top downand bottom up categories eal vv'dh look ahead pn s according to the primary strategy adopted. In the former, starting from the other tasks. It is also 1 ni)' principal nonterminal of a grammar, an attempt is made to produce a derivation ng language grammars of the input string. Bottom up methods start with the input string and try to resulting in grammars\vh produce a parse directly. Problems have been treai It was pointed out by Knuth (1965) that the bottom up strategy results in a Seminars largely meet thi parse which has the property, when viewed (in reverse) as a derivation, that the Ranges to a grammar to rightmost nonterminal in each sentential form is replaced in each step. The top ' mending a grammar to . down strategy results in a derivation in which the leftmost nonterminal in each , vocated in this paper i ls a sentential form is replaced at each step (Lewis and Steams, 1968); i.e. the bottom small additional burdi " up strategy leads to a right canonical parse while the top down strategy leads ' eliminating genuine am

*■

\ Efficient l.R(i) Parsers 13

case that theseare the to the left canonical derivation. It is probably the appropriate formal notions embodied in the intuitive ideas of bottom up and top down. In his paper, Knuth defined a subclass of thecontext free languages, the __./_(/.) languages, which are precisely the set of languages for which the right canonical parse can he obtained deterministically (i.e. without backing up and changing any parsing decision already made) in a single pass over the input string from left to right, looking at most k symbols ahead on the input string to determine each parsing decision. Similarly Lewis and Steams consider the LL(k) languages, in which the left canonical derivation can be obtained deterministically under the same constraints. in linearly proportional eral to handle the Both classes of language can be parsed time and space of enrher detection (f) the length of the input string, but the fact that the LL(k) languages are the major obstacle „roperly contained in the LR(k) languages implies that the latter are potentially for parsing tables. greater practical interest. (It is true, however, that in the former category at any point in a parse only thefirst k terminal symbols generated by a production need to be matched with the input string to determine whether that production derivation. In the case of LR(k) languages it is not possible, in bs-t of the was used in the general, to determine whether a production has been used until k terminal symbols nidations on these beyond the last generated by that production have been inspected.) The LR{k) property is associated with a grammar; a language is LR(k) by it. Knuth University Algol W virtue of there existing at least one such grammar which generates . compiler is among iias given two procedures which decide, when given aparticular value of k, whether a grammar is LR(k). The second of these can be modified to provide state tables. In the event that the grammaris LR(k), a relatively simple interpreter for scan- ning these tables, in conjunction with a stack, provides a parsing routine for language. Knuth has also shown that any language which is LR(k) is LR(\). s\nta\^Pprogram the sentences of the language terminate with a unique symbol which does not languages receiving elsewhere in a sentence, this result can be sharpened to LR(0).) Un- fortunately, exploitation of this result in compilers founders on the fact that it an amount nguage, generating an LR(i) or LR(0) grammarin which the phrase structure tional to the square involves of the original grammar is corrupted. iple, Farley, 1970). In the case of programming languages, the need for more than one character ran be considerably look primarily at the lexical level due to the paucity of symbols analysis by parsing of ahead arises available on many current input devices. The occurrence of ":" and ": in arlv related to the =" Algol is a well-known example. Typically,1 ypically, forlor efficiencyetticiency ifn forior no ornerotherreason, 1 d cribed 60 is well-known lexical analysis is separated from syntactic analysis, and the lexical scan can to ategories deal with look ahead problems at this level of analysis as a minor addition "_" .r,o its other tasks. It is also the case that construction of other than trivial program- starting from. Z the . . . . min lang" grammars is extremely error prone process almost invariably .ro uce d ivation S 2^ an resulting in grammars which are syntactically ambiguous. Assuming that lexical problems have been treated separately, it appears that

_ a grammars largely meet the constraints which allow the use of theLR(\ ) algorithm ; strategy resultsi. inir. a o ■> , _ , . , * . . . from the constraints. are slight.. that the changes to a grammar to accomodate deviations more algorithm top Amending a grammar to enable the use of the restrictive SLR(\) h The worst . , - j. advocated in this paper is not always necessary (see Section 6.1) and at L„ x„_, is a small additional burden which can be treated in conduction with the problem i.e. the bottom+ 968 ....eliminating . ambiguities...... down strategy leads "< genume

; " 14 T. Anderson et al. :

, too The state tables of the LR{k) algorithm, evenfor k= 1 are large for use if Q — (vs> y p directly in a compiler or compiler writing system; e.g. a program implementing (written a =>/5) if thei theLR(\) algorithm was applied to an Algol 60 grammar, therun was terminated with the insertion of the 10000th state-input entry into the state table, at which X point 1 237 states had been created (see also Korenjak, 1969). The work reported Also a derives B ( here addresses the problems raised by Knuth, of developing algorithms that that accept LR(k) grammars or special classes of them and of mechanically producing efficient parsing programs. If co0, v>x , ..., &.„<_ producti 2. Definitions and Notation The p-th The p-th producti Let 0 denote the empty set and A denote the null string. recursive. If X and V are sets of strings and xy denotes the concatenation of x and y, A nonterminal sy: then XV ={xy\ xdX and y € V] If 4eFv and a 6 V If X° ={A} then sets X{ are defined by first ( 1 tlierefrom) of producti A production (A, a) may be denoted by A -> left (LHS) andright hand side 'egitimately tion A ->a. immediate It will prove convenient to number the productions (in some arbitrary manner It is appropriate to by the integers 1,2, . . . , s and to use the notation "n the LR(k) algorithi symbol A occurs m the A ->X X ...X , np^o p px p2 pnp ''.V a nonterminal B\ il f-R(i) algorithm will to denote the member of the set of productions. go; where b, (/.___ 1 ,) Unless stated otherwise simplicity that neither . '"Isewliere followed l>\ a A, li, C, ... denote members of I'v /'" I : fm \}. arise-, (when a, b, c, denote members of I ',- . . . ""hide triples , ; (^ A', denote members of V vv,l /.. V "h appears main till " a., /., y, ... denote members of V* '"■'"V states to keep n,

► Efficient I.R(\) Parsers IS

oo use large for If G -(I A. VT P, S±) is a context free grammar, then a directly derives /. U inipj^jMentiiig (written a >/.) if there exist ro€ F* y, d, and A cVN such that a.~yAd, fi=ycod " table, at which and A -*-oi is in P. ie work reported Also a _mw (written a4>/.) /? if there exist a>0, cox y=*A to, y, toe F*. _4e If F^ and a€ F* -/I then we define the sets

first (a) ={A.c K|a4.A co where to 6 V*} (A) follow ={XeVv±\S±±(lAXy where (leV* and ATyeF*±}.

3. The Constructor Algorithms In his section algorithms are (rv.rr,p.s_L). given for the construction of recognisers in the form of state tables. The fundamental algorithm is that of Knuth and the ensuing discussion presumes familiarity with it; however, the descriptions of the modified algorithms derived from it and presented here are self contained In Knuth's algorithm, a state (Knuth uses stateset in place of state) of a parse is defined as a set of triples of the form [p, /; «]. The appearance of [fi, ;■ «] in a state correspondsto the informal nmar, notion that in this state the parse has reached a point consistent with the propositions that (a) production number p may have been used to generate the portion of the sentence scanned thus far. (b) if production p has beenused, the symbols Xpx,Xp2 ,..., Xpj (or derivations therefrom) of production p have already been X , ac !'*. A and a recognised. (c) if production p has been ely of the produc- used, the k letter terminal symbol string a may legitimately immediately succeed the current use of production p. arbitrary manner) It appropriate to * consider the reason for the large number of states arising in the LR(k) algorithm. Consider an LR(\) grammar in a symbol A which nonterminal occurs in the RHS of some production where it is immediately y B, followed if p is the (index) number of a production defining A, LR(\)a^nonterminalalgorithm will the generatea state containing {[fi, i ; ft.], [p> i , b] J,■ b n r) membCrS f V? SUch that (We Mmphcty; thatneitherP AT ° assume ff tT tV nor B generatethe empty string.) Similarlyif A appears followed by a ■ nonterminal C, a statecontaining****{[fi 1 c,i r* cT ' i £______# EaCh f thCSe StatCS sP*wn which mcude triples fe, ,; ftj [p, ° " f"1 and /; cJ respectively with />l. Thus a nonterminal .eh appears many times in the RHS of the productions of a grammargenerates '"any states to keep track of the many different right contexts of its use Only "

r

Can " 16 T. Anderson et at. : The start string when n (i.e. all characters on the RHS of production p have beenrecognised) 7= p tion in the gramm; are the third components of the triples in a state used. If in a state containing of the start string n ft,], and the next character on the input string is b{ this is taken to imply [fi, p stack throughout tl that production p did indeed generate the preceding portion of the input string. b [p,j; b ]} and {[/>,/; cj If74= np, then in thetwo states {[/>, 7; x] r [pp'.cjj the same set of input symbols will be permissible; each input symbol in this pair states. of states will cause transition to the appropriate one of a pair of successor yeVruXL denotes Each such pair of states could be replaced by a single state were it not for the remainder of the inj possibility that the terminal symbols which follow A in one context (and thus Assume that a pars signal the use of some production p in a parse) may overlap with the set of represented by (3). terminal symbols which signal that p has not been used in some other context. The steps involvj Consider, for example, the grammar with productions, Step 1. Compute set satisfying 1) S-^aAc 4) S^aßd £7" 2) S-+bßc 5) A-+e i.e. if [fi, 7] is in the 6) B-+e 3) S-+bAd tially about to be n ' as an '__ ' example, productior In the input string 'a ec\it is ' c' which signals that 'c must be parsed could be about to b using production 5. For the string 'ft « c*. it is V which signals that 'V must be parsed as a'B' using production 6 and not as an 'A ' using production 5- Step 2. Compute The preceding discussion motivated the following algorithm which utilise the left contextual information inherent to the p, j components of the [p, /';« precedence elements of the LR(k) states. like the very successful " methods, it relies upon there being no intersection between the set of termini symbols which imply the use of a certain production in one context with the sei of grammar of terminal symbols precluding its use in another context. The class Membership of Z h] are SLR(k p to which the algorithm applies (and which it implicitly defines) called generated by producl independently b) (Simple LR{k)) grammars. These grammars were defined be encountered next The Deßemer (1969, 1971); for consistency his nomenclature is used. presen: string generated by for constructing formulation offers a different viewpoint and includes algorithms follow that when ye the practica efficient SLR(i) parsers. The basic algorithm is described only for are the states entere general k, boti case k= which is somewhat simpler than the algorithm for '-C-,,, includes the conceptually and computationally. ■Vo'.i> = -V) Provided that 3.1. The SLR{\) Algorithm grammar. The p-th production 1 ..„ Let G =(VN, VT , P, S±) be a context free unambigmms dR) denotedby "n the input string: , t^p^s and Ap^Xpx,Xp2 ...Xpnp n^O. 1) If yeZ. stack The sets Tp =follow {A p )n{VTu±) I- algorithm (» are computed as a preliminary step before commencing the and rename the stac Appendix). A state £7 is a set of elements [p, j], where the occurrence of an elemet on the RHS of produ . in a state means informally that potentially 7 characters ( tion p have been recognised. _ " - Acta Vol. ►"

;

Thereafter,

\,

Infontutica, Kfficient /./.(I) Parsers 17

>oen recognised) The start string S± is treated as though it were the RHS of a zero-th produc- itate containing tion in the grammar, so the initial state o]} implies that no character start string taken to imply of the has been recognised. The algorithm involves maintaining a he input"string. stack throughout the parse which is represented (to the left of the vertical bar) as i_ [P,JAm]} abol in this pair *W*Z...Pm\y* (3) successor states, denotes ytVTu± the next input symbol and to represents the (unexamined) c it not for the remainder of the input string. Initially, the stack contains the single element ntext (and thus Assume that a parse has £7\. reached the state £T>m so that the stack and input are with the set of represented by (3). c other context. The steps involved in proceeding to the next state are as follows. Step 1. Compute ft from £T>m where ft is defined recursively as the smallest set satisfying =£7> u [q, 0] [fi, j] ft m { 1 3 e ft, j <«,, Xp (/+l,= (4) i.e. if [p, 7] is in set, the the (7 _)-th character, Xpu+X), of production pis poten- tially about to be + recognised; if Xpu+U is a nonterminal symbol defined by, for parsed as an '.4 ' example, production number q, then the first character of production number q that '('' must be could be about to be recognised, so [q, 0] is added to the set to reflect this. iuction S. Step 2. Compute the following sets of terminal symbols 11 which utilises s the [p, j; a] Z of = {ac VT\ 3 [p, j]eft, j _ _ for the practical m Hf+x STm x, S7m on the top of the stack are the states entered after recognising X , general k, both pi Xp2 XpHf respectively. Further "£_„, includes the element [fi, 0] and, by (4), an element [fi', j'} , such that ■\p ti'+D=Ap-) Provided that

ZnZ p =Zp nZq =o for all p4= q, (6) -th production is an unambiguous choice of action can be made depending upon the next character on the input string: 0. (1) 1) If yeZ, stack y so that the stack-input representation becomes (2) ft^--.fty\co Ie algorithm (see and rename the stack-input symbols producing c of an element ft-%--.ftXm+x\yco' (7a) RHS of produc i.e. yis renamed Xm+X and the first symbol of co becomes the newy. Acta Vol. 2 " o

m+x

loformatica, " 18 T. Anderson et al. : top n elements from the stack 2) If yeZp , let r=m —np. On removing the p precedence method and inserting A p , the stack-input representation becomes associates the syml ..y symbol of the state x rAp \ym .%£T Korenjak (1969 which after renaming symbols is for reducing the n LR algorithm is us S>> \y to. (7b) %*>... m Xm+l of a grammar. The 1 Step 3. Using the £Tm defined by (7a) or (7b) compute ft as in Step (or algorithm, but spe< recall it) and replace Xm+X on the stack by £f,+1 where the corresponding r work for all right c ={\P. 7+i]| i<»p. x =xp(,+»} 8 ft+i \P. ft*ft. «+i < the grammar, it see ' if, latter, i.e. the symbol Xm+X , left on the stack in Step 2, has just been recognised. If in the sub [p. j]€ft and the (7+l)-th symbol on the RHS of production p matches Xm_, next state, then [fi, /+l] is included in

«>

Q

st;iU

$

t ' ■"* ■" ■'■■— n—*s *~ - — "' ' '" -" ' " —

Efficient LR(\) Parsers 19

ts from the stack precedence methods. This correspondence is established by relation (8) which associates the symbol Xm+X with the state ft +i. (Xm+l is termed the associated " symbol of the state £T .) Korenjak (1969) has described a method, closely related to the SLR method, fnr reducing the number of states generated by the LR algorithm. In it, the I.R algorithm is used to construct sub-parsers for certain selected nonterminals (7b) id a grammar. The main parser for the grammar is also constructed by the LR as in Step 1 (or algorithm, but special provision is made for invoking the sub-parsers whenever the corresponding nonterminals are to be recognised. Since each sub-parser must work for all right contexts of its nonterminal in the RHS of the productions of (8) U) the grammar, it seems clear that the SLR algorithm is equivalent to Korenjak's een recognised. If in the latter, sub-parsers are constructed for all nonterminals of the grammar. i p matches A,„. i in the next state, 3.2. SLR(\) State Tables and the Basic Parsing Algorithm een recognised. The SLR(\) algorithm has been described in terms of parsing a given input ductive step on ... .tring. It can be modified to produce state tables which, with a simple interpreter 1. This method of and a stack, suffice to parse all strings derived according to the grammar. To lenotation includes produce the state tables, a list of all possible states is maintained and m becomes rv weak constraint the index to states in this list. (Initially the list contains only s^={[o, o]} and ye. mi =0.) Step 1 of the algorithm is unchanged; Steps 2 and 3 are replaced by the t of Knuth's LR(k) following. Step 2. Compute ) algor^n (taking Z' = {XtV\3[p, j]eft, j

Step 3. For each symbol X € V, a state table entry for the state £fm is determined (AY, n ...XpH,*)}. from: k not so great. If XsZ' a p for any p, the entry is blank, , he LR(\) algorithm if XcZp the entry is the production number p, if XeZ' , the entry is the state £f to be entered next, where

if theconditions (6) A={[p, j+ \]\\p. jleft, jv the modified algorithm replaces it. "

m+x

if,

?s

X$Z

; 20 T. Anderson et al. :

The following procedure, which is a consequence of the method of construction in the context of this of the state tables, suffices to interpret them. Assume that state . is the current production 4. The example also state and symbol 7 is being scanned (initially. =0 corresponding to S^ ={[0, o]} can be parsed using ; and 7 is the first symbol on the input string) ; then substituting Tp . Thii Step 1. Stack i on the parse stack and examine the (i,/)-th entry of the state should also be point. table. If the entry is blank, then the terminal symbol 7 is invalid and the sub- of storage under som. sequent action is determinedby an error handling routine. The one disadvan Step 2. If the entry is a state number, then this state number is assigned to », of the state tables of involved than in the next input symbol is read and assigned to 7 and a return is made to Step 1. is cc Two alternative Step 3. (The entry must be a production number.) One stack element is presented here. popped from the stack for each character on the RHS of this production, thereby LALR(i) state ta uncovering a state number k. The LHS of the production is a nonterminal /. of a state as a set 0 The (k, l)-th entry of the state table, which necessarily is a state number, is required are the LR{\ assigned to . and a return is made to Step 1 . ft=ft A convenient termination of this procedure can be achieved by detecting the next state number corresponding to the state {[0, I]} in Step 2. The rather small example of Fig. 1 conveys no impression of the size of the Zp = {*\ which would result for a practical grammar; examples will be given in tables If formulated to parsi Section 6. an LALR(\) parser a the 3.3. The LALR(\) Algorithm state table. The d sible states is made 0 Consider the grammar with productions i.e. as in an LR(0) coi " are merged if for eacl 1) S-+aAd 3) S-+bAc other set such that p 2) S -va c c 4) A-+c. (differing only in the When the list of stat On attempting to produce an SLR(\) state table for this grammar as described first removing the ma in 3.2, a state {[2, 2], [4, I]} will be generated. For thisstate,Z ={c} and Zi ={c,d\ mark on a state indie hence Zr\Zys, demonstrating that the grammar is not SLR(\). However, if sequence that the rigl in state {[2, 2], [4, I]} with input symbol y =c the stack is reduced using produc recomputed next state tion 4, then this corresponds to recognition of aA. Since aAc cannot begin any marking any which 1 be as instance sentential form in thelanguage,Z4could (andshould) taken {d}in this continued until all m. More generally, consider the parser in a state £T for which acZp with input right context of each the symbol y =a. Denote by a the string of associated symbols of the states on method implicitly use- stack after reduction by production p. If for all such a, there is no to 6 V* suck in an LR(\) state tab that S^aaw then a should be deleted from Zp . (Note that no symbols car not every LR(\) state be deleted from Z since this would imply the possibility of reading past an error A second method in the input string.) invalid members of tl This inaccuracy in the sets Zp of the SLR(i) algorithm is due to the use ol a state £f, as any stat the sets Tp to determine the symbols which indicate reduction of the stack as a possible next stal instead of that subset of Tp which is the permitted right context of production f exactright context of p in a particular state. Tp is the permitted right context of production p over al by states. example, the state {[2, 2], [4, can only be reached in parsin. In the above I]} b " a string generatedby production 1 or 2; production 3 cannot be involved so that *«

Efficient LR(i) Parsers 21

in the context of this state only the subset {d} of Tt ={c, d) can follow the use of "thod of construction 4- current production state i "i* the example also demonstrates that a larger class of grammars than SLR{\) idingtoy„={lO.O]} The ran be parsed using a state table in which Zp is computed correctly rather than substituting Tp. This larger class is termed LALR{\) following Deßemer. It state out that LALR(i) state tables may be more economical .th entry of the should also be pointed invalid 'and the sub- of storage under somerepresentations e.g. the list representation discussed in 5.1. The one disadvantage of LALR{\) compared with SLR(\) is that computation of the state tables of the former is much more complex, though clearly less effort is assigned to i mber is involved than in computing LR{\) tables. made to Step 1 an is Two alternative methods for constructing LALR(\) tables will be briefly .)ne stack element is presented here. thereby notion is production, LALR(\) state tables may be computed directly. To do this the LR(i) /. on is a nonterminal of a state as a set of triples [/>, /; a] as discussed in Section } is needed. Also i. a state number, is required are theLR{\) versions of Eqs. (4) and (sb) i.e. ft=ftu{[q, 0; ft]|3[£, 7; a]€ft, j

SS I 0 i 22 T. Anderson et al. :

This definition provides an algorithm if circularity is avoided by taking 4 j £/t; [ , Fig. 2, I R(£f, r I]) ={/!} In the a computation [r, itself. This may result in the whenever of R(£r°, I]) invokes necessary action s should be deleted. The right spurious inclusion of Am the final set; it exact eHucine the stack context n ] a state may now as that state, giving an I of [p, p in be used Zp for entrjes in these st i LALR(X) state table. number entry in in cases where ZnZ or In practice it is only necessary to refine Zp the p 4=V subsequent action Zp r\Zyo. This would achieve acceptance of all LALR{\) grammars without ming^ a computation. used necessarily performing full LALR{\) Lalonde (1971) has when g. this technique to refine the sets Zp for every state which is not LR(0). uniquereduce acti state entries in th 4. Transforming the State Tables production numbe number of transformations can be applied to state tables produced by A p algorithms of the L R type with a view to reducing their space requirements or y improving the performance of the parsing process based upon them. Conceptually, . . , . at least, these transformations may be envisaged as being applied to the matrix >"_" _" _ -m further syntactic c representation of the state tables; modification ofiLthe constructorx algorithm1 may "* be more convenient. _ „ .. express . , . / _i i arithmetic For practical purposes, reduction of the number. of states. (rows of the matrix)" , . , ,- , _ and derivationsof by general state minimisation.... techniques (Pager,... 1970). can .be discounted; thev combinatorial nature of these techniques militates against their use. Even SLR(\) «^i/? =>*■ tables typically involveafew hundred states for programming language grammars result very freque Recourse to these techniques also neutralises some of the benefits of the LR production A ■ " special methods with respect to error detection. It is necessary to develop techni- other steps would characteristics ques taking into account features of the LR algorithms and general Thg exam . of programming language grammars to reduce the size of the state tables sub- stantially. in It is appropriate to recall that, in LR type state tables, the blank entries and there are not all syntactic columns corresponding to nonterminal symbols significant ,)e recognised Con , in a associated with a errors are detected by encountering a blank entry column .. transfer is made then that merging states with nonconflicting (terminal) input symbol. We observe ()n return to r£)W t is that merging states with other than identical nonterminal entries possible but ymb0 ] js "* " a tr; for symbols can upset the error detection capability. In SLR{I) entries terminal be app]jed cauSjng tables, states with nonconflicting nonterminal entries and identical terminal js made terminal entries cannot occur (Anderson, 1972). Merging states with nonconflicting t() thg present jnst if the is modified entries is possible with delayederror detection parsing algorithm j( tl)ere .g nQ gema to include checking that the associated symbols of the states on the top of the n,cll stack match the RHS of the appropriate production every time a production number entry is encountered. Detection of errors on the first input character to depart from the rules of syntax is a useful feature of LR methods. Apart from possible benefits in the error recovery problem, compilers using delayed error detection are prone to failures in semantic routines in the interval between reading the invalid character and the error being detected by the syntax analyser Attention has been intentionally confined to transformations preserving this " immediate indication of errors.

_.

"., _.

;

S

tQ

nQt Efficient LR(\) Parsers 23

by taking 4.1. Elimination of LR(0) Reduce States and Chain Derivations " In Fig. 2, the rows 1, 2, 3, 8, 11, 13, 20, 22, 23, 24 and 26 have the LR(0) property that it is not necessary te> inspect the input symbol to determine the in the nay result necessary action since there is a unique action in each of these states, namely, right I. The exact reducing the stack using a production number which is associated with all valid state, giving an it entries in these states. Failure to inspect the input symbol on any production number entry in the table causes no significant delay in error detection; the j=s or ,-here ZnZp subsequent action of the parser will be to use the LHS of the production to deter- without grammars mine the next action of the parser. Ultimately a next state entry is encountered used le (1971) has when this input symbol must be inspected. It follows that states containing a H LR{O). uniquereduce action can be eliminated by the simple device of replacing the next state entries in the table causing transition into such states by the appropriate production number. of the states which would be eliminated by this change, notably those by Some :ables produced corresponding to rows 1, 2, 3 a°d 11 in Fig. 2 are involved in a more general or ice requirements which is suggested by considering the problem of eliminating Conceptually, transformation them. chain derivations. The need to group syntactic constructs together to form a matrix >plied to the further syntactic construct results in frequent occurrences of productions of the ictor algorithm may form _4->B, where A, BeVN . The specification of precedence of operations in expressions, for example, also introduces productions type, matrix) arithmetic of this (rows of the and derivations of the form be discounted; the _4 iruseAnSL/c(l) a/l 2/3 =>""" =>a.An(s =>ay /?, where .d^ 2, ..., langulf^rammars. result very frequently. Usually, in compilers, only the final step involving the benefits of the LR production A„^-y has semantic significance and a parser which bypasses the velop special techni- other steps would be advantageous. eneral characteristics The example grammaradmits the chain the state tables sub- £=*T=>P in . the blank entries and there are two states (rows 6 and 12 of Fig. 2) in which an instance of Eis to hficant all syntactic be recognised. Consideringrow 6, it is observed that when a P has been recognised, with a mn associated a transfer is made to row 1 1 where production 7 causes Pto be parsed as T, then nonconflicting ;s with on return to row 6, the T entry implies a transfer to row 10 where, if the input identical 1 other than symbol is ' ', a transfer is made to row 17, otherwise production number 5 must SLR{\) * .apability. In be applied causing T to be parsed as E. Row 6 then indicates that for E a transfer terminal id identical is made to row 9. At this stage it would emerge that ' ) ' is an invalid successor onflieting mn terminal to the present instance of an E; it is valid however in row 12. We observe that, algorithm is modified if there is no semantic significance attached to production 7, then row 11 may as the of the ites on top well notexist, a transfer to row 10 would suffice; in row 10, if thereis no semantic ry time a production to rst input character 1 S^A 8 r->r*p methods. Apart from 2 S-+I 9 P-*{E) s using delayed error 3 A id : = E 10 P^id ~y»-* the interval between 4 I B then AL 11 B Bor id E T analyser. 5 -* 12 B^id iy the syntax 6 En n T L->-else S this -* + 13 atinns preserving 7 T->P 14 L-+A Fig. 1 "

; 24 T. Anderson et al. :

" M ( ) *If then or else := + ± SAIETPBL semantic signifi (The condition future usage of are associated readily describ ! In the modifie state. For each entry — p in th the appropriat Only Step 3 is Step 3. The Let Z' ={_*., Compute thi

18 io ii The sets W{ all intermediate t = 1,2 r,t

then, for each t,\

Then members list of states. Any WP con " in which the on Such a set Wt" in the state tab the stack using Fig. 2 has been elimin an element must production num significance attached to production 5, then the - 5 entries simply serve to delay a new tyPe of den an error signal until row 9is entered. Theredundant references to the 5 entries °ted by *p. — must could be eliminated by replacing — 5 entries in row 10by the entries in the cor be read fl responding columns of row 9 (omitting the — 5 entry in the ' ) * column) wen lncremented by- it not for thefact that row 10is entered from row 12 also. By a similar argument Tne comput; the entries from row 18 should replace — 5 entries in row 10 (omitting those in * example gra the 'else' and 'X' columns). These assignments are incompatible. If no entries set are omitted from row 10, the assignment incompatibility disappears but so also Z' ={id, (, E does the error detection capability. Complete elimination of chain derivation; x W is possible only at the price of delayed error detection (matching stack symbol- '__ t t with a production RHS whenever the stack is reduced) or at the expense oi , introducing extra states, if e.g. in the example, state 10 is split into two copies 2 ( {[9°'.n* row 6 can use one and row 12 the other. . c {[3.1 It is clear that states containing a single [p,j] element can be eliminated ii 4 T {[ 5. 1.. f=np . A special case arises whenever np Xpi€V and production p ne Ui " N has I

;

=\, V

Efficient LR{\) Parsers 25

case, A IE T P B L semantic significance. In this all reference to production p may be deleted. n (The conditions p Xpi£VN imply that prexiuction /> is a chain production; 2 3 " future usage of this term will imply the additional requirement that no semantics are associated with this production.) The changes to the state tables are most readily described in terms of further modifications to the constructor algorithm. In the modified SLR(i) algorithm, step 2 computes a set Z' and sets Zp for each 7 state. For each member of a set Zp the state table contains a production number 9 10 11 entry — P in the corresponding column. For each member of Z', step } computes the appropriate next state entry and adds any new states to the list of states. Only Step 3 is modified to accomodate the required changes. Step 3. The treatment of Z' is altered.

Let Z' ={XV Xt, ..., Xr} and C ={p |production pis a chain production}. Compute the sets {__>. j i]\3[p.i]€3£, j , X X 1 gr. WJ = + , np]}} not already present are added to the list of states. Any WP consisting of a single [p, np] element would produce an LR(0) state in which the only valid action would be to reduce the stack using production p. Such a set is precluded from generating a new state and instead the entry :6 2 3 WP in the state table for state £7n and symbol Xt is made to specify a reduction of " the stack using production p. If this is done, one state (the LR(0) state which has been eliminated) is missing from the stack when the reduction occurs and an element must be added to preserve the stack alignment. If _¥,€ VN an ordinary production number entry, denoted by p in Fig. 2, can be employed ; if X, €VT a new type of entry is needed, a scan-production— number which will be simply serve to delay entry denoted by *p. A scan-production number entry indicates that a further symbol to the 5 entries ices — must be read from the input string. In either case the stack pointer must be the entries in the cor- incremented by one before the reduction is performed. the ' ) ' column) were The computations of Step are illustrated in the case of for 3v a similar argument 3 the example grammar of Fig. 6, 9, 10, 11, Fig. (omitting those in 1. (See also rows 12 and 13 of 2.) 10 The set ipatible. If no entries disappears but so also Z'={id, (, E, T,P}, hence 1 of chain derivations XA w atching stack symbols l "t w; Wt" of or at the expense W {[10, I]} {[10. 1]} {[10, 1]} . split into two copies, 2 ( {[9, I]} {[ I]} {[ 9, I]} E {[ 3, 3], [6. I]} {[ 3], [6, I]} {[ 3. 3], [6, I]} 4 it can be eliminated if T {[ 5,1], [B,l]} {[ 5, I], [8, I], [3, 3], [6, I]} {[ 8, I], [3, 3], [6, I]} P {[7,l]} {[ 7,1], [5.1]. [3, [6,l]} {[ 8, I]} d production p has no [B.l], 3], I], [3. 3], [6,

=\,

9, 3, "- 26 T. Anderson et al. :

y M ( ) »lf :- + th«nor.t«« j. S A I E T PB L this would be to cop 0 [ 0] 4 5 -0 —0 -0 to replace the existi 4 [ 1] 6 when this —4 entry- 5 [ 1] *12 7 be regarded as reduc 6 [ 2] *10 12 9 10 10 /_ -*/l were eliminah 7 [ 2], 14 15 I [11.1] j tion 4) thereby repla 9 [ 3, 3], 16 -3 -3 grammar,the referen t 6,1] to the new producti 10 [ 8, I], 16 17 -3 -3 [ 3], To accomodate tl [ 6. 1] ing algorithm must I 10' [ 8, I], 16 *9 17 [ 6, I], Step 3. If the en [ 2] I is read and assigned 12 [ 1] *10 12 18 10' 10' Step 4. (The entn 14 [ 3] 4 19 15 [11,2] »11 entry.) One element i 16 [ 6, 2] *10 12 21 21 of this production th 1 7 [8, 2] *10 12 -8 tion is a nontennina 18 [ 6, I], 16 *9 [ 2] a) a state number 19 [ 4] 25 -14 -4 to Step 1 21 [ 6, 3], -6 -6 17 -6 - 6 or [ B,l] b) a 25 [13, 1] 4 5 -13-13 -13 production

Fig- 3 4.2. Sim " The state table i The appearance ,:ntnes m the coluni of [5, 1] in Wt implies (since production sis the chain produc tion E^T) that W vW constitutes appearance l t t WJ\ The of [7, 1] in Wt implies "?F (since T->P, ' *nTV1 US*"SP production 7is a chain production) that Wt must be included in .1 ' °? ? WZ with W This introduces [5, W ', P s . 1] into B which implies that W. must also be .^. , °T* , included. The W/', 1 £/£5. result from the on removing [5, ;" W/ 1] and [7 1 Ttf F Fig. 3 shows the state table for the example grammarresulting from the above Old number I modifications to the constructor algorithm. Some steps in chain derivations may be eliminated without increasing the number of states by taking If the state table W," -=W{ (9 advantage can be ta identical; at the exr where y>X . . X, =>XU v =». W =>_V, is the longest chain derivation for whirr addressing permits t W„,WV, ..., Ww and Wt each have a single member. In terms of the previous discus Im ' similarly treated ; sion relating to Fig. 2 this modification eliminates states like {[7, I]} in row tl produce state tabl while avoiding the replication of {[5, I], [8, I]} in row 10, which is entailed ir '" As a final comm eliminating the chain production 5. worth noting that tl A special case which is not covered by the modifications is the case ofproduc "' column of the 7 tions with a null RHS. Usually semantic information will be associated but il w'"»i«'li can be decom] not, for such a production p, the state table entries for the members of Zp are " replaced by the entry for A p in the same row of the table. In Fig. 3 the effect oi

0, 3, 4, 3, 4,

3,

9, 9, 4,

9, 4,

C

*, *"

Efficient I.R(\) Parsers 27

i r^ b i. this would be to copy the —4 entry in the L column of row 19 into the J_ column to replace the existing —14 entry; however to keep the stack of states correct when this —4 entry is encountered the length of the RHS of production 4 must he regarded as reduced by one. Precisely this effect would result if the production /. PA were eliminated from the grammarof Fig. 1 (by substituting it in produc- tion 4) thereby replacing L-+A by /->- if B then A. On processing the modified grammar, thereference toproduction 14 (L -+A) would be replaced by a reference tn the new production which differs from production 4 in omitting the final L. To accomodate the addition of a new type of entryin the state table, the pars- ing algorithm must be amended. Step 3 is replaced by the two steps Step 3. If the entry is a scan production entry then the next input symbol is read and assigned to / and / is also stacked. Step 4. (The entry is either a production number or a scan production number entry.) One element is popped from the parse stack for each character on the RHS of this production thereby uncovering a state number k. The LHS of the produc- tion is a nonterminal /. If the (/., .)-th entry of the state table is, a) a state number, then this state number is assigned to . and a return is made to Step 1 or b) a production number, then I is stacked and a return is made to Step 4

4.2. Simple Row Merging and Partitioning of State Tables The state table in Fig. 3 is quite typical in that many of the states have no entries in the columns corresponding to nonterminal symbols; states with this ; the chain produc- property can be renumbered to occur in adjacent rows at the end of the state implies 7, l in \V% table. An obvious space saving results by first performing this renumbering and ust be"included in then partitioning the state table into two arrays; the T-table contains only it ll j must also be terminal symbol columns and the N-table only nonterminal symbol columns, r 5, 1" and [7, 1] rig. 4 results from Fig. 3 using the state renumbering scheme : ing from the above Old number 0 4 5 6 7 9 10 10' 12 14 15 16 17 18 19 21 25 New number 1 10 8 3 11 12 13 16 4 914 5 615 717 2 lout increasing the If the state table is separated into a terminal part and a nonterminal part, (9) advantage can be taken of the fact that several rows in the terminal array are identical; at the expense of an array with one element for each state, indirect rivation for which addressing permits the identical rows to be overlaid. The nonterminal array can the previous discus- lie similarly treated; in this case rows with nonconflicting entries can be overlaid i]} in row 1 1 to produce state tables of the form shown in Fig. 5- vhich is entailed in As a final comment on the matrix representation of the state table, it is worth noting that the states can be renumbered so that the next state numbers s the case of produc- in a column of the T-table are represented by a sequence of consecutive integers a„ ji ._s_ociated but ii which can be decomposed into the form c +bt where members of Z are p c =min(a,- — 1). i Fig. . the effect of i " 28 T. Anderson et al. :

" M + ( ) "If thon or "!*" = 1 S A I E T PB L existence s The of

1 state entries causi 2 table, namely, the 3 The values 6, c 4 to the column. Th 5 6 the renumbering 7 range of _>, values 8 the state table the 9 Advantage can or 10 compactly v encode : that those in a ro 12 possible to keep tl the total number 13 possibility.

14 15 The motivatio; 16 is the obvious spa other is to exploit '7 The IBM Svst the searching of a of Fig. 4 a second set, B. an indication that " Row number 1 2 For row i of tl 3 list of the valid ir 4 list 5 of correspondi 6 character in set B i 7 tion provides eithe 8 entry or an indicat 9 10 Production nui 11 several times in a i 12 detection can be dt 13 t<> omit all entries 14 such a row. 15 Instea 16 appendedat the en 17 "i the correspondi: S A I ETP B L ' H for the trans "»r' example, two 1 -0 -0 12 11 -4 c -0 13 13 "He is 2 -13 -13 -13 15 16 16 chosen tor 3 7 17 17 I < 11.-1 /_(,') as mucl 4 -8 The arrays Tt l TV-table "' stored head to " Fig. 5 ■" hieved via an ar

; Efficient /-/?(!) Parsers 29

i: r P B L The existence of such a renumbering of states follows from the fact that all next state entries causing transition to a particular state lie in one column of the state " table, namely, the column corresponding to the associated symbol of the state. The values _>, can be stored in the table and c in an array element corresponding to the column. The requirement that the a, be consecutive is not compatible with the renumbering to partition the state table; it is sufficient, however, if the range of b, values in any column is small compared withsthe number of states in the state table then fewer bits are neededto encode the 6, than the state numbers. Advantage can only be taken of this if production number entries can also be compactly encoded. It is not possible, in general, to renumber productions so that those in a row of the T-table have consecutive numbers, but it should be possible to keep the range of production numbers in a row small compared with the total number of productions. No attempt has been made to pursue this possibility.

5. List Representationof the State Tables The motivation for representing state tables by means of lists is twofold : one is the obvious space saving resulting from the elimination of blank entries; the other is to exploit the list searching instructions available on modern computers. The IBM System/360 "translate and test" instruction, for example, permits the searching of a set, A, of 8 bit characters for the first occurrence of a member of a second set, B. Its result is, in effect, either the index of this element in A, or an indication that no element in A matches an element in B. or els« : = _L 5.1. Encoding of the T-table For row . of the T-table, two arrays are constructed: one, TCHAR(i), is a 2 -14 list of the valid input characters for this row; the other, TACTION(i), is the list of corresponding entries from the row of the state table. Given an input 3 character in set B and taking TCHAR(i) as set A, the translate and test instruc- 14 tion provides either an index into TACTlON(i) for the appropriate state table -3 -3 entry or an indication that the input symbol is invalid. -3 -3 Production number entries referring to the same production often occur >everal times in a row of the T-table (see rows 8, 9, 13, of Fig. 5)- Because error detection can be deferred until after the stack has been reduced, it is convenient -6 to omit all entries for one production from TCHAR(i) and TACTlON(i) for such a row. Instead a special code, default, representing a symbol not in V is appended at the end of the list TCHAR(i) and the production number is inserted in the corresponding element of TACTlON(i). The default code is included in b L set B for the translate and test instruction. For grammars more complex than our -4 example, two or more sets of production number entries may exist in a row ; one is chosen for the default entry which reduces the number of elements in XCHAR(i) as much as possible. The arrays TCHAR(i) and TACTION(i) for all rows . can conveniently be stored head to tail in arrays TSYM and TACT. Access to these arrays is achieved via an array TSTATE containing one entry with two fields for each "

; 30 T. Anderson et al. :

row One of .. these fields contains an index to the first elements of TSYM anc The treatmentof TACT row ; for . the other field specifies the number of entries in these array; the derivation E : I for row . The compacted form . of the 7-table shown in Fig. 5 is % convenient and W„" were seen to form from which to build the lists. There is now the possibility offurther economic respectively. The m in storage e.g. the only entry in row sof Fig. sis identical to an entry which mus: from which it follov be present I in TSYM and TACT for row 1 ; no additional space is needed for it are a subset of those i Consequently, in th can 5.2. Compacting the T-Table lists W overlaid , the state numbers o Ihe general problem ** — of overlapping the entries in TSYM and TACT fron be*"used to overlay c rows of the state tables which have common subsets of entries to economise or The procedure be storage under the restriction that all elements of a row must lie in sequent* of the entries of one positions is nontnvial; however, worthwhile storage savings can be achieve, to each state' initial with relatively simple strategies. Two which have proved to be quite effective are described. Step 1. An unm Let A ( tri is i = {atl,ai 2 «mJ denote the elements of TCHAR(i) and _3,= '» es selected t {b , bit b ini} denote the corresponding of a elements TACTION(i). Sufficient 0 _r«i v " (though certainly not l Unma necessary) conditions for entries from two rows . and / 1« vL { ' overlap in the arrays TSYM and TACT are ' il> The r members of ( 1) A ,- nAf 4= 0, 2) if =a a,k ileAi nA j then btk =bjl ( 10 3) arran Bed in orde a,ni and «,-„, are not both equal to the default code. ** c . , _ Step 2. A set 0., Since A,=(A -A )u(A,nA and i i i) A j^(Aj-A^v^nA,), the elements ot 2, . . rin turn If " A,.-A , then can t A^A,, and finally A, -A, be inserted in TSYM in sequence entries of each mem (As each element a or a ik fl is inserted into TSYM, its corresponding element b, have their nonblank or bjt is inserted m corresponding the position of TACT). Condition 3 abov, the 7-table The me ensures that, after i perhaps interchanging the roles of the names . and /, am obtain a further com default code is located in the set A,—A, and this must be the last element of thi The algorithm ter set to be inserted into TSYM.->■»*«. f u ,- XT v the correspondins Wo attempt is to made overlap entries from more than two states, and the In practice this I- pairs of states are selected with the aid of a triangular matrix C for which the the other technique elements are defined by responding set Qa = { C,7 =0, for any pair of rows . and / not satisfyingthe three conditions (10), Cfj = cardinality of A^nAj, otherwise. Pairs of statesr and s whoseentries are to be overlapped are selectedby repeatedh Tlie encoding use determining i,,jnR three corresp( Cr ,=max(C, ; ) entries in the .V-tab default entry techniq then effectively " replacing all entries in rows and columns r and sby zeros, unti '' fr«quently occurri no nonzeroelement remains in the matrix C. '-limination of chain If two states both contain default entries, the conditions for overlappinf "''"'"cal. In column.- their entries include therequirement that ; Ilt:xt state entries _>,„. = bjnj this seems to be rarely satis- .'!' fied in practice unless chain derivations are eliminated when the technique n: is little doubt " described next is far more effective. "'*',ihl is extremely eft

;

./ V-

Efficient LRU) Parsers 31

its of TSYM and The treatmentof chain derivationswas illustrated in Section 4.1 by considering these the derivation E=> T in the example grammar es in " arrays of Fig. 1. The sets WP , Wt" ; i^ ,i convenient and W£" were seen to correspond to the states entered afterrecognising E, T and P The method of c further economies respectively. constructing these sets ensures that W," WJ" c Wb", which it follows entry which must from that the (nonblank) state table entries for the state W3" c is needed for it. ire a subset of those for the state WP which in turn are a subset of those for WP. Consequently, in the list representation, the entries for both of the states WP and |f{" can be overlaid on those of WP If during the construction of state tables, the state numbers of sets of states like WP , WP and Wf are recorded, they can and TACT from be used to overlay entries in the arrays TSYM and TACT. . to economise on The procedure below is similar in intent but caters for ' accidental' inclusions t lie in sequential of the entries of one state within another. The process involves appending a mark , can be achieved to each state; initially all states are unmarked. be quite effective Stepl. An unmarked state £fM with the maximum number of nonblank entries is selected then the largest AR(i) and _3, = set Qt is computed such that IOX(i). Sufficient Pi ={■*/! "*/ is unmarked, the nonblank entries of state _5? are included in those wo rows i and ; to of.*?,}

The r members of Q x 2 £?„ (10) ire arranged in order of descending numbers of nonblank entries Step 2. A set is Q 2 constructed. Initially its only member is £T]V For k = ), the elements of 2,3 rin turn, if the nonblank entries of S7)h are included in the nonblank 'SYM in sequence .ntries of each member of 2 , then Q £?„ is included in Q2 . The states in Q 2 can .onding element b ik have their nonblank entries overlaid on of £? " those M in the list representation of Condition . above the T-table. The members of Q2 are marked and a return is made to Step 1 to ame-. . and /, any obtain afurther compatible set for overlaying. of this last element The algorithm terminates when all states aremarked. Note that for many states 'f st the corresponding sets Q2 ={^.}. two states, and the In practice this latter process for compacting lists has been applied first and rix C for which the the other technique has been applied y only to those states M for which the cor- responding set Q2 = {£TM }. conditions (10), s.3.Encoding of the N-Table lected byrepeatedly The encoding used for the _V-table is basically similar to that of the T-table, using three corresponding arrays NSTATE, NSYM and NACT. As blank ""tries in the .V-table can never be encountered by the parsing algorithm, the h'fault entry technique used in encoding the 7-table may be applied toeliminate nd s by zeros, until i frequently occurring entry from either a row or a column of the AT-table. I'hmination of chain derivations ensures that in some rows several entries are r.ns for overlapping identical. In columns, repetitions of the same next state can be expected, since ill is to be rarely satis- next state entries leading to a particular row originate in the same column. vhen the technique There is little doubt that choosing a default value for each column is superior "id is extremely effective in reducing the'space requirements. The choice of "

; 32 T. Anderson et al. :

default entry is a value occurring at least as often as any other in a column An array NDEF contains the default value for each column of the _V-table, and r U s e r the arrays NSYM and NACT contain only those entries from a row which P are ° *J P^briefly"t] making it trivial to ascertain the existing parsing table space requirements and en un cn"g an enl relatively simple matter to replace the parser for timing comparisons. , a mixt ""^routine The XPL compiler (McKeeman, Horning and Wortman, 1970) uses PP c or by the p strategy precedenceparser with onecharacterof lookahead. The class of gramma.' .^ P ' ng routlne s this method is larger than and properly includes the simp accomodated by ap "ate' grammars. XPL is of interest since Lalonde (1971) has alreadyreplace V semantic n precedence ' J the mixed strategy precedenceparser in the XPL compiler by an LALR(i) pars Really, this ineeh lclentifiers in man and reported on the space and time comparisons of his implementation. ' 1 Acta Informatics, _ " Vol.

USI

;

"

S M>alM 11l _ jj-^^Wi^*W"'* >1111l mmtmmmmmmmmmmtltmtm mm „,., , , mmtmmmlimiltmlmmma^ * * * ' """ ' — " m_^^

Efficient LR(\) Parsers 33

ther in a column, 6.1. Space Efficiency the .V-table, and To put space requirements in context, the formats used for array elements oin a row which " are described briefly. 75 YM and NSYM have 8 bit elements permitting a 255 vocabulary. (Since the state tables are split into a T-table and an the list in NSYM symbols then 255 symbols in each of V and V are in fact permitted.) TSTATE given nonterminal .V-table N T NSTATE have 16bit elements which specify both a pointer and a length; ■'.!<"/' i-> selected, and using 10 bits for the pointer allows approximately nemdcfault entries from from S'DEF. 1024 of the Af-tablc and the '/'-table, provided that no more than entries is possible; unless each 64 occur in a single row of these tables. The use of 16 bit elements for the arrays y much less, since TACT, NACT and NDEFreflects convenience of access on an IBM System/360 mall compared to rather than information content. These elements include a two bit field distin- s in columns of the type of state table entry—next state, production column becoming guishing the number or scan- number—and an integer defining either the next state or production ,pparent in Fig. 4. production appreciable number. As space is available within these elements, the length of the RHS of a allows is included for production and entries; earched tend to be production scan-production these lengths could equally well be stored in a separate array, as indeed the LHS symbols of to keep the length productions Eight bit entries suffice for the array LHS. eferred to that of the are. It is perhaps appropriate to interpolate a practical comment on the use of LR analysers. Programming language grammars are not usually context free and so are not LR(k), much less SLR(i). While the lexical analyser and semantic routines of acompiler can usually be so designed that an SLR(i) grammar repre- sents the syntax of a language, it may not always prove convenient or possible. tate tables and to failure of a grammar to satisfy the SLR(i) condition implies that for ng the e^cts of the The one e\i--tin(^nguages, or more state-input symbol combinations there is not a unique action in the T-table. The list of possible alternative actions for one of these state-input combinations can contain at most one action next state or and Hoare (1966). of the scan-production type; the others tably large, contain- number are necessarily of the production number type. In the current implementation it is assumed that a semantic routine, uniquely associated nt extensions. The with a particular production, is invoked a or scan- 60 is fast, notwith- whenever production number production number entry naming that speed is due, in no production is encountered. Provided that this semantic routine can determine whether or not its associated production >ased on the simple may be applied to reduce the stack at this ntains a significant point in the parse, such lists of actions can be accomodated by means of a fourth type of entry .s chain derivations in the TACT array, which specifies both the number of alternative entries and points to the first of is largely the capa- the alternatives in an SUPTACT. e well known speed array The entries in SUPTACT have the same format as those in TACT. A next state or scan- entry, if one similar. The existing production exists, is placed last the list of alternative actions and is used if no with which to make in only pro- duction number entry is found to to L360 (Wirth, 1968) be applicable. The amendment the parsing routine to deal with this extension involves the inclusion of a flag which is set : requirements and a on encountering an entry of fourth which npansons. the type in TACT and is reset either by a semantic routine 1970) uses a mixed which determines that its corresponding production may l)e applied or by parsing entry next state type. he clasr, of grammars the routine if the last is of the The parsing routine includes the simple selects alternatives from the Ust passing control to the appropriate so long as set. has alreadyreplaced semantic routines the flag is Typically, fan LALR(\) parser this mechanism maybe used tohandle theconcept oftypesassociated .mentation. with identifiers in many programming languages. Acta Vol. 2 "

M,iai aM,M>

Infonnatica, " 34 T. Anderson et al.: Table 1. Storage space for Algol W and XPL parse tables with chain derivations partially Table 2 . Storage spa eliminated

Algol W XPL

States Productions 108 Kntries in terminal c Terminal symbols 41 Kntries in nontermir Nonterminal symbols 48 States with nonuniq Average length of production RHS. 2.0? Nonuniqueentries

....(0) elimina States 183 states Entries in terminal columns \ )78 Identical rows remo\ Entries in nonterminal columns 395 Kows with nonuniqu Kntries eliminated v States with nonunique entries 0 Kntries eliminated v Nonunique entries 0 Kntries eliminated b Effects of transforming state tables Kntries eliminated b L7.(0) states eliminated 166 84 Identical rows eliminatedfrom T-Table 37 23 Rows with nonunique entries eliminated from T-Table 4 0 Array elements in Entries eliminated using T-table defaults 2543 731 Entries eliminated using A. -table defaults 1 338 325 Entries eliminatedby removing identical T-table rows 526 146 Entries eliminatedby overlaying rows in the T-tablelistrepresentation 143 53 SLR(\) tables in list form Array elements in : TSTATE 160 Total storage requird N STATE 60 39 TACT 328 164 " NSYM, NACT 214 70 LHS 190 108 NDEF 71 48 derivations. If noel SUPTACT 18 o have contained 86 Total storage requirements in bytes for: The space reqi SLR(\) parser with partial chain elimination 2434 _ Lalonde for his im Simple Precedence parser (with complete chain elimination) 6 730 Ampler parsing roi Mixed Strategy Precedence parser (with no chain elimination) 2962 tions rather more Table 2 provi are completely el The space compactions described can usefully be applied to these extension- the space requirei to the T-table. Two rows of the T-table can be merged if for each input symbol quadrupled, mainl there are identical unique actions or identical lists of nonunique actions. In am increased sevenfole row of the T-table, a list of alternatives containing only production number „f states to be recc entries may be chosen as the default entry for the row. In the array SUPTACI .V-table is less effe lists may be overlaid. "f this extra spac Table 1 provides information on the size of the Algol W and XPL grammar- <,n iy about twia, t , together with details of the SLR(\) state tables for these grammars. The effect with many operal on the tables of the various transformations is indicated, and storagerequirement- in the middle, the are given based on the list representation data structures which have beer ting theexistence described. Finally, for comparison, the storage requirements of the existing creating parse tab parse tables in the current Algol W and XPL compilers are included. Tli> the resulting table- " table sizes given in Table 1 result from using (9) to partially eliminate chan Tables 1 and 2. Ti

TSYM, Efficient /./.(1) Parsers 35

Irriv.itions partially table 2. Storage spare (or Algol W parse tables with full elimination of chain derivations

AlgolW Algol \\"XPL SLR(\) slate tables states 469 100 108 Kntries in terminal columns 7483 ol 4 1 Kntries in nonterminal columns 1556 71 4S States with nonunique entries 7 2.;>0 2.07 Nonunique entries 810 Effects of transforming state tables 330 183 I.R(0) states eliminated 155 -t 5 1 3 1 178 Identical rows removed from T-table 37 I 552 395 Kows with nonuniqueentries eliminated from T-table 4 0 Kntries eliminated using T-table defaults 3455 810 O Kntries eliminated using N-table defaults 1022 Kntries eliminated by removing identical T-table rows 524 Kntries eliminated by overlaying rows in the T-table list representation 1605 100 S4 .ti 23 State Tables in List Form 0 4 Array elements in : TSTATE 2M3 731 N STATE 1 33S 325 TSYM, TACT 526 146 NSYM, NACT 143 53 LHS NDEF SUPTACT 160 99 Total storage requirement in bytes : 60 39

190 108 71 -IS derivations. Ifnoeliminations wereattempted, thearrays NSYMandNA CTwould IS 0 have contained 86 fewer elements in the case of Algol W and 14fewer for XPL. The space requirements for XPL are slightly better than those reported by 1 182 2434 for his implementation. The data structures used here lead a (> 7 > Lalonde to somewhat 2962 simpler parsing routine and even withoutconsidering eliminationof chain deriva- tions rather more optimisation to reduce parsing time has been employed here. Table 2 provides similar data, for Algol W only, when chain derivations are completely eliminated. Without the overlaying techniques of Section 5-2, d those extensions the space requirements compared with partial chain elimination would have "arii iiiput symbol quadrupled, mainly because the number of entries in TSYM and TACT have in- actions. In any increased sevenfold. Lesser contributions arise from the doubling of the number induction number of states to berecorded and thefact that use of default entries for columns of the SUPTACT .V-table is less effective in saving space. The overlaying techniques recover much of this extra space so that the final space requirements quoted in Table 2 are id XPL grammars only about twice those of Table 1 . It is clear that space requirements for languages mmars. The effect with many operators in an extensive hierarchy would be great. On breaking .ragerequirements in the middle, the 12 step chain derivation for an Algol W expression (by stipula- which have been ting the existence of a semantic routine for the appropriate production) and then ts of the existing creating parse tables with full chain elimination specified, it was observed that are included. The the resulting tables had a space requirement midway between those reported in ly eliminate chain Tables 1 and 2. Time-space trading is possible in this way. "

Si

array I1 1 " 36 T. Anderson et al. : In comparison with the current Algol W parser, the SLR(i) parser fare The times in favourably on space requirements even when full chain elimination is specific. the total compil theyimply that . 6.2. Time Efficiency by about 1 5 % The fuct ths- The syntax analysis Algol phase of the W compilerr was _timed using seven „ A ,„ . , _, , areso smallmen ! Algol W programs as input. and with... four. variants. of. the SLR parser substitute. , g for the existing parser. Since it was easy to suppress the effects of chain elimim ,P, __, . 4.- it i Xj. x- " -_t i P . A ... bO% of those cv tion in the latter, tinungs with and without it were comparison. ~ .-_.- „ taken for Tt (hfference " , _.. _._"„.. , , 'his i performance rankingi " oft the six" parsers was substantially the same whicheve. _, , program was used for input. The results for five of these programs are shown i | , Table 3; times are quoted insecondsand arenot moreaccurate than 0.01 second , . +-1- the.. number of < tlie SP DEirser 1 Table Time (seconds) of the syntax Algol compiler using 3- analysis phase of the W varioi ,f. - , ... parsing methods (theret namely, k;nowin, SLR(i) SP SLR(\)PC SLR(i)SC °SLR(\)C SPC of the len.igth of a search throug "rogram 1 0.43 0.41 0.35 0.31 0.28 of the stack. Tw 'rogram2 1.12 1.09 0.94 0.83 0.76 0.7: "rogram 3 1.64 159 139 1.21 1.09 1.11 entries is expen 'rogram 4 2.40 2.32 2.02 1.78 1.61 1.6i Secondly, pressi "rogram 5 4.07 3.92 3-42 3-02 2.76 2.7.; way of searchin of the IBM Syst ing short lists is In order of increasing speed the parsers are : In terms of " 1) SLR{\). An SLR(i) parser with LR(0) reduce states removed but wit (2419 bytes) gi\ no chain elimination attempted. premium (258 b 2) SP. The existing simple precedence parser of the Algol W compiler be SLR(])SC ; with elimination of chain derivations suppressed. 4000 and 5 500 1 3) SLR(i)PC. The parser for which space data was given in Table 1. Chai' are partially eliminated (to the extent that they can be without increasing tl number of states). The parsing 4) SLR(i)SC. The parser mentioned at the end of the previous section Algol which the derivation for an W expression was broken in the middle b ( ' otherwise chains were completelyJ eliminated. , ' r '"any detection" for which 5) SLR(\)C. The parser space data was given in Table 2 and: writer and the a the which chain derivations have been completely eliminated. p| lf. 6) The existing Algol W parser. phot ing the lis mv; tns The most important comparison is I>etweeii the SLR(I)C and the SPC pur* """ The slight difference in performance between the two parsers can be viewed '"'vantages' " ' can Uv resulting from minor optimisations in the SPC parser; similar optimisations! «'*' be X" " available within the SLR framework but more direct methods of improvr searching"' d> performance of SLR parsers are available and look more promising. For examp '" "(Vl' "'S'l Pc after complete elimination of chains, overlaying of identical columns of t slat e tal " ,s"' Af-table (after default entries have? been removed) offers a way of replacing t ,n st r s|, searching by direct table look up without increasing space requirements em ' r ntations""' t ' " mously.i iii ther ronm-irlomp.-Cif

,

,>fflClent

0.2;

" SPC, Efficient Lli(\) Parsers 37

are one third of i(\) p^ffr fares The times in the last two columns of Table 3 approximately of ition is specified. the total compilation time the programs. In conjunction with the first column they imply that elimination of chain derivations improves total compilation time by % about 15 " The fact that performance differences between the SLR(i) and SP parsers nod using several .ire sosmall merits comment ; particularly since Lalonde reported that his LALR{i ) >arser substituted parser (using software list searching) produced parsing times which were about of chain elimina- .jO% of those given by the mixed stategy precedence parser in the XPL compiler. The comparison. This difference is largely accounted for by the different ways in which the simple same whichever precedence and mixed strategy precedence methods determine which production ams are shown in to use when the stack must bereduced. The simple precedencemethod determines in ± 0.01 seconds. the number of symbols which must be removed from the stack—a fact used by the SP parser in setting up a hashing function which is known to be extremely inpiler using various efficient (thereby largely negating the potential advantage of the LR methods, namely, knowing a priori which production toapply). In the absence of knowledge >"../.( i)C SPC n( the length of the production, the mixed strategy precedence method involves a search through all productions with a final symbol matching that at the top 1.28 0.27 of the stack. Twofurther comments may be made. Unpacking two bit precedence 0.7 0.75 to 1.00 1.10 entries is expensive; the SP parser expends 3 800 bytes avoid this penalty. 1.61 1.60 Secondly, pressing the translate and test instruction into service is not an ideal 2.70 2.75 way of searching lists, particularly in view of the abysmal indexing capabilities of the IBM System/360 character handling instructions. The overhead for search- ing short lists is high. In terms of time space trade-offs, Table 3 suggests that the PC variant (2419 bytes) gives a substantial performance improvement for a modest space premium (258 bytes) when compared with the straightforward SLR(i) parser. requirements of approximately .1 \\ compiler but The SLR(\)SC and SLR(i)C variants with space 4000 and 5 500 bytes are less economical. in Table 1 Chains . 7. Summary lout increasing the The parsing methods based on the SLR(\) algorithm accept a wider class of grammarsthan otherformal methods used at present, with consequential weaken- irevious section in ing of constraints in choosing a grammar which conveniently reflects the semantics in the middle but of a language. While this is perhaps the major advantage of these methods, the early detection of syntactic errors has immediate benefit to both the compiler in Table 2 and in writerand thecompiler userand has potentialbenefit in the errorrecovery problem. The implementation described here was deliberately directed towards ex- ploiting the list searching instructions available on computers such as the shown that such an nd the SPC parser. I'nivac 1108 and IBM 3 60; it was in environment the above quite comparable s can be viewed as advantages can be obtained with time and space efficiencies x optimisations are io one of the better methods currently in use. The absence of hardware assisted hods of improving list searching does not necessarily imply untenably large space requirements to using. For example, achieve high performance. Suggestions for compactly encoding the matrix form ■if are :al columns of the the state tables were made in passing; also since typically there many of replacing list in the state table having only one or two entries, mixed matrix—Ust requirements enor- representations are possible. If the early error detection requirement is relaxed, further compaction techniques become available which have been excluded here.

C

■ay '

38 " ' T. Anderson et al. : a pa lCular presentation of the state ro„W v? Tr ? tables, the parsk Korenjak, A.: A thCm iS mdePen*»t of Aether the tables are SLRh 6.3-623 (t 9 6< LALm*n or LR(i).r 2 t?-Likewise the ,alonde WR : table compaction techniques are in the mair ' ,ab,e a,,ho*8h th y ha L fartM « .-JKSM 255sittZ^' °' ' " " McKeeman, W.I. ' Hall 1970. so) toac tnowledge many discussions withE.H. Sa, D: A terthwJtTS ynch ThankS ] wT.^PTrT' are als due to the former a"d (o 462 473 (»97f reading2r?nZdrafts ofJ^hthis J.M. Rushby manuscript and pointing° out deficiencies "^. N : PL 36 (1968). Wirth, N., Hoar. Appendix 413-432(196. SCtS fUSt(X) and f llow{A) imputed Wit below ° *» from the definitions giv« g£f cloi First nonterminal symbols generating the empty string are determined recur sively using I >r. TAT. Anderson, Computing Labo: A=>A iff A-+B B x 2 ...8„, B^A, i^i^n n^o The University 8 upon : For a terminal symbol *, first(x) = {x}. ggj For a nonterminal symbol A Xcfirst (A) iff_Y=_4v A-^B Yy, x ...Bn B^A, i^i^n, n^O, Xefirst(Y) If acl7* then Xc " first («) iff « = Y(3, X€first (V) v(Y±A*Xefirst(fi)) The sets last(x) are needed in the computationof follow(A) and are obtain, analogously tofirst (x), i.e. .as.(*)={*} Xelast(A) iffX=Av

A~*fiYB1 ...Bn, B^A, Igigfl, «^o, Finally Xelast(Y) Xt (V) iff 3A -+pY follow x Bx ... BnXxV , B^A t__i__»,

Yelast(Yx), Xefirst(Xx). ±6 follow(V) iff Felast (S)

References A L ,) ph n univ,,. p"LLai:£.'iy°y "<«■"«<"■ "» <■»»>" M °' ° "" PhDT C«,__SS:5,'„,7 W -»-«■* "»'» » Deßemer F. L. : Simple LR(k) grammars. CACM 14, 453-460 (1971) fc-arey, j.. An efficient context free parsing algorithm. CACM 13 94-102 (1970 an Gr D ! J Translator w"«"8 CACM 11, 77 .968) Knnth n r : n Ill', htion of lanß,,agoS '-U3 u^tSf; from left t0 riKht D £^ " «o?-6S(SS '.'«

" w,rth

w^O,

S .Efficient /. /«'(!) Parsers 39 hies, the parsing Korenjak, A. : A practical method for constructing /./.(/.) processors CACM 12 bios are SLR(\) 61.3 -62 j (1969). Ulondc, W. R. : All efficient "ir.. in th.- main LALR parser generator. Tech. Rep. CSRG-2 (1971), 111 UK main, University of Toronto. V ' described 111 the ( ]S, W P.M., RE.: Syntax directed transduction. JACM 15, 465-488 (1968). McKeeman, W.M., Horning, J. J., Wortman, D. B. : A compiler generator. Prentice Hall 1970. D: A solution to an en ins with F H Saf ''W'' °P problem by Knuth. Information and Control 14 4 [J.M.Rushbyfo. - 73r_! Wirth.f^N.: 9^ a programming language for the 360 computers. JACM 15, 37-74 (1968). Wirth, N., Hoare, CAR.: A contribution to the developmentof Algol CACM 9 413-432(1966). Wirth, N, Weber. H. : Euler: A generalisation Algol, aeiinmonsdefinitions Hvon _ of and its formal definition: given Part T CACM 9 13 23 (J966) etermined recur- f T Anderson alld Dr j Eve Dr. J. J. Horning (omputing Laboratory Department of Computer ;; The University University of Toronto Newcastle upon Tyne NE 1 7RU Toronto 181

1. 94-102 (1970). -113 (1968). Information and " o

.in, Steams,

PL36O,

Science,