Seminarslargely Meet
Total Page:16
File Type:pdf, Size:1020Kb
■■■"""" " ■ "■'* " ' ummmmmimmmmmmmmtmamttmmm^^^mtmm.a^-^-~-^-, mmmv Acta Informatica 2, 12—39 (1973) © by Springer-Verlag 1973 to the left canonical deri; formal notions embodie paper, Knuth defined a which are precisely the LR(\) be obtained determinisi Efficient Parsers decision already made) T. Anderson, J. Eve and J. J. Horning looking at most k svml decision. Similarly Lew Received April 7, 1972 the left canonical deri constraints. Summary. Knuth's LR(\) parsing algorithm is sufficiently general to handle the Both classes of lang parsing of most programminglanguages,with the additionalbenefit of earlierdetection 0 c ienefh nf th of syntax errors than in other formal methodsused in compilers. The major obstacle . impeding the use of this algorithm is the large space requirement for parsing tables. rP 'y contained in tr In addition, certain optimisations described here, which increase parsing speed, greater practical inti exacerbate the space problem. at°* any point in a parse 01 Two approachesfor overcoming this problem are explored. need to be matched wit 1. Using parsers related to the LR(\) parser which parse a large subset of the was used in the derivat; LR(\) grammars, but which have much smaller parsing tables. general, to determine wli 2. Devising compact encodings and performing space optimisations on these beyond the last generate tables The LR(k) property Time and space efficiency is evaluated using the Stanford University Algol W virtue of there existing compiler as a base for comparison since the parsing routine in this compileris among has given two procedures the more efficient presently used. a grammar js LR{k] j^ " In the event that the gr; 1. Introduction ning these tables, in ie ,an The widespread use of context free grammars to model the syntax of program- guage. Knuth has ming languages has resulted in the parsing problem for such languages receiving I sentences of the lain a great deal of attention both theoretically and practically. appear elsewhere in .1 s or It is known that, for parsing an arbitrary context free language, an amount . una*ely, exploitation of time proportional to the cube and an amountof space proportional to the square involves generating an /. of of the length of the input string is sufficient (see, for example, Earley, 1970). the orig«nal grammar i For various subclasses of context free grammars these bounds can be considerably *n the case of prograi improved. Programming languages are generally amenable to analysis by parsing loo'< ahead arises prirr methods with time and space requirements which are linearly related to the available on many curre length of the input string. Several such linear parsing algorithms are described Algol 60 is a well-known in the survey article of Feldman and Gries (1968). lexical analysis is separa Historically, these parsers were classified into top downand bottom up categories eal vv'dh look ahead pn s according to the primary strategy adopted. In the former, starting from the other tasks. It is also 1 ni)' principal nonterminal of a grammar, an attempt is made to produce a derivation ng language grammars of the input string. Bottom up methods start with the input string and try to resulting in grammars\vh produce a parse directly. Problems have been treai It was pointed out by Knuth (1965) that the bottom up strategy results in a Seminars largely meet thi parse which has the property, when viewed (in reverse) as a derivation, that the Ranges to a grammar to rightmost nonterminal in each sentential form is replaced in each step. The top ' mending a grammar to . down strategy results in a derivation in which the leftmost nonterminal in each , vocated in this paper i ls a sentential form is replaced at each step (Lewis and Steams, 1968); i.e. the bottom small additional burdi " up strategy leads to a right canonical parse while the top down strategy leads ' eliminating genuine am *■ \ Efficient l.R(i) Parsers 13 case that theseare the to the left canonical derivation. It is probably the appropriate formal notions embodied in the intuitive ideas of bottom up and top down. In his paper, Knuth defined a subclass of thecontext free languages, the __./_(/.) languages, which are precisely the set of languages for which the right canonical parse can he obtained deterministically (i.e. without backing up and changing any parsing decision already made) in a single pass over the input string from left to right, looking at most k symbols ahead on the input string to determine each parsing decision. Similarly Lewis and Steams consider the LL(k) languages, in which the left canonical derivation can be obtained deterministically under the same constraints. in linearly proportional eral to handle the Both classes of language can be parsed time and space of enrher detection (f) the length of the input string, but the fact that the LL(k) languages are the major obstacle „roperly contained in the LR(k) languages implies that the latter are potentially for parsing tables. greater practical interest. (It is true, however, that in the former category at any point in a parse only thefirst k terminal symbols generated by a production need to be matched with the input string to determine whether that production derivation. In the case of LR(k) languages it is not possible, in bs-t of the was used in the general, to determine whether a production has been used until k terminal symbols nidations on these beyond the last generated by that production have been inspected.) The LR{k) property is associated with a grammar; a language is LR(k) by it. Knuth University Algol W virtue of there existing at least one such grammar which generates . compiler is among iias given two procedures which decide, when given aparticular value of k, whether a grammar is LR(k). The second of these can be modified to provide state tables. In the event that the grammaris LR(k), a relatively simple interpreter for scan- ning these tables, in conjunction with a stack, provides a parsing routine for language. Knuth has also shown that any language which is LR(k) is LR(\). s\nta\^Pprogram the sentences of the language terminate with a unique symbol which does not languages receiving elsewhere in a sentence, this result can be sharpened to LR(0).) Un- fortunately, exploitation of this result in compilers founders on the fact that it an amount nguage, generating an LR(i) or LR(0) grammarin which the phrase structure tional to the square involves of the original grammar is corrupted. iple, Farley, 1970). In the case of programming languages, the need for more than one character ran be considerably look primarily at the lexical level due to the paucity of symbols analysis by parsing of ahead arises available on many current input devices. The occurrence of ":" and ": in arlv related to the =" Algol is a well-known example. Typically,1 ypically, forlor efficiencyetticiency ifn forior no ornerotherreason, 1 d cribed 60 is well-known lexical analysis is separated from syntactic analysis, and the lexical scan can to ategories deal with look ahead problems at this level of analysis as a minor addition "_" .r,o its other tasks. It is also the case that construction of other than trivial program- starting from. Z the . min lang" grammars is extremely error prone process almost invariably .ro uce d ivation S 2^ an resulting in grammars which are syntactically ambiguous. Assuming that lexical problems have been treated separately, it appears that programming language _ a grammars largely meet the constraints which allow the use of theLR(\ ) algorithm ; strategy resultsi. inir. a o ■> , _ , . , * . from the constraints. are slight.. that the changes to a grammar to accomodate deviations more algorithm top Amending a grammar to enable the use of the restrictive SLR(\) h The worst . , - j. advocated in this paper is not always necessary (see Section 6.1) and at L„ x„_, is a small additional burden which can be treated in conduction with the problem i.e. the bottom+ 968 ....eliminating . ambiguities... ... down strategy leads "< genume ; " 14 T. Anderson et al. : , too The state tables of the LR{k) algorithm, evenfor k= 1 are large for use if Q — (vs> y p directly in a compiler or compiler writing system; e.g. a program implementing (written a =>/5) if thei theLR(\) algorithm was applied to an Algol 60 grammar, therun was terminated with the insertion of the 10000th state-input entry into the state table, at which X point 1 237 states had been created (see also Korenjak, 1969). The work reported Also a derives B ( here addresses the problems raised by Knuth, of developing algorithms that that accept LR(k) grammars or special classes of them and of mechanically producing efficient parsing programs. If co0, v>x , ..., &.„<_ producti 2. Definitions and Notation The p-th The p-th producti Let 0 denote the empty set and A denote the null string. recursive. If X and V are sets of strings and xy denotes the concatenation of x and y, A nonterminal sy: then XV ={xy\ xdX and y € V] If 4eFv and a 6 V If X° ={A} then sets X{ are defined by first (<x)={; +1 =X{ X for X' t^O follow (A) ={: and X* is defined by X* = x\ ('—vo V In this section algc A context free grammar(with an endmarker) is a quadrupleG = (VN, T, P, S J_ ), where form of state tables. T discussion presumes fa " V is a finite set of nonterminal symbols, N algorithms derived fro VT is a finite set of terminal symbols, In Knuth's algoritl is defined as a set VN nVT = Q and V denotes VNu VT, of o a state corresponds to start or nonterminal of the grammar, Se VN is the symbol principal a point consistent with J_ft is the endmarker, V (a) production nun SJ_ is the start string, sentence scanned thus (b) if P is a finite subset of VN X V* called the set of productions.