Non-Deterministic Recursive Ascent Parsing
Total Page:16
File Type:pdf, Size:1020Kb
Non-deterministic Recursive Ascent Parsing Ren~ Leermakers Philips Research Laboratories, P.O. Box 80.000, 5600 JA Eindhoven, The Netherlands E-mail:[email protected] ABSTRACT this reason. Our presentation, however, contains a sur- prisingly simple correctness proof. In fact, this proof is A purely functional implementation of LR-parsers is this paper's major contribution to parsing theory. One given, together with a simple correctness proof. It is of its lessons is that the CF grammar class is often the presented as a generalization of the recursive descent natural one to proof parsers for, even if these parsers are parser. For non-LR grammars the time-complexity of devoted to some special class of grammars. If the gram- our parser is cubic if the functions that constitute the marlis restricted in some way, a parser for general CF parser are implemented as memo-functions, i.e. func- grammars may have properties that enable smart imple- tions that memorize the results of previous invocations. mentation tricks to enhance efficiency. As we show be- Memo-functions also facilitate a simple way to construct low, the relation between LR-parsers and LR-grammars a very compact representation of the parse forest. For is of this kind. LR(0) grammars, our algorithm is closely related to the Especially in natural language processing, standard recursive ascent parsers recently discovered by Kruse- CF grammars are often too limited in their strong gen- man Aretz [1] and Roberts [2]. Extended CF grammars erative power. The extended CF grammar formalism, (grammars with regular expressions at the right hand allowing rules to have regular expressions at the right side) can be parsed with a simple modification of the hand side, is a useful extension, for that reason. It is not LR-parser for normal CF grammars. difficult to generalize our parser to cope with extended grammars, although the application of LR-parsing to extended CF grammars is well-known to be problematic 1 Introduction [5]. We first present the recursive descent recognizer in In this paper we give a purely functional implementa- a way that allows the desired generalization. Then we tion of LR-parsers, applicable to general CF grammars. obtain the recursive ascent recognizer and its proof. If It will be obtained as a generalization of the well-known the grammar is LR(0) a few implementation tricks lead recursive descent parsing technique. For LR(0) gram- to the recursive ascent recognizer of ref. [1]. Subse- mars, our result implies a deterministic parser that is quently, the time and space complexities of the recog- closely related to the recursive ascent parsers discovered nizer are analysed, and the algorithm for constructing by Kruseman Aretz [1] and Roberts [2]. In the gen- a cubic representation for parse forests is given. The eral non-deterministic case, the parser has cubic time paper ends with a discussion of extended CF grammars. complexity if the parse functions are implemented as memo-functions [3], which are functions that memorize and re-use the results of previous invocations. Memo- 2 Recursive descent functions are easily implemented in most programming languages. The notion of memo-functions is also used Consider CF grammar G, with terminals VT and non- to define an algorithm that constructs a cubic represen- terminals V/v. Let V = VN U VT. A well-known top- tation for the parse forest, i.e. the collection of parse down parsing technique is the recursive descent parser. trees. Recursive descent parsers consist of a number of pro- It has been claimed by Tomita that non-deterministic cedures, usually one for each non-terminal. Here we LR-parsers are useful for natural language processing. present a variant that consists of functions, one for each In [4] he presented a discussion about how to do non- item (dotted rule). We use the unorthodox embracing deterministic LR-parsing, with a device called a graph- operator [.] to map each item to its function: (we use structured stack. With our parser we show that no ex- greek letters for arbitrary elements of V*) plicit stack manipulations are needed; they can be ex- pressed implicitly with the use of appropriate program- [A -~ a.~] : N --. 2 N ming language concepts. where N is the set of integers, or a subset (0...nm~x), Most textbooks on parsing do not include proper with nma= the maximum seutence length. The functions correctness proofs for LR-parsers, mainly because such are to meet the following specification: proofs tend to be rather involved. The theory of LR- parsing should still be considered underdeveloped, for [A --, a. l(0 = {Jl -*" - 63 - with x~...xn the sentence to be parsed. A recursive im- with B E V, and 1/31 the number of symbols in 3 (with plementation for these functions is given by (b • VT, B • H = 0). A recursive ascent recognizer may be obtained v,,) by relating to each state q not only the above [q], but [A --* a.](i) = {i} also a function__ that we take to be the result of applying operator [.] to the state: [a --* a.b-r](i) = {jib = zi+, ^ j E [A ~ ab.-r](i + 1)} [q] : V x N --* 2 I° xN [A ---, a.B-r](i) = {Jl~ • [B ~ .~](i)^j • [A -; ~a.-r](~)} It has the specification We keep to the custom of omitting existential quantifi- [q](B,i) = {(A --* a.3,j)lA --* a.3 e qA cation (here for k,/f) in definitions of this kind. 3 =~* B-rAT ---,* xi+a...xj} The proof is elementary and ba#ed on For i >.nn (n is the sentence length) it follows that [q](i) = [q](B,i) = $, whereas for i _< n the functions 3~(3 = x~+a-r A -r ~* zi+~...~:s)V are recursively implemented by 3B-~$k(3 = B-r ^ B --~ 8 A 8 --;* zi+a...x~^ [q](i) = [q](x,+l, i + 1)u -r --~* 2;k+l...2~j) {(1, j)IB --* .e e ini(q)A (l, j) E-~(B, i)}U If we add a grammar rule S' --* S to G, with S' ([ V {(l,i)lI • q ^ final(l)} then S --** x~...xn is equivalent to n • [S' --* .S](0). [q](B, i) = { (pop(l), J)l The recursive descent recognizer works for any CF (1,j) • [ooto(q, B)](i)^ pop(l) • q}U grammar except for grammars for which ~A~(A ---* {(I,4)1(J, k) • [goto(q, B)I~^ aAcr --** A3). For such left-recursive grammars the rec- pop(J) • ini(q) ^ (1, j) • [q](lhs(S), k)} ognizer does not terminate, as execution of [A --* .a](i) Proof: will lead to a call of itself. The recognition is not a linear First we notice that process in general: the function calls [A --- a.B3"](i) lead to calls [B --* ./i](i) for all values of ~ such that B ---, /8 "** xi+l..-xj is a grammar rule. 3~(3 ~* zi+l't ^ 7 ~" z,+2...zj)v 3B~(3 ~" B-r ^ B ~ c ^ -y --." z,+~...zj)v (~=~^i=j) 3 The ascent recognizer Hence One way to make the recognizer more deterministic is by [q](i) = combining functions corresponding to a number of com- {(A --* a.3,J)l(A --* a.3, j) • r~(z,+~, i+ 1)}u peting items into one function. Let the set of all items {(A --, ,~.3, J)l of G be given by In. Subsets of I6; are called states, and B -.-. eA(A --, a.3,j) • [q](B,i)}u we use q to be an arbitrary state, lWe associate to each {(A --~ a.,i)la --* a. • q} state q a function, re-using the above operator [.], This is equivalent to the earlier version because we may [q] : N ~ 2 I°×N replace the clause B ~ e by B ---, .e • ini(q). Indeed, if state q has item A --* a.fl and if there is a left-most- that meets the specification symbol derivation/3 =~* B-r then all items B --* .A are [q](i) ---- {(A -- a.3,j)l A --. a.3 • q ^ 3 --*" zi+~...xi} included in ini(q). As above, the function reports which parts of the sen- For establishing the correctness of [q--] notice that tence can be derived. But as the function is associated 3 ~* B3" either contains zero steps, in which case to a set q of items, it has to do so for each item in 3 = B'r, or it contains at least one step: q. If we define the initial state q0 = {S' --* .S}, now 3.y(3 =~* B3" A 3' --*" xi+a ...zs) = S --," xl...xn is equivalent to (S' ---* .S,n) • [q0](0). 3~(3 = B-r ^ -r --" xi+l...zj)V Before proceeding, we need a couple of definitions. 3ce.~k(~8 :=~* C-rAG --* B~S A~5 -.*" xi+ l...x~,A -r -'** Let ini(q) be the set of initial items for state q, that xk+l ...x j) are derived from q by the closure operation: Hence [q](B, i) may be written as the union of two sets, ini(q) = { B --* .AIB -. A ^ A --* a.3 • q A 3 =¢ B-r}. [q](B, i) = So USa: The double arrow =¢, denotes a left-most-symbol rewrit- So = {(A --~ a.B3",j)] ing Ba =e~ Cfla, using a non-e rule B ---, Cfl. The A ---. ct.B3" • qA-r ---** xs+l...xj} transition function goto is defined by (B • V) S~ = {(a --. a.3,j)lA --* a.3 • q ^ 3 =~" C-r^ goto(q, B) = {A -* aB.3]A --* a.B3 • (q U ini(q))} C ---* B~ ^ $ --** zi+l...xk ^ 3' --*" zk+l...zi}.