An Efficient Parser Generator for Natural Language

An ENcient Parser Gener;ttor fl)r Nat;rea,1 Language Masayuki ISItlI* l(azuhisa OHTA Iliro~dd SAITO Fujitsu Inc. Apple Technology, Inc. Keio University masayuki~.nak.math.keio.ac.j p k-ohta@kol)o.apple.coln hxs~nak.math.keio.ac.jp Abstract splitting the parse stack fin' each action. The parser merges the divided sta.ck br;tnches, only We. have developed a parser generator for natu- when they have the same top state. This merger ral language i)rocessing. The generator named operation is important for efficiency. As a re- "NLyace" accepts grammar rules written in the suit, the stacl( becomes a. gra.plt instead of a Yacc format. NLyacc, unlike Yacc, can handle simph,, linear state sequence. arbitrary context-free grammars using the gen- There is already a generalized LR parser eralized Lll. parsing Mgorithm. The parser pro- for natural language processing developed at duced by NLyacc elliciently parses given sen- Carnegie Mellon University [3]. Nl,yacc diflhrs tences and executes semantic actions. NLyacc, fi'om CMU's system in the following points. which is a free and sharable software, runs on UNIX workstations and personal computers. • NLyacc is written in C, while CMU's in 1 Parser Generator for NLP Lisp. Yacc[4] was designed for unambiguous progl'anl- ming languages. Thus, Yacc cat) not elegantly • CMU's cannot handh', c rules, while NI,y- handle a script language with a natural lan- ace does. c rules are handful for writing guage flavor, i.e. Yacc forces a grammar writer natural rules. to use tricks for handling ambiguities. To rem- edy this situation we have developed Nl,yacc which can handle arbitrary context-fi'ee gr;tnl- The way to execute semantic actions dif- mars t and allows a grammar writer to write fers. CMU's evaluates an Ll?(]-like se- natural rules and semantic actions. Although mantic action attached to each rule when there are several parsing algorithms for a gen- reduce action is performed on that rule. eral context-fi'ee language, such as ATN, CYI(, NLyacc executes a semantic action in two and garley, "the generalized Eli. parsing algo- levels; one is perfin'med during parsing rithm [2]" would be the best in terms of its for syntactic control and the. other is per- compatibility with Yacc and its efficiency. formed onto each successfifl final p;~rse. We An ambiguous grammar causes a conflict in will desc.ribe the details of NLyacc's ap- the parsing table, a state which has more than proach in the next section. one action in an entry. The. generalized LR parsing proceeds exactly tit(.' same way as the stm~dard one except when it encounters a con- NLyacc is ,,pper-compatible to Yacc. NLy- flict. The standard deterministic LR parser acc consists of three modules; a reader, a pars- chooses only one action in this situation. The ing table constructor, and a drive routine for generalized I,R parser, on the other hand, per- the gene.ralized LR parsing. The reader accepts forms all the actions in the multiple entry by grammar ruh;s in the Yacc format. The table constructor produces a generalized LR. parsing *This work was done while lshil stayed at l)ept, of Computer Science, Keio University, Japan. t;tble instead of the standard I,R. parsing table. 1To be exact, NLyacc ca,t not handle ;t circular rule We describe the de.tails of the parser in the next like "A --+ A". sectiou. 417 2 Execution of Semantic Ac- above during parsing. After parsing is done, tions the semantic action evMuator performs the task as it traces all the history paths one by one. NLyacc differs from Yacc mainly in the exe- This approach retains parsing efficiency and can cution process of semantic actions attached to avoid the execution of useless semantic actions. each grammar rule. Namely, Yacc evaluates a A drawback of this approach is that semantic semantic action a.q it parses the input. We ex- actions can not control the syntactic parsing, amine if this evaluation mechanism is suitable because actions are not evaluated until tile pars- for the generalized LR. parsing here. If we can ing is clone. To compensate the cons above, we assume that there is only one final parse, the have introduced a new semantic action enclosed parser can ewtluate semantic actions when only with [ ] to enable a user to discard semantically one branch exists on top of the stack. Although incorrect parses in the middle of parsing. having only one final parse is often the cruse in Namely, there are two types of semantic ac- practical applications, the constraint of being tions: unambiguous is too strong in generM. An action enclosed with [ ] is executed during parsing .just as done in Yacc. If 2.1 Handling Side Effects 'return 0;' is execute<t in the action, the Next, we examine what would happen if seman- partial parse having invoked this action tic actions are executed during parsing. When fails and is disca.rded. a reduce action is performed, the parser evaluates the action attached to the current rule. * An action enclosed with { ) is executed al- As described in the previous section, the parse ter the syntactic parsing. stack grows in a graph form. Thus, when the action contains side effects like an assignment In the example below, the bracketed action operation to a variable shared by different ac- checks if the subtraction result is negative, and, tions, that side effect must not propagate to tile if true, discar<ts its partial parse. other paths in the graph. If an environment, which is a set of v,zdue of number : number '-' number variables, is prepared to each path of the parse [ $$ = $1-33; if(35 < 0) return 0; ] branches, such side effect can be encapsulated. { $$ = 31-33; print( ..... , 31, $3, $$); } When a stack splits, a copy of the environment should be created for each branch. When the 2.3 Keeping Parse History parse branches are merged, however, each environment can not be merged. Instead, the Our generalized Lll. parsing algorithm is differ- merged state must have all the environments. ent from tile original one [2] in that ore' algo- Thus, the number of environments grows expo- rithm keeps a history of parse actions to exe- nentially as parsing proceeds. Therefore this cute semantic actions after the syntactle pars- approach decreases the parsing e[Iiciency dras- ing. The original algorithm uses a packe<l forest tically. Also this high cost operation would be representation for the stack, whereas our algo- vain when the parse fails in the middle. To rithm uses a list representation. sum it up, although this approach retains com- The algorithm of keeping the parse history is patibility with Yacc, it sacrifices efficiency too shown as follows. much. 1) If the next action is "shift s", then make < s > as the history, where < s > is a list of only one element s. 2.2 Two Kinds of Semantic Actions 2) If the next action is "reduce r : A -+ BIB2 We, therefore, take another approach to han- "".11~", then make append(lh, lI2, ..., IIn, l-r]) dling semantic actions in NLyacc. Namely, the as the history, where Hi is a history of Bi, r parser just keeps a list of actions to be exe- is the rule number of production "A -+ 1~1132 cuted, and performs all the actions after pars- • "1],/', an<l the function 'append' concatenates ing is done. This method can avoid the problem multiple lists and returns the result. 418 Now we describe how to execute semantic ac- [3] M. Tomita and J. G. Carbonell. The universal tions using the parse history. First, before start- parser architecture for knowledge-based machine ing to parse, the parser ca.lls "yyinit" function translation. In Proceedings, lOlh hdcvaational to initialize wtriables in the semantic actions. Joint Um~ference on Arlificial IMelligence (IJ- CAI), Milan, A,gust 1987. Our system requires the. user to define "yyinit" to set initial values to the variables. Next, the [J] yacc - yet another compiler-compiler: l)arsing parser starts parsing and l)erforms a shift ac- l)rogram generator, in UNLV manual. tion or a reduce action according to the parse history and evaluates the apl)ropriate semantic Appendix - Sample Runs - actions. A sa,mple grammar helow covers a sm~fll set of 2.4 Efficient Memory Management l'~nglish sentences. The. parser I)ro(h:,ees syntactic trees ofagiven sentence. Agreement check We use a list structure to implement the. parse is done by the semantic actions. stack, because the stack becomes a complex grN)h structure as described l)reviously. Be- /* grml~ar.y */ cause the parser discards fa.iled branches of the %{ stack, the system rechfims the memory allo- #include <stdio.h> cated for the discarded parses using the "mark #include <stdlib.h> and sweep garhage collection algorithm [1]" to #include "gran~ar,h" #include "proto.h" use memory efficiently. 'Phis garl)age collection %} is triggered only when the memory is exhausted in our current implementation. %token NOUN VERB DET PREP %% 3 Distribution SS : S { pr_tree($1); } Portability S : NP VP [ return checkl($1, $2); ] Currently, NLyacc runs on UNIX worksta.- { $$ = mk_tree2("S", $1, $2); } tions and DOS personal computers. S : S PP { $5 = mk_tree2("S", $1, $2); } Debugging Grammars For grammar debugging, NLyacc provides NP : NOUN [ $$ : $1; ] l)arse trace information such as a history of { $$ : mk treel("NP", $I); } shift/reduce actions, execution information of '[] actions.' NP : DET NOUN [ $$ = $2; return check2($1, $2);] When NLya.cc encounters an error state, { 55 = mk tree2("NP", $I, $2); } "yyerror" function is called just a.s in Yacc.

Load more