An ENcient Parser Gener;ttor fl)r Nat;rea,1 Language

Masayuki ISItlI* l(azuhisa OHTA Iliro~dd SAITO Fujitsu Inc. Apple Technology, Inc. Keio University masayuki~.nak.math.keio.ac.j p k-ohta@kol)o.apple.coln hxs~nak.math.keio.ac.jp

Abstract splitting the parse stack fin' each action. The parser merges the divided sta.ck br;tnches, only We. have developed a parser generator for natu- when they have the same top state. This merger ral language i)rocessing. The generator named operation is important for efficiency. As a re- "NLyace" accepts grammar rules written in the suit, the stacl( becomes a. gra.plt instead of a format. NLyacc, unlike Yacc, can handle simph,, linear state sequence. arbitrary context-free grammars using the gen- There is already a generalized LR parser eralized Lll. Mgorithm. The parser pro- for natural language processing developed at duced by NLyacc elliciently parses given sen- Carnegie Mellon University [3]. Nl,yacc diflhrs tences and executes semantic actions. NLyacc, fi'om CMU's system in the following points. which is a free and sharable software, runs on UNIX workstations and personal computers. • NLyacc is written in C, while CMU's in 1 Parser Generator for NLP Lisp.

Yacc[4] was designed for unambiguous progl'anl- ming languages. Thus, Yacc cat) not elegantly • CMU's cannot handh', c rules, while NI,y- handle a script language with a natural lan- ace does. c rules are handful for writing guage flavor, i.e. Yacc forces a grammar writer natural rules. to use tricks for handling ambiguities. To rem- edy this situation we have developed Nl,yacc which can handle arbitrary context-fi'ee gr;tnl- The way to execute semantic actions dif- mars t and allows a grammar writer to write fers. CMU's evaluates an Ll?(]-like se- natural rules and semantic actions. Although mantic action attached to each rule when there are several parsing algorithms for a gen- reduce action is performed on that rule. eral context-fi'ee language, such as ATN, CYI(, NLyacc executes a semantic action in two and garley, "the generalized Eli. parsing algo- levels; one is perfin'med during parsing rithm [2]" would be the best in terms of its for syntactic control and the. other is per- compatibility with Yacc and its efficiency. formed onto each successfifl final p;~rse. We An causes a conflict in will desc.ribe the details of NLyacc's ap- the parsing table, a state which has more than proach in the next section. one action in an entry. The. generalized LR parsing proceeds exactly tit(.' same way as the stm~dard one except when it encounters a con- NLyacc is ,,pper-compatible to Yacc. NLy- flict. The standard deterministic LR parser acc consists of three modules; a reader, a pars- chooses only one action in this situation. The ing table constructor, and a drive routine for generalized I,R parser, on the other hand, per- the gene.ralized LR parsing. The reader accepts forms all the actions in the multiple entry by grammar ruh;s in the Yacc format. The table constructor produces a generalized LR. parsing *This work was done while lshil stayed at l)ept, of Computer Science, Keio University, Japan. t;tble instead of the standard I,R. parsing table. 1To be exact, NLyacc ca,t not handle ;t circular rule We describe the de.tails of the parser in the next like "A --+ A". sectiou.

417 2 Execution of Semantic Ac- above during parsing. After parsing is done, tions the semantic action evMuator performs the task as it traces all the history paths one by one. NLyacc differs from Yacc mainly in the exe- This approach retains parsing efficiency and can cution process of semantic actions attached to avoid the execution of useless semantic actions. each grammar rule. Namely, Yacc evaluates a A drawback of this approach is that semantic semantic action a.q it parses the input. We ex- actions can not control the syntactic parsing, amine if this evaluation mechanism is suitable because actions are not evaluated until tile pars- for the generalized LR. parsing here. If we can ing is clone. To compensate the cons above, we assume that there is only one final parse, the have introduced a new semantic action enclosed parser can ewtluate semantic actions when only with [ ] to enable a user to discard semantically one branch exists on top of the stack. Although incorrect parses in the middle of parsing. having only one final parse is often the cruse in Namely, there are two types of semantic ac- practical applications, the constraint of being tions: unambiguous is too strong in generM. An action enclosed with [ ] is executed during parsing .just as done in Yacc. If 2.1 Handling Side Effects 'return 0;' is execute as the history, where < s > is a list of only one element s. 2.2 Two Kinds of Semantic Actions 2) If the next action is "reduce r : A -+ BIB2 We, therefore, take another approach to han- "".11~", then make append(lh, lI2, ..., IIn, l-r]) dling semantic actions in NLyacc. Namely, the as the history, where Hi is a history of Bi, r parser just keeps a list of actions to be exe- is the rule number of production "A -+ 1~1132 cuted, and performs all the actions after pars- • "1],/', an

418 Now we describe how to execute semantic ac- [3] M. Tomita and J. G. Carbonell. The universal tions using the parse history. First, before start- parser architecture for knowledge-based machine ing to parse, the parser ca.lls "yyinit" function translation. In Proceedings, lOlh hdcvaational to initialize wtriables in the semantic actions. Joint Um~ference on Arlificial IMelligence (IJ- CAI), Milan, A,gust 1987. Our system requires the. user to define "yyinit" to set initial values to the variables. Next, the [J] yacc - yet another compiler-compiler: l)arsing parser starts parsing and l)erforms a shift ac- l)rogram generator, in UNLV manual. tion or a reduce action according to the parse history and evaluates the apl)ropriate semantic Appendix - Sample Runs - actions. A sa,mple grammar helow covers a sm~fll set of 2.4 Efficient Memory Management l'~nglish sentences. The. parser I)ro(h:,ees syntac- tic trees ofagiven sentence. Agreement check We use a list structure to implement the. parse is done by the semantic actions. stack, because the stack becomes a complex grN)h structure as described l)reviously. Be- /* grml~ar.y */ cause the parser discards fa.iled branches of the %{ stack, the system rechfims the memory allo- #include cated for the discarded parses using the "mark #include and sweep garhage collection algorithm [1]" to #include "gran~ar,h" #include "proto.h" use memory efficiently. 'Phis garl)age collection %} is triggered only when the memory is exhausted in our current implementation. %token NOUN VERB DET PREP

%% 3 Distribution SS : S { pr_tree($1); }

Portability S : NP VP [ return checkl($1, $2); ] Currently, NLyacc runs on UNIX worksta.- { $$ = mk_tree2("S", $1, $2); } tions and DOS personal computers. S : S PP { $5 = mk_tree2("S", $1, $2); } Debugging Grammars For grammar debugging, NLyacc provides NP : NOUN [ $$ : $1; ] l)arse trace information such as a history of { $$ : mk treel("NP", $I); } shift/reduce actions, execution information of '[] actions.' NP : DET NOUN [ $$ = $2; return check2($1, $2);] When NLya.cc encounters an error state, { 55 = mk tree2("NP", $I, $2); } "yyerror" function is called just a.s in Yacc. NP : NP PP [ 55 = $1; ] Distribution { $$ = mk_tree2("NP", $1, 52); } NLyacc is distributed through e-mail (ple:tse PP : PREP contact nlyacc~nak.math.keio.ac.jp). I)is- NP { $$ = mk_tree2("PP", $1, 52); } tribution package includes all the source codes, VP : VERB NP [ $5 = $1; ] a manual, and some sample grammars. { $$ = mk_%ree2("VP", $1, $2); } %% FILE* yyin; References extern int yydebug; [1] J. McCarthy. Recursive flmctions of symbolic expressions and their computation by machine, int main(argc, argv) part 1. Communications of the A CM, 3(4), April int argc; 1960. char *argv[]; { [2] M. Tomita. EJficieut Parsing for Nalural Lan- int result; guage. Kluwer Academic P.blishers, l~oston, MA, 1985. yydebug = I;

419 yyin = stdin; } _pair; read_dictionary("dict"); } contents; yyinitialize_heap(); } NODE, *NODEPTR; result = yyparse(); printf("Result = Zd\n", result); #define leaf contents._leaf yyfree_heap(); #define pos contents._pair._pos return O; #define left contents._pair, left #define right contents, pair._right

void yyinit() typedef WORD SEM, *SEMPTR; {} #define YYSTYPE NODEPTR int yyerror(message) char* message; #define YYSEMTYPE SEMPTR { fprintf(stderr, "%s\n", message); /* dict */ exit(l); I:NOUN:OI } You:NOUN:22 you:NOUN:22 int checkl(seml, sem2) He:NOUN:04 SEMPTR seml, sem2; he:NOUN:04 { She:NOUN:04 return (seml->seigen & sem2->seigen); she:NOUN:04 } It:NOUN:04 it:NOUN:04 int check2(seml, sem2) We:NOUN:IO SEMPTR seml, sem2; we:NOUN:IO { They:NOUN:40 return (seml->seigen & sem2->seigen); they:NOUN:40 } see:VERB:73 sees:VERB:04 /* grammar.h */ a:DET:07 #define SPELLING_SIZE 32 the:DET:77 #define HINSBI_SIZE 32 with:PREP:O0 #define BUFFER_SIZE 64 telescope:NOUN:07 man:NOUN:07 typedef struct word { Sample Runs struct word *next; # sentence no.l char *spelling; int hinshi; /* parts of speech */ }[e sees a man with a telescope "D int seigen; /* constraints */ # parse 1 S:(S:(NP:(NOUN:He) } WORD; VP:(VERB:sees NP:(DET:a NOUN:man))) typedef enum tag { PP:(PREP:with NP:(DET:a NOUN:telescope))) TLEAF, TNDDE # parse 2 } TAG; S:(NP:(NOUN:He) VP:(VERB:sees typedef struct node { NP:(NP:(DET:a NOUN:man) PP:(PREP:with TAG tag; NP:(DET:a NOUN:telescope))))) union { WORD* _lea~; # sentence no.2 struct { He see a man "D char *_pos; struct node *_left; # The semantic actions prune syntactically- struct node * right; # sound but semantically-incorrect parses.

420