A Structured Language Model

Ciprian Chelba The Johns Hopkins University CLSP, Barton Hall 320 3400 N. Charles Street, Baltimore, MD-21218 chelba@j hu. edu

Abstract

The paper presents a language model that develops syntactic structure and uses it to the dog I heard yesterday barked extract meaningful information from the word history, thus enabling the use of Figure 1: Partial parse long distance dependencies. The model as- signs to every joint sequence '¢"~.( ~ I h_{-=*l ) ~_{-I [ h_O of words-binary-parse-structure with head- word annotation. The model, its proba- bilistic parametrization, and a set of ex- ~ w_l ... w..p ...... w q...w~r w_lr+ll ...w_k w_lk+l} ..... w_n periments meant to evaluate its predictive power are presented. Figure 2: A word-parse k-prefix

1 Introduction past. The correct partial parse of the word his- tory when predicting barked is shown in Figure 1. The main goal of the proposed project is to develop The word dog is called the headword of the con- a language model(LM) that uses syntactic structure. stituent ( the (dog (...) )) and dog is an exposed The principles that guided this propo§al were: headword when predicting barked -- topmost head- • the model will develop syntactic knowledge as a word in the largest constituent that contains it. The built-in feature; it will assign a probability to every syntactic structure in the past filters out irrelevant joint sequence of words-binary-parse-structure; words and points to the important ones, thus en- • the model should operate in a left-to-right man- abling the use of long distance information when ner so that it would be possible to decode word lat- predicting the next word. Our model will assign a tices provided by an automatic speech recognizer. probability P(W, T) to every sentence W with ev- The model consists of two modules: a next word ery possible binary branching parse T and every predictor which makes use of syntactic structure as possible headword annotation for every constituent developed by a parser. The operations of these two of T. Let W be a sentence of length I words to modules are intertwined. which we have prepended and appended so that wo = and wl+l =. Let Wk be the 2 The Basic Idea and Terminology word k-prefix w0... wk of the sentence and WkT~ Consider predicting the word barked in the sen- the word-parse k-prefix. To stress this point, a tence: word-parse k-prefix contains only those binary trees the dog I heard yesterday barked again. whose span is completely included in the word k- A 3-gram approach would predict barked from prefix, excluding wo =. Single words can be re- (heard, yesterday) whereas it is clear that the garded as root-only trees. Figure 2 shows a word- predictor should use the word dog which is out- parse k-prefix; h_0 .. h_{-m} are the exposed head- side the reach of even 4-grams. Our assumption words. A complete parse -- Figure 3 -- is any bi- is that what enables us to make a good predic- nary parse of the wl ... wi sequence with the tion of barked is the syntactic structure in the restriction that is the only allowed headword.

498 ~D h_{-2 } h_{-I } h_O

w_l ...... w_l

Figure 3: Complete parse Figure 4: Before an adjoin operation

h.~(-z ) -- h_(-2) h._o. h._(- x )

Note that (wl...wi) needn't be a constituent, but for the parses where it is, there is no restriction on which of its words is the headword. The model will operate by means of two modules: Figure 5: Result of adjoin-left

• PREDICTOR predicts the next word wk+l given h'_{*t ).h_(o2) h*_O -- n_O the word-parse k-prefix and then passes control to h_ . the PARSER; • PARSER grows the already existing binary ...... branching structure by repeatedly generating the transitions adjoin-left or adjoin-right until it Figure 6: Result of adjoin-right passes control to the PREDICTOR by taking a null transition. The operations performed by the PARSER en- • t~ denotes the i-th PARSER operation carried sure that all possible binary branching parses with out at position k in the word string; all possible headword assignments for the w~... wk tk E {adjoin-left,adjoin-right},i < Nk , word sequence can be generated. They are illus- =null, i = Nk trated by Figures 4-6. The following algorithm de- Our model is based on two : scribes how the model generates a word sequence P(wk/Wk-lTk-1) (1) with a complete parse (see Figures 3-6 for notation): P(t~/Wk, Wk-lTk-1, t~... t~_l) (2) Transition t; // a PARSER transition generate ; As can be seen (wk, Wk-lTk-1, t k ...ti_l)k is one do{ of the Nk word-parse k-prefixes of WkTk, i = 1, Nk predict next_word; //PREDICTOR at position k in the sentence. do{ //PARSER To ensure a proper probabilistic model we have if(T_{-l} != ) to make sure that (1) and (2) are well defined con- if(h_0 == ) t = adjoin-right; ditional probabilities and that the model halts with else t = {adjoin-{left,right}, null}; probability one. A few provisions need to be taken: else I; = null; • P(null/WkTk) = 1, if T_{-1} == ensures }while(t != null) that is adjoined in the last step of the }while(!(h_0 == &E T_{-1} == )) process; t = adjoin-right; // adjoin ; DONE • P(adjoin-right/WkTk) = 1, if h_0 == It is easy to see that any given word sequence with a ensures that the headword of a complete parse is possible parse and headword annotation is generated ; by a unique sequence of model actions. • 3~ > Os.t. P(wk=/Wk-lT~-l) >_ e, VWk-lTk-1 ensures that the model halts with probability one. 3 Probabilistic Model The probability P(W, T) can be broken into: 3.1 The first model 1+1 p P(W,T) = l-L=1[ (wk/Wk-lTk-1)" The first term (1) can be reduced to an n-gram LM, ~]~21 P ( tk l wk, Wk- , Tk-1, t~ . . . t~_l) ] where: P(w~/W~-lTk-1) = P(wk/W~-l... Wk-n+l). • Wk-lTk-1 is the word-parse (k - 1)-prefix A simple alternative to this degenerate approach • wk is the word predicted by PP~EDICTOR would be to build a model which predicts the next • Nk - 1 is the number of adjoin operations the word based on the preceding p-1 exposed headwords PARSER executes before passing control to the and n-1 words in the history, thus making the fol- PREDICTOR (the N~-th operation at position k is lowing equivalence classification: the null transition); N~ is a function of T [WkTk] = {h_O .. h_{-p+2},iUk-l..Wk-n+ 1 }.

499 The approach is similar to the trigger LM(Lau93), the number of parameters for each of the 4 models the difference being that in the present work triggers described. are identified using the syntactic structure. H LM PP [ parara H LM PP param II 3.2 The second model Model (2) assigns probability to different binary H 312 206540 h 410 102437 parses of the word k-prefix by chaining the ele- mentary operations described above. The workings Table 1: Perplexity results of the PARSER are very similar to those of Spat- ter (Jelinek94). It can be brought to the full power of Spatter by changing the action of the adjoin 5 Acknowledgements operation so that it takes into account the termi- The author thanks to Frederick Jelinek, Sanjeev nal/nonterminal labels of the constituent proposed Khudanpur, Eric Ristad and all the other members by adjoin and it also predicts the nonterminal la- of the Dependency Modeling Group (Stolcke97), bel of the newly created constituent; PREDICTOR WS96 DoD Workshop at the Johns Hopkins Uni- will now predict the next word along with its POS versity. tag. The best equivalence classification of the WkTk word-parse k-prefix is yet to be determined. The Collins parser (Collins96) shows that dependency- References grammar-like constraints may be the most Michael John Collins. 1996. A new statistical parser adequate, so the equivalence classification [WkTk] based on bigram lexical dependencies. In Pro- should contain at least (h_0, h_{-1}}. ceedings of the 3~th Annual Meeting of the As- sociation for Computational Linguistics, 184-191, 4 Preliminary Experiments Santa Cruz, CA. Assuming that the correct partial parse is a func- Frederick Jelinek. 1997. Information extraction from tion of the word prefix, it makes sense to compare speech and text -- course notes. The Johns Hop- the word level perplexity(PP) of a standard n-gram kins University, Baltimore, MD. LM with that of the P(wk/Wk-ITk-1) model. We Frederick Jelinek, John Lafferty, David M. Mager- developed and evaluated four LMs: man, Robert Mercer, Adwait Ratnaparkhi, Salim • 2 bigram LMs P(wk/Wk-lTk-1) = P(Wk/Wk-1) Roukos. 1994. Decision Tree Parsing using a Hid- referred to as W and w, respectively; wk-1 is the pre- den Derivational Model. In Proceedings of the vious (word, POStag) pair; Human Language Technology Workshop, 272-277. • 2 P(wk/Wk-ITk--1) = P(wjho) models, re- ARPA. ferred to as H and h, respectively; h0 is the previous Raymond Lau, Ronald Rosenfeld, and Salim exposed (headword, POS/non-term tag) pair; the Roukos. 1993. Trigger-based language models: a parses used in this model were those assigned man- maximum entropy approach. In Proceedings of the ually in the Penn (Marcus95) after under- IEEE Conference on Acoustics, Speech, and Sig- going headword percolation and binarization. nal Processing, volume 2, 45-48, Minneapolis. All four LMs predict a word wk and they were Mitchell P. Marcus, Beatrice Santorini, Mary Ann implemented using the Maximum Entropy Model- Marcinkiewicz. 1995. Building a large annotated ing Toolkit 1 (Ristad97). The constraint templates corpus of English: the Penn Treebank. Computa- in the {W,H} models were: tional Linguistics, 19(2):313-330. 4 <= <*>_<*> <7>; P- <= <7>_<*> <7>; Eric Sven Ristad. 1997. Maximum entropy model- 2 <= _<7> ; 8 <= <*>_ <7>; ing toolkit. Technical report, Department of Com- and in the {w,h} models they were: puter Science, Princeton University, Princeton, 4 <= <*>_<*> <7>; 2 <= <7>_<*> <7>; N J, January 1997, v. 1.4 Beta. <.> denotes a don't care position, <7>_<7> a (word, Andreas Stolcke, Ciprian Chelba, David Engle, tag) pair; for example, 4 <= <7>_<*> <7> will trig- Frederick Jelinek, Victor Jimenez, Sanjeev Khu- ger on all ((word, any tag), predicted-word) pairs danpur, Lidia Mangu, Harry Printz, Eric Sven that occur more than 3 times in the training data. Ristad, Roni Rosenfeld, Dekai Wu. 1997. Struc- The sentence boundary is not included in the PP cal- ture and Performance of a Dependency Language culation. Table 1 shows the PP results along with Model. In Proceedings of Eurospeech'97, PJaodes, Greece. To appear. I ftp://ftp.cs.princeton.edu/pub/packages/memt

500