ADAPTIVE PROBABILISTIC GENERALIZED LR

+ Jerry Wright* , Ave Wrigley* and Richard Sharman • Centre for Communications Research Queen's- Building , University Walk , Bristol BS8 lTR , U.K. + I.B .M. United- Kingdom Scientific Centre Athelstan House, St Clement Street, Winchester S023 9DR, U.K. ABSTRACT connected speech recognition (Lee , 1989) , together with the possibility Various issues in the of building grammar ...le vel mqdelling implem�ntation of generalized LR into this . framework (Lee and parsing with probability' are Rabiner, 19S9) is further evidence discussed. A method for preve_nting of this trend. Th_e p\lrpose. of this the generation of infinite ·numbers paper is to consider some issues ,, in of states is described and the space the implementation of probabilistic requirements of the parsing tables generalized LR parsing . are · assessed for a substantial natural-language grammar . Because of a high degree of ambiguity in the The obj ectives of our current grammar , there are many multiple work on language modelling and · · entries and the tables are rather parsing for s·peech recognition can large . A new method for grammar be summarised as follows : adaptation is introduced which may (1) real-time parsing without help to reduce this problem . A excessive space requirement,s, probabilistic version of the Tomita ( 2 ) minimum restrictions on parse forest i& also described. the grammar ( ambiguity , null rules , left - recursion all · permitted, no need to use a normal form) , 1. INTRODUCTION ( 3 ) probabilistic predictions to be made available to the patte·rn matcher, with word or phoneme The generalized LR parsing likel ihoods received in return , algorithm of Tomita ( 1986) allows (4) interpretations , ranked by most context- free grammars to be overall probability, to be made parsed with high efficiency . For available to the user, applications in speech recognition (5) adaptation of the language ( and perhaps elsewhere) there is a model and parser , with mihimwn delay need for a systematic treatment of and interaction with the user. uncertainty in language · modelling , pattern recognition and parsing . The choice of parser generator( 3)is. Probabilis tic grammars are relevant to obj ectives (1) and increasing in importance as language All versions satisfy obj ective ( 2 ) models ( Sharman , 1989, Lari and but are initially susceptible td Young , 1990) , pattern recognition is generating an infinite numb er of guided by predictions of forthcoming states for certain probabilistic words ( one application of the recent grammars , and in the case of the algorithm of Jelinek ( 1990)), and canonical parser generator this can the extension of the Tomita happen in two ways . A solution to algorithm to probabilistic grammars this problem is described in section (Wright et al , 1989 , 1990) is one 2. The need for probabilistic approach to the parsing problem . predictions of forthc( oming3)) words or The successful application of the phonemes ( obj ective is best met Viterbi beam-search algorithm to by the canonical parser generator ,

100 versions 2. PARSING TABLE because for all other the prior probability distribution can SPACE REQUIREMENTS only be found by following up all possible reduce actions in a state , 2.1 INFINITE SERIES OF STATES : inet aladvanc, e of the next input (Wright . FROM THE ITEM PROBABILITIES 1989 , 1990 ) . This · consumes both time and space , but the size of When applied fo a probabilistic the parsing tables produced by the grammar , the various versions of LR canonical parser generator generally parser generator first produce a pre.eludes . their use . The , space series of item sets in which a requirements for the various ·· parser· pro:t,ability derive� fr9m the grammar . generato,r s :are assessed in section . is attached to each item , and . then 3 . . , - ,The grammar used· fo.r · this generate . the action and go to ·tables purpose. ,was develop�d by 1.-B.M. fro·m · - in wh;ich each entry again has an an Associated Press corpus o.f.. ·text attach�d probab.ility , representit;1g_ a ( Sharman, _1989) . frequency for . that act_ion coetndi al.ti, onal_ upon the state (Wright Obj ective (4) is met . by the parse l989 , . 199.0) . Sometimes these. forest representation which is a probapiliti..es can cause ;a p�oblem in . pro_bab_�listic version_ of . - that state generation . For example , empioyed · by Tomit·a . .. ( 19.�6)__ , consfder the following probabilistic incorporating sub -node .•�har ing· · �nd grammsar : local · amb iguity packi�g . · This . is A, P1 I B, P2 des�rib�d in- section 4: · A ➔ cA, ql a, q2 e B· C B, b, r2 · The final issue ·(�b,.j c_tive : (5) ➔ and section ty5) is crucial to the r1 ➔ applicab:i.li of the whole .9:ppr�_ac� •, where p1 , p2 and so on (with We regard a grammar as ;a P1 + p2 = 1) represent the prob�bilistic structured_ hier- probabilities of the respective arc.hical_ model of language . a� used� - . rules . After receiving·, the terminal not _a - prescriptive basis - for symbol c, the state with the . correctness .of that use . A (closed) item set shown in Table 1 rela_tiyely compact parsing table is entered, with the first coluinn of. presumes a relatively compact probabilities for each item . After grammar , wh ich is therefore going to receiving the terminal symbol c be inadequate to cope with the range again , a state with the same item of usage to which it is likely to be set is entered but with the second exposed. It is essential that the column of probabilities for each software be made adaptive , and our item, and ' these ·are different from experimental version operates the first unless q1 = r1 . For the through the LR. parser to synthesise probabilistic parser these states new grammar rules , assess their must therefore be distinguished, and plausibility, and make incremental in fact this process continues to changes to the LR. parsing tables in generate an ( in principle) infinite order to add or delete rules in the sequence of states . Although it may grammar . sometimes be sufficient merely to truncate this series at some point , the number of additional states generated when all the "goto" steps have been exhausted can be very large .

101 II- I d B Table 1: Item se1 with probabilities . ➔ --3> , 0_f Item First pro�I ab ility Second probability ·�� ,c,, 2 2 2 A C P1 q1 / (P1 q�+P2 ;--; )] I P1 q12 (p1 q12 +p2 r21 ) B ➔ C • AB P2r21/ (p1q1+P2r1 ) P2r31/ (p1q21+ P2r21 ) ➔ C• A I P1 q1 / (P1 q1 +p2 r1 ) P1 q12 (P1 q1 +p2 2 r1 ) 2 A ➔ • a . P1 q12 q2/ (p1q1+P2 r1 ) P1 q13 <1 2/ (P12 q1 +p2 r1· ·) · AB ➔ • c B I P2r1 / (P1 q1 +p2 r1 ) P2r21 (P1 q1 +p2 2 r1 ) 2 B ➔ • P2 r1 r2/(p1q1+P2r1 ) P2r1 r2/(p1q1+p2r1 ) ➔ • b

Table 2: Separated item sets .

1 C 1 A c•A B • B

A ➔ •cA B ➔ c B A ➔ B ➔ • b a ➔ • ➔ •

goto We can avoid . this problem by multiple entry is required, and · in��oducing a multiple - shift entry this in .turn means that a for the terminal symbol c, in the probabilitgo toy has to be attached to state from wh ich the one just each entry ( this was not discussed is enteractioedn. Multiple required in the original version of entries in the table are the probabilistic LR parser) . normally confined to cases of shift- reduce and reduce- reduce Suppose in general that a grammar · conflicts , . but the purpose here is has non terminal and terminal to force the stack to divide , with a vocabularies and respectively . probability , P1 q1 / (p1q1+P2 r1 ) Conditions foNr the ocT currence of an attached to one branch and infinite series of states can be P2r1/(p1q1+P2r1 ) to the other . summarised as follows : there occurs These then lead to separate states a state in the closure of wh ich with item sets as shown in Table 2. there arise either

The prior probabilities of a, ( a) two distinct self- recursive nonterminal symbols (A, B and c are obtained by comb ining theb + say two branches and take the same ) for wh ich a nonempty string a e (Nu T) exists such that values as before . Further A � a A f3 B � a B -y occurrences of the terminal symbol c and simply cause the same state to be /J,-y • re- entered in each branch , and where e (Nu T) eventually an a or b eliminates one or b two or more mutually ( ) ( ) + branch . If c is replaced by a recurs ive nonterminal symbols for nonterminal symbol C, the same which a nonempty string a e (Nu T) procedure applies except that a exists such that

102 A � a B /3 and B � a A -y 2.2 INFINITE SERIES OF STATES : • • FROM THE LOOKAHEAD DISTRIBUTION where {3,-y e (N u T) • and in addition either For the probabilistic LALR parser generator the lookaheads consist of ( i) � left -most derivation a set of terminal symbols , as in the of one symbol from the other is case of non-probabilistic grammars . possible : 6 (N u T) However, for the canonical parser A � B 6 where e generator there is a full • • probabil_ity distribution of or ( ii) a self-recurs ive lookaheads and this creates a second nonterminal say , which . - may potential source of looping coincide with (C, or also arises behaviour . It is possible for item such that A B) (N u sets · to have the same item. where 6 e T) probabilities but different • lookahead distributions . Suppose for the same a. that a state contains an item with a right- recurs ive nonterminal symbol These conditions ensure that the say) after the dot, with nothing a-successor of this state also fo(Cllo, wing. If the stateC also contains items with A and B after contains another item with after the dot but with probabilities the dot followed by a non-null different from those for the earlier string , thus- . state , and moreover that this C � C C � C a and + a /3 continues to generate an infinite • (N u series (different item probabilities where a,/3e T) , then the new do not always imply this). One way lookahead probaC bility distribution to prevent this series is to computed for will be a mixture of associate with each item in the the old one and a distribution lists (which form the states) an derived from fJ in the second • item. array of states for each nonterminal The state automaton possesses a loop symbol , recording the state( s) in because of the right- recurs ion, and which that· symbol occurred as. the the lookahead distribution is left -hand side of an item from wh ich different each time around so that the current item is descended. again an ( in principle) infinite These arrays can be created during series is generated. The second the course of the LR "closure" itemC can arise within the same state function . Pairs of nonterminals if is also left- recurs ive ( the satisfying the conditions above are simplest example of I a,th is is the then easily detected within the grammar S ➔SS, p1 P2 ), and "goto" function, so that an this problem also arises for an item appropriate multiple shift or goto of the second kind on its own , if fJ entry can be automatically created is nonempty but nullable. for the last symbol of a. Only the items leading to the looping It is possible to break the loop behaviour need to be separated by by introducing multiple shift (or this means , and the number of goto) actions as before , but the additional states generated is procedure is complicated by the small. Cases of three-way ( or presence of null rules and/or left ­ higher) mutual recursion with a recursion. These can allow the common a are very rare . distribution to change even when there is just a single item in the kernel . In the absence of this behaviour the state from which the one just discussed is entered can be

103 treated with a multiple entry in endowed with a flag to indicate order to prevent the lookaheads from whether or not any of Ifthe lookahead mixing , the conditions which give data are important . not, merger rise to this problem being checked can b,e based purelfory on the rule and within the '"goto" function. The dot positions the items , and prior probability c-alculations at their probabilities. The numbers of run-time are :correct. states saved va.ries very much with the gr amm-ar : ·dre above -example This pro,cedure has not ·been fully repres·ents an extreme case, and implemented at the prPsent time , but equally there are grammars f'o·r which it seems that thiscommon kind of looping no saving occurs . behaviour , is more than that discussed in the previ'ous section . The additional states cre�ted by the 3-. COMPARISON ·OF PARSER multiple• shifts exace.rbates the already maj or disadvantage ·of the GENERATORS canonical parser with regard to space re·quirements . To compare the parser ,generators a test grammar developed by LB.M. from ari Associated1989) Pr. ess corpus was 2.3 MERGING OF CANONICAL STATES used ( Sharmanof , This grammar consists 677 rules, ranked with a Consider the following grammar : S A rule- count which was easily converted into a probability. It A➔➔ efSb Alg C was convenient to use _reduced versions of the gr'ammar based on - a ( the rule-probabilities do not rule-count threshold� Simply ma:tter) . - The full canonical parser truncating- the grammar is not generator produces eighteen states , sufficient , however , for . two of which eight are eliminated by reasons . First, the resultfng shift198- red5)uc. e optimisation (Aho et grammar can be disconnected in that· al , Of the remaining states, there exist rules whose left -hand a further eight consist of two sets sides cannot occur in any string of four , the sets · distinguished only derived from S. Second , the gra!filllar by the lookaheads . These states can be incomplete in that propagate the dot through the longer nonterminal symbols can arise within rules, but in fact the lookaheads strings derived from S but for wh ich are not used because in each case there are no corresponding rules the series terminates in a shift­ because all have counts below the reduce entry . When this action threshold. The solution to these occurs the parser moves to a state two problems is basically the same : wherein the possible next symbols recursively to add to the truncated are revealed. These states can grammar a small numb er of additional therefore be merged without rules , with counts below the compromising the predictive threshold , until the resulting advantage of the canonical parser. grammar is connected and complete . This reduces the number ofLALR states to six, the same as for the parser Applying this procedure for generator. various rule-count thresholds All creates a hierarchy of grammars and this applies to the allows the relationship between the probabilistic version where the size of the grammar and the parsing3 lookaheads are propagated as a tables to be explored. Table distribution . A fairly simp le contains a summary of the results . procedure allows each state to be The number of states and total

104 Table 3: Parsing table space requirements .

- . Non-pr-�p IALR . Probabilistic Canonical %>1 LALR%>1•. - - %>1 Rules Size States Entries States Entries States· · Entries . , 15 37 .. 15'. 58 0 15 58 0 17 68 0 27' 63 23 128 2 23 128 2 27 151 0 42 · 104 -· 36 239 2 38 239 2 • 110 · 780 1 77 191 71 845 6. 71 845 6 13 7 , 115 291 120 1931 146 2297 ( lookahead. · '· 194 ·510 214 7522 2_2 359 12051 19 loo. ping: 677 2075 , 1011 :· 126322 46 3600 >250000 behaviour) .. ' '

� .. number of entries . in t9e .pars> in.g size of the grammar but that it does tables are compared for non­ so exponentially .._ ; ·· The space probabilistic and probabilistic LALR requirements of LR parsers for parser . generators-, . the la:tter unamb iguous computer languages · tend in�orporating the multiple- shift . to grow in1974) a li. near way with size procedure discusse'd :fo section 2 wa.1s .' ( Purdom , It is also notable Shift- reduc� optimisation · ( and no coincidence ) that· the applied in all _case$ . ,The "size" of proportion of multiple entries also each grammar is the· total length .. of grows with the size of the grammar . right-hand sides of all rules. plus Although , further stages of the number of nonterminal sym�ols . optimisation may enable space to be·· The number of table entries is the saved, attention must be focussed on total of all non-error action and. the grammar itself. go to entries including _ multiple entries . Al(%>1so ) displayed is the The parser(1977 ) generation algorithm percentage of non-error cells of Pager is similar to the in the tables (action and go to) LALR algorithm except that states which contain multiple entries . are merged only when doing so results in no · additional multiple Only limited results are entries . All such entries · are available for the canonical parser therefore the result of non­ generator because the lookahead loop determinism in the grammar (with the suppression procedure ( section 2.2) exception of loop -breaking multiple has not yet been implemented. shifts as discussed in section 2 ) . Despite the use of the cano2.3) nical This algorithm has been implemented merging procedure ( section the for probabilistic grammars , but - for size of the parsing tables. is the - test series the results are clearly growing rapidly and this identical to those for the LALR version of parser generator is only generator. It follows that the a practical proposition for rather growing proportion of multiple small grammars. entries is the product not of state merger but of rich non-determinism What stands out most from the in the grammar . LALR results is not that the total number of entries grows with the The last . two rows in the table

105 correspond to the addition to .the length of the input string . The grammar of two large groups of impact of this is minimised by sub­ rules, used twice a11d once. node sharing and local- amb iguity respectively in the corpus . These packing . Where two or mote parses infrequent rules appear to introdµce contain parts of their a .high _ degree · �f ambiguity , which . interpretation of a sentence wh ich also shows up during the state­ are identical they can share - the genera�ion procedure . Each ·s tate is relevant nodes . And , two . o.r more first gene�ated as a "kernel:" of parses may differ becaus··e .of items , and the presence of more than amb iguity wh ich is localised: · if one item within a kernel implies part of the . sentence is derivable that. there is a local amb iguity from a nonterminal symbol in more which is being carried forward in than one way then. the relev.ant nodes order that the state automaton is may be packed together.- By thus determ.inistic . For the non­ compacting the .parse foresf . the probabilistic LALR parser generator space requirement becomes O (n ) · f6r with the full grammar of 677 rules , most grammars (Kipps , 1989) . the average kernel contained 6.1 items and the largest contained no Employing this representation for­ fewer than 68! probabilistic grammars requires that a value be attached to. each node According to Gazdar and Pullum wh ich enables the eventual ( 1985) , it .. has never been argued calculation of the parse probability that English is inherently for the whole sentence given the amb iguous , rather that a data . In addition it is necessary descriptively adequate grammar that the m ( say) most probable should be amb iguous in order to interpretations be, obtained without. account for seI.mantB.M.ic intuitions . an exhaustive search of the However , the grammar may compacted parse forest. suffer from excessive amb iguity and I the parser would benefit A value P (�, (D}1 •••J A) is considerab ly if some way could be attached to each node in a parse found to reduce it. tree , where � denotes a particular derivation of the string w1 •••w J Finally , the physical storage from the symbol and (D}1 •••J requirements are easily stated: represents the A, corresponding each table entry requires four acoustical data . This probability bytes , two to specify the action and is the product of the probabilities two for the probability in of all rules used in the paA rticular logarithmic form , converted to a derivation of w1 •••w J from and the short integer. likel ihoods of those words given the data , and is easily found for a 4. PROBABILISTIC PARSE FOREST particular node from the probabilities attached to each subnode and the rule probability when the reduce action occurs . This In keeping with the . first and calculation is not affected by the fourth obj ectives set out in the context of w •••w , arid therefore Introduction a probabilistic version 1 J shared nodes need have only one of the parse forest representation value . For locally amb iguous packed of Tomita (1986 has been developed. ) nodes the probabilities of each In the presence of amb iguity , and alternative are recorded, in order even more so with uncertainty in the that the correct ordering of data , the number of interpretations alternatives can be created at may increase exponentially with the further packed nodes higher in the

106 parse forest. truncated away on grounds of probability, so the algorithm The probability attached to the requires polynomial time and space . S-nqde at the apexI of any Furthermore , in this version the is P (6, {D}1 ...M S) where Mis the probabilities in the parsing tables length of input . If all are used only for prediction . ( to alternatives are retained in the guide the pattern�matcher) arid not pars_e ·forest then the parse for calculating the parse probabilities given all the data can probabilities . These predictions do be obtained by normalisation. If . not have to be very precise so space all that is required is to identify can be saved by storing each ­ the single most probable parse then probability in logarithmic form as a all local amb iguity can be resolved short integer. All this applies to by maximising at packed nodes ( in isolated-word recognition; for the manner of the Viterbi algorithm) connected speech the situation could and retaining only the most . probable be different . derivation , because this ambiguity is invisible to higher- level 5. ADAPTAB ILITY struct�res in the forest.

The general problem -0f The speed and LRef fectiveness of identifying the m most probable the probabilistic parser would be parses is more complex . The current seriously compromised if a large m most probable are stored at each grammar ( of, say, thousands -of rules shared node together with a compact with a high degree of amb iguity) way of indicating wh ich combination were adopted, and yet it would seem of subnodes corresponds to each that if · a grammar-based language · derivation . Upon reduction by a model is to be employed for large ­ rule whose right-hand side is of vocabulary speech recognition th�n length k, the new m most probable k the need for a large grammar will be derivations must be found and sorted unavoidable. The full version of· from the ( in the worst case) m I.B.M. possibilities . If two reductions the grammar referred to in are possible , . to the same section 3 extends to many thousands nonterminal symool and spanning the of rules but the greater part of same data , and if the right-hand these consist of oddball-rules that are used only once or twice in the sides are of length k and k , then k 1 k 2 corpus . Collectively the oddballs the worst-case + numb2 er of possibilities is m 1 m and so are important because- they allow the on. With appropriate book-keeping corpus to be modelled, but there are efficient ways to find and individually · each one is rather insignificant . It may be the case sort the m most probable of these, that generalisations perhaps going and record the subnodes. In ( beyond a context- free grammar would practice this requires an array of ) eliminate a lot of these rules , but size 4m stored for each node in the parse forest. there is also a case for the parser to be made adaptive . This approach has several One approach to adaptation would advantages as comp ared with the be to assume a probabilistic grammar original Bayesian algorithmet alfor, in Chomsky normal form and then use uncertain input data (Wright the inside - outside algorithm (Lari 1989, 1990) . The results are and Young , 1990 . This approach has essentially equivalent , and most of ) a lot to recommend it, but here we the exponentially-growing number of consider an alternative approach possible interpretations are

107 based on a rule- adaptive enhancement potential core rule . With of the LR parser. The principle is sufficient evidence of usage a rule that at any time the parsing tables may be promoted from backup to core are based on a relatively small core grammar , and likewise a rule may be grammar of important rules, but with demoted. an error-recovery procedure and a backup grammar . Error- recovery A version is being developed in allows new rules to be created as which the LR parsing tables are required, and rules can be updated incrementally as rules are transferred between backup and core transferred between backup and core grammars in response to usage . grammars. This should occur on­ line , with the intention that the The probabilistic LR parser has core grammar be kept reasonably been enhanced with such a procedure . compact ( and the parser The m1n1mum adaptation wh ich can correspondingly fast) while adapting allow an ungrammatical sentence to to the user. This is still very far be accepted is a local change to a from a complete solution to the single existing rule : in this sense problem of context- free gramm_ar it is assumed that the sentence is adaptation , but a system operating "close" to the language . The along these lines would satisfy (at conditions for rule- adaptation can least to some degree) all the be summarised as follows : obj ectives as set out in the Introduction. Input string : J ....Wi ....W .... W1 Wn ACKNOWLEDGMENT Existing rule : ➔ a1 fi a2 A Adapted rule : This work was funded by the I.B.M. such that United Kingdom Scientific Centre . S � 'Y1 A "(2 • REFERENCES "f1 a1 � W1 · · · · W1 • J 6 � W1 + 1 ····W • AV(1985) Aho, , R Sethi and J D Ullman • • . a2 'Y2 o�• WoJ + 1 • Wn : Principles , 1 2 Techniques and Tools, Addison- and where , consist of Wesley . with any unused nullable suppressed. Gerald(1985) , Gazdar and Geoffrey K. Pullum "Computationally relevant The adaptation therefore consists properties of natural languages and in the deletion , insertion or their grammars", New Generation replacement of a substring within Computing, 3, 273-306 . the right-hand side of a rule , and the suppression of unused nullables (1990) , Fred Jelinek "Computation of simply ensures that all remaining the probability of initial substring symbols actually contribute to the generation by stochastic context parse of the sentence . This free grammars", I.B.M. Research , procedure usually generates a number Yorktown Heights , New York . of rule -candidates. Assuming that one of these is chosen as correct (1989) , J R Kipps "Analysis of ( although it may not be possible to Tomita's algorithm for general automate this entirely) , it is then context-free parsing" , P oceedings added to the backup grammar as a r

108 of the International Workshop on MeParllsionng Univ Technologiersity, 19es3-, 202 . Carnegie­

K Lari and S J Young (1990) , "The estimation of stochastic context­ free grammars using the inside ­ outside algorithm" , 4, 35-5Com6 . puter Speech and Language , Chin-Hui Lee and Lawrence R. Rab iner (1989) , "A frame - synchronous network search algorithm for connected word recognition" , IEEE Trans . on Acoustics , 3S7p, ee· 164ch 9-1658and. Signal Processing , Kai-Fu Lee (1989) , KluwerAu toma ticAcadem Speechic RecoPublisgniherstion. ,

D Pager (1977) , "A practical general method for constructing parsers", LR(k7 ), 249-268. Acta Informa tica ,

P Purdom (1974) , "The size of LALR(l) parsers", 14, 326- 337. BIT , Richard Sharman (1989) , "Observational evidence for a statistical model of language", I.B.M. United Kingdom Scientific Centre Report 205 .

Masaru Tomita (1986) , EffiKluciewentr AcParademsingic fo Purblis Natuhersral . Language ,

Jerry Wright and Ave Wrigley (1989) , "Prob_ab ilistic LR parsing for speech recognition" , Proceedings of the International WorkshoCarnp oneg ie-MParellosingn UnivTechnoersloitgiy,es 105, -114.

Jerry Wright (1990) , "LRpa rsing of probabilistic gramm.ars with input uncertainty for speech recognition" , 4, Com297-pute323r. Speech and Language ,

109