\¥c fldy 17{.c, trlct, c.d St,()chasti(: (;;rammars

:Ricks op den Akker ~nd Itugo tcr Doest

University o:[ Tw('ntc, l)epartment of CompuSer Science P.O.Box 217, 7,500 A]i', Ensch(;de, The Netherlands

Keywords: stochastic languages, , as the chance that tile event x, or the event that a inference. language-source produces x, will occur, it will be clear that the sum of 4)(x) where x ranges over all possible sentences is equal to one. Tile stochastic language is Abstract called context free if tile language L is context-free. The usual grammatical model for a stochastic A new type of stochatstic grammars is introduced tor inw~s- context-free language is a context-free gramm~u' I,o- tigation: weakly restricted stochastic grammars. In this gcther with a function f that assigns a p~Lper we will concentrate oil the consistcl,cy prol)lcnt. ~1'o real mnnber in [0, 1] to each of the productions of the find conditions for stoch~stlc gramma.rs to be consistm~t, grammar. The Ineaning of this Nnction is the follow- the theory of multitype Galton-Watson branciting pro- int. A step in a derivation of a sententia] form, in cesses and gelmrating functions is of central ilnporl,~utce. which a nonterminal A is rewritten using production The unrestricted stochastic grammar generates p has chance f(p) to occur, independent of which A the s~tme class of langu~gcs as the wc~kly rcstricl.cd for is rewri{,ten in the sentential form and indepelident malisln, q'}te inside-outside algorithm is adapt.cd for use of the history of the proces l,hat produced the sen- with weakly restricted granlmars. tential form. The probability of a derivation(- tree) is the product of the of bile derivation steps that produces the tree. The probability of a sentence 1 Introduction generated by tin, gramnm.r is the sum of the probabili- ties of all the. trees of' a sentence. So given a stochastic l[' W( COilSJdcr ~ natural lttnguages as a sLr~,lcLtire lno(l- grammar we can compute the probabilities of all its elled by a formal grammar we do not consider il any sentences. The distribution language generated by a more as a language thal. is used. Formal (context- stochastic grammar G, I)L(G), is defiued as the set free) grammars are often advocated as a model lbr o[' all deriw~tion trees with their probabilities. 'Che l,he "linguistic competence" of an ideal }tal;ural l~m- stochast, ic language generated by a stochastic gram- guage user. It is also 1,oticed that this mat}mmatical mar (7, HL(G), is defined as the set of all sentences concept is far froln at su{Iicieut model for describing all generated by the grammar with their probabilities. aspects of the language. What cannot be expressed by A stochastic gr~mmlar G is an adequate model of a this model is tile fact thai. some sentences or phrases language L if on its basis we can correctly compute are more likeley to occur than others. ']'his notion of tile probabilities of the sentences in the hmgnage L. occurrence refers to the use of language and thereR)re Of course this assumes a statistical analysis of a lan- considering this kind of statistical knowledge about guage corpus. A stochastic grammar that generates a language has to do with the of language stochastic language is called consistent. laid down in a corpus of the language. With a partic- ular context of use in mind a syntactically ambiguous Definition 1.1 A stochastic grammar G is called sentence will often have a most, likely meaning and consistent if for the probability measure p reduced b?l hence a most likely mlalysis. Some of the shortcom- (; onto the laT~g~agc generated by its underlying gram- ings of the pure (context-free) grammar model can maybe be solved by stochastic gralmnars, a model that makes it possible to incorporate certain statisti- cal facts about the language nse into a model of the possible structures of sentences as we conceive them Otherwise the grammar is called inconsistent. from a mathematical, formal, point of view. Natural languages are now seen as stochastic; a user of a lan- Not all stochastic grammars generate a stochastic lan- gnage as a stochastic source producing sentences. A guage. Even proper, and reduced grammars 1 do not stochastic language over some alphabet E is simply a 1A granlmed is culled proper if for all nontcrminals A, talc formal language L over i3 together with a probability stlnl of the probabilities assigned to the rules for A is 1. A function 05 assigning to each string a: in the language grammar is c~dled reduced if all nont.erminals are reachable and a real mn-nber 05(a,) in [0, 1]. Since 05(a:) is interpreted can produce a terminal st, ring.

929 necessarily generate a stochastic language. This is of production rules seems unknown in literature, al- illustrated in the following example. " though we found some other fornlalisms that were de- signed to add context-sensitivity to the assignment of Example 1.1 Consider the stochastic grammar G probabilities. For instance, the definition of stoch- with nonterminal set VN = {S}, terminal set V'r = astic grammars by Salomaa in [8] is somewhat differ- {a}. The productions with their probabilities are ent from the definition we gave in our introduction: given by: the probability of a production to be applied is here S ~ SS dependent on the production that was last applied.

S ---' a To escape the bootstrap problem (when a deriva- tions is started, there is no last applied production) Following the technique presented in [2] we find an initial stochastic vector is added to the grammar. that the production generating function is given by Weakly restricted stochastic grammars are introduced gl(St ) = qs~ + 1- q, and that, the first moment matrix in [1]. In the following definition Ca, denotes the set /2 is given by [2q]. We can conclude that the gram- of productions for Ai and l~(Ai) denotes the number mar is consistent if and only if q _< 1/2. For details we of right-hand side occurrences of nonterminal A~. refer to [5]. No~ice thai, all the different trees of string a r~ have Lhe same probability. Hence, they cannot be Definition 2.1 A weakly restricted stochastic gram- distinguished according to their probabilities. [] mat" Gzv is a pair" (Cc, A), where Cc = (VN, ~,},, P, X) is a conlex#free flrammar and A is a set of functions It has been noticed that the usual model of a stoch- astic grammar as presented above, and which we A = {p~lA~ C VN} from now on call the unrestricted stochastic gram.mar model, has some disadvantages for modelling "real" where, if j E 1 ...t~(Ai) and k E 1...[CA,I, pi(j,k) = languages. In this paper we present a more ade- Pij~" G [0, 1] The set of productions P contains ezacily one produclion for start symbol S. quate model, the weakly restricted stochastic gram- mar model. We give necessary and sufficient con- In words, Plj~ stands for the probability that the k-th ditions to test in an efficient way whether such a production with left~hand side Ai is used for rewrit- grammar defines a stochastic language. Moreover, we ing the j-th right-hand side-occurrence of nonterminal will show that these grammars can be transformed Ai. The usefullness of this context-dependency can into an equivalent model of the usual type. The nice be seen immediately fi'om the following unrestricted thing about the new model is that it models "context- stochastic grammar, which is taken (in part) from the dependent" probabilities of production-rules directly example grammar in [3] (p. 29): in terms of the grammar specification of the language and not in terms of some particular implementation S ~ NP VP of the grammar as a parser. The latter is done by VP o~ Vt NP Briscoe and Carroll [3] by assigning probabilities to the transitions of the LR-parser constructed for the grammar. In section 2 weakly restricted grammars N P o~ [5,N P are introduced, in section 3 conditions for their con- NP Q Dd N sistency are investigated; in section 4 it is proven that NP ~ NP PP weakly restricted grammars and unrestricted gram- mars generate the same class of stochastic languages Unrestricted stochastic grammars cannot model con- and section 5 presents the inside-outside algorithm for text dependent use of productions. For example, an weakly restricted grammars. NP is more likely to be expanded as a pronoun in sub- ject position than elsewhere. Exactly this dependence on where a nonterminal was introduced can be mod- 2 Weakly Restricted Stochastic eled by using a weakly restricted stochastic grammar. Grammars Since in a weakly restricted stochastic grammar the probabilities of applying a production are dependent ~Ib add context-sensitivity to the assignment of proba- on the particular occurrence of a nonterminal in the bilities to the application of production rules, we take right-hand side of a production, it is useflfl to require into account (and distinguish) the occurrences of the that there is only one start production. nonterminals. Then, for each nonterminal occurrence The characteristic grammar of a weakly restricted distinct probabilities can be given for the production grammar is the underlying context free grammar. rules that can be used to rewrite the nonterminal. The next step is to compute probabilities for strings This way of assigning probabilities to the application with respect to weakly restricted stochas~aic gram- mars. For this purpose a tree is written in terms 2Although we found in [7] by Jelinek and Lafferty the (false) of its subtrees (trees with a nonterminal as root) as statement that a stochastic grammar is consistent if and only q[tiljl, f.i2j~, • .., ti,~(q)j,(q)], in which q is a production, if it is proper, given that the underlying grammar is reduced. The example gives a clear counter example of their statement. n(q) is the number of nonterminals in the right-hand side of q and tij denotes a (sub)tree with the j-th

930 occurrence of' nonterminal Ai at its root. A tree for then we know thai it corresponds to a prodnetion for which n(q)---0 is written as []. Ai of the fbrm

Definition 2.2 The probabilily of a derivation tree t A i -+ :t:i:Ail Xi2gi ~ . . . zi,~Ai,,:ci,,+~ wi~h respect to o weakly reslriclcd s~ochastie grammar is deflated r'ecurswely a where the .~:ij ~ 1.4[. The production has, if it is used for rewriting oceurrence Aij, probability ct of being applied. In Example 3 it will be illustrated how the terms of t~he genera.tAng timctions correspond to the ,~(q) productions of the grammar.

m .~ ,n m3,,,) Theorem 3.1 Let Aq =~ ct," thus ~he j-th oecurre~.cc rn=l of nonlerminal Ai is re.written using ezactly one pro- dTtclion. 73c probability tha~ (~ contains lhe ,..th oc- where 1 < k,,~ < [CA,,. ]. currence of nontermznal Am is given by The probability of a string is deiined as the sun: of the probabilities of all distinct derivation trees ~hat c~gii(Sl,t ..... st',tt(Ak)) yield this string. l)efinition 2.3 73e probability of a string x Z~l Proof h~ general ~he generating function can be writ- L(G~) is defined as ten as

9ij(SIA,...,Sk,I~(A~)) =: 91j(St,1,...,Sk~/{(A~))-I eli

where s~,Jt(a~)) only contains terms de- The distribution langnage ])L(Ow) and stochastic glj(slj,,.., pendent on slj,... and where is a (;Oll- language ,5'/;((]~) of a wealdy restrieted gramuxar ,sk,la(A~) eij stant~ term. The terms dependent on S:,l, ..., S~.,R(Ak) ((;~,A) are detined ana]oguous to :;l~e distribution come from productions for Ai that contain nontermi.. language and stochastic language of an nnrestricted nals in their right-h~md sides and the constant terms granllilar. fl'om produetAons for Ai that only contain terminals in their right-hand sides. When partial derivatives 3 Consistency are taken from 9ij we can just as well consider .qlj, since the constant term will become zero. Wc know In this section consistency of we~kly rest, rict, ed stoch- thal. the terms in g~j do not contain any powers higher astic grammars will be considered. The theory of than 1. of the variables in it. This leads us to the in- nmltiwpe hranching processes will be used to come sight, that taking the mn-th partial derivative of .qij to a similar theorem as is given in [2] for unrestricted results in at most one term consisting of the form stochastic grammars. po,f(si,: .... , s#,/~(Ak)) where f does not depend on s ..... and Pij, is one of the probabilities resulting from Definition 3.1 l~or the j-th occurrence of ~ontermi- applying Pi to j and some h in 1 ... ]('~fA,]. If we substi-. hal Ai ~ VN the production generating fnnetAon ]or lute i for all remaining variables in the partial deriw~- weakly restricted stochaslic grammars is defined as: live we find as value for eijmn the probability that the j-th occurrence of nor,terminal Ai is rewritten by the 9i~(~1,1, ...,sk,It(A~)) = production that contains in its right-hand side non- ICAil k t~(Am) terminal occurrence A,7 m . D

u=l m~:] n:.:l The first-moment matrix for weakly restricted gram- mars is defined just like the first-moment matrix for where r,m(k) is 1 if nonlerminal-occurrence A .... ap- unrestricted grammars: pears in the righ1-hand side of the k-lh production rule wilh nonIerminal Ai as left-hand side and 0 otherwise. Definition 3.2 The first-moment matrix E associ- ated wzth the weakly restricted grammar G is Note that for each right-hand side nonterminal oc- currence a dummy-variable is introduced: sij corre- /; = [~u,,-d sponds to the j-th occurrence of nonterminal Ai. A special variable is Sl,:: it corresponds to the s~art symbol which is the right-hand side of the start pro- We order the set of eigenvalues of the first-moment duction s ~ P of the form Z ~ S, The genera- matrix from the largest one to the smallest, such that ~ing function for nonterminal occnrrenee A 0 entails P: presenl,s the maximum. for each production for Ai a term. If 91j has a term of the form Theorem 3.2 A proper weakly restric~ed grammar is O~Si: Si 2 . . . Si,, consistenl if pl < 1 a'nd is nol consisZcnl if pj > 1

931 The proof of this theorem is analoguous to the proof 4 Equivalence of the related theorem in [2] and we will not trea.t it here (see [5] for a proof). In this section we will show that a weakly restricted stochastic grammar can be transformed into an equiv- Example 3.1 Consider the weakly restricted stoch- alent unrestricted grammar. We define two grammars astic grammar (G~, A) where G~ = (VN, VT, P, Z) = G and H to be equivalent if DL(G) = DL(II). ((Z, S}, (a}, P, Z) and P ~md A are as follows: The transformation is pertbrmed as follows. With z -~ s (p, 1 - p) each nonterminal occurrence Aij in the right-hand s -~ s s (q, ~ - q)(r, 1 - r) side of a production rule associate a new unique non- terminal Aij; for each new nonterminal Aij copy the set of production rules with nonterminal Ai as left- For a reason at. the of the example to become clear, hand side, replace the left-hand sides with Aij and we assume that p ¢ 0. The production generating replace in the right-hand sides each nonterminal with functions are given by its new (associated) nonterminal; assign probability gll(s11, st2,sla) = ps12s13 + 1 - p Pijk to the k-th production rule with left-hand side g12(S11~ Sl2~ 813) "~-- qs12813 @ 1 - q Aij. We formalized this in the following algorithm. g13(S11, 812,813) ---~ rs12813 + 1 - r Algorithm 4.1 The first-moment matrix E is given t)y Associate with the j-th occurrence of nontermi- nal Ai in the right-hand sides of tile production 0 q q rules a (new) unique nonterminal Aij (clearly 0 7' r j ~ I,...,R(A~)). [['he set ofnonterminals for The characteristic equation is given by ¢(x) = x((x - the rewritten grammar C' is denoted by V/~ and q)(q-- r)-qr) = x2(x- (q+r)) = 0. Thus, the eigen- is the set of associated nonterminals plus the values of the matrix are 0 and x = q + r. According start symbol S from the we~fldy restricted gram- to Theorem 3.2 the grammar is consistent if q + r < 1 mar G. and inconsistent ifq+r > I. Ifq+r = 1 the theo-. 2 This step is given in pseudo-pascah rein does not decide tile consistency of the grammar. From the characteristic equation it follows that the for i:= 1 to IVNI do value of p does not influence the consistency of the for j := ] to t~(Ai) do grannnar. However, looking at the gramnrar we find [" := _P' U CA, (j) that it is consistent if p = 0, regardless of probabil- od ities q and r. Therefore, before Theorern 3.2 can be od used for checking the consistency of tt~e grammar, the grammar must be stripped of productions having for where CA,(j) is the set of productions CA, with each nonterminal occurrence probability zero of being left-hand sides Ai replaced by Aij and the nonter- applied. [] minals in tile right-hand sides of the production Definition 3.3 A final class C ofnonterminal occur- rules replaced by their associated nonterminals. rences is a subset of tile set of all nontcrminal occur- fences having tile property that any occurrence in C The probabilities to be assigned to tim produc- has probability 1 of producing, when rewritten using tion rules in CA,(j) are deduced from the Ply -~ one production rule, exactly one occurrence also in (Pijl,.. ",PijlCAil): the ]c-th production rule in C. CA,(j) is assigned probability pij;:. Theorem 3.3 A weakly restricted s~ochastic gram- Kl mar is consistent if and only if Pl <_ 1 and there are no .final classes. Theorem 4.1 For every weakly restricted stoch- astic grammar there is an unrestricted stochastic For the proof of Theorem 3.3 we refer to [5]. Ap- grammar which, is distributively equivalent. plying this theorem to the example learns us that if q + r = 1, the grammar is consistent if and only if Proof We can prove the theorem by proving that there is no final class of nonterminals. Looking at the the algorithm finds for every weakly restricted gram- grammar we see that there is a final class of occur- mar an unrestricted grammar that is distributively rences ifq = 1 or r = 1 (or both); the final classes then equivalent. From the algorithm it immediately fol- are {S2},{Ss} and {$2, $3}, respectively; if in addi- lows that the languages (without the probabilities) tion p = 1, then the final classes are {S1, S2},{S1, $3} generated by the weakly restricted grammar and the and {$1,$2, $3}, respectively. Hence, the grammar is unrestricted grammar generated by the algorithm are consistent if and only if q + r < 1 A q ¢ 1 A r ¢i 1. equal. The production rules introduced by the al- Notice that if q 7~ r then all trees of a ~ have difi~rent gorithm in the unrestricted grammar cannot gener- probabilities. ate any other strings than the string generated by

932 tile weakly restricted gr~mnnar. Also it. cart be seen in CNI i'. Using this fact we can sirnplify the way non- l[rom tile algorithm that the unrestricted grammar as- terminal occurrences are indexed: A,ffp.,.) (A,.(vq.)) de- sociates the same probabilities with its strings as tt~e notes the occurrence of ./lq (At) ill the production unrestricted grammar. IIence, the theorem holds. [] Ap --* AqA,.; for this production also the notation (pqr) is used and for the production Ap---~ aq (pq). A corollary of this theorem is that %r each weak]y Similarly the probability of occurrence Aq(p.r) to be restrict, ed grammar there exists an unrestricted gram- rewritten using rule (qst) is denotes by Pq(p.r)(q~t). mar that is stochaslically equivalent. For the start production a special provision has to he q'he time-complexity of the algorithm can easily be taken: the norlterminal occurrence in its right-hand found. YWe obserw~ that, if we denote the number of side is denoted by Z~l(0..). A stochastic grammar in nonterminals in the weakly restricted grammar by k, CNF over these sets can then be specified by each step can be done in in O(k) steps. Then the total time complexity is O(k). We, deiine the size of tg(Adllfl a grannriar to be tile product of the number of non- i terminals and the nmnber of productions. The size of probabilities. Since wc require stochastic gr~mmars the newly created grammar c~al be found to he poly- to he proper, we know theft for p, q, r = l,..., ~, nomial in the size of tile weakly restricted gralnm~u'. ~_.p,~(>,.)(q~.t) t- )_~ PqO,.,)(,.) = 1 .s, t s 5 Inference If we want to use the inside-outside algorithm for The inside-outside algorithm is originally a reestilna- grammar inf'erence, then the. grammar prohabilities Lion procedure for the rule probabilities of an un- haw; to meet the above condition in order for tile rees- restricted stochastic grammar in Chomksy Normal timation to make sense. Form (CNF) [4]. It, takes as input an initial m~re- If string w - w]'w2...wl~d, then 1.tvj~ 0 ~ i < stricted stochastic grammar (; in CNF and a sam- j < IWl denotes the substring wi+i...wj. The in- side probahili,y Pq ,(i,j) estimates the likelihood pie set b7 of strings and it itcralJvely reestimates rule P[C -r) . . , probabilities to ma~ximize the probability that the that occurrence Av(,i.,.) derives iwj, wlnle the outside grammar would produce the samt)le set,. probability ,O~l,(q.,,)~i, j) estimates the likelihood of de-- riving [roln the start symbol S. The The basic idea of" the inside-outside algorithm is l,o otl~iAp(q,,)j~t)lw I msideq~robability for st, ring w and nontermin~d occur- tlse the cllrrent rl.tle probabilities to cstirnate from the sample set the expected frequencies of certain deriva- rellce Ap(q.,,) is defined by the recurrent relation tion steps, and them compute new rule probability l~,}q.,.)(i-- t, i) = p,,(q.,.)(p~), where a, = wl estimates as appropriate frequency rates. Therefore, w each iteration of the algorithm starts by c~deulating (,(~,.,.)(~, a:) := the inside and outside probabilities for all strings in ) w • . to the sample set. These probabilities are. ill fact. prob- s,t i

~ w ' w ' Iv~l, ~ -IVrl, and ass.n,e that VN :- {z -- A~,,,S' = E E Oq(s.t) (J' ~')],.(qp.)(3, i)pq(a.t)(qpr) A1,A2,...,A,~} and l/!t, = {a] .... ,ct~}. By definition ~,t j=O it is required t,hat the grammar has one production for start, symbol Z: Z -+ £'. Parallel to the definition of The second equation above is somewhat simpler generating fnnctions for weakly restricted grammars, than the corresponding one For unrestricted stoch- we have to distinguish all nonterminal occurrences in astic grammars, because the occurrence Ap(q.r) for right-hand sides of productions; we remind that the which the outside probability O~(q ,.)(/, k) is com- probahility of each production depends on the par: puted specifics the production use(~}~r creating it and ticular nonterminal occurrence to be rewritten. The consequently the prohability for Ap(v,. ) to generate inside and outside t>rohabilities now have to he spec: cl'wiAp(q.,.)j'w[w I is the sum of lnnch less possibilities. ilied for ea.ch nonterminal occurrence seperately. As Once the inside and outside probabilities are con}- already stated in the introduction, the inside-outside tinted for each string in the sample set E, the reesti- algorithm is designed only for context-free grammars mated probability of binary rules, ~Kf.,.)0,~t) , and tile

933 reestimated probability of unary rules, ~q(p.,.)(q~), are 6 Conclusions computed using the following reestimation formulae: In this paper we have investigated consistency of /Sv(q.,.)0,.,,) = weakly restricted stochastic grammars and presented 1 [ Pv(q'")(w*)I~(q't)(i'J) ] an adapted version of the inside-outside algorithm. Other issues concerning stochastic grammars and es- wee O

wEE Acknowledgement We are grateful to Jorma where P~ is the probability assigned by the current Tarhio, presently at the University of Berkeley, Cali- model to string w fornia, for stimulating discnssions.

P~ = I7<0 )(0, I,ol) References and P~ is the probability assigned by the current model to the set of derivations involving some in- [1] R. op den Akker. Stochastic Gram.mars: fl, eory stance of Ap and applications. University of Twente, Depart- ment of Computer Science, Memoranda Informat- '~ = [~ i ' ~ i ' ica 93-19, 1993. 0_

934