View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by St Andrews Research Repository

Computation of Infix Probabilities for Probabilistic Context-Free Grammars

Mark-Jan Nederhof Giorgio Satta School of Computer Science Dept. of Information Engineering University of St Andrews University of Padua United Kingdom Italy [email protected] [email protected]

Abstract or part-of-speech, when a prefix of the input has al- ready been processed, as discussed by Jelinek and The notion of infix probability has been intro- Lafferty (1991). Such distributions are useful for duced in the literature as a generalization of speech recognition, where the result of the acous- the notion of prefix (or initial substring) prob- tic processor is represented as a lattice, and local ability, motivated by applications in speech recognition and word error correction. For the choices must be made for a next transition. In ad- case where a probabilistic context-free gram- dition, distributions for the next word are also useful mar is used as language model, methods for for applications of word error correction, when one the computation of infix probabilities have is processing ‘noisy’ text and the parser recognizes been presented in the literature, based on vari- an error that must be recovered by operations of in- ous simplifying assumptions. Here we present sertion, replacement or deletion. a solution that applies to the problem in its full generality. Motivated by the above applications, the problem of the computation of infix probabilities for PCFGs has been introduced in the literature as a generaliza- 1 Introduction tion of the prefix probability problem. We are now given a PCFG and a string w, and we are asked Probabilistic context-free grammars (PCFGs for G short) are a statistical model widely used in natural to compute the probability that a sentence generated by has w as an infix. This probability is defined language processing. Several computational prob- G lems related to PCFGs have been investigated in as the possibly infinite sum of the probabilities of the literature, motivated by applications in model- all strings of the form xwy, for any pair of strings x and y over the alphabet of . Besides applications ing of natural language syntax. One such problem is G the computation of prefix probabilities for PCFGs, in computation of the probability distribution for the where we are given as input a PCFG and a string next word token and in word error correction, in- G w, and we are asked to compute the probability that fix probabilities can also be exploited in speech un- a sentence generated by starts with w, that is, has derstanding systems to score partial hypotheses in G w as a prefix. This quantity is defined as the possi- algorithms based on beam search, as discussed by bly infinite sum of the probabilities of all strings of Corazza et al. (1991). the form wx, for any string x over the alphabet of . Corazza et al. (1991) have pointed out that the G The problem of computation of prefix probabili- computation of infix probabilities is more difficult ties for PCFGs was first formulated by Persoon and than the computation of prefix probabilities, due to Fu (1975). Efficient algorithms for its solution have the added ambiguity that several occurrences of the been proposed by Jelinek and Lafferty (1991) and given infix can be found in a single string generated Stolcke (1995). Prefix probabilities can be used to by the PCFG. The authors developed solutions for compute probability distributions for the next word the case where some distribution can be defined on 1213

Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1213–1221, Edinburgh, Scotland, UK, July 27–31, 2011. c 2011 Association for Computational Linguistics the distance of the infix from the sentence bound- from rules in R to real numbers in the interval [0, 1]. aries, which is a simplifying assumption. The prob- The concept of left-most derivation in one step is π lem is also considered by Fred (2000), which pro- represented by the notation α β, which means ⇒G vides algorithms for the case where the language that the left-most occurrence of any nonterminal in model is a probabilistic . However, α (Σ N) is rewritten by means of some rule ∈ ∪ ∗ the algorithm in (Fred, 2000) does not apply to cases π R. If the rewritten nonterminal is A, then π ∈ with multiple occurrences of the given infix within must be of the form (A γ) and β is the result → a string in the language, which is what was pointed of replacing the occurrence of A in α by γ. A left- out to be the problematic case. most derivation with any number of steps, using a d In this paper we adopt a novel approach to the sequence d of rules, is denoted as α β. We omit ⇒G problem of computation of infix probabilities, by re- the subscript when the PCFG is understood. We G moving the ambiguity that would be caused by mul- also write α ∗ β when the involved sequence of ⇒ tiple occurrences of the given infix. Although our rules is of no relevance. Henceforth, all derivations result is obtained by a combination of well-known we discuss are implicitly left-most. techniques from the literature on PCFG parsing and A complete derivation is either the empty se- pattern matching, as far as we know this is the first quence of rules, or a sequence d = π1 πm, m d ··· ≥ algorithm for the computation of infix probabilities 1, of rules such that A w for some A N and ⇒ ∈ that works for general PCFG models without any re- w Σ . In the latter case, we say the complete ∈ ∗ strictive assumption. derivation starts with A, and in the former case, with The remainder of this paper is structured as fol- d an empty sequence of rules, we assume the com- lows. In Section 2 we explain how the sum of the plete derivation starts and ends with a single termi- probabilities of all trees generated by a PCFG can nal, which is left unspecified. It is well-known that be computed as the least fixed-point solution of a there exists a bijective correspondence between left- non-linear system of equations. In Section 3 we re- most complete derivations starting with nonterminal call the construction of a new PCFG out of a given A and parse trees derived by the grammar with root PCFG and a given finite automaton, such that the A and a yield composed of terminal symbols only. language generated by the new grammar is the in- The depth of a complete derivation d is the length tersection of the languages generated by the given of the longest path from the root to a leaf in the parse PCFG and the automaton, and the probabilities of tree associated with d. The length of a path is de- the generated strings are preserved. In Section 4 fined as the number of nodes it visits. Thus if d = π we show how one can efficiently construct an un- for some rule π = (A w) with w Σ , then the → ∈ ∗ ambiguous finite automaton that accepts all strings depth of d is 2. with a given infix. The material from these three The probability p(d) of a complete derivation d = sections is combined into a new algorithm in Sec- π1 πm, m 1, is: tion 5, which allows computation of the infix prob- ··· ≥ ability for PCFGs. This is the main result of this m paper. Several extensions of the basic technique are p(d) = p(πi). discussed in Section 6. Section 7 discusses imple- Yi=1 mentation and some experiments. We also assume that p(d) = 1 when d is an empty sequence of rules. The probability p(w) of a string 2 Sum of probabilities of all derivations w is the sum of all complete derivations deriving that Assume a probabilistic context-free grammar , rep- string from the start symbol: G resented by a 5-tuple (Σ, N, S, R, p), where Σ and N are two finite disjoint sets of terminals and non- p(w) = p(d). d terminals, respectively, S N is the start symbol, d:XS w ∈ ⇒ R is a finite set of rules, each of the form A α, → where A N and α (Σ N)∗, and p is a function With this notation, consistency of a PCFG is de- ∈ ∈ ∪ 1214 fined as the condition: For each k with 1 k N , the generating ≤ ≤ | | function associated with Ak is defined as p(d) = 1. d gA (z1, z2, . . . , z ) = d,w: S w k N X⇒ | | m N k | | f(Aj ,αk,i) In other words, a PCFG is consistent if the sum p(Ak αk,i) z . (2) → · j of probabilities of all complete derivations starting Xi=1  jY=1  with S is 1. An equivalent definition of consistency Furthermore, for each i 1 we recursively define considers the sum of probabilities of all strings: (i) ≥ functions gA (z1, z2, . . . , z N ) by k | | p(w) = 1. (1) w gA (z1, z2, . . . , z N ) = gAk (z1, z2, . . . , z N ), (3) X k | | | | See (Booth and Thompson, 1973) for further discus- and, for i 2, by ≥ sion. (i) gA (z1, z2, . . . , z N ) = (4) In practice, PCFGs are often required to satisfy k | | (i 1) the additional condition: g ( g − (z , z , . . . , z ), Ak A1 1 2 N (i 1) | | g − (z , z , . . . , z ),..., A2 1 2 N p(π) = 1, (i 1) | | g − (z1, z2, . . . , z )). π=(A α) A N N X→ | | | | for each A N. This condition is called proper- Using induction it is not difficult to show that, for ∈ (i) . PCFGs that naturally arise by parameter es- each k and i as above, g (0, 0,..., 0) is the sum of ness Ak timation from corpora are generally consistent; see the probabilities of all complete derivations from Ak (Sanchez´ and Bened´ı, 1997; Chi and Geman, 1998). having depth not exceeding i. This implies that, for However, in what follows, neither properness nor i = 0, 1, 2,..., the sequence of the g(i) (0, 0,..., 0) Ak consistency is guaranteed. monotonically converges to Z(Ak). We define the partition function of as the func- For each k with 1 k N we can now write G tion Z that assigns to each A N the value ≤ ≤ | | ∈ Z(Ak) = d Z(A) = p(A w). (1) = lim g(i) (0,..., 0) ⇒ i Ak d,w →∞ X (i 1) = lim g ( g − (0, 0,..., 0),..., Ak A1 Note that Z(S) = 1 means that is consistent. i (i 1) →∞ g − (0, 0,..., 0) ) G A N More generally, in later sections we will need to | | (i 1) compute the partition function for non-consistent = g ( lim g − (0, 0,..., 0),..., Ak i A1 PCFGs. →∞ (i 1) limi g − (0, 0,..., 0) ) A N We can characterize the partition function of a →∞ | | = gAk (Z(A1),...,Z(A N )). PCFG as a solution of a specific system of equa- | | tions. Following the approach in (Harris, 1963; Chi, The above shows that the values of the partition 1999), we introduce generating functions associated function provide a solution to the system of the fol- with the nonterminals of the grammar. For A N ∈ lowing equations, for 1 k N : and α (N Σ)∗, we write f(A, α) to denote the ≤ ≤ | | ∈ ∪ number of occurrences of symbol A within string α. zk = gAk (z1, z2, . . . , z N ). (5) | | Let N = A1,A2,...,A N . For each Ak N, let { | |} ∈ mk be the number of rules in R with left-hand side In the case of a general PCFG, the above equa- Ak, and assume some fixed order for these rules. For tions are non-linear polynomials with positive (real) each i with 1 i m , let A α be the i-th coefficients. We can represent the resulting system ≤ ≤ k k → k,i rule with left-hand side Ak. in vector form and write X = g(X). These systems 1215 are called monotone systems of polynomial equa- met. However, there are some degenerate cases for tions and have been investigated by Etessami and which these standard conditions do not hold, result- Yannakakis (2009) and Kiefer et al. (2007). The ing in exponential time behaviour of the fixed-point sought solution, that is, the partition function, is the iteration method. This has been firstly observed least fixed point solution of X = g(X). in (Etessami and Yannakakis, 2005). For practical reasons, the set of nonterminals of An alternative iterative algorithm for the approx- a grammar is usually divided into maximal subsets imation of the partition function has been proposed of mutually recursive nonterminals, that is, for each by Etessami and Yannakakis (2009), based on New- A and B in such a subset, we have A ∗ uBα ton’s method for the solution of non-linear systems ⇒ and B ∗ vAβ, for some u, v, α, β. This corre- of equations. From a theoretical perspective, Kiefer ⇒ sponds to a strongly connected component if we et al. (2007) have shown that, after a certain number see the connection between the left-hand side of a of initial iterations, Newton’s method adds a fixed rule and a nonterminal member in its right-hand side number of bits to the precision of the approximated as an edge in a directed graph. For each strongly solution, even in the above mentioned cases in which connected component, there is a separate system of the fixed-point iteration method shows exponential equations of the form X = g(X). Such systems can time behaviour. However, these authors also show be solved one by one, in a bottom-up order. That that, in some degenerate cases, the number of itera- is, if one strongly connected component contains tions needed to compute the first bit of the solution nonterminal A, and another contains nonterminal B, can be at least exponential in the size of the system. where A ∗ uBα for some u, α, then the system for Experiments with Newton’s method for the ap- ⇒ the latter component must be solved first. proximation of the partition functions of PCFGs The solution for a system of equations such as have been carried out in several application-oriented those described above can be irrational and non- settings, by Wojtczak and Etessami (2007) and by expressible by radicals, even if we assume that all Nederhof and Satta (2008), showing considerable the probabilities of the rules in the input PCFG are improvements over the fixed-point iteration method. rational numbers, as observed by Etessami and Yan- nakakis (2009). Nonetheless, the partition function 3 Intersection of PCFG and FA can still be approximated to any degree of preci- sion by iterative computation of the relation in (4), It was shown by Bar-Hillel et al. (1964) that context- as done for instance by Stolcke (1995) and by Ab- free languages are closed under intersection with ney et al. (1999). This corresponds to the so-called regular languages. Their proof relied on the con- fixed-point iteration method, which is well-known struction of a new CFG out of an input CFG and in the numerical calculus literature and is frequently an input finite automaton. Here we extend that con- applied to systems of non-linear equations because struction by letting the input grammar be a proba- it can be easily implemented. bilistic CFG. We refer the reader to (Nederhof and When a number of standard conditions are met, Satta, 2003) for more details. each iteration of (4) adds a fixed number of bits To avoid a number of technical complications, we to the precision of the solution; see Kelley (1995, assume the finite automaton has no epsilon transi- Chapter 4). Since each iteration can easily be im- tions, and has only one final state. In the context plemented to run in polynomial time, this means of our use of this construction in the following sec- that we can approximate the partition function of a tions, these restrictions are without loss of general- ity. Thus, a finite automaton (FA) is represented PCFG in polynomial time in the size of the PCFG M itself and in the number of bits of the desired preci- by a 5-tuple (Σ, Q, q0, qf , ∆), where Σ and Q are sion. two finite sets of terminals and states, respectively, In practical applications where large PCFGs are q0 is the initial state, qf is the final state, and ∆ is a empirically estimated from data sets, the standard a finite set of transitions, each of the form s t, 7→ conditions mentioned above for the polynomial time where s, t Q and a Σ. ∈ ∈ approximation of the partition function are usually A complete computation of accepting string M 1216 w = a1 an is a sequence c = τ1 τn of tran- or more succinctly: ··· ai ··· sitions such that τi = (si 1 si) for each i (1 − 7→ ≤ i n), for some s , s , . . . , s with s = q and p0(w) = p(w). ≤ 0 1 n 0 0 s = q . The language of all strings accepted by w w L( ) n f M X ∈XM is denoted by L( ). A FA is unambiguous if at M Note that the above construction of is exponen- most one complete computation exists for each ac- G0 cepted string. A FA is deterministic if there is at tial in the largest value of m in any rule from . For a G most one transition s t for each s and a. this reason, is usually brought in binary form be- 7→ G For a FA as above and a PCFG = (Σ, N, S, fore the intersection, i.e. the input grammar is trans- M G R, p) with the same set of terminals, we construct formed to let each right-hand side have at most two a new PCFG = (Σ,N ,S ,R , p ), where N = members. Such a transformation can be realized in G0 0 0 0 0 0 Q (Σ N) Q, S = (q , S, q ), and R is the set linear time in the size of the grammar. We will return × ∪ × 0 0 f 0 of rules that is obtained as follows. to this issue in Section 7.

For each A X X in R and each se- 4 Obtaining unambiguous FAs • → 1 ··· m quence s , . . . , s with s Q, 0 i m, 0 m i ∈ ≤ ≤ In the previous section, we explained that unambigu- and m 0, let (s , A, s ) (s ,X , s ) ≥ 0 m → 0 1 1 ··· ous finite automata have special properties with re- (sm 1,Xm, sm) be in R0; if m = 0, the new spect to the grammar that we may construct out − G0 rule is of the form (s0, A, s0) . Function p0 of a FA and a PCFG . In this section we dis- → M G assigns the same probability to the new rule as cuss how unambiguity can be obtained for the spe- p assigned to the original rule. cial case of finite automata accepting the language a of all strings with given infix w Σ∗: For each s t in ∆, let (s, a, t) a be in R0. ∈ • 7→ → Function p0 assigns probability 1 to this rule. L (w) = xwy x, y Σ∗ . infix { | ∈ } Intuitively, a rule of is either constructed out of G0 Any deterministic automaton is also unambigu- a rule of or out of a transition of . On the basis G M ous. Furthermore, there seem to be no practical al- of this correspondence between rules and transitions gorithms that turn FAs into equivalent unambiguous of , and , it is not difficult to see that each G0 G M FAs other than the algorithms that also determinize derivation d in deriving string w corresponds to a 0 G0 them. Therefore, we will henceforth concentrate on unique derivation d in deriving the same string and G deterministic rather than unambiguous automata. a unique computation c in accepting the same M Given a string w = a1 an, a finite automaton string. Conversely, if there is a derivation d in ··· G accepting Linfix (w) can be straightforwardly con- deriving string w, and some computation c in M structed. This automaton has states s0, . . . , sn, tran- accepting the same string, then the pair of d and c a a sitions s0 s0 and sn sn for each a Σ, and corresponds to a unique derivation d0 in 0 deriving 7→ ai 7→ ∈ G transition si 1 si for each i (1 i n). The the same string w. Furthermore, the probabilities of − 7→ ≤ ≤ initial state is s0 and the final state is sn. Clearly, d and d0 are equal, by definition of p0. there is nondeterminism in state s0. w Let us now assume that each string is accepted One way to make this automaton deterministic is by at most one computation, i.e. is unambigu- M to apply the general algorithm of determinization of ous. If a string w is accepted by , then there are M finite automata; see e.g. (Aho and Ullman, 1972). as many derivations deriving w in as there are in G0 This algorithm is exponential for general FAs. An . If w is not accepted by , then there are zero G M alternative approach is to construct a deterministic derivations deriving w in . Consequently: G0 finite automaton directly from w, in line with the Knuth-Morris-Pratt algorithm (Knuth et al., 1977; p (d ) = p(d), 0 0 Gusfield, 1997). Both approaches result in the same dX0,w: Xd,w: deterministic FA, which we denote by w. However, d d0 I S0 w S w w L( ) the latter approach is easier to implement in such a ⇒G0 ⇒G ∧ ∈ M 1217 way that the time complexity of constructing the au- 6 Extensions tomaton is linear in w . | | The approach discussed above allows for a number The automaton is described as follows. There Iw of generalizations. First, we can replace the infix w are n + 1 states t0, . . . , tn, where as before n is by a sequence of infixes w1, ..., wm, which have to the length of w. The initial state is t and the final 0 occur in the given order, one strictly after the other, state is t . The intuition is that reads a string n Iw with arbitrary infixes in between: x = b b from left to right, and when it has 1 ··· m read the prefix b1 bj (0 j m), it is in state σ (w , . . . , w , ) = ··· ≤ ≤ island 1 m G ti (0 i < n) if and only if a1 ai is the longest ≤ ··· p(x0w1x1 wmxm). prefix of w that is also a suffix of b1 bj. If the ··· ··· x0,...,xm Σ∗ automaton is in state tn, then this means that w is an X∈ infix of b b . This problem was discussed before by (Corazza et 1 ··· j In more detail, for each i (1 i n) and each al., 1991), who mentioned applications in speech ≤ a ≤ a Σ, there is a transition ti 1 tj, where j is recognition. Further applications are found in com- ∈ − 7→ the length of the longest string that is both a prefix putational biology, but their discussion is beyond the of w and a suffix of a1 ai 1a. If a = ai, then scope of this paper; see for instance (Apostolico et ··· − clearly j = i, and otherwise j < i. To ensure that we al., 2005) and references therein. In order to solve remain in the final state once an occurrence of infix the problem, we only need a small addition to the a w has been found, we also add transitions t t procedures we discussed before. First we construct n 7→ n for each a Σ. This construction is illustrated in separate automata wj (1 j m) as explained in ∈ I ≤ ≤ Figure 1. Section 4. These automata are then composed into a single automaton . In this composition, I(w1,...,wm) 5 Infix probability the outgoing transitions of the final state of , for Iwj each j (1 j < m), are removed and that final state With the material developed in the previous sections, ≤ is merged with the initial state of the next automaton the problem of computing the infix probabilities can . The initial state of the composed automaton be effectively solved. Our goal is to compute for Iwj+1 is the initial state of w1 , and the final state is the given infix w Σ∗ and PCFG = (Σ, N, S, R, I ∈ G final state of w . The time costs of constructing p): I m are linear in the sum of the lengths of the I(w1,...,wm) strings w . σ (w, ) = p(z). j infix G Another way to generalize the problem is to re- z L (w) ∈ Xinfix place w by a finite set L = w , . . . , w . The ob- { 1 m} In Section 4 we have shown the construction of finite jective is to compute: automaton w accepting Linfix (w), by which we ob- I σinfix (L, ) = p(xwy) tain: G w L,x,y Σ ∈ X∈ ∗ σ (w, ) = p(z). Again, this can be solved by first constructing a de- infix G z L( w) terministic FA, which is then intersected with . ∈XI G This FA can be obtained by determinizing a straight- As is deterministic and therefore unambiguous, Iw forward nondeterministic FA accepting L, or by di- the results from Section 3 apply and if = (Σ,N , G0 0 rectly constructing a deterministic FA along the lines S ,R , p ) is the PCFG constructed out of and 0 0 0 G Iw of the Aho-Corasick algorithm (Aho and Corasick, then: 1975). Construction of the automaton with the latter approach takes linear time. σ (w, ) = p0(z). infix G Further straightforward generalizations involve z X formalisms such as probabilistic tree adjoining Finally, we can compute the above sum by applying grammars (Schabes, 1992; Resnik, 1992). The tech- the iterative method discussed in Section 2. nique from Section 3 is also applicable in this case, 1218 a, b, c b, c c a b

a b a c t0 t1 t2 t3 t4

b, c a

Figure 1: Deterministic automaton that accepts all strings over alphabet a, b, c with infix abac. { } as the construction from Bar-Hillel et al. (1964) car- the avoidance of matrix inversion, which sometimes ries over from context-free grammars to tree ad- makes Newton’s method prohibitively expensive. In joining grammars, and more generally to the linear our experiments, Broyden’s method was generally context-free rewriting systems of Vijay-Shanker et faster than Newton’s method and much faster than al. (1987). the simple iteration method by the relation in (4). For further details on Broyden’s method, we refer 7 Implementation the reader to (Kelley, 1995). We have conducted experiments with the computa- The main obstacle to computation for infixes sub- tion of infix probabilities. The objective was to iden- stantially longer than 7 symbols is the memory con- tify parts of the computation that have a high time sumption rather than the running time. This is due or space demand, and that might be improved. The to the required square matrices, the dimension of experiments were run on a desktop with a 3.0 GHz which is the number of nonterminals. The number Pentium 4 processor. The implementation language of nonterminals (of the intersection grammar) natu- is C++. rally grows as the infix becomes longer. The set-up of the experiments is similar to that in As explained in Section 2, the problem is divided (Nederhof and Satta, 2008). A probabilistic context- into smaller problems by isolating disjoint sets of free grammar was extracted from sections 2-21 of mutually recursive nonterminals, or strongly con- the Penn Treebank version II. Subtrees that gener- nected components. We found that for the applica- ated the empty string were systematically removed. tion to the automata discussed in Section 4, there The result was a CFG with 10,035 rules, 28 nonter- were exactly three strongly connected components minals and 36 parts-of-speech. The rule probabili- that contained more than one element, throughout ties were determined by maximum likelihood esti- the experiments. For an infix of length n, these com- mation. The grammar was subsequently binarized, ponents are: to avoid exponential behaviour, as explained in Sec- C , which consists of nonterminals of the form tion 3. • 1 (t , A, t ), where i < n and j < n, We have considered 10 strings of length 7, ran- i j domly generated, assuming each of the parts-of- C2, which consists of nonterminals of the form speech has the same probability. For all prefixes of • (ti, A, tj), where i = j = n, and those strings from length 2 to length 7, we then com- puted the infix probability. The duration of the full C , which consists of nonterminals of the form • 3 computation, averaged over the 10 strings of length (ti, A, tj), where i < j = n. 7, is given in the first row of Table 1. In order to solve the non-linear systems of equa- This can be easily explained by looking at the struc- tions, we used Broyden’s method. It can be seen ture of our automata. See for example Figure 1, with as an approximation of Newton’s method. It re- cycles running through states t0, . . . , tn 1, and cy- − quires more iterations, but seems to be faster over- cles through state tn. Furthermore, the grammar ex- all, and more scalable to large problem sizes, due to tracted from the Penn Treebank is heavily recursive, 1219 infix length 2 3 4 5 6 7 total running time 1.07 1.95 5.84 11.38 23.93 45.91 Broyden’s method for C1 0.46 0.90 3.42 6.63 12.91 24.38 Broyden’s method for C2 0.08 0.04 0.07 0.04 0.03 0.09 Broyden’s method for C3 0.20 0.36 0.81 1.74 5.30 9.02

Table 1: Running time for infixes from length 2 to length 7. The infixes are prefixes of 10 random strings of length 7, and reported CPU times (in seconds) are averaged over the 10 strings. so that almost every nonterminal can directly or in- infix probability of a string w and have kept inter- directly call any other. mediate results in memory. Can the computation of

The strongly connected component C2 is always the infix probability of a string of the form aw or wa, a Σ, be computed by relying on the existing re- the same, consisting of 2402 nonterminals, for each ∈ infix of any length. (Note that binarization of the sults, so that the computation is substantially faster grammar introduced artificial nonterminals.) The than if the computation were done from scratch? last three rows of Table 1 present the time costs of Our investigations so far have not found a posi- Broyden’s method, for the three strongly connected tive answer to this second question. In particular, components. the systems of equations for C1 and C3 change fun- damentally if the infix is extended by one more sym- The strongly connected component C3 happens to correspond to a linear system of equations. This is bol, which seems to at least make incremental com- because a rule in the intersection grammar with a putation very difficult, if not impossible. Note that the algorithms for the computation of prefix prob- left-hand side (ti, A, tj), where i < j = n, must abilities by Jelinek and Lafferty (1991) and Stolcke have a right-hand side of the form (ti,A0, tj), or of (1995) do allow incrementality, which contributes to the form (ti,A1, tk)(tk,A2, tj), with k n. If k < ≤ their practical usefulness for speech recognition. n, then only the second member can be in C3. If k = n, only first member can be in C3. Hence, such a rule corresponds to a linear equation within 8 Conclusions the system of equations for the entire grammar. We have shown that the problem of infix probabili- A linear system of equations can be solved an- ties for PCFGs can be solved by a construction that alytically, for example by Gaussian elimination, intersects a context-free language with a regular lan- rather than approximated through Newton’s method guage. An important constraint is that the finite or Broyden’s method. This means that the running automaton that is input to this construction be un- times in the last row of Table 1 can be reduced by ambiguous. We have shown that such an automa- treating C3 differently from the other strongly con- ton can be efficiently constructed. Once the input nected components. However, the running time for probabilistic PCFG and the FA have been combined C1 dominates the total time consumption. into a new probabilistic CFG, the infix probability The above investigations were motivated by two can be straightforwardly solved by iterative algo- questions, namely whether any part of the computa- rithms. Such algorithms include Newton’s method, tion can be precomputed, and second, whether infix and Broyden’s method, which was used in our exper- probabilities can be computed incrementally, for in- iments. Our discussion ended with an open question fixes that are extended to the left or to the right. The about the possibility of incremental computation of first question can be answered affirmatively for C2, infix probabilities. as it is always the same. However, as we can see in Table 1, the computation of C2 amounts to a small portion of the total time consumption. References The second question can be rephrased more pre- S. Abney, D. McAllester, and F. Pereira. 1999. Relating cisely as follows. Suppose we have computed the probabilistic grammars and automata. In 37th Annual 1220 Meeting of the Association for Computational Linguis- F. Jelinek and J.D. Lafferty. 1991. Computation of the tics, Proceedings of the Conference, pages 542–549, probability of initial substring generation by stochas- Maryland, USA, June. tic context-free grammars. Computational Linguistics, A.V. Aho and M.J. Corasick. 1975. Efficient string 17(3):315–323. matching: an aid to bibliographic search. Communi- C.T. Kelley. 1995. Iterative Methods for Linear and cations of the ACM, 18(6):333–340, June. Nonlinear Equations. Society for Industrial and Ap- A.V. Aho and J.D. Ullman. 1972. Parsing, volume 1 plied Mathematics, Philadelphia, PA. of The Theory of Parsing, Translation and Compiling. S. Kiefer, M. Luttenberger, and J. Esparza. 2007. On the Prentice-Hall, Englewood Cliffs, N.J. convergence of Newton’s method for monotone sys- tems of polynomial equations. In Proceedings of the A. Apostolico, M. Comin, and L. Parida. 2005. Con- 39th ACM Symposium on Theory of Computing, pages servative extraction of overrepresented extensible mo- 217–266. tifs. In Proceedings of Intelligent Systems for Molecu- D.E. Knuth, J.H. Morris, Jr., and V.R. Pratt. 1977. Fast lar Biology (ISMB05). pattern matching in strings. SIAM Journal on Comput- Y. Bar-Hillel, M. Perles, and E. Shamir. 1964. On formal ing, 6:323–350. properties of simple phrase structure grammars. In M.-J. Nederhof and G. Satta. 2003. Probabilistic pars- Y. Bar-Hillel, editor, Language and Information: Se- ing as intersection. In 8th International Workshop on lected Essays on their Theory and Application, chap- Parsing Technologies, pages 137–148, LORIA, Nancy, ter 9, pages 116–150. Addison-Wesley, Reading, Mas- France, April. sachusetts. M.-J. Nederhof and G. Satta. 2008. Computing parti- T.L. Booth and R.A. Thompson. 1973. Applying prob- tion functions of PCFGs. Research on Language and abilistic measures to abstract languages. IEEE Trans- Computation, 6(2):139–162. actions on Computers, C-22:442–450. E. Persoon and K.S. Fu. 1975. Sequential classification Z. Chi and S. Geman. 1998. Estimation of probabilis- of strings generated by SCFG’s. International Journal tic context-free grammars. Computational Linguistics, of Computer and Information Sciences, 4(3):205–217. 24(2):299–305. P. Resnik. 1992. Probabilistic tree-adjoining grammar as Z. Chi. 1999. Statistical properties of probabilistic a framework for statistical natural language process- context-free grammars. Computational Linguistics, ing. In Proc. of the fifteenth International Conference 25(1):131–160. on Computational Linguistics, Nantes, August, pages A. Corazza, R. De Mori, R. Gretter, and G. Satta. 418–424. 1991. Computation of probabilities for an island- J.-A. Sanchez´ and J.-M. Bened´ı. 1997. Consistency driven parser. IEEE Transactions on Pattern Analysis of stochastic context-free grammars from probabilis- and Machine Intelligence, 13(9):936–950. tic estimation based on growth transformations. IEEE K. Etessami and M. Yannakakis. 2005. Recursive Transactions on Pattern Analysis and Machine Intelli- Markov chains, stochastic grammars, and monotone gence, 19(9):1052–1055, September. systems of nonlinear equations. In 22nd International Y. Schabes. 1992. Stochastic lexicalized tree-adjoining Symposium on Theoretical Aspects of Computer Sci- grammars. In Proc. of the fifteenth International Con- ence, volume 3404 of Lecture Notes in Computer Sci- ference on Computational Linguistics, Nantes, Au- ence, pages 340–352, Stuttgart, Germany. Springer- gust, pages 426–432. Verlag. A. Stolcke. 1995. An efficient probabilistic context-free parsing algorithm that computes prefix probabilities. K. Etessami and M. Yannakakis. 2009. Recursive Computational Linguistics, 21(2):167–201. Markov chains, stochastic grammars, and monotone K. Vijay-Shanker, D.J. Weir, and A.K. Joshi. 1987. systems of nonlinear equations. Journal of the ACM, Characterizing structural descriptions produced by 56(1):1–66. various grammatical formalisms. In 25th Annual A.L.N. Fred. 2000. Computation of substring proba- Meeting of the Association for Computational Linguis- bilities in stochastic grammars. In A. Oliveira, edi- tics, Proceedings of the Conference, pages 104–111, tor, Grammatical Inference: Algorithms and Applica- Stanford, California, USA, July. tions, volume 1891 of Lecture Notes in Artificial Intel- D. Wojtczak and K. Etessami. 2007. PReMo: an an- ligence, pages 103–114. Springer-Verlag. alyzer for Probabilistic Recursive Models. In Tools D. Gusfield. 1997. Algorithms on Strings, Trees, and and Algorithms for the Construction and Analysis of Sequences. Cambridge University Press, Cambridge. Systems, 13th International Conference, volume 4424 T.E. Harris. 1963. The Theory of Branching Processes. of Lecture Notes in Computer Science, pages 66–71, Springer-Verlag, Berlin, Germany. Braga, Portugal. Springer-Verlag. 1221