Computation of Infix Probabilities for Probabilistic Context-Free Grammars
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by St Andrews Research Repository Computation of Infix Probabilities for Probabilistic Context-Free Grammars Mark-Jan Nederhof Giorgio Satta School of Computer Science Dept. of Information Engineering University of St Andrews University of Padua United Kingdom Italy [email protected] [email protected] Abstract or part-of-speech, when a prefix of the input has al- ready been processed, as discussed by Jelinek and The notion of infix probability has been intro- Lafferty (1991). Such distributions are useful for duced in the literature as a generalization of speech recognition, where the result of the acous- the notion of prefix (or initial substring) prob- tic processor is represented as a lattice, and local ability, motivated by applications in speech recognition and word error correction. For the choices must be made for a next transition. In ad- case where a probabilistic context-free gram- dition, distributions for the next word are also useful mar is used as language model, methods for for applications of word error correction, when one the computation of infix probabilities have is processing ‘noisy’ text and the parser recognizes been presented in the literature, based on vari- an error that must be recovered by operations of in- ous simplifying assumptions. Here we present sertion, replacement or deletion. a solution that applies to the problem in its full generality. Motivated by the above applications, the problem of the computation of infix probabilities for PCFGs has been introduced in the literature as a generaliza- 1 Introduction tion of the prefix probability problem. We are now given a PCFG and a string w, and we are asked Probabilistic context-free grammars (PCFGs for G short) are a statistical model widely used in natural to compute the probability that a sentence generated by has w as an infix. This probability is defined language processing. Several computational prob- G lems related to PCFGs have been investigated in as the possibly infinite sum of the probabilities of the literature, motivated by applications in model- all strings of the form xwy, for any pair of strings x and y over the alphabet of . Besides applications ing of natural language syntax. One such problem is G the computation of prefix probabilities for PCFGs, in computation of the probability distribution for the where we are given as input a PCFG and a string next word token and in word error correction, in- G w, and we are asked to compute the probability that fix probabilities can also be exploited in speech un- a sentence generated by starts with w, that is, has derstanding systems to score partial hypotheses in G w as a prefix. This quantity is defined as the possi- algorithms based on beam search, as discussed by bly infinite sum of the probabilities of all strings of Corazza et al. (1991). the form wx, for any string x over the alphabet of . Corazza et al. (1991) have pointed out that the G The problem of computation of prefix probabili- computation of infix probabilities is more difficult ties for PCFGs was first formulated by Persoon and than the computation of prefix probabilities, due to Fu (1975). Efficient algorithms for its solution have the added ambiguity that several occurrences of the been proposed by Jelinek and Lafferty (1991) and given infix can be found in a single string generated Stolcke (1995). Prefix probabilities can be used to by the PCFG. The authors developed solutions for compute probability distributions for the next word the case where some distribution can be defined on 1213 Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1213–1221, Edinburgh, Scotland, UK, July 27–31, 2011. c 2011 Association for Computational Linguistics the distance of the infix from the sentence bound- from rules in R to real numbers in the interval [0, 1]. aries, which is a simplifying assumption. The prob- The concept of left-most derivation in one step is π lem is also considered by Fred (2000), which pro- represented by the notation α β, which means ⇒G vides algorithms for the case where the language that the left-most occurrence of any nonterminal in model is a probabilistic regular grammar. However, α (Σ N) is rewritten by means of some rule ∈ ∪ ∗ the algorithm in (Fred, 2000) does not apply to cases π R. If the rewritten nonterminal is A, then π ∈ with multiple occurrences of the given infix within must be of the form (A γ) and β is the result → a string in the language, which is what was pointed of replacing the occurrence of A in α by γ. A left- out to be the problematic case. most derivation with any number of steps, using a d In this paper we adopt a novel approach to the sequence d of rules, is denoted as α β. We omit ⇒G problem of computation of infix probabilities, by re- the subscript when the PCFG is understood. We G moving the ambiguity that would be caused by mul- also write α ∗ β when the involved sequence of ⇒ tiple occurrences of the given infix. Although our rules is of no relevance. Henceforth, all derivations result is obtained by a combination of well-known we discuss are implicitly left-most. techniques from the literature on PCFG parsing and A complete derivation is either the empty se- pattern matching, as far as we know this is the first quence of rules, or a sequence d = π1 πm, m d ··· ≥ algorithm for the computation of infix probabilities 1, of rules such that A w for some A N and ⇒ ∈ that works for general PCFG models without any re- w Σ . In the latter case, we say the complete ∈ ∗ strictive assumption. derivation starts with A, and in the former case, with The remainder of this paper is structured as fol- d an empty sequence of rules, we assume the com- lows. In Section 2 we explain how the sum of the plete derivation starts and ends with a single termi- probabilities of all trees generated by a PCFG can nal, which is left unspecified. It is well-known that be computed as the least fixed-point solution of a there exists a bijective correspondence between left- non-linear system of equations. In Section 3 we re- most complete derivations starting with nonterminal call the construction of a new PCFG out of a given A and parse trees derived by the grammar with root PCFG and a given finite automaton, such that the A and a yield composed of terminal symbols only. language generated by the new grammar is the in- The depth of a complete derivation d is the length tersection of the languages generated by the given of the longest path from the root to a leaf in the parse PCFG and the automaton, and the probabilities of tree associated with d. The length of a path is de- the generated strings are preserved. In Section 4 fined as the number of nodes it visits. Thus if d = π we show how one can efficiently construct an un- for some rule π = (A w) with w Σ , then the → ∈ ∗ ambiguous finite automaton that accepts all strings depth of d is 2. with a given infix. The material from these three The probability p(d) of a complete derivation d = sections is combined into a new algorithm in Sec- π1 πm, m 1, is: tion 5, which allows computation of the infix prob- ··· ≥ ability for PCFGs. This is the main result of this m paper. Several extensions of the basic technique are p(d) = p(πi). discussed in Section 6. Section 7 discusses imple- Yi=1 mentation and some experiments. We also assume that p(d) = 1 when d is an empty sequence of rules. The probability p(w) of a string 2 Sum of probabilities of all derivations w is the sum of all complete derivations deriving that Assume a probabilistic context-free grammar , rep- string from the start symbol: G resented by a 5-tuple (Σ, N, S, R, p), where Σ and N are two finite disjoint sets of terminals and non- p(w) = p(d). d terminals, respectively, S N is the start symbol, d:XS w ∈ ⇒ R is a finite set of rules, each of the form A α, → where A N and α (Σ N)∗, and p is a function With this notation, consistency of a PCFG is de- ∈ ∈ ∪ 1214 fined as the condition: For each k with 1 k N , the generating ≤ ≤ | | function associated with Ak is defined as p(d) = 1. d gA (z1, z2, . , z ) = d,w: S w k N X⇒ | | m N k | | f(Aj ,αk,i) In other words, a PCFG is consistent if the sum p(Ak αk,i) z . (2) → · j of probabilities of all complete derivations starting Xi=1 jY=1 with S is 1. An equivalent definition of consistency Furthermore, for each i 1 we recursively define considers the sum of probabilities of all strings: (i) ≥ functions gA (z1, z2, . , z N ) by k | | p(w) = 1. (1) w gA (z1, z2, . , z N ) = gAk (z1, z2, . , z N ), (3) X k | | | | See (Booth and Thompson, 1973) for further discus- and, for i 2, by ≥ sion. (i) gA (z1, z2, . , z N ) = (4) In practice, PCFGs are often required to satisfy k | | (i 1) the additional condition: g ( g − (z , z , . , z ), Ak A1 1 2 N (i 1) | | g − (z , z , . , z ),..., A2 1 2 N p(π) = 1, (i 1) | | g − (z1, z2, .