<<

for PCFGs via Monte Carlo

Mark Johnson Thomas L. Griffiths Sharon Goldwater Cognitive and Linguistic Sciences Department of Psychology Department of Linguistics Brown University University of California, Berkeley Stanford University Mark [email protected] Tom [email protected] [email protected]

Abstract Carlo (MCMC) for inferring PCFGs and their parses from strings alone. These can be viewed This paper presents two Markov chain as Bayesian alternatives to the IO . Monte Carlo (MCMC) algorithms for The goal of Bayesian inference is to compute a Bayesian inference of probabilistic con- distribution over plausible parameter values. This text free grammars (PCFGs) from ter- “posterior” distribution is obtained by combining the minal strings, providing an alternative likelihood with a “prior” distribution P(θ) over pa- to maximum-likelihood estimation using rameter values θ. In the case of PCFG inference θ is the Inside-Outside algorithm. We illus- the vector of rule probabilities, and the prior might trate these methods by estimating a sparse assert a preference for a sparse grammar (see be- grammar describing the morphology of low). The of each value of θ the Bantu language Sesotho, demonstrat- is given by Bayes’ rule: ing that with suitable priors Bayesian techniques can infer linguistic structure P(θ|D) ∝ P(D|θ)P(θ). (1) in situations where maximum likelihood In principle Equation 1 defines the posterior prob- methods such as the Inside-Outside algo- ability of any value of θ, but computing this may rithm only produce a trivial grammar. not be tractable analytically or numerically. For this reason a variety of methods have been developed to 1 Introduction support approximate Bayesian inference. One of the most popular methods is Markov chain Monte Carlo The standard methods for inferring the parameters of (MCMC), in which a Markov chain is used to sam- probabilistic models in computational linguistics are ple from the posterior distribution. based on the principle of maximum-likelihood esti- This paper presents two new MCMC algorithms mation; for example, the parameters of Probabilistic for inferring the posterior distribution over parses Context-Free Grammars (PCFGs) are typically es- and rule probabilities given a corpus of strings. The timated from strings of terminals using the Inside- first algorithm is a component-wise Gibbs sampler Outside (IO) algorithm, an instance of the Ex- which is very similar in spirit to the EM algo- pectation Maximization (EM) procedure (Lari and rithm, drawing parse trees conditioned on the cur- Young, 1990). However, much recent work in ma- rent parameter values and then sampling the param- chine learning and has turned away from eters conditioned on the current set of parse trees. maximum-likelihood in favor of Bayesian methods, The second algorithm is a component-wise Hastings and there is increasing interest in Bayesian methods sampler that “collapses” the probabilistic model, in- in computational linguistics as well (Finkel et al., tegrating over the rule probabilities of the PCFG, 2006). This paper presents two Markov chain Monte with the goal of speeding convergence. Both algo- rithms use an efficient dynamic programming tech- over trees t as follows: nique to sample parse trees. fr(t) Given their usefulness in other disciplines, we PG(t|θ) = θr ∈ believe that Bayesian methods like these are likely rYR to be of general utility in computational linguis- where t is generated by G and fr(t) is the number tics as well. As a simple illustrative example, we of times the production = A → β ∈ R is used use these methods to infer morphological parses for in the derivation of t. If G does not generate t let verbs from Sesotho, a southern Bantu language with PG(t|θ) = 0. The yield y(t) of a parse tree t is agglutinating morphology. Our results illustrate that the sequence of terminals labeling its leaves. The Bayesian inference using a prior that favors sparsity probability of a string w ∈ T + of terminals is the can produce linguistically reasonable analyses in sit- sum of the probability of all trees with yield w, i.e.: uations in which EM does not. The rest of this paper is structured as follows. PG(w|θ) = PG(t|θ). The next section introduces the background for our t:yX(t)=w paper, summarizing the key ideas behind PCFGs, 2.2 Bayesian inference for PCFGs Bayesian inference, and MCMC. Section 3 intro- w duces our first MCMC algorithm, a Gibbs sampler Given a corpus of strings = (w1,...,wn), where for PCFGs. Section 4 describes an algorithm for each wi is a string of terminals generated by a known sampling trees from the distribution over trees de- CFG G, we would like to be able to infer the pro- fined by a PCFG. Section 5 shows how to integrate duction probabilities θ that best describe that corpus. Taking w to be our data, we can apply Bayes’ rule out the rule weight parameters θ in a PCFG, allow- ing us to sample directly from the posterior distribu- (Equation 1) to obtain: tion over parses for a corpus of strings. Finally, Sec- P(θ|w) ∝ PG(w|θ)P(θ), where tion 6 illustrates these methods in learning Sesotho n morphology. PG(w|θ) = PG(wi|θ). iY=1 2 Background Using t to denote a sequence of parse trees for w, we can compute the joint posterior distribution over 2.1 Probabilistic context-free grammars t and θ, and then marginalize over t, with P(θ|w)= t P(t,θ|w). The joint posterior distribution on t Let be a Context-Free Grammar G = (T,N,S,R) and θ is given by: in Chomsky normal form with no useless produc- P tions, where T is a finite set of terminal symbols, N P(t,θ|w) ∝ P(w|t)P(t|θ)P(θ) is a finite set of nonterminal symbols (disjoint from n T ), S ∈ N is a distinguished nonterminal called the = P(wi|ti)P(ti|θ) P(θ) ! start symbol, and R is a finite set of productions of iY=1 the form A → BC or A → w, where A,B,C ∈ N with P(w |t ) = 1 if y(t )= w , and 0 otherwise. and w ∈ T . In what follows we use β as a variable i i i i ranging over (N × N) ∪ T . 2.3 Dirichlet priors A Probabilistic Context-Free Grammar (G, θ) is The first step towards computing the posterior dis- a pair consisting of a context-free grammar G and tribution is to define a prior on θ. We take P(θ) to a real-valued vector θ of length |R| indexed by pro- be a product of Dirichlet distributions, with one dis- ductions, where θA→β is the production probability tribution for each non-terminal A ∈ N. The prior associated with the production A → β ∈ R. We is parameterized by a positive real valued vector α require that θA→β ≥ 0 and that for all nonterminals indexed by productions R, so each production prob- , → . A ∈ N A→β∈R θA β = 1 ability θA→β has a corresponding Dirichlet param- APCFGP (G, θ) defines a eter αA→β. Let RA be the set of productions in R with left-hand side A, and let θA and αA refer to large — in our case, the state space includes all pos- the component subvectors of θ and α respectively sible parses of the entire training corpus w — and ′ indexed by productions in RA. The Dirichlet prior the transition probabilities P(s |s) are specified via a PD(θ|α) is: scheme guaranteed to converge to the desired distri- bution π(s) (in our case, the posterior distribution). P (θ|α) = P (θ |α ), where D D A A We “run” the Markov chain (i.e., starting in initial A∈N ′ Y state s0, sample a state s1 from P(s |s0), then sam- 1 αr−1 ′ PD(θA|αA) = θr and ple state s2 from P(s |s1), and so on), with the prob- C(αA) ∈ rYRA ability that the Markov chain is in a particular state, ∈ A Γ(αr) P(s ), converging to π(s ) as i →∞. C(α ) = r R (2) i i A After the chain has run long enough for it to ap- Γ(Q r∈RA αr) proach its stationary distribution, the expectation where Γ is the generalizedP factorial function and E [f] of any function f(s) of the state s will be C(α) is a normalization constant that does not de- π approximated by the average of that function over pend on θ . A the set of sample states produced by the algorithm. Dirichlet priors are useful because they are con- For example, in our case, given samples (t ,θ ) for jugate to the distribution over trees defined by a i i i = 1,...,ℓ produced by an MCMC algorithm, we PCFG. This means that the posterior distribution can estimate θ as on θ given a set of parse trees, P(θ|t, α), is also a Dirichlet distribution. Applying Bayes’ rule, 1 ℓ Eπ[θ] ≈ θi PG(θ|t, α) ∝ PG(t|θ)PD(θ|α) ℓ Xi=1 fr(t) αr−1 ∝ θr θr The remainder of this paper presents two MCMC ∈ ! ∈ ! rYR rYR algorithms for PCFGs. Both algorithms proceed by fr(t)+αr −1 = θr setting the initial state of the Markov chain to a guess ∈ t rYR for ( ,θ) and then sampling successive states using which is a Dirichlet distribution with parameters a particular transition matrix. The key difference be- f(t) + α, where f(t) is the vector of production twen the two algorithms is the form of the transition counts in t indexed by r ∈ R. We can thus write: matrix they assume.

PG(θ|t, α) = PD(θ|f(t)+ α) 3 A Gibbs sampler for P(t, θ|w, α) which makes it clear that the production counts com- The Gibbs sampler (Geman and Geman, 1984) is bine directly with the parameters of the prior. one of the simplest MCMC methods, in which tran- sitions between states of the Markov chain result 2.4 Markov chain Monte Carlo from sampling each component of the state condi- Having defined a prior on θ, the posterior distribu- tioned on the current value of all other variables. In tion over t and θ is fully determined by a corpus our case, this means alternating between sampling w. Unfortunately, computing the posterior probabil- from two distributions: ity of even a single choice of t and θ is intractable, n as evaluating the normalizing constant for this dis- P(t|θ, w, α) = P(ti|wi,θ), and tribution requires summing over all possible parses i=1 for the entire corpus and all sets of production prob- Y P(θ|t, w, α) = PD(θ|f(t)+ α) abilities. Nonetheless, it is possible to define al- = P (θ |f (t)+ α ). gorithms that sample from this distribution using D A A A A∈N Markov chain Monte Carlo (MCMC). Y MCMC algorithms construct a Markov chain Thus every two steps we generate a new sample of whose states s ∈ S are the objects we wish to sam- t and θ. This alternation between parsing and up- ple. The state space S is typically astronomically dating θ is reminiscent of the EM algorithm, with αA1 ... αA ... αA j |N| the Inside-Outside algorithm for PCFGs (Lari and Young, 1990). The second step involves a recursion θA1 ... θA ... θA j |N| from larger to smaller strings, sampling from the productions that expand each string and construct- ing the corresponding tree in a top-down fashion. t1 ... ti ... tn In this section we take w to be a string of terminal symbols where each , w1 ... wi ... wn w = (w1,...,wn) wi ∈ T and define wi,k = (wi+1,...,wk) (i.e., the sub- string from w up to w ). Further, let G = Figure 1: A Bayes net representation of dependen- i+1 k A (T,N,A,R), i.e., a CFG just like G except that the cies among the variables in a PCFG. start symbol has been replaced with A, so, PGA (t|θ) is the probability of a tree t whose root node is la- t the Expectation step replaced by sampling and the beled A and PGA (w|θ) is the sum of the probabili- Maximization step replaced by sampling θ. ties of all trees whose root nodes are labeled A with The dependencies among variables in a PCFG are yield w. depicted graphically in Figure 1, which makes clear The Inside algorithm takes as input a PCFG that the Gibbs sampler is highly parallelizable (just (G, θ) and a string w = w0,n and constructs a ta- like the EM algorithm). Specifically, the parses ti ble with entries pA,i,k for each A ∈ N and 0 ≤ are independent given θ and so can be sampled in i < k ≤ n, where pA,i,k = PGA (wi,k|θ), i.e., the parallel from the following distribution as described probability of A rewriting to wi,k. The table entries in the next section. are recursively defined below, and computed by enu- merating all feasible i, k and A in any order such that PG(ti|θ) PG(ti|wi,θ) = all smaller values of k −i are enumerated before any PG(wi|θ) larger values. We make use of the fact that the posterior is a product of independent Dirichlet distributions in or- pA,k−1,k = θA→wk der to sample θ from PD(θ|t, α). The production pA,i,k = θA→B C pB,i,j pC,j,k probabilities θA for each nonterminal A ∈ N are A→B C∈R i

′ fr(t−i)+ αr Our goal in this section is modest: we aim merely θr = ′ ′ t ′ to provide an illustrative example of Bayesian infer- r ∈RA fr ( −i)+ αr ence using MCMC. As Figure 2 shows, when the P 6 Inferring sparse grammars Dirichlet prior parameter αr approaches 0 the P (θ |α) becomes increasingly concen- As stated in the introduction, the primary contribu- D r trated around 0. This ability to bias the sampler tion of this paper is introducing MCMC methods toward sparse grammars (i.e., grammars in which for Bayesian inference to computational linguistics. many productions have probabilities close to 0) is Bayesian inference using MCMC is a technique of useful when we attempt to identify relevant produc- generic utility, much like Expectation-Maximization tions from a much larger set of possible productions and other general inference techniques, and we be- via parameter estimation. lieve that it belongs in every computational linguist’s The Bantu language Sesotho is a richly agglutina- toolbox alongside these other techniques. tive language, in which verbs consist of a sequence Inferring a PCFG to describe the syntac- of morphemes, including optional Subject Markers tic structure of a natural language is an obvi- (SM), Tense (T), Object Markers (OM), Mood (M) ous application of grammar inference techniques, and derivational affixes as well as the obligatory and it is well-known that PCFG inference us- Verb stem (V), as shown in the following example: ing maximum-likelihood techniques such as the Inside-Outside (IO) algorithm, a dynamic program- re -a-di -bon-a ming Expectation-Maximization (EM) algorithm for SM T OM V M PCFGs, performs extremely poorly on such tasks. “We see them” We have applied the Bayesian MCMC methods de- 1It is easy to demonstrate that the poor quality of the PCFG scribed here to such problems and obtain results models is the cause of these problems rather than search or other algorithmic issues. If one initializes either the IO or Bayesian very similar to those produced using IO. We be- estimation procedures with treebank parses and then runs the lieve that the primary reason why both IO and the procedure using the yields alone, the accuracy of the parses uni- Bayesian methods perform so poorly on this task formly decreases while the (posterior) likelihood uniformly in- creases with each iteration, demonstrating that improving the is that simple PCFGs are not accurate models of (posterior) likelihood of such models does not improve parse English syntactic structure. We know that PCFGs accuracy. We used an implementation of the Hastings sampler justed toward 1, sped the convergence of both algo- described in Section 5 to infer morphological parses rithms. We ran both algorithms for several thousand t for a corpus w of 2,283 unsegmented Sesotho iterations over the corpus, and both seemed to con- verb types extracted from the Sesotho corpus avail- verge fairly quickly once τ was set to 1. “Jittering” able from CHILDES (MacWhinney and Snow, 1985; the initial estimate of θ used in the IO algorithm also Demuth, 1992). We chose this corpus because the sped its convergence. words have been morphologically segmented manu- The IO algorithm converges to a solution where ally, making it possible for us to evaluate the mor- θWord → V = 1, and every string w ∈ w is analysed phological parses produced by our system. We con- as a single morpheme V. (In fact, in this grammar structed a CFG G containing the following produc- P(wi|θ) is the empirical probability of wi, and it is tions easy to prove that this θ is the MLE). Word → V The samples t produced by the Hastings algo- Word → VM rithm depend on the parameters of the Dirichlet Word → SMVM prior. We set αr to a single value α for all pro- Word → SMTVM ductions r. We found that for α > 10−2 the sam- Word → SMTOMVM ples produced by the Hastings algorithm were the together with productions expanding the pretermi- same trivial analyses as those produced by the IO nals SM, T, OM, V and M to each of the 16,350 dis- algorithm, but as α was reduced below this t be- tinct substrings occuring anywhere in the corpus, gan to exhibit nontrivial structure. We evaluated producting a grammar with 81,755 productions in the quality of the segmentations in the morpholog- all. In effect, G encodes the basic morphologi- ical analyses t in terms of unlabeled precision, re- cal structure of the Sesotho verb (ignoring factors call, f-score and exact match (the fraction of words such as derivation morphology and irregular forms), correctly segmented into morphemes; we ignored but provides no information about the phonological morpheme labels because the manual morphological identity of the morphemes. analyses contain many morpheme labels that we did Note that G actually generates a finite language. not include in G). Figure 3 contains a plot of how However, G parameterizes the probability distribu- these quantities vary with α; obtaining an f-score of tion over the strings it generates in a manner that 0.75 and an exact word match accuracy of 0.54 at would be difficult to succintly characterize except α = 10−5 (the corresponding values for the MLE θˆ in terms of the productions given above. Moreover, are both 0). Note that we obtained good results as α with approximately 20 times more productions than was varied over several orders of magnitude, so the training strings, each string is highly ambiguous and actual value of α is not critical. Thus in this appli- estimation is highly underconstrained, so it provides cation the ability to prefer sparse grammars enables an excellent test-bed for sparse priors. us to find linguistically meaningful analyses. This We estimated the morphological parses t in two ability to find linguistically meaningful structure is ways. First, we ran the IO algorithm initialized relatively rare in our experience with unsupervised with a uniform initial estimate θ0 for θ to produce PCFG induction. an estimate of the MLE θˆ, and then computed the We also experimented with a version of IO modi- Viterbi parses ˆt of the training corpus w with respect fied to perform Bayesian MAP estimation, where the to the PCFG (G, θˆ). Second, we ran the Hastings Maximization step of the IO procedure is replaced sampler initialized with trees sampled from (G, θ0) with Bayesian inference using a Dirichlet prior, i.e., with several different values for the parameters of where the rule probabilities θ(k) at iteration k are es- the prior. We experimented with a number of tech- timated using: niques for speeding convergence of both the IO and (k) w (k−1) Hastings algorithms, and two of these were particu- θr ∝ max(0, E[fr| ,θ ]+ α − 1). larly effective on this problem. Annealing, i.e., us- ing P(t|w)1/τ in place of P(t|w) where τ is a “tem- Clearly such an approach is very closely related to perature” parameter starting around 5 and slowly ad- the Bayesian procedures presented in this article, 1 F-score Precision meaningful structure. We attribute this to the ability Recall of the Bayesian prior to prefer sparse grammars. 0.75 Exact We expect that these algorithms will be of inter- 0.5 est to the computational linguistics community both because a Bayesian approach to PCFG estimation is 0.25 more flexible than the Maximum Likelihood meth- 0 1 0.01 1e-04 1e-06 1e-08 1e-10 ods that currently dominate the field (c.f., the use Dirichlet prior parameter αr of a prior as a bias towards sparse solutions), and because these techniques provide essential building Figure 3: Accuracy of morphological segmentations blocks for more complex models. of Sesotho verbs proposed by the Hastings algo- rithms as a function of Dirichlet prior parameter References α. F-score, precision and recall are unlabeled mor- Katherine Demuth. 1992. Acquisition of Sesotho. In Dan pheme scores, while Exact is the fraction of words Slobin, editor, The Cross-Linguistic Study of Language Ac- correctly segmented. quisition, volume 3, pages 557–638. Lawrence Erlbaum As- sociates, Hillsdale, N.J.

Ye Ding, Chi Yu Chan, and Charles E. Lawrence. 2005. RNA and in some circumstances this may be a useful secondary structure prediction by centroids in a Boltzmann . However, in our experiments with the weighted ensemble. RNA, 11:1157–1166. Sesotho data above we found that for the small val- Jenny Rose Finkel, Christopher D. Manning, and Andrew Y. ues of α necessary to obtain a sparse solution,the Ng. 2006. Solving the problem of cascading errors: expected rule count E[f ] for many rules r was less Approximate Bayesian inference for linguistic annotation r pipelines. In Proceedings of the 2006 Conference on Empir- than 1−α. Thus on the next iteration θr = 0, result- ical Methods in Natural Language Processing, pages 618– ing in there being no parse whatsoever for many of 626, Sydney, Australia. Association for Computational Lin- the strings in the training data. Variational Bayesian guistics. techniques offer a systematic way of dealing with Stuart Geman and . 1984. Stochastic relaxation, these problems, but we leave this for further work. Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 6:721–741. 7 Conclusion James E. Gentle. 2003. Random number generation and Monte This paper has described basic algorithms for per- Carlo methods. Springer, New York, 2nd edition. forming Bayesian inference over PCFGs given ter- Joshua Goodman. 1998. Parsing inside-out. minal strings. We presented two Markov chain Ph.D. thesis, Harvard University. available from Monte Carlo algorithms (a Gibbs and a Hastings http://research.microsoft.com/˜joshuago/. sampling algorithm) for sampling from the posterior Dan Klein and Chris Manning. 2004. Corpus-based induc- distribution over parse trees given a corpus of their tion of syntactic structure: Models of dependency and con- yields and a Dirichlet product prior over the produc- stituency. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 478–485. tion probabilities. As a component of these algo- rithms we described an efficient dynamic program- K. Lari and S.J. Young. 1990. The estimation of Stochastic Context-Free Grammars using the Inside-Outside algorithm. ming algorithm for sampling trees from a PCFG Computer Speech and Language, 4(35-56). which is useful in its own right. We used these sampling algorithms to infer morphological analy- Brian MacWhinney and Catherine Snow. 1985. The child lan- guage data exchange system. Journal of Child Language, ses of Sesotho verbs given their strings (a task on 12:271–296. which the standard Maximum Likelihood estimator Noah A. Smith and Jason Eisner. 2006. Annealing structural returns a trivial and linguistically uninteresting so- bias in multilingual weighted grammar induction. In Pro- lution), achieving 0.75 unlabeled morpheme f-score ceedings of the 21st International Conference on Computa- and 0.54 exact word match accuracy. Thus this tional Linguistics and 44th Annual Meeting of the Associa- tion for Computational Linguistics, pages 569–576, Sydney, is one of the few cases we are aware of in which Australia. Association for Computational Linguistics. a PCFG estimation procedure returns linguistically