Bayesian Inference for Pcfgs Via Markov Chain Monte Carlo

Bayesian Inference for PCFGs via Markov chain Monte Carlo Mark Johnson Thomas L. Griffiths Sharon Goldwater Cognitive and Linguistic Sciences Department of Psychology Department of Linguistics Brown University University of California, Berkeley Stanford University Mark [email protected] Tom [email protected] [email protected] Abstract Carlo (MCMC) algorithms for inferring PCFGs and their parses from strings alone. These can be viewed This paper presents two Markov chain as Bayesian alternatives to the IO algorithm. Monte Carlo (MCMC) algorithms for The goal of Bayesian inference is to compute a Bayesian inference of probabilistic con- distribution over plausible parameter values. This text free grammars (PCFGs) from ter- “posterior” distribution is obtained by combining the minal strings, providing an alternative likelihood with a “prior” distribution P(θ) over pa- to maximum-likelihood estimation using rameter values θ. In the case of PCFG inference θ is the Inside-Outside algorithm. We illus- the vector of rule probabilities, and the prior might trate these methods by estimating a sparse assert a preference for a sparse grammar (see be- grammar describing the morphology of low). The posterior probability of each value of θ the Bantu language Sesotho, demonstrat- is given by Bayes’ rule: ing that with suitable priors Bayesian techniques can infer linguistic structure P(θ|D) ∝ P(D|θ)P(θ). (1) in situations where maximum likelihood In principle Equation 1 defines the posterior prob- methods such as the Inside-Outside algo- ability of any value of θ, but computing this may rithm only produce a trivial grammar. not be tractable analytically or numerically. For this reason a variety of methods have been developed to 1 Introduction support approximate Bayesian inference. One of the most popular methods is Markov chain Monte Carlo The standard methods for inferring the parameters of (MCMC), in which a Markov chain is used to sam- probabilistic models in computational linguistics are ple from the posterior distribution. based on the principle of maximum-likelihood esti- This paper presents two new MCMC algorithms mation; for example, the parameters of Probabilistic for inferring the posterior distribution over parses Context-Free Grammars (PCFGs) are typically es- and rule probabilities given a corpus of strings. The timated from strings of terminals using the Inside- first algorithm is a component-wise Gibbs sampler Outside (IO) algorithm, an instance of the Ex- which is very similar in spirit to the EM algo- pectation Maximization (EM) procedure (Lari and rithm, drawing parse trees conditioned on the cur- Young, 1990). However, much recent work in ma- rent parameter values and then sampling the param- chine learning and statistics has turned away from eters conditioned on the current set of parse trees. maximum-likelihood in favor of Bayesian methods, The second algorithm is a component-wise Hastings and there is increasing interest in Bayesian methods sampler that “collapses” the probabilistic model, in- in computational linguistics as well (Finkel et al., tegrating over the rule probabilities of the PCFG, 2006). This paper presents two Markov chain Monte with the goal of speeding convergence. Both algorithms use an efficient dynamic programming tech- over trees t as follows: nique to sample parse trees. fr(t) Given their usefulness in other disciplines, we PG(t|θ) = θr ∈ believe that Bayesian methods like these are likely rYR to be of general utility in computational linguis- where t is generated by G and fr(t) is the number tics as well. As a simple illustrative example, we of times the production r = A → β ∈ R is used use these methods to infer morphological parses for in the derivation of t. If G does not generate t let verbs from Sesotho, a southern Bantu language with PG(t|θ) = 0. The yield y(t) of a parse tree t is agglutinating morphology. Our results illustrate that the sequence of terminals labeling its leaves. The Bayesian inference using a prior that favors sparsity probability of a string w ∈ T + of terminals is the can produce linguistically reasonable analyses in sit- sum of the probability of all trees with yield w, i.e.: uations in which EM does not. The rest of this paper is structured as follows. PG(w|θ) = PG(t|θ). The next section introduces the background for our t:yX(t)=w paper, summarizing the key ideas behind PCFGs, 2.2 Bayesian inference for PCFGs Bayesian inference, and MCMC. Section 3 intro- w duces our first MCMC algorithm, a Gibbs sampler Given a corpus of strings = (w1,...,wn), where for PCFGs. Section 4 describes an algorithm for each wi is a string of terminals generated by a known sampling trees from the distribution over trees de- CFG G, we would like to be able to infer the pro- fined by a PCFG. Section 5 shows how to integrate duction probabilities θ that best describe that corpus. Taking w to be our data, we can apply Bayes’ rule out the rule weight parameters θ in a PCFG, allow- ing us to sample directly from the posterior distribu- (Equation 1) to obtain: tion over parses for a corpus of strings. Finally, Sec- P(θ|w) ∝ PG(w|θ)P(θ), where tion 6 illustrates these methods in learning Sesotho n morphology. PG(w|θ) = PG(wi|θ). iY=1 2 Background Using t to denote a sequence of parse trees for w, we can compute the joint posterior distribution over 2.1 Probabilistic context-free grammars t and θ, and then marginalize over t, with P(θ|w)= t P(t,θ|w). The joint posterior distribution on t Let be a Context-Free Grammar G = (T,N,S,R) and θ is given by: in Chomsky normal form with no useless produc- P tions, where T is a finite set of terminal symbols, N P(t,θ|w) ∝ P(w|t)P(t|θ)P(θ) is a finite set of nonterminal symbols (disjoint from n T ), S ∈ N is a distinguished nonterminal called the = P(wi|ti)P(ti|θ) P(θ) ! start symbol, and R is a finite set of productions of iY=1 the form A → BC or A → w, where A,B,C ∈ N with P(w |t ) = 1 if y(t )= w , and 0 otherwise. and w ∈ T . In what follows we use β as a variable i i i i ranging over (N × N) ∪ T . 2.3 Dirichlet priors A Probabilistic Context-Free Grammar (G, θ) is The first step towards computing the posterior dis- a pair consisting of a context-free grammar G and tribution is to define a prior on θ. We take P(θ) to a real-valued vector θ of length |R| indexed by pro- be a product of Dirichlet distributions, with one dis- ductions, where θA→β is the production probability tribution for each non-terminal A ∈ N. The prior associated with the production A → β ∈ R. We is parameterized by a positive real valued vector α require that θA→β ≥ 0 and that for all nonterminals indexed by productions R, so each production prob- , → . A ∈ N A→β∈R θA β = 1 ability θA→β has a corresponding Dirichlet param- A PCFGP (G, θ) defines a probability distribution eter αA→β. Let RA be the set of productions in R with left-hand side A, and let θA and αA refer to large — in our case, the state space includes all pos- the component subvectors of θ and α respectively sible parses of the entire training corpus w — and ′ indexed by productions in RA. The Dirichlet prior the transition probabilities P(s |s) are specified via a PD(θ|α) is: scheme guaranteed to converge to the desired distribution π(s) (in our case, the posterior distribution). P (θ|α) = P (θ |α ), where D D A A We “run” the Markov chain (i.e., starting in initial A∈N ′ Y state s0, sample a state s1 from P(s |s0), then sam- 1 αr−1 ′ PD(θA|αA) = θr and ple state s2 from P(s |s1), and so on), with the prob- C(αA) ∈ rYRA ability that the Markov chain is in a particular state, ∈ A Γ(αr) P(s ), converging to π(s ) as i →∞. C(α ) = r R (2) i i A After the chain has run long enough for it to ap- Γ(Q r∈RA αr) proach its stationary distribution, the expectation where Γ is the generalizedP factorial function and E [f] of any function f(s) of the state s will be C(α) is a normalization constant that does not de- π approximated by the average of that function over pend on θ . A the set of sample states produced by the algorithm. Dirichlet priors are useful because they are con- For example, in our case, given samples (t ,θ ) for jugate to the distribution over trees defined by a i i i = 1,...,ℓ produced by an MCMC algorithm, we PCFG. This means that the posterior distribution can estimate θ as on θ given a set of parse trees, P(θ|t, α), is also a Dirichlet distribution. Applying Bayes’ rule, 1 ℓ Eπ[θ] ≈ θi PG(θ|t, α) ∝ PG(t|θ)PD(θ|α) ℓ Xi=1 fr(t) αr−1 ∝ θr θr The remainder of this paper presents two MCMC ∈ ! ∈ ! rYR rYR algorithms for PCFGs. Both algorithms proceed by fr(t)+αr −1 = θr setting the initial state of the Markov chain to a guess ∈ t rYR for ( ,θ) and then sampling successive states using which is a Dirichlet distribution with parameters a particular transition matrix. The key difference be- f(t) + α, where f(t) is the vector of production twen the two algorithms is the form of the transition counts in t indexed by r ∈ R.

Load more