
Online Adaptor Grammars with Hybrid Inference Ke Zhai Jordan Boyd-Graber Shay B. Cohen Computer Science and UMIACS Computer Science School of Informatics University of Maryland University of Colorado University of Edinburgh College Park, MD USA Boulder, CO USA Edinburgh, Scotland, UK [email protected] [email protected] [email protected] Abstract 2010), which often converges in fewer iterations than batch variational inference. Adaptor grammars are a flexible, powerful Past variational inference techniques for adap- formalism for defining nonparametric, un- tor grammars assume a preprocessing step that supervised models of grammar productions. This flexibility comes at the cost of expensive looks at all available data to establish the support inference. We address the difficulty of infer- of these nonparametric distributions (Cohen et al., ence through an online algorithm which uses 2010). Thus, these past approaches are not directly a hybrid of Markov chain Monte Carlo and amenable to online inference. variational inference. We show that this in- Markov chain Monte Carlo (MCMC) inference, an ference strategy improves scalability without alternative to variational inference, does not have sacrificing performance on unsupervised word this disadvantage. MCMC is easier to implement, segmentation and topic modeling tasks. and it discovers the support of nonparametric mod- els during inference rather than assuming it a priori. 1 Introduction We apply stochastic hybrid inference (Mimno et Nonparametric Bayesian models are effective tools al., 2012) to adaptor grammars to get the best of both to discover latent structure in data (Muller¨ and Quin- worlds. We interleave MCMC inference inside vari- tana, 2004). These models have had great success in ational inference. This preserves the scalability of text analysis, especially syntax (Shindo et al., 2012). variational inference while adding the sparse statis- Nonparametric distributions provide support over a tics and improved exploration MCMC provides. countably infinite long-tailed distributions common Our inference algorithm for adaptor grammars in natural language (Goldwater et al., 2011). starts with a variational algorithm similar to Cohen We focus on adaptor grammars (Johnson et al., et al. (2010) and adds hybrid sampling within varia- 2006), syntactic nonparametric models based on tional inference (Section 3). This obviates the need probabilistic context-free grammars. Adaptor gram- for expensive preprocessing and is a necessary step mars weaken the strong statistical independence as- to create an online algorithm for adaptor grammars. sumptions PCFGs make (Section 2). Our online extension (Section 4) processes exam- The weaker statistical independence assumptions ples in small batches taken from a stream of data. that adaptor grammars make come at the cost of ex- As data arrive, the algorithm dynamically extends pensive inference. Adaptor grammars are not alone the underlying approximate posterior distributions in this trade-off. For example, nonparametric exten- as more data are observed. This makes the algo- sions of topic models (Teh et al., 2006) have substan- rithm flexible, scalable, and amenable to datasets tially more expensive inference than their parametric that cannot be examined exhaustively because of counterparts (Yao et al., 2009). their size—e.g., terabytes of social media data ap- A common approach to address this compu- pear every second—or their nature—e.g., speech ac- tational bottleneck is through variational infer- quisition, where a language learner is limited to the ence (Wainwright and Jordan, 2008). One of the bandwidth of the human perceptual system and can- advantages of variational inference is that it can be not acquire data in a monolithic batch (Borschinger¨ easily parallelized (Nallapati et al., 2007) or trans- and Johnson, 2012). formed into an online algorithm (Hoffman et al., We show our approach’s scalability and effective- 465 Transactions of the Association for Computational Linguistics, 2 (2014) 465–476. Action Editor: Kristina Toutanova. Submitted 11/2013; Revised 5/2014; Revised 9/2014; Published 10/2014. c 2014 Association for Computational Linguistics. ness by applying our inference framework in Sec- Algorithm 1 Generative Process 1: For nonterminals c N, draw rule probabilities tion 5 on two tasks: unsupervised word segmenta- ∈ θc Dir(αc) for PCFG . tion and infinite-vocabulary topic modeling. ∼ G 2: for adapted nonterminal c in c1, . , c M do | | 3: Draw grammaton Hc PYGEM(ac, bc,Gc) according 2 Background ∼ to Equation 1, where Gc is defined by the PCFG rules R. 4: For i 1,...,D , generate a phrase-structure tree tS,i In this section, we review probabilistic context-free ∈ { } grammars and adaptor grammars. using the PCFG rules R(e) at non-adapted nonterminal e and the grammatons Hc at adapted nonterminals c. 2.1 Probabilistic Context-free Grammars 5: The yields of trees t1, . , tD are observations x1, . , xD. Probabilistic context-free grammars (PCFG) de- fine probability distributions over derivations of a trees Gc rooted at nonterminal c into a richer distri- context-free grammar. We define a PCFG to be G bution Hc over the trees headed by a nonterminal c, a tuple W , N, R, S, θ : a set of terminals W , a h i which is often referred to as the grammaton. N R set of nonterminals , productions , start sym- A Pitman-Yor Adaptor grammar (PYAG) forms bol S N and a vector of rule probabilities θ. ∈ the adapted tree distributions Hc using a Pitman-Yor c R(c) The rules that rewrite nonterminal is . For process (Pitman and Yor, 1997, PY), a generalization PCFG a more complete description of s, see Manning of the Dirichlet process (Ferguson, 1973, DP).1 A and Schutze¨ (1999). draw Hc (πc, zc) is formed by the stick break- PCFG ≡ s typically use nonterminals with a syntactic ing process (Sudderth and Jordan, 2008, PYGEM) interpretation. A sequence of terminals (the yield) parametrized by scale parameter a, discount factor is generated by recursively rewriting nonterminals b, and base distribution Gc: as sequences of child symbols (either a nonterminal or a symbol). This builds a hierarchical phrase-tree π Beta(1 b, a + kb), z G , k0 ∼ − k ∼ c structure for every yield. k 1 π π − (1 π ),H π δ . (1) For example, a nonterminal VP represents a verb k ≡ k0 j=1 − j0 ≡ k k zk phrase, which probabilistically rewrites into a se- Q P quence of nonterminals V, N (corresponding to verb Intuitively, the distribution Hc is a discrete recon- and noun) using the production rule VP V N. struction of the atoms sampled from Gc—hence, → Both nonterminals can be further rewritten. Each reweights Gc. Grammaton Hc assigns non-zero nonterminal has a multinomial distribution over ex- stick-breaking weights π to a countably infinite pansions; for example, a multinomial for nonter- number of parse trees z. We describe learning these minal N would rewrite as “cake”, with probability grammatons in Section 3. More formally, a PYAG is a quintuple = θN cake = 0.03. Rewriting terminates when the A → , M, a, b, α with: a PCFG ; a set of adapted derivation has reached a terminal symbol such as hG i G nonterminals M N; Pitman-Yor process param- “cake” (which does not rewrite). ⊆ eters a , b at each adaptor c M and Dirichlet While PCFGs are used both in the supervised set- c c ∈ parameters α for each nonterminal c N. We ting and in the unsupervised setting, in this paper c ∈ we assume an unsupervised setting, in which only also assume an order on the adapted nonterminals, terminals are observed. Our goal is to predict the c1, . , c M such that cj is not reachable from ci in | | underlying phrase-structure tree. a derivation if j > i.2 Algorithm 1 describes the generative process of 2.2 Adaptor Grammars an adaptor grammar on a set of D observed sen- PCFGs assume that the rewriting operations are in- tences x1, . , xD. dependent given the nonterminal. This context- 1 freeness assumption often is too strong for modeling Adaptor grammars, in their general form, do not have to use the Pitman-Yor process but have only been used with the natural language. Pitman-Yor process. Adaptor grammars break this independence as- 2This is possible because we assume that recursive nonter- sumption by transforming a PCFG’s distribution over minals are not adapted. 466 Given a PYAG , the joint probability for a set of tion of q over parse trees z. Instead of explicitly A sentences X and its collection of trees T is computing the variational distribution for all param- eters, one can sample from it. This produces a sparse p(X, T , π, θ, z ) = c M p(πc ac, bc)p(zc Gc) |A ∈ | | approximation of the variational distribution, which p(θc αc) p(xd, td θ, π, z), improves both scalability and performance. Sparse · c N | Qxd X | ∈ ∈ distributions are easier to store and transmit in im- Q Q th where xd and td represent the d observed string plementations, which improves scalability. Mimno and its corresponding parse. The multinomial PCFG et al. (2012) also show that sparse representations parameter θc is drawn from a Dirichlet distribution improve performance. Moreover, because it can at nonterminal c N. At each adapted nontermi- ∈ flexibly adjust its support, it is a necessary prereq- nal c M, the stick-breaking weights π are drawn ∈ c uisite to online inference (Section 4). from a PYGEM (Equation 1). Each weight has an as- sociated atom zc,i from base distribution Gc, a sub- 3.1 Variational Lower Bound tree rooted at c. The probability p(x , t θ, π, z) is d d | We posit a mean-field variational distribution: the PCFG likelihood of yield xd with parse tree td. Adaptor grammars require a base PCFG such that 1 2 q(π, θ, T γ, ν, φ) = c M i∞=1 q(πc,i0 νc,i, νc,i) it does not have recursive adapted nonterminals, i.e., | ∈ | there cannot be a path in a derivation from a given c N q(θc γc)Q x XQq(td φd), (2) · ∈ | d∈ | adapted nonterminal to a second appearance of that Q Q adapted nonterminal.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages12 Page
-
File Size-