<<

Online Adaptor Grammars with Hybrid Inference Ke Jordan Boyd-Graber Shay B. Cohen Computer Science and UMIACS Computer Science School of Informatics University of Maryland University of Colorado University of Edinburgh College Park, MD USA Boulder, CO USA Edinburgh, Scotland, UK [email protected] [email protected] [email protected]

Abstract 2010), which often converges in fewer iterations than batch variational inference. Adaptor grammars are a flexible, powerful Past variational inference techniques for adap- formalism for defining nonparametric, un- tor grammars assume a preprocessing step that supervised models of grammar productions. This flexibility comes at the cost of expensive looks at all available data to establish the support inference. We address the difficulty of infer- of these nonparametric distributions (Cohen et al., ence through online algorithm which uses 2010). Thus, these past approaches are not directly a hybrid of Markov chain Monte Carlo and amenable to online inference. variational inference. We show that this in- Markov chain Monte Carlo (MCMC) inference, an ference strategy improves scalability without alternative to variational inference, does not have sacrificing performance on unsupervised word this disadvantage. MCMC is easier to implement, segmentation and topic modeling tasks. and it discovers the support of nonparametric mod- els during inference rather than assuming it a priori. 1 Introduction We apply stochastic hybrid inference (Mimno et Nonparametric Bayesian models are effective tools al., 2012) to adaptor grammars to get the best of both to discover latent structure in data (Muller¨ and Quin- worlds. We interleave MCMC inference inside vari- tana, 2004). These models have had great success in ational inference. This preserves the scalability of text analysis, especially syntax (Shindo et al., 2012). variational inference while adding the sparse statis- Nonparametric distributions provide support over a tics and improved exploration MCMC provides. countably infinite long-tailed distributions common Our inference algorithm for adaptor grammars in natural language (Goldwater et al., 2011). starts with a variational algorithm similar to Cohen We focus on adaptor grammars (Johnson et al., et al. (2010) and adds hybrid sampling within varia- 2006), syntactic nonparametric models based on tional inference (Section 3). This obviates the need probabilistic context-free grammars. Adaptor gram- for expensive preprocessing and is a necessary step mars weaken the strong statistical independence as- to create an online algorithm for adaptor grammars. sumptions PCFGs make (Section 2). Our online extension (Section 4) processes exam- The weaker statistical independence assumptions ples in small batches taken from a stream of data. that adaptor grammars make come at the cost of ex- As data arrive, the algorithm dynamically extends pensive inference. Adaptor grammars are not alone the underlying approximate posterior distributions in this trade-off. For example, nonparametric exten- as more data are observed. This makes the algo- sions of topic models (Teh et al., 2006) have substan- rithm flexible, scalable, and amenable to datasets tially more expensive inference than their parametric that cannot be examined exhaustively because of counterparts (Yao et al., 2009). their size—e.g., terabytes of social media data ap- A common approach to address this compu- pear every second—or their nature—e.g., speech ac- tational bottleneck is through variational infer- quisition, where a language learner is limited to the ence (Wainwright and Jordan, 2008). One of the bandwidth of the human perceptual system and can- advantages of variational inference is that it can be not acquire data in a monolithic batch (Borschinger¨ easily parallelized (Nallapati et al., 2007) or trans- and Johnson, 2012). formed into an online algorithm (Hoffman et al., We show our approach’s scalability and effective-

465

Transactions of the Association for Computational Linguistics, 2 (2014) 465–476. Action Editor: Kristina Toutanova. Submitted 11/2013; Revised 5/2014; Revised 9/2014; Published 10/2014. c 2014 Association for Computational Linguistics. ness by applying our inference framework in Sec- Algorithm 1 Generative Process 1: For nonterminals c N, draw rule probabilities tion 5 on two tasks: unsupervised word segmenta- ∈ θc Dir(αc) for PCFG . tion and infinite-vocabulary topic modeling. ∼ G 2: for adapted nonterminal c in c1, . . . , c M do | | 3: Draw grammaton Hc PYGEM(ac, bc,Gc) according 2 Background ∼ to Equation 1, where Gc is defined by the PCFG rules R.

4: For i 1,...,D , generate a phrase-structure tree tS,i In this section, we review probabilistic context-free ∈ { } grammars and adaptor grammars. using the PCFG rules R(e) at non-adapted nonterminal e and the grammatons Hc at adapted nonterminals c. 2.1 Probabilistic Context-free Grammars 5: The yields of trees t1, . . . , tD are observations x1, . . . , xD. Probabilistic context-free grammars (PCFG) de- fine probability distributions over derivations of a trees Gc rooted at nonterminal c into a richer distri- context-free grammar. We define a PCFG to be G bution Hc over the trees headed by a nonterminal c, a tuple W , N, R, S, θ : a set of terminals W , a h i which is often referred to as the grammaton. N R set of nonterminals , productions , start sym- A Pitman-Yor Adaptor grammar (PYAG) forms bol S N and a vector of rule probabilities θ. ∈ the adapted tree distributions Hc using a Pitman-Yor c R(c) The rules that rewrite nonterminal is . For process (Pitman and Yor, 1997, PY), a generalization PCFG a more complete description of s, see Manning of the Dirichlet process (Ferguson, 1973, DP).1 A and Schutze¨ (1999). draw Hc (πc, zc) is formed by the stick break- PCFG ≡ s typically use nonterminals with a syntactic ing process (Sudderth and Jordan, 2008, PYGEM) interpretation. A sequence of terminals (the yield) parametrized by scale parameter a, discount factor is generated by recursively rewriting nonterminals b, and base distribution Gc: as sequences of child symbols (either a nonterminal or a symbol). This builds a hierarchical phrase-tree π Beta(1 b, a + kb), z G , k0 ∼ − k ∼ c structure for every yield. k 1 π π − (1 π ),H π δ . (1) For example, a nonterminal VP represents a verb k ≡ k0 j=1 − j0 ≡ k k zk phrase, which probabilistically rewrites into a se- Q P quence of nonterminals V, N (corresponding to verb Intuitively, the distribution Hc is a discrete recon- and noun) using the production rule VP VN. struction of the atoms sampled from Gc—hence, → Both nonterminals can be further rewritten. Each reweights Gc. Grammaton Hc assigns non-zero nonterminal has a multinomial distribution over ex- stick-breaking weights π to a countably infinite pansions; for example, a multinomial for nonter- number of parse trees z. We describe learning these minal N would rewrite as “cake”, with probability grammatons in Section 3. More formally, a PYAG is a quintuple = θN cake = 0.03. Rewriting terminates when the A → , M, a, b, α with: a PCFG ; a set of adapted derivation has reached a terminal symbol such as hG i G nonterminals M N; Pitman-Yor process param- “cake” (which does not rewrite). ⊆ eters a , b at each adaptor c M and Dirichlet While PCFGs are used both in the supervised set- c c ∈ parameters α for each nonterminal c N. We ting and in the unsupervised setting, in this paper c ∈ we assume an unsupervised setting, in which only also assume an order on the adapted nonterminals, terminals are observed. Our goal is to predict the c1, . . . , c M such that cj is not reachable from ci in | | underlying phrase-structure tree. a derivation if j > i.2 Algorithm 1 describes the generative process of 2.2 Adaptor Grammars an adaptor grammar on a set of D observed sen- PCFGs assume that the rewriting operations are in- tences x1, . . . , xD. dependent given the nonterminal. This context- 1 freeness assumption often is too strong for modeling Adaptor grammars, in their general form, do not have to use the Pitman-Yor process but have only been used with the natural language. Pitman-Yor process. Adaptor grammars break this independence as- 2This is possible because we assume that recursive nonter- sumption by transforming a PCFG’s distribution over minals are not adapted.

466 Given a PYAG , the joint probability for a set of tion of q over parse trees z. Instead of explicitly A sentences X and its collection of trees T is computing the variational distribution for all param- eters, one can sample from it. This produces a sparse p(X, T , π, θ, z ) = c M p(πc ac, bc)p(zc Gc) |A ∈ | | approximation of the variational distribution, which p(θc αc) p(xd, td θ, π, z), improves both scalability and performance. Sparse · c N | Qxd X | ∈ ∈ distributions are easier to store and transmit in im- Q Q th where xd and td represent the d observed string plementations, which improves scalability. Mimno and its corresponding parse. The multinomial PCFG et al. (2012) also show that sparse representations parameter θc is drawn from a Dirichlet distribution improve performance. Moreover, because it can at nonterminal c N. At each adapted nontermi- ∈ flexibly adjust its support, it is a necessary prereq- nal c M, the stick-breaking weights π are drawn ∈ c uisite to online inference (Section 4). from a PYGEM (Equation 1). Each weight has an as- sociated atom zc,i from base distribution Gc, a sub- 3.1 Variational Lower Bound tree rooted at c. The probability p(x , t θ, π, z) is d d | We posit a mean-field variational distribution: the PCFG likelihood of yield xd with parse tree td. Adaptor grammars require a base PCFG such that 1 2 q(π, θ, T γ, ν, φ) = c M i∞=1 q(πc,i0 νc,i, νc,i) it does not have recursive adapted nonterminals, i.e., | ∈ | there cannot be a path in a derivation from a given c N q(θc γc)Q x XQq(td φd), (2) · ∈ | d∈ | adapted nonterminal to a second appearance of that Q Q adapted nonterminal. where πc,i0 is drawn from a variational Beta distri- 1 2 bution parameterized by νc,i, νc,i; and θc is from a 3 Hybrid Variational-MCMC Inference R(c) variational Dirichlet prior γ R| |. Index i c ∈ + Discovering the latent variables of the model—trees, ranges over a possibly infinite number of adapted th adapted probabilities, and PCFG rules—is a problem rules. The parse for the d observation, td is mod- of posterior inference given observed data. Previ- eled by a multinomial φd, where φd,i is the proba- th ous approaches use MCMC (Johnson et al., 2006) or bility generating the i phrase-structure tree td,i. variational inference (Cohen et al., 2010). The variational distribution over latent variables MCMC discovers the support of nonparametric induces the following ELBO on the likelihood: models during the inference, but does not scale to larger datasets (due to tight coupling of variables). (z, π, θ, T , D; a, b, α) = H[q(θ, π, T )] L Variational inference, however, is inherently paral- + c N Eq[log p(θc αc)] (3) lel and easily amendable to online inference, but re- ∈ | + c M i∞=1 Eq[log p(πc,i0 ac, bc)] quires preprocessing to discover the adapted produc- P ∈ | tions. We combine the best of both worlds and pro- + Pc M Pi∞=1 Eq[log p(zc,i π, θ)] ∈ | pose a hybrid variational-MCMC inference algorithm + Eq[log p(xd, td π, θ, z)], Pxd XP | for adaptor grammars. ∈ P Variational inference posits a variational distribu- where H[ ] is the entropy function. • tion over the latent variables in the model; this in To make this lower bound tractable, we truncate turn induces an “evidence lower bound” (ELBO, ) the distribution over π to a finite set (Blei and Jor- L as a function of a variational distribution q, a lower dan, 2005) for each adapted nonterminal c M, ∈ bound on the marginal log-likelihood. Variational i.e., π0 1 for some index Kc. Because the atom c,Kc ≡ inference optimizes this objective function with re- weights πk are deterministically defined by Equa- spect to the parameters that define q. tion 1, this implies that πc,i is zero beyond index In this section, we derive coordinate-ascent up- Kc. Each weight πc,i is associated with an atom zc,i, dates for these variational parameters. A key math- a subtree rooted at c. We call the ordered set of zc,i ematical component is taking expectations with re- the truncated nonterminal grammaton (TNG). Each spect to the variational distribution q. We strategi- adapted nonterminal c M has its own TNGc. The th ∈ cally use MCMC sampling to compute the expecta- i subtree in TNGc is denoted TNGc(i).

467 In the rest of this section, we describe approxi- at each nonterminal. Given this approximate PCFG, mate inference to maximize . The most impor- we can then sample a derivation z for string x from L tant update is φd,i, which we update using stochastic the possible trees (Johnson et al., 2007). MCMC inference (Section 3.2). Past variational ap- Sampling requires a derived PCFG that approx- G0 proaches for adaptor grammars (Cohen et al., 2010) imates the distribution over tree derivations condi- rely on a preprocessing step and heuristics to define tioned on a yield. It includes the original PCFG rules a static TNG. In contrast, our model dynamically R = c β that define the base distribution and { → } discovers trees. The TNG grows as the model sees the new adapted productions R0 = c z, z { ⇒ ∈ more data, allowing online updates (Section 4). TNG . Under 0, the probability θ of adapted pro- c} G 0 The remaining variational parameters are opti- duction c z is ⇒ mized using expected counts of adaptor grammar Eq[log πc,i], if TNGc(i) = z rules. These expected counts are described in Sec- log θc0 z = Eq[log πc,Kc ] + Eq[log θc z], (6) tion 3.3, and the variational updates for the vari-  ⇒ ⇒  ational parameters excluding φd,i are described in  otherwise Section 3.4.  where Kc is the truncation level of TNGc and πc,Kc 3.2 Stochastic MCMC Inference represents the left-over stick weights in the stick- breaking process for adaptor c M. θc z repre- Each observation xd has an associated variational ∈ ⇒ sents the probability of generating tree c z under multinomial distribution φ over trees td that can ⇒ d the base distribution. See also Cohen (2011). yield observation x with probability φ . Hold- d d,i The expectation of the Pitman-Yor multinomial ing all other variational parameters fixed, the π under the truncated variational stick-breaking coordinate-ascent update (Mimno et al., 2012; c,i distribution is Bishop, 2006) for φd,i is 1 1 2 φ Eq[log πa,i] = Ψ(νa,i) Ψ(νa,i + νa,i) (7) φ exp ¬ d [log p(t x , π, θ, z)] , − d,i Eq d,i d (4) i 1 2 1 2 ∝ { | } + j−=1(Ψ(νa,j) Ψ(νa,j + νa,j)), th − where φd,i is the probability generating the i φ and the expectationP of generating the phrase- phrase-structure tree t and ¬ d [ ] is the expec- d,i Eq structure tree a z based on PCFG productions • ⇒ tation with respect to the variational distribution q, under the variational Dirichlet distribution is excluding the value of φd. Instead of computing this expectation explicitly, Eq[log θa z] = c β a z Ψ(γc β) (8) ⇒ → ∈ ⇒ → we turn to stochastic variational inference (Mimno Ψ( γ ) c Pβ Rc c β0 et al., 2012; Hoffman et al., 2013) to sample from − → 0∈ → this distribution. This produces a set of sampled where Ψ( ) is theP digamma function, and c • → trees σd σd,1, . . . , σd,k . From this set of trees β a z represents all PCFG productions in the ≡ { } ∈ ⇒ we can approximate our variational distribution over phrase-structure tree a z. ⇒ trees φ using the empirical distribution σd, i.e., This PCFG can compose arbitrary subtrees and thus discover new trees that better describe the data, φd,i I[σd,j = td,i, σd,j σd]. (5) ∝ ∀ ∈ even if those trees are not part of the TNG. This is equivalent to creating a “new table” in MCMC in- This leads to a sparse approximation of variational ference and provides truncation-free variational up- distribution φ.3 dates (Wang and Blei, 2012) by sampling a unseen Previous inference strategies (Johnson et al., subtree with adapted nonterminal c M at the root. 2006; Borschinger¨ and Johnson, 2012) for adaptor ∈ This frees our model from preprocessing to initial- grammars have used sampling. The adaptor gram- ize truncated grammatons in Cohen et al. (2010). mar inference methods use an approximate PCFG to This stochastic approach has the advantage of creat- emulate the marginalized Pitman-Yor distributions ing sparse distributions (Wang and Blei, 2012): few 3In our experiments, we use ten samples. unique trees will be represented.

468 Grammar Seating Assignments 1: f is the expected number of productions within (nonterminal A) the TNG. It is the sum over the probability of a S→AB B B tree td,k times the number of times an adapted B→{a,b,c} a b production appeared in t , f (a z ) = A→B d,k d ⇒ a,i φ a z : a z M(t ) . Yield Parse New Seating Counts k d,k | ⇒ a,i ⇒ a,i ∈ d,k | B a g(B →a)+ =1 count of rule a za,i in tree td,k  ba S B B P ⇒ A B b a b f(A →b) +=1 2: g | {zPCFG } R B a g(B →a)+ =1 is the expected counts of productions ca S B B B g(B →c)+ =1 A B c a b c that defines the base distribution of the adaptor h(A →c)+ =1 grammar, gd(a β) = B b g(B →b)+ =1 → ab S B B B g(B →a)+ =1 A B a a b a +=1 (φ a β : a β N(t ) ) . h(A →a) k d,k | → → ∈ d,k | Figure 1: Given an adaptor grammar, we sample derivations 3: Finally,P a third set of productions are newly dis- given an approximate PCFG and show how these affect counts. The sampled derivations can be understood via the Chinese covered by the sampler and not in the TNG. restaurant metaphor (Johnson et al., 2006). Existing cached These subtrees are rules that could be adapted, rules (elements in the TNG) can be thought of as occupied ta- with expected counts hd(c zc,i) = bles; this happens in the case of the yield “ba”, which increases ⇒ counts for unadapted rules g and for entries in TNGA, f. For k (φd,k c zc,i : c zc,i / M(td,k) ) . the yield “ca”, there is no appropriate entry in the TNG, so it | ⇒ ⇒ ∈ | must use the base distribution, which corresponds to sitting at a TheseP subtrees—lists of PCFG rules sampled new table. This generates counts for g, as it uses the unadapted rule and for h, which represents entries that could be included from Equation 6—correspond to adapted pro- in the TNG in the future. The final yield, “ab”, shows that even ductions not yet present in the TNG. when compatible entries are in the TNG, it might still create a new table, changing the underlying base distribution. 3.4 Variational Updates Given the sparse vectors φ sampled from the hybrid Parallelization As noted in Cohen et al. (2010), MCMC step, we update all variational parameters as the inside-outside algorithm dominates the runtime of every iteration, both for sampling and variational γa β =αa β + x X gd(a β) → → d∈ → inference. However, unlike MCMC, variational in- Kb + P n(a β, zb,i), b M i=1 → ference is highly parallelizable and requires fewer 1 ∈ νa,i =1 ba + x X fd(a za,i) synchronizations per iteration (Zhai et al., 2012). In −P Pd∈ ⇒ Kb our approach, both inside algorithms and sampling + b MP k=1 n(a za,i, zb,k), ∈ ⇒ process can be distributed, and those counts can be 2 Ka νa,i =aa +Piba +P x X j=1 fd(a za,j) aggregated afterwards. In our implementation, we d∈ ⇒ Kb Ka use multiple threads to parallelize tree sampling. + b M Pk=1 jP=1 n(a za,j, zb,k), ∈ ⇒ n(r, tP) P P 3.3 Calculating Expected Rule Counts where is the expected number of times pro- duction r is in tree t, estimated during sampling. For every observation xd, the hybrid approach pro- duces a set of sampled trees, each of which contains Hyperparameter Update We update our PCFG three types of productions: adapted rules, original hyperparameter α, PYGEM hyperparameters a and PCFG rules, and potentially adapted rules. The last b as in Cohen et al. (2010). set is most important, as these are new rules dis- 4 Online Variational Inference covered by the sampler. These are explained using the Chinese restaurant metaphor in Figure 1. The Online inference for probabilistic models requires us multiset of all adapted productions is M(td,i) and to update our posterior distribution as new observa- the multiset of non-adapted productions that gener- tions arrive. Unlike batch inference algorithms, we ate tree td,i is N(td,i). We compute three counts: do not assume we always have access to the entire

469 dataset. Instead, we assume that observations arrive 4.1 Refining the Truncation minibatches in small groups called . The advantage As we observe more data during inference, our TNGs of online inference is threefold: a) it does not re- need to change. New rules should be added, useless quire retaining the whole dataset in memory; b) each rules should be removed, and derivations for existing online update is fast; and c) the model usually con- rules should be updated. In this section, we describe verges faster. All of these make adaptor grammars heuristics for performing each of these operations. scalable to larger datasets. Our approach is based on the stochastic varia- Adding Productions Sampling can identify pro- tional inference for topic models (Hoffman et al., ductions that are not adapted but were instead drawn 2013). This inference strategy uses a form of from the base distribution. These are candidates for stochastic gradient descent (Bottou, 1998): using the TNG. For every nonterminal a, we add these the gradient of the ELBO, it finds the sufficient potentially adapted productions to TNGa after each statistics necessary to update variational parameters minibatch. The count associated with candidate pro- (which are mostly expected counts calculated using ductions is now associated with an adapted produc- the inside-outside algorithm), and interpolates the tion, i.e., the h count contributes to the relevant f result with the current model. count. This mechanism dynamically expands TNGa. We assume data arrive in minibatches B (a set of Our model sentences). We accumulate expected counts Sorting and Removing Productions does not require a preprocessing step to initialize the (l) (l 1) TNGs, rather, it constructs and expands all TNGs on f˜ (a za,i) =(1 ) f˜ − (a za,i) (9) ⇒ − · ⇒ TNG X the fly. To prevent the from growing unwieldy, +  | | f (a z ), B xd Bl d a,i we prune TNG after every u minibatches. As a re- · | l| ∈ ⇒ (l) (l 1) g˜ (a β) =(1 ) g˜P− (a β) (10) sult, we need to impose an ordering over all the parse → − · → X trees in the TNG. The underlying PYGEM distribu- +  | | gd(a β), Bl xd Bl tion implicitly places an ranking over all the atoms · | | ∈ → P according to their corresponding sufficient statis- with decay factor  (0, 1) to guarantee conver- tics (Kurihara et al., 2007), as shown in Equation 9. ∈ κ gence. We set it to  = (τ + l)− , where l is the It measures the “usefulness” of every adapted pro- minibatch counter. The decay inertia τ prevents pre- duction throughout inference process. mature convergence, and decay rate κ controls the In addition to accumulated sufficient statistics, speed of change in sufficient statistics (Hoffman et Cohen et al. (2010) add a secondary term to discour- al., 2010). We recover batch variational approach age short constituents (Mochihashi et al., 2009). We when B = D and κ = 0. impose a reward term for longer phrases in addition The variables f˜(l) and g˜(l) are accumulated suffi- to f˜ and sort all adapted productions in TNGa using cient statistics of adapted and unadapted productions the ranking score after processing minibatch Bl. They update the ap- (l) proximate gradient. The updates for variational pa- Λ(a za,i) = f˜ (a za,i) log( s + 1), rameters become ⇒ ⇒ · · | | where s is the number of yields in production a (l) | | ⇒ γa β =αa β +g ˜ (a β) (11) z . Because  decreases each minibatch, the reward → → → a,i Kb for long phrases diminishes. This is similar to an + b M i=1 n(a β, zb,i), ∈ → annealed version of Cohen et al. (2010)—where the ν1 =1 b + f˜(l)(a z ) (12) a,i −Pa P ⇒ a,i reward for long phrases is fixed, see also Mochihashi Kb + b M k=1 n(a za,i, zb,k), et al. (2009). After sorting, we remove all but the top ∈ ⇒ ν2 =a + ib + Ka f˜(l)(a z ) (13) Ka adapted productions. a,i a P a P j=1 ⇒ a,j Kb Ka For MCMC in- + b M Pk=1 j=1 n(a za,j, zb,k), Rederiving Adapted Productions ∈ ⇒ ference, Johnson and Goldwater (2009) observe that where K isP the sizeP of thePTNG at adaptor a M. atoms already associated with a yield may have trees a ∈

470 SENT COLLOC SENT DOC Algorithm 2 Online inference for adaptor grammars 7→ 7→ j SENT COLLOC SENT j=1, 2, . . . D 7→ 1: Random initialize all variational parameters. COLLOC WORDS DOCj j TOPICi 7→ 7→ − 2: for minibatch of l = 1, 2,... do WORDS WORD i=1, 2, . . . K 7→ WORDS WORD WORDS TOPIC WORD 3: Construct approximate PCFG θ0 of (Equation 6). 7→ i 7→ A WORD CHARS WORD CHARS collocation 4: for input sentence d = 1, 2,...,Dl do 7→ InfVoc LDA 7→ CHARS CHAR CHARS CHAR

unigram 7→ 7→ 5: Accumulate inside probabilities from approximate CHARS CHAR CHARS CHARS CHAR CHARS 7→ 7→ PCFG θ0. CHAR ? CHAR ? 7→ 7→ 6: Sample phrase-structure trees σ and update the tree distribution φ (Equation 5). Table 1: Grammars used in our experiments. The nonterminal CHAR is a non-adapted rule that expands to all characters used 7: For every adapted nonterminal c, append adapted pro- in the data, sometimes called pre-terminals. Adapted nonter- ductions to TNGc. minals are underlined. For the unigram grammar, only nonter- 8: Accumulate sufficient statistics (Equations 9 and 10). minal WORD is adapted; whereas for the collocation grammar, 9: Update γ, ν1, and ν2 (Equations 11-13). both nonterminals WORD and COLLOC are adapted. For the IN- 10: Refine and prune the truncation every u minibatches. FVOCLDA grammar, D is the total number of documents and K is the number of topics. Therefore, j ranges over 1,...,D { } and i ranges over 1,...,K . that do not explain their yield well. They propose ta- { } ble label resampling to rederive yields. is much faster than variational approach—but I is In our approach this is equivalent to “mutating” usually on the order of thousands. some derivations in a TNG. After pruning rules ev- ery u minibatches, we perform table label resam- Likewise, our hybrid approach also only needs pling for adapted nonterminals from general to spe- the less expensive inside algorithm to sample trees. cific (i.e., a topological sort). This provides better And while each iteration is less expensive, our ap- expected counts n(r, ) for rules used in phrase- proach can achieve reasonable results with only a • structure subtrees. Empirically, we find table la- single pass through the data. And thus only requires bel resampling only marginally improves the word- O(D) calls to the inside algorithm. segmentation result. Because the inside-outside algorithm is funda- mental to each of these algorithms, we use it as a Initialization Our inference begins with random common basis for comparison across different im- variational Dirichlets and empty TNGs, which obvi- plementations. This is over-generous to variational ates the preprocessing step in Cohen et al. (2010). approaches, as the full inside-outside computation Our model constructs and expands all TNGs on the is more expensive than the inside algorithm required fly. It mimics the incremental initialization of John- for sampling in MCMC and our hybrid approach. son and Goldwater (2009). Algorithm 2 summarizes the pseudo-code of our online approach. 5 Experiments and Discussion 4.2 Complexity Inside and outside calls dominate execution time We implement our online adaptor grammar model for adaptor grammar inference. Variational ap- (ONLINE) in Python4 and compare it against both proaches compute inside-outside algorithms and es- MCMC (Johnson and Goldwater, 2009, MCMC) and timate the expected counts for every possible tree the variational inference (Cohen et al., 2010, VARI- derivation (Cohen et al., 2010). For a dataset with D ATIONAL). We use the latest implementation of observations, variational inference requires O DI MCMC sampler for adaptor grammars5 and simulate calls to inside-outside algorithm, where I is the  the variational approach using our implementation. number of iterations, typically in the tens. For MCMC approach, we use the best settings re- In contrast, MCMC only needs to accumulate in- ported in Johnson and Goldwater (2009) with incre- side probabilities, and then sample a tree deriva- mental initialization and table label resampling. tion (Chappelier and Rajman, 2000). The sampling step is negligible in processing time compared to the 4 Available at http://www.umiacs.umd.edu/˜zhaike/. 5 inside algorithm. MCMC inference requires O DI http://web.science.mq.edu.au/˜mjohnson/code/ calls to the inside algorithm—hence every iteration py-cfg-2013-02-25.tgz 

471 ctb7 pku cityu Model and Settings unigram collocation unigram collocation unigram collocation 500 iter 72.70 (2.81) 50.53 (2.82) 72.01 (2.82) 49.06 (2.81) 74.19 (3.55) 63.14 (3.53) 1000 iter 72.65 (2.83) 62.27 (2.79) 71.81 (2.81) 62.47 (2.77) 74.37 (3.54) 70.62 (3.51) 72.17 2.80 69.65 2.77 71.46 2.80 70.20 2.73 74.22 3.54 72.33 3.50 MCMC 1500 iter ( ) ( ) ( ) ( ) ( ) ( ) 2000 iter 71.75 (2.79) 71.66 (2.76) 71.04 (2.79) 72.55 (2.70) 74.01 (3.53) 73.15 (3.48) κ τ KWord = 30k KColloc = 100k KWord = 40k KColloc = 120k KWord = 50k KColloc = 150K 32 70.17 (2.84) 68.43 (2.77) 69.93 (2.89) 68.09 (2.71) 72.59 (3.62) 69.27 (3.61) 0.6 128 72.98 (2.72) 65.20 (2.81) 72.26 (2.63) 65.57 (2.83) 74.73 (3.40) 64.83 (3.62) 512 72.76 (2.78) 56.05 (2.85) 71.99 (2.74) 58.94 (2.94) 73.68 (3.60) 60.40 (3.70) 32 71.10 (2.77) 70.84 (2.76) 70.31 (2.78) 70.91 (2.71) 73.12 (3.60) 71.89 (3.50) 0.8 128 72.79 (2.64) 70.93 (2.63) 72.08 (2.62) 72.02 (2.63) 74.62 (3.45) 72.28 (3.51) ONLINE 512 72.82 (2.58) 68.53 (2.76) 72.14 (2.58) 70.07 (2.69) 74.71 (3.37) 72.58 (3.49) 32 69.98 (2.87) 70.71 (2.63) 69.42 (2.84) 71.45 (2.67) 73.18 (3.59) 72.42 (3.45) 1.0 128 71.84 (2.72) 71.29 (2.58) 71.29 (2.67) 72.56 (2.61) 73.23 (3.39) 72.61 (3.41) 512 72.68 (2.62) 70.67 (2.60) 71.86 (2.63) 71.39 (2.66) 74.45 (3.41) 72.88 (3.38) VARIATIONAL 69.83 (2.85) 67.78 (2.75) 67.82 (2.80) 66.97 (2.75) 70.47 (3.72) 69.06 (3.69)

Table 2: Word segmentation accuracy measured by word token F1 scores and negative log-likelihood on held-out test dataset in the 6 brackets (lower the better, on the scale of 10 ) for our ONLINE model against MCMC approach (Johnson et al., 2006) on various dataset using the unigram and collocation grammar.7 80 70 5.1 Word Segmentation 60 50 We evaluate our online adaptor grammar on the task 40 f−1 score of word segmentation, which focuses on identify- 1e+03 1e+04 1e+05 1e+06 1e+07 ing word boundaries from a sequence of characters. # of inside−outside function calls This is especially the case for Chinese, since char- model mcmc online variational acters are written in sequence without word bound- (a) unigram grammar. aries. 80 We first evaluate all three models on the stan- 70 60 dard Brent version of the Bernstein-Ratner cor- 50 40 pus (Bernstein-Ratner, 1987; Brent and Cartwright, f−1 score 30 1996, brent). The dataset contains 10k sentences, 1e+03 1e+04 1e+05 1e+06 1e+07 # of inside−outside function calls 1.3k distinct words, and 72 distinct characters. We model mcmc online variational compare the results on both unigram and colloca- tion grammars introduced in Johnson and Goldwater (b) collocation grammar. (2009) as listed in Table 1. Figure 2: Word segmentation accuracy measured by word token Figure 2 illustrates the word segmentation ac- F1 scores on brent corpus of three approaches against number of inside-outside function call using unigram (upper) and collo- curacy in terms of word token F1-scores on brent cation (lower) grammars in Table 1.6 against the number of inside-outside function calls for all three approaches using unigram and colloca-

6Our ONLINE settings are batch size B = 20, decay inertia tion grammars. In both cases, our ONLINE approach τ = 128, decay rate κ = 0.6 for unigram grammar; and mini- converges faster than MCMC and VARIATIONAL ap- batch size B = 5, decay inertia τ = 256, decay rate κ = 0.8 proaches, yet yields comparable or better perfor- for collocation grammar. TNGs are refined at interval u = 50. mance when seeing more data. Truncation size is set to KWord = 1.5k and KColloc = 3k. The settings are chosen from cross validation. We observe simi- In addition to the brent corpus, we also evalu- lar behavior under κ = 0.7, 0.9, 1.0 , τ = 32, 64, 512 , ate three approaches on three other Chinese datasets { } { } B = 10, 50 and u = 10, 20, 100 . 8 { } { } compiled by Xue et al. (2005) and Emerson (2005): 7For ONLINE inference, we parallelize each minibatch with four threads with settings: batch size B = 100 and TNG refine- Chinese Treebank 7.0 (ctb7): 162k sentences, • ment interval u = 100. ONLINE approach runns for two passes 57k distinct words, 4.5k distinct characters; over datasets. VARIATIONAL runs fifty iterations, with the same truncation level as in ONLINE. For negative log-likelihood eval- our model under κ = 0.7, 0.9 and τ = 64, 256 . { } { } uation, we train the model on a random 70% of the data, and 8We use all punctuation as natural delimiters (i.e., words hold out the rest for testing. We observe similar behavior for cannot cross punctuation).

472 3232 128128 512512 ⌧ : ⌧ : ⌧ : 0.6  Peking University (pku): 183k sentences, 53k 707070 0.6 • 606060 distinct words, 4.6k distinct characters; and 50 505070 0.8 707060 0.8 City University of Hong Kong (cityu): 207k 6060pmi

pmi 50 • 5070 50 1 sentences, 64k distinct words, and 5k distinct 7070 60 1 60coherence 6050 characters. 50 50 1 10 1 10 1 10 1001000 1001000 1001000 1 10 1 10 1 10 1001000 1001000 1001000 We compare our inference method against other # of passes over the dataset # of passes over the dataset approaches on F1 score. While other unsupervised inference infvoc mcmc online variational inference infvoc mcmc online variational word segmentation systems are available (Mochi- Figure 3: The average coherence score of topics on de-news hashi et al. (2009), inter alia),9 our focus is on a di- datasets against INFVOC approach and other inference tech- rect comparison of inference techniques for adaptor niques (MCMC, VARIATIONAL) under different settings of de- cay rate κ and decay inertia τ using the InfVoc LDA grammar in grammar, which achieve competitive (if not state-of- Table 1. The horizontal axis shows the number of passes over the-art) performance. the entire dataset.11 Table 2 shows the word token F1-scores and neg- ative likelihood on held-out test dataset of our model tagging (Toutanova and Johnson, 2008) or dialogue against MCMC and VARIATIONAL. We randomly modeling (Zhai and Williams, 2014). However, ex- sample 30% of the data for testing and the rest for pressing topic models in adaptor grammars is much training. We compute the held-out likelihood of the slower than traditional topic models, for which fast 10 most likely sampled parse trees out of each model. online inference (Hoffman et al., 2010) is available. Our ONLINE approach consistently better segments Zhai and Boyd-Graber (2013) argue that online words than VARIATIONAL and achieves comparable inference and topic models violate a fundamental or better results than MCMC. assumption in online algorithms: new words are For MCMC, Johnson and Goldwater (2009) show introduced as more data are streamed to the algo- that incremental initialization—or online updates in rithm. Zhai and Boyd-Graber (2013) then introduce general—results in more accurate word segmenta- an inference framework, INFVOC, to discover words tion, even though the trees have lower posterior from a Dirichlet process with a character n-gram probability. Similarly, our ONLINE approach initial- base distribution. izes and learns them on the fly, instead of initializing We show that their complicated model and on- the grammatons and parse trees for all data upfront line inference can be captured and extended via an as for VARIATIONAL. This uniformly outperforms appropriate PCFG grammar and our online adap- batch initialization on the word segmentation tasks. tor grammar inference algorithm. Our extension to 5.2 Infinite Vocabulary Topic Modeling INFVOC generalizes their static character n-gram model, learning the base distribution (i.e., how Topic models often can be replicated using a care- words are composed from characters) from data. In fully crafted PCFG (Johnson, 2010). These pow- contrast, their base distribution was learned from a erful extensions can capture topical collocations dictionary as a preprocessing step and held fixed. and sticky topics; these embelishments could fur- This is an attractive testbed for our online infer- ther improve NLP applications of simple unigram ence. Within a topic, we can verify that the words we topic models such as word sense disambigua- discover are relevant to the topic and that new words tion (Boyd-Graber and Blei, 2007), part of speech rise in importance in the topic over time if they are 9Their results are not directly comparable: they use different relevant. For these experiments, we treat each token subsets and assume different preprocessing. 10 (with its associated document pseudo-word j) as a Note that this is only an approximation to the true held-out − likelihood, since it is impossible to enumerate all the possible single sentence, and each minibatch contains only parse trees and hence compute the likelihood for a given sen- one sentence (token). tence under the model. 11 12 We train all models with 5 topics with settings: TNG re- The plot is generated with truncation size KTopic = 2k, finement interval u = 100, truncation size KTopic = 3k, and mini-batch size B = 1, truncation pruning interval u = 50, the mini-batch size B = 50. We observe a similar behavior decay inertia τ = 256, and decay rate κ = 0.8. All PY hyper- under κ 0.7, 0.9 and τ 64, 256 . parameters are optimized. ∈ { } ∈ { }

473 minibatch-1k minibatch-3k minibatch-4k minibatch-8k minibatch-10k minibatch-15k minibatch-17k minibatch-19k minibatch-20k ...... 1-reform 1-year 1-tax 1-tax 1-tax 1-deduct 1-tax ... 2-union 4-percent 2-minist 2-reform 2-schroeder 2-year 2-tax 2-year 5-increas 3-wage 5-tax ...... 3-tax 3-pension 3-year 3-reform 3-year 3-reform 6-reform 10-union 16-minist ... 4-pension 4-year 4-reform 4-schroeder 4-pension 4-pension ...... 12-year 13-wage 18-year 5-reform 5-minist 5-minist 5-increas 5-reform 5-minist ...... 13-increas 6-increas 6-pension 6-minist 6-increas 47-percent 21-bill ... 10-committe ...... 7-minist ...... 16-wage 8-lower 8-increas 9-pension 9-schroeder 53-year 32-increas ... 12-percent ...... 9-increas ...... 19-minist 13-percent 13-lower 11-percent 11-committe 33-tax ... 16-lower 11-committe 67-tax ...... 22-union 30-committe 16-percent 15-lower 19-percent 48-reform ... 19-increas 13-schroeder 70-minist ...... 17-committe 49-lower 33-bill 19-bill 14-percent 31-lower 58-lower ... 25-bill ... 108-bill ...... 20-union 82-schroeder 106-wage 28-committe 17-lower 49-bill 82-percent ... 42-union ... 164-lower ...... 90-bill 235-bill 95-committe ... 43-wage 115-union ... 51-union 23-bill 51-union ...... 106-committe 272-wage 180-pension ... 181-schroeder 120-schroeder ... 78-wage 49-union 53-deduct ...... 229-pension 306-deduct 436-deduct 530-deduct ... 382-deduct 92-wage 127-wage ......

pension schroeder deduct recess primarili recipi alloc club ...... committe affair shop ...... new words added at corresponding minibatch

Figure 4: The evolution of one topic—concerning tax policy—out of five topics learned using online adaptor grammar inference on the de-news dataset. Each minibatch represents a word processed by this online algorithm; time progresses from left to right. As the algorithm encounters new words (bottom) they can make their way into the topic. The numbers next to words represent their overall rank in the topic. For example, the word “pension” first appeared in mini-batch 100, was ranked at 229 after minibatch 400 and became one of the top 10 words in this topic after 2000 minibatches (tokens).12

Quantitatively, we evaluate three different infer- For example, “schroeder” (former German chancel- ence schemes and the INFVOC approach13 on a col- lor) first appeared in minibatch 300, was success- lection of English daily news snippets (de-news).14 fully picked up by our model, and became one of We used the InfVoc LDA grammar (Table 1). For the top ranked words in the topic. all approaches, we train the model with five topics, and evaluate topic coherence (Newman et al., 2009), 6 Conclusion which correlates well with human ratings of topic interpretability ( et al., 2009). We collect the Probabilistic modeling is a useful tool in understand- co-occurrence counts from Wikipedia and compute ing unstructured data or data where the structure is the average pairwise pointwise mutual information latent, like language. However, developing these (PMI) score between the top 10 ranked words of ev- models is often a difficult process, requiring signifi- ery topic. Figure 3 illustrates the PMI score, and our cant machine learning expertise. approach yields comparable or better results against all other approaches under most conditions. Adaptor grammars offer a flexible and quick way Qualitatively, Figure 4 shows an example of a to prototype and test new models. Despite ex- topic evolution using online adaptor grammar for pensive inference, they have been used for topic the de-news dataset. The topic is about “tax pol- modeling (Johnson, 2010), discovering perspec- icy”. The topic improves over time; words like tive (Hardisty et al., 2010), segmentation (Johnson “year”, “tax” and “minist(er)” become more promi- and Goldwater, 2009), and grammar induction (Co- nent. More importantly, the online approach discov- hen et al., 2010). ers new words and incorporates them into the topic. We have presented a new online, hybrid inference scheme for adaptor grammars. Unlike previous ap- 13 Available at http://www.umiacs.umd.edu/˜zhaike/. proaches, it does not require extensive preprocess- 14 The de-news dataset is randomly selected subset of 2.2k ing. It is also able to faster discover useful structure http://homepages.inf.ed.ac. English documents from in text; with further development, these algorithms uk/pkoehn/publications/de-news/. It contains 6.5k unique types and over 200k word tokens. Tokenization and could further speed the development and application stemming provided by NLTK (Bird et al., 2009). of new nonparametric models to large datasets.

474 Acknowledgments Thomas Emerson. 2005. The second international chi- nese word segmentation bakeoff. In Fourth SIGHAN We would like to thank the anonymous reviewers, Workshop on Chinese Language, Jeju, Korea. Kristina Toutanova, Mark Johnson, and Ke for Thomas S. Ferguson. 1973. A Bayesian analysis of insightful discussions. This work was supported some nonparametric problems. The Annals of Statis- by NSF Grant CCF-1018625. Boyd-Graber is also tics, 1(2). supported by NSF Grant IIS-1320538. Any opin- Sharon Goldwater, Thomas L. Griffiths, and Mark John- ions, findings, conclusions, or recommendations ex- son. 2011. Producing power-law distributions and pressed here are those of the authors and do not nec- damping word frequencies with two-stage language essarily reflect the view of the sponsor. models. Journal of Machine Learning Research, pages 2335–2382, July. Eric Hardisty, Jordan Boyd-Graber, and Philip Resnik. References 2010. Modeling perspective using adaptor grammars. In Proceedings of Emperical Methods in Natural - Nan Bernstein-Ratner. 1987. The phonology of parent guage Processing. child speech. Children’s language, 6:159–174. Steven Bird, Ewan Klein, and Edward Loper. 2009. Nat- Matthew Hoffman, David M. Blei, and Francis Bach. ural Language Processing with Python. O’Reilly Me- 2010. Online learning for latent Dirichlet allocation. dia. In Proceedings of Advances in Neural Information Christopher M. Bishop. 2006. Pattern Recognition and Processing Systems. Machine Learning. Springer-Verlag New York, Inc., Matthew Hoffman, David M. Blei, Chong Wang, and Secaucus, NJ, USA. John Paisley. 2013. Stochastic variational inference. David M. Blei and Michael I. Jordan. 2005. Variational In Journal of Machine Learning Research. inference for Dirichlet process mixtures. Journal of Mark Johnson and Sharon Goldwater. 2009. Improving Bayesian Analysis, 1(1):121–144. nonparameteric Bayesian inference: experiments on Benjamin Borschinger¨ and Mark Johnson. 2012. Using unsupervised word segmentation with adaptor gram- rejuvenation to improve particle filtering for bayesian mars. In Conference of the North American Chapter word segmentation. In Proceedings of the Association of the Association for Computational Linguistics. for Computational Linguistics. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa- Leon´ Bottou. 1998. Online algorithms and stochastic ter. 2006. Adaptor grammars: A framework for speci- approximations. In Online Learning and Neural Net- fying compositional nonparametric Bayesian models. works. Cambridge University Press, Cambridge, UK. In Proceedings of Advances in Neural Information Jordan Boyd-Graber and David M. Blei. 2007. PUTOP: Processing Systems. Turning predominant senses into a topic model for Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa- WSD. In 4th International Workshop on Semantic ter. 2007. Bayesian inference for PCFGs via Markov Evaluations. chain Monte Carlo. In Conference of the North Ameri- Michael R. Brent and Timothy A. Cartwright. 1996. Dis- can Chapter of the Association for Computational - tributional regularity and phonotactic constraints are guistics. useful for segmentation. volume 61, pages 93–125. Mark Johnson. 2010. PCFGs, topic models, adaptor Jonathan Chang, Jordan Boyd-Graber, and David M. grammars and learning topical collocations and the Blei. 2009. Connections between the lines: Augment- structure of proper names. In Proceedings of the As- ing social networks with text. In Knowledge Discovery sociation for Computational Linguistics. and Data Mining. Jean-Cedric´ Chappelier and Martin Rajman. 2000. Kenichi Kurihara, Max Welling, and Yee Whye Teh. Monte-Carlo sampling for NP-hard maximization 2007. Collapsed variational Dirichlet process mixture problems in the framework of weighted parsing. In models. In International Joint Conference on Artifi- Natural Language Processing, pages 106–117. cial Intelligence. Shay B. Cohen, David M. Blei, and Noah A. Smith. Christopher D. Manning and Hinrich Schutze.¨ 1999. 2010. Variational inference for adaptor grammars. In Foundations of Statistical Natural Language Process- Conference of the North American Chapter of the As- ing. The MIT Press, Cambridge, MA. sociation for Computational Linguistics. David Mimno, Matthew Hoffman, and David Blei. 2012. Shay B. Cohen. 2011. Computational Learning of Prob- Sparse stochastic inference for latent Dirichlet alloca- abilistic Grammars in the Unsupervised Setting. Ph.D. tion. In Proceedings of the International Conference thesis, Carnegie Mellon University. of Machine Learning.

475 Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. Ke Zhai and Jordan Boyd-Graber. 2013. Online latent 2009. Bayesian unsupervised word segmentation with Dirichlet allocation with infinite vocabulary. In Pro- nested pitman-yor language modeling. In Proceedings ceedings of the International Conference of Machine of the Association for Computational Linguistics. Learning. Peter Muller¨ and Fernando A. Quintana. 2004. Non- Ke Zhai and Jason D. Williams. 2014. Discovering latent parametric Bayesian data analysis. Statistical Science, structure in task-oriented dialogues. In Proceedings of 19(1). the Association for Computational Linguistics. Ramesh Nallapati, William Cohen, and John Lafferty. Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mo- 2007. Parallelized variational EM for latent Dirichlet hamad Alkhouja. 2012. Mr. LDA: A flexible large allocation: An experimental evaluation of speed and scale topic modeling package using variational infer- scalability. In ICDMW. ence in mapreduce. In Proceedings of World Wide Web Conference. David Newman, Sarvnaz Karimi, and Lawrence Cave- don. 2009. External evaluation of topic models. In Proceedings of the Aurstralasian Document Comput- ing Symposium. J. Pitman and M. Yor. 1997. The two-parameter Poisson- Dirichlet distribution derived from a stable subordina- tor. Annals of Probability, 25(2):855–900. Hiroyuki Shindo, Yusuke Miyao, Akinori Fujino, and Masaaki Nagata. 2012. Bayesian symbol-refined tree substitution grammars for syntactic parsing. In Pro- ceedings of the Association for Computational Lin- guistics. Erik B. Sudderth and Michael I. Jordan. 2008. Shared segmentation of natural scenes using depen- dent Pitman-Yor processes. In Proceedings of Ad- vances in Neural Information Processing Systems. Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical Dirichlet pro- cesses. Journal of the American Statistical Associa- tion, 101(476):1566–1581. Kristina Toutanova and Mark Johnson. 2008. A Bayesian LDA-based model for semi-supervised part- of-speech tagging. In Proceedings of Advances in Neural Information Processing Systems, pages 1521– 1528. Martin J. Wainwright and Michael I. Jordan. 2008. Graphical models, exponential families, and varia- tional inference. Foundations and Trends in Machine Learning, 1(1–2):1–305. Chong Wang and David M. Blei. 2012. Truncation-free online variational inference for Bayesian nonparamet- ric models. In Proceedings of Advances in Neural In- formation Processing Systems. Naiwen Xue, Fei Xia, -dong Chiou, and Marta Palmer. 2005. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engi- neering. Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Knowledge Dis- covery and Data Mining.

476