Ke Zhai and Jordan Boyd-Graber. Online Topic Models with Infinite

Home , Ke (surname)

Ke Zhai and Jordan Boyd-Graber. Online Topic Models with Inﬁnite Vocabulary. International Confer- ence on Machine Learning, 2013, 9 pages.

@inproceedings{Zhai:Boyd-Graber-2013, Title = {Online Topic Models with Infinite Vocabulary}, Author = {Ke Zhai and Jordan Boyd-Graber}, Booktitle = {International Conference on Machine Learning}, Year = {2013}, Url = {http://umiacs.umd.edu/~jbg//docs/2013_icml_infvoc.pdf}, }

Links: Poster [http://umiacs.umd.edu/~jbg/docs/2013_infvoc_poster.pdf] • Talk [http://techtalks.tv/talks/online-latent-dirichlet-allocation-with-infinite-vocabulary/58182/] • Code [https://github.com/kzhai/PyInfVoc] •

Downloaded from http://umiacs.umd.edu/~jbg/docs/2013_icml_infvoc.pdf

Contact Jordan Boyd-Graber ([email protected]) for questions about this paper.

1 Online Latent Dirichlet Allocation with Inﬁnite Vocabulary

Ke Zhai [email protected] Department of Computer Science, University of Maryland, College Park, MD USA

Jordan Boyd-Graber [email protected] iSchool and UMIACS, University of Maryland, College Park, MD USA Abstract evince thematic coherence, are always modeled as a multi- Topic models based on latent Dirichlet allocation nomial drawn from a finite Dirichlet distribution. This as- (LDA) assume a predefined vocabulary. This is sumption precludes additional words being added over time. reasonable in batch settings but not reasonable Particularly for streaming algorithms, this is neither reason- for streaming and online settings. To address this able nor appealing. There are many reasons immutable vo- lacuna, we extend LDA by drawing topics from a cabularies do not make sense: words are invented (“crowd- Dirichlet process whose base distribution is a dis- sourcing”), words cross languages (“Gangnam”), or words tribution over all strings rather than from a finite common in one context become prominent elsewhere (“vu- Dirichlet. We develop inference using online vari- vuzelas” moving from music to sports in the 2010 World ational inference and—to only consider a finite Cup). To be flexible, topic models must be able to capture number of words for each topic—propose heuris- the addition, invention, and increased prominence of new tics to dynamically order, expand, and contract terms. the set of words we consider in our vocabulary. We show our model can successfully incorporate Allowing models to expand topics to include additional new words and that it performs better than topic words requires changing the underlying statistical formalism. models with finite vocabularies in evaluations of Instead of assuming that topics come from a finite Dirichlet topic quality and classification performance. distribution, we assume that it comes from a Dirichlet process (Ferguson, 1973) with a base distribution over all possible words, of which there are an infinite number. Bayesian 1. Introduction nonparametric tools like the Dirichlet process allow us to reason about distributions over infinite supports. We review Latent Dirichlet allocation (LDA) is a probabilistic approach both topic models and Bayesian nonparametrics in Section2. for exploring topics in document collections (Blei et al., In Section3, we present the infinite vocabulary topic model, 2003). Topic models offer a formalism for exposing a col- which uses Bayesian nonparametrics to go beyond fixed lection’s themes and have been used to aid information vocabularies. retrieval (Wei & Croft, 2006), understand academic liter- In Section4, we derive approximate inference for our model. ature (Dietz et al., 2007), and discover political perspec- Since emerging vocabulary are most important in non-batch tives (Paul & Girju, 2010). settings, in Section5, we extend inference to streaming As hackneyed as the term “big data” has become, re- settings. We compare the coherence and effectiveness of our searchers and industry alike require algorithms that are infinite vocabulary topic model against models with fixed scalable and efficient. Topic modeling is no different. A vocabulary in Section6. common scalability strategy is converting batch algorithms Figure1 shows a topic evolving during inference. The al- into streaming algorithms that only make one pass over the gorithm processes documents in sets we call minibatches; data. In topic modeling, Hoffman et al.(2010) extended after each minibatch, online variational inference updates LDA to online settings. our model’s parameters. This shows that out of vocabu- However, this and later online topic models (Wang et al., lary words can enter topics and eventually become high 2011; Mimno et al., 2012) make the same limiting assump- probability words. tion. The namesake topics, distributions over words that

Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). Online Latent Dirichlet Allocation with Inﬁnite Vocabulary

minibatch-2 minibatch-3 minibatch-5 minibatch-8 minibatch-10 minibatch-16 minibatch-17 minibatch-39 minibatch-83 minibatch-120 ...... 0-captain 0-appear 100-issu 90-club 102-club 118-club 132-rock 87-seri 82-seri 1-annual ...... 1-appear 1-hulk 2-rock 146-cover 105-issu 115-issu 128-copi 194-issu 161-issu 162-issu ...... 2-crohn 2-wolverin 3-wolverin 196-admir 161-cover 127-cover 137-cover 215-seri 283-copi 288-copi 3-hulk 3-annual ...... 4-appear 138-issu ... 199-copi 214-copi 130-copi 217-copi 306-appear 294-appear ... 4-copi ...... 5-comicstrip 5-rock 180-appear 307-cover 229-appear 244-appear 197-appear 226-cover 311-cover 5-rider ...... 6-seri 6-wolverin ...... 319-rock 502-annual 6-comicstrip 324-club 288-rock 289-rock 261-appear 512-annual 7-mutant 7-bloom ...... 7-cover 493-annual 814-forc 8-cover 8-forc 360-annual 381-annual 450-annual 588-annual 830-forc ...... 8-forc 9-comicstrip 639-seri 1194-rider 12-issu 643-rock 685-seri 584-seri 949-forc 4782-wolverin ...... 9-captain 11-hiram 877-forc 8516-bloom 14-hulk 819-forc 791-forc 811-forc 1074-rider 9231-bloom ...... 10-bloom ...... 12-annal 1003-rider 8944-hulk 16-copi 1064-rider 1091-rider 1090-rider 6038-comicstrip 9659-hulk 11-issu ...... 13-mutant 7075-captain 10819-comicstrip 53-forc ... 12-seri 1185-seri 4639-hiram 6267-crohn 6520-mutant 11527-comicstrip ...... 15-seri 9420-crohn 11301-mutant 57-rider 16-mutant 7113-hiram 9569-captain 12009-mutant ...... 16-cover ... Settings ...... 10266-hiram 14335-captain 86-captain 37-crohn Number of Topics: K = 50 11914-crohn 15040-captain ...... 19-copi ... Truncation Level: T = 20K ...... 16676-crohn 3531-hiram 41-rock Minibatch Size: S = 155 12760-hiram 17381-crohn β ...... 23-issu ... DP Scale Parameter: α = 5000 ...... Reordering Delay: U = 20 17519-hiram 3690-bloom 43-hiram 18224-hiram ...... 280-rider ... Learning Inertia: τ0 = 256 ...... Learning Rate: κ =0.6 3915-crohn ...

New-Words 2 New-Words 3 New-Words 5 New-Words 8 New-Words 10 New-Words 16 New-Words-17 New-Words-39 New-Words-83

hiram crohn captain comicstrip bloom wolverin laci izzo gown ...... moskowitz corpu seqitur mutant hulk albion ...... patlafontain mazelyah ...... words added at corresponding minibatch

Figure 1. The evolution of a single “comic book” topic from the 20 newsgroups corpus. Each column is a ranked list of word probabilities after processing a minibatch (numbers preceding words are the exact rank). The box below the topics contains words introduced in a minibatch. For example, “hulk” ﬁrst appeared in minibatch 10, was ranked at 9659 after minibatch 17, and became the second most important word by the ﬁnal minibatch. Colors help show words’ trajectories.

2. Background 2.1. Bayesian Nonparametrics Latent Dirichlet allocation (Blei et al., 2003) assumes a sim- Bayesian nonparametrics is an appealing solution; it mod- ple generative process. The K topics, drawn from a sym- els arbitrary distributions with an unbounded and possibly metric Dirichlet distribution, β Dir(η), k = 1,...,K countably infinite support. While Bayesian nonparametrics k ∼ { } generate a corpus of observed words: is a broad field, we focus on the Dirichlet process (DP, Fer- 1: for each document d in a corpus D do guson 1973). 2: Choose a distribution θd over topics from a Dirichlet The Dirichlet process is a two-parameter distribution with distribution θ Dir(αθ). β d ∼ scale parameter α and base distribution G0. A draw G 3: each of the n = 1,...,N word indexes β for d do from DP(α ,G0) is modeled as 4: Choose a topic zn from the document’s distribu- β tion over topics z Mult(θ ). b1, . . . , bi,... Beta(1, α ), ρ1, . . . , ρi,... G0. n ∼ d ∼ ∼ 5: Choose a word wn from the appropriate topic’s Individual draws from a Beta distribution are the foundation distribution over words p(wn βz ). | n for the stick-breaking construction of the DP (Sethuraman, Implicit in this model is a finite number of words in the 1994). Each break point bi models how much probability vocabulary because the support of the Dirichlet distribution mass remains. These break points combine to form an Dir(η) is fixed. Moreover, it fixes a priori which words we infinite multinomial, can observe, a patently false assumption (Algeo, 1980). i 1 − βi bi (1 bj ),G βiδρ , (1) ≡ − ≡ i j=1 i Y X Online Latent Dirichlet Allocation with Infinite Vocabulary where the weights β give the probability of selecting any 4: Set stick weights βkt = bkt (1 bks). i slan- guage model. However, such a na¨ıve approach is unrealistic 2.2. N-gram Models in Latent Variable Models and is biased to shorter words; preliminary experiments yielded poor results. Instead, we define G0 as the following A strength of the probabilistic formalism is the ability to distribution over strings embed specialized models inside more general models. The 1: Choose a length l Mult(λ). problem of part-of-speech (POS) induction (Goldwater & ∼ 2: Generate character ci p(ci ci n,...,i 1). Griffiths, 2007) uses morphological regularity within part of ∼ | − − speech classes (e.g., verbs in English often end with “ed”) This is similar to the classic n-gram language model, except to learn a character n-gram model for parts of speech (Clark, that the length is first chosen from a multinomial distribution 2003). This has been combined within the latent variable over all lengths. Estimating conditional n-gram probabili- HMM via a Chinese restaurant process (Blunsom & Cohn, ties is well-studied in natural language processing (Jelinek 2011). & Mercer, 1985). We also view latent clusters of words (topics) as a non- The full expression for the probability of a word ρ consisting parametric distribution with a character n-gram base distri- of the characters c1, c2,... under G0 is bution, but to better support streaming data sets, we use on- ρ G0(ρ) pWM(l = ρ λ) | | p(ci ci n,...,i 1) line variational inference; previous approaches used Monte ≡ | | | i=1 | − − Carlo methods (Neal, 1993). Variational inference is eas- Q where ρ is the length of the word. To avoid length bias, ier to distribute (Zhai et al., 2012) and amenable to online | | we chose the multinomial λ that minimizes the average updates (Hoffman et al., 2010). discrepancy between word corpus probabilities p and the C Within the topic modeling community, there are different probability in our word model approaches to deal with changing word use. Dynamic topic 2 λ arg minλ p (ρ) pWM(ρ λ) , s.t. λl = 1. models (Blei & Lafferty, 2006) discover evolving topics by ≡ ρ | C − | | l viewing word distributions as n-dimensional points under- P P The n-gram statistics are estimated from an English dictio- going Brownian motion. These models reveal compelling nary which need not be very large, since it is a language topical evolution; e.g., physics moving from studies of the model over characters, not words. æther to relativity to quantum mechanics. However, the models assume fixed vocabularies; we show that our infinite vocabulary model discovers more coherent topics (Sec- 4. Variational Approximation tion 6.2). Inference in probabilistic inference uncovers the latent vari- An elegant solution for large vocabularies is the “hashing ables that best reconstruct observed data. The quality of trick” (Weinberger et al., 2009), which maps strings into a this reconstruction is measured by log likelihood. For a restricted set of integers via a hash function. These integers corpus of D documents where the d-th document contains become the topic model’s vocabulary. While elegant, words Nd words, the joint distribution is are no longer identifiable. However, our infinite vocabulary K β p(W , ρ, β, θ, z) = ∞ p(ρkt G0) p(βkt α ) topic model retains identifiability and better models datasets k=1 t=1 | · | D θ Nd (Section 6.3). p(θd α ) Q p(zdnQ θd)p(ωdn zdn, βz ) . d=1 | n=1 | | dn hQ Q i Directly optimizing the latent variables Z corpus-level 3. Infinite Vocabulary Topic Model ≡ { stick proportions β, document topic distributions θ and Our generative process is identical to LDA’s (Section2) word topic assignments z is intractable, so we use varia- } except that topics are not drawn from a finite Dirichlet. tional inference (Blei et al., 2003). Instead, topics are drawn from a DP with base distribution To use variational inference, we select a simpler family G0 over all possible words: of distributions over the latent variables Z. We call these 1: each topic k for do distributions q. This family of distributions allows us to 2: Draw words ρ , (t = 1, 2, ... ) from G . kt { } 0 optimize a lower bound of the likelihood called the evidence 3: Draw b Beta(1, αβ), (t = 1, 2,... ). kt ∼ { } lower bound (ELBO) , L Online Latent Dirichlet Allocation with Infinite Vocabulary

4.2. Stochastic Inference

log p(W ) Eq(Z) [log p(W , Z)] Eq(Z) [q] = . (2) Recall that the variational distribution q(zd η) is a single ≥ − L | distribution over the Nd vectors of length K. While this Maximizing is equivalent to minimizing the Kullback- removes the tight coupling between θ and z that often com- L Leibler (KL) divergence between the true distribution and plicates mean-field variational inference, it is no longer as the variational distribution. simple to determine the variational distribution q(zd η) that | optimizes Eqn. (2). However, Mimno et al.(2012) showed Unlike mean-field approaches (Blei et al., 2003), which that Gibbs sampling instantiations of z∗ from the distri- q dn assume is a fully factorized distribution, we integrate out bution conditioned on other topic assignments results in a the word-level topic distribution vector θ: q(z η) is a d | sparse, efficient empirical estimate of the variation distribu- KNd single distribution over possible topic configurations tion. In our model, the conditional distribution of a topic rather than a product of Nd multinomial distributions over assignment of a word with TOS index t = k(wdn) is K topics. Combined with a beta distribution q(b ν1 , ν2 ) T kt| kt kt for stick break points, the variational distribution q is q(zdn = k z dn, t = k(wdn)) (4) | − T q(Z) q(β, z) = q(z η) q(b ν1, ν2). Nd θ D d K k k k (3) m=1 Izdm=k + αk exp Eq(ν) [log βkt] . ≡ | | ∝ m=n 6 Q Q P However, we cannot explicitly represent a distribution over We iteratively sample from this conditional distribution to all possible strings, so we truncate our variational stick- obtain the empirical distribution φ qˆ(z ) for latent dn ≡ dn breaking distribution q(b ν) to a finite set. variable z , which is fundamentally different from mean- | dn field approach (Blei et al., 2003). 4.1. Truncation Ordered Set There are two cases to consider for computing Eqn. (4)— Variational methods typically cope with infinite dimension- whether a word wdn is in the TOS for topic k or not. First, ality of nonparametric models by truncating the distribution we look up the word’s index t = k(wdn). If this word is T to a finite subset of all possible atoms that nonparametric in the TOS, i.e., t Tk, the expectations are straightfor- ≤ distributions consider (Blei & Jordan, 2005; Kurihara et al., ward (Mimno et al., 2012) 2006; Boyd-Graber & Blei, 2009). This is done by selecting Nd θ 1 a relatively large truncation index Tk, and then stipulating q(zdn = k) m=1 φdmk + αk exp Ψ(νkt) (5) ∝ m=n · { that the variational distribution uses the rest of the available 6 Ps Tk) is then truncation is not just a fixed vocabulary size; it is a trunca- P Nd θ tion ordered set (TOS). The ordering is important because q(zdn = k) m=1 φdmk + αk (6) ∝ m=n the Dirichlet process is a size-biased distribution; words 6 P s t 2 1 2 exp ≤ Ψ(ν ) Ψ(ν + ν ) . with lower indices are likely to have a higher probability · { s=1 ks − ks ks } than words with higher indices. P This is different from finite vocabulary topic models that set Each topic has a unique TOS of limited size that maps Tk vocabulary a priori and ignore OOV words. every word type w to an integer t; thus t = (w) is the in- Tk dex of the atom ρ that corresponds to w. We defer how we kt 4.3. Refining the Truncation Ordered Set choose this mapping until Section 4.3. More pressing is how we compute the two variational distributions of interest. For In this section, we describe heuristics to update the TOS in- q(z η), we use local collapsed MCMC sampling (Mimno spired by MCMC conditional equations, a common practice | et al., 2012) and for q(b ν) we use stochastic variational for updating truncations. One component of a good TOS is | inference (Hoffman et al., 2010). We describe both in turn. that more frequent words should come first in the ordering. Online Latent Dirichlet Allocation with Infinite Vocabulary

This is reasonable because the stick-breaking prior induces after the craze passes). We describe three components of a size-biased ordering of the clusters. This has previously this process, expanding the truncation, refining the ordering been used for truncation optimization for Dirichlet process of TOS, and contracting the vocabulary. mixtures and admixtures (Kurihara et al., 2007). This process depends Another component of a good TOS is that words consis- Determining the TOS Ordering on the ranking score of a word in topic k at minibatch tent with the underlying base distribution should be ranked i, R (ρ). Ideally, we would compute R from all data. higher than those not consistent with the base distribution. i,k However, only a single minibatch is accessible. We have a This intuition is also consistent with the conditional sam- per-minibatch rank estimate pling equations for MCMC inference (Muller¨ & Quintana, ρ D Nd 2004); the probability of creating a new table with dish is ri,k(ρ) = p(ρ G0) φdnkδω =ρ β i d i n=1 dn proportional to α G0(ρ) in the Chinese restaurant process. | · |S | ∈S which we interpolate with ourP previousP ranking Thus, to update the TOS, we define the ranking score of t k word in topic as Rik(ρ) = (1 ) Ri 1,k(ρ) + rik(ρ). (10) − · − · D Nd R(ρ ) = p(ρ G ) φ δ , (7) We introduce an additional algorithm parameter, the re- kt kt| 0 dnk ωdn=ρkt d=1 n=1 ordering delay U. We found that reordering after every X X minibatch (U = 1) was not effective; we explore the role sort all words by the scores within that topic, and then use of reordering delay in Section6. After U minibatches have those positions as the new TOS. In Section 5.1, we present been observed, we reorder the TOS for each topic according online updates for the TOS. to the words’ ranking score R in Eqn. (10); (w) becomes Tk the rank position of w according to the latest Rik. 5. Online Inference Expanding the Vocabulary Each minibatch contains Online variational inference seeks to optimize the ELBO words we have not seen before. When we see them, we according to Eqn. (2) by stochastic gradient optimiza- L must determine their relative rank position in the TOS, their tion. Because gradients estimated from a single observation rank scores, and their associated variational parameters. The are noisy, stochastic inference for topic models typically latter two issues are relevant for online inference because uses “minibatches” of S documents out of D total docu- both are computed via interpolations from previous values ments (Hoffman et al., 2010). in Eqn. (10) and (9). For an unseen word ω, previous values are undefined. Thus, we set Ri 1,k for unobserved words to An approximation of the natural gradient of with respect − L be 0, ν to be 1, and (ω) is T +1 (i.e., increase truncation to ν is the product of the inverse Fisher information and its Tk k first derivative (Sato, 2001) and append to the TOS).

1 D Nd 1 ∆νkt = 1 + d n=1 φdnkδωdn=ρkt νkt (8) Contracting the Vocabulary To ensure tractability we |S| ∈S − 2 β D Nd 2 must periodically prune the words in the TOS. When we re- ∆νkt = α + P d SP n=1 φdnkδωdn>ρkt νkt, |S| ∈ − order the TOS (after every U minibatches), we only keep the top T terms, where T is a user-defined integer. A word type which leads to an updateP of Pν, ρ will be removed from if its index (ρ) > T and its Tk Tk ν1 = ν1 + ∆ν1 , ν2 = ν2 + ∆ν2 (9) previous information (e.g., rank and variational parameters) kt kt · kt kt kt · kt is discarded. In a later minibatch, if a previously discarded κ where i = (τ0 + i)− defines the step size of the algorithm word reappears, it is treated as a new word. in minibatch i. The learning rate κ controls how quickly new parameter estimates replace the old; κ (0.5, 1] is ∈ 6. Experimental Evaluation required for convergence. The learning inertia τ0 prevents premature convergence. We recover the batch setting if In this section, we evaluate the performance of our infinite 1 = and κ = 0. vocabulary topic model (infvoc) on two corpora: de-news S D and 20 newsgroups.2 Both corpora were parsed by the same

5.1. Updating the Truncation Ordered Set 1A collection of daily news items between 1996 to 2000 in A nonparametric streaming model should allow the vocabu- English. It contains 9,756 documents, 1,175,526 word tokens, and 20,000 distinct word types. Available at homepages.inf.ed. lary to dynamically expand as new words appear (e.g., intro- ac.uk/pkoehn/publications/de-news. ducing “vuvuzelas” for the 2010 World Cup), and contract 2A collection of discussions in 20 different newsgroups. It as needed to best model the data (e.g., removing “vuvuzelas” contains 18,846 documents and 100,000 distinct word types. It Online Latent Dirichlet Allocation with Inﬁnite Vocabulary

T : 20000 T : 30000 T : 40000 T : 20000 T : 30000 T : 40000 U T : 4000 T : 5000 T : 4000 T : 5000 80 U U U 150 : 100 60 :

60 10 10 60 : : 40

40 10 40 10 50 20 20 20 0 0 U 0 0 150 U

pmi 80 pmi U U :

100 60 : 20 pmi pmi

60 20 60 : : 50 40

40 20 40 20 20 20 20 0 0 0 0 0 40 80 0 40 80 0 40 80 120 120 120 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 10 20 30 40 0 10 20 30 40 minibatch minibatch minibatch minibatch infvoc: αab=3kβ infvoc: αab=5kβ infvoc: ab=3kβ infvoc: ab=5kβ settings settings α α

settings β β settings infvoc: αab=1k infvoc: αab=2k infvoc: αab=1kβ infvoc: αab=2kβ infvoc: αab=4kβ infvoc: αab=4kβ (a) de-news, S = 140. (b) de-news, S = 245. (c) 20 newsgroups, S = 155. (d) 20 newsgroups, S = 310.

Figure 2. PMI score on de-news (Figure 2(a) and 2(b), K = 10) and 20 newsgroups (Figure 2(c) and 2(d), K = 50) against different β settings of DP scale parameter α , truncation level T and reordering delay U, under learning rate κ = 0.8 and learning inertia τ0 = 64. Our model is more sensitive to αβ and less sensitive to T .

! : 0.6 ! : 0.7 ! : 0.8 ! : 0.9 ! : 1 80 " tokenizer and stemmer with a common English stopword 0

60 :

40 256 list (Bird et al., 2009). First, we examine its sensitivity to 20 0

both model parameters and online learning rates. Having 80 " pmi 60 0 :

40 64 chosen those parameters, we then compare our model with 20 other topic models with ﬁxed vocabularies. 0 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 minibatch β β

settings infvoc: αab=1k T=4k infvoc: αab=2k T=4k Evaluation Metric Typical evaluation of topic models infvoc: αab=1kβ T=5k infvoc: αab=2kβ T=5k is based on held-out likelihood or perplexity. However, (a) de-news, S = 245 and K = 10. creating a strictly fair comparison for our model against ! : 0.6 ! : 0.7 ! : 0.8 ! : 0.9 ! : 1 existing topic model algorithms is difﬁcult, as traditional "

150 0 : topic model algorithms must discard words that have not 100 256 50 previously been observed. Moreover, held-out likelihood 0 150 " pmi 0

100 : is a flawed proxy for how topic models are used in the real 64 50 world (Chang et al., 2009). Instead, we use two evaluation 0 0 0 0 0 0 metrics: topic coherence and classification accuracy. 25 50 75 100 125 25 50 75 100 125 25 50 75 100 125 25 50 75 100 125 25 50 75 100 125 minibatch Pointwise mutual information (PMI), which correlates with settings infvoc: αab=3kβ T=20k infvoc: αab=4kβ T=20k infvoc: αab=5kβ T=20k human perceptions of topic coherence, measures how words (b) 20 newsgroups, S = 155 and K = 50. fit together within a topic. Following Newman et al.(2009), we extract document co-occurence statistics from Wikipedia Figure 3. PMI score on two datasets with reordering delay U = 20 and score a topic’s coherence by averaging the pairwise against different settings of decay factor κ and τ0. A suitable β PMI score (w.r.t. Wikipedia co-occurence) of the topic’s ten choice of DP scale parameter α increases the performance signif- highest ranked words. Higher average PMI implies a more icantly. Learning parameters κ and τ0 jointly define the step decay. Larger step sizes promote better topic evolution. coherent topic. Classification accuracy is the accuracy of a classifier learned from the topic distribution of training documents applied to ison, as its inferences requires access to all of the documents test documents (the topic model sees both sets). A higher ac- in the dataset; unlike the other algorithms, it is not online. curacy means the unsupervised topic model better captures the underlying structure of the corpus. To better simulate Vocabulary For fixed vocabulary models, we must decide real-world situations, 20-newsgroup’s test/train split is by on a vocabulary a priori. We consider two different vocabu- date (test documents appeared after training documents). lary methods: use the first minibatch to define a vocabulary (null) or use a comprehensive dictionary3 (dict). We use the Comparisons We evaluate the performance of our model same dictionary to train infvoc’s base distribution. (infvoc) against three other models with fixed vocabularies: online variational Bayes LDA (fixvoc-vb, Hoffman Experiment Configuration For all models, we use the et al. 2010), online hybrid LDA (fixvoc-hybrid, Mimno et al. same symmetric document Dirichlet prior with αθ = 1/K, 2012), and dynamic topic models (dtm, Blei & Lafferty where K is the number of topics. Online models see exactly 2006). Including dynamic topic models is not a fair compar- the same minibatches. For dtm, which is not an online is sorted by date into roughly 60% training and 40% testing data. 3http://sil.org/linguistics/wordlists/ Available at qwone.com/˜jason/20Newsgroups. english/ Online Latent Dirichlet Allocation with Infinite Vocabulary

80 model settings accuracy % 60 infvoc αβ = 3k T = 40k U = 10 52.683

6 ﬁxvoc vb-dict 45.514 40 . settings pmi dtm−dict: tcv=0.01 fixvoc−vb−dict ﬁxvoc vb-null 49.390

20 = 0 fixvoc−hybrid−dict fixvoc−vb−null ﬁxvoc hybrid-dict 46.720 κ 0 fixvoc−hybrid−null infvoc: αab=2kβ T=4k U=10 ﬁxvoc hybrid-null 50.474

= 155 fixvoc vb dict-hash 52.525 0 10 20 30 40 = 64 factor(minibatch) S fixvoc vb full-hash T = 30k 51.653 0 τ fixvoc hybrid dict-hash 50.948 (a) de-news, S = 245, K = 10, κ = 0.6 and τ0 = 64 fixvoc hybrid full-hash T = 30k 50.948 tcv = 0.001 62.845 150 dtm-dict infvoc αβ = 3k T = 40k U = 20 52.317

100 6 ﬁxvoc vb-dict 44.701 .

pmi fixvoc vb-null 51.815 50 = 0 fixvoc hybrid-dict 46.368 κ fixvoc hybrid-null 50.569 0 = 310 fixvoc vb dict-hash 48.130

0 10 20 30 40 50 60 70 80 90 100 110 120 = 64 factor(minibatch) S ﬁxvoc vb full-hash T = 30k 47.276 0 τ

settings dtm−dict: tcv=0.05 fixvoc−hybrid−null fixvoc−vb−null ﬁxvoc hybrid dict-hash 51.558 fixvoc−hybrid−dict fixvoc−vb−dict infvoc: αab=5kβ T=20k U=20 ﬁxvoc hybrid full-hash T = 30k 43.008 (b) 20 newsgroups, S = 155, K = 50, κ = 0.8 and τ0 = 64 dtm-dict tcv = 0.001 64.186

Figure 4. PMI score on two datasets against different models. Our Table 1. Classification accuracy based on 50 topic features ex- model infvoc yields a better PMI score against fixvoc and dtm; tracted from 20 newsgroups data. Our model (infvoc) out-performs gains are more marked in later minibatches as more and more algorithms with a fixed or hashed vocabulary but not dtm, a batch proper names have been added to the topics. Because dtm is not an algorithm that has access to all documents. online algorithm, we do not have detailed per-minibatch coherence statistics and thus show topic coherence as a box plot per epoch. 6.2. Comparing Algorithms: Coherence algorithm but instead partitions its input into “epochs”, we Now that we have some idea of how we should set parame- combine documents in ten consecutive minibatches into an ters for infvoc, we compare it against other topic modeling techniques. We used grid search to select parameters for epoch (longer epochs tended to have worse performance; 4 this was the shortest epoch that had reasonable runtime). each of the models and plotted the topic coherence aver- aged over all topics in Figure4. For online hybrid approaches (infvoc and fixvoc-hybrid), we collect 10 samples empirically from the variational distribu- While infvoc initially holds its own against other models, tion in E-step with 5 burn-in sweeps. For fixvoc-vb, we run it does better and better in later minibatches, since it has 50 iterations for local parameter updates. managed to gain a good estimate of the vocabulary and the topic distributions have stabilized. Most of the gains in topic coherence come from highly specific proper nouns which 6.1. Sensitivity to Parameters are missing from vocabularies of the fixed-vocabulary topic Figure2 shows how the PMI score is affected by the DP models. This advantage holds even against dtm, which uses scale parameter αβ, the truncation level T , and the reorder- batch inference. ing delay U. The relatively high values of αβ may be surpris- ing to readers used to seeing a DP that instantiates dozens 6.3. Comparing Algorithms: Classification of atoms, but when vocabularies are in tens of thousands, such scale parameters are necessary to support the long tail. For the classification comparison, we consider additional Although we did not investigate such approaches, this sug- topic models. While we need the most probable topic strings gests that more advanced nonparametric distributions (Teh, for PMI calculations, classification experiments only need a 2006) or explicitly optimizing αβ may be useful. Relatively document’s topic vector. Thus, we consider hashed vocabu- large values of U suggest that accurate estimates of the rank lary schemes. The first, which we call dict-hashing, uses a order are important for maintaining coherent topics. dictionary for the known words and hashes any other words 4 While infvoc is sensitive to parameters related to the vocab- For the de-news dataset, we select (20 newsgroups parameters in parentheses) minibatch size S 140, 245 (S 155, 310 ), ulary, once suitable values of those parameters are chosen, ∈ { } ∈ { } DP scale parameter αβ 1k, 2k (αβ 3k, 4k, 5k ), trun- it is no more sensitive to learning-specific parameters than cation size T 3k, 4∈k { (T } 20k,∈30 {k, 40k ),} reorder- other online LDA algorithms (Figure3), and values used for ing delay U ∈10 { , 20 for} infvoc∈; { and topic chain} variable other online topic models also work well here. tcv 0.001, ∈0.005 { , 0.01}, 0.05 for dtm. ∈ { } Online Latent Dirichlet Allocation with Infinite Vocabulary into the same set of integers. The second, full-hash, used in In addition to explicitly modeling the change of topics over Vowpal Wabbit,5 hashes all words into a set of T integers. time, it is also possible to model additional structure within topic. Rather than a fixed, immutable base distribution, mod- We train 50 topics for all models on the entire dataset and eling each topic with a hierarchical character n-gram model collect the document level topic distribution for every article. would capture regularities in the corpus that would, for ex- We treat such statistics as features and train a SVM classifier ample, allow certain topics to favor different orthographies on all training data using Weka (Hall et al., 2009) with (e.g., a technology topic might prefer words that start with default parameters. We then use the classifier to label testing “i”). While some topic models have attempted to capture documents with one of the 20 newsgroup labels. A higher orthography for multilingual applications (Boyd-Graber & accuracy means the model is better capturing the underlying Blei, 2009), our approach is more robust and incorporating content. the our approach with models of transliteration (Knight & Our model infvoc captures better topic features than online Graehl, 1997) might allow concepts expressed in one lan- LDA fixvoc (Table1) under all settings. 6 This suggests that guage better capture concepts in another, further improving in a streaming setting, infvoc can better categorize docu- the ability of algorithms to capture the evolving themes and ments. However, the batch algorithm dtm, which has access topics in large, streaming datasets. to the entire dataset performs better because it can use later documents to retrospectively improve its understanding of Acknowledgments earlier ones. Unlike dtm, infvoc only sees early minibatches once and cannot revise its model when it is tested on later The authors thank Chong Wang, Dave Blei, and Matt Hoff- minibatches. man for answering questions and sharing code. We thank Jimmy Lin and the anonymous reviewers for helpful sug- 6.4. Qualitative Example gestions. Research supported by NSF grant #1018625. Any opinions, conclusions, or recommendations are the authors’ Figure1 shows the evolution of a topic in 20 newsgroups and not those of the sponsors. about comics as new vocabulary words enter from new minibatches. While topics improve over time (e.g., relevant words like “seri(es)”, “issu(e)”, “forc(e)” are ranked higher), References interesting words are being added throughout training and Algeo, John. Where do all the new words come from? American become prominent after later minibatches are processed Speech, 55(4):264–277, 1980. (e.g., “captain”, “comicstrip”, “mutant”). This is not the Natural Language case for standard online LDA—these words are ignored Bird, Steven, Klein, Ewan, and Loper, Edward. Processing with Python. O’Reilly Media, 2009. and the model does not capture such information. In addition, only about 60% of the word types appeared in the Blei, David M. and Jordan, Michael I. Variational inference for SIL English dictionary. Even with a comprehensive English Dirichlet process mixtures. Journal of Bayesian Analysis, 1(1): dictionary, online LDA could not capture all the word types 121–144, 2005. in the corpus, especially named entities. Blei, David M. and Lafferty, John D. Dynamic topic models. In Proceedings of the International Conference of Machine 7. Conclusion and Future Work Learning, 2006. We proposed an online topic model that, instead of assuming Blei, David M., Ng, Andrew, and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, vocabulary is known a priori, adds and sheds words over 2003. time. While our model is better able to create coherent topics, it does not outperform dynamic topic models (Blei & Blunsom, Phil and Cohn, Trevor. A hierarchical Pitman-Yor Lafferty, 2006; Wang et al., 2008) that explicitly model how process HMM for unsupervised part of speech induction. In topics change. It would be interesting to allow such models Proceedings of the Association for Computational Linguistics, 2011. to—in addition to modeling the change of topics—also change the underlying dimensionality of the vocabulary. Boyd-Graber, Jordan and Blei, David M. Multilingual topic models 5 for unaligned text. In Proceedings of Uncertainty in Artificial hunch.net/˜vw/ Intelligence, 2009. 6Parameters were chosen via cross-validation on a 30%/70% dev-test split from the following parameter settings: DP scale Chang, Jonathan, Boyd-Graber, Jordan, and Blei, David M. Con- parameter α 2k, 3k, 4k , reordering delay U 10, 20 (for nections between the lines: Augmenting social networks with infvoc only);∈ truncation { level} T 20k, 30k, 40∈k { (for infvoc} ∈ { } text. In Knowledge Discovery and Data Mining, 2009. and fixvoc full-hash models); step decay factors τ0 64, 256 ∈ { } and κ 0.6, 0.7, 0.8, 0.9, 1.0 (for all online models); and topic Clark, Alexander. Combining distributional and morphological chain variable∈ { tcv 0.01, 0.05} , 0.1, 0.5 (for dtm only). ∈ { } information for part of speech induction. 2003. Online Latent Dirichlet Allocation with Infinite Vocabulary

Cohen, Shay B., Blei, David M., and Smith, Noah A. Variational Wang, Chong and Blei, David. Variational inference for the nested inference for adaptor grammars. In Conference of the North Chinese restaruant process. In Proceedings of Advances in American Chapter of the Association for Computational Lin- Neural Information Processing Systems, 2009. guistics, 2010. Wang, Chong and Blei, David M. Truncation-free online varia- Dietz, Laura, Bickel, Steffen, and Scheffer, Tobias. Unsupervised tional inference for bayesian nonparametric models. In Proceed- prediction of citation inﬂuences. In Proceedings of the Interna- ings of Advances in Neural Information Processing Systems, tional Conference of Machine Learning, 2007. 2012.

Ferguson, Thomas S. A Bayesian analysis of some nonparametric Wang, Chong, Blei, David M., and Heckerman, David. Continuous problems. The Annals of Statistics, 1(2):209–230, 1973. time dynamic topic models. In Proceedings of Uncertainty in Artificial Intelligence, 2008. Goldwater, Sharon and Griffiths, Thomas L. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings Wang, Chong, Paisley, John, and Blei, David. Online variational of the Association for Computational Linguistics, 2007. inference for the hierarchical Dirichlet process. In Proceedings of Artificial Intelligence and Statistics, 2011. Hall, Mark, Frank, Eibe, Holmes, Geoffrey, Pfahringer, Bernhard, Reutemann, Peter, and Witten, Ian H. The WEKA data mining Wei, Xing and Croft, Bruce. LDA-based document models for software: An update. SIGKDD Explorations, 11, 2009. ad-hoc retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, 2006. Hoffman, Matthew, Blei, David M., and Bach, Francis. Online learning for latent Dirichlet allocation. In NIPS, 2010. Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. Feature hashing for large scale multitask learning. Jelinek, F. and Mercer, R. Probability distribution estimation from In Proceedings of the International Conference of Machine sparse data. IBM Technical Disclosure Bulletin, 28:2591–2594, Learning, pp. 1113–1120. ACM, 2009. 1985. Zhai, Ke, Boyd-Graber, Jordan, Asadi, Nima, and Alkhouja, Mo- Knight, Kevin and Graehl, Jonathan. Machine transliteration. In hamad. Mr. LDA: A flexible large scale topic modeling package Proceedings of the Association for Computational Linguistics, using variational inference in mapreduce. In Proceedings of 1997. World Wide Web Conference, 2012.

Kurihara, Kenichi, Welling, Max, and Vlassis, Nikos. Acceler- ated variational Dirichlet process mixtures. In Proceedings of Advances in Neural Information Processing Systems, 2006.

Kurihara, Kenichi, Welling, Max, and Teh, Yee Whye. Collapsed variational Dirichlet process mixture models. In International Joint Conference on Artiﬁcial Intelligence. 2007.

Mimno, David, Hoffman, Matthew, and Blei, David. Sparse stochastic inference for latent Dirichlet allocation. In Pro- ceedings of the International Conference of Machine Learning, 2012.

Muller,¨ Peter and Quintana, Fernando A. Nonparametric Bayesian data analysis. Statistical Science, 19(1):95–110, 2004.

Neal, Radford M. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, Uni- versity of Toronto, 1993.

Newman, David, Karimi, Sarvnaz, and Cavedon, Lawrence. Ex- ternal evaluation of topic models. In Proceedings of the Aurstralasian Document Computing Symposium, 2009.

Paul, Michael and Girju, Roxana. A two-dimensional topic-aspect model for discovering multi-faceted topics. In Association for the Advancement of Artiﬁcial Intelligence, 2010.

Sato, Masa-Aki. Online model selection based on the variational Bayes. Neural Computation, 13(7):1649–1681, July 2001.

Sethuraman, Jayaram. A constructive deﬁnition of Dirichlet priors. Statistica Sinica, 4:639–650, 1994.

Teh, Yee Whye. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the Association for Computational Linguistics, 2006.