Topic Modeling: Beyond Bag-of-Words

Hanna M. Wallach [email protected] Cavendish Laboratory, University of Cambridge, Cambridge CB3 0HE, UK

Abstract While such models may use conditioning contexts of arbitrary length, this paper deals only with Some models of textual corpora employ text models—i.e., models that predict each word based on generation methods involving n-gram statis- the immediately preceding word. tics, while others use latent topic variables inferred using the “bag-of-words” assump- To develop a bigram language model, marginal and tion, in which word order is ignored. Pre- conditional word counts are determined from a corpus viously, these methods have not been com- w. The marginal count Ni is defined as the number bined. In this work, I explore a hierarchi- of times that word i has occurred in the corpus, while cal generative probabilistic model that in- the conditional count Ni|j is the number of times word corporates both n-gram statistics and latent i immediately follows word j. Given these counts, the topic variables by extending a unigram topic aim of bigram language modeling is to develop pre- model to include properties of a hierarchi- dictions of word wt given word wt−1, in any docu- cal Dirichlet bigram language model. The ment. Typically this is done by computing estimators model hyperparameters are inferred using a of both the marginal probability of word i and the con- Gibbs EM algorithm. On two data sets, each ditional probability of word i following word j, such as of 150 documents, the new model exhibits fi = Ni/N and fi|j = Ni|j/Nj, where N is the num- better predictive accuracy than either a hi- ber of words in the corpus. If there were sufficient erarchical Dirichlet bigram language model data available, the observed conditional frequency fi|j or a unigram topic model. Additionally, the could be used as an estimator for the predictive prob- inferred topics are less dominated by func- ability of i given j. In practice, this does not provide tion words than are topics discovered using a good estimate: only a small fraction of possible i, j unigram statistics, potentially making them word pairs will have been observed in the corpus. Con- more meaningful. sequently, the conditional frequency estimator has too large a variance to be used by itself.

To alleviate this problem, the bigram estimator fi|j is 1. Introduction smoothed by the marginal frequency estimator fi to Recently, much attention has been given to genera- give the predictive probability of word i given word j: tive probabilistic models of textual corpora, designed P (wt = i|wt−1 = j) = λfi + (1 − λ)fi|j. (1) to identify representations of the data that reduce de- scription length and reveal inter- or intra-document The parameter λ may be fixed, or determined from statistical structure. Such models typically fall into the data using techniques such as cross-validation (Je- one of two categories—those that generate each word linek & Mercer, 1980). This procedure works well in on the basis of some number of preceding words or practice, despite its somewhat ad hoc nature. word classes and those that generate words based on The hierarchical Dirichlet language model (MacKay & latent topic variables inferred from word correlations Peto, 1995) is a bigram model that is entirely driven by independent of the order in which the words appear. principles of Bayesian inference. This model has a sim- n-gram language models make predictions using ob- ilar predictive distribution to models based on equa- served marginal and conditional word frequencies. tion (1), with one key difference: the bigram statistics fi|j in MacKay and Peto’s model are not smoothed rd Appearing in Proceedings of the 23 International Con- with marginal statistics fi, but are smoothed with a ference on , Pittsburgh, PA, 2006. Copy- quantity related to the number of different contexts in right 2006 by the author(s)/owner(s). which each word has occurred. Topic Modeling: Beyond Bag-of-Words

Latent Dirichlet allocation (Blei et al., 2003) provides the vocabulary. These parameters are denoted by the an alternative approach to modeling textual corpora. matrix Φ, with P (wt = i|wt−1 = j) ≡ φi|j. Φ may be Documents are modeled as finite mixtures over an un- thought of as a transition probability matrix, in which derlying set of latent topics inferred from correlations the jth row, the probability vector for transitions from between words, independent of word order. word j, is denoted by the vector φj. This bag-of-words assumption makes sense from a Given a corpus w, the likelihood function is point of view of computational efficiency, but is unre- Y Y Ni|j alistic. In many language modeling applications, such P (w|Φ) = φi|j , (2) as text compression, , and predictive i j text entry, word order is extremely important. Fur- where N is the number of times that word i imme- thermore, it is likely that word order can assist in topic i|j diately follows word j in the corpus. inference. The phrases “the department chair couches offers” and “the chair department offers couches” have MacKay and Peto (1995) extend this basic framework the same unigram statistics, but are about quite dif- by placing a Dirichlet prior over Φ: ferent topics. When deciding which topic generated Y the word “chair” in the first sentence, knowing that it P (Φ|βm) = Dirichlet(φj|βm), (3) was immediately preceded by the word “department” j makes it much more likely to have been generated by where β > 0 and m is a measure satisfying P m = 1. a topic that assigns high probability to words related i i to university administration. Combining equations (2) and (3), and integrating over Φ, yields the probability of the corpus given the hy- In practice, the topics inferred using latent Dirichlet al- perparameters βm, also known as the “evidence”: location are heavily dominated by function words, such as “in”, “that”, “of” and “for”, unless these words are Q Y Γ(Ni|j + βmi) Γ(β) P (w|βm) = i . (4) removed from corpora prior to topic inference. While Γ(N + β) Q Γ(βm ) removing these may be appropriate for tasks where j j i i word order does not play a significant role, such as information retrieval, it is not appropriate for many It is also easy to obtain a predictive distribution for language modeling applications, where both function each context j given the hyperparameters βm: and content words must be accurately predicted. Ni|j + βmi In this paper, I present a hierarchical Bayesian model P (i|j, w, βm) = . (5) Nj + β that integrates bigram-based and topic-based ap- proaches to document modeling. This model moves To make the relationship to equation (1) explicit, beyond the bag-of-words assumption found in la- P (i|j, w, βm) may be rewritten as tent Dirichlet allocation by introducing properties of MacKay and Peto’s hierarchical Dirichlet language P (i|j, w, βm) = λjmi + (1 − λj)fi|j, (6) model. In addition to exhibiting better predictive where f = N /N and performance than either MacKay and Peto’s language i|j i|j j model or latent Dirichlet allocation, the topics inferred β using the new model are typically less dominated by λj = . (7) Nj + β function words than are topics inferred from the same corpora using latent Dirichlet allocation. The hyperparameter mi is now taking the role of the marginal statistic fi in equation (1). 2. Background Ideally, the measure βm should be given a proper prior and marginalized over when making predictions, yield- I begin with brief descriptions of MacKay and Peto’s ing the true predictive distribution: hierarchical Dirichlet language model and Blei et al.’s latent Dirichlet allocation. Z P (i|j, w) = P (βm|w)P (i|j, w, βm) d(βm). (8) 2.1. Hierarchical Dirichlet Language Model However, if P (βm|w) is sharply peaked in βm so Bigram language models are specified by a conditional that that it is effectively a delta function, then the distribution P (wt = i|wt−1 = j), described by W (W − true predictive distribution may be approximated by 1) free parameters, where W is the number of words in P (i|j, w, [βm]MP), where [βm]MP is the maximum of Topic Modeling: Beyond Bag-of-Words

P (βm|w). Additionally, the prior over βm may be corpus given the hyperparameters: assumed to be uninformative, yielding a minimal data- driven Bayesian model in which the optimal βm may P (w|αn, βm) = be determined from the data by maximizing the evi- Q X Y i Γ(Ni|k + βmi) Γ(β) dence. MacKay and Peto show that each element of Q Γ(Nk + β) Γ(βmi) the optimal m, when estimated using this “empirical z k i Bayes” procedure, is related to the number of contexts Q ! Y k Γ(Nk|d + αnk) Γ(α) in which the corresponding word has appeared. Q . (12) Γ(Nd + α) Γ(αnk) d k 2.2. Latent Dirichlet Allocation Nk is the total number of times topic k occurs in z, Latent Dirichlet allocation (Blei et al., 2003) repre- while Nd is the number of words in document d. sents documents as random mixtures over latent top- The sum over z cannot be computed directly because ics, where each topic is characterized by a distribution it does not factorize and involves T N terms, where N over words. Each word wt in a corpus w is assumed to is the total number of words in the corpus. However, have been generated by a latent topic zt, drawn from it may be approximated using Markov chain Monte a document-specific distribution over T topics. Carlo (Griffiths & Steyvers, 2004). Word generation is defined by a conditional distribu- Given a corpus w, a set of latent topics z, and op- tion P (wt = i|zt = k), described by T (W − 1) free timal hyperparameters [αn]MP and [βm]MP, approxi- parameters, where T is the number of topics and W is mate predictive distributions for each topic k and doc- the size of the vocabulary. These parameters are de- ument d are given by the following pair of equations: noted by Φ, with P (wt = i|zt = k) ≡ φi|k. Φ may be MP thought of as an emission probability matrix, in which MP Ni|k + [βmi] th P (i|k, w, z, [βm] ) = MP (13) the k row, the distribution over words for topic k, is Nk + β denoted by φk. Similarly, topic generation is charac- N + [αn ]MP terized by a conditional distribution P (z = k|d = d), MP k|d k t t P (k|d, w, z, [αn] ) = MP . (14) described by D(T −1) free parameters, where D is the Nd + α number of documents in the corpus. These parameters These may be rewritten as form a matrix Θ, with P (zt = k|dt = d) ≡ θk|d. The th d row of this matrix is the distribution over topics P (i|k, z, w, βm) = λkfi|k + (1 − λk)mi (15) for document d, denoted by θ . d P (k|d, z, w, αn) = γdfk|d + (1 − γd)nd, (16) The joint probability of a corpus w and a set of corre- where f = N /N , f = N /N and sponding latent topics z is i|k i|k k k|d k|d d β Y Y Y Ni|k Nk|d P (w, z|Φ, Θ) = φ θ , (9) λk = (17) i|k k|d Nk + β i k d α γd = . (18) Nd + α where Ni|k is the number of times that word i has been generated by topic k, and Nk|d is the number of times fi|k is therefore being smoothed by the hyperparame- topic k has been used in document d. ter mi, while fk|d is smoothed by nk. Note the simi- larity of equations (15) and (16) to equation (6). Blei et al. place a Dirichlet prior over Φ, Y 3. Bigram Topic Model P (Φ|βm) = Dirichlet(φk|βm), (10) k This section introduces a model that extends latent Dirichlet allocation by incorporating a notion of word and another over Θ, order, similar to that employed by MacKay and Peto’s Y hierarchical Dirichlet language model. Each topic is P (Θ|αn) = Dirichlet(θd|αn). (11) now represented by a set of W distributions. d Word generation is defined by a conditional distri- Combining these priors with equation (9) and integrat- bution P (wt = i|wt−1 = j, zt = k), described by ing over Φ and Θ gives the evidence for hyperparam- WT (W − 1) free parameters. As before, these pa- eters αn and βm, which is also the probability of the rameters form a matrix Φ, this time with WT rows. Topic Modeling: Beyond Bag-of-Words

Each row is a distribution over words for a particu- Prior 2: Alternatively there may be T hyperparam- lar context j, k, denoted by φj,k. Each topic k is now eter vectors—one for each topic k: characterized by the W distributions specific to that Y Y topic. Topic generation is the same as in latent Dirich- P (Φ|{βkmk}) = Dirichlet(φj,k|βkmk). (22) let allocation: topics are drawn from the conditional j k distribution P (zt = k|dt = d), described by D(T − 1) Information is now shared between only those proba- free parameters, which form a matrix Θ. bility vectors with topic context k. Intuitively, this is appealing. Learning about the distribution over The joint probability of a corpus w and a single set of words for a single context j, k yields information about latent topic assignments z is the distributions over words for other contexts j0, k

Y Y Y Y Ni|j,k Nk|d that share this topic, but not about distributions with P (w, z|Φ, Θ) = φi|j,k θk|d , (19) other topic contexts. In other words, this prior encap- i j k d sulates the notion of similarity between distributions over words for a given topic context. where Ni|j,k is the number of times word i has been generated by topic k when preceded by word j. As in Having defined the distributions that characterize latent Dirichlet allocation, Nk|d is the number of times word and topic generation in the new model and as- topic k has been used in document d. signed priors over the parameters, the generative pro- cess for a corpus w is: The prior over Θ is chosen to be the same as that used in latent Dirichlet allocation: 1. For each topic k and word j: Y (a) Draw φ from the prior over Φ: ei- P (Θ|αn) = Dirichlet(θd|αn). (20) j,k d ther Dirichlet(φj,k|βm) (prior 1) or Dirichlet(φj,k|βkmk) (prior 2). However, the additional conditioning context j in 2. For each document d in the corpus: the distribution that defines word generation affords greater flexibility in choosing a hierarchical prior for (a) Draw the topic mixture θd for document d Φ than in either latent Dirichlet allocation or the hi- from Dirichlet(θd|αn). erarchical Dirichlet language model. The priors over (b) For each position t in document d:

Φ used in both MacKay and Peto’s language model i. Draw a topic zt ∼ Discrete(θd). and Blei et al.’s latent Dirichlet allocation are “cou- ii. Draw a word wt from the distribution pled” priors: learning the probability vector for a sin- over words for the context defined by gle context, φj the case of MacKay and Peto’s model the topic zt and previous word wt−1, and φ in Blei et al.’s, gives information about the k Discrete(φwt−1,zt ). probability vectors in other contexts, j0 and k0 respec- tively. This dependence comes from the hyperparam- The evidence, or probability of a corpus w given the eter vector βm, shared, in the case of the hierarchical hyperparameters, is either (prior 1) Dirichlet language model, between all possible previ- P (w|αn, βm) = ous word contexts j and, in the case of latent Dirichlet  allocation, between all possible topics k. Since word Q X Y Y i Γ(Ni|j,k + βmi) Γ(β) generation is conditioned upon both j and k in the  Q Γ(Nj,k + β) Γ(βmi) new model presented in this paper, there is more than z j k i one way in which hyperparameters for the prior over Q ! Y k Γ(Nk|d + αnk) Γ(α) Φ might be shared in this model. Q (23) Γ(Nd + α) Γ(αnk) d k Prior 1: Most simply, a single hyperparameter vec- or (prior 2) tor βm may be shared between all j, k contexts: P (w|αn, {βkmk}) = Y Y  P (Φ|βm) = Dirichlet(φ |βm). (21) Q j,k X Y Y i Γ(Ni|j,k + βkmi|k) Γ(β) j k  Q Γ(Nj,k + βk) Γ(βkm ) z j k i i|k Here, knowledge about the probability vector for one Q ! Y k Γ(Nk|d + αnk) Γ(α) φj,k will give information about the probability vectors Q . (24) 0 0 Γ(Nd + α) k Γ(αnk) φj0,k0 for all other j , k contexts. d Topic Modeling: Beyond Bag-of-Words

As in latent Dirichlet allocation, the sum over z is 2. Iteration i: intractable, but may be approximated using MCMC. (s) S (a) E-step: Draw S samples {z }s=1 from For a single set of latent topics z, and optimal hyper- P (z|w,U (i−1)) using a Gibbs sampler. MP MP parameters [βm] or {[βkmk] }, the approximate (b) M-step: Maximize predictive distribution over words given previous word S 1 X j and current topic k is either (prior 1) U (i) = arg max log P (w, z(s)|U) U S MP s=1 N + [βmi] P (i|j, k, w, z, [βm]MP) = i|j,k (25) MP 3. i ← i + 1 and go to 2. Nj,k + β or (prior 2) 4.1. E-Step

MP P (i|j, k, w, z, {[βkmk] }) = Gibbs sampling involves sequentially sampling each variable of interest, z here, from the distribution over N + [β m ]MP t i|j,k k i|k that variable given the current values of all other vari- MP . (26) Nj,k + βk ables and the data. Letting the subscript −t denote a quantity that excludes data from the tth position, the In equation (25), the statistic Ni|j,k/Nj,k is always conditional posterior for zt is either (prior 1) smoothed by the quantity mi, regardless of the con- ditioning context, j, k. Meanwhile, in equation (26), P (zt = k|z−t, w, αn, βm) ∝ Ni|j,k/Nj,k is smoothed by mi|k, which is will vary de- {N }−t + βmw {N }−t + αnk pending on the conditioning topic k. wt|wt−1,k t k|dt (28) {Nk}−t + β {Ndt }−t + α Given [αn]MP, the approximate predictive distribution or (prior 2) over topics for document d is

MP P (zt = k|z−t, w, αn, {βkmk}) ∝ N + [αnk] P (k|d, w, z, [αn]MP) = k|d . (27) MP {Nwt|wt−1,k}−t + βkmwt|k {Nk|dt }−t + αnk Nd + α . (29) {Nwt−1,k}−t + βk {Ndt }−t + α 4. Inference of Hyperparameters Drawing a single set of topics z takes time proportional to the size of the corpus N and the number of topics Previous sampling-based treatments of latent Dirichlet T . The E-step therefore takes time proportional to N, allocation have not included any method for optimiz- T and the number of iterations for which the Markov ing hyperparameters. However, the method described chain is run in order to obtain the S samples. in this section may be applied to both latent Dirichlet allocation and the model presented in this paper. Note that the samples used to approximate the E-step must come from a single Markov chain. The model Given an uninformative prior over αn and βm or MP is unaffected by permutations of topic indices. Con- {βkmk}, the optimal hyperparameters, [αn] and MP MP sequently, there is no correspondence between topic [βm] or {[βkmk] }, may be found by maximiz- indices across samples from different Markov chains: ing the evidence, given in equation (23) or (24). topics that have index k in two different Markov chains The evidence contains latent variables z and must need not have similar distributions over words. therefore be maximized with respect to the hyperpa- rameters using an expectation-maximization (EM) al- 4.2. M-Step gorithm. Unfortunately, the expectation with respect Given {z(s)}S , the optimal αn can be computed us- to the distribution over the latent variables involves a s=1 ing the fixed-point iteration sum over T N terms, where N is the number of words in new the entire corpus. However, this sum may be approx- [αnk] = imated using a Markov chain Monte Carlo algorithm, P P  (s)  such as Gibbs sampling, resulting in a Gibbs EM algo- s d Ψ(Nk|d + αnk) − Ψ(αnk) αn , (30) rithm (Andrieu et al., 2003). Given a corpus w, and k P P s d (Ψ(Nd + α) − Ψ(α)) denoting the set of hyperparameters as U = {αn, βm} (s) or U = {αn, {βkmk}}, the optimal hyperparameters where Nk|d is the number of times topic k has been may be found by using the following steps: used in document d in the sth sample. Similar fixed- MP point iterations can be used to determine [βmi] and (0) (0) MP 1. Initialize z and U and set i = 1. {[βkmk] } (Minka, 2003). Topic Modeling: Beyond Bag-of-Words

In my implementation, each fixed-point iteration takes of 150 newsgroup postings, drawn at random from the time that is proportional to S and (at worst) N. For la- 20 Newsgroups data (Rennie, 2005). Again, 100 docu- tent Dirichlet allocation and the new model with prior ments were used for inference, while 50 were retained 1, the time taken to perform the M-step is therefore for evaluating predictive accuracy. at worst proportional to S, N and the number of iter- Punctuation characters, including hyphens and apos- ations taken to reach convergence. For the new model trophes, were treated as word separators, and each with prior 2, the time taken is also proportional to T . number was replaced with a special “number” token to reduce the size of the vocabulary. To enable evalua- 5. Experiments tion using documents containing tokens not present in the training corpus, all words that occurred only once To evaluate the new model, both variants were com- in the training corpus were replaced with an “unseen” pared with latent Dirichlet allocation and MacKay token u. Preprocessing the Psychological Review Ab- and Peto’s hierarchical Dirichlet language model. The stracts data in this manner resulted in a vocabulary of topic models were trained identically: the Gibbs EM 1374 words, which occurred 13414 times in the train- algorithm described in the previous section was used ing corpus and 6521 times in the documents used for for both the new model (with either prior) and latent testing. The 20 Newsgroups data ended up with a vo- Dirichlet allocation. The hyperparameters of the hier- cabulary of 2281 words, which occurred 27478 times archical Dirichlet language model were inferred using in the training data and 13579 times in the test data. the same fixed-point iteration used in the M-step. The Despite consisting of the same number of documents, results presented in this section are therefore a direct the 20 Newsgroups corpora are roughly twice the size reflection of differences between the models. of the Psychological Review Abstracts corpora. Language models are typically evaluated by computing the information rate of unseen test data, measured in 5.2. Results bits per word: the better the predictive performance, the fewer the bits per word. Information rate is a The experiments involving latent Dirichlet allocation direct measure of text compressibility. Given corpora and the new model were run with 1 to 120 topics, on an Opteron 254 (2.8GHz). These models all required w and wtest, information rate is defined as at most 200 iterations of the Gibbs EM algorithm de- log P (w |w) R = − 2 test , (31) scribed in section 4. In the E-step, a Markov chain was Ntest run for 400 iterations. The first 200 iterations were dis- carded and 5 samples were taken from the remaining where Ntest is the number of words in the test corpus. The information rate may be computed directly for the iterations. The mean time taken for each iteration is hierarchical Dirichlet language model. For the topic shown for both variants of the new model as a func- tion of the number of topics in figure 2. As expected, models, computing P (wtest|w) requires summing over z and ztest. As mentioned before, this is intractable. the time taken is proportional to both the number of Instead, the information rate may be computed us- topics and the size of the corpus. ing a single set of topics z for the training data, in The information rates of the test data are shown in this case obtained by running a Gibbs sampler for figure 1. On both corpora, latent Dirichlet allocation 20000 iterations after the hyperparameters have been and the hierarchical Dirichlet language model achieve inferred. Given z, multiple sets of topics for the test similar performance. With prior 1, the new model im- (s) S data {ztest}s=1 may be obtained using the predictive proves upon this by between 0.5 and 1 bits per word. distributions. Given hyperparameters U, P (wtest|w) However, with prior 2, it achieves an information rate may be approximated by taking the harmonic mean of reduction of between 1 and 2 bits per word. For latent (s) S {P (wtest|ztest, w, z,U)}s=1 (Kass & Raftery, 1995). Dirichlet allocation, the information rate is reduced most by the first 20 topics. The new model uses a 5.1. Corpora larger number of topics and exhibits a greater informa- tion rate reduction as more topics are added. In latent The models were compared using two data sets. The Dirichlet allocation, the latent topic for a given word first was constructed by drawing 150 abstracts (docu- is inferred using the identity of the word, the number ments) at random from the Psychological Review Ab- of times the word has previously been assumed to be stracts data provided by Griffiths and Steyvers (2005). generated by each topic, and the number of times each A subset of 100 documents were used to infer the hy- topic has been used in the current document. In the perparameters, while the remaining 50 were used for new model, the previous word is also taken into ac- evaluating the models. The second data set consisted Topic Modeling: Beyond Bag-of-Words

12 13 Hierarchical Dirichlet language model Hierarchical Dirichlet language model Latent Dirichlet allocation Latent Dirichlet allocation 11 12 Bigram topic model (prior 1) Bigram topic model (prior 1) Bigram topic model (prior 2) 11 Bigram topic model (prior 2) 10 10 9 9

bits per word 8 bits per word 8

7 7

6 6 0 20 40 60 80 100 120 0 20 40 60 80 100 120 number of topics number of topics

Figure 1. Information rates of the test data, measured in bits per word, under the different models versus number of topics. Left: Psychological Review Abstracts data. Right: 20 Newsgroups data.

15 30 Psychological Review Abstracts Psychological Review Abstracts 20 Newsgroups 20 Newsgroups 25

10 20

15

5 10

mean seconds per iteration mean seconds per iteration 5

0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 number of topics number of topics

Figure 2. Mean time taken to perform a single iteration of the Gibbs EM algorithm described in section 4 as a function of the number of topics for both variants of the new model. Left: prior 1. Right: prior 2. count. This additional information means words that den Markov model, while content words are handled were considered to be generated by the same topic in by latent Dirichlet allocation. latent Dirichlet allocation, may now be assumed to have been generated by different topics, depending on 6. Future Work the contexts in which they are seen. Consequently, the new model tends to use a greater number of topics. There is a another possible prior over Φ, in addition to the two priors discussed in this paper. This prior In addition to comparing predictive accuracy, it is in- has a hyperparameter vector for each previous word structive to look at the inferred topics. Table 1 shows context j, resulting in W hyperparameter vectors: the words most frequently assigned to a selection of topics extracted from the 20 Newsgroups training data Y Y P (Φ|{βjmj}) = Dirichlet(φj,k|βjmj). (32) by each of the models. The “unseen” token was omit- j k ted. The topics inferred using latent Dirichlet allo- Here, information is shared between all distributions cation contain many function words, such as “the”, with previous word context j. This prior captures the “in” and “to”. In contrast, all but one of the top- notion of common —word pairs that always ics inferred by the new model, especially with prior occur together. However, the number of hyperparam- 2, typically contain fewer function words. Instead, eter vectors is extremely large—much larger than the these are largely collected into the single remaining number of hyperparameters in prior 2—with compar- topic, shown in the last column of rows 2 and 3 in ta- atively little data from which to infer them. To make ble 1. This effect is similar, though less pronounced, effective use of this prior, each normalized measure m to that achieved by Griffiths et al.’s composite model j should itself be assigned a Dirichlet prior. This variant (2004), in which function words are handled by a hid- of the model could be compared with those presented Topic Modeling: Beyond Bag-of-Words in this paper. To enable a direct comparison, Dirichlet Table 1. Top: The most commonly occurring words in hyperpriors could also be placed on the hyperparame- some of the topics inferred from the 20 Newsgroups train- ters of the priors described in section 3. ing data by latent Dirichlet allocation. Middle: Some of the topics inferred from the same data by the new model with prior 1. Bottom: Some of the topics inferred by the 7. Conclusions new model with prior 2. Each column represents a sin- gle topic, and words appear in order of frequency of oc- Creating a single model that integrates bigram-based currence. Content words are in bold. Function words, and topic-based approaches to document modeling has which are not in bold, were identified by their presence on several benefits. Firstly, the predictive accuracy of the a standard list of stop words: http://ir.dcs.gla.ac.uk/ new model, especially when using prior 2, is signifi- resources/linguistic utils/stop words. All three sets cantly better than that of either latent Dirichlet al- of topics were taken from models with 90 topics. location or the hierarchical Dirichlet language model. Secondly, the model automatically infers a separate topic for function words, meaning that the other top- Latent Dirichlet allocation ics are less dominated by these words. the i that easter “number” is proteins ishtar Acknowledgments in satan the a to the of the Thanks to Phil Cowans, David MacKay and Fernando espn which to have Pereira for useful discussions. Thanks to Andrew hockey and i with Suffield for providing sparse matrix code. a of if but this metaphorical “number” english as evil you and References run there fact is

Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. Bigram topic model (prior 1) (2003). An introduction to MCMC for machine learning. Machine Learning, 50, 5–43. to the the the Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent party god and a Dirichlet allocation. Journal of Machine Learning Re- arab is between to search, 3, 993–1022. not belief warrior i power believe enemy of Griffiths, T. L., & Steyvers, M. (2004). Finding scientific any use battlefield “number” topics. Proceedings of the National Academy of Sciences, i there a is 101, 5228–5235. is strong of in this make there and Griffiths, T. L., & Steyvers, M. (2005). Topic mod- things i way it eling toolbox. http://psiexp.ss.uci.edu/research/ programs data/toolbox.htm. Bigram topic model (prior 2) Griffiths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum, J. B. (2004). Integrating topics and syntax. Advances party god “number” the in Neural Information Processing Systems. arab believe the to power about tower a Jelinek, F., & Mercer, R. (1980). Interpolated estima- as atheism clock and tion of Markov source parameters from sparse data. In arabs gods a of E. Gelsema and L. Kanal (Eds.), Pattern recognition in political before power i practice, 381–402. North-Holland publishing company. are see motherboard is rolling atheist mhz “number” Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal london most socket it of the American Statistical Association, 90, 773–795. security shafts plastic that MacKay, D. J. C., & Peto, L. C. B. (1995). A hierarchical Dirichlet language model. Natural Language Engineer- ing, 1, 289–307. Minka, T. P. (2003). Estimating a Dirichlet dis- tribution. http://research.microsoft.com/∼minka/ papers/dirichlet/. Rennie, J. (2005). 20 newsgroups data set. http:// people.csail.mit.edu/jrennie/20Newsgroups/.