Topic Modeling: Beyond Bag-Of-Words
Total Page:16
File Type:pdf, Size:1020Kb
Topic Modeling: Beyond Bag-of-Words Hanna M. Wallach [email protected] Cavendish Laboratory, University of Cambridge, Cambridge CB3 0HE, UK Abstract While such models may use conditioning contexts of arbitrary length, this paper deals only with bigram Some models of textual corpora employ text models—i.e., models that predict each word based on generation methods involving n-gram statis- the immediately preceding word. tics, while others use latent topic variables inferred using the “bag-of-words” assump- To develop a bigram language model, marginal and tion, in which word order is ignored. Pre- conditional word counts are determined from a corpus viously, these methods have not been com- w. The marginal count Ni is defined as the number bined. In this work, I explore a hierarchi- of times that word i has occurred in the corpus, while cal generative probabilistic model that in- the conditional count Ni|j is the number of times word corporates both n-gram statistics and latent i immediately follows word j. Given these counts, the topic variables by extending a unigram topic aim of bigram language modeling is to develop pre- model to include properties of a hierarchi- dictions of word wt given word wt−1, in any docu- cal Dirichlet bigram language model. The ment. Typically this is done by computing estimators model hyperparameters are inferred using a of both the marginal probability of word i and the con- Gibbs EM algorithm. On two data sets, each ditional probability of word i following word j, such as of 150 documents, the new model exhibits fi = Ni/N and fi|j = Ni|j/Nj, where N is the num- better predictive accuracy than either a hi- ber of words in the corpus. If there were sufficient erarchical Dirichlet bigram language model data available, the observed conditional frequency fi|j or a unigram topic model. Additionally, the could be used as an estimator for the predictive prob- inferred topics are less dominated by func- ability of i given j. In practice, this does not provide tion words than are topics discovered using a good estimate: only a small fraction of possible i, j unigram statistics, potentially making them word pairs will have been observed in the corpus. Con- more meaningful. sequently, the conditional frequency estimator has too large a variance to be used by itself. To alleviate this problem, the bigram estimator fi|j is 1. Introduction smoothed by the marginal frequency estimator fi to Recently, much attention has been given to genera- give the predictive probability of word i given word j: tive probabilistic models of textual corpora, designed P (wt = i|wt−1 = j) = λfi + (1 − λ)fi|j. (1) to identify representations of the data that reduce de- scription length and reveal inter- or intra-document The parameter λ may be fixed, or determined from statistical structure. Such models typically fall into the data using techniques such as cross-validation (Je- one of two categories—those that generate each word linek & Mercer, 1980). This procedure works well in on the basis of some number of preceding words or practice, despite its somewhat ad hoc nature. word classes and those that generate words based on The hierarchical Dirichlet language model (MacKay & latent topic variables inferred from word correlations Peto, 1995) is a bigram model that is entirely driven by independent of the order in which the words appear. principles of Bayesian inference. This model has a sim- n-gram language models make predictions using ob- ilar predictive distribution to models based on equa- served marginal and conditional word frequencies. tion (1), with one key difference: the bigram statistics fi|j in MacKay and Peto’s model are not smoothed rd Appearing in Proceedings of the 23 International Con- with marginal statistics fi, but are smoothed with a ference on Machine Learning, Pittsburgh, PA, 2006. Copy- quantity related to the number of different contexts in right 2006 by the author(s)/owner(s). which each word has occurred. Topic Modeling: Beyond Bag-of-Words Latent Dirichlet allocation (Blei et al., 2003) provides the vocabulary. These parameters are denoted by the an alternative approach to modeling textual corpora. matrix Φ, with P (wt = i|wt−1 = j) ≡ φi|j. Φ may be Documents are modeled as finite mixtures over an un- thought of as a transition probability matrix, in which derlying set of latent topics inferred from correlations the jth row, the probability vector for transitions from between words, independent of word order. word j, is denoted by the vector φj. This bag-of-words assumption makes sense from a Given a corpus w, the likelihood function is point of view of computational efficiency, but is unre- Y Y Ni|j alistic. In many language modeling applications, such P (w|Φ) = φi|j , (2) as text compression, speech recognition, and predictive i j text entry, word order is extremely important. Fur- where N is the number of times that word i imme- thermore, it is likely that word order can assist in topic i|j diately follows word j in the corpus. inference. The phrases “the department chair couches offers” and “the chair department offers couches” have MacKay and Peto (1995) extend this basic framework the same unigram statistics, but are about quite dif- by placing a Dirichlet prior over Φ: ferent topics. When deciding which topic generated Y the word “chair” in the first sentence, knowing that it P (Φ|βm) = Dirichlet(φj|βm), (3) was immediately preceded by the word “department” j makes it much more likely to have been generated by where β > 0 and m is a measure satisfying P m = 1. a topic that assigns high probability to words related i i to university administration. Combining equations (2) and (3), and integrating over Φ, yields the probability of the corpus given the hy- In practice, the topics inferred using latent Dirichlet al- perparameters βm, also known as the “evidence”: location are heavily dominated by function words, such as “in”, “that”, “of” and “for”, unless these words are Q Y Γ(Ni|j + βmi) Γ(β) P (w|βm) = i . (4) removed from corpora prior to topic inference. While Γ(N + β) Q Γ(βm ) removing these may be appropriate for tasks where j j i i word order does not play a significant role, such as information retrieval, it is not appropriate for many It is also easy to obtain a predictive distribution for language modeling applications, where both function each context j given the hyperparameters βm: and content words must be accurately predicted. Ni|j + βmi In this paper, I present a hierarchical Bayesian model P (i|j, w, βm) = . (5) Nj + β that integrates bigram-based and topic-based ap- proaches to document modeling. This model moves To make the relationship to equation (1) explicit, beyond the bag-of-words assumption found in la- P (i|j, w, βm) may be rewritten as tent Dirichlet allocation by introducing properties of MacKay and Peto’s hierarchical Dirichlet language P (i|j, w, βm) = λjmi + (1 − λj)fi|j, (6) model. In addition to exhibiting better predictive where f = N /N and performance than either MacKay and Peto’s language i|j i|j j model or latent Dirichlet allocation, the topics inferred β using the new model are typically less dominated by λj = . (7) Nj + β function words than are topics inferred from the same corpora using latent Dirichlet allocation. The hyperparameter mi is now taking the role of the marginal statistic fi in equation (1). 2. Background Ideally, the measure βm should be given a proper prior and marginalized over when making predictions, yield- I begin with brief descriptions of MacKay and Peto’s ing the true predictive distribution: hierarchical Dirichlet language model and Blei et al.’s latent Dirichlet allocation. Z P (i|j, w) = P (βm|w)P (i|j, w, βm) d(βm). (8) 2.1. Hierarchical Dirichlet Language Model However, if P (βm|w) is sharply peaked in βm so Bigram language models are specified by a conditional that that it is effectively a delta function, then the distribution P (wt = i|wt−1 = j), described by W (W − true predictive distribution may be approximated by 1) free parameters, where W is the number of words in P (i|j, w, [βm]MP), where [βm]MP is the maximum of Topic Modeling: Beyond Bag-of-Words P (βm|w). Additionally, the prior over βm may be corpus given the hyperparameters: assumed to be uninformative, yielding a minimal data- driven Bayesian model in which the optimal βm may P (w|αn, βm) = be determined from the data by maximizing the evi- Q X Y i Γ(Ni|k + βmi) Γ(β) dence. MacKay and Peto show that each element of Q Γ(Nk + β) Γ(βmi) the optimal m, when estimated using this “empirical z k i Bayes” procedure, is related to the number of contexts Q ! Y k Γ(Nk|d + αnk) Γ(α) in which the corresponding word has appeared. Q . (12) Γ(Nd + α) Γ(αnk) d k 2.2. Latent Dirichlet Allocation Nk is the total number of times topic k occurs in z, Latent Dirichlet allocation (Blei et al., 2003) repre- while Nd is the number of words in document d. sents documents as random mixtures over latent top- The sum over z cannot be computed directly because ics, where each topic is characterized by a distribution it does not factorize and involves T N terms, where N over words. Each word wt in a corpus w is assumed to is the total number of words in the corpus.