The Generalized Dirichlet Distribution in Enhanced Topic Detection
∗ Karla Caballero Joel Barajas Ram Akella UC, Santa Cruz UC, Santa Cruz UC, Santa Cruz Santa Cruz CA, USA Santa Cruz CA, USA Santa Cruz CA, USA [email protected] [email protected] [email protected]
ABSTRACT 1. INTRODUCTION We present a new, robust and computationally efficient Hier- Topic modeling has been widely studied in Machine Learn- archical Bayesian model for effective topic correlation mod- ing and Text Mining as an effective approach to extract eling. We model the prior distribution of topics by a Gen- latent topics from unstructured text documents. The key eralized Dirichlet distribution (GD) rather than a Dirichlet idea underlying topic modeling is to use term co-occurrences distribution as in Latent Dirichlet Allocation (LDA). We de- in documents to discover associations between those terms. fine this model as GD-LDA. This framework captures cor- The development of Latent Dirichlet Allocation (LDA) [3, 8] relations between topics, as in the Correlated Topic Model enabled the rigorous prediction of new documents first time. (CTM) and Pachinko Allocation Model (PAM), and is faster Consequently, variants and extensions of LDA have been an to infer than CTM and PAM. GD-LDA is effective to avoid active area of research in topic modeling. This research has over-fitting as the number of topics is increased. As a tree 2 main streams in document representation: 1) The explo- model, it accommodates the most important set of topics in ration of super and subtopics as in Pachinko Model Alloca- the upper part of the tree based on their probability mass. tion (PAM) [9, 10]; and 2) the correlation of topics [2, 9, 10]. Thus, GD-LDA provides the ability to choose significant However, despite the improvement of these approaches over topics effectively. To discover topic relationships, we per- LDA, they are rarely used in applications that handle large form hyper-parameter estimation based on Monte Carlo EM datasets, such as Information Retrieval. Major gaps in these Estimation. We provide results using Empirical Likelihood models include: modeling correlated topics in a computa- (EL) in 4 public datasets from TREC and NIPS. Then, we tionally effective manner; and a robust approach to ensure present the performance of GD-LDA in ad hoc information good performance for a wide range of document numbers, retrieval (IR) based on MAP, P@10, and Discounted Gain. vocabulary size and average word length per document. We discuss an empirical comparison of the fitting time. We In this paper we develop a new model to meet these needs. demonstrate significant improvement over CTM, LDA, and We use the Generalized Dirichlet (GD) distribution as a PAM for EL estimation. For all the IR measures, GD-LDA prior distribution of the document topic mixtures, leading shows higher performance than LDA, the dominant topic to GD-LDA. We show that the Dirichlet distribution is a model in IR. All these improvements with a small increase special case of GD. As a result, GD-LDA is deemed to be in fitting time than LDA, as opposed to CTM and PAM. a generalized case of LDA. Our goal is to provide a more flexible model for topics while retaining the conjugacy prop- Categories and Subject Descriptors erties, which are desirable in inference. The features of the H.3.3 [Information Storage and Retrieval]: Information GD-LDA model include: 1) An effective method to represent Search and Retrieval—Document Filtering; G.3 [Probability sparse topic correlations in natural language documents. 2) and Statistics]: Statistical Computing A model which handles global topic correlations with time complexity O(KW ), adding minimal computational cost re- General Terms spect to LDA. This results in a fast and robust approach compared to CTM and PAM. 3) A hierarchical tree struc- Algorithms, Experimentation ture that accommodates the most significant topics, based Keywords on probability mass, at the upper levels allowing us to reduce the number of topics efficiently. Note that GD is a special Statistical Topic Modeling, Document Representation case of Dirichlet Trees [13, 5]. This distribution has been ∗ Main contact. used previously in topic modeling to add domain knowledge to the probability of words given a topic [1]. In contrast, we use the GD distribution to model topic correlations. Thus, this approach is complementary to GD-LDA. Permission to make digital or hard copies of all or part of this work for To validate our model, we use Empirical Likelihood (EL) personal or classroom use is granted without fee provided that copies are in four data sets with different characteristics of document not made or distributed for profit or commercial advantage and that copies length, vocabulary size, and total number of documents. bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific GD-LDA outperforms CTM, PAM and LDA consistently permission and/or a fee. in all the datasets. In addition, we test the performance of CIKM'12, October 29–November 2, 2012, Maui, HI, USA. GD-LDA in adhoc Information Retrieval, obtaining superior Copyright 2012 ACM 978-1-4503-1156-4/12/10 ...$15.00. each conditional Binomial distribution of the nodes Vk, we have a GD distribution where this set of Beta distributions is conjugate to the set of Binomial distributions.
Zk ∼ Beta(αk ,βk) Tk ∼ Bin(Zk,Nk) for Nk = N − ivariance, and they must sum to one. When we The GD distribution is a conjugate prior distribution to use the Dirichlet distribution as a prior for the multinomial the Multinomial distribution, then we can integrate out the distribution we have only one degree of freedom, the total parameter of the Multinomial distribution leading to: prior sample size, to incorporate our confidence in the prior K−1 K−1 ′ ′ Γ(αk + βk) Γ(α )+Γ(β ) knowledge. As a consequence, we can not add individual p(T |α, β) = p(T |θ)p(θ)dθ = k k Z Γ(α )Γ(β ) Γ(α′ + β′ ) variance information for each entry of the random vector. kY=1 k k kY=1 k k (3) In addition, all entries are always negatively correlated. In ′ ′ other words, if the probability of one entry increases each of where αk = αk +Tk, βk = βk +Tk+1 +...+TK . The ability to the other probabilities must either decrease or remain the estimate this integral in closed form is crucial for GD-LDA same to sum to one. Despite these limitations, the Dirich- model because this enables us to perform Gibbs sampling. let distribution is widely used given that this is a conjugate This is the product of Beta-Binomial distributions for Zk prior of the multinomial distribution. independent random variables. As a result, this expression The GD distribution allows us to sample each entry of the is factorized into independent ratios of Gamma functions , random vector of proportions from independent Beta distri- which allows us to find the Maximum Likelihood Estimation butions. This independence is the key property that pro- (MLE) of α, β in a simple manner (independently) which is vides more flexibility than the Dirichlet distribution. For- not the case of the Dirichlet distribution. mally this distribution is defined as: 2.2 Covariances: Properties and Constraints K−1 Γ(αj + βj ) αj −1 η In natural language documents, most of the times only p(θ|α, β)= θ (1 − θ1 −···− θ ) j j j (1) a few topics co-occur leading to sparse topic correlations Y Γ(αj )Γ(βj ) j=1 [19]. This implies that a very few random topics may suffice, rather than the full joint distribution including all the topics. where θ1 + θ2 + . . . + θK−1 + θK = 1, ηj = βj − αj+1 − βj+1 When considering useful distributions for topic modeling, for 1 ≤ j ≤ K − 2 and ηK−1 = βK−1 − 1. a desirable feature is to model only a few topics in a doc- To illustrate the properties of the GD distribution we de- ument. For instance, the topic domestic politics could fre- fine Z1 = θ1 and Zk = θk/Vk for k = 2, 3,...,K − 1 where quently co-occur with middle east politics and health care Vk = 1−θ1 −···−θk−1. Let Tk be the discrete random vari- reform. This implies that if we observe domestic politics, able with multinomial distribution with parameter θ1 ...θK the conditional probabilities of observing middle east politics for K different categories. We start at node V1, and at this and health care reform will increase. However, if a document node we sample T1 with probability Z1 and V2 with proba- contains domestic politics and middle east politics, it could bility 1 − Z1. Conditional on V2, we sample T2 with prob- very rarely contain health care reform. ability Z2 and V3 with probability 1 − Z2. In the general GD has the computational advantage of only having 2K − case, conditional on Vk, we sample Tk with probability Zk 1 parameters. This also implies that the distribution covari- and Vk+1 with probability 1 − Zk for k = 1 ...K − 1. If we ances are constrained. The question is whether it is possible now add a prior Beta distribution with parameters αk, βk for that these constraints imply that only a few topics co-occur Algorithm 1 GD-LDA Generative Model This expression allows us to decompose the problem into two for topic k ← 1 to K do hierarchical models that can be treated and optimized sep- draw φk ∼ Dir(γ) end for arately based on these conditional probabilities. The prob- for document j ← 1 to D do ability of words given topics is: draw θj ∼ GenDir(α, β) for do word w ← 1 to Nj p (w|z, γ)= p (w|φ,z) p (φ|γ,z) dφ = draw zwj ∼ Mult(θj ) Zφ draw w|zwj = k ∼ Mult(φk ) K V V (5) end for Γ( i=1 γi) i=1 Γ(γi + Nk,i) end for PV Q V kY=1 i=1 Γ(γi) Γ( i=1(γi + Nk,i)) Q P For the probability of topics, we have a GD prior distribution for the topic mixtures in each document which is assumed to be Multinomial. Thus we have:
p(z|α, β)= p (z|θ) p (θ|α, β) dθ = Zθ D K−1 K−1 j j (6) Γ(αk + βk) Γ(αk) + Γ(βk) Γ(α )Γ(β ) j j jY=1 kY=1 k k kY=1 Γ(αk + βk)
j j where αk = αk + Nj,k, βk = βk + Nj,k+1 + ... + Nj,K . The topic assignments z are not observed. Then, we define Figure 2: Graphical Model for GD-LDA. α and β are a Gibbs sampling method to infer z as follows: vectors of size K − 1 where K is the number of topics. γ ¬wj p(w|z, γ)p(z|α, β) is a vector of size V (vocabulary size). p(zwj = k|z , α, β, γ) = (7) p(w|z¬wj , γ)p(z¬wj |α, β) in a document. In this respect, the covariance properties de- ¬wj scribed in Appendix A.1, particularly the third one, indicate Here, z represents the topic assignments for all the words that the covariances are dependent on the expected value of except the word w from document j. This analysis leads us two topics, then several covariances might be close to zero to the following distributions for Gibbs sampling: if the right hand side ratio is small. This is an important ¬wj p(w|z, γ) Ni,k + γi feature of the GD distribution which we propose to exploit ∝ (8) p(w|z¬wj, γ) V ¬wj in topic modeling. A detailed discussion of the covariance i=1 N + γi i,k constraints of GD distribution is provided in Appendix A. P p(z|α, β) ¬ ∝ 3. GD-LDA: METHODOLOGY p(z wj |α, β) ¬wj αk + Nj,k In this section, we depict the parameter estimation pro- ¬ , k = 1 α + β + K N wj cess to fit the GD as a prior distribution for topics. We show k k l=1 j,l P that GD-LDA can be computed with a computational cost ¬wj k−1 K ¬wj α + N βm + N k j,k l=m+1 j,l similar to that of LDA. Fig 2 shows the graphical model ¬wj ¬wj 1 1 iterations and pass it to the next iteration of a vector of size V . α and β are vectors of size K − 1. the evaluation. This is illustrated in Algorithm 2. From this 3.2 Model and Gibbs Sampling pseudo code, O((2K +1)W ) operations are required for each Gibbs sampling draw for the whole corpus, where W is the We define the joint probability associated with the graph- total number of words. Thus, the time complexity for each ical model defined in Fig 2 with joint distribution: iteration is O(KW ) as in the case of LDA [15]. This com- p(w,z|α, β, γ)= p(w|z, γ)p(z|α, β) (4) plexity is possible due to the cascade structure of the GD Algorithm 2 GD-LDA Gibbs Sampling Algorithm 3 Monte Carlo EM cumF actor ← 1, k ← 1 Start with an initial guess for α, β, γ and z ¬wj ¬wj N +γ αk + Nj,k i,k i repeat CDF [k] ← ¬wj × ¬wj α β K N V N γ k + k+ i + i Run Gibbs Sampling using Eqs. 8 and 9 l=1 j,l =1 i,k for k ← 2 to K − 1 Pdo P Find Expected value for topic assignments E(z|α, β, γ)= z ¬wj β + K N k−1 l=k j,l Choose γ to maximize complete Likelihood Eqs 56-60 in [12] cumF actor ← cumF actor × P ¬wj α +β + K N Choose α, β that maximize complete Likelihood using Eq 27 k−1 k−1 l=k−1 j,l P ¬wj until convergence of α, β, γ αk + Nj,k ∗ CDF [k] ← CDF [k − 1] + cumF actor × ¬wj × Choose topic assignments z with highest probability α +β + K N ∗ ∗ ∗ k k l=1 j,l Set α = α, β = β, γ = γ ¬wj P N γ ∗ ∗ ∗ ∗ i,k + i return α ,β , γ ,z ¬wj V N +γ i=1 i,k i endP for ¬wj N +γ i,K i j j CDF [K] ← CDF [K − 1] + cumF actor × ¬wj +1 V N +γ where αk = αk + N j,k, βk = βk + N j,k + ... + N j,K . i=1 i,K i P We develop a Newton-based method in Appendix B. To initialize the search, we use the method of moments based on the conditional beta distributions from the tree represen- distribution, which facilitates the calculation of the cumula- tation of the GD distribution [21]. The proportions pk,i = tive factor in Algorithm 2. In contrast, the time complexity V N k,i/ Pi=1 N k,i are employed for this initialization. of a Gibbs draw for PAM depends on the number of super- One key component of this optimization is that each pair topics, S, and sub-topics K as O(SKW ) [11]. The general 1 2 αk, βk is optimized separately. Thus, the time complexity for recommendation is to set S = K/2 leading to O(K W ). this optimization is linear in time with K. Parameter fitting in PAM is performed using the method of moments due 3.3 Parameter Estimation the high model complexity and computational cost of the Previous research has assumed uniform priors for the topic optimization [10, 11]. In addition, the number of parameters mixtures θ and the vocabulary distribution for topics φ with- to fit for CTM and PAM is K2 [2] [19], assuming K/2 super- out estimating them (constant values for α and γ) [8]. Wal- topics for PAM. As a result, these methods are highly prone lach et al. in [16] concludes that parameter estimation with to over-fitting, as many parameters are being fit. asymmetric Dirichlet prior probability of topics provides an improvement in the fitting. In GD-LDA, we estimate the 3.4 Predictive Distributions parameters α, β of the GD distribution to discover topic cor- Given the optimal parameters (α∗, β∗) obtained from Al- relations. We estimate the parameters for the prior distri- gorithm 3, and the word-topic observations (w,z) the pre- bution of words given topics γ. Ideally, we should maximize dictive distribution for document j, θˆj , is estimated as: the likelihood p(w|α,β,γ) for observations w and hyper- θˆjk = parameters α,β,γ directly. Unfortunately, this distribution ∗ α + N j,k is intractable for this model. To solve this issue, we aug- k , if k = 1 ∗ ∗ K αk + βk + l N j,l ment the likelihood to p(w,z|α,β,γ) and use Monte Carlo =1 P Expectation Maximization (MCEM) [18]. Conditional on − ∗ ∗ k 1 β + K N αk + Nj,k m l=m+1 j,l hyper-parameters α,β,γ, we use Gibbs sampling to esti- × if 1