The Generalized Dirichlet Distribution in Enhanced Topic Detection

∗ Karla Caballero Joel Barajas Ram Akella UC, Santa Cruz UC, Santa Cruz UC, Santa Cruz Santa Cruz CA, USA Santa Cruz CA, USA Santa Cruz CA, USA [email protected] [email protected] [email protected]

ABSTRACT 1. INTRODUCTION We present a new, robust and computationally efficient Hier- Topic modeling has been widely studied in Machine Learn- archical Bayesian model for effective topic correlation mod- ing and Text Mining as an effective approach to extract eling. We model the prior distribution of topics by a Gen- latent topics from unstructured text documents. The key eralized Dirichlet distribution (GD) rather than a Dirichlet idea underlying topic modeling is to use term co-occurrences distribution as in Latent Dirichlet Allocation (LDA). We de- in documents to discover associations between those terms. fine this model as GD-LDA. This framework captures cor- The development of Latent Dirichlet Allocation (LDA) [3, 8] relations between topics, as in the Correlated Topic Model enabled the rigorous prediction of new documents first time. (CTM) and Pachinko Allocation Model (PAM), and is faster Consequently, variants and extensions of LDA have been an to infer than CTM and PAM. GD-LDA is effective to avoid active area of research in topic modeling. This research has over-fitting as the number of topics is increased. As a tree 2 main streams in document representation: 1) The explo- model, it accommodates the most important set of topics in ration of super and subtopics as in Pachinko Model Alloca- the upper part of the tree based on their probability mass. tion (PAM) [9, 10]; and 2) the correlation of topics [2, 9, 10]. Thus, GD-LDA provides the ability to choose significant However, despite the improvement of these approaches over topics effectively. To discover topic relationships, we per- LDA, they are rarely used in applications that handle large form hyper-parameter estimation based on Monte Carlo EM datasets, such as Information Retrieval. Major gaps in these Estimation. We provide results using Empirical Likelihood models include: modeling correlated topics in a computa- (EL) in 4 public datasets from TREC and NIPS. Then, we tionally effective manner; and a robust approach to ensure present the performance of GD-LDA in ad hoc information good performance for a wide range of document numbers, retrieval (IR) based on MAP, P@10, and Discounted Gain. vocabulary size and average word length per document. We discuss an empirical comparison of the fitting time. We In this paper we develop a new model to meet these needs. demonstrate significant improvement over CTM, LDA, and We use the Generalized Dirichlet (GD) distribution as a PAM for EL estimation. For all the IR measures, GD-LDA prior distribution of the document topic mixtures, leading shows higher performance than LDA, the dominant topic to GD-LDA. We show that the Dirichlet distribution is a model in IR. All these improvements with a small increase special case of GD. As a result, GD-LDA is deemed to be in fitting time than LDA, as opposed to CTM and PAM. a generalized case of LDA. Our goal is to provide a more flexible model for topics while retaining the conjugacy prop- Categories and Subject Descriptors erties, which are desirable in inference. The features of the H.3.3 [Information Storage and Retrieval]: Information GD-LDA model include: 1) An effective method to represent Search and Retrieval—Document Filtering; G.3 [Probability sparse topic correlations in natural language documents. 2) and Statistics]: Statistical Computing A model which handles global topic correlations with time complexity O(KW ), adding minimal computational cost re- General Terms spect to LDA. This results in a fast and robust approach compared to CTM and PAM. 3) A hierarchical tree struc- Algorithms, Experimentation ture that accommodates the most significant topics, based Keywords on probability mass, at the upper levels allowing us to reduce the number of topics efficiently. Note that GD is a special Statistical Topic Modeling, Document Representation case of Dirichlet Trees [13, 5]. This distribution has been ∗ Main contact. used previously in topic modeling to add domain knowledge to the probability of words given a topic [1]. In contrast, we use the GD distribution to model topic correlations. Thus, this approach is complementary to GD-LDA. Permission to make digital or hard copies of all or part of this work for To validate our model, we use Empirical Likelihood (EL) personal or classroom use is granted without fee provided that copies are in four data sets with different characteristics of document not made or distributed for profit or commercial advantage and that copies length, vocabulary size, and total number of documents. bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific GD-LDA outperforms CTM, PAM and LDA consistently permission and/or a fee. in all the datasets. In addition, we test the performance of CIKM'12, October 29–November 2, 2012, Maui, HI, USA. GD-LDA in adhoc Information Retrieval, obtaining superior Copyright 2012 ACM 978-1-4503-1156-4/12/10 ...$15.00. each conditional Binomial distribution of the nodes Vk, we have a GD distribution where this set of Beta distributions is conjugate to the set of Binomial distributions.

Zk ∼ Beta(αk ,βk) Tk ∼ Bin(Zk,Nk) for Nk = N − ivariance, and they must sum to one. When we The GD distribution is a conjugate prior distribution to use the Dirichlet distribution as a prior for the multinomial the Multinomial distribution, then we can integrate out the distribution we have only one degree of freedom, the total parameter of the Multinomial distribution leading to: prior sample size, to incorporate our confidence in the prior K−1 K−1 ′ ′ Γ(αk + βk) Γ(α )+Γ(β ) knowledge. As a consequence, we can not add individual p(T |α, β) = p(T |θ)p(θ)dθ = k k Z Γ(α )Γ(β ) Γ(α′ + β′ ) variance information for each entry of the random vector. kY=1 k k kY=1 k k (3) In addition, all entries are always negatively correlated. In ′ ′ other words, if the probability of one entry increases each of where αk = αk +Tk, βk = βk +Tk+1 +...+TK . The ability to the other probabilities must either decrease or remain the estimate this integral in closed form is crucial for GD-LDA same to sum to one. Despite these limitations, the Dirich- model because this enables us to perform Gibbs sampling. let distribution is widely used given that this is a conjugate This is the product of Beta-Binomial distributions for Zk prior of the multinomial distribution. independent random variables. As a result, this expression The GD distribution allows us to sample each entry of the is factorized into independent ratios of Gamma functions , random vector of proportions from independent Beta distri- which allows us to find the Maximum Likelihood Estimation butions. This independence is the key property that pro- (MLE) of α, β in a simple manner (independently) which is vides more flexibility than the Dirichlet distribution. For- not the case of the Dirichlet distribution. mally this distribution is defined as: 2.2 Covariances: Properties and Constraints K−1 Γ(αj + βj ) αj −1 η In natural language documents, most of the times only p(θ|α, β)= θ (1 − θ1 −···− θ ) j j j (1) a few topics co-occur leading to sparse topic correlations Y Γ(αj )Γ(βj ) j=1 [19]. This implies that a very few random topics may suffice, rather than the full joint distribution including all the topics. where θ1 + θ2 + . . . + θK−1 + θK = 1, ηj = βj − αj+1 − βj+1 When considering useful distributions for topic modeling, for 1 ≤ j ≤ K − 2 and ηK−1 = βK−1 − 1. a desirable feature is to model only a few topics in a doc- To illustrate the properties of the GD distribution we de- ument. For instance, the topic domestic politics could fre- fine Z1 = θ1 and Zk = θk/Vk for k = 2, 3,...,K − 1 where quently co-occur with middle east politics and health care Vk = 1−θ1 −···−θk−1. Let Tk be the discrete random vari- reform. This implies that if we observe domestic politics, able with multinomial distribution with parameter θ1 ...θK the conditional probabilities of observing middle east politics for K different categories. We start at node V1, and at this and health care reform will increase. However, if a document node we sample T1 with probability Z1 and V2 with proba- contains domestic politics and middle east politics, it could bility 1 − Z1. Conditional on V2, we sample T2 with prob- very rarely contain health care reform. ability Z2 and V3 with probability 1 − Z2. In the general GD has the computational advantage of only having 2K − case, conditional on Vk, we sample Tk with probability Zk 1 parameters. This also implies that the distribution covari- and Vk+1 with probability 1 − Zk for k = 1 ...K − 1. If we ances are constrained. The question is whether it is possible now add a prior Beta distribution with parameters αk, βk for that these constraints imply that only a few topics co-occur Algorithm 1 GD-LDA Generative Model This expression allows us to decompose the problem into two for topic k ← 1 to K do hierarchical models that can be treated and optimized sep- draw φk ∼ Dir(γ) end for arately based on these conditional probabilities. The prob- for document j ← 1 to D do ability of words given topics is: draw θj ∼ GenDir(α, β) for do word w ← 1 to Nj p (w|z, γ)= p (w|φ,z) p (φ|γ,z) dφ = draw zwj ∼ Mult(θj ) Zφ draw w|zwj = k ∼ Mult(φk ) K V V (5) end for Γ( i=1 γi) i=1 Γ(γi + Nk,i) end for PV Q V kY=1 i=1 Γ(γi) Γ( i=1(γi + Nk,i)) Q P For the probability of topics, we have a GD prior distribution for the topic mixtures in each document which is assumed to be Multinomial. Thus we have:

p(z|α, β)= p (z|θ) p (θ|α, β) dθ = Zθ D K−1 K−1 j j (6) Γ(αk + βk) Γ(αk) + Γ(βk) Γ(α )Γ(β ) j j jY=1 kY=1 k k kY=1 Γ(αk + βk)

j j where αk = αk + Nj,k, βk = βk + Nj,k+1 + ... + Nj,K . The topic assignments z are not observed. Then, we define Figure 2: Graphical Model for GD-LDA. α and β are a Gibbs sampling method to infer z as follows: vectors of size K − 1 where K is the number of topics. γ ¬wj p(w|z, γ)p(z|α, β) is a vector of size V (vocabulary size). p(zwj = k|z , α, β, γ) = (7) p(w|z¬wj , γ)p(z¬wj |α, β) in a document. In this respect, the covariance properties de- ¬wj scribed in Appendix A.1, particularly the third one, indicate Here, z represents the topic assignments for all the words that the covariances are dependent on the expected value of except the word w from document j. This analysis leads us two topics, then several covariances might be close to zero to the following distributions for Gibbs sampling: if the right hand side ratio is small. This is an important ¬wj p(w|z, γ) Ni,k + γi feature of the GD distribution which we propose to exploit ∝ (8) p(w|z¬wj, γ) V ¬wj in topic modeling. A detailed discussion of the covariance i=1 N + γi i,k constraints of GD distribution is provided in Appendix A. P p(z|α, β) ¬ ∝ 3. GD-LDA: METHODOLOGY p(z wj |α, β) ¬wj αk + Nj,k In this section, we depict the parameter estimation pro- ¬ , k = 1  α + β + K N wj cess to fit the GD as a prior distribution for topics. We show  k k l=1 j,l   P that GD-LDA can be computed with a computational cost  ¬wj k−1 K ¬wj  α + N βm + N  k j,k l=m+1 j,l similar to that of LDA. Fig 2 shows the graphical model  ¬wj ¬wj 1 1 iterations and pass it to the next iteration of a vector of size V . α and β are vectors of size K − 1. the evaluation. This is illustrated in Algorithm 2. From this 3.2 Model and Gibbs Sampling pseudo code, O((2K +1)W ) operations are required for each Gibbs sampling draw for the whole corpus, where W is the We define the joint probability associated with the graph- total number of words. Thus, the time complexity for each ical model defined in Fig 2 with joint distribution: iteration is O(KW ) as in the case of LDA [15]. This com- p(w,z|α, β, γ)= p(w|z, γ)p(z|α, β) (4) plexity is possible due to the cascade structure of the GD Algorithm 2 GD-LDA Gibbs Sampling Algorithm 3 Monte Carlo EM cumF actor ← 1, k ← 1 Start with an initial guess for α, β, γ and z ¬wj ¬wj N +γ αk + Nj,k i,k i repeat CDF [k] ← ¬wj × ¬wj α β K N V N γ k + k+ i + i Run Gibbs Sampling using Eqs. 8 and 9 l=1 j,l =1 i,k for k ← 2 to K − 1 Pdo P Find Expected value for topic assignments E(z|α, β, γ)= z ¬wj β + K N k−1 l=k j,l Choose γ to maximize complete Likelihood Eqs 56-60 in [12] cumF actor ← cumF actor × P ¬wj α +β + K N Choose α, β that maximize complete Likelihood using Eq 27 k−1 k−1 l=k−1 j,l P ¬wj until convergence of α, β, γ αk + Nj,k ∗ CDF [k] ← CDF [k − 1] + cumF actor × ¬wj × Choose topic assignments z with highest probability α +β + K N ∗ ∗ ∗ k k l=1 j,l Set α = α, β = β, γ = γ ¬wj P N γ ∗ ∗ ∗ ∗ i,k + i return α ,β , γ ,z ¬wj V N +γ i=1 i,k i endP for ¬wj N +γ i,K i j j CDF [K] ← CDF [K − 1] + cumF actor × ¬wj +1 V N +γ where αk = αk + N j,k, βk = βk + N j,k + ... + N j,K . i=1 i,K i P We develop a Newton-based method in Appendix B. To initialize the search, we use the method of moments based on the conditional beta distributions from the tree represen- distribution, which facilitates the calculation of the cumula- tation of the GD distribution [21]. The proportions pk,i = tive factor in Algorithm 2. In contrast, the time complexity V N k,i/ Pi=1 N k,i are employed for this initialization. of a Gibbs draw for PAM depends on the number of super- One key component of this optimization is that each pair topics, S, and sub-topics K as O(SKW ) [11]. The general 1 2 αk, βk is optimized separately. Thus, the time complexity for recommendation is to set S = K/2 leading to O(K W ). this optimization is linear in time with K. Parameter fitting in PAM is performed using the method of moments due 3.3 Parameter Estimation the high model complexity and computational cost of the Previous research has assumed uniform priors for the topic optimization [10, 11]. In addition, the number of parameters mixtures θ and the vocabulary distribution for topics φ with- to fit for CTM and PAM is K2 [2] [19], assuming K/2 super- out estimating them (constant values for α and γ) [8]. Wal- topics for PAM. As a result, these methods are highly prone lach et al. in [16] concludes that parameter estimation with to over-fitting, as many parameters are being fit. asymmetric Dirichlet prior probability of topics provides an improvement in the fitting. In GD-LDA, we estimate the 3.4 Predictive Distributions parameters α, β of the GD distribution to discover topic cor- Given the optimal parameters (α∗, β∗) obtained from Al- relations. We estimate the parameters for the prior distri- gorithm 3, and the word-topic observations (w,z) the pre- bution of words given topics γ. Ideally, we should maximize dictive distribution for document j, θˆj , is estimated as: the likelihood p(w|α,β,γ) for observations w and hyper- θˆjk = parameters α,β,γ directly. Unfortunately, this distribution ∗ α + N j,k is intractable for this model. To solve this issue, we aug- k , if k = 1  ∗ ∗ K αk + βk + l N j,l ment the likelihood to p(w,z|α,β,γ) and use Monte Carlo  =1  P Expectation Maximization (MCEM) [18]. Conditional on  − ∗  ∗ k 1 β + K N  αk + Nj,k m l=m+1 j,l hyper-parameters α,β,γ, we use Gibbs sampling to esti-  × if 1

− − D K 1 K 1 Γ(αj )+Γ(βj ) new new Γ(αk + βk) k k 4.1 Validation α ,β = arg max j j α,β Γ(αk )Γ(βk) Γ(α + β ) jY=1 kY=1 kY=1 k k The main challenge to validate statistical topic models is (11) the lack of reliable observed topic labels for each word of a 1K/2 super-topics is the recommendation by Mallet toolbox, corpus. The standard approach is to estimate the likelihood http://mallet.cs.umass.edu/ of held-out data [17]. In addition, model performance in a Table 1: Features of the datasets analyzed. Mean doc- Algorithm 4 EL Estimation for model M ument includes the 95% interval for s ← 1 to Ns do Dataset NIPS NYT APW OHSUMED Sample document ds with mixture θs given model M # documents 1840 5553 14657 196404 ˆ K ˆ # unique terms 13649 11229 18471 38900 Find p(w = i|θs, φ,M)= k=1 θskφki for i = 1,...,V end for P Mean doc. length 1322 ± 274 274 ± 132 169 ± 76 186 ± 36 Set logEL= 0 Table 2: Number of Gibbs samples per EM iteration for t ← 1 to Dt do Nt MCEM iteration 1-4 5-6 7-8 9 10 Find p(dt|ds,M)= p(wt|θs, φ,ˆ M) wt=1 Burn-in Samples 10 15 20 40 50 1 Ns Find p(dt|M)= Q p(dt|ds,M) Ns s=1 Gibbs Samples 50 100 200 500 700 Update logEL=logEL+P Log(p(dt|M)) end for P (w|θ ), as follows: return logEL TM j K PTM (w|θˆj , φˆ)= P (w|z = k, φˆk)P (z = k|θˆj) (14) supervised task has also been used [22]. Here, we validate kX=1 the GD-LDA performance addressing these two forms. We This is augmented with the maximum likelihood estimate for estimate the likelihood of completely unseen documents us- the language model based on document terms Dj , PML(w|Dj ), ing Empirical Likelihood, and we compare the performance and for the language model based on the corpus C, PML(w|C), of topic models in the ad hoc Information Retrieval as de- leading to: veloped in [19]. Nj µ PIR(w|Dj , θˆj , φˆ) = λ PLM (w|Dj )+ PML(w|C) 4.1.1 Empirical Likelihood (EL) Nj + µ Nj + µ

There has been a debate about the evaluation of topic +(1 − λ)PT M (w|θˆj , φˆ) models with methods ranging from the harmonic mean of (15) complete likelihood for topic assignments, to perplexity and where µ is a smoothing parameter, Nj is the number of empirical likelihood. As discussed in [17], perplexity falls words in document j, and λ is a parameter for the linear into the category of document completion where a portion combination. Therefore, the ranking function for query Q each document must be observed to estimate the likelihood with terms q is given by: of the remaining content. Similarly, a “left to right” evalua- ˆ ˆ ˆ ˆ tion has been proposed to estimate the probability of words P (Q|Dj , θj , φ)= PIR(w = q|Dj , θj , φ) (16) in a test document incrementally [17]. For validation, we qY∈Q use Empirical Likelihood (EL) criterion. Our intent is to We use the parameter values: µ=1000, λ=0.7. These are predict the likelihood of fully unseen documents. We do not the values recommended in [19]. use perplexity or “left to right” evaluation because they are based on the word order inside the document, in contrast 4.2 Experimental Settings to “bag of words”. In addition, EL has been shown to be a We use 4 different datasets to test our model. NIPS con- more pessimistic approximation for the probability of held- 2 ference papers dataset which contains long documents (8- out documents than “left to right” method [17]. 10 pages). Two news datasets, NYT and APW, obtained We estimate EL as described in [11]. Here, we generate a from TREC-3. These collections contain shorter documents set of pseudo documents, θ using the estimated prior topic s (1 page) with different vocabulary size. A fourth dataset, distribution of the model being tested using training data. OHSUMED from TREC-9, consists of abstracts from med- Then, a word distribution is estimated for each θ based on s ical papers. In contrast to the other datasets, the number φˆ. This is used to estimate the probability of seeing the test of documents is much larger (more than ten times). Table set. We define this process by means of Algorithm 4. Here, 1 shows the features of these datasets. We remove standard D represents the number of documents for a test set, d is t t stop words and perform stemming in all datasets. the document t = 1,...,D , w are the N words of d , and t t t t We test 4 variants of the GD-LDA algorithm. In two of N is the number of pseudo documents given model M. We s these cases we fix the value of γ and optimize the parameters use the generative approach based on the conditional Beta α and β as in Algorithm 3 based on two variants: 1) using distributions of the tree representation for GD as in Fig 1. the empirical expected topic assignmentsz ¯ [18], GD-LDA The main limitation of EL estimation is that the number of ∗ Fixed Gamma Mean; and 2) using the empirical mode z [7] pseudo documents should be sufficiently large to cover the in the M-step, GD-LDA Fixed Gamma Max. In the other parameter space of θ given the trained model. Then, N s two cases, we optimize all the three parameters (α,β and γ) determines the accuracy of the approximation. based on the expected topic assignments, GD-LDA Mean, 4.1.2 Ad hoc Retrieval and the topic assignment mode GD-LDA Max. We compare GD-LDA, LDA with parameter optimization, Information Retrieval represents a hard problem where CTM and PAM. For LDA, we use our own implementation. the gains of new models are often small or nil. This applica- We follow the collapsed Gibbs sampling approach [8], with tion represents a good test of the power of our approach. We asymmetric Dirichlet prior distributions for both, the prob- compare the performance of different topic models in ad hoc ability of words given a topic, and for the probability of Information Retrieval (IR). We use the approach proposed in topics. These parameters are optimized as discussed by au- [19] and incorporate our model by replacing the LDA model thors in [6]. For optimization, we follow the Newton-based described in the method. Based on the predictive distribu- approach proposed by Minka in Eqs 56-60 [12]. For both tion estimators for topics φˆ, and for each document θˆj , we provide a topic-based language model for each document, 2Available at: http://cs.nyu.edu/~roweis/data.html Figure 3: Qualitative example of GD-LDA for the NYT dataset with 20 topics. Probability of topics: marginal probability as defined by Eq 18 (red), conditional probability given the parent node as defined by Eq 22 (blue).

GD-LDA and LDA, we set a maximum of 1, 000 Newton iterations for parameter optimization. The schedule of the Gibbs sampling is detailed in Table 2. We initialize the values αk = 2/K for k = 1,...,K − 1, and βK−1 = 2/K in GD-LDA to obtain the special case of Dirichlet distribution; Figure 4: Correlation graph of a subset of topics inferred we start with LDA as prior model, and allow the data to by GD-LDA for the NYT dataset with 30 topics. adapt the model in the MCEM iterations to the more general GD. For CTM3, we use the default settings with parameter dimensionality reduction. If this probability is large com- estimation: a maximum of 1, 000 EM iterations with con- −5 pared with the probability of exploring further the tree, we vergence of 10 and maximum of 20 variational iterations −6 4 can discard the remaining tree. with a convergence rate of 10 . For PAM we use 1000 Fig 4 shows some of the positive and negative correlations Gibbs samples and K/2 super topics. This implementation between different topics inferred by GD-LDA. We observe supports multi-threading; to enable a fair comparison we use that Health is positively correlated with Technology and Fi- only one thread. We modify the implementation to obtain nancial. Y2K is positively correlated with Technology and the fitted parameters since they are not provided by default. Arts but negatively correlated with Foreign Politics. Note For EL estimation, we perform 10-fold cross-validation and that by default LDA assumes negative correlations, thus the we sample N = 10, 000 pseudo documents. s main value of topic correlated models is to discover positive 4.3 Results correlations. Fig 5(a) shows the decomposition of a sports document from the NYT dataset into topics using PAM with 4.3.1 Qualitative Results 10 super-topics and 20 sub-topics, CTM and GD-LDA with 20 topics. This shows that CTM and GD-LDA assign a Fig 3 shows a qualitative example of GD-LDA in the NYT high probability to a single topic (sports). On the other dataset. The figure represents the empirically estimated hand, PAM provides 3 sub-topics and most of the super- probability of topics for a subset of the topics. Each box topics with a significant probability mass. This is not de- represents a topic and the terms displayed are the ones with ˆ sirable since an important objective with topic modeling is highest posterior predictive probability φk. Given the fitted to cluster documents based on the topic mixture. Fig 5(b) GD-LDA model, we calculate the point estimate of the to- shows a comparison between CTM and GD-LDA for a docu- tal probability for each topic (in blue) and its conditional ment about the Y2K problem and airport functionality. We probability given a parent node (red). Here, conditional on observe that GD-LDA provides better segmentation of the the current node, we move deeper into the tree or observe document based on the topics proportions and content. a word from the topic at that level. As we move down into In addition, we calculate the distribution of the number of the tree, the conditional probability of picking the left hand significant topics in a document. We rank the topics based topic increases since the probability mass of the remaining on their topic probability mass in each document. Then, topics is less. This is a desirable property which facilitates we estimate the number of topics which accounts for 95% 3We use the CTM implementation provided by Blei at http: of this probability mass. Fig 6 shows the distribution of //www.cs.princeton.edu/~blei/ctm-c/. this number of topics inferred by LDA, CTM, PAM, and 4We use the implementation provided by Mallet http:// GD-LDA. For PAM, we consider the number of sub-topics. mallet.cs.umass.edu/. Here, we observe that CTM favors high number of topics for each documents which introduces noise when the goal is to characterize documents as in Information Retrieval. In PAM the number of topics per document is highly dependent of the number of super topics used. In order to handle topic correlation sparsity, PAM prunes the relationships between a super topic and subtopics [11]. This reduces the amount of correlations that can be modeled. In contrast, GD-LDA favors smaller number of topics per documents and without any constraint in the inference. The key property for this behavior is the cascade structure of the distribution. As discussed in section 2.2, this validates empirically the intuition of few significant topics for natural language documents, and sparse correlations. 4.3.2 Empirical Likelihood Results (a) Fig 7 shows EL estimations for the 4 variants of GD-LDA, CTM, PAM and LDA with asymmetric prior and parameter optimization in the four datasets. For the NIPS dataset, LDA is not shown since its predictive likelihood is extremely low. Similarly, PAM performance for the OHSUMED dataset is not shown for the same reason. The CTM model would not run for the OHSUMED dataset (mid-size dataset) based on the number of documents and vocabulary size. We observe that optimizing γ has a low impact as more data becomes available. Conceptually, a prior distribution represents the prior knowledge with a given sample size. Consequently adding more data decreases the impact of the prior information. This is consistent with the conclusions discussed in [16]. We then compare the use of the topic assignments,z ¯ (posterior mean), and the topic assignment mode, z∗ (maximum a posteriori MAP). We find that GD- LDA mean performs better, when the number of documents and the vocabulary size are relatively small (NIPS dataset). (b) In contrast, GD-LDA Max shows the highest performance Figure 5: Qualitative comparison of the topic distribu- for the other datasets, Fig 7(b)-(d). In particular, this tion for two documents from the NYT dataset. (a) From method performance is clearly superior for all the topics left to right: PAM, CTM, and GD-LDA (b) From left to when tested for the largest dataset (OHSUMED). GD-LDA right: GD-LDA and CTM Mean requires fewer topics to achieve a superior performance 30 30 compared to LDA, CTM and PAM. Moreover, EL decreases 25 25 more smoothly than these methods. It also remains fairly 20 20 constant when the number of topics attains a high value.

15 15 An important difference between the datasets analyzed is the average document length and number of unique terms. 10 10 The worst performance of CTM, as well as PAM, is found in 5 5 the case of APW dataset. This dataset has a larger vocab- 0 0 0 10 20 30 40 50 0 10 20 30 40 50 ulary size, and significantly shorter documents than NIPS (a) (b) dataset, where both CTM and PAM show a similar perfor- GD-LDA Mean 30 30 mance to that of . This suggests over-fitting by CTM where a full topic covariance matrix is estimated, 25 25 and by PAM where a matrix of K subtopics by K/2 super- 20 20 topics is estimated. In addition, PAM uses the method of 15 15 moments, instead of optimization, for parameter estimation.

10 10 This is a limitation of PAM when EL based evaluation is

5 5 performed since the ﬁtted parameters do not optimize the likelihood. As discussed in section 4.3.1, CTM favors larger 0 0 0 10 20 30 40 50 0 10 20 30 40 50 number of topics for each document than the other meth- (c) (d) ods. This behavior introduces noise in EL because correla- Figure 6: Distribution of the number of signiﬁcant tions are sparse, and only a few topics are present in natural topics per document for the APW corpus inferred by: language documents. (a)LDA, (b)CTM, (c)PAM, and (d)GDLDA using K = 50 topics. X-axis is number of topics per documents and Y - 4.3.3 Ad hoc Information Retrieval Results axis is the percentage of documents in the corpus. We show the application of GD-LDA in ad hoc IR using the OHSUMED dataset. As discussed above, we use the ap- 4 x 10 −3000

−1 −3020 −3040

−1.01 −3060

−3080 −1.02 −3100

−1.03 −3120

−3140 GD−LDA Max GD−LDA Mean Empirical Likelihood GD−LDA Mean −1.04 Empirical Likelihood CTM −3160 GD−LDA Fixed Beta Mean GD−LDA Max GD−LDA Fixed Beta Max GD−LDA Fixed Gamma Max −3180 −1.05 LDA GD−LDA Fixed Gamma Mean −3200 CTM PAM PAM

−1.06 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 Number of Topics Number of Topics (a) (b)

−1900 −1920 −840 −1940 −845 −1960 −1980 −850 −2000

−2020 −855 −2040 GD−LDA Fixed Gamma Max GDLDA Mean

Empirical Likelihood −2060 GD−LDA Fixed Gamma Mean −860 GD−LDA Gamma Mean GD LDA Max Empirical Likelihood −2080 GD−LDA Gamma Max GDLDA Fixed Gamma Mean GD LDA Fixed Gamma Max −2100 LDA −865 CTM LDA −2120 PAM −870 0 20 40 60 80 100 120 140 160 180 200 0 25 50 75 100 125 150 Number Topics Number of Topics (c) (d) Figure 7: Mean per-document log-Likelihood for (a) NIPS, (b)NYT, (c)APW and (d) OHSUMED datasets as a function of the number of topics.

Fig 8 shows the computational time in minutes to fit the Table 3: Results of ad hoc IR using GDLDA, LDA and model for LDA, CTM, PAM and GD-LDA in the datasets PAM models using K topics for the OHSUMED dataset we consider. A non-linear increase in time is observed after Model K P@10 MAP DG GD-LDA 50 0.502±0.06 0.496±0.06 0.790±0.07 50 topics for CTM in NYT and APW datasets, and after LDA 50 0.470±0.06 0.470±0.06 0.780±0.08 80 topics in NIPS dataset. We observe that the computa- PAM 50 0.434±0.06 0.439±0.05 0.671±0.04 tional cost of PAM grows quadratically after 20 topics. In GD-LDA 75 0.470±0.06 0.491±0.06 0.742±0.04 general, the computational cost of GD-LDA is comparable LDA 75 0.470±0.06 0.461±0.06 0.741±0.04 to that of LDA and is less than CTM and PAM. This is a PAM 75 0.431±0.06 0.438±0.05 0.651±0.03 significant advantage since GD-LDA provides a more flexible model structure when compared with LDA. Moreover, proach from [19] as a benchmark for comparison. We train variances and a number of covariances are modeled more ef- the GD-LDA Max using K = 50 and K = 75 topics, and fectively, and with minimal increase in computational cost. compare its performance with LDA and PAM. We use stan- This property makes GD-LDA more suitable than CTM or dard IR measures: Precision at 10 (P@10), Mean Average PAM for larger scale applications, such as IR. Precision (MAP) and Discounted Gain (DG), to compare these methods. The OHSUMED dataset is considered to be 4.3.5 Choosing the Number of Topics a medium size dataset in the IR literature. This contains 63 An open problem is how to select the optimal number of topical queries with 35, 000 relevant labels. topics to train a model. In general, an exhaustive experimen- We have not modified the retrieval model of [19] to ex- tation needs to be performed to select the optimal number ploit the power of the new GD-LDA model. Despite this, of topics. Due to the tree structure of the GD distribution, the performance improvement for all the measures is signif- GD-LDA tends to accommodate the most relevant topics, icant as we observe in Table 3. Here, the best performance based on probability mass, at the upper levels of the tree. is for K = 50 topics, where EL estimation peaks in Fig Fig 9 shows the cumulative probability vs the number 7(d). For this case, there is agreement between the adhoc of topics of the expected topic mixture of documents given IR and EL performance. Respect to LDA, GD-LDA shows the fitted parameters for GD-LDA and CTM. We observe improvement of: 6.3% for P@10, 5.5% for MAP, and 1.1% that GD-LDA favors a smaller number of topics. Notice for DG. Note that PAM shows lower performance than LDA that, after a certain number of topics, the expected contri- in IR for all the measures. This is consistent with the IR bution of the remaining ones is not significant. This prevents performance of PAM reported in [22]. GD-LDA from over-fitting. When comparing GD-LDA with CTM, we observe that CTM favors uniform topic mixtures 4.3.4 Computational Cost Comparison in each document. Thus, if we fit CTM for larger number One advantage of GD-LDA over CTM and PAM is that its of topics, we would observe the same linear behavior as seen computational complexity is linear in the number of topics. in graphs from Fig 9. This prevents CTM from discarding This is not the case for CTM and PAM which scale quadrat- any topic easily or suggesting an optimal topic range as op- ically with the number of topics. As discussed in section 3, posed to GD-LDA. In addition, this behavior makes CTM GD-LDA should add minimal computational cost to LDA. less effective in handling over-fitting.

3000 10000 14000 LDA LDA GD−LDA GD−LDA GD−LDA GD LDA 2000 12000 LDA CTM 2500 LDA CTM PAM PAM 8000 PAM CTM PAM 10000 1500 2000 6000 8000 1500 1000 6000 4000 1000 4000 500 2000 500 2000

0 0 0 0 0 50 100 150 200 0 20 40 60 80 100 120 140 160 0 50 100 150 200 0 25 50 75 100 125 150 Number of Topics Number of Topics Number of Topics Number of Topics (a) (b) (c) (d) Figure 8: Average running times in minutes of CTM, GD-LDA and LDA for: (a) NIPS, (b) APW, (c) NYT, (d) OHSUMED. x-axis represents the number of topics. All the experiments were performed using a Quad Intel machine with 2.5GHz with 8GB of memory.

1 1 the Dirichlet tree to have a comparable notion of subtopics

0.8 0.8 and super topics as in PAM with the same conjugacy and

0.6 0.6 computational cost as GD-LDA.

0.4 0.4 We have shown that the use of GD-LDA in adhoc IR in- Cumulative Probability Cumulative Probability 0.2 0.2 creases the performance signiﬁcantly, in contrast to earlier

0 0 incorporations of topic models. Future directions include 0 50 100 150 200 0 50 100 150 200 Number of Topics Number of Topics the use of topic mixture instead the probability of words in (a) the ranking function. In addition, we plan to explore the 1 1 impact of GD-LDA in Interactive Information Retrieval.

0.8 0.8 0.6 0.6 6. ACKNOWLEDGEMENTS 0.4 0.4 Cumulative Probability

Cumulative Probability This work is partially funded by CONACYT grant 207751 0.2 0.2 and CONACYT UC-MEXUS grant 194880. 0 0 0 50 100 150 200 0 50 100 150 200 Number of Topics Number of Topics (b) 7. REFERENCES Figure 9: Cumulative distribution of the topic mix- [1] D. Andrzejewski, X. Zhu, and M. Craven. Incorporating tures based on the fitted prior distribution parame- domain knowledge into topic modeling via dirichlet forest ters. From left to right GD-LDA and CTM for: (a) priors. In ICML, pages 25–32, 2009. NIPS NIPS and (b) APW. The distributions are shown for [2] D. Blei and Lafferty. Correlated topic models. , 18:147–154, 2006. {20, 30, 40, 60, 80, 100, 125, 150, 175, 200} topics. [3] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning, 3:993–1022, 2003. 5. DISCUSSION AND CONCLUSION [4] R. J. Connor and J. E. Mosimann. Concepts of We have introduced the use of the GD distribution in independence for proportions with a generalization of the probabilistic topic modeling. The advantages of the GD over dirichlet distribution. Journal of American Statistics, the Dirichlet distribution, and the benefits when compared 64:194–206, 1969. with the estimation of the full covariance matrix in CTM [5] S. Y. Dennis. A bayesian analysis of tree-structured statistical decision problems. Journal of Statistical have been described. The apparent constraints on covari- Planning and Inference, 53(3):323–344, 1996. ances in GD actually results in modeling better sparse topic [6] G. Doyle and C. Elkan. Accounting for burstiness in topic correlations in natural language documents, as our empirical models. Proceedings of the 26th ICML, 2009. validation indicates. This results in better performance in [7] C. Elkan. Clustering documents with an exponential family empirical likelihood and IR measures. We have developed approximation of the dirichlet compound multinomial an efficient Gibbs sampling model which uses the conjugacy distribution. In Proceedings of 23rd ICML, 2006. property of GD with the Multinomial distribution. We have [8] T. Griffiths and M. Steyvers. Finding scientific topics. demonstrated that the running time of GD-LDA is compa- Proceedings of the NAS, 101(6):5228–5235, 2004. rable to LDA and less than CTM and PAM. This provides [9] W. Li, D. Blei, and A. McCallum. Nonparametric Bayes pachinko allocation. In 23rd UAI Conference, 2007. a model computationally competitive and with better per- [10] W. Li and A. McCallum. Pachinko allocation: formance than these methods. Dag-structured mixture models of topic correlations. In We have shown that the impact of optimizing the vocab- Proceedings of ICML 06, pages 577–584, 2006. ulary parameter γ decreases when the vocabulary size and [11] W. Li and A. McCallum. Pachinko allocation: Scalable the number of documents in the corpus is large. Due to mixture models of topic correlations. Journal of Machine the tree structure of the GD distribution, GD-LDA proves Learning, 2008. Submitted. to be powerful in handling over-fitting with a large number [12] T. P. Minka. Estimating a dirichlet distribution. 2003. of topics as its performance remains fairly high even when [13] T. P. Minka. The dirichlet-tree distribution. 2004. the number of topics is increased. This is not the case for [14] B. Null. The nested dirichlet distribution: Properties and applications. Submitted, 2008. CTM, PAM, and LDA. As a consequence, we can reduce the [15] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, number of topics, by using the conditional probability of the and M. Welling. Fast collapsed gibbs sampling for latent remaining topics when we are moving down into the tree. A dirichlet allocation. Proceedings of the KDD 2008, 2008. natural extension of the model is to allow the tree struc- [16] H. Wallach, D. Mimno, and A. McCallum. Rethinking lda: ture to be fitted. A direction of improvement is to modify Why priors matter. In NIPS 22, pages 1973–1981. 2009. [17] H. M. Wallach, I. Murray, R. Salakhutdinov, and The GD distribution models the Cov(θk,θk+1) for con- D. Mimno. Evaluation methods for topic models. In secutive tree levels without constraints. The covariances for Proceedings of ICML ’09, pages 1105–1112, 2009. deeper levels in the tree Cov(θk,θj ),j > k + 1 are con- [18] G. C. G. Wei and M. A. Tanner. A Monte Carlo strained. Since θ1 is used as base category, it is always Implementation of the EM Algorithm and the Poor Man’s Data Augmentation Algorithms. Journal of the American negatively correlated with the other categories. This is a Statistical Association, 85:699–704, 1990. typical constraint of the Logistic Normal distribution used [19] X. Wei and W. B. Croft. Lda-based document models for in CTM[2]. The GD distribution constrains the sign of ad-hoc retrieval. In Proceedings SIGIR 2006, 2006. Cov(θk,θj ) to be the same as that of Cov(θk,θk+1) for j > [20] T.-T. Wong. Generalized dirichlet distribution in bayesian k+1. These constraints imply that probability of co-occurrence analysis. Applied Mathematics and Computation, of these topics is very small. 97:165–181, 1998. [21] T.-T. Wong. Parameter estimation for generalized dirichlet distributions from the sample estimates of the first and the B. MAXIMIZATION OF α AND β FOR GD second moments of random variables. Comput. Stat. Data To estimate the parameters of the GD distribution, the Anal., 54(7):1756–1765, 2010. log-likelihood of Eq. 6 is optimized using the Newton method. [22] X. Yi and J. Allan. Evaluating topic models for information retrieval. In Proceedings CIKM, 2008. D K−1 K−1 L(α, β) = log Γ(αk + βk) − (log Γ(αk ) + logΓ(βk)) jX=1 kX=1 kX=1 APPENDIX K−1 K−1 j j j j + log Γ(αk ) + logΓ(βk) − log Γ(αk + βk) A. FIRSTANDSECONDMOMENTSOFTHE kX=1 kX=1 (23) GD DISTRIBUTION By taking the derivative with respect to α, β we have:

We derive the ﬁrst and second moments of the GD distri- D D bution based on the tree representation of Eq 2. We write: ∂L(α, β) j j j = DΨ(αk + βk) − DΨ(αk )+ Ψ(αk) − Ψ(αk + βk) ∂αk k−1 jX=1 jX=1 θ1 = Z1 θk = Zk(1 − θ1 −···− θk−1) = Zk m=1(1 − Zm) D D (17) ∂L(α, β) j j j Q = DΨ(αk + βk) − DΨ(βk)+ Ψ(βk) − Ψ(αk + βk) 2 ∂βk Let Sk = E(Zk), Rk = E(Zk). Since Zk’s are independent: jX=1 jX=1 (24) k−1 2 k−1 E(θk) = Sk (1 − Sm), E(θ ) = Rk (1 − 2Sm + Rm) j j m=1 k m=1 Recall that α = αk + N j,k, β = βk + N j,k+1 + . . . + Q Q (18) k k Similarly for the crossproducts where k < j: N j,K . Notice that the derivative of L(α, β) with respect to αk just depends on αk and βk as opposed to the Dirichlet k−1 j−1 E(θkθj ) = E Zk m=1(1 − Zm) Zj m=1(1 − Zm) distribution. For the second derivative we have: k−1 n hQ 2 i hQj−1 io = E m=1(1 − Zm) Zk (1 − Zk) m=i+1(1 − Zm) Zj 2 D D ∂ L(α, β) ′ ′ ′ j ′ j j nhk− i h j− i o Q1 Q 1 = DΨ (αk + βk) − DΨ (αk )+ Ψ (αk) − Ψ (αk + βk) = m=1(1 − 2Sm + Rm) (Sk − Rk) m=i+1(1 − Sm) Sj (∂α )2 k jX=1 jX=1 hQ i hQ i (19) 2 D D ∂ L(α, β) ′ ′ ′ j ′ j j Therefore, we have for the variance and covariance: = DΨ (αk + βk) − DΨ (βk)+ Ψ (β ) − Ψ (α + β ) (∂β )2 k k k k jX=1 jX=1 − − k 1 2 k 1 2 2 D V ar(θk) = Rk m=1(1 − 2Sm + Rm) − Sk m=1(1 − Sm) ∂ L(α, β) ′ ′ j j h 2 2 2 i = DΨ (αk + βk) − Ψ (αk + βk) = RkEQ(V ) − S (E(Vk)) Q ∂α ∂β k k k k jX=1 (20) ∂2L(α, β) ∂2L(α, β) j−1 = 0, =0 for l 6= k Cov(θk,θj ) = Sj m=i+1(1 − Sm) ∂αk ∂αl ∂βk∂βl − h i − (25) (S − R ) k 1 Q(1 − 2S + R ) − (S − S2 ) k 1 (1 − S )2 k k m=1 m m k k m=1 m Therefore, the Hessian matrix can be written as: n Q 2 Q o = E(θj )/E(Vk+1) [SkV ar(Vk) − V ar(θk)] (21) H(α, β) = block-diag[H1(α1,β1),...,HK−1(αK−1,βK−1)] (26) for k = 1,...,K in Eq 20, and for k = 1,...,K − 1 and where H is the Hessian matrix for α , β . Following a simi- j = k + 1,...,K in Eq 21. We estimate S , R based on the k k k k k lar logic as in the case of a Dirichlet distribution described in independent Beta distributions of Z . From the moment k Eqs (56-60) of [12], we have the following Newton iteration: generating function for the Beta distribution we have: − − α α (α +1) new 1 new 1 k k k αk = αk − Hk gk , βk = βk − Hk gk , Sk = α +β Rk = (α +β )(α +β +1) k = 1,...,K (22) 1 2 k k k k k k D D k ′ j ′ k ′ j ′ where αK = 1, βK = 0 [4]. Note that the variances and the q11 = Ψ (αk) − DΨ (αk), q22 = Ψ (βk) − DΨ (βk), jX=1 jX=1 expected value are not constrained. k k k k D g1 /q11 + g2 /q22 ′ ′ j j ak = , bk = DΨ (αk + βk) − Ψ (α + β ) −1 k −1 k −1 k k A.1 Covariance Properties of GD Distribution bk +(q11) +(q22) jX=1 k Three key properties can be derived from Eqs 20-22: −1 gl − ak Hk gk = for l = 1, 2 l k qll 1. For j >k, Cov(θk,θj ) > 0 if and only if V ar(Vk) > (27) k k V ar(θk). Thus, Cov(θk,θj ) > 0 if and only if the sum- where g1 = dL(α, β)/dαk and g2 = dL(α, β)/dβk from Eq mation of the ﬁrst k − 1 θ’s varies more than θk 24. 2. For j > 1, Cov(θ1,θj )= −V ar(θ1)E(θj )/E(1−θ1) < 0 since θ1 is at the root level of the tree

3. For j>k+1, Cov(θk,θj )= Cov(θk,θk+1)E(θj )/E(θk+1). Thus, Cov(θk,θj ) at deeper levels of the tree will have the same sign as Cov(θk,θk+1)