E€icient Correlated Topic Modeling with Topic Embedding

Junxian He∗1,3, Zhiting Hu∗1,2, Taylor Berg-Kirkpatrick1, Ying Huang3, Eric P. Xing1,2 Carnegie Mellon University1 Petuum Inc.2 Shanghai Jiao Tong University3 ∗ Equal contribution {junxianh,zhitingh,tberg,epxing}@cs.cmu.edu , [email protected]

ABSTRACT poor scaling on large data. For instance, in CTM, direct modeling Correlated topic modeling has been limited to small model and of pairwise correlations and the non-conjugacy of logistic-normal 3 problem sizes due to their high computational cost and poor scaling. prior impose inference complexity of O(K ), where K is the number In this paper, we propose a new model which learns compact topic of latent topics, signi€cantly more demanding compared to LDA embeddings and captures topic correlations through the closeness which scales only linearly. While there has been recent work on between the topic vectors. Our method enables ecient inference improved modeling and inference [2, 9, 35, 36], the model scale in the low-dimensional embedding space, reducing previous cubic has still limited to less than 1000s of latent topics. Œis stands in or quadratic time complexity to linear w.r.t the topic size. We stark contrast to recent industrial-scale LDA models which handle further speedup variational inference with a fast sampler to exploit millions of topics on billions of documents [8, 42] for capturing long- sparsity of topic occurrence. Extensive experiments show that our tail semantics and supporting industrial applications [40], yet, such approach is capable of handling model and data scales which are rich extraction task is expected to be beŠer addressed with more several orders of magnitude larger than existing correlation results, expressive correlation models. It is therefore highly desirable to de- without sacri€cing modeling quality by providing competitive or velop ecient correlated topic models with great representational superior performance in document classi€cation and retrieval. power and highly scalable inference, for practical deployment. In this paper, we develop a new model that extracts correlation CCS CONCEPTS structures of latent topics, sharing comparable expressiveness with the costly CTM model, while keeping as ecient as the simple •Computing methodologies → Latent variable models; LDA. We propose to learn a distributed representation for each KEYWORDS latent topic, and characterize correlatedness of two topics through the closeness of respective topic vectors in the embedding space. Correlated topic models; topic embedding; scalability Compared to previous pairwise correlation modeling, our topic em- bedding scheme is parsimonious with less parameters to estimate, 1 INTRODUCTION yet ƒexible to enable richer analysis and visualization. Figure 1 illustrates the correlation paŠerns of 10K topics inferred by our Large ever-growing document collections provide great opportuni- model from two million NYTimes news articles, in which we can ties, and pose compelling challenges, to infer rich semantic struc- see clear dependency structures among the large collection of topics tures underlying the data for data management and utilization. and grasp the semantics of the massive . Topic models, particularly the Latent Dirichlet Allocation (LDA) We further derive an ecient variational inference procedure model [6], have been one of the most popular statistical frameworks combined with a fast sparsity-aware sampler for stochastic tackling to identify latent semantics from text corpora. One drawback of of non-conjugacies. Our embedding based correlation modeling LDA derives from the conjugate Dirichlet prior, as it models topic enables inference in the low-dimensional vector space, resulting occurrence (almost) independently and fails to capture rich topical in linear complexity w.r.t topic size as with the lightweight LDA. correlations (e.g., a document about virus may be likely to also be Œis allows us to discover 100s of 1000s of latent topics with their about disease while unlikely to also be about €nance). E‚ective correlations on near 10 million articles, which is several orders of modeling of the pervasive correlation paŠerns is essential for struc- magnitude larger than prior work [5, 9]. tural topic navigation, improved document representation, and

arXiv:1707.00206v1 [cs.LG] 1 Jul 2017 Our work di‚ers from recent research which combines topic accurate prediction [5, 9, 37]. Correlated (CTM) [5] models with word embeddings [3, 10, 22, 29] for capturing word extends LDA using a logistic-normal prior which explicitly models dependencies, as we instead focus on modeling dependencies in the correlation paŠerns with a Gaussian covariance matrix. latent topic space which exhibit uncertainty and are inferentially Despite the enhanced expressiveness and resulting richer rep- more challenging. To the best of our knowledge, this is the €rst resentations, practical applications of correlated topic modeling work to incorporate distributed representation learning with topic have unfortunately been limited due to high model complexity and correlation modeling, o‚ering both intuitive geometric interpreta- tion and theoretical Bayesian modeling advantages. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed We demonstrate the ecacy of our method through extensive for pro€t or commercial advantage and that copies bear this notice and the full citation experiments on various large text corpora. Our approach shows on the €rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permiŠed. To copy otherwise, or republish, greatly improved eciency over previous correlated topic models, to post on servers or to redistribute to lists, requires prior speci€c permission and/or a and scales well as with the much simpler LDA. Œis is achieved fee. Request permissions from [email protected]. without sacri€cing the modeling power—the proposed model ex- KDD’17, August 13–17, 2017, Halifax, NS, Canada. © 2017 ACM. 978-1-4503-4887-4/17/08...$15.00 tracts high-quality topics and correlations, obtaining competitive DOI: 10.1145/3097983.3098074 casualty equipment banana scallion drug Kuwait effort whipped toast eggplant fillet sour Iraq report role cream garlic clove cucumber squash pickle mustard deploy country plan destruct warn help expect artillery ally grate remain military spill strengthen aroma world frozen infantry onion gentle armor deployment culinary cherry humanitarian dipped flavor diplomacy fruit cure pour nut crunchy diplomat outline power far line

week three point decline contaminate commercial park syndrome demand interest largest deadly infect justified get prostate mutate virus billion import cough outbreak transmit flu Compaq leave suspect cancer bacteria grocery diagnose transmission proxy official unit expect pediatric official soon headache research higher yield people bank Citicorp Diabetes symptom percent long quarterly buckle country organ disease epidemic medicine market antibiotics offset old profit surgical month strong plummet cardiac prescribe private example diagnose neutral confederal fluid patient follow sent invest doctor asthma risky skeptic expert chronic kidney

Figure 1: Visualization of 10K correlated topics on the NYTimes news corpus. ‡e point cloud shows the 10K topic embeddings where each point represents a latent topic. Smaller distance indicates stronger correlation. We show four sets of topics which are nearby each other in the embedding space, respectively. Each topic is characterized by the top words according to the word distribution. Edge indicates correlation between topics with strength above some threshold. or beŠer performance than CTM in document classi€cation and the correlation prior with independent factor models for faster in- retrieval tasks. ference. However, similar to many other approaches, the problem Œe rest of the paper is organized as follows: section 2 brieƒy scale has still limited to thousands of documents and hundreds reviews related work; section 3 presents the proposed topic embed- of topics. In contrast, we aim to scale correlated topic modeling ding model; section 4 shows extensive experimental ; and section 5 to industrial level deployment by reducing the complexity to the concludes the paper. LDA level which is linear to the topic size, while providing as rich extraction as the costly CTM model. We note that recent scalable 2 RELATED WORK extensions of LDA such as alias methods [28, 42] are orthogonal 2.1 Correlated Topic Modeling to our approach and can be applied in our inference for further speedup. We consider this as our future work. Topic models represent a document as a mixture of latent topics. Another line of topic models organizes latent topics in a hierar- Among the most popular topic models is the LDA model [6] which chy which also captures topic dependencies. However, the hierar- assumes conjugate Dirichlet prior over topic mixing proportions chy structure is either pre-de€ned [7, 19, 30] or inferred from data for easier inference. Due to its simplicity and scalability, LDA has using Bayesian nonparametric methods [4, 11] which are known extracted broad interest for industrial applications [40, 42]. Œe to be computationally demanding [12, 17]. Our proposed model is Dirichlet prior is however incapable of capturing dependencies be- ƒexible without sacri€cing scalability. tween topics. Œe classic CTM model provides an elegant extension of LDA by replacing the Dirichlet prior with a logistic-normal prior which models pairwise topic correlations with the Gaussian covari- ance matrix. However, the enriched extraction comes with compu- 2.2 Distributed Representation Learning tational cost. Œe number of parameters in the covariance matrix Œere has been a growing interest in distributed representation grows as square of the number of topics, and parameter estima- that learns compact vectors (a.k.a embeddings) for words [27, 33], tion for the full-rank matrix can be inaccurate in high-dimensional entities [18, 31] , network nodes [14, 38], and others. Œe induced space. More importantly, frequent matrix inversion operations vectors are expected to capture semantic relatedness of the target during inference lead to O(K3) time complexity, which has signi€- items, and are successfully used in various applications. Compared cantly restricted the model and data scales. To address this, Chen to most work that induces embeddings for observed units, we learn et al. [9] derives a scalable Gibbs sampling algorithm based on data distributed representations of latent topics which poses unique augmentation. Œough bringing down the inference cost to O(K2) challenge for inference. Some previous work [25, 26] also induces per document, the computation is still too expensive to be practical compact topic manifold for visualizing large document collections. in real-world massive tasks. PuŠhividhya et al. [36] reformulates Our work is distinct in that we leverage the learned topic vectors ⇢ Symbol Description D, K, V number of documents, latent topics, and vocabulary words embedding space Nd number of words in document d u3 ⌧ M embedding dimension of topic and document ad u embedding vector of topic k u1 u4 k ad embedding vector of document d u 2 ηd (unnormalized) topic weight vector of document d ad ⌘d uk wdn the nth word in document d K zdn the topic assignment of word wdn ↵ ϕk word distribution of topic k zdn Ks number of non-zero entries of document’s topic proportion topic weights Vs number of non-zero entries of topic word distribution Table 1: Notations used in this paper. wdn k K u1 u2 u3 u4 Nd D document vector will have similar (either large or small) distances to the vectors of two semantically correlated topics which are them- Figure 2: Graphical model representation. ‡e le part selves nearby each other in the space, and thus tend to assign similar schematically shows our correlation modeling mechanism, probability mass to the two topics. Figure 2, le‰ part, schematically where nearby topics tend to have similar (either large or illustrates the embedding based correlation modeling. small) weights in a document. We thus avoid expensive modeling of pairwise topic correlation matrix, and are enabled to perform inference in the low-dimensional for ecient correlation modeling and account for the uncertainty embedding space, leading to signi€cant reduction in model and of correlations. inference complexity. We further exploit the intrinsic sparsity of An emerging line of approaches [3, 10, 22, 29] incorporates word topic occurrence, and develop stochastic variational inference with embeddings (either pre-trained or jointly inferred) with conven- fast sparsity-aware sampling to enable high scalability. We derive tional topic models for capturing word dependencies and improving the inference algorithm in section 3.3. topic coherence. Our work di‚ers since we are interested in the In contrast to word representation learning where word tokens topic level, aiming at capturing topic dependencies with learned are observed and embeddings can be induced directly from word topic embeddings. collocation paŠerns, topics are hidden from the text, posing addi- tional inferential challenge. We resort to generative framework as 3 TOPIC EMBEDDING MODEL in conventional topic models by associating a word distribution with each topic. We also take into account uncertainty of topic Œis section proposes our topic embedding model for correlated correlations for ƒexibility. Œus, in addition to the intuitive geomet- topic modeling. We €rst give an overview of our approach, and ric interpretation of our embedding based correlation scheme, the present the model structure in detail. We then derive an ecient full Bayesian treatment also endows connection to the classic CTM variational algorithm for inference. model, o‚ering theoretical insights into our approach. We present 3.1 Overview the model structure in the next section. (Table 1 lists key notations; Figure 2 shows the graphical model representation of our model.) We aim to develop an expressive topic model that discovers latent topics and underlying correlation structures. Despite this added 3.2 Model Structure representational power, we want to keep the model parsimonious We €rst establish the notations. LetW = {w }D be a collection of and ecient in order to scale to large text data. As discussed above d d=1 { }Nd (section 2), CTM captures correlations between topic pairs with a documents. Each document d contains Nd words wd = wdn n=1 Gaussian covariance matrix, imposing O(K2) parameter size and from a vocabulary of size V . O(K3) inference cost. In contrast, we adopt a new modeling scheme We assume K topics underlying the corpus. As discussed above, drawing inspiration from recent work on distributed representa- for each topic k, we want to learn a compact distributed represen- M K×M tions, such as word embeddings [33] which learn low-dimensional tation uk ∈ R with low dimensionality (M  K). Let U ∈ R = T word vectors and have shown to be e‚ective in encoding word denote the topic vector collection with the kth row Uk · uk . As semantic relatedness. a common choice in methods, we use the vector We induce continuous distributed representations for latent top- inner product for measuring the closeness between embedding vec- ics, and, as in word embeddings, expect topics with relevant seman- tors. In addition to topic embeddings, we also induce document M tics to be close to each other in the embedding space. Œe contiguity vectors in the same vector space. Let ad ∈ R denote the embed- of the embedding space enables us to capture topical co-occurrence ding of document d. We now can conveniently compute the anity T 0 paŠerns conveniently—we further embed documents into the same of a document d to a topic k through uk ad . A topic k nearby, and vector space, and characterize document’s topic proportions with thus semantically correlated to topic k, will naturally have similar | T − T | ≤ k − 0 kk k its distances to the topics. Smaller distance indicates larger topic distance to the document, as uk ad uk0ad uk uk ad weight. By the triangle inequality of distance metric, intuitively, a and kuk − uk0 k is small. We express uncertainty of the anity by modeling the actual Algorithm 1 Generative Process K topic weights ηd ∈ R as a Gaussian variable centered at the an- 1. For each topic k = 1, 2, ··· , K, ity vector, following η ∼ N(Ua ,τ −1I). Here τ characterizes the d d • Draw the topic word distribution ϕk ∼ Dir(β) uncertainty degree and is pre-speci€ed for simplicity. As in logistic- −1 • Draw the topic embedding uk ∼ N(0, α I) normal models, we project the topic weights into the probability 2. For each document d = 1, 2, ··· , D, simplex to obtain topic distribution θ = so‰max(η ), from which −1 d d • Draw the document embedding ad ∼ N(0, ρ I) we sample a topic z ∈ {1,..., K} for each word w in the doc- −1 dn dn • Draw the document topic weight ηd ∼ N(Uad ,τ I) ument. As in conventional topic models, each topic k is associated • Derive the distribution over topics θd = so‰max(ηd ) with a multinomial distribution ϕ over the word vocabulary, and k • For each word n = 1, 2, ··· , Nd , each observed word is drawn from respective word distribution (a) Draw the topic assignment zdn ∼ Multi(θd ) indicated by its topic assignment. (b) Draw the word wdn ∼ Multi(ϕzdn ) PuŠing everything together, the generative process of the pro- posed model is summarized in Algorithm 1. A theoretically ap- pealing property of our method is its intrinsic connection to con- lower bound (ELBO): ventional logistic-normal models such as the CTM model. If we   a Õ p(uk )p(ϕk ) marginalize out the document embedding variable d , we obtain L(q) = Eq log + T −1 k q(u )q(ϕ ) ηd ∼ N(0,UU + τ I), recovering the pairwise topic correlation k k   matrix with low rank constraint, where each element is just the Õ p(ad )p(ηd |ad ,U )p(zdn |ηd )p(wdn |zdn, ϕ) Eq log closeness of respective topic embeddings, coherent to the above d,n q(ad )q(ηd )q(zdn) geometric intuitions. Such covariance decomposition has been used (3) in other context, such as sparse Gaussian processes [39] for ecient We optimize L(q) via coordinate ascent, interleaving the update approximation and Gaussian reparameterization [23, 41] for di‚er- of the variational parameters at each iteration. We employ sto- entiation and reduced variance. Here we relate low-dimensional chastic variational inference which optimizes the parameters with embedding learning with low-rank covariance decomposition and stochastic gradients estimated on data minibatchs. Due to the space estimation. limitations, here we only describe key computation rules of the Œe low-dimensional representations of latent topics enable gradients (or closed-form solutions). Œese stochastically estimated parsimonious correlation modeling with parameter complexity of quantities are then used to update the variational parameters af- O(MK) (i.e., topic embedding parameters), which is ecient in ter scaled by a learning rate. Please refer to the supplementary terms of topic number K. Moreover, we are allowed to perform e- material [1] for detailed derivations. cient inference in the embedding space, with inference cost linear Updating topic and document embeddings. For each topic in K, a huge advance compared to previous cubic complexity of ( ) k q(u |µ , Σ u ) vanilla CTM [5] and quadratic of recent improved version [9]. We , we isolate only the terms that contain k k k , derive our inference algorithm in the next section. Õ L(q(uk )) = Eq [logp(uk )] + Eq [logp(ηd |ad ,U )] d (4) 3.3 Inference − Eq [logq(uk )] . Posterior inference and parameter estimation is not analytically Œe optimal solution for q(uk ) is then obtained by seŠing the gra- tractable due to the coupling between latent variables and the non- dient to zero, with the variational parameters computed as: conjugate logistic-normal prior. Œis makes the learning dicult Õ  µ = τ Σ(u) · ξ γ , especially in our context of scaling to unprecedentedly large data k d dk d (5) and model sizes. We develop a stochastic variational method that (1) h Õ  ( ) i−1 Σ(u) = αI + τ Σ a + γ γT , involves only compact topic vectors which are cheap to infer, and (2) d d d d includes a fast sampling strategy which tackles non-conjugacy and where we have omiŠed the subscript k of the variational covari- exploits intrinsic sparsity of both the document topic occurrence ance matrix Σ(u) as it is independent with k. Intuitively, the optimal and the topical words. variational topic embeddings are the centers of variational docu- We €rst assume a mean-€eld family of variational distributions: ment embeddings scaled by respective document topic weights and q(u, ϕ, a,η, z) = transformed by the variational covariance matrix. Ö Ö Ö (1) By symmetry, the variational parameters of document embed- q(uk )q(ϕk ) q(ad )q(ηd ) q(zdn). k d n ding ad is similarly updated as: Õ  where the factors have the parametric forms: γ = τ Σ(a) · ξ µ , d k dk k (u) (a) (6) q(uk ) = N(uk |µk , Σ ), q(ad ) = N(ad |γd , Σ ), h Õ  i−1 k d Σ(a) = γI + τ Σ(u) + µ µT , ( ) k k k ( ) = ( | ), ( ) = N( | , η ), (2) q ϕk Dir ϕk λk q ηd ηd ξd Σd where, again, Σ(a) is independent with d and thus the subscript d q(z ) = Multi(z |κ ) dn dn dn is omiŠed. Variational algorithms aim to minimize KL divergence from q to Learning low-dimensional topic and document embeddings is the true posterior, which is equivalent to tightening the evidence computationally cheap. Speci€cally, by Eq.(5), updating the set {µ }K of variational topic vector means k k=1 imposes complexity where the second term O(KM2), and updating the covariance Σ(u) requires only O(M3). Õ Eq [logp(zd |ηd )] = q(zdn = k)Eq [log(so‰maxk (ηd ))] (a) k,n Similarly, by Eq.(6), the cost of optimizingγd and Σ is O(KM) and O(KM2), respectively. Note that Σ(a) is shared across all documents involves variational expectations of the logistic transformation and does not need updates per document. We see that all the updates which does not have an analytic form. We construct a fast Monto cost only linearly w.r.t to the topic size K which is critical to scale Carlo estimator for approximation. Particularly, we employ repa- to large-scale practical applications. rameterization trick by €rst assuming a diagonal covariance matrix ( ) η = ( 2) Sparsity-aware topic sampling. We next consider the opti- Σd diag σd as is commonly used in previous work [5, 23], mization of the variational topic assignment q(zdn) for each word where σd denotes the vector of standard deviations, resulting in wdn. LeŠing wdn = v, the optimal solution is: the following sampling procedure: ( ) n Õ o η t = ξ + σ ϵ(t); ϵ(t) ∼ N(0, I), (9) q(z = k) ∝ exp {ξ } exp Ψ(λ ) − Ψ λ 0 , (7) d d d dn dk kv v0 kv where is the element-wise multiplication. With T samples of ηd , where Ψ(·) is the digamma function; and ξd and λk are the varia- we can estimate the variational lower bound and the derivatives tional means of the document’s topic weights and the variational ∇L w.r.t the variational parameters {ξd , σd }. For instance, word weights (Eq.(2)), respectively. Direct computation of q(z ) dn ∇ E [logp(z |η )] with Eq.(7) has complexity of O(K), which becomes prohibitive in ξd q d d the presence of many latent topics. To address this, we exploit two Õ ÕT  (t) ≈ q(zdn = k)ek − (Nd /T ) so max η (10) aspects of intrinsic sparsity in the modeling: (1) Œough a whole k,n t=1 d Õ ÕT  ( ) corpus can cover a large diverse set of topics, a single document ≈ ( = ) − ( / ) t 1 z˜dn k ek Nd T so max ηd in the corpus is usually about only a small number of them. We k,n t=1 where e is an indicator vector with the kth element being 1 and the thus only maintain the top Ks entries in each ξd , where Ks  K, k making the complexity due to the €rst term in the right-hand side rest 0. In practice T = 1 is usually sucient for e‚ective inference. of Eq.(7) only O(Ks ) for all K topics in total; (2) A topic is typically Œe second equation applies the hard topic sample mentioned above, characterized by only a few words in the large vocabulary, we thus which reduces the time complexity O(KNd ) of the original standard computation (the €rst equation) to O(N + K) (i.e., O(N ) for the cut o‚ the variational word weight vector λk for each k by main- d d O( ) taining only its top Vs entries (Vs  V ). Such sparse treatment €rst term and K for the second). helps enhance the interpretability of learned topics, and allows Œe €rst term in Eq.(8) depends on the topic and document em- cheap computation with on average O(KVs /V ) cost for the second beddings to encode topic correlations in document’s topic weights. term1. With the above sparsity-aware updates, the resulting com- Œe derivative w.r.t to the variational parameter ξd is computed as: plexity for Eq.(7) with K topics is brought down to O(Ks + KVs /V ), ∇ E [ ( | )] ( ˜ − ) ξd q logp ηd U , ad = τ Uγd ξd . (11) a great speedup over the original O(K) cost. Œe top Ks entries of U˜ ξd are selected using a Min-heap data structure, whose computa- Here is the collection of variational means of topic embeddings ˜ = T tional cost is amortized across all words in the document, imposing where the kth row Uk · µk . We see that, with low-dimensional O(K/Nd log Ks ) computation per word. Œe cost for €nding the topic and document vector representations, inferring topic corre- O( ) top Vs entries of λk is similarly amortized across documents and lations is of low cost KM which grows only linearly w.r.t to words, and becomes insigni€cant. the topic size. Œe complexity of the remaining terms in Eq.(8), Updating the remaining variational parameters will frequently as well as respective derivatives w.r.t the variational parameters, O( ) involve computation of variational expectations under q(zdn). It has complexity of KM (Please see the supplements [1] for more is thus crucial to speedup this operation. To this end, we employ details). In summary, the cost of updating q(ηd ) for each document O( ) sparse approximation by sampling from q(zdn) a single indicator d is KM + K + Nd . z˜dn, and use the “hard” sparse distribution q˜(zdn = k) := 1(z˜dn = Finally, the optimal solution of the variational topic word distri- k) to estimate the expectations. Note that the sampling operation is bution q(ϕk |λk ) is given by: cheap, having the same complexity with computing q(z ) as above. Õ dn λ = β + 1(w = v)1(z˜ = k). (12) As shown shortly, such sparse computation will signi€cantly reduce kv d,n dn dn our running cost. Œough stochastic expectation approximation is Algorithm summarization. We summarize our variational commonly used for tackling intractability [24, 34], here we instead inference in Algorithm 2. As analyzed above, the time complex- apply the technique for fast estimation of tractable expectations. ity of our variational method is O(KM2 + M3) for inferring topic (η) ( | , ) embeddings q(ud ). Œe cost per document is O(KM) for comput- We next optimize the variational topic weights q ηd ξd Σd . ing q(a ), O(KM) for updating q(η ), and O((K + KV /V )N ) Extracting only the terms in L(q) involving q(ηd ), we get: d d s s d for maintaining q(zd ). Œe overall complexity for each document L(q(ηd )) = Eq [logp(ηd |ad ,U )] + Eq [logp(zd |ηd )] is thus O(KM + (K + KV /V )N ), which is linear to model size (8) s s d − Eq [logq(ηd )] , (K), comparable to the LDA model while greatly improving over previous correlation methods with cubic or quadratic complexity.

1In practice we also set a threshold s such that each word v needs to have at least Œe variational inference algorithm endows rich independence { }K structures between the variational parameters, allowing straight- s non-zero entries in λk k=1. Œus the exact complexity of the second term is O(max{KVs /V, s }). forward parallel computing. In our implementation, updates of Algorithm 2 Stochastic variational inference Dataset #doc (D) vocab size (V ) doc length 1: Initialize variational parameters randomly 20Newsgroups 18K 30K 130 2: repeat NYTimes 1.8M 10K 284 /( )0.9 3: Compute learning rate ιiter = 1 1 + iter PubMed 8.2M 10K 77 4: Sample a minibatch of documents B 5: for all d ∈ B do Table 2: Statistics of the three datasets, including the num- 6: repeat ber of documents (D), vocabulary size (V ), and average num- 7: Update q(zd ) with Eq.(7) and sample z˜d ber of words in each document. 8: Update γd with Eq.(6) 9: Update q(ηd ) using respective gradients computed with Eqs.(10),(11),and more in the supplements [1]. 0.7 10: until convergence ∗ ( )∗ 11: Compute stochastic optimal values µ , Σ u with Eq.(5) 0.6 ∗ 12: Compute stochastic optimal values λ with Eq.(12) ∗ (u) 13: Update x = (1 − ιiter)x + ιiterx with x ∈ {µ, Σ , λ} 0.5 (a) 14: Update Σ with Eq.(6) Accuracy LDA 15: end for 0.4 CTM 16: until convergence Ours 0.3 20 40 60 80 100 K variational topic embeddings {µk } (Eq.(5)), topic word distribu- Figure 3: Classi€cation accuracy on 20newsgroup. tions {λk } (Eq.(12)), and document embeddings {γd } (Eq.(6)) for a data minibatch, are all computed in parallel across multiple CPU Baselines. We compare the proposed model with a set of carefully cores. selected competitors: 4 EXPERIMENTS • Latent Dirichlet Allocation (LDA) [6] uses conjugate Dirichlet priors and thus scales linearly w.r.t the topic size We demonstrate the ecacy of our approach with extensive ex- but fails to capture topic correlations. Inference is based on periments. (1) We evaluate the extraction quality in the tasks of the stochastic variational algorithm [16]. When evaluating document classi€cation and retrieval, in which our model achieves scalability, we leverage the same sparsity assumptions as similar or beŠer performance than existing correlated topic models, in our model for speeding up. signi€cantly improving over simple LDA. (2) For scalability, our ap- • Correlated Topic Model (CTM) [5] employs standard proach scales comparably with LDA, and handles massive problem logistic-normal prior which captures pairwise topic cor- sizes orders-of-magnitude larger than previously reported correla- relations. Œe model uses stochastic variational inference tion results. (3) ‹alitatively, our model reveals very meaningful with O(K3) time complexity. topic correlation structures. • Scalable CTM (S-CTM) [9] developed a scalable sparse 4.1 Setup Gibbs sampler for CTM inference with time complexity of O(K2). Using distributed inference on 40 machines, Datasets. We use three public corpora provided in the UCI repos- the method discovers 1K topics from millions of docu- 2 itory for the evaluation: 20Newsgroups is a collection of news ments, which to our knowledge is the largest automatically documents partitioned (nearly) evenly across 20 di‚erent news- learned topic correlation structures so far. groups. Each article is associated with a category label, serving as Parameter Setting. Œroughout the experiments, we set the em- ground truth in the tasks of document classi€cation and retrieval; bedding dimension to M = 50, and sparseness parameters to NYTimes is a widely-used large corpus of New York Times news K = 50 and V = 100. We found our modeling quality is ro- articles; and PubMed is a large set of PubMed abstracts. Œe de- s s bust to these parameters. Following common practice, the hyper- tailed statistics of the datasets are listed in Table 2. We removed parameters are €xed to β = 1/K, α = 0.1, ρ = 0.1, and τ = 1. Œe a standard list of 174 stop words and performed . For baselines are using similar hyper-parameter seŠings. NYTimes and Pubmed, we kept the top 10K frequent words in vocab- All experiments were performed on Linux with 24 4.0GHz CPU ulary, and selected 10% documents uniformly at random as test sets, cores and 128GB RAM. All models are implemented using C/C++, respectively. For 20Newsgroups, we followed the standard train- and parallelized whenever possible using the OpenMP library. ing/test spliŠing, and performed the widely-used pre-processing3 by removing indicative meta text such as headers and footers so 4.2 Document Classi€cation that document classi€cation is forced to be based on the semantics of plain text. We €rst evaluate the performance of document classi€cation based on the learned document representations. We evaluate on the 2hŠp://archive.ics.uci.edu/ml 20Newsgroups dataset where ground truth class labels are avail- 3hŠp://scikit-learn.org/stable/datasets/twenty newsgroups.html able. We compare our proposed model with LDA and CTM. For LDA 60 60 60 50 50 50 40 40 40 30 30 30 LDA LDA LDA

Precision(%) 20 Precision(%) 20 Precision(%) 20 CTM CTM CTM 10 Ours 10 Ours 10 Ours 0 0 0 0.1 0.4 1.6 6.4 25.6 100 0.1 0.4 1.6 6.4 25.6 100 0.1 0.4 1.6 6.4 25.6 100 Recall(%) Recall(%) Recall(%)

Figure 4: Precision-Recall curves on 20Newsgroups. Le : #topic K = 20. Middle: K = 60. Right: K = 100. and CTM, a multi-class SVM classi€er is trained for each of them Running Time based on the topic distributions of the training documents, while for Dataset K the proposed model, the SVM classi€er takes the document embed- LDA CTM S-CTM Ours ding vectors as input. Generally, more accurate modeling of topic 20Newsgroups 100 11 min 60 min 22 min 20 min correlations enables beŠer document modeling and representations, 100 2.5 hr – 6.4 hr 3.5 hr resulting in improved document classi€cation accuracy. NYTimes 1K 5.6 hr – – 5.7 hr Figure 3 shows the classi€cation accuracy as the number of top- 10K 8.4 hr – – 9.2 hr ics varies. We see that the proposed model performs best in most of the cases, indicating that our method can discover high-quality PubMed 100K 16.7 hr – – 19.9 hr latent topics and correlations. Both CTM and our model signi€- Table 3: Total training time on various datasets with di‚er- cantly outperforms LDA which treats latent topics independently, ent number of topics K. Entries marked with “–” indicates validating the importance of topic correlation for accurate text se- model training is too slow to be €nished in 2 days. mantic modeling. Compared to CTM, our method achieves beŠer or competitive accuracy as K varies, which indicates that our model, achieves similar or beŠer level of performance as the conventional though orders-of-magnitude faster (as shown in the next), does complicated correlated topic model, here we want our approach not sacri€ce modeling power compared to the complicated and to tackle large problem sizes which are impossible for existing computationally demanding CTM model. correlation methods, and to scale as eciently as the lightweight 4.3 Document Retrieval LDA, for practical deployment. Table 3 compares the total running time of model training with We further evaluate the topic modeling quality by measuring the di‚erent sized datasets and models. As a common practice [16], we performance of document retrieval [15]. We use the 20Newsgroups determine convergence of training when the di‚erence between dataset. A retrieved document is relevant to the query document the test set per-word log-likelihoods of two consecutive iterations is when they have the same class label. For LDA and CTM, document smaller than some threshold. On small dataset like 20Newsgroups similarity is measured as the inner product of topic distributions, (thousands of documents) and small model (hundreds of topics), and for our model we use the inner product of document embedding all approaches €nish training in a reasonable time. However, with vectors. increasing number of documents and latent topics, we see that Figure 4 shows the retrieval results with varying number of top- the vanilla CTM model (with O(K3) inference complexity) and its ics, where we use the test set as query documents to retrieve similar scalable version S-CTM (with O(K2) inference complexity) quickly documents from the training set, and the results are averaged over becomes impractical, limiting their deployment in real-world scale all possible queries. We observe similar paŠerns as in the document tasks. Our proposed topic embedding method, by contrast, scales classi€cation task. Our model obtains competitive performance linearly with the topic size, and is capable of handling 100K topics with CTM, both of which capture topic correlations and greatly on over 8M documents (PubMed)—a problem size several orders improve over LDA. Œis again validates our goal that the proposed of magnitude larger than previously reported largest results [9] method has lower modeling complexity while at the same time is as (1K topics on millions of documents). Notably, even with added accurate and powerful as previous complicated correlation models. model power and increased extraction performance compared to In addition to ecient model inference and learning, our approach LDA (as has been shown in sections 4.2-4.3), our model only im- based on compact document embedding vectors also enables faster poses negligible additional training time, showing strong potential document retrieval compared to conventional topic models which of our method for practical deployment of real-world large-scale are based on topic distribution vectors (i.e., M  K). applications as LDA does. Figure 5, le‰ panel, shows the convergence curves on NYTimes 4.4 Scalability as training goes. Using similar time, our model converges to a beŠer We now investigate the eciency and scalability of the proposed point (higher test likelihood) than LDA does, while S-CTM is much model. Compared to topic extraction quality in which our model slower, failing to arrive convergence within the time frame. 7.6 60 LDA LDA 8.0 50 CTM CTM S-CTM S-CTM 40 Ours Ours 8.4 3000 30 400 8.8 LDA Time(min) 20 S-CTM 10 Time(s) per batch 20 5

Log Predictive Probability Ours 9.2 0 0 0 4 8 12 16 20 20 40 60 80 100 100 1k 10k 100k Time(h) K K Figure 5: Le : Convergence on NYTimes with 1K topics. Middle: Total training time on 20Newsgroups. Right: Runtime of one inference iteration on a minibatch of 500 NYTimes articles, where the result points of CTM and S-CTM on large K are omitted as they fail to €nish one iteration within 2 hours.

general exist first exploration press science nation war Luna launch proof take useful holocaust aggression Europe express Washington operational benefit camera rainer ray study frequency Turkish SSTO analysis shuttle Karabakh belief select water fossil Armenian launch Armenia Azerbajian steam MWRA Soviet Istanbul planet clock EPA Nagorno religious visual orbit socket evil program LARC conference Islamic jaeger NASA escrow propulsion Ghetto jihad key algorithm package system information conductor NSA descryptor file large GFCIs grounding ISDN privacy wire insulation citizen escrow decode encryption chips act cryptography penalty

Figure 6: A portion of topic correlation graph learned from 20Newsgroups. Each node denotes a latent topic whose semantic meaning is characterized by the top words according to the topic’s word distribution. ‡e font size of each word is proportional to the word weight. Topics with correlation strength over some threshold are connected with edges. ‡e thickness of the edges is proportional to the correlation strengths.

Figure 5, middle panel, measures the total training time with Figure 6 visualizes the topic correlation graph inferred from varying number of topics. We use the small 20Newsgroups dataset the 20Newsgroups dataset. We can see many topics are strongly since on larger data (e.g., NYTimes and PubMed) the CTM and S- correlated to each other and exhibit clear correlation structure. For CTM models are usually too slow to converge in a reasonable time. instance, the set of topics in the right upper region are mainly about We see that the training time of CTM increases quickly as more astronomy and are interrelated closely, while their connections topics are used. S-CTM works well in this small data and model to information security topics shown in the lower part are weak. scale, but, as have been shown above, it is incapable of tackling Figure 7 shows 100K topic embeddings and their correlations on larger problems. In contrast, our approach scales as eciently as the PubMed dataset. Related topics are close to each other in the the simpler LDA model. Figure 5, right panel, evaluates the runtime embedding space, revealing diverse substructures of themes in of one inference iteration on a minibatch of 500 documents. when the collection. Our model discovers very meaningful structures, the topic size grows to a large number, CTM and S-CTM fail to providing insights into the semantics underlying the large text €nish one iteration in 2 hours. Our model, by contrast, keeps as corpora and facilitating understanding of the large collection of scalable as LDA and considerably speeds up over CTM and S-CTM. topics.

5 CONCLUSIONS 4.5 Visualization and Analysis We have developed a new correlated topic model which induces We qualitatively evaluate our approach by visualizing and exploring distributed vector representations of latent topics, and characterizes the extracted latent topics and correlation paŠerns. correlations with the closeness of topic vectors in the embedding [8] Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. 2016. WarpLDA: a Simple tissue differ heart time and Ecient O(1) Algorithm for Latent Dirichlet Allocation. In VLDB. peak time ventriculoarterial [9] Jianfei Chen, Jun Zhu, Zi Wang, Xun Zheng, and Bo Zhang. 2013. Scalable diastolic inference for logistic-normal topic models. In NIPS. 2445–2453. treatment [10] Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for topic models with word embeddings. In ACL. vasodilator [11] Kumar Dubey, Qirong Ho, Sinead A Williamson, and Eric P Xing. 2014. Depen- cardiac coronary dent nonparametric trees for dynamic hierarchical clustering. In NIPS. 1152– unfold DNA endothelial 1160. resistant epricardial [12] Yarin Gal and Zoubin Ghahramani. 2014. Pitfalls in the use of Parallel Inference for the Dirichlet Process.. In ICML. 208–216. [13] Prasoon Goyal, Zhiting Hu, Xiaodan Liang, Chenyu Wang, and Eric Xing. 2017. Nonparametric Variational Auto-encoders for Hierarchical Representation Learn- ing. arXiv preprint arXiv:1703.07027 (2017). [14] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for RNA networks. In KDD. ACM, 855–864. thermoresponsive rRNA invade [15] Geo‚rey E Hinton and Ruslan R Salakhutdinov. 2009. Replicated so‰max: an undirected topic model. In NIPS. 1607–1614. integrety aspheric acidophilum substract [16] MaŠhew D Ho‚man, David M Blei, Chong Wang, and John William Paisley. reconstruct 2013. Stochastic variational inference. JMLR 14, 1 (2013), 1303–1347. [17] Zhiting Hu, Qirong Ho, Avinava Dubey, and Eric P Xing. 2015. Large-scale Distributed Dependent Nonparametric Trees.. In ICML. 1651–1659. factor detect [18] Zhiting Hu, Poyao Huang, Yuntian Deng, Yingkai Gao, and Eric P Xing. 2015. prothesis psym bacteriophage Entity Hierarchy Embedding.. In ACL. 1292–1300. sequence enterohepatic nonsurvive [19] Zhiting Hu, Gang Luo, Mrinmaya Sachan, Eric Xing, and Zaiqing Nie. 2016. venereology incriminate Grounding topic models with knowledge bases. In IJCAI. [20] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Controllable Text Generation. ICML (2017). Figure 7: Visualization of 100K correlated topics on PubMed. [21] Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, and Eric P Xing. 2017. On Unifying Deep Generative Models. arXiv preprint arXiv:1706.00550 (2017). See the captions of Figure 1 for more depictions. [22] Di Jiang, Rongzhong Lian, Lei Shi, and Hua Wu. 2016. Latent Topic Embedding. In COLING. [23] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational Bayes. space. Such modeling scheme, along with the sparsity-aware sam- arXiv preprint arXiv:1312.6114 (2013). pling in inference, enables highly ecient model training with [24] Miguel Lazaro-Gredilla.´ 2014. Doubly stochastic variational Bayes for non- linear time complexity in terms of the model size. Our approach conjugate inference. ICML. [25] Tuan Le and Hady W Lauw. 2014. Semantic visualization for spherical represen- scales to unprecedentedly large data and models, while achieving tation. In KDD. ACM, 1007–1016. strong performance in document classi€cation and retrieval. Œe [26] Tuan Minh Van LE and Hady W Lauw. 2014. Manifold learning for jointly proposed correlation method is generally applicable to other con- modeling topic and visualization. (2014). [27] Tao Lei, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Low-rank text, such as modeling word dependencies for improved topical tensors for scoring dependency structures. ACL. coherence. It is interesting to further speedup of the model in- [28] Aaron Q Li, Amr Ahmed, Sujith Ravi, and Alexander J Smola. 2014. Reducing the sampling complexity of topic models. In KDD. ACM, 891–900. ference through variational neural Bayes techniques [13, 23] for [29] Shaohua Li, Tat-Seng Chua, Jun Zhu, and Chunyan Miao. 2016. Generative topic amortized variational updates across data examples. Note that embedding: a continuous representation of documents. In ACL. our model is particularly suitable to incorporate neural inference [30] Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In ICML. ACM, 577–584. networks that, replacing the per-document variational embedding [31] Yuezhang Li, Ronghuo Zheng, Tian Tian, Zhiting Hu, Rahul Iyer, and Katia distributions, map documents into compact document embeddings Sycara. 2016. Joint Embedding of Hierarchical Categories and Entities for Concept directly. We are also interested in combining generative topic mod- Categorization and Dataless Classi€cation. In COLING. [32] Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, and Eric P Xing. 2017. els with advanced deep text generative approaches [20, 21, 32] for Recurrent Topic-Transition GAN for Visual Paragraph Generation. arXiv preprint improved text modeling. arXiv:1703.07022 (2017). [33] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Je‚rey Dean. 2013. Distributed Representations of Words and Phrases and Œeir Compositionality. ACKNOWLEDGMENTS In NIPS. [34] David Mimno, MaŠ Ho‚man, and David Blei. 2012. Sparse stochastic inference Œis research is supported by NSF IIS1447676, ONR N000141410684, for latent Dirichlet allocation. arXiv preprint arXiv:1206.6425 (2012). and ONR N000141712463. [35] John Paisley, Chong Wang, David M Blei, et al. 2012. Œe discrete in€nite logistic normal distribution. Bayesian Analysis 7, 4 (2012), 997–1034. [36] Duangmanee Pew PuŠhividhya, Hagai T AŠias, and Srikantan Nagarajan. 2009. REFERENCES Independent factor topic models. In ICML. ACM, 833–840. [1] 2017. Supplementary material. (2017). www.cs.cmu.edu/∼zhitingh/kddsupp [37] Rajesh Ranganath and David M Blei. 2016. Correlated random measures. JASA [2] Amr Ahmed and Eric Xing. 2007. On tight approximate inference of the logistic- (2016). normal topic admixture model. In AISTATS. [38] Jian Tang, Meng ‹, and Qiaozhu Mei. 2015. PTE: embedding [3] Kayhan Batmanghelich, Ardavan Saeedi, Karthik Narasimhan, and Sam Gersh- through large-scale heterogeneous text networks. In KDD. ACM, 1165–1174. man. 2016. Nonparametric Spherical Topic Modeling with Word Embeddings. In [39] Michalis K Titsias. 2009. Variational Learning of Inducing Variables in Sparse ACL. Gaussian Processes. In AISTATS, Vol. 5. 567–574. [4] David M Blei, Œomas L Griths, and Michael I Jordan. 2010. Œe nested Chinese [40] Yi Wang, Xuemin Zhao, Zhenlong Sun, Hao Yan, Lifeng Wang, Zhihui Jin, Liubin restaurant process and Bayesian nonparametric inference of topic hierarchies. J. Wang, Yang Gao, Ching Law, and Jia Zeng. 2015. Peacock: Learning long-tail ACM 57, 2 (2010), 7. topic features for industrial applications. TIST 6, 4 (2015), 47. [5] David M Blei and John D La‚erty. 2007. A correlated topic model of science. Še [41] Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. 2016. Annals of Applied Statistics (2007), 17–35. Stochastic Variational Deep Kernel Learning. In NIPS. 2586–2594. [6] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet [42] Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, allocation. JMLR 3, Jan (2003), 993–1022. Tie-Yan Liu, and Wei-Ying Ma. 2015. LightLDA: Big topic models on modest [7] Jordan L Boyd-Graber, David M Blei, and Xiaojin Zhu. 2007. A Topic Model for computer clusters. In WWW. ACM, 1351–1361. Word Sense Disambiguation.. In EMNLP-CoNLL. 1024–1033. A INFERENCE A.1 Stochastic Mean-Field Variational Inference We €rst assume a mean-€eld family of variational distributions: Ö Ö Ö q(u, ϕ, a,η, z) = q(u )q(ϕ ) q(a )q(η ) q(z ), (A.13) k k k d d d n dn where the factors have the parametric forms: ( ) ( ) ( ) = N( | , u ), ( ) = N( | , a ), q uk uk µk Σk q ad ad γd Σd ( ) ( ) = ( | ), ( ) = N( | , η ), (A.14) q ϕk Dir ϕk λk q ηd ηd ξd Σd q(zdn) = Multi(zdn |κdn). Variational algorithms aim to minimize KL divergence from q to the true posterior, which is equivalent to tightening the evidence lower bound (ELBO):

L(q) = Eq [logp(u, a,η, z,w, ϕ|α, β, ρ,τ )] − Eq [logq(u, a,η, z, ϕ)] = Eq [logp(u|α)] + Eq [logp(ϕ|β)] + Eq [logp(a|ρ)] + Eq [logp(η|u, a,τ )] (A.15) + Eq [logp(z|η)] + Eq [logp(w|ϕ, z)] − Eq [logq(u, a,η, z, ϕ)] . A.2 Optimize q(z)    q(zdn = k) ∝ exp E−zdn logp(zdn = k|ηd ) + E−zdn [logp(wdn |ϕk , zdn = k)]  hÕ i ∝ exp E−zdn [log(so‰maxk (ηd ))] + E−zdn 1(wdn = v) log ϕkv v (A.16) V  Õ Õ ∝ exp ξ + 1(w = v)(Ψ(λ ) − Ψ( λ 0 )) . dk v dn kv kv v0=1 Sparsity-aware topic sampling. Direct computation of q(zdn) with Eq.(A.16) has complexity of O(K), which becomes prohibitive in the presence of many latent topics. To address this, we exploit two aspects of intrinsic sparsity in the modeling: (1) Œough a whole corpus can cover a large diverse set of topics, a single document in the corpus is usually about only a small number of them. We thus only maintain the top Ks entries in each ξd , where Ks  K, making the complexity due to the €rst term in the right-hand side of Eq.(A.16) only O(Ks ) for all K topics in total; (2) A topic is typically characterized by only a few words in the large vocabulary, we thus cut o‚ the variational word weight vector λk for each k by maintaining only its top Vs entries (Vs  V ). Such sparse treatment helps enhance the interpretability of 4 learned topics, and allows cheap computation with average O(KVs /V ) cost for the second term . With the above sparsity-aware updates, the resulting complexity for Eq.(A.16) with K topics is brought down to O(Ks + KVs /V ), a great speedup over the original O(K) cost. Œe top Ks entries of ξd are selected using a Min-heap data structure, whose computational cost is amortized across all words in the document, imposing O(K/Nd log Ks ) computation per word. Œe cost for €nding the top Vs entries of λk is similarly amortized across documents and words, and becomes insigni€cant. Besides, updating the remaining variational parameters will frequently involve computation of variational expectations under q(zdn). It is thus crucial to speedup this operation. To this end, we employ sparse approximation by sampling from q(zdn) a single indicator z˜dn, and use the “hard” sparse distribution q˜(zdn = k) := 1(z˜dn = k) to estimate the expectations. Note that the sampling operation is cheap, having the same complexity with computing q(zdn) as above. As shown shortly, such sparse computation will signi€cantly reduce our running cost. A.3 Optimize q(ϕ)

For each topic k, we isolate only the terms that contain q(ϕk ), Ö β− Ö ( )· ( ) q(ϕ ) ∝ exp E (log ϕ 1) + E (log ϕ1 wdn =v 1 z˜dn =k ) k −ϕk v kv −ϕk d,n,v kv (A.17) Ö β−1+Í 1(w =v)·1(z˜ =k) ∝ ϕ d,n dn dn . v kv Œerefore, q(ϕk ) ∼ Dir(λk ), (A.18) Õ λ = β + 1(w = v)· 1(z˜ = k). (A.19) kv d,n dn dn Œe cost for updating q(ϕ) is globally amortized across documents and words, and thus insigni€cant compared with other local parameter update.

4 { }K O( { / , }) In practice we also set a threshold s such that each word v needs to have at least s non-zero entries in λk k=1. Œus the exact complexity of the second term is max KVs V s . A.4 Optimize q(u) and q(a)  Õ q(u ) ∝ exp E− [logp(u |α)] + E− [logp(η |a ,u,τ )] , (A.20) k uk k d uk d d " #  1 α T E−u [logp(u |α)] = E−u log exp(− u u ) k k k M − M k k (2π) 2 α 2 2 (A.21) α ∝ − uT u , 2 k k " #  1 τ T E−u [logp(η |a ,u,τ )] = E−u log exp(− (η − Ua ) (η − Ua )) k d d k M − M d d d d (2π) 2 τ 2 2 (A.22) τ hÕ ( ) i Õ = − uT (Σ a + γ γT ) u + τ ξ γT u + C. 2 k d d d d k d dk d k Œerefore, 1 h Õ ( ) i Õ q(u ) ∝ exp  − uT αI + (τ Σ a + τγ γT ) u + τ ξ γT u , (A.23) k 2 k d d d d k d dk d k ( ) ( ) a ( ) ∼ N( , u ) where Σd is the covariance matrix of ad . From Eq.(A.23), we know q uk µk Σk . ( ) h Õ ( ) i−1 Σ u = αI + (τ Σ a + τγ γT ) . (A.24) k d d d d (u) (u) Notice that Σk is unrelated to k, which means all topic embeddings share the same covariance matrix, we denote it as Σ . Õ µ = τ Σ(u) ·( ξ γ ). (A.25) k d dk d Analogously, Õ γ = τ Σ(a) ·( ξ µ ), (A.26) d k dk k h Õ i−1 Σ(a) = γI + τKΣ(u) + τ µ µT . (A.27) k k k Since Σ(a) is unrelated to d, we can rewrite Eq.(A.24) as

h ( ) Õ i−1 Σ(u) = αI + τDΣ a + τγ γT . (A.28) d d d d γ O(KDM) {µ }K Σ(a) Œe cost for optimizing d is . Updating the set of variational topic vector means k k=1 and both imposes complexity O(KM2), and update of Σ(u) costs O(M3). Since µ, Σ(a), and Σ(u) are all global parameters, we update them in a distributed manner.

A.5 Optimize q(η) ( ) ( ) ( ) ∼ N( , η ), η = ( ) Assume q ηd is Gaussian Distribution and its covariance matrix is diagonal, i.e., ηdk ξdk Σd Σd diag σd . We can isolate the terms in ELBO including ηd ,

L(ηd ) = Eq [logp(ηd |U , ad )] + Eq [logp(zd |ηd )] − Eq [logq(ηd )] , (A.29)

τ Õ 2 2 T Eq [logp(η |U , a )] = − (ξ + σ ) + τξ µγ + C, (A.30) d d 2 k dk dk d d Õ E [logp(z |η )] = 1(z = k)E [log(so‰max (η ))] , (A.31) q d d k,n dn q k d Õ E [logq(η )] = − log σ + C. (A.32) q d k dk For Eq.(A.31), the expectation is intractable due to normalization term in so‰max. As a result, we use reparameterization trick and Monto Carlo estimator to approximate the expectation: ( ) t = + (t) (t) ∼ N( , ), (A.33) ηd ξd σd ϵ ; ϵ 0 I where is the element-wise multiplication. With T samples of ηd , we can estimate the variational lower bound and the derivatives ∇L w.r.t. the variational parameters {ξd , σd }.

∇ E [ ( | )] ( ˜ − ) ξd q logp ηd U , ad = τ Uγd ξd . (A.34) Here U˜ is the collection of variational means of topic embeddings where the kth row U˜k · = µk . ∇ E [ ( | )] E ∇ ( | ) ξd q logp zd ηd = q ξd logp zd ηd 1 ÕT Õ h ( ) i ≈ 1(z˜ = k) e − so max(ξ + σ ϵ l ) T t=1 k,n dn k d d d (A.35) Õ ÕT  ( ) ≈ 1(z˜ = k)e − (N /T ) so max η t , k,n dn k d t=1 d th where ek is an one-hot vector, which evaluates to 1 in its k entry. T is the sample number.

∇σd Eq [logp(ηd |U , ad )] = −τσd , (A.36)   ∇σd Eq [logp(zd |ηd )] = Eq ∇σd logp(zd |ηd ) = 0, (A.37) 1 ∇σd Eq [logq(ηd )] = − , (A.38) σd where 1 is element-wise computation. Œerefore, σd Õ ÕT  ( ) ∇ L = τ (U˜ γ − ξ ) + 1(z˜ = k)e − (N /T ) so max η t , (A.39) ξd d d k,n dn k d t=1 d 1 ∇σd L = −τσd + . (A.40) σd We can conclude that σdk = τ and thus there is no update for σ in our algorithm. In the experiment, we set T = 1 and use Adagrad to update ξd . From Eq.(A.39), the time complexity for updating variational mean topic weight vector ξd is O(KM + Nd + K).