Dirichlet Enhanced Latent Semantic Analysis

Kai Yu Shipeng Yu Volker Tresp Siemens Corporate Technology Institute for Computer Science Siemens Corporate Technology D-81730 Munich, Germany University of Munich D-81730 Munich, Germany [email protected] D-80538 Munich, Germany [email protected] [email protected]

Abstract of latent topics. Latent Dirichlet allocation (LDA) [3] generalizes PLSI by treating the topic mixture param- This paper describes nonparametric Bayesian eters (i.e. a multinomial over topics) as variables drawn treatments for analyzing records containing from a Dirichlet distribution. Its Bayesian treatment occurrences of items. The introduced model avoids overfitting and the model is generalizable to retains the strength of previous approaches new data (the latter is problematic for PLSI). How- that explore the latent factors of each record ever, the parametric Dirichlet distribution can be a (e.g. topics of documents), and further uncov- limitation in applications which exhibit a richer struc- ers the clustering structure of records, which ture. As an illustration, consider Fig. 1 (a) that shows reflects the statistical dependencies of the la- the empirical distribution of three topics. We see that tent factors. The nonparametric model in- the probability that all three topics are present in a duced by a Dirichlet process (DP) flexibly document (corresponding to the center of the plot) is adapts model complexity to reveal the clus- near zero. In contrast, a Dirichlet distribution fitted to tering structure of the data. To avoid the the data (Fig. 1 (b)) would predict the highest proba- problems of dealing with infinite dimensions, bility density for exactly that case. The reason is the we further replace the DP prior by a simpler limiting expressiveness of a simple Dirichlet distribu- alternative, namely Dirichlet-multinomial al- tion. location (DMA), which maintains the main This paper employs a more general nonparametric modelling properties of the DP. Instead of re- Bayesian approach to explore not only latent topics lying on Markov chain Monte Carlo (MCMC) and their probabilities, but also complex dependen- for inference, this paper applies efficient vari- cies between latent topics which might, for example, ational inference based on DMA. The pro- be expressed as a complex clustering structure. The posed approach yields encouraging empirical key innovation is to replace the parametric Dirichlet results on both a toy problem and text data. prior distribution in LDA by a flexible nonparamet- The results show that the proposed algorithm ric distribution G(·) that is a sample generated from uncovers not only the latent factors, but also a Dirichlet process (DP) or its finite approximation, the clustering structure. Dirichlet-multinomial allocation (DMA). The Dirich- let distribution of LDA becomes the base distribution for the Dirichlet process. In this Dirichlet enhanced 1 Introduction model, the posterior distribution of the topic mixture for a new document converges to a flexible mixture We consider the problem of modelling a large corpus of model in which both mixture weights and mixture pa- high-dimensional discrete records. Our assumption is rameters can be learned from the data. Thus the a that a record can be modelled by latent factors which posteriori distribution is able to represent the distribu- account for the co-occurrence of items in a record. To tion of topics more truthfully. After convergence of the ground the discussion, in the following we will iden- learning procedure, typically only a few components tify records with documents, latent factors with (la- with non-negligible weights remain; thus the model is tent) topics and items with . Probabilistic la- able to naturally output clusters of documents. tent semantic indexing (PLSI) [7] was one of the first Nonparametric Bayesian modelling has attracted con- approaches that provided a probabilistic approach to- siderable attentions from the learning community wards modelling text documents as being composed butions

wd,n|zd,n; β ∼ Mult(zd,n, β) (1)

zd,n|θd ∼ Mult(θd). (2)

wd,n is generated given its latent topic zd,n, which takes value {1, . . . , k}. β is a k × |V | multinomial pa- (a)(b) P rameter matrix, j βi,j = 1, where βz,wd,n specifies the probability of generating w given topic Figure 1: Consider a 2-dimensional simplex represent- d,n z. θ denotes the parameters of a multinomial distri- ing 3 topics (recall that the probabilities have to sum d bution of document d over topics for wd, satisfying to one): (a) We see the probability distribution of k θ ≥ 0, P θ = 1. topics in documents which forms a ring-like distribu- d,i i=1 d,i tion. Dark color indicates low density; (b) The 3- In the LDA model, θd is generated from a k- dimensional Dirichlet distribution that maximizes the dimensional Dirichlet distribution G0(θ) = Dir(θ|λ) likelihood of samples. with parameter λ ∈ Rk×1. In our Dirichlet enhanced model, we assume that θd is generated from distribu- tion G(θ), which itself is a random sample generated (e.g. [1, 13, 2, 15, 17]). A potential problem with from a Dirichlet process (DP) [5] this class of models is that inference typically relies on G|G , α ∼ DP(G , α ), (3) MCMC approximations, which might be prohibitively 0 0 0 0 slow in dealing with the large collection of documents where nonnegative scalar α0 is the precision parame- in our setting. Instead, we tackle the problem by a ter, and G0(θ) is the base distribution, which is identi- less expensive variational mean-field inference based cal to the Dirichlet distribution. It turns out that the on the DMA model. The resultant updates turn out to distribution G(θ) sampled from a DP can be written be quite interpretable. Finally we observed very good as ∞ empirical performance of the proposed algorithm in X G(·) = πlδθ∗ (·) (4) both toy data and textual document, especially in the l latter case, where meaningful clusters are discovered. l=1 where π ≥ 0, P∞ π = 1, δ (·) are point mass distri- This paper is organized as follows. The next section l l l θ butions concentrated at θ, and θ∗ are countably infi- introduces Dirichlet enhanced latent semantic analy- l nite variables i.i.d. sampled from G [14]. The proba- sis. In Section 3 we present inference and learning 0 bility weights π are solely depending on α via a stick- algorithms based on a variational approximation. Sec- l 0 breaking process, which is defined in the next subsec- tion 4 presents experimental results using a toy data tion. The generative model summarized by Fig. 2(a) set and two document data sets. In Section 5 we is conditioned on (k × |V | + k + 1) parameters, i.e. β, present conclusions. λ and α0. Finally the likelihood of the collection D is given by 2 Dirichlet Enhanced Latent Semantic  D  Analysis Z Y Z LDP(D|α0, λ, β) = p(G; α0, λ) p(θd|G) G d=1 θd N k Following the notation in [3], we consider a corpus Yd X   D containing D documents. Each document d is p(wd,n|zd,n; β)p(zd,n|θd) dθd dG. n=1 z =1 a sequence of Nd words that is denoted by wd = d,n (5) {wd,1, . . . , wd,Nd }, where wd,n is a variable for the n-th word in wd and denotes the index of the corresponding word in a vocabulary V . Note that a same word may In short, G is sampled once for the whole corpus D, θd occur several times in the sequence wd. is sampled once for each document d, and topic zd,n sampled once for the n-th word wd,n in d.

2.1 The Proposed Model 2.2 Stick Breaking and Dirichlet Enhancing

We assume that each document is a mixture of k latent The representation of a sample from the DP-prior in topics and words in each document are generated by Eq. (4) is generated in the stick breaking process in ∗ repeatedly sampling topics and words using the distri- which infinite number of pairs (πl, θl ) are generated. (a)(b)(c)

Figure 2: Plate models for latent semantic analysis. (a) Latent semantic analysis with DP prior; (b) An equivalent representation, where cd is the indicator variable saying which cluster document d takes on out of the infinite clusters induced by DP; (c) Latent semantic analysis with a finite approximation of DP (see Sec. 2.3).

∗ θl is sampled independently from G0 and πl is defined explain upcoming data very well, which is particularly as suitable for our setting where dictionary is fixed while l−1 Y documents can be growing. π1 = B1, πl = Bl (1 − Bj), j=1 By applying the stick breaking representation, our where Bl are i.i.d. sampled from Beta distribution model obtains the equivalent representation in ∗ Beta(1, α0). Thus, with a small α0, the first “sticks” Fig. 2(b). An infinite number of θl are generated from πl will be large with little left for the remaining sticks. the base distribution and the new indicator variable cd ∗ Conversely, if α0 is large, the first sticks πl and all indicates which θl is assigned to document d. If more ∗ subsequent sticks will be small and the πl will be more than one document is assigned to the same θl , cluster- evenly distributed. In conclusion, the base distribu- ing occurs. π = {π1, . . . , π∞} is a vector of probability tion determines the locations of the point masses and weights generated from the stick breaking process. α0 determines the distribution of probability weights. The distribution is nonzero at an infinite number of 2.3 Dirichlet-Multinomial Allocation (DMA) discrete points. If α is selected to be small the am- 0 ∗ plitudes of only a small number of discrete points will Since infinite number of pairs (πl, θl ) are generated in be significant. Note, that both locations and weights the stick breaking process, it is usually very difficult to are not fixed but take on new values each time a new deal with the unknown distribution G. For inference there exist Markov chain Monte Carlo (MCMC) meth- sample of G is generated. Since E(G) = G0, initially, the prior corresponds to the prior used in LDA. With ods like Gibbs samplers which directly sample θd using ∗ P´olya urn scheme and avoid the difficulty of sampling many documents in the training data set, locations θl which agree with the data will obtain a large weight. the infinite-dimensional G [4]; in practice, the sam- pling procedure is very slow and thus impractical for If a small α0 is chosen, parameters will form clusters high dimensional data like text. In Bayesian statistics, whereas if a large α0, many representative parameters will result. Thus Dirichlet enhancement serves two the Dirichlet-multinomial allocation DPN in [6] has of- purposes: it increases the flexibility in representing ten been applied as a finite approximation to DP (see the posterior distribution of mixing weights and en- [6, 9]), which takes on the form courages a clustered solution leading to insights into N X the document corpus. G = π δ ∗ , N l θl The DP prior offers two advantages against usual doc- l=1 ument clustering methods. First, there is no need to where π = {π1, . . . , πN } is an N-vector of proba- specify the number of clusters. The finally resulting bility weights sampled once from a Dirichlet prior ∗ clustering structure is constrained by the DP prior, Dir(α0/N, . . . , α0/N), and θl , l = 1,...,N, are but also adapted to the empirical observations. Sec- i.i.d. sampled from the base distribution G0. It has ond, the number of clusters is not fixed. Although been shown that the limiting case of DPN is DP the parameter α0 is a control parameter to tune the [6, 9, 12], and more importantly DPN demonstrates tendency for forming clusters, the DP prior allows the similar stick breaking properties and leads to a simi- creation of new clusters if the current model cannot lar clustering effect [6]. If N is sufficiently large with respect to our sample size D, DPN gives a good ap- derived, but it turns out to be very slow and inappli- proximation to DP. cable to high dimensional data like text, since for each word we have to sample a z. Therefore Under the DP model, the plate representation of our N in this section we suggest efficient variational infer- model is illustrated in Fig. 2(c). The likelihood of the ence. whole collection D is Z Z D  N Y X ∗ LDPN (D|α0, λ, β) = p(wd|θ , cd; β) π θ∗ 3.1 Variational Inference d=1 cd=1  ∗ p(cd|π) dP (θ ; G0) dP (π; α0) (6) The idea of variational mean-field inference is to propose a joint distribution Q(π, θ∗, c, z) condi- where cd is the indicator variable saying which unique tioned on some free parameters, and then en- ∗ value θl document d takes on. The likelihood of doc- force Q to approximate the a posteriori distribu- ument d is therefore written as tions of interests by minimizing the KL-divergence ∗ DKL(Qkp(π, θ , c, z|D, α0, λ, β)) with respect to those Nd k ∗ Y X ∗ free parameters. We propose a variational distribution p(wd|θ , cd; β) = p(wd,n|zd,n; β)p(zd,n|θ ). cd Q over latent variables as the following n=1 zd,n=1

2.4 Connections to PLSA and LDA Q(π, θ∗, c, z|η, γ, ϕ, φ) = Q(π|η)·

From the application point of view, PLSA and LDA N D D Nd Y ∗ Y Y Y both aim to discover the latent dimensions of data Q(θl |γl) Q(cd|ϕd) Q(zd,n|φd,n) with the emphasis on indexing. The proposed Dirich- l=1 d=1 d=1 n=1 let enhanced semantic analysis retains the strengths (7) of PLSA and LDA, and further explores the cluster- ing structure of data. The model is a generalization where η, γ, ϕ, φ are variational parameters, each tai- of LDA. If we let α0 → ∞, the model becomes identi- cal to LDA, since the sampled G becomes identical to loring the variational a posteriori distribution to each latent variable. In particular, η specifies an N- the finite Dirichlet base distribution G0. This extreme dimensional Dirichlet distribution for π, γ specifies case makes documents mutually independent given G0, l a k-dimensional Dirichlet distribution for distinct θ∗, since θd are i.i.d. sampled from G0. If G0 itself is not l sufficiently expressive, the model is not able to cap- ϕd specifies an N-dimensional multinomial for the in- ture the dependency between documents. The Dirich- dicator cd of document d, and φd,n specifies a k- let enhancement elegantly solves this problem. With dimensional multinomial over latent topics for word w . It turns out that the minimization of the KL- a moderate α0, the model allows G to deviate away d,n divergence is equivalent to the maximization of a from G0, giving modelling flexibilities to explore the richer structure of data. The exchangeability may not lower bound of the ln p(D|α0, λ, β) derived by apply- exist within the whole collection, but between groups ing Jensen’s inequality [10]. Please see the Appendix ∗ for details of the derivation. The lower bound is then of documents with respective atoms θl sampled from given as G0. On the other hand, the increased flexibility does not lead to overfitting, because inference and learn- ing are done in a Bayesian setting, averaging over the D Nd number of mixture components and the states of the X X ∗ latent variables. LQ(D) = EQ[ln p(wd,n|zd,n, β)p(zd,n|θ , cd)] d=1 n=1 D 3 Inference and Learning X + EQ[ln p(π|α0)] + EQ[ln p(cd|π)] (8) d=1 In this section we consider model inference and N X ∗ ∗ learning based on the DPN model. As seen + EQ[ln p(θl |G0)] − EQ[ln Q(π, θ , c, z)]. from Fig. 2(c), the inference needs to calculate the l=1 a posteriori joint distribution of latent variables ∗ p(π, θ , c, z|D, α0, λ, β), which requires to compute Eq. (6). This integral is however analytically infeasi- The optimum is found setting the partial derivatives ble. A straightforward Gibbs sampling method can be with respect to each variational parameter to be zero, which gives rise to the following updates parts of L in Eq. (8) involving α0 and λ: α0  L[α ] = ln Γ(α0) − N ln Γ N k 0 N n X  X o φd,n,i ∝ βi,w exp ϕd,l Ψ(γl,i) − Ψ( γl,j) N N d,n α0 X X + ( − 1) Ψ(η ) − Ψ( η ), l=1 j=1 N l j (9) l=1 j=1 k k N N k k  X h X  Xd i X n X X ϕd,l ∝ exp Ψ(γl,i) − Ψ( γl,j) φd,n,i L[λ] = ln Γ( λi) − ln Γ(λi) i=1 j=1 n=1 l=1 i=1 i=1 k k N  X  X o X + (λi − 1) Ψ(γl,i) − Ψ( γl,j) . + Ψ(ηl) − Ψ( ηj) (10) i=1 j=1 j=1

D Nd Estimates for α and λ are found by maximization of X X 0 γl,i = ϕd,lφd,n,i + λi (11) these objective functions using standard methods like d=1 n=1 Newton-Raphson method as suggested in [3]. D X α0 η = ϕ + (12) l d,l N 4 Empirical Study d=1 4.1 Toy Data where Ψ(·) is the digamma function, the first deriva- tive of the log Gamma function. Some details of the We first apply the model on a toy problem with derivation of these formula can be found in Appendix. k = 5 latent topics and a dictionary containing 200 We find that the updates are quite interpretable. For words. The assumed probabilities of generating words example, in Eq. (9) φd,n,i is the a posteriori probability from topics, i.e. the parameters β, are illustrated in of latent topic i given one word wd,n. It is determined Fig. 3(d), in which each colored line corresponds to a both by the corresponding entry in the β matrix that topic and assigns non-zero probabilities to a subset of can be seen as a likelihood term, and by the possi- words. For each run we generate data with the follow- bility that document d selects topic i, i.e., the prior ing steps: (1) one cluster number M is chosen between term. Here the prior is itself a weighted average of 5 and 12; (2) generate M document clusters, each of ∗ different θl s to which d is assigned. In Eq. (12) ηl which is defined by a combination of topics; (3) gener- is the a posteriori weight of πl, and turns out to be ate each document d, d = 1,..., 100, by first randomly ∗ the tradeoff between empirical responses at θl and the selecting a cluster and then generating 40 words ac- prior specified by α0. Finally since the parameters are cording to the corresponding topic combinations. For coupled, the variational inference is done by iteratively DPN we select N = 100 and we aim to examine the performing Eq. (9) to Eq. (12) until convergence. performance for discovering the latent topics and the structure. In Fig. 3(a)-(c) we illustrate the process of clustering 3.2 Parameter Estimation documents over EM iterations with a run containing 6 document clusters. In Fig. 3(a), we show the initial Following the empirical Bayesian framework, we can random assignment ϕd,l of each document d to a clus- estimate the hyper parameters α0, λ, and β by itera- ter l. After one EM step documents begin to accumu- tively maximizing the lower bound LQ both with re- late to a reduced number of clusters (Fig. 3(b)), and spect to the variational parameters (as described by converge to exactly 6 clusters after 5 steps (Fig. 3(c)). Eq. (9)-Eq. (12)) and the model parameters, holding The learned word distribution of topics β is shown in the remaining parameters fixed. This iterative proce- Fig. 3(e) and is very similar to the true distribution. dure is also referred to as variational EM [10]. It is easy to derive the update for β: By varying M, the true number of document clusters, we examine if our model can find the correct M. To de- termine the number of clusters, we run the variational D N X Xd inference and obtain for each document a weight vector βi,j ∝ φd,n,iδj(wd,n) (13) ϕd,l of clusters. Then each document takes the cluster d=1 n=1 with largest weight as its assignment, and we calculate the cluster number as the number of non-empty clus- where δj(wd,n) = 1 if wd,n = j, and 0 otherwise. For ters. For each setting of M from 5 to 12, we randomize the remaining parameters, let’s first write down the the data for 20 trials and obtain the curve in Fig. 3(f) (a)(b)(c)

(d)(e)(f )

Figure 3: Experimental results for the toy problem. (a)-(c) show the document-cluster assignments ϕd,l over the variational inference for a run with 6 document clusters: (a) Initial random assignments; (b) Assignments after one iteration; (c) Assignments after five iterations (final). The multinomial parameter matrix β of true values and estimated values are given in (d) and (e), respectively. Each line gives the probabilities of generating the 200 words, with wave mountains for high probabilities. (f) shows the learned number of clusters with respect to the true number with mean and error bar. which shows the average performance and the vari- topics. Our model outperforms LDA and PLSI in all ance. In 37% of the runs we get perfect results, and the runs, which indicates that the flexibility introduced in another 43% runs the learned values only deviate by DP enhancement does not produce overfitting and from the truth by one. However, we also find that the results in a better generalization performance. model tends to get slightly fewer than M clusters when M is large. The reason might be that, only 100 doc- uments are not sufficient for learning a large number M of clusters. 4.3 Clustering 4.2 Document Modelling

We compare the proposed model with PLSI and LDA In our last experiment we demonstrate that our ap- on two text data sets. The first one is a subset of proach is suitable to find relevant document clusters. the Reuters-21578 data set which contains 3000 docu- We select four categories, autos, motorcycles, baseball ments and 20334 words. The second one is taken from and hockey from the 20-newsgroups data set with 446 the 20-newsgroup data set and has 2000 documents documents in each topic. Fig. 4(c) illustrates one clus- with 8014 words. The comparison metric is perplexity, tering result, in which we set topic number k = 5 and conventionally used in modelling. For a test found 6 document clusters. In the figure the docu- document set, it is formally defined as ments are indexed according to their true category labels, so we can clearly see that the result is quite X Perplexity(Dtest) = exp(− ln p(Dtest)/ |wd|). meaningful. Documents from one category show sim- d ilar membership to the learned clusters, and different categories can be distinguished very easily. The first We follow the formula in [3] to calculate the perplexity two categories are not clearly separated because they for PLSI. In our algorithm N is set to be the number are both talking about vehicles and share many terms, of training documents. Fig. 4(a) and (b) show the while the rest of the categories, baseball and hockey, comparison results with different number k of latent are ideally detected. (a)(b)(c)

Figure 4: (a) and (b): Perplexity results on Reuters-21578 and 20-newsgroups for DELSA, PLSI and LDA; (c): Clustering result on 20-newsgroups dataset.

5 Conclusions and Future Work [2] D. M. Blei and M. I. Jordan. Variational meth- ods for the Dirichlet process. In Proceedings of the This paper proposes a Dirichlet enhanced latent se- 21st International Conference on Machine Learn- mantic analysis model for analyzing co-occurrence ing, 2004. data like text, which retains the strength of previous [3] D. M. Blei, A. Ng, and M. I. Jordan. Latent approaches to find latent topics, and further introduces Dirichlet Allocation. Journal of Machine Learn- additional modelling flexibilities to uncover the clus- ing Research, 3:993–1022, 2003. tering structure of data. For inference and learning, we [4] M. D. Escobar and M. West. Bayesian density adopt a variational mean-field approximation based on estimation and inference using mixtures. Journal a finite alternative of DP. Experiments are performed of the American Statistical Association, 90(430), on a toy data set and two text data sets. The ex- June 1995. periments show that our model can discover both the latent and meaningful clustering structures. [5] T. S. Ferguson. A Bayesian analysis of some non- parametric problems. Annals of Statistics, 1:209– In addition to our approach, alternative methods for 230, 1973. approximate inference in DP have been proposed us- ing expectation propagation (EP) [11] or variational [6] P. J. Green and S. Richardson. Modelling hetero- methods [16, 2]. Our approach is most similar to the geneity with and without the Dirichlet process. work of Blei and Jordan [2] who applied mean-field ap- unpublished paper, 2000. proximation for the inference in DP based on a trun- [7] T. Hofmann. Probabilistic Latent Semantic In- cated DP (TDP). Their approach was formulated in dexing. In Proceedings of the 22nd Annual ACM of general exponential-family mixture models SIGIR Conference, pages 50–57, Berkeley, Cali- [2]. Conceptually, DPN appears to be simpler than fornia, August 1999. TDP in the sense that the a posteriori of G is a sym- [8] H. Ishwaran and L. F. James. Gibbs sampling metric Dirichlet while TDP ends up with a generalized methods for stick-breaking priors. Journal of the Dirichlet (see [8]). In another sense, TDP seems to be American Statistical Association, 96(453):161– a tighter approximation to DP. Future work will in- 173, 2001. clude a comparison of the various DP approximations. [9] H. Ishwaran and M. Zarepour. Exact and ap- proximate sum-representations for the Dirichlet Acknowledgements process. Can. J. Statist, 30:269–283, 2002. [10] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and The authors thank the anonymous reviewers for their L. K. Saul. An introduction to variational meth- valuable comments. Shipeng Yu gratefully acknowl- ods for graphical models. Machine Learning, edges the support through a Siemens scholarship. 37(2):183–233, 1999. [11] T. Minka and Z. Ghahramani. Expectation prop- References agation for infinite mixtures. In NIPS’03 Work- [1] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. shop on Nonparametric Bayesian Methods and The infinite hidden Markov model. In Advances in Infinite Models, 2003. Neural Information Processing Systems (NIPS) [12] R. M. Neal. Markov chain sampling methods 14, 2002. for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249– The other terms can be derived as follows: 265, 2000. D Nd X X ∗ [13] C. E. Rasmussen and Z. Ghahramani. Infinite EQ[ln p(zd,n|θ , cd)] = mixtures of gaussian process experts. In Ad- d=1 n=1

vances in Neural Information Processing Systems D Nd k N k X X X X  X  14, 2002. ψd,lφd,n,i Ψ(γl,i) − Ψ( γl,j) , [14] J. Sethuraman. A constructive definition of d=1 n=1 i=1 l=1 j=1 α0 Dirichlet priors. Statistica Sinica, 4:639–650, EQ[ln p(π|α0)] = ln Γ(α0) − N ln Γ( ) 1994. N N N α0  X X [15] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. + − 1 Ψ(η ) − Ψ( η ), N l j Blei. Hierarchical Dirichlet processes. Technical l=1 j=1 Report 653, Department of Statistics, University D D N N of California, Berkeley, 2004. X X X  X  EQ[ln p(cd|π)] = ψd,l Ψ(ηl) − Ψ( ηj) , [16] V. Tresp and K. Yu. An introduction to non- d=1 d=1 l=1 j=1 parametric hierarchical bayesian modelling with N N  k k X ∗ X X X a focus on multi-agent learning. In Proceedings of EQ[ln p(θl |G0)] = ln Γ( λi) − ln Γ(λi) the Hamilton Summer School on Switching and l=1 l=1 i=1 i=1 Learning in Feedback Systems. Lecture Notes in k k  X  X  Computing Science, 2004. + (λi − 1) Ψ(γl,i) − Ψ( γl,j) , [17] K. Yu, V. Tresp, and S. Yu. A nonparametric i=1 j=1 hierarchical Bayesian framework for information filtering. In Proceedings of 27th Annual Interna-

tional ACM SIGIR Conference, 2004. N N ∗ X X EQ[ln Q(π, θ , c, z)] = ln Γ( ηl) − ln Γ(ηl) l=1 l=1 Appendix N N X  X  + (ηl − 1) Ψ(ηl) − Ψ( ηj) To simplify the notation, we denote Ξ for all the la- l=1 j=1 ∗ tent variables {π, θ , c, z}. With the variational form N k k X  X X Eq. (7), we apply Jensen’s inequality to the likelihood + ln Γ( γl,i) − ln Γ(γl,i) Eq. (6) and obtain l=1 i=1 i=1 k k  X  X  ln p(D|α0, λ, β) + (γl,i − 1) Ψ(γl,i) − Ψ( γl,j) Z Z X X ∗ i=1 j=1 = ln p(D, Ξ|α0, λ, β)dθ dπ ∗ D N D Nd k π θ c z X X X X X Z Z + ψd,l ln ψd,l + φd,n,i ln φd,n,i. X X Q(Ξ)p(D, Ξ|α0, λ, β) = ln dθ∗dπ d=1 l=1 d=1 n=1 i=1 ∗ Q(Ξ) π θ c z Z Z X X ∗ ≥ Q(Ξ) ln p(D, Ξ|α0, λ, β)dθ dπ Differentiating the lower bound with respect to dif- π θ∗ c z ferent latent variables gives the variational E-step in Z Z X X − Q(Ξ) ln Q(Ξ)dθ∗dπ Eq. (9) to Eq. (12). M-step can also be obtained by ∗ π θ c z considering the lower bound with respect to β, λ and α . = EQ[ln p(D, Ξ|α0, λ, β)] − EQ[ln Q(Ξ)], 0 which results in Eq. (8). To write out each term in Eq. (8) explicitly, we have, for the first term,

D N D N k X Xd X Xd X EQ[ln p(wd,n|zd,n, β)] = φd,n,iβi,ν , d=1 n=1 d=1 n=1 i=1 where ν is the index of word wd,n.