Dirichlet Enhanced Latent Semantic Analysis

Dirichlet Enhanced Latent Semantic Analysis Kai Yu Shipeng Yu Volker Tresp Siemens Corporate Technology Institute for Computer Science Siemens Corporate Technology D-81730 Munich, Germany University of Munich D-81730 Munich, Germany [email protected] D-80538 Munich, Germany [email protected] [email protected] Abstract of latent topics. Latent Dirichlet allocation (LDA) [3] generalizes PLSI by treating the topic mixture param- This paper describes nonparametric Bayesian eters (i.e. a multinomial over topics) as variables drawn treatments for analyzing records containing from a Dirichlet distribution. Its Bayesian treatment occurrences of items. The introduced model avoids overfitting and the model is generalizable to retains the strength of previous approaches new data (the latter is problematic for PLSI). How- that explore the latent factors of each record ever, the parametric Dirichlet distribution can be a (e.g. topics of documents), and further uncov- limitation in applications which exhibit a richer struc- ers the clustering structure of records, which ture. As an illustration, consider Fig. 1 (a) that shows reflects the statistical dependencies of the la- the empirical distribution of three topics. We see that tent factors. The nonparametric model in- the probability that all three topics are present in a duced by a Dirichlet process (DP) flexibly document (corresponding to the center of the plot) is adapts model complexity to reveal the clus- near zero. In contrast, a Dirichlet distribution fitted to tering structure of the data. To avoid the the data (Fig. 1 (b)) would predict the highest proba- problems of dealing with infinite dimensions, bility density for exactly that case. The reason is the we further replace the DP prior by a simpler limiting expressiveness of a simple Dirichlet distribu- alternative, namely Dirichlet-multinomial al- tion. location (DMA), which maintains the main This paper employs a more general nonparametric modelling properties of the DP. Instead of re- Bayesian approach to explore not only latent topics lying on Markov chain Monte Carlo (MCMC) and their probabilities, but also complex dependen- for inference, this paper applies efficient vari- cies between latent topics which might, for example, ational inference based on DMA. The pro- be expressed as a complex clustering structure. The posed approach yields encouraging empirical key innovation is to replace the parametric Dirichlet results on both a toy problem and text data. prior distribution in LDA by a flexible nonparamet- The results show that the proposed algorithm ric distribution G(·) that is a sample generated from uncovers not only the latent factors, but also a Dirichlet process (DP) or its finite approximation, the clustering structure. Dirichlet-multinomial allocation (DMA). The Dirich- let distribution of LDA becomes the base distribution for the Dirichlet process. In this Dirichlet enhanced 1 Introduction model, the posterior distribution of the topic mixture for a new document converges to a flexible mixture We consider the problem of modelling a large corpus of model in which both mixture weights and mixture pa- high-dimensional discrete records. Our assumption is rameters can be learned from the data. Thus the a that a record can be modelled by latent factors which posteriori distribution is able to represent the distribu- account for the co-occurrence of items in a record. To tion of topics more truthfully. After convergence of the ground the discussion, in the following we will iden- learning procedure, typically only a few components tify records with documents, latent factors with (la- with non-negligible weights remain; thus the model is tent) topics and items with words. Probabilistic la- able to naturally output clusters of documents. tent semantic indexing (PLSI) [7] was one of the first Nonparametric Bayesian modelling has attracted con- approaches that provided a probabilistic approach to- siderable attentions from the learning community wards modelling text documents as being composed butions wd,n|zd,n; β ∼ Mult(zd,n, β) (1) zd,n|θd ∼ Mult(θd). (2) wd,n is generated given its latent topic zd,n, which takes value {1, . , k}. β is a k × |V | multinomial pa- (a)(b) P rameter matrix, j βi,j = 1, where βz,wd,n specifies the probability of generating word w given topic Figure 1: Consider a 2-dimensional simplex represent- d,n z. θ denotes the parameters of a multinomial distri- ing 3 topics (recall that the probabilities have to sum d bution of document d over topics for wd, satisfying to one): (a) We see the probability distribution of k θ ≥ 0, P θ = 1. topics in documents which forms a ring-like distribu- d,i i=1 d,i tion. Dark color indicates low density; (b) The 3- In the LDA model, θd is generated from a k- dimensional Dirichlet distribution that maximizes the dimensional Dirichlet distribution G0(θ) = Dir(θ|λ) likelihood of samples. with parameter λ ∈ Rk×1. In our Dirichlet enhanced model, we assume that θd is generated from distribution G(θ), which itself is a random sample generated (e.g. [1, 13, 2, 15, 17]). A potential problem with from a Dirichlet process (DP) [5] this class of models is that inference typically relies on G|G , α ∼ DP(G , α ), (3) MCMC approximations, which might be prohibitively 0 0 0 0 slow in dealing with the large collection of documents where nonnegative scalar α0 is the precision parame- in our setting. Instead, we tackle the problem by a ter, and G0(θ) is the base distribution, which is identi- less expensive variational mean-field inference based cal to the Dirichlet distribution. It turns out that the on the DMA model. The resultant updates turn out to distribution G(θ) sampled from a DP can be written be quite interpretable. Finally we observed very good as ∞ empirical performance of the proposed algorithm in X G(·) = πlδθ∗ (·) (4) both toy data and textual document, especially in the l latter case, where meaningful clusters are discovered. l=1 where π ≥ 0, P∞ π = 1, δ (·) are point mass distri- This paper is organized as follows. The next section l l l θ butions concentrated at θ, and θ∗ are countably infi- introduces Dirichlet enhanced latent semantic analy- l nite variables i.i.d. sampled from G [14]. The proba- sis. In Section 3 we present inference and learning 0 bility weights π are solely depending on α via a stick- algorithms based on a variational approximation. Sec- l 0 breaking process, which is defined in the next subsec- tion 4 presents experimental results using a toy data tion. The generative model summarized by Fig. 2(a) set and two document data sets. In Section 5 we is conditioned on (k × |V | + k + 1) parameters, i.e. β, present conclusions. λ and α0. Finally the likelihood of the collection D is given by 2 Dirichlet Enhanced Latent Semantic D Analysis Z Y Z LDP(D|α0, λ, β) = p(G; α0, λ) p(θd|G) G d=1 θd N k Following the notation in [3], we consider a corpus Yd X D containing D documents. Each document d is p(wd,n|zd,n; β)p(zd,n|θd) dθd dG. n=1 z =1 a sequence of Nd words that is denoted by wd = d,n (5) {wd,1, . , wd,Nd }, where wd,n is a variable for the n-th word in wd and denotes the index of the corresponding word in a vocabulary V . Note that a same word may In short, G is sampled once for the whole corpus D, θd occur several times in the sequence wd. is sampled once for each document d, and topic zd,n sampled once for the n-th word wd,n in d. 2.1 The Proposed Model 2.2 Stick Breaking and Dirichlet Enhancing We assume that each document is a mixture of k latent The representation of a sample from the DP-prior in topics and words in each document are generated by Eq. (4) is generated in the stick breaking process in ∗ repeatedly sampling topics and words using the distri- which infinite number of pairs (πl, θl ) are generated. (a)(b)(c) Figure 2: Plate models for latent semantic analysis. (a) Latent semantic analysis with DP prior; (b) An equivalent representation, where cd is the indicator variable saying which cluster document d takes on out of the infinite clusters induced by DP; (c) Latent semantic analysis with a finite approximation of DP (see Sec. 2.3). ∗ θl is sampled independently from G0 and πl is defined explain upcoming data very well, which is particularly as suitable for our setting where dictionary is fixed while l−1 Y documents can be growing. π1 = B1, πl = Bl (1 − Bj), j=1 By applying the stick breaking representation, our where Bl are i.i.d. sampled from Beta distribution model obtains the equivalent representation in ∗ Beta(1, α0). Thus, with a small α0, the first “sticks” Fig. 2(b). An infinite number of θl are generated from πl will be large with little left for the remaining sticks. the base distribution and the new indicator variable cd ∗ Conversely, if α0 is large, the first sticks πl and all indicates which θl is assigned to document d. If more ∗ subsequent sticks will be small and the πl will be more than one document is assigned to the same θl , cluster- evenly distributed. In conclusion, the base distribu- ing occurs. π = {π1, . , π∞} is a vector of probability tion determines the locations of the point masses and weights generated from the stick breaking process. α0 determines the distribution of probability weights. The distribution is nonzero at an infinite number of 2.3 Dirichlet-Multinomial Allocation (DMA) discrete points.

Load more