Gaussian LDA for Topic Models with Word Embeddings

Gaussian LDA for Topic Models with Word Embeddings Rajarshi Das*, Manzil Zaheer*, Chris Dyer School of Computer Science Carnegie Mellon University Pittsburgh, PA, 15213, USA frajarshd; manzilz; cdyerg @cs:cmu:edu Abstract results. However, our intuitions tell us that while documents may indeed be conceived of as a mix- Continuous space word embeddings ture of topics, we should further expect topics to learned from large, unstructured corpora be semantically coherent. Indeed, standard human have been shown to be effective at cap- evaluations of topic modeling performance are de- turing semantic regularities in language. signed to elicit assessment of semantic coherence In this paper we replace LDA’s param- (Chang et al., 2009; Newman et al., 2009). How- eterization of “topics” as categorical ever, this prior preference for semantic coherence distributions over opaque word types with is not encoded in the model, and any such obser- multivariate Gaussian distributions on vation of semantic coherence found in the inferred the embedding space. This encourages topic distributions is, in some sense, accidental. In the model to group words that are a this paper, we develop a variant of LDA that oper- priori known to be semantically related ates on continuous space embeddings of words— into topics. To perform inference, we rather than word types—to impose a prior expec- introduce a fast collapsed Gibbs sampling tation for semantic coherence. Our approach re- algorithm based on Cholesky decom- places the opaque word types usually modeled in positions of covariance matrices of the LDA with continuous space embeddings of these posterior predictive distributions. We fur- words, which are generated as draws from a mul- ther derive a scalable algorithm that draws tivariate Gaussian. samples from stale posterior predictive distributions and corrects them with a How does this capture our preference for se- Metropolis–Hastings step. Using vectors mantic coherence? Word embeddings have been learned from a domain-general corpus shown to capture lexico-semantic regularities in (English Wikipedia), we report results on language: words with similar syntactic and seman- two document collections (20-newsgroups tic properties are found to be close to each other in and NIPS). Qualitatively, Gaussian LDA the embedding space (Agirre et al., 2009; Mikolov infers different (but still very sensible) et al., 2013). Since Gaussian distributions capture topics relative to standard LDA. Quantita- a notion of centrality in space, and semantically tively, our technique outperforms existing related words are localized in space, our Gaussian models at dealing with OOV words in LDA model encodes a prior preference for seman- held-out documents. tically coherent topics. Our model further has several advantages. Traditional LDA assumes a fixed 1 Introduction vocabulary of word types. This modeling assumption drawback as it cannot handle out of vocabu- Latent Dirichlet Allocation (LDA) is a Bayesian lary (OOV) words in “held out” documents. Zhai technique that is widely used for inferring the and Boyd-Graber (2013) proposed an approach topic structure in corpora of documents. It con- to address this problem by drawing topics from ceives of a document as a mixture of a small num- a Dirichlet Process with a base distribution over ber of topics, and topics as a (relatively sparse) dis- all possible character strings (i.e., words). While tribution over word types (Blei et al., 2003). These this model can in principle handle unseen words, priors are remarkably effective at producing useful the only bias toward being included in a particular *Both student authors had equal contribution. topic comes from the topic assignments in the rest of the document. Our model can exploit the conti- 2.2 Latent Dirichlet Allocation (LDA) guity of semantically similar words in the embed- LDA (Blei et al., 2003) is a probabilistic topic ding space and can assign high topic probability to model of corpora of documents which seeks to a word which is similar to an existing topical word represent the underlying thematic structure of the even if it has never been seen before. document collection. They have emerged as a The main contributions of our paper are as fol- powerful new technique of finding useful structure lows: We propose a new technique for topic mod- in an unstructured collection as it learns distribu- eling by treating the document as a collection of tions over words. The high probability words in word embeddings and topics itself as multivari- each distribution gives us a way of understanding ate Gaussian distributions in the embedding space the contents of the corpus at a very high level. In (x3). We explore several strategies for collapsed LDA, each document of the corpus is assumed to Gibbs sampling and derive scalable algorithms, have a distribution over K topics, where the dis- achieving asymptotic speed-up over the na¨ıve im- crete topic distributions are drawn from a symmet- plementation (x4). We qualitatively show that ric dirichlet distribution. The generative process is our topics make intuitive sense and quantitatively as follows. demonstrate that our model captures a better rep- 1. for k = 1 to K resentation of a document in the topic space by (a) Choose topic βk ∼ Dir(η) outperforming other models in a classification task 2. for each document d in corpus D (x5). (a) Choose a topic distribution θd ∼ Dir(α) (b) for each word index n from 1 to Nd 2 Background i. Choose a topic zn ∼ Categorical(θ ) Before going to the details of our model we pro- d ii. Choose word w ∼ vide some background on two topics relevant to n Categorical(β ) our work: vector space word embeddings and zn As it follows from the definition above, a topic LDA. is a discrete distribution over a fixed vocabulary 2.1 Vector Space Semantics of word types. This modeling assumption pre- cludes new words to be added to topics. However According to the distributional hypothesis (Har- modeling topics as a continuous distribution over ris, 1954), words occurring in similar contexts word embeddings gives us a way to address this tend to have similar meaning. This has given problem. In the next section we describe Gaus- rise to data-driven learning of word vectors that sian LDA, a straightforward extension of LDA that capture lexical and semantic properties, which is replaces categorical distributions over word types now a technique of central importance in natu- with multivariate Gaussian distributions over the ral language processing. These word vectors can word embedding space. be used for identifying semantically related word pairs (Turney, 2006; Agirre et al., 2009) or as fea- 3 Gaussian LDA tures in downstream text processing applications (Turian et al., 2010; Guo et al., 2014). Word As with multinomial LDA, we are interested in vectors can either be constructed using low rank modeling a collection of documents. However, approximations of cooccurrence statistics (Deer- we assume that rather than consisting of sequences wester et al., 1990) or using internal represen- of word types, documents consist of sequences of M tations from neural network models of word se- word embeddings. We write v(w) 2 R as the quences (Collobert and Weston, 2008). We use a embedding of word of type w or vd;i when we are recently popular and fast tool called word2vec1, indexing a vector in a document d at position i. to generate skip-gram word embeddings from un- Since our observations are no longer dis- labeled corpus. In this model, a word is used as crete values but continuous vectors in an M- an input to a log-linear classifier with continuous dimensional space, we characterize each topic k as projection layer and words within a certain win- a multivariate Gaussian distribution with mean µk dow before and after the words are predicted. and covariance Σk. The choice of a Gaussian pa- rameterization is justified by both analytic conve- 1https://code.google.com/p/word2vec/ nience and observations that Euclidean distances κk + 1 p(zd;i = k j z−(d;i); Vd; ζ; α) / (nk;d + αk) × tνk−M+1 vd;i µk; Σk (1) κk Figure 1: Sampling equation for the collapsed Gibbs sampler; refer to text for a description of the notation. between embeddings correlate with semantic sim- rior distribution over the topic parameters, pro- ilarity (Collobert and Weston, 2008; Turney and portions, and the topic assignments of individual Pantel, 2010; Hermann and Blunsom, 2014). We words. Since there is no analytic form of the poste- place conjugate priors on these values: a Gaus- rior, approximations are required. Because of our sian centered at zero for the mean and an inverse choice of conjugate priors for topic parameters and Wishart distribution for the covariance. As be- proportions, these variables can be analytically in- fore, each document is seen as a mixture of top- tegrated out, and we can derive a collapsed Gibbs ics whose proportions are drawn from a symmetric sampler that resamples topic assignments to indi- Dirichlet prior. The generative process can thus be vidual word vectors, similar to the collapsed sam- summarized as follows: pling scheme proposed by Griffiths and Steyvers 1. for k = 1 to K (2004). (a) Draw topic covariance Σk ∼ W−1(Ψ; ν) The conditional distribution we need for sam- 1 pling is shown in Figure 1. Here, z−(d;i) repre- (b) Draw topic mean µk ∼ N (µ; κ Σk) 2. for each document d in corpus D sents the topic assignments of all word embeddings, excluding the one at ith position of docu- (a) Draw topic distribution θd ∼ Dir(α) ment d; V is the sequence of vectors for docu- (b) for each word index n from 1 to Nd d 0 0 ment d; t 0 (x j µ ; Σ ) is the multivariate t - distri- i.

Load more