Topic Modeling in Embedding Spaces

Topic Modeling in Embedding Spaces Adji B. Dieng Francisco J. R. Ruiz∗ David M. Blei Columbia University DeepMind Columbia University New York, NY, USA London, UK New York, NY, USA [email protected] [email protected] [email protected] Abstract proportions provide a low-dimensional representation of each document. LDA can be fit to large Topic modeling analyzes documents to learn datasets of text by using variational inference meaningful patterns of words. However, exist- and stochastic optimization (Hoffman et al., 2010, ing topic models fail to learn interpretable 2013). topics when working with large and heavy- LDA is a powerful model and it is widely used. tailed vocabularies. To this end, we develop However, it suffers from a pervasive technical the embedded topic model (ETM), a generative problem—it fails in the face of large vocabularies. model of documents that marries traditional Practitioners must severely prune their vocabular- topic models with word embeddings. More spe- ies in order to fit good topic models—namely, cifically, the ETM models each word with a those that are both predictive and interpretable. categorical distribution whose natural param- This is typically done by removing the most and eter is the inner product between the word’s least frequent words. On large collections, this embedding and an embedding of its assigned pruning may remove important terms and limit topic. To fit the ETM, we develop an effi- the scope of the models. The problem of topic cient amortized variational inference algo- modeling with large vocabularies has yet to be rithm. The ETM discovers interpretable topics addressed in the research literature. even with large vocabularies that include rare In parallel with topic modeling came the idea of words and stop words. It outperforms exist- word embeddings. Research in word embeddings ing document models, such as latent Dirichlet begins with the neural language model of Bengio allocation, in terms of both topic quality and et al. (2003), published in the same year and predictive performance. journal as Blei et al. (2003). Word embeddings eschew the ‘‘one-hot’’representation of words—a vocabulary-length vector of zeros with a single 1 Introduction one—to learn a distributed representation, one where words with similar meanings are close in Topic models are statistical tools for discovering a lower-dimensional vector space (Rumelhart and the hidden semantic structure in a collection of Abrahamson, 1973; Bengio et al., 2006). As for documents (Blei et al., 2003; Blei, 2012). Topic topic models, researchers scaled up embedding models and their extensions have been applied methods to large datasets (Mikolov et al., 2013a,b; to many fields, such as marketing, sociology, Pennington et al., 2014; Levy and Goldberg, 2014; political science, and the digital humanities. Mnih and Kavukcuoglu, 2013). Word embeddings Boyd-Graber et al. (2017) provide a review. have been extended and developed in many ways. Most topic models build on latent Dirichlet They have become crucial in many applications allocation (LDA) (Blei et al., 2003). LDA is a of natural language processing (Maas et al., 2011; hierarchical probabilistic model that represents Li and Yang, 2018), and they have also been each topic as a distribution over terms and re- extended to datasets beyond text (Rudolph et al., presents each document as a mixture of the top- 2016). ics. When fit to a collection of documents, the In this paper, we develop the embedded topic topics summarize their contents, and the topic model (ETM), a document model that marries LDA ∗Work done while at Columbia University and the and word embeddings. The ETM enjoys the good University of Cambridge. properties of topic models and the good properties 439 Transactions of the Association for Computational Linguistics, vol. 8, pp. 439–453, 2020. https://doi.org/10.1162/tacl a 00325 Action Editor: Doug Downey. Submission batch: 2/2020; Revision batch: 5/2020; Published 7/2020. c 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Figure 1: Ratio of the held-out perplexity on a document Figure 2: A topic about Christianity found by the ETM completion task and the topic coherence as a function on The New York Times. The topic is a point in the word embedding space. of the vocabulary size for the ETM and LDA on the 20NewsGroup corpus. The perplexity is normalized by the size of the vocabulary. While the performance of LDA deteriorates for large vocabularies, the ETM maintains good performance. ofwordembeddings.Asatopicmodel,itdiscovers an interpretable latent semantic structure of the documents; as a word embedding model, it pro- vides a low-dimensional representation of the meaning of words. The ETM robustly accommo- dates large vocabularies and the long tail of language data. Figure 3: Topics about sports found by the ETM on The Figure 1 illustrates the advantages. This figure New York Times. Each topic is a point in the word shows the ratio between the perplexity on held-out embedding space. documents (a measure of predictive performance) and the topic coherence (a measure of the quality of the topics), as a function of the size of the topic’s embedding and each term’s embedding. vocabulary. (The perplexity has been normalized Figures 2 and 3 show topics from a 300-topic ETM by the vocabulary size.) This is for a corpus of of The New York Times. The figures show each 11.2K articles from the 20NewsGroup and for topic’s embedding and its closest words; these 100 topics. The red line is LDA; its performance topics are about Christianity and sports. deteriorates as the vocabulary size increases—the Representing topics as points in the embedding predictive performance and the quality of the space allows the ETM to be robust to the presence topics get worse. The blue line is the ETM; it main- of stop words, unlike most topic models. When tains good performance, even as the vocabulary stop words are included in the vocabulary, the size become large. ETM assigns topics to the corresponding area of Like LDA, the ETM is a generative probabilistic the embedding space (we demonstrate this in model: Each document is a mixture of topics and Section 6). each observed word is assigned to a particular As for most topic models, the posterior of the topic. In contrast to LDA, the per-topic conditional topic proportions is intractable to compute. We probability of a term has a log-linear form that derive an efficient algorithm for approximating involves a low-dimensional representation of the the posterior with variational inference (Jordan vocabulary.Eachtermisrepresentedbyanembed- et al., 1999; Hoffman et al., 2013; Blei et al., ding and each topic is a point in that embedding 2017) and additionally use amortized inference space. The topic’s distribution over terms is pro- to efficiently approximate the topic proportions portional to the exponentiated inner product of the (Kingma and Welling, 2014;Rezende etal., 2014). 440 The resulting algorithm fits the ETM to large ment variables in the ETM are part of a larger corpora with large vocabularies. This algorithm probabilistic topic model. can either use previously fitted word embeddings, One of the goals in developing the ETM is to or fit them jointly with the rest of the parameters. incorporate word similarity into the topic model, (In particular, Figures 1 to 3 were made using the and there is previous researchthat shares this goal. version of the ETM that uses pre-fitted skip-gram These methods either modify the topic priors word embeddings.) (Petterson et al., 2010; Zhao et al., 2017b; Shi We compared the performance ofthe ETM to LDA, et al., 2017; Zhao et al., 2017a) or the topic the neural variational document model (NVDM) assignment priors (Xie et al., 2015). For example, (Miao et al., 2016), and PRODLDA (Srivastava and Petterson et al. (2010) use a word similarity graph 1 Sutton, 2017). The NVDM is a form of multinomial (as given by a thesaurus) to bias LDA towards matrix factorization and PRODLDA is a modern assigning similar words to similar topics. As version of LDA that uses a product of experts another example, Xie et al. (2015) model the per- to model the distribution over words. We also word topic assignments of LDA using a Markov compare to a document model that combines random field to account for both the topic pro- PRODLDA with pre-fitted word embeddings. The ETM portions and the topic assignments of similar yields better predictive performance, as measured words. These methods use word similarity as a by held-out log-likelihood on a document comple- type of ‘‘side information’’ about language; in tion task (Wallach et al., 2009b). It also discovers contrast, the ETM directly models the similarity (via more meaningful topics, as measured by topic embeddings) in its generative process of words. coherence (Mimno et al., 2011) and topic diver- However, a more closely related set of works sity. The latter is a metric we introduce in this directly combine topic modeling and word paper that, together with topic coherence, gives a embeddings. One common strategy is to convert better indication of the quality of the topics. The the discrete text into continuous observations ETM is especially robust to large vocabularies. of embeddings, and then adapt LDA to generate real-valued data (Das et al., 2015; Xun et al., 2016; Batmanghelich et al., 2016; Xun et al., 2 Related Work 2017). With this strategy, topics are Gaussian distributions with latent means and covariances, Thisworkdevelopsanewtopicmodelthatextends andthe likelihood overthe embeddingsis modeled LDA. LDA has been extended in many ways, and with a Gaussian (Das et al., 2015) or a Von-Mises topic modeling has become a subfield of its own.

Topic Modeling in Embedding Spaces

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support