Generative Topic Embedding: a Continuous Representation of Documents
Total Page:16
File Type:pdf, Size:1020Kb
Generative Topic Embedding: a Continuous Representation of Documents Shaohua Li1,2 Tat-Seng Chua1 Jun Zhu3 Chunyan Miao2 [email protected] [email protected] [email protected] [email protected] 1. School of Computing, National University of Singapore 2. Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY) 3. Department of Computer Science and Technology, Tsinghua University Abstract as continuous vectors in a low-dimensional em- bedding space (Bengio et al., 2003; Mikolov et al., Word embedding maps words into a low- 2013; Pennington et al., 2014; Levy et al., 2015). dimensional continuous embedding space The learned embedding for a word encodes its by exploiting the local word collocation semantic/syntactic relatedness with other words, patterns in a small context window. On by utilizing local word collocation patterns. In the other hand, topic modeling maps docu- each method, one core component is the embed- ments onto a low-dimensional topic space, ding link function, which predicts a word’s distri- by utilizing the global word collocation bution given its context words, parameterized by patterns in the same document. These their embeddings. two types of patterns are complementary. When it comes to documents, we wish to find a In this paper, we propose a generative method to encode their overall semantics. Given topic embedding model to combine the the embeddings of each word in a document, we two types of patterns. In our model, topics can imagine the document as a “bag-of-vectors”. are represented by embedding vectors, and Related words in the document point in similar di- are shared across documents. The proba- rections, forming semantic clusters. The centroid bility of each word is influenced by both of a semantic cluster corresponds to the most rep- its local context and its topic. A variational resentative embedding of this cluster of words, re- inference method yields the topic embed- ferred to as the semantic centroids. We could use dings as well as the topic mixing propor- these semantic centroids and the number of words tions for each document. Jointly they rep- around them to represent a document. resent the document in a low-dimensional In addition, for a set of documents in a partic- continuous space. In two document clas- ular domain, some semantic clusters may appear sification tasks, our method performs bet- in many documents. By learning collocation pat- ter than eight existing methods, with fewer terns across the documents, the derived semantic features. In addition, we illustrate with an centroids could be more topical and less noisy. example that our method can generate co- Topic Models, represented by Latent Dirichlet herent topics even based on only one doc- Allocation (LDA) (Blei et al., 2003), are able to ument. group words into topics according to their colloca- tion patterns across documents. When the corpus 1 Introduction is large enough, such patterns reflect their seman- Representing documents as fixed-length feature tic relatedness, hence topic models can discover vectors is important for many document process- coherent topics. The probability of a word is gov- ing algorithms. Traditionally documents are rep- erned by its latent topic, which is modeled as a resented as a bag-of-words (BOW) vectors. How- categorical distribution in LDA. Typically, only a ever, this simple representation suffers from being small number of topics are present in each docu- high-dimensional and highly sparse, and loses se- ment, and only a small number of words have high mantic relatedness across the vector dimensions. probability in each topic. This intuition motivated Word Embedding methods have been demon- Blei et al. (2003) to regularize the topic distribu- strated to be an effective way to represent words tions with Dirichlet priors. 666 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 666–675, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics Semantic centroids have the same nature as top- ument through weighted connections. Larochelle ics in LDA, except that the former exist in the em- and Lauly (2012) assigned each word a unique bedding space. This similarity drives us to seek the topic vector, which is a summarization of the con- common semantic centroids with a model similar text of the current word. to LDA. We extend a generative word embedding Huang et al. (2012) proposed to incorporate model PSDVec (Li et al., 2015), by incorporating global (document-level) semantic information to topics into it. The new model is named TopicVec. help the learning of word embeddings. The global In TopicVec, an embedding link function models embedding is simply a weighted average of the the word distribution in a topic, in place of the cat- embeddings of words in the document. egorical distribution in LDA. The advantage of the Le and Mikolov (2014) proposed Paragraph link function is that the semantic relatedness is al- Vector. It assumes each piece of text has a la- ready encoded as the cosine distance in the em- tent paragraph vector, which influences the distri- bedding space. Similar to LDA, we regularize the butions of all words in this text, in the same way topic distributions with Dirichlet priors. A varia- as a latent word. It can be viewed as a special case tional inference algorithm is derived. The learning of TopicVec, with the topic number set to 1. Typ- process derives topic embeddings in the same em- ically, however, a document consists of multiple bedding space of words. These topic embeddings semantic centroids, and the limitation of only one aim to approximate the underlying semantic cen- topic may lead to underfitting. troids. Nguyen et al. (2015) proposed Latent Feature To evaluate how well TopicVec represents doc- Topic Modeling (LFTM), which extends LDA to uments, we performed two document classifica- incorporate word embeddings as latent features. tion tasks against eight existing topic modeling or The topic is modeled as a mixture of the con- document representation methods. Two setups of ventional categorical distribution and an embed- TopicVec outperformed all other methods on two ding link function. The coupling between these tasks, respectively, with fewer features. In addi- two components makes the inference difficult. tion, we demonstrate that TopicVec can derive co- They designed a Gibbs sampler for model infer- herent topics based only on one document, which ence. Their implementation1 is slow and infeasi- is not possible for topic models. ble when applied to a large corpous. The source code of our implementation is avail- Liu et al. (2015) proposed Topical Word Em- able at https://github.com/askerlee/topicvec. bedding (TWE), which combines word embed- ding with LDA in a simple and effective way. 2 Related Work They train word embeddings and a topic model separately on the same corpus, and then average Li et al. (2015) proposed a generative word em- the embeddings of words in the same topic to get bedding method PSDVec, which is the precur- the embedding of this topic. The topic embedding sor of TopicVec. PSDVec assumes that the con- is concatenated with the word embedding to form ditional distribution of a word given its context the topical word embedding of a word. In the end, words can be factorized approximately into inde- the topical word embeddings of all words in a doc- pendent log-bilinear terms. In addition, the word ument are averaged to be the embedding of the embeddings and regression residuals are regular- document. This method performs well on our two ized by Gaussian priors, reducing their chance of classification tasks. Weaknesses of TWE include: overfitting. The model inference is approached by 1) the way to combine the results of word embed- an efficient Eigendecomposition and blockwise- ding and LDA lacks statistical foundations; 2) the regression method (Li et al., 2016b). TopicVec LDA module requires a large corpus to derive se- differs from PSDVec in that in the conditional dis- mantically coherent topics. tribution of a word, it is not only influenced by its Das et al. (2015) proposed Gaussian LDA. It context words, but also by a topic, which is an em- uses pre-trained word embeddings. It assumes that bedding vector indexed by a latent variable drawn words in a topic are random samples from a mul- from a Dirichlet-Multinomial distribution. tivariate Gaussian distribution with the topic em- Hinton and Salakhutdinov (2009) proposed to bedding as the mean. Hence the probability that a model topics as a certain number of binary hidden variables, which interact with all words in the doc- 1https://github.com/datquocnguyen/LFTM/ 667 Name Description with the direction of ti,zij . Each topic tik has a S Vocabulary s1, , sW document-specific prior probability to be assigned { ··· } V Embedding matrix (vs1 , , vsW ) to a word, denoted as φ = P (k d ). The vector ··· ik | i D Document set d1, , dM φ = (φ , , φ ) is referred to as the mixing { ··· } i i1 ··· iK vsi Embedding of word si proportions of these topics in document di. asisj , A Bigram residuals tik, T i Topic embeddings in doc di 4 Link Function of Topic Embedding rik, ri Topic residuals in doc di In this section, we formulate the distribution of a zij Topic assignment of the j-th word j in doc di word given its context words and topic, in the form φi Mixing proportions of topics in doc di of a link function. Table 1: Table of notations The core of most word embedding methods is a link function that connects the embeddings of a fo- cus word and its context words, to define the distri- word belongs to a topic is determined by the Eu- bution of the focus word.