Generative Topic Embedding: a Continuous Representation of Documents

Shaohua Li1,2 Tat-Seng Chua1 Jun Zhu3 Chunyan Miao2 [email protected] [email protected] [email protected] [email protected] 1. School of Computing, National University of Singapore 2. Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY) 3. Department of Computer Science and Technology, Tsinghua University

Abstract as continuous vectors in a low-dimensional em- bedding space (Bengio et al., 2003; Mikolov et al., maps words into a low- 2013; Pennington et al., 2014; Levy et al., 2015). dimensional continuous embedding space The learned embedding for a word encodes its by exploiting the local word collocation semantic/syntactic relatedness with other words, patterns in a small context window. On by utilizing local word collocation patterns. In the other hand, topic modeling maps docu- each method, one core component is the embed- ments onto a low-dimensional topic space, ding link function, which predicts a word’s distri- by utilizing the global word collocation bution given its context words, parameterized by patterns in the same document. These their embeddings. two types of patterns are complementary. When it comes to documents, we wish to find a In this paper, we propose a generative method to encode their overall semantics. Given topic embedding model to combine the the embeddings of each word in a document, we two types of patterns. In our model, topics can imagine the document as a “bag-of-vectors”. are represented by embedding vectors, and Related words in the document point in similar di- are shared across documents. The proba- rections, forming semantic clusters. The centroid bility of each word is influenced by both of a semantic cluster corresponds to the most rep- its local context and its topic. A variational resentative embedding of this cluster of words, re- inference method yields the topic embed- ferred to as the semantic centroids. We could use dings as well as the topic mixing propor- these semantic centroids and the number of words tions for each document. Jointly they rep- around them to represent a document. resent the document in a low-dimensional In addition, for a set of documents in a partic- continuous space. In two document clas- ular domain, some semantic clusters may appear sification tasks, our method performs bet- in many documents. By learning collocation pat- ter than eight existing methods, with fewer terns across the documents, the derived semantic features. In addition, we illustrate with an centroids could be more topical and less noisy. example that our method can generate co- Topic Models, represented by Latent Dirichlet herent topics even based on only one doc- Allocation (LDA) (Blei et al., 2003), are able to ument. group words into topics according to their colloca- tion patterns across documents. When the corpus 1 Introduction is large enough, such patterns reflect their seman- Representing documents as fixed-length feature tic relatedness, hence topic models can discover vectors is important for many document process- coherent topics. The probability of a word is gov- ing algorithms. Traditionally documents are rep- erned by its latent topic, which is modeled as a resented as a bag-of-words (BOW) vectors. How- categorical distribution in LDA. Typically, only a ever, this simple representation suffers from being small number of topics are present in each docu- high-dimensional and highly sparse, and loses se- ment, and only a small number of words have high mantic relatedness across the vector dimensions. probability in each topic. This intuition motivated Word Embedding methods have been demon- Blei et al. (2003) to regularize the topic distribu- strated to be an effective way to represent words tions with Dirichlet priors.

666 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 666–675, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

Semantic centroids have the same nature as top- ument through weighted connections. Larochelle ics in LDA, except that the former exist in the em- and Lauly (2012) assigned each word a unique bedding space. This similarity drives us to seek the topic vector, which is a summarization of the con- common semantic centroids with a model similar text of the current word. to LDA. We extend a generative word embedding Huang et al. (2012) proposed to incorporate model PSDVec (Li et al., 2015), by incorporating global (document-level) semantic information to topics into it. The new model is named TopicVec. help the learning of word embeddings. The global In TopicVec, an embedding link function models embedding is simply a weighted average of the the word distribution in a topic, in place of the cat- embeddings of words in the document. egorical distribution in LDA. The advantage of the Le and Mikolov (2014) proposed Paragraph link function is that the semantic relatedness is al- Vector. It assumes each piece of text has a la- ready encoded as the cosine distance in the em- tent paragraph vector, which influences the distri- bedding space. Similar to LDA, we regularize the butions of all words in this text, in the same way topic distributions with Dirichlet priors. A varia- as a latent word. It can be viewed as a special case tional inference algorithm is derived. The learning of TopicVec, with the topic number set to 1. Typ- process derives topic embeddings in the same em- ically, however, a document consists of multiple bedding space of words. These topic embeddings semantic centroids, and the limitation of only one aim to approximate the underlying semantic cen- topic may lead to underfitting. troids. Nguyen et al. (2015) proposed Latent Feature To evaluate how well TopicVec represents doc- Topic Modeling (LFTM), which extends LDA to uments, we performed two document classifica- incorporate word embeddings as latent features. tion tasks against eight existing topic modeling or The topic is modeled as a mixture of the con- document representation methods. Two setups of ventional categorical distribution and an embed- TopicVec outperformed all other methods on two ding link function. The coupling between these tasks, respectively, with fewer features. In addi- two components makes the inference difficult. tion, we demonstrate that TopicVec can derive co- They designed a Gibbs sampler for model infer- herent topics based only on one document, which ence. Their implementation1 is slow and infeasi- is not possible for topic models. ble when applied to a large corpous. The source code of our implementation is avail- Liu et al. (2015) proposed Topical Word Em- able at https://github.com/askerlee/topicvec. bedding (TWE), which combines word embed- ding with LDA in a simple and effective way. 2 Related Work They train word embeddings and a topic model separately on the same corpus, and then average Li et al. (2015) proposed a generative word em- the embeddings of words in the same topic to get bedding method PSDVec, which is the precur- the embedding of this topic. The topic embedding sor of TopicVec. PSDVec assumes that the con- is concatenated with the word embedding to form ditional distribution of a word given its context the topical word embedding of a word. In the end, words can be factorized approximately into inde- the topical word embeddings of all words in a doc- pendent log-bilinear terms. In addition, the word ument are averaged to be the embedding of the embeddings and regression residuals are regular- document. This method performs well on our two ized by Gaussian priors, reducing their chance of classification tasks. Weaknesses of TWE include: overfitting. The model inference is approached by 1) the way to combine the results of word embed- an efficient Eigendecomposition and blockwise- ding and LDA lacks statistical foundations; 2) the regression method (Li et al., 2016b). TopicVec LDA module requires a large corpus to derive se- differs from PSDVec in that in the conditional dis- mantically coherent topics. tribution of a word, it is not only influenced by its Das et al. (2015) proposed Gaussian LDA. It context words, but also by a topic, which is an em- uses pre-trained word embeddings. It assumes that bedding vector indexed by a latent variable drawn words in a topic are random samples from a mul- from a Dirichlet-Multinomial distribution. tivariate Gaussian distribution with the topic em- Hinton and Salakhutdinov (2009) proposed to bedding as the mean. Hence the probability that a model topics as a certain number of binary hidden variables, which interact with all words in the doc- 1https://github.com/datquocnguyen/LFTM/

667 Name Description with the direction of ti,zij . Each topic tik has a S Vocabulary s1, , sW document-specific prior probability to be assigned { ··· } V Embedding matrix (vs1 , , vsW ) to a word, denoted as φ = P (k d ). The vector ··· ik | i D Document set d1, , dM φ = (φ , , φ ) is referred to as the mixing { ··· } i i1 ··· iK vsi Embedding of word si proportions of these topics in document di.

asisj , A residuals tik, T i Topic embeddings in doc di 4 Link Function of Topic Embedding rik, ri Topic residuals in doc di In this section, we formulate the distribution of a zij Topic assignment of the j-th word j in doc di word given its context words and topic, in the form φi Mixing proportions of topics in doc di of a link function. Table 1: Table of notations The core of most word embedding methods is a link function that connects the embeddings of a fo- cus word and its context words, to define the distri- word belongs to a topic is determined by the Eu- bution of the focus word. Li et al. (2015) proposed clidean distance between the word embedding and the following link function: the topic embedding. This assumption might be P (wc w0 : wc 1) improper as the Euclidean distance is not an opti- | − c 1 c 1 mal measure of semantic relatedness between two − − 2 P (w ) exp v> v + a . (1) embeddings . ≈ c wc wl wlwc  Xl=0 Xl=0  3 Notations and Definitions Here awlwc is referred as the bigram resid- Throughout this paper, we use uppercase bold let- ual, indicating the non-linear part not captured by v> v ters such as S, V to denote a matrix or set, low- wc wl . It is essentially the logarithm of the nor- malizing constant of a softmax term. Some litera- ercase bold letters such as vwi to denote a vector, a normal uppercase letter such as N,W to denote ture, e.g. (Pennington et al., 2014), refers to such a scalar constant, and a normal lowercase letter as a term as a bias term. si, wi to denote a scalar variable. (1) is based on the assumption that the con- ditional distribution P (wc w0 : wc 1) can Table 1 lists the notations in this paper. | − In a document, a sequence of words is referred be factorized approximately into independent log- bilinear terms, each corresponding to a context to as a text window, denoted by wi, , wi+l, ··· word. This approximation leads to an efficient and or wi:wi+l. A text window of chosen size c effective word embedding algorithm PSDVec (Li before a word wi defines the context of wi as et al., 2015). We follow this assumption, and pro- wi c, , wi 1. Here wi is referred to as the fo- − ··· − pose to incorporate the topic of w in a way like a cus word. Each context word wi j and the focus c − latent word. In particular, in addition to the con- word wi comprise a bigram wi j, wi. − We assume each word in a document is seman- text words, the corresponding embedding tik is in- tically similar to a topic embedding. Topic embed- cluded as a new log-bilinear term that influences dings reside in the same N-dimensional space as the distribution of wc. Hence we obtain the fol- word embeddings. When it is clear from context, lowing extended link function: topic embeddings are often referred to as topics. P (wc w0:wc 1, zc, di) P (wc) | − ≈ · Each document has K candidate topics, arranged c 1 c 1 − − in the matrix form T i = (ti1 tiK ), referred to exp v> v + t + a +r , (2) ··· wc wl zc wlwc zc as the topic matrix. Specifically, we fix t = 0, i1 n Xl=0  Xl=0 o referring to it as the null topic. where d is the current document, and r is the In a document d , each word w is assigned to i zc i ij logarithm of the normalizing constant, named the a topic indexed by zij 1, ,K . Geometri- ∈ { ··· } topic residual. Note that the topic embeddings tzc cally this means the embedding vwij tends to align may be specific to di. For simplicity of notation, 2 Almost all modern word embedding methods adopt the we drop the document index in tzc . To restrict exponentiated cosine similarity as the link function, hence the the impact of topics and avoid overfitting, we con- cosine similarity may be assumed to be a better estimate of the semantic relatedness between embeddings derived from strain the magnitudes of all topic embeddings, so these methods. that they are always within a hyperball of radius γ.

668 µi hij (b) For the j-th word: i. Draw topic assignment zij from the cate- Gaussian Gaussian gorical distribution Cat(φi); ii. Draw word wij from S according to asisj Word Embeddings vsi Residuals V A P (wij wi,j c:wi,j 1, zij, di). | − − The above generative process is presented in plate notation in Figure (1).

w0 w1 wc ··· 5.1 Likelihood Function zc Given the embeddings V , the bigram residuals A, the topics T and the hyperparameter α, the Mult i wc d ∈ complete-data likelihood of a single document d Documents i is: θd Topic Embeddings t T p(d , Z , φ α, V , A, T ) Dir d D i i i i ∈ | =p(φi α)p(Zi φi)p(di V , A, T i, Zi) α | | | K K Li Γ( αk) α 1 = k=1 φ j − φ P (w ) K ij · i,zij ij Figure 1: Graphical representation of TopicVec. k=1 Γ(αk) P jY=1 jY=1 j 1 Q − It is infeasible to compute the exact value of the exp v> v + t · wij wil zij topic residual rk. We approximate it by the context  l=j c  X−  size c = 0. Then (2) becomes: j 1 − + awilwij + ri,zij , (6) P (w k, d ) = P (w ) exp v> t + r . (3) c i c wc k k l=j c ! | X− P (w k) = 1 It is required that wc S c to where Zi = (zi1, , ziL ), and Γ( ) is the ∈ | ··· i · make (3) a distribution. It follows that Gamma function. P

> Let Z, T , φ denote the collection of all the rk = log P (sj) exp vsj tk . (4) − { } M M M sj S document-specific Zi , T i , φ ,  X∈  { }i=1 { }i=1 { i}i=1 respectively. Then the complete-data likelihood (4) can be expressed in the matrix form: of the whole corpus is: r = log(u exp V >T ), (5) p(D, A, V , Z, T , φ α, γ, µ) − { } | where u is the row vector of unigram probabilities. W W,W K

= P (vsi ; µi) P (asisj ; f(hij)) Unif(Bγ) Yi=1 i,jY=1 Yk 5 Generative Process and Likelihood M p(φ α)p(Z φ )p(d V , A, T , Z ) The generative process of words in documents can · { i| i| i i| i i } i=1 be regarded as a hybrid of LDA and PSDVec. Y W,W W Analogous to PSDVec, the word embedding vs 1 2 2 i = exp f(hi,j)a µi vs and residual a are drawn from respective Gaus- (H, µ)U K {− sisj − k i k } sisj Z γ i,j=1 i=1 sians. For the sake of clarity, we ignore their gen- X X M K K Li Γ( αk) α 1 eration steps, and focus on the topic embeddings. k=1 φ j − φ P (w ) · K ij · i,zij ij The remaining generative process is as follows: k=1 Γ(αk) Yi=1 P jY=1 jY=1 1. For the k-th topic, draw a topic embedding uni- j 1 j 1 Q − − formly from a hyperball of radius γ, i.e. tk exp v> v +t + a +r , ∼ · wij wil zij wilwij i,zij Unif(Bγ); l=j c l=j c  n  X−  X− o 2. For each document di: (7)

(a) Draw the mixing proportions φi from the where P (vsi ; µi) and P (asisj ; f(hij)) are the two Dirichlet prior Dir(α); Gaussian priors as defined in (Li et al., 2015).

669 Following the convention in (Li et al., 2015), As in LDA, this posterior is analytically in- hij, H are empirical bigram probabilities, µ are tractable, and we use a simpler variational dis- the embedding magnitude penalty coefficients, tribution q(Z, φ) to approximate it. and (H, µ) is the normalizing constant for word Z embeddings. Uγ is the volume of the hyperball of 6.2 Mean-Field Approximation and radius γ. Variational GEM Algorithm Taking the logarithm of both sides, we obtain In this stage, we fix V = V ∗, A = A∗, and seek the optimal T ∗, p(Z, φ D, A∗, V ∗, T ∗). As | log p(D, A, V , Z, T , φ α, γ, µ) V , A are constant, we also make them implicit | ∗ ∗ W in the following expressions. =C log (H, µ) A 2 µ v 2 For an arbitrary variational distribution 0 − Z − k kf(H) − ik si k i=1 q(Z, φ), the following equalities hold X M K Li p(D, Z, φ T ) + log φik(mik + αk 1) + ri,zij E log | − q q(Z, φ) Xi=1Xk=1 Xj=1   j 1 j 1 =Eq [log p(D, Z, φ T )] + (q) − − | H > +vwij vwil + tzij + awilwij , (8) = log p(D T ) KL(q p), (9) l=j c l=j c  | − ||  X−  X− where p = p(Z, φ D, T ), (q) is the entropy of Li | H where mik = j=1 δ(zij = k) counts the number q. This implies of words assigned with the k-th topic in di, C0 = PK P KL(q p) Γ( k=1 αk) M,Li M log QK + i,j=1 log P (wij) K log Uγ || k=1 Γ(αk) − = log p(D T ) Eq [log p(D, Z, φ T )] + (q) is constant given theP hyperparameters. | − | H = log p(D T ) (q, T ). (10) 6 Variational Inference Algorithm | − L In (10), Eq [log p(D, Z, φ T )] + (q) is usu- 6.1 Learning Objective and Process | H ally referred to as the variational free energy Given the hyperparameters α, γ, µ, the learning (q, T ), which is a lower bound of log p(D T ). L | objective is to find the embeddings V , the topics Directly maximizing log p(D T ) w.r.t. T is in- | T , and the word-topic and document-topic distri- tractable due to the hidden variables Z, φ, so we butions p(Zi, φi di, A, V , T ). Here the hyperpa- maximize its lower bound (q, T ) instead. We | L rameters α, γ, µ are kept constant, and we make adopt a mean-field approximation of the true pos- them implicit in the distribution notations. terior as the variational distribution, and use a However, the coupling between A, V and variational algorithm to find q∗, T ∗ maximizing T , Z, φ makes it inefficient to optimize them si- (q, T ). L multaneously. To get around this difficulty, we The following variational distribution is used: learn word embeddings and topic embeddings sep- arately. Specifically, the learning process is di- q(Z, φ; π, θ) = q(φ; θ)q(Z; π) vided into two stages: M Li = Dir(φ ; θ ) Cat(z ; π ) . (11) 1. In the first stage, considering that the topics  i i ij ij  have a relatively small impact to word dis- Yi=1  jY=1  tributions and the impact might be “averaged We can obtain (Li et al., 2016a)  out” across different documents, we simplify the model by ignoring topics temporarily. Then (q, T ) L the model falls back to the original PSDVec. M K Li k The optimal solution V ∗, A∗ is obtained ac- = π + α 1 ψ(θ ) ψ(θ ) ij k − ik − i0 cordingly; i=1(k=1 j=1 X XX   2. In the second stage, we treat V ∗, A∗ as Li Li > > > constant, plug it into the likelihood func- + Tr(T i vwij πij) + ri πij ) tion, and find the corresponding optimal Xj=1 Xj=1 T ∗, p(Z, φ D, A∗, V ∗, T ∗) of the full model. + (q) + C , (12) | H 1

670 Li k where T i is the topic matrix of the i-th docu- where m¯ ik = j=1 πij = E[mik], the sum of ment, and ri is the vector constructed by con- the variational probabilities of each word being as- P ∂rik catenating all the topic residuals rik. C1 = signed to the k-th topic in the i-th document. ∂T W i 2 2 ∂rik C0 log (H, µ) A f(H) i=1 µi vsi + is a gradient matrix, whose j-th column is . − Z −k k − k k ∂tij M,Li j 1 j 1 v> − vw + − aw w is i,j=1 wij k=j c ik Pk=j c ik ij Remind that rik = log EP (s)[exp vs>tik ] . constant. − − − { } P  P P  When j = k, it is easy to verify that ∂rik = 0. We proceed to optimize (12) with a General- 6 ∂tij ized Expectation-Maximization (GEM) algorithm When j = k, we have q T ∂r w.r.t. and as follows: ik rik = e− EP (s)[exp vs>tik vs] 1. Initialize all the topics T i = 0, and correspond- ∂tik · { } r ingly their residuals ri = 0; = e− ik exp v>t P (s)v · { s ik} s 2. Iterate over the following two steps until con- s W ∈ r X vergence. In the l-th step: = e− ik exp t> V (u V ), (16) · { ik } ◦ (a) Let the topics and residuals be T = (l 1) (l 1) (l) where u V is to multiply each column of V with T − , r = r − , find q (Z, φ) that max- ◦ (l 1) u element-by-element. imizes (q, T − ). This is the Expectation L Therefore ∂rik = (0, ∂rik , , 0). Plug- step (E-step). In this step, log p(D T ) is con- ∂T i ··· ∂tik ··· | ging it into (15), we obtain stant. Then the q that maximizes (q, T (l)) L (q p) q (l) Li will minimize KL , i.e. such a is the ∂ (q , T ) ∂ri1 ∂riK || > closest variational distribution to p measured L = vwij πij+(m ¯ i1 , , m¯ iK ). ∂T i ∂ti1 ··· ∂tiK by KL-divergence; Xj=1 (l) (b) Given the variational distribution q (Z, φ), We proceed to optimize T i with a gradient de- find T (l), r(l) that improve (q(l), T ), using scent method: L (l) Gradient descent method. This is the gener- (l) (l 1) ∂ (q , T ) T i = T − + λ(l, Li) L , alized Maximization step (M-step). In this ∂T i step, π, θ, (q) are constant. L0λ0 where λ(l, Li) = is the learning rate H l max Li,L0 · { } 6.2.1 Update Equations of π, θ in E-Step function, L0 is a pre-specified document length λ T = T (l 1), r = r(l 1) threshold, and 0 is the initial learning rate. As In the E-step, − − are con- ∂ (q(l),T ) (l 1) the magnitude of L is approximately pro- stant. Taking the derivative of (q, T − ) w.r.t. ∂T i k L portional to the document length Li, to avoid the πij and θik, respectively, we can obtain the opti- mal solutions (Li et al., 2016a) at: step size becoming too big a on a long document, if Li > L0, we normalize it by Li. k (l) πij exp ψ(θik) + vw> tik + rik . (13) ∝ { ij } To satisfy the constraint that tik γ, when (l) k(l) k ≤ Li t > γ, we normalize it by γ/ t . k ik ik θ = π + α . (14) k k (m) ik ij k After we obtain the new T , we update r us- j=1 i X ing (5). 6.2.2 Update Equation of T i in M-Step Sometimes, especially in the initial few itera- In the Generalized M-step, π = π(l), θ = θ(l) are tions, due to the excessively big step size of the constant. For notational simplicity, we drop their gradient descent, (q, T ) may decrease after the L superscripts (l). update of T . Nonetheless the general direction of To update T , we first take the derivative of (q, T ) is increasing. i L (12) w.r.t. T , and then take the Gradient Descent i 6.3 Sharing of Topics across Documents method. The derivative is obtained as (Li et al., 2016a): In principle we could use one set of topics across the whole corpus, or choose different topics for ∂ (q(l), T ) L different subsets of documents. One could choose ∂T i a way to best utilize cross-document information. Li K For instance, when the document category in- ∂rik > = vwij πij + m¯ ik , (15) formation is available, we could make the docu- ∂T i Xj=1 Xk=1 ments in each category share their respective set

671 of topics, so that M categories correspond to M LDA: the vanilla LDA (Blei et al., 2003) in • sets of topics. In the learning algorithm, only the the library3; k 4 update of πij needs to be changed to cater for this sLDA: Supervised Topic Model (McAuliffe • situation: when the k-th topic is relevant to the and Blei, 2008), which improves the predic- k document i, we update πij using (13); otherwise tive performance of LDA by modeling class k πij = 0. labels; An identifiability problem may arise when we LFTM: Latent Feature Topic Modeling5 • split topic embeddings according to document (Nguyen et al., 2015). subsets. In different topic groups, some highly The document-topic proportions of topic modeling similar redundant topics may be learned. If we methods were used as their document representa- project documents into the topic space, portions tion. of documents in the same topic in different docu- The document representation methods are: ments may be projected onto different dimensions of the topic space, and similar documents may Doc2Vec: Paragraph Vector (Le and • 6 eventually be projected into very different topic Mikolov, 2014) in the gensim library . proportion vectors. In this situation, directly us- TWE: Topical Word Embedding7 (Liu et • ing the projected topic proportion vectors could al., 2015), which represents a document cause problems in unsupervised tasks such as clus- by concatenating average topic embedding tering. A simple solution to this problem would be and average word embedding, similar to our to compute the pairwise similarities between topic TV+WV; embeddings, and consider these similarities when GaussianLDA: Gaussian LDA8 (Das et al., • computing the similarity between two projected 2015), which assumes that words in a topic topic proportion vectors. Two similar documents are random samples from a multivariate will then still receive a high similarity score. Gaussian distribution with the mean as the topic embedding. Similar to TopicVec, we 7 Experimental Results derived the posterior topic proportions as the To investigate the quality of document represen- features of each document; tation of our TopicVec model, we compared its MeanWV: The mean word embedding of the • performance against eight topic modeling or doc- document. ument representation methods in two document Datasets We used two standard document clas- classification tasks. Moreover, to show the topic sification corpora: the 20 Newsgroups9 and the coherence of TopicVec on a single document, we ApteMod version of the Reuters-21578 corpus10. present the top words in top topics learned on a The two corpora are referred to as the 20News and news article. Reuters in the following. 7.1 Document Classification Evaluation 20News contains about 20,000 newsgroup doc- uments evenly partitioned into 20 different cate- 7.1.1 Experimental Setup gories. Reuters contains 10,788 documents, where Compared Methods Two setups of TopicVec each document is assigned to one or more cate- were evaluated: gories. For the evaluation of document classifi- TopicVec: the topic proportions learned by cation, documents appearing in two or more cate- • TopicVec; gories were removed. The numbers of documents TV+WV: the topic proportions, concate- in the categories of Reuters are highly imbalanced, • nated with the mean word embedding of the and we only selected the largest 10 categories, document (same as the MeanWV below). leaving us with 8,025 documents in total. We compare the performance of our methods 3https://radimrehurek.com/gensim/models/ldamodel.html against eight methods, including three topic mod- 4http://www.cs.cmu.edu/˜chongw/slda/ eling methods, three continuous document repre- 5https://github.com/datquocnguyen/LFTM/ 6https://radimrehurek.com/gensim/models/doc2vec.html sentation methods, and the conventional bag-of- 7 words ( ) method. The count vector of BOW https://github.com/largelymfs/topical word embeddings/ BOW 8https://github.com/rajarshd/Gaussian LDA is unweighted. 9http://qwone.com/˜jason/20Newsgroups/ The topic modeling methods include: 10http://www.nltk.org/book/ch02.html

672 The same preprocessing steps were applied to 20News Reuters all methods: words were lowercased; stop words Prec Rec F1 Prec Rec F1 and words out of the word embedding vocabulary BOW 69.1 68.5 68.6 92.5 90.3 91.1 (which means that they are extremely rare) were LDA 61.9 61.4 60.3 76.1 74.3 74.8 removed. sLDA 61.4 60.9 60.9 88.3 83.3 85.1 Experimental Settings TopicVec used the word LFTM 63.5 64.8 63.7 84.6 86.3 84.9 embeddings trained using PSDVec on a March MeanWV 70.4 70.3 70.1 92.0 89.6 90.5 2015 Wikipedia snapshot. It contains the most fre- Doc2Vec 56.3 56.6 55.4 84.4 50.0 58.5 quent 180,000 words. The dimensionality of word TWE 69.5 69.3 68.8 91.0 89.1 89.9 embeddings and topic embeddings was 500. The GaussianLDA 30.9 26.5 22.7 46.2 31.5 35.3 hyperparameters were α = (0.1, , 0.1), γ = 5. ··· TopicVec 71.4 71.3 71.2 91.8 92.0 91.7 For 20news and Reuters, we specified 15 and 12 TV+WV1 72.1 71.9 71.8 91.4 91.9 91.5 topics in each category on the training set, respec- tively. The first topic in each category was al- 1Combined features of TopicVec topic proportions and ways set to null. The learned topic embeddings MeanWV. were combined to form the whole topic set, where Table 2: Performance on multi-class text classifi- redundant null topics in different categories were cation. Best score is in boldface. removed, leaving us with 281 topics for 20News and 111 topics for Reuters. The initial learning rate was set to 0.1. After 100 GEM iterations Avg. Features BOW MeanWV TWE TopicVec TV+WV on each dataset, the topic embeddings were ob- 20News 50381 500 800 281 781 tained. Then the posterior document-topic distri- Reuters 17989 500 800 111 611 butions of the test sets were derived by performing one E-step given the topic embeddings trained on Table 3: Number of features of the five best per- the training set. forming methods. LFTM includes two models: LF-LDA and LF- DMM. We chose the better performing LF-LDA to evaluate. TWE includes three models, and we top categories. chose the best performing TWE-1 to compare. LDA, sLDA, LFTM and TWE used the spec- Evaluation Results Table 2 presents the perfor- ified 50 topics on Reuters, as this is the optimal mance of the different methods on the two clas- topic number according to (Lu et al., 2011). On sification tasks. The highest scores were high- the larger 20news dataset, they used the specified lighted with boldface. It can be seen that TV+WV 100 topics. Other hyperparameters of all com- and TopicVec obtained the best performance on pared methods were left at their default values. the two tasks, respectively. With only topic pro- GaussianLDA was specified 100 topics on portions as features, TopicVec performed slightly 20news and 70 topics on Reuters. As each sam- better than BOW, MeanWV and TWE, and sig- pling iteration took over 2 hours, we only had time nificantly outperformed four other methods. The for 100 sampling iterations. number of features it used was much lower than For each method, after obtaining the document BOW, MeanWV and TWE (Table 3). representations of the training and test sets, we trained an `-1 regularized linear SVM one-vs-all GaussianLDA performed considerably inferior classifier on the training set using the scikit-learn to all other methods. After checking the generated library11. We then evaluated its predictive perfor- topic embeddings manually, we found that the em- mance on the test set. beddings for different topics are highly similar to each other. Hence the posterior topic proportions Considering that the largest Evaluation metrics were almost uniform and non-discriminative. In few categories dominate Reuters, we adopted addition, on the two datasets, even the fastest Alias macro-averaged precision, recall and F1 measures sampling in (Das et al., 2015) took over 2 hours for as the evaluation metrics, to avoid the average re- one iteration and 10 days for the whole 100 itera- sults being dominated by the performance of the tions. In contrast, our method finished the 100 EM 11http://scikit-learn.org/stable/modules/svm.html iterations in 2 hours.

673 only derived from a single document. The reason is that TopicVec takes advantage of the rich se- mantic information encoded in word embeddings, which were pretrained on a large corpus. The topic coherence suggests that the derived topic embeddings were approximately the seman- tic centroids of the document. This capacity may aid applications such as document retrieval, where a “compressed representation” of the query docu- ment is helpful.

8 Conclusions and Future Work In this paper, we proposed TopicVec, a generative model combining word embedding and LDA, with the aim of exploiting the word collocation patterns both at the level of the local context and the global document. Experiments show that TopicVec can learn high-quality document representations, even Figure 2: Topic Cloud of the pharmaceutical com- given only one document. pany acquisition news. In our classification tasks we only explored the use of topic proportions of a document as its rep- resentation. However, jointly representing a doc- 7.2 Qualitative Assessment of Topics Derived ument by topic proportions and topic embeddings from a Single Document would be more accurate. Efficient algorithms for Topic models need a large set of documents to ex- this task have been proposed (Kusner et al., 2015). tract coherent topics. Hence, methods depending Our method has potential applications in vari- on topic models, such as TWE, are subject to this ous scenarios, such as document retrieval, classifi- limitation. In contrast, TopicVec can extract co- cation, clustering and summarization. herent topics and obtain document representations Acknowlegement even when only one document is provided as in- put. We thank Xiuchao Sui and Linmei Hu for their To illustrate this feature, we ran TopicVec on help and support. We thank the anonymous men- a New York Times news article about a pharma- tor provided by ACL for the careful proofread- ceutical company acquisition12, and obtained 20 ing. This research is funded by the National Re- topics. search Foundation, Prime Minister’s Office, Sin- Figure 2 presents the most relevant words in gapore under its IDM Futures Funding Initiative the top-6 topics as a topic cloud. We first calcu- and IRC@SG Funding Initiative administered by lated the relevance between a word and a topic as IDMPO. the frequency-weighted cosine similarity of their embeddings. Then the most relevant words were selected to represent each topic. The sizes of References the topic slices are proportional to the topic pro- Yoshua Bengio, Rejean´ Ducharme, Pascal Vincent, and portions, and the font sizes of individual words Christian Jauvin. 2003. A neural probabilistic lan- guage model. Journal of Re- are proportional to their relevance to the topics. search, pages 1137–1155. Among these top-6 topics, the largest and small- est topic proportions are 26.7% and 9.9%, respec- David M Blei, Andrew Y Ng, and Michael I Jordan. tively. 2003. Latent dirichlet allocation. the Journal of ma- chine Learning research, 3:993–1022. As shown in Figure 2, words in obtained topics were generally coherent, although the topics were Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for topic models with word embed- 12http://www.nytimes.com/2015/09/21/business/a-huge- dings. In Proceedings of the 53rd Annual Meet- overnight-increase-in-a-drugs-price-raises-protests.html ing of the Association for Computational Linguistics

674 and the 7th International Joint Conference on Natu- Jon D McAuliffe and David M Blei. 2008. Super- ral Language Processing (Volume 1: Long Papers), vised topic models. In Advances in neural informa- pages 795–804, Beijing, China, July. Association for tion processing systems, pages 121–128. Computational Linguistics. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Geoffrey E Hinton and Ruslan R Salakhutdinov. 2009. rado, and Jeff Dean. 2013. Distributed representa- Replicated softmax: an undirected topic model. In tions of words and phrases and their compositional- Advances in neural information processing systems, ity. In Proceedings of NIPS 2013, pages 3111–3119. pages 1607–1614. Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Eric H Huang, Richard Socher, Christopher D Man- Mark Johnson. 2015. Improving topic models with ning, and Andrew Y Ng. 2012. Improving word latent feature word representations. Transactions representations via global context and multiple word of the Association for Computational Linguistics, prototypes. In Proceedings of the 50th Annual Meet- 3:299–313. ing of the Association for Computational Linguis- tics: Long Papers-Volume 1, pages 873–882. Asso- Jeffrey Pennington, Richard Socher, and Christopher D ciation for Computational Linguistics. Manning. 2014. GloVe: Global vectors for word representation. Proceedings of the Empiricial Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Q. Methods in Natural Language Processing (EMNLP Weinberger. 2015. From word embeddings to docu- 2014), 12. ment distances. In David Blei and Francis Bach, ed- itors, Proceedings of the 32nd International Confer- ence on Machine Learning (ICML-15), pages 957– 966. JMLR Workshop and Conference Proceedings. Hugo Larochelle and Stanislas Lauly. 2012. A neural autoregressive topic model. In Advances in Neural Information Processing Systems, pages 2708–2716. Quoc Le and Tomas Mikolov. 2014. Distributed repre- sentations of sentences and documents. In Proceed- ings of the 31st International Conference on Ma- chine Learning (ICML-14), pages 1188–1196. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- proving distributional similarity with lessons learned from word embeddings. Transactions of the Associ- ation for Computational Linguistics, 3:211–225. Shaohua Li, Jun Zhu, and Chunyan Miao. 2015. A generative word embedding model and its low rank positive semidefinite solution. In Proceedings of the 2015 Conference on Empirical Methods in Natu- ral Language Processing, pages 1599–1609, Lisbon, Portugal, September. Association for Computational Linguistics. Shaohua Li, Tat-Seng Chua, Jun Zhu, and Chun- yan Miao. 2016a. Generative topic em- bedding: a continuous representation of docu- ments (extended version with proofs). Technical report. https://github.com/askerlee/topicvec/blob/ master/topicvec-ext.pdf. Shaohua Li, Jun Zhu, and Chunyan Miao. 2016b. PS- DVec: a toolbox for incremental and scalable word embedding. To appear in Neurocomputing. Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2015. Topical word embeddings. In AAAI, pages 2418–2424. Yue Lu, Qiaozhu Mei, and ChengXiang Zhai. 2011. Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. In- formation Retrieval, 14(2):178–203.

675