Arxiv:1702.07117V1 [Cs.CL] 23 Feb 2017 Distances Between Vectors, Which Have Been Widely Used in Concatenating Word Embeddings with Topic Embeddings

LTSG: Latent Topical Skip-Gram for Mutually Learning Topic Model and Vector Representations Jarvan Law, Hankz Hankui Zhuo, Junhua He and Erhu Rong Dept. of Computer Science, Sun Yat-Sen University, GuangZhou, China. 510006 [email protected], [email protected] fhejunh,[email protected] Abstract topic embeddings with word embeddings. Despite the success of TWE, compared to previous multi-prototype models Topic models have been widely used in discovering [Reisinger and Mooney, 2010; Huang et al., 2012], it assumes latent topics which are shared across documents in that word distributions over topics are provided by off-the- text mining. Vector representations, word embed- shelf topic models such as LDA, which would limit the ap- dings and topic embeddings, map words and topics plications of TWE once topic models do not perform well in into a low-dimensional and dense real-value vec- some domains [Petterson et al., 2010; Phan et al., 2011]. As tor space, which have obtained high performance in a matter of fact, pervasive polysemous words in documents NLP tasks. However, most of the existing models would harm the performance of topic models that are based assume the result trained by one of them are perfect on co-occurrence of words in documents. Thus, a more re- correct and used as prior knowledge for improving alistic solution is to build both topic models with regard to the other model. Some other models use the infor- polysemous words and polysemous word embeddings simul- mation trained from external large corpus to help taneously, instead of using off-the-shelf topic models. improving smaller corpus. In this paper, we aim to build such an algorithm framework that makes In this work, we propose a novel learning framework, topic models and vector representations mutually called Latent Topical Skip-Gram (LTSG) model, to mutually improve each other within the same corpus. An learn polysemous-word models and topic models. To the best EM-style algorithm framework is employed to iter- of our knowledge, this is the first work that considers learning atively optimize both topic model and vector rep- polysemous-word models and topic models simultaneously. resentations. Experimental results show that our Although there have been approaches that aim to improve [ model outperforms state-of-art methods on various topic models based on word embeddings MRF-LDA Xie et ] NLP tasks. al., 2015 , they fail to improve word embeddings provided words are polysemous; although there have been approaches that aim to improve polysemous-word models TWE [Liu et 1 Introduction al., 2015] based on topic models, they fail to improve topic models considering words are polysemous. Different from Word embeddings, e.g., distributed word representations previous approaches, we introduce a new node Tw, called [Mikolov et al., 2013], represent words with low dimensional global topic, to capture all of the topics regarding polyse- and dense real-value vectors, which capture useful semantic mous word w based on topic-word distribution ', and use and syntactic features of words. Distributed word embed- the global topic to estimate the context of polysemous word dings can be used to measure word similarities by computing w. Then we characterize polysemous word embeddings by arXiv:1702.07117v1 [cs.CL] 23 Feb 2017 distances between vectors, which have been widely used in concatenating word embeddings with topic embeddings. We various IR and NLP tasks, such as entity recognition [Turian illustrate our new model in Figure 1, where Figure 1(A) is et al., 2010], disambiguation [Collobert et al., 2011] and pars- the skip-gram model [Mikolov et al., 2013], which aims to ing [Socher et al., 2011; Socher et al., 2013]. Despite the maximize the probability of context c given word w, Figure success of previous approaches on word embeddings, they all 1(B) is the TWE model, which extends the skip-gram model assume each word has a specific meaning and represent each to maximize the probability of context c given both word w word with a single vector, which restricts their applications and topic t, and Figure 1(C) is our LTSG model which aims in fields with polysemous words, e.g., “bank” can be either to maximize the probability of context c given word w and “a financial institution” or “a raised area of ground along a global topic Tw. Tw is generated based on topic-word distri- river”. bution ' (i.e., the joint distribution of topic embedding t and To overcome this limitation, [Liu et al., 2015] propose word embedding w) and topic embedding t (which is based a topic embedding approach, namely Topical Word Em- on topic assignment z). Through our LTSG model, we can si- beddings (TWE), to learn topic embeddings to characterize multaneously learn word embeddings w and global topic em- various meanings of polysemous words by concatenating beddings Tw for representing polysemous word embeddings, ment corpus D, each document wm 2 D is assumed to have c c a distribution over K topics. The generative process of LDA is shown as follows, 1. For each topic k = 1 ! K, draw a distribution over words 'k ∼ Dir(β) 2. For each document wm 2 D; m 2 f1; 2;:::;Mg w w t (a) Draw a topic distribution θm ∼ Dir(α) (A) Skip-Gram (B) TWE (b) For each word wm;n 2 wm; n = 1;:::;Nm i. Draw a topic assignment zm;n ∼ Mult(θm), zm;n 2 f1;:::;Kg: ii. Draw a word wm;n ∼ Mult('zm;n ) c where α and β are Dirichlet hyperparameters, specifying the nature of priors on θ and '. Variational inference and Gibbs sampling are the common ways to learn the parameters of w Tw LDA. φ 2.2 The Skip-Gram Model The Skip-Gram model is a well-known framework for learn- z t ing word vectors [Mikolov et al., 2013]. Skip-Gram aims to predict context words given a target word in a sliding window, (C) LTSG as shown in Figure 1(A). Given a document corpus D defined in Table 1, the objec- Figure 1: Skip-Gram, TWE and LTSG models. Blue, yel- tive of Skip-Gram is to maximize the average log-probability low, green circles denote the embeddings of word, topic and M N context, while red circles in LTSG denote the global topical 1 X Xm X L(D) = word. White circles denote the topic model part, topic-word PM N distribution ' and topic assignment z. m=1 m m=1 n=1 −c≤j≤c;j6=0 log Pr(wm;n+jjwm;n); (1) and topic word distribution ' for mining topics with regard where c is the context window size of the target word. The ba- to polysemous words. We will exhibit the effectiveness of our sic Skip-Gram formulation defines Pr(wm;n+jjwm;n) using LTSG model in text classification and topic mining tasks with the softmax function: regard to polysemous words in documents. exp(vw · vw ) In the remainder of the paper, we first introduce prelimi- Pr(w jw ) = m;n+j m;n ; (2) m;n+j m;n PW naries of our LTSG model, and then present our LTSG algo- w=1 exp(vw · vwm;n ) rithm in detail. After that, we evaluate our LTSG model by where v and v are the vector representations of comparing our LTSG algorithm to state-of-the-art models in wm;n wm;n+j various datasets. Finally we review previous work related to target word wm;n and its context word wm;n+j, and W is the our LTSG approach and conclude the paper with future work. number of words in the vocabulary V. Hierarchical softmax and negative sampling are two efficient approximation methods used to learn Skip-Gram. 2 Preliminaries 2.3 Topical Word Embeddings In this section, we briefly review preliminaries of Latent Topical word embeddings (TWE) is a more flexible and Dirichlet Allocation (LDA), Skip-Gram, and Topical Word powerful framework for multi-prototype word embeddings, Embeddings (TWE), respectively. We show some notations where topical word refers to a word taking a specific topic and their corresponding meanings in Table 1, which will be as context [Liu et al., 2015], as shown in Figure 1(B). TWE used in describing the details of LDA, Skip-Gram, and TWE. model employs LDA to obtain the topic distributions of document corpora and topic assginment for each word token. TWE 2.1 Latent Dirichlet Allocation model uses topic zm;n of target word to predict context word [ ] Latent Dirichlet Allocation (LDA) Blei et al., 2003 , a three- compared with only using the target word wm;n to predict level hierarchical Bayesian model, is a well-developed and context word in Skip-Gram. TWE is defined to maximize the widely used probabilistic topic model. Extending Probabilis- following average log probability tic Latent Semantic Indexing (PLSI) [Hofmann, 1999], LDA M N adds Dirichlet priors to document-specific topic mixtures to 1 X Xm X overcome the overfitting problem in PLSI. LDA aims at mod- L(D) = PM N (3) eling each document as a mixture over sets of topics, each as- m=1 m m=1 n=1 −c≤j≤c;j6=0 sociated with a multinomial word distribution. Given a docu- log Pr(wm;n+jjwm;n) + log Pr(wm;n+jjzm;n): Table 1: Notations of the text collection. Term Notation Definition or Description vocabulary V set of words in the text collection, jVj = W word w a basic item from vocabulary indexed as w 2 f1; 2;:::;W g document w a sequence of N words, w = (w1; w2; : : : ; wN ) corpus D a collection of M documents, D = fw1; w2;:::; wM g topic-word ' K distributions over vocabulary (K × W matrix), j'j = K; j'kj = W d word embedding v distributed representation of word, denoted by vw, v 2 R d topic embedding t distributed representation of topic, denoted by tk, t 2 R TWE regards each topic as a pseudo word that appears in all 3.1 Topic Assignment via Gibbs Sampling positions of words assigned with this topic.

Load more