Word and Document Embedding with Vmf-Mixture Priors on Context Word Vectors
Total Page:16
File Type:pdf, Size:1020Kb
Word and Document Embedding with vMF-Mixture Priors on Context Word Vectors Shoaib Jameel Steven Schockaert Medway School of Computing School of Computer Science and Informatics University of Kent Cardiff University [email protected] [email protected] Abstract vectors are typically unconstrained. As was shown in (Mu et al., 2018), after per- Word embedding models typically learn two forming a particular linear transformation, the an- types of vectors: target word vectors and con- text word vectors. These vectors are normally gular distribution of the word vectors that are ob- learned such that they are predictive of some tained by standard models is essentially uniform. word co-occurrence statistic, but they are oth- This isotropy property is convenient for study- erwise unconstrained. However, the words ing word embeddings from a theoretical point of from a given language can be organized in view (Arora et al., 2016), but it sits at odds with various natural groupings, such as syntactic fact that words can be organised in various nat- word classes (e.g. nouns, adjectives, verbs) ural groupings. For instance, we might perhaps and semantic themes (e.g. sports, politics, sen- expect that words from the same part-of-speech timent). Our hypothesis in this paper is that embedding models can be improved by explic- class should be clustered together in the word em- itly imposing a cluster structure on the set of bedding. Similarly, we might expect that organ- context word vectors. To this end, our model ising word vectors in clusters that represent se- relies on the assumption that context word vec- mantic themes would also be beneficial. In fact, a tors are drawn from a mixture of von Mises- number of approaches have already been proposed Fisher (vMF) distributions, where the parame- that use external knowledge for imposing such a ters of this mixture distribution are jointly op- cluster structure, capturing the intuition that words timized with the word vectors. We show that this results in word vectors which are qualita- which belong to the same category should be rep- tively different from those obtained with exist- resented by similar vectors (Xu et al., 2014; Guo ing word embedding models. We furthermore et al., 2015; Hu et al., 2015; Li et al., 2016c) or show that our embedding model can also be be located in a low-dimensional subspace (Jameel used to learn high-quality document represen- and Schockaert, 2016). Such models tend to out- tations. perform standard word embedding models, but it is unclear whether this is only because they can 1 Introduction take advantage of external knowledge, or whether Word embedding models are aimed at learning imposing a cluster structure on the word vectors is vector representations of word meaning (Mikolov itself also inherently useful. et al., 2013b; Pennington et al., 2014; Bojanowski In this paper, we propose a word embedding et al., 2017). These representations are primarily model which explicitly aims to learn context vec- learned from co-occurrence statistics, where two tors that are organised in clusters. Note that un- words are represented by similar vectors if they like the aforementioned works, our method does tend to occur in similar linguistic contexts. Most not rely on any external knowledge. We simply models, such as Skip-gram (Mikolov et al., 2013b) impose the requirement that context word vectors and GloVe (Pennington et al., 2014) learn two dif- should be clustered, without prescribing how these ferent vector representations w and ~w for each clusters should be defined. To this end, we extend word w, which we will refer to as the target word the GloVe model by imposing a prior on the con- vector and the context word vector respectively. text word vectors. This prior takes the form of a Apart from the constraint that wi · ~wj should re- mixture of von Mises-Fisher (vMF) distributions, flect how often words wi and wj co-occur, these which is a natural choice for modelling clusters in 3319 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3319–3328 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics directional data (Banerjee et al., 2005). multinomial distribution over words. In this pa- We show that this results in word vectors that per, we take a different approach for learning doc- are qualitatively different from those obtained us- ument vectors, by not considering any document- ing existing models, significantly outperforming specific topic distribution. This allows us to repre- them in syntax-oriented evaluations. Moreover, sent document vectors and (context) word vectors we show that the same model can be used for in the same space and, as we will see, leads to im- learning document embeddings, simply by view- proved empirical results. ing the words that appear in a given document as Apart from using pre-trained word embeddings context words. We show that the vMF distribu- for improving topic representations, a number of tions in that case correspond to semantically co- approaches have also been proposed that use topic herent topics, and that the resulting document vec- models for learning word vectors. For example, tors outperform those obtained with existing topic (Liu et al., 2015b) first uses the standard LDA modelling strategies. model to learn a latent topic assignment for each word occurrence. These assignments are then used 2 Related Work to learn vector representations of words and top- ics. Some extensions of this model have been pro- A large number of works have proposed tech- posed which jointly learn the topic-specific word niques for improving word embeddings based on vectors and the latent topic assignment (Li et al., external lexical knowledge. Many of these ap- 2016a; Shi et al., 2017). The main motivation for proaches are focused on external knowledge about these works is to learn topic-specific word repre- word similarity (Yu and Dredze, 2014; Faruqui sentations. They are thus similar in spirit to multi- et al., 2015; Mrksic et al., 2016), although some prototype word embeddings, which aim to learn approaches for incorporating categorical knowl- sense-specific word vectors (Neelakantan et al., edge have been studied as well, as already men- 2014). Our method is clearly different from these tioned in the introduction. What is different about works, as our focus is on learning standard word our approach is that we do not rely on any external vectors (as well as document vectors). knowledge. We essentially impose the constraint Regarding word embeddings more generally, that some category structure has to exist, without the attention has recently shifted towards contex- specifying what these categories look like. tualized word embeddings based on neural lan- The view that the words which occur in a given guage models (Peters et al., 2018). Such contex- document collection have a natural cluster struc- tualized word embeddings serve a broadly simi- ture is central to topic models such as Latent lar purpose as the aforementioned topic-specific Dirichlet Allocation (LDA) (Blei et al., 2003) and word vectors, but with far better empirical perfor- its non-parametric counterpart called Hierarchi- mance. Despite their recent popularity, however, cal Dirichlet Processes (HDP) (Teh et al., 2005), it is worth emphasizing that state-of-the-art meth- which automatically discovers the number of la- ods such as ELMO (Peters et al., 2018) rely on tent topics based on the characteristics of the data. a concatenation of the output vectors of a neural In recent years, several approaches that com- language model with standard word vectors. For bine the intuitions underlying topic models with this reason, among others, the problem of learn- word embeddings have been proposed. For exam- ing standard word vectors remains an important ple, in (Das et al., 2015) it was proposed to replace research topic. the usual representation of topics as multinomial distributions over words by Gaussian distributions 3 Model Description over a pre-trained word embedding, while (Bat- manghelich et al., 2016) and (Li et al., 2016b) used The GloVe model (Pennington et al., 2014) learns von Mises-Fisher distributions for this purpose. for each word w a target word vector w and a con- Note that documents are still modelled as multi- text word vector ~w by minimizing the following nomial distributions of topics in these models. In objective: (He et al., 2017) the opposite approach is taken: documents and topics are represented as vectors, X f(x )(w · ~w + b + b~ − log x )2 with the aim of modelling topic correlations in an ij i j i j ij i;j efficient way, while each topic is represented as a xij 6=0 3320 where xij is the number of times wi and wj co- which is commonly known as the confluent hyper- occur in the given corpus, bi and b~j are bias terms geometric function. Note, however, that we will and f(xij) is a weighting function aimed at reduc- not need to evaluate this denominator, as it simply θ ing the impact of sparse co-occurrence counts. It acts as a scaling factor. The normalized vector kθk , is easy to see that this objective is equivalent to for θ 6= 0, is the mean direction of the distribution, maximizing the following likelihood function while kθk is known as the concentration parame- ter. To estimate the parameter θ from a given set of Y 2 f(xij ) P (DjΩ) / N (log xij; µij; σ ) samples, we can use maximum likelihood (Hornik i;j and Grun¨ , 2014). x 6=0 ij A finite mixture of vMFs, which we denote as where σ2 > 0 can be chosen arbitrarily, N means movMF, is a distribution on the unit hypersphere d the Normal distribution and of the following form (x 2 S ): ~ K µij = wi · ~wj + bi + bj X h(xjΘ) = k vmf(xjθk) Furthermore, D denotes the given corpus and Ω k=1 refers to the set of parameters learned by the word where K is the number of mixture components, embedding model, i.e.