Syntax-Aware Multi-Sense Word Embeddings for Deep Compositional Models of Meaning

Syntax-Aware Multi-Sense Word Embeddings for Deep Compositional Models of Meaning Jianpeng Cheng Dimitri Kartsaklis University of Oxford Queen Mary University of London Department of School of Electronic Engineering Computer Science and Computer Science [email protected] [email protected] Abstract larger text constituents such as phrases or sentences, since the uniqueness of multi-word expres- Deep compositional models of meaning sions would inevitably lead to data sparsity prob- acting on distributional representations of lems, thus to unreliable vectorial representations. words in order to produce vectors of larger The problem is usually addressed by the provision text constituents are evolving to a pop- of a compositional function, the purpose of which ular area of NLP research. We detail is to prepare a vectorial representation for a phrase a compositional distributional framework or sentence by combining the vectors of the words based on a rich form of word embeddings therein. While the nature and complexity of these that aims at facilitating the interactions compositional models may vary, approaches based between words in the context of a sen- on deep-learning architectures have been shown to tence. Embeddings and composition lay- be especially successful in modelling the meaning ers are jointly learned against a generic of sentences for a variety of tasks (Socher et al., objective that enhances the vectors with 2012; Kalchbrenner et al., 2014). syntactic information from the surround- The mutual interaction of distributional word ing context. Furthermore, each word is vectors by a means of a compositional model pro- associated with a number of senses, the vides many opportunities for interesting research, most plausible of which is selected dy- the majority of which still remains to be explored. namically during the composition process. One such direction is to investigate in what way We evaluate the produced vectors qualita- lexical ambiguity affects the compositional pro- tively and quantitatively with positive re- cess. In fact, recent work has shown that shal- sults. At the sentence level, the effective- low multi-linear compositional models that explic- ness of the framework is demonstrated on itly handle extreme cases of lexical ambiguity in a the MSRPar task, for which we report re- step prior to composition present consistently bet- sults within the state-of-the-art range. ter performance than their “ambiguous” counterparts (Kartsaklis and Sadrzadeh, 2013; Kartsaklis 1 Introduction et al., 2014). A first attempt to test these obser- Representing the meaning of words by using their vations in a deep compositional setting has been distributional behaviour in a large text corpus is presented by Cheng et al. (2014) with promising a well-established technique in NLP research that results. has been proved useful in numerous tasks. In Furthermore, a second important question re- a distributional model of meaning, the semantic lates to the very nature of the word embeddings representation of a word is given as a vector in used in the context of a compositional model. In a some high dimensional vector space, obtained ei- setting of this form, word vectors are not any more ther by explicitly collecting co-occurrence statis- just a means for discriminating words based on tics of the target word with words belonging to a their underlying semantic relationships; the main representative subset of the vocabulary, or by di- goal of a word vector is to contribute to a bigger rectly optimizing the word vectors against an ob- whole—a task in which syntax, along with seman- jective function in some neural-network based ar- tics, also plays a very important role. It is a central chitecture (Collobert and Weston, 2008; Mikolov point of this paper, therefore, that in a composi- et al., 2013). tional distributional model of meaning word vec- Regardless their method of construction, distri- tors should be injected with information that re- butional models of meaning do not scale up to flects their syntactical roles in the training corpus. 1531 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1531–1542, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics. The purpose of this work is to improve the sifiers, we approach the problem by following a current practice in deep compositional models of siamese architecture (Bromley et al., 1993). meaning in relation to both the compositional process itself and the quality of the word embed- 2 Background and related work dings used therein. We propose an architecture 2.1 Distributional models of meaning for jointly training a compositional model and a set of word embeddings, in a way that imposes Distributional models of meaning follow the dis- dynamic word sense induction for each word dur- tributional hypothesis (Harris, 1954), which states ing the learning process. Note that this is in con- that two words that occur in similar contexts have trast with recent work in multi-sense neural word similar meanings. Traditional approaches for con- embeddings (Neelakantan et al., 2014), in which structing a word space rely on simple counting: a the word senses are learned without any composi- word is represented by a vector of numbers (usu- tional considerations in mind. ally smoothed by the application of some function such as point-wise mutual information) which Furthermore, we make the word embeddings show how frequently this word co-occurs with syntax-aware by introducing a variation of the other possible context words in a corpus of text. hinge loss objective function of Collobert and We- In contrast to these methods, a recent class of ston (2008), in which the goal is not only to predict distributional models treat word representations as the occurrence of a target word in a context, but to parameters directly optimized on a word predic- also predict the position of the word within that tion task (Bengio et al., 2003; Collobert and We- context. A qualitative analysis shows that our vec- ston, 2008; Mikolov et al., 2013; Pennington et tors reflect both semantic and syntactic features in al., 2014). Instead of relying on observed co- a concise way. occurrence counts, these models aim to maximize In all current deep compositional distributional the objective function of a neural net-based ar- settings, the word embeddings are internal param- chitecture; Mikolov et al. (2013), for example, eters of the model with no use for any other pur- compute the conditional probability of observ- pose than the task for which they were specifically ing words in a context around a target word (an trained. In this work, one of our main consid- approach known as the skip-gram model). Re- erations is that the joint training step should be cent studies have shown that, compared to their generic enough to not be tied in any particular co-occurrence counterparts, neural word vectors task. In this way the word embeddings and the de- reflect better the semantic relationships between rived compositional model can be learned on data words (Baroni et al., 2014) and are more effective much more diverse than any task-specific dataset, in compositional settings (Milajevs et al., 2014). reflecting a wider range of linguistic features. In- deed, experimental evaluation shows that the pro- 2.2 Syntactic awareness duced word embeddings can serve as a high qual- Since the main purpose of distributional models ity general-purpose semantic word space, present- until now was to measure the semantic relatedness ing performance on the Stanford Contextual Word of words, relatively little effort has been put into Similarity (SCWS) dataset of Huang et al. (2012) making word vectors aware of information regard- competitive to and even better of the performance ing the syntactic role under which a word occurs of well-established neural word embeddings sets. in a sentence. In some cases the vectors are POS- Finally, we propose a dynamic disambiguation tag specific, so that ‘book’ as noun and ‘book’ framework for a number of existing deep compo- as verb are represented by different vectors (Kart- sitional models of meaning, in which the multi- saklis and Sadrzadeh, 2013). Furthermore, word sense word embeddings and the compositional spaces in which the context of a target word is de- model of the original training step are further re- termined by means of grammatical dependencies fined according to the purposes of a specific task (Pado´ and Lapata, 2007) are more effective in cap- at hand. In the context of paraphrase detection, we turing syntactic relations than approaches based achieve a result very close to the current state-of- on simple word proximity. the-art on the Microsoft Research Paraphrase Cor- For word embeddings trained in neural settings, pus (Dolan and Brockett, 2005). An interesting syntactic information is not usually taken explic- aspect at the sideline of the paraphrase detection itly into account, with some notable exceptions. experiment is that, in contrast to mainstream ap- At the lexical level, Levy and Goldberg (2014) proaches that mainly rely on simple forms of clas- propose an extension of the skip-gram model 1532 based on grammatical dependencies. Following a ing a generic objective that can be trained on any different approach, Mnih and Kavukcuoglu (2013) general-purpose text corpus. While we focus on weight the vector of each context word depending recursive and recurrent neural network architec- on its distance from the target word. With regard tures, the general ideas we will discuss are in prin- to compositional settings (discussed in the next ciple model-independent. section), Hashimoto et al. (2014) use dependency- based word embeddings by employing a hinge loss 2.4 Disambiguation in composition objective, while Hermann and Blunsom (2013) Regardless of the way they address composition, condition their objectives on the CCG types of the all the models of Section 2.3 rely on ambiguous involved words.

Syntax-Aware Multi-Sense Word Embeddings for Deep Compositional Models of Meaning

Creating Words: Is Lexicography for You? Lexicographers Decide Which Words Should Be Included in Dictionaries. They May Decide T

Character-Word LSTM Language Models

Basic Morphology

Learning About Phonology and Orthography

Linguistics 1A Morphology 2 Complex Words

Words Their Way®: Sixth Edition © 2016 Stages of Spelling Development

The Art of Lexicography - Niladri Sekhar Dash

Philosophy of Language in the Twentieth Century Jason Stanley Rutgers University

4. the History of Linguistics : the Handbook of Linguistics : Blackwell Reference On

Word Forms Grammar Resource Sheet for ESL Writers

A Functional Approach to the Choice Between Descriptive, Prescriptive

The Neat Summary of Linguistics