Evaluating Neural Word Representations in Tensor-Based

Evaluating Neural Word Representations in Tensor-Based Compositional Settings Dmitrijs Milajevs1 Dimitri Kartsaklis2 Mehrnoosh Sadrzadeh1 Matthew Purver1 1Queen Mary University of London 2 School of Electronic Engineering University of Oxford and Computer Science Department of Computer Science Mile End Road, London, UK Parks Road, Oxford, UK [email protected] fd.milajevs,m.sadrzadeh,[email protected] Abstract ral word embeddings in compositional tasks, and meaningfully compare them with the traditional We provide a comparative study be- distributional approach based on co-occurrence tween neural word representations and counts. We are especially interested in investi- traditional vector spaces based on co- gating the performance of neural word vectors in occurrence counts, in a number of com- compositional models involving general mathe- positional tasks. We use three differ- matical composition operators, rather than in the ent semantic spaces and implement seven more task- or domain-specific deep-learning com- tensor-based compositional models, which positional settings they have generally been used we then test (together with simpler ad- with so far (for example, by Socher et al. (2012), ditive and multiplicative approaches) in Kalchbrenner and Blunsom (2013) and many oth- tasks involving verb disambiguation and ers). sentence similarity. To check their scala- bility, we additionally evaluate the spaces In particular, this is the first large-scale study using simple compositional methods on to date that applies neural word representations in larger-scale tasks with less constrained tensor-based compositional distributional models language: paraphrase detection and di- of meaning similar to those formalized by Coecke alogue act tagging. In the more con- et al. (2010). We test a range of implementations strained tasks, co-occurrence vectors are based on this framework, together with additive competitive, although choice of composi- and multiplicative approaches (Mitchell and Lap- tional method is important; on the larger- ata, 2008), in a variety of different tasks. Specif- scale tasks, they are outperformed by neu- ically, we use the verb disambiguation task of ral word embeddings, which show robust, Grefenstette and Sadrzadeh (2011a) and the tran- stable performance across the tasks. sitive sentence similarity task of Kartsaklis and Sadrzadeh (2014) as small-scale focused experi- 1 Introduction ments on pre-defined sentence structures. Addi- Neural word embeddings (Bengio et al., 2006; tionally, we evaluate our vector spaces on para- Collobert and Weston, 2008; Mikolov et al., phrase detection (using the Microsoft Research 2013a) have received much attention in the dis- Paraphrase Corpus of Dolan et al. (2005)) and di- tributional semantics community, and have shown alogue act tagging using the Switchboard Corpus state-of-the-art performance in many natural lan- (see e.g. (Stolcke et al., 2000)). guage processing tasks. While they have been In all of the above tasks, we compare the neural compared with co-occurrence based models in word embeddings of Mikolov et al. (2013a) with simple similarity tasks at the word level (Levy et two vector spaces both based on co-occurrence al., 2014; Baroni et al., 2014), we are aware of counts and produced by standard distributional only one work that attempts a comparison of the techniques, as described in detail below. The gen- two approaches in compositional settings (Blacoe eral picture we get from the results is that in almost and Lapata, 2012), and this is limited to additive all cases the neural vectors are more effective than and multiplicative composition, compared against the traditional approaches. composition via a neural autoencoder. We proceed as follows: Section 2 provides a The purpose of this paper is to provide a more concise introduction to distributional word repre- complete picture regarding the potential of neu- sentations in natural language processing. Section 3 takes a closer look to the subject of composi- they are rather similar to each other (both of them tionality in vector space models of meaning and are humans) and dissimilar to words such as dog, describes the range of compositional operators ex- pavement or idea. The same applies at the phrase amined here. In Section 4 we provide details about and sentence level: “dogs chase cats” is similar in the vector spaces used in the experiments. Our ex- meaning to “hounds pursue kittens”, but less so to perimental work is described in detail in Section 5, “cats chase dogs” (despite the lexical overlap). and the results are discussed in Section 6. Finally, Distributional methods provide a way to address Section 7 provides conclusions. this problem. By representing words and phrases as vectors or tensors in a (usually highly dimen- 2 Meaning representation sional) vector space, one can express similarity in meaning via a suitable distance metric within There are several approaches to the representation that space (usually cosine distance); furthermore, of word, phrase and sentence meaning. As nat- composition can be modelled via suitable linear- ural languages are highly creative and it is very algebraic operations. rare to see the same sentence twice, any practical approach dealing with large text segments must Co-occurrence-based word representations be compositional, constructing the meaning of One way to produce such vectorial representa- phrases and sentences from their constituent parts. tions is to directly exploit Harris (1954)’s intuition The ideal method would therefore express not that semantically similar words tend to appear in only the similarity in meaning between those con- similar contexts. We can construct a vector space stituent parts, but also between the results of their in which the dimensions correspond to contexts, composition, and do this in ways which fit with usually taken to be words as well. The word linguistic structure and generalisations thereof. vector components can then be calculated from the frequency with which a word has o-occurred Formal semantics Formal approaches to the with the corresponding contexts in a window of semantics of natural language have long built words, with a predefined length. upon the classical idea of compositionality – Table 1 shows 5 3-dimensional vectors for the that the meaning of a sentence is a function words Mary, John, girl, boy and idea. The words of the meanings of its parts (Frege, 1892). In philosophy, book and school signify vector space compositional type-logical approaches, predicate- dimensions. As the vector for boy is closer to girl argument structures representing phrases and sen- than it is to idea in the vector space—a direct con- tences are built from their constituent parts by β- sequence of the fact that boy’s contexts are simi- reduction within the lambda calculus framework lar to girl’s and dissimilar to idea’s—we can infer (Montague, 1970): for example, given a represen- that boy is semantically more similar to girl than tation of John as john0 and sleeps as λx.sleep0(x), to idea. the meaning of the sentence “John sleeps” Many variants of this approach exist: perfor- can be constructed as λx.sleep0(x)(john0) = mance on word similarity tasks has been shown sleep0(john0). Given a suitable pairing between to be improved by replacing raw counts with words and semantic representations of them, this weighted values (e.g. mutual information)—see method can produce structured sentential repre- (Turney et al., 2010) and below for discussion, and sentations with broad coverage and good gener- (Kiela and Clark, 2014) for a detailed comparison. alisability (see e.g. (Bos, 2008)). The above logical approach is extremely powerful because it can capture complex aspects of meaning such as quan- philosophy book school tifiers and their interaction (see e.g. (Copestake et al., 2005)), and enables inference using well stud- Mary 0 10 22 ied and developed logical methods (see e.g. (Bos John 4 60 59 and Gabsdil, 2000)). girl 0 19 93 boy 0 12 104 Distributional hypothesis However, such for- idea 10 47 39 mal approaches are less able to express similarity in meaning. We would like to capture the Table 1: Word co-occurrence frequencies ex- intuition that while John and Mary are distinct, tracted from the BNC (Leech et al., 1994). Neural word embeddings Deep learning tech- extending this to the sentence level and to more niques exploit the distributional hypothesis dif- complex semantic phenomena, though, depends ferently. Instead of relying on observed co- on their applicability within compositional mod- occurrence frequencies, a neural language model els, which is the subject of the next section. is trained to maximise some objective function re- lated to e.g. the probability of observing the sur- 3 Compositional models rounding words in some context (Mikolov et al., Compositional distributional models represent 2013b): meaning of a sequence of words by a vector, ob- T tained by combining meaning vectors of the words 1 X X log p(w jw ) (1) within the sequence using some vector composi- T t+j t t=1 −c≤j≤c;j6=0 tion operation. In a general classification of these models, one can distinguish between three broad Optimizing the above function, for example, pro- cases: simplistic models which combine word duces vectors which maximise the conditional vectors irrespective of their order or relation to one probability of observing words in a context around another, models which exploit linear word order, the target word wt, where c is the size of the and models which use grammatical structure. training window, and w1w2; ··· wT a sequence of The first approach combines word vectors words forming a training instance. Therefore, the by vector addition or point-wise multiplication resulting vectors will capture the distributional in- (Mitchell and Lapata, 2008)—as this is indepen- tuition and can express degrees of lexical similar- dent of word order, it cannot capture the differ- ity. ence between the two sentences “dogs chase cats” This method has an obvious advantage com- and “cats chase dogs”.

Evaluating Neural Word Representations in Tensor-Based

Arguments For: Semantic Folding and Hierarchical Temporal Memory

Embeddings in Natural Language Processing

Sociolinguistic Properties of Word Embeddings Introduction

Semantic Structure and Interpretability of Word Embeddings

A Compositional and Interpretable Semantic Space

A Vector Semantics Approach to the Geoparsing Disambiguation Task for Texts in Spanish

The Discursive Construction of Migrant Otherness on Facebook: a Distributional Semantics Approach

Enhancing Domain Word Embedding Via Latent Semantic Imputation

Natural Language Processing (CSEP 517): Distributional Semantics

Embeddings in Natural Language Processing

User Representation Learning for Social Networks: an Empirical Study

Learning Semantic Similarity in a Continuous Space