Evaluating Neural Word Representations in Tensor-Based Compositional Settings Dmitrijs Milajevs1 Dimitri Kartsaklis2 Mehrnoosh Sadrzadeh1 Matthew Purver1 1Queen Mary University of London 2 School of Electronic Engineering University of Oxford and Computer Science Department of Computer Science Mile End Road, London, UK Parks Road, Oxford, UK [email protected] {d.milajevs,m.sadrzadeh,m.purver}@qmul.ac.uk

Abstract ral word embeddings in compositional tasks, and meaningfully compare them with the traditional We provide a comparative study be- distributional approach based on co-occurrence tween neural word representations and counts. We are especially interested in investi- traditional vector spaces based on co- gating the performance of neural word vectors in occurrence counts, in a number of com- compositional models involving general mathe- positional tasks. We use three differ- matical composition operators, rather than in the ent semantic spaces and implement seven more task- or domain-specific deep-learning com- tensor-based compositional models, which positional settings they have generally been used we then test (together with simpler ad- with so far (for example, by Socher et al. (2012), ditive and multiplicative approaches) in Kalchbrenner and Blunsom (2013) and many oth- tasks involving verb disambiguation and ers). sentence similarity. To check their scala- bility, we additionally evaluate the spaces In particular, this is the first large-scale study using simple compositional methods on to date that applies neural word representations in larger-scale tasks with less constrained tensor-based compositional distributional models language: paraphrase detection and di- of meaning similar to those formalized by Coecke alogue act tagging. In the more con- et al. (2010). We test a range of implementations strained tasks, co-occurrence vectors are based on this framework, together with additive competitive, although choice of composi- and multiplicative approaches (Mitchell and Lap- tional method is important; on the larger- ata, 2008), in a variety of different tasks. Specif- scale tasks, they are outperformed by neu- ically, we use the verb disambiguation task of ral word embeddings, which show robust, Grefenstette and Sadrzadeh (2011a) and the tran- stable performance across the tasks. sitive sentence similarity task of Kartsaklis and Sadrzadeh (2014) as small-scale focused experi- 1 Introduction ments on pre-defined sentence structures. Addi- Neural word embeddings (Bengio et al., 2006; tionally, we evaluate our vector spaces on para- Collobert and Weston, 2008; Mikolov et al., phrase detection (using the Microsoft Research 2013a) have received much attention in the dis- Paraphrase Corpus of Dolan et al. (2005)) and di- tributional semantics community, and have shown alogue act tagging using the Switchboard Corpus state-of-the-art performance in many natural lan- (see e.g. (Stolcke et al., 2000)). guage processing tasks. While they have been In all of the above tasks, we compare the neural compared with co-occurrence based models in word embeddings of Mikolov et al. (2013a) with simple similarity tasks at the word level (Levy et two vector spaces both based on co-occurrence al., 2014; Baroni et al., 2014), we are aware of counts and produced by standard distributional only one work that attempts a comparison of the techniques, as described in detail below. The gen- two approaches in compositional settings (Blacoe eral picture we get from the results is that in almost and Lapata, 2012), and this is limited to additive all cases the neural vectors are more effective than and multiplicative composition, compared against the traditional approaches. composition via a neural autoencoder. We proceed as follows: Section 2 provides a The purpose of this paper is to provide a more concise introduction to distributional word repre- complete picture regarding the potential of neu- sentations in natural language processing. Section 3 takes a closer look to the subject of composi- they are rather similar to each other (both of them tionality in vector space models of meaning and are humans) and dissimilar to words such as dog, describes the range of compositional operators ex- pavement or idea. The same applies at the phrase amined here. In Section 4 we provide details about and sentence level: “dogs chase cats” is similar in the vector spaces used in the experiments. Our ex- meaning to “hounds pursue kittens”, but less so to perimental work is described in detail in Section 5, “cats chase dogs” (despite the lexical overlap). and the results are discussed in Section 6. Finally, Distributional methods provide a way to address Section 7 provides conclusions. this problem. By representing words and phrases as vectors or tensors in a (usually highly dimen- 2 Meaning representation sional) vector space, one can express similarity in meaning via a suitable distance metric within There are several approaches to the representation that space (usually cosine distance); furthermore, of word, phrase and sentence meaning. As nat- composition can be modelled via suitable linear- ural languages are highly creative and it is very algebraic operations. rare to see the same sentence twice, any practical approach dealing with large text segments must Co-occurrence-based word representations be compositional, constructing the meaning of One way to produce such vectorial representa- phrases and sentences from their constituent parts. tions is to directly exploit Harris (1954)’s intuition The ideal method would therefore express not that semantically similar words tend to appear in only the similarity in meaning between those con- similar contexts. We can construct a vector space stituent parts, but also between the results of their in which the dimensions correspond to contexts, composition, and do this in ways which fit with usually taken to be words as well. The word linguistic structure and generalisations thereof. vector components can then be calculated from the frequency with which a word has o-occurred Formal semantics Formal approaches to the with the corresponding contexts in a window of semantics of natural language have long built words, with a predefined length. upon the classical idea of compositionality – Table 1 shows 5 3-dimensional vectors for the that the meaning of a sentence is a function words Mary, John, girl, boy and idea. The words of the meanings of its parts (Frege, 1892). In philosophy, book and school signify vector space compositional type-logical approaches, predicate- dimensions. As the vector for boy is closer to girl argument structures representing phrases and sen- than it is to idea in the vector space—a direct con- tences are built from their constituent parts by β- sequence of the fact that boy’s contexts are simi- reduction within the lambda calculus framework lar to girl’s and dissimilar to idea’s—we can infer (Montague, 1970): for example, given a represen- that boy is semantically more similar to girl than tation of John as john0 and sleeps as λx.sleep0(x), to idea. the meaning of the sentence “John sleeps” Many variants of this approach exist: perfor- can be constructed as λx.sleep0(x)(john0) = mance on word similarity tasks has been shown sleep0(john0). Given a suitable pairing between to be improved by replacing raw counts with words and semantic representations of them, this weighted values (e.g. mutual information)—see method can produce structured sentential repre- (Turney et al., 2010) and below for discussion, and sentations with broad coverage and good gener- (Kiela and Clark, 2014) for a detailed comparison. alisability (see e.g. (Bos, 2008)). The above logi- cal approach is extremely powerful because it can capture complex aspects of meaning such as quan- philosophy book school tifiers and their interaction (see e.g. (Copestake et al., 2005)), and enables inference using well stud- Mary 0 10 22 ied and developed logical methods (see e.g. (Bos John 4 60 59 and Gabsdil, 2000)). girl 0 19 93 boy 0 12 104 Distributional hypothesis However, such for- idea 10 47 39 mal approaches are less able to express similar- ity in meaning. We would like to capture the Table 1: Word co-occurrence frequencies ex- intuition that while John and Mary are distinct, tracted from the BNC (Leech et al., 1994). Neural word embeddings Deep learning tech- extending this to the sentence level and to more niques exploit the distributional hypothesis dif- complex semantic phenomena, though, depends ferently. Instead of relying on observed co- on their applicability within compositional mod- occurrence frequencies, a neural language model els, which is the subject of the next section. is trained to maximise some objective function re- lated to e.g. the probability of observing the sur- 3 Compositional models rounding words in some context (Mikolov et al., Compositional distributional models represent 2013b): meaning of a sequence of words by a vector, ob- T tained by combining meaning vectors of the words 1 X X log p(w |w ) (1) within the sequence using some vector composi- T t+j t t=1 −c≤j≤c,j6=0 tion operation. In a general classification of these models, one can distinguish between three broad Optimizing the above function, for example, pro- cases: simplistic models which combine word duces vectors which maximise the conditional vectors irrespective of their order or relation to one probability of observing words in a context around another, models which exploit linear word order, the target word wt, where c is the size of the and models which use grammatical structure. training window, and w1w2, ··· wT a sequence of The first approach combines word vectors words forming a training instance. Therefore, the by vector addition or point-wise multiplication resulting vectors will capture the distributional in- (Mitchell and Lapata, 2008)—as this is indepen- tuition and can express degrees of lexical similar- dent of word order, it cannot capture the differ- ity. ence between the two sentences “dogs chase cats” This method has an obvious advantage com- and “cats chase dogs”. The second approach has pared to co-occurrence method: since now the generally been implemented using some form of context is predicted, the model in principle can deep learning, and captures word order, but not by be much more robust in data sparsity prob- necessarily caring about the grammatical structure lems, which is always an important issue for co- of the sentence. Here, one works by recursively occurrence word spaces. Additionally, neural vec- building and combining vectors for subsequences tors have also proven successful in other tasks of words within the sentence using e.g. autoen- (Mikolov et al., 2013c), since they seem to en- coders (Socher et al., 2012) or convolutional fil- code not only attributional similarity (the degree to ters (Kalchbrenner et al., 2014). We do not con- which similar words are close to each other), but sider this approach in this paper. This is because, also relational similarity (Turney, 2006). For ex- as mentioned in the introduction, their vectors ample, it is possible to extract the singular:plural and composition operators are task-specific, be- relation (apple:apples, car:cars) using vector sub- ing trained directly to achieve specific objectives traction: in certain pre-determined tasks. Here, we are in- −−−→ −−−−→ apple − apples ≈ car−→ − cars−−→ terested in vector and composition operators that work for any compositional task, and which can Perhaps even more importantly, semantic relation- be combined with results in linguistics and formal ships are preserved in a very intuitive way: semantics to provide generalisable models that can extend to complex semantic phenomena. The third −−→ −−→ −−−→ −−−−→ king − man ≈ queen − woman (i.e. the grammatical) approach promises a way to achieve this, and has been instantiated in various allowing the formation of analogy queries similar −−→ −−→ −−−−→ −−−→ ways in the work of (Baroni and Zamparelli, 2010; to king − man + woman = ?, obtaining queen as Grefenstette and Sadrzadeh, 2011a; Kartsaklis et 1 the result. al., 2012). Both neural and co-occurrence-based ap- proaches have advantages over classical formal General framework Formally, we can spec- approaches in their ability to capture lexical se- ify the vector representation of a word sequence −→ −→ −→ −→ mantics and degrees of similarity; their success at w1w2 ··· wn as the vector s = w1 ?w2 ?···?wn, where ? is a vector operator, such as addition +, 1Levy et al. (2014) improved Mikolov et al. (2013c)’s method of retrieving relational similarities by changing the point-wise multiplication , tensor product ⊗, or underlying objective function. matrix multiplication ×. In the simplest compositional models (the first which each model was introduced. For the sim- approach described above), ? is + or , e.g. see ple compositional models the sentence is a string (Mitchell and Lapata, 2008). Grammar-based of any number of words; for the grammar-based compositional models (the third approach) are models, we consider simple transitive sentences based on a generalisation of the notion of vectors, “Sbj Verb Obj” and introduce the following abbre- known as tensors. Whereas a vector −→v is an ele- viations for the concrete method used to build a ment of an atomic vector space V , a tensor z is an tensor for the verb: element of a tensor space V ⊗ W ⊗ · · · ⊗ Z. The 1. Verb is a verb matrix computed using the for- number of tensored spaces is referred to by the or- −−→ −−→ −−→ −−→ mula P Sbj ⊗Obj , where Sbj and Obj are der of the space. Using a general duality theorem i i i i i the subjects and objects of the verb across the from multi-linear algebra (Bourbaki, 1989), it fol- corpus. These models are referred to by rela- lows that tensors are in one-one correspondence tional (Grefenstette and Sadrzadeh, 2011a); with multi-linear maps, that is we have: they are generalisations of predicate seman- ∼ z ∈ V ⊗W ⊗· · ·⊗Z = fz : V → W → · · · → Z tics of transitive verbs, from pairs of individ- uals to pairs of vectors. The models reduce In such a tensor-based formalism, meanings of the order 3 tensor of a transitive verb to an nouns are vectors and meanings of predicates such order 2 tensor (i.e. a matrix). as adjectives and verbs are tensors. Meaning of a 2. Verbg is a verb matrix computed using the for- string of words is obtained by applying the compo- −−→ −−→ −−→ sitions of multi-linear map duals of the tensors to mula Verb ⊗ Verb, where Verb is the distri- the vectors. For the sake of demonstration, take butional vector of the verb. These models are the case of an intransitive sentence “Sbj Verb”; referred to by Kronecker, which is the term −→ the meaning of the subject is a vector Sbj ∈ V sometimes used to denote the outer prod- and the meaning of the intransitive verb is a tensor uct of tensors (Grefenstette and Sadrzadeh, Verb ∈ V ⊗ W . The meaning of the sentence is 2011b). The models also reduce the order 3 −→ tensor of a transitive verb to an order 2 tensor obtained by applying fV erb to Sbj, as follows: (i.e. a matrix). −−−−−→ −→ Sbj Verb = f (Sbj) V erb 3. The models of the last five lines of the table By tensor-map duality, the above becomes use the so-called Frobenius operators from equivalent to the following, where composition categorical compositional distributional se- has become the familiar notion of matrix multi- mantics (Kartsaklis et al., 2012) to expand plication, that is ? is ×: the relational matrices of verbs from order 2 −→ to order 3. The expansion is obtained by ei- Verb × Sbj ther copying the dimension of the subject into the space provided by the third tensor, hence In general and for words with tensors of order referred to by Copy-Sbj, or copying the di- higher than two, ? becomes a generalisation of ×, mension of the object in that space, hence re- referred to by tensor contraction, see e.g. (Kart- ferred to by Copy-Obj; furthermore, we can saklis and Sadrzadeh, 2013). Since the creation take addition, multiplication, or outer product and manipulation of tensors of order higher than 2 of these, which are referred to by Frobenius- is difficult, one can work with simplified versions Add, Frobenius-Mult, and Frobenius-Outer of tensors, faithful to their underlying mathemat- (Kartsaklis and Sadrzadeh, 2014). ical basis; these have found intuitive interpreta- tions, e.g. see (Grefenstette and Sadrzadeh, 2011a; 4 Semantic word spaces Kartsaklis and Sadrzadeh, 2014). In such cases, ? Co-occurrence-based vector space instantiations becomes a combination of a range of operations have received a lot of attention from the scientific such as ×, ⊗, or +. community (refer to (Kiela and Clark, 2014; Po- Specific models In the current paper we will ex- lajnar and Clark, 2014) for recent studies). There- periment with a variety of models. In Table 2, we fore we instantiate two co-occurrence-based vec- present these models in terms of their composi- tors spaces with different underlying corpora and tion operators and a reference to the main paper in weighting schemes. Method Sentence Linear algebraic formula Reference Addition w w ··· w w−→ + w−→ + ··· + w−→ Mitchell and Lapata (2008) 1 2 n −→1 −→2 −→n Multiplication w1w2 ··· wn w1 w2 · · · wn Mitchell and Lapata (2008) −→ −→ Relational Sbj Verb Obj Verb (Sbj ⊗ Obj) Grefenstette and Sadrzadeh (2011a) −→ −→ Kronecker Sbj Verb Obj Verbg (Sbj ⊗ Obj) Grefenstette and Sadrzadeh (2011b) −→ −→ Copy object Sbj Verb Obj Sbj (Verb × Obj) Kartsaklis et al. (2012) −→ T −→ Copy subject Sbj Verb Obj Obj (Verb × Sbj) Kartsaklis et al. (2012) −→ −→ −→ T −→ Frob. add. Sbj Verb Obj (Sbj (Verb × Obj)) + (Obj (Verb × Sbj)) Kartsaklis and Sadrzadeh (2014) −→ −→ −→ T −→ Frob. mult. Sbj Verb Obj (Sbj (Verb × Obj)) (Obj (Verb × Sbj)) Kartsaklis and Sadrzadeh (2014) −→ −→ −→ T −→ Frob. outer Sbj Verb Obj (Sbj (Verb × Obj)) ⊗ (Obj (Verb × Sbj)) Kartsaklis and Sadrzadeh (2014)

Table 2: Compositional methods.

GS11 Our first word space is based on a typ- and Sadrzadeh, 2011b; Mitchell and Lapata, 2008; ical configuration that has been used in the past Kartsaklis et al., 2012). extensively for compositional distributional mod- els (see below for details), so it will serve as a KS14 In this variation, we train a vector space useful baseline for the current work. In this vec- from the ukWaC corpus3 (Ferraresi et al., 2008), tor space, the co-occurrence counts are extracted originally using as a basis the 2,000 content words from the British National Corpus (BNC) (Leech et with the highest frequency (but excluding a list of al., 1994). As basis words, we use the most fre- stop words as well as the 50 most frequent content quent nouns, verbs, adjectives and adverbs (POS words since they exhibit low information content). tags SUBST, VERB, ADJ and ADV in the BNC The vector space is again lemmatized. As context XML distribution2). The vector space is lemma- we consider a 5-word window from either side of tized, that is, it contains only “canonical” forms of the target word, while as our weighting scheme we words. use local mutual information (i.e. point-wise mu- In order to weight the raw co-occurrence counts, tual information multiplied by raw counts). In a we use positive point-wise mutual information further step, the vector space was normalized and (PPMI). The component value for a target word projected onto a 300-dimensional space using sin- t and a context word c is given by: gular value decomposition (SVD). In general, dimensionality reduction produces  p(c|t) PPMI(t, c) = max 0, log more compact word representations that are robust p(c) against potential noise in the corpus (Landauer and where p(c|t) is the probability of word c given t Dumais, 1997; Schutze,¨ 1997). SVD has been in a symmetric window of length 5 and p(c) is the shown to perform well on a variety of tasks similar probability of c overall. to ours (Baroni and Zamparelli, 2010; Kartsaklis and Sadrzadeh, 2014). Vector spaces based on point-wise mutual in- formation (or variants thereof) have been suc- cessfully applied in various distributional and Neural word embeddings (NWE) For our neu- compositional tasks; see e.g. (Grefenstette and ral setting, we used the skip-gram model of Sadrzadeh, 2011a; Mitchell and Lapata, 2008; (Mikolov et al., 2013b) trained with negative sam- Levy et al., 2014) for details. PPMI has been pling. The specific implementation that was tested shown to achieve state-of-the-art results (Levy et in our experiments was a 300-dimensional vec- al., 2014) and is suggested by the review of Kiela tor space learned from the Google News corpus word2vec4 and Clark (2014). Our use here of the BNC as a and provided by the toolkit. Fur- gensim ˇ corpus and the window length of 5 is based on pre- thermore, the library (Rehu˚rekˇ and So- vious use and better performance of these param- jka, 2010) was used for accessing the vectors. eters in a number of compositional experiments On the contrary with the previously described co- (Grefenstette and Sadrzadeh, 2011a; Grefenstette 3http://wacky.sslmit.unibo.it/ 2http://www.natcorp.ox.ac.uk/ 4https://code.google.com/p/word2vec/ occurrence vector spaces, this version is not lem- one of the landmarks. This is similar to the in- matized. transitive dataset described in (Mitchell and Lap- The negative sampling method improves the ob- ata, 2008). Consider the sentence “system meets jective function of Equation 1 by introducing neg- specification”; here, meets is the ambiguous tran- ative examples to the training algorithm. Assume sitive verb, and system and specification are its ar- that the probability of a specific (c, t) pair of words guments in this context. Possible landmarks for (where t is a target word and c another word in meet are satisfy and visit; for this sentence, the the same context with t), coming from the training human judgements show that the disambiguated data, is denoted as p(D = 1|c, t). The objective meaning of the verb is more similar to the land- function is then expressed as follows: mark satisfy and less similar to visit. Y p(D = 1|c, t) (2) The task is to estimate the similarity of the sense (c,t)∈D of a verb in a context with a given landmark. To get our similarity measures, we compose the verb That is, the goal is to set the model parameters in with its arguments using one of our compositional a way that maximizes the probability of all obser- models; we do the same for the landmark and then vations coming from the training data. Assume compute the cosine similarity of the two vectors. 0 now that D is a set of randomly selected incorrect We evaluate the performance by averaging the hu- 0 0 (c , t ) pairs that do not occur in D, then Equation man judgements for the same verb, argument and 2 above can be recasted in the following way: landmark entries, and calculating the Spearman’s Y Y correlation between the average values and the co- p(D = 1|c, t) p(D = 0|c0, t0) sine scores. As a baseline, we compare this with (c,t)∈D (c0,t0)∈D0 (3) the correlation produced by using only the verb In other words, the model tries to distinguish a tar- vector, without composing it with its arguments. get word t from random draws that come from a Table 3 shows the results of the experiment. noise distribution. In the implementation we used NWE copy-object composition yields the best cor- for our experiments, c is always selected from relation with the human judgements, and top per- a 5-word window around t. More details about formance across all vector spaces and models with the negative sampling approach can be found in a Spearman ρ of 0.456. For the KS14 space, the (Mikolov et al., 2013b); the note of Goldberg and best result comes from Frobenius outer (0.350), Levy (2014) also provides an intuitive explanation of the underlying setting.

5 Experiments Method GS11 KS14 NWE Our experiments explore the use of the vector Verb only 0.212 0.325 0.107 spaces above, together with the compositional op- Addition 0.103 0.275 0.149 erators described in Section 3, to a range of tasks Multiplication 0.348 0.041 0.095 all of which require semantic composition: verb sense disambiguation; sentence similarity; para- Kronecker 0.304 0.176 0.117 phrasing; and dialogue act tagging. Relational 0.285 0.341 0.362 Copy subject 0.089 0.317 0.131 5.1 Disambiguation Copy object 0.334 0.331 0.456 We use the transitive verb disambiguation dataset5 Frobenius add. 0.261 0.344 0.359 described in (Grefenstette and Sadrzadeh, 2011a). Frobenius mult. 0.233 0.341 0.239 The dataset consists of ambiguous transitive verbs Frobenius outer 0.284 0.350 0.375 together with their arguments, landmark verbs that identify one of the verb senses, and human Table 3: Spearman ρ correlations of models with judgements that specify how similar is the disam- human judgements for the word sense disam- biguated sense of the verb in the given context to biguation task. The best result (NWE Copy ob- ject) outperforms the nearest co-occurrence-based 5 This and the sentence similarity dataset are avail- competitor (KS14 Frobenius outer) with a statisti- able at http://www.cs.ox.ac.uk/activities/ compdistmeaning/ cally significant difference (p < 0.05, t-test). while the best operator for the GS11 space is Method GS11 KS14 NWE point-wise multiplication (0.348). Verb only 0.491 0.602 0.561 For simple point-wise composition, only mul- tiplicative GS11 and additive NWE improve over Addition 0.682 0.732 0.689 their corresponding verb-only baselines (but both Multiplication 0.597 0.321 0.341 perform worse than the KS14 baseline). With Kronecker 0.581 0.408 0.561 tensor-based composition in co-occurrence based Relational 0.558 0.437 0.618 copy subject spaces, yields lower results than Copy subject 0.370 0.448 0.405 the corresponding baselines. Other composition Copy object 0.571 0.306 0.655 Kronecker methods, except for KS14, improve Frobenius add. 0.566 0.460 0.585 over the verb-only baselines. Finally we should Frobenius mult. 0.525 0.226 0.387 note that, despite the small training corpus, the Frobenius outer 0.560 0.439 0.622 GS11 vector space performs comparatively well; for example, Kronecker model improves the pre- Table 4: Results for sentence similarity. There viously reported score of 0.28 (Grefenstette and is no statistically significant difference between Sadrzadeh, 2011b). KS14 addition and NWE addition (the second best 5.2 Sentence similarity result). In this experiment we use the transitive sentence similarity dataset described in (Kartsaklis and Specifically, we get classification results on the Sadrzadeh, 2014). The dataset consists of transi- Microsoft Research Paraphrase Corpus paraphrase tive sentence pairs and a human similarity judge- corpus (Dolan et al., 2005) working in the fol- 6 ment . The task is to estimate similarity between lowing way: we construct vectors for the sen- two sentences. As in the disambiguation task, tences of each pair; if the cosine similarity be- we first compose word vectors to obtain sentence tween the two sentence vectors exceeds a certain vectors, then compute cosine similarity of them. threshold, the pair is classified as a paraphrase, We average the human judgements for identical otherwise as not a paraphrase. For this experi- sentence pairs to compute correlation with cosine ment and that of Section 5.4 below, we investigate scores. only the element-wise addition and multiplication Table 4 shows the results. Again, the best compositional models, since at their current stage performing vector space is KS14, but this time of development tensor-based models can only effi- with addition: the Spearman ρ correlation score ciently handle sentences of fixed structure. Never- with averaged human judgements is 0.732. Addi- theless, the simple point-wise compositional mod- tion was the means for the other vector spaces to els still allow for a direct comparison of the vector achieve top performance as well: GS11 and NWE spaces, which is the main goal of this paper. got 0.682 and 0.689 respectively. None of the models in tensor-based composi- For each vector space and model, a number of tion outperformed addition. KS14 performs worse different thresholds were tested on the first 2000 with tensor-based methods here than in the other pairs of the training set, which we used as a de- vector spaces. However, GS11 and NWE, except velopment set; in each case, the best-performed copy subject for both of them and Frobenius multi- threshold was selected for a single run of our plication for NWE, improved over their verb-only “classifier” on the test set (1726 pairs). Addition- baselines. ally, we evaluate the NWE model with a lemma- tized version of the corpus, so that the experimen- 5.3 Paraphrasing tal setup is maximally similar for all vector spaces. In this experiment we evaluate our vector spaces The results are shown in the first part of Table 5. on a mainstream paraphrase detection task. Additive NWE gives the highest performance, 6The textual content of this dataset is the same as that of with both lemmatized and un-lemmatized versions (Kartsaklis and Sadrzadeh, 2013), the difference is that the outperforming the GS11 and KS14 spaces. In dataset of (Kartsaklis and Sadrzadeh, 2014) has updated hu- the un-lemmatized case, the accuracy of our sim- man judgements whereas the previous dataset used the orig- inal annotations of the intransitive dataset of (Mitchell and ple “classifier” (0.73) is close to state-of-the-art Lapata, 2008). range. The state-of-the art result (0.77 accuracy Co-occurrence Neural word embeddings Baseline GS11 KS14 Unlemmatized Lemmatized Model Accuracy F-Score Accuracy F-Score Accuracy F-Score Accuracy F-Score Accuracy F-Score MSR addition 0.62 0.79 0.70 0.80 0.73 0.82 0.72 0.81 0.65 0.75 MSR multiplication 0.52 0.58 0.66 0.80 0.42 0.34 0.41 0.36 SWDA addition 0.35 0.35 0.40 0.35 0.63 0.60 0.44 0.40 0.60 0.58 SWDA multiplication 0.32 0.16 0.39 0.33 0.58 0.53 0.43 0.38

Table 5: Results for paraphrase detection (MSR) and dialog act tagging (SWDA) tasks. All top results significantly outperform corresponding nearest competitors (for accuracy): p < 0.05, χ2 test. and 0.84 F-score7) by the time of this writing has sification decision). The results are shown in the been obtained using 8 machine translation metrics second part of Table 5. and three constituent classifiers (Madnani et al., Un-lemmatized NWE addition gave the best ac- 2012). curacy (0.63) and F-score (0.60) (averaged over The multiplicative model gives lower results tag classes), i.e. similar results to (Milajevs and than the additive model across all vector spaces. Purver, 2014)—although note that the dimension- The KS14 vector space shows the steadiest per- ality of our NWE vectors is 10 times lower than formance, with a drop in accuracy of only 0.04 theirs. Multiplicative NWE outperformed the cor- and no drop in F-score, while for the GS11 and responding model in (Milajevs and Purver, 2014). NWE spaces both accuracy and F-score experi- In general, addition consistently outperforms mul- enced drops by more than 0.20. tiplication for all the models. Lemmatization dramatically lowers tagging accuracy: the lem- 5.4 Dialogue act tagging matized GS11, KS14 and NWE models perform As our last experiment, we evaluate the word much worse than un-lemmatized NWE, suggest- spaces on a dialogue act tagging task (Stolcke et ing that morphological features are important for al., 2000) over the Switchboard corpus (Godfrey this task. et al., 1992). Switchboard is a collection of ap- proximately 2500 dialogs over a telephone line by 6 Discussion 500 speakers from the U.S. on predefined topics.8 Previous comparisons of co-occurrence-based and The experiment pipeline follows (Milajevs and neural word vector representations vary widely Purver, 2014). The input utterances are prepro- in their conclusions. While Baroni et al. (2014) cessed so that the parts of interrupted utterances conclude that “context-predicting models obtain are concatenated (Webb et al., 2005). Disfluency a thorough and resounding victory against their markers and commas are removed from the utter- count-based counterparts”, this seems to contra- ance raw texts. For GS11 and KS14 the utterance dict, at least at the first consideration, the more tokens are POS-tagged and lemmatized; for NWE, conservative conclusion of Levy et al. (2014) that we test the vectors in both a lemmatized and an “analogy recovery is not restricted to neural word un-lemmatized version of the corpus.9 We split embeddings [. . . ] a similar amount of relational the training and testing utterances as suggested by similarities can be recovered from traditional dis- Stolcke et al. (2000). Utterance vectors are then tributional word representations” and the findings obtained as in the previous experiments; they are of Blacoe and Lapata (2012) that “shallow ap- reduced to 50 dimensions using SVD and a k- proaches are as good as more computationally in- nearest-neighbour classifier is trained on these re- tensive alternatives” on phrase similarity and para- duced utterance vectors (the 5 closest neighbours phrase detection tasks. by Euclidean distance are retrieved to make a clas- It seems clear that neural word embeddings have an advantage when used in tasks for which 7F-scores use the standard definition F = 2(precision ∗ recall)/(precision + recall). they have been trained; our main questions here 8The dataset and a Python interface to it are available are whether they outperform co-occurrence based at http://compprag.christopherpotts.net/ alternatives across the board; and which ap- swda.html 9We use WordNetLemmatizer of the NLTK library proach lends itself better to composition using (Bird, 2006). general mathematical operators. To partially an- swer this question, we can compare model be- Task GS11 KS14 NWE haviour against the baselines in isolation. Disambiguation + + + For the disambiguation and sentence similarity Sentence similarity + – + tasks the baseline is the similarity between verbs only, ignoring the context—see above. For the Paraphrase − + + paraphrase task, we take the global vector-based Dialog act tagging − − + similarity reported in (Mihalcea et al., 2006): 0.65 accuracy and 0.75 F-score. For the dialogue act Table 6: Summary of vector space performance tagging task the baseline is the accuracy of the against baselines. General improvement (cases bag-of-unigrams model in (Milajevs and Purver, where more than a half of the models perform bet- 2014): 0.60. ter) and decrease with regard to a corresponding baseline is respectively marked by + and −.A Sections 5.1 and 5.2 show that although the best bold value means that the model gave the best re- choice of vector representation might vary, for sult in the task. small-scale tasks all methods give fairly compet- itive results. The choice of compositional oper- ator seems to be more important and more task- On small-scale tasks, where the sentence struc- specific: while a tensor-based operation (Frobe- tures are predefined and relatively constrained, nius copy-object) performs best for verb disam- NWE gives better or similar results to count-based biguation, the best result for sentence similarity vectors. Tensor-based composition does not al- is achieved by a simple additive model, with all ways outperform simple compositional operators, other compositional methods behaving worse than but for most of the cases gives results within the the verb-only baseline in the KS14 case. GS11 and same range. NWE, on the other hand, outperform their base- On large-scale tasks, neural vectors are more lines with a number of compositional methods, al- successful than the co-occurrence based alterna- though both of them achieve lower performance tives. However, this study does not reveal whether than KS14 overall. this is because of their neural nature, or just be- Based on only small-scale experiment results, cause they are trained on a larger amount of data. one could conclude that there is little significant The question of whether neural vectors outper- difference between the two ways of obtaining vec- form co-occurrence vectors therefore requires fur- tors. GS11 and NWE show similar behaviour in ther detailed comparison to be entirely resolved; comparison to their baselines, while it is possible our experiments suggest that this is indeed the case to tune a co-occurrence based vector space (KS14) in large-scale tasks, but the difference in size and and obtain the best result. Large scale tasks reveal nature of the original corpora may be a confound- another pattern: the GS11 vector space, which be- ing factor. In any case, it is clear that the neural haves stably on the small scale, drags behind the vectors of package perform steadily KS14 and NWE spaces in the paraphrase detec- off-the-shelf across a large variety of tasks. The tion task. In addition, NWE consistently yields size of the vector space (3 million words) and the best results. Finally, only the NWE space was able available code-base that simplifies the access to to provide adequate results on the dialogue act tag- the vectors, makes this set a good and safe choice ging task. Table 6 summarizes model performance for experiments in the future. Of course, even bet- with regard to baselines. ter performances can be achieved by training neu- ral language models specifically for a given task 7 Conclusion (see e.g. Kalchbrenner et al. (2014)). The choice of compositional operator (tensor- In this work we compared the performance of two based or a simple point-wise operation) depends co-occurrence-based semantic spaces with vectors strongly on the task and dataset: tensor-based learned by a neural network in compositional set- composition performed best with the verb dis- tings. We carried out two small-scale tasks (word ambiguation task, where the verb senses depend sense disambiguation and sentence similarity) and strongly on the arguments of the verb. However, it two large-scale tasks (paraphrase detection and di- seems to depend less on the nature of the vectors alogue act tagging). itself: in the disambiguation task, tensor-based composition proved best for both co-occurrence- Johan Bos. 2008. Wide-coverage semantic analy- based and neural vectors; in the sentence similar- sis with boxer. In Johan Bos and Rodolfo Del- ity task, where point-wise operators proved best, monte, editors, Semantics in Text Processing. STEP 2008 Conference Proceedings, Research in Compu- this was again true across vector spaces. tational Semantics, pages 277–286. College Publi- cations. Acknowledgements N. Bourbaki. 1989. Commutative Algebra: Chapters We would like to thank the three anonymous 1-7. Srpinger Verlag, Berlin/New York. reviewers for their fruitful comments. Sup- Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen port by EPSRC grant EP/F042728/1 is grate- Clark. 2010. Mathematical foundations for a com- fully acknowledged by Milajevs, Kartsaklis and positional distributional model of meaning. CoRR, abs/1003.4394. Sadrzadeh. Purver is partly supported by Con- CreTe: the project ConCreTe acknowledges the fi- Ronan Collobert and Jason Weston. 2008. A unified nancial support of the Future and Emerging Tech- architecture for natural language processing: Deep nologies (FET) programme within the Seventh neural networks with multitask learning. In Pro- ceedings of the 25th international conference on Framework Programme for Research of the Eu- , pages 160–167. ACM. ropean Commission, under FET grant number 611733. Ann Copestake, Dan Flickinger, Carl Pollard, and Ivan A Sag. 2005. Minimal recursion semantics: An introduction. Research on Language and Com- putation, 3(2-3):281–332. References Bill Dolan, Chris Brockett, and Chris Quirk. 2005. Mi- Marco Baroni and Roberto Zamparelli. 2010. Nouns crosoft research paraphrase corpus. Retrieved May, are vectors, adjectives are matrices: Representing 29:2013. adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Methods in Natural Language Processing, pages Silvia Bernardini. 2008. Introducing and evaluating 1183–1193. Association for Computational Linguis- ukWaC, a very large web-derived corpus of English. tics. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google, pages 47–54. Marco Baroni, Georgiana Dinu, and German´ Gottlob Frege. 1892. On sense and reference. Ludlow Kruszewski. 2014. Don’t count, predict! a (1997), pages 563–584. systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings John J Godfrey, Edward C Holliman, and Jane Mc- of the 52nd Annual Meeting of the Association for Daniel. 1992. Switchboard: Telephone speech cor- Computational Linguistics, volume 1. pus for research and development. In Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., Yoshua Bengio, Holger Schwenk, Jean-Sebastien´ 1992 IEEE International Conference on, volume 1, Senecal,´ Frederic´ Morin, and Jean-Luc Gauvain. pages 517–520. IEEE. 2006. Neural probabilistic language models. In Innovations in Machine Learning, pages 137–186. Yoav Goldberg and Omer Levy. 2014. word2vec Springer. Explained: deriving Mikolov et al.’s negative- sampling word-embedding method. arXiv preprint Steven Bird. 2006. NLTK: the natural language arXiv:1402.3722. toolkit. In Proceedings of the COLING/ACL on In- teractive presentation sessions, pages 69–72. Asso- Edward Grefenstette and Mehrnoosh Sadrzadeh. ciation for Computational Linguistics. 2011a. Experimental support for a categorical com- positional distributional model of meaning. In Pro- ceedings of the Conference on Empirical Methods William Blacoe and Mirella Lapata. 2012. A com- in Natural Language Processing, pages 1394–1404. parison of vector-based representations for semantic Association for Computational Linguistics. composition. In Proceedings of the 2012 Joint Con- ference on Empirical Methods in Natural Language Edward Grefenstette and Mehrnoosh Sadrzadeh. Processing and Computational Natural Language 2011b. Experimenting with transitive verbs in a Dis- Learning, pages 546–556. Association for Compu- CoCat. In Proceedings of the GEMS 2011 Work- tational Linguistics. shop on GEometrical Models of Natural Language Semantics, pages 62–66, Edinburgh, UK, July. As- Johan Bos and Malte Gabsdil. 2000. First-order infer- sociation for Computational Linguistics. ence and the interpretation of questions and answers. Proceedings of Gotelog, pages 43–50. Z.S. Harris. 1954. Distributional structure. Word. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent the Association for Computational Linguistics: Hu- convolutional neural networks for discourse compo- man Language Technologies, pages 182–190. Asso- sitionality. In Proceedings of the Workshop on Con- ciation for Computational Linguistics. tinuous Vector Space Models and their Composition- ality, pages 119–126, Sofia, Bulgaria, August. Asso- Rada Mihalcea, Courtney Corley, and Carlo Strappa- ciation for Computational Linguistics. rava. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI, vol- Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- ume 6, pages 775–780. som. 2014. A convolutional neural network for Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- modelling sentences. Proceedings of the 52nd An- frey Dean. 2013a. Efficient estimation of word nual Meeting of the Association for Computational representations in vector space. arXiv preprint Linguistics, June. arXiv:1301.3781. Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2013. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Prior disambiguation of word tensors for construct- rado, and Jeff Dean. 2013b. Distributed representa- ing sentence vectors. In Proceedings of the 2013 tions of words and phrases and their compositional- Conference on Empirical Methods in Natural Lan- ity. In Advances in Neural Information Processing guage Processing (EMNL), pages 1590–1601, Seat- Systems, pages 3111–3119. tle, USA, October. Association for Computational Linguistics. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2014. A word representations. In Proceedings of NAACL- study of entanglement in a categorical framework of HLT, pages 746–751. natural language. In Proceedings of the 11th Work- shop on Quantum Physics and Logic (QPL), Kyoto, Dmitrijs Milajevs and Matthew Purver. 2014. Inves- Japan, June. tigating the contribution of distributional semantic information for dialogue act classification. In Pro- Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, and Stephen ceedings of the 2nd Workshop on Continuous Vector Pulman. 2012. A unified sentence space for Space Models and their Compositionality (CVSC), categorical distributional-compositional semantics: pages 40–47, Gothenburg, Sweden, April. Associa- Theory and experiments. In Proceedings of COL- tion for Computational Linguistics. ING 2012: Posters, pages 549–558, Mumbai, India, Jeff Mitchell and Mirella Lapata. 2008. Vector-based December. The COLING 2012 Organizing Commit- models of semantic composition. In Proceedings tee. of ACL-08: HLT, pages 236–244. Association for Computational Linguistics. Douwe Kiela and Stephen Clark. 2014. A systematic study of semantic vector space model parameters. Richard Montague. 1970. Universal grammar. Theo- In Proceedings of the 2nd Workshop on Continu- ria, 36(3):373–398. ous Vector Space Models and their Compositionality (CVSC), pages 21–30, Gothenburg, Sweden, April. Tamara Polajnar and Stephen Clark. 2014. Improv- Association for Computational Linguistics. ing distributional semantic vectors through context selection and normalisation. In Proceedings of the T. Landauer and S. Dumais. 1997. A Solution 14th Conference of the European Chapter of the As- to Plato’s Problem: The sociation for Computational Linguistics, pages 230– Theory of Acquision, Induction, and Representation 238, Gothenburg, Sweden, April. Association for of Knowledge. Psychological Review. Computational Linguistics. ˇ Geoffrey Leech, Roger Garside, and Michael Bryant. Radim Rehu˚rekˇ and Petr Sojka. 2010. Software 1994. Claws4: the tagging of the british national Framework for Topic Modelling with Large Cor- Proceedings of the LREC 2010 Workshop corpus. In Proceedings of the 15th conference pora. In on New Challenges for NLP Frameworks on Computational linguistics-Volume 1, pages 622– , pages 45– http://is. 628. Association for Computational Linguistics. 50, Valletta, Malta, May. ELRA. muni.cz/publication/884893/en. Omer Levy, Yoav Goldberg, and Israel Ramat-Gan. Hinrich Schutze.¨ 1997. Ambiguity resolution in natu- 2014. Linguistic regularities in sparse and explicit ral language learning. csli. Stanford, CA, 4:12–36. word representations. In Proceedings of the Eigh- teenth Conference on Computational Natural Lan- Richard Socher, Brody Huval, Christopher D Manning, guage Learning, Baltimore, Maryland, USA, June. and Andrew Y Ng. 2012. Semantic compositional- Association for Computational Linguistics. ity through recursive matrix-vector spaces. In Pro- ceedings of the 2012 Joint Conference on Empiri- Nitin Madnani, Joel Tetreault, and Martin Chodorow. cal Methods in Natural Language Processing and 2012. Re-examining machine translation metrics Computational Natural Language Learning, pages for paraphrase identification. In Proceedings of the 1201–1211. Association for Computational Linguis- 2012 Conference of the North American Chapter of tics. Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza- beth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Carol Van Ess-Dykema, Rachel Martin, and Marie Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3):339–373. Peter D Turney, Patrick Pantel, et al. 2010. From frequency to meaning: Vector space models of se- mantics. Journal of artificial intelligence research, 37(1):141–188. Peter D Turney. 2006. Similarity of semantic relations. Computational Linguistics, 32(3):379–416. Nick Webb, Mark Hepple, and Yorick Wilks. 2005. Dialogue act classification based on intra-utterance features. In Proceedings of the AAAI Workshop on Spoken Language Understanding. Citeseer.