<<

Embedding Words and Senses Together via Joint Knowledge-Enhanced Training

Massimiliano Mancini*, Jose Camacho-Collados*, Ignacio Iacobacci and Roberto Navigli Department of Computer Science Sapienza University of Rome [email protected] collados,iacobacci,navigli @di.uniroma1.it { }

Abstract Previous works have addressed this limita- tion by automatically inducing word senses from Word embeddings are widely used in Nat- monolingual corpora (Schutze¨ , 1998; Reisinger ural Language Processing, mainly due to and Mooney, 2010; Huang et al., 2012; Di Marco their success in capturing semantic infor- and Navigli, 2013; Neelakantan et al., 2014; Tian mation from massive corpora. However, et al., 2014; Li and Jurafsky, 2015; Vu and their creation process does not allow the Parker, 2016; Qiu et al., 2016), or bilingual par- different meanings of a word to be auto- allel data (Guo et al., 2014; Ettinger et al., 2016; matically separated, as it conflates them Susterˇ et al., 2016). However, these approaches into a single vector. We address this issue learn solely on the basis of statistics extracted by proposing a new model which learns from text corpora and do not exploit knowl- word and sense embeddings jointly. Our edge from semantic networks. Additionally, their model exploits large corpora and knowl- induced senses are neither readily interpretable edge from semantic networks in order to (Panchenko et al., 2017) nor easily mappable to produce a unified vector space of word lexical resources, which limits their application. and sense embeddings. We evaluate the Recent approaches have utilized semantic net- main features of our approach both qual- works to inject knowledge into existing word rep- itatively and quantitatively in a variety of resentations (Yu and Dredze, 2014; Faruqui et al., tasks, highlighting the advantages of the 2015; Goikoetxea et al., 2015; Speer and Lowry- proposed method in comparison to state- Duda, 2017; Mrksic et al., 2017), but without solv- of-the-art word- and sense-based models. ing the meaning conflation issue. In order to ob- tain a representation for each sense of a word, a number of approaches have leveraged lexical 1 Introduction resources to learn sense embeddings as a result of post-processing conventional word embeddings Recently, approaches based on neural networks (Chen et al., 2014; Johansson and Pina, 2015; which embed words into low-dimensional vector Jauhar et al., 2015; Rothe and Schutze¨ , 2015; Pile- spaces from text corpora (i.e. word embeddings) hvar and Collier, 2016; Camacho-Collados et al., have become increasingly popular (Mikolov et al., 2016). 2013; Pennington et al., 2014). Word embeddings Instead, we propose SW2V (Senses and Words have proved to be beneficial in many Natural Lan- to Vectors), a neural model that exploits knowl- guage Processing tasks, such as Machine Transla- edge from both text corpora and semantic net- tion (Zou et al., 2013), syntactic parsing (Weiss works in order to simultaneously learn embed- et al., 2015), and Question Answering (Bordes dings for both words and senses. Moreover, our et al., 2014), to name a few. Despite their suc- model provides three additional key features: (1) cess in capturing semantic properties of words, both word and sense embeddings are represented these representations are generally hampered by in the same vector space, (2) it is flexible, as it can an important limitation: the inability to discrimi- be applied to different predictive models, and (3) nate among different meanings of the same word. it is scalable for very large semantic networks and Authors marked with an asterisk (*) contributed equally. text corpora.

100 Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 100–111, Vancouver, Canada, August 3 - August 4, 2017. c 2017 Association for Computational Linguistics

2 Related work space of words and senses as an emerging feature.

Embedding words from large corpora into a low- 3 Connecting words and senses in dimensional vector space has been a popular task context since the appearance of the probabilistic feed- forward neural network language model (Ben- In order to jointly produce embeddings for words gio et al., 2003) and later developments such as and senses, SW2V needs as input a corpus where 1 word2vec (Mikolov et al., 2013) and GloVe (Pen- words are connected to senses in each given con- nington et al., 2014). However, little research has text. One option for obtaining such connections focused on exploiting lexical resources to over- could be to take a sense-annotated corpus as input. come the inherent ambiguity of word embeddings. However, manually annotating large amounts of Iacobacci et al.(2015) overcame this limitation data is extremely expensive and therefore imprac- by applying an off-the-shelf disambiguation sys- tical in normal settings. Obtaining sense-annotated tem (i.e. Babelfy (Moro et al., 2014)) to a cor- data from current off-the-shelf disambiguation and pus and then using word2vec to learn sense em- entity linking systems is possible, but generally beddings over the pre-disambiguated text. How- suffers from two major problems. First, supervised ever, in their approach words are replaced by their systems are hampered by the very same prob- intended senses, consequently producing as out- lem of needing large amounts of sense-annotated put sense representations only. The representation data. Second, the relatively slow speed of current of words and senses in the same vector space disambiguation systems, such as graph-based ap- proves essential for applying these knowledge- proaches (Hoffart et al., 2012; Agirre et al., 2014; based sense embeddings in downstream applica- Moro et al., 2014), or word-expert supervised sys- tions, particularly for their integration into neural tems (Zhong and Ng, 2010; Iacobacci et al., 2016; architectures (Pilehvar et al., 2017). In the litera- Melamud et al., 2016), could become an obstacle ture, various different methods have attempted to when applied to large corpora. overcome this limitation. Chen et al.(2014) pro- This is the reason why we propose a simple yet posed a model for obtaining both word and sense effective unsupervised shallow word-sense con- representations based on a first training step of nectivity algorithm, which can be applied to vir- conventional word embeddings, a second disam- tually any given semantic network and is linear on biguation step based on sense definitions, and a fi- the corpus size. The main idea of the algorithm is nal training phase which uses the disambiguated to exploit the connections of a semantic network text as input. Likewise, Rothe and Schutze¨ (2015) by associating words with the senses that are most aimed at building a shared space of word and connected within the sentence, according to the sense embeddings based on two steps: a first train- underlying network. ing step of only word embeddings and a second Shallow word-sense connectivity algorithm. training step to produce sense and synset em- Formally, a corpus and a semantic network are beddings. These two approaches require multiple taken as input and a set of connected words and steps of training and make use of a relatively small senses is produced as output. We define a seman- resource like WordNet, which limits their cov- tic network as a graph (S, E) where the set S con- erage and applicability. Camacho-Collados et al. tains synsets (nodes) and E represents a set of (2016) increased the coverage of these WordNet- semantically connected synset pairs (edges). Al- based approaches by exploiting the complemen- gorithm1 describes how to connect words and tary knowledge of WordNet and Wikipedia along senses in a given text (sentence or paragraph) T . with pre-trained word embeddings. Finally, Wang First, we gather in a set ST all candidate synsets et al.(2014) and Fang et al.(2016) proposed a of the words (including multiwords up to trigrams) model to align vector spaces of words and en- in T (lines1 to3). Second, for each candidate tities from knowledge bases. However, these ap- synset s we calculate the number of synsets which proaches are restricted to nominal instances only are connected with s in the semantic network (i.e. Wikipedia pages or entities). and are included in ST , excluding connections of In contrast, we propose a model which learns synsets which only appear as candidates of the both words and sense embeddings from a single 1In this paper we focus on senses but other items con- joint training phase, producing a common vector nected to words may be used (e.g. supersenses or images).

101 Algorithm 1 Shallow word-sense connectivity only, irrespective of the corpus size. This enables Input: Semantic network (S, E) and text T represented as a a fast training on large amounts of text corpora, bag of words in contrast to current unsupervised disambiguation Output: Set of connected words and senses T ∗ T S ⊂ × algorithms. Additionally, as we will show in Sec- 1: Set of synsets ST 2: for each word w ←T ∅ tion 5.2, this algorithm does not only speed up sig- ∈ 3: ST ST Sw (Sw: set of candidate synsets of w) ← ∪ S + T nificantly the training phase, but also leads to more 4: Minimum connections threshold θ | T | | | ← 2 δ accurate results. 5: Output set of connections T ∗ 6: for each w T ← ∅ Note that with our algorithm a word is allowed ∈ 7: Relative maximum connections max = 0 to have more than one sense associated. In fact, 8: Set of senses associated with w, Cw ← ∅ 9: for each candidate synset s Sw current lexical resources like WordNet (Miller, ∈ 10: Number of edges n = s0 ST :(s, s0) E & 1995) or BabelNet (Navigli and Ponzetto, 2012) | ∈ ∈ w0 T : w0 = w & s0 S w0 are hampered by the high granularity of their sense 11: ∃if n ∈ max &6 n θ then∈ | 12: if≥n > max then≥ inventories (Hovy et al., 2013). In Section 6.2 we 13: Cw (w, s) ← { } show how our sense embeddings are particularly 14: max n suited to deal with this issue. 15: else ← 16: Cw Cw (w, s) ← ∪ { } 17: T ∗ T ∗ Cw ← ∪ 4 Joint training of words and senses 18: return Output set of connected words and senses T ∗ The goal of our approach is to obtain a shared vector space of words and senses. To this end, same word (lines5 to 10). Finally, each word is our model extends conventional word embedding associated with its top candidate synset(s) accord- models by integrating explicit knowledge into its ing to its/their number of connections in context, architecture. While we will focus on the Con- provided that its/their number of connections ex- tinuous Bag Of Words (CBOW) architecture of ST + T 2 ceeds a threshold θ = | 2| δ | | (lines 11 to 17). word2vec (Mikolov et al., 2013), our extension This parameter aims to retain relevant connectivity can easily be applied similarly to Skip-Gram, or to across senses, as only senses above the threshold other predictive approaches based on neural net- will be connected to words in the output corpus. θ works. The CBOW architecture is based on the 3 is proportional to the reciprocal of a parameter δ, feedforward neural network language model (Ben- and directly proportional to the average text length gio et al., 2003) and aims at predicting the current and number of candidate synsets within the text. word using its surrounding context. The architec- The complexity of the proposed algorithm is ture consists of input, hidden and output layers. N + (N α), where N is the number of words The input layer has the size of the word vocabulary × of the training corpus and α is the average poly- and encodes the context as a combination of one- semy degree of a word in the corpus according to hot vector representations of surrounding words of the input semantic network. Considering that non- a given target word. The output layer has the same content words are not taken into account (i.e. pol- size as the input layer and contains a one-hot vec- ysemy degree 0) and that the average polysemy tor of the target word during the training phase. degree of words in current lexical resources (e.g. Our model extends the input and output layers WordNet or BabelNet) does not exceed a small of the neural network with word senses4 by ex- constant (3) in any language, we can safely assume ploiting the intrinsic relationship between words that the algorithm is linear in the size of the train- and senses. The leading principle is that, since a ing corpus. Hence, the training time is not signif- word is the form of an underlying sense, icantly increased in comparison to training words updating the embedding of the word should pro- duce a consequent update to the embedding rep- 2As mentioned above, all unigrams, bigrams and trigrams present in the semantic network are considered. In the case resenting that particular sense, and vice-versa. As of overlapping instances, the selection of the final instance is a consequence of the algorithm described in the performed in this order: mention whose synset is more con- nected (i.e. n is higher), longer mention and from left to right. previous section, each word in the corpus may be 3Higher values of δ lead to higher recall, while lower val- connected with zero, one or more senses. We re- ues of δ increase precision but lower the recall. We set the value of δ to 100, as it was shown to produce a fine bal- 4Our model can also produce a space of words and synset ance between precision and recall. This parameter may also embeddings as output: the only difference is that all synonym be tuned on downstream tasks. senses would be considered to be the same item, i.e. a synset.

102 Figure 1: The SW2V architecture on a sample training instance using four context words. Dotted lines represent the virtual link between words and associated senses in context. In this example, the input layer consists of a context of two previous words (wt 2, wt 1) and two subsequent words (wt+1, wt+2) with − − respect to the target word wt. Two words (wt 1, wt+2) do not have senses associated in context, while 1 2 3 − 1 wt 2, wt+1 have three senses (st 1, st 1, st 1) and one sense associated (st+1) in context, respectively. − − − − 1 2 The output layer consists of the target word wt, which has two senses associated (st , st ) in context. fer to the set of senses connected to a given word Only words. In this case we exclude senses as within the specific context as its associated senses. target. There is a single output layer with the Formally, we define a training instance as a se- size of the word vocabulary as in the original quence of words W = wt n, ..., wt, ..., wt+n CBOW model. − (being wt the target word) and S = 1 ki St n, ..., St, ...., St+n, where Si = si , ..., si Only senses. In contrast, this alternative excludes − is the sequence of all associated senses in context words, using only senses as target. In this of w W . Note that S might be empty if the case, if a word does not have any associated i ∈ i word wi does not have any associated sense. sense, it is not used as target instance. In our model each target word takes as context both its surrounding words and all the senses 4.2 Input layer alternatives associated with them. In contrast to the original Both words and senses. Words and their associ- CBOW architecture, where the training criterion ated senses are included in the input layer and is to correctly classify w , our approach aims to t contribute to the hidden state. Both words and predict the word w and its set S of associated t t senses are updated as a consequence of the senses. This is equivalent to minimizing the backpropagation algorithm. following loss function: In this alternative only the surround- E = log(p(w W t,St)) log(p(s W t,St)) Only words. − t| − | ing words contribute to the hidden state, i.e. s St X∈ the target word/sense (depending on the alter- t where W = wt n, ..., wt 1, wt+1, ..., wt+n and native of the output layer) is predicted only t − − S = St n, ..., St 1,St+1, ..., St+n. Figure1 from word features. The update of an input − − shows the organization of the input and the out- word is propagated to the embeddings of its put layers on a sample training instance. In what associated senses, if any. In other words, de- follows we present a set of variants of the model spite not being included in the input layer, on the output and the input layers. senses still receive the same gradient of the associated input word, through a virtual con- 4.1 Output layer alternatives nection. This configuration, coupled with the Both words and senses. This is the default case only-words output layer configuration, corre- explained above. If a word has one or more sponds exactly to the default CBOW archi- associated senses, these senses are also used tecture of word2vec with the only addition of as target on a separate output layer. the update step for senses.

103 Only senses. Words are excluded from the input et al., 2015), using cosine similarity (cos) as layer and the target is predicted only from comparison measure between senses: the senses associated with the surrounding words. The weights of the words are updated sim(w1, w2) = max cos(~s1, ~s2) (1) s Sw ,s Sw through the updates of the associated senses, ∈ 1 0∈ 2

in contrast to the only-words alternative. where Swi represents the set of all candidate senses of wi and ~si refers to the sense vector representation of the sense si. 5 Analysis of Model Components In this section we analyze the different compo- nents of SW2V, including the nine model configu- 5.1 Model configurations rations (Section 5.1) and the algorithm which gen- In this section we analyze the different configu- erates the connections between words and senses rations of our model in respect of the input and in context (Section 5.2). In what follows we de- the output layer on a word similarity experiment. scribe the common analysis setting: Recall from Section4 that our model could have words, senses or both in either the input and output Training model and hyperparameters. For • layers. Table1 shows the results of all nine config- evaluation purposes, we use the CBOW urations on the WS-Sim and RG-65 datasets. model of word2vec with standard hyperpa- As shown in Table1, the best configuration ac- rameters: the dimensionality of the vectors is cording to both Spearman and Pearson correla- set to 300 and the window size to 8, and hi- tion measures is the configuration which has only erarchical softmax is used for normalization. senses in the input layer and both words and senses These hyperparameter values are set across in the output layer.7 In fact, taking only senses as all experiments. input seems to be consistently the best alternative for the input layer. Our hunch is that the knowl- Corpus and semantic network. We use a • edge learned from both the co-occurrence infor- 300M-words corpus from the UMBC project mation and the semantic network is more balanced (Han et al., 2013), which contains English with this input setting. For instance, in the case paragraphs extracted from the web.5 As se- of including both words and senses in the input mantic network we use BabelNet 3.06, a large layer, the co-occurrence information learned by multilingual semantic network with over 350 the network would be duplicated for both words million semantic connections, integrating re- and senses. sources such as Wikipedia and WordNet. We chose BabelNet owing to its wide coverage of 5.2 Disambiguation / Shallow word-sense named entities and lexicographic knowledge. connectivity algorithm Benchmark. Word similarity has been one In this section we evaluate the impact of our shal- • of the most popular benchmarks for in-vitro low word-sense connectivity algorithm (Section evaluation of vector space models (Penning- 3) by testing our model directly taking a pre- ton et al., 2014; Levy et al., 2015). For disambiguated text as input. In this case the net- the analysis we use two word similarity work exploits the connections between each word datasets: the similarity portion (Agirre et al., and its disambiguated sense in context. For this 2009, WS-Sim) of the WordSim-353 dataset comparison we used Babelfy8 (Moro et al., 2014), (Finkelstein et al., 2002) and RG-65 (Ruben- a state-of-the-art graph-based disambiguation and stein and Goodenough, 1965). In order to entity linking system based on BabelNet. We com- compute the similarity of two words using pare to both the default Babelfy system which our sense embeddings, we apply the standard 7In this analysis we used the word similarity task for closest senses strategy (Resnik, 1995; Bu- optimizing the sense embeddings, without caring about the danitsky and Hirst, 2006; Camacho-Collados performance of word embeddings or their interconnectivity. Therefore, this configuration may not be optimal for word 5http://ebiquity.umbc. embeddings and may be further tuned on specific applica- edu/blogger/2013/05/01/ tions. More information about different configurations in the umbc-webbase-corpus-of-3b-english-words/ documentation of the source code. 6http://babelnet.org 8http://babelfy.org

104 Output Words Senses Both WS-Sim RG-65 WS-Sim RG-65 WS-Sim RG-65 r ρ r ρ r ρ r ρ r ρ r ρ Words 0.49 0.48 0.65 0.66 0.56 0.56 0.67 0.67 0.54 0.53 0.66 0.65 Senses 0.69 0.69 0.70 0.71 0.69 0.70 0.70 0.74 0.72 0.71 0.71 0.74 Input Both 0.60 0.65 0.67 0.70 0.62 0.65 0.66 0.67 0.65 0.71 0.68 0.70

Table 1: Pearson (r) and Spearman (ρ) correlation performance of the nine configurations of SW2V

WS-Sim RG-65 sense clustering (Section 6.2). Finally, we evalu- r ρ r ρ ate the coherence of our unified vector space by Shallow 0.72 0.71 0.71 0.74 measuring the interconnectivity of word and sense Babelfy 0.65 0.63 0.69 0.70 embeddings (Section 6.3). Babelfy* 0.63 0.61 0.65 0.64 Experimental setting. Throughout all the ex- periments we use the same standard hyperparam- Table 2: Pearson (r) and Spearman (ρ) correla- eters mentioned in Section5 for both the origi- tion performance of SW2V integrating our shal- nal word2vec implementation and our proposed low word-sense connectivity algorithm (default), model SW2V. For SW2V we use the same opti- Babelfy, or Babelfy*. mal configuration according to the analysis of the previous section (only senses as input, and both words and senses as output) for all tasks. As train- uses the Most Common Sense (MCS) heuristic as a ing corpus we take the full 3B-words UMBC web- back-off strategy and, following (Iacobacci et al., base corpus and the Wikipedia (Wikipedia dump 2015), we also include a version in which only of November 2014), used by three of the compari- instances above the Babelfy default confidence son systems. We use BabelNet 3.0 (SW2V ) and threshold are disambiguated (i.e. the MCS back- BN WordNet 3.0 (SW2V ) as semantic networks. off strategy is disabled). We will refer to this latter WN We compare with the version as Babelfy* and report the best configura- Comparison systems. publicly available pre-trained sense embeddings tion of each strategy according to our analysis. of four state-of-the-art models: Chen et al.(2014) 9 Table2 shows the results of our model using and AutoExtend10 (Rothe and Schutze¨ , 2015) the three different strategies on RG-65 and WS- based on WordNet, and SensEmbed11 (Iacobacci Sim. Our shallow word-sense connectivity algo- et al., 2015) and NASARI12 (Camacho-Collados rithm achieves the best overall results. We believe et al., 2016) based on BabelNet. that these results are due to the semantic connec- tivity ensured by our algorithm and to the pos- 6.1 Word Similarity sibility of associating words with more than one sense, which seems beneficial for training, mak- In this section we evaluate our sense represen- ing it more robust to possible disambiguation er- tations on the standard SimLex-999 (Hill et al., rors and to the sense granularity issue (Erk et al., 2015) and MEN (Bruni et al., 2014) word simi- 13 2013). The results are especially significant con- larity datasets . SimLex and MEN contain 999 sidering that our algorithm took a tenth of the time and 3000 word pairs, respectively, which consti- needed by Babelfy to process the corpus. tute, to our knowledge, the two largest similar- 9http://pan.baidu.com/s/1eQcPK8i 6 Evaluation 10We used the AutoExtend code (http://cistern. cis.lmu.de/˜sascha/AutoExtend/) to obtain We perform a qualitative and quantitative evalua- sense vectors using W2V embeddings trained on UMBC tion of important features of SW2V in three dif- (GoogleNews corpus used in their pre-trained models is not publicly available). We also tried the code to include ferent tasks. First, in order to compare our model BabelNet as lexical resource, but it was not easily scalable against standard word-based approaches, we eval- (BabelNet is two orders of magnitude larger than WordNet). 11 uate our system in the word similarity task (Sec- http://lcl.uniroma1.it/sensembed/ 12http://lcl.uniroma1.it/nasari/ tion 6.1). Second, we measure the quality of our 13To enable a fair comparison we did not perform experi- sense embeddings in a sense-specific application: ments on the small datasets used in Section5 for validation.

105 SimLex-999 MEN System Corpus r ρ r ρ SW2VBN UMBC 0.49 0.47 0.75 0.75 SW2VWN UMBC 0.46 0.45 0.76 0.76 AutoExtend UMBC 0.47 0.45 0.74 0.75 AutoExtend Google-News 0.46 0.46 0.68 0.70 Senses SW2VBN Wikipedia 0.47 0.43 0.71 0.73 SW2VWN Wikipedia 0.47 0.43 0.71 0.72 SensEmbed Wikipedia 0.43 0.39 0.65 0.70 Chen et al. (2014) Wikipedia 0.46 0.43 0.62 0.62 Word2vec UMBC 0.39 0.39 0.75 0.75 RetrofittingBN UMBC 0.47 0.46 0.75 0.76 Retrofitting UMBC 0.47 0.46 0.76 0.76 Words WN Word2vec Wikipedia 0.39 0.38 0.71 0.72 RetrofittingBN Wikipedia 0.35 0.32 0.66 0.66 RetrofittingWN Wikipedia 0.47 0.44 0.73 0.73

Table 3: Pearson (r) and Spearman (ρ) correlation performance on the SimLex-999 and MEN word similarity datasets. ity datasets comprising a balanced set of noun, antonym pairs, which are over-represented in this verb and adjective instances. As explained in Sec- dataset: 38 word pairs hold a clear antonymy re- tion5, we use the closest sense strategy for the lation (e.g. encourage-discourage or long-short), word similarity measurement of our model and while 41 additional pairs hold some degree of all sense-based comparison systems. As regards antonymy (e.g. new-ancient or man-woman).15 In the word embedding models, words are directly contrast to the consistently low gold similarity compared by using cosine similarity. We also in- scores given to antonym pairs, our system varies clude a retrofitted version of the original word2vec its similarity scores depending on the specific na- word vectors (Faruqui et al., 2015, Retrofitting14) ture of the pair16. Recent works have managed using WordNet (RetrofittingWN) and BabelNet to obtain significant improvements by tweaking (RetrofittingBN) as lexical resources. usual word embedding approaches into provid- Table3 shows the results of SW2V and all com- ing low similarity scores for antonym pairs (Pham parison models in SimLex and MEN. SW2V con- et al., 2015; Schwartz et al., 2015; Nguyen et al., sistently outperforms all sense-based comparison 2016; Mrksic et al., 2017), but this is outside the systems using the same corpus, and clearly per- scope of this paper. forms better than the original word2vec trained on 6.2 Sense Clustering the same corpus. Retrofitting decreases the perfor- mance of the original word2vec on the Wikipedia Current lexical resources tend to suffer from the corpus using BabelNet as lexical resource, but sig- high granularity of their sense inventories (Palmer nificantly improves the original word vectors on et al., 2007). In fact, a meaningful clustering of the UMBC corpus, obtaining comparable results their senses may lead to improvements on down- to our approach. However, while our approach stream tasks (Hovy et al., 2013; Flekova and provides a shared space of words and senses, Gurevych, 2016; Pilehvar et al., 2017). In this sec- Retrofitting still conflates different meanings of a tion we evaluate our synset representations on the word into the same vector. Wikipedia sense clustering task. For a fair com- Additionally, we noticed that most of the score parison with respect to the BabelNet-based com- divergences between our system and the gold stan- 15Two annotators decided the degree of antonymy between dard scores in SimLex-999 were produced on word pairs: clear antonyms, weak antonyms or neither. 16For instance, the pairs sunset-sunrise and day-night are given, respectively, 1.88 and 2.47 gold scores in the 0-10 14https://github.com/mfaruqui/ scale, while our model gives them a higher similarity score. retrofitting In fact, both pairs appear as coordinate synsets in WordNet.

106 Accuracy F-Measure resentations of NASARI and SensEmbed using the SW2V 87.8 63.9 same setup and the same underlying lexical re- SensEmbed 82.7 40.3 source. This confirms the capability of our system NASARI 87.0 62.5 to accurately capture the semantics of word senses Multi-SVM 85.5 - on this sense-specific task. Mono-SVM 83.5 - Baseline 17.5 29.8 6.3 Word and sense interconnectivity In the previous experiments we evaluated the ef- Table 4: Accuracy and F-Measure percentages of fectiveness of the sense embeddings. In contrast, different systems on the SemEval Wikipedia sense this experiment aims at testing the interconnec- clustering dataset. tivity between word and sense embeddings in the vector space. As explained in Section2, there have been previous approaches building a shared space parison systems that use the Wikipedia corpus for of word and sense embeddings, but to date lit- training, in this experiment we report the results of tle research has focused on testing the semantic our model trained on the Wikipedia corpus and us- coherence of the vector space. To this end, we ing BabelNet as lexical resource only. For the eval- evaluate our model on a Word Sense Disambigua- uation we consider the two Wikipedia sense clus- tion (WSD) task, using our shared vector space of tering datasets (500-pair and SemEval) created by words and senses to obtain a Most Common Sense Dandala et al.(2013). In these datasets sense clus- (MCS) baseline. The insight behind this experi- tering is viewed as a binary classification task in ment is that a semantically coherent shared space which, given a pair of Wikipedia pages, the system of words and senses should be able to build a rel- has to decide whether to cluster them into a single atively strong baseline for the task, as the MCS instance or not. To this end, we use our synset em- of a given word should be closer to the word 17 beddings and cluster Wikipedia pages together vector than any other sense. The MCS baseline if their similarity exceeds a threshold γ. In order is generally integrated into the pipeline of state- to set the optimal value of γ, we follow Dandala of-the-art WSD and Entity Linking systems as a et al. (2013) and use the first 500-pairs sense clus- back-off strategy (Navigli, 2009; Jin et al., 2009; tering dataset for tuning. We set the threshold γ Zhong and Ng, 2010; Moro et al., 2014; Raganato to 0.35, which is the value leading to the highest et al., 2017) and is used in various NLP applica- F-Measure among all values from 0 to 1 with a tions (Bennett et al., 2016). Therefore, a system 0.05 step size on the 500-pair dataset. Likewise, which automatically identifies the MCS of words we set a threshold for NASARI (0.7) and SensEm- from non-annotated text may be quite valuable, bed (0.3) comparison systems. especially for resource-poor languages or large Finally, we evaluate our approach on the Se- knowledge resources for which obtaining sense- mEval sense clustering test set. This test set con- annotated corpora is extremely expensive. More- sists of 925 pairs which were obtained from a over, even in a resource like WordNet for which set of highly ambiguous words gathered from sense-annotated data is available (Miller et al., past SemEval tasks. For comparison, we also in- 1993, SemCor), 61% of its polysemous lemmas clude the supervised approach of Dandala et al. have no sense annotations (Bennett et al., 2016). (2013) based on a multi-feature Support Vector Given an input word w, we compute the cosine Machine classifier trained on an automatically- similarity between w and all its candidate senses, labeled dataset of the English Wikipedia (Mono- picking the sense leading to the highest similarity: SVM) and Wikipedia in four different languages (Multi-SVM). As naive baseline we include the MCS(w) = argmax cos(~w, ~s) (2) s Sw system which would cluster all given pairs. ∈ Table4 shows the F-Measure and accuracy re- where cos(~w, ~s) refers to the cosine similarity be- sults on the SemEval sense clustering dataset. tween the embeddings of w and s. In order to as- SW2V outperforms all comparison systems ac- sess the reliability of SW2V against previous mod- cording to both measures, including the sense rep- els using WordNet as sense inventory, we test our 17Since Wikipedia is a resource included in BabelNet, our model on the all-words SemEval-2007 (task 17) synset representations are expandable to Wikipedia pages. (Pradhan et al., 2007) and SemEval-2013 (task

107 2 7 SemEval-07 SemEval-13 companyn (military unit) schooln ( of fish) SW2V 39.9 54.0 AutoExtend SW2V AutoExtend SW2V 9 1 7 AutoExtend 17.6 31.0 companyn battalionn school schoolsn 4 1 Baseline 24.8 34.9 company battalion schooln sharksn 8 1 6 companyn regimentn schooln sharks 6 4 1 3 companyn detachmentn schoolv shoalsn Table 5: F-Measure percentage of different MCS 7 1 3 1 companyn platoonn schooln fishn 1 1 1 strategies on the SemEval-2007 and SemEval- companyv brigaden elementary dolphinsn 3 2013 WSD datasets. firm regiment schools podsn 1 1 3 businessn corpsn elementarya eels 2 5 firmn brigade schooln dolphins 1 1 2 12) (Navigli et al., 2013) WSD datasets. Note that companyn platoon elementarya whalesn our model using BabelNet as semantic network has a far larger coverage than just WordNet and Table 6: Ten closest word and sense embeddings 2 7 may additionally be used for Wikification (Mihal- to the senses companyn (military unit) and schooln cea and Csomai, 2007) and Entity Linking tasks. (group of fish). Since the versions of WordNet vary across datasets and comparison systems, we decided to evaluate cluding the preprocessed corpora and pre-trained the systems on the portion of the datasets covered embeddings used in the evaluation) and source by all comparison systems18 (less than 10% of in- code to apply our extension of the word2vec ar- stances were removed from each dataset). chitecture to learn word and sense embeddings Table5 shows the results of our system and from any preprocessed corpus are freely avail- AutoExtend on the SemEval-2007 and SemEval- able at http://lcl.uniroma1.it/sw2v. 2013 WSD datasets. SW2V provides the best Unlike previous sense-based models which re- MCS results in both datasets. In general, AutoEx- quire post-processing steps and use WordNet as tend does not accurately capture the predominant sense inventory, our model achieves a semantically sense of a word and performs worse than a base- coherent vector space of both words and senses line that selects the intended sense randomly from as an emerging feature of a single training phase the set of all possible senses of the target word. and is easily scalable to larger semantic networks In fact, AutoExtend tends to create clusters like BabelNet. Finally, we showed, both quantita- which include a word and all its possible senses. tively and qualitatively, some of the advantages of As an example, Table6 shows the closest word and using our approach as against previous state-of- sense19 embeddings of our SW2V model and Au- the-art word- and sense-based models in various toExtend to the military and fish senses of, respec- tasks, and highlighted interesting semantic prop- tively, company and school. AutoExtend creates erties of the resulting unified vector space of word clusters with all the senses of company and school and sense embeddings. and their related instances, even if they belong to As future work we plan to integrate a WSD and different domains (e.g., firm2 or business1 clearly n n Entity Linking system for applying our model on concern the business sense of company). Instead, downstream NLP applications, along the lines of SW2V creates a semantic cluster of word and Pilehvar et al.(2017). We are also planning to ap- sense embeddings which are semantically close to ply our model to languages other than English and the corresponding company2 and school7 senses. n n to study its potential on multilingual and cross- lingual applications. 7 Conclusion and Future Work In this paper we proposed SW2V (Senses and Acknowledgments Words to Vectors), a neural model which learns The authors gratefully acknowledge vector representations for words and senses in a the support of the ERC Consolidator joint training phase by exploiting both text corpora Grant MOUSSE No. 726487. and knowledge from semantic networks. Data (in- Jose Camacho-Collados is supported by a 18 We were unable to obtain the word embeddings of Chen Google Doctoral Fellowship in Natural Language et al.(2014) for comparison even after contacting the authors. 19 p th Processing. We would also like to thank Jim Mc- Following Navigli(2009), wordn is the n sense of word with part of speech p (using WordNet 3.0). Manus for his comments on the manuscript.

108 References Allyson Ettinger, Philip Resnik, and Marine Carpuat. 2016. Retrofitting Sense-Specific Word Vectors Us- Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana ing Parallel Text. In Proceedings of NAACL-HLT. Kravalova, Marius Pas¸ca, and Aitor Soroa. 2009. A pages 1378–1383. study on similarity and relatedness using distribu- tional and WordNet-based approaches. In Proceed- Wei Fang, Jianwen Zhang, Dilin Wang, Zheng Chen, ings of NAACL. pages 19–27. and Ming Li. 2016. Entity disambiguation by knowledge and text jointly embedding. In Proceed- Eneko Agirre, Oier Lopez de Lacalle, and Aitor Soroa. ings of CoNLL. pages 260–269. 2014. Random walks for knowledge-based word sense disambiguation. Computational Linguistics Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris 40(1):57–84. Dyer, Eduard Hovy, and Noah A. Smith. 2015. Yoshua Bengio, Rejean´ Ducharme, Pascal Vincent, and Retrofitting word vectors to semantic lexicons. In Christian Janvin. 2003. A Neural Probabilistic Lan- Proceedings of NAACL. pages 1606–1615. guage Model. The Journal of Machine Learning Re- search 3:1137–1155. Lev Finkelstein, Gabrilovich Evgeniy, Matias Yossi, Rivlin Ehud, Solan Zach, Wolfman Gadi, and Rup- Andrew Bennett, Timothy Baldwin, Jey Han Lau, Di- pin Eytan. 2002. Placing search in context: The con- ana McCarthy, and Francis Bond. 2016. Lexsemtm: cept revisited. ACM Transactions on Information A semantic dataset based on all-words unsupervised Systems 20(1):116–131. sense distribution learning. In Proceedings of ACL. pages 1513–1524. Lucie Flekova and Iryna Gurevych. 2016. Supersense embeddings: A unified model for supersense inter- Antoine Bordes, Sumit Chopra, and Jason Weston. pretation, prediction, and utilization. In Proceedings 2014. Question answering with subgraph embed- of ACL. pages 2029–2041. dings. In Proceedings of EMNLP. pages 615–620. Josu Goikoetxea, Aitor Soroa, Eneko Agirre, and Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Basque Country Donostia. 2015. Random walks Multimodal distributional semantics. J. Artif. Intell. and neural network language models on knowledge Res.(JAIR) 49(1-47). bases. In Proceedings of NAACL. pages 1434–1439.

Alexander Budanitsky and Graeme Hirst. 2006. Evalu- Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting ating WordNet-based measures of Lexical Semantic Liu. 2014. Learning sense-specific word embed- Relatedness. Computational Linguistics 32(1):13– dings by exploiting bilingual resources. In Proceed- 47. ings of COLING. pages 497–507.

Jose´ Camacho-Collados, Mohammad Taher Pilehvar, Lushan Han, Abhay Kashyap, Tim Finin, James and Roberto Navigli. 2015. A Unified Multilingual Mayfield, and Jonathan Weese. 2013. UMBC Semantic Representation of Concepts. In Proceed- EBIQUITY-CORE: Semantic textual similarity sys- ings of ACL. Beijing, China, pages 741–751. tems. In Proceedings of the Second Joint Confer- ence on Lexical and Computational Semantics. vol- Jose´ Camacho-Collados, Mohammad Taher Pilehvar, ume 1, pages 44–52. and Roberto Navigli. 2016. Nasari: Integrating ex- plicit knowledge and corpus statistics for a multilin- Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Artifi- gual representation of concepts and entities. Simlex-999: Evaluating semantic models with (gen- cial Intelligence 240:36–64. uine) similarity estimation. Computational Linguis- Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. tics . A unified model for word sense representation and disambiguation. In Proceedings of EMNLP. Doha, Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Qatar, pages 1025–1035. Martin Theobald, and Gerhard Weikum. 2012. Kore: keyphrase overlap relatedness for entity dis- Bharath Dandala, Chris Hokamp, Rada Mihalcea, and ambiguation. In Proceedings of CIKM. pages 545– Razvan C. Bunescu. 2013. Sense clustering using 554. Wikipedia. In Proc. of RANLP. Hissar, Bulgaria, pages 164–171. Eduard H. Hovy, Roberto Navigli, and Simone Paolo Ponzetto. 2013. Collaboratively built semi- Antonio Di Marco and Roberto Navigli. 2013. Cluster- structured content and Artificial Intelligence: The ing and diversifying web search results with graph- story so far. Artificial Intelligence 194:2–27. based word sense induction. Computational Lin- guistics 39(3):709–754. Eric H. Huang, Richard Socher, Christopher D. Man- ning, and Andrew Y. Ng. 2012. Improving word Katrin Erk, Diana McCarthy, and Nicholas Gaylord. representations via global context and multiple word 2013. Measuring word meaning in context. Com- prototypes. In Proc. of ACL. Jeju Island, Korea, putational Linguistics 39(3):511–554. pages 873–882.

109 Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2009. Word Sense Disambiguation: Roberto Navigli. 2015. Sensembed: Learning sense A survey. ACM Computing Surveys 41(2):1–69. embeddings for word and relational similarity. In Proceedings of ACL. Beijing, China, pages 95–105. Roberto Navigli, David Jurgens, and Daniele Vannella. 2013. SemEval-2013 Task 12: Multilingual Word Ignacio Iacobacci, Mohammad Taher Pilehvar, and Sense Disambiguation. In Proceedings of SemEval Roberto Navigli. 2016. Embeddings for Word Sense 2013. pages 222–231. Disambiguation: An Evaluation Study. In Proceed- ings of ACL. pages 897–907. Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation Sujay Kumar Jauhar, Chris Dyer, and Eduard Hovy. and application of a wide-coverage multilingual se- 2015. Ontologically grounded multi-sense represen- mantic network. AIJ 193:217–250. tation learning for semantic vector space models. In Proceedings of NAACL. Arvind Neelakantan, Jeevan Shankar, Alexandre Pas- sos, and Andrew McCallum. 2014. Efficient non- Peng Jin, Diana McCarthy, Rob Koeling, and John Car- parametric estimation of multiple embeddings per roll. 2009. Estimating and exploiting the entropy of word in vector space. In Proceedings of EMNLP. sense distributions. In Proceedings of NAACL (2). Doha, Qatar, pages 1059–1069. pages 233–236.

Richard Johansson and Luis Nieto Pina. 2015. Embed- Kim Anh Nguyen, Sabine Schulte im Walde, and ding a semantic network in a word space. In Pro- Ngoc Thang Vu. 2016. Integrating distributional ceedings of NAACL. pages 1428–1433. lexical contrast into word embeddings for antonym- synonym distinction. In Proceedings of ACL. pages Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- 454–459. proving distributional similarity with lessons learned from word embeddings. TACL 3:211–225. Martha Palmer, Hoa Dang, and Christiane Fellbaum. 2007. Making fine-grained and coarse-grained Jiwei Li and Dan Jurafsky. 2015. Do multi-sense em- sense distinctions, both manually and automatically. beddings improve natural language understanding? Natural Language Engineering 13(2):137–163. In Proceedings of EMNLP. Lisbon, Portugal. Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Oren Melamud, Jacob Goldberger, and Ido Dagan. Simone Paolo Ponzetto, and Chris Biemann. 2017. 2016. context2vec: Learning Generic Context Em- Unsupervised does not mean uninterpretable: The bedding with Bidirectional LSTM. In Proc. of case for word sense induction and disambiguation. CONLL. pages 51–61. In Proceedings of EACL. pages 86–98.

Rada Mihalcea and Andras Csomai. 2007. Wikify! Jeffrey Pennington, Richard Socher, and Christopher D Linking documents to encyclopedic knowledge. In Manning. 2014. GloVe: Global vectors for word Proceedings of the Sixteenth ACM Conference on representation. In Proceedings of EMNLP. pages Information and Knowledge management. Lisbon, 1532–1543. Portugal, pages 233–242. Nghia The Pham, Angeliki Lazaridou, and Marco Ba- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey roni. 2015. A multitask objective to inject lexical Dean. 2013. Efficient estimation of word represen- contrast into distributional semantics. In Proceed- tations in vector space. CoRR abs/1301.3781. ings of ACL. pages 21–26. George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39– Mohammad Taher Pilehvar, Jose Camacho-Collados, 41. Roberto Navigli, and Nigel Collier. 2017. Towards a Seamless Integration of Word Senses into Down- George A. Miller, Claudia Leacock, Randee Tengi, and stream NLP Applications. In Proceedings of ACL. Ross Bunker. 1993. A semantic concordance. In Vancouver, Canada. Proceedings of the 3rd DARPA Workshop on Human Language Technology. Plainsboro, N.J., pages 303– Mohammad Taher Pilehvar and Nigel Collier. 2016. 308. De-conflated semantic representations. In Proceed- ings of EMNLP. Austin, TX. Andrea Moro, Alessandro Raganato, and Roberto Nav- igli. 2014. Entity Linking meets Word Sense Disam- Sameer Pradhan, Edward Loper, Dmitriy Dligach, and biguation: a Unified Approach. TACL 2:231–244. Martha Palmer. 2007. SemEval-2007 task-17: En- glish lexical sample, SRL and all words. In Pro- Nikola Mrksic, Ivan Vulic,´ Diarmuid OS´ eaghdha,´ Ira ceedings of SemEval. pages 87–92. Leviant, Roi Reichart, Milica Gai, Anna Korhonen, and Steve Young. 2017. Semantic Specialisation of Lin Qiu, Kewei Tu, and Yong Yu. 2016. Context- Distributional Word Vector Spaces using Monolin- dependent sense embedding. In Proceedings of gual and Cross-Lingual Constraints. TACL . EMNLP. Austin, Texas, pages 183–191.

110 Alessandro Raganato, Jose Camacho-Collados, and Zhi Zhong and Hwee Tou Ng. 2010. It Makes Sense: A Roberto Navigli. 2017. Word Sense Disambigua- wide-coverage Word Sense Disambiguation system tion: A Unified Evaluation Framework and Empir- for free text. In Proc. of ACL System Demonstra- ical Comparison. In Proceedings of EACL. pages tions. pages 78–83. 99–110. Will Y. Zou, Richard Socher, Daniel M Cer, and Joseph Reisinger and Raymond J. Mooney. 2010. Christopher D Manning. 2013. Bilingual word em- Multi-prototype vector-space models of word mean- beddings for phrase-based machine translation. In ing. In Proceedings of ACL. pages 109–117. Proceedings of EMNLP. pages 1393–1398.

Philip Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Pro- ceedings of IJCAI. pages 448–453.

Sascha Rothe and Hinrich Schutze.¨ 2015. AutoEx- tend: Extending Word Embeddings to Embeddings for Synsets and Lexemes. In Proceedings of ACL. Beijing, China, pages 1793–1803.

Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Commun. ACM 8(10):627–633.

Hinrich Schutze.¨ 1998. Automatic word sense discrim- ination. Computational linguistics 24(1):97–123.

Roy Schwartz, Roi Reichart, and Ari Rappoport. 2015. Symmetric pattern based word embeddings for im- proved word similarity prediction. In Proceedings of CoNLL. pages 258–267.

Robert Speer and Joanna Lowry-Duda. 2017. Con- ceptnet at semeval-2017 task 2: Extending word em- beddings with multilingual relational knowledge. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). pages 76–80.

Simon Suster,ˇ Ivan Titov, and Gertjan van Noord. 2016. Bilingual learning of multi-sense embeddings with discrete autoencoders. In Proceedings of NAACL- HLT. pages 1346–1356.

Fei Tian, Hanjun Dai, Jiang Bian, Bin Gao, Rui Zhang, Enhong Chen, and Tie-Yan Liu. 2014. A probabilis- tic model for learning multi-prototype word embed- dings. In Proceedings of COLING. pages 151–160.

Thuy Vu and D Stott Parker. 2016. K-embeddings: Learning conceptual embeddings for words using context. In Proceedings of NAACL-HLT. pages 1262–1267.

Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph and text jointly em- bedding. In Proceedings of EMNLP. pages 1591– 1601.

David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. In Proceedings of ACL. Beijing, China, pages 323–333.

Mo Yu and Mark Dredze. 2014. Improving lexical em- beddings with semantic knowledge. In Proceedings of ACL (2). pages 545–550.

111