Embedding Words and Senses Together Via Joint Knowledge-Enhanced Training

Embedding Words and Senses Together via Joint Knowledge-Enhanced Training Massimiliano Mancini*, Jose Camacho-Collados*, Ignacio Iacobacci and Roberto Navigli Department of Computer Science Sapienza University of Rome [email protected] collados,iacobacci,navigli @di.uniroma1.it { } Abstract Previous works have addressed this limitation by automatically inducing word senses from Word embeddings are widely used in Nat- monolingual corpora (Schutze¨ , 1998; Reisinger ural Language Processing, mainly due to and Mooney, 2010; Huang et al., 2012; Di Marco their success in capturing semantic infor- and Navigli, 2013; Neelakantan et al., 2014; Tian mation from massive corpora. However, et al., 2014; Li and Jurafsky, 2015; Vu and their creation process does not allow the Parker, 2016; Qiu et al., 2016), or bilingual par- different meanings of a word to be auto- allel data (Guo et al., 2014; Ettinger et al., 2016; matically separated, as it conflates them Susterˇ et al., 2016). However, these approaches into a single vector. We address this issue learn solely on the basis of statistics extracted by proposing a new model which learns from text corpora and do not exploit knowl- word and sense embeddings jointly. Our edge from semantic networks. Additionally, their model exploits large corpora and knowl- induced senses are neither readily interpretable edge from semantic networks in order to (Panchenko et al., 2017) nor easily mappable to produce a unified vector space of word lexical resources, which limits their application. and sense embeddings. We evaluate the Recent approaches have utilized semantic net- main features of our approach both qual- works to inject knowledge into existing word rep- itatively and quantitatively in a variety of resentations (Yu and Dredze, 2014; Faruqui et al., tasks, highlighting the advantages of the 2015; Goikoetxea et al., 2015; Speer and Lowry- proposed method in comparison to state- Duda, 2017; Mrksic et al., 2017), but without solv- of-the-art word- and sense-based models. ing the meaning conflation issue. In order to ob- tain a representation for each sense of a word, a number of approaches have leveraged lexical 1 Introduction resources to learn sense embeddings as a result of post-processing conventional word embeddings Recently, approaches based on neural networks (Chen et al., 2014; Johansson and Pina, 2015; which embed words into low-dimensional vector Jauhar et al., 2015; Rothe and Schutze¨ , 2015; Pile- spaces from text corpora (i.e. word embeddings) hvar and Collier, 2016; Camacho-Collados et al., have become increasingly popular (Mikolov et al., 2016). 2013; Pennington et al., 2014). Word embeddings Instead, we propose SW2V (Senses and Words have proved to be beneficial in many Natural Lan- to Vectors), a neural model that exploits knowl- guage Processing tasks, such as Machine Transla- edge from both text corpora and semantic net- tion (Zou et al., 2013), syntactic parsing (Weiss works in order to simultaneously learn embed- et al., 2015), and Question Answering (Bordes dings for both words and senses. Moreover, our et al., 2014), to name a few. Despite their suc- model provides three additional key features: (1) cess in capturing semantic properties of words, both word and sense embeddings are represented these representations are generally hampered by in the same vector space, (2) it is flexible, as it can an important limitation: the inability to discrimi- be applied to different predictive models, and (3) nate among different meanings of the same word. it is scalable for very large semantic networks and Authors marked with an asterisk (*) contributed equally. text corpora. 100 Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 100–111, Vancouver, Canada, August 3 - August 4, 2017. c 2017 Association for Computational Linguistics 2 Related work space of words and senses as an emerging feature. Embedding words from large corpora into a low- 3 Connecting words and senses in dimensional vector space has been a popular task context since the appearance of the probabilistic feed- forward neural network language model (Ben- In order to jointly produce embeddings for words gio et al., 2003) and later developments such as and senses, SW2V needs as input a corpus where 1 word2vec (Mikolov et al., 2013) and GloVe (Pen- words are connected to senses in each given con- nington et al., 2014). However, little research has text. One option for obtaining such connections focused on exploiting lexical resources to over- could be to take a sense-annotated corpus as input. come the inherent ambiguity of word embeddings. However, manually annotating large amounts of Iacobacci et al.(2015) overcame this limitation data is extremely expensive and therefore imprac- by applying an off-the-shelf disambiguation sys- tical in normal settings. Obtaining sense-annotated tem (i.e. Babelfy (Moro et al., 2014)) to a cor- data from current off-the-shelf disambiguation and pus and then using word2vec to learn sense em- entity linking systems is possible, but generally beddings over the pre-disambiguated text. How- suffers from two major problems. First, supervised ever, in their approach words are replaced by their systems are hampered by the very same prob- intended senses, consequently producing as out- lem of needing large amounts of sense-annotated put sense representations only. The representation data. Second, the relatively slow speed of current of words and senses in the same vector space disambiguation systems, such as graph-based ap- proves essential for applying these knowledge- proaches (Hoffart et al., 2012; Agirre et al., 2014; based sense embeddings in downstream applica- Moro et al., 2014), or word-expert supervised sys- tions, particularly for their integration into neural tems (Zhong and Ng, 2010; Iacobacci et al., 2016; architectures (Pilehvar et al., 2017). In the litera- Melamud et al., 2016), could become an obstacle ture, various different methods have attempted to when applied to large corpora. overcome this limitation. Chen et al.(2014) pro- This is the reason why we propose a simple yet posed a model for obtaining both word and sense effective unsupervised shallow word-sense con- representations based on a first training step of nectivity algorithm, which can be applied to vir- conventional word embeddings, a second disam- tually any given semantic network and is linear on biguation step based on sense definitions, and a fi- the corpus size. The main idea of the algorithm is nal training phase which uses the disambiguated to exploit the connections of a semantic network text as input. Likewise, Rothe and Schutze¨ (2015) by associating words with the senses that are most aimed at building a shared space of word and connected within the sentence, according to the sense embeddings based on two steps: a first train- underlying network. ing step of only word embeddings and a second Shallow word-sense connectivity algorithm. training step to produce sense and synset em- Formally, a corpus and a semantic network are beddings. These two approaches require multiple taken as input and a set of connected words and steps of training and make use of a relatively small senses is produced as output. We define a seman- resource like WordNet, which limits their cov- tic network as a graph (S, E) where the set S con- erage and applicability. Camacho-Collados et al. tains synsets (nodes) and E represents a set of (2016) increased the coverage of these WordNet- semantically connected synset pairs (edges). Al- based approaches by exploiting the complemen- gorithm1 describes how to connect words and tary knowledge of WordNet and Wikipedia along senses in a given text (sentence or paragraph) T . with pre-trained word embeddings. Finally, Wang First, we gather in a set ST all candidate synsets et al.(2014) and Fang et al.(2016) proposed a of the words (including multiwords up to trigrams) model to align vector spaces of words and en- in T (lines1 to3). Second, for each candidate tities from knowledge bases. However, these ap- synset s we calculate the number of synsets which proaches are restricted to nominal instances only are connected with s in the semantic network (i.e. Wikipedia pages or entities). and are included in ST , excluding connections of In contrast, we propose a model which learns synsets which only appear as candidates of the both words and sense embeddings from a single 1In this paper we focus on senses but other items con- joint training phase, producing a common vector nected to words may be used (e.g. supersenses or images). 101 Algorithm 1 Shallow word-sense connectivity only, irrespective of the corpus size. This enables Input: Semantic network (S, E) and text T represented as a a fast training on large amounts of text corpora, bag of words in contrast to current unsupervised disambiguation Output: Set of connected words and senses T ∗ T S ⊂ × algorithms. Additionally, as we will show in Sec- 1: Set of synsets ST 2: for each word w ←T ∅ tion 5.2, this algorithm does not only speed up sig- ∈ 3: ST ST Sw (Sw: set of candidate synsets of w) ← ∪ S + T nificantly the training phase, but also leads to more 4: Minimum connections threshold θ | T | | | ← 2 δ accurate results. 5: Output set of connections T ∗ 6: for each w T ← ∅ Note that with our algorithm a word is allowed ∈ 7: Relative maximum connections max = 0 to have more than one sense associated. In fact, 8: Set of senses associated with w, Cw ← ∅ 9: for each candidate synset s Sw current lexical resources like WordNet (Miller, ∈ 10: Number of edges n = s0 ST :(s, s0) E & 1995) or BabelNet (Navigli and Ponzetto, 2012) | ∈ ∈ w0 T : w0 = w & s0 S w0 are hampered by the high granularity of their sense 11: ∃if n ∈ max &6 n θ then∈ | 12: if≥n > max then≥ inventories (Hovy et al., 2013).

Embedding Words and Senses Together Via Joint Knowledge-Enhanced Training

Planar Embeddings of Minc's Continuum and Generalizations

Neural Subgraph Matching

Some Planar Embeddings of Chainable Continua Can Be

Cauchy Graph Embedding

Strong Inverse Limit Reflection

Lecture 9: the Whitney Embedding Theorem

A Novel Approach to Embedding of Metric Spaces

Isomorphism and Embedding Problems for Infinite Limits of Scale

Inverse Limit Spaces of Interval Maps

Embedding Smooth Diffeomorphisms in Flows

Lecture 5: Submersions, Immersions and Embeddings

Metric Manifold Learning: Preserving the Intrinsic Geometry