On the Importance of Distinguishing Word Meaning Representations: a Case Study on Reverse Dictionary Mapping

On the Importance of Distinguishing Word Meaning Representations: A Case Study on Reverse Dictionary Mapping Mohammad Taher Pilehvar Tehran Institute for Advanced Studies (TeIAS), Tehran, Iran DTAL, University of Cambridge, Cambridge, UK [email protected] Abstract as Word Sense Disambiguation, leaving their potential impact on downstream word-based systems Meaning conflation deficiency is one of the unknown. In this paper, we provide an analy- main limiting factors of word representations sis to highlight the importance of addressing the which, given their widespread use at the core meaning conflation deficiency. Specifically, we of many NLP systems, can lead to inaccu- show how distinguishing different meanings of a rate semantic understanding of the input text and inevitably hamper the performance. Sense word can facilitate a more accurate semantic un- representations target this problem. How- derstanding of a state-of-the-art reverse dictionary ever, their potential impact has rarely been in- system, reflected by substantial improvements in vestigated in downstream NLP applications. recall and generalisation power. Through a set of experiments on a state-of- the-art reverse dictionary system based on neu- 2 Reverse Dictionary ral networks, we show that a simple adjust- ment aimed at addressing the meaning con- Reverse dictionary, conceptual dictionary, or conflation deficiency can lead to substantial im- cept lookup is the task of returning a word given provements. its description or definition (Brown and McNeill, 1966; Zock and Bilac, 2004). For example, given 1 Meaning Conflation Deficiency “a crystal of snow”, the system has to return the Words are often the most fine-grained meaning word snowflake. The task is closely related to bearing components of NLP systems. As a stan- the “tip of the tongue” problem where an individ- dard practise, particularly for neural models, the ual recalls some general features about a word but input text is treated as a sequence of words and cannot retrieve that from memory. Therefore, a re- each word in the sequence is represent with a verse dictionary system can be particularly useful dense distributional representation (word embed- to writers and translators when they cannot recall a ding). Importantly, this setting ignores the fact that word in time or are unsure how to express an idea a word can be polysemous, i.e., it can take multi- they want to convey. ple (possibly unrelated) meanings. Representing a word with all its possible meanings as a sin- 2.1 Evaluation framework gle point (vector) in the embedding space, the so- Our experiments are based on the reverse dictio- called meaning conflation deficiency (Camacho- nary model of Hill et al.(2016) which leverages a Collados and Pilehvar, 2018), can hinder system’s standard neural architecture in order to map dictio- semantic effectiveness. nary definitions to representations of the words de- To address this deficiency, many techniques fined by those definitions. Specifically, they pro- have been put forward over the past few years, posed two neural architectures for mapping the the most prominent of which is sense repre- definition of word t to its word embedding et. sentation or multi-prototype embedding (Schutze¨ , Let Dt be the sequence of words in t’s definition, 1998; Reisinger and Mooney, 2010). However, as i.e., Dt = fw1; w2; : : : ; wng, with their corre- a general trend, these representations are usually sponding embeddings fv1; v2;:::; vng. The two evaluated either on generic benchmarks, such as models differ in the way they process Dt. In the word similarity, or on sense-centered tasks such bag-of-words (BoW) model, Dt is taken as a bag 2151 Proceedings of NAACL-HLT 2019, pages 2151–2156 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics of words, i.e., the representation of the defini- mapping function that obtains distinct interpre- tion is encoded by adding the word embeddings tations for individual definition, hence mapping Pn of all its content words, i.e., i=1 vi. The model them to different points in the target space: st1 , learns, using a fully-connected layer, a matrix for st2 , and st3 . Specifically, in our experiments we transforming the encoded representation to the tar- leveraged DeConf (Pilehvar and Collier, 2016). get word’s embedding et. The BoW model is DeConf is a WordNet-based sense representation not sensitive to the order of words in Dt. This technique which receives a set of pre-trained word might be crucial for an accurate semantic under- embeddings and generates embeddings for indi- standing. The Recurrent Neural Network (RNN) vidual word senses in the same semantic space, model alleviates this issue by encoding the input hence generating a combined space of words and sequence using an LSTM architecture (Hochreiter word senses. and Schmidhuber, 1997). Similarly to the BoW DeConf performs a set of random walks on model, a dense layer maps the encoded represen- WordNet’s semantic network and extracts for each tation to the target word’s embedding. sense a set of sense biasing words Bs. A sense bi- In both cases, the goal is to map a given defini- asing word for the ith meaning of a target word t is tion to the corresponding target word’s embedding a semantically related word to that specific sense et, computed using Word2vec (Mikolov et al., of the word (sti ). For each word sense in WordNet 2013) and independently from the training of the we obtain the corresponding Bs. Then, the embed- main model. Two cost functions were tested: (1) ding for a specific word sense s is computed as: the cosine distance between the estimated point X s = jj e + α exp(−δ ) e jj; (1) in the target space (e^t) and et, and (2) the rank w i b b 2 Bs loss which contrast the choice of et with a random choice for a randomly-selected word from the vo- where δ is a decay parameter and ew is the embed- cabulary other than t. ding of corresponding lemma of sense s. In our The reverse dictionary system takes advantage experiments, as for word embeddings we used the of a standard architecture which has proven ef- 300-dimensional Word2vec embeddings, trained fective in various NLP tasks. However, similarly on the Google News corpus.1 The same set was to many other word-based models, the system ig- used as input to DeConf. As a result of this pro- nores that the same word can have multiple (po- cess, around 207K additional word senses were in- tentially unrelated) meanings. In fact, it tries to troduced in the space for the 155K unique words map multiple definitions, with different seman- in WordNet 3.0. tics, to the same point in the target space. For in- stance, the three semantically unrelated definitions 2.2.1 Supersenses of crane: “lifts and moves heavy objects”, “large It is widely acknowledged that sense distinctions long-necked wading bird”, and “a small constella- in WordNet inventory are too fine-grained for most tion in the southern hemisphere” will have similar NLP applications (Hovy et al., 2013). For in- semantic interpretation by the system. This word- stance, for the noun star, WordNet 3.0 lists eight level meaning conflation can hamper the ability of senses, among which two celestial body senses (as the system in learning an accurate mapping func- an “astronomical object” and that “visible, as a tion. In what follows in this paper, we will il- point of light, from the Earth”), and three person lustrate how a simple sense level distinction can senses (“skillful person”, “lead actor”, and “per- facilitate a more accurate semantic understanding forming artist”). This fine level of sense distinc- for the reverse dictionary system, hence leading to tion is often more than that required by the tar- significant performance improvements. get downstream application (Rud¨ et al., 2011; Sev- eryn et al., 2013; Flekova and Gurevych, 2016). 2.2 Sense Integration In our experiments, we used WordNet’s lexicog- rapher files (lexnames2) in order to reduce sense Let t be an ambiguous word with three meanings; granularity. Created by the curators of WordNet hence, three distinct definitions Dt1 , Dt2 , and Dt3 . 1 The original model of Hill et al.(2016) maps all https://code.google.com/archive/p/ word2vec/ these definitions to et. We mitigate the mean- 2https://wordnet.princeton.edu/man/ ing conflation deficiency through a sense-specific lexnames.5WN.html 2152 WN-seen WN-unseen Concept Mapping top-10 top-100 top-10 top-100 top-10 top-100 cosine 0.656 0.824 0.150 0.310 0.230 0.480 RNN ranking 0.694 0.836 0.162 0.352 0.335 0.630 Supersense cosine 0.642 0.820 0.250 0.416 0.280 0.590 BoW ranking 0.706 0.872 0.310 0.474 0.390 0.735 cosine 0.742 0.854 0.164 0.336 0.275 0.505 RNN ranking 0.668 0.840 0.180 0.372 0.325 0.615 Sense cosine 0.678 0.826 0.290 0.456 0.300 0.620 BoW ranking 0.692 0.848 0.292 0.470 0.380 0.735 cosine 0.462 0.652 0.056 0.162 0.215 0.400 RNN ranking 0.534 0.728 0.086 0.188 0.190 0.475 Word cosine 0.446 0.652 0.136 0.264 0.175 0.465 BoW ranking 0.562 0.740 0.160 0.292 0.320 0.600 Baseline - - 0.104 0.346 0.054 0.158 0.065 0.300 Table 1: Accuracy performance (@10/100) of the original (word-based) reverse dictionary system and its sense- and supersense-based improvements on different datasets.

Load more