Arxiv:1905.05677V3 [Cs.CL] 27 Aug 2019 Various Approaches Have Been Proposed to Antonymy, Etc
Total Page:16
File Type:pdf, Size:1020Kb
Sense Vocabulary Compression through the Semantic Knowledge of WordNet for Neural Word Sense Disambiguation Loïc Vial Benjamin Lecouteux Didier Schwab Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France {loic.vial, benjamin.lecouteux, didier.schwab}@univ-grenoble-alpes.fr Abstract tify the different senses of words from unanno- tated or parallel corpora (e.g. Ide et al. (2002)). In this article, we tackle the issue of the Supervised methods are by far the most pre- limited quantity of manually sense anno- dominant as they generally offer the best results tated corpora for the task of word sense in evaluation campaigns (for instance (Navigli et disambiguation, by exploiting the seman- al., 2007)). State of the art classifiers used to com- tic relationships between senses such as bine specific features such as the parts of speech synonymy, hypernymy and hyponymy, in and the lemmas of surrounding words (Zhong and order to compress the sense vocabulary of Ng, 2010), but they are now replaced by neural Princeton WordNet, and thus reduce the networks which learn their own representation of number of different sense tags that must words (Raganato et al., 2017b; Le et al., 2018). be observed to disambiguate all words of One major bottleneck of supervised systems is the lexical database. We propose two dif- the restricted quantity of manually sense anno- ferent methods that greatly reduce the size tated corpora: In the annotated corpus SemCor of neural WSD models, with the benefit (Miller et al., 1993), the largest manually sense of improving their coverage without addi- annotated corpus available, words are annotated tional training data, and without impacting with 33 760 different sense keys, which corre- their precision. In addition to our meth- sponds to only approximately 16% of the sense ods, we present a WSD system which re- inventory of WordNet (Miller, 1995), the lexical lies on pre-trained BERT word vectors in database of reference widely used in WSD. Many order to achieve results that significantly works try to leverage this problem by creating outperforms the state of the art on all WSD new sense annotated corpora, either automatically evaluation tasks. (Pasini and Navigli, 2017), semi-automatically (Taghipour and Ng, 2015), or through crowdsourc- 1 Introduction ing (Yuan et al., 2016). Word Sense Disambiguation (WSD) is a task In this work, the idea is to solve this issue by which aims to clarify a text by assigning to each taking advantage of the semantic relationships be- of its words the most suitable sense labels, given a tween senses included in WordNet, such as the predefined sense inventory. hypernymy, the hyponymy, the meronymy, the arXiv:1905.05677v3 [cs.CL] 27 Aug 2019 Various approaches have been proposed to antonymy, etc. Our method is based on the ob- achieve WSD: Knowledge-based methods rely on servation that a sense and its closest related senses dictionaries, lexical databases, thesauri or knowl- (its hypernym or its hyponyms for instance) all edge graphs as primary resources, and use algo- share a common idea or concept, and so a word rithms such as lexical similarity measures (Lesk, can sometimes be disambiguated using only re- 1986) or graph-based measures (Moro et al., lated concepts. Consequently, we do not need to 2014). Supervised methods, on the other hand, ex- know every sense of WordNet to disambiguate all ploit sense annotated corpora as training instances words of WordNet. for a classifier such as SVM (Chan et al., 2007; For instance, let us consider the word “mouse” Zhong and Ng, 2010), or more recently by a neu- and two of its senses which are the computer ral network (Kågebäck and Salomonsson, 2016). mouse and the animal mouse. We only need to Finally, unsupervised methods automatically iden- know the notions of “animal” and “electronic de- vice” to distinguish them, and all notions that are and the sense closest to the predicted vector is as- more specialized such as “rodent” or “mammal” signed to each word. are therefore superfluous. By grouping them, we These systems have the advantage of bypassing can benefit from all other instances of electronic the problem of the lack of sense annotated data by devices or animals in a training corpus, even if concentrating the power of abstraction offered by they do not mention the word “mouse”. recurrent neural networks on a good quality lan- Contributions: In this paper, we hypothesize that guage model trained in an unsupervised manner. only a subset of WordNet senses could be con- However, sense annotated corpora are still indis- sidered to disambiguate all words of the lexical pensable to contruct the sense vectors. database. Therefore, we propose two different methods for building this subset and we call them 2.2 WSD Based on a Softmax Classifier sense vocabulary compression methods. By us- In these systems, the main neural network directly ing these techniques, we are able to greatly im- classifies and attributes a sense to each input word prove the coverage of supervised WSD systems, through a probability distribution computed by a nearly eliminating the need for a backoff strategy softmax function. Sense annotations are simply that is currently used in most systems when deal- seen as tags put on every word, like a POS-tagging ing with a word which has never been observed task for instance. in the training data. We evaluate our method on We can distinguish two separate branches of a state of the art WSD neural network, based on these types of neural networks: pretrained contextualized word vector representa- 1. Those in which we have several distinct and tions, and we present results that significantly out- token-specific neural networks (or classifiers) perform the state of the art on every standard WSD for every different word in the dictionary (Ia- evaluation task. Finally, we provide a documented cobacci et al., 2016; Kågebäck and Salomons- tool for training and evaluating neural WSD mod- son, 2016), each of them being able to manage els, as well as our best pretrained model in a dedi- a particular word and its particular senses. For cated GitHub repository1. instance, one of the classifiers is specialized in 2 Related Work choosing between the four possible senses of the noun “mouse”. This type of approach is In WSD, several recent advances have been made particularly fitted for the lexical sample tasks, in the creation of new neural architectures for su- where a small and finite set of very ambigu- pervised models and the integration of knowledge ous words have to be sense annotated in several into these systems. Multiple works also exploit the contexts, but it can also be used in all-words idea of grouping together related senses. In this word sense disambiguation tasks. section, we give an overview of these works. 2. Those in which we have a larger and general 2.1 WSD Based on a Language Model neural network that is able to manage all dif- ferent words and assign a sense in the set of all In this type of approach, that has been initiated by existing sense in the dictionary used (Raganato Yuan et al. (2016) and reimplemented by Le et al. et al., 2017b). (2018), the central component is a neural language The advantage of the first branch of approaches model able to predict a word with consideration is that in order to disambiguate a word, limiting for the words surrounding it, thanks to a recurrent our choice to one of its possible senses is compu- neural network trained on a massive quantity of tationally much easier than searching through all unannotated data. the senses of all words. To put things in perspec- Once the language model is trained, it is used to tive, the average number of senses of polysemous produce sense vectors that result from averaging words in WordNet is approximately 3, whereas the word vectors predicted by the language model the total number of senses considering all words at all positions of words annotated with the given is 206 941. sense. The second approach, however, has an interest- At test time, the language model is used to pre- ing property: all senses reside in the same vector dict a vector according to the surrounding context, space and hence share features in the hidden layers 1https://github.com/getalp/disambiguate of the network. This allows the model to predict an identical sense for two different words (i.e. syn- tag, by simply keeping track of which sense key of onyms), but it also offers the possibility to predict its lemma belongs to the predicted group. a sense for a word not present in the dictionary (e.g. neologism, spelling mistake...). 3 Sense Vocabulary Compression Finally, in two recent articles, Luo et al. (2018a) Current state of the art supervised WSD systems and Luo et al. (2018b) have proposed an improve- such as Yuan et al. (2016), Raganato et al. (2017b), ment of these type of architectures, by computing Luo et al. (2018a) and Le et al. (2018) are all con- an attention between the context of a target word fronted to the following issues: and the gloss of its different senses. Thus, their 1. Due to the small number of manually sense an- work is one of the first to incorporate knowledge notated corpora available, a target word may from WordNet into a WSD neural network. never be observed during the training, and 2.3 Sense Clustering Methods therefore the system is not able to annotate it. Several works exploit the idea of grouping to- 2. For the same reason, a word may have been ob- gether mutiple WordNet sense tags in order to cre- served, but not all of its senses.