Sparsity Makes Sense: Word Sense Disambiguation Using Sparse Contextualized Word Representations

Sparsity Makes Sense: Word Sense Disambiguation Using Sparse Contextualized Word Representations Gabor´ Berend1;2 1Institute of Informatics, University of Szeged 2MTA-SZTE Research Group on Aartificial Intelligence [email protected] Abstract differ from typical word embeddings in that most coefficients are exactly zero. Such sparse word In this paper, we demonstrate that by utiliz- ing sparse word representations, it becomes representations have been argued to convey an possible to surpass the results of more com- increased interpretability (Murphy et al., 2012; plex task-specific models on the task of fine- Faruqui et al., 2015; Subramanian et al., 2018) grained all-words word sense disambiguation. which could be advantageous for WSD. It has been Our proposed algorithm relies on an overcom- shown that sparsity can not only favor interpretabil- plete set of semantic basis vectors that allows ity, but it can contribute to an increased perfor- us to obtain sparse contextualized word repre- mance in downstream applications (Faruqui et al., sentations. We introduce such an information theory-inspired synset representation based on 2015; Berend, 2017). the co-occurrence of word senses and non- The goal of this paper is to investigate and quan- zero coordinates for word forms which allows tify what synergies exist between contextualized us to achieve an aggregated F-score of 78.8 and sparse word representations. Our rigorous ex- over a combination of five standard word sense periments show that it is possible to get increased disambiguating benchmark datasets. We also performance on top of contextualized representa- demonstrate the general applicability of our tions when they are post-processed in a way which proposed framework by evaluating it towards part-of-speech tagging on four different tree- ensures their sparsity. banks. Our results indicate a significant im- In this paper we introduce an information theory- provement over the application of the dense inspired algorithm for creating sparse contextu- word representations. alized word representations and evaluate it in a 1 Introduction series of challenging WSD tasks. In our experiments, we managed to obtain solid results for Natural language processing applications have ben- multiple fine-grained word sense disambiguation efited remarkably form language modeling based benchmarks. All our source code for reproduc- contextualized word representations, including ing our experiments are made available at https: CoVe (McCann et al., 2017), ELMo (Peters et al., //github.com/begab/sparsity_makes_sense.1 2018) and BERT (Devlin et al., 2019), inter alia. Our contributions can be summarized as follows: Contrary to standard “static” word embeddings like word2vec (Mikolov et al., 2013) and Glove (Pen- • we propose the application of contextualized nington et al., 2014), contextualized representa- sparse overcomplete word representation in tions assign such vectorial representations to men- the task of word sense disambiguation, tions of word forms that are sensitive to the entire sequence in which they are present. This charac- • we carefully evaluate our information theory teristic of contextualized word embeddings makes inspired approach for quantifying the strength them highly applicable for performing word sense of the connection between the individual di- disambiguation (WSD) as it has been investigated mensions of (sparse) word representations and recently (Loureiro and Jorge, 2019; Vial et al., 2019). 1An additional demo application performing all-words word sense disambiguation is also made available at Another popular line of research deals with http://www.inf.u-szeged.hu/˜berendg/nlp_ sparse overcomplete word representations which demos/wsd. 8498 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 8498–8508, November 16–20, 2020. c 2020 Association for Computational Linguistics human interpretable semantic content such as et al.(2017b) tackled all-words WSD as a sequence fine grained word senses, learning model and solved it using LSTMs. Vial et al.(2019) introduced a similar framework, but • we demonstrate the general applicability of replaced the LSTM decoder with an ensemble of our algorithm by applying it for POS tagging transformers. (Vial et al., 2019) additionally relied on four different UD treebanks. on BERT contextual word representations as input 2 Related work to their all-words WSD system. Contextual word embeddings have recently su- One of the key difficulties of natural language un- perseded traditional word embeddings due to their derstanding is the highly ambiguous nature of lan- advantageous property of also modeling the neigh- guage. As a consequence, WSD has long-standing boring context of words upon determining their origins in the NLP community (Lesk, 1986; Resnik, vectorial representations. As such, the same word 1997a,b), still receiving major recent research in- form gets assigned a separate embedding when terest (Raganato et al., 2017a; Trask et al., 2015; mentioned in different contexts. Contextualized Melamud et al., 2016; Loureiro and Jorge, 2019; word vectors, including (Devlin et al., 2019; Yang Vial et al., 2019). A thorough survey on WSD et al., 2019), typically employ some language algorithms of the pre-neural era can be found in modelling-inspired objective and are trained on (Navigli, 2009). massive amounts of textual data, which makes them A typical evaluation for WSD systems is to quan- generally applicable in a variety of settings as illus- tify the extent to which they are capable of iden- trated by top-performing entries at the SuperGLUE tifying the correct sense of ambiguous words in leaderboard (Wang et al., 2019). their contexts according to some sense inventory. Most recently, Loureiro and Jorge(2019) have One of the most frequently applied sense inventory proposed the usage of contextualized word rep- in the case of English is the Princeton WordNet resentations for tackling WSD. Their framework (Fellbaum, 1998) which also served the basis of builds upon BERT embeddings and performs WSD our evaluation. relying on a k-NN approach of query words to- A variety of WSD approaches has evolved rang- wards the sense embeddings that are derived as the ing from unsupervised and knowledge-based solu- centroids of contextual embeddings labeled with tions to supervised ones. Unsupervised approaches a certain sense. The framework also utilizes static could investigate the textual overlap between the fasttext (Bojanowski et al., 2017) embeddings, and context of ambiguous words and their potential averaged contextual embeddings derived from the sense definitions (Lesk, 1986) or they could be definitions attached to WordNet senses for mitigat- based on random walks over the semantic graph ing the problem caused by the limited amounts of providing the sense inventory (Agirre and Soroa, sense-labeled training data. 2009). Supervised WSD techniques typically perform Kumar et al.(2019) proposed the EWISE ap- better than unsupervised approaches. IMS (Zhong proach which constructs sense definition embed- and Ng, 2010) is a classical supervised WSD frame- dings also relying on the network structure of Word- work which was created with the intention of easy Net for performing zero-shot WSD in order to han- extensibility. It trains SVMs for predicting the cor- dle words without any sense-annotated occurrence rect sense of a word based on traditional features, in the training data. Bevilacqua and Navigli(2020) such as surface forms and POS tags of the ambigu- introduces EWISER as an improvement over the ous words as well as its neighboring words. EWISE approach by providing a hybrid knowledge- The recent advent of neural text representations based and supervised approach via the integration have also shaped the landscape of algorithms per- of explicit relational information from WordNet. forming WSD. Iacobacci et al.(2016) extended the Our approach differs from both (Kumar et al., 2019) classical feature-based IMS framework by incor- and (Bevilacqua and Navigli, 2020) in that we are porating word embeddings. Melamud et al.(2016) not exploiting the structural properties of WordNet. devised context2vec, which relies on a bidirectional SenseBERT (Levine et al., 2019) extends BERT LSTM (biLSTM) for performing supervised WSD. (Devlin et al., 2019) by incorporating an auxiliary Kageb˚ ack¨ and Salomonsson(2016) also proposed task into the masked language modeling objective the utilization of biLSTMs for WSD. Raganato for predicting word supersenses besides word iden- 8499 tities. Our approach differs from SenseBERT as posing a total of M sequences and Ni tokens in we do not propose an alternative way for training sentence i. We refer to the contextualized word (i) contextualized embeddings, but introduce an algo- representation for some token in boldface, i.e. xj rithm for extracting a useful representation from and the collection of contextual embeddings as M pretrained BERT embeddings that can effectively n (i)Ni o X = xj : be used for WSD. Due to this conceptual difference, j=0 i=0 Likewise to the sequence of sentences and their our approach does not need a large transformer respective tokens, we also utilize a sequence of an- model to be trained, but it can be steadily applied M n (i)Ni o over pretrained models. notations that we denote as S = sj , j=0 i=0 GlossBERT (Huang et al., 2019) framed WSD with s(i) indicating the labeling of token j within as a sentence pair classification task between the j (i) jSj sentence

Load more