Language Modelling Makes Sense: Propagating Representations Through Wordnet for Full-Coverage Word Sense Disambiguation

Language Modelling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation Daniel Loureiro, Al´ıpio Mario´ Jorge LIAAD - INESC TEC Faculty of Sciences - University of Porto, Portugal [email protected], [email protected] Abstract Several factors have contributed to this limited progress over the last decade, including lack of Contextual embeddings represent a new gener- standardized evaluation, and restricted amounts of ation of semantic representations learned from sense annotated corpora. Addressing the eval- Neural Language Modelling (NLM) that ad- uation issue, Raganato et al.(2017a) has intro- dresses the issue of meaning conflation ham- duced a unified evaluation framework that has al- pering traditional word embeddings. In this work, we show that contextual embeddings ready been adopted by the latest works in WSD. can be used to achieve unprecedented gains Also, even though SemCor (Miller et al., 1994) in Word Sense Disambiguation (WSD) tasks. still remains the largest manually annotated cor- Our approach focuses on creating sense-level pus, supervised methods have successfully used embeddings with full-coverage of WordNet, label propagation (Yuan et al., 2016), semantic and without recourse to explicit knowledge of networks (Vial et al., 2018) and glosses (Luo sense distributions or task-specific modelling. et al., 2018b) in combination with annotations to As a result, a simple Nearest Neighbors (k- advance the state-of-the-art. Meanwhile, task- NN) method using our representations is able to consistently surpass the performance of pre- specific sequence modelling architectures based vious systems using powerful neural sequenc- on BiLSTMs or Seq2Seq (Raganato et al., 2017b) ing models. We also analyse the robustness haven’t yet proven as advantageous for WSD. of our approach when ignoring part-of-speech Until recently, the best semantic representations and lemma features, requiring disambiguation at our disposal, such as word2vec (Mikolov et al., against the full sense inventory, and revealing 2013) and fastText (Bojanowski et al., 2017), were shortcomings to be improved. Finally, we ex- bound to word types (i.e. distinct tokens), con- plore applications of our sense embeddings for concept-level analyses of contextual embed- verging information from different senses into the dings and their respective NLMs. same representations (e.g. ‘play song’ and ‘play tennis’ share the same representation of ‘play’). 1 Introduction These word embeddings were learned from unsupervised Neural Language Modelling (NLM) Word Sense Disambiguation (WSD) is a core trained on fixed-length contexts. However, by task of Natural Language Processing (NLP) which recasting the same word types across different arXiv:1906.10007v1 [cs.CL] 24 Jun 2019 consists in assigning the correct sense to a word sense-inducing contexts, these representations be- in a given context, and has many potential ap- came insensitive to the different senses of poly- plications (Navigli, 2009). Despite breakthroughs semous words. Camacho-Collados and Pilehvar in distributed semantic representations (i.e. word (2018) refer to this issue as the meaning confla- embeddings), resolving lexical ambiguity has re- tion deficiency and explore it more thoroughly in mained a long-standing challenge in the field. Sys- their work. tems using non-distributional features, such as It Recent improvements to NLM have allowed for Makes Sense (IMS, Zhong and Ng, 2010), remain learning representations that are context-specific surprisingly competitive against neural sequence and detached from word types. While word em- models trained end-to-end. A baseline that simply bedding methods reduced NLMs to fixed repre- chooses the most frequent sense (MFS) has also sentations after pretraining, this new generation proven to be notoriously difficult to surpass. of contextual embeddings employs the pretrained NLM to infer different representations induced by 2 Language Modelling Representations arbitrarily long contexts. Contextual embeddings have already had a major impact on the field, driv- Distributional semantic representations learned ing progress on numerous downstream tasks. This from Unsupervised Neural Language Modelling success has also motivated a number of iterations (NLM) are currently used for most NLP tasks. In on embedding models in a short timespan, from this section we cover aspects of word and contex- context2vec (Melamud et al., 2016), to GPT (Rad- tual embeddings, learned from from NLMs, that ford et al., 2018), ELMo (Peters et al., 2018), and are particularly relevant for our work. BERT (Devlin et al., 2019). Being context-sensitive by design, contextual 2.1 Static Word Embeddings embeddings are particularly well-suited for WSD. Word embeddings are distributional semantic rep- In fact, Melamud et al.(2016) and Peters et al. resentations usually learned from NLM under one (2018) produced contextual embeddings from the of two possible objectives: predict context words SemCor dataset and showed competitive results on given a target word (Skip-Gram), or the inverse Raganato et al.(2017a)’s WSD evaluation frame- (CBOW) (word2vec, Mikolov et al., 2013). In work, with a surprisingly simple approach based both cases, context corresponds to a fixed-length on Nearest Neighbors (k-NN). These results were window sliding over tokenized text, with the tar- promising, but those works only produced sense get word at the center. These modelling objectives embeddings for the small fraction of WordNet are enough to produce dense vector-based repre- (Fellbaum, 1998) senses covered by SemCor, resentations of words that are widely used as pow- sorting to the MFS approach for a large number erful initializations on neural modelling architec- of instances. Lack of high coverage annotations tures for NLP. As we explained in the introduc- is one of the most pressing issues for supervised tion, word embeddings are limited by meaning WSD approaches (Le et al., 2018). conflation around word types, and reduce NLM Our experiments show that the simple k-NN to fixed representations that are insensitive to con- w/MFS approach using BERT embeddings suf- texts. However, with fastText (Bojanowski et al., fices to surpass the performance of all previous 2017) we’re not restricted to a finite set of repre- systems. Most importantly, in this work we intro- sentations and can compositionally derive repre- duce a method for generating sense embeddings sentations for word types unseen during training. with full-coverage of WordNet, which further im- proves results (additional 1.9% F1) while forgo- 2.2 Contextual Embeddings ing MFS fallbacks. To better evaluate the fitness of our sense embeddings, we also analyse their The key differentiation of contextual embeddings performance without access to lemma or part-of- is that they are context-sensitive, allowing the speech features typically used to restrict candi- same word types to be represented differently ac- date senses. Representing sense embeddings in the cording to the contexts in which they occurr. In same space as any contextual embeddings gener- order to be able to produce new representations ated from the same pretrained NLM eases intro- induced by different contexts, contextual embed- spections of those NLMs, and enables token-level dings employ the pretrained NLM for inferences. intrinsic evaluations based on k-NN WSD perfor- Also, the NLM objective for contextual embed- mance. We summarize our contributions1 below: dings is usually directional, predicting the previous and/or next tokens in arbitrarily long contexts A method for creating sense embeddings for • (usually sentences). ELMo (Peters et al., 2018) all senses in WordNet, allowing for WSD was the first implementation of contextual embed- based on k-NN without MFS fallbacks. dings to gain wide adoption, but it was shortly af- Major improvement over the state-of-the-art ter followed by BERT (Devlin et al., 2019) which • on cross-domain WSD tasks, while exploring achieved new state-of-art results on 11 NLP tasks. the strengths and weaknesses of our method. Interestingly, BERT’s impressive results were ob- tained from task-specific fine-tuning of pretrained Applications of our sense embeddings for NLMs, instead of using them as features in more • concept-level analyses of NLMs. complex models, emphasizing the quality of these 1Code and data: github.com/danlou/lmms representations. 3 Word Sense Disambiguation (WSD) 3.2 WSD State-of-the-Art There are several lines of research exploring dif- While non-distributional methods, such as Zhong ferent approaches for WSD (Navigli, 2009). Su- and Ng(2010)’s IMS, still perform competitively, pervised methods have traditionally performed there are have been several noteworthy advance- best, though this distinction is becoming increas- ments in the last decade using distributional rep- ingly blurred as works in supervised WSD start resentations from NLMs. Iacobacci et al.(2016) exploiting resources used by knowledge-based ap- improved on IMS’s performance by introducing proaches (e.g. Luo et al., 2018a; Vial et al., 2018). word embeddings as additional features. We relate our work to the best-performing WSD Yuan et al.(2016) achieved significantly im- methods, regardless of approach, as well as meth- proved results by leveraging massive corpora to ods that may not perform as well but involve pro- train a NLM based on an LSTM architecture. This ducing sense embeddings. In this section we in- work is contemporaneous with Melamud et al.

Language Modelling Makes Sense: Propagating Representations Through Wordnet for Full-Coverage Word Sense Disambiguation

Malware Classification with BERT

Bros:Apre-Trained Language Model for Understanding Textsin Document

Information Extraction Based on Named Entity for Tourism Corpus

NLP with BERT: Sentiment Analysis Using SAS® Deep Learning and Dlpy Doug Cairns and Xiangxiang Meng, SAS Institute Inc

Unified Language Model Pre-Training for Natural

Question Answering by Bert

Distilling BERT for Natural Language Understanding

Arxiv:2104.05274V2 [Cs.CL] 27 Aug 2021 NLP ﬁeld Due to Its Excellent Performance in Various Tasks

Efficient Training of BERT by Progressively Stacking

Word Sense Disambiguation with Transformer Models

Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model

Recombining Frames and Roles in Frame-Semantic Parsing