Sensebert: Driving Some Sense Into BERT

SenseBERT: Driving Some Sense into BERT Yoav Levine Barak Lenz Or Dagan Ori Ram Dan Padnos Or Sharir Shai Shalev-Shwartz Amnon Shashua Yoav Shoham AI21 Labs, Tel Aviv, Israel fyoavl,barakl,ord,orir,[email protected] Abstract actual meaning in a given context, referred to as its sense. Indeed, the word-form level is viewed as a The ability to learn from large unlabeled cor- surface level which often introduces challenging pora has allowed neural language models to ambiguity (Navigli, 2009). advance the frontier in natural language understanding. However, existing self-supervision In this paper, we bring forth a novel method- techniques operate at the word form level, ology for applying weak-supervision directly on which serves as a surrogate for the underly- the level of a word’s meaning. By infusing word- ing semantic content. This paper proposes a sense information into BERT’s pre-training sig- method to employ weak-supervision directly nal, we explicitely expose the model to lexical at the word sense level. Our model, named semantics when learning from a large unanno- SenseBERT, is pre-trained to predict not only tated corpus. We call the resultant sense-informed the masked words but also their WordNet su- model SenseBERT. Specifically, we add a masked- persenses. Accordingly, we attain a lexical- semantic level language model, without the use word sense prediction task as an auxiliary task in of human annotation. SenseBERT achieves sig- BERT’s pre-training. Thereby, jointly with the stan- nificantly improved lexical understanding, as dard word-form level language model, we train a we demonstrate by experimenting on SemEval semantic-level language model that predicts the Word Sense Disambiguation, and by attaining missing word’s meaning. Our method does not a state of the art result on the ‘Word in Context’ require sense-annotated data; self-supervised learn- task. ing from unannotated text is facilitated by using 1 Introduction WordNet (Miller, 1998), an expert constructed inventory of word senses, as weak supervision. Neural language models have recently undergone We focus on a coarse-grained variant of a word’s a qualitative leap forward, pushing the state of the sense, referred to as its WordNet supersense, in art on various NLP tasks. Together with advances order to mitigate an identified brittleness of fine- in network architecture (Vaswani et al., 2017), the grained word-sense systems, caused by arbitrary use of self-supervision has proven to be central sense granularity, blurriness, and general subjec- to these achievements, as it allows the network to tiveness (Kilgarriff, 1997; Schneider, 2014). Word- learn from massive amounts of unannotated text. Net lexicographers organize all word senses into 45 The self-supervision strategy employed in BERT supersense categories, 26 of which are for nouns, (Devlin et al., 2019) involves masking some of the 15 for verbs, 3 for adjectives and 1 for adverbs (see words in an input sentence, and then training the full supersense table in the supplementary materi- model to predict them given their context. Other als). Disambiguating a word’s supersense has been proposed approaches for self-supervised objectives, widely studied as a fundamental lexical categoriza- including unidirectional (Radford et al., 2019), per- tion task (Ciaramita and Johnson, 2003; Basile, mutational (Yang et al., 2019), or word insertion- 2012; Schneider and Smith, 2015). based (Chan et al., 2019) methods, operate simi- We employ the masked word’s allowed super- larly, over words. However, since a given word senses list from WordNet as a set of possible labels form can possess multiple meanings (e.g., the word for the sense prediction task. The labeling of words ‘bass’ can refer to a fish, a guitar, a type of singer, with a single supersense (e.g., ‘sword’ has only the etc.), the word itself is merely a surrogate of its supersense noun.artifact) is straightforward: We 4656 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4656–4667 July 5 - 10, 2020. c 2020 Association for Computational Linguistics train the network to predict this supersense given burdened with the implicit task of disambiguat- the masked word’s context. As for words with mul- ing word meanings, often fails to grasp lexical tiple supersenses (e.g., ‘bass’ can be: noun.food, semantics, exhibiting high supersense misclassi- noun.animal, noun.artifact, noun.person, etc.), we fication rates. Our suggested weakly-supervised train the model to predict any of these senses, lead- word-sense signal allows SenseBERT to signifi- ing to a simple yet effective soft-labeling scheme. cantly bridge this gap. We show that SenseBERTBASE outscores both Moreover, SenseBERT exhibits an improvement BERTBASE and BERTLARGE by a large margin on in lexical semantics ability (reflected by the Word a supersense variant of the SemEval Word Sense in Context task score) even when compared to mod- Disambiguation (WSD) data set standardized in Ra- els with WordNet infused linguistic knowledge. ganato et al.(2017). Notably, SenseBERT re- Specifically we compare to Peters et al.(2019) ceives competitive results on this task without fune- who re-contextualize word embeddings via a word- tuning, i.e., when training a linear classifier over to-entity attention mechanism (where entities are the pretrained embeddings, which serves as a tes- WordNet lemmas and synsets), and to Loureiro and tament for its self-acquisition of lexical semantics. Jorge(2019) which construct sense embeddings Furthermore, we show that SenseBERTBASE sur- from BERT’s word embeddings and use the Word- passes BERTLARGE in the Word in Context (WiC) Net graph to enhance coverage (see quantitative task (Pilehvar and Camacho-Collados, 2019) from comparison in table3). the SuperGLUE benchmark (Wang et al., 2019), which directly depends on word-supersense aware- 3 Incorporating Word-Supersense ness. A single SenseBERTLARGE model achieves Information in Pre-training state of the art performance on WiC with a score of In this section, we present our proposed method for 72:14, improving the score of BERT by 2:5 LARGE integrating word sense-information within Sense- points. BERT’s pre-training. We start by describing the 2 Related Work vanilla BERT architecture in subsection 3.1. We conceptually divide it into an internal transformer Neural network based word embeddings first ap- encoder and an external mapping W which trans- peared as a static mapping (non-contextualized), lates the observed vocabulary space into and out of where every word is represented by a constant pre- the transformer encoder space [see illustration in trained embedding (Mikolov et al., 2013; Penning- figure1(a)]. ton et al., 2014). Such embeddings were shown In the subsequent subsections, we frame our con- to contain some amount of word-sense informa- tribution to the vanilla BERT architecture as an ad- tion (Iacobacci et al., 2016; Yuan et al., 2016; dition of a parallel external mapping to the words Arora et al., 2018; Le et al., 2018). Addition- supersenses space, denoted S [see illustration in fig- ally, sense embeddings computed for each word ure1(b)]. Specifically, in section 3.2 we describe sense in the word-sense inventory (e.g. WordNet) the loss function used for learning S in parallel to have been employed, relying on hypernymity re- W , effectively implementing word-form and word- lations (Rothe and Schutze¨ , 2015) or the gloss for sense multi-task learning in the pre-training stage. each sense (Chen et al., 2014). These approaches Then, in section 3.3 we describe our methodology rely on static word embeddings and require a large for adding supersense information in S to the initial amount of annotated data per word sense. Transformer embedding, in parallel to word-level The introduction of contextualized word embed- information added by W . In section 3.4 we ad- dings (Peters et al., 2018), for which a given word’s dress the issue of supersense prediction for out-of- embedding is context-dependent rather than pre- vocabulary words, and in section 3.5 we describe computed, has brought forth a promising prospect our modification of BERT’s masking strategy, pri- for sense-aware word embeddings. Indeed, visual- oritizing single-supersensed words which carry a izations in Reif et al.(2019) show that sense sen- clearer semantic signal. sitive clusters form in BERT’s word embedding space. Nevertheless, we identify a clear gap in 3.1 Background this abilty. We show that a vanilla BERT model The input to BERT is a sequence of words fx(j) 2 DW N trained with the current word-level self-supervision, f0; 1g gj=1 where 15% of the words are re- 4657 x(1) Wx(1) p(1) 1 Transformer words (j) j T (a) BERT [MASK] W Wx(j) + p encoder W y N x(N) Wx(N) p(N) x(1) Wx(1) SMx(1) p(1) 1 words W WT y Transformer (b) SenseBERT [MASK] Wx(j) + SMx(j) + p(j) j encoder senses S ST y N x(N) Wx(N) SMx(N) p(N) Figure 1: SenseBERT includes a masked-word supersense prediction task, pre-trained jointly with BERT’s original masked-word prediction task (Devlin et al., 2019) (see section 3.2). As in the original BERT, the mapping from the Transformer dimension to the external dimension is the same both at input and at output (W for words and S for supersenses), where M denotes a fixed mapping between word-forms and their allowed WordNet supersenses (see section 3.3). The vectors p(j) denote positional embeddings. For clarity, we omit a reference to a sentence-level Next Sentence Prediction task trained jointly with the above. placed by a [MASK] token (see treatment of sub- The word-score vector for a masked word at po- word tokanization in section 3.4).

Sensebert: Driving Some Sense Into BERT

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support