SenseBERT: Driving Some Sense into BERT

Yoav Levine Barak Lenz Or Dagan Ori Ram Dan Padnos Or Sharir Shai Shalev-Shwartz Amnon Shashua Yoav Shoham

AI21 Labs, Tel Aviv, Israel

{yoavl,barakl,ord,orir,...}@ai21.com

Abstract actual meaning in a given context, referred to as its sense. Indeed, the word-form level is viewed as a The ability to learn from large unlabeled cor- surface level which often introduces challenging pora has allowed neural language models to ambiguity (Navigli, 2009). advance the frontier in natural language under- standing. However, existing self-supervision In this paper, we bring forth a novel method- techniques operate at the word form level, ology for applying weak-supervision directly on which serves as a surrogate for the underly- the level of a word’s meaning. By infusing word- ing semantic content. This paper proposes a sense information into BERT’s pre-training sig- method to employ weak-supervision directly nal, we explicitely expose the model to lexical at the word sense level. Our model, named semantics when learning from a large unanno- SenseBERT, is pre-trained to predict not only tated corpus. We call the resultant sense-informed the masked words but also their WordNet su- model SenseBERT. Specifically, we add a masked- persenses. Accordingly, we attain a lexical- semantic level , without the use word sense prediction task as an auxiliary task in of human annotation. SenseBERT achieves sig- BERT’s pre-training. Thereby, jointly with the stan- nificantly improved lexical understanding, as dard word-form level language model, we train a we demonstrate by experimenting on SemEval semantic-level language model that predicts the Word Sense Disambiguation, and by attaining missing word’s meaning. Our method does not a state of the art result on the ‘Word in Context’ require sense-annotated data; self-supervised learn- task. ing from unannotated text is facilitated by using 1 Introduction WordNet (Miller, 1998), an expert constructed in- ventory of word senses, as . Neural language models have recently undergone We focus on a coarse-grained variant of a word’s a qualitative leap forward, pushing the state of the sense, referred to as its WordNet supersense, in art on various NLP tasks. Together with advances order to mitigate an identified brittleness of fine- in network architecture (Vaswani et al., 2017), the grained word-sense systems, caused by arbitrary use of self-supervision has proven to be central sense granularity, blurriness, and general subjec- to these achievements, as it allows the network to tiveness (Kilgarriff, 1997; Schneider, 2014). Word- learn from massive amounts of unannotated text. Net lexicographers organize all word senses into 45 The self-supervision strategy employed in BERT supersense categories, 26 of which are for nouns, (Devlin et al., 2019) involves masking some of the 15 for verbs, 3 for adjectives and 1 for adverbs (see words in an input sentence, and then training the full supersense table in the supplementary materi- model to predict them given their context. Other als). Disambiguating a word’s supersense has been proposed approaches for self-supervised objectives, widely studied as a fundamental lexical categoriza- including unidirectional (Radford et al., 2019), per- tion task (Ciaramita and Johnson, 2003; Basile, mutational (Yang et al., 2019), or word insertion- 2012; Schneider and Smith, 2015). based (Chan et al., 2019) methods, operate simi- We employ the masked word’s allowed super- larly, over words. However, since a given word senses list from WordNet as a set of possible labels form can possess multiple meanings (e.g., the word for the sense prediction task. The labeling of words ‘bass’ can refer to a fish, a guitar, a type of singer, with a single supersense (e.g., ‘sword’ has only the etc.), the word itself is merely a surrogate of its supersense noun.artifact) is straightforward: We

4656 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4656–4667 July 5 - 10, 2020. c 2020 Association for Computational Linguistics train the network to predict this supersense given burdened with the implicit task of disambiguat- the masked word’s context. As for words with mul- ing word meanings, often fails to grasp lexical tiple supersenses (e.g., ‘bass’ can be: noun.food, semantics, exhibiting high supersense misclassi- noun.animal, noun.artifact, noun.person, etc.), we fication rates. Our suggested weakly-supervised train the model to predict any of these senses, lead- word-sense signal allows SenseBERT to signifi- ing to a simple yet effective soft-labeling scheme. cantly bridge this gap.

We show that SenseBERTBASE outscores both Moreover, SenseBERT exhibits an improvement

BERTBASE and BERTLARGE by a large margin on in lexical semantics ability (reflected by the Word a supersense variant of the SemEval Word Sense in Context task score) even when compared to mod- Disambiguation (WSD) data set standardized in Ra- els with WordNet infused linguistic knowledge. ganato et al.(2017). Notably, SenseBERT re- Specifically we compare to Peters et al.(2019) ceives competitive results on this task without fune- who re-contextualize word embeddings via a word- tuning, i.e., when training a linear classifier over to-entity attention mechanism (where entities are the pretrained embeddings, which serves as a tes- WordNet lemmas and synsets), and to Loureiro and tament for its self-acquisition of lexical semantics. Jorge(2019) which construct sense embeddings

Furthermore, we show that SenseBERTBASE sur- from BERT’s word embeddings and use the Word- passes BERTLARGE in the Word in Context (WiC) Net graph to enhance coverage (see quantitative task (Pilehvar and Camacho-Collados, 2019) from comparison in table3). the SuperGLUE benchmark (Wang et al., 2019), which directly depends on word-supersense aware- 3 Incorporating Word-Supersense ness. A single SenseBERTLARGE model achieves Information in Pre-training state of the art performance on WiC with a score of In this section, we present our proposed method for 72.14, improving the score of BERT by 2.5 LARGE integrating word sense-information within Sense- points. BERT’s pre-training. We start by describing the 2 Related Work vanilla BERT architecture in subsection 3.1. We conceptually divide it into an internal transformer Neural network based word embeddings first ap- encoder and an external mapping W which trans- peared as a static mapping (non-contextualized), lates the observed vocabulary space into and out of where every word is represented by a constant pre- the transformer encoder space [see illustration in trained embedding (Mikolov et al., 2013; Penning- figure1(a)]. ton et al., 2014). Such embeddings were shown In the subsequent subsections, we frame our con- to contain some amount of word-sense informa- tribution to the vanilla BERT architecture as an ad- tion (Iacobacci et al., 2016; Yuan et al., 2016; dition of a parallel external mapping to the words Arora et al., 2018; Le et al., 2018). Addition- supersenses space, denoted S [see illustration in fig- ally, sense embeddings computed for each word ure1(b)]. Specifically, in section 3.2 we describe sense in the word-sense inventory (e.g. WordNet) the used for learning S in parallel to have been employed, relying on hypernymity re- W , effectively implementing word-form and word- lations (Rothe and Schutze¨ , 2015) or the gloss for sense multi-task learning in the pre-training stage. each sense (Chen et al., 2014). These approaches Then, in section 3.3 we describe our methodology rely on static word embeddings and require a large for adding supersense information in S to the initial amount of annotated data per word sense. Transformer embedding, in parallel to word-level The introduction of contextualized word embed- information added by W . In section 3.4 we ad- dings (Peters et al., 2018), for which a given word’s dress the issue of supersense prediction for out-of- embedding is context-dependent rather than pre- vocabulary words, and in section 3.5 we describe computed, has brought forth a promising prospect our modification of BERT’s masking strategy, pri- for sense-aware word embeddings. Indeed, visual- oritizing single-supersensed words which carry a izations in Reif et al.(2019) show that sense sen- clearer semantic signal. sitive clusters form in BERT’s space. Nevertheless, we identify a clear gap in 3.1 Background this abilty. We show that a vanilla BERT model The input to BERT is a sequence of words {x(j) ∈ DW N trained with the current word-level self-supervision, {0, 1} }j=1 where 15% of the words are re- 4657 x(1) Wx(1) p(1) 1 Transformer words (j) j T (a) BERT [MASK] W Wx(j) + p encoder W y N x(N) Wx(N) p(N)

x(1) Wx(1) SMx(1) p(1) 1 words W WT y Transformer (b) SenseBERT [MASK] Wx(j) + SMx(j) + p(j) j encoder senses S ST y N x(N) Wx(N) SMx(N) p(N)

Figure 1: SenseBERT includes a masked-word supersense prediction task, pre-trained jointly with BERT’s original masked-word prediction task (Devlin et al., 2019) (see section 3.2). As in the original BERT, the mapping from the Transformer dimension to the external dimension is the same both at input and at output (W for words and S for supersenses), where M denotes a fixed mapping between word-forms and their allowed WordNet supersenses (see section 3.3). The vectors p(j) denote positional embeddings. For clarity, we omit a reference to a sentence-level Next Sentence Prediction task trained jointly with the above. placed by a [MASK] token (see treatment of sub- The word-score vector for a masked word at po- word tokanization in section 3.4). Here N is the sition j is extracted from the Transformer en- words input sentence length, DW is the word vocabulary coder output by applying the transpose: y = (j) > (j) size, and x is a 1-hot vector corresponding to W voutput [see illustration in figure1(a)]. The the jth input word. For every masked word, the use of the same matrix W as the mapping in and output of the pretraining task is a word-score vec- out of the transformer encoder space is referred to D tor ywords ∈ R W containing the per-word score. as weight tying (Inan et al., 2017; Press and Wolf, BERT’s architecture can be decomposed to (1) an 2017). internal Transformer encoder architecture (Vaswani Given a masked word in position j, BERT’s et al., 2017) wrapped by (2) an external mapping original masked-word prediction pre-training task 1 to the word vocabulary space, denoted by W . is to have the softmax of the word-score vector The Transformer encoder operates over a se- words > (j) y = W voutput get as close as possible to a (j) d quence of word embeddings vinput ∈ R , where 1-hot vector corresponding to the masked word. d is the Transformer encoder’s hidden dimension. This is done by minimizing the cross-entropy loss These are passed through multiple attention-based between the softmax of the word-score vector and Transformer layers, producing a new sequence a 1-hot vector corresponding to the masked word: of contextualized embeddings at each . The Transformer encoder output is the final sequence (j) d LLM = − log p(w|context), (2) of contextualized word embeddings voutput ∈ R . d×D The external mapping W ∈ R W is effec- tively a translation between the external word vo- where w is the masked word, the context is com- cabulary dimension and the internal Transformer posed of the rest of the input sequence, and the dimension. Original words in the input sentence probability is computed by: are translated into the Transformer block by apply- words ing this mapping (and adding positional encoding exp yw (j) d p(w|context) = , (3) vectors p ∈ R ): P words w0 exp yw0 (j) (j) (j) vinput = W x + p (1) words th 1For clarity, we omit a description of the Next Sentence where yw denotes the w entry of the word- Prediction task which we employ as in Devlin et al.(2019). score vector.

4658 3.2 Weakly-Supervised Supersense 1. “This bass is delicious” Prediction Task (supersenses: noun.food, noun.artifact, etc.) Jointly with the above procedure for training the 2. “This chocolate is delicious” word-level language model of SenseBERT, we (supersenses: noun.food, noun.attribute, etc.) train the model to predict the supersense of every masked word, thereby training a semantic-level lan- 3. “This pickle is delicious” guage model. This is done by adding a parallel ex- (supersenses: noun.food, noun.state, etc.) ternal mapping to the words supersenses space, de- d×D noted S ∈ R S [see illustration in figure1(b)], Masking the marked word in each of the examples where DS = 45 is the size of supersenses vocabu- results in three identical input sequences, each with lary. Ideally, the objective is to have the softmax of a different sets of labels. The ground truth label, senses DS > (j) the sense-score vector y ∈ R := S voutput noun.food, appears in all cases, so that its probabil- get as close as possible to a 1-hot vector correspond- ity in contexts indicating food is increased whereas ing to the word’s supersense in the given context. the signals supporting other labels cancel out. For each word w in our vocabulary, we employ allowed While LSLM pushes the network in the right the WordNet word-sense inventory for constructing direction, minimizing this loss could result in the A(w), the set of its “allowed” supersenses. Specifi- network becoming overconfident in predicting a cally, we apply a WordNet Lemmatizer on w, ex- strict subset of the allowed senses for a given word, tract the different synsets that are mapped to the i.e., a collapse of the prediction distribution. This lemmatized word in WordNet, and define A(w) as is especially acute in the early stages of the training the union of supersenses coupled to each of these procedure, when the network could converge to the synsets. As exceptions, we set A(w) = ∅ for noisy signal of the soft-labeling scheme. the following: (i) short words (up to 3 characters), To mitigate this issue, the following regulariza- since they are often treated as abbreviations, (ii) tion term is added to the loss, which encourages stop words, as WordNet does not contain their main a uniform prediction distribution over the allowed synset (e.g. ‘he’ is either the element helium or the supersenses: hebrew language according to WordNet), and (iii) X 1 tokens that represent part-of-word (see section 3.4 Lreg = − log p(s|context), (6) for further discussion on these tokens). SLM |A(w)| s∈A(w) Given the above construction, we employ a com- bination of two loss terms for the supersense-level i.e., a cross-entropy loss with a uniform distribution language model. The following allowed-senses over the allowed supersenses. term maximizes the probability that the predicted Overall, jointly with the regular word level lan- sense is in the set of allowed supersenses of the guage model trained with the loss in eq.2, we train masked word w: the semantic level language model with a combined allowed loss of the form: LSLM = − log p (s ∈ A(w)|context) X = − log p(s|context), (4) allowed reg LSLM = LSLM + LSLM . (7) s∈A(w) where the probability for a supersense s is given 3.3 Supersense Aware Input Embeddings by: Though in principle two different matrices could senses exp(ys ) have been used for converting in and out of the p(s|context) = P senses . (5) s0 exp(ys0 ) Tranformer encoder, the BERT architecture em- The soft-labeling scheme given above, which ploys the same mapping W . This approach, re- treats all the allowed supersenses of the masked ferred to as weight tying, was shown to yield the- word equally, introduces noise to the supersense la- oretical and pracrical benefits (Inan et al., 2017; bels. We expect that encountering many contexts in Press and Wolf, 2017). Intuitively, constructing the a sufficiently large corpus will reinforce the correct Transformer encoder’s input embeddings from the labels whereas the signal of incorrect labels will same mapping with which the scores are computed diminish. To illustrate this, consider the following improves their quality as it makes the input more examples for the food context: sensitive to the training signal.

4659 (a) All Supersenses (b) Noun Supersenses

noun.artifact

noun.attribute noun.group noun.location noun.person noun.shape noun.animal noun.object noun.food noun.feeling noun.body noun.plant

noun.substance

Verb Supersenses Noun Supersenses Other (adv./adj.) Abstract Concrete Concrete - Entities

Figure 2: UMAP visualization of supersense vectors (rows of the classifier S) learned by SenseBERT at pre-training. (a) Clustering by the supersense’s part-of speech. (b) Within noun supersenses, semantically similar supersenses are clustered together (see more details in the supplementary materials).

We follow this approach, and insert our newly diction task (eq.2) masked sub-word tokens are proposed semantic-level language model matrix straightforwardly predicted as described in sec- S in the input in addition to W [as depicted in tion 3.1. In contrast, word-sense supervision is figure1(b)], such that the input vector to the Trans- only meaningful at the word level. We compare former encoder (eq.1) is modified to obey: two alternatives for dealing with tokenized OOV words for the supersense prediction task (eq.7). v(j) = (W + SM)x(j) + p(j), (8) input In the first alternative, called 60K vocabulary, we augment BERT’s original 30K-token vocabulary where p(j) are the regular positional embeddings (which roughly contained the most frequent words) as used in BERT, and M ∈ DS ×DW is a static 0/1 R with additional 30K new words, chosen according matrix converting between words and their allowed to their frequency in Wikipedia. This vocabulary WordNet supersenses A(w) (see construction de- increase allows us to see more of the corpus as tails above). whole words for which supersense prediction is a The above strategy for constructing v(j) allows input meaningful operation. Additionally, in accordance for the semantic level vectors in S to come into play with the discussion in the previous subsection, our and shape the input embeddings even for words sense-aware input embedding mechanism can help which are rarely observed in the training corpus. the model extract more information from lower- For such a word, the corresponding row in W is frequency words. For the cases where a sub-word potentially less informative, since due to the low token is chosen for masking, we only propagate word frequency the model did not have sufficient the regular word level loss and do not train the chance to adequately learn it. However, since the supersense prediction task. model learns a representation of its supersense, the corresponding row in S is informative of the se- The above addition to the vocabulary results in mantic category of the word. Therefore, the input an increase of approximately 23M parameters over embedding in eq.8 can potentially help the model the 110M parameters of BERTBASE and an increase to elicit meaningful information even when the of approximately 30M parameters over the 340M masked word is rare, allowing for better exploita- parameters of BERTLARGE (due to different embed- tion of the training corpus. ding dimensions d = 768 and d = 1024, respec- tively). It is worth noting that similar vocabulary 3.4 Rare Words Supersense Prediction sizes in leading models have not resulted in in- At the pre-processing stage, when an out-of- creased sense awareness, as reflected for example vocabulary (OOV) word is encountered in the cor- in the WiC task results (Liu et al., 2019). pus, it is divided into several in-vocabulary sub- As a second alternative, referred to as average word tokens. For the self-supervised word pre- embedding, we employ BERT’s regular 30K-token

4660 (a) (b) The [MASK] fell to the floor. noun.person noun.food

52% noun.artifact (sword, chair, ...) Dan cooked a bass on the grill. 17% noun.person (man, girl, ...) verb.creation noun.artifact Gill [MASK] the bread. noun.artifact adj.all 33% verb.contact (cut, buttered, ...) 20% verb.consumption (ate, chewed, ...) The bass player was exceptional. 11% verb.change (heated, baked, ...) 6% verb.possession (took, bought, ...) noun.person

Figure 3: (a) A demonstration of supersense probabilities assigned to a masked position within context, as given by SenseBERT’s word-supersense level semantic language model (capped at 5%). Example words corresponding to each supersense are presented in parentheses. (b) Examples of SenseBERT’s prediction on raw text, when the unmasked input sentence is given to the model. This beyond word-form abstraction ability facilitates a more natural elicitation of semantic content at pre-training.

vocabulary and employ a whole-word-masking SenseBERTBASE SemEval-SS Fine-tuned strategy. Accordingly, all of the tokens of a to- 30K no OOV 81.9 kenized OOV word are masked together. In this 30K average OOV 82.7 case, we train the supersense prediction task to pre- 60K no OOV 83 dict the WordNet supersenses of this word from the average of the output embeddings at the location of the masked sub-words tokens. Table 1: Testing variants for predicting supersenses of rare words during SenseBERT’s pretraining, as de- 3.5 Single-Supersensed Word Masking scribed in section 5.1. Results are reported on the SemEval-SS task (see section 5.2). 30K/60K stand for Words that have a single supersense are good an- vocabulary size, and no/average OOV stand for not pre- chors for obtaining an unambiguous semantic sig- dicting senses for OOV words or predicting senses from nal. These words teach the model to accurately the average of the sub-word token embeddings, respec- map contexts to supersenses, such that it is then tively. able to make correct context-based predictions even when a masked word has several supersenses. We therefore favor such words in the masking strategy, guage model that predicts the missing word’s mean- choosing 50% of the single-supersensed words in ing jointly with the standard word-form level lan- each input sequence to be masked. We stop if guage model. 40% of the overall 15% masking budget is filled We illustrate the resultant mapping in fig- with single-supersensed words (this rarly happens), ure2, showing a UMAP dimensionality reduc- and in any case we randomize the choice of the tion (McInnes et al., 2018) of the rows of S, remaining words to complete this budget. As in which corresponds to the different supersenses. A the original BERT, 1 out of 10 words chosen for clear clustering according to the supersense part- masking is shown to the model as itself rather than of-speech is apparent in figure2(a). We further replaced with [MASK]. identify finer-grained semantic clusters, as shown for example in figure2(b) and given in more detail 4 Semantic Language Model in the supplementary materials. Visualization SenseBERT’s semantic language model allows A SenseBERT pretrained as described in section3 predicting a distribution over supersenses rather (with training hyperparameters as in Devlin et al. than over words in a masked position. Figure3(a) (2019)), has an immediate non-trivial bi-product. shows the supersense probabilities assigned by The pre-trained mapping to the supersenses space, SenseBERT in several contexts, demonstrating the denoted S, acts as an additional head predicting a model’s ability to assign semantically meaningful word’s supersense given context [see figure1(b)]. categories to the masked position. We thereby effectively attain a semantic-level lan- Finally, we demonstrate that SenseBERT enjoys

4661 BERT SenseBERT (a) The team used a battery of the newly developed “gene probes” noun.artifact noun.group SemEval-SS Ten shirt-sleeved ringers stand in a circle, one foot ahead of the noun.quantity noun.body other in a prize-fighter's stance

(b) Sent. A: Sent. B: The kick must be synchronized A sidecar is a smooth drink WiC Same Different with the arm movements. but it has a powerful kick.

Sent. A: Sent. B: Plant bugs in the dissident’s Plant a spy in Moscow. Different Same apartment.

Figure 4: Example entries of (a) the SemEval-SS task, where a model is to predict the supersense of the marked word, and (b) the Word in Context (WiC) task where a model must determine whether the underlined word is used in the same/different supersense within sentences A and B. In all displayed examples, taken from the corresponding development sets, SenseBERT predicted the correct label while BERT failed to do so. A quantitative comparison between models is presented in table2. an ability to view raw text at a lexical semantic prising an OOV word. During fine-tuning, when level. Figure3(b) shows example sentences and encountering an OOV word we predict the super- their supersense prediction by the pretrained model. senses from the rightmost sub-word token in the Where a vanilla BERT would see only the words 60K vocabulary method and from the average of of the sentence “Dan cooked a bass on the grill”, the sub-word tokens in the average embedding SenseBERT would also have access to the super- method. sense abstraction: “[Person] [created] [food] on the As shown in table1, both methods perform com- [artifact]”. This sense-level perspective can help parably on the SemEval supersense disambigua- the model extract more knowledge from every train- tion task (see following subsection), yielding an ing example, and to generalize semantically similar improvement over the baseline of learning super- notions which do not share the same phrasing. sense information only for whole words in BERT’s original 30K-token vocabulary. We continue with 5 Lexical Semantics Experiments the 60K-token vocabulary for the rest of the ex- In this section, we present quantitative evaluations periments, but note the average embedding option of SenseBERT, pre-trained as described in sec- as a viable competitor for predicting word-level tion3. We test the model’s performance on a semantics. supersense-based variant of the SemEval WSD test 5.2 SemEval-SS: Supersense Disambiguation sets standardized in Raganato et al.(2017), and on the Word in Context (WiC) task (Pilehvar and We test SenseBERT on a Word Supersense Dis- Camacho-Collados, 2019) (included in the recently ambiguation task, a coarse grained variant of the introduced SuperGLUE benchmark (Wang et al., common WSD task. We use SemCor (Miller 2019)), both directly relying on the network’s abil- et al., 1993) as our training dataset (226, 036 an- ity to perform lexical semantic categorization. notated examples), and the SenseEval (Edmonds and Cotton, 2001; Snyder and Palmer, 2004) / Se- 5.1 Comparing Rare Words Supersense mEval (Pradhan et al., 2007; Navigli et al., 2013; Prediction Methods Moro and Navigli, 2015) suite for evaluation (over- We first report a comparison of the two methods de- all 7253 annotated examples), following Raganato scribed in section 3.4 for predicting the supersenses et al.(2017). For each word in both training and test of rare words which do not appear in BERT’s origi- sets, we change its fine-grained sense label to its nal vocabulary. The first 60K vocabulary method corresponding WordNet supersense, and therefore enriches the vocabulary and the second average train the network to predict a given word’s super- embedding method predicts a supersense from the sense. We name this Supersense disambiguation average embeddings of the sub-word tokens com- task SemEval-SS. See figure4(a) for an example

4662 SemEval-SS Frozen SemEval-SS Fine-tuned Word in Context

BERTBASE 65.1 79.2 –

BERTLARGE 67.3 81.1 69.6

SenseBERTBASE 75.6 83.0 70.3

SenseBERTLARGE 79.5 83.7 72.1

Table 2: Results on a supersense variant of the SemEval WSD test set standardized in Raganato et al.(2017), which we denote SemEval-SS, and on the Word in Context (WiC) dataset (Pilehvar and Camacho-Collados, 2019) included in the recently introduced SuperGLUE benchmark (Wang et al., 2019). These tasks require a high level of lexical semantic understanding, as can be seen in the examples in figure4. For both tasks, SenseBERT demonstrates a clear improvement over BERT in the regular fine-tuning setup, where network weights are modified during training on the task. Notably, SenseBERTLARGE achieves state of the art performance on the WiC task. In the SemEval-SS Frozen setting, we train a linear classifier over pretrained embeddings, without changing the network weights. The results show that SenseBERT introduces a dramatic improvement in this setting, implying that its word-sense aware pre-training (section3) yields embeddings that carries lexical semantic information which is easily extractable for the benefits of downstream tasks. Results for BERT on the SemEval-SS task are attained by employing the published pre-trained BERT models, and the BERTLARGE result on WiC is taken from the baseline scores published on the SuperGLUE benchmark (Wang et al., 2019) (no result has been published for BERTBASE).

Word in Context the model’s potential to acquire word-supersense information given its pre-training. ELMo† 57.7 Table2 shows a comparison between vanilla BERT sense embeddings †† 67.7 BERT and SenseBERT on the supersense dis- BERT ‡ 69.6 LARGE ambiguation task. Our semantic level pre- RoBERTa‡‡ 69.9 training signal clearly yields embeddings with KnowBERT-W+W 70.9 enhanced word-meaning awareness, relative to SenseBERT 72.1 embeddings trained with BERT’s vanilla word- Table 3: Test set results for the WiC dataset. level signal. SenseBERTBASE improves the score †Pilehvar and Camacho-Collados(2019) of BERTBASE in the Frozen setting by over 10 †† Loureiro and Jorge(2019) points and SenseBERTLARGE improves that of ‡ Wang et al.(2019) BERTLARGE by over 12 points, demonstrating com- ‡‡Liu et al.(2019) petitive results even without fine-tuning. In the  Peters et al.(2019) setting of model fine-tuning, we see a clear demon- stration of the model’s ability to learn word-level semantics, as SenseBERT surpasses the score from this modified data set. BASE of BERT by 2 points. We show results on the SemEval-SS task for LARGE two different training schemes. In the first, we 5.3 Word in Context (WiC) Task trained a linear classifier over the ‘frozen’ output embeddings of the examined model – we do not We test our model on the recently introduced WiC change the the trained SenseBERT’s parameters in binary classification task. Each instance in WiC this scheme. This Frozen setting is a test for the has a target word w for which two contexts are amount of basic lexical semantics readily present provided, each invoking a specific meaning of w. in the pre-trained model, easily extricable by fur- The task is to determine whether the occurrences ther downstream tasks (reminiscent of the semantic of w in the two contexts share the same meaning probes employed in Hewitt and Manning(2019); or not, clearly requiring an ability to identify the Reif et al.(2019). word’s semantic category. The WiC task is defined In the second training scheme we fine-tuned the over supersenses (Pilehvar and Camacho-Collados, examined model on the task, allowing its param- 2019) – the negative examples include a word used eters to change during training (see full training in two different supersenses and the positive ones details in the supplementary materials). Results include a word used in the same supersense. See attained by employing this training method reflect figure4(b) for an example from this data set.

4663 Score CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE

BERTBASE (OURS) 77.5 50.1 92.6 88.7/84.3 85.7/84.6 71.0/88.9 83.6 89.4 67.9

SenseBERTBASE 77.9 54.6 92.2 89.2/85.2 83.5/82.3 70.3/88.8 83.6 90.6 67.5

Table 4: Results on the GLUE benchmark test set.

Results on the WiC task comparing Sense- Acknowledgments BERT to vanilla BERT are shown in table2. We acknowledge useful comments and assistance SenseBERTBASE surpasses a larger vanilla model, from our colleagues at AI21 Labs. We would also BERTLARGE. As shown in table3, a single like to thank the anonymous reviewers for their SenseBERTLARGE model achieves the state of the art score in this task, demonstrating unprecedented valuable feedback. lexical semantic awareness. References 5.4 GLUE Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, The General Language Understanding Evaluation and Andrej Risteski. 2018. Linear algebraic struc- ture of word senses, with applications to polysemy. (GLUE; Wang et al.(2018)) benchmark is a popu- Transactions of the Association for Computational lar testbed for language understanding models. It Linguistics, 6:483–495. consists of 9 different NLP tasks, covering different linguistic phenomena. We evaluate our model on Pierpaolo Basile. 2012. Super-sense tagging using sup- port vector machines and distributional features. In GLUE, in order to verify that SenseBERT gains its International Workshop on Evaluation of Natural lexical semantic knowledge without compromising Language and Speech Tool for Italian, pages 176– performance on other downstream tasks. Due to 185. Springer. slight differences in the data used for pretraining William Chan, Nikita Kitaev, Kelvin Guu, Mitchell BERT and SenseBERT (BookCorpus is not pub- Stern, and Jakob Uszkoreit. 2019. KERMIT: Genera- licly available), we trained a BERTBASE model with tive insertion-based modeling for sequences. arXiv the same data used for our models. BERTBASE and preprint arXiv:1906.01604.

SenseBERTBASE were both finetuned using the ex- act same procedures and hyperparameters. The Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and results are presented in table4. Indeed, Sense- disambiguation. In Proceedings of the 2014 Con- BERT performs on par with BERT, achieving an ference on Empirical Methods in Natural Language overall score of 77.9, compared to 77.5 achieved Processing (EMNLP), pages 1025–1035, Doha, Qatar. Association for Computational Linguistics. by BERTBASE. Massimiliano Ciaramita and Mark Johnson. 2003. Su- 6 Conclusion persense tagging of unknown nouns in WordNet. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pages 168– We introduce lexical semantic information into 175. a neural language model’s pre-training objective. This results in a boosted word-level semantic aware- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of ness of the resultant model, named SenseBERT, deep bidirectional transformers for language under- which considerably outperforms a vanilla BERT on standing. In Proceedings of the 2019 Conference of a SemEval based Supersense Disambiguation task the North American Chapter of the Association for and achieves state of the art results on the Word Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages in Context task. This improvement was obtained 4171–4186, Minneapolis, Minnesota. Association for without human annotation, but rather by harnessing Computational Linguistics. an external linguistic knowledge source. Our work indicates that semantic signals extending beyond Philip Edmonds and Scott Cotton. 2001. SENSEVAL- 2: Overview. In Proceedings of SENSEVAL-2 Sec- the lexical level can be similarly introduced at the ond International Workshop on Evaluating Word pre-training stage, allowing the network to elicit Sense Disambiguation Systems, pages 1–5, Toulouse, further insight without human supervision. France. Association for Computational Linguistics.

4664 John Hewitt and Christopher D. Manning. 2019.A Andrea Moro and Roberto Navigli. 2015. SemEval- structural probe for finding syntax in word represen- 2015 task 13: Multilingual all-words sense disam- tations. In Proceedings of the 2019 Conference of biguation and entity linking. In Proceedings of the the North American Chapter of the Association for 9th International Workshop on Semantic Evaluation Computational Linguistics: Human Language Tech- (SemEval 2015), pages 288–297, Denver, Colorado. nologies, Volume 1 (Long and Short Papers), pages Association for Computational Linguistics. 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics. Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Comput. Surv., 41(2). Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli, David Jurgens, and Daniele Vannella. Roberto Navigli. 2016. Embeddings for word sense 2013. SemEval-2013 task 12: Multilingual word disambiguation: An evaluation study. In Proceed- sense disambiguation. In Second Joint Conference ings of the 54th Annual Meeting of the Association on Lexical and Computational Semantics (*SEM), for Computational Linguistics (Volume 1: Long Pa- Volume 2: Proceedings of the Seventh International pers), pages 897–907, Berlin, Germany. Association Workshop on Semantic Evaluation (SemEval 2013), for Computational Linguistics. pages 222–231, Atlanta, Georgia, USA. Association for Computational Linguistics. Hakan Inan, Khashayar Khosravi, and Richard Socher. 2017. Tying word vectors and word classifiers: A Jeffrey Pennington, Richard Socher, and Christopher loss framework for language modeling. In ICLR. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Confer- Adam Kilgarriff. 1997. I don’t believe in word senses. ence on Empirical Methods in Natural Language Pro- Computers and the Humanities, 31(2):91–113. cessing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics. Minh Le, Marten Postma, Jacopo Urbani, and Piek Vossen. 2018. A deep dive into word sense disam- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt biguation with LSTM. In Proceedings of the 27th Gardner, Christopher Clark, Kenton Lee, and Luke International Conference on Computational Linguis- Zettlemoyer. 2018. Deep contextualized word repre- tics, pages 354–365, Santa Fe, New Mexico, USA. sentations. In Proceedings of the 2018 Conference of Association for Computational Linguistics. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- nologies, Volume 1 (Long Papers), pages 2227–2237, dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, New Orleans, Louisiana. Association for Computa- Luke Zettlemoyer, and Veselin Stoyanov. 2019. tional Linguistics. RoBERTa: A robustly optimized bert pretraining ap- proach. arXiv preprint arXiv:1907.11692. Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Daniel Loureiro and Al´ıpio Jorge. 2019. Language Smith. 2019. Knowledge enhanced contextual word modelling makes sense: Propagating representations representations. In Proceedings of the 2019 Confer- through WordNet for full-coverage word sense disam- ence on Empirical Methods in Natural Language Pro- biguation. In Proceedings of the 57th Annual Meet- cessing and the 9th International Joint Conference ing of the Association for Computational Linguistics, on Natural Language Processing (EMNLP-IJCNLP), pages 5682–5691, Florence, Italy. Association for pages 43–54, Hong Kong, China. Association for Computational Linguistics. Computational Linguistics. Mohammad Taher Pilehvar and Jose Camacho-Collados. Leland McInnes, John Healy, and James Melville. 2018. 2019. WiC: the word-in-context dataset for evalu- UMAP: Uniform manifold approximation and pro- ating context-sensitive meaning representations. In jection for dimension reduction. arXiv preprint Proceedings of the 2019 Conference of the North arXiv:1802.03426. American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Volume 1 (Long and Short Papers), pages 1267–1273, rado, and Jeff Dean. 2013. Distributed representa- Minneapolis, Minnesota. Association for Computa- tions of words and phrases and their compositionality. tional Linguistics. In Advances in Neural Information Processing Sys- tems 26, pages 3111–3119. Curran Associates, Inc. Sameer Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. SemEval-2007 task-17: En- George A Miller. 1998. WordNet: An electronic lexical glish lexical sample, SRL and all words. In Pro- database. MIT press. ceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 87–92, George A. Miller, Claudia Leacock, Randee Tengi, and Prague, Czech Republic. Association for Computa- Ross T. Bunker. 1993. A semantic concordance. tional Linguistics. In Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March Ofir Press and Lior Wolf. 2017. Using the output em- 21-24, 1993. bedding to improve language models. In Proceedings

4665 of the 15th Conference of the European Chapter of Processing Systems 32, pages 3266–3280. Curran the Association for Computational Linguistics: Vol- Associates, Inc. ume 2, Short Papers, pages 157–163, Valencia, Spain. Association for Computational Linguistics. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, A multi-task benchmark and analysis platform for nat- Dario Amodei, and Ilya Sutskever. 2019. Language ural language understanding. In Proceedings of the models are unsupervised multitask learners. 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages Alessandro Raganato, Jose Camacho-Collados, and 353–355, Brussels, Belgium. Association for Com- Roberto Navigli. 2017. Word sense disambiguation: putational Linguistics. A unified evaluation framework and empirical com- parison. In Proceedings of the 15th Conference of Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- the European Chapter of the Association for Compu- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. tational Linguistics: Volume 1, Long Papers, pages XLNet: Generalized autoregressive pretraining for 99–110, Valencia, Spain. Association for Computa- language understanding. In Advances in Neural In- tional Linguistics. formation Processing Systems 32, pages 5753–5763. Curran Associates, Inc. Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B Viegas, Andy Coenen, Adam Pearce, and Been Kim. Dayu Yuan, Julian Richardson, Ryan Doherty, Colin 2019. Visualizing and measuring the geometry of Evans, and Eric Altendorf. 2016. Semi-supervised BERT. In Advances in Neural Information Process- word sense disambiguation with neural models. In ing Systems 32, pages 8594–8603. Curran Associates, Proceedings of COLING 2016, the 26th International Inc. Conference on Computational Linguistics: Technical Papers, pages 1374–1385, Osaka, Japan. The COL- Sascha Rothe and Hinrich Schutze.¨ 2015. AutoEx- ING 2016 Organizing Committee. tend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the 53rd Annual Meeting of the Association for Computational A Supersenses and Their Representation Linguistics and the 7th International Joint Confer- in SenseBERT ence on Natural Language Processing (Volume 1: Long Papers), pages 1793–1803, Beijing, China. As- We present in table5 a comprehensive list of Word- sociation for Computational Linguistics. Net supersenses, as they appear in the WordNet Nathan Schneider. 2014. Lexical semantic analysis in documentation. In fig.5 we present a Dendro- natural language text. Unpublished Doctoral Disser- gram of an Agglomerative hierarchical clustering tation, Carnegie Mellon University. over the supersense embedding vectors learned by Nathan Schneider and Noah A. Smith. 2015. A corpus SenseBERT in pre-training. The clustering shows and model integrating multiword expressions and a clear separation between Noun senses and Verb supersenses. In Proceedings of the 2015 Conference senses. Furthermore, we can observe that semanti- of the North American Chapter of the Association cally related supersenses are clustered together (i.e, for Computational Linguistics: Human Language Technologies, pages 1537–1547, Denver, Colorado. noun.animal and noun.plant). Association for Computational Linguistics. B Training Details Benjamin Snyder and Martha Palmer. 2004. The En- glish all-words task. In Proceedings of SENSEVAL-3, As hyperparameters for the fine-tuning, we used the Third International Workshop on the Evaluation max seq length = 128, chose learning rates from of Systems for the Semantic Analysis of Text, pages {5e−6, 1e−5, 2e−5, 3e−5, 5e−5}, batch sizes 41–43, Barcelona, Spain. Association for Computa- from {16, 32}, and fine-tuned up to 10 epochs for tional Linguistics. all the datasets. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems 30, pages 5998–6008. Curran Asso- ciates, Inc.

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Aman- preet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. SuperGLUE: A stickier benchmark for general-purpose language understand- ing systems. In Advances in Neural Information

4666 Nouns Verbs noun.relation noun.attribute noun.cognition noun.state noun.person noun.location noun.group null adj.pert adv.all adj.all noun.communication noun.act noun.artifact noun.substance noun.quantity noun.time noun.plant noun.animal noun.body noun.object noun.food noun.motive adj.ppl noun.process noun.shape noun.feeling noun.possession noun.phenomenon noun.event verb.contact verb.competition verb.motion verb.social verb.possession verb.communication verb.cognition verb.perception verb.creation verb.stative verb.change verb.weather verb.emotion verb.body verb.consumption

Figure 5: Dendrogram visualization of an Agglomerative hierarchical clustering over the supersense vectors (rows of the classifier S) learned by SenseBERT.

Name Content Name Content adj.all All adjective clusters noun.quantity Nouns denoting quantities and units of measure adj.pert Relational adjectives (pertainyms) noun.relation Nouns denoting relations between people or things or ideas adv.all All adverbs noun.shape Nouns denoting two and three dimensional shapes noun.Tops Unique beginner for nouns noun.state Nouns denoting stable states of affairs

noun.act Nouns denoting acts or actions noun.substance Nouns denoting substances

noun.animal Nouns denoting animals noun.time Nouns denoting time and temporal relations noun.artifact Nouns denoting man-made objects verb.body Verbs of grooming, dressing and bodily care noun.attribute Nouns denoting attributes of people verb.change Verbs of size, temperature change, and objects intensifying, etc. noun.body Nouns denoting body parts verb.cognition Verbs of thinking, judging, analyzing, doubting noun.cognition Nouns denoting cognitive verb.communication Verbs of telling, asking, ordering, processes and contents singing noun.communication Nouns denoting communicative verb.competition Verbs of fighting, athletic activities processes and contents noun.event Nouns denoting natural events verb.consumption Verbs of eating and drinking

noun.feeling Nouns denoting feelings verb.contact Verbs of touching, hitting, tying, and emotions digging noun.food Nouns denoting foods and drinks verb.creation Verbs of sewing, baking, painting, performing noun.group Nouns denoting groupings of people verb.emotion Verbs of feeling or objects noun.location Nouns denoting spatial position verb.motion Verbs of walking, flying, swimming

noun.motive Nouns denoting goals verb.perception Verbs of seeing, hearing, feeling

noun.object Nouns denoting natural objects verb.possession Verbs of buying, selling, owning (not man-made) noun.person Nouns denoting people verb.social Verbs of political and social activities and events noun.phenomenon Nouns denoting natural phenomena verb.stative Verbs of being, having, spatial relations

noun.plant Nouns denoting plants verb.weather Verbs of raining, snowing, thawing, thundering noun.possession Nouns denoting possession adj.ppl Participial adjectives and transfer of possession noun.process Nouns denoting natural processes

Table 5: A list of supersense categories from WordNet lexicographer.

4667