BERTRAM: Improved Embeddings Have Big Impact on Contextualized Model Performance

Timo Schick Hinrich Schutze¨ Sulzer GmbH Center for Information and Language Processing Munich, Germany LMU Munich, Germany [email protected] [email protected]

Abstract using methods such as byte-pair encoding (Sen- nrich et al., 2016), WordPiece embeddings (Wu Pretraining deep language models has led to et al., 2016) and character-level CNNs (Baevski large performance gains in NLP. Despite this et al., 2019). Nevertheless, Schick and Schutze¨ success, Schick and Schutze¨ (2020) recently showed that these models struggle to under- (2020) recently showed that BERT’s (Devlin et al., stand rare . For static word embeddings, 2019) performance on a rare word probing task can this problem has been addressed by separately be significantly improved by explicitly learning rep- learning representations for rare words. In resentations of rare words using Attentive Mimick- this work, we transfer this idea to pretrained ing (AM) (Schick and Schutze¨ , 2019a). However, language models: We introduce BERTRAM, a AM is limited in two important respects: powerful architecture based on BERT that is capable of inferring high-quality embeddings • For processing contexts, it uses a simple bag- for rare words that are suitable as input rep- of-words model, making poor use of the avail- resentations for deep language models. This is able information. achieved by enabling the surface form and con- texts of a word to interact with each other in a • It combines form and context in a shallow deep architecture. Integrating BERTRAM into fashion, preventing both input signals from BERT leads to large performance increases due to improved representations of rare and interacting in a complex manner. medium frequency words on both a rare word probing task and three downstream tasks.1 These limitations apply not only to AM, but to all previous work on obtaining representations for rare 1 Introduction words by leveraging form and context. While using bag-of-words models is a reasonable choice for As word embedding algorithms (e.g. Mikolov et al., static embeddings, which are often themselves bag- 2013) are known to struggle with rare words, sev- of-words (e.g. Mikolov et al., 2013; Bojanowski eral techniques for improving their representations et al., 2017), it stands to reason that they are not have been proposed. These approaches exploit ei- the best choice to generate input representations ther the contexts in which rare words occur (Lazari- for position-aware, deep language models. dou et al., 2017; Herbelot and Baroni, 2017; Kho- To overcome these limitations, we introduce dak et al., 2018; Liu et al., 2019a), their surface- BERTRAM (BERT for Attentive Mimicking), a form (Luong et al., 2013; Bojanowski et al., 2017; novel architecture for learning rare word representa- Pinter et al., 2017), or both (Schick and Schutze¨ , tions that combines a pretrained BERT model with 2019a,b; Hautte et al., 2019). However, all of this AM. As shown in Figure1, the learned rare word prior work is designed for and evaluated on uncon- representations can then be used as an improved textualized word embeddings. input representation for another BERT model. By Contextualized representations obtained from giving BERTRAM access to both surface form and pretrained deep language models (e.g. Peters et al., contexts starting at the lowest layer, a deep integra- 2018; Radford et al., 2018; Devlin et al., 2019; Liu tion of both input signals becomes possible. et al., 2019b) already handle rare words implicitly Assessing the effectiveness of methods like

1Our implementation of BERTRAM is publicly available at BERTRAM in a contextualized setting is challeng- https://github.com/timoschick/bertram. ing: While most previous work on rare words was

3996 Proceedings of the 58th Annual Meeting of the Association for Computational , pages 3996–4007 July 5 - 10, 2020. c 2020 Association for Computational Linguistics evaluated on datasets explicitly focusing on rare BERT words (e.g Luong et al., 2013; Herbelot and Ba- roni, 2017; Khodak et al., 2018; Liu et al., 2019a), riding a un ##ic ##y ##cle is hard these datasets are tailored to uncontextualized em- beddings and thus not suitable for evaluating our BERT model. Furthermore, rare words are not well repre- BERTRAM sented in commonly used downstream task datasets. We therefore introduce rarification, a procedure to riding a unicycle is hard automatically convert evaluation datasets into ones for which rare words are guaranteed to be impor- Figure 1: Top: Standard use of BERT. Bottom: Our tant. This is achieved by replacing task-relevant proposal; first BERTRAM learns an embedding for “uni- frequent words with rare synonyms obtained using cycle” that replaces the WordPiece sequence. BERT is semantic resources such as WordNet (Miller, 1995). then run on this improved input representation. We rarify three common text (or text pair) classifica- tion datasets: MNLI (Williams et al., 2018), AG’s convolutional neural networks to directly access News (Zhang et al., 2015) and DBPedia (Lehmann character-level information (Kim et al., 2016; Pe- et al., 2015). BERTRAM outperforms previous ters et al., 2018; Baevski et al., 2019). work on four English datasets by a large margin: Complementary to surface form, another useful on the three rarified datasets and on WNLaMPro source of information for understanding rare words (Schick and Schutze¨ , 2020). are the contexts in which they occur (Lazaridou In summary, our contributions are as follows: et al., 2017; Herbelot and Baroni, 2017; Khodak et al., 2018). Schick and Schutze¨ (2019a,b) show • We introduce BERTRAM, a model that inte- that combining form and context leads to signifi- grates BERT into Attentive Mimicking, en- cantly better results than using just one of the two. abling a deep integration of surface-form and While all of these methods are bag-of-words mod- contexts and much better representations for els, Liu et al.(2019a) recently proposed an architec- rare words. ture based on context2vec (Melamud et al., 2016). • We devise rarification, a method that trans- However, in contrast to our work, they (i) do not forms evaluation datasets into ones for which incorporate surface-form information and (ii) do rare words are guaranteed to be important. not directly access the hidden states of context2vec, but instead simply use its output distribution. • We show that adding BERTRAM to BERT Several datasets focus on rare words, e.g., Stan- achieves a new state-of-the-art on WNLaM- ford Rare Word (Luong et al., 2013), Definitional Pro (Schick and Schutze¨ , 2020) and beats all Nonce (Herbelot and Baroni, 2017), and Contex- baselines on rarified AG’s News, MNLI and tual Rare Word (Khodak et al., 2018). However, DBPedia, resulting in an absolute improve- unlike our rarified datasets, they are only suitable ment of up to 25% over BERT. for evaluating uncontextualized word representa- tions. Rarification is related to adversarial example 2 Related Work generation (e.g. Ebrahimi et al., 2018), which ma- nipulates the input to change a model’s prediction. Surface-form information (e.g., morphemes, char- We use a similar mechanism to determine which acters or character n-grams) is commonly used to words in a given sentence are most important and improve word representations. For static word em- replace them with rare synonyms. beddings, this information can either be injected into a given embedding space (Luong et al., 2013; 3 Model Pinter et al., 2017), or a model can directly be given access to it during training (Bojanowski et al., 2017; 3.1 Form-Context Model Salle and Villavicencio, 2018; Piktus et al., 2019). We first review the basis for our new model, the In the area of contextualized representations, many form-context model (FCM) (Schick and Schutze¨ , architectures employ subword segmentation meth- 2019b). Given a set of d-dimensional high-quality ods (e.g. Radford et al., 2018; Devlin et al., 2019; embeddings for frequent words, FCM induces em- Yang et al., 2019; Liu et al., 2019b). Others use beddings for rare words that are appropriate for

3997 the given embedding space. This is done as fol- ti = [MASK]. We experiment with three variants lows: Given a word w and a context C in which of BERTRAM: BERTRAM-SHALLOW, BERTRAM- form d 2 it occurs, a surface-form embedding v(w,C) ∈ R REPLACE and BERTRAM-ADD. is obtained by averaging over embeddings of all SHALLOW. Perhaps the simplest approach for character n-grams in w; the n-gram embeddings obtaining a context embedding from C using BERT are learned during training. Similarly, a context context d is to define embedding v(w,C) ∈ R is obtained by averaging over the embeddings of all words in C. Finally, context v(w,C) = hi(et1 , . . . , etm ) . both embeddings are combined using a gate This approach aligns well with BERT’s pretrain- form context > form context g(v(w,C), v(w,C) ) = σ(x [v(w,C); v(w,C) ] + y) ing objective of predicting likely substitutes for [MASK] tokens from their contexts. The context with parameters x ∈ 2d, y ∈ and σ denoting context R R embedding v(w,C) is then combined with its form the sigmoid function, allowing the model to decide counterpart as in FCM. how to weight surface-form and context. The final While this achieves our first goal of using a more representation of w is then a weighted combination sophisticated context model that goes beyond bag- of form and context embeddings: of-words, it still only combines form and context

context form in a shallow fashion. v(w,C) = α · (Av(w,C) + b) + (1 − α) · v(w,C) REPLACE. Before computing the context embed- form context d×d where α = g(v(w,C), v(w,C) ) and A ∈ R , b ∈ ding, we replace the uncontextualized embedding d of the [MASK] token with the word’s surface-form R are parameters learned during training. The context part of FCM is able to capture the embedding: broad topic of rare words, but since it is a bag-of- vcontext = h (e , ... , e , vform , e , ... , e ) . words model, it is not capable of obtaining a more (w,C) i t1 ti−1 (w,C) ti+1 tm concrete or detailed understanding (see Schick and Our rationale for this is as follows: During regular Schutze¨ , 2019b). Furthermore, the simple gating BERT pretraining, words chosen for prediction are mechanism results in only a shallow combination replaced with [MASK] tokens only 80% of the of form and context. That is, the model is not time and kept unchanged 10% of the time. Thus, able to combine form and context until the very standard pretrained BERT should be able to make last step: While it can learn to weight form and use of form embeddings presented this way as they context components, the two embeddings (form provide a strong signal with regards to how the and context) do not share any information and thus “correct” embedding of w may look like. do not influence each other. ADD. Before computing the context embedding, 3.2 BERTRAM we prepad the input with the surface-form embed- 3 To overcome these limitations, we introduce ding of w, followed by a colon (e:): BERTRAM, a model that combines a pretrained vcontext = h (vform , e , e , . . . , e ) . BERT (Devlin et al., 2019) with (w,C) i+2 (w,C) : t1 tm Attentive Mimicking (Schick and Schutze¨ , 2019a). The intuition behind this third variant is that lex- We denote with et the (uncontextualized, i.e., first- ical definitions and explanations of a word w are layer) embedding assigned to a (wordpiece) token occasionally prefixed by “w :” (e.g., in some on- t by BERT. Given a sequence of such uncontextu- line dictionaries). We assume that BERT has seen alized embeddings e = e1, . . . , en, we denote by many definitional sentences of this kind during pre- hj(e) the contextualized representation of the j-th training and is thus able to leverage surface-form token at the final layer when the model is given e information about w presented this way. as input. For both REPLACE and ADD, surface-form in- Given a word w and a context C in which it oc- formation is directly and deeply integrated into the curs, let t = t1, . . . , tm be the sequence obtained 2 from C by (i) replacing w with a [MASK] token We refer to these three BERTRAM configurations simply as SHALLOW, REPLACE and ADD. and (ii) tokenization (matching BERT’s vocabu- 3We experimented with other prefixes, but found that this lary); furthermore, let i denote the index for which variant is best capable of recovering w at the masked position.

3998 v(w,C1) v(w,C)

A · + b Attentive Mimicking

v(w,C ) ... v(w,C ) BERT 1 m

BERTRAM ... BERTRAM form e[CLS] v e e e[MASK] e e e ... (w,C1) : other such as trousers

: other [MASK] such as trousers ...

hSiwas wash ... leshSi (w, C1) ... (w, Cm)

Figure 2: Schematic representation of BERTRAM-ADD processing the input word w = “washables” given a single context C1 = “other washables such as trousers ...” (left) and given multiple contexts C = {C1,...,Cm} (right) computation of the context embedding; thus, we through the AM layer. Regardless of the se- do not require any gating mechanism and directly lected BERTRAM variant, the context embed- context set v(w,C) = A · v(w,C) + b. Figure2 (left) shows ding is always obtained using SHALLOW in how a single context is processed using ADD. this stage. Furthermore, only A, b and all To exploit multiple contexts of a word if avail- parameters of the AM layer are optimized. able, we follow the approach of Schick and Schutze¨ (2019a) and add an AM layer on top of our model; 2. We train only the form part (i.e., only the n- see Figure2 (right). Given a set of contexts gram embeddings); our loss for a single exam- (w, C) ke − vform k2 C = {C1,...,Cm} and the corresponding em- ple is w (w,C) . Training in this stage is completely detached from the under- beddings v(w,C1), . . . , v(w,Cm), AM applies a self- attention mechanism to all embeddings, allowing lying BERT model. the model to distinguish informative from uninfor- 3. In the third stage, we combine the pretrained mative contexts. The final embedding v is then (w,C) form-only and context-only models and train a weighted combination of all embeddings: all parameters. The first two stages are only Xm run once and then used for all three BERTRAM v(w,C) = ρi · v(w,C ) i=1 i variants because context and form are trained in isolation. The third stage must be run for where the self-attention layer determines the Pm each variant separately. weights ρi subject to i=1 ρi = 1. For further details, see Schick and Schutze¨ (2019a). We freeze all of BERT’s parameters during training 3.3 Training as we – somewhat surprisingly – found that this slightly improves the model’s performance while Like previous work, we use mimicking (Pinter et al., speeding up training. For ADD, we additionally 2017) as a training objective. That is, given a fre- found it helpful to freeze the form part in the third quent word w with known embedding ew and a set training stage. Importantly, for the first two stages of corresponding contexts C, BERTRAM is trained of our training procedure, we do not have to back- 2 to minimize kew − v(w,C)k . propagate through BERT to obtain all required gra- Training BERTRAM end-to-end is costly: the dients, drastically increasing the training speed. cost of processing a single training instance (w, C) with C = {C1,...,Cm} is the same as processing 4 Dataset Rarification an entire batch of m examples in standard BERT. Therefore, we resort to the following three-stage The ideal dataset for measuring the quality of rare training process: word representations would be one for which the accuracy of a model with no understanding of rare 1. We train only the context part, minimizing words is 0% whereas the accuracy of a model that Pm context 2 kew − A · ( ρi · v ) + bk where perfectly understands rare words is 100%. Unfortu- i=1 (w,Ci) ρi is the weight assigned to each context Ci nately, existing datasets do not satisfy this desidera-

3999 tum, not least because rare words – by their nature which case we simply separate both represen- – occur rarely. tations by a slash (cf. §5.3). To accustom the This does not mean that rare words are not im- language model to this duplication of words, portant: As we shift our focus in NLP from words we replace each word wi with “wi / wi” with and sentences as the main unit of processing to a probability of 10%. To make sure that the larger units like paragraphs and documents, rare model does not simply learn to always focus words will occur in a high proportion of such larger on the first instance during training, we ran- “evaluation units”. Rare words are also clearly a domly mask each of the two repetitions with hallmark of human language competence, which probability 25%. should be the ultimate goal of NLP. Our work is part of a trend that sees a need for evaluation tasks • We do not finetune the model’s embedding in NLP that are more ambitious than what we have layer. We found that this does not hurt per- now.4 formance, an observation in line with recent findings of Lee et al.(2019). To create more challenging datasets, we use rar- ification, a procedure that automatically transforms Test Set Generation. Let p(y | x) be the proba- existing text classification datasets in such a way bility that the finetuned model M assigns to class y that rare words become important. We require a given input x, and M(x) = arg maxy∈Y p(y | x) pretrained language model M as a baseline, an be the model’s prediction for input x where Y de- arbitrary text classification dataset D containing la- notes the set of all labels. For generating our test beled instances (x, y) and a substitution dictionary set, we only consider candidates that are classified S, mapping each word w to a set of rare synonyms correctly by the baseline model, i.e., candidates S(w). Given these ingredients, our procedure con- (x, y) ∈ Dcand with M(x) = y. For each such sists of three steps: (i) splitting the dataset into a entry, let x = w1, . . . , wn and let xw =t be the se- train set and a set of test candidates, (ii) training the i quence obtained from x by replacing wi with t. We baseline model on the train set and (iii) modifying compute a subset of the test candidates to generate the final test set. wi = arg min p(y | xwj =[MASK]), wj :S(wj )6=∅ Dataset Splitting. We partition D into a training set Dtrain and a set of test candidates, Dcand. Dcand i.e., we select the word wi whose masking pushes contains all instances (x, y) ∈ D such that for at the model’s prediction the farthest away from least one word w in x, S(w) 6= ∅ – subject to the the correct label. If removing this word al- constraint that the training set contains at least one ready changes the model’s prediction – that is, third of the entire data. M(xwi=[MASK]) 6= y –, we select a random rare

synonym wˆi ∈ S(wi) and add (xwi=w ˆi , y) to the Baseline Training. We finetune M on Dtrain. Let test set. Otherwise, we repeat the above procedure; (x, y) ∈ Dtrain where x = w1, . . . , wn is a se- if the label still has not changed after masking up to quence of words. We deviate from the finetuning 5 words, we discard the candidate. Each instance procedure of Devlin et al.(2019) in three respects: (x , y) of the resulting test set has wi1 =w ˆi1 ,...,wik =w ˆik the following properties: • We randomly replace 5% of all words in x

with a [MASK] token. This allows the model • If each wij is replaced by [MASK], the entry to cope with missing or unknown words, a is classified incorrectly by M. In other words,

prerequisite for our final test set generation. understanding the words wij is necessary for M to determine the correct label. • As an alternative to overwriting the language model’s uncontextualized embeddings for rare • If the model’s internal representation of each words, we also want to allow models to add an wˆij is sufficiently similar to its representation alternative representation during test time, in of wij , the entry is classified correctly by M. That is, if the model is able to understand 4Cf. (Bowman, 2019): “If we want to be able to establish the rare words wˆi and to identify them as fair benchmarks that encourage future progress toward robust, j human-like language understanding, we’ll need to get better synonyms of wij , it will predict the correct at creating clean, challenging, and realistic test datasets.” label.

4000 Model RAREMEDIUM Task Entry BERT (base) 0.112 0.234 MNLI i think i will go finish up my laundry wash- + AM (Schick and Schutze¨ , 2020) 0.251 0.267 ables. + BERTRAM-SHALLOW 0.250 0.246 AG’s [. . . ] stake will improve meliorate syman- + BERTRAM-REPLACE 0.155 0.216 tec’s consulting contacts [. . . ] + BERTRAM-ADD 0.269 0.367 DBPedia yukijiro hotaru [. . . ] is a japanese nipponese BERT (large) 0.143 0.264 actor histrion. RoBERTa (large) 0.270 0.275 MNLI a smart person is often ofttimes correct in + BERTRAM-ADD 0.306 0.323 their answers ansers. MNLI the southwest has a lot of farming and Table 1: MRR on WNLaMPro test for baseline mod- vineyards vineries that make excellent fantabulous merlot. els and various BERTRAM configurations. Best results per base model are underlined, results that do not dif- Table 2: Examples from rarified datasets. Crossed out: fer significantly from the best results in a paired t-test replaced words. Bold: replacements. (p < 0.05) are bold.

compensate for this as follows: In our downstream Note that the test set is closely coupled to the task experiments, we gather the set of contexts C baseline model M because we select the words to for each word from WWC+BookCorpus during be replaced based on M’s predictions. Importantly, inference.5 however, the model is never queried with any rare synonym during test set generation, so its repre- 5.2 WNLaMPro sentations of rare words are not taken into account We evaluate BERTRAM on the WNLaMPro dataset for creating the test set. Thus, while the test set (Schick and Schutze¨ , 2020). This dataset consists is not suitable for comparing M with an entirely lingonberry 0 of cloze-style phrases like “A is a .” different model M , it allows us to compare various and the task is to correctly fill the slot ( ) with strategies for representing rare words in the embed- one of several acceptable target words (e.g., “fruit”, ding space of M. Definitional Nonce (Herbelot and “bush” or “berry”), which requires understanding of Baroni, 2017) is subject to a similar constraint: it the meaning of the phrase’s keyword (“lingonberry” is tied to a specific (uncontextualized) embedding in the example). As the goal of this dataset is to space based on (Mikolov et al., 2013). probe a language model’s ability to understand rare words without any task-specific finetuning, Schick 5 Evaluation and Schutze¨ (2020) do not provide a training set. 5.1 Setup The dataset is partitioned into three subsets based on the keyword’s frequency in WWC: RARE (oc- For our evaluation of BERTRAM, we follow the ex- curring fewer than 10 times) MEDIUM (occurring perimental setup of Schick and Schutze¨ (2020). We between 10 and 100 times), and FREQUENT (all experiment with integrating BERTRAM both into remaining words). BERTbase and RoBERTalarge (Liu et al., 2019b). For our evaluation, we compare the performance Throughout our experiments, when BERTRAM is of a standalone BERT (or RoBERTa) model with used to provide input representations for one of the one that uses BERTRAM as shown in Figure1 (bot- two models, we use the same model as BERTRAM’s tom). As our focus is to improve representations underlying language model. Further training speci- for rare words, we evaluate our model only on WN- fications can be found in AppendixA. LaMPro RARE and MEDIUM. Table1 gives results; While BERT was trained on BookCorpus (Zhu our measure is mean reciprocal rank (MRR). We et al., 2015) and a large Wikipedia dump, we fol- see that supplementing BERT with any of the pro- low previous work and train BERTRAM only on the posed methods results in noticeable improvements much smaller Westbury Wikipedia Corpus (WWC) for the RARE subset, with ADD clearly outperform- (Shaoul and Westbury, 2010); this of course gives ing SHALLOW and REPLACE. Moreover, ADD per- BERT a clear advantage over BERTRAM. This ad- forms surprisingly well for more frequent words, vantage is even more pronounced when comparing improving the score for WNLaMPro-MEDIUM by BERTRAM with RoBERTa, which is trained on a 5We recreate BookCorpus with the script at github. corpus that is an order of magnitude larger than the com/soskek/bookcorpus. We refer to the joined cor- original BERT corpus. We try to at least partially pus of WWC and BookCorpus as WWC+BookCorpus.

4001 MNLI AG’s News DBPedia Model All Msp WN All Msp WN All Msp WN BERT (base) 50.5 49.1 53.4 56.5 54.8 61.9 49.3 46.0 57.6 + Mimick (Pinter et al., 2017) 37.2 38.2 38.7 45.3 43.9 50.5 36.5 35.8 41.1 + A La Carte (Khodak et al., 2018) 44.6 45.7 46.1 52.4 53.7 56.1 51.1 48.7 59.3 + AM (Schick and Schutze¨ , 2020) 50.9 50.7 53.6 58.9 59.8 62.6 60.7 63.1 62.8 + BERTRAM 53.3 52.5 55.6 62.1 63.1 65.3 64.2 67.9 64.1 + BERTRAM-SLASH 56.4 55.3 58.6 62.9 63.3 65.3 65.7 67.3 67.2 + BERTRAM-SLASH + INDOMAIN 59.8 57.3 62.7 62.5 62.1 66.6 74.2 74.8 76.7 RoBERTa (large) 67.3 68.7 68.4 63.7 68.1 65.7 65.5 67.3 66.6 + BERTRAM-SLASH 70.1 71.5 70.9 64.6 68.4 64.9 71.9 73.8 73.9 + BERTRAM-SLASH + INDOMAIN 71.7 71.9 73.2 68.1 71.9 69.0 76.0 78.8 77.3

Table 3: Accuracy of standalone BERT and RoBERTa, various baselines and BERTRAM on rarified MNLI, AG’s News and DBPedia. The five BERTRAM instances are BERTRAM-ADD. Best results per baseline model are underlined, results that do not differ significantly from the best results in a two-sided binomial test (p < 0.05) are bold. Msp/WN: subset of instances containing at least one misspelling/synonym. All: all instances.

58% compared to BERTbase and 37% compared using the synonym relation of WordNet (Miller, to Attentive Mimicking. This makes sense con- 1995) and the pattern library (Smedt and Daele- sidering that the key enhancement of BERTRAM mans, 2012) to make sure that all synonyms have over AM lies in improving context representations consistent parts of speech. Furthermore, we only and interconnection of form and context; the more consider synonyms for each word’s most frequent contexts are given, the more this comes into play. sense; this filters out much noise and improves the Noticeably, despite being both based on and in- quality of the created sentences. In addition to tegrated into a BERTbase model, our architecture WordNet, we use the misspelling dataset of Pik- even outperforms BERTlarge by a large margin. tus et al.(2019). To prevent misspellings from While RoBERTa performs much better than BERT dominating the resulting datasets, we only assign on WNLaMPro, BERTRAM still significantly im- misspelling-based substitutes to randomly selected proves results for both rare and medium frequency 10% of the words contained in each sentence. Mo- words. As it performs best for both the RARE and tivated by the results on WNLaMPro-MEDIUM, we MEDIUM subset, we always use the ADD configura- consider every word that occurs less than 100 times tion of BERTRAM in the following experiments. in WWC+BookCorpus as being rare. Example entries from the rarified datasets obtained using 5.3 Downstream Task Datasets BERTbase as a baseline model can be seen in Ta- To measure the effect of adding BERTRAM to a ble2. The average number of words replaced with pretrained deep language model on downstream synonyms or misspellings is 1.38, 1.82 and 2.34 tasks, we rarify (cf. §4) the following three datasets: for MNLI, AG’s News and DBPedia, respectively. • MNLI (Williams et al., 2018), a natural lan- Our default way of injecting BERTRAM embed- guage inference dataset where given two sen- dings into the baseline model is to replace the se- tences a and b, the task is to decide whether quence of uncontextualized subword token embed- a entails b, a and b contradict each other or dings for a given rare word with its BERTRAM- neither; based embedding (Figure1, bottom). That is, given a sequence of uncontextualized token em- • AG’s News (Zhang et al., 2015), a news classi- beddings e = e1, . . . , en where ei, . . . , ej with fication dataset with four different categories 1 ≤ i ≤ j ≤ n is the sequence of embeddings (world, sports, business and science/tech); for a single rare word w with BERTRAM-based • DBPedia (Lehmann et al., 2015), an ontology embedding v(w,C), we replace e with dataset with 14 classes (e.g., company, artist) e0 = e , . . . , e , v , e , . . . , e . that have to be identified from text snippets. 1 i−1 (w,C) j+1 n For all three datasets, we create rarified instances As an alternative to replacing the original se- both using BERTbase and RoBERTalarge as a base- quence of subword embeddings for a given rare line model and build the substitution dictionary S word, we also consider BERTRAM-SLASH, a con-

4002 figuration where the BERTRAM-based embedding 65 BERT+BSL is simply added and both representations are sepa- BERT+BSL rated using a single slash: 60 BERT+BSL 55 BERT eSLASH = e1, . . . , ej, e/, v(w,C), ej+1, . . . , en . BERT 50 BERT

The intuition behind this variant is that in BERT’s Accuracy pretraining corpus, a slash is often used to separate 45 MNLI two variants of the same word (e.g., “useable / us- AG’s News DBPedia able”) or two closely related concepts (e.g., “com- 40 pany / organization”, “web-based / cloud”) and 1 2 4 8 16 32 64 128 thus, BERT should be able to understand that both cmax ei, . . . , ej and v(w,C) refer to the same entity. We therefore surmise that whenever some information Figure 3: BERT vs. BERT combined with BERTRAM- is encoded in one representation but not in the other, SLASH (BERT+BSL) on three downstream tasks for giving BERT both representations is helpful. varying maximum numbers of contexts cmax By default, the set of contexts C for each word is obtained by collecting all sentences from WWC+BookCorpus in which it occurs. We also on WNLaMPro, combining BERT with BERTRAM try a variant where we add in-domain contexts by consistently outperforms both a standalone BERT giving BERTRAM access to all texts (but not la- model and one combined with various baseline bels) found in the test set; we refer to this variant as models. Using the SLASH variant brings improve- INDOMAIN.6 Our motivation for including this vari- ments across all datasets as does adding INDOMAIN ant is as follows: Moving from the training stage of contexts (exception: BERT/AG’s News). This a model to its production use often causes a slight makes sense considering that for a rare word, every domain shift. This is turn leads to an increased single additional context can be crucial for gaining number of input sentences containing words that a deeper understanding. Correspondingly, it is not did not – or only very rarely – appear in the training surprising that the benefit of adding BERTRAM to data. However, such input sentences can easily be RoBERTa is less pronounced, because BERTRAM collected as additional unlabeled examples during uses only a fraction of the contexts available to production use. While there is no straightforward RoBERTa during pretraining. Nonetheless, adding way to leverage these unlabeled examples with an BERTRAM significantly improves RoBERTa’s ac- already finetuned BERT model, BERTRAM can eas- curacy for all three datasets both with and without ily make use of them without requiring any labels adding INDOMAIN contexts. or any further training: They can simply be in- To further understand for which words using cluded as additional contexts during inference. As BERTRAM is helpful, Figure3 looks at the accuracy ERTRAM this gives B a slight advantage, we also of BERTbase both with and without BERTRAM as a report results for all configurations without using function of word frequency. That is, we compute indomain data. Importantly, adding indomain data the accuracy scores for both models when consid- increases the number of contexts for more than 90% ering only entries (xwi =w ˆi ,...,wi =w ˆi , y) where of all rare words by at most 3, meaning that they 1 1 k k each substituted word wˆij occurs less than cmax can still be considered rare despite the additional times in WWC+BookCorpus, for different values indomain contexts. of cmax. As one would expect, cmax is positively cor- Table3 reports, for each task, the accuracy on the related with the accuracies of both models, showing entire dataset (All) as well as scores obtained con- that the rarer a word is, the harder it is to under- sidering only instances where at least one word was stand. Interestingly, the gap between standalone replaced by a misspelling (Msp) or a WordNet syn- BERT and BERT with BERTRAM remains more 7 onym (WN), respectively. Consistent with results or less constant regardless of cmax. This suggests 6For the MNLI dataset, which consists of text pairs (a, b), that using BERTRAM may even be helpful for more we treat a and b as separate contexts. frequent words. 7Note that results for BERT and RoBERTa are only loosely comparable because the datasets generated from both baseline To investigate this hypothesis, we perform an- models through rarification are different. other rarification of MNLI that differs from the

4003 rare words in BERT’s and RoBERTa’s embedding BERT+BSL RoBERTa+BSL BERT+BSL+ID RoBERTa+BSL+ID spaces. This is achieved by employing a powerful pretrained language model and deeply integrating 10 surface-form and context information. By replac- 8 ing important words with rare synonyms, we cre- 6 ated downstream task datasets that are more chal- 4 lenging and support the evaluation of NLP models 2 on the task of understanding rare words, a capa-

Accuracy improvement 0 bility that human speakers have. On all of these datasets, BERTRAM improves over standard BERT [0,125) [125,250) [250,500) [500,∞) and RoBERTa, demonstrating the usefulness of our Word counts method. Our analysis showed that BERTRAM is benefi- Figure 4: Improvements for BERT (base) and cial not only for rare words (our main target in this RoBERTa (large) when adding BERTRAM-SLASH paper), but also for frequent words. In future work, (+BSL) or BERTRAM-SLASH + INDOMAIN (+BSL+ID) we want to investigate BERTRAM’s potential bene- on MNLI-1000 fits for such frequent words. Furthermore, it would be interesting to explore more complex ways of previous rarification in two respects. First, we in- incorporating surface-form information – e.g., by crease the threshold for a word to count as rare using a character-level CNN similar to the one of from 100 to 1000. Second, as this means that we Kim et al.(2016) – to balance out the potency of have more WordNet synonyms available, we do not BERTRAM’s form and context parts. use the misspelling dictionary (Piktus et al., 2019) for substitution. We refer to the resulting datasets Acknowledgments for BERTbase and RoBERTalarge as MNLI-1000. This work was funded by the European Research Figure4 shows results on MNLI-1000 for var- Council (ERC #740516). We would like to thank ious rare word frequency ranges. For each value the anonymous reviewers for their helpful com- [c0, c1) on the x-axis, the y-axis shows improve- ments. ment in accuracy compared to standalone BERT or RoBERTa when only dataset entries are con- sidered for which each rarified word occurs be- References tween c (inclusively) and c (exclusively) times 0 1 Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke in WWC+BooksCorpus. We see that for words Zettlemoyer, and Michael Auli. 2019. Cloze-driven with frequency less than 125, the improvement in pretraining of self-attention networks. In Proceed- accuracy remains similar even without using mis- ings of the 2019 Conference on Empirical Methods spellings as another source of substitutions. In- in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Pro- terestingly, for every single interval of rare word cessing (EMNLP-IJCNLP), pages 5359–5368, Hong counts considered, adding BERTRAM-SLASH to Kong, China. Association for Computational Lin- BERT considerably improves its accuracy. For guistics. RoBERTa, adding BERTRAM brings improvements Piotr Bojanowski, Edouard Grave, Armand Joulin, and only for words occurring less than 500 times. Tomas Mikolov. 2017. Enriching word vectors with While using INDOMAIN data is beneficial for subword information. Transactions of the Associa- rare words – simply because it gives us addi- tion for Computational Linguistics, 5:135–146. tional contexts for these words –, when consid- Sam Bowman. 2019. T5 explores the limits of ering only words that occur at least 250 times in transfer learning. WWC+BookCorpus, adding INDOMAIN contexts does not help. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- 6 Conclusion standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association We have introduced BERTRAM, a novel architec- for Computational Linguistics: Human Language ture for inducing high-quality representations for Technologies, Volume 1 (Long and Short Papers),

4004 pages 4171–4186, Minneapolis, Minnesota. Associ- Lexical and Computational Semantics (*SEM 2019), ation for Computational Linguistics. pages 61–67, Minneapolis, Minnesota. Association for Computational Linguistics. Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White-box adversarial exam- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- ples for text classification. In Proceedings of the dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, 56th Annual Meeting of the Association for Compu- Luke Zettlemoyer, and Veselin Stoyanov. 2019b. tational Linguistics (Volume 2: Short Papers), pages RoBERTa: A robustly optimized BERT pretrain- 31–36, Melbourne, Australia. Association for Com- ing approach. Computing Research Repository, putational Linguistics. arXiv:1907.11692.

Jeroen Van Hautte, Guy Emerson, and Marek Rei. Thang Luong, Richard Socher, and Christopher Man- 2019. Bad form: Comparing context-based and ning. 2013. Better word representations with re- form-based few-shot learning in distributional se- cursive neural networks for morphology. In Pro- mantic models. Computing Research Repository, ceedings of the Seventeenth Conference on Computa- arXiv:1910.00275. tional Natural Language Learning, pages 104–113, Sofia, Bulgaria. Association for Computational Lin- Aurelie´ Herbelot and Marco Baroni. 2017. High-risk guistics. learning: acquiring new word vectors from tiny data. In Proceedings of the 2017 Conference on Empiri- Oren Melamud, Jacob Goldberger, and Ido Dagan. cal Methods in Natural Language Processing, pages 2016. context2vec: Learning generic context em- 304–309. Association for Computational Linguis- bedding with bidirectional LSTM. In Proceedings tics. of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 51–61, Berlin, Mikhail Khodak, Nikunj Saunshi, Yingyu Liang, Germany. Association for Computational Linguis- Tengyu Ma, Brandon Stewart, and Sanjeev Arora. tics. 2018. A la carte embedding: Cheap but effective induction of semantic feature vectors. In Proceed- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey ings of the 56th Annual Meeting of the Association Dean. 2013. Efficient estimation of word represen- for Computational Linguistics (Volume 1: Long Pa- tations in . Computing Research Repos- pers), pages 12–22. Association for Computational itory, arXiv:1301.3781. Linguistics. George A. Miller. 1995. Wordnet: A lexical database Yoon Kim, Yacine Jernite, David Sontag, and Alexan- for english. Commun. ACM, 38(11):39–41. der M. Rush. 2016. Character-aware neural lan- guage models. In Proceedings of the Thirtieth AAAI Adam Paszke, Sam Gross, Soumith Chintala, Gregory Conference on Artificial Intelligence, AAAI’16, Chanan, Edward Yang, Zachary DeVito, Zeming pages 2741–2749. AAAI Press. Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A In NIPS Autodiff Workshop. method for stochastic optimization. In 3rd Inter- national Conference on Learning Representations Matthew Peters, Mark Neumann, Mohit Iyyer, Matt (ICLR). Gardner, Christopher Clark, Kenton Lee, and Luke Angeliki Lazaridou, Marco Marelli, and Marco Baroni. Zettlemoyer. 2018. Deep contextualized word rep- 2017. Multimodal word meaning induction from resentations. In Proceedings of the 2018 Confer- minimal exposure to natural text. Cognitive Science, ence of the North American Chapter of the Associ- 41(S4):677–705. ation for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers), pages Jaejun Lee, Raphael Tang, and Jimmy Lin. 2019. 2227–2237, New Orleans, Louisiana. Association What would Elsa do? Freezing layers during trans- for Computational Linguistics. former fine-tuning. Computing Research Reposi- tory, arXiv:1911.03090. Aleksandra Piktus, Necati Bora Edizel, Piotr Bo- janowski, Edouard Grave, Rui Ferreira, and Fabrizio Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Silvestri. 2019. Misspelling oblivious word embed- Dimitris Kontokostas, Pablo N. Mendes, Sebastian dings. In Proceedings of the 2019 Conference of Hellmann, Mohamed Morsey, Patrick van Kleef, the North American Chapter of the Association for Soren¨ Auer, and Christian Bizer. 2015. DBpedia - Computational Linguistics: Human Language Tech- a large-scale, multilingual knowledge base extracted nologies, Volume 1 (Long and Short Papers), pages from wikipedia. Semantic Web Journal, 6(2):167– 3226–3234, Minneapolis, Minnesota. Association 195. for Computational Linguistics. Qianchu Liu, Diana McCarthy, and Anna Korhonen. Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. 2019a. Second-order contexts from lexical substi- 2017. Mimicking word embeddings using subword tutes for few-shot learning of word representations. RNNs. In Proceedings of the 2017 Conference on In Proceedings of the Eighth Joint Conference on Empirical Methods in Natural Language Processing,

4005 pages 102–112. Association for Computational Lin- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. guistics. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Alec Radford, Karthik Narasimhan, Tim Salimans, and Macherey, Jeff Klingner, Apurva Shah, Melvin John- Ilya Sutskever. 2018. Improving language under- son, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, standing by generative pre-training. Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Alexandre Salle and Aline Villavicencio. 2018. Incor- Cliff Young, Jason Smith, Jason Riesa, Alex Rud- porating subword information into matrix factoriza- nick, Oriol Vinyals, Greg Corrado, Macduff Hughes, tion word embeddings. In Proceedings of the Sec- and Jeffrey Dean. 2016. Google’s neural machine ond Workshop on Subword/Character LEvel Models, translation system: Bridging the gap between hu- pages 66–71, New Orleans. Association for Compu- man and . Computing Research tational Linguistics. Repository, arXiv:1609.08144.

Timo Schick and Hinrich Schutze.¨ 2019a. Attentive Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- mimicking: Better word embeddings by attending bonell, Ruslan Salakhutdinov, and Quoc V Le. to informative contexts. In Proceedings of the 2019 2019. XLNet: Generalized autoregressive pretrain- Conference of the North American Chapter of the ing for language understanding. Computing Re- Association for Computational Linguistics: Human search Repository, arXiv:1906.08237. Language Technologies, Volume 1 (Long and Short Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Papers), pages 489–494, Minneapolis, Minnesota. Character-level convolutional networks for text clas- Association for Computational Linguistics. sification. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Timo Schick and Hinrich Schutze.¨ 2019b. Learning se- Neural Information Processing Systems 28, pages mantic representations for novel words: Leveraging 649–657. Curran Associates, Inc. both form and context. In Proceedings of the Thirty- Third AAAI Conference on Artificial Intelligence. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- dinov, Raquel Urtasun, Antonio Torralba, and Sanja Timo Schick and Hinrich Schutze.¨ 2020. Rare words: Fidler. 2015. Aligning books and movies: Towards A major problem for contextualized embeddings and story-like visual explanations by watching movies how to fix it by attentive mimicking. In Proceedings and reading books. In Proceedings of the IEEE inter- of the Thirty-Fourth AAAI Conference on Artificial national conference on computer vision, pages 19– Intelligence. 27.

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computa- tional Linguistics.

Cyrus Shaoul and Chris Westbury. 2010. The westbury lab wikipedia corpus.

Tom De Smedt and Walter Daelemans. 2012. Pattern for python. Journal of Research, 13(Jun):2063–2067.

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Remi´ Louf, Morgan Funtow- icz, and Jamie Brew. 2019. Transformers: State-of- the-art natural language processing.

4006 A Training Details B Evaluation Details

Our implementation of BERTRAM is based on Py- WNLaMPro In order to ensure comparability Torch (Paszke et al., 2017) and the Transform- with results of Schick and Schutze¨ (2020), we use ers library (Wolf et al., 2019). To obtain tar- only WWC to obtain contexts for WNLaMPro key- get embeddings for frequent multi-token words words. (i.e., words that occur at least 100 times in Rarified Datasets To obtain rarified instances WWC+BookCorpus) during training, we use one- of MNLI, AG’s News and DBPedia, we train token approximation (OTA) (Schick and Schutze¨ , BERTbase and RoBERTalarge on each task’s train- 2020). For RoBERTalarge, we found increasing the ing set for 3 epochs. We use a batch size of 32, number of iterations per word from 4,000 to 8,000 a maximum sequence length of 128 and a weight to produce better OTA embeddings using the same decay factor of 0.01. For BERT, we perform linear evaluation setup as Schick and Schutze¨ (2020). For warmup for the first 10% of training examples and all stages of training, we use Adam (Kingma and use a maximum learning rate of 5 · 10−5. After Ba, 2015) as optimizer. reaching its peak value, the learning rate is lin- early decayed. For RoBERTa, we found training to Context-Only Training. During the first stage be unstable with these parameters, so we chose a of our training process, we train BERTRAM with a lower learning rate of 1·10−5 and performed linear maximum sequence length of 96 and a batch size warmup for the first 10,000 training steps. of 48 contexts for BERTbase and 24 contexts for To obtain results for our baselines on the rarified RoBERTalarge. These parameters are chosen such datasets, we use the original Mimick implementa- that a batch fits on a single Nvidia GeForce GTX tion of Pinter et al.(2017), the A La Carte imple- 1080Ti. Each context in a batch is mapped to a mentation of Khodak et al.(2018) and the Attentive word w from the set of training words, and each Mimicking implementation of Schick and Schutze¨ batch contains at least 4 and at most 32 contexts per (2019a) with their default hyperparameter settings. word. For BERTbase and RoBERTalarge, we pretrain As A La Carte can only be used for words with the context part for 5 and 3 epochs, respectively. at least one context, we keep the original BERT −5 We use a maximum learning rate of 5 · 10 and embeddings whenever no such context is available. perform linear warmup for the first 10% of training While using BERTRAM allows us to completely examples, after which the learning rate is linearly remove the original BERT embeddings for all rare decayed. words and still obtain improvements in accuracy on all three rarified downstream tasks, the same is Form-Only Training. In the second stage of our not true for RoBERTa, where removing the original training process, we use the same parameters as sequence of subword token embeddings for a given Schick and Schutze¨ (2020), as our form-only model rare word (i.e., not using the SLASH variant) hurts is the very same as theirs. That is, we use a learning performance with accuracy dropping by 5.6, 7.4 rate of 0.01, a batch size of 64 words and we apply and 2.1 points for MNLI, AG’s News and DBPedia, n-gram dropout with a probability of 10%. We respectively. We believe this to be due to the vast pretrain the form-only part for 20 epochs. amount of additional contexts for rare words in RoBERTa’s training set that are not available to Combined Training. For the final stage, we use BERTRAM. the same training configuration as for context-only training, but we keep n-gram dropout from the form-only stage. We perform combined training for 3 epochs. For ADD, when using RoBERTa as an un- derlying language model, we do not just prepad the input with the surface-form embedding followed by a colon, but additionally wrap the surface-form embedding in double quotes. That is, we prepad form the input with e”, v(w,C), e”, e:. We found this to perform slightly better in preliminary experiments with some toy examples.

4007