BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance

BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance Timo Schick Hinrich Schutze¨ Sulzer GmbH Center for Information and Language Processing Munich, Germany LMU Munich, Germany [email protected] [email protected] Abstract using methods such as byte-pair encoding (Sen- nrich et al., 2016), WordPiece embeddings (Wu Pretraining deep language models has led to et al., 2016) and character-level CNNs (Baevski large performance gains in NLP. Despite this et al., 2019). Nevertheless, Schick and Schutze¨ success, Schick and Schutze¨ (2020) recently showed that these models struggle to under- (2020) recently showed that BERT’s (Devlin et al., stand rare words. For static word embeddings, 2019) performance on a rare word probing task can this problem has been addressed by separately be significantly improved by explicitly learning rep- learning representations for rare words. In resentations of rare words using Attentive Mimick- this work, we transfer this idea to pretrained ing (AM) (Schick and Schutze¨ , 2019a). However, language models: We introduce BERTRAM, a AM is limited in two important respects: powerful architecture based on BERT that is capable of inferring high-quality embeddings • For processing contexts, it uses a simple bag- for rare words that are suitable as input rep- of-words model, making poor use of the avail- resentations for deep language models. This is able information. achieved by enabling the surface form and contexts of a word to interact with each other in a • It combines form and context in a shallow deep architecture. Integrating BERTRAM into fashion, preventing both input signals from BERT leads to large performance increases due to improved representations of rare and interacting in a complex manner. medium frequency words on both a rare word probing task and three downstream tasks.1 These limitations apply not only to AM, but to all previous work on obtaining representations for rare 1 Introduction words by leveraging form and context. While using bag-of-words models is a reasonable choice for As word embedding algorithms (e.g. Mikolov et al., static embeddings, which are often themselves bag- 2013) are known to struggle with rare words, sev- of-words (e.g. Mikolov et al., 2013; Bojanowski eral techniques for improving their representations et al., 2017), it stands to reason that they are not have been proposed. These approaches exploit ei- the best choice to generate input representations ther the contexts in which rare words occur (Lazari- for position-aware, deep language models. dou et al., 2017; Herbelot and Baroni, 2017; Kho- To overcome these limitations, we introduce dak et al., 2018; Liu et al., 2019a), their surface- BERTRAM (BERT for Attentive Mimicking), a form (Luong et al., 2013; Bojanowski et al., 2017; novel architecture for learning rare word representa- Pinter et al., 2017), or both (Schick and Schutze¨ , tions that combines a pretrained BERT model with 2019a,b; Hautte et al., 2019). However, all of this AM. As shown in Figure1, the learned rare word prior work is designed for and evaluated on uncon- representations can then be used as an improved textualized word embeddings. input representation for another BERT model. By Contextualized representations obtained from giving BERTRAM access to both surface form and pretrained deep language models (e.g. Peters et al., contexts starting at the lowest layer, a deep integra- 2018; Radford et al., 2018; Devlin et al., 2019; Liu tion of both input signals becomes possible. et al., 2019b) already handle rare words implicitly Assessing the effectiveness of methods like 1Our implementation of BERTRAM is publicly available at BERTRAM in a contextualized setting is challeng- https://github.com/timoschick/bertram. ing: While most previous work on rare words was 3996 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3996–4007 July 5 - 10, 2020. c 2020 Association for Computational Linguistics evaluated on datasets explicitly focusing on rare BERT words (e.g Luong et al., 2013; Herbelot and Ba- roni, 2017; Khodak et al., 2018; Liu et al., 2019a), riding a un ##ic ##y ##cle is hard these datasets are tailored to uncontextualized embeddings and thus not suitable for evaluating our BERT model. Furthermore, rare words are not well repre- BERTRAM sented in commonly used downstream task datasets. We therefore introduce rarification, a procedure to riding a unicycle is hard automatically convert evaluation datasets into ones for which rare words are guaranteed to be impor- Figure 1: Top: Standard use of BERT. Bottom: Our tant. This is achieved by replacing task-relevant proposal; first BERTRAM learns an embedding for “uni- frequent words with rare synonyms obtained using cycle” that replaces the WordPiece sequence. BERT is semantic resources such as WordNet (Miller, 1995). then run on this improved input representation. We rarify three common text (or text pair) classifica- tion datasets: MNLI (Williams et al., 2018), AG’s convolutional neural networks to directly access News (Zhang et al., 2015) and DBPedia (Lehmann character-level information (Kim et al., 2016; Pe- et al., 2015). BERTRAM outperforms previous ters et al., 2018; Baevski et al., 2019). work on four English datasets by a large margin: Complementary to surface form, another useful on the three rarified datasets and on WNLaMPro source of information for understanding rare words (Schick and Schutze¨ , 2020). are the contexts in which they occur (Lazaridou In summary, our contributions are as follows: et al., 2017; Herbelot and Baroni, 2017; Khodak et al., 2018). Schick and Schutze¨ (2019a,b) show • We introduce BERTRAM, a model that inte- that combining form and context leads to signifi- grates BERT into Attentive Mimicking, en- cantly better results than using just one of the two. abling a deep integration of surface-form and While all of these methods are bag-of-words mod- contexts and much better representations for els, Liu et al.(2019a) recently proposed an architec- rare words. ture based on context2vec (Melamud et al., 2016). • We devise rarification, a method that trans- However, in contrast to our work, they (i) do not forms evaluation datasets into ones for which incorporate surface-form information and (ii) do rare words are guaranteed to be important. not directly access the hidden states of context2vec, but instead simply use its output distribution. • We show that adding BERTRAM to BERT Several datasets focus on rare words, e.g., Stan- achieves a new state-of-the-art on WNLaM- ford Rare Word (Luong et al., 2013), Definitional Pro (Schick and Schutze¨ , 2020) and beats all Nonce (Herbelot and Baroni, 2017), and Contex- baselines on rarified AG’s News, MNLI and tual Rare Word (Khodak et al., 2018). However, DBPedia, resulting in an absolute improve- unlike our rarified datasets, they are only suitable ment of up to 25% over BERT. for evaluating uncontextualized word representations. Rarification is related to adversarial example 2 Related Work generation (e.g. Ebrahimi et al., 2018), which ma- nipulates the input to change a model’s prediction. Surface-form information (e.g., morphemes, char- We use a similar mechanism to determine which acters or character n-grams) is commonly used to words in a given sentence are most important and improve word representations. For static word em- replace them with rare synonyms. beddings, this information can either be injected into a given embedding space (Luong et al., 2013; 3 Model Pinter et al., 2017), or a model can directly be given access to it during training (Bojanowski et al., 2017; 3.1 Form-Context Model Salle and Villavicencio, 2018; Piktus et al., 2019). We first review the basis for our new model, the In the area of contextualized representations, many form-context model (FCM) (Schick and Schutze¨ , architectures employ subword segmentation meth- 2019b). Given a set of d-dimensional high-quality ods (e.g. Radford et al., 2018; Devlin et al., 2019; embeddings for frequent words, FCM induces em- Yang et al., 2019; Liu et al., 2019b). Others use beddings for rare words that are appropriate for 3997 the given embedding space. This is done as fol- ti = [MASK]. We experiment with three variants lows: Given a word w and a context C in which of BERTRAM: BERTRAM-SHALLOW, BERTRAM- form d 2 it occurs, a surface-form embedding v(w;C) 2 R REPLACE and BERTRAM-ADD. is obtained by averaging over embeddings of all SHALLOW. Perhaps the simplest approach for character n-grams in w; the n-gram embeddings obtaining a context embedding from C using BERT are learned during training. Similarly, a context context d is to define embedding v(w;C) 2 R is obtained by averaging over the embeddings of all words in C. Finally, context v(w;C) = hi(et1 ; : : : ; etm ) : both embeddings are combined using a gate This approach aligns well with BERT’s pretrain- form context > form context g(v(w;C); v(w;C) ) = σ(x [v(w;C); v(w;C) ] + y) ing objective of predicting likely substitutes for [MASK] tokens from their contexts. The context with parameters x 2 2d; y 2 and σ denoting context R R embedding v(w;C) is then combined with its form the sigmoid function, allowing the model to decide counterpart as in FCM. how to weight surface-form and context. The final While this achieves our first goal of using a more representation of w is then a weighted combination sophisticated context model that goes beyond bag- of form and context embeddings: of-words, it still only combines form and context context form in a shallow fashion. v(w;C) = α · (Av(w;C) + b) + (1 − α) · v(w;C) REPLACE. Before computing the context embed- form context d×d where α = g(v(w;C); v(w;C) ) and A 2 R ; b 2 ding, we replace the uncontextualized embedding d of the [MASK] token with the word’s surface-form R are parameters learned during training.

Load more