What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models
Total Page:16
File Type:pdf, Size:1020Kb
What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models Allyson Ettinger Department of Linguistics University of Chicago [email protected] Abstract their origin in psycholinguistics, these diagnostics have two distinct advantages: They are carefully Pre-training by language modeling has become controlled to ask targeted questions about linguis- a popular and successful approach to NLP tic capabilities, and they are designed to ask these tasks, but we have yet to understand exactly questions by examining word predictions in con- what linguistic capacities these pre-training text, which allows us to study LMs without any processes confer upon models. In this paper need for task-specific fine-tuning. we introduce a suite of diagnostics drawn from Beyond these advantages, our diagnostics dis- human language experiments, which allow us tinguish themselves from existing tests for LMs in to ask targeted questions about information two primary ways. First, these tests have been used by language models for generating pre- chosen specifically for their capacity to reveal dictions in context. As a case study, we apply insensitivities in predictive models, as evidenced these diagnostics to the popular BERT model, by patterns that they elicit in human brain re- finding that it can generally distinguish good sponses. Second, each of these tests targets a from bad completions involving shared cate- set of linguistic capacities that extend beyond gory or role reversal, albeit with less sensitiv- the primarily syntactic focus seen in existing ity than humans, and it robustly retrieves noun LM diagnostics—we have tests targeting com- hypernyms, but it struggles with challenging monsense/pragmatic inference, semantic roles and inference and role-based event prediction— event knowledge, category membership, and ne- and, in particular, it shows clear insensitivity gation. Each of our diagnostics is set up to sup- to the contextual impacts of negation. port tests of both word prediction accuracy and sensitivity to distinctions between good and bad 1 Introduction context completions. Although we focus on the Pre-training of NLP models with a language mod- BERT model here as an illustrative case study, eling objective has recently gained popularity these diagnostics are applicable for testing of any as a precursor to task-specific fine-tuning. Pre- language model. trained models like BERT (Devlin et al., 2019) This paper makes two main contributions. First, and ELMo (Peters et al., 2018a) have advanced we introduce a new set of targeted diagnostics the state of the art in a wide variety of tasks, for assessing linguistic capacities in language 1 suggesting that these models acquire valuable, models. Second, we apply these tests to shed generalizable linguistic competence during the light on strengths and weaknesses of the popular pre-training process. However, though we have BERT model. We find that BERT struggles with established the benefits of language model pre- challenging commonsense/pragmatic inferences training, we have yet to understand what exactly and role-based event prediction; that it is generally about language these models learn during that robust on within-category distinctions and role process. reversals, but with lower sensitivity than humans; This paper aims to improve our understanding and that it is very strong at associating nouns with of what language models (LMs) know about lan- hypernyms. Most strikingly, however, we find guage, by introducing a set of diagnostics target- that BERT fails completely to show generalizable ing a range of linguistic capacities drawn from 1All test sets and experiment code are made available human psycholinguistic experiments. Because of here: https://github.com/aetting/lm-diagnostics. 34 Transactions of the Association for Computational Linguistics, vol. 8, pp. 34–48, 2020. https://doi.org/10.1162/tacl a 00298 Action Editor: Marco Baroni. Submission batch: 9/2019; Revision batch: 11/2019; Published 2/2020. c 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 understanding of negation, raising questions about use fine-grained classification tasks to probe in- the aptitude of LMs to learn this type of meaning. formation in sentence embeddings (Adi et al., 2016; Conneau et al., 2018; Ettinger et al., 2018), or token-level and other sub-sentence level infor- 2 Motivation for Use of Psycholinguistic mation in contextual embeddings (Tenney et al., Tests on Language Models 2019b; Peters et al., 2018b). Some of this work has targeted specific linguistic phenomena such It is important to be clear that in using these diag- as function words (Kim et al., 2019). Much work nostics, we are not testing whether LMs are psy- has attempted to evaluate systems’ overall level of cholinguistically plausible. We are using these ‘‘understanding’’, often with tasks such as seman- tests simply to examine LMs’ general linguistic tic similarity and entailment (Wang et al., 2018; knowledge, specifically by asking what informa- Bowman et al., 2015; Agirre et al., 2012; Dagan tion the models are able to use when assigning et al., 2005; Bentivogli et al., 2016), and additional probabilities to words in context. These psycho- work has been done to design curated versions linguistic tests are well-suited to asking this type of these tasks to test for specific linguistic capa- of question because a) the tests are designed for bilities (Dasgupta et al., 2018; Poliak et al., drawing conclusions based on predictions in con- 2018; McCoy et al., 2019). Our diagnostics text, allowing us to test LMs in their most natural complement this previous work in allowing for setting, and b) the tests are designed in a controlled direct testing of language models in their natural manner, such that accurate word predictions in setting—via controlled tests of word prediction in context depend on particular types of information. context—without requiring probing of extracted In this way, these tests provide us with a natural representations or task-specific fine-tuning. means of diagnosing what kinds of information More directly related is existing work on analyz- LMs have picked up on during training. ing linguistic capacities of language models spe- Clarifying the linguistic knowledge acquired cifically. This work is particularly dominated by during LM-based training is increasingly relevant testing of syntactic awareness in LMs, and often as state-of-the-art NLP models shift to be predom- mirrors the present work in using targeted evalua- inantly based on pre-training processes involving tions modeled after psycholinguistic tests (Linzen word prediction in context. In order to understand et al., 2016; Gulordava et al., 2018; Marvin and the fundamental strengths and limitations of these Linzen, 2018; Wilcox et al., 2018; Chowdhury and models—and in particular, to understand what al- Zamparelli, 2018; Futrell et al., 2019). These anal- lows them to generalize to many different tasks— yses, like ours, typically draw conclusions based we need to understand what linguistic competence on LMs’ output probabilities. Additional work has and general knowledge this LM-based pre-training examined the internal dynamics underlying LMs’ makes available (and what it does not). The im- capturing of syntactic information, including test- portance of understanding LM-based pre-training ing of syntactic sensitivity in different components is also the motivation for examining pre-trained of the LM and at different timesteps within the BERT, as we do in the present paper, despite the sentence (Giulianelli et al., 2018), or in individual fact that the pre-trained form is typically used only units (Lakretz et al., 2019). as a starting point for fine-tuning. Because it is the This previous work analyzing language models pre-training that seemingly underlies the general- focuses heavily onsyntacticcompetence—semantic ization power of the BERT model, allowing for phenomena like negative polarity items are tested simple fine-tuning to perform so impressively, it is insome studies (Marvin and Linzen, 2018; Jumelet the pre-trained model that presents the most impor- and Hupkes, 2018), but the tested capabilities in tant questions about the nature of generalizable these cases are still firmly rooted in the notion of linguistic knowledge in BERT. detecting structural dependencies. In the present work we expand beyond the syntactic focus of the 3 Related Work previous literature, testing for capacities includ- ing commonsense/pragmatic reasoning, semantic This paper contributes to a growing effort to role and event knowledge, category membership, better understand the specific linguistic capacities and negation—while continuing to use controlled, achieved by neural NLP models. Some approaches targeted diagnostics. Our tests are also distinct 35 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 in eliciting a very specific response profile in been carefully designed for studying specific humans, creating unique predictive challenges for aspects of language processing, and each test has models, as described subsequently. been shown to produce informative patterns of We further deviate from previous work ana- results when tested on humans. In this section lyzing LMs in that we not only compare word we provide relevant background on