What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models

Allyson Ettinger

Department of Linguistics University of Chicago [email protected]

Abstract their origin in psycholinguistics, these diagnostics have two distinct advantages: They are carefully Pre-training by language modeling has become controlled to ask targeted questions about linguis- a popular and successful approach to NLP tic capabilities, and they are designed to ask these tasks, but we have yet to understand exactly questions by examining word predictions in con- what linguistic capacities these pre-training text, which allows us to study LMs without any processes confer upon models. In this paper need for task-specific fine-tuning. we introduce a suite of diagnostics drawn from Beyond these advantages, our diagnostics dis- human language experiments, which allow us tinguish themselves from existing tests for LMs in to ask targeted questions about information two primary ways. First, these tests have been used by language models for generating pre- chosen specifically for their capacity to reveal dictions in context. As a case study, we apply insensitivities in predictive models, as evidenced these diagnostics to the popular BERT model, by patterns that they elicit in human brain re- finding that it can generally distinguish good sponses. Second, each of these tests targets a from bad completions involving shared cate- set of linguistic capacities that extend beyond gory or role reversal, albeit with less sensitiv- the primarily syntactic focus seen in existing ity than humans, and it robustly retrieves noun LM diagnostics—we have tests targeting com- hypernyms, but it struggles with challenging monsense/pragmatic inference, semantic roles and inference and role-based event prediction— event knowledge, category membership, and ne- and, in particular, it shows clear insensitivity gation. Each of our diagnostics is set up to sup- to the contextual impacts of negation. port tests of both word prediction accuracy and sensitivity to distinctions between good and bad 1 Introduction context completions. Although we focus on the Pre-training of NLP models with a language mod- BERT model here as an illustrative case study, eling objective has recently gained popularity these diagnostics are applicable for testing of any as a precursor to task-specific fine-tuning. Pre- . trained models like BERT (Devlin et al., 2019) This paper makes two main contributions. First, and ELMo (Peters et al., 2018a) have advanced we introduce a new set of targeted diagnostics the state of the art in a wide variety of tasks, for assessing linguistic capacities in language 1 suggesting that these models acquire valuable, models. Second, we apply these tests to shed generalizable linguistic competence during the light on strengths and weaknesses of the popular pre-training process. However, though we have BERT model. We find that BERT struggles with established the benefits of language model pre- challenging commonsense/pragmatic inferences training, we have yet to understand what exactly and role-based event prediction; that it is generally about language these models learn during that robust on within-category distinctions and role process. reversals, but with lower sensitivity than humans; This paper aims to improve our understanding and that it is very strong at associating nouns with of what language models (LMs) know about lan- hypernyms. Most strikingly, however, we find guage, by introducing a set of diagnostics target- that BERT fails completely to show generalizable ing a range of linguistic capacities drawn from 1All test sets and experiment code are made available human psycholinguistic experiments. Because of here: https://github.com/aetting/lm-diagnostics. 34

Transactions of the Association for Computational Linguistics, vol. 8, pp. 34–48, 2020. https://doi.org/10.1162/tacl a 00298 Action Editor: Marco Baroni. Submission batch: 9/2019; Revision batch: 11/2019; Published 2/2020. c 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 understanding of negation, raising questions about use fine-grained classification tasks to probe in- the aptitude of LMs to learn this type of meaning. formation in sentence embeddings (Adi et al., 2016; Conneau et al., 2018; Ettinger et al., 2018), or token-level and other sub-sentence level infor- 2 Motivation for Use of Psycholinguistic mation in contextual embeddings (Tenney et al., Tests on Language Models 2019b; Peters et al., 2018b). Some of this work has targeted specific linguistic phenomena such It is important to be clear that in using these diag- as function words (Kim et al., 2019). Much work nostics, we are not testing whether LMs are psy- has attempted to evaluate systems’ overall level of cholinguistically plausible. We are using these ‘‘understanding’’, often with tasks such as seman- tests simply to examine LMs’ general linguistic tic similarity and entailment (Wang et al., 2018; knowledge, specifically by asking what informa- Bowman et al., 2015; Agirre et al., 2012; Dagan tion the models are able to use when assigning et al., 2005; Bentivogli et al., 2016), and additional probabilities to words in context. These psycho- work has been done to design curated versions linguistic tests are well-suited to asking this type of these tasks to test for specific linguistic capa- of question because a) the tests are designed for bilities (Dasgupta et al., 2018; Poliak et al., drawing conclusions based on predictions in con- 2018; McCoy et al., 2019). Our diagnostics text, allowing us to test LMs in their most natural complement this previous work in allowing for setting, and b) the tests are designed in a controlled direct testing of language models in their natural manner, such that accurate word predictions in setting—via controlled tests of word prediction in context depend on particular types of information. context—without requiring probing of extracted In this way, these tests provide us with a natural representations or task-specific fine-tuning. means of diagnosing what kinds of information More directly related is existing work on analyz- LMs have picked up on during training. ing linguistic capacities of language models spe- Clarifying the linguistic knowledge acquired cifically. This work is particularly dominated by during LM-based training is increasingly relevant testing of syntactic awareness in LMs, and often as state-of-the-art NLP models shift to be predom- mirrors the present work in using targeted evalua- inantly based on pre-training processes involving tions modeled after psycholinguistic tests (Linzen word prediction in context. In order to understand et al., 2016; Gulordava et al., 2018; Marvin and the fundamental strengths and limitations of these Linzen, 2018; Wilcox et al., 2018; Chowdhury and models—and in particular, to understand what al- Zamparelli, 2018; Futrell et al., 2019). These anal- lows them to generalize to many different tasks— yses, like ours, typically draw conclusions based we need to understand what linguistic competence on LMs’ output probabilities. Additional work has and general knowledge this LM-based pre-training examined the internal dynamics underlying LMs’ makes available (and what it does not). The im- capturing of syntactic information, including test- portance of understanding LM-based pre-training ing of syntactic sensitivity in different components is also the motivation for examining pre-trained of the LM and at different timesteps within the BERT, as we do in the present paper, despite the sentence (Giulianelli et al., 2018), or in individual fact that the pre-trained form is typically used only units (Lakretz et al., 2019). as a starting point for fine-tuning. Because it is the This previous work analyzing language models pre-training that seemingly underlies the general- focuses heavily onsyntacticcompetence—semantic ization power of the BERT model, allowing for phenomena like negative polarity items are tested simple fine-tuning to perform so impressively, it is insome studies (Marvin and Linzen, 2018; Jumelet the pre-trained model that presents the most impor- and Hupkes, 2018), but the tested capabilities in tant questions about the nature of generalizable these cases are still firmly rooted in the notion of linguistic knowledge in BERT. detecting structural dependencies. In the present work we expand beyond the syntactic focus of the 3 Related Work previous literature, testing for capacities includ- ing commonsense/pragmatic reasoning, semantic This paper contributes to a growing effort to role and event knowledge, category membership, better understand the specific linguistic capacities and negation—while continuing to use controlled, achieved by neural NLP models. Some approaches targeted diagnostics. Our tests are also distinct

35

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 in eliciting a very specific response profile in been carefully designed for studying specific humans, creating unique predictive challenges for aspects of language processing, and each test has models, as described subsequently. been shown to produce informative patterns of We further deviate from previous work ana- results when tested on humans. In this section lyzing LMs in that we not only compare word we provide relevant background on human lan- probabilities—we also examine word prediction guage processing, and explain how we use this accuracies directly, for a richer picture of models’ information to choose the particular tests used specific strengths and weaknesses. Some previous here. work has used word prediction accuracy as a test of LMs’ language understanding—the LAMBADA 4.1 Background: Prediction in Humans dataset (Paperno et al., 2016), in particular, tests To study language processing in humans, psy- models’ ability to predict the final word of a passage, cholinguists often test human responses to words in cases where the final sentence alone is insuffi- in context, in order to better understand the infor- cientforprediction.However, although LAMBADA mation that our brains use to generate predictions. presents a challenging prediction task, it is not In particular, there are two types of predictive well-suited to ask targeted questions about types of human responses that are relevant to us here: information used by LMs for prediction—unlike Cloze Probability The first measure of human our tests, LAMBADA is not controlled to isolate expectation is a measure of the ‘‘cloze’’ response. and test the use of specific types of information In a cloze task, humans are given an incomplete in prediction. Our tests are thus unique in taking sentence and tasked with filling their expected advantage of the additional information provided word in the blank. ‘‘Cloze probability’’ of a word by testing word prediction accuracy, while also w in context c refers to the proportion of people leveraging the benefits of controlled sentences who choose w to complete c. We will treat this that allow for asking targeted questions. as the best available gold standard for human Finally, our testing of BERT relates to a growing prediction in context—humans completing the literature examining linguistic characteristics of cloze task typically are not under any time pres- the BERT model itself, to better understand what sure, so they have the opportunity to use all avail- underlies the model’s impressive performance. Clark able information from the context to arrive at a et al. (2019) analyze the dynamics of BERT’s self- prediction. attention mechanism, probing attention heads for syntactic sensitivity and finding that individual N400 Amplitude The second measure of human heads specialize strongly for syntactic and coref- expectation is a brain response known as the erence relations. Lin et al. (2019) also examine N400, which is detected by measuring electrical syntactic awareness in BERT by syntactic probing activity at the scalp (by electroencephalography). at different layers, and by examination of syntactic Like cloze, the N400 can be used to gauge how sensitivity in the self-attention mechanism. Tenney expected a word w is in a context c—the amplitude et al. (2019a) test a variety of linguistic tasks at dif- of the N400 response appears to be sensitive to ferent layers of the BERT model. Most similarly fit of a word in context, and has been shown to our work here, Goldberg (2019) tests BERT to correlate with cloze in many cases (Kutas on several of the targeted syntactic evaluations and Hillyard, 1984). The N400 has also been described earlier for LMs, finding BERT to exhibit shown to be predicted by LM probabilities (Frank very strong performance on these measures. Our et al., 2013). However, the N400 differs from work complements these approaches in testing cloze in being a real-time response that occurs BERT’s linguistic capacities directly via the word only 400 milliseconds into the processing of a prediction mechanism, and in expanding beyond word. Accordingly, the expectations reflected in the syntactic tests used to examine BERT’s pre- the N400 sometimes deviate from the more fully dictions in Goldberg (2019). formed expectations reflected in the untimed cloze response. 4 Leveraging Human Studies 4.2 Our Diagnostic Tests The power in our diagnostics stems from their The test sets that we use here are all drawn from origin in psycholinguistic studies—the items have human studies that have revealed divergences

36

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 Context Expected Inappropriate He complained that after she kissed him, he couldn’t get the lipstick mascara | bracelet red color off his face. He finally just asked her to stop wearing that He caught the pass and scored another touchdown. There was football baseball | monopoly nothing he enjoyed more than a good game of

Table 1: Example items from CPRAG-102 dataset.

between cloze and N400 profiles—that is, for about information being applied for prediction. each of these tests, the N400 response suggests a We leverage this in our experiments detailed in level of insensitivity to certain information when Sections 6–9. computing expectations, causing a deviation from In all tests, the target word to be predicted falls the fully informed cloze predictions. We choose in the final position of the provided context, which these as our diagnostics because they provide means that these tests should function similarly for built-in sensitivity tests targeting the types of either left-to-right or bidirectional LMs. Similarly, information that appear to have reduced effect because these tests require only that a model can on the N400—and because they should present produce token probabilities in context, they are particularly challenging prediction tasks, tripping equally applicable to the masked LM setting of up models that fail to use the full set of available BERT as to a standard LM. In anticipation of information. testing the BERT model, and to facilitate fair future comparisons with the present results, we filter out items for which the expected word is not 5 Datasets in BERT’s single-word vocabulary, to ensure that Each of our diagnostics supports three types all expected words can be predicted. of testing: word prediction accuracy, sensitivity It is important to acknowledge that these are testing, and qualitative prediction analysis. Be- small test sets, limited in size due to their origin cause these items are designed to draw conclusions in psycholinguistic studies. However, because about human processing, each set is carefully these sets have been hand-designed by cognitive constructed to constrain the information relevant scientists to test predictive processing in humans, for making word predictions. This allows us to their value is in the targeted assessment that they examine how well LMs use this target information. provide with respect to information that LMs use For word prediction accuracy, we use the most in prediction. expected items from human cloze probabilities as We now we describe each test set in detail. the gold completions.2 These represent predictions 5.1 CPRAG-102: Commonsense and that models should be able to make if they access Pragmatic Inference and apply all relevant context information when generating probabilities for target words. Our first set targets commonsense and pragmatic For sensitivity testing, we compare model prob- inference, and tests sensitivity to differences abilities for good versus bad completions— within semantic category. The left column of specifically, comparisons on which the N400 Table 1 shows examples of these items, each of showed reduced sensitivity in experiments. This which consists of two sentences. These items come allows us to test whether LMs will show sim- from an influential human study by Federmeier ilar insensitivities on the relevant linguistic and Kutas (1999), which tested how brains would distinctions. respond to different types of context completions, Finally, because these items are constructed in shown in the right columns of Table 1. such a controlled manner, qualitative analysis of Information Needed for Prediction Accurate models’ top predictions can be highly informative prediction on this set requires use of commonsense 2With one exception, NEG-136, for which we use reasoning to infer what is being described in completion truth, as in the original study. the first sentence, and pragmatic reasoning to

37

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 determine how the second sentence relates. For Context Compl. instance, in Table 1, commonsense knowledge the restaurant owner forgot which served informs us that red color left by kisses suggests customer the waitress had lipstick, and pragmatic reasoning allows us to the restaurant owner forgot which served infer that the thing to stop wearing is related waitress the customer had to the complaint. As in LAMBADA, the final sentence is generic, not supporting prediction Table 2: Example items from ROLE-88 on its own. Unlike LAMBADA, the consistent dataset. Compl = Context Completion. structure of these items allows us to target 3 specific model capabilities; additionally, none Information Needed for Prediction Accurate 4 of these items contain the target word in context, prediction on this set requires a model to inter- forcing models to use commonsense inference pret semantic roles from sentence syntax, and rather than coreference. Human cloze probabilities apply event knowledge about typical interactions show a high level of agreement on appropriate between types of entities in the given roles. The completions for these items—average cloze prob- set has reversals for each noun pair (shown in ability for expected completions is .74. Table 2) so models must distinguish roles for each order. Sensitivity Test The Federmeier and Kutas (1999) study found that while the inappropriate Sensitivity Test The Chow et al. (2016) study completions (e.g., mascara, bracelet) had cloze found that although each completion (e.g., served) probabilities of virtually zero (average cloze is good for only one of the noun orders and not .004 and .001, respectively), the N400 showed the reverse, the N400 shows a similar level of some expectation for completions that shared a expectation for the target completions regardless semantic category with the expected completion of noun order. Our sensitivity test targets this (e.g., mascara,byrelationtolipstick). Our distinction, testing whether LMs will show similar sensitivity test targets this distinction, testing difficulty distinguishing appropriate continuations whether LMs will favor inappropriate completions based on word order and semantic role. Human based on shared semantic category with expected cloze probabilities show strong sensitivity to the completions. role reversal, with average cloze difference of .233 between good and bad contexts for a given Data The authors of the original study make completion. available 40 of their contexts—we filter out six to Data The authors provide 120 sentences (60 accommodate BERT’s single-word vocabulary,5 pairs)—which we filter to 88 final items, removing for a final set of 34 contexts, 102 total items.6 pairs for which the best completion of either context is not in BERT’s single-word vocabulary. 5.2 ROLE-88: Event Knowledge and Semantic Role Sensitivity 5.3 NEG-136: Negation Our second set targets event knowledge and se- Our third set targets understanding of the meaning mantic role interpretation, and tests sensitivity to of negation, along with knowledge of category impact of role reversals. Table 2 shows an example membership. Table 3 shows examples of these item pair from this set. These items come from a test items, which involve absence or presence of human experiment by Chow et al. (2016), which negation in simple sentences, with two different tested the brain’s sensitivity to role reversals. completions that vary in truth depending on the negation. These test items come from a human 3To highlight this advantage, as a supplement for this test study by Fischler et al. (1983), which examined set we provide specific annotations of each item, indicating how human expectations change with the addition the knowledge/reasoning required to make the prediction. of negation. 4More than 80% of LAMBADA items contain the target word in the preceding context. Information Needed for Prediction Because 5 For a couple of items, we also replace an inappropriate the negative contexts in these items are highly completion with another inappropriate completion of the same semantic category to accommodate BERT’s vocabulary. unconstraining (A robin is not a ?), predic- 6Our ‘‘item’’ counts use all context/completion pairings. tion accuracy is not a useful measure for the

38

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 Context Match Mismatch are in BERT’s single-word vocabulary. We refer to these supplementary 64 items, designed to test A robin is a bird tree effects of naturalness, as NEG-136-NAT. A robin is not a bird tree 6 Experiments Table 3: Example items from NEG-136- SIMP dataset. As a case study, we use these three diagnostics to examine the predictive capacities of the pre- negative contexts. We test prediction accuracy trained BERT model (Devlin et al., 2019), which for affirmative contexts only, which allows us has been the basis of impressive performance to test models’ use of hypernym information across a wide range of tasks. BERT is a deep bi- (robin = bird). Targeting of negation happens directional transformer network (Vaswani et al., in the sensitivity test. 2017) pre-trained on tasks of masked language modeling (predicting masked words given bi- Sensitivity Test The Fischler et al. (1983) directional context) and next-sentence prediction study found that although the N400 shows more (binary classification of whether two sentences expectation for true completions in affirmative are a sequence). We test two versions of the pre- sentences (e.g., A robin is a bird), it fails to trained model: BERTBASE and BERTLARGE (uncased). adjust to negation, showing more expectation for These versions have the same basic architecture, false continuations in negative sentences (e.g., A but BERTLARGE has more parameters—in total, robin is not a bird). Our sensitivity test targets BERTBASE has 110M parameters, and BERTLARGE this distinction, testing whether LMs will show has 340M. We use the PyTorch BERT implemen- similar insensitivity to impacts of negation. Note tation with masked language modeling parameters that here we use truth judgments rather than cloze for generating word predictions.8 probability as an indication of the quality of a For testing, we process our sentence contexts to completion. have a [MASK] token—also used during BERT’s pre-training—in the target position of interest. We Data Fischler et al. provide the list of 18 subject then measure BERT’s predictions for this [MASK] nouns and 9 category nouns that they use for their token’s position. Following Goldberg (2019), we sentences, which we use to generate a comparable also add a [CLS] token to the start of each sentence 7 dataset, for a total of 72 items. We refer to these to mimic BERT’s training conditions. 72 simple sentences as NEG-136-SIMP. All target BERT differs from traditional left-to-right lan- words are in BERT’s single-word vocabulary. guage models, and from real-time human pre- dictions, in being a bidirectional model able to Supplementary Items In a subsequent study, use information from both left and right context. Nieuwland and Kuperberg (2008) followed up This difference should be neutralized by the fact on the Fischler et al. (1983) experiment, creat- that our items provide all information in the left ing affirmative and negative sentences chosen to context—however, in our experiments here, we do be more ‘‘natural ... for somebody to say’’, and allow one advantage for BERT’s bidirectionality: contrasting these with affirmative and negative We include a period and a [SEP] token after each sentences chosen to be less natural. ‘‘Natural’’ [MASK] token, to indicate that the target position items include examples like Most smokers find that is followed by the end of the sentence. We do this quitting is (not) very (difficult/easy), while items in order to give BERT the best possible chance of designed to be less natural include examples like success, by maximizing the chance of predicting a Vitamins and proteins are (not) very (good/bad). single word rather than the start of a phrase. Items The authors share 16 base contexts, correspond- for these experiments thus appear as follows: ingto64 additional items, which we add to the orig- inal 72 for additional comparison. All target words [CLS] The restaurant owner forgot which cus- tomer the waitress had [MASK] . [SEP] 7The one modification that we make to the original subject noun list is a substitution of the word salmon for bass within the category of fish—because bass created lexical ambiguity 8https://github.com/huggingface/pytorch- that was not interesting for our purposes here. pretrained-BERT.

39

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 Orig Shuf Trunc Shuf + Trunc

BERTBASE k =1 23.5 14.1 ± 3.1 14.7 8.1 ± 3.4 BERTLARGE k =1 35.3 17.4 ± 3.5 17.6 10.0 ± 3.0 BERTBASE k =5 52.9 36.1 ± 2.8 35.3 22.1 ± 3.2 BERTLARGE k =5 52.9 39.2 ± 3.9 32.4 21.3 ± 3.7

Table 4: CPRAG-102 word prediction accuracies (with and without sentence perturbations). Shuf = first sentence shuffled; Trunc = second sentence truncated to two words before target.

Logits produced by the language model for Prefer good w/ .01 thresh the target position are softmax-transformed to BERT 73.5 44.1 obtain probabilities comparable to human cloze BASE BERT 79.4 58.8 probability values for those target positions.9 LARGE Table 5: Percent of CPRAG-102 items with 7 Results for CPRAG-102 good completion assigned higher probability than bad. First we report BERT’s results on the CPRAG-102 test targeting common sense, pragmatic reasoning, generally enough syntactic context to identify the and sensitivity within semantic category. part of speech, as well as some sense of semantic category (on top of the thematic setup of the first 7.1 Word Prediction Accuracies sentence), but removing other information from We define accuracy as percentage of items for that second sentence. We also test with both per- which the ‘‘expected’’ completion is among the turbations together (‘‘Shuf + Trunc’’). Because model’s top k predictions, with k =1and k =5. different shuffled word orders give rise to differ- Table 4 (‘‘Orig’’) shows accuracies of ent results, for the ‘‘Shuf’’ and ‘‘Shuf + Trunc’’ settings we show mean and standard deviation BERTBASE and BERTLARGE. For accuracy at k = from 100 runs. 1, BERTLARGE soundly outperforms BERTBASE with correct predictions on just over a third Table 4 shows the accuracies as a result of of contexts. Expanding to k =5, the models these perturbations. One thing that is immediately converge on the same accuracy, identifying the clear is that the BERT model is indeed making expected completion for about half of contexts.10 use of information provided by the word order Because commonsense and pragmatic reason- of the first sentence, and by the more distant ing are non-trivial concepts to pin down, it is content of the second sentence, as each of worth asking to what extent BERT can achieve these individual perturbations causes a notable this performance based on simpler cues like word drop in accuracy. It is worth noting, however, identities or n-gram context. To test importance that with each perturbation there is a subset of word order, we shuffle the words in each item’s of items for which BERT’s accuracy remains first sentence, garbling the message but leaving all intact. Unsurprisingly, many of these items are individual words intact (‘‘Shuf’’ in Table 4). To those containing particularly distinctive words test adequacy of n-gram context, we truncate the associated with the target, such as checkmate second sentence, removing all but the two words (chess), touchdown (football), and stone-washed preceding the target word (‘‘Trunc’’)—leaving (jeans). This suggests that some of BERT’s suc- cess on these items may be attributable to sim- n 9Human cloze probabilities are importantly different from pler lexical or -gram information. In Section 7.3 true probabilities over a vocabulary, making these values we take a closer look at some more difficult items not directly comparable. However, cloze provides important that seemingly avoid such loopholes. indication—the best indication we have—of how much a context constrains human expectations toward a continuation, 7.2 Completion Sensitivity so we do at times loosely compare these two types of values. 10Note that word accuracies are computed by context, so Next we test BERT’s ability to prefer expected these accuracies are out of the 34 base contexts. completions over inappropriate completions of

40

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 Context BERTLARGE predictions Pablo wanted to cut the lumber he had bought to make car, house, room, truck, apartment some shelves. He asked his neighbor if he could borrow her The snow had piled up on the drive so high that they note, letter, gun, blanket, newspaper couldn’t get the car out. When Albert woke up, his father handed him a At the zoo, my sister asked if they painted the black and cat, person, human, bird, species white stripes on the animal. I explained to her that they were natural features of a

Table 6: BERTLARGE top word predictions for selected CPRAG-102 items. the same semantic category. We first test this by simply measuring the percentage of items for Orig -Obj -Sub -Both which BERT assigns a higher probability to the BERTBASE k=1 14.8 12.5 12.5 9.1 good completion (e.g., lipstick from Table 1) BERTLARGE k=1 13.6 5.7 6.8 4.5 than to either of the inappropriate completions BERTBASE k=5 27.3 26.1 22.7 18.2 (e.g., mascara, bracelet). Table 5 shows the BERTLARGE k=5 37.5 18.2 21.6 14.8 results. We see that BERTBASE assigns the highest probability to the expected completion in 73.5% of Table 7: ROLE-88 word prediction accuracies items, whereas BERTLARGE does so for 79.4%—a (with and without sentence perturbations). -Obj = solid majority, but with a clear portion of items generic object; -Subj = generic subject; -Both = for which an inappropriate, semantically related generic object and subject. target does receive a higher probability than the appropriate word. We can make our criterion slightly more strin- imately half of CPRAG-102 items, and that the gent if we introduce a threshold on the prob- models are able to prefer good completions to ability difference. The average cloze difference semantically related inappropriate completions in between good and bad completions is about .74 a majority of items, though with notably weaker for the data from which these items originate, sensitivity than humans. To better understand the reflecting a very strong human sensitivity to the models’ weaknesses, in this section we examine difference in completion quality. To test the pro- predictions made when the models fail. portion of items in which BERT assigns more Table 6 shows three example items along substantially different probabilities, we filter to with the top five predictions of BERTLARGE.In items for which the good completion probability each case, BERT provides completions that are is higher by greater than .01—a threshold chosen sensible in the context of the second sentence, to be very generous given the significant average but that fail to take into account the context cloze difference. With this threshold, the sensi- provided by the first sentence—in particular, the tivity drops noticeably—BERTBASE shows sen- predictions show no evidence of having been sitivity in only 44.1% of items, and BERTLARGE able to infer the relevant information about the shows sensitivity in only 58.8%. These results situation or object described in the first sentence. tell us that although the models are able to prefer For instance, we see in the first example that good completions to same-category bad comple- BERT has correctly zeroed in on things that one tions in a majority of these items, the difference might borrow, but it fails to infer that the thing to is in many cases very small, suggesting that this be borrowed is something to be used for cutting sensitivity falls short of what we see in human lumber. Similarly, BERT’s failure to detect the cloze responses. snow-shoveling theme of the second item makes for an amusing set of non sequitur completions. 7.3 Qualitative Examination of Predictions Finally, the third example shows that BERT has We thus see that the BERT models are able to identified an animal theme (unsurprising, given identify the correct word completions in approx- the words zoo and animal), but it is not applying

41

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 ≤.17 ≤.23 ≤.33 ≤.77 Prefer good w/ .01 thresh

BERTBASE k=1 12.0 17.4 17.4 11.8 BERTBASE 75.0 31.8 BERTLARGE k=1 8.0 4.3 17.4 29.4 BERTLARGE 86.4 43.2 BERTBASE k=5 24.0 26.1 21.7 41.1 BERTLARGE k=5 28.0 34.8 39.1 52.9 Table 9: Percent of ROLE-88 items with good completion assigned higher probability than Table 8: Accuracy of predictions in unperturbed role reversal. ROLE-88 sentences, binned by max cloze of context. patterns suggest that BERTBASE is less dependent upon the full detail of the subject-object structure, instead relying primarily upon one or the other the phrase black and white stripes to identify of the participating nouns for its verb predictions. the appropriate completion of zebra. Altogether, BERT , on the other hand, appears to make these examples illustrate that with respect to the LARGE heavier use of both nouns, such that loss of either target capacities of commonsense inference and one causes non-trivial disruption in the predictive pragmatic reasoning, BERT fails in these more accuracy. challenging cases. It should be noted that the items in this set are overall less constraining than those in 8 Results for ROLE-88 Section 7—humans converge less clearly on the same predictions, resulting in lower average cloze Next we turn to the ROLE-88 test of semantic role values for the best completions. To investigate the sensitivity and event knowledge. effect of constraint level, we divide items into four 8.1 Word Prediction Accuracies bins by top cloze value per sentence. Table 8 shows the results. With the exception of BERT at We again define accuracy by presence of a top BASE k =1, for which accuracy in all bins is fairly cloze item within the model’s top k predictions. low, it is clear that the highest cloze bin yields Table 7 (‘‘Orig’’) shows the accuracies for much higher model accuracies than the other three BERT and BERT .Fork =1, accu- LARGE BASE bins, suggesting some alignment between how racies are very low, with BERT slightly out- BASE constraining contexts are for humans and how performing BERT . When we expand to LARGE constraining they are for BERT. However, even k =5, accuracies predictably increase, and in the highest cloze bin, when at least a third BERT now outperforms BERT by a LARGE BASE of humans converge on the same completion, healthy margin. even BERT at k =5is only correct in To test the extent to which BERT relies on LARGE half of cases, suggesting substantial room for the individual nouns in the context, we try improvement.11 two different perturbations of the contexts: re- moving the information from the object (which 8.2 Completion Sensitivity customer the waitress ...), and removing the information from the subject (which customer the Next we test BERT’s sensitivity to role reversals waitress...), in each case by replacing the noun by comparing model probabilities for a given with a generic substitute. We choose one and completion (e.g., served) in the appropriate versus other as substitutions for the object and subject, role-reversed contexts. We again start by testing respectively. the percentage of items for which BERT assigns Table 7 shows the results with each of these a higher probability to the appropriate than to the perturbations individually and together. We ob- inappropriate completion. As we see in Table 9, serve several notable patterns. First, removing ei- BERTBASE prefers the good continuation in ther the object (‘‘-Obj’’) or the subject (‘‘-Sub’’) 75% of items, whereas BERTLARGE does so has relatively little effect on the accuracy of for 86.4%—comparable to the proportions for CPRAG-102. However, when we apply our BERTBASE for either k =1or k =5. This is quite different from what we see with BERT ,the LARGE 11This analysis is made possible by the Chow et al. (2016) accuracy of which drops substantially when the authors’ generous provision of the cloze data for these items, object or subject information is removed. These not originally made public with the items themselves.

42

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 Context BERTBASE predictions BERTLARGE predictions the camper reported which girl the taken, killed, attacked, bitten, attacked, killed, eaten, taken, bear had picked targeted the camper reported which bear the taken, killed, fallen, bitten, taken, left, entered, found, girl had jumped chosen the restaurant owner forgot which served, hired, brought, been, served, been, delivered, customer the waitress had taken mentioned, brought the restaurant owner forgot which served, been, chosen, ordered, served, chosen, called, waitress the customer had hired ordered, been

Table 10: BERTBASE and BERTLARGE top word predictions for selected ROLE-88 sentences.

Accuracy Affirmative Negative

BERTBASE k =1 38.9 BERTBASE 100 0.0 BERTLARGE k =1 44.4 BERTLARGE 100 0.0 BERTBASE k =5 100 BERTLARGE k =5 100 Table 12: Percent of NEG-136-SIMP items with true completion assigned higher probability than Table 11: Accuracy of word predictions in false. NEG-136-SIMP affirmative sentences. under both word orders, even though for the threshold of .01 (still generous given the average second word order this produces an unlikely cloze difference of .233), sensitivity drops more scenario. In both cases, the model’s assigned dramatically than on CPRAG-102, to 31.8% and probability for served is much higher for the 43.2%, respectively. appropriate word order than the inappropriate Overall, these results suggest that BERT is, one—a difference of .6 for BERTLARGE and in a majority of cases of this kind, able to use .37 for BERTBASE—but it is noteworthy that no noun position to prefer good verb completions more semantically appropriate top continuation is to bad—however, it is again less sensitive than identified by either model for which waitress the humans to these distinctions, and it fails to match customer had . human word predictions on a solid majority As a final note, although the continuations of cases. The model’s ability to choose good are generally impressively grammatical, we see completions over role reversals (albeit with weak exceptions in the second bear/girl sentence— sensitivity) suggests that the failures on word both models produce completions of questionable prediction accuracy are not due to inability to grammaticality (or at least questionable use of distinguish word orders, but rather to a weakness selection restrictions), with sentences like which in event knowledge or understanding of semantic bear the girl had fallen from BERTBASE,and role implications. which bear the girl had entered from BERTLARGE.

8.3 Qualitative Examination of Predictions 9 Results for NEG-136 Table 10 shows predictions of BERTBASE and Finally, we turn to the NEG-136 test of negation BERTLARGE for some illustrative examples. For and category membership. the girl/bear items, we see that BERTBASE favors continuations like killed and bitten with bear as subject, but also includes these continuations with 9.1 Word Prediction Accuracies girl as subject. BERTLARGE, by contrast, excludes We start by testing the ability of BERT to predict these continuations when girl is the subject. correct category continuations for the affirmative In the second pair of sentences we see that the contexts in NEG-136-SIMP. Table 11 shows the models choose served as the top continuation accuracy results for these affirmative sentences.

43

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 Context BERTLARGE predictions A robin is a bird, robin, person, hunter, pigeon A daisy is a daisy, rose, flower, berry, tree Ahammerisa hammer, tool, weapon, nail, device Ahammerisan object, instrument, axe, implement, explosive A robin is not a robin, bird, penguin, man, fly A daisy is not a daisy, rose, flower, lily, cherry A hammer is not a hammer, weapon, tool, gun, rock A hammer is not an object, instrument, axe, animal, artifact

Table 13: BERTLARGE top word predictions for selected NEG-136-SIMP sentences.

Aff. NT Neg. NT Aff. LN Neg. LN false completion in every case. This shows a strong insensitivity to the meaning of negation, with BERTBASE 62.5 87.5 75.0 0.0 BERT preferring the category match completion BERTLARGE 75.0 100 75.0 0.0 every time, despite its falsity. Table 14: Percent of NEG-136-NAT with true 9.3 Qualitative Examination of Predictions continuation given higher probability than false. Aff = affirmative; Neg = negative; NT = natural; Table 13 shows examples of the predictions LN = less natural. made by BERTLARGE in positive and negative contexts. We see a clear illustration of the phe- nomenon suggested by the earlier results: For We see that for k =5, the correct category affirmative sentences, BERT produces generally is predicted for 100% of affirmative items, true completions (at least in the top two)—but suggesting an impressive ability of both BERT these completions remain largely unchanged after models to associate nouns with their correct negation is added, resulting in many blatantly immediate hypernyms. We also see that the untrue completions. accuracy drops substantially when assessed on Another interesting phenomenon that we can k =1. Examination of predictions reveals that observe in Table 13 is BERT’s sensitivity to the these errors consist exclusively of cases in which nature of the determiner (a or an) preceding the BERT completes the sentence with a repetition of masked word. This determiner varies depending the subject noun, e.g., A daisy is a daisy—which on whether the upcoming target begins with a is certainly true, but which is not a likely or vowel or a consonant (for instance, our mis- informative sentence. matched category paired with hammer is insect) and so the model can potentially use this cue 9.2 Completion Sensitivity to filter the predictions to those starting with We next assess BERT’s sensitivity to the meaning either vowels or consonants. How effectively does of negation, by measuring the proportion of items BERT use this cue? The predictions indicate that in which the model assigns higher probabilities to BERT is for the most part extremely good at using true completions than to false ones. this cue to limit to words that begin with the right Table 12 shows the results, and the pattern is type of letter. There are certain exceptions (e.g., stark. When the statement is affirmative (A robin An ant is not a ant), but these are in the minority. is a ), the models assign higher probability to the true completion in 100% of items. Even 9.4 Increasing Naturalness with the threshold of .01—which eliminated many The supplementary NEG-136-NAT items allow comparisons on CPRAG-102 and ROLE-88—all us to examine further the model’s handling of items pass but one (for BERTBASE), suggesting a negation, with items designed to test the effect robust preference for the true completions. of ‘‘naturalness’’. When we present BERT with However, in the negative statements (A robin is this new set of sentences, the model does show not a ), BERT prefers the true completion in 0% an apparent change in sensitivity to the nega- of items, assigning the higher probability to the tion. BERTBASE assigns true statements higher

44

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 Context BERTLARGE predictions Most smokers find that quitting is very difficult, easy, effective, dangerous, hard Most smokers find that quitting isn’t very effective, easy, attractive, difficult, succcessful A fast food dinner on a first date is very good, nice, common, romantic, attractive A fast food dinner on a first date isn’t very nice, good, romantic, appealing, exciting

Table 15: BERTLARGE top word predictions for selected NEG-136-NAT sentences.

probability than false for 75% of natural sentences good completions to bad semantically related (‘‘NT’’), and BERTLARGE does so for 87.5% of completions in a majority of items, but many natural sentences. By contrast, the models each of these probability differences are very small, show preference for true statements in only 37.5% suggesting that the model’s sensitivity is much of items designed to be less natural (‘‘LN’’). less than that of humans. Table 14 shows these sensitivities broken down On ROLE-88, BERT’s accuracy in match- by affirmative and negative conditions. Here we ing top human predictions is much lower, with see that in the natural sentences, BERT prefers BERTLARGE at only 37.5% accuracy. Perturba- true statements for both affirmative and negative tions reveal interesting model differences, sug- contexts—by contrast, the less natural sentences gesting that BERTLARGE has more sensitivity than show the pattern exhibited on NEG-136-SIMP, BERTBASE to the interaction between subject and in which BERT prefers true statements in a high object nouns. Sensitivity tests show that both mod- proportion of affirmative sentences, and in 0% els are typically able to use noun position to prefer of negative sentences, suggesting that once again good completions to role reversals, but the differ- BERT is defaulting to category matches with the ences are on average even smaller than on CPRAG- subject. 102, indicating again that model sensitivity to the Table 15 contains BERTLARGE predictions on distinctions is less than that of humans. The mod- two pairs of sentences from the ‘‘Natural’’ sen- els’ general ability to distinguish role reversals tence set. It is worth noting that even when BERT’s suggests that the low word prediction accuracies first prediction is appropriate in the context, the are not due to insensitivity to word order per se, top candidates often contradict each other (e.g., but rather to weaknesses in event knowledge or difficult and easy). We also see that even with understanding of semantic role implications. these natural items, sometimes the negation is not Finally, NEG-136 allows us to zero in with enough to reverse the top completions, as with the particular clarity on a divergence between BERT’s second pair of sentences, in which the fast food predictive behavior and what we might expect dinner both is and isn’t a romantic first date. from a model using all available information about word meaning and truth/falsity. When presented 10 Discussion with simple sentences describing category mem- bership, BERT shows a complete inability to Our three diagnostics allow for a clarified picture prefer true over false completions for negative of the types of information used for predictions sentences. The model shows an impressive ability by pre-trained BERT models. On CPRAG-102, to associate subject nouns with their hypernyms, we see that both models can predict the best but when negation reverses the truth of those completion approximately half the time (at k = hypernyms, BERT continues to predict them 5), and that both models rely non-trivially on nonetheless. By contrast, when presented with word order and full sentence context. However, sentences that are more ‘‘natural’’, BERT does re- successful predictions in the face of perturbations liably prefer true completions to false, with or also suggest that some of BERT’s success on without negation. Although these latter sentences these items may exploit loopholes, and when are designed to differ in naturalness, in all we examine predictions on challenging items, likelihood it is not naturalness per se that drives we see clear weaknesses in the commonsense the model’s relative success on them—but rather and pragmatic inferences targeted by this set. a higher frequency of these types of statements in Sensitivity tests show that BERT can also prefer the training data.

45

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 The latter result in particular serves to highlight ymous reviewers for valuable feedback on earlier a stark, but ultimately unsurprising, observation versions of this paper. We also thank members of about what these pre-trained language models the Toyota Technological Institute at Chicago for bring to the table. Whereas the function of lan- useful discussion of these and related issues. guage processing for humans is to compute meaning and make judgments of truth, language models are trained as predictive models—they References will simply leverage the most reliable cues in Yossi Adi, Einat Kermany, Yonatan Belinkov, order to optimize their predictive capacity. For Ofer Lavi, and Yoav Goldberg. 2016. Fine- a phenomenon like negation, which is often not grained analysis of sentence embeddings using conducive to clear predictions, such models may auxiliary prediction tasks. International Con- not be equipped to learn the implications of this ference on Learning Representations. word’s meaning. Eneko Agirre, Mona Diab, Daniel Cer, and 11 Conclusion Aitor Gonzalez-Agirre. 2012. SemEval-2012 task 6: A pilot on semantic textual similarity. In this paper we have introduced a suite of In Proceedings of the First Joint Conference on diagnostic tests for language models to better Lexical and Computational Semantics-Volume 1: our understanding of the linguistic competencies Proceedings of the main conference and the acquired by pre-training via language modeling. shared task, and Volume 2: Proceedings of We draw our tests from psycholinguistic studies, the Sixth International Workshop on Semantic allowing us to target a range of linguistic ca- Evaluation, pages 385–393. pacities by testing word prediction accuracies and sensitivity of model probabilities to linguistic Luisa Bentivogli, Raffaella Bernardi, Marco distinctions. As a case study, we apply these tests Marelli, Stefano Menini, Marco Baroni, and to analyze strengths and weaknesses of the popular Roberto Zamparelli. 2016. SICK through the BERT model, finding that it shows sensitivity SemEval glasses. Lesson learned from the eval- to role reversal and same-category distinctions, uation of compositional distributional semantic albeit less than humans, and it succeeds with models on full sentences through semantic noun hypernyms, but it struggles with challenging relatedness and . Language inferences and role-based event prediction—and it Resources and Evaluation, 50(1):95–124. shows clear failures with the meaning of negation. Samuel R. Bowman, Gabor Angeli, Christopher We make all test sets and experiment code avail- Potts, and Christopher D. Manning. 2015. A able (see Footnote 1), for further experiments. large annotated corpus for learning natural The capacities targeted by these test sets are language inference. In Proceedings of the 2015 by no means comprehensive, and future work can Conference on Empirical Methods in Natural build on the foundation of these datasets to expand Language Processing, pages 632–642. to other aspects of language processing. Because these sets are small, we must also be conservative Wing-Yee Chow, Cybelle Smith, Ellen Lau, and in the strength of our conclusions—different for- Colin Phillips. 2016. A ‘bag-of-arguments’ mulations may yield different performance, and mechanism for initial verb predictions. Language, future work can expand to verify the generality Cognition and Neuroscience, 31(5):577–596. of these results. In parallel, we hope that the weaknesses highlighted by these diagnostics can Shammur Absar Chowdhury and Roberto help to identify areas of need for establishing Zamparelli. 2018. RNN simulations of grammat- robust and generalizable models for language icality judgments on long-distance dependen- understanding. cies. In Proceedings of the 27th International Conference on Computational Linguistics, Acknowledgments pages 133–144.

We would like to thank Tal Linzen, Kevin Gimpel, Kevin Clark, Urvashi Khandelwal, Omer Levy, Yoav Goldberg, Marco Baroni, and several anon- and Christopher D. Manning. 2019. What does

46

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 BERT look at? An analysis of BERT’s atten- Richard Futrell, Ethan Wilcox, Takashi Morita, tion. arXiv preprint arXiv:1906.04341. Peng Qian, Miguel Ballesteros, and Roger Levy. 2019. Neural language models as psy- Alexis Conneau, German Kruszewski, Guillaume cholinguistic subjects: Representations of syn- Lample, Loic Barrault, and Marco Baroni. 2018. tactic state. In Proceedings of the 2019 What you can cram into a single vector: Probing Conference of the North American Chapter sentence embeddings for linguistic properties. of the Association for Computational Linguis- In ACL 2018-56th Annual Meeting of the tics: Human Language Technologies, Volume 1 Association for Computational Linguistics, (Long and Short Papers), pages 32–42. pages 2126–2136. Mario Giulianelli, Jack Harding, Florian Mohnert, Ido Dagan, Oren Glickman, and Bernardo Dieuwke Hupkes, and Willem Zuidema. 2018. Magnini. 2005. The PASCAL recognising tex- Under the hood: Using diagnostic classifiers to tual entailment challenge. In investigate and improve how language models Challenges Workshop, pages 177–190. Springer. track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Ishita Dasgupta, Demi Guo, Andreas Stuhlmuller,¨ Analyzing and Interpreting Neural Networks Samuel J. Gershman, and Noah D. Goodman. for NLP, pages 240–248. 2018. Evaluating compositionality in sentence embeddings. Proceedings of the 40th Annual Yoav Goldberg. 2019. Assessing BERT’s syntac- Meeting of the Cognitive Science Society. tic abilities. arXiv preprint arXiv:1901.05287.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Gulordava, Piotr Bojanowski, Edouard and Kristina Toutanova. 2019. BERT: Pre- Grave, Tal Linzen, and Marco Baroni. 2018. training of deep bidirectional transformers Colorless green recurrent networks dream for language understanding. Proceedings of hierarchically. In Proceedings of the 2018 Con- the 2019 Conference of the North American ference of the North American Chapter of Chapter of the Association for Computational the Association for Computational Linguistics: Linguistics: Human Language Technologies. Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 1195–1205. Allyson Ettinger, Ahmed Elgohary, Colin Jaap Jumelet and Dieuwke Hupkes. 2018. Do Phillips, and Philip Resnik. 2018. Assessing language models understand anything? In composition in sentence vector representations. Proceedings of the 2018 EMNLP Workshop In Proceedings of the 27th International BlackboxNLP: Analyzing and Interpreting Conference on Computational Linguistics, Neural Networks for NLP, pages 222–231. pages 1790–1801. Najoung Kim, Roma Patel, Adam Poliak, Kara D. Federmeier and Marta Kutas. 1999. A Patrick Xia, Alex Wang, Tom McCoy, Ian rose by any other name: Long-term memory Tenney, Alexis Ross, Tal Linzen, Benjamin structure and sentence processing. Journal of Van Durme, Samuel R. Bowman, and Ellie memory and Language, 41(4):469–495. Pavlick. 2019. Probing what different NLP tasks teach machines about function word Ira Fischler, Paul A. Bloom, Donald G. Childers, comprehension. In Proceedings of the Eighth Salim E. Roucos, and Nathan W. Perry Jr. 1983. Joint Conference on Lexical and Computational Brain potentials related to stages of sentence Semantics (*SEM 2019), pages 235–249. verification. Psychophysiology, 20(4):400–409. Marta Kutas and Steven A. Hillyard. 1984. Stefan L. Frank, Leun J. Otten, Giulia Galli, Brain potentials during reading reflect word and Gabriella Vigliocco. 2013. Word surprisal expectancy and semantic association. Nature, predicts N400 amplitude during reading. In ACL 307(5947):161. 2013-51st Annual Meeting of the Association for Computational Linguistics, Proceedings of Yair Lakretz, German´ Kruszewski, Theo´ Desbordes, the Conference, volume 2, pages 878–883. Dieuwke Hupkes, Stanislas Dehaene, and

47

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021 Marco Baroni. 2019. The emergence of number American Chapter of the Association for Com- and syntax units in LSTM language models. putational Linguistics: Human Language Tech- In Proceedings of the 2019 Conference of the nologies, Volume 1 (Long Papers), volume 1, North American Chapter of the Association for pages 2227–2237. Computational Linguistics: Human Language Matthew Peters, Mark Neumann, Luke Technologies, Volume 1 (Long and Short Papers), Zettlemoyer, and Wen-tau Yih. 2018b. Dissect- pages 11–20. ing contextual word embeddings: Architecture Yongjie Lin, Yi Chern Tan, and Robert and representation. In Proceedings of the 2018 Frank. 2019. Open sesame: Getting inside Conference on Empirical Methods in Natural BERT’s linguistic knowledge. arXiv preprint Language Processing, pages 1499–1509. arXiv:1906.01698. Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven Tal Linzen, Emmanuel Dupoux, and Yoav White, and Benjamin Van Durme. 2018. Col- Goldberg. 2016. Assessing the ability of LSTMs lecting diverse natural language inference prob- to learn syntax-sensitive dependencies. Trans- lems for sentence representation evaluation. In actions of the Association for Computational Proceedings of the 2018 Conference on Empir- Linguistics, 4:521–535. ical Methods in Natural Language Processing, pages 67–81. Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In Ian Tenney, Dipanjan Das, and Ellie Pavlick. Proceedings of the 2018 Conference on Empir- 2019a. BERT rediscovers the classical NLP ical Methods in Natural Language Process- pipeline. arXiv preprint arXiv:1905.05950. ing, pages 1192–1202. Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, R. Thomas McCoy, Ellie Pavlick, and Tal Adam Poliak, R. Thomas McCoy, Najoung Linzen. 2019. Right for the wrong reasons: Kim,Benjamin Van Durme, Samuel R. Bowman, Diagnosing syntactic heuristics in natural lan- Dipanjan Das, and Ellie Pavlick. 2019b. What guage inference. In Proceedings of the 57th do you learn from context? Probing for sen- Annual Meeting of the Association for Com- tence structure in contextualized word repre- putational Linguistics, pages 3428–3448. sentations. In Proceedings of the International Conference on Learning Representations. Mante S. Nieuwland and Gina R. Kuperberg. Ashish Vaswani, Noam Shazeer, Niki Parmar, 2008. When the truth is not too hard to Jakob Uszkoreit, Llion Jones, Aidan N. handle: An event-related potential study on the Gomez, Łukasz Kaiser, and Illia Polosukhin. pragmatics of negation. Psychological Science, 2017. Attention is all you need. In Advances 19(12):1213–1218. in Neural Information Processing Systems, Denis Paperno, German´ Kruszewski, Angeliki pages 5998–6008. Lazaridou, Ngoc Quan Pham, Raffaella Alex Wang, Amanpreet Singh, Julian Michael, Bernardi, Sandro Pezzelle, Marco Baroni, Felix Hill, Omer Levy, and Samuel R. Bowman. Gemma Boleda, and Raquel Fernandez. 2016. 2018. GLUE: A multi-task benchmark and The LAMBADA dataset: Word prediction analysis platform for natural language under- requiring a broad discourse context. In Proceed- standing. Proceedings of the 2018 Conference ings of the 54th Annual Meeting of the Associa- on Empirical Methods in Natural Language tion for Computational Linguistics (Volume 1: Processing, page 353. Long Papers), volume 1, pages 1525–1534. Ethan Wilcox, Roger Levy, Takashi Morita, and Matthew Peters, Mark Neumann, Mohit Iyyer, Richard Futrell. 2018. What do RNN language Matt Gardner, Christopher Clark, Kenton Lee, models learn about filler–gap dependencies? In and Luke Zettlemoyer. 2018a. Deep contex- Proceedings of the 2018 EMNLP Workshop tualized word representations. In Proceed- BlackboxNLP: Analyzing and Interpreting ings of the 2018 Conference of the North Neural Networks for NLP, pages 211–221.

48

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298 by guest on 24 September 2021