Arxiv:1912.13337V2 [Cs.CL] 1 Sep 2020

What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge Kyle Richardson and Ashish Sabharwal Allen Institute for AI, Seattle, WA, USA {kyler,ashishs}@allenai.org Abstract Benchmark Tasks 1.OpenBook QA (OBQA) (Mihaylov et al., 2018) Open-domain question answering (QA) in- Question: Which of the following is a [specific type of] learned behavior? A. thinking B. cooking C. hearing D. breathing volves many knowledge and reasoning chal- 2. ARC Challenge (Clark et al., 2018) lenges, but are successful QA models actu- Question: What is a worldwide increase in temperature called [definition]? ally learning such knowledge when trained A. greenhouse effect B. global warming C. ozone depletion D. solar heating on benchmark QA tasks? We investigate Train this via several new diagnostic tasks probing Question-Answering Model whether multiple-choice QA models know Evaluate Continue Training definitions and taxonomic reasoning—two Dataset Probes skills widespread in existing benchmarks Question : In ‘the toddler could Question : In ‘the toddler could and fundamental to more complex reason- count’, count is best defined as count to 100’, count is a type of ing. We introduce a methodology for auto- A. name or recite the numbers... A. recite event B.... C.... matically building probe datasets from ex- Generate: GEN(τ) pert knowledge sources, allowing for sys- Expert Knowledge (Triples T ) (type-of,count.v.03,recite.v.02) tematic control and a comprehensive eval- (ex,count.v.02, count, ’the toddler could count’) uation. We include ways to carefully (defined-as,count.v.02, ’name or recite the numbers...’ ) (defined-as, recite.v.02, ’read aloud from memory.’ ) control for artifacts that may arise during this process. Our evaluation confirms Figure 1: An illustration of our experimental setup and that transformer-based multiple-choice QA probing methodology. The gray box at the top shows models are already predisposed to recognize questions from existing open-domain QA benchmarks, certain types of structural linguistic knowl- requiring background knowledge. The yellow box edge. However, it also reveals a more nu- shows simple examples of multiple-choice questions in anced picture: their performance notably our proposed Definition and ISA probes. degrades even with a slight increase in the number of “hops” in the underlying taxo- Recent success in QA has been driven largely nomic hierarchy, and with more challeng- by new benchmarks (Zellers et al., 2018; Tal- ing distractor candidates. Further, existing models are far from perfect when assessed mor et al., 2019b; Bhagavatula et al., 2020; Khot at the level of clusters of semantically con- et al., 2020, etc.) and advances in model pre- nected probes, such as all hypernym ques- training (Radford et al., 2018; Devlin et al., 2019). tions about a single concept. This raises a natural question: Do state-of-the-art multiple-choice QA (MCQA) models that excel at arXiv:1912.13337v2 [cs.CL] 1 Sep 2020 standard benchmarks truly possess basic knowl- 1 Introduction edge and reasoning skills expected in these tasks? Automatically answering questions, especially in Answering this question is challenging due to the open-domain setting where minimal or no con- limited understanding of heavily pre-trained com- textual knowledge is explicitly provided, requires plex models and the way existing MCQA datasets considerable background knowledge and reason- are constructed. We focus on the second aspect, ing abilities. For example, answering the two which has two limitations: Large-scale crowd- questions in the top gray box in Figure1 requires sourcing leaves little systematic control over ques- identifying a specific ISA relation (that ‘cooking’ tion semantics or requisite background knowl- is a type of ‘learned behavior’) as well as recall- edge (Welbl et al., 2017), while questions from ing a concept definition (that ‘global warming’ is real exams tend to mix multiple challenges in defined as a ‘worldwide increase in temperature’). a single dataset, often even in a single question (Clark et al., 2018; Boratko et al., 2018). probes measure competence in various settings To address this challenge, we propose system- including hypernymy, hyponymy, and synonymy atically constructing model competence probes by detection, as well as word sense disambiguation. exploiting structured information contained in ex- Our exploration is closely related to the recent pert knowledge sources such as knowledge graphs work of Talmor et al.(2019a). However, a key dif- and lexical taxonomies. Importantly, these probes ference is that they study language models (LMs), are diagnostic tasks, designed not to impart new for which there is no clear a priori expectation of knowledge but to assess what models trained on specific knowledge or reasoning skills. In contrast, standard QA benchmarks already know; as such, we focus on models heavily trained for benchmark they serve as proxies for the types of questions QA tasks, where such tasks are known to require that a model might encounter in its original task, certain types of knowledge and reasoning skills. but involve a single category of knowledge under We probe whether such skills are actually learned various controlled conditions and perturbations. by QA models, either during LM pre-training or Figure1 illustrates our methodology. We start when training for the QA tasks. with a set of standard MCQA benchmark tasks D Recognizing the need for suitable controls in and a set of models M trained on D. Our goal is to any synthetic probing methodology (Hewitt and assess how competent these models are relative to Liang, 2019; Talmor et al., 2019a), we introduce a particular knowledge or reasoning skill S (e.g., two controls: (a) the probe must be challenging definitions) that is generally deemed important for for any model that lacks contextual embeddings, performing well on D. To this end, we systemati- and (b) strong models must have a low inoculation cally and automatically generate a set of dataset cost, i.e., when fine-tuned on a few probing ex- probes PS from information available in expert amples, the model should mostly retain its perfor- knowledge sources. Each probe is an MCQA ren- mance on its original task.2 This ensures that the dering of the target information (see examples in probe performance of a model, even when lightly Figure1, yellow box). We then use these probes inoculated on probing data, reflects its knowledge PS to ask two empirical questions: (1) How well as originally trained for the benchmark task, which do models in M already trained on D perform is precisely what we aim to uncover. on probing tasks PS? (2) With additional nudg- Constructing a wide range of systematic tests ing, can models be re-trained, using only a modest is critical for having definitive empirical evidence amount of additional data, to perform well on each of model competence on any given phenomenon. probing task PS with minimal performance loss Such tests should cover a broad set of concepts and on their original tasks D (thus giving evidence of question variations (i.e., systematic adjustments to prior model competence on S)? how the questions are constructed). When assess- While our methodology is general, our exper- ing ISA reasoning, not only is it important to rec- iments focus on probing state-of-the-art MCQA ognize in the question in Figure1 that cooking is a models in the domain of grade-school level sci- learned behavior, but also that cooking is a general ence, which is considered particularly challenging type of behavior or, through a few more inferential with respect to background knowledge and infer- steps, a type of human activity. Our automatic use ence (Clark, 2015; Clark et al., 2019; Khot et al., of expert knowledge sources allows constructing 2020). In addition, existing science benchmarks such high-coverage probes, circumventing pitfalls are known to involve widespread use of defini- of solicitation bias and reporting bias. tion and taxonomic knowledge (see detailed anal- Our results confirm that transformer-based QA ysis by Clark et al.(2018); Boratko et al.(2018)), models3 have a remarkable ability to recognize the which is also fundamental to deeper reasoning. types of knowledge captured in our probes—even Accordingly, we employ the most widely used lex- without additional fine-tuning (i.e., in a zero-shot ical ontology WordNet (Miller, 1995) and publicly setting). Such models can even outperform strong available dictionaries as sources of expert knowl- 2Standard inoculation (Liu et al., 2019a) is known to drop edge to construct our probes, WordNetQA (Sec- performance on the original task. We use a modified objec- 1 tion 3.1) and DictionaryQA (Section 3.2) . These tive (Richardson et al., 2020) to alleviate this issue. 3Different from Talmor et al.(2019a), we find BERT and 1All data and code are available at https://github. RoBERTa based QA models to be qualitatively similar, per- com/allenai/semantic_fragments forming within 5% of each other on nearly all probes. task-specific non-transformer models trained di- (2019), and other variants thereof, which are all rectly on our probing tasks (e.g., +26% compared based on the transformer architecture of Vaswani to a task-specific LSTM). We also show that the et al.(2017). There have been recent stud- same models can be effectively re-fine-tuned on ies into the types of relational knowledge con- small samples (even 100 examples) of probe data, tained in large-scale knowledge models (Schick and that high performance on the probes tends to and Schütze, 2020; Petroni et al., 2019; Jiang correlate with a smaller drop in the model’s per- et al., 2019), which also probe models using struc- formance on the original QA task. tured knowledge sources. These studies, how- Our comprehensive assessment also reveals im- ever, primarily focus on unearthing the knowl- portant nuances to the positive trend.

Load more