Is Multihop QA in DIRE Condition? Measuring and Reducing Disconnected Reasoning

Is Multihop QA in DIRE Condition? Measuring and Reducing Disconnected Reasoning Harsh Trivedi†∗ Niranjan Balasubramanian† Tushar Khot‡ Ashish Sabharwal‡ † Stony Brook University, Stony Brook, U.S.A. fhjtrivedi,[email protected] ‡ Allen Institute for AI, Seattle, U.S.A. ftushark,[email protected] Abstract Input Which country got independence when the cold war started? Has there been real progress in multi-hop The war started in 1950. question-answering? Models often exploit The cold war started in 1947. dataset artifacts to produce correct answers, France finally got its independence. without connecting information across multi- India got independence from UK in 1947. Set of Facts ple supporting facts. This limits our ability 30 countries were involved in World War 2. to measure true progress and defeats the purpose of building multi-hop QA datasets. We Disconnected Reasoning Output make three contributions towards addressing Answer SF Simple Combination this. First, we formalize such undesirable be- Answer havior as disconnected reasoning across sub- India India sets of supporting facts. This allows develop- No Interaction Supporting Facts (SF) ing a model-agnostic probe for measuring how much any model can cheat via disconnected reasoning. Second, using a notion of con- trastive support sufficiency, we introduce an Figure 1: Example of disconnected reasoning, a form automatic transformation of existing datasets of bad multifact reasoning: Model arrives at the answer that reduces the amount of disconnected rea- by simply combining its outputs from two subsets of soning. Third, our experiments1 suggest that the input, neither of which contains all supporting facts. there hasn’t been much progress in multifact From one subset, it identifies the blue supporting fact QA in the reading comprehension setting. For ( ), the only one mentioning cold war. Independently, a recent large-scale model (XLNet), we show from the other subset, it finds the red fact ( ) as the that only 18 points out of its answer F1 score only one mentioning a country getting independence of 72 on HotpotQA are obtained through mul- with associated time, and returns the correct answer (In- tifact reasoning, roughly the same as that of dia). Further, it returns a simple union of the supporting a simpler RNN baseline. Our transformation facts it found over the input subsets. substantially reduces disconnected reasoning (19 points in answer F1). It is complementary to adversarial approaches, yielding further re- shortucts) in existing datasets (Min et al., 2019; ductions in conjunction. Chen and Durrett, 2019). While this demonstrates the existence of models that can cheat, what we do 1 Introduction not know is the extent to which current models do Multi-hop question answering requires connecting cheat, and whether there has been real progress in and synthesizing information from multiple facts building models for multifact reasoning. in the input text, a process we refer to as multi- We address this issue in the context of multi-hop fact reasoning. Prior work has, however, shown reading comprehension. We introduce a general- that bad reasoning models, ones that by design do purpose characterization of a form of bad multihop not connect information from multiple facts, can reasoning, namely disconnected reasoning. For achieve high scores because they can exploit spe- datasets annotated with supporting facts, this al- cific types of biases and artifacts (e.g., answer type lows devising a model-agnostic probe to estimate ∗ the extent of disconnected reasoning done by any Early portion of this work was done during the first author’s internship at Allen Institute for AI. model, and an automatic transformation of existing 1https://github.com/stonybrooknlp/dire datasets that reduces such disconnected reasoning. 8846 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 8846–8863, November 16–20, 2020. c 2020 Association for Computational Linguistics Measuring Disconnected Reasoning. Good XLNet (Yang et al., 2019), a recent large-scale lan- multifact reasoning,2 at a minimum, requires mod- gugage model, only achieves 17.5 F1 pts (of its to- els to connect information from one or more facts tal 71.9 answer F1) via multifact reasoning, roughly when they select and use information from other the same as a much simpler RNN model. (ii) Train- facts to arrive at an answer. However, models can ing on the transformed dataset with CSST results in cheat, as illustrated in Figure1, by independently a substantial reduction in disconnected reasoning assessing information in subsets of the input facts (e.g., a 19 point drop in answer F1), demonstrating none of which contains all supporting facts, and that it less cheatable, is a harder test of multifact taking a simple combination of outputs from these reasoning, and gives a better picture of the current subsets (e.g., by taking a union) to produce the state of multifact reasoning. (iii) The transformed overall output. This entirely avoids meaningfully dataset is more effective at reducing disconnected combining information across all supporting facts, reasoning than a previous adversarial augmenta- a fundamental requirement of multifact reasoning. tion method (Jiang and Bansal, 2019), and is also We refer to this type of reasoning as disconnected complementary, improving further in combination. reasoning (DiRe in short) and provide a formal In summary, the DiRe probe serves as a simple criterion, the DIRE condition, to catch cheating yet effective tool for model designers to assess models. Informally, it checks whether for a given how much of their model’s score can actually be at- test of multifact reasoning (e.g., answer prediction tributed to multifact reasoning. Similarly, dataset or supporting fact identification), a model is able to designers can assess how cheatable is their dataset trivially combine its outputs on subsets of the input D (in terms of allowing disconnected reasoning) by context (none of which has all supporting facts) training a strong model on the DiRe probe for D, without any interaction between them. and use our transform to reduce D’s cheatability. Using the DIRE condition, we develop a system- atic probe, involving an automatically generated 2 Related Work probing dataset, that measures how much a model can score using disconnected reasoning. Multi-hop Reasoning: Many multifact reasoning approaches have been proposed for HotpotQA and Reducing Disconnected Reasoning. A key as- similar datasets (Mihaylov et al., 2018; Khot et al., pect of a disconnected reasoning model is that it 2020). These use iterative fact selection (Nishida does not change its behavior towards the selection et al., 2019; Tu et al., 2020; Asai et al., 2020; Das and use of supporting facts that are in the input, et al., 2019), graph neural networks (Xiao et al., whether or not the input contains all of the support- 2019; Fang et al., 2020; Tu et al., 2020), or simply ing facts the question requires. This suggests that cross-document self-attention (Yang et al., 2019; the notion of sufficiency—whether all supporting Beltagy et al., 2020) to capture inter-paragraph in- facts are present in the input, which clearly matters teraction. While these approaches have pushed the to a good multifact model—does not matter to a bad state of the art, the extent of actual progress on model. We formalize this into a constrastive sup- multifact reasoning remains unclear. port sufficiency test (CSST) as an additional test of Identifying Dataset Artifacts: Several works multifact reasoning that is harder to cheat. We intro- have identified dataset artifacts for tasks such as duce an automatic transformation that adds to each NLI (Gururangan et al., 2018), Reading Compre- question in an original multi-hop dataset a group hension (Feng et al., 2018; Sugawara et al., 2020), of insufficient context instances corresponding to and even multi-hop reasoning (Min et al., 2019; different subsets of supporting facts. A model must Chen and Durrett, 2019). These artifacts allow recognize these as having insufficient context in models to solve the dataset without actually solv- order to receive any credit for the question. ing the underlying task. On HotpotQA, prior work Our empirical evaluation on the HotpotQA has shown existence of models that identify the dataset (Yang et al., 2018) reveals three interesting support (Groeneveld et al., 2020) and answer (Min findings: (i) A substantial amount of progress on et al., 2019; Chen and Durrett, 2019) by operating multi-hop reading comprehension can be attributed on each paragraph or sentence independently. We, to improvements in disconnected reasoning. E.g., on the other hand, estimate the amount of discon- 2We refer to desirable types of multifact reasoning as good nected reasoning in any model and quantify the and undesirable types as bad. cheatability of answer and support identification. 8847 Mitigation of Dataset Artifacts: To deal with multifact reasoning will look for a supporting fact these artifacts, several adversarial methods have that mentions when the cold war started ( ) and been proposed for reading comprehension (Jia and use information from this fact (year 1947) to select Liang, 2017; Rajpurkar et al., 2018) and multi-hop the other supporting fact mentioning the country QA (Jiang and Bansal, 2019). These methods min- that got independence ( ) (or vice versa). imally perturb the input text to limit the effective- A bad multifact reasoning model, however, can ness of the dataset artifacts. Our insufficient con- cheat on answer prediction by only looking for a text instances that partition the context are comple- fact that mentions a country getting independence mentary to these approaches (as we show in our at some time (mentioned in ), without connect- experiments). Rajpurkar et al.(2018) used a mix ing this to when the cold war started (mentioned of answerable and unanswerable questions to make in ). Similarly, the model can also cheat on the models avoid superficial reasoning.

Is Multihop QA in DIRE Condition? Measuring and Reducing Disconnected Reasoning

Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names

A Structural Analysis of Mide Chants

5892 Cisco Category: Standards Track August 2010 ISSN: 2070-1721

Etir Code Lists

Management of Stroke in Neonates and Children: a Scientific Statement from the American Heart Association/American Stroke Association

K191201 Trade/Device Name

Partitions with Prescribed Hook Differences

BASHKIR MANUAL Nicholas Poppe (University of Washington), and Andreas Tietze (University of California, Los Angeles), Consulting Editors

2Nd Stage International Geopolitical Competition TEST

Delaware River Water Quality ^Ristol to Marcus Hook Pennsylvania August 1949 to December 1963 R«» W

A Comparison of Circle and J Hook Performance Within the Grenadian Pelagic Longline Fishery Anthony G

Cyrillic # Version Number