Is Multihop QA in DIRE Condition? Measuring and Reducing Disconnected Reasoning Harsh Trivedi†∗ Niranjan Balasubramanian† Tushar Khot‡ Ashish Sabharwal‡ † Stony Brook University, Stony Brook, U.S.A. fhjtrivedi,
[email protected] ‡ Allen Institute for AI, Seattle, U.S.A. ftushark,
[email protected] Abstract Input Which country got independence when the cold war started? Has there been real progress in multi-hop The war started in 1950. question-answering? Models often exploit The cold war started in 1947. dataset artifacts to produce correct answers, France finally got its independence. without connecting information across multi- India got independence from UK in 1947. Set of Facts ple supporting facts. This limits our ability 30 countries were involved in World War 2. to measure true progress and defeats the pur- pose of building multi-hop QA datasets. We Disconnected Reasoning Output make three contributions towards addressing Answer SF Simple Combination this. First, we formalize such undesirable be- Answer havior as disconnected reasoning across sub- India India sets of supporting facts. This allows develop- No Interaction Supporting Facts (SF) ing a model-agnostic probe for measuring how much any model can cheat via disconnected reasoning. Second, using a notion of con- trastive support sufficiency, we introduce an Figure 1: Example of disconnected reasoning, a form automatic transformation of existing datasets of bad multifact reasoning: Model arrives at the answer that reduces the amount of disconnected rea- by simply combining its outputs from two subsets of soning. Third, our experiments1 suggest that the input, neither of which contains all supporting facts. there hasn’t been much progress in multifact From one subset, it identifies the blue supporting fact QA in the reading comprehension setting.