Musique: Multi-Hop Questions Via Single-Hop Question Composition

MuSiQue: Multi-hop Questions via Single-hop Question Composition Harsh Trivediy Niranjan Balasubramaniany Tushar Khotz Ashish Sabharwalz y Stony Brook University, Stony Brook, U.S.A. {hjtrivedi,niranjan}@cs.stonybrook.edu z Allen Institute for AI, Seattle, U.S.A. {tushark,ashishs}@allenai.org Abstract What is the former name of What is the former name of the the country before it gained country where Atika Suri studied? independence on 27 Dec 1949 To build challenging multi-hop question an- Dutch East Indies where Atika Suri studied in? Dutch East Indies swering datasets, we propose a bottom-up semi-automatic process of constructing multi- Where did Atika Suri study? Where did Atika Suri study? A1: Trisakti University A1': Trisakti University hop question via composition of single-hop Q1 Q1' questions. Constructing multi-hop questions Which country is A1 in? Which country is A1' in? as composition of single-hop questions allows Q2 A2: Indonesia Q2' A2': Indonesia us to exercise greater control over the qual- What is former name of A2' before it ity of the resulting multi-hop questions. This What is former name of A2? gained independence on 27 Dec 1949? process allows building a dataset with (i) con- A3: Dutch East Indies Q3 Q3' A3': Dutch East Indies nected reasoning where each step needs the answer from a previous step; (ii) minimal train- Trisakti University located in Jakarta Indonesia... Stanford University located test leakage by eliminating even partial over- in USA ... Indonesia was known as Dutch East Indies before its independence on Dec 27, 1949 ...Myanmar was formerly known as Burma lap of reasoning steps; (iii) variable number of Context hops and composition structures; and (iv) con- Figure 1: Our proposed approach to generate multi-hop trasting unanswerable questions by modifying questions by composing single-hop questions. Left:A the context. We use this process to construct a question from MuSiQue that forces models to reason new multihop QA dataset: MuSiQue-Ans with through all intended hops. Right: A potential question 25K 2-4 hop questions using seed questions filtered out by our approach for not requiring connected from 5 existing single-hop datasets. Our exper- reasoning. Notice that the question on the right can be iments demonstrate that MuSiQue-Ans is chal- answered using just Q3’ without knowing the answers lenging for state-of-the-art QA models (e.g., to the previous questions (there is only one mentioned human-machine gap of 30 F1 pts), signifi- country that gained independence on 27 Dec 1949). cantly harder than existing datasets (2x human- machine gap), and substantially less cheatable (e.g., a single-hop model is worse by 30 F1 pts). We also build an even more challenging high scores (Min et al., 2019a; Chen and Durrett, dataset, MuSiQue-Full, consisting of answer- 2019) without performing any reasoning that at able and unanswerable contrast question pairs, least connects the intended facts together (Trivedi where model performance drops further by 13+ et al., 2020). This defeats the purpose of building F1 pts.1 multi-hop datasets. arXiv:2108.00573v1 [cs.CL] 2 Aug 2021 To address this issue, we propose a new bottom- 1 Introduction up approach (and a corresponding dataset called Multi-hop question answering (QA) datasets are MuSiQue) for building multi-hop reading compre- designed with the intent that models must connect hension QA datasets via composition of single- information from multiple facts in order to answer hop questions from existing datasets. By carefully each multi-hop question. However, when using the controlling the hops used in the questions, our ap- common method of crowdsourcing to create any proach ensures that each hop requires the connected multi-hop question given just a pair of documents reasoning desired in multi-hop QA datasets. as the starting point, the resulting datasets (e.g., Figure1 illustrates two potential multi-hop ques- HotpotQA (Yang et al., 2018)) often end up with tions that can be constructed using such a bottom- unintended artifacts that allow models to achieve up approach. We create compositions of single-hop 1For data and code, see https://github.com/ questions where each question must use the answer stonybrooknlp/musique. (often referred to as a bridging entity) from a previous question in the reasoning chain. While intuitive, has a higher human-machine gap, and comes with this, by itself, is however not sufficient to ensure reasoning graph decompositions that can be used multi-hop reasoning. For example, the question to build stronger models (via data augmentation or on the right can be answered without ever find- as auxilliary supervision). ing answers to Q1’ or Q2’. Even if a model does not know that A2’ refers to the country Indone- 2 Multihop Reasoning Desiderata sia, there is only one country that is mentioned in Multihop question answering requires connecting the context as gaining independence on 27 Dec, and synthesizing information from multiple facts. 1949. Models can, therefore, answer the suppos- Prior works, however, have shown that multihop edly multi-hop question on the right using just the reasoning datasets often have artifacts that allow last question (i.e., be successful with single-hop models to achieve high scores bypassing the need reasoning). This is not the case with the question for connecting the information from multiple facts on the left (an actual example from our dataset) (Min et al., 2019a; Chen and Durrett, 2019; Trivedi where every single-hop question would be almost et al., 2020). This defeats the purpose of building impossible to answer confidently without knowing multihop QA datasets, as we cannot reliably use the bridging entity from the previous step. them to measure progress in multihop reasoning. Our proposed approach to build multi-hop What are then the desirable properties of ques- datasets identifies the presence of such artifacts in tions that indeed test a model’s multi-hop capabili- the composition chain and filters them out, thereby ties, and how can we create such questions? Ideally, reducing the cheatability of our dataset. a multihop question should necessitate meaningful In addition to reducing such artifacts, our pro- synthesis of information from all (multiple) sup- posed approach also minimizes the potential of porting facts. However, what meaningful synthesis memorization by reducing train-test leakage at the really means is subjective, and hard to objectively level of each single-hop question. The approach quantify. Trivedi et al.(2020) have argued that at a allows creating questions with varying number of minimum, multihop reasoning should necessitate hops by simply composing over additional single- connecting information from multiple (all) support- hop questions. ing facts, and proposed a formal condition (DiRe disconnected reasoning We use this approach to build a new challenge condition) to check if is multi-hop QA dataset, MuSiQue, consisting of 2-4 employed by a model on a dataset. While the DiRe hop questions. We empirically demonstrate that condition allows one to probe existing models and our dataset has fewer artifacts and is more chal- datasets for undesirable reasoning, it does not pro- construct lenging than two commonly used prior multi-hop vide an efficient way to new multihop reasoning datasets, HotpotQA (Yang et al., 2018) QA datasets that necessitate connected reasoning. and 2WikiMultihopQA (Ho et al., 2020). We also In this work, we propose a method to create show the benefits of using the various features of multi-hop reasoning datasets that require connected our pipeline in terms of increasing the difficulty of reasoning by first laying out desirable properties of the dataset. We find that MuSiQue is a promising a multi-hop question in terms of its decomposition, testbed for approaches that rely on principled multi- and then devising a dataset construction procedure hop reasoning, such as the Step Execution model to optimize for these properties. we discuss. Lastly, by incorporating the notion of Connected Question Hops. Consider the ques- unanswerability or insufficient context (Rajpurkar tion Which city across the Charles River in Mas- et al., 2018; Trivedi et al., 2020), we also release a sachusetts was Facebook launched in?. This can variant of our dataset, MuSiQue-Full, that is even be answered by two supporting facts in the Knowl- 2 more challenging and harder to cheat on. edge Source (KS): (f1) Facebook is launched in In summary, we make two main contribu- Harvard University. (f2) Harvard university is in tions: (1) A new dataset construction approach Cambridge city, which is across the Charles River for building challenging multi-hop reasoning QA in Massachussets. That is, the question can be de- datasets that operates via composition of single- composed as (Q1) Which university was Facebook hop question, reduces artifacts, reduces train-test launched in? with answer A1, and (Q2) Which city leakage, and is easier to annotate. (2) A new 2We view KS as the fixed context text in the reading com- challenge dataset MuSiQue that is less cheatable, prehension setting and a large corpus in the open-QA setting. 2 across the Charles River in Massachussets is A1 is connected via constituent single-hop questions, it in?. doesn’t ensure whether single-hop questions are in A proper way to find answer to this question fact meaningfully answered. This poses additional is to answer Q1, plug this answer in Q2 in place challenges especially in context of reading compre- of A1, and answer Q2. However, in this example, hension setting where the input text is limited, and Q2 is so specifically informative about the city of often has limited or no distractors. interest that it is possible to uniquely identify the We want models to identify and meaningfully answer to it even without considering A1 or using sythesize information from constituent question f1.

Load more