MuSiQue: Multi-hop Questions via Single-hop Question Composition

Harsh Trivedi† Niranjan Balasubramanian† Tushar Khot‡ Ashish Sabharwal‡

† Stony Brook University, Stony Brook, U.S.A. {hjtrivedi,niranjan}@cs.stonybrook.edu ‡ Allen Institute for AI, Seattle, U.S.A. {tushark,ashishs}@allenai.org

Abstract What is the former name of What is the former name of the the country before it gained country where Atika Suri studied? independence on 27 Dec 1949 To build challenging multi-hop question an- Dutch East Indies where Atika Suri studied in? Dutch East Indies swering datasets, we propose a bottom-up semi-automatic process of constructing multi- Where did Atika Suri study? Where did Atika Suri study? A1: Trisakti University A1': Trisakti University hop question via composition of single-hop Q1 Q1' questions. Constructing multi-hop questions Which country is A1 in? Which country is A1' in? as composition of single-hop questions allows Q2 A2: Indonesia Q2' A2': Indonesia us to exercise greater control over the qual- What is former name of A2' before it ity of the resulting multi-hop questions. This What is former name of A2? gained independence on 27 Dec 1949? process allows building a dataset with (i) con- A3: Dutch East Indies Q3 Q3' A3': Dutch East Indies nected reasoning where each step needs the an-

swer from a previous step; (ii) minimal train- Trisakti University located in Jakarta Indonesia... Stanford University located test leakage by eliminating even partial over- in USA ... Indonesia was known as Dutch East Indies before its independence on Dec 27, 1949 ...Myanmar was formerly known as Burma lap of reasoning steps; (iii) variable number of Context hops and composition structures; and (iv) con- Figure 1: Our proposed approach to generate multi-hop trasting unanswerable questions by modifying questions by composing single-hop questions. Left:A the context. We use this process to construct a question from MuSiQue that forces models to reason new multihop QA dataset: MuSiQue-Ans with through all intended hops. Right: A potential question 25K 2-4 hop questions using seed questions filtered out by our approach for not requiring connected from 5 existing single-hop datasets. Our exper- reasoning. Notice that the question on the right can be iments demonstrate that MuSiQue-Ans is chal- answered using just Q3’ without knowing the answers lenging for state-of-the-art QA models (e.g., to the previous questions (there is only one mentioned human-machine gap of 30 F1 pts), signifi- country that gained independence on 27 Dec 1949). cantly harder than existing datasets (2x human- machine gap), and substantially less cheatable (e.g., a single-hop model is worse by 30 F1 pts). We also build an even more challenging high scores (Min et al., 2019a; Chen and Durrett, dataset, MuSiQue-Full, consisting of answer- 2019) without performing any reasoning that at able and unanswerable contrast question pairs, least connects the intended facts together (Trivedi where model performance drops further by 13+ et al., 2020). This defeats the purpose of building F1 pts.1 multi-hop datasets. arXiv:2108.00573v1 [cs.CL] 2 Aug 2021 To address this issue, we propose a new bottom- 1 Introduction up approach (and a corresponding dataset called Multi-hop question answering (QA) datasets are MuSiQue) for building multi-hop reading compre- designed with the intent that models must connect hension QA datasets via composition of single- information from multiple facts in order to answer hop questions from existing datasets. By carefully each multi-hop question. However, when using the controlling the hops used in the questions, our ap- common method of crowdsourcing to create any proach ensures that each hop requires the connected multi-hop question given just a pair of documents reasoning desired in multi-hop QA datasets. as the starting point, the resulting datasets (e.g., Figure1 illustrates two potential multi-hop ques- HotpotQA (Yang et al., 2018)) often end up with tions that can be constructed using such a bottom- unintended artifacts that allow models to achieve up approach. We create compositions of single-hop 1For data and code, see https://github.com/ questions where each question must use the answer stonybrooknlp/musique. (often referred to as a bridging entity) from a previ- ous question in the reasoning chain. While intuitive, has a higher human-machine gap, and comes with this, by itself, is however not sufficient to ensure reasoning graph decompositions that can be used multi-hop reasoning. For example, the question to build stronger models (via data augmentation or on the right can be answered without ever find- as auxilliary supervision). ing answers to Q1’ or Q2’. Even if a model does not know that A2’ refers to the country Indone- 2 Multihop Reasoning Desiderata sia, there is only one country that is mentioned in Multihop question answering requires connecting the context as gaining independence on 27 Dec, and synthesizing information from multiple facts. 1949. Models can, therefore, answer the suppos- Prior works, however, have shown that multihop edly multi-hop question on the right using just the reasoning datasets often have artifacts that allow last question (i.e., be successful with single-hop models to achieve high scores bypassing the need reasoning). This is not the case with the question for connecting the information from multiple facts on the left (an actual example from our dataset) (Min et al., 2019a; Chen and Durrett, 2019; Trivedi where every single-hop question would be almost et al., 2020). This defeats the purpose of building impossible to answer confidently without knowing multihop QA datasets, as we cannot reliably use the bridging entity from the previous step. them to measure progress in multihop reasoning. Our proposed approach to build multi-hop What are then the desirable properties of ques- datasets identifies the presence of such artifacts in tions that indeed test a model’s multi-hop capabili- the composition chain and filters them out, thereby ties, and how can we create such questions? Ideally, reducing the cheatability of our dataset. a multihop question should necessitate meaningful In addition to reducing such artifacts, our pro- synthesis of information from all (multiple) sup- posed approach also minimizes the potential of porting facts. However, what meaningful synthesis memorization by reducing train-test leakage at the really means is subjective, and hard to objectively level of each single-hop question. The approach quantify. Trivedi et al.(2020) have argued that at a allows creating questions with varying number of minimum, multihop reasoning should necessitate hops by simply composing over additional single- connecting information from multiple (all) support- hop questions. ing facts, and proposed a formal condition (DiRe disconnected reasoning We use this approach to build a new challenge condition) to check if is multi-hop QA dataset, MuSiQue, consisting of 2-4 employed by a model on a dataset. While the DiRe hop questions. We empirically demonstrate that condition allows one to probe existing models and our dataset has fewer artifacts and is more chal- datasets for undesirable reasoning, it does not pro- construct lenging than two commonly used prior multi-hop vide an efficient way to new multihop reasoning datasets, HotpotQA (Yang et al., 2018) QA datasets that necessitate connected reasoning. and 2WikiMultihopQA (Ho et al., 2020). We also In this work, we propose a method to create show the benefits of using the various features of multi-hop reasoning datasets that require connected our pipeline in terms of increasing the difficulty of reasoning by first laying out desirable properties of the dataset. We find that MuSiQue is a promising a multi-hop question in terms of its decomposition, testbed for approaches that rely on principled multi- and then devising a dataset construction procedure hop reasoning, such as the Step Execution model to optimize for these properties. we discuss. Lastly, by incorporating the notion of Connected Question Hops. Consider the ques- unanswerability or insufficient context (Rajpurkar tion Which city across the Charles River in Mas- et al., 2018; Trivedi et al., 2020), we also release a sachusetts was Facebook launched in?. This can variant of our dataset, MuSiQue-Full, that is even be answered by two supporting facts in the Knowl- 2 more challenging and harder to cheat on. edge Source (KS): (f1) Facebook is launched in In summary, we make two main contribu- Harvard University. (f2) Harvard university is in tions: (1) A new dataset construction approach Cambridge city, which is across the Charles River for building challenging multi-hop reasoning QA in Massachussets. That is, the question can be de- datasets that operates via composition of single- composed as (Q1) Which university was Facebook hop question, reduces artifacts, reduces train-test launched in? with answer A1, and (Q2) Which city leakage, and is easier to annotate. (2) A new 2We view KS as the fixed context text in the reading com- challenge dataset MuSiQue that is less cheatable, prehension setting and a large corpus in the open-QA setting.

2 across the Charles River in Massachussets is A1 is connected via constituent single-hop questions, it in?. doesn’t ensure whether single-hop questions are in A proper way to find answer to this question fact meaningfully answered. This poses additional is to answer Q1, plug this answer in Q2 in place challenges especially in context of reading compre- of A1, and answer Q2. However, in this example, hension setting where the input text is limited, and Q2 is so specifically informative about the city of often has limited or no distractors. interest that it is possible to uniquely identify the We want models to identify and meaningfully answer to it even without considering A1 or using sythesize information from constituent question f1. In other words, the question is such that a model and context to arrive at its answer. However, here doesn’t need to connect information from the two again, it’s difficult to quantify what meaningful syn- supporting facts. thesis really means. At the very least though, we To prevent such shortcut based reasoning to be know some connection between constituent ques- an effective strategy, we need questions and knowl- tion and context should be required to arrive at edge sources such that all hops of the multi-hop the answer for the skill of reading comprehension. question are necessary to arrive at the answer. That This means that, at the very least, we want multi- is, the dataset shouldn’t contain information that hop questions with constituent single-hop ques- allows a model to arrive at the correct answer con- tions such that dropping question or context in the fidently even when bypassing single-hop questions. dataset shouldn’t have enough information to iden- This can be computationally checked by training tify the answer confidently. This can be tested in a strong model M on the dataset distribution. To with a strong model M that is trained on this dataset check if the hops of above 2-hop question are con- distribution. nected or not, we need to check whether input We say that some question to context connection question Q2 with entity A1 masked, along with the is necessary for a model M to answer Q if: knowledge source KS as input, is answerable by M. If it is, then it is possible for M to answer the ∀i : M(qi, φ) 6∈ Ai ∩ M(φ, KS) 6∈ Ai (2) supposedly multi-hop question via disconnected Although this requirement might seem rather reasoning. In general, for a strong model M, if a naïve, previous works have shown that RC datasets multi-hop question has any constituent question in often have artifacts that allow models to pre- its decomposition such that M can answer it cor- dict the answer without question or without con- rectly even after masking at least one of its bridge text (Kaushik and Lipton, 2018). Moreover, re- entities, we can conclude that M can cheat on the cent work has also shown that question and answer question via disconnected reasoning. memorization coupled train-test leakage can lead Formally, we associate each question Q with G , Q to high answer scores without any context (Lewis a directed acyclic graph representing Q’s decom- et al., 2021). As we will show later, previous multi- position into simpler subquestions q , q , . . . , q , 1 2 n hop RC datasets can be cheated via such context- which form the nodes of G . A directed edge Q only and question-only shortcuts as well. There- (q , q ) for j < i indicates that q refers to the an- j i i fore, it is important for future multi-hop reading swer to some previous subquestion q in the above j comprehension QA datasets to ensure that question subquestion sequence. A is the (often singleton) i and context are connected. set of valid answers to qi. An is the set of valid m In summary, we want multi-hop reading com- answers to Q. For edge (q , q ), we use q j to j i i prehension questions that have desirable properties denote the subquestion formed from q by masking i (1) and (2). If these two conditions hold for some out the mention of the answer from q . Similarly, j strong model M, we say that the question satisfies qm denotes the subquestion with answers from all i the MuSiQue condition. Our dataset construction incoming edges masked out. pipeline (Section4) optimizes for these properties We say that all hops are necessary for a model over a large-space of multi-hop questions that can M to answer Q if: be composed by single-hop questions (Section3).

mj ∀(qj, qi) ∈ edges(GQ): M(qi , KS) 6∈ Ai (1) 3 Multihop via Singlehop Composition Connected Question and Context. Although The information need of a multihop question can be above requirement ensures each multi-hop question decomposed into a directed acyclic graph (DAG) of

3 constituent questions or operations (Wolfson et al., control. Specifically, exploring a very large space 2020). For example, the question "Which city was of potential single-hop questions allows us to filter Facebook launched in?" can be decomposed into a out those for which we can’t find strong distractors. 2-node DAG with nodes corresponding to "Which university was Facebook launched" Harvard, and 4 Data Construction Pipeline "Which city was #1 launched in?". The same pro- cesses can also be reversed to compose a candi- We design a dataset construction pipeline with the date multihop question from constituent single-hop goal of creating multi-hop questions that satisfy questions. In the above example, the answer of 1st the MuSiQue condition (i.e., equations (1) and (2). question is an entity, Harvard which also occurs as The high-level schematic of the pipeline is shown part of the second question, which allows two ques- in Figure2. tions to be composed together. More concretely, We begin with a large set of reading compre- we have the following criterion: hension single-hop questions S, with individual instance denoted with (q , p , a ) referring to the Composability Criterion: Two single-hop i i i question, associated paragraph, and a valid answer question answer tuples (q , a ) and (q , a ) are 1 1 2 2 respectively. These single-hop questions are run composable into a multi-hop question Q with a 2 through the following steps: as a valid answer if a1 is an answer entity and it is mentioned in q2. S1. Find Good Single-Hop Questions The process of composing candidate multi-hop First, we filter single-hop questions that: questions can be chained together to form candidate reasoning graphs of various shapes and sizes. Con- Are close paraphrases: If two questions have veniently, since NLP community has constructed the same normalized3 answer and their question abundant human-annotated single hop questions, words have an overlap of more than 70%, we as- we can leverage them directly to create multi-hop sume them to be paraphrases and filter out one of questions. them. Furthermore, since single-hop reading compre- hension questions come with associated supporting Likely have annotation errors: Annotation er- paragraph or context, we can prepare supporting rors are often unavoidable in the single-hop RC context for the composed multihop question as a datasets. While a small percentage of such errors set of supporting paragraphs from constituent ques- is not a huge problem, these errors can be ampli- tions. Additional distracting paragraphs can be fied when multi-hop questions are created by com- retrieved from a large corpus of paragraphs. posing single-hop questions – a multi-hop ques- tion would have an error if any constituent single- Such ground-up and automatic construction of hop question has an error. For example, a dataset candidate multihop questions from existing single- of 3-hop questions, created by composing single- hop questions gives us a programmatic approach to hop questions with errors in 20% of the questions, explore a large-space of candidate multi-hop ques- would have errors in ∼ 50% of the multi-hop ques- tions, which provides a unique advantage towards tions. To do this without human intervention, we the goal of preventing shortcut-based reasoning. use a model-based approach. We generate 5-fold Previous works in making multihp QA datasets train, validation and test splits of the set. For each less cheatable have explored finding or creating split, we train 5 strong models (2 random-seeds better distractors to include in the context, while of RoBERTa-large (Liu et al., 2019), 2 random- treating the questions as static (Min et al., 2019a; seeds of Longformer-Large (Beltagy et al., 2020) Jia and Liang, 2017). However, this may not be and 1 UnifiedQA (Khashabi et al., 2020)) for the a good enough strategy, because if the subques- answer prediction task in reading comprehension tions are specific enough, even in an open domain setting. We remove instances from the test folds there may not be any good distractor (Groeneveld where none of the models’ predicted answer had et al., 2020a). Further, adding distractors found any overlap with the labeled answer. by specialized interventions may introduce new ar- We also remove single-hop questions where the tifacts, allowing models to learn shortcuts again. dataset comes with multiple ground-truth answers, Instead, creating multi-hop questions by compos- ing single-hop questions provides us with greater 3remove special characters, articles, and lowercase

4 2017K 1-Hop Qns 760K 1-Hop Qns 12M 2-Hop Qns 3.2M 2-Hop Qns Find Find Good Filter to Composable Single-Hop Connected 2-Hop Questions 2-Hop MuSiQue Questions Questions (♫) 1 Text2 3 Pipeline

All Single-Hop Good Single-Hop Composable Connected 4 Questions 2-Hop Questions 2-Hop Questions Questions Build Multihop 50K 2-4 Hop Qns 25K 2-4 Hop Qns 25K 2-4 Hop Qns 27K 2-4 Hop Qns 78K 2-4 Hop Qns Questions

Add Un- Build Crowdsource Split answerable Contexts for Question Questions sufficient Questions Questions Compositions Train to Sets

8 Gold 7 6 5 insufficient Contexts Dev Test Composed Minimizing Train- Multi (2-4) MuSiQ-Full MuSiQ-Ans Questions Test Overlap Hop Questions

Figure 2: MuSiQue construction pipeline. MuSiQue pipline takes single-hop questions from existing datasets, explores the space of multi-hop questions that can be composed from them, and generates dataset of challenging multi-hop questions that are difficult to cheat on. MuSiQue pipeline also makes unanswerable multi-hop questions that makes the final dataset significantly more challenging. or where ground-truth answer isn’t a substring of same type by an entity-extractor6. the associated context. 2. normalized a1 and e2 are identical. Are not amenable to creating multi-hop ques- tions: Since composing two questions (described 3. querying wikipedia search API with a1 and next) needs the answer to the first question be an e2 return identical first result. entity, we remove questions for which the answer 4. A SOTA wikification model (Wu et al., 2020) doesn’t have a single entity4. We also remove out- returns same result for a1 and e2 with the con- lier questions for which the context is too short (< text of p1 + q1 and p2 + q2 respectively. 20 words) or long (> 300 words). We start with 2017K single-hop questions from We found this process to be about 92% precise7 5 datasets (SQuAD (Rajpurkar et al., 2016), in identifying pairs of single-hop questions with Natural Questions (Kwiatkowski et al., 2019), common entity. MLQA5 (Lewis et al., 2019), T-REx (ElSahar et al., To consider a pair of single-hop questions as 2018), Zero Shot RE (Levy et al., 2017)) and filter composable, we additionally also check: 1. a2 is it down to 760K good single-hop questions using not part of q1 2. p1 and p2 are not same. Given this pipeline. our seed set of 760K questions, we are able to find 12M 2-hop composable pairs. S2. Find Composable 2-Hop Questions We next find composable pairs of single-hop ques- S3. Filter to Connected 2-Hop Questions tions within this set. A pair of different single-hop Next, we filter out the composable 2-hop questions questions (q1, p1, a1) and (q2, p2, a2) can be com- to only those that are likely to be connected. That posed to form a 2-hop question (Q,{p1, p2}, a2) if is, to answer the 2-hop question, it is necessary to a1 is an entity and is mentioned once in q2; here Q use and answer all constituent single-hop questions. represents a composed question whose DAG GQ We call this process disconnection filtering. has q1 and q2 as the nodes and (q1, q2) as the only Going back to MuSiQue condition (1,2), for the edge. 2-hop question to be connected, we need M(q1, φ) mj To ensure that the two entity mentions (a1 and be not a1 and M(q2 ,C) be not a2, where C is its occurrence in q2 denoted as e2) refer to the same the context. This condition naturally gives us an entity, we check for the following: opportunity to decompose the problem in 2 parts: (i) check if the first single-hop question (head node) 1. both a1 and e2 are marked as entities of the is answerable without the context (ii) check if the

4extracted by spacy.io 6we used spacy.io 5en-en subset of it 7based on crowdsourced human evaluation

5 second single-hop question (tail node) is answer- question from the multiple single-hop questions, able with context and question, but with mention we try to limit the total length of these questions. of a1 from the question masked. Each single-hop question shouldn’t be more than Filtering Head Nodes: We take all the ques- 10 tokens long. The total length of questions should tions that appear at least once as the head of com- not be more than 15 tokens in 2-hops and 3-hops, posable 2-hop questions (q1) to create the set of and more than 20 tokens in 4-hops . Finally, we head nodes. We then create 5-fold splits of the set, remove all 2-hop questions that occur as a subset and train and generate predictions using multiple of any of the 3-hop questions and remove all 3-hop strong (different random seeds). This way we have questions that occur as a subset of any of the 4-hop 5 answer predictions for each unique head ques- questions. tion. We consider the head node acceptable only if average AnsF1 is less than a threshold. S5. Split Questions to Sets Filtering Tail Nodes: We create a unique set of Given the recent findings of (Lewis et al., 2021), masked single-hop questions that occur as a tail we split the final set of questions into train, val- node (q2) in any composable 2-hop questions. If idation and test splits such that it is not possible the same single-hop question occurs in two 2-hop to score high by memorization. We do this by en- questions with different masked entity, they both suring there no overlap (described below) between would be added to the set. We then prepare context train and validation set, and train and test set. Addi- for each question by taking the associated gold- tionally, we also assure overlap between validation paragraph corresponding to that question and re- and test is minimal. trieving 9 distractor paragraphs using question with We consider two multi-hop questions Qi and Qj the masked entity as a query. We then create 5-fold overlap if (i) any single-hop question is common splits of the set, and train and predict answer and between Qi and Qj (ii) answer to any single-hop support paragraph using multiple (different random question of Qi is also an answer to some single- 8 seeds) strong models Longformer (Beltagy et al., hop question in Qj (iii) any associated paragraph 2020). This way we have 5 answer predictions for of any of the single-hop questions is common be- each unique tail question. We consider the tail node tween Qi and Qj. We start with 2 sets of multi-hop acceptable only if both AnsF1 and SuppF1 are less questions: initially, set-1 contains all questions set- than a fixed threshold. 2 is empty. Then we greedily take one question Finally, only those composable 2-hop questions from set-1 that least overlaps with rest of the set-1 are kept for which both head and tail node are ac- questions. We do this until fixed set of set-2 is ceptable. Starting from 12M 2-hop questions, this achieved. Then, remove all remaining questions results in a set of 3.2M connected 2-hop questions. from set-1 which overlap with set-2. Finally, our set-1 becomes training set, and set-2 becomes vali- S4. Build Multihop Questions dation+test set, which is further split into validation The set of connected 2-hop questions form the and test with similar procedure. We ensure the dis- directed edges of a graph. Any subset Directed tribution of datasets of single-hop questions in train, Acyclic Graph (DAG) of this graph can be used validation and test set are similar, and also control to create a connected multi-hop question. We enu- the sizes of 2,3 and 4 hop questions. merate 6 types reasoning graphs 1 for 2-hop, 2 for 3-hops and 3 for 4-hops as shown in Table9, and S6. Build Contexts for Questions employ following heuristics for curation. For a n-hop question, the context is a set of para- To ensure diversity of the resulting questions and graphs consisting of paragraphs associated with to also make the graph exploration computationally individual constituent subquestions (p1, p2 . . . pn), practical, we used two heuristics to control graph and additional distractor paragraphs retrieved from traversal: (i) The same bridging entity should not a corpus of paragraphs. The query used for retrieval be used more than 100 times, (ii) same single-hop is plain concatenation of subquestions with answer question should not appear in more than 25 multi- mentions from all of its incoming edges masked, m m m hop questions. Furthermore, since we eventually q1 +q2 ... qn . want to create a comprehensible single multi-hop To ensure that our distractors are not obvious, 8Because it can fit long context of 10 paragraphs and has we retrieve them only from the set of gold con- been shown to be competitive on HotpotQA in similar setup text paragraphs associated with the initially filtered

6 single-hop questions. As a result, it would be im- comprehension instances (19,938 train, 2,417 vali- possible to identify the relevant paragraphs without dation, 2,459 test) which we call MuSiQue-Ans. using the question. We will compare this strategy with standard strategy of using full wikipedia as a S8. Add Unanswerable Questions source of distractors in the experiments section. For each answerable multi-hop RC instance we cre- Furthermore, we also ensure that memorizing ate a corresponding unanswerable multi-hop RC the paragraphs from the training data can’t help the instance using the procedure closely similar to the model to select or eliminate paragraphs in devel- one proposed in (Trivedi et al., 2020). For a multi- opment or test set. For this, we first retrieve top hop question we randomly sample any of its single- 100 paragraphs for each multi-hop question (using hop question and make it unanswerable by ensuring concatenated query described above), then we enu- the answer to that single-hop question doesn’t ap- merate over all non-supporting paragraphs of each pear in any of the paragraphs in context. Since one question and randomly eliminate all occurrences of of the single-hop question is unanswerable given this paragraph either from (i) training or (ii) the de- the context, the whole multi-hop question becomes velopment and test set. We combine the remaining unanswerable. retrieved paragraphs (in the order of their score) for The process to build context for unanswerable each question to the set of supporting paragraphs questions is identical to that of the answerable ones, to form a set of 20 paragraphs. These are then except it’s adjusted to ensure the forbidden answer shuffled to form the context. (from single-hop question that’s being made unan- This strategy ensures that a given paragraph may swerable) is never part of the context. First, we re- occur as non-supporting in only one of the two: move the supporting paragraphs of multi-hop ques- (i) train set or (ii) development and test set. This tion which contain the forbidden answer. Second, limits memorization potential (as we show in our we retrieve top 100 paragraphs with the concate- experiments). nated query, same as answer question, but addi- tionally put a hard-constraint to disallow the for- S7. Crowdsource Question Compositions bidden answer. From what remains, we apply the We ask crowdworkers to compose short and coher- same filtering of paragraphs as explained for an- ent questions from our final question compositions swerable questions, to ensure non-supporting para- (represented as DAGs), so that information from graphs don’t overlap. Finally, remaining supporting all single-hop questions is used, and answer to the paragraphs are combined with top retrieved para- composed question is same as the last single-hop graphs to form context of 20 unique paragraphs. question. We also filter out incorrectly composed For the new task, the model needs to predict questions by asking the crowdworkers to verify if whether the question is answerable or not, and pre- the bridged entities refer to the same underlying dict answer and support if it’s answerable. Given entity. The workers can see associated paragraphs the questions for answerable and unanswerable sets corresponding to each single-hop question for this are identical and the context marginally changes, task. Our annotation interface can be viewed in models that rely on shortcuts find this task ex- figure3. Workers are encouraged to write short tremely difficult. questions, but if the question is too long, they are Since we create one unanswerable question allowed to split it in 2 sentences (see Table2 for for each answerable question, we now have some examples) 49,628 reading-comprehension instances (39,876 We ran initial qualification rounds on Amazon train, 4,834 validation, 4,918 test) which we call MTurk for the task, where 100 workers participated. MuSiQue-Full. Authors of the paper graded the coherency correct- ness and selected the top 17 workers to generate Final Dataset the final dataset. The task was split in 9 batches The final dataset statistics for MuSiQue-Ans and we gave regular feedback to the workers by (MuSiQue-Full has twice the number of questions email. We paid 25, 40, 60 cents for 2, 3 and 4 hops in each cell) are shown in Table1. Multi-hop ques- questions respectively, which amounted to about tions in MuSiQue constitute 21,020 unique single- 15 USD per hour. Total cost was question writing hop questions, 4132 unique answers to multi-hop was about 11K USD. questions, 19841 unique answers to single-hop At this stage we have 24,814 reading- questions, and 7676 unique supporting paragraphs.

7 Train Dev Test potQA and 2Wiki, and referred to as HotpotQA- 2-hop 14376 1252 1271 20k and 2Wiki-20k, respectively. 3-hop 4387 760 763 We will use the following notation through 4-hop 1175 405 425 this section. Instances in MuSiQue-Ans, Hot- Total (24,814) 19938 2417 2459 potQA, and 2Wiki are of the form (Q, C; A, Ps). Given a question Q along with a context C con- Table 1: Dataset statistics of MuSiQue-Ans. MuSiQue- sisting of a set of paragraphs, the task is to predict Full contains twice the number of questions in each cat- egory above – one answerable and one unanswerable. the answer A and identify supporting paragraphs Ps ∈ C. MuSiQue-Ans additionally has the DAG representation of ground-truth decomposition GQ MuSiQue has multi-hop questions of 6 types of rea- (cf. Section2), which models may leverage dur- soning graphs distributed across 2-4 hops. These ing training. Instances in MuSiQue-Full are of types and examples are shown in Figure2. form (Q, C; A, Ps,S), where there’s an additional binary classification task to predict S, the answer- 5 Experimental Setup ability of Q based on C, also referred to as context sufficiency (Trivedi et al., 2020). This section describes the datasets, models, and human assessment used in our experiments, whose Metrics. For MuSiQue-Ans, HotpotQA, and results are reported in Section6. 2Wiki, we report the standard F1 based metrics for answer (AnsF1) and support identification 5.1 Datasets (SuppF1); see Yang et al.(2018) for details. All 3 We create two versions of our dataset: MuSiQue- datasets have paragraph-level support annotation, Ans is a set of 25K answerable questions, where the but not all have the same further fine-grained sup- task is to predict the answer and supporting para- port annotation, such as the reasoning graph, sup- graphs. MuSiQue-Full is a set of 50K questions porting sentences, or evidence tuples for the three (25K answerable and 25K unanswerable), where datasets, respectively. To make a fair comparison, the task is to predict whether the question is an- we use only paragraph-level support F1 across all swerable or not, and if it’s answerable, then predict datasets. the answer and supporting paragraphs. For MuSiQue-Full, we follow Trivedi et al. We compare our dataset with two similar multi- (2020) to combine context sufficiency prediction hop RC datasets: HotpotQA (Yang et al., 2018) S with AnsF1 and SuppF1, which are denoted and 2WikiMultihopQA9 (Ho et al., 2020). We use as AnsF1+Suff and SuppF1+Suff. Instances in distractor-setting of HotpotQA to compare with our MuSiQue-Full occur in pairs and are also evaluated reading-comprehension setting. Questions in Hot- in pairs. Specifically, for each Q with a sufficient context C, there is a paired instance with Q and an potQA are crowdsourced, and questions in 2Wiki 0 are automatically generated based on rules and tem- insufficient context C . For AnsF1+Suff, if a model plates. Both datasets have 10 paragraphs as con- incorrectly predicts context sufficiency (yes or no) text. HotpotQA contains 2-hop questions with 2 for either of the instances in a pair, it gets 0 pts supporting paragraphs each, while 2Wiki has 2-hop on that pair. Otherwise, it gets same AnsF1 score and 4-hop questions with 2 and 4 supporting para- on the pair as it gets on the answerable instance graphs, respectively. Additionally, HotpotQA has in the pair. Scores are averaged across all pairs of sentence-level support information and 2Wiki has instances in the dataset. Likewise for SuppF1+Suff. supporting chain information with entity-relation 5.2 Models tuples, but we don’t use this additional annotation in our evaluation for a fair comparison. All our models are Transformer-based (Vaswani HotpotQA, 2Wiki, and MuSiQue-Ans have 90K, et al., 2017) pretrained language models (Devlin 167K, and 20K training instances, respectively. et al., 2019), implemented using PyTorch (Paszke For a fair comparison, we use equal sized train- et al., 2019), HuggingFace Transformers (Wolf ing sets in all our experiments, obtained by ran- et al., 2019) and AllenNLP (Gardner et al., 2017). domly sampling 20K instances each from Hot- We experiment with 2 kinds of models: (1) Stan- dard Multi-hop Models, which receive both Q and 9For brevity, we use 2Wiki to refer to 2WikiMultihopQA. C as input, are in principle capable of employing

8 Graph Question Decomposition Supporting Snippets Who was the grandfa- 1. Who was the male parent of 1. Philip Goodhart ... one of seven chil- ther of ? David Goodhart? Philip Goodhart dren ... to Philip Goodhart. Arthur Lehman Good- 2. Who’s Philip Goodhart’s father? 2. Philip Carter Goodhart ... son of hart Arthur Lehman Goodhart.

What currency is used 1. At what location did Billy Giles 1. Billy Giles (..., Belfast – 25 September where Billy Giles died? die? Belfast 1998, Belfast) pound sterling 2. What part of the United Kingdom 2. ... thirty-six public houses ... Belfast, is Belfast located in? Northern Ire- Northern Ireland. land 3. bank for pound sterling, issuing ... in 3. What is the unit of currency in Northern Ireland. Northern Ireland? pound sterling

When was the first es- 1. What is McDonaldization named 1. ... spreading of McDonald’s restaurants tablishment that Mc- after? McDonald’s ... ’McDonaldization’ Donaldization is named 2. Which state is Horndean located 2. ... Horndean is a village ... in Hampshire, after, open in the coun- in? England. try Horndean is located? 3. When did the first McDonald’s 3. 1974 ... first McDonald’s in the United 1974 open in England? 1974 Kingdom .. in .

When did Napoleon oc- 1. Who brought Louis XVI style to 1. Marie Antoinette, ...brought the "Louis cupy the city where the the court? Marie Antoinette XVI" style to court mother of the woman 2. Who’s mother of Marie An- 2. Maria Antonia of Austria, youngest who brought Louis XVI toinette? Maria Theresa daughter of .. Maria Theresa style to the court died? 3. In what city did Maria Theresa 3. Maria Theresa ... in Vienna ... after the 1805 die? Vienna death 4. When did Napoleon occupy Vi- 4. occupation of Vienna by Napoleon’s enna? 1805 troops in 1805

How many Germans 1. What continent is Aruba in? 1. ... Aruba, lived including indigenous peo- live in the colonial hold- South America ples of South America ing in Aruba’s continent 2. What country is Prazeres? Portu- 2. Prazeres is .. in municipality of Lisbon, that was governed by gal Portugal. Prazeres’s country? 5 3. The colonial holding in South 3. Portugal, ... desire for independence million America governed by Portugal? amongst Brazilians. Brazil 4. Brazil .. 5 million people claiming Ger- 4. How many Germans live in man ancestry. Brazil? 5 million

When did the people 1. What is Philipsburg capital of? 1. Philipsburg .. capital of .. Saint Martin who captured Malakoff Saint Martin 2. ... airport on the Caribbean island of come to the region 2. Saint Martin (French part) is Saint Martin/Sint Maarten. where Philipsburg is lo- located on what terrain feature? 3. ... the capture of the Malakoff by the cated? 1625 Caribbean French 3. Who captured Malakoff? French 4. French trader ... sailed to ... Caribbean 4. When did the French come to the in 1625 ... French settlement on Caribbean? 1625

Table 2: Examples of 6 different reasoning graph shapes in MuSiQue desired or expected reasoning, and have demon- 5.2.1 Multi-hop Models strated competitive performance on previous multi- We describe how these models work for our hop QA datasets. These models help probe the datasets, MuSiQue-Ans and MuSiQue-Full. For extent to which a dataset can be solved by current HotpotQA and 2Wiki, they operate similarly to models. (2) Artifact-based Models, which are re- MuSiQue-Ans. stricted in some way that prohibits them from doing desired or expected reasoning, as we will discuss End2End Model. This model takes (Q, C) as in- shortly. These models help probe the extent to put and predicts (A, Ps) as the output for MuSiQue- which a dataset can be cheated. Ans and (A, Ps,S) for MuSiQue-Full. Answer pre- diction is implemented as span prediction using a

9 transformer similar to Devlin et al.(2019). Support does not refer to the answer to any other single-hop prediction is implemented by adding special [PP] question. The step execution model applies Ms to tokens at the beginning of each paragraph and su- q1 in order to predict an answer a1. Then, for every pervising them with binary cross-entropy loss, sim- edge (q1, qi), it substitutes the reference in qi to the ilar to Beltagy et al.(2020). The binary classifi- answer to q1 with the predicted answer a1, thereby cation for answerability or context sufficiency is removing this cross-question reference in qi. This done via the CLS token of the transformer archi- process is repeated for q2, q3, . . . , qn in this order, tecture, trained with cross-entropy loss. We use and the predicted answer to qn is reported as the Longformer (Beltagy et al., 2020), which is one of final answer. the few transformer architectures that is able to fit The single-hop model Ms (which we implement the full context. as an End2End model) is trained on only single-hop instances—taking (qi,C) as input, and producing Select+Answer Model. This model, inspired by (A, Pi) or (A, Ps ,Si) as the output. Here Pi refers Quark (Groeneveld et al., 2020b) and SAE (Tu i to the singleton supporting paragraph for qi and et al., 2020), breaks the process into two parts. Si refers to whether C is sufficient to answer qi. First, a context selector ranks and selects the K 10 For MuSiQue-Full, the answerer predicts the multi- most relevant paragraphs CK from C. Second, hop question as having sufficient context if Ms an answerer generates the answer and supporting predicts all subquestions in the above process to paragraphs based only on CK . Both components have sufficient context. are trained individually, as follows. We experiment with this model only on The selector is designed to rank the support para- MuSiQue, since HotpotQA and 2Wiki don’t have graphs Ps ∈ C to be the highest based on Q and decompositons that are executable using Ms. C. Given (Q, C) as input, it scores every P ∈ C and is trained with the cross-entropy loss. We form 5.2.2 Artifact-based Models CK using the K paragraphs it scores the highest. To probe weaknesses of the datasets, we consider The answerer is trained to take (Q, CK ) as input, three models whose input, by design, is insufficient and predict (A, Ps) as the output for MuSiQue-Ans to allow them to perform desired reasoning. and (A, Ps,S) for MuSiQue-Full. We implement The Question-Only Model takes only Q as in- a selector using RoBERTa-large (Liu et al., 2019), put (no C) and generates A as the output. We im- and an answerer using Longformer-Large. plement this with BART-large (Lewis et al., 2020). The Context-Only Model takes only C as input Step Execution Model. Similar to prior decom- (no Q) and predicts (A, P ) as the output. We im- positional approaches (Talmor and Berant, 2018; s plement this with an End2End Longformer-Large Min et al., 2019b; Qi et al., 2020; Khot et al., 2021), model where the empty string is used as Q. this model performs explicit, step-by-step multi- Finally, our Single-Paragraph Model, similar hop reasoning, by first predicting a decomposition to those proposed by Min et al.(2019a) and Chen of the input question Q into a DAG G containing Q and Durrett(2019), is almost the same as Select- single-hop questions, and then using repeated calls Answer model with K=1. Instead of training the to a single-hop model to execute this decomposi- selector to rank all P the highest, we train it to tion as discussed below. s rank any paragraph containing the answer string A The question decomposer is trained based on as the highest. The answerer then takes as input one ground-truth decomposition annotations available selected paragraph p ∈ P and predicts an answer in MuSiQue-Ans, and is implemented with BART- s to Q based solely on p. Note that this model doesn’t large. have access to full supporting information, as all The answerer takes C and the predicted DAG considered datasets have at least two supporting G as input, and outputs (A, P ) for MuSiQue- Q s paragraphs per question. Ans and (A, Ps,S) for MuSiQue-Full. It does this with repeated calls to a single-hop model Ms 5.3 Human Performance while traversing G in a topological sort order Q We perform a randomized human experiment to (q , q , . . . , q ), as follows. By definition of topo- 1 2 n establish comparable and fair human performance logical sort, q has no incoming edges and hence 1 on HotpotQA, 2Wiki, and MuSiQue-Ans. We sam- 10K is a hyperparameter, chosen from {3,5,7}. ple 125 questions from each dataset, combine them

10 into a single set, shuffle this set, and obtain 5 anno- 6.1 MuSiQue is a Challenging Dataset tations of answer span and supporting paragraphs We first show that, compared to the other two for each instance. Our interface lists all context datasets considered (HotpotQA and 2Wiki), both paragraphs and makes them easily searchable and variants of MuSiQue are less cheatable via short- sortable with interactive text-overlap-based search cuts and have a larger human-to-machine gap. queries4. The interface includes a tutorial with 3 examples of question, supporting paragraphs, and Higher Human-Machine Gap. As shown in the answer from each dataset. top two sections of Table3, MuSiQue-Ans has We crowdsourced this task on Amazon Mechan- a significantly higher human-model gap than the ical Turk.11 For the qualification round (25 × 3 × 5 other datasets, for both answer and supporting para- annotations), we allowed only workers with master- graph identification.12 In fact, for both the other qualification and selected workers who had more datasets, supporting paragraph identification has than 75 AnsF1 and more than 75 SuppF1 on all even surpassed human majority score, whereas for datasets. Only these 5 workers were allowed in rest MuSiQue-Ans, there is 17 point gap. Addition- of the annotation. ally, MuSiQue-Ans has a ∼29 pt gap in answer-f1, We should note that 2Wiki did not report human whereas HotpotQA and 2Wiki have a gap of only scores and HotpotQA reported human scores which 10 and 5, resp. aren’t a fair comparison to models. This is because Note that the human-model gap further decreases humans in HotpotQA were shown question and for HotpotQA and 2Wiki if we use their full train- only the two correct supporting paragraphs to an- ing dataset. Likewise for MuSiQue-Ans, as we’ll swer the question, whereas models are expected to show in Section 6.3, adding auto-generated aug- reason over the full context (including 8 additional mentation data also reduces the human-model gap. distracting paragraphs). In our setup, we put the However, since these alterations result in datasets same requirements on human as on models, making of different sizes and quality, they do not provide a human vs. model comparison more accurate. More- useful comparison point. over, since we’ve shuffled our instances and used Lower Cheatability. As seen in the bottom sec- the same pool of workers, we can also compare tion of Table3, the performance of artifact-based human-scores across the 3 datasets. models (from Section 5.2.2) is significantly higher We compute two human performance scores: on HotpotQA and 2Wiki than on MuSiQue-Ans. Human-Majority – most common annotated an- This shows that MuSiQue is significantly less cheat- swer and support (in case of a tie, one is chosen at able via shortcut-based reasoning. random), and Human Upper Bound – the answer Answer identification can be done very well on and support prediction that maximizes the score. HotpotQA and 2Wiki with the Single-Paragraph The human-scores reported in HotpotQA are based model. In particular, support identification in both on Human Upper Bound. datasets can be done to a very high degree (67.6 and 92.0 F1) even without the question. One might 6 Empirical Findings argue that the comparison of paragraph identifica- tion is unfair since HotpotQA and 2Wiki have 10 We now discuss our empirical findings, demon- paragraphs as context, while MuSiQue-Ans has 20. strating that MuSiQue is a challenging multi-hop However, as we will discuss in ablations later in dataset that is harder to cheat on than existing Table6, even with 10 paragraphs, MuSiQue-Ans is datasets (Section 6.1), that the steps in the construc- significantly less cheatable. Overall, we find that tion pipeline of MuSiQue are individually valuable both answer and supporting paragraph identifica- (Section 6.2), and that the dataset serves as a useful tion tasks in MuSiQue-Ans are less cheatable via testbed for pursuing interesting multi-hop reason- disconnected reasoning. ing research (Section 6.3). MuSiQue-Full: Even More Challenging. Ta- All our reported results are based on the devel- ble4 compares the performance of models on opment sets of the respective datasets. 12Table9 in the Appendix shows the performance of multi- hop models split by the number of hops required by the ques- 11https://www.mturk.com tion in MuSiQue (2, 3, or 4 hops).

11 HotpotQA-20K 2Wiki-20K MuSiQue-Ans Model AnsF1 | SuppF1 AnsF1 | SuppF1 AnsF1 | SuppF1  Humans Majority 83.6 | 92.0 84.1 | 99.0 76.6 | 93.0 UpperBound 91.8 | 96.0 89.0 | 100 88.6 | 96.0 Multi-hop End2End 72.9 | 94.3 72.9 | 97.6 42.3 | 67.6 Models Select+Answer 74.9 | 94.6 79.5 | 99.0 47.3 | 72.3 Step Execution — — 47.5 | 76.8 Artifact- Single-Paragraph 64.8 | — 60.1 | — 32.0 | — Based Context-Only 18.4 | 67.6 50.1 | 92.0 3.4 | 0.0 Models Question-Only 19.6 | — 27.0 | — 4.6 | —

Table 3: MuSiQue has a substantially higher human-model gap than the other datasets, as shown by the results in the top two sections of the table. Further, MuSiQue is less cheatable compared to the other datasets, as evidenced by lower performance of artifact-based models (bottom section of the table).

MuSiQue-Ans MuSiQue-Full Model AnsF1 | SuppF1 AnsF1+Suff | SuppF1+Suff   Multi-hop End2End 42.3 | 67.6 22.0 | 25.2 Models Select+Answer 47.3 | 72.3 34.4 | 42.0 Step Execution 47.5 | 76.8 27.8 | 28.3 Artifact- Single-Paragraph 32.0 | — 2.4 | — based Context-Only 3.4 | 0.0 1.0 | 0.8 Models Question-Only 4.6 | — 0.7 | —

Table 4: MuSiQue-Full is harder and less cheatable than MuSiQue-Ans, as evidenced by the multi-hop models and artifact-based models sections of the table, respectively. Note that MuSiQue-Full uses a stricter metric that also checks for correct context sufficiency prediction (“Suff”) and operates over pairs of highly related instances.

MuSiQue-Ans vs. MuSiQue-Full, where the lat- to a seq2seq BART-large model that’s trained (us- ter is obtained by adding unanswerable questions ing MuSiQue) to take as input two composable to the former. questions and generate as output a composed ques- The results demonstrate that MuSiQue-Full is tion. significantly more difficult and less cheatable than As shown in Table5, the Disconnection Filter is MuSiQue-Ans. Intuitively, because the answerable crucial for increasing the difficulty and decreasing and unanswerable instances are very similar but the cheatability of the final dataset. Specifically, have different labels, it’s difficult for models to do without this filter, we see that both multihop and well on both instances if they learn to rely on short- artifact-based models perform significantly better cuts (Kaushik et al., 2019; Gardner et al., 2020). on the resulting datasets. As we see, all artifact-based models barely get any AnsF1+Suff or SuppF1+Suff score. For all multi- Reduced Train-Test Leakage (step 5). Similar hop models too, the AnsF1 drops by 13-20 pts and to the above ablation, we assess the value of using SuppF1 by 30-48 pts. our careful train-test splits based on a clear separa- tion of constituent single-hop subquestions, their 6.2 Dataset Construction Steps are Valuable answers, and their supporting paragraphs across Next, we show that three key steps of our dataset splits (Step 8 in Figure2). Note that our bottom-up construction pipeline (Section4) are valuable. construction pipeline is what enables such a split. To perform this assessment, we create a dataset the Disconnection Filter (step 3). To understand the traditional way, with a purely random partition into effect of Disconnection Filter in our dataset con- train, validation, and test splits. For uniformity, struction pipeline, we do an ablation study by skip- we ensure the distribution of 2-4 hop questions ping the step of filtering composable 2-hop ques- in development set of the resulting dataset from tions to connected 2-hop questions; see Figure2. both ablated pipelines remains the same as in the Since we don’t have human-generated composed original development set. questions for these additional questions, we resort Table5 shows that without a careful train/test

12 Single-Paragraph Model Context-Only Model End2End AnsF1 | SuppF1 AnsF1 | SuppF1 AnsF1 | SuppF1 Full Pipeline (F) 32.0 | — 3.4 | 0.0 42.3 | 67.6 F \ unmemorizable splits 85.1 | — 70.1 | 49.8 90.1 | 85.5  F \ disconnection filter 52.7 | — 6.3 | 36.8 60.8 | 72.5

Table 5: Key components of the construction pipeline of MuSiQue are crucial for its difficulty and less cheatability.

Context Type Retrieval Corpus Single-Paragraph Model Context-Only Model End2End AnsF1 | SuppF1 AnsF1 | SuppF1 AnsF1 | SuppF1 No Distractors None 49.7 | — 17.0 | 100 70.1 | 100 10 Para Full Wikipedia 42.5 | — 12.5 | 77.7 57.2 | 87.6 10 Para Positive Distractors 28.0 | — 5.5 | 34.6 54.1 | 80.2 20 Para Full Wikipedia 41.7 | — 12.4 | 66.4 50.3 | 80.8 20 Para Positive Distractors 32.0 | — 3.4 | 0.0 42.3 | 67.6  Table 6: Positive distractors are more effective than using full Wikipedia for choosing distractors, as evidenced by lower scores of models. The effect of using positive distractors is more pronounced when combined with the use of 20 (rather than 10) distractor paragraphs. split, the dataset is highly solvable by current used wikipedia as the corpus for identifying distrac- models (AnsF1=90.1). Importantly, we see that tors like HotpotQA and 2Wiki. This result suggest most of this high score can also be achieved by that it’s necessary to be careful about the corpus single-paragraph (AnsF1=85.1) and context-only to search for when selecting distractors, and en- (AnsF1=75.1) models, revealing the high cheatabil- sure there is no distributional shift that a powerful ity of such a split. pretrained model can exploit to bypass reasoning. Second, we observe that using 20 paragraphs in- Harder Distractors (step 7). To understand the stead of 10 makes the dataset more difficult and less effect of distractors on the difficulty and cheatabil- cheatable. Interestingly, this effect is significantly ity of the dataset, we construct 5 variations. Three more pronounced if we use positive distractors, in- of them capture the effect of the number of distrac- dicating the synergy between these two approaches tors: (i) no distractors, (ii) 10 paragraphs, and (iii) to create more challenging distractors. 20 paragraphs; and two of them capture the effect 13 of the source of distractors: (i) Full wikipedia, 6.3 A Testbed for Interesting Approaches and (ii) gold context paragraphs from the good Right Inductive Bias Helps MuSiQue. We’ve single-hop questions (Stage 1 of Figure2). We shown that artifact-based models do not perform refer to the last setting as positive distractors, as well on MuSiQue. Next we ask, can a model these paragraphs are likely to appear as supporting that has an explicit inductive bias towards con- (positive) paragraphs in our final dataset. nected reasoning outperform black-box models on The results are shown in Table6. First, we ob- MuSiQue? For this, we compare the End2End serve that using positive distractors instead of full model, which doesn’t have any explicit architec- wikipedia significantly worsens the performance of tural bias towards doing multi-step reasoning, to all models. In particular, it makes identifying sup- the Step Execution model, which can only take porting paragraphs without the question extremely explicit reasoning steps to arrive at answer and sup- difficult. This difficulty also percolates to the port. Single-Paragraph and End2End models. We have shown in Table3 that Context-only models are able As shown in Table7, the Step Execution model ∼ to identify supporting paragraphs in HotpotQA and outperforms the End2End model by 5 F1 points 2Wiki to a very high degree (67.2 and 92.0 SuppF1) on both MuSiQue-Ans and MuSiQue-Full. More- even without the question. This would have also over, using the oracle decompositions further im- been true for MuSiQue-Ans (66.4 SuppF1) had we proves the score of the Step Execution model by 5-6 pts on both the datasets. The End2End model, 13We used the Wikipedia corpus from Petroni et al.(2020). however, actually performs a few points worse

13 MuSiQue-Ans MuSiQue-Full AnsF1 | SuppF1 AnsF1+Suff | SuppF1+Suff End2End 42.3 | 67.6 22.0 | 25.2 End2End w/ Oracle Decomposition 37.3 | 64.3 19.5 | 23.0 Step Execution 47.5 | 76.8 27.8 | 28.3 Step Execution w/ Oracle Decomposition 53.9 | 82.7 32.9 | 32.8

Table 7: Right inductive bias helps MuSiQue

Training Data MuSiQue-Ans MuSiQue-Full AnsF1 | SuppF1 AnsF1+Suff | SuppF1+Suff Filtered MultiHop 42.3 | 67.6 22.0 | 25.2 Filtered MultiHop + Filtered SingleHop 45.0 | 70.0 23.1 | 23.6 Filtered MultiHop + Unfiltered MultiHop 52.1 | 67.7 33.9 | 38.2

Table 8: Effect of augmenting training data of MuSiQue. The three rows (top to bottom) have 20K, 34.5K, 70.5K and instance for answerable set (L) respectively. There are twice as many instances in unanswerable set (R). when its provided with the oracle decomposition 7 Related Work instead of the composed question. This shows that models that can exploit the decomposition structure Multihop QA. MuSiQue-Ans is closest to Hot- (and our associated annotations) can potentially potQA (Yang et al., 2018) and 2WikiMulti- outperform these naive black-box models on our hopQA (Ho et al., 2020). HotpotQA was pre- dataset, leading to development of more interesting pared by having crowdworkers write a question and possibly more interpretable models. by showing two paragraphs and then adding dis- tractors paragraphs post-hoc. We believe since the Additional Data Augmentation helps MuSiQue. questions were written by crowdworkers without Our dataset construction pipeline allows us to gen- considering the difficulty of compositions and dis- erate additional training augmentation data for tractors, this dataset has been shown to be solvable MuSiQue. We explore two strategies: to a large extent without multi-hop reasoning (Min Adding Filtered Single-hop Questions. We et al., 2019a; Trivedi et al., 2020). 2WikiMulti- make a set of unique constituent single-hop ques- hopQA (Ho et al., 2020) was generated automati- tion of training MuSiQue-Ans and MuSiQue-Full, cally using structured information from Wikidata apply the same context building procedure (Step and Wikipedia and a fixed set of human-authored 7, but for single question), and add resulting RC rule templates. While the compositions in this instances to the training instances of MuSiQue-Ans dataset could be challenging, they were limited and MuSiQue-Full respectively. The validation and to a few rule-based templates and also selected in test sets of MuSiQue-Ans and MuSiQue-Full still a model-independent fashion. As shown in our remain the same. experiments, both these datasets (in comparable Adding Unfiltered Multi-hop Questions. We settings) are significantly more cheatable and have make a new set of multi-hop questions through less human-model gap. our dataset construction pipeline for additional data Qi et al.(2020) also build a multi-hop QA augmentation. Since we’ve already exhausted ques- dataset with variable number of hops but only use it tions with disconnection filter, we choose to skip for evaluation. Other multi-hop RC datasets focus disconnection filter in favor of large-training data. on other challenges such as narrative understand- Additionally, since it’s a very large set, we use to ing (Khashabi et al., 2018), discrete reasoning (Dua generates question compositions for it instead of et al., 2019), multiple modalities (Chen et al., 2020; relying on humans. We use BART-large seq2seq Talmor et al., 2021), open-domain QA (Geva et al., model trained on MuSiQue for this purpose. 2021; Khot et al., 2020; Yang et al., 2018; Talmor We find that both augmentation strategies help and Berant, 2018; Mihaylov et al., 2018) and rela- improve scores on MuSiQue by ∼ 10 AnsF1 points. tion extraction (Welbl et al., 2018). While we do However, it’s still far for human achievable scores not focus on these challenges, we believe it should (∼ 25 AnsF1 and SuppF1). be possible to extend our idea to these settings.

14 Unanswerability in QA. The idea of using unan- interesting avenue for future work. swerable questions to ensure robust reasoning has been considered before in single-hop (Rajpurkar 8 Conclusion et al., 2018) and multi-hop (Ferguson et al., 2020; We present a new pipeline to construct challenging Trivedi et al., 2020) setting has been considered multi-hop QA datasets via composition of single- before. Within the multi-hop setting, the IIRC hop questions. Due to the bottom-up nature of dataset (Ferguson et al., 2020) focuses on open- our construction, we can identify and eliminate po- domain QA where the unanswerable questions are tential reasoning shortcuts such as disconnected identified based on human annotations of ques- reasoning and train-test leakages. Furthermore, the tions where relevant knowledge could not be re- resulting dataset automatically comes with anno- trieved from Wikipedia. Our work is more simi- tated decompositions for all the questions, support- lar to Trivedi et al.(2020) where we modify the ing paragraphs and bridging entities. This allows context of an answerable question to make it unan- for easy dataset augmentation and development of swerable. This allows us to create counterfactual models with the right inductive biases. pairs (Kaushik et al., 2019; Gardner et al., 2020) of We build a new challenge dataset for multi-hop answerable and unanswerable questions that can be reasoning: MuSiQue consisting of 2-4 hops ques- evaluated as a group (Gardner et al., 2020; Trivedi tions with 6 reasoning graphs. We show our dataset et al., 2020) to measure the true reasoning capabili- is less cheatable and more challenging than prior ties of a model. multi-hop QA dataset. Due to the additional an- The transformation described in Trivedi et al. notations, we are also able to create an even more (2020) also removes supporting paragraphs to cre- challenging dataset: MuSiQue-Full consisting of ate unanswerable questions but relies on the dataset contrasting pairs of answerable and unanswerable annotations to be complete. It is possible that questions with minor perturbations of the context. the context contains other supporting paragraphs We use this dataset to show that each feature of that were not annotated in the dataset. By creat- our pipeline increases the hardness of the resulting ing questions in this bottom-up fashion where we dataset. even know the bridging entities, we can eliminate Extending our approach to more compositional any potential supporting paragraph by removing all operations such as comparisons, discrete computa- paragraphs containing the bridging entity. tions, etc. are interesting directions of future work. Developing stronger models in the future by im- Question Decomposition and Composition. proving the accuracy of the question decomposition Existing multi-hop QA datasets have been and the accuracy of the single-hop question answer- decomposed into simpler questions (Min et al., ing model can reduce the human-machine gap on 2019b; Talmor and Berant, 2018) and special MuSiQue. meaning representations such as QDMR (Wolfson et al., 2020). Due to nature of our construction, our dataset naturally provides the question References decomposition for each question. This can enable development of more interpretable models with Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. the right inductive biases such as DecompRC (Min arXiv:2004.05150. et al., 2019b), ModularQA (Khot et al., 2021). Similar to our approach, recent work (Pan et al., Jifan Chen and Greg Durrett. 2019. Understanding dataset design choices for multi-hop reasoning. In 2021; Yoran et al., 2021) has also used bottom- NAACL-HLT. up approaches to build multi-hop QA datasets. However these approaches used rule-based ap- Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan proaches to create the composed question with Xiong, Hong Wang, and William Wang. 2020. Hy- bridqa: A dataset of multi-hop question answering the primary goal of data augmentation. These over tabular and textual data. Findings of EMNLP generated datasets themselves are not challenging 2020. and have only been shown to improve the perfor- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and mance on downstream datasets targetted in their Kristina Toutanova. 2019. BERT: Pre-training of rule-based compositions. Evaluating the impact deep bidirectional transformers for language under- of MuSiQue on other multi-hop QA datasets is an standing. In NAACL.

15 Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Stanovsky, Sameer Singh, and Matt Gardner. 2019. Shyam Upadhyay, and Dan Roth. 2018. Looking DROP: A reading comprehension benchmark requir- beyond the surface:a challenge set for reading com- ing discrete reasoning over paragraphs. In NAACL. prehension over multiple sentences. In NAACL.

Hady ElSahar, P. Vougiouklis, Arslen Remaci, Tushar Khot, Peter Clark, Michal Guerquin, Peter C. Gravier, Jonathon S. Hare, F. Laforest, and Jansen, and Ashish Sabharwal. 2020. QASC: A E. Simperl. 2018. T-REx: A large scale alignment dataset for question answering via sentence compo- of natural language with knowledge base triples. In sition. In AAAI. LREC. Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter James Ferguson, Matt Gardner, Hannaneh Hajishirzi, Clark, and Ashish Sabharwal. 2021. Text modular Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A networks: Learning to decompose tasks in the lan- dataset of incomplete information reading compre- guage of existing models. In NAACL. hension questions. In EMNLP. T. Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Michael Collins, Ankur P. Parikh, Chris Alberti, Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, D. Epstein, Illia Polosukhin, J. Devlin, Kenton Lee, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Kristina Toutanova, Llion Jones, Matthew Kelcey, et al. 2020. Evaluating NLP models via contrast sets. Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, arXiv preprint arXiv:2004.02709. Quoc V. Le, and Slav Petrov. 2019. Natural ques- tions: A benchmark for question answering research. Matt Gardner, Joel Grus, Mark Neumann, Oyvind TACL, 7:453–466. Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettle- Omer Levy, Minjoon Seo, Eunsol Choi, and Luke moyer. 2017. AllenNLP: A deep semantic natu- Zettlemoyer. 2017. Zero-shot relation extraction via ral language processing platform. arXiv preprint reading comprehension. In CoNLL. arXiv:1803.07640. Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, jan Ghazvininejad, Abdelrahman Mohamed, Omer Dan Roth, and Jonathan Berant. 2021. Did Aristotle Levy, Veselin Stoyanov, and Luke Zettlemoyer. Use a Laptop? A Question Answering Benchmark 2020. BART: Denoising sequence-to-sequence pre- with Implicit Reasoning Strategies. TACL. training for natural language generation, translation, and comprehension. In ACL. Dirk Groeneveld, Tushar Khot, Mausam, and Ashish Sabharwal. 2020a. A simple yet strong pipeline for Patrick Lewis, Barlas Oguz,˘ Ruty Rinott, Sebastian HotpotQA. In EMNLP. Riedel, and Holger Schwenk. 2019. MLQA: Eval- uating cross-lingual extractive question answering. Dirk Groeneveld, Tushar Khot, Ashish Sabharwal, et al. arXiv preprint arXiv:1910.07475. 2020b. A simple yet strong pipeline for hotpotqa. In EMNLP. Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel. 2021. Question and answer test-train overlap Xanh Ho, A. Nguyen, Saku Sugawara, and Akiko in open-domain question answering datasets. In Aizawa. 2020. Constructing a multi-hop qa dataset EACL. for comprehensive evaluation of reasoning steps. In COLING. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Robin Jia and Percy Liang. 2017. Adversarial exam- Luke Zettlemoyer, and Veselin Stoyanov. 2019. ples for evaluating reading comprehension systems. Roberta: A robustly optimized bert pretraining ap- In EMNLP. proach. arXiv preprint arXiv:1907.11692.

Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish 2019. Learning the difference that makes a dif- Sabharwal. 2018. Can a suit of armor conduct elec- ference with counterfactually-augmented data. In tricity? a new dataset for open book question answer- ICLR. ing. In EMNLP.

Divyansh Kaushik and Zachary C. Lipton. 2018. How Sewon Min, Eric Wallace, Sameer Singh, Matt Gard- much reading does reading comprehension require? ner, Hannaneh Hajishirzi, and Luke Zettlemoyer. a critical investigation of popular benchmarks. In 2019a. Compositional questions do not necessitate EMNLP. multi-hop reasoning. In ACL.

D. Khashabi, S. Min, T. Khot, A. Sabhwaral, Sewon Min, Victor Zhong, Luke S. Zettlemoyer, and O. Tafjord, P. Clark, and H. Hajishirzi. 2020. Uni- Hannaneh Hajishirzi. 2019b. Multi-hop reading fiedqa: Crossing format boundaries with a single qa comprehension through question decomposition and system. EMNLP - findings. rescoring. In ACL.

16 Liangming Pan, Wenhu Chen, Wenhan Xiong, Min- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Yen Kan, and William Yang Wang. 2021. Unsuper- Chaumond, Clement Delangue, Anthony Moi, Pier- vised multi-hop question answering by question gen- ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- eration. In NAACL. icz, and Jamie Brew. 2019. Huggingface’s trans- formers: State-of-the-art natural language process- Adam Paszke, Sam Gross, Francisco Massa, Adam ing. ArXiv, abs/1910.03771. Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gard- Antiga, Alban Desmaison, Andreas Kopf, Edward ner, Yoav Goldberg, Daniel Deutch, and Jonathan Yang, Zachary DeVito, Martin Raison, Alykhan Te- Berant. 2020. Break it down: A question under- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, standing benchmark. TACL. Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learn- Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian ing library. In NeurIPS, pages 8024–8035. Riedel, and Luke Zettlemoyer. 2020. Zero-shot en- tity linking with dense entity retrieval. In EMNLP. Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- Thorne, Yacine Jernite, Vassilis Plachouras, Tim gio, William W. Cohen, Ruslan Salakhutdinov, and Rocktäschel, and Sebastian Riedel. 2020. KILT: A Christopher D. Manning. 2018. HotpotQA: A benchmark for knowledge intensive language tasks. dataset for diverse, explainable multi-hop question In arXiv:2009.02252. answering. In EMNLP. Peng Qi, Haejun Lee, Oghenetegiri Sido, Christo- Ori Yoran, Alon Talmor, and Jonathan Berant. pher D Manning, et al. 2020. Retrieve, read, rerank, 2021. Turning tables: Generating examples then iterate: Answering open-domain questions of from semi-structured tables for endowing language varying reasoning steps from text. arXiv preprint models with reasoning skills. arXiv preprint arXiv:2010.12527. arXiv:2107.07261. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques- tions for SQuAD. In ACL. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP. Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In NAACL. Alon Talmor, Ori Yoran, Amnon Catav, Dan Lahav, Yizhong Wang, Akari Asai, Gabriel Ilharco, Han- naneh Hajishirzi, and Jonathan Berant. 2021. Multi- modalqa: Complex question answering over text, ta- bles and images. arXiv preprint arXiv:2104.06039. H. Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2020. Is multihop QA in DiRe condition? Measuring and reducing discon- nected reasoning. In EMNLP. Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, Xiaodong He, and Bowen Zhou. 2020. Select, an- swer and explain: Interpretable multi-hop reading comprehension over multiple documents. In AAAI. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. TACL, 6:287–302.

17 A Appendix Figure3 shows our annotation interface for ques- tion composition. Figure4 show our annotation in- terface for establishing human scores on MuSiQue- Ans, 2WikiMultihopQA and HotpotQA. and4 show our annotation interfaces for ques- tion composition and comparing hum. Table9 shows the performance of various multi- hop models on MuSiQue, split by the number of hops required for the question.

18 Figure 3: This is the annotation interface we used for MuSiQue. Workers could see decomposition graph and passage associated with subquestions.

19 Figure 4: This is the annotation interface we used for comparing human-scores on MuSiQue, HotpotQA and 2WikiMultihopQA.

20 2-hop 3-hop 4-hop Dataset Model AnsF1 | SuppF1 AnsF1 | SuppF1 AnsF1 | SuppF1 End2End 43.1 | 67.1 40.1 | 69.4 43.8 | 65.6 MuSiQue-Ans Select+Answer 52.0 | 72.2 42.7 | 75.2 41.2 | 67.2 (metric m) Step Execution 56.1 | 80.0 43.7 | 77.1 28.2 | 66.0 End2End 23.2 | 26.4 19.7 | 26.5 18.1 | 19.1 MuSiQue-Full Select+Answer 42.3 | 50.4 26.4 | 36.7 24.9 | 26.0 (metric m+Suff) Step Execution 38.5 | 38.9 18.3 | 20.2 8.2 | 10.5

Table 9: Performance of various multi-hop models on questions with different numbers of hops.

21