IIRC: a Dataset of Incomplete Information Reading Comprehension Questions
Total Page:16
File Type:pdf, Size:1020Kb
IIRC: A Dataset of Incomplete Information Reading Comprehension Questions James Ferguson}∗ Matt Gardner| Hannaneh Hajishirzi}| Tushar Khot| Pradeep Dasigi| }University of Washington |Allen Institute for AI fjfferg,[email protected] fmattg,tushark,[email protected] Abstract Humans often have to read multiple doc- uments to address their information needs. However, most existing reading comprehen- sion (RC) tasks only focus on questions for which the contexts provide all the information required to answer them, thus not evaluating a system’s performance at identifying a poten- tial lack of sufficient information and locat- ing sources for that information. To fill this gap, we present a dataset, IIRC, with more than 13K questions over paragraphs from En- glish Wikipedia that provide only partial in- formation to answer them, with the missing information occurring in one or more linked documents. The questions were written by crowd workers who did not have access to any Figure 1: An example from IIRC. At the top is a con- of the linked documents, leading to questions text paragraph which provides only partial information that have little lexical overlap with the con- required to answer the question. The bold spans in the texts where the answers appear. This process context indicate links to other Wikipedia pages. The also gave many questions without answers, colored boxes below the question show snippets from and those that require discrete reasoning, in- four of these pages that provide the missing informa- creasing the difficulty of the task. We fol- tion for answering the question. The answer is the un- low recent modeling work on various reading derlined span. comprehension datasets to construct a baseline model for this dataset, finding that it achieves 31.1% F1 on this task, while estimated human ing comprehension tasks, such as SQuAD 2.0 (Ra- performance is 88.4%. The dataset, code for jpurkar et al., 2018), DROP (Dua et al., 2019b), or the baseline system, and a leaderboard can be found at https://allennlp.org/iirc. Quoref (Dasigi et al., 2019), evaluate models using a relatively simpler setup where all the information arXiv:2011.07127v1 [cs.CL] 13 Nov 2020 required to answer the questions (including judg- 1 Introduction ing them as being unanswerable) is provided in the associated contexts. While this setup has led to sig- Humans often read text with the goal of obtain- nificant advances in reading comprehension (Ran ing information. Given that a single document et al., 2019; Zhang et al., 2020), the tasks are still is unlikely to contain all the information a reader limited since they do not evaluate the capability of might need, the reading process frequently involves models at identifying precisely what information, identifying the information present in the given if any, is missing to answer a question, and where document, and what is missing, followed by locat- that information might be found. ing a different source that could potentially con- On the other hand, open-domain question an- tain the missing information. Most recent read- swering tasks (Chen et al., 2017; Joshi et al., 2017; ∗Work done as an intern at the Allen Institute for AI. Dhingra et al., 2017) present a model with a ques- tion by itself, requiring the model to retrieve rele- This baseline achieves an F1 score of 31.1% on vant information from some corpus. However, this IIRC, while the estimated human performance is approach loses grounding in a particular passage 88.4% F1. Even giving the model oracle pipeline of text, and it has so far been challenging to collect components results in a performance of only 70.3%. diverse, complex question in this setting. Taken together, these results show that substantial Alternatively, complex questions grounded in modeling is needed both to identify and retrieve context can be converted to open-domain or missing information, and to combine the retrieved incomplete-information QA datasets such as Hot- information to answer the question (Section4). We potQA (Yang et al., 2018). However, they do not additionally perform qualitative analysis of the data, capture the information-seeking questions that arise and find that the errors of the baseline model are from reading a single document with partial infor- evenly split between retrieving incorrect informa- mation (Min et al., 2019b; Chen and Durrett, 2019). tion, identifying unanswerable questions, and suc- cessfully reasoning over the retrieved information. We present a new dataset of incomplete infor- By construction, all examples in IIRC require mation reading comprehension questions, IIRC, identifying missing information. Even though cur- to address both of these limitations. IIRC is rent model performance is quite low, a model a crowdsourced dataset of 13441 questions over trained on this data could theoretically leverage 5698 paragraphs from English Wikipedia, with that fact to achieve artificially high performance most of the questions requiring information from on test data, because it does not have to first de- one or more documents hyperlinked to the associ- termine whether more information is needed. To ated paragraphs, in addition to the original para- account for this issue, we additionally sample ques- graphs themselves. Our crowdsourcing process tions from SQuAD 2.0 (Rajpurkar et al., 2018) and (Section2) ensures the questions are naturally DROP (Dua et al., 2019b), which have similar ques- information-seeking by decoupling question and tion language to what is in IIRC, putting forward answer collection pipelines. Crowd workers are this kind of combined evaluation as a challenging instructed to ask follow-up questions after read- benchmark for the community. Predictably, our ing a paragraph, giving links to pages where they baseline model performs substantially worse in this would expect to find the answer. This process re- setting, reaching only 28% F on the IIRC portion sults in questions like the one shown in Figure1. 1 of this combined evaluation (Section5). As illustrated by the example, this setup results in questions requiring complex reasoning, with an 2 Building IIRC estimated 39% of the questions in IIRC requiring discrete reasoning. Moreover, 30% of the questions We used Wikipedia to build IIRC and relied on the in IIRC require more than one linked document fact that entities in Wikipedia articles are linked to in addition to the original paragraph and 30% of other articles about those entities, providing more them are unanswerable even given the additional information about them. Our goals were to build information. When present, the answers are either a dataset with naturally information-seeking ques- extracted spans, boolean, or values resulting from tions anchored in paragraphs with incomplete infor- numerical operations. mation, such that identifying the location of miss- To evaluate the quality of the data, we run ing information is non-trivial, and answering the experiments with a modified version of Num- questions would require complex cross-document Net+ (Ran et al., 2019), a state-of-the-art model reasoning. from DROP (Dua et al., 2019b), chosen because We ensured that the questions are information- a significant portion of questions in IIRC require seeking by separating question and answer collec- numerical reasoning similar to that found in DROP. tion processes, and by not providing the question Because DROP uses only a single paragraph of writers access to the contexts where the answers context, we add a two-stage pipeline to retrieve occur. This process also ensured that we get ques- necessary context for the model from the linked tions that have minimal lexical overlap with the articles. The pipeline first identifies which links answer contexts. We used Wikipedia paragraphs are pertinent, and then selects the most relevant with many outgoing links to increase the difficulty passage from each of those links, concatenating of identifying the articles that provide the missing them to serve as input for the model (Section3). information. To ensure complex cross-document How many different directors did Prada work with in 1976 and 1977? Jaya Prada Seeta Kalyanam She became a huge star in 1976 with major hit films. Director K. Balachander's Seeta Kalyanam is a 1976 black-and-white film Anthuleni Katha (1976) showcased her dramatic skills; K. Telugu epic, mythological, Viswanath's color film Siri Siri Muvva (1976) showed her playing a mute girl with drama film directed by Bapu Answer type: Numeric excellent dancing skills; and her title role as Sita in the big-budget mythological Answer: 5 directors film Seetha Kalyanam confirmed her versatility. In 1977, she starred in Adavi Ramudu Adavi Ramudu, which broke box office records and which permanently cemented Adavi Ramudu is a 1977 her star status. Filmmaker Vijay introduced her to Kannada cinema in his 1977 Telugu Action film directed super-hit movie Sanadi Appanna alongside Kannada matinee idol Raj Kumar. by K. Raghavendra Rao Link prediction: Hard Retrieval: Bridge Reasoning: Discrete-numeric In what country did Bain attend the doctoral seminars of Wlad Godzich? Thomas Bain Bain was born in London. He lived Kingston upon Thames attending prep school at University of Geneva Answer type: Span Highfield School (Liphook, Hampshire). He suffered from Dyslexia … He completed The University of Geneva is a Answer: Switzerland M. Phil at the Geneva-based IUEE (Institute for European Studies), and later public research university attended the doctoral seminars of Wlad Godzich in