Arxiv:2010.12527V3 [Cs.CL] 16 Apr 2021

Retrieve, Read, Rerank, then Iterate: Answering Open-Domain Questions of Varying Reasoning Steps from Text Peng Qi*♠♥ Haejun Lee*| Oghenetegiri “TG” Sido*♠ Christopher D. Manning♠ ♠ Computer Science Department, Stanford University ~ JD AI Research | Samsung Research {pengqi, osido, manning}@cs.stanford.edu, [email protected] Abstract Despite this success, most previous systems are developed with, and evaluated on, datasets that We develop a unified system to answer di- contain exclusively single-hop questions (ones that rectly from text open-domain questions that require a single document or paragraph to answer) may require a varying number of retrieval steps. We employ a single multi-task trans- or multi-hop ones. As a result, their design is often former model to perform all the necessary tailored exclusively to single-hop (e.g., Chen et al., subtasks—retrieving supporting facts, rerank- 2017; Wang et al., 2018b) or multi-hop questions ing them, and predicting the answer from all (e.g., Nie et al., 2019; Min et al., 2019; Feldman retrieved documents—in an iterative fashion. and El-Yaniv, 2019; Zhao et al., 2020a; Xiong We avoid making crucial assumptions as previ- et al., 2021); even when the model is designed to ous work that do not transfer well to real-world work with both, it is often trained and evaluated on settings, including exploiting knowledge of the e.g. fixed number of retrieval steps required to an- exclusively single-hop or multi-hop settings ( , swer each question or using structured meta- Asai et al., 2020). In practice, not only can we data like knowledge bases or web links that not expect open-domain QA systems to receive have limited availability. Instead, we design a exclusively single- or multi-hop questions from system that would answer open-domain ques- users, but it is also non-trivial to judge reliably tions on any text collection without prior knowl- whether a question requires one or multiple pieces edge of reasoning complexity. To emulate of evidence to answer a priori. For instance, “In this setting, we construct a new benchmark by which U.S. state was Facebook founded?” appears combining existing one- and two-step datasets with a new collection of 203 questions that to be single-hop, but its answer cannot be found in require three Wikipedia pages to answer, unify- the main text of a single English Wikipedia page. ing Wikipedia corpora versions in the process. Besides the impractical assumption about reason- We show that our model demonstrates compet- ing hops, previous work often also assumes access itive performance on both existing benchmarks to non-textual metadata such as knowledge bases, and this new benchmark. entity linking, and Wikipedia hyperlinks when re- 1 Introduction trieving supporting facts, especially in answering complex questions (Nie et al., 2019; Feldman and Using knowledge to solve problems is a hallmark of El-Yaniv, 2019; Zhao et al., 2019; Asai et al., 2020; intelligence, and open-domain question answering Dhingra et al., 2020; Zhao et al., 2020a). While arXiv:2010.12527v3 [cs.CL] 16 Apr 2021 (QA) is an important means for intelligent systems this information is helpful, it is not always avail- to make use of the knowledge in large text collec- able in text collections we might be interested in tions. With the help of large-scale datasets based getting answers from, such as news or academic on Wikipedia (Rajpurkar et al., 2016, 2018) and research articles, besides being labor-intensive and other large corpora (Trischler et al., 2016; Dunn time-consuming to collect and maintain. It is there- et al., 2017; Talmor and Berant, 2018), the research fore desirable to design a system that is capable of community has made substantial progress on tack- extracting knowledge from text without using such ling this problem in recent years, notably in the metadata, to maximally make use of knowledge direction of complex reasoning over multiple pieces available to us in the form of text. multi-hop of evidence, or reasoning (Yang et al., To address these limitations, we propose Iterative 2018; Welbl et al., 2018; Chen et al., 2020). Retriever, Reader, and Reranker (IRRR), which ∗These authors contributed equally. features a single neural network model that performs 1. Q à “Ingerophrynus Gollum” … 2. Q + retrieved paras à NOANSWER 4. Q + W Ingerophrynus Gollum à “Lord of the Rings” 5. Q + W Ingerophrynus Gollum + W The Lord of the Rings à “150 million copies” Retriever Q. The Ingerophrynus Answer exists in one Gollum is named after A. 150 search of the reasoning paths a character in a book Query Reader million that sold how many Generator copies copies? No answer exist WIKIPEDIA Expand reasoning path with top-ranked paragraph Reranker 3. Q + retrieved paras à W Ingerophrynus Gollum Figure 1: The IRRR question answering pipeline answers a complex question in the HotpotQA dataset by iteratively retrieving, reading, and reranking paragraphs from Wikipedia. In this example, the question is answered in five steps: 1. the retriever model selects the wordsRepeat N times “Ingerophrynus until the answer found is confident gollum” enough from the question as an initial search query; 2. the question answering model attempts to answer the question by combining the question with each of the retrieved paragraphs and fails to find an answer; 3. the reranker picks the paragraph about the Ingerophrynus gollum toad to extend the reasoning path; 4. the retriever generates an updated query “Lord of the Rings” to retrieve new paragraphs; 5. the reader correctly predicts the answer “150 million copies” by combining the reasoning path (question + “Ingerophrynus gollum”) with the newly retrieved paragraph about “The Lord of the Rings”. all of the subtasks required to answer questions mark that features questions of different levels of from a large collection of text (see Figure1). IRRR complexity on an unified, up-to-date version of is designed to leverage off-the-shelf information Wikipedia, with newly annotated questions that retrieval systems by generating natural language require at least three hops of reasoning, on which search queries, which allows it to easily adapt to our proposed model serves as a strong baseline.1 arbitrary collections of text without requiring well- tuned neural retrieval systems or extra metadata. 2 Open-Domain Question Answering This further allows users to understand and control IRRR, if necessary, to facilitate trust. Moreover, The task of open-domain question answering is IRRR iteratively retrieves more context to answer concerned with finding the answer 0 to a ques- the question, which allows it to easily accommodate tion @ from a large text collection D. Successful questions of different number of reasoning steps. solutions to this task usually involve two crucial components: an information retrieval system that To evaluate the performance of open-domain finds a small set of relevant documents D from D, QA systems in a more realistic setting, we con- A and a reading comprehension system that extracts struct a new benchmark by combining the questions the answer from it. Chen et al.(2017) presented from the single-hop SQuAD Open (Rajpurkar et al., one of the first neural-network-based approaches to 2016; Chen et al., 2017) and the two-hop HotpotQA this problem, which was later extended by Wang (Yang et al., 2018) with a new collection of 203 et al.(2018a) with a reranking system to further human-annotated questions that require informa- reduce the amount of context the reading com- tion from three Wikipedia pages to answer. We prehension component has to consider to improve map all questions to a unified version of the English answer accuracy. Wikipedia to reduce stylistic differences that might More recently, Yang et al.(2018) showed that provide statistical shortcuts to models. We show this single-step retrieve-and-read approach to open- that IRRR not only achieves competitive perfor- domain question answering is inadequate for more mance with state-of-the-art models on the original complex questions that require multiple pieces of SQuAD Open and HotpotQA datasets, but also evidence to answer (e.g., “What is the popula- establishes a strong baseline for this new dataset. tion of Mark Twain’s hometown?”). Later work To recap, our contributions in this paper are: (1) demonstrated that this can be resolved by retrieving a single unified neural network model that performs supporting facts beyond a single step, but many all essential subtasks in open-domain QA purely approaches are tailored to this task by leveraging from text (retrieval, reranking, and reading com- Wikipedia hyperlinks (Nie et al., 2019; Asai et al., prehension) and achieves strong results on SQuAD and HotpotQA; (2) a new open-domain QA bench- 1We will release our code and models upon acceptance. O / X O / X O / X …… NOANSWER / Span / Yes / No Start End Gold Paragraph 2020) or explicitly modeling fixed reasoning steps Token-wise Binary Prediction 4-way Clsf Span Prediction NCERerank Clsf (Qi et al., 2019; Min et al., 2019). RerankRerank However, most previous work assumes that all Retriever (Query Generator) Reader Reranker questions are either exclusively single-hop or multi- h[CLS] h1 h2 h3 h4 h5 h6 … hop during training and evaluation, even when the Transformer-Encoder model itself is not heavily tailored towards one or [CLS] q [SEP] title0 [CONT] ctx0 [SEP] … the other. This limits their applicability in real- world applications where the retrieval difficulty of Figure 2: The overall architecture of our IRRR model, questions cannot be determined ahead of time. We which uses a shared Transformer encoder to perform all subtasks of open-domain question answering. propose IRRR, a system that performs variable-hop retrieval for open-domain QA, and a new benchmark to evaluate systems in a more realistic setting.

Arxiv:2010.12527V3 [Cs.CL] 16 Apr 2021

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support