Arxiv:2011.13137V2 [Cs.CL] 30 May 2021 the Question That Leads to Each Answer Should Also 1 Introduction Be Provided to Clarify Each Interpretation

Answering Ambiguous Questions through Generative Evidence Fusion and Round-Trip Prediction Yifan Gaoy∗, Henghui Zhuz, Patrick Ngz, Cicero Nogueira dos Santosz, Zhiguo Wangz, Feng Nanz, Dejiao Zhangz, Ramesh Nallapatiz, Andrew O. Arnoldz, Bing Xiangz zAWS AI y The Chinese University of Hong Kong [email protected] zfhenghui,patricng,cicnog,[email protected] zfnanfen,dejiaoz,rnallapa,anarnld,[email protected] Abstract Prompt Question (Google search query): What’s the most points scored in an NBA game? In open-domain question answering, questions Disambiguated QA Pairs: Q : What’s the most points scored in an NBA game by are highly likely to be ambiguous because 1 combined team? / A1: 370 users may not know the scope of relevant top- Q2: What’s the most points scored in an NBA game by a ics when formulating them. Therefore, a sys- single team? / A2: 186 tem needs to find possible interpretations of Q3: What’s the most points scored in an NBA game by an the question, and predict one or multiple plau- individual? / A3: 100 Relevant Wikipedia Page 1: The highest-scoring regular sible answers. When multiple plausible an- season game is the triple-overtime game between ... the swers are found, the system should rewrite the two teams combined to score 370 points, with the pistons question for each answer to resolve the ambi- defeating the nuggets 186–184 ... guity. In this paper, we present a model that Relevant Wikipedia Page 2: Wilt Chamberlain scored an aggregates and combines evidence from multi- nba-record 100 points ... ple passages to adaptively predict a single answer or a set of question-answer pairs for am- Figure 1: An example from the AMBIGQA (Min et al., biguous questions. In addition, we propose a 2020) dataset. The Prompt Question is gathered from novel round-trip prediction approach to itera- Google search queries and has three interpretations tively generate additional interpretations that upon reading Wikipedia. Disambiguated QA Pairs our model fails to find in the first pass, and are the full set of acceptable answers, paired with the then verify and filter out the incorrect question- disambiguated rewriting of the prompt question. answer pairs to arrive at the final disambiguated output. Our model, named REFUEL, achieves a new state-of-the-art performance points scored in an NBA game?” is ambiguous on the AMBIGQA dataset, and shows com- because the score in this question could be inter- petitive performance on NQ-OPEN and Trivi- preted as the combined score in a game (Q1A1), aQA. The proposed round-trip prediction is a score from a single team (Q2A2), or score from model-agnostic general approach for answer- an individual player (Q3A3). Therefore, a system ing ambiguous open-domain questions, which needs to adaptively predict a single answer, or a set improves our REFUEL as well as several base- of equally plausible answers when the question has line models. We release source code for our models and experiments at https://github. multiple interpretations. When a set of multiple com/amzn/refuel-open-domain-qa. answers is predicted, an unambiguous rewriting of arXiv:2011.13137v2 [cs.CL] 30 May 2021 the question that leads to each answer should also 1 Introduction be provided to clarify each interpretation. Min et al.(2020) decompose this problem into Open-domain Question Answering (QA) is the task two subtasks. Given the prompt question and of answering questions using a collection of pas- Wikipedia passages, the first subtask, Answer Pre- sages with diverse topics (Chen et al., 2017; Guu diction, consists in predicting one or several plau- et al., 2020; Karpukhin et al., 2020). Open-domain sible answers, depending on whether this question questions are highly likely to be ambiguous be- is ambiguous or not. If multiple answers are pre- cause people may not have the knowledge of rele- dicted, the second subtask, Question Disambigua- vant topics when formulating them. For example, tion, requires generating a disambiguated question “What’s the most in Figure1, the prompt question for each of the plausible answers. They propose ∗Work done during an internship at AWS AI. SPANSEQGEN, which first retrieves and reranks passages using the prompt question, and then uously feed the generated questions into REFUEL adopts a BART pre-trained sequence-to-sequence until there are no new answers predicted from our model (Lewis et al., 2020a) to generate all plausi- model. While this round-trip prediction can im- ble answers, conditioned on the concatenation of prove the recall of answers, we refine the quality the prompt question and top 8 passages. For the of predicted QA pairs by filtering them with the question disambiguation subtask, based on BART, conditional probability of the answers estimated by they first pre-train a question generation model on an answer-generation model. NQ-OPEN (Kwiatkowski et al., 2019), a large-scale Our REFUEL achieves a new state-of-the-art on open-domain QA dataset, to generate the question the AMBIGQA dataset, outperforming the previ- given the answer and top 8 passages. Then they ous best model SPANSEQGEN by 9.1% in answer fine-tune it as a question disambiguation model to prediction F1 and 4.4% in Edit-F1 score for ques- generate the disambiguated question conditioned tion disambiguation. When directly doing infer- on the prompt question, answer, and passages. ence on NQ-OPEN and TriviaQA, REFUEL not only predicts the single answer precisely but also There are three main drawbacks to SPANSE- finds multiple interpretations if the question is am- QGEN. Firstly, a complete coverage of all rele- biguous. Moreover, human evaluation shows that vant passages is essential for predicting all plau- REFUEL can correctly generate more QA pairs on sible answers of the ambiguous question. How- all three datasets. Finally, the proposed round-trip ever, SPANSEQGEN only takes 8 passages for an- prediction is a model-agnostic general approach for swer prediction so some of the most informative answering ambiguous questions, which improves passages might be excluded. Secondly, for the our REFUEL as well as several baseline models up question disambiguation subtask, there is a mis- to 3.7% for the overall performance. match between question generation pre-training The main contributions of this work, which are on NQ-OPEN and question disambiguation fine- fundamental to significantly push the state-of-the- tuning on AMBIGQA – there is no question to art in answering ambiguous questions, can be sum- disambiguate in question generation pre-training, marized as follows: which makes the pre-training task somewhat mis- aligned with fine-tuning. Thirdly, SPANSEQGEN 1. We present an evidence aggregation approach predicts a much smaller average number of answers that can effectively use a large number of pas- compared to the ground truth data (1.17 vs. 2.19). sages to uncover more candidate interpretations of the ambiguous question. To address these issues, we propose REFUEL, 2. We propose a token-deletion pre-training task Round-trip Evidence FUsion via gEneration with to reduce the mismatch between pre-training retrievaL, a new framework for answering ambigu- and fine-tuning for question disambiguation. ous open-domain questions. To ensure a broad The insertion-based weighted loss further coverage of relevant knowledge of the question, helps to capture answer-relevant constraints. REFUEL reads 12 times more passages (100 in our 3. We propose a round-trip prediction approach experiments) than SPANSEQGEN by using Fusion- to find more interpretations missed in the first in-Decoder (Izacard and Grave, 2020) that pro- prediction pass, which we further refine us- cesses each passage individually in the encoder, ing a conditional-probability-based filtering and then fused their encodings together in the de- approach. coder. For the question disambiguation subtask, we propose a token-deletion pre-training task to trans- 2 REFUEL form NQ-OPEN into an “ambiguous” QA setting REFUEL answers questions through a three-step by randomly deleting an informative span for each process illustrated in Figure2: question. Thus, pre-training and fine-tuning tasks are well aligned. Additionally, we add an insertion- 1. The Passage Retrieval & Reranking module based weighted loss to emphasize the newly in- retrieves question-relevant passages from the serted tokens in the disambiguated question, which whole Wikipedia corpus. Then the retrieved helps the model on learning to resolve the ambi- passages are further reranked (Sec. 2.1). guity. Finally, we propose a round-trip prediction 2. Taking the reranked passages and the prompt approach to find additional interpretations that RE- question as input, our single pass QA pair FUEL fails to predict in the first pass. We contin- generation model makes the first prediction 1. Retrieval & Reranking 2. Single Pass QA Pair Generation ! ! Q , A", Question # Prompt Q Q" If #Predicted Passages Disambiguation Prompt Answers > 1 Passage Retrieval Top K Answer A , … , A … … … Question Q! & Reranking Passages Prediction " $ Q!, A , Question $ Q# Passages Disambiguation $ 3. Round-Trip Prediction Example oF Round-Trip Prediction: & Q% A" QD # New A!? A" Q" A" terminate! AP … QD Q# �! AP & … … Top K ' Q" AP A ! A QD Q# New! # Passages $ � ( ( A) QD Q) Single Pass QA Pair Generation 1st Round-Trip Generation 2nd 3rd Figure 2: Overall Pipeline of REFUEL.REFUEL firstly retrieves question-relevant passages (Section 2.1). Then it generates first-pass QA pairs through the Answer Prediction (AP) module and Question Disambiguation (QP) module (Section3). Finally, generated disambiguated questions Qd are further taken as the input of our pipeline to find more interpretations (Round-Trip Prediction). If the generated question Qd still has multiple interpretations, the newly predicted answers will receive their own questions (Section 2.3). pass to predict a single answer or a set of plausible answers A1; :::; Am.

Arxiv:2011.13137V2 [Cs.CL] 30 May 2021 the Question That Leads to Each Answer Should Also 1 Introduction Be Provided to Clarify Each Interpretation

Keith Richards Wrote Satisfaction in His Sleep

Injustice Runs Deep Christopher Gordon Nicole Hayes the Goals of a Higher Education Usually Entail the Creation, Testing, and Implementation of New Ideas

Linn Lounge Presents... the Rolling Stones

Performing Songwriter 70 July/August 2008

What's a Zephyr?

Mick Jagger © Felix Aeppli 04-2020 / 09-2021

Professor Stewart's Casebook of Mathematical Mysteries

The Rolling Stones

Grrr! Pressetext

The Lost Rolling Stone: How Guitar Great Rory Gallagher Was Airbrushed from Rock History

The Rolling Stones

The Preface 2002