Arxiv:2009.08553V4 [Cs.CL] 6 Aug 2021 Taining to the Questions
Total Page:16
File Type:pdf, Size:1020Kb
Generation-Augmented Retrieval for Open-Domain Question Answering Yuning Mao1,∗ Pengcheng He2, Xiaodong Liu3, Yelong Shen2, Jianfeng Gao3, Jiawei Han1, Weizhu Chen2 1University of Illinois, Urbana-Champaign 2Microsoft Azure AI 3Microsoft Research 1fyuningm2, [email protected] 2;3fpenhe, xiaodl, yeshe, jfgao,[email protected] Abstract Early OpenQA systems (Chen et al., 2017) We propose Generation-Augmented Retrieval use classic retrieval methods such as TF-IDF and (GAR) for answering open-domain questions, BM25 with sparse representations. Sparse methods which augments a query through text genera- are lightweight and efficient, but unable to per- tion of heuristically discovered relevant con- form semantic matching and fail to retrieve rele- texts without external resources as supervi- vant passages without lexical overlap. More re- sion. We demonstrate that the generated con- cently, methods based on dense representations texts substantially enrich the semantics of the (Guu et al., 2020; Karpukhin et al., 2020) learn to queries and GAR with sparse representations (BM25) achieves comparable or better per- embed queries and passages into a latent vector formance than state-of-the-art dense retrieval space, in which text similarity beyond lexical over- methods such as DPR (Karpukhin et al., 2020). lap can be measured. Dense retrieval methods can We show that generating diverse contexts for a retrieve semantically relevant but lexically differ- query is beneficial as fusing their results con- ent passages and often achieve better performance sistently yields better retrieval accuracy. More- than sparse methods. However, the dense mod- over, as sparse and dense representations are els are more computationally expensive and suffer often complementary, GAR can be easily com- from information loss as they condense the entire bined with DPR to achieve even better per- formance. GAR achieves state-of-the-art per- text sequence into a fixed-size vector that does not formance on Natural Questions and TriviaQA guarantee exact matching (Luan et al., 2020). datasets under the extractive QA setup when There have been some recent studies on query re- equipped with an extractive reader, and con- formulation with text generation for other retrieval sistently outperforms other retrieval methods tasks, which, for example, rewrite the queries to when the same generative reader is used.1 context-independent (Yu et al., 2020; Lin et al., 1 Introduction 2020; Vakulenko et al., 2020) or well-formed (Liu et al., 2019) ones. However, these methods re- Open-domain question answering (OpenQA) aims quire either task-specific data (e.g., conversational to answer factoid questions without a pre-specified contexts, ill-formed queries) or external resources domain and has numerous real-world applications. such as paraphrase data (Zaiem and Sadat, 2019; In OpenQA, a large collection of documents (e.g., Wang et al., 2020) that cannot or do not trans- Wikipedia) are often used to seek information per- fer well to OpenQA. Also, some rely on time- arXiv:2009.08553v4 [cs.CL] 6 Aug 2021 taining to the questions. One of the most com- consuming training process like reinforcement mon approaches uses a retriever-reader architecture learning (RL) (Nogueira and Cho, 2017; Liu et al., (Chen et al., 2017), which first retrieves a small sub- 2019; Wang et al., 2020) that is not efficient enough set of documents using the question as the query for OpenQA (more discussions in Sec.2). and then reads the retrieved documents to extract (or generate) an answer. The retriever is crucial as it In this paper, we propose Generation- is infeasible to examine every piece of information Augmented Retrieval (GAR), which augments in the entire document collection (e.g., millions a query through text generation of a pre-trained of Wikipedia passages) and the retrieval accuracy language model (PLM). Different from prior bounds the performance of the (extractive) reader. studies that reformulate queries, GAR does not require external resources or downstream feedback ∗Work was done during internship at Microsoft Azure AI. 1Our code and retrieval results are available at https: via RL as supervision, because it does not rewrite //github.com/morningmoni/GAR. the query but expands it with heuristically discov- ered relevant contexts, which are fetched from and 62.7 on Trivia, creating new state-of-the-art re- PLMs and provide richer background information sults for extractive OpenQA. GAR also outperforms (Table2). For example, by prompting a PLM other retrieval methods under the generative setup to generate the title of a relevant passage given when the same generative reader is used: EM=38.1 + a query and appending the generated title to the (45.3 for GAR ) on NQ and 62.2 on Trivia. query, it becomes easier to retrieve that relevant Contributions. (1) We propose Generation- passage. Intuitively, the generated contexts Augmented Retrieval (GAR), which augments explicitly express the search intent not presented queries with heuristically discovered relevant con- in the original query. As a result, GAR with texts through text generation without external su- sparse representations achieves comparable or pervision or time-consuming downstream feedback. even better performance than state-of-the-art (2) We show that using generation-augmented approaches (Karpukhin et al., 2020; Guu et al., queries achieves significantly better retrieval and 2020) with dense representations of the original QA results than using the original queries or ex- queries, while being more lightweight and efficient isting unsupervised QE methods. (3) We show in terms of both training and inference (including that GAR, combined with a simple BM25 model, the cost of the generation model) (Sec. 6.4). achieves new state-of-the-art performance on two Specifically, we expand the query (question) by benchmark datasets in extractive OpenQA and com- adding relevant contexts as follows. We conduct petitive results in the generative setting. seq2seq learning with the question as the input and various freely accessible in-domain contexts as 2 Related Work the output such as the answer, the sentence where the answer belongs to, and the title of a passage Conventional Query Expansion. GAR shares that contains the answer. We then append the gen- some merits with query expansion (QE) meth- erated contexts to the question as the generation- ods based on pseudo relevance feedback (Rocchio, augmented query for retrieval. We demonstrate 1971; Abdul-Jaleel et al., 2004; Lv and Zhai, 2010) that using multiple contexts from diverse gener- in that they both expand the queries with relevant ation targets is beneficial as fusing the retrieval contexts (terms) without the use of external super- results of different generation-augmented queries vision. GAR is superior as it expands the queries consistently yields better retrieval accuracy. with knowledge stored in the PLMs rather than We conduct extensive experiments on the Nat- the retrieved passages and its expanded terms are ural Questions (NQ) (Kwiatkowski et al., 2019) learned through text generation. and TriviaQA (Trivia) (Joshi et al., 2017) datasets. Recent Query Reformulation. There are recent The results reveal four major advantages of GAR: or concurrent studies (Nogueira and Cho, 2017; (1) GAR, combined with BM25, achieves signif- Zaiem and Sadat, 2019; Yu et al., 2020; Vaku- icant gains over the same BM25 model that uses lenko et al., 2020; Lin et al., 2020) that reformu- the original queries or existing unsupervised query late queries with generation models for other re- expansion (QE) methods. (2) GAR with sparse rep- trieval tasks. However, these studies are not eas- resentations (BM25) achieves comparable or even ily applicable or efficient enough for OpenQA be- better performance than the current state-of-the-art cause: (1) They require external resources such as retrieval methods, such as DPR (Karpukhin et al., paraphrase data (Zaiem and Sadat, 2019), search 2020), that use dense representations. (3) Since sessions (Yu et al., 2020), or conversational con- GAR uses sparse representations to measure lexical texts (Lin et al., 2020; Vakulenko et al., 2020) overlap2, it is complementary to dense representa- to form the reformulated queries, which are not tions: by fusing the retrieval results of GAR and available or showed inferior domain-transfer per- + DPR (denoted as GAR ), we obtain consistently formance in OpenQA (Zaiem and Sadat, 2019); better performance than either method used individ- (2) They involve time-consuming training process ually. (4) GAR outperforms DPR in the end-to-end such as RL. For example, Nogueira and Cho(2017) QA performance (EM) when the same extractive reported a training time of 8 to 10 days as it uses + reader is used: EM=41.8 (43.8 for GAR ) on NQ retrieval performance in the reward function and conducts retrieval at each iteration. In contrast, 2Strictly speaking, GAR with sparse representations han- dles semantics before retrieval by enriching the queries, while GAR uses freely accessible in-domain contexts like maintaining the advantage of exact matching. passage titles as the generation targets and standard seq2seq learning, which, despite its simplicity, is freely accessible contexts as the generation targets. not only more efficient but effective for OpenQA. We show in Sec. 6.2 that having multiple gener- Retrieval for OpenQA. Existing sparse retrieval ation targets is helpful in that fusing their results methods for OpenQA (Chen et al., 2017) solely rely consistently brings better retrieval accuracy. on the information of the questions. GAR extends Context 1: The default target (answer). The de- to contexts relevant to the questions by extracting fault target is the label in the task of interest, which information inside PLMs and helps sparse meth- is the answer in OpenQA. The answer to the ques- ods achieve comparable or better performance than tion is apparently useful for the retrieval of relevant dense methods (Guu et al., 2020; Karpukhin et al., passages that contain the answer itself. As shown 2020), while enjoying the simplicity and efficiency in previous work (Roberts et al., 2020; Brown et al., of sparse representations.