Arxiv:2102.07033V1 [Cs.CL] 13 Feb 2021 Swering When It Is Likely to Be Incorrect

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them Patrick Lewisyz Yuxiang Wuz Linqing Liuz Pasquale Minerviniz Heinrich Kuttler¨ y Aleksandra Piktusy Pontus Stenetorpz Sebastian Riedelyz yFacebook AI Research zUniversity College London [email protected] Abstract the whole corpus, and then retrieve-and-read documents in order to answer questions on-the-fly (Chen Open-domain Question Answering models et al., 2017; Lee et al., 2019a, inter alia). which directly leverage question-answer (QA) A second class of models, closed-book question pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in answering (CBQA) models, have recently been terms of speed and memory compared to con- proposed. They learn to directly map questions to ventional models which retrieve and read from answers from training question-answer (QA) pairs text corpora. QA-pair retrievers also offer in- without access to a background corpus (Roberts terpretable answers, a high degree of control, et al., 2020; Ye et al., 2021). These models usu- and are trivial to update at test time with new ally take the form of pretrained seq2seq models knowledge. However, these models lack the such as T5 (Raffel et al., 2020) or BART (Lewis accuracy of retrieve-and-read systems, as sub- stantially less knowledge is covered by the et al., 2019a), fine-tuned on QA-pairs. It has re- available QA-pairs relative to text corpora like cently been shown that current closed-book models Wikipedia. To facilitate improved QA-pair mostly memorise training QA-pairs, and can strug- models, we introduce Probably Asked Ques- gle to answer questions that do not overlap with tions (PAQ), a very large resource of 65M training data (Lewis et al., 2020b). automatically-generated QA-pairs. We intro- Models which explicitly retrieve (training) QA- duce a new QA-pair retriever, RePAQ, to com- pairs, rather than memorizing them in parameters, plement PAQ. We find that PAQ preempts have been shown to perform competitively with and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read CBQA models (Lewis et al., 2020b; Xiao et al., models, whilst being significantly faster. Us- 2020). These models have a number of useful prop- ing PAQ, we train CBQA models which out- erties, such as fast inference, interpretable outputs perform comparable baselines by 5%, but trail (by inspecting retrieved QA-pairs), and the ability RePAQ by over 15%, indicating the effective- to update the model’s knowledge at test time by ness of explicit retrieval. RePAQ can be con- adding or removing QA-pairs. figured for size (under 500MB) or speed (over However, CBQA and QA-pair retriever models 1K questions per second) whilst retaining high accuracy. Lastly, we demonstrate RePAQ’s are currently not competitive with retrieve-and-read strength at selective QA, abstaining from an- systems in terms of accuracy, largely because the arXiv:2102.07033v1 [cs.CL] 13 Feb 2021 swering when it is likely to be incorrect. This training QA-pairs they operate on cover substan- enables RePAQ to “back-off” to a more expen- tially less knowledge than background corpora like sive state-of-the-art model, leading to a com- Wikipedia. In this paper, we explore whether mas- bined system which is both more accurate and sively expanding the coverage of QA-pairs enables 2x faster than the state-of-the-art model alone. CBQA and QA-pair retriever models which are 1 Introduction competitive with retrieve-and-read models. We present Probably Asked Questions (PAQ), a Open-domain QA (ODQA) systems usually have semi-structured Knowledge Base (KB) of 65M nat- access to a background corpus that can be used to ural language QA-pairs, which models can mem- answer questions. Models which explicitly exploit orise and/or learn to retrieve from. PAQ differs this corpus are commonly referred to as Open-book from traditional KBs in that questions and answers models (Roberts et al., 2020). They typically index are stored in natural language, and that questions are generated such that they are likely to appear in tions: i) introduce PAQ, 65M QA-pairs automati- ODQA datasets. PAQ is automatically constructed cally generated from Wikipedia, and demonstrate using a question generation model and Wikipedia. the importance of global filtering for high quality To ensure generated questions are not only answer- ii) introduce RePAQ, a QA system designed to uti- able given the passage they are generated from, lize PAQ and demonstrate how it can be optimised we employ a global filtering post-processing step for memory, speed or accuracy iii) investigate the employing a state-of-the-art ODQA system. This utility of PAQ for CBQA models, improving by 5% greatly reduces the amount of wrong and ambigu- but note significant headroom to RePAQ iv) demon- ous questions compared other approaches (Fang strate RePAQ’s strength on selective QA, enabling et al., 2020; Alberti et al., 2019), and is critical for us to combine RePAQ with a state-of-the-art QA high-accuracy, downstream QA models. model, making it both more accurate and 2x faster1 To complement PAQ we develop RePAQ, a 2 Open-Domain Question Answering question answering model based on question retrieval/matching models, using dense Maximum ODQA is the task of answering natural language Inner Product Search-based retrieval, and option- factoid question from an open set of domains. A ally, re-ranking. We show that PAQ and RePAQ typical question might be “when was the last year provide accurate ODQA predictions, at the level astronauts landed on the moon?”, with a target an- of relatively recent large-scale retrieve-and-read swer “1972”. The goal of ODQA is to develop systems such as RAG (Lewis et al., 2020a) on Nat- an answer function m : Q 7! A, where Q and A uralQuestions (Kwiatkowski et al., 2019a) and Triv- respectively are the sets of all possible questions iaQA (Joshi et al., 2017). PAQ instances are anno- and answers. We assume there is a distribution tated with scores that reflect how likely we expect P (q; a) of QA-pairs, defined over Q × A. A good questions to appear, which can be used to control answer function will minimise the expected error the memory footprint of RePAQ by filtering the KB over P (q; a) with respect to some loss function, accordingly. As a result, RePAQ is extremely flexi- such as answer string match. In practice, we do ble, allowing us to configure QA systems with near not have access to P (q; a), and instead rely on an state-of-the-art results, very small memory size, or empirical sample of QA-pairs K drawn from P , inference speeds of over 1,000 questions per sec- and measure the empirical loss of answer functions ond. Memory-optimised configurations of RePAQ on K. Our goal in this work is to implicitly model won two of the four tracks of the 2020 Efficien- P (q; a) so that we can draw a large sample of QA- tQA NeurIPS competition (Min et al., 2020a), with pairs, PAQ, which we can train on and/or retrieve system sizes of 336MB and 29MB, respectively. from. Drawing a sufficiently large sample will over- We also show that PAQ is a useful source of train- lap with K, essentially pre-empting and caching ing data for CBQA models. BART models trained questions that humans may ask at test-time. This on PAQ outperform baselines trained on standard allows us to shift computation from test-time to data by 5%. However, these models struggle to train-time compared to retrieve-and-read methods. effectively memorise all the knowledge in PAQ, lagging behind RePAQ by 15%. This demonstrates 3 Generating Question-Answer Pairs the effectiveness of RePAQ at leveraging PAQ. In this section, we describe the process for generat- Finally, we show that since RePAQ’s question ing PAQ. Given a large background corpus C, our matching score correlates well with QA accuracy, QA-pair generation process consists of the follow- it effectively “knows when it doesn’t know”, allowing components: ing for selective question answering (Rodriguez et al., 2019) where QA systems may abstain from 1. A passage selection model ps(c), to identify answering if confidence is too low. Whilst answer passages which humans are likely to ask ques- abstaining is important in its own right, it also en- tions about. ables an elegant “back-off” approach where we can 2. An answer extraction model pa(a j c), for defer to a more accurate but expensive QA system identifying spans in a passage that are more when answer confidence is low. This enables us to likely to be answers to a question. make use of the best of both speed and accuracy. 1The PAQ data, models and code will be made available at In summary, we make the following contribu- https://github.com/facebookresearch/PAQ Figure 1: Top Left: Generation pipeline for QA-pairs in PAQ. Top Right: PAQ used as training data for CBQA models. Bottom Left: RePAQ retrieves similar QA-pairs to input questions from PAQ. Bottom right: RePAQ’s confidence is predictive of accuracy. If confidence is low, we can defer to slower, more accurate systems, like FiD. 3. A question generator model pq(q j a; c) that, ther randomly or using heuristics. We then train a given a passage and an answer, generates a model to minimise negative log-likelihood of posi- question. tive passages relative to negatives. We implement p with RoBERTa (Liu et al., 4. A filtering QA model pf (a j q; C) that gen- s erates an answer for a given question.

Arxiv:2102.07033V1 [Cs.CL] 13 Feb 2021 Swering When It Is Likely to Be Incorrect

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support