PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Patrick Lewis†‡ Yuxiang Wu‡ Linqing Liu‡ Pasquale Minervini‡ Heinrich Kuttler¨ † Aleksandra Piktus† Pontus Stenetorp‡ Sebastian Riedel†‡ †Facebook AI Research ‡University College London [email protected]

Abstract whole corpus, and then retrieve-and-read docu- ments in order to answer questions on-the-fly (Chen Open-domain Question Answering models et al., 2017; Lee et al., 2019a, inter alia). which directly leverage question-answer (QA) second class of models, closed-book question pairs, such as closed-book QA (CBQA) mod- els and QA-pair retrievers, show promise in answering (CBQA) models, have recently been terms of speed and memory compared to con- proposed. They learn to directly map questions to ventional models which retrieve and read from answers from training question-answer (QA) pairs text corpora. QA-pair retrievers also offer in- without access to a background corpus (Roberts terpretable answers, a high degree of control, et al., 2020; et al., 2021). These models usu- and are trivial to update at test time with new ally take the form of pretrained seq2seq models knowledge. However, these models lack the such as T5 (Raffel et al., 2020) or BART (Lewis accuracy of retrieve-and-read systems, as sub- stantially less knowledge is covered by the et al., 2019a), fine-tuned on QA-pairs. It has re- available QA-pairs relative to text corpora like cently been shown that current closed-book models Wikipedia. To facilitate improved QA-pair mostly memorise training QA-pairs, and can strug- models, introduce Probably Asked Ques- gle to answer questions that do not overlap with tions (PAQ), a very large resource of 65M training data (Lewis et al., 2020b). automatically-generated QA-pairs. We intro- Models which explicitly retrieve (training) QA- duce a new QA-pair retriever, RePAQ, to com- pairs, rather than memorizing them in parameters, plement PAQ. We find that PAQ preempts have been shown to perform competitively with and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read CBQA models (Lewis et al., 2020b; Xiao et al., models, whilst being significantly faster. Us- 2020). These models have a number of useful prop- ing PAQ, we train CBQA models which out- erties, such as fast inference, interpretable outputs perform comparable baselines by 5%, but trail (by inspecting retrieved QA-pairs), and the ability RePAQ by over 15%, indicating the effective- to update the model’s knowledge at test time by ness of explicit retrieval. RePAQ can con- adding or removing QA-pairs. figured for size (under 500MB) or speed (over However, CBQA and QA-pair retriever models 1K questions per second) whilst retaining high accuracy. Lastly, we demonstrate RePAQ’s are currently not competitive with retrieve-and-read strength at selective QA, abstaining from an- systems in terms of accuracy, largely because the arXiv:2102.07033v1 [cs.CL] 13 Feb 2021 swering when it is likely to be incorrect. This training QA-pairs they operate on cover substan- enables RePAQ to “back-off” to a more expen- tially less knowledge than background corpora like sive state-of-the-art model, leading to a com- Wikipedia. In this paper, we explore whether mas- bined system which is both more accurate and sively expanding the coverage of QA-pairs enables 2x faster than the state-of-the-art model alone. CBQA and QA-pair retriever models which are 1 Introduction competitive with retrieve-and-read models. We present Probably Asked Questions (PAQ), a Open-domain QA (ODQA) systems usually have semi-structured Knowledge Base (KB) of 65M nat- access to a background corpus that can be used to ural language QA-pairs, which models can mem- answer questions. Models which explicitly exploit orise and/or learn to retrieve from. PAQ differs this corpus are commonly referred to as Open-book from traditional KBs in that questions and answers models (Roberts et al., 2020). They typically index are stored in natural language, and that questions are generated such that they are likely to appear in tions: ) introduce PAQ, 65M QA-pairs automati- ODQA datasets. PAQ is automatically constructed cally generated from Wikipedia, and demonstrate using a question generation model and Wikipedia. the importance of global filtering for high quality To ensure generated questions are not only answer- ii) introduce RePAQ, a QA system designed to uti- able given the passage they are generated from, lize PAQ and demonstrate how it can be optimised we employ a global filtering post-processing step for memory, speed or accuracy iii) investigate the employing a state-of-the-art ODQA system. This utility of PAQ for CBQA models, improving by 5% greatly reduces the amount of wrong and ambigu- but note significant headroom to RePAQ iv) demon- ous questions compared other approaches (Fang strate RePAQ’s strength on selective QA, enabling et al., 2020; Alberti et al., 2019), and is critical for us to combine RePAQ with a state-of-the-art QA high-accuracy, downstream QA models. model, making it both more accurate and 2x faster1 To complement PAQ we develop RePAQ, a 2 Open-Domain Question Answering question answering model based on question re- trieval/matching models, using dense Maximum ODQA is the task of answering natural language Inner Product Search-based retrieval, and option- factoid question from an open set of domains. A ally, re-ranking. We show that PAQ and RePAQ typical question might be “when was the last year provide accurate ODQA predictions, at the level astronauts landed on the moon?”, with a target an- of relatively recent large-scale retrieve-and-read swer “1972”. The goal of ODQA is to develop systems such as RAG (Lewis et al., 2020a) on Nat- an answer function m : 7→ A, where Q and A uralQuestions (Kwiatkowski et al., 2019a) and Triv- respectively are the sets of all possible questions iaQA (Joshi et al., 2017). PAQ instances are anno- and answers. We assume there is a distribution tated with scores that reflect how likely we expect P (q, a) of QA-pairs, defined over Q × A. A good questions to appear, which can be used to control answer function will minimise the expected error the memory footprint of RePAQ by filtering the KB over P (q, a) with respect to some loss function, accordingly. As a result, RePAQ is extremely flexi- such as answer string match. In practice, we do ble, allowing us to configure QA systems with near not have access to P (q, a), and instead rely on an state-of-the-art results, very small memory size, or empirical sample of QA-pairs K drawn from P , inference speeds of over 1,000 questions per sec- and measure the empirical loss of answer functions ond. Memory-optimised configurations of RePAQ on K. Our goal in this work is to implicitly model won two of the four tracks of the 2020 Efficien- P (q, a) so that we can draw a large sample of QA- tQA NeurIPS competition (Min et al., 2020a), with pairs, PAQ, which we can train on and/or retrieve system sizes of 336MB and 29MB, respectively. from. Drawing a sufficiently large sample will over- We also show that PAQ is a useful source of train- lap with K, essentially pre-empting and caching ing data for CBQA models. BART models trained questions that humans may ask at test-time. This on PAQ outperform baselines trained on standard allows us to shift computation from test-time to data by 5%. However, these models struggle to train-time compared to retrieve-and-read methods. effectively memorise all the knowledge in PAQ, lagging behind RePAQ by 15%. This demonstrates 3 Generating Question-Answer Pairs the effectiveness of RePAQ at leveraging PAQ. In this section, we describe the process for generat- Finally, we show that since RePAQ’s question ing PAQ. Given a large background corpus C, our matching score correlates well with QA accuracy, QA-pair generation process consists of the follow- it effectively “knows when it doesn’t know”, allow- ing components: ing for selective question answering (Rodriguez et al., 2019) where QA systems may abstain from 1. A passage selection model ps(c), to identify answering if confidence is too low. Whilst answer passages which humans are likely to ask ques- abstaining is important in its own right, it also - tions about. ables an elegant “back-off” approach where we can 2. An answer extraction model pa(a | c), for defer to a more accurate but expensive QA system identifying spans in a passage that are more when answer confidence is low. This enables us to likely to be answers to a question. make use of the best of both speed and accuracy. 1The PAQ data, models and code will be made available at In summary, we make the following contribu- https://github.com/facebookresearch/PAQ

Figure 1: Top Left: Generation pipeline for QA-pairs in PAQ. Top Right: PAQ used as training data for CBQA models. Bottom Left: RePAQ retrieves similar QA-pairs to input questions from PAQ. Bottom right: RePAQ’s confidence is predictive of accuracy. If confidence is low, we can defer to slower, more accurate systems, like FiD.

3. A question generator model pq(q | a, c) that, ther randomly or using heuristics. We then train a given a passage and an answer, generates a model to minimise negative log-likelihood of posi- question. tive passages relative to negatives. We implement p with RoBERTa (Liu et al., 4. A filtering QA model pf (a | q, C) that gen- s erates an answer for a given question. If an 2019a) and obtain positive passages from Natu- ral Questions (NQ, Kwiatkowski et al., 2019b). We answer generated by pf does not match the answer a question was generated from, the sample easy negatives at random from Wikipedia, question is discarded. This ensures generated and hard negatives from the same Wikipedia arti- questions are consistent (Alberti et al., 2019). cle as the positive passage. Easy negatives help the model to learn topics of interest, and hard nega- As shown in Fig.1, these models are applied se- tives help to differentiate between interesting and quentially to generate QA-pairs for PAQ, in a simi- non-interesting passages from the same article. We lar manner to related work in contextual QA gen- evaluate by measuring how highly positive valida- eration (Alberti et al., 2019; Lewis et al., 2019b). tion passages are ranked amongst negatives. First a passage c is selected with a high probabil- ity according to ps. Next, candidate answers a 3.2 Answer Extraction, pa are extracted from c using p , and questions q are a Given a passage, this component identifies spans generated for each answer in the passage using p . q that are likely to be answers to questions. We con- Lastly, p generates a new answer a0 for the ques- f sider two alternatives: an off-the-shelf Named En- tion. If the source answer a matches a0, then (q, a) tity Recogniser (NER) or training a BERT (Devlin is deemed consistent and is added to PAQ. In the et al., 2019) answer extraction model on NQ. following, we describe each component in detail. The NER answer extractor simply extracts all 2 3.1 Passage Selection, ps named entities from a passage. The majority of questions in ODQA datasets consist of entity men- The passage selection model p is used to find pas- s tions (Kwiatkowski et al., 2019a; Joshi et al., 2017), sages which are likely to contain information that so this approach can achieve high answer coverage. humans may ask about, and would thus be good However, as we extract all entity mentions in a candidates to generate questions from. We learn p s passage, we may extract unsuitable mentions, or using a similar method to Karpukhin et al.(2020b). miss answers that do not conform to the NER sys- Concretely, we assume access to a set of positive tem’s annotation schema. The trained answer span passages C+ ⊂ C, which we obtain from answer- extractor aims to address these issues. containing passages from an ODQA training set. Since we do not have a set of labelled negative 2We use the sPaCy (Honnibal and Montani, 2017) NER passages, we sample negatives from the corpus, ei- system, trained on OntoNotes (Hovy et al., 2006). BERT answer span extraction is typically per- QA-pairs for CBQA models. The second treats formed by modelling answer start and end in- PAQ as a KB, which models learn to directly re- dependently, obtaining answer probabilities via trieve from. These use-cases are related, as CBQA pa(a | c) = p(astart | c) × p(aend | c) (Devlin et al., models have been shown to memorise the train data 2019). We found this approach be sub-optimal for in their parameters, latently retrieving from them modelling multiple span occurrences in a passage. at test time (Lewis et al., 2020b; Domingos, 2020). We instead use an approach that breaks the condi- tional independence of answer spans by directly 4.1 PAQ for Closed-Book QA predicting pa(a | c) = p([astart, aend] | c). This We fine-tune a BART-large (Lewis et al., 2019a) model first feeds a passage through BERT, before with QA-pairs from the concatenation of the train- concatenating the start and end token representa- ing data and PAQ, using a similar training proce- tions of all possible spans of up to length 30, before dure to Roberts et al.(2020). We use early stopping feeding them into a MLP to compute pa(a | c). At on the validation set and a batch size of 512, and generation time, the answer extraction component note learning is slow, requiring 70 epochs on PAQ. extracts a constant number of spans from each pas- Following recent best practices(Alberti et al., 2019; sage, ranked by their extraction probabilities. Yang et al., 2019), we then fine-tune on the training QA-pairs only, using validation Exact Match score 3.3 Question Generation, pq for early stopping (Rajpurkar et al., 2018). Given a passage and an answer, this component We note that an effective CBQA model must generates likely questions with that answer. To be able to understand the semantics of questions indicate the answer and its occurrence in the pas- and how to generate answers, in addition to be- sage, we prepend the answer to the passage and ing able to store a large number of facts in its pa- label the answer span with surrounding special to- rameters. This model thus represents a kind of kens. We construct the dataset from NQ, TriviaQA, combined parametric knowledgebase and retrieval and SQuAD, and perform standard fine-tuning of system (Petroni et al., 2020a). The model proposed BART-base (Lewis et al., 2019a) to obtain pq. in the next section, RePAQ, represents an explicit non-parametric instantiation of this idea. 3.4 Filtering, pf 4.2 RePAQ The filtering model pf improves the quality of gen- erated questions, by ensuring that they are con- RePAQ is a retrieval model which operates on KBs sistent: that the answer they were generated is of QA-pairs, such as PAQ. RePAQ extends recently likely to be a valid answer to the question. Previous proposed nearest neighbour QA-pair retriever mod- work (Alberti et al., 2019; Fang et al., 2020) has - els (Lewis et al., 2020b; Xiao et al., 2020). These ployed a machine reading comprehension (MRC) models assume access to a KB of N QA-pairs K = {(q , a )...(q , a )}. These models provide QA model for this purpose, pf (a | q, c), which 1 1 N N produces an answer when supplied with a question an answer to a test question q by finding the most 0 0 and the passage it was generated from. We refer to relevant QA-pair (q , a ) in K, using a scalable rel- 0 this as local filtering. However, local filtering will evance function, then returning a as the answer not remove questions which are ambiguous (Min to q. Such a function could be implemented using et al., 2020b), and can only be answered correctly standard information retrieval techniques, (.g. TF- with access to the source passage. Thus, we use IDF) or learnt from training data. RePAQ is learnt from ODQA data and consists of a neural dense an ODQA model for filtering, pf (a | q, C), sup- plied with only the generated question, and not the retriever, optionally followed by a neural reranker. source passage. We refer to this as global filtering, 4.2.1 RePAQ Retriever and later show that it is vital for strong downstream Our retriever adopts the dense Maximum Inner results. We use FiD-base with 50 retrieved pas- Product Search (MIPS) retriever paradigm, that sages, trained on NQ (Izacard and Grave, 2020). has recently been shown to obtain state-of-the-art 4 Question Answering using PAQ results in a number of settings (Karpukhin et al., 2020b; Lee et al., 2021, inter alia). Our goal is to We consider two uses of PAQ for building QA mod- embed queries q and indexed items d into a repre- els. The first is to use PAQ as a source of training sentation space via embedding functions gq and gd, > so that the inner product gq(q) gd(d) is maximised many retrieve-and-read generators which consume for items relevant to q. In our case, queries are thousands of tokens to generate an answer. Effi- questions and indexed items are QA-pairs (q0, a0). cient MIPS libraries such as FAISS (Johnson et al., We make our retriever symmetric by embedding q0 2017) enable RePAQ’s retriever to answer 100s to rather than (q0, a0), meaning that only one embed- 1000s of questions per second (see Section 5.2.3). ding function gq is required, which maps questions We use a KB for RePAQ consisting of training set to embeddings. This applies a useful inductive bias, QA-pairs concatenated with QA-pairs from PAQ. and we find that it aids stability during training. 4.2.2 RePAQ Reranker Learning the embedding function gq is compli- cated by the lack of labelled question pair para- The accuracy of RePAQ can be improved using a phrases in ODQA datasets. We propose a latent reranker on the top-K QA-pairs from the retriever. variable approach similar to retrieval-augmented The reranker uses cross-encoding (Humeau et al., generation (RAG, Lewis et al., 2020b),3 where 2020), and includes the retrieved answer in the we we index training QA-pairs rather than doc- scoring function for richer featurisation. We con- uments. For an input question q, the top K QA- catenate the input question q, the retrieved question 0 0 q0 and its answer a0 with separator tokens, and feed pairs (q , a ) are retrieved by a retriever pret where 0 > 0 it through ALBERT. We obtain training data in pret(q|q ) ∝ exp(gq(q) gq(q )). These are then the following manner: For a training QA-pair, we fed into a seq2seq model pgen which generates an answer for each retrieved QA-pair, before a final first retrieve the top 2K QA-pairs from PAQ using answer is produced by marginalising, RePAQ’s retriever. If one of the retrieved QA-pairs has the correct answer, we treat that QA-pair as a X 0 0 0 positive, and randomly sample K-1 of the incorrect p(a|q) = pgen(a|q, q , a )pret(q |q), 0 0 retrieved questions as negatives. We train by min- (a ,q )∈top-k pret(·|q) imising negative log likelihood of positives relative As pgen generates answers token-by-token, credit to 10 negatives, and rerank 50 retrieved pairs at test can be given for retrieving helpful QA-pairs which time. The reranker improves accuracy at the ex- do not exactly match the target answer. For ex- pense of some speed. However, as QA-pairs consist ample, for the question “when was the last time of fewer tokens than passages, the reranker is still anyone was on the moon” and target answer “- faster than retrieve-and-read models, even when cember 1972”, retrieving “when was the last year using architectures such as ALBERT-xxlarge. astronauts landed on the moon” with answer “1972” will help to generate the target answer, despite the 5 Results answers having different granularity. After training, We first examine the PAQ resource in general, be- 4 we discard pret , retaining only the question em- fore exploring how both CBQA models and RePAQ bedder g. We implement pret with ALBERT (Lan perform using PAQ, comparing to recently pub- et al., 2020) with an output dimension of 768, and lished systems. We use the Natural Questions (NQ, pgen with BART-large (Lewis et al., 2019a). We Kwiatkowski et al., 2019b) and TriviaQA (Joshi train using the top 100 retrieved QA-pairs, and et al., 2017) datasets to assess performance, evalu- refresh the embedding index every 5 training steps. ating with the standard Exact Match (EM) score. Once the embedder gq is trained, we build a test-time QA system by embedding and indexing 5.1 Examining PAQ a QA KB such as PAQ. Answering is achieved by We generate PAQ by applying the pipeline de- retrieving the most similar stored question, and re- scribed in Section3 to the Wikipedia dump from turning its answer. The matched QA-pair can be Karpukhin et al.(2020a), which splits Wikipedia displayed to the user, providing a mechanism for into 100-word passages. We use passage selection more interpretable answers than CBQA models and model ps to rank all 21M passages, and generate from the top 10M, before applying global filtering.5 3Other methods, such as heuristically constructing para- phrase pairs assuming that questions with the same answer are We are interested in understanding the effective- paraphrases, and training with sampled negatives would also ness of different answer extractors, and whether be valid, but were not competitive in our early experiments 4 5 We could use pgen as a reranker/aggregator for QA, but in Generation was stopped when downstream performance practice find it both slower and less accurate than the reranker with RePAQ did not significantly improve with more ques- described in Section 4.2.2 tions. Extracted Unique Filtered Coverage that 82% of questions accurately capture informa- Dataset Ratio Answers Qs QAs NQ TQA tion from the passage and contain sufficient details PAQL,1 76.4M 58.0M 14.1M 24.4% 88.3 90.2 to locate the answer. 16% of questions confuse PAQL,4 76.4M 225.2M 53.8M 23.9% 89.8 90.9 the semantics of certain answer types, either by PAQNE,1 122.2M 65.4M 12.0M 18.6% 83.5 88.3 conflating similar entities in the passage or by mis- PAQ 165.7M 279.2M 64.9M 23.2% 90.2 91.1 interpreting rare phrases (see examples 7 and 8 in Table 1: PAQ dataset statistics and ODQA dataset an- Table2). Finally, we find small numbers of gram- swer coverage. “Ratio” refers to the number of gener- mar errors (such as example 5) and mismatched ated questions which pass the global consistency filter. wh-words (5% and 2% respectively).8 Other observations PAQ often contains several generating more questions per answer span results paraphrases of the same QA-pair. This redun- leads to better results. To address these questions, dancy reflects how information is distributed in we create three versions of PAQ, described below. Wikipedia, with facts often mentioned on several PAQL uses the learnt answer extractor, and a ques- different pages. Generating several questions per tion generator trained on NQ and TriviaQA. We answer span also increases redundancy. Whilst this extract 8 answers per passage and use a beam size means that PAQ could be more information-dense of 4 for question generation. In PAQL,1 we only if a de-duplication step was applied, we later show use the top scoring question in the beam, whereas that RePAQ always improves with more questions in PAQL,4 we use all four questions from the beam, in its KB (section 5.2.1). This suggests that it is allowing for several questions to be generated from worth increasing redundancy for greater coverage. one answer in a passage. PAQ uses the NER NE,1 5.2 Question Answering Results answer extractor, and a generator trained only on NQ. PAQNE,1 allow us assess whether diversity in In this section, we shall compare how the PAQ- the form of answer extractors and question genera- leveraging models proposed in section4 compare tors leads to better results. The final KB, referred to to existing approaches. We primarily compare to as just “PAQ”, is the union of PAQL and PAQNE. a state-of-the-art retrieve-and-read model, Fusion- As shown in Table1, PAQ consists of 65M fil- in-Decoder (FiD, Izacard and Grave, 2020). FiD tered QA pairs.6 This was obtained by extracting uses DPR (Karpukhin et al., 2020b) to retrieve rele- 165M answer spans and generating 279M unique vant passages from Wikipedia, and feeds them into questions before applying global filtering. Table1 T5 (Raffel et al., 2020) to generate a final answer. shows that the PAQL pipeline is more efficient than Table3 shows the highest-accuracy configura- PAQNE, with 24.4% of QA-pairs surviving filter- tions of our models alongside recent state-of-the-art ing, compared to 18%. approaches. We make the following observations: Comparing rows 2 and 7 shows that a CBQA BART PAQ Answer Coverage To evaluate answer ex- model trained with PAQ outperforms a compara- tractors, we calculate how many answers in the ble NQ-only model by 5%, and only 3% behind validation sets of TriviaQA and NQ also occur in T5-11B (row 1) which has 27x more parameters. 7 PAQ’s filtered QA-pairs. Table1 shows that the Second, we note strong results for RePAQ on NQ answer coverage of PAQ is very high – over 90% (47.7%, row 9), outperforming recent retrieve-and- for both TriviaQA and NQ. Comparing PAQL with read systems such as RAG by over 3% (row 4). PAQNE shows that the learnt extractor achieves Multi-task training RePAQ on NQ and TriviaQA higher coverage, but the union of the two leads to improves TriviaQA results by 1-2% (comparing the highest coverage overall. Comparing PAQL,1 rows 8-9 with 10-11). RePAQ does not perform and PAQL,4 indicates that using more questions quite as strongly on TriviaQA (see section 5.2.6), from the beam also results in higher coverage. but is within 5% of RAG, and outperforms concur- PAQ Question Generation Quality Illustrative rent work on real-time QA, DensePhrases (row 6, examples from PAQ can be seen in Table2. Man- Lee et al., 2021). Lastly, row 12 shows that combin- ual inspection of 50 questions from PAQ reveals ing RePAQ and FiD-large into a combined system is 0.9% more accurate than FiD-large (see Section 6Each question only has one answer due to global filtering 5.2.4 for more details). 7performed using standard answer normalisation (Ra- jpurkar et al., 2016) 8Further details and automatic metrics in Appendix A.3 # Question Answer Comment 1 who created the dutch comic strip panda Martin Toonder X 2 what was the jazz group formed by john hammond in 1935 Goodman Trio X 3 astrakhan is russia’s main market for what commodity fish X 4 what material were aramaic documents rendered on leather X 5 when did the giant panda chi chi died 22 July 1972 X, Grammar error 6 pinewood is a village in which country England ∼, Also a Pinewood village in USA 7 who was the mughal emperor at the battle of lahore Ahmad Shah Bahadur  Confuses with Ahmad Shah Abdali 8 how many jersey does mitch richmond have in the nba 2  His Jersey No. was 2

Table 2: Representative Examples from PAQ. Xindicates correct, ∼ ambiguous and  incorrect facts respectively

# Model Type Model NaturalQuestions TriviaQA 1 Closed-book T5-11B-SSM (Roberts et al., 2020) 35.2 51.8 2 Closed-book BART-large (Lewis et al., 2020b) 26.5 26.7 3 QA-pair retriever Dense retriever (Lewis et al., 2020b) 26.7 28.9 4 Open-book, retrieve-and-read RAG-Sequence (Lewis et al., 2020a) 44.5 56.8 5 Open-book, retrieve-and-read FiD-large, 100 docs (Izacard and Grave, 2020) 51.4 67.6 6 Open-book, phrase index DensePhrases (Lee et al., 2021) 40.9 50.7 7 Closed-book BART-large, pre-finetuned on PAQ 32.7 33.2 8 QA-pair retriever RePAQ (retriever only) 41.2 38.8 9 QA-pair retriever RePAQ (with reranker) 47.7 50.7 10 QA-pair retriever RePAQ-multitask (retriever only) 41.7 41.3 11 QA-pair retriever RePAQ-multitask (with reranker) 47.6 52.1 12 QA-pair retriever RePAQ-multitask w/ FiD-Large Backoff 52.3 67.3

Table 3: Exact Match score for highest accuracy RePAQ configurations in comparison to recent state-of-the-art systems. Highest score indicated in bold, highest non-retrieve-and-read model underlined.

5.2.1 Ablating PAQ using RePAQ Exact Match # KB Filtering Size Table4 shows RePAQ’s accuracy when using differ- Retrieve Rerank ent PAQ variants. To establish the effect of filtering, 1 NQ-Train - 87.9K 27.9 31.8 we evaluate RePAQ with unfiltered, locally-filtered 2 PAQL,1 None 58.0M 21.6 30.6 and globally-filtered QA-pairs on PAQL,1. Com- 3 PAQL,1 Local 31.7M 28.3 34.9 paring rows 2, 3 and 4 shows that global filtering is 4 PAQL,1 Global 14.1M 38.6 44.3 crucial, leading to a 9% and 14% improvement over 5 PAQL,4 Global 53.8M 40.3 45.2 locally-filtered and unfiltered datasets respectively. 6 PAQNE,1 Global 12.0M 37.3 42.6 We also note a general trend in Table4 that 7 PAQ Global 64.9M 41.6 46.4 adding more globally-filtered questions improves accuracy. Rows 4 and 5 show that using four ques- Table 4: The effect of different PAQ subsets on the NQ tions per answer span is better than generating one validation accuracy of RePAQ (+0.9%), and Rows 5,6 and 7 show that combin- ing PAQ and PAQ results in a further 1.2% NE L ter the background corpus for a retrieve-and-read improvement. Empirically we did not observe any model (Izacard et al., 2020). We compare the sys- cases where increasing the number of globally fil- tem size of a FiD-large system and RePAQ as the tered QA-pairs reduced accuracy, even when there number of items (passages and QA-pairs respec- were millions of QA-pairs already. tively) in their indexes are reduced. We select 5.2.2 System Size vs Accuracy which passages and QA-pairs are included using the passage selection model p .9 Further experi- PAQ’s QA-pairs are accompanied by scores of how s mental details can be found in appendix A.4. Fig.2 likely they are to be asked. These scores can be used to filter the KB and reduce the RePAQ sys- 9 Here, we use PAQL1, which is 5x smaller than the full tem size. A similar procedure can be used to fil- PAQ, but retains most of the accuracy (see Table4) Model Retriever Reranker Exact Match Q/sec 45 FiD-large - - 51.4 0.5 FiD-base - - 48.2 2 40 RePAQ base - 40.9 1400 RePAQ large - 41.2 1100 35 RePAQ-retriever RePAQ xlarge - 41.5 800

NQ Dev Exact Match RePAQ-reranker 30 RePAQ base base 45.7 55 FID-Large RePAQ large xlarge 46.2 10 0 1 2 3 4 5 6 7 RePAQ xlarge xxlarge 47.6 6 System Size (GB)

Figure 2: System size vs. accuracy for RePAQ and FiD- Table 5: Inference speeds of various configurations of large as a function of the number of items in the index. RePAQ compared to FiD models on NQ

90 RePAQ-retriever RePAQ-reranker shows the that both system sizes can be reduced 80 several-fold with only a small drop in accuracy, FID-Large 70 demonstrating the effectiveness of ps. FiD can achieve a higher accuracy, but requires larger sys- 60 tem sizes. RePAQ can be reduced to a smaller size NQ Exact Match before a significant accuracy drop, driven primarily 50 by the higher information density of QA-pairs rela- 0.0 0.2 0.4 0.6 0.8 1.0 tive to passages, and fewer model parameters used Fraction of Questions Answered by RePAQ compared to FiD. Highly-optimized RePAQ models won the “500MB” and “Smallest Figure 3: Risk-coverage plot for FiD and RePAQ. System” tracks of the EfficientQA NeurIPS com- petition with disk images of 336MB and 29MB 5.2.4 Selective Question Answering respectively. For further details on EfficientQA, QA systems should not just be able to answer ac- the reader is referred to Min et al.(2020a). curately, but also “know when they don’t know”, 5.2.3 Inference Speed vs Accuracy and abstain from answering when they are unlikely to produce good answers. This task is challenging We train a variety of differently-sized RePAQ for current systems (Asai and Choi, 2020; Jiang models to explore the relationship between accu- et al., 2020b), and has been approached in MRC racy and inference speed. We use a fast Hierar- by training on unanswerable questions (Rajpurkar chical Navigable Small World (HNSW) index in et al., 2018) and for trivia systems by leveraging FAISS (Malkov and Yashunin, 2018; Johnson et al., incremental QA formats (Rodriguez et al., 2019). 10 2017) and measure the time required to evalu- We find that RePAQ’s retrieval and reranking ate the NQ test set on a system with access to one scores are well-correlated with answering correctly. 11 GPU. Table5 shows these results. Some retriever- This allows RePAQ to be used for selective question only RePAQ models can answer over 1000 ques- answering by abstaining when the score is below a tions per second, and are relatively insensitive to certain threshold. Figure3 shows a risk-coverage model size, with ALBERT-base only scoring 0.5% plot (Wang et al., 2018) for RePAQ and FiD, where lower than ALBERT-xlarge. They also outperform we use FiD’s answer log probability for its answer retrieve-and-read models like REALM (40.4%, confidence.12 The plot shows the accuracy on the Guu et al., 2020) and recent real-time QA models top N% highest confidence answers for NQ. If we like DensePhrases (40.9%, Lee et al., 2021). We require models to answer 75% of user questions, find that larger, slower RePAQ rerankers achieve RePAQ’s accuracy on the questions it does answer higher accuracy. However, even the slowest RePAQ is 59%, whereas FiD, which has poorer calibration, is 3x faster than FiD-base, whilst only being 0.8% scores only 55%. This difference is even more less accurate, and 12x faster than FiD-large. pronounced with stricter thresholds – with coverage

10The HNSW index has negligible (∼0.1%) drop in re- 12We also investigate improving FiD’s calibration using an triever accuracy compared to a flat index auxiliary model, see Appendix A.6. We find that the most 11System details can be found in Appendix A.5 effective way to calibrate FiD is to use RePAQ’s confidences of 50%, RePAQ outperforms FiD by over 10%. FiD Input: who was the film chariots of fire about A: Eric Liddell only outperforms RePAQ when we require systems who was the main character in chariots of fire A: Eric Liddell X who starred in the movie chariots of fire A: Ian Charleson  to answer more than 85% of questions. which part did straan rodger play in chariots of fire A: Sandy McGrath  Whilst RePAQ’s selective QA is useful in its own who played harold in the 1981 film chariots of fire A: Ben Cross  right, it also allows us to combine the slow but ac- who is the main character in chariots of fire A: Eric Liddell X curate FiD with the fast and precise RePAQ, which Input: what is the meaning of the name didymus A: twin what language does the name didymus come from A: Greek  we refer to as backoff. We first try to answer with where does the name didymus come from in english A: Greek  RePAQ, and if the confidence is below a threshold what does the word domus mean in english A: home  determined on validation data, we pass the ques- how long has the term domus been used A: 1000s of years  what does the greek word didyma mean A: twin tion onto FiD. For NQ, the combined system is X Input: what is the name of a group of llamas A: herd 2.1x faster than FiD-large, with RePAQ answering what are llamas and alpacas considered to be A: domesticated  57% of the questions, and the overall accuracy is what are the figures of llamas in azapa valley A: Atoca  1% higher than FiD-large (see table3). what are the names of the llamas in azapa valley A: Atoca  what is the scientific name for camels and llamas A:Camelidae  If inference speed is a priority, the threshold can are llamas bigger or smaller than current forms A:larger  be decreased so that RePAQ answers 80% of the questions, which retains the same overall accuracy Table 6: Examples of top 5 retrieved QA-pairs for NQ. as FiD, with a 4.6x speedup. For TriviaQA, the Italics indicate QA-pairs chosen by reranker. combined system backs off to FiD earlier, due to Q A-only No the stronger relative performance of FiD. Addi- Model Total tional details can be found in appendix A.6 Overlap Overlap Overlap CBQA BART w/ NQ 26.5 67.6 10.2 0.8 5.2.5 Analysing RePAQ’s Predictions CBQA BART w/ NQ+PAQ 28.2 52.8 24.4 9.4 Some examples of top retrieved questions are + final NQ finetune 32.7 69.8 22.2 7.51 shown Table6. When RePAQ answers correctly, RePAQ (retriever only) 41.7 65.4 31.7 21.4 RePAQ (with reranker) 47.3 73.5 39.7 26.0 the retrieved question is a paraphrase of the test question from PAQ in 89% of cases. As such, Table 7: NQ Behavioural splits (Lewis et al., 2020b). there is high (80.8 ROUGE-L) similarity between “Q overlap” are test questions with paraphrases in train- correctly-answered test questions and the top re- ing data. “A-only” are test questions where answers trieved questions. 9% of test questions even exist appear in training data, but questions do not. “No over- verbatim in PAQ, and are thus trivial to answer. The lap” where neither question or answer overlap. reranker primarily improves over the retriever for ambiguous cases, and cases where the top retrieved answer it correctly, that QA-pair will not be added answer does not have the right granularity. In 32% to PAQ, and RePAQ cannot use it to answer the of cases, RePAQ does not retrieve the correct an- test question. The NQ FiD-base-50-passage model swer in the top 50 QA-pairs, suggesting a lack of used for filtering scores 46.1% and 53.1% for NQ coverage may be a significant source of error. In and TriviaQA respectively. RePAQ actually outper- these cases, retrieved questions are much less simi- forms the filter model on NQ by 1.6%. This is pos- lar to the test question than for correctly answered sible because generated questions may be phrased questions, dropping by 20 ROUGE-L. We also ob- in such a way that they are easier to answer, e.g. be- serve cases where retrieved questions match the test ing less ambiguous (Min et al., 2020b). RePAQ can question, but the retrieved answer does not match then retrieve the paraphrased QA-pair and answer the desired answer. This is usually due to different correctly, even if the filter could not answer the answer granularity, but in a small number of cases test question directly. The filtering model’s weaker was due to factually incorrect answers. scores on TriviaQA helps explain why RePAQ is 5.2.6 Does the Filtering Model Limit not as strong on this dataset. We speculate that us- RePAQ’s Accuracy? ing a stronger filtering model for TriviaQA would As RePAQ relies on retrieving paraphrases of test in turn improve RePAQ’s results. questions, we may expect that the ODQA filtering 5.3 Closed-book QA vs RePAQ model places an upper bound on it’s performance. For example, if a valid QA-pair is generated which Table7 shows results on test set splits which mea- overlaps with a test QA-pair, but the filter cannot sure how effectively models memorise QA-pairs from the NQ train set (“Q overlap”), and generalise that are likely to be asked. QA-pairs have also to novel questions (“A overlap only” and “No over- been used in semantic role labelling, such as lap”).13 Comparing CBQA models trained on NQ QA-SRL (FitzGerald et al., 2018). vs those trained on NQ and PAQ show that models trained with PAQ answer more questions correctly Real-time ODQA Systems that prioritise fast from the “A-only overlap” and “No overlap” cate- runtimes over accuracy are sometimes referred to gories, indicating they have learnt facts not present as real-time QA systems (Seo et al., 2018). Den- in the NQ train set. Applying additional NQ fine- SPI (Seo et al., 2019) and a contemporary work, tuning on the PAQ CBQA model improves scores DensePhrases (Lee et al., 2021), approach this by on “Q overlap” (indicating greater memorisation indexing all possible phrases in a background cor- of NQ), but scores on the other categories drop pus, and learn mappings from questions to passage- (indicating reduced memorisation of PAQ). How- phrase pairs. We also build an index for fast answer- ever, RePAQ, which explicitly retrieves from PAQ ing, but generate and index globally-answerable rather than memorising it in parameters, strongly questions. Indexing QA-pairs can be considered as outperforms the CBQA model in all categories, indexing summaries of important facts from the cor- demonstrating that the CBQA model struggles to pus, rather than indexing the corpus itself. We also memorise enough facts from PAQ. CBQA mod- generate and store multiple questions per passage- els with more parameters may be better able to answer pair, relieving information bottlenecks from memorise PAQ, but have downsides in terms of encoding a passage-answer pair into a single vector. system resources. Future work should address how Question Generation for QA Question gener- to better store PAQ in CBQA model parameters. ation has been used for various purposes, such 6 Related Work as data augmentation (Alberti et al., 2019; Lewis et al., 2019b; Lee et al., 2021), improved re- ODQA has received much attention in both for its trieval (Nogueira et al., 2019), generative mod- practical applications, and as a benchmark for how elling for contextual QA (Lewis and Fan, 2018), NLP models store and access knowledge (Chen as well as being studied in its own right (Du et al., and Yih, 2020; Petroni et al., 2020b). 2017; Hosking and Riedel, 2019). Serban et al. (2016) generate large numbers of questions from KBQA A number of early approaches in ODQA Freebase, but do not address how to use them for focused on using structured KBs (Berant et al., QA. Closest to our work is the recently-proposed 2013) such as Freebase (Bollacker et al., 2007), OceanQA (Fang et al., 2020). OceanQA first gen- with recent examples fromF evry´ et al.(2020) and erates contextual QA-pairs from Wikipedia. At Verga et al.(2020). This approach often has high test-time, a document retrieval system is used to re- precision but suffers when KB doesn’t match user trieve the most relevant passage for a question and requirements, or where the schema limits what the closest pre-generated QA-pair from that pas- knowledge can be stored. We populate our knowl- sage is then selected. In contrast, we focusing on edgebase with semi-structured QA pairs which are generating a large KB of non-contextual, globally- specifically likely to be relevant at test time miti- consistent ODQA questions and exploring what gating both of these drawbacks, and sharing many QA systems are facilitated by such a resource. of the benefits, such as precision and extensibility. Open Information Extraction Our work 7 Conclusion touches on automatic KB construction and open We have introduced PAQ, a dataset of 65M QA- information extraction (OpenIE) (Angeli et al., pairs, and explored how it could be used to improve 2015). Here, the goal is to mine facts from free ODQA models. We demonstrated the effectiveness text into structured or semi-structured forms, of RePAQ, a system which retrieves from PAQ, typically (subject, relation, object) triples for use in terms of accuracy, speed, space efficiency and in tasks such as slot-filling (Surdeanu, 2013). We selective QA. Generating PAQ is computationally generate natural language QA-pairs rather than intensive due to its large scale, but should be a use- OpenIE triples, and we do not attempt to extract ful, re-usable resource for more accurate, smaller all possible facts in a corpus, focusing only those and faster QA models. Nevertheless, future work 13See Lewis et al.(2020b) for further details. should be carried out to improve the efficiency of generation in order to expand PAQ’s coverage. Danqi Chen and Wen-tau Yih. 2020. Open-domain We also demonstrated PAQ’s utility for improved question answering. In ACL (tutorial), pages 34–37. Association for Computational Linguistics. CBQA models, but note a large accuracy gap be- tween our CBQA models and RePAQ. Explor- Nicola De Cao, Gautier Izacard, Sebastian Riedel, ing the trade-offs between storing and retrieving and Fabio Petroni. 2020. Autoregressive Entity knowledge parametrically or non-parametrically is Retrieval. arXiv:2010.00904 [cs, stat]. ArXiv: 2010.00904. a topic of great current interest (Lewis et al., 2020a; De Cao et al., 2020), and PAQ should be a useful Jacob Devlin, Ming-Wei Chang, Kenton Lee, and testbed for probing this relationship further. We Kristina Toutanova. 2019. BERT: Pre-training of also note that PAQ could be used as general data- Deep Bidirectional Transformers for Language Un- derstanding. In Proceedings of the 2019 Conference augmentation when training any open-domain QA of the North American Chapter of the Association model or retriever. Whilst we consider such work for Computational Linguistics: Human Language out-of-scope for this paper, leveraging PAQ to im- Technologies, Volume 1 (Long and Short Papers), prove retrieve-and-read and other systems systems pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. should be explored in future work. Pedro Domingos. 2020. Every Model Learned by Gra- dient Descent Is Approximately a Kernel Machine. References arXiv:2012.00152 [cs, stat]. ArXiv: 2012.00152. Chris Alberti, Daniel Andor, Emily Pitler, Jacob De- Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn- vlin, and Michael Collins. 2019. Synthetic QA Cor- ing to Ask: Neural Question Generation for Reading pora Generation with Roundtrip Consistency. In Comprehension. In Proceedings of the 55th Annual Proceedings of the 57th Annual Meeting of the Meeting of the Association for Computational Lin- Association for Computational Linguistics, pages guistics (Volume 1: Long Papers), pages 1342–1352, 6168–6173, Florence, Italy. Association for Compu- Vancouver, Canada. Association for Computational tational Linguistics. Linguistics. Yuwei Fang, Shuohang Wang, Gan, Siqi Sun, Gabor Angeli, Melvin Jose Johnson Premkumar, and and Jingjing Liu. 2020. Accelerating Real- Christopher D. Manning. 2015. Leveraging Lin- Time Question Answering via Question Generation. guistic Structure For Open Domain Information Ex- arXiv:2009.05167 [cs]. ArXiv: 2009.05167. traction. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Linguis- Nicholas FitzGerald, Julian Michael, Luheng He, and tics and the 7th International Joint Conference on Luke Zettlemoyer. 2018. Large-Scale QA-SRL Pars- Natural Language Processing (Volume 1: Long Pa- ing. arXiv:1805.05377 [cs]. ArXiv: 1805.05377. pers), pages 344–354, Beijing, China. Association for Computational Linguistics. Jerome H. Friedman. 2001. Greedy function approx- imation: A gradient boosting machine. Annals of Akari Asai and Eunsol Choi. 2020. Challenges in Statistics, 29(5):1189–1232. Publisher: Institute of Information Seeking QA:Unanswerable Questions Mathematical Statistics. and Paragraph Retrieval. arXiv:2010.11915 [cs]. ArXiv: 2010.11915. Thibault Fevry,´ Livio Baldini Soares, Nicholas FitzGer- ald, Eunsol Choi, and Tom Kwiatkowski. 2020. En- Jonathan Berant, Andrew Chou, Roy Frostig, and Percy tities as Experts: Sparse Memory Access with En- Liang. 2013. Semantic parsing on freebase from tity Supervision. arXiv:2004.07202 [cs]. ArXiv: question-answer pairs. In EMNLP, pages 1533– 2004.07202. 1544. ACL. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. REALM: Kurt D. Bollacker, Robert P. Cook, and Patrick Tufts. Retrieval-Augmented Language Model Pre- 2007. Freebase: A shared database of structured Training. arXiv:2002.08909 [cs]. ArXiv: general human knowledge. In AAAI, pages 1962– 2002.08909. 1963. AAAI Press. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Danqi Chen, Adam Fisch, Jason Weston, and Antoine Natural language understanding with Bloom embed- Bordes. 2017. Reading Wikipedia to Answer Open- dings, convolutional neural networks and incremen- Domain Questions. In Proceedings of the 55th An- tal parsing. To appear. nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870– Tom Hosking and Sebastian Riedel. 2019. Eval- 1879, Vancouver, Canada. Association for Computa- uating Rewards for Question Generation Models. tional Linguistics. arXiv:1902.11049 [cs]. ArXiv: 1902.11049. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Uszkoreit, Quoc Le, and Slav Petrov. 2019a. Natu- Ramshaw, and Ralph Weischedel. 2006. OntoNotes: ral Questions: a Benchmark for Question Answering the 90% solution. In Proceedings of the Human Lan- Research. Transactions of the Association of Com- guage Technology Conference of the NAACL, Com- putational Linguistics. panion Volume: Short Papers, NAACL-Short ’06, pages 57–60, USA. Association for Computational Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Linguistics. field, Michael Collins, Ankur P. Parikh, Chris Al- berti, Danielle Epstein, Illia Polosukhin, Jacob De- Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, vlin, Kenton Lee, Kristina Toutanova, Llion Jones, and Jason Weston. 2020. Poly-encoders: Trans- Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, former Architectures and Pre-training Strategies Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019b. for Fast and Accurate Multi-sentence Scoring. Natural questions: a benchmark for question answer- arXiv:1905.01969 [cs]. ArXiv: 1905.01969. ing research. Trans. Assoc. Comput. Linguistics, 7:452–466. Gautier Izacard and Edouard Grave. 2020. Leveraging Passage Retrieval with Generative Models for Open Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Domain Question Answering. arXiv:2007.01282 Kevin Gimpel, Piyush Sharma, and Radu Soricut. [cs]. ArXiv: 2007.01282. 2020. Albert: A lite bert for self-supervised learning of language representations. In International Con- Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola ference on Learning Representations. De Cao, Sebastian Riedel, and Edouard Grave. 2020. A Memory Efficient Baseline for Open Do- Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, and Danqi main Question Answering. arXiv:2012.15156 [cs]. Chen. 2021. Learning Dense Representations of ArXiv: 2012.15156. Phrases at Scale. arXiv:2012.12624 [cs]. ArXiv: 2012.12624. Zhengbao Jiang, Jun Araki, Haibo Ding, and Gra- ham Neubig. 2020a. How Can We Know When Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Language Models Know? arXiv:2012.00955 [cs]. 2019a. Latent Retrieval for Weakly Supervised ArXiv: 2012.00955. Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Zhengbao Jiang, Wei Xu, Jun Araki, and Gra- Computational Linguistics ham Neubig. 2020b. Generalizing Natural Lan- , pages 6086–6096, Flo- guage Analysis through Span-relation Representa- rence, Italy. Association for Computational Linguis- tions. arXiv:1911.03822 [cs]. ArXiv: 1911.03822. tics. Jeff Johnson, Matthijs Douze, and Herve´ Jegou.´ Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2017. Billion-scale similarity search with GPUs. 2019b. Latent retrieval for weakly supervised open arXiv:1702.08734 [cs]. ArXiv: 1702.08734. domain question answering. In ACL (1), pages 6086–6096. Association for Computational Linguis- Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke tics. Zettlemoyer. 2017. TriviaQA: A Large Scale Dis- tantly Supervised Challenge Dataset for Reading Mike Lewis and Angela Fan. 2018. Generative Ques- Comprehension. In Proceedings of the 55th Annual tion Answering: Learning to Answer the Whole Meeting of the Association for Computational Lin- Question. In International Conference on Learning guistics (Volume 1: Long Papers), pages 1601–1611, Representations. Vancouver, Canada. Association for Computational Linguistics. Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Levy, Ves Stoyanov, and Luke Zettlemoyer. S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi 2019a. BART: Denoising Sequence-to-Sequence Chen, and Wen-tau Yih. 2020a. Dense passage Pre-training for Natural Language Generation, retrieval for open-domain question answering. In Translation, and Comprehension. arXiv:1910.13461 EMNLP (1), pages 6769–6781. Association for [cs, stat]. ArXiv: 1910.13461. Computational Linguistics. Patrick Lewis, Ludovic Denoyer, and Sebastian Riedel. Vladimir Karpukhin, Barlas Oguz,˘ Sewon Min, 2019b. Unsupervised Question Answering by Cloze Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Translation. In Proceedings of the 57th Annual Chen, and Wen-tau Yih. 2020b. Dense Passage Meeting of the Association for Computational Lin- Retrieval for Open-Domain Question Answering. guistics, pages 4896–4910, Florence, Italy. Associa- arXiv:2004.04906 [cs]. ArXiv: 2004.04906. tion for Computational Linguistics. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Patrick Lewis, Ethan Perez, Aleksandara Piktus, field, Michael Collins, Ankur Parikh, Chris Alberti, Fabio Petroni, Vladimir Karpukhin, Naman Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Goyal, Heinrich Kuttler,¨ Mike Lewis, Wen-tau Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Yih, Tim Rocktaschel,¨ Sebastian Riedel, and Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Douwe Kiela. 2020a. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Myle Ott, Sergey Edunov, Alexei Baevski, Angela arXiv:2005.11401 [cs]. ArXiv: 2005.11401. Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel. Toolkit for Sequence Modeling. arXiv:1904.01038 2020b. Question and Answer Test-Train Over- [cs]. ArXiv: 1904.01038. lap in Open-Domain Question Answering Datasets. arXiv:2008.02637 [cs]. ArXiv: 2008.02637. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Antiga, Alban Desmaison, Andreas Kopf,¨ Edward Joshi, Danqi Chen, Omer Levy, Mike Lewis, Yang, Zach DeVito, Martin Raison, Alykhan - Luke Zettlemoyer, and Veselin Stoyanov. 2019a. jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Roberta: A robustly optimized bert pretraining ap- Junjie Bai, and Soumith Chintala. 2019. PyTorch: proach. arXiv preprint arXiv:1907.11692. An Imperative Style, High-Performance Deep Learn- ing Library. arXiv:1912.01703 [cs, stat]. ArXiv: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- 1912.01703. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim RoBERTa: A Robustly Optimized BERT Pretrain- Rocktaschel,¨ Yuxiang Wu, Alexander H. Miller, and ing Approach. arXiv:1907.11692 [cs]. ArXiv: Sebastian Riedel. 2020a. How Context Affects Lan- 1907.11692. guage Models’ Factual Predictions. In Automated Knowledge Base Construction. A. Malkov and D. A. Yashunin. 2018. Efficient and robust approximate nearest neighbor search Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick using Hierarchical Navigable Small World graphs. Lewis, Majid Yazdani, Nicola De Cao, James arXiv:1603.09320 [cs]. ArXiv: 1603.09320. Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktaschel,¨ and Sebastian Riedel. 2020b. KILT: Sewon Min, Jordan Boyd-Graber, Chris Alberti, a Benchmark for Knowledge Intensive Language Danqi Chen, Eunsol Choi, Michael Collins, Kelvin Tasks. arXiv:2009.02252 [cs]. ArXiv: 2009.02252. Guu, Hannaneh Hajishirzi, Kenton Lee, Jenni- Colin Raffel, Noam Shazeer, Adam Roberts, Kather- maria Palomaki, Colin Raffel, Adam Roberts, Tom ine Lee, Sharan Narang, Michael Matena, Yanqi Kwiatkowski, Patrick Lewis, Yuxiang Wu, Hein- Zhou, Wei Li, and Peter J. Liu. 2020. Exploring rich Kuttler,¨ Linqing Liu, Pasquale Minervini, Pon- the Limits of Transfer Learning with a Unified Text- tus Stenetorp, Sebastian Riedel, Sohee Yang, Min- to-Text Transformer. Journal of Machine Learning joon Seo, Gautier Izacard, Fabio Petroni, Lu- Research, 21(140):1–67. cas Hosseini, Nicola De Cao, Edouard Grave, Ikuya Yamada, Sonse Shimaoka, Masatoshi Suzuki, Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Shumpei Miyawaki, Shun Sato, Ryo Takahashi, Jun Know What You Don’t Know: Unanswerable Ques- Suzuki, Martin Fajcik, Martin Docekal, Karel On- tions for SQuAD. In Proceedings of the 56th An- drej, Pavel Smrz, Hao Cheng, Yelong Shen, Xi- nual Meeting of the Association for Computational aodong Liu, Pengcheng He, Weizhu Chen, Jian- Linguistics (Volume 2: Short Papers), pages 784– feng Gao, Barlas Oguz, Xilun Chen, Vladimir 789, Melbourne, Australia. Association for Compu- Karpukhin, Stan Peshterliev, Dmytro Okhonko, tational Linguistics. Michael Schlichtkrull, Sonal Gupta, Yashar Mehdad, and Wen-tau Yih. 2020a. NeurIPS 2020 - Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and ficientQA Competition: Systems, Analyses and Percy Liang. 2016. SQuAD: 100,000+ Questions Lessons Learned. arXiv:2101.00133 [cs]. ArXiv: for Machine Comprehension of Text. In Proceed- 2101.00133. ings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Austin, Texas. Association for Computational Lin- Luke Zettlemoyer. 2019. A discrete hard EM ap- guistics. proach for weakly supervised question answering. In EMNLP/IJCNLP (1), pages 2851–2864. Associ- Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. ation for Computational Linguistics. How Much Knowledge Can You Into the Pa- rameters of a Language Model? arXiv:2002.08910 Sewon Min, Julian Michael, Hannaneh Hajishirzi, [cs, stat]. ArXiv: 2002.08910. and Luke Zettlemoyer. 2020b. AmbigQA: Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, Answering Ambiguous Open-domain Questions. and Jordan Boyd-Graber. 2019. Quizbowl: arXiv:2004.10645 [cs]. ArXiv: 2004.10645. The Case for Incremental Question Answering. arXiv:1904.04792 [cs]. ArXiv: 1904.04792. Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Minjoon Seo, Tom Kwiatkowski, Ankur Parikh, Ali Query Prediction. arXiv:1904.08375 [cs]. ArXiv: Farhadi, and Hannaneh Hajishirzi. 2018. Phrase- 1904.08375. Indexed Question Answering: A New Challenge for Scalable Document Comprehension. In Proceed- and Madian Khabsa. 2021. Studying Strategi- ings of the 2018 Conference on Empirical Methods cally: Learning to Mask for Closed-book QA. in Natural Language Processing, pages 559–564, arXiv:2012.15856 [cs]. ArXiv: 2012.15856. Brussels, Belgium. Association for Computational Linguistics.

Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019. Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. In Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pages 4430–4441, Florence, Italy. Association for Computational Linguistics.

Iulian Vlad Serban, Alberto Garc´ıa-Duran,´ Caglar Gulcehre, Sungjin Ahn, Sarath Chandar, Aaron Courville, and Yoshua Bengio. 2016. Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus. In Pro- ceedings of the 54th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 588–598, Berlin, Germany. Associa- tion for Computational Linguistics.

M. Surdeanu. 2013. Overview of the TAC2013 Knowl- edge Base Population Evaluation: English Slot Fill- ing and Temporal Slot Filling. TAC.

Pat Verga, Haitian Sun, Livio Baldini Soares, and William W. Cohen. 2020. Facts as Experts: Adapt- able and Interpretable Neural Memory over Sym- bolic Knowledge. arXiv:2007.00849 [cs]. ArXiv: 2007.00849.

William Wang, Angelina Wang, Aviv Tamar, Xi Chen, and Pieter Abbeel. 2018. Safer Classification by Synthesis. arXiv:1711.08534 [cs, stat]. ArXiv: 1711.08534.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Remi´ Louf, Morgan Funtow- icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. HuggingFace’s Transformers: State-of-the-art Nat- ural Language Processing. arXiv:1910.03771 [cs]. ArXiv: 1910.03771.

Jinfeng Xiao, Lidan Wang, Franck Dernoncourt, Trung Bui, Tong Sun, and Jiawei Han. 2020. Open- Domain Question Answering with Pre-Constructed Question Spaces. arXiv:2006.08337 [cs]. ArXiv: 2006.08337.

Wei Yang, Yuqing Xie, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. Data Augmenta- tion for BERT Fine-Tuning in Open-Domain Ques- tion Answering. arXiv:1904.06652 [cs]. ArXiv: 1904.06652.

Qinyuan Ye, Belinda Z. Li, Sinong Wang, Ben- jamin Bolte, Hao Ma, Wen-tau Yih, Xiang Ren, A Appendices A.1 Dataset splits 45

For Natural Questions (NQ, Kwiatkowski et al., 40 2019b), we use the standard Open-Domain splits in 35 our experiments, consisting of 79,168 train, 8,757 PAQ-retriever development, and 3,610 test question-answer pairs. NQ Dev Exact Match PAQ-reranker 30 For TriviaQA, We use the open-domain train-test FID-Large splits, which correspond to the unfiltered-train and 0 2 4 6 8 10 unfiltered-dev reading comprehension splits (Lee Memory (GB) et al., 2019b; Min et al., 2019; Karpukhin et al., Figure 4: System size vs. accuracy for RePAQ and FID- 2020a). large as a function of the number of items in the index, with different experimental setup than the main paper A.2 Futher details on Passage selection Our passage selection model is based on model fails to capture that. For example for “un- RoBERTaBASE (Liu et al., 2019b). We feed each passage we select into the model as input and use der a 109–124 loss to the Milwaukee Bucks”, the an MLP on top of RoBERTa’s [CLS] token to pre- question is generated as “when did ... play for dict if it is positive or negative. We then use this the toronto raptors”. 3) Very few (2%) questions model to perform inference and obtain a score for mismatch question type words (Wh-types) with an- every single passage in the whole wikipedia cor- swers when the answers are words rarely asked pus. The top N passages ranked by its score are (e.g. first). served as the candidate pool we generate answers A.4 Further details on System Size vs from. The model is optimized for a higher recall, Accuracy such that the positive passages should be identified with a high probability. Our model achieved 84.7% The system experiment described in Section 5.2.2 recall on the Natural Questions dev set. measures the required to store the models, the text of the documents/QA-pairs, and a dense A.3 Further Details on Question Quality index. For figure2, We assume models are stored For NQ, we find that the retrieved questions are at fp16 precision, the text has been compressed 14 paraphrases of the test questions in the majority of using LZMA , and the indexes both use 768d cases. The test questions in TriviaQA are mostly vectors, and Product Quantization. These are re- very specific, and whilst retrieved questions in PAQ alistic defaults, with usage in the question an- contain less details, they still usually have correct swering literature (Izacard et al., 2020; Min et al., answers. To better evaluate the quality of the ques- 2020a). The RePAQ model used in this figure con- tions, we conduct human evaluation on 50 random sists of an ALBERT-base retriever and ALBERT- sampled questions generated from the wikipedia xxlarge reranker, and the FID system consists passage pool. We have the following observations: of DPR (Karpukhin et al., 2020b) (itself consist- 1) the majority (82%) questions accurately capture ing of two BERT-base retrievers) and a T5-large the context of the answer in the passage, and con- reader (Raffel et al., 2020). Using a different sys- tain relevant and informative details to locate the tem setup (for example using full precision to store answer. 2) 16% questions fail to understand the the models, no text compression, and fp16 index semantics of certain types of answers. The failure quantization, shown in figure4) shifts the relative can be traced to: Mistaking extremely similar enti- position of the curves in Figure2, but not the quali- ties. For example, given the sentence “The Kerch tative observation that RePAQ models can be com- Peninsula is ... located at the eastern end of the pressed to smaller sizes before significant drop in Crimean Peninsula” and the answer “the Crimean accuracy. Peninsula”, the generated question is “what is the A.5 Further details on Inference speed eastern end of the kerch peninsula”; Generalisation The system used for inference speed benchmarking to rare phrases. The digit combinations appearing is a machine learning workstation with 80 CPU in passages mostly stand for date or timestamp. However, it is not applied to all the cases and the 14https://tukaani.org/xz/ 90 RePAQ-retr. RePAQ 90 RePAQ-rerank FiD-L, logprobs 80 FID-Large FiD-L, loss 80 FiD-L, GBM 70 FiD-L, RePAQ 70

NQ Exact Match 60

TriviaQA Exact Match 60

50 50 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Fraction of Questions Answered Fraction of Questions Answered

Figure 5: Risk Coverage Plot for FiD and RePAQ on Figure 6: Risk Coverage Plot for different Calibration TriviaQA. FID has higher overall accuracy, but RePAQ scores for FiD-Large (RePAQ included for compari- with a reranker still outperforms it for coverages <50% son). Using RePAQ’s answer confidence scores to cali- brate FiD leads to the best results for FiD cores, 512GB of CPU RAM and access to one 32GB NVIDIA V100 GPU. Inference is carried out (2020a). We train a Gradient Boosting Machine at mixed precision for all systems, and questions (GBM, Friedman, 2001) on the dev set of NQ, are allowed to be answered in parallel. Models are which attempts to predict whether FID-large has implemented in Pytorch (Paszke et al., 2019) using answered correctly or not. The GBM is given FID’s Transformers (Wolf et al., 2020). Measurements answer loss, answer log probability and the re- are repeated 3 times and the mean time is reported, trieval score of the top 100 retrieved documents rounded to an appropriate significant figure. The from DPR. Figure6 shows these results. We first HNSW index used in this experiment indexes all note that FiD-Large’s answer loss and answer log 65M PAQ QA-pairs with vectors of dimension 768 probabilities perform similarly, and both struggle dimensions, uses an ef construction of 80, to calibrate FiD, as mentioned in the main paper. ef search of 32, and store n of 256, and per- The GBM does improve calibration, especially at forms up to 2048 index searches in parallel. This lower coverages, but still lags behind RePAQ by index occupies 220GB, but could be considerably 7% EM at 50% coverage. We also note a surprising compressed with techniques including scalar or finding that we can acutally use RePAQ’s scores to product quantization, or training retrievers with calibrate FiD. Here, we use FiD’s predicted answer, smaller projection dimensions. The reader is re- but RePAQ’s confidence score to decide whether to ferred to Izacard et al.(2020) and Lewis et al. answer or not. This result is also plotted in Figure (2020a, Appendix C) for experiments demonstrat- 6, and results in the best risk-coverage curve for ing index compression with little-to-no drop in ac- FiD. However, this is still not as well-calibrated as curacy. simply using RePAQ, highlighting the strength of RePAQ for selective QA further. A.6 Further details on Selective Question A.7 Additional Model training details Answering RePAQ models were trained for up to 3 days on a Figure5 shows the Risk-Coverage plot for Trivi- machine with 8 NVIDIA 32GB V100 GPUs. Val- aQA. The results are qualitatively similar as those idation Exact match score was used to determine for NQ (see Figure3 in the main paper), although when to stop training in all cases. RePAQ retrievers FiD’s stronger overall performance on TriviaQA were trained using Fairseq (Ott et al., 2019), and shifts its risk-coverage curve up the accuracy axis rerankers were trained in Transformers (Wolf et al., relative to RePAQ. FiD also seems a little better cal- 2020) in Pytorch. The PAQ BART CBQA models ibrated on TriviaQA than NQ, indicated by greater were trained in Fairseq for 6 days on 8 V100s, after gradient. However, RePAQ remains better cali- which validation accuracy had plateaued. Stan- brated than FiD, and outperforms it for answer dard BART hyperparameters were used, apart from coverages below 50%. batch size and learning rate, which were tuned to We also investigate ways to improve the cali- try to promote faster learning, but learning became bration of FiD-large on NQ, using post-hoc cali- unstable with learning rates greater than 0.0001. bration, using an approach similar to Jiang et al.