Analysing Dense Passage Retrieval for Multi-Hop Estion Answering
Total Page:16
File Type:pdf, Size:1020Kb
Combining Lexical and Dense Retrieval for Computationally Efficient Multi-hop Question Answering Georgios Sidiropoulos1, Nikos Voskarides2∗, Svitlana Vakulenko1, Evangelos Kanoulas1 1 University of Amsterdam, Amsterdam, The Netherlands 2 Amazon, Barcelona, Spain [email protected], [email protected], [email protected], [email protected] Abstract QA systems typically consist of (i) a retriever that identifies the passage/document in the underly- In simple open-domain question answering ing collection that contains the answer to the user’s (QA), dense retrieval has become one of the question, and (ii) a reader that extracts or generates standard approaches for retrieving the relevant the answer from the identified passage (Chen et al., passages to infer an answer. Recently, dense retrieval also achieved state-of-the-art results 2017). Given that often the answer cannot be found in multi-hop QA, where aggregating informa- in the top-ranked passage, inference follows a stan- tion from multiple pieces of information and dard beam-search procedure, where top-k passages reasoning over them is required. Despite their are retrieved and the reader scores are computed for success, dense retrieval methods are compu- all k passages (Lee et al., 2019). However, readers tationally intensive, requiring multiple GPUs are very sensitive to noise in the top-k passages, to train. In this work, we introduce a hy- thus making the performance of the retriever criti- brid (lexical and dense) retrieval approach that is highly competitive with the state-of-the-art cal for the performance of QA systems (Yang et al., dense retrieval models, while requiring sub- 2019). This is further amplified in multi-hop QA, stantially less computational resources. Addi- where multiple retrieval hops are performed; poten- tionally, we provide an in-depth evaluation of tial retrieval errors get propagated across hops and dense retrieval methods on limited computa- thus severely harm QA performance. tional resource settings, something that is miss- The majority of current approaches to multi- ing from the current literature. hop QA use either traditional IR methods (TF-IDF, 1 Introduction BM25) (Qi et al., 2019) or graph-based methods for the retriever (Nie et al., 2019; Asai et al., 2020). Multi-hop QA requires retrieval and reasoning However, those approaches have serious limita- over multiple pieces of information (Yang et al., tions. The former approaches require high lexical 2018). For instance, consider the multi-hop ques- overlap between questions and relevant passages, tion: “Where is the multinational company founded while the latter rely on an interlinked underlying by Robert Smith headquartered?”. To answer this corpus, which is not always the case. Recently, question we first need to retrieve the passage about Xiong et al.(2021) introduced a dense multi-hop Robert Smith in order to find the name of the com- passage retrieval model that constructs a new query pany he founded (General Mills), and subsequently representation based on the question and previously arXiv:2106.08433v2 [cs.IR] 22 Sep 2021 retrieve the passage about General Mills, which retrieved passages and subsequently uses the new contains the answer the question (Golden Valley, representation to retrieve the next set of relevant Minnesota). Even though multi-hop QA requires passages. This model achieved state-of-the-art re- multiple retrieval hops, it is fundamentally differ- sults while not relying on an interlinked underlying ent from session search (Yang et al., 2015; Levine corpus. et al., 2017) and conversational search (Dalton Even though dense retrieval models achieve et al., 2019; Voskarides et al., 2020; Vakulenko state-of-the-art results on multi-hop QA, they et al., 2021), since in multi-hop QA the information are computationally intensive, requiring multiple need of the user is expressed in a single question, GPUs to train. Existing work only reports results thus not requiring multiple turns of interaction. for the cases where such resources are available; ∗ Research conducted when the author was at the Univer- therefore providing no answer on how feasible is to sity of Amsterdam. train such models on a low resource setting. In this paper, we focus on developing an efficient retriever that Skeet Ulrich is currently starring in premiere?”, for multi-hop QA that can be trained effectively in where the bridge entity “Riverdale (2017 TV se- a low computational resource setting. We aim to ries)” is missing. In comparison questions the main answer the following research questions: two entities (of the two relevant passages) are both mentioned and compared; e.g., “Which has smaller RQ1 How does the performance of dense retrieval flowers, Campsis or Kalmiopsis?”. compare to lexical and hybrid approaches? RQ2 How does the performance degrade in the 3.2 Metrics low computational resource settings? Following previous work (Yang et al., 2018; Xiong Our main contributions are the following: (i) et al., 2021), we report passage Exact Match (EM) we propose a hybrid (lexical and dense) retrieval to measure the overall retrieval performance. Ex- model which is competitive against its fully dense act Match is a metric that evaluates whether both competitors while requiring eight times less com- of the ground-truth passages for each question are putational power and (ii) we perform a thorough included in the retrieved passages (then EM=1 oth- analysis on the performance of dense passage re- erwise EM=0). Note that metrics such as EM and trieval models on the task of multi-hop QA.1 F1 w.r.t question’s answer (Ans) and supporting facts on sentence-level (Sup) do not fit in our exper- 2 Task Description imental setup since we focus on the retrieval part of the pipeline and not on the reading. Let p 2 C denote a passage within a passage col- lection C, and q a multi-hop question. Given q 3.3 Models and C the task is to retrieve a set of relevant pas- In this section, we describe the models we experi- sages P = fp1; p2; : : : ; png, where pi 2 C. In the multi-hop scenario we consider here, not all ment with. relevant passages can be retrieved using the input 3.3.1 Single-hop models question q alone. This is due to the fact that there is a low lexical overlap or semantic relationship Given a question, single-hop models retrieve a between question q and one or more of the relevant ranked list of passages. Thus, they are not aware of passages in P. In this case, information from one the multi-hop nature of the task. of the relevant passages pi is needed to retrieve BM25 is a standard lexical retrieval model. We another relevant passage pj, where pj may be lexi- use the default Anserini parameters (Yang et al., cally/semantically different from question q. 2017). Rerank is a standard two-stage retrieval model 3 Experimental Setup that first retrieves passages with BM25 and then In this section, we describe the dataset used in uses BERT to rerank the top-k passages (Nogueira our experiments, the metrics we use to answer our and Cho, 2019). The BERT (base) classifier was research questions and the models we compare trained with a point-wise loss (Nogueira and Cho, against. 2019). It was fine-tuned on the train split of Hot- potQA for 2 epochs. Training took 5 hours with 3.1 Dataset batch size of 8 using a single 12GB GPU. We ex- For our experiments, we focus on the HotpotQA perimented with k=100 and k=1000, and found that dataset and particularly the full-wiki setting (Yang k=100 results in a better reranking performance at et al., 2018). HotpotQA is a large-scale 2-hop QA the top positions. dataset where the answers to questions must be DPR is a dense passage retrieval model for sim- found in the context of the entire Wikipedia. Ques- ple questions (Karpukhin et al., 2020). Given a + tions in HotpotQA fall into one of the following question q, a relevant passage p and a set of irrele- − − − categories: bridge or comparison. In bridge ques- vant passages fp1 ; p2 ; : : : ; pmg, the model learns + tions, the bridge entity that connects the two rele- to rank p higher via the optimization of the nega- vant passages is missing; e.g., ‘When did the show tive log likelihood of the relevant passage. To train DPR on HotpotQA, a multi-hop QA dataset, we fol- 1Our trained models and data are available at https://github.com/GSidiropoulos/hybrid_ low the procedure described in (Xiong et al., 2021). retrieval_for_efficient_qa. This model was trained for 25 epochs (∼ 2 days) on a single 12GB GPU, using a RoBERTa-based Model EM@2 EM@10 EM@20 encoder. BM25 0.127 0.320 0.395 3.3.2 Multi-hop models Rerank 0.314 0.476 0.517 DPR 0.116 0.275 0.336 These models are aware of the multi-hop nature of the task. They recursively retrieve new information MDR 0.440 0.581 0.619 at each hop by conditioning the question on infor- Rerank+DPR2 0.599 0.732 0.762 mation retrieved on previous hops (Xiong et al., MDR (full) 0.677 0.772 0.793 2021). In practice, at each hop t the question q and Table 1: Overall retrieval performance. In the first the passage retrieved in the previous hop pt−1 gets group we show single-hop models, while in the second encoded as the new query qt = h(q; pt−1), where group we show multi-hop models. h(·) the question encoder, to retrieve the next rel- evant passage; when t = 1 then we have just the question. Differently from single-hop models, at 4.1 Overall performance inference time, given a question, beam search is Here we aim to answer RQ1 by comparing the used to obtain the top-k passage pair candidates.