Revealing the Importance of Semantic Retrieval for Machine Reading at Scale

Revealing the Importance of Semantic Retrieval for Machine Reading at Scale Yixin Nie Songhe Wang Mohit Bansal UNC Chapel Hill fyixin1, songhe17, [email protected] Abstract not only precise retrieval of the relevant information sparsely restored in a large knowledge source Machine Reading at Scale (MRS) is a chal- but also a deep understanding of both the selected lenging task in which a system is given an knowledge and the input query to give the corre- input query and is asked to produce a pre- sponding output. Initiated by Chen et al.(2017), cise output by “reading” information from a the task was termed as Machine Reading at Scale large knowledge base. The task has gained popularity with its natural combination of in- (MRS), seeking to provide a challenging situation formation retrieval (IR) and machine compre- where machines are required to do both semantic hension (MC). Advancements in representa- retrieval and comprehension at different levels of tion learning have led to separated progress in granularity for the final downstream task. both IR and MC; however, very few studies Progress on MRS has been made by improv- have examined the relationship and combined ing individual IR or comprehension sub-modules design of retrieval and comprehension at dif- with recent advancements on representative learn- ferent levels of granularity, for development of MRS systems. In this work, we give gen- ing (Peters et al., 2018; Radford et al., 2018; De- eral guidelines on system design for MRS by vlin et al., 2018). However, partially due to the proposing a simple yet effective pipeline sys- lack of annotated data for intermediate retrieval in tem with special consideration on hierarchical an MRS setting, the evaluations were done mainly semantic retrieval at both paragraph and sen- on the final downstream task and with much less tence level, and their potential effects on the consideration on the intermediate retrieval perfor- downstream task. The system is evaluated on mance. This led to the convention that upstream both fact verification and open-domain multi- hop QA, achieving state-of-the-art results on retrieval modules mostly focus on getting better the leaderboard test sets of both FEVER and coverage of the downstream information such that HOTPOTQA. To further demonstrate the im- the upper-bound of the downstream score can be portance of semantic retrieval, we present ab- improved, rather than finding more exact infor- lation and analysis studies to quantify the con- mation. This convention is misaligned with the tribution of neural retrieval modules at both nature of MRS where equal effort should be put paragraph-level and sentence-level, and illus- in emphasizing the models’ joint performance and trate that intermediate semantic retrieval mod- optimizing the relationship between the semantic ules are vital for not only effectively filtering arXiv:1909.08041v1 [cs.CL] 17 Sep 2019 upstream information and thus saving down- retrieval and the downstream comprehension sub- stream computation, but also for shaping up- tasks. stream data distribution and providing better Hence, to shed light on the importance of se- data for downstream modeling.1 mantic retrieval for downstream comprehension tasks, we start by establishing a simple yet ef- 1 Introduction fective hierarchical pipeline system for MRS using Wikipedia as the external knowledge source. Extracting external textual knowledge for machine The system is composed of a term-based retrieval comprehensive systems has long been an impor- module, two neural modules for both paragraph- tant yet challenging problem. Success requires level retrieval and sentence-level retrieval, and a 1Code/data made publicly available at: https:// neural downstream task module. We evaluated github.com/easonnie/semanticRetrievalMRS the system on two recent large-scale open domain benchmarks for fact verification and multi- 2018; Hanselowski et al., 2018) on data-driven hop QA, namely FEVER (Thorne et al., 2018) neural networks for automatic fact checking. We and HOTPOTQA (Yang et al., 2018), in which re- consider the task also as MRS because they share trieval performance can also be evaluated accu- almost the same setup except that the downstream rately since intermediate annotations on evidences task is verification or natural language inference are provided. Our system achieves the start-of- (NLI) rather than QA. the-art results with 45.32% for answer EM and Information Retrieval Success in deep neural 25.14% joint EM on HOTPOTQA (8% absolute networks inspires their application to information improvement on answer EM and doubling the retrieval (IR) tasks (Huang et al., 2013; Guo et al., joint EM over the previous best results) and with 2016; Mitra et al., 2017; Dehghani et al., 2017). In 67.26% on FEVER score (3% absolute improve- typical IR settings, systems are required to retrieve ment over previously published systems). and rank (Nguyen et al., 2016) elements from a We then provide empirical studies to validate collection of documents based on their relevance design decisions. Specifically, we prove the neces- to the query. This setting might be very different sity of both paragraph-level retrieval and sentence- from the retrieval in MRS where systems are asked level retrieval for maintaining good performance, to select facts needed to answer a question or ver- and further illustrate that a better semantic re- ify a statement. We refer the retrieval in MRS as trieval module not only is beneficial to achiev- Semantic Retrieval since it emphasizes on seman- ing high recall and keeping high upper bound for tic understanding. downstream task, but also plays an important role in shaping the downstream data distribution and 3 Method providing more relevant and high-quality data for downstream sub-module training and inference. In previous works, an MRS system can be com- These mechanisms are vital for a good MRS sys- plicated with different sub-components processing tem on both QA and fact verification. different retrieval and comprehension sub-tasks at different levels of granularity, and with some sub- 2 Related Work components intertwined. For interpretability con- siderations, we used a unified pipeline setup. The Machine Reading at Scale First proposed and overview of the system is in Fig.1. formalized in Chen et al.(2017), MRS has gained To be specific, we formulate the MRS system popularity with increasing amount of work on as a function that maps an input tuple (q; K) to both dataset collection (Joshi et al., 2017; Welbl an output tuple (^y; S) where q indicates the input et al., 2018) and MRS model developments (Wang query, K is the textual KB, y^ is the output predic- et al., 2018; Clark and Gardner, 2017; Htut et al., tion, and S is selected supporting sentences from 2018). In some previous work (Lee et al., 2018), Wikipedia. Let E denotes a set of necessary evi- paragraph-level retrieval modules were mainly dences or facts selected from K for the prediction. for improving the recall of required information, For a QA task, q is the input question and y^ is the while in some other works (Yang et al., 2018), predicted answer. For a verification task, q is the sentence-level retrieval modules were merely for input claim and y^ is the predicted truthfulness of solving the auxiliary sentence selection task. In the input claim. For all tasks, K is Wikipedia. our work, we focus on revealing the relationship The system procedure is listed below: between semantic retrieval at different granular- (1) Term-Based Retrieval: To begin with, we ity levels and the downstream comprehension task. used a combination of the TF-IDF method and a To the best of our knowledge, we are the first to ap- rule-based keyword matching method2 to narrow ply and optimize neural semantic retrieval at both the scope from whole Wikipedia down to a set of paragraph and sentence levels for MRS. related paragraphs; this is a standard procedure in Automatic Fact Checking: Recent work (Thorne MRS (Chen et al., 2017; Lee et al., 2018; Nie et al., and Vlachos, 2018) formalized the task of au- 2019). The focus of this step is to efficiently select tomatic fact checking from the viewpoint of a candidate set PI that can cover the information machine learning and NLP. The release of as much as possible (PI ⊂ K) while keeping the FEVER (Thorne et al., 2018) stimulates many recent developments (Nie et al., 2019; Yoneda et al., 2Details of term-based retrieval are in Appendix. When did Robben retire from Bayern? P-Level S-Level QA 2019 Retrieval Retrieval Term based Fact 1: Robben said in an interview "I can say that this is my last year [at Bayern]... 2019 Fact 2: On 18 May 2019, he scored his last league goal for Bayern... False Term based P-Level S-Level NLI True Robben retired from Retrieval Retrieval Bayern in 2009. Figure 1: System Overview: blue dotted arrows indicate the inference flow and the red solid arrows indicate the training flow. Grey rounded rectangles are neural modules with different functionality. The two retrieval modules were trained with all positive examples from annotated ground truth set and negative examples sampled from the direct upstream modules. Thus, the distribution of negative examples is subjective to the quality of the upstream module. size of the set acceptable enough for downstream tences in S and the query, obtaining the final out- processing. put y^. (2) Paragraph-Level Neural Retrieval: After In some experiments, we modified the setup for obtaining the initial set, we compare each para- certain analysis or ablation purposes which will be graph in PI with the input query q using a neural explained individually in Sec6. model (which will be explained later in Sec 3.1).

Load more