Repurposing Entailment for Multi-Hop Question Answering Tasks
Total Page:16
File Type:pdf, Size:1020Kb
Repurposing Entailment for Multi-Hop Question Answering Tasks Harsh Trivedi|, Heeyoung Kwon|, Tushar Khot♠, Ashish Sabharwal♠, Niranjan Balasubramanian| | Stony Brook University, Stony Brook, U.S.A. fhjtrivedi,heekwon,[email protected] ♠ Allen Institute for Artificial Intelligence, Seattle, U.S.A. ftushark,[email protected] Abstract Question Answering (QA) naturally reduces to an entailment problem, namely, verifying whether some text entails the answer to a ques- tion. However, for multi-hop QA tasks, which require reasoning with multiple sentences, it remains unclear how best to utilize entailment models pre-trained on large scale datasets such as SNLI, which are based on sentence pairs. We introduce Multee, a general architecture Figure 1: An example illustrating the challenges in us- that can effectively use entailment models for ing sentence-level entailment model for multi-sentence multi-hop QA tasks. Multee uses (i) a local reasoning needed for QA, and the high-level approach module that helps locate important sentences, used in Multee. thereby avoiding distracting information, and level, whereas question answering requires verify- (ii) a global module that aggregates informa- ing whether multiple sentences, taken together as tion by effectively incorporating importance weights. Importantly, we show that both mod- a premise, entail a hypothesis. ules can use entailment functions pre-trained There are two straightforward ways to address on a large scale NLI datasets. We evaluate per- this mismatch: (1) aggregate independent entail- formance on MultiRC and OpenBookQA, two ment decisions over each premise sentence, or (2) multihop QA datasets. When using an entail- make a single entailment decision after concate- ment function pre-trained on NLI datasets, nating all premise sentences. Neither approach is Multee outperforms QA models trained only on the target QA datasets and the OpenAI fully satisfactory. To understand why, consider transformer models. the set of premises in Figure1, which entail the hypothesis Hc. Specifically, the combined infor- 1 Introduction mation in P 1 and P 3 entails Hc, which corre- How can we effectively use textual entailment sponds to the correct answer Cambridge. On one models for question answering? Previous attempts hand, aggregating independent decisions will fail at this have resulted in limited success (Harabagiu because no individual premise entails HC . On and Hickl, 2006; Sacaleanu et al., 2008; Clark the other hand, simply concatenating premises to et al., 2012). With recent large scale entailment form a single paragraph will fail because distract- datasets (Bowman et al., 2015; Williams et al., ing information in P 2 and P 4 can muddle use- 2018; Khot et al., 2018) pushing entailment mod- ful information in P 1 and P 3. An effective ap- els to high accuracies (Chen et al., 2017; Parikh proach, therefore, must recognize relevant sen- et al., 2016; Wang et al., 2017), we re-visit this tences (i.e., avoid distracting ones) and compose challenge and propose a novel method for re- their sentence-level information. purposing neural entailment models for QA. Our solution to this challenge is based on the A key difficulty in using entailment models observation that a sentence-level entailment func- for QA turns out to be the mismatch between tion can be re-purposed for both recognizing rel- the inputs to the two tasks: large-scale entail- evant sentences, and for computing sentence-level ment datasets are typically framed at a sentence representations. Both tasks require comparing in- 2948 Proceedings of NAACL-HLT 2019, pages 2948–2958 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics formation in a pair of texts, but the objectives models for question answering. (ii) A model that of the comparison are different. This means we incorporates local (sentence level) entailment de- can take an entailment function that is trained for cisions with global (document level) entailment basic entailment (i.e., comparing information in decisions to effectively aggregate information for texts), and adapt it to work for both recognizing multi-hop QA task. (iii) An empirical evaluation relevance and computing representations. Thus, that shows entailment based QA can achieve state- this architecture allows us to incorporate advances of-the-art performance on two challenging multi- in entailment architectures and to leverage pre- hop QA datasets, OpenBookQA and MultiRC. trained models obtained using large scale entail- ment datasets. 2 Question Answering using Entailment To this end, we propose a general architecture Non-extractive question answering can be seen that uses a (pre-trained) entailment function fe for as a textual entailment problem, where we ver- multi-sentence QA. Given a hypothesis statement ify whether a hypothesis constructed out of a Hqa representing a candidate answer, and the set question and a candidate answer is entailed by of premise sentences fPig, our proposed archi- the knowledge—a collection of sentences1 in the tecture uses the same function fe for two compo- source text. The probability of an answer A, given nents: (a) a sentence relevance module that scores a question Q, can be modeled as the probability of each Pi based on its potential relevance to Hqa, a set of premises fPig entailing a hypothesis state- with the goal of weeding out distractors; and (b) a ment Hqa constructed from Q and A: relevance-weighted aggregator that combines en- tailment information from multiple P . ∆ i Pr[A j Q; fPig] = Pr[fPig Hqa] (1) Thus, we build effective entailment aware rep- resentations of larger contexts (i.e., multiple sen- Here we use to denote textual entailment. tences) from those of small contexts (i.e., individ- Given QA training data, we can then learn a ual sentences). The main strength of our approach model that approximates the entailment probabil- is that, unlike standard attention mechanisms, the ity Pr[fPig Hqa]. aggregator module uses the attention scores from Can one build an effective QA model ge using the relevance module at multiple levels of abstrac- an existing entailment model fe that has been pre- tions (e.g., multiple layers of a neural network) trained on a large-scale entailment dataset? Fig- within fe, using join operations that compose rep- ure2 illustrates two straightforward ways of doing resentations at each level. We refer to this multi- so, using fe as a black-box function: level aggregation of textual entailment representa- tions as Multee (pronounced multi). Our implementation of Multee uses ESIM (Chen et al., 2017), a recent sentence-level entailment model, pre-trained on SNLI and MultiNLI datasets. We demonstrate its effective- ness on two challenging multi-sentence reasoning datasets: MultiRC (Khashabi et al., 2018) and OpenBookQA (Mihaylov et al., 2018). Multee using ELMo contextual embeddings (Peters et al., Figure 2: Black Box Applications of Textual Entail- 2018) matches state-of-the-art results achieved ment Model for QA: Max and Concat models with large transfomer-based models (Radford (i) Aggregate Local Decisions (Max): Use fe to et al., 2018) that were trained on a sequence of check how much each sentence Pi entails Hqa on large scale tasks (Sun et al., 2019). Ablation its own, and aggregate these local entailment deci- studies demonstrate that both relevance scoring sions, for instance, using a max operation. and multi-level aggregation are valuable, and that pre-training on large entailment corpora is ge(fPig;Hqa) = max fe(Pi;Hqa) (2) i particularly helpful for OpenBookQA. 1This collection can be a sequence in the case of passage This work makes three main contributions: (i) comprehension or a list of sentences, potentially from varied A novel approach to use pre-trained entailment sources, in the case of QA over multiple documents. 2949 (ii) Concatenate Premises (Concat): Combine the question. In other words, the importance of a the premise sentences in a sequence to form a sin- sentence can be modeled as its entailment proba- gle large passage P , and use fe to check whether bility, i.e., how well the sentence by itself supports this passage as a whole entails the hypothesis Hqa, the answer hypothesis. We can use a pre-trained making a single entailment decision: entailment model to obtain this. The importance αi of a sentence Pi can be modeled as: ge(fPig;Hqa) = fe(P; Hqa) (3) αi = fe(Pi;Hqa) (4) Our experiments reveal, however, that neither approach is an effective means of using pre-trained This can be further improved by modeling the entailment models for QA (see Table1). For the sentence with its surrounding context. This is example in Figure1, Max model would not be especially useful for passage-level QA, where able to consider information from P1 and P3 to- the neighboring sentences provide useful context. gether. Instead, it will pickup Silicon Valley as the Given a premise sentence Pi, the entailment func- answer since P2 is close to Hs, “Facebook was tion fe computes a single hypothesis-aware repre- launched in Silicon Valley”. Similarly, Concat sentation xi containing information in the premise would also be muddled by distracting information that is relevant to entailing the answer hypothesis in P2, which will weaken its confidence in answer Hqa. This is essentially the output of last layer Cambridge. Therefore, without careful guidance, of neural function fe before projecting it to logits. simple aggregation can easily add distracting in- We denote this part of