Repurposing Entailment for Multi-Hop Tasks

Harsh Trivedi♣, Heeyoung Kwon♣, Tushar Khot♠, Ashish Sabharwal♠, Niranjan Balasubramanian♣

♣ Stony Brook University, Stony Brook, U.S.A. {hjtrivedi,heekwon,niranjan}@cs.stonybrook.edu ♠ Allen Institute for Artificial Intelligence, Seattle, U.S.A. {tushark,ashishs}@allenai.org

Abstract Question Answering (QA) naturally reduces to an entailment problem, namely, verifying whether some text entails the answer to a ques- tion. However, for multi-hop QA tasks, which require reasoning with multiple sentences, it remains unclear how best to utilize entailment models pre-trained on large scale datasets such as SNLI, which are based on sentence pairs. We introduce Multee, a general architecture Figure 1: An example illustrating the challenges in us- that can effectively use entailment models for ing sentence-level entailment model for multi-sentence multi-hop QA tasks. Multee uses (i) a local reasoning needed for QA, and the high-level approach module that helps locate important sentences, used in Multee. thereby avoiding distracting information, and level, whereas question answering requires verify- (ii) a global module that aggregates informa- ing whether multiple sentences, taken together as tion by effectively incorporating importance weights. Importantly, we show that both mod- a premise, entail a hypothesis. ules can use entailment functions pre-trained There are two straightforward ways to address on a large scale NLI datasets. We evaluate per- this mismatch: (1) aggregate independent entail- formance on MultiRC and OpenBookQA, two ment decisions over each premise sentence, or (2) multihop QA datasets. When using an entail- make a single entailment decision after concate- ment function pre-trained on NLI datasets, nating all premise sentences. Neither approach is Multee outperforms QA models trained only on the target QA datasets and the OpenAI fully satisfactory. To understand why, consider transformer models. the set of premises in Figure1, which entail the hypothesis Hc. Specifically, the combined infor- 1 Introduction mation in P 1 and P 3 entails Hc, which corre- How can we effectively use textual entailment sponds to the correct answer Cambridge. On one models for question answering? Previous attempts hand, aggregating independent decisions will fail at this have resulted in limited success (Harabagiu because no individual premise entails HC . On and Hickl, 2006; Sacaleanu et al., 2008; Clark the other hand, simply concatenating premises to et al., 2012). With recent large scale entailment form a single paragraph will fail because distract- datasets (Bowman et al., 2015; Williams et al., ing information in P 2 and P 4 can muddle use- 2018; Khot et al., 2018) pushing entailment mod- ful information in P 1 and P 3. An effective ap- els to high accuracies (Chen et al., 2017; Parikh proach, therefore, must recognize relevant sen- et al., 2016; Wang et al., 2017), we re-visit this tences (i.e., avoid distracting ones) and compose challenge and propose a novel method for re- their sentence-level information. purposing neural entailment models for QA. Our solution to this challenge is based on the A key difficulty in using entailment models observation that a sentence-level entailment func- for QA turns out to be the mismatch between tion can be re-purposed for both recognizing rel- the inputs to the two tasks: large-scale entail- evant sentences, and for computing sentence-level ment datasets are typically framed at a sentence representations. Both tasks require comparing in-

2948 Proceedings of NAACL-HLT 2019, pages 2948–2958 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics formation in a pair of texts, but the objectives models for question answering. (ii) A model that of the comparison are different. This means we incorporates local (sentence level) entailment de- can take an entailment function that is trained for cisions with global (document level) entailment basic entailment (i.e., comparing information in decisions to effectively aggregate information for texts), and adapt it to work for both recognizing multi-hop QA task. (iii) An empirical evaluation relevance and computing representations. Thus, that shows entailment based QA can achieve state- this architecture allows us to incorporate advances of-the-art performance on two challenging multi- in entailment architectures and to leverage pre- hop QA datasets, OpenBookQA and MultiRC. trained models obtained using large scale entail- ment datasets. 2 Question Answering using Entailment To this end, we propose a general architecture Non-extractive question answering can be seen that uses a (pre-trained) entailment function fe for as a textual entailment problem, where we ver- multi-sentence QA. Given a hypothesis statement ify whether a hypothesis constructed out of a Hqa representing a candidate answer, and the set question and a candidate answer is entailed by of premise sentences {Pi}, our proposed archi- the knowledge—a collection of sentences1 in the tecture uses the same function fe for two compo- source text. The probability of an answer A, given nents: (a) a sentence relevance module that scores a question Q, can be modeled as the probability of each Pi based on its potential relevance to Hqa, a set of premises {Pi} entailing a hypothesis state- with the goal of weeding out distractors; and (b) a ment Hqa constructed from Q and A: relevance-weighted aggregator that combines en- tailment information from multiple P . ∆ i Pr[A | Q, {Pi}] = Pr[{Pi}  Hqa] (1) Thus, we build effective entailment aware rep- resentations of larger contexts (i.e., multiple sen- Here we use  to denote textual entailment. tences) from those of small contexts (i.e., individ- Given QA training data, we can then learn a ual sentences). The main strength of our approach model that approximates the entailment probabil- is that, unlike standard attention mechanisms, the ity Pr[{Pi}  Hqa]. aggregator module uses the attention scores from Can one build an effective QA model ge using the relevance module at multiple levels of abstrac- an existing entailment model fe that has been pre- tions (e.g., multiple layers of a neural network) trained on a large-scale entailment dataset? Fig- within fe, using join operations that compose rep- ure2 illustrates two straightforward ways of doing resentations at each level. We refer to this multi- so, using fe as a black-box function: level aggregation of textual entailment representa- tions as Multee (pronounced multi). Our implementation of Multee uses ESIM (Chen et al., 2017), a recent sentence-level entailment model, pre-trained on SNLI and MultiNLI datasets. We demonstrate its effective- ness on two challenging multi-sentence reasoning datasets: MultiRC (Khashabi et al., 2018) and OpenBookQA (Mihaylov et al., 2018). Multee using ELMo contextual embeddings (Peters et al., Figure 2: Black Box Applications of Textual Entail- 2018) matches state-of-the-art results achieved ment Model for QA: Max and Concat models with large transfomer-based models (Radford (i) Aggregate Local Decisions (Max): Use fe to et al., 2018) that were trained on a sequence of check how much each sentence Pi entails Hqa on large scale tasks (Sun et al., 2019). Ablation its own, and aggregate these local entailment deci- studies demonstrate that both relevance scoring sions, for instance, using a max operation. and multi-level aggregation are valuable, and that pre-training on large entailment corpora is ge({Pi},Hqa) = max fe(Pi,Hqa) (2) i particularly helpful for OpenBookQA. 1This collection can be a sequence in the case of passage This work makes three main contributions: (i) comprehension or a list of sentences, potentially from varied A novel approach to use pre-trained entailment sources, in the case of QA over multiple documents.

2949 (ii) Concatenate Premises (Concat): Combine the question. In other words, the importance of a the premise sentences in a sequence to form a sin- sentence can be modeled as its entailment proba- gle large passage P , and use fe to check whether bility, i.e., how well the sentence by itself supports this passage as a whole entails the hypothesis Hqa, the answer hypothesis. We can use a pre-trained making a single entailment decision: entailment model to obtain this. The importance αi of a sentence Pi can be modeled as: ge({Pi},Hqa) = fe(P,Hqa) (3) αi = fe(Pi,Hqa) (4) Our experiments reveal, however, that neither approach is an effective means of using pre-trained This can be further improved by modeling the entailment models for QA (see Table1). For the sentence with its surrounding context. This is example in Figure1, Max model would not be especially useful for passage-level QA, where able to consider information from P1 and P3 to- the neighboring sentences provide useful context. gether. Instead, it will pickup Silicon Valley as the Given a premise sentence Pi, the entailment func- answer since P2 is close to Hs, “Facebook was tion fe computes a single hypothesis-aware repre- launched in Silicon Valley”. Similarly, Concat sentation xi containing information in the premise would also be muddled by distracting information that is relevant to entailing the answer hypothesis in P2, which will weaken its confidence in answer Hqa. This is essentially the output of last layer Cambridge. Therefore, without careful guidance, of neural function fe before projecting it to logits. simple aggregation can easily add distracting in- We denote this part of fe that outputs last vector formation into the premise representation, causing representation as fev and full fe that gives entail- entailment to fail. This motivates the need for new, ment probability as fep . effective mechanisms for global reasoning over a We use these hypothesis-aware xi vectors for collection of premises. each sentence as inputs to a BiLSTM producing a contextual representation ci for each premise sen- 3 Our Approach: Multee tence Pi, which is then fed to a feedforward layer that predicts the sentence-level importance as: We propose a new entailment based QA model, T Multee, with two components: (i) a sentence αi = softmax(W ci + b) (5) relevance model, which learns to focus on the The components for generating xi are part of relevant sentences, and (ii) a multi-layer aggre- the original entailment function fe and can be pre- gator, which uses an entailment model to ob- trained on the entailment dataset. The BiLSTM to tain multiple layers of question-relevant represen- compute ci and the parameters W and b for com- tations for the premises and then composes them puting αi are not part of the original entailment using the sentence-level scores from the relevance function and thus can only be trained on the target model. Finding relevant sentences is a form of lo- QA task. We perform this additional contextual- cal entailment between each premise and the an- ization only when sentences form contiguous text. swer hypothesis, whereas aggregating question- Additionally, for datasets such as MultiRC, where relevant representations is a form of global entail- the relevant sentences have been marked, we intro- ment between all premises and the answer hypoth- duce a loss term based on the true relevance label esis. This means, we can effectively re-purpose and predicted weights, αi. the same pre-trained entailment function fe for both components. Figure3 shows an architecture 3.2 Multi-level Aggregation that uses multiple copies of fe to achieve this. The goal of this module is to aggregate repre- sentations from important sentences in order to 3.1 Sentence Relevance Model make a global entailment decision. There are two The goal of this module is to identify sentences key questions to answer: (1) how to combine the in the paragraph that are important for the given sentence-level information into a paragraph-level hypothesis. As shown in Figure1, this helps the representation and (2) how to use the sentence rel- global module aggregate relevant content while re- evance weights {αi}. ducing the chances of picking up distracting infor- Most entailment models include many layers mation. A sentence is considered important if it that transform the input premise and the hypothe- contains information that is relevant to answering sis. A typical neural entailment stack includes en-

2950 Figure 3: Multee overview: Multee includes two main components, a relevance module, and a multi-layer aggregator module. Both modules use pre-trained entailment functions (fep and fev ). fep is the full entailment model that gives entailment probability, and fev is part of it excluding last projection to logits and softmax. The multi-level aggregator uses multiple copies of entailment function fev , one for each sub-aggregator performing a join at a different layer. Right part of figure zooms in on one such sub-aggregator joining at layer `. coding layers that independently generate contex- paragraph vectors and pass them through a feed- tual representations of the premise and the hypoth- forward network projecting them down to logits, esis, followed by some cross-attention layer that that can be used to compute the final passage wide yields relationships between the premise and hy- entailment probabilities. pothesis words, and additional layers that use this cross-attention to generate premise attended rep- 3.2.1 Join Operations resentations of the hypothesis and vice versa. The Given a set of sentence-wise outputs from the final layers are classification layers which deter- lower layer {X˜i} and the corresponding sentence- mine entailment based on the representations from relevance weights {αi}, the join operation com- the previous layer. Each layer thus generates inter- bines them into a single passage-level representa- mediate representation that captures different type tion Y˜ , which can be directly consumed by the of entailment related information. This presents us layer above it in the stack. The specifics of the with a choice of multiple points for aggregation. join operation depends on the shape of the out- puts from the lower layer, and the shape of the in- Figure3 illustrates our approach for aggregating puts expected by the layer after the join. Here we sentence-level representations into a single para- show four possible join operations, one for each graph level representation. For each premise P i layer. The ones defined for Score Layer and Em- in the passage, we first process the pair (P ,H ) i qa bedding Layer can be reduced to black-box base- through the entailment stack (f ) resulting in a ev lines, while we use the other two in Multee. ˜ ` set of intermediate representations {Xi } for each Score Layer: The score layer outputs the entail- layer `. We can choose a particular layer ` to ment probabilities {si} for each premise to hy- be the aggregation layer. We then compute a pothesis independently, which need to be joined weighted combination of the sentence-level out- to one entailment score. One way to do this is to ˜ ` puts at this layer {Xi } to produce a passage-level simply take a weighted maximum of the individual ˜ ` representation Y . The weights for the sentences entailment probabilities. So we have X˜i = si ∀i  are obtained from the Sentence Relevance model. and Y˜ = maxi αisi . This reduces to black-box We refer to this as a join operation as shown in Max model (Equation2) when using {αi} = 1. the Figure3. Layers of the entailment function Embedding Layer: The embedding layer outputs f that are below the join operate at a sentence- 2 ev a sequence of embedded vectors of [P¯i] one se- level, while layers above the join now operate over quence for each premise Pi and another sequence paragraph-wise representations. The final layer of embedded vectors [H¯qa] for the answer hypoth- (i.e. the top most layer) of f thus gives us a vec- ev esis Hqa. A join operation in this case scales tor representation of the entire passage. This type each embedded vector in a premise by its rele- of join can be applied at multiple layers result- vance weight and concatenates them together to ing in paragraph vectors that correspond to mul- tiple levels of aggregation. We concatenate these 2We use [.] to denote a sequence and ¯. to denote a vector

2951 form [P¯]. H¯qa is passed through unchanged. Multee’s multi-layer aggregator module uses join operations at two levels: Cross Attention X˜ = ([P¯ ], [H¯ ]) ∀i i i qa Layer (CA) and Final Layer (FL). The two cor- ¯ ¯ ¯ ¯ [P ] = [α1[P1]; α2[P2]; ... ; αn[Pn]] responding aggregators share parameters up till  Y˜ = [P¯], [H¯qa] the lower of the two join layers (CA in this case), where they both operate at the sentence For non-contextual word embeddings, this re- level. Above this layer, one aggregator switches duces to Concat Premises (Eq.3) when {αi} = 1. to operating at the paragraph level, where it has Final Layer (FL): The final layer in the entail- its own, unshared parameters. In general, if ment stack usually outputs a single vector h¯ which Multee were to aggregate at layers `i1, `i2, . . . , `ik, is then used in a linear layer and softmax to pro- then the aggregators with joins at layers ` and duce label probabilities. The join operation here is `0 respectively could share parameters at layers a weighted sum of the premise-level vectors. So 1,..., min{`, `0}. ˜ ¯ ˜ P ¯ we have Xi = hi ∀i and Y = i αihi. This is similar to a standard attention mecha- 3.3 Implementation Details nism, where attended representation is computed Multee uses the ESIM stack as the entailment by summing the scaled representations. However, function pre-trained on SNLI and MultiNLI for such scaled addition is not possible when the out- both the relevance module and for the multi-layer puts from lower layers are not of the same shapes, aggregator module. It uses aggregation at two- as in the following case. levels, one at the cross-attention level (CA) and Cross Attention Layer (CA): Cross-attention is a one at the final layer (FL). All uses of the entail- standard component of many entailment and read- ment function in Multee are initialized with ing comprehension models. This layer produces the same pre-trained entailment model weights. three outputs: (i) For each premise P , we get a i The embedding layer and the BiLSTM layer pro- hypothesis to premise cross attention matrix M hpi cess paragraph-level contexts but processing at with shape (h × p ), where h is the number of hy- i higher layers are done either at premise level or pothesis tokens, and p is the number of tokens i paragraph-level depending on where the join op- in premise P ; (ii) for each premise P , we get a i i eration is performed. sequence of vectors [P¯i] that corresponds to the token sequence of the premise Pi; and (iii) for 4 Experiments the hypothesis, we get a single sequence of vec- ¯ tors [Hqa] that corresponds to its token sequence. Datasets: We evaluate Multee on two datasets, M hpi attention matrix was generated by cross at- OpenBookQA (Mihaylov et al., 2018) and Mul- ¯ ¯ tention from [Hqa] to [Pi]. tiRC (Khashabi et al., 2018), both of which The join operation in this layer produces a cross are specifically designed to test reasoning over attention matrix that spans the entire passage, i.e., multiple sentences. MultiRC is paragraph-based has shape (h × p), where p is the total num- multiple-choice QA dataset derived from varying ber of tokens across all premises. The opera- topics where the questions are answerable based tion first scales the cross-attention matrices by the on information from the paragraph. In MultiRC, sentence-relevance weights {αi} in order to “tone each question can have more than one correct an- down” the influence of distracting/irrelevant sen- swer choice, and so it can be viewed as a bi- tences, and then re-normalizes the final matrix: nary classification task (one prediction per an- swer choice), with 4,848 / 4,583 examples in X˜ = (M hpi , [P¯ ], [H¯ ]) ∀i i i qa Dev/Test sets. OpenBookQA, on the other hand, hp  hp hpn  M = αiM 1 ; ... ; αiM has multiple-choice science questions with exactly hp one correct answer choice and no associated para- hp Mij Mij = graph. As a result, this dataset requires the rele- P M hp k ik vant facts to be retrieved from auxiliary resources ¯  ¯ ¯ ¯  [P ] = [P1]; [P2]; ...;[Pn] including the open book of facts released with the hp Y˜ = (M , P,¯ H¯qa) paper and other sources such as WordNet (Miller, 1995) and ConceptNet (Speer and Havasi, 2012). hp th th hp where Mij is i row and j column of M . It contains 500 questions in the Dev and Test sets. 2952 OpenBookQA MultiRC

Accuracy F1a | F1m | EM Dev Test Dev Test

Entailment Multee 56.2 54.8 69.6 | 73.0 | 22.8 70.4 | 73.8 | 24.5 Based Models Concatenate Premises 47.2 47.6 68.3 | 71.3 | 17.9 68.5 | 72.5 | 18.0 with ELMo Max of Local Decisions 47.8 45.2 66.5 | 69.3 | 16.3 66.70 | 70.4 | 19.4

Entailment Multee 54.6 55.8 68.3 | 71.7 | 16.4 69.9 | 73.6 | 19.0 Based Models Concatenate Premises 47.4 42.6 66.9 | 70.7 | 14.8 68.7 | 72.4 | 16.3 with GloVe Max of Local Decisions 44.4 47.6 66.8 | 70.3 | 16.9 67.7 | 71.1 | 18.2

Previously LR (Khashabi et al., 2018) — — 63.7 | 66.5 | 11.8 63.5 | 66.9 | 12.8 Published IR (Khashabi et al., 2018) — — 60.0 | 64.3 | 54.8 | 53.9 | Results QM + ELMo (Mihaylov et al., 2018) 54.6 50.2 — — ESIM + ELMo (Mihaylov et al., 2018) 53.9 48.9 — — KER (Mihaylov et al., 2018) 55.6 51.4 — —

Large OFT (Sun et al., 2019) — 52.0 67.2 | 69.3 | 15.2 — Transformer OFT (ensemble) (Sun et al., 2019) — 52.8∗ 67.7 | 70.3 | 16.5∗ — Models RS (Sun et al., 2019) — 55.2∗ 69.2 | 71.5 | 22.6∗ — RS (ensemble) (Sun et al., 2019)— 55.8∗ 70.5 | 73.1 | 21.8∗ —

Table 1: Comparison of Multee with other systems. Starred (*) results are based on contemporaneous system. Results marked (—) are not available. RS is Reading Strategies, KER is Knowledge Enhanced Reader, OFT is OpenAI FineTuned Transformer.

Preprocessing: For each question and answer components are pre-trained on sentence-level en- choice, we create an answer hypothesis statement tailment tasks and then fine-tuned as part of end- using a modified version of the script used in Sc- to-end QA training. The MultiRC dataset includes iTail (Khot et al., 2018) construction. We wrote a sentence-level relevance labels. We supervise the handful of rules to better convert the question and Sentence Relevance module with a binary cross answer to a hypothesis. We also mark the span of entropy loss for predicting these relevance labels answer in the hypothesis with special begin and when available. We used PyTorch (Paszke et al., end tokens, @@@answer and answer@@@ re- 2017) and AllenNLP to implement our models and spectively3. For MultiRC, we also apply an off- ran them on Beaker7. For pre-training we use the-shelf coreference resolution model4 and re- the same hyper-parameters of ESIM(Chen et al., place the mentions when they resolve to pronouns 2017) as available in implementation of AllenNLP occurring in a different sentence5. For Open- (Gardner et al., 2017) and fine-tune the model pa- BookQA, we use the exact same retrieval as re- rameters. We do not perform any hyper-parameter leased by the authors of OpenBookQA6 and use tuning for any of our models. We fine-tune all lay- the OpenBook and WordNet as the knowledge ers in ESIM except for the embedding layer. source with top 5 sentences retrieved per query. Models Compared: We experiment with Glove Training Multee: For OpenBookQA we use (Pennington et al., 2014) and ELMo (Peters et al., cross entropy loss for labels corresponding to 4 an- 2018) embeddings for Multee and compare swer choices. For MultiRC, we use binary cross with following three types of systems: entropy loss for each answer-choice separately (A) Baselines using entailment as a black-box since in MultiRC each question can have more We use the pre-trained entailment model as a than one correct answer choice. The entailment black-box in two ways: concatenate premises (Concat) and aggregate sentence level decisions 3Answer span marking gave substantial gains for all en- with a max operation (Max). Both models were tailment based models including the baselines. also pre-trained on SNLI and MultiNLI datasets 4https://github.com/huggingface/neuralcoref 5It is hard to learn co-reference, as these target datasets and fine-tuned on the target QA datasets with same are too small to learn this in an end-to-end fashion. 6https://github.com/allenai/OpenBookQA 7https://beaker.org/

2953 pre-processing. OpenBookQA MultiRC (B) Previously published results: For MultiRC, Accuracy F1a | F1m there are two published baselines: IR (Information Retrieval) and LR (Logistic Regression). These αi 50.6 67.3 | 70.3 simple models turn out to be strong baselines on αi 55.8 67.4 | 71.0 this relatively smaller sized dataset. For Open- αi + supervise — 68.3 | 71.7 BookQA, we report published baselines from Table 2: Relevance Model Ablation of Multee. (Mihaylov et al., 2018): Question Match with α : without relevance weights, α : with relevance ELMo (QM + ELMo), Question to Answer ESIM i i weights respectively, αi + supervise: with supervised with ELMo (ESIM + ELMo) and their best result relevance weights. Test results on OpenBookQA and with the Knowledge Enhanced Reader (KER). Dev results on MultiRC. (C) Large Transformer based models: We com- OpenBookQA MultiRC pare with OpenAI-Transformer (OFT), pre-trained Aggregator on large-scale language modeling task and fine- Accuracy F1a | F1m tuned on respective datasets. A contemporaneous Cross Attention (CA) 45.8 67.2 | 71.1 8 work, which published these transformer results, Final Layer (FL) 51.0 68.3 | 71.5 also fine-tuned this transformer further on a large CA +FL 55.8 68.3 | 71.7 scale reading comprehension dataset, RACE (Lai et al., 2017), before fine-tuning on the target QA Table 3: Aggregator Level Ablation of Multee. On datasets with their method, Reading Strategies. MultiRC, Multee uses relevance supervision but not on OpenBookQA because of unavailibility. Test results 4.1 Results on OpenBookQA and Dev results on MultiRC.

Table1 summarizes the performance of all mod- MultiRC, is not the case in OpenBookQA. els. Multee outperforms the black-box entail- In general, gains from Multee are more ment baselines (Concat and Max) that were pre- prominent in OpenBookQA than in MultiRC. We trained on the same data, previously published hypothesize that a key contributor to this differ- baselines, OpenAI transformer models. We note ence is distraction being a lesser challenge in Mul- that the 95% confidence intervals around baseline tiRC, where premise sentences come from a single accuracy for OpenBookQA and MultiRC are 4.3% paragraph whose other sentences are often irrele- and 1.3%, respectively. vant and rarely distract towards incorrect answers. On OpenBookQA test set, Multee with OpenBookQA has a noisier set of sentences, since GloVe outperforms ensemble version of OpenAI an equal number of sentences is retrieved for the transformer by 3.0 points in accuracy. It also out- correct and each incorrect answer choice. performs single model version of Reading Strate- gies system and is comparable to their ensemble 4.2 Ablations version. On MultiRC dev set, Multee with Relevance Model Ablation. Table2 shows the ELMo outperforms ensemble version of OpenAI utility of the relevance module. We use the same transformer by 1.9 points in F1a, 2.7 in F1m and setting as the full model (aggregation at Cross At- 6.3 in EM. It also outperforms single model ver- tention (CA) and the Final Layer (FL)). As shown sion of Reading Strategies system and is compa- in the table, using the relevance module weights rable to their ensemble version. Recall that the (α ) leads to improved accuracy on both datasets Reading Strategies results are reported with an ad- i (substantially so in OpenBookQA) as compared to ditional fine-tuning on another larger QA dataset, ignoring the module, i.e., setting all weights to 1 RACE (Lai et al., 2017) aside from the target QA (α ). In MultiRC, we show that the additional datasets we use here. i supervision for the relevance module leads to even While ELMo contextual embeddings helped in further improvements in score. MultiRC, it did not help OpenBookQA. We be- Multi-Level Aggregator Ablation. Multee lieve this is in part due to the mismatch between performs aggregation at two levels: Cross Atten- our ELMo training setup where all sentences are tion Layer (CA) and Final Layer (FL). We denote treated as a single sequence, which, while true in this by CA+FL. To show that multi-level aggrega- 8Published on arXiv on Oct 31, 2018 (Sun et al., 2019). tion is better than individual aggregations, we train

2954 OpenBookQA MultiRC F1a precision F1a recall Accuracy F1a | F1m IR Sum Loss 59.5 68.5 Snli + MultiNli 55.8 69.9 | 73.6 BCE Loss 58.0 83.2 Snli 50.4 69.3 | 73.3 Scratch 42.2 68.3 | 72.6 Table 6: F1a precision and recall on MultiRC Dev with 2 kinds of relevance losses. IR Sum is the sum of at- Table 4: Effect (on test data) of pre-training the entail- tention probability mass on irrelevant sentences. BCE ment model used in Multee. is Binary Cross Entropy loss. models with aggregation at only FL and at only scores for a question in MultiRC. Overall, we find CA. Table3 shows that multi-layer aggregation is that two types of behaviors emerge from different better than CA or FL alone on both the datasets. loss functions. For instance, trying to minimize the sum of attention probability mass on irrelevant 4.3 Effect of Pre-training P sentences i.e. i αi(1 − yi), called IR Sum Loss, One of the benefits of using entailment based com- causes the attention scores to become ”peaky” i.e, ponents in a QA model is that we can pre-train high for one or two sentences, and close to zero them on large scale entailment datasets and fine- for others. This leads to higher precision but at tune them as part of the QA model. Table4 shows significantly lower recall for the QA system, as it that such pre-training is valuable. The model now uses information from fewer but highly rele- trained from scratch is substantially worse in the vant sentences. Binary cross entropy loss (BCE) case of OpenBookQA, highlighting the benefits of allows the model to attend to more relevant sen- our entailment-based QA model. tences thereby increasing recall without too much Multee benefits come from two sources: (i) drop in precision. Re-purposing of entailment function for multi- Failure Cases. As Figure5 shows, our model with sentence question answering, and (ii) transferring BCE loss tends to distribute the attention, espe- from a large-scale entailment task. In the case cially to sentences close to the relevant ones. We of OpenBookQA, both are helpful. For MultiRC, hypothesize that the model is learning to use the only the first is a significant contributor. Table5 contextualized BiLSTM representations to incor- shows that re-purposing was a bigger factor for porate information from neighboring sentences, MultiRC, since Max and Concat models do not which is useful for this task and for passage under- work well when trained from scratch. standing in general. For example, more than 60% of Dev questions in MultiRC have at least one ad- OpenBookQa MultiRc jacent relevant sentence pair. Figure 4a illustrates Accuracy F1a | F1m this behavior. Snli + MultiNli 47.6 66.8 | 70.3 On the other hand, if the relevant sentences Max Scratch 32.4 42.8 | 44.0 are far apart, the model finds it difficult to han- Snli + MultiNli 42.6 66.9 | 70.7 dle such long-range cross sentence dependencies Concat Scratch 35.8 51.3 | 50.4 in its contextualized representations. As a result, it ends up focusing attention on the most relevant Table 5: Pre-training ablations of black-box entailment sentence, missing out on other relevant sentences baselines for OpenBookQA (test) and MultiRC (dev). (Figure 4b). When these unattended but relevant sentences contain the answer, the model fails. 5 Analysis 6 Related Work Relevance Loss. The sentence-level relevance model provides a way to dig deeper into the over- Entailment systems have been applied to question- all QA model’s behavior. When sentence-level su- answering before but have only had limited suc- pervision is available, as in the case of MultiRC, cess (Harabagiu and Hickl, 2006; Sacaleanu et al., we can analyze the impact of different auxiliary 2008; Clark et al., 2012) in part because of the losses for the relevance module. Table6 shows the small size of the early entailment datasets (Da- QA performance with different relevance losses, gan et al., 2006, 2013). Recent large scale en- and Figure5 shows a visualization of attention tailment datasets such as SNLI (Bowman et al.,

2955 (a) Positive Example (b) Negative Example Figure 4: Success and failure examples of Multee from MultiRC. R: annotated relevant sentences. Green/yellow: high/low predicted relevance.

formation from multiple supporting sentences. Combining information from multiple sen- tences is a key problem in language understanding. Recent Reading comprehension datasets (Welbl et al., 2018; Khashabi et al., 2018; Yang et al., 2018; Mihaylov et al., 2018) explicitly evalu- ate a system’s ability to perform such reasoning through questions that need information from mul- tiple sentences in a passage. Most approaches on these tasks perform simple attention-based aggre- gation (Mihaylov et al., 2018; Song et al., 2018; Cao et al., 2018) and do not exploit the entailment models trained on large scale datasets. Figure 5: Sentence level attentions for various sentence relevance losses. R: annotated relevant sentences. 7 Conclusions Using entailment for question answering has seen 2015) and MultiNLI (Williams et al., 2018) have limited success. Neural entailment models are led to many new powerful neural entailment mod- designed and trained on tasks defined over sen- els that are not only more effective, but also pro- tence pairs, whereas QA often requires reasoning duce better representations of sentences (Conneau over longer texts spanning multiple sentences. We et al., 2017). Models such as Decomposable At- propose Multee, a novel QA model that ad- tention (Parikh et al., 2016) and ESIM (Chen et al., dresses this mismatch. It uses an existing entail- 2017), on the other hand, find alignments between ment model to both focus on relevant sentences the hypothesis and premise words through cross- and aggregate information from these sentences. attention. However, these improvements in entail- Results on two challenging QA datasets, as well as ment models have not yet translated to improve- our ablation study, indicate that entailment based ments in end tasks such as question answering. QA can achieve state-of-the-art performance and SciTail (Khot et al., 2018) was created from a is a promising direction for further research. science QA task to push for models with a direct impact on QA. Entailment models trained on this Acknowledgements dataset show minor improvements on the Aristo This work is supported in part by the National Sci- Reasoning Challenge (Clark et al., 2018; Musa ence Foundation under Grant IIS-1815358. The et al., 2018). However, these QA systems make computations on beaker.org were supported in part independent predictions and can not combine in- by credits from Google Cloud.

2956 References Computational Linguistics. Association for Compu- tational Linguistics, pages 905–912. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large an- Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, notated corpus for learning natural language infer- Shyam Upadhyay, and Dan Roth. 2018. Looking ence. In Proceedings of the 2015 Conference on beyond the surface:a challenge set for reading com- Empirical Methods in Natural Language Processing prehension over multiple sentences. In Proceedings (EMNLP). Association for Computational Linguis- of North American Chapter of the Association for tics. Computational Linguistics (NAACL). Nicola De Cao, Wilker Aziz, and Ivan Titov. 2018. Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. Question answering by reasoning across docu- Scitail: A textual entailment dataset from science ments with graph convolutional networks. CoRR question answering. In Proceedings of AAAI. abs/1808.09920. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui and Eduard Hovy. 2017. Race: Large-scale read- Jiang, and Diana Inkpen. 2017. Enhanced lstm for ing comprehension dataset from examinations. In natural language inference. In Proceedings of the Proceedings of the 2017 Conference on Empirical 55th Annual Meeting of the Association for Compu- Methods in Natural Language Processing. Associa- tational Linguistics (Volume 1: Long Papers). Asso- tion for Computational Linguistics, pages 785–794. ciation for Computational Linguistics, pages 1657– https://doi.org/10.18653/v1/D17-1082. 1668. https://doi.org/10.18653/v1/P17-1152. Todor Mihaylov, Peter Clark, Tushar Khot, and Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal. 2018. Can a suit of ar- Ashish Sabharwal, Carissa Schoenick, and Oyvind mor conduct electricity? a new dataset for Tafjord. 2018. Think you have solved question an- open book question answering. In Proceed- swering? Try ARC, the AI2 Reasoning Challenge. ings of the 2018 Conference on Empirical Meth- arXiv preprint arXiv:1803.05457 . ods in Natural Language Processing. Association for Computational Linguistics, pages 2381–2391. Peter Clark, Philip Harrison, and Xuchen Yao. 2012. http://aclweb.org/anthology/D18-1260. An entailment-based approach to the qa4mre chal- lenge. In CLEF. George A Miller. 1995. WordNet: a lexical database for English. Communications of the ACM Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc 38(11):39–41. Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from Ryan Musa, Xiaoyan Wang, Achille Fokoue, Nicholas natural language inference data. In Proceedings of Mattei, Maria Chang, Pavan Kapanipathi, Bassem the 2017 Conference on Empirical Methods in Nat- Makni, Kartik Talamadupula, and Michael Wit- ural Language Processing. Association for Compu- brock. 2018. Answering science exam questions tational Linguistics, Copenhagen, Denmark, pages using query rewriting with background knowledge. 670–680. https://www.aclweb.org/anthology/D17- CoRR abs/1809.05726. 1070. Ankur Parikh, Oscar Tackstr¨ om,¨ Dipanjan Das, and Ido Dagan, Oren Glickman, and Bernardo Magnini. Jakob Uszkoreit. 2016. A decomposable attention 2006. The PASCAL recognising textual entailment model for natural language inference. In Proceed- challenge. In Machine learning challenges. evalu- ings of the 2016 Conference on Empirical Meth- ating predictive uncertainty, visual object classifica- ods in Natural Language Processing. Association tion, and recognising tectual entailment, Springer, for Computational Linguistics, pages 2249–2255. pages 177–190. https://doi.org/10.18653/v1/D16-1244. Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas- Adam Paszke, Sam Gross, Soumith Chintala, Gre- simo Zanzotto. 2013. Recognizing textual entail- gory Chanan, Edward Yang, Zachary DeVito, Zem- ment: Models and applications. Synthesis Lectures ing Lin, Alban Desmaison, Luca Antiga, and Adam on Human Language Technologies 6(4):1–220. Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Jeffrey Pennington, Richard Socher, and Christo- Peters, Michael Schmitz, and Luke S. Zettlemoyer. pher D. Manning. 2014. Glove: Global vectors for 2017. AllenNLP: A deep semantic natural language word representation. In Empirical Methods in Nat- processing platform. ural Language Processing (EMNLP). pages 1532– 1543. http://www.aclweb.org/anthology/D14-1162. Sanda Harabagiu and Andrew Hickl. 2006. Methods for using textual entailment in open-domain ques- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt tion answering. In Proceedings of the 21st Inter- Gardner, Christopher Clark, Kenton Lee, and Luke national Conference on Computational Linguistics Zettlemoyer. 2018. Deep contextualized word rep- and the 44th annual meeting of the Association for resentations. In Proc. of NAACL.

2957 Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language under- standing by generative pre-training. Technical re- port, Technical report, OpenAI. Bogdan Sacaleanu, Constantin Orasan, Christian Spurk, Shiyan Ou, Oscar Ferrandez, Milen Kouylekov, and Matteo Negri. 2008. Entailment- based question answering for structured data. In 22nd International Conference on on Computational Linguistics: Demonstration Papers. Association for Computational Linguistics, pages 173–176. Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Radu Florian, and Daniel Gildea. 2018. Exploring graph-structured passage representation for multi- hop reading comprehension with graph neural net- works. CoRR abs/1809.02040. Robert Speer and Catherine Havasi. 2012. Repre- senting general relational knowledge in concept- net 5. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012). European Language Resources Asso- ciation (ELRA).

Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2019. Improving machine reading comprehension with general reading strategies. In Proceedings of North American Chapter of the Association for Computa- tional Linguistics (NAACL).

Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral multi-perspective matching for natural lan- guage sentences. In IJCAI.

Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transac- tions of the Association for Computational Linguis- tics 6:287–302. http://aclweb.org/anthology/Q18- 1021. Adina Williams, Nikita Nangia, and Samuel Bow- man. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, Volume 1 (Long Papers). Association for Computational Linguistics, pages 1112–1122. https://doi.org/10.18653/v1/N18-1101. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- gio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

2958