Arxiv:1905.08511V2 [Cs.CL] 29 May 2019 Stedfclyo Esnn.I Sdfcl O the for Difﬁcult Is It Reasoning

Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction Kosuke Nishida1, Kyosuke Nishida1, Masaaki Nagata2, Atsushi Otsuka1, Itsumi Saito1, Hisako Asano1, Junji Tomita1 1 NTT Media Intelligence Laboratories, NTT Corporation 2 NTT Communication Science Laboratories, NTT Corporation [email protected] Abstract Question answering (QA) using textual sources for purposes such as reading comprehension (RC) has attracted much attention. This study focuses on the task of explainable multi-hop QA, which requires the system to return the answer with evidence sentences by reasoning and gathering disjoint pieces of the reference texts. It proposes the Query Figure 1: Concept of explainable multi-hop QA. Given Focused Extractor (QFE) model for evidence a question and multiple textual sources, the system ex- extraction and uses multi-task learning with tracts evidence sentences from the sources and returns the QA model. QFE is inspired by extractive the answer and the evidence. summarization models; compared with the existing method, which extracts each evidence sentence independently, it sequentially ex- system to find the disjoint pieces of information tracts evidence sentences by using an RNN as evidence and reason using the multiple pieces with an attention mechanism on the question of such evidence. The second challenge is inter- sentence. It enables QFE to consider the de- pendency among the evidence sentences and pretability. The evidence used to reason is not nec- cover important information in the question essarily located close to the answer, so it is diffi- sentence. Experimental results show that QFE cult for users to verify the answer. with a simple RC baseline model achieves a Yang et al. (2018) released HotpotQA, an ex- state-of-the-art evidence extraction score on plainable multi-hop QA dataset, as shown in Fig- HotpotQA. Although designed for RC, it also ure 1. Hotpot QA provides the evidence sentences achieves a state-of-the-art evidence extraction of the answer for supervised learning. The evi- score on FEVER, which is a recognizing textual entailment task on a large textual dence extraction in multi-hop QA is more difficult database. than that in other QA problems because the question itself may not provide a clue for finding ev- 1 Introduction idence sentences. As shown in Figure 1, the sys- arXiv:1905.08511v2 [cs.CL] 29 May 2019 Reading comprehension (RC) is a task that uses tem finds an evidence sentence (Evidence 2) by re- textual sources to answer any question. It has seen lying on another evidence sentence (Evidence 1). significant progress since the publication of nu- The capability of being able to explicitly extract merous datasets such as SQuAD (Rajpurkar et al., evidence is an advance towards meeting the above 2016). To achieve the goal of RC, systems must two challenges. be able to reason over disjoint pieces of informa- Here, we propose a Query Focused Extractor tion in the reference texts. Recently, multi-hop (QFE) that is based on a summarization model. question answering (QA) datasets focusing on this We regard the evidence extraction of the explain- capability, such as QAngaroo (Welbl et al., 2018) able multi-hop QA as a query-focused summa- and HotpotQA (Yang et al., 2018), have been re- rization task. Query-focused summarization is the leased. task of summarizing the source document with re- Multi-hop QA faces two challenges. The first gard to the given query. QFE sequentially extracts is the difficulty of reasoning. It is difficult for the the evidence sentences by using an RNN with an attention mechanism on the question sentence, while the existing method extracts each evidence sentence independently. This query-aware recur- rent structure enables QFE to consider the depen- dency among the evidence sentences and cover the important information in the question sentence. Our overall model uses multi-task learning with a QA model for answer selection and QFE for evidence extraction. The multi-task learning with QFE is general in the sense that it can be combined with any QA model. Moreover, we find that the recognizing textual entailment (RTE) task on a large textual database, FEVER (Thorne et al., 2018), can be regarded as an explainable multi-hop QA task. We confirm Figure 2: Overall model architecture. The answer layer that QFE effectively extracts the evidence both on is the version for the RC task. HotpotQA for RC and on FEVER for RTE. Our main contributions are as follows. exists only if there are not enough answer candi- • We propose QFE for explainable multi-hop dates to answer Q. The answer string A is a short QA. We use the multi-task learning of the QA S span in C. Evidence E consists of the sentences model for answer selection and QFE for evi- in C and is required to answer Q. dence extraction. For RC, we tackle HotpotQA. In HotpotQA, the • QFE adaptively determines the number of ev- answer candidates are ‘Yes’, ‘No’, and ‘Span’. idence sentences by considering the depen- The answer string AS exists if and only if the dency among the evidence sentences and the answer type AT is ‘Span’. C consists of ten coverage of the question. Wikipedia paragraphs. The evidence E consists • QFE achieves state-of-the-art performance of two or more sentences in C. on both HotpotQA and FEVER in terms of For RTE, we tackle FEVER. In FEVER, the the evidence extraction score and comparable answer candidates are ‘Supports’, ‘Refutes’, and performance to competitive models in terms ‘Not Enough Info’. The answer string AS does not of the answer selection score. QFE is the first exist. C is the Wikipedia database. The evidence model that outperformed the baseline on Hot- E consists of the sentences in C. potQA. 3 Proposed Method 2 Task Definition This section first explains the overall model archi- Here, we re-define explainable multi-hop QA so tecture, which contains our model as a module, that it includes the RC and the RTE tasks. and then the details of our QFE. Def. 1. Explainable Multi-hop QA 3.1 Model Architecture Input: Context C (multiple texts), Query Q (text) Except for the evidence layer, our model is the Output: Answer Type AT (label), Answer String same as the baseline (Clark and Gardner, 2018) AS (text), Evidence E (multiple texts) used in HotpotQA (Yang et al., 2018). Figure 2 shows the model architecture. The input of the The Context C is regarded as one connected text model is the context C and the query Q. The in the model. If the connected C is too long model has the following layers. (e.g. over 2000 words), it is truncated. The Query Q is the query. The model answers Q with an an- The Word Embedding Layer encodes C and swer type AT or an answer string AS. The An- Q as sequences of word vectors. A word vector swer Type AT is selected from the answer candi- is the concatenation of a pre-trained word embed- dates, such as ‘Yes’. The answer candidates de- ding and a character-based embedding obtained pend on the task setting. The Answer String AS using a CNN (Kim, 2014). The outputs are C1 ∈ lw×dw mw×dw Sentence Vectors ´ Query Vectors µ R ,Q1 ∈ R , where lw is the length (in / ͬͥ ͛ ͭͥ words) of C, mw is the length of Q and dw is the ˅ / Extraction ͮ Glimpse ˅ ͬ' size of the word vector. ¡ ͭ( ħ ¢ ͬ ͬ RNN The Context Layer encodes C1,Q1 as contextual vectors C ∈ Rlw×2dc ,Q ∈ Rmw×2dc by us- Figure 3: Overview of Query Focused Extractor at step 2 2 t. zt is the current summarization vector. gt is the ing a bi-directional RNN (Bi-RNN), where d is t c query vector considering the current summarization. e the output size of a uni-directional RNN. is the extracted sentence. xet updates the RNN state. The Matching Layer encodes C2,Q2 as match- Rlw×dc ing vectors C3 ∈ by using bi-directional Loss Function: Our model uses multi-task attention (Seo et al., 2017), a Bi-RNN, and self- learning with a loss function L = LA +LE, where attention (Wang et al., 2017). LA is the loss of the answer and LE is the loss of the evidence. The answer loss LA is the sum The Evidence Layer first encodes C as −→ ←− 3 of the cross-entropy losses for all probability dis- Rlw×2dc [C4; C4] ∈ by a Bi-RNN. Let j1(i) be the tributions obtained by the answer layer. The evi- index of the first word of the i-th sentence in C dence loss LE is defined in subsection 3.3. and j2(i) be the index of the last word. We define the vector of the i-th sentence as: 3.2 Query Focused Extractor Query Focused Extractor (QFE) is shown as the −−−→ ←−−− R2dc xi = [c4,j2(i); c4,j1(i)] ∈ . red box in Figure 2. QFE is an extension of the extractive summarization model of Chen and Bansal Here, X ∈ Rls×2dc is the sentence-level context (2018), which is not for query-focused settings. vectors, where ls is the number of sentences of C. Chen and Bansal used an attention mechanism to QFE, described later, receives sentence-level extract sentences from the source document such context vectors X ∈ Rls×2dc and the contextual that the summary would cover the important in- mw×2dc formation in the source document. To focus on the query vectors Q2 ∈ R as Y. QFE outputs the probability distribution that the i-th sentence is query, QFE extracts sentences from C with atten- the evidence: tion on Q such that the evidence covers the important information with respect to Q.

Arxiv:1905.08511V2 [Cs.CL] 29 May 2019 Stedfclyo Esnn.I Sdfcl O the for Difﬁcult Is It Reasoning

A Supplemental Material References

On Rez Washroom Policy

Footyzine (ACC-534-Q15-01-06)

Machine Head På Gröna Lunds Stora Scen

2012 November 6 November 13

Issue 4, Volume 7

Pressmeddelande 2008-03-07

VAN BOEK TOT FILM Eerst Was Er Het Boek

Metal! … Inspiration Pack …

MACHINE HEAD Unto the Locust (Thrash)

NEGRITA “Dannato Vivere Tour”,Compagnia Baraban

Machine Head Gæster Igen VEGA Til Efteråret