Explore, Propose, and Assemble: An Interpretable Model for Multi-Hop Reading Comprehension

Yichen Jiang∗ Nitish Joshi∗ Yen-Chun Chen Mohit Bansal UNC Chapel Hill {yichenj, nitish, yenchun, mbansal}@cs.unc.edu

Abstract datasets require single-hop reasoning only, which means that the evidence necessary to answer the Multi-hop reading comprehension requires the question is concentrated in a single sentence or lo- model to explore and connect relevant infor- cated closely in a single paragraph. Such datasets mation from multiple sentences/documents in order to answer the question about the con- emphasize the role of locating, matching, and text. To achieve this, we propose an in- aligning information between the question and the terpretable 3-module system called Explore- context. However, some recent multi-document, Propose-Assemble reader (EPAr). First, the multi-hop reading comprehension datasets, such Document Explorer iteratively selects relevant as WikiHop and MedHop (Welbl et al., 2017), documents and represents divergent reasoning have been proposed to further assess MRC sys- chains in a tree structure so as to allow assim- tems’ ability to perform multi-hop reasoning, ilating information from all chains. The An- swer Proposer then proposes an answer from where the required evidence is scattered in a set every root-to-leaf path in the reasoning tree. of supporting documents. Finally, the Evidence Assembler extracts a These multi-hop tasks are much more chal- key sentence containing the proposed answer lenging than previous single-hop MRC tasks (Ra- from every path and combines them to pre- jpurkar et al., 2016, 2018; Hermann et al., 2015; dict the final answer. Intuitively, EPAr ap- Nguyen et al., 2016; Yang et al., 2015) for three proximates the coarse-to-fine-grained compre- primary reasons. First, the given context contains hension behavior of human readers when fac- ing multiple long documents. We jointly op- a large number of documents (e.g., 14 on aver- timize our 3 modules by minimizing the sum age, 64 maximum for WikiHop). Most existing of losses from each stage conditioned on the QA models cannot scale to the context of such previous stage’s output. On two multi-hop length, and it is challenging to retrieve a reason- reading comprehension datasets WikiHop and ing chain of documents with complete informa- MedHop, our EPAr model achieves significant tion required to connect the question to the an- improvements over the baseline and compet- swer in a logical way. Second, given a reason- itive results compared to the state-of-the-art ing chain of documents, it is still necessary for the model. We also present multiple reasoning- chain-recovery tests and ablation studies to model to consider evidence loosely distributed in demonstrate our system’s ability to perform in- all these documents in order to predict the final an- arXiv:1906.05210v1 [cs.CL] 12 Jun 2019 terpretable and accurate reasoning.1 swer. Third, there could be more than one logical way to connect the scattered evidence (i.e., more 1 Introduction than one possible reasoning chain) and hence this requires models to assemble and weigh informa- The task of machine reading comprehension tion collected from every reasoning chain before and question answering (MRC-QA) requires the making a unified prediction. model to answer a natural language question by finding relevant information and knowledge in To overcome the three difficulties elaborated a given natural language context. Most MRC above, we develop our interpretable 3-module sys- tem based on examining how a human reader ∗ equal contribution; part of this work was done during would approach a question, as shown in Fig. 1a the second author’s internship at UNC (from IIT Bombay). 1Our code is publicly available at: and Fig. 1b. For the 1st example, instead of read- https://github.com/jiangycTarheel/EPAr ing the entire set of supporting documents sequen- The THhaeu nHtaedu nCteadst lCea (s Dtleu t(c Dh u: tScpho :o Sksplooot k) silso at )h aisu nat heda uantttreadc taiottnra icnt itohne in the The PoTlshte rPboerlsgt ePrubmerpgh oPuusme p( hGoeursme a(n G : ePromlsatne r:b Peroglsetre Hrbuebrhgaeur sH )u ibs haa us ) is a amuasemmuesnetm parkent park Efteling Efteling in th ei nN tehteh eNrleatnhdesr l.a Intd ws a. sItd wesaigsndeeds ibgyn ed by pumpinpgu mstaptiinogn satbaotivoen theabo Dykeve the Ditch Dyke i nDitch the Upper in the Upper in Harzcentra iln central Ton Tvoan dvea nV dene aVnedn .a..nd ... GermaGnye r..m. any ... EftelingEfteling is a fiasn ata fsayn-ttahseym-tehde mameuds aemuenset mpaernkt ipna Kaatsheuvelrk in Kaatsheuvel in the in the The DykeThe Ditch Dyke iDitchs the l oins gtheset laorntigfeicsita al rdtitfcichi ainl dthitec hUpper in the Harz Upper in Harz in NethNereltahnedrsla. nTdhse. aTthtrea cattitorancs taioren sb asred b oanse edl eomn eenltesm freonmts afrnocmie natn mciyetnhts m yths centralc Genetrrmala nGye. rmany. and alengde nledgse, nfadisr,y f taailreys ,t aflaebsl,e fsa, balneds ,f oanlkdl ofroel.klore. The UpperThe UpperHarz r eHarzfers t ore .f.e. rths et ot e.r..m t hUep tpeermr H Uaprzp ecro vHearsrz t hceo vaereras tohfe t haere a of the KaatsheuvelKaatsheuvel is a viisl laa gveil ilnag teh ei nD tuhtec hD purtocvhi npcroe voifn cNeo ortfh N Borratbha Bntr,a bant, seven hsiesvtoerni chails tmoirnicinagl mtoiwninnsg ( t\"oBwenrsg s(t\\"uB0e0reg4sdt\tue\0"0) e-4 Cdltaeu\"s)t h-a Cl,l austhal, situastietdu a..t.e idt .i.s. tiht eis l athrgee lsatr vgielslat gvei lilna gaen din t haen dc atphiet acla opfi ttahle o mf uthneic mipuanliitcyi pality ZellerfZeledl,l eArnfedlrde,a Asbnedrgre, aAsblternga,u A, Lltaeuntaeun,t hLaalu, tWeniltdheaml, aWnnil daenmd aGnrnu nadn d- Grund - of Loonof Loon op Zand op Zand, whi,c wh haliscoh caolsnos icsotsn .s..ists ... in the pinre tsheen tp-dreasye Gnte-drmaya nG feerdmeraanl fsetadter oalf sLowertate of SaxonyLower . . QuerQyu seurbyj escutb: jTehcte: HThaeu nHteadu nCtaesdt lCe astle Query Qsuubejreyc ts: uPbojlesctte:r Pbeorlgs tPerubmeprgh oPuusme phouse QuerQyu beorydy b:o ldocya: tleodc_aitne_dt_hien__atdhme_inadismtraintiivsetr_atteivrreit_otreirarli_toernitaitly_entity Query Qboudeyry: lboocadtye:d l_oicna_ttehde_iand_mthinei_satrdamtivinei_stterarrtitvoer_iatel_rerintotirtyial_entity AnswAenrs:w Looner: Loon op Zand op Zand AnsweAr: nLowerswer: LowerSaxony Saxony (a) (b) Figure 1: Two examples from the QAngaroo WikiHop dataset where it is necessary to combine information spread across multiple documents to infer the correct answer. (a): The hidden reasoning chain of 3 out of a total of 37 documents for a single query. (b): Two possible reasoning chains that lead to different answers: “” and “”, while the latter (green solid arrow) fits better with query body “administrative territorial entity”.

tially, she would start from the document that is all possible reasoning chains/paths. Hence, to be directly related to the query subject (e.g., “The able to weigh and combine information from mul- Haunted Castle”). She could then read the second tiple reasoning branches, the Document Explorer and third document by following the connecting is rolled out multiple times to represent all the entities “park Efteling” and “Kaatsheuvel”, and divergent reasoning chains in a ‘reasoning tree’ uncover the answer “Loon op Zand” by comparing structure, so as to allow our third component, the phrases in the final document to the query. In this Evidence Assembler, to assimilate important ev- way, the reader accumulates knowledge about the idence identified in every reasoning chain of the query subject by exploring inter-connected docu- tree to make one final, unified prediction. To do ments, and eventually uncovers the entire reason- so, the Assembler selects key sentences from each ing chain that leads to the answer. Drawing inspi- root-to-leaf document path in the ‘reasoning tree’ ration from this coarse (document-level) plus fine- and forms a new condensed, salient context which grained (word-level) comprehension behavior, we is then bidirectionally-matched with the query rep- first construct a T -hop Document Explorer model, resentation to output the final prediction. Via this a hierarchical memory network, which at each re- procedure, evidence that was originally scattered current hop, selects one document to read, updates widely across several documents is now collected the memory cell, and iteratively selects the next concentratedly, hence transforming the task to a related document, overall constructing a reason- scenario where previous standard phrase-matching ing chain of the most relevant documents. We style QA models (Seo et al., 2017; Xiong et al., next introduce an Answer Proposer that performs 2017; Dhingra et al., 2017) can be effective. query-context reasoning at the word-level on the Overall, our 3-module, multi-hop, reasoning- retrieved chain and predicts an answer. Specifi- tree based EPAr (Explore-Propose-Assemble cally, it encodes the leaf document of the reason- reader) closely mimics the coarse-to-fine-grained ing chain while attending to its ancestral docu- reading and reasoning behavior of human readers. ments, and outputs ancestor-aware word represen- We jointly optimize this 3-module system by tations for this leaf document, which are compared having the following component working on to the query to propose a candidate answer. the outputs from the previous component and minimizing the sum of the losses from all 3 However, these two components above cannot modules. The Answer Proposer and Evidence handle questions that allow multiple possible rea- Assembler are trained with maximum likelihood soning chains that lead to different answers, as using ground-truth answers as labels, while the shown in Fig. 1b. After the Document Explorer Document Explorer is weakly supervised by selects the 1st document, it finds that both the 2nd heuristic reasoning chains constructed via TF-IDF and 3rd documents are connected to the 1st doc- and documents with the ground-truth answer. ument via entities “the Dyke Ditch” and “Upper On WikiHop, our system achieves the highest- Harz” respectively. This is a situation where a sin- reported dev set result of 67.2%, outperforming gle reasoning chain diverges into multiple paths, all published models2 on this task, and 69.1% and it is impossible to tell which path will lead to the correct answer before finishing exploring 2At the time of submission: March 3rd, 2019. Query Hiearchical, Key-value Query Memory Network: Subject Body BiDAF I Final prediction DE EA

synthesized context document-reasoning tree A sentence in containing query subject Values: DE ... A sentence in containing candidate 0 Attention A sentence in containing candidate 1

{ , , ... , }

sampling ... softmax ... A sentence in containing candidate 4

AP AP AP AP Keys: ... ( aware) proposed candidate 0 ( aware) proposed candidate 1 ...... ( aware) proposed candidate 4

Figure 2: The full architecture of our 3-module system EPAr, with the Document Explorer (DE, left), Answer Proposer (AP, middle), and Evidence Assembler (EA, right). accuracy on the hidden test set, which is com- 2.1 Retrieval and Encoding petitive with the current leaderboard state-of-the- In this section, we describe the pre-processing art. On MedHop, our system outperforms all pre- document retrieval and encoding steps before in- vious models, achieving the new state-of-the-art troducing our three modules of EPAr. We adopt a test leaderboard accuracy. It also obtains statis- 2-hop document retrieval procedure to reduce the tically significant (p < 0.01) improvement over number of supporting documents that are fed to our strong baseline on the two datasets. Further, our system. We first select one document with the we show that our Document Explorer combined shortest TF-IDF distance to the query. We then with 2-hop TF-IDF retrieval is substantially better rank the remaining documents according to their than two TF-IDF-based retrieval baselines in mul- TF-IDF distances to the first selected document tiple reasoning-chain recovery tests including on and add the top N 0 −1 documents to form the con- human-annotated golden reasoning chains. Next, text with a total of N 0 documents for this query. we conduct ablations to prove the effectiveness Adding this preprocessing step is not only helpful of the Answer Proposer and Evidence Assembler in reducing GPU memory consumption but also in comparison with several baseline counterparts, helps bootstrap the training by reducing the search and illustrate output examples of our 3-module space of the Document Explorer (Sec. 2.2). system’s reasoning tree. We then use a Highway Network (Srivastava 2 Model et al., 2015) of dimension d, which merges the character embedding and GloVe word embed- In this section, we describe our 3-module sys- ding (Pennington et al., 2014), to get the word tem that constructs the ‘reasoning tree’ of docu- representations for the supporting documents and N 0×K×d ments and predicts the answer for the query. For- query4. This gives three matrices: X ∈ R , Js×d J ×d mally, given a query q and a corresponding set Qsub ∈ R and Qbod ∈ R b , K, Js, Jb N of supporting documents D = {di}i=1, our sys- are the lengths of supporting documents, query tem tries to find a reasoning chain of documents body, and query subject respectively. We then ap- 0 0 0 3 d1, . . . , dT , di ∈ D. The information from these ply a bi-directional LSTM-RNN (Hochreiter and selected documents is then combined to predict the Schmidhuber, 1997) of v hidden units to get the answer among the given answer candidates. In the contextual word representations for the documents K×2v WikiHop and MedHop datasets, a query consists H = {h1, ··· , hN 0 } s.t. hi ∈ R and the Js×2v Jb×2v of a subject qsub (e.g., “The Haunted Castle” in query Usub ∈ R , Ubod ∈ R . Other Fig. 1a) and a body qbod (e.g., “located in the ad- than the word-level encoding, we also collect com- ministrative territorial entity”). There is one single pact representations of all the supporting docu- correct answer a (e.g., “Loon op Zand”) in the set 4 of candidate answers A = {c }L such that the Unlike previous works (Welbl et al., 2017; Dhingra et al., l l=1 2018; De Cao et al., 2018; Song et al., 2018a) that concate- relation qbod holds true between qsub and a. nate supporting documents together to form a large context, we instead maintain the document-level hierarchy and encode 3In WikiHop dataset, T ≤ 3. each document separately. ˆ t+1 t ˆ ments, denoted as P = {p1, ··· , pN 0 }, by apply- (ht+1, m ) = fDE(m ) such that ht+1 ∈ H ing the self-attention mechanism in Zhong et al. and hˆt 6= hˆt+1. Therefore, unrolling the Docu- (2019) (see details in appendix). We obtain em- ment Explorer for T hops results in a sequence ˆ ˆ beddings for each candidate ci ∈ {c1, c2, .., cL} of non-repeating documents Hˆ = {h1, ··· , hT } using the average-over-word embeddings of the such that each document hˆi is selected iteratively first mention5 of the candidate in H. based on the current memory state building up one reasoning chain of documents. In practice, we roll 2.2 Document Explorer out DE multiple times to obtain a document-search Our Document Explorer (DE, shown in the left ‘reasoning tree’, where each root-to-leaf path cor- part of Fig.2) is a hierarchical memory net- responds to a query-to-answer reasoning chain. work (Chandar et al., 2016). It utilizes the reduced 2.3 Answer Proposer document representations P = {p1, p2, ··· , pN 0 } and their corresponding word-level representa- The Answer Proposer (AP, shown in the middle tions H = {h1, h2, ··· , hN 0 } as the key-value part of Fig.2) takes as input a single chain of doc- ˆ ˆ knowledge base and maintains a memory m using uments {h1, ··· , hT } from one of the chains in a Gated Recurrent Unit (GRU) (Cho et al., 2014). the ‘reasoning tree‘ created by the DE, and tries At every step, the DE selects a document which to predict a candidate answer from the last doc- ˆ is related to the current memory state and updates ument hT in that reasoning chain. Specifically, the internal memory. This iterative procedure thus we adopt an LSTM-RNN with an attention mech- ˆ constructs a reasoning chain of documents. anism (Bahdanau et al., 2015) to encode the hT to ancestor-aware representations y by attending to Read Unit At each hop t, the model computes a [hˆ ]. The model then computes a distribu- document-selection distribution P over every doc- 1,...,T −1 hˆi ∈ hˆ ument based on the bilinear-similarity between the tion over words T T based on the similarity y memory state m and document representations P between and the query representation. This dis- 6 tribution is then used to compute the weighted av- using the following equations : 1 2 K T t erage of word representations {hT , hT , ··· , hT }. xn = pn Wrm χ = softmax(x) P (di) = χi Finally, AP proposes an answer among all candi- The read unit looks at all document (representa- dates {c1, ··· , cL} that has the largest similarity score with this weighted average h˜ . tion) P and selects (samples) a document di ∼ T P . The write operation updates the internal state ek = vT tanh(W hˆi + W sk + b) (memory) using this sampled document. i h cct s k k k X k i a = softmax(e ); c = ai hcct Write Unit After the model selects di ∈ D, i k ˆk−1 k−1 k−1 the model then computes a distribution over ev- y = LSTM(hT , s , c ) ery word in document di based on the similarity k k k w = α(y , us) + α(y , ub);  = softmax(w) between the memory state m and its word repre- XK ˆk sentations h ∈ H. This distribution is then used a = h k; Scorel = β(cl, a) (2) i k=1 T to compute the weighted average of all word rep- ˆ ˆ resentations in document di. We then feed this where hcct = [h1,...,T −1] is the concatenation of weighted average h˜ as the input to the GRU cell documents in the word dimension; us and ub are and update its memory state m (subscript i is the final states of Usub and Ubod respectively, omitted for simplicity): and sk is the LSTM’s hidden states at the kth step. The Answer Proposer proposes the candidate with T ˜ XK wk = h Wwm h = hkωk the highest score among {c , ··· , c }. All com- k k=1 (1) 1 L ω = softmax(w) mt+1 = GRU(h,˜ mt) putations in Eqn.2 that involve trainable parame- ters are marked in bold.7 This procedure produces Combining the ‘read’ and ‘write’ operations de- ancestor-aware word representations that encode scribed above, we define a recurrent function: the interactions between the leaf document and 5We tried different approaches to make use of all mentions ancestral document, and hence models the multi- of every candidate, but observe no gain in final performance. hop, cross-document reasoning behavior. 6We initialize the memory state with the last state of the 7 query subject Usub to make first selected document directly See appendix for the definition of the similarity functions conditioned on the query subject. α and β. 2.4 Evidence Assembler 3 Experiments and Results 3.1 Datasets and Metrics As shown in Fig. 1b, it is possible that a reasoning path could diverge into multiple branches, where We evaluate our 3-module system on the WikiHop each branch represents a unique, logical way of re- and the smaller MedHop multi-hop datasets from trieving inter-connected documents. Intuitively, it QAngaroo (Welbl et al., 2017). For the WikiHop is very difficult for the model to predict which path dev set, each instance is also annotated as “fol- to take without looking ahead. To solve this, our lows” or “not follows”, i.e., whether the answer system first explores multiple reasoning chains by can be inferred from the given set of supporting rolling out the Document Explorer multiple times documents, and “single” or “multiple”, indicating to construct a ‘reasoning tree’ of documents, and whether the complete reasoning chain comprises then aggregates information from multiple rea- of single or multiple documents. This allows us to soning chains using a Evidence Assembler (EA, evaluate our system on less noisy data and to in- shown in the right part of Fig.2), to predict the vestigate its strength in queries requiring different final answer. For each reasoning chain, the As- levels of multi-hop reasoning. Please see appendix sembler first selects one sentence that contains the for dataset and metric details. candidate answer proposed by the Answer Pro- 3.2 Implementation Details poser and concatenates all these sentences into a new document h0. This constructs a highly infor- For WikiHop experiments, we use 300-d GloVe mative and condensed context, at which point pre- word embeddings (Pennington et al., 2014) for vious phrase-matching style QA models can work our main full-size ‘EPAr’ model and 100-d GloVE effectively. Our EA uses a bidirectional attention word embeddings for our smaller ‘EPAr’ model flow model (Seo et al., 2017) to get a distribution which we use throughout the Analysis section for over every word in h0 and compute the weighted time and memory feasibility. We also use the average of word representations {h01, ··· , h0K } as last hidden state of the encoding LSTM-RNN to h˜0. Finally, the EA selects the candidate answer of get the compact representation for all supporting the highest similarity score w.r.t. h˜0. documents in case of smaller model, in contrast to self-attention (Sec.B in Appendix) as in the full-size ‘EPAr’ model. The encoding LSTM- RNN (Hochreiter and Schmidhuber, 1997) has 2.5 Joint Optimization 100-d hidden size for our ‘EPAr’ model whereas the smaller version has 20-d hidden size. The em- Finally, we jointly optimize the entire model us- bedded GRU (Cho et al., 2014) and the LSTM in ing the cross-entropy losses from our Document our Evidence Assembler have the hidden dimen- Explorer, Answer Proposer, and Evidence Assem- sion of 80. In practice, we only apply TF-IDF bler. Since the Document Explorer samples doc- based retrieval procedure to our Document Ex- uments from a distribution, we use weak supervi- plorer and Answer Proposer during inference, and sion at the first and the final hops to account for during training time we use the full set of support- the otherwise non-differentiabilty in the case of ing documents as the input. This is because we ob- end-to-end training. Specifically, we use the doc- served that the Document Explorer overfits faster ument having the shortest TF-IDF distance w.r.t. in the reduced document-search space. For the Ev- the query subject to supervise the first hop and the idence Assembler, we employ both the TF-IDF re- documents which contain at least one mention of trieval and Document Explorer to get the ‘reason- the answer to supervise the last hop. This allows ing tree’ of documents, at both training and testing the Document Explorer to learn the chain of docu- time. We refer to the Sec.E in the appendix for the ments leading to the document containing the an- implementation details of our MedHop models. swer from the document most relevant to the query subject. Since there can be multiple documents 3.3 Results containing the answer, we randomly sample a doc- We first evaluate our system on the Wiki- ument as the label at the last hop. For the Answer Hop dataset. For a fair comparison to recent Proposer and Evidence Assembler, we use cross- works (De Cao et al., 2018; Song et al., 2018a; entropy loss from the answer selection process. Raison et al., 2018), we report our “EPAr” with Dev Test Query subject: Sulphur Spring, Query body: located in the administrative territorial entity BiDAF (Welbl et al., 2017)? - 42.9 Sulphur Spring (also known as Crater Hills Geyser), is a Coref-GRU (Dhingra et al., 2018) 56.0 59.3 0 geyser in the Hayden Valley region of WEAVER (Raison et al., 2018) 64.1 65.3 Yellowstone National Park in the United States . ... MHQA-GRN (Song et al., 2018a) 62.8 65.4 Hayden Valley is a large, sub-alpine valley in Yellowstone Entity-GCN (De Cao et al., 2018) 64.8 67.6 1 National Park straddling the Yellowstone River ... BAG (Cao et al., 2019) 66.5 69.0 2 The Yellowstone River is a tributary of the Missouri River ... CFC (Zhong et al., 2019) 66.4 70.6 Yellowstone Falls consist of two major waterfalls on the 3 EPAr (Ours) 67.2 69.1 Yellowstone River, within Wyoming, United States. ... Yellowstone National Park is a national park located in the 4 Table 1: Dev set and Test set accuracy on WIKIHOP U.S. states of Wyoming, Montana and Idaho. ... dataset. The model marked with ? does not use can- 0 didates and directly predict the answer span. EPAr is our system with TF-IDF retrieval, Document Explorer, 1 2 3 4 Answer Proposer and Evidence Assembler. Yellowstone Missouri Wyoming Wyoming? Montana? Idaho? follow follow full Figure 3: A ‘reasoning tree’ with 4 leaves that lead to + multiple + single different answers (marked in bold). The ground-truth BiDAF Baseline 62.8 63.1 58.4 answer is marked in red additionally. DE+AP+EA? 65.2 66.9 61.1 AP+EA 68.7 67.0 62.8 DE+AP+EA 69.4 70.6 64.7 Moreover, we see that EPAr is able to achieve DE+AP+EA† 71.8 73.8 66.9 high accuracy in both the examples that require † DE+AP+EA +SelfAttn 73.5 72.9 67.2 multi-hop reasoning (“follows + multiple”), and Table 2: Ablation accuracy on WIKIHOP dev set. The other cases where a single document suffices for model marked with ? does not use the TFIDF-based correctly answering the question (“follows + sin- document retrieval procedure. The models marked gle”), suggesting that our system is able to ad- with † are our full EPAr systems with 300-d word em- just to examples of different reasoning require- beddings and 100-d LSTM-RNN hidden size (same as ments. The evaluation results further demonstrate the last row of Table1), while the 4th row represents that our Document Explorer combined with TF- the smaller EPAr system. IDF-based retrieval (row ‘DE+AP+EA’) consis- tently outperforms TF-IDF alone (row ‘AP+EA’) 300-d embeddings and 100-d hidden size of the or the Document Explorer without TF-IDF (row encoding LSTM-RNN. As shown in Table1, EPAr ‘DE+AP+EA?’ in Table2), showing that our 2- achieves 67.2% accuracy on the dev set, outper- hop TF-IDF document retrieval procedure is able forming all published models, and achieves 69.1% to broadly identify relevant documents and further accuracy on the hidden test set, which is competi- aid our Document Explorer by reducing its search tive with the current state-of-the-art result.8 space. Finally, comparing the last two rows in Ta- Next, in Table2, we further evaluate our EPAr ble2 shows that using self-attention (Zhong et al., system (and its smaller-sized and ablated versions) 2019) to compute the document representation can on the “follows + multiple”, “follows + single”, further improve the full-sized system. We show an and the full development set. First, note that example of the ‘reasoning tree’ constructed by the on the full development set, our smaller system Document Explorer and the correct answer pre- (“DE+AP+EA”) achieves statistically significant dicted by the Evidence Assembler in Fig.3. (p-value < 0.01)9 improvements over the BiDAF We report our system’s accuracy on the Med- baseline and is also comparable to De Cao et al. Hop dataset in Table3. Our best system achieves (2018) on the development set (64.7 vs. 64.8).10 60.3 on the hidden test set11, outperforming all 8Note that there also exists a recent anonymous unpub- current models on the leaderboard. However, as lished entry on the leaderboard with 70.9% accuracy, which reported by Welbl et al.(2017), the original Med- is concurrent to our work. Also note that our system achieves these strong accuracies even without using pretrained lan- Hop dataset suffers from a candidate frequency guage model representations like ELMo (Peters et al., 2018) imbalance issue that can be exploited by certain or BERT (Devlin et al., 2018), which have been known to give significant improvements in machine comprehension strong model with 100-d word embeddings and 20-d LSTM- and QA tasks. We leave these gains for future work. RNN hidden size (similar to baselines in Welbl et al.(2017)) 9All stat. signif. is based on bootstrapped randomization in all our analysis/ablation results (including Sec.4). test with 100K samples (Efron and Tibshirani, 1994). 11The masked MedHop test set results use the smaller size 10For time and memory feasibility, we use this smaller model, because this performed better on the masked dev set. Test R@1 R@2 R@3 R@4 R@5 Test (Masked) Random 39.9 51.4 60.2 67.8 73.5 FastQA? (Weissenborn et al., 2017) 23.1 31.3 1-hop TFIDF 38.4 48.5 58.6 67.4 73.7 BiDAF? (Seo et al., 2017) 33.7 47.8 2-hop TFIDF 38.4 58.7 70.2 77.2 81.6 CoAttention - 58.1 DE 52.5 70.2 80.3 85.8 89.0 Most Frequent Candidate? 10.4 58.4 TFIDF+DE 52.2 69.0 77.8 82.2 85.2 EPAr (Ours) 41.6 60.3 Table 5: Recall-k score is the percentage of examples where the ground-truth answer is present in the top-k Table 3: Test set accuracy on MEDHOP dataset. The re- root-to-leaf path in the ‘reasoning tree’. ‘TFIDF+DE’ sults marked with ? are reported in (Welbl et al., 2017). is the combination of the 2-hop TFIDF retrieval proce- R@1 R@2 R@3 R@4 R@5 dure and our Document Explorer. Random 11.2 17.3 27.6 40.8 50.0 1-hop TFIDF 32.7 48.0 56.1 63.3 70.4 of 2-hop TF-IDF and Document Explorer. 2-hop TFIDF 42.9 56.1 70.4 78.6 82.7 DE 38.8 50.0 65.3 73.5 83.7 Human Evaluation: We collect human- TFIDF+DE 44.9 64.3 77.6 82.7 90.8 annotated reasoning chains for 100 documents Table 4: Recall-k score is the % of examples where one from the “follows + multiple” dev set, and com- of the human-annotated reasoning chains is recovered pare these to the ‘reasoning tree’ constructed by in the top-k root-to-leaf paths in the ‘reasoning tree’. our Document Explorer to assess its ability to dis- ‘TFIDF+DE’ is the combination of the 2-hop TF-IDF cover the hidden reasoning chain from the entire retrieval procedure and our Document Explorer. pool of supporting documents. For each example, human annotators (external, English-speaking) heuristics like the ‘Most Frequent Candidate’ in select two of the smallest set of documents, from Table3. To eliminate this bias and to test our sys- which they can reason to find the correct answer tem’s ability to conduct multi-hop reasoning us- from the question. As shown in Table4, our ing the context, we additionally evaluate our sys- Document Explorer combined with 2-hop TF-IDF tem on the masked version of MedHop, where ev- (row ‘TFIDF+DE’) obtains higher golden-chain ery candidate expression is replaced randomly us- recall scores compared to the two TFIDF-based ing 100 unique placeholder tokens so that models document retrieval heuristics (row ‘1-hop TFIDF’ can only rely on the context to comprehend every and ‘2-hop TFIDF’) alone or the Document candidate. Our model achieves 41.6% accuracy Explorer without TF-IDF (row ‘DE’). in this “masked” setting, outperforming all previ- ously published works by a large margin. Answer Span Test: We also test our Document Explorer’s ability to find the document with men- 4 Analysis tions of the ground-truth answer. Logically, the fact that the answer appears in one of the docu- In this section, we present a series of new analyses ments in the ‘reasoning tree’ signals higher prob- and comparisons in order to understand the contri- ability that our modules at the following stages bution from each of our three modules and demon- could predict the correct answer. As shown in strate their advantages over other corresponding Table5, our Document Explorer receives signifi- baselines and heuristics. cantly higher answer-span recall scores compared 12 4.1 Reasoning Chain Recovery Tests to the two TF-IDF-based document selectors. We compare our Document Explorer with two TF- 4.2 Answer Proposer Comparisons IDF-based document selectors for their ability to We compare our Answer Proposer with two rule- recover the reasoning chain of documents. The based sentence extraction heuristics for the ability 1-hop TF-IDF selector selects the top k + 1 doc- to extract salient information from every reason- uments with the highest TF-IDF score w.r.t. the ing chain. For most documents in the WikiHop query subject. The 2-hop TF-IDF selector, as in dataset, the first sentence is comprised of the most Sec. 2.1, first selects the top-1 TF-IDF document salient information from that document. Hence, w.r.t. the query subject and then selects the top k remaining documents based on the TF-IDF score 12In this test, the Document Explorer alone outperforms its combination with the 2-hop TF-IDF retrieval. In practice, our with respect to the first selected document. Fi- system employs both procedures due to the advantage shown nally, we also compare to our final combination in both empirical results (Table2) and analysis (Table4). follows follows tiple reasoning chains. The results further show full + multiple + single that our Assembler is better than the reranking al- Full-doc 63.1 68.4 69.0 ternative. Lead-1 63.6 68.7 70.2 AP w.o. attn 63.3 68.3 69.6 AP 64.7 69.4 70.6 4.4 Multi-hop Reasoning Example Table 6: Answer Proposer comparison study. “Follows We visualize the 3-stage reasoning procedure of + multiple” and “follows + single” are the subsets of our EPAr system in Fig.4. As shown in the left dev set as described in Sec. 3.1. of Fig.4, the Document Explorer first locates the root document (“The Polsterberg Pumphouse ...”) follows follows based on the query subject. It then finds three more full + multiple + single documents that are related to the root document, Single-chain 59.9 64.3 63.8 constructing three document chains. The An- Avg-vote 54.6 56.3 55.6 swer Proposer proposes a candidate answer from Max-vote 51.5 53.9 53.3 w. Reranker 60.6 65.1 65.5 each of the three chains selected by the Docu- w. Assembler 64.7 69.4 70.6 ment Explorer. Finally, the Evidence Assembler selects key sentences from all documents in the Table 7: Evidence Assembler comparison study: constructed document chains and makes the final Reranker (described in the appendix) rescores the doc- uments selected by the Document Explorer. prediction (“Lower Saxony”). 5 Related Works we construct one baseline that concatenates the first sentence from each selected document as the The last few years have witnessed significant input to the Evidence Assembler. We also show progress on text-based machine reading compre- results of combining all the full documents as the hension and question answering (MRC-QA) in- synthesized context instead of selecting one sen- cluding cloze-style blank-filling tasks (Hermann tence from every document. We further present et al., 2015), open-domain QA (Yang et al., 2015), a lighter neural-model baseline that directly pro- answer span prediction (Rajpurkar et al., 2016, poses the answer from the leaf document without 2018), and generative QA (Nguyen et al., 2016). first creating its ancestor-aware representation. As However, all of the above datasets are confined shown in Table6, the system using sentences se- to a single-document context per question setup. lected by our Answer Proposer outperforms both Joshi et al.(2017) extended the task to the multi- rule-based heuristics (row 1 and 2) and the simple document regime, with some examples requir- neural baseline (row 3). ing cross-sentence inference. Earlier attempts in multi-hop MRC focused on reasoning about the 4.3 Assembler Ablations relations in a knowledge base (Jain, 2016; Zhou In order to justify our choice of building an As- et al., 2018; Lin et al., 2018) or tables (Yin et al., sembler, we build a 2-module system without the 2015). QAngaroo WikiHop and MedHop (Welbl Evidence-Assembler stage by applying the An- et al., 2017), on the other hand, are created as nat- swer Proposer to only the top-1 reasoning chain ural language MRC tasks. They are designed in in the tree. We also present two voting heuristics a way such that the evidence required to answer a that selects the final answer by taking the aver- query could be spread across multiple documents. age/maximum prediction probability from the An- Thus, finding some evidence requires building a swer Proposer on all document chains. Further- reasoning chain from the query with intermediate more, we compare our Evidence Assembler with inference steps, which poses extra difficulty for an alternative model that, instead of assembling MRC-QA systems. HotpotQA (Yang et al., 2018) information from all reasoning chains, reranks all is another recent multi-hop dataset which focuses chains and their proposed answers to select the on four different reasoning paradigms. top-1 answer prediction. As shown in Table7, The emergence of large-scale MRC datasets the full system with the Assembler achieves sig- has led to innovative neural models such as co- nificant improvements over the 2-module system. attention (Xiong et al., 2017), bi-directional at- This demonstrates the importance of the Assem- tention flow (Seo et al., 2017), and gated atten- bler in enabling information aggregation over mul- tion (Dhingra et al., 2017), all of which are metic- Query subject: Polsterberg Pumphouse Query body: located_in_the_administrative_territorial_entity Query subject: Polsterberg Pumphouse Query body: located_in_the_administrative_territorial_entity The Sperberhai Dyke is in fact an aqueduct which forms part of the Upper Harz Water Regale network of reservoirs, ditches, dams and tunnels ... 1 b The Harz is the highest mountain range in Northern a. The Polsterberg Pumphouse (German : Polsterberger Hubhaus) is a The Harz is the highest mountain range in Northern and its rugged terrain extends across parts of Lower Saxony, pumping station above the Dyke Ditch in the Upper Harz in central Germany and its rugged terrain extends across parts Saxony-Anhalt, and Thuringia. The name "Harz" derives from the Germany which is used today as a forest restaurant. 2 of Lower Saxony, Saxony-Anhalt, and Thuringia. ... Middle High German word "Hardt" or "Hart" (mountain forest), b. The Harz is the highest mountain range in Northern Germany and Latinized as "Hercynia". its rugged terrain extends across parts of Lower Saxony, The Polsterberg Pumphouse (German : Polsterberger Saxony-Anhalt, and Thuringia. The Upper Harz refers to the Hubhaus) is a pumping station above the Dyke Ditch northwestern and higher part of the Harz mountain range in Germany. in the Upper Harz in central Germany which is used c. In its traditional sense, the term Upper Harz covers the area of the today as a forest restaurant. ... a seven historical mining towns - Clausthal, Zellerfeld, Andreasberg, Altenau, Lautenthal, and Grund - in the present-day 2 Germany, officially the Federal Republic of Germany, The Upper Harz refers to the northwestern and higher part of the German federal state of Lower Saxony. is a federal parliamentary republic in central-western d. The Dyke Ditch is the longest artificial ditch in the Upper Harz in Europe. ... Harz mountain range in Germany. ... c the term Upper Harz covers the area of the seven historical mining towns - Clausthal, Zellerfeld, central Germany. Its purpose was to collect surface runoff for the The Upper Harz refers to the northwestern and higher Andreasberg, Altenau, Lautenthal, Wildemann and Grund - in the operation of the Upper Harz mining industry from precipitation-heavy part of the Harz mountain range in Germany. ... present-day German federal state of Lower Saxony. ... regions a long way away (particularly from the and parts 2 of the massif). Sewage is a water-carried waste, in solution or suspension, that is intended to be removed from a community. d The Dyke Ditch is the longest artificial ditch in the Upper Harz Wildemann is a town and a former municipality in the in central Germany. Its purpose was to collect surface runoff for the Final answer: Lower Saxony district of , in Lower Saxony, Germany. operation of the Upper Harz mining industry from precipitation-heavy regions a long way away (particularly from the The Dyke Ditch is the longest artificial ditch in the Bruchberg and parts of the Brocken massif). ... Upper Harzin central Germany. ...

Document Explorer Answer Proposer Evidence Assembler

Figure 4: An example of our 3-stage EPAr system exploring relevant documents, proposing candidate answers, and then assembling extracted evidence to make the final prediction. ulously designed to solve single-document MRC ilarly, Wang et al.(2018) first selected relevant tasks. Clark and Gardner(2018) and Chen et al. content by ranking documents and then extracted (2017) used a simple TF-IDF based document- the answer span. Min et al.(2018) selected rele- selection procedure to find the context that is vant sentences from long documents in a single- most relevant to the query for multi-document document setup and achieved faster speed and ro- QA. However, this 1-hop, similarity-based se- bustness against adversarial corruption. However, lection process would fail on multi-hop reading- none of these models are built for multi-hop MRC comprehension datasets like WikiHop because the where our EPAr system shows great effectiveness. query subject and the answer could appear in dif- ferent documents. On the other hand, our Doc- 6 Conclusion ument Explorer can discover the document with We presented an interpretable 3-module, multi- the answer “Loon op Zand” (in Fig. 1a) by itera- hop, reading-comprehension system ‘EPAr’ which tively selecting relevant documents and encoding constructs a ‘reasoning tree’, proposes an answer the hinge words “Efteling” and “Kaatsheuvel” in candidate for every root-to-leaf chain, and merges its memory. key information from all reasoning chains to make Recently, Dhingra et al.(2018) leveraged coref- the final prediction. On WikiHop, our system out- erence annotations from an external system to con- performs all published models on the dev set, and nect the entities. Song et al.(2018a) and De Cao achieves results competitive with the current state- et al.(2018) utilized Graph Convolutional Net- of-the-art on the test set. On MedHop, our system works (Kipf and Welling, 2017) and Graph Recur- outperforms all previously published models on rent Networks (Song et al., 2018b; Zhang et al., the leaderboard test set. We also presented multi- 2018) to model the relations between entities. Re- ple reasoning-chain recovery tests for the explain- cently, Cao et al.(2019) extended the Graph Con- ability of our system’s reasoning capabilities. volutional Network in De Cao et al.(2018) by in- 7 Acknowledgement troducing bi-directional attention between the en- tity graph and query. By connecting the entities, We would like to thank Johannes Welbl for help- these models learn the inference paths for multi- ing test our system on WikiHop and MedHop. We hop reasoning. Our work differs in that our sys- thank the reviewers for their helpful comments. tem learns the relation implicitly without the need This work was supported by DARPA (YFA17- of any human-annotated relation. Recently, Zhong D17AP00022), Google Faculty Research Award, et al.(2019) used hierarchies of co-attention and Bloomberg Data Science Research Grant, Sales- self-attention to combine evidence from multiple force Deep Learning Research Grant, Nvidia GPU scattered documents. Our novel 3-module archi- awards, Amazon AWS, and Google Cloud Credits. tecture is inspired by previous 2-module selection The views contained in this article are those of the architectures for MRC (Choi et al., 2017). Sim- authors and not of the funding agency. References Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural leyman, and Phil Blunsom. 2015. Teaching ma- Machine Translation by Jointly Learning to Align chines to read and comprehend. In Advances in Neu- and Translate. In Third International Conference on ral Information Processing Systems, pages 1693– Learning Representations. 1701.

Yu Cao, Meng Fang, and Dacheng Tao. 2019. BAG: bi- Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. directional attention entity graph convolutional net- Long short-term memory. Neural computation, work for multi-hop reasoning question answering. 9(8):1735–1780. In NAACL-HLT. Sarthak Jain. 2016. Question answering over knowl- Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Pascal edge base using factual memory networks. In Pro- Vincent, Gerald Tesauro, and Yoshua Bengio. 2016. ceedings of the NAACL Student Research Workshop. Hierarchical memory networks. arXiv preprint Association for Computational Linguistics. arXiv:1605.07427. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Danqi Chen, Adam Fisch, Jason Weston, and Antoine Zettlemoyer. 2017. Triviaqa: A large scale distantly Bordes. 2017. Reading Wikipedia to answer open- supervised challenge dataset for reading comprehen- domain questions. In ACL. sion. arXiv preprint arXiv:1705.03551.

Kyunghyun Cho, Bart van Merrienboer, C¸aglar Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Gulc¸ehre,¨ Fethi Bougares, Holger Schwenk, and method for stochastic optimization. CoRR. Yoshua Bengio. 2014. Learning phrase representa- tions using RNN encoder-decoder for statistical ma- Thomas N. Kipf and Max Welling. 2017. Semi- chine translation. In EMNLP. supervised classification with graph convolutional networks. In ICLR. Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and Jonathan Be- Xi Victoria Lin, Richard Socher, and Caiming Xiong. rant. 2017. Coarse-to-fine question answering for 2018. Multi-hop knowledge graph reasoning with long documents. In ACL. reward shaping. In EMNLP.

Christopher Clark and Matt Gardner. 2018. Simple Sewon Min, Victor Zhong, Richard Socher, and Caim- and effective multi-paragraph reading comprehen- ing Xiong. 2018. Efficient and robust question an- sion. In Proceedings of the 56th Annual Meeting swering from minimal context over documents. In of the Association for Computational Linguistics. Proceedings of the 56th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Nicola De Cao, Wilker Aziz, and Ivan Titov. 2018. Long Papers), pages 1725–1735. Association for Question answering by reasoning across documents Computational Linguistics. with graph convolutional networks. arXiv preprint arXiv:1808.09920. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 2016. Ms marco: A human generated machine Kristina Toutanova. 2018. Bert: Pre-training of deep reading comprehension dataset. arXiv preprint bidirectional transformers for language understand- arXiv:1611.09268. ing. Jeffrey Pennington, Richard Socher, and Christopher D Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William W Manning. 2014. Glove: Global vectors for word Cohen, and Ruslan Salakhutdinov. 2018. Neural representation. In Conference on Empirical Meth- models for reasoning over multiple mentions using ods in Natural Language Processing (EMNLP). coreference. In Proceedings of the 16th Annual Conference of the North American Chapter of the Matthew E. Peters, Mark Neumann, Mohit Iyyer, Association for Computational Linguistics: Human Matt Gardner, Christopher Clark, Kenton Lee, and Language Technologies. Luke S. Zettlemoyer. 2018. Deep contextualized word representations. In NAACL-HLT. Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. 2017. Gated- Martin Raison, Pierre-Emmanuel Mazare,´ Rajarshi attention readers for text comprehension. In Pro- Das, and Antoine Bordes. 2018. Weaver: Deep co- ceedings of the 55th Annual Meeting of the Associa- encoding of questions and documents for machine tion for Computational Linguistics (Volume 1: Long reading. arXiv preprint arXiv:1804.10490. Papers), pages 1832–1846, Vancouver, Canada. As- sociation for Computational Linguistics. P. Rajpurkar, R. Jia, and P. Liang. 2018. Know what you don’t know: Unanswerable questions for Bradley Efron and Robert J Tibshirani. 1994. An intro- SQuAD. In Association for Computational Linguis- duction to the bootstrap. CRC press. tics (ACL). Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Victor Zhong, Caiming Xiong, Nitish Shirish Keskar, Percy Liang. 2016. Squad: 100,000+ questions for and Richard Socher. 2019. Coarse-grain fine-grain machine comprehension of text. In Conference on coattention network for multi-evidence question an- Empirical Methods in Natural Language Processing swering. In ICLR. (EMNLP). Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Mantong Zhou, Minlie Huang, and Xiaoyan Zhu. Hannaneh Hajishirzi. 2017. Bidirectional attention 2018. An interpretable reasoning network for multi- flow for machine comprehension. In International relation question answering. In Proceedings of Conference on Learning Representations (ICLR). the 27th International Conference on Computational Linguistics. Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Radu Florian, and Daniel Gildea. 2018a. Exploring graph-structured passage representation for multi- Appendix hop reading comprehension with graph neural net- works. arXiv preprint arXiv:1809.02040. A Reranker Linfeng Song, Yue Zhang, Zhiguo Wang, and Daniel Gildea. 2018b. A graph-to-sequence model for amr- We explore an alternative to Evidence Assem- to-text generation. In Proceedings of the 56th An- bler (EA), where instead of selecting key sen- nual Meeting of the Association for Computational tences from every root-to-leaf path in the rea- Linguistics (Volume 1: Long Papers), pages 1616– soning tree, we use a reranker to rescore the se- 1626. Association for Computational Linguistics. lected documents. Specifically, given a document Rupesh Kumar Srivastava, Klaus Greff, and Jurgen¨ reasoning-tree of tw reasoning chains, we use bidi- Schmidhuber. 2015. Highway networks. In Inter- rectional attention (Seo et al., 2017) between the national Conference on Machine Learning (ICML). last documents in each chain and all the docu- Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, ments from the previous hops in that chain to ob- Tim Klinger, Wei Zhang, Shiyu Chang, Gerald ˆ ˆ tain {h1, ··· , htw } which are the refined repre- Tesauro, Bowen Zhou, and Jing Jiang. 2018. R3: sentations of the leaf documents. We then ob- Reinforced ranker-reader for open-domain question answering. In AAAI. tain a fixed length document representation as the weighted average of word representations for each Dirk Weissenborn, Georg Wiese, and Laura Seiffe. of the t documents using similarity with query 2017. Making neural qa as simple as possible but w not simpler. In CoNLL. subject and query body as the weights using func- tion α. We obtain the scores for each of the doc- Johannes Welbl, Pontus Stenetorp, and Sebastian uments by computing similarity with the answer Riedel. 2017. Constructing datasets for multi-hop reading comprehension across documents. In TACL. which that reasoning chain proposes using β. (See Sec.C below for details of the similarity functions Caiming Xiong, Victor Zhong, and Richard Socher. α and β.) 2017. Dynamic coattention networks for question answering. In ICLR. B Self-Attention Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain ques- We use self-attention from Zhong et al.(2019) tion answering. In Proceedings of the 2015 Con- ference on Empirical Methods in Natural Language to get the compact representation for all support- Processing, pages 2013–2018. ing documents. Given contextual word repre- sentations for the supporting documents H = Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- K×2v gio, William W Cohen, Ruslan Salakhutdinov, and {h1, h2, ··· , hN 0 } such that hi ∈ R , we de- 2v Christopher D Manning. 2018. Hotpotqa: A dataset fine Selfattn(hi) → pi ∈ R as: for diverse, explainable multi-hop question answer- ing. In Conference on Empirical Methods in Natural k Language Processing (EMNLP). aik = tanh(W2tanh(W1hi + b1) + b2) aˆi = softmax(ai) Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao. (3) 2015. Neural enquirer: Learning to query tables. K X k arXiv preprint. pi = aˆikhi k=1 Yue Zhang, Qi Liu, and Linfeng Song. 2018. Sentence- state lstm for text representation. In Proceedings of the 56th Annual Meeting of the Association for Com- such that pi provides the summary of the ith doc- putational Linguistics (ACL). ument with a vector representation. C Similarity Functions Welbl et al.(2017) show the poor performance of TF-IDF model we drop the TF-IDF document re- When constructing our 3-module system, we use trieval procedure and supervision at the first hop of similarity functions α and β. The function β is the Document Explorer (with the document hav- defined as: ing highest TF-IDF score to query subject). We train all modules of our system jointly using Adam β(h, c) = W relu(W [h; u; h◦u]+b )+b β1 β2 β2 β1 Optimizer (Kingma and Ba, 2014) with an initial (4) learning rate of 0.001 and a batch size of 10. We where relu(x) = max(0, x), and ◦ represents also use a dropout rate of 0.2 in all our linear pro- element-wise multiplication. And the function α jection layers, encoding LSTM-RNN and charac- is defined as: ter CNNs. T α(h, u) = Wα2 ((Wα1 h + bα1 ) ◦ u) (5) where all trainable weights are marked in bold.

D Datasets and Metrics We evaluate our 3-module system on QAngaroo (Welbl et al., 2017), which is a set of two multi- hop reading comprehension datasets: WikiHop and MedHop. WikiHop contains 51K instances, including 44K for training, 5K for development and 2.5K for held out testing. MedHop is a smaller dataset based on the domain of molecular biology. It consists of 1.6K instances for training, 342 for development, and 546 for held out testing. Each instance consists of a query (which can be sep- arated as a query subject and a query body), a set of supporting documents and a list of candi- date answers. For the WikiHop development set, each instance is also annotated as “follows” or “not follows”, which signifies whether the answer can be inferred from the given set of supporting documents, and “multiple” or “single”, which tells whether the complete reasoning chain comprises of multiple documents or just a single one. We measure our system’s performance on these sub- sets of the development set that are annotated as “follows and multiple” and “follows and single”. This allows us to evaluate our systems on a less noisy version of development set and to investigate their strength in queries requiring different levels of multi-hop reasoning behavior.

E Implementation Details For Medhop, considering the small size of the dataset, we use 20-d hidden size of the encod- ing LSTM-RNN and the last hidden state of the encoding LSTM-RNN to get compact representa- tion of the documents. We also use a hidden size of 20 for the embedded GRU cell and LSTM in our Evidence Assembler. In addition to that, since