Exploiting Explicit Paths for Multi-hop Reading Comprehension

Souvik Kundu†∗ and Tushar Khot‡ and Ashish Sabharwal‡ and Peter Clark‡ †Department of Computer Science, National University of Singapore ‡Allen Institute for Artificial Intelligence, Seattle, WA, U.S.A. [email protected], {tushark,ashishs,peterc}@allenai.org

Abstract Query:(always breaking my heart, , ?) Supporting Passages: We propose a novel, path-based reasoning (p1) “” is the second sin- approach for the multi-hop reading compre- gle from ’s al- hension task where a system needs to com- bum , released in 1996 ( see 1996 in music ) . It ... (p2) A Woman and a Man is the sixth studio al- bine facts from multiple passages to answer bum by American singer Belinda Carlisle, released a question. Although inspired by multi-hop in the United Kingdom on September 23, 1996 by reasoning over knowledge graphs, our pro- (then part of the EMI Group, ... posed approach operates directly over unstruc- Candidates: chrysalis records, group, virgin tured text. It generates potential paths through records, ... passages and scores them without any di- Answer: chrysalis records rect path supervision. The proposed model, Paths: named PathNet, attempts to extract implicit (“Always Breaking My Heart” ... single from ... A relations from text through entity pair repre- Woman and a Man) sentations, and compose them to encode each (A Woman and a Man ... released ... by ... Chrysalis path. To capture additional context, Path- Records) Net also composes the passage representations Figure 1: Example illustrating our proposed path ex- along each path to compute a passage-based traction and reasoning approach. representation. Unlike previous approaches, our model is then able to explain its reason- ing via these explicit paths through the pas- requiring a system to combine information from sages. We show that our approach outper- forms prior models on the multi-hop Wikihop multiple sentences in order to arrive at the answer, dataset, and also can be generalized to apply referred to as multi-hop reasoning. to the OpenBookQA dataset, matching state- Multi-hop reasoning has been studied for ques- of-the-art performance. tion answering (QA) over structured knowledge graphs (Lao et al., 2011; Guu et al., 2015; Das 1 Introduction et al., 2017). Many of the successful models ex- Many reading comprehension (RC) datasets (Ra- plicitly identify paths in the knowledge graph that jpurkar et al., 2016; Trischler et al., 2017; Joshi led to the answer. A strength of these models et al., 2017) have been proposed recently to eval- is high interpretability, arising from explicit path- arXiv:1811.01127v2 [cs.CL] 8 Jul 2019 uate a system’s ability to answer a question from based reasoning over the underlying graph struc- a given text passage. However, most of the ques- ture. However, they cannot be directly applied to tions in these datasets can be answered by using QA in the absence of such structure. only a single sentence or passage. As a result, Consequently, most multi-hop RC models over systems designed for these tasks may not be able unstructured text (Dhingra et al., 2017; Hu et al., to compose knowledge from multiple sentences or 2018) extend standard attention-based models passages, a key aspect of natural language under- from RC by iteratively updating the attention to standing. To remedy this, new datasets (Weston indirectly “hop” over different parts of the text. et al., 2015; Welbl et al., 2018; Khashabi et al., Recently, graph-based models (Song et al., 2018; 2018a; Mihaylov et al., 2018) have been proposed, Cao et al., 2018) have been proposed for the Wik- ∗ Work performed while doing an internship at the Allen iHop dataset (Welbl et al., 2018). Nevertheless, Institute for Artificial Intelligence. these models still only implicitly combine knowl- edge from all passages, and are therefore unable to based approaches. provide explicit reasoning paths. We make three main contributions: We propose an approach1 for multiple choice (1) A novel path-based reasoning approach for RC that explicitly extracts potential paths from multi-hop QA over text that produces explanations text (without direct path supervision) and encodes in the form of explicit paths; (2) A model, PathNet, the knowledge captured by each path. Figure1 which aims to extract implicit relations from text shows how to apply this approach to an exam- and compose them; and (3) Outperforming prior ple in the WikiHop dataset. It shows two sam- models on the target WikiHop dataset2 and gen- ple paths connecting an entity in the question eralizing to the open-domain science QA dataset, (Always Breaking My Heart) to a candidate an- OpenBookQA, with performance comparable to swer (Chrysalis Records) through a singer (Be- prior models. linda Carlisle) and an (A Woman and a Man). 2 Related Work To encode the path, our model, named PathNet, We summarize related work in QA over text, semi- first aims to extract implicit (latent) relations be- structured knowledge, and knowledge graphs. tween entity pairs in a passage based on their con- Multi-hop RC. Recent datasets such as textual representations. For example, it aims to ex- bAbI (Weston et al., 2015), Multi-RC (Khashabi tract the implicit single from relation between the et al., 2018a), WikiHop (Welbl et al., 2018), and song and the name of the album in the first pas- OpenBookQA (Mihaylov et al., 2018) have en- sage. Similarly, it extracts the released by relation couraged research in multi-hop QA over text. The between the album and the record label in the sec- resulting multi-hop models can be categorized ond passage. It learns to compose the extracted into state-based and graph-based reasoning mod- implicit relations such that they map to the main els. State-based reasoning models (Dhingra et al., relation in the query, in this case record label. In 2017; Shen et al., 2017; Hu et al., 2018) are closer essence, the motivation is to learn to extract im- to a standard attention-based RC model with an plicit relations from text and to identify their valid additional “state” representation that is iteratively compositions, such as: (x, single from, y), (y, re- updated. The changing state representation re- leased by, z) → (x, record label, z). Due to the sults in the model focusing on different parts of absence of direct supervision on these relations, the passage during each iteration, allowing it to PathNet does not explicitly extract these relations. combine information from different parts of the However, our qualitative analysis on a sampled set passage. Graph-based reasoning models (Dhingra of instances from WikiHop development set shows et al., 2018; Cao et al., 2018; Song et al., 2018), on that the top scoring paths in 78% of the correctly the other hand, create graphs over entities within answered questions have implied relations in the the passages and update entity representations via text that could be composed to derive the query recurrent or convolutional networks. In contrast, relations. our approach explicitly identifies paths connecting In addition, PathNet also learns to compose ag- entities in the question to the answer choices. gregated passage representations in a path to cap- Semi-structured QA. Our model is closer to ture more global information: encoding(p1), en- Integer Linear Programming (ILP) based meth- coding(p2) → (x, record label, z). This passage- ods (Khashabi et al., 2016; Khot et al., 2017; based representation is especially useful in do- Khashabi et al., 2018b), which define an ILP mains such as science question answering where program to find optimal support graphs for con- the lack of easily identifiable entities limits the ef- necting the question to the choices through a fectiveness of the entity-based path representation. semi-structured knowledge representation. How- While this passage-based representation is less in- ever, these models require a manually authored terpretable than the entity-based path representa- and tuned ILP program, and need to convert text tion, it still identifies the two passages used to se- into a semi-structured representation—a process lect the answer, compared to a spread out attention that is often noisy (such as using Open IE tu- over all documents produced by previous graph- 2Other systems, such as by Zhong et al.(2019), have 1The source code is available at https://github. recently appeared on the WikiHop leaderboard (http:// com/allenai/PathNet qangaroo.cs.ucl.ac.uk/leaderboard.html). ples (Khot et al., 2017), SRL frames (Khashabi tity and r represents the relation between he and et al., 2018b)). Our model, on the other hand, the unknown tail entity. The task is to select the is trained end-to-end, and discover relevant rela- unknown tail entity from a given set of candidates tional structure from text. Instead of an ILP pro- {c1, c2, . . . cN }, by reasoning over supporting pas- gram, Jansen et al.(2017) train a latent ranking sages P = p1,..., pM . To perform multi-hop rea- perceptron using features from aggregated syntac- soning, we extract multiple paths P (cf. Section4) tic structures from multiple sentences. However, connecting he to each ck from the supporting pas- their system operates at the detailed (and often sages P. The j-th 2-hop path for candidate ck is noisy) level of dependency graphs, whereas we denoted pkj, where pkj = he → e1 → ck, and e1 identify entities and let the model learn implicit is referred to as the intermediate entity. relations and their compositions. In OpenBookQA, different from WikiHop, the Knowledge Graph QA. QA datasets on knowl- questions and candidate answer choices are plain edge graphs such as Freebase (Bollacker et al., text sentences. To construct paths, we extract 2008), require systems to map queries to a sin- all head entities from the question and tail enti- gle relation (Bordes et al., 2015), a path (Guu ties from candidate answer choices, considering et al., 2015), or complex structured queries (Be- all noun phrases and named entities as entities. rant et al., 2013) over these graphs. While early This often results in many 2-hop paths connect- models (Lao et al., 2011; Gardner and Mitchell, ing a question to a candidate answer choice via

2015) focused on creating path-based features, re- the same intermediate entity. With {he1 , he2 ,...} cent neural models (Guu et al., 2015; Das et al., representing the list of head entities from a ques-

2017; Toutanova et al., 2016) encode the entities tion, and {ck1 , ck2 ,...} the list of tail entities from and relations along a path and compose them using candidate ck, the j-th path connecting ckα to heβ recurrent networks. Importantly, the input knowl- α,β can be represented as: pkj = heα → e1 → ckβ . edge graphs have entities and relations that are For simplicity, we omit the notations α and β from shared across all training and test examples, which path representation. the model can exploit during learning (e.g., via Next, the extracted paths are encoded and learned entity and relation embeddings). When scored (cf. Section5). Following, the normalized reasoning with text, our model must learn these path scores are summed for each candidate to give representations purely based on their local context. a probability distribution over the candidate an- swer choices. 3 Approach Overview

We focus on the multiple-choice RC setting: given 4 Path Extraction a question and a set of passages, the task is to The first step in our approach is extracting paths find the correct answer among a predefined set of from text passages. Consider the example in Fig- candidates. The proposed approach can be ap- ure1. Path extraction proceeds as follows: plied to m-hop reasoning, as discussed briefly in (a) We find a passage p1 that contains a head the corresponding sections for path extraction, en- entity he from the question Q. In our example, coding, and scoring. Since our target datasets we would identify the first supporting passage that 3 primarily need 2-hop reasoning and the poten- contains always breaking my heart. tial of semantic drift with increased number of (b) We then find all named entities and noun hops (Fried et al., 2015; Khashabi et al., 2019), phrases that appear in the same sentence as he or we focus on and assess the case of 2-hop paths in the subsequent sentence. Here, we would col- (m = 2). As discussed later (see Footnote4), lect Belinda Carlisle, A Woman and a Man, and our path-extraction step scales exponentially with album as potential intermediate entity e1. m. Using m = 2 keeps this step tractable, while (c) Next, we find a passage p2 that contains the still covering almost all examples in our target potential intermediate entity identified above. For datasets. clarity, we refer to the occurrence of e1 in p2 as e10. In WikiHop, a question Q is given in the form of By design, (he, e1) and (e10, ck) are located in dif- a tuple (he, r, ?), where he represents the head en- ferent passages. For instance, we find the second 3We found that most WikiHop questions can be answered passage that contains both Belinda Carlisle and A with 2 hops and OpenBookQA also targets 2-hop questions. Woman and a Man. (d) Finally, we check whether p2 contains any of the candidate answer choices. For instance, p2 contains chrysalis records and emi group. The resulting extracted paths can be summa- rized as a set of entity sequences. In this case, for the candidate answer chrysalis records, we ob- tain a set of two paths: (always breaking my heart → Belinda Carlisle → chrysalis records), (always breaking my heart → A Man and a Woman → chrysalis records). Similarly, we can collect paths for the other candidate, emi group. Notably, our path extraction method can be eas- ily extended for more hops. Specifically, for m- Figure 2: Architecture of the proposed model. hop reasoning, steps (b) and (c) are repeated (m − 1) times, where the intermediate entity from step (c) becomes the head entity for the subsequent step Q, passages p1 and p2, candidate ck, and the loca- 0 (b). For larger values of m, maintaining tractabil- tions of he, e1, e1, ck in these passages: (1) Em- ity of this approach would require optimizing the bedding and Encoding (§ 5.1) (2) Path Encoding complexity of identifying the passages containing (§ 5.2) (3) Path Scoring (§ 5.3). In Figure3, we an entity (steps (a) and (c)) and limiting the num- present the model architecture for these three com- ber of neighboring entities considered (step (b)).4 ponents used for scoring the paths. For one hop reasoning, i.e., when a single pas- sage is sufficient to answer a question, we con- 5.1 Embedding and Encoding struct the path with e1 as null. In this case, both he and ck are found in a single passage. In this We start by describing how we embed and con- way, for a task requiring more hops, one only need textually encode all pieces of text: question, sup- to guess the maximum number of hops. If some porting passages, and candidate answer choices. questions in that task require less hops, our pro- For word embedding, we use pretrained 300 di- posed approach can easily handle that by assign- mensional vectors from GloVe (Pennington et al., ing the intermediate entity to null. For instance, in 2014), randomly initializing vectors for out of vo- this work, our approach can handle 1-hop reason- cabulary (OOV) words. For contextual encoding, ing although it is developed for 2-hop. we use bi-directional LSTM (BiLSTM) (Hochre- 5 PathNet: Path-based Multi-hop QA iter and Schmidhuber, 1997). Model Let T , U, and V represent the number of to- kens in the p-th supporting passage, question, and Once we have all potential paths, we score them k-th answer candidate, respectively. The final en- using the proposed model, named PathNet, whose coded representation for the p-th supporting pas- overview is depicted in Figure2. The key compo- sage can be obtained by stacking these vectors into nent is the path-scorer module that computes the T ×H Sp ∈ R , where H is the number of hidden score for each path pkj. We normalize these scores units for the BiLSTMs. The sequence level en- across all paths, and compute the probability of a U×H coding for the question, Q ∈ R , and for the candidate ck being the correct answer by summing V ×H k-th candidate answer, Ck ∈ R , are obtained the normalized scores of the paths associated with similarly. We use row vector representation (e.g., c : 1×H k X R ) for all vectors in this paper. prob(ck) = score(pkj). (1) j Next, we describe three main model compo- 5.2 Path Encoding nents, operating on the following inputs: question After extracting the paths as discussed in Section 4If the search step takes no more than s steps and iden- 4, they are encoded using an end-to-end neural tifies a fixed number k of passages, and we select up to e neighboring entities, our approach would have a time com- network. This path encoder consists of two com- plexity of O(ke)m−1sm for enumerating m-hop paths. ponents: context-based and passage-based. Figure 3: Architecture of the path scoring module, shown here for 2-hop paths.

H 5.2.1 Context-based Path Encoding resentation xctx ∈ R given by: This component aims to implicitly encode the re- xctx = comp(rhe,e1 , re10,ck ) (4) lation between he and e1, and between e10 and ck. These implicit relation representations are them For fixed length paths, we can use a feed for- composed together to encode a path representation ward network as the composition function. E.g., for he → e1 . . . e10 → ck. for 2-hop paths, we use FFL(rhe,e1 , re10,ck ). For First, we extract the contextual representations variable length paths, we can use recurrent compo- for each of he, e1, e10, and ck. Based on the lo- sition networks such as LSTM, GRU. We compare cations of these entities in the corresponding pas- these composition functions in Section 6.3. sages, we extract the boundary vectors from the 5.2.2 Passage-based Path Encoding passage encoding representation. For instance, if In this encoder, we use entire passages to com- he appears in the p-th supporting passage from pute the path representation. As before, suppose token i1 to i2 (i1 ≤ i2), then the contextual en- 2H (he, e1) and (e10, ck) appear in supporting passages coding of he, ghe ∈ R is taken to be: ghe = p1 and p2, respectively. We encode each of p1 and sp1,i1 || sp1,i2 , where || denotes the concatena- p2 into a single vector based on passage-question tion operation. If he appears in multiple locations interaction. As discussed below, we first compute within the passage, we use the mean vector rep- a question-weighted representation for passage to- resentation across all of these locations. The lo- kens and then aggregate it across the passage. cation encoding vectors ge1 , ge10, and gck are ob- tained similarly. Question-Weighted Passage Representation: Next, we extract the implicit relation between For the p-th passage, we first compute the atten- H T ×U he and e1 as rhe,e1 ∈ R , using a feed forward tion matrix A ∈ R , capturing the similarity layer: between the passage and question words. Then,

rhe,e1 = FFL(ghe , ge1 ) , (2) we calculate a question-aware passage representa- q1 T ×H q1 where FFL is defined as: tion Sp ∈ R , where Sp = AQ. Similarly, a passage-aware question representation, Qp ∈ U×H > FFL(a, b) = tanh(aWa + bWb) . (3) R , is computed, where Qp = A Sp. Further, we compute another passage represen- H0 H00 q2 T ×H q1 Here a ∈ R and b ∈ R are input vectors, and tation Sp = AQp ∈ R . Intuitively, Sp H0×H H00×H Wa ∈ R and Wb ∈ R are trainable captures important passage words based on the q2 weight matrices. The bias vectors are not shown question, whereas Sp is another passage repre- here for simplicity. Similarly, we compute the im- sentation which focuses on the interaction with H plicit relation between e10 and ck as re10,ck ∈ R , passage-relevant question words. The idea of en- using their location encoding vectors ge10 and gck . coding a passage after interacting with the ques- Finally, we compose all implicit relation vectors tion multiple times is inspired from the Gated At- along the path to obtain a context-based path rep- tention Reader model (Dhingra et al., 2017). Aggregate Passage Representation: To derive already uses the question representation. We ag- a single passage vector, we first concatenate the gregate the representation Ck for candidate ck into H two passage representations for each token, ob- a single vector c˜k ∈ R by applying an atten- q q1 q2 T ×2H taining Sp = Sp || Sp ∈ R . We then use tive pooling operation similar to Equation5. The an attentive pooling mechanism for aggregating score for passage-based path is then computed as the token representations. The aggregated vector follows: 2H > ˜sp ∈ R for the p-th passage is obtained as: zpsg = c˜k xpsg (9) p q > p q at ∝ exp(sp,tw ); ˜sp = a Sp (5) Finally, the unnormalized score for path pkj is:

2H where w ∈ R is a learned vector. In this z = zctx + zpsg (10) way, we obtain the aggregated vector representa- tions for both supporting passages p1 and p2 as and its normalized version, score(pkj), is calcu- 2H 2H ˜sp1 ∈ R and ˜sp2 ∈ R , respectively. lated by applying the softmax operation over all the paths and candidate answers. Composition: We compose the aggregated pas- sage vectors to obtain the passage-based path rep- 6 Experiments H resentation xpsg ∈ R similar to Equation4: We start by describing the experimental setup, and

xpsg = comp(˜sp1 , ˜sp2 ) (6) then present results and an analysis of our model.

6.1 Setup Similar to the composition function in context- We consider the standard (unmasked) version of based path encoding, this composition function the recently proposed WikiHop dataset (Welbl can be a feed-forward network for fixed length or et al., 2018). WikiHop is a large scale multi- recurrent networks for variable length paths. hop QA dataset consisting of about 51K questions 5.3 Path Scoring (5129 Dev, 2451 Test). Each question is associ- Encoded paths are scored from two perspectives. ated with an average of 13.7 supporting Wikipedia passages, each with 36.4 tokens on average. Context-based Path Scoring: We score We also evaluate our model on Open- context-based paths based on their interaction BookQA (Mihaylov et al., 2018), a very recent with the question encoding. First, we aggregate and challenging multi-hop QA dataset with about the question into a single vector. We take the 6K questions (500 Dev, 500 Test), each with 4 first and last hidden state representations from candidate answer choices. Since OpenBookQA the question encoding Q to obtain an aggregated does not have associated passages for the ques- question vector representation. tions, we retrieve sentences from a text corpus to H The aggregated question vector q˜ ∈ R is create single sentence passages. We start with a corpus of 1.5M sentences used q˜ = (q || q ) W , (7) 0 U q by previous systems (Khot et al., 2017) for sci- 2H×H where Wq ∈ R is a learnable weight matrix. ence QA. It is then filtered down to 590K sen- H tences by identifying sentences about generalities The combined representation yxctx,q ∈ R of the question and a context-based path is computed as: and removing noise. We assume sentences that start with a plural noun are likely to capture gen- yxctx,q = FFL(xctx , q˜) Finally, we derive scores for context-based paths: eral concepts, e.g. “Mammals have fur”, and only consider such sentences. We also eliminate noisy > zctx = yxctx,qwctx , (8) and irrelevant sentences by using a few rules such as root of the parse tree must be a sentence, it must where w ∈ H is a trainable vector. ctx R not contain proper nouns. This corpus is also pro- Passage-based Path Scoring: We also score vided along with our code. paths based on the interaction between the Next, we need to retrieve sentences that can passage-based path encoding vector and the can- lead to paths between the question q and an an- didate encoding. In this case, only candidate en- swer choice c. Doing so naively will only retrieve coding is used since passage-based path encoding sentences that directly connect entities in q to c, Accuracy (%) Model QA models. We show the best results from each Dev Test of the competing entries. Welbl et al.(2018) pre- Welbl et al.(2018) - 42.9 Dhingra et al.(2018) 56.0 59.3 sented the results of BiDAF (Seo et al., 2017) on Song et al.(2018) 62.8 65.4 the WikiHop dataset. Dhingra et al.(2018) in- Cao et al.(2018) 64.8 67.6 † † corporated coreference connections inside GRU PathNet 67.4 69.6 network to capture coreference links while ob- Table 1: Accuracy on the WikiHop dataset. taining the contextual representation. Recently, †Statistically significant (Wilson, 1927) Cao et al.(2018) and Song et al.(2018) proposed graph neural network approaches for multi-hop Accuracy (%) Model reading comprehension. While the high level idea Dev Test is similar for these work, Cao et al.(2018) used KER (OMCS) 54.4 52.2 KER (WordNet) 55.6 51.4 ELMo (Peters et al., 2018) for a contextual em- KER (OB + OMCS) 54.6 50.8 bedding, which has proven to be very useful in the KER (OB + WordNet) 54.2 51.2 recent past in many NLP tasks. KER (OB + Text) 55.4 52.0 PathNet (OB + Text) 55.0 53.4 As seen in Table1, our proposed model Path- Net significantly outperforms prior approaches on Table 2: Accuracy on the OpenBookQA dataset. WikiHop. Additionally, we benefit from inter- pretability: unlike these prior methods, our model allows identifying specific entity chains that led to i.e., 1-hop paths. To facilitate 2-hop reasoning, we the predicted answer. first retrieve sentences based on words in q, and for Table2 presents results on the OpenBookQA each retrieved sentence s1, we find sentences that dataset. We compare with the Knowledge En- overlap with both s1 and c. Each path is scored hanced Reader (KER) model (Mihaylov et al., using idf(q, s1) · idf(s1, s2) · idf(s2, c), where s2 is 2018). The variants reflect the source from which the second retrieved sentence and idf(w) is the idf the model retrieves relevant knowledge: the open score of token w based on the input corpus: book (OB), WordNet subset of ConceptNet, and P idf(w) Open Mind Common Sense (OMCS) subset of idf(x, y) = w∈x∩y min(P idf(w), P idf(w)) ConceptNet, and the corpus of 590K sentences w∈x w∈y (Text). Since KER does not scale to a corpus of For efficiency, we perform beam search and ig- this size, we provided it with the combined set of nore any chain if the score drops below a threshold sentences retrieved by our model for all the Open- (0.08). Finally we take the top 100 chains and use BookQA questions. The model computes various these sentences as passages in our model. cross-attentions between the question, knowledge, We use Spacy5 for tokenization. For word em- and answer choices, and combines these atten- bedding, we use the 840B 300-dimensional pre- tions to select the answer. Overall, our proposed trained word vectors from GloVe and we do not approach marginally improved over the previous 6 update them during training. For simplicity, we do models on the OpenBookQA dataset . Note that, not use any character embedding. The number of our model was designed for the closed-domain hidden units in all LSTMs is 50 (H = 100). We setting where all the required knowledge is pro- use dropout (Srivastava et al., 2014) with probabil- vided. Yet, our model is able to generalize on the ity 0.25 for every learnable layer. During training, open-domain setting where the retrieved knowl- the minibatch size is fixed at 8. We use the Adam edge may be noisy or insufficient to answer the optimizer (Kingma and Ba, 2015) with learning question. rate 0.001 and clipnorm 5. We use cross entropy 6.3 Effectiveness of Model Components loss for training. This being a multiple-choice QA task, we use accuracy as the evaluation metric. Table3 shows the impact of context-based and passage-based path encodings. Performance of 6.2 Main Results the model degrades when we ablate either of

Table1 compares our results on the WikiHop 6Sun et al.(2018) used the large OpenAI fine-tuned lan- dataset with several recently proposed multi-hop guage model (Radford et al., 2018) pre-trained on an addi- tional dataset, RACE (Lai et al., 2017) to achieve a score of 5https://spacy.io/api/tokenizer 55% on this task. the two path encoding modules. Intuitively, in % Accuracy (∆) Model context-based path encodings, limited and more WikiHop OBQA † † fine-grained context is considered due to the use PathNet 67.4 55.0 - context-based path 64.7 (2.7) 54.8∗ (0.2) of specific entity locations. On the contrary, the - passage-based path 63.2 (4.2) 46.2 (8.8) passage-based path encoder computes the path representations considering the entire passage Table 3: Ablation results on development sets. ∗ representations (both passages which contain Improvement over this is not statistically significant. the head entity and tail entity respectively). Accuracy (%) As a result, even if the intermediate entity can Model WikiHop ∆ not be used meaningfully, the model poses the FFL (PathNet) 67.4 - ability to form an implicit path representation. FFL Shared 66.7 0.7 Passage-based path encoder is more helpful LSTM 67.1 0.3 on OpenBookQA as it is often difficult to find GRU 67.3 0.1 meaningful explicit context-based paths through Table 4: Various composition functions to generate entity linking across passages. Let us consider path representation (xctx) on WikiHop development set. the following example taken from OpenBookQA development set where our model successfully predicted the correct answer. weights perform better, possibly because it gives the model the freedom to handle answer candi- Question: What happens when someone on dates differently, especially allowing the model to top of a bicycle starts pushing it ’s peddles in a consider the likelihood of a candidate being a valid circular motion ? answer to any question, akin to a prior. bike Answer: the accelerates 6.4 Qualitative Analysis Best Path: (bicycle, pedal, bike) p1: bicycles require continuous circular motion One key aspect of our model is its ability to indi- on pedals cate the paths that contribute most towards predict- p2: pushing on the pedals of a bike cause that bike ing an answer choice. Table5 illustrates the two to move. highest-scoring paths for two sample WikiHop In this case, the extracted path through entity questions which lead to correct answer prediction. linking is not meaningful as the path composition In the first question, the top-2 paths are formed by would connect bicycles to bike 7. However, when connecting Zoo Lake to Gauteng through the inter- the entire passages are considered, they contain mediate entities Johannesburg and South Africa, sufficient information to help infer the answer. respectively. In the second example, the science Table4 presents the results on WikiHop de- fiction novel This Day All Gods Die is connected velopment set when different composition func- to the publisher Bantam Books through the author tions are used for Equation (4). Recurrent net- Stephen R. Donaldson, and the collection Gap Cy- works, such as LSTM and GRU, enable the path cle for first and second paths, respectively. encoder to model an arbitrary number of hops. We also analyzed 50 randomly chosen questions For 2-hop paths, we found that a simple feed for- that are annotated as requiring multi-hop reason- ward network (FFL) performs slightly better than ing in the WikiHop development set and that our the rest. We also considered sharing the weights model answered correctly. In 78% of the ques- 8 (FFL shared) when obtaining the relation vectors tions, we found at least one meaningful path in the top-3 extracted paths, which dropped to 62% rhe,e1 and re10,ck . Technically, the FFL model is performing the same task in both cases: extract- for top-1 path. On average, 66% of the top-3 paths ing implicit relations and the parameters could returned by our model were meaningful. In con- be shared. However, practically, the unshared trast, only 46% of three randomly selected paths per question made sense, even when limited to the 7 Entities in science questions can be phrases and events paths for the correct answers. That is, a random (e.g., “the bike accelerates”). Identifying and matching such entities are very challenging in case of the OpenBookQA baseline, even with oracle knowledge of the cor- dataset. We show that our entity-linking approach, designed rect answer, would only find a good path in 46% for noun phrases and named entities, is still able to perform comparable to state-of-the-art methods on science question 8A path is considered meaningful if it has valid relations answering, despite this noisy entity matching. that can be composed to conclude the predicted answer. Question:(zoo lake, located in the administrative territorial entity, ?) Answer: gauteng Rank-1 Path: (zoo lake, Johannesburg, gauteng) Passage1: ... Zoo Lake is a popular lake and public park in Johannesburg , South Africa . It is part of the Hermann Eckstein Park and is ... Passage2: ... Johannesburg ( also known as Jozi , Joburg and eGoli ) is the largest city in South Africa and is one of the 50 largest urban areas in the world . It is the provincial capital of Gauteng , which is ... Rank-2 Path: (zoo lake, South Africa, gauteng) Passage1: ... Zoo Lake is a popular lake and public park in Johannesburg , South Africa . It is ... Passage2: ... aka The Reef , is a 56-kilometre - long north - facing scarp in the Gauteng Province of South Africa . It consists of a ... Question:(this day all gods die, publisher, ?) Answer: bantam books Rank-1 Path: (this day all gods die, Stephen R. Donaldson, bantam books) Passage1: ... All Gods Die , officially The Gap into Ruin : This Day All Gods Die , is a science fiction novel by Stephen R. Donaldson , being the final book of The Gap Cycle ... Passage2: ... The Gap Cycle ( published 19911996 by Bantam Books and reprinted by Gollancz in 2008 ) is a science fiction story , told in a series of 5 books , written by Stephen R. Donaldson . It is an ... Rank-2 Path: (this day all gods die, Gap Cycle, bantam books) Passage1: ... All Gods Die , officially The Gap into Ruin : This Day All Gods Die , is a science fiction novel by Stephen R. Donaldson , being the final book of The Gap Cycle ... Passage2: ... The Gap Cycle ( published 19911996 by Bantam Books and reprinted by Gollancz in 2008 ) is a science fiction story ...

Table 5: Two top-scoring paths for sample WikiHop Dev questions. In the Rank-1 path for the first question, the model composes the implicit located in relations between (Zoo lake, Johannesburg) and (Johannesburg, Gauteng). of the cases. We also analyzed 50 questions that retrieving the sentences from this corpus using the our model gets wrong. The top-scoring paths here 2-hop retrieval. Computations on beaker.org were were of lower quality (only 16.7% were meaning- supported in part by credits from Google Cloud. ful). This provides qualitative evidence that our model’s performance is correlated with the qual- ity of the paths it identifies, and it does not simply References guess using auxiliary information such as entity 9 Jonathan Berant, Andrew Chou, Roy Frostig, and Percy types, number of paths, etc. Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of EMNLP. 7 Conclusion Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim We present a novel, path-based, multi-hop reading Sturge, and Jamie Taylor. 2008. Freebase: a collab- comprehension model that outperforms previous oratively created graph database for structuring hu- models on WikiHop and OpenBookQA. Impor- man knowledge. In Proceedings of ACM SIGMOD international conference on Management of data. tantly, we illustrate how our model can explain its reasoning via explicit paths extracted across mul- Antoine Bordes, Nicolas Usunier, Sumit Chopra, and tiple passages. While we focused on 2-hop rea- Jason Weston. 2015. Large-scale simple question soning required by our evaluation datasets, the ap- answering with memory networks. In NIPS. proach can be generalized to longer chains and to longer natural language questions. Nicola De Cao, Wilker Aziz, and Ivan Titov. 2018. Question answering by reasoning across docu- ments with graph convolutional networks. CoRR, Acknowledgment abs/1808.09920. We thank Johannes Welbl for helping us evaluat- Rajarshi Das, Arvind Neelakantan, David Belanger, ing our model on the WikiHop test set. We thank and Andrew McCallum. 2017. Chains of reasoning Rodney Kinney, Brandon Stilson and Tal Fried- over entities, relations, and text using recurrent neu- man for helping to produce the clean corpus used ral networks. In Proceedings of EACL. for OpenBookQA. We thank Dirk Groeneveld for Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William Co- 9A model that returns the answer with the highest number hen, and Ruslan Salakhutdinov. 2018. Neural mod- of paths would score only 18.5% on the WikiHop develop- els for reasoning over multiple mentions using coref- ment set. erence. In Proceedings of NAACL. Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, Cohen, and Ruslan Salakhutdinov. 2017. Gated- and Eduard H. Hovy. 2017. Race: Large-scale read- attention readers for text comprehension. In Pro- ing comprehension dataset from examinations. In ceedings of ACL. EMNLP.

Daniel Fried, Peter Jansen, Gustave Hahn-Powell, Mi- Ni Lao, Tom M. Mitchell, and William W. Cohen. hai Surdeanu, and Peter Clark. 2015. Higher- 2011. Random walk inference and learning in a order lexical semantic models for non-factoid an- large scale knowledge base. In Proceedings of swer reranking. TACL, 3:197–210. EMNLP. Matt Gardner and Tom M. Mitchell. 2015. Effi- Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish cient and expressive knowledge base completion us- Sabharwal. 2018. Can a suit of armor conduct elec- ing subgraph feature extraction. In Proceedings of tricity? A new dataset for open book question an- EMNLP. swering. In Proceedings of EMNLP. Kelvin Guu, John Miller, and Percy Liang. 2015. Traversing knowledge graphs in vector space. In Jeffrey Pennington, Richard Socher, and Christo- Proceedings of EMNLP. pher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of EMNLP. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long short-term memory. Neural Computation, Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt 9(8):1735–1780. Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, resentations. In Proceedings of NAACL. Furu Wei, and Ming Zhou. 2018. Reinforced mnemonic reader for machine reading comprehen- Alec Radford, Karthik Narasimhan, Tim Salimans, and sion. In Proceedings of IJCAI. Ilya Sutskever. 2018. Improving language under- standing by generative pre-training. Technical re- Peter Jansen, Rebecca Sharp, Mihai Surdeanu, and Pe- port, Technical report, OpenAI. ter Clark. 2017. Framing qa as building and ranking intersentence answer justifications. Computational Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Linguistics, 43:407–449. Percy Liang. 2016. SQuAD: 100,000+ questions for Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke machine comprehension of text. In Proceedings of Zettlemoyer. 2017. TriviaQA: A large scale dis- EMNLP. tantly supervised challenge dataset for reading com- prehension. In Proceedings of ACL. Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention Daniel Khashabi, Erfan Sadeqi Azer, Tushar Khot, flow for machine comprehension. In Proceedings of Ashutosh Sabharwal, and Dan Roth. 2019. On the ICLR. capabilities and limitations of reasoning for natural language understanding. CoRR, abs/1901.02522. Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. 2017. Reasonet: Learning to stop Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, reading in machine comprehension. In Proceedings Shyam Upadhyay, and Dan Roth. 2018a. Looking of KDD. beyond the surface: A challenge set for reading com- prehension over multiple sentences. In Proceedings Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, of NAACL. Radu Florian, and Daniel Gildea. 2018. Exploring graph-structured passage representation for multi- Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Pe- hop reading comprehension with graph neural net- ter Clark, Oren Etzioni, and Dan Roth. 2016. Ques- works. CoRR, abs/1809.02040. tion answering via integer programming over semi- structured knowledge. In Proceedings of IJCAI. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Dropout: A simple way to prevent neural networks Dan Roth. 2018b. Question answering as global rea- from overfitting. JMLR, 15(1):1929–1958. soning over semantic abstractions. In Proceedings of AAAI. Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2018. Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017. Improving machine reading comprehension with Answering complex questions using open informa- general reading strategies. CoRR, abs/1810.13441. tion extraction. In Proceedings of ACL. Kristina Toutanova, Victoria Lin, Wen tau Yih, Hoi- Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: fung Poon, and Chris Quirk. 2016. Compositional A method for stochastic optimization. In Proceed- learning of embeddings for relation paths in knowl- ings of ICLR. edge base and text. In Proceedings of ACL. Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har- ris, Alessandro Sordoni, Philip Bachman, and Ka- heer Suleman. 2017. NewsQA: A machine compre- hension dataset. In Proceedings of the 2nd Work- shop on Representation Learning for NLP. Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. TACL, 6:287–302. Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. 2015. Towards AI-complete ques- tion answering: A set of prerequisite toy tasks. CoRR, abs/1502.05698. Edwin B. Wilson. 1927. Probable inference, the law of succession, and statistical inference. JASA, 22(158):209–212. Victor Zhong, Caiming Xiong, Nitish Shirish Keskar, and Richard Socher. 2019. Coarse-grain fine-grain coattention network for multi-evidence question an- swering. In Proceedings of ICLR.