Arxiv:1911.03868V2 [Cs.CL] 13 Apr 2020 Answer Any Question a User Can Pose, with Ev- As Shown in Figure1

Knowledge Guided Text Retrieval and Reading for Open Domain Question Answering

Sewon Min1, Danqi Chen2, Luke Zettlemoyer1,3, Hannaneh Hajishirzi1,4 1University of Washington, Seattle, WA 2Princeton University, Princeton, NJ 3Facebook AI Research, Seattle, WA 4Allen Institute for AI, Seattle, WA {sewon,lsz,hannaneh}@cs.washington.edu [email protected]

Abstract

We introduce an approach for open-domain question answering (QA) that retrieves and reads a passage graph, where vertices are passages of text and edges represent relationships that are derived from an external knowledge base or co-occurrence in the same article. Our goals are to boost coverage by using knowledge-guided retrieval to find more relevant passages than text-matching methods, and to improve accuracy by allowing for better knowledge-guided fusion of information across related passages. Our graph retrieval method expands a set of seed keyword- Figure 1: An example from NATURAL QUESTIONS.A retrieved passages by traversing the graph graph of passages is constructed based on Wikipedia structure of the knowledge base. Our reader and Wikidata, where the edges represent relationships extends a BERT-based architecture and up- between passages. The baseline model which uses TF- dates passage representations by propagating IDF and reads each passage in isolation outputs the information from related passages and their re- wrong answer (red) by selecting a person name from lations, instead of reading each passage in iso- the passage about the song. Our model which leverages lation. Experiments on three open-domain QA relations (e.g. part of) synthesizes the context over re- datasets, WEBQUESTIONS,NATURAL QUES- lated passages and predicts the correct answer (blue) by TIONS and TRIVIAQA, show improved perfor- choosing a singer of the album. mance over non-graph baselines by 2-11% absolute. Our approach also matches or exceeds the state-of-the-art in every case, without using haylov and Frank, 2018), but do not explicitly use an expensive end-to-end training regime. the KB graph structure. In this paper, we show that such structure can be highly beneficial for both re- 1 Introduction trieving text passages and fusing information across Open-domain question answering systems aim to them in open-domain text-based QA, for example arXiv:1911.03868v2 [cs.CL] 13 Apr 2020 answer any question a user can pose, with ev- as shown in Figure1. idence provided by either factual text such as We introduce a general approach for text-based Wikipedia (Chen et al., 2017; Yang et al., 2019) open-domain QA that is knowledge guided: it re- or knowledge bases (KBs) such as Freebase (Be- trieves and reads a passage graph, where vertices rant et al., 2013; Kwiatkowski et al., 2013; Yih are passages of text and edges represent relation- et al., 2015). Textual evidence, in general, has bet- ships that are derived either from an external knowl- ter coverage but KBs more directly support making edge base or co-occurrence in the same article. Our complex inferences. It remains an open question goal is to combine the high coverage of textual how to best make use of KBs without sacrificing corpora with the structural information in knowl- coverage in text-based open domain QA. Previous edge bases, to improve both the retrieval coverage work has converted KB facts to sentences to pro- and accuracy of the resulting model. Unlike stan- vide extra evidence (Weissenborn et al., 2017; Mi- dard approaches that retrieve and read a set of passages (Chen et al., 2017), our approach integrates such as WIKIHOP (Welbl et al., 2017) or HOT- graph structure at every stage to construct, retrieve POTQA (Yang et al., 2018). In this work, we in- and read a graph of passages. stead focus on naturally gathered questions which Our approach first retrieves a passage graph by require much more diverse types of cross paragraph expanding a set of seed passages based on the reasoning. graph structure of the knowledge base and the co- After retrieving evidence passages, most occurrence in text corpus (Figure1). We then in- pipeline systems use a reading comprehension troduce a reader model that extends BERT (De- model to extract the answer. Previous work ei- vlin et al., 2019) and propagates information ther concatenates retrieved passages into a single from related passages and their relations, enabling sequence (Swayamdipta et al., 2018; Yang et al., knowledge-rich cross-passage representations. To- 2018; Song et al., 2018) with no explicit model of gether, this approach allows for better coverage (e.g. how they are related, or reads each passage in par- the graph contains many passages that text-match allel (Clark and Gardner, 2018; Alberti et al., 2019; retrieval would miss) and accuracy (e.g. by better Min et al., 2019b; Wang et al., 2019) with no ability combining information across related passages to to fuse the information they contain. To the best of find the best answer). our knowledge, reading passages by incorporating Experiments demonstrate significant improve- structural information across passages has not been ments on three popular open-domain QA datasets: studied previously. The most related models are WEBQUESTIONS (Berant et al., 2013), NATURAL Song et al.(2018) and Cao et al.(2019), which QUESTIONS (Kwiatkowski et al., 2019) and TRIV- fuse information through entities detected by entity IAQA (Joshi et al., 2017). Our graph-based re- linking and coreference resolution on WIKIHOP. trieval and reader models, together, improve accu- In contrast, our model fuses information across racy consistently and significantly, outperforming passages, to better model the overall relationships the non-graph baselines by 2–11% and matching or between the different blocks of text. exceeding the state-of-the-art in every case without Other lines of research in open-domain QA in- an expensive end-to-end training regime. Through clude joint learning of retrieval and reader compo- extensive ablations, we show that both graph-based nents (Lee et al., 2019) or direct phrase retrieval in retrieval and reader models substantially contribute a large collection of documents (Seo et al., 2019). to the performance improvements, even when we Although end-to-end training can further improve fix the other component. the performance of our approach, this paper only focuses on pipeline approaches since end-to-end 2 Related Work training is computationally and memory expensive.

Text-based Question Answering. Text-based Knowledge Base Question Answering. Ques- open-domain QA is a long standing prob- tion answering over knowledge bases has also been lem (Voorhees et al., 1999; Ferrucci et al., 2010). well studied (Berant et al., 2013; Kwiatkowski Recent work has focused on two-stage approaches et al., 2013; Yih et al., 2015), typically without that combine information retrieval with neural read- using any external text collections. However, re- ing comprehension (Chen et al., 2017; Wang et al., cent work has augmented knowledge bases with 2018; Das et al., 2019; Yang et al., 2019). We fol- text from Wikipedia (Das et al., 2017; Sun et al., low this tradition but introduce a new framework 2018, 2019; Xiong et al., 2019a), to increase factual which retrieves and reads a graph of passages. coverage when a given knowledge base is incom- Other graph retrieval methods have been devel- plete. In this paper, we study what can be loosely oped, either using entity name matching (Ding seen as an inverse problem. The model answers et al., 2019; Xiong et al., 2019b; Godbole et al., questions based on a large set of documents, and 2019) or hyperlinks (Asai et al., 2020). How- the knowledge base is used to better model rela- ever, we are not aware of work integrating exter- tionships between different passages of text. nal knowledge bases or tightly coupling the ap- 3 Approach proach with a graph reader, as we do in this paper. Moreover, most previous graph-based approaches We present a new general approach for text- evaluate on questions that are explicitly written to based open-domain question answering, which con- encourage reasoning based on a chain of entities, sists of a retrieval model GRAPHRETRIEVER and Q: Who sang more than a feeling by Boston? (A: Brad Delp) 0-th iteration 1-th iteration ... span Entity TF-IDF P1 ParEncoder selection score More Than a Feeling part of Boston (album) child “More Than a Feeling” is a Boston is the debut studio span song by the American rock album by American rock P2 ParEncoder band Boston. parent band Boston. Produced ... selection score

part of span performer P3 ParEncoder Entity TF-IDF selection score Boston (band) child More Than a Feeling span Boston is an American rock Personnel. Tom Scholz - P4 ParEncoder band from Boston. Produced acoustic and electric rhythm selection score by Tom Scholz, … guitar, ... performer span parent P5 ParEncoder selection score TF-IDF I gotta Feeling Boston (band) child span “I gotta Feeling” is the Tom Scholz first started P6 ParEncoder selection score second single from The writing music in 1969 … Black Eyed Peas’ fifth ... Vocalist Brad Delp was ... Layers: 1, 2, …, M

Figure 2: A diagram of our approach, consisting of GRAPHRETRIEVER (left) and GRAPHREADER (right). First, GRAPHRETRIEVER constructs a graph of passages by obtaining seed passages through either entity linking or TF-IDF, and expanding the graph based on Wikidata and Wikipedia (Section 3.1). GRAPHREADER then takes this graph as an input, obtains initial passage representations, and updates them with respect to the graph, using M fusion layers (Section 3.2). a neural reader model GRAPHREADER. The passages) and update the graph in the m-th iteration overall approach is illustrated in Figure2. (1 ≤ m ≤ Mret). GRAPHRETRIEVER retrieves a graph of passages Seed passages. GRAPHRETRIEVER starts with a in which vertices are passages and edges de- set of Wikipedia articles by taking the union of (1) note relationships between passages (Section 3.1). articles corresponding to the entities which are iden- GRAPHREADER reads the input passage graph tified by an entity linking system (Ferrucci et al., and returns the answer (Section 3.2). 2010) on the input question; (2) the top KTFIDF Setup. The goal is to answer the question based articles returned by a TF-IDF based retrieval sys- on a text corpus C, which consists of a large col- tem. We choose the first passage of these articles (0) lection of articles and each of them can be divided as seed passages P . into multiple passages. We also assume an exter- Graph expansion. Starting from seed passages nal knowledge base K = {(e1, r, e2)} exists where P(0), GRAPHRETRIEVER expands the passage e1, e2 are entities and r is a relation, and there is a graph from P(m−1) to P(m) by iterating over the 1-1 mapping between the KB entities and articles following two methods, until it includes n pas- in the text corpus. Specifically, we use Wikipedia sages. as the text corpus C and Wikidata (Vrandeciˇ c´ and First, the passage graph is updated by adding Krotzsch¨ , 2014) as the knowledge base K, as there passages that are related to P(m−1) according to exists an alignment between the two resources and a relation present in Wikidata. Specifically, if Wikipedia has been widely used before in open- (m−1) pi ∈ P and pj are the first passages of domain question answering research (Chen et al., Wikipedia articles that correspond to KB entities 2017; Seo et al., 2019). epi and epj such that (epi , ri,j, epj ) ∈ K, pj is added to the passage graph, being connected to 3.1 GRAPHRETRIEVER 1 pi through ri,j. Our retrieval approach takes a question as input and Second, supporting passages for P(m−1) are uses the knowledge base K to construct a graph of added to the passage graph. Specifically, non-first passages in C. It obtains seed passages and expands passages of Wikipedia articles that are associated the passage graph through Mret iterations, until it with P(m−1) are ranked by BM25 (Robertson et al., reaches the maximum number of passages n. We 1 (m) Although this may include some entities that are not denote P as the passages obtained in the m- closely related to the question, it still increases the coverage th iteration, and describe how to obtain P(0) (seed of entities related to the answer, as shown in Section 4.4. (m) 2009) and the top KBM25 passages are chosen. We passage representation zi for each fusion layer construct relations between passages if they belong 1 ≤ m ≤ M, based on its previous representa- to the same Wikipedia article: ri,j is child and tion, all the adjacent passages and their relations. rj,i is parent if pi is the first passage of the arti- Largely inspired by Graph Convolution Networks cle and pj is another passage from the same article. (GCN) (Kipf and Welling, 2017; Marcheggiani and Titov, 2017), we investigate two methods to obtain Final graph. Finally, we retrieve a passage graph (m) (m−1) n n z from z and ri,j as follows. consisting of n passages: {p1, . . . , pn}. The rela- i j j=1 j=1 tions between the passages are denoted by {ri,j | Binary. We first consider binary relations which 1 ≤ i, j ≤ n}, where ri,j is either a KB relation, encodes whether a passage pair is related or not, child, parent or no relation, indicating without incorporating relations. Specifically, the relationship between a passage pair (pi, pj). G = {j | r 6= no relation}, 3.2 GRAPHREADER i i,j (m+1) 1 X (m) (m) Our GRAPHREADER takes a question q and n re- zi = Wf [zi ⊕ zj ] + bf , |Gi| trieved passages p1, p2, . . . , pn (and their relations j∈Gi ri,j) and aims to output an answer to the question as a text span in one of retrieved passages. Instead where Wf and bf are learnable parameters, and ⊕ of processing each passage independently, our ap- is a concatenation. proach obtains knowledge-rich representations of Relation-aware. We then consider a relation- passages by fusing information from linked pas- aware composition function: sages across the graph structure. n 3.2.1 Initial Passage Representation (m+1) 1 X (m) (m) z = W [z ⊕ f(r , z )] + b , i n f i i,j j f Formally, given the question q and a passage pi, j=1 GRAPHREADER first obtains a question-aware passage representation: where f is a composition function, ⊕ is a concatenation, and W and b are learnable parameters. L×h f f Pi = TextEncode(q, pi) ∈ R , We use concatenation for the composition function, (m) (m) where L is the maximum length of each passage, f(ri,j, zj ) = CONCAT(ri,j, zj ), for simplic- and h is the hidden dimension. We use BERT (De- ity because it worked as well as more complex vlin et al., 2019), although the approach could be functions such as element-wise multiplication and applied with many other encoders. bilinear mappings in our early experiments. Additionally, GRAPHREADER encodes a rela- 3.2.3 Answering Questions tion ri,j through a relation encoder: GRAPHREADER uses the updated passage repre- h (M) (M) ri,j = RelationEncode(ri,j) ∈ R . sentations z1 ,..., zn to compute the probability of pi being an evidence passage. Denote We consider the most frequent 98 relations and (M) (M) h×n Z = [z ,..., zn ] ∈ R , we define group the other relations as unk releation, to- 1 tal to be 100 including no relation. We di- | Psel(i) = softmax Z wsel , rectly learn an an embedding matrix to get a vector i representation for each relation, which works well h where wsel ∈ R is a learnable vector. Once in practice since we have relatively few relations the evidence passage is chosen by i∗ = and many examples of each. argmax1≤i≤n Psel(i), the probability of a span 3.2.2 Fusing Passage Representations 1 ≤ j ≤ k ≤ L in the passage pi∗ being an answer GRAPHREADER builds M graph-aware fusion lay- is computed as Pstart,i∗ (j) × Pend,i∗ (k), where ers to update passage representations by propagat- P ∗ (j) = softmax P ∗ w , ing information through the edges of the graph, as start,i i start j depicted in the right side of Figure2. Specifically, ∗ ∗ Pend,i (k) = softmax Pi wend k, GRAPHREADER initializes passage representation (0) h with zi = MaxPool(Pi). It then obtains new where wstart, wend ∈ R are learnable vectors. Dataset Statistics Graph density on NATURAL QUESTIONS and TRIVIAQA.3 Ta- Train Dev Test Cross Inner Total ble1 shows the statistics of the datasets and the WEBQ 3417 361 2032 1.53 0.88 2.41 density of the graph (number of relations per pas- NATURALQ 79168 8757 3610 1.26 0.90 2.16 sage) retrieved by GRAPHRETRIEVER. TRIVIAQA 78785 8837 11313 1.28 0.88 2.16

Table 1: Dataset statistics and density of the graph 4.2 Baselines (# of relations per a passage). Cross, Inner and Total For retrieval, we compare our GRAPHRE- denote cross-document relations (KB relations), inner- TRIEVER to a pure text-match based retrieval document relations (child, parent) and their sum. method which retrieves Wikipedia articles based on TF-IDF scores (Chen et al., 2017) and ranks their For training, we use the maximum marginal like- passages through BM25 (Robertson et al., 2009). lihood objective by maximizing: This is to investigate if leveraging the knowledge base actually improves the retrieval component. n X For reader, we compare our GRAPHREADER I[|Si| > 0]logPsel(i) with two competitive baselines which read each i=1 passage in parallel, PARREADER and PAR- n   X X READER++. Both baselines obtain question- + log  (Pstart,i(si) × Pend,i(ei)) aware passage representations P1,..., Pn as de- i=1 (si,ei)∈Si scribed in Section 3.2 with a different way of cal- culating Psel(i). PARREADER computes Psel(i) where Si is a set of spans which correspond to the using a binary classiﬁer: answer text in pi. We tried using Z for span predictions, but did not see meaningful improvements. Psel(i) = softmax Wselmaxpool(Pi) 1, We hypothesize it is because span prediction given the correct passage is an easier task compared to 2×h where Wsel ∈ R is a learnable matrix (Alberti 2 choosing the right evidence passage. et al., 2019; Min et al., 2019b). PARREADER++ is similar to PARREADER but takes a softmax across 4 Experiments passages, inspired by Clark and Gardner(2018) 4:

4.1 Datasets ˆ | Psel(i) = softmax P wsel i, We evaluate our model on three open-domain question answering datasets, where the evaluation met- where Pˆ = [MaxPool(P1),..., MaxPool(Pn)] ∈ h×n h ric is Exact Match. (1) WEBQUESTIONS (Berant R and wsel ∈ R is learned. Note that PAR- et al., 2013) is originally a QA dataset designed to READER++ is exactly the same as GRAPHREADER answer questions based on Freebase; the questions with no fusion layer. Finally, for both baselines, were collected through Google Suggest API. We the probability of the span is computed in the same follow Chen et al.(2017) and frame the problem way as described in Section 3.2. as a span selection task over Wikipedia. (2) NATU- RAL QUESTIONS (Kwiatkowski et al., 2019) con- 4.3 Implementation Details sists of questions collected using the Google search We use the Wikipedia dump from 2018-12-20 and engine; questions with short answers up to 5 tokens the Wikidata dump from 2019-06-015. We use are taken following Lee et al.(2019). (3) TRIVI- TAGME (Ferragina and Scaiella, 2011)6 as an en- AQA (Joshi et al., 2017) consists of questions from tity linking system. We split the article into pas- trivia and quiz-league websites. For all datasets, sages with natural breaks and merge consecutive we only use question and answer pairs for training ones with up to a maximum length of 300 tokens. and testing, and discard the provided evidence doc- 3 uments which are part of reading comprehension https://bit.ly/2q8mshc and https://bit. ly/2HK1Fqn. tasks. We follow the data splits from Chen et al. 4Our preliminary result indicates this variant slightly out- (2017) for WEBQUESTIONS and Min et al.(2019a) performs the original model, S-Norm. 5archive.org/download/enwiki-20181220 2We observed that over 80% of the error cases the baseline and dumps.wikimedia.org/Wikidatawiki/ model made are due to the incorrect passage selection on all 20190601 datasets. 6github.com/gammaliu/tagme Retriever Reader WEBQUESTIONS NATURAL QUESTIONS TRIVIAQA Dev Test Dev Test Dev Test

Text-match PARREADER 23.6 25.2 26.1 25.8 52.1 52.1 Text-match PARREADER++ 19.9 20.8 28.9 28.7 54.5 54.0

GRAPHRETRIEVER PARREADER 33.2 33.0 30.2 29.3 54.8 54.7 GRAPHRETRIEVER PARREADER++ 33.7 31.8 33.1 33.5 55.5 55.0 GRAPHRETRIEVER GRAPHREADER (binary) 34.0 36.4 34.2 34.1 55.2 54.2 GRAPHRETRIEVER GRAPHREADER (relation) 34.0 36.0 34.7 34.5 55.8 56.0 Previous best (pipeline) - 18.5a 31.7b 32.6b 50.7c 50.9c Previous best (end-to-end) 38.5d 36.4d 31.3d 33.3d 45.1d 45.0d

Table 2: Overall results on the development and the test set of three datasets. We also report the previous best results, both with pipeline and end-to-end: aLin et al.(2018), bAsai et al.(2020), cMin et al.(2019a), dLee et al. (2019). Note that the development sets used in Lin et al.(2018) and Lee et al.(2019) are slightly different but the test sets are the same; Asai et al.(2020) uses a better pretrained model (a whole word masking BERT LARGE).

7 We use KTFIDF = 5,KBM25 = 40 for WEBQUES- tions are inferred based from the text. TIONS and KTFIDF = 10,KBM25 = 80 for the State-of-the-art results. We also compare our rest two, which empirically set Mret as 2 and 1, respectively. results to the previous best models, both pipeline and end-to-end approaches for open-doman QA, We use the uncased version of BERT (De- BASE in Table2. Our best-performing model outper- vlin et al., 2019) for question-aware passage forms previously published pipeline models by representations TextEncode. When training 6–18%, showcasing the benefit of our graph re- GRAPHREADER, we cannot feed the full passage trieval and reader models. In particular, our mod- graph in the same batch due to the memory con- els with GRAPHRETRIEVER (both baseline and straints. Therefore, for every parameter update, we GRAPHREADER) outperform the previous best sample at most 20 passages where one of them con- graph-based retrieval model (Asai et al., 2020)8 tains the answer text, either uniformly at random by a large margin, despite the fact that they used a or by taking a subgraph. For each model, we ex- stronger BERT model than ours as the base model. periment with fusion layers of M = {1, 2, 3} and Our model also outperforms or matches the end- two sampling methods, and choose the number that to-end model (Lee et al., 2019) which is expensive gives the best result on the development set. More to train as it uses an extra pretraining strategy. Al- details can be found in AppendixA. though not explored in this paper, our framework can be trained end-to-end as well, which has a great 4.4 Main Results potential to further advance the state-of-the-art. Model Comparisions The main results are given 5 Analyses in Table2. We observed three overall trends: (1) GRAPHRETRIEVER offers significant perfor- To better understand model performance, we report mance gains over text-match retrieval when we a number of ablation studies (Section 5.1) and a compare within the same reader across all datasets, qualitative analysis (Section 5.2). e.g., 1–11% absolute gains with PARREADER++. This indicates that graph-based retrieval provides 5.1 Ablation Studies passages with significantly better evidence to an- Effect of different retrieval methods. Table3 swer the question. (2) GRAPHREADER outper- compares text-match retrieval and GRAPHRE- forms two PARREADER baselines consistently TRIEVER with ‘Text-match + Wikidata’, a variant across all datasets, achieving 1–5% absolute gains. of GRAPHRETRIEVER where we take the union of This result demonstrates that fusing information 7 In order to verify this hypothesis, we modify our reader across passages is more effective than reading to have an output layer for relation classification, and observe each passage in isolation. (3) GRAPHREADER that the accuracy is over 80% for all datasets. using relations offers some improvement over 8 To the best of our knowledge, Asai et al.(2020) (1) is the only graph-based approach evaluated on naturally found ques- GRAPHREADER with binary relations. The gains tions and (2) also outperforms other graph-based approaches are smaller than expected, likely because the rela- on HOTPOTQA. Retriever WebQ Natural Q WebQ Natural Q

Reader: PARREADER++ PARREADER++ 33.7 33.1 Text-match 19.9 28.9 PARREADER++ (pairs from graph) 31.3 29.8 Text-match + Wikidata 29.4 30.5 PARREADER++ (all pairs) 25.7 20.5 GRAPHRETRIEVER 33.7 33.1 GRAPHREADER (binary) 34.0 34.2 GRAPHREADER (relation) 34.0 34.7 Reader: GRAPHREADER (binary) Text-match + Wikidata 30.8 32.6 GRAPHRETRIEVER 34.0 34.2 Table 5: Comparison to passage concatenation. For all rows, GRAPHRETRIEVER is used. Table 3: Effect of different retrieval methods. We compare text-match retrieval and GRAPHREADER with ‘Text-match + Wikidata’ (described in Section 5.1). achieve good performance across two datasets. In particular, using relation information is better than

WEBQUESTIONS Natural Q ignoring relation information (fully connected, Fully connected 33.7 33.6 empty), demonstrating the importance of selecting Empty 33.7 33.6 a good set of graph edges. Cross-doc 34.2 33.7 Inner-doc 33.4 33.7 Comparison to passage concatenation. Table5 Cross+Inner 34.0 34.2 compares the performance of our graph-based method with two baseline readers where a con- Table 4: Effects of different relation types in catenation of passage pairs is included as input, GRAPHREADER. We compare input graphs contain- and PARREADER++ reads each of them in iso- ing different sets of edges, where we use GRAPHRE- lation. First, PARREADER++ (pairs from graph) TRIEVER and GRAPHREADER (binary). concatenates passage pairs that are related in the input graph, along with the relation text. Sec- text-match retrieval and Wikidata-based retrieval, ond, PARREADER++ (all pairs) concatenates all each computed in isolation. Specifically, the text- passage pairs. For these baselines, the concate- match retrieval is the baseline described in Sec- nated passages are up to 300 tokens.9 For PAR- tion 4.2, and Wikidata-based retrieval is done by READER++ (pairs from graph), we use n = 20 obtaining seed passages through entity linking and instead of n = {40, 80}.10 It is worth noting that updating the passage graph only through Wikidata. concatenating more passages into a single input is This variant can be seen as late combination be- non-trivial due to the fixed input length of BERT. tween the text-match and Wikidata-based retrieval, Details are provided in AppendixA. Results show whereas GRAPHRETRIEVER provides early combi- that concatenating passages is not competitive, po- nation. Although ‘Text-match + Wikidata’ outper- tentially because truncating each passage causes forms text-match retrieval by a large margin, our significant information loss. GRAPHRETRIEVER significantly outperforms this method, showing the importance of jointly leverag- 5.2 Qualitative Results ing text-match and Wikidata for graph construction. Figure3 shows a few examples from NATURAL QUESTIONS and WEBQUESTIONS. AppendixB Effect of different relation types in lists additional examples. They include cases GRAPHREADER. Table4 compares the where our method incorporates knowledge-rich re- effect of using different relation types in the lationships between passages to find the correct constructed graph of passages for GRAPHREADER, evidence and answer the question. showing results for the following settings: (a) fully connected, which connects all pairs of passages, (b) Knowledge helps retrieving evidence passages. empty, which does not include any edges between In Example 1, text-match retrieval does not retrieve passages, (c) cross-doc, which only includes edges the article ‘Director of the United States Mint’ and between passages according to the Wikidata rela- fails to retrieve any passage about the director of the tions, (d) inner-doc, which only includes child US Mint. However, GRAPHRETRIEVER retrieves and parent, and (e) cross+inner, which includes 9 both cross-doc and inner-doc, corresponding to We split each passage up to 145 tokens and set the relation text to be up to 10 tokens. the graph constructed by our approach. Results 10This restrition is needed because there are too many pas- indicate that cross-doc and inner-doc relations sage pairs: even n = 20 gives 20 + 190 = 210 passages. Example 1 Example 2 Who is the current director of the US Mint? (A: David J. Ryder) What county is St. Louis Park in? (A: Hennepin County) Text-match Passage graph Passage graph

United States Mint United States Mint Saint Louis Park Quarter (United States coin) The United States Mint is a unit of the Department Sain Louis Park is a city in of the Treasury responsible for producing coinage ... St. Louis County Hennipin County, Minnesota, USA. St. Louis County, Nicket (United States coin) organization directed from the office/person Missouri is located is located in Linux Mint in the far eastern portion of ... Director of the United States Mint Hennepin County Mint mark The Director of the United States Mint is … David Hennepin County is a county in J. Ryder became director in April 2018. the U.S. state of Minenesota.

Example 3 Example 4 Q: When did Toyota first come to the United States? (A: 1957) Q: Who plays the judge in drop dead diva? (A: Lex Medlin) Text-match Passage graph Passage graph Toyota Toyota Toyota Drop Dead Diva Toyota Motor Corporation is parent The company was Drop Dead Diva is an American legal comedy-drama / Toyota Supra fantasy television series … Plot … She finds herself being a Japanese multinational child founded by Kiichiro automotive manufacturer ... Toyoda in 1937 ... judged by a gatekeeper named Fred (Ben Feldman). Toyota Avalon subsidiary Toyota Camry (XV40) parent organization part of the series has part

Lexus Toyota Tundra Toyota Motor Sales, USA Drop Dead Diva (Season 3) Toyota Prius Toyota Motor Sales, USA, Inc. is the North American The third season of Drop Dead Diva premiered on June Plus-in-Hybrid Toyota sales, marketing and distribution subsidiary 19, 2011. Plot … The second is with a fun-loving new devoted to the United States market. Founded in 1957, ... judge named Owen French (Lex Medlin).

Figure 3: Examples from NATURAL QUESTIONS and WEBQUESTIONS where predictions from PARREADER++ and GRAPHREADER (both with GRAPHRETRIEVER) are denoted by red and blue text, respectively. Subsets of the retrieved graphs are reported. Detailed analyses in Section 5.2. the correct evidence passage by using the relationship between ‘Toyota’ and ‘Toyota Motor Sales, ship between ‘United States Mint’ and ‘Director of USA’, and predicts the correct answer. Simi- the United States Mint’, enabling GRAPHREADER larly in Example 4, PARREADER++ predicts “Ben to successfully predict the answer. Similarly, in Ex- Feldman” as an answer potentially due to the ample 3, Wikidata enables GRAPHRETRIEVER to word “judged by”. However, leveraging rela- retrieve ‘Toyota Motor Sales, USA’ which contains tions in the graph has part and part of the series, the evidence to the question, whereas text-match GRAPHREADER infers that two passages belong retrieval fails to do so.11 to the same series and ‘Drop Dead Diva (Season 3)’ mentions the judge more explicitly. Relation information explicitly supports the answer. Although Example 2 appears to be easy to humans since the passage from ‘Saint Louis 6 Conclusion Park’ alone provides the enough evidence, PAR- READER++ with no relation information makes a We proposed a general approach for text-based wrong prediction, ‘St. Louis County’, potentially open-domain question answering that integrates because of the similarity in names. However, Wiki- graph structure at every stage to construct, retrieve data relation in Example 2, is located in, explic- and read a graph of passages. Our retrieval method itly supports the evidence to answer the question, leverages both text corpus and a knowledge base therefore, GRAPHREADER which leverages graph to find a relevant set of passages and their relations. information easily predicts the right answer. Our reader then propagates information according Relation information enables the model to syn- to the input graph, enabling knowledge-rich cross- thesize across related passages. In Example passage representations. Our approach consistently 3, PARREADER++ makes a wrong prediction outperforms competitive baselines on three open- from the passage ‘Toyota’, potentially because domain QA datasets, WEBQUESTIONS, NATURAL this passage seems more related to the company. QUESTIONS and TRIVIAQA. We also included a GRAPHREADER, however, leverages the relation- detailed qualitative analysis to illustrate which com- ponents contribute the most to the overall system 11For both Example 1 and 3, initial retrieved articles include some passages containing the evidence, but BM25 passage performance. ranking misses them. References Thomas N Kipf and Max Welling. 2017. Semi- supervised classification with graph convolutional Chris Alberti, Kenton Lee, and Michael Collins. 2019. networks. In ICLR. A BERT baseline for the Natural Questions. arXiv preprint arXiv:1901.08634. Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Zettlemoyer. 2013. Scaling semantic parsers with Richard Socher, and Caiming Xiong. 2020. Learn- on-the-fly ontology matching. In EMNLP. ing to retrieve reasoning paths over wikipedia graph for question answering. In ICLR. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- field, Michael Collins, Ankur Parikh, Chris Al- Jonathan Berant, Andrew Chou, Roy Frostig, and Percy berti, Danielle Epstein, Illia Polosukhin, Jacob De- Liang. 2013. Semantic parsing on Freebase from vlin, Kenton Lee, Kristina Toutanova, Llion Jones, question-answer pairs. In EMNLP. Matthew Kelcey, Ming-Wei Change, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019. 2019. Natural questions: a benchmark for question Question answering by reasoning across documents answering research. TACL. with graph convolutional networks. In NAACL. Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Danqi Chen, Adam Fisch, Jason Weston, and Antoine 2019. Latent retrieval for weakly supervised open Bordes. 2017. Reading Wikipedia to answer open- domain question answering. In ACL. domain questions. In ACL. Christopher Clark and Matt Gardner. 2018. Simple Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. and effective multi-paragraph reading comprehen- 2018. Denoising distantly supervised open-domain sion. In ACL. question answering. In ACL. Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Diego Marcheggiani and Ivan Titov. 2017. Encoding and Andrew McCallum. 2019. Multi-step retriever- sentences with graph convolutional networks for se- reader interaction for scalable open-domain question mantic role labeling. In EMNLP. answering. In ICLR. Todor Mihaylov and Anette Frank. 2018. Knowledge- Rajarshi Das, Manzil Zaheer, Siva Reddy, and Andrew able reader: Enhancing cloze-style reading compre- McCallum. 2017. Question answering on knowl- hension with external commonsense knowledge. In edge bases and text using universal schema and ACL. memory networks. In ACL. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Kristina Toutanova. 2019. BERT: Pre-training of Luke Zettlemoyer. 2019a. A discrete hard EM ap- deep bidirectional transformers for language under- proach for weakly supervised question answering. standing. In NAACL. In EMNLP. Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, Sewon Min, Eric Wallace, Sameer Singh, Matt Gard- and Jie Tang. 2019. Cognitive graph for multi-hop ner, Hannaneh Hajishirzi, and Luke Zettlemoyer. reading comprehension at scale. In ACL. 2019b. Compositional questions do not necessitate multi-hop reasoning. In ACL. Paolo Ferragina and Ugo Scaiella. 2011. Fast and ac- curate annotation of short texts with wikipedia pages. Adam Paszke, Sam Gross, Soumith Chintala, Gregory IEEE software. Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam David Ferrucci, Eric Brown, Jennifer Chu-Carroll, Lerer. 2017. Automatic differentiation in PyTorch. James Fan, David Gondek, Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Stephen Robertson, Hugo Zaragoza, et al. 2009. The Prager, et al. 2010. Building Watson: An overview probabilistic relevance framework: BM25 and be- of the DeepQA project. AI magazine, 31(3):59–79. yond. Foundations and Trends R in Information Re- Ameya Godbole, Dilip Kavarthapu, Rajarshi Das, trieval. Zhiyu Gong, Abhishek Singhal, Hamed Zamani, Mo Yu, Tian Gao, Xiaoxiao Guo, Manzil Zaheer, Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, et al. 2019. Multi-step entity-centric information re- Ankur P Parikh, Ali Farhadi, and Hannaneh Ha- trieval for multi-hop question answering. In Work- jishirzi. 2019. Real-time open-domain question an- shop on Machine Reading for Question Answering swering with dense-sparse phrase index. In ACL. EMNLP. Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Radu Florian, and Daniel Gildea. 2018. Exploring Zettlemoyer. 2017. TriviaQA: A large scale dis- graph-structured passage representation for multi- tantly supervised challenge dataset for reading com- hop reading comprehension with graph neural net- prehension. In ACL. works. arXiv preprint arXiv:1809.02040. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. Dropout: a simple way to prevent neural networks End-to-end open-domain question answering with from overfitting. Journal of machine learning re- BERTserini. In NAACL (Demonstrations). search. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- Haitian Sun, Tania Bedrax-Weiss, and William W Co- gio, William W Cohen, Ruslan Salakhutdinov, and hen. 2019. Pullnet: Open domain question answer- Christopher D Manning. 2018. HotpotQA: A ing with iterative retrieval on knowledge bases and dataset for diverse, explainable multi-hop question text. In EMNLP. answering. In EMNLP. Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Scott Wen-tau Yih, Ming-Wei Chang, Xiaodong He, Mazaitis, Ruslan Salakhutdinov, and William W Co- and Jianfeng Gao. 2015. Semantic parsing via hen. 2018. Open domain question answering us- staged query graph generation: Question answering ing early fusion of knowledge bases and text. In with knowledge base. In ACL. EMNLP. Swabha Swayamdipta, Ankur P Parikh, and Tom Kwiatkowski. 2018. Multi-mention learning for reading comprehension with neural cascades. In ICLR. Ellen M Voorhees et al. 1999. The TREC-8 question answering track report. In Trec. Denny Vrandeciˇ c´ and Markus Krotzsch.¨ 2014. Wiki- data: a free collaborative knowledge base. Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. 2018. R3: Reinforced ranker-reader for open-domain question answering. In AAAI. Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nallap- ati, and Bing Xiang. 2019. Multi-passage BERT: A globally normalized BERT model for open-domain question answering. In ACL. Dirk Weissenborn, Toma´sˇ Kociskˇ y,` and Chris Dyer. 2017. Dynamic integration of background knowledge in neural NLU systems. arXiv preprint arXiv:1706.02596. Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2017. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. In TACL. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. HuggingFace’s Trans- formers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2019a. Improving question answering over incomplete kbs with knowledge- aware reader. In ACL. Wenhan Xiong, Mo Yu, Xiaoxiao Guo, Hong Wang, Shiyu Chang, Murray Campbell, and William Yang Wang. 2019b. Simple yet effective bridge reasoning for open-domain multi-hop question answering. In 2nd Workshop on Machine Reading for Question An- swering. A Training details Dataset GRAPHREADER M sample binary 1 U A.1 Details & Hyperparameters relation, concat 1 U WEBQUESTIONS relation, elm-wise 1 G All experiments are done in Python 3.5 and Py- relation, bilinear 1 U Torch 1.1.0 (Paszke et al., 2017). For BERT, binary 1 G we use the uncased version of BERTBASE and relation, concat 3 G PYTORCH-TRANSFORMERS (Wolf et al., 2019)12. NATURAL Q. relation, elm-wise 1 U relation, bilinear 2 G Specifically, given a question Q and a passage Pi where the title of the originated article is Ti, we binary 1 G S = Q :[ ]: < > : T : relation, concat 1 G form a sequence i SEP t i TRIVIAQA relation, elm-wise 1 U < /t > : Pi, where : indicates a concatenation relation, bilinear 2 U and [SEP] is a special token. This sequence is then fed into BERT and the hidden representation of Table 6: Hyperparameters used for experiments. M de- the sequence from the last layer is chosen as a notes the number of fusion layers, and ‘sample’ denotes a sampling method for training, where ‘U’ and ‘G’ in- question-aware passage representation. For the em- dicate an uniform sampling and a subgraph sampling, bedding matrix for the relation encoder, we keep respectively. 100 relations (no relation, UNK and top 98 relations), which cover over 95% of all relations on pi and pj are connected through ri,j, all of pi, pj, all datasets. pi :[SEP]: ri,j :[SEP]: pj are included as input For PARREADER, we use a batch size of 10 passages of PARREADER++. We limit the length on WEBQUESTIONS and 60 on NATURAL QUES- of the passages and the relation text to be 145 and TIONS and TRIVIAQA. For PARREADER++ and 10, respectively, so that the total length to be up GRAPHREADER, we use a batch size of 8 on WE- to 300. The number of the final input passages to BQUESTIONS and 16 on the rest two. For each the model will be n + |{ri,j|ri,j 6= none}|, where fusion layer, we apply dropout (Srivastava et al., n(n−1) 0 ≤ |{ri,j|ri,j 6= none}| ≤ 2 (but typically 2014) with a probability of 0.3. For training, we n(n−1) evaluate the model on the development set period- much smaller than 2 as the input graph is very ically, and stop training when Exact Match score sparse). does not improve 10 times. For all other hyper- Similarly, PARREADER++ (all pairs) concate- parameters not mentioned, we follow the default nate all passage pairs. For all 1 ≤ i < j ≤ n, pi, p , p :[ ]: p are included as input passages settings from PYTORCH-TRANSFORMERS. j i SEP j As mentioned in Section 4.3, for each model, of PARREADER++. As pi and pj may not have a we experiment with M = {1, 2, 3} and two sam- relation, the relation text ri,j is omitted as an input. pling methods, and choose the number that gives Again, the length limit for pi and pj is 145. The number of the final input passages to the model the best result on the development set. The cho- n(n−1) sen hyperparameters for each model is reported in will be n + 2 . As there are too many input Table6. passages for this baseline, we use n = 20 instead For inference, we experiment with the number of of n = {40, 80}. input passages n = {40, 80} and choose the best For both baselines, as the limit for a single pas- one on the development set for testing. We restrict sage is different from other baselines (145 vs. 300), the predicted span to be a Freebase entity string on we run the retrieval again by following the same WEBQUESTIONS, following Chen et al.(2017). method but just split the article into passages with a different length limit. A.2 Details for baselines with passage concatenation. B More qualitative analyses We design two baselines which concatenate a pas- Figure4 depict more examples where our model sage pair. predicts the correct answer. We describe how our First, PARREADER++ (pairs from graph) con- model outperforms baselines for each of retrieval catenates passage pairs that are related in the input and reading component. graph, along with the relation text. Specifically, if Retrieval through entity linking. In Example 12github.com/huggingface/transformers 1, text-match retrieval fails to retrieve the evi- Example 1 Q: Who sings the theme song to All That? (A: TLC) Text-match Passage graph

That ‘70s Show All That All That is an American sketch comedy television Rey Valera series created by Brian Robbins and Mike Tollin. The series originally aired on Nickelodeon ... TLC Frankie Laine composer child TLC recorded the theme song Part of Your World to Nickelodeon’s popular TLC sketch comedy All That. Protest Song TLC is an American girl group whose original parent line-up consisted of Tionne “T-Boz” Watkins, Lisa NBC Sunday Night Football “Left Eye” Lopes, and Rozonda “Chilli” Thomas.

Example 2 Example 3 Q: Which country did Nike originate from? Q: When did the movie karate kid first come out? (A: 1984) (Answer: United States of America) Passage graph Passage graph The Karate Kid Nike, Inc. The Karate Kid is a 1984 Nike, Inc. is an American child American martial arts drama film The Karate Kid (2010 film) multinational corporation that ... Nike, Inc. written by Robert Kamen and ... The Karate Kid is a 2010 In April 2014, one of the martial arts drama film … and country followed by follows follows biggest strikes in part of The Karate Kid series. United States of America parent mainland China took The Karate Kid Part II place at … producing United States of America is a The Karate Kid Part II is a 1986 followed by amongst others for Nike. country comprising 50 states, a American martial arts drama film federal district, ... written by Robert Kamen and ...

Figure 4: Examples from NATURAL QUESTIONS and WEBQUESTIONS where predictions from PARREADER++ and GRAPHREADER (both with GRAPHRETRIEVER) are denoted by red and blue text, respectively. Subsets of graph information are reported. Detailed analyses in SectionB. dence passages because it fails to capture ‘All the only country name mentioned in retrieve That’ as a key entity. GRAPHRETRIEVER, on passages from the article “Nike, Inc.” However, the other hand, retrieves the article “All That” GRAPHREADER leverages the relation country and by entity linking. It is worth to note that this predicts the correct answer. was particularly common, when the key entities Example 3 requires to reason across multiple pas- are composed of common words that TF-IDF sages, as it asks about the ﬁrst advent of the movie does not capture its importance, e.g., “Who plays series that have similar titles. PARREADER++, letty in bring it on all or nothing?” or “Who sings which reads each passage in isolation, predicts does he love you with Reba?”. “2010” from the wrong passage. GRAPHREADER, however, incorporates relations followed by and fol- Retrieval through knowledge. In Example 1, lows and successfully distinguishes the ﬁrst movie. entity linking was not enough for evidence to answer the question, because the article “All That” does not contain the singer of the theme song. Meanwhile, WikiData contains a triple <“All That”, composer, “TLC”>, allowing the retrieval of the passage “TLC”.

Reading by synthesizing across passages. In Example 1, although GRAPHRETRIEVER retrieves the evidence passage, PARREADER++ which does not leverage the graph information predicts the wrong span by choosing the ﬁrst person name in the passage from “All That”. GRAPHREADER, however, leverages the relation composer and predicts the correct answer. In Example 2, although the question appears to by easy for humans, PARREADER++ retrieves “China” as an answer, potentially because it is