Assembling Information from Big Corpora by Focusing Machine Reading

Item Type text; Electronic Dissertation

Authors Noriega Atala, Enrique

Citation Noriega Atala, Enrique. (2020). Assembling Information from Big Corpora by Focusing Machine Reading (Doctoral dissertation, University of Arizona, Tucson, USA).

Publisher The University of Arizona.

Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Download date 30/09/2021 19:48:34

Link to Item http://hdl.handle.net/10150/656801 ASSEMBLING INFORMATION FROM BIG CORPORA BY FOCUSING MACHINE READING

by

Enrique Noriega Atala

Copyright Enrique Noriega Atala 2020

A Dissertation Submitted to the Faculty of the

SCHOOL OF INFORMATION

In Partial Fulfillment of the Requirements

For the Degree of

DOCTOR OF PHILOSOPHY

In the Graduate College

THE UNIVERSITY OF ARIZONA

2020 2

THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE

As members of the Dissertation Committee, we certify that we have read the dissertation prepared by: Enrique Noriega Atala titled:

and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy.

Clayton Morrison ______Date: ______Jan 4, 2021 Clayton Morrison

Mihai Surdeanu ______Date: ______Jan 4, 2021 Mihai Surdeanu

Peter A Jansen ______Date: ______Jan 4, 2021 Peter A Jansen

Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College.

I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement.

Clayton Morrison ______Date: ______Jan 4, 2021 Clayton Morrison D issertation Committee Chair

School of Information

3

ACKNOWLEDGEMENTS

I would like to express my gratitude to all the people who directly or indirectly were participant in the research projects that culminated in this work. First and foremost, to my adviser, Prof. Clayton T. Morrison, for his mentorship and support throughout the years. His advice and guidance were priceless and I will be forever grateful. To my committee members, Prof. Mihai Surdeanu and Prof. Peter Jansen, who provided invaluable insight during the the development of this dissertation and throughout the course of my graduate studies. To my friends Marco, Becky, Gus, Dane, Paul, and all the other members of ML4AI and CLU Lab who mentored me and collaborated with me. Thank you for all the good times we had during our time at grad school. And last, but not least, to my wife Cecilia, whom without her support and encouragement this wouldn’t have been possible. 4

DEDICATION

To my wife, Cecilia. 5

TABLE OF CONTENTS

LIST OF FIGURES...... 7

LIST OF TABLES...... 8

ABSTRACT...... 10

CHAPTER 1 Introduction...... 12

CHAPTER 2 Learning to Search in the Biomedical Domain...... 15 2.1 Introduction...... 15 2.2 Related Work...... 17 2.3 Focused Reading...... 18 2.4 Baseline Algorithm and Evaluation...... 20 2.4.1 Dataset...... 22 2.4.2 Baseline Results...... 23 2.4.3 Baseline Error Analysis...... 23 2.5 Reinforcement Learning for Focussed Reading...... 25 2.5.1 Ablation Test for RL State Features...... 28 2.5.2 Error Analysis of the RL Policy...... 29 2.6 Discussion...... 29

CHAPTER 3 Learning to Search in the Open Domain...... 31 3.1 Introduction...... 31 3.2 Related Work...... 32 3.3 Learning to Search...... 32 3.3.1 Constructing Query Actions...... 35 3.3.2 State Representation Features...... 37 3.3.3 Topic Modeling Features...... 39 3.3.4 Reward Function Structure...... 40 3.4 Evaluation and Discussion...... 41 3.5 Analysis of Errors and Semantic Coherence...... 47 3.5.1 Analysis of Successful Search Paths...... 48 3.5.2 Analysis of the Negative Outcomes...... 56 3.6 Discussion...... 62 6

TABLE OF CONTENTS – Continued

CHAPTER 4 Contextualizing Information Extraction...... 64 4.1 Introduction...... 64 4.2 Related Work...... 66 4.3 Dataset...... 67 4.3.1 Negative Examples and Extended Positive Examples..... 70 4.3.2 Inter-Annotator Agreement...... 72 4.4 Features...... 74 4.5 Experiment...... 77 4.5.1 Baseline and Model Results...... 78 4.6 Discussion...... 81

CHAPTER 5 Conclusions and Future Work...... 84

APPENDIX A Inference Paths of Positive Outcomes...... 88

APPENDIX B Inference Path Segments of Negative Outcomes...... 97

REFERENCES...... 112 7

LIST OF FIGURES

2.1 Evolution of search for the directed path that connects two proteins during focused reading...... 17 2.2 Graph edge encoding the relation extracted from: mTOR triggers cellular apoptosis...... 18

3.1 Distribution of # of hops in testing problems...... 42 3.2 Number of hops in successful paths...... 53 3.3 Number of iterations for cases with negative outcomes...... 58 3.4 Number of hops in inference paths...... 59

4.1 Aggregated κ scores and association counts for all contexts, binned by κ score ranges...... 70 4.2 κ scores for top 15 contexts by association count...... 72 4.3 Precision, recall and F1 score per classifier. The dashed red line indicates the F1 score of the baseline...... 76 4.4 Feature usage per paper for each classifier...... 80 4.5 F1 scores per paper for all models...... 82 8

LIST OF TABLES

2.1 Results of the baseline and RL Query Policy for the focused reading of biomedical literature...... 21 2.2 Baseline error causes...... 23 2.3 Features that describe the state of search in the RL-based focused reading algorithm. “PA” stands for “participant A”, i.e., the latest entity chosen from the source subgraph. “PB” stands for “partici- pant B”, i.e., the latest entity chosen from the destination subgraph. “Iteration” indicates the number of times focused reading has gone through the central loop in Algorithm 1...... 25 2.4 3-fold cross validation results comparing two RL focused reading mod- els against the baseline...... 25 2.5 Ablation test over the features that encode the state of the RL policies. 28 2.6 Error analysis for the best RL-based focused reading policy...... 29

3.1 Query templates...... 35 3.2 State representation features...... 37 3.3 Multi-hop search dataset details...... 42 3.4 Hyper-parameter values...... 43 3.5 Feature sets ablation results. * denotes the difference w.r.t. the cascade baseline is statistically significant...... 44 3.6 Results of the best model in the validation dataset. * denotes the difference w.r.t. the cascade baseline is statistically significant..... 46 3.7 Examples of inference paths...... 50 3.8 Off-topic endpoints examples...... 51 3.9 Off-topic endpoints distributions...... 51 3.10 Inference path with semantic drift...... 52 3.11 Topic switching in inference paths...... 52 3.12 Inference paths with problematic entities...... 54 3.13 Prevalence of problematic entities in inference paths...... 54 3.14 Taxonomy level switching example...... 55 3.15 Prevalence of taxonomy level switching...... 56 3.16 Example of partial and missing segments...... 57 3.17 Size of the knowledge graph in number of processed documents.... 60 3.18 Size of the knowledge graph in number of entities...... 60 3.19 Size of the knowledge graph in number of relations...... 61 3.20 Semantic drift in negative outcome cases...... 61 9

LIST OF TABLES – Continued

4.1 Classification features...... 74 10

ABSTRACT

We propose a methodology to teach an automated agent to learn how to search for multi-hop connections in large corpora by selectively allocating and deploying machine reading resources. The elements of multi-hop connections are often located in different documents that are not know ahead of time. Making it harder for a naive algorithm to exhaustively process a corpus if it is of a reasonable size, e.g. the English Wikipedia or PubMed Central. We formulate the elements of a novel search framework, focused reading (FR), as a Markov Decision Process, whose state-representation is comprised of domain- agnostic features related to the current state of the search and the dynamics of the search process. We employ reinforcement learning (RL) to find a policy to search for multi-hop connections in the biomedical domain. Our evaluation of the framework finds that the learned policy is more efficient at retrieving multi-hop paths than a strong, deterministic baseline algorithm. We introduce extensions to the FR framework to evaluate it in an open domain. Besides the domain agnostic state representation features, we introduce a set of features that capture information about the topic distribution of the underlying corpus as well as features that capture the distributional similarity of the entities extracted with machine reading tools. We use RL to find a policy that recovers more multi-hop paths while processing fewer documents than multiple heuristic baselines in an open-domain corpus. We perform an extensive analysis to understand the performance and limits of the method. Semantic drift is found to be a prevalent issue that affects the search outcomes and the coherence of the paths found by FR. We present first steps towards reducing semantic drift in the biomedical domain by proposing a supervised learning method to assign biological container context to biochemical interactions detected with information extraction. Examples of bi- 11 ological container context include species, organ or tissue type. We propose a set of features based on frequency, syntactic properties and other linguistic properties relative to an expression of a container context and to expressions of biochemical interactions. We experiment with a battery of classification algorithms and com- pare favorably to a deterministic, location-based baseline. We leave to future the integration of this methodology in a FR implementation. 12

CHAPTER 1

Introduction

The sheer size of public corpora such as Wikipedia1 or large paper repositories like arXiv2 and PubMed Central3 poses an enormous challenge to automating effective search for relevant information. This problem is compounded when the underlying information requires multi-hop connections: chains of relations that reveal how two concepts may be indirectly related to each other. For example, the Yankees are a team based in New York City; New York City is home to two MLB teams; the Mets are a baseball team base in New York City. In this example, the Yankees and the Mets are related to each other by virtue of both being baseball teams based on New York City. The elements of multi-hop connections can exist in more than one document. This becomes the norm in problem domains of reasonable complexity, e.g., searching for biological mechanisms that connect two proteins [Cohen, 2015a] or searching for explanations that require complex reasoning by understanding text supported by different documents in QA systems [Welbl et al., 2018; Yang et al., 2018]. As a result of the size of public public corpora, some even containing millions of documents, the task of retrieving multi-hop connections is complex. Clearly, we have surpassed our ability to keep up with and integrate these findings through manual reading alone. Machine reading technologies help to at least partially alleviate this issue. Information retrieval systems [Manning et al., 2008a] assist the user in restricting attention to a relevant subset of and ranking the resulting documents with respect to their relevance to the query. Information extraction tools [Valenzuela- Esc´arcegaet al., 2015], such as named entity recognizers [Manning et al., 2014a] and

1https://www.wikipedia.org/ 2http://arxiv.org/ 3https://www.ncbi.nlm.nih.gov/pmc/ 13 relation extraction systems [Valenzuela-Esc´arcega et al., 2017a], provide structure representing the information present in natural language text. In a naive approach, an automated information extraction agent could process all the documents in a corpus, searching for the indirect connections that satisfy a multi- hop information need. However, this quickly becomes prohibitively expensive as the corpus size increases. Further, the documents may also be behind a paywall, adding an additional economic cost to accessing information. Thus, the naive exhaustive reading approach is simply not feasible for most large corpora scenarios. Instead, we need to incorporate the kind of iterative focused reading that humans are capable of. When people search for information, they use background knowledge, based in part on what they have just read, to narrow down the search space while selectively committing time and other resources to carefully reading documents that appear relevant. This process may be repeated multiple times until the information need is satisfied. To effectively read at this scale, we need to incorporate methods for focused reading: develop the ability to pose queries about concepts of interest and perform targeted, incremental search through the literature for connections between concepts while minimizing reading documents that are likely irrelevant. In this dissertation we introduce a methodology to teach an automated agent, such as an information gathering system, to learn how to selectively direct and allo- cate machine reading resources. A significant reduction of computational resources can be achieved by steering the deployment of machine reading towards relevant sections of the corpus. We will show how using reinforcement learning, the agent can find policies that deploy machine reading resources more effectively than strong deterministic baselines, in domain-specific scenarios as well as in the open domain. This dissertation is organized as follows:

In Chapter2 we develop the core of the focused reading algorithm framework • in a domain-specific scenario. We leverage domain-specific information extrac- tion tools that output biochemical interactions, so that the multi-hop connec- tions are interpreted as potential protein interaction pathways. We formally 14

define the focused reading search process as a Markov Decision Process. We use a corpus of biomedical and biochemical scientific articles related to a model of pancreatic cancer to formulate search search problems. Using reinforcement learning, we explore the space of policies over the Markov decision process to derive one that finds the multi-hop chains of relations that associate different biochemical entities and compare these policies with deterministic baseline policies.

In Chapter3 we refine the formulation of focused reading to fit to an open • domain corpus. We enhance the focused reading state representation with semantic features that reflect the topic composition of the underlying corpus and show it is possible to retrieve multi-hop connections with focused reading and reinforcement learning more efficiently than several strong baselines. We present a detailed analysis that sheds light on the quality of the resulting inferences made using a focused reading algorithm and investigate what some of the most important factors are that make multi-hop search problems difficult to solve using focused reading. After the analysis we find that semantic drift is a major challenge to focused reading.

In Chapter4 we take the first steps towards addressing semantic drift in the • biomedical domain. We propose a supervised learning methodology to con- textualize information extraction in the biomedical domain. Specifically, we study how to assign biological container context of biochemical interactions extracted from scientific literature. Some examples of container context classes are species, organ or tissue type. The context information can be used to dis- criminate when to include specific relations into multi-hop connections. We leave for future work how to leverage this in focused reading implementations.

Finally, in chapter5 we summarize the work presented in this dissertation • and explore future avenues of research for focused reading algorithms and implementations. 15

CHAPTER 2

Learning to Search in the Biomedical Domain

[The contents of this chapter are based on work previously published as Noriega-Atala et al.[2017a]]

2.1 Introduction

The millions of academic papers in the biomedical domain contain a vast amount of information that may lead to new hypotheses for disease treatment. However, scientists are faced with a problem of “undiscovered public knowledge,” as they struggle to read and assimilate all of this information [Swanson, 1986]. Furthermore, the literature is growing at an exponential rate [Pautasso, 2012]; PubMed1 has been adding more than a million papers per year since 2011. We have surpassed our ability to keep up with and integrate these findings through manual reading alone. Fortunately, machine reading technologies are available to automate discovery and extraction of information present in large repositories of information. Large ongoing efforts, such as the BioNLP task community [N´edellecet al., 2013; Kim et al., 2012, 2009] and the DARPA Big Mechanism Program [Cohen, 2015b], are making progress in advancing methods for machine reading and assembly of extracted biochemical interactions into large-scale models. However, to date, these methods rely on either manual retrieval of relevant documents by humans, or processing large batches of documents that may or may not be relevant to the model being constructed. Batch machine reading of literature at this scale poses a new, growing set of problems. First, access to some documents is costly. The PubMedCentral (PMC)

1http://www.ncbi.nlm.nih.gov/pubmed 16

Open Access Subset2 (OA) is estimated3 to comprise 20%4 of the total literature; the remaining full-text documents are only available through paid access. Second, while there have been great advances in quality, machine reading is still not solved. Updates to our readers requires reprocessing the documents. For large document corpora, this quickly becomes the chief bottleneck in information extraction for model construction and analysis. Finally, even if we could cache all reading results, the search for connections between concepts within the extracted results should not be done blindly. At least in the biology domain, the many connections between biological entities and processes leads to a very high branching factor, making blind search for paths intractable. To effectively read at this scale, we need to incorporate methods for focused reading: develop the ability to pose queries about concepts of interest and perform targeted, incremental search through the literature for connections between concepts while minimizing reading documents that are likely irrelevant. For example, suppose a biologist is interested in what the literature has to say about how protein Pi3k affects protein KTF. Figure 2.1 gives a schematic repre- sentation of what the focused reading search might look like: given the initial seed query entities, Pi3k and KTF, focused reading retrieves a set of documents that mention both, extracts interactions that connect additional entities to the seed en- tities, and adds them to an expanding graph of connected entities. The first pass adds new entities around Pi3k and KTF (within the dashed circles in Fig. 2.1), and a second pass expands the graph further (to include all entities within the solid circle boundaries). Eventually two entities on the periphery of the expanding sub- graphs are linked (the orange direct edge in the figure) and focused reading returns the complete path from Pi3k to KTF. The key research challenge is then: how can focused reading search in a way that finds these paths while minimizing the number of documents that need to be read? 2https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ 3https://tinyurl.com/bachman-oa 4This includes 5% from PMC author manuscripts. 17

Pi3k""

KTF"

Figure 2.1: Evolution of search for the directed path1" that connects two proteins during focused reading.

2.2 Related Work

The past few years have seen a large body of work on information extraction (IE), particularly in the biomedical domain. This work is too vast to be comprehensively discussed here. We refer the interested reader to the BioNLP community [N´edellec et al., 2013; Kim et al., 2012, 2009] for a starting point. However, most of this work focuses on entity/relation/event extraction given a document, not on what documents to read given a goal. Reinforcement learning has been used to achieve state of the art performance in several natural language processing (NLP) and information retrieval (IR) tasks. For example, RL has been used to guide IR and filter irrelevant web content [Seo and Zhang, 2000; Zhang and Seo, 2001]. More recently, RL has been combined with deep learning with great success, e.g., for improving coreference resolution [Clark and Manning, 2016a]. Finally, RL has been used to improve the efficiency of IE by learning how to incrementally reconcile new information and help choose what to look for next [Narasimhan et al., 2016], a task close to ours. This serves as an inspiration for the work we present here, but with a critical difference: Narasimhan et al.[2016] focus on slot filling using a pre-existing template. This makes both the information integration and stopping criteria well-defined. On the other hand, in our focused reading domain there is no well-defined template: we do not know ahead of time which new pieces of information are necessarily relevant and must consider this in the context of our search. 18

+ (Positive) Cellular mTOR Apoptosis

Figure 2.2: Graph edge encoding the relation extracted from: mTOR triggers cel- lular apoptosis.

2.3 Focused Reading

In this chapter we consider learning how to search for the biomedical domain, and we focus on binary promotion/inhibition interactions between bio entities. Thus, the IE component constructs a directed graph, where vertices represent entities par- ticipating in an interaction (protein, gene, gene product, or other biological process), and edges represent directed activation interactions. Edge labels indicate whether the controller entity has a positive (promoting) or negative (inhibitory) influence on the controlled entity. Figure 2.2 shows an example edge in this graph. Importantly, this graph is constructed on the fly, as the IE system is incrementally exposed to more documents. We use REACH5, an open source IE system [Valenzuela-Esc´arcegaet al., 2015a], to extract interactions from unstructured biomedical text. We couple this IE system with a Lucene6 index of biomedical publications to retrieve documents based on queries about entity mentions in the text (as discussed below). Importantly, we essentially use IE as a black box7, and focus on strategies that guide what the IE system reads for a complex information need. In particular, we consider the scenario where a biologist (or other model-building process) queries the literature about how one entity (source) affects another (destination), where the connection is typically indirect (as in the example in Figure 2.1). Algorithm1 outlines the general focused reading algorithm for this task. In the algorithm, S, D, A, and B represent individual entities, where S and D are the source and destination entities in the user query. G is the graph of interactions that is iteratively constructed during the focused reading procedure, with V being

5https://github.com/clulab/reach 6https://lucene.apache.org 7Thus, our method could potentially work with any IE system. 19

Algorithm 1 Focused Reading 1: procedure FocusedReading(S, D)

2: G ← {{S, D}, ∅} 3: repeat

4: Σ ← EndpointStrategy(G)

5: (A, B) ← ChooseEndPoints(Σ,G)

6: Q ← ChooseQuery(A, B, G)

7: (V,E) ← Lucene+Reach(Q)

8: Expand(V,E,G)

9: until IsConnected(S, D) OR StopConditionMet(G)

10: end procedure

the set of vertices (entities), and E the set of edges (connecting entities participating in an interaction). Σ is a strategy for selecting entities, while Q is a Lucene query constructed in each iteration to retrieve new documents to read. The algorithm initializes the search graph to contain the two unconnected entities as vertices: S,D (line2). The algorithm then enters into its central loop (lines3 { } through7). The loop terminates when one or more directed paths connecting S to D are found, or when a stopping condition is met: either G has not changed since the previous run through the loop, or after exceeding some number of iterations through the loop (in this work, ten). At each pass through the loop the algorithm grows the search graph as follows:

1. The graph G is initialized with two nodes, the source (S) and destination (D) in the user’s query, and no edges (because we have not read any documents yet to understand what interactions exist).

2. Given the current graph, choose a strategy, Σ, for selecting which entities to query next: exploration or exploitation (line4). In general, exploration aims to widen the search space by adding many more nodes to the graph, whereas exploitation aims to narrow the search by focusing on entities in a specific region of the graph. 20

3. Using strategy Σ, choose the next entities to attempt to link: (A, B) (line5).

4. Choose a query, Q: again, exploration or exploitation, following the same in- tuition as with the entity choice strategy (line4). Here exploration queries retrieve a wider range of documents, while exploitation queries are more re- strictive.

5. Run the Lucene query to retrieve documents and process the documents using the IE system. The result of this call is a set of interactions, similar to the interaction in Figure 2.2 (line5).

6. Add the new interaction participant entities (vertices V ) and directed influ- ences (edges E) to the search graph (line6).

7. If the source and destination entities are connected in G, stop: the user’s query has been satisfied. Stop with a failure if G has not changed since the last run through the loop, or we’ve gone through the loop 10 times. Otherwise, continue from step 2.

The central loop performs a bidirectional search in which each iteration expands the search horizon outward from S and D (as depicted in Figure 2.1). Algorithm1 represents a family of possible focused reading algorithms, differentiated by how each of the functions in the main loop are implemented. In this work, IsConnected stops after a single path is found, but a variant could consider finding multiple paths, paths of some length, or incorporate other criteria about the properties of the path. We next consider particular choices for the inner loop functions.

2.4 Baseline Algorithm and Evaluation

The main functions that affect the search behavior of Algorithm1 are End- pointStrategy and ChooseQuery. Here we describe a baseline focused reading implementation in which EndpointStrategy and ChooseQuery are designed to attempt to find any directed path from S to D as quickly as possible, that is, 21

Baseline Best RL Query Policy Relative Change Number of IR queries 573 433 25% decrease Unique documents read 26,197 19,883 24% decrease # Paths recovered (out of 289) 189 (65%) 198 (68%) 3% increase

Table 2.1: Results of the baseline and RL Query Policy for the focused reading of biomedical literature.

in as few passes through the inner loop, which in turn should tend to minimize the number of documents that must be read. There are a variety of approaches to consider for how EndpointStrategy can implement exploration versus exploitation strategies. For the baseline, we consider the following interpretations: exploration will involve selecting entities that are more connected to other entities, and therefore more likely to select documents that in- troduce more entities, while exploit will focus on entities that have been introduced more recently to the graph under the intuition that searching for what they are connected to may be more likely to reveal a path between the fringes of the S and D subgraphs. Either strategy has potential advantages for finding a connecting path: exploration will retrieve more potential paths, but as the cost of reading more documents, while exploiting may more quickly connect the fringes, but only if the current fringes are sufficiently close. For the baseline, we fix the EndpointStrategy to always explore, under the intuition that search will introduce more entities earlier, and therefore be more likely to connect quickly; although this might be at the cost of potentially reading more than needed, it is still better than exhaustively reading the entire corpus. Under this strategy, ChooseEndPoints chooses entities (A, B) that currently have the most inward and outgoing edges (i.e., highest vertex degree) in the current state of G (disallowing choosing an entity pair used in a previous query). Now that we have our candidate entities (A, B), our next step is to formulate how we will use these entities to retrieve new documents. Here we consider two classes of queries: (1) we restrict our query to only retrieve documents that simultaneously mention both A and B, and therefore is more likely to retrieve a paper with a direct 22 link between A and B (exploit), or (2) we retrieve documents that mention either A or B, therefore generally retrieving more documents that will introduce more new entities (explore). For our baseline, where we are trying to find a path between S and D as quickly as possible, we implement a greedy ChooseQuery: first try the conjunctive exploitation query; if no documents are retrieved, then “relax” the search to the disjunctive exploration query.

2.4.1 Dataset

To evaluate the baseline, we constructed a data set based on a collection of docu- ments seeded by a set of 132 entities that come from the University of Pittsburgh’s Dynamic Cell Environment (DyCE) model, a biomolecular model of pancreatic can- cer [Telmer et al., 2017]. These entities are known to participate in protein signaling pathways that drive pancreatic cancer. Using these entities, we retrieved 70,719 documents that mention them. We processed all documents using REACH, extracting all of the interactions mentioned, and converted them into a single graph. The resulting graph consisted of approximately 80,000 vertices, 115,000 edges, and had an average (undirected) vertex degree of 24. We will refer to this graph as the REACH graph, as it represents what can be retrieved by REACH from the set of 70K documents. It is important to note that our focused reader has access to this graph only during training, and not at evaluation time. During testing, the focused reader must dynamically reconstruct the subset of the graph necessary to answer the given information need. We then conducted an exhaustive shortest path search for directed paths between all the entity pairs that are known to be connected in a signaling pathway (according to the cancer researchers) and found a total of 789 entity pairs connected by a directed path in the REACH graph. We selected 289 of these entity pairs perform an initial evaluation of our base- line algorithm, described in the next section. (In the more extensive evaluation in Section 2.5 we partition all 789 entity pairs into three parts to perform cross 23 validation.)

2.4.2 Baseline Results

We ran this baseline focused reading algorithm on each of the 289 pairs of entities, in each case attempting to recover a directed path from one to the other. The results are summarized in the middle column of Table 2.1: by issuing a total of 573 queries, the baseline read 26,197 documents out of the total 70,719 documents (37% of the corpus), in order to recover 189 of the paths (65% of the paths that could be found using this corpus).

2.4.3 Baseline Error Analysis

Although the baseline recovers paths between nearly two thirds of the test pairs, we would like to understand the conditions under which it failed to find the remaining paths that we know exist within the REACH graph (Section 2.4.1). We performed an error analysis on a random subsample of 82 of the test pairs where the baseline failed to find a path – in all of these cases, the baseline algorithm reached the limit of 10 iterations through the inner search loop without finding a path. Table 2.2 presents a summary of the types of errors that caused the baseline implementation to fail.

Error cause Frequency NER error 12 Ungrounded ID 19 Empty query result 20 Premature finish 4 QueryChoice exploiting too early 5 Poor choice of endpoint entities 21

Table 2.2: Baseline error causes

We detail these error types next: 24

NER error and Ungrounded ID represent cases where either the reader made a mistake in labeling an entity, or the entity grounding component, which attempts to link each extracted named entity to a knowledge base of known entities, e.g., Uniprot for protein names,8 does not currently cover the entity name.9 These kinds of errors can be addressed as part of improvements to the IE component but are out of the scope of focused reading. When the algorithm finished prematurely, nothing inherently incorrect happened during the search process, but the search reached the maximum number of iterations allowed. We suspect that letting the search continue for a few more iterations would have found an answer. QueryChoice exploiting too early occurs when, in the cascading strategy, the most restrictive (i.e., exploitation) IR query succeeds but returns very few docu- ments in the first couple iterations and the information contained in the retrieved documents already existed in the graph. In this case, the exploitative search was too restrictive and resulted in no new information being added. This could have been mitigated if the algorithm chose to do a less restrictive query (explore) early on the process to retrieve a larger set of documents and therefore added more information. Poor choice of endpoint entities is the problem in which both endpoint entities are correctly grounded, but the retrieved set of documents does not add new in- formation to G. In this case, the EndpointStrategy and ChooseEndpoints procedures should have selected different entities to try to connect. In summary, the baseline error analysis suggests that, while focused read- ing is indeed possible, choosing when to explore versus when to exploit in QueryChoice and selecting endpoint entities (a function of EndpointStrategy and ChooseEndPoints) is not trivial and leaves room for improvement.

8http://www.uniprot.org 9This happens in the current reader because it uses a conditional random field model for named entity recognition, which sometimes introduces false positives that do not exist in the relevant knowledge bases. 25

Feature Name Description Iteration Current iteration # of the search PA query log count How many time has PA been used in previous queries PB query log count How many time has PB been used in previous queries Same component Are PA and PB in the same connected component of G? Iteration of introduction for PA Which iteration first introduced PA Iteration of introduction for PB Which iteration first introduced PB PA Rank PA’s rank by total degree in G PB Rank PB’s rank by total degree in G

Table 2.3: Features that describe the state of search in the RL-based focused reading algorithm. “PA” stands for “participant A”, i.e., the latest entity chosen from the source subgraph. “PB” stands for “participant B”, i.e., the latest entity chosen from the destination subgraph. “Iteration” indicates the number of times focused reading has gone through the central loop in Algorithm 1.

Fold 1 Fold 2 Fold 3 Paths Queries Documents Score Paths Queries Documents Score Paths Queries Documents Score RL Query Only 189 398 20,151 0.93 182 392 19,682 0.92 207 358 18,107 1.14 RL All Actions 189 412 20,791 0.90 183 415 19,884 0.92 208 388 19,355 1.07 Baseline 180 459 25,550 0.70 176 500 26,270 0.66 192 462 25,250 0.76

Table 2.4: 3-fold cross validation results comparing two RL focused reading models against the baseline.

2.5 Reinforcement Learning for Focussed Reading

From the above analysis, we found that a significant number of the failures might have been avoided had the algorithm used a different strategy for EndpointStrat- egy and/or ChooseQuery. Informally, the baseline model chose to exploit when it should have explored, or viceversa. The conditions for making different choices depend on the current state of G, and earlier query behavior can affect later query opportunities, making this an iterative decision making problem and a natural fit for a RL formulation. Inspired by this observation, we consider RL for finding a better policy for End- pointStrategy and ChooseQuery. We consider a space of four possible actions; each “action” must include both an endpoint entity selection and a query choice, but there are two (explore/exploit) options for each:

In the context of choosing the endpoints, either exploit, which restricts the • 26

choices only to the most recently added vertices, or explore, which considers the highest ranked vertices across the whole graph. For example, suppose focused reading is in iteration 8, there is an entity, α, that was introduced in iteration 1 and is connected to 5 other entities in the current search graph, and another entity, β, that was introduced in the previous iteration (iteration 7) and is connected to 2 other entities. In this case exploit would select β, while explore would select α.

In the context of querying, exploit produces a conjunctive query, whereas ex- • plore uses a wider, disjunctive query. For example, if building a query to retrieve documents about entities α and β, exploit will require that the doc- uments mention both α AND β in the same document, while explore will retrieve any documents that mention either α OR β.

Altogether, possible actions are then: (1) Endpoint Explore + Query Ex- plore, (2) Endpoint Exploit + Query Explore, (3) Endpoint Explore + Query Ex- ploit, and (4) Endpoint Exploit + Query Exploit. Table 2.3 summaries the features that are used to describe the state of the search. Note that the features include representation of the history of the search (e.g., how many times the algorithm used a particular entity to search), as well as the state of the graph (e.g., how well connected two entities are). With the goal of recovering paths as quickly as possible, we provide a reward of +1 if the algorithm successfully finds a path, a reward of 1 if the search fails to − find a path, and assess a “living reward” of 0.05 for each step during the search, − to encourage trying to finish the search as quickly as possible. Based on the above framing of focused reading as a RL problem, we evaluated different configurations of RL-based focused reading: (1) RL All Actions learns a policy for choosing among all four actions combinations; (2) RL Endpoint Only fixes the query strategy to be identical to the baseline but learns a policy for selecting between endpoint explore and exploit; and (3) RL Query Only fixes the endpoint selection strategy to be identical to the baseline but learns a policy for selecting 27 between query choice exploration and exploitation. Since our interest is ultimately in improving the efficiency of focused reading search, that is, recovering the most paths while minimizing the number of documents # paths read, we define the performance score as the ratio # documents ; a higher scores is better as it means we are generally being finding more relevant paths in the documents we retrieved. We trained the three different configurations of RL focused reading using SARSA and Q-Learning [Sutton and Barto, 1998]. As the number of unique states is large, we used a linear approximation of the q-function. Once the policy converged during training, we fixed the linear q-function estimate and used this as the fixed policy for selecting queries at evaluation time. We evaluated the variations of RL Policies on the same data set of entity pairs used to evaluate the baseline as well as performed a three-fold cross validation where two folds were used for training and one for testing; each fold contained exactly 263 entity pairs for which a path can be found in the paper corpus. Table 2.4 reports the results of the most intersting RL policies and baseline for the three-fold cross validation. These results show that the RL Query only policy, which fixes the Endpoint selection strategy to exploration while learning the query strategy, generally performs the best. (We did consider two variants of the RL Endpoint Only policy: one that chooses query exploit only, which was found to do worse than the baseline, and query explore only, which does better than the baseline but worse than the other two policies.) Table 2.1 compares the results for the baseline under the initial evaluation set (described in Sec. 2.4.1) against the RL Query Only, and here we see that compared to the baseline, the RL Query Policy resulted in a 25% reduction in the number of queries that were run, leading to a 24% reduction in the number of documents that were read, while at the same time increasing the number of paths recovered by 3%. We tested the statistical significance of the difference in results between the baseline and the best RL policy on the testing dataset by performing a bootstrap resampling test. Our hypotheses were that the policy reads fewer documents, makes 28

All Same Ranks Iteration Query Entity features component− − − number− counts introduction− Paths found 198 201 202 199 200 196 Documents read 19,883 20,531 20,463 19,893 20,918 17,936 Queries made 433 467 469 487 484 403 Score 99.58 97.90 98.71 100.04 95.61 109.28

Table 2.5: Ablation test over the features that encode the state of the RL policies. fewer queries and finds more paths. The resulting estimated p-values for fewer documents and fewer queries were found to be near 0, and p < 0.003 for finding more paths.

2.5.1 Ablation Test for RL State Features

We performed a feature ablation study to assess what contribution features make to learning efficacy. The results are summarized in Table 2.5. The features are clustered into five different groups. We measured the impact of removing one feature group at a time. The table highlights that removing the Entity introduction feature (keeping track of which iteration a biological entity is first introduced to the graph during search) has a small penalty in the amount of paths recalled, but achieves the highest ratio of paths found to documents read score. Individually removing most of the other features had a negative impact, indi- cating that each is indeed important to model both the state of the graph and the history of the search. All in all, using all the features achieves a good balance across the three metrics reported in the table. This suggests that the design of relevant features for RL-based focused reading is not trivial and deserves further attention. We leave this as future work. 29

Error cause Frequency Empty query result 12 Ungrounded participant 4 Low yield from IE 2

Table 2.6: Error analysis for the best RL-based focused reading policy.

2.5.2 Error Analysis of the RL Policy

Finally, we analyzed the execution trace of eighteen (20% of the errors) of the searches that failed to find a path under RL. The results are summarized in Table 2.6. The table shows that the main source of failures is receiving no results from the information retrieval query, i.e., when the IR system returns zero documents for the chosen query. This is typically caused by over-constrained queries. The second most common source of failures was ungrounded participants, i.e., when at least one of the selected participants that form the query could not be linked to our protein knowledge base. This is generally caused by mistakes in our NER sequence model, and also tends to yield no results from the IR component. Finally, the low yield from IE situation appears when the the information pro- duced through machine reading in one iteration is scarce and adds no new compo- nents to the interaction graph, again resulting in a stop condition.

2.6 Discussion

In this chapter, we introduced a framework for the focused reading of biomedical literature, which is necessary to handle the data overload that plagues even machine reading approaches. We have presented a general focused reading algorithm family, an intuitive baseline algorithm instance from that family, and formulated a rein- forcement learning approach to search for better policies for choosing which entities to use and how to construct a query given entities. Informally, the RL approach learns when focused reading should explore (widen its search) or exploit (narrow the search). 30

We demonstrated that the RL-based focused reading is more efficient than the baseline (e.g., reading 24% fewer documents), while successfully finding 3% more target paths. The work presented in this chapter was evaluated on a domain-specific corpus. Limiting the scope of the search domain in this chapter is useful. It allows for the use of heavily specialized machine reading tools, such as REACH. Such specialized tools have been battle-tested, evaluated and improved by domain experts throughout the years and we benefit from this. For example, REACH provides comprehensive named entity normalization to biochemical ontologies. It is also designed to be deployed in the exact same domain as this chapter’s document corpus. The class of relations extracted by REACH is constrained to biochemical interactions. Despite the localized scope of the extractions, there is risk of finding spurious information due to semantic drift. For example, if the user is interested in identifying a pathway that describes how Protein A promotes Protein B in endothelial tissue, the lack of context information may result in drift where one reaction of the pathway occurs in hepatic tissue, leading to an overall implausible pathway that has mixed together different reaction contexts. In Chapter4 we introduce the first steps to address this problem. Another key challenge is that specialized machine reading components may not always be available for the domain of interest. In Chapter3, we extend the focused reading framework to work in an open domain application. 31

CHAPTER 3

Learning to Search in the Open Domain

[The contents of this chapter are based on a manuscript under review for publication in a peer-reviewed venue at the time of this writing]

3.1 Introduction

In this chapter, propose an extension to the focused reading method to teach an automated agent to learn how to search for multi-hop paths of relations between entities in an open domain. The focused reading agent learns a policy for directing existing machine reading resources to focus on regions of a corpus estimated to be relevant to the search problem at hand. Similar to chapter2, we formulate the learning problem as a Markov decision process with a state representation that encodes the dynamics of the search process and a reward structure that minimizes the number of documents that must be processed while still finding multi-hop paths. Because the work described in this chapter considers an open domain corpus, the nature of the documents varies significantly. Therefore, in addition, we introduce new state representation features that capture the underlying topic composition of the open domain corpus, as well as the distributional similarity of named entities extracted by machine reading. We implement the method in an actor-critic reinforcement learning algorithm and evaluate the resulting policies on a dataset of search problems derived from a subset of English Wikipedia. The results of this chapter prove that with the appropriate refinements of the focused reading implementation, it is possible to learn how to search in a noisy environment with general domain machine reading tools. 32

3.2 Related Work

Modern machine reading technology enables the extraction of structured informa- tion from natural language data. Named-entity recognition [Tjong Kim Sang and De Meulder, 2003] systems detect and label specific classes of concepts from text, both in the general domain [Manning et al., 2014a] and for specific domains Neu- mann et al.[2019]. Relation extraction systems extract interactions between differ- ent concepts in open-domain [DBL, 2018, 2017, 2008] and domain-specific scenar- [Jin-Dong et al., 2019; Demner-Fushman et al., 2019; Cohen et al., 2011]. Reinforcement learning has been successfully deployed for a variety of natural language processing (NLP) tasks. Clark and Manning[2016b] proposed a policy- gradient method to resolve the correct coreference chains for the task of coreference resolution. Li et al.[2017] used reinforcement learning to train an end-to-end task- completion dialogue system. For the task of machine translation, He et al.[2016] formulated the task as a dual-learning game in which two agents teach each other without the need of human labelers using policy-gradient algorithms. Reinforcement learning has also been specifically applied to improving search and machine reading. In learning how to search, Kanani and McCallum[2012] proposed a methodology for the task of slot-filling based on temporal-difference q- learning that uses domain specific state representation features to select actions in a resource-constrained scenario. Noriega-Atala et al.[2017b] successfully applied reinforcement learning to finding relevant biochemical interactions in a large corpus by focusing the allocation of machine reading resources towards the most promising documents. Similarly, Wang et al.[2019] explore the use of deep neural networks and deep reinforcement learning to simulate the search behavior of a researcher, also in the biomedical domain.

3.3 Learning to Search

We propose a methodology to teach an automated agent how to selectively retrieve and read documents in an iterative fashion in order to efficiently find multi-hop 33 connections between a pair of concepts (or entities). Each search step focuses on a restricted set of documents that are hypothesized to be more relevant for finding a connection between the two target concepts. The focus set is retrieved and processed and if a path connecting the concepts is found, the search terminates. Otherwise, a new set of focus documents is identified based on what has been learned so far during the search. The process is repeated iteratively until the desired information is found or a number of iterations is exceeded. Our method is general as it does not directly rely on any supervised domain specific semantics. During the search, the agent iteratively constructs a knowledge graph (KG) that represents the relations between concepts found so far through machine reading. In each iteration, the algorithm formulates a document retrieval query based on the current state of the knowledge graph, which is then executed by an information retrieval (IR) component. The IR component contains data structures to query the corpus, for example using an inverted index. The construction of these data struc- tures usually only requires shallow processing, such as tokenization and stemming, and not a full-fledged NLP pipeline. Any documents returned from executing the query are processed by an information extraction (IE) component that performs named entity recognition and relation extraction. The KG is expanded by adding newly identified entities as new nodes and previously unseen relations as new edges. The overall goal of the method is to focus on the documents that appear to be most likely to contain a path between the target concepts, all while processing as few documents as possible. Noriega-Atala et al.[2017b] formalized this iterative search process as a family of focused reading algorithms, shown in Algorithm2: The algorithm starts with the KG representing only the endpoints of the search: the named entities E1 and E2. The algorithm then initiates the search loop. The first step in the loop analyzes the knowledge graph and generates an information retrieval query, Q. As we will describe shortly, the current state of the KG is used to parameterize and constrain the scope of Q, focusing it on returning a limited subset of documents that are 34

Algorithm 2 Focused reading algorithm 1: procedure FocusedReading(E1,E2)

2: KG ← {{E1,E2}, ∅} 3: repeat

4: Q ← BuildQuery(KG)

5: (V,E) ← Retrieval+Extraction(Q)

6: Expand(V,E,KG)

7: until IsConnected(E1,E2) OR HasTimedOut

8: end procedure

hypothesized to be most relevant. After retrieval, the documents are processed by the IE component. Any entities not previously found in the KG are placed in the new entity set V , and similarly any new relations linking entities are placed in the new relation set, E. V and E are incorporated into the KG and the algorithm then searches the updated KG for any new possible paths connecting E1 and E2. If a path exists, it is returned as a candidate explanation of how E1 and E2 are related. Otherwise, if no such path exists, the query formulation process (using the updated KG) followed by IR and IE, is repeated until a path is found or the process times out. This framework can answer multi-hop search queries for which the relationships along a connecting path come from different documents. For example, this process may discover that Valley of Mexico is connected to the Aztecs because the Aztecs were a pre-columbian civilization (found in one document), which, in turn, was located in the Valley of Mexico (found in another document). In the following subsections, we formulate focused reading as a Markov decision process (MDP). We first describe how information retrieval actions are constructed, and follow with an explanation of how the search state is represented. The reward is engineered so that the policy minimizes the number of documents processed while iteratively constructing the knowledge graph during the search for a connection between the target concepts. We then use reinforcement learning to learn a policy 35 that implements a strategy for efficient focused reading.

3.3.1 Constructing Query Actions

Template # Params Constraints Conjunction Two: (A, B) Contains A and B Singleton One: (E) Contains E Disjunction Two: (A, B) Contains A or B

Table 3.1: Query templates

In the focused reading MDP, actions are comprised of information retrieval queries. Actions are constructed from a set of three query templates, listed in Table 3.1. Each template is parameterized by one or two arguments representing the en- tities that are the subject of the query. The template type then incorporates these entities into the set of constraints that must be satisfied by a document in order to be retrieved. The different query templates are intuitively designed to give the agent the choice of either exploring the corpus by performing a broader search through the more permissive disjunctive query (documents are retrieved if either of the entities are present), or instead exploiting particular regions of the corpus through the more restrictive conjunctive query (the documents must contain both entities). Because conjunctive queries return documents with the text of both entities, they are more likely to identify relations connecting the entities. However, there is also an increased risk that such queries will end up not finding any satisfying documents, especially when the entities are not closely related, resulting in waisting one iteration in the search process. On the other hand, disjunctive queries are designed to return a larger set of documents, which, reduce the likelihood of returning an empty set. But they introduce the risk of processing more potentially irrelevant documents, and potentially introducing more irrelevant entities. Singleton queries represent a compromise between conjunction and disjunction. They are designed to expand the set of existing queries to the knowledge graph, which may in turn be along paths that connect the target entities, but retrieving documents related to just one entity, 36 rather than two. Every entity or pair of entities in the current knowledge graph is eligible to serve as a parameter in a query template. The challenge is to choose which entities paired with query template type are more likely to retrieve documents containing candidate paths, using only the domain-agnostic information present in the KG. As the search process proceeds, the number of entities in the knowledge graph grows, in turn increasing the number of possible query actions that can be con- structed. Reinforcement learning quickly becomes intractable as the state and ac- tion space grows. We therefore perform a beam search to fix the cardinality of the action space to a constant size. In particular, we use cosine similarity (for entity pairs) and average tf-idf scores (for single entities) to rank the entities that might participate in constructing query actions. The agent then chooses among the top n entities/pairs for each query template, thus bounding the total number of actions available to the agent to 3n different queries at each step. We rank candidate entity pairs that might participate in query templates involv- ing two entities by computing the cosine similarity of the vector representations of the entities. We use the natural language expression representation of the named entities to construct a continuous vector representation. The vector representation of each entity is built by averaging the word embedding vectors1 of the words in the text of the named entity description. This similarity works as a proxy indicator of how related those entities are, under the intuition that entities that have similar embeddings are more likely to participate in relations. For singleton entity queries, we use the average tf-idf score of the entity’s nat- ural language description for ranking. The tf-idf score of an entity is derived from averaging the tf-idf score of the individual terms in the entity’s natural language description. Each term’s frequency value is based on the complete corpus. Tf-idf scores are often used as a proxy measure of term importance (the term occurs selec- tively with greater frequency within some documents), so here the intuition is that

1We used the pretrained GloVe model provided by spaCy at https://spacy.io/models/en# en_core_web_lg 37

entities with higher tf-idf scores may be associated with higher recall in the corpus. Finally, there is an additional non-query action that is available in every step of the search: early stop. If the agent choses to stop early, the search process transitions to a final, unsuccessful state. This deprives the agent from successfully finding a path, but avoids incurring further cost of processing more documents in a possibly unfruitful search.

3.3.2 State Representation Features

At each step during search, the focused reading agent will select just one action to execute (a query action or early stop) based on the current search state. The agent makes this decision using a model that estimates for each action the expected long- term reward that can be achieved by taking that action in the current state. Here we describe the collection of features used to represent the current state, provided as input to the model.

Category Feature Iteration number Doc set size Search state # of vertices in KG # of edges Embedding of E1 Endpoints Embedding of E2 Cosine sim. or avg tf-idf score Query # of new documents to add ∆ Entropy of queries Topic Modeling KL Divergence of query

Table 3.2: State representation features

Table 3.2 provides a summary of the features included in the state representation. We group them into four categories.

Search state features: Information about the current state of the search process • including: the number of documents that have been processed so far; how 38

long has the search been running (expressed in iterations); and the size of the knowledge graph.

Endpoints of the search: E1 and E2 represent the original target concepts that • we are trying to find a path between. The identity of the endpoints determines the starting point of the search and conditions the theme of the content sought during the search. This information is provided to the model using the vector embedding representations of E1 and E2.

Query features: We include in the state representation features the score with • which each of the 3n queries in the action space is ranked. This includes the cosine similarity for conjunction and disjunction queries and tf-idf score for singleton queries. The intuition is that the score may be correlated with the expected long-term reward. For each query action, we will also see the identities of which documents will be retrieved. This allows us to count how many documents will be retrieved that have not already contributed to the KG, and this is included in the state representation for each action.

Topic modeling features: Finally, we would like to incorporate some indication • of what information is contained in the documents that might be retrieved, and how it relates to the current entities within the KG. As a proxy for this information, we model the topics in the potentially retrieved documents, by peeking into the IR component to see the identity of documents that would be returned by the queries in the action space, and compare them to the topics represented in the KG using two numerical scores: (a) an approximation of how broad or specific the topics are in the set of documents that would be returned by the query, and (b) an estimate of how the knowledge graph’s topic distribution would change if the query is selected as the next action. 39

3.3.3 Topic Modeling Features

Topic modeling features can be useful for staying on topic throughout the search, avoiding drift into potentially irrelevant content. Consider the following example: A user wants to know how the Red Sox and the Golden State Warriors are related by searching Wikipedia. While the two entities cover different sports in different regions of the United States, it is more likely that the connection will occur in a document about sports, e.g., they are both covered by the ESPN TV station. We use Latent Dirichlet Allocation [Blei et al., 2003] (LDA) to provide the agent the ability to exploit topic information available in the corpus. LDA is unsupervised and requires only shallow processing of the corpus, namely, tokenizing and optionally stemming. This is essentially the same information required for constructing an inverted index for IR, so can be computed along with the IR component used in the focused reading system. LDA produces a topic distribution for each document. We then aggregate the set of documents by summing the topic frequencies across documents and renormalizing. The topic distribution of the KG is then the aggregation of the topic distributions of the documents processed so far in the search process. The topic distribution of a query is the aggregation of the topic distributions of the unseen documents returned by the query. We consider two statistics for relating topic distributions: topic entropy and Kullback-Leibler (KL) divergence. Intuitively, the entropy of a topic distribution is an estimate of how specialized a document is, that is, how much it focuses on a particular set of topics. For example, a document that only talks about a specific sport will generally have a topic distribution where the mass is concentrated only on the particular topics of that sport, and therefore have a lower entropy than another document that discusses sports and business. Document sets with overall higher entropy are more likely to introduce information about more topics to the knowledge graph, and therefore produce more opportunities for new links between a broader set of entities. Lower 40 entropy queries focus on a narrower set of topics, and thus, may introduce links between a restricted set of entities. The difference in entropy expresses this intuition in relative terms. We introduce a feature, ∆ Entropy, as the difference in entropy between documents retrieved by a candidate action and the documents the action retrieved in the previous step. Positive values indicate that the candidate query will generally expand the topics compared to those fetched the last step while negative values indicate more restricted topic focus. ∆ Entropy measures how concentrated the mass is, but it does not tell us how the distributions are different. Two document sets may have completely different topic distributions, yet have the same or similar entropy. Kullback-Leibler (KL) di- vergence [Kullback and Leibler, 1951], also known as relative entropy, helps measure how different two distributions are with respect to each other, even if they have the same absolute entropy. To capture this information, we compute the KL divergence between the topic distribution in the new documents (retrieved by the new query) and the topic distribution of the knowledge graph. This estimates how different the information in the new query is relative to what has already been retrieved.

3.3.4 Reward Function Structure

The overall goal of the focused reading learning processes is to identify a policy that efficiently finds paths of relations between the target entities while minimizing the number of documents that must be processed by the IE component. To achieve this, we want the reward structure of the MDP to incorporate the tradeoff between the number of documents that have to be read (the reading cost) and whether the agent can successfully find a path between the entities. Equation 3.1 describes the reward function, where st represents the current state and at represents the action executed in that state.  S if s is succesful sate  t+1  r(st, at) = c m if m > 0 (3.1) − ×   e if m = 0 − 41

A positive reward S (for “success”) is given when executing at results in a transition to a state whose knowledge graph contains a path connecting the target entities. Otherwise, the search is not yet complete and the a cost is incurred for processing m documents with machine reading on step t. The cost is adjusted by a hyper parameter c that controls the relative expense of processing a single document. Note that there may be actions that return an empty document set, incurring no cost from reading, but still not making progress in the search. To discourage the agent from choosing such actions, the hyperparameter e controls the cost of executing an unfruitful action that returns no new information. (Specific parameter values used in this work are presented in Table 3.4 of Section 5.)

3.4 Evaluation and Discussion

To evaluated the focused reading learning method, we introduce a novel dataset derived from the English version of Wikipedia. Our dataset consist of a set of 369 multi-hop search problems, where a search problem is consists of a pair of entities to be connected by a path of relations, potentially connecting to other entities along the path. The foundation of the dataset is a subset of 6,880 Wikipedia articles from the WikiHop [Welbl et al., 2018] corpus. We used Wikification [Ratinov et al., 2011; Cheng and Roth, 2013] to extract named entities from these documents and nor- malize them to the title of a corresponding Wikipedia article. Wikification does not perform relation extraction, so we lack gold-standard relations. To overcome this limitation, in this paper we induce a relation between entities that co-occur within a window of three sentences. Every relation extracted this way can be traced back to at least one document in the corpus. We create a gold-standard knowledge graph using the induced entities and rela- tions, and we sample pairs of entities connected by paths to create search problems for the dataset. Table 3.3 contains a break-down of the number of elements in each subset of the dataset. Figure 3.1 shows the distribution of path lengths (number 42 of hops) between target entities in the test problems. To avoid overlap between training and test data, none of the target endpoints are shared between the train and test data.

Element Size Corpus 6880 articles Search Problems Training 230 problems Development 500 problems Testing 670 problems Total 1400 problems

Table 3.3: Multi-hop search dataset details.

We train an LDA model2 and constructed an information retrieval inverted index over the collection of documents.3

40

35

30

25

20 Frequency 15

10

5

0 2 3 4 5 6 7 Number of Hops

Figure 3.1: Distribution of # of hops in testing problems

We used the Advantage Actor Critic algorithm [Mnih et al., 2016] (A2C) to

2We used the LDA implementation provided by gensim https://radimrehurek.com/gensim_ 3.8.3/. 3The dataset files, with the documents, extractions, indices and LDA model will be made public upon acceptance of the article. 43 implement our reinforcement learning focused reading method.4 A2C is an actor-critic method and we use a single neural network architecture to model the action policy (actor) as well as the state value function (critic). We use a single neural network architecture to implement the A2C actor-critic model. The architecture consists of a fully-connected feed-forward neural network with four layers and two output heads. The first output head represents the approximation of the action policy (the actor) as a soft-max activation layer whose size is the cardinality of the action space. This approximates the probability distribution of the actions given the current state. The second head approximates the state value, as a single neuron with a linear activation. The state value estimates the expected long- term reward of using the estimated action distribution of the first head in the current state. Altogether the model consists of approximately 3.79 million parameters. Table 3.4 lists the hyper-parameter values used in our experiments. 5

Hyper-parameter Value Environment # entities per query template 15 Maximum # of steps 10 Reward Function Successful outcome S 1000 Document processing cost c 10 Empty query cost e 100 Training Mini-batch size 100 # Iterations 2000

Table 3.4: Hyper-parameter values

We performed an ablation analysis on the development dataset to find the best configuration of features. The development dataset contains five hundred search problems. The set of endpoints of the search problems does not overlap with those of the training and

4Implemented using the rlpyt library hosted at https://github.com/astooke/rlpyt. 5Hyper-parameter values were determined through manual tunning. 44

Average Steps Success Rate Processed Documents Documents per Success Overall Successes Failures Baselines Random 25.04 (0.014) 56,187.8 (3,197.6) 449.83 (34.34) 8.41 (0.06) 3.66 (0.25) 10 (0) Conditional 23.92 (0.008) 49,609.8 (4,215.11) 415.03 (36.01) 8.51 (0.06) 3.78 (0.07) 10 (0) Cascade 32.84 (0.01) 62,058.2 (3,686.57) 378.15 (23.87) 7.42 (0.05) 2.93 (0.19) 9.61 (0.06) All Features Dropout 0.2 36 (0.007)* 58,552.2 (719.67)* 325.41 (8.43)* 6.56 (0.05) 2.22 (0.06) 9.01 (0.04) Dropout 0.5 36.64 (0.004)* 100,869 (4,121) 550.76 (26.2) 7 (0.04) 2.37 (0.08) 9.67 (0.06) No Embs 26.3 (0.008) 39,433 (1,678.2)* 428.82 (19.51) 6.52 (0.03) 2.37 (0.08) 7.74 (0.07) No Query Features Dropout 0.2 33.68 (0.003) 42,022.2 (2,071.89)* 249.57 (13.02)* 4.56 (0.05) 2.02 (0.03) 5.84 (0.08) Dropout 0.5 36.48 (0.003)* 62,126.6 (1,900.75) 340.62 (10.79)* 5.95 (0.06) 2.28 (0.03) 8.06 (0.09) No Embs 35.6 (0.005)* 58,025.8 (1,085.72)* 325.99 (4.26)* 6.37 (0.07) 2.2 (0.07) 8.68 (0.12) No Search Features Dropout 0.2 35.92 (0.002)* 55,723 (1,437.01)* 310.27 (8.13)* 6.42 (0.04) 2.15 (0.03) 8.82 (0.07) Dropout 0.5 35.32 (0.003)* 53,227.4 (1,429.88)* 301.42 (8.77)* 5.41 (0.07) 2.09 (0.02) 7.22 (0.11) No Embs 37.16 (0.004)* 97,612.2 (4,550.54) 525.48 (26.72) 6.92 (0) 2.44 (0.08) 9.56 (0.01) No Topic Features Dropout 0.2 35.56 (0.004)* 51,757.4 (1,510.95)* 291.11 (8.36)* 5.92 (0.02) 2.15 (0.07) 8 (0.05) Dropout 0.5 35.72 (0.007)* 55,637 (1,456.53)* 311.58 (8.74)* 5.56 (0.05) 2.13 (0.06) 7.46 (0.06) No Embs 28.52 (0.004) 50,634.6 (2,060.88)* 355.05 (12.1) 6.74 (0.06) 3.5 (0.04) 8.03 (0.1)

Table 3.5: Feature sets ablation results. * denotes the difference w.r.t. the cascade baseline is statistically significant. validation datasets. This is enforced to avoid any accidental leak of training infor- mation. Table 3.5 contains the results of the ablation experiments. All the search prob- lems were repeated five times with different random seeds. The key columns of the table are defined as follows. Success Rate represents the percentage of problems in the test set for which the agent connected the endpoints. Processed Docs provides the number of documents processed in all the search problems of the test set. Docs per Success is a summary the other two columns: it contains the number of doc- uments processed divided by the number of successes. This ratio is an aggregate statistic useful for comparing the performance between different policies. We report the sample averages and their standard deviations in parentheses. For example, the Success Rate column displays the average and standard deviation of five success rate calculations over five hundred search problems. The Processed Documents column displays the average and standard deviation of the cumulative count of documents processed in the search problems, and so forth. 45

We implement three baseline policies that were not derived using RL:

Random: Uniformly randomly selects a query from all possible queries con- • structed from eligible combinations of entities assigned to the query templates.

Conditional Random: Uniformly randomly selects a query template, conjunc- • tion, disjunction and singleton and then choses the uniformly randomly selects the entities to parameterize the template.

Cascade: Uniformly randomly samples a pair of entities and executes a con- • junction query. If the result set does not contain any documents, then the agent selects a disjunction query with the same entities.

For consistency, each baseline was also evaluated with five different random seeds over the testing set. The top part of Table 3.5 shows the results of the baseline policies. To test for statistical significance, we performed a non-parametric bootstrap resampling test with ten thousand samples for the the following metrics: success rate, processed documents and documents per success. For each metric, we calculated the difference between the result of the cascade baseline and the result of each of the reinforcement learning (RL) policies. If p 0.05 of the difference being in favor ≤ of the reinforcement learning policy, the quantity is starred in the table. In terms of success rate, most of the reinforcement learning models perform better than the cascade baseline. The notable exceptions are two feature configura- tions that do not use endpoint embeddings. These configurations are the one that considers all feature classes and the one that does not consider topic features. Excluding query features from training produces models that process fewer doc- uments per success with or without endpoint embeddings. Excluding search features produced in average models with higher success rate, with or without embeddings, but does so while processing more documents compared to other configurations. 46

The configurations that exclude query features produced models with the best numbers of documents per success. When the topic features are excluded, a similar result is achieved, but the number of documents per success of model that does not use endpoint embeddings is not statistically significantly lower than that of the cascade baseline. Nonetheless the model that has the best balance between success rate and docu- ments per success is the one that excludes topic features and trains with a dropout coefficient of 0.2 on the endpoint embeddings. We use this model to evaluate the validation dataset.

Average Steps Success Rate Processed Documents Documents per Success Overall Successes Failures Baseline Cascade 36.52 (0.008) 83,252.4 (2,538.11) 339.39 (13.01) 7.18 (0.06) 2.81 (0.05) 9.69 (0.04) No Topic Features Dropout 0.2 39.02 (0.007)* 79,737.2 (1,664.65) 304.22 (10.06)* 6.28 (0.04) 2.16 (0.08) 8.91 (0.03)

Table 3.6: Results of the best model in the validation dataset. * denotes the differ- ence w.r.t. the cascade baseline is statistically significant.

Table 3.6 displays the results of the cascade baseline, which shows the strongest performance among the baseline policies, and the chosen reinforcement learning model on the validation dataset. The validation dataset contains 650 search prob- lems and the set of endpoints of its search problems is disjoint from the other datasets’ for the same reason, to avoid leaking any training or development signal into the validation dataset. We did the same non-parametric bootstrap test for sta- tistical significance. The reinforcement learning policy achieves approximately 2.5% higher average success rate on the testing dataset that the cascade baseline policy. While it also processes fewer documents in average, the difference is not statistically significant, but when considering the number of documents per success, the result is indeed significant, requiring approximately thirty five documents in average than cascade. We also investigated the frequency of the queries executed by the best perform- ing policies and the cascade baseline. The distribution of the query templates turn 47 out to be quite similar across all variants: About 90 91% of the queries issued by − all policies are disjunction, 8 9% conjunction and 0 2% singleton. This not only − − suggests that the policies converge to a similar use of query template distribution, but that choosing the correct entities that parameterize those queries accounts for significant differences in performance. The standard deviations in Table 3.5 also support the following observation: they are, in general, smaller for the reinforce- ment learning policies with respect to those of the baselines. Action policies are probability distributions over the action space conditioned on the state representa- tion. By design the presence of clear signal in the features results in a conditional distribution where the best actions available are consistently more probable. In contrast, the baseline policies give each entity or pair the same probability of being selected resulting on more variable outcomes.

3.5 Analysis of Errors and Semantic Coherence

The results shown in tables 3.5 and 3.6 of section 3.4 affirm that it is possible to find a policies for focused reading in the open domain using reinforcement learning that perform better than baseline algorithms. We now do an analysis of the results to understand how can a focused reading implementation can be improved. A focused reading search process can terminate in one of two ways: with a successful outcome, or with a negative outcome, when the search has run for too long or the agent executing the search decided to stop it prematurely. When a focused reading search process finishes successfully, the resulting data structure is the inference path. This path represents a chain of relations connecting both endpoints of the search problem through elements of the knowledge graph. The knowledge graph is built from the results of machine reading over a subset of the underlying corpus. Although the resulting inference path represents some sequence of links between the two endpoints, a further question is whether the relationships between each link in the path are appropriate. The most important way in which we evaluate 48 the appropriateness is by looking for semantic drift. Semantic drift is the change of meaning of a word or concept relative to the general topic of the discourse. For example, depending of the context of a phrase, bank could refer to a financial institution or to a piece of land adjacent to a body of water. Semantic drift can happen if the incorrect sense of the word is assumed. In the presence of semantic drift, the topics of the inference paths change too drastically, by following relations relevant to an incorrect sense of an entity, straying away from the themes relevant to the search problem. While the relations linking the entities in the path may be accurately extracted from the literature, the themes of the relations may shift arbitrarily, rather than be tried to the overall relationship between the endpoints. In the remaining of this section, we investigate the properties of paths found in both focused reading search success and failure. In both cases, we estimate the degree to which the links along the inference paths (whether complete or partial) maintain meaningful semantic relationships. We then compare and contrast ob- served patterns in the statistics of semantic coherency among successful paths and failed partial paths. In cases where search failed, we explore the properties of the search problem that appear to make the search more difficult and consider what would have been needed to successfully find a semantically consistent path.

3.5.1 Analysis of Successful Search Paths

In this section we analyze paths found by successful focused reading search. We collected a random sample of one hundred inference paths retrieved by running the focused reading agent on the test problem set. We describe how the paths were manually scored according to their semantic coherence. We then explore how properties of the original search problem as well as the search process relate to observed patterns in path semantic coherence quality. AppendixA contains the data used to conduct the analysis of this subsection. 49

Semantic Coherence

The first step in the analysis is to assess the quality of the inference paths. For this, we introduce the concept of path topical coherence. In this context, coherence means that as we move from one concept to the next along the path, there is a clear thematic association between both concepts. If there is a thematic association between all pairs in the path, we consider the path to be coherent. Otherwise, the path it is deemed incoherent. Table 3.7 contains two examples of both classes. The first two cases are coherent paths. For each path, every hop transitions to a new entity related to the previous one, making a coherent explanation from start to end. The last two paths are incoherent. The third path transitions from Oak to Species to Bee. The sense of “species” changed arbitrarily rather than thematically. While a bee is a species of insect, Species more likely refers to a species of tree. This arbitrary change of sense of Species enables any transition to any entity that is a species. While this relation is (at least implicitly) expressed in the text, in the context of the sequence of related entities it breaks coherence of the theme of topics across the path. The fourth path exhibits incoherence even more severely. Division of Bonner is a geographical entity, unrelated to either of the endpoints. The path transitions to a more abstract concept, The Division, arbitrarily. The only associ- ation between those entities is the common term divsion. The last edge transitions from The Division to NCAA Division I Football Championship. This last edge is technically correct, but implies another sense of the term division. In this path, the endpoints share a sports theme, but the inference path is incoherent. All one hundred path samples were manually annotated for semantic coherence based on the judgment of the investigator. The sample of successful paths contain 68 coherent paths and 32 incoherent paths. From now on, most of the analysis will be done in parallel on both partitions. There is a high share of inference paths in the sample that are incoherent and warrant further analysis to understand the underlying characteristics that are preva- lent on them. The following sections will look into in some of those important 50

Coherent paths - National Basketball Association - Professional sports league organization - Battle of Stiklestad - The Battle - Battle of Magersfontein Incoherent paths Leaf vegetable - Oak - Species - Bee Esteghlal F.C. - Lewis - Division of Bonner - The Division - NCAA Division I Football Championship

Table 3.7: Examples of inference paths characteristics.

Off-topic Endpoints

Focused reading only has access to the endpoints of the search, without any prior context before starting the search. We say the endpoints are off-topic if they don’t immediately seem related to each other without any prior context. Off-topic end- points doesn’t necessarily mean that the search problem they represent is ill-posed, i.e., that there is a reasonable thematic relation between the endpoints, but it may warrant more topic switching in the inference in order to achieve coherence. Ta- ble 3.8 contains a few examples of search problems with off-topic endpoints. It is easy to appreciate how without any further explanation, those pairs are entities unrelated to each other. Table 3.9 contains the break-down of off-topic endpoints stratified by coherence. We can observe how coherent paths tend to co-occur more frequently with on-topic endpoint search problems, whereas incoherent paths are approximately reversed, hinting that the search problems with endpoints that are off-topic are more likely to be harder to solve by focused reading. This difficulty, however, doesn’t mean that search problems with off-topic endpoints are necessarily incorrect or impossible to solve coherently. It is worth mentioning that the symme- try of the distributions shown in table 3.9 for coherent and incoherent paths appears to be a coincidence. 51

Starting Endpoint Destination Endpoint Doctor of Musical Arts Wayne, 2004 NBA Draft Australian House of Representatives Indiana Color commentator

Table 3.8: Off-topic endpoints examples

Off-topic On-topic Percentage Count Percentage Count Coherent paths 0.15 10 0.85 58 Incoherent paths 0.16 27 0.84 5

Table 3.9: Off-topic endpoints distributions

Topic Switching

To quantify the effect of semantic drift in the inference paths, we count the number of topic switches, where a topic switch is a transition where the tail and the head of the relation don’t share a common topic. Topic switches were determined by the annotator. Topic switches are not bad per se, in fact, they could be necessary. For example, when the search endpoints are off-topic, the inference path eventually needs to change from the topic of the source to the topic of the destination. A topic switch is problematic is only if it causes semantic drift in the inference path. We can observe an example of a path with a topic switch in Table 3.10. The rela- tion between Single and Pop music clearly is of a different nature of that between Association of Tennis Professionals and Single. This is due to two different senses of the word single, where it means a song in the former case and part of a match in the later. Grounding to the same concept is an artifact of the information extraction component and this is a good example of the prevalent issue of semantic drift. Table 3.11 contains the breakdown of topic switches in the sample. The first observation is that search problems with on-topic endpoints tend to have very few 52 semantic switches, regardless of coherence. When those cases have topic switches, they skew towards the lower end of the distribution, having at most two switches in the sample. Problems with off-topic endpoints have a higher share of topic switches. This makes sense because any inference path in this category must at some point switch to a topic close to the goal of the search.

Boycott - Association of Tennis Professionals - Single - Pop music - Radio - Norwegian Broadcasting Corporation

Table 3.10: Inference path with semantic drift

Coherent Incoherent On-topic Off-topic On-topic Off-topic # of switches Pct Count Pct Count Pct Count Pct Count Zero 0.90 52 0.10 1 0.60 3 0.03 1 One 0.05 3 0.70 7 0.40 2 0.55 15 Two 0.05 3 0.20 2 0.00 0 0.30 8 Three 0.00 0 0.00 0 0.00 0 0.07 2 Four 0.00 0 0.00 0 0.00 0 0.03 1

Table 3.11: Topic switching in inference paths

Path Lengths

We look at how the length of the paths is related to its path topical coherence. When an inference path is longer, it has more opportunities for mistakes, as there are more places where semantic drift can occur. Figures 3.2a and 3.2b display the distribution of path lengths stratified by coherence. We can observe that incoher- ent inferences tend to skew slightly towards longer paths compared to paths with coherent relations.

Problematic Entities

A common observation during the analysis was that there are a few entities which appear in several inference paths, despite being unrelated to the theme of the search. 53

Inference path length - Coherent Inference path length - Incoherent

20 20

15 15

Count 10 Count 10

5 5

0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Length Length

(a) Coherent paths (b) Incoherent paths Figure 3.2: Number of hops in successful paths

We label these problematic entities, because they introduce noise on the inference and to the search process. Problematic entities frequently appear in the knowledge graphs built during focused reading search processes. Each works as a hub, contributing to semantic drift due to its high degree of connectivity, providing the search process with many more options to potentially select a path step whose semantic relation moves away from the other relations in the path. The existence of problematic entities is an artifact of the information extraction engine: since it is being treated as a black-box during focused reading, an entity could be grounded to a word sense that does not match the theme of the path, an abstract concept could be labeled as a named entity, etc. These artifacts have a downstream effect on the quality of the results of the search process. Some of the problematic entities in the analyzed sample are Quotation Mark: A punctuation sign which is picked up as a named entity, prevalent enough to be connected to a diverse set of named entities; E: An accidental case of aggressive stop word filtering on the information extraction engine which stripped the other words, leaving only the letter as the natural language description of the entity; the 54 titled Turn on the bright lights, which is surrounded by several other concepts detected as entities, enough to be a participant of many relations by the information extraction engine, acting as a hub to many spurious relations. Table 3.12 contains a few examples of paths with a problematic entity, which is bolded in each case. It is easy to appreciate how the transition in and out of the problematic entity results on semantic drift.

Mod ( gaming) - Quotation mark - Genre - Hip hop music - Supergroup (music) - Rap rock 2000 NBA Draft - NBA Draft - National Basketball Association - Professional sports league organization - Major professional sports leagues of the United States and Canada - World Hockey Association - Hartford Whalers - Turn on the Bright Lights - New Guinea campaign

Table 3.12: Inference paths with problematic entities

Table 3.13 contains the distributions of inference paths that contain problematic entities by coherence. Almost all coherent paths are free of problematic entities. The two cases labeled as coherent that contain a problematic entity are coherent in the absence of the problematic entity, meaning that if those problematic entities where removed, the paths would still be coherent and the reasoning encoded in the inference path is still good. In the case of incoherent paths, problematic entities were prevalent in about half the cases. This means that while the presence of problematic entities in an inference path almost certainly leads to an incoherent hypothesis, it is not the only reason why a path may turn to be incoherent.

With problematic entities W/o problematic entities Percentage Count Percentage Count Coherent paths 0.03 2 0.97 66 Incoherent paths 0.44 14 0.56 18

Table 3.13: Prevalence of problematic entities in inference paths 55

Taxonomy Level Switching

A frequently observed phenomenon during the analysis is taxonomy level switching. Some of the entities are fairly concrete, refering to specific, unambiguous concepts, i.e. people’s names, or locations’ names, but other entities refer to classes of things that could have different specific manifestations, such as Film, Album, Battle, etc. When a hop in an inference path changes level going either from less abstract to more abstract or viceversa, we say there was a taxonomy level switch. This is a categorical variable, as different hierarchies have taxon relationships and there is no direct correspondence between them. Table 3.14 has an example of an inference path that exhibits this phenomenon. The path starts at a concrete entity, then climbs up to an abstract concept that represents a parent class, followed by moving back down to another concrete instance. In this case, it is easy to observe that a taxonomy level switch doesn’t imply a topic switch, they are different phenomena. However, a taxonomy level switch could co-occur with a topic switch, such as a case where an abstract entity has multiple senses, such as Oak - Species - Bee.

River Weaver - The River - River Cherwell

Table 3.14: Taxonomy level switching example

In this analysis the annotator counted whether a path contains at least one taxonomy switch. Table 3.15 contains the breakdown of the prevalence of taxonomy level switching by coherence of the inference path. It is clear that this kind of switch is prevalent in both classes of paths, coherent and incoherent. The analysis suggests that it is correlated with coherence in inference paths. Further analysis of this phenomenon could shed more light about its impact.

Conclusions of Positive Outcomes

Semantic drift hinders the quality of inference paths. Incoherent paths are prevalent, making up a significant share of the results of search processes. This is supported by the observation of longer inference paths, which are more likely to be incoherent 56

With taxonomy level switch W/o taxonomy level switch Percentage Count Percentage Count Coherent paths 0.69 47 0.31 21 Incoherent paths 0.44 15 0.56 18

Table 3.15: Prevalence of taxonomy level switching because they present more chances for semantic drift. Similarly, unrelated search endpoints make it harder to retrieve a coherent path using focused reading as topic switching is necessary to navigate through the knowledge graph to retrieve an infer- ence path. Additionally, the quality of the information extraction component has influence on the quality of the inference paths. This is observable by the effect of problematic entities. If there is a way to detect them and remove them it could boost the quality of the coherence in the inference paths.

3.5.2 Analysis of the Negative Outcomes

When a focused reading process does not successfully finish, it does not produce a complete inference path. There are two ways in which a negative outcome like this can result from a search process: the search either times out by reaching the limit of iterations, or it stops early when the agent decides so. This chapter is concerned with these cases. The sample of negative-outcome was similarly drawn by randomly choosing one hundred cases from the pool of unsuccessful searches. Although failed search does not result in a complete inference path, we can still analyze partial paths by considering how they might be completed. For each element of the sample, We located the entity in the knowledge graph that is closest to the goal of the search process. This is possible because that information is available on to the precomputed ground-truth knowledge graph used to derive the search problems. The path from the source of the search to the entity closest to the goal is the partial segment. The remaining shortest path towards the goal is the missing seg- 57 ment, which is not yet retrieved. If those two segments were combined, they would constitute an inference path given the actions previously taken by the agent before stopping. If they were combined, they would connect the endpoints of the search. Table 3.16 shows an example of a partial segment and a missing segment.

Endpoints Berkeley Software Distribution, Renting Partial segment Berkeley Software Distribution - IBM PC compatible - DOS Missing segment Video game - DVD - Renting

Table 3.16: Example of partial and missing segments

Using these two pieces, we can take a look at what would’ve been necessary to turn those negative outcome into positive outcomes and quantitatively measure their characteristics. Beyond those properties, in this analysis we also look at the properties of the search process to understand what may make these specific search problems hard to solve. AppendixB contains the data used to conduct the analysis of this subsection.

Number of Steps and Early Stopping

There are two ways in which a focused reading search can stop without retrieving an inference path:

Timing out: The agent stops after executing ten steps and not retrieving a • path. If the number of steps was higher, it could be possible that the agent would finally locate a path after executing more queries.

Stopping early: One of the actions available to the agent at each step is early • stop. If this action was chosen, the search process is finished with a negative outcome.

Figure 3.3 shows the distribution of iterations for the sample of negative out- comes. There was early stopping in 43% of the cases, however, most of the early stops where at later stages of the search process (iteration 7). This suggests the ≥ focused reading agent can benefit from better signal as to when to stop early. 58

Number of iterations

50

40

30 Count

20

10

0 2 3 4 5 6 7 8 9 10 iterations

Figure 3.3: Number of iterations for cases with negative outcomes.

Off-topic Endpoints and Problematic Entities

Similar to the analysis in section 3.5.1, we investigate how prevalent off-topic end- points are on the search problems that comprise the negative-outcome error analysis. The sample contains 89 search problems with off-topic endpoints and 11 with on- topic endpoints. This distribution mirrors the one of the successful paths, where approximately 85% of the search problems had on-topic endpoints. Problematic entities are also present in this sample. 47 search problems had a problematic entity either in its partial or missing segment and 53 did not. The proportion is roughly balanced, which is similar to the prevalence of problematic entities in incoherent inference paths in section 3.5.1.

Missing Path Segment Lengths

Inspecting the number of hops missing to reach the goal of the search is a proxy on how difficult each problem is. Inspecting the missing segments’ lengths can give some insight on the overall difficulty of the problems in the negative-outcome sample. Figure 3.4a displays a histogram of the number of hops of the missing segments. Figure 3.4b shows the distribution of path lengths if both partial and 59

missing segments where to be combined. We can appreciate how those candidate solutions are long, suggesting the prob- lem is hard. Comparing those lengths to those is figures 3.2a and 3.2b, it can be seen that if the negative outcomes could be turned around, the average lengths would be higher. And the longer the path, the higher the chance of having semantic drift.

Missing hops to goal Combined length

25 25

20 20

15 15 Count Count

10 10

5 5

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12 Number of hops Number of hops

(a) Missing segments (b) Partial + missing segments Figure 3.4: Number of hops in inference paths

Knowledge Graph Size

We measure the size of the knowledge graph with three different quantities: the number of documents processed during the search process, the number of entities and the number relations in the knowledge graph. The number of documents is correlated with the amount of information that is present in the knowledge graph. The number of entities and relations is proportional to the size of the action space during a focused reading search and upper bound the complexity of the inference, which is carried out as a shortest path search. Tables 3.17, 3.18 and 3.19 display the binned frequencies of sizes, in documents, entities and relations respectively. In each table, columns two and three summarize 60 the sizes of the negative outcomes. To put those numbers in perspective, columns four through seven summarize the same distribution over the positive outcomes. In all cases, the negative outcome cases tend to have larger knowledge graphs compared to positive outcomes, regardless of coherence. A possible reason for this is that larger knowledge graphs have larger pools of candidate actions for the agent to choose from and it is more likely that the beam search didn’t sweep through the arguments to the query templates necessary to retrieve the correct documents. There is not a lot of difference in the distributions of coherent and incoherent paths, because as far as the focused reading agent is concerned, they are equivalent, as no decision factors in the properties analyzed in this analysis.

Negative Outcomes Coherent Paths Incoherent Paths Number of documents Count Percentage Count Percentage Count Percentage (0, 100] 45 0.45 52 0.76 25 0.78 (100, 200] 46 0.46 12 0.17 4 0.12 (200, 300] 2 0.02 2 0.03 1 0.03 (300, 400] 1 0.01 1 0.02 2 0.06 (400, 500] 0 0.00 0 0.00 0 0.00 (500, 600] 0 0.00 0 0.00 0 0.00 (600, 700] 0 0.00 0 0.00 0 0.00 (700, 800] 5 0.05 1 0.02 0 0.00

Table 3.17: Size of the knowledge graph in number of processed documents

Negative Outcomes Coherent Paths Incoherent Paths Number of entities Count Percentage Count Percentage Count Percentage (0, 100] 75 0.75 61 0.90 27 0.85 (100, 200] 17 0.17 5 0.08 3 0.09 (200, 300] 2 0.02 1 0.01 2 0.06 (300, 400] 0 0.00 0 0.00 0 0.00 (400, 500] 0 0.00 0 0.00 0 0.00 (500, 600] 5 0.05 1 0.01 0 0.00 (600, 700] 1 0.01 0 0.00 0 0.00

Table 3.18: Size of the knowledge graph in number of entities 61

Negative Outcomes Coherent Paths Incoherent Paths Number of relations Count Percentage Count Percentage Count Percentage (0, 100] 76 0.76 62 0.92 27 0.85 (100, 200] 17 0.17 4 0.06 3 0.09 (200, 300] 1 0.01 1 0.01 2 0.06 (300, 400] 0 0.00 0 0.00 0 0.00 (400, 500] 0 0.00 1 0.01 0 0.00 (500, 600] 5 0.05 0 0.00 0 0.00 (600, 700] 1 0.01 0 0.00 0 0.00

Table 3.19: Size of the knowledge graph in number of relations

Semantic Drift

It is not straight forward to assess the presence of semantic drift for the negative outcomes. In this analysis we measured the prevalence of topic switches in the partial segment and in the missing segments. Additionally, we analyzed whether the partial segment is straying away from the missing segment. That means, if finding the edge that connects both segments would incur in a topic switch. This analysis loosely resembles a divide and conquer approach, where the path is split in two, partial and missing segments, and each is analyzed individually, but also the stitching of both parts is analyzed. Table 3.20 displays the prevalence of semantic drift for the three cases. Semantic drift is prevalent in all cases, specially in the missing segments, where 76% of them contain at least one topic switch. If we consider the case in which at least one of the segments suffers from drifting, 93 of the cases suffer from semantic drift, which is similar to the proportion of cases that have off-topic endpoints, which also suggests that these cases are hard for a focused reading search process.

With semantic drift W/o semantic drift Partial segment 48 52 Missing segment 76 24 Intersection 49 51 Either part 93 7

Table 3.20: Semantic drift in negative outcome cases 62

Conclusions of Negative Outcomes

The analysis reinforces the previous observation that semantic drift hinders the effectiveness of focused reading. Inspecting the missing segments reveals that they suffer from topic switching, making them harder to discover a complete path during the search process. The length of the missing segments suggest that the search problems studied are considerably harder to solve. Since they are longer, there is a higher chance of topic switching in them, as shown in table 3.20. If the missing segments where shorter, the search problems might as well be finished by retrieving incoherent inference paths. The endpoints of the search problems in this analysis are mostly off-topic, which add additional complexity to the problem overall. Similarly, focused reading struggles with larger knowledge graphs. The number of entities is directly proportional to the number of actions available to the agent, making it harder for the agent to make a good decision. We also observed how early stoping happens frequently, but most of the time happens at later stages of the search process, suggesting that a stronger state rep- resentation may lead to better exercising of the early stop action.

3.6 Discussion

In this chapter we described several improvements to focused reading to learn how to search in an open domain. We introduced state representation features that capture the underlying semantics present in the documents of the corpus based on topic models. We also expanded the granularity of the action space to consider a higher number of actions at any given step. We derived a dataset of search problems using wikipedia articles to test focused reading in the general domain. The results show that reinforcement learning produces policies that compare favorably to baselines policies. We performed an error analysis to understand when and why focused reading fails to find an inference path between the endpoints of the search problem or when the resulting inference is not topically coherent. 63

We found that a key challenge facing open-domain focused reading is semantic drift. In future work, learning how to search can be improved by explicitly modeling contextual information and using this context to discriminate which individual en- tities and relations should be included in the knowledge graph and better estimate which entities are relevant to consider for the following action. The results of the analysis suggest that if context information can be explicitly factored in into focused reading, the search agent may be able to successfully return an inference path more frequently and the overall topic coherence of the paths may increase. Following this research direction, in the next chapter we introduce a methodology to identify and associate contextual relations extracted in the biomedical domain. We leave open as future work the development of methods for contextual constraints in focused reading. 64

CHAPTER 4

Contextualizing Information Extraction

[The contents of this chapter are based on work previously published as Noriega-Atala et al.[2019]]

4.1 Introduction

Progress has been made in automating biological event extraction from biomedical texts [Zhou et al., 2014], but little attention has been given to identifying and associating the biological context in which such events occur. Biological context, however, often plays a critical role in interpreting these events. For example, the following is a summary of a key finding in a paper by Young and Jacks[2010]:

Mutations in oncogenes are much more likely to lead to cancer in some tissue types than others, because some tissues express other proteins that counteract the onco- gene. For example, in mice, the G12D activating mutation in K-ras causes lung tumors but not muscle-derived sarcomas, because muscle cells express two proteins (Arf and Ink4a) that cause cell division to halt when Ras is overactive.

An automated event extraction system might extract the biochemical event “G12D activates mutation in K-ras”, but without understanding the biological context – of whether this event occurs in lung or muscle tissue – the reader will not understand why the event does or does not lead to cancer. Biological context is not only important, it also comes in many varieties. Here we focus on biological container context, where a biological “container” may be specified at various levels of granularity, but each level serves to further specify the type of biological system in which an event might occur. From the highest level of granularity, we consider species (human, mouse), then tissue (lung, lymphoid), and 65

finally cell type (t-cell, endothelial). Container contexts across levels often stand in mereological (“part-whole”) relationships, but knowing a finer level of granularity does not always fully determine higher levels. For example, a species may contain several tissue types, but these may also be present in other species. To fully understand the biological context in which an event occurs, we need to know the container types at each level of specification. In fact, a special case of biological context specification comes in the form of naming the cell line used in experiments. Cell lines comprise a specific cell culture cloned from a single cell and therefore consist of cells with a uniform genetic makeup. Cell lines available for purchase typically specify the cell type, tissue, and species from which they were derived. For example, the PCS-100-0201 cell line is derived from endothelial cells of the artery tissue of a human (species). In this chapter we treat the problem of extracting biological container context as one of identifying container context mentions, a problem of named entity recognition (NER), and of associating them with events, a kind of relation extraction. A key challenge for context association is that context mentions are often not found in the same sentence as the event, making this an inter-sentential relation extraction problem. For example, consider the following excerpt [Bustelo et al., 2012]:

This route promotes the translocation of Rac1/RhoGDI to F-actin-rich membrane areas, the Pak-dependent release of RAC1 from the complex and Rac1 activation. This pathway is important for optimal Rac1 activation during the signaling of the EGF receptor, integrins and the antigenic T-cell receptor.

Here, the three underlined events in the first sentence are associated with the T-cell context in the second sentence. The main contributions in this chapter are: (1) provide an analysis of the context- event inter-sentential relation extraction problem, (2) introduce a corpus of context-

1http://www.atcc.org/en/Products/Cells_and_Microorganisms/Human_Primary_Cells/ Cell_Type/Endothelial_Cells/PCS-100-020.aspx 66 event relations for evaluation, and (3) present first results of an inter-sentential context extraction and association model that provides a baseline for future work.

4.2 Related Work

The context association problem relates to two general problems that have been studied in the natural language processing and linguistics communities. The first problem, relation extraction, has received extensive attention [Banko et al., 2007; Bach and Badaskar, 2007], including within the biomedical domain [Quan et al., 2014; Fundel et al., 2007], with recent promising results incorporat- ing distant supervision [Poon et al., 2015]. All of this work, however, focuses on identifying relations among entities within the same sentence. The context asso- ciation problem, on the other hand, deals with inter-sentential relations, and as Bach and Badaskar (2007) note, “it is not straightforward to modify [sentence-level] algorithms ... to capture long range relations.” Very little prior work has studied inter-sentential relation extraction. A notable exception, Swampillai and Stevenson[2011], combined within-sentence syntactic features with an introduced dependency link between the root nodes of parse trees from different sentences that contain a given pair of entities. Swampillai & Stevenson used these features to train an SVM to extract inter-sentential relations from the MUC62 corpus. In contrast, our work is within the biomedical domain, requiring the development of a different set of features, and we also develop a novel feature aggregation technique that facilitates improved context association, as described in the following sections. Context-event association also bears similarity to a second problem, bridging anaphora resolution, which has been primarily investigated theoretically in the lin- guistics literature. Bridging anaphora aims at identifying associations between en- tities at the discourse (rather than single-sentence) level. As Irmer[2009] notes, the relation between the two entities “is not explicitly stated by linguistics means”, but

2https://catalog.ldc.upenn.edu/LDC20003T13 67

knowledge of the relation “is necessary for successfully interpreting a discourse.” As in the case of container contexts, for example, the relation may be mereological: e.g., I looked into the room. The ceiling was very high. [Irmer, 2009]. Freitas[2005] presents a computational model of bridging anaphora that makes use of Discourse Representation Theory to create a rule-based system for deter- mining what kind of bridging anaphoric relationship two entities might have. By contrast, Poesio et al.[2004] developed a multi-layer perceptron classifier that uses a measure of lexical distance derived from the WordNet database [Fellbaum, 1998], among other features, to achieve a maximum accuracy score of 79.3% on a small cor- pus. Both models provide interesting approaches to subclasses of bridging anaphora resolution, but neither generalizes to the biomedical context-event association prob- lem, where a complete reworking of relevant features has been required to success- fully associate biological container context with events. Some prior art exists specifically to contextualize biochemical events. Gerner et al.[2010a] associates anatomical contextual containers to event mentions that appear in the same sentence, via a set of rules that considers lexical patterns in the case of ambiguity and falls back to token distance if no pattern is matched. Sarafraz [2012] elaborates on the same idea by incorporating dependency trees into the rules instead of lexical patterns, as well as introducing a method to detect negations and speculative statements. The proposed method we present in this paper is related to this prior art in the sense that we attribute contextual relation between entities and biochemical events, but focus on inter-sentential relations, instead of intra-sentential ones.

4.3 Dataset

With the help of three biology domain experts, we compiled an annotated corpus of biological container context mentions associated with biochemical events. The corpus consists of 22 biomedical research papers about the Ras cancer pathway. All 68 of the papers are available from the PubMed Open Access3 repository. The complete set of annotations are also open source and available online.4 The first step in constructing the annotation corpus involved identifying men- tions of biochemical events within the text. Here, a biochemical event is a relation between one or more entities participating in a biochemical reaction or its regula- tion. A mention of a biochemical event can be identified by a trigger word, where trigger words are usually the name of the chemical reaction, e.g., phosphorylation, ubiquitination, expression, etc. A sentence may contain more than one event. For example, the phrase “phosphorylation of plexin-As by nonreceptor tyrosine kinases Fes and Fps and Fyn” contains a total of four events: one phosphorylation event and three different regulation events (each kinase regulates the phosphorylation). Each trigger word is considered once, and forms the basis for one event in the corpus. We identified biochemical events using two independent methods. First, we asked our biology domain experts to go through the papers and identify spans of text that they believe express one or more biochemical events. For this task, they were provided with an interface by which they could simply highlight spans of text (or move span boundaries). Contiguous spans of text could contain a mention of more than one biochemical reaction, but we treat each contiguous span of text as just a single mention. We also used REACH [Valenzuela-Esc´arcegaet al., 2018; Valenzuela-Esc´arcega et al., 2017b], an open source biomedical event extraction system built on top of the ODIN information extraction library [Valenzuela-Esc´arcegaet al., 2015b], to identify biochemical events. REACH associates with each extracted event the source span of text that served as evidence for the event. We combined any event mentions whose spans overlap to count them as one single event. In the example above, REACH does identify 4 separate event mentions, but they would be combined into one event mention text span. The next step in the corpus construction involved identifying any mentions of

3http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ 4https://ml4ai.github.io/BioContext 69 biological container context. REACH includes a named entity recognition (NER) facility that detects candidate context mentions in text and grounds them to a con- sistent ID. The NER facility works by matching word tokens against multiple knowl- edge bases. These knowledge bases are dictionaries that map a sequence of words to a unique grounding identifier. It is possible for several different words/phrases to share the same identifier in the case of multiple lexical expressions referring to the same kind of entity, e.g. “human” can be indicated by woman, man, patient, child, etc. This design is inspired by the Linnaeus system, a taxonomy-based NER system for labeling species mentions [Gerner et al., 2010b], as a matter of fact, the species’ NER knowledge base is a subset of the Linnaeus dictionary. Every individ- ual knowledge base in REACH has a category, these categories are species, organs, tissue types, cell lines and cellular components. Each of these categories represent a different notion of biological container and are not mutually exclusive. They were put together by scraping specialized websites that contain curated enumerations of entities belonging to those categories. When a sequence of words match an entry of a knowledge base, its category together with the grounding identifier conform to a context type. For this work, we have restricted our use to the knowledge bases of species, tissue types and cell lines. While context mentions also take up spans of text (usually one to just a few words), they generally do not overlap, unlike event mentions. With the spans of text identified as containing biological event and context men- tions, we then asked the domain expert annotators to identify the context mentions associated with each span of text associate with one or more events. For this task, the annotators were provided an annotation tool that displayed the original text, with spans of text associated with an event highlighted in green and spans of text associated with a context mention highlighted in yellow; the annotators could then select an event and context span and indicate whether they are associated. Each container context mention associated with an event (as text span) is then taken to constitute one positive instance of a context-event relation. Multiple context mentions (whether of the same type, e.g., two instances of human, or different, e.g., 70

Number of event associations for various context types (binned by each context's Fleiss' Kappa score)

200

175

150

125

100

75

50

25

0 0.5 0.0 0.5 1.0 13 15 166 72 83 202 18 84 62 49 34 22 19 Figure 4.1: Aggregated κ scores and association counts for all contexts, binned by κ score ranges. one instance of human and one of rat) may be associated with the same event, each comprising a separate context-event relation.

4.3.1 Negative Examples and Extended Positive Examples

The annotation process produced a gold-standard set of event-context associations, but two problems remain. First, the annotations provided by the domain experts consisted of only positive examples. The annotators reported that it was very unnat- ural to identify explicit negative examples, that is, contexts that were categorically not related to a given event. As our classifier learning framework described in Sec- tion 4.5.1 requires both positive and negative examples, we therefore developed a method for estimating negative instances. The second, related, problem is that each annotation relates an event to a specific context mention. However, other instance mentions of that context type might be mentioned in other sentences that the annotators did not label (we will return to this distinction between mention and type again in Section 4.5.1). Again, annotators found it most natural to identify the context mention instances that were directly 71 relevant, but not exhaustively include all instances that might also be relevant or irrelevant. We make the simplifying assumption that if an annotator associated one context mention with an event, then for the purposes of constructing a training data set, all other instances of that context type mentioned in the paper are also relevant to associating that context with the event. We used the REACH-extracted context types that were not annotated by our domain experts to be associated with an event to build a set of negative example context-event pairs (addressing the first problem) and to extend the number of pos- itive example context mentions and event pairs (addressing the second problem). These were constructed as follows: First, each paper (represented in an XML for- mat) was processed by an NLP pipeline [Manning et al., 2014b] to transform it into a plain text representation separated by sentence. This representation allowed us to associate every annotation with its location relative to the sentences in the cor- pus. Next, we considered all the REACH-extracted events paired with each event mention; if the context mention did not have the same grounding ID as one of the expert-annotated context mentions for that event, it was labeled as a negative ex- ample for that event; on the other hand, if it did have the same ID as one of the expert-annotated contexts, then it was labeled a positive example. The above procedure resulted in two context mention sets, each containing zero or more context mentions that come from sentences throughout the paper: one set representing positive context mention associations, the other representing con- text mentions whose context type are assumed to not be associated with the event (negative associations). In Section 4.5, we evaluate the performance of a set of classifiers designed to label context-types associated with events. Each classifier takes as input a paper with already identified events and, based on context mentions extracted by REACH, determines what context types are associated with each event. Rather than test whether each context mention individually indicates that a context type is associated with the event, we instead aggregate the evidence of all context mentions of the same context type (as indicated by the REACH grounding ID). This evidence aggregation 72

pancreatic cancer (Tissue) HEK-293 (Cell Line) membranes (Cellular Component) neurons (Cell Type) glioblastoma (Tissue) glioma (Tissue) DBTRG-05MG (Cell Line) melanoma (Tissue) MEFs (Cell Line) mouse (Species) human (Species) growth cones (Cellular Component) fibroblasts (Tissue) embryonic fibroblasts (Cell Type) Drosophila (Species)

0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 − − − − − Figure 4.2: κ scores for top 15 contexts by association count. is achieved by extracting a feature vector associated with each context mention and event instance and then combining the feature vectors that share the same context type (this is described in the Section 4.5.1). After aggregation, our data set corpus derived from the annotated context-event associations consists of 2, 523 positive instances of events associated with particular context types, and 20,000 instances of events that are negatively associated with context types.

4.3.2 Inter-Annotator Agreement

A subset of 11 papers was used to analyze inter-annotator agreement. Each of the papers’ events were considered together with the set of context mentions found in that paper’s text: Each potential context-type paired with an event was treated as a binary classification task, where each of the three annotators judged whether the context type was associated with the event. Because the set of potential context types varies across papers, we calculated Fleiss’ Kappa Fleiss and Cohen[1973] scores, κ, measuring the agreement between the three annotators for each context type separately. Figure 4.1 shows the general distribution of the κ scores and the frequency with which these contexts were associated with events, and Figure 4.2 shows the κ scores for the top 15 context types by association count. In addition to inter-annotator agreement, we also measured the amount of agree- 73 ment between REACH and our annotators by looking at the degree of overlap be- tween the text spans about events picked out by our annotators and the event spans picked out by REACH. Event spans were taken to be overlapping if they shared at least one word between them, and the REACH spans were considered against the set of manual events that were common to all three annotators. Out of a total of 1629 events, 130 were picked out by only REACH, 626 were picked out by only the annotators, and 873 were were identified by both, resulting in a Jaccard similarity index of 0.536. Qualitatively, the domain experts suggested a number of reasons why the agree- ment between annotators, for context associations, and with REACH, for event spans, might have varied relatively widely. First, the contexts mentioned in a text are sometimes themselves modified in the course of setting up experimental condi- tions. Consider the following example from Hazeki et al. Hazeki et al.[2011]:

To further investigate the role of p110γ in CpG localization, Cos7 cells were trans- fected with p110γ and its mutant forms (unlike macrophages, Cos7 cells do not express p110γ).

Some desired property (e.g., the expression of p110γ) might not usually be found in some context (e.g., the Cos7 cell line), so our annotators sometimes disagreed about whether that context should indeed be associated with subsequent events in the paper. The annotators also observed that sometimes event spans picked out by REACH properly contained more than one actual event, and might then disagree about whether that span as a whole should be associated with some context. Finally, the annotators noted that container contexts from the more granular level (e.g., species) might not be salient in papers dealing with very low-level events (e.g., interactions at the molecular level, or crystal structure studies), and therefore disagreed about how to assess granular context associations. These are very important observations and point to the need for the further technology developments required to fully capture all of the semantics of context. 74

Feature Name Description General features Sentence distance No. of sentences separating the event and context mentions Dependency distance No. of edges separating the mentions within dependency graph Context type frequency No. of context mentions of the same type Is context closest Indicates whether the context mention is the closest one to the event Φ features Is sentence first person Is sentence past tense An instance for each: event and context mentions Is sentence present tense Syntactic features Event spanning dependency bigrams Sequence of dependency bigrams spanning from event mention Negated event mention Indicates whether a neg dependency is within 2 degrees in dep. graph Context spanning dependency bigrams Sequence of dependency bigrams spanning from context mention Negated context mention Indicates whether a neg dependency is within 2 degrees in dep. graph

Table 4.1: Classification features

In this work, we have preserved the original annotations, but more sophisticated parsing (e.g., of more of the component structure of biochemical events and of each paper’s experimental setup) will be needed to properly tackle these concerns. We leave these as open problems for future work.

4.4 Features

Inter-sentence relation extraction is more challenging than intra-sentence relation extraction primarily because a number of traditional linguistic features, such as information about syntactic dependencies, are unavailable across sentences. We model inter-sentence context relation extraction as a supervised learning problem. As discussed in Section 4.3.1, we are considering the task of identifying whether a context type is associated with an event mention, given evidence from other context mentions in the text, although our corpus consists of annotations of relations between event and context mentions. To model this task, we aggregate all of the instances of the features associated with each context-mention/event-mention relation that share the same context type. This is done by first constructing a feature vector for each individual context-mention/event-mention relation. We can then consider different feature vector aggregation schemes to construct a single feature 75 representation for the evidence in the paper for the relationship between a context type and event mention. We begin by describing the features that make up the context-mention/event- mention feature vectors and then describe how they are aggregated. Similar to the representation scheme used by Swampillai and Stevenson[2011], we incorporate local syntactic features associated with the context mentions and events. However, we also incorporate several measures of the distance between context mentions and events. Table 4.1 summarizes the features used for this work, grouped into three func- tionally similar categories. Features from the general category concern kinds of distances. Sentence distance counts the number of sentences between the context and event mentions. If they’re in the same sentence it takes a value of zero, adja- cent sentences are distance one, and so on. Dependency distance is similar in spirit, but counts how many edges away the two mentions are on the dependency parse graph. If the two mentions aren’t in the same sentence, an artificial edge connect- ing the roots of each of the dependency graphs of the two sentences is introduced and then the edge count is performed. Context type frequency is the count of con- text mentions of the same type within the current document, and is context closest is a Boolean value that is 1 when the context mention is the closest to the event mention, otherwise 0. Phi features represent other linguistic characteristics of the sentence containing the mentions. For each of the listed phi features, an instance is created for each the sentence in which the context and event mentions occur. These features are also Boolean valued, set to 1 whenever the assertion holds true, otherwise 0. Part-of-Speech tags were used to implement these features, e.g., the past tense feature uses the verb’s tag to check is it’s contained in the set of the possible past tenses (VBD or VBN).5 Similarly for the other two features in this category. Syntactic features rely on dependency parses of the containing sentences and are dynamically generated. Spanning dependency bigrams are derived from the spanning tree rooted at the head token of a mention of depth two. The bigram

5https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html 76

Figure 4.3: Precision, recall and F1 score per classifier. The dashed red line indicates the F1 score of the baseline. features derived from all of the dependency paths are combined in a bag-of-bigrams where the label of the edges on a path become the an element of its corresponding bigram. Negated mention features look for the presence of a neg dependency within the spanning tree just described; if present they are set to 1, otherwise 0. We have explored several feature aggregation methods. The method we found that works best and is the basis of the results reported here is to form a sin- gle vector constructed out of statistical summaries of the the individual context- mention/event-mention feature vectors. In particular, we compute the average, minimum value and maximum value for each feature vector element across the fea- ture vectors in the set of context-mention/event-mention feature vectors, resulting in an average, minimum, and maximum value vector (respectively); these are then concatenated to form a final vector three times the length of original vectors. 77

4.5 Experiment

Using the feature representation described in section 4.4, we trained and evaluated a number of different supervised learning classifiers for the context-event association task, within a cross-validation evaluation framework. The intended use case for our proposed method is to take as input a new paper, pre-process it with a machine reading system, such as REACH, in order to extract all context and event mentions, and then run the context-event association classifier to label events by their associated context type. For this reason, the natural unit of input is a single paper. Each paper is therefore a fold in our cross-validation evaluation, and we have a total of 22 papers in our corpus, for 22 folds. We performed leave-one-out cross-validation, iteratively holding out one paper as the test set. In this evaluation, we used micro-averaged F1 scores [Manning et al., 2008b]. A micro-average score weights the contribution of each fold proportionally to the amount of data it contributes to the overall data set, making the final score more robust to fold results that could contain a proportionally small number of annotated events and therefore not be as representative. At each iteration, we need to train the model but also search for the combination of features that performs the best with that model. In order to have a basis for evaluating how well a feature set performs, we partitioned (uniformly random) the remaining 21 papers into a validation set of 4 papers and a set of 17 papers to use for training. We then considered different combinations of features by considering the power set6 of feature groups as described in Table 4.1. In total, there were 16, 383 possible combinations of features for each classifier. The cardinality of the power set is considerable, as a result, the experiments were performed on a HPC cluster7 to find the optimal combination of features for each of the machine learning algorithms without relying on feature selection approximation heuristics. For each of these feature sets, we trained a model and evaluated its performance on the

6... minus the empty feature set. 7An allocation of computer time from the UA Research Computing High Performance Com- puting (HPC) at the University of Arizona is gratefully acknowledged. 78 validation set. The model with the feature set that achieved the highest validation F1 was then evaluated (with no further changes) on the held-out test set. We then repeated this procedure for each iteration of leave-one-out cross validation.

4.5.1 Baseline and Model Results

A simple but reasonable deterministic classifier was developed to serve as a base- line, described in Algorithm3. Intuitively, the baseline classifier does the following: given the index of the sentence in which an event occurs, build a two sided inter- val of width-k sentences around the event-sentence and conclude that any context mentioned in the sentences within the window are associated with the event. This baseline classifier was run within the same cross validation loop and was “trained” by performing a parameter search for k [0, 6] and the best k is selected ∈ according to performance on the validation set. In this way, the predictive capability of the algorithm can be compared in the same terms as the machine learning models.

Algorithm 3 Deterministic context baseline function IsContext(evt, ctx, k) evtSen = getSentenceIx(evt) ctxSen = getSentenceIx(ctx) interval = [evtSen k, evtSen + k] − if ctxSen interval then ∈ return T rue else return F alse end if end function

Beyond the baseline model, we evaluated the following classifiers:

Logistic Regression with `2 regularization penalty • 79

Support Vector Machines with the following kernels: Linear, Polynomial and • Gaussian

Random Forest • Feed-Forward Neural Network • The hyper-parameters for each of the algorithms, such as the regularization coef- ficient on the logistic regression, the degree of the polynomial kernel, the maximum depth for the trees in the random forest, etc., where tuned with manual exploration. The feed-forward neural network had a single hidden layer. Figure 4.3 compares the micro-averaged precision, recall, and F1 scores for each model evaluated in the cross validation. Again, these averages are computed across the 22 folds of cross-validation, where for each model within each fold, a search was performed to find the combination of features that allowed the model to perform best on the validation set for that fold. The dashed line in the figure indicates the micro-averaged F1 score for the baseline classifier. In general, the trained classifiers all achieved average F1 scores higher than the baseline, with the best performing models being the random forest and neural network classifiers. To test whether each non-baseline classifier performed significantly better than the baseline, we performed a bootstrap resampling test [Cohen, 1995] where for each model we uniformly randomly sampled with replacement the same number of context-to-event associations as in the original 22 papers, computed the F1 scores of the model and the baseline on that sample, took the difference of the baseline F1 from the model F1, and repeated this 1000 times (per model). For each non-baseline classifier we found that its F1 scores exceeded the baseline in at least 95% of the cases. Figure 4.4 shows, for each model, the frequency with which each feature ended up being selected as part of the set of features that allowed the model to perform best on the validation papers (as there were a total of 22 cross validation folds, the maximum possible frequency is 22). This provides some insight into which features, in general, tended to provide more useful information, for each model. The 80 Figure 4.4: Feature usage per paper for each classifier 81 spanning dependency bigrams with respect to the event mentions are seldom used by the classifiers, but the dependency bigrams with respect to the context mentions are frequently used, suggesting that syntax is correlated with the presence of a context relation. The Is context closest boolean feature is one of the most frequently used features. This is consistent with the intuition that context information gets established close to the statements of interest, in this case, biochemical reactions and biological processes. Another interesting pattern is that the context class frequency is also almost always used, suggesting that the number of times a context class is mentioned is also highly correlated with whether the context will be associated with an event. Finally, Figure 4.5 shows the differences in model performance across the papers. In the figure, each column represents the F1 score of each model on the respective paper, where papers are sorted by the F1 score of the baseline (red x’s) from least to greatest. The x-axis labels reference the last three digits of the paper PubMed ID, and below in parentheses are the total number of context-mention/event-mention candidate relations involved in providing evidence for the context type label. The overall best-performing feed-forward network (whose F1 is denoted by the black x’s) generally performed significantly better than the baseline, except for papers #906 and #001.

4.6 Discussion

In this chapter we described the problem of extracting and associating biological container context with biochemical events in biomedical texts. We cast this as an inter-sentential relation extraction problem, where the entities being related (in this case, biochemical interaction event mentions and biological container context mentions) can be, and often are, a number of sentences apart from each other. To date, very little work has been done on contextual relation extraction, and more work is needed to develop domain-general techniques. However, we believe our contribution here takes some steps in this direction, and provides a strong baseline 82

Figure 4.5: F1 scores per paper for all models for work in the application domain of association biological container context with biochemical events. We developed a set of features for the this domain and demonstrated their vari- able use for this task with a variety of state of the art classification methods. The categories of features include syntactic features, distance-based features, phi fea- tures, and frequency-based features. There is ample room for improvement. We believe improvements to discourse modeling and parsing will be a key source of future advances in inter-sentential relation extraction. In particular, biomedical research articles have conventional structure with an expected set of sections: Introduction, Materials and Methods, etc. These sections in turn have different contents, and we are interested in better exploiting the particular discourse properties of each to improve how we extract the associations of information embedded in the paper. For example, a context type mentioned in the Abstract section may be more relevant to events across the paper, whereas a particular cell line mentioned multiple times in a long Methods section could have high importance locally, but be much less relevant in other sections. To 83 leverage this structural properties, discourse-based features could be used in tandem with sequence-aware machine learning algorithms could be used, such as recurrent and LSTM deep neural networks. Future work on learning how to search can leverage the contextualization of information extraction in order to provide more accurate inference paths. Contex- tualized information could also be used to inform better the automated agent at the time of formulating and executing information retrieval queries. The annotated corpus, code, and instructions used to implement the experiments described in this paper can be found at https://ml4ai.github.io/BioContext. 84

CHAPTER 5

Conclusions and Future Work

In this dissertation we explored the elements for learning how to search and proposed a methodology to selectively direct and deploy machine reading resources. In Chapter2, we formulated the problem of finding multi-hop connections as a family of algorithms, called focused reading. Focused reading algorithms can be modularly configured to make use of different machine reading components adapted to the particular corpus, but that from the perspective of the focused reading policy serve as black-box resources for the information gathering process. A focused reading policy searches the documents to find multi-hop inference paths as candidate paths of relations between entities that represent a candidate relation between concept endpoints. An implementation of focused reading can decide how to direct machine reading tools and where to deploy them. We used reinforcement learning (RL) to search for policies that perform better than a deterministic baseline implementation in the biomedical domain. Specifically, we explored two temporal-difference learning algorithms: SARSA and Q-Learning. The resulting policies where able to find more inference paths and process fewer documents than the deterministic baseline algorithm and those policies are able to make decisions based on the context of the search process and state of acquired knowledge so as to make better decisions than the fixed rules of the deterministic baseline policy. RL relies on a Markov decision process (MDP) formulation. In a MDP, an agent uses information of a state representation to choose the best estimated action in the long term. While our corpus and dataset of search problem is in the biomedical domain, our MDP formulation makes use of domain-independent features. Domain-specific implementations of focused reading may rely on highly- specialized machine reading components tailored to the specific domain and the study in Chapter2 is an example of this. In chapter3 we propose a refinement to 85 the MDP formulation to work in the open domain. Open domain implementations are challenging because of the variety of topics described in the underlying corpus. To account for this, in addition to the domain-independent state representation fea- tures introduced in chapter2, we introduce new state representation features based on topic modeling and on word embeddings. The open domain setting may describe a much wider set of entities and concepts. This dramatically increases the number of potential relations between entities, which in turn leads to a much larger action space. In response to this observation, we also developed a strategy for handling action spaces with much larger cardinality. Instead of considering specific individ- ual actions, action types are treated as templates that are parameterized by one or more named entities. Based on the improvements to the MDP representation, we then used a policy-gradient RL algorithm to search for policies that produce better results than strong baselines. We performed an exhaustive analysis on the results of open-domain FR. In the analysis, we looked at several factors that make it harder to find a path connecting the endpoints using FR. We also looked at the factors that make the inference incoherent when the policy finds the path connecting the endpoints. Some of these factors includes the amount of topic switching in the in- ference path or the presence of relations based in the incorrect word sense of an entity. The results of the analysis suggest that incorporating context information explicitly into the search process may improve the quality of the outcomes and of the resulting inference paths. In chapter4 we address the problem of contextualizing information extraction, specifically biochemical interactions extracted from the biomedical domain. We de- scribed a corpus of annotations put together by three domain experts. The corpus of scientific articles with manual annotations includes pairs of biochemical interac- tions and biological containers, where the container entity provides an indication of the biological context within which the biochemical event might be expected to occur. The annotations available are positive instances of the is context of relation- ship between the pair, we then used data augmentation to approximate negative instances. We propose a set of syntactic, linguistic and positional features to char- 86 acterize every individual pair of event mention and biological container mention as a datum and use annotations as labels for supervised learning. Frequently, a de- tected biochemical reaction is paired with different mentions of the same container context class. We introduce an aggregation technique to transform the domain of the problem from biochemical event mention and container context mention pairs to biochemical event mention and container context class pairs. We used a battery of supervised learning algorithms to train a binary classifier to determine whether a pair is participant in an is context of relation and compare the resulting models with a simple positional baseline. The algorithms that produced the best models where the multi-layer perceptron and gradient-boosted trees. We leave the door open to future work that integrates contextualizing informa- tion extraction methods, such as that described in chapter4, into focused reading implementations, particularly when deployed in the open domain, such as the study described in Cection3. We consider this to be the natural next step, whether the in- tegration occurs at the time of reconciliation into the knowledge graph, at inference, or during action selection. It is of interest to the author to implement a focused reading algorithm with an already existing, heavily curated dataset. Evaluating focused reading in this context will enable finer-grained study of the reliability of the methods introduced in this dissertation, beyond a small-scale experiments or proof-of-concept implementations. The potential of the opportunity of doing it is two-fold: First, focused reading may be used as a stand-alone method or complementing existing methods for natural language processing (NLP) tasks, such as Question Answering or Natural Language Understanding. Second, some new NLP datasets that have only recently become available are orders of magnitudes larger and more extensively curated than those used in this work. For example, WorldTree V2 [Xie et al., 2020] is a heavily curated dataset of complex science questions of elementary school level. WorldTree is a dataset designed to support research into the NLP subtask of Question Answering. It provides explanation graphs already curated by domain experts, that represent a coherent inference as an answer to each question. We can use these graphs as a 87 benchmark to evaluate the quality of the inference paths returned by focused reading and to directly evaluate focused reading for the task of Question Answering. In general, we can leverage existing resources to help highlight the strengths and areas of opportunities of our methods more accurately. 88

APPENDIX A

Inference Paths of Positive Outcomes

In this appendix we list the inference paths used to conduct the error analysis of section 3.5.1. The following list contains one hundred multi-hop paths returned by a a focused reading policy. The sample was collected randomly from the positive outcomes in the testing dataset after running the best performing reinforcement learning policy.

Announcer - World Wrestling Entertainment - Toy - Dart (missile) • - Sea Dart missile

New Zealand Liberal Party - Turn on the Bright Lights - New • Jersey General Assembly - The General (1927 film) - General Conference of Seventh-day Adventists

Skye - The Isle - Lewis - E - Breitenbach, Burgenlandkreis • Singer- - Pop music - Arena - Certified Fraud Examiner • BBC Radio 5 Live - BBC Radio - BBC Radio 2 • Siege of Kinsale - The Battle - Battle of Tabouk • Radio - Radio programming - NBC - Situation comedy - CBS - Soap • - NBC Daytime

HAL 9000 - Quotation mark - - Pop music - Album - H. • D. Kumaraswamy

Basketball - National Basketball Association - Professional • sports league organization - Major League Soccer 89

National Football League - Scout (sport) - United States - • United States Patent and Trademark Office

United States Secretary of Homeland Security - The United - • United States Department of Commerce - The Department - New York State Department of Environmental Conservation

Bureau of Transportation Statistics - The Bureau - United States • Fish and Wildlife Service - The United - United Nations System

Music history - Pop music - Star - Metro-Goldwyn-Mayer - Feature • film

Esteghlal F.C. - Lewis - E - Division of Bonner - The Division - • NCAA Division I Football Championship

New Zealand Railways Department - Turn on the Bright Lights - • New York blues

IDog - New Zealand - Turn on the Bright Lights - Croatian • Peasant Party - New Croatian Initiative

Educational television - BBC - Television program - A TV - • Television licence

Mod (computer gaming) - Quotation mark - Genre - Hip hop music - • Supergroup (music) - Rap rock

Battle of Trautenau - The Battle - Battle of Stalingrad • Boycott - Association of Tennis Professionals - Single (music) - • Pop music - Radio - Norwegian Broadcasting Corporation

AAG TV - Television channel - Pay television - Television • station - Ultra high frequency - WNYE-TV 90

Battle of Pea Ridge - The Battle - Battle of the Alamo • D-Day - Invasion of Normandy - The Battle - Battle of Karbala • NBA Sixth Man of the Year Award - National Basketball • Association - - Turn on the Bright Lights - New Zealand National Party

Departments of France - Tax - District - Fox Broadcasting • Company - Science - USA Science and Engineering Festival

Battle of Dettingen - The Battle - Battle of Cissa • Communist Party of the Russian Federation - Communist party - • Party of Labour of Albania

Aesthetics - Do it yourself - Television program - BBC - • Journalist

Battle of Siffin - The Battle - Battle of Maldon • Talk/Chat show - American Broadcasting Company - Comedy - Sex • comedy

Secretary of State for Commonwealth Relations - The Secretary • - United States Secretary of the Interior - The United - United States Department of the Army - The Department - Department of Scientific and Industrial Research

Order of the Red Eagle - The Order (group) - Order of St Michael • and St George - Brand - Bar association - Sisinio Gonz´alez Mart´ınez

Battle of Arginusae - The Battle - Battle of Corinth (146 BC) • Battle of the Eurymedon - The Battle - Battle of Bannockburn • 91

2004 NBA Draft - NBA Draft - National Basketball Association • - Coach (sport) - - National Football League playoffs - Australian Football League - The Australian - Australian House of Representatives

Batting average - - Professional sports • league organization - National Hockey League - Stint

Computer role-playing game - Video game - Quotation mark - • Meliodas

Royal Academy of Engineering - Order of the British Empire - • - Division I - Stadium - National Football League - National Football League playoffs - Australian Football League - The Australian - Australian Wood Frog

Bay leaf - Leaf vegetable - Oak - Species - Bee • Landrace - Quotation mark - Novel - Spy fiction - I spy • United States Secretary of Health and Human Services - The • United - United States Capitol - Pennsylvania State Capitol Complex

Battle of Plataea - The Battle - Battle of Plattsburgh • Treaty of Greenville - The Treaty - Treaty of Tartu • (Russian/Estonian)

IBM - Operating system - DOS - Multiplayer game • 2000 NBA Draft - NBA Draft - National Basketball Association - • Professional sports league organization - National Hockey League - Major professional sports leagues of the United States and Canada - World Hockey Association - Hartford Whalers - Turn on the Bright Lights - New Guinea campaign 92

NBA Sixth Man of the Year Award - National Basketball • Association - Major professional sports leagues of the United States and Canada - World Hockey Association - Hartford Whalers - Turn on the Bright Lights - New Zealand National Party

Kup - Basketball - National Basketball Association - • Professional sports league organization - Major League Soccer

Communist Party of Czechoslovakia - Communist party - Communist • Party of Tajikistan

Indiana - U.S. state - United States - Scout (sport) - National • Football League - Color commentator

Battle of Cedar Creek - The Battle - Battle of Changban • RT 2fm - Radio broadcasting - National Public Radio - Color • commentator - National Football League - 2006 NFL Draft

Digital Audio Tape - Album - Pop music - Musician • Doctor of Musical Arts - Doctor (Doctor Who) - Doctor of • Philosophy - Wayne, New Jersey

Ska punk - - Pop music - Pop rap • Centrifuge - Gas - Port - Oil - Beetle • Television set - Television - MTV - Guitarist • New York City Ballet - New York City - Turn on the Bright Lights • - The New Adventures of Superman (TV series)

Battle of Boroughbridge - The Battle - Battle of Lincoln (1217) • Battle of Nicopolis - The Battle - Battle of the Ancre Heights • 93

Security guard - BBC - Drama - ITV (TV network) - Television • network - BBC Television - Television - MTV

Focus Features - Film - NBC - Situation comedy - CBS - • Television - The PTL Club

New York State Treasurer - Turn on the Bright Lights - Greek • language - Common Era - 510s BC

New Zealand Cup - Turn on the Bright Lights - New Zealand • disruptive pattern material

Battle of Shipka Pass - The Battle - Battle of the Thames • Crime fiction - NBC - Police procedural - American Broadcasting • Company - Animated cartoon

United States women’s national soccer team - The United - United • States Secretary of the Interior - The Secretary - Secretary of State of New Jersey

Esteghlal F.C. - Lewis - E - Division of Bonner - The Division - • Division of Bowman

Order of the Holy Cross - The Order (group) - Order of St • Michael and St George - Order of the British Empire - Royal Academy of Engineering

Red Army - Army General - United States Army - United States • Army Corps of Engineers

Battle of Shizugatake - The Battle - Battle of Coronea (447 BC) • Singer-songwriter - Pop music - Great Depression - Great • Depression in the United States 94

Adventure film - Metro-Goldwyn-Mayer - Film - RKO Pictures • Adventure - American Broadcasting Company - Science fiction - • Ace Books

New Zealand Liberal Party - Turn on the Bright Lights - New York • State Department of Environmental Conservation - The Department - Department of Wildlife Conservation (Sri Lanka)

pantomime - The River (album) - River Cherwell • State Treasurer of Oklahoma - The Office (U.S. TV series) - • Office of Public Works - The Office (UK TV series) - Attorney General of Ireland

Chancellor of Austria - The Federal Kuala Lumpur - Federal • government of the United States - The United - United States Department of the Army - The Department - California Department of Transportation

San Francisco Giants - Major League Baseball - Professional • sports league organization - National Football League - Pittsburgh Steelers

Prime Minister of Greece - The Prime Minister - Prime Minister • of Iran

Battle of Taiyuan - The Battle - Battle of Nancy • General manager - Television Broadcasts Limited - Comedy - BBC • Television - Television network - American Broadcasting Company - Soap opera

Television pilot - Quotation mark - Television program - • Australian Broadcasting Corporation 95

Music World Corporation - Pop music - Music festival • Battle of Tertry - The Battle - Battle of Iwo Jima • New Zealand Warriors - Turn on the Bright Lights - New York • State Department of Environmental Conservation - The Department - United States Department of Labor

Battle of Zenta - The Battle - Battle of Cowpens • Music video - Pop music - Billboard - • Battle of Flodden Field - The Battle - Battle of Ghazni • Rap - Record chart - Pop music - Icon • New Haven Ninjas - Turn on the Bright Lights - New Zealand • national team

Horizons-2 - New Horizons - Turn on the Bright Lights - New • Republican Force - Republican Force Tucum´an

Battle of the Ancre Heights - The Battle - Battle of Artemisium • Battle of Mount Dingjun - The Battle - Battle of Legnano • Fox Footy Channel - The Fox (1921 film) - Fox River (Wisconsin) • Battle of Forts Clinton and Montgomery - The Battle - Battle of • Komaki and Nagakute

Star - Pop music - Record chart - Latin pop • Battle of Stiklestad - The Battle - Battle of Magersfontein • Star - Pop music - Album - Latin pop • Apollo of Veii - The Apollo (Glasgow) - Apollo - God - Christian • denomination - The Christian - Christian theology 96

River Beane - The River (album) - River Nene • 97

APPENDIX B

Inference Path Segments of Negative Outcomes

In this appendix we list the inference paths used to conduct the error analysis of section 3.5.2. The following list contains one hundred pairs of multi-hop path segments reconstructed from searches with negative outcome, that means, search problems in which the agent didn’t find a path connecting the endpoints of the search. The partial segment is the sequence of hops from the source endpoint of the search to the entity that is closest to the destination endpoint of the search, according to the ground truth knowledge graph. The entities and relations in the partial segment have been retrieved by focused reading prior to the end of the search process. The missing segment contains the entities and relations not found during focused reading that would connect the partial segment to the destination of the search. If the partial and missing segments are connected, they would be a shortest path between the endpoints of the search given the existence of the partial segment in the knowledge graph. The sample was collected randomly by inspecting the state of searches with negative outcomes in the testing dataset after running the best performing rein- forcement learning policy. For each sampled item, both segments were computed systematically for later manual analysis.

Partial segment: MCH Arena - Arena - Pop music • Missing segment: Film - Sex - Marriage - Law - John Hunt Morgan

Partial segment: Crying (song) - ’M • Missing segment: Cooking - BBC - Television program - MTV - Stunt

Partial segment: Cloud - IBM • Missing segment: Association of Tennis Professionals - Single 98

(music) - Pop music - Guitarist - MTV - Brand - Order of St Michael and St George - Order of the British Empire

Partial segment: Ford (crossing) - Quotation mark - Music video • Missing segment: Pop music - Star - Mixed martial arts - Legend - Oyo Empire - Clan - Aso, Kumamoto

Partial segment: Danmarks Radio - Quotation mark - DVD • Missing segment: Home video - United States - United States Department of Commerce - The Department - Department of Social Security

Partial segment: CBS Television Distribution - CBS - Situation • comedy - BBC Missing segment: Cooking - ’M - Hunting - Dog - Garnish (food)

Partial segment: Universal Serial Bus - Product (chemistry) • Missing segment: Oil - Driller (oil)

Partial segment: ITV - Television program • Missing segment: Quotation mark - 1984 in film

Partial segment: Public Image Ltd. - Guitarist - MTV • Missing segment: Television - CBS - Crime - Law - William E. Dodge

Partial segment: Department of Trade and Industry (Philippines) - • The Department Missing segment: New York State Department of Environmental Conservation - Turn on the Bright Lights - New York Knicks - National Basketball Association - Star - Pop music - Musician - Commodore 64

Partial segment: Launch game - DVD - Writer • Missing segment: Gay - Politician - African National Congress - Political campaign 99

Partial segment: France - BMW • Missing segment: Model (person) - Aston Martin DB5 - Toy - Manufacturing

Partial segment: FIS Nordic World Ski Championships 1962 - Nordic • skiing Missing segment: Arena - Pop music - Film - Metro-Goldwyn-Mayer - China - Han Chinese

Partial segment: New Haven Ninjas - Turn on the Bright Lights • Missing segment: New York Knicks - National Basketball Association - Star - Pop music - Artist

Partial segment: Arizona State University - Association football - • Major League Soccer - Prime time Missing segment: American Broadcasting Company - Television program - Quotation mark - Video game - Console role-playing game - Final Fantasy Crystal Chronicles

Partial segment: Order of St. George - The Order (group) - Order • of St Michael and St George Missing segment: Order of the British Empire - Association football - Division I - The Division - Division of Bonner

Partial segment: California Golden Bears football - Association • football - Major League Soccer - Prime time Missing segment: American Broadcasting Company - Drama - Espionage - Adventure film

Partial segment: Light heavyweight (MMA) - Fighting • Championship Missing segment: Color commentator - National Football League - General manager - Television Broadcasts Limited - Television - 100

Quotation mark - Vice President of the United States - The Vice (TV series) - Vice Chief of Naval Operations

Partial segment: Division of Military Aeronautics - The Department • Missing segment: New York State Department of Environmental Conservation - Turn on the Bright Lights - New Buffalo (singer)

Partial segment: Emmy Award - Public Broadcasting Service - • Television program Missing segment: Quotation mark - DVD - Home video - United States - United States Department of Commerce - The Department - Department of the East

Partial segment: Power tool - Gas - Deals • Missing segment: Law - Crime - CBS - Television network - Pay television - Television channel - Zee TV

Partial segment: Nui (atoll) - Population - United States • Missing segment: Home video - DVD - Quotation mark - Television program - American Broadcasting Company - Album - Iggy Pop

Partial segment: Cable television - Home Box Office - Dramatic • programming - Fox Broadcasting Company Missing segment: Brand - Bar association - Rabbi Ammi

Partial segment: New York Raiders - Turn on the Bright Lights - • New York Knicks Missing segment: National Basketball Association - Star - Metro-Goldwyn-Mayer - Drama film - Television movie - Remake

Partial segment: Platform game - DOS • Missing segment: Video game - Quotation mark - Television program - American Broadcasting Company - Soap opera 101

Partial segment: Union for Democracy and Social Progress (Togo) - • The Union (political coalition) - Union Flag - United States Missing segment: Home video - DVD - Sequel - 3-D film - Film - IFC Films

Partial segment: Universal Serial Bus - Product (chemistry) • Missing segment: Petroleum - - Electric Light Orchestra - Album

Partial segment: World Trade Organization - Deals - Law • Missing segment: Marriage - Sex - Comedy film - DVD - Trilogy

Partial segment: ABC Television - Television program - NBC • Missing segment: Film - Tau - Leng

Partial segment: Open source software - GNU - Computer software - • DOS - Video game - Quotation mark Missing segment: Novel - Common Era - Greek language - Turn on the Bright Lights - Virgin New Adventures

Partial segment: ARIA Music Awards - The Australian - Australian • Football League - National Football League playoffs - National Football League - General manager Missing segment: Television Broadcasts Limited - Comedy - 20th Century Fox - The Fox (1921 film) - Fox Glacier

Partial segment: Parliament of New Zealand - Turn on the Bright • Lights - New England Missing segment: PEN American Center - United States - Population - Pet - Chimpanzee

Partial segment: UHF anime - Television station - Fox Broadcasting • Company - Drama - BBC 102

Missing segment: Record chart - Pop music - Songwriter - Dove Award for Songwriter of the Year

Partial segment: National Hockey League - Professional sports • league organization Missing segment: Major League Soccer - Prime time - American Broadcasting Company - Animated cartoon - Computer-generated imagery - Computer animation

Partial segment: The Fox (brand) - Fox Sports (USA) - Sports • journalism - Fox Broadcasting Company - Television program Missing segment: Quotation mark - DVD - Home video - United States - Sony BMG - Inspiration (Yngwie Malmsteen album)

Partial segment: Yes (band) - Singing • Missing segment: Pop music - Single (music) - Association of Tennis Professionals - IBM - Operating system - Gas - Molecule

Partial segment: - The Australian • Missing segment: - Public Broadcasting Service - Television program - Quotation mark - Genre - Hip hop music - Business magnate - Petroleum - Reservoir

Partial segment: Radio - Radio programming • Missing segment: BBC - Soap opera - NBC Daytime

Partial segment: Rohan - Quotation mark • Missing segment: Television program - BBC television drama - Dramatic programming

Partial segment: New Zealand First - Turn on the Bright Lights • Missing segment: Greek language - Common Era - Novel - Quotation mark - Video game - DOS - Operating system - 103

Partial segment: Illinois State Redbirds - The Illinois - Illinois • General Assembly - The General (1927 film) - New Jersey General Assembly Missing segment: Turn on the Bright Lights - Greek language - Common Era - Novel - Spy fiction - I spy

Partial segment: Boycott - Association of Tennis Professionals - • Single (music) Missing segment: Pop music - Radio - Norwegian Broadcasting Corporation

Partial segment: National Party of Scotland - New Zealand National • Party - Turn on the Bright Lights - Greek language Missing segment: Common Era - Novel - Spy fiction - Fiction - ABC Fiction Award

Partial segment: Ionic Greek - Quotation mark • Missing segment: Novel - Common Era - Greek language - Turn on the Bright Lights - New York and Long Island Traction Company

Partial segment: Microprocessor development board - Oya • Missing segment: Reservoir - Oil - Business magnate - Hip hop music - Genre - Quotation mark - Novel - Common Era - Mesopotamia

Partial segment: Berkeley Software Distribution - IBM PC compatible • - DOS Missing segment: Video game - Quotation mark - DVD - Renting

Partial segment: Film distributor - DVD • Missing segment: Quotation mark - Television program - CBS - CBS Television Distribution 104

Partial segment: Audio engineering - Dub music • Missing segment: Musician - Pop music - Star - National Basketball Association - Major professional sports leagues of the United States and Canada - World Hockey Association

Partial segment: Emperor - Oyo Empire • Missing segment: Legend - Mixed martial arts - Icon - Pop music - Radio

Partial segment: Fellow - Association for Computing Machinery • Missing segment: The Association - Association of Tennis Professionals - Single (music) - Pop music - Film - Sex - Marriage

Partial segment: Royal Society of New South Wales - Royal Society • - Royal Society of Edinburgh - Order of the British Empire - Order of St Michael and St George - Brand Missing segment: MTV - Television program - VH1 - Fashion

Partial segment: List of Sleeper Cell characters - Quotation mark • Missing segment: Genre - Pop music - Arena - Nordic skiing - FIS Nordic World Ski Championships 1966

Partial segment: Political system - Han Dynasty • Missing segment: Emperor of China - Han Chinese - China - Metro-Goldwyn-Mayer - Film - Pop music - Icon - Gay icon

Partial segment: Mailing list - Electric Light Orchestra - Album - • American Broadcasting Company - Situation comedy Missing segment: CBS - Variety show - YTV (TV channel)

Partial segment: Record chart - Pop music • Missing segment: Star - National Basketball Association - New York Knicks - Turn on the Bright Lights - New York State Department 105

of Environmental Conservation - The Department - Department for International Development

Partial segment: Latin pop - Album - American Broadcasting Company • Missing segment: Animated cartoon - Computer-generated imagery - Computer animation - 3D computer graphics - Space simulator

Partial segment: Cartoon - Tom and Jerry • Missing segment: Metro-Goldwyn-Mayer - Film - Gay - Beach - Khe

Partial segment: Republic of China Army - The Republic (Plato) • Missing segment: Bosnia and Herzegovina - The Kingdom (film) - Kingdom of Castile - New Kingdom - Turn on the Bright Lights - New York University School of Medicine

Partial segment: Uprising (film) - Mau - Town - Oil • Missing segment: Deals - Gas - Operating system - DOS - Video game - BMX - Cycling

Partial segment: Mass media - BBC - Record chart • Missing segment: Pop music - Arena - Nordic skiing - FIS Nordic World Ski Championships 2007 - EuroBasket 2007

Partial segment: Vice President of the United States - Quotation • mark Missing segment: Television program - CBS - Crime - Law - Magazine - RES (magazine)

Partial segment: List of vocal groups - Pop music - Musician - • Commodore 64 Missing segment: The Commodore - Commodore 1581

Partial segment: Dramatic programming - Fox Broadcasting Company - • District Missing segment: Tax - Write-off 106

Partial segment: English Renaissance - English people • Missing segment: English language - RAI - Television - MTV - Brand - Order of St Michael and St George - The Order (group) - Order of St. George

Partial segment: RES (magazine) - Magazine - Law - Crime - CBS - • Television - Quotation mark Missing segment: Television program - VH1 - Fashion - IMG Models

Partial segment: Australian Magpie - The Australian - Australia • - Public Broadcasting Service - Television program - Television Broadcasts Limited Missing segment: Television - MTV - Brand - Bar association - Adda River

Partial segment: Spy film - Film - NBC - Broadcast network - Fox • Broadcasting Company Missing segment: Animated cartoon - Computer-generated imagery - Animation - DIC Entertainment

Partial segment: The Kingdom (film) - Kingdom of Castile • Missing segment: New Kingdom - Turn on the Bright Lights - Greek language - Common Era - Novel - Quotation mark - Television program - ITV - Franchising

Partial segment: Stunt - MTV • Missing segment: Television - Quotation mark - Vice President of the United States - The Vice (TV series) - Vice president - The Deputy (TV series) - Deputy Prime Minister of Australia

Partial segment: Same-sex marriage - Sex • Missing segment: Film - Pop music - Music World Corporation 107

Partial segment: Union Jack (magazine) - The Union (political • coalition) - Union Flag - United States - States of Germany Missing segment: East Germany - Tennis - Association of Tennis Professionals - The Association - Association of Chief Police Officers

Partial segment: Pop art - Artist • Missing segment: Gay - Politician

Partial segment: OSI (band) - Album - American Broadcasting Company • Missing segment: Television program - Television Broadcasts Limited - General manager - National Football League - 1984 NFL season

Partial segment: Satellite - National Public Radio • Missing segment: Color commentator - National Football League - Scout (sport) - United States - Olympic Games - 2000 Summer Olympics

Partial segment: Secretary of State for Commonwealth Relations - • The Secretary - United States Secretary of the Interior Missing segment: The United - United States Department of Justice - The Department - Department of Scientific and Industrial Research

Partial segment: HBO Films - Film • Missing segment: Gay - Writer - DVD - Home video - United States - Union Flag - The Union (political coalition) - Union of the Crowns - The Union of Crowns

Partial segment: A TV - Television program - Public Broadcasting • Service 108

Missing segment: Australia - The Australian - Australian Amateur Football Council

Partial segment: Department of Wildlife Conservation (Sri Lanka) • - The Department - New York State Department of Environmental Conservation Missing segment: Turn on the Bright Lights - Greek language - Common Era - Ezekiel - Bar association - Sisinio Gonz´alez Mart´ınez

Partial segment: London - Australian Football League - National • Football League playoffs Missing segment: National Football League - Coach (sport) - National Basketball Association - New York Knicks - Turn on the Bright Lights -

Partial segment: Ace Books - Science fiction - Fox Broadcasting • Company - Television program Missing segment: Quotation mark - DVD - Home video - United States - United States Department of Commerce - The Department - Department of National Defence (Canada)

Partial segment: New Cross - Turn on the Bright Lights - New York • Knicks - National Basketball Association - Professional sports league organization Missing segment: National Basketball Association - NBA Draft - 2005 NBA Draft

Partial segment: Kingdom of Mysore - The Kingdom (film) - Kingdom • of Castile Missing segment: New Kingdom - Turn on the Bright Lights - New Left Review 109

Partial segment: BBC Radio 2 - BBC Radio • Missing segment: Radio programming - BBC - Drama - ABC Family - Family (TV channel)

Partial segment: Amazed - ’M - Cooking • Missing segment: BBC - Record chart - Pop music - Single (music) - Association of Tennis Professionals - The Association - Association of Reformed Institutions of Higher Education

Partial segment: Squid - Mud - Architecture - X86 architecture • Missing segment: Personal computer - IBM - Operating system - Gas - Electron configuration - Orbital period

Partial segment: 2004 NBA Draft - NBA Draft • Missing segment: National Basketball Association - Star - Pop music - Singing - Popular music - Musical instrument - Toy - Dart (missile) - Sea Dart missile

Partial segment: Tax - District - Fox Broadcasting Company - Comedy • Missing segment: 20th Century Fox - The Fox (1921 film) - Fox Entertainment Group

Partial segment: RNA virus - Virus - DNA • Missing segment: Gene - RNA - Product (chemistry) - Read-only memory - Header (computing)

Partial segment: United Kingdom - Her Majesty’s Government • Missing segment: Government of the United Kingdom - British Raj - Film - Pop music - Rock music - Caprock

Partial segment: Sun Television and Appliances - Television - • Quotation mark Missing segment: Genre - Pop music - Poppins 110

Partial segment: Emperor of Ethiopia - The Emperors - Empire of • Trebizond - The Empire (Warhammer) Missing segment: Empire style - New Kingdom - Kingdom of Castile - The Kingdom (film) - Pagan Kingdom

Partial segment: Metro Station (band) - The Metro (song) • Missing segment: Metro-Goldwyn-Mayer - Film - Gay - Politician - Liberal Democratic Party (Japan)

Partial segment: Platform game - DOS - Video game - Quotation mark • Missing segment: Genre - Pop music - - Liberal Appeal

Partial segment: Imprint - Ace Books • Missing segment: Science fiction - Fox Broadcasting Company - Television program - Quotation mark - Video game - BMX

Partial segment: Sau Mau Ping - Mau - Town - Oil • Missing segment: Product (chemistry) - RNA - Gene - HLA-B27

Partial segment: Australian Magpie - The Australian - Australian • Football League - National Football League playoffs - National Football League - General manager Missing segment: Television Broadcasts Limited - Television program - Fox Broadcasting Company - District - Non-resident Indian and Person of Indian Origin - Punjabi people

Partial segment: Union of Utrecht (Old Catholic) - The Union • (political coalition) - Union Flag - United States Missing segment: PEN American Center - New England - Turn on the Bright Lights - New Jersey Department of Corrections

Partial segment: New Jersey Department of Environmental Protection • - Turn on the Bright Lights - New England 111

Missing segment: PEN American Center - United States - Union Flag - The Union (political coalition) - Union of Fascist Little Ones

Partial segment: United States Department of the Interior - • The Department - New York State Department of Environmental Conservation Missing segment: Turn on the Bright Lights - New York Knicks - National Basketball Association - Star - Metro-Goldwyn-Mayer - Springfield, Massachusetts

Partial segment: New York State Senate - Turn on the Bright Lights • Missing segment: New England - PEN American Center - United States - Union Flag - The Union (political coalition) - Union of Free Democrats 112

REFERENCES

(2008). Proceedings of the First Text Analysis Conference, TAC 2008, Gaithersburg, Maryland, USA, November 17-19, 2008. NIST.

(2017). Proceedings of the 2017 Text Analysis Conference, TAC 2017, Gaithersburg, Maryland, USA, November 13-14, 2017. NIST.

(2018). Proceedings of the 2018 Text Analysis Conference, TAC 2018, Gaithersburg, Maryland, USA, November 13-14, 2018. NIST.

Bach, N. and S. Badaskar (2007). A review of relation extraction. Literature review for Language and Statistics II.

Banko, M., M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni (2007). Open information extraction from the web. In Proceedings of the Twentieth In- ternational Joint Conference on Artificial Intelligence, pp. 2670–2676.

Blei, D., A. Ng, and M. I. Jordan (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, pp. 993–1022.

Bustelo, X. R., V. Ojeda, M. Barreira, V. Sauzeau, and A. Castro-Castro (2012). Rac-ing to the plasma membrane: the long and complex work commute of Rac1 during cell signaling. Small GTPases, 3(1), pp. 60–66.

Cheng, X. and D. Roth (2013). Relational Inference for Wikification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1787–1796. Association for Computational Linguistics, Seattle, Washington, USA.

Clark, K. and C. D. Manning (2016a). Deep reinforcement learning for mention- ranking coreference models. arXiv preprint arXiv:1609.08667.

Clark, K. and C. D. Manning (2016b). Improving Coreference Resolution by Learn- ing Entity-Level Distributed Representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pp. 643–653. Association for Computational Linguistics, Berlin, Germany. doi:10.18653/v1/P16-1061.

Cohen, K. B., D. Demner-Fushman, S. Ananiadou, J. Pestian, J. Tsujii, and B. Web- ber (eds.) (2011). Proceedings of BioNLP 2011 Workshop. Association for Com- putational Linguistics, Portland, Oregon, USA. 113

Cohen, P. R. (1995). Empirical methods for artificial intelligence, volume 139. MIT press Cambridge, MA.

Cohen, P. R. (2015a). DARPA’s Big Mechanism program. Physical biology, 12(4), p. 045008.

Cohen, P. R. (2015b). DARPA’s Big Mechanism program. Physical Biology, 12(4), p. 045008.

Demner-Fushman, D., K. B. Cohen, S. Ananiadou, and J. Tsujii (eds.) (2019). Pro- ceedings of the 18th BioNLP Workshop and Shared Task. Association for Compu- tational Linguistics, Florence, Italy.

Fellbaum, C. (ed.) (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.

Fleiss, J. L. and J. Cohen (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psy- chological measurement, 33(3), pp. 613–619.

Freitas, S. A. A. d. (2005). Interpreta¸c˜aoautomatizada de textos: Processamento de An´aforas. Ph.D. thesis, Universidade Federal do Esp´ıritoSanto, Brasil.

Fundel, K., R. K¨uffner,and R. Zimmer (2007). RelEx – Relation extraction using dependency parse trees. Bioinformatics, 23(3), pp. 365–371.

Gerner, M., G. Nenadic, and C. M. Bergman (2010a). An exploration of mining gene expression mentions and their anatomical locations from biomedical text. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, pp. 72–80. Association for Computational Linguistics.

Gerner, M., G. Nenadic, and C. M. Bergman (2010b). LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics, 11, p. 85.

Hazeki, K., Y. Kametani, H. Murakami, M. Uehara, Y. Ishikawa, K. Nigorikawa, S. Takasuga, T. Sasaki, T. Seya, M. Matsumoto, and O. Hazeki (2011). Phos- phoinositide 3-Kinaseγ Controls the Intracellular Localization of CpG to Limit DNA-PKcs-Dependent IL-10 Production in Macrophages. PLoS One, 6(10).

He, D., Y. Xia, T. Qin, L. Wang, N. Yu, T.-Y. Liu, and W.-Y. Ma (2016). Dual learning for machine translation. In Advances in neural information processing systems, pp. 820–828.

Irmer, M. (2009). Bridging Inferences in Discourse Interpretation. Ph.D. thesis, University of Leipzig. 114

Jin-Dong, K., N. Claire, B. Robert, and D. Louise (eds.) (2019). Proceedings of The 5th Workshop on BioNLP Open Shared Tasks. Association for Computational Linguistics, Hong Kong, China.

Kanani, P. H. and A. K. McCallum (2012). Selecting actions for resource-bounded information extraction using reinforcement learning. In Proceedings of the fifth ACM international conference on Web search and data mining, pp. 253–262.

Kim, J.-D., N. Nguyen, Y. Wang, J. Tsujii, T. Takagi, and A. Yonezawa (2012). The genia event and protein coreference tasks of the BioNLP shared task 2011. BMC bioinformatics, 13(11), p. 1.

Kim, J.-D., T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii (2009). Overview of BioNLP’09 shared task on event extraction. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, pp. 1–9. Association for Computational Linguistics.

Kullback, S. and R. A. Leibler (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), pp. 79–86.

Li, X., Y.-N. Chen, L. Li, J. Gao, and A. Celikyilmaz (2017). End-to-End Task- Completion Neural Dialogue Systems. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 733–743. Asian Federation of Natural Language Processing, Taipei, Taiwan.

Manning, C. D., P. Raghavan, and H. Sch¨utze(2008a). Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA. ISBN 0521865719, 9780521865715.

Manning, C. D., P. Raghavan, and H. Sch¨utze(2008b). Introduction to information retrieval. Cambridge University Press.

Manning, C. D., M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky (2014a). The Stanford CoreNLP Natural Language Processing Toolkit. In Asso- ciation for Computational Linguistics (ACL) System Demonstrations, pp. 55–60.

Manning, C. D., M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. Mc- Closky (2014b). The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL).

Mnih, V., A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. 115

Narasimhan, K., A. Yala, and R. Barzilay (2016). Improving information extraction by acquiring external evidence with reinforcement learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016). N´edellec,C., R. Bossy, J.-D. Kim, J.-J. Kim, T. Ohta, S. Pyysalo, and P. Zweigen- baum (2013). Overview of BioNLP shared task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, pp. 1–7. Neumann, M., D. King, I. Beltagy, and W. Ammar (2019). ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 319–327. Association for Com- putational Linguistics, Florence, Italy. doi:10.18653/v1/W19-5034. Noriega-Atala, Valenzuela-Escarcega, Morrison, and Surdeanu (2017a). Learning what to read: Focused machine reading. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2017). Noriega-Atala, E., P. D. Hein, S. S. S. Thumsi, Z. Wong, X. Wang, S. M. Hendryx, and C. T. Morrison (2019). Extracting Inter-sentence Relations for Associating Biological Context with Events in Biomedical Text. IEEE/ACM Transactions on Computational Biology and Bioinformatics. Noriega-Atala, E., M. A. Valenzuela-Esc´arcega,C. Morrison, and M. Surdeanu (2017b). Learning what to read: Focused machine reading. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2905–2910. Association for Computational Linguistics, Copenhagen, Denmark. doi:10.18653/v1/D17-1313. Pautasso, M. (2012). Publication growth in biological sub-fields: patterns, pre- dictability and sustainability. Sustainability, 4(12), pp. 3234–3247. Poesio, M., R. Mehta, A. Maroudas, and J. Hitzeman (2004). Learning to Re- solve Bridging References. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pp. 143–150. Barcelona, Spain. doi:10.3115/1218955.1218974. Poon, H., K. Toutanova, and C. Quirk (2015). Distant supervision for cancer path- way extraction from text. In Pacific Symposium for Biocomputing. Quan, C., M. Wang, and F. Ren (2014). An unsupervised text mining method for relation extraction from biomedical literature. PLOS One. Ratinov, L., D. Roth, D. Downey, and M. Anderson (2011). Local and Global Algorithms for Disambiguation to Wikipedia. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL). 116

Sarafraz, F. (2012). Finding conflicting statements in the biomedical literature. Ph.D. thesis, University of Manchester.

Seo, Y.-W. and B.-T. Zhang (2000). A reinforcement learning agent for personal- ized information filtering. In Proceedings of the 5th international conference on Intelligent user interfaces, pp. 248–251. ACM.

Sutton, R. S. and A. G. Barto (1998). Reinforcement learning: An introduction. MIT press Cambridge.

Swampillai, K. and M. Stevenson (2011). Extracting Relations Within and Across Sentences. In Proceedings of Recent Advances in Natural Language Processing.

Swanson, D. R. (1986). Undiscovered public knowledge. The Library Quarterly, 56(2), pp. 103–118.

Telmer, C. A., K. Sayed, A. A. Butchy, Kaltenmeir, M. Lotze, and N. Miskov- Zivanov (2017). Manuscript in preparation.

Tjong Kim Sang, E. F. and F. De Meulder (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147.

Valenzuela-Esc´arcega, M. A., O. Babur, G. Hahn-Powell, D. Bell, T. Hicks, E. Noriega-Atala, X. Wang, M. Surdeanu, E. Demir, and C. T. Morrison (2017a). Large-scale automated reading with Reach discovers new cancer driving mecha- nisms.

Valenzuela-Esc´arcega, M. A., O.¨ Babur, G. Hahn-Powell, D. Bell, T. Hicks, E. Noriega-Atala, X. Wang, M. Surdeanu, E. Demir, and C. T. Morrison (2017b). Large-scale Automated Reading with Reach Discovers New Cancer Driving Mech- anisms. In Proceedings of the BioCreative VI Workshop (BioCreative6).

Valenzuela-Esc´arcega, M. A., G. Hahn-Powell, D. Bell, T. Hicks, E. Noriega, M. Sur- deanu, and C. T. Morrison (2018). REACH. https://github.com/clulab/ reach.

Valenzuela-Esc´arcega, M. A., G. Hahn-Powell, T. Hicks, and M. Surdeanu (2015a). A Domain-independent Rule-based Framework for Event Extraction. In Proceed- ings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Assian Federation of Natural Language Processing: Software Demonstrations (ACL-IJCNLP). 117

Valenzuela-Esc´arcega, M. A., G. Hahn-Powell, T. Hicks, and M. Surdeanu (2015b). A Domain-independent Rule-based Framework for Event Extraction. In Proceed- ings of the 53rd Annual Meeting of the Association for Computational Linguis- tics and the 7th International Joint Conference on Natural Language Process- ing of the Asian Federation of Natural Language Processing: Software Demon- strations (ACL-IJCNLP), pp. 127–132. ACL-IJCNLP 2015. Paper available at http://www.aclweb.org/anthology/P/P15/P15-4022.pdf.

Valenzuela-Esc´arcega, M. A., G. Hahn-Powell, M. Surdeanu, and T. Hicks (2015). A domain-independent rule-based framework for event extraction. Proceedings of ACL-IJCNLP 2015 System Demonstrations, pp. 127–132.

Wang, H., X. Liu, Y. Tao, W. Ye, Q. Jin, W. W. Cohen, and E. P. Xing (2019). Automatic Human-like Mining and Constructing Reliable Genetic Association Database with Deep Reinforcement Learning. In PSB, pp. 112–123. World Scien- tific.

Welbl, J., P. Stenetorp, and S. Riedel (2018). Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics, 6, pp. 287–302.

Xie, Z., S. Thiem, J. Martin, E. Wainwright, S. Marmorstein, and P. Jansen (2020). WorldTree V2: A Corpus of Science-Domain Structured Explanations and Infer- ence Patterns supporting Multi-Hop Inference. In Proceedings of the 12th Lan- guage Resources and Evaluation Conference. European Language Resources As- sociation, Marseille, France.

Yang, Z., P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Ques- tion Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380. Association for Computational Linguistics, Brussels, Belgium. doi:10.18653/v1/D18-1259.

Young, N. P. and T. Jacks (2010). Tissue-specific p19Arf regulation dictates the response to oncogenic K-ras. Proceedings of the National Academy of Sciences of the United States of America, 107(22), pp. 10184–10189.

Zhang, B.-T. and Y.-W. Seo (2001). Personalized web-document filtering using reinforcement learning. Applied Artificial Intelligence, 15(7), pp. 665–685.

Zhou, D., D. Zhong, and Y. He (2014). Biomedical relation extraction: From binary to complex. Computational and Mathematical Methods in Medicine.