Assembling Information from Big Corpora by Focusing Machine Reading
Total Page:16
File Type:pdf, Size:1020Kb
Assembling Information from Big Corpora by Focusing Machine Reading Item Type text; Electronic Dissertation Authors Noriega Atala, Enrique Citation Noriega Atala, Enrique. (2020). Assembling Information from Big Corpora by Focusing Machine Reading (Doctoral dissertation, University of Arizona, Tucson, USA). Publisher The University of Arizona. Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author. Download date 30/09/2021 19:48:34 Link to Item http://hdl.handle.net/10150/656801 ASSEMBLING INFORMATION FROM BIG CORPORA BY FOCUSING MACHINE READING by Enrique Noriega Atala Copyright c Enrique Noriega Atala 2020 A Dissertation Submitted to the Faculty of the SCHOOL OF INFORMATION In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY In the Graduate College THE UNIVERSITY OF ARIZONA 2020 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by: Enrique Noriega Atala titled: and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy. Clayton Morrison _________________________________________________________________ Date: ____________Jan 4, 2021 Clayton Morrison Mihai Surdeanu _________________________________________________________________ Date: ____________Jan 4, 2021 Mihai Surdeanu Peter A Jansen _________________________________________________________________ Date: ____________Jan 4, 2021 Peter A Jansen Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. Clayton Morrison _________________________________________________________________ Date: ____________Jan 4, 2021 Clayton Morrison D issertation Committee Chair School of Information 3 ACKNOWLEDGEMENTS I would like to express my gratitude to all the people who directly or indirectly were participant in the research projects that culminated in this work. First and foremost, to my adviser, Prof. Clayton T. Morrison, for his mentorship and support throughout the years. His advice and guidance were priceless and I will be forever grateful. To my committee members, Prof. Mihai Surdeanu and Prof. Peter Jansen, who provided invaluable insight during the the development of this dissertation and throughout the course of my graduate studies. To my friends Marco, Becky, Gus, Dane, Paul, and all the other members of ML4AI and CLU Lab who mentored me and collaborated with me. Thank you for all the good times we had during our time at grad school. And last, but not least, to my wife Cecilia, whom without her support and encouragement this wouldn't have been possible. 4 DEDICATION To my wife, Cecilia. 5 TABLE OF CONTENTS LIST OF FIGURES................................7 LIST OF TABLES.................................8 ABSTRACT.................................... 10 CHAPTER 1 Introduction............................ 12 CHAPTER 2 Learning to Search in the Biomedical Domain......... 15 2.1 Introduction................................ 15 2.2 Related Work............................... 17 2.3 Focused Reading............................. 18 2.4 Baseline Algorithm and Evaluation................... 20 2.4.1 Dataset.............................. 22 2.4.2 Baseline Results.......................... 23 2.4.3 Baseline Error Analysis...................... 23 2.5 Reinforcement Learning for Focussed Reading............. 25 2.5.1 Ablation Test for RL State Features............... 28 2.5.2 Error Analysis of the RL Policy................. 29 2.6 Discussion................................. 29 CHAPTER 3 Learning to Search in the Open Domain............ 31 3.1 Introduction................................ 31 3.2 Related Work............................... 32 3.3 Learning to Search............................ 32 3.3.1 Constructing Query Actions................... 35 3.3.2 State Representation Features.................. 37 3.3.3 Topic Modeling Features..................... 39 3.3.4 Reward Function Structure................... 40 3.4 Evaluation and Discussion........................ 41 3.5 Analysis of Errors and Semantic Coherence............... 47 3.5.1 Analysis of Successful Search Paths............... 48 3.5.2 Analysis of the Negative Outcomes............... 56 3.6 Discussion................................. 62 6 TABLE OF CONTENTS { Continued CHAPTER 4 Contextualizing Information Extraction............. 64 4.1 Introduction................................ 64 4.2 Related Work............................... 66 4.3 Dataset.................................. 67 4.3.1 Negative Examples and Extended Positive Examples..... 70 4.3.2 Inter-Annotator Agreement................... 72 4.4 Features.................................. 74 4.5 Experiment................................ 77 4.5.1 Baseline and Model Results................... 78 4.6 Discussion................................. 81 CHAPTER 5 Conclusions and Future Work.................. 84 APPENDIX A Inference Paths of Positive Outcomes.............. 88 APPENDIX B Inference Path Segments of Negative Outcomes........ 97 REFERENCES................................... 112 7 LIST OF FIGURES 2.1 Evolution of search for the directed path that connects two proteins during focused reading........................... 17 2.2 Graph edge encoding the relation extracted from: mTOR triggers cellular apoptosis.............................. 18 3.1 Distribution of # of hops in testing problems.............. 42 3.2 Number of hops in successful paths................... 53 3.3 Number of iterations for cases with negative outcomes......... 58 3.4 Number of hops in inference paths.................... 59 4.1 Aggregated κ scores and association counts for all contexts, binned by κ score ranges.............................. 70 4.2 κ scores for top 15 contexts by association count............ 72 4.3 Precision, recall and F1 score per classifier. The dashed red line indicates the F1 score of the baseline................... 76 4.4 Feature usage per paper for each classifier............... 80 4.5 F1 scores per paper for all models.................... 82 8 LIST OF TABLES 2.1 Results of the baseline and RL Query Policy for the focused reading of biomedical literature.......................... 21 2.2 Baseline error causes........................... 23 2.3 Features that describe the state of search in the RL-based focused reading algorithm. \PA" stands for \participant A", i.e., the latest entity chosen from the source subgraph. \PB" stands for \partici- pant B", i.e., the latest entity chosen from the destination subgraph. \Iteration" indicates the number of times focused reading has gone through the central loop in Algorithm 1................. 25 2.4 3-fold cross validation results comparing two RL focused reading mod- els against the baseline........................... 25 2.5 Ablation test over the features that encode the state of the RL policies. 28 2.6 Error analysis for the best RL-based focused reading policy...... 29 3.1 Query templates............................. 35 3.2 State representation features....................... 37 3.3 Multi-hop search dataset details..................... 42 3.4 Hyper-parameter values......................... 43 3.5 Feature sets ablation results. * denotes the difference w.r.t. the cascade baseline is statistically significant................ 44 3.6 Results of the best model in the validation dataset. * denotes the difference w.r.t. the cascade baseline is statistically significant..... 46 3.7 Examples of inference paths....................... 50 3.8 Off-topic endpoints examples...................... 51 3.9 Off-topic endpoints distributions..................... 51 3.10 Inference path with semantic drift.................... 52 3.11 Topic switching in inference paths.................... 52 3.12 Inference paths with problematic entities................ 54 3.13 Prevalence of problematic entities in inference paths.......... 54 3.14 Taxonomy level switching example................... 55 3.15 Prevalence of taxonomy level switching................. 56 3.16 Example of partial and missing segments................ 57 3.17 Size of the knowledge graph in number of processed documents.... 60 3.18 Size of the knowledge graph in number of entities........... 60 3.19 Size of the knowledge graph in number of relations.......... 61 3.20 Semantic drift in negative outcome cases................ 61 9 LIST OF TABLES { Continued 4.1 Classification features........................... 74 10 ABSTRACT We propose a methodology to teach an automated agent to learn how to search for multi-hop connections in large corpora by selectively allocating and deploying machine reading resources. The elements of multi-hop connections are often located in different documents that are not know ahead of time. Making it harder for a naive algorithm to exhaustively process a corpus if it is of a reasonable size, e.g. the English Wikipedia or PubMed Central. We formulate the elements of a novel search framework, focused reading (FR), as a Markov Decision Process, whose state-representation is comprised of domain- agnostic features related to the