Exploiting Latent Features of Text and Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Clemson University TigerPrints All Dissertations Dissertations May 2020 Exploiting Latent Features of Text and Graphs Justin George Sybrandt Clemson University, [email protected] Follow this and additional works at: https://tigerprints.clemson.edu/all_dissertations Recommended Citation Sybrandt, Justin George, "Exploiting Latent Features of Text and Graphs" (2020). All Dissertations. 2592. https://tigerprints.clemson.edu/all_dissertations/2592 This Dissertation is brought to you for free and open access by the Dissertations at TigerPrints. It has been accepted for inclusion in All Dissertations by an authorized administrator of TigerPrints. For more information, please contact [email protected]. Exploiting Latent Features of Text and Graphs A Dissertation Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Computer Science by Justin Sybrandt May 2020 Accepted by: Dr. Ilya Safro, Committee Chair Dr. Amy Apon Dr. Sez Atamturktur Dr. Brian Dean Dr. Alexander Herzog Abstract As the size and scope of online data continues to grow, new machine learning techniques become necessary to best capitalize on the wealth of available information. However, the models that help convert data into knowledge require nontrivial pro- cesses to make sense of large collections of text and massive online graphs. In both scenarios, modern machine learning pipelines produce embeddings | semantically rich vectors of latent features | to convert human constructs for machine under- standing. In this dissertation we focus on information available within biomedical science, including human-written abstracts of scientific papers, as well as machine- generated graphs of biomedical entity relationships. We present the Moliere system, and our method for identifying new discoveries through the use of natural language processing and graph mining algorithms. We propose heuristically-based ranking cri- teria to augment Moliere, and leverage this ranking to identify a new gene-treatment target for HIV-associated Neurodegenerative Disorders. We additionally focus on the latent features of graphs, and propose a new bipartite graph embedding technique. Using our graph embedding, we advance the state-of-the-art in hypergraph partition- ing quality. Having newfound intuition of graph embeddings, we present Agatha, a deep-learning approach to hypothesis generation. This system learns a data-driven ranking criteria derived from the embeddings of our large proposed biomedical se- mantic graph. To produce human-readable results, we additionally propose CBAG, ii a technique for conditional biomedical abstract generation. iii Dedication To Emma Cinatl, You're the most loving, caring, affectionate, and supportive person I've ever encoun- tered. I couldn't have asked for a better partner to persevere with through the last four years. Thank you for reminding me to see the beauty in the ordinary, for teaching me to enjoy a good hike, for encouraging me when times were tough, and for celebrating with me now that we've made it through. With love, Justin Sybrandt iv Acknowledgments I would like to thank Ilya Safro, as well as the members of my committee, for guidance over these last four years. I am grateful to have had mentors so amenable and accomplished. I would also like to acknowledge my collaborators: Michael Shtut- man on Chapters5,6, and8, Angelo Carrabba on Chapter7, Ruslan Shaydulin on Chapter4, and Ilya Tyagin on Chapter8. Furthermore, I would like to thank Lisa and Larry Sybrandt, my parents, for raising me to give my best to everything I do. Thank you to my siblings, Jen- nifer and Joseph Sybrandt, for keeping me grounded. To Marilyn and Darrel Apps, my grandparents, for providing ever-present encouragement and enthusiasm. To my grandmother, Kayleen Sybrandt, for mailing countless comic strips full of three-toed sloths and road-crossing chickens, for long conversations full of Seinfeld references, and for never-wavering love full of optimism. Lastly, I would like to thank all of my friends and colleagues, of which there are too many to list here. Your contributions, both social and scholarly, were crucial. I would specifically like to thank everyone who I had the pleasure of working alongside in our lab group: Angelo Carrabba, Varsh Chauhan, Joey Liu, Korey Palmer, Zirou Qiu, Ehsan Sadrfaridpour, Ruslan Shaydulin, Ilya Tyagin, and Hayato Ushijima- Mwesigwa. What started as a beige windowless office grew into a home away from home. v Funding Justin Sybrandt has received funding through both the US Department of Ed- ucation through the GAANN DAISE fellowship program, as well through the National Science Foundation, Award #1633608, through the NRT RIES fellowship program. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the US Department of Education. vi Table of Contents Title Page ................................... i Abstract .................................... ii Dedication................................... iv Acknowledgments............................... v Funding..................................... vi List of Tables ................................. x List of Figures................................. xii 1 Introduction................................ 1 1.1 Research Objectives............................ 2 1.2 Contributions in Summary........................ 3 2 Background ................................ 8 2.1 Latent Features.............................. 8 2.2 Text Embedding ............................. 13 2.3 Graph Embedding ............................ 21 2.4 Automatic Hypothesis Generation.................... 23 3 FOBE and HOBE: First- and High-Order Bipartite Embeddings 27 3.1 Background and Motivation....................... 28 3.2 Methods and Technical Solutions .................... 32 3.3 Algorithmic Analysis........................... 42 3.4 Empirical Evaluation........................... 43 3.5 Significance and Impact ......................... 47 3.6 Sensitivity Study ............................. 51 3.7 Conclusions................................ 53 4 Partition Hypergraphs with Embeddings............... 54 4.1 Introduction................................ 55 vii 4.2 Notation and Preliminary Concepts................... 59 4.3 Background and Related Work...................... 62 4.4 Embedding-Based Coarsening...................... 72 4.5 Experimental Design........................... 78 4.6 Results................................... 81 4.7 Conclusion................................. 89 5 Moliere: Automatic Biomedical Hypothesis Generation System . 91 5.1 Introduction................................ 92 5.2 Knowledge Network Construction.................... 100 5.3 Query Process............................... 108 5.4 Experiments................................ 111 5.5 Deployment Challenges.......................... 116 5.6 Lessons Learned and Open Problems.................. 118 5.7 Conclusions................................ 122 5.8 Acknowledgments............................. 122 6 Large-Scale Validation of Hypothesis Generation Systems via Can- didate Ranking ..............................123 6.1 Introduction................................ 124 6.2 Technical Background .......................... 127 6.3 Validation Methodology ......................... 133 6.4 New Ranking Methods for Topic Model Driven Hypotheses . 135 6.5 Results and Lessons Learned....................... 142 6.6 Case-Study: HAND and DDX3 Candidate Selection . 146 6.7 Related Work and Proposed Validation . 149 6.8 Deployment Challenges and Open Problems . 152 7 Are Abstracts Enough for Hypothesis Generation? . 156 7.1 Introduction................................ 157 7.2 Background: Literature-Based HG ................... 160 7.3 Methodology ............................... 163 7.4 Results................................... 172 7.5 Tradeoffs.................................. 177 7.6 Lessons Learned and Open Problems.................. 178 7.7 Deployment Challenges.......................... 180 7.8 Conclusion................................. 182 8 AGATHA: Automatic Graph-mining And Transformer based Hy- pothesis generation Approach......................184 8.1 Introduction................................ 185 8.2 Background................................ 188 8.3 Data Preparation............................. 194 viii 8.4 Ranking Plausible Connections ..................... 200 8.5 Validation................................. 204 8.6 Results................................... 207 8.7 Lessons Learned and Open Problems.................. 212 8.8 Related Work............................... 214 8.9 Conclusions................................ 216 9 CBAG: Conditional Biomedical Abstract Generation . 218 9.1 Introduction................................ 219 9.2 Background................................ 223 9.3 Multi-Conditional Language Model................... 227 9.4 Data Preparation............................. 229 9.5 Results................................... 232 9.6 Related Work............................... 239 9.7 Future Challenges and Ethical Considerations . 242 9.8 Conclusions................................ 243 Appendices...................................245 A Hypergraph Partitioning Details....................246 Bibliography..................................258 ix List of Tables 3.1 Graph Summary. We report the median (md) and max degree for each node set, as well as the Spectral