C 2018 Sheng Wang LEVERAGING KNOWLEDGE NETWORKS for PRECISION MEDICINE
Total Page:16
File Type:pdf, Size:1020Kb
c 2018 Sheng Wang LEVERAGING KNOWLEDGE NETWORKS FOR PRECISION MEDICINE BY SHENG WANG DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2018 Urbana, Illinois Doctoral Committee: Professor ChengXiang Zhai, Chair Assistant Professor Jian Peng, Director of Research Professor Saurabh Sinha Professor Jiawei Han Professor Xinghua Lu ABSTRACT Akin to the exponential growth of genomic sequencing data, high-throughput techniques in proteomics and biotechnology have been creating ever-expanding repositories of proteomic, pharmacological, and interactomic data. Other molecular data, including expression profiles, genomic mutations and cell conditions, have also been massively generated and they are further refining our understanding of disease mechanisms. In addition, patient data, gathered by electronic medical record systems and social medias, complement biological data and pave the way for personalized treatment strategies. Therefore, efficiently and effectively integrating and mining these invaluable data hold the great promising of making precision medicine a reality. However, integrating and mining these large-scale, heterogeneous, and noisy dataset pose several fundamental computational challenges and have therefore become a bottleneck to clinical decision making and medical knowledge discovery. This thesis is a systematic study of mining these biological and healthcare data for precision medicine. I take a network perspective and integrate these datasets into a large knowledge network where nodes are biological concepts and links are biological relationships. I then propose a novel computa- tional framework to mine these knowledge networks. To demonstrate the effectiveness of mining knowledge networks, I will introduce how this framework can be used to understand molecular functions, accelerate drug discovery, and support clinical decision making. To understand molecular functions, I will show how a knowledge network can substantially im- prove gene function prediction performance and further annotate novel gene sets by mining scientific literature-based knowledge network. To accelerate drug discovery, I will use the knowledge network to predict drug targets and identify drug associated pathways. To sup- port clinical decision making, I will discuss our efforts in integrating genomics data with clinical data to cluster patients, predict patient survival and visualize patient records. Fi- nally, I will conclude this thesis by summarizing how mining knowledge networks advance precision medicine and discussing the promising future work of this thesis. ii To my wife and my parents for all their love. iii ACKNOWLEDGMENTS I would like to thank all the people who give me the tremendous help in the past five years. First of all I would like to express my sincere gratitude to my advisor Prof. ChengXiang Zhai for his patience, earnest encouragements, and strongest supports. He always encouraged me to think critically and independently, defended and improved my ideas through discussion with him. From him I have learned the way of being an independent researcher. Also, I am heartily thankful to the way Prof. Zhai treated his students. I also want to express my deepest appreciation to my advisor Prof. Jian Peng. Prof. Peng introduced me to the wonderful world of bioinformatics and network biology. He later guided me to focus on an important topic of cancer genomics. In the pursuit of my Ph.D thesis, I have been always inspired by his passion, vision and knowledge in the field. Prof. Zhai and Prof. Peng worked with me like peers and colleagues. I could not have imagined having a better advisor and mentor for my Ph.D study. I would also like to thank my thesis committee members, Prof. Saurabh Sinha, Prof. Jiawei Han and Prof. Xinghua Lu, for their great feedback and suggestions on my Ph.D. research and thesis work. Prof. Sinha has been giving me invaluable support in many aspects of my research, especially in my postdoc application. I have also benefited a lot from the weekly meetings with him. Prof. Han has set up a great example of an authentic researcher for me: his keen insights and vision, his passion on research and life, his patience in mentoring students, and his diligence. Prof. Lu has given me many insightful comments about cancer genomics research and invaluable support and advice in my career development. I also sincerely thank Dr. Hoifung Poon, Dr. Ryen White and Dr. Eric Horvitz in Microsoft Research, Prof. Trey Ideker at University of California San Diego and Dr. Liewei Wang at Mayo Clinic. Working with them has given me the unique opportunities to be exposed to practical clinical problems, and the outcome of these collaborations contributed to very important pieces of this thesis. I also want to thank all DAIS group members and visitors, who have built such a great environment for research. Finally and above all, I owe my deepest gratitude to my wife Xiuwen and my parents. I want to thank my wife, for her love, understanding, patience, and support at every moment. And I want to thank my parents for their endless and unreserved love, who have been encouraging and supporting me all the time. This thesis is dedicated to them. iv TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . 1 CHAPTER 2 A FRAMEWORK FOR MINING KNOWLEDGE NETWORKS . 3 2.1 Computational approaches for mining knowledge networks . .3 2.2 Related work . 11 CHAPTER 3 KNOWLEDGE NETWORK ASSISTED GENE FUNCTION AN- NOTATION . 13 3.1 Homogeneous Network Embedding for Gene Function Prediction . 13 3.2 Integrating Heterogeneous Information for Gene Function Prediction . 26 3.3 Annotating gene sets by integrating literature data and molecular networks . 36 CHAPTER 4 KNOWLEDGE NETWORK ASSISTED DRUG DISCOVERY . 48 4.1 Knowledge network assisted drug pathway identification . 48 4.2 Network-assisted drug target identification . 64 CHAPTER 5 KNOWLEDGE NETWORK ASSISTED CLINICAL DECISION MAKING . 85 5.1 Constructing A Symptom, Disease, and Drug Network from Clinical Data . 85 5.2 Mining Adverse Drug Reactions Network from Online Health Forums . 97 5.3 Incorporating knowledge networks in electronic medical records analysis . 106 CHAPTER 6 SUMMARY AND FUTURE DIRECTIONS . 114 6.1 Knowledge Network-Supported Biological Finding Interpretation . 114 6.2 Refining Knowledge Networks with Experimental Data . 115 6.3 Uncovering disease mechanisms through network biology . 115 REFERENCES . 116 v CHAPTER 1: INTRODUCTION Akin to the exponential growth of genomic sequencing data, high-throughput techniques in proteomics and biotechnology have been creating ever-expanding repositories of genomic, proteomic, and interactomic data, leading to significant breakthroughs in precision medicine [1, 2, 3, 4, 5, 6, 7, 8]. Other molecular data, including expression profiles, genomic muta- tions and cell conditions, have also been massively generated and they are further refining our understanding of disease progression and drug mechanisms of action [9, 10, 11, 12, 13, 14, 15]. Patients data, including clinical data collected in electronic medical records and user-generated online social media data, is also of considerable interest due to its unique value in clinical data mining tasks such as adverse drug reactions detection, patient survival prediction, patient subtyping, and treatment recommendation [16, 17, 18, 19, 20, 21]. These datasets contain valuable knowledge about biological systems, human diseases, and drug mechanisms, creating an unprecedented opportunity for accelerating scientific discov- ery via discovery of knowledge from these data as well as optimizing clinical decisions and improving healthcare by developing intelligent machine learning-based decision support sys- tems. Moreover, it becomes even more pronounced when we jointly mine these heterogeneous dataset. For example, we can use large-scale genome-wide association studies to interpret unexpected adverse drug reactions identified in electronic medical records [22, 23]. Patient's response to small molecular inhibitors can also be used to refine our understanding of genetic interaction in cancer [24, 25, 26]. In this thesis, I systematically studied how to leverage knowledge networks to discover new biomedical knowledge and support clinical decision making. Despite the success of pre- vious data-driven approaches, they suffer from the noise and missing values in biomedical datasets. Intuitively, integrating different biomedical datasets reduces noise in biomedical dataset, leading to less false-positive results. However, jointly mining different biomedical datasets is more challenging due to the heterogeneity of biomedical datasets. To address these challenges, I propose to integrate different biomedical datasets into a large-scale, het- erogeneous knowledge network. In this knowledge network, nodes are biomedical entities such as drugs, diseases, and genes. Links are relations such as drug a inhibits gene b, gene c 'is mutated' in patient d, and gene e interacts with gene f. This knowledge network's poten- tial for advancing precision medicine is based on the following two factors. First, integrating knowledge from different datasets offers a multi-perspective view of disease progression and drug mechanisms. Applying `guilt by association' rule [27] to this knowledge network boosts the inference of new knowledge by combining information from different sources. Secondly, 1 despite the success of existing biomedical data