Recognition and Normalization of Terminology from Large
Total Page:16
File Type:pdf, Size:1020Kb
RECOGNITION AND NORMALIZATION OF TERMINOLOGY FROM LARGE BIOMEDICAL ONTOLOGIES AND THEIR APPLICATION FOR PHARMACOGENE AND PROTEIN FUNCTION PREDICTION by CHRISTOPHER STANLEY FUNK B.S., Baylor University, 2009 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Computational Bioscience Program 2015 This thesis for the Doctor of Philosophy degree by Christopher Funk has been approved for the Computational Bioscience Program by Kevin B. Cohen, Chair Lawrence E. Hunter, Advisor Karin M. Verspoor Asa Ben-Hur Joan Hooper Date 4/29/2015 ii Funk, Christopher Stanley (Ph.D., Computational Bioscience) Recognition and Normalization of Terminology From Large Biomedical Ontologies and their Application for Pharmacogene and Protein Function Prediction Thesis directed by Professor Lawrence E. Hunter ABSTRACT In recent years many high-throughput techniques have enabled the study and produc- tion of large amounts of biomedical data; as a result, the biomedical literature is growing at an exponential rate and shows no sign of slowing. With this comes a significant in- crease in knowledge that remains unreachable for many applications due to their reliance on manually curated database or software that is unable to scale. This dissertation focuses on benchmarking and improving performance of scalable biomedical concept recognition systems along with exploring the utility of text-mined features for biomedical discovery. The initial focus of my dissertation is on the task of concept recognition from free text – identifying semantic concepts from well utilized and community supported open biomedical ontologies. Identifying these concepts and grounding text to known ontological identifiers allows any application to also leverage the vast knowledge associated and curated to the concept. I establish current baselines for recognition through a rigorous evaluation and full parameter exploration of three concept recognition systems on eight biomedical ontologies using the CRAFT corpus as gold-standard. Additionally, I create synonym expansion rules that show improved performance, specifically recall, of Gene Ontology concepts. The later chapters focus on the application of text-mining features obtained from large literature collections for biomedical discovery. The two specific problems presented are the prediction of pharmacogenes and automated protein function prediction. Information contained in the literature has only begun to be harnessed and incorporated into machine learning systems to make biomedical predictions, so I explore two widely open questions: 1. What information should be mined from the literature? 2. How should it be combined with other of data, both literature and sequenced-based? iii I demonstrate that improving the ability to recognize concepts from the Gene Ontology produces more informative functional predictions and illustrate that not only can literature features be helpful for making prediction but offer the ability to aid in validation. Other contributions of this dissertation include publicly available software, a fast user friendly concept recognition and evaluation pipeline along with the hand-crafted composi- tional rules for increasing recognition of Gene Ontology concepts. The form and content of this abstract are approved. I recommend its publication. Approved: Lawrence E. Hunter iv To my loving wife, Charis, and my ever supportive parents, Ralph and Linda. v ACKNOWLEDGMENT My family, friends, and colleagues have been an incredible support as I’ve been working the past years. I would first like to thank my loving wife, Charis. She has been very supportive during these past years. She has been understanding on the evenings that I must work along with the time away for conferences. I would also like to thank my parents, Ralph and Linda Funk, both educators who have stressed the importance of education growing up. They have provided support daily with their prayers and encouraging words; from the beginning they have encouraged me to pursue what I love. I would like to thank all the professors in the Computational Bioscience Program (CPBS) for taking a chance on a student with only a bachelors degree and only marginal research experience. I also want to thank my entire thesis committee: Larry Hunter, Kevin Cohen, Karin Verspoor, Asa Ben-Hur, and Joan Hooper. It has been wonderful working with them all the past five years; I’ve learned a lot from watching them run their labs, apply for grants, and am encouraged through observing their enthusiasm for research. I would like to specifically thank my thesis advisor Larry Hunter. With his continual stressing of the importance of networking, he has pushed an introvert out of his comfort zone. Next I would like to thank my committee chair Kevin Cohen. He has kept me on track during my time here and always has his door open for any conversations – from work to sharing embarrassing memories. I would like to thank Karin Verspoor for agreeing to take on a student that was rough around the edges and for continuing to mentor me despite the 15 hour time difference. My entire committee has shaped my research and this dissertation; their tough feedback and questions have made me a better scientist and prepared me for my future career. I would next like to thank the current and past students within CPBS: Bill Baumgartner, David Knox, Alex Poole, Michael Hinterberg, Meg Pirrung, Mike Bada, Kevin Livingston, Ben Garcia, Charlotte Siska, and Gargi Datta. I have enjoyed collaborating They have been always been willing to listen to ideas and provide honest feedback. Not only have they been work colleagues, but they have been friends; I look forward to keeping up and seeing the amazing places we all end up. I would also like to thank students Betsy Siewert and Daniel vi Dvorkin for saving me countless weeks worth of work time by proving a LaTeX template for this dissertation. I would like to thank the program administration Kathy Thomas, Elizabeth Wethington, and Liz Pruett. Their knowledge and experience of the inner workings of the graduate school has made that part. I would also like to thank Dave Farrell for his help with any technical difficulties that came my way. In the following sections, I acknowledge my collaborators and co-authors on the specific projects presented in this dissertation. Chapter II I would like to thank William Baumgartner, Benjamin Garcia, Christophe Roeder, Michael Bada, Kevin Cohen, Lawrence E Hunter, and Karin Verspoor, my co-authors for this original work. I give special thanks to both Bill Baumgartner and Christophe Roeder for their hours of time spent teaching and helping me to learn UIMA. I would like to acknowledge Willie Rodgers from the National Library of Medicine for his help with MetaMap. I appreciate the time spent by the three anonymous reviewers who read and helped to improve this work. Chapter III I would like to thank Kevin Cohen, Lawrence E Hunter, and Karin Verspoor, my co-authors for this original work. The comments from these co-authors has pushed me beyond the original scope of the work, but has greatly improved this work. I also would like to thank Mike Bada, Judy Blake, and Chris Mungall for their input and discussions about the synonym generation rules. Chapter IV I would like to thank Kevin Cohen and Lawrence E Hunter, my co-authors for this original work. I appreciate the committee organizers for the “text-mining for biomedical discovery” session at PSB (not to mention the wonderfully chosen conference location) and the time spent by the three anonymous reviewers who read and helped to improve this work. Chapter V I would like to thank Karin Verspoor, Asa Ben-Hur, Artem Sokolov, Kiley Graim, and Indika Kahanda, who we all instrumental in accomplishing the work in this chapter. I thank the organizers for the two CAFA community challenges, for vii providing the motivation, data, and man hours spent evaluating many predictions. I appreciate the time spent by the three anonymous reviewers who read and helped to improve all papers combined for this work. Chapter VI I would like to thank Karin Verspoor, Asa Ben-Hur, and Indika Kahanda, my co-authors for this work. I appreciate the time spent by the two anonymous reviewers who read and helped to improve this work. viii TABLE OF CONTENTS CHAPTER I. INTRODUCTION . .1 1.1 Biomedical ontologies . .2 1.2 Biomedical text mining . .3 1.2.1 Concept recognition/normalization . .4 1.2.1.1 BioCreative . .5 1.2.1.2 Corpora . .7 1.3 A brief into to automated protein function prediction . .8 1.4 Dissertation goals . .9 1.4.1 Chapter II . .9 1.4.2 Chapter III . 10 1.4.3 Chapter IV . 11 1.4.4 Chapter V . 12 1.4.5 Chapter VI . 12 II. LARGE SCALE EVALUATION OF CONCEPT RECOGNITION . 14 2.1 Introduction . 14 2.2 Background . 14 2.2.1 Related work . 16 2.2.2 A note on “concepts” . 17 2.3 Methods . 18 2.3.1 Corpus . 18 2.3.2 Ontologies . 19 2.3.3 Structure of ontology entries . 19 2.3.4 A note on obsolete terms . 20 2.3.5 Concept recognition systems . 21 2.3.6 Parameter exploration . 22 2.3.7 Evaluation pipeline . 22 2.3.8 Statistical analysis . 25 ix 2.3.9 Analysis of results files . 25 2.4 Results and discussion . 26 2.4.1 Best parameters . 28 2.4.2 Cell Type Ontology . 29 2.4.2.1 NCBO Annotator parameters . 32 2.4.2.2 MetaMap parameters . 33 2.4.2.3 ConceptMapper parameters . 33 2.4.3 Gene Ontology - Cellular Component . 35 2.4.3.1 NCBO Annotator parameters . 37 2.4.3.2 MetaMap parameters . 37 2.4.3.3 ConceptMapper parameters . 38 2.4.4 Gene Ontology - Biological Process . 39 2.4.4.1 NCBO Annotator parameters .