Multiple Kernel Learning for Gene Prioritization, Clustering
Total Page:16
File Type:pdf, Size:1020Kb
MULTIPLE KERNEL LEARNING FOR GENE PRIORITIZA.TION.CLUSTERING. AND FI.INCTIONAL ENRICHMENT ANAI-YSIS bv DavidH. Millis A Dissertation Submittedto the GraduateFaculty of GeorgeMason University in PartialFulfillment of The Requirementsfor the Degree of Doctorof Philosophy Bioinformaticsand Computational Biology Dr. JeffreyL. Solka,Dissertation Director Dr. JamesD. Willett, DissertationChair Dr. LakshmiK. Matukumalli. CommitteeMember Dr. JamesD. Willett, DepartmentChair Dr. DonnaFox, Associate Dean, Student Affairs & SpecialPrograms, College of Science Dr. PeggyAgouris, Interim Dean, College of Science SpringSemester 2014 GeorgeMason University Fairfax,VA Multiple Kernel Learning for Gene Prioritization, Clustering, and Functional Enrichment Analysis A Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at George Mason University by David H. Millis Doctor of Medicine Howard University College of Medicine, 1983 Director: Jeffrey L. Solka, Professor Department of Bioinformatics and Computational Biology Spring Semester 2014 George Mason University Fairfax, VA This work is licensed under a creative commons attribution-noderivs 3.0 unported license. ii DEDICATION In memory of my parents, Ena Grace Millis Clovis Bolton Millis You believed that I could do anything. iii ACKNOWLEDGMENTS I would like first to express gratitude to my dissertation committee. Dr. Jeffrey Solka served as dissertation director, helped provide order and focus to this research effort, and patiently reviewed multiple document drafts. Dr. Lakshmi Matukumalli allowed access to data generated by the USDA Bovine Functional Genomics Laboratory, and gave wise, practical advice on the dissertation completion process. Dr. James Willett served as dissertation chair, offered biological insights that helped move the research forward, and provided validation and support for the usefulness of the ideas developed over the course of this project. Many thanks to Dr. Ronald Kostoff, who arranged funding for a summer internship with The Mitre Corporation and provided helpful insights on text mining and literature-based discovery. I worked at the Thomas B. Finan Center, a psychiatric hospital in Cumberland, Maryland, while completing this dissertation. To the leadership, staff, and patients of the Finan Center: You have taught me much about suffering, survival, resilience, community, and the challenges of being human. For all you have given me, I will always be grateful. To the many music teachers and church musicians with whom I worked over the years, in New York, California, and Maryland: I learned a lot from you about persistence, about being unafraid to make mistakes, and about honing one’s craft by always going back to the fundamentals. Everything that you taught me about artistic growth and craftsmanship is reflected in these pages. To my sisters Dianne Millis and Kathy Brown, brother Michael Millis, and brothers-in- law Eric Brown and Peter Brown: Each of you has had a role in making this accomplishment a reality. Thank you for you love and support throughout this long process. Spending time with you during the holidays always reminds me of the things in life that are truly important. iv To my relatives across the country and around the world, in New York, New Jersey, Connecticut, Georgia, Florida, California, England, and Jamaica: Although we have been far apart geographically, we remain close in spirit. I am honored to make this accomplishment a part of our family history. Finally, to my nephews Marcus Brown and Jordan Brown: Find something that you like doing and that the world needs to get done, and have fun working hard at it. The world needs you. Everything you need to change the world, you already possess. You are perfect as you are. David H. Millis, M.D., Ph.D. Gainesville, Virginia April 29, 2014 v TABLE OF CONTENTS Page List of Tables ...................................................................................................................... x List of Figures .................................................................................................................... xi List of Equations ............................................................................................................... xii List of Abbreviations ....................................................................................................... xiii Abstract ............................................................................................................................ xiv Chapter One: Introduction .................................................................................................. 1 Research Problem: Gene Prioritization ........................................................................... 1 Research Hypothesis ....................................................................................................... 5 Research Contributions ................................................................................................... 5 Chapter Two: Ensemble Methods for Classification .......................................................... 7 Classification Strategies .................................................................................................. 7 Ensemble Classifiers ....................................................................................................... 8 Structure of Base Classifiers........................................................................................ 9 Design of Base Classifier Training Sets .................................................................... 10 Combining Classifier Outputs ................................................................................... 12 Chapter Three: Kernel Methods and Multiple Kernel Learning ....................................... 14 Kernel Methods ............................................................................................................. 14 Support Vector Machines .............................................................................................. 15 Combining Kernels ....................................................................................................... 20 Multiple Kernel Learning .............................................................................................. 20 Chapter Four: Text Processing Methods for Measuring Gene Similarity ....................... 25 Gene Similarity Based on Shared Abstracts ................................................................. 26 Gene Similarity Based on Co-Occurrence of Gene Names in Abstracts ...................... 26 Gene Similarity Based on Cosine Similarity of Free Text in Abstracts ........................ 27 Chapter Five: Classifier Design ........................................................................................ 29 Classifier Design: Overview ......................................................................................... 29 vi Data Sources .................................................................................................................. 30 The Gene Ontology (GO) Database .......................................................................... 30 The KEGG Pathway Database .................................................................................. 31 The REACTOME Pathway Database ........................................................................ 31 The STRING Protein-Protein Interaction (PPI) Database ......................................... 31 PubMed: Shared Identifiers ....................................................................................... 32 PubMed: Named Entity Recognition ......................................................................... 32 PubMed: Cosine Similarity of Abstracts ................................................................... 33 MicroRNA Target Prediction Algorithms ................................................................. 34 Risperidone Differential Expression Data ................................................................. 34 Data Preprocessing ........................................................................................................ 35 SVM Step ...................................................................................................................... 36 MKL Step ...................................................................................................................... 36 Classifier Diversity ........................................................................................................ 37 Measuring Classifier Performance ................................................................................ 37 Selecting Best MKL Classifier ...................................................................................... 38 Clustering Methods ....................................................................................................... 38 Functional Enrichment Analysis ................................................................................... 39 Chapter Six: Project #1: microRNA-Gene Interactions ................................................... 41 Background: Biology of microRNAs and microRNA Target Prediction ..................... 41 MicroRNAs and Gene Expression ................................................................................ 42 MicroRNA Target Prediction ........................................................................................ 44 Computational microRNA Target Prediction