Ontology-Based Similarity Measures and Their Application in Bioinformatics

Ontology-based Similarity Measures and their Application in Bioinformatics DISSERTATION ZUR ERLANGUNG DES GRADES DES DOKTORS DER NATURWISSENSCHAFFTEN DER NATURWISSENSCHAFTLICH-TECHNISCHEN FAKULTÄTEN DER UNIVERSITÄT DES SAARLANDES eingereicht von ANDREAS SCHLICKER Saarbruecken, Mai 2010 Tag des Kolloquiums: 02.11.2010 Dekan der Fakultät: Prof. Dr. Holger Hermanns Vorsitzender des Prüfungsausschusses: Prof. Dr. Gerhard Weikum Gutachter: Prof. Dr. Dr. Thomas Lengauer Dr. Mario Albrecht Dr. Marie-Dominique Devignes Beisitzer: Dr. Francisco S. Domingues Abstract Genome-wide sequencing projects of many different organisms produce large numbers of sequences that are functionally characterized using experimental and bioinformatics methods. Following the development of the first bio-ontologies, knowledge of the functions of genes and proteins is increasingly made available in a standardized format. This allows for devising approaches that directly exploit functional information using semantic and functional similarity measures. This thesis addresses different aspects of the development and application of such similarity measures. First, we analyze semantic and functional similarity measures and apply them for investigating the functional space in different taxa. Second, a new software program and a new database are described, which overcome limitations of existing tools and simplify the utilization of similarity measures for different applications. Third, we delineate two applications of our functional similarity measures. We utilize them for analyzing domain and protein interaction datasets and derive thresholds for grouping predicted domain interactions into low- and high-confidence subsets. We also present the new MedSim method for prioritization of candidate disease genes, which is based on the observation that genes and proteins contributing to similar diseases are functionally related. We demonstrate that the MedSim method performs at least as well as more complex state-of-the-art methods and significantly outperforms current methods that also utilize functional annotation. iii iv Kurzfassung Die Sequenzierung der kompletten Genome vieler verschiedener Organismen liefert eine große Anzahl an Sequenzen, die mit Hilfe experimenteller und bioinformatischer Metho- den funktionell charakterisiert werden. Nach der Entwicklung der ersten Bio-Ontologien wird das Wissen über die Funktionen von Genen und Proteinen zunehmend in einem standardisierten Format zur Verfügung gestellt. Dadurch wird die Entwicklung von Ver- fahren ermöglicht, die funktionelle Informationen direkt mit Hilfe semantischer und funk- tioneller Ähnlichkeit verwenden. Diese Doktorarbeit befasst sich mit verschiedenen As- pekten der Entwicklung und Anwendung solcher Ähnlichkeitsmaße. Zuerst analysieren wir semantische und funktionelle Ähnlichkeitsmaße und verwenden sie für eine Analyse des funktionellen Raumes verschiedener Organismengruppen. Danach beschreiben wir eine neue Software und eine neue Datenbank, die Limitationen existierender Programme überwinden und den Einsatz von Ähnlichkeitsmaßen in verschiedenen Anwendungen vereinfachen. Drittens schildern wir zwei Anwendungen unserer funktionellen Ähnlichkeitsmaße. Wir verwenden sie zur Analyse von Domän- und Proteininteraktionsdatensätzen und leiten Grenzwerte ab, um die Domäninteraktionen in Teilmengen mit niedriger und hoher Kon- fidenz einzuteilen. Außerdem präsentieren wir die MedSim-Methode zur Priorisierung von potentiellen Krankheitsgenen. Sie beruht auf der Beobachtung, dass Gene und Pro- teine, die zu ähnlichen Krankheiten beitragen, funktionell verwandt sind. Wir zeigen, dass die MedSim-Methode mindestens so gut funktioniert wie komplexere moderne Methoden und die Leistung anderer aktueller Methoden signifikant übertrifft, die auch funktionelle Annotationen verwenden. v vi Acknowledgements First, I would like to thank my supervisor Thomas Lengauer for his invaluable support and advice during my PhD studies. I am grateful that he provided me with the possibility to pursue my own ideas in such a great environment. I also want to thank Mario Albrecht for guiding me during this time, for his help and advice on all problems. I am also grateful to Marie-Dominique Devignes for her willingness to act as an additional reviewer of this thesis. I especially want to thank my office mates Dorothea Emig and Fidel Ramírez and all other members of our team, Hagen Blankenburg, Sven-Eric Schelhorn, Gabriele Mayr, Sarah Diehl, and Christoph Welsch, for the discussions, their input, and the good time. Thanks go also to Alejandro Pironti with whom I worked on issues regarding the prioritization of candidate disease genes. I am also grateful to Franisco Domingues and Ingolf Sommer for critically reading manuscripts and providing suggestions. I would also like to thank all other people on floor five of the MPII I didn’t mention yet for the joy of working with them and all the time we spent together. Thanks also to Susanne Eyrisch for her suggestions on this thesis. I would especially like to acknowledge Achim Büch for his help with all technical problems and his support in setting up databases and web services. Special thanks go also to Ruth Schneppen-Christmann for her support with all administrative issues and for organizing numerous trips to conferences. Finally, I would like to thank my family and especially my mother for always believing in me and encouraging me. I am grateful to Jasmin for lighting up my life. Last but not least, I would like to thank all my friends, without whom I wouldn’t be here today, for knowingly and unknowingly supporting me whenever needed, and the countless hours of fun. vii viii Table of Contents List of Figures xiv List of Tables xv 1 Introduction1 1.1 Motivation..................................1 1.2 Overview..................................3 1.3 Outline...................................3 2 Ontologies and Similarity Computation5 2.1 Ontology..................................5 2.1.1 The Study of the Nature of Existence...............5 2.1.2 Ontologies in Computer Science and Biomedical Research....6 2.1.3 Biomedical Ontologies.......................9 2.2 Semantic Similarity............................. 13 2.2.1 Edge-Based Measures....................... 14 2.2.2 Node-Based Measures....................... 16 2.2.3 Hybrid Approaches......................... 19 2.3 Functional Similarity............................ 20 2.3.1 Pairwise Measures......................... 20 2.3.2 Groupwise Measures........................ 21 2.3.3 Measures Combining Different Ontologies............ 23 2.4 Summary.................................. 23 3 Analysis of Semantic and Functional Similarity 25 3.1 Introduction................................. 25 3.2 Materials and Methods........................... 27 ix 3.2.1 Database.............................. 27 3.2.2 Datasets with Protein Pairs..................... 27 3.2.3 Comparison with Lord et al. .................... 28 3.2.4 Functional Comparisons...................... 29 3.2.5 Multidimensional Scaling..................... 29 3.2.6 Hierarchical Clustering....................... 30 3.3 Comparing Biological Processes and Molecular Functions........ 30 3.3.1 Comparison of Processes from Fungi and Mammals....... 30 3.3.2 Comparison of Functions from Mycobacteria and Mammals... 32 3.4 Comparison of the funSim Score with Sequence Similarity........ 34 3.5 Comparison of Average and Best-Match Average Approaches...... 38 3.6 Finding Functionally Related Proteins................... 40 3.7 Analysis of the Yeast Functional Space................... 42 3.8 Applying funSim to Pfam Families..................... 47 3.9 Conclusions................................. 50 4 Software Applications for Functional Similarity Analysis 53 4.1 Introduction................................. 53 4.2 Extended Functional Similarity Scores................... 55 4.2.1 Introducing the rfunSim Score................... 55 4.2.2 Adding Cellular Component to the funSim Score......... 59 4.3 Functional Similarity Search Tool (FSST)................. 59 4.3.1 Functional Comparison of Proteins................ 60 4.4 Functional Similarity Matrix (FunSimMat)................. 60 4.4.1 Materials and Methods....................... 61 4.4.2 Query Options........................... 64 4.4.3 Web Front-End........................... 65 4.4.4 XML-RPC Interface........................ 65 4.4.5 RESTlike Interface......................... 67 4.4.6 Tools for Visualizing Result Sets.................. 67 4.5 Conclusions................................. 68 5 Functional Evaluation of Interaction Networks 69 5.1 Introduction................................. 69 5.2 Materials and Methods........................... 70 x 5.2.1 Domain Interaction Networks................... 70 5.2.2 Protein Interaction Networks.................... 72 5.2.3 Functional Similarity Measures.................. 72 5.3 Results and Discussion........................... 73 5.3.1 Comparing Confidence Scores for Domain Interactions...... 73 5.3.2 Background Distributions and Randomized Domain Networks.. 74 BMA 5.3.3 Computing and Analyzing GOscoremax Distributions....... 78 5.3.4 Deriving Confidence Score Thresholds.............. 79 5.3.5 Comparing Human Protein Interaction Networks......... 81 5.4 Conclusions................................. 86 6 Disease Gene Prioritization using Semantic Similarity 89 6.1 Introduction................................. 89 6.2 Materials and Methods........................... 93 6.2.1 Data Sources...........................

Load more