Methods and Models for the Analysis of Biological Signifïcance Based on High-Throughput Data
Total Page:16
File Type:pdf, Size:1020Kb
Methods and Models for the Analysis of Biological Signifïcance Based on High-Throughput Data Jose Luis Mosquera Mayo Aquesta tesi doctoral està subjecta a la llicència Reconeixement- NoComercial – CompartirIgual 3.0. Espanya de Creative Commons. Esta tesis doctoral está sujeta a la licencia Reconocimiento - NoComercial – CompartirIgual 3.0. España de Creative Commons. This doctoral thesis is licensed under the Creative Commons Attribution-NonCommercial- ShareAlike 3.0. Spain License. Methods and Models for the Analysis of Biological Significance Based on High Throughput Data Jose Luis Mosquera Mayo 2 Methods and Models for the Analysis of Biological Significance Based on High Throughput Data M`etodes i Models per a l’An`aliside la Significaci´oBiol`ogicaBasada en Dades d'Alt Rendiment Mem`oriapresentada per en Jose Luis Mosquera Mayo per optar al grau de doctor per la Universitat de Barcelona SIGNATURES El Doctorand: Jose Luis Mosquera Mayo El Director de la Tesi: Dr. Alexandre Sanchez´ Pla El Tutor de la Tesi: Dr. Josep Mar´ıa Oller Sala Programa de doctorat en estad´ıstica Departament d'Estad´ıstica,Facultat de Biologia Universitat de Barcelona 4 Acknowledgements \A teacher affects eternity, he can never tell where his influence stops." Henry Brooks Adams \Mathematicians do not study objects, but relations between objects." Jules Henri Poincar´e It is truly an honor for me to address these words of gratitude to everyone who has contributed, to a greater or lesser extent by supporting me during this exciting program of studies, to help make my lifelong dream come true: to complete my doctoral dissertation. First and foremost, I would like (and feel compelled) to mention a special debt of gratitude to Prof. Alex S´anchez Pla. I am enormously grateful to him, as the director of this dissertation, for having taught, guided and supported me, correcting my scientific work with an interest and dedication that have greatly exceeded all the expectations that I, as a UB student, could have had. Likewise, I would also like to thank him once again for his work as the head of the Statistics and Bioinformatics Unit at the VHIR, for directing, supervising and correcting all the projects in which I have participated over the last six years as a senior technician in the department. It is also my deep personal desire to thank all the members of the Statistical Research and Bioinformatics Group, as well as the professors in the UB Statistics Department, for everything they have taught me, both profes- sionally and as a human being. I learned something from them each and every day. For these reasons, I would like to specifically express my thanks to Prof. M. Carme Ru´ızde Villa, Prof. Francesc Carmona, Prof. Esteban Vegas, Prof. Ferran Reverter and Prof. Antoni Mi´narro. I would also like to recognize my office mate and sparring partner, both at the UB and the VHIR, Mr. Israel Ortega. I would not wish to neglect a big thank you to the department secretaries, Ms. Roser Maldonado 5 6 and Ms. Elisabet Ballester, for their hard work and dedication; without them, the bureaucracy would have been a disaster. I would like to extend these thanks to all the members of the Statistics and Bioinformatics Unit and the High Technology Unit at the VHIR. In particular, I would like to thank Mr. Alejandro Artacho, Dr. Ricardo Gonzalo and Dr. Francisca Gallego for having shared their knowledge and friendship with me. I must not forget a few words of appreciation for the researchers at the VHIR, with whom I have col- laborated closely, and who without knowing it, have at times indirectly contributed to this dissertation. I would therefore like to especially mention Dr. Mar´ıa Vicario, Dr. Cristina Mart´ınez, Dr. Javier Santos, Dr. Rosanna Paciucci, Dr. Andreas Doll and Dr. Jaume Revent´os. Finally, I would like to express my eternal gratitude to Antonio, Carmen, Chus, Mar´ıa, Magdalena, my parents, my brother and my grandparents, who are no longer with us; without them, I would never have been able to complete this project. They have always been with me, through thick and thin. Furthermore, I would like to sincerely thank Inma, Ricardo, Jordi, Ferran, Mireia and Paula for having put up with me, without asking for anything in return. Thank you very much to all those whom I have not mentioned, who have contributed in some way to this personal achievement; please forgive me for this omission. Thank you. Contents Acknowledgement5 Contents 13 List of Tables 15 List of Figures 17 Glossary 21 I Introduction and Objectives 23 1 Introduction: Scope of the Thesis 25 1.1 Meaning of Biological Significance Concept........... 27 1.2 The Gene Ontology........................ 28 1.2.1 The Ontology Domains of the GO............ 29 1.2.1.1 Cellular Component.............. 29 1.2.1.2 Biological Process............... 30 1.2.1.3 Molecular Function............... 31 1.2.2 Scope of the GO..................... 32 1.2.3 The Structure of the GO................. 33 1.3 From Biological to Statistical Significance........... 35 1.3.1 Philosophy of Semantic Similarity Measures...... 35 1.3.1.1 Semantic Similarity Methods......... 37 1.3.1.2 The Edge-Based Approach........... 38 1.3.1.3 The Node-Based Approach.......... 39 1.3.1.4 The Hybrid-Based Approach......... 39 1.3.2 Philosophy of Enrichment Analysis........... 39 1.3.2.1 Enrichment Analysis Tools........... 40 1.3.2.2 Singular Enrichment Analysis......... 41 1.3.2.3 Modular Enrichment Analysis......... 41 1.3.2.4 Gene Set Enrichment Analysis........ 42 7 CONTENTS 8 2 Hypothesis, Objectives and Outline of the Thesis 45 2.1 Hypotheses............................ 45 2.2 Objectives............................. 46 2.2.1 Main Objectives...................... 46 2.2.2 Specific Objectives.................... 46 2.2.2.1 Specific Objectives Associated with the Study of Semantic Similarity Measures.... 47 2.2.2.2 Specific Objectives Associated with the Clas- sification and Study of GO Tools for Enrich- ment Analysis.................. 47 2.3 Outline............................... 47 II Study of Two Semantic Similarity Measures for Exploring GO Categories 49 3 Material and Methods 53 3.1 Main Concepts of Graph Theory................. 53 3.1.1 Basic Graph Concepts.................. 54 3.1.2 Subgraphs......................... 55 3.1.3 Directed Graphs..................... 55 3.1.4 Paths and Connection.................. 56 3.1.5 DAG and Rooted DAG.................. 57 3.1.6 Matrices and Graphs................... 57 3.1.7 Order and Degrees.................... 59 3.2 The GO graph.......................... 61 3.2.1 Basic Concepts of the GO graph............. 61 3.2.2 The Study of the GO based on the Graph Theory... 61 3.3 Carey's Framework........................ 63 3.3.1 Refinement of Relationships............... 64 3.3.1.1 Refinement Matrix............... 65 3.3.1.2 Accessibility Matrix.............. 66 3.3.2 Mapping Genes to GO.................. 67 3.3.2.1 Mapping Matrix................ 67 3.3.2.2 Object-Ontology Complexes.......... 68 3.3.2.3 Coverage Matrix................ 68 3.4 Main concepts of POSET theory................ 69 3.4.1 Partially Ordered Sets Definition............ 69 3.4.2 Chains and Anti-chains.................. 71 3.4.3 Ideal, Filter and Hourglass................ 71 3.4.4 Upper and Lower Bounds................ 72 9 CONTENTS 3.4.5 Interval Orders and Length............... 73 3.4.6 POSET Ontology..................... 74 3.5 Similarity Measures........................ 75 3.5.1 Formal Definitions of Similarity Measure and Metric Distance.......................... 75 3.5.2 Semantic Similarity Measure............... 76 3.5.3 Organization and Classification of Semantic Similarities 77 3.6 GO Terms and Semantic Similarity Measures.......... 78 3.6.1 Lord's Measure...................... 79 3.6.1.1 The Information Content Concept...... 79 3.6.1.2 Resnik's Similarity Measure.......... 80 3.6.1.3 Lord's similarity measure........... 81 3.6.2 Joslyn's Measure..................... 82 3.6.2.1 Pseudo-distances................ 83 3.7 sims: An R Package for Computing Semantic Similarities of an Ontology............................ 84 4 Results 85 4.1 Minor Contributions....................... 85 4.1.1 Graph Theory....................... 85 4.1.2 Semantic Similarity Measures.............. 87 4.1.2.1 The Information Content Concept...... 88 4.2 Main Contributions........................ 89 4.2.1 GO Terms and Semantic Similarity Measures..... 89 4.2.2 Lord's Measure...................... 89 4.2.2.1 The Information Content Concept...... 90 4.2.2.2 Resnik's Measure in Terms of Metric Distance 92 4.2.3 Joslyn's Measure..................... 94 4.2.4 sims: An R Package for Computing Semantic Similar- ities of an Ontology.................... 96 4.2.4.1 Availability of sims .............. 96 4.2.4.2 Requirements.................. 97 4.2.4.3 Main Possibilities of the Package and the List of Functions................... 97 4.2.4.4 Semantic Similarity Measures Implemented in sims ..................... 97 4.2.5 Semantic Similarity Profiles between GO Terms.... 100 5 Discussion 103 CONTENTS 10 III Classification of GO Tools for Enrichment Anal- ysis 109 6 Material and Methods 113 6.1 Selection of GO Tools....................... 113 6.2 Definition of Standard Functionalities Set and Classification of GO Tools............................ 115 6.3 SerbGO: Searching for the Best GO Tool............ 115 6.3.1 Implementation of SerbGO ................ 116 6.4 Evolution and Clustering of GO Tools............. 116 6.4.1 Data............................ 117 6.4.1.1 Raw Data.................... 117 6.4.1.2 Homogenization of Raw Data......... 117 6.4.1.3 Data Matrices.................. 118 6.4.2 Statistical Methods.................... 118 6.4.2.1 Descriptive Statistics Methods........ 118 6.4.2.2 Inferential Analysis Methods......... 119 6.4.2.3 Multivariate Analysis Methods........ 120 6.5 DeGOT: An Ontology for Developing GO Tools......... 124 6.5.1 Basic Concepts of an Ontology............