Augmenting Bioinformatics Research with Biomedical Ontologies
Total Page:16
File Type:pdf, Size:1020Kb
Waclaw Kusnierczyk Augmenting Bioinformatics Research with Biomedical Ontologies PhD thesis 2008 Department of Computer and Information Science Faculty of Information Technology, Mathematics and Electrical Engineering Norwegian University of Science and Technology i Abstract The main objective of the reported study was to investigate how biome- dical ontologies, logically structured representations of various aspects of the biomedical reality, can help researchers in analyzing experimental data. High-throughput technologies, such as platforms for microarray gene ex- pression experiments (MAGE), yield increasingly large amounts of ‘massive’ data; successful analyses of the data require expertise in the application do- main, and to support researchers with automated methods explicit domain knowledge representations are often essential. Two attempts to construct such tools are reported here: one successful — eGOn, a web-based tool for mapping the results of microarray gene expression experiments onto the structure of the Gene Ontology; and one unsuccessful — a framework for knowledge- and case-based enhancement of biological association network- building tools. Ontologies — structured, computer-understandable accounts of expert kno- wledge — are a relatively new invention. Until only recently, biomedical ontologies were developed with little care for formal semantics, syntactic and semantic compatibility with each other, and the ontological (in a philo- sophical sense) commitments made. However, for successful integration of resources that use different ontologies to describe their content and services, integration on the ontological level is needed. While one way to achieve this is to apply some of the much-researched techniques of ontology alignment and merging, another approach is to organize the ontology creation move- ment around a common basis — a unique top-level ontology and a set of design principles. Some of such integrative efforts made by the community of Open Biomedical Ontologies, in which I have participated, are reported. Furthermore, the thesis presents a framework for consistently connecting the Gene Ontology (the most prominent of ontologies covered by OBO) with the Taxonomy of Species, and discusses the benefits of its prospective adoption by the OBO community. ii Acknowledgments This dissertation is submitted to the Norwegian University of Science and Technology (NTNU) in partial fulfillment of the requirements for the degree doktor ingeniør (PhD). It reflects and partially reports my work done at the Department of Computer and Information Science (IDI) in the course of a graduate study under the guidance of professors Jan Komorowski, Astrid Lægreid, Agnar Aamodt, and Barry Smith. I would not have achieved much without support from my supervisors, col- laborators, and colleagues. My work at the Norwegian University of Science and Technology (NTNU) in Trondheim was initially part of a research ini- tiative in functional genomics, conducted under the guidance of prof. Jan Komorowski,1 prof. Astrid Lægreid,2 and prof. Arne Sandvik.3 It is these three people, and especially Jan and Astrid, whom I owe thanks for intro- ducing me to the field and for inviting me to be a member of their research team. Special thanks go to prof. Agnar Aamodt,4 who agreed to become my supervisor after prof. Komorowski had left for Sweden. I am deeply grateful for prof. Aamodt’s having introduced me to Artificial Intelligence, Machine Learning, and Case-Based Reasoning; for his faith in my ability to succeed; and for his continuous support in my efforts to find the right environment and an appropriate research target. Last, but not least among my tutors, 1Prof. Komorowski is currently the Scientific Director of the Linnaeus Center for Bioinformatics at the Uppsala University, Sweden (http://www.lcb.uu.se/). 2Prof. Lægreid is currently a Prorector at NTNU. 3Prof. Sandvik is currently a Project Leader at the NTNU Microarray Core Facility. 4Prof. Aamodt is currently the Leader of the Division of Intelligent Systems at IDI, NTNU. iii iv prof. Barry Smith5 deserves particular appreciation for introducing me to the Open Biomedical Ontologies community, for involving discussions, and for inviting me to visit him and his research team in Saarbrücken. I would like to thank my colleagues at the Institute for their help, discus- sions, and company. I thank Jörg for helping me in recovering from in- numerable annoyances of LATEX and TEX, and for discussions on these and many other issues. I would also like to thank those at IDI with whom I had the chance to collaborate in teaching — this was a great experience. Ku- dos to all members of the administration staff for their support in solving mundane problems. I am also grateful to colleagues from the Department of Cancer Research and Molecular Medicine (IKM) and other institutes at the Faculty of Medicine (DMF) at NTNU who helped me or allowed me to participate in their projects. I owe thanks to those members of the Ontolog Forum6 who, with amazing patience and understanding, responded to my questions, complaints, and claims. Last not least, the warmest thank you to my closest family for all the support and encouragement they have been giving me throughout the whole period of my study and writing. 5Prof. Smith is an internationally renowned philosopher-ontologist; currently, he is, among oth- ers, the Director of the Institute for Formal Ontology and Medical Information Science (IFOMIS) in Saarbrücken, Germany, and a member of the Scientific Advisory Board of the Gene Ontology Consortium. 6http://ontolog.cim3.net/. Contents 1 Outline 1 1.1 ResearchOverview ........................ 2 1.1.1 Phase I: Analysis and Mining of Microarray Data . 2 1.1.2 Phase II: Knowledge-Guided Microarray Data Analysis 6 1.1.3 Phase III: Biomedical Ontology Engineering . 11 1.1.4 Research Summary . 15 1.2 ThesisOverview .......................... 21 2 Background 25 2.1 Introduction ............................ 26 2.2 Bioinformatics and Computational Biology . 27 2.3 Data, Information, Knowledge . 29 2.4 BiomedicalData .......................... 31 2.5 DataIntegration .......................... 39 2.6 Standardization .......................... 45 2.7 BiomedicalOntologies . 52 3 Standardization in Biomedical Ontology 63 3.1 Introduction ............................ 64 3.2 [N]ontologicalEngineering. 65 3.2.1 KnowledgeandModels. 67 3.2.2 ConceptsandClasses . 71 3.2.3 Classes and Individuals . 73 3.2.4 FurtherNotes ....................... 76 3.3 A Philosophical Framework for Bio-Ontologies . 77 v vi CONTENTS 3.3.1 Reality and Representation . 79 3.3.2 Three Levels of Reality . 80 3.3.3 (Against) The Concept Orientation . 82 3.3.4 BasicFormalOntology . 86 3.3.5 Discussion ......................... 95 4 Subsetting the GO with Slims 97 4.1 Introduction ............................ 98 4.2 The Generality and Specificity of GO Terms . 100 4.3 TheGOSlims ........................... 101 4.3.1 GO Slims Have Imprecisely Defined Scope . 104 4.3.2 ‘Species-Specificity’ Has Imprecise Meaning . 106 4.3.3 Relations Between Taxa Are Neglected . 107 4.3.4 SlimsAreBuiltManually . 109 4.3.5 Slims Are Updated Manually . 109 4.4 Discussion ............................. 110 5 ConnectingtheGOandtheTS 111 5.1 Introduction ............................ 112 5.2 Relations Between the GO and the TS . 113 5.2.1 Validity........................... 114 5.2.2 Specificity ......................... 115 5.2.3 Relevance ......................... 115 5.2.4 Additional Notes . 116 5.3 InferencePatterns—RulesofPropagation . 117 5.3.1 Logical Properties of the Rules of Propagation . 119 5.3.2 ConsequencesofPropagation . 119 5.4 Dynamic Partitioning of the GO . 121 5.5 Discussion ............................. 123 5.5.1 Implementation of the Framework . 125 5.5.2 Manually Created and Inferred Assertions . 125 5.5.3 Epistemologicalissues . 126 5.5.4 Logical Implications . 127 5.5.5 Propagation along ‘Part of’ Relations . 128 5.5.6 A Note on the Terminology . 129 CONTENTS vii 6 Concluding Remarks 131 6.1 GoalsandQuestionsRevisited . 131 A Formalization of the Framework 139 A.1 Introduction ............................ 140 A.2 LOF — A Formalism for the GO-TS Framework . 141 A.2.1 The Vocabulary of LOF .................. 142 A.2.2 The Syntax of LOF ..................... 142 A.2.3 The Semantics of LOF ................... 143 A.3 Monotonic Inference in LOF ................... 148 A.3.1 Inference from ΦAS-Sentences . 148 A.3.2 Inference from ΦSS-Sentences . 149 A.3.3 Inference from ΦRSS-Sentences . 150 A.3.4 Inference from ΦOS-Sentences. 152 A.3.5 ΦSS-Sentences versus ΦRSS, ΦAS, and ΦOS-Sentences . 152 A.4 Non-Monotonic Inference in LOF ................. 153 A.5 Discussion ............................. 154 A.5.1 Existential Claims . 154 A.5.2 The Choice of species ................... 156 A.5.3 Translation to Other Formalisms . 157 B Critical Assessment of Taxonomic Databases 159 B.1 Introduction ............................ 160 B.1.1 The Need for Taxonomic Annotation . 160 B.2 TheProblemofSpecies . 162 B.2.1 OntoClean: Species are Metaclasses . 164 B.3 Taxonomic Classifications and Nomenclatures . 167 B.4 ProblemswiththeTaxonomy. 168 B.4.1 Taxonomic Databases . 168 B.4.2 TaxonomicRanks . 169 B.4.3 RankOrders ........................ 171 B.4.4 RankNames ........................ 175 B.4.5 Taxa and Their Ranks . 176 B.4.6 Taxa and Their Names . 178 B.5 Discussion ............................. 182 viii CONTENTS Bibliography 184 List of Figures 1.1 A biological association network from