Ontology: Tool for Broad Spectrum Knowledge Integration Barry Smith
Total Page:16
File Type:pdf, Size:1020Kb
Foreword to Chinese Translation Ontology: Tool for Broad Spectrum Knowledge Integration Barry Smith BFO: The Beginnings This book was first published in 2015. Its primary target audience was bio- and biomedical informaticians, reflecting the ways in which ontologies had become an established part of the toolset of bio- and biomedical informatics since (roughly) the completion of the Human Genome Project (HGP). As is well known, the success of HGP led to the transformation of biological and clinical sciences into information-driven disciplines and spawned a whole series of new disciplines with names like ‘proteomics’, ‘connectomics’ and ‘toxiocopharmacogenomics’. It was of course not only the human genome that was made available for research but also the genomes of other ‘model organisms’, such as mouse or fly. The remarkable similarities between these genomes and the human genome made it possible to carry out experiments on model organisms and use the results to draw conclusions relevant to our understanding of human health and disease. To bring this about, however, it was necessary to create a controlled vocabulary that could be used for describing salient features of model organisms in a species-neutral way, and to use the terms of this vocabulary to tag the sequence data for all salient organisms. It was with the purpose of creating such a vocabulary that the Gene Ontology (GO) was born in a Montreal hotel bar in 1998.1 Since then the GO has served as mediator between the new genomic data on the one hand, which is accessible only with the aid of computers, and what we might think of as the ‘old biology data’ captured using natural language by means of terms such as ‘cell division’ or ‘mitochondrion’ or ‘protein binding’. Very soon the success, and the utility, of the GO led to the creation of other ontologies covering domains not covered by the GO, such as cell types, proteins, anatomy and diseases, grouped together under the heading ‘OBO’ for ‘Open Biomedical Ontologies’. Beginning in around 2003, however, the problem arose of determining what should serve as the common formal architecture that would bind these ontologies together, and it was in this context that Basic Formal Ontology was born. The result, which is documented at length in the pages that follow, is the Open Biological and Biomedical Ontologies (OBO) Foundry, a suite of life science ontologies all of which descend from BFO and rest on the same set of principles of good practice for ontology development, principles designed to ensure interoperability. The OBO 1 https://www.ncbi.nlm.nih.gov/pubmed/10802651 Foundry ontologies2, like BFO, are created and maintained manually. It is human beings with corresponding scientific and ontological expertise who are responsible for populating the ontologies, formulating definitions and responding to requests from users of the ontology who identify errors or gaps. The underlying idea is that ontologies should be developed in a process of coordinated evolution, each ontology being used in a variety of other co-developed ontologies, for instance through use of its terms in writing definitions. The OBO Foundry principles for coordinated evolution have brought considerable success. Where many ontology initiatives, because they are too closely tied to specific local projects and research groups, enjoy short lives, the aggressive reuse of OBO Foundry ontologies to highly diverse projects and use cases has led to an increasing influence of the OBO Foundry and with this came also the increasing influence of BFO, illustrated not least in the publication of this book. BFO in China As this volume demonstrates, BFO and the OBO Foundry have been discovered, too, in China, where BFO is already being used as top-level ontology by ontologies such as the Traditional Chinese Drug Ontology (TCDO)3 and Chinese Plant Species Diversity Domain Ontology.4 The newly initiated OntoChina program5 aims to support collaborative ontology development, application and distribution across the life sciences and beyond. It is members of the OntoChina group who are responsible for this translation, for the translation into Chinese not only of the BFO ontology itself, but also of other BFO-based ontologies, such as the Information Artifact Ontology (IAO)6 and the Ontology for Biomedical Investigations7. OntoChina also plans to train future ontologists in China using BFO and the BFO-based ontology ecosystem as foundation. BFO in the OBO Foundry It was not until 2016 that BFO was adopted as the official top-level ontology of the OBO Foundry and it is worth examining in this light one core principle of the Foundry suite, namely the principle of orthogonality. Two ontologies are ‘orthogonal’ in Foundry terminology, if they do not overlap – which means: if they share no terms in common. This principle was adopted in support of the idea that there should be exactly one ontology for each domain – so that each term should belong to exactly one ontology. The rationale for this principle is that if all those working in a given domain use the same 2 Barry Smith, et al., “The OBO Foundry: Coordinated Evolution of Ontologies to Support Biomedical Data Integration”, Nature Biotechnology, 25 (11), 2007. 3 Yan Zhu, Lihong Liu, Bo Gao, Yongqun He. TCDO: a community-based Traditional Chinese Drug Ontology. The 11th International Biocuration Conference (Biocuration-2018), Crowne Plaza Hotel Shanghai Fudan, Shanghai, China, April 8-11, 2018. 4 http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/abstract/abstract4189.shtml 5 http://ontochina.org 6 Werner Ceusters and Barry Smith, “Aboutness: Towards Foundations for the Information Artifact Ontology”, Proceedings of the Sixth International Conference on Biomedical Ontology (ICBO), (CEUR 1515), 2015, 1-5. 7 Anita Bandrowski, Ryan Brinkman, Mathias Brochhausen,et al. “The Ontology for Biomedical Investigations”, PLoS ONE, 11(4): e0154556, April 29, 2016. ontology, then the annotations that they create when they tag their data with ontology terms, will be consistent with each other, and thus cumulate in an orderly fashion. Such cumulativity would allow also the identification of conflicts between different sets of annotations. (Compare the way in which the SI System of Units allows not only the cumulation but also the comparison of quantitative data.) If, however, another core principle of the OBO Foundry requires the use of BFO as common upper level for all Foundry ontologies. This means that BFO terms belong to each Foundry ontology, which conflicts with the idea that each term should belong to exactly one ontology. Terms not only from the BFO top-level ontology but also from mid- level ontologies such as IAO and OBI are re-used in multiple domain ontologies at lower levels. In like fashion, terms from more general domain ontologies such as the Cell Ontology are reused in ontologies such as the Cell Line Ontology and the Cell Cycle Ontology at lower levels. How, then, are do we save the principle that each term should belong to exactly one ontology? The answer is that orthogonality should be interpreted in such a way that each term should have its source – the original home where the term and its definition were manually crafted – in some one specific ontology, marked through the fact that the identifier for this ontology is included in the relevaxnt term IRI. This original term IRI should then travel with the term when it is reused in other ontologies, thereby preserving the link to the original home and at the same time networking the ontologies together, through reuse of terms not only in multiple ontology is-a hierarchies, but also in multiple term definitions.8 The hierarchy of ontologies reflects the hierarchy of sciences The way in which OBO Foundry ontologies are organized on successive levels represituation is analogous to what obtains in the organization of science. At any given stage of its development prior knowledge obtained through scientific research is organized hierarchically. Very general knowledge pertaining to general laws of physics or chemistry provides high-level starting points for scientific subdisciplines at successively lower levels of, for example, electrodynamics, quantum electrodynamics, organic chemistry, petroleum chemistry, neurochemistry, and so forth. The nodes in this (somewhat idealized) hierarchy then serve as starting points for huge numbers of further disciplines and subdisciplines, and also determine which bodies of prior knowledge are encapsulated in textbooks and which new results are reported in which journals. It is the titles of such textbooks and journals which tell us where we can find both prior knowledge and new results, just as it is the names of ontologies such as the Vaccine Ontology or the Infectious Disease Ontology which tell us where corresponding terms and definitions are to be found. 8 For example along the lines illustrated in https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14- 513 Implications for deep learning All of this, perhaps surprisingly, has implications for contemporary artificial intelligence research. This is because one major shortfall of today’s dominant AI paradigm of autonomous deep learning is its inability to deal with prior knowledge. This is a problem, in areas such as biology and medicine, where huge amounts of prior knowledge already exist. It is a problem, too, for the public acceptance of the results of deep learning, since the use of neural networks creates what is in effect a black box, which means that we cannot understand how given results were achieved. This means, also, that we cannot assign responsibility – on the side of either the machine or its human programmers – when things go wrong. How, then, to supplement neural networks with prior knowledge in areas such as biology and medicine. To create representations inside the computer of entire disciplines, or of entire suites of disciplines and sub-disciplines, is of course an impractical goal.