Principles of Ontology Construction Overview of Tutorial Organization
Total Page:16
File Type:pdf, Size:1020Kb
Principles of Ontology Construction Overview of tutorial Biological data must be readily accessible, comparable, and correlated to efficiently provide relevant answers to scientific inquiries and thus enable discoveries. Well- principled ontological frameworks can provide a means to accomplish this. The caveat is that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing ontologies in assorted biological domains. However, these efforts will only be beneficial and aid biological data integration if certain criteria are met. These prerequisites are that the ontologies are non- overlapping, that they are accepted and used by the community, and that they are well- principled. The methods and approach required for the creation of usable ontologies is the focus of this tutorial. Organization This tutorial handout contains relevant reading material (described below) and the presentation itself. The presentation is organized into four sections: 1. The sociology of ontology building (Michael Ashburner) 2. The fundamental principles of ontology construction (Barry Smith) 3. Case studies of errors and corrections based on these principles (David Hill and Rama Balakrishnan) 4. A debate on the counter-tensions between pragmatics and purity. Reading Material Two pieces that provide a historical perspective: 1. Ashburner M, Lewis SE. 2002 On ontologies for biologists: the Gene Ontology - uncoupling the web. Novartis Found Symp 247: 66-80. 2. Lewis SE. 2005. Gene Ontology: looking backwards and forwards. Genome Biology 6: 103. A philosopher’s critique of some representative biomedical ontologies: 3. Smith B. 2005 Ontologies in Biomedicine:The Good, the Bad, and the Ugly. Personal communication. A small assortment of active ontology projects for illustration: 4. Gkoutos GV, Green ECJ, Mallon A-M, Hancock JM and Davidson D. 2004. Using ontologies to describe mouse phenotypes. Genome Biology, 6:R8. 5. Bard J, Rhee SY, Ashburner M. 2005. An ontology for cell type. Genome Biology, 6:R21. 6. Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R and Ashburner M. 2005. The Sequence Ontology: a tool for the unification of genome annotations Genome Biology, 6:R44. 7. Rosse C and Mejino JeLV. 2003. A reference ontology for biomedical informatics: the Foundational Model of Anatomy. Journal of Biomedical Informatics 36:478–500. The group used for the case study evaluation: 8. The Gene Ontology Consortium. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32: D258-D261. Methodology for defining ontological relationships: 9. Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL and Rosse C. 2005. Relations in biomedical ontologies Genome Biology, 6:R46 Novartis Symposium – November 2001. On ontologies for biologists: The Gene Ontology – untangling the web. Michael Ashburner, Department of Genetics, University of Cambridge and EMBL – European Bioinformatics Institute, Hinxton, Cambridge, UK. and Suzanna Lewis, Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, USA. Department of Genetics University of Cambridge Downing Street Cambridge CB2 3EH The European Bioinformatics Institute The Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD Berkeley Drosophila Genome Project Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA. [email protected]; [email protected] 1 Abstract. The mantra of the “post-genomic” era is “gene function”. Yet surprisingly little attention has been given to how functional and other information concerning genes is to be captured, made accessible to biologists or structured in a computable form. The aim of the Gene Ontology Consortium is to provide a framework for both the description and the organisation of such information. The GO Consortium is presently concerned with three structured controlled vocabularies which can be used to describe three discrete biological domains, building structured vocabularies which can be used to describe the molecular function, biological roles and cellular locations of gene products. Keywords: Gene function; ontologies; controlled vocabularies; databases 2 Introduction and status. The GO Consortium’s work is motivated by the need of both biologists and bioinformaticists for a method for rigorously describing the biological attributes of gene products (GO Consortium 2000, 2001). A comprehensive lexicon (with mutually understood meanings) describing those attributes of molecular biology that are common to more than one life form is essential to enable communication: in both computer and natural languages. In this era, when new sequenced genomes are rapidly being completed, all needing to be discussed, described, and compared, the development of a common language is crucial. The most familiar of these attributes is that of “function”. Indeed, as early as 1993 Monica Riley (Riley 1993) attempted a hierarchical functional classification of all the then known proteins of Escherichia coli. Since then, there have been other attempts to provide vocabularies and ontologies1 for the description of gene function, either explicitly or implicitly (e.g. Dure 1991, Commission of Plant Gene Nomenclature 1994, Fleischmann et al 1995, Overbeek et al 1997, Takai-Igarashi, Nadaoka, Kaminuma 1998, Baker et al 1999, Mewes et al 1999, Overbeek et al 2000, Stevens et al 2000; see Riley 1988, Rison et al 2000, Sklyar 2001 for reviews, Karp et al. 2002). Riley has recently updated her classification for the proteins of E. coli (Serres et al 2001). One problem with many (though not all: e.g. Schulze-Kremer 1997, 1998, Karp et al 20002a, 2002b) efforts prior to that of the GO Consortium is that they lacked semantic clarity due, to a large degree, to the absence of definitions for the terms used. Moreover, these previous classifications were usually not explicit concerning the relationships between different (e.g. “parent” and “child”) terms or concepts. A further problem with these efforts was that, by and large, they were developed as one-off exercises, with little consideration given to revision and implementation beyond the domain for which they were first conceived. They generally also lacked the 1 Philosophically speaking an ontology is “the study of that which exists" and is defined in opposition to "epistemology", which means "the study of that which is known or knowable". Within the field of artificial intelligence the term ontology has taken on another meaning: “A specification of a conceptualization that is designed for reuse across multiple applications and implementations” (Karp 2000) and it is in this sense that we are using it. 3 apparatus required for both persistence and consistent use by others, i.e. versioning, archiving and unique identifiers attached to their concepts. The GO vocabularies distinguish three orthogonal domains (vocabularies); the concepts within one vocabulary do not overlap those within another. These domains are molecular_function, biological_process and cellular_component, defined as follows: molecular_function: An action characteristic of a gene product. biological_process: A phenomenon marked by changes that lead to a particular result, mediated by one or more gene products. cellular_component: The part, or parts, of a cell of which a gene product is a component; for this purpose includes the extracellular environment of cells. The initial objective of the GO Consortium is to provide a rich structured vocabulary of terms (concepts) for use by those annotating gene products within an informatics context, be it a database of the genetics and genomics of a model organism, a database of protein sequences or a database of information about gene products, such as might be obtained from a DNA microarray experiment. In GO the annotation of gene products with GO terms follows two guidelines: (i) that all annotations include the evidence upon which that assertion is based and, (ii) that the evidence provided for each annotation includes attribution to an available external source, such as a literature reference. Databases using GO for annotation are widely distributed. Therefore an additional task of the Consortium is to provide a centralized holding site for their annotations. GO provides a simple format for contributing databases to submit their annotations to a central annotation database maintained by GO. The annotation data submitted includes the association of gene products with GO terms as well as ancillary information, such as evidence and attribution. These annotations can then form the basis for queries – either by an individual or a computer program. At present gene product associations are available for several different organisms, including two yeasts (S. pombe and S. cerevisiae), two 4 invertebrates (Caenorhabditis elegans and Drosophila melanogaster), two mammals (mouse and rat) and a plant, Arabidopsis thaliana. In addition, the first bacterium (Vibrio cholerae) has now been annotated with GO and efforts are now underway to annotate all 60 or so publicly available bacterial genomes. Over 80% of the proteins in the SWISS-PROT protein database have been annotated with GO terms (the majority by automatic annotation, see below), these include the SWISS-PROT to GO annotations of over 16,000 human proteins (available at www.geneontology.org/gene- associations/gene_association.goa). Some 7,000 human proteins were also annotated with GO by Proteome