Principles of Ontology Construction

Overview of tutorial Biological data must be readily accessible, comparable, and correlated to efficiently provide relevant answers to scientific inquiries and thus enable discoveries. Well- principled ontological frameworks can provide a means to accomplish this. The caveat is that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing ontologies in assorted biological domains. However, these efforts will only be beneficial and aid biological data integration if certain criteria are met. These prerequisites are that the ontologies are non- overlapping, that they are accepted and used by the community, and that they are well- principled. The methods and approach required for the creation of usable ontologies is the focus of this tutorial.

Organization This tutorial handout contains relevant reading material (described below) and the presentation itself. The presentation is organized into four sections: 1. The sociology of ontology building () 2. The fundamental principles of ontology construction (Barry Smith) 3. Case studies of errors and corrections based on these principles (David Hill and Rama Balakrishnan) 4. A debate on the counter-tensions between pragmatics and purity.

Reading Material

Two pieces that provide a historical perspective: 1. Ashburner M, Lewis SE. 2002 On ontologies for biologists: the - uncoupling the web. Novartis Found Symp 247: 66-80. 2. Lewis SE. 2005. Gene Ontology: looking backwards and forwards. Genome Biology 6: 103.

A philosopher’s critique of some representative biomedical ontologies: 3. Smith B. 2005 Ontologies in Biomedicine:The Good, the Bad, and the Ugly. Personal communication.

A small assortment of active ontology projects for illustration: 4. Gkoutos GV, Green ECJ, Mallon A-M, Hancock JM and Davidson D. 2004. Using ontologies to describe mouse phenotypes. Genome Biology, 6:R8. 5. Bard J, Rhee SY, Ashburner M. 2005. An ontology for cell type. Genome Biology, 6:R21. 6. Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R and Ashburner M. 2005. The Sequence Ontology: a tool for the unification of genome annotations Genome Biology, 6:R44. 7. Rosse C and Mejino JeLV. 2003. A reference ontology for biomedical informatics: the Foundational Model of Anatomy. Journal of Biomedical Informatics 36:478–500.

The group used for the case study evaluation: 8. The Gene Ontology Consortium. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32: D258-D261.

Methodology for defining ontological relationships: 9. Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL and Rosse C. 2005. Relations in biomedical ontologies Genome Biology, 6:R46

Novartis Symposium – November 2001.

On ontologies for biologists: The Gene Ontology – untangling the web.

Michael Ashburner, Department of Genetics, University of Cambridge and EMBL – European Institute, Hinxton, Cambridge, UK. and

Suzanna Lewis, Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, USA.

Department of Genetics University of Cambridge Downing Street Cambridge CB2 3EH

The European Bioinformatics Institute The Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD

Berkeley Drosophila Genome Project Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA. [email protected]; [email protected]

1

Abstract.

The mantra of the “post-genomic” era is “gene function”. Yet surprisingly little attention has been given to how functional and other information concerning genes is to be captured, made accessible to biologists or structured in a computable form. The aim of the Gene Ontology Consortium is to provide a framework for both the description and the organisation of such information. The GO Consortium is presently concerned with three structured controlled vocabularies which can be used to describe three discrete biological domains, building structured vocabularies which can be used to describe the molecular function, biological roles and cellular locations of gene products.

Keywords:

Gene function; ontologies; controlled vocabularies; databases

2

Introduction and status.

The GO Consortium’s work is motivated by the need of both biologists and bioinformaticists for a method for rigorously describing the biological attributes of gene products (GO Consortium 2000, 2001). A comprehensive lexicon (with mutually understood meanings) describing those attributes of molecular biology that are common to more than one life form is essential to enable communication: in both computer and natural languages. In this era, when new sequenced genomes are rapidly being completed, all needing to be discussed, described, and compared, the development of a common language is crucial.

The most familiar of these attributes is that of “function”. Indeed, as early as 1993 Monica Riley (Riley 1993) attempted a hierarchical functional classification of all the then known proteins of Escherichia coli. Since then, there have been other attempts to provide vocabularies and ontologies1 for the description of gene function, either explicitly or implicitly (e.g. Dure 1991, Commission of Plant Gene Nomenclature 1994, Fleischmann et al 1995, Overbeek et al 1997, Takai-Igarashi, Nadaoka, Kaminuma 1998, Baker et al 1999, Mewes et al 1999, Overbeek et al 2000, Stevens et al 2000; see Riley 1988, Rison et al 2000, Sklyar 2001 for reviews, Karp et al. 2002). Riley has recently updated her classification for the proteins of E. coli (Serres et al 2001).

One problem with many (though not all: e.g. Schulze-Kremer 1997, 1998, Karp et al 20002a, 2002b) efforts prior to that of the GO Consortium is that they lacked semantic clarity due, to a large degree, to the absence of definitions for the terms used. Moreover, these previous classifications were usually not explicit concerning the relationships between different (e.g. “parent” and “child”) terms or concepts. A further problem with these efforts was that, by and large, they were developed as one-off exercises, with little consideration given to revision and implementation beyond the domain for which they were first conceived. They generally also lacked the

1 Philosophically speaking an ontology is “the study of that which exists" and is defined in opposition to "epistemology", which means "the study of that which is known or knowable". Within the field of artificial intelligence the term ontology has taken on another meaning: “A specification of a conceptualization that is designed for reuse across multiple applications and implementations” (Karp 2000) and it is in this sense that we are using it.

3 apparatus required for both persistence and consistent use by others, i.e. versioning, archiving and unique identifiers attached to their concepts.

The GO vocabularies distinguish three orthogonal domains (vocabularies); the concepts within one vocabulary do not overlap those within another. These domains are molecular_function, biological_process and cellular_component, defined as follows: molecular_function: An action characteristic of a gene product. biological_process: A phenomenon marked by changes that lead to a particular result, mediated by one or more gene products. cellular_component: The part, or parts, of a cell of which a gene product is a component; for this purpose includes the extracellular environment of cells.

The initial objective of the GO Consortium is to provide a rich structured vocabulary of terms (concepts) for use by those annotating gene products within an informatics context, be it a database of the genetics and genomics of a model organism, a database of protein sequences or a database of information about gene products, such as might be obtained from a DNA microarray experiment. In GO the annotation of gene products with GO terms follows two guidelines: (i) that all annotations include the evidence upon which that assertion is based and, (ii) that the evidence provided for each annotation includes attribution to an available external source, such as a literature reference.

Databases using GO for annotation are widely distributed. Therefore an additional task of the Consortium is to provide a centralized holding site for their annotations. GO provides a simple format for contributing databases to submit their annotations to a central annotation database maintained by GO. The annotation data submitted includes the association of gene products with GO terms as well as ancillary information, such as evidence and attribution. These annotations can then form the basis for queries – either by an individual or a computer program.

At present gene product associations are available for several different organisms, including two yeasts (S. pombe and S. cerevisiae), two

4 invertebrates (Caenorhabditis elegans and Drosophila melanogaster), two mammals (mouse and rat) and a plant, Arabidopsis thaliana. In addition, the first bacterium (Vibrio cholerae) has now been annotated with GO and efforts are now underway to annotate all 60 or so publicly available bacterial genomes. Over 80% of the proteins in the SWISS-PROT protein database have been annotated with GO terms (the majority by automatic annotation, see below), these include the SWISS-PROT to GO annotations of over 16,000 human proteins (available at www.geneontology.org/gene- associations/gene_association.goa). Some 7,000 human proteins were also annotated with GO by Proteome Inc. and are available from LocusLink (Pruitt, Maglott 2001).

A number of other organismal databases are in the process of using GO for annotation, including those for Plasmodium falciparum (and other parasitic protozoa) (M. Berriman, personal communication), Dictyostelium discoideum (R. Chisholm, personal communication) and the grasses (rice, maize, wheat, etc) (L. Vincent, personal communication). The availability of these sets of data has lead to the construction of GO browsers which enable users to query them all simultaneously for genes whose products serve a particular function, play a role in a particular biological process or are located in a particular sub-cellular part (AmiGO 2001). These associations are also available as tab-delimited tables (www.geneontology.org/gene-associations/) or with protein sequences. GO thus achieves de facto a degree of database integration (see Leser 1998), one holy grail of applied bioinformatics.

Availability.

The products of the GO Consortium’s work can be obtained from their w3 home page: www.geneontology.org.

All of the efforts of the GO Consortium are placed in the public domain and can be used by academia or industry alike without any restraint, other than they cannot be modified and then passed off as the products of the Consortium. This is true for all major classes of the GO Consortium’s products: the controlled vocabularies, the gene-association tables, and software for browsing and editing the GO vocabularies and gene association tables (AmiGO 2001, DAG Edit 2001). Thus the GO Consortium’s work is very much in the spirit of the Open Source tradition in software development (DiBona, Ockman, Stone 1999; OpenSource 2001). The GO ontologies and

5 their associated files are available as text files, in XML or as tables for a MySQL database.

The structure of the GO ontologies.

All biologists are familiar with hierarchical graphs – the system of classification introduced by Linnaeus has been a bedrock for biological research for some 250 years. In a Linnean taxonomy the nodes of the graphs are the names of taxa, be they phyla or species; the edges between these nodes represent the relationship “is a member of” between parent and child nodes. Thus the node “species:Drosophila melanogaster” “is a member of” its parent node “genus:Drosophila”. Useful as hierarchies are they suffer from a serious limitation, each node has one and only one parental node – no species is a member of two (or more) genera, no genus a member of two (or more) families. Yet in the broader world of biology an object may well have two or more parents. Consider, as a simple example, a protein that both binds DNA and hydrolyses ATP. It is as equally correct to describe this as a “DNA binding protein” as it is to describe it as a “catalyst” (or enzyme); therefore it should be a child of both within a tree structure. Not all DNA binding proteins are enzymes, not all enzymes are DNA binding proteins, yet some are and we need to be able to represent these facts conceptually. For this reason GO uses a structure known as a directed acyclic graph (DAG), a graph in which nodes can have many parents but in which cycles – that is a path which starts and ends at the same node – are not allowed. All nodes must have at least one parent node, with the exception of the root of each graph.

Alice replies to Humpty Dumpty’s inquiry as to the meaning of her name “Must a name mean something?” “Of course it must” replies Humpty Dumpty (Heath 1974:188). This is as true in the real world as in that through the looking glass. The nodes in the GO controlled vocabularies are concepts, concepts that describe the molecular function, biological role or cellular location of gene products. The terms used by GO are simply a shorthand way of referring to these concepts, concepts that are restricted by their natural language definitions. (At present only 20% of the 10,000 or so GO terms are defined but a major effort to correct this situation will be launched early in 2002). Each and every GO term has a unique identifier consisting of the prefix GO: and an integer, for example, GO:0036562. But what happens if a GO term changes? A change may be as trivial as correcting a spelling error or as drastic as being a new lexical string. If the

6 change does not change the meaning of the term then there is no change to the GO identifier. If the meaning is changed, however, then the old term, its identifier and definition are retired (they are marked as “obsolete”, they never disappear from the database) and the new term gets a new identifier and a new definition. Indeed this is true even if the lexical string is identical between old and new terms; thus if we use the same words to describe a different concept then the old term is retired and the new is created with its own definition and identifier. This is the only case where, within any one of the three GO ontologies, two or more concepts may be lexically identical; all except one of them must be flagged as being obsolete. Because the nodes represent semantic concepts (as described by their definitions) it is not strictly necessary that the terms are unique, but this restriction is imposed in order to facilitate searching. This mechanism helps with maintaining and synchronizing other databases that must track changes within GO, which is always rapidly changing by design. Keeping everything and everyone consistent is a difficult problem that we had to solve in order permit this dynamic adaptability of GO.

The edges between the nodes represent the relationships between them. GO uses two very different classes of semantic relationship between nodes: isa and partof. Both the isa and partof relationships within GO should be fully transitive. That is to say an instance of a concept is also an instance of all of the parents of that concept (to the root); a part concept that is partof a whole concept is a partof all of the parents of that concept (to the root). Both relationships are reflexive (see below).

The isa relationship is one of subsumption, a relationship that permits refinement in concepts and definitions and thus enables annotators to draw coarser or finer distinctions, depending on the present degree of knowledge. This class of relationship is known as hyponymy (and its reflexive relation hypernymy) to the authors of the lexical database WordNet (Fellbaum 1998). Thus the term DNA binding is a hyponym of the term nucleic acid binding; conversely nucleic acid binding is a hypernym of DNA binding.

The latter term is more specific than the former, and hence its child. It has been argued that the isa relationship, both generally (see below) and as used by GO (P. Karp, personal communication; S. Schultze-Kremer, personal communication) is complex and that further information describing the nature of the relationship should be captured. Indeed this is true, because the

7 precise connotation of the isa relationship is dependent upon each unique pairing of terms and the meanings of these terms. Thus the isa relationship is not a relationship between terms, but rather is a relationship between particular concepts. Therefore the isa relationship is not a single type of relationship; its precise meaning is dependent on the parent and child terms it connects. The relationship simply describes the parent as the more general concept and the child as the more precise concept and says nothing about how the child specifically refines the concept.

The partof relationship (meronomy and its reflexive relationship holonymy) (Cruse 1986, cited in Miller 1998) is also semantically complex as used by GO (see: Wierzbicka 1984 (cited in Miller 1998), Miller 1998, Priss 1998, Rogers and Rector 2000). It may mean that a child node concept “is a component of” its parent concept. (The reflexive relationship (holonymy) would be “has a component”). The mitochondrion “is a component of” the cell; the small ribosomal subunit “is a component of” the ribosome. This is the most common meaning of the partof relationship in the GO cellular_component ontology. In the biological_process ontology, however, the semantic meaning of partof can be quite different, it can mean “is a subprocess of”; thus the concept amino acid activation “is a subprocess of” of the concept protein biosynthesis. It is in the future for the GO Consortium to clarify these semantic relationships while, at the same time not making the vocabularies too cumbersome and difficult to maintain and use.

Meronymy and hyponymy cause terms to “become intertwined in complex ways” (Miller 1998:38). This is because one term can be a hyponym with respect to one parent, but a meronym with respect to another. Thus the concept cytostolic small ribosomal subunit is both a meronym of the concept cytostolic ribosome and a hyponym of the concept small ribosomal subunit, since there also exists the concept mitochondrial small ribosomal subunit.

The third semantic relationship represented in GO is the familiar relationship of synonymy. Each concept defined in GO (i.e. each node) has one primary term (used for identification) and may have zero or many synonyms. In the sense of the WordNet noun lexicon a term and its synonyms at each node represents a synset (Miller 1998); in GO, however, the relationship between

8 synonyms is strong, and not as context dependent as in WordNet’s synsets. This means that in GO all members of synset are completely interchangeable in whatever context the terms are found. That is to say, for example, that "lymphocyte receptor of death" and "death receptor 3" are equivalent labels for the same concept and are conceptually identical. One consequence of this strict usage is that synonyms are not inherited from parent to child concepts in GO.

The final semantic relationship in GO is a cross-reference to some other database resource, representing the relationship “is equivalent to”. Thus the cross-reference between the GO concept alcohol dehydrogenase and the Enzyme Commission’s number EC:1.1.1.1 is an equivalence (but not necessarily an identity, these cross-references within GO are for a practical rather than theoretical purpose). As with synonyms, database cross-references are not inherited from parent to child concept in GO.

As we have expressed, we are not fully satisfied that the two major classes of relationship within GO, isa and partof, are yet defined as clearly as we would like. There is, moreover, some need for a wider agreement in this field on the classes of relationship that are required to express complex relationships between biological concepts. Others are using relationships that, at first sight appear to be similar to these: for example within the aMAZE database (van Helden et al 2001) the relationships ContainedCompartment and SubType appear to be similar to GO’s partof and isa, respectively. Yet ContainedCompartment and partof have, on closer inspection, different meanings (GO’s partof seems to be a much broader concept than aMAZE’s ContainedCompartment).

The three domains now considered by the GO Consortium, molecular_function, biological_process and cellular_component are orthogonal. They can be applied independently of each other to describe separable characteristics. A curator can describe where some protein is found without knowing what process it is involved in. Likewise, it may be known that a protein is involved in a particular process without knowing its function. There are no edges between the domains, although we realize that there are relationships between them. This constraint was made because of problems in defining the semantic meanings of edges between nodes in different ontologies (see Rogers and

9 Rector (2000) for a discussion of the problems of transitivity met within an ontology that includes different domains of knowledge). This structure is, however, to a degree, artificial. Thus all (or, certainly most) gene products annotated with the GO function term transcription factor will be involved in the process transcription, DNA-dependent and the majority will have the cellular location nucleus. This really becomes important not so much within GO itself, but at the level of the use of GO for annotation. For example, if a curator were annotating genes in FlyBase, the genetic and genomic database for Drosophila, then it would be an obvious convenience for a gene product annotated with the function term transcription factor to inherit both the process transcription, DNA-dependent and the location nucleus. There are plans to build a tool to do this, but one that allows a curator to say to the system “in this case do not inherit” where to do so would be misleading or wrong.

Annotation using GO.

There are two general methods for using GO to annotate gene products within a database. These may be characterised as the ‘curatorial’ and ‘automatic’ methods. By ‘curatorial’ we mean that a domain expert annotates gene products with GO terms as the result of either reading the relevant literature or by an evaluation of a computational result. Automated methods rely solely on computational sequence comparisons such as the result of a BLAST (Alstschul et al 1990) or InterProScan (Zdobnov Apweiler 2001) analysis of a gene product’s known or predicted protein sequence. Whatever method is used, the basis for the annotation is then summarised, using a small controlled list of phrases (www.geneontology.org/GO.evidence); perhaps “inferred from direct assay” if annotating on the evidence of experimental data in a publication or “inferred from sequence comparison with database:object” (where database:object could be, for example, SWISS-PROT:P12345, where P12345 is a sequence accession in the SWISS-PROT database of protein sequences), if the inference is made from a BLAST or InterProScan compute which has been evaluated by a curator.

The incorrect inference of a protein’s or predicted protein’s function from sequence comparison is well known to be a major problem and one that has often contaminated both databases and the literature (Kyrpides and Ouzounis

10 1998, for one example among many). The syntax of GO annotation in databases allows curators to annotate a protein as NOT having a particular function despite impressive BLAST data. For example, in the genome of Drosophila melanogaster there are at least 480 proteins or predicted proteins that any casual curation of BLASTP output would assign the function peptidase (or one of its child concepts) yet, on closer inspection, at least 14 of these lack residues required for the catalytic function of peptidases (D. Coates, personal communication). In FlyBase these are curated with the “function” NOT peptidase. What is needed is a comprehensive set of computational rules to allow curators, who cannot be experts in every protein family, to automatically detect the signatures of these cases, cases where the transitive inference would be incorrect (Kretschmann, Fleischmann, Apweiler 2001). It is also conceivable that triggers to correct dependent annotations could be constructed because GO annotations track the identifiers of the sequence on which annotation is based.

Curatorial annotation will be at a quality proportional both to the extent of the available evidence for annotation and the human resources available for annotation. Potentially, its quality is high but at the expense of human effort. For this reason several ‘automatic’ methods for the annotation of gene products are being developed. These are especially valuable for a first- pass annotation of a large number of gene products, those, for example, from a complete genome sequencing project. One of the first to be used was M. Yandell’s program LoveAtFirstSight developed for the annotation of the gene products predicted from the complete genome of Drosophila melanogaster (Adams et al 2000). Here, the sequences were matched (by BLAST) to a set of sequences from other organisms that had already been curated using GO.

Three other methods, DIAN (Pouliot et al 2001), PANTHER (Kerlavage et al 2002) and GO Editor (Xie et al 2002), also rely on a comprehensive database of sequences or sequence clusters that have been annotated with GO terms by curation, albeit with a large element of automation in the early stages of the process. PANTHER is a method in which proteins are clustered into “phylogenetic” families and sub-families, which are then annotated with GO terms by expert curators. New proteins can then be matched to a cluster (in fact to a Hidden Markov Model describing the conserved sequence patterns of that cluster) and transitively annotated with appropriate GO terms. In a recent experiment PANTHER performed well in comparison

11 with the curated set of GO annotations of Drosophila genes in FlyBase (Mi et al in preparation). DIAN matches proteins to a curated set using two algorithms, one is vocabulary based and is only suitable for sequences that already have some attached annotation; the other is domain based, using Pfam Hidden Markov Models of protein domains.

Even simpler methods have also been used. For example, much of the first- pass GO annotation of mouse proteins was done by parsing the KEYWORDs attached to SWISS-PROT records of mouse proteins, using a file that semantically mapped these KEYWORDs to GO concepts (see www.geneontology.org/external2go/spkw2go) (Hill et al 2001).

Automatic annotations have the advantages of speed, essential if large protein data sets are to be analysed within a short time. Their disadvantage is that the accuracy of annotation may not be high and the risk of errors by incorrect transitive inference is great. For this reason, all annotations made by such methods are tagged in GO gene-association files as being “inferred by electronic annotation”. Ideally, all such annotations are reviewed by curators and subsequently replaced by annotations of higher confidence.

The problems of complexity and redundancy.

There are in the biological_process ontology many words or strings of words that have no business being there. The major examples of offending concepts are chemical names and anatomical parts. There are two reasons why this is problematic, one practical and the other of more theoretical importance. The practical problem is one of maintainability. The number of chemical compounds that are metabolised by living organisms is vast. Each one deserves its own unique set of GO terms: carbohydrate metabolism (and its children carbohydrate biosynthesis, carbohydrate catabolism), carbohydrate transport and so on. In the ideal world there would exist a public domain ontology for natural (and xenobiotic) compounds: carbohydrate simple carbohydrate pentose hexose glucose

12 galactose polysaccharide and so on. Then we could make the cross-product between this little DAG (a DAG because a carbohydrate could also be an acid or an alcohol, for example) and this small biological_process DAG: metabolism biosynthesis catabolism to produce automatically: carbohydrate metabolism carbohydrate biosynthesis carbohydrate catabolism simple carbohydrate metabolism simple carbohydrate biosynthesis simple carbohydrate catabolism pentose metabolism pentose biosynthesis pentose catabolism hexose metabolism hexose biosynthesis hexose catabolism glucose metabolism glucose biosynthesis glucose catabolism galactose metabolism galactose biosynthesis galactose catabolism polysaccharide metabolism polysaccharide biosynthesis polysaccharide catabolism

Such cross-product DAGs may often have compound terms that are not appropriate. For example, the GO concepts 1,1,1-trichloro-2,2- bis-(4'-chlorophenyl)ethane metabolism and 1,1,1- trichloro-2,2-bis-(4'-chlorophenyl)ethane catabolism are appropriate, yet 1,1,1-trichloro-2,2-bis- (4'-chlorophenyl)ethane biosynthesis is not; organisms break down DDT but do not synthesise it. For this reason any cross-product tree would need pruning by a domain expert subsequent to its computation (or rules for selecting sub-graphs that are not be cross-multiplied).

13 Unfortunately, as no suitable ontology of compounds yet exists in the public domain, there is no alternative to the present method of maintaining this part of the biological_process ontology by hand.

A very similar situation exists for anatomical terms, in effect used as anatomical qualifiers to terms in the biological_process ontology. An example is eye morphogenesis, a term that can be broken up into an anatomical component, eye, and a process component, morphogenesis. This example illustrates a further problem, we clearly need to be able to distinguish the morphogenesis of a fly eye from that of a murine eye, or a Xenopus eye, or an acanthocephalan eye (were they to have eyes). Such is not the way to maintain an ontology. Far better would be to have species- (or clade-) specific anatomical ontologies and then to generate the required terms for biological_process as cross-products. This is indeed the way in which GO will proceed (D. Hill, in preparation) and anatomical ontologies for Drosophila and Arabidopsis are already available (www.genontology.org/anatomy/), with those for mouse and C. elegans in preparation (Bard and Winter 2001, for a discussion). The other advantage of this approach is that these anatomical ontologies can then be used in other contexts, for example for the description of expression patterns or mutant phenotypes (Hamsey 1997). gobo: global open biological ontologies.

Although the three controlled vocabularies built by the GO Consortium are far from complete they are already showing their value (e.g. Venter et al 2001, Jenssen et al 2001, Laegreid et al 2002, Pouliot et al 2001). Yet, as discussed in the preceding paragraphs the present method of building and maintaining some of these vocabularies cannot be sustained. Both for their own use, as well as the belief that it will be useful for the community at large the GO Consortium is sponsoring gobo (global open biological ontologies) as an umbrella for structured controlled vocabularies for the biological domain. A small ontology of such ontologies might look like this: gobo gene gene_attribute gene_structure gene_variation gene_product gene_product_attribute

14 molecular_function biological_process cellular_component protein_family chemical_substance biochemical_substance class biochemical_substance_attribute pathway pathway_attribute developmental_timeline anatomy gross_anatomy tissue cell_type phenotype mutant_phenotype pathology disease experimental_condition taxonomy

Some of these already exist (e.g. Taxman for taxonomy (Wheeler et al 2000)) or are under active development (e.g. the MGED ontologies for microarray data description (MGED 2001), a trait ontology for grasses (GRAMENE 2002)) others are not. There is everything to be gained if these ontologies could (at least) all be instantiated in the same syntax (e.g. that used now by the GO Consortium or in DAML+OIL (Fensel et al 2001)); for then they could share software, both tools and browsers, and be more readily exchanged. There is also everything to be gained if these are all open source and agree on a shared namespace for unique identifiers.

GO is very much a work in progress. Moreover, it is a community rather than individual effort. As such, it tries to be responsive to feedback from its users so that it can improve its utility to both biologists and bioinformaticists, a distinction, we observe, that is growing harder to make every day.

References.

Adams M et al 2000 The genome sequence of Drosophila melanogaster. Science 287:2185-2195

15

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ 1990 Basic local alignment search tool J Mol Biol 215:403-410

AmiGO 2001 url: www.godatabase.org/cgi-bin/go.cgi

Baker PG, Goble CA, Bechhofer S, Paton NW, Stevens R, Brass A 1999 An ontology for bioinformatics applications. Bioinformatics 15:510-520

Bard J, Winter R 2001 Ontologies of developmental anatomy: Their current and future roles. Briefings Bioinformatics 2:289-299

Commission of Plant Gene Nomenclature 1994 Nomenclature of sequenced plant genes. Plant Molec Biol Reporter 12:S1-S109

Cruse DA 1986 Lexical semantics. New York, Cambridge University Press

DAG Edit 2001 url: sourceforge.net/projects/geneontology/

DiBona C, Ockman S, Stone M (Editors) 1999 OpenSources. O’Reilly, Sebastopol CA

Dure L. III 1991 On naming plant genes. Plant Molec Biol Reporter 9:220- 228

Fellbaum C (editor) 1998 WordNet. An Electronic Lexical Database. MIT Press, Cambridge MA

Fensel D, van Harmelen F, Horrocks I, McGuinness D, and P. F. Patel- Schneider PF 2001 OIL: An ontology infrastructure for the semantic web. IEEE Intelligent Systems 16:38-45; url: www.daml.org

Fleischmann RD, Adams MD et al 1995 Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496-512

GO Consortium 2000 Gene Ontology: Tool for the unification of biology. Nature Genetics 25:25-29

16 GO Consortium 2001 Creating the gene ontology resource: design and implementation. Genome Res 11:1425-1433

GRAMENE 2002 url: www.gramene.org/plant_ontology

Hamsey M 1997 A review of phenotypes of Saccharomyces cerevisiae. Yeast 1:1099-1133.

Heath P 1974 The Philosopher’s Alice. Alice’s Adventures in Wonderland & Through a Looking Glass, by Lewis Carroll. Introduction and notes by Peter Heath. Academy Editions, London

Hill DP, Davis AP, Richardson JE, Corradi JP, Ringwald M, Eppig JT, Blake JA 2001 Strategies for biological annotation of mammalian systems: implementing gene ontologies in mouse genome informatics. Genomics 74:121-128

Jenssen TK, Laegreid A, Komorowski J, Hovig 2001 A literature network of human genes for high-throughput analysis of gene expresssion. Nature Genetics 28:21-28

Karp P 2000 An ontology for biological function based on molecular interactions. Bioinformatics 16:269-285

Karp P, Paley S 1994 Representations of metabolic knowledge. Proc 2nd Internat Conf Intelligent Systems Bioinformatics, pp 203-211

Karp P, Riley M, Saier M, Paulsen IJ, Collado-Vides J, Paly SM, Pellegrini- Toole A, Bonavides C, Gama-Castro S 2002a The EcoCyc database. Nucleic Acids Res 30:56-58

Karp P, Riley M, Parley SM, Pellegrini-Toole A 2002b The MetaCyc database. Nucleic Acids Res 30:59-61

Kerlavage A, Bonazzi V, di Tommaso M, Lawrence C, Li P, Mayberry F, Mural R, Nodell M, Yandell M, Zhang J, Thomas PD 2002 The Celera Discovery system. Nucleic Acids Res 30:129-136

17 Kretschmann E, Fleischmann W, Apweiler R 2001 Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17:920-926

Kyrpides NC, Ouzounis CA 1998 Whole-genome sequence annotation ‘going wrong with confidence’. Molec Microbiol 32:886-887

Laegreid A, Hvidsten TR, Midelfart H, Komorowski J, Sandvik AK 2002 Supervised learning used to predict biological functions of 196 human genes. [In press]

Leser U 1998 Semantic mapping for database integration – making use of ontologies. url: cis.cs.tu- berlin.de/~leser/pub_n_pres/ws_ontology_final98.ps. gz

MGED 2001 url: www.mged.org

Mewes HW, Heumann K, Kaps A, Mayer K, Pfeiffer F, Stocker S, Frishman D 1999 MIPS: a database for genomes and protein sequences. Nucleic Acids Res 27:44-48

Miller GA 1998 Nouns in WordNet. Chapter 1 in Fellbaum 1998

OpenSource 2001 www.opensource.org/

Overbeek R, Larsen N, Punsch GD, D’Souza M, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E 2000 WIT: Integrated system for high- level throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 28:123-125

Overbeek R, Larsen N, Smith W, Maltsev N, Selkov E 1997 Representation of function: the next step. Gene 191:GC1-GC9

Pouliot Y, Gao J, Su QJ, Liu GG, Ling YB 2001 DIAN: A novel algorithm for genome ontological classification. Genome Res 11:1766-1779

Priss UE 1998 The formalization of WordNet by methods of relational concept analysis. Chapter 7 in Fellbaum 1998

18 Pruitt KD, Maglott DR 2001 RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 29:137-140

Riley M 1993 Functions of the gene products of Escherichia coli. Microbiol Revs 57:862-952

Riley M 1988 Systems for categorizing functions of gene products. Curr Opin Struct Biol 8:388-392

Rison SCG, Hodgman TC, Thornton JM 2000 Comparison of functional annotation schemes for genomes. Funct Integr Genomics 1:56-69

Rogers J, Rector A 2000 GALEN’s model of parts and wholes: Experience and comparisons. Proc Amer Medical Informatics Assn Symp 2000:714-718 (editor JM Overhage). Hanley & Belfus Inc, Philadelphia PA

Schulze-Kremer S 1997 Integrating and exploiting large-scale, heterogeneous and autonomous databases with an ontology for molecular biology. pp. 43-46 in: Hofestaedt R, Lim H (editors) Molecular bioinformatics – The human genome project. Shaker Verlag, Aachen.

Schulze-Kremer S 1998 Ontologies for molecular biology. Proc Pacific Symp Biocomput 3:695-706

Serres MH, Gopal S, Nahum LA, Liang P, Gaasterland T, Riley M 2001 A functional update of the Escherichia coli K-12 genome. GenomeBiology 2001:2/9/research/0035.1

Sklyar N 2001 Survey of existing Bio-ontologies. url: http://dol.uni- leipzig.de/pub/2001-30/en

Stevens R, Baker P, Bechhofer S, Ng G, Jacoby A, Paton NW, Goble CA, Brass A 2000 Transparent Access to Multiple Bioinformatics Information Sources. Bioinformatics 16:184-186

Takai-Igarashi T, Nadaoka Y, Kaminuma T 2000 A database for cell signaling networks. J Comp Biol 5:747

Venter JC et al 2001 The sequence of the human genome. Science 291:1304- 1351

19

Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA 2000 Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 28:10-14

Wierzbicka A 1984 Apples are not a “kind of fruit”. Amer Ethnologist 11:313-328

Xie H, Wasserman A., Levine L, Novik A, Grebinsky V, Shoshan A, Mintz L 2002 Automatic large scale protein annotation through Gene Ontology. [In press]

Zdobnov EM, Apweiler R 2001 InterProScan - an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17:847-848

Acknowledgements.

The Gene Ontology Consortium is supported by a grant to the GO Consortium from the National Institutes of Health (HG02273), a grant to FlyBase from the Medical Research Council, London (G9827766) and by donations from AstraZeneca Inc and Incyte Genomics.

The work described in this review is that of the Gene Ontology Consortium and not the authors – they are just the raconteurs; they thank all of their colleagues for their great support. They also thank Robert Stevens, a user- friendly artificial intelligencer, for his comments and for providing references that would otherwise have evaded them; MA thanks Donald Michie for introducing him to WordNet, albeit over a rather grotty chinese meal in York.

20 The Gene Ontology, looking backwards and forwards

The Gene Ontology: looking backwards and forwards Suzanna E. Lewis The Gene Ontology Consortium (GO) was initiated six years ago when a group of scientists, including myself, decided that the most direct way connect our data was to share the same language for describing it. Looking back over what has happened since then, all of us feel that the most significant achievement of the GO is in uniting the many independent biological database efforts into a cooperative force. Long ago, in the pre-genome era, biological databases were coming to terms with a formidable amount of work. After Crick and Watson elucidated the structure of DNA, the field of molecular biology exploded and an ever-increasing amount of information needed to be carefully managed and organized. This was particularly so after the invention of methods to sequence DNA in the late 1970s1, 2 and consequently, the initiation of the genome sequencing programs in the late 1980s, all of which led to an even faster acceleration of work in this field. Keeping pace with molecular developments were biological data management efforts. These first began emerging in the 1960’s when Margaret Dayhoff3 published the Atlas of Protein Sequence and Structure4, which later went on-line as the Protein Identification Resource (PIR5). More than 30 years ago, in the 1970s, the first structure database, Protein Data Bank (PDB6), was founded7 and Jackson Laboratory developed the first mammalian genetics database8. A few years later the first depositories for nucleotide sequences were established—with the EMBL Data Library9 beginning in 198110 at Heidelberg, Germany and GenBank11 in 198212 at Los Alamos, New Mexico— followed shortly by the formal establishment of the PIR in 198413 for proteins. By the late 1980s and 1990s biological databases were popping up everywhere: 1986—SwissProt14; 1989—C. elegans ACeDB15; 1991—Arabidopsis AAtDB16; 199217—The Institute for Genomic Research18 (TIGR); 1993—FlyBase19; and in 199420—Saccharomyces Genome Database21 (SGD). These groups all took advantage of concurrent technological advances and pioneered the use of the Internet, the web, and relational database management systems (RDBMS) and SQL when these technologies first became available during the 1980s and 1990s22. Thus, many biological databases bloomed, flourished and, until the late 1990’s, all of them primarily operated autonomously. Having many independent genome databases made a large number of researchers very happy but there were shortcomings. The most important research limitation was that the full potential of these isolated data sets would not be realized until they were as integrated as possible. However, there is a practical constraint, which is that biological databases are inherently distributed because the specialized biological expertise that is required for data capture is spread around the globe at the sites where the data originates. Whatever the solution to biological integration was, it would have to acknowledge that the primary sources of data are these distributed investigators. The community initially was very small and these pioneer database developers largely knew one another. They made many attempts to work together towards an integrated solution either by facilitating the transfer of knowledge between databases or by merging them. The annual ACeDB workshops are one example of these efforts. In the early 1990s these two-week sessions brought together participants from many organisms, such as pine trees, tomatoes, bovines, flies, weeds, worms, and others. Unfortunately, ACeDB was dependent upon what became outmoded technology and did not adapt to the web or RDBMS quickly enough to survive as a general solution. There were also a number of meetings organized in vain attempts to design the ultimate biological database schema, such as the Meeting on the Interconnection of Molecular Biology Databases held at Clare College, Cambridge in 1995. Creating a federated system failed for reasons to numerous to list, but the biggest impedance was getting the many people involved to agree on virtually everything. It would have created a technological behemoth that would be unable to respond to new requirements when they inevitably occurred. Even small-scale collaborations between two databases failed (SGD and Berkeley Fly Database—my personal experience). While we decided to share technology, the RDBMS and programming language, this commonality was moot because we did not also share a common focus. SGD had a finished

5/10/05 1

The Gene Ontology, looking backwards and forwards genome while Berkeley was managing EST and physical mapping data. The central point is that the solution to biological database integration does not lay in particular technologies. At the same time, an approximate solution to this problem was being demanded by the research communities whom the model organism databases (MODs) served. These communities increasingly included not just organism-specific researchers, but also pharmaceutical companies, human geneticists, and biologists interested in many organisms, not just one. Another contributing factor was the recent maturation of DNA microarray technology23, 24. The implication of this development was that functional analysis would be done on a large scale and the community risked losing the capability of fully leveraging the power of these new data if they were poorly integrated. For those orchestrating a genome database this was not merely an intellectual exercise, we had to find a solution or risk losing funding. In a word, we were highly motivated. The most fundamental questions for the biologists the MODs served revolve around the genes. What genes are there, what are their mRNA and peptide sequences, where are they on the genome, when are they expressed and how is their activity controlled, in what tissue, organ, and part of the cell are they expressed, what function do they carry out and what role does this play in the organism’s biology? Both pragmatically and biologically then, it made sense for the solution to likewise revolve around the genes. One essential aspect of this, that everyone agreed was necessary, was systematically recording the molecular functions and biological roles of every gene. One of the first functional classification systems was created in 1993 by Monica Riley for E. coli25. Building primarily upon this system, Michael Ashburner began assembling what became proto GO, originally to serve the requirements of FlyBase. Similarly, TIGR created its functional classification system around this time. These early efforts were systematic, in that they were using a well-defined set of concepts for the descriptions, but they were limited because they were not shared between organisms. SGD, FlyBase, TIGR, Mouse Genome Informatics26 (MGI), and others, all independently realized that we could essentially solve a significant portion of the data integration issue if a cross-species functional classification system were created. In our ideal world, sequence (nucleic acid, protein), organism, and other specialty biological databases would all agree on how this should be done. In 1998, it simply was imperative for those responsible for community model organism databases to act, as the number of completely sequenced genomes and large-scale functional analysis experiments was growing. Our correspondence that spring contained many messages such as these: “I'm interested in being involved in defining a vocabulary that is used between the model organism databases. These databases must work together to produce a controlled vocabulary.” (Personal communication); and “It would be desirable if the whole genome community was using one role/process scheme. It seems to me that your list and the TIGR list are similar enough that generation of a common list is conceivable.” (Personal communication). In July of that year, Michael Ashburner presented a proposal at the Montreal ISMB bio-ontologies workshop to use a simple hierarchical controlled vocabulary that was dismissed by other participants as naïve. However, later, in the hotel bar, representatives of FlyBase (Suzanna Lewis), SGD (Steve Chervitz), and MGI (Judith Blake) embraced this proposal and agreed to jointly apply the same vocabulary to describe the molecular functions and biological roles for every gene in their respective databases and thereby founded the Gene Ontology Consortium. It is now six years later and the GO has grown enormously. There are many measures demonstrating its success: Publications—at present there are close to 300 articles in PubMed referencing the GO; Support from large institutional databanks—SwissProt now uses GO for annotating of the peptide sequences they maintain; Participation—the number of organism groups has grown every quarter from the initial three to roughly two dozen; Acceptance—every conference has talks and posters either referencing or utilizing the GO, and within the genome community it is the accepted standard for functional annotation. While it is impossible in hindsight to pinpoint exactly why it has succeeded, there are certain definite factors involved that are listed below. We already had ‘market-share’.

5/10/05 2

The Gene Ontology, looking backwards and forwards

Our careers were such that we could take risks. We were and are practical and experienced engineers. We have always worked at the leading edge of technology. It was in our own self-interest. We had domain knowledge. We are open. A significant advantage that we (those managing biological databases) had, though it is not often considered, is our stewardship of key data sets. The commencement of GO also coincided with the completion of many key genomes that, once sequencing is finished, these database groups annotate, manage and maintain. These facts put us in the right position to succeed because of the influence these data have. The decisions we make in our management of these data have a great deal of downstream effect. Every researcher, both bench and informaticist, who utilize the genomic data of mouse, Drosophila, yeast, and other organisms are influenced by our choices in how the data are described and organized. In contrast to broad-spectrum archival repositories, these data are annotated by specialists in the biology of a given organism who have a detailed understanding of its idiosyncratic biological phenomena. This expertise anchors the captured knowledge in experimental data. As other organism specialists joined, the Arabidopsis Information Resource27 (TAIR) joining soon after the start, as well as microbial and pathogen databases28 the impact of GO increased. Given the large established constituency of biologists that FlyBase, SGD, MGI, and TAIR are accepted by, it is unsurprising that our decision to jointly develop the GO was influential. In addition to holding majority share of these critical research resources the careers of the people involved are built on successful collaborative efforts. The professionals who are responsible for the biological databases fall roughly into two classes. They are either tenured principal investigators who wish to contribute to their community or PhD level researchers (both biologists and computer scientists) who have especially chosen a non-academic career track. As individuals, they do not have much to gain by, for example, publishing papers as individuals. Papers are published, of course, about the content of the database or techniques for managing these data, but an individual’s personal publication record is not a primary criterion upon which their career is evaluated. Rather, careers are measured by the success of the project and the strength of an individual’s contribution to the project’s goals. This attitude allowed us to remove our egos and concern for individual recognition from the search for a solution to the data interconnection problem. Apart from the preceding organizational and social factors, each GO consortium scientist had a successful background in producing large information resources. Everyone possessed institutional knowledge of the requirements for biology and proven experience in engineering management and development. They knew how to decompose a large and complex project into smaller readily measurable milestones, an extremely difficult thing to do. Understanding the theoretical requirements of a problem is necessary, but insufficient. The experience and practical skill to effectively direct the development and implement a solution was also essential. Complementing our existing skills was our willingness to use new technologies. A key characteristic of the scientists who initiated the GO is that they are “early adopters”. There is a definite behavior pattern in this group of exploring technological innovations. We had always sought new strategies to solve our problems, for example—the Internet, the web, RDBMSs, new languages (such as Perl and Java), to ontologies—all of which we began to work with before the methodologies were mature and well-established. In short, we have a tradition in experimentation. It is not very surprising that scientists are willing to experiment, but this mind set extends to computer science as well and enables us to exploit advances in that field to address the needs of biology. Anything that will help us get the job done we will take advantage of. The GO consortium is inherently collaborative and collaborations are hard, very hard, because of geography, misunderstandings, and the length of time it takes to get anything resolved and completed. Within the consortium, it is made even more difficult because we must discuss and 5/10/05 3

The Gene Ontology, looking backwards and forwards agree upon mental concepts and definitions in addition to concrete issues such as data syntax and exchange. Still, we actively sought collaboration, because it was in our own self-interest. Our users, whose support we depended upon, were demanding the ability to ask the same query of different genomic databases and receive comparable answers. Every biological database would gain through cooperation. One of the most significant contributing factors is our deep knowledge of the domain of biology. No problem can be solved successfully if you do not understand its nuances. The consortium succeeded by utilizing knowledge from many disparate fields: selectively exploiting what has been learned in the field of artificial intelligence (AI) and the study of ontologies; constrained by practical engineering considerations and incremental development; all whilst bearing in mind the niceties of the biology being represented. Domain knowledge is essential to GO’s success, without which we could not maintain biological fidelity. Last, and perhaps most important, is that we have always been open. All of the vocabularies, the annotations, and software tools are available for others to use. Our success is best illustrated by how much they are used29. This openness is essential in the scientific environment we work in. To provide a technology without a willingness to reveal all source code and data is tantamount to throwing away the lab notebook. Providing outside researchers with the ability to completely understand the methods that are used is mandatory for scientific progress. The GO is not perfect, but its success is primarily due to revealing everything. The feedback we receive from others is what is enabling the consortium to improve with age. Our plan for the future is to build on this base. We are actively seeking ways and building tools to help new biological databases utilize the GO and thus extend our data coverage to include more organisms. We will remain pragmatic in our choice of technologies and remain flexible enough to exploit new advances. We will incrementally advance the sophistication of the underlying software architecture, one example of this is shown by our collaboration with Reactome30, a project generating formal representations of biological pathways. We will seek out domain experts as the biological coverage of the GO extends into new areas so that biological fidelity is kept high. Likewise, we will work with experts to extend the scope of the ontologies to cover other critical areas of biological description, such as anatomies, cell types, and phenotypes, as illustrated by the Open Biological Ontologies31 project. Finally, we will continue to work cooperatively and remain open as this has shown to be the most scientifically productive approach. The GO succeeded because it was not a technical solution per se. Technology is more than just an implementation detail of course, but will never be a silver bullet either. It comes down to the fact that we want to continue integrating our knowledge forever and technologies are short-lived. Therefore, the solution must be able to adopt new technologies as they arise while the primary focus remains on cooperative development of semantic standards. It’s about the content, not the container. Perhaps ironically, the impact of shifting the focus away from a technical solution to our biological data integration problem is that we have begun sharing technology. Once the mechanism for a dialog was in place we have discovered many other areas where our interests coincided. There are now organized meetings for professional biological curators to meet and discuss standard methodologies32. The Generic Model Organism Database33 (GMOD) effort makes these common tools available to the community and serves as a forum for a wide spectrum of interests. It is this unforeseen outcome, consolidating the disparate databases into a cooperative community engaged in productive dialogs, which is the single largest impact and achievement of the Gene Ontology consortium.

1 Sanger F., Coulson A.R., A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol. 1975 May 25;94(3):441-8. 2 Maxam A.M., Gilbert W., A new method for sequencing DNA. Proc Natl Acad Sci U S A. 1977 Feb;74(2):560-4.

5/10/05 4

The Gene Ontology, looking backwards and forwards

3 http://www.dayhoff.cc/index.html 4 Dayhoff MO, Eck RV, Chang MA, Sochard MR. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, 1965 5 http://pir.georgetown.edu/home.shtml 6 http://www.rcsb.org/pdb/ 7 http://www.rcsb.org/pdb/holdings.html 8 www.jax.org/about/milestones.html 9 http://www.ebi.ac.uk/embl/index.html 10 http://www.embl.org/aboutus/generalinfo/history.html 11 http://www.ncbi.nlm.nih.gov/Genbank/index.html 12 www.ncbi.nlm.gov/Education/BLASTinfo/milestones.html 13 http://pir.georgetown.edu/pirwww/aboutpir/history.html 14 www.ebi.ac.uk/swissprot 15 http://www.acedb.org/ 16http://weedsworld.arabidopsis.org.uk/Vol3ii/Cherry-Flanders-Petel.WW.html 17 www.tigr.org/about/history.shtml 18 http://www.tigr.org/ 19 www.flybase.org 20 www.yeastgenome.org/aboutsgd.shtml 21 www.yeastgenome.org/ 22 1970: Relational database model specified (Ted Codd: www.nap.edu/readingroom/books/far/ch6.html) 1983: The internet is defined as networks using TCP/IP (www.historyoftheinternet.com/chap4.html), 1985: SQL defined; 1990: First ever web page at CERN (www.w3.org/History.html) 23 Fodor SP, Rava RP, Huang XC, Pease AC, Holmes CP, Adams CL., Multiplexed biochemical assays with biological chips. Nature. 1993 Aug 5;364(6437):555-6. 24 Schena M., Shalon D., Davis R.W., Brown P.O., Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995 Oct 20;270(5235):467-70. 25 Riley M. Functions of the gene products of Escherichia coli. Microbiol Rev. 1993 Dec;57(4):862- 952. 26 www.informatics.jax.org/ 27 www.arabidopsis.org/ 28 www.genedb.org/ 29 http://www.geneontology.org/GO.biblio.html 30 www.reactome.org 31 http://obo.sf.net 32 http://tesuque.stanford.edu/biocurator.org/ 33 http://gmod.sourceforge.net/

5/10/05 5

Ontologies in Biomedicine: The Good, the Bad, and the Ugly http://ontology.buffalo.edu/bio/OntologiesGBU.html

Ontologies in Biomedicine: The Good, the Bad, and the Ugly

Compiled for internal use

Very First Draft. Comments, Corrections and Extensions Welcome (to: [email protected])

1. Caveats: 1. Not everything on this list is described by its authors as an ontology. 2. The list has been prepared for illustrative purposes, as a preliminary guide to the sorts of pitfalls that we face in the building of ontologies. Its goal is to draw attention primarily to what is wrong with ontologies. Thus it should be used in conjunction with lists of ontologies in biomedicine such as those prepared at:

http://www.cs.man.ac.uk/~stevensr/ontology.html http://anil.cchmc.org/Bio-Ontologies.html http://lsdis.cs.uga.edu/~cthomas/bio_ontologies.html

and of course with the OBO (Open Biomedical Ontologies Consortium) ontology library: http://obo.sourceforge.net

2. First Draft List of Criteria to be satisfied by Good Ontologies (NB think of these as rules of thumb, or goals to keep constantly in mind – the world is too messy to support them all simultaneously)

a. Each ontology should have as its backbone a taxonomy based on the is_a relation (for ‘is a subtype of’). This should be as far as possible a true hierarchy (single inheritance). b. The taxonomy should have one root, with a suite of high-level children of the root of a sort which yield a top-down view of the structure of the whole ontology. (One does not have this e.g. in SNOMED, or in the Cell Ontology.) c. The expressions corresponding to the constituent nodes of the taxonomy and to its relations (is_a, part_of, etc.) should be explicitly defined in both human-readable and computable formats. The latter should be formalized versions of the former. Such definitions should then provide the rationale for establishing the class subsumption inheritance hierarchy. d. There should be clear rules governing how definitions are formulated. e. The ontology should distinguish between the types (classes, universals) represented by this taxonomy, and the tokens (individuals, particulars, instances) instantiated by these types on the side of reality. f. The relationships should be used consistently to ensure valid inferences both within and between ontologies. One should be able to reliably query on instance data, computationally. g. Classification systems that have existed for centuries have been human interpretable, but never

1 of 5 5/10/05 10:01 AM Ontologies in Biomedicine: The Good, the Bad, and the Ugly http://ontology.buffalo.edu/bio/OntologiesGBU.html

computable. So, being able to compute on an ontology is important. h. An ontology should accommodate change in knowledge. It should have clear procedures for adding new terms, and clear procedures for correcting erroneous entries. All prior versions should be easily accessible. i. There should be clear rules governing how to select terms and how to resolve problems in case of difficult terms. j. The different types of problem cases in the treatment of terms, relations and definitions, should be carefully documented, and best practices for the resolution of these problems tested and promulgated. k. The scope of an ontology should be clearly specified, both in terms of the domain of instances over which it applies and in terms of the types of relations in that domain (and thus to pertinent type of scientific inquiry). The family of terms in a given ontology should then have a natural unity, which should also be reflected in the name of the ontology. This criterion not satisfied e.g. by the various so-called 'tissue' ontologies discussed in the last below. Indeed the family of tissue terms is one important example of a problem area – reflecting the fact that the term 'tissue' is ambiguous as between KIND of tissue and PORTION of tissue. (A similar ambiguity applies e.g. to ‘substance’.)

3. The Rankings

3.1 Very Good

The Foundational Model of Anatomy (FMA): http://sig.biostr.washington.edu/projects/fm/ Very clear statement of scope (structural human anatomy, at all levels of granularity, from the whole organism to the biological macromolecule; very powerful treatment of definitions (from which the entire FMA hierarchy is generated); very quick turn-around time for correction of errors; very few unfortunate artifacts in the ontology deriving from its specific computer representation (Protégé)

3.2 The Good

GALEN Motivation: to find ways of storing detailed clinical information in a computer system so that both (1) clinicians are able to store and review information at a level of detail relevant to them and (2) computers can manipulate what is stored, for retrieval, abstraction, display, comparison. Very powerful (Description Logic-based) formal structure, thus tight organization and careful treatment of terms; unfortunately remains only partially developed after some years of lying fallow. Now in some respects outdated.

3.3 The Intermediate (= still need many modifications)

Gene Ontology Open source; very useful; poor treatment of the relations between the entities covered by its three separate ontologies

Reactome http://www.reactome.org/

2 of 5 5/10/05 10:01 AM Ontologies in Biomedicine: The Good, the Bad, and the Ugly http://ontology.buffalo.edu/bio/OntologiesGBU.html

A rich knowledgebase of biological process, but with incoherent treatment of top-level categories. Thus ReferentEntity (embracing e.g. small molecules) is treated as a sibling of PhysicalEntity (embracing complexes, molecules, ions and particles). Similarly CatalystActivity is treated as a sibling of Event.

SNOMED http://www.snomed.org/

Swissprot http://us.expasy.org/sprot/ Protein knowledgebase Sequence Ontology http://song.sourceforge.net/

Cell Ontology http://www.xspan.org/obo/

Zebrafish Anatomy and Development Ontology http://obo.sourceforge.net/cgi-bin/detail.cgi?zfishanat

NANDA International Taxonomy http://www.nanda.org/html/taxonomy.html A conceptual system that guides the classification of nursing diagnoses in a taxonomy.

ICNP International Classification for Nursing Practice http://www.icn.ch/icnp.htm A combinatorial terminology for nursing practice that facilitates crossmapping of local terms and existing vocabularies and classifications

National Cancer Institute Thesaurus http://www.mindswap.org/2003/CancerOntology/ Top-level structure recognizes the existence of three (disjoint) classes of cells: cells, normal cells, abnormal cells. Recognizes three (disjoint) classes of plants: vascular plants, non-vascular plants, other plants. Inherits many of the problematic features from other terminologies in the UMLS.

UMLS http://www.nlm.nih.gov/research/umls/

ICD-10 http://www.icd10.ch/index.asp?lang=EN

3.4 The Bad

UMLS Semantic Network http://semanticnetwork.nlm.nih.gov/ Recognizes only one subtype of plant – algae (which are not plants) Treats the digestive system as a conceptual part of the organism

Clinical Terms Version 3 (The Read Codes): http://www.nhsia.nhs.uk/terms/pages/publications/v3refman/chap2.pdf

3 of 5 5/10/05 10:01 AM Ontologies in Biomedicine: The Good, the Bad, and the Ugly http://ontology.buffalo.edu/bio/OntologiesGBU.html

(Early?) versions classify chemicals into: chemicals whose name begins with ‘A’, chemicals whose name begins with ‘B’, chemicals whose name begins with ‘C’, ... Incorporated into SNOMED-CT

LOINC Logical Observation Identifiers Names and Codes: http://www.regenstrief.org/loinc Goal: to facilitate the exchange and pooling of results, such as blood hemoglobin, serum potassium, or vital signs, for clinical care, outcomes management, and research tissue ontologies Problem: reveals its origins in the punchcard era; typical string: 12189-7 | CREATINE KINASE.MB/CREATINE KINASE.TOTAL | CFR | PT | SET/PLAS | QN | CALCULATION

Health Level 7 Reference Information Model (HL7 RIM): http://www.hl7.org/Library/data-model/RIM/modelpage_mem.htm HL7 is a standard for exchange of information between clinical information systems (has proved very crumbly as a standard; every hospital has its own version of HL7); the RIM is designed to overcome this problem by defining the world of healthcare data (a consensus view of the entire healthcare universe); one problem with the RIM is that very many entities in the healthcare universe (e.g. disorders, genes, ribosomes) are identified as documents; because of the counterintuitive nature of this identification, RIM documentation is itself highly counterintuitive, and the RIM community itself is subject to constant fights

Medical Entities Dictionary (MED): http://med.dmi.columbia.edu/ Semantic network style.

MedDRA v. 3: http://www.meddramsso.com/NewWeb2003/medra_overview/index.htm Has hierarchies, but you can’t tell by browsing through the hierarchies whether different terms represent the same thing or not. MedDRA v. 3 does not assign unique codes to its terms, but rather works with unique terms collected from various sources which are left unchanged for reasons of ‘compatibility’. Some source terminologies, such as WHO-ART (World Health Organization Adverse Reaction Terminology) had all terms in uppercase, some not. So a unique term might be “COLD”, but also “cold”, and “Cold”, and “cOLd”, .... Each unique term in MedDRA v3 must be assigned a single meaning, but MedDRA does this in a haphazard way. Thus the 4-character string “COLD” might be assigned the meaning common cold or cold temperature or (as is in fact the case ) chronic obstructive lung disease. Suppose, now, that a medical doctor in a pharmaceutical company has the task of coding into MedDRA handwritten reports received from practising physicians engaged in clinical studies. She must then, according to the coding rules set up by her department, either code a sentence such as “patient coughing and sneezing, ... diagnosis: COLD” as referring to chronic obstructive lung disease (which is obviously wrong), or make a phone call to the physician to ensure that he in fact meant “cold” and not “COLD”.

MEDCIN: http://www.medicomp.com/index_html.htm Mixes up everything that can be mixed up

4 of 5 5/10/05 10:01 AM Ontologies in Biomedicine: The Good, the Bad, and the Ugly http://ontology.buffalo.edu/bio/OntologiesGBU.html

International Classification of Primary Care (ICPC): http://www.ulb.ac.be/esp/wicc/icpc2.html tries to explain general medicine (family medicine) by means of about 800 classes.

MeSH MGED eVoc PATO Mouse Pathology

3.5 The Ugly

ICD-10-PCS: http://www.cms.hhs.gov/paymentsystems/icd9/icd10.asp based on good principles but worked out using an ugly representation

UMLS Semantic Network

Special Mention: Ugly Tissue Ontologies 1. TissueDB (http://tissuedb.ontology.ims.u-tokyo.ac.jp:8082/tissuedb/) has nothing to do with tissues. What they call tissue is basically all the structures one can identify histologically. 2. Brenda Tissue Ontology http://www.brenda.uni-koeln.de/ontology/tissue/tree/update/update_files/BrendaTissue has nothing to do with tissue. Or rather, here, basically everything a tissue. Thus it contains statements like: arm is-a limb 3. Aukland Anatomy Ontology Tissue Class View http://n2.bioeng5.bioeng.auckland.ac.nz/ontology/anatomy/ontology_class_view?class_uri=http%3A//physiome.bioeng.auckland.ac.nz/anatomy/all%23Tissue Classifies tissue into: Connective tissue, Epithelial tissue, Glandular tissue, Muscle tissue, Nervous tissue; but proceeding further down the hierarchy we find not tissues but organs and organ parts such as SimpleTubularGland, SimpleAcinarGland, etc. Moreover EndocrineGland is asserted to have two ‘instances’ (we presume they mean subclasses): EndocrineGland (!), and FollicularEndocrineGland. Among the ‘instances’ of ConnectiveTissue are listed: Left Humerus, Right Tibia, and so on. So nonsense, here, too.

5 of 5 5/10/05 10:01 AM Journal of Biomedical Informatics 36 (2003) 478–500 www.elsevier.com/locate/yjbin

A reference ontology for biomedical informatics: the Foundational Model of Anatomy

Cornelius Rosse* and Jose L. V. Mejino Jr.

Departments of Biological Structure, and Medical Education & Biomedical Informatics, Structural Informatics Group, University of Washington, Seattle, WA 98195, USA

Received 7 November 2003

Abstract

The Foundational Model of Anatomy (FMA), initially developed as an enhancement of the anatomical content of UMLS, is a domain ontology of the concepts and relationships that pertain to the structural organization of the human body. It encompasses the material objects from the molecular to the macroscopic levels that constitute the body and associates with them non-material entities (spaces, surfaces, lines, and points) required for describing structural relationships. The disciplined modeling approach employed for the development of the FMA relies on a set of declared principles, high level schemes, Aristotelian definitions and a frame-based authoring environment. We propose the FMA as a reference ontology in biomedical informatics for correlating dif- ferent views of anatomy, aligning existing and emerging ontologies in bioinformatics ontologies and providing a structure-based template for representing biological functions. Ó 2003 Elsevier Inc. All rights reserved.

Keywords: Ontology; Knowledge representation; Bioinformatics; Biomedical informatics; Anatomy; Mereotopology; Embryology; Developmental biology; UMLS

1. Introduction contexts with distinct user groups in mind; consequently their correlation and mapping to one another pose a Ontology design is becoming increasingly recognized considerable challenge. The challenge is enhanced by the as central to medical informatics [1] and even more so to need for aligning these ontologies with evolving, com- bioinformatics. New ontologies continue to appear in putable information resources in the classical, basic, diverse areas of the biomedical sciences with a particular biomedical sciences (e.g., anatomy, physiology, and emphasis on biological macromolecules and the pro- pathology), as well as with those in clinical medicine. cesses in which these molecules participate. The impor- Such correlations will be critical for the development of tance of relating such new information resources to knowledge-based applications that will need to rely on medical terminologies (or vocabularies) is illustrated by inference in order to support clinical research and de- the recent incorporation of the Gene Ontology [2] in the cision making based on the knowledge of molecular Unified Medical Language System (UMLS) [3]. UMLS, biology. designed, maintained and distributed by the National A raison d’^etre of UMLS is to facilitate the estab- Library of Medicine, provides a unified knowledge lishment of correspondences in the meaning of terms representation system for correlating a large number of among its constituent vocabularies. This correlation is biomedical terminologies. Like most UMLS terminolo- largely achieved through assigning the same concept gies, the Gene Ontology and other application ontolo- occurring in different terminologies to high level se- gies in biomedical informatics are compiled in diverse mantic types encompassed within the UMLS Semantic Network [4]. It is more problematic, however, to rec- oncile divergences in the semantic structure of these * Corresponding author. Fax: 1-206-543-1524. sources and other ontologies at levels higher than leaf E-mail address: [email protected] (C. Rosse). concepts and discrete terms. For example, while there is

1532-0464/$ - see front matter Ó 2003 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2003.11.007 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 479 considerable correspondence in the meaning of ana- formed body and assume the role of ‘‘actors’’ in all tomical terms in UMLS sources that include substantial physiological and disease processes. Therefore, we con- amounts of anatomy, there is very little similarity in the tend that a coherent domain ontology of anatomical schemes these sources use for arranging their anatomical entities is the best candidate for serving as a foundation terms into a coherent representation of anatomical and reference for the correlation of other ontologies in knowledge. While such correspondences may support biomedical informatics. Our second objective is to il- the correlation of the meaning of terms, the underlying lustrate the process of disciplined modeling we pursued semantic structure of these abstractions must also be in establishing the FMA. We believe that this approach aligned if problem solving calls for inference across the could also serve well the authors of emerging knowledge boundaries of related ontologies. sources in bioinformatics, in that it synergizes with and It is particularly important to assure coherence of enhances broader guidelines and desiderata that have knowledge domains that generalize to a number of other been proposed for the construction of terminologies and fields where they will be reused. Such is the case with the knowledge bases [10,11]. classical, basic, biomedical sciences and also with more modern disciplines, such as neuroscience and develop- 1.1. Organization of this paper mental biology. All these fields are embraced by bio- and biomedical informatics, which deal not only with human We first define the FMA and then illustrate the dis- biology but also with observations and experimental ciplined modeling approach by focusing on the estab- data derived from non-human species. In order to sup- lishment of the Anatomy Taxonomy (AT) and the other port the generation of knowledge-based applications two components of the FMA, which relate to structural that will be increasingly needed in basic science and and developmental attributes of the entities to which clinical research, as well as in the delivery of health care, concepts in the AT refer. The next sections are devoted computable knowledge sources must be established not to accessing, scaling, and evaluating the FMA, before only in the modern but also in the classical disciplines of we discuss the FMAÕs relevance to UMLS and comment basic science. Such a widening focus in bioinformatics is on its potential as a reference ontology for biomedical inevitable in the post-genomic era, and the process has informatics, which leads to our conclusions. Different in fact already begun. Distinct from the large clinical typographies used in the text have the following asso- terminologies (e.g., SNOMED RT [5], GALEN [6], ciations: Names of concepts represented in the FMA are Medical Entities Dictionary [7]), a number of ontologies in Courier New font, which distinguishes, for example, are emerging that represent knowledge in discrete fields Organ, a class in the AT, from the term ÔorganÕ used in a of the basic biomedical sciences. One of these ontologies general context; relationships between concepts are in is the Digital Anatomist Foundational Model of Anat- italics enclosed by hyphens, e.g., -part of-; italics are also omy (Foundational Model or FMA, for short) [8,9]. The used for emphasis and for Latin terms; abbreviations of FMA symbolically represents the structural organiza- the components of the FMA are in bold capitals, tion of the human body from the macromolecular to e.g., AT. macroscopic levels. The initial development of the FMA was supported by UMLS with the intent of enhancing the anatomical 2. The Foundational Model of Anatomy content of UMLS source vocabularies and ultimately facilitating the correlation of anatomical concepts rep- The Foundational Model of Anatomy is an evolving resented in these vocabularies. We present a status re- ontology for biomedical informatics; it is concerned port on the FMA, major components of which are with the representation of entities and relationships included in UMLS as the Digital Anatomist vocabulary necessary for the symbolic modeling of the structure of (known in previous editions of UMLS as UWDA). With the human body in a computable form that is also un- this report we wish to promote the evaluation of the derstandable by humans [8,9]. Specifically, the FMA is FMA with respect to realizing its intended role in an abstraction that explicitly represents a coherent body UMLS and, in a broader sense, bring the FMA to the of declarative knowledge about human anatomy as a attention of the biomedical, and particularly the bioin- domain ontology (defined below). The ontology is im- formatics communities. plemented in a frame-based system and is stored in a The purpose of this paper is to describe the FMA and relational database. The FMA is intended as a reusable propose it as a reference ontology for biomedical in- and generalizable resource of deep anatomical knowl- formatics. Our rationale for this proposal is based on edge, which can be filtered to meet the needs of any the fact that the FMAÕs concept domain embraces all knowledge-based application that requires structural material objects, substances and spaces that result from information. It is distinct from application ontologies in the coordinated expression of structural genes. In their that it is not intended as an end-user application and aggregate these anatomical entities constitute the fully does not target the needs of any particular user group. 480 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

We regard this model as foundational for two reasons: (science), which is a biological science concerned with (1) anatomy is fundamental to all biomedical domains; the discovery, analysis and representation of anatomy and (2) the anatomical concepts and relationships en- (structure). We declared the following principles for compassed by the FMA generalize to all these domains. guiding the formulation and instantiation of the FMA By Ôanatomical conceptÕ we mean a unit of thought that abstraction [8,9]: refers to an anatomical entity (defined in section 3.2.1). 1. Unified context principle. The abstraction should The Foundational Model currently contains 70,000 conform to a strictly structural context. Although ana- distinct anatomical concepts—representing structures tomical discourse in education and various biomedical ranging in size from some macromolecular complexes fields embraces diverse contexts (e.g., functional, surgical, and cell components to major body parts. These con- radiological, and biomechanical), it is the analysis and cepts are associated with more than 110,000 terms, and description of an organismÕs structure that distinguishes are related to one another by more than 1.5 million in- the science of anatomy from other biological sciences. We stantiations of over 170 kinds of relationships. We de- have found that only in a structural context is it possible veloped and instantiated this large and complex model to establish a single inheritance hierarchy that subsumes through an approach we call disciplined modeling. all anatomical concepts. As stated earlier, it is our con- tention that such a structure-based representation can serve as a reference ontology for correlating other (e.g., 3. Disciplined modeling functional, clinical) contexts and views of anatomy. 2. Abstraction level principle. The abstraction should We first describe the elements of disciplined modeling model canonical anatomy and provide a framework for that have guided the establishment of the three major anatomical variants, but should exclude instantiated components of the FMA and then deal with each of anatomy. these components: the Anatomy Taxonomy, Anatomi- We have previously distinguished canonical and in- cal Structural Abstraction, and the Anatomical Trans- stantiated anatomy [8]. Canonical anatomy is a field of formation Abstraction. anatomy (science) that comprises the synthesis of generalizations based on anatomical observations that 3.1. Elements of disciplined modeling of anatomy describe idealized anatomy (structure). These general- izations have been implicitly sanctioned by their usage in We borrow the term Ôdisciplined modelingÕ from Perl anatomical discourse. Instantiated anatomy is the field of et al. [12,13], who proposed a methodology for re- anatomy (science) which comprises anatomical data structuring existing vocabularies in order to introduce pertaining to instances (i.e., individuals) of organisms clarity into their representation scheme. We on the other and their parts. Although we exclude instantiated anat- hand have employed a disciplined approach for the de omy from the FMA, our intent is for the FMA to serve as novo creation of a new knowledge base. The elements of a foundation for the representation of the anatomy of our approach consist of a set of declared foundational individuals and to provide an organizational framework principles, a high level scheme for representing the ref- for anatomical data, including images. Thus, the FMA erents of concepts and relationships in the anatomy should represent classes, which are multiply located domain, Aristotelian definitions and a knowledge anatomical entities (i.e., universals) that exist in the modeling environment that assures implementation of instances (or particulars) that they subsume. the principles and the inheritance of definitional and 3. Species specificity principle. The initial iteration of non-definitional attributes. the abstraction should model the anatomy of Homo sapiens, but at the same time it should serve as a 3.1.1. Foundational principles framework for the anatomy of other mammalian and Principles are assertions that provide the basis for eventually, other vertebrate species. Although clinical reasoning and action. The nature of the principles we medicine is concerned with the human, animal models of declare is dictated by the definition of the domain we human disease, as well as veterinary medicine in its own intend to model. This domain is anatomy. We have right, call for a symbolic representation of anatomy. The previously distinguished and defined two concepts for highly conserved groups of structural genes that dictate which the term ÔanatomyÕ is a homonym: anatomy the vertebrate body phenotype provide a rationale for (science) and anatomy (structure) [8]. As its definition in eventually modeling species-specific anatomy as spe- a preceding section specifies, the Foundational Model of cializations of a generalizable vertebrate body plan [14]. Anatomy is an abstraction of anatomy (structure), Therefore, the high level abstract classes of the FMA which is the ordered aggregate of material objects and should accommodate the generalized ‘‘Bauplan’’ of physical spaces filled with substances that together vertebrates. constitute a biological organism. The instantiated sym- 4. Definition principle. Defining attributes of a class in bolic model itself is a concrete manifestation of anatomy the model should be specified in terms of the physical C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 481 and other structural (i.e., anatomical) attributes of the modeling the structure of the coherent knowledge do- anatomical entities that the class subsumes (see Section main. In order to justify its designation as foundational, 3.1.3). such a model should serve as a reference in terms of 5. Dominant concept principle. An ontologyÕs domi- which other views (contexts) of the domain can be cor- nant class is the class in reference to which other classes related. Moreover, the concepts represented in a foun- in the ontology are defined. Anatomical structure dational model should be indispensable for the symbolic (defined in Section 3.2) shall be the dominant class in the modeling of, and discourse in, a number if other do- FMA (see Section 3.2.2.2). mains. The Foundational Model of Anatomy is a foun- 6. Organizational unit principle. The abstraction shall dational model of the physical organization of the have two units in terms of which subclasses of Ana- human body—i.e., anatomy (structure)—and its coherent tomical structure are defined: Cell and Organ. knowledge domain is anatomy (science). Other domains Other subclasses of Anatomical structure shall for which anatomy is indispensable include physiology, constitute cells or organs, or be constituted by cells or pathology, clinical medicine, and molecular and devel- organs. opmental biology. 7. Content constraint principle. The largest anatomical These principles provide the rationale for proposing a structure represented shall be the whole organism (in the high level scheme for the FMA. current iteration, the human body) and the smallest Biological macromolecule. Should the need arise, 3.1.2. High level scheme molecules not synthesized through the expression of the A high level scheme encapsulates the concept domain organismÕs own genes shall be represented in separate and scope of a symbolic model and defines its main ontologies. Within these constraints, the abstraction components; in effect it serves as a hypothesis that is shall model both concepts and relationships at the most tested by the instantiation of the model and may be refined level of granularity. modified during this process. We have previously pro- 8. Relationship constraint principle. The abstraction posed such a high level scheme for the Foundational shall model three types of relationships that occur be- Model of Anatomy [9]: tween anatomical entities: (1) class subsumption rela- FMA ¼ðAT; ASA; ATA; MkÞ; ð1Þ tionships; (2) static physical relationships; and (3) relationships that describe the transformation of ana- where AT is the Anatomy Taxonomy, which specifies the tomical entities during the ontogeny of an organism. taxonomic relationships of anatomical entities and as- Dynamic physical relationships between anatomical signs them to classes (defined in next section) according entities (e.g., those relating to physiological function and to defining attributes which they share with one another the pathogenesis of abnormalities and disease) shall be and by which they can be distinguished from one an- 1 modeled in separate ontologies. other; the ASA,orAnatomical Structural Abstraction 9. Coherence principle. The abstraction shall have one describes the partitive (meronymic) and spatial rela- root, Anatomical entity, which subsumes all enti- tionships of the concepts represented in the taxonomy; ties relating to the structural organization of the body; the ATA,orAnatomical Transformation Abstraction concepts referring to these entities shall be arranged in a describes the time-dependent morphological transfor- single and comprehensive inheritance class subsumption mations of the concepts represented in the taxonomy hierarchy. during the human life cycle, which includes prenatal 10. Representation principle. The abstraction shall be development, post-natal growth and aging; and Mk re- modeled as an ontology of anatomical concepts and fers to Metaknowledge, which comprises the principles should accommodate all naming conventions associated and sets of rules, according to which the relationships with these concepts. are represented in the modelÕs other three component Because of the diverse and implied meanings associ- abstractions. ated with the term Ôontology,Õ (some of which are re- This abstraction captures the information that is nec- viewed by Burgun and Bodenreider [11]), we prefer to essary for describing the anatomy of not only the whole refer to the abstraction of the FMA as a symbolic body, but also that of any structure (physical object) or model, rather than an ontology. We define a symbolic space that constitutes the body. Indeed, in practical terms, model as a conceptualization of a domain of discourse the foundational model of the whole body must be gen- represented with non-graphical symbols in a computable erated stepwise through aggregating the symbolic models form that supports inference. We designate such a of discrete classes of physical anatomical entities. The symbolic model as a foundational model, when it declares foundational model for the anatomy of the entire body the principles for including concepts and relationships that are implicitly assumed when knowledge of the do- 1 In previous publications this was called the Ao (Anatomy main is applied in diverse contexts, and explicitly defines ontology); we renamed it as AT in order to distinguish it from the the concepts and relationships necessary for consistently entire FMA, which is more appropriately regarded as an ontology. 482 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

(FMABODY) may, therefore, be conceived of as the ag- explicit specification of the properties (attributes) that gregate of the foundational models of physical anatomi- define the essence of entities, providing the basis on cal entities (fFMAPHYSICAL ANATOMICAL ENTITYg) that which they may be grouped together or distinguished constitute the body. Thus, from one another. Unlike dictionary definitions, which bear no relationship to their neighbors in the alphabet- FMABODY ¼fFMAPHYSICAL ANATOMICAL ENTITYg: ð2Þ ized list of terms, the definition of a concept in a tax- The FMAÕs high level scheme identifies the anatomy onomy is enriched by the definition of all of its parents taxonomy as one of the component abstractions of the within the hierarchy. Thus, a definition of a concept symbolic model or ontology, a distinction that is rarely within an ontology is incomplete without that of all of made clear in discussions of ontologies. The AT forms its parents. the backbone of the FMA, and Aristotelian definitions, Therefore, in creating the Anatomy Taxonomy, two a third element of principled modeling, play a key role in challenges need to be met: a conceptual one, which is to its establishment. identify the structural attributes in terms of which ene- tities that constitute the human body may be grouped 3.1.3. Aristotelian definitions together and distinguished from one another, and a In dictionaries the unit of information is a term, and practical one, which is to identify an authoring program the purpose of the definitions is to define all meanings that not only supports but also enforces the implemen- associated with a given term. For example the term tation of foundational principles and definitional de- ÔorganÕ may refer, among other things, to a musical in- siderata that are to guide the creation of the FMA. We strument, or a part of the human body. In an ontology first describe the knowledge modeling environment we or foundational model, as we define it above, the unit of selected, which is the fourth element of disciplined information is a concept and the purpose of definitions modeling. is to align all concepts in the ontologyÕs domain in a coherent inheritance type hierarchy or taxonomy. This 3.1.4. Knowledge modeling environment objective imposes a set of requirements that are not We have analyzed the challenges posed by the seem- satisfied by the majority of dictionary definitions. We ingly simple task of formally representing declarative have found that, unlike a number of controlled medical anatomical knowledge and found them to be surpris- terminologies, we could not adopt dictionary definitions ingly complex [17]. We selected the Protege-2000 on- for establishing the Anatomy Taxonomy. Therefore, tology editing and knowledge acquisition environment guided by the foundational principles we declared, and [18] for encoding the FMA, because its frame-based relying on precedent set by Aristotle [15], we formulated architecture, which is compatible with the Open ten desiderata that definitions must satisfy in order to Knowledge Base Connectivity (OKBC) protocol [19], support the creation of an inheritance type hierarchy, provides for an expressive, scalable and tractable rep- such as the AT [16]. resentation of anatomical entities and the complex re- In brief, these desiderata specify that definitions lationships that exist between them. We briefly describe should be consistent with the declared context and and illustrate with examples how (1) frames are used in principles of an ontology. Rather than stating the Protege-2000 to represent anatomical concepts; (2) meaning of terms, definitions should state the essence of frames allow for distinguishing between classes and in- anatomical entities in terms of their characteristics, stances; (3) Protege-2000 provides for selective inheri- consistent with the ontologyÕs context. Paraphrasing tance of attributes; and (4) Protege enhances the Aristotle, the essence of an entity is constituted by two specificity and expressivity of attributes through as- sets of defining attributes; one set, the genus, necessary signing to them their own attributes. to assign an entity to a class and the other set, the dif- ferentiae, necessary to distinguish the entity from other 3.1.4.1. Frames, slots, slot values, and facets. Anatomical entities also assigned to the class. A collection of entities concepts are represented as frames in Protege-2000. A that share the same set of essential characteristics con- frame is a data structure that contains all the informa- stitutes a class of the ontology. The defining attribute/s tion in the ontology about a given concept. This infor- shared by all entities within the selected domain should mation includes the properties of the entity to which specify the root of the ontology. To assure transitive that concept refers and also the relationships of that inheritance of essential characteristics, classes that may entity to other entities. In the context of the FMA, a not have been explicitly identified in existing sources of frame is a named anatomical entity, such as vertebra. domain knowledge should be defined. With each frame is associated a defined set of attributes; Provided these desiderata are satisfied, the hierar- each of these attributes has a value. Thus each frame chical sequence of classes in the taxonomy will be dic- consists of a concept and a set of attribute/value pair- tated by the properties shared by collections of entities. ings. Fig. 1 shows the frame Vertebra; the concept The soundness of this hierarchy will then depend on the highlighted in the left hand pane (the AT) and some of C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 483

Fig. 1. The frame of the concept Vertebra.

Fig. 2. A variety of terms associated with the concept Uterine tube.

Fig. 3. Attributed adjacency and continuity relationships of the Esophagus. 484 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 the attribute/value pairings in the right hand pane of the lumbar vertebra, as currently implemented, does Protege graphical user interface (GUI). not subsume collections of collections; rather it sub- Attributes (properties) and relationships of the entity sumes concrete anatomical entities, which, however, are associated with the concept are expressed as slots of the not represented in the AT. Should a need arise, this frame. Slots correspond to such non-structural attri- representation allows us to elaborate the AT by intro- butes as preferred name, synonyms, and numerical ducing subclasses of Fifth lumbar vertebra speci- identifiers (UWDA-ID), as well as such structural at- fied by gender or race, for example, without having to tributes or relationships as -has part-, -part of-, -has di- redefine this class and its ancestors. mension-, -bounded by-, etc. Slots remain empty unless Since concrete, real-world objects, such as the verte- filled with one or more values. In Fig. 1 the synonyms brae of a John or a Jane Doe, represent anatomical data, slot is empty because Vertebra has no synonyms, in concurrence with the Ôabstraction level principle,Õ they whereas the same slot in the frame of Uterine tube in are excluded from the FMA; they belong in the field of Fig. 2 is filled with two values. instantiated anatomy. By contrast, concepts in the class Protege-2000 allows different binary relationships for hierarchy of the Anatomy Taxonomy refer to collections slots. Some slots, like -has dimension- and -has inherent and collections of collections; they belong in the field of 3D shape-, have a binary relationship with atomic values canonical anatomy. like Boolean ‘‘true’’ or ‘‘false’’; for slots that describe Although the above explanation suggests that all binary relationships between frames, the values are de- concepts of anatomical structures in the AT are classes, rived from established classes of the AT or the FMAÕs in fact, we had to assign the role of instance as well to other associated taxonomies. For example, the Dimen- the frames of these concepts. In the frame-based system sional Ontology provides the values for the slot -has of Protege, this was the technical solution for enabling shape- (e.g., cylinder, polyhedron, which are subclasses the selective inheritance of attributes, discussed in the of 3-D volume), whereas the values for the part and next section. This solution required the establishment of adjacency slots in the frame are derived from the AT. a metaclass hierarchy and assigning the frames of AT In Protege-2000, facets impose constraints on the classes as instances of the corresponding metaclasses values that a slot can have. For example, the facets of (see below). Thus, except for its root, all concepts in the the -part of- slot in the frame of Organ specify that AT are subclasses of a superclass and also an instance of there can be multiple values for the slot and that the a metaclass. These dual assignments integrate the AT values can be derived only from AT classes Organ and the metaclass hierarchy. Class-to-class relationships System, Organ system subdivision, Body in the integrated AT and metaclass hierarchies are en- part and Body part subdivision. Thus the value coded in Protegeas-direct superclass- and -direct sub- Vertebral column in the -part of- slot of Vertebra class- links, whereas the inverse relationship between a is allowed, because Vertebral column is a subdivi- class and its instances in the metaclass hierarchy is -di- sion of the skeletal system. Another example is the re- rect type- and -direct instance-. We distinguish the inte- striction for the -nerve supply- slot; values for this slot grated Anatomy Taxonomy and metaclass hierarchy may only be derived from AT classes Cranial nerve, from other hierarchies (e.g., part-of, branch-of) by Spinal nerve,andPeripheral nerve. calling it the -is a- hierarchy. This technical contrivance is of interest to the authors of the FMA and to other 3.1.4.2. Classes and instances. In Protege-2000 a frame knowledge modelers; it can, however, remain opaque to may represent a class or an instance. As far as most users other users of the ontology. of the Foundational Model will be concerned, however, (and as explained below) all the nodes of Anatomy 3.1.4.3. Selective inheritance of attributes. The purpose Taxonomy hierarchy may be regarded as classes. of the Anatomy Taxonomy is to assure the propagation A class in the AT is a collection of anatomical entities or inheritance of attributes. It is necessary, however, to or collections of collections. For example, the class distinguish between the attributes that should and Vertebra represents such a collection of collections. It should not be propagated. As intimated above, the de- subsumes different collections of vertebrae like cervical, sired selective inheritance is achieved operationally, in a thoracic, and lumbar vertebrae (Fig. 1). Moreover, the seemingly contradictory way, by assigning a dual role to members of each of these collections, which in Protege each frame: in Protege each AT frame is modeled both are represented as subclasses of Vertebra, are likewise as a class and as an instance. Its role as a class allows it further grouped into more specialized collections. This is to propagate its set of attributes to its subclasses, but in true even of the leaves of the Vertebra tree, which its role as an instance it is prevented from doing so. have no subclasses in the AT. The Fifth lumbar The insertion of new slots at appropriate levels of the vertebra, for example, is a class to which the fifth ontology provides for introducing definitional and other lumbar vertebrae of individuals like a John or a Jane attributes that should be inherited by descendants of a Doe belong. Therefore, unlike the higher classes, Fifth class. Such a class has been designated as a property C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 485 introduction class [20], whereas in Protege-2000 new in any knowledge-modeling environment is a challenge. attributes (slots) are introduced in metaclasses. Meta- The solution we developed in the frame-based environ- classes function as templates, and serve to define new ment of Protege-2000 may seem complex, but it captures classes. Newly created classes in the AT are assigned as the necessary knowledge [17]. instances of corresponding metaclasses. Thus an AT The solution is to attach to a slot (e.g., -continuous class is a subclass of its ancestor classes in the AT and its with-, -adjacency-) a value that includes not only the frame is an instance of its metaclass. For example, the simple adjacency relationship between referenced struc- AT class Vertebra is a subclass of Irregular bone tures but also the additional attributes of that relation- and an instance of Vertebra metaclass. ship (e.g., superiorly, inferiorly, or anterior, posterior, This arrangement allows for discriminating between left and right). Attribution of the slot value is called slots that should and should not be propagated. The reification. This can be achieved by assigning the slot definitional attributes are propagated to descendants of value as an instance frame of a class which specifies or the class as template slots; they specify which slots each describes the additional attributes for the relationship. member of the class shall have and what the restrictions For example, in the case of the slot -adjacency-, the (facets) on the values of these slots shall be. Instances of slot value is an instance of a class Anatomical ad- the class, on the other hand, inherit such template slots jacency coordinate. This class carries the template as own slots and assign specific values to them (own slot slots that describe the adjacent structure (-related part-) values). Own slots are not propagated. For example, and its relative position or coordinate (-coordinate-and Vertebra metaclass has a template slot -part of-, -laterality-) that qualify its adjacency to the reference which its instance Vertebra inherits as its own slot, anatomical structure. As shown in the frame of and assigns the slot value Vertebral column.Cer- Esophagus (Fig. 3), one value of its -adjacency- slot is vical vertebra is a subclass of Vertebra and in- an instance that shows the related part Fibrous herits the template slot -part of- but not the slot value pericardium as being anterior and to the right and Vertebral column. Instead it converts the template left (coordinate and laterality, respectively) of the slot into its own slot, and assigns its own slot value esophagus, which is the reference anatomical structure. Cervical vertebral column. Template slots dic- This rather complex reification process allows us to tate what attributes or slots a class must impose on its not only comprehensively represent structural relations descendants. The example illustrates the principle of but also to qualify relations with additional attributes in modeling at the most refined level of granularity. order to describe the structure of the body with accuracy Although Cervical vertebra is part of Vertebral at the highest level of granularity. The process also il- column, the most specific relationship holds for Cer- lustrates that the challenges of modeling anatomical vical vertebral column, which is also a subdivi- knowledge push the envelope of available methods [17] sion of the skeletal system and is in turn a part of the and require the collaboration of anatomists and Vertebral column. It is the role of intelligent query knowledge engineers. interfaces, described in Section 4, to concatenate such relationships and allow the result Cervical verte- 3.2. Anatomy taxonomy bra -part of- Vertebral column. Anatomical discourse in educational, research and 3.1.4.4. Attributed relationships. The FMA is particu- clinical contexts proceeds at the level of discrete ana- larly rich in relationships, which, in addition to defining tomical structures and spaces, which correspond to leaf attributes, describe the part-whole, location, and other concepts of a taxonomy. Although attempts to stan- spatial associations of anatomical entities. However, for dardize anatomical terminology are more than a century the precise and comprehensive description of the struc- old, time-honored sources of the domain contain only ture of the body, it is not sufficient to state, for example, implied and contradictory schemes for classifying ana- that the esophagus is continuous with the pharynx and tomical entities, which are not supported by explicit stomach, or that it is adjacent to the vertebral column. It definitions. The officially sanctioned term list, Termin- is necessary to specify that the esophagus is continuous ologia Anatomica [21] (and its predecessor Nomina An- with the pharynx superiorly and with the stomach in- atomica), compiled by an international group of feriorly; and its adjacency relationship with the vertebral anatomists, has a number of shortcomings for sup- column is posterior, whereas with the fibrous pericar- porting the establishment of an inheritance hierarchy dium, it is anterior, on both the right and the left. Thus [22]. Chief among these shortcomings is the lack of ab- the continuity and adjacency attributes need to be as- stract classes that could subsume more and more specific sociated with additional attributes in order to express collections of anatomical entities on the basis of their additional elements of knowledge involved in the rela- shared essential properties. As a consequence, controlled tionships. Such attributed relationships are the rule ra- medical terminologies and emerging ontologies in bio- ther than the exception in anatomy. Their representation informatics have no choice but to establish their own 486 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 scheme for aligning anatomical concepts in a comput- 3.2.2. The inheritance class subsumption hierarchy able representation. Since these sources target the needs of diverse user groups, they represent anatomy in het- 3.2.2.1. High level classes. The rationale for selecting the erogeneous contexts; therefore their anatomy content is root of the AT makes reference to two major types of hard to generalize to domains beyond their own. anatomical entities in terms of whether or not they are In this section we present the rationale for the class physical in nature. Therefore we designated the imme- structure of the AT in the context of foundational diate descendants of Anatomical entity as the principles, starting with the selection of its root. Next we classes Physical anatomical entity and Non- illustrate the inheritance of definitional and other attri- physical anatomical entity (Fig. 4). The genus butes through the class subsumption hierarchy and for both is Anatomical entity, and in structural comment on the derivation of terms. terms the differentia that distinguishes these two classes is the structural attribute of spatial dimension: All 3.2.1. Root of the AT physical entities have spatial dimension, because they Since our intent is to represent knowledge about are volumes, surfaces, lines or points, whereas non- anatomical structure, the Anatomy Taxonomy must physical entities have no spatial dimension. Therefore accommodate not only the physical entities (sub- the attribute and its corresponding slot Ôspatial dimen- stances, objects, spaces, surfaces, lines, and points) that sionÕ are introduced at this level; the value of the slot in constitute the body, but also the descriptors of these the frame of Physical anatomical entity will be entities that we want to model. Terms, coordinates, Ôtrue.Õ Not only the slot, but also its value will be in- relationships, developmental stages and other non- herited by all descendants of this class. physical concepts that form an indispensable part of Physical anatomical entities may be further specified anatomical discourse must also be included in the AT. on the basis of whether or not they have mass, which A more restricted concept than ÕentityÕ will not sub- serves as the differentia of the classes Material sume these concepts. Therefore, we declared Ana- physical anatomical entity and Non-mate- tomical entity as the root of the AT and, in order rial physical anatomical entity. Subclasses of to satisfy requirements for its Aristotelian definition, the latter are Anatomical space, Anatomical we considered the essential properties of this concept. surface, Anatomical line, and Anatomical Anatomical entities can be conceptualized only in re- point, none of which have mass [23]. These classes are lation to biological organisms, and they are unique distinguished from one another by the number of spatial among biological concepts in that they pertain to the dimensions they have. structural organization of these organisms. Therefore, Even without presenting the definitions of these the genus of Ôanatomical entityÕ is the primitive Ôbio- classes and listing their defining differential attributes, logical entity,Õ because it manifests the essence of all the logic and rationale for establishing these high level biological entities (namely that they pertain only to abstract classes should become apparent. Although an- biological organisms), and the differentia is the re- atomical texts and medical terminologies with an ana- striction to structure. The definition may therefore be tomical content deal only superficially, if at all, with written as: anatomical surfaces, lines, and points, it is nevertheless Anatomical entity is a biological entity, which constitutes the structural organization of a biological organism, or is an attribute of that organization. We use this first definition of the FMA to illustrate the process of formulating such definitions. The con- ceptualization and insertion of such a new class in the AT is paralleled by establishing the template slots in its metaclass that will be inherited by all of its descen- dants. Every concept to be entered in the FMA will have a preferred name and a specific, randomly as- signed numerical identifier. Therefore slots for these attributes are inserted in the Anatomical entity metaclass. This template will also have other slots. For example, all anatomical entities, including ana- tomical terms, have parts. Therefore the -has part- slot, and its inverse, -part of-, are introduced at the root of Fig. 4. Schematic representation of the principal classes of the Anat- the AT. omy Taxonomy. C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 487 necessary to represent these entities explicitly and com- Heart, the Anatomical structure, which the sur- prehensively in the FMA in order to describe boundary face bounds; Cytoplasm,aCell substance, can be and adjacency relationships of material physical ana- conceptualized only in reference to Cell,anAna- tomical entities and spaces. tomical structure. The class of Material physical anatomical The definition of Anatomical structure imple- entity may be subdivided into two major types on the ments the Ôcontent constraint principleÕ of the FMA, in basis of the differentia of inherent 3D shape. We desig- that it implies that the largest anatomical structure is the nate the collection that lacks this attribute as Body organism itself, and the smallest are biological macro- substance; its descendants include Secretion, molecules assembled from smaller non-biological mole- Excretion, Blood, etc.; all of which have mass and cules through the mediation of the organismÕs genes. In accommodate to the shape of their container. The this sense, the definition also distinguishes, in a broader members of the collection that have their own inher- context, animate and inanimate objects. ent 3D shape constitute the class Anatomical structure. 3.2.2.3. Units of structural organization. The organiza- tional unit principle designates Cell and Organ as 3.2.2.2. Dominant concept. The dominant class principle organizational units of the FMA; these are two of the declares Anatomical structure as the dominant subclasses of Anatomical structure. All but two of class in the FMA; therefore its definition is of particular the other subclasses of Anatomical structure are importance. conceptually derived from cell or organ, in that they are Anatomical structure either parts of cells and organs or are constituted by cells is a material physical anatomical entity and organs. We discuss these derivative classes in the which has inherent 3D shape; next section. The exceptions are Acellular ana- is generated by coordinated expression tomical structure (e.g., elastic and collagen fiber of the organismÕs own structural genes; and otolith) and Biological macromolecule. Such consists of parts that molecules exist in association with cell parts and also are anatomical structures; independent of cells in body substances. It may be ar- spatially related to one another in patterns gued that Biological macromolecule qualifies as determined by coordinated gene expression. an organizational unit within the FMA. Although we The definition illustrates that inherent 3D shape is a include a substantial number of macromolecules in the necessary, but not a sufficient, differentia for defining the FMA, our intent is to link to other ontologies when the class Anatomical structure. We have to exclude need arises for representing the molecular composition from this class, for example, manufactured objects used and associations of cell parts and body substances. as prostheses and biological organisms such as parasites Cell. With respect to Cell, the organizational unit and bacteria that are introduced into an individual, as principle is consistent with the cell theory of Schleiden well as space-occupying lesions such as neoplasms and [24] and Schwann [25]. However, notwithstanding some granulomas. The differentiae in the class definition that unique exceptions, a cell is a microscopic structure; in exclude such foreign and abnormal structures are spec- practical terms, it is meaningful to consider it as a unit ified by constraining the class to biological objects of organization only at the microscopic level. No orga- generated by the coordinated expression of groups of nizational unit existed at the macroscopic level until we the organismÕs own structural genes and thereby dis- proposed ÔorganÕ to fill this role [8]. It is hard to find tinguishing these structures from those that result from satisfactory definitions of cell and organ in dictionaries. perturbed or abnormal biological processes. Moreover, Our definitions of these two concepts conform to the by introducing the differentia of the genetically deter- definition principle. We first define Cell and discuss its mined arrangement of the parts of an anatomical subclasses. structure, the definition also excludes from the class such Cell cell aggregates as a rouleau or a sediment of blood cells. is a anatomical structure The dominant role of Anatomical structure is which consists of cytoplasm surrounded by a reflected by the fact that non-material physical ana- plasma membrane tomical entities (e.g., spaces, surfaces) and body sub- with or without the cell nucleus. stances (e.g., blood, cytosol) are conceptualized in the This class subsumes all cell types of the human body FMA, and also in anatomical discourse in general, in and can accommodate those of other metazoan organ- terms of their relationship to anatomical structures. For isms. One may find up to 10 different implied classifi- example, Thoracic cavity (an Anatomical cations of cells in the literature. However, these space) can only be conceptualized in terms of the classifications are unsupported by explicit definitions. Anatomical structure (the Thorax) of which it is The most consistent scheme was proposed by Lovtrup a part; Surface of heart cannot exist without [26], and is based on such structural properties as the 488 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 connectivity of cells to one another and the type of ap- this class blood, lymph, semen, and cerebrospinal fluid, pendages they possess. We have adopted these proper- all of which meet the definition of Body substance. ties as the differentia for the largest collections of cells Likewise, gingiva and many other entities convention- [27], and found it necessary to further subdivide these ally referred to as tissue consist of more than one tissue classes based on embryonic derivation (Fig. 5). We in terms of the FMA definition. The definition implies, recognize that this classification introduces transforma- furthermore, that in the fully formed organism tissues tional rather than structural attributes as differentiae. do not exist independent of organs. In the embryo, However, until the necessary gene expression data be- however, tissues are definable before bona fide organs come available, the representation of cell lineages can- are formed. not be accomplished on the basis of structural attributes The definition of Organ part links the microscopic alone. Cell classification is a topic that merits further and macroscopic units of structural organization to one discussion in a separate publication. another and eliminates any circular element from the Organ. Dictionary and textbook definitions of organ definition of Organ. In terms of the definition, the liver are satisfied by such anatomical structures as the hand qualifies as an organ, because it is constituted by a or knee, as well as by the liver or the thymus. There are maximal set of anatomical structures that are composed also a large number of macroscopic anatomical struc- of tissues, and these structures are connected to one tures, which are known by their specific name, but have another to form a discrete morphological entity. Al- not been designated as any particular higher level type. though the right lung is composed of the same set of For example, by what criteria is the skin generally re- connected organ parts as the left lung, the two sets are garded as an organ, but the underlying layer of super- not continuous with one another; hence the two lungs ficial fascia is never referred to as such, or as any other are separate organs. The entire skin qualifies as an organ type of entity? What are nerves and blood vessels? It has, in terms of the definition, and so does the superficial in fact been suggested that it is not possible to define fascia that underlies it. On the other hand, the brain and organ, because the meaning of the term varies so widely. spinal cord cannot be regarded as two separate organs, The definition we have proposed for Organ resolves since both are made of the same types of organ parts, these problems. which are continuous with one another and together Organ constitute a morphological whole. In fact a real is an anatomical structure, boundary between the two cannot be determined. which consists of the maximal set of organ parts Therefore, the definition mandates that brain and spinal so connected to one another that together cord be regarded as organ parts and that together they they constitute a self-contained unit of be classified as one organ. We have named and defined it macroscopic anatomy as the Neuraxis [28]. morphologically distinct from other such units. It follows from the definition of Organ that differ- The definition is contingent on the definition of entiae for distinguishing organ subclasses must be based Organ part. on the kinds of continuous organ parts of which organs Organ part are constituted. Even without presenting definitions, is an anatomical structure, Fig. 6 illustrates the employment of elementary struc- which consists of two or more types of tissues, tural attributes, on the basis of which types or organs spatially related to one another are grouped together and distinguished from one an- in patterns determined by coordinated gene other. These essential properties (e.g., organ cavity, wall, expression; parenchyma, cortex, medulla, lobe, etc.) are introduced together with other contiguous organ parts in the corresponding metaclasses and are inherited by it constitutes an organ. the subclasses of the respective organ types. Only at this Tissue is another concept with a variety of meanings level of the AT do we reach specific organ types, such as in general discourse. Its dictionary and textbook defi- lung, esophagus, heart, etc., which are the concepts nitions are violated by regarding such concepts as blood commonly encountered in anatomical and clinical dis- and gingiva as tissues. Before discussing Organ, we also course. Such are also the concepts that are subsumed by define tissue. derivative subclasses of Anatomical structure. Tissue is an anatomical structure, 3.2.2.4. Derivative classes. We regard Organ part and which consists of similarly specialized cells Cell part, referred to in the previous section, as de- and intercellular matrix, rivative subclasses of Anatomical structure be- aggregated according to genetically determined cause they are conceived of in relation to Organ and spatial relationships. Cell, the organizational units of the FMA. Although The differentia of genetically determined spatial re- each of the remaining derivative subclasses are explicitly lationships among the constituent cells excludes from defined, we will not present these definitions here; rather C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 489 we comment on them and illustrate the kinds of struc- anatomical structures that are members of one class. We tures each subsumes. assign such collections to the class Anatomical set. Body part and organ system. Perhaps most important The FMA does not allow plural concepts and therefore are the classes Body part2 and Organ system. Both the singular concept Set of cranial nerves is en- are constituted by organs. In a body part, such as the tered as a subclass of Anatomical set. At the cellular Trunk or Upper limb, organs of different classes are level such a set is Myone, for example, which is a set of related to one another through genetically predeter- skeletal muscle cells (muscle fibers) innervated by a mined patterns. The same holds true for Body part single alpha motor neuron. Anatomical sets have subdivisions (e.g., Thorax, Hand). Organ systems members, rather than parts (e.g., Oculomotor nerve (and their subdivisions) are constituted of organs pre- is a member of Set of cranial nerves). dominantly of the same type, which are interconnected Members of an anatomical set, as defined in the by zones of continuity. For example, Musculoskel- FMA, are distinct from elements of a mathematical set etal system is comprised of the classes Muscle in at least two respects: (1) indirect connections exist (organ), Bone (organ), Joint, and Ligament between the members, since all anatomical structures of (organ), which together form an interconnected ana- an organism are interconnected directly or indirectly tomical structure. Subdivisions of this system, the (except for those that are surrounded by body sub- Skeletal system and Articular system, for ex- stances; e.g., blood cells afloat in plasma); (2) as a rule, ample, consist of sets of bones and joints, respectively; the members are ordered in accord with genetically de- the joints interconnecting the bones and visa versa. So termined patterns (e.g., the set of cranial nerves associ- called systems of the body are, as a rule, conceived of in ated with the brain and the set of ribs associated with functional rather than structural terms; therefore many the vertebral column are ordered and their members are of them do not qualify as anatomical structures (e.g., not interchangeable; whereas as far as we know, no such immune system, endocrine system) and are excluded ordered pattern exists for the disposition of members of from the Organ system class. However, because these a myone within a muscle fasciculus); and (3) the mem- concepts are so widely used in anatomical and clinical bers do not define an anatomical set (which is a class), discourse, we represent them in the FMA as the class whereas a mathematical set is defined by its members. Functional system, which is a child of Non-ana- Finally, we introduced the class Anatomical tomical anatomical entity. junction to subsume such anatomical structures as a Anatomical cluster, set, and junction. There are a suture, the commissure of the mitral valve, gastro- number of other anatomical concepts in current use that esophageal junction, anastomosis, and nerve plexus, as are a composite of organs, organ parts, tissues or cells well as synapse or desmosome. These heterogeneous that are hard to classify, yet we wanted to accommodate structures are arranged in appropriate subclasses of them in the FMA. For this purpose we created and Anatomical junction. We define this class as an defined the classes for Anatomical cluster, Ana- anatomical structure in which two or more anatomical tomical set, and Anatomical junction. structures establish physical continuity with one another For example, the root of the lung and the renal or intermingle their component parts. pedicle meet the definition of Anatomical struc- Anticipating future enhancements of the FMA, we ture, but do not fit any of its subclasses we described so have also introduced three additional classes. Vesti- far. Both consist of a heterogeneous set of organ parts gial anatomical structure (e.g., epoophoron, grouped together in a predetermined manner, but do not gubernaculum testis) and Gestational structure, constitute the whole or a subdivision of either a body which includes subclasses for gestational membranes as part or an organ system. We classify such structures as well as embryonic and fetal structures. The third class, Anatomical cluster. Such clusters can be com- Variant anatomical structure, is as yet sparsely posed of cells (e.g., splenic cord, consisting of erythro- populated. Once we focus on anatomical variants, cytes, reticular cells, lymphocytes, monocytes, and members of this class will be reassigned as variant sub- plasma cells), organ parts (e.g., tendinous or rotator classes of the canonical anatomical structures. cuff, consisting of the fused tendons of several muscles), as well as of organs (e.g., lacrimal apparatus consists of 3.2.3. Derivation of terms the lacrimal gland, lacrimal sac, and nasolacrimal duct, Our intent with the FMA is to make anatomical in- each of which qualify as an organ). formation available in computable form that generalizes Also problematic are such widely used concepts as to all application domains of anatomy. Therefore, rather viscera, or cranial nerves, which represent a collection of than attempting to standardize terminology, we are committed to include in the FMA all terms that cur- 2 ÔBody partÕ and ÔBody regionÕ are regarded as synonyms by most rently designate anatomical entities in order to facilitate sources, including Terminologia Anatomica; the FMA adopts this navigation of the FMA by any user. We relied on time- convention. honored English language scholarly textbooks of 490 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 anatomy [29–31] as our primary sources for anatomical anatomical terms of NeuroNames [33], a structured terms, enhanced by copious reference to original journal vocabulary of the brain. articles from the anatomy and clinical literature. We In the FMA each concept has a randomly assigned have developed a tool for semi-automatically integrating unique numerical identifier (UWDAID; University of existing anatomical term lists into the FMA [32]. Such Washington Digital Anatomist Identifier) and is asso- integration has been accomplished for approximately ciated with one or more terms. One of these terms is 10,000 terms of Terminologia Anatomica [21], the offi- designated as the preferred name of the concept; other cially sanctioned anatomical term list, and 6500 neuro- terms are synonyms or non-English equivalents (Fig. 2).

Fig. 5. Major classes of Cell. Fig. 6. Subclasses of Organ.

Fig. 7. Documentation associated with Tuba uterine, a non-English equivalent of the preferred name Uterine tube. C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 491

Each term is created as an instance of the class Con- cept name. Instances of Concept name have associ- ated with them various meta-data that describe the attributes of the term, illustrated in Fig. 7. A consistent naming convention is used throughout. Unlike in many other terminologies (including Termin- ologia Anatomica), all terms are in the singular form, and conjunctions and homonyms are not allowed. An- atomical entities commonly referred to as groups or collections (e.g., intercostal arteries, spinal nerves) are represented as anatomical sets and designated, for ex- ample as Set of intercostal arteries and Set of spinal nerves, since such concepts conform to the definition of the class Anatomical set. Because each term must be unique, commonly used homonyms such as ÔmuscleÕ and ÔboneÕ are rendered specific by ex- tensions to discriminate between their different mean- ings; e.g., Muscle (tissue), a class that subsumes Smooth muscle and Striated muscle and Muscle (organ), which subsumes such organs as Biceps brachii and Gluteus maximus. Fig. 8. Part of the taxonomy of structural relationships. Although the compendium of available anatomical terms is large, for the comprehensive and logical mod- Non-anatomical anatomical entity. Fig. 3 eling of anatomical structure we had to include in the illustrates the implementation of some of these rela- FMA concepts that have not been named previously. tionships in the frame of the esophagus. Reference is These concepts include not only the high level classes of made in earlier sections to the fact that the majority of the AT, but also macroscopic parts of the body that these relationships are attributed, which further have not previously been named [34]. For example, to enhances the expressivity and specificity of the FMA satisfy the FMAÕs requirement that all parts of a whole for describing the structure, not only the constituents, of be explicitly named, we assigned the term Upper the human body. Particular attention is paid to attrib- uterine segment to a previously unnamed part of uted partonomic relationships in one of our recent Body of the uterus to complement the other part, publications [35]. which is generally known as the Lower uterine We have conceived of the ASA as sets of interacting segment. networks [36], which are schematically represented in Formulas govern the ordering of descriptors in the Fig. 9. The high level scheme for the ASA derives from complex name of an anatomical entity. For example, the the FMAÕs overall conceptual scheme. The example we order of adjectives in the term ÔLeft fifth inter- describe below illustrates the nature and interactions costal spaceÕ is based on the rationale that the noun between just two of the ASAÕs interacting networks. in the term is ÔspaceÕ; its primary descriptor is Ôinter- These networks make reference to some of the rules that costal,Õ further specified by a sequence of numbers, a constrain the concepts that can be linked to one another specificity enhanced by the laterality descriptor. In the by these relationships to certain classes of the Dimen- term this order is reversed. Based on a similar rationale, sional taxonomy (DT). The Do is a small ontology in the the term Ôright upper lobeÕ is not the preferred name of FMA, which represents dimensional entities of zero to the concept, although the FMA includes it as a synonym three dimensions and shape classes of 3D entities. It also of ÔUpper lobe of right lung,Õ because of its distinguishes between real and virtual surfaces and lines. common usage in radiology reports. The example for illustrating ASA networks concerns the heart. The surface of the heart forms the boundary 3.3. Anatomical Structural Abstraction of the heart in the boundary network (Bn), rather than being a part of the heart, because nodes of the parton- Defined in Section 3.1.2, the ASA is an aggregate of omy network (Pn) must be of the same dimension in the the structural relationships that exist between the enti- DT, whereas a boundary must have one lower dimen- ties represented in the AT. A full account of the ASA sion than the entity it bounds. Because they share the will be the subject of a separate report. Our purpose here same dimension, the diaphragmatic surface of the heart is to summarily illustrate the richness and specificity of is a part of the surface of the heart (Pn) and forms part structural relationships in the FMA. Fig. 8 shows a part of the boundary not only of the heart, but also of the of the taxonomy of these relationships as subclasses of right ventricle (Bn), which is a part of the heart. The Bn 492 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

Fig. 9. A scheme for Anatomical Structural Abstraction (ASA).

of the heart comes about by representing not only the More comprehensive implementation will be achieved surfaces that bound the heartÕs subvolumes, but also the through semi-automated authoring tools that are under lines that bound these surfaces (which are the cardiac current development, which can reuse the knowledge margins), and the points, which in turn bound the already embedded in the FMA. Also, we anticipate that margins. The Pn of the heart comes about by repre- investigators who have a need for comprehensive rep- senting transitively the subvolumes of the heart in one resentation of the anatomy of particular parts of the network, the subsurfaces of the surface of the heart (e.g., body (e.g., the eye or the knee joint) will collaborate Surface of heart -has part- Diaphragmatic with us in populating the knowledge base for the areas surface of heart, Sternocostal surface of of their interest. heart, Base of heart) and the subdivisions of each subsurface (e.g., Diaphragmatic surface of the 3.4. Anatomical Transformation Abstraction heart -has part- Diaphragmatic surface of right ventricle, Diaphragmatic surface of Defined in Section 3.1.2, we envisage the initial im- left ventricle) in another network, and those of plementation of the ATA as a symbolic model of the the margins (lines) of the heart in yet another network. entities and relationships that link the fertilized egg or Similar interactions of the Bn and Pn with the other zygote to the fully differentiated anatomical structures networks, shown in Fig. 9, comprehensively describe the and spaces that are currently represented in the AT.As structure and spatial relationships of any anatomical we initially did for the ASA, we propose a high level structure or space. A number of authors refer to such a scheme for the prenatal component of the ATA as a scheme as a mereotopological model or representation, hypothesis, which, as in the case of the ASA, will be though none have defined it or implemented it to the tested and modified as the ATA becomes implemented same level as the FMA. The conception of such a and instantiated. Currently, we are not proposing such mereotopological model or ASA as a set of interacting schemes for the morphological transformations associ- networks is a particular feature of the FMA. ated with the processes of growth and aging. Our pres- The ASA has been instantiated quite extensively in ent purpose with giving a preliminary account of the the FMA for boundary and partonomic relationships, ATA scheme is to illustrate the challenges the symbolic as well for -branch of - and -tributary of- relationships, modeling of developmental biology and prenatal devel- including their inverses. Other relationships are more opment present, and to emphasize that knowledge of sparsely implemented. embryonic development is as important a component of C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 493

Fig. 10. A scheme for Anatomical Transformation Abstraction (ATA). Shading is used to facilitate the visualization of relationships between cognates of a higher level component of the ATA. anatomical and medical reasoning as spatial knowledge mean -gives rise to-) Mesodermal primordium of of the human body. The FMA will not attain its full humerus > Cartilaginous primordium of hu- potential until it is able to support inference based on merus > Ossifying humerus with primary os- both structural and developmental relationships. sification center > Ossifying humerus with The significance of the ATA scheme as we propose it secondary ossification center > Fully is that, together with the ASA, it formalizes and con- formed humerus. Each developmental stage of the strains all the kinds of information that need to be as- same structure is distinguished from the preceding one sociated with an anatomical entity in order to by a set of newly acquired phenotypes, which, as a rule comprehensively conceptualize and symbolically repre- results from differential gene expression. PTr pertains to sent its development starting from the fertilized egg. We all classes of Developmental structure and De- propose a scheme for the ATA as an extension of the velopmental space, even if the phenotypic change is FMAÕs overall conceptual scheme and illustrate its limited to the addition or deletion of one of their com- components in Fig. 10. ponents, the structural rearrangement of their parts, or a We envisage the Developmental Taxonomy (DevT) change in their shape. Therefore, the formalism for as the sum of several developmental subtaxonomies phenotypic transformation should specify the immediate linked together through the AT. This virtual umbrella precursor (Pc1), its immediate successor (S1) and the taxonomy will consist of taxonomies of developmental change in phenotype (DPt): structures (DStrO), developmental spaces (DSpO), and PTr Pc ; S ; DPt : 3 developmental processes (DPO). ¼ð 1 1 Þ ð Þ Developmental lineage (DL) and phenotypic trans- Developmental lineage (DL) specifies a line of descent formation (PTr) relate to the essence of embryonic de- or ancestry in which an ancestor replicates itself and velopment. Both are complex concepts. Both can be gives rise to two or more descendants, each of which is modeled through the inverse relationships -gives rise phenotypically distinct from its immediate ancestor. The to- and -derived from-, or their synonyms between a formalism for lineage parallels that for PTr by specify- ÔprecursorÕ and one or more Ôsuccessors.Õ Phenotypic ing the immediate ancestor (A1), the immediate de- transformation (PTr) is a developmental relationship, scendant (D1) and the change in phenotype (DPT): which is established between developmental states of DL A ; D ; DPt : 4 one individual, or a class of individuals, on the basis of a ¼ð 1 1 Þ ð Þ change in phenotype (gene expression) between precur- Note that each DPt has to be expressed as an ASA sor and successor. For example, (using the symbol > to attribute of Pc1,S1,A1, and D1. This is only one of the 494 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 ways in which the ASA and ATA will be closely inter- Thus change in phenotype along a cell lineage or in related, an observation that leads to the conclusion that the phenotypic transformation of multicellular, devel- an ontology of embryonic development should be de- oping structures is the outcome of a number of inter- veloped as a logical extension and integral component of acting networks, which are controlled by the facilitation the FMA. or repression of selected groups of genes. Therefore, we Timing of PTr and DL in the context of a develop- propose the first iteration of regulatory networks (Rn) mental clock must be represented through the develop- that control the expression of new phenotypes as: mental time parameters of post-ovulatory time (POT) Rn TAg; Sc; Tg; G ; G ; Prop; DPt : 5 and/or developmental stage (DSt). ¼ð f r Þ ð Þ A transforming agent (TAg)—which is a gene prod- The purpose of the Rn scheme is to establish a uct—is always required for effecting the expression of a framework for the information that emerges from ex- new phenotype. This agent may play a facilitatory or periments and integrate this new information with ex- inhibitory role in the expression of the new phenotype isting knowledge. The components of this formalism by its target (Tg). The expression of this new phenotype decompose the complex developmental events into ele- (DPT) depends on the activity of one or more specific ments that can be entered in the framework of the FMA, genes (G), which may increase (i.e., is facilitated; Gf )or even with currently available methods. decrease (i.e., is repressed; Gr). We concede that while the establishment of the FMA TAg has not only a target but also a source (Sc). It is, in for static, fully formed anatomy is a Herculean task, this fact, itself a new phenotype resulting from facilitated or task pales in comparison with the challenges posed by suppressed gene activation within its source. In both the enhancement of the FMA with the dynamic pro- target and source, the macromolecule that corresponds to cesses that constitute embryonic development and cell the new phenotype is produced through a change in the differentiation. These challenges provide the motivation activity of a gene or genes, even when this change results for collaboration, a coordinated, distributed effort, and from the repression of another gene or genes. Finally, the for the development of knowledge-based authoring tools TAg must be propagated (Prop) from the source to the that facilitate the population of a large knowledge base, target, which may occur within cells, through cell junc- such as the FMA, and others that are currently emerg- tions or through the intercellular environment. ing in bioinformatics.

Fig. 11. The distributed, Internet-based architecture of the Anatomy Information System (AIS). Various structural information resources (bottom row) are made available to outside processes by means of specialized servers (center row). Various client applications (top row) are graphical and query user interfaces developed for different users. Other remote agents and interfaces at diverse locations access servers of the AIS via well-defined Internet protocols. C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 495

4. Accessing the FMA on several levels. At the most fundamental level, the model has to be evaluated for its internal consistency The FMA is one of the components of the Anatomy and comprehensiveness. There are no precedents we are Information System (AIS), shown in Fig. 11, which is a aware of for evaluating the overall semantic structure of three-tiered software architecture constituted by a set of a computable knowledge source, which is perhaps one of structural information resources (the chief one of which the most critical features of the FMA. At the highest is the FMA), sets of authoring and end-user programs, level, a knowledge base that claims to be reusable and and structural information servers, which communicate ‘‘foundational’’ must be evaluated for its generalizability with the information resources via the web through the and usefulness to other projects in knowledge repre- mediation of the servers [37]. sentation and application development. Given the fact Currently, the FMA is accessed through six different that the FMA is still evolving and has not yet been re- user interfaces in the AIS, which are shown at the top of leased, its evaluations to date have been largely at the Fig. 11: (1) the Protege-2000 graphical user interface, first level. which supports authoring and also allows browsing Internal consistency checks were performed by through the Protege class structure; (2) the Founda- UMLS staff on segments of the FMA instantiated for tional Model Explorer (FME), a web-based GUI that different body parts as these segments were delivered for provides intuitive browsing capabilities without the inclusion in the UMLS. Independent projects also as- complexity of the full Protege system [38]; (3) the GO- sessed the internal consistency of different versions of QAFMA Graphical User Interface to the OQAFMA the FMA as a prerequisite for meeting their own re- Query Agent for the Foundational Model of Anatomy, search objectives [44,45, Gu H. personal communica- which provides a web interface for users to issue low- tion]. Feedback from these investigators revealed an level database queries to the OQAFMA server [39]; (4) aggregate of a few hundred errors, many of which re- the intelligent EMILY GUI, which constrains the con- lated to spelling and only a few to cycles in the class struction of queries to concepts and relationships to subsumption and partonomy hierarchies. Given the size those in the FMA and relies on inference to retrieve and complexity of the FMA, we found these results very results not explicitly represented in the knowledge base gratifying. [40]; (5) GAPP, a natural language interface that allows It is problematic to evaluate the FMA for compre- simple queries about the concepts and relationships hensiveness of its content, since there is no available represented in the FMA [41]; and (6) the GUI of the gold standard for comparison. There is no other source Dynamic Scene Generator that provides access to im- that includes over 100,000 anatomical terms, less than ages and 3D models linked to the FMA in order to 10% of which correspond to the complete list of officially support knowledge-based generation of interactive sanctioned anatomical terms [21]. Nevertheless, a cor- scenes [42]. relation of the incidence of anatomical concepts in a In addition, the part of the FMAÕs content incorpo- large compendium of clinical reports with the FMA rated in the UMLS as the Digital Anatomist vocabulary would be informative. is accessible through the UMLS knowledge server. The Comprehensiveness seems a relatively trivial problem Digital Anatomist vocabulary contains the Anatomy compared to evaluating the FMAÕs overall semantic Taxonomy, except for the concepts and relationships structure and the extensive modeling of relationships. pertaining to the brain and spinal cord, and relation- However, the difficulties entailed in such an apparently ships of partonomy and branch and tributary relation- simple task are illustrated by the mapping of large ships. symbolic models to one another, taking into account The evolution of the diverse interfaces for accessing their structure as well as their terms [45]. The FMA and the FMA indicates that the FMA has reached a stage at GALENÕs common reference model (CRM) [46] were which there is sufficient content to support experiments selected for developing automated methods for such for interrogating the knowledge base, which is a key model matching. Although, after some necessary lexical requirement for developing knowledge-based applica- adjustments, over 3000 matching terms can be demon- tions such as the Dynamic Scene Generator [42], and strated, there are surprisingly few homologies between also for evaluating the FMA. The recent release of the the FMA and GALEN-CRM when -is a- and parton- FMA on the Internet [43] should facilitate both these omy relationships are also taken into account. The activities. reasons for the differences have not yet been explored, but at least some of them may be the different contexts of modeling. GALEN represents anatomy in the context 5. Evaluation and current usage of surgical procedures, whereas the FMA has a strictly structural orientation. Evaluation of a large knowledge base, such as the The ultimate evaluation of the Foundational Model FMA, poses considerable problems and must take place of Anatomy needs to take place through testing the 496 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 hypothesis that motivates the establishment of the that encompasses a comparable spectrum of anatomical model: the FMA will provide the anatomical informa- entities at a level above that of elementary textbooks of tion called for by any knowledge-based application that an introductory nature. requires computable anatomical knowledge. We include The next scaling up entailed the development of the among such applications those developed for education, neuroanatomical component of the FMA [28]. The biomedical research, and clinical medicine. The prereq- FMA is unique among neuroscience resources in that it uisites for such evaluations are currently being gener- comprehensively represents anatomical concepts of both ated. The development of query interfaces to the FMA, the central and peripheral nervous systems; moreover it described in the preceding section, is a requirement does so in the same information space as other systems for making the FMA accessible for application of the body. The instantiation of neuroanatomical development. relationships is in progress. We have made evolving versions of the FMA avail- In Section 3.4 we propose to extend the FMA to able to selected investigators, but its use has been largely knowledge elements that integrate the traditional field of limited to associating the terms of the FMA with images classical embryology with contemporary developmental and image volumes [47–50], and for integrating these biology. The FMAÕs semantic structure accommodates terms in other terminologies [51]. Definitions of the the implemented and projected scale ups quite naturally. FMA have been used as a basis for characterizing defi- We regard this outcome as a validation of the FMAÕs nitions of anatomical concepts in WordNet [52] and in conceptual framework and disciplined approach to other biomedical ontologies [11], as well as for the au- knowledge modeling. tomatic semantic interpretation of anatomical spatial Recently we began to experiment with using the relationships [53], enriching the UMLS semantic net- FMA as a template for the representation of the anat- work [54] and designing its metaschema [55]. As far as omy of non-human species, particularly those that serve we are aware, only one application relies on knowledge as experimental models of human disease [14]. The embedded in the FMA for interacting with 3D scenes classes of the AT readily accommodate the anatomy of [42]. We hope that the development of knowledge-based mammals and even other vertebrates. The challenge is to applications calling for anatomical knowledge will be formally represent interspecies similarities and differ- stimulated by access to the comprehensive FMA, pro- ences at the various levels of structural organization. viding opportunities for its higher level evaluation. Solution of this problem will likely generalize to the representation of intraspecies anatomical variation, i.e., differences between individuals. This possibility has im- 6. Scaling of FMA portant applications not only in clinical medicine but also in anthropology. Plans have been made already The objective of the FMA to represent declarative for using the FMA to annotate anthropological osteol- knowledge about the structure of the body calls for ogy databases [Drs. Razdan and Clark, personal scaling the model to the concept domains of those fields communication]. of anatomical science that are not yet included in the We are committed to constrain the FMAÕs content to FMA. These fields include neuroanatomy, develop- biological structure or anatomy. However, we have be- mental biology and embryology, and also comparative gun to develop a representation of physiological func- anatomy. Moreover, we contend that since manifesta- tion using the FMA as a template or reference ontology tions of health and disease may be conceptualized as [56]. Such a Foundational Model of Physiology (FMP) attributes of anatomical structures, a logical and com- will be distinct from the FMA but it will be intimately prehensive representation of anatomy should serve as a linked to it. foundation or template for the computable representa- tion of physiological function, as well as pathology and the clinical manifestations of diseases. Unless the se- 7. Discussion mantic structure of the FMA lends itself for such scal- ing, the model cannot be regarded as foundational. The Digital Anatomist Foundational Model of Moreover, if the FMA is to fulfill its potential as a Anatomy expresses a theory of anatomy that provides a reference ontology, then it should be feasible to readily view of the domain consonant with the requirements of align other existing and evolving biomedical ontologies formal knowledge representation and also accommo- with it. dates traditional views of the domain. Coherent theories The first phase of the FMAÕs development was fo- of anatomy have not been declared as such, although cused on macroscopic anatomy. Then the scope was theoretical treatises on mereotopology (e.g., [57]), or on extended to include histology and the representation of some aspect of it (e.g., [58]), cite, or are even based on, cells, subcellular entities, and biological macromole- anatomical examples. These proposals, however, as a cules. There is no other hard copy or computable source rule, do not proceed from the examples to implementing C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 497 the theory for the entire corpus of the domain, which, of sented, transitive relationship, or along a virtual path course, is not their purpose. The FMAÕs theory of concatenated from heterogeneous relationships [39]. anatomy is articulated by its high level scheme, the se- The structure of the AT is a dynamic abstraction that mantic structure of the AT, and the schemes of the is modified as a result of new insights we gain into the modelÕs ASA and ATA components. Initially proposed structure of anatomical knowledge. New terms are also as hypotheses, these components of the FMA have now added to the FMA as they come to our attention. been largely validated by instantiating the symbolic model with tens of thousands of concepts and more than 7.2. Relevance to UMLS a million relationships. In this article we focus primarily on the AT and defer As noted in the introduction, in the initial phase of detailed descriptions of the ASA and ATA to separate the FMAÕs development, we conceived of the classes of communications. We first summarize the salient features the AT as extensions and specifications of UMLS Se- of the AT, before commenting on the relevance of the mantic Types (ST). However, the disciplined approach FMA to UMLS in general and to bioinformatics in to modeling we describe in this communication, coupled particular. with the insights we gained into the structure of ana- tomical knowledge through the instantiation of the 7.1. Salient features of the AT model, resulted in the redefinition of many of these classes. The specificity of these definitions has led to a Our intent with the Anatomy Taxonomy is to in- divergence between the definitions of UMLS ST and corporate in it all concepts that relate to the structure of FMA classes, several of which are designated by the the body, including those first identified in the contem- same or similar terms. For example, there are sub- porary literature and those that are newly discovered. stantial differences in the definitions of the semantic type The AT introduces a number of classes that are unlikely ÔAnatomical StructureÕ and the FMA class of the same to be found in the literature or in anatomical discourse. name. Therefore, in submitting to UMLS evolving ver- The rationale and justification for creating these classes sions of the Digital Anatomist component of the FMA, is to assure that general as well as more and more spe- we assigned Anatomical structure to the UMLS cific attributes that are shared by increasingly specialized ST ÔBody Part, Organ or Organ componentÕ rather than anatomical structures are propagated from the root of ÔAnatomical Structure.Õ More problematic is the as- the taxonomy to its leaves. The semantic structure of the signment of Anatomical space (which subsumes AT also assures that all anatomical entities, ranging in such entities as Peritoneal cavity, Vertebral canal, and size and complexity from macromolecules to major Ischio-anal fossa) to ST ÔBody Space or Junction,Õ a body parts and the whole organism, are encompassed by descendant of ÔConceptual Entity.Õ The latter is defined one attributed graph. This graph also accommodates as a broad grouping of abstract entities, whereas the classes of substances and non-material entities that are FMA class is a descendant of Physical anatomical associated with and defined in terms of anatomical entity, since the entities to which the class refers have structures, which constitute the dominant class of the physical dimension. AT. In addition to these non-material physical ana- Similar considerations led other investigators to sug- tomical entities of zero to three dimensions, the root of gest adding several new semantic types to better describe the AT also subsumes non-physical anatomical entities the anatomy portion of the Enriched Semantic Network that have no spatial dimension at all. they developed for UMLS, allowing multiple parents in To safeguard against ambiguity, explicit Aristotelian the -is a- subsumption hierarchy [54]. An abstraction definitions specify the classes of the AT in terms of metaschema for this enriched network is given in [55]. predominantly structural attributes, which are formally Some of these enrichments make use of the FMAÕs represented in the frames of the ATÕs concepts. At the definitions, which suggests perhaps that bidirectional in- current state of the FMA, however, these definitions are teractions between the UMLS SN and its source vocab- less consistently implemented the further one moves ularies could benefit not only the vocabularies but also the away from the taxonomyÕs root. SN. Thus, in addition to the potential of the FMA for The semantic structure of the AT, together with the reconciling inconsistencies in anatomical concepts rep- Protege-2000 authoring environment, allows the repre- resented in UMLS vocabularies [59] and in traditional, sentation of multiple inheritance. However, Aristotelian hard-copy sources [34], class definitions of the FMA may definitions that specify the essence of the entities to prove useful in a review of UMLS semantic types. Such a which the concepts refer obviate the need for multiple review is likely to become desirable as a consequence inheritance, since non-definitional attributes of the of the expanding scope of the UMLS Metathesaurus, concepts can be readily accommodated as slots of their which reflects the growing relevance of bioinformatics to frames. This representation affords searching the clinical medicine by the inclusion of emerging ontologies knowledge base along the path of any explicitly repre- in this field of biomedical informatics. 498 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

7.3. Relevance to bio- and biomedical informatics 3. By modeling canonical anatomical knowledge and, in particular, by introducing high level, abstract classes The relevance of the FMA to domains of bioinfor- of anatomical entities, the FMA also provides a matics beyond that of traditional anatomy is illustrated framework for inter- and intraspecies anatomical vari- by recent, emerging projects that reuse information from ation and for the organization of anatomical data that the FMA. Though initially conceived for classical, pertain to instances of the human and other species. macroscopic anatomy, the FMA has been successfully These data include the clinical record and biological scaled to microscopic and neuroanatomy as well as to experiments performed on non-human species. biological macromolecules. The scheme for modeling 4. The FMA is unusual among traditional and com- embryology and developmental biology, described in putable knowledge sources in that it strictly adheres in this communication, is an integral part of the FMAÕs its modeling to one context. Because the majority of the conceptual framework. The FMA has also provided a other sources target particular user groups, of necessity, motivation for research related to the modeling of they intermingle different contexts or views of their physiological functions [56], comparative anatomy [14], primary domain of interest. By design, the FMA is in- and anthropological osteology, and to querying and tended to meet the needs of diverse user groups and matching large ontologies and databases [39–41,45]. applications that require anatomical information; We contend that the Foundational Model of Anat- therefore it is designed as a reusable reference ontology omy is the most promising, currently available candidate rather than an application ontology. Only the structural for serving as a reference ontology in biomedical infor- context generalizes to and complements all other views matics. The reasons for this contention are inherent in of biology and medicine. The structural context proved the semantic structure and other distinguishing features to be critical for the disciplined modeling of the FMA; of the FMA. By way of summary, we highlight the we found it to be the only view that allowed the com- following features. prehensive and consistent representation of biological 1. The FMA is a domain ontology that represents structure across all levels of its organization. deep knowledge of the structure of the human body by Such context-specific modeling results in a number of placing an emphasis on the highest level of granularity benefits: (1) it obviates duplication and redundancy in of its concepts and the large number and specificity of ontology development, since the FMAÕs contents can be the structural relationships that exist between the ref- reused; (2) it provides for consistency among indepen- erents of these concepts. Modeling at the highest level of dent ontologies that rely on the FMAÕs contents; and (3) detail assures consistency in the representation across it serves as a template for the development of other different levels of structural organization. A conse- ontologies in which the concepts of the FMA assume the quence of this approach is that, as far as we are aware, role of actors. the FMA has developed into the most complex bio- medical domain ontology. This conclusion is reached by applying the metric proposed by Gu et al. [13], in terms 8. Conclusions of which the FMA scores over 10 in comparison with a score of 2–3 for vocabularies included in and similar to We attempted to illustrate that the FMA not only those in UMLS. This level of complexity presents its encompasses in the Anatomy Taxonomy the diverse own challenges, which include developing methods to entities that make up the human body, but is also ca- filter the FMAÕs contents when information is required pable of modeling through the interacting networks of at coarser levels of granularity. The semantic structure its ASA and ATA components a great deal of knowledge of the FMA will facilitate the development of knowl- about these entities. Anatomical knowledge represented edge-based tools for such a purpose. in the FMA parallels in its complexity and depth the 2. The concept domain of the FMA integrates in one knowledge printed in textbooks and journal articles continuous conceptual and implementation framework pertaining to the structure of the body. However, unlike subdomains of anatomy that are conventionally handled the information in these hard copy sources, the FMAÕs by independent and largely incompatible sources. The contents are processable by computers and therefore objective is to comprehensively represent in the FMA provide for machine-based inference, which is a pre- anatomical entities down to the level of cell parts and requisite for the development of knowledge-based ap- provide a framework for linking to the FMA ontologies plications. Most of the current and emerging ontologies and other data repositories for biological macromole- in bioinformatics are primarily concerned with repre- cules. Comprehensive instantiation of the FMAÕs ASA senting the entities of their domain and point to publi- and ATA components can be accomplished through cations for the knowledge associated with the referents funding that targets the needs of research groups for of the concepts they model. We hope that our report will computable, in-depth anatomical information related to encourage a trend in the development of bioinformatics selected parts of the body. ontologies toward incrementally linking the published C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 499 information in a computable form to the concepts these [13] Gu H, Perl Y, Geller J, Halper M, Singh M. A methodology for ontologies compile in order to make also this informa- partitioning a vocabulary hierarchy into trees. Artif Intell Med tion machine-processable. Serving as a reference ontol- 1999;15(1):77–98. [14] Travillian RS, Rosse C, Shapiro LG. An approach to the ogy for bioinformatics, the FMA may facilitate such a anatomical correlation of species through the Foundational process. Model of Anatomy. Proc AMIA Symp 2003:669–73. [15] Aristotle. The categories. Cambridge, MA: Harvard University Press; 1973. [16] Michael J, Mejino JLV, Rosse C. The role of definitions in Acknowledgments biomedical concept representation. Proc AMIA Symp 2001:463–7. [17] Noy NF, Mejino JLV, Musen MA, Rosse C. Pushing the The number of publications in References by mem- envelope: challenges in frame-based representation of human bers of the Structural Informatics Group attests to the anatomy. Data & Knowledge Eng [in press]. [18] Noy NF, Fergerson RW, Musen MA. The knowledge model of numerous individuals who actively contributed to the Protege 2000: combining interoperability and flexibility. In: Proc development of the Foundational Model of Anatomy 12 Internat Conf on Knowledge Eng Knowledge Manage over a nearly 10 year period. We record our recognition (EKAW-2000). Juan-les-Pins France: Springer; 2000. of the collaboration we continue to enjoy with Dr. Mark [19] Chaudhri VK, Farquhar A, Fikes R, Karp PD, Rice JP. OKBC: a Musen and other members of Stanford Medical Infor- programmatic foundation for knowledge base interoperability. In: Fifteenth National Conf on Artificial (AAAI-98). Madison, matics. In addition to his many other contributions, we Wisconsin: AAI Press/The MIT Press; 1998. are grateful to Dr. James F. Brinkley for reviewing the [20] Gu H, Halper M, Geller J, Perl Y. Benefits of an object-oriented manuscript. Our special thanks to Dr. Barry Smith for database representation for controlled medical terminologies. J making a number of valuable suggestions for improving Am Med Inform Assoc 1999;6:283–303. the clarity of the manuscript. The most substantial [21] Federative Committee on Anatomical Terminology (FCAT). Terminologia Anatomica. Stuttgart: Thieme, 1998. support for the work we report was received from the [22] Rosse C. Terminologia Anatomica; considered from the perspec- National Library of Medicine through contract LM tive of next-generation knowledge sources. Clin Anat 03528 and Grant LM 06822. 2001;14(2):120–33. [23] Mejino JLV, Rosse C. Conceptualizations of anatomical spatial entities in the Digital Anatomist Foundational Model. Proc AMIA Symp 1999:112–6. References [24] Schleiden MJ. Beitrage€ zur Phytogenese. Muller€ Õs Archive. 1838. Translation in Sydenham Soc, vol. 12, London, 1847. [1] Musen MA. Medical informatics: searching for underlying com- [25] Schwann T. Mikroskopische Untersuchungen uber die uberein-€ ponents. Methods Inf Med 2002;41:12–9. stimmung in der Structur und dem Wachstum der Tiere und [2] GeneOntology (GO). Available from: http://www.geneontol- Pflanzen. Berlin, 1839. Translation in Sydenham Soc, vol. 12, ogy.org/. London, 1847. [3] US Department of Health and Human Services, National [26] Lovtrup S. Epigenetics; a treatise on theoretical biology. London: Institutes of Health, National Library of Medicine. Unified Wiley; 1974. Medical Language System (UMLS), 2002. [27] Agoncillo AV, Mejino Jr JLV, Rickard KL, Detwiler LT, Rosse [4] McCray AT. Representing biomedical knowledge in the UMLS C. Proposed classification of cells in the Foundational Model of Semantic Network. In: Broering NC, editor. High performance Anatomy. Proc AMIA Symp 2003:775. medical libraries: advances in information management for the [28] Martin RF, Mejino JLV, Bowden DM, Brinkley JF, Rosse C. virtual era. Westport, CT: Mekler; 1993. p. 45–55. Foundational model of neuroanatomy: its implications for the [5] Spackman KE, Campbell KE, Cote RA. SNOMED RT: a Human Brain Project. Proc AMIA Symp 2001:438–42. reference terminology for health care. Proc AMIA Symp [29] Hollinshead WH.. 3rd ed.. Anatomy for surgeons, vols. 1–3. 1997:640–4. Philadelphia: Harper and Row; 1982. [6] GALEN. Available from: http://www.opengalen.org/. [30] Rosse C, Gaddum-Rosse P. In: HollinsheadÕs textbook of [7] Cimino JJ, Hricsak G, Johnson SB, Clayton PD. Designing an anatomy. 5th ed. Philadelphia: Lippincott-Raven; 1997. p. 902. introspective multipurpose controlled medical vocabulary. Proc [31] Williams PL, Bannister LH, Berry MM, Collins P, Dyson M, 13th Annu Symp Comput Appl Med Care 1989:513–7. Dussec JE, Ferguson MWJ. In: GrayÕs anatomy. 38th ed. New [8] Rosse C, Mejino JL, Modayur BR, Jakobovits R, Hinshaw KP, York: Churchill Livingstone; 1995. p. 2092. Brinkley JF. Motivation and organizational principles for ana- [32] Rickard KL, Mejino Jr JLV, Martin RF, Agoncillo AV, Rosse C. tomical knowledge representation: the Digital Anatomist Sym- Problems and solutions with integrating legacy terminologies into bolic Knowledge Base. J Am Med Inform Assoc 1998;5:17–40. evolving knowledge bases [submitted]. [9] Rosse C, Shapiro LG, Brinkley JF. The Digital Anatomist [33] Martin RF, Bowden D. Primate brain maps. Oxford: Elsevier; Foundational Model: principles for defining and structuring its 2000. concept domain. Proc AMIA Symp 1998:820–4. [34] Agoncillo A, Mejino JLV, Rosse C. Influence of the Digital [10] Cimino JJ. Desiderata for controlled medical vocabularies in the Anatomist Foundational model on traditional representations of twenty-first century. Methods Inf Med 1998;37(4–5):394–403. anatomical concepts. Proc AMIA Symp 1999:2–6. [11] Burgun A, Bodenreider O. Ontologies in the biomedical domain. J [35] Mejino Jr JLV, Agoncillo AV, Rickard KL, Rosse C. Represent- Am Med Inform Assoc 2003 [in press]. ing complexity in part-whole relationships within the Founda- [12] Perl Y, Geller J, Gu H. Identify a forest hierarchy in an OODB tional Model of Anatomy. Proc AMIA Symp 2003:450–4. specialization hierarchy satisfying disciplined modeling. Proc First [36] Neal PJ, Shapiro LG, Rosse C. The Digital Anatomist spatial IFCIS Internat Conf on Cooperative Inform Syst CoopISÕ96 abstraction: a scheme for the spatial description of anatomical 1996:182–95. entities. Proc AMIA Symp 1998:423–7. 500 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

[37] Brinkley JF, Wong BA, Hinshaw KP, Rosse C. Design of an anat- [49] Sneiderman CA, Rindflesch TC, Bean CA. Identification of omy information system. IEEE Comp Graphics Appl 1999;3:38–48. anatomical terminology in medical text. Proc AMIA Symp [38] Detwiler LT, Mejino Jr JLV, Rosse C, Brinkley JF. Efficient web- 1998:428–32. based navigation of the Foundational Model of Anatomy. Proc [50] Teng CC, Austin-Seymour MM, Barker J, Kalet IJ, Shapiro LG, AMIA Symp 2003:829. Whipple M. Head and neck lymph node region delineation with 3- [39] Mork P, Brinkley JF, Rosse C. OQAFMA querying agent for the D CT image registration. Proc AMIA Symp 2002:767–71. Foundational Model of Anatomy: providing flexible and efficient [51] Tringali M, Hole WT, Srinivasan S. Integration of a standard access to a large semantic network. JBI 2003;36:501–17. gastrointestinal endoscopy terminology in the UMLS Metathe- [40] Shapiro LG, Chung E, Detwiler LT, Mejino Jr JLV, Agoncillo AV, saurus. Proc AMIA Symp 2002:801–5. Brinkley JF, Rosse C. A generalizable intelligent query interface for [52] Bodenreider O, Burgun A. Characterizing the definitions of the Digital Anatomist Foundational Model [submitted]. anatomical concepts in WorldNet and specialized sources. Proc [41] Distelhorst G, Srivastava V, Rosse C, Brinkley JF. A prototype First Global WorldNet Conf 2002:223–30. natural language interface to a large complex knowledge base, the [53] Bean CA, Rindflesch TC, Sneiderman CA. Automatic semantic Foundational Model of Anatomy. Proc AMIA Symp 2003:200–4. interpretation of anatomic spatial relationships in clinical text. [42] Wong BA, Rosse C, Brinkley JF. Semi-automatic scene genera- Proc AMIA Symp 1998:897–901. tion using the Digital Anatomist Foundational Model. Proc [54] Zhang L, Perl Y, Geller J, Halper M, Cimino JJ. Enriching the AMIA Symp 1999:637–41. structure of the UMLS Semantic Network. Proc AMIA Ann [43] http://fma.biostr.washington.edu. Symp 2002; 939–943. [44] Beck R. Logic-based remodeling of the Digital Anatomist [55] Zhang L, Perl Y, Halper M, Geller J. Designing Metaschemas Foundational Model. Proc AMIA Symp 2003:748–52. for the UMLS Enriched Semantic Network. JBI 2003;36: [45] Zhang S, Bodenreider O. Aligning representation of anatomy using 433–49. lexical and structural methods. Proc AMIA Symp 2003 [in press]. [56] Cook DL, Mejino Jr JLV, Rosse C. Evolution of a foundational [46] Rector AL, Gangenni E, Galeazzi A, Rossi-Mori A. The GALEN model of physiology: symbolic representation for functional core model schema for anatomy: towards a reusable application- bioinformatics [submitted]. independent model of medical concepts. In: Twelfth International [57] Smith B. Mereotopology: a theory of parts and boundaries. Data Congress of European Federation for Medical Informatics. & Knowledge Eng 1996;20:287–303. Lisbon, Portugal; 1994. p. 229–233. [58] Schulz S, Hahn U. Mereotopological reasoning about parts [47] Lober W, Brinkley JF. A portable image annotation tool for web- (w)holes in bio-ontologies. In: Proceedings of FOISÕ01. New based anatomy atlases. Proc AMIA Symp 1999:1108. York: ACM Press; 2001. p. 198–209. [48] Rindflesch TC, Bean CA, Sneiderman CA. Argument identifica- [59] Mejino JL, Rosse C. The potential of the Digital Anatomist tion for arterial branching predications asserted in cardiac Foundational Model for assuring consistency in UMLS sources. catheterization reports. Proc AMIA Symp 2000:704–8. Proc AMIA Symp 1998:825–9. D258±D261 Nucleic Acids Research, 2004, Vol. 32, Database issue DOI: 10.1093/nar/gkh036 The Gene Ontology (GO) database and informatics resource

Gene Ontology Consortium*

GO-EBI, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Received August 21, 2003; Revised and Accepted September 12, 2003

ABSTRACT The Gene Ontology (GO) project is a collaborative effort to address two aspects of information integration: providing The Gene Ontology (GO) project (http://www. consistent descriptors for gene products, in different data- geneontology.org/) provides structured, controlled bases; and standardizing classi®cations for sequences and vocabularies and classi®cations that cover several sequence features. The project began in 1998 as a collabor- domains of molecular and cellular biology and are ation between three model organism databases: FlyBase freely available for community use in the annotation (Drosophila), the Saccharomyces Genome Database (SGD) of genes, gene products and sequences. Many and the Mouse Genome Informatics (MGI) project. Since model organism databases and genome annotation then, the GO Consortium has grown to include many groups use the GO and contribute their annotation databases, including several of the world's major repositories sets to the GO resource. The GO database inte- for plant, animal and microbial genomes (a current list of grates the vocabularies and contributed annotations member organizations is included as Supplementary Material). and provides full access to this information in sev- eral formats. Members of the GO Consortium con- tinually work collectively, involving outside experts THE GO PROJECT as needed, to expand and update the GO vocabular- ies. The GO Web resource also provides access to The GO project has three major goals: (i) to develop a set of extensive documentation about the GO project and controlled, structured vocabulariesÐknown as ontologiesÐto links to applications that use GO data for functional describe key domains of molecular biology, including gene product attributes and biological sequences; (ii) to apply GO analyses. terms in the annotation of sequences, genes or gene products in biological databases; and (iii) to provide a centralized public resource allowing universal access to the ontologies, INTRODUCTION annotation data sets and software tools developed for use with The era of genome-scale biology has seen the accumulation of GO data. vast amounts of biological data, accompanied by the wide- spread proliferation of biology-oriented databases. To make Ontologies the best use of biological databases and the knowledge they The GO project provides ontologies to describe attributes of contain, different kinds of information from different sources gene products in three non-overlapping domains of molecular must be integrated in ways that make sense to biologists. biology. Within each ontology, terms have free text de®nitions A major component of the integration effort is the and stable unique identi®ers. The vocabularies are structured development and use of annotation standards such as in a classi®cation that supports `is-a' and `part-of' relation- ontologies (1±4). Ontologies provide conceptualizations of ships. The scope and structure of the GO vocabularies are domains of knowledge and facilitate both communication described in more detail in references (5±7). In the current between researchers and the use of domain knowledge by research environment, where new genome sequences are computers for multiple purposes. being rapidly generated, and where comparative genome

*Correspondence should be addressed to GO-EBI, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. Tel. +44 1223 494667; Fax: +44 1223 494468; Email: [email protected] *Current members of the GO Consortium are: M. A. Harris, J. Clark, A. Ireland, J. Lomax (GO-EBI, Hinxton, UK); M. Ashburner, R. Foulger (FlyBase, Department of Genetics, University of Cambridge, Cambridge, UK); K. Eilbeck, S. Lewis, B. Marshall, C. Mungall, J. Richter, G. M. Rubin (BDGP, UC-Berkeley, Berkeley, CA, USA), J. A. Blake, C. Bult, M. Dolan, H. Drabkin, J. T. Eppig, D. P. Hill, L. Ni, M. Ringwald (MGI, Jackson Laboratory, Bar Harbor, ME, USA); R. Balakrishnan, J. M. Cherry, K. R. Christie, M. C. Costanzo, S. S. Dwight, S. Engel, D. G. Fisk, J. E. Hirschman, E. L. Hong, R. S. Nash, A. Sethuraman, C. L. Theesfeld (SGD, Department of Genetics, Stanford University, Stanford, CA, USA); D. Botstein, K. Dolinski, B. Feierbach (Genomics Institute, Princeton University, Princeton, NJ, USA); T. Berardini, S. Mundodi, S. Y. Rhee (TAIR, Carnegie Institution, Department of Plant Biology, Stanford, CA, USA); R. Apweiler, D. Barrell, E. Camon, E. Dimmer, V. Lee (GOA database, UniProt, EBI, Hinxton, UK); R. Chisholm, P. Gaudet, W. Kibbe (DictyBase, Northwestern University, Chicago, IL, USA); R. Kishore, E. M. Schwarz, P. Sternberg (WormBase, California Institute of Technology, Pasadena, CA, USA); M. Gwinn, L. Hannick, J. Wortman (Institute for Genome Research, Rockville, MD, USA); M. Berriman, V. Wood (Wellcome Trust Sanger Institute, Hinxton, UK); N. de la Cruz, P. Tonellato (RGD, Medical College of Wisconsin, Milwaukee, WI, USA); P. Jaiswal (Gramene, Department of Plant Breeding, Cornell University, Ithaca, NY, USA); T. Seigfried (Maize DB, Iowa State University, Ames, IA, USA); R. White (Incyte Genomics, Palo Alto, CA, USA).

Nucleic Acids Research, Vol. 32, Database issue ã Oxford University Press 2004; all rights reserved Nucleic Acids Research, 2004, Vol. 32, Database issue D259 analysis requires the integration of data from multiple sources, Table 1. Status of the GO vocabularies it is especially germane to provide rigorous ontologies that can Totals July 1, 2000 July 1, 2003 be shared by the community. Molecular Function (MF) describes activities, such as All valid termsa 4493 13412 catalytic or binding activities, at the molecular level. GO Terms with de®nitions 250 11105 molecular function terms represent activities rather than the Terms with synonyms 301 2813 Terms with db cross-references 1042 12317 entities (molecules or complexes) that perform the actions, Associationsb 30654 7781954 and do not specify where, when or in what context the action Gene products 13016 1549236 takes place. Examples of individual molecular function terms Sequences 0 21916 are the broad concept `kinase activity' and the more speci®c Pathsc 30941 314886 `6-phosphofructokinase activity', which represents a subtype aExcludes obsolete terms. of kinase activity. bIndividual associations between any gene product and any GO term. Biological Process (BP) describes biological goals accom- cParent±child relationships traced from any GO term to the root (molecular plished by one or more ordered assemblies of molecular function, biological process or cellular component). functions. High-level processes such as `cell death' can have both subtypes, such as `apoptosis', and subprocesses, such as `apoptotic chromosome condensation'. Cellular Component (CC) describes locations, at the levels The SO is being used by the collaborating databases of subcellular structures and macromolecular complexes. for genomic feature annotation. Like GO annotations, SO Examples of cellular components include `nuclear inner annotations are curated using both manual work by experts membrane', with the synonym `inner envelope', and the and purely computational methodologies. `ubiquitin ligase complex', with several subtypes of these GO slims complexes represented. The recent development of the Sequence Ontology (SO) For many purposes, in particular reporting the results of GO permits the classi®cation and standard representation of annotation of a genome or cDNA collection, it is very useful to sequence features. De®ned sequence features include terms have a high-level view of each of the three ontologies. These such as `exon', whose meaning is widely accepted, and the subsets of the GO have become known as `GO slims', the ®rst more problematic term `pseudogene', for which several of which was constructed for the annotation of the Drosophila different usages have yet to be resolved. Although the SO is genome (13). An example of a GO slim analysis is shown in a relatively new vocabulary, and is still undergoing re®ne- Figure 1. ment, it is already being used for genome annotation projects The shared use of GO slims makes comparisons of in Drosophila and Caenorhabditis elegans. summary GO term distributions very easy. Different applica- tions, however, may require different GO slim sets tailored to Annotations the speci®c needs of an analysis. To address this, the GO Consortium makes both generic and speci®c GO slim ®les Collaborating databases provide data sets comprising links available. The generic GO slim ®le is kept up to date with between database objects and GO terms, with supporting respect to the full ontologies, and speci®c GO slim ®les that documentation. Every annotation must be attributed to a have been used in particular publications or analyses are source, which may be a literature reference, another database archived. or a computational analysis; furthermore, the annotation must indicate the type of evidence the cited source provides to support the association between the gene product and the GO THE GO DATABASE term. A standard set of evidence codes quali®es annotations The GO database consists of a MySQL database that captures with respect to different types of experimental determinations. GO content and a Perl object model and Application For example, a direct assay to determine the function of the Programmer Interface (API) to simplify database access and exact gene product being annotated is more reliable than a help programmers write tools that use the GO data. The GO sequence architecture comparison. relational database is released monthly in several versions: High-quality GO annotations, normally based on curatorial termdb includes the ontologies, de®nitions and cross-refer- review of published literature and supported by experimental ences to other databases; assocdb includes all data in termdb evidence, are now available for gene products in many model plus associations to gene products; and seqdb adds protein organisms. In addition, large sets of annotations made using sequences for annotated gene products (where available). A automated methods cover both model organisms and less fourth version, seqdblite, is equivalent to seqdb without the experimentally tractable organisms, including human. A IEA-based associations; this version is used by the AmiGO number of different automatic methods have been applied browser (see below). (e.g. 8±12), all of which are represented by the evidence code The GO database schema models generic graphs, including IEA (`inferred from electronic annotation'). Table 1 provides the GO structure (a directed acyclic graph, or DAG) a snapshot of current annotations in the GO database; a more relationally. At the core of the schema are two relational detailed table is maintained on the web at http://www.ge- tables for capturing all terms (also called nodes) and term± neontology.org/doc/GO.current.annotations.shtml. Additional term relationships (arcs). The two relationship types, `is-a' and information on GO annotations can be found in references (5± `part-of,' are represented as a `relationship type' attribute in 8) and (13). the relationship table. D260 Nucleic Acids Research, 2004, Vol. 32, Database issue

Figure 1. Application of a GO slim set in genome annotation. The number of gene products annotated to each term in each of four model organism genomes is shown for a GO slim set taken from the cellular component ontology (data as of August 1, 2003).

GO RESOURCES snapshots are archived. Current and archival releases of all three formats can be downloaded from the GO web site. Access to ontologies and annotations in all formats Documentation The output of the GO projectÐvocabularies, annotations, database and accompanying toolsÐare in the public domain The GO web resource includes an extensive set of docu- and are readily accessible via the GO web pages at http:// mentation pages (see http://www.geneontology.org/doc/ www.geneontology.org/. The GO Consortium gives permis- GO.contents.doc.html). Topics include an overview of the sion for any of its products to be used without license, in GO project and the ontologies, guides to editorial style, ®le accordance with its redistribution and citation policy. formats and annotation practices, and frequently asked Highlights of that policy are: questions (FAQ). (i) that the Gene Ontology Consortium is clearly acknow- ledged as the source of the product; Software/tools (ii) that any GO Consortium ®le(s) displayed publicly A variety of browsers that provide visualization and query include the revision number(s) and/or date(s) of the relevant capabilities for the GO are available. For example, the AmiGO GO ®le(s); browser (developed by the GO software group at Berkeley; see (iii) that neither the content of a GO ®le(s) nor the logical http://www.godatabase.org/cgi-bin/go.cgi) provides a web relationships embedded within the GO ®le(s) be altered in any interface for searching and displaying the ontologies, term way. de®nitions and associated annotated gene products for the The full GO Redistribution and Citation Policy document entire spectrum of contributing organism databases repre- is available online at http://www.geneontology.org/doc/ sented in the GO database. AmiGO easily allows users to GO.cite.html. A list of useful URLs and addresses is included browse a tree-like view of the GO structure and to search for in the Supplementary Material. terms using a variety of different keys such as a name, The MySQL database described above can be downloaded synonym, de®nition, numerical identi®er or cross-referenced locally, and Perl APIs are provided. The GO Consortium's entry in an external database. The summary view presents the ontologies and annotations are also available as ¯at ®les (the list of gene products associated with each term. The results most frequently updated format at the time of writing) and as may be constrained by the evidence code used in the RDF XML; the latter is available with or without annotation association or by the organization that submitted the associ- data included. The MySQL and XML formats are released ation. Representative amino acid sequences are available for monthly. The ¯at ®les are updated continally, and monthly most genes, and these can be selected and downloaded as Nucleic Acids Research, 2004, Vol. 32, Database issue D261

FASTA ®les. Using GOst, the GO BLAST server, users may contacts.html. Any questions about contributing to the GO submit a query sequence and retrieve the sequences and GO project should be directed to the main GO mailing list at annotations of all similar gene products in the GO database. [email protected]. The GO software group has also developed DAG-Edit, a tool that provides a graphical interface to browse, query and SUMMARY edit GO or any other vocabulary that has a DAG data structure. GO curators use DAG-Edit to manage the GO vocabularies. The GO project provides an ongoing example of community The tool has also been used by other groups to build ontologies development of bioinformatics standards. Combining the for a wide range of biological subjects, such as anatomies and expertise of biologists from multiple sub-disciplines, the developmental timelines for several model organisms, human computational expertise of arti®cial intelligence researchers, diseases and plant growth environment. DAG-Edit is an open and input from multiple users of the system, the GO source Java application that is installed locally. A user guide is Consortium continues to develop and expand these classi®ca- available within the application and on the web (http:// tion systems for molecular biology. www.geneontology.org/doc/dagedit_userguide/dagedit.html). DAG-Edit is updated regularly to add features and improve SUPPLEMENTARY MATERIAL performance; the current version can be downloaded from http://sourceforge.net/project/show®les.php?group_id=36855. Supplementary Material is available at NAR Online. The GO Software web page (http://www.geneontology.org/ doc/GO.tools.html) provides a catalogue of GO-related tools ACKNOWLEDGEMENTS developed by members of the GO Consortium or by GO users. The Gene Ontology Consortium is supported by NIH/NHGRI In addition to AmiGO, there are several more applications for grant HG02273, and by grants from the European Union RTD browsing and searching the GO vocabularies and annotations. Programme `Quality of Life and Management of Living Other available software includes applications for correlating Resources' (QLRI-CT-2001-00981 and QLRI-CT-2001- data from the GO project and other sources (including, but not 00015). limited to, microarray data), as well as tools that are not speci®c to, but can be used in conjunction with, GO data. REFERENCES Other resources 1. Gruber,T.R. (1993) A translational approach to portable ontologies. Literature collection. The GO project maintains a biblio- Knowl. Acq., 5, 199±220. graphy of peer-reviewed publications (124 as of August 2003) 2. Jones,D.M. and Paton,R.C. (1999) Toward principles for the representation of hierarchical knowledge in formal ontologies. Data relevant to the development and use of the GO vocabularies Knowl. Eng., 31, 102±105. and annotation sets at http://www.geneontology.org/doc/ 3. Schulze-Kremer,S. (1998) Ontologies for molecular biology. Pac. Symp. GO.biblio.html. Many of the publications document the Biocomput., 3, 695±706. curation and display of GO annotations within a wide variety 4. Stevens,R., Goble,C.A., and Bechhofer.S. (2000) Ontology-based of databases, whereas others make use of GO terms and gene knowledge representation for bioinformatics. Brief. Bioinform., 1, 398± 414. product annotations in the interpretation of large-scale 5. Blake,J.A. and Harris,M. (2003) The Gene Ontology Project: Structured experimental results. Still other papers describe novel uses vocabularies for molecular biology and their application to genome and of GO terms (e.g. in text mining), software that uses GO data expression analysis. In Baxevanis,A.D., Davison,D.B., Page,R., and integration of the GO with other ontological resources. Stormo,G. and Stein,L. (eds), Current Protocols in Bioinformatics. Wiley and Sons, Inc., New York. 6. The Gene Ontology Consortium (2001) Creating the gene ontology Community input. The GO effort is greatly enriched by input resource: design and implementation. Genome Res., 11, 1425±1433. from its user community. Several routes are available for users 7. The Gene Ontology Consortium (2000) Gene Ontology: tool for the to comment on various aspects of the GO. Comments and uni®cation of biology. Nature Genet., 25, 25±29. suggestions for changes and updates to the ontologies can 8. Camon,E., Magrane,M., Barrell,D., Binns,D., Fleischmann,W., Kersey,P., Mulder,N., Oinn,T., Maslen,J., Cox,A. et al. (2003) The Gene be submitted via a GO project page at the SourceForge Ontology Annotation (GOA) Project: Implementation of GO in SWISS- site (http://sourceforge.net/projects/geneontology), whereupon PROT, TrEMBL, and InterPro. Genome Res., 13, 662±672. each suggestion is evaluated by GO Consortium members. 9. Mi,H., Vandergriff,J., Campbell,M., Narechania,A., Majoros,W., Different `trackers' available from the SourceForge site allow Lewis,S., Thomas,P.D. and Ashburner,M. (2003) Assessment of genome- GO users to report problems or request features for the AmiGO wide protein function classi®cation for Drosophila melanogaster. Genome Res., 13, 2118±2128. browser, and to submit suggestions for additions and changes to 10. Pouliot,Y., Gao,J., Su,Q.J., Liu,G.G. and Ling,X.B. (2001) DIAN: a the ontologies; items can be assigned to individuals or groups novel algorithm for genome ontological classi®cation. Genome Res., 11, within the GO Consortium who have relevant expertise. This 1766±1779. system allows the submitter to track the status of a suggestion, 11. Okazaki,Y., Furuno,M., Kasukawa,T., Adachi,J., Bono,H., Kondo,S., both online and by email, allows other users to see what Nikaido,I., Osato,N., Saito,R. and Suzuki,H. et al. (2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length changes are currently under consideration, and archives all cDNAs. Nature, 420, 563±573. entries and associated communications. 12. Xie,H., Wasserman,A., Levine,Z., Novik,A., Grebinskiy,V., Shoshan,A. and Mintz,L. (2002) Large scale protein annotation through Gene Mailing lists. GO also has several mailing lists, covering Ontology. Genome Res., 12, 785±794. general questions and comments, the GO database and 13. Adams,M.D., Celniker,S.E., Holt,R.A., Evans,C.A., Gocayne,J.D., Amanatides,P.G., Scherer,S.E., Li,P.W., Hoskins,R.A., Galle,R.F. et al. software, and summaries of changes to the ontologies. The (2000) The genome sequence of Drosophila melanogaster. Science, 287, lists are described at http://www.geneontology.org/GO_ 2185±2195.