<<

Journal of Biomedical Informatics 39 (2006) 314–320 www.elsevier.com/locate/yjbin

Beyond the data deluge: Data integration and bio-ontologies

Judith A. Blake *, Carol J. Bult

The Jackson Laboratory, Bar Harbor, ME, USA

Received 31 August 2005 Available online 21 February 2006

Abstract

Biomedical research is increasingly a data-driven science. New technologies support the generation of genome-scale data sets of sequences, sequence variants, transcripts, and ; genetic elements underpinning understanding of biomedicine and disease. Infor- mation systems designed to manage these data, and the functional insights (biological knowledge) that come from the analysis of these data, are critical to mining large, heterogeneous data sets for new biologically relevant patterns, to generating hypotheses for experimen- tal validation, and ultimately, to building models of how biological systems work. Bio-ontologies have an essential role in supporting two key approaches to effective interpretation of genome-scale data sets: data integration and comparative genomics. To date, bio-ontologies such as the Ontology have been used primarily in community genome databases as structured controlled terminologies and as data aggregators. In this paper we use the Gene Ontology (GO) and the Mouse Genome Informatics (MGI) database as use cases to illustrate the impact of bio-ontologies on data integration and for comparative genomics. Despite the profound impact ontologies are having on the digital categorization of biological knowledge, new biomedical research and the expanding and changing nature of biological infor- mation have limited the development of bio-ontologies to support dynamic reasoning for knowledge discovery. Ó 2006 Elsevier Inc. All rights reserved.

Keywords: Gene Ontology; Mouse Genome Informatics; Data integration; Biomedical ontologies; Bio-ontologies

1. Introduction power of experimental studies in model organisms to illu- minate aspects of human biology and disease [9]. The era of genome analysis was launched with such In addition to the analysis of genome features, new tech- landmark publications as the first large-scale survey of nologies, such as microarrays and massively parallel signa- transcriptional activity in the using ture sequencing, allow researchers to investigate global expressed sequence tags (ESTs) [1] and the first complete transcriptional activity in different cell and tissue types genome of a cellular organism [2]. In relatively quick suc- [10,11]. Technologies that permit comprehensive surveys cession, the complete genome sequences of several major of the content of cells and tissues are rapidly emerg- biomedical model organisms were published including ing and will undoubtedly reveal much about the nature of yeast (Saccharomyces cerevisiae) [3], nematode (Caenorhab- the proteome and the relationship of the genome, tran- ditis elegans) [4], fruit fly (Drosophila melanogaster) [5], and scriptome, and proteome to one another [12,13]. the laboratory mouse (Mus musculus) [6]. With the avail- The challenge of assembling and maintaining a compre- ability of the genome sequences of these model systems hensive catalog of an organism’s genome features is and genome sequence of the human [7,8], biomedical matched by the challenge of managing the range of attri- researchers were able to compare and contrast catalogs of bute information associated with these features. Our genome features. The explosion of sequence-based compar- knowledge about each genome or cellular component is ative genomics made even more apparent the potential and both complex and incomplete. Tens of thousands of scien- tific papers are published each year that correct old infor- * Corresponding author. Fax: +1 207 288 6132. mation and add new details about millions of distinct E-mail address: [email protected] (J.A. Blake). biological entities. One of the major challenges faced by

1532-0464/$ - see front matter Ó 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2006.01.003 J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320 315 the biomedical research community is how to access, ana- rates community input to help prioritize needed changes lyze, and visualize heterogeneous data in ways that lead and improvements. Workshops that include domain and to novel insights into biological processes or that lead to ontology development experts are held on a regular basis the formulation of a hypothesis that can be tested experi- to continue with expansion and improvement of the mentally. Beyond the annotation of each individual gene, resource. transcript, and protein in an organism’s genome, is the The GO Consortium includes the major model organism much larger challenge of detecting and representing the database groups (mouse, fly, yeast, Arabidopsis, worm, networks and interactions of these parts in cells and mul- nematode, and rat) and a major sequence database ti-cellular complexes. Eventually, the characterization of resource (UniProt). In addition, many other genome anno- these networks will allow biomedical researchers to directly tation groups and bioinformatics centers participate both connect gene function to organismal phenotype. in the development of the ontologies and the use of the Coincident with the emergence of genome-scale data GO annotation process for functional annotation systems. generation technologies has been the development of ontol- The GO web site provides access to the ontologies, annota- ogies that are specific for biomedical research domains. tions, documentation, bibliography, and tools to support Ontology-based knowledge representations have been and data analysis using the power of ontological information are being developed in many domains to facilitate informa- structures. tion extraction and retrieval and to support data interoper- GO has become a community standard for genome ability in biology [14,15]. More than dictionaries or annotations and, together with other ontologies for biology thesauri, bio-ontologies formally represent relationships such as anatomies [17,18] and cell types [19] provides between defined biological terms such that the terminolo- semantic and ontological representations for biology. gies can be used both by humans and by computers to These resources are now essential for the information man- exchange and explore information. agement of genetics and genomics data relevant to biomed- Data integration and comparative genomics are two icine. The Open Biological Ontologies resource (http:// common approaches to interpreting genomic data. In this obo.sourceforge.net/) provides a repository of community paper, we explore the impact of bio-ontologies for both bio-ontologies. data integration and comparative genomics. Our viewpoint is a decidedly practical one. The issues we address are not 3. The Mouse Genome Informatics database: representative about the challenges of ontology development per se; rath- database er we focus on how a specific ontology development effort, the Gene Ontology, is being used to facilitate data integra- Mouse Genome Informatics (MGI; http://www. tion in the organism-specific Mouse Genome Informatics informatics.jax.org/) is the recognized community database (MGI) database and, in the context of information sys- for the laboratory mouse [20].1 MGI integrates genetic and tems, how it is designed to facilitate sharing biological genomic data for the mouse in order to fulfill its mission as knowledge across organisms. We begin by briefly describ- an informatics resource that facilitates the use of the mouse ing the Gene Ontology (GO) project. We then describe as a model system for understanding human biology and the Mouse Genome Informatics (MGI) resource and the disease processes. MGI provides extensive information role that GO plays in providing semantic consistency to about mouse , sequences, alleles, mutant phenotypes, functional annotations for mouse genes. Finally, we discuss strain polymorphisms, gene and protein function, develop- some of the challenges before us as we move beyond mental gene expression patterns, and curated mammalian descriptive models of biological systems to predictive homology data. MGI is also a platform for computational models. assessment of integrated biological data with the goal of identifying candidate genes associated with complex 2. The Gene Ontology: representative Bio-Ontology phenotypes. Data integration is a primary focus of MGI [21]. Data The Gene Ontology (GO) project was founded to integration means identifying disparate data that describe advance the development and utilization of bio-ontologies the same biological entity (e.g., gene, transcript, protein, and semantic standards for molecular biology [16]. The GO etc.). Data integration is critical to knowledge discovery Consortium, the coordinating group for this project, is because it allows different information about the same entity developing structured controlled terminologies of terms to be related in new ways [22]. Many hurdles to achieving for the biological domains of molecular function, biologi- data integration exist. For example, the same gene can be cal process, and cellular location of gene products. The referred to by multiple different names in the literature. Or, GO Consortium members develop the structured set of the same name can be applied to more than one gene. Differ- terms, define the relationships and paths between terms, and then annotate genes and gene products using the terms. 1 Similar model organism genome databases are maintained for yeast The ontologies and annotations are provided publicly as (http://www.yeastgenome.org/), Drosophila (http://flybase.bio.indiana. part of the GO database resource [http://www.geneontolo- edu/), C. elegans (http://www.wormbase.org/), zebra fish (http://zfin. gy.org/]. Gene Ontology is a work in progress and incorpo- org/), and rat (http://rgd.mcw.edu/) among others. 316 J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320 ent data about the same gene can be obtained from various data sets assembled from data loads and the biomedical lit- sources. Sequence data obtained from a variety of resources erature. In this context, the ontologies are used as con- may include overlapping sequence sets identified by different trolled terminologies for the assignment of consistent primary identifiers. In addition to data integration challeng- functional annotations. The assignment of appropriate es, MGI must deal with incomplete data and inconsistent GO terms to genes in MGI enables researchers to retrieve data and conflicts in data that appear over time. from the database all genes that are associated with a spe- Key strategies employed by the MGI staff for data inte- cific biological process, molecular function or cellular gration include use of unique, permanent accessioned iden- location. tifiers for major biological entities to preserve cross- An example of how the Gene Ontology is used to facil- reference integrity and the use of controlled terminologies itate data integration is illustrated by the functional anno- and ontologies for functional annotations to achieve tation of the mouse hexosaminidase A gene (Hexa, semantic normalization both with mouse-specific data as MGI:96073) in MGI. The gene product of the Hexa gene well as across multiple model organism databases. These is involved in hydrolase activity and one of the GO molec- integration processes enable comprehensive and accurate ular function terms assigned to this gene by MGI curators recall of diverse kinds of data. The combination of refer- is b-N-acetylhexosaminidase activity (GO:0004563) ence integrity and semantic normalization enable complex (Fig. 1). Data from seven publications support this func- queries that facilitate knowledge discovery. tional annotation for Hexa [23–29]. Although each of these papers supports the GO term assignment, the nature of the 4. GO at MGI evidence in each paper differs and there is substantial var- iation in the words that the authors of these papers use The incorporation of the Gene Ontology within MGI to describe the function of Hexa. In this case, GO is used provides semantic integration and access to heterogeneous primarily as a controlled terminology to enforce standards

Fig. 1. GO annotations for the Hexa gene in MGI. The molecular function annotation of b-N-acetylhexosaminidase activity is supported by multiple references and several different lines of evidence. The reference numbers are MGI journal numbers and provide a link to PubMed. The rows of this table provide a subset of the GO annotation record [not included on the web page are GO term Accession ID, Taxon ID, etc.]. One row is emphasized for example. The columns represented are (A) GO domain—here ‘Molecular Function’; (B) GO Term—here ‘b-N-acetylhexosaminidase activity’; (C) GO Evidence Code—here ‘IGI’’ for ‘Inferred from Genetic Interaction.’ This provides a high-level characterization of the type of experimental or computation method used to provide the data for this annotation; (D) Inferred From—here ‘MGI:96074’ is the MGI public Accession ID for the gene with which Hexa interacted in the experiment reported by this annotation. (E) Ref(s)—here ‘J:36305’ is the MGI journal identifier number. This is the citation for the experiment that informs this annotation row. The MGI journal records link to PubMed. The annotation row, then, provides the GO annotation for the gene as well as links to supporting evidence and citations. J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320 317 in how the functions of genes are described. Such naming from the biomedical literature, the curation of the literature standards are one of the keys to data integration [21].In also provides a primary mechanism for continually updat- the Hexa example above, the advantage of the ontological ing the GO. All bio-ontologies must evolve as biological structure of GO is not readily apparent. It could be argued knowledge about genes and gene products emerges. The that a simple flat list of terms would suffice to achieve inte- curation of literature reveals new and changing knowledge gration in this and similar cases. However, the advantage about biological systems and can result in changes to an of the ontological structure of GO becomes evident when ontology. For example, a recent emphasis by the MGI the retrieval of integrated data becomes necessary or desir- curation staff to fully annotate genes involved with the bio- able. For example, a researcher may wish to retrieve all logical process of ‘regulation of blood pressure’ in the mouse genes whose products are involved in ‘lipid metabo- mouse resulted in 43 new terms being added to the GO. lism.’ If a simple flat list of controlled terminology terms is One immediate consequence of the data aggregation used for functional annotation, each term must be explicit- made available through structured controlled terminolo- ly associated with the gene product in order for the Hexa gies is the ability to develop statistical measures that eval- gene to be returned in response to this query. Because of uate the significance of experimental data sets relative to the structure of the GO, however, it is possible to traverse the complete set of information about a given system. the relationships of the terms in the GO hierarchy to For example, the analysis of data generated by such gen- retrieve all of the genes that are annotated specifically to ome-scale technologies as gene microarray data illustrates the term, ‘lipid metabolism,’ but also to any of the child the interplay between increasingly global experimental par- terms, including ‘sequestering of lipid,’ the actual term to adigms and the critical role that biomedical ontologies play which Hexa is annotated (Fig. 2). in interpretation of large-scale data sets [11]. In many A further use of GO annotation sets is to provide refer- microarray experiments, the outcome is a list of tens to ence information that can be used for initial functional hundreds of genes that are differentially expressed in two annotations. The comprehensive annotation of well-known or more samples under different experimental conditions. genes provides the reference information that can be used While gleaning biological context from a list of genes is a to provide initial functional annotations for relatively daunting task, it is made tractable by knowledgebases uncharacterized genes via comparative sequence and and ontologies. Indeed, a number of software applications domain analysis (Fig. 3). These computational assignments have emerged in recent years that leverage the availability are identified by the evidence and citation information pro- of Gene Ontology annotations to help researchers identify vided in the annotations, and can thus be included or not biological themes in their gene lists [30]. as appropriate for the use of the annotation sets. These ini- GO is not the only bio-ontology or structured controlled tial annotation assignments can be extremely useful for terminology included in the MGI resource. The Mouse experimental biologists making crucial decisions as to allo- Anatomy [18] and the Mammalian Phenotype Ontology cation of research resources for further characterization of [31] are also used for gene and gene product annotation specific genes. While the use of GO for functional annota- by MGI curators. These, along with multiple sets of con- tions is a powerful tool for data integration of information trolled terminologies, including most prominently the offi-

Fig. 2. A portion of the GO hierarchy showing the position of ‘lipid metabolism’ in the Biological Process ontology. Note that there are, in MGI, 444 mouse genes with 812 annotations assigned to this term or one of its subterms. At the level of the ‘sequestering of lipid’ term, there are only eight genes with nine annotations. One of the genes annotated to ‘sequestering of lipid’ term is Hexa. 318 J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320

Fig. 3. The use of structured controlled terminologies enables robust data aggregation and recovery of information. As illustrated here, reviewed computational analysis (RCA) and inferred from electronic annotation (IEA) assignments of GO terms provide initial annotations for 12 relatively uncharacterized RIKEN gene models to ‘lipid metabolism’ and its subterms. These annotations provide experimental biologists with initial assessments of gene characteristics and help in the design of future experiments. cial nomenclature for mouse genes, strains, and alleles, pro- and mouse] that have been annotated to this term or one vide the semantic structure to support the data integration of its subterms. This set of annotations is independent of and retrieval needs of a major bioinformatics resource. the evolutionary relationships between the gene products (although they might actually be homologs), reflecting, 5. Ontologies and comparative genomics rather, the shared biology. Thus, we can use the combined expertise in biological annotations for diverse genomes to Organism-centric databases such as MGI will continue provide a robust set of gene products that share a defined to play an important role in biomedical research. Recently, aspect of their biology as another measure of comparative the availability of sequenced genomes has driven the use of biology. sequence-based comparative methods. These approaches have provided tremendous insights into genome organiza- 6. Biological knowledge in computable form tion and have revealed novel biological features such as conserved non-coding regions that are often completely The examples of data integration and recovery as repre- conserved at the nucleotide level from human to pufferfish sented by the MGI and GO systems demonstrate both the [32]. promise and challenge of data representations. On the one Comparative biology also includes comparing and con- hand, the use of ontologies and the careful mapping of data trasting biological knowledge about the functions and roles across experimental systems provide a comprehensive mech- that genes and gene products play in different organisms. anism to recover robust sets of relevant data in response to GO has provided a common language for describing uni- complex queries. On the other hand, these representations versal biological processes which, in turn, has facilitated lack the precision in representing the full context surround- both in-depth understanding of biology within a single ing knowledge that is described in a typical journal article organism and for comparing biological processes across comprised of unstructured text, tables, and figures. To fully multiple organisms. Consistent use of biological terms represent the complexities of information presented in a bio- across different organisms means that researchers can medical study require a granularity of data representation retrieve data for a single organism or for multiple organ- for heterogeneous data that is difficult to acquire in comput- isms accurately and consistently. Being able to compare able form. For example, the presentation of phenotype data data across species makes it possible to use and play to from a study of a targeted mutation of a single gene in the the strengths of all the different model organisms to study laboratory mouse requires the complete and controlled the function of genes. For example, GO annotations for description of the genetic construct used, the genotypes of many genomes are integrated in the GO database the mice studied, the details of the experimental assays, the [www.godatabase.org]. Searching the GO database for explanation of the phenotypic results, and the narrative of annotations to the term ‘sequestering of lipid,’ the term dis- the context for the study in relationship to current knowl- cussed earlier in this paper, returns 35 gene products from edge. Indeed, changing any one of these contexts can change six organisms [budding yeast, fission yeast, weed, fly, rat, the outcome of an experiment and complicate the represen- J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320 319 tation of knowledge about a gene or gene product. These the bio-ontologies will be of critical importance to the suc- accounts of the experiment as reported in the literature also cess of the data integration and retrieval. include the perspective of the study in relation to human biol- Ultimately, the construction and use of bio-ontologies ogy and diseases, often the genesis of the investigation in the as formal, logical systems will enable computational rea- first place. Evidence ontologies under development are a crit- soning over complex semantic systems. The extension of ical next step in the maturation of the bio-ontology field. bio-ontologies to model or represent biological systems fac- There are active communities devoted to the representation es formidable challenges, including the representation of of experimental information for microarray [33], toxicoge- environment, scale, and temporal–spatial context (i.e., nomic [34], and other experimental data, and more efforts complexity), as well as the incompleteness and ever chang- are under discussion. ing nature of biological knowledge. The rapid advances in The current challenge for data management is to facili- development of biomedical ontologies brings both encour- tate the intersection of the expressiveness of the biomedical agements and challenges to our efforts to integrate and literature with the data retrieval and correlation capabili- access data for knowledge discovery. ties of the structure information tracked with ontologies and stored in databases. The biomedical literature brings Acknowledgments the expertise and experimental results of scientists into the context of the current knowledge. In the age of geno- J.A.B. and C.J.B. are funded through their work with mic-sized data sets, the biomedical literature is increasingly the GO and MGI Consortiums. The GO project is funded archaic as a form of transmission of scientific knowledge by NIH/NHGRI HG02273. MGI database resources are for computers, yet the written word is essential for humans. funded by NIH/NHGRI (HG00330, HG02273), NIH/ Can we improve the computability over biomedical infor- NICHD (HD33745), and NCI (CA89713). The genesis of mation by using bioinformatics tools at the level of bio- this paper was presented at the Workshop on Ontology medical publications? Much work in this area has been and Biomedical Informatics (Rome, 2005) supported by undertaken by the natural language processing (NLP) com- the International Medical Informatics Association. munity with critical assessments undertaken to evaluation information retrieval using GO annotations [35,36], but References the core indexing of the literature to controlled terminolo- gies is still primarily undertaken by curators at bioinfor- [1] Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, matics resources such as the NLM Mesh indexers [37]. Lee NH, et al. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA Ultimately, the integration of structured terminologies sequence. Nature 1995;77(Suppl.):3–174. and bio-ontologies into the scientific reports as part of [2] Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, the publication process may provide the means to capture Kerlavage AR, et al. Whole-genome random sequencing and assem- both the human readable and the computationally tracta- bly of Haemophilus influenzae Rd. Science 1995;269:496–512. ble forms of semantic understanding. [3] Wood V, Gwilliam R, Rajandream MA, Lyne M, Lyne R, Stewart A, et al. The genome sequence of Schizosaccharomyces pombe. Nature 2002;415:871–80. 7. The role and future of biomedical ontologies [4] The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science Biomedical ontologies provide the semantic structure 1998;282:2012–18. that supports integration and comparison of large, com- [5] Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, et al. The genome sequence of Drosophila melano- plex data sets in biology. To date, the practical focus for gaster. Science 2000;287:2185–95. ontology development and application has been on the [6] Mouse Genome Sequence Consortium and Mouse Genome Analysis capture of domain knowledge such as function and loca- Group. Initial sequencing and comparative analysis of the mouse tion. Ontologies support uniform data encoding thus genome. Nature 2002;420:520–62. providing semantic integration of information derived [7] Lander ES et al. Initial sequencing and analysis of the human genome. Nature 2001;409:860–921. from multiple resources such as free-text publications or [8] Venter JC et al. The sequence of the human genome. Science functional annotations of sequences from multiple provid- 2001;291:1304–51. ers. Such semantic integration facilitates the exchange and [9] Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson integration of data between resources and is essential for CR, Hariharan IK, et al. The genome sequence of Drosophila future development of semantic web services. Attention melanogaster. Science 2000;287:2185–95. [10] Oudes AJ, Roach JC, Walashek LS, Eichner LJ, True LD, Vessella to standards of ontology development [38] ensure these RL, et al. Application of Affymetrix array and Massively Parallel domain-specific ontologies are useful as terminologies for Signature Sequencing for identification of genes involved in prostate annotation systems and that they can be used in formal cancer progression. BMC Cancer 2005;5:86. ontological representations in biomedicine as these [11] Zhang W, Morris QD, Chang R, Shai O, Bakowski MA, Mitsakakis resources develop. Bio-ontologies are having a significant N, et al. The functional landscape of mouse gene expression. J Biol 2004;3:21. Epub 2004 Dec 6. impact on data integration, access, and analysis through [12] Elias JE, Haas W, Faherty BK, Gygi SP. Comparative evaluation of their use in capturing and structuring biological data. As mass spectrometry platforms used in large-scale proteomics investi- the genomic and biomedical informatics systems coalesce, gations. Nat Methods 2005;2:667–75. 320 J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320

[13] Guan JQ, Chance MR. Structural proteomics of macromolecular compartments during B cell receptor-mediated cell activation. J Exp assemblies using oxidative footprinting and mass spectrometry. Med 2002;4:461–72. Trends Biochem Sci 2005;Aug 2. [26] Sango K, McDonald MP, Crawley JN, Mack ML, Tifft CJ, Skop E, [14] Bard J, Rhee S. Ontologies in biology: design, applications and future et al. Mice lacking both subunits of lysosomal beta-hexosaminidase challenges. Nat Rev Genet 2004;5:213–22. display gangliosidosis and mucopolysaccharidosis. Nat Genet [15] Stevens R, Goble CA, Bechhofer S. Ontology-based knowledge 1996;14:348–52. representation for bioinformatics. Brief Bioinform 2000;1:398–414. [27] Yamanaka S, Johnson MD, Grinberg A, Westphal H, Crawley JN, [16] The Gene Ontology Consortium. Gene Ontology: tool for the Taniike M, et al. Targeted disruption of the Hexa gene results in unification of biology. Nat Genet 2000;25:25–9. mice with biochemical and pathologic features of Tay–Sachs disease. [17] Rosse C, Mejino JL. A reference ontology for biomedical informat- Proc Natl Acad Sci USA 1994;21:9975–9. ics: the Foundational Model of Anatomy. J Biomed Inform [28] Phaneuf D, Wakamatsu N, Huang JQ, Borowski A, Peterson AC, 2003;36:478–500. Fortunato SR, et al. Dramatically different phenotypes in mouse [18] Hayamizu TF, Mangan M, Corradi JP, Kadin JA, Ringwald M. The models of human Tay–Sachs and Sandhoff diseases. Hum Mol Genet Adult Mouse Anatomical Dictionary: a tool for annotating and 1996;5:1–14. integrating data. Genome Biol 2005;6:R9. Epub 2005 Feb 15. [29] Guidotti JE, Mignon A, Haase G, Caillaud C, McDonell N, Kahn A, [19] . et al. Adenoviral gene therapy of the Tay–Sachs disease in hexos- [20] Blake JA, Eppig JT, Richardson JE, Bult CJ, Kadin JA, and the aminidase A-deficient knock-out mice. Hum Mol Genet Mouse Genome Database Group. The Mouse Genome Database 1999;5:831–8. (MGD): Integration Nexus for the Laboratory Mouse. Nucleic Acids [30] . Res 2001;29:91–4. [31] Smith CL, Goldsmith CA, Eppig JT. The Mammalian Phenotype [21] Bult CJ. Data integration standards in model organisms: from Ontology as a tool for annotating, analyzing and comparing genotype to phenotype in the laboratory mouse. Drug Discov Today phenotypic information. Genome Biol 2005;6:R7. Epub 2004 Dec 15. 2002;1:163–8. [32] Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman [22] Bult CJ, Richardson JE, Blake JA, Kadin JA, Ringwald M, Eppig WW, et al. Arrays of ultraconserved non-coding regions span the loci JT, and the Mouse Genome Informatics Staff. Mouse genome of key developmental genes in vertebrate genomes. BMC Genomics informatics in a new age of biological inquiry. In: Proceedings of the 2004;5:99. IEEE international symposium on bio-informatics and biomedical [33] . engineering; 2000. p. 29–32. [34] . [23] Cohen-Tannoudji M, Marchand P, Akli S, Sheardown SA, Puech JP, [35] Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCre- Kress C, et al. Disruption of murine Hexa gene leads to enzymatic AtIvE: critical assessment of information extraction for biology. BMC deficiency and to neuronal lysosomal storage, similar to that observed Bioinformatics 2005;6(Suppl. 1):S1. Epub 2005 May 24. in Tay–Sachs disease. Mamm Genome 1995;6:844–9. [36] Camon EB, Barrell DG, Dimmer EC, Lee V, Magrane M, Maslen J, [24] Yuziuk JA, Bertoni C, Beccari T, Orlacchio A, Wu YY, Li SC, et al. et al. An evaluation of GO annotation retrieval for BioCreAtIvE and Specificity of mouse GM2 activator protein and beta-N-acetylhexos- GOA. BMC Bioinformatics 2005;6(Suppl. 1):S17. aminidases A and B. Similarities and differences with their human [37] . counterparts in the catabolism of GM2. J Biol Chem 1998;273:66–72. [38] Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, et al. [25] Lankar D, Vincent-Schneider H, Briken V, Yokozeki T, Raposo G, Relations in biomedical ontologies. Genome Biol 2005;6:R46. Epub Bonnerot C. Dynamics of major histocompatibility complex class II 2005 Apr 28.