Comparative and Functional Genomics Comp Funct Genom 2004; 5: 618–622. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.448 Conference Editorial Standards and Ontologies for Functional Genomics 2 University of Pennsylvania, Philadelphia, PA, USA, 23–26 October 2004

Midori A. Harris* and Helen Parkinson European Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

*Correspondence to: Midori A. Harris, EMBL-EBI- The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. E-mail: [email protected]

Received: 26 November 2004 Accepted: 26 November 2004

Like the first conference, held in 2002, on Stan- and exchanging biological data, experimental meta- dards and Ontologies for Functional Genomics data (data about data) and ontologies themselves. (www.sofg.org), the second SOFG meeting brou- SOFG2 examined the progress of the world ght together computer scientists, ontologists and of ontology and standards development in the biomedical scientists in Philadelphia to examine the two years since SOFG1; a number of significant state of the art in ontology technology, develop- changes soon became evident. ment and emerging standards for biomedicine. Notably, the ontology community has adopted Briefly, an ontology is a means of formaliz- ing knowledge about a subject; at a minimum, the I3C standard Web Ontology Language (OWL; + an ontology must include terms (concepts) rele- Horrocks et al., 2003), the successor to DAML vant to a domain, definitions for the terms, and OIL (Stevens et al., 2002), as its lingua franca. In defined relationships between the terms. Ontolo- parallel with the move to OWL, the past two years gies may be represented in any of a number of have brought increasingly sophisticated tools for formats and systems, with varying degrees of com- building and applying ontologies to the community. plexity, and may range in size from hundreds to Presentations and discussions at SOFG2, summa- hundreds of thousands of concepts. In biomedicine, rized briefly below, identified several more com- ontologies support the organization and manage- mon issues and themes. ment of large amounts of data, permitting sophisti- In her keynote presentation, Carole Goble used cated searches and allowing database annotations Shakespeare’s Romeo & Juliet as an analogy to be standardized. The standards aspect of the to illustrate the sociological (we currently know conference focused on content standards, i.e. what information should be captured about a biologi- of no romantic) interactions between ontology cal concept or an experiment. This issue of ‘what developers coming from differing perspectives. As should be said’ leads naturally to the considera- described in detail in the accompanying review tion of how things can be said, thus forging con- (Goble and Wroe, 2004), computer scientists and nections between standards and ontologies. Format philosophers (the ‘Montagues’) emphasize for- was touched upon in the context of representing mal structures, whereas domain experts such as

Copyright  2005 John Wiley & Sons, Ltd. Conference editorial: SOFG2 619 biologists (the ‘Capulets’) have preferred less for- with GO, OBO has grown to encompass over 40 mal, more pragmatic approaches, despite the lim- ontologies, covering diverse biological domains. itations inherent in informal systems. The Mon- The availability of certain ontologies, such as those tague/Capulet analogy struck a chord with partici- for anatomical terms or chemical substances, facili- pants, as several speakers declared their allegiance; tates the creation and maintenance of combinatorial notably, many proudly claimed connections with concepts. Many types of biological knowledge can both formalist and pragmatist camps. Although only be adequately represented by such compound the relationship between the formal and pragmatic concepts; an example particularly relevant to SOFG camps has historically been marked by tension and is that of modelling phenotypes, which requires conflict, an important theme emerging from SOFG2 concepts from anatomy, experimental procedures is that of growing mutual interest and respect, as (assays) and results, among others. GO and OBO computer scientists become attracted to the com- have also given rise to the development of OBOL, plex use cases that biology provides, and biologists a language for formalizing combinatorial terms in come to appreciate the practical benefits of for- OBO ontologies. mal ontological approaches. The final outcome may The session on Ontologies for Biological Sys- thus be happier for bio-ontologists than for Shake- tems began with two perspectives on biological speare’s lovers, thanks to the efforts of our ‘Princes pathways. The Reactome database (www.reacto- of Genomics’ who straddle both communities, and me.org), presented by Peter D’Eustachio, mod- perhaps also to the auspicious location of SOFG2 els the entities and events that make up path- in the city of brotherly love. ways. Minoru Kanehisa described the graph-based Chris Wroe began the first session, on Onto- design of KEGG (Kanehisa et al., 2004). In light logical Systems: Theory and Development, with a of comments made in several talks on the need presentation that followed logically from Goble’s, for an ontology of chemicals, Marcus Ennis gave giving an overview of both theoretical and practi- a very timely presentation on the nascent database cal aspects of ontology development. For example, of Chemical Entities of Biological Interest (ChEBI) an ontology may be represented by a simple struc- (www.ebi.ac.uk/chebi). ChEBI covers chemical ture such as a hierarchy or directed acyclic graph nomenclature, formulae, and structures, and is built (DAG), or in a sophisticated formal structure such upon a chemical ontology that organizes chemical as a description logic (DL) system. The most suit- compounds by both chemical characteristics and able representation depends on the requirements, function, essentially providing a biologist’s view of i.e. on the intended use of the ontology. An ontol- the chemical world. The last talk tied the session to ogy may be converted from a simpler to a more the preceding one: Chris Catton used a description formal representation as needs evolve, as illustrated of the BioImage database (www.bioimage.org) by the example of the GONG project (Wroe et al., and its ontology-based architecture to provide con- 2003), which represented a portion of the Gene text for a discussion of challenges that ontology Ontology (GO) in OWL. developers face, especially in a field such as biol- Two presentations then described uses of OWL ogy, where knowledge changes rapidly and some- to adapt ontologies for particular purposes: Chin- times substantially. tan O. Patel discussed motivations for, and chal- The third session, Ontology Systems Develop- lenges presented by, representing existing biomed- ment: Electronic Demonstrations, underscored the ical domain ontologies in OWL; Karim Nashar considerable progress that has been made since presented a proposal to integrate several ontologies, SOFG1 in developing tools to construct ontolo- including the MGED core ontology (Stoeckert and gies and applying them in biological contexts. Parkinson, 2003) and existing and proposed exten- Mark Musen and Phil Lord presented different sions, to represent experiment metadata. To com- aspects of Proteg´ e´ (Noy et al., 2003): Musen gave plete the session and link with the following ses- an overview of the tool, noting recent develop- sion, summarized the origins ments such as support for multiple simultaneous and aims of the Open Biology Ontologies (OBO; users, and a growing array of plugins that lend formerly GOBO) initiative (obo.sourceforge.net). support for OWL reasoning and ontology manage- Intended to extend the model of community-based ment. Lord then focused on the OWL plugin, which development of publicly available ontologies begun allows Proteg´ e´ to read and write OWL ontologies,

Copyright  2005 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 618–622. 620 M. A. Harris and H. Parkinson and provides an interface for constructing descrip- FMA includes the subcellular anatomical structures tion logic statements. Users who have worked also covered by the GO cellular component ontol- with DL statements in less user-friendly editors ogy; some inconsistencies between the GO cellular will welcome these advances. John Day-Richter component ontology and the FMA have come to reported on recent enhancements in DAG-Edit light, and should be resolved in the near future. (godatabase.org/dev/index.html), which was orig- The Anatomical Dictionary for the Adult Mou- inally created to edit GO and other DAG-structured se (www.informatics.jax.org/searches/anatdict ontologies, and now supports many features of form.shtml), presented by Terry Hayamizu, has more sophisticated representations. DAG-Edit now a somewhat simpler structure than the FMA, and offers a powerful mechanism for searching, filter- is used by the mouse community to annotate their ing and displaying ontology terms, and a wide array data. The anatomy talks continued a theme that of plugins to support managing categories, parents was among the highlights of SOFG1, where dis- and namespaces. Another plugin allows DAG-Edit cussions began that led to the creation of the SOFG to use OBOL; still others track change history Anatomy Entry List (Parkinson et al., 2004), a sim- and provide a graphical display. Barry Zeeberg ple high-level anatomy mapping ontology. Work demonstrated GoMiner (Feng et al., 2003; Zeeberg continues to map between anatomy ontologies. et al., 2003), one of several tools that combines GO Duncan Davidson returned to the recurring data with gene expression data to aid interpretation problem of modelling phenotypes. Two impor- of large-scale data in an ontology-based context. tant issues facing databases are: (a) sharing and These tools and others were then included in an searching phenotype data across species; and ‘electronic poster’ session, in which individual tool (b) mapping between formally decomposed rep- demonstrations occurred. resentations and shorthand phrases familiar to The session on Thesauri, Nomenclatures, and biologists. A simple system using characters, the Biomedical Literature highlighted several attributes and values (CAV) has been proposed aspects of working with free text as well as ontolo- and shows promise. These issues were revisited in gies. As noted by Stuart Nelson, the Unified Medi- a breakout session devoted to the Phenotype and cal Language System (UMLS; McCray and Nelson, Trait Ontology (PATO) (obo.sourceforge.net/cgi- 1995) is not an ontology, but addresses some of the bin/detail.cgi?poav), an ontology of attributes same issues as biological ontologies. UMLS inte- and values designed for phenotype representa- grates information from disparate sources based on tion. Annotations using PATO would refer to other conceptual connections that aim to resolve ambi- ontologies, such as GO, anatomy ontologies, devel- guities in usage. Inderjeet Mani then described opmental stage ontologies and so on, for the char- the PRONTO protein ontology, produced by a tool acters in a CAV system. that automates much of the information gather- Chris Stoeckert reviewed the MGED ontology ing process. Another approach to mining literature (Stoeckert and Parkinson, 2003) for microarray and other data sources came from Winston Hide’s experiment description, noting its relationship to talk on applying the eVOC anatomy ontology to the MAGE object model and to external ontologies. interpret expression data (Hide et al., 2003). Yves In the near future the ontology must be extended Lussier discussed the need to map between dif- to accommodate new technologies and biological ferent ontologies, as well as incorporate data from areas of interest. Both points were considered fur- many different genomic datasets, to deal with phe- ther in a breakout session focused on the extension notype data or the ‘phenome’ on a large scale of the MGED ontology into the areas of functional (Lussier and Li, 2004). genomics, proteomics, environmental biology and The Standards and Protocols session began with toxicogenomics, as well as the need to support the two talks on anatomy ontologies and how they are emerging MAGE2 model, which will move beyond used to integrate anatomy with biological informa- microarrays into functional genomics. tion at different scales and in different experimen- The success of the MGED ontology and the tal contexts. John Gennari described the Foun- ubiquitous MIAME standard for microarray data dational Model of Anatomy (Rosse and Mejino, (Brazma et al., 2001) has inspired analogous efforts 2003), with its focus on human anatomy, for- for other experiments and data types, as illustrated mal structure, and precise definitions. Notably, the by Eric Deutsch’s talk on the MISFISHIE standard

Copyright  2005 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 618–622. Conference editorial: SOFG2 621

(scgap.systemsbiology.net/data/misfishie/; named applications, most of which involve combining after a bathtub toy) for gene expression localiza- ontologies with other types of data. Two presen- tion experiments, and Michael Cary’s presenta- tations focused on the use of GO in combina- tion on BIOPAX (www.biopax.org), an exchange tion with similarity measures: Antonio Sanfilippo format for pathway data. Developed for use by described weighted links between the three ontolo- some 138 pathway databases, BioPax uses the gies of GO that can uncover connections between high-level concepts Physical Entity, Interaction and annotated gene products. In Olivier Bodenrei- Pathway to build a representation of the small der’s presentation, semantic similarity between molecules, complexes and physical interactions that GO terms was used to facilitate gene expression compose biological pathways and process. Exist- analysis. ing ontologies, such as the GO, NCBI taxonomy The remaining talks all highlighted the applica- (www.ncbi.nlm.nih.gov/Taxonomy/taxonomyho- tion of multiple ontologies to an area of interest. me.html) and the Cell Type Ontology (obo.source- Matt Mailman returned to phenotype annotation, forge.net/cgi-bin/detail.cgi?celltype), are being from the perspective of using existing ontologies used wherever possible by the BIOPAX project. to capture desired information in an object model Kai Runte described several standards and an for storage of phenotypic data being developed ontology relevant to proteomics, which are being at NCBI. Gilberto Fragoso reported on the NCI developed as part of the HUPO Proteomics Stan- Thesaurus, which addresses the needs of the can- dards Initiative (psidev.sourceforge.net). The cer research community for disease description; HUPO effort casts a wide net, aiming to stan- also see his review in this issue (Fragoso et al., dardize not only the representation of exper- 2004). Gloria Despacio-Reyes discussed the infor- imental approaches and minimal data require- mation relevant to crop research and the ontolo- ments — much as the MGED ontology and MIA- gies used by the International Rice Research Insti- ME do for the microarray community — but also tute. In a summary of terminology relevant to data formats generated and interpreted by instru- pathology, Roger Brown of GlaxoSmithKline pro- mentation. vided a perspective from a community that has In the Functional Genomics Applications ses- adopted ontologies recently and needs to adopt sion, talks covered various aspects of using ontolo- and integrate several vocabularies to model exper- gies in genomic databases and other genomic con- iments and results in sufficient detail to accom- texts. Biological databases use ontologies covering modate the needs of regulatory organizations and several different subdisciplines of biology; Judith research scientists. Mike Waters described the Blake presented the representative example of the combination of traditional toxicology and phar- Mouse Genome Informatics databases (Bult et al., macology with ‘omics’ scale technologies, and 2004), which provides its users a convenient inter- standards and ontologies that are emerging to face to GO, mouse-specific anatomy and pheno- model toxicogenomics data. The need for stan- type vocabularies, and other controlled vocabu- dardization in toxicology and pathology, where a laries. Anastasia Nikolskaya described the clas- plethora of use cases and complexity brought by sification of proteins by PIR (Wu et al., 2004) long established domains such as toxicology meets superfamilies, and how that approach complements high-throughput technologies, are explored further functional annotation with GO terms. Karen Eil- by Sansone et al. in this issue (Sansone et al., beck presented the Sequence Ontology (SO), which 2004). aims to unify the description of sequence features It is clear that there is a great deal of activity across many sources and many formats. SO covers and cooperation between the various disciplines locatable sequence features (such as exon, promoter that make up the SOFG community. In partic- or binding site) and their properties (sequence ular, the improvements in ontology editors and attributes such as maternally imprinted gene,or the applications that use ontologies will continue sequence and chromosomal variations). SO-based to bear fruit for the bench biologist, who can annotation is described in detail in the accompany- expect to use GO annotations when analysing ing article (Eilbeck and Lewis, 2004). microarray data, retrieve microarray data effi- In the final session, From Resource to Appli- ciently from LIMS systems and repositories, and cation, speakers reported on a wide range of retrieve information efficiently from journal articles

Copyright  2005 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 618–622. 622 M. A. Harris and H. Parkinson by using ontologies. That the biologist is often Hide W, Smedley D, McCarthy M, Kelso J. 2003. Application of unaware of the gory details of an ontology and eVOC: controlled vocabularies for unifying gene expression data. C R Biol 326: 1089–1096. its relative complexity indicates that the com- Horrocks I, Patel-Schneider PF, van Harmelen F. 2003. From munity is making progress with the technol- SHIQ and RDF to OWL: the making of a web ontology ogy. Presentations from SOFG2 are available at language. J Web Semant 1: 7–26. www.sofg.org/meetings/sofg2004/index.html Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. 2004. The KEGG resource for deciphering the genome. Nucleic Acids Res 32: D277–D280. Acknowledgements Lussier YA, Li J. 2004. Terminological mapping for high throughput comparative biology of phenotypes. Pac Symp The authors thank Nancy Place (The Jackson Laboratory) Biocomput 2004: 202–213. and Jim Wolff (University of Pennsylvania) for conference McCray AT, Nelson SJ. 1995. The representation of meaning in organization, and Chris Stoeckert, who was volunteered the 19 UMLS. Methods Inf Med 34: 193–201. as the SOFG2 host. SOFG2 was funded by the National MISFISHIE; scgap.systemsbiology.net/data/misfishie/. Human Genome Research Institute, the National Institute NCBI Taxonomy; www.ncbi.nlm.nih.gov/Taxonomy/taxonomy- for Environmental Health Sciences, the Natural Environ- home.html/. ment Research Council, GlaxoSmithKline and the MGED Noy NF, Crubezy M, Fergerson RW, et al. 2003. Proteg´ e-2000:´ an Society. open-source ontology-development and knowledge-acquisition environment. AMIA Annu Symp Proc 2003: 953. OBO; obo.sourceforge.net. Parkinson HE, Aitken S, Baldock RA, et al. 2004. The SOFG References Anatomy Entry List (SAEL): an annotation tool for functional genomics data. Comp Funct Genom (in press). Anatomical Dictionary for the Adult Mouse; http://www.infor- PATO; obo.sourceforge.net/cgi-bin/detail.cgi?poav. matics.jax.org/searches/anatdict form.shtml. PSI; psidev.sourceforge.net/. BioImage; www.bioimage.org. Reactome; www.reactome.org. BIOPAX; www.biopax.org. Rosse C, Mejino JL. 2003. A reference ontology for biomedical Brazma A, Hingamp P, Quackenbush J, et al. 2001. Mini- informatics: the Foundational Model of Anatomy. J Biomed mum information about a microarray experiment (MIAME) — Inform 36: 478–500. toward standards for microarray data. Nature Genet 95: Sansone SA, Morrison N, Rocca-Serra P, Fostel J. 2004. Stan- 365–371. dardization initiatives in the (eco)toxicogenomics domain: a Bult CJ, Blake JA, Richardson JE, et al. The Mouse Genome review. Comp Funct Genom (in press). Database (MGD): integrating biology with the genome Nucleic SOFG; www.sofg.org. Acids Res 32: D476–D481. Stevens R, Goble CA, Horrocks I, Bechhofer S. 2002. Building Cell Type Ontology; obo.sourceforge.net/cgi-bin/detail.cgi?cell- a bioinformatics ontology using OIL. IEEE Trans Inf Technol type. Biomed 6: 135–141. ChEBI; www.ebi.ac.uk/chebi. Stoeckert C, Parkinson H. 2003. The MGED ontology: a DAG-Edit; godatabase.org/dev/index.html. framework for describing functional genomics experiments. Eilbeck K, Lewis SE. 2004. Sequence ontology annotation guide. Comp Funct Genom 4: 127–132. Comp Funct Genom (in press). Wroe CJ, Stevens R, Goble CA, Ashburner M. 2003. A method- Feng W, Wang G, Zeeberg BR, et al. 2003. Development ology to migrate the gene ontology to a description logic of gene ontology tool for biological interpretation of environment using DAML + OIL. Pac Symp Biocomput 2003: genomic and proteomic data. AMIA Annu Symp Proc 2003: 624–635. 839. Wu CH, Nikolskaya A, Huang H, et al. 2004. PIRSF: family Fragoso G, de Coronado S, Haber M, Hartel F, Wright L. 2004. classification system at the Protein Information Resource. Overview and utilization of the NCI Thesaurus. Comp Funct Nucleic Acids Res 32: D112–D114. Genom (in press). Zeeberg BR, Feng W, Wang G, et al. 2003. GoMiner: a resource Goble CA, Wroe C. 2004. The Montagues and the Capulets. Comp for biological interpretation of genomic and proteomic data. Funct Genom (in press). Genome Biol 4: R28.

Copyright  2005 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 618–622.