Flybase: Enhancing Drosophila Gene Ontology Annotations
Total Page:16
File Type:pdf, Size:1020Kb
Published online 23 October 2008 Nucleic Acids Research, 2009, Vol. 37, Database issue D555–D559 doi:10.1093/nar/gkn788 FlyBase: enhancing Drosophila Gene Ontology annotations Susan Tweedie1,*, Michael Ashburner1, Kathleen Falls2, Paul Leyland1, Peter McQuilton1, Steven Marygold1, Gillian Millburn1, David Osumi-Sutherland1, Andrew Schroeder2, Ruth Seal1, Haiyan Zhang2 and The FlyBase Consortiumy 1Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK and 2The Biological Laboratories, Harvard University, 16 Divinity Avenue, Cambridge, MA 02138, USA Received September 19, 2008; Revised October 8, 2008; Accepted October 9, 2008 ABSTRACT activity’ will automatically gather those products labelled with the more specific types of kinase activity. FlyBase (http://flybase.org) is a database of GO annotation comprises at least three components: a Drosophila genetic and genomic information. Gene GO term that describes molecular function, biological role Ontology (GO) terms are used to describe three or subcellular location; an ‘evidence code’ that describes attributes of wild-type gene products: their molecu- the type of analysis used to support the GO term lar function, the biological processes in which they (Table 1); and an attribution to a specific reference. play a role, and their subcellular location. This arti- There may also be supporting evidence for the choice of cle describes recent changes to the FlyBase GO GO term in the form of database cross-references; for annotation strategy that are improving the quality instance, a gene function may be ‘inferred from genetic of the GO annotation data. Many of these changes interaction’, in which case an identifier for the interacting stem from our participation in the GO Reference gene will be included. Qualifiers may also be included to Genome Annotation Project—a multi-database col- modify the annotation; currently these are: ‘colocalizes_- laboration producing comprehensive GO annotation with’, ‘contributes_to’ and ‘NOT’. In the case of the sets for 12 diverse species. ‘NOT’ qualifier, this has the effect of negating the annota- tion. FlyBase uses ‘NOT’ sparingly for cases where there is a prior expectation that a GO term should apply to a gene product but evidence exists to the contrary. Another INTRODUCTION TO GENE ONTOLOGY annotation component, one that FlyBase has neglected ANNOTATION until recently, is date. FlyBase now records an accurate What gene products do and where they do it are date for each GO annotation reflecting when the annota- fundamental questions for biologists no matter what tion was originally made or last reviewed by a curator. GO organism is being studied. The Gene Ontology (GO; annotations made prior to implementing the data compo- www.geneontology.org) was established 10 years ago as nent are dated 20060803. Examples of GO annotations are a means of summarizing this information consistently shown in Table 2. across different databases by using a common set of GO annotation is useful for both small-scale and large- defined controlled vocabulary terms. FlyBase was one of scale analyses. It can provide a first indication of the three founding members of the Gene Ontology Consor- nature of a gene product and, in conjunction with evidence tium [GOC; (1)] but the GO has since been adopted by codes, point directly to papers with pertinent experimental many model organism databases, making comparison of data. However, it is particularly useful in the analysis gene function between diverse species feasible. The GO of large data sets such as the output of a microarray also encodes relationships between terms, which allows experiment. For instance, the AmiGO GO Term Enrich- for efficient searching and computational reasoning. For ment Tool (amigo.geneontology.org/cgi-bin/amigo/term_ example, a search for all gene products with ‘kinase enrichment) will show significant shared GO terms or *To whom correspondence should be addressed. Tel: +44 1223 333 963; Fax: +44 1223 333 992; Email: [email protected] Present address: Ruth Seal, EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK yThe FlyBase Consortium comprises: FlyBase-Harvard: W. Gelbart, L. Bitsoi, M. Crosby, A. Dirkmaat, D. Emmert, L. S. Gramates, K. Falls, R. Kulathinal, B. Matthews, M. Roark, S. Russo, A. Schroeder, S. St Pierre, H. Zhang, P. Zhou and M. Zytkovicz; FlyBase-Cambridge: M. Ashburner, N. Brown, P. Leyland, P. McQuilton, S. Marygold, G. Millburn, D. Osumi-Sutherland, R. Stefancsik, S. Tweedie and M. Williams; and FlyBase- Indiana: T. Kaufman, K. Matthews, J. Goodman, G. Grumbling, V. Strelets and R. Wilson. ß 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D556 Nucleic Acids Research, 2009, Vol. 37, Database issue parents of those GO terms associated with a list of genes. TermLink (retrieves data for all Drosophila species) Whatever the scale of use, for the best results it is impor- and QueryBuilder (species and other criteria can be speci- tant that GO annotation be as complete and accurate as fied) (2). possible. This article describes recent changes to GO Table 3 shows a summary of the current FlyBase GO annotation procedures in FlyBase that aim to improve annotations; 10 131 (72%) of the 14 029 Drosophila mela- the quality of our functional annotations. nogaster protein-coding genes have at least one GO anno- tation and 9403 (67%) have annotations with specific (non-root) GO terms. GO ANNOTATION IN FLYBASE The GO is dynamic and its content can change on a daily basis. Most of these changes are the addition of GO annotations appear on the Gene Report page in new terms but other alterations, such as making a term FlyBase (Figure 1) and are also available as a download- obsolete, a term name change or a change to ontology able file (gene_association.fb.gz in the genes section of the structure, require us to revisit our existing annotations Precomputed Files page accessed from the Files menu on to choose alternative GO terms or check that the anno- the FlyBase home page). This file is in the standard GO tations are still valid. Rather than attempt to keep gene_association file format (www.geneontology.org/ completely up-to-date with this moving target, FlyBase GO.format.annotation.shtml) and corresponds to the loads a new version of the GO every one or two releases data that are submitted to the GOC for a given FlyBase of FlyBase. Both new and existing annotations are made release. GO data are searchable in FlyBase using both consistent with this ‘frozen’ version for a given release. The frozen version of the GO, and all other ontologies Table 1. Evidence codes used in GO annotation used in FlyBase, are available in the Ontology Terms sec- tion of the Precomputed Files page. Manually assigned evidence codes Experimental evidence codes The GO annotation set is submitted to the GOC at the Inferred from Direct Assay (IDA) same time as a new version of FlyBase is released. Users Inferred from Physical Interaction (IPI) should be aware that there may be a few differences Inferred from Mutant Phenotype (IMP) between the data downloadable from FlyBase and the Inferred from Genetic Interaction (IGI) Inferred from Expression Pattern (IEP) file downloadable from the GO website. Differences arise Computational analysis evidence codes because the submitted files are screened for errors, and Inferred from Sequence or Structural Similarity (ISS) lines that do not meet a series of quality controls are a Inferred from Sequence Orthology (ISO) removed. For instance, any annotations with terms that Inferred from Sequence Alignment (ISA)a Inferred from Sequence Model (ISM)a have become obsolete since the ontology was frozen at Inferred from Genomic Context (IGC) FlyBase will be stripped out of the files available from Inferred from Reviewed Computational Analysis (RCA) the GO site. Author statement evidence codes Traceable Author Statement (TAS) Non-traceable Author Statement (NAS) Curatorial statement codes GO CURATION PIPELINE Inferred by Curator (IC) No biological Data available (ND) Most GO data in FlyBase are entered via our paper-by- Automatically assigned evidence codes paper literature curation process; curators read papers and Inferred from Electronic Annotation (IEA) chose the appropriate terms based on the data described. a For gene products that have not yet been experimentally Three new subcategories of the ISS evidence code used to assign GO characterized, GO terms are assigned manually if there is terms based on sequence similarity. ISO is used when the similar sequences are considered to be orthologous. ISA is used where there significant sequence similarity to gene products of known is extensive sequence alignment but the sequences are not known to be function, using the ‘inferred from sequence similarity’ evi- orthologous. ISM is used when a sequence model has been generated dence code or one of the new more specific versions of this from a set of related sequences, e.g. hidden Markov models for trans- code (Table 1). When no specific term can be assigned membrane regions. Full documentation of evidence codes together with how they are used in annotation can be found on the GO website from either published data or sequence homology, a (http://www.geneontology.org/GO.contents.doc.shtml). gene is annotated