Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P
Total Page:16
File Type:pdf, Size:1020Kb
Published online 1 November 2009 Nucleic Acids Research, 2010, Vol. 38, Database issue D563–D569 doi:10.1093/nar/gkp871 Ensembl Genomes: Extending Ensembl across the taxonomic space P. J. Kersey*, D. Lawson, E. Birney, P. S. Derwent, M. Haimel, J. Herrero, S. Keenan, A. Kerhornou, G. Koscielny, A. Ka¨ ha¨ ri, R. J. Kinsella, E. Kulesha, U. Maheswari, K. Megy, M. Nuhn, G. Proctor, D. Staines, F. Valentin, A. J. Vilella and A. Yates EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK Received August 14, 2009; Revised September 28, 2009; Accepted September 29, 2009 ABSTRACT nucleotide archives; numerous other genomes exist in states of partial assembly and annotation; thousands of Ensembl Genomes (http://www.ensemblgenomes viral genomes sequences have also been generated. .org) is a new portal offering integrated access to Moreover, the increasing use of high-throughput genome-scale data from non-vertebrate species sequencing technologies is rapidly reducing the cost of of scientific interest, developed using the Ensembl genome sequencing, leading to an accelerating rate of genome annotation and visualisation platform. data production. This not only makes it likely that in Ensembl Genomes consists of five sub-portals (for the near future, the genomes of all species of scientific bacteria, protists, fungi, plants and invertebrate interest will be sequenced; but also the genomes of many metazoa) designed to complement the availability individuals, with the possibility of providing accurate and of vertebrate genomes in Ensembl. Many of the sophisticated annotation through the similarly low-cost databases supporting the portal have been built in application of functional assays. Many of these trends have first become visible in human genomics, but are close collaboration with the scientific community, spreading to other species as costs continue to fall. which we consider as essential for maintaining For example, the 1000 genomes project (to sequence the accuracy and usefulness of the resource. 1000 human genomes) has quickly been followed by the A common set of user interfaces (which include launch of similar initiatives in Arabidopsis (1), Plasmodium a graphical genome browser, FTP, BLAST search, (http://www.genome.gov/26523588), Drosophila (http:// a query optimised data warehouse, programmatic www.dpgp.org), and other species. access, and a Perl API) is provided for all domains. These developments place particular demands on bio- Data types incorporated include annotation of logical databases, but also represent an opportunity. The (protein and non-protein coding) genes, cross European Bioinformatics Institute (EBI) is involved references to external resources, and high (together with our collaborators) in the operation of a throughput experimental data (e.g. data from large number of archival databases, e.g. the European Nucleotide Archive (ENA) (2), ArrayExpress (3) (for scale studies of gene expression and polymorphism gene expression data), PRIDE (4) (for proteomics data) visualised in their genomic context). Additionally, etc. Such databases remain essential as a permanent extensive comparative analysis has been per- record of experimental effort, and are under continuous formed, both within defined clades and across the development to accommodate data produced by new wider taxonomy, and sequence alignments and technologies. The UniProt Knowledgebase (UniProtKB) gene trees resulting from this can be accessed (5), in contrast, uses information curated from the scien- through the site. tific literature to describe proteins, which can subsequently serve as reference data for systems designed to automati- cally transfer annotation to similar, less well-characterised INTRODUCTION sequences. However, such computed annotation is Since 1995, when the genome of a cellular organism best recalculated periodically as algorithms are refined was completely sequenced for the first time, genome and the quantity and quality of reference data increases. sequencing has transformed the biological sciences. Moreover, the interpretation of potentially huge raw Today, in excess of 1000 genomes have been sequenced, data sets as integrated features (e.g. a gene ‘built’ assembled, annotated and deposited in the public from transcriptional evidence) is a necessity if users are *To whom correspondence should be addressed. Tel: +44 1223 494 601; Fax: +44 1223 494 468; Email: [email protected] ß The Author(s) 2009. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D564 Nucleic Acids Research, 2010, Vol. 38, Database issue not to drown in an excess of information. This calls for a of Ensembl Genomes, breadth of taxonomic spread. different type of resource i.e. derived and interpreted, com- Moreover, Ensembl is not just a data visualisation plementing the archival and literature-based resources. tool, but a suite of programs for data production The structure of a genome provides a natural index (e.g. gene calling, comparative genomics) that can be through which data from other molecular biology experi- deployed individually according to the needs of an indi- ments can be accessed. The Ensembl (6) platform for vidual community. We are looking forward to working genome annotation and visualisation has been under with increasing numbers of collaborators to improve not development by the EBI and the Wellcome Trust only access to genome annotation, but the quality and Sanger Institute since 2000, with an initial focus on the depth of the annotation itself. human genome that has since widened to include other vertebrates. This paper describes ‘Ensembl Genomes’, a new resource recently launched by the EBI to complement Ensembl by providing access to genome-scale data from OVERVIEW OF ENSEMBL GENOMES non-vertebrate species through the same user interfaces as Ensembl Genomes (http://www.ensemblgenomes.org) Ensembl. There are two big advantages of this approach. consists of separate portals for each of five distinct First, the Ensembl software system has evolved in domains of life: bacteria, protists, fungi, plants, and inver- response to new data types that have frequently first tebrate metazoa. Each site contains data for selected species appeared in human genomic studies, but which are now from within their domain, chosen for their scientific increasingly appearing in the context of other species; its interest. Information is available about each species, re-use provides a cost-effective way of providing sophisti- including the assembly version and annotation methods cated data analysis and visualisation tools for a wider used, the overall composition of the genome, etc. These range of genomes. Secondly, the use of a common have been customised for the species in Ensembl system for all genomes greatly decreases the cost of Genomes; for example, in Ensembl Bacteria, a circular broad range comparative genomics and other inter- view of the genome is provided, and the top-level pages species data analysis. For example, since the launch of have been designed to provide structured access to the Ensembl Genomes, data from malarial parasites (genus large numbers of similar strains that have been sequenced Plasmodium, available through Ensembl Protists), the in many clades. The main functionality offered by the malarial vector (Anopheles gambiae, available through portals is access to a graphical view of each genome, Ensembl Metazoa), and the human host (in Ensembl) using the Ensembl genome browser software. The can now be accessed through a common set of interfaces. browser offers a number of views of locations in the With Ensembl Genomes, we aim to provide an genome, genes, and specific transcripts; tabs at the top of integrated portal to reference sequence and annotation the page allow users to switch between these different levels, from all species of scientific interest; and thereby to while a context-dependent left hand menu offers access to a complement, and provide convenient access to, data selection of data views specific to each. The main display from existing and novel community initiatives in particu- lar areas. Many of the databases that support Ensembl then appears in the central panel. Users can integrate their Genomes have been built by, or in close collaboration own analyses into appropriate displays by using the with, groups who maintain specialist data resources for Distributed Annotated System (10) (DAS). Specific views individual species, and we are actively seeking to extend exist for expression data, SNPs and other polymorphisms, the range of these collaborations. The development of and comparative genomics. With each new data release, the accurate and useful genome databases is wholly dependent browser software is updated to a recent version, ensuring on the involvement of scientists with knowledge of the that a consistent user experience is provided between most important data and its biological context. Ensembl Ensembl and Ensembl Genomes. Release frequency is Genomes can support such expertise through the provi- four to six times a year, tied to Ensembl releases where sion of a data analysis and service infrastructure, close practical. Use of the browser is illustrated in Figure 1. integration with the archival databases,