D98–D108 Nucleic Acids Research, 2012, Vol. 40, Database issue Published online 23 November 2011 doi:10.1093/nar/gkr1032 GeneDB—an annotation database for pathogens Flora J. Logan-Klumpler1,10,*, Nishadi De Silva1, Ulrike Boehme1, Matthew B. Rogers1,2, Giles Velarde1,2,10, Jacqueline A. McQuillan1, Tim Carver1, Martin Aslett1, Christian Olsen4, Sandhya Subramanian4, Isabelle Phan4, Carol Farris4,5, Siddhartha Mitra4, Gowthaman Ramasamy4, Haiming Wang6, Adrian Tivey1, Andrew Jackson1, Robin Houston1, Julian Parkhill1, Matthew Holden1, Omar S. Harb7, Brian P. Brunk7, Peter J. Myler4,5,8, David Roos9, Mark Carrington10, Deborah F. Smith2, Christiane Hertz-Fowler3 and Matthew Berriman1,* 1Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, 2Centre for Immunology and Infection, Department of Biology, University of York, Heslington, York YO10 5YW, 3Centre for Genomic Research, Department of Functional and Comparative Genomics, University of Liverpool, Liverpool L69 3BX, UK, 4Seattle Biomedical Research Institute, Seattle, WA 98109, 5Department of Medical Education and Biomedical Informatics, University of Washington, Seattle, WA 98195, 6Center for Tropical and Emerging Global Diseases, University of Georgia, Athens, GA 30602, 7Penn Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104, 8Department of Global Health, University of Washington, Seattle, WA 98195, 9Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA and 10Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK Received September 26, 2011; Revised and Accepted October 21, 2011 ABSTRACT INTRODUCTION GeneDB (http://www.genedb.org) is a genome GeneDB was initially established with funding from the database for prokaryotic and eukaryotic pathogens Wellcome Trust as a curated resource for the kinetoplastid and closely related organisms. The resource parasites and Schizosaccharomyces pombe that were provides a portal to genome sequence and annota- sequenced by the Pathogen Genomics group at the Wellcome Trust Sanger Institute (WTSI) (http://www tion data, which is primarily generated by the .sanger.ac.uk). It has now become the home for all Pathogen Genomics group at the Wellcome Trust pathogen genomes sequenced at the WTSI, as well as Sanger Institute. It combines data from completed several related pathogen genomes from other institutes, and ongoing genome projects with curated annota- and also provides a repository of genome sequences of tion, which is readily accessible from a web based pathogenic organisms to the wider scientific community. resource. The development of the database in GeneDB provides comprehensive coverage and a system- recent years has focused on providing database- atic approach to pathogen genome annotation, by driven annotation tools and pipelines, as well as incorporating and standardizing, data from numerous catering for increasingly frequent assembly sources. Currently GeneDB hosts the sequences and updates. The website has been significantly re- associated annotation of 41 organisms, including bacter- designed to take advantage of current web ial, protozoan, helminth and arthropod genomes, making it the largest multi-organism database of curated parasite technologies, and improve usability. The current genomes. The common factor between the genomes is the release stores 41 data sets, of which 17 are huge health and economic impact that these organisms manually curated and maintained by biologists, have on humans and animals around the world. Malaria who review and incorporate data from the scientific is a major cause of human mortality, and is represented on literature, as well as other sources. GeneDB is GeneDB by the Plasmodium falciparum genome, as well as primarily a production and annotation database genomes of several non-human Plasmodium species. for the genomes of predominantly pathogenic Chagas disease (American trypanosomiasis), leishmania- organisms. sis, schistosomiasis and human African trypanosomiasis *To whom correspondence should be addressed. Tel: +44 (0)1223 494944; Fax: +44 (0)1223 494919; Email: [email protected] Correspondence may also be addressed to Flora Logan-Klumpler. Tel: +44 (0)1223 495340; Fax: +44 (0)1223 494919; Email: fl[email protected] ß The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2012, Vol. 40, Database issue D99 (also known as sleeping sickness) are four of the Neglected provided by GeneDB with functional genomics data sets Tropical Diseases identified by the World Health from the wider community. Organization (http://www.who.int/neglected_diseases/en/), GeneDB has a unique role in producing and improving and are represented by genomes in this database. Of several key reference genomes, and hence it is a critical the bacterial genomes on GeneDB, 13 are pathogenic part of TriTrypDB, PlasmoDB and more recently, to humans. Within the database, there are 34 finished SchistoDB. It is also the platform on which changes, and non-contiguous finished genomes and 7 ongoing directed from those sites, are implemented. Currently sequencing projects (Table 1). Automatic pipelines GeneDB is the only database to host multiple flatworm generate the majority of these genomes, including annota- genomes. tion transfer from closely related, manually annotated species. Several of the eukaryotes have in-depth manual curation, with annotations extracted from the literature DATABASE CONTENT AND CURATION and/or provided by users of the website. GeneDB now holds the genome sequences of 9 Recently, the resource has undergone a complete apicomplexans, 3 of them human pathogens; 12 re-implementation, with many new features added. First, kinetoplastid protozoans, 7 of them human pathogens; 3 to make curation, and particularly cross-organism parasitic helminths, all human pathogens; and 16 bacterial curation, easier, the data are stored in a large species, 13 of them either human pathogens or opportun- multi-organism Chado database (2). The use of Chado istic human pathogens. Table 1 shows the current status of enables data to be structured in a way that allows rapid, the genomes in GeneDB. flexible and consistent annotation of any features of a Data are currently stored in a PostgreSQL relational sequence. Second, the website has been modernised to database that uses the Chado schema (2), produced by provide a more intuitive and consistent navigation, the GMOD consortium. This allows concurrent annota- browsing and searching across all organisms. Currently, tion of multiple organisms in a central data store, and the website has an average of 300 unique users per day. enables curators to make cross-organism annotations. Programmatic access to the data has also been extended Several types of annotation are currently performed on by means of web services, which are used to drive some of genomes once they have been loaded into the database. the more dynamic page elements and to provide data These can be loosely broken down into automated and upgrades to other linked resources. manual annotations. Automated annotations resulting GeneDB is a database of annotated genomes or from bioinformatics studies are fed into the database via genomes undergoing active annotation. We focus on the the means of specialised loader programs, wrapped in rapid release and simple, rapid access, to reference PERL or BASH scripts. Manual annotations are per- genomes, with a depth of annotation and curation that formed using Artemis, which can connect directly to is not found on other databases [such as ENA-EMBL a Chado database (5). Off-site curators are able to (3) and GenBank (4)]. Other -omics data is hosted else- connect our internal database using Artemis, over VPN. where, allowing GeneDB to focus on the annotation of Automated annotations include predictions of protein reference genomes from the published literature and user domains [PANTHER (8), PFAM (9), etc.], signal comments (see Curation section). This also enables users peptides [Signalp 3.0 (10)], GO terms [IprScan (11)], GPI to explore a genome in detail, particularly in conjunction anchors (dGPI; http://129.194.185.165/dgpi), transmem- with the genome-browsing tool, Artemis (see below) (5), brane helices [tmHMM (12)], non-coding RNA features with detailed and up to date curation available for every [RFAMscan (13)], and relationships to orthologous genes individual gene in a simple to navigate and consistent in closely related genomes [OrthoMCL (14)]. Results of layout. The simple layout and page design means that predictive algorithms are reloaded on GeneDB at data are presented to users extremely rapidly. As part of regular intervals to keep pace with growing databases. a collaborative effort with the EuPathDB group of data- Much of the information in the database is being com- bases (http://www.eupathdb.org) the annotated and prehensively curated and improved, and is now regularly curated genomes of the trypanosomatid parasites are updated with new assemblies. Manual curation of the regularly sent from GeneDB to TritrypDB (6; http:// database has focused on the trypanosomatids and www.tritrypdb.org) to be integrated with a wide variety Plasmodium species,
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages11 Page
-
File Size-