VectorBase: A Data Resource for Invertebrate Vector Genomics The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Lawson, Daniel, Peter Arensburger, Peter Atkinson, Nora J. Besansky, Robert V. Bruggner, Ryan Butler, Kathryn S. Campbell, et al. 2009. VectorBase: A data resource for invertebrate vector genomics. Nucleic Acids Research 37(Suppl 1): D583-D587. Published Version doi:10.1093/nar/gkn857 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:4513030 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Open Access Policy Articles, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#OAP Published online 21 November 2008 Nucleic Acids Research, 2009, Vol. 37, Database issue D583–D587 doi:10.1093/nar/gkn857 VectorBase: a data resource for invertebrate vector genomics Daniel Lawson1,*, Peter Arensburger2, Peter Atkinson2, Nora J. Besansky3, Robert V. Bruggner3, Ryan Butler3, Kathryn S. Campbell4, George K. Christophides5, Scott Christley3, Emmanuel Dialynas6, Martin Hammond1, Catherine A. Hill7, Nathan Konopinski3, Neil F. Lobo3, Robert M. MacCallum5, Greg Madey3, Karine Megy1, Jason Meyer7, Seth Redmond5, David W. Severson3, Eric O. Stinson3, Pantelis Topalis6, Ewan Birney1, William M. Gelbart4, Fotis C. Kafatos5, Christos Louis6,8 and Frank H. Collins3 1European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK, 2Department of Entomology, University of California, Riverside, 900 University Avenue, Riverside, CA. 92521, 3Center for Global Health and Infectious Diseases, Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46656-0369, 4The Biological Laboratories, 16 Divinity Avenue, Harvard University, Cambridge, MA 02138, USA, 5Cell and Molecular Biology Department, Imperial College London, South Kensington Campus, London SW7 2AZ, UK, 6Institute of Molecular Biology and Biotechnology, FORTH, Vassilika Vouton, PO BOX 1385, Heraklion, Crete, Greece, 7Department of Entomology, Purdue University, West Lafayette, IN 47907, USA and 8Department of Biology, University of Crete, Heraklion, Crete, Greece Received September 15, 2008; Revised and Accepted October 16, 2008 ABSTRACT such as microarray expression analysis. We are active in producing genome annotation ourselves but also col- VectorBase (http://www.vectorbase.org) is an laborate with a range of partners including our sister NIAID-funded Bioinformatic Resource Center Bioinformatic Resource Center’s (1) to incorporate and focused on invertebrate vectors of human patho- improve the annotations. gens. VectorBase annotates and curates vector The reduction in cost of sequencing has seen genomes genomes providing a web accessible integrated become available for an increasing number of vector resource for the research community. Currently, species. VectorBase is directly responsible for three mos- VectorBase contains genome information for three quito species (Aedes aegypti, Anopheles gambiae and Culex mosquito species: Aedes aegypti, Anopheles gam- quinquefasciatus) and the tick Ixodes scapularis. We work biae and Culex quinquefasciatus, a body louse closely with the genome sequencing centres on the initial Pediculus humanus and a tick species Ixodes annotation and publication of these genomes and then scapularis. Since our last report VectorBase has assume responsibility for ongoing re-annotation tasks. initiated a community annotation system, a micro- A number of other genomes are within scope for Vector- array and gene expression repository and controlled Base including the body louse (Pediculus humanus), triatomine bug (Rhodnius prolixus), tsetse fly (Glossina vocabularies for anatomy and insecticide resis- morsitans morsitans) and sand flies (Lutzomyia longipalpis tance. We have continued to develop both the soft- and Phlebotomus papatasi). A full list of VectorBase ware infrastructure and tools for interrogating the species and data sets can be accessed on the website stored data. (http://www.vectorbase.org/Help/Current_release). This report highlights the new genomes integrated into VectorBase and some of the new features and INTRODUCTION improvements that we have added since our last report VectorBase is a genome information system which (2). Users interested in the VectorBase project should provides a genome browser for visualizing genome anno- visit the main web page or help pages (http://www. tations, including DNA and protein alignments, varia- vectorbase.org/Help/Main_Page) for more information tions, protein feature data and functional data sets, about the project. *To whom correspondence should be addressed. Tel: +44 1223 494 444; Fax: +44 1223 494 468; Email: [email protected] ß 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D584 Nucleic Acids Research, 2009, Vol. 37, Database issue ACCESSING VECTORBASE into the main gene build during the next round of re- VectorBase as a web resource is linked with a number of annotation. other databases, most notably the public nucleotide and Small-scale manual appraisal of gene predictions has protein databases. Direct cross-references to the genes, been undertaken for An. aegypti and C. quinquefasciatus transcripts and proteins exist in the GenBank/EMBL/ as part of the quality control for the gene builds. In the C. quinquefasciatus DDBJ genome assembly records as well as the UniProt case of , this revealed at least 1500 protein records, where both An. gambiae and Ae. aegypti predictions which were removed from the CpipJ1.2 data- set. Amongst the deprecated gene predictions were a large are deemed to be complete proteomes. Other resources set of single exon predictions which had no supporting which use VectorBase data range from large general transcript evidence and no similarity to other mosquito resources, such as Ensembl (http://www.ensembl.org) proteomes or any other sequences in the public data- and Refseq (http://www.ncbi.nlm.nih.gov/RefSeq) to the bases. Expert opinion was that these were erroneous more biologically focused proteinase database Merops over-prediction by the computation algorithms rather (http://merops.sanger.ac.uk) and miRNA target predic- than a large Culex-specific gene family. Efforts such as tions in mirBase (http://microrna.sanger.ac.uk). The these highlight our determination to improve gene predic- VectorBase site and wiki resource are indexed by the tion accuracy through the integration of new data sets and major search engines allowing users to readily find content the re-appraisal of the existing prediction set. of interest. COMMUNITY ANNOTATIONS EXPANDED ROLE OF VECTORBASE VectorBase employs community representatives focused VectorBase is active in all stages of genome analysis around the NIAID-funded species (the three mosquito including initial annotation of new genome sequences in genomes and I. scapularis). The representatives were collaboration with the sequencing centres, such as JCVI hired from within the relevant community and have and The Broad Institute and subsequent re-annotation both biological knowledge of the species and informatics using both computational and manual approaches in liai- skills. Their role is to liaise with the community providing son with the community. Automated annotation using the helpdesk and training capacity, acting as mediators and Ensembl system (3) was undertaken for the new genomes quality assurance for data submission of gene predictions (C. quinquefasciatus, P. humanus and I. scapularis). The and as advocates for the user community in the develop- process of resolving differences between VectorBase and ment of the VectorBase resource. the partner sequencing centre annotations has been a We have developed a Community Annotation Pipeline fruitful task leading to high-quality automated annotation (CAP) to facilitate community involvement in the curation but problems will remain which can only be addressed of these genomes. This system consists of a CHADO data- using further resources (expressed sequence tags or new base which stores annotations, both from the manual genome sequences) or through manual appraisal of the effort within VectorBase and those submitted directly automated gene predictions. VectorBase has invested from the community, and a web interface submission some resource toward the latter and implemented strate- tool to upload data. Submitters use a spreadsheet format gies for involving the community in the annotation effort. and can include gene predictions, gene symbols and gene We have also implemented data mining tools, such as the descriptions, and attach GO terms or citations to a gene HMMER package (http://hmmer.janelia.org/) to build model. One aspect of the submission system is its ability to profile hidden Markov models from multiple sequence align a cDNA sequence to the genome using exonerate (7). alignments which can then be used for sensitive database The simplicity of the submission process in conjunction searching using statistical descriptions of a sequence with community representative involvement in data qual- families consensus. ity consistency checks (e.g.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-