Genbank Dennis A
Total Page:16
File Type:pdf, Size:1020Kb
D48–D53 Nucleic Acids Research, 2012, Vol. 40, Database issue Published online 5 December 2011 doi:10.1093/nar/gkr1202 GenBank Dennis A. Benson, Ilene Karsch-Mizrachi, Karen Clark, David J. Lipman, James Ostell and Eric W. Sayers* National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA Received September 30, 2011; Revised November 14, 2011; Accepted November 17, 2011 ABSTRACT sequence (GSS), whole-genome shotgun (WGS) and Õ other high-throughput data from sequencing centers. GenBank is a comprehensive database that The US Office of Patents and Trademarks also contributes contains publicly available nucleotide sequences sequences from issued patents. GenBank participates with for more than 250 000 formally described species. the EMBL Nucleotide Sequence Database (EMBL-Bank), Downloaded from These sequences are obtained primarily through part of the European Nucleotide Archive (ENA) (2), and submissions from individual laboratories and batch the DNA Data Bank of Japan (DDBJ) (3) as a partner submissions from large-scale sequencing projects, in the International Nucleotide Sequence Database including whole-genome shotgun (WGS) and envir- Collaboration (INSDC). The INSDC partners exchange onmental sampling projects. Most submissions data daily to ensure that a uniform and comprehensive http://nar.oxfordjournals.org/ are made using the web-based BankIt or standalone collection of sequence information is available worldwide. Sequin programs, and accession numbers are NCBI makes the GenBank data available at no cost over the Internet, through FTP and a wide range of web-based assigned by GenBank staff upon receipt. Daily data retrieval and analysis services (4). exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is access- RECENT DEVELOPMENTS ible through the NCBI Entrez retrieval system, PopSet redesign which integrates data from the major DNA and at DTU Library on January 16, 2012 protein sequence databases along with taxonomy, In the past year, NCBI redesigned the web interface for the PopSet database (www.ncbi.nlm.nih.gov/popset) of genome, mapping, protein structure and domain related sequences and alignments derived from phylogen- information, and the biomedical journal literature etic, population, mutation and ecosystem studies that have via PubMed. BLAST provides sequence similarity been submitted to GenBank. The new PopSet record views searches of GenBank and other sequence data- contain three sections: an introduction showing the title bases. Complete bimonthly releases and daily and citation for the citation reporting the data set; a list of updates of the GenBank database are available by sequences contained in the data set; and, when available, FTP. To access GenBank and its related retrieval an alignment of the sequences shown in the same and analysis services, begin at the NCBI home Graphical Sequence Viewer that is a display option on page: www.ncbi.nlm.nih.gov. nucleotide and protein records. In addition, PopSet record pages now display links to other PopSet records reported in the same published study, making it much INTRODUCTION easier to locate these related records. For PopSet records GenBank (1) is a comprehensive public database of with fewer than 100 sequences, links are provided to nucleotide sequences and supporting bibliographic and generate a BLAST alignment of the sequences or, if an biological annotation. GenBank is built and distributed alignment was submitted as part of the record, a by the National Center for Biotechnology Information distance tree view of the alignment. (NCBI), a division of the National Library of Medicine New tools for exploring features in GenBank records (NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda, MD, USA. Within any NCBI nucleotide or protein sequence record, NCBI builds GenBank primarily from the submission the hyperlinks in the feature table section now open a new of sequence data from authors and from the bulk submis- tool that highlights that feature in the sequence and sion of expressed sequence tag (EST), genome survey displays details of that annotation. A bar will appear at *To whom correspondence should be addressed. Tel: +301 496 2475; Fax: +301 480 9241; Email: [email protected] Published by Oxford University Press 2011. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2012, Vol. 40, Database issue D49 the bottom of the browser window that allows users to Table 1. Growth of GenBank divisions (nucleotide base pairs) navigate to other features within the record or to view Division Description Release 185 Annual the sequence corresponding to the feature in FASTA or (8/2011) increase GenBank format. For example, the various segments of a (%)a CDS on a genomic sequence can be highlighted together and then displayed and downloaded as a single FASTA TSA Transcriptome shotgun data 1 874 047 448 370.1 record. Sequence records also now have a ‘Find in this ENV Environmental samples 2 553 693 157 48.2 PHG Phages 62 579 756 44.0 Sequence’ link in the right column that opens a search PAT Patented sequences 11 154 487 762 30.9 bar at the bottom of the browser window. Users can BCT Bacteria 6 975 597 755 30.8 then input subsequences, including standard ambiguity INV Invertebrates 2 535 336 197 24.5 codes and nucleotides and Prosite patterns for proteins, WGS Whole-genome shotgun data 208 315 831 132 23.1 VRL Viruses 1 180 083 600 21.6 and then locate these patterns within the current sequence. MAM Other mammals 807 098 397 18.8 PLN Plants 4 741 991 057 17.4 Unverified sequences GSS Genome survey sequences 20 770 772 329 12.6 SYN Synthetic 156 218 063 9.6 As part of the standard submission process, GenBank VRT Other vertebrates 2 705 250 711 6.8 staff review submissions for biological accuracy and EST Expressed sequence tags 39 018 185 344 6.0 assist authors in providing accurate annotations. If UNA Unannotated 125 912 4.7 Downloaded from GenBank staff is unable to verify the accuracy of the PRI Primates 6 116 546 725 2.9 ROD Rodents 4 396 957 541 2.3 submitted sequences and/or annotations, they may now HTC High-throughput cDNA 662 320 919 0.4 add a comment to the record stating that the sequence is STS Sequence tagged sites 635 872 683 0.3 unverified. Until the submitter is able to resolve the issues, HTG High-throughput genomic 24 324 068 445 0.2 such sequences will have the word ‘UNVERIFIED:’ at TOTAL All GenBank sequences 338 987 064 933 18.2 the beginning of their definition lines and will not be http://nar.oxfordjournals.org/ aMeasured relative to Release 179 (8/2010). included in BLAST databases. WGS browser Table 2. Top Organisms in GenBank (Release 185) Within GenBank, WGS master records (see below) contain no sequence data, but rather show the descriptive Organism Non-WGS base pairs information and range of accession numbers of the contigs Homo sapiens 15 881 839 899 submitted as part of that WGS project. In the future, Mus musculus 9 118 049 806 NCBI will no longer assign GI numbers to these individ- Rattus norvegicus 6 503 434 302 at DTU Library on January 16, 2012 ual contigs, and at that point users will be unable to view Bos taurus 5 381 235 474 them directly in the Nucleotide database. Instead, users Zea mays 5 055 840 446 Sus scrofa 4 793 300 236 may view these records in a WGS browser that is linked Danio rerio 3 127 958 433 from the WGS feature of any WGS master record. Strongylocentrotus purpuratus 1 352 948 327 The WGS browser provides the complete descriptive Oryza sativa Japonica Group 1 251 053 810 information from the master record of the project, inter- Nicotiana tabacum 1 194 842 997 Xenopus (Silurana) tropicalis 1 147 237 486 active views of the FASTA of every contig record and Arabidopsis thaliana 1 138 511 865 also provides links to the FTP files for all the contigs Drosophila melanogaster 1 058 563 193 of the entire project. Pan troglodytes 1 003 309 475 Canis lupus familiaris 947 332 578 Vitis vinifera 915 431 680 ORGANIZATION OF THE DATABASE Gallus gallus 896 784 038 Glycine max 895 052 594 GenBank divisions Macaca mulatta 828 906 407 Solanum lycopersicum 778 132 243 GenBank groups sequence records into various divisions based either on the source taxonomy or the sequencing strategy used to obtain the data. There are 12 taxonomic divisions (BCT, ENV, INV, MAM, PHG, PLN, PRI, ROD, SYN, UNA, VRL, VRT) and six high-throughput collaboration with EMBL-Bank and DDBJ and with the divisions (EST, GSS, HTC, HTG, STS, TSA). Finally, valuable assistance of external advisers and curators. the PAT division contains records supplied by patent Almost 250 000 formally described species are represented offices and the WGS division contains sequences from in GenBank, and the top species in the non-WGS WGS projects. The size and growth of these divisions, GenBank divisions are listed in Table 2. and of GenBank as a whole, are shown in Table 1. Sequence identifiers and accession numbers Sequence-based taxonomy Each GenBank record, consisting of both a sequence and Database sequences are classified and can be queried its annotations, is assigned a unique identifier called an using a comprehensive sequence-based taxonomy (www accession number that is shared across the three collabo- .ncbi.nlm.nih.gov/taxonomy/) developed by NCBI in rating databases (GenBank, DDBJ, EMBL Bank).