Spring 2003 Final3
Total Page:16
File Type:pdf, Size:1020Kb
NCBI News National Center for Biotechnology Information National Library of Medicine National Institutes of Health Department of Health and Human Services Spring 2003 [ A Field Guide to GenBank® and NCBI Resources: First Version of Human NCBI’s Scientific Outreach and Training Program Genome Reference Sequence Debuts on Biological sequence and structure use of NCBI databases and tools. DNA’s 50th information are now used in nearly The course, called “A Field Guide every field of biological research. A to GenBank and NCBI Resources”, working knowledge of these resources is designed especially for biologists April 14, 2003 marked the and standard computational biology who work at the bench or in the field 50th anniversary of the tools are an essential part of every but use sequence and structure data description of the structure biologist’s toolkit. However, keeping in their research. All researchers, of DNA and also saw the up with these databases and tools can educators and students who work release of the first version of be challenging in this period of rapidly with biological sequence and structure the 3 billion base pair refer- changing bioinformatics resources. data should find this to be a useful ence sequence of the human introduction and survey of the genome. Annotations to the In order to help researchers keep available NCBI tools and databases. raw sequence made public on April 14 abreast of enhancements and the Because of the rapid expansion of were released on April 29 when the increasing diversity of NCBI molecu- the resources, even experienced NCBI reference genome, NCBI build 33, lar biology resources, the NCBI users will likely learn something new appeared in the NCBI Map Viewer. service desk provides a free training program for biologists on the effective continued on page 4 continued on page 2 In this issue 1 Field Guide to GenBank 1 Human Reference Sequence 2 UniGene Expands 3 Rat Genome Assembly 5 Taxonomy Browser 5 Search the NCBI Web 5 Recent Publications 6 New Genomes in GenBank 6 Entrez Quiz 7 Submissions Corner 8 GenBank Cumulative Updates 8 GenBank Release 135 Figure 1: This Field Guide Map shows the number of Field Guides held in each state through April 2003. the Human Genome Project sequenc- Human Reference Sequence ing centers, NCBI, the University of continued from page 1 California at Santa Cruz, and Ensembl, NCBI News a joint project between EMBL-EBI and the Sanger Institute. The human genome reference NCBI News is distributed four times sequence is a critical contribution to There are about 32,000 genes anno- a year. We welcome communication RefSeq, NCBI’s database of reference tated by NCBI on build 33; of those, from users of NCBI databases and sequences for genomes, mRNAs, almost 18,000 are supported by software and invite suggestions for and proteins. mRNA alignments and may be con- articles in future issues. Send corre- sidered to be confirmed. The typical spondence to NCBI News at the The reference sequence covers about confirmed human gene has 12 exons address below. 99 percent of the human genome’s of an average length of 236 base NCBI News gene-containing , euchromatic, pairs each, separated by introns of National Library of Medicine regions, at an accuracy of 99.99 per- an average length of 5,478 base pairs. Bldg. 38A, Room 3S308 cent. Only the sequence near cen- As a consequence, the average intron 8600 Rockville Pike tromeres, telomeres, and other hete- length is about twice the average Bethesda, MD 20894 Phone: (301) 496-2475 rochromatic blocks, along with a small transcript length. Some statistics on Fax: (301) 480-9241 number of unclonable gaps, has yet NCBI’s build 33 are given in Table 1. E-mail: [email protected] to be determined. Small updates to the assembly will continue to be made To view the genome or download the Editors Dennis Benson as complex regions are refined and sequence, start at the “Human Genome David Wheeler the remaining gaps, about 400 of less Resources” link under “Hot Spots” on than 100 kilobases each, are closed. the NCBI Home Page. See the upcom- Contributors ing summer edition of the NCBI News Medha Bhagwat The assembly is the end product of for more on the human genome and Peter Cooper Eric Sayers several years of collaborative work by how to explore it at NCBI. Writers Vyvy Pham Gene Identification Method EST Alignment Predicted mRNA Alignment David Wheeler Number of Genes 3,295 11,972 17,708 Editing and Production Robert Yates Exons/Gene (Avg) 4 5 12 Graphic Design Exon Length (Avg) 440 158 226 Gary Mosteller Intron Length (Avg) 5,605 10,512 5,478 Joe Vuthipong mRNA Length (Avg) 2,127 820 2,741 In 1988, Congress established the National Center for Biotechnology Gene Length (Avg) 13,940 43,918 55,147 Information as part of the National Library of Medicine; its charge is Table 1: Statistics: First version of the human genome reference sequence, NCBI human genome build 33. to create information systems for molecular biology and genetics data and perform research in UniGene Expands; C. elegans added to LocusLink computational molecular biology. The contents of this newsletter may With the recent additions of Ciona similarities, the LocusLink report for be reprinted without permission. intestinalis (sea squirt), Oncorhynchus the gene, and its map location. Try The mention of trade names, com- mykiss (rainbow trout), and Vitis UniGene by following the “UniGene” mercial products, or organizations vinifera (the wine grape) the UniGene link under “Hot Spots” on the NCBI does not imply endorsement by database of non-redundant gene-orient- Home Page. NCBI, NIH, or the U.S. Government. ed clusters now covers 15 animals, 12 NIH Publication No. 03-3272 plants and Dictyostelium discoideum, The nematode, Caenorhabditis ISSN 1060-8788 a slime mold. Each UniGene cluster elegans, represented in UniGene with ISSN 1098-8408 (Online Version) is linked to related information, such over 19,000 clusters, is now also repre- as the tissue types in which the gene sented in the LocusLink database of is expressed, model organism protein genetic loci with over 22,000 records. Spring 2003 NCBI News 2 Rat Genome Assembly Deposited into GenBank With the recent deposit of the draft rat genome assembly Try the rat MapViewer at: into GenBank, researchers can view the rat genome in Entrez www.ncbi.nlm.nih.gov/mapview and from the Rat Genome Map Viewer. Other resources such as Rat Genomic BLAST and downloadable files of predicted Researchers can also BLAST a query sequence against WGS gene and protein sequences complement the navigational contigs, WGS traces from the Trace Archive, BAC ends, HTG, and display capabilities of the Viewer. RNA, and EST sequences, using the Rat Genome BLAST service. BLAST hits are shown marked in red on the genome The Whole Genome Shotgun (WGS) effort that produced assembly. Figure 2 shows the BLAST hit of a Bos taurus p53 the rat genome was managed by the Rat Genome Sequencing tumor suppressor phosphoprotein (TP53), mRNA against Consortium (RGSC), which is led by the Baylor College of the Rat genome assembly. The query sequence shows an 82% Medicine Human Genome Sequencing Center. The current homology to the rat contig NW_042660.1. The parallel gene assembly is comprised of over 157,000 WGS contigs fused to sequence map shows the annotated rat gene TP53 that is produce almost 1,800 supercontigs, or scaffolds. annotated within the same region, suggesting a homologous relationship between the Bos taurus and rat genes. The rat WGS contigs are assigned accession numbers of the format “AABR########”, with the first two digits describing The Rat BLAST service can be accessed from the BLAST the project version number. Thus, the first version has the Home Page. accession AABR01000000, and consists of sequences AABR01000001 through AABR01157561. One can either search for AABR01000000 in Entrez, and from its record, click to view the records AABR01000001 through AABR01157561; or, one can search for any of the individual records separately. The WGS reads are found in the Trace Archive and can also be queried using the BLAST service described below. There are over 38,000,000 rat traces available via the “Hot Spots” “Trace archive” link on the NCBI Home Page. More information on the assembly and other rat genomic resources is available from the NCBI “Genomic Biology” page. Follow the “Rat” link under “Organism-Specific Resources”. To download sequence and annotations via FTP, use Figure 1: Map Viewer display of rat genome assembly for chromosome 17 showing several parallel maps, including those for predicted and confirmed ftp.ncbi.nih.gov/genomes genes, and the contigs and their components. Click on “Maps and Options” to view all the available maps. The genome assembly can be displayed using the Rat Genome Map Viewer, as in Figure 1. In the figure, several different maps are shown, including gene sequence, compo- nent, contig, GenomeScan, UniGene, and STS. The compo- nent map shows the tiling path of the WGS contig accessions used to build the “NW_xxxxxx” scaffolds, whereas the contig map displays the placement of the scaffolds on the chromosomes. Genes that have also been annotated on the genome assembly using the alignment of mRNAs to the con- tigs are shown on the Genes_seq map, while GenomeScan predictions are included on a separate map. In addition, the Transcript map shows the combinations of exons, or splice variants, that are implied by mRNA alignments. The Rn_Uni- Gene map shows the alignment of rat ESTs to the genome assembly and shows EST-only gene models based on shared introns and alignment to a common position on the genome.