Biological Sequence Database: NCBI

Biological sequence database: NCBI Subject : Bioinformatics Lesson : Biological sequence database: National Center for Biotechnology Information (NCBI ) Lesson Developer : Sandip Das College/ Department: Department of Botany, University of Delhi 0 Biological sequence database: NCBI Table of Contents Chapter: Biological sequence database: National Center for Biotechnology Information (NCBI) Introduction Databases at NCBI Literature Bookshelf Pubmed Nucleic Acid dbEST dbGSS dbGSS Popset dbGaP dbVar o Genome o Taxonomy o PubChem o Expression analysis o Protein Summary Exercise/ Practice Glossary References/ Bibliography/ Further Reading National Center for Biotechnology Information (NCBI) NCBI has emerged as the primary free-to-access source of data and analysis tools in the field of computational biology. The free-access nature of NCBI is possible as the policy of funding and publication in most countries dictates that the researcher mandatorily deposits the information generated using public-fund into a free-to-access central repository. In return, the repository (such as NCBI or EMBL) assigns a unique identification number, often termed as accession number, to the data that also can be used to identify the depositor and 1 Biological sequence database: NCBI several other features. The following section will introduce you to a variety of databases dealing with a wide range of disciplines. Please do note that although the data may be organized separately for the sake of simplicity and clarity, in reality, all the databases are inter-linked and can be navigated from one to the other. The databases are also associated with their appropriate analysis tools. The following section lists some of the databases that have been created at NCBI. For the sake of simplicity, the databases in this lesson have been divided into three sections-section I dealing with publication, literature and small scale DNA/RNA sequencing projects; section II-dealing with whole genome, epigenome, maps of genomes, taxonomy and chemical structures; and section III dealing with resources for RNA and protein that are required for “functional genomics” . These sections marked as I, II and III will be dealt in their respective chapters. Databases-I: Literature (PubMed, PubMed Central; NCBI Bookshelf): DNA and RNA (Refseq, nucleotide, EST, GSS, WGS, PopSet, trace archive, SRA): Databases-II: Genomes (Map Viewer, Genome workbench, Plant Genome Central, Genome Reference Consortium, Epigenomics, Genomics Structural variation): Maps: Taxonomy: PubChem Substance: Databases-III: Expression analysis-GEO Proteins (Reference sequences, GenPept, UniProt/SwissProt, PRF, PDB, Protein clusters, Structure, UniGene, CDD): Entrez is the single point database search and retrieval system that allows a user to perform the search and retrieve action against “all” or a “specific” database in an interlinked manner. 2 Biological sequence database: NCBI Figure : Various databases at NCBI can be accessed through the Entrez portal Source: http://www.ncbi.nlm.nih.gov/sites/gquery The National Center for Biotechnology Center (NCBI) site is conveniently organized into four major domains and these domains are interlinked : 1. Databases, 2. Tools, 3. Data submission and 4. Education The following figure depicts the interlinked nature of these domains and can be reached by 1. Open the ncbi page by typing in www.ncbi.nlm.nih.gov in the web browser 2. Click the “search” button on the home page without enetering any keyword . 3. On the top left hand corner of the webpage, click on the “site map” to reach the page. 3 Biological sequence database: NCBI Figure: Various databases are organized into four major domains and are interlinked Source: http://www.ncbi.nlm.nih.gov/guide/sitemap/ Databases of NCBI The following section introduces you to some of the following databases at NCBI Databases-I: Literature (PubMed, PubMed Central; NCBI Bookshelf): DNA and RNA (Refseq, nucleotide, EST, GSS, WGS, PopSet, trace archive, SRA): Literature: Bookshelf provides free access and allows users to browse and retrieve a wealth of information in life sciences and healthcare. The information may be in the form of books documents and policy information from various government agencies and publishers. The bookshelf titles are organized subject-wise, by Type or by Publisher in a searchable or browsable format. 4 Biological sequence database: NCBI Figure: Bookshelf database at NCBI Source: http://www.ncbi.nlm.nih.gov/books Pubmed: The second source of literature is Pubmed that comprises of millions of peer-reviewed research and review articles, and online books in the area of life science and allied disciplines. The articles and book chapters also provide links to related literature and information through web-links. A further sub-database of 5 Biological sequence database: NCBI Pubmed is PubMed Central (PMC) that provides free full-text access to research articles from the field of biomedical, life science and other related subjects. 6 Biological sequence database: NCBI Text and Reference books Figure: Pubmed and PubMed Central (PMC) is the key database at NCBI that provides access to research articles, review and books Source: http://www.ncbi.nlm.nih.gov/pubmed Nucleotides: The database for nucleotide resources have been divided into several sub- classes that are based on the genomic source or type. dbEST: The database on EST (dbEST; Expressed Sequence Tags) catalogues single-pass sequence reads of transcripts of a range of organisms which are further employed to evaluate spatio-temporal status of transcript and also for gene and genome annotation. A majority of the EST sequences are short and range between 300-500 nucleotides and are generated in large numbers from several EST projects in progress; ESTs are also derived from several projects that deal with differential display or RACE (Rapid Amplification of cDNA Ends). The expressed sequences present in the database can be used to study a global expression profile of an organism at various stages of development and adaptation. dbGSS: A parallel database that hosts random short single pass sequences from genome of various organisms is termed as database on Genome Survey Sequence or dbGSS. Like dbEST, an analysis of dbGSS can reveal a snapshot of the genomic landscape and composition of an organism and thus may provide valuable information prior to embarking on a full scale genome sequencing project. Both dbEST and dbGSS accept sequences that have been generated through Sanger’s di-deoxy Chain termination chemistry and are part of Trace Archive at NCBI. 7 Biological sequence database: NCBI SRA: EST and GSS generated through next Generation sequencing (NGS) such as Applied Biosystematics SOLiD, Roche 454 and Illumina 1G and Helicos Bioscience Heliscope are deposited at Short Read Archive (SRA) database. Indeed, SRA database is emerging as the primary repository for all forms of high-throughput data emerging from EST, GSS and Whole Genome Sequencing projects and other High Throughput Genomics studies. Figure: dbEST contains short single-pass sequence information from cDNA (http://www.ncbi.nlm.nih.gov/nucest) Figure : Sequence Read Archive (SRA) 8 Biological sequence database: NCBI Source: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi? Figure : SRA showing NGS data Source: http://www.ncbi.nlm.nih.gov/sra/?term=Cholesterol, http://www.ncbi.nlm.nih.gov/sra/SRX188623 Information in short Read archive (SRA) can be accessed by taking the following steps: 1. Go to NCBI by typing www.ncbi.nlm.nih.gov 2. Type any “keyword” such “cholesterol” in the search box 3. Select “SRA” database using the drop down menu in the search box 4. Click “search” 5. A list of SRA results of NGS data containing Cholesterol will appear 6. Select any data for further analysis Popset: Information originating from studies that compares sequences originating from same or different species or taxon for the purpose of ecosystem based analysis, phylogenetic analysis for genetic variation / mutational analysis are deposited in the Popset database. Each set consists of a comparable DNA sequence information derived from a single locus or gene for a group of organism or taxon. 9 Biological sequence database: NCBI . Figure: Popset with set of sequences deposited as a part of molecular phylogenetic studies employing ribosomal DNA gene sequence Source: http://www.ncbi.nlm.nih.gov/sites/gquery, 10 Biological sequence database: NCBI dbGaP: Studies undertaken with the help of Genome-wide association studies (GWAS), medical resequencing, molecular diagnostics to establish and analyse relationships between genotype and phenotype are archived under the database on Genotype and Phenotype(dbGaP); 11 Biological sequence database: NCBI Figure : dbGaP database contains information of relation between genotype and phenotype Source: http://www.ncbi.nlm.nih.gov/gap 12 Biological sequence database: NCBI dbVar: Data on large scale genomic variation such as insertions, deletions and relationship between such variation and phenotype are based at the database of Genomic Structural Variation (dbVar). 13 Biological sequence database: NCBI Figure: dbVar at NCBI is the database for genomic and structural variation for various genomes Source: http://www.ncbi.nlm.nih.gov/dbvar This section deals with databases that are specific to genome sequencing and analysis tools, maps, taxonomy and Chemicals substances and has been grouped under database-II Databases-II: Genomes (Map Viewer, Genome workbench, Plant Genome Central,

Biological Sequence Database: NCBI

Comparative Genomics of Arabidopsis and Maize: Prospects and Comment Limitations Volker Brendel*, Stefan Kurtz† and Virginia Walbot‡

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Quality Assessment of Maize Assembled Genomic Islands (Magis) and Large-Scale Experimental Verification of Predicted Genes

An Active DNA Transposon Family in Rice

5, and J. Chris Pires

Genome Survey of Misgurnus Anguillicaudatus to Identify Genomic Information, Simple Sequence Repeat (SSR) Markers and Mitochondrial Genome

SSR-HRM) Analysis for Genetic Relationship of Luffa Genotypes

Identification and Characterization of Rearrangements in the Vervet Monkey Genome

The Nuclear Genome of Brachypodium Distachyon: Analysis of BAC End Sequences

Rice Transposable Elements: a Survey of 73,000 Sequence-Tagged-Connectors

Distribution of Genes and Repetitive Elements in the Diabrotica Virgifera Virgifera Genome Estimated Using BAC Sequencing Brad S

Analysis of the Maize (Zea Mays L) Genome Using Molecular, Genetic and Computational Approaches Yan Fu Iowa State University