Biological sequence database: NCBI

Subject : Lesson : Biological sequence database: National Center for Biotechnology Information (NCBI ) Lesson Developer : Sandip Das College/ Department: Department of Botany, University of Delhi

0 Biological sequence database: NCBI

Table of Contents

Chapter: Biological sequence database: National Center for Biotechnology Information (NCBI)

 Introduction

 Databases at NCBI  Literature  Bookshelf  Pubmed  Nucleic Acid

 dbEST  dbGSS  dbGSS  Popset  dbGaP  dbVar

o o Taxonomy o PubChem o Expression analysis o Protein

 Summary  Exercise/ Practice  Glossary  References/ Bibliography/ Further Reading

National Center for Biotechnology Information (NCBI) NCBI has emerged as the primary free-to-access source of data and analysis tools in the field of . The free-access nature of NCBI is possible as the policy of funding and publication in most countries dictates that the researcher mandatorily deposits the information generated using public-fund into a free-to-access central repository. In return, the repository (such as NCBI or EMBL) assigns a unique identification number, often termed as accession number, to the data that also can be used to identify the depositor and

1 Biological sequence database: NCBI several other features. The following section will introduce you to a variety of databases dealing with a wide range of disciplines. Please do note that although the data may be organized separately for the sake of simplicity and clarity, in reality, all the databases are inter-linked and can be navigated from one to the other. The databases are also associated with their appropriate analysis tools. The following section lists some of the databases that have been created at NCBI. For the sake of simplicity, the databases in this lesson have been divided into three sections-section I dealing with publication, literature and small scale DNA/RNA sequencing projects; section II-dealing with whole genome, epigenome, maps of , taxonomy and chemical structures; and section III dealing with resources for RNA and protein that are required for “functional genomics” . These sections marked as I, II and III will be dealt in their respective chapters.

Databases-I: Literature (PubMed, PubMed Central; NCBI Bookshelf): DNA and RNA (Refseq, nucleotide, EST, GSS, WGS, PopSet, trace archive, SRA): Databases-II: Genomes (Map Viewer, Genome workbench, Plant Genome Central, Genome Reference Consortium, Epigenomics, Genomics Structural variation): Maps: Taxonomy: PubChem Substance: Databases-III: Expression analysis-GEO Proteins (Reference sequences, GenPept, UniProt/SwissProt, PRF, PDB, Protein clusters, Structure, UniGene, CDD):

Entrez is the single point database search and retrieval system that allows a user to perform the search and retrieve action against “all” or a “specific” database in an interlinked manner.

2 Biological sequence database: NCBI

Figure : Various databases at NCBI can be accessed through the Entrez portal Source: http://www.ncbi.nlm.nih.gov/sites/gquery

The National Center for Biotechnology Center (NCBI) site is conveniently organized into four major domains and these domains are interlinked : 1. Databases, 2. Tools, 3. Data submission and 4. Education

The following figure depicts the interlinked nature of these domains and can be reached by 1. Open the ncbi page by typing in www.ncbi.nlm.nih.gov in the web browser 2. Click the “search” button on the home page without enetering any keyword . 3. On the top left hand corner of the webpage, click on the “site map” to reach the page.

3 Biological sequence database: NCBI

Figure: Various databases are organized into four major domains and are interlinked Source: http://www.ncbi.nlm.nih.gov/guide/sitemap/

Databases of NCBI

The following section introduces you to some of the following databases at NCBI

Databases-I: Literature (PubMed, PubMed Central; NCBI Bookshelf): DNA and RNA (Refseq, nucleotide, EST, GSS, WGS, PopSet, trace archive, SRA):

Literature:

 Bookshelf provides free access and allows users to browse and retrieve a wealth of information in life sciences and healthcare. The information may be in the form of books documents and policy information from various government agencies and publishers. The bookshelf titles are organized subject-wise, by Type or by Publisher in a searchable or browsable format.

4 Biological sequence database: NCBI

Figure: Bookshelf database at NCBI

Source: http://www.ncbi.nlm.nih.gov/books

 Pubmed: The second source of literature is Pubmed that comprises of millions of peer-reviewed research and review articles, and online books in the area of life science and allied disciplines. The articles and book chapters also provide links to related literature and information through web-links. A further sub-database of

5 Biological sequence database: NCBI

Pubmed is PubMed Central (PMC) that provides free full-text access to research articles from the field of biomedical, life science and other related subjects.

6 Biological sequence database: NCBI

Text and Reference books

Figure: Pubmed and PubMed Central (PMC) is the key database at NCBI that provides access to research articles, review and books

Source: http://www.ncbi.nlm.nih.gov/pubmed

Nucleotides: The database for nucleotide resources have been divided into several sub- classes that are based on the genomic source or type. dbEST: The database on EST (dbEST; Expressed Sequence Tags) catalogues single-pass sequence reads of transcripts of a range of organisms which are further employed to evaluate spatio-temporal status of transcript and also for gene and genome annotation. A majority of the EST sequences are short and range between 300-500 nucleotides and are generated in large numbers from several EST projects in progress; ESTs are also derived from several projects that deal with differential display or RACE (Rapid Amplification of cDNA Ends). The expressed sequences present in the database can be used to study a global expression profile of an organism at various stages of development and adaptation.

dbGSS: A parallel database that hosts random short single pass sequences from genome of various organisms is termed as database on Genome Survey Sequence or dbGSS. Like dbEST, an analysis of dbGSS can reveal a snapshot of the genomic landscape and composition of an organism and thus may provide valuable information prior to embarking on a full scale genome sequencing project. Both dbEST and dbGSS accept sequences that have been generated through Sanger’s di-deoxy Chain termination chemistry and are part of Trace Archive at NCBI.

7 Biological sequence database: NCBI

SRA: EST and GSS generated through next Generation sequencing (NGS) such as Applied Biosystematics SOLiD, Roche 454 and Illumina 1G and Helicos Bioscience Heliscope are deposited at Short Read Archive (SRA) database. Indeed, SRA database is emerging as the primary repository for all forms of high-throughput data emerging from EST, GSS and projects and other High Throughput Genomics studies.

Figure: dbEST contains short single-pass sequence information from cDNA (http://www.ncbi.nlm.nih.gov/nucest)

Figure : Sequence Read Archive (SRA)

8 Biological sequence database: NCBI

Source: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?

Figure : SRA showing NGS data Source: http://www.ncbi.nlm.nih.gov/sra/?term=Cholesterol, http://www.ncbi.nlm.nih.gov/sra/SRX188623

Information in short Read archive (SRA) can be accessed by taking the following steps: 1. Go to NCBI by typing www.ncbi.nlm.nih.gov 2. Type any “keyword” such “cholesterol” in the search box 3. Select “SRA” database using the drop down menu in the search box 4. Click “search” 5. A list of SRA results of NGS data containing Cholesterol will appear 6. Select any data for further analysis

Popset: Information originating from studies that compares sequences originating from same or different species or taxon for the purpose of ecosystem based analysis, phylogenetic analysis for genetic variation / mutational analysis are deposited in the Popset database. Each set consists of a comparable DNA sequence information derived from a single locus or gene for a group of organism or taxon.

9 Biological sequence database: NCBI

.

Figure: Popset with set of sequences deposited as a part of molecular phylogenetic studies employing ribosomal DNA gene sequence Source: http://www.ncbi.nlm.nih.gov/sites/gquery,

10 Biological sequence database: NCBI dbGaP: Studies undertaken with the help of Genome-wide association studies (GWAS), medical resequencing, molecular diagnostics to establish and analyse relationships between genotype and phenotype are archived under the database on Genotype and Phenotype(dbGaP);

11 Biological sequence database: NCBI

Figure : dbGaP database contains information of relation between genotype and phenotype Source: http://www.ncbi.nlm.nih.gov/gap

12 Biological sequence database: NCBI

dbVar: Data on large scale genomic variation such as insertions, deletions and relationship between such variation and phenotype are based at the database of Genomic Structural Variation (dbVar).

13 Biological sequence database: NCBI

Figure: dbVar at NCBI is the database for genomic and structural variation for various genomes Source: http://www.ncbi.nlm.nih.gov/dbvar

This section deals with databases that are specific to genome sequencing and analysis tools, maps, taxonomy and Chemicals substances and has been grouped under database-II

Databases-II: Genomes (Map Viewer, Genome workbench, Plant Genome Central, Genome Reference Consortium, Epigenomics, Genomics Structural variation): Maps: Taxonomy: PubChem Substance:

Genome databases: The deluge in data that prompted the rapid growth, development and deployment of computational tools for study of biological processes or Bioinformatics has largely been driven by the high throughput technologies in the area of sequencing including Whole Genome Sequencing (WGS) and the recently developed RNA sequencing (RNAseq). NCBI has developed several dedicated portals such as Genome and Genome Resource Consortia (GRC) for researchers and other users to access the genome sequence data and analytical tools. The Genome database stores information not only from sequencing efforts that are underway for several organisms, but also integrates information from genetic and physical maps, genetic markers, linkage groups and chromosomes.

14 Biological sequence database: NCBI

Figure : Genome database can be accessed through the Entrez portal Source: http://www.ncbi.nlm.nih.gov/sites/gquery

15 Biological sequence database: NCBI

Figure: Genome database at NCBI Source: http://www.ncbi.nlm.nih.gov/genome;http://www.ncbi.nlm.nih.gov/genome/browse/

16 Biological sequence database: NCBI

Figure : Genome database provides information at various levels of genome organization, ranging from gross i.e. chromosome to the highest level of resolution i.e. nucleotides Source: http://www.ncbi.nlm.nih.gov/genome/browse/ ; http://www.ncbi.nlm.nih.gov/genome/47 ;

17 Biological sequence database: NCBI

Apart from the genome database, sequences from several taxonomic groups such as those for Human genome, Microbes, Organelles, Plants and Viruses are also available as customized databases along with analysis tools. The following figure depicts the snapshot of “Plant Genome Central”, a dedicated portal for completed and ongoing genome sequencing efforts for plants.

Figure : Plant Genome Central is a customized database hosting the completed and ongoing plant genome sequencing projects Source: http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html

18 Biological sequence database: NCBI

The database at Plant Genome Central can be accessed by taking the following steps: 1. Open the NCBI webpage by typing www.ncbi.nlm.nih.gov 2. Click on “Genome” link that appears on the right hand side of the webpage to go to http://www.ncbi.nlm.nih.gov/genome 3. Under the “Custom” resources, click on “Plants” to access the Plant Genome Central webpage. 4. Browse and select any plant of choice to go to plant specific database

The database reconstitutes the chromosomes and other genomic entities along with the precise location of molecular and genetic markers, genetic maps, genes and genetic elements, physical maps and a variety of other resources. For example, the gene/symbols provide links to the details about the gene sequence, chromosomal location, function, GEO profile, domain structure, their epigenetic status, literature etc.

Maps can be viewed as a graphical representation of the location of the various genetic elements on the genome. A genetic map is based on estimating recombination frequency and linkage between sets of molecular markers. Once such a genetic map has been constructed, information about phenotypes and traits can be added to represent the linkage between molecular markers and phenotypes. As the genetic maps are based on estimatring recombination frequency, the distance between the various markers are expressed in terms of Centimorgan distances. As genomic regions vary in their propensity for undergoing recombination, mapping of markers is not uniformly distributed and therefore genetic maps contain non-uniformly distributed and mapped markers. In addition, it is not possible to derive a universal relationship between recombination frequency and physical distance and therefore genetic maps are not an accurate representation of length of genomes. Physical maps that are based on sequence information, in contrast provides information about the actual location of various genetic element and on genome and therefore provide accurate estimation of genome size; but lack the information on recombination frequency. Plant Genome database allows users to navigate between genetic maps and physical maps and therefore can be employed to understand relationship between genetic and physical distance.

Apart from the genome sequencing database, Genome database provides several tools for analysis for example annotation tools for prokaryotic (Prokaryotic Genomes Automatic annotation Pipeline – PGAAP) and eukaryotic genomes, Pairwise Sequence Comparison (PASC) and three-way genome comparison tool termed as TaxPlot.

Re-programming of gene expression states that are mediated via methylation of Cytosine residues and modification of histones lead to chromatin re-modelling termed as Epigenetics is a major mechanism of gene regulation. Information about epigenetic changes occurring at genome wide level and also specific genetic loci can be accessed at Epigenomic database

19 Biological sequence database: NCBI

Figure : Details about genes and other genetic elements can be accessed from Plant Genome Central database (http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html;

20 Biological sequence database: NCBI

Figure : Genome analysis tools at Genome database can be reached by clicking onto “PASC” ( Source: http://www.ncbi.nlm.nih.gov/sutils/pasc/viridty.cgi?textpage=overview ) and “TaaxPlot” (Source: http://www.ncbi.nlm.nih.gov/sutils/taxik2.cgi ) under “customs resources at Genome webpage (http://www.ncbi.nlm.nih.gov/genome/ )

21 Biological sequence database: NCBI

Figure : Genome database also allows users to search and retrieve information about DNA methylation (epigenetic) status of various genes and genetic elements for specific organisms through the Epigenomics database Source: http://www.ncbi.nlm.nih.gov/epigenomics/?term=Arabidopsis http://www.ncbi.nlm.nih.gov/epigenomics/view/genome/?uid=13399&assembly=31&term= MEDEA

22 Biological sequence database: NCBI

Taxonomy: This database was conceptualized and became functional in 1991 contains curated hierarchical taxonomic information about organisms for which sequence information is available at the public database. In addition to being a portal to taxonomy, the site also provides birds-eye-view of the resources associated with a specific organism in question. The taxonomic classification is based on a combination of molecular data and morphological data and can be accessed at http://www.ncbi.nlm.nih.gov/taxonomy.

Figure: Taxonomy database with complete taxonomic details of an organism (taxon specific page) Source: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606&lvl=3&l in=f&keep=1&srchmode=1&unlock

23 Biological sequence database: NCBI

Figure: Taxonomy database hierarchical display page Source: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi

Table : A summary list of taxa represented in taxonomy database as on December 2012

higher lower Ranks: taxa genus species taxa total Archaea 115 131 466 0 712 Bacteria 1209 2286 11268 770 15533 Eukaryota 18913 59886 251517 19045 349361 Fungi 1365 4111 25496 981 31953 Metazoa 13767 39148 117739 9513 180167 Viridiplantae 2292 14058 100061 8320 124731 Viruses 575 378 1974 0 2927 All taxa 20838 62689 265258 19815 368600 Source: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=statistics& uncultured=hide&unspecified=hide

PubChem: NCBI provides a database service, PubChem (setup in 2004), that stores curated information about description, structure and bioassay of a range of small molecules that have a role in biological priocesses. PubChem Compound, for example, is a database

24 Biological sequence database: NCBI that stores information about validated structures of small chemical compounds; and PubChem BioAssay contains data about bioactivity or bioassay performed with the chemical compound. The PubChem database also reveals the biosynthesis and/or role of the chemical compound in various metabolic pathways by providing links to database of metabolic pathways such as KEGG/BioSystems.

Figure : Access to PubChem databases through Entrez gateway Source: http://www.ncbi.nlm.nih.gov/sites/gquery

25 Biological sequence database: NCBI

Figure : PubChem Compound database allows users to browse and download information about the structure, function and cellular biosynthesis / role of small molecules Source:http://www.ncbi.nlm.nih.gov/pccompound/?term=Vincristine ;http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=5978&loc=ec_rcs http://www.ncbi.nlm.nih.gov/biosystems/148659?Sel=cid:5978#show=smallmolecules

26 Biological sequence database: NCBI

Figure: PubChem Bioassay webpage snapshot Source: http://www.ncbi.nlm.nih.gov/pcassay/?term=Vincristine

This section deals with databases that are specific to gene expression analysis and protein analysis, essentially biomolecules that are significant for “functionality” Databases-III: Expression analysis-GEO Proteins (Reference sequences, GenPept, UniProt/SwissProt, PRF, PDB, Protein clusters, Structure, UniGene, CDD):

27 Biological sequence database: NCBI

Expression analysis Gene expression profiles based on high-throughput experiment are stored at the Gene Expression Omnibus (GEO) database. The data adheres to the Minimum Information About Microarray Experiment (MIAME) standard that must be followed by all researchers and includes for example, experimental design, laboratory procedures followed and related information about sample and data processing. Gene expression data can be accessed through the Entrez portal of the NCBI or directly through the GEO portal. The data is catalogued under i. “platforms (i.e. oligo microarray, cDNA microarray, SAGE etc), ii. “samples”, iii. “experimental “series” and as iv. “datasets”

GEO accepts data generated through i. microarray, ii. Next Genetation Sequencing (NGS), iii. Chromatin Immunoprecipitation (ChIP), iv. Genome methylation, v. Genome variation through array (arrayCGH), vi. SNParray, vii. SAGE and viii. protein arrays

a) Gene expression profile analysis can be used to compare and group transcripts that are up- or down-regulated under different growth conditions.

28 Biological sequence database: NCBI

Figure: GEO portal allows users to search, browse and analyse a range of quantitative data generated from microarray, ChIP-chip, NGS or methylation arrays Source: http://www.ncbi.nlm.nih.gov/geo/

a) Figure: GEO databaseThe Figure shows clustering of transcripts of Arabidopsis thaliana wild type and aux mutant roots after treatment with Ethylene.

Source: http://www.ncbi.nlm.nih.gov/gds/;http://www.ncbi.nlm.nih.gov/geo/gds/analyze/analyze.cgi?ID= GDS3505

Specific datasets can be reached by 1. Searching against any of the three fields provides. For example in this case, searching using the term “auxin” against the “dataset” field retrieves the complete list 2. Clicking on any of the results obtained takes you to the specific dataset 3. The expression profile can be reached by clicking on the “cluster” link towards the right hand side of the webpage

29 Biological sequence database: NCBI

a) Figure: Analysis tools at GEO allows users to perform comparison of gene expression profile across various tissues or developmental stages. For example the figure shows a comparative analysis of transcript from “ventricular” and “atrium” tissue of humans.

Source: http://www.ncbi.nlm.nih.gov/geoprofiles/

Specific profile can be reached by 1. searching against the “profile field. For example in this case, searching using the term “Cholesterol” against the “profile” field retrieves the complete list 2. Clicking on any of the results obtained takes you to the specific dataset 3. The expression profile can be reached by clicking on the “graphical output” link towards the right hand side of the webpage

Table 1.2: Breakup of various forms of dataset at GEO database (as on December 2012) Source: http://www.ncbi.nlm.nih.gov/geo/summary/?type=platforms

30 Biological sequence database: NCBI

Technology Count in situ oligonucleotide 4,129 spotted oligonucleotide 2,518 spotted DNA/cDNA 2,732 antibody 18 MS 16 SAGE - NlaIII 72 SARST 2 MPSS 17 RT-PCR 121 other 124 oligonucleotide beads 175 mixed spotted oligonucleotide/cDNA 14 spotted peptide or protein 45 high-throughput sequencing 845

Table: Summary of organisms represented at GEO (As on December 2012; Source: https://www.ncbi.nlm.nih.gov/geo/summary/?type=tax )

Organism Series Platforms Samples Homo sapiens 12,824 3,722 473,819 Mus musculus 8,682 1,612 134,566 Rattus norvegicus 1,590 368 37,384 Saccharomyces cerevisiae 1,286 489 24,337 Arabidopsis thaliana 1,698 278 20,628 Drosophila melanogaster 1,576 265 15,430 Sus scrofa 246 69 6,096 Caenorhabditis elegans 694 157 5,339 Bos taurus 267 108 4,890 Glycine max 122 32 4,697 Zea mays 176 74 4,387

Escherichia coli 392 109 4,013 Oryza sativa 336 159 3,735 Gallus gallus 242 77 3,639 Macaca mulatta 155 27 2,529 Xenopus laevis 86 21 806

31 Biological sequence database: NCBI

Protein databases: Protein sequence and structure data at NCBI is available at the Protein cluster and Protein database. In addition, information about the various domains in proteins are stored at the Conserved Domain Database (CDD). The collection of protein sequence from several collections such as in-silico translations from genomes, SwissProt, Protein Information Resources (PIR) and Protein Databank (PDB) are clubbed under the Protein database. Protein cluster hosts related sequences or clusters that are encoded by completed genomes whereas the domain information, i.e. the presence of functional unit in a protein can be accessed through the domain databases such as Conserved domain database. The interlinked nature of the databases allows the users to navigate from one database to another in a seamless manner as can be demonstrated with the help of the protein database.

32 Biological sequence database: NCBI

Figure : Protein database hosts sequence that can be viewed in various formats such as FASTA and Graphics Source: http://www.ncbi.nlm.nih.gov/protein

33 Biological sequence database: NCBI

Figure : Domain structure can be retrieved from protein database using CDD tool

34 Biological sequence database: NCBI

Source: http://www.ncbi.nlm.nih.gov/cdd

Figure : Protein cluster interface Source: http://www.ncbi.nlm.nih.gov/proteinclusters

Searching with a given keyword such as “elongase” retrieves a number of results which further provides information about the protein, related literature, sequence alignment and the phylogenetic or evolutionary relationship.

35 Biological sequence database: NCBI

36 Biological sequence database: NCBI

Figure : Various windows of Protein cluster database provides details about publication, sequence alignment and phylogeny of related proteins Source: http://www.ncbi.nlm.nih.gov/proteinclusters)

Specific resources at Protein cluster can be searched by 1. using a specific keyword, elongase, in this example 2. Opens a page containing all data that contain “elongase” 3. Click on the result to open specific dataset 4. Click on “Alignment” and “build tree” under “cluster” tool towards the left hand column of the webpage to display the desired results

From protein sequences, information about the functional modules or domains are catalogued at the CDD database and structure information, if solved, can be accessed through the Structure and/or Protein Data Bank (PDB) database. As domains are the functional units of any protein, CDD employs multiple sequence alignment (MSA) to represent or identify conserved domains and also uses a combination of sequence conservation along with 3-D structure prediction tools to reveal structure-function relationship. NCBI provides a free tool termed as Cn-3D for viewing structures that can be accessed at “http://www.ncbi.nlm.nih.gov/Structu re/CN3D/cn3d.shtml”.

37 Biological sequence database: NCBI

Figure : Accessing Conserved domain database for protein structure from Entrez portal Source: http://www.ncbi.nlm.nih.gov/sites/gquery

38 Biological sequence database: NCBI

Figure : CDD stores information about structural and functional domains of proteins

39 Biological sequence database: NCBI

Source: http://www.ncbi.nlm.nih.gov/cdd/?term=Elongases , http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=116972, http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=cl09934

40 Biological sequence database: NCBI

Figure : Structure / Protein database (PDB) database stores 3-D structure information that have been generated using NMR spectroscopy, X-ray crystallography and computational strategies. Source: http://www.ncbi.nlm.nih.gov/Structure/mmdb/mmdbsrv.cgi?uid=18763

41 Biological sequence database: NCBI

For a given protein sequence, the domain structrure can be identified by using Conserved Domain Architecture Retrieval Tool (CDART) which is based on sequence and structure comparison .

42 Biological sequence database: NCBI

Figure: CDART portal to identify domains in proteins Source: http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rps

Summary

The databases at NCBI can be defined as an organized collection of the various datasets that allows the users to query, search, sift, retrieve and perform analysis. The homepage of NCBI (www.ncbi.nlm.nih.gov) allows users to access either the entire portal/database or select an appropriate database and perform search and analysis. The literature on a wide range of subjects including books and peer reviewed research articles are catalogued under Bookshelf and Pubmed; some of the information is free to access while some can be accessed upon payment to the publisher. The nucleotide sequences arising out of various kinds of study are available under dbEST (Expressed sequence tag), dbGSS (genome survey sequence), SRA (short read archive), Popset (nucleotide database arising out of ecosystem, phylogeny or genetic variation analysis), dbGAP (relation between Genotype and Phenotype), dbVar (genomic and structural variation). Complete or Whole Genome sequence information and analyses tools can be accessed and retrieved through specialized sub-portals such as plant genome central, human genome etc. Genome annotation and analysis tools include PGAAP, PASC, TaxPlot, Epigenomic among several others. Information on classification and grouping based on taxonomic affiliations are available under Taxonomy database. PubChem and KEGG are databases cataloguing small chemical molecule structure, bioassay and metabolic pathways. The Gene Expression Omnibus (GEO) database is a collection of datasets generated through high throughput experiments on microarray, Next Generation Sequencing, ChIP, Genome Methylation, SAGE and protein arrays. NCBI hosts a variety of datasets and tools for protein resources such as Protein Cluster, Conserved Domain database and Structure, with a free tool Cn3D for viewing.

Exercises

1 The single point database search and retrieval system of NCBI is termed as ------2 What are the major domains under which NCBI databases and tools are organized? 3 Are the domains of NCBI standalone or interlinked? 4 Are there databases at NCBI that can be used to retrieve literature for plant and biomedical research? How would you use such a database? 5 What are the key features of Bookshelf and Pubmed? 6 What are the various databases for DNA sequence data? 7 Compare the features of dbEST and dbGSS. 8 Expand the following: o dbEST o dbGSS o TSA

43 Biological sequence database: NCBI

o SRA o PMC o RACE 9 What is the purpose of Popset? List the characteristic features of Popset. Retrieve a dataset of rDNA ITS and comment on the dataset obtained. 10 It is possible for you to use any database at NCBI to understand genotype and phenotype relationship. How? 11 The database for large scale genomic variation is termed as ------. 12 With the help of a flowchart, list the steps for accessing and retrieving a data on genomic variation. 13 Expand the following: o WGS o GRC o PGAAP o PASC 14 Comment on the Genome databases at NCBI 15 What are features and advantages associated with genetic and physical maps? 16 Genome annotation and comparison can be performed using ------and ------tools. 17 Define epigenome. Information about epigenome can be accessed using which database? 18 You have to find the structure and function of “Caffeine” in biological system. Suggest a database and a strategy to fulfill the objective. 19 Determine the taxonomic status of an organism or your choice. Also find out the molecular resources associated with the organism. 20 What kind of datasets are present in GEO? How would you retrieve and analyse datasets from GEO? 21 Prepare a list of databases and datasets related to protein sequence and structure at NCBI. 22 How would you retrieve a protein sequence and gather information about the domains. 23 Select a protein sequence of your choice and determine the following 24 Domain structure 25 3-D structure 26 Which database can be used for information about protein sequence alignment and phylogeny? How?

44 Biological sequence database: NCBI

27 What purpose do the following databases/datasets serve?

o CDART

o PDB / Structure

o CDD

Glossary

a. ChIP-chip: Chromatin-Immuno precipitation or often called ChIP-on-chip is used to identify and study DNA-protein interaction. It relies on the principle of whole genome microarray and Immunoprecipitation (http://en.wikipedia.org/wiki/ChIP-on- chip). It involves chromatin-immunoprecipotation followed by microarray. b. Differential display: A comparative qualitative and quantitative analysis of transcript profile to detect similarities and differences in mRNA between different tissues or developmental stages or adaptive conditions. It involves amplifying mRNA using oligodT to convert mRNA into first strand using reverse transcriptase and then performing PCR in combination using another short random or arbitrary primer. c. Epigenetics: The heritable and reversible alteration in gene expression state that is attributed to modifications at histone proteins and Cytosine residues. d. EST: Expressed Sequence Tags are generated through single-pass sequencing of 5’ and 3’ ends of cDNA clones from a cDNA libraries and are a rapid and inexpensive method to get a snapshot into the transcript profile and generate sequence data e. GEO: Gene Expression Omnibus is a database cum analysis ports for quantitative data on gene expression states generated through high throughput technologies such as microarray, sequencing and SAGE. f. GSS: Randomly selected genomic fragments are sequenced to survey the genome of an organism g. MIAME: An internationally accepted norm for performing microarray experiments. The MIAME guideline requires the researchers to record and submit experimental design, sample annotation, raw data and processed data. Further details can be retrieved at http://www.mged.org/Workgroups/MIAME/miame_2.0.html or at http://www.ncbi.nlm.nih.gov/geo/info/MIAME.html h. Microarray: Developed as an high throughput tool for analysis of global transcriptome profile. It involved microspotting cDNA clones or oligo-nucleotides corresponding to entire transcript complement of an organism onto glass-slides, in

45 Biological sequence database: NCBI

an array format, and then hybridizing the slides with fluorescent labeled transcript to detect qualitative and quantitative differences i. Next Generation Sequencing (NGS): All new sequencing technologies that are not dependent of Sanger’s Chain termination methods are broadly clubbed under NGS. Some of these include reversible chain termination reactions, or single molecule sequencing or Ligation based Sequencing. j. Pubmed: A central repository for literature at NCBI. k. RACE: Rapid Amplification of cDNA ends is a technique commonly used in molecular biology to isolate 5’and 3’ ends of mRNA by converting into cDNA (http://www.ncbi.nlm.nih.gov/books/NBK21136/).

References

Works Cited NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Research 2009 Volume 37; D885-D890 NCBI GEO: archive for functional genomics data sets—10 years on. Nucleic Acids Research 2011 Volume 39; D1005-D1010

Suggested Readings Bioinformatics and Functional Genomics: 2nd Edition, Jonathon Pevsner (2009), Wiley Blackwell

Web Links www.ncbi.nlm.nih.gov http://www.ncbi.nlm.nih.gov/Structu re/CN3D/cn3d.shtml http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=statistics& uncultured=hide&unspecified=hide http://www.ncbi.nlm.nih.gov/Structu re/CN3D/cn3d.shtml https://www.ncbi.nlm.nih.gov/geo/summary/?type=tax https://www.ncbi.nlm.nih.gov/geo/summary/?type=platforms http://www.mged.org/Workgroups/MIAME/miame_2.0.html http://www.ncbi.nlm.nih.gov/geo/info/MIAME.html

46