<<

Accessing molecular biology information

* browsers - UCSC

* NCBI

* Galaxy Flow of genetic information and databases

Exon 1 Exon 2 Exon 3 Exon 4

intron intron intron DNA

transcription

5' 3' RNA Genbank/ EMBL splicing

5' UTR 3' UTR polyA mature coding sequence mRNA

translation

GenPept / SwissProt

folding / UniProt

Protein Data Bank (PDB) Databases, cont. Redundancy at GenBank => RefSeq

Many sequences are represented more than once in GenBank

2003  RefSeq collection : curated secondary database non-redundant selected organisms

•Genome DNA (assemblies) •Transcripts (RNA) •Protein http://www.ncbi.nlm.nih.gov/books/bv.fcg i?rid=handbook RefSeq vs GenBank

GenBank RefSeq Not curated Curated Author submits NCBI creates from existing data Only author can revise NCBI reivses as new data emerge Multiple records from same loci Single records for each molecule common of major organisms Records can contradict each other No limit to included Limited to model organisms Akin to primary literature Akin to review articles Archives with nucleotide sequences

• Genbank/EMBL • (NCBI) • GEO ( omnibus, NCBI) • 1000 Project • • The • Species-specific databases like –FlyBase – WormBase – Saccharomyces Genome Database Genome using a shotgun approach Sequenced eukaryotic genomes Sequencing going wild ...

BGI : "capacity to sequence the equivalent of 1,600 complete genomes each day"

"BGI and BGI Americas aim to build a library of digital life, which includes 1,000 plant and animal reference genomes, 10,000 microorganism genomes". “million genome project”: Sequencing of one million chinese individuals NCBI Sequence Read Archive 2013: a total of ~ 1015 nt 5 1 e+ 1 4 1 e+ 8 4 1 e+ 6 4 1 44e+ 1 e+ 2 0e+00

2008 2009 2010 2011 2012 2013 http://www.genome.gov/sequencingcosts/

Cost per Mb of DNA Date Cost per Genome Sequence september‐2001 $ 5 292.39 $ 95 263 072 mars‐2002 $ 3 898.64 $ 70 175 437 september‐2002 $ 3 413.80 $ 61 448 422 mars‐2003 $ 2 986.20 $ 53 751 684 oktober‐2003 $ 2 230.98 $ 40 157 554 januari‐2004 $ 1 598.91 $ 28 780 376 april‐2004 $ 1 135.70 $ 20 442 576 juli‐2004 $ 1 107.46 $ 19 934 346 oktober‐2004 $ 1 028.85 $ 18 519 312 januari‐2005 $ 974.16 $ 17 534 970 april‐2005 $ 897.76 $ 16 159 699 juli‐2005 $ 898.90 $ 16 180 224 oktober‐2005 $ 766.73 $ 13 801 124 januari‐2006 $ 699.20 $ 12 585 659 april‐2006 $ 651.81 $ 11 732 535 juli‐2006 $ 636.41 $ 11 455 315 oktober‐2006 $ 581.92 $ 10 474 556 januari‐2007 $ 522.71 $ 9 408 739 april‐2007 $ 502.61 $ 9 047 003 juli‐2007 $ 495.96 $ 8 927 342 oktober‐2007 $ 397.09 $ 7 147 571 januari‐2008 $ 102.13 $ 3 063 820 april‐2008 $ 15.03 $ 1 352 982 juli‐2008 $ 8.36 $ 752 080 oktober‐2008 $ 3.81 $ 342 502 januari‐2009 $ 2.59 $ 232 735 april‐2009 $ 1.72 $ 154 714 juli‐2009 $ 1.20 $ 108 065 oktober‐2009 $ 0.78 $ 70 333 januari‐2010 $ 0.52 $ 46 774 april‐2010 $ 0.35 $ 31 512 juli‐2010 $ 0.35 $ 31 125 oktober‐2010 $ 0.32 $ 29 092 januari‐2011 $ 0.23 $ 20 963 april‐2011 $ 0.19 $ 16 712 juli‐2011 $ 0.12 $ 10 497 oktober‐2011 $ 0.09 $ 7 743 Many other sites have sequences available for downloading www.1000genomes.org

The is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on .

As with other major reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases. The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. http://cancergenome.nih.gov/

The Cancer Genome Atlas (TCGA) is a landmark research program supported by the National Cancer Institute and National Human Genome Research Institute at the National Institutes of Health. TCGA researchers will identify the genomic changes in more than 20 different types of human cancer. By comparing the DNA in samples of normal tissue and cancer tissue taken from the same patient, researchers can identify changes specific to that particular cancer.

TCGA is analyzing hundreds of samples for each type of cancer. By looking at many samples from many different patients, researchers will gain a better understanding of what makes one cancer different from another cancer. This is important because even two patients with the same type of cancer may experience very different outcomes or respond very differently to treatments. By connecting specific genomic changes with specific outcomes, researchers will be able to develop more effective, individualized ways of helping each cancer patient. http://www.sanger.ac.uk/genetics/CGP/

The identification of genes that are mutated and hence drive oncogenesis has been a central aim of since the advent of recombinant DNA technology. The Cancer Genome Project is using the human genome sequence and high throughput detection techniques to identify somatically acquired sequence variants/ and hence identify genes critical in the development of human cancers (see here for a description of our strategy). This initiative will ultimately provide the paradigm for the detection of germline mutations in non-neoplastic human genetic diseases through genome-wide mutation detection approaches. Archives with nucleotide sequences

• Genbank/EMBL • Sequence read archive (NCBI) • GEO (Gene expression omnibus, NCBI) • 1000 Genomes Project • The Cancer Genome Atlas • The Cancer Genome Project • Species-specific databases like –FlyBase – WormBase – Saccharomyces Genome Database NCBI

* Nucleotide * Protein * Structure * PubMed * OMIM (genetic diseases) * dbSNP * Taxonomy browser NCBI databases

DNA

transkription "Nucleotide"

RNA

translation

protein "Protein"

folding

protein 3D structure with specific "Structure" biological function Database formats - EMBL and Genbank EMBL format

ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product."; RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX FH Key Location/Qualifiers FH FT source 1..756 FT /db_xref="taxon:1638" FT /organism="Listeria ivanovii" FT /strain="ATCC 19119" FT RBS 95..100 FT /gene="sod" FT terminator 723..746 FT /gene="sod" FT CDS 109..717 FT /db_xref="SWISS-PROT:P28763" FT /transl_table=11 FT /gene="sod" FT /EC_number="1.15.1.1" FT /product="superoxide dismutase" FT /protein_id="CAA45406.1" FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" XX SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300 Examples of feature table elements

* to represent a coding sequence that is constructed from a range of exons: CDS join(1886..1922,2272..2319,3563..3675,4750..4878)

* to represent a coding sequence on the complementary strand of DNA: CDS complement(1159..2577) EMBL and Genbank formats

EMBL format

ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product."; RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX Common sequence formats

1. Genbank 2. EMBL 3. FASTA

>X12345 Y098TR gene CGTATCTTACGAGCTACTACGA GGTCTTATCGGACGAGCGACT ...

4. FASTQ

@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGT + !''*((((***+))%%%++)(%%%%).1***-+* Search Definition Qualifier Field Contains the unique accession number of the sequence or record, assigned to the nucleotide, protein, structure, genome Accession record, or PopSet by a sequence database builder. The [ACCN] Structure database accession index contains the PDB IDs but not the MMDB IDs. Contains all terms from all searchable database fields in the All Fields [ALL] database. Contains all authors from all references in the database Author records. The format is last name space first initial(s), without [AUTH] A selection of search Name punctuation (e.g., marley jf). fields using NCBI Entrez. Contains the biological features assigned or annotated to the nucleotide sequences and defined in the Feature Key DDBJ/EMBL/GenBank Feature Table [FKEY] (http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html). Not available for the Protein or Structure databases. Contains the name of the journal in which the data were published. Journal names are indexed in the database in Jou rnal abbreviated form (e.g., J Biol Chem). Journals are also [JOUR] Name indexed by their by ISSNs. Browse the index if you do not know the ISSN or are not sure how a particular journal name is abbreviated. Contains the date that the most recent modification to that record is indexed in Entrez, in the format YYYY/MM/DD Modification (e.g., 1999/08/05). A year alone, (e.g., 1999) will retrieve all [MDAT] Date records modified for that year; a year and month (e.g., 1999/03) retrieves all records modified for that month that are indexed in Entrez. Contains the scientific and common names for the organisms Organism [ORGN] associated with protein and nucleotide sequences. Contains properties of the nucleotide or protein sequence. For example, the Nucleotide database's Properties index includes Properties molecule types, publication status, molecule locations, and [PROP] GenBank divisions. A Properties index is not available in the Structure database. Publication Contains the date that records are released into Entrez, in the [PDAT] Date format YYYY/MM/DD (e.g., 1999/08/05). It is the date the