<<

AA FieldField GuideGuide toto GenBankGenBank andand NCBINCBI MolecularMolecular BiologyBiology ResourcesResources

slightly modified from

PeterPeter CooperCooper ftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/

EricEric SayersSayers ftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_Penn/ NCBINCBI ResourcesResources

•• AboutAbout NCBINCBI •• NCBINCBI SequenceSequence DatabasesDatabases –– PrimaryPrimary DatabaseDatabase –– GenBankGenBank –– DerivativeDerivative DatabasesDatabases -- RefSeqRefSeq •• EntrezEntrez DatabasesDatabases andand TextText SearchingSearching •• BLASTBLAST ServicesServices •• GenomicGenomic ResourcesResources TheThe NationalNational CenterCenter forfor BiotechnologyBiotechnology InformationInformation (NCBI)(NCBI)

Lister Hill Center William H. Natcher Building TheThe NationalNational CenterCenter forfor BiotechnologyBiotechnology InformationInformation (NCBI)(NCBI) •• CreatedCreated asas aa partpart ofof NLMNLM inin 19881988 – Establish public databases – Perform research in – Develop software tools for – Disseminate biomedical information •• Tools:Tools: BLAST(1990),BLAST(1990), EntrezEntrez (1992)(1992) •• GenBankGenBank (1992)(1992) •• FreeFree MEDLINEMEDLINE ((PubMedPubMed,, 1997)1997) •• HumanHuman genomegenome (2001)(2001) NCBINCBI HomeHome PagePage http://www.http://www.ncbincbi..nlmnlm..nihnih..govgov

ToTo learnlearn more,more, visitvisit thethe ““SiteSite MapMap ““AboutAbout NCBINCBI”” andand webweb pagespages ”” SiteSite MapMap AboutAbout NCBINCBI SomeSome NCBINCBI Statistics….Statistics….

Growth of GenBank

23 30000 22 28000 21 20 26000 19 18 24000 17 22000 16 15 20000 14 18000 13 12 16000 11 14000 10 9 12000 8 10000 Sequences (millions) Sequences 7 Base Pairs 6 8000

Sequences Base Pairs of DNA (millions) 5 6000 4 3 4000 2 2000 1 0 0 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 UsersUsers perper dayday

250000

1997 1998 1999 2000 2001

200000

150000

100000

50000

Christmas Day 0 MolecularMolecular DatabasesDatabases •• PrimaryPrimary DatabasesDatabases – Original submissions by experimentalists – Database staff organize but don’t add additional information • Example: GenBank •• DerivativeDerivative DatabasesDatabases – Human curated • compilation and correction of data • Example: SWISS-PROT, NCBI RefSeq mRNA – Computationally Derived • Example: UniGene – Combinations • Example: NCBI Assembly WhatWhat isis GenBank?GenBank? NCBI’sNCBI’s PrimaryPrimary SequenceSequence DatabaseDatabase •• NucleotideNucleotide onlyonly sequencesequence databasedatabase •• GenBankGenBank DataData – Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts established for centers •• DataData sharedshared amongstamongst threethree collaboratingcollaborating databases:databases: – GenBank – DNA Database of Japan (DDBJ). – European Molecular Biology Laboratory Database (EMBL) TheThe InternationalInternational NucleotideNucleotide SequenceSequence DatabaseDatabase CollaborationCollaboration NIHNIH Entrez

Sequin BankIt NCBI ftp •Submissions GenBankGenBank •Updates •Submissions •Updates EMBLEMBL DDBJDDBJ CIB EBI

•Submissions NIGNIG •Updates SRS getentry EMBLEMBL GenBank:GenBank: NCBI’s Primary

Release 133 December 2002 22,318,883 Records 28,507,990,166 Nucleotides 110,000 + Species • full release every two months • incremental and cumulative updates daily • available only through internet

ftp://ftp.ncbi.nih.gov/genbank/

>90 Gigabytes of data EntrezEntrez

NucleotideNucleotide RefSeq 1%

EMBL 9%

DDBJ 19%

GenBank 71%

23,464,770 records Genome Assembly RefSeq UniGene TATAGCCG AGCTCCGATA CCGATGACAA Curators

Algorithms

G

A

T

C

G G A A T

C

T

C A A A

T

C

C

G

A T

C

A A A G

T

Labs T

C

T

T

A

A T

A

G G

C

T C G

G

T T

C

T G

G A C

A CC

G C G

A A AA

GT

T

C GC T A A T G A

T G A

G C G C C T

A C T

T C G

A T

G A C G T

A G A

G

T TA

C T GC C G

G T G

A A C G A GC

A

T T T

A

C T

GT GT

A TATAGCCG G

G C T

C T

A C

C GenBank

C

A A

A

C G T

A A

C

T

T G A

A T

T

C G

A

C

A

C A

A

T

AG C

T

A T

G

A T A

A

T

G TATAGCCG TATAGCCG

T AG

T A

T T

G

T G

A

G

T

A

GA C G

GT G

C T

TATAGCCG A

C T

C

A A

A

T

T

T

C T

G

T T

A

A

C

A A A T

T Primary vs. Derivative Databases A Primary vs. Derivative Databases Sequencing Centers TraditionalTraditional GenBankGenBank DivisionsDivisions •Direct Submissions (Sequin and BankIt) •Accurate •Well characterized

BCT Bacterial and Archeal INV Invertebrate MAM Mammalian (ex. ROD and PRI) PHG Phage PLN Plant and Fungal PRI Primate ROD Rodent SYN Synthetic (cloning vectors) VRL Viral VRT Other Vertebrate AA TraditionalTraditional GenBankGenBank RecordRecord Locus Field Molecule Type

Modification Date Definition Line GenBank Division Accession Number Version GI (GenInfo) Keywords

Taxonomy AA TraditionalTraditional GenBankGenBank RecordRecord BulkBulk SequenceSequence DivisionsDivisions ofof GenBankGenBank

•Batch Submissions (email and ftp) •Inaccurate •Poorly Characterized

EST Expressed Sequence Tag STS Sequence Tagged Site GSS Genome Survey Sequence HTG High Throughput Genomic HTC High Throughput cDNA OrganizationOrganization ofof GenBankGenBank 11 Traditional Divisions PAT 4% Traditional 8% 1 Patent Division

STS, HTG, HTC 2%

GSS 19%

EST 67% 5 Bulk Divisions

23,087,196 records ESTEST Division:Division: EExpressedxpressed SSequenceequence TTagsags >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTT TCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAG AATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAG TTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAG CAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGT ATGCCTGCCGTGTTGAACCATGTNGACTTTnucleus GTCACAGNCCCAAGTTNAGTTTAAGTGGGNA5’ TCGAGACATGTAAGGCAGGCATCATGGGAG30,000 GTTTTGAAGNATGCCGCNTTTTGGATTGGGA TGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG genes 3’ >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACC-isolate unique clones ATGCCTTACTTTATCAAATGTATAAGANGTRNA AAATATGAATCTTATATGACAAAATGTTTC-sequence once ATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTgene products from each end CTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTT GAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATT CTCTGCTCCCCACCTCTANGTTGCCAGCCCTC

make cDNA 80-100,000 unique library cDNA clones in library WhatWhat isis UniGeneUniGene?? A gene-oriented view of sequence entries

•MegaBlast-based automated sequence clustering •Nonredundant set of gene-oriented clusters •Each cluster represents a unique gene •Provides information on tissue-specific expression and map locations •Includes well-characterized genes and novel ESTs •Useful for gene discovery and selection of mapping reagents OrganismsOrganisms RepresentedRepresented inin UniGeneUniGene

Just In… C.elegans Ciona intestinalis Gallus gallus ESTEST hitshits toto HomoHomo sapienssapiens musclemuscle creatinecreatine kinasekinase mRNAmRNA

Query Sequence GenomeGenome SequencingSequencing

Whole BAC insert (or genome)

shredding

sequencing cloning isolating

GSS division or trace archive assembly

Draft Sequence (HTG division) WorkingWorking DraftDraft SequenceSequence

gaps HTGHTG Division:Division: HHighigh TThroughputhroughput GGenomeenome phase 1 HTG Acc = AC109609.1 phase 2 HTG Acc =AC109609.6 phase 3 ROD Acc = AC109609.10

40,000 to > 350,000 bp HTGHTG Division:Division: HHighigh TThroughputhroughput GGenomeenome NCBI’sNCBI’s ThirdThird PartyParty AnnotationAnnotation (TPA)(TPA) DatabaseDatabase NEW

•• NCBINCBI nownow acceptsaccepts thethe submissionsubmission ofof newnew annotationsannotations ofof existingexisting GenBankGenBank sequences;sequences; •• FacilitatesFacilitates thethe annotationannotation ofof genomesgenomes byby experts;experts; AA SampleSample TPATPA recordrecord RefSeq:RefSeq: NCBI’sNCBI’s DerivativeDerivative SequenceSequence DatabaseDatabase

• Curated transcripts and – reviewed – human, mouse, rat, fruit fly, zebrafish, arabidopsis • Human model transcripts and proteins • Assembled Genomic Regions (contigs) – draft human genome – mouse genome • Chromosome records – Microbial – viral – organelle TheThe RefSeqRefSeq AccessionAccession NumbersNumbers mRNAs and Proteins human NM_123456 Curated mRNA mouse NP_123456 Curated rat fruit fly NR_123456 Curated non-coding RNA zebrafish XM_123456 Predicted Transcript (human, mouse)Arabidopsis XP_123456 Predicted Protein (human, mouse) XR_123456 Predicted non-coding RNA Gene Records NG_ 123456 Reference Genomic Sequence (human) Assemblies NT_ 123456 Contig (Mouse and Human) NW_123456 Supercontig (Mouse) NC_ 123456 Chromosome (Microbial,Viral,Arabidopsis ) NR_ 123456 Interim Identifier for Microbial Chromosomes CuratedCurated RefSeqRefSeq Records:Records: NM_,NM_, NP_ NP_ EntrezEntrez:: LinkingLinking andand NeighboringNeighboring TheThe EntrezEntrez DatabasesDatabases EntrezEntrez:: DatabaseDatabase IntegrationIntegration

Word weight

PubMed abstracts

Taxonomy 33-D -D Phylogeny StructureStructur VAST e

BLAST Nucleotide Protein BLAST sequences sequences TheThe (ever) ExpandingExpanding EntrezEntrez

Journals SystemSystem UniGene Books SNP PubMed PubMed UniSTS Central

Nucleotide PopSet

Protein Entrez ProbeSet

Structure Genome

CDD Taxonomy 3D Domains OMIM EntrezEntrez NucleotidesNucleotides

glucose 6 phosphate dehydrogenase DocumentDocument Summaries:Summaries: glucose 6 phosphate dehydrogenase[All Fields] = 748 hits EntrezEntrez Nucleotides:Nucleotides: LimitsLimits Accession All Fields Author Name EC/RN Number glucose 6 phosphateFeature dehydrogenase key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word EntrezEntrez Nucleotides:Nucleotides: Preview/IndexPreview/Index AddingAdding Terms:Terms:Accession Preview/IndexPreview/Index All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name greenKeyword plants Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length . . . PlantPlant G6PDG6PD mRNAsmRNAs Display:Display: Formats,Formats, Links,Links, andand NeighborsNeighbors

Summary Brief ASN.1 FASTA XML GenBank GI list LinkOut Nucleotide Neighbors Genome Links ProbeSet Links OMIM Links PopSet Links Protein Links PubMed Links SNP Links Structure Links Taxonomy Links UniSTS Links >gi|603218|gb|U18238.1|MSU18238 Medicago sativa glucose-6-phosphate dehyd CCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGA GATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGC TTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATT GTTGCCACCTGATGAAGTTCACATFASTA definitionTTTTGGCTATGCAAGGTCAAAGA line TCTCCGATGATGAATTGAGAAAC AAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTT TACAATTGGTTAAATATGTAAGTG>gi|603218|gb|U18238.1|MSU18238GCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGAT TTCAGAGCATGAATATTTGAAAAA> TAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCT CCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGAT GGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGAT TGGAGAGTTATTTGAAGAACCACAgi number GATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAACLocus name ATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGC AGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCG AGATATCATTCCAAACCATCTGTTDatabaseGCAGGTTCTTTGCTTGATTGCTA identifiers TGGAAAAACCCGTTTCTCTCAAG CCTGAGCACATTCGAGATGAGAAAgbGTGAAGGTTCTTGAATCAGTACT GenBank CCCTATTAGAGATGATGAAGTTGAccession number TTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGC AACTACTATTCTGCGGATACACAAembTGAAAGATGGGAAGGTGTTCCTT EMBL TCATTGTGAAAGCAGGGAAGGCC CTAAATTCTAGGAAGGCAGAGATTdbjCGGGTTCAATTCAAGGATGTTCC DDBJ TGGTGACATTTTCAGGAGTAAAA AGCAAGGGAGAAACGAGTTTGTTAspTCCGCCTACAACCTTCAGAAGCT SWISS-PROT ATTTACATGAAGCTTACGGTCAA GCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGG ATAACCATTCCAGAGGCTTATGAGpdbCGTCTAATTCTCGACACAATTAG Protein Databank AGGTGATCAACAACATTTTGTTC GCAGAGACGAATTAAAGGCATCATpirGGCAAATATTCACACCACTTTTA PIR CACAAAATTGATAGAGGGGAGTT GAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGA TATGTTCAAACACCCGGTTATATAprf TGGATTCCTCCTACCTTATAGAG PRF TGACCAAATTTCATAATAAAACA AGGATTAGGATTATCAGGAGCTTArefTAAATAAGTCTTCAATAAGCTTG RefSeq TGAAATTTTCGTTATAATCTCTC TCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGA ATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA EntrezEntrez GenomeGenome OrganismOrganism PagesPages TheThe MapMap Viewer:Viewer: aa commoncommon platformplatform forfor integratedintegrated displaydisplay TheThe MapMap ViewerViewer EntrezEntrez PubMedPubMed OnlineOnline BooksBooks EntrezEntrez SpecializedSpecialized DatabasesDatabases

Taxonomy Searchable taxonomic tree having nodes for all species with records in an Entrez database

OMIM Online Mendelian Inheritance in Man: A database of genetically linked human diseases

ProbeSet Expression data (GEO) and microarray datasets EntrezEntrez TaxonomyTaxonomy EntrezEntrez OMIMOMIM EntrezEntrez ProbeSetProbeSet TraceTrace ArchiveArchive EntrezEntrez StructureStructure

1CET StructureStructure SummarySummary

Cn3D viewer

Related Structures

Conserved Domains Cn3D:Cn3D: DisplayingDisplaying StructuresStructures

Chloroquine StructureStructure NeighborsNeighbors StructuralStructural AlignmentAlignment

Chloroquine

NADH MMDB:MMDB: MMolecularolecular MModelingodeling DDataata BBasease

• Derived from experimentally determined PDB records • Value added to PDB records including: – Addition of explicit chemical graph information – Validation – Inclusion of Taxonomy, Citation, and other information – Conversion to ASN.1 data description language • Structure neighbors determined by Vector Alignment Search Tool (VAST)