EMBL-EBI Powerpoint Presentation
Total Page:16
File Type:pdf, Size:1020Kb
EBI Patent Sequence Services • PIUG 2012 Biotechnology • Workshop • February 6th, 2012 • Boston Jennifer McDowall EBI is an Outstation of the European Molecular Biology Laboratory. Overview 1) Know the data ____Click __ toEuropean ____ edit Master_____ Nucleotide ____text Archive styles______ Second_____ _____ level UniProt ____Third _____ level Non -Fourth_____redundant _____ level patent sequence DB ____Fifth _____level 2) The Toolbox EBI search SRS advanced text search Sequence searching Sequence Searching Tools 2 1) Know the Data Know the data • Many databases, each getting bigger ____Click __ to ____ edit Master _____ ____text styles______ • Efficient searching Second_____ requires _____ level knowledge of what data is stored ____Third in _____ level a database Don‟t assume annotation Fourth_____ _____can level be transferred because of a good match ____Fifth _____level • Databases can contain errors • Data can change Deletions, sequence modifications Daily updates, identifier changes… Sequence Searching Tools 4 Major sequence databases ____Click __ to ____ edit Master _____ ____text styles______ European • >170 million sequences Second_____ _____ level Nucleotide Archive • (~42 million non-redundant) • ____Thirdrelease _____ levelevery 3 months, daily updates Fourth_____ _____ level ____Fifth _____level • >30.1 million non-redundant sequences UniProt • monthly release, daily updates Sequence Searching Tools 5 Additional sequence data Specialized databases ____Click __ to ____ edit Master _____ ____text styles______ • Immunoglobulins: Second_____ IMGT/HLA _____ level, IMGT/LIGM • Immunopolymorphisms ____Third _____ level : IPD -KIR , IPD-MHC Fourth_____ _____ level • Variation: HGVBase ____Fifth , _____ leveldbSNP • Alternative splicing: ASTD • Completed genomes: Ensembl, Integr8 • Structure: PDB, Structural Genomics targets Sequence Searching• Interaction Tools : IntAct 6 Patent Sequences Patent sequences can be found in ____Clickthe __ tofollowing ____ edit Master _____ databases: ____text styles______ Second_____ _____ level ____Third _____ level ENA • Patent nucleotides Fourth_____ _____ level ____Fifth _____level UniProt • Patent proteins Archive NR patent • Patent nucleotides and proteins sequences Sequence Searching Tools 7 Which database do you use? let’s take a look… European nucleotide archive UniProt Non-redundant patent sequence databases European nucleotide archive UniProt Non-redundant patent sequence databases Primary sequence databases Primary data submitted to databases ____Click __ to ____ edit Master _____ ____text styles______ GenBank DDBJ + SRA Second_____ _____ level ____Third _____ level INSDC Fourth_____ _____ level ____Fifth _____level (U.S.A.) (Japan) ENA Sequence Searching Tools 11 (Europe) Primary sequence databases Primary data submitted to databases ____Click __ to ____ edit Master _____ ____text styles______ GenBank DDBJ + SRA Second_____ _____ level ____Third _____ level INSDC Fourth_____ _____ level INSDC agreement: ____Fifth _____level • Free unrestricted access • All data exchanged daily ENA How do they differ? organization of data tools and database links Sequence Searching Tools 12 ENA has a 3-tiered structure Feature annotation ____Click __ to ____ edit Master _____ ____text styles______ Second_____1) EMBL _____ level-Bank ____Third _____ level Assembly E information Fourth_____ _____ level N ____Fifth _____level A Sequencing 2) Sequence Read Archive & sampling (Next Gen sequencing) information 3) Trace Archive (Capillary sequencing) Sequence Searching Tools 13 http://www.ebi.ac.uk/ena/ How is the data organised? Data in EMBL-Bank is divided in 2 ways: ____Click __ to ____ edit Master _____ ____text styles______ 1) Data classes Second _____ _____ level ____Third _____ level • Type of data or methodology used to obtain data Fourth_____ _____ level • Each entry belongs ____Fifth to _____level one data class 2) Taxonomic Divisions • Each entry belongs to one taxonomic division Sequence Searching Tools 14 EMBL-Bank data classes CON Constructed from sequence assemblies EST Expressed Sequence Tag (cDNA) GSS Genome ____Click Survey __ to Sequence____ edit Master _____ (high-throughput ____text styles______ short sequence) HTC High-Throughput cDNA Second_____ (unfinished) _____ level HTG High-Throughput Genome ____Third sequencing _____ level (unfinished) Fourth_____ _____ level MGA Mass Genome Annotation ____Fifth _____level PAT Patent sequences SRA Sequence Read Archive (both databank and data class) STS Sequence Tagged Site (short unique genomic sequences) STD Standard (high quality annotated sequence) TSA Transcriptome Shotgun Assembly (computational assembly) Sequence Searching Tools 15WGS Whole Genome Shotgun EMBL-Bank data classes Data is always changing ____Click __ to ____ edit Master _____ ____text styles______ • Assembly of sequences Second_____ into_____ level larger fragments ____Third _____ level • Suppression of obsolete entries (i.e. once assembled) Fourth_____ _____ level • Sequence modifications ____Fifth _____level • Daily updates • Identifier changes • Corrections (databases can contain errors) • etc… Sequence Searching Tools 16 EMBL-Bank data classes Data assembly can affect entries ____Click __ to ____ edit Master _____ ____text styles______ Example: Second_____ _____ level WGS Shotgun ____Third _____ level• Fragments in separate entries Fourth_____ _____ level • Join to make new CON entries ____Fifth _____level CON Constructed Old WGS entries archived • Join into large STD entry (e.g. completed genome) • Add annotation STD Standard Old CON entries Sequence Searching Tools archived 17 ENA taxonomy All INSDC databases use NCBI taxonomy ____Click __ to ____ edit Master _____ ____text styles______ Second_____ _____ level Divisions Only sequenced ____Third _____ level organisms represents HUM Human Fourth_____ _____ level MUS Mouse INV ____ FifthInvertebrate _____level Other: ROD Rodent PLN Plant ENV Environmental MAM Mammal PRO Prokaryote SYN Synthetic VRT Vertebrate PHG Phage TGN Transgenic FUN Fungi VIR Viral UNC Unclassified Sequence Searching Tools 18 ENA taxonomy Some species EXCLUDED from certain ____Click __ to ____taxonomic edit Master _____ ranges ____text styles______ Second_____ _____ level ROD Rodent excludes ____Third mouse _____ level Fourth_____ human_____ level MAM Mammal excludes ____Fifth _____mouselevel rodent Applies to ftp files and human sequence search tools mouse but not to ENA browser VRT Vertebrate excludes rodent mammal Sequence Searching Tools 19 ENA taxonomy Sometimes there is no taxonomic data ____Click __ to ____ edit Master _____ ____text styles______ Second_____ _____ level Environmental • Genus species = „uncultivated bacterium‟ ____Third or _____ „unspecified‟level Fourth_____ _____ level Synthetic • Genus species____Fifth _____=level „synthetic construct‟ Transgenic • Taxonomy for recipient and donor organisms Patent • Exempt from requiring Genus species Sequence Searching Tools 20 Database structure EMBL-Bank: ____Click __ to ____ edit Master _____ ____text styles______ Second_____ _____ level ____Third _____ level Fourth_____ _____ level ENA Database ____Fifth _____level Sequence Searching Tools 21 Database structure EMBL-Bank: Data classes ____Click __ to ____ edit Master _____ ____text styles______ Second_____ _____ level ____Third _____ level Fourth_____ _____ level ____Fifth _____level 1st: Data split into classes Sequence Searching Tools 22 Database structure EMBL-Bank: Data classes ____Click __ to ____ edit Master _____ ____text styles______ Second_____ _____ level HUM MUS ____Third _____ level Taxonomic ROD Fourth_____ _____ level Divisions MAM ____Fifth _____level VRT FUN INV ... Reduces search set 1st: Data split into classes 2nd: Data split into intersecting slices by taxonomy Sequence Searching Tools 23 Database structure EMBL-Bank: Data classes „Mouse‟ + „EST‟ ____ Click __ to ____ edit Master _____ ____text styles______ intersection Second_____ _____ level HUM MUS ____Third _____ level Taxonomic ROD Fourth_____ _____ level Divisions MAM ____Fifth _____level VRT FUN INV ... Reduces search set 1st: Data split into classes 2nd: Data split into intersecting slices by taxonomy Sequence Searching Tools 24 European Nucleotide Archive ENA is accessible from the EBI homepage ____Click __ to ____ edit Master _____ ____text styles______ Second_____ _____ level ____Third _____ level Fourth_____ _____ level ____Fifth _____level ENA Sequence Searching Tools 25 http://www.ebi.ac.uk/ ENA homepage ____Click __ to ____ edit Master _____ ____text styles______ Second_____ _____ level ____Third _____ level Fourth_____ _____ level ____Fifth _____level • Text search • Sequence search • Programmatic access Sequence Searching Tools 26 http://www.ebi.ac.uk/ena Patent sequence record in EMBL-Bank Sequence Download version data Dates (first public ____Click __ to ____ edit Master _____ ____text styles______ and last updated) Navigate to related data Second_____ _____ level e.g. Version Graphical viewer archive ____Third _____ level Fourth_____ _____ level DNA source ____Fifth _____level Navigate to external data sources e.g. UniProt Patent reference Sequence Sequence Searching Tools 27 Non-patent entry in EMBL-Bank General information ____Click __ to ____ edit Master _____ ____text styles______ Second_____ _____ level More detailed Additional graphical view information ____Third _____ level Fourth_____ _____ level Genome annotation ____Fifth _____level Assembly information Sequence Searching Tools 28 ENA graphical viewer ____Click