EBI Patent Sequence Services
• PIUG 2012 Biotechnology • Workshop • February 6th, 2012 • Boston
Jennifer McDowall
EBI is an Outstation of the European Molecular Biology Laboratory. Overview
1) Know the data
____Click __ toEuropean ____ edit Master_____ Nucleotide ____text Archive styles______Second______level UniProt ____Third _____ level Non -Fourth_____redundant _____ level patent sequence DB ____Fifth _____level 2) The Toolbox
EBI search
SRS advanced text search
Sequence searching Sequence Searching Tools 2 1) Know the Data Know the data
• Many databases, each getting bigger ____Click __ to ____ edit Master ______text styles______• Efficient searching Second_____ requires _____ level knowledge of what data is stored ____Third in _____ level a database Don‟t assume annotation Fourth______can level be transferred because of a good match ____Fifth _____level • Databases can contain errors
• Data can change Deletions, sequence modifications Daily updates, identifier changes…
Sequence Searching Tools 4 Major sequence databases
____Click __ to ____ edit Master ______text styles______• >170 million sequences European Second______level Nucleotide Archive • (~42 million non-redundant) • ____Thirdrelease _____ levelevery 3 months, daily updates Fourth______level ____Fifth _____level • >30.1 million non-redundant sequences UniProt • monthly release, daily updates
Sequence Searching Tools 5 Additional sequence data
Specialized databases ____Click __ to ____ edit Master ______text styles______• Immunoglobulins: Second_____ IMGT/HLA _____ level, IMGT/LIGM • Immunopolymorphisms ____Third _____ level : IPD -KIR , IPD-MHC Fourth______level • Variation: HGVBase ____Fifth , _____ leveldbSNP • Alternative splicing: ASTD • Completed genomes: Ensembl, Integr8 • Structure: PDB, Structural Genomics targets
Sequence Searching• Interaction Tools : IntAct 6
Patent Sequences
Patent sequences can be found in ____Clickthe __ tofollowing ____ edit Master _____ databases: ____text styles______Second______level ____Third _____ level ENA •Fourth_____ Patent nucleotides _____ level ____Fifth _____level UniProt • Patent proteins Archive
NR patent • Patent nucleotides and proteins sequences Sequence Searching Tools 7 Which database do you use?
let’s take a look… European nucleotide archive
UniProt
Non-redundant patent sequence databases European nucleotide archive
UniProt
Non-redundant patent sequence databases Primary sequence databases
Primary data submitted to databases ____Click __ to ____ edit Master ______text styles______GenBank DDBJ + SRA Second______level ____Third _____ level Fourth_____INSDC _____ level ____Fifth _____level (U.S.A.) (Japan) ENA
Sequence Searching Tools 11 (Europe) Primary sequence databases
Primary data submitted to databases ____Click __ to ____ edit Master ______text styles______GenBank DDBJ + SRA Second______level ____Third _____ level Fourth_____INSDC _____ level INSDC agreement: ____Fifth _____level • Free unrestricted access • All data exchanged daily ENA
How do they differ? organization of data tools and database links Sequence Searching Tools 12 ENA has a 3-tiered structure
Feature annotation ____Click __ to ____ edit Master ______text styles______Second_____1) EMBL _____ level-Bank ____Third _____ level Assembly E information Fourth______level N ____Fifth _____level A
Sequencing 2) Sequence Read Archive & sampling (Next Gen sequencing) information 3) Trace Archive (Capillary sequencing)
Sequence Searching Tools 13 http://www.ebi.ac.uk/ena/ How is the data organised?
Data in EMBL-Bank is divided in 2 ways: ____Click __ to ____ edit Master ______text styles______
1) Data classes Second ______level ____Third _____ level • Type of data or Fourth_____ methodology _____ level used to obtain data • Each entry belongs ____Fifth to _____level one data class
2) Taxonomic Divisions
• Each entry belongs to one taxonomic division
Sequence Searching Tools 14 EMBL-Bank data classes
CON Constructed from sequence assemblies EST Expressed Sequence Tag (cDNA) GSS Genome ____Click Survey __ to Sequence____ edit Master _____ (high-throughput ____text styles______short sequence) HTC High-Throughput cDNA Second_____ (unfinished) _____ level HTG High-Throughput Genome ____Third sequencing _____ level (unfinished) Fourth______level MGA Mass Genome Annotation ____Fifth _____level PAT Patent sequences SRA Sequence Read Archive (both databank and data class) STS Sequence Tagged Site (short unique genomic sequences) STD Standard (high quality annotated sequence) TSA Transcriptome Shotgun Assembly (computational assembly)
Sequence Searching Tools 15WGS Whole Genome Shotgun EMBL-Bank data classes
Data is always changing ____Click __ to ____ edit Master ______text styles______• Assembly of sequences Second_____ into_____ level larger fragments ____Third _____ level • Suppression of obsolete entries (i.e. once assembled) Fourth______level • Sequence modifications ____Fifth _____level • Daily updates • Identifier changes • Corrections (databases can contain errors) • etc…
Sequence Searching Tools 16 EMBL-Bank data classes
Data assembly can affect entries ____Click __ to ____ edit Master ______text styles______Example: Second______level WGS Shotgun ____Third _____ level• Fragments in separate entries Fourth______level ____Fifth• Join _____level to make new CON entries CON Constructed Old WGS entries archived • Join into large STD entry (e.g. completed genome) • Add annotation STD Standard Old CON entries Sequence Searching Tools archived 17 ENA taxonomy
All INSDC databases use NCBI taxonomy ____Click __ to ____ edit Master ______text styles______Second______level Divisions Only sequenced ____Third _____ level organisms represents HUM Human Fourth______level MUS Mouse INV ____ FifthInvertebrate _____level Other: ROD Rodent PLN Plant ENV Environmental
MAM Mammal PRO Prokaryote SYN Synthetic
VRT Vertebrate PHG Phage TGN Transgenic
FUN Fungi VIR Viral UNC Unclassified Sequence Searching Tools 18 ENA taxonomy
Some species EXCLUDED from certain ____Click __ to ____taxonomic edit Master _____ ranges ____text styles______Second______level ROD Rodent excludes ____Third mouse _____ level Fourth_____ human_____ level MAM Mammal excludes ____Fifth _____mouselevel rodent Applies to ftp files and human sequence search tools mouse but not to ENA browser VRT Vertebrate excludes rodent mammal Sequence Searching Tools 19 ENA taxonomy
Sometimes there is no taxonomic data ____Click __ to ____ edit Master ______text styles______Second______level Environmental • Genus species = „uncultivated bacterium‟ ____Third or _____ „unspecified‟level Fourth______level Synthetic • Genus species____Fifth _____=level „synthetic construct‟
Transgenic • Taxonomy for recipient and donor organisms
Patent • Exempt from requiring Genus species
Sequence Searching Tools 20 Database structure
EMBL-Bank: ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ENA____Fifth Database _____level
Sequence Searching Tools 21 Database structure
EMBL-Bank: Data classes ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
1st: Data split into classes
Sequence Searching Tools 22 Database structure
EMBL-Bank: Data classes ____Click __ to ____ edit Master ______text styles______Second______level HUM
MUS ____Third _____ level Taxonomic ROD Fourth______level Divisions MAM ____Fifth _____level VRT FUN
INV
... Reduces
search set 1st: Data split into classes 2nd: Data split into intersecting slices by taxonomy
Sequence Searching Tools 23 Database structure
EMBL-Bank: Data classes „Mouse‟ + „EST‟ ____ Click __ to ____ edit Master ______text styles______intersection Second______level HUM
MUS ____Third _____ level Taxonomic ROD Fourth______level Divisions MAM ____Fifth _____level VRT FUN
INV
... Reduces
search set 1st: Data split into classes 2nd: Data split into intersecting slices by taxonomy
Sequence Searching Tools 24 European Nucleotide Archive
ENA is accessible from the EBI homepage ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
ENA
Sequence Searching Tools 25 http://www.ebi.ac.uk/ ENA homepage
____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
• Text search • Sequence search • Programmatic access
Sequence Searching Tools 26 http://www.ebi.ac.uk/ena Patent sequence record in EMBL-Bank
Sequence Download version data Dates (first public ____Click __ to ____ edit Master ______text styles______and last updated) Navigate to related data Second______level e.g. Version archive ____Third _____ level Graphical viewer Fourth______level DNA source ____Fifth _____level Navigate to external data sources e.g. UniProt Patent reference
Sequence
Sequence Searching Tools 27 Non-patent entry in EMBL-Bank
General information ____Click __ to ____ edit Master ______text styles______Second______level More detailed Additional graphical view information ____Third _____ level Fourth______level Genome annotation ____Fifth _____level
Assembly information
Sequence Searching Tools 28 ENA graphical viewer
____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 29 Patent sequences in ENA
Comprehensive archive with >160 million entries ____Click __ to ____ edit Master ______text styles______ Release every Second3_____ months, _____ dailylevel updates EMBL-Bank contains ____Third >22 _____ level million patent entries Fourth______level Patent sequences from EPO, USPTO, JPO and KIPO ____Fifth _____level Sequence redundancy arises from different patents claiming same sequence
Redundancy may be useful for studies in variation
…but a problem for patents Sequence Searching Tools 30 ENA
____Click __ to ____ edit Master ______text styles______Second______level ENA is a comprehensive ____Third _____ level resource for Fourth______level nucleotide ____Fifth sequence _____level data,
but it is better for non-patent data
Sequence Searching Tools 31 European nucleotide archive
UniProt
Non-redundant patent sequence databases Where does the data come from?
ENA UniParc exchange PDB ____Click __ to ____ edit Master ______text styles______Second______level data daily RefSeq ____Third _____ level Ensembl Fourth______level ____Fifth _____level
VEGA Sequence Sequence sources Patents
Model organisms
Sequence Searchingmore… Tools 33 UniProt has a 3-tiered structure
ENA History of UniParc sequences
PDB ____Click __ to ____ edit Master ______text styles______
Metagenomic Second_____ &_____ levelTaxonomy RefSeq environmental ____Third _____ level known Ensembl Fourth______level Metagenomic Automatic UniMES ____Fifth _____levelUniProtKB/
VEGAprojects TrEMBL annotation Sequence Sequence sources Patents Manual Remove annotation redundancy Model organisms UniProtKB/ High quality Sequence Searchingmore… Tools SwissProt annotation 34 UniProt has a 3-tiered structure
ENA UniParc
PDB ____Click __ to ____ edit Master ______text styles______
Metagenomic Second_____ &_____ levelTaxonomy RefSeq environmental ____Third _____ level known Ensembl Fourth______level UniMES ____Fifth _____levelUniProtKB/
VEGA TrEMBL Sequence Sequence sources Patents UniMES UniRef Model Clusters Clusters organisms UniProtKB/ Sequence Searchingmore… Tools SwissProt 35 UniProt has a 3-tiered structure
Complete history of sequences (no annotation) UniParc ____Click Cross__ to ____ edit-links Master to_____ external ____text sequence styles______sources Second______level Swiss- Prot____Third: non _____ -levelredundant, manual annotation
UniProtKB TrEMBL Fourth_____: redundant, _____ level automatic annotation ____Fifth _____level
UniMES Sequences from metagenomic projects
Combines sequences (speed searching) UniRef UniRef100, UniRef90, UniRef50
Sequence Searching Tools 36 Patent sequence record in UniParc
Accession Download Patents not found data in UniProtKB ____Click __ to ____ edit Master ______text styles______
List of databases Second______level containing ____Third _____ level sequence Fourth______level Deleted Navigate to entries individual entries ____Fifth _____level identified (greyed out)
Sequence
Sequence Searching Tools 37 Browsing a UniProtKB/SwissProt entry
Download data Names (synonyms) and taxonomy Protein attributes ____Click __ to ____ edit Master ______text styles______Annotation Ontologies Second______level
____Third _____ level Protein interactions Splice variants Fourth______level ____Fifth _____level Sequence features
Sequence
References Navigate to external data
Sequence Searchingsources Tools 38 e.g. Ensembl General information Browsing a UniProtKB/TrEMBL entry
Name (could be clone name)____Click __ to ____ edit Master ______text styles______
Taxonomy Second______level ____Third _____ level Fourth______level Automatic annotation . (derived from InterPro) ____Fifth _____level
Ontologies (both automatic and manual curation)
Sequence Searching Tools 39 Browsing a UniRef90 entry
Faster and more sensitive sequence search with no ____Click __ to ____ edit Master _____loss of ____ textinformation styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Status Cluster List of Taxonomy of % identity of (SwissProt name entries in each entry sequences Sequence Searchingand/or Tools TrEMBL) cluster in cluster 40 Taxonomic distribution of species
All kingdoms: Within Eukaryota: ____Click __ to ____ edit Master ______text styles______Other mammals Bacteria Second______level (27%) (61%) ____Third _____ levelOther Vertebrata Fourth______(10%)level Homo (12%) Archaea (4%)____Fifth _____level Viruses (3%) Other (8%) Viridiplantae (18%) Nematoda (2%) Insecta (5%) Eukaryota Fungi (32%) (18%)
Sequence Searching Tools 41 SwissProt – most represented species
Mainly model organisms ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 42 Protein existence tag
!! Not sequence validation !! ____Click __ to ____ edit Master ______text styles______Protein existence Second_____ level: _____ level Total ____Third _____ level Evidence at protein Fourth_____ level _____ level 13% ____Fifth _____level Evidence at transcript level 12% Inferred from homology 70% Predicted 5% Uncertain (mainly TrEMBL) -
Sequence Searching Tools 43 Protein existence tag
!! Not sequence validation !! ____Click __ to ____ edit Master ______text styles______Protein existence Second_____ level: _____ level Human ____Third _____ level Evidence at protein Fourth_____ level _____ level 59% ____Fifth _____level Evidence at transcript level 37.5% Inferred from homology 1% Predicted 0.5% Uncertain (mainly TrEMBL) 2%
Sequence Searching Tools 44 Annotation sources for UniProtKB
Data sources Protein classification GO ____ClickFunctional __ to info ____ edit Master _____* Manual ____text curation styles______Protein Second_____* Literature _____ level-based InterPro PRIDE identification data annotation classification ____Third* _____Sequence level analysis Protein families and InterPro Signal domains Fourth______level prediction ____Fifth _____level IntAct Molecular interactions UniProtKB Transmembrane prediction IntEnz Enzymes * Microbial protein Other HAMAP families Automated predictions annotation RESID Post-translational Sequence for sources data Some annotation Searching Tools modifications 45 Features of UniProtKB
Splice variants ____Click __ to ____ edit Master ______text styles______Sequence Sequence Second______level features ____Third _____ level Fourth______level ____Fifth _____level Ontologies Annotations
Nomenclature References
Sequence Searching Tools 46 A wealth of external links
Organism-specific DBs Enzyme & pathway Proteomic DBs Genome annotation DBs Family and domain DBs DictyBase AGD DBs PeptideAtlas Ensembl KEGG Gene3D PIRSF EchoBASE CGD BioCyc PRIDE GeneID NMPDR HAMAP PRINTS EcoGene CTD BRENDA ProMEX VectorBase UCSC InterProProDom euHCVdb CYGD Reactome GenomeReviews TIGR PANTHER PROSITE FlyBase HGNC Pathway_Interaction_DB Pfam TIGRFAMs GeneCards HPA SMART GeneFarm MGI ____Click __ to ____ edit Master ______text styles______Gramene MIM Phylogenomic DBs H-InvDB RGD HOGENOM OMA LegioList SGD 125 links! Second______level HOVERGEN PhylomeDB Leproma TAIR InParanoid OrthoDB ListiList ZFIN MaizeGDB MypuList ____Third _____ level Polymorphism DBs Orphanet PharmGKB dbSNP PhotoList PseudoCAP SagaList SubtiList Fourth______level Ontologies TubercuList WormBase GO WormPep Xenbase GeneDB_Spombe 2D gel DBs ArachnoServer BuruList ____Fifth _____level 2DBase-Ecoli ANU-2DPAGE 3D structure DBs Aarhus/Ghent-2DPAGE (no server) DisProt HSSP COMPLUYEAST-2DPAGE PDB PDBsum Cornea-2DPAGE SMR DOSAC-COBS-2DPAGE ECO2DBASE (no server) Gene expression DBs HSC-2DPAGE ArrayExpress Bgee OGP GermOnline CleanEx PHCI-2DPAGE Genevestigator Others PMMA-2DPAGE Protein-protein Rat-heart-2DPAGE PTM DBs BindingDB Sequence DBs Protein family/group DBs PMAP- interaction DBs REPRODUCTION-2DPAGE CAZy MEROPS GlycoSuiteDB CutDB EMBL IPI DIP Siena-2DPAGE PeroxiBase REBASE PhosphoSite DrugBank PIR RefSeq IntAct SWISS-2DPAGE PptaseDBSequence SearchingTCDB Tools PhosSite NextBio UniGene STRING World-2DPAGE 47 SwissProt manual annotation
1. Protein sequence • Merge ____Click __ toavailable ____ edit Master _____ CDS ____ text(coding styles______sequence) • Annotate Secondsequence______level discrepancies ____Third _____ level • Report sequencing Fourth______level errors... 2. Biological information ____Fifth _____level • Extract literature information • Orthologue data propagation • Protein sequence analysis...
Sequence Searching Tools 48 Merge available CDS
1 SwissProt entry = 1 gene (1 species) ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Merge TrEMBL entries representing the same protein
Manually analyze and annotate the differences Sequence Searching Tools 49 Annotate sequence discrepancies
Identification of amino acid variants ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ....and____Fifth of _____level PTMs
Sequence Searching Tools 50 Report sequencing errors
____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Other examples of sequencing errors include: premature stop codons, read-throughs, erroneous initiator methionines
Sequence Searching Tools 51 SwissProt manual annotation
1. Protein sequence • Merge ____Click __ toavailable ____ edit Master _____ CDS ____ text(coding styles______sequence) • Annotate Secondsequence______level discrepancies ____Third _____ level • Report sequencing Fourth______level errors... 2. Biological information ____Fifth _____level • Extract literature information • Orthologue data propagation • Protein sequence analysis...
Sequence Searching Tools 52 Sources of annotated information
UniProtKB/SwissProt gathers ____Click __ to ____ edit Master ______text styles______information from multiple sources: Second______level
____Third _____ level • Publications Fourth_____ (literature/PubMed) _____ level • Prediction proteins ____Fifth (Prosite, _____level Anabelle) • Contact with experts • Other databases • Nomenclature committees
Sequence Searching Tools 53 Nomenclature
____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Synonyms useful for Fourth______level literature searching ____Fifth _____level
Sequence Searching Tools 54 Nomenclature
____Click __ to ____ edit Master ______text styles______Provides synonyms Second______level and cleavage ____Third _____ level products of Fourth______level bifunctional proteins ____Fifth _____level
Sequence Searching Tools 55 Annotation comments
>30 comment fields ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
SequenceControlled Searching Tools vocabularies used whenever possible… 56 Sequence annotation (Features)
…enable researchers ____Click __ to ____ edit Master ______text styles______to obtain a summary Second______level of what is known ____Third _____ level about a protein… Fourth______level ____Fifth _____level …including domain annotation, identifying binding sites…
Sequence Searching Tools 57 Sequence annotation (Features)
Feature (e.g. domain) highlighted on sequence ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 58 Gene Ontology
1. Biological Process • Cell division A commonly ____Click __recognized to ____ edit Master ______text styles______• Mitosis • Organelle fission series of events Second______level ____Third _____ level 2. Molecular Function Fourth______level • Protein kinase activity An elemental activity or • Insulin binding ____Fifth _____level • Insulin receptor activity task or job
3. Cellular Component • Mitochondrion Where a gene product • Mitochondrial matrix is located • Mitochondrial membrane Sequence Searching Tools 59 Gene Ontology
Annotation for human Rhodopsin: ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 60 Imported annotation
Binary interactions are taken from the database Interactors of human p53 ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 61 Evidence for annotation
____Click __ to ____ edit Master ______textProven styles______Second______level Proven ____Third _____ level Fourth______level ____Fifth _____level
Proven
Potential
Sequence Searching Tools By similarity 62 Sources references included
____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 63 UniProt homepage
____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level • Text search Fourth______level • BLAST sequence ____Fifth search _____level • Sequence alignment • Retrieve sequences • ID mapping between databases
Sequence Searching Tools 64 http://www.uniprot.org/ Patent sequences in UniProt
____Click __ to ____ edit Master ______text styles______ Comprehensive archive consisting of specialised databases Second______level Release every month, ____Third daily updates_____ level UniProtKB/SwissProt Fourth_____ annotation _____ level-rich, but has no patent data ____Fifth _____level Patent sequences only found in UniParc as an archived list
UniParc is non-redundant but contains no annotation
…therefore patent information limited
Sequence Searching Tools 65 UniProt
____Click __ to ____ edit Master ______text styles______Second______level UniProt is an ____Thirdexcellent _____ level source of quality Fourth______level protein sequence ____Fifth and _____level annotation data, but it is better for non-patent data
Sequence Searching Tools 66 Sequence archives Old entries accessible in both ENA and UniProt
• ENA nucleotide sequence version archive www.ebi.ac.uk/embl/sva ____Click __ to ____ edit Master ______text styles______Second______level
____Third _____ level Search by accession get all records Search by date Fourth______level get specific record ____Fifth _____level • UniSave – UniProt sequence/annotation version archive www.ebi.ac.uk/uniprot/unisave
Sequence Searching Tools 67 www.ebi.ac.uk Provides complete version list
Select and View specific compare versions ____Click __ to ____ edit Master ______text styles______old entry Second______level ____Third _____ level Fourth______level ____Fifth _____level
Tracks all changes to an entry
Sequence Searching Tools 68 View old entries
____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 69 Compare different versions
____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Changes Fourth______level highlighted ____Fifth _____level
Sequence Searching Tools 70 Why do we need a new sequence database for patents?
To address the following problems:
redundancy
lack of patent-specific annotation European nucleotide archive
UniProt
Non-redundant patent sequence databases Distributing patent sequences
____Click __ to ____ edit Master______text styles______Second______level GenBank ____Third _____ levelDDBJ Fourth_____INSDC _____ level ____Fifth _____level
ENA
Sequence Searching Tools 73 Distributing patent sequences
Redundancy: a consequence of the international cooperation ____Click __ to ____ edit Master______text styles______JPO Second______level USPTO GenBank ____Third _____ levelDDBJ KIPO Fourth_____INSDC _____ level ____Fifth _____level
As other National other ENA Offices participate patent in data exchange offices redundancy will increase Sequence Searching Tools EPO 74 Distributing patent sequences
NR patent databases remove redundancy ____Click __ to ____ edit Master______text styles______JPO Second______level USPTO GenBank ____Third _____ levelDDBJ KIPO Fourth_____INSDC _____ level ____Fifth _____level
other ENA patent offices NR patent sequence databases Sequence Searching Tools EPO 75 EBI-EPO collaboration
Collaboration between ____Click __ to ____ edit Master______text styles______Second_____EBI and _____ level EPO ____Third _____ level Fourth______level • Database development and • Acquire patent sequences maintenance ____Fifth _____level • Link to patent literature • Link to EBI search engines • Extract patent annotation • Link to EBI analysis tools
• Link to EBI databases • Collate patent family information
Sequence Searching Tools 76 Creating a non-redundant database
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 77 Creating a non-redundant database
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 78 Creating a non-redundant database
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 79 Patent data correction
Correction of Publication Numbers ____Click __ to ____ edit Master______textand styles______kind Codes Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 80 Patent resources at EBI
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 81 http://www.ebi.ac.uk/patentdata/ More information provided
Patent proteins: Patent nucleotides: ENA EPO USPTO JPO KIPO ____Click __ to ____ edit Master_____(EPO, ____text USPTO, styles______JPO, KIPO) Second______level ____Third _____ level Fourth______level ____Fifth _____level
Complete sequences (EPO, USPTO, JPO, KIPO) Non-redundant sequence data Patent family classification Sequence Searching Tools 82 Enriched with patent information let’s look at an example… Searching redundant databases
Protein ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Example: Fourth______level Search patent ____Fifth _____level protein sequence Patent proteins
Sequence Searching Tools 84 http://www.ebi.ac.uk/Tools/sss/ Results from redundant databases
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
>260 identical results . . too much to analyze
Sequence Searching Tools 85 NR patent sequence databases
____Click __ to ____ edit Master______text styles______LEVEL1 NR Second _____patent sequence_____ level database ____Third _____ level removes Fourth_____ redundancy _____ level ____Fifth _____level fewer results to analyze, less chance
of missing important results
Sequence Searching Tools 86 Searching patent sequences
NR patent Level-1 ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Example: Fourth______level Search patent ____Fifth _____level protein sequence NR patent level-1
Sequence Searching Tools 87 http://www.ebi.ac.uk/Tools/sss/ Results from level-1 database
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level Each hit unique ____Fifth _____level
Sequence Searching Tools 88 Results from level-1 database
____Click __ to ____ edit Master______text styles______Second_____Earliest _____ level publication date ____Third _____ level List of all Fourth______level patents Link to containing ____Fifth _____level sequence the sequence entry
Link to patent documentation
Sequence Searching Tools 89 Patent sequence record in NRNL1
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Patents containing Sequence 100% identical Sequence Searching Tools sequence 90 Patent families
Simple ____Click Patent__ to ____ edit Family Master_____ is a ____grouptext styles ______of patents that relate to Second _____the same _____ level invention, and are based on the same____Third originating _____ level application Fourth______level They arise when ____Fifth an invention _____level is patented in multiple countries
Grouping patents into families reduces multi-national results down to a representative member
Sequence Searching Tools 91 Patent families
second patent family Invention A patent family Invention B ____Click __EP to ____WO edit US Master ______text styles______US JP Second______level
GM671154 ADA42650 CS017585 ____ThirdACQ13114 _____ DI603183level HB492658 AAR79155 DD649656 Fourth______level ____Fifth _____level 100% identical sequences
Same sequence can appear multiple times in a database due to: Same invention filed multiple times in different offices (same patent family) Different inventors use the same sequence in different contexts (different
Sequencepatent Searching families)Tools 92 NR patent sequence databases
____Click __ to ____ edit Master______text styles______Second______level LEVEL2 NR patent sequence database ____Third _____ level groups identical Fourth_____ sequences _____ level by patent family ____Fifth _____level provides earliest priority date for family
Sequence Searching Tools 93 Searching patent sequences
NR patent Level-2 ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Example: Fourth______level Search patent ____Fifth _____level protein sequence NR patent level-2
Sequence Searching Tools 94 http://www.ebi.ac.uk/Tools/sss/ Results from level-2 database
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level Each hit = one family ____Fifth _____level
Sequence Searching Tools 95 Results from level-2 database
____Click __ to ____ editEarliest Master______text styles______Second_____publication _____ level patents in (priority) ____Third date _____ level same family in family Fourth______level ____Fifth Link_____level to sequence entry
Link to patent documentation
Sequence Searching Tools 96 Patent sequence record in NRNL2
____Click __ to ____ edit Master______text styles______Priority number and date Second______level Patent Patent equivalents ____Third _____ level literature Sequence Fourth______level record in ENA ____Fifth _____level
Translation
Sequence
Sequence Searching Tools 97 Non-redundant patent databases
Patent Patent ____Clicknucleotides __ to ____ edit Master_____proteins ____text styles______Second______level Groups together Level-1 NRNL1 NRPL1 ____Third _____ level 100% identical (Non-redundant (Non-redundant patent sequences nucleotide level- 1)Fourth_____ protein _____ levellevel-1) ____Fifth _____level
Level-2 NRNL2 NRPL2 Groups together (Non-redundant (Non-redundant identical sequences nucleotide level-2) protein level-2) by patent family
Sequence Searching Tools 98 http://www.ebi.ac.uk/patentdata/ Non-redundant patent databases
ENA (redundant) ____Click __ to ____ edit Master______text styles______Second______level Remove ____Third _____ level sequence redundancy Fourth______level Level-1 NR ____Fifth _____level Additional annotation, including priority dates for patent families Group by patent families
Level-2 NR www.ebi.ac.uk Sequence Searching Tools 99 Patent sequence records at EBI
Nucleotide ENA ~23.1 M PAT sequences ____Click __ to ____ edit Master______text styles______NRNL1 Second_____ ~11.9 _____ Mlevel sequences ____Third _____ level NRNL2 Fourth_____ ~15.0 _____ levelM sequences ____Fifth _____level
Protein Patent ~6.3 M PRT sequences Proteins
NRPL1 ~2.5 M sequences
Sequence Searching Tools NRPL2 ~3.8 M sequences 100 NR Patent Sequence Databases
Sequence searches against a non-redundant database is faster and____Click avoids __ to overlooking____ edit Master_____ data ____text styles______Second______level These databases are ____Third the first _____ level non-redundant collection take account of both sequenceFourth______andlevel family concepts ____Fifth _____level Publication corrections significantly increase data quality
Collation of biological features in a single record enables understanding of biological concept in which the sequence is being used
Sequence Searching Tools 101 2) The Toolbox EBI search
SRS advanced search
Sequence search EBI search
SRS advanced search
Sequence search EBI search
EBI-Search accessible from any EBI page ____Click __ to ____ edit Master______text styles______EBI Search Search all databases Second______level and literature in one go ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 105 http://www.ebi.ac.uk/ EBI search
EBI-Search by gene name
____Click __ to ____ editSearch Master_____ for ____text styles______Second_____src gene _____ level ____Third _____ level
src Fourth______level ____Fifth _____level
Sequence Searching Tools 106 http://www.ebi.ac.uk/ EBI search by gene name
____Click __ srcto ____ edit Master______text styles______
Second______levelquery ____Third _____ level Lists species with information Fourth______level Lists ____Fifth _____level relevant entries in all EBI resources
Sequence Searching Tools 107 EBI search by gene name
____Click __ to ____ editsrc Master______text styles______Easy to change Second______level between species ____Third _____ level Fourth______level Tabs organise ____Fifth _____level data by: • gene • expression • protein • structure • literature
Sequence Searching Tools 108 EBI search by gene name
Information from Ensembl: ____Click __ to ____ editsrc Master______text styles______• Gene sequence Second______level • Location ____Third _____ level • Sequence variations Fourth______level • Orthologues... ____Fifth _____level
Gene structure (forward and
Sequence Searching Tools reverse strand) 109 EBI search by gene name
Expression ____Click __ to ____ editsrc Master______text styles______studies shown Expression studies from by organism part Gene Expression Atlas Second_____, _____ level view by: ____Third _____ level • Disease state Fourth______level • Cell type • Compound treatment... ____Fifth _____level
Sequence Searching Tools 110 EBI search by gene name
Information from UniProt: ____Click __ to ____ editsrc Master______text styles______InterPro domain • Function Second______level architecture • Gene Ontology ____Third _____ level • Isoforms Fourth ______level • Sequence... ____Fifth _____level
IntAct protein interaction data Sequence Searching Tools 111 EBI search by gene name
Information from View additional ____Click __ to ____ editsrc Master______text styles______PDBe: Second______level structures • Chain information ____Third _____ level • Structural Fourth______level domains ____Fifth _____level • Citations...
View structure
Sequence Searching Tools 112 EBI search by gene name
Can print full summary of ____Click __ to ____ editsrc Master______text styles______any page Second______level ____Third _____ level Fourth______level ____Fifth _____level Reviews
Keyword Free full in title text
Curator- Patent selected Sequence Searching Tools 113 EBI search
EBI-Search for patent information
____Click __ to ____ edit Master______textSearch styles______for patent Second______level WO0146262 ____Third _____ level
WO0146262 Fourth______level ____Fifth _____level
Sequence Searching Tools 114 http://www.ebi.ac.uk/ EBI Search
Search for patent ____Click __ WO0146262to ____ edit Master______text styles______WO0146262
Second______levelquery ____Third _____ level Literature for Fourth______level WO0146262 ____Fifth _____level Includes link to Sequence data full paper for WO0146262
Includes list of additional annotation
Sequence Searching Tools 115 EBI search
____Click __ to ____ edit Master______text styles______Second______level EBI search is a quick way to find literature ____Third _____ level and sequences Fourth_____ (in ENA_____ level and UniProt) ____Fifth _____level associated with a patent
Sequence Searching Tools 116 EBI search
SRS advanced search
Sequence search SRS: advanced text search
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
1st: Select resources to 2nd: Create query search
Sequence Searching Tools 118 http://www.ebi.ac.uk/srs/ SRS: advanced text search
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level Select library tab
Sequence Searching Tools 119 SRS: advanced text search
Search >100 databases ____Click __ to ____ edit Master______text styles______
Select library tab Second______level ____Third _____ level Fourth______level NR patent DNA ____Fifth _____level (NRNL1 & NRNL2)
NR patent proteins (NRPL1 & NRPL2)
Sequence Searching Tools 120 SRS: advanced text search
Search >100 databases ____Click __ to ____ edit Master______text styles______
Select library tab Second______level ____Third _____ level Fourth______level ____Fifth _____level Example: Selected to search NR level-1 patent DNA database
Sequence Searching Tools 121 SRS: advanced text search
____Click __ to ____ edit Master______text styles______
Select library tab Second______level ____Third _____ level Fourth______level ____Fifth _____level
Select resources to search
Sequence Searching Tools 122 SRS: advanced text search
____Click __ to ____ edit Master______text styles______
Select library tab Select Second_____ resources _____ level to search ____Third _____ level Fourth______level ____Fifth _____level
1) Select field 2) Type in text
Sequence Searching Tools 123 SRS: advanced text search
____Click __ to ____ edit Master______text styles______
Select library tab Select Second_____ resources _____ level to search ____Third _____ level Fourth______level ____Fifth _____level
Here, selected patent number Sequence Searching Tools 124 SRS: advanced text search
____Click __ to ____ edit Master______text styles______
Select library tab Select Second_____ resources _____ level to search ____Third _____ level Fourth______level ____Fifth _____level
Create query
Sequence Searching Tools 125 SRS: advanced text search
____Click __ to ____ edit Master______text styles______
Select library tab Select Second_____ resources _____ level to search Create query ____Third _____ level Fourth______level ____Fifth _____level
Lists non-redundant nucleotide sequences from WO0146262
Sequence Searching Tools 126 SRS: advanced text search
____Click __ to ____ edit Master______text styles______
Select library tab Select Second_____ resources _____ level to search Create query ____Third _____ level Fourth______level ____Fifth _____level
WO0146262 sequences
Sequence Searching Tools 127 SRS: advanced text search
____Click __ to ____ edit Master______text styles______
Select library tab Select Second_____ resources _____ level to search Create query ____Third _____ level WO0146262 nucleotide sequence Fourth______level record in NRNL1 ____Fifth _____level
WO0146262 sequences
Details which other patents also claim this sequence (with NRNL2, would Sequence Searching Tools see family grouping) 128 SRS: advanced text search
____Click __ to ____ edit Master______text styles______
Select library tab Select Second_____ resources _____ level to search Create query ____Third _____ level Fourth______level ____Fifth _____level
WO0146262 sequences NRNL1 sequence record
Sequence Searching Tools 129 SRS: advanced text search
____Click __ to ____ edit Master______text styles______
Select library tab Select Second_____ resources _____ level to search Create query ____Third _____ level Fourth______level ____Fifth _____level
WO0146262 literature WO0146262 sequences
NRNL1 sequence record
Sequence Searching Tools 130 http://www.ebi.ac.uk/srs/ SRS: advanced text search
ENA ____ClickFind __all to sequences ____ edit Master_____ associated ____text with styles______a patent Second______level ____Third _____ level Find all sequences associated with a patent NRNL1 Fourth______level + identify all ____patentsFifth _____level associated with each sequence
Find all sequences associated with a patent NRNL2 + identify all patents in the same family associated with each sequence
Sequence Searching Tools 131 EBI search
SRS advanced search
Sequence search What’s available at EBI
Tools are accessible from the EBI homepage ____Click __ to ____ edit Master______text styles______Second______level ____ThirdUnder _____ tools,level select Fourth_____Tools _____ level Index ____Fifth _____level
Sequence Searching Tools 133 http://www.ebi.ac.uk/ What’s available at EBI
____Click __ to ____ edit Master______text styles______Second______levelLink to list ____Third _____ level of all tools Fourth______level ____Fifth _____level
Most popular tools are listed
Sequence Searching Tools 134 http://www.ebi.ac.uk/Tools What’s available at EBI
Full list of ____Click __ to ____ edit Master______text styles______sequence Second______level search tools ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 135 http://www.ebi.ac.uk/Tool/ssss STEP 1:
Choose a search algorithm Choosing the right search engine
BLAST Fast search (better for proteins than nucleotides)
FASTA ____Click __ Fastto ____ editsearch Master______text styles______Second______level PSI-SEARCH Finding ____Third remote _____ levelhomologues SSEARCH Sensitive Fourth_____ but slow; _____ level good for short sequences ____Fifth _____level Force full-length matches Query GGSEARCH ||||||||||||||| Subject
Match domains/patterns Query GLSEARCH ||||||||||||||| to protein; oligo-to-gene Subject
AVTEGP EVLFN Q FASTM Multi-peptide search FVNGFAD Sequence Searching Tools AKFQPGE ||||| ||||| ||||| 137 S Example: Comparing search engines using a short peptide query sequence
query sequence: RPPSWIPK Comparing search engines
query : RPPSWIPK
NCBI ____Click- __ to ____ edit Master______text styles______WU-BLAST SSEARCH BLAST Second______level ____Third _____ level UniProtKB/ UniProtKB/ UniProtKB/ SwissProt Fourth_____SwissProt _____ level SwissProt ____Fifth _____level
No hits found hit length e() hit length e() 1: TY01_PHYAZ 61 5.4 1: TY01_PHYAZ 61 0.42 2: BRK5_PHYNO 8 4.8 3: BRK_AMICA 9 9.2 Look at the 5: BRK_LEPOS 9 9.2 difference in
Sequence Searching Tools e-values 139 Comparing search engines
____Click __ to ____ edit Master______text styles______Second______level SSEARCH is ____Third a sensitive _____ level search engine Fourth______level suitable ____Fifthfor short _____level sequences (may be too slow for longer sequences)
Sequence Searching Tools 140 Comparing search engines - specialised
query : RPPSWIPK
SSEARCH ____Click __ to ____ editGLEARCH Master______text styles______GGEARCH
GLSEARCH Second_____ has _____ level GGSEARCH UniProtKB/ a preference ____Third forUniProtKB/ _____ level UniProtKB/ SwissProt SwissProt limited to similarSwissProt long hits Fourth______levelsized hits ____Fifth _____level hit length e() hit length e() hit length e() 1: TY01_PHYAZ 61 0.42 1: TY01_PHYAZ 61 2.5e-16 1: BRK5_PHYNO 8 8e-7 2: BRK5_PHYNO 8 4.8 2: BRK5_PHYNO 8 1.9e-11 2: TY01_PHYBU 8 2e-5 3: BRK_AMICA 9 3: TY01_PHYBU 8 5.2e-8 3: BRK_LEPOS 9 5.4e-4 5: BRK_LEPOS 9 9.2 4: BRK_ONCMY 10 5.6e-8 4: BRK_AMICA 9 5.4e-4 9.2 5: BRK_LEPOS 9 1e-7 5: BRK4_PHAJA 8 0.0087 8: B4GT2_HUMAN 372 5.8e-5 6: BRK_PHYHY 8 0.0087 ...... 39: DNAA_PROM 0.024 34: TY51_LITRU 7 8.3 Sequence Searching Tools 465 40: DNAA_PROM 0.036 141 199 Comparing search engines - specialised
____Click __ to ____ edit Master______text styles______GGSEARCH finds Second_____ similar _____ level length sequences; ____Third _____ level GLSEARCH matches entire sequence to Fourth______level any length____Fifth _____ levelsequences
Sequence Searching Tools 142 Restricting length of matches
query : RPPSWIPK
SSEARCH ____Click __ to ____ edit Master______text styles______Database Second______level range 6-10 ____Third _____ level UniProtKB/ Fourth______level SwissProt ____Fifth _____level
Sequence Searching Tools 143 Restricting length of matches
query : RPPSWIPK
SSEARCH ____Click __ to ____ editGGEARCH Master______text styles______SSEARCH Database Limiting Second_____ database _____ level range range 6-10 ____ThirdlimitsUniProtKB/ size _____ level of hits, UniProtKB/ SwissProt SwissProt UniProtKB/ Fourth_____but stricter _____ level than SwissProt GGSEARCH ____Fifth _____level hit length e() hit length e() hit length e() 1: BRK5_PHYNO 8 0.34 1: BRK5_PHYNO 8 8e-7 1: TY01_PHYAZ 61 0.42 2: BRK_LEPOS 9 0.57 2: TY01_PHYBU 8 2e-5 2: BRK5_PHYNO 8 4.8 3: BRK_AMICA 9 0.57 3: BRK_LEPOS 9 5.4e-4 3: BRK_AMICA 9 9.2 4: TY01_PHYBU 8 0.61 4: BRK_AMICA 9 5.4e-4 4: BRK_LEPOS 9 9.2 5: BRK_ONCMY 10 0.63 5: BRK4_PHAJA 8 0.0087 6: BRK4_PHAJA 8 5.0 6: BRK_PHYHY 8 0.0087 ...... 18: BRK3_PELRI 9 9.8 34: TY01_LITRU 7 8.3 Sequence Searching Tools 144 STEP 2:
Choose a database to search Several databases available
Protein ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level Note: these databases cover ____Fifth the _____level same sequences • NR level-1 Patent • NR level-2 databases • EPO + JPO + KIPO + USPTO
Sequence Searching Tools 146 Several databases available
Nucleotide ____Click __ to ____ edit Master______text styles______Second______level Note: these databases cover ____Third the _____ level same sequences • NR level-1 Fourth______level • NR level-2 ____Fifth _____level • EMBL Patents
Patent data
Sequence Searching Tools 147 Database size is important
The larger the database searched, the higher ____Click (less __ to ____ significant)edit Master_____ the ____ textresulting styles______e-values Second______level ____Third _____ level Most sequence Fourth_____ databases _____ level are large...... and growing____Fifth _____level every day:
• ENA-Annotation >160 million entries
• UniParc (non-redundant) >30 million entries
Sequence Searching Tools 148 Example: Comparing database size when searching with a short peptide query sequence
query sequence: RPPSWIPK Comparing databases size
query : RPPSWIPK
Database size decreasing ____Click __ to ____ edit Master______text styles______Second______level SSEARCH SSEARCH SSEARCH ____Third _____ level
Fourth______level UniProtKB/ UniParc UniProtKB ____Fifth _____level SwissProt
No hits found hit e() hit e() 1: TY01_PHYAZ 3.6 1: TY01_PHYAZ 0.068 2: BRK5_PHYNO 1.9 Look at the 3: TY01_PHYBU 7.7 difference in 4: BRK_AMICA 8.7 5: BRK_LEPOS 8.7 Sequencee -Searchingvalues Tools 150 6: BRK_ONCMY 9.7 Comparing databases size
____ClickThe __ larger to ____ edit the Master _____database ____text searched, styles______the higher (less significant) Second______level the resulting e-values ____Third _____ level Fourth______level Search the ____Fifth smallest _____level database likely to contain your sequence
You can also run a second search of the entire database, or run multiple small searches Sequence Searching Tools 151 Is it best to search a protein or a nucleotide database? Is it best to search a protein or a nucleotide database?
2 issues are worth considering… 1) Codon degeneracy
Because amino acids are encoded by different codons, there ____Click can __ be to ____more edit variability Master______text between styles______CDS s than Second_____ between _____ level proteins ____Third _____ level Fourth______level Ser Amino acids match ____Fifth _____level Ser
UCU Nucleotides mismatch AGC
Sequence Searching Tools 154 1) Codon degeneracy
Human CKS1B kinase v Zebra finch CDC28 kinase 1B ____Click __ to ____ edit Master______text styles______Proteins Second______level ____Third _____ level Fourth______level ____Fifth _____level
Nucleotides
Sequence Searching Tools 155 1) Codon degeneracy
____Click __ to ____ edit Master______text styles______Sequence Second_____ conservations _____ level is ____Third _____ level more stringent Fourth_____ at _____ thelevel protein level, than ____ Fifthat the _____ levellevel of the
nucleotide coding sequence
Sequence Searching Tools 156 2) Amino acid similarity
____Click __ to ____ edit Master______text styles______Protein sequence Second______levelsearches can ____Third _____ level distinguish between exact, similar and Fourth______level dissimilar ____Fifth _____level matches
Sequence Searching Tools 157 2) Amino acid similarity
Amino acids grouped by physical & chemical properties ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 158 2) Amino acid similarity
Protein searches take account of amino acid similarities ____Click __ to ____ edit Master______text styles______highly Second_____weakly _____ level not conserved ____Thirdconserved _____ level conserved Ser Fourth_____ Ser_____ level Ser Amino acids Ser ____Fifth _____Asnlevel Leu identical similar mismatch
No UCU UCU UCU distinction Nucleotides AGC AAC CUC
Sequence Searching Tools 159 mismatch mismatch mismatch 2) Amino acid similarity
Protein alignments can score a ____Click __ to ____ edit Master______text styles______conservative Second_____ amino _____ acidlevel substitution differently from ____Third a non _____ level-conservative one through the Fourth_____ use of scoring_____ level matrices ____Fifth _____level By contrast, nucleotide alignments use over-simple (less sensitive) match/mismatch scoring
Sequence Searching Tools 160 Identify Protein v nucleotide search homologues
searching:
Homo
prokaryotes
cyanobacteria amphibians
____Click __ to ____ edit Master_____ genus ____text styles______
arthropods
reptiles
land plants land
mammals
eukarytoes
birds
flowers
archaea
insects
plants fish today extinction of dinosaurs Second______level Cambrian explosion ____Third _____ level 1 multicellular life
Fourth______level ____Fifth _____level
Protein2 complexcomparisons cells identify homologues
5agoBillions of years -10x further back in 3 evolutionphotosynthesis
self-replicating cells
4 chemical evolution Sequence Searching Tools 161 formation of Earth Protein v nucleotide: example
100% identity DNA Protein Genemore significant Protein e-value e-value e-value ____Clickfor DNA __ to match ____ edit because Master______text styles______Human GSTP1 Glutathione S-transferase P 4.0e-199 1.8e-92 a longer sequence Bovine GSTP1 Glutathione Second_____ S-transferase _____ level P 5.9e-154 8.0e-82 Mouse GSTP2 Glutathione ____Third S- transferase_____ level P2 4.4e-133 4.7e-79 Toad GSTP1 Glutathione S-transferase P 1.5e-55 9.8e-59 Frog GSTP1 Glutathione Fourth_____ S-transferase _____ level P 8.9e-33 2.6e-45 Nematode GSTP1 Glutathione S-transferase P 8.5e-10 4.2e-32 Rabbit CodonGSTMU degeneracy Glutathione ____Fifth and S- transferase _____level Mu 1.3e-07 1.3e-20 Bovine simpleGSTM2 scoring Glutathione give rise S- transferaseto M2 4.3 4.4e-17 Liver fluke lessGSTMU significant Glutathione e-values S-transferase MU - 3.2e-15 Hornworm GST2 Glutathione S-transferase 2 - 3.2e-12 for DNA matches Fruit fly GSTS1 Glutathione S-transferase S1 - 2.3e-08 Slime mold GSTA2 Glutathione S-transferase a2 - 2.6e-05 Maize GST1 Glutathione S-transferase 1 - 3.0e-01 Wheat GSTA1 Glutathione S-transferase 1 - 1.9 Sequence Searching Tools 162 Protein v nucleotide search
____Click __ to ____ edit Master______text styles______…therefore, if a Secondpatent_____ claims _____ level both a nucleotide ____Third _____ level CDS and a protein Fourth_____ sequence, _____ level the protein sequence could pull ____Fifth out _____levelmany more homologues
than the nucleotide CDS
Sequence Searching Tools 163 STEP 3:
Choosing search parameters to fit the task Choosing parameters Parameters are set for searching a full-length protein or gene ____Click __ to ____ edit Master______text styles______Second______level Changing ____Third _____ level parameters can improve Fourth______level search results ____Fifth _____level for short sequences
Sequence Searching Tools 165 How to optimise parameters?
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level User manual ____Fifth _____level provides help
Sequence Searching Tools 166 Which parameters to choose?
Matrix ____Click __ to ____ edit Master______text styles______Second______level Nucleotide search ____Third _____ level „simpler‟ - only Fourth______level match/mismatch ____Fifth _____level
Protein search uses substitution matrix tables (based on amino acid similarities and rate of change)
Sequence Searching Tools 167 Protein matrices
Choice of 1. strictness of search matrix ____Click __ to ____ edit Master______text styles______depends on: Second______level ____Third _____ level Fourth2._____length _____ level of query sequence QUERY ____Fifth LENGTH _____level MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4
Sequence Searching Tools <=10 MDM10 -23 -4 168 Example: Comparing matrices when searching with a short peptide query sequence
query sequence: MDM2_HUMAN Comparing matrices
SSEARCH SSEARCH Blosum80 stricter ____Click __ to ____ edit Master______text styles______than Blosum50 Blosum50 Blosum80 (default) Second______level ____Third _____ level UniProtKB/ Fourth______UniProtKB/level SwissProt SwissProt Blosom80 more ____Fifth _____level significant hit e() hit e() (close) match MDM2_XENLA 4e-70 MDM2_XENLA 6e-109 MDM4_BOVIN 9e-48 MDM4_BOVIN 9e-51 Blosom50 more MDM4_DANRE 6e-17 MDM4_DANRE 5e-24 XB34_ORYSJ 0.01 XB34_ORYSJ 0.05 significant RN157_MOUSE 0.19 RN157_MOUSE 4.1 (distant) match
Sequence SearchingMGRN1_MOUSE Tools 1.2 MGRN1_MOUSE 6.0 170 MGRN1_HUMAN 2.4 MGRN1_HUMAN 6.4 Comparing matrices
____Click __ to ____ edit Master______text styles______Second_____Matrices: _____ level ____Third _____ level Use a high Blosum Fourth_____ to _____ findlevel close matches, a low Blosum ____Fifth to find _____level distant matches;
Use MDM to find longer matches
Sequence Searching Tools 171 Nucleotide match/mismatch
Matrix - protein Match/mismatch - nucleotide ____Click __ to ____ edit Master______text styles______Second______levelFASTA BLAST ...instead have...____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 172 Nucleotide match/mismatch
• “Reward” for match, “penalty” for mismatch ____Click __ to ____ edit Master______text styles______• Reward/penalty Secondratio:______level Increase ratio ____Thirdto find more_____ level divergent sequences: Ratio of 0.33 Fourth_____(1/-3) for _____ 99%level conserved Ratio of 0.5 (1/ ____Fifth-2) for _____95%level conserved Ratio of 1 (1/-1) for 75% conserved
Sequence Searching Tools 173 Gap penalties
____Click __ to ____ edit Master_____Protein search ____text styles______gap penalties Second_____gap open _____ level = 0 to -23 ____ThirdGap extension _____ level = 0 to -8
Fourth_____Nucleotide _____ level search ____Fifthgap open_____level = -2 to -16 Gap extension = 0 to -4
Sequence Searching Tools 174 Gap penalties
Choice of ____Click __ to ____ edit1. Masterstrictness______text of search styles______gap penalties depends on: Second_____• _____larger level penalty fewer gaps ____Third _____ level Fourth_____2. to match _____ level scoring matrix QUERY ____Fifth LENGTH _____level MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4
Sequence Searching Tools <=10 MDM10 -23 -4 175 Word length - ktup
KTUP KTUP = „word-length‟ of search ____Click __ to ____ edit Master______text styles______(word length) SecondLarge_____ word _____ level-length less sensitive ____Third _____ level faster Fourth______level ____Fifth _____level
Nucleotide search: fewer bases than amino acids higher KTUP
Sequence Searching Tools 176 Example: Comparing ktup when searching with a short RNA query sequence
query sequence: 23bp RNA Comparing ktup query : 23bp RNA
FASTA ____Click __ to ____ edit MasterFASTA______text styles______FASTA Second______level ktup6 ktup2 ktup3 (default) ____Third _____ level E=50 Fourth______levelLower ktup is EMBL ____FifthEMBL _____level more sensitive EMBL release release release E=50 extends No hits found hit e-valuee() cut-off hit e() 1: AB334817 0.12 1: AB334817 0.074 2: AY238603 0.14 2: AY238603 0.092 3: AC101743 0.16 3: AC101743 0.11 4: AC115920 0.18 4: AC115920 0.12 5: BC098485 22 Sequence Searching Tools ...... 178 11: AL591512 32 Comparing ktup
____Click __ to ____ edit Master______text styles______•Lowering ktup makes Second_____ the _____ levelsearch more sensitive ____Third _____ level •Increase e-value Fourth_____ cut-off for_____ level short sequences ____Fifth _____level •Increase match/mismatch score (+5/-4 for FASTA)
•Increase gap penalties
Sequence Searching Tools 179 Masking
Do I mask ____Click my __ to ____ edit Master______text styles______sequence? Low complexity regions should be Second______level masked____Third _____ level to avoid spurious results Fourth_____• CA _____ levelrepeats ____Fifth• poly _____level-A tails • proline-rich regions
**Be careful you don’t mask what you are looking for
Sequence Searching Tools 180 Parameters for short sequences
What do ____Click I use __ to ____ edit Master______text styles______
for short Second______level sequences? ____Thirduse _____ levelstrict matrices Fourth_____use _____high level gap penalties ____Fifthavoid _____level masking allow high e-values
Sequence Searching Tools 181 Adding value to your search results Sequence search results page
Actions on ____ Click __ to ____ edit Master______text styles______all results Second______level
On selected ____Third _____ level results Fourth______level ____Fifth _____level
Sequence Searching Tools 183 Visual output
____Click __ to ____ edit Master______text styles______Second______level Graphical ____Third _____ level display of results Fourth______level ____Fifth _____level
Sequence Searching Tools 184 Visual output
____Click __ to ____ edit Master______text styles______Second______level Select sequence ____Third _____ level Fourth______level View alignment ____Fifth _____level
Sequence Searching Tools 185 Visual output
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Visual output provides an at-a-glance view Fourth______level of the length and____Fifth position _____level of all matches
Sequence Searching Tools 186 e-values or % identity?
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Do you use e-values or % identity?
Sequence Searching Tools 187 e-values or % identity?
e-value is a better estimate of similarity than ____Click % __identity, to ____ edit but Master_____ patents ____text use styles%______identity Second______level e-value Estimates ____Third statistical _____ level significance of matches Fourth______level Default = 10 expect 10 matches found by chance ____Fifth _____level E() = 1-10 frequently related E() = <0.01 usually homologous
% identity % of positions identical between query and match sequence Sequence Searching Tools 188 e-values or % identity?
example... ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Similar Different % identity scores e-values Sequence Searching Tools 189 e-values or % identity?
example... Pattern of conservation ____Click __ to ____ edit Master______textindicates styles______homology Second______level ____Third _____ level Fourth______level ____Fifth _____level
No evidence of
Sequence Searching Tools homology 190 Check alignments
Check alignments, ____Click __ to ____ edit Master______text styles______especially if using a Second______level local-local algorithm ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 191 Check alignments
____Click __ to ____ edit Master______text styles______Second______level 662 ____Third _____ level 100% identity, but Fourth______level only over 124 / 662 (20%) of sequence ____Fifth _____level
124 aa overlap
Sequence Searching Tools 192 Check alignments
____Click __ to ____ edit Master______text styles______Second______level Always check ____Third alignments _____ level to see where and to Fourth_____ what extent_____ level the query & target ____Fifthsequences _____level match
Sequence Searching Tools 193 Additional annotation
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level Protein and nucleotide search results have ____Fifth _____level additional annotation
• Nucleotide sequence • Protein sequence • Genomic information • Gene ontology mapping • InterPro protein classification Sequence Searching Tools • Literature 194 Additional annotation
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
e.g. related ENA nucleotide entries
Sequence Searching Tools 195 Functional predictions
Protein search results have „Function Prediction‟ ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Sequence Searching Tools 196 Public site: function prediction
____Click __ to ____ edit Master______text styles______Second______level Functional predictions: InterPro family/domain ____Third _____ level classifications Fourth______level ____Fifth _____level Visual comparison find mis/partial matches
Sequence Searching Tools 197 InterPro annotation
Domain annotation ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
SWIB/MDM2 RanBP2-type RING-type domain zinc finger zinc finger Sequence Searching Tools 198 InterPro annotation
Family classification ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
Mdm2/Mdm4 family Sequence Searching Tools 199 Public site: function prediction
____Click __ to ____ edit Master______text styles______100% ID Second______level • family signature ____Third _____ level • 4 domain signatures Fourth______level 34% ID • family signature ____Fifth _____level • 3 domain signatures
28% ID • 1 domain signature
24% ID • No signatures
Sequence Searching Tools 200 InterPro: access directly
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level
. InterPro
Sequence Searching Tools 201 http://www.ebi.ac.uk/ InterPro homepage
____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level Text or sequence >60,000 protein ____Fifth _____level search signatures from 11 member databases
Sequence Searching Tools 202 http://www.ebi.ac.uk/interpro/ InterPro homepage
____Click __ to ____ edit Master______text styles______New Second______levelinterface ____Third _____ level Fourth______level ____Fifth _____level
Text or sequence search
Sequence Searching Tools 203 http://www.ebi.ac.uk/interpro/ InterProScan sequence search
____Click __ to ____ edit Master______text styles______Download version takes both protein Second______level and nucleotide ____Third _____ level sequence Fourth______level ____Fifth _____level All search engines of member databases
Sequence Searching Tools 204 InterProScan results
____Click __ to ____ edit Master______text styles______Second_____Domains _____ level ____Third _____ level Fourth______level ____Fifth _____level
Family
Sequence Searching Tools 205 Recap
DDBJ GenBank JPO ____Click __ to ____ edit Master_____USPTO ____text styles______KIPO EPO Second______level ____Third _____ level NR patent ENA Fourth______level UniProt databases ____Fifth _____level SRS text search Sequence search
Select resources Select search tool Select library tab Create query Navigate to sss tools
Create search patentSequence literature Searching Tools Sequence list Functional predictions 206 Sequence record Sequence matches Summary
EBI provides free access to ____Click __ to ____ edit Master_____ >100 ____text databases styles______Second______level EBI provides specialised patent ____Third _____ levelsequence databases Fourth______level ____FifthEBI _____level provides multiple sequence search options
EBI provides advanced text search options
Sequence Searching Tools 207 http://www.ebi.ac.uk/ Help
____Click __ to ____ edit Master______text styles______Second______level ____ThirdContacts: _____ level http://www.ebi.ac.uk/support/ Fourth______level ____Fifth _____level
Sequence Searching Tools 208