EBI Patent Sequence Services

• PIUG 2012 Biotechnology • Workshop • February 6th, 2012 • Boston

Jennifer McDowall

EBI is an Outstation of the European Molecular Biology Laboratory. Overview

1) Know the data

____Click __ toEuropean ____ edit Master_____ Nucleotide ____text Archive styles______Second______level  UniProt ____Third _____ level  Non -Fourth_____redundant _____ level patent sequence DB ____Fifth _____level 2) The Toolbox

 EBI search

 SRS advanced text search

 Sequence searching Sequence Searching Tools 2 1) Know the Data Know the data

• Many , each getting bigger ____Click __ to ____ edit Master ______text styles______• Efficient searching Second_____ requires _____ level knowledge of what data is stored ____Third in _____ level a  Don‟t assume annotation Fourth______can level be transferred because of a good match ____Fifth _____level • Databases can contain errors

• Data can change  Deletions, sequence modifications  Daily updates, identifier changes…

Sequence Searching Tools 4 Major sequence databases

____Click __ to ____ edit Master ______text styles______• >170 million sequences European Second______level Nucleotide Archive • (~42 million non-redundant) • ____Thirdrelease _____ levelevery 3 months, daily updates Fourth______level ____Fifth _____level • >30.1 million non-redundant sequences UniProt • monthly release, daily updates

Sequence Searching Tools 5 Additional sequence data

Specialized databases ____Click __ to ____ edit Master ______text styles______• Immunoglobulins: Second_____ IMGT/HLA _____ level, IMGT/LIGM • Immunopolymorphisms ____Third _____ level : IPD -KIR , IPD-MHC Fourth______level • Variation: HGVBase ____Fifth , _____ leveldbSNP • Alternative splicing: ASTD • Completed genomes: Ensembl, Integr8 • Structure: PDB, targets

Sequence Searching• Interaction Tools : IntAct 6

Patent Sequences

Patent sequences can be found in ____Clickthe __ tofollowing ____ edit Master _____ databases: ____text styles______Second______level ____Third _____ level ENA •Fourth_____ Patent nucleotides _____ level ____Fifth _____level UniProt • Patent Archive

NR patent • Patent nucleotides and proteins sequences Sequence Searching Tools 7 Which database do you use?

let’s take a look… European nucleotide archive

UniProt

Non-redundant patent sequence databases European nucleotide archive

UniProt

Non-redundant patent sequence databases Primary sequence databases

Primary data submitted to databases ____Click __ to ____ edit Master ______text styles______GenBank DDBJ + SRA Second______level ____Third _____ level Fourth_____INSDC _____ level ____Fifth _____level (U.S.A.) (Japan) ENA

Sequence Searching Tools 11 (Europe) Primary sequence databases

Primary data submitted to databases ____Click __ to ____ edit Master ______text styles______GenBank DDBJ + SRA Second______level ____Third _____ level Fourth_____INSDC _____ level INSDC agreement: ____Fifth _____level • Free unrestricted access • All data exchanged daily ENA

How do they differ?  organization of data  tools and database links Sequence Searching Tools 12 ENA has a 3-tiered structure

Feature annotation ____Click __ to ____ edit Master ______text styles______Second_____1) EMBL _____ level-Bank ____Third _____ level Assembly E information Fourth______level N ____Fifth _____level A

Sequencing 2) Sequence Read Archive & sampling (Next Gen ) information 3) Trace Archive (Capillary sequencing)

Sequence Searching Tools 13 http://www.ebi.ac.uk/ena/ How is the data organised?

Data in EMBL-Bank is divided in 2 ways: ____Click __ to ____ edit Master ______text styles______

1) Data classes Second ______level ____Third _____ level • Type of data or Fourth_____ methodology _____ level used to obtain data • Each entry belongs ____Fifth to _____level one data class

2) Taxonomic Divisions

• Each entry belongs to one taxonomic division

Sequence Searching Tools 14 EMBL-Bank data classes

CON Constructed from sequence assemblies EST Expressed Sequence Tag (cDNA) GSS Genome ____Click Survey __ to Sequence____ edit Master _____ (high-throughput ____text styles______short sequence) HTC High-Throughput cDNA Second_____ (unfinished) _____ level HTG High-Throughput Genome ____Third sequencing _____ level (unfinished) Fourth______level MGA Mass Genome Annotation ____Fifth _____level PAT Patent sequences SRA Sequence Read Archive (both databank and data class) STS Sequence Tagged Site (short unique genomic sequences) STD Standard (high quality annotated sequence) TSA Shotgun Assembly (computational assembly)

Sequence Searching Tools 15WGS Whole Genome Shotgun EMBL-Bank data classes

Data is always changing ____Click __ to ____ edit Master ______text styles______• Assembly of sequences Second_____ into_____ level larger fragments ____Third _____ level • Suppression of obsolete entries (i.e. once assembled) Fourth______level • Sequence modifications ____Fifth _____level • Daily updates • Identifier changes • Corrections (databases can contain errors) • etc…

Sequence Searching Tools 16 EMBL-Bank data classes

Data assembly can affect entries ____Click __ to ____ edit Master ______text styles______Example: Second______level WGS Shotgun ____Third _____ level• Fragments in separate entries Fourth______level ____Fifth• Join _____level to make new CON entries CON Constructed Old WGS entries archived • Join into STD entry (e.g. completed genome) • Add annotation STD Standard Old CON entries Sequence Searching Tools archived 17 ENA taxonomy

All INSDC databases use NCBI taxonomy ____Click __ to ____ edit Master ______text styles______Second______level Divisions Only sequenced ____Third _____ level represents HUM Human Fourth______level MUS Mouse INV ____ FifthInvertebrate _____level Other: ROD Rodent PLN Plant Environmental

MAM Mammal PRO Prokaryote SYN Synthetic

VRT Vertebrate PHG Phage TGN Transgenic

FUN Fungi VIR Viral UNC Unclassified Sequence Searching Tools 18 ENA taxonomy

Some EXCLUDED from certain ____Click __ to ____taxonomic edit Master _____ ranges ____text styles______Second______level ROD Rodent  excludes ____Third mouse _____ level Fourth_____ human_____ level MAM Mammal  excludes ____Fifth _____mouselevel rodent Applies to ftp files and human sequence search tools mouse but not to ENA browser VRT Vertebrate  excludes rodent mammal Sequence Searching Tools 19 ENA taxonomy

Sometimes there is no taxonomic data ____Click __ to ____ edit Master ______text styles______Second______level Environmental • Genus species = „uncultivated bacterium‟ ____Third or _____ „unspecified‟level Fourth______level Synthetic • Genus species____Fifth _____=level „synthetic construct‟

Transgenic • Taxonomy for recipient and donor organisms

Patent • Exempt from requiring Genus species

Sequence Searching Tools 20 Database structure

EMBL-Bank: ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ENA____Fifth Database _____level

Sequence Searching Tools 21 Database structure

EMBL-Bank: Data classes ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

1st: Data split into classes

Sequence Searching Tools 22 Database structure

EMBL-Bank: Data classes ____Click __ to ____ edit Master ______text styles______Second______level HUM

MUS ____Third _____ level Taxonomic ROD Fourth______level Divisions MAM ____Fifth _____level VRT FUN

INV

... Reduces

search set 1st: Data split into classes 2nd: Data split into intersecting slices by taxonomy

Sequence Searching Tools 23 Database structure

EMBL-Bank: Data classes „Mouse‟ + „EST‟ ____ Click __ to ____ edit Master ______text styles______intersection Second______level HUM

MUS ____Third _____ level Taxonomic ROD Fourth______level Divisions MAM ____Fifth _____level VRT FUN

INV

... Reduces

search set 1st: Data split into classes 2nd: Data split into intersecting slices by taxonomy

Sequence Searching Tools 24 European Nucleotide Archive

ENA is accessible from the EBI homepage ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

ENA

Sequence Searching Tools 25 http://www.ebi.ac.uk/ ENA homepage

____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

• Text search • Sequence search • Programmatic access

Sequence Searching Tools 26 http://www.ebi.ac.uk/ena Patent sequence record in EMBL-Bank

Sequence Download version data Dates (first public ____Click __ to ____ edit Master ______text styles______and last updated) Navigate to related data Second______level e.g. Version archive ____Third _____ level Graphical viewer Fourth______level DNA source ____Fifth _____level Navigate to external data sources e.g. UniProt Patent reference

Sequence

Sequence Searching Tools 27 Non-patent entry in EMBL-Bank

General information ____Click __ to ____ edit Master ______text styles______Second______level More detailed Additional graphical view information ____Third _____ level Fourth______level Genome annotation ____Fifth _____level

Assembly information

Sequence Searching Tools 28 ENA graphical viewer

____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 29 Patent sequences in ENA

 Comprehensive archive with >160 million entries ____Click __ to ____ edit Master ______text styles______ Release every Second3_____ months, _____ dailylevel updates  EMBL-Bank contains ____Third >22 _____ level million patent entries Fourth______level  Patent sequences from EPO, USPTO, JPO and KIPO ____Fifth _____level  Sequence redundancy arises from different patents claiming same sequence

 Redundancy may be useful for studies in variation

…but a problem for patents Sequence Searching Tools 30 ENA

____Click __ to ____ edit Master ______text styles______Second______level ENA is a comprehensive ____Third _____ level resource for Fourth______level nucleotide ____Fifth sequence _____level data,

but it is better for non-patent data

Sequence Searching Tools 31 European nucleotide archive

UniProt

Non-redundant patent sequence databases Where does the data come from?

ENA UniParc exchange PDB ____Click __ to ____ edit Master ______text styles______Second______level data daily RefSeq ____Third _____ level Ensembl Fourth______level ____Fifth _____level

VEGA Sequence Sequence sources Patents

Model organisms

Sequence Searchingmore… Tools 33 UniProt has a 3-tiered structure

ENA History of UniParc sequences

PDB ____Click __ to ____ edit Master ______text styles______

Metagenomic Second_____ &_____ levelTaxonomy RefSeq environmental ____Third _____ level known Ensembl Fourth______level Metagenomic Automatic UniMES ____Fifth _____levelUniProtKB/

VEGAprojects TrEMBL annotation Sequence Sequence sources Patents Manual Remove annotation redundancy Model organisms UniProtKB/ High quality Sequence Searchingmore… Tools SwissProt annotation 34 UniProt has a 3-tiered structure

ENA UniParc

PDB ____Click __ to ____ edit Master ______text styles______

Metagenomic Second_____ &_____ levelTaxonomy RefSeq environmental ____Third _____ level known Ensembl Fourth______level UniMES ____Fifth _____levelUniProtKB/

VEGA TrEMBL Sequence Sequence sources Patents UniMES UniRef Model Clusters Clusters organisms UniProtKB/ Sequence Searchingmore… Tools SwissProt 35 UniProt has a 3-tiered structure

 Complete history of sequences (no annotation) UniParc ____Click Cross__ to ____ edit-links Master to_____ external ____text sequence styles______sources Second______level  Swiss- Prot____Third: non _____ -levelredundant, manual annotation

UniProtKB  TrEMBL Fourth_____: redundant, _____ level automatic annotation ____Fifth _____level

UniMES  Sequences from metagenomic projects

 Combines sequences (speed searching) UniRef  UniRef100, UniRef90, UniRef50

Sequence Searching Tools 36 Patent sequence record in UniParc

Accession Download Patents not found data in UniProtKB ____Click __ to ____ edit Master ______text styles______

List of databases Second______level containing ____Third _____ level sequence Fourth______level Deleted Navigate to entries individual entries ____Fifth _____level identified (greyed out)

Sequence

Sequence Searching Tools 37 Browsing a UniProtKB/SwissProt entry

Download data Names (synonyms) and taxonomy attributes ____Click __ to ____ edit Master ______text styles______Annotation Ontologies Second______level

____Third _____ level Protein interactions Splice variants Fourth______level ____Fifth _____level Sequence features

Sequence

References Navigate to external data

Sequence Searchingsources Tools 38 e.g. Ensembl General information Browsing a UniProtKB/TrEMBL entry

Name (could be clone name)____Click __ to ____ edit Master ______text styles______

Taxonomy Second______level ____Third _____ level Fourth______level Automatic annotation . (derived from InterPro) ____Fifth _____level

Ontologies (both automatic and manual curation)

Sequence Searching Tools 39 Browsing a UniRef90 entry

Faster and more sensitive sequence search with no ____Click __ to ____ edit Master _____loss of ____ textinformation styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Status Cluster List of Taxonomy of % identity of (SwissProt name entries in each entry sequences Sequence Searchingand/or Tools TrEMBL) cluster in cluster 40 Taxonomic distribution of species

All kingdoms: Within Eukaryota: ____Click __ to ____ edit Master ______text styles______Other mammals Bacteria Second______level (27%) (61%) ____Third _____ levelOther Vertebrata Fourth______(10%)level Homo (12%) Archaea (4%)____Fifth _____level Viruses (3%) Other (8%) Viridiplantae (18%) Nematoda (2%) Insecta (5%) Eukaryota Fungi (32%) (18%)

Sequence Searching Tools 41 SwissProt – most represented species

Mainly model organisms ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 42 Protein existence tag

!! Not sequence validation !! ____Click __ to ____ edit Master ______text styles______Protein existence Second_____ level: _____ level Total ____Third _____ level Evidence at protein Fourth_____ level _____ level 13% ____Fifth _____level Evidence at transcript level 12% Inferred from homology 70% Predicted 5% Uncertain (mainly TrEMBL) -

Sequence Searching Tools 43 Protein existence tag

!! Not sequence validation !! ____Click __ to ____ edit Master ______text styles______Protein existence Second_____ level: _____ level Human ____Third _____ level Evidence at protein Fourth_____ level _____ level 59% ____Fifth _____level Evidence at transcript level 37.5% Inferred from homology 1% Predicted 0.5% Uncertain (mainly TrEMBL) 2%

Sequence Searching Tools 44 Annotation sources for UniProtKB

Data sources Protein classification GO ____ClickFunctional __ to info ____ edit Master _____* Manual ____text curation styles______Protein Second_____* Literature _____ level-based InterPro PRIDE identification data annotation classification ____Third* _____Sequence level analysis Protein families and InterPro Signal domains Fourth______level prediction ____Fifth _____level IntAct Molecular interactions UniProtKB Transmembrane prediction IntEnz * Microbial protein Other HAMAP families Automated predictions annotation RESID Post-translational Sequence for sources data Some annotation Searching Tools modifications 45 Features of UniProtKB

Splice variants ____Click __ to ____ edit Master ______text styles______Sequence Sequence Second______level features ____Third _____ level Fourth______level ____Fifth _____level Ontologies Annotations

Nomenclature References

Sequence Searching Tools 46 A wealth of external links

Organism-specific DBs & pathway Proteomic DBs Genome annotation DBs Family and domain DBs DictyBase AGD DBs PeptideAtlas Ensembl KEGG Gene3D PIRSF EchoBASE CGD BioCyc PRIDE GeneID NMPDR HAMAP PRINTS EcoGene CTD BRENDA ProMEX VectorBase UCSC InterProProDom euHCVdb CYGD Reactome GenomeReviews TIGR PANTHER PROSITE FlyBase HGNC Pathway_Interaction_DB TIGRFAMs GeneCards HPA SMART GeneFarm MGI ____Click __ to ____ edit Master ______text styles______Gramene MIM Phylogenomic DBs H-InvDB RGD HOGENOM OMA LegioList SGD 125 links! Second______level HOVERGEN PhylomeDB Leproma TAIR InParanoid OrthoDB ListiList ZFIN MaizeGDB MypuList ____Third _____ level Polymorphism DBs Orphanet PharmGKB dbSNP PhotoList PseudoCAP SagaList SubtiList Fourth______level Ontologies TubercuList WormBase GO WormPep GeneDB_Spombe 2D gel DBs ArachnoServer BuruList ____Fifth _____level 2DBase-Ecoli ANU-2DPAGE 3D structure DBs Aarhus/Ghent-2DPAGE (no server) DisProt HSSP COMPLUYEAST-2DPAGE PDB PDBsum Cornea-2DPAGE SMR DOSAC-COBS-2DPAGE ECO2DBASE (no server) expression DBs HSC-2DPAGE ArrayExpress Bgee OGP GermOnline CleanEx PHCI-2DPAGE Genevestigator Others PMMA-2DPAGE Protein-protein Rat-heart-2DPAGE PTM DBs BindingDB Sequence DBs /group DBs PMAP- interaction DBs REPRODUCTION-2DPAGE CAZy MEROPS GlycoSuiteDB CutDB EMBL IPI DIP Siena-2DPAGE PeroxiBase REBASE PhosphoSite DrugBank PIR RefSeq IntAct SWISS-2DPAGE PptaseDBSequence SearchingTCDB Tools PhosSite NextBio UniGene STRING World-2DPAGE 47 SwissProt manual annotation

1. Protein sequence • Merge ____Click __ toavailable ____ edit Master _____ CDS ____ text(coding styles______sequence) • Annotate Secondsequence______level discrepancies ____Third _____ level • Report sequencing Fourth______level errors... 2. Biological information ____Fifth _____level • Extract literature information • Orthologue data propagation • Protein sequence analysis...

Sequence Searching Tools 48 Merge available CDS

1 SwissProt entry = 1 gene (1 species) ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Merge TrEMBL entries representing the same protein

 Manually analyze and annotate the differences Sequence Searching Tools 49 Annotate sequence discrepancies

Identification of amino acid variants ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ....and____Fifth of _____level PTMs

Sequence Searching Tools 50 Report sequencing errors

____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Other examples of sequencing errors include: premature stop codons, read-throughs, erroneous initiator methionines

Sequence Searching Tools 51 SwissProt manual annotation

1. Protein sequence • Merge ____Click __ toavailable ____ edit Master _____ CDS ____ text(coding styles______sequence) • Annotate Secondsequence______level discrepancies ____Third _____ level • Report sequencing Fourth______level errors... 2. Biological information ____Fifth _____level • Extract literature information • Orthologue data propagation • Protein sequence analysis...

Sequence Searching Tools 52 Sources of annotated information

UniProtKB/SwissProt gathers ____Click __ to ____ edit Master ______text styles______information from multiple sources: Second______level

____Third _____ level • Publications Fourth_____ (literature/PubMed) _____ level • Prediction proteins ____Fifth (Prosite, _____level Anabelle) • Contact with experts • Other databases • Nomenclature committees

Sequence Searching Tools 53 Nomenclature

____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Synonyms useful for Fourth______level literature searching ____Fifth _____level

Sequence Searching Tools 54 Nomenclature

____Click __ to ____ edit Master ______text styles______Provides synonyms Second______level and cleavage ____Third _____ level products of Fourth______level bifunctional proteins ____Fifth _____level

Sequence Searching Tools 55 Annotation comments

>30 comment fields ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

SequenceControlled Searching Tools vocabularies used whenever possible… 56 Sequence annotation (Features)

…enable researchers ____Click __ to ____ edit Master ______text styles______to obtain a summary Second______level of what is known ____Third _____ level about a protein… Fourth______level ____Fifth _____level …including domain annotation, identifying binding sites…

Sequence Searching Tools 57 Sequence annotation (Features)

Feature (e.g. domain) highlighted on sequence ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 58

1. Biological Process • Cell division A commonly ____Click __recognized to ____ edit Master ______text styles______• Mitosis • Organelle fission series of events Second______level ____Third _____ level 2. Molecular Function Fourth______level • Protein kinase activity An elemental activity or • binding ____Fifth _____level • activity task or job

3. Cellular Component • Where a gene product • Mitochondrial matrix is located • Mitochondrial membrane Sequence Searching Tools 59 Gene Ontology

Annotation for human : ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 60 Imported annotation

Binary interactions are taken from the database Interactors of human ____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 61 Evidence for annotation

____Click __ to ____ edit Master ______textProven styles______Second______level Proven ____Third _____ level Fourth______level ____Fifth _____level

Proven

Potential

Sequence Searching Tools By similarity 62 Sources references included

____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 63 UniProt homepage

____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level • Text search Fourth______level • BLAST sequence ____Fifth search _____level • • Retrieve sequences • ID mapping between databases

Sequence Searching Tools 64 http://www.uniprot.org/ Patent sequences in UniProt

____Click __ to ____ edit Master ______text styles______ Comprehensive archive consisting of specialised databases Second______level  Release every month, ____Third daily updates_____ level  UniProtKB/SwissProt Fourth_____ annotation _____ level-rich, but has no patent data ____Fifth _____level  Patent sequences only found in UniParc as an archived list

UniParc is non-redundant but contains no annotation

…therefore patent information limited

Sequence Searching Tools 65 UniProt

____Click __ to ____ edit Master ______text styles______Second______level UniProt is an ____Thirdexcellent _____ level source of quality Fourth______level protein sequence ____Fifth and _____level annotation data, but it is better for non-patent data

Sequence Searching Tools 66 Sequence archives Old entries accessible in both ENA and UniProt

• ENA nucleotide sequence version archive www.ebi.ac.uk/embl/sva ____Click __ to ____ edit Master ______text styles______Second______level

____Third _____ level Search by accession  get all records Search by date  Fourth______level get specific record ____Fifth _____level • UniSave – UniProt sequence/annotation version archive www.ebi.ac.uk//unisave

Sequence Searching Tools 67 www.ebi.ac.uk Provides complete version list

Select and View specific compare versions ____Click __ to ____ edit Master ______text styles______old entry Second______level ____Third _____ level Fourth______level ____Fifth _____level

Tracks all changes to an entry

Sequence Searching Tools 68 View old entries

____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 69 Compare different versions

____Click __ to ____ edit Master ______text styles______Second______level ____Third _____ level Changes Fourth______level highlighted ____Fifth _____level

Sequence Searching Tools 70 Why do we need a new for patents?

To address the following problems:

 redundancy

 lack of patent-specific annotation European nucleotide archive

UniProt

Non-redundant patent sequence databases Distributing patent sequences

____Click __ to ____ edit Master______text styles______Second______level GenBank ____Third _____ levelDDBJ Fourth_____INSDC _____ level ____Fifth _____level

ENA

Sequence Searching Tools 73 Distributing patent sequences

Redundancy: a consequence of the international cooperation ____Click __ to ____ edit Master______text styles______JPO Second______level USPTO GenBank ____Third _____ levelDDBJ KIPO Fourth_____INSDC _____ level ____Fifth _____level

As other National other ENA Offices participate patent in data exchange offices  redundancy will increase Sequence Searching Tools EPO 74 Distributing patent sequences

NR patent databases remove redundancy ____Click __ to ____ edit Master______text styles______JPO Second______level USPTO GenBank ____Third _____ levelDDBJ KIPO Fourth_____INSDC _____ level ____Fifth _____level

other ENA patent offices NR patent sequence databases Sequence Searching Tools EPO 75 EBI-EPO collaboration

Collaboration between ____Click __ to ____ edit Master______text styles______Second_____EBI and _____ level EPO ____Third _____ level Fourth______level • Database development and • Acquire patent sequences maintenance ____Fifth _____level • Link to patent literature • Link to EBI search engines • Extract patent annotation • Link to EBI analysis tools

• Link to EBI databases • Collate patent family information

Sequence Searching Tools 76 Creating a non-redundant database

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 77 Creating a non-redundant database

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 78 Creating a non-redundant database

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 79 Patent data correction

Correction of Publication Numbers ____Click __ to ____ edit Master______textand styles______kind Codes Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 80 Patent resources at EBI

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 81 http://www.ebi.ac.uk/patentdata/ More information provided

Patent proteins: Patent nucleotides: ENA EPO USPTO JPO KIPO ____Click __ to ____ edit Master_____(EPO, ____text USPTO, styles______JPO, KIPO) Second______level ____Third _____ level Fourth______level ____Fifth _____level

 Complete sequences (EPO, USPTO, JPO, KIPO)  Non-redundant sequence data  Patent family classification Sequence Searching Tools 82  Enriched with patent information let’s look at an example… Searching redundant databases

Protein ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Example: Fourth______level Search patent ____Fifth _____level protein sequence Patent proteins

Sequence Searching Tools 84 http://www.ebi.ac.uk/Tools/sss/ Results from redundant databases

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

>260 identical results . .  too much to analyze

Sequence Searching Tools 85 NR patent sequence databases

____Click __ to ____ edit Master______text styles______LEVEL1 NR Second _____patent sequence_____ level database ____Third _____ level removes Fourth_____ redundancy _____ level ____Fifth _____level fewer results to analyze, less chance

of missing important results

Sequence Searching Tools 86 Searching patent sequences

NR patent Level-1 ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Example: Fourth______level Search patent ____Fifth _____level protein sequence NR patent level-1

Sequence Searching Tools 87 http://www.ebi.ac.uk/Tools/sss/ Results from level-1 database

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level Each hit unique ____Fifth _____level

Sequence Searching Tools 88 Results from level-1 database

____Click __ to ____ edit Master______text styles______Second_____Earliest _____ level publication date ____Third _____ level List of all Fourth______level patents Link to containing ____Fifth _____level sequence the sequence entry

Link to patent documentation

Sequence Searching Tools 89 Patent sequence record in NRNL1

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Patents containing Sequence 100% identical Sequence Searching Tools sequence 90 Patent families

Simple ____Click Patent__ to ____ edit Family Master_____ is a ____grouptext styles ______of patents that relate to Second _____the same _____ level invention, and are based on the same____Third originating _____ level application Fourth______level  They arise when ____Fifth an invention _____level is patented in multiple countries

 Grouping patents into families reduces multi-national results down to a representative member

Sequence Searching Tools 91 Patent families

second patent family Invention A patent family Invention B ____Click __EP to ____WO edit US Master ______text styles______US JP Second______level

GM671154 ADA42650 CS017585 ____ThirdACQ13114 _____ DI603183level HB492658 AAR79155 DD649656 Fourth______level ____Fifth _____level 100% identical sequences

Same sequence can appear multiple times in a database due to:  Same invention filed multiple times in different offices (same patent family)  Different inventors use the same sequence in different contexts (different

Sequencepatent Searching families)Tools 92 NR patent sequence databases

____Click __ to ____ edit Master______text styles______Second______level LEVEL2 NR patent sequence database ____Third _____ level groups identical Fourth_____ sequences _____ level by patent family ____Fifth _____level  provides earliest priority date for family

Sequence Searching Tools 93 Searching patent sequences

NR patent Level-2 ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Example: Fourth______level Search patent ____Fifth _____level protein sequence NR patent level-2

Sequence Searching Tools 94 http://www.ebi.ac.uk/Tools/sss/ Results from level-2 database

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level Each hit = one family ____Fifth _____level

Sequence Searching Tools 95 Results from level-2 database

____Click __ to ____ editEarliest Master______text styles______Second_____publication _____ level patents in (priority) ____Third date _____ level same family in family Fourth______level ____Fifth Link_____level to sequence entry

Link to patent documentation

Sequence Searching Tools 96 Patent sequence record in NRNL2

____Click __ to ____ edit Master______text styles______Priority number and date Second______level Patent Patent equivalents ____Third _____ level literature Sequence Fourth______level record in ENA ____Fifth _____level

Translation

Sequence

Sequence Searching Tools 97 Non-redundant patent databases

Patent Patent ____Clicknucleotides __ to ____ edit Master_____proteins ____text styles______Second______level  Groups together Level-1 NRNL1 NRPL1 ____Third _____ level 100% identical (Non-redundant (Non-redundant patent sequences nucleotide level- 1)Fourth_____ protein _____ levellevel-1) ____Fifth _____level

Level-2 NRNL2 NRPL2  Groups together (Non-redundant (Non-redundant identical sequences nucleotide level-2) protein level-2) by patent family

Sequence Searching Tools 98 http://www.ebi.ac.uk/patentdata/ Non-redundant patent databases

ENA (redundant) ____Click __ to ____ edit Master______text styles______Second______level Remove ____Third _____ level sequence redundancy Fourth______level Level-1 NR ____Fifth _____level Additional annotation, including priority dates for patent families Group by patent families

Level-2 NR www.ebi.ac.uk Sequence Searching Tools 99 Patent sequence records at EBI

Nucleotide ENA ~23.1 M PAT sequences ____Click __ to ____ edit Master______text styles______NRNL1 Second_____ ~11.9 _____ Mlevel sequences ____Third _____ level NRNL2 Fourth_____ ~15.0 _____ levelM sequences ____Fifth _____level

Protein Patent ~6.3 M PRT sequences Proteins

NRPL1 ~2.5 M sequences

Sequence Searching Tools NRPL2 ~3.8 M sequences 100 NR Patent Sequence Databases

 Sequence searches against a non-redundant database is faster and____Click avoids __ to overlooking____ edit Master_____ data ____text styles______Second______level  These databases are ____Third the first _____ level non-redundant collection take account of both sequenceFourth______andlevel family concepts ____Fifth _____level  Publication corrections significantly increase data quality

 Collation of biological features in a single record enables understanding of biological concept in which the sequence is being used

Sequence Searching Tools 101 2) The Toolbox EBI search

SRS advanced search

Sequence search EBI search

SRS advanced search

Sequence search EBI search

EBI-Search accessible from any EBI page ____Click __ to ____ edit Master______text styles______EBI Search Search all databases Second______level and literature in one go ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 105 http://www.ebi.ac.uk/ EBI search

EBI-Search by gene name

____Click __ to ____ editSearch Master_____ for ____text styles______Second_____src gene _____ level ____Third _____ level

src Fourth______level ____Fifth _____level

Sequence Searching Tools 106 http://www.ebi.ac.uk/ EBI search by gene name

____Click __ srcto ____ edit Master______text styles______

Second______levelquery ____Third _____ level Lists species with information Fourth______level Lists ____Fifth _____level relevant entries in all EBI resources

Sequence Searching Tools 107 EBI search by gene name

____Click __ to ____ editsrc Master______text styles______Easy to change Second______level between species ____Third _____ level Fourth______level Tabs organise ____Fifth _____level data by: • gene • expression • protein • structure • literature

Sequence Searching Tools 108 EBI search by gene name

Information from Ensembl: ____Click __ to ____ editsrc Master______text styles______• Gene sequence Second______level • Location ____Third _____ level • Sequence variations Fourth______level • Orthologues... ____Fifth _____level

Gene structure (forward and

Sequence Searching Tools reverse strand) 109 EBI search by gene name

Expression ____Click __ to ____ editsrc Master______text styles______studies shown Expression studies from by part Gene Expression Atlas Second_____, _____ level view by: ____Third _____ level • Disease state Fourth______level • Cell type • Compound treatment... ____Fifth _____level

Sequence Searching Tools 110 EBI search by gene name

Information from UniProt: ____Click __ to ____ editsrc Master______text styles______InterPro domain • Function Second______level architecture • Gene Ontology ____Third _____ level • Isoforms Fourth ______level • Sequence... ____Fifth _____level

IntAct protein interaction data Sequence Searching Tools 111 EBI search by gene name

Information from View additional ____Click __ to ____ editsrc Master______text styles______PDBe: Second______level structures • Chain information ____Third _____ level • Structural Fourth______level domains ____Fifth _____level • Citations...

View structure

Sequence Searching Tools 112 EBI search by gene name

Can print full summary of ____Click __ to ____ editsrc Master______text styles______any page Second______level ____Third _____ level Fourth______level ____Fifth _____level Reviews

Keyword Free full in title text

Curator- Patent selected Sequence Searching Tools 113 EBI search

EBI-Search for patent information

____Click __ to ____ edit Master______textSearch styles______for patent Second______level WO0146262 ____Third _____ level

WO0146262 Fourth______level ____Fifth _____level

Sequence Searching Tools 114 http://www.ebi.ac.uk/ EBI Search

Search for patent ____Click __ WO0146262to ____ edit Master______text styles______WO0146262

Second______levelquery ____Third _____ level Literature for Fourth______level WO0146262 ____Fifth _____level Includes link to Sequence data full paper for WO0146262

Includes list of additional annotation

Sequence Searching Tools 115 EBI search

____Click __ to ____ edit Master______text styles______Second______level EBI search is a quick way to find literature ____Third _____ level and sequences Fourth_____ (in ENA_____ level and UniProt) ____Fifth _____level associated with a patent

Sequence Searching Tools 116 EBI search

SRS advanced search

Sequence search SRS: advanced text search

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

1st: Select resources to 2nd: Create query search

Sequence Searching Tools 118 http://www.ebi.ac.uk/srs/ SRS: advanced text search

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level Select library tab

Sequence Searching Tools 119 SRS: advanced text search

Search >100 databases ____Click __ to ____ edit Master______text styles______

Select library tab Second______level ____Third _____ level Fourth______level NR patent DNA ____Fifth _____level (NRNL1 & NRNL2)

NR patent proteins (NRPL1 & NRPL2)

Sequence Searching Tools 120 SRS: advanced text search

Search >100 databases ____Click __ to ____ edit Master______text styles______

Select library tab Second______level ____Third _____ level Fourth______level ____Fifth _____level Example: Selected to search NR level-1 patent DNA database

Sequence Searching Tools 121 SRS: advanced text search

____Click __ to ____ edit Master______text styles______

Select library tab Second______level ____Third _____ level Fourth______level ____Fifth _____level

Select resources to search

Sequence Searching Tools 122 SRS: advanced text search

____Click __ to ____ edit Master______text styles______

Select library tab Select Second_____ resources _____ level to search ____Third _____ level Fourth______level ____Fifth _____level

1) Select field 2) Type in text

Sequence Searching Tools 123 SRS: advanced text search

____Click __ to ____ edit Master______text styles______

Select library tab Select Second_____ resources _____ level to search ____Third _____ level Fourth______level ____Fifth _____level

Here, selected patent number Sequence Searching Tools 124 SRS: advanced text search

____Click __ to ____ edit Master______text styles______

Select library tab Select Second_____ resources _____ level to search ____Third _____ level Fourth______level ____Fifth _____level

Create query

Sequence Searching Tools 125 SRS: advanced text search

____Click __ to ____ edit Master______text styles______

Select library tab Select Second_____ resources _____ level to search Create query ____Third _____ level Fourth______level ____Fifth _____level

Lists non-redundant nucleotide sequences from WO0146262

Sequence Searching Tools 126 SRS: advanced text search

____Click __ to ____ edit Master______text styles______

Select library tab Select Second_____ resources _____ level to search Create query ____Third _____ level Fourth______level ____Fifth _____level

WO0146262 sequences

Sequence Searching Tools 127 SRS: advanced text search

____Click __ to ____ edit Master______text styles______

Select library tab Select Second_____ resources _____ level to search Create query ____Third _____ level WO0146262 nucleotide sequence Fourth______level record in NRNL1 ____Fifth _____level

WO0146262 sequences

Details which other patents also claim this sequence (with NRNL2, would Sequence Searching Tools see family grouping) 128 SRS: advanced text search

____Click __ to ____ edit Master______text styles______

Select library tab Select Second_____ resources _____ level to search Create query ____Third _____ level Fourth______level ____Fifth _____level

WO0146262 sequences NRNL1 sequence record

Sequence Searching Tools 129 SRS: advanced text search

____Click __ to ____ edit Master______text styles______

Select library tab Select Second_____ resources _____ level to search Create query ____Third _____ level Fourth______level ____Fifth _____level

WO0146262 literature WO0146262 sequences

NRNL1 sequence record

Sequence Searching Tools 130 http://www.ebi.ac.uk/srs/ SRS: advanced text search

ENA ____ClickFind __all to sequences ____ edit Master_____ associated ____text with styles______a patent Second______level ____Third _____ level Find all sequences associated with a patent NRNL1 Fourth______level + identify all ____patentsFifth _____level associated with each sequence

Find all sequences associated with a patent NRNL2 + identify all patents in the same family associated with each sequence

Sequence Searching Tools 131 EBI search

SRS advanced search

Sequence search What’s available at EBI

Tools are accessible from the EBI homepage ____Click __ to ____ edit Master______text styles______Second______level ____ThirdUnder _____ tools,level select Fourth_____Tools _____ level Index ____Fifth _____level

Sequence Searching Tools 133 http://www.ebi.ac.uk/ What’s available at EBI

____Click __ to ____ edit Master______text styles______Second______levelLink to list ____Third _____ level of all tools Fourth______level ____Fifth _____level

Most popular tools are listed

Sequence Searching Tools 134 http://www.ebi.ac.uk/Tools What’s available at EBI

Full list of ____Click __ to ____ edit Master______text styles______sequence Second______level search tools ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 135 http://www.ebi.ac.uk/Tool/ssss STEP 1:

Choose a search algorithm Choosing the right search engine

BLAST Fast search (better for proteins than nucleotides)

FASTA ____Click __ Fastto ____ editsearch Master______text styles______Second______level PSI-SEARCH Finding ____Third remote _____ levelhomologues SSEARCH Sensitive Fourth_____ but slow; _____ level good for short sequences ____Fifth _____level Force full-length matches Query GGSEARCH ||||||||||||||| Subject

Match domains/patterns Query GLSEARCH ||||||||||||||| to protein; oligo-to-gene Subject

AVTEGP EVLFN Q FASTM Multi-peptide search FVNGFAD Sequence Searching Tools AKFQPGE ||||| ||||| ||||| 137 S Example: Comparing search engines using a short peptide query sequence

query sequence: RPPSWIPK Comparing search engines

query : RPPSWIPK

NCBI ____Click- __ to ____ edit Master______text styles______WU-BLAST SSEARCH BLAST Second______level ____Third _____ level UniProtKB/ UniProtKB/ UniProtKB/ SwissProt Fourth_____SwissProt _____ level SwissProt ____Fifth _____level

No hits found hit length e() hit length e() 1: TY01_PHYAZ 61 5.4 1: TY01_PHYAZ 61 0.42 2: BRK5_PHYNO 8 4.8 3: BRK_AMICA 9 9.2 Look at the 5: BRK_LEPOS 9 9.2 difference in

Sequence Searching Tools e-values 139 Comparing search engines

____Click __ to ____ edit Master______text styles______Second______level SSEARCH is ____Third a sensitive _____ level search engine Fourth______level suitable ____Fifthfor short _____level sequences (may be too slow for longer sequences)

Sequence Searching Tools 140 Comparing search engines - specialised

query : RPPSWIPK

SSEARCH ____Click __ to ____ editGLEARCH Master______text styles______GGEARCH

GLSEARCH Second_____ has _____ level GGSEARCH UniProtKB/ a preference ____Third forUniProtKB/ _____ level UniProtKB/ SwissProt SwissProt limited to similarSwissProt long hits Fourth______levelsized hits ____Fifth _____level hit length e() hit length e() hit length e() 1: TY01_PHYAZ 61 0.42 1: TY01_PHYAZ 61 2.5e-16 1: BRK5_PHYNO 8 8e-7 2: BRK5_PHYNO 8 4.8 2: BRK5_PHYNO 8 1.9e-11 2: TY01_PHYBU 8 2e-5 3: BRK_AMICA 9 3: TY01_PHYBU 8 5.2e-8 3: BRK_LEPOS 9 5.4e-4 5: BRK_LEPOS 9 9.2 4: BRK_ONCMY 10 5.6e-8 4: BRK_AMICA 9 5.4e-4 9.2 5: BRK_LEPOS 9 1e-7 5: BRK4_PHAJA 8 0.0087 8: B4GT2_HUMAN 372 5.8e-5 6: BRK_PHYHY 8 0.0087 ...... 39: DNAA_PROM 0.024 34: TY51_LITRU 7 8.3 Sequence Searching Tools 465 40: DNAA_PROM 0.036 141 199 Comparing search engines - specialised

____Click __ to ____ edit Master______text styles______GGSEARCH finds Second_____ similar _____ level length sequences; ____Third _____ level GLSEARCH matches entire sequence to Fourth______level any length____Fifth _____ levelsequences

Sequence Searching Tools 142 Restricting length of matches

query : RPPSWIPK

SSEARCH ____Click __ to ____ edit Master______text styles______Database Second______level range 6-10 ____Third _____ level UniProtKB/ Fourth______level SwissProt ____Fifth _____level

Sequence Searching Tools 143 Restricting length of matches

query : RPPSWIPK

SSEARCH ____Click __ to ____ editGGEARCH Master______text styles______SSEARCH Database Limiting Second_____ database _____ level range range 6-10 ____ThirdlimitsUniProtKB/ size _____ level of hits, UniProtKB/ SwissProt SwissProt UniProtKB/ Fourth_____but stricter _____ level than SwissProt GGSEARCH ____Fifth _____level hit length e() hit length e() hit length e() 1: BRK5_PHYNO 8 0.34 1: BRK5_PHYNO 8 8e-7 1: TY01_PHYAZ 61 0.42 2: BRK_LEPOS 9 0.57 2: TY01_PHYBU 8 2e-5 2: BRK5_PHYNO 8 4.8 3: BRK_AMICA 9 0.57 3: BRK_LEPOS 9 5.4e-4 3: BRK_AMICA 9 9.2 4: TY01_PHYBU 8 0.61 4: BRK_AMICA 9 5.4e-4 4: BRK_LEPOS 9 9.2 5: BRK_ONCMY 10 0.63 5: BRK4_PHAJA 8 0.0087 6: BRK4_PHAJA 8 5.0 6: BRK_PHYHY 8 0.0087 ...... 18: BRK3_PELRI 9 9.8 34: TY01_LITRU 7 8.3 Sequence Searching Tools 144 STEP 2:

Choose a database to search Several databases available

Protein ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level Note: these databases cover ____Fifth the _____level same sequences • NR level-1 Patent • NR level-2 databases • EPO + JPO + KIPO + USPTO

Sequence Searching Tools 146 Several databases available

Nucleotide ____Click __ to ____ edit Master______text styles______Second______level Note: these databases cover ____Third the _____ level same sequences • NR level-1 Fourth______level • NR level-2 ____Fifth _____level • EMBL Patents

Patent data

Sequence Searching Tools 147 Database size is important

The larger the database searched, the higher ____Click (less __ to ____ significant)edit Master_____ the ____ textresulting styles______e-values Second______level ____Third _____ level Most sequence Fourth_____ databases _____ level are large...... and growing____Fifth _____level every day:

• ENA-Annotation >160 million entries

• UniParc (non-redundant) >30 million entries

Sequence Searching Tools 148 Example: Comparing database size when searching with a short peptide query sequence

query sequence: RPPSWIPK Comparing databases size

query : RPPSWIPK

Database size decreasing ____Click __ to ____ edit Master______text styles______Second______level SSEARCH SSEARCH SSEARCH ____Third _____ level

Fourth______level UniProtKB/ UniParc UniProtKB ____Fifth _____level SwissProt

No hits found hit e() hit e() 1: TY01_PHYAZ 3.6 1: TY01_PHYAZ 0.068 2: BRK5_PHYNO 1.9 Look at the 3: TY01_PHYBU 7.7 difference in 4: BRK_AMICA 8.7 5: BRK_LEPOS 8.7 Sequencee -Searchingvalues Tools 150 6: BRK_ONCMY 9.7 Comparing databases size

____ClickThe __ larger to ____ edit the Master _____database ____text searched, styles______the higher (less significant) Second______level the resulting e-values ____Third _____ level Fourth______level  Search the ____Fifth smallest _____level database likely to contain your sequence

 You can also run a second search of the entire database, or run multiple small searches Sequence Searching Tools 151 Is it best to search a protein or a nucleotide database? Is it best to search a protein or a nucleotide database?

2 issues are worth considering… 1) Codon degeneracy

Because amino acids are encoded by different codons, there ____Click can __ be to ____more edit variability Master______text between styles______CDS s than Second_____ between _____ level proteins ____Third _____ level Fourth______level Ser Amino acids match ____Fifth _____level Ser

UCU Nucleotides mismatch AGC

Sequence Searching Tools 154 1) Codon degeneracy

Human CKS1B kinase v Zebra finch CDC28 kinase 1B ____Click __ to ____ edit Master______text styles______Proteins Second______level ____Third _____ level Fourth______level ____Fifth _____level

Nucleotides

Sequence Searching Tools 155 1) Codon degeneracy

____Click __ to ____ edit Master______text styles______Sequence Second_____ conservations _____ level is ____Third _____ level more stringent Fourth_____ at _____ thelevel protein level, than ____ Fifthat the _____ levellevel of the

nucleotide coding sequence

Sequence Searching Tools 156 2) Amino acid similarity

____Click __ to ____ edit Master______text styles______Protein sequence Second______levelsearches can ____Third _____ level distinguish between exact, similar and Fourth______level dissimilar ____Fifth _____level matches

Sequence Searching Tools 157 2) Amino acid similarity

Amino acids grouped by physical & chemical properties ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 158 2) Amino acid similarity

Protein searches take account of amino acid similarities ____Click __ to ____ edit Master______text styles______highly Second_____weakly _____ level not conserved ____Thirdconserved _____ level conserved Ser Fourth_____ Ser_____ level Ser Amino acids Ser ____Fifth _____Asnlevel Leu identical similar mismatch

No UCU UCU UCU distinction Nucleotides AGC AAC CUC

Sequence Searching Tools 159 mismatch mismatch mismatch 2) Amino acid similarity

 Protein alignments can score a ____Click __ to ____ edit Master______text styles______conservative Second_____ amino _____ acidlevel substitution differently from ____Third a non _____ level-conservative one through the Fourth_____ use of scoring_____ level matrices ____Fifth _____level  By contrast, nucleotide alignments use over-simple (less sensitive) match/mismatch scoring

Sequence Searching Tools 160 Identify Protein v nucleotide search homologues

searching:

Homo

prokaryotes

cyanobacteria amphibians

____Click __ to ____ edit Master_____ genus ____text styles______

arthropods

reptiles

land plants land

mammals

eukarytoes

birds

flowers

archaea

insects

plants fish today extinction of dinosaurs Second______level Cambrian explosion ____Third _____ level 1 multicellular life

Fourth______level ____Fifth _____level

Protein2 complexcomparisons cells identify homologues

5agoBillions of years -10x further back in 3 evolutionphotosynthesis

self-replicating cells

4 chemical evolution Sequence Searching Tools 161 formation of Earth Protein v nucleotide: example

100% identity  DNA Protein Genemore significant Protein e-value e-value e-value ____Clickfor DNA __ to match ____ edit because Master______text styles______Human GSTP1 Glutathione S-transferase P 4.0e-199 1.8e-92 a longer sequence Bovine GSTP1 Glutathione Second_____ S-transferase _____ level P 5.9e-154 8.0e-82 Mouse GSTP2 Glutathione ____Third S- transferase_____ level P2 4.4e-133 4.7e-79 Toad GSTP1 Glutathione S-transferase P 1.5e-55 9.8e-59 Frog GSTP1 Glutathione Fourth_____ S-transferase _____ level P 8.9e-33 2.6e-45 GSTP1 Glutathione S-transferase P 8.5e-10 4.2e-32 Rabbit CodonGSTMU degeneracy Glutathione ____Fifth and S- transferase _____level Mu 1.3e-07 1.3e-20 Bovine simpleGSTM2 scoring Glutathione give rise S- transferaseto M2 4.3 4.4e-17 Liver fluke lessGSTMU significant Glutathione e-values S-transferase MU - 3.2e-15 Hornworm GST2 Glutathione S-transferase 2 - 3.2e-12 for DNA matches Fruit fly GSTS1 Glutathione S-transferase S1 - 2.3e-08 Slime mold GSTA2 Glutathione S-transferase a2 - 2.6e-05 Maize GST1 Glutathione S-transferase 1 - 3.0e-01 Wheat GSTA1 Glutathione S-transferase 1 - 1.9 Sequence Searching Tools 162 Protein v nucleotide search

____Click __ to ____ edit Master______text styles______…therefore, if a Secondpatent_____ claims _____ level both a nucleotide ____Third _____ level CDS and a protein Fourth_____ sequence, _____ level the protein sequence could pull ____Fifth out _____levelmany more homologues

than the nucleotide CDS

Sequence Searching Tools 163 STEP 3:

Choosing search parameters to fit the task Choosing parameters Parameters are set for searching a full-length protein or gene ____Click __ to ____ edit Master______text styles______Second______level Changing ____Third _____ level parameters can improve Fourth______level search results ____Fifth _____level for short sequences

Sequence Searching Tools 165 How to optimise parameters?

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level User manual ____Fifth _____level provides help

Sequence Searching Tools 166 Which parameters to choose?

Matrix ____Click __ to ____ edit Master______text styles______Second______level Nucleotide search ____Third _____ level „simpler‟ - only Fourth______level match/mismatch ____Fifth _____level

Protein search uses substitution matrix tables (based on amino acid similarities and rate of change)

Sequence Searching Tools 167 Protein matrices

Choice of 1. strictness of search matrix ____Click __ to ____ edit Master______text styles______depends on: Second______level ____Third _____ level Fourth2._____length _____ level of query sequence QUERY ____Fifth LENGTH _____level MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4

Sequence Searching Tools <=10 MDM10 -23 -4 168 Example: Comparing matrices when searching with a short peptide query sequence

query sequence: MDM2_HUMAN Comparing matrices

SSEARCH SSEARCH Blosum80 stricter ____Click __ to ____ edit Master______text styles______than Blosum50 Blosum50 Blosum80 (default) Second______level ____Third _____ level UniProtKB/ Fourth______UniProtKB/level SwissProt SwissProt Blosom80 more ____Fifth _____level significant hit e() hit e() (close) match MDM2_XENLA 4e-70 MDM2_XENLA 6e-109 MDM4_BOVIN 9e-48 MDM4_BOVIN 9e-51 Blosom50 more MDM4_DANRE 6e-17 MDM4_DANRE 5e-24 XB34_ORYSJ 0.01 XB34_ORYSJ 0.05 significant RN157_MOUSE 0.19 RN157_MOUSE 4.1 (distant) match

Sequence SearchingMGRN1_MOUSE Tools 1.2 MGRN1_MOUSE 6.0 170 MGRN1_HUMAN 2.4 MGRN1_HUMAN 6.4 Comparing matrices

____Click __ to ____ edit Master______text styles______Second_____Matrices: _____ level ____Third _____ level Use a high Blosum Fourth_____ to _____ findlevel close matches, a low Blosum ____Fifth to find _____level distant matches;

Use MDM to find longer matches

Sequence Searching Tools 171 Nucleotide match/mismatch

Matrix - protein Match/mismatch - nucleotide ____Click __ to ____ edit Master______text styles______Second______levelFASTA BLAST ...instead have...____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 172 Nucleotide match/mismatch

• “Reward” for match, “penalty” for mismatch ____Click __ to ____ edit Master______text styles______• Reward/penalty Secondratio:______level  Increase ratio ____Thirdto find more_____ level divergent sequences:  Ratio of 0.33 Fourth_____(1/-3) for _____ 99%level conserved  Ratio of 0.5 (1/ ____Fifth-2) for _____95%level conserved  Ratio of 1 (1/-1) for 75% conserved

Sequence Searching Tools 173 Gap penalties

____Click __ to ____ edit Master_____Protein search ____text styles______gap penalties Second_____gap open _____ level = 0 to -23 ____ThirdGap extension _____ level = 0 to -8

Fourth_____Nucleotide _____ level search ____Fifthgap open_____level = -2 to -16 Gap extension = 0 to -4

Sequence Searching Tools 174 Gap penalties

Choice of ____Click __ to ____ edit1. Masterstrictness______text of search styles______gap penalties depends on: Second_____• _____larger level penalty  fewer gaps ____Third _____ level Fourth_____2. to match _____ level scoring matrix QUERY ____Fifth LENGTH _____level MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4

Sequence Searching Tools <=10 MDM10 -23 -4 175 Word length - ktup

KTUP  KTUP = „word-length‟ of search ____Click __ to ____ edit Master______text styles______(word length)  SecondLarge_____ word _____ level-length  less sensitive ____Third _____ level  faster Fourth______level ____Fifth _____level

Nucleotide search: fewer bases than amino acids  higher KTUP

Sequence Searching Tools 176 Example: Comparing ktup when searching with a short RNA query sequence

query sequence: 23bp RNA Comparing ktup query : 23bp RNA

FASTA ____Click __ to ____ edit MasterFASTA______text styles______FASTA Second______level ktup6 ktup2 ktup3 (default) ____Third _____ level E=50 Fourth______levelLower ktup is EMBL ____FifthEMBL _____level more sensitive EMBL release release release E=50 extends No hits found hit e-valuee() cut-off hit e() 1: AB334817 0.12 1: AB334817 0.074 2: AY238603 0.14 2: AY238603 0.092 3: AC101743 0.16 3: AC101743 0.11 4: AC115920 0.18 4: AC115920 0.12 5: BC098485 22 Sequence Searching Tools ...... 178 11: AL591512 32 Comparing ktup

____Click __ to ____ edit Master______text styles______•Lowering ktup makes Second_____ the _____ levelsearch more sensitive ____Third _____ level •Increase e-value Fourth_____ cut-off for_____ level short sequences ____Fifth _____level •Increase match/mismatch score (+5/-4 for FASTA)

•Increase gap penalties

Sequence Searching Tools 179 Masking

Do I mask ____Click my __ to ____ edit Master______text styles______sequence? Low complexity regions should be Second______level masked____Third _____ level to avoid spurious results Fourth_____• CA _____ levelrepeats ____Fifth• poly _____level-A tails • proline-rich regions

**Be careful you don’t mask what you are looking for

Sequence Searching Tools 180 Parameters for short sequences

What do ____Click I use __ to ____ edit Master______text styles______

for short Second______level sequences?  ____Thirduse _____ levelstrict matrices Fourth_____use _____high level gap penalties  ____Fifthavoid _____level masking  allow high e-values

Sequence Searching Tools 181 Adding value to your search results Sequence search results page

Actions on ____ Click __ to ____ edit Master______text styles______all results Second______level

On selected ____Third _____ level results Fourth______level ____Fifth _____level

Sequence Searching Tools 183 Visual output

____Click __ to ____ edit Master______text styles______Second______level Graphical ____Third _____ level display of results Fourth______level ____Fifth _____level

Sequence Searching Tools 184 Visual output

____Click __ to ____ edit Master______text styles______Second______level Select sequence ____Third _____ level Fourth______level View alignment ____Fifth _____level

Sequence Searching Tools 185 Visual output

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Visual output provides an at-a-glance view Fourth______level of the length and____Fifth position _____level of all matches

Sequence Searching Tools 186 e-values or % identity?

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Do you use e-values or % identity?

Sequence Searching Tools 187 e-values or % identity?

e-value is a better estimate of similarity than ____Click % __identity, to ____ edit but Master_____ patents ____text use styles%______identity Second______level e-value  Estimates ____Third statistical _____ level significance of matches Fourth______level Default = 10 expect 10 matches found by chance   ____Fifth _____level  E() = 1-10  frequently related  E() = <0.01  usually homologous

% identity  % of positions identical between query and match sequence Sequence Searching Tools 188 e-values or % identity?

example... ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Similar Different % identity scores e-values Sequence Searching Tools 189 e-values or % identity?

example... Pattern of conservation ____Click __ to ____ edit Master______textindicates styles______homology Second______level ____Third _____ level Fourth______level ____Fifth _____level

No evidence of

Sequence Searching Tools homology 190 Check alignments

Check alignments, ____Click __ to ____ edit Master______text styles______especially if using a Second______level local-local algorithm ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 191 Check alignments

____Click __ to ____ edit Master______text styles______Second______level 662 ____Third _____ level 100% identity, but Fourth______level only over 124 / 662 (20%) of sequence ____Fifth _____level

124 aa overlap

Sequence Searching Tools 192 Check alignments

____Click __ to ____ edit Master______text styles______Second______level Always check ____Third alignments _____ level to see where and to Fourth_____ what extent_____ level the query & target ____Fifthsequences _____level match

Sequence Searching Tools 193 Additional annotation

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level Protein and nucleotide search results have ____Fifth _____level additional annotation

• Nucleotide sequence • Protein sequence • Genomic information • Gene ontology mapping • InterPro protein classification Sequence Searching Tools • Literature 194 Additional annotation

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

e.g. related ENA nucleotide entries

Sequence Searching Tools 195 Functional predictions

Protein search results have „Function Prediction‟ ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Sequence Searching Tools 196 Public site: function prediction

____Click __ to ____ edit Master______text styles______Second______level Functional predictions: InterPro family/domain ____Third _____ level classifications Fourth______level ____Fifth _____level Visual comparison  find mis/partial matches

Sequence Searching Tools 197 InterPro annotation

Domain annotation ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

SWIB/ RanBP2-type RING-type domain zinc finger Sequence Searching Tools 198 InterPro annotation

Family classification ____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

Mdm2/Mdm4 family Sequence Searching Tools 199 Public site: function prediction

____Click __ to ____ edit Master______text styles______100% ID Second______level • family signature ____Third _____ level • 4 domain signatures Fourth______level 34% ID • family signature ____Fifth _____level • 3 domain signatures

28% ID • 1 domain signature

24% ID • No signatures

Sequence Searching Tools 200 InterPro: access directly

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level ____Fifth _____level

. InterPro

Sequence Searching Tools 201 http://www.ebi.ac.uk/ InterPro homepage

____Click __ to ____ edit Master______text styles______Second______level ____Third _____ level Fourth______level Text or sequence >60,000 protein ____Fifth _____level search signatures from 11 member databases

Sequence Searching Tools 202 http://www.ebi.ac.uk/interpro/ InterPro homepage

____Click __ to ____ edit Master______text styles______New Second______levelinterface ____Third _____ level Fourth______level ____Fifth _____level

Text or sequence search

Sequence Searching Tools 203 http://www.ebi.ac.uk/interpro/ InterProScan sequence search

____Click __ to ____ edit Master______text styles______Download version takes both protein Second______level and nucleotide ____Third _____ level sequence Fourth______level ____Fifth _____level All search engines of member databases

Sequence Searching Tools 204 InterProScan results

____Click __ to ____ edit Master______text styles______Second_____Domains _____ level ____Third _____ level Fourth______level ____Fifth _____level

Family

Sequence Searching Tools 205 Recap

DDBJ GenBank JPO ____Click __ to ____ edit Master_____USPTO ____text styles______KIPO EPO Second______level ____Third _____ level NR patent ENA Fourth______level UniProt databases ____Fifth _____level SRS text search Sequence search

Select resources Select search tool Select library tab Create query Navigate to sss tools

Create search patentSequence literature Searching Tools Sequence list Functional predictions 206 Sequence record Sequence matches Summary

EBI provides free access to ____Click __ to ____ edit Master_____ >100 ____text databases styles______Second______level EBI provides specialised patent ____Third _____ levelsequence databases Fourth______level ____FifthEBI _____level provides multiple sequence search options

EBI provides advanced text search options

Sequence Searching Tools 207 http://www.ebi.ac.uk/ Help

____Click __ to ____ edit Master______text styles______Second______level ____ThirdContacts: _____ level http://www.ebi.ac.uk/support/ Fourth______level ____Fifth _____level

Sequence Searching Tools 208