Aim and Outline of the courseDb data mining

Db tools can be used to retrieve informaon for a gene or protein

The most important concept is the similarity

Aim and Outline of the courseDb data mining

BLAST

Interpro

Blast2GO

Sequence Alignments

BLAST (Basic Local Alignment Search Tool

• One of the tools of the NCBI - The U.S. National Center for Biotechnology Information.

• Uses word matching like FASTA

• Similarity matching of words (3 AA’s, 11 bases/nucleotides) – does not require identical words.

• If no words are similar, then there is no alignment – won’t find matches for very short sequences 13 Sequence Alignments BLAST word matching

MEAAVKEEISVEDEAVDKNI

MEA EAA AAV Break query AVK VKE into words: KEE EEI EIS ISV ... Break database sequences into words:

14 Sequence Alignments Compare word lists

Query Word List: Database Sequence Words Lists

MEA ? RTT AAQ EAA SDG KSS AAV SRW LLN AVK QEL RWY VKL VKI GKG KEE DKI NIS EEI LFC WDV EIS AAV KVR ISV PFR DEI

Compare word lists by Hashing & allow near matches!

15 Sequence Alignments

Find locations of matching words in all sequences

ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT MEA EAA TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY AAV AVK IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH KLV KEE EEI EIS ISV

16 Sequence Alignments Extend hits one base at a time

• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.

• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user. 17 Sequence Alignments Extend hits one base at a time

• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.

Why statistics??

• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user. 17 Sequence Alignments BLAST : example of result

• Job Title: P14867|GBRA1_HUMAN Gamma-aminobutyric acid... Show Conserved Domains Putative conserved domains have been detected, click on the image below for detailed results. * BLASTP 2.2.18 (Mar-02-2008) protein-protein BLAST

Database: Non-redundant SwissProt sequences 309,621 sequences; 115,465,120 total letters

Query= P14867|GBRA1_HUMAN Gamma-aminobutyric acid receptor subunit alpha-1 - Homo sapiens (Human). Length=456

Sequences producing significant alignments: (Bits) Value

sp|P14867.3|GBRA1_HUMAN Gamma-aminobutyric acid receptor subu... 948 0.0 Gene info sp|Q5R6B2.1|GBRA1_PONPY Gamma-aminobutyric acid receptor subu... 944 0.0 sp|Q4R534.1|GBRA1_MACFA Gamma-aminobutyric acid receptor subu... 944 0.0 sp|P08219.1|GBRA1_BOVIN Gamma-aminobutyric acid receptor subu... 939 0.0 Gene info sp|P62813.1|GBRA1_RAT Gamma-aminobutyric acid receptor subuni... 908 0.0 Gene info sp|P19150.1|GBRA1_CHICK Gamma-aminobutyric acid receptor subu... 882 0.0 Gene info sp|P47869.2|GBRA2_HUMAN Gamma-aminobutyric acid receptor subu... 670 0.0 Gene info sp|P26048.1|GBRA2_MOUSE Gamma-aminobutyric acid receptor subu... 669 0.0 Gene info sp|P23576.1|GBRA2_RAT Gamma-aminobutyric acid receptor subuni... 669 0.0 Gene info sp|P10063.1|GBRA2_BOVIN Gamma-aminobutyric acid receptor subu... 667 0.0 Gene info • sp|Q08E50.1|GBRA5_BOVIN Gamma-aminobutyric acid receptor subu... 641 0.0 Gene info sp|Q8BHJ7.1|GBRA5_MOUSE Gamma-aminobutyric acid receptor subu... 640 0.0 Gene info sp|P31644.1|GBRA5_HUMAN Gamma-aminobutyric acid receptor subu... 638 0.0 Gene info sp|P19969.1|GBRA5_RAT Gamma-aminobutyric acid receptor subuni... 636 0.0 Gene info sp|P34903.1|GBRA3_HUMAN Gamma-aminobutyric acid receptor subu... 632 0.0 Gene info sp|P26049.1|GBRA3_MOUSE Gamma-aminobutyric acid receptor subu... 630 6e-180 Gene info sp|P10064.1|GBRA3_BOVIN Gamma-aminobutyric acid receptor subu... 628 2e-179 Gene info sp|P20236.1|GBRA3_RAT Gamma-aminobutyric acid receptor subuni... 627 3e-179 Gene info sp|P30191.1|GBRA6_RAT Gamma-aminobutyric acid receptor subuni... 520 6e-147 Gene info sp|P16305.2|GBRA6_MOUSE Gamma-aminobutyric acid receptor subu... 518 2e-146 Gene info sp|Q90845.1|GBRA6_CHICK Gamma-aminobutyric acid receptor subu... 518 3e-146 Gene info

18 Sequence Alignments BLAST is approximate but fast

• BLAST makes similarity searches very quickly, but also makes errors – misses some important similarities – makes many incorrect matches

• The NCBI BLAST web server lets you compare your query sequence to various sequences stored in the GenBank;

• This is a VERY fast and powerful computer.

• The speed and relatively good accuracy of BLAST are the key why the tool is the most popular search tool.

19

InterPro Database Protein Functional Analysis

Jennifer McDowall, Ph.D. Senior InterPro Curator EBI Sequence Databases

nucleotide protein sequence sequence

translate EMBL UniProtKB >7M TrEMBL

(GenBank, manual DDBJ) annotation

UniProtKB CGCGCCTGTACGCT >400,000 GAACGCTCGTGAC GTGTAGTGCGCG Swiss-Prot CGCGCCTGTACGCT GAACGCTCGTGAC GTGTAGTGCGCG EBI Sequence Databases nucleotide protein protein sequence sequence annotation

translate EMBL UniProtKB TrEMBL InterPro (GenBank, manual DDBJ) annotation Protein signatures UniProtKB groups of CGCGCCTGTACGCT related proteins GAACGCTCGTGAC GTGTAGTGCGCG Swiss-Prot CGCGCCTGTACGCT (same family or GAACGCTCGTGAC GTGTAGTGCGCG share domains) InterPro ~80% Protein Coverage

UniProt/ UniProt/ UniMESS SwissProt TrEMBL Metagenomic proteins proteins proteins

UniProtKB ~400,000 >7M >6M

Signature matches

Available InterPro ~370,000 >5.3M 2009 What are protein signatures?

Multiple

• A signature describes the paern of a set of conserved residues in a group of proteins Ø Define a Ø Define a protein feature (domain or conserved site) What value are signatures?

• More sensive homology searches Ø Find more distant homologues than BLAST What value are signatures?

• More sensive homology searches

• Classificaon of proteins Ø Associate proteins that share: Funcon Domains Sequence Structure What value are signatures?

• More sensive homology searches

• Classificaon of proteins

• Annotaon of protein sequences Ø Define conserved regions of a protein - e.g. locaon and type of domains key structural or funconal sites What value are signatures?

• More sensive homology searches

• Classificaon of proteins

• Annotaon of protein sequences

• Transfer addional (automac) annotaon Ø Associate TrEMBL proteins with well-annotated SwissProt proteins

Transfer annotaon Signature methods

• Pattern

• Fingerprint

• Sequence clustering

• HMM

• SAM Patterns

Pattern/motif in sequence à regular expression Can define important sites

EXAMPLE: Insulin Enzyme catalytic site B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxxProsthetic group attachment

A chain xxxxxCCxxxCxxxxxxxxCxMetal ion binding site | | Cysteines for disulphide bonds Protein or molecule binding Patterns

Pattern/motif in sequence à regular expression Can define important sites

EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx

A chain xxxxxCCxxxCxxxxxxxxCx | |

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVE ALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGA GSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN Patterns

Pattern/motif in sequence à regular expression Can define important sites

EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx

A chain xxxxxCCxxxCxxxxxxxxCx | |

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVE ALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGA GSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N Patterns

Pattern/motif in sequence à regular expression Can define important sites

EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx

A chain xxxxxCCxxxCxxxxxxxxCx | |

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVE ALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGA GSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N Regular expression C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Patterns – understanding a regular expression

Strictly conserved site; only x denotes any amino acid one amino acid is accepted can occur at a single at this posion posion

C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C

There are dashes between Curly brackets denote each posion amino acids that cannot occur at a single posion Patterns – understanding a regular expression

X(2) – therefore any amino acid can occur at the next two posion

C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C

Square brackets denote range of amino acids that occur at a single posion Patterns

Sequence alignment

Insulin family Define pattern motif

xxxxxx xxxxxx Extract pattern sequences xxxxxx xxxxxx

Build regular C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C expression

Pattern signature

PS00000 Fingerprints

Several motifs à characterise family

Identify small conserved regions in divergent proteins

Different combinations of motifs describe subfamilies

EXAMPLE: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKS IMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE Fingerprints

Several motifs à characterise family

Identify small conserved regions in divergent proteins

Different combinations of motifs describe subfamilies

EXAMPLE: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

His phosphorylation site Fingerprints

Several motifs à characterise family

Identify small conserved regions in divergent proteins

Different combinations of motifs describe subfamilies

EXAMPLE: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

His phosphorylation site

Ser phosphorylation site Fingerprints

Several motifs à characterise family

Identify small conserved regions in divergent proteins

Different combinations of motifs describe subfamilies

EXAMPLE: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE

His phosphorylation site

Conserved site Ser phosphorylation site Fingerprints

Several motifs à characterise family

Identify small conserved regions in divergent proteins

Different combinations of motifs describe subfamilies

EXAMPLE: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE

1) GIHARPATLLVQTASKF 3-motif fingerprint 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE Fingerprints

Sequence alignment

His Ser Conserved Define motifs phosphorylation phosphorylation site site site

xxxxxx xxxxxx xxxxxx Extract motif xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx sequences xxxxxx xxxxxx xxxxxx

Correct order Fingerprint signature 1 2 3 Correct spacing

PR00000 Sequence clustering

Automatic clustering of homologous domains

**Rarely covers entire domain (conserved core)

**Signature size can change with release

Known domain families PSI-BLAST Recruit homologous domains MKDOM2 Automatic clustering ProDomAlign Align domain families Hidden Markov Models (HMM)

Can characterise protein over entire length

Models conserved and divergent regions (position-specific scoring)

Models insertions and deletions

Ø Outperform in sensitivity and specificity

Ø More flexible (can use partial alignments) Hidden Markov Models (HMM)

Sequence 1: Sequence Sequence 2: Sequence 3: alignment Sequence 4: Sequence 5: Sequence 6: Sequence 7:

Scoring matrix

(residue at each position in Bayesian statistics alignment) probability scoring

Profile Hidden Markov Models (HMM)

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:

M1

M = match state Hidden Markov Models (HMM)

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:

M1 M2

M = match state Hidden Markov Models (HMM)

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:

M1 M2 M3

M = match state Hidden Markov Models (HMM)

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10

M = match state Hidden Markov Models (HMM)

I1 I2 I3 I4 I5 I6 I7 I8 I9

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10

D2 D3 D4 D5 D6 D7 D8 D9

I = insert state D = delete state Hidden Markov Models (HMM)

HMM databases:

• PIR SUPERFAMILY • PANTHER Families conserved in sequence • TIGRFAM

Domains conserved • SMART in sequence • SUPERFAMILY Domains conserved in • GENE3D structure SAM Profile HMMs

Homologous structural superfamilies

Start with single seed sequence

Create 1 model for every protein in Few proteins in family superfamily à combine results have PDB structures

Proteins in superfamily may have low sequence identity SAM Profile models

T99 script: Single seed sequence GIHARPATLLVQTASKF WU-BLASTP

Close homologues Low identity

GIHARPATLLVQTASKF GIHARPATLLVQTASKF matches GIHARPATLLVQTASKF search Initial model New larger alignment GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF Final model Signatures Methods Describe protein features: acve sites, binding sites… • Pattern Describe families and • Fingerprint sibling subfamilies Predicts conserved • Sequence clustering domains

• HMM

• SAM Signature Methods

• Pattern Funconal classificaon of families • Fingerprint Funconal domain • Sequence clustering annotaon

• HMM

• SAM Structural domain annotaon Comprehensive annotation InterPro removes redundancy

SWIB/MDM2 domain

RanBP2-type zinc finger RING-type Domain annotaon zinc finger Comprehensive annotation

Conserved site within zinc finger

Annotate features Comprehensive annotation

Mdm2/Mdm4 family Parent

Child Mdm4 subfamily Family classificaon Domain Boundaries

Gene3D (and SSF) determines domain structural boundaries Pfam trims domains to regions of good sequence conservation

ProDom displays shortest conserved sequence Fragmented Signatures

1) Signature method 2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains Fragmented Signatures

1) Signature method • e.g. PRINTS – discrete motifs

2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains Fragmented Signatures

1) Signature method 2) Duplicated domains • e.g. SSF - duplication consisting of 2 domains with same fold

3) Repeated elements 4) Non-contiguous domains Fragmented Signatures

1) Signature method 2) Duplicated domains 3) Repeated elements • e.g. Kringle, WD40

4) Non-contiguous domains Fragmented Signatures

1) Signature method 2) Duplicated domains 3) Repeats 4) Non-contiguous domains • Structural domains can consist of non- contiguous sequence Fragmented Signatures

1) Signature method 2) Duplicated domains 3) Repeats 4) Non-contiguous domains Complementary Annotation

7-blade beta-propeller Beta-propeller repeat

Ø Structural-based signature (SSF) shows boundaries of structural domain

Ø Sequence-based signature (Pfam) shows that the domain is made up of repeating sequence elements Complementary Annotation

PFAM shows domain is composed of two types of repeated sequence mofs

SUPERFAMILY shows the potenal domain boundaries Complementary Annotation

PFAM/SMART show 2 domains from disnct sequence families

GENE3D shows that these domains share homologous structure Funconal annotaon with Blast2GO Index • Why Blast2GO? Concepts on funconal annotaons – Funcon assignment – Vocabularies – GO and GOA – The GO Direct Acyclic Graph – The problem of funcon transfer • Blast2GO Java applicaon + Praccals • Blast2GO @ Babelomics + Praccals

Why Blast2GO? Experiment Workflow analysis Data-Analysis Gene-List

MNAT1 CTNNBL1 ENOX2 GTPBP1 RALY TAGLN2 RAB3A PPP2R5A MAPRE1 Functional ..... Annotation Functional ... interpretation +

Functional Profiling What does Blast2GO do?

Generates annotations Visualization of funcional annotations Concepts of funconal annotaon • Gene/Protein funcon • Referes to the molecular funcon of a gene or a protein: • Funconal annotaon • More general, referes to the characterizaon of funconal aspect of the protein. Stress-related, cytoplasm, ABC transporter • Also referes to the process of assingment of a funcon label • Habitually, standard vocabularies are used to assign funcon

Funconal Vocabularies

Molecular Function Biological Process Cellular Component

Metabolic pathways

KEGG orthologues

Functional motifs The

• Project developed by the Gene Ontology Consorum • Provides a controlled vocabulary to describe gene and gene product aributes in any organism • Includes both the development of the Ontology and the maintenance of a Database of annotaons The Ontology ü Annotations are given to te most specific (low) level. ü True path rule: annotation at a term implies annotation to all its parent terms More general ü Annotation is given with an Evidence Code: o IDA: inferred by direct assay o TAS: traceable author statement o ISS: infered by sequence similarity o IEA: electronic annotation o …. More specific The GO has a DAG structure The Gene Ontology Database (GOA) http://www.geneontology.org/GO.current.annotations.shtml

• There is a collaborang instuon per organism to provide annotaons • Most of the GOA annotaons come from UniProt • Most of the annotaons are electronic annotaons InterPro

http://www.ebi.ac.uk/interpro/databases.html

• Collecon of databases with funconal annotaon of protein mofs • Funconal vocabulary at UniProt • There is an equivalence table between GO and InterPro Funconal assignment Annotaon

Empirical Transference Literature reference

Filogeny

Molecular Biochemical Sequence interacons Structure assay Comparison analysis

Gene/protein Sequence expression Idenficaon homology of folds Mof idenficaon Automac annotaon • GO annotaons can be created by comparision to annotated sequences • To achieve enough coverage, high- throughput, automac annotaon is required • The most effecve (also error prone) automac annotaon method is transfer from sequence similarity Concerns in funconal transfer by

GO1, GO2, GO3, GO4 similarity HIT QUERY GO1, GO2, GO3, GO4

• Level of homology (~ from 40-60% is possible) • The overlap query and hit sequences (not much a problem) • The domain or structure funcon associaon • The paralog problem: genes with similar sequences might have different funconal specificaons • The evidence for the original annotaon • Balance between quality and quanty: depends on the use Blast2GO • Suite for funconal annotaon and data mining on funconal data – Consideraons for annotaon • Simlarity • Length of the overlap • Percentage of hit sequence spanned by the overlap • Evidence original annotaon • Blast hits and mof hits • Refinement by addional methods – Visualizaon: • Annotaon charts • Knowledge discovery on the DAG • Desktop Java applicaon • web interface @ Babelomics: Babelomics for non-model Blast2GO Annotaon strategy

Hit1 Hit1 go1,go2, go3 go1,go2, go3 Sq1 Sq1 Hit2 Sq1 Hit2 go1,go3, go4 Sq1 go1,go3, go4 Hit3 Hit3 go3,go5, go6,go8 go3,go5, go6,go8 Hit4 Hit4 go1,go4 go1,go4 Hit1 Hit1 go6,go9, go8 go6,go9, go8 Sq2 Sq2 Hit2 Sq2 Hit2 go1,go8 Sq2 go1,go8 Blast Hit3 Mapping Hit3 go4,go1, go8,go9 Annotation go4,go1, go8,go9 Hit4 Hit4 Hit1 Hit1 go2 go2 Sq3 Sq3 Hit2 Sq3 Hit2 go2,go4, go4 Sq3 go2,go4, go4 Hit3 Hit3 go2,go5, go6 go2,go5, go6 Hit4 Hit4 go2,go4 go2,go4

Hit1 Hit1 Sq4 Sq4 Sq4 Sq4 Hit2 Hit2 Blast2GO Annotaon Strategy

go1,go2, go3 Sq1 go1,go3, go4 Sq1 go1,go2, go3, go3,go5, go6,go8 GO11 go1,go4

go6,go9, go8 Refinement Sq2 go1,go8 Sq2 go8, go4,go1, go8,go9 GO12, GO13

go2 InterPro Annex Sq3 go2,go4, go4 Sq3 go2,go4 go2,go5, go6 GOSlim go2,go4 Manual

Sq4 Sq4 GO15 B2G Highlighng on the DAG: The B2G Score • Coloring strategy to highlight regions in the DAG where the most interesng informaon is concentrated • The confluence score (B2G score) keeps a balance between the number of annotated sequences at one node and the distance to the origin of annotaon