InterPro: Protein signatures, classification, and functional analysis

Lorenzo Cerutti

Swiss Institute of Geneva/Lausanne, Switzerland

March 2011 Credits

I Nicolas Hulo and Marco Pagni from SIB for sharing some slides with me.

I Terri Attwood from Faculty of Life Sciences & School of Computer Science () for allowing me to use some of her ideas.

I Jennifer McDowall and Duncan Legge for providing the InterPro tutorial.

I The InterPro team for providing some of their InterPro presentation slides.

March 20112 Biology: a complex matter!

March 20113 Biology: a Complex Matter!

I Proteins

I exhibit rich evolutionary relationships; I exhibit complex molecular interactions; I have complex regulation and modification mechanism; I exists in dynamic systems.

I The computational sequence analysis tools are na¨ıveabout real biology and the complex relationships between molecular elements.

I Therefore we should be critical about what we can achieve with such computational sequence analysis tools.

I So, again, be critical! and understand the biology.

March 20114 Biology: a Complex Matter!

I Proteins

I exhibit rich evolutionary relationships; I exhibit complex molecular interactions; I have complex regulation and modification mechanism; I exists in dynamic systems.

I The computational sequence analysis tools are na¨ıveabout real biology and the complex relationships between molecular elements.

I Therefore we should be critical about what we can achieve with such computational sequence analysis tools.

I So, again, be critical! and understand the biology.

March 20114 Biology: a Complex Matter!

I Proteins

I exhibit rich evolutionary relationships; I exhibit complex molecular interactions; I have complex regulation and modification mechanism; I exists in dynamic systems.

I The computational sequence analysis tools are na¨ıveabout real biology and the complex relationships between molecular elements.

I Therefore we should be critical about what we can achieve with such computational sequence analysis tools.

I So, again, be critical! and understand the biology.

March 20114 Biology: a Complex Matter!

I Proteins

I exhibit rich evolutionary relationships; I exhibit complex molecular interactions; I have complex regulation and modification mechanism; I exists in dynamic systems.

I The computational sequence analysis tools are na¨ıveabout real biology and the complex relationships between molecular elements.

I Therefore we should be critical about what we can achieve with such computational sequence analysis tools.

I So, again, be critical! and understand the biology.

March 20114 Biology: a Complex Matter!

I Proteins

I exhibit rich evolutionary relationships; I exhibit complex molecular interactions; I have complex regulation and modification mechanism; I exists in dynamic systems.

I The computational sequence analysis tools are na¨ıveabout real biology and the complex relationships between molecular elements.

I Therefore we should be critical about what we can achieve with such computational sequence analysis tools.

I So, again, be critical! and understand the biology.

March 20114 Biology: a Complex Matter!

I Proteins

I exhibit rich evolutionary relationships; I exhibit complex molecular interactions; I have complex regulation and modification mechanism; I exists in dynamic systems.

I The computational sequence analysis tools are na¨ıveabout real biology and the complex relationships between molecular elements.

I Therefore we should be critical about what we can achieve with such computational sequence analysis tools.

I So, again, be critical! and understand the biology.

March 20114 Biology: a Complex Matter!

I Proteins

I exhibit rich evolutionary relationships; I exhibit complex molecular interactions; I have complex regulation and modification mechanism; I exists in dynamic systems.

I The computational sequence analysis tools are na¨ıveabout real biology and the complex relationships between molecular elements.

I Therefore we should be critical about what we can achieve with such computational sequence analysis tools.

I So, again, be critical! and understand the biology.

March 20114 Today we speak about similarities and differences

March 20115 Give me the criteria!

A trip to the farm

I What is similar? What is different?

March 20116 A trip to the farm

I What is similar? What is different? Give me the criteria!

March 20116 I Hey! Donkeys have longer ears than horses!

A trip to the farm

I Accumulated observations help to detect subtle similarities/differences.

March 20117 A trip to the farm

I Accumulated observations help to detect subtle similarities/differences. I Hey! Donkeys have longer ears than horses!

March 20117 Wings! what else!

A trip in the air

I Why can they fly?

March 20118 A trip in the air

I Why can they fly? Wings! what else!

March 20118 I Hey! Some of them have an engine!

A trip in the air

I Accumulated observations help to detect subtle similarities/differences, and function.

March 20119 A trip in the air

I Accumulated observations help to detect subtle similarities/differences, and function. I Hey! Some of them have an engine!

March 20119 Functional annotation of sequences

I Use similarities and differences to infer which residues, motifs, domains, are responsible for a particular function.

I Accumulated observations help in identifying such functional regions.

I Ultimately, experimental evidences should be used to label functional residues.

March 2011 10 Functional annotation of sequences

I Use similarities and differences to infer which residues, motifs, domains, are responsible for a particular function.

I Accumulated observations help in identifying such functional regions.

I Ultimately, experimental evidences should be used to label functional residues.

March 2011 10 Functional annotation of sequences

I Use similarities and differences to infer which residues, motifs, domains, are responsible for a particular function.

I Accumulated observations help in identifying such functional regions.

I Ultimately, experimental evidences should be used to label functional residues.

March 2011 10 One more thing before the serious stuff: let’s play Lego!

March 2011 11 One more thing before the serious stuff: let’s play Lego!

March 2011 11 One more thing before the serious stuff: let’s play Lego!

March 2011 11 One more thing before the serious stuff: let’s play Lego!

March 2011 11 Proteins and Lego

I Most of the proteins are modular and/or contains specific motifs ... like the Lego, you can use the same brick in different constructions.

I We can use these modules and motifs to build specific signatures that will be used to classify proteins and infer their function.

I So, what are such sequence signature?

March 2011 12 Proteins and Lego

I Most of the proteins are modular and/or contains specific motifs ... like the Lego, you can use the same brick in different constructions.

I We can use these modules and motifs to build specific signatures that will be used to classify proteins and infer their function.

I So, what are such sequence signature?

March 2011 12 Proteins and Lego

I Most of the proteins are modular and/or contains specific motifs ... like the Lego, you can use the same brick in different constructions.

I We can use these modules and motifs to build specific signatures that will be used to classify proteins and infer their function.

I So, what are such sequence signature?

March 2011 12 General definitions of conserved sequence signatures

I Conserved regions in biological sequences can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a characteristic three dimensional structure or fold. Families: groups of proteins that have the same domain arrangement or that are conserved along the whole sequence. Repeats: structural units always found in two or more copies that assemble in a specific fold. Assemblies of repeats might also be thought of as domains. Motifs: region of domains containing conserved active or binding residues, or short conserved regions present outside domains that may adopt folded conformation only in association with their binding ligands. Sites: functional residues (active sites, disulfide bridges, post-translation modified residues).

March 2011 13 General definitions of conserved sequence signatures

I Conserved regions in biological sequences can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a characteristic three dimensional structure or fold. Families: groups of proteins that have the same domain arrangement or that are conserved along the whole sequence. Repeats: structural units always found in two or more copies that assemble in a specific fold. Assemblies of repeats might also be thought of as domains. Motifs: region of domains containing conserved active or binding residues, or short conserved regions present outside domains that may adopt folded conformation only in association with their binding ligands. Sites: functional residues (active sites, disulfide bridges, post-translation modified residues).

March 2011 13 General definitions of conserved sequence signatures

I Conserved regions in biological sequences can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a characteristic three dimensional structure or fold. Families: groups of proteins that have the same domain arrangement or that are conserved along the whole sequence. Repeats: structural units always found in two or more copies that assemble in a specific fold. Assemblies of repeats might also be thought of as domains. Motifs: region of domains containing conserved active or binding residues, or short conserved regions present outside domains that may adopt folded conformation only in association with their binding ligands. Sites: functional residues (active sites, disulfide bridges, post-translation modified residues).

March 2011 13 General definitions of conserved sequence signatures

I Conserved regions in biological sequences can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a characteristic three dimensional structure or fold. Families: groups of proteins that have the same domain arrangement or that are conserved along the whole sequence. Repeats: structural units always found in two or more copies that assemble in a specific fold. Assemblies of repeats might also be thought of as domains. Motifs: region of domains containing conserved active or binding residues, or short conserved regions present outside domains that may adopt folded conformation only in association with their binding ligands. Sites: functional residues (active sites, disulfide bridges, post-translation modified residues).

March 2011 13 General definitions of conserved sequence signatures

I Conserved regions in biological sequences can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a characteristic three dimensional structure or fold. Families: groups of proteins that have the same domain arrangement or that are conserved along the whole sequence. Repeats: structural units always found in two or more copies that assemble in a specific fold. Assemblies of repeats might also be thought of as domains. Motifs: region of domains containing conserved active or binding residues, or short conserved regions present outside domains that may adopt folded conformation only in association with their binding ligands. Sites: functional residues (active sites, disulfide bridges, post-translation modified residues).

March 2011 13 General definitions of conserved sequence signatures

I Conserved regions in biological sequences can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a characteristic three dimensional structure or fold. Families: groups of proteins that have the same domain arrangement or that are conserved along the whole sequence. Repeats: structural units always found in two or more copies that assemble in a specific fold. Assemblies of repeats might also be thought of as domains. Motifs: region of domains containing conserved active or binding residues, or short conserved regions present outside domains that may adopt folded conformation only in association with their binding ligands. Sites: functional residues (active sites, disulfide bridges, post-translation modified residues).

March 2011 13 Example of conserved regions PPID family: 1 CSA PPIASE (cyclophilin-type peptydil-prolyl cis-trans isomerase) + 3 TPR repeats (tetratrico peptide repeat).

PPID_BOVIN (P26882) (369 aa)

Cys181:active site Binding cleft (motif)

March 2011 14 By conservation! Obvious, isn’t it?

How do I detect conserved signatures?

March 2011 15 How do I detect conserved signatures? By conservation! Obvious, isn’t it?

March 2011 15 Measures of Conservation

Identity: Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value depends on how the two sequences are aligned. Similarity: Proportion of pairs of similar residues between two aligned sequences. If two residues are similar can determined by a substitution matrix (e.g. BLOSUM62). So this value depends strongly on the scoring system used. Homology: Two sequences are homologous if and only if they have a common ancestor. Homologous sequences do not necessarily serve the same function, nor are they always highly similar: structure may be conserved while sequence is not.

March 2011 16 Measures of Conservation

Identity: Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value depends on how the two sequences are aligned. Similarity: Proportion of pairs of similar residues between two aligned sequences. If two residues are similar can determined by a substitution matrix (e.g. BLOSUM62). So this value depends strongly on the scoring system used. Homology: Two sequences are homologous if and only if they have a common ancestor. Homologous sequences do not necessarily serve the same function, nor are they always highly similar: structure may be conserved while sequence is not.

March 2011 16 Measures of Conservation

Identity: Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value depends on how the two sequences are aligned. Similarity: Proportion of pairs of similar residues between two aligned sequences. If two residues are similar can determined by a substitution matrix (e.g. BLOSUM62). So this value depends strongly on the scoring system used. Homology: Two sequences are homologous if and only if they have a common ancestor. Homologous sequences do not necessarily serve the same function, nor are they always highly similar: structure may be conserved while sequence is not.

March 2011 16 Measures of Conservation

Identity: Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value depends on how the two sequences are aligned. Similarity: Proportion of pairs of similar residues between two aligned sequences. If two residues are similar can determined by a substitution matrix (e.g. BLOSUM62). So this value depends strongly on the scoring system used. Homology: THIS IS NOT A MEASURE OF CONSERVATION AND THERE IS NO PERCENTAGE OF HOMOLOGY! (It’s either yes or no). Two sequences are homologous if and only if they have a common ancestor. Homologous sequences do not necessarily serve the same function, nor are they always highly similar: structure may be conserved while sequence is not.

March 2011 17 Find conservation using pairwise alignments

March 2011 18 Detect conservation using pairwise alignments

I A popular way to identify similarities between proteins is to perform a pairwise alignment (Smith-Waterman, Needleman-Wunsch, BLAST, ...).

seq1/1-78 1 WFHGSWTRQGAEHLL-LLKGEAGTFVLRECLSSPGQYVLSV- 40 seq2/1-82 1 WYHGEIERSIAEGLLGQRRNNTGSFIVREALENIGAFSVTVY 42

seq1/1-78 41 -RYIGNHK--HCIISQHDRNGQFLIEDDRACDTFGMLLQHY 78 seq2/1-82 43 DKD ISHPRVLHFRVNSNMNNG-FY IATKTCFRT IPY IIW FF 82

I Normally, when the identity is higher than 40% this method gives good results.

I However, the weakness of the pairwise alignment is that no distinction is made between an amino acid at a crucial position (like an active site) and an amino acid with no critical role.

March 2011 19 Detect conservation using pairwise alignments

I A popular way to identify similarities between proteins is to perform a pairwise alignment (Smith-Waterman, Needleman-Wunsch, BLAST, ...).

seq1/1-78 1 WFHGSWTRQGAEHLL-LLKGEAGTFVLRECLSSPGQYVLSV- 40 seq2/1-82 1 WYHGEIERSIAEGLLGQRRNNTGSFIVREALENIGAFSVTVY 42

seq1/1-78 41 -RYIGNHK--HCIISQHDRNGQFLIEDDRACDTFGMLLQHY 78 seq2/1-82 43 DKD ISHPRVLHFRVNSNMNNG-FY IATKTCFRT IPY IIW FF 82

I Normally, when the identity is higher than 40% this method gives good results.

I However, the weakness of the pairwise alignment is that no distinction is made between an amino acid at a crucial position (like an active site) and an amino acid with no critical role.

March 2011 19 Detect conservation using pairwise alignments

I A popular way to identify similarities between proteins is to perform a pairwise alignment (Smith-Waterman, Needleman-Wunsch, BLAST, ...).

seq1/1-78 1 WFHGSWTRQGAEHLL-LLKGEAGTFVLRECLSSPGQYVLSV- 40 seq2/1-82 1 WYHGEIERSIAEGLLGQRRNNTGSFIVREALENIGAFSVTVY 42

seq1/1-78 41 -RYIGNHK--HCIISQHDRNGQFLIEDDRACDTFGMLLQHY 78 seq2/1-82 43 DKD ISHPRVLHFRVNSNMNNG-FY IATKTCFRT IPY IIW FF 82

I Normally, when the identity is higher than 40% this method gives good results.

I However, the weakness of the pairwise alignment is that no distinction is made between an amino acid at a crucial position (like an active site) and an amino acid with no critical role.

March 2011 19 Alignment of two thioredoxin domains

I Smith-Waterman alignment of two thioredoxin domains:

I Which are the important functional sites?

March 2011 20 Alignment of two thioredoxin domains

I Smith-Waterman alignment of two thioredoxin domains:

I Which are the important functional sites?

March 2011 20 The super-power of MSA!

March 2011 21 Detect conservation using MSA

I A Multiple (MSA) detects the subtle similarities and differences that distinguish a group of sequences from another similar group.

10 20 30 40 50 60 A1_XENLA/1-61 QPN-TARTNFTTKQLTELEKEFHFNKYLTRARRVEIAAALQLNETQVKIWF QNRRMKQKKRE A1_MOUSE/1-61 QPN-AVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE A1_MONDO/1-61 QPN-TVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE A1_HETFR/1-61 QPN-TVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAAALQLNETQVKIWF QNRRMKQKKRE A1a_BRARE/1-61 QPN-TVRTNFSTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE B1_HUMAN/1-61 SPS-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_MOUSE/1-61 APG-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_RAT/1-61 TPG-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_MONDO/1-61 PAG-GIRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_ORYLA/1-62 AHNSAMRTNFTTRQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_TETNG/1-62 AHNSAIRTNFSTRQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_BRARE/1-61 PQN-TIRTNFTTKQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_LATME/1-61 QQN-TIRTNFTTKQLTELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_AMBME/1-61 QQN-SIRTNFTTKQLSELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1b_FUGRU/1-61 QHN-VIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_ORYLA/1-61 QIN-VIRTNFTTKQLTELEKEFHFNKYLTRARRIEVAASLDLNETQVKIWF QNRRMKQKKRE B1b_ORENI/1-61 QHN-VIRTNFTTKQLTELEKEFHFNKYLSRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_TETNG/1-61 QHN-AIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_BRARE/1-61 QQN-IIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAATLELNETQVKIWF QNRRMKQKKRE B1_CHICK/1-61 QPN-TIRTNFTTKQLTELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE

I We can use these differences to build classifiers to distinguish between the sub-groups.

I We can use such similarities/differences to search for more distant homologue sequences.

I We can use such similarities/differences to infer functional sites.

March 2011 22 Detect conservation using MSA

I A Multiple Sequence Alignment (MSA) detects the subtle similarities and differences that distinguish a group of sequences from another similar group.

10 20 30 40 50 60 A1_XENLA/1-61 QPN-TARTNFTTKQLTELEKEFHFNKYLTRARRVEIAAALQLNETQVKIWF QNRRMKQKKRE A1_MOUSE/1-61 QPN-AVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE A1_MONDO/1-61 QPN-TVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE A1_HETFR/1-61 QPN-TVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAAALQLNETQVKIWF QNRRMKQKKRE A1a_BRARE/1-61 QPN-TVRTNFSTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE B1_HUMAN/1-61 SPS-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_MOUSE/1-61 APG-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_RAT/1-61 TPG-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_MONDO/1-61 PAG-GIRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_ORYLA/1-62 AHNSAMRTNFTTRQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_TETNG/1-62 AHNSAIRTNFSTRQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_BRARE/1-61 PQN-TIRTNFTTKQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_LATME/1-61 QQN-TIRTNFTTKQLTELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_AMBME/1-61 QQN-SIRTNFTTKQLSELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1b_FUGRU/1-61 QHN-VIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_ORYLA/1-61 QIN-VIRTNFTTKQLTELEKEFHFNKYLTRARRIEVAASLDLNETQVKIWF QNRRMKQKKRE B1b_ORENI/1-61 QHN-VIRTNFTTKQLTELEKEFHFNKYLSRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_TETNG/1-61 QHN-AIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_BRARE/1-61 QQN-IIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAATLELNETQVKIWF QNRRMKQKKRE B1_CHICK/1-61 QPN-TIRTNFTTKQLTELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE

I We can use these differences to build classifiers to distinguish between the sub-groups.

I We can use such similarities/differences to search for more distant homologue sequences.

I We can use such similarities/differences to infer functional sites.

March 2011 22 Detect conservation using MSA

I A Multiple Sequence Alignment (MSA) detects the subtle similarities and differences that distinguish a group of sequences from another similar group.

10 20 30 40 50 60 A1_XENLA/1-61 QPN-TARTNFTTKQLTELEKEFHFNKYLTRARRVEIAAALQLNETQVKIWF QNRRMKQKKRE A1_MOUSE/1-61 QPN-AVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE A1_MONDO/1-61 QPN-TVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE A1_HETFR/1-61 QPN-TVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAAALQLNETQVKIWF QNRRMKQKKRE A1a_BRARE/1-61 QPN-TVRTNFSTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE B1_HUMAN/1-61 SPS-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_MOUSE/1-61 APG-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_RAT/1-61 TPG-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_MONDO/1-61 PAG-GIRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_ORYLA/1-62 AHNSAMRTNFTTRQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_TETNG/1-62 AHNSAIRTNFSTRQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_BRARE/1-61 PQN-TIRTNFTTKQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_LATME/1-61 QQN-TIRTNFTTKQLTELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_AMBME/1-61 QQN-SIRTNFTTKQLSELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1b_FUGRU/1-61 QHN-VIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_ORYLA/1-61 QIN-VIRTNFTTKQLTELEKEFHFNKYLTRARRIEVAASLDLNETQVKIWF QNRRMKQKKRE B1b_ORENI/1-61 QHN-VIRTNFTTKQLTELEKEFHFNKYLSRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_TETNG/1-61 QHN-AIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_BRARE/1-61 QQN-IIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAATLELNETQVKIWF QNRRMKQKKRE B1_CHICK/1-61 QPN-TIRTNFTTKQLTELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE

I We can use these differences to build classifiers to distinguish between the sub-groups.

I We can use such similarities/differences to search for more distant homologue sequences.

I We can use such similarities/differences to infer functional sites.

March 2011 22 Detect conservation using MSA

I A Multiple Sequence Alignment (MSA) detects the subtle similarities and differences that distinguish a group of sequences from another similar group.

10 20 30 40 50 60 A1_XENLA/1-61 QPN-TARTNFTTKQLTELEKEFHFNKYLTRARRVEIAAALQLNETQVKIWF QNRRMKQKKRE A1_MOUSE/1-61 QPN-AVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE A1_MONDO/1-61 QPN-TVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE A1_HETFR/1-61 QPN-TVRTNFTTKQLTELEKEFHFNKYLTRARRVEIAAALQLNETQVKIWF QNRRMKQKKRE A1a_BRARE/1-61 QPN-TVRTNFSTKQLTELEKEFHFNKYLTRARRVEIAASLQLNETQVKIWF QNRRMKQKKRE B1_HUMAN/1-61 SPS-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_MOUSE/1-61 APG-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_RAT/1-61 TPG-GLRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_MONDO/1-61 PAG-GIRTNFTTRQLTELEKEFHFNKYLSRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_ORYLA/1-62 AHNSAMRTNFTTRQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_TETNG/1-62 AHNSAIRTNFSTRQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1a_BRARE/1-61 PQN-TIRTNFTTKQLTELEKEFHFSKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_LATME/1-61 QQN-TIRTNFTTKQLTELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1_AMBME/1-61 QQN-SIRTNFTTKQLSELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE B1b_FUGRU/1-61 QHN-VIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_ORYLA/1-61 QIN-VIRTNFTTKQLTELEKEFHFNKYLTRARRIEVAASLDLNETQVKIWF QNRRMKQKKRE B1b_ORENI/1-61 QHN-VIRTNFTTKQLTELEKEFHFNKYLSRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_TETNG/1-61 QHN-AIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAASLELNETQVKIWF QNRRMKQKKRE B1b_BRARE/1-61 QQN-IIRTNFTTKQLTELEKEFHFNKYLTRARRVEVAATLELNETQVKIWF QNRRMKQKKRE B1_CHICK/1-61 QPN-TIRTNFTTKQLTELEKEFHFNKYLTRARRVEIAATLELNETQVKIWF QNRRMKQKKRE

I We can use these differences to build classifiers to distinguish between the sub-groups.

I We can use such similarities/differences to search for more distant homologue sequences.

I We can use such similarities/differences to infer functional sites.

March 2011 22 Can you guess which residues are important?

CEF1_YEAST/62-106 FTEFSKEEDAQLLDLARELPNQ-WRTIADMM------ARPAQVCVERYNRLL CEF1_SCHPO/58-102 KTEWSREEDEKLLHLAKLLPTQ-WRTIAPIV------GRTATQCLERYQKLL MU152_SCHPO/55-105 RVKWTEKETNDLLRGCQIHGVGNWKKILLDERFH-----FTNRSPNDLKDRFRTIL O24251_PETCR/16-68 KQKWTAEEEEALKAGVKKHGMGKWKTILVDPDFATA---LTHRSNIDLKDKWRNLG TERF1_CRIGR/380-428 RQAWLWEEDKNLRSGVRKYGEGNWSKILLHYKFN------NRTSVMLKDRWRTMK P70413_MOUSE/224-268 VGKYTPEEIEKLKELRIKHGND-WATIGAALG------RSASSVKDRCRLMK TTF1_MOUSE/590-634 KGRYNEEDTKKLKAYHSLHGND-WKKIGAMV------ARSSLSVALKFSQIG O64877_ARATH/47-101 TQAWGTWEELLLACAVKRHGFGDWDSVATEVRSRS-SLSHLLASANDCRHKYRDLK O48523_ARATH/13-66 KQTWSTWEELLLACAVHRHGTESWNSVSAEIQKLSPNLCSLTASA--CRHKYFDLK consensus ...... * . .... *......

March 2011 23 MSA Information Content

I Example: MSA reflects secondary structure.

alpha−helix beta−strand beta−strand beta−strand alpha−helix 10 20 30 40 50 60 70 80 90 100 SH2-7/1-77 WFHGEIER------GESERLLMM--EVQEGTFLIRKSDAMYPGC---- YTLSHSENSV------RFKEIIISKMQRM----SVCAE-SK---HILLNE IVWVY SH2-19/1-78 WYYGFIKR------NEAEGLLMN--DKEDGAYLVRSSRS-DVGE---- ISLSVRFDD------EIHHFR ICTLTKG-----V IMKA ILGDNFSDLPQ L VYHY SH2-14/1-75 WIDGKILR------KEAEKYLSE---GKDGTFLVRDSD--KPGE---- ISLALHEEK------MITPFIIHRNDDD---NYYRGEGET---FPAISE L IM Y Y SH2-12/1-92 --FGRMSR------QQAEDIFRAGIGNKPGTFLVRESES-TPADGMSE YALAVRHNEPEQNSRYGKVIHHKIRRVPDYYDDGYFLKEEAK---LQHLGQ LIEYY SH2-18/1-83 WFHGYITGV------GNEAEYQLVP---GKKGDFLVRDSSR-QEDD---- FTLSVVFNDT---PNGEQ IKHYH IM FLAAF---GYYV ILN IE---FDTLADL ISYH SH2-8/1-80 WILAFISR------TEVPLLLLEI-SPARGTYLVRKSS--TLGD---- YTVTVRDDG------RVKHFQ IQ FKEDLKTPGGY IIEGPT---FCT INDL IDHQ SH2-6/1-81 WYFGQ ITG------EAEELIQKP-EGRNGKFLVRTSR--TDGE----F ALSVHNDGV---LTHPDRKHFRIIEANDG---GYFIAEESS---HCSFKQL IGLY SH2-5/1-78 WYHPAISR------STREQQLLK--GNEEGSFLVRKSDP-RKGN---- FVLTRKVGSPE--MANSCHKHYKVYRNGTK-----YYSDGK------SLAEMIRLQ SH2-15/1-76 WIHSAITR------DAVRMLRD----PVGKFVVRFSDT-SPGE---- YTLSVVFNA------VQILNPVMINRLEEK---IYYVFTRET---FESLDD IKTHH SH2-20/1-74 YYHRFLYR------EEAYESLLG-----PGDFLLRESIS-KPLE---- ISLSVMDDG------KV IWYRKQ EVDNR---TYTRFGRKK---FRTLQYL IQ H F SH2-1/1-83 WGHGNISG------DDAEEILQDP-RVPSGKFLVREAK--KPSF---- FILPVKYDDR----ELSTVKHFKVKTDANG---GYYLTLGPQV-GLDEITE LVQYY SH2-10/1-74 LFHGFANR------TAIAEARLQN---TMYGGYLVRESE--SPGE---- IALSLWHCS------SVKW-RIYTNENG---NLVIYS-LF---FSTLSQ FVYHY SH2-2/1-86 WFWGKVSGRT----KGNSKAETQLND--GGRDGSFIVRDSAT-RPGD---- IAFSLRTDGD----RGEEVNHCKVTPMDNG---KYYVEMNDR---FNT IQ EL IEVY SH2-17/1-77 WFMKFITW------KEAEECLMDR-EQRDGLFVIRESSQ-HPNA---- FSISVREFG------SVGHIVVRYDNRG----IIITDNTV---NCHLGE LIHFY SH2-11/1-80 FFAGDLGK------LASYRLAT--ARPPGTFLVRLSDN-STGD---- ITVSVVDWGQ---KRNPKVKQYLILEECNG----VFGIGREY---FDEPQA LVHGY SH2-4/1-74 LYSGKVST------AYVEMLLKT-----TGTFLVRESDS-SEGS---- FTLSVRYQS------EVQHYIIDKQDGG---KYMLDRSRR---HGSLLE IVNHY SH2-3/1-74 WFAGDITR------ELVENSLML---EKTGDFLLRQSE--APGS---- YVLYWLDIS------VVKHYLIKNEQNC----YYMTTGIR---FSSLPL LVMDY SH2-16/1-80 WFHGEISRGGPCIEDKPPEAEDRLLP---NKQGIYLVRKRET-EEGQ---- YTLTLVTKN------NHSHVIIGFSETG----YFCTGKI------LQD LVSHY SH2-9/1-77 WYIPALDR------KQAEELLLYS-GQHQGQLIVRPSEH-EQGH---- FALSVRSGSP------RVKH IV IQ SDEHR----- IRNGGET---FSSLEEL VEVD SH2-13/1-81 WFSGQVTR------QDAATLLQS--GGEEGSFLVRESDS-HQGV---- FSLSVLEQRD---SKKSKVHHILVQSAED----QVLISERKK---FDGLFD LITHY

March 2011 24 Remember the pairwise alignment of two thioredoxin domains?

March 2011 25 The alignment of a thioredoxin domain using a MSA model

I MSA models permits to identify important residues. These are the residues we should align correctly!

March 2011 26 The alignment of a thioredoxin domain using a MSA model

I MSA models permits to identify important residues. These are the residues we should align correctly!

March 2011 26 Today course is about models of multiple sequence alignment

I Why do we need models of MSA?

I to resume in a single ”sequence” the differences and similarities observed in each column of the MSA; I to use the model to search for very distant homologs (YES! we have an idea of the variability); I to build models to classify very similar sequences (YES! we know what is different and what is similar); I to align correctly important residues (YES! we know which residues are the most conserved); I to detect variations in active sites and other important regions of one protein (e.g. detect if some SNPs affect an enzymatic function); I to realign sequences to a MSA very very very fast! (e.g. epidemiology phylogenetic trees); I to build databases of MSA models describing protein families, domains, motifs, etc., can be used to annotate new proteomes; I ...

March 2011 27 Today course is about models of multiple sequence alignment

I Why do we need models of MSA?

I to resume in a single ”sequence” the differences and similarities observed in each column of the MSA; I to use the model to search for very distant homologs (YES! we have an idea of the variability); I to build models to classify very similar sequences (YES! we know what is different and what is similar); I to align correctly important residues (YES! we know which residues are the most conserved); I to detect variations in active sites and other important regions of one protein (e.g. detect if some SNPs affect an enzymatic function); I to realign sequences to a MSA very very very fast! (e.g. epidemiology phylogenetic trees); I to build databases of MSA models describing protein families, domains, motifs, etc., can be used to annotate new proteomes; I ...

March 2011 27 Today course is about models of multiple sequence alignment

I Why do we need models of MSA?

I to resume in a single ”sequence” the differences and similarities observed in each column of the MSA; I to use the model to search for very distant homologs (YES! we have an idea of the variability); I to build models to classify very similar sequences (YES! we know what is different and what is similar); I to align correctly important residues (YES! we know which residues are the most conserved); I to detect variations in active sites and other important regions of one protein (e.g. detect if some SNPs affect an enzymatic function); I to realign sequences to a MSA very very very fast! (e.g. epidemiology phylogenetic trees); I to build databases of MSA models describing protein families, domains, motifs, etc., can be used to annotate new proteomes; I ...

March 2011 27 Today course is about models of multiple sequence alignment

I Why do we need models of MSA?

I to resume in a single ”sequence” the differences and similarities observed in each column of the MSA; I to use the model to search for very distant homologs (YES! we have an idea of the variability); I to build models to classify very similar sequences (YES! we know what is different and what is similar); I to align correctly important residues (YES! we know which residues are the most conserved); I to detect variations in active sites and other important regions of one protein (e.g. detect if some SNPs affect an enzymatic function); I to realign sequences to a MSA very very very fast! (e.g. epidemiology phylogenetic trees); I to build databases of MSA models describing protein families, domains, motifs, etc., can be used to annotate new proteomes; I ...

March 2011 27 Today course is about models of multiple sequence alignment

I Why do we need models of MSA?

I to resume in a single ”sequence” the differences and similarities observed in each column of the MSA; I to use the model to search for very distant homologs (YES! we have an idea of the variability); I to build models to classify very similar sequences (YES! we know what is different and what is similar); I to align correctly important residues (YES! we know which residues are the most conserved); I to detect variations in active sites and other important regions of one protein (e.g. detect if some SNPs affect an enzymatic function); I to realign sequences to a MSA very very very fast! (e.g. epidemiology phylogenetic trees); I to build databases of MSA models describing protein families, domains, motifs, etc., can be used to annotate new proteomes; I ...

March 2011 27 Today course is about models of multiple sequence alignment

I Why do we need models of MSA?

I to resume in a single ”sequence” the differences and similarities observed in each column of the MSA; I to use the model to search for very distant homologs (YES! we have an idea of the variability); I to build models to classify very similar sequences (YES! we know what is different and what is similar); I to align correctly important residues (YES! we know which residues are the most conserved); I to detect variations in active sites and other important regions of one protein (e.g. detect if some SNPs affect an enzymatic function); I to realign sequences to a MSA very very very fast! (e.g. epidemiology phylogenetic trees); I to build databases of MSA models describing protein families, domains, motifs, etc., can be used to annotate new proteomes; I ...

March 2011 27 Today course is about models of multiple sequence alignment

I Why do we need models of MSA?

I to resume in a single ”sequence” the differences and similarities observed in each column of the MSA; I to use the model to search for very distant homologs (YES! we have an idea of the variability); I to build models to classify very similar sequences (YES! we know what is different and what is similar); I to align correctly important residues (YES! we know which residues are the most conserved); I to detect variations in active sites and other important regions of one protein (e.g. detect if some SNPs affect an enzymatic function); I to realign sequences to a MSA very very very fast! (e.g. epidemiology phylogenetic trees); I to build databases of MSA models describing protein families, domains, motifs, etc., can be used to annotate new proteomes; I ...

March 2011 27 Today course is about models of multiple sequence alignment

I Why do we need models of MSA?

I to resume in a single ”sequence” the differences and similarities observed in each column of the MSA; I to use the model to search for very distant homologs (YES! we have an idea of the variability); I to build models to classify very similar sequences (YES! we know what is different and what is similar); I to align correctly important residues (YES! we know which residues are the most conserved); I to detect variations in active sites and other important regions of one protein (e.g. detect if some SNPs affect an enzymatic function); I to realign sequences to a MSA very very very fast! (e.g. epidemiology phylogenetic trees); I to build databases of MSA models describing protein families, domains, motifs, etc., can be used to annotate new proteomes; I ...

March 2011 27 Today course is about models of multiple sequence alignment

I Why do we need models of MSA?

I to resume in a single ”sequence” the differences and similarities observed in each column of the MSA; I to use the model to search for very distant homologs (YES! we have an idea of the variability); I to build models to classify very similar sequences (YES! we know what is different and what is similar); I to align correctly important residues (YES! we know which residues are the most conserved); I to detect variations in active sites and other important regions of one protein (e.g. detect if some SNPs affect an enzymatic function); I to realign sequences to a MSA very very very fast! (e.g. epidemiology phylogenetic trees); I to build databases of MSA models describing protein families, domains, motifs, etc., can be used to annotate new proteomes; I ...

March 2011 27 Methods to Build Models of MSA

Consensus: Consensus, Patterns Profile: Position Specific Scoring Matrices (PSSMs), PSI-BLAST, Generalized Profiles, Hidden Markov Models (HMMs).

March 2011 28 Methods to Build Models of MSA

Consensus: Consensus, Patterns Profile: Position Specific Scoring Matrices (PSSMs), PSI-BLAST, Generalized Profiles, Hidden Markov Models (HMMs).

March 2011 28 Methods to Build Models of MSA

Consensus: Consensus, Patterns Profile: Position Specific Scoring Matrices (PSSMs), PSI-BLAST, Generalized Profiles, Hidden Markov Models (HMMs).

March 2011 28 Patterns ... or consensus with some variability

March 2011 29 PROSITE patterns

I The PROSITE Patterns use a special syntax to describe the consensus of all the sequences present in the multiple alignment using a single expression. I Used to describe small functional regions:

I Enzyme catalytic sites; I Prosthetic group attachment sites (heme, PLP, biotin, etc.); I Amino acids involved in binding a metal ion; I Cysteines involved in disulfide bonds; I Regions involved in binding a molecule (ATP, calcium, DNA etc.) or a protein.

I Excellent tool to annotate active sites in combination with profiles.

March 2011 30 PROSITE patterns

I The PROSITE Patterns use a special syntax to describe the consensus of all the sequences present in the multiple alignment using a single expression. I Used to describe small functional regions:

I Enzyme catalytic sites; I Prosthetic group attachment sites (heme, PLP, biotin, etc.); I Amino acids involved in binding a metal ion; I Cysteines involved in disulfide bonds; I Regions involved in binding a molecule (ATP, calcium, DNA etc.) or a protein.

I Excellent tool to annotate active sites in combination with profiles.

March 2011 30 PROSITE patterns

I The PROSITE Patterns use a special syntax to describe the consensus of all the sequences present in the multiple alignment using a single expression. I Used to describe small functional regions:

I Enzyme catalytic sites; I Prosthetic group attachment sites (heme, PLP, biotin, etc.); I Amino acids involved in binding a metal ion; I Cysteines involved in disulfide bonds; I Regions involved in binding a molecule (ATP, calcium, DNA etc.) or a protein.

I Excellent tool to annotate active sites in combination with profiles.

March 2011 30 PROSITE patterns

I The PROSITE Patterns use a special syntax to describe the consensus of all the sequences present in the multiple alignment using a single expression. I Used to describe small functional regions:

I Enzyme catalytic sites; I Prosthetic group attachment sites (heme, PLP, biotin, etc.); I Amino acids involved in binding a metal ion; I Cysteines involved in disulfide bonds; I Regions involved in binding a molecule (ATP, calcium, DNA etc.) or a protein.

I Excellent tool to annotate active sites in combination with profiles.

March 2011 30 PROSITE patterns

I The PROSITE Patterns use a special syntax to describe the consensus of all the sequences present in the multiple alignment using a single expression. I Used to describe small functional regions:

I Enzyme catalytic sites; I Prosthetic group attachment sites (heme, PLP, biotin, etc.); I Amino acids involved in binding a metal ion; I Cysteines involved in disulfide bonds; I Regions involved in binding a molecule (ATP, calcium, DNA etc.) or a protein.

I Excellent tool to annotate active sites in combination with profiles.

March 2011 30 PROSITE patterns

I The PROSITE Patterns use a special syntax to describe the consensus of all the sequences present in the multiple alignment using a single expression. I Used to describe small functional regions:

I Enzyme catalytic sites; I Prosthetic group attachment sites (heme, PLP, biotin, etc.); I Amino acids involved in binding a metal ion; I Cysteines involved in disulfide bonds; I Regions involved in binding a molecule (ATP, calcium, DNA etc.) or a protein.

I Excellent tool to annotate active sites in combination with profiles.

March 2011 30 PROSITE patterns

I The PROSITE Patterns use a special syntax to describe the consensus of all the sequences present in the multiple alignment using a single expression. I Used to describe small functional regions:

I Enzyme catalytic sites; I Prosthetic group attachment sites (heme, PLP, biotin, etc.); I Amino acids involved in binding a metal ion; I Cysteines involved in disulfide bonds; I Regions involved in binding a molecule (ATP, calcium, DNA etc.) or a protein.

I Excellent tool to annotate active sites in combination with profiles.

March 2011 30 PROSITE patterns

I The PROSITE Patterns use a special syntax to describe the consensus of all the sequences present in the multiple alignment using a single expression. I Used to describe small functional regions:

I Enzyme catalytic sites; I Prosthetic group attachment sites (heme, PLP, biotin, etc.); I Amino acids involved in binding a metal ion; I Cysteines involved in disulfide bonds; I Regions involved in binding a molecule (ATP, calcium, DNA etc.) or a protein.

I Excellent tool to annotate active sites in combination with profiles.

March 2011 30 PROSITE patterns syntax example

I Pattern:

I Regexp:ˆM.?[ST] {2}.[ˆ V] I Text:

I The sequence must start with a methionine, I followed by any aa or nothing, I followed by a serine or threonine twice, I followed by any aa, I followed by any aa except a valine.

March 2011 31 PROSITE patterns syntax example

I Pattern:

I Regexp:ˆM.?[ST] {2}.[ˆ V] I Text:

I The sequence must start with a methionine, I followed by any aa or nothing, I followed by a serine or threonine twice, I followed by any aa, I followed by any aa except a valine.

March 2011 31 PROSITE patterns syntax example

I Pattern:

I Regexp:ˆM.?[ST] {2}.[ˆ V] I Text:

I The sequence must start with a methionine, I followed by any aa or nothing, I followed by a serine or threonine twice, I followed by any aa, I followed by any aa except a valine.

March 2011 31 PROSITE patterns syntax example

I Pattern:

I Regexp:ˆM.?[ST] {2}.[ˆ V] I Text:

I The sequence must start with a methionine, I followed by any aa or nothing, I followed by a serine or threonine twice, I followed by any aa, I followed by any aa except a valine.

March 2011 31 PROSITE patterns syntax example

I Pattern:

I Regexp:ˆM.?[ST] {2}.[ˆ V] I Text:

I The sequence must start with a methionine, I followed by any aa or nothing, I followed by a serine or threonine twice, I followed by any aa, I followed by any aa except a valine.

March 2011 31 PROSITE patterns syntax example

I Pattern:

I Regexp:ˆM.?[ST] {2}.[ˆ V] I Text:

I The sequence must start with a methionine, I followed by any aa or nothing, I followed by a serine or threonine twice, I followed by any aa, I followed by any aa except a valine.

March 2011 31 I OK to detect and annotate very conserved regions, but

I ... poor gap models

I ... residues at one position are considered equivalent in their frequencies

I ... if a symbol is not present at one position, this will exclude ”not yet observed” variants having this residue

How to build a PROSITE pattern

March 2011 32 I ... poor gap models

I ... residues at one position are considered equivalent in their frequencies

I ... if a symbol is not present at one position, this will exclude ”not yet observed” variants having this residue

How to build a PROSITE pattern

I OK to detect and annotate very conserved regions, but

March 2011 32 I ... residues at one position are considered equivalent in their frequencies

I ... if a symbol is not present at one position, this will exclude ”not yet observed” variants having this residue

How to build a PROSITE pattern

I OK to detect and annotate very conserved regions, but

I ... poor gap models

March 2011 32 I ... if a symbol is not present at one position, this will exclude ”not yet observed” variants having this residue

How to build a PROSITE pattern

I OK to detect and annotate very conserved regions, but

I ... poor gap models

I ... residues at one position are considered equivalent in their frequencies

March 2011 32 How to build a PROSITE pattern

I OK to detect and annotate very conserved regions, but

I ... poor gap models

I ... residues at one position are considered equivalent in their frequencies

I ... if a symbol is not present at one position, this will exclude ”not yet observed” variants having this residue

March 2011 32 Improve models by ... counting

March 2011 33 Position Specific Scoring Matrix (PSSM) / Profiles

I Patterns do not account for subtle conservation/variability in the MSA columns.

I We can use a PSSM, or profile, to build more sophisticated models.

I The PSSM/profile is based on the observed frequencies of each residue at a specific position in the MSA.

I The frequencies matrix is then converted into a scoring matrix (log-odds scores). The final matrix will contains positive scores for expected amino acids and negative scores for unexpected ones at a given position.

I Log-odds scores are preferred to frequencies for computational reasons (multiplication → sum, limit the overflow problem).

March 2011 34 Position Specific Scoring Matrix (PSSM) / Profiles

I Patterns do not account for subtle conservation/variability in the MSA columns.

I We can use a PSSM, or profile, to build more sophisticated models.

I The PSSM/profile is based on the observed frequencies of each residue at a specific position in the MSA.

I The frequencies matrix is then converted into a scoring matrix (log-odds scores). The final matrix will contains positive scores for expected amino acids and negative scores for unexpected ones at a given position.

I Log-odds scores are preferred to frequencies for computational reasons (multiplication → sum, limit the overflow problem).

March 2011 34 Position Specific Scoring Matrix (PSSM) / Profiles

I Patterns do not account for subtle conservation/variability in the MSA columns.

I We can use a PSSM, or profile, to build more sophisticated models.

I The PSSM/profile is based on the observed frequencies of each residue at a specific position in the MSA.

I The frequencies matrix is then converted into a scoring matrix (log-odds scores). The final matrix will contains positive scores for expected amino acids and negative scores for unexpected ones at a given position.

I Log-odds scores are preferred to frequencies for computational reasons (multiplication → sum, limit the overflow problem).

March 2011 34 Position Specific Scoring Matrix (PSSM) / Profiles

I Patterns do not account for subtle conservation/variability in the MSA columns.

I We can use a PSSM, or profile, to build more sophisticated models.

I The PSSM/profile is based on the observed frequencies of each residue at a specific position in the MSA.

I The frequencies matrix is then converted into a scoring matrix (log-odds scores). The final matrix will contains positive scores for expected amino acids and negative scores for unexpected ones at a given position.

I Log-odds scores are preferred to frequencies for computational reasons (multiplication → sum, limit the overflow problem).

March 2011 34 Position Specific Scoring Matrix (PSSM) / Profiles

I Patterns do not account for subtle conservation/variability in the MSA columns.

I We can use a PSSM, or profile, to build more sophisticated models.

I The PSSM/profile is based on the observed frequencies of each residue at a specific position in the MSA.

I The frequencies matrix is then converted into a scoring matrix (log-odds scores). The final matrix will contains positive scores for expected amino acids and negative scores for unexpected ones at a given position.

I Log-odds scores are preferred to frequencies for computational reasons (multiplication → sum, limit the overflow problem).

March 2011 34 Build a PSSM ... to make a long story short

I Sequence weighting: correct sampling bias. I Residue counts: get the frequency of each residue at each position of the MSA. I Pseudo-counts: avoid frequencies of 0 ⇒ avoid exclusion of residues. I Build scoring matrix: used to score alignments. 1 2 3 4 5 6 7 A 16 -4 6 10 -4 -4 6 C -4 -4 -4 -4 -4 -4 -4 D -4 -4 -4 -4 -4 -4 -4 E -4 -4 -4 -4 -4 -4 -4 F -4 -4 -4 -4 -4 -4 -4 G -4 -4 -4 -4 -4 -4 -4 H -4 -4 -4 -4 -4 -4 -4 I -4 -4 -4 -4 -4 -4 -4 K -4 -4 -4 -4 -4 -4 -4 L -4 -4 -4 13 6 6 -4 M -4 -4 -4 -4 16 -4 -4 N -4 -4 -4 -4 -4 -4 -4 P -4 -4 6 -4 -4 6 -4 Q -4 -4 -4 -4 -4 -4 -4 R -4 -4 -4 -4 -4 -4 -4 S 6 10 10 -4 -4 10 6 T -4 13 6 -4 -4 -4 10 V -4 -4 -4 -4 -4 6 6 W -4 -4 -4 -4 -4 -4 -4 Y -4 -4 -4 -4 -4 -4 -4

March 2011 35 Build a PSSM ... to make a long story short

I Sequence weighting: correct sampling bias. I Residue counts: get the frequency of each residue at each position of the MSA. I Pseudo-counts: avoid frequencies of 0 ⇒ avoid exclusion of residues. I Build scoring matrix: used to score alignments. 1 2 3 4 5 6 7 A 16 -4 6 10 -4 -4 6 C -4 -4 -4 -4 -4 -4 -4 D -4 -4 -4 -4 -4 -4 -4 E -4 -4 -4 -4 -4 -4 -4 F -4 -4 -4 -4 -4 -4 -4 G -4 -4 -4 -4 -4 -4 -4 H -4 -4 -4 -4 -4 -4 -4 I -4 -4 -4 -4 -4 -4 -4 K -4 -4 -4 -4 -4 -4 -4 L -4 -4 -4 13 6 6 -4 M -4 -4 -4 -4 16 -4 -4 N -4 -4 -4 -4 -4 -4 -4 P -4 -4 6 -4 -4 6 -4 Q -4 -4 -4 -4 -4 -4 -4 R -4 -4 -4 -4 -4 -4 -4 S 6 10 10 -4 -4 10 6 T -4 13 6 -4 -4 -4 10 V -4 -4 -4 -4 -4 6 6 W -4 -4 -4 -4 -4 -4 -4 Y -4 -4 -4 -4 -4 -4 -4

March 2011 35 Build a PSSM ... to make a long story short

I Sequence weighting: correct sampling bias. I Residue counts: get the frequency of each residue at each position of the MSA. I Pseudo-counts: avoid frequencies of 0 ⇒ avoid exclusion of residues. I Build scoring matrix: used to score alignments. 1 2 3 4 5 6 7 A 16 -4 6 10 -4 -4 6 C -4 -4 -4 -4 -4 -4 -4 D -4 -4 -4 -4 -4 -4 -4 E -4 -4 -4 -4 -4 -4 -4 F -4 -4 -4 -4 -4 -4 -4 G -4 -4 -4 -4 -4 -4 -4 H -4 -4 -4 -4 -4 -4 -4 I -4 -4 -4 -4 -4 -4 -4 K -4 -4 -4 -4 -4 -4 -4 L -4 -4 -4 13 6 6 -4 M -4 -4 -4 -4 16 -4 -4 N -4 -4 -4 -4 -4 -4 -4 P -4 -4 6 -4 -4 6 -4 Q -4 -4 -4 -4 -4 -4 -4 R -4 -4 -4 -4 -4 -4 -4 S 6 10 10 -4 -4 10 6 T -4 13 6 -4 -4 -4 10 V -4 -4 -4 -4 -4 6 6 W -4 -4 -4 -4 -4 -4 -4 Y -4 -4 -4 -4 -4 -4 -4

March 2011 35 Build a PSSM ... to make a long story short

I Sequence weighting: correct sampling bias. I Residue counts: get the frequency of each residue at each position of the MSA. I Pseudo-counts: avoid frequencies of 0 ⇒ avoid exclusion of residues. I Build scoring matrix: used to score alignments. 1 2 3 4 5 6 7 A 16 -4 6 10 -4 -4 6 C -4 -4 -4 -4 -4 -4 -4 D -4 -4 -4 -4 -4 -4 -4 E -4 -4 -4 -4 -4 -4 -4 F -4 -4 -4 -4 -4 -4 -4 G -4 -4 -4 -4 -4 -4 -4 H -4 -4 -4 -4 -4 -4 -4 I -4 -4 -4 -4 -4 -4 -4 K -4 -4 -4 -4 -4 -4 -4 L -4 -4 -4 13 6 6 -4 M -4 -4 -4 -4 16 -4 -4 N -4 -4 -4 -4 -4 -4 -4 P -4 -4 6 -4 -4 6 -4 Q -4 -4 -4 -4 -4 -4 -4 R -4 -4 -4 -4 -4 -4 -4 S 6 10 10 -4 -4 10 6 T -4 13 6 -4 -4 -4 10 V -4 -4 -4 -4 -4 6 6 W -4 -4 -4 -4 -4 -4 -4 Y -4 -4 -4 -4 -4 -4 -4

March 2011 35 Build a PSSM ... to make a long story short

I Sequence weighting: correct sampling bias. I Residue counts: get the frequency of each residue at each position of the MSA. I Pseudo-counts: avoid frequencies of 0 ⇒ avoid exclusion of residues. I Build scoring matrix: used to score alignments. 1 2 3 4 5 6 7 A 16 -4 6 10 -4 -4 6 C -4 -4 -4 -4 -4 -4 -4 D -4 -4 -4 -4 -4 -4 -4 E -4 -4 -4 -4 -4 -4 -4 F -4 -4 -4 -4 -4 -4 -4 G -4 -4 -4 -4 -4 -4 -4 H -4 -4 -4 -4 -4 -4 -4 I -4 -4 -4 -4 -4 -4 -4 K -4 -4 -4 -4 -4 -4 -4 L -4 -4 -4 13 6 6 -4 M -4 -4 -4 -4 16 -4 -4 N -4 -4 -4 -4 -4 -4 -4 P -4 -4 6 -4 -4 6 -4 Q -4 -4 -4 -4 -4 -4 -4 R -4 -4 -4 -4 -4 -4 -4 S 6 10 10 -4 -4 10 6 T -4 13 6 -4 -4 -4 10 V -4 -4 -4 -4 -4 6 6 W -4 -4 -4 -4 -4 -4 -4 Y -4 -4 -4 -4 -4 -4 -4

March 2011 35 PSSM Scoring a Match: the sliding window method

I Put the matrix on top of the 1st position of the sequence: 1 2 3 4 5 6 7 A 16 -4 6 10 -4 -4 6

C -4 -4 -4 -4 -4 -4 -4 D -4 -4 -4 -4 -4 -4 -4 E -4 -4 -4 -4 -4 -4 -4 F -4 -4 -4 -4 -4 -4 -4 G -4 -4 -4 -4 -4 -4 -4 H -4 -4 -4 -4 -4 -4 -4 I -4 -4 -4 -4 -4 -4 -4 K -4 -4 -4 -4 -4 -4 -4 L -4 -4 -4 13 6 6 -4

M -4 -4 -4 -4 16 -4 -4 N -4 -4 -4 -4 -4 -4 -4 P -4 -4 6 -4 -4 6 -4 Q -4 -4 -4 -4 -4 -4 -4 R -4 -4 -4 -4 -4 -4 -4 S 6 10 10 -4 -4 10 6 T -4 136 -4 -4 -4 10 V -4 -4 -4 -4 -4 6 6 W -4 -4 -4 -4 -4 -4 -4 Y -4 -4 -4 -4 -4 -4 -4 Sequence: M A T P C M L V ... I Score = -4 + (-4) + 6 + (-4) + (-4) + (-4) + (-4) = -18

March 2011 36 PSSM Scoring a Match: the sliding window method

I Shift on the 2st position of the sequence (and so on ...): 1 2 3 4 5 6 7 A 16 -4 6 10 -4 -4 6

C -4 -4 -4 -4 -4 -4 -4 D -4 -4 -4 -4 -4 -4 -4 E -4 -4 -4 -4 -4 -4 -4 F -4 -4 -4 -4 -4 -4 -4 G -4 -4 -4 -4 -4 -4 -4 H -4 -4 -4 -4 -4 -4 -4 I -4 -4 -4 -4 -4 -4 -4 K -4 -4 -4 -4 -4 -4 -4 L -4 -4 -4 13 66 -4

M -4 -4 -4 -4 16 -4 -4 N -4 -4 -4 -4 -4 -4 -4 P -4 -46 -4 -4 6 -4 Q -4 -4 -4 -4 -4 -4 -4 R -4 -4 -4 -4 -4 -4 -4 S 6 10 10 -4 -4 10 6 T -4 13 6 -4 -4 -4 10

V -4 -4 -4 -4 -4 66 W -4 -4 -4 -4 -4 -4 -4 Y -4 -4 -4 -4 -4 -4 -4 Sequence: M A T P C M L V ... I Score = 16 + 13 + 6 + (-4) + 16 + 6 + 6 = 59

March 2011 37 PSSM: Interpretation of the Score

I How do I interpret the score produced by a PSSM? Which is the lower score I consider to produce a true match?

I Only biological arguments tell you if a match is true or not.

I However, a statistical analysis can help us decide if a match is statistically significant (true positive) or not (false positive).

March 2011 38 PSSM: Interpretation of the Score

I How do I interpret the score produced by a PSSM? Which is the lower score I consider to produce a true match?

I Only biological arguments tell you if a match is true or not.

I However, a statistical analysis can help us decide if a match is statistically significant (true positive) or not (false positive).

March 2011 38 PSSM: Interpretation of the Score

I How do I interpret the score produced by a PSSM? Which is the lower score I consider to produce a true match?

I Only biological arguments tell you if a match is true or not.

I However, a statistical analysis can help us decide if a match is statistically significant (true positive) or not (false positive).

March 2011 38 PSSM: Interpretation of the Score

I We can estimate the score distribution of a PSSM on unrelated sequences (green bars). This permits to calculate the E-value: the number of matches that we expect to occur by chance with a score ≥ a given cutoff.

homologous sequences non-homologuous sequences Observed distribution Frequency

high cutoff

low cutoff

Score March 2011 39 PSSM: Fingerprints

I To overcome the gap limitation of PSSMs (missing gap model), two or more PSSMs can be used to describe long regions. The combination of various PSSMs is called fingerprints. Fingerprint

PSSM 1 PSSM 2 PSSM 3

RKLLVGAPVLL SCLLATCVG VRTTLQAA RRILVAAPALL TCILGGCRG VRTSIIAA RKLAAGAPVIL SCNLGGCKA KKSTLLLA KKIIAGGPAII SCQQRGCKG VKGSSNAG KRLLVGAPVLL SCNNGGCVG VKSSILAV

March 2011 40 Gap models!

What is missing in PSSMs?

March 2011 41 What is missing in PSSMs?

Gap models!

March 2011 41 Generalized Profiles/HMMs: General Concepts Generalized Profiles/HMMs: General Concepts Generalized Profiles/HMMs: General Concepts Generalized Profiles/HMMs: General Concepts Generalized Profiles/HMMs: General Concepts Generalized Profiles/HMMs: General Concepts Generalized Profiles/HMMs: General Concepts Generalized Profiles/HMMs: General Concepts Generalized Profiles/HMMs: General Concepts Generalized Profiles/HMMs: General Concepts Generalized Profiles/HMMs: General Concepts Generalized Profiles/HMMs: General Concepts

I Generalized profiles and linear Hidden Markov Models are equivalent, except that generalized profiles use scores, HMMs use probabilities.

Definitions:

I Match states contain position specific score/probability distributions for residues.

I Deletion states contain positions specific scores/probabilities to observe a deletion.

I Insertion states contain position specific scores/probabilities to observe an insertion.

I Transitions, which are associated with scores/probabilities to reach a state si from a state sj .

March 2011 43 Generalized Profiles/HMMs: General Concepts

I Generalized profiles and linear Hidden Markov Models are equivalent, except that generalized profiles use scores, HMMs use probabilities.

Definitions:

I Match states contain position specific score/probability distributions for residues.

I Deletion states contain positions specific scores/probabilities to observe a deletion.

I Insertion states contain position specific scores/probabilities to observe an insertion.

I Transitions, which are associated with Mi Mi+1 scores/probabilities to reach a state si from a state sj .

March 2011 43 Generalized Profiles/HMMs: General Concepts

I Generalized profiles and linear Hidden Markov Models are equivalent, except that generalized profiles use scores, HMMs use probabilities.

Definitions:

I Match states contain position specific score/probability distributions for residues.

I Deletion states contain positions specific scores/probabilities to observe a deletion.

I Insertion states contain position specific scores/probabilities to observe an insertion.

I Transitions, which are associated with Mi Mi+1 scores/probabilities to reach a state si from a state sj .

Di Di+1 March 2011 43 Generalized Profiles/HMMs: General Concepts

I Generalized profiles and linear Hidden Markov Models are equivalent, except that generalized profiles use scores, HMMs use probabilities.

Definitions:

I Match states contain position specific score/probability distributions for residues.

I Deletion states contain positions specific scores/probabilities to observe a deletion. I

I Insertion states contain position specific scores/probabilities to observe an insertion.

I Transitions, which are associated with Mi Mi+1 scores/probabilities to reach a state si from a state sj .

Di Di+1 March 2011 43 Generalized Profiles/HMMs: General Concepts

I Generalized profiles and linear Hidden Markov Models are equivalent, except that generalized profiles use scores, HMMs use probabilities.

Definitions:

I Match states contain position specific score/probability distributions for residues.

I Deletion states contain positions specific scores/probabilities to observe a deletion. I

I Insertion states contain position specific scores/probabilities to observe an insertion.

I Transitions, which are associated with Mi Mi+1 scores/probabilities to reach a state si from a state sj .

Di Di+1 March 2011 43 How to Build a Generalized Profile

10 seq1/1-12 GHEGVGKVVKIG seq2/1-11 GHEKKGRFE-RG seq3/1-7 GHEGYG-----G seq4/1-6 GHE-EG-----A seq5/1-7 GHELRG-----A Final Generalized Profile Counts + Pseudo−counts

M M M D M M I I I I I M INS −10 −10 −10 −10 −10 −1 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 A 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 2+1 A −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −3.8 C 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 C −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 D 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 D −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 E 0+1 0+1 5+1 0+1 1+1 0+1 0+1 0+1 1+1 0+1 0+1 0+1 E −1.0 −1.0 6.8 −1.0 2.0 −1.0 −1.0 F 0+1 0+1 0+1 0+1 0+1 0+1 0+1 1+1 0+1 0+1 0+1 0+1 F −1.0 −1.0 −1.0 2.0 −1.0 −1.0 −1.0 G 5+1 0+1 0+1 2+1 0+1 5+1 0+1 0+1 0+1 0+1 0+1 3+1 G 6.8 −1.0 −1.0 3.8 −1.0 6.8 5.0 H 0+1 5+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 H −1.0 6.8 −1.0 −1.0 −1.0 −1.0 −1.0 I 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 1+1 0+1 I −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 K 0+1 0+1 0+1 1+1 1+1 0+1 1+1 0+1 0+1 1+1 0+1 0+1 Normalization K −1.0 −1.0 −1.0 2.0 2.0 −1.0 −1.0 L 0+1 0+1 0+1 1+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 L −1.0 −1.0 −1.0 2.0 −1.0 −1.0 −1.0 M 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 Log−odds scores M −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 N 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 N −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 P 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 P −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 Q 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 Q −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 R 0+1 0+1 0+1 0+1 1+1 0+1 1+1 0+1 0+1 0+1 1+1 0+1 R −1.0 −1.0 −1.0 −1.0 2.0 −1.0 −1.0 S 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 S −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 T 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 T −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 V 0+1 0+1 0+1 0+1 1+1 0+1 0+1 1+1 1+1 0+1 0+1 0+1 V −1.0 −1.0 −1.0 −1.0 2.0 −1.0 −1.0 W 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 W −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 Y 0+1 0+1 0+1 0+1 1+1 0+1 0+1 0+1 0+1 0+1 0+1 0+1 Y −1.0 −1.0 −1.0 −1.0 2.0 −1.0 −1.0 Assign score for insertion DEL −10 −10 −10 −1 −10 −10 −10 Assign score for deletion

March 2011 44 Generalized Profiles: Search and Align with Dynamic Programming

INS −10 −10 −10 −10 −10 −1 1 2 3 4 5 6 7 A −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −3.8 C −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 D −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 E −1.0 −1.0 6.8 −1.0 2.0 −1.0 −1.0 F −1.0 −1.0 −1.0 2.0 −1.0 −1.0 −1.0 G 6.8 −1.0 −1.0 3.8 −1.0 6.8 5.0 H −1.0 6.8 −1.0 −1.0 −1.0 −1.0 −1.0 I −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 K −1.0 −1.0 −1.0 2.0 2.0 −1.0 −1.0 L −1.0 −1.0 −1.0 2.0 −1.0 −1.0 −1.0 M −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 N −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 P −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 Q −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 R −1.0 −1.0 −1.0 −1.0 2.0 −1.0 −1.0 S −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 T −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 V −1.0 −1.0 −1.0 −1.0 2.0 −1.0 −1.0 W −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 −1.0 Y −1.0 −1.0 −1.0 −1.0 2.0 −1.0 −1.0 DEL −10 −10 −10 −1 −10 −10 −10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 T 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 S 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 G 0.0 6.8 0.0 0.0 3.8 0.0 6.8 5.0 H 0.0 0.0 13.6 3.6 2.6 0.0 0.0 0.0 ALIGNMENT: score 28 E 0.0 0.0 3.6 20.4 19.4 9.4 0.0 0.0 L 0.0 0.0 0.0 10.4 22.4 18.4 8.4 0.0 V 0.0 0.0 0.0 0.4 9.4 24.2 17.4 7.4 1 2 3 4 5 6 − − 7 G 0.0 6.8 0.0 0.0 4.2 12.2 31.0 22.4 | | | | | | | V 0.0 0.0 0.0 0.0 0.0 6.2 21.0 30.0 T S G H E L V G V V G V 0.0 0.0 0.0 0.0 0.0 2.0 11.0 29.0 G 0.0 6.8 0.0 0.0 3.8 0.0 9.8 28.0 March 2011 45 Align a Sequence to an HMM

Input Sequence: VGGAERCSA

Align sequence to the model: Viterbi algorithm (find the path maximising the probability of the sequence given the model)

... P(A|1) = 0.7 P(A|2) = 0.01 ... P(A|5) = 0.01 P(R|3) = 0.3 P(C|1) = 0.05 P(C|2) = 0.01 P(I|4) = 0.2 P(C|5) = 0.9 P(K|3) = 0.3 P(D|1) = 0.02 P(D|2) = 0.35 P(L|4) = 0.3 P(D|5) = 0.02 P(L|3) = 0.03 ... P(E|2) = 0.4 ......

M1 M2 M3 M4 M5

BEGIN D1 D2 D3 D4 D5 END

I0 I1 I2 I3 I4 I5

3x 2x

V G G A E R − C S A

Alignment I0(3x) M1 M2 M3 D4 M5 I5(2x)

March 2011 46 So many model databases ...

March 2011 47 specificity PROSITE: good for functional annotation and domain boundaries description (patterns and generalized profiles) PRINTS: good for classification in sub-families (fingerprints) ProDom: very exhaustive conserved cores of domains (PSI-BLAST PSSMs) SMART: sub-cellular localization of domains (HMMs) : large number of protein families and domain descriptors (HMMs) PIRSF: protein families sharing the same domain composition (HMMs) TIGRFAM: functional classification of protein families (HMMs) PANTHER: functional classification of protein families via evolutionary evidence (HMMs) Gene3D: structural-based classification (HMMs) Superfamily: structural-based classification (HMMs)

March 2011 48 So many model databases ... in InterPro :-)

March 2011 49 InterPro

I InterPro is a consortium grouping a number of protein motif databases: PROSITE, Pfam, PRINTS, ProDom, SMART TIGRFAM, Panther, SCOP Superfamily, Gene3D, etc..

I InterPro tries to have and maintain a high quality annotation.

I The database and a stand-alone package are available to locally run a complete InterPro analysis.

I http://www.ebi.ac.uk/interpro/

I ftp://ftp.ebi.ac.uk/pub/databases//.

March 2011 50 InterPro

I InterPro is a consortium grouping a number of protein motif databases: PROSITE, Pfam, PRINTS, ProDom, SMART TIGRFAM, Panther, SCOP Superfamily, Gene3D, etc..

I InterPro tries to have and maintain a high quality annotation.

I The database and a stand-alone package are available to locally run a complete InterPro analysis.

I http://www.ebi.ac.uk/interpro/

I ftp://ftp.ebi.ac.uk/pub/databases/interpro/.

March 2011 50 InterPro

I InterPro is a consortium grouping a number of protein motif databases: PROSITE, Pfam, PRINTS, ProDom, SMART TIGRFAM, Panther, SCOP Superfamily, Gene3D, etc..

I InterPro tries to have and maintain a high quality annotation.

I The database and a stand-alone package are available to locally run a complete InterPro analysis.

I http://www.ebi.ac.uk/interpro/

I ftp://ftp.ebi.ac.uk/pub/databases/interpro/.

March 2011 50 InterPro

I InterPro is a consortium grouping a number of protein motif databases: PROSITE, Pfam, PRINTS, ProDom, SMART TIGRFAM, Panther, SCOP Superfamily, Gene3D, etc..

I InterPro tries to have and maintain a high quality annotation.

I The database and a stand-alone package are available to locally run a complete InterPro analysis.

I http://www.ebi.ac.uk/interpro/

I ftp://ftp.ebi.ac.uk/pub/databases/interpro/.

March 2011 50 InterPro Foundations

I InterPro is not a simple collection of descriptors obtained from the consortium databases; it integrates and organizes such information.

March 2011 51 InterPro Signatures Grouping

(100) Prosite Same position IPR00001 Same protein hits Pfam (100)

Prosite (100) Same position IPR00001 Different protein hits Pfam (50) IPR00002

Different overlapping Prosite (100) position IPR00001 Same protein hits Pfam (100) IPR00002

Prosite (100) Different position IPR00001 Pfam IPR00002

March 2011 52 InterPro Linking Signatures: Parent-Child

Parent-Child relationship: applies to protein families and domains.

Pfam (100) Protein kinase Pfam Parent Prosite (75) Serine kinase Protein kinase

Pfam (100) Protein kinase Smart (25) Tyrosine kinase

Prosite Smart Children Prosite Smart Smart Serine kinase Tyrosine kinase

No protein hits in common

March 2011 53 InterPro Linking Signatures: Contains-Found in Contains-Found in relationship: describes domain composition and applies to protein families and domains.

Pfam Contains Receptor family (Smart and Prosite) Pfam Receptor family N−term domain Smart Prosite C−term domain

Smart Prosite Found in N−term domain C−term domain (Pfam)

Contains-Found in relationship: the container signature should cover at least 90% of the contained signature.

Pfam Contains Smart Found in

>= 90%

Pfam Contains Smart Found in

< 90% Overlapping

March 2011 54 InterPro Linking Signatures: Evolutionary Context

I The classification criteria used by the different databases in the consortium can be used to extrapolate evolutionary information.

Signature InterPro criteria relationship

Structural family Gene3D Grandparent

Sequence family Pfam Pfam Parent

Functional family TIGRFAM TIGRFAM TIGRFAM TIGRFAM Children

March 2011 55 Anatomy of an InterPro Entry: Overview

March 2011 56 Anatomy of an InterPro Entry: Proteins Matched

March 2011 57 Anatomy of an InterPro Entry: Domain Organization

March 2011 58 Anatomy of an InterPro Entry: Pathways and Interactions

March 2011 59 Anatomy of an InterPro Entry: Species

March 2011 60 Anatomy of an InterPro Entry: Structures

March 2011 61 Anatomy of an InterPro Entry: Related Resources

March 2011 62 Anatomy of an InterPro Entry: References

March 2011 63 InterPro Scan

I InterPro scan allows to search for motifs on your sequence.

I You can search against all the consortium motif databases in one click or just chose some of them.

I The InterPro scan service also search for trans-membrane regions and signal peptide (short peptide chain that directs the transport of a protein).

I You can download InterPro Scan and run it on your machine ... better have a cluster ;-)

I http://www.ebi.ac.uk/Tools/InterProScan

I IMPORTANT: InterPro scan uses trusted cutoffs, so you will miss weak matches. InterPro scan is good to annotate, but not for discovery. Better to go directly to the individual consortium members to run scans to find weak matches.

March 2011 64 InterPro Scan

I InterPro scan allows to search for motifs on your sequence.

I You can search against all the consortium motif databases in one click or just chose some of them.

I The InterPro scan service also search for trans-membrane regions and signal peptide (short peptide chain that directs the transport of a protein).

I You can download InterPro Scan and run it on your machine ... better have a cluster ;-)

I http://www.ebi.ac.uk/Tools/InterProScan

I IMPORTANT: InterPro scan uses trusted cutoffs, so you will miss weak matches. InterPro scan is good to annotate, but not for discovery. Better to go directly to the individual consortium members to run scans to find weak matches.

March 2011 64 InterPro Scan

I InterPro scan allows to search for motifs on your sequence.

I You can search against all the consortium motif databases in one click or just chose some of them.

I The InterPro scan service also search for trans-membrane regions and signal peptide (short peptide chain that directs the transport of a protein).

I You can download InterPro Scan and run it on your machine ... better have a cluster ;-)

I http://www.ebi.ac.uk/Tools/InterProScan

I IMPORTANT: InterPro scan uses trusted cutoffs, so you will miss weak matches. InterPro scan is good to annotate, but not for discovery. Better to go directly to the individual consortium members to run scans to find weak matches.

March 2011 64 InterPro Scan

I InterPro scan allows to search for motifs on your sequence.

I You can search against all the consortium motif databases in one click or just chose some of them.

I The InterPro scan service also search for trans-membrane regions and signal peptide (short peptide chain that directs the transport of a protein).

I You can download InterPro Scan and run it on your machine ... better have a cluster ;-)

I http://www.ebi.ac.uk/Tools/InterProScan

I IMPORTANT: InterPro scan uses trusted cutoffs, so you will miss weak matches. InterPro scan is good to annotate, but not for discovery. Better to go directly to the individual consortium members to run scans to find weak matches.

March 2011 64 InterPro Scan

I InterPro scan allows to search for motifs on your sequence.

I You can search against all the consortium motif databases in one click or just chose some of them.

I The InterPro scan service also search for trans-membrane regions and signal peptide (short peptide chain that directs the transport of a protein).

I You can download InterPro Scan and run it on your machine ... better have a cluster ;-)

I http://www.ebi.ac.uk/Tools/InterProScan

I IMPORTANT: InterPro scan uses trusted cutoffs, so you will miss weak matches. InterPro scan is good to annotate, but not for discovery. Better to go directly to the individual consortium members to run scans to find weak matches.

March 2011 64 InterPro Scan

I InterPro scan allows to search for motifs on your sequence.

I You can search against all the consortium motif databases in one click or just chose some of them.

I The InterPro scan service also search for trans-membrane regions and signal peptide (short peptide chain that directs the transport of a protein).

I You can download InterPro Scan and run it on your machine ... better have a cluster ;-)

I http://www.ebi.ac.uk/Tools/InterProScan

I IMPORTANT: InterPro scan uses trusted cutoffs, so you will miss weak matches. InterPro scan is good to annotate, but not for discovery. Better to go directly to the individual consortium members to run scans to find weak matches.

March 2011 64 InterPro Scan

I To scan a sequence against InterPro:

I Text search: permits to search for text, protein ID, InterPro ID, GO terms.

March 2011 65 InterPro Scan (the old version)

I To scan a sequence against InterPro:

March 2011 66 InterPro Scan Result

March 2011 67

I Nota bene: InterPro uses only trusted cutoffs! InterPro Scan Result: tab view

March 2011 68 InterPro Scan Result: summary view

March 2011 69 InterPro Annotation of UniProtKB/TrEMBL

TrEMBL TrEMBL { Uncharacterized InterPro Automatic sequences annotation Swiss−Prot { Annotated sequences {

Groups of related proteins (same family or domain structure)

I InterPro contains >16,000 entries (>50,000 signature methods).

I InterPro signatures cover 96% of UniProt/Swiss-Prot proteins and 79% of UniProtKB/TrEMBL proteins (>7 million protein matched).

March 2011 70 March 2011 71 ... dont’ worry, almost finished :-) ... Summary

March 2011 72 Summary

I Multiple sequence alignment (MSA) contains a higher information content than pairwise alignments.

I MSA can be modeled using various methods (Pattern, PSSM, Generalized Profile/HMM).

I MSA models are very efficient to search distant homologous sequences, classify sub-families, and to produce high quality alignments.

I MSA models are stored in databases (PROSITE, PRINTS, Pfam, ...) and used to annotate sequences.

I InterPro integrates MSA models from various databases and organize them and their annotation so relationships emerge (e.g. parent-child relationship).

March 2011 73 Summary

I Multiple sequence alignment (MSA) contains a higher information content than pairwise alignments.

I MSA can be modeled using various methods (Pattern, PSSM, Generalized Profile/HMM).

I MSA models are very efficient to search distant homologous sequences, classify sub-families, and to produce high quality alignments.

I MSA models are stored in databases (PROSITE, PRINTS, Pfam, ...) and used to annotate sequences.

I InterPro integrates MSA models from various databases and organize them and their annotation so relationships emerge (e.g. parent-child relationship).

March 2011 73 Summary

I Multiple sequence alignment (MSA) contains a higher information content than pairwise alignments.

I MSA can be modeled using various methods (Pattern, PSSM, Generalized Profile/HMM).

I MSA models are very efficient to search distant homologous sequences, classify sub-families, and to produce high quality alignments.

I MSA models are stored in databases (PROSITE, PRINTS, Pfam, ...) and used to annotate sequences.

I InterPro integrates MSA models from various databases and organize them and their annotation so relationships emerge (e.g. parent-child relationship).

March 2011 73 Summary

I Multiple sequence alignment (MSA) contains a higher information content than pairwise alignments.

I MSA can be modeled using various methods (Pattern, PSSM, Generalized Profile/HMM).

I MSA models are very efficient to search distant homologous sequences, classify sub-families, and to produce high quality alignments.

I MSA models are stored in databases (PROSITE, PRINTS, Pfam, ...) and used to annotate sequences.

I InterPro integrates MSA models from various databases and organize them and their annotation so relationships emerge (e.g. parent-child relationship).

March 2011 73 Summary

I Multiple sequence alignment (MSA) contains a higher information content than pairwise alignments.

I MSA can be modeled using various methods (Pattern, PSSM, Generalized Profile/HMM).

I MSA models are very efficient to search distant homologous sequences, classify sub-families, and to produce high quality alignments.

I MSA models are stored in databases (PROSITE, PRINTS, Pfam, ...) and used to annotate sequences.

I InterPro integrates MSA models from various databases and organize them and their annotation so relationships emerge (e.g. parent-child relationship).

March 2011 73 Summary

I Multiple sequence alignment (MSA) contains a higher information content than pairwise alignments.

I MSA can be modeled using various methods (Pattern, PSSM, Generalized Profile/HMM).

I MSA models are very efficient to search distant homologous sequences, classify sub-families, and to produce high quality alignments.

I MSA models are stored in databases (PROSITE, PRINTS, Pfam, ...) and used to annotate sequences.

I InterPro integrates MSA models from various databases and organize them and their annotation so relationships emerge (e.g. parent-child relationship).

March 2011 73 [email protected]