What about comparing a group of sequences (instead of a group of structures):

Multiple Sequence Alignment What is a multiple alignment?

 a representation of a set of sequences, where equivalent residues (e.g. functional, structural) are aligned in rows or more usually columns

Example: part of an alignment of SH2 domains from 14 sequences

lnk_rat crk1_mouse nck_human ht16_hydat pip5_human fer_human 1ab2 1mil 1blj 1shd 1lkkA 1csy 1bfi 1gri

* conserved identical residues : conserved similar residues What is a multiple alignment?

conserved residues

conservation profile secondary structure Multiple sequence alignments

How to score an alignment of many sequences?

Given M sequences Ai , we can define a score for the multiple sequence alignment as the sum of the scores of all the induced pair alignments

S=Si

1>ASPTLPLSLA S 2>SS-TLPA--A = 3>SSPTLPA--A

1>ASPTLPLSLA 1>ASPTLPLSLA 2>SS-TLPA--A +S =S 2>SS-TLPA--A 3>SSPTLPA--A +S 3>SSPTLPA--A Multiple sequence alignments

Can you think to other measures instead of the sum of pairwise scores? Multiple sequence alignments

Can you think to other measures instead of the sum of pairwise scores?

Try to MAXIMIZE the information content of all the columns (see sequence logos)

I  log2 20  S( p) 20 S( p)   pi ln pi i1

 MINIMIZE the overall entropy Ncolums 20 S    p ji ln p ji j1 i1 Multiple sequence alignments

Multiple sequence alignment (MSA) The algorithmic problem is to find the alignment with the maximum score

Exact algorithms Algorithms based of multi-dimensional dynamic programming have been implemented. However they are too slow when many sequences have to be compared.

Progressive alignments

Iterative algorithms

Consistency-based algorithms Multiple Alignment Construction

 Optimal multiple alignment example : MSA (Lipman et al. 1989, Gupta et al. 1995) Optimal multiple alignment

Extension of dynamic programming for 2 sequences => N dimensions

Example : alignment of 3 sequences

Problem : calculation time and memory requirements Time proportional to Nk for k sequences of length N => limited to less than 10 sequences Multiple Alignment Construction

 Optimal multiple alignment MSA, OMA  Progressive multiple alignment ClustalW (Thompson et al. NAR. 1994) ClustalX (Thompson et al. NAR. 1997) Progressive multiple alignment

Idea : Progressively align pairs of sequences (or groups of sequences)

Problem : Start with which sequences ? How to decide order of alignment ?  first align the most closely related sequences

How to measure the similarity of the sequences ?  align all the sequences pairwise  calculate the similarity between each pair from the alignment Progressive multiple alignments

A B C D Step1: Pairwise sequence alignment: exact, all-against-all A A B B D D S=70, id=50% S=60, id=40% S=95, id=80% A B C C C D S=100, id=90% S=70, id=50% S=80, id=70% Step1: Build a similarity tree A C B D Progressive multiple alignments

Step 3: Exact alignment of the most similar sequences, following the tree A B C D

Step 4: Build the profile from the sub alignments)

Step 5: Perfom profile-to profile alignment following the similarity tree, until comprising all the sequences A C B D Profile-to-profile score

1 The position i along the first sequence profile, it is represented by a 20-valued vector P i = 1 1 1 P i(A) P i (C) …… P i (Y) The position j along the second sequence profile, it is represented by a 20-valued vector 2 1 1 1 P j = P j(A) P j (C) …… P j (Y)

The score for aligning the two positions is

20 20 1 2 Score(i, j)   P i rm P j rk  M rm ,rk  m1 k1 where M is a matrix score (BLOSUM or PAM) The score can be used in dynamic programming procedures (Needleman- Wunsch, Smith-Waterman) “once a gap, always a gap”

• Where gaps are added is a critical question

• Gaps are often added to the first two (closest) sequences

• To change the initial gap choices later on would be to give more weight to distantly related sequences

• To maintain the initial gap choices is to trust that those gaps are most believable Problems with Progressive algorithms

• Dependence of the ultimate MSA on the initial pairwise sequence alignment with the highest score

• Errors in initial alignments are propagated

• Gaps can proliferate, if not careful

• Gaps can be amino-acid specific, so that you penalize introduction of gaps into segments that are less likely to have gaps (e.g. hydrophobic core) CLUSTALW

Multiple sequence alignment of cytochromes

• Download from UniProtKB the sequences of the following proteins (in FASTA format)

• P99999 (human) • P00004 (horse) • P0C0X8 (Rhodobacter) • P00091 (Rhodopseudomonas) • Q93VA3 (Arabidopsis)

And align with ClustalW @ http://clustalw.ddbj.nig.ac.jp/ Result summary

CLUSTAL 2.1 Multiple Sequence Alignments

Sequence type explicitly set to Protein Sequence format is Pearson Sequence 1: sp|P99999|CYC_HUMAN_Cytochrome_c_OS=Homo_sapiens_GN=CYCS_PE=1_SV=2 105 aa Sequence 2: sp|P00004|CYC_HORSE_Cytochrome_c_OS=Equus_caballus_GN=CYCS_PE=1_SV=2 105 aa Sequence 3: sp|P0C0X8|CYC2_RHOSH_Cytochrome_c2_OS=Rhodobacter_sphaeroides_GN=cycA_PE=1_SV=1 124 aa Sequence 4: sp|P00091|CYC22_RHOPA_Cytochrome_c2_OS=Rhodopseudomonas_palustris 139 aa Sequence 5: sp|Q93VA3|CYC6_ARATH_Cytochrome_c6__chloroplastic_OS=Arabidopsis_thaliana_ 175 aa Start of Pairwise alignments Aligning...

Sequences (1:2) Aligned. Score: 88 Sequences (1:3) Aligned. Score: 30 Sequences (1:4) Aligned. Score: 27 Sequences (1:5) Aligned. Score: 5 Sequences (2:3) Aligned. Score: 32 Sequences (2:4) Aligned. Score: 28 Sequences (2:5) Aligned. Score: 5 Sequences (3:4) Aligned. Score: 36 Sequences (3:5) Aligned. Score: 3 Sequences (4:5) Aligned. Score: 3 Guide tree file created: [../work/2016/03/22/0612/clustalw_1458594734_746144_63242.dnd] Result summary

There are 4 groups Start of Multiple Alignment

Aligning... Group 1: Sequences: 2 Score:2188 Group 2: Sequences: 2 Score:1660 Group 3: Sequences: 4 Score:1071 Group 4: Delayed Alignment Score 1148

CLUSTAL-Alignment file created [../work/2016/03/22/0612/clustalw_1458594734_746144_63242.aln]

CLUSTAL 2.1 Multiple Sequence Alignments

Sequence format is CLUSTAL Sequence 1: sp|P99999|CYC_HUMAN_Cytochrome 177 aa Sequence 2: sp|P00004|CYC_HORSE_Cytochrome 177 aa Sequence 3: sp|P0C0X8|CYC2_RHOSH_Cytochrom 177 aa Sequence 4: sp|P00091|CYC22_RHOPA_Cytochro 177 aa Sequence 5: sp|Q93VA3|CYC6_ARATH_Cytochrom 177 aa

Bootstrap output file created: [../work/2016/03/22/0612/clustalw_1458594734_746144_63242.phb] Guide tree ( ( sp|P99999|CYC_HUMAN:0.06190, sp|P00004|CYC_HORSE:0.05238) :0.28253, ( sp|P0C0X8|CYC2_RHOSH:0.31114, sp|P00091|CYC22_RHOPA:0.32595) :0.04416, sp|Q93VA3|CYC6_ARATH_:0.60318);

The tree is visualised with http://etetoolkit.org/treeview/ Visualizing trees with etetoolkit CLUSTAL 2.1 multiple sequence alignment sp|P99999|CYC_HUMAN ------MGDVEKGKKIFIMKCS------QCHT 20 sp|P00004|CYC_HORSE ------MGDVEKGKKIFVQKCA------QCHT 20 sp|P0C0X8|CYC2_RHOSH ------QEGDPEAGAKAFNQCQTCHVIVDDSGT 27 sp|P00091|CYC22_RHOPA ------MVKKLLTILSIAATAGSLSIGTASAQDAKAGEAVFKQCMT 40 sp|Q93VA3|CYC6_ARATH_ MRLVLSGASSFTSNLFCSSQQVNGRGKELKNPISLNHNKDLDFLLKKLAP 50 *. . . . sp|P99999|CYC_HUMAN VEKGGKHKTGPNLHG--LFGRKTGQAPGYS-YTAANKN---KGIIWGEDT 64 sp|P00004|CYC_HORSE VEKGGKHKTGPNLHG--LFGRKTGQAPGFT-YTDANKN---KGITWKEET 64 sp|P0C0X8|CYC2_RHOSH TIAGRNAKTGPNLYG--VVGRTAGTQADFKGYGEGMKEAGAKGLAWDEEH 75 sp|P00091|CYC22_RHOPA CHRADKNMVGPALGG--VVGRKAGTAAGFT-YSPLNHNSGEAGLVWTADN 87 sp|Q93VA3|CYC6_ARATH_ PLTAVLLAVSPICFPPESLGQTLDIQRGATLFNRACIGCHDTGGNIIQPG 100 . ..* .*:. . . . : * sp|P99999|CYC_HUMAN LMEYLENP------KKYIPG------TKMIFVGI 86 sp|P00004|CYC_HORSE LMEYLENP------KKYIPG------TKMIFAGI 86 sp|P0C0X8|CYC2_RHOSH FVQYVQDPTK------FLKEYTGD------AKAKGKMTFK-L 104 sp|P00091|CYC22_RHOPA IINYLNDPNA------FLKKFLTDKGKADQAVGVTKMTFK-L 122 sp|Q93VA3|CYC6_ARATH_ ATLFTKDLERNGVDTEEEIYRVTYFGKGRMPGFG--EKCTPRGQCTFGPR 148 : :: * . : * sp|P99999|CYC_HUMAN KKKEERADLIAYLKKATNE------105 sp|P00004|CYC_HORSE KKKTEREDLIAYLKKATNE------105 sp|P0C0X8|CYC2_RHOSH KKEADAHNIWAYLQQVAVRP------124 sp|P00091|CYC22_RHOPA ANEQQRKDVVAYLATLK------139 sp|Q93VA3|CYC6_ARATH_ LQDEEIKLLAEFVKFQADQGWPTVSTD 175 Very bad alignment. :. : : :: Visualizing and manipulating alignments with JalView ‘’Bootstrap’’ tree (based on distances computed from MSA)

( ( ( sp|P99999|CYC_HUMAN:0.06522, sp|P00004|CYC_HORSE:0.04907) 1000:0.24564, sp|P0C0X8|CYC2_RHOSH:0.33664) 539:0.01376, sp|P00091|CYC22_RHOPA:0.32266, NB. It is not usually a good tree for sp|Q93VA3|CYC6_ARATH_:0.53865)TRICHOTOMY; inferring phylogenetic relationships CLUSTAL 2.1 multiple sequence alignment sp|P99999|CYC_HUMAN ------MGDVEKGKKIFIMKCS------QCHT 20 sp|P00004|CYC_HORSE ------MGDVEKGKKIFVQKCA------QCHT 20 sp|P0C0X8|CYC2_RHOSH ------QEGDPEAGAKAFNQCQTCHVIVDDSGT 27 sp|P00091|CYC22_RHOPA ------MVKKLLTILSIAATAGSLSIGTASAQDAKAGEAVFKQCMT 40 sp|Q93VA3|CYC6_ARATH_ MRLVLSGASSFTSNLFCSSQQVNGRGKELKNPISLNHNKDLDFLLKKLAP 50 *. . . . sp|P99999|CYC_HUMAN VEKGGKHKTGPNLHG--LFGRKTGQAPGYS-YTAANKN---KGIIWGEDT 64 sp|P00004|CYC_HORSE VEKGGKHKTGPNLHG--LFGRKTGQAPGFT-YTDANKN---KGITWKEET 64 sp|P0C0X8|CYC2_RHOSH TIAGRNAKTGPNLYG--VVGRTAGTQADFKGYGEGMKEAGAKGLAWDEEH 75 sp|P00091|CYC22_RHOPA CHRADKNMVGPALGG--VVGRKAGTAAGFT-YSPLNHNSGEAGLVWTADN 87 sp|Q93VA3|CYC6_ARATH_ PLTAVLLAVSPICFPPESLGQTLDIQRGATLFNRACIGCHDTGGNIIQPG 100 . ..* .*:. . . . : * sp|P99999|CYC_HUMAN LMEYLENP------KKYIPG------TKMIFVGI 86 sp|P00004|CYC_HORSE LMEYLENP------KKYIPG------TKMIFAGI 86 sp|P0C0X8|CYC2_RHOSH FVQYVQDPTK------FLKEYTGD------AKAKGKMTFK-L 104 sp|P00091|CYC22_RHOPA IINYLNDPNA------FLKKFLTDKGKADQAVGVTKMTFK-L 122 sp|Q93VA3|CYC6_ARATH_ ATLFTKDLERNGVDTEEEIYRVTYFGKGRMPGFG--EKCTPRGQCTFGPR 148 : :: * . : * sp|P99999|CYC_HUMAN KKKEERADLIAYLKKATNE------105 sp|P00004|CYC_HORSE KKKTEREDLIAYLKKATNE------105 sp|P0C0X8|CYC2_RHOSH KKEADAHNIWAYLQQVAVRP------124 sp|P00091|CYC22_RHOPA ANEQQRKDVVAYLATLK------139 sp|Q93VA3|CYC6_ARATH_ LQDEEIKLLAEFVKFQADQGWPTVSTD 175 Very bad alignment. :. : : :: HOWEVER: there is an error in the input. WHICH ONE? (hint: check the UniProt entries)

CLUSTAL 2.1 multiple sequence alignment After deleting the signal/transit peptides sp|P99999|CYC_HUMAN -MGDVEKGKKIFIMKCSQCHTVE------KGGKHKTGPNLHGLFGRKTG 42 sp|P00004|CYC_HORSE -MGDVEKGKKIFVQKCAQCHTVE------KGGKHKTGPNLHGLFGRKTG 42 sp|P0C0X8|CYC2_RHOSH QEGDPEAGAKAFN-QCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAG 49 sp|P00091|CYC22_RHOPA --QDAKAGEAVFK-QCMTCHR------ADKN-MVGPALGGVVGRKAG 37 sp|Q93VA3|CYC6_ARATH_ QTLDIQRGATLFNRACIGCHDTG------GNIIQPGATLFTKDLERNG 42 * : * * * ** . *. * . * sp|P99999|CYC_HUMAN QAPGYS-YTAANKN---KGIIWGEDTLMEYLENPKKYIP------77 sp|P00004|CYC_HORSE QAPGFT-YTDANKN---KGITWKEETLMEYLENPKKYIP------77 sp|P0C0X8|CYC2_RHOSH TQADFKGYGEGMKEAGAKGLAWDEEHFVQYVQDPTKFLKEYTGD----AK 95 sp|P00091|CYC22_RHOPA TAAGFT-YSPLNHNSGEAGLVWTADNIINYLNDPNAFLKKFLTDKGKADQ 86 sp|Q93VA3|CYC6_ARATH_ VDTEEEIYRVTYFGKG-RMPGFGEKCTPRGQCTFGPRLQ------80 . * : . . : sp|P99999|CYC_HUMAN --GTKMIFVGIKKKEERADLIAYLKKATNE- 105 sp|P00004|CYC_HORSE --GTKMIFAGIKKKTEREDLIAYLKKATNE- 105 sp|P0C0X8|CYC2_RHOSH AKG--KMTFKLKKEADAHNIWAYLQQVAVRP 124 sp|P00091|CYC22_RHOPA AVGVTKMTFKLANEQQRKDVVAYLATLK--- 114 sp|Q93VA3|CYC6_ARATH_ -----DEEIKLLAEFVKFQADQGWPTVSTD- 105 : : : Cytochrome C (Homo vs. Rhodopseudomonas palustris) GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKG---IIWGEDTLMEY .|...|..:|. .|..||.. .|...||.|.|..|||.|.|.|:.|...|.|.| ::|..|.:..| xDAKAGEAVFK-QCMTCHRA---DKNMVGPALAGVVGRKAGTAAGFTYSPLNHNSGEAGLVWTADNIVPY Structural alignment: RMSD= 0,13 nm LENPKKYIP------GTKMIFVGIKKKEERADLIAYLKKAT 29% sequence identity |..|..::. .|||.| .:...::|.|.:|||.... LADPNAFLKKFLTEKGKADQAVGVTKMTF-KLANEQQRKDVVAYLATLK

Global without end-gap score: 152; 28.7% identity (63.0% similar) in 108 aa overlap 10 20 30 sp|P99 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGL :.. :. .: .:: : ... :. .:: : :. Sequence alignment: sp|P00 MVKKLLTILSIAATAGSLSIGTASAQDAKAGEAVF----KQCMTCHRADKNMVGPALGGV 29% sequence identity 10 20 30 40 50

IS IT SIMILAR TO 40 50 60 70 80 90 sp|P99 FGRKTGQAPGYSYTAANKNKG---IIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERA STRUCTURAL ALIGMENT? :::.: : :..:. :.:.: ..: :....::..:. .. : ... : .. . sp|P00 VGRKAGTAAGFTYSPLNHNSGEAGLVWTADNIINYLNDPNAFL---KKFLTDKGKADQAV COULD IT BE SAFELY 60 70 80 90 100 110 ADOPTED FOR MODELLING 100 PROCEDURES? sp|P99 DLIAYLKKATNE . . : .:: sp|P00 GVTKMTFKLANEQQRKDVVAYLATLK 120 130 portion where structural and sequence alignments coincide Cytochrome C (Homo vs. Rhodopseudomonas palustris) GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKG---IIWGEDTLMEY .|...|..:|. .|..||.. .|...||.|.|..|||.|.|.|:.|...|.|.| ::|..|.:..| xDAKAGEAVFK-QCMTCHRA---DKNMVGPALAGVVGRKAGTAAGFTYSPLNHNSGEAGLVWTADNIVPY Structural alignment: RMSD= 0,13 nm LENPKKYIP------GTKMIFVGIKKKEERADLIAYLKKAT 29% sequence identity |..|..::. .|||.| .:...::|.|.:|||.... LADPNAFLKKFLTEKGKADQAVGVTKMTF-KLANEQQRKDVVAYLATLK

Sequence alignment extracted from the multiple sequence alignment

sp|P99999|CYC_HUMAN -MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTG 42 sp|P00091|CYC22_RHOPA --QDAKAGEAVFKQ-CMTCHRADKN---MVGPALGGVVGRKAG 37

sp|P99999|CYC_HUMAN QAPGYSYTAANKN---KGIIWGEDTLMEYLENPKKYIP------77 sp|P00091|CYC22_RHOPA TAAGFTYSPLNHNSGEAGLVWTADNIINYLNDPNAFLKKFLTDKGKADQ 86

sp|P99999|CYC_HUMAN --GTKMIFVGIKKKEERADLIAYLKKATNE- 105 sp|P00091|CYC22_RHOPA AVGVTKMTFKLANEQQRKDVVAYLATLK--- 114

IT IS MORE SIMILAR TO THE STRUCTURAL ALIGMENT.

portion where structural and sequence alignments coincide Multiple Alignment Construction

 Optimal multiple alignment MSA, OMA

 Progressive multiple alignment ClustalW, ClustalX

 Iterative multiple alignment MUSCLE, PROBCONS, PROBALIGN, DIALIGN How is it possible to improve the ClustalW results?

• The alignment is based on a guide tree computed on the basis of the pairwise distances (guide tree).

• The sequence distances computed starting from the MSA can be different (‘’phylogenetic’’ tree) How is it possible to improve the ClustalW results?

• If the trees are very different, the final MSA is somehow incoherent with respect of the procedure used to derive it.

• It is then possible to iterate the progessive alignment procedure, using the ‘’phylogenetic’’ tree as guide.

Muscle • The first guide tree is not based on pairwise aligment,but on the comparison between the vectors containing the k-mer compositions of each sequence (faster)

• Which is the dimension on the k-mer composition vector? Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput.Nucleic Acids Res. 32(5):1792-1797. http://www.ebi.ac.uk/Tools/msa/muscle/ Multiple sequence alignment of cytochromes

• Download from UniProtKB the sequences of the following proteins (in FASTA format)

• P99999 (human) • P00004 (horse) • P0C0X8 (Rhodobacter) • P00091 (Rhodopseudomonas) • Q93VA3 (Arabidopsis)

And align with MUSCLE @ http://www.ebi.ac.uk/Tools/msa/muscle/ sp|Q93VA3|CYC6_ARATH MRLVLSGASSFTSNLFCSSQQVNGRGKELKNPISLNHNKDLDFLLKKLAPPLTAVLLAVS sp|P99999|CYC_HUMAN ------MG------sp|P00004|CYC_HORSE ------MG------sp|P0C0X8|CYC2_RHOSH ------QEG------sp|P00091|CYC22_RHOPA ------MVKKLLTILSIAATAGSLSIGTASAQ------* sp|Q93VA3|CYC6_ARATH PICFPPESLGQTLDIQRGATLFNRACIGCH------DTGGNI------sp|P99999|CYC_HUMAN ------DVEKGKKIFIMKCSQCH------TVEKGGKHKTGPNLHGLFGRKTG sp|P00004|CYC_HORSE ------DVEKGKKIFVQKCAQCH------TVEKGGKHKTGPNLHGLFGRKTG sp|P0C0X8|CYC2_RHOSH ------DPEAGAKAFNQ-CQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAG sp|P00091|CYC22_RHOPA ------DAKAGEAVFKQ-CMTCH------RADKNMVGPALGGVVGRKAG * : * * * ** .* : sp|Q93VA3|CYC6_ARATH IQPGATLFTKDLER---NGVDTEEEIYRVTYFGKGRM------PGFGEKCTPRGQCTF sp|P99999|CYC_HUMAN QAPGYS-YTAANKN---KGIIWGEDTL-MEYLENPKKYI------PG------TKMIF sp|P00004|CYC_HORSE QAPGFT-YTDANKN---KGITWKEETL-MEYLENPKKYI------PG------TKMIF sp|P0C0X8|CYC2_RHOSH TQADFKGYGEGMKEAGAKGLAWDEEHF-VQYVQDPTKFL------KEYTGDAKAKGKMTF sp|P00091|CYC22_RHOPA TAAGFT-YSPLNHNSGEAGLVWTADNI-INYLNDPNAFLKKFLTDKGKADQAVGVTKMTF .. . : *: : : *. . : * sp|Q93VA3|CYC6_ARATH -GPRLQDEEIKLLAEFVKFQADQGWPTVSTD sp|P99999|CYC_HUMAN VGIKKKEERADLIAYLKKATNE------sp|P00004|CYC_HORSE AGIKKKTEREDLIAYLKKATNE------sp|P0C0X8|CYC2_RHOSH -KLKKEADAHNIWAYLQQVAVRP------sp|P00091|CYC22_RHOPA -KLANEQQRKDVVAYLATLK------: : .: * : Multiple Alignment Construction

 Optimal multiple alignment MSA, OMA

 Progressive multiple alignment ClustalW, ClustalX

 Iterative multiple alignment MUSCLE, PROBCONS, PROBALIGN, DIALIGN

 Consistency-based multiple alignment

 T-COFFEE, DiAlign, Probalign, Probcons Consistency For any multiple alignment, the induced pairwise alignments are necessarily consistent; given a multiple alignment containing three sequences x, y, and z, if position xi aligns with position zk in the projected x-z alignment and position zk aligns with yj in the projected z-y alignments, then xi must align with yj in the projected x-y alignment.

Consistency-based techniques apply this principle in reverse, using evidence from intermediate sequences to guide the pairwise alignment of x and y, such as needed during the steps of a progressive alignment. Transitive relation

In mathematics, a binary relation R over a set X is transitive if whenever an element a is related to an element b, and b is in turn related to an element c, then a is also related to c. -WikiPedia

"a,b,c Î X : (aRbÙbRc) Þ aRc

Cedric Notredame Transitive relation in alignment scene

"a,b,c Î X : (aRbÙbRc) Þ aRc

"x,y,zÎalned:(xAlnzÙ zAln y) Þ xAln y

multiple sequence alignment pairwise alignment

x x a consistency a y y

Cedric Notredame

Where are samples? Pairwise Pairwise alignments

x x y

consistency MSA y

Consistency between MSA & pairwise alignment : 0/1 How can we increase the resolution of confidence? Cedric Notredame Pairwise Pairwise alignments x a x b x d x

MSA a y y

c y e y

x x x a b d a c e y y y consistency inconsistency inconsistency

Cedric Notredame • MSA from progressive alignments can be largely inconsistent with respect to the set of pairwise alignments used to build the guide tree

• Consistency-based methods try to build the tree in a more consistent way T-Coffee (Tree-based Consistency Objective Function for alignment Evaluation)

This would be the ClustalW alignment of the four sequences.

CAT is evidently misaligned

Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205-17 T-Coffee

The T-Coffee strategy starts from pairwise alignments as well.

Each pair of aligned residues is associated with a weight equal to the average identity among matched residues (gapped positions are neglected).

Identity values are used instead of alignment scores.

Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205-17 T-Coffee In order to align sequence A and B, the three possible alignments are considered (A and B, A and B through C, A and B through D).

The weight associated to each alignment is the minimum of the weight associated to the pairwise alignments

Min (A-C=77, B-C=100)

Min (A-D=100, B-D=100)

Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205-17 T-Coffee The three A to B alignment are combined in an extended library or residue correspondences.

The weight of each correspondence is the sum of weights of the alignments where the correspondence is found. The A to B alignment with highest weight is then obtained with dynamic programming.

The A to B alignment then take into accounts all the possible intermediate sequences.  improved consistency

W=88 W=177

Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple The procedure is repeated for each pair. sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205-17 T-Coffee

The extended pairwise alignments are used to build a guide tree and progressive alignment procedure is then applied.

T-Coffee considers both global and local pairwise alignments.

It can also add supplementary information (domain, motifs….)

Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205-17 http://tcoffee.crg.cat/apps/tcoffee/index.html Multiple sequence alignment of cytochromes

• Download from UniProtKB the sequences of the following proteins (in FASTA format)

• P99999 (human) • P00004 (horse) • P0C0X8 (Rhodobacter) • P00091 (Rhodopseudomonas) • Q93VA3 (Arabidopsis)

And align with T-Coffee @ http://tcoffee.crg.cat/apps/tcoffee/do:regular/

How to benchmark alignment methods? BAliBASE was the first large scale benchmark specifically designed for MSA, providing high quality manually refined reference alignments based on 3D structural superpositions.

BAliBASE is divided into several reference datasets: 1- cases with small numbers of equidistant sequences, and was further subdivided by percent identity; 2- families with one or more “orphan” sequences; 3- a pair of divergent subfamilies, with less than 25% identity between the two groups; 4- sequences with large terminal extensions (N/C-terminal); 5- sequences with large internal insertions and deletions. SP score determines the extent to which the programs succeed in aligning input sequences in an MSA. It is calculated as the ratio of the sum of scores p for all pairs of residues in every column of the alignment by the sum of scores in the reference alignment; p = 1 if the pair of compared residues is aligned identically in the reference alignment, otherwise p = 0.

TC score is a binary score function which tests the ability of the programs to correctly align all sequences. The TC score is calculated considering the ratio of the sum of scores c by the number of columns in the alignment, being c = 1 if all residues in the column are aligned identically in the reference alignment, otherwise c = 0

Time of execution

Peak memory Z-scores Z-scores Z-scores Z-scores Z-scores Z-scores Results: Our results indicate that mostly the consistency-based programs Probcons, T-Coffee, Probalign and MAFFT outperformed the other programs in accuracy. Whenever sequences with large N/C terminal extensions were present in the BAliBASE suite, Probalign, MAFFT and also CLUSTAL OMEGA outperformed Probcons and T-Coffee. The drawback of these programs is that they are more memory-greedy and slower than POA, CLUSTALW, DIALIGN-TX, and MUSCLE. CLUSTALW and MUSCLE were the fastest programs, being CLUSTALW the least RAM memory demanding program.

Conclusions: Based on the results presented herein, all four programs Probcons, T-Coffee, Probalign and MAFFT are well recommended for better accuracy of multiple sequence alignments. T-Coffee and recent versions of MAFFT can deliver faster and reliable alignments, which are specially suited for larger datasets than those encountered in the BAliBASE suite, if multi-core computers are available. In fact, parallelization of alignments for multi-core computers should probably be addressed by more programs in a near future, which will certainly improve performance significantly. Exercise

Using the BLAST tool at Uniprot, retrieve all the SwissProt sequences that are similar with an E-value <0,001 to the Rhodopseudomonas cytochrome C (P00091) .

Download the sequences in Fasta format and align with ClustalW, Muscle or T-Coffee

Analyse the conserved positions in the alignments

Repeat with the Arabidopsis (Q93VA3) and the human (P99999) sequences

Compare the results, an in particular the pattern of conserved residues.