Sequence Alignment What Is a Multiple Alignment?
Total Page:16
File Type:pdf, Size:1020Kb
What about comparing a group of sequences (instead of a group of structures): Multiple Sequence Alignment What is a multiple alignment? a representation of a set of sequences, where equivalent residues (e.g. functional, structural) are aligned in rows or more usually columns Example: part of an alignment of SH2 domains from 14 sequences lnk_rat crk1_mouse nck_human ht16_hydat pip5_human fer_human 1ab2 1mil 1blj 1shd 1lkkA 1csy 1bfi 1gri * conserved identical residues : conserved similar residues What is a multiple alignment? conserved residues conservation profile secondary structure Multiple sequence alignments How to score an alignment of many sequences? Given M sequences Ai , we can define a score for the multiple sequence alignment as the sum of the scores of all the induced pair alignments S=Si<j S(Ai,Aj) 1>ASPTLPLSLA S 2>SS-TLPA--A = 3>SSPTLPA--A 1>ASPTLPLSLA 1>ASPTLPLSLA 2>SS-TLPA--A +S =S 2>SS-TLPA--A 3>SSPTLPA--A +S 3>SSPTLPA--A Multiple sequence alignments Can you think to other measures instead of the sum of pairwise scores? Multiple sequence alignments Can you think to other measures instead of the sum of pairwise scores? Try to MAXIMIZE the information content of all the columns (see sequence logos) I log2 20 S( p) 20 S( p) pi ln pi i1 MINIMIZE the overall entropy Ncolums 20 S p ji ln p ji j1 i1 Multiple sequence alignments Multiple sequence alignment (MSA) The algorithmic problem is to find the alignment with the maximum score Exact algorithms Algorithms based of multi-dimensional dynamic programming have been implemented. However they are too slow when many sequences have to be compared. Progressive alignments Iterative algorithms Consistency-based algorithms Multiple Alignment Construction Optimal multiple alignment example : MSA (Lipman et al. 1989, Gupta et al. 1995) Optimal multiple alignment Extension of dynamic programming for 2 sequences => N dimensions Example : alignment of 3 sequences Problem : calculation time and memory requirements Time proportional to Nk for k sequences of length N => limited to less than 10 sequences Multiple Alignment Construction Optimal multiple alignment MSA, OMA Progressive multiple alignment ClustalW (Thompson et al. NAR. 1994) ClustalX (Thompson et al. NAR. 1997) Progressive multiple alignment Idea : Progressively align pairs of sequences (or groups of sequences) Problem : Start with which sequences ? How to decide order of alignment ? first align the most closely related sequences How to measure the similarity of the sequences ? align all the sequences pairwise calculate the similarity between each pair from the alignment Progressive multiple alignments A B C D Step1: Pairwise sequence alignment: exact, all-against-all A A B B D D S=70, id=50% S=60, id=40% S=95, id=80% A B C C C D S=100, id=90% S=70, id=50% S=80, id=70% Step1: Build a similarity tree A C B D Progressive multiple alignments Step 3: Exact alignment of the most similar sequences, following the tree A B C D Step 4: Build the profile from the sub alignments) Step 5: Perfom profile-to profile alignment following the similarity tree, until comprising all the sequences A C B D Profile-to-profile score 1 The position i along the first sequence profile, it is represented by a 20-valued vector P i = 1 1 1 P i(A) P i (C) …… P i (Y) The position j along the second sequence profile, it is represented by a 20-valued vector 2 1 1 1 P j = P j(A) P j (C) …… P j (Y) The score for aligning the two positions is 20 20 1 2 Score(i, j) P i rm P j rk M rm ,rk m1 k1 where M is a matrix score (BLOSUM or PAM) The score can be used in dynamic programming procedures (Needleman- Wunsch, Smith-Waterman) “once a gap, always a gap” • Where gaps are added is a critical question • Gaps are often added to the first two (closest) sequences • To change the initial gap choices later on would be to give more weight to distantly related sequences • To maintain the initial gap choices is to trust that those gaps are most believable Problems with Progressive algorithms • Dependence of the ultimate MSA on the initial pairwise sequence alignment with the highest score • Errors in initial alignments are propagated • Gaps can proliferate, if not careful • Gaps can be amino-acid specific, so that you penalize introduction of gaps into segments that are less likely to have gaps (e.g. hydrophobic core) CLUSTALW Multiple sequence alignment of cytochromes • Download from UniProtKB the sequences of the following proteins (in FASTA format) • P99999 (human) • P00004 (horse) • P0C0X8 (Rhodobacter) • P00091 (Rhodopseudomonas) • Q93VA3 (Arabidopsis) And align with ClustalW @ http://clustalw.ddbj.nig.ac.jp/ Result summary CLUSTAL 2.1 Multiple Sequence Alignments Sequence type explicitly set to Protein Sequence format is Pearson Sequence 1: sp|P99999|CYC_HUMAN_Cytochrome_c_OS=Homo_sapiens_GN=CYCS_PE=1_SV=2 105 aa Sequence 2: sp|P00004|CYC_HORSE_Cytochrome_c_OS=Equus_caballus_GN=CYCS_PE=1_SV=2 105 aa Sequence 3: sp|P0C0X8|CYC2_RHOSH_Cytochrome_c2_OS=Rhodobacter_sphaeroides_GN=cycA_PE=1_SV=1 124 aa Sequence 4: sp|P00091|CYC22_RHOPA_Cytochrome_c2_OS=Rhodopseudomonas_palustris 139 aa Sequence 5: sp|Q93VA3|CYC6_ARATH_Cytochrome_c6__chloroplastic_OS=Arabidopsis_thaliana_ 175 aa Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 88 Sequences (1:3) Aligned. Score: 30 Sequences (1:4) Aligned. Score: 27 Sequences (1:5) Aligned. Score: 5 Sequences (2:3) Aligned. Score: 32 Sequences (2:4) Aligned. Score: 28 Sequences (2:5) Aligned. Score: 5 Sequences (3:4) Aligned. Score: 36 Sequences (3:5) Aligned. Score: 3 Sequences (4:5) Aligned. Score: 3 Guide tree file created: [../work/2016/03/22/0612/clustalw_1458594734_746144_63242.dnd] Result summary There are 4 groups Start of Multiple Alignment Aligning... Group 1: Sequences: 2 Score:2188 Group 2: Sequences: 2 Score:1660 Group 3: Sequences: 4 Score:1071 Group 4: Delayed Alignment Score 1148 CLUSTAL-Alignment file created [../work/2016/03/22/0612/clustalw_1458594734_746144_63242.aln] CLUSTAL 2.1 Multiple Sequence Alignments Sequence format is CLUSTAL Sequence 1: sp|P99999|CYC_HUMAN_Cytochrome 177 aa Sequence 2: sp|P00004|CYC_HORSE_Cytochrome 177 aa Sequence 3: sp|P0C0X8|CYC2_RHOSH_Cytochrom 177 aa Sequence 4: sp|P00091|CYC22_RHOPA_Cytochro 177 aa Sequence 5: sp|Q93VA3|CYC6_ARATH_Cytochrom 177 aa Bootstrap output file created: [../work/2016/03/22/0612/clustalw_1458594734_746144_63242.phb] Guide tree ( ( sp|P99999|CYC_HUMAN:0.06190, sp|P00004|CYC_HORSE:0.05238) :0.28253, ( sp|P0C0X8|CYC2_RHOSH:0.31114, sp|P00091|CYC22_RHOPA:0.32595) :0.04416, sp|Q93VA3|CYC6_ARATH_:0.60318); The tree is visualised with http://etetoolkit.org/treeview/ Visualizing trees with etetoolkit CLUSTAL 2.1 multiple sequence alignment sp|P99999|CYC_HUMAN ------------------------MGDVEKGKKIFIMKCS------QCHT 20 sp|P00004|CYC_HORSE ------------------------MGDVEKGKKIFVQKCA------QCHT 20 sp|P0C0X8|CYC2_RHOSH -----------------------QEGDPEAGAKAFNQCQTCHVIVDDSGT 27 sp|P00091|CYC22_RHOPA ----------MVKKLLTILSIAATAGSLSIGTASAQDAKAGEAVFKQCMT 40 sp|Q93VA3|CYC6_ARATH_ MRLVLSGASSFTSNLFCSSQQVNGRGKELKNPISLNHNKDLDFLLKKLAP 50 *. sp|P99999|CYC_HUMAN VEKGGKHKTGPNLHG--LFGRKTGQAPGYS-YTAANKN---KGIIWGEDT 64 sp|P00004|CYC_HORSE VEKGGKHKTGPNLHG--LFGRKTGQAPGFT-YTDANKN---KGITWKEET 64 sp|P0C0X8|CYC2_RHOSH TIAGRNAKTGPNLYG--VVGRTAGTQADFKGYGEGMKEAGAKGLAWDEEH 75 sp|P00091|CYC22_RHOPA CHRADKNMVGPALGG--VVGRKAGTAAGFT-YSPLNHNSGEAGLVWTADN 87 sp|Q93VA3|CYC6_ARATH_ PLTAVLLAVSPICFPPESLGQTLDIQRGATLFNRACIGCHDTGGNIIQPG 100 . ..* .*:. : * sp|P99999|CYC_HUMAN LMEYLENP------------------KKYIPG----------TKMIFVGI 86 sp|P00004|CYC_HORSE LMEYLENP------------------KKYIPG----------TKMIFAGI 86 sp|P0C0X8|CYC2_RHOSH FVQYVQDPTK--------------FLKEYTGD------AKAKGKMTFK-L 104 sp|P00091|CYC22_RHOPA IINYLNDPNA--------------FLKKFLTDKGKADQAVGVTKMTFK-L 122 sp|Q93VA3|CYC6_ARATH_ ATLFTKDLERNGVDTEEEIYRVTYFGKGRMPGFG--EKCTPRGQCTFGPR 148 : :: * . : * sp|P99999|CYC_HUMAN KKKEERADLIAYLKKATNE-------- 105 sp|P00004|CYC_HORSE KKKTEREDLIAYLKKATNE-------- 105 sp|P0C0X8|CYC2_RHOSH KKEADAHNIWAYLQQVAVRP------- 124 sp|P00091|CYC22_RHOPA ANEQQRKDVVAYLATLK---------- 139 sp|Q93VA3|CYC6_ARATH_ LQDEEIKLLAEFVKFQADQGWPTVSTD 175 Very bad alignment. :. : : :: Visualizing and manipulating alignments with JalView ‘’Bootstrap’’ tree (based on distances computed from MSA) ( ( ( sp|P99999|CYC_HUMAN:0.06522, sp|P00004|CYC_HORSE:0.04907) 1000:0.24564, sp|P0C0X8|CYC2_RHOSH:0.33664) 539:0.01376, sp|P00091|CYC22_RHOPA:0.32266, NB. It is not usually a good tree for sp|Q93VA3|CYC6_ARATH_:0.53865)TRICHOTOMY; inferring phylogenetic relationships CLUSTAL 2.1 multiple sequence alignment sp|P99999|CYC_HUMAN ------------------------MGDVEKGKKIFIMKCS------QCHT 20 sp|P00004|CYC_HORSE ------------------------MGDVEKGKKIFVQKCA------QCHT 20 sp|P0C0X8|CYC2_RHOSH -----------------------QEGDPEAGAKAFNQCQTCHVIVDDSGT 27 sp|P00091|CYC22_RHOPA ----------MVKKLLTILSIAATAGSLSIGTASAQDAKAGEAVFKQCMT 40 sp|Q93VA3|CYC6_ARATH_ MRLVLSGASSFTSNLFCSSQQVNGRGKELKNPISLNHNKDLDFLLKKLAP 50 *. sp|P99999|CYC_HUMAN VEKGGKHKTGPNLHG--LFGRKTGQAPGYS-YTAANKN---KGIIWGEDT 64 sp|P00004|CYC_HORSE VEKGGKHKTGPNLHG--LFGRKTGQAPGFT-YTDANKN---KGITWKEET 64 sp|P0C0X8|CYC2_RHOSH TIAGRNAKTGPNLYG--VVGRTAGTQADFKGYGEGMKEAGAKGLAWDEEH 75 sp|P00091|CYC22_RHOPA CHRADKNMVGPALGG--VVGRKAGTAAGFT-YSPLNHNSGEAGLVWTADN 87 sp|Q93VA3|CYC6_ARATH_ PLTAVLLAVSPICFPPESLGQTLDIQRGATLFNRACIGCHDTGGNIIQPG 100 . ..* .*:. : * sp|P99999|CYC_HUMAN LMEYLENP------------------KKYIPG----------TKMIFVGI 86 sp|P00004|CYC_HORSE LMEYLENP------------------KKYIPG----------TKMIFAGI 86 sp|P0C0X8|CYC2_RHOSH FVQYVQDPTK--------------FLKEYTGD------AKAKGKMTFK-L