Multiple Sequence Alignment

Multiple Sequence Alignment Biol4230 Tues, February 20, 2018 Bill Pearson [email protected] 4-2818 Pinn 6-057 Goals of today’s lecture: • Why multiple sequence alignment (MSA)? – identify conserved (functional?) positions among related sequences – input to evolutionary tree methods • MSA computational complexity – Models for MSA: tree-based, Sum-of-pairs, star – "optimal" O(Nk) (k sequences of length N) – progressive: O(k2N2) – progressive/iterative (O(k2N2) • Evaluating MSA accuracy – BALIBASE • Phylogenetic alignments – BaliPhy fasta.bioch.virginia.edu/biol4230 1 To learn more: • Pevsner Bioinformatics Chapter 6 pp 179–212 • Altschul, S. F. and Lipman, D. J. (1989) Trees, stars, and multiple biological sequence alignment. SIAM J. Appl. Math. 49:197-209. • Thompson, J. D., Plewniak, F., and Poch, O. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 27:2682-2690 • Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position- specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680. • R. C. Edgar (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113 • Suchard, M. A. and Redelings, B. D. (2006) "BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny" Bioinformatics 22:2047-2048 fasta.bioch.virginia.edu/biol4230 2 1 Overview • No multiple alignments without HOMOLOGY • Multiple sequence alignments can resolve ambiguous gaps – largely used to specify gap positions • Many popular programs build successive pair-wise alignments (progressive alignment) – Clustal-W (Clustal-Omega), T-coffee, MUSCLE • Simple progressive alignment methods fix gaps early, after which they cannot be moved • Iterative approaches required to adjust gaps • Tree-based alignments bring a more phylogenetic perspective • What are appropriate tests – alignments for trees vs alignments for structures? fasta.bioch.virginia.edu/biol4230 3 Algorithms for Pairwise Sequence Comparison Algorithm Value Scoring Gap Time calculated Matrix Penalty Required Needleman- Global arbitrary penalty/g O(n2) Needleman and Wunsch similarity ap Wunsch (1970) Sellers (global) Unity penalty/r O(n2) Sellers (1974) distance esidue 2 Smith- local Sij < 0.0 affine O(n ) Smith and Waterman similarity q+rk Waterman, 1981; Gotoh 1982 SRCHN approx. Sij < 0.0 penalty/g O(n)- Wilbur and local ap O(n2) Lipman (1983) 2 FASTA approx. Sij < 0.0 affine O(n )/K Lipman and Pearson local q+rk (1985, 1988) 2 BLASTP Sij < 0.0 multiple O(n )/K Altschul et al HSP (1990) 2 BLAST2.0 approx. Sij < 0.0 affine O(n )/K Altschul et al local q+rk (1997) 2 Local, global, and "glocal" alignments 50 100 150 200 Glutathione_S-Trfase_N Glutathione-S-Trfase_C-like Local – 26.3% id GSTT1 Glutathione_S-Trfase_NGlutathione-S-Trfase_C-like SSPA E() < 0.00024 Glutathione_S-Trfase_N Glutathione-S-Trfase_C-like 50 100 150 200 Globally similar: 50 100 150 200 Glutathione_S-Trfase_N Glutathione-S-Trfase_C-like GSTT1 Global – 20.6% id Glutathione_S-Trfase_N Glutathione-S-Trfase_C-like SSPA -7 E() < 10 Glutathione_S-Trfase_N Glutathione-S-Trfase_C-like 50 100 150 200 50 100 150 200 Glutathione_S-Trfase_N Glutathione-S-Trfase_C-like GSTT1 Local – 29.2% id Glutathione-S-Trfase_C-like EF1B E() < 9 Glutathione-S-Trfase_C-likeEF-1_beta_acid_region_eukTransl_elong_fac_EF1B_bsu/dsu 50 100 150 200 Locally similar: 50 100 150 200 Glutathione_S-Trfase_N Glutathione-S-Trfase_C-like GSTT1 Global – 15.3% id Glutathione-S-Trfase_C-like EF-1_beta_acid_region_eukTransl_elong_fac_EF1B_bsu/dsu EF1B E() < 7000 Glutathione-S-Trfase_C-like EF-1_beta_acid_region_eukTransl_elong_fac_EF1B_bsu/dsu 50 100 150 200 150 Glutathione-S-Trfase_C-like GSTT1 Glocal – 26.8% id Glutathione-S-Trfase_C-like EF2A E() < 0.02 Glutathione-S-Trfase_C-like EF-1_beta_acid_region_eukTransl_elong_fac_EF1B_bsu/dsu 50 100 150 200 fasta.bioch.virginia.edu/biol4230 5 No multiple alignments without HOMOLOGY Homologs GSTM1_HUMAN -------------------MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAPDYDRSQWLNEKFKLGLDFPNLPYLIDGAHKITQ GSTM3_HUMAN ---------------MSCESSMVLGYWDIRGLAHAIRLLLEFTDTSYEEKRYTCGEAPDYDRSQWLDVKFKLDLDFPNLPYLLDGKNKITQ GSTP1_HUMAN ------------------MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTV--------ETWQEGSLKASCLYGQLPKFQDGDLTLYQ GSTA1_HUMAN -----------------MAEKPKLHYFNARGRMESTRWLLAAAGVEFEEKFIKS-------AEDLDKLRNDGYLMFQQVPMVEIDGMKLVQ HPGDS_HUMAN ------------------MPNYKLTYFNMRGRAEIIRYIFAYLDIQYEDHRIEQ--------ADWP--EIKSTLPFGKIPILEVDGLTLHQ GSTT1_HUMAN -------------------MGLELYLDLLSQPCRAVYIFAKKNDIPFELRIVDLIK------GQHLSDAFAQVNPLKKVPALKDGDFTLTE GSTO1_HUMAN MSGESARSLGKGSAPPGPVPEGSIRIYSMRFCPFAERTRLVLKAKGIRHEVININ-------LKNKPEWFFKKNPFGLVPVLENSQGQLIY : . :* . : GSTM1_HUMAN SNAILCYIARKHNL----CGETEEEKIRVDILENQ---------TMDNHMQLGMICYNPEFEKLKP--KYLEELPEKLKLYSEFLGK----- GSTM3_HUMAN SNAILRYIARKHNM----CGETEEEKIRVDIIENQ---------VMDFRTQLIRLCYSSDHEKLKP--QYLEELPGQLKQFSMFLGK----- GSTP1_HUMAN SNTILRHLGRTLGL----YGKDQQEAALVDMVNDG---------VEDLRCKYISLIYT-NYEAGKD--DYVKALPGQLKPFETLLSQNQGG- GSTA1_HUMAN TRAILNYIASKYNL----YGKDIKERALIDMYIEG---------IADLGEMILLLPVCPPEEKDAK--LALIKEKIKNRYFPAFEKVLKSHG HPGDS_HUMAN SLAIARYLTKNTDL----AGNTEMEQCHVDAIVDT---------LDDFMSCFPWAEKKQDVKEQMFNELLTYNAPHLMQDLDTYLGG----- GSTT1_HUMAN SVAILLYLTRKYKVPDYWYPQDLQARARVDEYLAWQHTTLRRSCLRALWHKVMFPVFLGEPVSPQTLAATLAELDVTLQLLEDKFLQN---- GSTO1_HUMAN ESAITCEYLDEAYPGKKLLPDDPYEKACQKMILEL---------FSKVPSLVGSFIRSQNKEDYAG---LKEEFRKEFTKLEEVLTN---KK :* . Non-homologs GSTM1_HUMAN ----MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAP--DYDRSQWLNEKFKLGLDFPNLPYLIDGAHKITQSNAILCY GSTP1_HUMAN ---MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTV----------ETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRH GSTT1_HUMAN ----MGLELYLDLLSQPCRAVYIFAKKNDIPFELRIVDLIKG--------QHLSDAFAQVNPLKKVPALKDGDFTLTESVAILLY NARJ_ECO57 ---MIELVIVSRLLEYPDAALWQHQQEMFEAIAASKNLP-------------KEDAHALGIFLRDLTTMDPLDAQAQYSELFDRG DYR_BPT4 ---MIKLVFRYSPTKTVDGFNELAFG--------------------------LGDGLPWGRVKKDLQNFKARTEGTIMIMGAKTF TPIS_RABIT APSRKFFVGGNWKMNGRKKNLGELITTLNAAKVPADTEVVCAPPTAYIDFARQKLDPKIAVAAQNCYKVTNGAFTGEISPGMIKD . GSTM1_HUMAN IARKHNL----CGETEEEKIRVD--ILENQTMDNHMQLGMICYNP----EFEKLK-----PKYLEELPEKLKLYSEFLGK----- GSTP1_HUMAN LGRTLGL----YGKDQQEAALVD--MVNDGVEDLRCKYISLIYT-----NYEAGK-----DDYVKALPGQLKPFETLLSQNQGG- GSTT1_HUMAN LTRKYKVPDYWYPQDLQARARVDEYLAWQHTTLRRSCLRALWHKV----MFPVFLGEPVSPQTLAATLAELDVTLQLLEDKFLQN NARJ_ECO57 RATSLLLFEHVHGESRDRGQAMVDLLAQYEQHGLQLNSRELPDHLPLYLEYLAQLPQSEAVEGLKDIAPILALLSARLQQRESR- DYR_BPT4 QSLPTLLP--------GRSHIVVCDLARDYPVTKDGDLAHFYITWEQYITYISGG--------EIQVSSPNAPFETMLDQNSK-- TPIS_RABIT CGATWVVLG--HSERRHVFGESDELIGQKVAHALSEGLGVIACIGEKLDEREAGITEKVVFEQTKVIADNVKDWSKVVLAYEP-- : : : : fasta.bioch.virginia.edu/biol4230 6 3 Homology is confusing I: Homology defined Three(?) Ways • Proteins/genes/DNA that share a common ancestor • Specific positions/columns in a multiple sequence alignment that have a 1:1 relationship over evolutionary history – sequences are 50% homologous ??? • Specific (morphological/functional) characters that share a recent divergence (clade) – bird/bat/butterfly wings are/are not homologous fasta.bioch.virginia.edu/biol4230 7 Multiple alignments (can) place gaps A C G T +1 : match 0 -2 -4 -6 -8 -1 : mismatch -2 -2 -2 -2 -2 : gap A -2 1 -1 -1 -1 1 -4 -3 -6 -5 -8 -7 -10 A:A -4 -1 -3 -5 -2 1 -1 -3 -5 ACGT:ACGGT C -2 -1 1 -1 -1 -3 -1 2 -3 -2 -5 -4 -7 -6 -3 0 -2 1: ACG-T AC-GT -4 -1 2 0 -2 2: ACGGT ACGGT G -2 -1 -1 1 -1 +2 +2 -5 -3 -2 0 3 -2 -1 -4 -8 -5 -2 1 -6 -3 0 3 1 1: ACG-T AC-GT G -2 -1 -1 1 -1 2: ACGGT ACGGT -7 -5 -4 -2 1 1 2 -1 -10 -7 -4 -1 3: ACCGT ACCGT -8 -5 -2 1 2 +5 +7 T -2 -1 -1 -1 1 -9 -7 -6 -4 -3 -1 2 0 Sum of pairs score: -12 -9 -6 -3 -10 -7 -4 -1 2 1v2+1v3+2v3 fasta.bioch.virginia.edu/biol4230 8 4 Multiple alignments (can) place gaps Sum-of-pairs scoring Sum of pairs score: 1v2+1v3+2v3 +1 : match 1: ACG-T AC-GT -1 : mismatch 2: ACGGT ACGGT -2 : gap +2 +2 1: ACG-T AC-GT 1: ACG-T AC-GT 3: ACCGT ACCGT 2: ACGGT ACGGT +0 +2 3: ACCGT ACCGT 2: ACGGT ACGGT +5 +7 3: ACCGT ACCGT +3 +3 SP: +5 +7 fasta.bioch.virginia.edu/biol4230 9 Affine gap penalties: gap(x) = open+x*extend • Affine gap penalties consolidate gaps: – gap(x) = 0 + 7*x 60 70 80 90 100 GSTM1 FPNLPYLIDGAHKITQSNAILCYIARKHN--LCGETE-EEKIRVDI-LE---NQ-TMD-N : ..: : :: . .:.:: : :::.. : : . :: ::. .: :: : : GSTF1 FGQVPALQDGDLYLFESRAICKYAARKNKPELLREGNLEEAAMVDVWIEVEANQYTAALN 60 70 80 90 100 110 – gap(x) = 11 + 1*x 60 70 80 90 100 110 GSTM1 FPNLPYLIDGAHKITQSNAILCYIARKHN---LCGETEEEKIRVDI-LENQTMDNHMQLG : ..: : :: . .:.:: : :::.. : . :: ::. .: .. :. GSTF1 FGQVPALQDGDLYLFESRAICKYAARKNKPELLREGNLEEAAMVDVWIEVEANQYTAALN 60 70 80 90 100 110 fasta.bioch.virginia.edu/biol4230 10 5 The 3-Sequence Alignment Problem Pairwise alignment: O(n2) time 400x400 = 105 k-wise alignment: O(nk) time 40010 = 1026 fasta.bioch.virginia.edu/biol4230 11 The Dynamic Programming Hyperlattice http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/node2.html fasta.bioch.virginia.edu/biol4230 12 6 Efficiencies for global, close, alignments Sequence alignment (particularly multiple sequence alignment) is about placing gaps. It is trivial to align K identical length sequences without gaps. fasta.bioch.virginia.edu/biol4230 13 Trees, stars, and multiple alignment Fig. 1 SP-, tree-, and star-alignments for five, one-letter, input sequences. Pairwise alignments with cost one are indicated by solid lines, and pairwise alignments with cost zero are indicated by dotted lines. Altschul, S. F. and Lipman, D. J. (1989) SIAM J. Appl. Math. 49:197-209

Multiple Sequence Alignment

T-Coffee Documentation Release Version 13.45.47.Aba98c5

Comparative Analysis of Multiple Sequence Alignment Tools

How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)

"Phylogenetic Analysis of Protein Sequence Data Using The

Ple Sequence Alignment Methods: Evidence from Data

Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods

HMMER User's Guide

Determination of Optimal Parameters of MAFFT Program Based on Balibase3.0 Database

The Art of Multiple Sequence Alignment in R

Introduction to Linux for Bioinformatics – Part II Paul Stothard, 2006-09-20

HMMER User's Guide

Syntax Highlighting for Computational Biology Artem Babaian1†, Anicet Ebou2, Alyssa Fegen3, Ho Yin (Jeffrey) Kam4, German E