Multiple Sequence Alignment Me MSA Is NP-Complete Recall
Total Page:16
File Type:pdf, Size:1020Kb
Me Week 4, Class 2: Multiple Sequence Alignment Paul Chew Email: [email protected] Office: 494 Rhodes CS 426 Office Hours: MWF 11:15 to noon Fall 2003 Research Area: Computational Geometry 2 Recall: Multiple Sequence Alignment (MSA) MSA is NP-Complete Goal is to find a common To find the best alignment we For an NP-complete problem alignment of several typically measure z If a fast (polynomial time) sequences z Sum of pairwise distances algorithm is ever found TFAA--LSK within each column then all NP-complete Provides more information problems have fast that pairwise alignment ALSA--LSD z Distances are measured ALSN--LSD using a scoring matrix algorithms z Can match a single (e.g., BLOSUM or PAM) protein against entire MSSMKDLSG Best alignment can be found Most researchers believe that family ELKP--LAQ using Dynamic Programming no polynomial time algorithm k Useful to… z But requires time Θ(n ) for exists z Distinguish evolutionary k sequences of length n z Goal becomes: find a relationships reasonable approximation to the exact solution z Discover important parts MSA using Sum-of-Pairs is of a protein family known to be NP-complete 3 4 Recall: Center Star Method Phylogenetic Tree Alignment z Choose one sequence to Phylogenetic tree = be the center evolutionary tree S6 S5 z Align each of the other sequences with this center If you know the phylogenetic sequence tree for your sequences rat mouse lion cat dog Try each sequence as the S7 S4 z The cost for an alignment center to find the one with the is Σ(i,j)∈ED(Si,Sj) least cost: Σ D(S ,S ) Unfortunately i≠c c i S3 where E is the set of S1 edges in the tree z The multiple alignment is often needed to derive the Produces an approximation S2 phylogenetic tree whose sum-of-pairs cost is < Note that the Center Star twice optimal Method is just using a z Usually, we don’t know the sequences for the internal z d(.,.), the scoring matrix, particularly bad phylogenetic must satisfy the Triangle tree nodes Inequality 5 6 1 Consensus Representations The Consensus String of a Multiple Alignment Goal: Build a single string The Steiner string, S*, is the The consensus string SM that somehow represents string that minimizes the derived from multiple A B A an entire set, S, of strings consensus error alignment M is the A B - z The consensus error for concatenation of the string T is ΣS∈SD(S,T) consensus characters for -B A There are 2 related ideas C A - that are candidates for each column of M Note that the Steiner string is such a string z The consensus A B A not necessarily a member of S character for column i is z the Steiner string Note also that the definition of the character that z the consensus string of Steiner string does not minimizes the summed depend on a multiple the optimal consensus distance to it from all the multiple alignment alignment (although a Steiner string induces a multiple characters in column i alignment) 7 8 The Optimal Consensus Multiple Alignment How is MSA Actually Done? The optimal consensus One can show that The Center Star Method Technique that is multiple alignment is the z The multiple alignment z Produces a result with a commonly used alignment that minimizes induced by the Steiner provable bound z Iterative pairwise the sum of the column string is the same as the z But it’s not often used in alignment (see below) errors optimal consensus multiple alignment practice because it z The column error is the doesn’t work as well as z The consensus string of Additional methods sum of distances from the optimal consensus other methods the consensus z Repeated-motif multiple alignment is (once methods character to each spaces are removed) the z Hidden Markov models character in that column same as the Steiner string (more on this later in the Unfortunately, we have no course) way to determine the Steiner string (although we can approximate within a factor of 2 using the center-star string) 9 10 Iterative Pairwise Alignment Summarizing a Group of Sequences: the Profile In simplest form There are lots of variations on For a multiple alignment of length n, a profile is a table of z Add strings one at a time this idea size |Σ∪{-}| × n to a growing multiple z Each entry shows the frequency of a symbol within a alignment We are using the Minimum column z The string chosen is the Spanning Tree as a way to z Σ is the alphabet (in our case, the 20 amino acids) one closest to some string cluster the strings already in the multiple There are many clustering alignment methods; each one leads to a This is basically somewhat different method for A B A z a Minimum Spanning Tree multiple alignment Col 1 Col 2 Col 3 A B - A 0.50 0.25 0.50 (when using edit distance) B 0.00 0.75 0.00 or For some methods we must -B A C 0.25 0.00 0.00 - 0.25 0.00 0.50 z a Maximum Spanning compute the distance C A - Tree (when using similarity between a sequence and a scores) set of sequences 11 12 2 Aligning a String to a Profile A Multiple Alignment Package: ClustalW Dynamic Programming can Example Basic outline of algorithm be used just as it is for z Suppose z Calculate the C(k,2) [i.e., k pairwise comparisons s(A,A) = 2 choose 2] pairwise alignment scores We use a weighted sum of s(A,B) = s(A,-) = -1 s-values when comparing a s(A,C) = -2 z Use a neighbor-joining algorithm to build a tree letter to a profile-column z Then A matched to column 1 scores based on the distances z s(.,.) is the scoring 0.5(2) + 0.25(-1) + 0.25(-2) z Distances are updated matrix used for pairwise = 0.25 using string/string, comparisons string/profile, and Col 1 Col 2 Col 3 profile/profile comparisons Profile to profile A 0.50 0.25 0.50 Actual algorithm includes B 0.00 0.75 0.00 comparisons can be done C 0.25 0.00 0.00 many ad-hoc rules (e.g., http://www.uib.no/aasland/chromo/chromo-tree.gif similarly - 0.25 0.00 0.50 weighting, different scoring matrices, and special gap scores) 13 14 3.