5.1 Approximation Algorithms for Multiple Sequence Alignment P

Algorithms for Molecular Biology Fall Semester, 1998 Lecture 5: January 3, 1999 1 Lecturer: Ron Shamir Scribe: Itay Lotan and Ziv Modai 5.1 Approximation Algorithms for Multiple Sequence Alignment 5.1.1 Problem De nition Given a family S =(S ;:::;S )ofk sequences, such that the sequences are \similar" to 1 k each other, wewould like to nd out the common characteristics of this family. Aligning each pair of sequences from S separately, often do es not reveal this common information. 0 0 0 A multiple alignment of S is a new set of sequences S =(S ;:::;S ) such that: 1 k 0 All the strings in S are of equal length. We denote this length by l . 0 Each S was generated from S by inserting spaces. i i When p erforming multiple alignment, as in the case of pairwise alignment, one wishes to evaluate the quality of the alignmentby giving it a numeric score (see also lecture 3). De nition 5.1 -Thesumofpairs (SP) score of a multiple alignment M is the sum of the scores of pairwise global alignments inducedby M. Let (x; y ) b e our scoring function, i.e., the price of aligning the character x with the character y ,for x; y 2 [fg.We assume that (; )= 0, (x; y )= (y; x), and that the triangle inequality (x; y ) (x; z )+ (z; y) holds. 5.1.2 The Center Star Metho d for (SP) Alignment In this section, we present an approximation algorithm for calculating the optimal multiple alignment under the SP metric (see e.g. [3] [pp 348{350]). The algorithm achieves an approximation ratio of two. Given a multiple alignment M, let d(S ;S ) b e the score of the pairwise alignmentit i j P induces on S ;S . Our target function is the SP value of M whichisd(M) d(S ;S ) i j i j i<j (sum over all pairs of the score of the induced alignment). 1 Based in part on a scrib e by Sariel Har-Peled on Decemb er 18, 1995 and on the b o ok [3, pp 347{363] 1 c 2 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98 Problem 5.1 The SP alignment problem. INPUT: A set of sequences S . QUESTION: Compute a global multiple alignment M with minimum sum-of-pairs score. We denote by D (S; Y ) the score of the optimal alignmentbetween sequences S and Y . Figure 5.1: A generic center star for six strings, where the center string (S )is S . c 3 De nition 5.2 We say that a tree T having the elements of S as its nodes, induces a multiple alignment M(T ) over S , if for each edge (S; Y ) 2 E (T ) the value of the alignment inducedbyM on S and Y is D (S; Y ) (optimal). In such a case we write M = M(T ). P Given S =(S ;:::;S ), let S 2S denote the elementof S , for which D (S ;S )is 1 k c i c i6=c minimal. We refer to S as the center of M . c c Approximation Algorithms for Multiple Sequence Alignment 3 De nition 5.3 We de ne the center star, T to be a star treeof k nodes, with the center c node labeled S and with each of the k 1 remaining nodes labeled by a distinct sequencein c SnfS g (see gure 5.1). The multiple alignment M of S is the multiple alignment induced c c by the center star, i.e., for each v 6= c, the alignment M induces an optimal pairwise c alignment between S and S . c v The Center Star Algorithm: P D (S ;S ) and let M = fS g. 1. Find S 2S minimizing i t t t i6=t 2. Add the sequences in SnfS g to M one by one so that the alignmentofevery newly t added sequence with S is optimal. Add spaces, when needed, to all pre-aligned se- t quences. Running time analysis: k 2 1. O (n ) for step 1. 2 P k 1 2 2 0 2. O ((i n) n)=O (k n ) for step 2. (Since the worst-case length of S after the i=1 c addition of i strings is (i +1) n) Lemma 5.2 For 1 i; j k; i 6= j it holds that d(S ;S ) D (S ;S )+D (S ;S ). i j i c c j Proof: Since the triangle inequality holds for every single column of the alignmentby the de nition of the scoring scheme, it also holds for entire strings by the de nition of d. There- fore d(S ;S ) d(S ;S )+d(S ;S ). But the edges (S ;S ); (S ;S ) are edges of E (T ), thus i j i c c j i c c j c d(S ;S )= D (S ;S ). It follows that d(S ;S ) D (S ;S )+D (S ;S ). i c i c i j i c c j Let M denote the optimal alignmentofS . Let d (S ;S ) denote the value of alignment i j between S and S induced by M . i j Theorem 5.3 d(M ) 2(k 1) c < 2 d(M ) k . Proof: By Lemma 5.2 it follows that X X X 2d(M )= d(S ;S ) (D (S ;S )+D (S ;S )) = 2(k 1) D (S ;S ) (5.1) c i j i c c j c i i6=j i6=j i6=c P We de ne W to b e D (S ;S ) and we get: c i i6=c 2d(M ) 2(k 1)W (5.2) c c 4 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98 On the other hand, bythechoice of c it follows that: X X X X X 2d(M )= d (S ;S ) D (S ;S )= D (S ;S ) W = kW: (5.3) i j i j i j i i i6=j i6=j j 6=i And nally, 2(k 1)W 2(k 1) d(M ) c = : (5.4) d(M ) kW k Theorem 5.3 implies that calculating the multiple alignment of the center star pro duces a 2(k 1) times the value of the optimal multiple alignment with a value whichisatmostR = k k 4 3 alignment. For example R = , R = . 3 4 3 2 5.1.3 Multiple Alignment with Consensus In this section we lo ok at an approximation algorithm for a multiple alignment that optimizes a di erent score metrics - the consensus error. As b efore, we assume the existence of a pairwise scoring scheme satisfying the triangle inequality. De nition 5.4 Given a set of strings S , the consensus error of a string S with respect to S P is E (S )= D (S; S ). Note that S need not bein S . i S 2S i Problem 5.4 Optimal Steiner string. INPUT: A set of strings S QUESTION: Find a string S which minimizes the consensus error E (S ) over al l possible strings. The Steiner string S attempts to capture the common characteristics of the set of strings S and re ect them in a single string. We will present an approximation algorithm for the optimal Steiner string problem with worst-case approximation ratio of 2. Lemma 5.5 Let S have k strings, and assume that the scoring scheme satis es the triangle E (S ) 2 inequality. Then there exists a string S 2S such that 2 < 2 (see e.g. [3] [pp E (S ) k 349{351]). Proof: For any S 2S: X X D (S; S ) [D (S; S )+D (S ;S )] = (5.5) E (S )= i i S 2S S 6=S i i X (k 2) D (S; S )+D (S; S )+ D (S ;S )= (k 2) D (S; S )+E (S ) i S 6=S i 5.2. MULTIPLE ALIGNMENT TO A PHYLOGENETIC TREE 5 If wepick S 2S suchthatS is closest to S then: X E (S )= D (S ;S ) k D (S; S ) (5.6) i S 2S i P The center string S 2S minimizes D (S ;S ) and therefore its consensus error is c c i S 2S i smaller then the consensus error of the S (the string closest to S ). We get: (k 2) D (S ;S )+E (S ) (k 2) D (S ;S ) 2 E (S ) c c c +1=2 (5.7) E (S ) E (S ) k D (S ;S ) k c The pro of ab ove uses the lemma 5.5 and the fact that E (S ) E (S ) c It is worthwhile noting that Steiner string was de ned without alignment, and the only requirement is the distance function, that satis es the triangle inequality.We will next start discussing consensus strings that are alignment motivated. 5.1.4 Consensus Strings from Multiple Alignment De nition 5.5 Given a multiple alignment M of a set of strings S ,theconsensus character in column i of M is the character that minimizes the summed distance toitfrom al l the characters in column i.Let d(i) denote that minimum sum in column i. De nition 5.6 The consensus string S derivedfrom the alignment M is the concatenation M of the consensus characters for each column of M. P l De nition 5.7 The alignment error of S equals d(i) where l is the number of char- M i=1 acters in S M De nition 5.8 The optimal consensus multiple alignment is a multiple alignment M of an input set S whose consensus string S minimizes the alignment error.

Load more