BMC Bioinformatics BioMed Central
Research article Open Access Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost Shinsuke Yamada*1,2, Osamu Gotoh2,3 and Hayato Yamana1
Address: 1Department of Computer Science, Graduate School of Science and Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan, 2Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2- 43 Aomi, Koto-ku, Tokyo 135-0064, Japan and 3Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501, Japan Email: Shinsuke Yamada* - [email protected]; Osamu Gotoh - [email protected]; Hayato Yamana - [email protected] * Corresponding author
Published: 01 December 2006 Received: 20 June 2006 Accepted: 01 December 2006 BMC Bioinformatics 2006, 7:524 doi:10.1186/1471-2105-7-524 This article is available from: http://www.biomedcentral.com/1471-2105/7/524 © 2006 Yamada et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract Background: Multiple sequence alignment (MSA) is a useful tool in bioinformatics. Although many MSA algorithms have been developed, there is still room for improvement in accuracy and speed. In the alignment of a family of protein sequences, global MSA algorithms perform better than local ones in many cases, while local ones perform better than global ones when some sequences have long insertions or deletions (indels) relative to others. Many recent leading MSA algorithms have incorporated pairwise alignment information obtained from a mixture of sources into their scoring system to improve accuracy of alignment containing long indels. Results: We propose a novel group-to-group sequence alignment algorithm that uses a piecewise linear gap cost. We developed a program called PRIME, which employs our proposed algorithm to optimize the well-defined sum-of-pairs score. PRIME stands for Profile-based Randomized Iteration MEthod. We evaluated PRIME and some recent MSA programs using BAliBASE version 3.0 and PREFAB version 4.0 benchmarks. The results of benchmark tests showed that PRIME can construct accurate alignments comparable to the most accurate programs currently available, including L- INS-i of MAFFT, ProbCons, and T-Coffee. Conclusion: PRIME enables users to construct accurate alignments without having to employ pairwise alignment information. PRIME is available at http://prime.cbrc.jp/.
Background method is applicable to only a small number of Multiple sequence alignment (MSA) is a useful tool for sequences. In fact, even when a sum-of-pairs (SP) score elucidating the relationships among function, evolution, with the simplest gap cost is used as an objective function, sequence, and structure of biological macromolecules computation of optimal MSA is an NP-hard problem [4]. such as genes and proteins [1-3]. Although we can calcu- Hence, many heuristic methods have been developed. late the optimal alignment of a set of sequences by n- Almost all practical methods presently available adopt dimensional dynamic programming (DP), the DP
Page 1 of 17 (page number not for citation purposes) BMC Bioinformatics 2006, 7:524 http://www.biomedcentral.com/1471-2105/7/524
either a progressive [5-7] or an iterative [8-10] heuristic gap extension penalty with only the data structures used strategy. in the previous algorithm that were designed to detect the opening of new gaps [12,19]. Accordingly, we newly The group-to-group sequence alignment algorithm is a introduce two additional data structures: 'insertion length straightforward extension of the pairwise sequence align- profile' and 'dynamic gap information'. An insertion ment algorithm, and is the core of progressive and itera- length profile vector is associated with each column of a tive methods. The essential difference of group-to-group group of sequences, while dynamic gap information keeps sequence alignment from pairwise sequence alignment is track of information about gaps inserted into a group dur- the existence of gaps within each group of prealigned ing the DP process. Together with those used in the previ- sequences. The gap opening penalty used in an affine gap ous algorithm, gap extension penalty can be calculated cost disrupts the independence between adjacent col- efficiently. Using the proposed algorithm, we developed a umns, and hence calculating the optimal alignment program called PRIME. between two groups with respect to the SP score was shown to be NP-complete [11]. Gotoh was the first to PRIME stands for Profile-based Randomized Iteration devise a group-to-group sequence alignment algorithm MEthod. As a result of benchmark tests, the accuracy of that optimizes the SP score by using a candidate list para- our method is shown to be comparable to the most accu- digm [12]. An algorithm with a candidate list paradigm, rate methods available today, all of which incorporate similar to the branch-and-bound method, prunes the can- pairwise alignment information obtained from all-by-all didates that are dispensable for arrival at an optimal solu- pairwise alignment. This implies that the piecewise linear tion. Kececioglu and Starrett proposed another candidate- gap cost is as effective as pairwise alignment information pruning method [11]. Although these algorithms can cal- in improving the alignment accuracy of sequences, some culate the optimal alignment between two groups, they of which have long indels. require relatively extensive computational resources. Sev- eral papers have reported faster algorithms that use the Algorithms heuristic estimation of gap opening penalties [8,10]. In this section, we first review the previous group-to- group sequence alignment algorithm with an affine gap Several studies have discussed the tendency that global cost [12,19], and then describe a novel one with a piece- alignment methods perform better than local ones wise linear gap cost. The final subsection outlines a dou- [13,14]. However, the opposite is also true when some bly nested randomized iterative strategy with which our sequences to be aligned have long insertions or deletions proposed algorithm is integrated. (indels). One reason for this tendency is that almost all group-to-group sequence alignment algorithms use an aff- The definitions of symbols are as follows. Let Σ be the res- ine-like gap cost that over-penalizes long indels. To allevi- idue set and |Σ|, the number of elements in Σ. Σ* denotes ate this problem, several methods have combined pairwise global and local alignments, or incorporated the set containing a null and each element in Σ. A null consistency information among pairwise alignments means that a residue of one sequence does not aligned [6,15,16]. Another strategy to prevent over-penalizing with that of another sequence when aligning sequences, long indels is to use a concave function as the gap cost. It and is denoted by the symbol '-'. A and B denote prea- is relatively easy to choose a concave gap cost that does ligned groups of sequences. A includes m rows, and B, n not over-penalize long indels, and several pairwise rows. The respective lengths of A and B are I and J. A , a , sequence alignment algorithms using this gap cost have p i and a denote the p-th row of A, the i-th column of A and been developed [17,18]. However, there have been few p,i attempts to incorporate this gap cost into a group-to- the i-th residue of Ap, respectively. Bq, bj, and bq,j are group sequence alignment algorithm to develop an MSA defined similarly. Both ap,i and bq,j belong to Σ*. Note that program. any column of a group must not be a null column, which In this paper, we propose a novel group-to-group consists of nulls only. If all nulls are removed from a sequence alignment algorithm with a piecewise linear gap group, each row is an usual sequence. A run of consecutive cost [18], which is the key to a progressive or an iterative nulls in a row is called a gap. A gap length is the number refinement method. The piecewise linear gap cost [18] is of nulls constituting the gap. A segment of A that consists one of the concave functions and consists of L linear func- of consecutive columns as to at is denoted by A(s, t); A is tions. Depending on the gap length, this gap cost varies its also expressed as A(1, I). s(a, b) is a substitution score inclination, which corresponds to the gap extension pen- between residues a and b. By g(x), we mean a gap cost alty. However, in the case of group-to-group sequence alignment algorithm, it is difficult to calculate the proper function of gap length x. The pair weight between the p-th
Page 2 of 17 (page number not for citation purposes) BMC Bioinformatics 2006, 7:524 http://www.biomedcentral.com/1471-2105/7/524
sequence in A and the q-th sequence in B is wp,q. If the 2 partial alignment where ai and bj are aligned. Hij, and three-way method [20] is used to calculate the pair 3 weights, w can be factorized as w ·w , where w Hij, mean partial alignment scores where ai and bj are p,q Ap Bq Ap aligned with null columns, respectively. The recurrent and wB are the weights for the p-th sequence in A and the q equations are: q-th one in B, respectively. 0 = k () Review of previous group-to-group sequence alignment HHij,,max {}ij 1 13≤≤k algorithm with affine gap cost The previous group-to-group sequence alignment algo- 1 =+0 0 + () rithm that optimizes SP or weighted SP score with an aff- HHij,, ij−−11 G(,ab ijij ; P −−11 , ) S (, ab ij ) 2 ine gap cost is based on a two-dimensional DP method [12]. The key point of this algorithm is to exactly evaluate 2 =+−k k +− HHGPSij, max{}ij−−11,, (aaiij , ; ) (i , ) ()3 the gap opening penalties during the DP process. To k=02, explicitly consider gaps already present in each group, this algorithm introduces a gap state that denotes the number 3 =+−k k +− HHGPSij, max{}ij,,−−11 ( ,bbjij ; ) ( ,j ) ()4 of consecutive nulls up to the current position. k=03, 0 Another important feature of this algorithm is the candi- where '-' denotes a null column, and Pij−−11, is a partial date list paradigm, which is a variant of branch-and- DP path. A partial DP path is one from (0, 0) to the node bound methods. Because the calculation of gap opening of the DP matrix in question, representing a partial align- penalties depends on a previous partial DP path, simple 0 extension of pairwise sequence alignment algorithm may ment; for example, Pij, represents a partial alignment not yield globally optimal alignment between two groups 0 [12]. For rigorous calculation, not only locally optimal with the best score Hij, between A(1, i) and B(1, j)·S(ai, partial paths but also those that possibly contribute to glo- bj) is responsible for the calculation of substitution scores bally optimal alignment have to be stored at each node of and gap extension penalties: a DP matrix [11,12]. In the worst case, the number of can- didates to be stored grows exponentially with the total =⋅ () number of sequences in the two groups [11]. As discussed Swsab(,abij )∑ ∑ pq,,, (pi , q j ). 5 in some papers [8,10,12], the group-to-group sequence 1≤≤pm1≤≤qn alignment algorithm without the candidate list paradigm may suffice for good alignment. Moreover, because the If either a or b is a null, s(a, b) = -u, and if both a and b are novel group-to-group sequence alignment algorithm 0 null, s(a, b) = 0. G(ai, bj; Pij−−11, ) is the gap opening pen- described below requires roughly twice as much computa- tion time as the previous one at each DP process, we alty when ai is aligned with bj: adopted a simpler algorithm without the candidate list 0 =⋅−⋅γ 0 0 paradigm. GP(,abijij ;−−11,,,,, )∑ ∑ wvabx pq ( ) (pi , q j , pi−1,).yqj, −1 ()6 1≤≤pm1≤≤qn
Basic algorithm 0 0 xpi, −1 and yqj, −1 are the gap states for the p-th and q-th Let an affine gap cost function be g(x) = -(ux + v), where 0 u(> 0) and v(> 0) are constants called gap extension pen- rows in A(1, i - 1) and B(1, j - 1) on Pij−−11, , respectively. alty and gap opening penalty, respectively. The group-to- As mentioned above, the gap state is the length of the gap group sequence alignment algorithm with the affine gap up to the current position. γ(a, b, x, y) represents whether cost employs essentially the same recurrent relations as a gap opens with respect to a pair of rows. Specifically, if the pairwise sequence alignment algorithm [21], with a is a residue, x ≥ y, and b is a null; or if a is a null, x ≤ y, exact evaluation of gap opening and extension penalties. Like the pairwise sequence alignment algorithm, we calcu- and b is a residue; then γ(a, b, x, y) = 1. Otherwise, γ(a, b, late four variables at each node, (i, j), of the DP matrix: x, y) = 0. In order to evaluate exact gap openings for calcu- k 0 1 2 3 0 lation of each Hij, , the gap states must be updated. If ap,i Hij, , Hij, , Hij, , and Hij, . Hij, holds the best score 0 =+d k 1 2 3 1 is a null, xxpi,, pi−1 1 where d = arg max1≤k≤3 {}.Hij, among Hij, , Hij, , and Hij, at (i, j). Hij, is a score of a
Page 3 of 17 (page number not for citation purposes) BMC Bioinformatics 2006, 7:524 http://www.biomedcentral.com/1471-2105/7/524
0 However, if the gap opening penalty were calculated using Otherwise, xpi, = 0. The other gap states are calculated in the static gap states only, a gap opening would be detected a similar way. wrongly, because a1,8 is a null, b2,12 is a residue, and x1,7 =
Use of generalized profile 0 ≤ y2,11 = 0 where x1,7 and y2,11 are the static gap states of
Although equations 5 and 6 require O(mn) computa- a1,7 and b2,11, respectively. For correct detection of gap tional steps, these steps can be reduced by using a gener- opening, we need 'running gap states', each of which rep- alized profile [19] and the three-way weighting method resents the sum of the numbers of consecutive static and [20]. The idea of using the generalized profile is that the dynamic nulls up to the current position. For the example same residue types or gap states on a column are treated shown in Figure 1, the running gap state for A2 at column together. The generalized profile consists of four vectors position a7 is 11, which is composed of 7 static nulls and calculated from each column of a group: frequency, resi- 4 dynamic nulls. due profile, and two kinds of static gap profile vectors. These vectors can be obtained in advance of the DP proc- Static gap states are compactly represented as a static gap ess. The frequency and residue profile vectors are used to profile. A static gap profile is obtained by gathering static calculate S(ai, bj), while the static gap profile vectors are gap states with the same values at a column. More specif-
0 ically, the static gap profile at a column ck consists of two necessary to calculate G(ai, bj; Pij−−11, ). vectors E and S . Each element of both vectors has the ck ck With the frequency and residue profile vectors, S(ai, bj) is same form: {(g, f)} where g is the gap state of previous col- calculated by umn ck-1 and f is the weighted frequency of the occurrence of rows whose element on c is either a residue or null =⋅=⋅ k Sfppf(,abij ) ∑∑ab, r ,,rr ab,r ()7 ij ij depending on E or S , respectively. Although E r∈∈ΣΣ* r * ck ck ck itself may be used to calculate gap opening penalties, its where residue frequency f is the weighted frequency of + ai ,r accumulated form, Ec , is more convenient to reduce the occurrence of residue type r (including null) on column k + a , and residue profile pfsrt=⋅∑ (,) . Both computation time. If an element in E is (g, f+), f+ repre- i bbjj,,rrt∈Σ* ck p and f are defined in the same way. The right sents the weighted frequency of the occurrence of rows ai ,r bj ,r whose gap states are not less than g. For the example hand side of equation 7 requires O(|Σ*|), because each + shown in Figure 1, Ea , E , and Sa are {(0, wA ), (2, frequency and residue profile vector consists of |Σ*| val- 8 a8 8 4 ues. When m × n is sufficiently large, the computation w ), (7, w )}, {(0, w + w + w ), (2, w + A3 A2 A4 A3 A2 A3 time can be considerably reduced. Although S(a , b ) is i j w ), (7, w )}, and {(0, w )}, respectively. Since all A2 A2 A1 easy to calculate, the profile-based calculation of G(ai, bj; 0 gap states are different in the worst case, the total number Pij−−11, ) is somewhat complicated, because, in addition + of elements of E and Sc is at most the number of to static gaps, dynamic gaps must be considered explicitly. ck k A static gap consists of consecutive static nulls that already sequences in the group. exist in each group, while a dynamic gap denotes a run of dynamic nulls that are inserted into each group during the Like a static gap profile, a running gap profile can repre- DP process. Note that we distinguish static and dynamic sent running gap states compactly. In order to obtain a gaps for convenience of the description of our algorithm, running gap profile, another data structure called a gap while they contribute to the total alignment score in the mediation profile is required. A gap mediation profile is same way. Let us consider an example of the gap opening defined for each path in the DP process, and records the total number of dynamic nulls inserted into static gaps penalty calculation when a8 is aligned with b12 (Figure 1). that are currently open. Each element of the gap media- To keep the discussion simple, we consider A and B only. 1 2 tion profile is expressed as (s, d), where s is the length of a From simple observation, we find that a gap between A 1 static gap and d is the summed length of dynamic gaps and B2 has already opened before a8 is aligned with b12. 0 inserted within or after the static gap. Let Mij, (A) be the
Page 4 of 17 (page number not for citation purposes) BMC Bioinformatics 2006, 7:524 http://www.biomedcentral.com/1471-2105/7/524
i 12 3456 7 8 0 there exists element (s, d) in Mij−−11, (A) such that g = s, A1 −··−−∗∗·∗ ∗ − · A2 −··−−···· · − ∗ 0 then (s + 1, d) ∈ Mij, (A). In the case where bj is aligned A3 −∗∗−−∗∗∗· · − ∗ 0 A4 −∗∗−−∗∗∗∗ ∗ − ∗ with a null column, then (s, d + 1) ∈ Mij, (A) where (s, d) 0 3 is an element in M − (A) or M − (A) depending on the B1 ∗∗∗∗ · ···· · · · ij, 1 ij, 1 B ∗∗∗∗∗∗∗∗∗∗ ∗ ∗ 2 maximum operation of equation 4. If ai is aligned with a j 123456789101112 0 0 2 null column, Mij, (A) equals Mij−1, (A) or Mij−1, (A). In ExampleFigure 1 of gap extension penalty calculation 0 Example of gap extension penalty calculation. This fig- the case of Figure 1, M711, (A) = {(0, 1), (2, 1), (7, 4)} is ure shows an example of columns a8 and b12 being aligned. '*', 0 derived from M710, (A) = {(0, 0), (2, 0), (7, 3)}. The other '·', and '-' denote a residue, a static null, and a dynamic null, respectively. We assume that piecewise linear gap cost g(x) is gap mediation profile vectors are constructed in a similar way. maxk = 1,2{-(ukx + vk)} and critical gap length xc (= Q(v2 - v1)/(u1 - u )N) is 4. Gap extension penalty is u if x ≤ 4, otherwise u . 2 1 2 + + Running gap profile vectors Eˆ and Sˆ are obtained by ˆ ˆ a ai Running gap profile vectors Sa and E are {(1, wA )} and i 8 a8 1 0 {(1, w + w + w ), (3, w + w ), (11, w )}, combining gap mediation profile Mij−−11, (A), and static A2 A3 A4 A2 A3 A2 + 0 + respectively. Dynamic gap information D (A) is {(0, 1), (2, gap profiles E and Sa , respectively. For each (g, f ) in 711, ai i 2), (7, 1)}. Segment profile Fa is {(1, wA ), (3, wA + + 0 8 2 2 E , (s, d) is chosen from Mij−−11, (A) such that g = s, and ai w ), (5, w + w + w )}. Similarly, the profile vec- A3 A2 A3 A4 + + ˆ ˆ + then (s + d, f ) ∈ E . Similarly, Sa is obtained from tors of B are defined: Sˆ = {(7, w )}, Eˆ = {(0, w )}, ai i b12 B1 b B2 12 0 ˆ + ˆ Mij−−11, (A) and Sa . For example, E and Sa at the D0 (B) is empty, and F = {(1, 0), (9, w )}. In what fol- i a8 8 711, b12 B2 node (8, 12) of Figure 1 are {(1, w + w + w ), (3, lows, we consider the non-trivial calculation of the gap A2 A3 A4 extension penalty with respect to the gap of B1, the target w + w ), (11, w )} and {(1, w )}, respectively. A2 A3 A2 A1 gap. By using Sˆ and D0 (A), we find that the two b11 711, + + 0 ˆ ˆ ˆ ˆ Using running gap profile vectors E , Sa , E , and Sb , dynamic gaps specified by (2, 2) and (7, 1) in D711, (A) are ai i bj j partially and completely aligned with the target gap, respec- Equation 6 can be rewritten as: tively. Consequently, the total number of nulls aligned with null columns of dynamic gaps to be removed is 2. Therefore, 0 =− ⋅++ + GP(,abijij ;−−11, )( vSESE )(b iia ). ()8 j ai i bj the number of columns of B1 is 5(= 7 - 2). By subtracting 5 from 8 (the end position of the segment), the starting posi- S • E+ is expressed as tion of A, 3, is obtained. Then, the gap extension penalty with respect to the gap of B is w (F ·u + F ·u ) where F = ++=⋅ 1 B1 1 1 2 2 1 SEi ∑ f() spq f ( e ) w + w and F = w . Note that A is not involved in 1≤≤pS|| A2 A3 2 A4 1 + the gap extension penalty because a1,8 is a null. + where sp and eq are the p-th and q-th elements in S and E , respectively, and f(e) is the weighted frequency of element gap mediation profile of group A at the DP node (i, + e. q is chosen such that the gap state of eq is the smallest j)·M0 (A) follows a recurrent relation, the initial condi- ij, among the elements in E+ whose gap states are not less 0 + + tion of which is Mij, (A) = {(0, 0)}. We first consider the than that of sk. Calculation of S • E requires O(|S|) + |E |), 0 case where a and b are aligned. For each (g, f) in S , if and hence G(ai, bj; P −−) at most O(m + n). Using equa- i j ai ij11, tion 8 instead of equation 6 can reduce computation from
Page 5 of 17 (page number not for citation purposes) BMC Bioinformatics 2006, 7:524 http://www.biomedcentral.com/1471-2105/7/524
O(mn) to O(m + n) even in the worst case where gap states a term for gap extension penalty is added to the right hand on a column are mutually different. side of equation 8. Let us consider an example of gap
extension penalty calculation for the null on b12 aligned Novel group-to-group sequence alignment algorithm with with the residues on a (the last column of Figure 1). The piecewise linear gap cost 8 gap state at b is 8. With omission of nulls that are In this section, we describe a novel group-to-group 1,12 sequence alignment algorithm with a piecewise linear gap aligned with other nulls, the lengths of the gap on B1 rela- cost. Although this algorithm uses recurrent equations 1 tive to A2, A3, and A4 are 1, 4, and 6, respectively. Assum- to 4, the algorithms calculating S(ai, bj) and G(ai, bj; ing that the critical gap length xc = 4, we obtain the
0 respective gap extension penalties for A2, A3, and A4 as u1, Pij−−11, ) must be changed. Roughly speaking, the term for u , and u . Therefore, the gap extension penalty in ques- calculating gap extension penalties is transferred from 1 2 tion is w (F ·u + F ·u ), where F = w + w and F 0 A1 1 1 2 2 1 A2 A3 2 S(ai, bj) to G(ai, bj; Pij−−11, ). After explanation of the = w . piecewise linear gap cost, we describe these algorithms in A4 detail. This example indicates two important points for the exact Piecewise linear gap cost calculation of gap extension penalty. First, each gap length The piecewise linear gap cost consists of several linear is obtained by counting the residues on each row of a spe- functions [18]: cific range in A called 'relevant segment', A(3, 8) in this example. Second, the two nulls on B1 at columns b5 and b are aligned with two separate dynamic gaps inserted =−+ () 11 gx() max{( uxll v )} 9 into A. The first observation suggests that F and F for any 1≤≤lL 1 2 segment may be calculated before the DP process. How- where u > u (≥ 0) and v > v (> 0). When L = 1, this cost l l+1 l+1 l ever, the second observation indicates that the number of is the same as the affine gap cost. This cost could alleviate null columns of dynamic gaps aligned with static nulls on over-penalizing long indels, because the inclination of a target gap must be subtracted from the gap state of the g(x), which corresponds to a gap extension penalty, u , l target gap for correct assignment of the relevant segment; becomes small as gap length increases. In other words, without this subtraction, the relevant segment would be this cost calculates gap extension penalties based on gap assigned as A(1, 8) in the above example. length. For the sake of simplicity, we restricted our atten- tion to the case of L = 2. Then, g(x) = -(u1x + v1) if x ≤ xc or We initially consider the first point, neglecting the pres- g(x) = -(u2x + v2), otherwise, xc = Q(v2 - v1)/(u1 - u2)N is called the critical gap length. ence of dynamic gaps for a moment. Without loss of gen- erality, we assume that the residues on ai are aligned with Calculation of substitution score a null of a target gap on bj whose gap state is g. We define As mentioned above, the calculation of gap extension F (a ) as the weighted fraction of nulls on a : penalties must be separated from the calculation of the 0 i i m substitution score S(ai, bj) in order to use the piecewise Fwa()a ≡−∑ δ ( ,), where δ(a, b) = 1 if a = b, oth- 0 iAp=1 p pi, linear gap cost. Therefore, S(ai, bj) is expressed as: erwise, δ(a, b) = 0, and '-' denotes a null. F0(ai) is the same Sfppf(,ab )=⋅=⋅∑∑ . ()10 as the null component of the frequency profile at a , i.e. ijabij,, r r ab ij ,,rr i r∈∈ΣΣr F (a ) = f − . We also define F (A, i, g) as the sum of 0 i ai , 1 where pfsrt=⋅∑ (,) . Note that this equation weights of sequences {Ap} such that ap,i ≠ '-' and the bbjj,,rrt∈Σ number of residues within segment A(i - g + 1, i) aligned and the definition of residue profile vector sum over not with the target gap is less than or equal to x : Σ* but Σ. c m FAig(,,)=−−∑ w (1 δ ( a ,))(Δ aig ,,), where 1 p=1 Ap pi, p Calculation of gap extension penalty Δ(ap, i, g) = 1 if the number of residues on the p-th row of In the previous algorithm with an affine gap cost, G(ai, bj; 0 A(i - g + 1, i) is less than or equal to xc, otherwise, Δ(ap, i, Pij−−11, ) was responsible for the gap opening penalty ig−+1 g) = 0. Specifically, if ∑ ((,))1−−≤δ ax , then only; however, it must take care of the gap extension pen- ki= pk, c alty in the case of the piecewise linear gap cost. Therefore,
Page 6 of 17 (page number not for citation purposes) BMC Bioinformatics 2006, 7:524 http://www.biomedcentral.com/1471-2105/7/524
Δ(a , i, g) = 1, otherwise, Δ(a , i, g) = 0. Likewise, F (A, i, g) 0 p p 2 to Dij, (A). The other dynamic gap information lists are is defined as the sum of weights of sequences {A } such p obtained in a similar way. that ap,i ≠ '-' and the number of residues within A(i - g + 1, i) aligned with the target gap is greater than xc. Obviously, It is worth mentioning that dynamic gap information 0 0 Dij, (A) and gap mediation profile Mij, (A) contain infor- F0(ai)+ F1(A, i, g) + F2(A, i, g) = 1. (11) mation on dynamic gaps from different viewpoints. The Because there may exist gap states g and g' (0 ≤ g Page 7 of 17 (page number not for citation purposes) BMC Bioinformatics 2006, 7:524 http://www.biomedcentral.com/1471-2105/7/524 The gap extension penalty for the target gap is then 4. Iteratively refine the alignment using the phylogenetic obtained by tree and the pair weights (a) Divide the alignment into two groups based on a ran- wB{u1·F1(A, i, gˆ ) + u2·F2(A, i, gˆ )}, (12) domly chosen branch of the tree where w is the weight given to the sequence containing B (b) Align these two groups using a group-to-group the target gap. Note that step 2c in this algorithm exam- sequence alignment algorithm ines whether or not the dynamic gap in question is par- tially aligned with the target gap. For the example (c) Repeat steps 4a to 4b until no better weighted SP score considered in Figure 1, t = 0 + 8 - 7 + 1 = 2, and s = l = 1 3 3 is obtained are calculated after the first iteration of steps 2 in this algo- rithm. In the second iteration, we obtain t = 1 + 8 - 2 + 2 2 5. Repeat steps 1 to 4 until the weighted SP score of the = 9. Because t = 9 > g = 8 and t - l = 7