An Efficient Method for Multiple Sequence Alignment

From: ISMB-94 Proceedings. Copyright © 1994, AAAI (www.aaai.org). All rights reserved. An Efficient Method for Multiple Sequence Alignment Jin Kim and Sakti Pramanik Department of Computer Science Michigan State University East Lansing, MI 48824-1027 { kimj ,pramanik } @cps.msu.ed u Abstract on dynamic programming tries to compare sequences simultaneously. This approach guarantees optimal Multiple sequence alignment has been a use- alignment. Although variations of dynamic program- ful method in the study of molecular evolution and sequence-structure relationships. This pa- ining have been widely used to derive optimal align- per presents a new methodfor multiple sequence ments, there are certain limitations. alignment based on simulated annealing tech- One important problem in expanding dynanfic pro- nique. Dynamic programming has been widely grammingto multiple sequence alignment is its high used to find an optimal alignment. However, computational complexity. In pairwise alignment, the dynamicprogramming has several limitations to computational complexity of dynamic programming is obtain optimal alignment. It. requires long com- O(m¯ n) where m, n are the lengths of the sequences. putation time and cannot apply certain types of cost functions. Wedescribe detail mechanismsof But when dynamic programming is used for multiple simulated annealing for multiple sequence align- sequence alignment, its computational complexity be- ment. problem. It is shownthat simulated anneal- comes proportional to the product of the lengths of ing can be an effective approachto overcomethe the sequences to be aligned. Therefore, the exponen- limitations of dynamicprogramming in multiple tial growth in computational complexity makes dy- sequence alignment problem. namic progranmfing impractical for aligning more than Key words, computational complexity, dy- three sequences (Fredman 1984; Murata, Richardson, namic programming, multiple sequence align- & Sussman 1985; Gotoh 1986). Lipman el al. (1989) ment, protein sequence analysis, sequence simi- implemented the Multiple Sequence Alignment (MSA) larity. program to align more than three sequences using dynamic programming. By confining the solution space Introduction using heuristic bounds (Carrillo & Lipman1988), the MSAprogram can align four to six sequences of length Multiple sequence alignment is a useful tool for the 200-300 residues using rigorous bounds. search of homologyin three or more sequences. It has been helpful in the study of molecular structure, func- Another problem of dynamic programmingis its lim- tion, and evolution. Pairwise sequence comparisons itation to apply certain cost function in multiple se- have been used for sequence similarity. But motifs and quence alignment. Altschul (1989) analyzed several other functionally important sites on a sequence may types of gap cost and substitution cost for multiple only be identified whena set of sequences are multiple alignments. He pointed out that previously defined aligned. gap costs in a multiple alignment were not. clearly tied Multiple sequence alignment methods can be divided to their substitution costs. He suggested a natural into two different types of algorithms; heuristic al- gap cost which was clearly related to its substitution gorithms and exhaustive algorithms. Heuristic algo- cost. In MSA,quasi-natural gap costs were used in- rithms (tlogeweg & Hesper 1984; Johnson & Doolittle stead of natural gap costs because natural gap costs 1986; Taylor 1987;Barton & Sternberg 1987; Higgins for dynamic programming require impractically long & Sharp 1988; Corpet 1988) try to find out good but computation time (Altschul 1989). Due to the type not necessarily optimal alignments within a reasonable gap costs used, MSAcannot guarantee producing an time. Most of these heuristic algorithms construct a optimal multiple alignment in some special cases. phylogenetic tree for the alignment of the sequences Several authors (Lukashin, Engelbrecht, & Brunak or assign the sequences to a particular order. The se- 1992;Ishikawa et al. 1993) have suggested simulated quences are aligned one by one related to the order. annealing (SA) as all alternative approach to over- The exhaustive approach (Fredman 1984; Murata, come the linfitations of dynamic programmhlg. SA is Richardson, & Sussman 1985; Gotoh 1986) based a good heuristic method to solve combinatorial opti- 212 ISMB-94 mization problems (Kirkpatrick et al., 1983). Ishikawa Cost function el al.(1993) applied SA to align protein sequences with Each multiple sequence alignment algorithm has its the same cost function as that used in Gotoh (1986). owncost function for the alignment of sequences. To be To reduce the long computation time, they utilized a used in sequence aligmnent, a cost function C should parallel computer for faster convergence to optimal so- be explicitly defined as a measure of overall alignment lution, and discussed temperature parallel algorithm quality. which does not require any temperature scheduling. AItschul (1989) discussed several global cost func- Lukashin el al.(1992) applied SA to humanintron se- tions for multiple sequence alignment and suggested quences with entropy as a cost function. SP (sum of pairs) with natural gap costs. SP is the In this paper, we present details of SA for multi- sum of the costs of aligning n(n - 1)/2 pairs of se- ple sequence alignment. To reduce long computation quences in an n sequence alignment. time of traditional SA, several speedup methods are Entropy also can be used as a cost function for multi- suggested. The SA method for protein sequences is ple sequence alignment. Entropy plays a central role in implemented. It is shown that SA can overcome the information theory as measures of information, choice problem of high computational complexity and the in- and uncertainty (Shannon 1948). It is considered that ability to use certain types of cost functions in dynamic an alignment with lower entropy is statistically prefer- programming. able to an alternative alignment with a higher cost. Let S = {sl,s2,-..,s,} be the set of all possible SA for multiple sequence alignment alignments with the same set of sequences. Then the Simulated annealing multiple sequence alignment problem is to find the alignment sl whose cost C(si) is smaller than the cost Sinmlated annealing (SA) was introduced by Kirk- of the other alignment sj. patrick et al., (1983). It is a probabilistic approach One important advantage of SA over dynamic pro- that can be used to find a global minimumof a func- grammingis its ability to be performed with any cost tion in combinatorial optimization problems. To ap- function. After applying the move sets to a current ply this algorithm to an optimization problem, a state alignment, a completely new alignment can be ob- space S = {Sl,. "’,sn} and a cost function C : S--+ R, tained and all the information to be applied to any where R is the set of real number, should be defined. cost function can be identified. For example, internal A real value C(S) should be assigned to each state s. gap, external gap, and total number of gaps in a new The goal of the optimization problem is to find the op- alignment can be identified after applying movesets to timal state Sopt whose score is min{si I 1 < i < n}. a current alignment. In contrast, any complete align- Simulated annealing continuously generates a new can- ment cannot be obtained in dynamic programming un- didate state Sneto from a current state Seurrent by ap- til all of the computations are finished. plying movesets and acceptance rules. The criteria of the acceptance rules are: Move Sets ¯ If ACg 0, accept a new state s,,ew. Several movesets can be applied to a current alignment ¯ If AC> 0, accept a new state Sneto with probability to generate a new candidate alignment. Basically, all P(AC) = -’xC/T where T is a t emperature and the movesets are related to change the positions of the AC = C(s,~,o) - C(scur~ent) is a cost difference. nulls (’-’) in the sequences. The basic movesets are follows. Probability P(AC) prevents the system from fixation at local minimum.A state Sc,,,re,,t is called local mini- ¯ Insertion (i,j,k, direction): This operation inserts mumif there is no newstate sne,o in S that is generated the k numberof consecutive nulls from the left/right from the state 8current by applying the single moveset (direction) of column j in the sequence and that has a lower cost than that of the $current. ¯ Deletion (i,j,k, direction): This operation deletes Temperature T controls a probability to accept a the k consecutive numberof nulls from the left/right new candidate state sn~,~,. Initially, T starts from a (direction) of column j where columns j- a through high temperature and after every iteration, T decreases j +/3 (~,/3 > k) make a gap where there are consec- to becomezero by applying an annealing schedule. The utive nulls in a sequence. probability of accepting a new state with a higher cost than that of the current state also decreases as temper- ¯ Shuffle (i, j, k, direction): This operation shuffles the ature T decreases. If a careful annealing schedule and left/right (direction) nulls from the null column numberof iterations are given, SA converges to a global (including null j) in the sequence i and its left/right minimumstate Sopt. The main disadvantage of SA is (direction) k consecutive characters. its requirement of a large amount of computation time Figure 1 shows examples of the move sets. because SA is based on Monte-Carlo methods, which By modifying these movesets, effective movesets for allow for a new candidate state with a higher cost than different type of multiple sequence alignment problem that of a current state. can be obtained. Also movesets can be applied to the Kiln 213 AI A2 ber of annealing steps and TE be the final temperature MKOIGGAMGSLA-- MKQIGG--AMGSLA and T1 be the initial temperature. Then the above MKKIGGATGALG-- MKKIGGATGALG-- equation becomes 7,s =~.. 3’k MK---IGGAMGSLA MK---IGGAMGSLA (2) A3 A4 From the equation, 3’ can easily be calculated from MKQIGG--AMGSLA MKQIGGA--MGSLA T , 7],.

Load more