<<

From: ISMB-94 Proceedings. Copyright © 1994, AAAI (www.aaai.org). All rights reserved.

An Efficient Method for Multiple

Jin Kim and Sakti Pramanik

Department of Computer Science Michigan State University East Lansing, MI 48824-1027 { kimj ,pramanik } @cps.msu.ed u

Abstract on tries to compare sequences simultaneously. This approach guarantees optimal Multiple sequence alignment has been a use- alignment. Although variations of dynamic program- ful method in the study of molecular evolution and sequence-structure relationships. This pa- ining have been widely used to derive optimal align- per presents a new methodfor multiple sequence ments, there are certain limitations. alignment based on simulated annealing tech- One important problem in expanding dynanfic pro- nique. Dynamic programming has been widely grammingto multiple sequence alignment is its high used to find an optimal alignment. However, computational complexity. In pairwise alignment, the dynamicprogramming has several limitations to computational complexity of dynamic programming is obtain optimal alignment. It. requires long com- O(m¯ n) where m, n are the lengths of the sequences. putation time and cannot apply certain types of cost functions. Wedescribe detail mechanismsof But when dynamic programming is used for multiple simulated annealing for multiple sequence align- sequence alignment, its computational complexity be- ment. problem. It is shownthat simulated anneal- comes proportional to the product of the lengths of ing can be an effective approachto overcomethe the sequences to be aligned. Therefore, the exponen- limitations of dynamicprogramming in multiple tial growth in computational complexity makes dy- sequence alignment problem. namic progranmfing impractical for aligning more than Key words, computational complexity, dy- three sequences (Fredman 1984; Murata, Richardson, namic programming, multiple sequence align- & Sussman 1985; Gotoh 1986). Lipman el al. (1989) ment, protein sequence analysis, sequence simi- implemented the Multiple Sequence Alignment (MSA) larity. program to align more than three sequences using dy- namic programming. By confining the solution space Introduction using heuristic bounds (Carrillo & Lipman1988), the MSAprogram can align four to six sequences of length Multiple sequence alignment is a useful tool for the 200-300 residues using rigorous bounds. search of homologyin three or more sequences. It has been helpful in the study of molecular structure, func- Another problem of dynamic programmingis its lim- tion, and evolution. Pairwise sequence comparisons itation to apply certain cost function in multiple se- have been used for sequence similarity. But motifs and quence alignment. Altschul (1989) analyzed several other functionally important sites on a sequence may types of gap cost and substitution cost for multiple only be identified whena set of sequences are multiple alignments. He pointed out that previously defined aligned. gap costs in a multiple alignment were not. clearly tied Multiple sequence alignment methods can be divided to their substitution costs. He suggested a natural into two different types of algorithms; heuristic al- gap cost which was clearly related to its substitution gorithms and exhaustive algorithms. Heuristic algo- cost. In MSA,quasi-natural gap costs were used in- rithms (tlogeweg & Hesper 1984; Johnson & Doolittle stead of natural gap costs because natural gap costs 1986; Taylor 1987;Barton & Sternberg 1987; Higgins for dynamic programming require impractically long & Sharp 1988; Corpet 1988) try to find out good but computation time (Altschul 1989). Due to the type not necessarily optimal alignments within a reasonable gap costs used, MSAcannot guarantee producing an time. Most of these heuristic algorithms construct a optimal multiple alignment in some special cases. phylogenetic tree for the alignment of the sequences Several authors (Lukashin, Engelbrecht, & Brunak or assign the sequences to a particular order. The se- 1992;Ishikawa et al. 1993) have suggested simulated quences are aligned one by one related to the order. annealing (SA) as all alternative approach to over- The exhaustive approach (Fredman 1984; Murata, come the linfitations of dynamic programmhlg. SA is Richardson, & Sussman 1985; Gotoh 1986) based a good heuristic method to solve combinatorial opti-

212 ISMB-94 mization problems (Kirkpatrick et al., 1983). Ishikawa Cost function el al.(1993) applied SA to align protein sequences with Each multiple sequence alignment algorithm has its the same cost function as that used in Gotoh (1986). owncost function for the alignment of sequences. To be To reduce the long computation time, they utilized a used in sequence aligmnent, a cost function C should parallel computer for faster convergence to optimal so- be explicitly defined as a measure of overall alignment lution, and discussed temperature parallel algorithm quality. which does not require any temperature scheduling. AItschul (1989) discussed several global cost func- Lukashin el al.(1992) applied SA to humanintron se- tions for multiple sequence alignment and suggested quences with entropy as a cost function. SP (sum of pairs) with natural gap costs. SP is the In this paper, we present details of SA for multi- sum of the costs of aligning n(n - 1)/2 pairs of se- ple sequence alignment. To reduce long computation quences in an n sequence alignment. time of traditional SA, several speedup methods are Entropy also can be used as a cost function for multi- suggested. The SA method for protein sequences is ple sequence alignment. Entropy plays a central role in implemented. It is shown that SA can overcome the information theory as measures of information, choice problem of high computational complexity and the in- and uncertainty (Shannon 1948). It is considered that ability to use certain types of cost functions in dynamic an alignment with lower entropy is statistically prefer- programming. able to an alternative alignment with a higher cost. Let S = {sl,s2,-..,s,} be the set of all possible SA for multiple sequence alignment alignments with the same set of sequences. Then the Simulated annealing multiple sequence alignment problem is to find the alignment sl whose cost C(si) is smaller than the cost Sinmlated annealing (SA) was introduced by Kirk- of the other alignment sj. patrick et al., (1983). It is a probabilistic approach One important advantage of SA over dynamic pro- that can be used to find a global minimumof a func- grammingis its ability to be performed with any cost tion in combinatorial optimization problems. To ap- function. After applying the move sets to a current ply this algorithm to an optimization problem, a state alignment, a completely new alignment can be ob- space S = {Sl,. "’,sn} and a cost function C : S--+ R, tained and all the information to be applied to any where R is the set of real number, should be defined. cost function can be identified. For example, internal A real value C(S) should be assigned to each state s. gap, external gap, and total number of gaps in a new The goal of the optimization problem is to find the op- alignment can be identified after applying movesets to timal state Sopt whose score is min{si I 1 < i < n}. a current alignment. In contrast, any complete align- Simulated annealing continuously generates a new can- ment cannot be obtained in dynamic programming un- didate state Sneto from a current state Seurrent by ap- til all of the computations are finished. plying movesets and acceptance rules. The criteria of the acceptance rules are: Move Sets ¯ If ACg 0, accept a new state s,,ew. Several movesets can be applied to a current alignment ¯ If AC> 0, accept a new state Sneto with probability to generate a new candidate alignment. Basically, all P(AC) = -’xC/T where T is a t emperature and the movesets are related to change the positions of the AC = C(s,~,o) - C(scur~ent) is a cost difference. nulls (’-’) in the sequences. The basic movesets are follows. Probability P(AC) prevents the system from fixation at local minimum.A state Sc,,,re,,t is called local mini- ¯ Insertion (i,j,k, direction): This operation inserts mumif there is no newstate sne,o in S that is generated the k numberof consecutive nulls from the left/right from the state 8current by applying the single moveset (direction) of column j in the sequence and that has a lower cost than that of the $current. ¯ Deletion (i,j,k, direction): This operation deletes Temperature T controls a probability to accept a the k consecutive numberof nulls from the left/right new candidate state sn~,~,. Initially, T starts from a (direction) of column j where columns j- a through high temperature and after every iteration, T decreases j +/3 (~,/3 > k) make a gap where there are consec- to becomezero by applying an annealing schedule. The utive nulls in a sequence. probability of accepting a new state with a higher cost than that of the current state also decreases as temper- ¯ Shuffle (i, j, k, direction): This operation shuffles the ature T decreases. If a careful annealing schedule and left/right (direction) nulls from the null column numberof iterations are given, SA converges to a global (including null j) in the sequence i and its left/right minimumstate Sopt. The main disadvantage of SA is (direction) k consecutive characters. its requirement of a large amount of computation time Figure 1 shows examples of the move sets. because SA is based on Monte-Carlo methods, which By modifying these movesets, effective movesets for allow for a new candidate state with a higher cost than different type of multiple sequence alignment problem that of a current state. can be obtained. Also movesets can be applied to the

Kiln 213 AI A2 ber of annealing steps and TE be the final temperature MKOIGGAMGSLA-- MKQIGG--AMGSLA and T1 be the initial temperature. Then the above MKKIGGATGALG-- MKKIGGATGALG-- equation becomes 7,s =~.. 3’k MK---IGGAMGSLA MK---IGGAMGSLA (2) A3 A4 From the equation, 3’ can easily be calculated from MKQIGG--AMGSLA MKQIGGA--MGSLA T , 7],. and k as follow. MKKIGGATGALG-- MKKIGGATGALG-- s MK-IGGAMGSLA-- MK-IGGAMGSLA-- 7cl =/~ (Ts/T,)) (3)

Figure 1: Examples of tile move sets. (a) orig- The starting temperature T/ should be high enough inal alignment A1. (b) new alignment A2 after to accept ahnost any new candidate alignment. At the Insertion(l, 6, 2, right) to A1. (c) new alignment final temperature TS, the probability to accept a new after Deletion(3, 5, 2,1eft) to A2. (d) new alignment candidate alignment with a higher cost ACis A4 after Swap(l, 7, 1, right) to A3. e-ac/T’ = e (4) where 0 < e < 1. This equation is simplified to different sequences sinmltaneously. The parameters i, j and direction in the move sets rules may be randomly T~ = -±C/t.(O (5) determined. But k may be determined by certain dis- tribution function, for example uniformly distribution The minimum cost change, -AC, resulting from a or inverse function related to the size of k. Only exper- moveset is 1 when protein sequences are aligned with iment can tell which is the best distribution function PAM-250matrix (Dayhoff 1978) as a substitution cost. for k. Empirically, e-I is set to the total numberof iterations Figure 2 shows a sketch of an energy landscape. k (White 1984). Therefore the final temperature be- States are represented along the x axis, with adjacent comes states being neighbors. The !7 axis shows the energy, l T! = l/In(k) (6) is a local minimumand g is a global minimumand d is the barrier distance between l and g. The higher bar- Speedup methods in SA rier distance d prevents l from crossing to g. Tim time Heuristic algorithm as the high temperature required to cross the barrier d is exponential to d/T phase. Simulated annealing is composed of roughly where T is a temperature. Carefully designed move two phases: a high-temperature phase and a low- temperature phase. In the high-temperature phase, SA gives a high probability to all the states with higher costs than that of a current state. This allows any state in the solution set to be a current state. At a lower temperature phase, SA gives a high probability to statcs with a lower or not much higher temperature than that of a current state. This allows only the states near a current state to be the next state. The high- temperature phase is similar to a random search, and g the low-temperature phase is similar to a greedy local search. Rose et al. (1986) suggested a good heuristic algorithm as a first phase and a simulated annealing Figure 2: Sketch of energy landscape. I is a local mini- approach as a second phase for fine optimization to mumand g is a global minimum,d is a barrier distance thc standard-cell-placement algorithm. between ! and g. In SA for nmltiple sequence alignment, the same ap- proach can be used. The output alignment generated sets can lower the barrier distance d and reduce the from the fast heuristic algorithm can be used as the SA time. high-temperature phase. Figure 3 shows the annealing curve and different starting points. SA time P can be Temperature scheduling saved when the system starts from point B which is ob- The standard cooling schedule proposed by Kirk- tained from the fast heuristic approach instead of point patrick et al. (1983) can be written as follows. A. It is clear that SAtime can be reduced if point A is closer to the optimal point. C. Whenthe alignment is T,+l =T, "7 (1) obtained from the heuristic approach, the initial tem- where n is tile ith annealirig step and 7 is the constant perature should be lower than the initial temperature for reducing the temperature. Let k be the total num- when traditional SA is applied.

214 ISMB-94 the length of initial aligument. If the length of the op- Energy timal alignment is longer than the length of the initial alignment, SA generates only a near-optimal alignment within the set Si. By increasing the length of an ini- A tial alignment, a larger set of candidate alignments can be examined by SA. To create longer initial alignment a lower gap cost, than the one used in the annealing phase, is applied. However, too much lower gap cost may result in an initial alignment whoselength is much longer than that of the optimal alignment. This longer initial alignment requires longer SA time. Therefore a C good initial alignment, whose length and cost as close to those of optimal alignment, is crucial for reducing Iteration SA time. P Application of SA Figure 3: Annealing curve (Energy vs. Iteration). Implementation of SA and results for is the starting point in the traditional SA approach. B protein sequences is the starting point obtained from the fast, heuristic approach. C is the minimal point. An algorithm called Multiple Sequence Alignment us- ing Simulated Annealing (MSASA)for protein se- quences was implemented and compared to MSAbased Limitation of lengths for candidate alignments. on dynamic programming on the Sun SPARC2.All the To apply dynamic programming for aligning n se- parameters in MSASAand MSAwere exactly same ex- quences, a fixed amount of computation is required for cept gap costs. Natural gap costs were used in MSASA, each cell of the n-dimensional lattice. The total coln- whereas quasi-natural gap costs were used in MSA. putational time in dynamic programming is propor- The cost of one gap was 8 in both algorithms. The tional to the product of the number of computations PAM-250matrix (Dayhoff 1978) was used for substitu- for each cell and the size of the n-dimensional lattice. tion costs. The modified substitution costs (17 minus These two factors limit the usage of dynamic program- the values in the PAM-250matrix) were used in both ruing in multiple sequence alignment. For example, algorithms. The SP substitution costs in MSAhave natural gap costs make the number of computations two options, weighted SP substitution costs and un- for each cell too large for aligning more than five se- weighted substitution costs. In the weighted SP substi- quences. The size of the lattice becomes too large for tution costs, weights are applied to the pairwise align- aligning more than three sequences with average pro- ments in order to reduce the effect of the dominance tein length of 200-300 residues. of a set of similar sequences in the multiple sequence Fickett (1984) suggested a way to reduce the solution alignment. In MSASA,both options could be applied. space in pairwise alignment. He searched the optimal For easy comparison of the two algorithms, only un- path within a diagonal band of the two-dimensional weighted SP substitution costs were considered. The matrix as defined by an upper bound of the cost of the experiments were performed on three serine protease optimal path. Carrillo and Lipman (1988) expanded families: chymotrypsin, trypsin and elastase. the idea to reduce the solution space for aligning n A heuristic procedure similar to progressive pairwise sequences. They calculated the upper bounds of the alignment (Feng & Doolittle 1987) is used to calcu- alignment cost of each pair of the sequences and con- late the heuristic bounds in MSA.This heuristic pro- fined the bounds. Therefore, they could reduce the cedure of MSAwas also used to generate an initial computation time by applying dynamic programming alignment of MSASA.The number of nulls of each se- to a limited solution space in an n-dimensional lattice. quence in the initial alignment are allowed to generate Execution time in SA can be reduced by confining a new alignment. Therefore, only the shuffle operation the length of the candidate alignment. First, initial was applied to generate a new candidate alignment in alignment is obtained by fast heuristic algorithms with MSASA.In the shuffle operation, the maximumvalue the same cost function. Second, only the number of of parameter k cyclically was changed from 1 to 10 as nulls in each sequence is allowed to generate a new can- the numberof iteration was increased. The initial tem- didate alignment. Therefore, only the move sets that perature T/ was decided by previous experience. The change the positions of the nulls are allowed. A column final temperature 71I was determined by equation (6). that is composedof all nulls may occur in the candi- date alignment. These null columns do not affect the Cost and time comparisons cost of an alignment. Thus, SA Call examine the set of Alignment A1 and A2 in Figure 4 were generated from alignments, Si, whose lengths are less than or equal to MSAand MSASA.A2 in figure 4 was generated from

Kim 215 MSASA.The score (2162) of tile aligmnent A2 from sequences. MSASAis lower than that (2170) of the alignment MSAtook an impractically long time to align more from MSA.The difference is due to the different gap than six sequences on a personal workstation. There- costs in MSASAand MSA. fore, it could not be directly compared to MSASAfor more than six sequences. In MSASA,1 to 10 million A1 (2170) iterations were enough to get a near-optimal solution IVGGTNSSWGEWPWQVSLOVKLT-AQRHLCGGSLIGHQWVLTAAwhen aligning up to 10 sequences. Therefore, the run- IVNGEEAVPGSWPWQVSLQDKTG---FHFCGGSLINENWVVTAAning time to align more than six protein sequences in IVGGYTCGANTVPYQVSL--NSG---YHFCGGSLINSQWVVSAhMSASAis more practical than that in MSA. VVOGTEAQRNSWPSQISLQYRSOSSWAHTCGGTLIRQNWVMTAA VVGGTRAAQGEFPFPIVRL--SMG...... CGGALYAQDIVLTAA Discussion It has been shownthat SA is a useful method for mul- A2 (2162) tiple sequence alignment. These arc the main charac- IVGGTNSSWOEWPWQVSLQVKLTAQR-HLCGGSLIGHQWVLTAAteristics of SA. IVNOEEAVPOSWPWQVHLQDKTOF---HFCOOSLINE3WVVTAA IVOOYTCOANTVPYQVHL--NSOY---HFCOOSLINSQWVVSAA¯ Flexibility: As already discussed in the section appli- VVOOTEAQRNSWPSQIHLQYRSGSSWAHTCOOTLIRQNWVMTAA cation ofSA, SA can be applied to any cost function. VVGGTRhhQGEFPFNVRL--SMG...... COOALYAQDIVLTAh This is because after applying a moveset, a complete new alignment can be obtained and any cost function can be applied to this new alignment. Therefore, any Figure 4: Alignlnent of segments from human plasma constraints or any human knowledge can be incor- kallikrein, bovine chymotrypsin, bovine trypsin, pig porated into SA. But its computational complexity elastase, and Streptorn.yces g~’iseus trypsin generated prevents dynamic programming from applying cer- from MSAand MSASA.A1 is the alignment generated tain types of cost. functions and constraints. from MSAand A2 is generated from SA. The score of ¯ Optimality: SA does not guarantee an optimal solu- A1 is 2170 and the score of A2 is 2162. tion but dynamic programming guarantees an opti- mal solution with rigorous bounds. The reasons for Figures 5 and 6 show an example of the alignment of generating lower costs in the above application is its the five sequences generated from MSASAand MSA. use of natural-gap costs. In SA, generally longer SA These sequences are human plasma kallikrein, bovine time generates a solution closer to an optimal so- chymotrypsin, bovine trypsin, pig elastase, and Strep- lution. By applying several speedup methods, this tomyces griseus trypsin. The cost of alignment from long computation time can be significantly reduced. MSAis 35853 and the cost of alignment from MSASA Ilirosawa el al. (1993) suggested SA as a refinement is 35845, which is lower than MSA’s.It took approxi- tool when long annealing time is required for mul- mately 50 minutes to generate the alignment Figure 6 tiple sequence alignment. They first produced ini- in MSA.The alignment in Figure 6 was the best align- tim alignment by using three way dynamic program- inent selected from the several different runs. Each ruing and applied SA for refinement, of the initial run took approximately 20 minutes. Quasi-natural gap alignment. With this method they could produce costs prevent MSAfrom generating optimal alignment. better quality of multiple sequence alignment, with Whena series of nulls with left and right letters is small increases in computational time, than those completely imposed on the series of nulls in other se- generated from the heuristic algorithms (Johnson quences, quasi-natural costs count one more gap than & Doolittle 1986; Taylor 1987;Barton & Sternberg natural gap costs do (Ahschul, 1989). These addi- 1987; Higgins & Sharp 1988). tional counts prevent generating optimal alignment from MSA.There are no additional counts in SA be- References cause the natural gap costs were used in MSASA. AItschul, S. F. 1989. Gap costs for nmltiple sequence MSAand MSASAgenerate the same alignment in alignment. J. Theor. Biol. 138:297-309. solne cases. Whenquasi-natural gap costs are used in both algorithms, both generate the same alignments. Barton, G. J., and Sternberg, M. J. E. 1987. A And even when natural gap costs are used in MSASA, startegy for the rapid multiple alignment of protein if there is no completely imposed nulls in the optimal sequences: confidence levels from tertiary structure alignment, both algorithms generate the same align- comparisons. J. Molec. Biol. 198:327--337. ments with rigorous bounds. Carrillo, If., and Lipman, D. 1988. The multip[e se- Whenthe lengths of the sequences are short and the quence alignment problem in biology. SIAM J. Appl. number of sequences is small, MSAgenerated optimal Math 48:1073-1082. alignment faster than MSASA.In MSASA,usually 1 Corpet, F. 1988. Multiple sequence alignment with to 5 million iterations, taking 5 to 30 milmtes, were hierarchical clustering. Nucl. Acids Res. 16:10881- enough to get a near-optimal alignment for four to six 10890.

216 ISMB-94 MSA I~GGTNSSW~EWPW~VSLQVKLT-A~RHL~GGSLIGH~WVLTAAH~FDGLPL~DVWRIYSGI~NLS~ITKDTPFS~IKEIIIHQNYK~SEG--~DIALI IVNGEEAVPGSWPWQVSLQDKTG---FHFCGGSLINENWVVTAAHCGVT.... TSDWVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTI--NNDITLL IVGOYTCGANTVPYQVSL--NSG---YHFCGGSLINSQWVVSAAHCYKS..... GIQVRLGEDNINVVEGNEQFISASKSIVHPSYNSNTL--NNDIMLI VVGGTEAQRNSWPSQISLQYRSGSS~AHTCGGTLIRQN~¢MTAAHCVDR---ELTFRV~VGEHNLNQNNGTEQYVG~QKIwHPY~TDDVAAGYDIALL VVGGTRAAQGEFPFMVRL--SMG...... CGGALYAqDIVLTAAHCVSG.... SGNNTSITATGGVVDLqSAVKVRSTKVLQAPGYNGT .... GKDWALI

KLQ~PLNYTEFQKPICLPSKGDTSTIYTNC~VT~W~FSK-EK~EIQNILQKVNIPLVTNEE~QKR-YQDYKITQRMvCAGYK-E~GKDA~K~DS~~PLV~ KISTAASFSqTVSAVCLPS~SDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKK--YWGTKIKDA~ICAG---AS~VSSC~GDSG~PL~C KLKSAASLNSRVASISLPTSCAS~G--TQCLI~W~NTKSS~TSYPDVLKCLKAPILSDSSCKSA-YPG-QITSNMFCAGYL-EG~KDSCQGDS~GPVVC RL~QSVTLNSYVQL~VLPRAGTILANNSPCYIT~w~LTR-TM~QLAQTLQqAYLPTVDYAICSSSSYW~STVKNSMVCA~--N~VRS~C~DSG~PLHC KLAQPINQPTLKIATTTAYNQGTFT...... VAGWGANR-EGGSQQRYLLKANVPFVSDAACRSA-YGNELVANEEICAGYPDTGGVDTCqGDSGGPNFR

KHN-GMWRLVGITSWGE--GCARREQPGVYTKVAEYMDWILEKTOSS KKN-GAWTLVGIVSWGS--STCSTSTPGV~ARVTALVNWVQQTLAAN SGK ..... LQGIVSWGS--GCAQKNKPGVYTKVCNYVSWIKQTIASN LVN-GQYAVHGVTSFVSRLGCNVTRKPTVFTRVSAYISWINNVIASN KDNADEWIQVGIVSWGY--GCARPGYPGVYTEVSTFASAIASAARTL

Figure 5: Alignment of human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elast.ase, and Strepto- myces griseus trypsin generated from MSA.

SA

IVG~TNSSWGEWPWQ~SLQVKLTAQR-HL~SLI~HQW~LTAAH~FDGLPLQDVwRIYS~ILNLSDITKDTPFSQIKEIIIHQMYK~SEGNH--DI IVNGEEAVPGSWPWQVSLQDKTGF---HFCGGSLINENWVVTAAHCGVT....TSDWVAGEFDOGSSSEKIQKLKIAKVFKNSKYNSLTINN--DI IVGGYTCGANTVPYQVSI~--SGY---HFCGGSLINSQWVVSAAHCYKS.....GIQVRLGEDNINVVEGNEQFISASKSIVHPSYNSk~LNN--DI VVGGTE~QR~SWPSQISLQYRSGSSWAHTCGGTLIRqNWV~TAAHCVDR---ELTFRVV~GEHNLNQNNGTEQYVGVQKIwHPY%fNTDDVAAGYDI VVGGTRAAQGEFPF~RLS--NG...... CGGALYAQDIVLTAAHCVSG .... SGNNTSITATGGVVDLQSAVKVRSTKVLQAPGYNGTG--K--DW

ALIKLQAPLNYTEFQKPI~LPSKGDTSTIYTN~W~TGWGFSK-EKGEIQNILqKVNIPL~TMEE~Q-KRYQDYKITQR/~V~AGY-KEGGKDACKGDS TLLKIST~ASFSQTVSAVCLPSASDDFAAGTT~VTTGWGLTRYTNANTPDRLqQ~SLPLLSNT/iCK--KYMGTKIKDA~ICAG---ASG~SSCMGDS ~LIKLKS~ASLNSRVASISLPTSCASAG--TQCLISGwGNTKSSGTSYPDVLKCLKAPILSDSSCK-SAYPG-~ITSN~FCAGY-LEGGKDSC~GDS ALLRLAQSVTLNSYVqLGVLP~AGTIL~NNSPCYITG~GLTR-TNGQLAqTLQq~YLPTVDY~ICSSSSY~GSTvKMSMVCAG--GNGVRSGCQGDS ALIKLAQPINQPTLKIATTTAYNQGTFT...... VAG~GANH-EGGSQQRYLLKANVPFVSDAACR-SAYGNELVkNEEICAGYPDTGGVDTCQGDS

GGPLVCKHNG-M~RLVGITSWGE--GCARREQPGVYTKVAEYNDWILEKTQSS GGPLVCKKNG-A~TLVGIVSWGS--STCSTSTPGVYARVTALVNWVQQTLAAN GGPVVCSGK..... LQGIVSWGS--GCAQKNKPGVYTKVCNYVSWIKqTIASN GGPLHCLVMG-QYAVHGVTSFVSRLGCNVTRKPTVFTRVSAYVSWIKQTIASN GGPMFRKDNADEWI~VGIVSWGY--GCARPGYPGVYTEVSTFASAIASAARTL

Figure 6: Alignment of human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elastase, and Strepto- myces griseus trypsin generated from MSASA.

Kiln 217 Dayhoff, M. O. 1978. A model of evolutionary change in proteins, matrices for detecting distance relation- ships. In Atlas of ProleiT, Sequence and Structure, volume 5 suppl. 3. Dayhoff. M. O.(ed) Washington. DC: National Biomedical Research Foundation. 345- 352. Feng, D. F., and Doolittle, R. F. 1987. Progressive sequence alignment as a prerequisite to correct phy- logenetic trees. J. Molec. Evol. 25:351-360. Fickett, J. W. 1984. Fast optimal alignment. Nucl. Acids Res. 12:175-180. Fredman, M. L. 1984. Algorithms for computing evolutionary similarity measures with length indepen- dent gap penalties. Bull. Math. Biol. 46:553-566. Gotoh, O. 1986. Alignment of three biological se- quences with an efficient traceback procedure. J. Theor. Biol. 121:327-333. Higgins, D. G., and Sharp, P. M. 1988. : a package for performing multiple sequence alignment on a microcomputer. 7:237-244. Hirosawa, M.; Hoshida, M.; Ishikawa, M.; and Toya, T. 1993. Mascot: Multiple alignment system for pro- tein sequences based on three-way dynamic program- ming. CABIOS 9:161-167. Hogeweg, P., and Hesper, B. 1984. The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J. Molec. Evol. 20:175- 186. Ishikawa, M.; Toya, T.; Hoshida, M.; Nitta, K.; Ogi- wara, A.; and Kanehisa, M. 1993. Multiple sequence alignment by parallel simulated annealing. CABIOS 9:267-273. Johnson, M., and Doolittle, R. F. 1986. A method for the simulataneous alignment of three or more amino acid sequences. J. Molec. Evol. 23:267-287. Lipman, D. J.; Altschul, S. F.; and Kececioglu, J. D. 1989. A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA. 86:4412-4415. Lukashin, A. V.; Engelbrecht, J.; and Brunak, S. 1992. Multiple alignment using simulated annealing: branch point definition in humanmRNA splicing tool for multiple sequence alignment. Nucl. Acids Res. 20:2511-2516. Murata, M.; Richardson, J. S.; and Sussman, J. L. 1985. Simultaneous comparison of three protein se- quences. In Proc. Natl. Acad. Sci. USA., volume 82, 3073-3077. Shannon, C. E. 1948. A mathematical theory of com- munication. Bell System Techn, J. 27:379-432,623- 656. Taylor, W. R. 1987. Multiple sequence alignment by a pairwise algorithm. CABIOS3:81-87. Wtfite, S. R. 1984. Concepts of scale in simulated annealing. In Proc. ICCD, 646-651.

218 ISMB-94