22 Genome Informatics 16(1): 22-33 (2005)

Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT

Kazutaka Katohl Kei-ichi Kuma1 [email protected] [email protected] Takashi Miyata2" Hiroyuki Toh5 [email protected] [email protected] 1 Center , Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan 2 Biohistory Research Hall , Takatsuki, Osaka 569-1125, Japan 3 Department of Electrical Engineering and Bioscience , Science and Engineering, Waseda University, Tokyo 169-8555, Japan 4 Department of Biophysics , Graduate School of Science, Kyoto University, Kyoto 606- 8502, Japan 5 Division of Bioinformatics , Research Center for Prevention of Infectious Diseases, Medical Institute of Bioregulation, Kyushu University, Fukuoka 812-8582,Japan

Abstract

In 2002, we developed and released a rapid multiple sequence alignment program MAFFT that

was designed to handle a huge (up to•`5,000 sequences) and long data (,-2,000 as or •`5,000 nt) in a reasonable time on a standard desktop PC. As for the accuracy, however, the previous versions

(v.4 and lower) of MAFFT were outperformed by ProbCons and TCoffee v.2, both of which were released in 2004, in several benchmark tests. Here we report a recent extension of MAFFT that aims to improve the accuracy with as little cost of calculation time as possible. The extended version of MAFFT (v.5) has new iterative refinement options, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-iin this report). These options use a new objective function combining the weighted sum-of-pairs (WSP) score and a score similar to COFFEE derived from all pairwise alignments. We discuss the improvement in accuracy brought by this extension, mainly using two benchmark tests released very recently, BAliBASE v.3 (for protein alignments) and BR,AliBASE (for RNA alignments). According to BAliBASE v.3, the overall average accuracy of L-INS-i was higher than those of other methods successively released in 2004, although the difference among the most accurate methods (ProbCons, TCoffee v.2 and new options of MAFFT) was small. The advantage in accuracy of [GL]-INS-ibecame greater for the alignments consisting of — 50 - 100 sequences. By utilizing this feature of MAFFT, we also examined another possible approach to improve the accuracy by incorporating homolog information collected from database. The [GL]-INS-i options are applicable to aligning up to r\-,200 sequences, although not applicable to thousands of sequences because of time and space complexities.

Keywords: multiple sequence alignment, dynamic programming, objective function

1 Introduction

Multiple alignment of protein or DNA sequences is a classical problem in the area of bioinformatics and an important tool required in a wide range of molecular biological analyses such as detecting conserved residues and inferring evolutionary history of genes and organisms. As a large amount of sequence information is being determined and accumulated by genome and cDNA projects of various organisms, the importance of a rapid and accurate algorithm for multiple sequence alignment has been recognized. Under such a situation, several groups are now actively investigating the algorithms MAFFT: A Multiple Sequence Alignment Program 23

Figure 1: Outlined process of MAFFT without pairwise alignment (A) and with pairwise alignments (B).

of multiple sequence alignment and a number of new programs have been released since last year [5-7, 13, 15, 28, 32]. ClustalW [30] released in 1994 is a standard multiple sequence alignment software. It employs the progressive method [8], which is simple but has room for improvement in accuracy and speed. Many of recent methods have tried to improve the accuracy. Gotoh [12]found that the alignment accuracy is significantly improved by an iterative refinement method using the weighted sum-of-pairs (WSP) score as an objective function. He developed a program, PRRN, that adopts the method. DIALIGN [17]and TCoffee [19] achieved high accuracies by combining local and global alignment features. Although the ability to combine different types of alignments derived from heterogeneous sources, such as structural alignment, is important [21],TCoffee requires a large CPU time proportional to N3. Thus it is hard to apply TCoffee to a large alignment consisting of dozens of sequences. The accuracy of TCoffee has recently been further improved in ver. 2 in comparison with ver. 1. Do et al. [5] proposed a new method ProbCons, whose accuracy is comparable to or slightly higher than TCoffee v.2 as shown in Table 1, and rather faster than TCoffee. Katoh et al. [16]developed MAFFT, in 2002, that aimed to reduce the CPU time of the progressive method. MAFFT also has an iterative refinement option that is a simplified version of PRRN. Edgar [6, 7] also implemented progressiveand iterative alignment strategies in MUSCLE in 2004. The algorithms of major options of MUSCLE seem similar to the NW-NS-[i2]options of MAFFT explained below. To reduce the CPU time, UPGMA [26]is more efficientlycoded in MUSCLE.Lassmann and Sonnhammer (manuscript submitted) released Kalign, in Mar. 2005, which adopts a progressive approach to achieve the reduction of CPU time. Kalign has a merit that the length dependency of the CPU time is near to linear. The previous version of MAFFT is used in part by several projects such as PFAM [2],ASTRAL [4] and MEROPS [23]. The outline of procedure of the previous version of MAFFT is explained below (see Figure 1A also). An initial alignment is constructed by the progressive method [8,30] and then improved by the iterative refinement method [3, 11]. A user can select an appropriate strategy from the fastest one (FFT-NS-1) to the most accurate one (FFT-NS-i) in v.4.

1. Progressive alignment (1). A rough distance between every pair of input sequences is rapidly calculated based on the number of 6-tuples shared by the two sequences [7, 14,16]. A guide tree is constructed from the distance matrix using UPGMA [26] with modified linkage. Input sequences are progressively aligned [8, 30] followingthe branching order of the guide tree. This procedure is referred to as FFT-NS-1. 2. Progressive alignment (2). The initial distance matrix is less reliable than that based on all pairwise alignments. We can obtain a more reliable distance matrix by using the FFT-NS-1 alignment [7, 16,29]. Progressive alignment is re-performed based on the new guide tree calcu- lated from the new distance matrix. This method is referred to as FFT-NS-2. 24 Katoh et al.

3. Iterative refinement. The FFT-NS-2 alignment is further improved by the iterative refinement method [3,11] that maximizes the weighted sum of pairs (WSP) score proposed by Gotoh [10], using an approximate group-to-group alignment algorithm [16]. This procedure is referred to as FFT-NS-i.

For the progressive alignment processes, a fast Fourier transform (FFT) approximation [16] is used in the FFT-NS-2, FFT-NS-1 and FFT-NS-i options (collectively denoted as FFT-NS-[12i]hereafter). When the input sequences are highly conserved, these options require CPU times effectively propor- tional to average sequence length L for alignments of a single gene. The slopes of the FFT-NS-[2i] lines are indeed close to 1 in Figure 4C. MAFFT also offers three more options NW-NS-[12i]with full dynamic programming (DP) [18]for the same part. In contrast to the case of FFT-NS-[12i], the time complexity of full DP options is O(L2), regardless of the similarity among input sequences. In addition to these options, two additional strategies, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-ihereafter), have been introduced in v.5. The [GL]-INS-istrategies use a new objective function combining the WSP [12]score and a score similar to COFFEE [20] derived from all pairwise alignments in order to improve the alignment accuracy. Iterative refinement with the objective function is performed in reasonable time, as [GL] INS-i were designed to have low computational complexity. See Methods for details of the algorithms and the objective function. Recently, reference databases of protein alignments based on 3D structural superposition, such as HOMSTRAD [27], SABmark [32] and PREFAB [7], have been independently developed. They are valuable resources to find good parameters for a sequence alignment program as well as to evaluate the performance of a program. After the publication of our latest paper describing MAFFT v.5 [15] in Jan. 2005, two benchmark databases have already been released. (i) The most widely used benchmark test, BAliBASE [31],has been substantially updated to v.3 (Thompson et al., manuscript submitted) and now contains alignments that are difficult to make accurate residue-to-residue correspondences with only sequence information, due to high sequence divergence, long indels, etc. That is, the align- ments are considered to represent difficult cases we may encounter when we deal with real problems. (ii) BRAliBASE, the first reference database for RNA alignment, has been recently released [9]. Thus, we compared the performances of the currently available sequence alignment programs including the new version of MAFFT, using the new reference alignment databases. The parameters of MAFFT and other methods are not closely optimized for these new benchmark datasets. It was suggested that the accuracy of a multiple alignment of distantly related sequences can be improved if they are aligned together with a number of their homologs, because the information from many sequences is expected to reduce the 'noise' [12]. This point has recently been investigated by Katoh et al. [15]and Simossis et al. [24]. We discuss the effect of number of homologs involved in an alignment in this report also.

2 Method

2.1 Incorporating Pairwise Alignment Information into the Objective Function In the two new iterative refinement options, [GL]-INS-i,the objective function is defined as the sum- mation of the WSP score and a score derived from all pairwise alignments. Both global and local alignment algorithms were tested as all pairwise alignment; G-INS-i uses global alignment [18] with an FFT approximation [16], whereas L-INS-i uses local alignment algorithm [25]. The procedures of the algorithms of [GL]-INS-iare as follows (see Figure 1B also).

1. An initial distance matrix is constructed from the pairwise scores, instead of shared 6-tuples, using the equation shown in ref. [16] and a guide tree is built using UPGMA with modified linkage [15]. Unlike FFT-NS-[2i] explained above, the guide tree is not reconstructed in the MAFFT: A Multiple Sequence Alignment Program 25

[GL]-INS-istrategies, because such a reconstruction did not show significant improvement of the alignment accuracy in our preliminary study (data not shown).

2. Each pairwise alignment is divided into gap-free segments and number n is assigned to each segment. The information of these segments is stored in a set of arrays, score [S(s,t,n)], which represents the alignment score of the nth gap-free segment between sequences s and t, length [L(s, t, n)] of the aligned segment (s, t, n), position [P(s, t, n)] for every pair of sequences s and t, and the importance value [E(s, t, n)] that is calculated, as described below, from the score of the segment and how frequently the residues are involved in gap-free segments. We denote (s, t, p, q) E P(s, t, n) if the pth site of sequence s corresponds to the qth site of segment t in aligned segment (s, t, n).

3. The frequency value f (s, p), which represents the frequency of the pth site of sequence s involved in gap-free segments, is calculated as

where wt is weighting factor (see [30] for definition) for sequence t. The importance value E(s, t, n) for aligned segment is calculated as

The importance matrix I/(s, t, p, q) between the pth site of sequence s and the qth site of sequence t is defined as

(1)

4. An alignment of a subset of given sequences, which is generated during the procedures of pro- gressive and iterative refinement methods, is referred to as 'group.' To align groups i and j, matrix H(groupi, groupj, p, q) is constructed as

H(groupi, groupj, p, q)

where A(s,p) is the pth amino acid residue on sequence s. Mab is a score between a pair of amino acids a and b. The score matrices examined in this study are described in the 'Parameter optimization' section. wst is a weighting factor between sequences s and t. A weighting scheme proposed by Thompson et al. [30] is used in progressive alignment stage whereas the iterative refinement stage adopts the weighting scheme by Gotoh [10]. WI , a weighting factor for the importance value, was set at 2.7 in this study. The alignment between two groups is computed by applying the dynamic programming algorithm [18] to matrix H(,,,) at each step of progres- sive alignment. The alignment produced by this procedure is referred to as [GL]-INS-1. The performances of [GL]-INS-1are not shown in this report.

5. An alignment by [GL]-INS-1is improved by the iterative refinement method ([GL]-INS-i), which optimizes the alignment with a objective function defined as the summation of the WSP score and importance value given by equation 1. Katoh et al. 26

2.2 Performance Evaluation

To evaluate the accuracy of the above methods, we used three datasets, BAliBASE v.3 (Thompson et al., manuscript submitted), HOMSTRAD [27] and SABmark [32], for protein alignment. BAliBASE is diveded into six different reference sets. Each set consists of: Ref11, small (-6) number of equi- distant sequences with low similarity (%id<20); Refl2, small (•`6) number of equi-distant sequences with medium similarity (%id of 20-40); Ref3, subgroups with <25% identity between the subgroups: Ref4, sequences that have long terminal extensions; Ref5, sequences with internal insertions. From HOMSTRAD, highly diverged alignments was extracted and used in our tests. From SABmark, a part of the TWI subset was used in this report (see refs [15,21] for the details). The accuracy of alignment was evaluated using either or both of the TC and SP values [31]. TC is the fraction of columns aligned identically to the reference alignment, whereas SP is the total number of correctly aligned sites over all sequence pairs divided by the number of all residue pairs in the reference alignment (see [31] for definition). Obviously, the TC and SP values are highly correlated to each other. For RNA alignment, BRAliBASE [9] was used. The accuracy of alignments were evaluated with SP and the structure conservation index (SCI) using the RNAz software [33]. SCI is a measure of conserved secondary-structural information contained within the alignment, which is proposed to he used for evaluating accuracy of RNA alignments by Gardner et al. The SCI is close to 1 if perfectly conserved structure is identified in an alignment, whereas TC 0 if no common structure found. Unlike SP and TC, SCI is calculated without using reference alignment. We also compared the accuracies and CPU times of several methods developed by other groups, ClustalW v1.83, PRRN v.3.11, TCoffee v2.03, MUSCLE v3.52, ProbCons v.1.09 and Kalign v1.0. For PRRN, a set of gap penalties re-tuned by Yamada et al. [34] was used instead of the default values. For other methods, default parameters were used. Several different options were tested for some programs according to need. These methods can be classified into three types of multiple sequence alignment strategies. (a) Progressive methods: ClustalW, the FFT-NS-[12] options of MAFFT, a progressive option of MUSCLE and Kalign. (b) Iterative refinement methods: PRRN, the FFT-NS-i option of MAFFT and the iterative refinement option of MUSCLE. (c) Methods incorporating pairwise alignment information: TCoffee, ProbCons and [GL]-INS-ioptions of MAFFT.

2.3 Parameter Optimization for Protein Alignments Gap-opening penalty S°P and the offset value Sa (see ref. [16] for definitions) were determined to provide the highest accuracy of the FFT-NS-2 strategy for a part of SABmark, by using golden section search (e.g. [22]). We examined five scoring matrices (BLOSUM45, 62, 80, JTT100 and ,TTT200)and selected the matrix providing the highest accuracy.

2.4 Availability

MAFFT was written in C, and runs on , Mac OS X and the Cygwin environment on Win- dows. The MAFFT package is available at http: //www.biophys .kyoto-u.ac. jp/~katoh/programs/ align/mafft/. The performances were measured on a 2.8 GHz Xeon processor with 1GB of RAM running SuSE Linux 9.0. The gcc v.3.3.1 compiler was used with the '-O3' optimization option.

3 Results and Discussion

3.1 Improvement in Accuracy by New Parameter Set for Protein Alignments Out of the five scoring matrices examined, BLOSUM62 showed the highest accuracy when FFT-NS- 2 was applied to a part of SABmark. The golden section search provided the optimal parameters, 1.53 for Sop and 0.123 for Sa (see ref. [15] for details). The improvement in accuracy from the MAFFT: A Multiple Sequence Alignment Program 27

Figure 2: The SP values as a function of average pairwise sequence identity for several methods. The curves were fitted to the all SP values for BAliBASE v.3 using a cubic spline. Percent identity was calculated from reference alignments. Progressive methods (a) are shown as dash-dotted lines, iterative refinement methods (b) are shown as dotted lines and methods incorporating pairwise alignment information (c) are shown in solid lines. See the footnote of Table 1 for the command line option of each method. previous parameter set (JTT200, Sop=2.40, Sa=0.06) was approximately 5 percentage points. The new parameter also set gave generally higher accuracy than the previous parameter set for different datasets (the rest part of SABmark, HOMSTRAD and BAliBASE), although of course not optimal. Thus wejudged this improvement is relevant for various protein alignments and adopted this parameter as default.

3.2 Accuracy Evaluation for Protein Alignments Table 1 shows the the averaged TC and SP accuracy values for each reference set of BAliBASE v.3. In Figure 2, the SP values of major methods are plotted as functions of the sequence similarities. The improvement in the SP value from a standard method (ClustalW) and the currently most accurate method (L-INS-i) was approximately 20%, for protein alignments with percent identity of ti 20%. A comparison among the accuracies of three types algorithms [(a) progressive methods, (b) itera- tive methods without pairwise alignment information, (c) methods incorporating pairwise alignment information] revealed a tendency of (c) > (b) > (a). There was no statistically significant differencein accuracy among the three most accurate methods (L-INS-i, ProbCons and TCoffee), when BAliBASE v.3 was used for the evaluation. Also in the other two benchmark tests (HOMATRAD+0 and SABmark+0 in Table 2), the differencesin accuracy values were small. The SABmark test is possibly biased to MAFFT, since the default parameters of MAFFT were optimized by applying FFT-NS-2 to a part of SABmark as noted above. On the other hand, the default parameters of ProbCons were trained via an unsupervised learning method on unaligned sequences of the previous version of BAliBASE [5]. The accuracies are possibly improved by closely adjusting gap penalties and other parameters as in the case of RNA alignment (see section 3.4). To examine this possibility, we applied the downhill simplex method (e.g. [221)to search the optimal set of gap penalties of ProbCons and L-INS-i for 28 Katoh et al. MAFFT: A Multiple Sequence Alignment Program 29

each of refl 1, refl2 and ref5 of BAliBASE v.3. With the optimized penalties, the SP accuracy values of L-INS-i were 67.4%, 93.95% and 90.85% for refl 1, ref12 and ref5, respectively, whereas those of ProbCons were 65.69 %, 94.27% and 90.40% for refl 1, ref12 and ref5, respectively. For both methods, the difference between accuracy values with default penalties and that with the optimized penalties were quite small (<1%). This results suggest that the default parameters of these methods were already well tuned unlike the case of RNA alignment (section 3.4).

Table 2: Accuracy values for the HOMSTRAD and SABmark tests .

The TC scores are shown for HOMSTRAD and the fp scores (similar to SP) are shown for the SABmark test. Main parts of this were taken from our latest paper [15]. The +n columns indicate that n homologs collected from the SwissProt database were added to each of HOMSTRAD and SABmark tests, aligned together and then removed before calculating the accuracy scores. See the footnote of Table 1 for command line option of each method.

3.3 Effect of the Number of Homologs Involved in an Alignment To examine the effect of the number of homologsinvolved in an alignment, we carried out the following test using HOMSTRAD and SABmark. (i) Homologsof the each entry of HOMSTRAD and SABmark were collected with BLAST [1]. Three cases of the number of collected homologues, 20, 50 and 100 (only for HOMSTRAD), were considered in this study. (ii) The input sequences were aligned, together with the collected homologs. by each of the alignment methods. (iii) The homologs collected by BLAST were excluded from the alignment. (iv) Then, the resulting alignment was compared with the corresponding reference alignment (see ref. [15] for details). As expected, the accuracy of a multiple alignment increased as the number of collected homologs involved in the alignment increased. This tendency was remarkable for MAFFT, although similar behavior was observed for most of the methods to some extent. In both the HOMSTRAD and SABmark tests (Table 2), the improvement by adding 50 or 100 homologs were about 10 percentage points at most. The improvement was comparable to that by introducing the structural information of one or two proteins [21],and rather larger than that by modification of algorithm (from FFT-NS-i to [GL]-INS-i)presented in this paper. These results suggest the importance of including a number of homologs for obtaining an accurate sequence alignment. An ability to handle a large number of sequences is therefore required for a multiple sequence alignment program.

3.4 RNA Alignments Figure 3 shows two types of accuracy values (SCI and SP) for the BRAliBASE test. From the BRAliBASE tests, a serious problem of the previous versions of MAFFT (v.5.5 and lower) has been turned out; there are too many gaps in comparison with the reference alignment and the accuracy values were remarkably low. Thus the gap penalties was tripled in the latest version (v.5.6 and 30 Katoh et al.

Figure 3: The SCI (A) and SP (B) values for BRAliBASE test. The curves were fitted to the all accuracy values for BRAliBASE using a cubic spline. Percent identity was calculated from reference alignments. In each panel, two thin lines show the accuracy values with default gap penalties of MAFFT v.4.22 and v.5.63. Bold lines show the accuracy values with the optimized gap penalties for BRAliBASE. Accuracy values of ClustalW are also shown with default (thin) and optimized (bold) gap penalties for reference. higher) compared to the old penalties, which were the same to the penalties for protein alignments. Although the new parameter set showed higher accuracy values than the previous one, they are not yet optimum for BRAliBASE. The accuracy values of other methods, such as ClustalW and PRRN, were also improved by changing gap penalties from default. As BRAliBASE is the first large benchmark database for RNA alignments and the parameters of sequence alignment programs are not yet tuned well, the accuracy values are possibly drastically improved by using this information. The improvement in accuracy by introducing the new objective function was smaller than the effect of changing gap penalties. The difference between the average accuracy value of FFT-NS-i and that of

L-INS-i was only •`0.5% for both SCI and SP. The possible reasons are: (i) Such a objective function is effective for protein alignments but not for RNA alignments. (ii) The new objective function possibly improves the accuracy of RNA alignments but the difference was not detected in this analysis, because

BRAliBASE contains only more simple problems than BAliBASE, every entry of BRAliBASE consists of sequences of similar lengths and the number of sequences is restricted to five.

3.5 CPU Time

The L-INS-i option is at least several times faster than TCoffee and ProbCons as shown in Figure 4. However, the process of L-INS-i contains all pairwise alignment, which requires CPU time proportional to N2L2, where N is the number of sequences and L is the sequence length. Because of the dependence on the number and length of the sequences, the L-INS-i options are considerably slower than the other options (FFT-NS-[i2])of MAFFT, in which a distance matrix is calculated by a very rapid way. When the highest accuracy is not required, the FFT-NS-i or FFT-NS-2 option may be still useful, considering the tradeoff between computational time and accuracy.

3.6 Conclusion

The L-INS-i option of MAFFT is one of the most accurate methods and, in particular, suitable to alignments having medium number of sequences (20-200). The CPU time required for L-INS-i is relatively small. This method has a drawback that it requires all-pairwise alignments, which consumes a large amount of CPU time. A possible way to further improve the performance is the re-examination MAFFT: A Multiple Sequence Alignment Program 31

Figure 4: The CPU times required for various sizes of alignments by various methods. Dash-dotted lines represent progressive methods (a), dotted lines represent iterative refinement methods (b) and solid lines represent methods incorporating pairwise alignment information (c). A and B, The number of input sequences (N) versus CPU time. Average sequence length was fixed at 300 aa. C and D, Average length (L) of input sequences versus CPU time. The number of sequence was fixed at 40. A and C simulates conserved alignments (average distance among input sequenceswas set at 100 PAM), whereas B and D simulates diverged alignments (average distance among input sequences was set at 250 PAM). of the objective function. More simple and relevant objective function may be useful for exploring the relationship between sequence and protein/RNA structure, as well as for reducing the CPU time.

Acknowledgments

We thank J. D. Thompson, P. P. Gardner and T. Lassmann for discussion and permitting us to use unpublished data.

References

[1] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J., Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic Acids Res., 25:3389-3402, 1997.

[2] Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L., Studholme, D. J., Yeats, C., and Eddy, S. R., The pfam protein families database, Nucleic Acids Res., 32:D138-D141, 2004. [3] Berger, M. P. and Munson, P. J., A novel randomized iterative strategy for aligning multiple protein sequences, Comput. Appi. Biosci., 7:479-484, 1991. 32 Katoh et al.

[4] Chandonia, J. M., Hon, G., Walker, N. S., Lo Conte, L., Kochi, P., Levitt, M., and Brenner, S. E., The ASTRAL compendium in 2004, Nucleic Acids Res., 32:D189 -D192, 2004.

[5] Do, C. B., Mahabhashyam, M. S., Brudno, M., and Batzoglou, S., ProbCons: Probabilistic consistency-based multiple sequence alignment, Genorne Res., 15:330-340, 2005.

[6] Edgar, R. C., MUSCLE: A multiple sequence alignment method with reduced time and space complexity, BMC Bioinformatics, 5: 113, 2004.

[7] Edgar, R. C., MUSCLE: Multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res., 32:1792-1797, 2004.

[8] Feng, D. F. and Doolittle, R. F., Progressive sequence alignment as a prerequisite to correct phylogenetic trees, J. Mol. Evol., 25:351-360, 1987. [9] Gardner, P. P., Wilm, A., and Washietl, S., A benchmark of multiple sequence alignment pro- grams upon structural RNAs, Nucleic Acids Res., 33(8):2433-2439, 2005. [10] Gotoh, O., A weighting system and algorithm for aligning many phylogenetically related se- quences, Comput. Appl. Biosci., 11:543-551, 1995. [11] Gotoh, O., Optimal alignment between groups of sequences and its application to multiple sequence alignment, Comput. Appl. Biosci., 9:361-370, 1993.

[12] Gotoh, O., Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments, J. Mol. Biol., 264:823-838, 1996.

[13] Grasso, C. and Lee, C., Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems, Bioinfor- matics, 20:1546-1556, 2004.

[14] Jones, D. T., Taylor, W. R., and Thornton, J. M. The rapid generation of mutation data matrices from protein sequences, Comput. Appl. Biosci., 8:275-282, 1992.

[15] Katoh, K., Kuma, K., Toh, H., and Miyata, T., MAFFT version 5: Improvement in accuracy of multiple sequence alignment, Nucleic Acids Res., 33:511-518, 2005.

[16] Katoh, K., Misawa, K., Kuma, K., and Miyata, T., MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Res., 30:3059-3066, 2002.

[17] Morgenstern, B., Dress, A., and Werner, T., Multiple DNA and protein sequence alignment based on segment-to-segment comparison, Proc. Natl. Acad. Sci. USA,93: 12098-12103, 1996.

[18] Needleman, S. B. and Wunsch, C. D., A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol., 48:443-453, 1970.

[19] Notredame, C., Higgins, D. G., and Heringa, J., T-Coffee: A novel method for fast and accurate multiple sequence alignment, J. Mol. Biol., 302:205-217, 2000.

[20] Notredame, C., Holm, L., and Higgins, D. G., COFFEE: An objective function for multiple sequence alignments, Bioinformatics, 14(5):407-422, 1998.

[21] O'Sullivan, O., Suhre, K., Abergel, C., Higgins,D. G., and Notredame, C., 3DCoffee:Combining protein sequences and structures within multiple sequence alignments, J. Mol. Biol., 340:385-395, 2004. MAFFT: A Multiple Sequence Alignment Program 33

[22] Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P., Numerical Recipes in C: the Art of Scientific Computing, Cambridge University Press, Cambridge, 2nd edition, 1995.

[23] Rawlings, N. D., Tolle, D. P., and Barrett, A. J., MEROPS: The peptidase database, Nucleic Acids Res., 32:D160 D164, 2004.

[24] Simossis,V. A., Kleinjung, J., and Heringa, J., Homology-extendedsequence alignment, Nucleic Acids Res., 33:816 824, 2005.

[25] Smith, T. F. and Waterman, M. S., Identification of common molecular subsequences, J. Mol. Biol., 147:195-197, 1981.

[26] Sokal, R. R. and Michener, C. D., A statistical method for evaluating systematic relationships, University of Kansas Scientific Bulletin, 28:1409-1438, 1958.

[27] Stebbings, L. A. and Mizuguchi, K., HOMSTRAD: Recent developments of the homologous protein structure alignment database, Nucleic Acids Res., 32:D203-D207, 2004. [28] Subramanian, A. R., Weyer-Menkhoff,J., Kaufmann, M., and Morgenstern, B., DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment, BMC Bioinformatics, 6:66, 2005.

[29] Tateno, Y., Ikeo, K., Imanishi, T., Watanabe, H., Endo, T., Yamaguchi, Y., Suzuki, Y., Taka- hashi, K., Tsunoyama, K., Kawai, M., Kawanishi, Y., Naitou, K., and Gojobori, T., Evolutionary motif and its biological and structural significance, J. Mol. Evol., 44:S38-S43, 1997.

[30] Thompson, J. D., Higgins, D. G., and Gibson, T. J., W: Improving the sensitiv- ity of progressive multiple sequence alignment through sequence weighting, position-specificgap penalties and weight matrix choice, Nucleic Acids Res., 22:4673-4680, 1994. [31] Thompson, J. D., Plewniak, F., and Poch, O., BAliBASE: A benchmark alignment database for the evaluation of multiple alignment programs, Bioinformatics, 15:87-88, 1999.

[32] Van Walle, I., Lasters, I., and Wyns, L., Align-m - a new algorithm for multiple alignment of highly divergent sequences, Bioinformatics, 20:1428-1435, 2004.

[33] Washietl, S., Hofacker, I. L., and Stadler, P. F. Fast and reliable prediction of noncoding RNAs, Proc. Natl. Acad. Sci. USA, 102:2454-2459, 2005.

[34] Yamada, S., Gotoli, O., and Yamana, H., Extension of PRRN: Implementation of a doubly nested randomized iterative refinement strategy under a piecewiselinear gap cost, Genome Informatics, 15:P082, 2004.