ã 2002 Oxford University Press Nucleic Acids Research, 2002, Vol. 30 No. 14 3059±3066
MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform
Kazutaka Katoh, Kazuharu Misawa1, Kei-ichi Kuma and Takashi Miyata*
Department of Biophysics, Graduate School of Science, Kyoto University, Kyoto 606-8502, Japan and 1Institute of Molecular Evolutionary Genetics, Pennsylvania State University, University Park, PA 16802, USA
Received April 8, 2002; Revised and Accepted May 24, 2002
ABSTRACT dif®culty, various heuristic methods, including progressive methods (3) and iterative re®nement methods (4±6), have been A multiple sequence alignment program, MAFFT, proposed to date. They are mostly based on various combin- has been developed. The CPU time is drastically ations of successive two-dimensional DP, which takes CPU reduced as compared with existing methods. time proportional to N2. MAFFT includes two novel techniques. (i) Homo- Even if these heuristic methods successfully provide the logous regions are rapidly identi®ed by the fast optimal alignments, there remains the problem of whether the Fourier transform (FFT), in which an amino acid optimal alignment really corresponds to the biologically sequence is converted to a sequence composed of correct one. The accuracy of resulting alignments is greatly volume and polarity values of each amino acid resi- affected by the scoring system. Thompson et al. (7) developed due. (ii) We propose a simpli®ed scoring system a complicated scoring system in their program CLUSTALW, that performs well for reducing CPU time and in which gap penalties and other parameters are carefully increasing the accuracy of alignments even for adjusted according to the features of input sequences, such as sequence divergence, length, local hydropathy and so on. sequences having large insertions or extensions as Nevertheless, no existing scoring system is able to process well as distantly related sequences of similar length. correctly global alignments for various types of problems Two different heuristics, the progressive method including large terminal extension of internal insertion (8). (FFT-NS-2) and the iterative re®nement method Considerable improvements in the accuracy have recently (FFT-NS-i), are implemented in MAFFT. The perform- been made in CLUSTALW (7) version 1.8, the most popular ances of FFT-NS-2 and FFT-NS-i were compared alignment program with excellent portability and operativity, with other methods by computer simulations and and T-COFFEE (9), which provides alignments of the highest benchmark tests; the CPU time of FFT-NS-2 is dras- accuracy among known methods to date. tically reduced as compared with CLUSTALW with On the other hand, few improvements have been made comparable accuracy. FFT-NS-i is over 100 times successfully to reduce the CPU time, since the proposal of the faster than T-COFFEE, when the number of input progressive method by Feng and Doolittle (3). A high-speed computer program applicable to large-scale problems is sequences exceeds 60, without sacri®cing the becoming more important with the rapid increase in the accuracy. number of protein and DNA sequences. In order to improve the speed of DP, it is effective to use highly homologous segments in the procedure of multiple sequence alignment INTRODUCTION (10). There are well-known homology search programs, such Multiple sequence alignment is a basic tool in various aspects as FASTA (11) and BLAST (12), based on string matching of molecular biological analyses ranging from detecting key algorithms. functional residues to inferring the evolutionary history of a In this report, we developed a novel method for multiple protein family. It is, however, dif®cult to align distantly sequence alignment based on the fast Fourier transform (FFT), related sequences correctly without manual inspections by which allows rapid detection of homologous segments. In expert knowledge. Many efforts have been made on the spite of its great ef®ciency, FFT has rarely been used problems concerning the optimization of sequence alignment. practically for detecting sequence similarities (13,14). We Needleman and Wunsch (1) presented an algorithm for also propose an improved scoring system, which performs sequence comparison based on dynamic programming (DP), well even for sequences having large insertions or extensions by which the optimal alignment between two sequences is as well as distantly related sequences of similar length. The obtained. The generalization of this algorithm to multiple ef®ciency (CPU time and accuracy) of the method was sequence alignment (2) is not applicable to a practical tested by computer simulations and the BAliBASE (15) alignment that consists of dozens or hundreds of sequences, benchmark tests in comparison with several existing methods. since it requires huge CPU time proportional to NK, where K is These tests showed that the CPU time has been drastically the number of sequences each with length N. To overcome this reduced, whereas the accuracy of the resulting alignments is
*To whom correspondence should be addressed. Tel: +81 75 753 4220; Fax: +81 75 753 4223; Email: [email protected] 3060 Nucleic Acids Research, 2002, Vol. 30 No. 14 comparable with that of the most accurate methods among existing ones.
METHODS Group-to-group alignments by FFT The frequency of amino acid substitutions strongly depends on the difference of physico-chemical properties, particularly volume and polarity, between the amino acid pair involved in the substitution (16). Substitutions between physico-chemic- ally similar amino acids tend to preserve the structure of proteins, and such neutral substitutions have been accumu- lated in molecules during evolution (17). It is therefore reasonable to consider that an amino acid a is assigned to a vector whose components are the volume value v(a) and the polarity value p(a) introduced by Grantham (18). We use the Figure 1. (A) A result of the FFT analysis. There are two peaks correspond- normalized forms of these values: vÃ(a) = [v(a) ± vÅ]/sv and ing to two homologous blocks. (B) Sliding window analysis is carried out pÃ(a) = [ p(a) ± pÅ]/sp, where an overbar denotes the average and the positions of homologous blocks are determined. Note that window size is 30 (see text) but the window size is set to 4 in (B) for simplicity. over 20 amino acids, and sv and sp denote the standard deviations of volume and polarity, respectively. An amino acid sequence is converted to a sequence of such vectors. The correlation cp(k) of polarity component between two Calculation of the correlation between two amino acid sequences sequences. We de®ne the correlation c(k) between two X sequences of such vectors as cp k pÃ1 npÃ2 n k 6 1nN;1nkM c(k) = cv(k) + cp(k), 1 where cv(k) and cp(k) are, as de®ned below, the correlations of is calculated in the same manner. volume component and polarity component, respectively, between two amino acid sequences to be aligned. The Finding homologous segments. If two sequences compared correlation c(k) represents the degree of similarity of two have homologous regions, the correlation c(k) has some peaks sequences with the positional lag of k sites. The high value of corresponding to these regions (Fig. 1A). By the FFT analysis, c(k) indicates that the sequences may have homologous however, we can know only the positional lag k of a regions. homologous region in two sequences but not the position of The correlation cv(k) of volume component between the region. As shown in Figure 1B, to determine the positions sequence 1 and sequence 2 with the positional lag of k sites of the homologous region in each sequence, a sliding window is de®ned as analysis with the window size of 30 sites is carried out, in X which the degree of local homologies is calculated for each of the highest 20 peaks in the correlation c(k). We identify a cv k vÃ1 nvÃ2 n k; 2 1nN;1nkM segment of 30 sites with score value exceeding a given threshold (0.7 per site in our program, see below for details of the scoring system) as a homologous segment. If two or more where vÃ1(n) and vÃ2(n) are the volume component of the nth site successive segments are identi®ed as homologous segments, of sequence 1 with the length of N and that of sequence 2 with they are combined into one segment of larger length. If the the length of M, respectively. Considering N ' M in many length of the combined segment is longer than 150 sites, the cases, equation 2 takes O(N2) operations. The FFT reduces the segment is divided into several segments with 150 sites each. CPU time of this calculation to O(N log N) (19). If V1(m) and V2(m) are the Fourier transform of vÃ1(n) and vÃ2(n), i.e. Dividing a homology matrix. To obtain an alignment between two sequences, the homologous segments must be arranged vÃ1(n) Û V1(m) 3 consistently in both sequences. A matrix Sij(1 < i, j < n, n is the number of homologous segments) is constructed in the vÃ2(n) Û V2(m), 4 following manner. If the ith homologous segment on sequence 1 corresponds to the jth homologous segment on sequence 2, it is known that the correlation cv(k) is expressed as Sij has the score value of the homologous segment calculated above; otherwise Sij is set to 0. By applying the standard DP * cv(k) Û V1 (m) ´ V2(m), 5 procedure to matrix Sij, we obtain the optimal path, which corresponds to the optimal arrangement of homologous where Û represents transform pairs, and the asterisk denotes segments. Figure 2A shows an example in which ®ve complex conjugation. homologous segments exist. The order of segments in Nucleic Acids Research, 2002, Vol. 30 No. 14 3061
the Needleman±Wunsch (NW) algorithm performs well with all-positive matrices, in which all elements have positive values. CLUSTALW (7) and other methods use such all- positive matrices by default. Since Vogt et al. (21) examined only the cases in which members of each protein familiy are similar in length, it is not clear whether such all-positive matrices are suitable to various alignment problems, particu- larly to those of different length. Accordingly, contrary to existing methods, we adopted a normalized similarity matrix à Mab (a and b are amino acids) that has both positive and negative values:
Figure 2. (A) An example of the segment-level DP; (B) Reducing the area MÃ = [(M ± average2)/(average1 ± average2)] + Sa, 7 for DP on a homology matrix. ab ab
where average1 = SafaMaa, average2 = Sa,bfafbMab, Mab is raw similarity matrix, fa is the frequency of occurrence of amino sequence 1 differs from that in sequence 2. The optimal path acid a, and Sa is a parameter that functions as a gap extension depends on S23 and S32; if S23 > S32, the path with bold arrows à penalty. Under this similarity matrix Mab, the score per site is optimal. between two random sequences is Sa, and the score per site Overall homology matrix is divided into some sub-matrices between two identical sequences is 1.0 + Sa. If Sa is much at the boundary corresponding to the center of homologous smaller than unity, gaps are scored virtually equivalent to segments as illustrated in Figure 2B. As a result, the shaded random amino acid sequences. area in Figure 2B is excluded from the calculation. As many The default parameters of our program are: M is the 200 homologous segments are detected by FFT, the CPU time is ab PAM log-odds matrix by Jones et al. (22), f is the frequency reduced. a of occurrence for amino acid a calculated by Jones et al. (22), Sop (gap opening penalty, de®ned below) is 2.4 and Sa is 0.06, Extension to group-to-group alignments. The procedure for amino acid sequences. For nucleic acid sequences, M is described above can be easily extended to group-to-group ab the 200 PAM log-odds matrix calculated from Kimura's two alignment by considering equations 2 and 6 as a special case parameter model (23) with transition/transversion ratio of 2.0, with one sequence in each group. These equations are op a fa is 0.25, S is 2.4 and S is 0.06. extended to group-to-group alignment by replacing vÃ1(n) with vÃgroup1(n), which is the linear combination of the volume components of the members belonging to group 1: Gap penalty. Homology matrix H(i, j) between two amino acid sequences A(i) and B(j) is constructed from the similarity à X matrix as H(i, j) = MA(i)B(i), where i and j are positions in vÃgroup1 n wi Á vÃi n; sequences. When two groups of sequences are aligned, i2group1 homology matrix between group 1 and group 2 is calculated as: where w is the weighting factor for sequence i, which is X i à calculated in the same manner as CLUSTALW (7) for the H i; j wnwmMA n; iB m; j; progressive method, or in the same manner as Gotoh's (20) n2group1;m2group2 weighting system for the iterative re®nement method. Similarly, polarity component is calculated as: where A(n, i) indicates the ith site of the nth sequence in group X 1, B(m, j) is the jth site of the mth sequence in group 2, and wn pÃgroup1 n wi Á pÃi n: is the weighting factor, de®ned previously, for nth sequences. i2group1 In the NW algorithm (1), the optimal alignment between two groups of sequences is calculated as: This method is applicable to nucleotide sequences by 8 > converting a sequence to a sequence of four-dimensional