MAFFT: a Novel Method for Rapid Multiple Sequence Alignment Based on Fast Fourier Transform
Total Page:16
File Type:pdf, Size:1020Kb
ã 2002 Oxford University Press Nucleic Acids Research, 2002, Vol. 30 No. 14 3059±3066 MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform Kazutaka Katoh, Kazuharu Misawa1, Kei-ichi Kuma and Takashi Miyata* Department of Biophysics, Graduate School of Science, Kyoto University, Kyoto 606-8502, Japan and 1Institute of Molecular Evolutionary Genetics, Pennsylvania State University, University Park, PA 16802, USA Received April 8, 2002; Revised and Accepted May 24, 2002 ABSTRACT dif®culty, various heuristic methods, including progressive methods (3) and iterative re®nement methods (4±6), have been A multiple sequence alignment program, MAFFT, proposed to date. They are mostly based on various combin- has been developed. The CPU time is drastically ations of successive two-dimensional DP, which takes CPU reduced as compared with existing methods. time proportional to N2. MAFFT includes two novel techniques. (i) Homo- Even if these heuristic methods successfully provide the logous regions are rapidly identi®ed by the fast optimal alignments, there remains the problem of whether the Fourier transform (FFT), in which an amino acid optimal alignment really corresponds to the biologically sequence is converted to a sequence composed of correct one. The accuracy of resulting alignments is greatly volume and polarity values of each amino acid resi- affected by the scoring system. Thompson et al. (7) developed due. (ii) We propose a simpli®ed scoring system a complicated scoring system in their program CLUSTALW, that performs well for reducing CPU time and in which gap penalties and other parameters are carefully increasing the accuracy of alignments even for adjusted according to the features of input sequences, such as sequence divergence, length, local hydropathy and so on. sequences having large insertions or extensions as Nevertheless, no existing scoring system is able to process well as distantly related sequences of similar length. correctly global alignments for various types of problems Two different heuristics, the progressive method including large terminal extension of internal insertion (8). (FFT-NS-2) and the iterative re®nement method Considerable improvements in the accuracy have recently (FFT-NS-i), are implemented in MAFFT. The perform- been made in CLUSTALW (7) version 1.8, the most popular ances of FFT-NS-2 and FFT-NS-i were compared alignment program with excellent portability and operativity, with other methods by computer simulations and and T-COFFEE (9), which provides alignments of the highest benchmark tests; the CPU time of FFT-NS-2 is dras- accuracy among known methods to date. tically reduced as compared with CLUSTALW with On the other hand, few improvements have been made comparable accuracy. FFT-NS-i is over 100 times successfully to reduce the CPU time, since the proposal of the faster than T-COFFEE, when the number of input progressive method by Feng and Doolittle (3). A high-speed computer program applicable to large-scale problems is sequences exceeds 60, without sacri®cing the becoming more important with the rapid increase in the accuracy. number of protein and DNA sequences. In order to improve the speed of DP, it is effective to use highly homologous segments in the procedure of multiple sequence alignment INTRODUCTION (10). There are well-known homology search programs, such Multiple sequence alignment is a basic tool in various aspects as FASTA (11) and BLAST (12), based on string matching of molecular biological analyses ranging from detecting key algorithms. functional residues to inferring the evolutionary history of a In this report, we developed a novel method for multiple protein family. It is, however, dif®cult to align distantly sequence alignment based on the fast Fourier transform (FFT), related sequences correctly without manual inspections by which allows rapid detection of homologous segments. In expert knowledge. Many efforts have been made on the spite of its great ef®ciency, FFT has rarely been used problems concerning the optimization of sequence alignment. practically for detecting sequence similarities (13,14). We Needleman and Wunsch (1) presented an algorithm for also propose an improved scoring system, which performs sequence comparison based on dynamic programming (DP), well even for sequences having large insertions or extensions by which the optimal alignment between two sequences is as well as distantly related sequences of similar length. The obtained. The generalization of this algorithm to multiple ef®ciency (CPU time and accuracy) of the method was sequence alignment (2) is not applicable to a practical tested by computer simulations and the BAliBASE (15) alignment that consists of dozens or hundreds of sequences, benchmark tests in comparison with several existing methods. since it requires huge CPU time proportional to NK, where K is These tests showed that the CPU time has been drastically the number of sequences each with length N. To overcome this reduced, whereas the accuracy of the resulting alignments is *To whom correspondence should be addressed. Tel: +81 75 753 4220; Fax: +81 75 753 4223; Email: [email protected] 3060 Nucleic Acids Research, 2002, Vol. 30 No. 14 comparable with that of the most accurate methods among existing ones. METHODS Group-to-group alignments by FFT The frequency of amino acid substitutions strongly depends on the difference of physico-chemical properties, particularly volume and polarity, between the amino acid pair involved in the substitution (16). Substitutions between physico-chemic- ally similar amino acids tend to preserve the structure of proteins, and such neutral substitutions have been accumu- lated in molecules during evolution (17). It is therefore reasonable to consider that an amino acid a is assigned to a vector whose components are the volume value v(a) and the polarity value p(a) introduced by Grantham (18). We use the Figure 1. (A) A result of the FFT analysis. There are two peaks correspond- normalized forms of these values: vÃ(a) = [v(a) ± vÅ]/sv and ing to two homologous blocks. (B) Sliding window analysis is carried out pÃ(a) = [ p(a) ± pÅ]/sp, where an overbar denotes the average and the positions of homologous blocks are determined. Note that window size is 30 (see text) but the window size is set to 4 in (B) for simplicity. over 20 amino acids, and sv and sp denote the standard deviations of volume and polarity, respectively. An amino acid sequence is converted to a sequence of such vectors. The correlation cp(k) of polarity component between two Calculation of the correlation between two amino acid sequences sequences. We de®ne the correlation c(k) between two X sequences of such vectors as cp k pÃ1 npÃ2 n k 6 1nN;1nkM c(k) = cv(k) + cp(k), 1 where cv(k) and cp(k) are, as de®ned below, the correlations of is calculated in the same manner. volume component and polarity component, respectively, between two amino acid sequences to be aligned. The Finding homologous segments. If two sequences compared correlation c(k) represents the degree of similarity of two have homologous regions, the correlation c(k) has some peaks sequences with the positional lag of k sites. The high value of corresponding to these regions (Fig. 1A). By the FFT analysis, c(k) indicates that the sequences may have homologous however, we can know only the positional lag k of a regions. homologous region in two sequences but not the position of The correlation cv(k) of volume component between the region. As shown in Figure 1B, to determine the positions sequence 1 and sequence 2 with the positional lag of k sites of the homologous region in each sequence, a sliding window is de®ned as analysis with the window size of 30 sites is carried out, in X which the degree of local homologies is calculated for each of the highest 20 peaks in the correlation c(k). We identify a cv k vÃ1 nvÃ2 n k; 2 1nN;1nkM segment of 30 sites with score value exceeding a given threshold (0.7 per site in our program, see below for details of the scoring system) as a homologous segment. If two or more where vÃ1(n) and vÃ2(n) are the volume component of the nth site successive segments are identi®ed as homologous segments, of sequence 1 with the length of N and that of sequence 2 with they are combined into one segment of larger length. If the the length of M, respectively. Considering N ' M in many length of the combined segment is longer than 150 sites, the cases, equation 2 takes O(N2) operations. The FFT reduces the segment is divided into several segments with 150 sites each. CPU time of this calculation to O(N log N) (19). If V1(m) and V2(m) are the Fourier transform of vÃ1(n) and vÃ2(n), i.e. Dividing a homology matrix. To obtain an alignment between two sequences, the homologous segments must be arranged vÃ1(n) Û V1(m) 3 consistently in both sequences. A matrix Sij(1 < i, j < n, n is the number of homologous segments) is constructed in the vÃ2(n) Û V2(m), 4 following manner. If the ith homologous segment on sequence 1 corresponds to the jth homologous segment on sequence 2, it is known that the correlation cv(k) is expressed as Sij has the score value of the homologous segment calculated above; otherwise Sij is set to 0. By applying the standard DP * cv(k) Û V1 (m) ´ V2(m), 5 procedure to matrix Sij, we obtain the optimal path, which corresponds to the optimal arrangement of homologous where Û represents transform pairs, and the asterisk denotes segments. Figure 2A shows an example in which ®ve complex conjugation.