Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT
Total Page:16
File Type:pdf, Size:1020Kb
22 Genome Informatics 16(1): 22-33 (2005) Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT Kazutaka Katohl Kei-ichi Kuma1 [email protected] [email protected] Takashi Miyata2" Hiroyuki Toh5 [email protected] [email protected] 1 Bioinformatics Center , Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan 2 Biohistory Research Hall , Takatsuki, Osaka 569-1125, Japan 3 Department of Electrical Engineering and Bioscience , Science and Engineering, Waseda University, Tokyo 169-8555, Japan 4 Department of Biophysics , Graduate School of Science, Kyoto University, Kyoto 606- 8502, Japan 5 Division of Bioinformatics , Research Center for Prevention of Infectious Diseases, Medical Institute of Bioregulation, Kyushu University, Fukuoka 812-8582,Japan Abstract In 2002, we developed and released a rapid multiple sequence alignment program MAFFT that was designed to handle a huge (up to•`5,000 sequences) and long data (,-2,000 as or •`5,000 nt) in a reasonable time on a standard desktop PC. As for the accuracy, however, the previous versions (v.4 and lower) of MAFFT were outperformed by ProbCons and TCoffee v.2, both of which were released in 2004, in several benchmark tests. Here we report a recent extension of MAFFT that aims to improve the accuracy with as little cost of calculation time as possible. The extended version of MAFFT (v.5) has new iterative refinement options, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-iin this report). These options use a new objective function combining the weighted sum-of-pairs (WSP) score and a score similar to COFFEE derived from all pairwise alignments. We discuss the improvement in accuracy brought by this extension, mainly using two benchmark tests released very recently, BAliBASE v.3 (for protein alignments) and BR,AliBASE (for RNA alignments). According to BAliBASE v.3, the overall average accuracy of L-INS-i was higher than those of other methods successively released in 2004, although the difference among the most accurate methods (ProbCons, TCoffee v.2 and new options of MAFFT) was small. The advantage in accuracy of [GL]-INS-ibecame greater for the alignments consisting of — 50 - 100 sequences. By utilizing this feature of MAFFT, we also examined another possible approach to improve the accuracy by incorporating homolog information collected from database. The [GL]-INS-i options are applicable to aligning up to r\-,200 sequences, although not applicable to thousands of sequences because of time and space complexities. Keywords: multiple sequence alignment, dynamic programming, objective function 1 Introduction Multiple alignment of protein or DNA sequences is a classical problem in the area of bioinformatics and an important tool required in a wide range of molecular biological analyses such as detecting conserved residues and inferring evolutionary history of genes and organisms. As a large amount of sequence information is being determined and accumulated by genome and cDNA projects of various organisms, the importance of a rapid and accurate algorithm for multiple sequence alignment has been recognized. Under such a situation, several groups are now actively investigating the algorithms MAFFT: A Multiple Sequence Alignment Program 23 Figure 1: Outlined process of MAFFT without pairwise alignment (A) and with pairwise alignments (B). of multiple sequence alignment and a number of new programs have been released since last year [5-7, 13, 15, 28, 32]. ClustalW [30] released in 1994 is a standard multiple sequence alignment software. It employs the progressive method [8], which is simple but has room for improvement in accuracy and speed. Many of recent methods have tried to improve the accuracy. Gotoh [12]found that the alignment accuracy is significantly improved by an iterative refinement method using the weighted sum-of-pairs (WSP) score as an objective function. He developed a program, PRRN, that adopts the method. DIALIGN [17]and TCoffee [19] achieved high accuracies by combining local and global alignment features. Although the ability to combine different types of alignments derived from heterogeneous sources, such as structural alignment, is important [21],TCoffee requires a large CPU time proportional to N3. Thus it is hard to apply TCoffee to a large alignment consisting of dozens of sequences. The accuracy of TCoffee has recently been further improved in ver. 2 in comparison with ver. 1. Do et al. [5] proposed a new method ProbCons, whose accuracy is comparable to or slightly higher than TCoffee v.2 as shown in Table 1, and rather faster than TCoffee. Katoh et al. [16]developed MAFFT, in 2002, that aimed to reduce the CPU time of the progressive method. MAFFT also has an iterative refinement option that is a simplified version of PRRN. Edgar [6, 7] also implemented progressiveand iterative alignment strategies in MUSCLE in 2004. The algorithms of major options of MUSCLE seem similar to the NW-NS-[i2]options of MAFFT explained below. To reduce the CPU time, UPGMA [26]is more efficientlycoded in MUSCLE.Lassmann and Sonnhammer (manuscript submitted) released Kalign, in Mar. 2005, which adopts a progressive approach to achieve the reduction of CPU time. Kalign has a merit that the length dependency of the CPU time is near to linear. The previous version of MAFFT is used in part by several projects such as PFAM [2],ASTRAL [4] and MEROPS [23]. The outline of procedure of the previous version of MAFFT is explained below (see Figure 1A also). An initial alignment is constructed by the progressive method [8,30] and then improved by the iterative refinement method [3, 11]. A user can select an appropriate strategy from the fastest one (FFT-NS-1) to the most accurate one (FFT-NS-i) in v.4. 1. Progressive alignment (1). A rough distance between every pair of input sequences is rapidly calculated based on the number of 6-tuples shared by the two sequences [7, 14,16]. A guide tree is constructed from the distance matrix using UPGMA [26] with modified linkage. Input sequences are progressively aligned [8, 30] followingthe branching order of the guide tree. This procedure is referred to as FFT-NS-1. 2. Progressive alignment (2). The initial distance matrix is less reliable than that based on all pairwise alignments. We can obtain a more reliable distance matrix by using the FFT-NS-1 alignment [7, 16,29]. Progressive alignment is re-performed based on the new guide tree calcu- lated from the new distance matrix. This method is referred to as FFT-NS-2. 24 Katoh et al. 3. Iterative refinement. The FFT-NS-2 alignment is further improved by the iterative refinement method [3,11] that maximizes the weighted sum of pairs (WSP) score proposed by Gotoh [10], using an approximate group-to-group alignment algorithm [16]. This procedure is referred to as FFT-NS-i. For the progressive alignment processes, a fast Fourier transform (FFT) approximation [16] is used in the FFT-NS-2, FFT-NS-1 and FFT-NS-i options (collectively denoted as FFT-NS-[12i]hereafter). When the input sequences are highly conserved, these options require CPU times effectively propor- tional to average sequence length L for alignments of a single gene. The slopes of the FFT-NS-[2i] lines are indeed close to 1 in Figure 4C. MAFFT also offers three more options NW-NS-[12i]with full dynamic programming (DP) [18]for the same part. In contrast to the case of FFT-NS-[12i], the time complexity of full DP options is O(L2), regardless of the similarity among input sequences. In addition to these options, two additional strategies, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-ihereafter), have been introduced in v.5. The [GL]-INS-istrategies use a new objective function combining the WSP [12]score and a score similar to COFFEE [20] derived from all pairwise alignments in order to improve the alignment accuracy. Iterative refinement with the objective function is performed in reasonable time, as [GL] INS-i were designed to have low computational complexity. See Methods for details of the algorithms and the objective function. Recently, reference databases of protein alignments based on 3D structural superposition, such as HOMSTRAD [27], SABmark [32] and PREFAB [7], have been independently developed. They are valuable resources to find good parameters for a sequence alignment program as well as to evaluate the performance of a program. After the publication of our latest paper describing MAFFT v.5 [15] in Jan. 2005, two benchmark databases have already been released. (i) The most widely used benchmark test, BAliBASE [31],has been substantially updated to v.3 (Thompson et al., manuscript submitted) and now contains alignments that are difficult to make accurate residue-to-residue correspondences with only sequence information, due to high sequence divergence, long indels, etc. That is, the align- ments are considered to represent difficult cases we may encounter when we deal with real problems. (ii) BRAliBASE, the first reference database for RNA alignment, has been recently released [9]. Thus, we compared the performances of the currently available sequence alignment programs including the new version of MAFFT, using the new reference alignment databases. The parameters of MAFFT and other methods are not closely optimized for these new benchmark datasets. It was suggested that the accuracy of a multiple alignment of distantly related sequences can be improved if they are aligned together with a number of their homologs, because the information from many sequences is expected to reduce the 'noise' [12]. This point has recently been investigated by Katoh et al. [15]and Simossis et al. [24]. We discuss the effect of number of homologs involved in an alignment in this report also.