A Simple Method to Control Over-Alignment in the MAFFT Multiple Sequence Alignment Program Kazutaka Katoh1,* and Daron M

Bioinformatics, 32(13), 2016, 1933–1942 doi: 10.1093/bioinformatics/btw108 Advance Access Publication Date: 26 February 2016 Original Paper Sequence analysis A simple method to control over-alignment in the MAFFT multiple sequence alignment program Kazutaka Katoh1,* and Daron M. Standley1,2 1Immunology Frontier Research Center, Osaka University, Suita 565-0871, Japan and 2Institute for Virus Research, Kyoto University, Kyoto 606-8507, Japan *To whom correspondence should be addressed. Associate Editor: Janet Kelso Received on October 5, 2015; revised on February 15, 2016; accepted on February 19, 2016 Abstract Motivation: We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difﬁculty in gene prediction. Results: The proposed method utilizes a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair. This method signiﬁcantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for multiple sequence alignment. Availability and implementation: The new feature is available in MAFFT versions 7.263 and higher. http://mafft.cbrc.jp/alignment/software/ Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction introduced due to difficulty in gene prediction (Brent, 2005; Gotoh Many comparative analyses of biological sequences utilize multiple et al.,2014; Nagy and Patthy, 2013; Yandell and Ence, 2012). With sequence alignment (MSA), and the quality of an MSA can affect incorrect reading frames, unrelated amino acid segments can appear the results of downstream analyses. Therefore, even incremental im- in a set of homologous sequences. Even if such errors could be com- provements in MSA quality can have wide-ranging effects. One area pletely excluded, splice variants of a gene can be included in an MSA in MSA technology that can potentially be improved is robustness of a Eukaryotic gene family. Another source of difficulty is when against noise in sequence data. small structural domains are surrounded by intrinsically disordered or As a result of large-scale sequencing projects, we now have access low-complexity regions (Thompson et al.,2011). Thus, the existence to many amino acid and nucleotide sequences from widely divergent of unrelated segments, or noise, in an MSA is common, especially organisms. Unfortunately, the quality of the sequences is not always in large-scale analyses of amino acid sequences, for which human in- high, partly due to limitations in sequencing technologies. Moreover, spection is difficult. Therefore robustness in such situations is an im- at the amino acid sequence level, a number of errors can be portant feature for any MSA method used for large-scale analysis. VC The Author 2016. Published by Oxford University Press. 1933 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] 1934 K.Katoh and D.M.Standley X XX Most MSA methods assume that the input sequences are all hom- is normalized such that fAMorig ðA; AÞ¼1and fAfBMorigðA; ologous. When the input sequences have unrelated segments, these seg- BÞ¼0, in the case of MAFFT, where fA and fB are the frequencies of ments often end up being aligned. This results in ‘over-alignment’; i.e. amino acids A and B, respectively. sites that are aligned but are, in fact, non-homologous (Blackburne and When a is close to one, some of the diagonal elements have nega- Whelan, 2012a; Schwartz and Pachter, 2007). Some MSA methods, tive values. In such a case, there can be situations where the optimal including PRANK and BAli-Phy, are aware of this problem (Bradley solution is to insert gaps even for identical sequences, et al., 2009; Loytynoja€ and Goldman, 2005, 2008; Redelings, 2014; WC - - - - - - - - - SATSATSATWG Suchard and Redelings, 2006). However, these methods are not ranked highly in standard benchmarks based on protein structural alignments WCSATSATSAT - - - - - - - - - WG (Sievers et al., 2011). This is not surprising because the standard criter- This occurs for the following reason; the score for the match of ion is essentially based on sensitivity, a consequence of which is that S-S, A-A and T-T are 0.815, 0.815 and 0.924, respectively, in the the over-alignment problem is not taken into account. BLOSUM62 matrix normalized as above. In an unrealistic case of Methods such as PRANK and BAli-Phy usually utilize simulated a ¼ 1, the score for matching the segment SATSATSAT, –1.338, is input to test alignment accuracy. Simulation-based benchmarks gen- negative. As a result, depending on gap cost, the above alignment erally have a limitation in that the simulation setting is inevitably can have a better score than the gap-less alignment, which has a oversimplified and the applicability to real-world situations is un- ‘cost’ of –1.338 for matching the segment when a ¼ 1. Although clear. On the other hand, the use of real data is problematic, when this example is an extreme one, it indicates a possibility that the seg- considering over-alignment, due to the arbitrariness of reference ments with weak or ambiguous similarity are not aligned if a posi- alignments (Edgar, 2010). More specifically, structural alignments tive value, a, is subtracted from the scoring matrix. Inversely, if can only provide information about sites that are structurally con- a positive value is added to the scoring matrix, even non-similar resi- served, not sites that are unrelated. In simulation-based benchmarks dues are matched. These effects were discussed in Vingron and that take into consideration over-alignment, PRANK and BAli-Phy Waterman (1994). consistently outperform other methods (Loytynoja€ and Goldman, An alignment with a high a value (close to one), where low- 2008; Redelings, 2014). Thus there is a discrepancy between bench- similarity segments tend not to be aligned, is preferable when the ex- marks based on real data and simulated data. To assess the over- pected evolutionary distance between the sequences is small. alignment problem, both types of benchmarks are necessary. Imagine a case where some dissimilar segments are found in a set of Here, we describe a simple method to control over-alignment dir- closely-related sequences. It is reasonable to infer that these seg- ectly and flexibly in a protein MSA. In brief, the proposed method ments were inserted due to some unusual factors, such as sequencing utilizes agreement between the local segments and the entire sequence errors or alternative splicing, and thus should be gapped. to determine which residues should be aligned; if there is a dissimilar Meanwhile, a low a value (close to zero) is useful when the evolu- segment in a globally highly similar pair of sequences, the dissimilar tionary distance between the sequences is expected to be large, segment is gapped. This is accomplished by using a variable scoring where low-similarity segments have to be aligned. matrix (VSM) that adapts to the global similarity between a pair of Thus we should use a large a value for closely-related sequences sequences (or groups) within an MSA given a single additional par- and a small or zero a value for distantly-related sequences. The value ameter that controls the risk of over-alignment. This is not a novel of a can be interpreted as the ‘lowest similarity level to align’, rang- idea since it is based on a combination of two techniques known for ing from zero to one, where zero corresponds to unrelated sequences more than 20 years: (i) adjusting the overall average of a scoring ma- and one corresponds to complete match. In the case of global align- trix (Vingron and Waterman, 1994) by rescaling the matrix according ment, we can estimate the expected similarity level, or evolutionary to the similarity of the pair of sequences or groups to be aligned and distance d, of the entire input sequences. It is natural to use large a (ii) use of different scoring matrices in an MSA, as originally proposed for a pair with small d, and to use small a for a pair with large d.In in ClustalW (Thompson et al.,1994), but probably not inherited by the case of MAFFT, the distance dij between sequences i and j is esti- Clustal Omega (Sievers et al.,2011). We have implemented the VSM mated as a value between zero and one, dij ¼ 1 ðtij=min ðtii; tjjÞÞ; technique in the MAFFT program (Katoh et al.,2002; Katoh and where tij is the alignment score between sequences i and j. So, we use Standley, 2013), which is one of the most sensitive methods in terms a variable a(d), depending on d and a new parameter, amax, to dy- of standard criteria (Sievers et al.,2011; Thompson et al.,2011). namically generate a variable scoring matrix (VSM) according to the distance d between the pair to be aligned, as 2 Methods MðA; B; dÞ¼MorigðA; BÞaðdÞ (2) Protein sequence alignment by dynamic programming (DP) aðdÞ¼ amax À d if amax > d (Needleman and Wunsch, 1970) uses a scoring matrix with 20 Â 20 elements (e.g.

A Simple Method to Control Over-Alignment in the MAFFT Multiple Sequence Alignment Program Kazutaka Katoh1,* and Daron M

T-Coffee Documentation Release Version 13.45.47.Aba98c5

Comparative Analysis of Multiple Sequence Alignment Tools

Ple Sequence Alignment Methods: Evidence from Data

Determination of Optimal Parameters of MAFFT Program Based on Balibase3.0 Database

The Art of Multiple Sequence Alignment in R

HMMER User's Guide

Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT

Multiple Sequence Alignment

Human Genome Graph Construction from Multiple Long-Read Assemblies

MAFFT: a Novel Method for Rapid Multiple Sequence Alignment Based on Fast Fourier Transform

A User-Friendly Sequence Editor for Mac OS X

Download, Update Or Check a User-Defined Database Before Searching It