Multiple Sequence Alignment: a Major Challenge to Large-Scale Phylogenetics
Total Page:16
File Type:pdf, Size:1020Kb
Multiple sequence alignment: a major challenge to large-scale phylogenetics November 18, 2010 · Tree of Life Kevin Liu1, C. Randal Linder2, Tandy Warnow3 1 Postdoctoral Research Associate, Austin, TX, 2 Associate Professor, Austin, TX, 3 Professor of Computer Science, Department of Computer Science, University of Texas at Austin Liu K, Linder CR, Warnow T. Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLOS Currents Tree of Life. 2010 Nov 18 [last modified: 2012 Apr 23]. Edition 1. doi: 10.1371/currents.RRN1198. Abstract Over the last decade, dramatic advances have been made in developing methods for large-scale phylogeny estimation, so that it is now feasible for investigators with moderate computational resources to obtain reasonable solutions to maximum likelihood and maximum parsimony, even for datasets with a few thousand sequences. There has also been progress on developing methods for multiple sequence alignment, so that greater alignment accuracy (and subsequent improvement in phylogenetic accuracy) is now possible through automated methods. However, these methods have not been tested under conditions that reflect properties of datasets confronted by large-scale phylogenetic estimation projects. In this paper we report on a study that compares several alignment methods on a benchmark collection of nucleotide sequence datasets of up to 78,132 sequences. We show that as the number of sequences increases, the number of alignment methods that can analyze the datasets decreases. Furthermore, the most accurate alignment methods are unable to analyze the very largest datasets we studied, so that only moderately accurate alignment methods can be used on the largest datasets. As a result, alignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates. Therefore, the estimation of highly accurate multiple sequence alignments is a major challenge for Tree of Life projects, and more generally for large-scale systematics studies. Funding Statement The research was supported by the US National Science Foundation DEB 0733029, and by Microsoft Research through support to TW. Introduction Species trees are typically estimated from a collection of genes (or other genomic regions), in one of two ways. The first way produces alignments on each gene, and then runs a phylogeny estimation method on the concatenation of these alignments; the second way estimates trees for each gene based upon an alignment of the sequences, and then combines these estimated gene trees into a species tree. In both cases, therefore, the accuracy of the resultant species tree depends directly or indirectly upon the alignments that are produced for each gene [1] [2] [3] [4] [5] [6] [7]. Because of the centrality of sequence alignment to phylogenetics (and other problems in biology), many alignment methods have been developed. ClustalW [8] is perhaps the most well known, and probably the most frequently used alignment method in systematics, but there are many others, including MAFFT [9], T-Coffee [10], Probcons [11], POY [12], and Muscle [13], that are used in the systematics PLOS Currents Tree of Life 1 community. There are also newer methods, i.e., Opal [14], Prank [15], Prank+GT [16], FSA [17], POY* [18], SATCHMO [19], ProbAlign [20], ProbTree [21], BALi-Phy [22], and SATé [16], which have also been developed but are less frequently used. Evaluations of these methods on both simulated and biological datasets show that alignment accuracy impacts tree accuracy, and that some methods (notably MAFFT and SATé) can produce highly accurate alignments on large datasets, and hence make it possible to construct highly accurate trees when trees are computed using maximum likelihood (ML) [7] [16]. However, these evaluations have been largely limited to datasets containing at most 1000 sequences, and so are not necessarily relevant to large-scale systematics studies. In this paper, we explore the performance of alignment methods on a collection of nucleotide datasets containing large numbers of sequences. We report computational requirements, including running time and memory usage, and also the accuracy of the alignments and of maximum likelihood (ML) trees estimated on these alignments. These comparisons show striking differences between methods. Alignment methods differ in their computational requirements, with some methods incapable of analyzing datasets beyond a few hundred sequences, and others able to analyze datasets with tens of thousands of sequences. Running time and memory usage for some alignment methods can be enormous, sometimes exceeding the computational requirements of RAxML [23] [24], currently one of the most frequently used methods for large-scale phylogenetic ML estimation. Methods also differ substantially with respect to alignment and subsequent tree accuracy, so that the selection of alignment method is very important to the systematics study. The main observation in this study is that only a small number of multiple sequence alignment methods can be run on datasets with several thousand sequences. Furthermore, of those methods that can be run on large datasets, none produces alignments of sufficient accuracy to be useful in estimating highly accurate phylogenies. Therefore, when datasets are large and have evolved with many indels, the input to species tree estimations (i.e. either super-alignments or gene trees) are likely to be of poor accuracy; consequently, this suggests that species tree estimates will also have reduced accuracy for large datasets, directly because of alignment difficulties. Methodology Overview We used several rRNA biological datasets (of up to 27,643 sequences) and one simulated dataset (of 78,132 sequences) to evaluate the alignment methods. The biological datasets have curated alignments based upon secondary structure and reference trees produced by computing maximum likelihood trees with bootstrapping on each curated alignment, retaining only the high support edges. For the simulated dataset, we have the true alignment and true tree. See Table 1 and the supplementary online materials [25] for details about these data. For each dataset, we attempted to compute alignments using a collection of alignment methods, and for each alignment that we generated, we computed a maximum likelihood tree using either RAxML or FastTree [26] [27] . We compared the estimated alignments to the true or curated alignment to determine the alignment error rate, and we compared the estimated ML tree to the true or reference tree to determine the tree error rate. We also recorded running time for each analysis. Datasets The biological datasets were drawn from Robin Gutell’s Comparative RNA Website (CRW) [28] and have curated alignments based upon secondary structure. While secondary structure alignments are highly reliable, there is no guarantee that these are perfectly correct. However, these are the most reliable benchmarks available to PLOS Currents Tree of Life 2 date for testing nucleotide alignment methods, other than using simulated data. These biological datasets contain 16S and 23S sequences, markers that are frequently used for estimating phylogenies. They range in size from about 100 sequences to almost 28,000 sequences. We used the preprocessing steps in [16] on these datasets to remove taxa that were more than 50% unsequenced, and to remove illegal characters. The empirical statistics for the resulting curated alignments after preprocessing are listed in Table 1. Note that these biological datasets are relatively “gappy”, with indels occupying 60-90% of the curated alignment matrix. They also have different gap length distributions, and different average and maximum p-distances. Thus, these datasets have properties that are challenging for alignment and phylogeny estimation, and are realistic examples of datasets used in large-scale phylogenetic studies. See the Supplementary Materials online webpage [25] for these data. The simulated dataset was obtained from [27], with 78,132 DNA sequences and 1,287 sites in the true alignment. This dataset is the last replicate from the simulation by [27], obtained at http://www.microbesonline.org/fasttree/; we call this the “Price 78K” dataset. The empirical statistics for the true alignment on this dataset are in Table 1. Many of the empirical statistics for the Price 78K dataset are similar to those of the empirical datasets; however, the Price 78K dataset has hardly any indels (in fact the gappiness is two orders of magnitude less than those exhibited in the biological datasets) and the average gap length is extremely short. As a result, the Price 78K dataset is potentially easier to align than the biological datasets. Because of its size, however, the Price 78K dataset provides a test of scalability. Reference alignments and trees were computed as follows. For the Price 78K dataset, we used the true alignment as the reference alignment, and the true tree as the reference tree; these are known because the Price 78K dataset is simulated. For the biological datasets, the curated alignment, modified by our minor cleaning operations, is the reference alignment. We computed unaligned sequences from the reference alignments by simply deleting indels (“-”). To obtain reference trees for the biological datasets, we performed rapid bootstrapping analyses with RAxML [29] on the curated alignment to produce a tree with support values, and then contracted all edges with