Mulple sequence alignment methods: evidence from data

Tandy Warnow Alignment Error/Accuracy

• SPFN: percentage of homologies in the true alignment that are not recovered (false negave homologies) • SPFP: percentage of homologies in the esmated alignment that are false (false posive homologies)

• TC: total number of columns correctly recovered • SP-score: percentage of homologies in the true alignment that are recovered • Pairs score: 1-(avg of SP-FN and SP-FP) Benchmarks

• Simulaons: can control everything, and true alignment is not disputed – Different simulators

• Biological: can’t control anything, and reference alignment might not be true alignment – BAliBASE, HomFam, Prefab – CRW (Comparave Ribosomal Website) Alignment Methods (Sample)

-Omega • MAFFT • Muscle • Opal • Prank/Pagan • Probcons

Co-esmaon of trees and alignments • Bali-Phy and Alifritz (stascal co-esmaon) • SATe-1, SATe-2, and PASTA (divide-and-conquer co- esmaon) • POY and Beetle (treelength opmizaon) Other Criteria

• Tree topology error • Tree branch length error

• Gap length distribuon • Inseron/deleon rao • Alignment length • Number of indels

How does the guide tree impact accuracy?

• Does improving the accuracy of the guide tree help? • Do all alignment methods respond idencally? (Is the same guide tree good for all methods?) • Do the default sengs for the guide tree work well? Alignment criteria

• Does the relave performance of methods depend on the alignment criterion? • Which alignment criteria are predicve of tree accuracy? • How should we design MSA methods to produce best accuracy? Choice of best MSA method

• Does it depend on type of data (DNA or amino acids?) • Does it depend on rate of evoluon? • Does it depend on gap length distribuon? • Does it depend on existence of fragments? Katoh and Standley . doi:10.1093/molbev/mst010 MBE

Table 2. Comparison of Different Options Using the 16S.B.ALL Data Set (Mirarab et al. 2012). Data Method Accuracy CPU Time Actual Timea Case 1 mafft ––multipair ––addfragments frags existingmsa 0.9969 6.67 days 18.3 h mafft ––6merpair ––addfragments frags existingmsa 0.9949 3.76 h 36.2 min mafft ––localpair ––add frags existingmsa 0.9707 23.4 daysb 2.43 daysb mafft ––6merpair ––add frags existingmsa 0.9604 1.32 h 1.44 h profile alignment 0.2779 15.5 h 1.60 h Case 2 mafft ––6merpair ––addfragments frags existingmsa 0.9969 4.54 h 33.8 min Case 3 mafft ––6merpair ––addfragments frags existingmsa 0.9949 1.79 days 5.91 h

NOTE.—The estimated alignments were compared with the CRW alignment to measure the accuracy (the number of correctly aligned letters/the number of aligned letters in the CRW alignment). Calculations were performed on a PC with 2.67 GHz Intel Xeon E7-8837/256 GB RAM (for the case marked with superscript alphabet “b”), or on a Linux PC with 3.47 GHz Intel Xeon X5690/48 GB RAM (for the other cases). Case 1: 13,822 sequences in the existing alignment 13,821 fragments; Case 2: 1,000 sequences in the existing alignment Â138,210 fragments; Case 3: 13,822 sequences in the existing alignment 138,210 fragments.  aWall-clock time with 10 cores. Command-line argument for parallel processing is ––thread 10. bFull command-line options are as follows: mafft ––localpair ––weighti 0 ––add frags existingmsa.

to each other. Even if a more computationally expensive (and one. The difficulty of this problem for standard approaches Downloaded from usually more accurate) method, L-INS-i, is applied (CPU time- comes from the fact that ITS1 sequences and ITS2 sequences =98h),thealignmentisstillobviouslyincorrect(fig. 2B). are not homologous to each other and most pairwise align- Two-step strategiesFrom canKatoh solve this and typeStandley of problem., 2013 (dealing with fragmentary sequences) That is, ments are impossible. Because of these nonhomologous pairs, asetoffull-lengthsequencestakenfromdatabasesarefirstMol. Biol. Evol. 30(4):772–780 doi:10.1093/the distancemolbev matrix/mst010 used for the guide tree calculation is

aligned to build a backbone MSA, and then the new ITS1 and not additive; the distances between ITS1 and full-length http://mbe.oxfordjournals.org/ ITS2 sequences are added into this backbone MSA, using the sequences and those between ITS2 and full-length sequences ––addfragments option. are close to zero, whereas the distances between ITS1 and ITS2 are quite large. In this situation, it is difficult for normal Step 1: mafft - -auto full_length_sequences >\ distance-based tree-buildingmethodstogiveareasonable backbone_msa tree. Moreover, in the alignment step, the objective function Step 2: mafft - -addfragments \ new_sequences of the L-INS-i is affected by inappropriate pairwise alignment backbone_msa > output scores between ITS1 and ITS2. Such problems can be avoided by just ignoring the relationship between ITS1 and ITS2, The second command is equivalent to by guest on February 22, 2015 as done in the ––addfragments option. mafft - -multipair - -addfragments \ In addition, a result of the second type of misuse of new_sequences backbone_msa > output mafft-profile (discussed earlier) is shown in figure 2C. in which Dynamic Programming (DP) is used to compare the Some new sequences are correctly aligned but others are distances between every new sequence and every sequence in obviously incorrectly aligned (note that the order of se- the backbone MSA (––multipair is selected by default). quences in fig. 2C is identical that in fig. 2D). These misalign- ments are due to an incorrect assumption on phylogenetic mafft - -6merpair - -addfragments \ new_sequences backbone_msa > output placement of new sequences shown in figure 1C. where distances are rapidly estimated using the number of Test Case 2: Bacterial SSU rRNA shared 6mers, instead of DP. Another case is the 16S.B.ALL data set by Mirarab et al. (2012). The result of the latter option (––6merpair It consists of an MSA of 13,822 bacterial SSU rRNA sequences, ––addfragments)isshowninfig. 2D and E.Thedifference taken from the Gutell Comparative RNA Website (CRW) between D and E is just in the order of sequences; the se- (Cannone et al. 2002)and138,210fragmentarysequences, quences were reordered according to similarity using the which are originally included in the CRW alignment ––reorder option in E.Inthisalignment,ITS1andITS2 but ungapped and artificially truncated. In Katoh and are clearly separated and aligned to appropriate positions in Standley (2013),weusedasubset(13,821fragmentary the full-length alignment. Moreover, this strategy is compu- sequences) prepared by Mirarab et al. (2012).Inadditionto tationally much less expensive (CPU time = 15 min [first this subset, here we use the full data set (138,210 fragmentary step] + 1.5 min [second step]) than the full application sequences), to examine the scalability. Suppose a situation of L-INS-i (CPU time = 98 h). The former option where we already have a manually curated (or backbone) (––multipair––addfragments)alsoreturnsasimilar MSA and a newly determined set of many fragmentary result to the latter (––6merpair) but is slower (CPU sequences in a metagenomics project, and we need an time = 48.6 min [second step]). entire MSA of them. This case suggests that it is crucial to select a strategy The first four lines in table 2 (case 1) show the perfor- appropriate to the problem of interest. The most time- mances of various options for such an analysis, with a rela- consuming method, L-INS-i, is not always the most accurate tively small data set (13,822 sequences in the existing

776 Important!

• Each method can be run in different ways – so you need to know the exact command used, to be able to evaluate performance. (You also need to know the version number!) Clustal-Omega study

• Clustal-Omega (Seivers et al., Molecular Systems Biology 2011) is the latest in the Clustal family of MSA methods • Clustal-Omega is designed primarily for amino acid alignment, but can be used on nucleode datasets • Alignment criterion: TC (column score) • Datasets: biological with structural alignments

High-quality protein MSAs using Clustal Omega F Sievers et al

attention to gaps. These gap positions are not included in these tests as they tend not to be structurally conserved. Dialign (Morgenstern et al, 1998) does not use consistency or progressive alignment but is based on finding best local multiple Consistency alignments. FSA (Bradley et al,2009)usessamplingofpairwise alignments and ‘sequence annealing’ and has been shown to deliver good nucleotide sequence alignments in the past. The Prefab benchmark test results are shown in Table II. Here, the results are divided into five groups according to the percent identity of the sequences. The overall scores range families) from 53 to 73% of columns correct. The consistency-based Total time (s) (1682 programs MSAprobs, MAFFT L-INS-i, Probalign, Probcons and T-Coffee, are again the most accurate but with long run times. Clustal Omega is close to the consistency programs in accuracy but is much faster. There is then a gap to the faster progressive based programs of MUSCLE, MAFFT, Kalign (Lassmann and 100 (90

Sonnhammer, 2005) and Clustal W. p Results from testing large alignments with up to 50000 families) %ID

sequences are given in Table III using HomFam. Here, each p in seconds is shown in the second last column. The last column indicates alignment is made up of a core of a Homstrad (Mizuguchi et al, 70 1998) structure-based alignment of at least five sequences. These sequences are then inserted into a test set of sequences from the corresponding, homologous, Pfam domain. This gives very large sets of sequences to be aligned but the testing is only carried out on the sequences with known structures. Only some programs 70 (117 are able to deliver alignments at all, with data sets of this size. We p families) restricted the comparisons toClustalOmega,MAFFT,MUSCLE %ID and Kalign. MAFFT with default settings, has a limit of 20 000 p sequences and we only use MAFFT with –parttree for the last 40 section of Table III. MUSCLE becomes increasingly slow when you get over 3000 sequences. Therefore, for 43000 sequences we used MUSCLE with the faster but less accurate setting of – maxiters 2, which restricts the number of iterations to two. 40 (563

Overall, Clustal Omega is easily the most accurate program p in Table III. The run times show MAFFT default and Kalign families) to be exceptionally fast on the smaller test cases and MAFFT %ID p

–parttree to be very fast on the biggest families. Clustal 20 Omega does scale well, however, with increasing numbers of sequences. This scaling is described in more detail in the Supplementary Information. We do have two further test cases et al., Molecular Systems Biology 2011 with 450 000 sequences, but it was not possible to get results

for these from MUSCLE or Kalign. These are described in the 20 (912

Supplementary Information as well. p families)

Table III gives overall run times for the four programs Seivers %ID evaluated with HomFam. Figure 1 resolves these run times p 0 case by case. Kalign is very fast for small families but does not scale as well. Overall, MAFFT is faster than the other programs over all test case sizes but Clustal Omega scales similarly. From Points in Figure 1 represent different families with different average sequence lengths and pairwise identities. Therefore,

the scalability trend is fuzzy, with larger dots occurring 100 (1682 p 0.700 0.535 0.866 0.967 0.980 1698.06 No generally above smaller dots. Supplementary Figure S3 shows 0.721 0.569 0.876 0.961 0.979 4544.45 Yes scalability data, where subsets of increasing size are sampled families) %ID from one large family only. This reduces variability in pairwise o identity and sequence length. Prefab results External profile alignment O

Clustal Omega can read extra information from a profile HMM TC Score shown (larger is beer) on Prefab structural benchmark of AA alignments Note that best performing method depends on the “%ID” (measure of similarity) T-CoffeeClustal MUSCLEMAFFTKalignClustalW2Dialign 0.710PRANKFSA 0.677 0.677 0.617 0.649 0.595 0.586 0.558 0.534 0.507 0.513 0.430 0.474 0.398 0.390 0.865 0.277 0.850 0.836 0.797 0.817 0.783 0.767 0.950 0.791 0.946 0.961 0.933 0.957 0.940 0.951 0.972 0.965 0.976 0.979 0.975 0.979 0.974 0.978 175 789.00 0.976 2068.56 225.56 3433.53 Yes 80.81 18 909.70 351 498.00 No 229 391.00 No No No No No No MSAprobsMAFFT (auto) ProbalignProbcons 0.737 0.719 0.717 0.591 0.563 0.562 0.889 0.881 0.876 0.965 0.961 0.955 0.971 0.977 0.972 51 286.00 35117.30 46 908.30 Yes Yes Yes Total column scores (TC) are shown forif different percent the identity method ranges; the is second consistency column is based. the average score over all test cases. The total run time derived from preexisting alignments. For example, if a user Table II Aligner 0

& 2011 EMBO and Macmillan Publishers Limited Molecular Systems Biology 2011 3 High-quality protein MSAs using Clustal Omega F Sievers et al

about 10 000 sequences is MAFFT/PartTree (Katoh and Toh, HomFam. For test cases with 43000 sequences, we run 2007). It is very fast but leads to a loss in accuracy, which has MUSCLE with the –maxiter parameter set to 2, in order to finish to be compensated for by iteration and other heuristics. With the alignments in reasonable times. Second, we have run several Clustal Omega, we use a modified version of mBed (Black- different programs from the MAFFT package. MAFFT (Katoh shields et al, 2010), which has complexity of O(N log N), and et al,2002)consistsofaseriesofprogramsthatcanberun which produces guide trees that are just as accurate as those separately or called automatically from a script with the –auto from conventional methods. mBed works by ‘emBedding’ each flag set. This flag chooses to run a slow, consistency-based sequence in a space of n dimensions where n is proportional to program (L-INS-i) when the number and lengths of sequences is log N. Each sequence is then replaced by an n element vector, small. When the numbers exceed inbuilt thresholds, a conven- where each element is simply the distance to one of n ‘reference tional progressive aligner is used (FFT-NS-2). The latter is also the sequences.’ These vectors can then be clustered extremely program that is run by default if MAFFT is called with no flags quickly by standard methods such as K-means or UPGMA. set. For very large data sets, the –parttree flag must be set on the In Clustal Omega, the alignments are then computed using the command line and a very fast guide tree calculation is then used. very accurate HHalign package (So¨ding, 2005), which aligns The results for the BAliBASE benchmark tests are shown in two profile hidden Markov models (Eddy, 1998). TableI. BAliBASE is divided into six ‘references.’Average scores Clustal Omega has a number of features for adding are given for each reference, along with total run times and sequences to existing alignments or for using existing average total column (TC) scores, which give the proportion of alignments to help align new sequences. One innovation is the total alignment columns that is recovered. A score of 1.0 to allow users to specify a profile HMM that is derived from an indicates perfect agreement with the benchmark. There are two alignment of sequences that are homologous to the input set. rows for the MAFFT package: MAFFT (auto) and MAFFT The sequences are then aligned to these ‘external profiles’ to default. In most (203 out of 218) BAliBASE test cases, the help align them to the rest of the input set. There are already number of sequences is small and the script runs L-INS-i, which widely available collections of HMMs from many sources such is the slow accurate program that uses the consistency heuristic as Pfam (Finn et al, 2009) and these can now be used to help (Notredame et al, 2000) that is also used by MSAprobs users to align their sequences. (Liu et al, 2010), Probalign, Probcons (Do et al, 2005) and T-Coffee. These programs are all restricted to small numbers of sequences but tend to give accurate alignments. This is clearly Results reflected in the times and average scores in Table I. The times range from 25 min up to 22 h for these packages and the Alignment accuracy accuracies range from 55 to 61% of columns correct. Clustal The standard method for measuring the accuracy of multiple Omega only takes 9 min for the same runs but has an accuracy alignment algorithms is to use benchmark test sets of reference level that is similar to that of Probcons and T-Coffee. alignments, generated with reference to three-dimensional The rest of the table is mainly taken by the programs that use structures. Here, we present results from a range of packages progressive alignment. Some of these are very fast but this tested on three benchmarks: BAliBASE (Thompson et al, speed is matched by a considerable drop in accuracy compared 2005), Prefab (Edgar, 2004) and an extended version of with the consistency-based programs and Clustal Omega. The HomFam (Blackshields et al, 2010). For these tests, we just weakest program here, is Clustal W (Larkin et al, 2007) report results using the default settings for all programs but followed by PRANK (Lo¨ytynoja and Goldman, 2008). PRANK with two exceptions, which were needed to allow MUSCLE is not designed for aligning distantly related sequences but at (Edgar, 2004) and MAFFT to align the biggest test cases in giving good alignments for phylogenetic work with special

Table I BAliBASE results

Aligner Av score BB11 BB12 BB2 BB3 BB4 BB5 Tot time (s) Consistency (218 families) (38 families) (44 families) (41 families) (30 families) (49 families) (16 families)

MSAprobs 0.607 0.441 0.865 0.464 0.607 0.622 0.608 12 382.00 Yes Probalign 0.589 0.453 0.862 0.439 0.566 0.603 0.549 10 095.20 Yes MAFFT (auto) 0.588 0.439 0.831 0.450 0.581 0.605 0.591 1475.40 Mostly (203/218) Probcons 0.558 0.417 0.855 0.406 0.544 0.532 0.573 13 086.30 Yes Clustal O 0.554 0.358 0.789 0.450 0.575 0.579 0.533 539.91 No T-Coffee 0.551 0.410 0.848 0.402 0.491 0.545 0.587 81041.50 Yes Kalign 0.501 0.365 0.790 0.360 0.476 0.504 0.435 21.88 No MUSCLE 0.475 0.318 0.804 0.350 0.409 0.450 0.460 789.57 No MAFFT (default) 0.458 0.258 0.749 0.316 0.425 0.480 0.496 68.24 No FSA 0.419 0.270 0.818 0.187 0.259 0.474 0.398 53 648.10 No Dialign 0.415 0.265 0.696 0.292 0.312 0.441 0.425 3977.44 No PRANK 0.376 0.223 0.680 0.257 0.321 0.360 0.356 128 355.00 No ClustalW 0.374 0.227 0.712 0.220 0.272 0.396 0.308 766.47 No

The figures are total column scores produced using bali score on core columns only. The average score over all families is given in the second column. The results for BAliBASE subgroupings are in columns 3–8. The total run time for all 218 families is given in the second last column. The last column indicates whether the method is consistency based.

2 Molecular Systems Biology 2011 & 2011 EMBO and Macmillan Publishers Limited

BAliBASE is a collecon of structurally-based alignments of amino acid sequences

From Seivers et al., Molecular Systems Biology 2011 High-quality protein MSAs using Clustal Omega F Sievers et al

wishes to align a set of globin sequences and has an existing horizontal axis using a log scale. With some smaller test cases, globin alignment, this alignment can be converted to a profile iteration actually has a detrimental effect. Once you get near HMM and used as well as the sequence input file. This HMM is 1000 or more sequences, however, a clear trend emerges. here referred to as an ‘external profile’ and its use in this way The more sequences you have, the more beneficial the effect of as ‘external profile alignment’ (EPA). During EPA, each iteration is. With bigger test cases, it becomes more and more sequence in the input set is aligned to the external profile. beneficial to apply two iterations. This result confirms Pseudocount information from the external profile is then the usefulness of EPA as a general strategy. It also confirms transferred, position by position, to the input sequence. the difficulty in aligning extremely large numbers of sequences Ideally, this would be used with large curated alignments of but gives one partial solution. It also gives a very simple but particular proteins or domains of interest such as are used in effective iteration scheme, not just for guide tree iteration, as metagenomics projects. Rather than taking the input se- used in many packages, but for iteration of the alignment itself. quences and aligning them from scratch, every time new sequences are found, the alignment should be carefully maintained and used as an external profile for EPA. Clustal Discussion Omega also can align sequences to existing alignments using The main breakthroughs since the mid 1980s in MSA methods conventional alignment methods. Users can add sequences to have been progressive alignment and the use of consistency. an alignment, one by one or align a set of aligned sequences to Otherwise, most recent work has concerned refinements for the alignment. speed or accuracy on benchmark test sets. The speed increases In this paper, we demonstrate the EPA approach with two have been dramatic but, with just two major exceptions, the examples. First, we take the 94 HomFam test cases from the methods are still basically O(N2) and incapable of being previous section and use the corresponding Pfam HMM for extended to data sets of 410 000 sequences. The two EPA. Before EPA, the average accuracy for the test cases was exceptions are mBed, used here, and MAFFT PartTree. PartTree 0.627 of correctly aligned Homstrad positions but after EPA it is faster but at the expense of accuracy, at least as judged by the rises to 0.653. This is plotted, test case for test case in benchmarking here. The second group of recent developments Figure 2A. Each dot is one test case with the TC score for Clustal Omega plotted against the score using EPA. The second example is illustrated in Figure 2B. Here, we take all the HomFam BAliBASE reference sets and align them as normal using ClustalΩ 100 000 Mafft Clustal Omega and obtain the benchmark result of 0.554 of Muscle Kalign columns correctly aligned, as already reported in Table I. For 10 000 Avr length: 1−50 50 −100 EPA, we use the benchmark reference alignments themselves 100 −150 1000 150 −200 as external profiles. The results now jump to 0.857 of columns 200 −250 250 −300 100 300 −350 correct. This is a jump of over 30% and while it is not a valid 350 −400 measure of Clustal Omega accuracy for comparison with other 400 −450 10 programs, it does illustrate the potential power of EPA to use Time (s) information in external alignments. 1 0.1

Iteration 0.01

EPA can also be used in a simple iteration scheme. Once a MSA 0.001 has been made from a set of input sequences, it can be 100 3000 10 000 100 000 converted into a HMM and used for EPA to help realign the #Sequences input sequences. This can also be combined with a full Figure 1 Alignment time for Clustal Omega (red), MAFFT (blue), MUSCLE recalculation of the guide tree. In Figure 3, we show the results (green) and Kalign (purple) against the number of sequences of HomFam test sets. Average sequence length is rendered by point size. Both axes have of one and two iterations on every test case from HomFam. logarithmic scales. Clustal Omega and Kalign were run with default flags over the The graph is plotted as a running average TC score for all test entire range. MUSCLE was run with –maxiters 2 for N43000 sequences. cases with N or fewer test cases where N is plotted on the MAFFT was run with –parttree for N410 000 sequences.

Table III HomFam benchmarking results

93pNp2957 (41 families) 3127pNp9105 (33 families) 10 099pNp50157 (18 families) Aligner TC/t (s) TC/t (s) TC/t (s) Clustal O 0.708/2114.0 0.639/11719.5 0.464/27 328.9 Kalign 0.569/324.9 0.563/6752.0 0.420/286 711.0 MAFFT default 0.550/238.9 0.462/3115.4 / MAFFT –parttree / / 0.253/6119.4À À MUSCLE default 0.533/104À ÀÀ 587.0 /À / MUSCLE –maxiters 2 / 0.416/8239.2À ÀÀ 0.216/110À 292.0 À À

The columns show total column score (TC) and total run time in seconds for groupings of small (o3000 sequences), medium (3000–10 000 sequences) and large (410 000 sequences) HomFam test cases.

4 Molecular Systems Biology 2011 & 2011 EMBO and Macmillan Publishers Limited

HomFam is a set of structurally-based alignments of sets of amino acid sequences

From Seivers et al., Molecular Systems Biology 2011 Observaons

• Relave and absolute accuracy (wrt TC score) impacted by degree of heterogeneity and dataset size • Some methods cannot run on large datasets • On small datasets, Clustal-Omega not as accurate as best methods (Probalign, MAFFT, and MSAprobs) • On large datasets, Clustal-Omega more accurate than other methods Quesons

• How do the different co-esmaon methods compare with respect to tree error and alignment error? – POY and BeeTLe (tree-length opmizaon methods) – BAli-Phy and Alifritz (stascal co-esmaon methods) – SATe-1, SATe-2, and PASTA (iterave) Results about treelength

• Yes – Solving treelength using affine gap penales is beer than using simple gap penales. • However - alignment accuracy is very low. • Tree accuracy is good, if compared to maximum parsimony (MP) analyses of good alignments • Tree accuracy is bad, if compared to maximum likelihood (ML) analyses of good alignments • Not examined: beer gap penales SATe “Family”

• Iterave divide-and-conquer methods – Each iteraon uses the current tree with divide- and-conquer, to produce an alignment (running preferred MSA methods on subsets, and aligning alignments together) – Each iteraon computes an ML tree on the current alignment, under Markov models of evoluon that do not consider indels SATe-I and SATe-II

• SATe (Simultaneous Alignment and Tree Esmaon) was introduced in Liu et al., Science 2009; SATe-II (Liu et al. Systemac Biology 2012) was an improvement in accuracy and speed. • Basic approach: iterate between alignment and tree esmaon (using standard ML analysis on alignments) • Stop aer 24 hours, and return alignment/tree pair with best ML score • Designed and tested only on nucleode sequences SATé Algorithm

Obtain initial alignment and estimated ML tree

Tree Use tree to compute Estimate ML tree on new new alignment alignment

Alignment SATé iteraon (actual decomposion produces 32 subproblems)

A C e Decompose based on A B input tree C D B D Align subproblems

A B

C D

Estimate ML tree on merged ABCD Merge subproblems alignment 1000 taxon models, ordered by difficulty

24 hour SATé-I analysis, on desktop machines (Similar improvements for biological datasets) 6 MIRARAB ET AL.

FIG. 2. Tree error rates on nucleotide datasets. We show missing branch (also known as false negative or FN) rates for maximum likelihood trees estimated on the reference alignment as well as alignments computed using PASTA and other methods; results not shown indicate failure to complete within 24 hr using 12 cores on the datasets. Error bars show standard error over 10 replicates for all model conditions of the Indelible and the 10,000-sequence RNASim datasets.

Alignment accuracy on AA datasets. Table 2 shows alignment accuracy on the AA datasets. Due to dataset sizes, Muscle and SATe´-II failed to complete on two of the HomFam datasets, so we separate out the results for these two datasets from the remaining 17 HomFam datasets. PASTA had the best pairs score or was tied for the best pairs score for both HomFam and AA-10 datasets. Mafft had the best TC score for HomFam(17), but PASTA was very close. For HomFam(2), PASTA had the best TC score and Mafft was a close second. On AA-10 datasets, SATe´-II had the best TC score and was closely trailed by Mafft and PASTA. Comparison to SATe´-II on 50,000-taxon dataset. SATe´-II could not finish even one iteration on the RNASim with 50,000 sequences running for 24 hr and given 12 CPUs on TACC. However, we were able to

Table 1. Alignment Accuracy on Nucleotide Datasets

Indelible - 10,000 RNASim CRW (16S)

M4 M3 M2 10k 50k 100k 200k 16S.3 16S.T 16S.B.ALL

Column (TC) score Clustal-O 160 10 X 13 X X X 12 0 1 Muscle 803 7 0 0 X X X 34 21 81 Mafft 337 13 0 28 30 26 X 75 85 15 Initial 422 106 18 11 15 5 4 33 X 24 SATe´-II 977 758 792 35 X X X 89 60 87 PASTA 987 920 1151 152 311 492 823 71 121 102 Pairs score (mean of SP score and modeler score) Clustal-O 0.97 0.34 X 0.65 X X X 0.57 0.53 0.60 Muscle 1.00 0.12 0.01 0.35 X X X 0.74 0.67 0.66 Mafft 1.00 0.76 0.02 0.72 0.73 0.72 X 0.75 0.70 0.71 Initial 0.99 0.98 0.91 0.87 0.88 0.87 0.88 0.86 X 0.95 SATe´-II 1.00 0.93 0.72 0.56 X X X 0.76 0.65 0.66 PASTA 1.00 1.00 0.99 0.85 0.85 0.87 0.86 0.87 0.83 0.94

We show the number of correctly aligned sites (top) and the average of the SP-score and modeler score (bottom). X indicates that a method failed to run on a particular dataset given the computational constraints. ‘‘Initial’’ corresponds to the alignment approach used to obtain the starting tree of PASTA (HMMER failed to align one sequence in the 16S.T dataset) and Clustal-O stands for Clustal-Omega. Boldface indicates the best values for each model condition.

Comparison of PASTA to SATe-II and other alignments on nucleode datasets. From Mirarab et al., J. Computaonal Biology 2014

PASTA: ULTRA-LARGE MULTIPLE SEQUENCE ALIGNMENT 7

Table 2. Alignment Accuracy on AA Datasets

Column (TC) score Pairs score

Method AA-10 HomFam(17) HomFam(2) AA-10 HomFam(17) HomFam(2)

Clustal-O 78 88 29 0.76 0.72 0.71 Muscle 48 51 X 0.70 0.52 X Mafft 81 103 32 0.76 0.75 0.79 Initial 54 95 16 0.75 0.71 0.81 SATe´-II 83 73 X 0.75 0.64 X PASTA 80 102 36 0.76 0.78 0.83

We show TC (the number of correctly aligned sites, left) and the pairs score (the average of the SP-score and modeler score, right). X indicates that a method failed to run on a particular dataset given the computational constraints. ‘‘Initial’’ corresponds to the alignment approach used to obtain the starting tree of PASTA (HMMER failed to align one sequence in the 16S.T dataset). All values shown are averages over all datasets in each category. Boldface indicates the best values for each model condition. run two iterations of SATe´-II on a separate machine with no running time limits (12 Quad-Core AMD Opteron processors, 256GB of RAM memory). Given 12 CPUs, two iterations of SATe´-II took 137 hr, compared to 10 hr for PASTA. However, the resulting SATe´-II alignment recovered only 30 columns entirely correctly while PASTA recovered 311 columns. The pairs score of SATe´-II was extremely poor (38.2%), while PASTA was quite accurate (81.0%). The tree produced by SATe´-II had higher error than PASTA (12.6%Comparison of PASTA to versus 8.2% FN rate). SATe-II and other alignments on AA datasets. Impact of varyingFrom Mirarab et al., J. Computaonal Biology 2014 algorithmic parameters. We compared results obtained using four different starting trees: a random tree, the ML tree on the Mafft-PartTree alignment, PASTA’s default starting tree, and the true (model) tree (see Table 3). After one iteration, PASTA alignments and trees based on our starting tree or true tree had roughly the same accuracy, and the starting tree based on Mafft-PartTree resulted in only a slightly worse tree (1% higher FN rate). However, using a random tree resulted in much higher tree error rates (52.3% error), and alignments that were also less accurate. Interestingly, after three iterations of PASTA, no noticeable difference could be detected between results from various starting trees. Thus, PASTA is robust to the choice of the starting tree. We also evaluated the impact of changing the alignment subset size (Table 4); these analyses showed that using alignment subsets of only 50 sequences improved the TC score and running time substantially, and only slightly changed the pairs score or tree error score. Although these analyses were performed only for two datasets, they suggest the possibility that improved results might be obtained through smaller alignment subsets. Running Time. Figure 3 compares the running time (in hours) of different alignment methods. Note that PASTA was faster than SATe´-II in all cases and could analyze datasets that SATe´-II could not (i.e., the

Table 3. Effect of the Starting Tree on Final PASTA Alignment and Tree

Initial tree Alignment accuracy Tree error method Error (FN) Pairs score TC FN

One iteration Random 100.0% 79.9% 2 52.3% Mafft-parttree 28.7% 87.0% 126 11.7% Starting tree 12.4% 86.8% 138 10.5% True tree 0% 86.1% 133 10.5% Three iterations Random 100.0% 90.4% 138 11.0% Mafft-parttree 28.7% 83.9% 144 10.7% Starting tree 12.4% 88.8% 145 10.7% True tree 0% 90.8% 150 10.5%

Alignment accuracy and tree error is shown for PASTA with various starting trees, after one iteration (top) and three iterations (bottom) on one replicate of the 10k RNASim dataset. Table S26: Comparison of alignment errors for two-phase and coestimation meth- ods. SATe´BML is the best likelihood method for SATe´ run until no improvements can be found with CT-5 proposals and with either all two-phase starting/tree alignment pairs or just RAxML(ClustalW). The four model conditions for which ALIFRITZ had not yet reported any ML trees are marked “N/A”. We report results for both the posterior decoding alignment and the MAP alignment from BAli-Phy’s MCMC walk. n =1for all values in the table.

BAli-Phy BAli-Phy Model SATe´BML MAFFT Prank+GT Muscle ClustalW posterior-decoding MAP ALIFRITZ

100L1 21.3 20.1 41.7 30.6 54.1 12.4 15.3 N/A 100L2 1.7 1.9 1.7 2.4 12.9 1.1 1.8 11.0 100M1 31.8 29.2 63.2 39.3 56.9 29.0 29.0 N/A 100M2 12.1 17.5 13.1 15.9 39.1 6.1 7.9 N/A 100M3 3.3 4.0 3.1 3.3 8.5 2.3 3.0 16.4 100S1 27.8 29.4 35.5 39.4 40.9 12.0 13.6 N/A 100S2 13.4 19.5 13.4 18.0 27.3 6.9 9.2 47.5 Average 15.9 17.4 24.5 21.3 34.2 10.0 11.4 N/A ✓ ✓ ✗

Alignment error is average of SPFN and SPFP. However, Bali-Phy could not run on datasets with 500 or 1000 sequences. Results from Liu et al., Science 2009.

50 Table S27: Comparison of tree errors for two-phase and coestimation methods. SATe´BML is the best likelihood method for SATe´ run until no improvements can be found with CT-5 proposals and with either all two-phase starting/tree alignment pairs or just RAxML(ClustalW). The four model conditions for which ALIFRITZ had not yet reported any ML trees are marked “N/A”. We report results for both the majority consensus tree and MAP tree computed from from BAli-Phy’s partially completed MCMC walk. Since the consensus tree is not usually binary, we also give the “FP” rate for BAli-Phy. n =1for all values in the table.

Missing branch rate (%)

BAli-Phy BAli-Phy Model RAxML(TrueAln) SATe´BML RAxML(MAFFT) RAxML(Prank+GT) RAxML(Muscle) RAxML(ClustalW) majority consensus MAP ALIFRITZ

100L1 12.4 15.5 12.4 29.9 12.4 26.8 16.5 15.5 N/A 100L2 2.1 2.1 2.1 2.1 2.1 4.3 2.1 2.1 4.3 100M1 3.1 13.4 17.5 33.0 14.4 25.8 42.3 41.2 N/A 100M2 6.2 5.2 6.2 6.2 5.2 6.2 5.2 5.2 N/A 100M3 5.2 5.2 6.2 4.1 4.1 6.2 5.2 4.1 6.2 100S1 11.5 11.5 13.5 24.0 13.5 18.8 17.7 17.7 N/A 100S2 3.2 2.2 3.2 2.2 4.3 7.5 2.2 2.2 7.5 Average 6.2 7.8 8.7 14.5 8.0 13.6 13.0 12.6 N/A ✓ BAli-Phy majority consensus tree “FP” rate (%) ✗ ✗ ✗ ✗ 100L1 11.0 100L2 4.2 100M1 38.5 100M2 4.2 100M3 4.2 Problem: BAli-Phy failure to converge, despite mul-week analyses. 100S1 14.1 Results from Liu et al., Science 2009. 100S2 4.2 Average 11.5

51 Results for co-esmaon methods

• Opmizing treelength (POY and BeeTLe) doesn’t produce good alignments, and trees are not as good as those obtained using ML on standard MSA methods. • Stascal co-esmaon of alignments and trees under models of evoluon that include indels can produce highly accurate alignments and trees – but running me is a big issue. • SATé and PASTA are iterave techniques for co- esmang alignments and trees, and produce good results… but have no stascal guarantees. Impact of guide tree

• Most MSA methods use “progressive alignment” techniques, that – First compute a guide tree T – Align the sequences from the boom-up using the guide tree • Hence, there is a potenal for the guide tree to impact the final alignment. • Many authors have studied this issue… here’s our take on it (Nelesen et al., PSB 2008) Nelesen et al., PSB 2008

• Pacific Symposium on Biocompung, 2008 • MSA methods: – ClustalW, Muscle, Probcons, MAFFT, and FTA (Fixed Tree Alignment, using POY on the guidetree) • Guide trees: – Default for each method – Two different UPGMA trees – Probtree (ML on Probcons+GT alignment) • Examined results on simulated datasets with respect to alignment error and tree error October 2, 2007 19:11 Proceedings Trim Size: 9in x 6in nelesen

alignment, and let Aˆ be the estimated alignment. Then the SP-error rate ∗ |P airs(A )−P airs(Aˆ)| is |P airs(A∗)| , expressed as a percentage; thus the SP-error is the percentage of the pairs of truly homologous nucleotides that are unpaired in the estimated alignment. However, it is possible for the SP-error rate to be 0, and yet have different alignments.

3.2. Results. We first examine the guide trees with respect to their topological accuracy. As shown in Figure 2, the accuracy of guide trees differs significantly, with the ProbCons default tree generally the least accurate, and our “probtree” guide tree the most accurate; the two UPGMA guide trees have very similar accuracy levels.

50 50 45 45 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 Missing Edge Rate (%) 0 Missing Edge Rate (%) 0 123456 123456 Guide Tree Guide Tree (a) 25 taxa (b) 100 taxa

Figure 2. Guide tree topological error rates, averaged over all model conditions and replicates. (1) ClustalW default, (2) ProbCons default, (3) Muscle default, (4) upgma1, (5) upgma2, and (6) probtree.

In Figure 3 we examine the accuracy of the alignments obtained us- ing different MSA methods on these guide trees. Surprisingly, despite the largeFigure from differencesNelesen in topological et al., Pacific Symposium on Biocompung, 2008 accuracy of the guide trees, alignment ac- curacy (measured using SP-error) for a particular alignment method varies relatively little between alignments estimated from different guide trees. For example, two ClustalW alignments or two Muscle alignments will have essentially the same accuracy scores, independent of the guide tree. The biggest factor impacting the SP-error of the alignment is the MSA method. Generally, ProbCons is the most accurate and ClustalW is the least. We then examined the impact of changes in guide tree on the accuracy of the resultant RAxML-based phylogeny (see Figure 4). In all cases, for a given MSA method, phylogenetic estimations obtained when the guide October 2, 2007 19:11 Proceedings Trim Size: 9in x 6in nelesen

45 35 40 30 M(default) 35 25 M(upgma1) 30 20 M(upgma2) 25 20 15 M(probtree) 15 10 M(true-tree) 10 SP Error Rate (%) 5 SP Error Rate (%) 5 0 0 clustal muscle probcons mafft fta clustal muscle probcons mafft fta (a)25taxa (b) 100 taxa

Figure 3. SP-error rates of alignments. M(guide tree) indicates multiple sequence align- ment generated using the indicated guide tree.

12 12 10 R(M(default)) 10 8 R(M(upgma1)) 8 R(M(upgma2)) 6 6 R(M(probtree)) 4 R(M(true-tree)) 4 2 R(true-aln) 2

Missing Edge Rate (%) 0 Missing Edge Rate (%) 0 clustal muscleprobcons mafft fta clustal muscleprobcons mafft fta (a)25taxa (b) 100 taxa

Figure 4. Missing edge rate of estimated trees. R(M(guide tree) indicates RAxML run on the alignment generated by the multiple sequence alignment method using the guide tree indicated. R(true-aln) indicates the tree generated by RAxML when given the true alignment.

treeFigure from is the true treeNelesen are more et al., Pacific Symposium on Biocompung, 2008 accurate than for all other guide trees. How- ever, MSA methods otherwise respond quite differently to improvements in guide trees. For example, Muscle responded very little (if at all) to im- provements in the guide tree, possibly because it computes a new guide tree after the initial alignment on the input guide tree. ClustalW also responds only weakly to improvement in guide tree accuracy, often showing - for example - worse performance on the probtree guide tree compared to the other guide trees. On the other hand, ProbCons and FTA both respond positively and significantly to improvements in guide trees. This is quite interesting, since the alignments did not improve in terms of their SP-error rates! Furthermore, ProbCons improves quite dramatically as compared to its performance in its default setting. The performance of FTA is in- triguing. It is generally worse than ProbCons on the UPGMA guide trees, but comparable to ProbCons on the probtree guide tree, and better than ProbCons on the true tree. Observaons

• Guide tree choice did not seem to affect alignment SP error • Guide tree choice affected tree error – but impact depended on dataset size (25 vs. 100) and MSA method. • Probcons very impacted by guide tree (and that may be because its own default guide tree is poorly chosen). • FTA very impacted by guide tree. Note that FTA on the true tree is MORE accurate than ML on the true alignment. • For analyses of 100-taxon datasets, Probtree is a good guide tree. Another study…

• Prank (Loytynoja and Goldman, Science 2008) is a “phylogeny aware” progressive alignment strategy. • Their study focused on evaluang MSAs with respect to TC score, but also atypical criteria, such as: – Gene tree branch length esmaon – Alignment length esmaon (compression issue) – Inseron/deleon rao – Number of inserons/deleons • They explored very small simulated datasets, evolving sequences down trees. REPORTS have a flaw that has gone unnoticed as long as that these patterns are, in fact, imposed by will give a very different picture of the mech- different methods have been consistent in the systematic biases in alignment algorithms, anisms of sequence evolution and show error they create. even in cases where they are incorrect and, sequence turnover through short insertions That such a significant error has passed indeed, phylogenetically unreasonable. We and deletions as a more frequent and im- undetected may be explained by the align- contend that algorithms that impose gap pat- portant phenomenon. This raises interesting ment field's historical focus on proteins, where terns like those found in structural align- questions of the true evolution of variable these biases tend to be manifested in less- ments of proteins are inappropriate for the sequences such as promoter regions, non- constrained regions such as loops (compare increasingly widespread analysis of genomic coding DNA, and exposed coil regions in Fig. 1). Alignments with insertions and de- DNA and are likely to cause error when the proteins: Do they predominantly evolve letions squeezed compactly between con- resulting alignments are used for evolutionary through point substitutions, or are those dis- served blocks may suffice for, and even be inferences. similar regions just incorrectly aligned non- preferred by, some molecular biologists work- We believe that alignment methods spe- homologous sequences? To resolve that, we ing with proteins. We have shown, however, cifically designed for evolutionary analyses need more sequence data and alignment methods that can really benefit from the additional information. The resulting align- ments may be fragmented by many gaps and may not be as visually beautiful as the traditional alignments, but if they represent correct homology, we have to get used to them.

References and Notes 1. R. A. Gibbs et al., Nature 428, 493 (2004). 2. Rhesus Macaque Genome Sequencing and Analysis Consortium, Science 316, 222 (2007). 3. The ENCODE Project Consortium, Nature 447, 799 (2007). 4. A. Stark et al., Nature 450, 219 (2007). 5. Materials and methods are available as supporting material on Science Online. 6. A. Rambaut, D. Posada, K. Crandall, E. Holmes, Nat. Rev. Genet. 5, 52 (2004). 7. N. Sullivan, M. Thali, C. Furman, D. Ho, J. Sodroski, J. Virol. 67, 3674 (1993). 8. R. Wyatt et al., J. Virol. 69, 5723 (1995). 9. M. Jansson et al., AIDS Res. Hum. Retroviruses 17, 1405 (2001). 10. S. D. Frost et al., Proc. Natl. Acad. Sci. U.S.A. 102, 18514 (2005). 11. M. Sagar, X. Wu, S. Lee, J. Overbaugh, J. Virol. 80, 9586 (2006). 12. J. D. Thompson, D. G. Higgins, T. J. Gibson, Nucleic Acids Res. 22, 4673 (1994). 13. C. Notredame, D. G. Higgins, J. Heringa, J. Mol. Biol. 302, 205 (2000). 14. R. C. Edgar, BMC Bioinformat. 5, 113 (2004). 15. K. Katoh, K. Kuma, H. Toh, T. Miyata, Nucleic Acids Res. 33, 511 (2005). 16. A. Löytynoja, N. Goldman, Proc. Natl. Acad. Sci. U.S.A. 102, 10557 (2005). 17. D. D. Pollock, D. J. Zwickl, J. A. McGuire, D. M. Hillis, Syst. Biol. 51, 664 (2002). 18. M. S. Rosenberg, S. Kumar, Syst. Biol. 52, 119 (2003). 19. M. S. Rosenberg, BMC Bioinformat. 6, 278 (2005). 20. K. M. Wong, M. A. Suchard, J. P. Huelsenbeck, Science 319, 473 (2008). Fig. 3. Alignment accuracy errors are reduced by a phylogeny-aware algorithm. (A to F)Errors 21. This work was funded in part by a Wellcome from traditional alignment methods grow with increasing evolutionary distances and more Trust Programme Grant (GR078968). We thank difficult alignments (close-intermediate-distant, white to blue gradient) and also with a denser N. Luscombe for many suggestions that improved sequence sampling and increasingly similar sequences (close-2X-4X, white to red gradient). the manuscript. The phylogeny-aware method PRANK+F is less biased by greater distances and, in contrast to other methods, improvesFrom inLoytyjoja accuracy and Goldman, Science 2008: with additional sequence data. Alignment statistics for Supporting Online Material the five multiple sequence alignment methods are number of (A) insertions and (B) deletions, www.sciencemag.org/cgi/content/full/320/5883/1632/DC1 (C) insertion/deletion ratio, (D) gap overlap, (E) total length of the alignment, and (F) Materials and Methods G SOM Text proportion of columns correctly recovered. ( ) Inferred branch lengths at different depths in Fig. S1 the tree (t1 – t4) for the intermediate sets indicate that alignment errors lead to overestimated Table S1 branch lengths, with PRANK+F giving the most accurate estimates across the whole range of References depths. All measures in (A) to (G) are shown relative to those inferred from the true align- ment, and values closer to 1 are more correct. Vertical bars show means and 95% confidence 28 March 2008; accepted 15 May 2008 intervals. 10.1126/science.1158395

www.sciencemag.org SCIENCE VOL 320 20 JUNE 2008 1635 REPORTS have a flaw that has gone unnoticed as long as that these patterns are, in fact, imposed by will give a very different picture of the mech- different methods have been consistent in the systematic biases in alignment algorithms, anisms of sequence evolution and show error they create. even in cases where they are incorrect and, sequence turnover through short insertions That such a significant error has passed indeed, phylogenetically unreasonable. We and deletions as a more frequent and im- undetected may be explained by the align- contend that algorithms that impose gap pat- portant phenomenon. This raises interesting ment field's historical focus on proteins, where terns like those found in structural align- questions of the true evolution of variable these biases tend to be manifested in less- ments of proteins are inappropriate for the sequences such as promoter regions, non- constrained regions such as loops (compare increasingly widespread analysis of genomic coding DNA, and exposed coil regions in Fig. 1). Alignments with insertions and de- DNA and are likely to cause error when the proteins: Do they predominantly evolve letions squeezed compactly between con- resulting alignments are used for evolutionary through point substitutions, or are those dis- served blocks may suffice for, and even be inferences. similar regions just incorrectly aligned non- preferred by, some molecular biologists work- We believe that alignment methods spe- homologous sequences? To resolve that, we ing with proteins. We have shown, however, cifically designed for evolutionary analyses need more sequence data and alignment methods that can really benefit from the additional information. The resulting align- ments may be fragmented by many gaps and may not be as visually beautiful as the traditional alignments, but if they represent correct homology, we have to get used to them.

References and Notes 1. R. A. Gibbs et al., Nature 428, 493 (2004). 2. Rhesus Macaque Genome Sequencing and Analysis Consortium, Science 316, 222 (2007). 3. The ENCODE Project Consortium, Nature 447, 799 (2007). 4. A. Stark et al., Nature 450, 219 (2007). 5. Materials and methods are available as supporting material on Science Online. 6. A. Rambaut, D. Posada, K. Crandall, E. Holmes, Nat. Rev. Genet. 5, 52 (2004). 7. N. Sullivan, M. Thali, C. Furman, D. Ho, J. Sodroski, J. Virol. 67, 3674 (1993). 8. R. Wyatt et al., J. Virol. 69, 5723 (1995). 9. M. Jansson et al., AIDS Res. Hum. Retroviruses 17, 1405 (2001). 10. S. D. Frost et al., Proc. Natl. Acad. Sci. U.S.A. 102, 18514 (2005). 11. M. Sagar, X. Wu, S. Lee, J. Overbaugh, J. Virol. 80, 9586 (2006). 12. J. D. Thompson, D. G. Higgins, T. J. Gibson, Nucleic Acids Res. 22, 4673 (1994). 13. C. Notredame, D. G. Higgins, J. Heringa, J. Mol. Biol. 302, 205 (2000). 14. R. C. Edgar, BMC Bioinformat. 5, 113 (2004). 15. K. Katoh, K. Kuma, H. Toh, T. Miyata, Nucleic Acids Res. 33, 511 (2005). 16. A. Löytynoja, N. Goldman, Proc. Natl. Acad. Sci. U.S.A. 102, 10557 (2005). 17. D. D. Pollock, D. J. Zwickl, J. A. McGuire, D. M. Hillis, From Loytynoja and Goldman, Science 2008 Syst. Biol. 51, 664 (2002). 18. M. S. Rosenberg, S. Kumar, Syst. Biol. 52, 119 (2003). 19. M. S. Rosenberg, BMC Bioinformat. 6, 278 (2005). 20. K. M. Wong, M. A. Suchard, J. P. Huelsenbeck, Science 319, 473 (2008). Fig. 3. Alignment accuracy errors are reduced by a phylogeny-aware algorithm. (A to F)Errors 21. This work was funded in part by a Wellcome from traditional alignment methods grow with increasing evolutionary distances and more Trust Programme Grant (GR078968). We thank difficult alignments (close-intermediate-distant, white to blue gradient) and also with a denser N. Luscombe for many suggestions that improved sequence sampling and increasingly similar sequences (close-2X-4X, white to red gradient). the manuscript. The phylogeny-aware method PRANK+F is less biased by greater distances and, in contrast to other methods, improves in accuracy with additional sequence data. Alignment statistics for Supporting Online Material the five multiple sequence alignment methods are number of (A) insertions and (B) deletions, www.sciencemag.org/cgi/content/full/320/5883/1632/DC1 (C) insertion/deletion ratio, (D) gap overlap, (E) total length of the alignment, and (F) Materials and Methods G SOM Text proportion of columns correctly recovered. ( ) Inferred branch lengths at different depths in Fig. S1 the tree (t1 – t4) for the intermediate sets indicate that alignment errors lead to overestimated Table S1 branch lengths, with PRANK+F giving the most accurate estimates across the whole range of References depths. All measures in (A) to (G) are shown relative to those inferred from the true align- ment, and values closer to 1 are more correct. Vertical bars show means and 95% confidence 28 March 2008; accepted 15 May 2008 intervals. 10.1126/science.1158395

www.sciencemag.org SCIENCE VOL 320 20 JUNE 2008 1635 Observaons

• Most alignment methods “over-align” (produce compressed alignments) • Prank avoids this through its “phylogeny-aware” strategy • Compression results in – Over-esmaons of branch lengths – Under-esmaon of inserons • Clustal is least accurate, other methods in between Results so far

• Relave accuracy depends on the alignment criterion – TC and sum-of- pairs scores do not necessarily correlate well. • Tree accuracy is also not that well correlated with alignment accuracy. • Different alignment criteria are opmized using different techniques • Accuracy on AA (amino acid) datasets not the same as accuracy on NT (nucleode) datasets. • Dataset properes that impact accuracy: – Dataset size – Heterogeneity (rate of evoluon) – Perhaps other things (gap length distribuon?) – and note, we have not yet examined fragmentary datasets • Exact command maers (always check details) General trends

• Treelength-based opmizaon currently not as accurate as some standard techniques (e.g., ML on MAFFT alignments) • Many methods give excellent results on small datasets – Probcons, Probalign, Bali-Phy, etc… but most are not in use because of dataset size limitaons • Large datasets best using PASTA or UPP? (maybe) • Co-esmaon under stascal models might be the way to go, IF… Research Projects

• Design your own MSA method, or just modify an exisng one in some simple way (e.g., different guide tree) • Test exisng MSA methods with respect to different criteria (e.g., extend Prank study to more methods and datasets) • Develop different MSA criteria that are more appropriate than TC, SPFN, SPFP • Compare different MSA methods on some biological dataset • Parallelize some MSA method • Consider how to combine MSAs on the same input Treelength opmizaon

• POY is the most well-known method for co-esmang alignments and trees using treelength criteria (however – note that the developers of POY say to ignore the alignment and only use the tree). • The accuracy of the final tree depends on the edit distance formulaon – as noted by several studies. Affine gap penales are more biologically realisc than simple gap penales. • We developed BeeTLe (Beer Tree Length), a heurisc that is guaranteed to always be as least as accurate as POY for the treelength criterion. Treelength quesons

• Is it beer to use affine than simple gap penales? • Does POY solve its treelength problem? Is BeeTLe actually beer (as promised)? • How accurate are the alignments? • How accurate are the trees, compared to – MP analyses of good alignments – ML analyses of good alignments Treelength Optimization for Phylogeny Estimation

Table 1. Average missing branch rate (%) on each 100-taxon model condition.

100-taxon model condition Total Max

Method L5 M5 S5 M4 S4 L4 M3 S3 L3 S1 M1 S2 M2 L2 L1 Average Std Err

ML(TrueAln) 4.9 6.2 4.0 6.7 8.1 8.4 9.5 11.1 10.2 12.7 9.9 12.9 10.1 9.3 12.5 9.1 0.9 SATe´-II 5.2 6.6 6.2 7.6 10.5 9.3 14.9 13.2 13.1 16.0 15.8 17.2 18.0 18.9 29.9 13.5 2.1 ML(MAFFT) 5.2 6.5 6.3 7.6 10.5 10.1 14.3 14.0 13.8 16.4 17.3 17.3 19.2 20.5 33.6 14.2 2.2 SATe´ 5.0 6.3 5.2 7.1 11.8 10.3 14.9 14.2 13.4 17.5 17.8 17.0 22.4 24.3 33.2 14.7 2.1 ML(Opal) 5.4 6.3 9.1 8.3 12.5 12.9 15.6 14.2 17.3 17.7 17.4 18.1 23.1 27.7 38.0 16.2 2.0 MP(TrueAln) 6.9 9.9 7.5 13.1 17.9 17.4 18.2 25.1 22.2 25.6 23.9 30.6 22.9 20.8 26.3 19.2 1.4 BeeTLe-Affine 7.2 9.0 7.2 10.5 14.7 20.5 20.5 23.0 26.0 29.8 26.6 31.5 25.4 27.3 36.4 21.0 2.0 ML(Probtree) 5.3 6.2 4.2 7.0 9.6 13.4 14.9 24.2 22.5 31.8 27.8 37.8 27.3 36.0 52.8 21.4 2.6 BeeTLe-Simple-2 6.2 9.5 6.3 12.0 17.5 19.3 22.4 25.0 27.9 31.6 28.3 32.8 28.5 26.9 37.6 22.1 2.5 ML(Prank+GT) 5.0 6.0 5.2 8.7 13.4 21.7 21.2 28.1 29.6 30.8 31.2 35.4 30.4 33.4 46.2 23.1 2.4 MP(Prank+GT) 6.7 9.3 7.1 12.7 19.3 22.4 22.3 29.4 28.6 30.2 28.2 34.3 29.4 30.2 38.2 23.2 1.8 MP(MAFFT) 7.3 10.2 8.3 13.7 20.0 19.9 22.5 27.8 26.5 29.4 30.5 32.7 30.2 32.3 42.8 23.6 2.0 BeeTLe-Simple-1 6.1 9.5 6.4 12.5 20.6 24.3 24.0 26.8 23.9 28.1 30.2 41.2 29.5 29.9 42.2 23.7 3.2 MP(Opal) 7.8 10.7 10.3 14.5 20.4 21.3 23.9 27.1 27.7 28.0 30.4 32.7 32.4 35.0 44.0 24.4 1.8 ML(ClustalW) 6.1 7.0 7.3 10.6 15.5 20.0 24.4 30.1 31.3 35.2 36.7 38.9 35.0 37.2 48.9 25.6 2.5 MP(Probtree) 7.9 10.5 7.8 13.5 18.0 22.9 21.7 32.5 32.0 36.8 34.6 42.6 34.0 43.7 58.3 27.8 2.2 MP(ClustalW) 7.0 9.9 8.6 14.7 20.9 26.1 27.3 36.3 35.2 38.4 39.5 44.3 37.5 38.8 49.2 28.9 2.0

For conciseness, model condition identifiers are truncated to unique suffixes. ‘‘Total Average’’ is the average across all model conditions. ‘‘Max Std Err’’ is the maximum standard error of any model condition. n~20 for each reported value; n~300 for ‘‘Total Average’’ and ‘‘Max Std Err’’. doi:10.1371/journal.pone.0033104.t001 whether POY (or any method based upon treelength optimization) BeeTLe) produced trees and alignments that are inferior to the co- is reliable for estimating highly accurate trees or alignments, the estimation approach in SATe´. It makes sense, therefore, to discuss more important question is which approaches are likely to produce the SATe´ ’s co-estimation technique. most accurate trees and alignments? The technique used by SATe´ to co-estimate trees and The study we presented suggests strongly that treelength alignments uses iteration combined with divide-and-conquer; each optimization is unlikely to produce trees or alignments that are iteration involves the estimation of a new alignment (produced as accurate as maximum likelihood on the leading alignment using divide-and-conquer) and then uses RAxML to produce an methods; it also showed that SATe´ trees and alignments were even ML tree on that new alignment. However, the ML model used in more accurate than maximum likelihood trees on leading estimating the tree is GTR+Gamma, and so indels are treated in alignments. Thus, parsimony-style co-estimation (as in POY and the standard way, which is as missing data – rather than treating

Figure 5. Alignment SP-FN error of different methods on 100-taxon model conditions. Averages and standard error bars are shown; n~20 for each reported value. doi:10.1371/journal.pone.0033104.g005

PLoS ONE | www.plosone.org 6 March 2012 | Volume 7 | Issue 3 | e33104

Simulated 100-sequence DNA datasets with varying rates of evoluon Results from Liu and Warnow, PLoS ONE 2012

Treelength Optimization for Phylogeny Estimation

Maximum Parsimony (MP) on different alignments

Figure 4. Missing branch rates of different methods on 100-taxon model conditions. We report missing branch rates for BeeTLe-Affine in comparison to ML methods, SATe´, and SATe´-II (top chart) and in comparison to MP methods (middle chart). On model conditions marked with ‘*’, ML(MAFFT)’s missing branch rate significantly improved upon BeeTLe-Affine’s (using one-tailed pairwise t-tests with Benjamini-Hochberg [57] correction for multiple tests, n~40 for each test, and a~0:05). On model conditions marked with ‘$’, BeeTLe-Affine’s missing branch rate significantly improved upon MP(MAFFT)’s (using similar statistical tests). Averages and standard error bars are shown; n~20 for each reported value. doi:10.1371/journal.pone.0033104.g004Simulated 100-sequence DNA datasets with varying rates of evoluon Results from Liu and Warnow, PLoS ONE 2012 results in Ogden and Rosenberg’s study [24] and Affine gave even MP) and a handful of alignment methods (i.e., MAFFT, SATe´, better results in Liu et al.’s subsequent study [31]), they are by no Probtree, Prank+GT, Opal, and ClustalW). It is possible that means representative of the full range of treelength criteria. Therefore better alignments could be obtained using other alignment it remains possible that a better treelength criterion can be developed. methods and that better trees might be obtained on these However, as noted above, it may be that the use of affine treelengths alignments using other phylogeny estimation methods. In may be inherently too simplistic (fitting single parameter models particular, likelihood-based methods such as MrBayes [39], Phyml rather than mixture models) to produce good results. Also, our [40], GARLI [41], FastTree [42,43], and Metapiga2 [44] might method, BeeTLe, is not designed to thoroughly search treespace for produce more accurate trees. We also did not explore the short trees. Instead, it is a very simple technique that scores a set of performance of BAli-Phy or other co-estimation methods that treat trees (including POY, RAxML(MAFFT), RAxML(ClustalW) and indels informatively, and these also might produce more accurate some of the neighbors of these trees) for treelength, and returns the trees. Thus, it is possible that there are currently available methods shortest tree. Therefore, it is likely that even shorter trees would be that might yield more even more accurate trees than those tested obtained by a more careful search through treespace. As a result it is in this study. possible that the shorter and topologically more accurate trees would We close with some comments about the general problem of be obtained by a more careful analysis. estimating trees and alignments from unaligned sequences, and It is worth noting that we only explored two phylogeny whether co-estimation of trees and alignments is beneficial or estimation methods (i.e., RAxML for ML analysis and PAUP* for detrimental. In other words, although it is important to understand

PLoS ONE | www.plosone.org 5 March 2012 | Volume 7 | Issue 3 | e33104 Maximum Likelihood (ML) on different alignments Treelength Optimization for Phylogeny Estimation

Simulated 100-sequence DNA datasets with varying rates of evoluon Results from Liu and Warnow, PLoS ONE 2012

Figure 4. Missing branch rates of different methods on 100-taxon model conditions. We report missing branch rates for BeeTLe-Affine in comparison to ML methods, SATe´, and SATe´-II (top chart) and in comparison to MP methods (middle chart). On model conditions marked with ‘*’, ML(MAFFT)’s missing branch rate significantly improved upon BeeTLe-Affine’s (using one-tailed pairwise t-tests with Benjamini-Hochberg [57] correction for multiple tests, n~40 for each test, and a~0:05). On model conditions marked with ‘$’, BeeTLe-Affine’s missing branch rate significantly improved upon MP(MAFFT)’s (using similar statistical tests). Averages and standard error bars are shown; n~20 for each reported value. doi:10.1371/journal.pone.0033104.g004 results in Ogden and Rosenberg’s study [24] and Affine gave even MP) and a handful of alignment methods (i.e., MAFFT, SATe´, better results in Liu et al.’s subsequent study [31]), they are by no Probtree, Prank+GT, Opal, and ClustalW). It is possible that means representative of the full range of treelength criteria. Therefore better alignments could be obtained using other alignment it remains possible that a better treelength criterion can be developed. methods and that better trees might be obtained on these However, as noted above, it may be that the use of affine treelengths alignments using other phylogeny estimation methods. In may be inherently too simplistic (fitting single parameter models particular, likelihood-based methods such as MrBayes [39], Phyml rather than mixture models) to produce good results. Also, our [40], GARLI [41], FastTree [42,43], and Metapiga2 [44] might method, BeeTLe, is not designed to thoroughly search treespace for produce more accurate trees. We also did not explore the short trees. Instead, it is a very simple technique that scores a set of performance of BAli-Phy or other co-estimation methods that treat trees (including POY, RAxML(MAFFT), RAxML(ClustalW) and indels informatively, and these also might produce more accurate some of the neighbors of these trees) for treelength, and returns the trees. Thus, it is possible that there are currently available methods shortest tree. Therefore, it is likely that even shorter trees would be that might yield more even more accurate trees than those tested obtained by a more careful search through treespace. As a result it is in this study. possible that the shorter and topologically more accurate trees would We close with some comments about the general problem of be obtained by a more careful analysis. estimating trees and alignments from unaligned sequences, and It is worth noting that we only explored two phylogeny whether co-estimation of trees and alignments is beneficial or estimation methods (i.e., RAxML for ML analysis and PAUP* for detrimental. In other words, although it is important to understand

PLoS ONE | www.plosone.org 5 March 2012 | Volume 7 | Issue 3 | e33104 PASTA study

• PASTA (RECOMB 2014 and J. Computaonal Biology 2014) is the replacement of SATe-1 (Liu et al., Science 2009) and SATe-2 (Liu et al., Systemac Biology 2012) • Alignment criteria: “Pairs” score and Total Column (TC) score • Evaluated on simulated and biological datasets (both nucleode and amino acid) • Alignment methods compared: “Inial” (an HMM- based technique), Clustal-Omega, MAFFT, and SATe SATe Family

• SATe-I (2009): – Up to about 10,000 sequences – Good accuracy and reasonable speed – “Center-tree” decomposion • SATe-II (2012) – Up to about 50,000 sequences – Improved accuracy and speed – Centroid-edge recursive decomposion • PASTA (2014) – Up to 1,000,000 sequences – Improved accuracy and speed – Combines centroid-edge decomposion with transivity merge 3

FIG. 1. Algorithmic design of PASTA. The first six boxes show the steps involved in one iteration of PASTA. The last two boxes show the meaning of transitivity for homologies defined by a column of an MSA, and how the concept of transitivity can be used to merge two compatible and overlapping alignments. MSA, multiple sequence alignment.

Figure from Mirarab et al., J. Computaonal Biology 2014 SATé-I vs. SATé- II

SATé-II • Faster and more accurate than SATé-I • Longer analyses or use of ML to select tree/ alignment pair slightly beer results PASTA variants – impact of alignment subset size 8 MIRARAB ET AL.

Table 4. Impact of Alignment Subset Size

Alignment accuracy Tree error Dataset Subset size FN Pairs score TC Running times

RNASim 10K 200 10.7% 88.8% 145 13,478 RNASim 10K 100 10.4% 87.4% 185 8,235 RNASim 10K 50 10.7% 88.6% 210 6,015 16S.T 200 8.2% 82.7% 121 9,120 16S.T 100 8.1% 82.0% 125 7,086 16S.T 50 7.9% 79.0% 129 5,780

We report tree error and alignment accuracy on one replicate of the 10K RNASim dataset and also on the 16S.T dataset, using three iterations of PASTA in which we explore the impact of changing the subset size from 200 (the default) to 100 and 50; all other algorithmic parameters use default values. Boldface indicates the best performance on the data.

RNASim datasets with 50K or more sequences). PASTA was not always faster than other methods, but was able to complete its analyses of all datasets within the 24-hr time limit, whereas other methods (except the starting tree) were unable to complete analyses on the largest datasets. Figure 4a presents aFrom Mirarab et al., J. Computaonal Biology 2014 detailed running time comparison of PASTA and SATe´-II on two specific model conditions of RNASim dataset. Note that merging subset alignments (and the last pairwise merge, shown in the dotted area) was the majority of the time used by SATe´-II to analyze the 50K RNASim dataset, but a very small fraction of the time used by PASTA. PASTA uses transitivity for all but the initial pairwise mergers, and therefore scales well with increased dataset size, as shown in Figure 4b (the sub-linear scaling is due to a better use of parallelism with increased number of sequences). Finally, Figure 4c shows that PASTA is highly parallelizable and has a much better speed-up with increasing number of threads than SATe´ does.

5. SUMMARY

The key algorithmic contribution in PASTA is the use of transitivity to align sequences on a guide tree, which addresses computational limitations in SATe´ and also improves the alignment of very distantly related sequences and remote homology detection. PASTA is fast and scales well with the number of processors, so that datasets with even 200,000 sequences can be analyzed in less than a day with a small

FIG. 3. Alignment running time (hours). Note that PASTA was run for three iterations everywhere, except on the 100,000-sequence RNASim dataset where it was run for two iterations, and on the 200,000-sequence RNASim dataset where it was run for one iteration. Mafft was run in default mode, except for the 100,000-sequences where PartTree was used. 6 MIRARAB ET AL.

FIG. 2. Tree error rates on nucleotide datasets. We show missing branch (also known as false negative or FN) rates for maximum likelihood trees estimated on the reference alignment as well as alignments computed using PASTA and other methods; results not shown indicate failure to complete within 24 hr using 12 cores on the datasets. Error bars show standard error over 10 replicates for all model conditions of the Indelible and the 10,000-sequence RNASim datasets.

Alignment accuracy on AA datasets. Table 2 shows alignment accuracy on the AA datasets. Due to dataset sizes, Muscle and SATe´-II failed to complete on two of the HomFam datasets, so we separate out the results for these two datasets from the remaining 17 HomFam datasets. PASTA had the best pairs score or was tied for the best pairs score for both HomFam and AA-10 datasets.Comparison of PASTA to Mafft had the best TC score forSATe HomFam(17),-II and other methods on nucleode datasets, but PASTA was very close. For HomFam(2), PASTA had the best TC score and Mafft was a close second. On AA-10 datasets, SATe´-II had the best TC scorewith respect to tree error. Figure from Mirarab et al., J. Computaonal Biology 2014 and was closely trailed by Mafft and PASTA.

Comparison to SATe´-II on 50,000-taxon dataset. SATe´-II could not finish even one iteration on the RNASim with 50,000 sequences running for 24 hr and given 12 CPUs on TACC. However, we were able to

Table 1. Alignment Accuracy on Nucleotide Datasets

Indelible - 10,000 RNASim CRW (16S)

M4 M3 M2 10k 50k 100k 200k 16S.3 16S.T 16S.B.ALL

Column (TC) score Clustal-O 160 10 X 13 X X X 12 0 1 Muscle 803 7 0 0 X X X 34 21 81 Mafft 337 13 0 28 30 26 X 75 85 15 Initial 422 106 18 11 15 5 4 33 X 24 SATe´-II 977 758 792 35 X X X 89 60 87 PASTA 987 920 1151 152 311 492 823 71 121 102 Pairs score (mean of SP score and modeler score) Clustal-O 0.97 0.34 X 0.65 X X X 0.57 0.53 0.60 Muscle 1.00 0.12 0.01 0.35 X X X 0.74 0.67 0.66 Mafft 1.00 0.76 0.02 0.72 0.73 0.72 X 0.75 0.70 0.71 Initial 0.99 0.98 0.91 0.87 0.88 0.87 0.88 0.86 X 0.95 SATe´-II 1.00 0.93 0.72 0.56 X X X 0.76 0.65 0.66 PASTA 1.00 1.00 0.99 0.85 0.85 0.87 0.86 0.87 0.83 0.94

We show the number of correctly aligned sites (top) and the average of the SP-score and modeler score (bottom). X indicates that a method failed to run on a particular dataset given the computational constraints. ‘‘Initial’’ corresponds to the alignment approach used to obtain the starting tree of PASTA (HMMER failed to align one sequence in the 16S.T dataset) and Clustal-O stands for Clustal-Omega. Boldface indicates the best values for each model condition.