Clustal-Omega

Tandy Warnow Today

• Discussion of “Fast scalable generaon of high- quality mulple sequence alignments using -Omega” by Sievers et al., published in Molecular Systems Biology 7:539 (2011) – Design of Clustal-Omega – Performance on data – Main innovaon: guide tree construcon – Reviews in “transparent process” – Comparison to other studies (me perming) Clustal-Omega study

• Clustal-Omega is the latest in the Clustal family of MSA methods • Clustal-Omega is designed primarily for amino acid alignment, but can be used on nucleode datasets

Clustal-Omega

Given unaligned sequences: • Build a guide tree • Perform progressive alignment on the guide tree, using “consistency” to inform the alignment mergers If desired, you can: • Iterate (construct profile HMM based on alignment, and then re-align) • Use external profile alignment (EPA) Evaluaon

• Default Clustal-Omega (no iteraon, no EPA) compared to standard methods on protein benchmarks • Variants of Clustal-Omega compared to default Clustal-Omega Criteria: • TC (column score) on “core” columns • Running me Datasets

• BAliBASE (“small”) • PreFab (50 sequences in each set, but evaluated based on pairwise alignments) • HomFam (much larger!) High-quality protein MSAs using Clustal Omega F Sievers et al

about 10 000 sequences is MAFFT/PartTree (Katoh and Toh, HomFam. For test cases with 43000 sequences, we run 2007). It is very fast but leads to a loss in accuracy, which has MUSCLE with the –maxiter parameter set to 2, in order to finish to be compensated for by iteration and other heuristics. With the alignments in reasonable times. Second, we have run several Clustal Omega, we use a modified version of mBed (Black- different programs from the MAFFT package. MAFFT (Katoh shields et al, 2010), which has complexity of O(N log N), and et al,2002)consistsofaseriesofprogramsthatcanberun which produces guide trees that are just as accurate as those separately or called automatically from a script with the –auto from conventional methods. mBed works by ‘emBedding’ each flag set. This flag chooses to run a slow, consistency-based sequence in a space of n dimensions where n is proportional to program (L-INS-i) when the number and lengths of sequences is log N. Each sequence is then replaced by an n element vector, small. When the numbers exceed inbuilt thresholds, a conven- where each element is simply the distance to one of n ‘reference tional progressive aligner is used (FFT-NS-2). The latter is also the sequences.’ These vectors can then be clustered extremely program that is run by default if MAFFT is called with no flags quickly by standard methods such as K-means or UPGMA. set. For very large data sets, the –parttree flag must be set on the In Clustal Omega, the alignments are then computed using the command line and a very fast guide tree calculation is then used. very accurate HHalign package (So¨ding, 2005), which aligns The results for the BAliBASE benchmark tests are shown in two profile hidden Markov models (Eddy, 1998). TableI. BAliBASE is divided into six ‘references.’Average scores Clustal Omega has a number of features for adding are given for each reference, along with total run times and sequences to existing alignments or for using existing average total column (TC) scores, which give the proportion of alignments to help align new sequences. One innovation is the total alignment columns that is recovered. A score of 1.0 to allow users to specify a profile HMM that is derived from an indicates perfect agreement with the benchmark. There are two alignment of sequences that are homologous to the input set. rows for the MAFFT package: MAFFT (auto) and MAFFT The sequences are then aligned to these ‘external profiles’ to default. In most (203 out of 218) BAliBASE test cases, the help align them to the rest of the input set. There are already number of sequences is small and the script runs L-INS-i, which widely available collections of HMMs from many sources such is the slow accurate program that uses the consistency heuristic as Pfam (Finn et al, 2009) and these can now be used to help (Notredame et al, 2000) that is also used by MSAprobs users to align their sequences. (Liu et al, 2010), Probalign, Probcons (Do et al, 2005) and T-Coffee. These programs are all restricted to small numbers of sequences but tend to give accurate alignments. This is clearly Results reflected in the times and average scores in Table I. The times range from 25 min up to 22 h for these packages and the Alignment accuracy accuracies range from 55 to 61% of columns correct. Clustal The standard method for measuring the accuracy of multiple Omega only takes 9 min for the same runs but has an accuracy alignment algorithms is to use benchmark test sets of reference level that is similar to that of Probcons and T-Coffee. alignments, generated with reference to three-dimensional The rest of the table is mainly taken by the programs that use structures. Here, we present results from a range of packages progressive alignment. Some of these are very fast but this tested on three benchmarks: BAliBASE (Thompson et al, speed is matched by a considerable drop in accuracy compared 2005), Prefab (Edgar, 2004) and an extended version of with the consistency-based programs and Clustal Omega. The HomFam (Blackshields et al, 2010). For these tests, we just weakest program here, is Clustal W (Larkin et al, 2007) report results using the default settings for all programs but followed by PRANK (Lo¨ytynoja and Goldman, 2008). PRANK with two exceptions, which were needed to allow MUSCLE is not designed for aligning distantly related sequences but at (Edgar, 2004) and MAFFTTC scores on BAliBASE sequences to align the biggest test cases in giving good alignments for phylogenetic work with special

Table I BAliBASE results

Aligner Av score BB11 BB12 BB2 BB3 BB4 BB5 Tot time (s) Consistency (218 families) (38 families) (44 families) (41 families) (30 families) (49 families) (16 families)

MSAprobs 0.607 0.441 0.865 0.464 0.607 0.622 0.608 12 382.00 Yes Probalign 0.589 0.453 0.862 0.439 0.566 0.603 0.549 10 095.20 Yes MAFFT (auto) 0.588 0.439 0.831 0.450 0.581 0.605 0.591 1475.40 Mostly (203/218) Probcons 0.558 0.417 0.855 0.406 0.544 0.532 0.573 13 086.30 Yes Clustal O 0.554 0.358 0.789 0.450 0.575 0.579 0.533 539.91 No T-Coffee 0.551 0.410 0.848 0.402 0.491 0.545 0.587 81041.50 Yes Kalign 0.501 0.365 0.790 0.360 0.476 0.504 0.435 21.88 No MUSCLE 0.475 0.318 0.804 0.350 0.409 0.450 0.460 789.57 No MAFFT (default) 0.458 0.258 0.749 0.316 0.425 0.480 0.496 68.24 No FSA 0.419 0.270 0.818 0.187 0.259 0.474 0.398 53 648.10 No Dialign 0.415 0.265 0.696 0.292 0.312 0.441 0.425 3977.44 No PRANK 0.376 0.223 0.680 0.257 0.321 0.360 0.356 128 355.00 No ClustalW 0.374 0.227 0.712 0.220 0.272 0.396 0.308 766.47 No

The figures are total column scores produced using bali score on core columns only. The average score over all families is given in the second column. The results for BAliBASE subgroupings are in columns 3–8. The total run time for all 218 families is given in the second last column. The last column indicates whether the method is consistency based.

2 Molecular Systems Biology 2011 & 2011 EMBO and Macmillan Publishers Limited

From Seivers et al., Molecular Systems Biology 2011 BAliBASE 0.65

0.6 MSAprobs Mafft-auto Probalign

0.55 ClustalOmega Probcons TCoffee Opal SATe

0.5 Kalign Muscle 0.45 Mafft

Dialign Fsa Total Column Score 0.4 ClustalW2 Prank 0.35 10 100 1000 10000 100000 1e+06 Time [s] High-quality protein MSAs using Clustal Omega F Sievers et al

attention to gaps. These gap positions are not included in these tests as they tend not to be structurally conserved. Dialign (Morgenstern et al, 1998) does not use consistency or progressive alignment but is based on finding best local multiple Consistency alignments. FSA (Bradley et al,2009)usessamplingofpairwise alignments and ‘sequence annealing’ and has been shown to deliver good nucleotide sequence alignments in the past. The Prefab benchmark test results are shown in Table II. Here, the results are divided into five groups according to the percent identity of the sequences. The overall scores range families) from 53 to 73% of columns correct. The consistency-based Total time (s) (1682 programs MSAprobs, MAFFT L-INS-i, Probalign, Probcons and T-Coffee, are again the most accurate but with long run times. Clustal Omega is close to the consistency programs in accuracy but is much faster. There is then a gap to the faster progressive based programs of MUSCLE, MAFFT, Kalign (Lassmann and 100 (90

Sonnhammer, 2005) and Clustal W. p Results from testing large alignments with up to 50000 families) %ID

sequences are given in Table III using HomFam. Here, each p in seconds is shown in the second last column. The last column indicates alignment is made up of a core of a Homstrad (Mizuguchi et al, 70 1998) structure-based alignment of at least five sequences. These sequences are then inserted into a test set of sequences from the corresponding, homologous, Pfam domain. This gives very large sets of sequences to be aligned but the testing is only carried out sequences on the sequences with known structures. Only some programs 70 (117 are able to deliver alignments at all, with data sets of this size. We p families) restricted the comparisons toClustalOmega,MAFFT,MUSCLE %ID and Kalign. MAFFT with default settings, has a limit of 20 000 p sequences and we only use MAFFT with –parttree for the last 40 section of Table III. MUSCLE becomes increasingly slow when you get over 3000 sequences. Therefore, for 43000 sequences we used MUSCLE with the faster but less accurate setting of – maxiters 2, which restricts the number of iterations to two. PreFab 40 (563

Overall, Clustal Omega is easily the most accurate program p in Table III. The run times show MAFFT default and Kalign families) to be exceptionally fast on the smaller test cases and MAFFT %ID p

–parttree to be very fast on the biggest families. Clustal 20 Omega does scale well, however, with increasing numbers of sequences. This scaling is described in more detail in the Supplementary Information. We do have two further test cases et al., Molecular Systems Biology 2011 with 450 000 sequences, but it was not possible to get results

for these from MUSCLE or Kalign. These are described in the 20 (912

Supplementary Information as well. p families)

Table III gives overall run times for the four programs Seivers %ID evaluated with HomFam. Figure 1 resolves these run times p 0 case by case. Kalign is very fast for small families but does not scale as well. Overall, MAFFT is faster than the other programs From over all test case sizes but Clustal Omega scales similarly.TC scores on Points in Figure 1 represent different families with different average sequence lengths and pairwise identities. Therefore,

the scalability trend is fuzzy, with larger dots occurring 100 (1682 p 0.700 0.535 0.866 0.967 0.980 1698.06 No generally above smaller dots. Supplementary Figure S3 shows 0.721 0.569 0.876 0.961 0.979 4544.45 Yes scalability data, where subsets of increasing size are sampled families) %ID from one large family only. This reduces variability in pairwise o identity and sequence length. Prefab results External profile alignment O

Clustal Omega can read extra information from a profile HMM Note that best performing method depends on the “%ID” (measure of similarity) T-CoffeeClustal MUSCLEMAFFTKalignClustalW2Dialign 0.710PRANKFSA 0.677 0.677 0.617 0.649 0.595 0.586 0.558 0.534 0.507 0.513 0.430 0.474 0.398 0.390 0.865 0.277 0.850 0.836 0.797 0.817 0.783 0.767 0.950 0.791 0.946 0.961 0.933 0.957 0.940 0.951 0.972 0.965 0.976 0.979 0.975 0.979 0.974 0.978 175 789.00 0.976 2068.56 225.56 3433.53 Yes 80.81 18 909.70 351 498.00 No 229 391.00 No No No No No No MSAprobsMAFFT (auto) ProbalignProbcons 0.737 0.719 0.717 0.591 0.563 0.562 0.889 0.881 0.876 0.965 0.961 0.955 0.971 0.977 0.972 51 286.00 35117.30 46 908.30 Yes Yes Yes Total column scores (TC) are shown forif different percent the identity method ranges; the is second consistency column is based. the average score over all test cases. The total run time derived from preexisting alignments. For example, if a user Table II Aligner 0

& 2011 EMBO and Macmillan Publishers Limited Molecular Systems Biology 2011 3 High-quality protein MSAs using Clustal Omega F Sievers et al

wishes to align a set of globin sequences and has an existing horizontal axis using a log scale. With some smaller test cases, globin alignment, this alignment can be converted to a profile iteration actually has a detrimental effect. Once you get near HMM and used as well as the sequence input file. This HMM is 1000 or more sequences, however, a clear trend emerges. here referred to as an ‘external profile’ and its use in this way The more sequences you have, the more beneficial the effect of as ‘external profile alignment’ (EPA). During EPA, each iteration is. With bigger test cases, it becomes more and more sequence in the input set is aligned to the external profile. beneficial to apply two iterations. This result confirms Pseudocount information from the external profile is then the usefulness of EPA as a general strategy. It also confirms transferred, position by position, to the input sequence. the difficulty in aligning extremely large numbers of sequences Ideally, this would be used with large curated alignments of but gives one partial solution. It also gives a very simple but particular or domains of interest such as are used in effective iteration scheme, not just for guide tree iteration, as metagenomics projects. Rather than taking the input se- used in many packages, but for iteration of the alignment itself. quences and aligning them from scratch, every time new sequences are found, the alignment should be carefully maintained and used as an external profile for EPA. Clustal Discussion Omega also can align sequences to existing alignments using The main breakthroughs since the mid 1980s in MSA methods conventional alignment methods. Users can add sequences to have been progressive alignment and the use of consistency. an alignment, one by one or align a set of aligned sequences to Otherwise, most recent work has concerned refinements for the alignment. speed or accuracy on benchmark test sets. The speed increases In this paper, we demonstrate the EPA approach with two have been dramatic but, with just two major exceptions, the examples. First, we take the 94 HomFam test cases from the methods are still basically O(N2) and incapable of being previous section and use the corresponding Pfam HMM for extended to data sets of 410 000 sequences. The two EPA. Before EPA, the average accuracy for the test cases was exceptions are mBed, used here, and MAFFT PartTree. PartTree 0.627 of correctly aligned Homstrad positions but after EPA it is faster but at the expense of accuracy, at least as judged by the rises to 0.653. This is plotted, test case for test case in benchmarking here. The second group of recent developments Figure 2A. Each dot is one test case with the TC score for Clustal Omega plotted against the score using EPA. The second example is illustrated in Figure 2B. Here, we take all the HomFam BAliBASE reference sets and align them as normal using ClustalΩ 100 000 Mafft Clustal Omega and obtain the benchmark result of 0.554 of Muscle Kalign columns correctly aligned, as already reported in Table I. For 10 000 Avr length: 1−50 50 −100 EPA, we use the benchmark reference alignments themselves 100 −150 1000 150 −200 as external profiles. The results now jump to 0.857 of columns 200 −250 250 −300 100 300 −350 correct. This is a jump of over 30% and while it is not a valid 350 −400 measure of Clustal Omega accuracy for comparison with other 400 −450 10 programs, it does illustrate the potential power of EPA to use Time (s) information in external alignments. 1 0.1

Iteration 0.01

EPA can also be used in a simple iteration scheme. Once a MSA 0.001 has been made from a set of input sequences, it can be 100 3000 10 000 100 000 converted into a HMM and used for EPA to help realign the #Sequences input sequences. This can also be combined with a full Figure 1 Alignment time for Clustal Omega (red), MAFFT (blue), MUSCLE recalculation of the guide tree. In Figure 3, we show the results (green) and Kalign (purple) against the number of sequences of HomFam test sets. Average sequence length is rendered by point size. Both axes have of one and two iterationsTC scores on HomFam sequences on every test case from HomFam. logarithmic scales. Clustal Omega and Kalign were run with default flags over the The graph is plotted as a running average TC score for all test entire range. MUSCLE was run with –maxiters 2 for N43000 sequences. cases with N or fewer test cases where N is plotted on the MAFFT was run with –parttree for N410 000 sequences.

Table III HomFam benchmarking results

93pNp2957 (41 families) 3127pNp9105 (33 families) 10 099pNp50157 (18 families) Aligner TC/t (s) TC/t (s) TC/t (s) Clustal O 0.708/2114.0 0.639/11719.5 0.464/27 328.9 Kalign 0.569/324.9 0.563/6752.0 0.420/286 711.0 MAFFT default 0.550/238.9 0.462/3115.4 / MAFFT –parttree / / 0.253/6119.4À À MUSCLE default 0.533/104À ÀÀ 587.0 /À / MUSCLE –maxiters 2 / 0.416/8239.2À ÀÀ 0.216/110À 292.0 À À

The columns show total column score (TC) and total run time in seconds for groupings of small (o3000 sequences), medium (3000–10 000 sequences) and large (410 000 sequences) HomFam test cases.

4 Molecular Systems Biology 2011 & 2011 EMBO and Macmillan Publishers Limited

Note the reduced set of methods due to dataset size.

From Seivers et al., Molecular Systems Biology 2011 Observaons (so far)

• Relave and absolute accuracy (wrt TC score) impacted by degree of heterogeneity and dataset size • Some methods cannot run on large datasets • On small datasets, Clustal-Omega not as accurate as Probalign, MAFFT v6.857, and MSAprobs. • On large datasets, Clustal-Omega more accurate than other methods that it was compared to. Clustal-Omega variants

• Iterang using profile HMMs • Using EPA (profiles based on external structural alignments) EPA for HomFam and BAliBASE.

Fabian Sievers et al. Mol Syst Biol 2011;7:539

© as stated in the article, figure or figure legend Iteration of HomFam alignments.

Fabian Sievers et al. Mol Syst Biol 2011;7:539

© as stated in the article, figure or figure legend Alignment time for Clustal Omega (red), MAFFT (blue), MUSCLE (green) and Kalign (purple) against the number of sequences of HomFam test sets.

Fabian Sievers et al. Mol Syst Biol 2011;7:539

© as stated in the article, figure or figure legend (a) (b) Run time (single threaded) Peak memory usage 100,000.00 100,000

10,000.00 10,000

1,000.00

1,000

100.00 s e s t d y n b o a c g e

10.00 e s

m 100

1.00

10 0.10 ClustalΩ ClustalΩ MAFFT-PartTree MAFFT-PartTree 0.01 1 20 200 2,000 20,000 200,000 20 200 2,000 20,000 200,000 sequences sequences

(c) Average memory usage (d) Alignment accuracy 100,000 0.5 ClustalΩ MAFFT-PartTree

10,000 0.4

1,000 0.3 s e e t r y o b c s a

g C e

100 T 0.2 m

10 0.1

ClustalΩ MAFFT-PartTree 1 0 20 200 2,000 20,000 200,000 20 200 2,000 20,000 200,000 sequences sequences Addional Observaons

• Iterang using profile HMMs improves TC scores for Clustal-Omega. • Using EPA (profiles based on external structural alignments) improves TC scores for Clustal-Omega. • MAFFT-parree is faster than Clustal-Omega. • The key to Clustal-Omega’s speed is its guide tree. Some concerns

• Other criteria (1-SPFN, 1-SPFP, impact on tree error, etc.) • Other datasets • Newer methods (or updated methods) • Restricon to methods that complete quickly High-quality protein MSAs using Clustal Omega F Sievers et al

wishes to align a set of globin sequences and has an existing horizontal axis using a log scale. With some smaller test cases, globin alignment, this alignment can be converted to a profile iteration actually has a detrimental effect. Once you get near HMM and used as well as the sequence input file. This HMM is 1000 or more sequences, however, a clear trend emerges. here referred to as an ‘external profile’ and its use in this way The more sequences you have, the more beneficial the effect of as ‘external profile alignment’ (EPA). During EPA, each iteration is. With bigger test cases, it becomes more and more sequence in the input set is aligned to the external profile. beneficial to apply two iterations. This result confirms Pseudocount information from the external profile is then the usefulness of EPA as a general strategy. It also confirms transferred, position by position, to the input sequence. the difficulty in aligning extremely large numbers of sequences Ideally, this would be used with large curated alignments of but gives one partial solution. It also gives a very simple but particular proteins or domains of interest such as are used in effective iteration scheme, not just for guide tree iteration, as metagenomics projects. Rather than taking the input se- used in many packages, but for iteration of the alignment itself. quences and aligning them from scratch, every time new sequences are found, the alignment should be carefully maintained and used as an external profile for EPA. Clustal Discussion Omega also can align sequences to existing alignments using The main breakthroughs since the mid 1980s in MSA methods conventional alignment methods. Users can add sequences to have been progressive alignment and the use of consistency. an alignment, one by one or align a set of aligned sequences to Otherwise, most recent work has concerned refinements for the alignment. speed or accuracy on benchmark test sets. The speed increases In this paper, we demonstrate the EPA approach with two have been dramatic but, with just two major exceptions, the examples. First, we take the 94 HomFam test cases from the methods are still basically O(N2) and incapable of being previous section and use the corresponding Pfam HMM for extended to data sets of 410 000 sequences. The two EPA. Before EPA, the average accuracy for the test cases was exceptions are mBed, used here, and MAFFT PartTree. PartTree 0.627 of correctly aligned Homstrad positions but after EPA it is faster but at the expense of accuracy, at least as judged by the rises to 0.653. This is plotted, test case for test case in benchmarking here. The second group of recent developments Figure 2A. Each dot is one test case with the TC score for Clustal Omega plotted against the score using EPA. The second example is illustrated in Figure 2B. Here, we take all the HomFam BAliBASE reference sets and align them as normal using ClustalΩ 100 000 Mafft Clustal Omega and obtain the benchmark result of 0.554 of Muscle Kalign columns correctly aligned, as already reported in Table I. For 10 000 Avr length: 1−50 50 −100 EPA, we use the benchmark reference alignments themselves 100 −150 1000 150 −200 as external profiles. The results now jump to 0.857 of columns 200 −250 250 −300 100 300 −350 correct. This is a jump of over 30% and while it is not a valid 350 −400 measure of Clustal Omega accuracy for comparison with other 400 −450 10 programs, it does illustrate the potential power of EPA to use Time (s) information in external alignments. 1 0.1

Iteration 0.01

EPA can also be used in a simple iteration scheme. Once a MSA 0.001 has been made from a set of input sequences, it can be 100 3000 10 000 100 000 converted into a HMM and used for EPA to help realign the #Sequences input sequences. This can also be combined with a full Figure 1 Alignment time for Clustal Omega (red), MAFFT (blue), MUSCLE recalculation of the guide tree. In Figure 3, we show the results (green) and Kalign (purple) against the number of sequences of HomFam test sets. Average sequence length is rendered by point size. Both axes have of one and two iterationsTC scores on HomFam sequences on every test case from HomFam. logarithmic scales. Clustal Omega and Kalign were run with default flags over the The graph is plotted as a running average TC score for all test entire range. MUSCLE was run with –maxiters 2 for N43000 sequences. cases with N or fewer test cases where N is plotted on the MAFFT was run with –parttree for N410 000 sequences.

Table III HomFam benchmarking results

93pNp2957 (41 families) 3127pNp9105 (33 families) 10 099pNp50157 (18 families) Aligner TC/t (s) TC/t (s) TC/t (s) Clustal O 0.708/2114.0 0.639/11719.5 0.464/27 328.9 Kalign 0.569/324.9 0.563/6752.0 0.420/286 711.0 MAFFT default 0.550/238.9 0.462/3115.4 / MAFFT –parttree / / 0.253/6119.4À À MUSCLE default 0.533/104À ÀÀ 587.0 /À / MUSCLE –maxiters 2 / 0.416/8239.2À ÀÀ 0.216/110À 292.0 À À

The columns show total column score (TC) and total run time in seconds for groupings of small (o3000 sequences), medium (3000–10 000 sequences) and large (410 000 sequences) HomFam test cases.

4 Molecular Systems Biology 2011 & 2011 EMBO and Macmillan Publishers Limited

Note the reduced set of methods due to dataset size.

From Seivers et al., Molecular Systems Biology 2011 Methods they didn’t explore

• PASTA and UPP: designed explicitly for large datasets. (They tested SATe-1, the precursor to SATe-II and PASTA, but not on the large datasets because it could not be run efficiently.) • Improved versions of MAFFT since their study • And who knows what else…

SATe Family

• SATe-I (2009): – Up to about 10,000 sequences – Good accuracy and reasonable speed – “Center-tree” decomposion • SATe-II (2012) – Up to about 50,000 sequences – Improved accuracy and speed – Centroid-edge recursive decomposion • PASTA (2014) – Up to 1,000,000 sequences – Improved accuracy and speed – Combines centroid-edge decomposion with transivity merge SATé/PASTA Algorithm Design

Obtain initial alignment and estimated ML tree

Tree Use tree to compute Estimate ML tree on new new alignment alignment

Alignment SATé iteraon (actual decomposion produces 32 subproblems)

A C e Decompose based on A B input tree C D B D Align subproblems

A B

C D

Estimate ML tree on merged ABCD Merge subproblems alignment PASTA: ULTRA-LARGE MULTIPLE 7

Table 2. Alignment Accuracy on AA Datasets

Column (TC) score Pairs score

Method AA-10 HomFam(17) HomFam(2) AA-10 HomFam(17) HomFam(2)

Clustal-O 78 88 29 0.76 0.72 0.71 Muscle 48 51 X 0.70 0.52 X Mafft 81 103 32 0.76 0.75 0.79 Initial 54 95 16 0.75 0.71 0.81 SATe´-II 83 73 X 0.75 0.64 X PASTA 80 102 36 0.76 0.78 0.83

We show TC (the number of correctly aligned sites, left) and the pairs score (the average of the SP-score and modeler score, right). X indicates that a method failed to run on a particular dataset given the computational constraints. ‘‘Initial’’ corresponds to the alignment approach used to obtain the starting tree of PASTA (HMMER failed to align one sequence in the 16S.T dataset). All values shown are averages over all datasets in each category. Boldface indicates the best values for each model condition. run two iterations of SATe´-II on a separate machine with no running time limits (12 Quad-Core AMD Opteron processors, Ma version 7.143b run in default mode (even on HomFam (17) datasets). 256GB of RAM memory). Given 12 CPUs, two iterations of SATe´-II took 137 hr, compared to 10 hr for PASTA. However, the resulting SATe´-II alignment recovered only 30 columns entirely correctly while PASTA recovered 311 columns. The pairs score of SATe´-II was extremely poor (38.2%), while PASTA was quite accurate (81.0%). The tree produced by SATe´-II had higher error than PASTA (12.6%Comparison of PASTA to versus 8.2% FN rate). SATe-II and other alignments on AA datasets. Impact of varyingFrom Mirarab et al., J. Computaonal Biology 2014 algorithmic parameters. We compared results obtained using four different starting trees: a random tree, the ML tree on the Mafft-PartTree alignment, PASTA’s default starting tree, and the true (model) tree (see Table 3). After one iteration, PASTA alignments and trees based on our starting tree or true tree had roughly the same accuracy, and the starting tree based on Mafft-PartTree resulted in only a slightly worse tree (1% higher FN rate). However, using a random tree resulted in much higher tree error rates (52.3% error), and alignments that were also less accurate. Interestingly, after three iterations of PASTA, no noticeable difference could be detected between results from various starting trees. Thus, PASTA is robust to the choice of the starting tree. We also evaluated the impact of changing the alignment subset size (Table 4); these analyses showed that using alignment subsets of only 50 sequences improved the TC score and running time substantially, and only slightly changed the pairs score or tree error score. Although these analyses were performed only for two datasets, they suggest the possibility that improved results might be obtained through smaller alignment subsets. Running Time. Figure 3 compares the running time (in hours) of different alignment methods. Note that PASTA was faster than SATe´-II in all cases and could analyze datasets that SATe´-II could not (i.e., the

Table 3. Effect of the Starting Tree on Final PASTA Alignment and Tree

Initial tree Alignment accuracy Tree error method Error (FN) Pairs score TC FN

One iteration Random 100.0% 79.9% 2 52.3% Mafft-parttree 28.7% 87.0% 126 11.7% Starting tree 12.4% 86.8% 138 10.5% True tree 0% 86.1% 133 10.5% Three iterations Random 100.0% 90.4% 138 11.0% Mafft-parttree 28.7% 83.9% 144 10.7% Starting tree 12.4% 88.8% 145 10.7% True tree 0% 90.8% 150 10.5%

Alignment accuracy and tree error is shown for PASTA with various starting trees, after one iteration (top) and three iterations (bottom) on one replicate of the 10k RNASim dataset. MAFFT vs. Clustal-Omega

• In the Clustal-Omega paper, Clustal-Omega had beer TC scores than MAFFT. • In the PASTA paper, MAFFT had beer TC scores than Clustal-Omega.

Why the difference?

• Two versions of MAFFT (Clustal-Omega paper used version 6.857, PASTA paper used version 7.143b) • Two different commands possibly (MAFFT- parree vs. MAFFT-default)

Lesson: Version number and command impacts absolute and relave performance! Katoh and Standley . doi:10.1093/molbev/mst010 MBE

Table 2. Comparison of Different Options Using the 16S.B.ALL Data Set (Mirarab et al. 2012). Data Method Accuracy CPU Time Actual Timea Case 1 ––multipair ––addfragments frags existingmsa 0.9969 6.67 days 18.3 h mafft ––6merpair ––addfragments frags existingmsa 0.9949 3.76 h 36.2 min mafft ––localpair ––add frags existingmsa 0.9707 23.4 daysb 2.43 daysb mafft ––6merpair ––add frags existingmsa 0.9604 1.32 h 1.44 h profile alignment 0.2779 15.5 h 1.60 h Case 2 mafft ––6merpair ––addfragments frags existingmsa 0.9969 4.54 h 33.8 min Case 3 mafft ––6merpair ––addfragments frags existingmsa 0.9949 1.79 days 5.91 h

NOTE.—The estimated alignments were compared with the CRW alignment to measure the accuracy (the number of correctly aligned letters/the number of aligned letters in the CRW alignment). Calculations were performed on a PC with 2.67 GHz Intel Xeon E7-8837/256 GB RAM (for the case marked with superscript alphabet “b”), or on a Linux PC with 3.47 GHz Intel Xeon X5690/48 GB RAM (for the other cases). Case 1: 13,822 sequences in the existing alignment 13,821 fragments; Case 2: 1,000 sequences in the existing alignment Â138,210 fragments; Case 3: 13,822 sequences in the existing alignment 138,210 fragments.  aWall-clock time with 10 cores. Command-line argument for parallel processing is ––thread 10. bFull command-line options are as follows: mafft ––localpair ––weighti 0 ––add frags existingmsa.

to each other. Even if a more computationally expensive (and one. The difficulty of this problem for standard approaches Downloaded from usually more accurate) method, L-INS-i, is applied (CPU time- comes from the fact that ITS1 sequences and ITS2 sequences =98h),thealignmentisstillobviouslyincorrect(fig. 2B). are not homologous to each other and most pairwise align- Two-step strategiesFrom canKatoh solve this and typeStandley of problem., 2013 (dealing with fragmentary sequences) That is, ments are impossible. Because of these nonhomologous pairs, asetoffull-lengthsequencestakenfromdatabasesarefirstMol. Biol. Evol. 30(4):772–780 doi:10.1093/the distancemolbev matrix/mst010 used for the guide tree calculation is

aligned to build a backbone MSA, and then the new ITS1 and not additive; the distances between ITS1 and full-length http://mbe.oxfordjournals.org/ ITS2 sequences are added into this backbone MSA, using the sequences and those between ITS2 and full-length sequences ––addfragments option. are close to zero, whereas the distances between ITS1 and ITS2 are quite large. In this situation, it is difficult for normal Step 1: mafft - -auto full_length_sequences >\ distance-based tree-buildingmethodstogiveareasonable backbone_msa tree. Moreover, in the alignment step, the objective function Step 2: mafft - -addfragments \ new_sequences of the L-INS-i is affected by inappropriate pairwise alignment backbone_msa > output scores between ITS1 and ITS2. Such problems can be avoided by just ignoring the relationship between ITS1 and ITS2, The second command is equivalent to by guest on February 22, 2015 as done in the ––addfragments option. mafft - -multipair - -addfragments \ In addition, a result of the second type of misuse of new_sequences backbone_msa > output mafft-profile (discussed earlier) is shown in figure 2C. in which Dynamic Programming (DP) is used to compare the Some new sequences are correctly aligned but others are distances between every new sequence and every sequence in obviously incorrectly aligned (note that the order of se- the backbone MSA (––multipair is selected by default). quences in fig. 2C is identical that in fig. 2D). These misalign- ments are due to an incorrect assumption on phylogenetic mafft - -6merpair - -addfragments \ new_sequences backbone_msa > output placement of new sequences shown in figure 1C. where distances are rapidly estimated using the number of Test Case 2: Bacterial SSU rRNA shared 6mers, instead of DP. Another case is the 16S.B.ALL data set by Mirarab et al. (2012). The result of the latter option (––6merpair It consists of an MSA of 13,822 bacterial SSU rRNA sequences, ––addfragments)isshowninfig. 2D and E.Thedifference taken from the Gutell Comparative RNA Website (CRW) between D and E is just in the order of sequences; the se- (Cannone et al. 2002)and138,210fragmentarysequences, quences were reordered according to similarity using the which are originally included in the CRW alignment ––reorder option in E.Inthisalignment,ITS1andITS2 but ungapped and artificially truncated. In Katoh and are clearly separated and aligned to appropriate positions in Standley (2013),weusedasubset(13,821fragmentary the full-length alignment. Moreover, this strategy is compu- sequences) prepared by Mirarab et al. (2012).Inadditionto tationally much less expensive (CPU time = 15 min [first this subset, here we use the full data set (138,210 fragmentary step] + 1.5 min [second step]) than the full application sequences), to examine the scalability. Suppose a situation of L-INS-i (CPU time = 98 h). The former option where we already have a manually curated (or backbone) (––multipair––addfragments)alsoreturnsasimilar MSA and a newly determined set of many fragmentary result to the latter (––6merpair) but is slower (CPU sequences in a metagenomics project, and we need an time = 48.6 min [second step]). entire MSA of them. This case suggests that it is crucial to select a strategy The first four lines in table 2 (case 1) show the perfor- appropriate to the problem of interest. The most time- mances of various options for such an analysis, with a rela- consuming method, L-INS-i, is not always the most accurate tively small data set (13,822 sequences in the existing

776 Reviewer 1

• “The overall message appears to be that ClustalOmega appears to be the fastest and most accurate algorithm for very large alignments, while it is sll substanally outperformed by other methods for smaller alignments.” • “Looking at the results presented here (again, from the standpoint of a "consumer"), I would not personally expect to make much use of this algorithm and will probably connue to use MAFFT or MUSCLE, mainly since they're sll performing similarly well or beer for smaller alignments, which are the main use for myself. I expect this to be true for the majority of users. However, this may change in the future and especially for those who work on major / data repository centers (such as the EBI), as sequencing of more species ramps up.” Reviewer 2

• “The main novelty of this program compared to previous versions of Clustal is its ability to deal with huge data sets. This is mainly achieved by using an efficient clustering algorithm to efficiently calculate a guide tree for progressive alignment… construcng the guide tree for progressive alignment is the boleneck of most MSA programs: calculang a tree from the input sequences with tradional methods takes (at least) O(n2) me while progressively aligning the sequences according to this tree can be done in linear me.” Comments about EPA

• Reviewer 1: “It also offers the Profile Alignment use- case, which is likely to be more important in the future” • Reviewer 2: “A second interesng novelty of Clustal Omega is the opon to use previously calculated "External Profile Alignments" (EPA) to improve the alignment procedure. This is done using an HMM approach developed by one of the co-authors. Since for many protein families, there are now profile HMMs available, it is a very sensible idea to use this informaon for improved MSA, rather than relying on sequence similarity alone, as do tradional alignment methods.” Overall

• Main features of Clustal-Omega are the fast guide tree calculaon, the use of iteraon, and EPA. (Both EPA and iteraon use profile HMMs.) • But improvements in accuracy compared to other methods only seems to be apparent on large datasets when restricng the set of methods to those that complete quickly. Newer methods (and even new versions of old methods like MAFFT) seem to outperform Clustal-Omega for accuracy, even if they are not as fast. • Accuracy was only assessed using TC score, which is not necessarily the whole story. Choice of best MSA method

• Does it depend on type of data (DNA or amino acids?) • Does it depend on rate of evoluon? • Does it depend on gap length distribuon? • Does it depend on existence of fragments? How does the guide tree impact accuracy?

• Does improving the accuracy of the guide tree help? • Do all alignment methods respond idencally? (Is the same guide tree good for all methods?) • Do the default sengs for the guide tree work well? Another study…

• Prank (Loytynoja and Goldman, Science 2008) is a “phylogeny aware” progressive alignment strategy. • Their study focused on evaluang MSAs with respect to TC score, but also atypical criteria, such as: – tree branch length esmaon – Alignment length esmaon (compression issue) – Inseron/deleon rao – Number of inserons/deleons • They explored very small simulated datasets, evolving sequences down trees. REPORTS have a flaw that has gone unnoticed as long as that these patterns are, in fact, imposed by will give a very different picture of the mech- different methods have been consistent in the systematic biases in alignment algorithms, anisms of sequence evolution and show error they create. even in cases where they are incorrect and, sequence turnover through short insertions That such a significant error has passed indeed, phylogenetically unreasonable. We and deletions as a more frequent and im- undetected may be explained by the align- contend that algorithms that impose gap pat- portant phenomenon. This raises interesting ment field's historical focus on proteins, where terns like those found in structural align- questions of the true evolution of variable these biases tend to be manifested in less- ments of proteins are inappropriate for the sequences such as promoter regions, non- constrained regions such as loops (compare increasingly widespread analysis of genomic coding DNA, and exposed coil regions in Fig. 1). Alignments with insertions and de- DNA and are likely to cause error when the proteins: Do they predominantly evolve letions squeezed compactly between con- resulting alignments are used for evolutionary through point substitutions, or are those dis- served blocks may suffice for, and even be inferences. similar regions just incorrectly aligned non- preferred by, some molecular biologists work- We believe that alignment methods spe- homologous sequences? To resolve that, we ing with proteins. We have shown, however, cifically designed for evolutionary analyses need more sequence data and alignment methods that can really benefit from the additional information. The resulting align- ments may be fragmented by many gaps and may not be as visually beautiful as the traditional alignments, but if they represent correct homology, we have to get used to them.

References and Notes 1. R. A. Gibbs et al., Nature 428, 493 (2004). 2. Rhesus Macaque Genome Sequencing and Analysis Consortium, Science 316, 222 (2007). 3. The ENCODE Project Consortium, Nature 447, 799 (2007). 4. A. Stark et al., Nature 450, 219 (2007). 5. Materials and methods are available as supporting material on Science Online. 6. A. Rambaut, D. Posada, K. Crandall, E. Holmes, Nat. Rev. Genet. 5, 52 (2004). 7. N. Sullivan, M. Thali, C. Furman, D. Ho, J. Sodroski, J. Virol. 67, 3674 (1993). 8. R. Wyatt et al., J. Virol. 69, 5723 (1995). 9. M. Jansson et al., AIDS Res. Hum. Retroviruses 17, 1405 (2001). 10. S. D. Frost et al., Proc. Natl. Acad. Sci. U.S.A. 102, 18514 (2005). 11. M. Sagar, X. Wu, S. Lee, J. Overbaugh, J. Virol. 80, 9586 (2006). 12. J. D. Thompson, D. G. Higgins, T. J. Gibson, Nucleic Acids Res. 22, 4673 (1994). 13. C. Notredame, D. G. Higgins, J. Heringa, J. Mol. Biol. 302, 205 (2000). 14. R. C. Edgar, BMC Bioinformat. 5, 113 (2004). 15. K. Katoh, K. Kuma, H. Toh, T. Miyata, Nucleic Acids Res. 33, 511 (2005). 16. A. Löytynoja, N. Goldman, Proc. Natl. Acad. Sci. U.S.A. 102, 10557 (2005). 17. D. D. Pollock, D. J. Zwickl, J. A. McGuire, D. M. Hillis, Syst. Biol. 51, 664 (2002). 18. M. S. Rosenberg, S. Kumar, Syst. Biol. 52, 119 (2003). 19. M. S. Rosenberg, BMC Bioinformat. 6, 278 (2005). 20. K. M. Wong, M. A. Suchard, J. P. Huelsenbeck, Science 319, 473 (2008). Fig. 3. Alignment accuracy errors are reduced by a phylogeny-aware algorithm. (A to F)Errors 21. This work was funded in part by a Wellcome from traditional alignment methods grow with increasing evolutionary distances and more Trust Programme Grant (GR078968). We thank difficult alignments (close-intermediate-distant, white to blue gradient) and also with a denser N. Luscombe for many suggestions that improved sequence sampling and increasingly similar sequences (close-2X-4X, white to red gradient). the manuscript. The phylogeny-aware method PRANK+F is less biased by greater distances and, in contrast to other methods, improvesFrom inLoytyjoja accuracy and Goldman, Science 2008: with additional sequence data. Alignment statistics for Supporting Online Material the five multiple sequence alignment methods are number of (A) insertions and (B) deletions, www.sciencemag.org/cgi/content/full/320/5883/1632/DC1 (C) insertion/deletion ratio, (D) gap overlap, (E) total length of the alignment, and (F) Materials and Methods G SOM Text proportion of columns correctly recovered. ( ) Inferred branch lengths at different depths in Fig. S1 the tree (t1 – t4) for the intermediate sets indicate that alignment errors lead to overestimated Table S1 branch lengths, with PRANK+F giving the most accurate estimates across the whole range of References depths. All measures in (A) to (G) are shown relative to those inferred from the true align- ment, and values closer to 1 are more correct. Vertical bars show means and 95% confidence 28 March 2008; accepted 15 May 2008 intervals. 10.1126/science.1158395

www.sciencemag.org SCIENCE VOL 320 20 JUNE 2008 1635 REPORTS have a flaw that has gone unnoticed as long as that these patterns are, in fact, imposed by will give a very different picture of the mech- different methods have been consistent in the systematic biases in alignment algorithms, anisms of sequence evolution and show error they create. even in cases where they are incorrect and, sequence turnover through short insertions That such a significant error has passed indeed, phylogenetically unreasonable. We and deletions as a more frequent and im- undetected may be explained by the align- contend that algorithms that impose gap pat- portant phenomenon. This raises interesting ment field's historical focus on proteins, where terns like those found in structural align- questions of the true evolution of variable these biases tend to be manifested in less- ments of proteins are inappropriate for the sequences such as promoter regions, non- constrained regions such as loops (compare increasingly widespread analysis of genomic coding DNA, and exposed coil regions in Fig. 1). Alignments with insertions and de- DNA and are likely to cause error when the proteins: Do they predominantly evolve letions squeezed compactly between con- resulting alignments are used for evolutionary through point substitutions, or are those dis- served blocks may suffice for, and even be inferences. similar regions just incorrectly aligned non- preferred by, some molecular biologists work- We believe that alignment methods spe- homologous sequences? To resolve that, we ing with proteins. We have shown, however, cifically designed for evolutionary analyses need more sequence data and alignment methods that can really benefit from the additional information. The resulting align- ments may be fragmented by many gaps and may not be as visually beautiful as the traditional alignments, but if they represent correct homology, we have to get used to them.

References and Notes 1. R. A. Gibbs et al., Nature 428, 493 (2004). 2. Rhesus Macaque Genome Sequencing and Analysis Consortium, Science 316, 222 (2007). 3. The ENCODE Project Consortium, Nature 447, 799 (2007). 4. A. Stark et al., Nature 450, 219 (2007). 5. Materials and methods are available as supporting material on Science Online. 6. A. Rambaut, D. Posada, K. Crandall, E. Holmes, Nat. Rev. Genet. 5, 52 (2004). 7. N. Sullivan, M. Thali, C. Furman, D. Ho, J. Sodroski, J. Virol. 67, 3674 (1993). 8. R. Wyatt et al., J. Virol. 69, 5723 (1995). 9. M. Jansson et al., AIDS Res. Hum. Retroviruses 17, 1405 (2001). 10. S. D. Frost et al., Proc. Natl. Acad. Sci. U.S.A. 102, 18514 (2005). 11. M. Sagar, X. Wu, S. Lee, J. Overbaugh, J. Virol. 80, 9586 (2006). 12. J. D. Thompson, D. G. Higgins, T. J. Gibson, Nucleic Acids Res. 22, 4673 (1994). 13. C. Notredame, D. G. Higgins, J. Heringa, J. Mol. Biol. 302, 205 (2000). 14. R. C. Edgar, BMC Bioinformat. 5, 113 (2004). 15. K. Katoh, K. Kuma, H. Toh, T. Miyata, Nucleic Acids Res. 33, 511 (2005). 16. A. Löytynoja, N. Goldman, Proc. Natl. Acad. Sci. U.S.A. 102, 10557 (2005). 17. D. D. Pollock, D. J. Zwickl, J. A. McGuire, D. M. Hillis, From Loytynoja and Goldman, Science 2008 Syst. Biol. 51, 664 (2002). 18. M. S. Rosenberg, S. Kumar, Syst. Biol. 52, 119 (2003). 19. M. S. Rosenberg, BMC Bioinformat. 6, 278 (2005). 20. K. M. Wong, M. A. Suchard, J. P. Huelsenbeck, Science 319, 473 (2008). Fig. 3. Alignment accuracy errors are reduced by a phylogeny-aware algorithm. (A to F)Errors 21. This work was funded in part by a Wellcome from traditional alignment methods grow with increasing evolutionary distances and more Trust Programme Grant (GR078968). We thank difficult alignments (close-intermediate-distant, white to blue gradient) and also with a denser N. Luscombe for many suggestions that improved sequence sampling and increasingly similar sequences (close-2X-4X, white to red gradient). the manuscript. The phylogeny-aware method PRANK+F is less biased by greater distances and, in contrast to other methods, improves in accuracy with additional sequence data. Alignment statistics for Supporting Online Material the five multiple sequence alignment methods are number of (A) insertions and (B) deletions, www.sciencemag.org/cgi/content/full/320/5883/1632/DC1 (C) insertion/deletion ratio, (D) gap overlap, (E) total length of the alignment, and (F) Materials and Methods G SOM Text proportion of columns correctly recovered. ( ) Inferred branch lengths at different depths in Fig. S1 the tree (t1 – t4) for the intermediate sets indicate that alignment errors lead to overestimated Table S1 branch lengths, with PRANK+F giving the most accurate estimates across the whole range of References depths. All measures in (A) to (G) are shown relative to those inferred from the true align- ment, and values closer to 1 are more correct. Vertical bars show means and 95% confidence 28 March 2008; accepted 15 May 2008 intervals. 10.1126/science.1158395

www.sciencemag.org SCIENCE VOL 320 20 JUNE 2008 1635 Observaons

• Most alignment methods “over-align” (produce compressed alignments) • Prank avoids this through its “phylogeny-aware” strategy • Compression results in – Over-esmaons of branch lengths – Under-esmaon of inserons • Clustal is least accurate, other methods in between Research quesons

• Clustal-Omega’s main advantage seems to be its fast guide tree. But the guide tree has a large impact on accuracy – can we improve the accuracy of Clustal-Omega by giving it a beer guide tree? • Iteraon using profile HMMs helps accuracy; can we use this technique in other methods? • EPAs are useful. What are the best ways to use external structurally-defined alignments? Research Projects

• Design your own MSA method, or just modify an exisng one in some simple way (e.g., different guide tree) • Test exisng MSA methods with respect to different criteria (e.g., extend Prank study to more methods and datasets) • Develop different MSA criteria that are more appropriate than TC, SPFN, SPFP • Compare different MSA methods on some biological dataset • Parallelize some MSA method • Consider how to combine MSAs on the same input