Ple Sequence Alignment Methods: Evidence from Data
Total Page:16
File Type:pdf, Size:1020Kb
Mul$ple sequence alignment methods: evidence from data Tandy Warnow Alignment Error/Accuracy • SPFN: percentage of homologies in the true alignment that are not recovered (false negave homologies) • SPFP: percentage of homologies in the es$mated alignment that are false (false posi$ve homologies) • TC: total number of columns correctly recovered • SP-score: percentage of homologies in the true alignment that are recovered • Pairs score: 1-(avg of SP-FN and SP-FP) Benchmarks • Simulaons: can control everything, and true alignment is not disputed – Different simulators • Biological: can’t control anything, and reference alignment might not be true alignment – BAliBASE, HomFam, Prefab – CRW (Comparave Ribosomal Website) Alignment Methods (Sample) • Clustal-Omega • MAFFT • Muscle • Opal • Prank/Pagan • Probcons Co-es$maon of trees and alignments • Bali-Phy and Alifritz (stas$cal co-es$maon) • SATe-1, SATe-2, and PASTA (divide-and-conquer co- es$maon) • POY and Beetle (treelength op$mizaon) Other Criteria • Tree topology error • Tree branch length error • Gap length distribu$on • Inser$on/dele$on rao • Alignment length • Number of indels How does the guide tree impact accuracy? • Does improving the accuracy of the guide tree help? • Do all alignment methods respond iden$cally? (Is the same guide tree good for all methods?) • Do the default sengs for the guide tree work well? Alignment criteria • Does the relave performance of methods depend on the alignment criterion? • Which alignment criteria are predic$ve of tree accuracy? • How should we design MSA methods to produce best accuracy? Choice of best MSA method • Does it depend on type of data (DNA or amino acids?) • Does it depend on rate of evolu$on? • Does it depend on gap length distribu$on? • Does it depend on existence of fragments? Katoh and Standley . doi:10.1093/molbev/mst010 MBE Table 2. Comparison of Different Options Using the 16S.B.ALL Data Set (Mirarab et al. 2012). Data Method Accuracy CPU Time Actual Timea Case 1 mafft ––multipair ––addfragments frags existingmsa 0.9969 6.67 days 18.3 h mafft ––6merpair ––addfragments frags existingmsa 0.9949 3.76 h 36.2 min mafft ––localpair ––add frags existingmsa 0.9707 23.4 daysb 2.43 daysb mafft ––6merpair ––add frags existingmsa 0.9604 1.32 h 1.44 h profile alignment 0.2779 15.5 h 1.60 h Case 2 mafft ––6merpair ––addfragments frags existingmsa 0.9969 4.54 h 33.8 min Case 3 mafft ––6merpair ––addfragments frags existingmsa 0.9949 1.79 days 5.91 h NOTE.—The estimated alignments were compared with the CRW alignment to measure the accuracy (the number of correctly aligned letters/the number of aligned letters in the CRW alignment). Calculations were performed on a Linux PC with 2.67 GHz Intel Xeon E7-8837/256 GB RAM (for the case marked with superscript alphabet “b”), or on a Linux PC with 3.47 GHz Intel Xeon X5690/48 GB RAM (for the other cases). Case 1: 13,822 sequences in the existing alignment 13,821 fragments; Case 2: 1,000 sequences in the existing alignment Â138,210 fragments; Case 3: 13,822 sequences in the existing alignment 138,210 fragments.  aWall-clock time with 10 cores. Command-line argument for parallel processing is ––thread 10. bFull command-line options are as follows: mafft ––localpair ––weighti 0 ––add frags existingmsa. to each other. Even if a more computationally expensive (and one. The difficulty of this problem for standard approaches Downloaded from usually more accurate) method, L-INS-i, is applied (CPU time- comes from the fact that ITS1 sequences and ITS2 sequences =98h),thealignmentisstillobviouslyincorrect(fig. 2B). are not homologous to each other and most pairwise align- Two-step strategiesFrom canKatoh solve this and typeStandley of problem., 2013 (dealing with fragmentary sequences) That is, ments are impossible. Because of these nonhomologous pairs, asetoffull-lengthsequencestakenfromdatabasesarefirstMol. Biol. Evol. 30(4):772–780 doi:10.1093/the distancemolbev matrix/mst010 used for the guide tree calculation is aligned to build a backbone MSA, and then the new ITS1 and not additive; the distances between ITS1 and full-length http://mbe.oxfordjournals.org/ ITS2 sequences are added into this backbone MSA, using the sequences and those between ITS2 and full-length sequences ––addfragments option. are close to zero, whereas the distances between ITS1 and ITS2 are quite large. In this situation, it is difficult for normal Step 1: mafft - -auto full_length_sequences >\ distance-based tree-buildingmethodstogiveareasonable backbone_msa tree. Moreover, in the alignment step, the objective function Step 2: mafft - -addfragments \ new_sequences of the L-INS-i is affected by inappropriate pairwise alignment backbone_msa > output scores between ITS1 and ITS2. Such problems can be avoided by just ignoring the relationship between ITS1 and ITS2, The second command is equivalent to by guest on February 22, 2015 as done in the ––addfragments option. mafft - -multipair - -addfragments \ In addition, a result of the second type of misuse of new_sequences backbone_msa > output mafft-profile (discussed earlier) is shown in figure 2C. in which Dynamic Programming (DP) is used to compare the Some new sequences are correctly aligned but others are distances between every new sequence and every sequence in obviously incorrectly aligned (note that the order of se- the backbone MSA (––multipair is selected by default). quences in fig. 2C is identical that in fig. 2D). These misalign- ments are due to an incorrect assumption on phylogenetic mafft - -6merpair - -addfragments \ new_sequences backbone_msa > output placement of new sequences shown in figure 1C. where distances are rapidly estimated using the number of Test Case 2: Bacterial SSU rRNA shared 6mers, instead of DP. Another case is the 16S.B.ALL data set by Mirarab et al. (2012). The result of the latter option (––6merpair It consists of an MSA of 13,822 bacterial SSU rRNA sequences, ––addfragments)isshowninfig. 2D and E.Thedifference taken from the Gutell Comparative RNA Website (CRW) between D and E is just in the order of sequences; the se- (Cannone et al. 2002)and138,210fragmentarysequences, quences were reordered according to similarity using the which are originally included in the CRW alignment ––reorder option in E.Inthisalignment,ITS1andITS2 but ungapped and artificially truncated. In Katoh and are clearly separated and aligned to appropriate positions in Standley (2013),weusedasubset(13,821fragmentary the full-length alignment. Moreover, this strategy is compu- sequences) prepared by Mirarab et al. (2012).Inadditionto tationally much less expensive (CPU time = 15 min [first this subset, here we use the full data set (138,210 fragmentary step] + 1.5 min [second step]) than the full application sequences), to examine the scalability. Suppose a situation of L-INS-i (CPU time = 98 h). The former option where we already have a manually curated (or backbone) (––multipair––addfragments)alsoreturnsasimilar MSA and a newly determined set of many fragmentary result to the latter (––6merpair) but is slower (CPU sequences in a metagenomics project, and we need an time = 48.6 min [second step]). entire MSA of them. This case suggests that it is crucial to select a strategy The first four lines in table 2 (case 1) show the perfor- appropriate to the problem of interest. The most time- mances of various options for such an analysis, with a rela- consuming method, L-INS-i, is not always the most accurate tively small data set (13,822 sequences in the existing 776 Important! • Each method can be run in different ways – so you need to know the exact command used, to be able to evaluate performance. (You also need to know the version number!) Clustal-Omega study • Clustal-Omega (Seivers et al., Molecular Systems Biology 2011) is the latest in the Clustal family of MSA methods • Clustal-Omega is designed primarily for amino acid alignment, but can be used on nucleo$de datasets • Alignment criterion: TC (column score) • Datasets: biological with structural alignments High-quality protein MSAs using Clustal Omega F Sievers et al attention to gaps. These gap positions are not included in these tests as they tend not to be structurally conserved. Dialign (Morgenstern et al, 1998) does not use consistency or progressive alignment but is based on finding best local multiple Consistency alignments. FSA (Bradley et al,2009)usessamplingofpairwise alignments and ‘sequence annealing’ and has been shown to deliver good nucleotide sequence alignments in the past. The Prefab benchmark test results are shown in Table II. Here, the results are divided into five groups according to the percent identity of the sequences. The overall scores range families) from 53 to 73% of columns correct. The consistency-based Total time (s) (1682 programs MSAprobs, MAFFT L-INS-i, Probalign, Probcons and T-Coffee, are again the most accurate but with long run times. Clustal Omega is close to the consistency programs in accuracy but is much faster. There is then a gap to the faster progressive based programs of MUSCLE, MAFFT, Kalign (Lassmann and 100 (90 Sonnhammer, 2005) and Clustal W. p Results from testing large alignments with up to 50000 families) %ID sequences are given in Table III using HomFam. Here, each p in seconds is shown in the second last column. The last column indicates alignment is made up of a core of a Homstrad (Mizuguchi et al, 70 1998) structure-based alignment of at least five sequences. These sequences are then inserted into a test set of sequences from the corresponding, homologous, Pfam domain. This gives very large sets of sequences to be aligned but the testing is only carried out on the sequences with known structures. Only some programs 70 (117 are able to deliver alignments at all, with data sets of this size. We p families) restricted the comparisons toClustalOmega,MAFFT,MUSCLE %ID and Kalign.