Comparison of Heuristic Approaches to the Generalized Tree Alignment Problem
Total Page:16
File Type:pdf, Size:1020Kb
Cladistics Cladistics (2015) 1–9 10.1111/cla.12142 Comparison of heuristic approaches to the generalized tree alignment problem Eric Forda,b,* and Ward C. Wheelerb aDepartment of Mathematics & Computer Science, Lehman College, CUNY, Bronx, NY 10468, USA; bDivision of Invertebrate Zoology, American Museum of Natural History, New York, NY 10024, USA Accepted 10 September 2015 Abstract Two commonly used heuristic approaches to the generalized tree alignment problem are compared in the context of phyloge- netic analysis of DNA sequence data. These approaches, multiple sequence alignment + phylogenetic tree reconstruction (MSA+TR) and direct optimization (DO), are alternative heuristic procedures used to approach the nested NP-Hard optimiza- tions presented by the phylogenetic analysis of unaligned sequences under maximum parsimony. Multiple MSA+TR implementa- tions and DO were compared in terms of optimality score (phylogenetic tree cost) over multiple empirical and simulated datasets with differing levels of heuristic intensity. In all cases examined, DO outperformed MSA+TR with average improvement in parsimony score of 14.78% (5.64–52.59%). © The Willi Hennig Society 2015. A central goal of biological systematics is mapping data (e.g. Lindgren and Daly, 2007; Lehtonen, 2008; the relationships among organisms and groups of Liu et al., 2009; Giribet and Edgecombe, 2013). In addi- organisms—both extant and extinct—based on the tion, the high degree of complexity in the settings of the reconstruction of phylogenetic trees using comparative software tools used for alignment and search only con- character data. The generalized tree alignment problem fuses the matter, as default settings are often used, and (GTAP; Sankoff, 1975) is defined as the search for these defaults do not necessarily correspond between phylogenetic tree(s)—and associated vertex (hypotheti- aligner and search engine. Here, we compare DO with cal ancestor) sequences—with lowest cost for those two-step solutions directly. We also test whether the data under maximum parsimony. results of searches where alignment and search setting There has been an ongoing debate in the literature correspond are better (i.e. more optimal) than those in regarding multiple sequence alignment (Katoh et al., which they do not. We find that DO results in the dis- 2002; Edgar, 2004; Wheeler, 2007), with several aligners covery of shorter trees, by an average factor of 15%. In available. In addition, much effort has been expended to addition, using the two-step approach we found signifi- improving search on aligned sequences (Goloboff et al., cantly (approximately 4%) shorter trees when using set- 2003). At the same time, other paradigmata for tings on alignment that match the settings of subsequent approaching the GTAP are also available, chief among tree search (as opposed to the default settings of multi- those being direct optimization (DO) (Wheeler, 1996, ple sequence alignment (MSA) implementations). 2003; Varon and Wheeler, 2012, 2013). It has been the experience of many investigators that DO gives signifi- cantly better results than the two-step process of align- Software tools ment followed by search for both real and simulated We ran comparisons using several pieces of align- *Corresponding author: ment software. What follows is a brief description of E-mail address: [email protected] each package. © The Willi Hennig Society 2015 2 E. Ford & W. C. Wheeler / Cladistics 0 (2015) 1–9 CLUSTAL OMEGA (Sievers et al., 2011) uses a Materials guide tree to align sequences. It is similar to CLUS- TAL W, described below, but speeds the process of Biological data creating the guide tree by re-encoding each sequence as an n-dimensional vector, which can be We ran alignment and search on six biological data viewed as its similarity to n reference sequences. sets and six synthetic data sets. The biological data This allows for more rapid clustering using hidden sets were chosen for their sizes. Each of the software Markov models. packages allows for the input of either genetic or pro- CLUSTAL W 2.0.12 (Larkin et al., 2007) pre-dates tein data. Here, we restricted our comparisons to CLUSTAL OMEGA and uses neighbour-joining (Saitou and nucleotide sequence data. Nei, 1987). In addition, CLUSTAL W weights the • 62 mantodean small subunit sequences (Svenson sequences, giving more-divergent sequences more and Whiting, 2004); weight, in an attempt to get improved results in pairwise • 208 metazoan small subunit sequences (Giribet comparisons. and Wheeler, 1999, 2001); MAFFT v7.029b (Katoh et al., 2002) performs pro- • 585 archaean small subunit sequence data from gressive alignments, also using a guide tree, and the European Ribosomal RNA Database uses fast Fourier transform to speed up identifica- (Wheeler, 2007); tion of homologous regions in sequences. Depend- • 1040 metazoan mitochondrial small subunit data ing on the number of taxa, MAFFT uses either from the European Ribosomal RNA Database progressive or iterative alignment algorithms. Both (Wheeler, 2007); MAFFT and MUSCLE, described below, use UPGMA • 1553 fungal small ribosomal subunit sequences (unweighted pair group method with arithmetic extracted from GenBank (Benson et al., 2013); mean, a hierarchical clustering method using a simi- • 1766 metazoan small subunit sequences extracted larity matrix), rather than neighbour-joining, as in from GenBank (Benson et al., 2013). CLUSTAL W. MUSCLE v3.8.31 (Edgar, 2004) also uses a guide Synthetic data tree to direct alignment. To build the initial tree, MUSCLE uses the approximate kmer pairwise dis- The synthetic data were created, using DAWG, by tances. A kmer is a subsequence sample of length k, varying the number of taxa and the distribution of and the kmer distance used by MUSCLE is the frac- edge (branch) lengths. Each of the synthetic data sets tion of kmers in common in a compressed alphabet. was rooted, and included an outgroup. All had initial This initial tree build is followed by progressive sequence lengths of 1800 bp (comparable to metazoan alignments using the Kimura distance (Kimura and small subunit data). We chose taxon counts to match Ohta, 1972). (approximately) the sizes of the smaller biological data In contrast to the above packages, POY (5.0 and sets, thus creating sets of 51, 201 and 601 taxa, includ- antecedents; Wheeler et al., 2015, 2013) uses DO. This ing a root taxon. For each of these taxon sets, we used approach optimizes median sequences on trees created a Python script to create a random tree, then assigned via heuristic search. The tree search itself is performed edge lengths to the tree using two distributions, an using heuristics similar to those implemented in TNT exponential distribution with mean of 0.1, and a uni- (Goloboff et al., 2003), but because the median opti- form distribution with values between 0.0 and 0.5. We mization and scoring are done as each tree is con- therefore generated a total of six synthetic data sets structed, a POY search is computationally more using DAWG. complex than a TNT search. In short, TNT is dealing with one NP-Hard optimization (tree search), whereas POY is dealing with two (tree search and tree align- Methods ment). After the alignment stage, we used TNT to search For each of the 12 data sets, a two-step process was for the shortest trees. In an attempt to compare followed. First, the sequences were aligned by each of DO with two-step searches more thoroughly, we the aligners (including POY implied alignment). The also ran TNT on the implied alignment results of resulting alignments were then fed into TNT for tree POY searches. search (specifics below). Thus, for each data set eight Finally, to create synthetic data with length varia- total alignments were generated (two each for MAFFT, tion in a more controlled manner (i.e. gaps, and thus POY/TNT and CLUSTAL OMEGA, one each for MUSCLE simulate unaligned sequences), we used DAWG 1.2-re- and CLUSTAL W), and for each of the eight alignments, lease (Cartwright, 2005). two TNT searches were run. Runs using POY alone (DO E. Ford & W. C. Wheeler / Cladistics 0 (2015) 1–9 3 without MSA) were performed on each unaligned data For the alignment stage, we first ran MAFFT with set (labelled “POY” in the tables and figures). This default settings, then reran it with gap opening cost of yielded a total of 192 runs through TNT, but 216 0 and extension cost 1 (–op 0 –ep 1 –lop 0 – searches, including POY solo searches. lep 2). MAFFT uses different alignment settings On several of the runs, POY found multiple trees. depending upon the size of the input data. To recreate Rather than performing multiple TNT searches on each those settings while changing the defaults for gap of the implied alignments, in these cases we used the costs, three different profiles were created: L-INS-i, first tree and implied alignment output. As POY outputs with commands –localpair –maxiterate these trees in the order that they were encountered 1000, FFT-NS-i (–retree 2 –maxiterate (with a randomized component), this choice was arbi- 2) and FFT-NS-2 (–retree 2 –maxiterate trary. DAWG takes as input a rooted tree with edge 0). For each data set, MAFFT was first run with default lengths and a sequence length, then evolves sequences settings. Its terminal output was then read to deter- following the tree. We generated random trees using a mine which of the three profiles was used, and that Python script, with edge lengths as noted above. Each profile was used when running MAFFT again with the of the six trees used to create the synthetic data was custom gap costs. The specifics of MAFFT options are fed into DAWG. We used a GTR model with defaults detailed in Katoh et al. (2002). The individual data settings for the frequency and parameters settings.