Analysis of comparative sequence data 99

Сборник трудов Зоологического музея МГУ им. М.В. Ломоносова Archives of Zoological Museum of Lomonosov Moscow State University Том / Vol. 54 Cтр. / Pр. 99–115

COMPUTATIONAL ASPECTS OF THE PHYLOGENETIC ANALYSIS OF COMPARATIVE SEQUENCE DATA Ward C. Wheeler Division of Invertebrate Zoology, American Museum of Natural History; [email protected]

The construction of phylogenetic trees based on sequence data presents interesting computational challenges usually not encountered in the ana- lysis of anatomical or other qualitative data. In addition to the more fa- mi liar problems inherent in tree searching, comparative sequence data require the additional step of alignment. Each of these procedures is com- putationally “hard”. When these two operations are coupled, the joint ac tivity is referred to as the General Tree Alignment Problem (GTAP), perhaps the most appropriate form of analysis for comparative sequence data. Since this optimization is also NP-hard, heuristic techniques have been brought to bear to allow researchers to identify useful solutions to empirical problems. Various exact and heuristic approaches are discussed here with reference to their computational burden. Advances in sequencing technology are greatly increasing the quantity of sequence data requiring analysis. Increasing time complexity in this light of new data streams is discussed.

ВЫЧИСЛИТЕЛЬНЫЕ АСПЕКТЫ ФИЛОГЕНЕТИЧЕСКОГО АНАЛИЗА СРАВНИТЕЛЬНЫХ ДАННЫХ ПО СИКВЕНСАМ Уорд K. Уилер

Построения филогенетических деревьев, основанные на сиквен- сах, представляют интересные вычислительные проблемы, которые, как правило, не встречаются при анализе анатомических или других качественных данных. Сравнительные данные по сиквенсам, в до- полнение к стандартным проблемам поиска деревьев, требуют специ- фического этапа выравнивания. Каждая из этих процедур является вычислительно «трудной». Объединение этих двух операций извест- но как General Tree Alignment Problem (GTAP): возможно, это наибо- лее подходящая форма анализа сравнительных данных по сиквенсам. Так как эта оптимизация также является NP-трудной, в неё задейство- ваны эвристические методы, позволяющие выявлять практические

© W.C. Wheeler, 2016. 100 W.C. Wheeler

решения эмпирических проблем. В статье обсуждаются различные точные и эвристические подходы с указанием их вычислительных на- грузок. Достижения в области технологии секвенирования резко уве- личивают количество сиквенсов, включаемых в анализ. В этом кон- тексте обсуждается увеличение времени вычислений при включении новых потоков данных.

1. Introduction procedures as we encounter mounting avail- As evolutionary biologists, we seek to ac- ability of sequence data. commodate and explain the broadest pos- 2. Types of alignment sible sample of comparative information including both phenotypic and genomic The core operation of all multiple align- information. The construction of phyloge- ment procedures (MSA) is pairwise align- netic trees based on sequence data presents ment. Whether a pairwise alignment is to interesting computational challenges usually minimize dissimilarity, maximize similar- not encountered in the analysis of anatomi- ity, and whatever method used to score the cal or other qualitative data. Although both quality of the alignment (e.g. edit costs, types of information rely on tree-searching probabilistic models), the time-complexity procedures, sequence data do not present of the operation is proportional to the prod- themselves with pre-ordained correspond- uct of the lengths of the two input sequences ences. This adds to the process an additional (O(n1n2)) or more simply the square of the computational challenge usually referred to length of the longest sequence in an analy- as alignment. sis (O(n2)). This string-match procedure is Unfortunately, as is well-known, both of often referred to as the Needleman–Wusch these operations (tree-searching and align- (Needleman, Wunsch, 1970) algorithm and is ment) are NP-hard optimizations (Foulds, based on . The quad- Graham, 1982; Day, 1987; Wang, Jiang, ratic complexity means that if the sequences 1994; Roch, 2006), hence exact solutions for are doubled in length, the operation will take non-trivial data sets will not only be unfi nd- four times as long and so forth (reviewed in able (at least with guarantee), but potentially Wheeler, 2012). exponential in number. This creates chal- This level of time complexity is actually lenges in the defi nition of time-effi cient and more than just a worst-case scenario. Given effective heuristic procedures. An additional that biological sequences are highly related factor lies within the specifi cation of types and non-random (at least in the phylogenetic of analysis based on different types of align- case), the algorithm of Ukkonen (1985) (or ment, tree-searching and their interaction. variants included in implementations such Here, I will discuss alternate methods as Wheeler et al., 2013, 2015) can be used of multiple (MSA) and and time-complexity reduced to an average their interaction with tree searching proce- case of O(nlogn) (Fig. 1). Storage (memory dures. Several existing tools and oncoming requirements) can also follow the time com- empirical challenges will also be presented. plexity. The overall objective being to identify ef- The Needleman–Wunsch and Ukkonen fective (in terms of optimality score) and procedures (as originally published) only effi cient (in terms of time effort) analytical align pairs of sequences. For phylogeneti- Analysis of comparative sequence data 101

Fig. 1. The increase in execution time (by problem size) for several complexities for logarithmic O(logn), O(nlogn), linear O(n), quadratic O(n2), cubic O(n3), and exponential O(2n). cally meaningful data sets, alignments of ponential (in the number of sequences) to a large numbers of sequences are required. series of pairwise alignments performed in an Pairwise algorithms can be naively extend- order determined in a variety of ways, most ed beyond two, but rapidly become unman- usually by a “guide-tree” (Fig. 2). A guide ageable. For three sequences a cube with n3 tree is not a , but simply a elements would be required, and for each way to order pairwise alignments and their element, 7 evaluations would have to take amalgamation into a single MSA. Progres- place as opposed to the 3 for pairwise. In sive alignment reduce the exponential time the general case for m sequences of length complexity down to one linear in the number n, nm elements would need to be stored and of sequences and quadratic in their length, (2m – 1) operations required at each element, O((m – 1)n2). for a time complexity of O((2m – 1)nm). Even This reduction in complexity is welcome the smallest data sets would be beyond our and makes many large data sets tractable. It analytical capabilities. This issue has lead does not, however, solve the entire problem. to the development of lower time and space A key question remains of what determines complexity “progressive” alignment proce- a “good” alignment versus a “bad” one, or dures (Feng, Doolittle, 1987). Progressive more precisely, how do we attach an optimal- alignment, in essence, breaks down the ex- ity value (or score) to an MSA. 102 W.C. Wheeler

Fig. 2. Progressive alignment of Feng, Doolittle (1987). Pairwise alignments are performed in post-order tree traversal to create partial MSAs following once a gap, always a gap (left of vertices) or profi le sequences (right of vertices) of IUPAC symbols for each column with a bar if a gap is also present in that aligned column. Redrawn after Wheeler (2012).

For pairwise alignment, scoring is stra- edge costs) is minimized. Each of these meth- ight forward, the distance between to two ods can result in different MSAs even though sequences. We might do this in a variety basic cost parameters (such as substitution of ways (see optimality criteria below), but and indel costs) are the same. An important once we have a distance between base pairs, question is which method is most appropri- or amino acids, we can apply this without ate, or best in a phylogenetic context. fuss to a pair of sequences. However, once we move to three sequences, the process 3. Alignment in the context of phylogeny becomes more complicated. Two common Once we have dispensed with the his- approaches would be the sum of the (three) torical atrocity of “by-eye” alignment (e. g. pairwise distances among the three aligned Kjer, 2004), we are left with three common sequences. A second would be to sum the approaches to automated MSA generation distance between each aligned sequence when we analyze comparative sequence and a common or “centroidal” sequence that data — SP, Consensus, and Tree. There are represents the center, or average of the three. two general approaches to discussing this The fi rst approach is referred to as “Sum-of- choice — those based on fi rst principles of Pairs”, or SP alignment (Carrillo, Lipman, historical homology and those based on ef- 1988) and the second “Consensus” alignment fi cacy and time complexity, often of imple- (Gusfi eld, 1997). Furthermore, with four se- mentations. quence a third approach is possible, that of On the theoretical side, the argument has “Tree” alignment (Sankoff, 1975), where been made recently (Padial et al., 2014) and alignments are created such that an overall not so recently (e. g. Giribet et al., 2002; evolutionary tree cost (in terms of summed Faivovich et al., 2004; Prendini et al., 2005) Analysis of comparative sequence data 103

In addition to the theoretical and episte- mological rationales for favoring tree align- ment, heuristic effi cacy is also an impor- tant factor. Since the original description of Direct Optimization (DO) (Wheeler, 1996; Varón, Wheeler, 2012) as an approach to the analysis of sequence data, many empirical analyses have shown substantial improve- ment in tree optimality (discussed more be- low) over other MSA methods. (e. g. Whiting et al., 2006; Lindgren, Daly, 2007; Giribet, Edgecombe, 2013). Ford and Wheeler (2016) in analyzing data sets from 62–1766 rDNA sequences (Giribet, Wheeler, 1999, 2001; Fig. 3. “Tree” alignment showing median se- Wheeler, 2007; Benson et al., 2013) as well quences (at internal vertices) and MSAs as simulated data showed optimality im- (multiple) on left and right. Note that there provements of up to 50% over SP and Con- are two equally costly vertex assignments sensus alignment implementations (Katoh and derived alignments for this simple case. et al., 2002a; Edgar, 2004; Larkin et al., 2007; Redrawn after Wheeler (2012). Sievers et al., 2011). These analyses com- pared MSAs (implied alignments [Wheeler, that since it is phylogenetic trees we are in- 2003a] in the case of POY5) when subse- terested in, an alignment method based on quently analyzed by TNT (Goloboff et al., trees is the only appropriate approach. Pa- 2003) the preeminent parsimony tree search dial et al. (2014) in their forceful analysis program. These results confi rmed those of also adduce empirical data as to the effi cacy (Wheeler, Giribet, 2009) where in the case (in terms of tree optimality) of the tree align- of over 5000 simulated data sets of Ogden ment approach at least as embodied in the and Rosenberg (2007) (also reanalyzed by program POY (Wheeler et al., 2013, 2015). Lehtonen, 2008), similarity MSAs always They explore and criticize what they have underperformed with respect to DO-based named “similarity alignment” (i. e. SP) on analysis. the grounds that similarity as a means of re- To my knowledge, no analysis has ever constructing phylogeny (as with UPGMA) been published where tree-alignment (at least has been long rejected in favor of homol- as far as implemented by POY) has resulted ogy (and synapomophy) based schemes. in phylogenetic trees of inferior optimality The homology implications of a potentially value (i. e. parsimony score) to those based unique scheme for each possible phyloge- on any other MSA method. netic scenario, termed “dynamic” homology (Wheeler, 2001), as opposed to the universal 4. Complexity of alignment “static” homology statements produced by Commonly used MSA implementations SP and consensus similarity alignments are — e. g. CLUSTALW (Higgins and Sharp, argued to be vastly superior in explanatory 1988), MAFFT (Katoh et al., 2002b, MUS- power and more fi rmly rooted in the histori- CLE (Edgar, 2004) — are based around cal notions of derived homology (Fig. 3). progressive alignment after pairwise align- 104 W.C. Wheeler ments of input sequences, hence have a time searches i. e. m2 to m3 for overall complex- complexity that is quadratic in both sequence ity of O(m2–3n2), this still rather loose bound length (n) and number (m), hence O(m2n2). has been shown to be effective and more This may seem rather effi cient, but the gen- than competitive with other heuristics (Ford, eral MSA optimization problem itself, no Wheeler, 2016). matter whether SP, Consensus, or Tree is NP-hard (Wang, Jiang, 1994). As mentioned 5. Complexity of tree searching above, this implies both that optimal solu- For most empirical systematists, the tions (nor non-trivial data sets) will not be search for optimal (or at least optimal found, and that there are a potentially expo- enough) trees occupies the greatest portion nential numbers of such solutions. Basically, of their computational effort. For this rea- there is likely no identifi able, single “opti- son, algorithmic effi ciency and quality of mal” MSA (Wheeler, 1994). implementation have been extremely im- Polynomial time approximations schemes portant to practicing systematists since the (PTAS) have been developed for SP align- fi rst phylogenetic (i. e. non cluster-based) ment with guaranteed bounds (Wang et al., tree reconstruction software was produced in 2000) employing a mix of exhaustive and the 1980’s (Farris, 1978, 1988; Felsenstein, lifted alignment (Wang, Gusfield, 1997; 1980; Mickevich, Farris, 1980; Swofford, Wheeler, 1999). Mainly of analytical inter- 1990). Tree searching, as an operation, is not est, a 1.47 bound can be ensured with a time unique to sequence data, but its interaction complexity of O(m2n9). This is clearly well with the dynamic homology concept yields beyond empirical utility (and for not such a a tight connection between the search for op- great bound), but underscores the diffi culty timal homology schemes and optimal trees in determining high quality solutions. (General Tree Alignment Problem, GTAP, The conditions of guarantee are rather below). loose, including all possible scenarios, and The search for the optimal or “best” phy- the reality of historical biological sequences logenetic tree is well-known to be NP-hard. is that they are much more “well-behaved” As for MSA, this implies that exact solu- than the most general case would allow. tions for most datasets will be unavailable There are no known guaranteed bounds for (time complexity O(2n)) and potentially ex- the DO algorithm (Wheeler, 1996), but per- ponential in number. This is the case for all formance in comparison with exact solutions optimality criteria that have been examined in the three-sequence case has been exam- including distances (Day, 1987), parsimony ined (Varón, Wheeler, 2012). Depending on (Foulds, Graham, 1982), and likelihood the rates of sequence (evolutionary) change, (Roch, 2006). For smaller data sets (< 25 or the DO algorithm has been shown to yield so taxa), exact solutions can be found via solutions within 3% to 10% of the optimal Branch-and-Bound approaches (Land, Doig, solution in biological realistic conditions. 1960; Hendy, Penny, 1982). These methods An O(n3) version of DO (Wheeler, 2003b) rely on the examination of intermediate (par- can perform better (exactly, unsurprisingly, tial) solutions and pruning large segments of in the three sequence case), but with an ad- the solution space. For this to perform well, ditional cost factor of the length of the se- the data need to be relatively clean and still quences, n. When employed in empirical may not yield much of an improved ex- cases (with comparable time complexity tree ecution time — and may even be slower in Analysis of comparative sequence data 105

Fig. 4. Initial tree construction trajectory via the Wagner algorithm (Farris, 1970). pathological cases. Modern empirical data addition of taxa to a growing tree. Initially, sets are almost exclusively of the size that three taxa are chosen (either randomly or by such an approach is impractical. distance), and each taxon added to the tree in Again as with MSA, heuristic approaches each possible place. The optimality value is are the main tool used to identify optimal calculated for each of these candidate solu- trees. These can be organized into two sorts tions and the tree with the best value chosen. of approaches that are most frequently used This is continued until all taxa have been in combination: trajectory and perturba- added to the tree (Fig. 4). tion. Trajectory-based searches identify lo- The time complexity of this operation is cal neighborhoods of solutions, choose the quadratic (O(m2)), however, common prac- best of them and advance to another neigh- tice is to perform a number of these, rand- borhood until no better solutions are found. omizing the addition order (this number can Perturbation techniques, on the other hand, be thought to grow with the problem size take a given solution and attempt to improve adding an additional factor of m, see below). it by modifying the tree in ways which may These randomized Wagner build solu- not be immediately better, but may result in tions are not, on their own, usually felt to be superior solutions at a later stage. satisfactory and much improvement can be The nearly universally employed initial garnered by what is commonly referred to trajectory algorithm is the “Wagner” algo- as branch-swapping, a form of tree refi ne- rithm, named after a botanist (Wagner, 1961) ment. There are two fundamental opera- but conceived by a computational systematist tions in branch-swapping: tree division and (Farris, 1970) constructs trees via sequential reattachment. In the fi rst, the tree is divided 106 W.C. Wheeler

Fig. 5. Tree-Bisection and Regrafting (TBR) rearrangement neighborhood. by removing an edge and in the second, re- linear time complexity (assuming tree opti- attached in a new location via the creation mality can be determined in constant time — of a new edge. The named forms of swap- a fallacy we will hold to for now). ping differ in the set of new edges that can Larger neighborhoods can be generated be created. In each case, a neighborhood of through the commonly referred to Subtree- new tree solutions is identifi ed and evaluated. Pruning-and-Regrafting (SPR) and Tree- The size of this neighborhood drives the time Bisection-and-Regrafting (TBR) algorithms complexity of the operation. (initially undocumented and known under Nearest-Neighbor-Interchange (NNI; Ca- various names; Mickevich, Farris, 1980; min, Sokal, 1965; Robinson, 1971) generates Swofford, 1990; see Wheeler, 2012). SPR the smallest neighborhood with the lowest refi nement involves breaking an edge and time complexity (O(m)). In NNI, as with all reattaching the unconnected vertex via new swapping procedures, each internal edge is edges to each remaining edge (as opposed examined in turn. All edges are defi ned by to only the proximate two of NNI). There two vertices, each of which is connected to are then (2m − 7) reattachments for a total two further edges. NNI deletes one of these of 2(m − 3)(2m − 7) (Allen, Steel, 2001). connecting edges and creates two new trees Hence, SPR has quadratic time complexity by creating edges between the now uncon- in the number of taxa. TBR yields an addi- nected vertex and the two remaining edges. tional factor of m by allowing edge connec- Given that there are (m − 3) edges in a tree tion not only to the root vertex of the dis- with m taxa, the total number of tree rear- connected tree, but to its internal edges as rangements examined is 2(m − 3), hence the well. The exact neighborhood size depends Analysis of comparative sequence data 107 on the tree shape, but is cubic in the number result generally inferior performance (Golo- of taxa (Fig. 5). boff, Pol, 2007). A typical trajectory search would in- In apposition to trajectory heuristics, volve a series of randomized Wagner builds perturbation approaches accept immediately followed by TBR branch-swapping refi ne- poorer solutions in hope of fi nding better ment. This is often referred to as RAS+TBR ones further on down the search path. The (Goloboff, 1999) and given that the number motivation behind accepting sub-optimal of random additions usually scales with the solutions is that the optimality landscape has number of terminals the overall strategy can multiple “peaks” and that solutions between be thought of as quartic, O(m4). them are suboptimal. Perturbation methods This rather daunting polynomial factor are specifi cally designed to go from local led Goloboff (1999) to search for meth- to hopefully global solutions by traversing ods with reduced time complexity based solution areas with reduced quality. The on the notion of “composite optima”. The fi rst use of this idea as an improvement on RAS+TBR strategy just scales too poorly trajectory search came from Nixon (1999) for large data sets over the entire tree. Yet, in the form of the “parsimony ratchet.” The it may well arrange smaller components of ratchet is a technique that begins with a local thee tree properly, at least for some subset (i. e. trajectory) solution and strives to break of RAS+TBR runs. Goloboff reasoned that through intermediate optimality barriers by segments of the data set (perhaps 50 taxa or reweighting subsets of characters and search- so; later named “sectors”) might well be in ing (via TBR) on the new (but related) data optimal or near optimal confi guration, but set. The reweighting scheme shifts the opti- the odds of getting a large number of sec- mality landscape such that new optima may tors simultaneously well confi gured would be reached via standard swapping, A second be very small, hence requiring prohibitive search, again typically via TBR, is then per- RAS+TBR iterations. By combining this formed beginning with the reweighted result, notion of sectors with breadth-fi rst search- but with weights returned to their original ing (Cormen et al., 2001), Goloboff (1999) values. This is performed multiple times, proposed Sectorial-Searching (SS). In this each instance offering an opportunity to fi nd branch-swapping refi nement operation, the a path from a local to a (more) global solu- edge set available for initial deletion and re- tion. The ratchet had an immediately salutary attachment is limited to those not in sectors affect on analyses, fi rst on the “Zilla” RBCL (commonly, but not invariantly subtrees). dataset (Chase et al., 1993; Nixon, 1999) The time complexity of this operation is and later on others (Giribet, Wheeler, 1999). still cubic, but with a reduction in constant The ratchet approach has had an enormous factor of the cube of the number of sectors, effect on tree searching not only by directly k. If these sectors are roughly equal in size, improving search results, but also by open- time complexity can be reduced dramati- ing up thinking that lead to a number of other cally to O((n/k)3). Disc-Covering methods innovative approaches. (Huson et al., 1999; Roshan et al., 2004) Following on the heals of Nixon’s ratchet, have many similarities with sectorial meth- other perturbation techniques were devel- ods, but have several important differences oped, more directly adapted from simulated in the way they defi ne and resolve interme- annealing. First developed for atomic bomb diate subproblems. This has been shown to calculation in the 1950’s (Metropolis et al., 108 W.C. Wheeler

1953; Kirkpatrick et al., 1983), simulated population. Those that survive selection then annealing mimics the process of the anneal- undergo recombination-corresponding com- ing of metals via stepped reductions in tem- ponents of trees (subtrees with the same leaf perature. The key concepts are the analogues set) are exchanged. The order of these op- to temperature and energy state. When tem- erations may vary, but these core operations peratures are relatively high, the probability are always components of GA optimization. of accepting a lower quality solution (higher Goloboff (1999) emphasized the recombina- energy, in this case inferior optimality score) tion step in his “Tree-Fusing” method, which increases, allowing the search to fi nd global also adds in trajectory searches to improve solutions through optimality valleys. Better solutions. At least in phylogenetic applica- (lower energy, superior optimality) solutions tions (including MSA; Notredame, Higgins, are always accepted. If the temperature is 1996), GA has not been shown to be very high enough, the search becomes a random effective in generating solutions de novo, walk with no infl uence of optimality score but has been extremely useful in improving at all. As the temperature is gradually re- existing solutions. duced, the probability of transitions to lower Each of the methods discussed above optimality is increasingly diminished, until has found utility in empirical tree searching, they are forbidden entirely and the search and are nearly always used in combination. becomes a standard trajectory. The trick to Implementations such as TNT (Goloboff using the technique effectively is identifying et al., 2003; Goloboff, Pol, 2007) and POY the proper connections between the physical (Wheeler et al., 2013, 2015) are explicit in model of annealing and the optimization at their efforts to make these techniques avail- hand, and the appropriate heating schedule. able. Large analyses of up to thousands of Goloboff’s “Tree-Drifting” (Goloboff, 1999) unaligned (Ford, Wheeler, 2016) and tens of uses elements of such a simulated anneal- thousands of aligned (Goloboff et al., 2009) ing approach. In his method, the difference sequences demonstrate their combined ef- in tree scores between candidate trees is the fectiveness. analogue of energy difference with random factor moderating acceptance of sub-optimal 6. The GTAP and heuristic effi ciency tree solutions. The previous sections have described A method employing elements of both alignment and tree search as two separate trajectory and perturbation actions, and em- operations. That is, in fact, how most phy- ploying populations of trees is referred to as logenetic analyses are performed, and how , or GA. The idea behind most systematists understand the problem of this optimization technique is to simulate the deriving phylogenetic trees from sequence evolutionary process with mutation, recom- data. This is, however, a limited and largely bination, and selection (Holland, 1975). As incorrect notion of the basic challenge. This applied to tree searching (Moilanen, 1999, does not mean that separate (i. e. 2-step) 2001), GA starts with a set of initial solu- analyses do not have heuristic utility, merely tions (such as those form RAS+TBR) and that as a conceptualization and problem defi - mutates these via some type of perturbation. nition (with concomitant solution space and The pool of trees then undergoes a selection complexity) it is an erroneous path. step, where the optimality value for the trees As discussed above, tree-alignment is determines their continued presence in the the proper form of MSA in a phylogenetic Analysis of comparative sequence data 109 context, hence, the alignment be constructed in successive versions of POY (Gladstein, to minimize the optimality score (originally Wheeler, 1997; Wheeler et al., 2005; Varón parsimony) for a given tree. Known as the et al., 2008; Wheeler et al., 2013). DO cre- TAP (for Tree Alignment Problem), it is the ates a series of vertex sequences via tree “small” parsimony problem for a tree of traversal in acceptable time with worst case unaligned sequence data and is NP-hard. If complexity O(mn2), but average case com- this tree is unknown, as is usually the case, plexity on the order of O(mnlogn). This has then a tree search must also be performed allowed analyses of sequences of length 2000 with each tree evaluated in turn based on its or more for more than 1000 sequences with tree alignment cost. This is the “large” parsi- unmatched optimality scores (Varón, Wheel- mony problem for unaligned sequences and er, 2013; Ford, Wheeler, 2016). Higher time is referred to as the General Tree Alignment complexity fl avors of DO have been defi ned: Problem or GTAP. Wrapping one NP-hard e. g. “Iterative-Pass” optimization (Wheeler, optimization inside of another makes this 2003b) with the much greater time complex- an exceptionally diffi cult, but empirically ity of O(n3). important, challenge. While DO constructs internal vertex se- Distinct alignment and tree searching quences, other GTAP approaches have been operations can be thought of as one sort of based on using observed sequences as can- GTAP heuristic. From the alignment perspec- didate vertex labels. These include “lifted” tive alone, there were only two efforts (of alignments (Wang et al., 1996; Wang, Gus- which I am aware) to construct alignments fi eld, 1997), and the stronger “fi xed-states” specifi cally with the objective of yielding (Wheeler, 1999) and “Search-Based” optimi- optimal trees MALIGN (Wheeler, Gladstein, zation (Wheeler, 2003c). Each of these has 1994, 1998) and TreeAlign (Hein, 1989a,b, complexities that are cubic in the number 1990). These implementations would pro- of taxa O(m3) after a quadratic setup phase duce MSA results (one or more MSAs) that O(m2n2). In general, these methods do not would then be fed into tree search programs yield results competitive with DO, but do to complete phylogenetic analysis. Although have guaranteed bounds, hence are extremely working towards tree-alignment, as with oth- useful for analytical purposes. er MSA efforts the production of a single (or Direct GTAP solution attempts will likely small number) of alignments used as a basis always have a higher time complexity than for the evaluations of a large number of trees tree searching on pre-aligned sequences, cur- (easily > 109) allows for a certain effi ciency, rently by a factor of n (all over factors being but at a signifi cant penalty in result quality. equal, which they are not). But two factors The fundamental idea of the GTAP is that have to be noted on this front, fi rst is the time each tree, in essence, needs to be evaluated expended in alignment part of the two-step on the basis of its own alignment. That is, the approach, and second the quality of the re- alignment optimal (at least heuristically) for sultant solution. that tree. As tree space is searched, a poten- tially unique MSA would need to be gener- 7. Optimality criteria and complexity ated for each candidate tree. Currently the This discussion has so far been rooted most effi cient heuristic procedure for this op- in minimizing overall evolutionary score — eration is Direct Optimization (DO; Wheeler, parsimony. Each of the techniques discussed 1996; Varón, Wheeler, 2012) implemented above (alignment, tree search, direct GTAP 110 W.C. Wheeler optimization) can and often is accomplished The same factors, yet increased again, af- within the framework of other optimality fect Bayesian GTAP methods as well. In ad- criteria, namely maximum likelihood and dition to even more enhanced parameters and Bayesian posterior probability. prior distributions involved in Bayesian cal- Sequence alignment in a maximum like- culations, the reliance on MC3 methods adds lihood context was fi rst proposed by Bishop enormous and problematic (in terms of sta- and Thompson (1986) and later in successive tionarity requirements and their recognition) efforts by Thorne and coworkers (Thorne optimization factors (Mossel, Vigoda, 2005, et al., 1991, 1992; Thorne, Kishino, 1992). 2006). In short, MSA, tree search, and GTAP A tree alignment approach was proposed can be approached in a variety of ways, and by Wheeler (2006) (reviewed in: Denton, with a variety of tools, but all are burdened Wheeler, 2012) implemented in POY3 and with the weight of NP-hard optimizations. POY5. Tree alignment in the general context of 8. The coming deluge Bayesian analysis has produced a number In the past 10 years sequencing technol- of implementations including Profi le-HMM ogy has rapidly advanced. This is largely (Krogh et al., 1994), HANDEL (Holmes, due to the Human Genome Project and these Bruno, 2001), SATCHMO (Edgar, Sjölander, technological fruits are now being reaped by 2003), ProAlign (Löytynoja, Milinkovitch, systematic biology. Two important sources 2003), ALIFRITZ (Fleissner et al., 2005), of comparative data are transcriptomic and BEAST (Lunter et al., 2005), and BaliPhy whole genome sequencing. The fi rst meta- (Redelings, Suchard, 2005; Suchard, Re- zoan whole-genome phylogeny was only delings, 2006). Wheeler (2014) proposed a published in 2007 and then only for twelve version of MAP based (as opposed to MC3) species of very closely related Drosophila Bayesian tree alignment implemented in species (Clark et al., 2007). Even though POY5. whole genome costs are rapidly decreas- The basic time complexity factors of sta- ing, faster even than Moore’s Law, the large tistical alignment, tree search and GTAP are data sets produced now are based on the still present as they are for parsimony, but expressed portion of the genome, the tran- usually have large additional constant factors scriptome. Although the transcriptome rep- resulting in vastly increased execution time. resents less than 1% of the genome of most This is due to their need to evaluate statistical organisms, analyses based on these data have models of sequence evolution including a va- increase our available genomic information riety of parameters, most importantly branch/ by three orders of magnitude in a short time. edge lengths. The most commonly used As an example, the metazoan analysis of ML tree search tool in use today, RAxML Dunn et al. (2008) relied on 150 genes, fol- (Stamatakis et al., 2005) as well as that for lowed the next year by Hejnöl et al. (2009) Bayesian, MrBayes (Huelsenbeck, Ronquist, with nearly 1500. Riesgo et al. (2012) and 2003), consume many thousands of times Sharma et al. (2014) have produced data sets more CPU cycles than that for parsimony, with greater than 20,000 genes. Concomitant at least in the guise of TNT (Goloboff et al., with this, is an increase in the number of 2003). Again, this is not sue to any short- taxa from which these samples are derived. coming in implementation, but the reality of At present, transcriptomic analysis requires parameter-rich model-based tree searching. relatively freshly obtained specimens, and Analysis of comparative sequence data 111 this has limited the taxonomic expansion in Bishop M.J., Thompson E. A. 1986. Maximum some cases. likelihood alignment of DNA sequences. — DNA-based techniques are now in de- Journal of Molecular Bioliology, 190 (2): 159–165. velopment that will likely allow the deter- Camin J.H., Sokal, R.R. 1965. A method for de- mination of whole genomes from preserved ducing branching sequences in phylogeny. museum specimens. This development will — Evolution, 19 (3): 311–326. no doubt result in an explosion of taxonom- Carrillo H., Lipman D. 1988. The multiple se- ic data that will threaten to overwhelm our quence alignment problem in biology. — current computational abilities. As we have SIAM Journal on Applied Mathematics, 48 (5): 1073–1082. seen, the general complexity of analysis of Chase M.W., Soltis D.E., Olmstead R.G., Mor- a genetic sequence is at worse quadratic in gan et al. 1993. of seed plants: length, and linear in number of genes. This is An analysis of nucleotide sequences from a complexity we can handle at present. The the plastid gene RBCL. — Annals of the Mis- taxon number factor is more daunting with souri Botanical Garden, 80 (3): 528–580. a potentially quartic time complexity. Whole Clark A.G., Eisen M.B., Smith D.R. et al. 2007. Evolution of genes and genomes on genomes will also provide a great deal more the Drosophila phylogeny. — Nature, 450: information than transcriptomes do presently 203–218. and this will take effort to extract. Cormen T.H., Leiserson C.E., Rivest R.L., The challenges coming our way in the Stein C. 2001. Introduction to algorithms. next few years are clear. We will have to Cambridge (MA): The MIT Press, 2nd edi- identify computational methods that will tion. 1180 p. produce, at least heuristically, optimal re- Day W.H.E. 1987. Computational complexity of inferring phylogenies from dissimilarity sults. Furthermore, these methods will have matrices. — Bulletin of Mathematical Biol- to scale into the hundreds of thousands of ter- ogy, 49 (4): 461–467. minals and potentially billions of base pairs. Denton J., Wheeler W.C. 2012. Trivial alignm- This will no doubt be assisted by the com- ents in maximum likelihood analysis of nu- modity parallelism now broadly available. cleotide data. — Cladistics, 28 (5): 514–528. Yet that will only generate a linear factor of Dunn C.W., Hejnöl A., Matus D.Q. et al. 2008. Broad phylogenomic sampling improves reso- improvement. To get to where we will need lution of the animal tree of life. — Nature, to be, only improved algorithms and poten- 452: 745–749. tially novel analytical approches offer a path. Edgar R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high thro- Acknowledgements ughput. — Nucleic Acids Research, 32 (5): Thanks to Igor Pavlinov for the opportu- 1792–1797. Edgar R.C., Sjölander K. 2003. SATCHMO: seq- nity to review this topic. uence alignment and tree construction us- References ing hidden Markov models. — Bioinforma- tics, 19 (11): 1404–1411. Allen B., Steel M. 2001. Subtree transfer oper- Faivovich J., Garca P.C., Ananias F. et al. 2004. ations and their induced metrics on evolu- A molecular perspective on the phylogeny tionary trees. — Annals of Combinatorics, of the Hyla pulchella species group (Anu- 5 (1): 1–13. ra, Hylidae). — Molecular phylogenetics and Benson D.A., Cavanaugh M., Clark K. et al. evo lution, 32 (3): 938–950. 2013. Genbank. — Nucleic Acids Research, Farris J.S. 1970. A method for computing Wag- 41 (D1): D36–D42. ner trees. — Systematic Zoology, 19:83–92. 112 W.C. Wheeler

Farris J.S. 1978. Wag78. Software and docume- ry. http://research.amnh.org/scicomp/projects/ ntation. Publ. by author. poy.php. Farris J.S. 1988. Hennig86. Version 1.5. Publ. Goloboff P. 1999. Analyzing large data sets in by author. reasonable times: solutions for composite op- Felsenstein J. 1980. PHYLIP Version 1.0: Phy- tima. — Cladistics, 15 (4): 415–428. logeny Inference Package. http://www0.nih. Goloboff P., Farris J.S., Nixon K. 2003. TNT go.jp/~jun/research/phylip/main.html?ref= (Tree analysis using New Technology) ver- herseybedava.info sion 1.0. Program and documentation. Tucu- Feng D.-F., Doolittle R.F. 1987. Progressive se- mán (Argentina). Published by the authors. quence alignment as a prerequisite to cor- http://www.lillo.org.arg/phylogeny/tnt. rect phylogenetic trees. — Journal of Mo- Goloboff P.A., Catalano S.A., Mirande J.M. et lecular Evolution, 25 (4): 351–360. al. 2009. Phylogenetic analysis of 73 060 Fleissner R., Metzler D., von Haeseler R. 2005. taxa corroborates major eukaryotic groups. Simultaneous statistical multiple alignment — Cladisitcs, 25 (3): 211–230. and phylogeny reconstruction. — Systemat- Goloboff P.A., Pol D. 2007. On divide-and-con- ic Biology, 54 (4): 548–561. quer strategies for parsimony analysis of Ford E., Wheeler W. 2016. Comparison of heu- large data sets: Rec-I-DCM3 versus TNT. ristic approaches to the general-tree-align- — Systematic Biology, 56 (3): 485–495. ment problem. — Cladistics, 32. in press. Gusfield D. 1997. Algorithms on strings, trees, http://onlinelibrary.wiley.com/doi/10.1111/ and sequences: Computer science and com- cla.12142/abstract?userIsAuthenticated=false putational biology. Cambridge: Cambridge &denied. Univ. Press. 534 p. Foulds L.R., Graham R.L. 1982. The Steiner Hein J. 1989a. A new method that simultane- problem in phylogeny is NP-complete. — ously aligns and reconstruct ancestral se- Advances in Applied Mathematics, 3 (1): quences for any number of homologous 43–49. sequences, when the phylogeny is given. Giribet G., Edgecombe G.D. 2013. Stable phy- — Molecular Biology and Evolution, 6 (6): logenetic patterns in scutigeromorph centi- 649–668. pedes (Myriapoda : Chilopoda : Scutigero- Hein J. 1989b. A tree reconstruction method morpha): dating the diversification of an anci ent lineage of terrestrial arthropods. — In- that is economical in the number of pairwise vertebrate Systematics, 27 (5): 485–501. comparisons used. — Molecular Biology Giribet G., Wheeler W.C. 1999. The position of and Evolution, 6 (6): 669–684. arthropods in the animal kingdom: Ecdyso- Hein J. 1990. Unified approach to alignment and zoa, islands, trees and the “parsimony rat- phylogenies. — Methods in Enzymology, chet”. — Molecular Phylogenetics and Evo- 183: 626–645. lution, 13 (3): 619–623. Hejnöl A., Obst M., Stamatakis A. et al. 2009. Giribet G., Wheeler W.C. 2001. Some unusual Assessing the root of bilaterian animals with small-subunit ribosomal DNA sequences of scalable phylogenomic methods. — Proce- metazoans. — American Museum Novitates, edings of the Royal Society, ser. B, Biologi- 3337: 1–14. cal Sci., 276: 4261–4270. Giribet G., Wheeler W.C., Muona J. 2002. DNA Hendy M.D., Penny D. 1982. Branch and bound multiple sequence alignments. — DeSalle R., algorithms to determine minimal evolution- Giribet G., Wheeler W.C. (eds). Molecular sys- ary trees. — Mathematical Biosciences, 60: tematics and evolution: Theory and practice. 133–142. Basel: Birkhauser Verlag. P. 107–114. Higgins D.G., Sharp P.M. 1988. Clustal: A pac- Gladstein D.S., Wheeler W.C. 1997. POY ver- kage for performing multiple sequence alig- sion 2.0. Program and documentation. New nment on a microcomputer. — Gene, 73 (1): York: American Museum of Natural Histo- 237–244. Analysis of comparative sequence data 113

Holland J.H. (ed.). 1975. Adaptation in natu- Löytynoja A., Milinkovitch M. C. 2003. ProA- ral and artificial systems. Ann Arbor (MI): lign, a probabilistic multiple alignment pro- University of Michigan Press. 211 p. gram. — , 19 (11): 1505– Holmes I., Bruno W.J. 2001. Evolutionary 1513. HMMs: a Bayesian approach to multiple Lunter G., Drummond A.J., Miklós I., Hein J. alignment. — Bioinfomratics, 17 (9): 803– 2005. Statistical alignment: Recent prog- 820. ress, new applications, and challenges.— Nie- Huelsenbeck J.P., Ronquist F. 2003. MrBayes: lsen R. (ed.). Statistical methods in molecu- Bayesian inference of phylogeny, 3.1.2 edi- lar evolution. Springer. P. 375–406. tion. Program and documentation. at http:// Metropolis N.A., Rosenbluth A., Rosenbluth M. morphbank.uuse/mrbayes/. et al. 1953. Equation of state calculations by Huson D., Nettles S., Warnow T. 1999. Disk- fast computing machine. — J. Chem. Phys, covering, a fast converging method for phy- 21 (6): 1087–1092. logenetic tree reconstruction. — Journal of Mickevich M.F., Farris J.S. 1980. PHYSYS: , 6 (3): 368–386. Phylogenetic analysis system. Publ. by au- Katoh K., Misawa K., Kuma K., Miyata T. thors. 2002a. MAFFT: a novel method for rapid Moilanen A. 1999. Searching for most parsimo- mutliple sequence alignment based on fast nious trees with simulated evolutionary op- fourier transform. — Nucleic Acids Rese- timization. — Cladistics, 15 (1): 39–50. arch, 30 (14): 3059–3066. Moilanen A. 2001. Simulated evolutionary op- Katoh K., Misawa K., Kuma K., Miyata T. timization and local search: Introduction and 2002b. MAFFT version 5.25: multiple se- application to tree search. — Cladistics, 17 quence alignment program. — Nucleic Ac- (1): S12–S25. ids Research, 30 (14): 3059–3066. Mossel E., Vigoda E. 2005. Phylogenetic mcmc Kirkpatrick S., Gelatt C.D., Vecchi M.P. 1983. algorithms are misleading on mixtures of Optimization by simulated annealing. — trees. — Science, 309: 2207–2209. Science, 220: 671–680. Mossel E., Vigoda E. 2006. Limitations of Kjer K. M. 2004. Aligned 18S and insect phy- markov chain monte carlo algorithms for logeny. — Systematic Biology, 53 (3): 506– bayesian inference of phylogeny. — The 514. Annals of Applied Probability, 16 (4): 2215– Krogh A., Brown M., Mian I.S. et al. 1994. Hid- 2234. den Markov models in computational biol- Needleman S.B., Wunsch C.D. 1970. A general ogy: applications to modeling. — method applicable to the search for similari- Journal of Molecular Bioliology, 235 (5): ties in the amino acid sequences of two pro- 1501–1531. teins. — Journal of Molecular Bioliology, Land A.H. and Doig, A.G. 1960. An automat- 48 (3): 443–453. ic method of solving discrete programming Nixon K.C. 1999. The parsimony ratchet, a new problems. Econometrica, 28 (3): 497–520. method for rapid parsimony analysis. — Larkin M.A., Blackshields G. et al. 2007. Clustal Cladistics, 15 (4): 407–414. W and Clustal X version 2.0. — Bioinfor- Notredame C., Higgins D.G. 1996. SAGA: se- matics, 23 (21): 2947–2948. quence alignment by genetic algorithm. — Lehtonen S. 2008. Phylogeny estimation and Nucleic Acids Research, 24 (8): 1515–1524. alignment via POY versus Clustal–PAUP: Ogden T.H., Rosenberg M.S. 2007. Alignment A response to Ogden and Rosenberg (2007). and topological accuracy of the direct opti- — Systematic Biology, 57 (4): 653–657. mization approach via POY and traditional Lindgren A.R., Daly M. 2007. The impact of phylogenetics via ClustalW + PAUP*. — length-variable data and alignment criterion Sys. Biol., 56:182–193. on the phylogeny of Decapodiformes (Mol- Padial J., Grant T., Frost D.R. 2014. Molecular lusca: Cephalopoda). — Cladistics, 23 (4): systematics of terraranas (Anura: Brachyce- 464–476. phaloidea) with an assessment of the effects 114 W.C. Wheeler

of alignment and optimality criteria. — Suchard M.A., Redelings B.D. 2006. Bali-Phy: Zootaxa, 3825: 1–132. Simultaneous Bayesian inference of align- Prendini L., Weygoldt P., Wheeler W.C. 2005. ment and phylogeny. — Bioinformatics, 22 Systematics of the Damon variegatus group (16): 2047–2048. of African whip spiders (Chelicerata: Am- Swofford D.L. 1990. PAUP: Phylogenetic Anal- blypygi): Evidence from behaviour, mor- ysis Using Parsimony. Version 2.4. Distrib- phology and DNA. — Organisms Diversity uted by the Illinois Natural History Survey: and Evolution, 5 (3): 203–236. Champaign, Illinois. Redelings B.D., Suchard M.A. 2005. Joint Bay- Thorne J.L., Kishino H. 1992. Divergence time esian estimation of alignment and phylo geny. and evolutionary rate estimation with multi- — Systematic Biology, 54 (3): 401–418. locus datafreeing phylogenies from the ar- Riesgo A., Andrade S.C., Sharma P.P. et al. 2012. tifacts of alignment. — Molecular Biology Comparative description of ten transcripto- and Evolution, 9 (6): 1148–1162. mes of newly sequenced invertebrates and Thorne J.L., Kishino H., Felsenstein J. 1991. An efficiency estimation of genomic sampling evolutionary model for maximum likelihood in non-model taxa. — Frontiers in Zoology, alignment of DNA sequences. — Journal of 9 (1): 1–24. Molecular Evolution, 33 (2): 114–124. Robinson D.F. 1971. Comparison of labelled tre- Thorne J.L., Kishino H., Felsenstein J. 1992. es with valency three. — Journal of Combi- Inching toward reality: an improved likeli- natorial Theory, 11 (2): 105–119. hood model of sequence evolution. — Jour- Roch S. 2006. A short proof that phylogenet- nal of Molecular Evolution, 34 (1): 3–16. ic tree reconstruction by maximum likeli- Ukkonen E. 1985. Finding approximate pat- hood is hard. — IEEE/ACM Transactions on terns in strings. — Journal of Algorithms, 6 Computational Biology and Bioinformatics, (1): 132–137. 3 (1): 92–94. Varón A., Vinh, L.S., Bomash, I., and Wheel- Roshan U., Moret B., Williams T., Warnow T. er, W.C. 2008. Poy 4.0. American Museum 2004. Rec-I-DCM3: A fast algorithmic tech- of Natural History. http://research.amnh.org/ nique for reconstructing large phylogenetic scicomp/projects/poy.php. tree. — Proc. IEEE Computer Society Bio- Varón A., Wheeler W. C. 2012. The tree-align- informatics Conference CSB 2004, Stan- ment problem. — BMC Bioinformatics, 13: ford (CA). P. 98–109. 293. Sankoff D.M. 1975. Minimal mutation trees of Varón A., Wheeler W.C. 2013. Local search for sequences. — SIAM Journal on Applied Ma- the generalized tree alignment problem. — thematics, 28 (1): 35–42. BMC Bioinformatics, 14: 66. Sharma P.P., Kaluziak S.T., Pérez-Porro A.R. Wagner W.H. 1961. Problems in the classifica- et al. 2014. Phylogenomic interrogation of tion of ferns. — Recent Advances in Bota- arachnida reveals systemic conflicts in phy- ny. Toronto: University of Toronto Press. P. logenetic signal. — Molecular Biology and 841–844. Evolution, 31 (11): 2963–2984. Wang L., Gusfield D. 1997. Improved approxi- Sievers F., Dineen D., Gibson T.J. et al. 2011. mation algorithms for tree alignment. — Fast, scalable generation of high-quality pro- Journal of Algorithms, 25 (2): 255–273. tein multiple sequence alignments using Clu- Wang L., Jiang T. 1994. On the complexity of stal Omega. — Molecular Systems Biology, multiple sequence alignment. — Journal of 7 (1): 539. Computational Biology, 1 (4): 337–348. Stamatakis A., Ludwig T., Meier H. 2005. Wang L., Jiang T., Gusfield D. 2000. A more ef- Raxml-iii: A fast program for maximum ficient approximation scheme for tree align- likelihood-based inference of large phy- ment. — SIAM J. Comput., 30 (1): 283–299. logenetic trees. — Bioinformatics, 21 (4): Wang L., Jiang T., Lawler E.L. 1996. Approxi- 456–463. mation algorithms for tree alignment with Analysis of comparative sequence data 115

a given phylogeny. — Algorithmica, 16 (3): weighting and dynamic programming. — 302–315. Cladistics, 30 (3): 282–290. Wheeler W.C. 1994. Sources of ambiguity in Wheeler W.C., Giribet G. 2009. Phylogenetic nucleic acid sequence alignment. — Schier- hypotheses and the utility of multiple seque- water B., Streit B., DeSalle R. (eds). Mo- nce alignment. — Rosenberg M.S. (ed.). lecular ecology and evolution: Approaches Perspectives on biological sequence align- and applications. Basel: Birkhäuser Verlag. ment. Berkeley (CA): University of Califor- P. 323 –352. nia Press. P. 95–104. Wheeler W.C. 1996. Optimization alignment: Wheeler W.C., Gladstein D.S. 1991–1998. MA- The end of multiple sequence alignment in LIGN. Program and documentation. Docu- phylogenetics? — Cladistics, 12 (1): 1–9. mentation by Janies D., Wheeler W.C. New Wheeler W.C. 1999. Fixed character states and York. http://research.amnh.org/scicomp/pro- the optimization of molecular sequence data. jects /malign.php. — Cladistics, 15 (4): 379–385. Wheeler W.C., Gladstein D.S. 1994. MALIGN: Wheeler W.C. 2001. Homology and the optimi- A multiple sequence alignment program. — zation of DNA sequence data. — Cladistics, Journal of Heredity, 85 (5): 417–418. 17 (1): S3–S11. Wheeler W.C., Gladstein D.S., De Laet J. 1996– 2005. POY version 3.0. Program and doc- Wheeler W.C. 2003a. Implied alignment. — umentation. Documentation by Janies D., Cladistics, 19 (3): 261–268. Wheeler W.C. Commandline documentati- Wheeler W.C. 2003b. Iterative pass optimiza- on by De Laet J., Wheeler W.C. New York: tion. — Cladistics, 19 (3): 254–260. American Museum of Natural History. http:// Wheeler W.C. 2003c. Search-based character research.amnh.org/scicomp/projects/poy.php optimization. — Cladistics, 19 (4): 348–355. (current version 3.0.11). Wheeler W.C. 2006. Dynamic homology and Wheeler W.C., Lucaroni N., Hong L. et al. 2013. the likelihood criterion. — Cladistics, 22 POY version 5.0. New York: American Mu- (2): 157–170. seum of Natural History. http://research.amnh. Wheeler W.C. 2007. The analysis of molecular org/scicomp/projects/poy.php. sequences in large data sets: where should Wheeler W.C., Lucaroni N., Hong, L. et al. we put our effort? — Hodkinson T.R., Par- 2015. POY version 5: Phylogenetic analy- nell J.A.N. (eds). Reconstructing the Tree sis using dynamic homologies under multi- of Life: Taxonomy and systematics of spe- ple optimality criteria. — Cladistics, 31 (2): cies rich taxa. Oxford: Oxford Univ. Press. 189–196. P. 113 –128. Whiting A.S., Pellegrino K.C., Rodrigues M.T. Wheeler W.C. 2012. Systematics: A course of 2006. Comparing alignment methods for lectures. Wiley-Blackwell. 460 p. inferring the history of the new world liz- Wheeler W.C. 2014. Maximum a posteriori ard genus Mabuya (Squamata: Scincidae). probability assignment (MAPA): An opti- — Molecular Phylogenetics and Evolution, mality criterion for phylogenetic trees via 38 (3): 719–730.