Quartet Puzzling: a Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies

Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies Korbinian Strimmer and Arndt von Haeseler Zoologisches Institut, Universitlt Mtinchen A versatile method, quartet puzzling, is introduced to reconstruct the topology (branching pattern) of a phylogenetic tree based on DNA or amino acid sequence data. This method applies maximum-likelihood tree reconstruction to all possible quartets that can be formed from n sequences. The quartet trees serve as starting points to reconstruct a set of optimal n-taxon trees. The majority rule consensus of these trees defines the quartet puzzling tree and shows groupings that are well supported. Computer simulations show that the performance of quartet puzzling to reconstruct the true tree is always equal to or better than that of neighbor joining. For some cases with high transi- tion/transversion bias quartet puzzling outperforms neighbor joining by a factor of 10. The application of quartet puzzling to mitochondrial RNA and tRNAVd’ sequences from amniotes demonstrates the power of the approach. A PHYLIP-compatible ANSI C program, PUZZLE, for analyzing nucleotide or amino acid sequence data is available. Introduction In recent years the maximum-likelihood method for try to reconstruct a tree topology considering only the reconstructing phylogenetic relationships (Felsenstein branching pattern of the (2) different quartet trees that 1981) has become more popular due to the arrival of can be constructed from n sequences (Sattath and Tver- powerful computers. The main advantage of a maxi- sky 1977; Fitch 1981; Bandelt and Dress 1986; Dress, mum-likelihood approach is the application of a well- von Haeseler, and Krtiger 1986). It has been shown defined model of sequence evolution to a given data set (Schiiniger and von Haeseler 1993) that these distance- (Felsenstein 1988). Although the application of the max- based methods exhibit performance similar to neighbor imum-likelihood method to biological data is now wide- joining (Saitou and Nei 1987) while generally being spread, its computational complexity prevents compu- much slower. tation for a large number of sequences. Generally, only In this paper we describe a new method, quartet slow programs for analyzing nucleotide or amino acid puzzling, for reconstructing phylogenetic relationships. sequences are available (Felsenstein 1993; Yang 1995), This method reconstructs the maximum-likelihood tree although it is possible to speed up calculations by par- for each of the (2) possible quartets. In a so-called puz- allelizing the algorithm or using approximative tech- zling step the resulting quartet trees are then combined niques (Adachi and Hasegawa 1994; Olsen et al. 1994). to an overall tree. During the puzzling step sequences Still, large trees can only be analyzed on massively par- are added sequentially in random order to an already- allel systems or by constraining the tree topology. existing subtree. The position of a new sequence is de- The principal goal of a maximum-likelihood anal- termined by a voting procedure, considering all quartets. ysis is the determination of a tree and corresponding Finally, an intermediate tree relating n sequences is ob- branch lengths that have the greatest likelihood of gen- tained. In general, there is no n-taxon tree that fits all erating the data. This task can be split into two parts: the (r;) different quartet trees. Therefore, the puzzling determining a tree topology and subsequently assigning step is repeated several times, thereby elucidating the branch lengths to the topology to obtain a maximum- landscape of possible optimal trees. The quartet puzzling likelihood estimate. Because the number of possible tree tree is obtained as a majority-rule consensus (Margush topologies grows exponentially with the number of se- and McMorris 1981) of all trees that result from multiple quences, all tree reconstruction methods that optimize runs of the puzzling step. Depending on the phyloge- an objective function have to rely on heuristic searches netic information contained in the data, this tree may be to find the best topology. Moreover, the optimization of binary or multifurcating. In addition to the tree topology branch lengths for a given topology is a tedious proce- the quartet puzzling tree also shows reliability values dure for maximum-likelihood-based tree reconstruction for each internal branch. In the next section of this paper methods and consumes a lot of computing time (Olsen the accuracy of the method is analyzed. As an illustra- et al. 1994). While maximum-likelihood procedures are tion, quartet puzzling is applied to evaluate the phylo- generally slow for the general case of n sequences, the genetic relationship among the amniotes (Hedges 1994). determination of the maximum-likelihood tree based on DNA or amino acid sequences poses no problem for four sequences. On the other hand, methods abound that The Quartet Puzzling Algorithm Key words: bootstrapping, consensus tree, maximum likelihood, Quartet puzzling essentially is a three-step proce- tree reconstruction, reliability of internal branches, quartet puzzling, quartet trees. dure, first reconstructing all possible quartet maximum- likelihood trees (maximum-likelihood step), then re- Address for correspondence and reprints: Amdt von Haeseler, Zoologisches Institut, Universitlt Mtinchen, LuisenstraRe 14, D-80333 peatedly combining the quartet trees to an overall tree Mtinchen, Germany. E-mail: [email protected]. (puzzling step), and finally computing the majority rule consensus of all intermediate trees giving the quartet Mol. Bid. Evol. 13(7):964-969. 1996 0 1996 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 puzzling tree (consensus step). 964 Quartet Puzzling 965 Ql Q2 Q3 u AE llml B= A C A C 0 1 1 1 0 0 X D 0 X D a) b) B D c D D C xxxFIG. I .-The three possible topologies for a four-taxon tree A C The first step in the quartet puzzling analysis is the 1 2 reconstruction of the branching pattern of all possible 4 (2) quartets with maximum likelihood. For each quartet 3 2 (A, B, C, D) three topologies Q,, Q2, and Q3 (fig. 1) 0 D exist with corresponding maximum-likelihood values X ml, m2, and m3. All topologies Q, with mj = max{m,, c) d) m2, m3} are optimal topologies in the maximum-likelihood sense and are stored for the puzzling step. If there is more than one best topology, the branching pattern of the quartet (A, B, C, D) is not uniquely defined. In this E case we choose randomly between the available optimal topologies every time we look up the branching pattern C of (A, B, C, 0). Thus, maximum-likelihood tree recon- A struction induces a neighbor relation llrn, between any four taxa A, B, C, and D (Bandelt and Dress 1986). The % 0 D neighbor relation ABII,, CD implies that taxa A and B and taxa C and D are neighbors with respect to each e) other. Note that in the corresponding tree Q, (fig. 1) the FIG. 2.-Addition of sequence E to the already-existing four-taxon paths connecting the taxa A and B and the taxa C and tree (a). The neighbor relations are given by AE~l,,BC, AE~I,,BD, D are disjoint. AC$,,DE, and BDII,,CE. The relation AE~l,,,,BC implies that the branch- Next, in the puzzling step, we aim to combine the es connecting B and C each get a score of one (b). (c) The score of the branches if A&LID is evaluated. If all four quartets are analyzed, quartet trees to an overall n-taxon tree. Generally the the branch leading to taxon A shows the lowest score (4, Hence, E is neighbor relation lIrn, on the set of all n taxa is not tree- inserted at this branch (e). like (Bandelt and Dress 1986), therefore it is necessary to apply approximation methods to obtain an overall tree topology (Sattath and Tversky 1977; Fitch 198 1; Ban- quences may not always lead to the same tree topology delt and Dress 1986; Dress, von Haeseler, and Kruger for different runs of the puzzling step. Therefore, step 1986). We suggest the following simple algorithm. First, two is repeated as often as possible, thereby elucidating the input order of the y1 taxa is randomized; let us as- the landscape of all possible optimal trees. Generally, sume that the order is A, B, C, D, E, . The maximum- the more taxa involved the more runs of the puzzling likelihood tree of the quartet (A, B, C, D) is now used step are advised. as a seed for the overall n-taxon tree. Then taxon E is In the third step of the quartet puzzling algorithm added to the subtree according to the following voting a majority rule consensus (Margush and McMorris procedure: The neighbor relation lIrnl induces for every 1981) is computed from the intermediate trees resulting quartet (i, j, k, E) a clustering i, j versus k, E, say. It is from the puzzling steps. We call this consensus tree the obvious that taxon E should not be placed on a branch quartet puzzling tree. Depending on the phylogenetic in- that lies on the path connecting i and j in the subtree. formation contained in the data the quartet puzzling tree The edges where E should not be placed in the subtree is either completely resolved or shows multifurcations. are marked for every quartet (i, j, k, E). Thus, every In addition to the tree topology the quartet puzzling tree branch in the subtree is assigned a score. If all different also provides information about the number of times a quartets containing taxon E and three taxa of the subtree particular grouping occurred in the intermediate trees. If are evaluated, species E is inserted at that branch in the the resolution of phylogenetic relationships between a tree that shows the lowest score.

Quartet Puzzling: a Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies

Treefinder Manual

Dissfernando.Pdf

A Powerful Graphical Analysis Environment for Molecular Phylogenetics Gangolf Jobb1*, Arndt Von Haeseler2,3 and Korbinian Strimmer1

A Phylogenomic Approach to Resolve the Arthropod Tree of Life Research

Disentangling Biological and Analytical Factors That Give Rise to Outlier Genes in Phylogenomic Matrices

Mammalian Evolution May Not Be Strictly Bifurcating Open Access Research Article

Phylogenetic Trees from Large Datasets