<<

Absolute Convergence: True Trees From Short

Tandy Warnow∗ Bernard M.E. Moret† Katherine St. John‡

Abstract evolution if, for every model tree (i.e., rooted tree and Fast-converging methods for reconstructing phylogenetic the associated random variables) and every ε > 0, there trees require that the sequences characterizing the taxa be is a length k (which depends on the method, of only polynomial length, a major asset in practice, since the model tree, and ε) such that the method recovers the real-life sequences are of bounded length. However, of the (the edges) of the model tree with probability half-dozen such methods proposed over the last few years, at least 1−ε, if it is given sequences of length at least k. only two fulfill this condition without requiring knowledge For many models (such as the Jukes-Cantor model [13], of typically unknown parameters, such as the evolutionary the simplest four-state model, as well as more complex rate(s) used in the model; this additional requirement models, such as the General Markov (GM) [19] model), severely limits the applicability of the methods. We say that even simple distance methods are easily established to methods that need such knowledge demonstrate relative fast be statistically consistent. convergence, since they rely upon an oracle. We focus on The sequence length required by a method is a sig- the class of methods that do not require such knowledge and nificant aspect of its performance since real data sets are thus demonstrate absolute fast convergence. We give a very of limited length. (Computational requirements are also general construction scheme that not only turns any relative important, but it is possible to wait longer or get more fast-converging method into an absolute fast-converging one, powerful machines, while it is not possible to get longer but also turns any statistically consistent method that sequences than exist in nature.) Consequently, experi- converges from sequences of length O(eO(diam(T ))) into an mental and analytical studies have attempted to bound absolute fast-converging method. the sequence lengths required by different phylogenetic methods. Methods that perform well (with respect to 1 Introduction topology estimation) from sequences of realistic lengths (bounded by at most a few thousand nucleotides) are Phylogenetic reconstruction methods build an evolu- very desirable, especially if the topological accuracy re- tionary tree from a collection of taxa given, for example, mains good when the rate of evolution and number of by molecular sequences. These methods are designed to taxa increase. recover the “true” evolutionary tree as often as possible. In an earlier paper [12], we defined the notion of fast Not all are guaranteed to do so with high probability convergence under the Jukes-Cantor model of evolution. under reasonable conditions; even those that offer this In this paper we introduce the concept of absolute guarantee vary considerably in their requirements. Un- fast convergence and extend this definition to more der some models of evolution, no method can be guaran- general models, such as the General Time Reversible teed to recover the true tree with high probability, due Markov model. A method is absolute fast-converging if to unidentifiability. Under other models, many methods it is fast-converging and does not need to know any will be able to recover the tree if given long enough se- of the parameters of the model in order to achieve quences. The latter methods are said to be statistically fast convergence. In the Jukes-Cantor model, such consistent under the model of evolution. Formally, a parameters might be the minimum (f) or maximum (g) method is statistically consistent for a specific model of expected number of times a site changes on any edge in the tree. A method can be fast converging only if it ∗Dept. of Computer Science, U. of Texas at Austin, [email protected]; supported by NSF grant 94-57800 and by is given knowledge of one or both of these values: such the David and Lucile Packard Foundation. a method is not absolute fast-converging and we say †Dept. of Computer Science, U. of New Mexico, instead that it is relative fast-converging. [email protected]; supported in part by NSF grant 00-81404. Only a few methods have been proved to be abso- ‡ Graduate Center and Dept. of Math. and Computer Science, lute fast-converging even under the simple Jukes-Cantor Lehman College, City U. of New York, [email protected]; supported in part by NSF grant 99-73874. model: two “DCM-boosted” quartet methods [12] and the Short-Quartet methods [8, 9]. Methods that have lution of a single site is modeled through the use of been proved relative fast-converging under the Jukes- “stochastic substitution matrices,” 4 × 4 matrices (one Cantor model include DCM-boosted neighbor-joining for each tree edge) in which every row sums to 1. A (DCM-NJ) [12], the Harmonic Greedy Triplets (HGT) stochastic model of how a single site evolves can thus method of Cs˝ur¨osand Kao [4], and a method of Cryan, have up to 12 free parameters. The simplest such model Goldberg, and Goldberg (CGG) [3]. These methods is the Jukes-Cantor model, with one free parameter, and are only relative fast-converging under the Jukes-Cantor the most complex is the General Markov model, with all model, rather than absolute fast-converging, because 12 parameters [19]. they require knowledge of f or g; without such knowl- edge, they have to guess the parameter (or, more pre- Definition 1. The GM model of single-site evolution cisely, they have to guess which of the trees they have is defined as follows. constructed is the correct tree). Their guessing strate- 1. The nucleotide in a random site at the root is drawn gies are not provably correct. Consequently, in absence from a known distribution, in which each nucleotide of knowledge about f and g, these methods are not even has positive probability. statistically consistent. In this paper, we describe a very simple algorithm, 2. The probability of each site substitution on an which we call Short Quartet Support (SQS). This al- edge e of the tree is given by a 4 × 4 stochastic gorithm selects the true tree (under the GM model) substitution M(e) in which det(M(e)) is not with high probability from a collection of trees, given 0, 1, or −1. sequences of only polynomial length. Since SQS does This model is generally used in a context where not require knowledge of the model parameters, it can all sites evolve identically and independently (the iid be used to turn a relative fast-converging method into assumption), although sometimes a distribution of rates an absolute fast-converging method. Consequently, across sites is also given. In this paper, we use the GM a straightforward use of SQS produces absolute fast- model with iid site evolution. converging versions of the relative fast-converging meth- We denote a model tree in the GM model as a pair, ods DCM-Neighbor-Joining [12], HGT [4], and CGG [3]. (T, {Me : e ∈ E(T )}), or more simply as (T,M). We SQS can also be combined with the first phase of assume that the number of changes obeys a Poisson the Disk-Covering Method (DCM) [12] to produce a distribution. For each edge e ∈ E(T ), we define the technique we call DCM∗ for reducing the dependency ∗ weight of the edge λ(e) to be − log |det(Me)|. This of methods on sequence lengths. In particular, DCM allows us to define the matrix of leaf-to-leaf distances, turns methods that converge from sequence lengths that P {λij}, with λij = λ(e) and where Pij is the grow exponentially in the diameter (the longest path e∈Pij path in T between leaves i and j. Note that {λ } is a length in the tree) into methods that are absolute fast- ij symmetric matrix. It is a well-known fact that, given converging. Since the diameter of an n-leaf tree can √ the distance matrix {λ }, it is easy to recover the be as large as n − 1 and is typically Ω( n), this tech- ij underlying leaf-labelled tree T in polynomial time. nique turns methods that are statistically consistent, This general model of site evolution subsumes the but not even relative fast-converging, into absolute fast- great majority of other models examined in the phyloge- converging methods. Finally, SQS provides a very gen- netic literature, including the Hasegawa-Kishino-Yano eral framework within which absolute fast-converging (HKY) model, the Kimura 2-parameter model (K2P), methods can be developed. the Kimura 3-ST model (K3ST), the Jukes-Cantor We state our results in terms of the General Time model (JC), etc. These models are all special cases Reversible Markov model of evolution, which contains, of the General Markov model, because they place as a special case, the Jukes-Cantor model. restrictions on the form of the stochastic substitution matrices (see [14] for more information about stochastic 2 Basics models of evolution). Most distance-based methods 2.1 Stochastic models of DNA sequence evolu- are statistically consistent under the General Markov tion. A model of DNA sequence evolution must de- model, because statistically consistent methods exist scribe the probability distribution of the four states, for estimating the matrix {λij} above. (A method for A, C, T, G, at the root, the evolution of a random site estimating the matrix {λij} is statistically consistent (i.e., position within the DNA sequence) and how the if each of the distance estimates dij converges to the evolution differs across the sites. Typically the prob- true value, λij, as the sequence length increases, with ability distribution at the root is uniform (so that all probability 1.) The “log-det” distances provide such a sequences of a fixed length are equally likely). The evo- statistically consistent estimation of the matrix {λij} [19]. In our theorems, we will call these distances 2.2 Fast-converging methods. Since letting f be simply the GM corrected distances. arbitrarily small or g be arbitrarily large affects the se- In [9], the sequence length needed to obtain quence length requirement, we are interested in devel- an arbitrarily good estimate of these distances was oping methods for which polynomially long sequences estimated, as follows: ensure accuracy under the General Markov model, when both f and g are fixed, but arbitrary. In order to define Theorem 2.1. Let (T,M) be a model tree in the this concept precisely, we first parameterize the General GM model. λ(e) = − log |det(Me)| and λij = Markov model. P λ(e). Assume that f, g are fixed with 0 < f ≤ e∈Pij λ(e) ≤ g for all edges e ∈ T . Let ε > 0, δ > 0 be given. Definition 2. GMf,g contains those (T,M) ∈ GM for Then, there is a constant C such that, if the sequence which f ≤ λ(e) ≤ g holds for all edges e ∈ E(T ). length exceeds We now define two types of “fast convergence.” O(g·diam(T )) C log ne Definition 3. A phylogenetic reconstruction method Φ is absolute fast-converging (afc) for the GM model if, then with probability at least 1 − δ, we have L∞(d, λ) = for all positive f, g, ε, there is a polynomial p such maxij |dij − λij| < ε, where d is the matrix of GM that, for all (T,M) in the GM model, on set S of n corrected distances, λ is the matrix of true distances, n sequences of length at least p(n) generated on T , we have is the number of leaves, and diam(T ) is the topological P r[Φ(S) = T ] > 1 − ε. diameter of T . Note that method M operates without any knowledge The significance of Theorem 2.1 is seen in the light of of parameters f or g—or indeed any of f and the following theorem: g. Thus, although the polynomial p depends upon both f and g, the method itself does not. Theorem 2.2. Let {dij} be an n × n dissimilarity matrix, {λij} an additive matrix defined by a tree T Definition 4. A phylogenetic reconstruction method Φ with positive edge-weighting λ(e), and f the smallest is relative fast-converging (rfc) for the model GMf,g edge weight in T , f = mine w(e). if, for all positive f, g, ε, there is a polynomial p such

∗ that, for all (T,M) ∈ GMf,g on set S of n se- • (From [1] and [8]) If L∞(d, λ) < f/2, then the Q quences of length at least p(n) generated on T , we have and Neighbor-Joining methods both return tree T P r[Φ(S, A(f, g)) = T ] > 1−ε, where A(f, g) denotes an on input d. oracle that provides information about f or g. • (From [8]) If L∞(d, λ) < f/8, then the Agarwala et al. method returns tree T on input d. Not requiring information about f or g is clearly desir- able, since in practice there is no a priori way to bound These two theorems imply the following result. f or g. Hence, absolute fast convergence is a more de- sirable property than relative fast convergence. Theorem 2.3. Let T , M, f, g, n, δ, λ, d, and However, almost all known fast-converging methods diam(T ) be as in Theorem 2.1. Then there is a constant are only rfc, not afc. The only known afc methods are: C > 0 such that, if the sequence length exceeds • The Short-Quartet methods (the dyadic closure C log neO(g·diam(T )) method [8] and the witness-antiwitness method [9]). These methods operate by attempting to produce, then, with probability at least 1−δ, the Neighbor-Joining for each setting of the parameter g, a binary tree and Q∗ methods recover the true tree. There is also a that meets certain constraints. They also can constant C0 > 0 such that, if the sequence length exceeds explore efficiently (e.g., through binary search) the parameter space of g. C0 log neO(g·diam(T )) • Certain variations of the Disk-Covering Method then, with probability at least 1−δ, the Agarwala method (DCM) [12]. DCM is a two-phase technique used recovers the true tree. in conjunction with existing phylogenetic methods. The first phase produces a collection of trees, where Since diam(T ) can be as large as n − 1, the sequence each tree is obtained by producing a division of length requirement of each of these methods is bounded the dataset into overlapping subproblems (“disks”) from above by a function that grows exponentially in of low diameter. Trees on these subproblems are n, even when g is fixed. computed using some existing phylogenetic method and are then merged into a supertree on the entire 3 Short Quartet Support set of taxa. The second phase selects a tree on the We now present the Short Quartet Support (SQS) entire set of taxa, based upon a criterion designed method and prove that SQS is absolute fast-converging specifically with respect to the base phylogenetic for True Tree Selection under the GM model. We method. Using DCM with the Buneman Method begin with some notation and terminology. Let T be a ∗ (also known as the Q method) or with the Na¨ıve phylogeny on S. We denote by Q(T ) the set of quartets Quartet Method (NQM) produces afc methods, on the leaves of T defined by T —i.e., quartet t is in because the second phase has provable performance Q(T ) if and only if the subtree of T induced by the four guarantees. taxa of t equals t. T can be reconstructed from Q(T ) in polynomial time [22]. All afc methods operate by producing a set of trees, then selecting the best tree from the set. By designing Definition 6. For a given quartet q on taxa from S, a suitable selection criterion, one can prove performance define diamD(q) as guarantees and thus establish absolute fast convergence. diam (q) = max{D | {i, j} ⊂ q} By comparison, rfc methods need to know (the value D i,j of or a tight bound on) one or both of the parameters In other words, diamD(q) is the maximum GM distance (f and g). Absent such knowledge, rfc methods cannot between the taxa of q. For Q, a fixed set of quartets, reconstruct the true tree with high probability from se- given D we can define the set Qw = {q ∈ Q : quences of polynomial length. Indeed, rfc methods also diamD(q) ≤ w}. operate by producing a set of trees, one for each set- ting of the unknown but necessary parameter, and then Definition 7. Let T be a fixed tree leaf-labelled by the select a tree from the set. Because their selection proce- set S, Q a fixed set of quartets on S, and D the GM dure has no performance guarantees, these methods are distance matrix on S. The support of T with respect to not statistically consistent, much less afc. For example, Q, denoted s(T,Q), is the HGT method [4] or its heuristic modification [5], the CGG method [3], and DCM with neighbor-joining max{l | (q ∈ Q and diamD(q) ≤ l) =⇒ q ∈ Q(T )} or with the Agarwala method (see [12]), all use heuris- Although we define the support of T with respect to tics (with no performance guarantees) to select the best arbitrary sets Q and arbitrary ways of defining the tree. The existence of a selection criterion with perfor- dissimilarity matrices, we will focus our attention on mance guarantees is thus a crucial ingredient in produc- matrices D and sets Q defined in particular ways: ing afc methods, prompting us to define the following problem. Definition 8. Let D denote the GM corrected dis- tances. We define Q to be the set of neighbor-joining quartets on the fourtuples of leaves in S based upon True Tree Selection Problem: D. (Since neighbor-joining on quartets is identical to • Input: A set S of sequences over the nu- the Four-Point Method (see [8]) on quartets, this is the cleotide alphabet {A, C, T, G} generated by an same as the set of quartet trees computed in [8, 9] and unknown GM model tree (T,M) and a collec- other papers.) tion T = {T ,T ,...,T } of phylogenies on S. 1 2 p We are now ready to present a high-level version of • Output: The true tree T if T is in T the SQS method. The main theorem of this paper is that SQS is absolute fast-converging under the General Markov (GM) model. In order for an algorithm for True Tree Selection to be useful in producing afc methods, it must itself Procedure SQS(T ,S) demonstrate a form of absolute fast convergence. • For each set of four taxa from S, compute the Definition 5. An algorithm Φ for True Tree Se- neighbor-joining quartet q; let Q be the set of lection is absolute fast-converging under the GM all such quartets. model if, for all f, g, ε > 0, there is a polynomial p such • Return Ti ∈ T such that s(Ti, Q) is maximum; that, for all model trees (T,M) ∈ GMf,g and for all sets if more than one such tree exists, return the S a set of sequences generated on (T,M) of length at one with the smallest index i. least p(|S|), T ∈ T implies P r[Φ(S, T ) = T ] > 1 − ε. Theorem 3.1. SQS is absolute fast-converging under TG(D, w) has vertex set S and edges between those taxa the General Markov model. That is, for all f, g, ε > 0, i, j such that Dij ≤ w. Formally: TG(D, w) = (S, Ew), there is a polynomial p such that, for all (T,M) ∈ GMf,g where Ew = {(i, j) | Dij ≤ w}. on set S of n sequences generated at random by T with length at least p(n), we have In [12], it was proved that if D is additive, then the threshold graph TG(D, w) is triangulated. P r[SQS(T ,S) = T ] > 1 − ε Definition 11. A graph G is triangulated if every for all T such that T ∈ T . cycle in G of length at least four contains a chord (i.e. a pair of non-sequential adjacent vertices). We postpone the proof of this theorem to Section 6. Although our matrices will not in general be additive, 4 Applications of SQS they will often be sufficiently close to additive (since 4.1 Turning rfc Methods into afc Methods: An they are based upon statistically consistent estimators alternative way of viewing rfc methods is that they of GM distances). The significance of this is that finding construct a set T of trees, one for each potential setting maximal cliques (i.e. maximal of vertices, every of the parameter, then face the problem of selecting pair of which are adjacent) in triangulated graph is the true tree from T . This is a relatively easy task if solvable in polynomial time, although hard for the the value for the parameter is known, but no general general case. solution has been found to the problem when the We describe the MaxClique procedure below. parameter is unknown. SQS is designed to solve this problem. In particular, SQS can be used with any rfc Procedure MaxClique method as a second phase, turning that method into an • Input: The GM distance matrix D, a set S of afc method. sequences generated on an unknown GM tree Definition 9. Let Φ be an rfc method. Define Φ∗ to (T,M), and a phylogenetic base method Φ. be the method obtained by using Φ to produce a set T of • Output: A set T of trees on S obeying trees and using SQS to choose a tree from T . P r[T ∈ T ] > 1 − ε if the sequence length Theorem 4.1. If Φ is rfc, then Φ∗ is afc. The addi- exceeds p(|S|) (where p is a polynomial). tional running time needed by Φ∗ is O(n4|T |). • Algorithm: For each w ∈ Dij: 4.2 DCM∗: Dramatically Reducing the Re- quired Length: We describe a general-purpose tech- – Let Ew = {(i, j) | Dij ≤ w}. Con- nique for reducing the sequence length requirement struct the threshold graph, TG(D, w) = of statistically consistent phylogenetic methods. This (S, Ew). technique has two phases. The first phase is obtained – If TG(D, w) is not connected, then let Tw from the Disk-Covering Method of Huson et al. [12]. be the star tree (i.e., the tree with one This phase takes as input a phylogenetic method Φ and internal node attached to each of the n a set S of sequences generated under a model of evolu- leaves) and exit. tion, and produces a collection T of trees. The second – Triangulate TG(D, w), minimizing phase of the technique uses the SQS selection procedure max{Dij | (i, j) added to (S, Ew)}, to select the best tree from the set T . thus producing the triangulated graph This two-phase procedure, DCM∗, has very strong TG∗(D, w). theoretical performance guarantees. In particular, it – Compute the maximal cliques can be used to create afc methods. For example, if C ,C ,...,C of TG∗(D, w) (note O(max{λ }) 1 2 l the convergence rate of Φ is O(e ij ), then the l ≤ n). For each i, 1 ≤ i ≤ l, let convergence rate of DCM∗-Φ is O(nO(g)), where g = ti = Φ(Ci). max λ(e), so that DCM∗-Φ is afc. e – Merge the subtrees t using the Strict We call the first phase of the DCM method the i Consensus Merger [12]. Let T be the MaxClique method, since the dataset decomposition w resulting tree and exit. technique is based upon computing maximal cliques in a graph that we now define: Return T = {Tw | w ∈ {Dij}}. Definition 10. Let D be a dissimilarity matrix on the set S, and let w ∈ R+ be given. The threshold graph Theorem 4.2. Assume (T,M) ∈ GMf,g, S a set of limitations, we will only suggest how the proof goes; sequences generated on T , D the GM distance matrix see [9] for the proof that WAM is afc. for S, and ε > 0. If Φ is a phylogenetic method with WAM has two phases. In the first phase, it uses a convergence rate on tree T of O(eO(max{λij })), then an algorithm called WATC (Witness-Antiwitness Tree MaxClique-Φ is rfc. Construction) to compute a set T of trees, one for each setting of the parameter w (quartet width). In the A stronger version of this theorem, which shows an im- second phase, it selects a tree from T as the true tree. provement in the convergence rate for all phylogenetic The technique used to construct T is provably rfc and methods whose sequence length requirements depend that used to select the tree from T , while different from upon the maximum evolutionary distance max λij in SQS, is provably afc, so that WAM is afc. the tree, is provided in Section 6. Lemma 6.2 on the WIGWAM differs from WAM in the implementa- conditions under which the Strict Consensus Merger is tion of the first phase and then uses SQS to select the guaranteed to yield the true tree appears in Section 6 best tree from T . We will focus therefore on describing as well, since it depends on terminology and theory de- WATC and show how our version (WIGWATC) oper- veloped in that section. ates. For any phylogenetic reconstruction method Φ, we Recall that S is the set of sequences generated on an denote by DCM∗-Φ the method obtained by applying unknown GM model tree, D is the GM distance matrix, this two-phase process: and Q is the set of NJ quartet trees computed using the GM matrix. Finally, recall that for q ∈ Q, diam (q) is • Phase 1: Given phylogenetic method Φ, set S of D the maximum GM distance between the leaves of q. We sequences generated on an unknown GM tree, and take the following definition from [9]: the GM distance matrix D, compute the set T of trees on S using MaxClique-Φ. Definition 12. Let e be an internal edge of T and • Phase 2: Compute Q, the set of neighbor- assume that the removal of e (and its endpoints) from joining quartets on S, based upon D, then return T decomposes T into four subtrees, two of which are T1 SQS(T , Q). and T2. A quartet {ab|cd} ∈ Q is a witness for the siblinghood of T1 and T2 if we have a ∈ T1, b ∈ T2 Theorem 4.3. Let (T,M) ∈ GMf,g, S a set of se- and c, d ∈ T − (T1 ∪ T2). A quartet {ef|gh} ∈ Q is quences generated on T , D the GM distance matrix for an antiwitness for the siblinghood of T1 and T2 if we S, and let ε > 0 be given. If Φ is a phylogenetic method have e ∈ T1, g ∈ T2 and f, h ∈ T − (T1 ∪ T2).A with convergence rate on tree T bounded from above by w-witness ( w-antiwitness) is a witness (anti-witness) q O(max{λij }) ∗ O(e ), then DCM -Φ is afc. obeying diamD(q) ≤ w.

Proof. Follows from Theorem 4.2 and Theorem 3.1. WATC takes as input the GM dissimilarity matrix D and a parameter w and proceeds like neighbor-joining, 4.3 WIGWAM: New rfc and afc methods: This in that it constructs the tree from the “outside-in.” general two-phase structure for afc methods can be Initially every leaf is its own rooted subtree. WATC used to design new afc methods. We present such then repeatedly pairs two rooted subtrees until there a method, which we call the WeIGhted Witness- are only three subtrees left, at which point the three Antiwitness Method (WIGWAM). WIGWAM runs in rooted subtrees are joined into a star. The resulting polynomial time, is afc, and by design produces a bi- unrooted tree is the output of WATC. WATC chooses nary tree on every input; hence, it improves upon some two subtrees to pair from a list of candidates—any afc methods that tend to return unresolved trees. WIG- candidate can be chosen; two rooted subtrees t and t0 WAM considers all quartets, but weighs them accord- are candidates for pairing if and only if there is a w- ing to their statistical support (as indicated by diamD); witness and no w-antiwitness to their siblinghood. If consequently, it should have better performance than no candidate pair can be found at some stage, WATC quartet methods that do not distinguish between quar- returns the star-tree and exits; otherwise WATC will tets [20]. return an unrooted binary tree. This simple technique WIGWAM is a modification of an earlier afc method is provably rfc [9]. called the Witness-Antiwitness Method (WAM) [9]. We Our modified technique, WIGWATC, uses two sep- therefore present the WAM method first, and then arate diameter bounds, one (W ) for witnesses and one show how WIGWAM differs from WAM. The proof that (A) for antiwitnesses. Two subtrees are then candidates WIGWAM is afc is slightly more complicated than the for pairing only when there is at least a W -witness and proof for WAM but is quite similar. Due to space no A-antiwitness. Unlike WATC, which picks an arbi- trary candidate, WIGWATC chooses the “best” candi- tree generated. Once a certain level of support l has date pair—that which has the least evidence against its been reached, no antiwitness of diameter less than or pairing. Formally, WIGWAM ranks antiwitnesses ac- equal to l is tolerated—the new tree under construction cording to their diameter, with the antiwitness of least is forced to agree with all quartets of diameter less than diameter (the “shortest” antiwitness) having the most or equal to l. This constraint may prevent the build- importance; WIGWAM then selects the candidate pair ing of a tree, but note that the next iteration, with an (i.e. pair of subtrees which are candidates) to maximize unchanged value for support, but a larger threshold for the diameter of its shortest antiwitness. witnesses, may find new candidates that were unavail- We describe WIGWATC and WIGWAM below. able at the current iteration and thus may succeed in a building a tree that obeys the constraints. Since a tree Procedure WIGWATC: that obeys all short quartets is the true tree, WIGWAM eventually produces a true tree if given all of its short • Input: Two diameter bounds W and A, and quartets. WIGWAM is typical of rfc methods in that the input given to WIGWAM, i.e. a set S of it returns a collection of trees, up to one per value of taxa, a set Q of quartet trees (that includes a the parameter (here the threshold on the diameter of quartet on every four taxa in the set S), and quartets), but our support-based mechanism incorpo- a dissimilarity matrix D on S. rates the SQS principle directly into the procedure, so • Start with each taxon defining a subtree. that WIGWAM is afc.

• While at least four subtrees remain do: Procedure WIGWAM – Form the graph G on the set of subtrees • Input: A set S of taxa, a set Q of quartet (one vertex per subtree), where two ver- trees (that includes a quartet on every four tices are joined by an edge if the corre- taxa in the set S), and a dissimilarity matrix sponding pair of subtrees has at least one D on S. W -witness and no A-antiwitness; define the cost of each edge to be the reciprocal • Compute the smallest diameter d such that of the diameter of the shortest antiwit- the initial threshold graph TG (with one node ness for that edge, 0 otherwise. for each taxon) for (D, d) is connected. Set W = d and A = 0. – If no edge is created, return failure. – Choose the edge of least cost, merge its • Repeat until W equals the largest value in D: two endpoints and the two corresponding – Call WIGWATC with parameters W and subtrees, and update edge costs as neces- A. sary (the merging may eliminate antiwit- – If WIGWATC returns failure, increase W nesses for other edges). to the next larger value in Dij. • Merge the remaining trees in the obvious way – Otherwise compute the support and return the resulting binary tree. s(T,QW ) of the returned tree T with respect to the set QW of quartets and set A = s(T,QW ). If the support is larger than W , set W = s(T,Q ); otherwise The difficult part of WIGWAM is to identify suit- W increase W to the next larger value in able values for W and A such that the tree returned for D . these values is the true the true tree. In its simplest ij form, WIGWAM searches (nearly) linearly through all O(n2) possible values of quartet diameters for each of the parameters. Roughly speaking, it starts with a Correctness follows from the fact that if s(T,QW ) = value w for W that ensures that the threshold graph s0 ≥ W , then a tree with support at least s0 can be built TG(D, w) is connected, then steps through values ei- by WIGWATC for each threshold value from W to s0. ther one at a time (when the support of the tree re- turned is not as good as the current diameter) or by 5 Conclusions skipping over an entire range of values (when the sup- We have introduced the concept of relative vs. absolute port is larger than the current diameter). The key to fast convergence and its associated problem True Tree the procedure is the updating of the threshold for an- Selection. We have given a simple, polynomial-time tiwitnesses in terms of the level of support for the last technique, SQS, for solving this problem, a technique that turns rfc methods into afc methods and that can max{λij | {i, j} ⊂ q}. be used, particularly in conjunction with the DCM We define various quantities with respect to the method, for designing new afc methods. As an example, two matrices, {λij} and {Dij}. For each internal edge we have introduced a new afc method, WIGWAM, e ∈ E(T ), the deletion of e from T breaks T into four which uses SQS to select its tree. subtrees, T1,T2,T3,T4. Set There is significant experimental evidence that the technique used to select the best tree from the set T wλ(e) = min{diamλ(a1, a2, a3, a4)}, has a dramatic impact on the topological accuracy of where the minimum is taken over all fourtuples the tree returned in phase II. For example, the stud- (a1, a2, a3, a4) with ai a leaf in Ti, i = 1, 2, 3, 4. ies by Cs˝ur¨osand Kao on their rfc method HGT [5] The short quartets around e are formed from taxa and by Huson et al. on their rfc method DCM-NJ [12] a1, a2, a3, a4 with ai ∈ Ti, so as to minimize both showed that, under some conditions, the set T diamλ(a1, a2, a3, a4). Set wλ(T ) = maxe wλ(e). could contain the true tree or a very close approximation thereof, but that T also contained trees with topological Definition 13. The set of short quartet trees of the ∗ error rates higher than 50%. Thus, designing methods tree T , denoted Qshort(T ), is the set of true trees (i.e., with provable performance guarantees is extremely im- subtrees of the true tree T ) on those fourtuples of S portant, both from a theoretical and a practical point obeying diamλ(q) ≤ wλ(T ). of view. Our interest in short quartets is based upon the We note that the set T does not need to contain a following theorem that was established in [6]: tree for each setting of its unknown parameter. In [9], Erd˝os et al. presented a technique called sparse-high for Lemma 6.1. Let (T, w) be a fixed edge-weighted tree use with their fast-converging WAM, a technique which leaf-labelled by S and let T 0 be another tree leaf-labelled ∗ 0 0 only requires computing a small number of trees. Hence by S. If we have Qshort(T ) ⊆ Q(T ), then T equals T . rfc methods need not be computationally intensive. Our results prompt new questions: are there other In the definition of the DCM∗ method, we stated that afc methods for True Tree Selection? In particular, subtrees on maximal cliques are computed and then is maximum likelihood (MLE) such a method (see, e.g., merged using a technique called the Strict Consensus [18] for overview)? Are there generic techniques for Merger. This technique is roughly as follows (see [12] for speeding up SQS? (Existing methods such the sparse details): merge the trees computed in a pairwise fashion, high search mentioned above are very much tailored to until all the subtrees are merged into a single tree on the base phylogenetic method.) the full set of taxa. During each merger, contract, if necessary, a minimum number of edges in order to 6 Proofs make the pair of trees agree on the subtrees induced We prove Theorem 3.1, and provide a stronger version by their common leaf sets, before merging the trees. of Theorem 4.2. (The order in which subtrees are merged matters, and Theorem 3.1. The SQS Procedure is absolute fast- follows the perfect elimination ordering given for the converging under the GM model. triangulation of the threshold graph; see [12] for details). Our proof proceeds by first establishing a condition We now characterize the conditions under which the under which we have s(T, Q) > s(T 0, Q) for all T 0 6= T . strict consensus merger is guaranteed to be correct. We then show that this condition holds with probability 1−ε from sequences of length bounded by a polynomial Lemma 6.2. Let (T, w) be an edge-weighted tree on leaf in n, for fixed f, g, ε > 0. Since the SQS procedure does set S and let G be a triangulated graph on vertex set S not consider the values of f or g, this will prove that it such that each short quartet of T induces a four-clique is afc. in G. Assume that, for each maximal clique C of G, we have the true tree tC induced in T by C. Then the Let (T,M) ∈ GMf,g be a given GM tree. Recall Strict Consensus Merger of the trees tC produces T . that {λij} denotes the matrix of true evolutionary distances leaves the taxa in S. Let S be a set of Proof. The proof of Theorem 4 in [12] proves this as- taxa generated by T at the leaves of T , and let Dij sertion when the trees on maximal cliques are obtained be the GM corrected distance between i and j. As by using the Buneman Tree method; however, the proof mentioned earlier, for all i, j, Dij converges to λij as of that theorem does not depend upon the use of the the sequence length k increases. Recall that Q denotes Buneman Tree method, but only upon the trees being the set of neighbor-joining quartets computed on all correct. Consequently, this lemma can be proved using fourtuples of leaves from S and that diamλ(q) denotes the same proof. Recall that for a fixed set Q of quartets, Qw is the set Proof. Proof omitted due to space constraints, but of quartets q ∈ Q obeying diamD(q) ≤ w. follows from results in [12]. Definition 14. The width of T with respect to D is ∗ This concludes the proof that SQS is fast-converging wD(T ) = max{diamD(q) | q ∈ Qshort(T )}. under the GM model (proof of Theorem 3.1).

(In words, wD(T ) is the maximum GM corrected dis- We now provide the stronger version of Theorem tance between any two leaves i, j that appear in a short 4.2. from which Theorem 4.2 follows as a special case. quartet of T .) Note that membership in a short quar- Let (T,M) ∈ GM , S a set of se- tet depends upon the true distance λ, but that w (q) is Theorem 6.2. f,g D quences generated on T , and D the GM distance matrix based upon the GM corrected distance matrix D. Let for S. Let ε > 0 be given and assume that Φ is a phylo- (T,M) be a GM tree, with D the GM corrected distance genetic method with convergence rate on tree T bounded matrix, and let λ denote the true distance matrix. We from above by F (max{λ }) for some function F . Then define ij DCM∗-Φ converges from sequences of length bounded by (q) F (O(w (T )). Consequently, if F (x) is O(eO(x)), then L∞ (D, λ) = max{|Dij − λij|: min{Dij, λij} ≤ q} λ DCM∗-M is afc. We continue with a lemma established in [8].

(w) Proof. (Sketch:) The tree TwD computed by the Max- Lemma 6.3. If we have L∞ (D, λ) < f/2, then Clique method is based upon the base method applied to neighbor-joining will return the correct tree for every subproblems of maximum GM distance wD. When the quartet q with diamD(q) ≤ w. sequences are of polynomial length, these subproblems Now the following lemma follows easily. have maximum true distance bounded by O(wλ). Thus, the convergence rate bound for the method M guaran- (w) Lemma 6.4. If L∞ (D, λ) < f/2, then s(T, Q) ≥ w. tees that M is accurate on each subproblem with high

(w) probability if the sequence length is at least F (O(wλ)), Proof. If L∞ (D, λ) < f/2, then Qw ⊆ Q(T ), and the maximum evolutionary distance in each subprob- s(T, Q) ≥ w. lem. We can now prove: The threshold graph contains TG(D, wD(T )), so that every short quartet induces a four-clique in the (wD (T )) Corollary 6.1. If L∞ (D, λ) < f/2, then we triangulation of this graph. Thus, since each subtree 0 0 have s(T , Q) ≥ wD(T ) if and only if T equals T . is correctly reconstructed, Lemma 6.2 implies that the strict consensus merger produces the true tree, proving Proof. Follows from Lemma 6.4 and Lemma 6.1. the first assertion. Finally, if F (x) is O(eO(x)), then DCM∗-M con- We are now ready to state the condition under which verges from sequences of length at most O(eO(wλ(T ))). the SQS procedure is guaranteed to be correct. In [9], wλ(T ) was established to be bounded by Theorem 6.1. Set (T,M) ∈ GMf,g, S the set of se- O(g log n), from which the result follows. quences generated on T , and D the GM distance matrix defined on S. Let Q denote the set of neighbor-joining 7 Acknowledgements (wD (T )) quartets on S based on D. If we have L∞ (D, λ) < The authors wish to thank the University of New f/2, then also we have s(T, Q) > s(T 0, Q) for all Mexico’s Department of Computer Science for hosting T 6= T 0. TW and KS in Summer 2000, during which time these results were obtained. Proof. Follows from Corollary 6.1.

We now analyze the sequence length that suffices for the References above condition to hold with high probability. Let (T,M) ∈ GM and let ε > 0 be an Lemma 6.5. f,g [1] Atteson, K., “The performance of the neighbor-joining arbitrary constant. Let D be the GM corrected distance methods of phylogenetic reconstruction,” Algorithmica matrix defined on a set of sequences generated on T . 25, 2/3 (1999), 251–278. Then there is a constant c > 0 such that, if the sequence [2] Berry, V., Bryant, D., Jiang, T., Kearney, P., Li, M., length k exceeds c · nO(g), then we have Wareham, T., and Zhang, H., “A practical algorithm for recovering the best supported edges of an evolution- (wD (T )) P r[L∞ (D, λ) < f/2] > 1 − ε ary tree,” preprint, 2000. [3] Cryan, M., Goldberg, L., and Goldberg, P., “Evolu- [20] St. John, K., Warnow, T., Moret, B.M.E., and Vawter, tionary trees can be learned in polynomial time in the L., “Performance study of phylogenetic methods: (un- two-state general Markov model,” Proceedings FOCS weighted) quartet methods and neighbor-joining,” to (1998), 436–445. appear in SODA 2001. [4] Cs˝ur¨os, M., and Kao, M.Y., “Recovering evolutionary [21] Strimmer, K., and von Haeseler, A., “Quartet puzzling: trees through harmonic greedy triplets,” In Proc. 10th A quartet maximum likelihood method for reconstruct- Annual ACM-SIAM Symp. on Discrete Algorithms ing tree ,” Mol. Biol. & Evol. 13 (1996), 964– (1999), 261–270. 969. [5] Cs˝ur¨os, M., and Kao, M.Y., “O(n2logL)-time accurate [22] Warnow, T., “Some combinatorial problems in phy- recovery of evolutionary trees with more than one thou- logenetics,” Invited paper, Proc. Int’l Colloq. Com- sand leaves: an experimental combination of harmonic binatorics and Graph Theory, Balatonlelle, Hungary greedy triplets and the minimum evolution principle,” (1996), in Vol. 7 of Bolyai Society Mathematical Stud- preprint, Yale U., 1999. ies, A. Gy´arf´as,L. Lov´asz,and L.A. Sz´ek´ely, eds., Bu- [6] Erd˝os,P.L., Steel, M., Sz´ek´ely, L., and Warnow, T., dapest, 363–413 (1999). “Local quartet splits of a binary tree infer all quartet splits via one dyadic inference rule,” Computers and Artificial Intelligence 16, 2 (1997), 217–227. [7] Erd˝os,P.L., Steel, M., Sz´ek´ely, L., and Warnow, T., “Inferring big trees from short sequences,” Proc. Int’l Congress on Automata, Languages, and Programming ICALP97, Bologna, Italy (1997). [8] Erd˝os,P.L., Steel, M., Sz´ek´ely, L., and Warnow, T., “A few logs suffice to build almost all trees—I,” Random Structures and Algorithms 14 (1997), 153–184. [9] Erd˝os,P.L., Steel, M., Sz´ek´ely, L., and Warnow, T., “A few logs suffice to build almost all trees—II,” Theor. Comput. Sci. 221 (1999), 77–118. [10] Huson, D., Nettles, S., Parida, L., Warnow, T., and Yooseph, S., “A divide-and-conquer approach to tree reconstruction,” Workshop on Algorithms and Experi- ments (ALEX98), Trento, Italy, 1998. [11] Huson, D., Nettles, S., Rice, K., Warnow, T., and Yooseph, S., “The hybrid tree reconstruction method,” ACM J. Experimental Algorithmics 4, 5 (1999), www.jea.acm.org/1999/HusonHybrid/. [12] Huson, D., Nettles, S., and Warnow, T., “Disk- covering, a fast-converging method for phylogenetic tree reconstruction,” J. Comput. Biol. 6, 3 (1999), 369– 386. [13] Jukes, T.H., and Cantor, C. Mammalian Protein Metabolism. Academic Press, New York, 1969; In chap- ter “Evolution of protein molecules,” 21–132. [14] Li, W.-H. Molecular Evolution. Sinauer, Mas- sachusetts, 1997. [15] Penny, D., and Steel, M.A, “Distributions of tree comparison metrics—some new results,” Syst. Biol. 42, 2 (1993), 126–141. [16] Rice, K., and Warnow, T., “Problems with big trees,” SSE/SSB Joint meetings, Vancouver, British Columbia, 1998. [17] Saitou, N., and Nei, M., “The neighbor-joining method: A new method for reconstructing phyloge- netic trees.” Mol. Biol. & Evol. 4 (1987), 406–425. [18] Setubal, J., and Meidanis, J. Introduction to Compu- tational Molecular Biology. PWS Publishing Co, 1997. [19] Steel, M.A., “Recovering a tree from the leaf coloura- tions it generates under a Markov model,” Appl. Math. Lett. 7 (1994), 19–24.