Absolute Convergence: True Trees from Short Sequences

Absolute Convergence: True Trees From Short Sequences Tandy Warnow∗ Bernard M.E. Moret† Katherine St. John‡ Abstract evolution if, for every model tree (i.e., rooted tree and Fast-converging methods for reconstructing phylogenetic the associated random variables) and every ε > 0, there trees require that the sequences characterizing the taxa be is a sequence length k (which depends on the method, of only polynomial length, a major asset in practice, since the model tree, and ε) such that the method recovers the real-life sequences are of bounded length. However, of the topology (the edges) of the model tree with probability half-dozen such methods proposed over the last few years, at least 1−ε, if it is given sequences of length at least k. only two fulfill this condition without requiring knowledge For many models (such as the Jukes-Cantor model [13], of typically unknown parameters, such as the evolutionary the simplest four-state model, as well as more complex rate(s) used in the model; this additional requirement models, such as the General Markov (GM) [19] model), severely limits the applicability of the methods. We say that even simple distance methods are easily established to methods that need such knowledge demonstrate relative fast be statistically consistent. convergence, since they rely upon an oracle. We focus on The sequence length required by a method is a sig- the class of methods that do not require such knowledge and nificant aspect of its performance since real data sets are thus demonstrate absolute fast convergence. We give a very of limited length. (Computational requirements are also general construction scheme that not only turns any relative important, but it is possible to wait longer or get more fast-converging method into an absolute fast-converging one, powerful machines, while it is not possible to get longer but also turns any statistically consistent method that sequences than exist in nature.) Consequently, experi- converges from sequences of length O(eO(diam(T ))) into an mental and analytical studies have attempted to bound absolute fast-converging method. the sequence lengths required by different phylogenetic methods. Methods that perform well (with respect to 1 Introduction topology estimation) from sequences of realistic lengths (bounded by at most a few thousand nucleotides) are Phylogenetic reconstruction methods build an evolu- very desirable, especially if the topological accuracy re- tionary tree from a collection of taxa given, for example, mains good when the rate of evolution and number of by molecular sequences. These methods are designed to taxa increase. recover the “true” evolutionary tree as often as possible. In an earlier paper [12], we defined the notion of fast Not all are guaranteed to do so with high probability convergence under the Jukes-Cantor model of evolution. under reasonable conditions; even those that offer this In this paper we introduce the concept of absolute guarantee vary considerably in their requirements. Un- fast convergence and extend this definition to more der some models of evolution, no method can be guaran- general models, such as the General Time Reversible teed to recover the true tree with high probability, due Markov model. A method is absolute fast-converging if to unidentifiability. Under other models, many methods it is fast-converging and does not need to know any will be able to recover the tree if given long enough se- of the parameters of the model in order to achieve quences. The latter methods are said to be statistically fast convergence. In the Jukes-Cantor model, such consistent under the model of evolution. Formally, a parameters might be the minimum (f) or maximum (g) method is statistically consistent for a specific model of expected number of times a site changes on any edge in the tree. A method can be fast converging only if it ∗Dept. of Computer Science, U. of Texas at Austin, [email protected]; supported by NSF grant 94-57800 and by is given knowledge of one or both of these values: such the David and Lucile Packard Foundation. a method is not absolute fast-converging and we say †Dept. of Computer Science, U. of New Mexico, instead that it is relative fast-converging. [email protected]; supported in part by NSF grant 00-81404. Only a few methods have been proved to be abso- ‡ Graduate Center and Dept. of Math. and Computer Science, lute fast-converging even under the simple Jukes-Cantor Lehman College, City U. of New York, [email protected]; supported in part by NSF grant 99-73874. model: two “DCM-boosted” quartet methods [12] and the Short-Quartet methods [8, 9]. Methods that have lution of a single site is modeled through the use of been proved relative fast-converging under the Jukes- “stochastic substitution matrices,” 4 × 4 matrices (one Cantor model include DCM-boosted neighbor-joining for each tree edge) in which every row sums to 1. A (DCM-NJ) [12], the Harmonic Greedy Triplets (HGT) stochastic model of how a single site evolves can thus method of Cs˝urösand Kao [4], and a method of Cryan, have up to 12 free parameters. The simplest such model Goldberg, and Goldberg (CGG) [3]. These methods is the Jukes-Cantor model, with one free parameter, and are only relative fast-converging under the Jukes-Cantor the most complex is the General Markov model, with all model, rather than absolute fast-converging, because 12 parameters [19]. they require knowledge of f or g; without such knowledge, they have to guess the parameter (or, more pre- Definition 1. The GM model of single-site evolution cisely, they have to guess which of the trees they have is defined as follows. constructed is the correct tree). Their guessing strate- 1. The nucleotide in a random site at the root is drawn gies are not provably correct. Consequently, in absence from a known distribution, in which each nucleotide of knowledge about f and g, these methods are not even has positive probability. statistically consistent. In this paper, we describe a very simple algorithm, 2. The probability of each site substitution on an which we call Short Quartet Support (SQS). This al- edge e of the tree is given by a 4 × 4 stochastic gorithm selects the true tree (under the GM model) substitution matrix M(e) in which det(M(e)) is not with high probability from a collection of trees, given 0, 1, or −1. sequences of only polynomial length. Since SQS does This model is generally used in a context where not require knowledge of the model parameters, it can all sites evolve identically and independently (the iid be used to turn a relative fast-converging method into assumption), although sometimes a distribution of rates an absolute fast-converging method. Consequently, across sites is also given. In this paper, we use the GM a straightforward use of SQS produces absolute fast- model with iid site evolution. converging versions of the relative fast-converging meth- We denote a model tree in the GM model as a pair, ods DCM-Neighbor-Joining [12], HGT [4], and CGG [3]. (T, {Me : e ∈ E(T )}), or more simply as (T,M). We SQS can also be combined with the first phase of assume that the number of changes obeys a Poisson the Disk-Covering Method (DCM) [12] to produce a distribution. For each edge e ∈ E(T ), we define the technique we call DCM∗ for reducing the dependency ∗ weight of the edge λ(e) to be − log |det(Me)|. This of methods on sequence lengths. In particular, DCM allows us to define the matrix of leaf-to-leaf distances, turns methods that converge from sequence lengths that P {λij}, with λij = λ(e) and where Pij is the grow exponentially in the diameter (the longest path e∈Pij path in T between leaves i and j. Note that {λ } is a length in the tree) into methods that are absolute fast- ij symmetric matrix. It is a well-known fact that, given converging. Since the diameter of an n-leaf tree can √ the distance matrix {λ }, it is easy to recover the be as large as n − 1 and is typically Ω( n), this tech- ij underlying leaf-labelled tree T in polynomial time. nique turns methods that are statistically consistent, This general model of site evolution subsumes the but not even relative fast-converging, into absolute fast- great majority of other models examined in the phyloge- converging methods. Finally, SQS provides a very gen- netic literature, including the Hasegawa-Kishino-Yano eral framework within which absolute fast-converging (HKY) model, the Kimura 2-parameter model (K2P), methods can be developed. the Kimura 3-ST model (K3ST), the Jukes-Cantor We state our results in terms of the General Time model (JC), etc. These models are all special cases Reversible Markov model of evolution, which contains, of the General Markov model, because they place as a special case, the Jukes-Cantor model. restrictions on the form of the stochastic substitution matrices (see [14] for more information about stochastic 2 Basics models of evolution). Most distance-based methods 2.1 Stochastic models of DNA sequence evolu- are statistically consistent under the General Markov tion. A model of DNA sequence evolution must de- model, because statistically consistent methods exist scribe the probability distribution of the four states, for estimating the matrix {λij} above. (A method for A, C, T, G, at the root, the evolution of a random site estimating the matrix {λij} is statistically consistent (i.e., position within the DNA sequence) and how the if each of the distance estimates dij converges to the evolution differs across the sites.

Absolute Convergence: True Trees from Short Sequences

Ch. 15 Power Series, Taylor Series

Topic 7 Notes 7 Taylor and Laurent Series

11.3-11.4 Integral and Comparison Tests

The Ratio Test, Integral Test, and Absolute Convergence

1 Convergence Tests

Math 113 Lecture #30 §11.6: Absolute Convergence and the Ratio and Root Tests

Power Series, Taylor Series and Analytic Functions (Section 5.1)

Half-Plane of Absolute Convergence

Generating Functions

Improper Integrals

Absolute Convergence and the Ratio and Root Tests

Changing the Order of Integration