Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci

University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1-2010 Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci Elchanan Mossel University of Pennsylvania Sébastien Roch Follow this and additional works at: https://repository.upenn.edu/statistics_papers Part of the Computer Sciences Commons, Genetics and Genomics Commons, and the Statistics and Probability Commons Recommended Citation Mossel, E., & Roch, S. (2010). Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7 (1), 166-171. http://dx.doi.org/10.1109/TCBB.2008.66 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/statistics_papers/389 For more information, please contact [email protected]. Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci Abstract We introduce a simple computationally efficient algorithm foreconstructing r phylogenies from multiple gene trees in the presence of incomplete lineage sorting, that is, when the topology of the gene trees may differ from that of the species tree. We show that our technique is statistically consistent under standard stochastic assumptions, that is, it returns the correct tree given sufficiently many unlinked loci. We also show that it can tolerate moderate estimation errors. Keywords bioinformatics, estimation theory, genetics, stochastic processes, estimation errors, incomplete lineage sorting, multiple gene trees, multiple loci, phylogeny estimation, stochastic assumptions, biology and genetics, incomplete lineage sorting, probability and statistics, coalescent process, phylogenetics, population genetics, statistical consistency, topological concordance, algorithms, animals, chromosome mapping, computer simulation, evolution, humans, linkage, disequilibrium, models Disciplines Computer Sciences | Genetics and Genomics | Physical Sciences and Mathematics | Statistics and Probability This journal article is available at ScholarlyCommons: https://repository.upenn.edu/statistics_papers/389 Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci∗ Elchanan Mossel Department of Statistics University of California, Berkeley [email protected] Sebastien Roch Theory Group Microsoft Research [email protected] February 14, 2013 Abstract We introduce a simple algorithm for reconstructing phylogenies from multiple gene trees in the presence of incomplete lineage sorting, that is, when the topology of the gene trees may differ from that of the species tree. We show that our technique is statistically consistent under standard stochastic assumptions, that is, it returns the correct tree given sufficiently many unlinked loci. We also show that it can tolerate moderate estimation errors. 1 Introduction Phylogenies—the evolutionary relationships of a group of species—are typically inferred from estimated genealogical histories of one or several genes (or gene trees) [Fel04, SS03]. Yet it is well known that such gene trees may provide mis- leading information about the phylogeny (or species tree) containing them. In- deed, it was observed early on that a gene tree may be topologically inconsistent with its species tree, a phenomenon known as incomplete lineage sorting. See e.g. [Mad97, Nic01, Fel04] and references therein. Such discordance plays little role in the reconstruction of deep phylogenetic branchings but it is critical in the study of recently diverged populations [LP02, HM03, Kno04]. Two common approaches to deal with this issue are concatenation and majority voting. In the former, one concatenates the sequences originating from several ∗Keywords: incomplete lineage sorting, gene tree, species tree, coalescent, topological concordance, statistical consistency. E.M. is supported by an Alfred Sloan fellowship in Mathematics and by NSF grants DMS-0528488, and DMS-0548249 (CAREER) and by ONR grant N0014-07-1-05- 06. 1 genes and hopes that a tree inferred from the combined data will produce a better estimate. This approach appears to give poor results [KD07]. Alternatively, one can infer multiple gene trees and output the most common reconstruction (that is, take a majority vote). This is also often doomed to failure. Indeed, a recent, striking result of Degnan and Rosenberg [DR06] shows that, under appropriate conditions, the most likely gene tree may be inconsistent with the species tree; and this situation may arise on any topology with at least 5 species. See also [PN88, Tak89] for related results. Other techniques are being explored that attempt to address incomplete lineage sorting, notably Bayesian [ELP07] and likelihood [SR07] methods. However the problem is still far from being solved as discussed in [MK06]. Here we propose a simple technique—which we call Global LAteSt Split or GLASS—for estimating species trees from multiple genes (or loci). Our technique develops some of the ideas of Takahata [Tak89] and Rosenberg [Ros02] who studied the properties of gene trees in terms of the corresponding species tree. In our main result, we show that GLASS is statistically consistent, that is, it always returns the correct topology given sufficiently many (unlinked) genes—thereby avoiding the pitfalls highlighted in [DR06]. We also obtain explicit convergence rates under a standard model based on Kingman’s coalescent [Kin82]. Moreover, we allow the use of several alleles from each population and we show how our technique leads to an extension of Rosenberg’s topological concordance [Ros02] to multiple loci. We note the recent results of Steel and Rodrigo [SR07] who showed that Max- imum Likelihood (ML) is statistically consistent under slightly different assumptions. An advantage of GLASS over likelihood (and Bayesian) methods is its computational efficiency, as no efficient algorithm for finding ML trees is known. Fur- thermore, GLASS gives explicit convergence rates—useful in assessing the quality of the reconstruction. For more background on phylogenetic inference and coalescent theory, see e.g. [Fel04, SS03, HSW05, Nor01, Tav04]. Organization. The rest of the paper is organized as follows. We begin in Sec- tion 2 with a description of the basic setup. The GLASS method is introduced in Section 3. A proof of its consistency can be found in Sections 4 and 5. We show in Section 6 that GLASS remains consistent under moderate estimation errors. Fi- nally in Section 7 we do away with the molecular clock assumption and we show how our technique can be used in conjunction with any distance matrix method. 2 2 Basic Setup We introduce our basic modelling assumptions. See e.g. [DR06]. Species tree. Consider n isolated populations with a common evolutionary history given by the species tree S = (V, E) with leaf set L. Note that |L| = n. For each branch e of S, we denote: • Ne, the (haploid) population size on e (we assume that the population size remains constant along the branch); • te, the number of generations encountered on e; te • τe = 2Ne , the length of e in standard coalescent time units; • µ = mine τe, the shortest branch length in S. The model does not allow migration between contemporaneous populations. Often in the literature, the population sizes {Ne}e∈E, are taken to be equal to a constant N. Our results are valid in a more general setting. Gene trees. We consider k loci I. For each population l and each locus i, we (i) sample a set of alleles Ml . Each locus i ∈ I has a genealogical history repre- (i) (i) (i) (i) (i) sented by a gene tree G = (V , E ) with leaf set L = ∪lMl . For two (i) (i) leaves a, b in G , we let Dab be the time in number of generations to the most recent common ancestor of a and b in G(i). Following [Tak89, Ros02] we are actually interested in interspecific coalescence times. Hence, we define, for all r, s ∈ L, (i) (i) (i) (i) Drs = min Dab : a ∈ Mr , b ∈ Ms . n o Inference problem. We seek to solve the following inference problem. We are given k gene trees as above, including accurate estimates of the coalescence times D(i) . ab (i) a,b∈L i∈I Our goal is to infer the species tree S. 3 Stochastic Model. In Section 4, we will first state the correcteness of our inference algorithm in terms of a combinatorial property of the gene trees. In Section 5, we will then show that under the following standard stochastic assumptions, this property holds for a moderate number of genes. Namely, we will assume that each gene tree G(i) is distributed according to a standard coalescent process: looking backwards in time, in each branch any two alleles coalesce at exponential rate 1 independently of all other pairs; whenever two populations merge in the species tree, we also merge the allele sets of the corresponding populations (that is, the coalescence proceeds on the union of both allele sets). We further assume that the k loci I are unlinked or in other words that (i) the gene trees {G }i∈I are mutually independent. Under these assumptions, an inference algorithm is said to be statistically consistent if the probability of returning an incorrect reconstruction goes to 0 as k tends to +∞. 3 Species Tree Estimation We introduce a technique which we call the Global LAteSt Split (GLASS) method. Inference method. Consider first the case of a single gene (k = 1). Looking backwards in time, the first speciation occurs at some time T1, say between popu- (1) lations r1 and s1. It is well known that, for any sample a from Mr1 and b from (1) (1) Ms1 , the coalescence time Dab between alleles a and b overestimates the divergence time of the populations. As noted in [Tak89], a better estimate of T1 can be obtained by taking the smallest interspecific coalescence time between alleles in (1) (1) (1) Mr1 and in Ms1 , that is, by considering instead Dr1s1 . The inference then proceeds as follows. First, cluster the two populations, say (1) r1 and s1, with smallest interspecific coalescence time Dr1s1 . Define the coalescence time of two clusters A, B ⊆ L as the minimum interspecific coalescence time between populations in A and in B, that is, (1) (1) DAB = min Drs : r ∈ A, s ∈ B .

Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci

Species Concepts Should Not Conflict with Evolutionary History, but Often Do

Phylogenetic Comparative Methods: a User's Guide for Paleontologists

Lineages, Splits and Divergence Challenge Whether the Terms Anagenesis and Cladogenesis Are Necessary

Statistical Inference for the Linguistic and Non-Linguistic Past

Learning the “Game” of Life: Reconstruction of Cell Lineages Through Crispr-Induced Dna Mutations

Phylogenetics

Advances in Computational Methods for Phylogenetic Networks in the Presence of Hybridization

Reading Phylogenetic Trees: a Quick Review (Adapted from Evolution.Berkeley.Edu)

Phylogenetic Systematics As the Basis of Comparative Biology

Allozyme Divergence and Phylogenetic Relationships Among Capra, Ovis and Rupicapra (Artyodactyla, Bovidae)

Taxonomy and Classification Goals: Un Ders Tan D Traditi Onal and Hi Erarchi Cal Cl Assifi Cati Ons of Biodiversity, and What Information Classifications May Contain

Phylogenetic Comparative Methods