The Probability of Monophyly of a Sample of Gene Lineages on a Species Tree
Total Page:16
File Type:pdf, Size:1020Kb
PAPER The probability of monophyly of a sample of gene COLLOQUIUM lineages on a species tree Rohan S. Mehtaa,1, David Bryantb, and Noah A. Rosenberga aDepartment of Biology, Stanford University, Stanford, CA 94305; and bDepartment of Mathematics and Statistics, University of Otago, Dunedin 9054, New Zealand Edited by John C. Avise, University of California, Irvine, CA, and approved April 18, 2016 (received for review February 5, 2016) Monophyletic groups—groups that consist of all of the descendants loci that are reciprocally monophyletic is informative about the of a most recent common ancestor—arise naturally as a conse- time since species divergence and can assist in representing the quence of descent processes that result in meaningful distinctions level of differentiation between groups (4, 18). between organisms. Aspects of monophyly are therefore central to Many empirical investigations of genealogical phenomena have fields that examine and use genealogical descent. In particular, stud- made use of conceptual and statistical properties of monophyly ies in conservation genetics, phylogeography, population genetics, (19). Comparisons of observed monophyly levels to model pre- species delimitation, and systematics can all make use of mathemat- dictions have been used to provide information about species di- ical predictions under evolutionary models about features of mono- vergence times (20, 21). Model-based monophyly computations phyly. One important calculation, the probability that a set of gene have been used alongside DNA sequence differences between and lineages is monophyletic under a two-species neutral coalescent within proposed clades to argue for the existence of the clades model, has been used in many studies. Here, we extend this calcu- (22), and tests involving reciprocal monophyly have been used to lation for a species tree model that contains arbitrarily many species. explain differing phylogeographic patterns across species (23). Comparisons of observed levels of monophyly with the level We study the effects of species tree topology and branch lengths on expected by chance alone (24) have assisted in establishing the the monophyly probability. These analyses reveal new behavior, distinctiveness of taxonomic groups (25, 26). Loci that conflict including the maintenance of nontrivial monophyly probabilities with expected monophyly levels have provided signatures of genic for gene lineage samples that span multiple species and even for roles in species divergences (27–29). EVOLUTION lineages that do not derive from a monophyletic species group. We For lineages from two species under a model of population illustrate the mathematical results using an example application to divergence, Rosenberg (4) computed probabilities of four differ- data from maize and teosinte. ent genealogical shapes: reciprocal monophyly of both species, monophyly of only one of the species, monophyly of only the other coalescent theory | monophyly | phylogeography species, and monophyly of neither species. The computation permitted arbitrary species divergence times and sample sizes— athematical computations under coalescent models have generalizing earlier small-sample computations (1–3, 30, 31)—and Mbeen central in developing a modern view of the descent of illustrated the transition from the species divergence, when mono- gene lineages along the branches of species phylogenies. Since phyly is unlikely for both species, to long after divergence, when early in the development of coalescent theory and phylogeography, reciprocal monophyly becomes extremely likely. Between these ex- coalescent formulas and related simulations have contributed to a tremes, the species can pass through a period during which mono- probabilistic understanding of the shapes of multispecies gene trees phyly of one species but not the other is the most probable state. (1–3), enabling novel predictions about gene tree shapes under Although this two-species computation has contributed to various insights about empirical monophyly patterns (21–23, 32–34), many evolutionary hypotheses (4, 5), new ways of testing hypotheses about scenarios deal with more than two species. Because multispecies gene tree discordances (6, 7), and new algorithms for problems of monophyly probability computations have been unavailable— species tree inference (8, 9) and species delimitation (10, 11). A – — “ ” except in limited cases with up to four species (4, 35 38) multispecies coalescent model, in which coalescent processes on multispecies studies have been forced to rely on two-species models, separate species tree branches merge back in time as species reach a restricting attention to species pairs (25, 34, 39) or pooling disparate common ancestor (12), has become a key tool for theoretical pre- lineages and disregarding their taxonomic distinctiveness (23, 26). dictions, simulation design, and evaluation of inference methods, Here, we derive an extension to the two-species monophyly and as a null model for data analysis. probability computation, examining arbitrarily many species re- A fundamental concept in genealogical studies is that of mono- lated by an evolutionary tree. Furthermore, we eliminate the past phyly. In a genealogy, a group that is monophyletic consists of all of restriction (4) that the lineages whose monophyly is examined all the descendants of its most recent common ancestor (MRCA): derive from the same population. This generalization is analo- every lineage in the group—andnolineageoutsideit—descends gous to the assumption that in computing the probability of a from this ancestor. Backward in time, a monophyletic group has all binary evolutionary character (40–42), one or both character of its lineages coalesce with each other before any coalesces with a states can appear in multiple species. Our approach uses a lineage from outside the group. The phylogenetic and phylogeographic importance of mono- phyly traces to the fact that monophyly enables a natural defini- This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sci- tion of a genealogical unit. Such a unit can describe a distinctive ences, “In the Light of Evolution X: Comparative Phylogeography,” held January 8–9, 2016, at set of organisms that differs from other groups of organisms in the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineer- ways that are evolutionarily meaningful. Species can be delimited ing in Irvine, CA. The complete program and video recordings of most presentations are by characters present in every member of a species and absent available on the NAS website at www.nasonline.org/ILE_X_Comparative_Phylogeography. outside the species, and that therefore can reflect monophyly (13, Author contributions: R.S.M., D.B., and N.A.R. designed research; R.S.M. and N.A.R. performed 14). In conservation biology, monophyly can be used as a priori- research; R.S.M. analyzed data; and R.S.M., D.B., and N.A.R. wrote the paper. tization criterion because groups with many monophyletic loci are The authors declare no conflict of interest. likely to possess unique evolutionary features (15). Reciprocal This article is a PNAS Direct Submission. monophyly, in which a set of lineages is divided into two groups 1To whom correspondence should be addressed. Email: [email protected]. that are simultaneously monophyletic, is often used in a genea- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. logical approach to species divergence (16, 17). The proportion of 1073/pnas.1601074113/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1601074113 PNAS Early Edition | 1of8 Downloaded by guest on September 27, 2021 pruning algorithm, generalizing the two-species formula in a Table 2. Possible monophyly events for two disjoint lineage conceptually similar manner to other recursive coalescent com- classes, S and C – putations on arbitrary trees (9, 40 44). Monophyletic groups Description Notation Like the work of Degnan and Salter (5), which considered probability distributions for gene tree topologies under the S Monophyly of SES multispecies coalescent model, our work generalizes a coales- C Monophyly of CEC cent computation known only for small trees (4, 35) to arbitrary Only S Paraphyly of CESC′ species trees. We study the dependence of the monophyly proba- Only C Paraphyly of SES′C bility on the model parameters, providing an understanding of Both S and C Reciprocal monophyly ESC factors that contribute to monophyly in species trees of arbi- Neither S nor C Polyphyly ES′C′ trary size. Finally, we explore the utility of monophyly prob- abilities in an application to genomewide data from maize and teosinte. arbitrarily labeled diagram of species tree T . These labels are Results used only for bookkeeping; the labeling does not affect subse- quent calculations. Lineages entering from the left and right Model and Notation. branches are called “left inputs” and “right inputs,” respectively. T ℓ Overview. Consider a rooted binary species tree with leaves and Each node x of T is associated with exactly one branch, leading ℓ specified topology and branch lengths. For each of the species from node x to its immediate predecessor on T . We refer to this T represented by leaves of , a number of sampled lineages is branch with the shared label x. specified. Given a specified partition of the lineages into two For an internal branch x in T , the number of class-S left in- subsets, we consider a condition describing whether one, the other, L L L puts is sx (cx for class C, mx for class M); the number of class-S both, or neither of the two subsets of lineages is monophyletic. Our R R R right inputs is sx (cx for class C, mx for class M). The total goal is to provide a recursive computation of the probability that I = L + R I = L + R number of class-S inputs of x is sx sx sx (cx cx cx for class the condition is obtained under the multispecies coalescent model. I = L + R C, mx mx mx for class M). The number of lineages that exit Notation appears in Table S1. branch x, entering a branch farther up the species tree, is the set Lineage classes.