PAPER The probability of monophyly of a sample of gene COLLOQUIUM lineages on a species tree
Rohan S. Mehtaa,1, David Bryantb, and Noah A. Rosenberga
aDepartment of Biology, Stanford University, Stanford, CA 94305; and bDepartment of Mathematics and Statistics, University of Otago, Dunedin 9054, New Zealand
Edited by John C. Avise, University of California, Irvine, CA, and approved April 18, 2016 (received for review February 5, 2016)
Monophyletic groups—groups that consist of all of the descendants loci that are reciprocally monophyletic is informative about the of a most recent common ancestor—arise naturally as a conse- time since species divergence and can assist in representing the quence of descent processes that result in meaningful distinctions level of differentiation between groups (4, 18). between organisms. Aspects of monophyly are therefore central to Many empirical investigations of genealogical phenomena have fields that examine and use genealogical descent. In particular, stud- made use of conceptual and statistical properties of monophyly ies in conservation genetics, phylogeography, population genetics, (19). Comparisons of observed monophyly levels to model pre- species delimitation, and systematics can all make use of mathemat- dictions have been used to provide information about species di- ical predictions under evolutionary models about features of mono- vergence times (20, 21). Model-based monophyly computations phyly. One important calculation, the probability that a set of gene have been used alongside DNA sequence differences between and lineages is monophyletic under a two-species neutral coalescent within proposed clades to argue for the existence of the clades model, has been used in many studies. Here, we extend this calcu- (22), and tests involving reciprocal monophyly have been used to lation for a species tree model that contains arbitrarily many species. explain differing phylogeographic patterns across species (23). Comparisons of observed levels of monophyly with the level We study the effects of species tree topology and branch lengths on expected by chance alone (24) have assisted in establishing the the monophyly probability. These analyses reveal new behavior, distinctiveness of taxonomic groups (25, 26). Loci that conflict including the maintenance of nontrivial monophyly probabilities with expected monophyly levels have provided signatures of genic
for gene lineage samples that span multiple species and even for roles in species divergences (27–29). EVOLUTION lineages that do not derive from a monophyletic species group. We For lineages from two species under a model of population illustrate the mathematical results using an example application to divergence, Rosenberg (4) computed probabilities of four differ- data from maize and teosinte. ent genealogical shapes: reciprocal monophyly of both species, monophyly of only one of the species, monophyly of only the other coalescent theory | monophyly | phylogeography species, and monophyly of neither species. The computation permitted arbitrary species divergence times and sample sizes— athematical computations under coalescent models have generalizing earlier small-sample computations (1–3, 30, 31)—and Mbeen central in developing a modern view of the descent of illustrated the transition from the species divergence, when mono- gene lineages along the branches of species phylogenies. Since phyly is unlikely for both species, to long after divergence, when early in the development of coalescent theory and phylogeography, reciprocal monophyly becomes extremely likely. Between these ex- coalescent formulas and related simulations have contributed to a tremes, the species can pass through a period during which mono- probabilistic understanding of the shapes of multispecies gene trees phyly of one species but not the other is the most probable state. (1–3), enabling novel predictions about gene tree shapes under Although this two-species computation has contributed to various insights about empirical monophyly patterns (21–23, 32–34), many evolutionary hypotheses (4, 5), new ways of testing hypotheses about scenarios deal with more than two species. Because multispecies gene tree discordances (6, 7), and new algorithms for problems of monophyly probability computations have been unavailable— species tree inference (8, 9) and species delimitation (10, 11). A – — “ ” except in limited cases with up to four species (4, 35 38) multispecies coalescent model, in which coalescent processes on multispecies studies have been forced to rely on two-species models, separate species tree branches merge back in time as species reach a restricting attention to species pairs (25, 34, 39) or pooling disparate common ancestor (12), has become a key tool for theoretical pre- lineages and disregarding their taxonomic distinctiveness (23, 26). dictions, simulation design, and evaluation of inference methods, Here, we derive an extension to the two-species monophyly and as a null model for data analysis. probability computation, examining arbitrarily many species re- A fundamental concept in genealogical studies is that of mono- lated by an evolutionary tree. Furthermore, we eliminate the past phyly. In a genealogy, a group that is monophyletic consists of all of restriction (4) that the lineages whose monophyly is examined all the descendants of its most recent common ancestor (MRCA): derive from the same population. This generalization is analo- every lineage in the group—andnolineageoutsideit—descends gous to the assumption that in computing the probability of a from this ancestor. Backward in time, a monophyletic group has all binary evolutionary character (40–42), one or both character of its lineages coalesce with each other before any coalesces with a states can appear in multiple species. Our approach uses a lineage from outside the group. The phylogenetic and phylogeographic importance of mono-
phyly traces to the fact that monophyly enables a natural defini- This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sci- tion of a genealogical unit. Such a unit can describe a distinctive ences, “In the Light of Evolution X: Comparative Phylogeography,” held January 8–9, 2016, at set of organisms that differs from other groups of organisms in the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineer- ways that are evolutionarily meaningful. Species can be delimited ing in Irvine, CA. The complete program and video recordings of most presentations are by characters present in every member of a species and absent available on the NAS website at www.nasonline.org/ILE_X_Comparative_Phylogeography. outside the species, and that therefore can reflect monophyly (13, Author contributions: R.S.M., D.B., and N.A.R. designed research; R.S.M. and N.A.R. performed 14). In conservation biology, monophyly can be used as a priori- research; R.S.M. analyzed data; and R.S.M., D.B., and N.A.R. wrote the paper. tization criterion because groups with many monophyletic loci are The authors declare no conflict of interest. likely to possess unique evolutionary features (15). Reciprocal This article is a PNAS Direct Submission. monophyly, in which a set of lineages is divided into two groups 1To whom correspondence should be addressed. Email: [email protected]. that are simultaneously monophyletic, is often used in a genea- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. logical approach to species divergence (16, 17). The proportion of 1073/pnas.1601074113/-/DCSupplemental.
www.pnas.org/cgi/doi/10.1073/pnas.1601074113 PNAS Early Edition | 1of8 Downloaded by guest on September 27, 2021 pruning algorithm, generalizing the two-species formula in a Table 2. Possible monophyly events for two disjoint lineage conceptually similar manner to other recursive coalescent com- classes, S and C – putations on arbitrary trees (9, 40 44). Monophyletic groups Description Notation Like the work of Degnan and Salter (5), which considered
probability distributions for gene tree topologies under the S Monophyly of SES multispecies coalescent model, our work generalizes a coales- C Monophyly of CEC cent computation known only for small trees (4, 35) to arbitrary Only S Paraphyly of CESC′ species trees. We study the dependence of the monophyly proba- Only C Paraphyly of SES′C bility on the model parameters, providing an understanding of Both S and C Reciprocal monophyly ESC factors that contribute to monophyly in species trees of arbi- Neither S nor C Polyphyly ES′C′ trary size. Finally, we explore the utility of monophyly prob- abilities in an application to genomewide data from maize and teosinte. arbitrarily labeled diagram of species tree T . These labels are Results used only for bookkeeping; the labeling does not affect subse- quent calculations. Lineages entering from the left and right Model and Notation. branches are called “left inputs” and “right inputs,” respectively. T ℓ Overview. Consider a rooted binary species tree with leaves and Each node x of T is associated with exactly one branch, leading ℓ specified topology and branch lengths. For each of the species from node x to its immediate predecessor on T . We refer to this T represented by leaves of , a number of sampled lineages is branch with the shared label x. specified. Given a specified partition of the lineages into two For an internal branch x in T , the number of class-S left in- subsets, we consider a condition describing whether one, the other, L L L puts is sx (cx for class C, mx for class M); the number of class-S both, or neither of the two subsets of lineages is monophyletic. Our R R R right inputs is sx (cx for class C, mx for class M). The total goal is to provide a recursive computation of the probability that I = L + R I = L + R number of class-S inputs of x is sx sx sx (cx cx cx for class the condition is obtained under the multispecies coalescent model. I = L + R C, mx mx mx for class M). The number of lineages that exit Notation appears in Table S1. branch x, entering a branch farther up the species tree, is the set Lineage classes. The initial sampled lineages are partitioned into O O O of outputs of branch x: sx , cx ,ormx . class S (subset) for lineages within a chosen subset, and class C We combine the input and output values into two three-entry S “ ” nI = I I I “ (complement) for all lineages not included in .Coalescencebe- vectors: the input states x ðsx, cx, mxÞ and the output ” nO = O O O nI = nL + nR tween an S lineage and a C lineage produces an M (mixed) lineage. states x ðsx , cx , mx Þ.Notethat x x x . We refer to Any coalescence involving an M lineage also produces an M line- the nodes directly below node x corresponding to its left and age. Coalescences between two S or two C lineages produce S and right incoming branches by xL and xR, respectively, and to nodes C lineages, respectively (Table 1). farther down the tree by sequences of LsandRs, which, read Letting the number of S and C lineages present initially in the from left to right, give the steps needed to reach them from x. ith leaf be Si and Ci, respectively, the model parameters are Si For example, xRL follows down from x to the right (xR), then ≤ ≤ ℓ T and Ci for 1 i , and the species tree . For convenience, we from xR to the left (xRL). T T aggregate the Si and Ci with into a parameter collection SC The time interval associated with node x is Tx, the length of that we call the initialized species tree. branch x. Branch lengths are measured in coalescent time Monophyly events. A monophyly event Ei is an assignment of labels units of N generations, where N represents the haploid pop- to lineage classes S and C. We can choose to label a class “mono- ulation size along the branch and is assumed to be constant. phyletic” or “not monophyletic,” or assign no label at all, so that Thus, larger population sizes correspond to shorter lengths of nine monophyly events are possible, six of which are relevant for our time in coalescent units. Coalescences between inputs during purposes (Table 2). All lineages in a monophyletic class must coa- time Tx yield the outputs of x. The root branch of T has lesce within the class to a single lineage before any coalesces outside infinite length. the class. If multiple classes are labeled monophyletic, then each class The outputs of any nonroot branch are exactly the left or the must be separately monophyletic. right inputs of another branch farther up the tree; the outputs of Species-merging events. We orient the species tree vertically, “up” the root are the outputs of the species tree. The root has only “ ” nO = toward the root and down toward the leaves. From a coales- one output lineage: root ð0,0,1Þ. Inputs of a node x are the nL = nO nR = nO cent backward-in-time perspective, at every internal node of the outputs of xL and xR, so that x x and x x . For conve- L IR = L = species tree—representing a species-merging event—lineages nience, when node x corresponds to leaf i, we let sx sx Si and I L enter from two branches directly below the node. We label c = c = Ci (Fig. 1). x x T x one of these branches “left” and the other “right,” based on an We define SC to be the initialized species subtree with root x x and Ei to be the monophyly event Ei for the subtree with root x, ignoring the rest of the species tree. Table 1. Lineage classes produced by coalescence events Coalescence sequences. A coalescence sequence is a sequence of coalescences that reduces a set of lineages to another set of lineages. As an example, consider four lineages—labeled A, B, C, and D—that coalesce to a single lineage. One sequence has A and C coalesce first, followed by B and D, then the lineages resulting from the AC and BD coalescences. This sequence could be described as (A, C), (B, D), (AC, BD). If the first two coalescences happened in opposite order, the sequence would be (B, D), (A, C), (AC, BD). Combinatorial functions. The probability gn,jðTÞ that n lineages co- alesce to j lineages in time T is given by equation 6.1 of ref. 45. It is nonzero only when n ≥ j ≥ 1 and T ≥ 0, except that we set g0,0ðTÞ = 1. Following equation 4 of ref. 4, the number of coales- cence sequences that reduce n lineages to k lineages is n−k Intraclass coalescences between pairs of lineages preserve the class; In,k = ½n!ðn − 1Þ! = ½2 k!ðk − 1Þ! . This function is nonzero only interclass coalescences result in M lineages. when n ≥ k ≥ 1, with the convention I0,0 = 1.
2of8 | www.pnas.org/cgi/doi/10.1073/pnas.1601074113 Mehta et al. Downloaded by guest on September 27, 2021 PAPER