<<

Evolution, 57(7), 2003, pp. 1465±1477

THE SHAPES OF NEUTRAL GENE GENEALOGIES IN TWO : PROBABILITIES OF , , AND IN A COALESCENT MODEL

NOAH A. ROSENBERG Molecular and Computational , University of Southern California, 1042 West 36th Place, DRB 289, Los Angeles, California 90089 E-mail: [email protected]

Abstract. The genealogies of samples of orthologous regions from multiple species can be classi®ed by their shapes. Using a neutral coalescent model of two species, I give exact probabilities of each of four possible genealogical shapes: reciprocal monophyly, two types of paraphyly, and polyphyly. After the divergence that forms two species, each of which has population size N, polyphyly is the most likely genealogical shape for the lineages of the two species. At ϳ1.300N generations after divergence, paraphyly becomes most likely, and reciprocal monophyly becomes most likely at ϳ1.665N generations. For a given species, the time at which 99% of its loci acquire monophyletic genealogies is ϳ5.298N generations, assuming all loci in its sister species are monophyletic. The probability that all lineages of two species are reciprocally monophyletic given that a sample from the two species has a reciprocally monophyletic genealogy increases rapidly with sample size, as does the probability that the most recent common ancestor (MRCA) for a sample is also the MRCA for all lineages from the two species. The results have potential applications for the testing of evolutionary hypotheses.

Key words. Coalescence, gene tree, labeled histories, population divergence, , Yule model.

Received September 7, 2002. Accepted February 4, 2003.

The genealogy for all copies of a particular genomic region eages ancestral to extant lineages to have been present in in a particular species can be classi®ed into one of two cat- ancestral species. These ancestral lineages must have coa- egories. Either the lineages are monophyletic, that is, they lesced in an order that included more than one interspeci®c comprise all the extant descendants of their most recent com- coalescence (Fig. 1ii±iv). Because recently diverged pairs of mon ancestor (MRCA), or they are not monophyletic. The species might be represented by multiple ancestral lineages latter scenario, equivalently, requires that lineages of one or at their times of divergence, their genealogies can have many more additional species also descend from this MRCA. interspeci®c coalescences and, thus, might not show recip- Consider all copies of orthologous regions in an ordered rocal monophyly. pair of species, (A, B). The genealogy of these lineages can Probabilities of genealogical shapes for two species have be placed into one of four categories: C1, the lineages of been studied in various population structure models for sam- each species are separately monophyletic; C2, the lineages ples of size 2 (Tajima 1983; Takahata and Nei 1985; Takahata of species A are monophyletic, and the lineages of species B and Slatkin 1990; Wakeley 2000). Some more general cases are not monophyletic; C3, the lineages of species B are mono- are treated in simulations (Neigel and Avise 1986; Hudson phyletic, and the lineages of species A are not monophyletic; and Coyne 2002) and in a few theoretical results (Brown and C4, Neither the lineages of species A nor the lineages of 1994; Rosenberg 2002). Here I allow samples of arbitrary species B are monophyletic. size from unstructured species and obtain exact probabilities These scenarios (Fig. 1) can also be labeled monophyly of of each of the four genealogical shapes for the lineages of A and B or reciprocal monophyly, paraphyly of B with respect the two species. Because only the shapes of genealogies are to A, paraphyly of A with respect to B, and polyphyly of A studied, underlying relationships masked by stochastic dif- and B, respectively. The lineages of species A, for example, ferences in the numbers of mutations that accumulate along are monophyletic in C1 and C2, paraphyletic with respect to different lineages are not considered. This issue has been B in C3, and polyphyletic with respect to B in C4. In this treated in related circumstances (Hey 1991; Clark 1997; article these terms are used to describe both a set of lineages Wakeley and Hey 1997). and the unique genealogy for the set of lineages; for example, I ®rst review models used in the calculations. The main a species is described as monophyletic if the genealogy of results, namely the probabilities of the four genealogical all of its lineages is monophyletic. When referring to a locus, shapes, are presented in equations (14)±(17). The properties the terms implicitly describe the genealogy of all copies of of these probabilities are then explored in detail. the locus in a species. Two genealogies are identical if and only if they have both the same branching order and the same THE COALESCENT MODEL:ONE SPECIES times of branching. Typically, random pairs of species have reciprocally mono- I assume a neutral coalescent model to describe the ge- phyletic genealogies at nearly all of their orthologous loci. nealogy of the lineages of each of two species (Nordborg However, the lineages of two closely related species are often 2001). For each species, n lineages, each of which is identi®ed not reciprocally monophyletic (Takahata and Nei 1985; Nei- by a distinctive label, are sampled in the present, and the gel and Avise 1986; Palumbi et al. 2001; Hudson and Coyne model provides a probability distribution on the set of pos- 2002). For true biological species that have not exchanged sible genealogies for these lineages. Going backward in time, migrants, lack of reciprocal monophyly requires many lin- the coalescent model has two components: a model that spec- 1465 ᭧ 2003 The Society for the Study of . All rights reserved. 1466 NOAH A. ROSENBERG

for large n and j.AsT increases, however, the number of ancestral lineages is likely quite small, so that gnj(T) is neg- ligible for all but very small values of j. In this article, gnj(T) is only evaluated for T Ն 0.1, where evaluation is straight- forward (and for T ϭ 0, where it is trivial). The following assumption, which has no impact on the ®rst several decimal places of results shown, is also made: gnj(T) ϭ 0ifT Ն 0.1, n Ն 90, and j Ն 50, or if T Ն 1, n Ն 20, and j Ն 10.

THE YULE MODEL In the coalescent model, at any time in the past, all pairs of lineages have equal probabilities of being the next pair to coalesce. The choice of which lineages coalesce is indepen- dent of when the coalescence occurs. This rule speci®es a uniform probability distribution on the set of possible se- quences of coalescences, and has been termed the Markov model or Yule model (Yule 1924; Aldous 2001); the sequence FIG. 1. The four types of genealogies possible for the lineages of of coalescences, from n lineages to the MRCA of the lineages, species A and B. (i) Monophyly of A and B. (ii) Paraphyly of B is a ``random-joining sequence'' (Maddison and Slatkin with respect to A. (iii). Paraphyly of A with respect to B. (iv) Po- 1991). The Yule model has been used to describe the branch- lyphyly of A and B. ing tree not only for lineages within species, but also for species themselves (Harding 1971; Slowinski and Guyer i®es the distributions of waiting times until coalescence and 1989; Maddison and Slatkin 1991; Brown 1994; Aldous a model that speci®es coalescence probabilities for pairs of 2001; Steel and McKenzie 2001). lineages. The latter is the Yule model (see next section). The waiting time until coalescence of n to n Ϫ 1 lineages is Sequences of Coalescences exponentially distributed with mean n(n 1)/2 coalescences Ϫ The set of possible sequences of coalescences for n line- per unit of time. A useful result under the coalescent model ages is equivalent to the set of labeled histories for n taxa. is given by Tavare (1984, eq. 6.1): if g (T) is the probability nj Stated precisely, two genealogies have the same labeled his- that the n lineages derive from j lineages that existed T co- tory if and only if they have the same coalescences in the alescent time units in the past, then same temporal order. Under the Yule model, each of the nk(2k Ϫ 1)(Ϫ1) Ϫjjn possible sequences of coalescences has the same probability Ϫk(kϪ1)T/2 (kϪ1) [k] gnj(T) ϭ ͸ e , (1) of being the labeled history for a random genealogy. The kϭj j!(k Ϫ j)!n(k) number of possible labeled histories for n lineages is obtained where a ϭ a(a ϩ 1)´´´(a ϩ k Ϫ 1) and a ϭ a(a Ϫ 1)´´´(a (k) [k] n Ϫ k ϩ 1) for k Ն 1, with a(0) ϭ a[0] ϭ 1. Except when 1 Յ from the fact that there are choices for the two most 2 j Յ n, gnj(T) ϭ 0. Note that gnj(0) ϭ␦nj, where ␦nj is Kro- ΂΃ necker's delta. The limiting case of n ϭϱyields (Tavare n Ϫ 1 recent lineages to coalesce, choices for the next co- 1984, eq. 6.3): ΂΃2 ϱ (2k Ϫ 1)(Ϫ1)kϪjj alescence, and so forth. The total number of labeled histories, g (T) ϭ eϪk(kϪ1)T/2 (kϪ1) . (2) ϱj ͸ H , equals (Edwards 1970): kϭj j!(k Ϫ j)! n For a population of constant size N lineages, in which the nnϪ 132n!(n Ϫ 1)! H ϭ ´´´ ϭ . (3) variance of the number of offspring produced by an individual n ΂΃΂22 ΃ ΂΃΂΃ 22 2nϪ1 equals one, T coalescent units equals TN generations. For models that have different values of this parameter or that More generally, the number of sequences of coalescences that incorporate any of various types of more complex reproduc- reduce n lineages to k lineages is tive behavior, equations (1) and (2) still hold, but a different nnϪ 1 k ϩ 2 k ϩ 1 I ϭ ´´´ scaling of coalescent units into units of absolute time is used n,k ΂΃΂΃΂΃΂΃22 2 2 (Nordborg 2001). Thus, results that follow can be applied for diverse evolutionary models within species, only with chang- n!(n Ϫ 1)! ϭ . (4) es of scale to the time parameters. I allow two species to 2nϪkk!(k Ϫ 1)! experience different amounts of coalescent time during the As a special case, I ϭ H . same period of absolute time; in the constant-size model, this n,1 n assumption corresponds to different population sizes for the The Interweaving of Two Sequences of Coalescences two species. Although approximations can assist in the computation Suppose that a particular sequence S1 of s1 coalescences (Grif®ths 1984), equations (1) and (2) can be dif®cult to occurs for a set of s1 ϩ 1 lineages, and that a particular evaluate numerically for small positive values of T, especially sequence S2 of s2 coalescences occurs for a different set of GENE GENEALOGIES IN TWO SPECIES 1467 s2 ϩ 1 lineages. The number of sequences of s1 ϩ s2 coa- lescences that include both S1 and S2 as subsequences is of s ϩ s interest. There are12 ways to choose which positions ΂΃s1 in a sequence of length s1 ϩ s2 are occupied by the subse- quence S1. Once these positions are chosen, the subsequence S1 ®lls them in a unique way, with its ®rst coalescence in the ®rst position chosen, its second coalescence in the second position, and so forth. The subsequence S2 ®lls the remaining positions in a unique way. Thus, the number of ways that s1 and s2 coalescences can be ``interwoven,'' preserving the order of coalescence in each subset is:

s12ϩ s W21(s , s 2) ϭ . (5) ΂΃s1 FIG. 2. Notation for the divergence model. Lineages ancestral to a sample are drawn more darkly than non-ancestral lineages. In this example, NA ϭ 5, rA ϭ 2, mA ϭ 2, qA ϭ 1; NB ϭ 5, rB ϭ 3, mB ϭ Sequences of Coalescences for Subsamples 3, qB ϭ 2. Times TA and TB are measured in coalescent units and refer to the same period of absolute time. Consider n lineages that have a randomly chosen labeled history, and a random subsample consisting of j of these lineages. If Z1(T) and Z2(T) respectively denote the numbers Probabilities of Monophyly, Paraphyly, and Polyphyly of lineages ancestral to the sample and the subsample at time T coalescent units before the present, then (eq. 2.3 of Saun- Let X denote the classi®cation of the rA ϩ rB sampled ders et al. 1984) lineages (either C1, C2, C3, or C4) and let ZA(T) and ZB(T) denote the numbers of lineages ancestral to species A and B, S(l21, l , n, j) ϭ Pr[Z 2211(T) ϭ l ͦ Z (T) ϭ l ; n, j] respectively, at time T coalescent units in the past. Suppose the r and r lineages of species A and B have q and q ln112Ϫ lnϩ l A B A B ancestral lineages at the time of divergence. This event has ΂΃΂΃΂΃lj22Ϫ ln nl21(j ϩ l ) probabilitygg (T )(T ). ϭ . (6) r,qAAA r,q BB B (n ϩ l )lj To have X ϭ C1, the MRCA for the qA ϩ qB ancestral njϩ l1 21 lineages must separate the genealogy of the lineages into two ΂΃΂jj ΃ species-speci®c subgenealogies. Using equation (3),HqA and → The limit as n ϱ, is (Grif®ths and TavareÂ, 2003): HqB sequences of coalescences can monophyletically join the lineages of species A and B, respectively. Using equation (5), lj 1 each pair of sequencesÐone for species A and one for species ΂΃΂΃ll22 l21(j ϩ l ) BÐcan be interwoven in W2(qA Ϫ 1, qB Ϫ 1) ways. There S(l21, l , ϱ, j) ϭ . (7) areH possible sequences of coalescences. We obtain j ϩ l lj1 qABϩq 1 (Brown 1994): ΂΃j Pr[X ϭ C1 ͦ ZAA(T ) ϭ q A, Z BB(T ) ϭ q B] The probability that the MRCA of the subsample is identical to the MRCA of the sample is (eq. 3.1 of Saunders et al. HHWqqAB2(qABϪ 1, q Ϫ 1) 1984; see also Sanderson 1996) ϭ HqABϩq Pr[min{T : Z (T) ϭ 1} ϭ min{T : Z (T) ϭ 1}] 21 2 ϭ . (9) j Ϫ 1 n ϩ 1 ϭ . (8) qABϩ q j ϩ 1 n Ϫ 1 (qABϩ q Ϫ 1) ΂΃qA

This equation solves the recursion for the quantity C(qA, qB) THEORY:TWO SPECIES A,B of Hudson and Coyne (2002) and for the quantity 1 Ϫ F2 (qA, Here I consider probabilities of genealogical shapes in a qB, 0) of Rosenberg (2002). divergence model. At time t years in the past, a species evolv- To compute conditional probabilities that the qA ϩ qB lin- ing according to the coalescent model diverges into two spe- eages fall into the other categories, we ®rst need the con- cies, A and B, each of which also evolves by the coalescent ditional probability that the lineages of species A are mono- model (Fig. 2). For species A and B, TA and TB coalescent phyletic, that is, the probability that X ϭ C1orX ϭ C2. Let time units have elapsed since the divergence, respectively. Ek be the event that the qA lineages of species A are mono- Total population sizes in the present are NA and NB; the an- phyletic and that the most recent interspeci®c coalescence cestral population size need not be speci®ed. Samples from occurs when k lineages ancestral to the lineages of species species A and B have sizes rA and rB. Both species are assumed B are present (1 Յ k Յ qB). The number of sequences that to have the same generation time. coalesce qA lineages to one isHqA . The number of 1468 NOAH A. ROSENBERG

rr sequences that coalesce qB lineages to k lineages isIq,k . The AB B Pr(X ϭ C1) ϭ g (T )g (T ) number of ways to interweave a sequence of coalescences ͸͸ rAA,qAr BB,qB qABϭ1 q ϭ1 from species A and one from species B is W2(qA Ϫ 1, qB Ϫ k). There are k choices for the lineage from species B that 2 participates in the coalescence with the MRCA of species A, ϫ (14) qABϩ q and there are HkϪ1 ways to coalesce the last k Ϫ 1 lineages. (qABϩ q Ϫ 1) Thus, we have ΂΃qA

rrAB HIqq,k W2(qABϪ 1, q Ϫ k) kH kϪ1 AB Pr(X ϭ C2) ϭ gr ,qAr(T )g ,qB(T ) Pr(Ek) ϭ ͸͸ AA BB qABϭ1 q ϭ1 HqABϩq

2(2qABBϩ q )(q Ϫ 1) qB ϫ (15) 2k qAA(q ϩ 1)(q Aϩ q BϪ 1) ΂΃k qABϩ q ϭ . (10) ΂΃qA qABABϩ qqϩ q Ϫ 1 qB rrAB ΂΃΂΃qkA Pr(X ϭ C3) ϭ g (T )g (T ) ͸͸ rAA,qAr BB,qB q ϭ1 q ϭ1 A combinatorial identity (Appendix 1) facilitates computa- AB tion of the probability of E, the event that the qA lineages of 2(qABAϩ 2q )(q Ϫ 1) species A are monophyletic: ϫ (16) qBB(q ϩ 1)(q Aϩ q BϪ 1) qABϩ q q qA k B ΂΃ qqBBk 2 ΂΃ rrAB Pr(E) ϭ Pr(Ek) ϭ Pr(X C4) g (T )g (T ) ͸͸ ϭ ϭ rAA,qAr BB,qB kϭ1 kϭ1 ͸͸ qABϩ qq ABϩ q Ϫ 1 qABϭ1 q ϭ1 qB ΂΃΂qkA ΃   2 qABϩ qq ABϩ q 2 q ϩ q ϫ 1 Ϫϩ AB  [qAA(q ϩ 1) q BB(q ϩ 1) ϭ . (11)  qABϩ q q ϩ q qAA(q ϩ 1) AB  ΂΃qA ΂΃qA 1  This result simpli®es a calculation of Brown (1994, eq. 12) Ϫ . (17) qABϩ q Ϫ 1] and gives the closed-form solution for the quantity C*(qA,  qB) of Hudson and Coyne (2002). The probability of F, the  event that the qB lineages of species B are monophyletic, is computed analogously. We now obtain Appendix 2 discusses an alternate derivation of (14). Note that if TA ϭ TB ϭ 0, then qA ϭ rA, qB ϭ rB, and the uncon- Pr[X ϭ C2 ͦ ZAA(T ) ϭ q A, Z BB(T ) ϭ q B] ditional probabilities reduce to equations (9), (12), and (13). The probabilities of monophyly, paraphyly, and polyphyly ϭ Pr(E) Ϫ Pr[X ϭ C1 ͦ Z (T ) ϭ q , Z (T ) ϭ q ] AA A BB B given by (14)±(17) exhibit complex dependencies on the four 2(2q ϩ q )(q Ϫ 1) parameters. It is convenient to ®rst consider the effects of ϭ ABB . (12) q (q ϩ 1)(q ϩ q Ϫ 1) the times TA and TB, and then the effects of the sample sizes qABϩ q AA A B rA and rB. ΂΃qA Effects of Time A similar expression gives the conditional probability of X ϭ C3, and the conditional probability of X ϭ C4 is one minus Suppose that TA and TB increase so that TB/TA is always the sum of the other three probabilities. This probability can equal to a constant K. This constant allows coalescent time also be obtained as follows: to elapse at different rates in the two species. In the constant population size model, TB/TA ϭ K corresponds to NB/NA ϭ Pr[X ϭ C4 ͦ ZAA(T ) ϭ q A, Z BB(T ) ϭ q B] 1/K. ϭ 1 Ϫ Pr(E) Ϫ Pr(F) ϩ Pr(E പ F) As observed by Neigel and Avise (1986), at their time of divergence, a sample from a pair of species is likely to show 2 q ϩ qqϩ q 1 ϭ 1 ϪϩϪAB AB . polyphyly. As time progresses, paraphyly becomes more like- q (q ϩ 1) q (q ϩ 1) q ϩ q Ϫ 1 ly, and eventually, reciprocal monophyly is most likely. In qABϩ q []AA BB A B the limit as time approaches ϱ, reciprocal monophyly has q ΂΃A probability one. This intuitive result holds for all values of (13) the sample sizes, as long as they are more than one, and regardless of the value of K. (If one of the sample sizes equals Finally, summing over possible values of qA and qB, the un- conditional probabilities are one, polyphyly and one type of paraphyly are impossible, so that the transition proceeds from a paraphyly stage to a mono- GENE GENEALOGIES IN TWO SPECIES 1469

FIG. 3. Probabilities of the four types of genealogies as functions of time, obtained from equations (14)±(17). (i) rA ϭ rB ϭ 3, TB ϭ TA. (ii) rA ϭ rB ϭ 3, TB ϭ TA/2. (iii) rA ϭ 30, rB ϭ 3, TB ϭ TA. (iv) rA ϭ 30, rB ϭ 3, TB ϭ TA/2. (v) rA ϭ rB ϭ 30, TB ϭ TA. (vi) rA ϭ rB ϭ 30, TB ϭ TA/2. phyly stage; if both samples have size one, I use the con- species remains, and monophyly is achieved for the lineages vention that reciprocal monophyly is guaranteed from time of each species. In passing from the stage in which neither 0.) species is monophyletic to the stage in which monophyly Thus, in Figure 3, which shows (14)±(17) as functions of occurs in both species, an intermediate stage is encountered time, various parameter values produce the same qualitative in which it is likely that the lineages of only one of the species limiting behavior. This result is explained by considering the are monophyletic. numbers of lineages ancestral to the sample at the time of Sample sizes affect this phenomenon in that for larger sam- divergence. For small times TA and TB, many lineages an- ple sizes, more time must elapse before the transition between cestral to the sample are present at divergence. In the an- stages occurs. A comparison of Figures 3i and 3v shows that cestral species these likely coalesce in such a way as to pro- increases in sample size do not have a great impact on these duce polyphyly. As time since divergence becomes large, transition times. suf®cient time exists for all lineages to coalesce within spe- Asymmetry in sample sizes and rates of coalescence, al- cies. At the time of divergence, likely only one lineage per though it does not affect the general behavior, causes the two 1470 NOAH A. ROSENBERG types of paraphyly to have different probabilities. If time elapses more rapidly in one species, the lineages of that spe- cies reach monophyly more rapidly; the species that has ex- perienced fewer coalescent units in the same amount of ab- solute time is the one that is more likely to exhibit paraphyly. Similarly, if coalescence occurs at the same rate in both spe- cies, the species that has a smaller sample is likely to achieve monophyly faster, and the lineages of the other species are more likely to be paraphyletic. This difference between the two probabilities of paraphyly can be substantial (Figs. 3ii, iii, iv, vi); it depends more on relative speed of coalescence than on relative sample sizes (compare the difference between Figs. 3iii and 3iv to that between Figs. 3ii and vi).

Effects of Sample Size As in other genealogical phenomena, the effect on (14)± (17) of increasing sample size is less profound than the effect of increasing time. Increases in sample size shift the time at which paraphyly has highest probability (cf. Figs. 3i, v), but only slightly. The probabilities of the four types of geneal- ogies approach their large-sample limits rapidly (Fig. 4), re- gardless of which type of genealogy is ultimately the most likely. Rapid convergence in sample size results from the fact that as lineages are added to a sample, they add ancestral lineages at a much slower rate. If we consider ancestors to a sample at T coalescent units in the past, adding lineages to the sample is unlikely to increase the number of ancestors if T is large. We can determine sample sizes r such that the distribution of the number of ancestral lineages at time T given a sample of size r is close to the large-sample limiting distribution. As the distribution of the number of ancestral lineages con- verges with increases in sample size, so too do the proba- bilities of monophyly, paraphyly, and polyphyly. A previous argument (Rosenberg 2002, table 3) can be used to identify sample sizes at which probabilities of mono- phyly, paraphyly, and polyphyly approach their limiting val- ues. The sample size r is chosen large enough in each species so that the mean number of ancestral lineages given a sample of size r differs from the corresponding mean for an in®nite sample size by less than a speci®ed tolerance. As can be seen by comparing Figures 4i and 4iii, if times are larger, sample sizes required for nearing the limit are smaller.

Order of the Probabilities The parameter space can be partitioned based on the rel- ative order of the probabilities of the four types of geneal- ogies. Suppose rA ϭ rB ϭ r and TA ϭ TB ϭ T. The probabilities of the two types of paraphyly are equal, so they are grouped into one category. The parameter space can contain as many as six regions, corresponding to the six orderings of the prob- abilities of monophyly, paraphyly, and polyphyly. The fact that evolution proceeds from polyphyly to paraphyly to monophyly suggests that the orderings Pr(monophyly) Ͼ Pr(polyphyly) Pr(paraphyly) and Pr(polyphyly) Pr(mono- Ͼ Ͼ FIG. 4. Probabilities of the four types of genealogies as functions phyly) Ͼ Pr(paraphyly), should never be observed. Indeed of sample size, obtained from equations (14)±(17). For all graphs, they are not seen. both species had the same sample size, that is, rA ϭ rB. (i) TA ϭ TB ϭ 2. (ii) TA ϭ 2, TB ϭ 0.2. (iii) TA ϭ TB ϭ 0.2. GENE GENEALOGIES IN TWO SPECIES 1471

Transition times between different orderings of the prob- abilities are values of T for which two of Pr(X ϭ C1), Pr(X ϭ C2) ϩ Pr(X ϭ C3), and Pr(X ϭ C4), as given by (14)± (17), are equal (Fig. 5). The three equations that must be solved are polynomials in eϪT. For samples of size r these polynomials have degree r(r Ϫ 1). With r ϭ 2, exact solutions to the quadratic equations give the transition times: Pr(X ϭ C2) ϩ Pr(X ϭ C3) ϭ Pr(X ϭ C4) is solved by ln(4/3) ഠ 0.2877; ln[(2 ϩ ͙6)/3] ഠ 0.3942 solves Pr(X ϭ C1) ϭ Pr(X ϭ C4), and ln[(4 ϩ ͙2)/3] ഠ 0.5904 solves Pr(X ϭ C1) ϭ Pr(X ϭ C2) ϩ Pr(X ϭ C3). For larger r, transition times are larger, and the equations can be solved numerically. Even for small sample sizes, the transition times between orderings are close to their (approximate) limiting values of 1.29989, 1.44026, and 1.66536 (computed using ϱ for r). Figure 5ii shows transition times for the case in which the sample from species B has size 1. This case arises because it is useful to consider whether lineages of a species are monophyletic or paraphyletic with respect to a monophyletic sister species. Because polyphyly and one type of paraphyly are not possible if rB ϭ 1, only the transition that occurs when Pr(X ϭ C1) ϭ Pr(X ϭ C3) ϭ 1/2 must be considered. For rA ϭ 2, the transition time is ln(4/3) ഠ 0.2877. For larger samples the transition time can be obtained numerically, and the large-sample limit is about 1.32663.

Genealogy of All Lineages in the Two Species The genealogy for the sample of lineages provides infor- mation about the genealogy for all lineages in the two species. A polyphyletic sample indicates that the genealogy for the species is polyphyletic; a monophyletic sample suggests but does not guarantee monophyly for the species. Using X for the classi®cation of the sample genealogy as before, and Y for that of the genealogy of all lineages from the two species,

Pr(Y ͦ X) is of interest. For six of 16 combinations (X, Y), the FIG. 5. Partition of the parameter space based on the ordering of sample con®guration excludes the species con®guration. the probabilities of monophyly, paraphyly, and polyphyly computed from equations (14)±(17). Time was assumed to have elapsed at The rA and rB sampled lineages have qA and qB ancestral the same rate for both species. (i) rA ϭ rB. The sum of the proba- lineages at the time of divergence, the species population bilities of the two types of paraphyly was used for the probability sizes are NA and NB, and the numbers of lineages ancestral of paraphyly. (ii) rB ϭ 1. In this case only one type of paraphyly to species at the time of divergence are mA and mB (Fig. 2). is possible and polyphyly is not possible. Using equation (6), this con®guration of ancestral lineages, denoted G, has probability (T ) S(q ,m,N,r) gN,mAA A A A A A NNmmABAB gN,m(TB) S(qB,mB,NB,rB). Here I show the probability that Pr(Y ϭ C1 ͦ X ϭ C1) ϭ g (T ) BB ͸͸͸͸ NAA,mA the lineages of both species have monophyletic genealogies mAϭ1 m BABϭ1 q ϭ1 q ϭ1 given that their samples are monophyletic. Bayes's theorem ϫ S(qAAAAN, m , N , r ) g ,mB(T ) yields BB ϫ S(qBBBB, m , N , r ) Pr(Y ϭ C1 ͦ X ϭ C1, G) q ϩ q AB(q ϩ q Ϫ 1) Pr(Y ϭ C1 ͦ G) Pr(X ϭ C1 ͦ Y ϭ C1, G) q AB ϭ . (18) ΂΃A Pr(X ϭ C1 ͦ G) ϫ . (19) m ϩ m AB(m ϩ m Ϫ 1) m AB Because the sampled lineages form a subset of the whole set ΂΃A of lineages, Pr(X ϭ C1 ͦ Y ϭ C1, G) ϭ 1. Pr(X ϭ C1 ͦ G) If coalescent time elapses at the same rate in both species, and Pr(Y ϭ C1 ͦ G) are computed using equation (9). Summing for large time since divergence, monophyly of the sample over lineage con®gurations, the desired probability is indicates that monophyly of all lineages from the two species 1472 NOAH A. ROSENBERG

and species MRCAs are identical (eq. 8) times the probability of the lineage con®guration, we obtain

Pr[min{T : Zsample(T) ϭ 1} ϭ min{T : Z species(T) ϭ 1}]

NNmmABAB ϭ g (T ) S(q , m , N , r ) ͸͸͸͸ NAA,mA AAAA mAϭ1 m BABϭ1 q ϭ1 q ϭ1

ϫ gNBB,mB(T )S(q BBBB, m , N , r ) (q ϩ q Ϫ 1) (m ϩ m ϩ 1) ϫ AB A B . (20) (qABϩ q ϩ 1) (m Aϩ m BϪ 1) For equal sample sizes and equal rates of coalescence, the probability that the sample MRCA is the species MRCA ap- proaches one as time increases (Fig. 7i). At time 0, with rA ϭ rB ϭ r and in®nite population size, the result reduces to equation (8), or (2r Ϫ 1)/(2r ϩ 1). For large values of time, the sample and the species each likely have one lineage per species at the time of divergence, so the genealogy of the sample necessarily includes this lineage. The probability in- creases monotonically, so that for larger samples, it is more likely that the sample MRCA is the same as the species MRCA. The case in which one of the species has only one sampled lineage is somewhat different (Fig. 7ii). Again, at time 0, equation (8) provides the desired probability. Also, for suf- ®ciently large times, the sample and species each have one ancestral lineage, which necessarily coincides. However, a range of intermediate times exists during which the proba- bility is smaller than the initial value of rA/(rA ϩ 2). In this range, the ancestry of the sample is usually reduced to a small number of lineages so that the (qA ϩ qB Ϫ 1)/(qA ϩ qB ϩ 1) term, or qA/(qA ϩ 2), is relatively small. The number of lin- eages ancestral to the species remains large enough that the (mA ϩ mB ϩ 1)/(mA ϩ mB Ϫ 1) term, or (mA ϩ 2)/mA, is fairly close to one. Thus, in this range, the product of the two terms is usually smaller than the initial value.

Whole-Genome Monophyly FIG. 6. Probability of reciprocal monophyly of two species given reciprocal monophyly of their samples, computed using equation For two closely related species, equations (14)±(17) can (19). Results are similar as long as population sizes are large com- pared to sample sizes; thus, in®nite population sizes were used for predict the fractions of their genomes that are monophyletic, both species. (i) rB ϭ rA. (ii) rB ϭ 1. paraphyletic, and polyphyletic. For example, as the proba- bility of monophyly for one region, (14) gives the expected fraction of a genome that is monophyletic. is likely (Fig. 6). For shorter times, monophyly of a larger Only about 1.665 coalescent units (1.665N generations in sample gives considerably more evidence that two species the constant population size model) are needed before recip- are reciprocally monophyletic than does monophyly of a rocal monophyly is the most likely type of genealogy for two smaller sample (Fig. 6i). If only one lineage is sampled from species. The time until most of their genomes are monophy- species B, however, the probability of monophyly for species letic is considerably longer (Table 1). The time until a certain A given monophyly of the sample increases very slowly with fraction of the genome of species A is expected to be mono- sample size (Fig. 6ii; see also Nordborg 1998). phyletic depends only weakly on its sister species. Assuming that the sister species to A is monophyletic only reduces the Most Recent Common Ancestor of All Lineages time to monophyly of A by 0.6±0.8 coalescent units. in the Two Species Using simulations with ®nite sample sizes, Hudson and It is of interest to know whether the MRCA of a sample Coyne (2002) obtained results for the time until reciprocal is identical to the MRCA of all lineages from the two species. monophyly that are similar to the values in Table 1 (second Let Zsample(T) and Zspecies(T) denote the numbers of lineages row of their table 1). Corresponding values of the time until ancestral to the sample and to all lineages of the two species monophyly of species A in Table 1 and in Hudson and Coyne at time T coalescent units in the past. By summing over all (2002, table 1) differ slightly, however: I assume species A lineage con®gurations the conditional probability that sample has a sister species that contains one lineage that eventually GENE GENEALOGIES IN TWO SPECIES 1473

Ϫ1 TABLE 1. Waiting times until monophyly.M1 (1Ϫ␣) is the waiting time after divergence until 100 (1 Ϫ␣)% of the genome of species A is expected to be monophyletic, assuming species B is mono- Ϫ1 phyletic.M2 (1 Ϫ␣) is the corresponding time, assuming species A and B have the same parameters. Times, measured in coalescent units, are obtained by solving equation (14) for TA, assuming Pr (X Ϫ1 ϭ C1) ϭ 1 Ϫ␣. ForM1 , rB ϭ 1 and the value of TB does not Ϫ1 matter. ForM2 , rB ϭ rA and TB ϭ TA. For both columns, rA ϭϱ.

Ϫ1 Ϫ1 ␣ M 1 (1 Ϫ␣) M 2 (1 Ϫ␣) 0.50 1.327 1.903 0.10 2.994 3.662 0.05 3.794 4.369 0.01 5.298 5.989 0.001 7.601 8.294 10Ϫ4 9.903 10.597 10Ϫ5 12.206 12.899 10Ϫ6 14.509 15.202 10Ϫ7 16.811 17.504

the probability of reciprocal monophyly of a region, using rA ϭ rB ϭϱ, TA ϭ TB ϭ T. Two sites in a genome that are separated by suf®cient distance l have independent geneal- ogies. Thus, a genome of total length C base pairs can be loosely approximated by u disjoint subunits, each of length l (C ϭ ul), such that each pair of subunits is independent and such that no recombination occurs within any subunit. The probability of monophyly for any subunit is M1(T). To de-

termine T␣ so that the whole genome is monophyletic with

probability 1 Ϫ␣after time T␣, we must solve u 1 Ϫ␣ϭ[M1(T␣)] . (21)

Ϫ1 IfM1 is the inverse function of M1 (Table 1), the solution is

Ϫ11/u Ϫ1 Ϫ␣/u T␣ ϭ M11[(1 Ϫ␣)]ഠ M (e ). (22) As an example, consider two genomes of length 3000 me- gabases (Mb) in which sites separated by 0.2 Mb have in- dependent genealogies. These genomes have u ϭ 15, 000, FIG. 7. Probability that the most recent common ancestor (MRCA) and the time until both genomes are fully reciprocally mono- Ϫ1 for a sample from two species is the same as the MRCA for all phyletic with probability 0.95 isM2 (0.9999966), or 13.972 lineages in the two species, computed using equation (20). Results coalescent units. The bottom row of table 1 of Hudson and are similar as long as population sizes are large compared to sample Coyne (2002) contains a similar computation. sizes; thus, in®nite population sizes were used for both species. (i) rA ϭ rB. (ii) rB ϭ 1. DISCUSSION Equations (14)±(17) give the closed-form probabilities of coalesces with the lineages of species A, whereas Hudson the four types of genealogies under the coalescent two-spe- and Coyne (2002) study the time until monophyly of species cies divergence model, superseding recursively de®ned so- A without allowing interspeci®c coalescences (see also Ar- lutions of Hudson and Coyne (2002) and Rosenberg (2002). bogast et al. 2002). Note that Hudson and Coyne (2002) Note, however, that (14) disagrees with the probability of considered 2N lineages per species but scaled time in units monophyly suggested by Palumbi et al. (2001, eq. 1) for NB of N generations; thus, times in their table must be divided ϭϱand qB ϭ 1. Unlike (14) and the approaches of Hudson by two to be directly comparable with those here. and Coyne (2002) and Rosenberg (2002), Palumbi et al. Given a number ␣, we can also determine the time until a (2001) assumed qA ϭ 1, or that monophyly of species A whole genome is monophyletic with probability 1 Ϫ␣. This requires all lineages of species A to coalesce more recently calculation assumes neutrality of the whole genome; thus, than divergence. Because monophyly of A is also possible nonmonophyletic regions that persist after this length of time for qA Ͼ 1 if the ancestral lineages of species A coalesce are candidates for targets of (see Discus- intraspeci®cally, the formula of Palumbi et al. (2001) can be sion). Let M1(T) be calculated from (14), using rA ϭϱ, rB regarded as an underestimate of the probability of monophyly ϭ 1, TA ϭ T, and TB arbitrary. This gives the probability of of A. The discrepancy between (14) and equation (1) of Pal- monophyly of a region in species A, assuming that species umbi et al. (2001) is largest at small divergence times, for B is monophyletic. The calculations are analogous for M2(T), which qA Ͼ 1 is a likely possibility. 1474 NOAH A. ROSENBERG

Figures 3 and 4 con®rm the stronger dependence of (14)± in which the MRCA of an asymmetric sample has consid- (17) on divergence times compared to sample sizes. For two erable probability of differing from the species MRCA (Fig. recently diverged species, however, Figures 6 and 7 show that 7ii). sample size noticeably affects the relationship between the ge- These problems of asymmetric samples parallel results on nealogy of the sample and that of the species. For very small the concordance of gene trees and species trees (Rosenberg samples, one should be cautious not to equate reciprocal mono- 2002). If three species are studied by sampling many lineages phyly of a sample with reciprocal monophyly of the two species. for the two species that are predicted to be sister species, but The concepts of monophyly, paraphyly, and polyphyly of only one lineage from the predicted species, the lineages have frequently been employed in molecular analysis will tend to place the predicted outgroup as the out- and (Neigel and Avise 1986; Palumbi et al. group, even if this is not correct. By contrast, symmetric 2001). Probabilities that sampled lineages exhibit monophy- sampling less frequently leads to this erroneous inference. ly, paraphyly, or polyphyly can be used in studies both of the history of a species as a whole (Wakeley and Hey 1997; Mitochondrial and Nuclear DNA Hudson and Coyne 2002) and of roles played in evolution Recent studies investigate relationships between shapes of by individual genes (Wang et al. 1999; Ting et al. 2000). In genealogies of mitochondrial DNA and those of nuclear DNA various contexts, the formulas here can assist in designing (Moore 1995; Palumbi et al. 2001). Under neutrality, with studies, making predictions, and interpreting data. equal offspring distributions for diploid males and females, coalescence time elapses four times as fast for mitochondrial Paraphyly and Divergence Times DNA as for nuclear DNA. As can be observed from Figures Because paraphyletic genealogies are most frequent for 3, 5, and 6, the behavior of gene genealogies can be quite only a short period of time (Fig. 5i), observed dominance of different at time 4T (mitochondrial DNA) compared to time paraphyly in multilocus datasets suggests that species lie in T (nuclear DNA). this intermediate period since divergence. If both types of For example, if three coalescent units have elapsed for paraphyly have similar frequencies and if paraphyletic ge- mitochondria, in Figure 3iii, mitochondrial DNA has mono- nealogies occur more often than monophyletic and polyphy- phyly probability over 0.8. Nuclear loci in the same , letic genealogies, the species are likely in the narrow band however, having only experienced 0.75 coalescent units, have of time from ϳ1.3 to 1.7 coalescent units since divergence monophyly probability under 0.1. Using Figure 5i and large (ϳ1.3N to 1.7N generations, in the constant population size samples, mitochondrial and nuclear loci reside in the lower model). partition of the parameter space until T ഠ 0.325, when par- Moreover, an observation that the frequencies of both types aphyly becomes more likely than polyphyly for mitochon- of paraphyly are about the same indicates that coalescence drial DNA, but not for nuclear loci. The two types remain occurs at similar rates in the two species, whereas differences in separate partitions until T ഠ 1.665, when both types are between the two frequencies provide evidence for a difference likely to be monophyletic. in coalescence rates. The species that experiences coales- cence more slowly is more likely to be paraphyletic (Figs. Unusual Genealogical Shapes and Selection 3ii, 3vi, 4ii). If constant population size is assumed, the spe- Deviations from predictions of the neutral model here can cies that has more paraphyly also has a larger population help identify loci undergoing various forms of selection. Se- size. lected loci may be more or less likely than neutral loci to have reciprocally monophyletic genealogies. Balancing se- Symmetric and Asymmetric Samples lection or selection for diversity increases the probability that Reasonably small samples are suf®cient to characterize many ancestral lineages per species persist into the distant shapes of most two-species genealogies. Sample sizes that past (Ioerger et al. 1990; Takahata and Nei 1990), decreasing can be used to achieve desired precision in the number of the probability of monophyly. Loci that were under selection ancestral lineages, and corresponding precision in genealog- during species divergence or that have been involved in main- ical shape probabilities, can be obtained from Rosenberg tenance of differences between species might be more likely (2002, table 3). to be monophyletic in at least one of the species, because A large sample from one species together with only a single the MRCA of extant lineages in that species might be a novel lineage from the other provides much less information than mutant that lived around the time of divergence (Hey 1994; moderate but similar sample sizes from both species. In Fig- Wang et al. 1999; Ting et al. 2000). ure 6, if one species has sample size 1, increasing the sample Tests of genealogical shape have been devised in various size from the other species helps only minimally toward cor- contexts. Species trees are compared with the Yule model or rect inference that the lineages of that species are monophy- other phylogenetic models, and agreement of inferred shapes letic. By contrast, if both species have similar sample sizes, of species trees with model predictions is taken as evidence few lineages are needed to almost guarantee that sample for mechanisms of evolution that underlie the models (Slow- monophyly implies species monophyly. Inferences about inski and Guyer 1989; Mooers and Heard 1997; Aldous 2001; MRCAs behave similarly: caution is warranted in inference Harcourt-Brown et al. 2001). If the tree for a set of species about the time to the MRCA of two closely related groups, has an improbable shape, special properties of those species when one group has a small sample size and the other does are used to explain the anomaly. Within a species, statistics not. This comment especially applies to recent divergences, based on shapes of gene genealogies predicted by the neutral GENE GENEALOGIES IN TWO SPECIES 1475 coalescent are used to test if evolution of particular genes Brown, J. K. M. 1994. Probabilities of evolutionary trees. Syst. has occurred consistently with neutrality (Kreitman 2000). Biol. 43:78±91. Clark, A. 1997. Neutral behavior of shared polymorphism. Proc. For genes that do not ®t the model, the of the deviation Natl. Acad. Sci. USA 94:7730±7734. assists in characterizing the alternative mode of evolution Edwards, A. W. F. 1970. Estimation of the branch points of a acting on the gene. branching diffusion process. J. R. Stat. Soc. Ser. B 32:155±174. Corresponding tests might be derived for multispecies gene Gould, H. W. 1972. Combinatorial identities. Rev. ed. Gould Pub- genealogies (Hey 1994; Palumbi et al. 2001). If divergence lications, Morgantown, WV. Grif®ths, R. C. 1984. Asymptotic line-of-descent distributions. J. was ancient enough that the monophyly probability is near Math. Biol. 21:67±75. one, genomic regions that are observed to not be monophy- Grif®ths, R. C., and S. TavareÂ. 2003. The genealogy of a neutral letic might be under balancing selection. If divergence was mutation. In P. Green, N. Hjort, and S. Richardson, eds. Highly recent enough that the probability of monophyly is near zero, structured stochastic systems. Oxford Univ. Press, Oxford, U.K. In press. then regions observed to be monophyletic might have been Harcourt-Brown, K. G., P. N. Pearson, and M. Wilkinson. 2001. important in species divergences. As polymorphism data ac- The imbalance of paleontological trees. Paleobiology 27: cumulate, a genomic approach may help to determine how 188±204. many genes experience these types of selection. Harding, E. F. 1971. The probabilities of rooted tree-shapes gen- erated by random bifurcation. Adv. Appl. Prob. 3:44±77. Hey, J. 1991. The structure of genealogies and the distribution of Conclusions ®xed differences between DNA sequence samples from natural populations. 128:831±840. In this article the probabilities of monophyly, paraphyly, ÐÐÐ. 1994. Bridging and with and polyphyly have been computed under a coalescent model. gene tree models. Pp. 435±449 in B. Schierwater, B. Streit, G. The approach also enabled calculation of the probability that P. Wagner, and R. DeSalle, eds. Molecular ecology and evo- the lineages of two species are reciprocally monophyletic lution: approaches and applications. BirkhaÈuser Verlag, Basel. given that a sample has this property. Times at which the Hudson, R. R., and J. A. Coyne. 2002. Mathematical consequences of the genealogical species concept. Evolution 56:1557±1565. dominant genealogical shape transits from polyphyly to par- Ioerger, T. R., A. G. Clark, and T.-H. Kao. 1990. Polymorphism aphyly or from paraphyly to monophyly have been obtained. at the self-incompatibility locus in Solanaceae predates speci- Note that genealogies have been discussed without concern ation. Proc. Natl. Acad. Sci. USA 87:9732±9735. for the fact that in practice they are estimated from DNA Kreitman, M. 2000. Methods to detect selection in populations with polymorphism data. The importance to the analysis of the applications to the . Annu. Rev. Genomics Hum. Genet. 1:539±559. shapes of genealogies makes it desirable to determine these Maddison, W. P., and M. Slatkin. 1991. Null models for the number shapes probabilistically. Recombination also has not been of evolutionary steps in a character on a . Evo- considered; the numerical results apply strictly only to re- lution 45:1184±1197. gions in which no recombination has occurred. Genomes of Mooers, A. O., and S. B. Heard. 1997. Evolutionary process from with low recombination rates may contain many phylogenetic tree shape. Q. Rev. Biol. 72:31±54. Moore, W. S. 1995. Inferring phylogenies from mtDNA variation: such regions. mitochondrial-gene trees versus nuclear-gene trees. Evolution Similar formulas to those here might be obtained for more 49:718±726. than two species. Many more types of genealogical shapes Neigel, J. E., and J. C. Avise. 1986. Phylogenetic relationships of are then possible, including shapes in which the gene ge- mitochondrial DNA under various demographic models of spe- nealogy and species tree disagree. The larger number of pos- ciation. Pp. 515±534 in S. Karlin and E. Nevo, eds. Evolutionary processes and Theory. Academic Press, New York. sible shapes of gene genealogies can make analysis of gene Nordborg, M. 1998. On the probability of Neanderthal ancestry. tree shape cumbersome. It might be best to divide trees of Am. J. Hum. Genet. 63:1237±1240. many species into overlapping subsets of a few species each ÐÐÐ. 2001. . Pp. 179±212 in D. J. Balding, C. and then analyze genealogical shape in each subset. The re- Cannings, and M. Bishop, eds. Handbook of statistical genetics. sults here are also based on simple models of within-species Wiley, Chichester, U.K. Palumbi, S. R., F. Cipriano., and M. P. Hare. 2001. Predicting evolution; advances might treat more complex models of the nuclear gene coalescence from mitochondrial data: the three- divergence process (Teshima and Tajima 2002) and of pop- times rule. Evolution 55:859±868. ulation structure after divergence. Rosenberg, N. A. 2002. The probability of topological concordance of gene trees and species trees. Theor. Popul. Biol. 61:225±247. Sanderson, M. J. 1996. How many taxa must be sampled to identify ACKNOWLEDGMENTS the root node of a large ? Syst. Biol. 45:168±173. I thank H. Innan and S. Tavare for discussions and M. Saunders, I. W., S. TavareÂ, and G. A. Watterson. 1984. On the Nordborg for comments on the manuscript. This research was genealogy of nested subsamples from a haploid population. Adv. Appl. Prob. 16:471±491. supported by an National Science Foundation Postdoctoral Slowinski, J. B., and C. Guyer. 1989. Testing the stochasticity of Fellowship in Biological Informatics. patterns of organismal diversity: an improved null model. Am. Nat. 134:907±921. Steel, M., and A. McKenzie. 2001. Properties of phylogenetic trees LITERATURE CITED generated by Yule-type speciation models. Math. Biosci. 170: Aldous, D. J. 2001. Stochastic models and descriptive statistics for 91±112. phylogenetic trees, from Yule to today. Stat. Sci. 16:23±34. Tajima, F. 1983. Evolutionary relationship of DNA sequences in Arbogast, B. S., S. V. Edwards, J. Wakeley, P. Beerli, and J. B. ®nite populations. Genetics 105:437±460. Slowinski. 2002. Estimating divergence times from molecular Takahata, N. 1989. Gene genealogy in three related populations: data on phylogenetic and population genetic timescales. Annu. consistency probability between gene and population trees. Ge- Rev. Ecol. Syst. 33:707±740. netics 122:957±966. 1476 NOAH A. ROSENBERG

Takahata, N., and M. Nei. 1985. Gene genealogy and variance of n n!(m ϩ n Ϫ 1 Ϫ k)! ϭ k interpopulational nucleotide differences. Genetics 110:325±344. ͸ (n Ϫ k)!(m ϩ n Ϫ 1)! ÐÐÐ. 1990. Allelic genealogy under overdominant and frequency kϭ0 dependent selection and polymorphism of major histocompati- n (m Ϫ 1) !n!(m ϩ n Ϫ 1 Ϫ k)! ϭ k bility complex loci. Genetics 124:967±978. ͸ (m ϩ n Ϫ 1)! (m Ϫ 1)! (n Ϫ k)! Takahata, N., and M. Slatkin. 1990. Genealogy of neutral genes in kϭ0 two partially isolated populations. Theor. Popul. Biol. 38: 1 n (m Ϫ 1) ϩ (n Ϫ k) 331±350. ϭ k ͸ n Ϫ k TavareÂ, S. 1984. Line-of-descent and genealogical processes, and m ϩ n Ϫ 1 kϭ0 ΂΃ their applications in population genetics models. Theor. Popul. ΂΃n Biol. 26:119±164. Teshima, K. M., and F. Tajima. 2002. The effect of migration during 1 nnm Ϫ 1 ϩ lmϪ 1 ϩ l the divergence. Theor. Popul. Biol. 62:81±95. ϭ n Ϫ l . (A4) ͸͸ll Ting, C.-T., S.-C. Tsaur, and C.-I. Wu. 2000. The phylogeny of m ϩ n Ϫ 1 []lϭ0 ΂΃΂΃lϭ0 closely related species as revealed by the genealogy of a spe- n ciation gene, Odysseus. Proc. Natl. Acad. Sci. USA 97: ΂΃ 5313±5316. The proof is completed by applying identities 2 and 3. Identity 1 Wakeley, J. 2000. The effects of subdivision on the genetic diver- generalizes identity 4.10 of Gould (1972). gence of populations and species. Evolution 54:1092±1101. Wakeley, J., and J. Hey. 1997. Estimating ancestral population pa- rameters. Genetics 145:847±855. Wang, R.-L., A. Stec., J. Hey., L. Lukens., and J. Doebley. 1999. APPENDIX 2 The limits of selection during maize domestication. Nature 398: A previous derivation of the probability of reciprocal monophyly 236±239. (eq. 14) was given in terms of a recursive function (Rosenberg Yule, G. U. 1924. A mathematical theory of evolution based on the 2002, eq. 10). Here I give a closed-form expression for that function, conclusions of Dr. J. C. Willis, F. R. S. Philos. Trans. R. Soc. which enables closed-form solution of other results in Rosenberg Lond. B 213:21±87. (2002) and Takahata (1989). A,B Fk (a, b, c) (Rosenberg 2002, eq. 21) is the probability that in Corresponding Editor: R. Harrison coalescing from a con®guration that includes a lineages of species A, b lineages of species B, and c lineages of species C down to k total lineages, an interspeci®c coalescence occurs and the most re- APPENDIX 1 cent interspeci®c coalescence links species A and B. The closed- A,B To simplify equation (11), identity 1 is needed. To prove identity form solution forFk is obtained using a counting approach. Sup- 1, we need identities 2 and 3. Identity 2 is proven by induction on pose that a, b, and c are all positive, and suppose that there are x, n y, and z ancestors of the a, b, and c lineages, respectively, when n, using ϭ n!/[k!(n Ϫ k)!]. Identity 3 follows from identity 2 one of the x lineages and one of the y lineages engage in the most ΂΃k recent interspeci®c coalescence. The remaining x ϩ y ϩ z Ϫ 1 by induction on n: lineages can coalesce to k lineages in any order. Identity 1: For positive integers m and n, Using equation (4), there are Ia,x sequences that can coalesce a lineages to x lineages. Similarly there are I ways to coalesce b n b,y k lineages to y lineages and Ic,z ways to coalesce c lineages to z lineages. The total number of sequences that coalesce a to x, b to n ΂΃k n(m ϩ n) ͸ ϭ . (A1) y, and c to z is Ia,xIb,yIc,zW3(a Ϫ x, b Ϫ y, c Ϫ z), where W3 is the kϭ0 m ϩ n Ϫ 1 m(m ϩ 1) number of ways of interweaving three sequences of coalescences. Analogously to equation (5), W is the trinomial coef®cient ΂΃k 3 Identity 2: For positive integers m and n, a Ϫ x ϩ b Ϫ y ϩ c Ϫ z W3(a Ϫ x, b Ϫ y, c Ϫ z) ϭ n m Ϫ 1 ϩ kmϩ n ΂΃a Ϫ x, b Ϫ y, c Ϫ z ͸ ϭ . (A2) kϭ0 ΂΃΂΃kn (a Ϫ x ϩ b Ϫ y ϩ c Ϫ z)! ϭ . (A5) Identity 3: For positive integers m and n, (a Ϫ x)! (b Ϫ y)! (c Ϫ z)! n m Ϫ 1 ϩ kmn(m ϩ n) ϩ n Ϫ 1 There are xy ways to choose the two lineages that are the most k ϭ . (A3) ͸ recent to coalesce interspeci®cally. Finally there are I ways kϭ0 ΂΃knm ϩ 1 ΂΃ xϩyϩzϪ1,k to coalesce the x ϩ y ϩ z Ϫ 1 remaining lineages to k lineages. Proof of Identity 1: We substitute l ϭ n Ϫ k: Thus, there are Ia,xIb,yIc,zW3(a Ϫ x, b Ϫ y, c Ϫ z)xyIxϩyϩzϪ1,k sequences n of coalescences that have the desired properties. k The most recent interspeci®c coalescence occurs when there are n ΂΃k at least k ϩ 1 lineages, so x ϩ y ϩ z Ն k ϩ 1. Also, 1 Յ x Յ a, ͸ 1 Յ y Յ b, and 1 Յ z Յ c. Using the allowable values of (x, y, z), kϭ0 m ϩ n Ϫ 1 the number of sequences of coalescences that satisfy the require- ΂΃k ments is

abc III W(a Ϫ x, b Ϫ y, c Ϫ z)xyI . (A6) ͸͸͸a,x b,y c,z 3 xϩyϩzϪ1,k xϭmax(1,kϩ1ϪbϪc) yϭmax(1,kϩ1ϪxϪc) zϭmax(1,kϩ1ϪxϪy) GENE GENEALOGIES IN TWO SPECIES 1477

The number of possible sequences of coalescences is Iaϩbϩc,k. The result is simpli®ed using equations (4) and (A5):

abc IIIa,x b,y c,z FA,B(a, b, c) ϭ W (a Ϫ x, b Ϫ y, c Ϫ z)xyI k ͸͸͸ 3 xϩyϩzϪI,k xϭmax(1,kϩ1ϪbϪc) yϭmax(1,kϩ1ϪxϪc) zϭmax(1,kϩ1ϪxϪy) Iaϩbϩc,k abc xϩ y ϩ z abc΂΃΂΃΂΃x y z ΂ x, y, z ΃a ϩ b ϩ c 2(xy)2z ϭ . (A7) ͸͸͸ 2 xϭmax(1,kϩ1ϪbϪc,1) yϭmax(1,kϩ1ϪxϪc) zϭmax(1,kϩ1ϪxϪy) a ϩ b ϩ caϩ b ϩ c abc (x ϩ y ϩ z)(x ϩ y ϩ z Ϫ 1) ΂΃΂΃x ϩ y ϩ z a,b,c If c ϭ 0, the calculation is analogous and simpler. ab IIa,x b,y FA,B(a, b, 0) ϭ W (a Ϫ x, b Ϫ y)xyI k ͸͸ 2 xϩyϪ1,k xϭmax(1,kϩ1Ϫb) yϭmax(1,kϩ1Ϫx) Iaϩb,k ab xϩ y ab΂΃΂΃΂xy x ΃a ϩ b 2(xy)2 ϭ . (A8) ͸͸ 2 xϭmax(1,kϩ1Ϫb) yϭmax(1,kϩ1Ϫx) a ϩ baϩ b ab (x ϩ y)(x ϩ y Ϫ 1) ΂΃΂΃x ϩ ya

It is straightforward to show that (A7) gives the same numerical values as equation 21 of Rosenberg (2002).