<<

Phylogenetic Inference under the Pure Drift Model

Shizhong Xu, * William R. Atchley, * and Walter M. Fitch? *Center for Qua ntitative , Department of Genetics, North Carolina State University; and TDepartment of and Evolutionary , University of California, Irvine

When pairwise genetic distances are used for phylogenetic reconstruction, it is usually assumed that the between two taxa contains information about the time after the two taxa diverged. As a result, upon an appropriate transformation if necessary, the distance usually can be fitted to a linear model such that it is expressed as the sum of lengths of all branches that connect the two taxa in a given phylogeny. This kind of distance is referred to as “additive distance.” For a exclusively driven by random genetic drift, genetic distances related to coancestry coefficients (6x, ) between any two taxa are more suitable. However, these distances are fundamentally different from the additive distance in that coancestry does not contain any information about the time after two taxa split from a common ancestral population; instead, it reflects the time before the two taxa diverged. In other words, the magnitude of OxY provides information about how long the two taxa share the same evolutionary pathways. The fundamental difference between the two kinds of distances has led to a different algorithm of evaluating phylogenetic trees when 8 Xv and related distance measures are used. Here we present the new algorithm using the ordinary-least-squares approach but fitting to a different linear model. This treatment allows within a to be included in the model. Monte Carlo simulation for a rooted phylogeny of four taxa has verified the efficacy and consistency of the new method. Application of the method to human population was demonstrated.

Introduction Random genetic drift is an important evolutionary selection is stronger than commonly observed in natural force. It has been argued that, in natural populations, populations, it is inefficient in countering drift when is sufficiently large that drift could be population sizes are on the order of 100 or fewer (Lacy ignored compared with other evolutionary forces such 1987). Random genetic drift is also considered to be as selection and (Fisher 1958, pp. 22-5 1). In important in determining the variation in frequen- inbred strains of mice, rats, guinea pigs, and some plants, cies in man (Cavalli-Sforza et al. 1964; Edwards and for example, the population size is so small and the evo- Cavalli-Sforza 1964; Cavalli-Sforza 1966; Cavalli-Sforza lutionary history so short that variation in allelic fre- and Edwards 1967 ) . quencies among inbred strains must have been predom- A commonly used measurement of divergence in inantly driven by random drift or allelic fixation ( Atchley gene frequencies caused by random genetic drift is and Fitch 199 1, 1993 ). Thus, patterns of genetic diver- Wright’s FST (Wright 1943, 195 1, 1965 ) . &statistics were gence observed among inbred strains result from random originally derived from a population-genetics perspective segregation of the original heterozygosity of the founding and it was assumed that an infinite number of popula- stocks. tions diverged at the same time from a common ancestral For captive populations, genetic drift is the over- population. From a phylogenetic perspective, the coan- riding factor controlling the loss of heterozygosity. Mu- cestry coefficient (&), which is another Fs,-related tation has no noticeable effect on populations of size measurement of population divergence, seems more ap- typically managed in zoos and nature preserves. Unless propriate. Within a population, it is defined as the prob-

Key words: phylogeny, genetic drift, coancestry coefficient, genetic ability that a random gene from one individual is iden- distance, reduction of heterozygosity. tical by descent to a random gene from another Address for correspondence and reprints: Shizhong Xu, Center individual ( Kempthorne 1969, pp. 72-80; Falconer for , Department of Genetics, North Carolina 1980, pp. 80-83). Between two populations exu is de- State University, Raleigh, North Carolina 2769576 14. fined as the that a random pair of , one Mol. Biol. Evol. 11(6):949-960. 1994. from each population, are identical by descent. Appro- 0 1994 by The University of Chicago. All rights reserved. 0737-4038/94/l 106-00 13$02.00 priate transformations of the coancestry coefficients can

949 950 Xu et al. be treated as genetic distances for use in phylogenetic X inference. However, this measure of genetic distance may not be additive, which is assumed by the Fitch-Margolish method (Fitch and Margolish 1967) and other phylogeny inferring algorithms. Further, there are no phylogeny- t inferring algorithms available that incorporate inbreed- AB ing coefficients like 8 xy. To circumvent this problem, a A phylogeny-inferring algorithm using character data such as the parsimony method may be used. Recently, Atchley and Fitch ( 1993) introduced the concept of loss parsi- mony to describe the segregation and random fixation of under systematic brother-sister mating. These authors used an inverted Camin-Sokal algorithm to find trees that minimize loss. However, irreversibility FIG.1 .-The rooted tree for two taxa used as an example in the of allele loss is only a qualitative prediction of random text. genetic drift and the loss parsimony model fails to in- corporate appropriate quantitative predictions from is the mean coefficient of population B. We population-genetics theory. A maximum-likelihood use H throughout to represent the expected heterozy- (ML) method under the pure drift model could, in prin- gosity. Estimated heterozygosity will be discussed later. ciple, incorporate all the pertinent quantitative predic- Equation ( 1) allows the time before divergence to be tions inherent to genetic drift (Felsenstein 1973, 198 1). inferred from the existing heterozygosity of node B as The drawback of an ML method is that it involves ex- tensive computing if explicit solutions are not possible. In addition, lack of knowledge on the exact joint prob- tAB = [~~g(~B)-~~g(~A)l/1~~[~-1/(2N,)l. (2) ability distribution of the data will decrease the credibility Let Hx and Hy be the expected heterozygosities of of ML. the terminal population X and Y, respectively, at the In this research, we first introduce a class of 8xy- time when data are sampled. Because of the relationships related measurements of genetic distances. These genetic distances, after appropriate transformation, are then used Hx = H~[l-l/(2N,)]~~‘ to infer phylogenetic relationships among taxa.

and The Pure Drift Model

Consider a finite population with effective popu- Hy = HB[ l-1 /(2Ne)lzBy, lation size N, isolated from an infinite, random-mating population in Hardy-Weinberg equilibrium (denoted by tBX and tBy can be inferred by A as in fig. 1). Assume effective population size did not change through time and at generation tAB the popula- tBx = [log(Hx)-log(HB)]/log[l-1/(2x)] (3) tion (denoted by B) was split into two lineages, X and Y, each of which had the same effective population size and N,. Populations X and Y have independent histories of genetic drift for t BX and tgy generations, reSpeCtiVdy. tBY = [log(HY)-log(HB)l/los[1-l/(2N,)1, (4) Suppose that the heterozygosity of a neutral in the infinite ancestral population was HA. When the finite respectively. If generation intervals for the two lineages population (population B) was split, the expected het- were the same and data were sampled at the same time, erozygosity was expressed by HB. If both HA and HB are tBX should equal t BY. However, this is not a requirement tAB, in generations, by known, one can infer the time of the phylogenetic methods being described here. the following formula: The number of heterozygotes in populations X and ( 1 ) Y can be obtained by counting individual genotypes. HB = HA( l-FIAB) = HA[I-~/(XV~)]~*~, Observed frequencies of heterozygotes are then used to where estimate Hx and Hy, denoted by fix and fiy, respec- tively. Unfortunately, HB is an unobservable historical FIAB= 1 - [1-1/(2N,)]‘AB event that cannot be simply counted. What we want is rnylogenetic merence unaer vnn y3 I to obtain an estimate of HB from the observed data sam- inbreeding coefficient, as H = Ho( 1-F), where Ho is pled from X and Y. the heterozygosity of the panmictic base population. Therefore, instead of estimating F or Oxy, we may es- Estimation of Heterozygosity of an Internal Node timate the heterozygosity. The number of heterozygotes in populations X and We now propose an unbiased estimator of HB, the Y (the terminal nodes) can be obtained from the actual heterozygosity of the internal node (see fig. 1), using counts by examining individual genotypes. However, observed allele frequencies for a locus of interest from individuals were assumed to mate randomly within each taxa X and Y. Under the drift model, their expected population so that the heterozygosity of each terminal values are the same as that in the initial base population. node can be estimated by the so-called gene diversity An unbiased estimate of the heterozygosity two gener- (Nei 1976, pp. 723-765). Let xi and yi denote the ith ations after node B is proposed as allele frequencies (observed) for taxa X and Y, respec- tively, where i = 1, 2, . . . , n for a locus with n allelic 1 - i xiyi. states. The heterozygosities of X and Y are then esti- Dxy = (5) i=l mated by This heterozygosity estimator, denoted by Dxy , is a ge- netic distance. The unbiased property of equation (5) H~=l-~xf is proved easily by showing i= 1 and E[Dxul = HA(~-~xY)- (64

As indicated before, exy = F,,,,,; hence,

E[Dxul = HA( l-r;;,,,,). (W respectively. If the constant 2 in the subscript of equation (6b) is The allelic frequencies of the two terminal taxa dropped, the quantity in the left-hand-side is HB. There- provide all information about the heterozygosity of their fore, the expected genetic distance between two lineages ancestral population at the time when they diverged. equals the heterozygosity of the potential grandchildren Consider node B as a parental population with X and (generation tAB+2) of their common ancestral popula- Y being two sets of offspring of B. Note that no matter tion at the time when the two lineages diverged. If the how many generations had passed since X and Y di- effective population size is not too small (N,>50), het- verged, X carries genes from one set of offspring and Y erozygosity reduction for two generations is insignificant carries genes from another set of offspring of B. Recall so that the expected Dxy is a good approximation for that the coancestry coefficient between X and Y is de- HB, that is, fined as the probability that a gene from X is identical by descent with a gene from Y. If X and Y form a pair E[Dxul = HB- (7) of mates, their offspring would have an inbreeding coef- ficient equal to the coancestry coefficient between X and Unless specified, otherwise we assume that N, is large Y. Because X and Y have been treated as offspring of enough so that the expected genetic distance between B, the progenies of X and Y would be the grandchildren two lineages equals the heterozygosity of their immediate of population B. As population B is designated as gen- common ancestral population at the time when the two eration tAB, their potential grandchildren should be des- lineages diverged. ignated as generation tAB -I- 2. Therefore, 8xu = FtAB+2. Although equation ( 5 ) is unbiased, it is not practical Coancestry coefficient, Oxv, is probably the most to use just a single locus to infer the heterozygosity of appropriate measurement of genetic distance between an internal node because a large variance will be antic- two taxa when dealing with random drift because it is ipated. Especially if there is no heterozygosity in the ter- independent of the initial gene frequency (allele fre- minal taxa, there are only two possible outcomes: that quency of the common ancestral population, A) and the two lineages fixed the same allele or that they fixed reflects the time elapsed before X and Y diverged. How- different alleles. The latter may be called alternative fix- ever, estimation of 0 xy still needs an estimate of the ation. In either case, we will not be able to infer HB just initial frequency. Fortunately, there is a simple linear on the basis of the two possible observed outcomes of relationship between the expected heterozygosity and fixation. It can be shown, however, that if rn independent Y3L AU er a. neutral loci are used, we can compare the two lineages no information about the lengths of branches that con- locus by locus, find the number of alternatively fixed nect A and D. Therefore, branch lengths of a phyloge- loci, and use the proportion over the total number of netic tree under the pure drift model cannot be estimated loci to measure the genetic distance ( Atchley and Fitch with the conventional Fitch-Margoliash method. 199 1, 1993 ) . A general strategy is to average Dxu over Suppose that figure 2 represents the true phyloge- loci, which has the operational form of netic relationship of the four taxa and we have collected 4( 4- 1)/ 2 = 6 pairwise distances and four observed het- erozygosities within taxa. The total number of observed (8) data points is 10. Let us define variable y as the appro- priate transformation of either the heterozygosity within taxon or the genetic distance between taxa. Thus, where Xji and yji are the ith allele frequencies of the jth locus for populations X and Y, respectively, and n, is the number of allelic states of the jth locus. Equation YA = [~OS~E;r,~-~~~~H,~l/~~~~~-~/~~Ne~l

(8) is still an unbiased estimator of the average het- YAB = [log(~AB)-log(~~)l/log[l-1/(2~~)1 erozygosity of the internal node over loci because drift has an equal effect on all loci. In this case, E(&,) YAC = [logt~AC)-logtEi,)l/10g[1-1/t2N~)1 = EJA( 1 41xy), where HA is the average heterozy- YAD = [log(~AD)-log(~~)1/1og[l-1/(2~~)1 gosity over loci of the common ancestral population

(node A). YB = [log(ri,)-log(H,)1/10g[l-1/(2N,)1 (12) The estimated length (number of generations) of YBC = [log(~BC)-log(H,)1/10g[l-1/(2N,)1 the root branch (fig. 1) is

YBD = [log(~BD)-log(~~)l/log[l-l /(2Ne)l 2,, = [log~~~~~-~~g~H,~ll~~~~~-~l~~N,~1.(9) Yc = [log~~~~-~o~tH,~l/~~~~~-~/~~N,)1 Let fix and Ejy, respectively, be the observed av- YCD = [log(~CD)-log(H,)1/1og[l-1/(2N,)1 erage heterozygosities (over loci) of populations X and Y at the time when data are sampled; then YD = [log(ri,)-log(Ei,)l/log[l-1/(2N,)1.

If fla and N, are known, these y’s are the observed data, 2,X = [log(ljX)-log(DXU)l/log[l-l /(2Ne)1 ( lo) which can be fitted to the following additive model: and YA = tl + t2 + t3 + t4 + eA ?By = [log(fi+log(&v)l/log[l-ll(2NJ. (11) YAB = 11 + 12 + t3 + eAB

YAC = t 1 + t2 + eAC Phylogenetic Inference under the Pure Drift Model We will describe a new algorithm of phylogeny re- YAD = tt + eAD construction under the pure drift model using both the YB = t, + t2 + t3 + tB + eB genetic distances between populations (& ) and vari- (13) YBC = t 1 + t2 + eBC ation within populations (fix ). Figure 2 gives a hypo- thetical drift-produced phylogeny with four taxa. The YBD = tl + eBD branch lengths represent time in generations. The Fitch- Margoliash method assumes that the true genetic dis- Yc = t, + t2 + tc + ec tance between taxa i and j, or a transformation of the YCD = tl + eCD distance, equals the sum of all the branch lengths that connect i and j. For instance, genetic distance between YD =tl + tD + eD A and B in figure 2 is the sum of tA and lg. Likewise, distance between A and D is the sum of tA, t3, t2, and where the e’s are error terms representing departures of tD. This is clearly not the case under the pure drift model observed from expected values. Let y and e be 10 X 1 because DAB estimates the heterozygosity at node d, vectors containing the y’s and the e’s, respectively, and which is determined by the sum of tl , t2, and t3 and t = [ tl t2 t3 tA tB tc tDIT be a 7 x 1 vector where the does not depend on tA and lg. Similarly, DAD refleCtS superscript T represents matrix transposition. The con- heterozygosity at node b, a of tl , and provides densed matrix notation for the above additive model is Phylogenetic Inference under Drift 953

matrix of vector e; the generalized-least-squares solution for t would be

t = (ZTV-‘Z)-‘ZTV-‘y. (17)

% Unfortunately, V is usually unknown and its estimation a is difficult. Therefore, the ordinary-least-squares solu- tions are used hereafter. The V matrix will be further discussed in a later section. There are 15 possible rooted trees for four taxa. In principle, one needs to evaluate all 15 possible trees and choose the one with the smallest MSE. D Several key points concerning y’s need to be made. First, for large N, , log [ 1- 1 / ( 2N,)] can be replaced by FIG. 2.-The model tree for four taxa used in the simulation studies. - 1/ (2N,). Second, N, may be unknown in natural pop- ulations, or it may have been estimated. Fortunately, 2N, occurs in all the y’s, thus permitting 2N, to be y=Zt+e (14) dropped. In that case, the branch length is not the num- ber of generations but t/( 2N,). Third, Ha = 2p0( 1-pO), where Z is a design matrix representing the tree topology. requiring the initial allelic frequency in the common In this particular example, ancestral population ( node a in fig. 2). One may replace po by the average allelic frequency of all lineages, but 1111000 that value is anticipated to have a large variance because lineages are not independent random samples of node 1110000 a. However, -log( fla) is a constant added to all the y’s. 1100000 Linear regression theory indicates that adding a constant to y’s does not affect the estimates of regression coeffi- 1000000 cients and the MSE, but it does change the estimate of 1110100 the intercept. Examining the above additive model and Z= the Z matrix, we see that the tree trunk (branch tl in 1100000 fig. 2) is actually the intercept and all other ts’ are regres- 1000000 sion coefficients. Therefore, if one is not interested in the length of the roof, the constant -log(H,) can be 1100010 ignored, leaving MSE and the other branch-length es- 1000000 timates unaffected. Ignoring -log(aa) we have yA = -log (HA) and YAB= -log( DAB), a similar transfor- 1000001 mation to Nei’s ( 1987, pp. 208-253) standard genetic distance, but, they have quite different meanings. Nei’s The ordinary least squares solution for t is standard genetic distance takes the negative of the natural log of the genetic similarity, whereas our y variable takes t = (ZTZ)_‘ZTy, the negative of the natural log of the genetic distance. Therefore, the y variable may not be called genetic dis- with a mean squared error (MSE) of tance but rather a kind of genetic “similarity.” Branch lengths of a phylogenetic tree cannot be MSE = (y-Zt)T(y-Zi)/dj- (16) negative. However, the least squares solution does not guarantee the nonnegativity. There are two situations where d’is the degrees of freedom and equals the number where negative estimates may occur. First, a wrong to- of data points minus the number of branches. For T pology may be chosen, which would lead to systematic taxa, df = T( T-3)/ 2 + 1. The ordinary-least-squares bias for the estimate of a branch length. Such bias cannot solutions given in equation ( 15) assume that the error be removed by increasing the number of loci. Second, terms are independent with a constant variance. How- if a small number of loci are used, may ever, these e’s are certainly correlated and the variance cause a negative estimate of a branch length, even when may also vary. Let V denote the variance-covariance a correct tree topology is used. In the latter case, nega- 954 Xu et al. tivity can be overcome by increasing the number of loci. Table 1 Algebraically, an ad hoc way of solving the problem of Estimated Heterozygosities fi of Terminal Taxa (diagonals) negativity is to set any negative branches to zero and and Heterozygosites @xv) of Internal Nodes (off diagonals) then recalculate the MSE for a given tree topology for Four Taxa from a Simulated Data Set under the Model (Swofford and Olsen 199 1). The optimal approach, Tree Given in Figure 2 however, is to utilize quadratic programming to disallow A B C D negative estimates of regression coefficients (Hildreth 1957). The tree trunk, tl , is the intercept in the linear A ...... 055 model; thus, it should not be constrained. B t.... .059 .045 c ...... 114 .lll .055 Monte Carlo Simulation D . . . . . ,222 .198 .200 .04: An example was generated via Monte Carlo sim- ulation to demonstrate the usage of the new method. We simulated the model tree given in figure 2 under a mated using equation ( 8). These values are listed ir special breeding procedure, namely, brother X sister table 1. We now try to estimate the branch lengths ant (bXs) mating. Most inbred populations of laboratory calculate MSE of the data. Data in table 1 were appro mice and rats were developed by b X s mating, which priately transformed into y variables using a modifiet strictly fit this pure drift model ( Atchley and Fitch 199 1, version of equation ( 12). The reason to modify equatior 1993). With brother X sister mating, equation ( 1) can ( 12) for the special mating system is that the effective still be used to approximate HB such that the hetero- population is too small (N,=2.6 178) and selfing is no zygosity is expressed as a function of generations. allowed. The modified version of equation ( 12) follows Kempthorne ( 1969) has shown that, with systematic full- all y’s with one subscript are added by 1; otherwise, the: sib mating, 1 - 1/(2N,) = ( 1+6)/4 = 0.809, leading are subtracted by 1. The MSE for this data set was 0.060~ toN,=2.6178.Hence,log[l-1/(2N,)]canbereplaced (generation ‘) and the estimated branch lengths are listen by log( 0.809). in table 2. To show that a choice of ii, does not affec We first simulate an initial hypothetical random MSE and estimates of t2 . . . tD, we also chose fla = 0. population with 200 independent neutral loci, all with and Z?a = 0.9 to compare with I;ia = 0.5. Remembe two allelic states (0 and 1) . The frequency of allelic state that the true value of tl is 3. When l?a = 0.5, the esti 1 is 0.5 across all loci. This population is designated as mated value was 3.17 (close to 3)) but it became -4.4: generation 0 and denoted as node a; hence, H, and 5.95, respectively, when Ra = 0.1 and 0.9 were used = 2(.5)(1-S) = .5. We then randomly sampled two To evaluate the sensitivity of MSE and generalize individuals (one male and one female) from this hy- the results, more simulations were conducted. The pothetical population, who were then b X s mated for numbers of independent loci examined were 5, 10, 15 three generations (designated as generation 3 and de- 20,25, 50, 75, and 100. For any given number of loci noted by node b). From node b, a series of inbreeding 100 replicated samples were simulated. For each repli lines were produced as described in figure 2. Four full- cate, all 15 possible trees (fig. 3) were evaluated. The sib progenies were produced from node b; one pair of tree with the smallest MSE was chosen as the inferrec full-sibs was used to initiate a lineage that produced line phylogeny. D after nine generations of b X s mating, as shown in figure 2. The progenies from the other pair of full-sibs from node b were b X s mated for three generations to Table 2 produce node c. At node c, one pair of progenies were Comparisons of the Estimated with the True Branch Lengths for the Simulated Data Set b X s mated for six generations to produce the lineage leading to line C and another pair leading to node d after True Length three generations of b X s mating, which subsequently Branch (no. of generations) Estimated Lengtl led to line A and B. Both tA and tg are three generations. The genotypes of the simulated organisms were exam- t, (intercept) 3 3.17 t2 ...... ined and allelic frequencies were calculated. In the ab- 3 2.86 t3 ...... 3 3.22 sence of selection and , systematic b X s mating tA . . . 3 2.16 reduces heterozygosity. Thus, this is a pure random drift tB . . , . . . 3 3.11 model of evolutionary change of allelic frequencies. tc , . . 6 5.38 The estimated heterozygosity of each terminal tD ...... 9 9.46 taxon was obtained by evaluating the genotype of each NOTE.-The expected heterozygosity in the common ancestral population locus of each individual. The genetic distance was esti- (H,) is 0.5. The MSE is 0.0604 (generation*). Phylogenetic Inference under Drift 955

-A A A for each tree is given in table 5. The mean MSE of the C C true tree (tree number 9) was again the smallest among B B D the 15 possible trees. The frequency of being chosen as (2) G(3) -isD the inferred phylogeny is given in table 6. When the B B number of loci was five, the percentage of choosing the D D right phylogeny (tree number 9) was 26%. When the A C number of loci increased to 50, this frequency increased

--Is-(4) c 4s(5) A to 94%. For 100 loci, it has reached 100%. These fre- quencies are generally greater than those found when A -A tree number 7 was assigned as the true tree. Similar plots B are provided in fig. 4b. 6 C -D -C (7) (8) An Application to Human

C -A -A The data consist of gene frequencies for five blood-

D group systems, AIA2B0, RH, MNSs, Fy, and Di, sam- B pled from four human populations: Eskimo, Bantu, En- -e-A (10) glish, and Korean (see table 1 of Cavalli-Sforza and Ed- wards 1967). The total number of alleles of the five loci A B -B is 19. Cavalli-Sforza and Edwards have analyzed the data D C C under a similar drift model and provided an exhaustive B A D treatment ( 15 possible rooted trees). Therefore, their e-D (13) c --I% e- (14) (15) A results are directly comparable with the results presented FIG. 3.-The 15 possible rooted trees for four taxa here. These gene-frequency data were used to calculate heterozygosities and pairwise genetic distances (table 7 ) . The 15 different rooted trees can be classified into The best (MSE=.00698) and second best two categories, asymmetric and symmetric. The asym- (MSE= .008 13 ) trees among 15 rooted trees are given metric trees include tree numbers 1, 2, 4, 5, 6, 7, 8, 10, in figure 5. These two trees are also the best trees of 11, 12, 14, and 15. Tree numbers 3, 9, and 13 belong Cavalli-Sforza and Edwards ( 1967 ) . However, our best to the symmetrical class. tree turns out to be their second best, in which the initial First, we chose tree number 7 as the true tree to split places Bantu on one branch and Eskimo, English, simulate the data. The average MSE of the 100 replicates and Korean on the other. The second split occurs be- for each tree is reported in table 3. The mean MSE of tween English and Eskimo-Korean. Internal branches the true tree (tree number 7) was the smallest among of negative lengths are generated with the remaining 13 the 15 possible trees. Therefore, MSE is sensitive to the trees, which, on the average, have an MSE several times choice of tree topology. The frequency of being chosen larger than those of the two trees. The internal node that as the inferred phylogeny is given in table 4. When the separates Eskimo and Korean has an estimated hetero- number of loci was five, the percentage of choosing the zygosity of 0.3954, but Korean has an estimate of 0.5348, right phylogeny (tree number 7) was 26%, which ranked which has generated a negative external branch length the second highest (the highest value was 28% for tree for the Korean lineage after divergence from Eskimo. number 2). When the number of loci increased to 10, The negative external branch length has been set to zero the frequency of choosing the right tree increased to 44%, (fig. 5). In general, our results are comparable with those which dominated all other trees. As expected, this fre- of Cavalli-Sforza and Edwards ( 1967 ) . Disregarding the quency increased as the number of loci increased. The position of the root, the two trees have the same topology. patterns of increase in frequency of choosing the true However, under the drift model, the two trees will pro- tree and decrease in MSE for the true tree (tree number duce quite different predictions. With the best tree, we 7) with increasing number of loci are shown in figure would predict that English and Bantu are equally alike, 4a. Tree numbers 2, 8, 9, and 14 generally had smaller as are Eskimo (or Korean) and Bantu, while with the MSEs (table 3) and higher frequencies (table 4) than second best tree, English and Bantu are more alike than other trees (except for the true one). Looking into the Eskimo (or Korean) and Bantu. 15 trees given in figure 3 again, we found that each of Nei and Roychoudhury ( 1993) inferred the phy- these four trees (tree numbers 2, 8, 9, and 14) retains a logenetic tree for 26 human populations using the true , but no other trees retain any true clade. neighbor-joining method (Saitou and Nei 1987) from Second, we chose tree number 9 as the true tree to 29 polymorphic loci. Their tree divides the 26 popula- simulate the data. The average MSE of the 100 replicates tions into four major groups. Nei and Roychoudhury 956 Xu et al.

Table 3 Averages of the MSE of 100 Replicated Simulations

NUMBER OF LOCI

TREE 5 10 15 20 25 50 75 100

407.33 184.83 133.67 95.84 45.35 12.92 11.96 11.48 2 185.24 93.53 70.99 61.90 26.72 3.15 3.07 2.74 3 403.39 177.55 133.62 96.42 45.34 12.92 11.96 11.48 4 387.25 180.43 147.46 95.26 43.19 13.68 12.55 12.13 5 417.48 206.35 156.99 103.21 47.08 14.00 12.77 12.20 6 430.12 207.6 1 156.28 99.49 46.99 14.00 12.77 12.20 7 104.91 54.33 16.84 12.80 1.26 0.36 0.25 0.17 8 218.32 119.01 60.85 29.06 8.28 6.21 5.01 4.63 9 234.65 129.68 60.86 24.38 8.20 6.21 5.01 4.62 10 420.23 208.73 157.00 99.60 47.04 14.00 12.77 12.20 11 430.13 211.13 156.29 103.46 47.03 14.00 12.77 12.20 12 388.47 180.5 1 147.46 95.27 43.18 13.68 12.54 12.13 13 392.10 192.09 143.68 97.01 45.84 12.85 12.04 11.65 14 175.77 104.94 77.37 62.01 26.94 3.16 3.08 2.75 15 385.99 198.88 143.70 96.06 45.84 12.85 12.04 11.65

NOTE.-There are 15 possible rooted trees for four taxa, and tree number 7 is the true tree.

( 1993) claimed that their tree was consistent with data of our best tree are roughly proportional to those of Nei on morphological differences, archaeological records, and Roychoudhury’s tree. and geographic distributions of the populations. It turns There is no doubt that genetic drift is an important out that the four human populations analyzed in this evolutionary force, but it is not the only reason for pop- study represent the four major groups. The inferred ulation differentiation in human. The initial split of hu- phylogeny of the four major groups (Nei and Roy- man population might have occurred 200,000 years ago choudhury 1993, fig. 2) has the same topology as our (see Nei and Roychoudhury 1993), which is equivalent best tree (fig. 5a), assuming that their tree was rooted to 6,000-7,000 generations. For such a large time scale, on the longest branch. In additional, the branch lengths selection and mutation may have played an important

Table 4 Frequency (of 100 replicates) of Being Chosen as the Inferred Phylogeny for Each Tree

NUMBER OF LOCI

TREE 5 10 15 20 25 50 75 100

15 2 0 0 0 0 0 0 28 20 20 12 13 8 2 0 0 1 2 0 1 0 0 0 3 3 0 1 0 0 0 0 1 2 0 0 0 0 0 0 1 2 0 0 0 0 0 0 26 44 54 66 65 82 93 100 8 11 5 6 3 5 4 2 0 9 5 8 3 10 7 0 10 2 0 0 0 0 11 0 0 0 0 0 0 12 0 2 0 0 0 13 2 2 0 0 0 0 14 6 7 14 7 8 0 15 0 2 0 0 0

NOTE-Tree number 7 is the true tree. Phylogenetic Inference under Drift 957

role in population divergence. Gene admixture may also

100 have occurred after divergence of these populations. Es- 90 timates of heterozygosities of both internal and external 80 70 nodes reported in table 7 are indeed much higher than 60 we normally anticipate under drift. From Cavalli-Sforza 50 and Bodmer’s ( 197 1, p. 733) table 11.9, we found that 40 30 the sampled average heterozygosity over the five loci in 20 the English population is 0.4788, which is similar to that 10 0 described here (0.4693). Cavalli-Sforza and Bodmer 0 10 20 30 40 50 60 70 80 90 100 ( 197 1, pp. 732-735) explain the high level of hetero-

Number of loci zygosity as possibly due to selection for heterozygotes and mutations. Disregarding all possible nondrift forces of evolution, the algorithm introduced here, which is valid under the pure drift model, seems relatively robust,

(b) because it used five loci with 19 alleles but produced a tree identical to Nei and Roychoudhury’s ( 1993) tree 100 90 using 29 loci. In particular, the neighbor-joining method 80 used by Nei and Roychoudhury’s ( 1993) does not de- 70 pend on the pure drift model. 60 50 40 Discussion 30 Our purpose is to introduce a new phylogeny-in- 20 10 ferring method under the pure drift model and using 0 &,-related genetic distance data. There are several fun- 0 10 20 30 40 50 60 70 80 90 100 damental differences between this method and other Number of loci methods based on pairwise distance data. First, our method not only uses the pairwise distance (Dxv) but FIG. 4.-Changes of MSE and frequency that the inferred phy- logeny is the true tree as the number of loci increases. All MSE’s were also the variation within population ( Hx ), whereas only expressed as percentage of the MSE when the number of loci was five; the former is used in conventional distance-based meth- (a) tree number 7 is the true tree; (b) tree number 9 is the true tree. ods. The Hx is the heterozygosity within taxon and can be treated as the distance of a taxon with itself if it is denoted by D xx, which is usually greater than zero. Sec- ond, the genetic distances presented here measure the heterozygosities of internal nodes. Therefore, hetero- Table 5 Averages of MSE of 100 Replicated Simulations

NUMBER OFLOCI

TREE 5 10 15 20 25 50 75 100

1 . . . . 256.90 79.29 52.53 22.96 11.18 5.16 4.66 4.95 2 258.99 77.03 50.29 22.38 12.20 5.23 4.65 4.99 3 :::: 290.02 88.73 55.58 23.96 12.38 5.33 4.75 5.04 4 . . . . 264.16 88.92 51.50 21.92 12.21 5.16 4.69 4.97 5 . . 282.89 82.03 54.04 23.00 11.35 5.16 4.65 4.99 6 205.42 31.51 38.47 16.34 3.78 3.13 2.78 2.99 7 :::: 146.32 66.2 1 20.68 10.06 9.97 3.24 2.96 3.02 8 . . . . 154.72 75.11 20.86 10.05 9.98 3.24 2.96 3.02 9 . . . 97.38 14.63 1.49 1.40 0.54 0.28 0.20 0.10 10 . . . 233.25 31.24 38.6 1 16.35 3.79 3.13 2.78 2.99 11 . . . 260.09 81.49 52.49 22.96 11.20 5.17 4.66 4.95 12 . . 262.14 88.9 1 51.48 22.14 12.21 5.16 4.69 4.97 13 290.73 84.05 55.52 24.33 12.39 5.34 4.74 5.04 14 . . . 261.75 73.46 50.27 22.43 12.20 5.23 4.65 4.99 15 . . . 287.34 75.65 54.02 23.25 11.34 5.17 4.65 4.99

No-K-There are 15 possible rooted trees for four taxa, and tree number 9 is the true tree. 958 Xu et al.

Table 6 Frequency (of 100 replicates) of Being Chosen as the Inferred Phylogeny for Each of the 15 Trees

NUMBER OF LOCI

TREE 5 10 15 20 25 50 75 100

1 . . . . 20 2 2 1 0 0 0 0 2 3 3 1 3 2 0 0 0 3 . . . . 2 0 1 0 1 0 0 0 4 5 5 0 3 0 0 0 0 5 4 3 1 1 0 0 0 0 6 5 7 8 5 2 1 1 0 7 12 8 6 6 3 1 0 0 8 :::: 11 2 5 4 3 2 0 0 9 . . . . 26 55 67 71 84 94 99 100 10 . . . 4 8 6 5 4 2 0 0 11 . 0 1 1 1 0 0 0 0 12 . . . 3 2 0 0 0 0 0 0 13 2 0 0 0 0 0 0 0 14 t.. 1 3 0 0 0 0 0 0 15 2 1 2 0 1 0 0 0

NOTE.-Tree number 9 is the true tree.

zygosity reduction from one node to its successive node case, we need to delete Hx from the data and delete its reflects the branch length. Third, upon the appropriate corresponding branch length from the unknown vector, transformation of D xy, the distance becomes “similar- t, when we fit the linear model. If all the taxa are fixed, ity,” which is not the sum of branches that connect X like inbred laboratory strains of mice or rats, then we and Y. Instead, it is the sum of lengths of all segments cannot estimate the length of any terminal branch. An from the root to the fork where X and Y split. In other internal branch is estimable only if two lineages diverged words, the magnitude of Dxy is determined by how long from this internal branch show a nonzero distance. taxa X and Y shared the same evolutionary pathways The genetic distance between taxa, Dxy , and het- and it contains no information about the time since they erozygosity within a taxon, Hx, are linearly related to split. Finally, this method directly fits rooted trees. If exy and Fx, respectively. The purpose in using Dxy is two rooted trees are identical in topology with regard to to estimate 8 xy because only 8 xy relates to the time taxa of interest except being rooted differently, the two trees will have different MSEs, which is contrary to the Fitch-Margoliash-related methods. (a) MeanSqoaredErm=.0 Under the pure drift model, if there is no hetero- 0.0731 zygosity for taxon X, then there is no information about the length of the terminal branch. Taxon X may have been fixed just before the data were sampled or many generations ago. Hx will be zero in either case. In this

Table 7 (b) wan squaredEm = .m13 Estimated Heterozygosites fi of Terminal Taxa (Diagonals) and Heterozygosites (&v) of Internal Nodes (off Diagonals) 00732 for Four Human Populations Calculated from Five Blood-Group Loci (ArA,BO, RH, MNSs, Fy, and Di)

Bantu English Eskimo Korean

Bantu ...... 3558 English ...... 4907 .4693 FIG. 5.-The best (a) and the second best (b) trees for four human Eskimo ...... 5478 .4888 .3675 populations found using the new phylogenetic reconstruction method. The estimated branch lengths are numbers of generations expressed as Korean ...... 5997 .5082 .3954 .5348 2N,. elapsed between taxa X and Y. If one already has good motion in these Euclidean coordinates. A maximum- estimates of all pairwise coancestry coefficients and fix- likelihood approach was then suggested using the trans- ation indices by using other methods (e.g., Cockerham formed frequency data. However, the authors ran into 1973; Reynolds et al. 1983; Weir 1990, pp. 135-172), singularities in the “likelihood surface,” which forced it is not necessary to invoke the Dxv and Hx statistics. them to utilize an ad hoc approach-the method of However, Dxu and Hx are easy to calculate and thus minimum evolution. Formal maximum-likelihood so- may turn out to be very useful. lutions were provided, via a restricted maximum-like- Equation ( 8) for the average Dxu over loci assumes lihood (REML) approach, by Felsenstein ( 1973, 198 1, that all loci are equally informative. If they are not, a 1985), who subsequently made the computer program weighted average is more appropriate. However, the rel- available (i.e., the CONTML program in the PHYLIP ative information provided by each locus depends on package of Felsenstein 1989). Rohlf and Wooten ( 1988) the variance of Dxu for that locus, which is rarely known. evaluated the relative efficacy of Felsenstein’s REML In addition, the m loci may be selectively chosen by method to Wagner’s parsimony and the UPGMA and researchers so that only polymorphic loci are included obtained some results opposed to those of Kim and for a set of taxa. This will cause Dxu to be a biased Burgman’s ( 1988 ) simulations. estimate of the heterozygosity of their recent common Acknowledgments ancestor. However, this bias will eventually go to the estimation of the intercept (the tree trunk), which is We thank B. S. Weir, Z.-B. Zeng, and R. R. Hudson usually not of interest. for many helpful suggestions on the earlier version of As mentioned earlier, the error terms in the linear the manuscript. We also thank two anonymous reviewers for useful comments and suggestions on the earlier ver- model ( eq. [ 15 ] ) are correlated with a variance-covari- sion of the manuscript. This work was supported by Na- ante matrix V. First, to derive V we have to derive the tional Institutes of Health grant GM-45344 and National variance of Dxu , the covariance between Dxu and Dxz, Science Foundation grants BSR-9 107 18 to W.R.A. and and so on. However, those variances and covariances BSR-9096052 to W.M.F. also involve three- or four-gene , which are certainly more complicated than two-gene LITERATURE CITED identity by descent. Weir and Basten ( 1990) have de- veloped explicit formulas for the variances and covari- ATCHLEY,W. R., and W. M. FITCH. 199 1. Gene trees and the antes of similar statistics from DNA-sequence data. Ex- origins of inbred strains of mice. Science 254:554-5X -. 1993. Genetic affinities among inbred strains of lab- tension of the Weir-Basten formulas to allele-frequency oratory mice. Molecular Biology and Evolution. Mol. Biol. data has not been obvious to us, so further investigation Evol. 10: 1150- 1169. is needed. Second, the y variables in the linear model CAVALLI-SFORZA,L. L. 1966. Population structure and human are log transformations of Dxv statistics; even though evolution. Proc Roy Sot. Lond. [B] 164:362-379. we have explicit formulas for the variances and covari- CAVALLI-SFORZA,L. L., I. BARRAI, and A. W. F. EDWARDS. antes of the D xy statistics, variances of log transfor- 1964. Analysis of under random genetic mations of D xy statistics still have to be approximated. drift. Cold Spring Harb. Symp. Quant. Biol. 29:9-20. unless one knows the distributional properties of the CAVALLI-SFORZA,L. L., and W. F. BODMER. 1971. The ge- Dxy . Alternatively, one can invoke the bootstrapping netics of human populations. W. H. Freeman, San Fran- or jackknifing resampling technique (Efron 1979) and cisco. CAVALLI-SFORZA,L. L., and A. W. F. EDWARDS. 1967. Phy- substitute V by its bootstrap estimate. This may increase logenetic analysis: models and estimation procedures. Evo- the chance of picking up the right tree and improve the lution 21:550-570. estimates of branch lengths. On the other hand, the errors COCKERHAM,C. C. 1973. Analysis of gene frequencies. Ge- associated with estimation of V are also introduced into netics 74:679-700. estimation oft, which may cause, in return, more errors EDWARDS,A. W. F., and L. L. CAVALLI-SFORZA.1964. Re- in the estimation of branch lengths. Further investigation construction of evolutionary trees. Pp. 67-76 in V. H. on this issue is necessary. HEYWOODand J. MCNEILL, eds. Phenetic and phylogenetic Inferring phylogeny under the pure drift model was classification. Association, London. originally suggested by Edwards and Cavalli-Sforza EFRON, B. 1979. Bootstrap methods: another look at the Jack- knife. Annu. Stat. 7: l-26. ( 1964) and Cavalli-Sforza and Edwards ( 1967), in which FALCONER,D. S. 1980. Introduction to quantitative genetics. the allelic frequencies were turned into coordinates in a 2d ed. Longman, London. Euclidean space by using a generalization of the arcsine FELSENSTEIN,J. 1973. Maximum-likelihood estimation of transformation so that the process of random genetic evolutionary trees from continuous characters. Am. J. Hum. drift could be approximated by a process of Brownian Genet. 25:47 l-492. 960 Xu et al.

- . 198 1. Evolutionary trees from gene frequencies and REYNOLDS,J., B. S. WEIR, and C. C. COCKERHAM.1983. Es- quantitative characters: finding maximum likelihood esti- timation of the coancestry coefficient: basis for a short-term mates. Evolution 35: 1229- 1252. genetic distance. Genetics 105:767-779. -. 1985. Phylogenies from gene frequencies: a statistical ROHLF, F. J., and M. C. WOOTEN. 1988. Evaluation of the problem. Syst. Zool. 34:300-3 11. restricted maximum-likelihood method for estimating -. 1989. PHYLIP-phylogeny inference package (ver- phylogenetic trees using simulated allele-frequency data. sion 3.2). 5: 164- 166. Evolution 42:58 l-595. FISHER, R. A. 1958. The genetical theory of , SAITOU,N., and M. NEI. 1987. The neighbor-joining method: 2d ed. Dover, New York. a new method for reconstructing phylogenetic trees. Mol. FITCH, W. M., and M. MARGOLIASH. 1967. Construction of Biol. Evol. 4:406-425. phylogenetic trees. Science 155:279-284. SWOFFORD,D. L., and G. J. OLSEN. 199 1. Phylogeny recon- HILDRETH, C. 1957. A quadratic programming procedure. struction. Pp. 4 1 l-50 1 in D. M. HILLIS and C. MORITZ, Naval Res. Logistics Q. 4:79-85. eds. Molecular systematics. Sinauer, Sunderland, Mass. KEMPTHORNE, 0. 1969. An introduction to genetic statistics. WEIR, B. S. 1990. Genetic data analysis. Sinauer, Sunderland, Iowa State University Press, Ames. Mass. KIM, J., and M. A. BURGMAN. 1988. Accuracy of phylogenetic- WEIR, B. S., and C. J. BASTEN. 1990. Sampling strategies for estimation methods under unequal evolutionary rates. distances between DNA sequences. Biometrics 46:55 l-582. Evolution 42:596-602. WRIGHT, S. 1943. Isolation by distance. Genetics 28: 114- 138. LACY, B. C. 1987. Loss of from managed - . 195 1. The genetical structure of populations. Ann. populations: interacting effects of drift, mutation, immi- Eugenics 15:323-354. gration, selection, and population subdivision. Conserv. -. 1965. The interpretation of population structure by Biol. 1:143-158. F-statistics with special regard to systems of mating. Evo- NEI, M. 1976. Mathematical models of and genetic lution 19:395-420. distance. Pp 723-765 in S. KARLIN and E. NAVO, eds. Pop- ulation genetics and ecology. Academic Press, New York. -. 1987. Molecular evolutionary genetics. Columbia NAOYUKI TAKAHATA, reviewing editor University Press, New York. NEI, M., and A. K. ROYCHOUDHURY.1993. Evolutionary re- Received February 8, 1994 lationships of human populations on a global scale. Mol. Biol. Evol. 10:927-943. Accepted June 10, 1994