<<

Copyright 0 1994 by the Society of America

Estimating Effective Population Size or Rate Using the Frequencies of of Various Classes in a Sample of DNA Sequences

Yun-xin FU Center for Demographic and , University of Texas, Houston, Texas 77225 Manuscript received February 18, 1994 Accepted for publication August 27, 1994

ABSTRACT Mutations resulting in segregating sites of a sample of DNA sequences can be classified by size and type and the frequencies of mutations of different sizes and types can be inferred from the sample. A framework for estimating the essential parameter 8 = 4Nu utilizing the frequencies of mutations of various sizes and types is developed in this paper, where N is the effective size of a population and p is mutation rate per sequence per generation.The framework is a combination of coalescent theory, general linear model and Monte-Carlo integration, which leads to two new estimators 6, and 6, as well as a general Watterson’s estimator 6, and a general Tajima’s estimator 6,. The greatest strength of the framework is that it can be used under a variety of population models. The properties of the framework and the four estimators bK, e,,, 6, and 6, are investigated under three important population models: the neutral Wright-Fisher model, the neutral model with recombination and the neutral Wright’s finite-islands model. Under all these models, it is shown that 6, is the best estimator among the four even when recombination rate or migration rate has to be estimated. Under the neutral Wright-Fisher model, it is shown that the new estimator 6,has avariance close to alower bound ofvariances of allunbiased estimators of Owhichsuggests that 6, is a:ery efficient estimator.

important parameterin studying the evolution of variances of all possible unbiased estimatorsof 8 under the A“a DNA region (locus) of a population is 8 = 4Np neutral Wright-Fisher model. That is, the population un- where N is the effective size of the population and p is der study evolvesaccording to the Wright-Fishermodel, all the mutation rate per sequence per generation. From mutations are selectively neutral and there is not recom- the value of 8, the effective population size N can be bination and no population subdivision. obtained if the mutation rate p is known or vice versa. Natural populations are usually more complex than Until recently inferencesabout 8 have been based described by the neutral Wright-Fisher model. For ex- largely on two quantities. One is the sequence diversity ample, they are often subdivided into small local popu- T,which is the average number of nucleotide differ- lations with migrations among them;an autosomal DNA ences per pair of sequences; the other is the number K region studied may be very large or consist of several of segregating sites (polymorphic sites). These two quan- separate regions so that recombinations cannot be ne- tities lead to respectively TAJIMA’Sestimate .ir of 8 and glected; some mutations may not be neutral or mutation WATTEMON’S estimate K of 8 as follows rates for different region of a locus may differ. Estimat- ing 8 under models other than the neutralWright-Fisher +=T K= wan model is often necessary for detailed analysis of poly- where morphism data. Although Watterson’s estimator K and 1 1 TAJIMA’Sestimator .ir can be modified to provide estimate an=l+-+ ...+- (1) of 8 under various models, they are likely inefficient as 2 n- 1 they are under the neutralWright-Fisher model. On the and n is the sample size. Although the computations of other hand,phylogenetic methods such as Fu (1994a) and these two estimators are easy, they are not very efficient FEUENSTEIN (1992) are difficult to be extended because estimators, in particular, the variance of .ir does not di- they require detailed properties of the branchs of a gene- minish with increasing sample size. The inefficiency of alogy whichare poorly understood under othermodels. In these estimators and the surging of DNA polymorphism this paper, I present a framework for estimating 8 which is data have been stimulating the development of newmeth- not only very efficient but can be used under a variety of ods of estimating8 that make better use of the information models providedthat additional parameters (ifany) can be in a sample. Phylogenetic information are very useful in estimated roughly.The efficiency ofthe framework is dem- developing more accurate estimators (FELSEN~IN 1992; onstrated using three models: the neutral Wright-Fisher FU 1994a). For example, Fu (1994a) showed that a phy- model, the neutral model with recombination and the logenetic estimator has a variance close to the minimum Wright’s finite-islands model.

Genetics 138 1375-1386 (December, 1994) 1376 Y.-X. FU

where [n/2] denotes the largest integer contained in n/2 and the i-th element qiof r) is given by

where 8i,n-zis the Kronecker delta, ie., it is equal to 1 if i = n - i and 0 otherwise. Under the infinite-sites model, q,is simplythe numberof such segregating sites at which the frequencies of the two segregating nucle- otides are i and n - i(i c n - i) respectively. Such a segregating site is said to be of type i or i-segregatingsite. We thus call qi the frequency of i-segregating sites. It is easy to see that when the infinite-sites model holds the FIGURE1.-The variants of mutations in a genealogy where value of r) can be obtained directly from a sample with- each 0 represents a mutation. out the help of an outgroup sequence.Because of this reason, it is easier to use an estimator based on r) than FREQUENCIES OF MUTATIONS OF an estimator based on g, although the formeris inferior VARIOUS CLASSES to the latter, as shall be demonstrated later. Consider a random sample of n sequences from a population. Corresponding to each homologous site of BEST LINEAR ESTIMATORS OF FROM AND the sequences, there is a genealogy connecting the n e 6 1 sequences (nucleotides) to their most recent common Because the number of segregating site K is equal to ancestor and the genealogy consists of 2( n - 1) 5, + . . . +5,- I and q1 + . . . +q(n/21,WATTERSON’S esti- branches. A branch is said tobe size i if exactly i sequences mator K can be written as in the sample are descendents of the branch. A mutation occurred in a branch of size i is said to be of size i or an ;mutation. Therefore, mutations leading to segregating sites in a sample can be classified into n - 1 sizes. An il- lustration of the definition is given in Figure l. where a, is givenby (1).Therefore, aisa linear function Let be the summation of i-mutations over all the 5, of el,. . . , or a linear functionof ql,. . . , qnI2].Since sites and be a vector given by a mutationof size i is counted in i( n - i) painvise com- parisons, TAJIMA’Sestimator .ir can be written as where T stands for transpose. The vector & is a primary source of information that will be utilized in this study to estimate 8. When the infinite-sites model is assumed and an outgroup sequenceis available, the value of&can which is also linear function of tl,. . . , (n-l, or ql,. . . , be inferreddirectly from the sequences of a sample. This q[n/21.One common featureof these linear estimators is is because that a mutationresults in a segregating site of that their coefficients, that is l/an, . . . , l/a, for Rand two segregating nucleotides and the nucleotide that is 2i(n - i)/n(n- l),(i = 1, . . . , n - 1) for ?r, are pre- not the same as that of the outgroup sequencemust be determined constants. A linear estimator with pre- the size ofthe mutation,otherwise it will contradicts with determined Coefficients is not in general a best linear the infinite-sites model. However, when the infinite-sites estimator. To demonstrate this, we consider a sample of model does not holdand there is no outgroup sequence only three sequences. There is only one topology for available, the value of & has to be inferred by recon- three sequences (Figure 2) and & = ( tl,&) where 5, is structing a genealogy of the sample; when the length of the sum of the numbers of mutations in all the three sequences are not sufficiently long or there arerecom- external branchesand 5, is the numberof mutations in binations the inferred value of & will often contain er- the internal branch. From KINGMAN’S (1982a,b) coales- rors. Although we wish to address all important issues cent theory, it is easy to show [for example, FU and LI related to the estimation of 8, we shall assume as an (1993b) that initial investigation that the value of 5: is known for a given sample. ~(5~)= 8 E([,) = Another vector of information that will be utilized to estimate 8 is Var(el) = 8 + 1/OP Var(6,) = %8 + %e2

cOv(~,,e2) = %e2 EstimationRate of Mutation 1377

pendent of 8 once themodel is given. Withsome modi- fications of the notation in Fu (1994a), we have that the best linear unbiased estimator of 8 from 5 is given by

However, Equation 11 does not provide a direct es- timate of 8 because its computation requires thevalue FIGURE 2.-A genealogy of three sequences. 6, is the sum of of 8 which is unknown. The problem can be circum- the numbers of mutations in branch 1,2 and 3 while 6, is the vented by the following iteration procedure. Definea number of mutations in branch4. See textsfor the explanation series of t, and t2.

Consider the following linear function of t1and t2

f = at1 + bt2 (7) and assign an arbitrary non-negative value to e,,. Then the limit 6, of the series is taken as our estimate of e. hr In order forf to be an unbiased estimator of 8, it must will be referred to as the BLUE (best linear unbiased satisfy estimator) of 8 from g and the estimation procedure ~(f)= ae + (b12)e = e as the BLUE procedure, though strictly speaking the estimator is not a linear function in 5 because of the Therefore, b = 2( 1 - a). Substituting 2( 1 - a) for b in iteration. (7), we obtain the variance off as From the definition of q and Equations 9 and 10, it Var(f) = a2Var(J) + 4(1 - a)Var(&) is simple to see that + 44- a)c=ov(J,X) E(q) = ea (13) Var(q) = 8D, 02r (14) = [a2 + 2(1 - a)2]e + + (1 - a)2]e2 + where D, = diag(P,, . . . , Pcn,21),I' = {yv]and It is easyto show that thevalue ofa for themost efficient estimator, that is, the one with smallest variance, must be (4 + 8)/(6 + e). Therefore,

is the best linear unbiased estimator of 8 and the coef- The BLUE of 8 from is the limit of the series ficients off are not predetermined.Equation 8 suggests dq q that to have an optimal estimating scheme, we should give more weight to t1than to t2when there are many segregating sites in a sample (indicating that 8 is large) and we should give about equal weights to both t1and and 0, can be any non-negative value. 5, when there are few segregating sites (indicating that From (9), we have that 0 is small). Nevertheless, we should not fix the coeffi- cients in advance. Similar analyses can be made samplesfor of more than three sequences and a general framework by Fu (1994a) Therefore we define a general WAITERSONestimator can be applied.The framework was developed for a vec- 6, and a general TAJIMAestimator as tor of variables whose means are linear functions of 8 6, and variances and covariances are quadratic functions of an K 8. In next section, we shall show that e, = - I: ai E(&) = ea (9) Var(&)= OD, + 028 (10) where (Y = (al,. . . , an-l);D, = diag(a,, . . . , an-l), Under the neutralWright-Fisher model we have that i.e., a matrix whose diagonal elements area,, . . . , an-l 6, = K and 6, = +r, which are also true under the and all non-diagonal elements are zero, and 8 = {uv}, neutralmodel with recombination which will be i, j = 1, . . . , n - 1; ai and uv are all constants inde- shown later. 1378 Y.-X. FU

When 8 is close to zero, it is easy to see from (11) or where i is the number of the event after whichthe branch (12) that starts and j is the number of the event at which the branch ends. For the genealogy of lth site, the length I(a) of the branchs of sue k is given by 1p = sjf,)t, + . . . + sgt, Itthus follows that when 8 is small, thegeneral WATTERSONestimator is approximately the best linear where s6 is an index variable representing the number of estimator of 8. times ti appears in the time lengths of branchs of size k; p Treating the vector of the coefficients of 8, is the total number of events. The average time length of branches of size k over all sites is therefore

1 1 I- lk = - If) = C sbti where ski = C s!). L x as a vector of constants as did in Fu (1994a),we have that 1= t Var(6,) = a,@ + b,O2 When there is no recombination, all the L genealogies are the same and consequently s,, = sg) = . . . = sti. For ex- where a, = u%,u and b, = u'xu. It is easy to see that ample we have for the genealogy in Figure2 that s12= 3, a nearly unbiased estimate of the variance of 6, is sll = 1, p2 = 0 and +1 = 1. Because p is defined as the mutation rate per se- quence per generation, the mutation rate per site per generation is therefore p/L. Assume that the number Similarly, a nearly unbiased estimate of the variance of of mutations in a branch of the genealogy of a site is a 8, is given by Poisson variable withparameter Zp/Lwhere lis the length

.. .. - a$, of the branch. Then the number of k-mutations in allthe = a,e, + v, 1 + b, L genealogies is a Poisson variable with parameter (p/L) 8;) = lkp.In other words, we assume that where a, = v?,v, b, = vTv and v is the vector of the coefficients of 8, given by

ESTIMATION OF THE MEAN AND VARIANCE OF 6 AND 9 Consider a sample of n sequences. The time interval between the moment at which the sample is taken and the time representing the most recent common ancestor of the n sequences canbe divided toa number of periods by the events occurred. In this paper, we mean, by an event, a coalescence, a recombination or a migration, but not a mutation. For convenience of notations, we treat the me ment at which the sample is taken as an event. Let the events be numbered according to their orders of occur- rence and tk be the time length (in terms of the number of generations) between the kth eventand (k+ 1)th event. Under the neutral Wright-Fisher model, there are only coalescent events,so tk represents the time length for the and period during which the sample has k + 1 ancestral se- quences; therefore tk is a (k + l)-coalescent time. Suppose there are in totalL sites ineach sequence. Each site canbe regarded as a locus and there is a genealogy for each site which connects the n nucleotides at the site to their most recent common ancestor. Consider the gene- alogy of the Zth site. We can see that the time length of a branch of the genealogy must be of the form ti+...+% Estimation of Mutation Rate 1379

TABLE 1

Variances of four estimators under the neutral Wright-Fisher model

e n ‘tin svce,, SU(e,) sv(Q 4e,) vs vq 2 2.48 5 2.11 2.28 2.21 2.28 1.87 1.76 10 1.32 1.94 1.50 1.48 1.42 1.39 1.46 20 0.93 0.981.72 1.12 1.07 1.02 1.04 50 0.66 1.63 0.740.78 0.70 0.74 0.80 300 0.42 1.54 0.450.48 0.43 0.47 0.45 5 5 5 11.75 9.16 10.6510.59 9.73 9.60 9.54 10 5.16 9.10 6.416.59 5.60 5.66 6.53 20 3.35 7.98 4.56 3.63 4.33 4.33 3.67 50 2.18 7.65 3.17 2.38 2.862.84 2.38 300 1.25 7.30 1.84 1.33 1.49 1.51 1.32 10 5 10 31.00 41.83 37.71 32.50 37.44 32.43 36.04 10 16.16 32.21 23.02 17.70 22.33 17.67 22.17 20 9.68 28.67 15.48 10.79 14.27 10.91 14.26 50 5.77 26.96 10.31 6.48 5.85 6.53 8.79 300 2.96 300 25.39 4.1225.73 3.25 4.04 3.23 5 112.2220 5 157.28 141.46 117.00 140.44 116.88 139.15 10 54.96 121.83 84.94 59.79 81.26 59.64 80.65 20 30.50 107.08 56.58 34.51 51.35 34.68 51.01 50 16.38 50 100.52 37.24 19.23 29.58 19.23 29.41 300 7.28 95.83 7.28 300 19.79 8.44 11.85 8.43 11.87 5 655.9550 5 943.55 841.26 672.60 832.82 672.01 834.17 10 304.88 10 719.55 500.64 329.28 479.03 328.65 477.61 20 156.57 638.37 330.25 179.90 294.36 179.45 294.23 50 73.73 50 603.1 1 214.87 92.84 166.87 92.74 164.16 300 25.85 571.70 25.85 300 056.29 33.99 112.5157.44 34.25 Note: Results are based on 50,000 simulated samples for each combination of 0 and n. su, sampling variance. veand vqare, respectively, the mean of Vs given by Equation 19 and the mean of V, given by Equation 20.

Suppose there arein total T genealogies for asample mate aiand aqby of n sequences and let wlk), bethe wi, 4v defined above for genealogy k and p, be the probability of ob serving genealogy k. Define

T respectively. Once the estimate & of (Y and the esti- a, = W(rk) p, (24) x mate 2 of 2 are obtained,we can obtain the estimates k of p and r. The accuracies of these estimations ob- T viously should increase with the number of genealo- ai= x (4;) + wy)wp)p,- aiaj (25) gies examined. My experience suggests that the num- k ber G of genealogies in the proximity of 10,000 is usually sufficient for the purpose of estimating Be- Then, it is easy to see that the mean and variance of 8. are given respectively by (9) and (10). Since Tis a cause algorithms based on coalescent theory are usu- ally very efficient, the need to estimate (Y and does very large number foreven a modest samplesize and 2 p, is not always easy to compute, it is impractical to not pose a serious burden of computation. obtain a and Z by examining all possible genealogies. To date,analytical solutions for a and Z are available THE NEUTRALWRIGHT-FISHER MODEL only for the neutral Wright-Fisher model (Fu 1994b). The neutral Wright-Fisher model is the simplest Analytical solutions for (Y and 2 simpw computations model in coalescent theory and is often selected to be tremendously, but when they are not available, the values the null model in studying DNA polymorphisms. Be- of a and 2 have to be estimated, which can be done as cause the lower bound of the variance of all unbiased follows. estimators of 8 under this model is known to be Supposean algorithm is available to generate genealogies of samples under a given population model and let G be the total number of genealogies randomly generated. Then accordingto the standard theory of Monte-Carlointegration [for example, (Fu and LI 1993a) and because (Y and 2 are known ana- HAMMERSLEYand HANDSCOMB(1965)], one can esti- lytically (Fu 1994b), we can obtain the efficiencies of 1380 Y.-X. FU

I 0 100 200 300 100 200 300 FIGURE3.-Coeficie~t.s of 6, 0 e (panels a and c) and e,, (panel h and d) as functions of 8. The sample size n is 10 for panels a and b, and 50 for panels c and d. In panel a and c the curves from top down are u,,up, . . . , respectively, and in panels h and d the curves from top downare u,, u,, . . . respectively.

U

Y 0 100 200 300 e estimators, 6,,, 6, 6, and 6, by comparing theirvariances duction by 6, over K increases with the value of 8 and to the lower bound Vmin. sample size n. In comparison 6, is only marginally bet- Simulated samples were used to measure the perfor- ter than bK when sample size is smallbut become consid- mances of the fourestimators. For a given values of8 and erably better than 6,when sample sizeand 8 are both large. sample size n, we generated a large number of samples Table 1 also showsthat %given by ( 19) and V, given by (20) using straightforward coalescent algorithm (KINGMAN are nearly unbiased estimators of Var(6,) and Var(6,) 1982a,b; HUDSON1983; TAJIMA1983); the values of the respectively. vector 5 and q from each sample were then used to The performances of 6, and 6, using estimated values obtain e,,, 6, 6, and 6, with both analytical values or of a and 2 (data not shown) are found to be almost estimated values ofa and 2.The results using analytical indistinguishable from those using analytical results of a and 2 are summarized in Table 1. a and Z, as long as reasonably large number of gene- Table 1 shows that thevariance of 6, is only marginally alogies (for example 10,000) are used to estimate a larger than Vmin,suggesting that 0, is a very efficient es- and 2. This indicates that the BLUE procedure pro- timator of 8. Comparing the variance of 6, to that of the posed in this paper is a powerful Monte-Carlo method estimator UPBLUE [Table 3 of Fu (1994a)], we find that for estimation 8. 6, is slightly less efficient than UPBLUE of 8. This is We also examined the coefficient vectors u and v to expected because UPBLUEuses more information see the relative contributions of ti (i = 1, . . . , n - 1) than 6,. Nevertheless, 6, is substantially betterthan and qi(i = 1, . . . , [ 72/21) to 6,and 6,. We have shown WAITERSON'Sestimator K and TAJIMA'Sestimate .ir. The earlier that when 8 is small, the 8, is close to WAITERSON'S latter always has the largest variance among the four estimate, which givesthe same weight to all the frequen- estimators considered. The percentage of variance re- cies, t,, . . . , 6 "-,.On theother hand, when 8 is Estimation of Mutation Rate 1381

TABLE 2

Variances of four estimators under the neutral model with recombmations

n P sv(8,) sv(&) sv(Q sv(Q vc v? e=10 10 0.0 17.5631.97 21.6922.82 17.41 22.11 1.o 15.8925.72' 19.1118.84 16.35 18.41 26.74' 18.6919.27 16.05 18.61 16.22 10.0 12.4719.15 14.2414.18 12.61 13.99 13.78 10.6010.67 10.45 11.15 10.96 50 0.0 26.81 10.50 6.16 6.87 9.06 8.24 1.0 21.58 6.90 8.71 5.60 6.50 7.83 8.96 6.5722.58 8.96 7.96 5.85 7.70 10.0 5.48 15.85 6.69 6.15 5.01 5.89 11.77 5.17 5.36 4.84 5.01 5.12 0 = 50 10 0.0 716.10328.34 495.55479.45 327.81 475.53 1.0 555.77292.66 395.04402.32 305.07 381.75 593.73299.67 413.02393.28 306.55 394.47 10.0 398.12216.62 283.30 276.72284.75 219.41 269.01184.07 199.08 194.71 194.53 206.04 50 0.0 598.3599.41 214.02 167.92 82.86 150.20 1.0 462.3997.45 172.48118.84 72.73 142.86 492.7998.08 179.08 74.66 146.37 139.10 10.0 327.22 10.0 80.11 95.60123.96 60.23 105.95 222.1373.09 76.5290.20 60.50 80.30

a Two-loci model. The infinite-sites model. Covariance matrix for each combination of n and R is estimated from 10,000 simulated samples. See the footnote to Table 1 for sv, and v,,. extremely large, is approximately THE NEUTRALMODEL WITH RECOMBINATIONS When an autosomal locus studied is either very large or consisting of several separate regions, recombina- tions cannot be neglected. The coalescent theory of the while the coefficients for intermediate values of 8 should neutral model with recombination and algorithms to lie between the two extremes. Thisis confirmed by Figure generate samples under this model have been devel- 3 where the coefficients of estimates for several sample oped by HUDSON(1983), HUDSON KAPLAN and (1985) and sizes are plotted. The coefficients are very sensitive to 8 KAPLAN and HUDSON (1987).An event under the this when 8 is small but become stable when 8 is large. model is either a coalescence or a recombination. Sup- Suppose u,and vi are the ith elements of the coeffi- pose between two consecutive events, there are m an- cient vector u and v, respectively. Then Figure 3 shows cestral sequences. HUDSON(1983) showed that the ex- for i < j, we have pected time length between the two events is ui2 uj and vi 2 vj. (27) 4N This suggests that the information from(, (or qi)is more Rmc + m(m - 1) reliable than the information from (or for the ej q) where the recombination parameterR = 4Nrand ris the purpose of estimating 8. This is indeed thecase because recombination rate per sequence per generation;c(0 I although &/ai, = 1, . . . , - 1) and qi/pi, = 1, . . . , (i n (i c 5 1) is the average proportion of recombinable sites [ 423) all have the same expectation 8, their variances at which recombinations are relevant to the sample [see have the relationships HUDSON(1983) for detail]. Since the coalescence process of a single site is iden- var( $) < var( :) var( 2)

TABLE 3 Variances of four estimators under the neutral Wright’ f~te-islandsmodel

Two islands 5 0.1 20 A 49.05 29.03 6.50 28.58 7.21 29.88 5 1.o 20 A 10.53 6.25 4.54 6.00 5.60 8.17 5 5.0 20 A 7.28 4.28 3.37 4.14 4.43 6.30 5 0.1 20 B 15.35 10.52 2.66 8.39 2.68 8.22 5 1.0 20 B 7.32 4.09 2.62 3.89 3.24 5.04 5 5.0 20 B 7.09 3.88 2.52 3.67 3.71 5.32 20 0.1 50 A 851.70 439.10 36.69 432.95 37.52 442.44 20 1.o 50 A 144.27 62.67 31.14 59.14 38.18 80.55 20 5.0 20 50 A 101.82 41.91 22.45 37.35 29.81 56.50 20 0.1 20 50 B 236.94 133.96 15.80 69.53 16.11 81.32 20 1.0 50 B 100.03 39.68 15.73 30.92 18.37 42.98 20 5.0 50 B 95.71 34.80 15.63 26.87 20.53 42.25 Five islands 5 0.1 20 A 114.19 68.84 32.99 68.86 36.95 73.34 5 1.0 20 A 16.49 11.03 9.46 11.oo 12.23 17.25 5 5.0 20 A 8.09 4.98 3.66 4.92 5.14 8.16 5 0.1 20 B 8.50 6.54 3.29 5.80 4.72 10.36 5 1.0 20 B 6.79 3.75 1.95 3.39 2.87 6.06 5 5.0 5 20 B 6.18 3.27 1.94 3.00 2.84 5.20 20 0.1 50 A 1709.09 1052.06 360.56 1085.31 373.79 1227.03 20 1.0 50 A 253.37 126.98 86.01 124.31 111.24 205.75 20 5.0 50 A 117.48 55.45 31.35 53.59 44.27 90.85 20 0.1 50 B 133.40 95.90 30.76 86.94 40.33 127.38 20 1.0 50 B 99.82 41.75 14.91 31.65 18.91 55.11 20 5.0 50 B 92.79 33.54 13.79 25.45 17.87 42.23 Note; a and 8 are estimated from 10,000 samples and the results in each row of the table is from 10,000 samples. A, extreme sampling scheme; B, balanced sampling scheme. Also see the footnote to Table 1 for sv, VE and vq.

that all the islands have the same effective population der the neutralWright-Fisher model. Therefore, &,is no size Nand that theoverall migration rate m is sufficiently longer the same as K, neither is 6, as .ir. Because mi- small so that theprobability that two or moreindividuals grations tends to increase the time between two coales- in a sample migrate at the same generation can be ne- cent events, K and 7i both overestimate 8. glected. Also we assume that thereis no recombination. We focus on the cases of two islands and five islands An event under the WRIGHT’Sfinite-islands model is and consider two sampling schemes. The first is that all either acoalescence or a migration. A coalescence event the sequences are taken from one island and thesecond reduces by one the number of ancestral genes in the is that a sample is taken from each island. These two island where the coalescence event occurs while a mi- sampling schemes will be referred to as the extreme- gration event reduces by one the number of ancestral sampling scheme and thebalanced-sampling scheme re- genes in one island but increases by one the numberof spectively and consequently a sample from the extreme- ancestral genes in another island, thus thetotal number sampling scheme anda sample from the balanced of ancestral genes remains the same. Let M = 4Nm and sampling scheme will be will be referred to as an extreme 6, (k = 1, . . . , d) be the number of ancestral genes in sample and a balanced sample, respectively. island k between two consecutive events. STROBECK We again use simulated samples to evaluate the per- (1987) showed that the expected time length between formances of the four estimators. Table 3 summarizes the two events is the results of simulations. It is obviously that d, is again 4N the best estimator among the fourestimators, but all of them have smaller variances when migration rate is large bk + xh6k(6h - l) ’ than when migration rate is small. Similarto the neutral Algorithms for generating samples under the neutral WRIGHT-FISHERmodel and the neutral model with re- Wright’s finite-islands model were developed by combination, superiority of d, is mostly evident when 8 STROBECK(1987) and SLATKINand MADDISON (1989). and sample size TZ are both large and in such cases, d, Since there is no recombination, all the sites in the also showsconsiderable improvement over dK dTis again sequences have the same genealogy. However, unlike the worst estimator among the four. the model with recombination, the expectation of the Comparing the results for the cases oftwo islands and number of i-mutations is no longer the same as that un- five islands, we find that under the extreme-sampling 1384 Y.-X. FU

TABLE 4

el Variances of four estimators under the infinite-loci neutral model with recombination when R has to be estimated

e R s4e,) 44i) su(@*) 4l) n=20 and k=5 10 0.0 27.84 15.33 11.50 14.65 3.0 19.15 10.93 9.35 10.30 8.0 13.25 8.11 7.52 7.81 10.0 12.60 7.73 7.35 7.47 15.0 10.40 6.72 6.73 6.64 L" 50 0.0 625.78 329.67 194.21 313.01 3.0 398.84 0 1 2 3 4 5 214.24 154.07 196.82 8.0 266.53 150.94 124.43 141.21 Ai 10.0 237.02 136.15 119.21 128.96 15.0 184.32 109.32 102.25 105.49 n = 50 and k= 10 10 0.0 26.68 10.45 7.81 9.61 3.0 17.77 7.47 6.31 6.93 8.0 12.45 5.64 5.12 5.32 10.0 11.43 5.23 4.92 5.01 15.04.74 9.75 4.65 4.63 50 0.0 609.04 220.29 144.36 200.84 r 3.0 367.31 139.85 98.50 127.21 8.0 248.52 97.72 78.56 86.43 10.0 215.08 90.74 75.06 82.60 15.0 177.47 75.84 67.16 69.44 Estimations of a and I; are based on estimated k and 10,000 / samples. Eachrow is obtained from 10,000 simulated samples - with recombination rate R. Also see the footnote to Table 1 for su, - vt and vq. simulations. When this is not true, the best sampling scheme likelyis the generalizedbalanced-sampling I 0 1 2 3 4 5 scheme in whichthe relative sizeof a sample from an island M with respect to the overall sample size is the same as the relative valueof the effective population size of the island FIGURE5.-Coefficients of de (panel a) and e,, (panel b) as functions of migration parameter M for balanced samples of with respect to the sum of all effectivepopulation sizes. size 10 from two islands. The curves from top down represent We also present in Table 3 the means of estimated y,. . . , in panel a and v,, . . . , v, in panel b. The coefficients variances of 6, and 6,. It is found thatvariances of and for eachvalue of M are averagedover 10,000 samples and 6, computed from Equations 19 and 20, respectively are 0 = 20. on average overestimates of the truevariances under the neutral Wright's finite-islands model. scheme, all the estimators performs better in the case of The coefficients of 6, and 6, for a balanced sample of two islands than in the case of five islands, suggesting size 10 from two islands are given in Figure 5. It is in- that in general thevariance of an estimator of 8 increases teresting to see that the coefficients quickly become with the numberof islands under the extreme-sampling stable with the value of M. This example and many oth- scheme. For the balanced-sampling scheme, the pat- ers we examined indicate that as far as 8 is concerned, terns of the four estimators are not all the same. When samples from islands may be treated as from a panmictic Mis small, 6, and 6, perform betterwith smaller number population even when migration parameter Misas small of islands than with larger number of islands but there- as 1, particularly when the estimator 6, is used. This ob verse are truewhen M is relatively large. In comparison, servation is in accordance with a large body of literature and 6, seem to perform betterwith larger number of 6, showing that migration is extremely powerful in elimi- islands for all the migration rates we have considered. nating differences among local populations. Under the Table 3 also showsthat thebalanced-sampling scheme neutral Wright's finite-islands model, the coefficients of is always better than the extreme-sampling scheme for 6, and 6, also satisfythe relationship given by (27) which the purpose of estimating 8 because all the estimators seems to be true in general. perform better under the former scheme than under the latter one.The extreme sampling scheme should be avoided particularly when migration rate is small. It DISCUSSION should be pointed out that we have assumed that all We have demonstrated under the neutral WRIGHT- islands have the same effective population size in our FISHERmodel, the neutral model with recombination Estimation of Mutation Rate 1385

TABLE 5 Means and MSE of the four estimators under the neutral Wright’s finiteislands model for balanced samples - - - 0 M M 0, MSE 6, MSE 9s MSE e; MSE A 5 0.5 26.50.1 11.87.7 131.45.3 9.26.1 52.7 0.3 6.2 16.6 5.7 2.8 8.1 5.2 6.2 5.5 0.5 5.1 8.8 5.0 2.7 4.8 5.0 4.1 5.0 0.7 4.5 0.7 3.9 6.5 4.7 2.5 4.9 4.8 3.5 1.o 4.1 3.6 5.9 4.4 2.6 4.8 4.6 3.4 5.0 3.5 5.0 3.6 5.8 3.9 2.6 4.5 3.2 4.2 69.5 5382.4 47.7 1708.0 24.7 69.3 35.7 656.4 35.7 69.3 24.7 20 1708.05.0 47.7 0.1 5382.4 69.5 1.0 23.9 162.8 22.6 79.7 21.2 79.7 22.6 162.8 23.9 1.0 64.635.3 21.7 2.0 21.3 114.9 21.0 114.9 21.3 2.0 20.8 32.459.6 20.6 53.5 5.0 20.120.1 106.050.2 20.1 30.555.6 20.1 7.0 19.6 102.2 19.6 7.0 19.7 52.9 19.8 30.4 19.8 46.8 10.097.749.1 19.719.9 29.9 19.819.9 51.9 B 5 0.5 0.1 0.5 5 16.1 207.1 13.8 124.6 9.1 26.8 11.1 65.7 0.3 6.8 17.1 6.5 7.5 10.3 6.15.7 3.1 0.5 4.9 6.9 5.0 1.9 4.0 5.0 3.4 5.0 0.7 4.1 5.4 4.3 3.3 4.7 3.1 1.6 4.5 1.o 4.43.5 3.45.4 3.8 2.91.7 4.1 5.0 2.4 2.7 5.0 3.7 5.6 8.1 2.8 3.3 4.0 0.1 133.9 18949.9 97.7 2018949.9 5.0133.9 0.1 8458.5 50.3 1228.9 67.4 3291.1 26.8 138.7 23.8 47.6 24.8 93.2 124.8.o 284.0 47.629.0 23.8 138.7 26.8 2.0 23.5 146.3 23.5 2.0 57.6 22.822.2 29.771.6 21.8 88.9 19.8 41.9 20.0 22.0 19.9 36.7 19.9 22.0 20.0 5.041.9 19.619.8 88.9 7.042.8 19.419.4 92.1 19.5 21.3 19.5 37.1 10.0 18.5 84.521.9 18.819.1 42.2 19.0 38.6

-” - e,, e,, 6, and e, are, respectively, the means of e,, e,, e, and e,. Estimations of a and 2 are based on estimated value M of M and 10,000 samples. Eachrow is obtained from 10,000 samples. A, Two islands and 10 sequences are taken from eachisland; B, five islands and five sequences are $ken from each island. and the neutralWright’s finite-islands model that the 6, increase. For example, consider the case of 8 = 50 and is an efficient estimator of 8. However, comparisons of sample size n = 50, when R is equal to 0 while it is es- the performancesof various estimators were conducted timated to be 10, the variances of 6, and 6, are, respec- assuming that thevalues of parameters other than 8 are tively, 144.4and 200.8 (Table 4), but if the estimation of known. Since in practice the value of migration param- R is accurate, i.e., R = 0, the variances become 92.3 and eter M and thevalue ofrecombination parameterR are 166.9respectively (Table 1). These simulations show likely unknown as well, one has to estimate their values that theeffort to compute 0, is undoubtedly worthwhile in order to obtainestimates of 8 by GC or 0,. It thenraises and one must make effort to obtain good estimate of R a question: will 0, be still superior when M and R are to to make most of 6,. be estimated? Although the recombination parameter R Next we consider the neutral Wright’s finite-islands can be estimated by, for example, HUDSONand KAPLAN’S model. When the estimate Mof Mis not accurate,all the (1987) method orHUDSON’S (1993) method and themi- four estimators become biased. To measure the per- grationparameter M by, for example, SLATKINand formance of a biased estimator we must consider its bias MADDISON’S (1989) method, but theirinferences are still as well as itsvariance. A proper measure of the accuracy in their infancies and better methods for estimating M of a biased estimator 6 of 8 is the mean square error and R are likely to appear in the near future. Therefore (MSE), defined by I decide to examine the performances of the four esti- mators by using directly erroneous values of M or R to MSE = E(6 - = Var(6) + (6 - estimate a and 2. Consider the neutral model with recombination first. Note that when 6 is unbiased, MSE is simply the variance Since recombinations do notchange theexpectation of of 6. Since the balanced-sampling scheme is much better 5 and r), it is easy to see that 6, and 6, is still unbiased than theextreme-sampling scheme and its useis strongly estimators of 8, when the estimated value of R is inac- recommended, I shall examine the performances of the curate. Table 4 gives a few examples on how the vari- four estimators for balanced samples only. Table 5 gives ances of 6, and 6, are affected when the estimate of R a few examples of the effect on estimations of 8 when M is inaccurate. It is found from Table 4 that6, remains to is not accurate. be the best estimator even when the estimate R of R is It is clear from Table 5 that all the four estimators, considerably different from its true value.However, e=, dK, 6, and 6,, are biased when R is inaccurate but 6, when R is different from R, the variances of 6, and 0, remains to be the best estimator among the four. 6, is 1386 Y.-X. FU a superior estimator not only because its MSEis the The programs written in C language for estimat- smallest in allthe cases we examined but also because it ing 8 under modelsconsidered in this paperare is the least biased estimator. Table 5 also showsthat the available from the author whose E-mail address is MSE of each estimator becomes relatively large when M [email protected]. is much larger than M; while on the other hand, theMSE I thank an anonymous referee for suggestions. This work is sup of an estimator does not change substantiallywhen fi is ported in part by a FIRST AWARD from the National Institutes of considerably smaller than M. This suggests that for the Health. purpose of estimating 8 it is better to underestimate M than to overestimate M. We pointed out earlier by ex- LITERATURE CITED amining the coefficients of 6, and 6, that samples from FELSENSTEIN,J., 1992Estimating effective population size from islands may well be treated as from a panmictic popu- samples of sequences: a bootstrap monte carlo integration lation when M is not extremely small and Table 5 re- method. Gen. Res. 60: 209-220. inforces this conclusion. For example, when & is 5 Fu, Y. X., 1994a A phylogenetic estimator of effective population size or mutation rate. Genetics 136: 685-692. which may be regarded to represent a nearly panmictic Fu, Y. X., 1994b Statistical properties of segregating sites. Theor. population, the bias of is not substantial when M is as Popul. Biol. (in press). small as 1, through the bias seems to increase with the Fu, Y. X., and W. H. LI, 1993a Maximum likelihood estimation of population parameters. Genetics 134: 1261-1270. number of islands. Fu, Y. X., and W. H. LI, 1993b Statistical tests of neutrality of mu- We assumed in this paper that the frequenciesof mu- tations. Genetics 133 693-709. tations of various sizes,i. e. 6, can be inferredaccurately, HAMMERSLEY,J. M., and D.C. HANDSCOMB, 1965 Monte-Curlo Meth- ods. John Wiley & Sons, New York. which implies that either the infinite-sites model is ap- HUDSON,R. R., 1983 Properties of a neutral allele model with intra- propriate and an outgroupsequence is available, or the genic recombination. Theor. Popul. Biol. 23: 183-201. sequences are sufficiently long so that the phylogeny HUDSON,R R, 1993 The how and why of generating gene genealogies, pp 23-38 in Mechanisms ofMoIemlarEvolutia, edited by N. TAKAHATA reconstruction is accurate. Samples of sequences of and A. G. CWULSinaur Associates, Sunderland, Mass. modest length without an outgroup are common and HUDSON,R. R.,and N. L. KAPLAN, 1985 Statistical properties of the thus procedure forestimating 8 for such samples will be number of recombination events in the history of a sample of DNA sequences. Genetics 111: 147-164. of practical importance. Because Lj and r) are also likely KINGMAN, J. F. C., 1982a The coalescent. Stochast. Proc. Their Appl. to be important in constructing more powerful tests of 13: 235-248. the hypothesis of neutral mutations than those by TAJIMA KINGMAN, J. F. C., 1982b On the genealogyof large populations. J. Appl. Prob. 19A: 27-43. (1989) and Fu and LI (1993b), and may even lead to SLATKIN,M., and W. P. MADDISON,1989 A cladistic measure of gene better estimators of R and M, we are currently searching flow inferred from the phylogenies of alleles. Genetics 123: 603-613. for the best methods to infer thevalues of and r) under 6 STROBECK,C., 1987 Average number of nucleotide difference in a various population models. But without going further, sample from a single subpopulation: a test for population sub- we expect that when the values of Lj and r) can not be division. Genetics 117: 149-153. inferred accurately, some simple equations for bias cor- TAJIMA,F., 1983 Evolutionaryrelationship of DNAsequencesin finite populations. Genetics 105: 437-460. rection may be sufficient in practice, as found for a TAJIMA, F., 1989DNA polymorphism in a subdivided population: the BLUE of 8 by Fu (1994a). For the neutralWright’s finite- expected number of segregating sites in the mesubpopulations islands model, it may turn out that the bias due to es- model. Genetics 123: 229-240. WATTERSON,G. A,, 1975 On the number of segregation sites. Theor. timating M is larger than that due to estimating 6 and Popul. Biol. 7: 256-276. r) therefore finding a good estimate of M may be more critical. Communicating editor: G. B. GOLDING