<<

Copyright 0 1991 by the Societyof America

The Genetic Structure of Admixed Populations

Jeffrey C. Long

Department of Anthropology, University of New Mexico, Albuquerque, New Mexico87131 Manuscript received June 5, 1990 Accepted for publication October 15, 1990

ABSTRACT A method for simultaneously estimating the admixture proportions of a hybrid population and Wright’s , FST, for that hybrid is presented. It is shown that the variance of admixture estimates can be partitioned into two components: (1) dueto sample size,and (2) due to evolutionary variance @e., ). A chi-square test used to detect heterogeneity of admixture estimates from different , or loci, can now be corrected for both sources of random errors. Hence, its value for the detection of from heterogeneous admixture estimates is improved. The estimation and testing procedures described above are independent of the dynamics of the admixture process. However, when the admixture dynamics can be specified, FST can be predicted from genetic principles. Two admixture models are considered here, flow and intermixture. These models are of value because they leadto very different predictions regarding the accumulation of from the parental populations and the accumulation of variance due to genetic drift. When there is not evidence for natural selection, and it is appropriate to apply these models to data, the variance effective size (N,)of the hybrid population can be estimated. Applications are made to three human populations: two of these are Afro-American populations and one is a Yanomamo Indian village. Natural selection could not be detected using the chi-square test in any of these populations. However, estimates of effective population-~ sizes do lead to a richer description of the genetic structure of these populations.

STIMATION of the proportional contributions tion over several generations, and application of this E of ancestral populations to theirhybrids has been approach was thought to be morepowerful than single important in genetic analyses of many admixed human generational studies (REED1969). Unfortunately, the populations (cf. CHAKRABORTY1986). Such admixture results of heterogeneity searches have been equivocal estimates help to clarify the historical background of because of the variation in admixture estimates from admixture and they are becoming useful in genetic other causes (ADAMSand WARD1973). From an evo- epidemiological investigations (CHAKRABORTYand lutionary perspective, an important source of varia- WEISS1986, 1988). Admixed populations have tion in admixture estimates, other than thatcaused by frequencies that are linear combinations of the allele natural selection, is caused by genetic drift. frequencies in their parental populations, and since Genetic drift, like admixture, affects all loci simul- admixture affects allloci equally, the same set of taneously. Its impact is typically evaluated from the admixture proportions are expected to apply to all parameter FST (WRIGHT1965, 1969). It is shown in alleles at all loci. However, estimated ancestral contri- this paper that FST can be used to measure the effect butions vary from allele to allele and from locus to of genetic drifton admixture estimates. It is also locus for a number of reasons; sampling error in the demonstrated that FsT is informative about other as- estimation of parental and hybrid population allele pects of the hybrid population’s evolutionary struc- frequencies and genetic drift are prominent sources ture. For example, FST is closely related to the vari- of random error, andsystematic biases are potentially ance effective size of the hybrid population. introduced by natural selection for or against some A method to estimate simultaneously the ancestral alleles. contributions to a hybrid population and FST is pre- It has been argued in the past that natural selection sented. This method allows a partition of the variance can be detected if admixture estimates from different of the estimated admixture proportions into two com- alleles are highly heterogeneous, or if some small ponents; the first component measures the error in- group of alleles yield admixture estimates that deviate herent in statistical sampling of existing populations, in theextreme from admixture estimates at most and the second component measures the error that other alleles (WORKMAN,BLUMBERC and COOPER accumulates because of evolutionary change. In the 1963; WORKMAN1968; REED1969; CAVALLI-SFORZA absence of natural selection, this component is a meas- and BODMER1971). A biased admixture estimate ure of genetic drift. Hereinafter, the first source of should reflect the cumulative effect of natural selec- error is referred to as sampling error and the second

(;elletics 127: 417-428(February, 1991) 418 J. C. Long source of error is referred to as evolutionaryerror. values of its frequencies in the hybrid population, the The estimator for Fsv presented here is distinct from first parentalpopulation, andthe second parental other estimators of FST (cj COCKERHAM1969, 1973; population, respectively. If admixtureand genetic WEIRand COCKERHAM1984; LONG 1986). It is ob- drift are the only evolutionary process that have af- tained from the variance of allele frequencies with fected the hybrid gene pool, then respect to their expectations based on the admixture Pht = ' Pli (1 - ' P2j Eer model. In order toobtain this estimate of FsT,multiple p + p) + (1) locus samples are required from the hybrid population = PPI + I.L * (f'lt - P2i) + €et and from all contributing parental populations. FST cannot be computed for the hybrid populationin the where p is the proportionate contribution of the 1st absence of information on the parental populations. parental population, (1 - p) is the proportionate con- A specific model of admixture dynamics is not re- tribution of the 2nd parental population, and is the quired by the basic procedures presented here, but if error due to genetic drift. The expectation of Phi is the dynamics of admixture can be specified, then FST p a Pli + (1 - p) - P2,;for convenience, this expecta- is predictable from population genetic principles and tion will be denoted E[Phi] = ai. The expectation of t,i it is possible to estimate the variance effective size (Ne) is zero, but the expectation of its square, and hence of the hybrid population. Two such models of admix- its variance, is not. The variance of thehybrid ture dynamics and methodsfor estimating N, are caused by genetic drift is cp(ph;) = explored in later sections of this paper. One of the E[Phj - ail2. models evaluated, , although ubiquitous in If PI,and P2, are known parameters, the method of the admixture literature (cf. GLASSand LI 1953), is weighted least squares (WLS) can be used to estimate found to be unusual with respect to geneticdrift @ from a sample of hybrid population genotypes (cf. because there will not be a single value of FST common ELSTON 197 1; LONGand SMOUSE1983). For these to all alleles at allloci. When admixture takes the purposes, let Fh, be the maximum likelihood estimate form of gene flow, the procedures developed here of the allele frequency in the hybrid population based estimate an average Fsl- pertaining to all loci sampled. on a sample of N,- individuals. The expectation of Phi The variance formula for admixture estimates de- is ai, however Phi contains a component of random rived here is used to improve achi-square statistic that error, tsj,due to statistical sampling. The expectation has been designed to detect natural selection from of t,, is zero, assuming that the population is large heterogeneityamong admixture estimates obtained andrandom mating, and its variance, d(phi), is from different alleles (CAVALLI-SFORZAand BODMER E[Fh,- Phi]* = Phi (1 - P~,)/~N,.The two error terms, 1971). There are two major problems with the chi- e,, and eel, are assumed to be independent, and accord- square test (in its original form) that have rendered ingly, the variance of ph,, with respect to both genetic its results controversial. First, the test confounds het- drift andstatistical sampling, is a:( phi) + a:( phi). These erogeneity due to natural selection and genetic drift quantities are illustrated in Figure 1. Table 1 provides (CAVALLI-SFORZAand BODMER1971; ADAMSand a listof definitions and symbols that will be used WARD1973). Second, the method treats different throughout this paper. alleles that segregate within the same locus as if they To obtain the WLS estimate of p, Equation 1 is are statistically independent. Thus,notions of random rewritten as sampling are violated. The methodsderived here rectify these problems and the interpretability of a (Fhl - ~2,)= I.L * - P2i) + (2) significant chi-square is clarified. The new methods and approaches for admixture where theterm ei is the sum of the two random analysis described above are applied to threeadmixed components, ie., E, = + E+ The admixtureparam- human populations in the later sections of this paper. eter, p, is then estimated by defining a vector X =

The populations analyzed are of historical interest [(PI]- Pz1),(P12 - P22), - e, (PI, - P21)IT, a diagonal and they are useful for demonstrating thedistinctions matrix V with elements (v,,] E[Ph1](1 - E[Ph,]),and between the tw70 theoretical models of admixture ex- a vector y = [(Phi - P~I), (Ph2 - ~22),e-., (~h,- amined here. P2,)]'. The estimate of I.L is given by

M = (XTV"X)"XTV"y. (3) A BASICADMIXTURE/DRIFT MODEL AND ESTIMATIONPROCEDURES The solution to Equation 3 is obtained iteratively Suppose that one codominant allele from each of (LONGand SMOUSE1983). At each iteration, the ex- several unlinked loci is chosen for analysis (multiple pected hybrid allele frequencies, E[Phi] = x,, are ap- alleles and dominance are dealt with in a later section proximated by i;, = PP1+ Ma (PIz - P2i). Following the of this paper). Let A, designate the chosen allele from usual steps (cj NETER and WASSERMAN1974), an the ith locus (i = 1, . . , I), and Phi, Pli, and Py1be the estimate of the variance of the admixture estimate is Admixed Populations 419 TABLE 1 Summary of symbols and notation used p = 0.50 Ne = 100 ph, = frequency of A, in the hybrid population N, = 10 PI, = frequency of A, in the 1st parental population P,, = frequency of A, in the 2nd parental population p = contribution of the 1st parental population to the hybrid ff, = E[Pht] = &I' PI, + (1 - P) ' P28. e<, = drift deviation of the ith allele, i.e., ph, - T,

MSE = (y - X - M)T * V" + (y - X M)/(f- 1). (5) = E[phi - phi]* + E[phi - Ti]' The mean squared error (MSE) estimates E[($,,, - rj)'r, (1 - rJ],which is the standardized variance of estimatedhybrid population allele fre- quencies. This quantity is very informative about the breedingstructure of the hybridpopulation be- cause it is closely related to WRIGHT'S(1965, 1969) fixation index, Fsr. WRIGHT(1965, p. 402;1969, p. 295,and elsewhere) defines the quantity FST = u&-)/[qT(l- q~)],as the "ratio of the actual allele + u;(Phz). frequency variance of the subdivisions to its maximum possible value qT(1 - qT) that is expected if the subdi- After simplification, and using the notation E[Phi] = visions are completely isolated and each completely Ti, fixed" (q denotes the allele frequency). Wright used the variance of subdivisions, a&-), to pertain to: (I) theaggregate of an essentially infinite number of existing subdivisions (e.g., WRIGHT1969, p. 294), or Upon standardization by ri (1 - rJ, equivalently, (2) an infinitely large number of isolated lines which might, hypothetically, have been drawn from a founding stock (WRIGHT1965, p. 407; 1951, p. 328). The latterinterpretation is necessary for analysis of an isolated population, or line, such as the hybrid population of the model presented here. The parameters of u:(phi) and ri defined in this paper 420 J. C. I>ong and MSE remains [ 1/2N + FST - (1 - l/2Ns)], because an estimate of F~Tcan be obtained fromall pairs of alleles F,*, = 2Ns/(2Ns - l)[MSE - l/(2Ns)](10) at a locus (NEI 1965), i.e., can be used to estimate FST. Since the terms in MSE are standardized by the estimate, ;i (1 - g), rather than the parameter ?r a (1 - T), Equations 9 and 10 are approximations. Nevertheless they are nearly un- Dominance among alleles at a locus will not affect biased estimators when FST is reasonably low(say s:(M), but thesampling portion of the variance, sf(M), Fsl- < 0.10) or, many loci have been assayed (J. SOBUS is altered. Eq. 4 still estimates s'(M), but its partition and J. C. LONG,unpublished computer simulations). into s?(M) and s:(M) requires special treatment. To With these points in mind, the estimated variance obtain s?(M), V is replaced with a new matrix, V*. of the admixture estimate, s2(M),can be partitioned The elements of V*" are the sampling variances of into two additive components hybrid population allele frequencies obtained using s'(M) = sf(M) + s:(M) (1 la) maximum likelihood estimation procedures. Follow- ing Lr (1 976), we can obtain these elements from the where, equation

and where PH, is the probability of the cth phenotypic

s:(M) = (X'V"X)" * F&[1 - l/(2Ns)]. (1 IC) class (c = 1, - , C) in the hybridpopulation, and dPH,/aBi is the partial derivative of the phenotypic It is instructive to review several points before pro- class with the parameter of interest (ie., Phi). Making ceeding. First, the variance of the admixture estimate this substitution, does not tend to zero by increasing the number of individuals sampled; it tends to zero as the number of 1 sf(M) = - * (XTV*"X)" loci assayed is increased. Second, ELSTON(197 1) in- 2Ns troduced weighted least squares and maximum likeli- and s:(M) should be estimated, hood estimators for p under the assumption that the hybrid population has not evolved by genetic drift. It s?(M) = s'(M) - sf(M). (1 5) can be shown that sf(M)is the estimated variance of If thealleles are codominant, V* reduces to V and all both of these estimators. Third, although Wright's estimation procedures remain the same. standardized variance definition of FST hasbeen used The assumption that parental population allele fre- so far, the correlational definitions of FsT (WRIGHT quencies are known without error is the final point of 1965; COCKERHAM1969, 1973; WEIR and COCKER- concern. While this assumption can neverbe fully HAM 1984) apply equally well. As points of reference, met, it is generally assumed (REED 1969; ADAMSand the parameter a; . (1 - 7rJ is equivalent to the param- WARD 1973) that allele frequencies obtained as un- eter a1 defined by REYNOLDS, WEIRand COCKERHAM biased estimates, with small standard errors, from the (1983) and implicit in WEIRand COCKERHAM(1984), modern descendants of the correct ancestral popula- and af(pht)is equivalent to their aid. tions will serve the purpose. This should be true for EXTENSION TO MULTIPLEALLELES, the method described above. As long as there is no DOMINANCEAND UNCERTAIN PARENTAL systematic error in the estimation of ancestral fre- POPULATIONS quencies, the main impact of uncertain ancestral fre- quencies should be the amplification of MSE. Hence, Neither multiple alleles nor dominance presents a serious barrier to the methods described above.Mul- our estimate of ?(M) should be appropriate,as should tiple alleles can be accommodated as follows. Expand our estimate of s:(M); however, the estimate of s:(M) will include error in estimation of ancestral allele X and y so that Phi is the frequency of the ith allele frequencies as well as genetic drift. If parental popu- the kth population. The alleles may be at the same lation allele frequency estimates are obtained from locus, or at separate loci, but one allele from each very small samples, then following CAVALLI-SFORZA locus must be discarded from analysis to eliminate and BODMER(197 1) and LONGand SMOUSE(1983), redundancy of information. It does not matter which the matrix allele from a locus is discarded from analysis (APPEN- DIX). For the matrix V, retain the diagonal elements U = [V/(2Ns) + M'[VI/(~NI)] (16) as they have been defined, but in each off-diagonal + (1 - M)'[V*/(2N*)11 position use V, = -E[Ph,] - EIPhl], where the alleles a and j segregate at the same locus. The expectation of should replace the matrix V in Equation 3-5. The Admixed Populations 421 matrices VI and V2 are the variance-covariance mat- INTERMIXTURE rices of allele frequencies in the parental populations, A and N,and N2 are their sample sizes. Parent 1 Parent 2

HETEROGENEITY OF ADMIXTUREESTIMATES \., Hybrid'Y CAVALLI-SFORZAand BODMER(1 97 1) presented the I heterogeneity chi-square statistic, .1 X?,-') = Zi(Mi - M)'/V(Mi) (1 7) where M is the weighted average of admixture pro- portions computed from all alleles, M, is the estimate 1 from the ith allele, and V(Mi)is the variance of Mi. I If a total of "I" independent alleles are assayed, x'(Z - 1) follows a chi-square distribution with I - 1 B GENE FLOW degrees of freedom. Our WLS estimate differs very Recipient Donor slightly from the weighted average that they use, and our estimates of V(M,)will be identical to theirs if the weight matrix U, and the variance formula (Equation 1 lb) are employed. Since Equation 1 lb assumes no evolutionary error, this test will detect heterogeneity among admixture estimates regardless of whether it is due to natural selection or genetic drift. Thus, the two most (evolutionarily) interesting sources of error E31 in admixture estimates are confounded. (lh7 The test can easily be improved by letting M, and FIGURE2.-Admixture models. The dynamics of population ad- V(M,)pertain to the ith locus instead of the ith allele, mixture fot- the intermixture and geneflow models are diagrammed thereby eliminating redundancy of information, and above. using the variance formula and all subsequent is by genetic drift. There- V(M,) = (XTV-'X)-' * MSE (18) fore, a single will apply to all generations while Fsl-, where the vector X is reduced to only those alleles at MSE and s2(M)progressively increase with time. It is the ith locus, but MSE pooled from the analysis of all well established ($. WRIGHT1969; p. 345) that, FST, loci. Thereby, thevariance of the admixtureestimates after t generations of drift will be accountsfor the error introduced by sampling the hybrid population and for the error resulting from genetic drift (including error in parental population allele frequencies) in the hybrid population. = 1 - exp(-t/2Ne).

MODELS OF ADMIXTUREAND ESTIMATION Accordingly, we can find the expected MSE, at any OF Ne time, t, for any sample size 1 The expectations of FST, MSE and s'(M) can be E[MSE] = -+ (')FsT obtained from first principles of 2Ns when the dynamics of admixture and drift can be specified. Two such models will be evaluated here. In the first model, the hybrid population is formed by a single event of intermixture between two parental With this relationship established, it is possible to populations. Hereinafter this model is referred to as estimate the variance effective size of the hybrid pop- intermixture. In the second model, there is gene dif- ulation, using fusion from a donor population (parent) into arecip- ient(hybrid) population, but not vice versa. This Nf = -t/[2 In(1 - F&)] (2 1) process is classically referred to as Gene Flow and it is where F& is the estimate of FsT- obtained from Equa- thought to apply to major racial groups such as Afro- tion 10. Application of this formula requires a good Americans. estimate of the length of time since the formation of Intermixture: The admixture effects of this process the hybrid population. Such estimates are readily are diagrammed in Figure 2A. The admixture pro- available for many of the populations that areof most portions are established at theinception of the process interest. 422 J. C. Long Gene flow: As shown in Figure 2B, the recipient can be used to estimate the variance effective size of (hybrid) populationreceives a constant proportion, a, the hybrid population. However, these are very re- of itsgenes from a donor populationevery generation. strictive conditions and it will almost always be pref- The cumulative portion of genesfrom thedonor erable to manipulate Equation 24 numerically to es- population, p, is not constant; after t generations of timate the effective population size. admixture, it is given by p = 1 - (1 - a)‘. The equation Gene flow can be profitably considered a special to estimate a follows from (GLASSand LI 1953) case of the island model, but some new considerations regarding the model dynamics must be recognized. = -In(l - M)/t (22) ii While WRIGHTwas concerned with the equilibrium where M is the estimate of p obtained from Equa- of the model, the dynamics have been explored as a tion 3. specialcase of the migration matrix (BODMERand The admixture dynamics of this model are well CAVALLI-SFORZA 1968;SMITH 1969; MORTON 1969; established (GLASSand LI 1953), but the effects of IMAIZUMI,MORTON andHARRIS 1970). Here it is genetic drift in the hybrid have not been investigated assumed that the migration/drift process begins with heretofore. Recall that (‘)FsT(defined for an ithallele) a set of demes drawn independently from apanmictic is population. Differentiation amongthe demes then proceeds up to the equilibrium point. An important feature of this assumption is that the expected allele frequencies at a locus do not change from generation where (‘)u:(phi) is the evolutionary variance of the ith to generation, nor do the expected allele frequencies allele frequency (in the hybrid population) in the t-th vary across demes. Under these conditions there is a generation, and (‘)x,is its expected allele frequency. single value of FSTthat pertains to all alleles at all loci. (‘)T~is obtained simply from (1 - (‘)/A) . PI,+ @)p P2,. With gene flow, each expected allele frequency changes It will change every generation until it equals the until convergence with the donor population is real- allele frequency in the donorpopulation. By extension ized. Until convergence, a single FST will not apply to of the principles of variance of gene frequencies in all alleles at all loci. Equation 10 provides an average finite populations (WRIGHT1969, p. 346) to account F& over all alleles analyzed. Although FST does not for migrants, the evolutionary variance of the allele vary greatly among alleles, estimates of the effective frequency under gene flow is obtained from size of the population should be computedseparately for each allele andthen averaged. Since it is the reciprocal of N, that entersinto Eq. 24, the harmonic average of theestimated effective population size from each allele should be employed. Distribution of N,*:The usefulness of effective population size estimates depends heavily ontheir precision. There are two major barriers to the adop- (‘)cr;(phi) is combined with %, to yield an estimate of tion of usual statistical procedures for evaluating this. (‘)FS,.(Equation 23). The expected MSE assuming gene First, the estimation methods forNe render a standard flow remains (Equation 23 is used for (‘)FsT). error formula for NP algebraically intractable. Sec- - - 1 ond, even if it were possible to derive a standarderror E[MSE] = - + (‘)FST. for N,*, its distribution is heavily right skewed and 2Ns asymmetric confidence intervals must be used. With The conditions of the gene flow model are unlike these considerations in mind, the bootstrap procedure those of the intermixture model because (‘)FsT will (EFRON1979; EFRONand GONG1983; DIACONISand approach an equilibrium value less than unity. The EFFRON1983) is an appropriate tool for approximat- equilibrium point is ing the distribution of N?. Briefly, the steps of the bootstrap procedure are as A (1 - a)2 Fsl- = (25) follows. (1) An estimate of FSTis obtained for each 2N, - (2N, - 1)(1 - a)2 allele analyzed, following the expectations established which is identical tothe equilibrium Fsl- under in Equation 9. Occasionally these estimates will be WRIGHT’S(1969, p. 291) island model of population negative, in which case zero should be substituted for structure. If FsT is close to its equilibrium value, and the negative value. (2) An estimate of N, should be CY is small, say CY 5 0.02, the formula obtained from F& for each allele using Equation 2 1, or Equation 24, depending on the appropriate admix- ture model. If F& is negative, then the only realistic estimate of Ne is infinity. (3) The observed frequency Admixed Populations 423 distribution of N, estimates is resampled with replaw- WARD 1973) but theoverall impression is that genetic merit alarge number of times (say ~0,000).The drift is likely to be a major contributor. The signifi- number of variates in each bootstrap sample is the cance of the heterogeneity chi-squares will be reex- Same as the number ofalleles analyzed in the original amined here using the method developed in this pa- data set and the sampling is done with replacement. per. Estimates of ancestral allele frequencies for the (4) For each bootstrap sample, an average effective present analyses are those of ADAMSand WARD population size is computed as the harmonic average (1 973). of the bootstrapped variates. The approximate confi- Sapel0Island (Afro-American): Later in the dence limits on NF are obtained by using the appro- 1960s, the Afro-American population living on Sapelo priate upper and lower percentiles of the bootstrap Island, Georgia,was sampled for the same genetic loci distribution. (BLUMBERGand HESSER1971) as the Claxton popu- lation. There were only 21 1 Afro-American residents on Sapel0 Island and sample sizes ranged from 14 1 to HUMANPOPULATION EXAMPLES 191 individuals per locus. Althoughgene flow oc- The utility of the above theory for the analysis of curred over the same 12-generation period, the Sa- actual populations will now be demonstrated on three pelo Island Afro-Americans were thought to be less human population examples. In each example, ances- admixed, and to have been somewhat isolated from tral contributions to the hybrid will be estimated, a other Afro-American populations as well. A dichoto- chi-square test for heterogeneity of admixture esti- mous grouping of admixture estimates was not ob- mates among alleles will be performed, andadditional served for Sapelo Afro-Americans, but there was a insights into the genetic structure of the hybrid pop- significant correlation of admixture estimatesbetween ulation will be gleaned from F&. The first two ex- Sapelo and Claxton Afro-Americans. This correlation amples are Afro-American populations and the gene of admixture estimates was interpreted as good evi- flow model will be more appropriate. The third ex- dence for natural selection (BLUMBERGand HESSER ample is a Yanomamo Indian village that became 197 1). Later, ADAMSand WARD (1973) founda sig- admixed with neighboring Ye’cuana Indians. The In- nificant heterogeneity chi-square for Sapelo Island; termixture model will be more appropriate for this however, they cautioned that genetic drift as well as population. natural selection could have produced this result. The ClaxtonGeorgia (Afro-American): The Claxton parentalpopulation allele frequencies compiled by Afro-American population was sampled in the early Adams and Ward (1 973) will also be used for reana- 1960s (COOPERet al. 1963); its census size was 1287 lyzing Sapelo Island. individuals at this time. Genetic data consisting of 15 Borabuk (Yanomamo Indian): Most of the Yano- alleles at 12 informative loci JABO, Duffy, GGPD, Hb, mamo Indians of Northwestern Brazil and Southern Hp, Js, Kidd, M, P, Rh, S, T) are suitable for admix- Venezuela have been isolated from both Caucasian ture analysis. The sample sizes varied between 133 andother Amerindianpopulations until quitere- and 304 individuals per locus. Like other Afro-Amer- cently. An exception to this isolation is provided by ican populations, it is assumed that the Claxton pop- the Yanomamo village Borabuk. Historical accounts ulation has received Caucasian genesfor about 12 (CHAGNONet al. 1970) place this village in close prox- generations. imitywith a Ye’cuana Indian village, Huduaduiia, Originally, WORKMAN andco-workers (I 963) con- during the later part of the nineteenth century. Ge- cluded that admixture estimates obtained from their netic exchanges took place during this time and a few markers fell into two categories. In the first category, Ye’cuana men and women are known to have had an the marker alleles attributed about 10% of the Clax- extraordinary impact onthe Borabuk gene pool ton gene pool to Caucasian origin and about 90% to (CHAGNONet al. 1970). During the twentieth century African origin. Most geneticmarkers fell into this the Borabuk population has remained relatively iso- category and the admixture proportions are similar lated from the Ye’cuana and other Yanomamo vil- toexpectations based on ethnohistoricalconsidera- lages, thus the Intermixture model is more appropri- tions. Appreciably higher estimates of the Caucasian ate to this population. Approximately 6 generations contribution to the Claxton Afro-American popula- passed between the influx of Ye’cuana genes into tion were obtained fromalleles in the second category. Borabuk andthe time of sampling. Polygyny and The alleles falling into this category were G6PD-, other aspects of Yanomamo social organization Hp’, Tfd and Hb‘. This groupwas identified as selec- (CHAGNON 1968), and the fact that this village was tively biased because of the presence of GGPD and nearly decimated by a disease epidemic early in this Hb’. Significant heterogeneity chi-squares for these century, indicate that Borabuk’s effective population data have been observed in subsequent analyses size is quite small. Borabuk, like other Yanomamo (CAVALLI-SFORZAand BODMER197 1; ADAMSand villages fluctuates in actual size, the seventy-five indi- 424 J. C. Long

TABLE 2 Admixture estimates and variances

Hybrid StandardFirst Second population ancestor population ancestor error s '(M) s:(M) s?(M)

Claxton Caucasian African k0.0510.0019 0.0026 0.0007 0.864 0.136 (A = 0.012) Sapelo Island African Caucasian 0.0300 k0.055 0.0298 0.0002 0.932 0.068 0.932 (A = 0.006) Borabuk Yanomamo 0.0059 Ye'cuana 0.0318 0.0377 +O. 194 0.640 0.360

TABLE 3 Heterogeneity analysis

Claxton Sapelo Borabuk

Locus M, V(MJ Xf Locus M, V(MJ X? Locus M, V(Md X:

AB0 0.15 0.0 1 0.01 AB0 -0.0 1 0.01 0.47 P 1.00 24.68 0.01 Rh 0.1 1 0.14 0.00 Rh 0.44 0.25 0.56 Fy -6.00 21.99 2.01 M -0.15 10.69 0.01 M -0.85 18.83 0.04 Jk 1.23 0.15 2.31 S -0.30 0.31 0.61 S -0.05 0.53 0.03 Le 0.920.27 0.29 FY 0.1 1 0.01 0.16 Fy 0.07 0.00 0.00 Di -1 .oo 1.40 1.92 P -0.23 0.10 1.31 P 0.80 0.18 3.02 MNS 1.18 0.39 0.74 Jk -0.59 0.23 2.30 Jk -0.12 0.40 0.08 Rh 0.46 0.09 0.38 JS -0.05 0.1 2 0.30 Js -0.70 0.22 2.69 Hp 1.02 0.15 0.98 T -0.38 0.37 0.70 T -0.4 1 0.65 0.35 Gc 0.00 2.95 0.14 HP 0.63 0.05 5.06 Hp 0.23 0.08 0.33 PGM 0.50 6.06 0.00 G6PD 0.34 0.07 0.58 G6PD 0.29 0.13 0.36 ACP -1.00 1.12 2.41 Hb 0.53 0.1 5 1.03 Hb 0.44 0.28 0.49 Total x' 12.07 8.43 9.00 d.f. (1 1) (1 1) (10) Locus abbreviations: blood groups: (ABO) ABO, (Rh) rhesus, (M, S and MNS) MNSs, (Fy) Duffy, (P) P, Uk) Kidd, us) Sutter, (Di) Diego. Erythrocyte and serum protein loci: (Hp) haptoglobin, (Gc) group specific component, (PGM) phosphoglucomutase, (ACP) red cell acid phosphatase, (G6PD) glucose 6-phosphate dehydrogenase, (Hb) hemoglobin &chain, (T)phenylthiocarbamide tasting. viduals sampled represented almost all people present was obtained; in the present study a single estimate during thevisit, the defacto size was probably less than was iterated over the entire setof loci. 100 individuals. Allele frequencies at eleven loci The standarderrors of admixture estimates are (ACP, Diego, Duffy, Gc, Hp, Kidd,Le, MNSs, P, quite high. More interesting are the variances of the PGM, Rh) forBorabuk, Huduaduiia, and the Sha- admixture estimates, and theirpartition into sampling matari (Yanomamo ancestor) are given in an earlier and evolutionarycomponents. In all three popula- paper (LONGand SMOUSE1983). tions, the vast majority of the variance of the admix- ture estimates is due to evolutionary, rather than RESULTS sampling error. Therefore, thevariance of estimated Estimates of the ancestral contributions to each of admixture proportions can be seriously affected by these three populations are presented in Table 2. genetic drift, despite earlierclaims for theAfro-Amer- Admixture estimates for the Afro-American popula- ican populations (WORKMAN,BLUMBERG and COOPER tions are similar to the previously published estimates 1963; BLUMBERGand HESSER197 1). for these populations (ADAMSand WARD 1973).The Admixture estimates computed by locus, Mi, their estimate of Yanomamo ancestry for Borabukpre- variances, V(Mi),and contributions, x:, to the heter- sented here is lower than, but not significantly differ- ogeneity chi-square statistic are presented forthe entfrom, the value published earlier (LONG and three populations in Table 3. Although the admixture SMOUSE1983). The difference between the two esti- estimates appear heterogeneous across loci, it is clear mates is due to the fact that in the 1983 study an that each estimate has an enormous variance associ- admixture estimate was calculated for each locus in- ated with it, and the chi-squares computed over all dividually, and then a weighted average across loci loci within these three populations do not approach Admixed Populations 425 A TABLE 4 BORABLK Population structure interpretations

0.50 Ratio Effective Census Hybrid populat,on MSE d.f. FZT. size (N.) size (N,) N,/N.

Claxton 0.017 11 0.013 1287 349 0.27 0.30 1.02 Sapel0 0.030 I1 0.027 211 216 Borabuk 0.158 0.15210 5100 218 ?0.l8 0.10 % I::: -0.60 -0.40 -0.20 statistical significance. Despite thebroad scatter of admixture estimates fromthe individual loci, the

-0.20 goodness of fit of thelinear admixture model is revealed in Figure3 by plots of - Pzi) versus /* -0.40 (Pit - P2J,as suggested by Eq. 2. While the graphical displays are compelling, their interpretations must be tempered with the knowledge thatthe points are -0.60 heteroscedastic andthat interdependence of error terms exists amongthe points representing alleles segregating at the same locus. B ('h SAPELO Further analyses of these populations are presented in Table 4. Since a significant x* is not obtained in 0.50 any of these populations, evidence for natural selec- tion is lacking, and FST and the effective population

0.30 size was estimated for each population. The relative ranking of F& for these populations, Claxton (0.01 3) < Sapelo (0.030) < Borabuk (0.152), conformsto what 0.10 would be expectedfrom the social demography of 1:::::: these groups. The estimates of Ne are biologically -0.60 -0.40 / reasonable for Claxton (349) and Borabuk 18). This (2

-0.20 is evidenced by the fact that theratios of their effective sizes to census sizes are, respectively, 0.27 and 20.18. The estimated Ne for Sapelo Island (2 16)is unusually high since this would suggest that the effective size is greater than thecensus size. -0.60 The percentiles of the bootstrapped distributions for these estimates are presented in Table 5. These distributions are widely dispersed; however, the 95% C confidence interval for Sapelo Island [I 19, 4831 still CLAXTON suggests that Sapelo Island had an uncharacteristically large effective population size. A likely explanation for this result is that the Sapelo Island population has received immigrants from other Afro-American com- munities. In other words, geneticdrift on Sapelo Island has been affected by immigrants from both the Caucasian population and other Afro-American pop- ulations. Since there were only four recordedcases of 1:::::: I (PI - P2) -0.60 -0.40 -0.20 0.10 0.30 0.50 miscegenation in the history of Sapelo Island, and . .. very few permanent Caucasian inhabitants of the is- land (BLUMBERGand HESSER1971), it is not unlikely that a portion of the Caucasian genes in the Sapelo Island population were actually introduced by ad- mixed Afro-Americans.

1 slopes of the plottedlines are the estimated proportionsof African FIGURE3.-The relationships between (ph2 - P2,) and (PI,- P2,). ancestry in the Afro-American populations and the proportion of This relationship should be linearwith slope @ (by Equation 2). The Yanonwno ancestry in the Borabuk population. 426 J. C. Long TABLE 5 COOPER1963; BLUMBERG and HESSER197 1). The Percentiles of bootstrapped distributionsof Nt chi-square statistics for heterogeneity of admixture estimates did not attain statistical significance in any Population of these populations. Earlier, significant chi-squares Percentile Claxton Sapelo Borabuk had been obtainedfor Claxton and Sapelo Island Blacks (ADAMSand WARD 1973), although these in- Min 117 82 7 vestigators were duly cautious in their interpretation. 1 162 109 13 2.5 178 119 14 In the absence of evidence for natural selection, it 5 194 128 16 is possible to estimate effective population sizes. Ne is 10 216 140 17 a fundamentally important parameter in population 25 265 166 21 genetics that links simplified theoretical models to 50 34 1 205 26 natural and artificial populations; yet Ne is notoriously 75 457 265 34 90 618 347 46 difficult to estimate, as noted by LAURIE-AHLBERG 95 745 413 58 and WEIR (1979),HILL (1981), WOOD (1987), and 97.5 88 1 483 72 WAPLES(1989) in other contexts. The estimation 99 1069 596 98 methods presented here should prove to valuable to Max 2305 2150 614 the analysis of hybrid populations. Each distribution is based on n = 10,000 bootstrapped trials. Fundamental to the interpretation of estimates of DISCUSSION Ne is knowledge of the distributions the estimates will follow. The bootstrap method suggested here is only The objective of this study has been to provide a a rough approximation.Ideally, if the original number framework for estimating the effects of genetic drift of alleles sampled is large, then the empirical distri- and admixture simultaneously. Thereby, the value of bution will closely approximate the true distribution. admixture estimates for analysis of hybrid population For smaller samples, such as those analyzed for the structure is enriched, and an appropriatenull hypoth- three example human populations,the interpretations esis for the detection of natural selection can be for- should be cautious. It should also be noted that the mulated. It was shown that Wright’s fixation index bootstrap method is not without underlying assump- FSTis closely related to the error variance of allele tions. It clearly assumes that the original data were frequencies predicted by admixture, and thatFST can drawnindependently from the same distribution. be estimated from a multiple locus sample of hybrid These assumptions are unlikely to hold considering population allele frequencies, providing that there is that alleles may co-segregate within loci and that the information on the allele frequencies in the contrib- distribution of N$ may vary from allele to allele. uting parental populations. Refinements on this approximation will be a fruitful There are two existing methods for estimating ad- area for further research. mixture proportions that allow for both evolutionary As a final point, the utility of the procedures devel- and sampling error (THOMPSON1973; WIJSMAN oped here should be applicable for purposes beyond 1984); each of these methods requires independent analysis of population genetic structure. For example, knowledge of the variance effective population size of epidemiological interest in hybridpopulations has thehybrid population over its entireevolutionary been increasing. Hybridpopulations have recently history, and that the admixture process was that of been shown to present unique opportunities for the the Intermixture Model. However, variance effective estimation of modes inheritance for diseases of com- sizes of populations are seldom known, and the Inter- plex etiology (CHAKRABORTYand WEISS 1986) and mixture Model does not apply to all admixed popu- for the estimation of recombination fractions (CHAK- lations. Superimposition of the intermixture model RABORTY and WEISS 1988). Precise estimates of an- onto populations for which gene flow is more appro- cestral contributions are required in each of these priate would clearly be misleading. Perhaps the great- circumstances and the methods provided here will be est advantage of the estimation procedures developed broadly applicable. The results of this study indicate here is that FST can be estimated regardless of the that drift error must always be considered in admix- underlyingpopulation structure, and the drift cor- ture estimates and that simultaneous assay of many rected chi-square test for heterogeneous admixture marker loci will be preferred. proportions can always be applied. 1 wish to thankLYNN JORDE, KEN MORGAN,ALAN ROGERS, Evolutionary variance in admixture estimates over- I’ETER SMOUSE,TROY TUCKER, BRUCE WEIR and an anonymous whelmed sampling variance in analyses of three ad- reviewer for their comments on earlier drafts of this paper. This mixed human populations. This finding is especially work has been greatly improved by the computersimulation studies of JAY SORUSwhich led me to the correct derivations of many Of important since it was originally arguedthat drift the formulaspresented here. This research was supported by a should not have been a significant force in two of National Institutes of Health BSRG grant from the Vice President these populations (see WORKMAN,BLUMBERG and of Research, University of New Mexico, tOJC1,. Admixed Populations 427

Note Added in Prooj An anonymous reviewer has MORTON,N. E., 1969 Human population structure. Annu. Rev. shown that thestatistics M and MSE, as defined inthis Genet. 3: 53-73. NEI, M., 1965 Variation and covariation of gene frequencies in paper, can be obtained from a set of equations that subdivided populations, Evolution 119: 256-258. do not require eliminating oneallele from each locus NETER, J., andW. WASSERMAN,1974 AppliedLinear Statistical for computation. Models. R. D. Irwin, Inc., Homewood, Ill. REED,T. E., 1969 Caucasian genes in American Negroes. Science 165: 762-768. LITERATURECITED REYNOLDS,J.. B. S. WEIRand C. C. COCKERHAM, 1983Estimation ADAMS,J., and R. H. WARD, 1973 Admixture studies and detec- of the coancestrycoefficient: basis for a short-term genetic tion of selection. Science 180 1137-1 143. distance. Genetics 105: 767-799. BLUMBERG,B. S., and J. E. HESSER, 1971 Loci differentially SMITH,C. A. B., 1969 Local fluctuations in genefrequencies. affected by selection in two American Black Populations. Proc. Ann. Hum. Genet. 32: 251-260. Natl. Acad. Sci. USA 68: 2554-2558. THOMPSON,E. A., 1973 The Icelandic admixture problem. Ann. BODMER,W. E., and L. L. CAVALLI-SFORZA,1968 A migration Hum. Genet. 37: 69-80. matrix model for the study of random genetic drift. Genetics WAPLES,R. S., 1989 A generalized approach for estimating effec- 59: 565-592. tive population size from temporal changesin allele frequency. CAVALLI-SFORZA,L. L., and W. F. BODMER,1971 TheGenetics of Genetics 121: 379-391. Human Populations. Freeman Xe Co, San Francisco. WEIR, B. S., and C.C. COCKERHAM, 1984Estimating F-statistics CHAGNON,N., 1968 Yanomamo:The Fierce People. Holt, Rinehart for the analysis of population structure. Evolution 38: 1358- Xe Winston, New York. 1370. CHAGNON, N.,J. V. NEEI., L. WEITKAMP, H.GERSHOWITZ and M. WIJSMAN,E. M., 1984 Techniques for estimating genetic admix- AYRES, 1970 The influence of cultural factors on the demog- ture and applications to the problems of theorigin of the raphy and pattern of gene flow from the Makiritare to the Icelanders and theAshkenazi Jews. Hum. Genet.67: 441-448. Yanomamo Indians. Am. J. Phys. Anthropol. 32: 339-350. WOOD,J. W., 1987The geneticdemography of theGainj of CHAKRAR~RTY,R., 1986 Gene admixture in human populations: Papua New Guinea. 111. Determinants of effective population nlodels and predictions. Yearb. Phys. Anthropol. 29 1-43. size. Am. Nat. 129 165-187. CHAKRABORTY,R., and K. M.WEISS, 1986 Frequencies ofcom- WORKMAN,P. L., 1968Gene flow andthe search fornatural plex diseases in hybrid populations. Am. J. Phys. Anthropol. selection in man. Hum. Biol. 40: 260-279. 70489-503. WORKMAN,P. L., B. S. BLUMBERGand A. J. COOPER, CHAKRABORTY,R., and K. M. WEISS, 1988 Admixture as a tool 1963 Selection, gene migration and polymorphic stability in for finding linked genes and detecting that difference from a US White and Negro population. Am. J. Hum. Genet. 15: allelic association between loci. Proc. Natl. Acad. Sci. USA 85: 429-435. 91 19-9123. WRIGHT,S., 1951 The genetical structure of populations.Ann. COCKERHAM,C. C., 1969 The variance of gene frequencies. Ev- 15 323-354. olution 23: 72-84. WRIGHT,S., 1965 The interpretation of population structure by COCKERHAM,C. C., 1973 Analyses of gene frequencies. Genetics F-statistics with special regard to systems of mating. Evolution 74: 679-700. 19 395-420. COOPER,A. J., B. S. BLUMBERG,P. L. WORKMAN andJ. R. Mc- WRIGHT,S., 1969 Evolution and The Genetics of Populations. VOL DONOUGH,1963 Biochemical polymorphictraits in a U.S. 2, The Theory of Gene Frequencies. University of Chicago Press, White and Negro population. Amer. J. Hum. Genet. 15: 420- Chicago. 428. Communicating editor: B. S. WEIR DIACONIS,P., and B. EFRON,1983 Computer intensive methods in statistics. Sci. Am. 249 116-130. EFRON, B.,1979 Bootstrap methods: another look at the jackknife. APPENDIX Ann. Statist. 7: 1-26. EFRON, B.,and G. GONG, 1983 A leisurely look at the bootstrap, Proof That Any Allele Can Be Dropped With the the jackknife, andcross-validation. Am. Statist. 37: 36-48. Weighted Least Squares Procedure ELSTON,R. C., 1971 The estimation ofadmixture in racial hy- brids. Ann. Hum. Genet. 359-17. Consider alocus with an arbitrary numberof alleles GLASS,B., and C. C. LI, 1953 The dynamics of racial intermix- AI, A,,. . . Az, with frequencies (PI, p,, . . . , pz) in the ture-an analysis based on the American Negro. Am. J. Hum. hybrid population and ($11, . in the first Genet. 5: 1-20, pzl,. . , pzl) HILL,W. G., 1981 Estimationof effective population size from parentalpopulation and (PI,, p22, . . . , pz2) in the data on . Genet. Res. 38: 209-216. second parental population. Define XT = [(pll - p,,), IMAIZUMI, Y., N. E. MORTON and D. E. HARRIS,1970 Isolation (PZI- Pzz), . . . , ($21 - ~zz)]and yT = [(PI - p12), by distance in artificial populations. Genetics 66: 569-582. (Pz - P22), . . . , (Pz - Pzz)]. The equation for the LAURIE-AHLBERG,C. C., and B. S. WEIR, 1979 Allozyme variation Weighted Least Square estimate of the admixture pro- and linkage disequilibrium in some laboratory populations of Drosophilia melanogaster. Genetics 74: 175- 195. portions is given by 1.1, C. C., 1976 First Course in Population Genetics. Boxwood Press, p dclfic: Grove, Calif. M = (xs~v-’xs)-’xs~v-’ys (A) LONG,J. C., 1986 The allelic correlation structure of Gainj- and where the “shortened”vectors Xs and ys are obtained Kalam-speaking people. I. The estimation and interpretation from the “full” vectors and y by eliminating the Zth of Wright’s F-statistics. Genetics 112 629-647. X LONG,J. C., and P. E. SMOUSE,1983 Intertribal geneflow between element. V has dimension (Z - 1 by Z - 1) and it is the Ye’cuana and Yanomamo: genetic analysis of an admixed the variance-covariance matrix of the multinomial village. Am. J. Phys. Anthropol. 61: 41 1-422. distribution (N = 1). It is easily verified that V” can 428 J. C. Long be written out explicitly as

-1 ...9 r 7 Pz ... -1 Pz ...... , r -I 1 1 - ...) -+q j#d Pz ' pz-I p, Equation AI can now be rewritten as follows

i= 1 M = z--l z--l 1 1 Xi[.. - + - * i= 1 pi PZ ,=I Let the allele to be dropped from analysis be desig- nated Ad. So far, we have let AZ be Ad, now we will let Ad be any of the 2 alleles. We will show that this leaves To complete the proof, the identity of the denomi- Equations 1 and 2 unaffected. Some special notation nators (02 and 03) is now proved is required, Z C xj j=1, j#d will mean that a sum will be taken over all alleles, except Ad. For example, if d = 3 and Z = 5, then Z xj = (x1 + x:! + x4 + xg). ,= 1 j#d Now, the estimator for admixture proportions can be rewritten as r 1 NOWusing theidentity and xz = - xi,

r

To prove that any allele can be dropped, Equation A2 must equal Equation A3. We begin by proving the identity of the numerators r 1 r 1

r#d r 1 r 1

i#d j#d Since it has been proved that the denominators of r 1 Equations A2 and A3 are equal, and that the numer- ators of Equations A2 and A3 are equal, it is proved that the same admixture estimate is obtained no mat- ter which allele is eliminated from the system.