ESTIMATION FROM POPULATION DATA: MODELS, PROCEDURES AND IMPIJCATIONS1

RICHARD S. SPIELMAN,V JAMES V. NEEL,Z and FRANCIS H. F. LIZ 2Depariment of Human Genetics, Uniuersiiy of Michigan Medical School, Ann Arbor, Michigan 48109

SDepariment of Human Genetics, University of Pennsyluania School of Medicine, Philadelphia, Pennsyluania 191744 Manuscript received November 6, 1975

ABSTRACT

Four different estimation procedures for niodels of population structure are compared. The parameters of the models are shown to be equivalent and, in most cases, easily expressed in terms of the parameters WRIGHTcalls “F- statistics.” We have estimated the parameters of each of these models with data on nine codominant allele pairs in 47 Yanomama villages, and we find that the different estimators for a given parameter all yield more or less equivalent results. P-statistics are often equated to inbreeding coefficients that are defined as the probability of identity by descent from alleles taken to be unique in some founding population. However, we are led to infer from computer simulation and general historical considerations that all estimates from genotype Ire- quencies greatly underestimate the inbreeding coefficient ior alleles in the founding population of American Indians in the western hemisphere. We sur- mise that in the highly subdivided tribal populations which prevailed until the recent advent of civilization, the probability of identity by descent for homo- logous allelcls was roughly 0.5. We consider some consequences of working with the customary, much lower, estimates-0.005 to 0.01-if, on the time scale of human , these represent only a very recent departure from the inbreeding intensity that prevailed before civilization.

RIGHT ( 1943, 195 1,1965) described the genetic properties of a subdivided or “structured” population by a model that, under certain assumptions, leads to estimates of average inbreeding of individuals. Recent years have witnessed the development of a number of alternative models and a proliferation of estimation procedures, described below. In the present communication, we first show how the models and estimators are related, by rewriting them in a uniform notation which reveals the underlying similarities. Then we compare the results obtained by these various approaches with a well-defined body of data. Finally, we attempt to infer how much these procedures err in estimating inbreeding (i.e.,frequency of identity by descent of a pair of alleles) and illirstrate some consequences.

1 Supported by the Energy Research Development Agency E(11-1)-1152 and the National Science Foundation BMS-74-11823. Present address.

Genetics 85: 355-371 February, 1977 356 R. S. SPIELMAN, J. V. NEEL AND F. H. F. LI For our test data we use snme of the extensive infomalion now available on the South American Indian. The unusual advantages of such populations for genetic inference were pointed out some time ago (NEELand SALZANO1964; NEEL 1970). Although the conclusions are derived from Indian populations, we believe that with suitable caution they apply to tribal man generally. It will be shown that all estimation procedures based on gene and genotype frequencies yield more or less comparable results, which, however, substantially underestimate the true coefficient of inbreeding in such populations.

METHODS

We have estimated the structure parameters for each of nine codominant allele pairs (listed in Table 2) using data from 47 Yanomama villages (GERSHOWITZet al. 1972; WEITKAMPand NEEL 1972; WEITKAMPet al. 1972; WARDet al. 1975). The four estimation procedures we have used are desrribed eponymously below, and may be taken to represent four slightly different models. All models assume no selection and envisage a population subdlvided into N smaller units. The estimated allele frequency in the $11 subdivision (size ni) is pz and the mean of pi across subdivisions is 3. Following WRIGHT’Sdefinitions (1943, especially 1965), all models intro- duce one or more parameters anomalously called “F-statistics,” to measure the nonrandomness which results from this structure. In the first three models, F,, is the correlation between two gametes randomly selected from individuals in the same subdivision, relative to gametes of the total population. F,, is the correlation between two gametes uniting to form a zygote, ignoring subdivisicn( i.e., “relative to gametes of the total population”). If generations are discrete and mating is at random within each subdivision (as “random” is specified in the definition of F,,), FIT is exactly equal (ignonng sampling error) to F,, of the previous generation (WRIGHT1965, p. 409-410). If mating is not random, a third correlation, F,,, defined as the correlation between uniting gametes “relative to those of their own subdivision,” reflects departures from random mating within subdivisions. The three F-statistics are related by:

These correlations have been estimated as follows:

A. W/N: (WORKMANand NISWANDER1970; NEELand WARD1972): Method of moments To estimate the F values, let H, be the frequency of heterozygotes in the total population and SL be the weighted mean squared deviation of pi and i.e., N

6 s2 = .B - a 1=1 Wi(Pi 3)Z.

Weighting is proportional to subdivision size;

s wi= ni/ I: ni i =I

Then INBREEDING POPULATION DATA 357 where & represents “is estimated by.” Rewriting equation (1) we have F, = (FIT - F,,)/ (1 - FST). In the W/N procedure, F,, has been estimated either by substituting estimates for F,, and F,, in this relationship, or as the weighted mean of the within-subdivision analogce of F,y7:

AY Fix= %=l.E ~,[1 -Hi/2pi (l-pi)] . (2c)

These alternate estimators for F,, are in general not equivalent. B. COCKERHAM(1969,1973): Least squares The model interprets the F values as intraclass correlations, thus implying a nested analysis of variance. The two alleles at a locus are assigned values (say 0, I), and cach individual pro- vides two such allelic measurements for a total score of 0, 1, or 2. The total variance of allelic measurements is the sum of three contributions:G, variance among group means; U;, variance a among individuals (within groups) ; and uz variance between the two alleles (within individuals, W within groups). The equivalence of such variances among units to covariances within units leads naturally to the definition of their ratios as intraclass correlations.

Thus COCKERHAMdefines F,, (which he calls T) as

and F,,, which he does not usually subscript:

These definitions give meaning to WRIGHT’S(1965) phrase “relative to gametes of thc total population” quoted above. For F,, the numerator is the covariance within groups (variance among groups) and for F,,, the covariance within individuals (total variance among individuals). Continuing the analogy,

the intraclass correlation interpretation of WRIGHT’Sdefinition, which is algebraically exactly (FIT - F8,)/(1 -- F,,). COCKERAMthus achieves a system of correlations which give statistical meaning to WRIGHT’Sdefinitions and form a coherent set in the idiom of analysis of variance. One attractive consequence is a familiar and natural means of estimation (least squares). For an unblamed design (unequal nL),we may define a conventional weighted analysis. Let ri = N N 2 n,/N, and define no = ri- Z (n,- ri)Z/N(N - 1)ri (SNEDECORand COCHRAN1967; SING, 2.=1 ,=1 CHAMBERLAINand EGGLESTON1973) Table 1 gives the expectation of the mean squares among subdivisions, among individuals within subdivisions. and within individuals. Since observations are restricted to 0 and 1, the sums of squares may be expressed simply in terms of numbers of genotypes or alleles. Setting the expectations equal to observed mean squares, we obtain the usual least squares estimators:

U: A (MS, - MSb)/2n0, U: (MSb - MS,,)/2, and U: A MS,. C. RST: (ROTHMAN,SING and TEMPLETON1974): Maximum likelihood The first two characterizations of population structure require no assumptions about prob- ability distributions underlying the observations. In contrast, the RST model makes explicit assumptions about the probability of collections of genotypes in an array of subdivisions. Imagine a population in which genotypes AA, Aa, and an have frequencies P,, P, and P, respectively (PI+ P, + P, = 1). Let the numbers of these three genotypes in the ith subdivision be nil, 358 R. S. SPIELMAN, J. V. NEEL AND F. H. F. LI TABLE 1 Nested analysis of uariance-unbalanced design *

~~ Expectation of Mean squares mean squares

v N Among individuals is12niqi(~-qi) - U;+ 20: (within subdivisions) [ 2

rs 1 A’ Within individuals U2 izlni2/2 &ni w (between alleles) 1 1/

* One method of estimation for COCKERHAM’S(least squares) model. From SING,CHAMBERLAIN and EGGLESTON1973. n Estimate frequency qi of A by qi = (2ni1+ ni2)/2ni. For additional notation, see text. ni2,ni3 where nil + ni2 fni, = ni. While the basic parameter of structure in WRIGHT’Smodel is correlation between allele values, the RST model is more convemently expressed in terms of the correlation (p) between the total allelic measurements (0,1, 2) of individuals. The funda- mental equivalence tc WRIGHT’Sform of the model is shown by the relationship D = 2FsT/ (1 f FIT);i.e., p is WRIGHT’S(1922) coefficient of relationship. If a sample of ni individuals is drawn without replacement from the total to form a subpopulation, the distribution of the three genotypes as a function of P,, P,, P,, and pis:

Estimation fortunately does not require evaluation of the combinations in (4). ROTHMAN, SINGand TEMPLETON(1974) give the log-likelihood for N subdivisions under this model:

N 3 n,,-l N ni-l InL(Pl,P,,p) =,z z In[(l-p)Pj+pk] --.Z I: ln(1-p+pk). (5) 1=1 3=1 k=O z=1 k=o A computer program is available to search for the values of P,, P,, and p which maximize this likelihood, given the genotype numbers (nij)in each subdivision. Since a single biological model underlies all the procedures, P,, P,, and p in RST may be considered a reparameterization of WRIGHT’SFIT and FsT. The alternate parameter sets are related as follows (9 = P, -k P2/2): INBREEDING POPULATION DATA 359

For comparison of the procedures under different models, we may apply these equivalences and extract F,,, and FsT from PI,P,, and p. D. M/M: MALECOT/MORTON (MALECOT1969; MORTONet al. 1971). The fourth model considered here invokes a different approach to the estimation of associ- ation between alleles. Geographic separation does not explicitly enter the concept of subdivision in the formulations of allelic association described above, and the subdivisions need not be spati- ally separated. This is because the degree of isolation of subdivisions is not a parameter of the model as stated (although WRIGHT,especially 1965, has discussed modifications where degree of isolation-represented by migration-is included). In contrast, MALECOT’Smodel of population structure makes association between two random alleles from individuals i and j a function of the (variable) distance between the individuals. There are more fundamental differences, how- ever, in the kind of allelic association studied; instead of using the correlations of WRIGHT, MALECOTmeasures association between the two random alleles by q5, called kinship, and defined as the (mean) probability of identity by descent (from a single ancestral allele): 9 (d) = acbW (7a) where d is the distance between i and j, a is local kinship, b is a function of the so-called “coeffi- cient of recall to equilibrium,” incorporating the effcct of history and systematic pressures (migration, , selection), and c is a constant, the “dimensionality of migration” (equal to i/z for two-dimensional migration). In application to human populations by MORTONand cd- leagues (IMAIZUMI,MORTON and HARRIS1970; MORTONet al. 1971), (7a) has been simplified to

+ij or +(d) = acbd The constant of proportionality, a, is defined by $(O) or Cii, mean kinship at distance zero or “local kinship” for generation i. Conceptually, a is the same parameter WRIGHTcalls FST (CAVALLI-SFORZAand BODMER1971; MORTON1973). Experience has shown (MORTON1969, p. 64; MORTONet al. 1973, p. 349) that estimates of 4 at large values of d are often negative, PO MORTONhas altered the relationship, making kinship at distance d relative to that at the largest value of d: y(d) = (1 - L)acbd+ L (7c) L is kinship obtained for the largest distance class by fitting (7b). Estimation for this model has gone through several generations of computer programs in the hands of MORTONand his colleagues (see MORTON1973; MORTONet al. 1973). Although the model describes @ as a probability, the estimation procedures all use functions of contemporary genotype (or phenotype) frequencies. Thus, like the estimates obtained for the previous models, which are explicitly based on correlations of allelic state, these estimates cannot in general refer to probability of identity by descent (CAVALLI-SFORZA1972). This ultimate equivalence of estimates of kinship to conventional correlation estimates, which must be true in principle, is demonstrated by the estimator chosen for Gii, kinship within the local population. Programs de- signed and distributed by MORTONcompute an estimate of $ii for biallelic loci from the following expression (or its equivalent) :

- where notation is as before, and q, = 1 - p,. The term in parentheses is equivalent to (p, - jj) ,/ p(1 - jj) so that in this procedure $%% is being estimated by Z wo(p2- p)*/F(I- p), exactly 1 =I the estimator used for FsT in (2a) above. It follows that for two alleles in N subdivisions, the estimate for $,, is not only theoretically, but also computationally, equivalent to (2a); local kinship is estimated by FsT. Thus it is tempting to write: 9 (d) = FsFe-bd (9) incorporating the assertion that for d = 0, kinship is FsT. 360 R. S. SPIELMAN, J. V. NEEL AND F. H. F. LI TABLE 2 Estimates of parameters of populaiion structure in 47 Yanomama Indian villages, by three different procedures (see text for abbreviations)

w/N COCKLRHAM RST

Allele Frequency Sample size F~~LFF,,&~ p FE FIT e+ ( 1 f F)/2 =FST

~ M 0.628 3105 ,079 ,085 .081 ,080 ,081 ,073 .043 S .174 3 10’5 .060 ,058 .061 ,052 .049 .086 .027 C ,912 3105 ,093 ,089 .096 .084 .I05 ,155 ,061 E .1 74 3 10’5 .017 ,061 .079 ,056 ,079 .075 .M2 FYa .556 2287 ,056 .070 .C44 ,050 ,066 .037 .034 HP~ ,882 2831 .008 .071 .070 ,066 ,077 ,037 .040 Gc ,870 3102 ,063 ,066 .064 .060 .074 ,088 ,040 PGMl ,951 2940 .004 ,040 .005 .033 .073 .078 .040 Albnormal ,910 3139 .046 ,087 ,048 .082 .123 .IO9 .068 Weighted mean .048 ,070 ,048 .063 ,081 ,084 ,044

Rank correlations: F,, of W/N with F,, of COCKERHAM:0.58. F,, of W,” with F,, of COCKERHAM:0.90.

The models considered above correspond to the situation at d 0 for the M/M model. There- fcre, for consistency with the foregoing, we restrict attention to local kinship in the M/M model. For loci with multiple alleles, some of MORTON’Sestimation procedures are not equivalent to (ea) or (8). Because we include the COCKERHAMand RST models, we use only codominant biallelic loci in this treatment, so cLidoes not require separate computation or tabulation. Values of Fsy, in the WJN model are also values of Cii in M/M for such loci, as the equivalence of (8) to (ea) demonstrates. RESULTS 1. Zntratribal: Table 2 presents estimates based on the Yanomama data, for parameters of the various models of population structure described above. Since the nine allele pairs are codominant, the allele frequencies in the table are esti- mated by “gene count”; i.e., they are exactly the observed proportions in the samp2e (ignoring subdivision). Consequently these allele frequencies are appro- priate only for the W/N and COCKERHAMmodels, not for the RST model (see below). Estimates obtained on the WIN model for FIT range from 0.004 (PGM,) to 0.093 (C) . Those for Fs, range from 0.040 (PGM,) to 0.089 (C), and are also exactly the estimates for O,, on the M/M model, as explained under METHODS.We note that these estimates for 47 villages agree well with those obtained by NEEL and WARD(1972) for a subset of the villages treated here. The large proportion in common (37 villages, 79% of the total) nearly guarantees good agreement. FIT estimated by nested analysis of variance corresponding to COCKERHAM’S model ranges from 0.005 to 0.096, and FE, (3)from 0.033 to 0.084, with the small- est value for PGM, and the largest for C as before. The T‘VJN and COCKERHAM procedures give very similar results over the entire range sampled by the nine loci, as is shown by the appreciable rank correlation of estimates, as well as near identity of their means. In addition to correlation between models, we note that the least squares estimates of FIT and FsT (COCKERHAM)are highly correlated; the INBREEDING POPULATION DATA 361 rank relation for this set of loci is 0.63. A principal virtue of analysis of vari- ance is that the estimates of variance within and between groups are independent and should be approximately uncorrelated, even for an unbalanced design. The values €or FST and FIT seem at first glance to contradict this. Recall that FsT is estimated by (3a) and that FIT is estimated by (3b). Given two uncorrelated ran- dom variables X and Y (representing the estimates of uZaand &), with arbitrary means and variances V, and Vy,the correlation Ectween X and X 4- Y is

For V, - Vy,as is generally the case, the correlation given by (10) is - l/vzor 0.707. T~LSwe expect FST and FITto be correlated, even if the variance compon- ents uiiderlying the estimation are not. The last three columns of Table 2 give estimates for the parameters of the RST model, obtained by a computer search of the likelihood surface described by equa- tion (5).Since p is equivalent to the coefficient of relationship, we must translate to FSTterms (equation 6a) for comparison with estimates given above. The last column gives the equivalent F,yT estimates for the p values obtained. Comparing thcsc with FST for WJN and COCKERHAM,we find a systematic difference; lor every allele pair except PGM,. the RST estimates are smaller than those of COCKERHAM.These differences, which represent a bias in the maximum likeli- hoo6 estimates, are discussed in detail by ROTHMAN,SING, and TEMPLETON (1974). The mean value (0.044) of the RST estimates for the nine loci is only 70% of the least squares (CocKERHAnx) estimate. However, the rank correlation of the RST estimates with least squares estimates is still large: 0.80. For FIT,the RST estimates are also systematically higher than those obtained by COCKERHAM’Sleast squares estimation. These differences are due, at least in part, to the way the genotype proportions, P,, P,, and P, over the total population are estimated. The P, are not estimated by least squares in the RST model (see METHODS). They vary as the search of the likelihood surface proceeds, until for the final estimate they are fixed in a combination (and with a p value) that maxi- mizes the likelihood. Consequently, these maximum likelihood P, can depart ap preciably frcni the observed sample P,,leading to values of FIT (2b or 6b) which differ markedly from conventional estimates of F, represented by both W/N and COCKERHAM. In representing the estimates for M/M by only one column in Table 1 (FST = &,) , we ignore the distinction between association of alleles in an individual and association of alleles in a village. In terms of the parameters of the model (COCK- ERHAM),a realistic interpretation of random union of gametes when generations are distinct requires

FIT (generation i -4- 1) = FsT (generation i) ; (11) i.e.,the correlation (FIT) of “united” gametes in the offspring generation is ex actly that cf random gametes (FST)in the parental generation. For a single mat- ing, this is no more than the familiar observation that the inbreeding coefficient of the offspring is the kinship coefficient of the parents. This consideration moved 362 R. S. SPIELMAN, J. V. NEEL AND F. H. F. LI

US to separate our data on the Yanomama roughly into two generations. We di- vided our sample into individuals of estimated age 20 and under (“offspring”) and those 25 and older (“parents”), ignoring those in between. Table 3 gives the result of estimating FIT and FST (COCKERHAM)separately for these two1 sub- samples or successive “generations.” Note first that FST for the older subsample is smaller (0.054) than FsT for the younger subsample (0.063) , as required bv the model of increasing genetic differentiation among subdivisions with time. Recall that for a single generation, i, in the unpartitioned total (Table 2) , (Fsr) exceeds (FIT) I so that if we ignore the distinction between generations, the resulting FIX A (FIT - FST)/( 1 - FST) is negative, amounting to - 0.016. The similarity of this value to the mean F,, (- 0.012) found by NEELand WARD(1972) suggests that in addition to the causes they list (pp. 651-654), failure to distinguish gene- rations is a possible explanation for negative estimates of FIs in small subdivided populations. From Table 3, however, we obtain FIT (offspring) = FST (parents) = 0.54, F,, = 0. (The standard error of this estimate is large, 0.015, reflecting the substantial locus heterogeneity apparent in Table 3.) Consideration of two genera- tions thus suggests that on the average, uniting gametes are drawn at random from the parental generation within a village to form the next generation. We summarize the results for nine allele pairs in 47 Yanomama villages in Table 2 as follows. The method of WJN and COCKERHAM’Sleast squares give estimates which do not differ appreciably. Both of these sets of estimates agree well with those obtained by maximum likelihood estimation €or the RST model, apart from the expected bias (downward) of the latter estimates. Because of the average difference resulting €rom this bias, we concentrate on the similarity in relative magnitude, indicated by rank correlation. Moreover, since all these meth- ods are based on the comparison of genotype distribution with allele frequencies, none estimates directly the probability of identity by descent of alleles present in founders, which could be obtained from total pedigree information extending to the founder generation. Since in most populations, genotype-based estimates

TABLE 3 Parameters of population structure estimited separately by generation (COCKERHAM’Smodel)

Age < 20 Allele FIT e FIT e M ,095 .069 .090 .069 S ,049 .049 ,128 .053 C ,098 .083 .I 05 .087 E .033 .065 .038 .050 Fr“ -.015 ,030 .WO ,050 HP~ -.017 .048 -.001 ,056 Gc ,114 .a72 .015 ,040 PGM1, ,040 .033 -.037 .019 Alb ,088 .I18 ,043 .067 Mean .054 .063 .053 .054 INBREEDING POPULATION DATA 363 must presumably underestimate this probability considerably, there is no abso- lute standard by which to judge estimates. We therefore require agreement in relative magnitude as a minimal indicator of comparability. 2. Intertribal. The preceding treatment makes use of the statistical fact that variance among groups equals covariance within (FALCONER1960). The variance of allele frequencies (U: or in standardized form, FST) over totally isolated sub- divisiom of size Ne increases in theory at the same rate, 1 - 1/2Ne, as the reduc- tion of heterozygosity within subdivisions. This is the theoretical basis for using variance among villages to measure accumulation in contemporary individuals of alleles identical by descent from 2N alleles, taken to be unique, in an ancestral population of N founders. The same principle implies accumulation of identity or kinship corresponding to the subdivision of the present-day descendants of the founding population of Amerindians, which can be measured (in theory) by FST obtained from tribal allele frequencies. Recently, LALOUELand MORTON(1973) have used this principle to obtain an estimate of +T “kinship of two random Makiritare [Indians] from the same village, relative to South American Indians as a whole.” Conditional kinship (’#’,,) L‘relativeto seven village samples” and kinship ( ‘#‘v) “of two Makiritare taken at random from the pool of seven villages, relative to Scuth American Indians” were estimated by the new method of bioassay described in MORTONet al. (1971). Combining these two estimates as

‘#‘T = $0 + (1 - 40) $V (12a) they obtain 0.097 for +T. Earlier, our comparison of models suggested that competing methods in the literature differ only as much as alternate parameterizations of a single biologi- cal model, and in some cases only in providing different estimators for identical parameters. As shown in METHODS,this is true for local kinship as estimated by LALOUELand MORTON(1973). Their result can therefore be obtained directly from F-statistics. Their (Po, if construed as correlation iristeadof kinship, is exactly the parameter FsT for seven Makiritare villages; this was estimated by NEELand WARD(1972) as 0.0358. +v is FST for tribes; for the 13 tribes LALOUELand MOR- TON consider (12 from FITCHand NEEL 1969, plus the Makiritare) we find s: /f~q= 0.073. When compared to LALOUELand MORTON’Sestimates obtained by the new bioassay of kinship, these FE, values appear negligibly different from the ’#’ values. Since, as LALOUELand MORTONsay, (12a) is WRIGHT’Shierarchic model (equation l), we may apply it directly to these F-statistics. We simply shift the heirarchic or subdivision unit in which allelic correlation is considered upward one level of inclusiveness: ‘Lgameteswithin villages” replaces “gametes within individuals,” and LLgameteswithin tribes” replaces “gametes within villages” in the definitions given in METHODS(cf. Table 4). Combining the the village and the tribal FST values, 364 R. S. SPIELMAN, J. V. NEEL AND F. H. F. LI

TABLE 4

The correspondence of F and $ for correlations of allelic values within individuals, within villages, and within tribes

Correlation of alleles or kinship within: JVlien next most inclusive unit is: Village [cf.Tribe] Individual [cf. Village] Village [cf. Tribe] -_ F,, [$,,; = F,, for villages]

Tribe [cf. all Amerindians] F,, [$v; = F,, fo’rTribes] F,, [$,; = $,, f we obtain 0.0358 + (1 - 0.0358) 0.073 = 0.106 for +,, a result in agreement with the value (.097) obtained by LALOUELand MORTON.Thus, since LALOUELand MORTONstart with the same values as given by F-statistics and combine them like F-statistics, the outcome reiterates that kinship coefficients yield the same answer as F-statistics, with the same shortcomings as F-statistics when construcd as estimates of inbreeding (see below). The same computations may be extended to the larger body of data on 47 Yanomama villages to yield a corresponding value for correlation (“kinship”) within Yanomama villages, relative to all tribes sampled. Mean F,yT for the 47 villages is 0.070 (Table 2). Mean FsT for the 20 tribes treated by WARDet al. (1975) is 0.101. Thus we €ind that the correlation between alleles within one of 47 Yanomama villages, relative IO the 20 tribes as a whole, is 0.070 -I- (1 - 0.07) 0.101 = 0.164. Referred to the 13 tribes csnsidered by LALOUELand MORTON, for which FST is 0.073, the correlation within any Yanomama village relative tc tribes is 0.070 f (1 - 0.07) 0.073 = 0.138. As ib expected, the larger FST (due to village differentiation) in the Yanomama (0.070) than in the Makiritare (0.036) is reflected in the greater estimate of the correlation of alleles within a village.

TO WHAT EXTENT DO ALL THESE VARIOUS APPROACHES UNDERESTIMATE IDENTITY GP ALLELES BY DESCENT? The various methods of estimating kinship or coefficient of inbreeding from gene frequency dispersion all have the obvious limitation of measuring only that inbreeding accumulating subsequent to the most recent subdivision (or spread) of the reference population. Thus, one can readily visualize the situation where two small, long-separated villages derived from the same ancestral village fuse, and then several generations later split again, at which time they come under study. Nolie of the inbreeding accumulated during either of the periods the vil- lages spent together will be reflected in F,, based on the genetic differences be- tween the two villages at the time of the study. ALLEN(1965) has addressed this point as follows: “Clearly, inbreeding coefficients, however they are estimated, have little relation to identity by descent for the majority of genes. Instead, F is useful as a measure of average reduction in heterozygosity at all loci over a de- fined interval of time. Specification of the reference generation is therefore both a theoretical and a practical necessity.” INBREEDING POPULATION DATA 365 Although this fact is explicitly recognized by those who derived the various formulations we have employed, it is sometimes lost sight of in application. In this section we explore, with particular reference to the lndian tribes of South America, the extent to which all these treatments based on subdivision may be underestimating the parameter of interest to us, namely, the expected frequency of identity by descent of a pair of alleles from an individual. We employ two approaches to the question; one is based on what is known concerning the his- torical antecedents of the Yanomama, the other on a Monte Carol simulation of Yanomama social structure. 1. An estimate of identity by descent from the history of the Amerindian: One attempt to estimate the probability that alleles in a contemporary Indian village are identical by descent from alleles in the founding population for the Western Hemisphere can be based entirely on historical and demographic considerations. For this simplified argument, we assume that in spite of fluctuating census counts, the panmictic isolated unit throughout the 800 generations (20,000 years-an underestimate) since man reached the New World can be adequately repre- sented by a single effective population size, Ne (the harmonic mean of sizes at successive times). The accumulation of identity by descent in a population of fixed size Ne is given by F = 1-( 1 - 1/2Ne)t, where t is time in generations. Figure 1 shows the graph of this function for various values of Neand t. We shall assume that Ne = 1,000 is a reasonable upper limit for the effective size through- out this period. After 800 generations, the expectation of F reaches about 0.33. Our experience with isolated Indian populations, in which the adult breeding population never approaches 1,000, suggests that 0.33 sets a lower limit to the proportion of the founders made homozygous since the initial peopling of the Americas. Figure 1 shows that even if the appropriate Ne is 2,000, F reaches approximately 0.2.

1.0 - -

0.8 - - c- ~0.6- -15 - c_I I0.4- -

0.2 - -

0' I I I I I I tl IO 15 20 25 30 35 yrs~IO-~ ' 1 400 600 800 1,000 1,200 1,400 Generations FIGURE1.-Accumulation with time of allelic identity by descent (or loss of heterozygosity) in a population of constant size Ne. 366 R. S. SPIELMAN, J. V. NEEL AND F. H. F. LI Mutation of the original alleles necessitates a slight correction in principle; in practice, this decrease in identity by descent is negligible, as shown below. We could also adjust our estimate upwards, to take into account the possibility of nonzero identity by descent accumulated before the founding population reached the Western Hemisphere. This correction is computationally trivial, but the implicit redefinition of the reference generation opens the possibility of infi- nite regress to Adam and Eve, which we prefer to avoid. 2. An estimate of identity by descent fromsimulation: A Monte Carlo popula- tion simulation program patterned after the Yanomama was developed some 6 years ago (MACCLUER,NEEL and CHAGNON1971). The program has since been extensively modified; a detailed description is in preparation ( LI, ROTHMAN and NEEL).In the present program, each member of the input population of 451, which is subdivided into four villages, is assigned a unique pair of alleles at each of four unlinked loci. Emigration from the population occurs at the rate of one female randomly selected annually from among the unmarried women aged 15-40. Immigration is at the same rate, the immigrant being assigned the age, village, and lineage of the emigrant. Because of the excess of males result- ing from preferential female infanticide, the immigrant marries quickly. Each immigrant also has a unique pair of alleles at each of four loci. Given the capacity of the computer for which the program was developed, thc simulation can extend over a maximum period of approximately 400 years, at which time the popula- tion, expanding at the rate of approximateiy 0.7 percent per year (as are the Yanomama currently), has in several runs grown to approximately 6000 per- sons. As noted, the original population is distributed among four villages; there is provision for village fission when village size increases to 250-300 persons, up to a total of nine villages. The phenotypes resulting from the various gene combinations are all selectively equivalent to one another, i.e., we assume no selection. Since at the beginning of the simulation each allele has a unique designation, all alleles with the same designation present in the population at the termination of the simulation must be identical by descent. Likewise, in any homozygous individual, the two alleles must be identical by descent. In the preliminary work with the model it became clear that because of the rate of growth of the popula- tion and the limitation of the model to a total of nine villages, all of the village fissions which computer capacity permitted had in several trial runs occurred by 150 to 200 years, after which the nine villages increased in size well beyond that usually achieved by slash-and-burn agriculturalists. Since this abnormal village size had important implications for any simulation of inbreeding, the decision was reached that for the purposes of this study we would limit the con- sideration of the results of the simulation to a period of 200 years. Table 5 summarizes the results of the three such simulations. The results sug- gest that in such a population, the mean frequency of identity by descent of a pair of alleles in an individual increases by approximately 0.01 each 100 years. Note that F substantially exceeds expectation under random combination of alleles in each of the three simulations. This is seen as a reflection of the Wahlund INBREEDING POPULATION DATA 367 TABLE 5

The identity by descent (F) observed after 200 years in a simulation of U Yanomama population*

Proportion of introduced Run number F zP,2 genes at 200 years 1 0.0160 0.0095 0.22 2 0.0208 0.0075 0.28 3 0.0203 O.CO33 0.27

* The frequency of the ith allele is pi. effect and of the social structure of the Yanomama. To the extent that it is due to the latter, the accumulation of identity by descent would not be expected to be the same in another tribe with a different social structure. However, preferential marriage with cross-cousins, a key feature of the Yanomama mating system, is common among Indian tribes. With respect to the possible generality of these results, it is important to note that the rather high rate of immigration into the population (always introducing new alleles) results in roughly 26 percent of the being of outside origin at the end of 200 years. In actual fact, such immigrants would often be kinfolk, reintroducing genes already present in the population. Furthermore, we note that all alleles were unique at the beginning of the simulation, whereas in the real world this would not be the case. These two factors lead us to believe that this estimate of the accumulation of identity by descent in Amerindian villages is a lower limit. What are the implications of the simulation for the identity by descent of a randomly selected pair of alleles in an Indian village? We assume as before that the Amerindian reached the New World via the Bering Straits 20,000 years ago, and that the predominant pattern of expansion has been the successive fissioning of villages (and tribes), with occasional fusions as dictated by the exigencies of war or disease. The forces opposing the accumulation of identity by descent are heterotic selection and mutation. It is possible to make rough allowance for the latter. At a mutation rate of 10-5/locus/generation, the probability of no muta- tion in a line of descent over 800 generations is 0.99999sooor 0.9920, and the probability of no mutation in either line of descent for a randomly chosen pair of alleles is 0.99202,or 0.9841. The corresponding figure for an assumed rate of lo-* is 0.8521, and for 0.9984. Then in 20,000 years, if F accumulated at the rate of 0.01 each 100 years or 0.002509 per generation, mean F for an Indian village in the absence of mutation should be 1- (1 - 0.00251) = 0.866; a mutation rate of 1O-5/locus/generation would in the corresponding 800 generations reduce this by only .00001. Approximate though these approaches are, it is difficult to escape the conclusion that except at loci with very high heterozygosity main- tained by selection, identity by descent from ancestral alleles in Amerindian founders for a pair of randomly chosen alleles in a contemporary Amerindian tribe should be no less than 0.3 and may well be greater than 0.5. 368 B. S. SPIELMAN, J. V. NEEL AND F. H. F. LI

DISCUSSION No matter which of the functions of gene and genotype frequencies is used as an estimate of identity by descent of a randomly chosen pair of alleles in an Indian village, there is a large discrepancy between the result and the value sug- gested by general considerations of population history. The reason seems clear: as frequently recognized, the estimates from population data are valid only for the time-depth since the current population became subdivided. The contribu- tion of this paper is to begin to define the magnitude of this discrepancy for a specific population. The foregoing makes it clear that all current statistical methods underestimate the true value of identity by descent for neutral alleles by a factor of three or four. Homozygosity and identity by descent: It might appear from the requirement of homozygosity in an individual for identity by descent that the conclusions reached above are most appropriate for a locus with high homozygosity. This is not true. Consider a polymorphic locus with two alleles. Among the homozygotes, it is generally impossible to distinguish loci at which the alleles are identical by descent from those which are alike only in state, although this distinction per- sists in the literature and is vital to the theoretical treatment of inbreeding. If the less frequent allele arose only once, the frequency of homozygotes for it sets a lower limit for the probability of identity by descent; in addition, some, if not all, homozygotes for the majority allele will be identical by descent (from some ancestral generation). This argument applies whether the polymorphism is balanced or transient. Indeed, if we cannot distinguish alleles alike in state from those identical by descent, it is only in the heterozygotes that we are certain that the alleles are not identical by descent. Consequences of inappropriate comparisons,-The paradox of Brazil and Japan: It is not only inaccuracies in specifying the time elapsed since the refer- ence generation which complicate estimates of inbreeding. Any estimate from gene or genotype frequencies must be based on population units taken to be the breeding or local population. AZEVDOet al. (1969) have applied the MALECOT formulation to genetic data from northeastern Brazil. and IMAIZUMIand MORTON (1969) to genetic data from Japan. The coefficient of kinship (+%%)is presented as approximately 10 times higher in Brazil than in Japan (cf. FRIEDLAENDER 1971 for a convenient summary). The present-day population of Brazil is drawn from three quite divergent ethnic groups, Caucasians, Negroes, and Indians (ethnically mongoloid). In the area on which the estimate is based, the contribu- tions of these three groups to the gene pool have been estimated as 0.57 Caucasian, 0.30 Negro, and 0.11 Indian (KRIEGERet al. 1965). Further, the predominant religion, Catholicism, discourages consanguineous marriage. By contrast, Japan is a relatively homogeneous country ethnically in which, prior to the Meiji Restoration, most of the population was concentrated in small villages with no religious impediment to inbreeding and where, to judge by practices persisting to the present, inbreeding may have been encouraged through the custom of arranged marriages. Given these facts it is unlikely that the coefficient of kinship is greater in Brazil than in Japan. INBREEDING POPULATION DATA 369 This inconsistency helps demonstrate the hazards in comparing populations by models of population structure. For the estimates of inbreeding in Brazil and Japan, all possible pairs were formed among members of the parental generation in the study material, and the genotypic similarity between members of a pair related to geographic distance between them at the time of the study. This proce- dure, like the other methods for estimating FST,must be quite sensitive to ethnic stratification, village heterogeneity, and assortative mating by ethnic back- ground, all of which presumably occur in Brazil. Moreover, the political unit defining a population for local kinship was considerably smaller in Brazil (mean radius of distrito 10 km) than in Japan (mean radius of prefecture 45 km), so that genetic similarity for pairs assigned to “same place” (+i%)had to be greater in Brazil than in Japan. Thus the units designated “local” differ in area by a factor of more than 20. We conclude that the populations which have been com- pared by means of the M/M formulation (summary in FRIEDLAENDER1974) violate the assumptions of the model and/or differ from each other in many ways. Consequently, apparent differences in parameters of population structure are likely to be confounded with differences in definition of the population of interest and in sample composition. Consequences of this argument for heterosis in man: While we can speak with some assurance only with respect to the Amerindian, we surmise that aspects of the inferences we have drawn are in the main applicable to man during his tribal days everywhere. If this is correct, then the process of detribalization which has occurred since the advent of civilization some 2,000-4,000 years ago must have resulted in a marked relaxation of inbreeding and of its consequences for selection against recessive alleles. A clear implication is that the off spring of parents from different long-separated small population isolates should be sub- stantially more heterozygous than their parents or the products of non-outbred matings. To some extent, the population isolates of Europe must represent the transformation in situ of tribal into peasant agricultural society. The genetic effects are undoubtedly confounded with changing patterns of nutrition and dis- ease, but our suggestion of the magnitude of the recent reduction in inbreeding should invite a favorable reconsideration of previous efforts (e.g. HULSE1958; WOLANSKI,JAROSZ and PYZUK1970) to detect heterosis in the results of out- breeding.

LITERATURE CITED ALLEN.G , 1965 Random and nonrandom inbreeding. Eugenics Quarterly 12: 181-198. AZEV~DO,E., N. E. MORTON,C. MIKI and S. YEE, 1969. Distance and kinship in northeastern Brazil. Am. J. Hum. Genet. 21 : 1-22. CAVALLI-SFORZA,L. L., 1972 Some current problems of human . Am. J. Hum. Genet. 25: 82-104. CAVALLI-SORZ~,L. L. and W. F. BODMER,1971 The Genetics of Human Populations. w. H. Free- man and Company, San Francisco. COCKERHAM,C. C., 1969 Variance of gene frequencies. Evolution 23: 72-84. --, 1973 Analyses of gene frequencies. Genetics 74: 679-700 FALCONER,D. S., 1960 Introduction to . Oliver and Boyd, Edinburgh. 3 70 R. S. SPIELMAN, J. V. NEEL AND F. H. F. LI FITCH,W. M. and J. V. NEEL, 1969. The phylogenic relationships of some Indian tribes of Central and South America. Am. J. Hum. Genet. 21 : 384-397. FRIEDLAENDER,J. S., 1971 The population structure of South-Central Bougainville. Amer. J. Phys. Anthrop. 35: 13-26. -, 1974 In: J. F. Crow and C. Denniston, Genetic Dis- lance, pp. 167-187. Plenum, New York. GERSHOWITZ,H., M. LAYRISSE,Z. LAYRISSE,J. V. NEEL, N. A. CHAGNONand M. AYRES,1972 The genetic structure of a tribal population, the Yanomama Indians. 11. Eleven blood-group systems and the ABH-Le secretor traits. Ann. Hum. Genet. Lond. 35: 261-269. HULSE,F. S., 1958 Exogamie et hktkrosis. Arch. Suisses Anthropol. GQn.22: 103-125. IMAIZUMI,Y. and N. E. MORTON,1969. Isolation by distance in Japan and Sweden compared with other countries. Human Hered. 19: 433-443. IMAIZUMI,Y., N. E. MORTONand D. E. HARRIS,1970 Isolation by distance in artificial popula- tions. Genetics 66: 569-582. KRIEGER,H., N. E. MORTON,M. P. MI, E. AZEV~DO,A. FREIRE-MAIA and N. YASUDA,1965 Racial admixture in northeastern Brazil. Ann. Hum. Genet., Lond. 29: 113-125. LALOUEL,J. and N. E. MORTON,1973 Bioassay of kinship in a South American Indian popula- tions. Am. J. Hum. Genet. 25: 62-73. LI, F. H. F., E. D. ROTHMANand J. V. NEEL,1977 A second study of the survival of a neutral mutation in a simulated Amerindian population. Am. Naturalist (In press.) MACCLUER,J. W., J. V. NEELand N. A. CHAGNON,1971 Demographic structure of a primitive population: a simulation. Amer. J. Phys. Anthrop. 35: 193-208. MALECOT,G., 1969 The Mathematics of Heredily. W. H. Freeman, San Francisco. MORTON,N. E., 1969 Population structure. pp. 61-68. In: Cornpuier Applicaiions in Genetics.

Edited by N. E. MORTON.U. Press of Hawaii, Honolulu. ~ , 1973 Kinship bioassay. pp. 158-163. In: Genetic Structure of Populations. Edited by N. E. MORTON.U. Press of Hawaii, Honolulu. MORTON,N. E., D. KLEIN,I. E. HUSSELS,P. DODINVAL,A. TODOROV,R. LEW and S. YEE, 1973 Genetic structure of Switzerland. Am. J. Hum. Geaiet. 25: 347-361. MORTON,N. E., S. YEE, D. E. HARRISand R. LEW, 1971 Bioassay of kinship. Theoret. Pop. Biol. 2 : 507-524. NEEL,J. V., 1970 Lessons from a “primitive” people. Science 170: 815-822. NEEL,J. V. and F. M. SALZANO,1964. A prospectus for genetic studies of the American Indian. Cold Spring Harbor Symp. Quant. Biol. 29: 85-98. NEEL,J. V. and R. H. WARD,1972 The genetic structure of a tribal population, the Yanomama Indians. VI. Analysis by F-statistics (including a comparison with the Makiritare and Xavante) . Genetics 72: 639-666. ROTHMAN,E. D., C. F. SING and A. R. TEMPLETON,1974 A model €or analysis of population structure. Genetics 76: 943-960. SING, C. F., M. A. CHAMBERLAINand B. K. EGGLESTON,1973 An analysis of variance of gcne frequencies in a human population. In: Human Population Structure. pp. 217-226. Edited by N. MORTON,U. of Hawaii Press, Honolulu. SNEDECOR,G. W. and W. G. COCHRAN,1967 Statistical Methods, 6th Ed. Iowa State University Press, Ames, Iowa. WARD,R. H., H. GERSHOWITZ,M. LAYRISSEand J. V. NEEL, 1975 The genetic structure of a tribal population, the Yanomsma Indians. XI. Gene frequencies for 10 blood groups and the ABH-Le secretor traits in the Yanomama and their neighbors; the uniqueness of the tribe. Am. J. Hum. Genet. 27: 1-30. INBREEDING POPULATION DATA 371 WEITKAMP,L. R. and J. V. NEEL, 1972 The genetic structure of a tribal population, the Yanomama Indians. IV. Eleven erythrocyte enzymes and summary of protein variants. Ann. Hum. Genet., Lond. 35: 433-444. WEITKAMP,L. R., T. ARENDS,M. GALLANGO,J. V. NEEL,J. SCHULTZand D. C. SHREFFLER,1972 The genetic structure of a tribal population, the Yanomama Indians. 111. Seven serum pro- tein systems. Ann. Hum. Genet., Lond. 35: 271-279. WOLANSKI,N., E. JAROSZand M. PY~UK,1970 Heterosis in man: growth in offspring and dis- tance between parents’ birthplaces. Soc. Biol. 17: 1-16. WORKMAN,P. L. and J. NISWANDER,1970 Population studies on Southwestern Indian tribes. 11. Local genetic differentiation in the Papago. Am. 3. Hum. Genet. 22: 244. WXIGHT,S., 1922 Coefficients of inbreeding and relationship. Am. Naturalist 56: 330-338. --, 1943 Isolation by distance. Genetics 28: 114-138. -, 1951 The genetical structure of populations. Ann. Eugenics 15: 323-354. __, 1965 The interpretation of population structure by F-statistics with special regard to systems of mating. Evolution. 19: 395-420. Corresponding editor: J. F. CROW