Copyright 0 1987 by the Genetics Society of America

Definition and Estimation of Higher-Order Gene Fixation Indices

Kermit Ritland Department of Botany, University of Toronto, Toronto, M5S IAl Canada Manuscript received March 26, 1987 Revised copy accepted August 15, 1987

ABSTRACT Fixation indices summarize the associations between genes that arise from the joint effects of and selection. In this paper, fixation indices are derived for pairs, triplets and quadruplets of genes at a single multiallelic locus. The fixation indices are obtained by dividing cumulants by constants; the cumulants describe the statistical distribution of and the constants are functions of gene frequency. The use of cumulants instead of moments is necessary only for four-gene indices, when the fourth cumulant is used. A second type of four-gene index is also required, and this index is based upon the covariation of second-order cumulants. At multiallelic loci, a large number of indices is possible. If alleles are selectively neutral, the number of indices is reduced and the relationship between gene identity and gene cumulants is shown.-Two-gene indices can always be estimated from genotypic frequency data at a single polymorphic locus. Three-gene indices are also estimable except when frequency equals one-half. Four-gene indices are not estimable unless selection is assumed to have an equal effect upon each allele (such as under selective neutrality) and the locus contains at least three alleles of unequal frequency. For diallelic or selected loci, an alternative four- gene fixation index is proposed. This index incorporates both types of four-gene associations but cannot be related to gene identity.

HE association of two genes at a locus can be locus requires measures based upon four genes. Fi- T measured either by the probability of allelic nally, higher-order fixation indices may be useful in identity-by-descent (HARRIS1964; COCKERHAM197 1) selection models that incorporate such population or by fixation indices based upon the covariance of structure. allelic values (WRIGHT 1922, 1969; COCKERHAM The concept of gene identity, as conceived by 1969; WEIR 1970). These two genes are usually con- MALECOT(1 948), has become extremely useful for sidered as residing in a single diploid individual, or as any problem that requires a measure of genetic relat- having been chosen randomly from each of two dip- edness. GILLOIS(1 965, 1966), COCKERHAM(197 l), loid individuals. If one needs to consider more than JACQUARD (1974) and CANNINGSand THOMPSON two genes, computations of gene identity are relatively (1981), and others, have used gene identity in many easily extended to an arbitrary number of genes (CAN- applications in populations and . NINGS and THOMPSON1981). A three-gene fixation Genes are identical by descent if they are all copies of index, based upon the third moment of three genes an ancestral allele. However, when selection is pres- at a diallelic locus, was derived for characterizing ent, genotypic associations cannot be strictly inter- mating systems under nonrandom outcrossing and preted with gene identity coefficients. In addition, selection (RITLAND1985). The extension of fixation estimation of gene identity requires inferences about indices to four genes, and the general multiallelic case the ancestral population to which gene identity is for either three or four genes, has remained unde- relative. scribed. Alternatively, without making prior assumptions Measures of higher-order gene associations are use- about selection and without the need to determine ful in several ways. They can be used to find covari- the relativity of the measure, we can measure the ances of inbred relatives (GILLOIS 1965, 1966). contemporary associations of alleles in terms of covar- Models based upon higher-order indices of association iances, moments or cumulants of a distribution of are useful in the analyses of gene frequencies (COCK- allelic values. These statistical measures incorporate ERHAM 1971). In natural populations of , spatial both the effects of inbreeding and , associations of genotypes often occur because of ge- and are relative to the easily measured, contemporary netic drift, selection and restricted gene flow. If selfing gene frequencies. or biparental inbreeding also occurs, associations may This paper derives, in terms of gene cumulants, all develop among pairs of inbred individuals, and the two-, three-, and four-gene fixation indices for mul- proper characterization of these associations at one tiallelic loci. The attainable space of fixation indices

Genetics 117: 783-793 (December, 1987) 784 K. Ritland and the estimation variance of higher-order gene fix- The matrix of random variables consisting of the ation indices are also briefly examined. If alleles are four vectors is a single observation that follows a selectively neutral, most fixation indices derived here multivariate multinomial distribution. It is empha- can be related to gene identity coefficients, thus giving sized that the distribution function may differ between alternative gene cumulant definitions to higher-order observations. This distinction is necessary only for gene identity coefficients. four genes, but for consistency, is kept for lower- order cases as well. DEFINITION OF GENE FIXATION INDICES Two genes: First, consider genes a and b. An obser- The statistical approach for describing associations vation consists of the two alleles of both genes a and of genotypes specifies the distribution of genotypes in 6. Observations are distributed as bivariate-multinom- terms of the cumulants of a multivariate, multinomial ial with first-order cumulants denoted as K, and Kj and distribution. A sufficient number of parameters are second-order cumulants denoted as K, (i,j = 1, . . ., n). introduced such that expected frequencies of all pos- Cumulants are a set of descriptive constants of a sible genotypes are specified. This method “saturates” distribution which are useful for measuring its prop- or even “oversaturates” (in the case of four genes) the erties, and in our circumstance, for specifying it. degrees of freedom available in the data. These cumulants are written in boldface to emphasize In the following, we will sequentially consider the that their values vary among observations, and as such cases of two, three and four genes at a single locus. are random variables. For one observation, the probability of observing Each gene has n alleles with respective frequencies cl, cp, . . ., c, (c as in cumulant). Since we are considering allele i for gene a is E[K,],the probability of observing one locus, these alleles are shared among genes and allelej for gene b is E[K,],and the probability ofjointly are of equal frequency among genes. At this single observing alleles i andj is E[K,K,]+ E[K,], where E[ ] is the “expectation” operator. Cumulants about the locus, the four genes are denoted as a, b, c and d. For notational convenience, we always observe allele i at mean of order one, two and three equal the corre- sponding moments about the mean. However, cumu- gene a, allelej at gene b, allele k at gene c, and allele I at gene d, even though the same allele may be lants of order four (used in the four-gene case below) do not equal the corresponding moments of order observed at different genes (in which case i = j for example). four. Thus the assignment of alleles to genes at one locus Since these cumulants may vary among observa- is as follows: tions, the population frequency$, of allele i for gene a and allele j for gene b is the double expectation, Gene a - allele i (i = 1, . . ., n) taken first of single observations then taken among Gene b - allelej (j= 1, ..., n) observations, Gene c - allele k (k = 1, . .., n) Gene d - allele 1 (1 = 1, ..., n) There are several specific situations to which our where c, is the mean of K, (and is the frequency of treatment applies. For example, if we consider only allele i), cJ the mean of Kj, and cy the mean of K, (and is the expected covariance between alleles and j). two genes, genes a and b can be the homologous genes i of one diploid individual. For three genes, gene c The covariance between K, and K~ is assumed to be could additionally be the gamete allele contributed by zero. the mate of the first individual. For four genes, genes To characterize the deviation of genotypic associa- tions from Hardy-Weinberg proportions, define the c and d could additionally be the homologous genes of a second diploid individual. Other situations exist following gene fixation indices (WRIGHT1922, 1969; as well. WEIR 1970) To describe the presence of alleles in genes mathe- matically, first consider gene a. For gene a, define a vector of Bernoulli random variables AI, Az, ..., A, such that if allele i is present, A, = 1 and all other A where the denominator term d, is are zero. In other words, if allele i is present, the random vector equals (0, 0, . . ., 0, 1, 0, . .., O), wherein dy = 6tjcz - cicj (2) the ith term equals one. Likewise, for gene b, introduce a second vector of with 6, as the Kronecker operator (6, = 1 if i =j, 6, = random variables, BI,BP, . . ., B,, . . ., B,, defined such 0 otherwise). The denominator term d, is the maxi- that if allele j is present for gene b, BJ = 1 and all mum value that the covariance clJ can take. For ex- other B, are zero. For genes c and d, the corresponding ample, if i =j then d, = c,(l - c,), or if i #j then d, = vectors C1, ..., Ck, ..., C, and D,, ..., Dl, ..., D, are -c,cJ. defined in the same way. Defining the fixation indices in this way enables us Higher-Order Gene Fixation Indices 785 to specify the ordered genotypic frequencies for genes For this three-gene index, the denominator term dqk a and b as is j.= cicj + d..F.. ‘J gg dqk = 6qkCj - C;djk - Cjd;k - Ckdq - CiCjCk (6) = cicj (6qc2 ~i~j)F,j. + - (3) where 6, = 1 if i = j = k, or 6qk = 0 otherwise, and = cicj( 1 - Fq) + 6,j.ciFq where the second-order d’s are as defined in (2);for The last line shows how the ordered genotypic fre- example, djk = 6jkcj - cjck. The denominator term dqk quencies can be expressed as a fraction of genotypes is the maximum value c,j.k can take. For example, if (a) in Hardy-Weinberg equilibrium and a remaining frac- i =j= k, then dqk = ci(1 - ci) (1 - 2ci), or if (b) i =j tion of “fixed” homozygous genotypes when i = j. # k, then d, = -cick(1 - 2ci), or finally if (c) all Thus the term “fixation index” for F is appropriate. subscripts are unique, then dqk = 2cicjck. If gene order is not distinguished (i 5 j),then all These three-gene fixation indices, together with the terms in (3) with i # j are multiplied by two (i.e,, if two-gene indices previously defined, enable us to write the genotypes are heterozygous, the frequencies are the three-gene frequencies as doubled). More simply, if genotypes are unordered, JliR = cicjck + CidjkFjk + CjdikFik + CkdqFq + dqkFqk, the right side of (3) is multiplied by (2 - aq). By taking marginal sums ofjj we obtain the con- which equals, after substituting in the 6’s for the d’s, straints upon the expected covariances as = CjCjCk( 1 - Fq - Fjk - Fik + 2Fqk) n n 6qCiCk(Fq - Fqk) (7) 1 cq = 0 and cq = 0. + i= 1 j= 1 + &kCjCk(F& - Fqk) + 6jkCjCj(Fjk - Fqk) + 6qkCiFqk. These are standard properties of multinomial cumu- lants. From these constraints, we can obtain the 2n - For this case of three genes, Fqk is the frequency that 1 constraints on the Fq as genotypes are fixed at all three genes, the Fq - Fqk

n n terms are frequencies that genotypes are fixed at two 2 c.F..- 6..F..= 0 and c.F..- 6..F..= 0. genes, and (1 - Fq - Fjk - Fik+ 2Fijk) is the frequency 2 83 IJ ‘J J CJ IJ Y (4) i= 1 j= 1 that genotypes are in random proportions. With many alleles, a large number of fixation indices If genotypes are unordered (i 5 j 5 k), terms can be defined. There are n2possible fixation indices involving one identical pair of subscripts are multi- plied by two and terms involving no identical sub- but (n - 1)2 independent fixation indices. The n(n - 1) degrees of freedom in the data (assuming equality scripts are multiplied by six. To do this, one can of allele frequencies betweer, genes a and b) are satu- multiply (7) by 6 - 469 - 46~- 46jk + 76,. Inspection of the marginal sums of fjk reveals the rated by n - 1 allele frequencies plus these (n - 1)* independent fixation indices. expected third-order cumulants have constraints Three genes: Next, consider the triplet of alleles in genes a, b and c, this triplet being the unit of obser- cqk = o for m = i, j, k. m=l vation. The joint frequency,fjk, of allele i for gene a, allele j for gene b and allele k for gene c is the From these constraints on the expected three-gene expectation first taken within observations then taken cumulants, the constraints on the three-gene fixation among observations, indices are

fjk = E[E[A$jCk]] = E[KiKjKk + KiKjk + KjKik + KkKq + Kqk], = cjcjck + c$jk + cjcik + ckcq + Cqk, where the expected third-order cumulant between Ai, Bj and ck is cqk, and the other c’s are as defined in (1); and likewise for expressions involving summations for example, the covariance cjk = E[K;~].Covariances over j and i. The number of three-gene fixation between first-order cumulants and second-order cu- indices is n3,while the number of independent three- mulants are assumed zero. gene indices is (n - 1)3 because of these constraints. To characterize the deviation of genotypic associa- Note that only one three-gene index is needed tions from that expected under random association of for a diallelic locus. The degrees of freedom in the alleles, define the following three-gene fixation indices data, n3 - 2(n - 1) - 1, is the sum of all independ- en5 qne, two and $hy;ee gene parameters, (n - 1) + I:] (n - 1)‘ + IiJ(n - The description of 786 K. Ritland genotypic frequencies is again saturated by the gene where the d’s are as defined in (2). The four-gene frequencies and fixation indices. frequenciesjjk! thus become, in terms of fixation in- Four genes: Finally, consider the quadruplet of dices, alleles in all four genes a, b, c and d, which is the unit Jju = CiCjCkCl + CicjduFu + CiCkdjlFjl + CiCldjkFjk of observation. The joint frequency Jjkl of allele i at gene a, allele j at gene b, allele k at gene c, and allele + CjckdilFd + CjCidaFik + ChCldijFij + CidjklFjkl 1 at gene d, is the double expectation taken within and 4- CjdiklFai + CkdijIFijI + CIdijkFijk among observations, + dijdkl(FijFk1 + Fij.kl) + d&djl(F&Fjl + F&.jl) didjk(FilFjk Fil.jk) dijuFijk1, Jjkl = E[E[AiBjCkDl]] = E[KjKjKkKl + + + + KiKjKkl + KiKkKjl + KiKlKjk + KjKkKil which equals, after substituting in the 6’s for the d’s, + KjKlKik + KkKIKij + KiKjkl + KjKikI Jju = cicjCkcl[ 1 - Fij - F& - Fit - Fjk - Fjl - Fkl + KkKijl + KlKgk + KijKk! + K&Kji + FijFkI + FaFjl + FilFjk + Fij.ki 4- F&.jl + F,!.jk + KiIKjk + Kijkl] (9) + 2Fijk + 2Fijl + 2F;kI + 2Fjkl - 6Fijk1] = c;cjckcl cicjcu CiCkCjl ciclcjk + Cjckcil + + + + 6ijCjCkCl[Fij( 1 - Fkl) cjclca CkClCij cicjkl Cjcihl CkCijI + + + + + - Fij.ki - Fijk - Fiji + 2Fijk1] + CICijk + cijckl + CikCjl + cilcjk + COV[Kij, Kkl] + G~cjcjcl[F&(1 - Fjl) + Cov[Kik, K~I]+ COV[KiI, Kjk] + cijkl, - F&.ji - Fijk - F&l + 2Fijkl] bilCiCjCk[Fil( 1 Fjk) where, cijkl is the expected fourth-cumulant between + - - Fd.jh - Fiji - Fikl (LFijkl] Ai,Bj, ckand D1 and Cov[. . .]are covariances of second- + order cumulants. The above is based upon the general + ajkCiCkCl[Fjk( 1 - F~I) formulas relating moments and cumulants given by - Fjk.il - Fijk - Fjkl + ZFijkl] (13) KENDALLand (1977, p. 340), with the added + 6j&iCjCk[Fjl( 1 - F&) STUART -F..-F..-F. +2F..] consideration that cumulants are variable and that ~1.~1 VI jkI ?lu pairs of second-order cumulants may covary. + bklCiCjCk[Fk[( 1 - Fij) To characterize the deviation of genotypic associa- - Fu.ij - F&l - Fju + 2Fijkl] tions from that expected under random association of + 6ij6kIC,Ck[FijFkI + Fij.kl - Fqu] four alleles, the four-gene fixation indices are defined + 6&6j&Cj[FikFjl + F&.jl - Fvk~] as follows: + 6i$j&Cj[FilFjk + Fil.jk - Fijkl] + 6ij&[Fijk - Fijki] + &jici[Fiji- Fijk~] 6iklci[Fikl - Fij~] 6jkICj[Fjkl - Fij~] Cajkl (i,j, 1 = 1, ..., (10) + + Fiju - dijkl k, n) + 6ijuC;Fijkl. where This expression shows that FqkI is the proportion of dckl = 6ijklci - CiCjCkCl - CiCjdkl - CiCkdjl genotypes fixed for all four genes, the Fqk - Fvkl terms - CiCldjk - CjCkdil - CjCldik - CkCldij are proportions of genotypes fixed for a triplet of - Cidjkl - Cjdikl - Ckdijl (1 1) genes, the FijFkI + Fg.61 - FVkl terms are proportions - Cldijk - dijdkl fixed for two pairs of genes, the Fij( 1 - Fkl) - Fij.k1 - Fyk Fiji 2 Fqkl terms are proportions fixed for one - djkdjl - dildjk - + pair of genes, and the remaining term is the propor- for 6ijk1 = 1 if i =j = k = 1 or 6, = 0 otherwise. The tion of genotypes in random proportions for all four lower-order d’s are as defined in (2) and (6); for alleles. example, dui = 6iklC; - Cidkl - Ckdil - Clda - CiCkcI. The If genotypes are unordered (i Ij Ik Il), the$& denominator term d, is the maximum possible value with three identical subscripts are multiplied by 2, the that the expected fourth cumulant cijkl can take. It is J,u with two pairs of identical subscripts are multiplied a function of allele frequencies and identities of sub- by 4, the Jjkl with two identical subscripts are multi- scripts i, j, k and 1. plied by 6, and theJjkl with no identical subscripts are A second type of four-gene measure also needs to multiplied by 12. This can be accomplished by multi- be defined. The covariances of two-gene fixation in- plying (1 3) by 12 - 6(6, + 6ik + &! + 6jk + 6jl + 6kI) + dices are defined as: 2(8ijbkl + 6&6jl + 6i$jk) + 8(6ijk + 6ijl + 6ikl + 6jkl) - 136ijkl. If genes a and b are unordered, and genes c and d COV[Kij, KkI] unordered, but the two pairs of genes are ordered, Fij.kl = (i,j,k,l= 1, ..., n) dijdn (13) is multiplied by (2 - 6ij)(2 - 6k1). Marginal sums OfJjkl show that all expected fourth F. . = Cov[Kik, Kjl] rk.jI (i, j, k, 1 = 1, ..., n) (12) dikdjl cumulants have constraints COV[KiI, Kjk] n F.rl . = j, 1 = 1, ..., .jk (i, k, n) c cqu = 0 for m = i, j, k, 1. dildjk m=l Higher-Order Gene Fixation Indices 787

From the constraints on the expected four-gene cu- cause of deviation from random proportions, then FGk mulants, the constraints on the four-gene fixation = Fabc (i, j, k = 1, .. ., n) and from inspection of (7), indices are the correspondence between two- and three-gene fix- - ation indices and two- and three-gene identities is, c - 6C1C,CkCiFz,ki + 26,ClCkClF~u I= 1 Fixation indices Genes identical

+ . . . - 6p6&ckFajkl - - I - (14) - 6y&C$( 1 - 3CJFqkl - - . - + 6@$ykl = 0 1 - Fa6 - Fac - Fbc + 2Fah None Fd - Fdc a=b and likewise for summations over R, j and i. The (16) Fac - Fabc a=c number of four-gene fixation indices is n4 = n4, Fbc - Fabc b=c [:I [:I Fabc a=b=c. while the number of independent four-gene indices is (n - 1)4 because of these constraints. Note that only one four-gene index is needed for a diallelic locus. For four genes, again if inbreeding is the only agent However, three covariances of indices were intro- of gene fixation, then FYU = Fabcd and Fq.u = Fab.cd (i, duced for each independent four-gene index in (lo), j, K, 1, ..., n) and from inspection of (13), the corre- so the number of parameters is now (n - 1) + 6(n - spondence between higher-order gene fixation indices 1)* + 4(n - 1)’ + 4(n - 1)4, which is greater than the and higher-order gene identities is, degrees of freedom in the data, n4 - 3(n - 1) - 1. The description of genotypic frequencies is thus ov- Fixation indices Genes identical ersaturated by parameters. 1 - Fd - Fac - Fad - Fk None RELATIONSHIP OF GENE CUMULANTS TO - FM - Fcd + FabFcd GENE IDENTITY + FacFbd + FczdFbc This derivation allows both inbreeding and selec- + Fd.cd + Fac.bd tion to structure genotypic frequencies in an arbitrary + Fad.bc + 2Fdc + 2Fabd manner. Interestingly, there is a correspondence be- + 2Facd + 2Fb.d - 6Fabcd (17) tween the proportions given in the last equations of Fob( 1 - Fcd) - Fab.cd - Fabc arb (3), (7) and (13) with those expected under identity- - Fabd + 2Fabcd bydescent arguments: the fixation indices which mul- FabFcd Fab.cd Fabcd a = band c = d tiply the gene frequency terms c,, c,, ck and cl can be + - replaced with gene-identity coefficients to arrive at Fabc - Fdcd a=b= C the same formulae that give joint genotypic frequen- Fabcd a = b E c = d. cies in terms of identity coefficients, such as those given in COCKERHAM(197 I). (There are six identities involving one pair, three If inbreeding is the only agent of gene fixation, then identities involving two pairs, four identities involving all F with the same subscripts equal each other in triplets, plus the quadruplet and null identity, for a expectation. For example, all FIJ(i, j = 1, .. ., n)equal total of 15 terms. The reader can obtain those not each other, all Fljk (z,j,K = 1, .. ., n)equal each other, given above by substitution of the lower case letters). all F,,u equal each other, and all FIJ.klequal each other. COCKERHAM(197 1) has interrelated all two-, three-, This drastically reduces the number of parameters and four-gene identity coefficients between pairs, tri- needed to describe multiallelic loci. The remaining plets and quadruplets of genes. fixation indices can be related to coefficients of gene Condensed fixation indices: When inbreeding is identity in the following way. the only agent of gene fixation and the four genes are For two genes, under pure inbreeding then F,, = those possessed by two diploid individuals, we can Fab (i, j = 1, ..., n) and from inspection of (3), the equate additional parameters to obtain fixation indices correspondence between the two-gene fixation index which are equivalent to the 9 configurations of iden- and gene identity is tity (JACQUARD 1974). Let the two alleles from indi- vidual “X”be genes a and b, and the two alleles from Fixation index Genes identical individual “Y” be genes c and d. Since genes are 1 - Fab None (15) exchangeable within individuals, then Fa, = Fad = Fbc - Fab arb - Fbd s Fq, Fabc = Fabd P Faby, Facd = Fbcd = Fxcdt and Fac.bd = Fd.cd = Fx.y.By addition of the appropriate where “=” denotes “is identical by descent to.” terms of (1 7), we obtain the following correspondence For three genes, again if inbreeding is the only between fixation indices and gene identity: 788 K. Ritland

Fixation indices Genes identical 1 - Fab - Fcd - 4Fxy None + FabFcd + 2F2y + Fab.cd 4- 2Fx.y + 4Fscd + 4Faby - 6Fabcd Fab(1 - Fcd) - Fab.cd a=b - 2Fxcd + 2Fabcd Fed( 1 - Fab) - Fab.cd c=d - 2Faby 2Fabcd + FIGURE1.-Some examples of the space of fixation indices at a 4Fxy(1 - FxY) - 4Fx.y a = cor a = d diallelic locus, as indicated by the shaded areas. (a) The space (18) occupied by the two-gene index as a function of gene frequency p. - 4Fscd - 4Faby + 8Fabcd or b 3 c or b = d (b) Space jointly occupied by the two- and three-gene indices, FabFcd + Fab.cd - Fabcd a = band c = d assuming equality of 311 two-gene indices, for various p. (c) Space of all higher-order indices, assuming equality of all indices of the same 2F2y + 2F.y.y - 2Fabcd a = c and b = d or asdandbsc order, for p = % and F2 2 = 0. 2Faby - 2Fabcd aEb=c or a = b = d as -( 1 - c,)/c: (when c, = cJ = ck and i #j# k). Likewise, PFscd - 2Fabcd ascEd the four-gene fixation index FElhL ranges between -1 and +1 at a diallelic locus, may be less than -1 at orb = c E d multiallelic loci, but always ranges from 0 to 1 if a = b = c = d. Fabcd inbreeding is the only agent of gene fixation. The COCKERHAM(197 1) gives a table of joint genotypic covariance of fixation indices Ftj.kl has a maximum frequencies in terms of probabilities of these identity determined by FEland Fkj. patterns. The space of fixation indices: THOMPSON(1976, Fully condensed fixation indices: Again for in- 1980) has considered the constraints upon the “space” breeding alone, and if now the four alleles are now of genealogical relationships imposed by the mecha- sampled from four different individuals in a popula- nism of Mendelian segregation within a pedigree. tion or if the four alleles are those of an autotetraploid When natural selection is allowed to influence geno- (all genes are exchangeable), the correspondence be- typic frequencies in any arbitrary manner, a second tween fixation indices and gene identity becomes: class of constraints are those that merely ensure non- negative genotypic frequencies. We can thus define Fixation Genes the space of fixation indices (as opposed to the space indices identical of genealogical relationships) as those sets of indices 1 - 6F2 + 3F; + F2 2 None which specify non-negative genotypic frequencies. + 8F3 - 6F4 This space was examined numerically by computing genotypic frequencies throughout the space of fixa- 6F2(l - F2) - ~FZ.~One pair tion indices. The allowable space is shown in Figure 1 - 12F3 + 12F4 (19) for some specific examples. For two genes, Figure la 3Fi + 3F2.2 - 3F4 Two pairs shows the allowable values of the two-gene index F2 4F3 - 4F4 Three at a diallelic locus depends upon gene frequency. At F4 Four, intermediate gene frequency, this two-gene index ranges from - 1 to + 1, but at more extreme frequen- where F, is the fixation index of order a and F2.2 is the cies, the lower limit approaches zero. covariance of second-order fixation indices. Figure Ib shows the space of allowable two- and three-gene fully condensed indices (FZ and F3 in Equa- ESTIMATION OF FIXATION INDICES tion 19) for a diallelic locus. The space of fixation Bounds of individual indices: The two-gene fixa- indices is again limited by gene frequency. In addition, tion index ranges from -1 to +1 at a diallelic locus their allowable values are constrained by each other. (WRIGHT1969). However, when i =j at a multiallelic The space appears to be continuous. locus, Fy may range down to -(1 - c,)/c, (when c, = cJ Figure IC gives slices through the space occupied and E[A,BJ]= c,). If is the only by the fully condensed modes of gene identity (F2, F3 cause of fixation, F always ranges from 0 to 1. and F4 in Equation 19) for a diallelic locus, assuming The three-gene fixation index FElkalso ranges from F2.2 = 0. A small proportion of this space is occupied, -1 to +1 for a diallelic locus, but at multiallelic loci and interestingly, this space appears to be discontin- when all three subscripts are unique, F,t can be as low uous. Higher-Order Gene Fixation Indices 789

Generally, the space occupied by higher-order fix- maximum possible values, this four-gene index can ation indices is quite limited. This restriction of space yange from -1 to +l. This index is estimated by I&/ places limits on the confidence intervals when jointly dvkl from (20).However, Ft& does not have a genetic estimating higher-order fixation indices. Second, it interpretation, as it cannot be related to any pattern suggests that we should look for a set of higher-order of gene identity. fixation indices which occupy more of this space. Such Equivalency to a gene-identity model: A second as-yet undefined indices would presumably be func- solution to the lack of estimability of four-gene indices tions of all lower-order cumulants. is to assume selection has an equal effect upon each Estimability of fixation indices: For a sample of n allele (or equivalently, assume selection is absent). observations, the following are method of moments This introduces the constraint that all fixation indices estimators for cumulants, which assume that cumu- of the same order equal each other (i.e., relations (15- lants are constant: 19) hold), and introduces enough information into the data such that four-gene parameters can be esti- 2z = s,/n mated. In such cases, the above method-of-moments 2, = (ns, sts,)/(n(n - 1)) - estimators are not appropriate, and the method of 3 maximum likelihood can be used for estimation. Cyk = n2syk - ESfijh + stlk (n(n - l)(n - 2)) When one assumes this equality of fixation indices, -( v 4 (20) l,kf = (n'(n + 1)sIJn - n(n + 1) CS,S,U the equations that relate fixation indices to genotypic 3 6 frequency assume the same form as equations that - n(n - 1) Cs,shl 2n Cscsjshl + relate gene identity to genotypic frequency, thus sug- - 6s,~,s,si)/(n(n- l)(n 2)(n 3)) - - gesting one is estimating gene identity. However, es- where the summations are over all groupings of sub- timation of gene identity also involves estimation of scripts and where parameters to which gene identity is relative. The

n relativity of gene identity depends upon the process st = E A, of in the population. This process has a high n variance and a significant covariance with allelic iden- s%j= c. A$, tity, and is not taken into account by the above rela- n tions. Svk = z A,B,Ck n Thus, when equality of gene fixation indices is assumed, the gene fixation model becomes equivalent Svki = c AfijCkDi to the gene identity model, but gene identity is not (KENDALLand STUART1977, p. 329). We can obtain estimated. Rather, we estimate fixation indices which estimates (with some bias) of two- and three-gene are interpretable in terms of normalized cumulants. fixation indices by substitution of the above estimates Variance of estimates: Five points are made here into (1) and (5). WEIR and COCKERHAM(1984) discuss concerning the estimation variance of higher-order unbiased estimators of two-gene fixation indices. fixation indices. First, both the third- and fourth- However, if pairs of second-order cumulants co- order indices consist of divisors which can be zero at vary, tvklactually estimates c,jkl+ COV[K~K~~]+ COV[K~K,~] some gene frequency. The divisor of the three-gene + COV[K,IK,~].Thus, the covariance of second-order index, dvk (Equation 6), can be zero at a gene fre- cumulants is not separable from the fourth cumulant quency of one-half if the subscripts i, j, and k are not when cumulants are estimated with this procedure, all different. The divisor of the four-gene index, dvkl and as a result the four-gene fixation indices, as given (Equation 11) can be zero at a gene frequency of by (10) and (12), are not estimable with (20). (3 - &)/6 if at least three subscripts are identical, or An alternative measure of four-gene associations: can be zero at a gene frequency of one-third if three One solution to this lack of joint estimability of four- subscripts are unique. gene indices is to define a single fixation index that Figure 2a shows the divisors of the second, third combines both types of four-gene associations. This and fourth cumulant plotted as a function of gene index is defined as frequency at a diallelic locus. Zeros occur at a gene frequency of (3 - &)/6 and one-half. Figure Cykl + COV[K,I, Kkl] 2b shows the logarithm of the standard deviation of the + COV[Ktk, Kji] + cOVIKti, KIk] F;kl = (21) estimate of the corresponding cumulants. Figure 2c dvki gives the resulting standard deviations of estimates of where dvkl is defined in (11). This fixation index gene frequency. The three- and four-gene fixation incorporates both types of four-gene associations: (1) indices are not estimable when divisors are zero in the tendency of all four alleles to vary together, c,+ Figure 2a. More importantly, these indices have a and (2) the covariance of fixed allele pairs, Cov[. ..I. high variance in the regions near these zeros. As these fourth-order cumulants are normalized by These values are for the condensed gene fixation 790 K. Ritiand

a. 2nd cumulant 31 c. I \+4-gene index I E 0.2 0

3y 0.1 c u UJ Y- a- O 0 fa8 .-> a -0.1

I 0 0.2 0.4 0 0.2 0.4 Gene frequency Gene frequency

-1 61 d. 1-4-gene index I n 71 b* I e, - c inbred 3-gene index CI - E -4 population 2-gene index I- 1st cumuiant + 1“ u) gene frequency e, r Q) 0 -7 c3 v) n Y cn = -IO-// 0- 1 I I I 1 0 0.2 0.4 0 0.2 0.4 Gene frequency Gene frequency FIGURE2.--How singularities of higher-order fixation indices arise. A diallelic locus is considered and values are for single observations (n = 1). (a) Divisor of cumulants of a given order, as a function of gene frequency. (b) The logarithm of the standard deviation of estimates of each cumulant (assuming true values are zero). (c) Resulting standard deviations. (d) Standard deviations with inbreeding Fa = 0.5, Fs = 0.375, F4 = 0.25, F2.1~0).

indices (Equation 19) assuming F2.2 = 0 or equiva- The third point concerns the effect of actual in- lently for FGM (Equation 21), and were computed for breeding upon variance of estimates. Figure 3a gives an “outbred” population with no actual gene fixation. asymptotic standard deviations of estimates as a func- They were found by inversion of the information tion of increasing four- ene fixation Fq (for Fs = matrix, so that these are asymptotic values and func- &(1 + 6)/2, F2 = F4 and gene frequency of tions of actual fixation indices and not sample size. one-third). Generally, varianceF of fixation indices de- All values are per observation (Le., sample size is one). creases with increasing actual gene fixation. For comparison, Figure 2d gives standard deviations The fourth point concerns effect of the number of for a more “inbred” population (F2 = 0.5, Fs = 0.375 alleles at a locus. Figure 3b shows that estimation and F4 = 0.25). The same trends are apparent. variance decreases with increasing numbers of alleles The second point is that estimates of higher-order at a locus. A significant decrease of variance occurs fixation indices often have higher variance relative to between 3 and 4 alleles, and asymptotic variances are the two-gene index. In Figure 2, both c and d show approached with 10 alleles at a locus. In this figure, that the asymptotic variances for the three- and four- gene fixation was assumed in the manner of Equation gene index are often much higher than for the two- 18, a triangular distribution of allele frequencies (c, gene index. However, at most gene frequencies, these 42, 43, . . .) was assumed, variances were found by higher-order fixation indices have standard deviations inverting the information matrix assuming gene fre- only about two to three times as great as the two-gene quencies were known, and the level of actual inbreed- index. In fact, at an intermediate gene frequency, the ing was assumed zero. The same trend of increasing variance of the four-gene index is about equal to that precision of estimates was found for moderate values of the two-gene index, although the exact relation of actual inbreeding, and for a uniform distribution depends on the degree of inbreeding. of (results not shown). Higher-Order Gene Fixation Indices 79 1 a. C. 1.5 3-gene index 0) t .-z 1.0 c v) QI Y-o 0.5 gene frequency cn0 0 0 0.2 0.4 0.6 0.8 1.o 3 5 7 9 Number of alleles

b. 0.8 Q Q c t 0 0.6 0 E .-E .-c c v) $ 0.4 Q rc y. 0 0 0 0.2 n v) m

I I 0 I I I I 1 I I I I I I 1 I I 3 5 7 9 3 5 7 9 Number of alleles Number of alleles FIGURE 3.-Geneyral variance efoperties-for estimates of fixation indices. (a) Standard deviation per observation as a function of level of inbreeding (Fz= JF,, FJ = 0.5dF4(1 + JF,), F2.n = 0) (b) Standard deviations as a function of number of alleles at the locus for the fully reduced indices (Equation 19), assuming selective neutrality and a triangular distribution of allele frequency. (c and d) Standard deviations of estimates of condensed fixation indices of Equation 18 (with same conditions as b). Note: sf = f(abcd), Sh =flab and cd), & =f(abc or abd), 4 =flab), -% =flacd or bcd), 4 =f(cd), 55 =flat and bd or ad and bc), -% =f(ac or ad or bc or bd), .% =f (none fixed), where genes fixed for same alleles are grouped, and f denotes frequency.

For a diallelic locus, the information matrix was moderately inbreeding populations than for outbred singular, showing that four-gene parameters are not populations. estimable for a diallelic locus. This is reflected in the observation that if subscripts can take only two values DISCUSSION in (13), the coefficients that multiply Fuhl and FiG.hl in (13) are always the same. Singularity of the informa- The fixation index was proposed by WRIGHT(1 922) tion matrix was also found for a triallelic locus when as a measure of the correlation between two genes alleles were of equal frequency. Thus, loci must also that develops as the result of the joint effects of have at least three alleles of unequal frequency to inbreeding and selection. This paper has derived the estimate four-gene indices. analogous fixation indices for three and four genes. The last point concerns the estimation of those gene In doing so, one encounters an exponential increase fixations which correspond to the nine condensed of complexity from the two-gene case to the four- identity modes. Note we cannot estimate gene iden- gene case. This complexity is somewhat misleading as tity, but rather estimate a set of fixation indices that the general multiallelic case was considered in this are equivalent in the manner discussed above. Figure paper. For a diallelic locus with gene frequency p = 1 3, c and d, shows the asymptotic standard deviations - q, the three-gene index is c3/Pq(q - p)where c3 is for estimates of fixation indices equivalent to the nine the third cumulant of allelic values, and the two four- condensed identity modes (with same assumptions as gene indices are c4/pq(1 - 6pq) and and c2.2/p2q2, Figure 3b). Variances of the three "outbred" indices where c4 is the expected fourth cumulant of allelic 3,sand 2% were so much greater than the "inbred" values and c2.2 the covariances of second cumulants indices that they were plotted separately. This sug- between pairs of allele values. gests that our power to infer relatedness is greater for One might suggest that indices based upon mo- 792 K. Ritland ments and not cumulants might be used to describe tion, then a locus with at least three alleles of unequal genotypic associations. However, higher-order fixa- frequency prov;des enough information to estimate tion indices based upon cumulants are more appro- four-gene fixation indices. Generally, Figures 2 and 3 priate because they can be directly related to gene show the statistical properties of four-gene indices are identity-by-descent. reasonable. The variances of three- and four-gene This equivalence between gene cumulants and gene fixation indices are usually larger than for two-gene identity has not been previously recognized. The dis- indices, but barring certain allele frequencies near the tinction we have made between moments and cumu- singularities of Figure 2, one requires a sample only lants is not necessary for two or three genes, but is about 2-3 times larger to obtain higher-order esti- necessary for four genes. References to gene cumu- mates of the same precision as two-gene estimates. lants, as opposed to moments, are lacking in related Besides increasing sample size, variance can be also studies. WEIR and COCKERHAM(1969) used fourth- reduced by assaying loci with more alleles. For ex- order moments to study group inbreeding. YASUDA ample, in molecular genetics, it might be desirable to (1973) advocated the use of higher-order moments to screen random genomic fragments for highly poly- describe mating type frequency. morphic loci. However, an enormous number of al- In the above notation, the fourth moment m4 equals leles is by no means required, as Figure 3 shows a the cumulants c4 + 3~2.2.A four-gene index based large reduction in variance with four or five alleles, upon a moment (which would equal m4/pq(1 - 3pq) and asymptotic variances appear to be reached by for a diallelic locus) would confound in a rather com- about ten alleles. plex manner the two types of four-gene indices, given Higher-order fixation indices are difficult to relate in (1 0) and (1 2). to the classical concepts of variance and covariance, One can conjecture that associations of more than as these terms strictly apply to two variables. Two- four genes can be likewise described by appropriately gene indices can be clearly regarded as the correlation normalized cumulants. However, associations be- between alleles. For four genes, the covariance of tween the cumulants themselves must also be consid- second-order indices provides some interpretation. ered. Equation 9 shows how the covariances between The first type of four-gene association, Futl (Equation second-order cumulants enters when associations of IO), might be considered as the variance of gene four genes are considered. These covariances can be fixation between fixed allele pairs. The second type nuisance parameters, as they can oversaturate the data of four-gene association, FU.u (Equation 12), is the with parameters. between-pair covariance of within-pair gene fixation. The estimation of four-gene parameters has been a The reader might gain some insight by contrasting a neglected topic. Much attention has been paid to the diallelic locus, wherein the two types of fixation are two-gene fixation index, in large part because it is confounded, with an infinite allele locus, where the amenable to treatment by classical statistical theory two types of fixation clearly differ. such as the analysis of variance (4 WEIRand COCK- The three-gene index seems to defy any interpre- ERHAM 1984). Studies concentrate upon two-gene tation in terms of covariances. Regardless, one point measures of relatedness (MORTONet al. 197 1; PAMILO is that these higher order indices are more closely tied and CROZIER1982) largely because these measures to covariances than to correlations, so that their range are sufficient for outbred populations. However, if we of allowable values are quite constrained compared to consider the four alleles as possessed by two individ- two-gene fixation indices (Figure 1). uals X and Y, four-gene associations may still be pres- At diallelic loci, the two components of four-gene ent in outbred populations, as the four-gene index association cannot be statistically separated. One so- Fx.y(Equation 18) can be nonzero in outbred popu- lution was to define a composite four-gene fixation lations. index F;ki which includes both types of fixation (Equa- Perhaps one reason for the lack of attention to tion 21). This four-gene index may perhaps be inter- higher-order associations is that, previously, measures preted as simply the covariance between pairs of fixed have been defined only in terms of gene identity. alleles. Regression interpretations may sometimes be Higher-order identity coefficients such as those given more appropriate. For three genes, if genes a and b in COCKERHAM(1 97 1) have always been estimable. are contained by individual X, and the allele contrib- However, estimation of gene identity requires ascer- uted by the mate of X is gene c, the effective selfing tainment of the point in time to which gene identity rate of this individual is Fa, + Fbc - Fabo which equals is relative. Use of fixation indices avoids this problem the coefficient of regression of the added values of a because the gene cumulants are defined relative to and b upon c (RITLAND1985). contemporary gene frequencies. There are many phenomena in If one assumes that selection has an equal effect that warrant consideration of higher-order associa- upon each allele, or more likely, that selection is tions of genes. These measures are particularly appro- absent and inbreeding is the only agent of gene fixa- priate for populations, as higher-order popula- Higher-Order Gene Fixation Indices 793

tion structure is predisposed to develop because of the COCKERHAM,C. C., 1969 Variance of gene frequencies. Evolution rooted nature of plants and the tendency for near- 23: 72-84. neighbor pollination, selfing, or local seed dispersal in COCKERHAM,C. C., 1971 Higher order probability functions of identity of alleles by descent. Genetics 69 235-246. plant popu1ations (BRoWN lg7'). Functions Of three- COCKERHAM,C. C., and B. S. WEIR, 1983 Variance of actual gene fixation indices provide a characterization of the inbreeding. Theor. Popul. Biol. 23: 85-109. mating system of MimuEus guttatus (RITLANDand GAN- GILLOIS,M., 1965 Relation d'identite en genbtique. I. Postulats DERS 1987). For the study of correlated matings, et axiomes Mendkliens. 11. Corr6lation gknitique dans le cas functions of four-gene fixation indices are useful. For de dominance. Ann. Inst. Henri Poincark. Sec. B 11: 1-94. GILLOIS,M., 1966 Note sur la variance et a1 covariance ginoty- example, we can consider two genes as possessed by piques entre apparent&. Ann. Inst. Henri Poincark. Sec. B 11: the maternal parent, and two as derived from each of 349-352. two paternal parents. HARRIS,D. L., 1964 Genotypic covariances between inbred rela- Another phenomenon is variation of actual inbreed- tives. Genetics 50 1319-1348. ing. ~~~~~-i~di~id~~lvariation due to variation of JACQUARD,A., 1974 The Genetic Structure ofPopulations. Springer- Verlag, New York. pedigree has been studied theoretica11y (WE1R*AVERY KENDALL,M., and A. STUART,1977 The Advanced Theory OfSta- and HILL 1980; COCKERHAMand WEIR 1983), and its tistics, Vol. 1. Distribution Theory. Macmillan, New York. magnitude concluded to be probably small. Among- MALECOT,G., 1948 LesMathematiques de t'heredite. Masson et Cie, neighborhood variation of inbreeding caused by var- Paris. iation of effective neighborhood size is a different MICHOD,R. E., 1982 The theory of kin selection. Ann. Rev. EcOl. Syst. 13: 23-55. aspect Of inbreeding 'ariation that needs empirica1 MICHOD,R. E., and W. D. HAMILTON,1980 Coefficients of relat- determination. One approach to estimate such varia- edness in sociobiology. Nature 288: 694-697. tion of inbreeding is to estimate four-gene fixation MORTON,N. E., S. YEE, D. E. HARRISand R. LEW, 1971 Bioassay indices, wherein the four genes are those shared by of kinship. Theor. Popul. Biol. 2 507-524. adjacent pairs of individuals throughout a population. PAMILO,P., and R. H. CROZIER,1982 Measuring genetic relat- edness in natural populations: methodology. Theor. Popul. Finally, higher-order fixation indices are relevant Biol. 21: 171-193. to the study ofthe evolutionary implications of genetic RITLAND,K., 1985 The genetic mating structure of subdivided correlations between interacting individuals, or kin populations. I. Open-mating model. Theor. Popul. Biol. 27: selection (MICHOD 1982). Although MICHOD and 51-74. H~~~~~~~(1 980) defined meaSureS of coefficients of RITLAND,K., and F. R. GANDERS,1987 Covariance of selfing rates with parental gene fixation indices within populations of Mi- relatedness which are general, in the sense of being mulus guttatus. Evolution 41: 760-771. functions of higher-order inbreeding coefficients, the THOMPSON,E. A., 1976 A restriction on the space of genetic estimation of relatedness within social groups has ap- relationships. Ann. Hum. Genet. 40: 201-204. parently relied exclusively upon two-gene measures THOMPSON,E. A., 1980 The gene identity states of a descendent. [e.g., PAMILOand CROZIER(1982) and references Theor. Popul. Biol. 18: 76-93. WEIR, B. S., 1970 Equilibria under inbreeding and selection. therein], Furthermore, the assumption of weak selec- Genetics 65371-378. tion in mode1s Of kin selection may be re1axed if WEIR,B. S., and C. C. COCKERHAM,1969 Group inbreeding with models use fixation indices instead of gene identities. two linked loci. Genetics 63: 71 1-742. However, higher-order indices suffer from an inher- WEIR,B. S., and C. C. COCKERHAM,1984 Estimating F-statistics ent complexity that may limit their practical use. for the analysis of population structure. Evolution 38: 1358- 1370. I thank a reviewer who clarified the difference between fixation WEIR, B. s.,p. J. AVERY2nd w. G. HILL,1980 Effect of mating indices and gene identity coefficients, and BRUCEWEIR who pro- structure on variation in inbreeding. Theor. Popul. Biol. 18: vided many comments and suggested the Kronecker delta approach. 396-429. This research was supported by a Natural Sciences and Engineering WRIGHT, S., 1922 Coefficients of inbreeding and relationship. Research Council of Canada grant to the author. Am. Nat. 56: 330-338. WRIGHT,S., 1969 Evolution and the Genettcs of Populations, Vol. 2. The Theory of Gene Frequenczes. University of Chicago Press, LITERATURE CITED Chicago. BROWN, A. H. D., 1979 Enzyme in plant popula- YASUDA,N.7 1973 Mating tYPe frequency in terms Of the gene tions. Theor. Popul. Biol. 15: 1-42. frequency moments with random . pp. 60-65. In: CANNINGS,C., and E. A. THOMPSON,1981 Geneological and Ge- Genetic Structure of Populations. Edited by N. E. MORTON. netic Structure. Cambridge University Press, Cambridge. Communicating editor: B. S. WEIR