<<

Copyright Ó 2006 by the Genetics Society of America DOI: 10.1534/genetics.105.049940

Estimating Relatedness Between Individuals in General Populations With a Focus on Their Use in Conservation Programs

Pieter A. Oliehoek,*,1 Jack J. Windig,†,‡ Johan A. M. van Arendonk* and Piter Bijma* *Animal Breeding and Genetics, Wageningen University, Wageningen, 6709 PG, The Netherlands and †Centre for Genetic Resources (CGN) and ‡Animal Sciences Group (ASG), Wageningen University and Research Centre (WUR), Lelystad, 8200 AB, The Netherlands Manuscript received August 23, 2005 Accepted for publication March 1, 2006

ABSTRACT Relatedness estimators are widely used in genetic studies, but effects of population structure on per- formance of estimators, criteria to evaluate estimators, and benefits of using such estimators in conser- vation programs have to date received little attention. In this article we present new estimators, based on the relationship between coancestry and molecular similarity between individuals, and compare them with existing estimators using Monte Carlo simulation of populations, either panmictic or structured. Es- timators were evaluated using statistical criteria and a diversity criterion that minimized relatedness. Re- sults show that ranking of estimators depends on the population structure. An existing estimator based on two-gene and four-gene coefficients of identity performs best in panmictic populations, whereas a new estimator based on coancestry performs best in structured populations. The number of marker alleles and loci did not affect ranking of estimators. Statistical criteria were insufficient to evaluate estimators for their use in conservation programs. The regression coefficient of pedigree relatedness on estimated relatedness (b2) was substantially lower than unity for all estimators, causing overestimation of the diversity con- served. A simple correction to achieve b2 ¼ 1 improves both existing and new estimators. Using re- latedness estimates with correction considerably increased diversity in structured populations, but did not do so or even decreased diversity in panmictic populations.

DDITIVE genetic relatedness between individuals When pedigrees of populations are known, additive A plays an important role in many fields of genetics. genetic relatedness between individuals can be calcu- In genetic analyses, knowledge of relatedness is used to lated from the pedigree (Emik and Terrill 1949) and estimate genetic parameters such as heritabilities and can be used to estimate additive genetic . Pedi- genetic correlations (Falconer and Mackay 1996). In gree data are, however, often lacking or incomplete, artificial selection, estimation of breeding values relies on especially between subpopulations of a species. In those knowledge of relatedness of individuals (Henderson cases, estimates of relatedness rely on molecular mark- 1984; Lynch and Walsh 1998), and relatedness be- ers. Methods to estimate relatedness from molecular tween individuals affects optimum designs of selection marker data described in the literature can be divided programs (e.g.,Nicholas and Smith 1983). In evolu- into two groups (Blouin 2003): (1) methods that esti- tionary biology, knowledge of relatedness between in- mate relatedness on a continuous scale (e.g.,Lynch teracting individuals is required to predict evolutionary and Ritland 1999; Wang 2002), and (2) methods that consequences of social interaction (Hamilton 1964). categorize individuals into a limited number of discrete In conservation genetics, knowledge of relatedness is classes of relatives, such as full-sib, half-sib, or parent– required to optimize conservation strategies. In this offspring relationships. article we focus on estimating relatedness for use in Toro et al. (2002) compared estimators expressing conservation strategies, but results are equally relevant relatedness on a continuous scale in a pedigreed pop- for other fields in genetics. Throughout, we consider ulation of pigs divided into two related strains, using the traditional population-genetic definition of relat- actual and simulated markers. Molecular coancestry edness for diploid individuals, which equals twice the (Male´cot 1948), the estimator of Lynch and Ritland coefficient of coancestry (Male´cot 1948; Lynch and (1999), and a maximum-likelihood estimator showed Walsh 1998). the highest correlation between pedigree and estimated relatedness. When both strains were analyzed together, molecular coancestry performed substantially better than 1Corresponding author: Animal Breeding and Genetics, Department of more sophisticated estimators, indicating that quality Animal Sciences, Wageningen University, P.O. Box 338, Marijkeweg 40, Wageningen, 6709 PG, The Netherlands. of estimators depends on the population structure and E-mail: [email protected] that current estimators are not optimal in general. More

Genetics 173: 483–496 (May 2006) 484 P. A. Oliehoek et al.

TABLE 1 Estimators used

Abbreviation Full name/reference Equation Category

fM Molecular coancestry 4 1 UCS Unweighted corrected similarity 5 1 WCS Weighted corrected similarity 6, 7 1 WEDS Weighted equal drift similarity 6, 7, 8 1 L&R Lynch and Ritland (1999) 13, 14 2 Wang Wang (2002)a —2 Q&G Queller and Goodnight (1989) 15, 16 3 a See http://www.zoo.cam.ac.uk/ioz/software.htm for software. recently, novel estimators have been proposed and tured), and (e) the size of a subset of individuals se- compared by Wang (2002) and Milligan (2003) for lected to maximize the amount of genetic variation their statistical performance in an ‘‘outbred’’ popula- conserved. Relatedness was estimated using simulated tion structure, having only four degrees of relatedness, marker data and analyzed against pedigree relatedness, parent–offspring, full-sibs, first cousins, and unrelated using both statistical and diversity criteria. individuals. A number of issues remain unsolved, relating in par- METHODS ticular to the population structure (Milligan 2003), the utility of estimated relatedness in conservation pro- Here we describe (1) the relatedness estimators con- grams, and the criterion to judge the quality of an sidered, (2) the simulated population structures in estimator. Estimators of Lynch and Ritland (1999) which estimators will be tested, and (3) the criteria used and Wang (2002) assume no inbreeding. Those esti- to assess quality of the estimators. mators have been evaluated using simulated popula- Relatedness estimators: Eight relatedness estimators tions without pedigree, no inbreeding, and simple are compared, which we divide into three categories classes of relatives of full-sibs,half-sibs,parent–offspring, (Table 1). The first category is based on the relationship or unrelated individuals. Complex pedigree structures between additive genetic relatedness (r), population ge- and high levels of relatedness and inbreeding, however, netic coancestry (f, also known as ‘‘kinship’’; Falconer are typical for populations in need of conservation. and Mackay 1996), and molecular coancestry (fM) There is a need for relatedness estimators that can be (Jacquard 1983; Lynch 1988; Toro et al. 2002) and applied to fragmented populations, where interest is consists of both existing and new estimators. The second in both within and between subpopulation relatedness. category is based on the relationship between additive Development of such estimators is not merely a sta- genetic relatedness and two-gene and four-gene coef- tistical issue, but needs a connection with population ficients of identity in ‘‘noninbred’’ populations and con- genetic concepts such as drift. Furthermore, the utility sists of the estimators of Lynch and Ritland (1999) and of using estimators in conservation programs, with the Wang (2002). The third category consists of the estima- aim to maximize the amount of additive genetic vari- tor of Queller and Goodnight (1989). All estimators ance conserved, has not been investigated to our knowl- express relatedness on a continuous scale. edge. Hence, more knowledge is needed of the usefulness Milligan (2003) presented a maximum-likelihood of relatedness estimators to support conservation strat- estimator for noninbred populations. In ‘‘inbred’’ pop- egies, such as determining which individuals are genet- ulations, however, finding the maximum-likelihood value ically important. is computationally demanding because many modes of In this article we introduce estimators that are based identity by descent (IBD) occur (see Table 1 in Milligan on the relationship between coancestry and relatedness, 2003). We did, therefore, not investigate maximum- which holds irrespective of inbreeding. In total, we likelihood estimators. compare eight estimators expressing relatedness on a Category 1: Estimators based on coancestry: By defini- continuous scale, with a focus on supporting conserva- tion, additive genetic relatedness (r) between diploid tion strategies. Monte Carlo simulations produced pop- individuals equals twice the coefficient of coancestry (f, ulations with both pedigree and marker data. Behavior also known as kinship; r ¼ 2f )(Male´cot 1948; Falconer of the estimators is studied for alternative populations, and Mackay 1996). Thus, conservation strategies based differing in (a) the number of alleles per locus in the on coancestry are equivalent to strategies based on base generation, (b) the number of loci used, (c) the relatedness. Coancestry of two individuals is the prob- relatedness compared to the base population, ability that two alleles drawn randomly, one from each (d) the population structure (either panmictic or struc- individual, are IBD, indicating that they descend from Estimating Relatedness From Markers 485 a common ancestor (Falconer and Mackay 1996). tween individuals x and y expressed relative to this base Coancestry and relatedness are expressed relative to a population. Equation 2a may be interpreted as the so-called base population, in which all alleles are de- probability that alleles are IBD (fxy) plus the probability fined as being not IBD, so that coancestry in the base that they are not IBD but AIS [(1 fxy)sl]. Equation 2a population is zero by definition (Falconer and Mackay holds irrespective of inbreeding or random mating. 1996; Lynch and Walsh 1998). Alleles that are molec- Rearrangement of Equation 2a gives a convenient form ularly identical in the base population are referred to as resembling Wright’s F-statistics (Wright 1978): alike in state (AIS). Thus, in any generation, the pro- 1 E½S ¼ð1 f Þð1 s Þ: ð2bÞ portion of alleles AIS is equal to expected homozygosity xy;l xy l in the base population. When pedigrees are known, the A so-called ‘‘method of moments estimator’’ of co- founder generation is commonly used as the base pop- ancestry is obtained by rearranging Equation 2a, sub- ulation, so that relatedness among founders is zero by stituting expected similarity by observed similarity, and definition. In principle, base populations serve merely averaging over L loci, which gives as a reference point, and the choice of the base pop- XL ulation is arbitrary. However, not all choices are genet- ^ 1 Sxy;l sl f xy ¼ : ð3Þ ically meaningful and theoretically correct, particularly L l¼1 1 sl in structured populations (see discussion). The new estimators presented in this article are based Multiplying Equation 3 by a factor of 2 yields a re- itland on the approach of Eding and Meuwissen (2003). latedness estimator (see R 1996). Eding and Meuwissen (2003) developed estimators of Equation 3 shows that a value for sl is needed for each between-population coancestry, using observations on locus. Because allele frequencies in the base population molecular similarity between and within populations, in are usually unknown, sl needs to be estimated, which which case the definition of the base population is more involves two problems. First, when the average level of obvious. We modify estimators of Eding and Meuwissen AIS is estimated incorrectly, the average estimated re- (2001, 2003) to estimate coancestries between individ- latedness of the current population will be biased. The uals instead of between populations. observed average similarity and the estimated probabil- First we describe the theoretical background of es- ity of alleles AIS together implicitly define the base timators based on coancestry. Estimators based on coan- population. The lower the estimated AIS, the further back in time this base population is set, and the higher cestry make use of the molecular similarity index (Sxy,l), which refers to a single locus l in a pair of individuals xy the average estimated relatedness of the current pop- and is defined as the probability that two marker alleles ulation. Vice versa, an overestimation of AIS will result oro drawn from two individuals are molecularly identical in underestimating relatedness (T et al. 2003). For (Jacquard 1983; Caballero and Toro 2000; Toro example, when the base population is set equal to the current population, which is done implicitly when sl et al. 2002). In the following, Sxy,l is referred to as ‘‘similarity.’’ For locus l, similarity between individual x is calculated from current allele frequencies assuming having alleles a and b and individual y having alleles c random mating, IBD between all pairs of individuals will and d is defined as be 1/2N on average, resulting in negative estimates of relatedness for many pairs of individuals. Negative 1 estimates are difficult to interpret because relatedness Sxy;l ¼ 4½Iac 1 Iad 1 Ibc 1 Ibd ð1Þ is defined as twice the probability that alleles are IBD. i orvitz (L and H 1953), where indicator Iac is one when The second problem is that, although probabilities of allele a of individual x is identical to allele c of individual alleles AIS differ per locus, expected coancestry for a 1 y, and zero otherwise, etc. Similarity takes values of 0, 4, pair of individuals is equal at all neutral loci by defi- 1 3 2, or 1. Values of 4 do not occur, because the fourth nition. Ideally, this should be taken into account when indicator must be equal to 1 when the previous three estimating sl for each locus. In the following we de- 1 indicators are equal to 1. Similarities of 4 require at scribe estimators based on Equations 2 and 3, in order least three distinct alleles and therefore do not occur of increasing complexity. at biallelic loci. Molecular coancestry (fM): Toroet al. (2002; 2003) used Similarity will vary between pairs of individuals and f M as an estimator of coancestry. Molecular coancestry will be partly due to alleles that are IBD but also due to ignores alleles AIS by setting sl ¼ 0 for all loci, so that alleles AIS. When sl denotes the probability that two estimated relatedness equals the average similarity over alleles at locus l are AIS, then expected similarity be- loci multiplied by a factor of two: tween individuals x and y at locus l is XL 2 ^rxy ¼ Sxy;l : ð4Þ E½S ; ¼f 1 ð1 f Þs ð2aÞ xy l xy xy l L l¼1 ynch (L 1988), where sl is the average similarity at locus When founder alleles are unique, ^rxy would be an un- l in the base population, and fxy is the coancestry be- biased estimator of relatedness. 486 P. A. Oliehoek et al.

Unweighted corrected similarity: For the unweighted Weighted equal drift similarity: The UCS and WCS es- corrected similarity (UCS) estimator, sl is estimated as- timators use the number of distinct alleles to estimate sl suming that all distinct alleles in the current population for each locus, which does not fully guarantee that co- had equal frequencies (pl) in the base population, pl ¼ ancestry between a pair of individuals is equal at all loci. 1/nl, where nl is the number of distinct alleles at locus The weighted equal drift similarity (WEDS) estimator l observed in the current population, which is often solves this problem by calculating sl so that the increase referred to as allelic diversity (ADl)(Fernandez et al. in coancestry since the base population is equal at all 2005). Consequently,P the probability that alleles are AIS loci. The WEDS estimator starts by setting sl ¼ 0 for the equals s ¼ p2 ¼ 1/n . Estimates for UCS were ob- locus having the lowest expected similarity (S ) given l nl l l P min ^2 tained by substituting sl ¼ 1/nl into Equation 3 and its allele frequencies, Smin ¼ minð n pnÞ, where n is the multiplying by a factor of 2, giving number of alleles. This defines the base population such that estimated sl will be nonnegative for all loci. The XL 2 Sxy;l 1=nl next step is to calculate s at other loci as the expected ^r ¼ : ð5Þ l xy = similarity at those loci, corrected with the same amount L l¼1 1 1 nl Smin of coancestry. It follows from Equation 2b, that for The assumption that sl ¼ 1/nl ignores differences in all loci allele frequencies among loci and consequently does P ^2 nl pi Smin not necessarily satisfy the condition that expected co- ^sl ¼ : ð8Þ ancestry of a pair of individuals is equal at all loci. 1 Smin However, it is simple to apply and may turn out to be Finally, coancestries are estimated using Equations 6 robust. and 7. Weighted corrected similarity: Allele frequencies vary Weighted log-linear model: Eding and Meuwissen (2003) among loci. Consequently, different loci contribute dif- estimated average coancestries within and between pop- ferently to the estimated relatedness, and the variance of ulations, by using the logarithm of Equation 2b, which observed similarity around its expectation (Equation yields a linear model. Here we applied their approach on 2a) varies among loci. The weighted corrected similarity the individual level. In contrast to the previous estimators, (WCS) estimator uses weights (wl) to optimize the im- this procedure obtains ^rxy and ^sl simultaneously. However, pact of loci on estimated relatedness, the weighted log-linear model (WLM) estimator required substantial computing time and yielded poor results (not XL 2 Sxy;l ^sl presented), which seemed to originate from the log- ^r ¼ w ; ð6Þ xy l ^ W l¼1 1 sl transformation when Sxy,l ¼ 1. Category 2: Estimators based on two-gene and four-gene where W is the sum of weights wl over all loci and ^sl ¼ coefficients of identity: The second category of estimators 1=nl . When variance of an estimator varies among ob- is based on the relationship between relatedness and servations, using reciprocals of the variance as weights two-gene and four-gene coefficients of identity in ‘‘non- minimizes the squared error of the estimate (Lynch inbred’’ populations (Lynch and Ritland 1999), and Ritland 1999; Eding and Meuwissen 2001). fxy The variance of estimated coancestry is proportional rxy ¼ 1 Dxy; ð12Þ 2 2 to VarðSxy;l Þ=ð1 sl Þ (Equation 3). An exact expression for Var(Sxy,l) follows from the probabilities of occur- where fxy is the probability that, at a certain locus, a rence of each similarity value and is given in the appendix. single allele in individual x is IBD to a single allele in A simple approximation for Var(Sxy,l) is obtained by individual y, and D is the probability that both alleles in assuming that Iac through Ibd in Equation 1 are mutually individual x are IBD to both alleles in individual y (f and independent, in which case Var(Sxy,l) is proportional to D are denoted D8 and D7 in Lynch and Walsh 1998). In Var(I..) and we can use Var(I) to obtain weights. Since I. the following, we summarize the estimators of Lynch is binomial, the reciprocal of weight wl for locus l having and Ritland (1999) and Wang (2002), which are based nl alleles equals on Equation 12. Beware of a typo in Lynch and Ritland P P (1999) and Wang (2002), which reads f ¼ 0.25 instead nl ^2 nl ^2 1 VarðIxy;l Þ i¼1 pi ð1 i¼1 pi Þ of f ¼ 0.5 for half-sibs (Falconer and Mackay 1996). wl ¼ 2 ¼ 2 ; ð7Þ ð1 ^sl Þ ð1 ^sl Þ Lynch and Ritland: Lynch and Ritland (1999) pro- posed an asymmetrical estimator that is now commonly where p^i is the estimated allele frequency of allele i at used. Their estimator is based on regression of genotype locus l in the current population. Preliminary results probabilities of the one individual on the genotype of showed that differences between exact or approximate the other individual of a pair. A symmetrical multilocus weights were negligible. Values presented in results, estimator is obtained as the weighted therefore, are obtained using approximate weights over loci, taking the average of the reciprocal multilocus (Equation 7), which are much simpler than exact weights. estimates, Estimating Relatedness From Markers 487

TABLE 2 Simulated standard population and alternatives

Alleles Loci Generations Structureb Capacitye Alternativea 2105Panmictic 100 5 20 10 Structured Ac 10 2–8 50 15 Structured Bd 10 100 20 2–18 Unique Values for the standard population are in italics. a Input parameters were varied one at a time, and other parameters were as in the standard population. b The panmictic population had 10 male and 50 female parents until generation 10. The structured popu- lation had 10 male and 50 female parents until generation 5, after which it split into two subpopulations. c Ninety individuals were sampled from the subpopulation bred from 10 male and 50 female parents, and 10 were sampled from a subpopulation bred from 8 male and 40 female parents. d Ten individuals were sampled from the subpopulation bred from 10 male and 50 female parents, and 90 were sampled from a subpopulation bred from 8 male and 40 female parents. e Capacity denotes the number of individuals that can be conserved.

XL XL D ang 1 1 and ¼ 0. Further details are in W (2002). We im- ^rxy ¼ wx;l ^rxy;l 1 wy;l ^ryx;l : ð13Þ plemented Wang’s estimator using his Fortran code 2W 2W x l¼1 y l¼1 available at http://www.zoo.cam.ac.uk/ioz/software.htm.

The locus-specific estimator ^rxy;l has as denominator Category 3: The estimator of Queller and Goodnight: ueller oodnight ð1 1 IabÞðpa 1 pbÞ4papb, where pa is the frequency of Q and G (1989) (Q&G) developed an allele a at locus l and, as in Equation 1, Iab ¼ 1 when estimator that was originally designed for estimating alleles a and b of individual x are identical and 0 average relatedness between populations, instead of in- otherwise. Consequently, a division by 0 occurs when dividuals. However, it can be modified to obtain a pairwise pa ¼ pb ¼ 0.5 and Iab ¼ 0, and the Lynch and Ritland asymmetric estimator for individuals, which is com- (L&R) estimator performs poorly at biallelic loci due to monly used nowadays (Lynch and Ritland 1999; Toro rounding errors at allele frequencies close to 0.5. We et al. 2002; Wang 2002; Milligan 2003). With Q&G, solved this problem by combining the product wx:l ^rxy;l in relatedness of individual x with individual y at locus l is Equation 13 into a single term, yielding the estimator 0:5ðI 1 I 1 I 1 I Þp p ^ ac ad bc bd a b: XL rxy;l ¼ ð15Þ 1 p ðI 1 I Þ 1 p ðI 1 I Þ4p p 1 1 Iab pa pb ^r ¼ b ac ad a bc bd a b xy 2W 2p p x l¼1 a b A number of alternative implementations of Equation XL 1 p ðI 1 I Þ 1 p ðI 1 I Þ4p p 15 are possible. We obtained relatedness by averaging 1 d ac bc c ad bd c d ; 2W 2p p the reciprocal estimates over L loci, y l¼1 c d P L ð14Þ ^rxy;l 1 ^ryx;l ^r ¼ l¼1 ; ð16Þ xy 2L where Wx and Wy are the sums of all weighting factors wx,l and wy,l, respectively. (See Lynch and Ritland 1999 for where L is the number of loci. For biallelic loci, Equa- details). Following Toroet al. (2002), we used estimated tion 15 is undefined when individual x is heterozygous, allele frequencies in Equation 14. because it results in a division by zero. The Q&G esti- Wang (2002): Using Equation 12, Wang (2002) de- mator was therefore omitted with biallelic loci. veloped an estimator that takes into account the un- Simulated populations: To compare estimators, pop- certainty of estimated allele frequencies. Briefly, the ulations with several discrete generations were simu- approach of Wang consists of the following. First, for lated. The following two sections describe the standard a single locus, joint probabilities of observing a pair population and five alternatives. Table 2 summarizes of genotypes are expressed as a function of f and D. population parameters. Subsequently, resulting expressions are solved for f and Standard population: The standard population was D, by treating genotype probabilities as known observa- panmictic and was bred from a base generation of 10 tions. Next, solutions for f and D are substituted into male and 50 female founders. Twenty marker loci were Equation 12, giving an estimate for rxy,l. Finally, a multi- simulated. Each locus had a random number of alleles locus estimate is obtained by using weighted least (n), ranging from 2 through 8. At each locus, alleles were squares, where weights are obtained assuming that f sampled with a probability of 1/n for each allele, so that, 488 P. A. Oliehoek et al. on average, alleles at a particular locus had the same fre- servation decisions are based on the estimates, not on quency in the base generation. Alleles were codominant, the true values; and (4) the correlation between esti- autosomal, unlinked, neutral, without mutation, and mated and pedigree relatedness (r), which measures followed Mendelian inheritance. the proportion of the variance in pedigree relatedness Ten discrete generations of 400 individuals were bred, explained by the estimator. Relatedness of individuals using random mating and selection of 10 male and 50 with themselves was excluded from the calculation of female individuals as parents of the next generation. these criteria. The last generation consisted of 100 individuals, which Diversity criterion: Although statistical criteria are in- were genotyped for all 20 loci. Relatedness between all formative for the quality of estimators, they do not pairs of individuals was estimated from the marker data, directly reveal the amount of genetic diversity conserved for each of the estimators described above. In addition, by using an estimator in practice. In addition to sta- relatedness between individuals was calculated from the tistical criteria, therefore, we develop a criterion that pedigree, using the tabular method (Emik and Terrill evaluates the genetic diversity conserved when selection 1949), and was considered to be the true value. Finally, decisions are based on estimated relatedness. quality of estimators was assessed by comparing esti- In this section we argue that relatedness is a key factor mated with pedigree relatedness, using both statistical in conservation. An important aspect in conservation and diversity criteria (see below). genetics is to minimize inbreeding levels and maximize Alternatives: The effect of the following five variables genetic diversity (Ballou and Lacy 1995; Frankham on quality of estimators was investigated (Table 2): (a) et al. 2002). Here we interpret genetic diversity as ad- the number of alleles per locus in the base generation; ditive genetic variance, for the following reasons. Fisher’s (b) the number of loci; (c) the average level of relat- fundamental theorem of natural selection (Fisher 1958), edness in the current generation, by varying the number stating that the rate of increase in fitness equals the of generations simulated; (d) a structured population; additive genetic variance of relative fitness, shows that and (e) a limitation to the number of individuals that adaptive potential of populations should be measured could be used in a conservation program, which was by their additive genetic variance for fitness. In random either all 100 or only the genetically most important mating populations, additive genetic variance in gener- 10 (see Diversity criterion). Alternative d was included ation t for any trait equals to investigate quality of estimators in structured pop- ulations. The structured population had 10 male and 50 VA;t ¼ð1 Ft ÞVA;0; ð17Þ female parents until generation 5, after which it split where F is the average inbreeding level in the popula- into two subpopulations, of which one was bred with 8 t tion in generation t, measured relative to the base gen- male and 40 female parents and the other with 10 male eration, and V is the additive genetic variance in the and 50 female parents. Two final generations of 100 A,0 base generation (Falconer and Mackay 1996). With individuals were simulated, and 90 individuals were random mating, the inbreeding level in the next gen- sampled from one and 10 from the other population eration, F , equals the average coancestry of the cur- or vice versa. Alternative e resembles the situation in t11 rent population and thus half the average relatedness practice, where conservation funds are limited. For each of the current population (r ¼ 2f ). Thus, maximizing alternative, one parameter was varied at a time, and genetic diversity and minimizing inbreeding in gener- other parameters were as in the standard population. ation t 1 1 is identical to minimizing relatedness in One hundred replicates were run per alternative, and generation t. In conclusion, therefore, conservation de- results were averaged over replicates. cisions within a species should aim at minimizing the Criteria: Two types of criteria were used: (1) statistical average additive genetic relatedness in that species. criteria that compared estimated with pedigree related- Consequently, our diversity criterion measures the ef- ness and (2) a diversity criterion that measures the ficiency of estimators when the objective is to minimize genetic variation conserved by using an estimator in average relatedness in a group of individuals. conservation strategies. With random mating, average relatedness in the next Statistical criteria: Four statistical criteria were used: generation is given by Meuwissen (1997), (1) the average bias, being the difference between aver- age estimated relatedness and average pedigree related- r ¼ c9Ac; ð18Þ ness (bias); (2) the regression coefficient of estimated relatedness on pedigree relatedness (b1), which is a where c is a vector of proportional contributions of measure for bias in the estimated differences in relat- individuals to the next generation, so that elements of edness among pairs of individuals; (3) the regression c sum to one, and A is a matrix of additive genetic coefficient of pedigree relatedness on estimated relat- relatedness between all individuals, including related- edness (b2), which indicates whether estimated relat- ness of individuals with themselves. Average relatedness edness yields an ‘‘unbiased’’ prediction of pedigree among parents, and thus the inbreeding level in the relatedness, which is important in practice because con- next generation, can be decreased or increased by Estimating Relatedness From Markers 489 varying the contributions of individuals (c). Thus av- TABLE 3 erage relatedness can be minimized by finding an opti- Comparison of estimators in the standard population mum contribution vector co that minimizes c9Ac, which is given by a b c d e f Estimator Bias b1 b2 r He Nge 1 A 1 Pedigree 0 1 1 1 0.86 3.69 co ¼ 1 ð19Þ 19A 1 fM 0.43 0.76 0.40 0.55 0.84 3.10 (Meuwissen 1997; Eding et al. 2002), where 1 is a UCS 0.02 1.02 0.27 0.52 0.84 3.08 column vector of ones. The matrix of additive genetic WCS 0.08 1.04 0.32 0.57 0.84 3.17 WEDS 0.10 0.94 0.35 0.57 0.84 3.15 relationships has to be estimated from marker data. The L&R 0.24 1.01 0.35 0.60 0.85 3.33 amount of genetic diversity conserved by using esti- Wang 0.28 1.16 0.28 0.57 0.84 3.17 mated optimal contributions ( cˆo) will depend on the Q&G 0.96 1.66 0.15 0.50 0.82 2.82 estimator used. To obtain estimated optimum contribu- Results are of 100 replicates. Standard errors of tions, we substituted the matrix of pedigree relatedness results were #0.01. ˆ a by the matrix of estimated relatedness ( A) in Equation Estimated relatedness minus pedigree relatedness. 19. When negative contributions were obtained, the b The regression of estimated relatedness on pedigree most negative contribution was set to zero and optimal relatedness. c contributions were recalculated, until all contributions The regression of pedigree on estimated relatedness. d The correlation between estimated relatedness and pedi- were nonnegative. In alternative e the lowest contribu- gree relatedness. tion was set to zero and optimal contributions were re- e The expected heterozygosity with estimated optimum con- calculated, until all contributions were nonnegative or tributions, Equation 19. only 10 contributions were left. f The number of founder genome equivalents with opti- We evaluated the result on two scales. On the first mum contributions, Equation 20. scale, the diversity criterion equals the proportion of additive genetic variance conserved relative to the base generation, and Q&G, average estimated relatedness deviated con- 1 siderably from the pedigree average, as reflected by bias. He ¼ 1 cˆ9oAcˆo; ð20Þ 2 Bias depends on the definition of the base population, which is derived by combining Equations 17 and 18. which is essentially arbitrary (see discussion). Bias, Note that, in Equation 20, A refers to relatedness cal- therefore, is not an important quality criterion and is culated from the pedigree. With random mating, He not presented further. equals expected heterozygosity in a population with The regression of estimated relatedness on pedigree estimated optimum contributions of individuals, ex- relatedness (b1) was close to one for most estimators, pressed as a proportion of heterozygosity in the base except for fM and Q&G. Results indicate a relationship generation. On the second scale, the diversity criterion between bias and b1, showing that b1 is underestimated equals the number of founders (Nge) that would have when bias is positive. The fM estimator performed best the same average coancestry (and thus the same additive for the regression of pedigree on estimated relatedness genetic variance) as the population obtained using esti- (b2), but in all cases b2 was substantially less than one. mated optimum contributions. Average coancestry among The correlation between estimated and pedigree re- N founders ¼ 1/(2N ), so that latedness (r) ranged from 0.50 (Q&G) to 0.60 (L&R), indicating that differences between estimators are rel- 1 N ¼ : ð21Þ atively small. When pedigree information was known, the ge ˆ9 ˆ coAco use of optimum contributions maintained 3.69 founder Caballero and Toro (2000) referred to Nge as the genome equivalents (Nge). Application of the estima- number of founder genome equivalents. Equation 21 is tors maintained between 2.82 (Q&G) and 3.33 (L&R) an expression on the scale of effective population size, founder genome equivalents, which are 76 and 90% of since it equals Nge ¼ 1=ð2f Þ. the maximum value obtained with known pedigree. In contrast to the statistical criteria, relatedness of When quality of estimators is judged by the correlation individuals with themselves was included in Aˆ and was and the number of founder genome equivalents, the estimated by using y ¼ x in the relevant expressions for ^rxy. following order is obtained: L&R performs best, fol- lowed by the group of WCS, WEDS, and Wang; next comes f ; then UCS; and finally Q&G. RESULTS M Number of alleles: Table 4 summarizes results for Comparison of estimators on the standard popula- different numbers of alleles per locus. The number of tion: Table 3 gives results for the standard population. distinct alleles in the current generation was reduced Average pedigree relatedness in the simulated standard by almost 90% when alleles in the base generation were population in the 10th generation was 0.282. With fM unique, whereas no reduction was observed when the 490 P. A. Oliehoek et al.

TABLE 4 Correlation between pedigree and estimated relatedness for a varying number of alleles in base populations

No. alleles: 2 2–8 5 2–18 10 120

Average no.: 2.00 4.56 4.71 6.71 7.45 13.3 Estimator % alleles left: 100 91 94 67 75 11

fM 0.37 0.54 0.57 0.62 0.66 0.77 UCS 0.37 0.52 0.57 0.60 0.65 0.77 WCS 0.37 0.56 0.58 0.65 0.67 0.79 WEDS 0.38 0.57 0.59 0.65 0.67 0.79 L&R 0.41 0.63 0.65 0.71 0.73 0.81 Wang 0.35 0.57 0.58 0.65 0.67 0.79 Q&G —a 0.49 0.52 0.60 0.65 0.78 Results are averages of 100 replicates. Standard errors of results were #0.01. a Q&G is not applicable to biallelic loci. base generation had only 2 alleles per locus. As ex- Structured populations: Table 6 summarizes results pected, the correlation between estimated and pedigree for the structured population for two sampling schemes. relatedness increased with the number of alleles. The In scheme A, 90 individuals were sampled from the benefit of increasing the number of alleles was smaller subpopulation bred from 10 and 50 parents and 10 when there were already many alleles. On average, the from the subpopulation from 8 and 40 bred parents. correlation increased by 50% when the number alleles increased from 2 to 5, whereas the correlation increased TABLE 5 by 16% when the number of alleles increased from 5 to Comparison of estimators for a varying number of loci 10. The L&R estimator had the highest correlation for

a b c d all schemes considered. WEDS, WCS, and Wang showed Estimator b1 b2 r Nge nearly identical correlations. It follows from Tables 3 and 4 that Q&G and UCS had 10 loci Pedigree 1 1 1 3.70 f 0.76 0.23 0.42 2.86 the poorest results, whereas WCS and WEDS had nearly M WEDS 0.92 0.21 0.44 2.93 identical results. This trend was observed in all alter- L&R 1.04 0.23 0.49 3.34 natives. No further results, therefore, are presented for Wang 1.15 0.17 0.43 2.83 UCS, Q&G, and WCS. Number of loci: Table 5 summarizes results for 20 loci Pedigree 1 1 1 3.69 schemes with different numbers of loci. The number fM 0.76 0.39 0.55 3.10 of loci did not affect the regression coefficient of es- WEDS 0.94 0.35 0.57 3.17 L&R 1.03 0.39 0.63 3.50 timated relatedness on pedigree relatedness (b1). In Wang 1.16 0.28 0.57 3.06 contrast, the regression coefficient of pedigree on es- timated relatedness (b2) increased considerably when 50 loci Pedigree 1 1 1 3.67 the number of loci increased, but still deviated clearly fM 0.75 0.68 0.72 3.30 from unity. The correlation increased by 30% when WEDS 0.95 0.58 0.74 3.35 going from 10 to 20 loci, by 30% when going from 20 L&R 1.02 0.60 0.78 3.55 to 50 loci, and by 14% when going from 50 to 100 loci. Wang 1.16 0.46 0.73 3.26 The L&R estimator showed the highest correlation 100 loci Pedigree 1 1 1 3.70 and maintained the most genetic variation, whereas fM fM 0.76 0.90 0.82 3.44 showed the lowest correlation and maintained the least WEDS 0.97 0.73 0.84 3.48 genetic variation. WEDS performed slightly better than L&R 1.02 0.73 0.86 3.59 Wang, but differences were small. Wang 1.16 0.59 0.83 3.39 Average level of relatedness: As expected, an in- Results are averages of 100 replicates. Standard errors of crease in the number of generations increased pedigree results were #0.01. relatedness and decreased the number of alleles surviv- a The regression of estimated relatedness on pedigree ing from the base to the current generation. Performance relatedness. b of estimators decreased in correspondence with the The regression of pedigree on estimated relatedness. c The correlation between estimated relatedness and pedi- decreasing number of alleles (results not shown). Apart gree relatedness. from an effect via the number of alleles, there was no effect d The number of founder genome equivalents with opti- of the level of relatedness on performance of estimators. mum contributions, Equation 20. Estimating Relatedness From Markers 491

TABLE 6 Comparison of estimators in structured populations

Structured Aa Structured Bb

c d e f c d e f Estimator b1 b2 r Nge b1 b2 r Nge Pedigree 1 1 1 4.23 1 1 1 3.82 fM 0.75 0.43 0.57 3.52 0.74 0.65 0.70 3.22 WEDS 0.90 0.38 0.58 3.58 0.86 0.54 0.68 3.25 L&R 0.88 0.39 0.59 3.84 0.65 0.47 0.55 2.96 Wang 1.15 0.31 0.59 3.45 1.13 0.42 0.69 3.13

Results are averages of 200 replicates (instead of 100). Standard errors of results were #0.01 (except for Nge). a Ninety individuals were sampled from the subpopulation bred from 10 male and 50 female parents, and 10 were sampled from a subpopulation bred from 8 male and 40 female parents. b Ten individuals were sampled from the subpopulation bred from 10 male and 50 female parents, and 90 were sampled from a subpopulation bred from 8 male and 40 female parents. c The regression of estimated reletedness on pedigree relatedness. d The regression of pedigree on estimated relatedness. e The correlation between estimated and pedigree relatedness. f The number of founder genome equivalents with optimum contributions, Equation 20. Standard errors of results were #0.02.

In scheme B, the sampling of individuals was reversed. In the structured population, the number of founder With scheme A, average relatedness was 0.26 and av- genome equivalents conserved using optimal contribu- erage inbreeding was 0.13. On average, correlations be- tions calculated from pedigree relatedness was higher tween estimated and pedigree relatedness had the same than when using equal contributions with scheme A level as in the standard population. When judged by the (4.23 vs. 3.85) and substantially higher with scheme correlation, estimators performed equally well. When B (3.82 vs. 2.61), indicating that optimizing contribu- judged by the number of founder genome equivalents, tions is more important in structured than in standard however, Wang showed poorest results and L&R highest, populations When only 10 individuals were included in indicating that ranking of estimators depends on the the set, the use of optimal contributions calculated from criterion used. With scheme B, average relatedness was estimated relatedness always conserved more founder 0.38 and average inbreeding was 0.19. In contrast to genomes then the use of equal contributions. Scheme B the panmictic standard population and scheme A, L&R always conserved more founder genomes, irrespective had the lowest correlation and lowest founder genome equivalents, whereas WEDS had the highest founder genome equivalents. TABLE 7 Use in conservation: Table 7 shows the number of founder genome equivalents conserved in sets of either Number of founder genome equivalents in sets of either 10 or 10 or all 100 individuals, having either optimal or equal 100 individuals, in a panmictic or structured population contributions of individuals, and for the panmictic a b standard population or a structured population. Panmictic Structured A Structured B In the standard population, the number of founder Individuals in set 100 10 100 10 100 10 genome equivalents conserved using optimal contribu- Pedigree 3.69 3.17 4.23 3.52 3.82 3.39 tions calculated from pedigree relatedness was only a Equalb 3.56c 2.78d 3.85c 2.89d 2.61c 2.17d little higher than when using equal contributions (3.69 fM 3.12 2.88 3.52 3.20 3.22 3.03 vs. 3.56). In standard populations, variation in related- WEDS 3.17 2.90 3.58 3.21 3.25 3.02 L&R 3.51 2.91 3.84 3.13 2.96 2.62 ness among pairs of individuals is relatively small, and Wang 3.07 2.87 3.45 3.18 3.13 2.97 the benefit of using optimum contributions is limited when the set contains all individuals. Surprisingly, when Results are averages of 200 replicates (instead of 100). Stan- all 100 individuals were included, the use of optimum dard errors of results were #0.02. a Ninety individuals were sampled from the subpopulation contributions based on estimated relatedness conserved bred from 10 male and 50 female parents, and 10 were sam- fewer founder genomes than equal contributions did. pled from a subpopulation bred from 8 male and 40 female Hence, conservation strategies based on estimated re- parents. latedness of limited accuracy can actually reduce the ge- b Ten individuals were sampled from the subpopulation netic variation conserved, instead of increasing it. When bred from 10 male and 50 female parents, and 90 were sam- pled from a subpopulation bred from 8 male and 40 female sets consisted of only 10 individuals, sets of optimum parents. contributions always had higher founder genome equiv- c All 100 individuals have equal contributions to the set. alents than sets with equal contributions. d Ten random individuals have equal contributions to the set. 492 P. A. Oliehoek et al. of the number of individuals in the set, illustrating the However, with the exception of biallelic loci, L&R importance of the sampling procedure. also performed better than Wang when relatedness Differences between estimators are in agreement with and the weight were calculated separately, indicating results in Tables 3–6. The L&R estimator performed that this cannot be the only source of differences. ang best in the standard population, whereas fM and WEDS 2. W (2002) presented results only for close rela- performed best in the panmictic structured population. tives of a single type at a time, nonrelatives, full-sibs, or half-sibs. In reality, however, pedigree related- ness is unknown so that it is impossible to a priori DISCUSSION distinguish between different types of relatives. Ped- We investigated quality of relatedness estimators in igree relatedness will take many distinct values, since simulated populations with many generations of pedi- all real populations have many generations of pedi- gree. The estimators UCS and Q&G showed lowest gree. It is not possible, therefore, to judge perfor- accuracy. Differences among fM, WCS, WEDS, Wang, mance of the Wang estimator in general populations and L&R were relatively small, and ranking of estimators from results presented in Wang (2002). In our study, depended on the population structure. In contrast to we considered populations with general relationships previously published results (Wang 2002), the L&R es- and evaluated estimators by the correlation between timator clearly performed better than the Wang esti- pedigree and estimated relatedness, without a priori mator in panmictic populations. The WEDS and fM distinguishing between categories of relatives. estimators performed best in structured populations. 3. Wang (2002) observed that the L&R estimator per- The difference between UCS and WCS showed that formed better for ‘‘unrelated’’individuals (i.e., not sibs weighting the impact of loci plays a significant role in or parent–offspring), which will be the majority even relatedness estimation. Average level of relatedness in in small populations. In contrast to current belief, the population did not affect quality of estimators. therefore, we find the L&R estimator to be superior When interest is not in conservation, but merely in to the Wang estimator in panmictic populations. point estimates for relatedness between pairs of indi- Furthermore, the L&R estimator is substantially viduals, quality of estimators may be judged by the cor- simpler. relation between true and estimated relatedness. When judged by the correlation, L&R performs best in pan- Bias and base population: For most estimators, aver- mictic populations and WEDS, fM, and Wang in struc- age estimated relatedness differed substantially from tured populations. Fernandez et al. (2005) argued that average pedigree relatedness, but the difference (bias) minimizing simple molecular coancestry ( fM) is the op- was unrelated to the accuracy (r) of estimators, illus- timum way to maximize diversity, which would imply trating that the choice of a base population is arbitrary that other relatedness estimators are redundant. Our in a panmictic population. Bias depends on the way es- results show, however, that there is a clear benefit of timators define the base population or, in other words, using more sophisticated relatedness estimators (see, on how they divide the average observed similarity into a e.g., L&R vs. fM in Table 5). In structured populations, proportion due to IBD vs. a proportion due to AIS. The sets of estimated optimum contributions had in most L&R and Wang estimators set the probability of AIS cases more diversity than sets of equal contributions of equal to the expected homozygosity in the current pop- individuals. Surprisingly, in panmictic populations, sets ulation, which implicitly defines the current generation of optimum estimated contributions sometimes had less as the base population. Average estimated relatedness is diversity than sets with equal contributions of individu- therefore close to zero for Wang and L&R, bias is neg- als, showing that estimates of relatedness can be useful ative, and many estimates are negative. Negative esti- in conservation programs, but should be used with caution. mates may seem confusing, because relatedness equals L&R vs. Wang: In contrast to results presented by twice the probability that alleles are IBD, which cannot Wang (2002), the L&R estimator performed consis- be negative by definition. However, negative estimates tently better than the Wang estimator in panmictic pop- can easily be scaled to positive values using an equation ulations, irrespective of the numbers of alleles and loci. similar to Equation 8, which solves the interpretation We identified three reasons for this discrepancy: problem (see also Eding and Meuwissen 2003). Alter- natively, relatedness may be interpreted as a measure 1. We have used a modified version of the L&R of additive genetic between individuals, in estimator that avoids the numerical rounding errors which case below average values indicate individuals that may occur when pa ¼ pb ¼ 0.5 or when estimates with dissimilar breeding values. ‘‘blow up’’ when they approach this value. Wang Regression of estimated relatedness on pedigree re- (2002) noted this problem, but did not correct for it. latedness: The regression coefficient of estimated relat- We observed that the L&R estimator improved edness on pedigree relatedness (b1) is a measure of bias. considerably when calculating the product of re- Unbiasedness, i.e., E½^rxy j rxy¼rxy, requires that b1 ¼ 1. latedness and weight in a single step (Equation 14). (Note that the criterion bias refers to average relatedness, Estimating Relatedness From Markers 493

TABLE 8 Number of founder genome equivalents in sets of either 10 or 100 individuals, in a panmictic or structured population after a priori b2 correction

Panmictic Structured Aa Structured Bb Individuals in set 100 10 100 10 100 10 Pedigree 3.69 3.17 4.23 3.52 3.82 3.39 Equalb 3.56 2.78 3.85 2.89 2.61 2.17 fM 3.47 (11) 2.93 (2) 3.93 (12) 3.24 (1) 3.45 (7) 3.09 (2) WEDS 3.49 (10) 2.93 (1) 3.95 (10) 3.25 (1) 3.41 (5) 3.06 (1) L&R 3.58 (2) 2.93 (1) 3.93 (2) 3.19 (2) 2.89 (2) 2.69 (3) Wang 3.45 (12) 2.91 (1) 3.92 (14) 3.23 (2) 3.36 (7) 3.00 (1) Results are averages of 200 replicates (instead of 100). Standard errors of results were #0.02. Numbers in parentheses show percentage change due to b2 correction. a Ninety individuals were sampled from the subpopulation bred from 10 male and 50 female parents, and 10 were sampled from a subpopulation bred from 8 male and 40 female parents. b Ten individuals were sampled from the subpopulation bred from 10 male and 50 female parents, and 90 were sampled from a subpopulation bred from 8 male and 40 female parents. whereas b1 refers to pairs of individuals.) There was a quently, irrespective of the estimator used, b1 ¼ b2 ¼ 1 clear relationship between bias and b1; positive bias was can be attained only when r ¼ 1, which requires data on accompanied by underestimation of b1 (Table 3). This many loci. All estimators had values for b2 substantially result is due to the population genetic relationship ,1 (Table 3), indicating that the amount of genetic between absolute differences among coancestries of diversity conserved will be overestimated when select- pairs of individuals and the average coancestry level of ing least-related individuals on the basis of estimated a population. Equation 3 illustrates this phenomenon. relatedness. Positive bias, i.e., overestimation of sl, reduces absolute To investigate the effect of b2 on the number of differences between coancestries because similarities founder genome equivalents conserved, we rescaled are scaled by 1 sl. Alternatively, the relationship can relatedness estimates to obtain b2 ¼ 1. First, we derived be understood by considering coancestry as a function an empirical relationship between b2 and the amount t of generation number (t), ft ¼ 1 ð1 Df Þ , which is a of information and next regressed estimated related- function starting at zero at t ¼ 0 and asymptoting to ness to its mean, using predicted b2. The empirical 1whent / ‘ (Falconer and Mackay 1996). At low prediction of b2 was values of ft the function is steep and differences in coancestries within a generation are large, whereas at bˆ2 ¼ 0:079 ½lnðno: lociÞ1½lnðno: allelesÞ 1 1:22: high ft the function is flat and differences are small. ð22Þ Thus the relationship between bias and b1 is a direct consequence of standard population genetic theory, and For the WEDS estimator, Equation 22 explained 99% of estimators that are consistent with population genetic the variation in b2 observed in the schemes analyzed theory will always show this relationship. (Table 2). We regressed relatedness estimates to their * ˆ Regression of pedigree on estimated relatedness: mean using ^rxy ¼ ^r xy 1 b2ð^rxy ^r xyÞ, which was applied The regression coefficient of pedigree on estimated separately to relatedness between individuals and to relatedness (b2) may be interpreted as the reciprocal relatedness of individuals with themselves. Finally, Nge * of a usual measure of unbiasedness, i.e., E½rxy j ^rxy¼^rxy was calculated using ^rxy instead of ^rxy. Results showed a requires that b2 ¼ 1. (Note that rxy is treated as a random clear increase in Nge, in particular in the panmictic variable here.) In conservation practice, selection of population with a conservation capacity of 100 individ- breeding individuals relies on estimated relatedness; uals (Table 8 vs. Table 7). Furthermore, as indicated by pedigree relatedness is unknown. To avoid overestima- the Nge values for equal vs. estimated optimal contribu- * tion of the genetic diversity conserved, it is important tions, the use of ^rxy almost completely removed the loss that estimated relatedness is an unbiased predictor of of diversity that occurred when using ^rxy with limited pedigree relatedness, which requires that b2 ¼ 1. accuracy. Those results show that, when conservation However, b2 ¼ Covðr; ^rÞ=Varð^rÞ¼1 requires that decisions are based on estimated relatedness, the re- s^r ¼ rsr , indicating that estimates should have lower verse of unbiasedness, i.e., E½rxy j ^rxy¼^rxy, may be more variance than pedigree values. As a consequence, important than the usual definition, E½^rxy j rxy¼rxy. Re- 2 b1 ¼ Covðr; ^rÞ=VarðrÞ¼rs^r =sr ¼ r . Therefore, when gression of relatedness estimates to their mean will be b2 ¼ 1, b1 must equal the square of the correlation particularly relevant when the amount of marker in- between pedigree and estimated relatedness. Conse- formation differs between individuals, in which case 494 P. A. Oliehoek et al. individuals with little information would be selected selected trait. It is, therefore, also known as gametic- too often because they have higher variance of their phase disequilibrium (Bulmer 1971). estimates. In principle, deviations from Hardy–Weinberg and As expected, the correlation between pedigree and gametic-phase equilibrium are transient, in contrast to estimated relatedness was not affected by regressing changes in allele frequency and linkage disequilibrium estimates to the mean. Consequently, for the purpose due to tight linkage. With two sexes, Hardy–Weinberg of conservation, the correlation between pedigree and equilibrium is restored in two generations of random estimated relatedness is not the optimal criterion for mating. Positive deviations from Hardy–Weinberg equi- quality of an estimator, since results in Table 8 are clearly librium, e.g., due to obligatory selfing, increase the ad- better than those in Table 7. A criterion such as the ditive genetic variance, but it is unclear what value to number of founder genome equivalents, which directly attribute to such additional variance, since utilization of reflects the amount of diversity conserved, is to be it involves between-family selection causing rapid loss of preferred for conservation purposes. diversity. Furthermore, when selection ceases, the ga- Although Equation 22 was obtained using the WEDS metic phase disequilibrium asymptotes quickly to zero estimator, results in Tables 3, 5, and 6 show that the (Bulmer 1971). Although natural selection will never relationship between b2 and the numbers of alleles and cease, it probably generates little gametic-phase disequi- loci is nearly identical for the L&R estimator and very librium because components of fitness have low herita- similar for fM. Equation 22 is, therefore, not restricted to bility. Hence, gametic-phase disequilibrium is mainly a the WEDS estimator, but useful in general. Equation 22 phenomenon of artificial selection. Thus, in the long is a simple but rather crude two-step method to regress run, it is mainly the genic variance that represents true estimates to their mean value depending on the amount genetic diversity originating from the allelic variety. of information. A statistically more appropriate method Transient components of the additive genetic variance is to treat relatedness as a random, instead of fixed, should not be included in a diversity criterion. In our variable when estimating relatedness. However, such opinion, therefore, a diversity criterion based on addi- models involve the estimation of the variance of re- tive genetic variance in an idealized population is still latedness, which may not be trivial. useful when real populations deviate from that situation. Diversity criterion with nonrandom mating and Population structure: Populations in need of conser- selection: The diversity criterion used in this study vation predominantly have fragmented structures. In relates to the additive genetic variance in an unselected agriculture, species are generally composed of breeds random-mating population; i.e.,1 c9Ac equals the and relatedness within breeds is much higher than that additive genetic variance in the sampled population, between breeds. Within rare breeds, fragmentation (over expressed as a proportion of that in the founder pop- different countries, for example) is common as well ulation, assuming that the sampled population is gen- (FAO 2000). This is logical because many domestic erated by random mating and that there has been species are kept in herds and breeding programs are no selection between the founder and current genera- often organized nationally. Similarly, populations in zoos tion. Most actual populations, however, undergo either frequently descend from groups from founders derived natural or artificial selection and show nonrandom mat- from different locations (see EAZA in Situ Conservation ing, which raises questions about the utility and gen- Database: http://www.eaza.net). Furthermore, human- erality of our criterion. In our opinion, however, the induced habitat loss and fragmentation are recognized additive genetic variance under random mating and as the primary causes of loss of biodiversity (Ballou and no selection is a useful measure for diversity, also when Lacy 1995; Frankham et al. 2002). Hence, structured the actual population is selected or shows nonrandom populations are the rule and panmictic populations the mating. The reasoning is as follows. By definition, the exception. additive genetic variance is the variance of the breeding Results of the structured population with scheme B in values. In diploids, this variance is composed of two com- Table 6 and results of Toro et al. (2002) indicate that ponents: (1) the additive genetic variance with Hardy– estimators based on two- and four-gene coefficients of Weinberg and linkage equilibrium, sometimes referred identity are sensitive to the population structure . This to as the genic variance (Wei et al. 1996), which depends result is probably because the basic relationship un- solely on the allele frequencies, and (2) a deviation from derlying these estimators (Equation 12) is valid only in the genic variance that depends on the way in which the absence of inbreeding. For example, the maximum alleles at all loci are combined within individuals. This value for relatedness in Equation 12 is 1, whereas in deviation is due to nonrandom mating causing devia- inbred populations, relatedness of an individual with tions from Hardy–Weinberg equilibrium and to muta- itself is 1 1 F, which has a maximum of 2. Furthermore, tion, selection, and drift causing linkage disequilibrium. in the derivation in Wang (2002), the genotype pairs Part of the total linkage disequilibrium is generated by AiAi–AiAi and AiAj–AiAj are grouped into a single cat- selection in the short term and is not related directly to egory that has a similarity value of 1 (according to the linkage, but occurs between any two loci affecting the definition in Wang 2002), which is correct on the basis Estimating Relatedness From Markers 495 of Equation 12. For example, in the hypothetical situ- more advanced statistical approach, such as conditional ation that founder alleles are unique, both genotypes probabilities of observing genotypes, rather than simi- have f ¼ 0, D ¼ 1, and r ¼ 1. However, from a population larity values. Hence, a promising approach to develop genetic point of view, those genotype pairs are clearly an estimator that performs better in both panmictic and different: AiAi–AiAi has f ¼ 1 and r ¼ 2, whereas AiAj–AiAj structured populations is to follow the statistical ap- 1 ynch itland has f ¼ 2 and r ¼ 1. proach of L and R (1999), but using r ¼ When noting that the basic equation underlying the 2f and the similarity definition of Equation 1 as a start- L&R and the Wang estimators is invalid with inbreeding, ing point (see also Toro et al. 2002). the good performance of those estimators in inbred Use in conservation practice: The benefit of optimal panmictic populations seem surprising at first. However, contributions based on relatedness estimators in conser- as argued above, the definition of a base population is vation programs depends on the population structure arbitrary with a panmictic population. Occurrence of and on the breeding capacity available for conservation, inbreeding, therefore, does not present a problem with e.g., expressed as the size of a population that can be random mating, because inbreeding coefficients can be conserved (N ). The use of optimal contributions based shifted to approximately zero by redefining the base on relatedness estimates that are regressed to their population. The L&R and Wang estimators ‘‘remove’’ mean always maintains more diversity than applying inbreeding by determining the probability that alleles equal contributions (Table 8), when populations are are AIS on the basis of observed allele frequencies, structured, which is common for populations in need of which defines the base population to be equal to the conservation. Even when they are panmictic and there current population. The same would happen if sl in is a limited breeding capacity (e.g., N ¼ 10), optimal Equation 3 were set to the currently expected homozy- contributions based on relatedness estimates are bene- gosity, as in Ritland (1996). In a structured population, ficial. With panmictic populations and large capacity, it however, removing inbreeding by using the current is equally good or better to use equal instead of optimal population as the base population causes negative (true) contributions (Table 8, N ¼ 100), although this is coancestries between individuals in different subpopu- seldom the case in conservation. lations, which is theoretically incorrect. In structured When using Equation 22 to regress estimates to their populations, therefore, inbreeding cannot be removed mean value, fM and WEDS are overall the best estima- by shifting the base population. We believe that the tors. They are robust with respect to population struc- inability to fully remove inbreeding is the basis of the ture, which is important when it is unknown whether the poor performance of the L&R estimator in scheme B of population is truly panmictic. the structured population. The high number of founder More information and partial Fortran code can be genome equivalents of L&R with scheme A in Table 7 found at http://www.geneticdiversity.net/estimators. is probably because scheme A resembles a panmictic html. population, since the low number of sampled individu- We thank Sipke Joost Hiemstra of the Centre for Genetic Resources als from the high-drift population is in balance with a low (CGN) for his thorough comments on previous versions of this contribution of diversity in this sample. The b2 correc- manuscript. We also thank Jinliang Wang for making the Fortran tion hardly improves founder genome equivalents for code of his estimator available on the internet. This work was financed L&R, whereas it improves all other estimators (Table 8). by the Ministry of Agriculture, Nature and Food Quality through the CGN, The Netherlands. Our results show that benefits of using relatedness estimates in conservation programs are substantially larger in structured than in panmictic populations LITERATURE CITED (Table 7; Fernandez et al. 2005). What is needed in Ballou, J. D., and R. C. Lacy, 1995 Identifying genetically impor- practice, therefore, is an estimator that can be applied tant individuals for management of genetic variation in pedi- greed populations, pp. 76–111 in Population Management for to general populations. The WEDS estimator is based Survival and Recovery: Analytical Methods and Strategies in Small on: (1) the relationship between relatedness and co- Population Conservation,editedbyJ.D.Ballou,M.E.Gilpin and ancestry (r ¼ 2f ) and (2) the relationship between mo- T. J. Foose. Columbia University Press, New York. Blouin, M. S., 2003 DNA-based methods for pedigree reconstruc- lecular similarity and coancestry (Equation 2a). Both tion and kinship analysis in natural populations. Trends Ecol. relationships are valid irrespective of the population Evol. 18: 503–511. structure (Lynch 1988; Falconer and Mackay 1996) Bulmer, M. G., 1971 The effect of selection on genetic variability. Am. Nat. 105: 201–211. and provide the theoretical basis for an estimator of Caballero, A., and M. A. Toro, 2000 Interrelations between effec- both within- and between-population relatedness (Eding tive population size and other pedigree tools for the manage- and Meuwissen 2001). ment of conserved populations. Genet. Res. 75: 331–343. Eding, H., and T. H. E. Meuwissen, 2001 Marker-based estimates We obtained the WEDS estimator using a simple of between and within population kinships for the conservation statistical approach, in which expected similarity was of genetic diversity. J. Anim. Breed. Genet. 118: 141–159. equated to observed similarity (Equation 3). Good re- Eding, H., and T. H. E. Meuwissen, 2003 Linear methods to esti- mate kinships from genetic marker data for the construction sults of the L&R estimator in panmictic populations of core sets in genetic conservation schemes. J. Anim. Breed. indicate that estimators can be improved by using a Genet. 120: 289–302. 496 P. A. Oliehoek et al.

Eding, H., R. Crooijmans,M.A.M.Groenen and T. H. E. Meuwissen, APPENDIX 2002 Assessing the contribution of breeds to genetic diversity in 1 2 conservation schemes. Genet. Sel. Evol. 34: 613–633. From Equation 6, weights are w ¼ VarðSxy;l Þ=ð1 ^sl Þ . mik errill l E , L. O., and C. E. T , 1949 Systematic procedures for For each locus l, we find the variance of the observed calculating inbreeding coefficients. J. Hered. 40: 51–55. 2 2 2 Falconer, D. S., and T. F. C. Mackay, 1996 Introduction to Quantita- similarityP as VarðSl Þ¼EðSl ÞPEðSl Þ , where EðSl Þ¼ 4 2 4 tive Genetics. Longmans Green, Harlow, Essex, UK. q¼1 PrðS ¼ Sq ÞSq and EðSl Þ¼ q¼1 PrðS ¼ Sq ÞSq , where FAO, 2000 World Watch List for Domestic Animal Diversity. FAO Publi- S takes values 0, 1, 1, and 1 for q ¼ 1–4, and Pr(S ¼ S ) cations, Rome. q 4 2 q Fernandez, J., B. Villanueva,R.Pong-Wong and M. A. Toro, denotes the probability of observing S ¼ Sq on locus l, 2005 Efficiency of the use of pedigree and molecular marker which depends on the allele frequencies. (We drop sub- information in conservation programs. Genetics 170: 1313–1321. script l for brevity.) Let A –A denote distinct alleles, p Fisher, R. A., 1958 The Genetical Theory of Natural Selection. Dover, i k i New York. be the frequency of allele Ai in the current population, Frankham, R., J. D. Ballou and D. A. Briscoe, 2002 Introduction to and n be the number of alleles at locus l. Then, for each Conservation Genetics. Cambridge University Press, Cambridge, UK. locus, Hamilton, W. D., 1964 The genetical evolution of social behaviour. II. J. Theor. Biol. 7: 17–52. enderson H ,C.R.,1984 Application of Linear Models in Animal Breeding. Category S ¼ 1 consists of the genotypes AiAi–AiAi,so University of Guelph, Guelph, Ontario, Canada. that Jacquard, A., 1983 Heritability—one word, 3 concepts. Biometrics 39: 465–477. Xn i orvitz L , C. C., and D. G. H , 1953 Some methods of estimating the ð ¼ Þ¼ 4: ð Þ inbreeding coefficient. Am. J. Hum. Genet. 5: 107–117. Pr S 1 pi A1 Lynch, M., 1988 Estimation of relatedness by DNA fingerprinting. i¼1 Mol. Biol. Evol. 5: 584–599. ynch itland 1 L , M., and K. R , 1999 Estimation of pairwise relatedness Category S ¼ 2 consists of the genotypes AiAi–AiAj and with molecular markers. Genetics 152: 1753–1766. AiAj–AiAj, so that Lynch, M., and B. Walsh, 1998 Genetics and Analysis of Quantitative Traits. Sinauer Associates, Sunderland, MA. ale´cot Xn1 Xn M , G., 1948 Les Mathe´matiques de l’He´re´dite´. Masson, Paris. 1 3 3 2 2 Meuwissen, T. H. E., 1997 Maximizing the response of selection PrðS ¼ 2Þ¼4 ðpi pj 1 pipj 1 pi pj Þ: ðA2Þ with a predefined rate of inbreeding. J. Anim. Sci. 75: 934–940. i¼1 j¼j1i Milligan, B. G., 2003 Maximum-likelihood estimation of related- ness. Genetics 163: 1153–1167. 1 Category S ¼ consists of the genotypes AiAj–AiAk,so Nicholas, F. W., and C. Smith, 1983 Increased rates of genetic 4 change in dairy cattle by embryo transfer and splitting. Anim. that Prod. 36: 341–353. Queller, D. C., and K. F. Goodnight, 1989 Estimating relatedness Xn2 Xn1 Xn 1 2 2 2 using genetic-markers. Evolution 43: 258–275. PrðS ¼ Þ¼8 ðp pj pk 1 pip pk 1 pipj p Þ: itland 4 i j k R , K., 1996 Estimators for pairwise relatedness and individ- i¼1 j¼i11 k¼j11 ual inbreeding coefficients. Genet. Res. 67: 175–186. Toro, M., C. Barragan,C.Ovilo,J.Rodriganez,C.Rodriguez ðA3Þ et al., 2002 Estimation of coancestry in Iberian pigs using mo- lecular markers. Conserv. Genet. 3: 309–320. Note that S ¼ 1 requires at least three distinct alleles. Toro, M. A., C. Barragan and C. Ovilo, 2003 Estimation of genetic 4 variability of the founder population in a conservation scheme Finally, using microsatellites. Anim. Genet. 34: 226–228. Wang, J. L., 2002 An estimator for pairwise relatedness using molec- PrðS ¼ 0Þ¼1 PrðS ¼ 1ÞPrðS ¼ 1ÞPrðS ¼ 1Þ: ular markers. Genetics 160: 1203–1215. 4 2 Wei, M., A. Caballero and W. G. Hill, 1996 Selection response in ðA4Þ finite populations. Genetics 144: 1961–1974. Wright, S., 1978 Evolution and the Genetics of Populations. University Substitution of the probabilities in the above equations of Chicago Press, Chicago. 2 1 for EðSl Þ, EðSl Þ,VarðSl Þ, and wl gives the weights for Communicating editor: D. Houle locus l.