How Many Alleles Per Locus Should Be Used to Estimate Genetic Distances?

Heredity (2002) 88, 62–65 2002 Nature Publishing Group All rights reserved 0018-067X/02 $25.00 www.nature.com/hdy How many alleles per locus should be used to estimate genetic distances? ST Kalinowski Conservation Biology Division, Northwest Fisheries Science Center, National Marine Fisheries Service, N.O.A.A., Seattle, WA 98112, USA As more microsatellite loci become available for use in gen- results could be achieved by examining either a few loci with etic surveys of population structure, population geneticists many alleles or many loci with a few alleles. More specifi- are able to select loci to use in population structure surveys. cally, the total number of independent alleles appears to be This study used computer simulations to investigate how the a good indicator of how precise estimates of genetic dis- number of alleles at loci affects the precision of estimates of tance will be. four common genetic distances. This showed that equivalent Heredity (2002) 88, 62–65. DOI: 10.1038/sj/hdy/6800009 Keywords: genetic distance; estimation; alleles; sampling; population genetics Introduction Edwards, 1967), the standard genetic distance of Nei, DS (Nei, 1972, 1978), and the Weir and Cockerham estimator Analysis of selectively neutral molecular markers has ␪ of FST, (Weir and Cockerham, 1984). become a common method for inferring the evolutionary Each of these genetic distances has unique evolution- history of populations and species. Technological ary and statistical properties (see Nei, 1987 for a review). advances have enabled population and conservation gen- Given a few simple assumptions, such as random mating eticists to describe increasingly complex and subtle gen- and constant population size, the genetic distance etic relationships. However, molecular data remains between two halves of a population instantly split into expensive and many population level studies are limited two completely isolated new populations will initially by the amount of data that can be collected. Efficient increase linearly with time. For example, D between two study design remains important in order to maximize the S such population fragments will be equal to ability of genetic data to describe genetic relationships between populations. Increasingly often, researchers have the luxury of selecting a set of loci to use in a study D Ϸ 2␮t from a larger number of loci that have been previously S characterized. This is particularly true for microsatellite loci. One of the most common criteria used to select loci where t is the number of generations the populations is the number of alleles present. I shall call this the allele have been isolated, and ␮ is the mutation rate of the loci number of a locus. Loci with more alleles are generally examined. In this case, F increases approximately lin- thought to produce more precise estimates of genetic dis- ST early with time, provided t is small tances than loci with few alleles, especially for closely related populations. However, loci with a large number Ϸ t of alleles can be difficult to score and take up more space FST on electrophoretic gels. This space issue is usually not a 2Ne problem when only one locus is run per gel, but it becomes a critical consideration when multiple loci are where Ne is the effective size of each population. The run on each gel. Unfortunately, there have been few expectations of DA and DC have not been expressed as guidelines for balancing these two contradictory con- functions of isolation time or other evolutionary vari- cerns. In this investigation, I examine the relationship ables, however they will initially increase linearly with between allele number and the coefficient of variation of time. All of these genetic distances are equal to zero when four popular genetic distances: the DA distance (Nei et al, populations have the same allele frequencies. The 1983), the chord distance, DC (Cavalli-Sforza and maximum value of DA, DC, and FST is equal to 1.0. DS will have a value of infinity when two populations do not share any alleles. The rate at which DA, DC, and DS Correspondence: S Kalinowski, Conservation Biology Division, Northwest increase with time is proportional to mutation rate. Fisheries Science Center, National Marine Fisheries Service, National Therefore, each of these three genetic distances is Oceanic and Atmospheric Administration, 2725 Montlake Boulevard expected to be higher at loci with many alleles than loci East, Seattle, WA 98112, USA. E-mail: [email protected] Received 3 April 2001; accepted 6 August 2001 with fewer alleles. How many alleles? ST Kalinowski 63 Methods of alleles at the locus minus one, and the total number of independent alleles is the sum of the number of inde- An analytical evaluation of the coefficient of variation for pendent alleles at each locus). For example, 16 loci each most genetic distances would be formidably complex, so having two independent alleles had approximately the I have relied on computer simulation to examine a simple same coefficient of variation for each of the genetic dis- model of population bifurcation (a similar simulation tances as two loci having 16 independent alleles each. The program available for general use is described by ratio t/Ne appeared to determine the relationship Excoffier et al, 2000). In this model, a randomly mating between number of alleles, number of loci, and the coef- population having an effective population size of Ne dip- ficient of variation. For example, the coefficient of vari- loid individuals is instantaneously split into two com- ation of genetic distances for populations of 500 individ- pletely isolated populations, each also having Ne individuals separated for 50 generations was identical to the uals. The populations are assumed to remain completely coefficient of variation of estimates of genetic distances isolated until samples are collected t generations after between populations of 5000 individuals separated for fragmentation. Gene trees for each locus were simulated 500 generations. using standard coalescent techniques (see Hudson, 1990 These results are in good agreement with an analysis for review). The mutation rate for each locus was of the Sanghvi (1953) genetic distance made by Foulley − obtained by selecting a number, u, from the interval [ 7, and Hill (1999). The Sanghvi distance is not used often − u 2] and using 10 as the mutation rate for that locus. Two for describing population structure, but it has tractable mutational models were used: infinite alleles and single mathematical properties and has been shown to estimate step mutation. phylogenies effectively (Takezaki and Nei, 1996). Foulley The infinite alleles model of mutation assumes that and Hill showed analytically that the coefficient of vari- each mutation is unique. The single step model of ation of the Sanghvi distance is approximately pro- mutation assumes that each allele can be represented as portional to the sum of the number of independent alleles a sum, and that each mutation either adds or subtracts at each locus in the sample. one from that sum. Mutation increasing the number of When divergence time was substantial, ie t/Ne was 1.0 repeat units was assumed to be as likely as mutation or greater, the relationship between the coefficient of decreasing the number of repeat units. All mutations variation of genetic distances and the number of inde- were assumed to change the number of repeat units by pendent alleles observed at small to moderate divergence a single step and no bounds were placed on the number times broke down, especially for highly polymorphic loci. of repeat units possible at simulated loci. Large numbers This is not especially problematic, for the utility of these of loci were simulated and loci with 2, 3, 4, 8, 16, and 33 genetic distances to quantify genetic differences between alleles were selected for analysis. Loci not having one of highly differentiated populations is limited. Keep in these numbers of alleles were dropped from analysis. All mind that both genetic drift and mutation lead to differ- samples consisted of 100 diploid individuals from each entiation. Both DA and DC approach their maximum population. The number of loci in the samples was varied value of 1.0 when mutation rate and divergence time are = from 2 to 32. Three effective population sizes (Ne 500, high. This results in these statistics having a low coef- = 5000, and 50000) and three divergence times (t 50, 500 ficient of variation when divergence time and polymor- and 5000) were examined. All combinations of these four phism are high (Figure 1), but decreases their ability to parameters was examined (number of loci, number of describe the length of population separation. DS loses its alleles, Ne, t). Four commonly used genetic distances utility when polymorphism is high, divergence time is were estimated from the data: the DA distance (Nei et al, long, and few loci are scored. In this case, samples from 1983), the chord distance, DC (Cavalli-Sforza and each population often share no alleles and DS is unde- Edwards, 1967), the standard genetic distance of Nei, DS fined. For example, about half of the loci having 33 alleles (Nei, 1978), and the Weir and Cockerham estimator of had no alleles in common in populations of 5000 individ- ␪ FST, (Weir and Cockerham, 1984). uals after 5000 generations. Lastly, ␪ has two undesirable The coefficients of variation for each of these genetic properties when divergence time and polymorphism is distances were estimated from the data by dividing the high. It asymptotically approaches a maximum value, standard deviation of the estimates by the average esti- and this value is inversely proportional to the amount of mate. Contour plots showing the coefficient of variation polymorphism present in the populations (eg, Hedrick, for data sets with different numbers of loci and varying 1999). numbers of alleles per locus were created with Sigma- Of the four distances examined, DA and DC exhibited Plot 2000.

Load more