CopyTight 0 1996 by the Society of America

Hierarchical Analysis of Diversity in Geographically Structured Populations

Kent E. Holsinger and Roberta J. Mason-Gamer1

Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, Connecticut 06269-3043 Manuscript received August 1, 1994 Accepted for publication November 4, 1995

ABSTRACT Existing methods for analyzing nucleotide diversity require investigators to identify relevant hierarchi- cal levels before beginning the analysis. We describe a method that partitions diversity into hierarchical components while allowing any structure present in the datato emerge naturally.We present an unbiased version of NEI’Snucleotide diversity statistics and show that our modification has the same properties as WRIGHT’SFw We compare its statistical properties with several other FsT estimators, and we describe how to use these statistics to produce a rooted tree of relationships among the sampled populations in which the mean time to coalescence of haplotypes drawn from populations belonging to the same node is smallerthan the mean time to coalescence of haplotypes drawn from populations belonging to different nodes. We illustrate the method by applying it to data from a recent survey of restriction site variation in the chloroplast genome of Coreopsis grandiflora.

OPULATION geneticists have long recognized that EXCOFFIERet al. (1992) recently described a method P the present in a species is hierar- appropriate for the analysis of restriction site and se- chically structured. In addition to differences among quence data that is related to WRIGHT’S F-statistics. It individuals within any one population, there may be uses a distance matrix to describe the relationship differences among populations within a givengeo- among haplotypes in the sample and partitions the vari- graphical region, differences among populations from ation present in the sample into three components: different geographical regions, and differences among variation within populations, variation among popula- entire geographical regions. SEWALL (1951, WRIGHT tions within a particular geographical region, and varia- 1965) introducedFstatistics as a way of assessing genetic tion among regions. It is formally equivalent to anested differentiation at each of these levels, and the interven- ing 40 years have repeatedly demonstrated how useful analysis of variance of the nucleotide diversity in the this approach can be (see, for example, the reviews on sample, and it is a simple generalization to several hier- patterns of electrophoretic diversity by NEVOet al. 1984, archical levels ofthe analysis ofvariance approach WEIR NEI 1987, or HAMRICK and GODT 1990). and BASTEN (1990) originally suggested. NEI (1982) In the past 10 years, population geneticists have proposed a method related to his GSr statistics for allo- turned to increasingly sensitive techniques for detecting zymes (NEI 1973), while TAKAHATAand PALUMBI , with a view toward uncovering fine- (1985), LYNCH and CREASE(1990), andNEI and MILLER scale population structure that may be undetectable (1990) proposed similar methods for estimating the in routine allozyme surveys. In particular, population average number of nucleotide substitutions between surveys ofrestriction site and nucleotide sequencevaria- populations. Valuable as these methods are, they all tion are now becoming routine. Unfortunately, Fstatis- suffer from one important limitation. As withany tics, whether those originally proposed by WRIGHTor nested analysis ofvariance, the investigator must specify theirmore recent modifications (COCKERHAM1969, 1973; NEI 1973; WEIR and COCKERHAM1984; LONG the hierarchical structure of the data before beginning 1986), are not appropriatefor analysis of the variation the analysis. In other words, each method requires that revealed by these new techniques. F statistics are calcu- a predetermined hierarchical structure be imposed on lated from allele frequencies at many, independently the data, not inferred from it. Thus, they are not well inherited loci. The variation revealed in restriction site suited to identifying the hierarchical structure that best and sequence studies, however, almost always results reflects the pattern of genetic differentiation among from differences at sites that are not independently populations. inherited. In this paper, we describe a method for describing the hierarchical structure of nucleotide diversity in a Corresponding author: Kent E. Holsinger, Department of Ecology sample that builds on these approaches. To develop and Evolutionary Biology, U-13, University of Connecticut, Storrs, CT NEI’S 06269-3043. E-mail: [email protected] this method we provide a bias correction to (1982) ’ Current address: Haward University Herbaria, 22 Divinity Ave., nucleotide diversity statistics. This statistic provides a Cambridge, MA 02138. direct analog of Fvr that is appropriate for haplotype

Genetics 142 629-639 (February, 1996) 630 K. E. andHolsinger R. J. Mason-Gamer frequency data, and we compare its statistical behavior cess involving the actual populations chosen, not the with that of other Fs7. measures based on nucleotide stochastic variation arising from the choice of popula- sequence data. We call this measure gF1, to emphasize tions (CJ: WEIRand COCKERHAM1984). Let v* = Cil its close relationship to NEI'S (1982) gYt.Using the fact 2,i,6tf Then that Fyr is equal to the ratio of the average coalescence times for different pairs of genes (SLATKIN1991), we E(v*) = E(2z,%)6,.. show how & can be used to group populations based '3 on the average time to coalescence for pairs of haplo- Assuming there is no covariance in the sampling of types. The results of this analysis can be displayed as a haplotypes from different populations, E(gtk+) = xjkxf1 tree diagram depicting the pattern of genetic differenti- for I # k (CJ: TAKAHATAand PALUMBI1985). Recalling ation among populations. The mean time to coales- this, we find that cence for two haplotypes drawn from the same node of the tree is less than that for two haplotypes drawn from different nodes. In addition to providing a more de- tailed description of the pattern of nucleotide diversity than methodsthat require pre-specified hierarchical categories, the structure of the tree can be interpreted as reflecting either patternsof gene exchange or phylo- genetic relationship and the statistical significance of the structure that is revealed can be assessed. We illus trate the method by applying it to data derived from a Thus, recent survey of restriction site variation in the chloro- plast genome of Core@sis grundi$oru in the southern United States (MASON-GAMERet al. 1995). is an unbiased estimator for v. Let 7rq be the probability THEMETHOD that haplotypes i and j differ at any nucleotide. Then the probability that two randomly chosen haplotypes Measuring nucleotide sequence diversity: Let nzkbe differ at any nucleotide is 7r = C, xixj7r* (NEI andTAJIMA the number of individuals with haplotype collected i 1981). If the differences in the sample are nucleotide from population k and nk = nib. Then = nik/nk is C, .& sequence differences, then an unbiased estimator for the maximum-likelihood estimate of x,, the frequency 71 is of haplotype i in population k. Similarly, gt = Ck &/n is the maximum-likelihood estimate of its average fre- quency, x, = Xik/n, in the set of populations being considered, where n is the number of populations in- cluded in the sample. Let 6, be the number of differ- where Nis the lengthof the nucleotide sequences. If the ences found between the ith and jth haplotypes in the differences in the sample are restriction site differences, sample. For restriction site surveys, 6, is the number of then an unbiased estimator for the nucleotide diversity restriction site differences found between two haplo- is types i and j. For nucleotide sequence surveys, 6, is the number of nucleotide differences found between two sequences i and j. Let vkbe the average number of differences between where m, is the number of restriction sites detected two haplotypes drawn at random from population k. with the ith restriction enzyme and ri is the number of NEI and TAJIMA(1981) showed that in its recognition sequence (NEI and TA- JIMA 1981; NEI 1987). The nucleotide diversity within each population, jik, is estimated in the same way. We now have unbiased estimates for the average nu- is an unbiased estimator for vk. Similarly, we define cleotide diversity within populations, namely the mean v as the average number of differences between two of the %kr and for the nucleotide diversity in a set of haplotypes drawn at randomwithout respect to the pop- populations. Following NEI (1973), letgst be the propor- tion of diversity inthe sample due to differences among ulation from whichthey were drawn: v = xex16tr Because the sample is drawn from several popula- populations. Then tions, a correction is necessary to obtain an unbiased ff-+ &l = -, (5) estimate of v from sample data (CJ:NEI and CHESSER % 1983). In making this correction, we consider only the stochastic variation that arises from the sampling pro- where % = x%h/n is the mean of the %k ($ NEI 1982; N ucleotide Diversity Analysis Diversity Nucleotide 63 1

TAKAHATAand PALTJMBI1985). gSt differs from NEI'S genes drawn at random from the second set of popula- (1982) gt in two ways: it includes a bias correction both tions. Then it follows from (8) that for average nucleotide diversity within populations and for total nucleotide diversity in the sample and average nucleotide diversity within populations is not a weighted average based on subpopulation size. If Jj is the probability that haplotypes i and j are identical at any nucleotide, and f is the probability that two haplotypes chosen at random match at any nucleo- tide, then

f = xtxlfl= xtxJ(l - X$)= 1 - T. (6) where t' = (p(l/D + q(l/q]-' is the harmonic mean zi Ij of and $ ($ SLATKIN1993). Thus, Similarly, if f is the mean across populations of the ti - t' probability that two haplotypes chosen at random from g.= the same population match at any nucleotide, then l, Just as FST measures the increase in coalescence time 1 1 1 f= xjkxjkjj' - Xjkxjk(1 - Tij)= 1 - r. (7) resulting from the grouping of genes into populations, n kij ktl hierarchically expanded Fstatistics measure the in- crease in coalescence time resulting from the grouping Using (6) and (7) it iseasy to see that gSt is a good estimator for FsT, because of populations into more inclusive sets of populations. Thus, pairwise Fstatistics provide a convenient measure (1 -f) - (1 - f-f of the evolutionary divergence between two sets of pop- gst = -7) 1-f 1- f' ulations (PIAZZAand MENOZZI1983; NEI 1987; SLATKIN 1991). The increased time to coalescence forgenes which is precisely WRIGHT'Sdefinition for FsT. Although from different populations or from different groups gSf is notan unbiased estimator of&,-because of populations may mean either that the rate of gene E( f E(+) /E(3) -the bias willbe small for moder- k/%) exchange between the two sets of populations is smaller ate to large sample sizes (STUARTand ORD1987). than that within each set or that the two sets of popula- Haplotype diversity and coalescence times: SLATKIN tions stopped exchanging genes entirely some time ago. (1991) pointed out that if mutation is rare, then The patterns produced by these processes cannot be - t-& distinguished from Fstatistics alone (FELSENSTEIN &iT = ?, (8) t 1982). Assessing methods of estimating Fd The relation- where7is the average time to coalescence of two genes ship between FST and coalescence times allows us to drawn at random without respect to population and - compare the accuracy of several different estimators 6 is the average time to coalescence of two genes drawn FsT. We used the algorithm described in APPENDIX A at random from the same population. If J is fin one to build many replicate coalescent trees for sampled set of populations, is fin a second set of populations, haplotypes under a wide variety ofpopulation sizes and and Jj is FST in the combined set of populations, then migration structures. For each of the coalescent trees WRIGHT(1951) showed that in our sample, we constructed a random nucleotide sequence to serve as the common ancestor of all alleles 1 - jj= (1 - gj){l- (PJ + GI3 (9) in that coalescent sample. All bases occurred at each where gij is the proportion of genetic variation in the position in this sequence with equal probability. Muta- whole sample owing to differences between populations tions were assigned randomly along each branch of the in the first and second set, p is the fraction of popula- genealogy following a JUKES and CANTOR (1969) substi- tions in the sample from the first group, and q is the tution model. Specifically, the number of mutations as- fraction of populations in the sample from the second signed was a Poisson random variate with mean equal group. For notational convenience we use g, instead of to the productof branch length and mutationrate. The gst when discussing the degreeof genetic differentiation position of each mutation was chosen at random, and between two specified sets of populations, where i and each of the three nucleotides that could replace the j are the population indices. current nucleotide at that position was chosen with Let be the average time to coalescence of two genes equal probability. We then compared estimates of FS, drawn at random without respect to which set they are calculated fromthe simulated sequencedata using drawn from,be the average time to coalescence of NEI'S (1982) &, our g,,, LYNCHand CREASE'S (1990) two genes drawn at random from the first set of popula- NsT, and WEIRand BASTEN'S /3' with the known FS,. as tions, and t; be the average time to coalescence of two calculated from the coalescent history of the sample. 632 K. E. andHolsinger R. J. Mason-Gamer

TABLE 1 Performance of alternative FsT measures

Mean squared error of estimate Nm Population size Sample configuration" Bt N r P 0.5 0.0019662500 0.002031 1 0.000993 0.000499 2 0.00450.000700 1 1 0.0022010.002292 3 0.014980.000644 0.002156 0.002561 500 4 0.0021.52 0.004388 0.003688 0.00366 5 0.003771 0.02002 0.005577 0.005544 6 0.002521 0.007348 0.004104 0.01552 .0 2500 1.0 0.0010661 0.000971 0.000240 0.00103l 0.000459 0.005369 0.001295 0.001246 0.0012952 0.005369 0.000459 0.000865 0.002896 0.001228 0.01054 0.0012288 0.002896 0.000865 500 4 0.00124 0.00461 0.002325 0.002313 5 0.002998 0.02345 0.004506 0.00448 6 0.001 747 0.008071 0.002896 0.01395 2.0 2500 2.0 1 0.000107 0.001003 0.000454 0.00440 0.002378 0.006071 0.002378 0.000605 0.0023782 0.006071 0.002378 3 0.003192 0.000216 0.000610 0.007807 500 4 0.000749 0.004598 0.001315 0.001308 5 0.002188 0.02626 0.00338 0.003361 6 0.001 113 0.008762 0.00187 0.0 1 126

I' I' See Table 2 for the list of sample configurations.

For each combination of parameters, we constructed a The results presented in Table 1 ignore two aspects sample of 1000 coalescent trees. For each coalescent of haplotype sampling that may be critically important tree, we constructed 10 sequence data sets using differ- in interpreting FS7.statistics:haplotypes samples are typi- ent randomly generated starting sequences and inde- cally gathered from a small subset of the populations pendent realizations of the mutation process. that may actually be exchanging migrants and effective Results for one set of simulations involving sequences population sizes may differ substantially from one sam- of 500 nucleotides drawn from five populations with ple population to another. Toexamine thefirst of these 500 or 2500 individuals in each population are shown effects, we performed a series of simulations in which in Table 1. For each of the two population sizes, several haplotypes were sampled from five randomly chosen different sample configurations were also considered populations out of a total of 21 populations. The mean (Table 2). Remarkably, the mean squared error of & squared errors reported in Table 3 are quite similar is two or more times smaller than that of any of the to those for simulations in which the five populations remaining estimators for every population size and sam- sampled representthe entire set of populations ex- ple configuration presented. Simulations involving se- changing migrants. These results suggest that FYI. esti- quences of 50 nucleotides or 1500 nucleotides showed mates may be relatively insensitive to the proportion of the same pattern. With longer sequences, of course, the populations sampled, although much more extensive mean squared error of all estimators is reduced, but g,, simulations would be necessary to confirm this conjec- benefitted more from this effect than the other estima- ture. To examine the second effect, we performed a tors. Increasing the sequence length from 500 to 1500 series of simulations in which population sizes differed nucleotides reduced the mean squared error of g5, by dramatically from one another. In the most extreme an average of 49% while the mean squared error of the case, the five populations had sizes of 100, 250, 2500, remaining estimates was reduced by less than 24%. 25000, and 100,000 and resulted in a mean squared error for g,, of 2.23 X lo-'. In the least extreme case, TABLE 2 the five populations had sizes of 500, 1000, 2500, 4000, Sample configurations for the Fflanalysis and 4500 and resulted in a mean squared error for g,, of 2.50 x 10"'. These errors are comparablewith those N um ber SampleNumber configuration reported in Table 1 and Table 3. Thus, it appears that 1 25,25,25,25,25 variation in population sizes has little impact on the 2 10,10,10,10,10 accuracy of FS7 estimates. 3 5,15,25,35,45 Our simulations included both multiple coalescent 4 12,12,12,12,12 trees for each set of parameter values and multiple real- 5 5,5,5,5,5,5 izations of the mutation process for each coalescent 6 4,8,12,16,20 tree. We took advantage of this structure to partition N ucleotide Diversity Analysis Diversity Nucleotide 633 TABLE 3 Population sampling and FSTestimates

Mean squared error of estimate

N,m Population size Sample configuration" g,, gst NSI. P 0.25 2500 1 2500 0.25 0.003145 0.0004690.003109 0.000974 2 0.0032910.000515 0.003495 0.003354 3 0.0004 17 0.002016 0.0026890.002016 17 0.0004 3 0.020080 2500 1 0.5 2500 0.002095 0.0003180.002241 0.000870 2 0.000482 0.004608 0.002294 0.004608 0.000482 2 0.002196 3 0.0003440.002642 0.002563 0.015600 2.5 2500 1 0.000036 0.000921 0.000394 0.000346 2 0.0001 10 0.006372 0.000576 0.0005260.000576 0.006372 10 0.0001 2 3 0.000087 0.003265 0.000386 0.006239 0.000386 0.003265 0.000087 3 2500 1 0.000022 1 5.0 2500 0.000995 0.000168 0.000146 2 0.000062 0.006259 0.000321 0.000243 0.000321 0.006259 0.000062 2 3 0.000052 0.003381 0.000272 0.006965 0.000272 0.003381 0.000052 3 10.0 2500 1 2500 10.0 0.000014 0.001009 0.000014 0.000061 2 0.000039 2 0.006543 0.000118 0.000086 3 0.000029 3 0.0035130.004209 0.000139 See Table 2 for the list of sample configurations. the overall variance in Fs,. estimates into two compo- of the existing methodsfor phylogenetic inference nents. The first of these reflects the inaccuracies associ- from distance data take advantage of the mathematical ated with estimating the actual coalescent history of the partitioning of diversity components among hierarchi- haplotypes in our sample. It is a result of estimation cal levels implicit in (9). Thefollowing algorithm cor- error. The small mean squared error associated with rects both of these deficiencies: each of the Fs7.estimates, and especially with g,!, shows Construct a list of nodes, treating each popula- that estimation error is small and that gYLand the other tion as a separate node. estimators provide a good estimate of the average coa- Compute gq between all pairs of nodes in the lescent history of a sample of haplotypes. The second node list, using (5) to calculate Jl, J, and X. (fk variance component reflects the differences among co- = 0 for any k in which the node consists of a alescent histories that can be produced by the same single population.) population parameters. It reflects intrinsic variance as- Combine the two nodes with the smallest gil into sociated with the drift-migration process. For the simu- a single node. lations reported in Table 1, the intrinsic drift variance Return to (2) until the node list is empty. is between five and 25 times greater than theestimation variance. In fact, the drift variance appears to be more This algorithm is similar to the widely-used un- than 10 times greater than the estimation variance weighted pair-group methodwith arithmetic mean (UP- whenever expected sequencedivergence is greater than GMA) for cluster analysis (SNEATHand SOUL 1973). - 1 %. In short, & and otherFYr estimators can provide In both that method and this one, distances from each a goodestimate of the coalescent history of a particular new node to every other node are recomputed at each sample of haplotypes, but inferences about migration step. Our method differs from UPGMA in the way in rates based on those Fy, estimates may be problematic. which the distances are recomputed. In UPGMA, if i Uncovering hierarchical structure:The results in the and j are combined into a new node, k, then & = (gi, preceding sections show that g,' is a useful measure of + @/2. In our method, gkrr is computed from (9). As genetic differentiationbetween pairs of populations (or a result, our method guarantees that the average time between paired sets of populations). How can we use to coalescence for haplotypes drawn from populations this information to uncover hierarchical structure in belonging to the same node is less than the average the data? Oneway would be to treat the painvise &+ as time to coalescence for a haplotype drawn from any of distances and analyze them using one of the existing those populations and from a population belonging to methods for phylogenetic analysis of distance data (cf: a different node. PIAZZAand MENOZZI1983; MERCUREet al. 1993). This In addition to producing a tree that describes the approach is unsatisfactory in twoways. First, methods hierarchical structure, it is possible to assess the statisti- for phylogenetic analysis of distance data generally as- cal significance of the genetic differentiation detected sume that branch lengths are additive, but inspection at each hierarchical level. We show in APPENDIX B that of (9) shows that the g,6 are not additive. Second, none if population haplotype frequencies areknown, not esti- 634 K. E.and Holsinger R.J. Mason-Gamer TABLE 4 Divergence between cpDNA haplotypesin Coreopsis pdijbra

A4 A15 A19A A15 A4 B B2 B13aB16 B15 E136 B17a B17b E18 3 1 1 A 1 3 13 14 19 1415 14 15 16 14 A 4 0.090 4 4 4 0.090 A4 16 22 1818 1717 17 17 19 A15 140.030 13 0.123 1213 132 14 18 15 13 A19 0.030 0.123 0.061 1415 2015 16 15 16 17 15 B 0.395 0.4870.426 0.365 8 1 2 3 1 2 3 1 B2 0.538 0.674 0.552 0.614 0.243 9 9 10 9 10 11 9 B13a 0.4560.426 0.548 0.488 0.060 0.304 5 33 5 5 3 B13b 0.425 0.517 0.395 0.457 0.030 0.273 0.0902 4 43 2 B15a 0.4280.4590.397 4 0.519 6 5 4 0.0910.121 0.151 0.275 B 16 0.425 0.518 0.396 0.458 0.030 0.274 0.0910.274 0.030 0.458 0.396 0.518 0.425 B16 0.060 1 0.121 2 2 B17a 0.456 0.548 0.426 0.488 0.0600.090 0.121 0.304 30.151 1 0.030 B17b 0.4890.519 0.547 0.579 0.091 0.1510.335 0.121 0.1824 0.030 0.060 B18 0.4250.0910.274 0.030 0.458 0.396 0.518 0.060 0.121 0.0910.060 0.121 Number of restriction site differences (above the diagonal)and percent nucleotide divergence (below the diagonal, calculated according to NEI and TAJIMA1981). From MASON-GAMERet al. (1995). mated, gj 2 0 with equality if and only if the haplotype The data: MASON-GAMERet al. (1995) provide com- frequencies in i and j are identical, as would be ex- plete details on the sampling and laboratory proce- pected from its closerelationship with FSTand GS7: This dures. Briefly, the sample consists of -20 individuals provides a simple test for the presence of genetic differ- from each of 14 populations in Georgia and Arkansas, entiation between two nodes. Let representingboth varieties found in Georgia (var.

0 = (@ - *(I) 2, grandijlwa and var. saxicola) and all three varieties xk ) (13) found inArkansas (var. grandijlflma, var. saxicola, and k var. harueyana). After extracting total DNA with a simple where is the sample frequency of haplotype k in modification of standard procedures (SAGHAI-~~AROOF node i and 2ij) is the sample frequency of haplotype k et al. 1984; DOYLE and DOXE 1987), samples were di- in node j. Then gl = 0, if and only if R = 0. To deter- gested with eight restriction enzymes, each of which mine the distribution of R under the null hypothesis cuts cpDNA frequently: AZuI (AC/GT), Had11 (GG/ we construct a large number of random samples in CC), HhaI (GCG/C), Hinfl (G/ANTC), MboI (/ which haplotypes are assigned randomly to populations GATC) , MspI (C/CGG) , RsaI (GT/AC) , and TaqI (T/ in each node in proportion to their average frequency CGA) . The resulting fragments were separated on 1.25- in the combined sample. We calculate R from each 1.5% agarose gels and bidirectionally blotted (SMITH sample, producing a null distribution to which we com- and SUMMERS1980) to reusable nylon membranes. pare our sample 0 (cf: LONG1986; WEIR and COCK- Membrane-bound DNA fragments were hybridized with ERHAM 1984; ZHIVOTOVSKY1988). eight 32P-labeled,cloned fragments of the lettuce chlo- roplast genome (JANSEN and PALMER1987) that an ear- AN EXAMPLE WITH CPDNA HAPLOTYPES lier study of Krigaa (another member of the sunflower While restriction site and nucleotide sequence varia- family) had suggested are the most variable regions of tion in chloroplast DNA (cpDNA) has been used exten- the genome (KIM et al. 1992). sively at the interspecific level and above, (e.g.,PALMER Among the 273 cpDNAs assayedfrom these 14 popu- et al. 1988), it has been less widelyused for intraspecific lations, 33 of the 427 restriction sites detected were studies because of its low rate of sequence evolution polymorphic, and we detected 13 distinct haplotypes (PALMER1985, 1987; WOLFE et al. 1987; BIRKY1988; (Table 4). The haplotypes may differ at only one site CLEGGet al. 1990). Several recent studies have shown, (0.030% sequence divergence) or at as many as 22 re- however, that the amountof intraspecific cpDNA varia- striction sites (0.674% sequence divergence). The 14 tion is greaterthan previously thought (reviewed in sampled populations show a wide range of population HARRISand INGRAM1991; SOLTISet al. 1992). We illus- structures (Table 5), from those in which only a single trate here the techniques described above can be used haplotype was detected to those in which as many as to provide new insight into patterns of population dif- four haplotypes were detected. Not only do populations ferentiation by applying them to data derived from an appear to be quite different from one another, haplo- extensive restriction site survey of C. grandijlflma,a mor- types are unevenly distributed among populations. The phologically variable member of the sunflower family A haplotype, for example, is found in eight populations found through much of the southern United States. while several others are found in only one. N ucleotide Diversity Nucleotide Analysis 635

TABLE 5 having both A-type and Btype genomes actually repre- Haplotypic composition of the population samples sents two populations with respect to cpDNA haplotype, because there is no evidence of cpDNA recombination Sample Sample Haplotypes HaplotypeHaplotypes Sample Sample in natural populations. no. sizefrequencies present In our sample, only two populations (population 2 Georgia from Georgia and population 15 from Arkansas) have 1 21 A 1.000 both A-type and Btype genomes present. Figure 2 pre- 2 18 A 0.500 sents the results of a hierarchical analysis of haplotype B2 0.500 B2 diversity in which population 2 and population 15 are 3 17 A 1.ooo each treated astwo populations: once composed en- 4 20 A 0.850 tirely of A-type haplotypes and one composed entirely A4 0.150 of Btype haplotypes (populations 2A/2B and 15A/15B, 8 23 A 1.ooo 9/10 20 A 1.000 respectively). The major difference between the rela- Arkansas tionships as depicted here and in Figure 1 is, not sur- 13 20 B 0.700 prisingly, that there are two major groups of popula- B13a 0.250 B13a tions: those in which only A-type genomes are present B13b 0.050 B13b and those in which only Btype genomes are present. 14 20 1 B13a .ooo In fact, >90% of the total nucleotide diversity present 15 31 A 0.097 in the sample is a result of the divergence between A 15 0.194 B13a 0.258 B13a A-type and Btype genomes. With respect to maternal B15 0.452 B15 phylogeny, therefore, A-type genomepopulations in 16 21 B 0.049 Arkansas are more closely related to A-type genome B16 0.290 B16 populations in Georgia than they are to Btype genome 17 19 B 0.895 populations in Arkansas. Similarly, Btype genome pop- B17a 0.053 B17a ulations in Arkansas share a more recent common ma- B17b 0.053 ternal ancestor with population 2B in Georgia than they 18 B 0.500 B18 0.500 B18 do with A-type genome populations in Arkansas. 19 19 A 0.632 In spite of the predominant role of genome type in A19 0.368 A19 structuring nucleotide diversity in C. grandzjlora, Figure 20 20 20 A 0.400 2 also makes it apparent that both the A-type and B A19 0.600 A19 type genomes found in population 15 (Arkansas) are From MASON-GAMERet al. (1995). more similar to other genomes of their type in Arkansas than to other genomes of their type in Georgia. Simi- Results: The results of our hierarchical analysis of larly, population 2A is more similar to other A-type ge- haplotype diversity in this sample are presented in Fig- nome populations from Georgia than it is to any A- ure 1. Three major groups of populations are evident: type genome populations from Arkansas. In short, this those in which only A-type genomes are present, those analysissuggests that the geographical separation of in which only Btype genomes are present, and those populations in Georgia and Arkansas has been accom- in which both A-type and Btype genomes are present. panied by significant genetic differentiation, a pattern Within the A-type genome group, the populations are that was not apparent inearlier analyses of electropho- further divided into western (Arkansas) and eastern retic variation (CRAWFORDand SMITH 1984; COSNER (Georgia)populations. The Btypegenome group is and CRAWFORD1990). Comparable restriction site or found only in Arkansas. The only Btype genome found nucleotide sequence data on nuclear encoded genes in Georgia, B2, is found in a population that also con- will be required before we can determine if the differ- tains haplotype A. The B2 genome is also unusual in ences observed between these sets of data reflect the that it is highly divergent from other Btype genomes higher levelof population subdivision expected for and is found nowhere else in the sample (Tables 4 cpDNA markers than for nuclear markers because of and 5; MASON-GAMERet al. 1995). MASON-GAMERet al. its predominantly maternal transmission (BIRKYet al. (1995) noted that A-type genomes are found both in 1983,1989) or the influence of hybridization and intro- other members of Coreopsis sect. Coreopsis, the section gression events (e.g., RIESEBERG 1991;WHI~EMORE and to which C. grandijlma belongs, and in species belong- SCHAAL1991; reviewed in RIESEBERG and BRUNSFELD ing to other sections of Coreopsis. Similarly, Btype ge- 1992). It appears,however, that A-type genome popula- nomes are found both inother members of sect. Core- tions are more closely connected with others in that opsis and in members of other sections. The pattern genome group than with any in the Bgenome group, of genome distribution among taxa suggests that the even though both A-type and Btype populations occur divergence between the two genome groups predates in both geographic regions and there is significant ge- the origin of C. grandijlora. If so, then each population netic differentiation between them. 636 K. E. Hoisinger and R. J. Mason-Gamer POPULATIONS (diversity)

19 (0.000144) A 0.027 (p=O.1388) genomes 20 (0.000148) 20 West 0.408 (p < 0.0001) 4 (0.000236) 4 FIGURE1.-Hierarchical analysis of haplo- 0.086 1 (0.000) (p=O.O010) in diversity type Coreopsis grandijora. Popula- A tionnumbers correspond tothose in Table 3 (0.000) genomes 5, andthenumbers in parenthesesfollowing €&'St them are the nucleotidesequence diversity 8 (0.000) within each population, estimated following NEIand TAJIMA(1981). The number at each 0.912 9/10 (0.000) node is the distance (g,)its between two (p < 0.OOOl) daughter nodes, and the P-value reported is 18(0.0001 95) the probability of obtaining a distance that 0.101 the (p = 0.0584) great or greater under null hypothesis 13 (0.000260) i of no differentiation between the daughter 0.077 (p =0.0514) (basednodes resam-onrandom 10,000 17 (0.000147) 0.675 (p <0.0001) 14 (0.000) 0.097 (p < 0.OOOl) 16 (0.000028) 0.058 (p

DISCUSSION In fact, when dealing with data of this type, it may be useful to consider ocher methods of calculating within The method we introduced here for studying pat- and among population diversity. By interpreting as terns of haplotype diversity within and among popula- 6, the number of steps between haplotypes and j on a tions is similar to several existing procedures. Like CP i minimum-spanning tree connecting all haplotypes, for statistics calculated from a distance matrix in which the example, the resulting tree will connect populations distance between two haplotypes is a simple count of whose haplotypes are, on average, separated by fewer the differences between them (EXCOFFIERet al. 1992), mutational events than those belonging to different the measure of among population differentiation we nodes EXCOFFIERet al. 1992).Alternatively, the evo- propose, gsl, does not correct for multiple substitutions. (6 lutionary distance between haplotypes and could be It differs in this respect from the similar measures pro- i j calculated using formulas for restriction sites or nucleo- posed byTAKAHATA and PALUMBI (1985), LYNCHand CREASE (1990), and NEI and MILLER(1990). At least tide sequences that correct for multiple substitutions, when levels of divergence are less than a few percent, base composition biases, among-site rate variation, or mutational biases. In eitherof these cases, the hierarchi- however, gSt provides more accurate estimates of FYr than other statistics that have been proposed for this cal analysis we propose would be done directly on the u and uk,using = (5 - 6)/5 as the measure of among purpose. Likeall of these measures, gsI allowsus to & partition the observed diversity into within and among population differentiation. Because $, depends on the population components (6Equation 9). Unlike other ratio of .Ti to 7r and zi and u differ from these only by a methods thathave been proposed,however, anyhierar- constant for low levels of sequence divergence, the re- chical structure present in the data emerges naturally sults of an analysis on & will differ substantially from from the algorithm we suggest for pairwise calculations those of an analysis on gsr only when some of the haplo- of gsI. The hierarchical structure need not be imposed types are very divergent from one another. Similarly, prior to the analysis. We regard this as the most novel data from electrophoretic surveys could be used to un- and useful feature of our method. cover hierarchical structure simply by using NEI'SC,Y, In circumstances where multiple substitutions are statistics (NEI 1973; NEIand CHESSER1983) in painvise more frequent, our method may still be employed as a comparisons instead of gsI.In short, the method we way to summarize the hierarchical structure of the data. describe here provides a flexible framework for the Any hierarchical structure revealed, however, should analysis ofgenetic variation in spatiallystructured popu- not be interpretedin terms of mean coalescence times. lations, a framework that is particularly appropriate Nucleotide Diversity Analysis 637

POPULATIONS (diversity)

4 (0.000236)

1 (0.000)

(p = 0.0009) 2A (0.000) A 3 (0.000) genomes East 0.529 8 (0.000) @l < 0.OOOl)

9/10 (0.000) FIGURE 2.-Hierarchicalanalysis of haplotype diversityin Coreopsis grand- 1 5A (0.0001 46) iflora, treating the A haplotypes and B A 0.361 haplotypes in populations 2 and 15 as (p = 0.0002) 19~0.000144~ genomes separate populations (2a and 2b,15a 0.914 - I 0.027 West and 15b). See the caption in Figure 1 (p<0.0001) @=0.144) 20 (0.000148) for an explanation of the statistics re- B ported on the tree. 26 (0.000) I genomes East 16 (0.000028) 0.585 0.249 :p < 0.0001) -(p<0.0001) 18 (0.000195) 0.101 (p = 0.0606) B 0.432 13 (0.000260) (p < 0.0001) 0.077 genomes (p = ,0508) 1 7 (0.0001 47) West 0.361 - (p < 0.OOOl) 14 (0.000)

156 (0.000858) when interest is centered on discovering hierarchical CLECG,M. T., G. H. LEARNand E. M. GOLENBERG,1990 Evolution of chloroplast DNA. pp. 135-149, in Evolution at the Molecular structure that is present in the data rather than on Leuel, edited by R. K. SELANDER,A. G. CLARKand T. S. WHIT". determining whether the data fits a preconceived no- Sinauer, Sunderland, MA. tion of what that structure should be. COCKERHAM,C. C., 1969 Variance of gene frequencies. Evolution 23 72-84. This work was supported by a grant from theUniversity of Connecti- COCKERHAM,C. C., 1973 Analyses of gene frequencies. Genetics 74: cut Research Foundation to K.E.H. and by a Grant-in-Aid of Research 679-700. from Sigma Xi and a National ScienceFoundation Doctoral Disserta- COSNER,M. B., and D. J. CRAWFORD,1990 Allozyme variation in tion Improvement grant (BSR-9105167) to R.J.M. Cureopsis sect. Cureopsis (Asteraceae). Syst. Bot. 15: 256-265. CRAWFORD,D. J., and E.B. SMITH,1984 Allozyme divergence and Note: The computer programs used to evaluate the interspecific variation in Cureopsis grandiflora (Compositae). Syst. Bot. 9: 219-225. performance of FyT measures and to analyze restriction DOYLE,J. J., andJ. L. DOYLE,1987 A rapid DNA isolation procedure site diversity in C. grandijora can be obtained from the for small quantities of fresh leaf tissue. Phytochem. Bull. 19: senior author on request. They are available either as 11-15. EXCOFFIER,L., P. E. SMOUSEand J. M. QUATTRO, 1992 Analysis of source code (ANSI C) or as a precompiled executable molecular variance inferred from metric distances among DNA (386 MS-DOS). haplotypes: application to human mitochondrial DNA restriction data. Genetics 131: 479-491. FELSENSTEIN,J., 1982 How can we infer geography and history from LITERATURE CITED gene frequencies? J. Theor. Biol. 96: 9-20. BIRKY,C. W., 1988 Evolution and variation in plant chloroplast and HAMRICK, J. L., and M. J. W. GODT,1990 Allozyme diversity in plant mitochondrial genomes, pp. 23-53, in Plant Evolutionaly Biology, species. pp. 43-63, in Plant , Breeding, and Ge- edited by L. D.GOTTLIEB and S. K. JAIN. Chapman and Hall, netic Resources, edited by A. H. D.BROWN, M. T. CLEGG,A. T. New York. KAHLER and B. S. WEIR.Sinauer Associates, Sunderland, MA. BIRKY,C. W., P. FUERSTand T. MARUYAMA,1989 Organelle gene HARRIS, S. A., and R. INGRAM,1991 Chloroplast DNA and biosyste- diversity under migration and drift: equilibrium expectations, matics: the effects of intraspecific diversity and plastid transmis- approach toequilibrium, effects of heteroplasmic cells, and com- sion. Taxon 40: 393-412. parison to nuclear genes. Genetics 121: 613-627. JANSEN, R. K., and J. D. PALMER,1987 Chloroplast DNA from lettuce BIRKY,C. W., T. "A and P.FLJERST, 1983 An approachto and Burnudesia (Asteraceae): structure, gene localization, and population and evolutionary genetic theory for genesin mitochon- characterization of a large inversion. Curr. Genet. 11: 553-564. dria and chloroplasts, and some results. Genetics 103 513-527. JUKES, T. H., and C. R. CANTOR,1969 Evolution of protein mole- 638 K. E. Holsinger and R. J. Mason-Gamer

cules. pp. 21-132, in Mammalian Protein Metabolism, edited by chloroplast DNA variation: systematic and phylogenetic implica- H. N. MUNRO.Academic Press, New York. tions. pp. 117-150, in Molecular Systematics of Plants, edited by KIM, K.-J., R. K. JANSEN and B. L. TURNER,1992 Phylogenetic and P. S. SOLTIS,D. E. SOLTISand J. J. DOM.E.Chapman and Hall, evolutionary implications of interspecific chloroplast DNA varia- New York. tion in dwarf dandelions (Krigia; Asteraceae). her.J. Bot. 79: STUART,A,, and J. K. Om, 1987 findall’s Advanced Theoly of Statistic.Y, 708-715. Volume 1. Oxford Univ. Press, New York. KINGMAN,J. F. C., 1982a The coalescent. Stoch. Proc. Appl. 13: 235- TAKAHATA,N., and S. R. PALUMBI,1985 Extranuclear differentiation 248. and geneflow in the finite island model. Genetics 109: 441 -457. KINGMAN, J. F. C., 19826 On the genealogy of large populations. J. WEIR,B. S., and C. J. BASTEN, 1990 Sampling strategies for distances Appl. Prob. 19: 27-43. between sequences. Biometrics 46: 551-582. LONG,J. C., 1986 The allelic correlation structure of Gainj- and WEIR,B. S., and C. C. Co(:mRH.”, 1984 Estimating F-statistics for Kalam-speaking people. I. The estimation and interpretation of the analysis of population structure. Evolution 38 1358-1370. Wright’s F-statistics. Genetics 112: 629-647. WHITTEMORE,A. T., and B. A. SCHAAL,1991 Interspecific gene flow LWCII, M., and T.J. CREASE,1990 The analysis of population survey in sympatric oaks. Proc. Natl. Acad. Sci. USA 88: 2540-2544. data on DNA sequence variation. Mol. Biol. Evol. 7: 377-394. WOLFE,K. H., W.-H. LI and P. M. SHARPE,1987 Rates of nucleotide MASON-GAMER,R. J., K. E. HOISINGERand R.K. JANSEN, 1995 Chloro- substitution vary greatly among plantmitochondrial, chloroplast, plast DNA haplotype variation within and among populations of and nuclear . Proc. Natl. Acad. Sci. USA 84: 9054-9058. Coreopsis grandzflma (Asteraceae). Mol. Biol. Evol. 12 371-381. WRIGHT,S., 1951 The genetical structure of populations. Ann. Eu- MERCURE,A,, K. RAILS, K. P. KOEPFLIand R. K. WAYNE,1993 Genetic gen. 15: 323-354. subdivisions among small canids: mitochondrial DNA differentia- WRIGHT,S., 1965 The interpretation of population structure by P tion of swift, kit, and arctic foxes. Evolution 47: 1313-1328. statistics with special regards to systems of mating. Evolution 19: NEI,M., 1973 Analysis of gene diversity in subdivided populations. 395-420. Proc. Natl. Acad. Sci. USA 70: 3321-3321. ZHIVOTOVSKY,L. A,, 1988 Some methods of analysisof correlated NEI,M., 1982 Evolution of human races at the gene level. pp. 167- characters. pp. 423-432, in Proceedings of theIllnternational Confm- 181, in Human Genetics, Part A: The Unfolding Genome, Proceedings ence on Quantitative Genetics, edited by B. S. WEIR,G. EISEN,M. M. of thr Sixth International Congress of Human Genetics, edited by B. GOOIIMANand G. NAMKOONG.Sinauer Assoc., Sunderland, MA. BONN&TAMIR,COHEN T. and R. M. GOODMAN.Alan R. Liss, Inc., New York. Communicating editor: B. S. WEIR NEI, M., 1987 MolemlarEuolutionaly Genetics. Columbia Univ. Press, New York. APPENDIX A NEI,M., and R. K CHESSER,1983 Estimation of fixation indices and gene diversities. Ann. Hum. Genet. 47: 253-259. NOTOHARA (1990) extends KINGMAN’S coalescent NEI, M., and J. C. MILLER,1990 A simple method forestimating average number of nucleotide substitutions within and between (1982a,b) to geographically structuredpopulations populations from restriction data. Genetics 125: 873-879. with arbitrary population sizes and arbitrary migration NEI,M., and F. TA~IMA,1981 DNA polymorphism detectable by re- rates among populations. Let mil be the proportion of striction endonucleases. Genetics 97: 145-163. NEVO,E., A. BEILESand R. BEN-SHLOMO,1984 The evolutionary individuals in population i that are immigrants from significance of genetic diversity: ecological, demographic, and population j. Let qy be the proportion of individuals in life history correlates. pp. 13-213, in Euolutionaly Dynamics of population that move to population j. mil is the back- Genetic Diversity, edited by G. S. MANI.Springer-Verlag, Berlin. i NOTOIIARA,M., 1990 The coalescent and the genealogical process ward migration rate, and qil is the forward migration in geographically structured population. J. Math. Biol. 29: 59-75. rate. They are related as PALMER,J. D., 1985 Evolution of chloroplast and mitochondrial DNA in plants and algae. pp. 131-140, in Monographs inEuolution- aly Biology: MolerularEvolutionaly Genetics, edited by R. J. MACIN- if k#j TYRE. Plenum, New York. PAI.MF,R,J. D., 1987 Chloroplast DNA evolution and biosystematic uses of chloroplast DNA variation. Am. Nat. 130: S6-S29. PALMER,J. D., R. K. JANSEN, H. J. MICHAELS,M. W. CHASEand J. W. MANHART, 1988 Chloroplast DNA variation and plant phylog- eny. Ann. MO Bot. Gard. 75: 1180-1206. where Nk is the size of population k. In this formulation, PIAZZA,A,, and P. MENOZZI,1983 Geographic variation in gene fre- quencies. pp. 444-450, in Numerical Taxonomy, NATO AS1 series, the effective population size is assumed to equal the Vol. 61, edited by J. FEISENSTEIN.Springer-Verlag, Heidelberg. census population size. Let nkbe then numberof haplo- RIESEBERG, I,. H., 1991 Homoploid reticulate evolution in Helian- types in the sample from population k. thus (Asteraceae): evidence from ribosomal genes. Am. J. Bot. 78: 1218-1237. Looking backward overthe genealogy of the sampled RIESEBERG,L. H., and S. J. BRUNSFELD,1992 Molecular evidence alleles, at each generation one of two events may occur: and plant introgression. pp. 151-176, in Molecular Systematics of coalescence or migration. Let be the probability of Plants, edited by P. S. SOI.TIS,D. E. S

Furthermore, theprobability that neither acoalescence (a) The haplotype migrated into population M. Se- nor a migration event occurs is lect arandom number, u,on the interval (0, 1).Let N be the largest integer n such that the 4 = 1 - (ffk - Ptk) . (‘44) inequality k if k Thus, the time back to the first event is exponentially distributed with mean I/+. Let a = xk(Yk and P = x, xj#k Pik. Then the probability that the first event is a is satisfied. coalescent event is (b) The haplotype migrated from population N. Se- a lect a haplotype at random from population M p, = - (‘45) a + 0’ and move it to population N. Return to (1). and the probability that the first event is a migration event is APPENDIX B P It is apparent from (A.9) that g, = 0 if and only ifJj pm = - a+p’ - (pJ + a) = 0. Letting 7rI1 represent the nucleotide diversity in the combined sample, 7i the average within To construct the coalescent structure of a sample we population diversity in the combined sample, and 7rz use the following algorithm: the diversity in node i, it is a matter of algebra to show Calculate all akand Pa 4, a, p, p,, and p, for that the current sample configuration. Select the time for the nextevent at random from an exponential distribution with mean 1/4. Select a random number, u,from a uniform dis- tribution on the interval (0, 1).If u < p, as given by (A.5), then go to (4), otherwise go to (5). The partial derivative of the term in square brackets The next event is a coalescent event. Select a with respect to xjl) is random number, u,on the interval (0, 1).Let K be the largest integer k such that the inequality 2pq (xp - Xi”)Skl. (B2) k 0 for Count total number of haplotypes remaining i f j and 6kk = 0 whenever each haplotype is distinct, after the coalescence event. If there is only one, which guarantees I A I f 0. Because y = 0 if and only the coalescence structure is complete. If there is if xr) = xij) for all k, xtl = xi,) is a unique critical point more than one, return to (1). for& - (PJ + a). The next event is a migration event. Select a It is evident from (B.2) that xp) = xij) is a minimum random number, u,on the interval (0, 1).Let M for each k individually. Because the xk form a complete be the largest integer m such that the inequality basis for the space, xt) = xi7)for all k is also a minimum for (B.1). Thus,

,& 5 0 033) is satisfied. with equality only when xf) = .xi7)for all k.