Copyright  2004 by the Society of America

The Frequency Spectrum in Genome-Wide Human Variation Data Reveals Signals of Differential Demographic History in Three Large World Populations

Gabor T. Marth,1 Eva Czabarka, Janos Murvai and Stephen T. Sherry National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894 Manuscript received April 15, 2003 Accepted for publication September 4, 2003

ABSTRACT We have studied a genome-wide set of single-nucleotide polymorphism (SNP) allele frequency measures for African-American, East Asian, and European-American samples. For this analysis we derived a simple, closed mathematical formulation for the spectrum of expected allele frequencies when the sampled populations have experienced nonstationary demographic histories. The direct calculation generates the spectrum orders of magnitude faster than coalescent simulations do and allows us to generate spectra for a large number of alternative histories on a multidimensional parameter grid. Model-fitting experiments using this grid reveal significant population-specific differences among the demographic histories that best describe the observed allele frequency spectra. European and Asian spectra show a bottleneck-shaped history: a reduction of effective in the past followed by a recent phase of size recovery. In contrast, the African-American spectrum shows a history of moderate but uninterrupted population expansion. These differences are expected to have profound consequences for the design of medical association studies. The analytical methods developed for this study, i.e., a closed mathematical formulation for the allele frequency spectrum, correcting the ascertainment bias introduced by shallow SNP sampling, and dealing with variable sample sizes provide a general framework for the analysis of public variation data.

HE analysis of statistical distributions of genetic the effects of recombination or rate heterogene- Tvariations has a rich history in classical population ity as we show below. genetic studies (Crow and Kimura 1970), and recent Modeling the distribution of allele frequency: Prior genome-scale data collection projects have positioned study of the AFS has been restricted to properties of the field to apply, challenge, and improve traditional summary statistics such as Tajima’s D (Tajima 1989), or theory by examining data from thousands of loci simul- the proportion of rare- to medium-frequency (Fu taneously. The two most frequently studied distributions and Li 1993). There has been very little analysis of the of nucleotide sequence variation are the marker density general shape of observed spectral distributions. The (MD), or mismatch distribution (Li 1977; Rogers and analytical shape of the AFS, under a stationary history Harpending 1992; i.e., the distribution of the number of constant effective population size, was derived by Fu of polymorphic sites observed when a collection of se- (1995) who showed that, within n samples, the expected quences of a given length are compared), and the allele number of of size i is inversely proportional frequency spectrum (AFS; Ewens 1972; i.e., the distribu- to i. Important properties of the coalescent process un- tion of diallelic polymorphic sites according to the num- der deterministically changing population size have ber of chromosomes that carry a given allele within a been derived in publications of Griffiths and Tavare sample). The latter distribution is immediately applicable (1994a,b) and Tavare et al. (1997). These results show to the genotype data produced by projects that are char- that, for the purposes of genealogy, varying population acterizing a large subset of currently available single-nucle- size can be treated by appropriate scaling of the coales- otide polymorphisms (SNPs) with measures of individual cent time. Applying these results to obtain a formula allele counts (genotypes) for three ethnic populations for the allele frequency spectrum is not trivial, however, (http://snp.cshl.org/allele_frequency_project/). In addi- because mutations occur in nonscaled time. More re- tion to data availability, the AFS has other, analytical advan- cently, Wooding and Rogers (2002) derived a method tages over MD data, most notably its independence from called the matrix coalescent that overcomes these diffi- culties and calculates the AFS under arbitrarily changing population size histories. Their approach solves the problem for the general case, but leads to an involved 1Corresponding author: Department of , Boston College, 140 Commonwealth Ave., Chestnut Hill, MA 02467. computational procedure requiring numerical matrix E-mail: [email protected] inversion. In this study, we have taken a different ap-

Genetics 166: 351–372 ( January 2004) 352 G. T. Marth et al.

proach. By extending Fu’s result from a stationary popu- lations, practically all possible simple shapes of popula- lation history to a more general shape, a profile of demo- tion history have been proposed: constant effective size graphic history characterized by an arbitrary number of (stationary history), growth relative to an ancestral effec- epochs such that the effective population size is constant tive size (population expansion), size reduction (col- within each epoch, we have arrived at a very simple, lapse), and bottleneck (a phase of size reduction fol- easily computable formula for the AFS. The price we lowed by a phase of growth or recovery); see Figure 1. pay is the lack of generality of arbitrary shapes. In many These claims as well as the underlying data have been practical situations, however, these shapes can be ap- reviewed by various authors (Harpending and Rogers proximated by a piecewise constant effective size profile. 2000; Wall and Przeworski 2000; Jorde et al. 2001; The advantage is a formulation that permits very rapid Rogers 2001; Ptak and Przeworski 2002; Tishkoff generation of AFS under a large number of competing and Williams 2002). It is generally agreed that variation histories for accurate data fitting and hypothesis testing. patterns in mitochondrial DNA show rapid expansion This result is applicable when the sites under consider- of effective size in all human populations. Results in ation are selected randomly and the number of success- microsatellite data are less unanimous about which pop- fully genotyped samples is identical at each site. For the ulations experienced expansion or what the magnitude data set we are considering both of these assumptions and starting time of such demographic events were. are violated. First, the sites in question were selected Recent studies of SNP data sets in nuclear DNA propose for the population allele frequency characterization of the possibility of a population collapse to explain re- a large subset of SNPs from a genome-wide map (Sachi- duced haplotype diversity (Clark et al. 1998; Reich et danandam et al. 2001) of SNPs discovered by computa- al. 2001, 2002; Gabriel et al. 2002), especially in samples tional means, in large mining efforts in the public (Alt- of European ancestry, a hypothesis consistent with our shuler et al. 2000; Mullikin et al. 2000; Lander et al. observations in the current data set. 2001; Marth et al. 2003) and private (Venter et al. 2001) domains, numbering millions of sites. Common in these efforts is that SNP discovery was carried out in METHODS samples of a small number of chromosomes (two or Allele frequency spectrum under stepwise constant three). The samples used in the discovery phase were effective population size: We show that, for a population different from the samples used in the consequent geno- evolving under the Wright-Fisher model, and under se- type characterization experiments, and they repre- lective neutrality, the expectation for the number of ⌿ sented an unknown mixture of ethnicities. Second, be- mutations i of size i, within a sample of n chromosomes cause of genotyping failures, the number of successful under a demographic history of multi-epoch, piecewise genotypes varies from site to site, raising the question constant effective population size is of how to compare allele counts across these sites. In this ␮ ⌿ ϭ 4 N1 work, we propose methods to deal with these practical E( i) i problems. The resulting suite of tools enables us to analyze  the shape of the AFS observed in the data directly and to MϪ1 Ϫ Ϫ Ϫ1 ϩ  ␮Nmϩ1 Nm n 1 ͚ 4 ΂ ΃ evaluate competing scenarios of demographic history on ϭ i i m 1  the basis of how well they fit the observations.   n n   Demographic history: The reconstruction of human  n Ϫ k Ϫ j ␶ l(l Ϫ 1)  ϫ ͚ ͚  ΂ ΃ m* ͟  ΂ Ϫ ΃ e 2 Ϫ Ϫ Ϫ , demographic history is of direct biological and anthro- kϭ2 i 1 jϭk  l:l϶j; l(l 1) j(j 1)  Յ Յ  pological interest. Additionally, the history of effective k l n population size has a profound effect on important (1) quantities such as the extent of linkage disequilibrium ␮ where is the (constant) per-locus , Nm and is therefore important for medical association stud- is the effective population size in epoch m, Tm is the ies. There have been many attempts for demographic ␶ ϭ m corresponding epoch duration, and *m ͚lϭ1Tl/2Nl , inference from contemporary molecular data represent- the normalized epoch boundary time. A detailed deriva- ing different molecular mutation systems such as mito- tion of this result is given in the appendix. The normal- chondrial DNA polymorphisms (Di Rienzo and Wilson ized distribution of these expectations according to the 1991; Rogers and Harpending 1992; Sherry et al. frequency is the allele frequency spectrum: 1994; Ingman et al. 2000), microsatellites (Di Rienzo et ϭ al. 1998; Kimmel et al. 1998; Reich and Goldstein Pn(i) Pr(a given segregating site is size i in n samples) 1998; Relethford and Jorde 1999; Gonser et al. 2000; ⌿ ϭ E( i) ϭ Ϫ Zhivotovsky et al. 2000), and, more recently, SNPs in nϪ1 , i 1,...,n 1. (2) ͚ ϭ E(⌿ ) nuclear DNA (Harding et al. 1997; Clark et al. 1998; j 1 j Cargill et al. 1999; Zhao et al. 2000; Reich et al. 2001; It is sometimes useful to consider the “full” allele full Sachidanandam et al. 2001; Yu et al. 2001). For both frequency spectrum, P n (i), considering sizes 0 and n, global samples of human diversity, or specific subpopu- i.e., when all samples carry the ancestral or the derived Demographic Inference From SNP Data 353 allele, respectively. We have verified the accuracy of the the individual terms are close in value. Instability can complete allele frequency spectrum derived from this be avoided by accurate calculation of each term. The formulation by coalescent simulations (supplemental higher the sample size, the more accurately each term Figure S1 at http://www.genetics.org/supplemental/). has to be evaluated. We do not have a systematic way Three important properties of the allele frequency spec- to predict the accuracy requirement as a function of trum are clear from Equation 1. First, the expectation sample size, hence we determined the accuracy require- for a given frequency is linear under simultaneous scal- ment for a given sample size by trial and error. In our ing of all effective population sizes and epoch durations implementation, we have used high-accuracy numeric

(i.e., as long as Tm and Nm are multiplied by the same libraries with settable numeric precision. Our experi- constant for each m), hence the relative frequency spec- ence has been that, up to a sample size n ϭ 100, a trum remains unchanged. This fact can be exploited to numeric precision of 100 decimal places was sufficient reduce the number of parameters that characterizes a for our calculations. Evaluation of the allele frequency given demographic model under consideration. Sec- spectrum for a sample size of 1000 required a numerical .decimal places 500ف ond, the expected number of mutations of a given size precision of for more than one nucleotide site is simply the sum Correcting ascertainment bias: To describe the situa- of the individual expectations, without regard to any tion where polymorphic sites discovered in a set of sam- possible correlation among the site genealogy of proxi- ples are genotyped in a second, independently drawn mal sites. Therefore, our results for the expected num- set of samples for frequency characterization we divide ber of segregating sites as well as the allele frequency the two independent groups of samples into a “discov- spectrum are also valid for polymorphisms at a single ery” group consisting of k samples and a “genotyping” locus of arbitrary sequence length, without regard to group consisting of n samples. The discovery process is possible recombination within the locus, or for polymor- modeled by considering only those sites within the n ϩ phisms collected from throughout the genome. This k samples that are polymorphic (i.e., are of size between latter consideration allows us to apply the theoretical 1 and k Ϫ 1) within the discovery group of depth k and expectations derived here for the data set examined, discarding those sites that are monomorphic in this without regard to the amount and structure of linkage group, as these sites would not be considered for subse- between the sites represented within the set. Third, the quent genotyping. The conditional probability, Pn|k(i), allele frequency spectrum is independent of the actual that a site is of size i within the n genotyping samples value of the per-nucleotide, per-generation mutation given that it is polymorphic in the k discovery samples rate, as long as this rate is uniform for every site consid- is: ered. ϭ | Ϫ Minor allele frequency spectrum (folded spectrum): Pn|k(i) Pr(size i in n samples size between 1 and k 1ink samples)

In situations where allele frequency is determined ex- Ϫ ϭ Pr(size i in n samples AND size between 1 and k 1ink samples) perimentally by counting the two alternative alleles Pr(size between 1 and k Ϫ 1ink samples) within a sample of n chromosomes, it is uncertain which kϪ1 ϩ ϩ ϭ ͚lϭ1 Pr(size i l in n k samples AND size l in k samples) of the two alleles is the mutant allele. In such situations, Pr(size between 1 and k Ϫ 1ink samples) instead of the true frequency, we work with the fre- kϪ1 | ϩ ϩ ϩ ϩ ϭ ͚lϭ1 Pr(size l in k samples size l i in n k samples) · Pr(size l i in n k samples) quency of the less frequent (or minor) allele (Fu 1995). Pr(size between 1 and k Ϫ 1ink samples) The distribution of minor allele frequency is described k n k n kϪ1 ΂ ΃΂ ΃ nϩkϪ1 full kϪ1΂ ΃΂ ΃ l i ͚ ϭ ϩ l i ϭ 1 full ϩ ϭ l 1 P n k(l ) ϩ by the folded spectrum defined as Ϫ ͚ nϩk P nϩk(i l ) Ϫ ͚ nϩk Pnϩk(i l ) k 1 full ΂ ϩ ΃ k 1 full ΂ ϩ ΃ ͚lϭ1 P k (l )lϭ1 l i ͚lϭ1 P k (l ) lϭ1 l i n ˜ ϭ ϩ Ϫ Յ k n Pn(i) Pn(i) Pn(n i), i: i . (3) kϪ1΂ ΃΂ ΃ ϭ l i ϩ C ͚ nϩk Pnϩk(i l ). 2 ΂ ϩ ΃ lϭ1 l i (4) ϭ By this definition, if n is even, P˜ n(n/2) 2Pn(n/2), i.e., twice the value we would expect to measure, leading It is possible that a site that appears polymorphic within to a “doubling effect.” This fact needs to be taken into the k discovery samples is monomorphic within the n geno- account during the interpretation of measured data. typing samples. As a result, the conditional probabilities

Because in many data sets available for analysis the an- Pn|k(0) and Pn|k(n) are typically nonzero, and one has to cestral allelic state is currently unknown, the folded renormalize after the transformation to get the AFS. It spectrum is important in practice. is easy to verify that Equation 4 is also valid for calculat-

Numerical calculation of the allele frequency spec- ing the folded conditional spectrum P˜ n|k(i), as defined trum: Frequency spectrum calculations were imple- in Equation 3, provided that both folded spectra P˜ k(i) mented in the C programming language. Some care and P˜ nϩk(i) are available. This property makes it possible must be taken when calculating the expected spectrum, to account for the ascertainment bias when only the because computing Equation 1 requires the evaluation folded allele frequency distributions are available. For of alternating sums, a source of numeric instability when the sake of completeness, we include the conditional 354 G. T. Marth et al. spectrum for the important special case, k ϭ 2, i.e., number of relative counts as compared to the original ascertainment within a pair of chromosomes: observations. To obtain the AFS, one omits sizes 0 and nϩ1 full ϩ ϩ Ϫ m in Equation 7 and renormalizes. It is easy to verify 2͚kϭ1P nϩ2(k) (i 1)(n 1 i) P | (i) ϭ · P ϩ (i ϩ 1) that the equivalence reduction also works for the folded n 2 full ϩ ϩ n 2 P 2 (1) (n 1)(n 2) allele frequency distribution. ϭ ϩ ϩ Ϫ ϩ C(i 1)(n 1 i)Pnϩ2 (i 1). (5) We point out that our reduction procedure is not equivalent to frequency binning, a procedure some- It is easy to show that under a stationary history the times employed to compare allele counts available at spectrum is a linear function of i, and the folded spec- different samples sizes. Aggregating discrete allele fre- trum is constant (Figure 2a). quency data on the basis of a nominal allele frequency We point out that our method of ascertainment bias c/n, the ratio of allele counts and the sample size, results correction improves on an earlier method based on in data distortion stemming from two sources. First, for using the measured discrete allele frequency as an esti- ϭ Ϫ1 a given sample, the inherent base frequency is fn n . mator for the overall allele frequency within the popula- In general, only window sizes that are integer multiples tion (Sherry et al. 1997; see supplemental Figure S2 at of fn will preserve the uniform appropriation of allele http://www.genetics.org/supplemental/). sizes into frequency bins. This may be impossible if Reduction of allele frequency counts to equivalent multiple sample sizes are present in the data. Second, counts at a lower sample size: Often allele frequency sites with identical nominal allele frequencies but differ- data are the result of genotyping a target number, nt, ent sample sizes are not equivalent; e.g., a site with a minor of individuals at a collection of polymorphic sites. Because allele count of 1 in 3 samples is clearly not equivalent of genotyping failures, however, the actual number of to a site with a minor allele count of 10 in 30 samples. genotypes available at different locations is smaller and Distortions from both sources are most pronounced at often varies from site to site. At sites where an identical lower sample sizes. Our equivalence reduction proce- number, n, of successfully determined chromosomal dure is a technique of data aggregation that is free allelic states are available we denote the distribution of of such distortions. This point is further illustrated in allele counts by Cn(i) and the corresponding probability supplemental Figure S3 at http://www.genetics.org/ distribution obtained by normalizing these counts by supplemental/, where we compared the AFS resulting Pn(i). Sites with different numbers of successful geno- from simple binning of all available data for the Euro- types are not directly comparable. To enable joint analy- pean samples to the AFS we obtain by the equivalence sis of allele counts observed at all sites genotyped in the data reduction procedure presented here. experiment, we have devised a procedure that, given Coalescent simulations and tabulation of linkage dis- an observed distribution of allele frequencies among equilibrium: We used coalescent simulations to verify samples, produces an equivalent distribution at a lower the accuracy of our allele frequency spectrum calcula- sample size, m. This is achieved by, first, considering all tions (supplemental Figure S1), to tabulate measures possible choices of m subsamples selected from the total of linkage disequilibrium, and to tabulate distributions n available samples, in such a way that each choice is of mutation age. To perform these simulations, we have equally likely and, second, requiring that the total num- implemented a widely used, direct coalescent algorithm ber of observations remains the same. Under these as- (Hudson 1991). The simulation software was first imple- sumptions, the “equivalent” allele counts, Cm(i), for m mented in Perl for rapid coding and error checking subsamples are and then reimplemented in Cϩϩ for increased compu- m nϪm tational speed. To verify the direct formula, we have nϪmϩi ΂ ΃΂ jϪi ΃ ϭ ϭ ͚ i ϭ run coalescent simulations under a variety of population Cm(i) E(Cm(i)) ΂n΃ Cn(j), i 0,...,m, (6) jϭi j history scenarios, tabulated the allele frequency spectra,

m nϪm and compared them to the computed predictions. To nϪmϩi ΂ ΃΂ Ϫ ΃ i j i verify the conditional spectrum calculations, we have simu- ϭ ͚ full ϭ Pm(i) ΂n΃ P n (j), i 0,...,m. (7) ϩ jϭi j lated n k chromosomes within a common genealogy, designated k samples as the discovery group, and n sam- Note that this procedure does not allow one to gener- ples as the genotyping, or frequency measurement, ate a higher sample size distribution on the basis of a group. Of all the sites that were polymorphic within lower sample size distribution. Also note that, even if the n ϩ k samples, we discarded those sites that were the higher sample size distribution was a relative allele monomorphic within the k discovery samples and kept frequency spectrum, the resulting lower sample size dis- the remaining sites. We then tabulated the allele fre- tribution will contain nonzero terms for size 0 and for quency counts at these sites among the n genotyping size m. Clearly, the first case is the result of the possibility samples. that the omission of n Ϫ m chromosomes left us with 0 Expectations for the extent of linkage disequilibrium mutant alleles, and the second is that only mutant alleles were generated according to a previously published remained. This results in a slight reduction of the total method (Kruglyak 1999). For each population, we Demographic Inference From SNP Data 355 used the best-fitting three-epoch model for the coales- in the past) parameter at 10,000, for each model class. cent simulations, with samples size n ϭ 100. Marker We have generated the unbiased allele frequency spec- allele frequencies were restricted to the range between tra by direct calculation using Equation 1, for a sample 0.25n and 0.75n. For each value of recombination frac- size of m ϩ 2, where m ϭ 41 is the (common) sample tion, we tabulated r2, a commonly used measure of link- size after data reduction, and k ϭ 2 is the discovery age disequilibrium defined as size. We then computed the conditional spectrum using Equation 4. Finally, we folded the spectrum using the (p Ϫ p · p )2 r 2 ϭ AB A B , (8) definition given in Equation 3. To quantify the degree pA · pa · pB · pb of fit between a given model and the observations we have used the likelihood of the observed data condi- where A and a denote the mutant and the ancestral tioned on the model: alleles at the first marker location, and B and b are the Ϫ alternative alleles at the second marker location. The c m 1 ϭ ci P(data|model) ΂ ΃ ͟ pi . (9) quantities pA, pa, pB, and pb are the corresponding allele c1,...,cmϪ1 iϭ1 frequency measurements, and pAB is the measured fre- quency of the haplotype defined by the combination of For generating the likelihood surface for the Euro- ␹2 allele A at the first marker position and B at the second pean bottleneck size vs. duration we used the metric marker position. Finally, marker age was tabulated by defined as registering the time of occurrence for each of the muta- mϪ1 Ϫ 2 (ci c · pi) tions during the simulations. ␹2 ϭ ͚ . (10) ϭ c · p Model fitting to observed allele frequency spectra: The i 1 i primary objective of the fitting experiments is to deter- In the above notations, ci is the observed number of mine the distribution of the posterior probability of the sites of size i, c is the number of total sites, pi is the predicted model parameters given the observed data: P(model| (relative) probability of size i, and m is the common sample data). With the help of our closed formula for the direct size to which all observations were reduced using the equiv- calculation of the AFS we were able to generate the alence data reduction procedure outlined earlier. expected AFS for a complete, high-resolution, multidi- Comparison between models with different epoch num- mensional grid overlaid on the parameter space that bers: Models within the same structure (same epoch num- we intended to explore. This direct approach yielded ber) could be directly compared on the basis of any of the likelihood distribution, P(data|model), computed the three goodness-of-fit metrics discussed above. Models at each grid point. Given that there is no sensible way with different numbers of epochs were compared using to assign an “informed” prior distribution to the model methods of normal hypothesis testing for nested models parameters, the distribution of the likelihood function (Ott 1991), on the basis of the likelihood of the data is equivalent to the posterior distribution and can be given each of the two models compared. The quantity ␭ ϭ used in ranking competing parameters. We point out 2 ln( ) 2 ln(P(data|model1)/P(data|model2)) is as- that an alternative method of achieving the same goal ymptotically ␹2 distributed, with degrees of freedom is to use a Markov-chain Monte Carlo (MCMC) tech- equal to the difference in the number of parameters nique to obtain the posterior distribution (Griffiths characterizing the models (i.e., adding one extra epoch and Tavare 1994a; Kuhner et al. 1995). We opted for increases the number of parameters by two). The larger the direct method because it was simple but computa- this quantity, the more significant the improvement that tionally feasible, by its nature avoided the convergence was achieved by the introduction of the extra epoch. If issues usually associated with MCMC, and allowed us to the quantity is small, the improvement in data fit does evaluate the likelihood function at every grid point, for not warrant the introduction of the extra parameters. each of the three population-specific AFS analyzed. Stepwise constant models of one, two, and three ep- ochs were considered. For each model class defined by RESULTS the number of epochs, a vector of parameters describing Modeling allele frequency: We considered a diploid the model was considered, including the effective popu- population whose demographic history was described lation size and the duration of the epoch (expressed in by a series of epochs such that the effective population terms of generations). We have sampled each effective size was stepwise constant within each epoch (e.g., Figure size parameter, Ni, between 1000 and 150,000 in steps 1) and showed that the expected number of samples of 1000 up to 30,000 and in steps of 5000 beyond 30,000, carrying a mutant allele can be described by a closed, and each epoch duration parameter, Ti, between 100 easily computable mathematical formulation (see and 50,000 in steps of 100 up to 10,000 and in steps of methods). We derived a method for incorporating the 500 beyond 10,000. Because of the scaling equivalence same frequency ascertainment bias into AFS models that of the relative distribution discussed earlier, we fixed was introduced into real data by the sampling strategies the ancestral size (the effective size of the epoch farthest used during SNP discovery and for revealing the strate- 356 G. T. Marth et al.

the attempted sample sizes are different. In such cases one selects a target sample size and applies the reduc- tion procedure to transform allele counts observed at higher sample sizes to the equivalent counts at this lower target sample size. It is then possible to fit the resulting single AFS containing the contribution of all available data instead of fitting multiple, often sparse spectra, one for each sample size present in the data. Minor allele frequency spectra observed in samples representing different world populations show differen- tial demographic histories: The SNP Consortium (http:// snp.cshl.org), an organization formed primarily for the discovery of a large set of human SNPs, has made well Figure 1.—Example of a three-epoch, piecewise constant, over 1 million polymorphic sites available in the public bottleneck-shaped population history profile. The ancestral domain (Sachidanandam et al. 2001). Most of these effective population size (N ) is followed by an instant reduc- 3 SNPs were discovered by comparing sequencing read frag- tion of effective size (N2). The duration of this epoch is T2 generations. This is followed by a stepwise increase of effective ments from multi-ethnic, anonymous, whole-genome population size to N1, T1 generations before the present. shotgun subclone libraries to the public genome refer- ence sequence (Sachidanandam et al. 2001); i.e., the vast majority of the SNPs were found in a discovery size gies’s consequent effect on SNP population frequency of two chromosomes (k ϭ 2). Quasi-random subsets of (methods). We illustrate the effect of this bias under these candidate sites were then selected for frequency different values of ascertainment sample size (Figure characterization in samples representing European- 2a). As expected, the bias toward sample enrichment American, African-American, and East Asian populations for common polymorphisms is strongest when SNPs are (for sample identifiers see http://snp.cshl.org/allele_ discovered in a pair of chromosomes, and it gradually frequency_project/panels.shtml). In this study, we disappears as discovery sample size increases. Under a chose the largest data set of allele frequency counts stationary population history, the folded spectrum un- resulting from genotypes provided by Orchid Biosci- der ascertainment in two chromosomes is a constant ences, of 42 individuals (84 chromosomes) drawn from function of frequency (methods), and deviations from each of the three populations (http://snp.cshl.org/ a horizontal line signal a nonstationary history that is allele_frequency_project/). Experimental results were easy to detect and interpret. In Figure 2b, we contrast reported for 33,538 sites. For a significant fraction of the ascertainment bias-corrected, minor allele fre- the sites genotyping was unsuccessful for one or more quency spectra for notable, competing scenarios of de- of the populations attempted. In some other cases, al- mographic history. When a population expands, an in- though genotyping was successful, all samples carried creasing number of chromosomes simultaneously incur the same allele and hence the site could not be con- new mutations, which results in an overabundance of firmed as polymorphic. For the purpose of our study, rare alleles in the spectrum. Conversely, a population we restricted our attention to those sites where (1) geno- collapse is a rapid loss of chromosomes, and the alleles typing from each of the three sample groups was success- present at high frequency are more likely to be carried ful (genotyping for a given population was considered by surviving chromosomes than are their rare counter- successful if genotype data were obtained for at least parts. For that reason a collapse generates an overrepre- half the population samples, i.e., 21 individuals, even sentation of common alleles. Finally, AFS under a bottle- if only one of the alternative alleles was seen in that neck history (a reduction of effective size followed by population) and (2) the site was polymorphic within at a phase of recovery) carries the signature of both the least one of the three population samples. Of the total phase of collapse (a valley at intermediate frequencies) 21,407 sites that were successfully genotyped in all three and that of growth (elevated signal at low frequencies). populations the European samples were polymorphic We report a procedure to transform allele counts at 18,660 sites, the African samples at 20,587 sites, and at a given sample size to a lower, target sample size the Asian samples at 17,369 sites. At a given site, the (methods). Using this equivalence sample size reduction total number of alleles counted varied between 42 (the procedure, allele count observations at all sites can be minimum number possible, in case only 21 diploid indi- reduced to the equivalent counts at a lower, “common viduals were successfully genotyped within a popula- denominator” sample size, as illustrated in Figure 3. tion) and 84, the maximum possible if all 42 individuals This procedure is useful for analyzing allele counts at within a population sample were successfully genotyped. sites where the number of available genotypes is variable To use all the data available, we have applied our equiva- either because a fraction of attempted genotyping ex- lence sample size reduction procedure (methods)to periments failed or when merging data sets in which convert the allele count data to a common denominator Demographic Inference From SNP Data 357

Figure 2.—Ascertainment bias. (a) Folded spectra under stationary history, at various values of “discovery sample” size k (methods). (b) Allele frequency spectra predicted under competing scenarios of population history (conditioned on pairwise ascertainment k ϭ 2). Equilibrium his- ϭ ϭ ϭ tory, N1 10,000; expansion, N1 20,000, T1 ϭ ϭ ϭ 3000, N2 10,000; collapse, N1 2000, T1 500, ϭ ϭ ϭ N2 10,000; bottleneck history, N1 20,000, T1 ϭ ϭ ϭ 3000, N2 2000, T2 500, N3 10,000. (a and b) Sample size n ϭ 41.

sample size. Because the identity of the ancestral and our web site: www.ncbi.nlm.nih.gov/IEB/Research/ the mutant allele was not known, we used the allele GVWG/AFS-2003/. counts of the less frequent (or minor) allele, giving rise To assess the signals of population history within these to a folded spectrum (methods). To avoid the “dou- observed distributions, we generated allele frequency bling” effect associated with folding the allele frequency spectra as predicted under competing scenarios of pop- spectrum when the sample size is an even number, as ulation history of varying complexity: stationary history described in methods and in particular by Equation 3, (one epoch), expansion or collapse (two epoch), and we chose the common denominator sample size as m ϭ all possible shapes of three-epoch histories (methods). 41, i.e., the first odd number below the (even) sample For a given set of model parameters, we generated the size 42. The unfolded spectrum hence lies between 1 corresponding theoretically predicted, ascertainment and 40 (sizes 0 and 41 indicate monomorphisms). Ac- bias-corrected minor allele frequency spectrum and cordingly, the folded spectrum lies between minor allele evaluated the degree of fit between the prediction and sizes 1 and 20, for each of the three population-specific the observations (methods). For each population-spe- sample groups (Figure 4, first column). The allele fre- cific data set and for each model structure (number of quency data used in our analysis are available through epochs), we determined the best-fitting model parame- 358 G. T. Marth et al.

Figure 3.—Sample size reduction. Folded, normalized allele frequency dis- tribution for each sample size (n ϭ 42, ...,84)present in the European allele count data (gray) is shown. The allele frequency spectra obtained using the equivalence sample size reduction tech- nique (methods) are also shown for var- ious equivalence sample sizes (m ϭ 21, 31, and 41; green).

ters and the corresponding measures of goodness of fit. (N, effective number of individuals) and duration (T, By definition of the likelihood function used for data generations) of the recovery phase was within a narrow ϭ ϭ fitting, the best-fitting model parameters are the maxi- range (N1 19,000–21,000, T1 2700–3000). Parame- ϭ mum-likelihood parameter estimates for that model ters of the bottleneck phase were in a wider range (N2 ϭ class (Table 1). 1000–4000 and T2 200–1300), with several alternative The normalized observed allele frequency distribu- pairs available: longer but less severe bottlenecks or tions for each population group and the corresponding shorter, more severe bottlenecks. Given the potential best-performing distributions within each model class interest in a possible bottleneck in the history of Euro- are shown in Figure 4. In all three population-specific pean populations, we further investigated the strength spectra, stationary history is a poor descriptor of the of the bottleneck signal by fixing the recovery size and ϭ ϭ data, both by visual inspection and by examination of duration parameters (N1 20,000, T1 3000) and vary- the fit values in Table 1. The best-fitting two-epoch ing the bottleneck size N2 and duration T2 in fine incre- model for all three spectra is that of expansion (Table ments (20). For each parameter combination, we evalu- 1). In the European (Figure 4a) and in the Asian (Figure ated the goodness of fit to the European spectrum as 4b) samples the best-fitting three-epoch model is one measured by the ␹2 statistics and reported the resulting of a bottleneck-shaped history. In the European data, probability surface in Figure 5. The best-fitting parame- the curve fit produced by the bottleneck profile is a very ter combinations (ones not rejected by the ␹2 test even significant improvement over that produced by histories at the 99.8% level) lie on a slightly curved line between of expansion. In the Asian data, the improvement is still the following pairs: effective size of 1040 during the significant but to a lesser degree. The best-fitting three- bottleneck for 240 generations and effective size 2320 epoch models in African-American data (Figure 4c) rep- for 560 generations. The most likely model, at this reso- resent a two-step population increase of moderate size. lution, is a bottleneck effective size of 1560 for 360 In addition to the best-fitting models, a range of pa- generations. These values and the ratio of effective pop- rameter values produced comparably good fit to the ulation size and bottleneck duration being nearly con- observations. We have examined parameter sets that stant in a large region are in good agreement with previ- produced likelihood values that were at least 90% of ous reports (Reich et al. 2001). In the Asian data (Figure the value obtained for the best-fitting three-epoch pa- 4b), all parameters including those characterizing the ϭ rameter set. Analysis of these “close to optimal” parame- bottleneck phase were within a tight range: N2 3000– ϭ ϭ ϭ ter values in the European data shows that both the size 5000, T2 600–1000, N1 24,000–26,000, and T1 3000– Demographic Inference From SNP Data 359

Figure 4.—Model fitting to folded AFS observed in population-specific genotype data reduced to common sample size, m ϭ 41. (a) European spectrum. (b) Asian spectrum. (c) African-American spectrum. First column, observed allele frequency spectrum (black), best-fitting three-epoch theoretical model prediction (green), and prediction under stationary effective size (red); second column, breakdown of mutations according to age within each frequency class of the best-fitting model spectra [color bands correspond to a range of 1000 generations (e.g., black band, 1–1000 generations; red band, 1001–2000 generations)]; third column, distribution of mutation times (generations in the past) at each frequency, based on 1 million simulation replicates. Notched box: 25%, median, 75%. Whiskers: min/max values. Open square: mean value. Open circle: 5%, 95% values.

3200. Similarly narrow ranges were observed for the ple and rapid way to generate expected distributions ϭ ϭ African-American data (Figure 4c): N2 16,000, T2 of allele frequency under stepwise constant models of ϭ ϭ 13,000–15,000, N1 26,000–30,000, and T1 2000– effective population size history. This procedure is or- 2600. ders of magnitude faster than tabulating simulation rep- licates, especially for large sample sizes, permitting fast generation of model spectra to explore large parameter DISCUSSION spaces at high resolution. The method of ascertainment Significance of the allele frequency analysis methods bias calculation we have presented permits the interpre- presented here: Equation 1 (methods) provides a sim- tation of allele frequency spectra measured at polymor- 360 G. T. Marth et al.

TABLE 1 Results of fitting multi-epoch models of allele frequency spectrum to population-specific observed allele frequency data

Model Model Resulting pairwise ␪ Improvement over structure parameters (units of 10Ϫ4)lnP(data|model) lower-epoch model a. European data ϭ Ϫ One epoch N1 10,000 8.00 55.98 — ϭ Ϫ ␭ϭ Two epoch N2 10,000 8.74 38.11 2 ln 35.74 ϭ Ͻ Ϫ4 N1 140,000 P 10 ϭ (T1 2,000) Highly significant ϭ Ϫ ␭ϭ Three epoch N3 10,000 7.88 23.72 2 ln 28.78 ϭ Ͻ Ϫ4 N2 2,000 P 10 ϭ (T2 500) Highly significant ϭ N1 20,000 ϭ (T1 3,000)

b. Asian data ϭ Ϫ One epoch N1 10,000 8.00 74.26 — ϭ Ϫ ␭ϭ Two epoch N2 10,000 8.63 31.95 2 ln 84.62 ϭ Ͻ Ϫ4 N1 50,000 P 10 ϭ (T1 2,000) Highly significant ϭ Ϫ ␭ϭ Three epoch N3 10,000 8.24 26.39 2 ln 11.12 ϭ ϭ N2 3,000 P 0.0039 ϭ (T2 600) Significant ϭ N1 25,000 ϭ (T1 3,200)

c. African-American data ϭ Ϫ One epoch N1 10,000 8.00 197.86 — ϭ Ϫ ␭ϭ Two epoch N2 10,000 9.20 28.69 2 ln 338.34 ϭ Ͻ Ϫ4 N1 18,000 P 10 ϭ (T1 7,500) Highly significant ϭ Ϫ ␭ϭ Three epoch N3 10,000 10.29 26.72 2 ln 3.94 ϭ ϭ N2 16,000 P 0.1395 ϭ (T2 15,000) Not significant ϭ N1 26,000 ϭ (T1 2,400)

phic sites selected from existing variation resources. Our Table 1). Clearly, the shapes of the European and the procedure of equivalence sample size reduction enables Asian spectra are closer to each other than either is to the analysis of realistic data sets with genotyping failures. the shapes of the African spectra. On the basis of the All three of the above procedures are firmly rooted three-epoch models, both the European and the Asian within the coalescent framework. Model calculations data are best explained by bottleneck-shaped histories, directly correspond to experimentally observable quan- whereas the best-fitting third-order model for the Afri- tities, without referencing directly unobservable quanti- can-American data is a continued expansion. The results ties such as the overall population frequency of alleles. of hierarchical model testing (methods) in Table 1 The data-fitting methodology is conceptually simple and show that the inclusion of the third epoch did not sig- allows direct comparison of the degree of fit between nificantly improve the fit to the African-American data. each of the three population samples examined, at each However, the bottleneck history is a dramatic improve- grid point (parameter combination). ment over the best-fitting two-epoch growth models in Differential population histories in the three sample both the European and Asian data. Considering the sets: On the basis of the goodness of fit between models range of models that produced close to optimal fit val- and observations (Table 1), a history of stationary popu- ues, but using a fixed, 20-year generation time, the Euro- lation size can be confidently rejected for all three sets pean bottleneck represented a 2.5- to 10-fold decline of samples. Introduction of even very simple dynamics in population size, lasting 200–1300 generations [4–26 into the history has dramatically improved data fit. thousand years (KY)]. This was followed by a phase of There were large differences among the allele frequency 5- to 20-fold population expansion, starting 2700–4300 spectra observed in the three populations (Figure 4 and generations (54–86 KY) ago. The Asian bottleneck rep- Demographic Inference From SNP Data 361

Figure 5.—Bottleneck size and duration in the European samples. The probability surface of the effective size and the duration ϭ of a bottleneck are shown. Size of the ancestral epoch is fixed at N3 10,000, size of the present epoch is fixed at 20,000, and ϭ the duration of the present epoch is fixed at T1 3000. Parameter regions indicated by shading fall into the same bin of significance. Note that the P values indicated are the direct ␹2 probabilities (i.e., 1 minus the tail probability).

resented a 2- to 3-fold decline for 600–1000 generations neck severity index (in our notation T2/N2) and consider (12–20 KY), followed by 5- to 8-fold growth starting moderate bottlenecks where the expansion ratio is 20 3000–4200 generations (60–84 KY) ago. The best-fitting and the severity index is in the range of 0.25 and 4.0. Our models for the African-American data represent unin- own estimates (expansion ratio 5–20 for Europeans, 5–8 -for both popula 0.2ف terrupted growth of effective population size, with the for Asians, and severity index of expansion clearly starting earlier than is evident in our tions) are in general agreement with these values and European or the Asian data. signify bottlenecks on the less severe end of the spec- Earlier mitochondrial and microsatellite studies re- trum. Our estimates for the start of the recovery phase port data that are predominantly consistent with expan- (54–86 KYA for Europeans, 60–84 KYA for Asians) are sion-type histories of effective population size. The main well within the range of the mitochondrial and microsa- evidence that points to expansion is negative values of tellite estimates. The fact that our best-fitting two-epoch Tajima’s D and an excess of low-frequency alleles. The models indicate expansion-type histories for all three start of such expansion is estimated between 30 and 130 populations we examined is also consistent with conclu- KYA (Harpending and Rogers 2000). Nuclear data, sions from mitochondrial and microsatellite data. A val- especially in samples of non-African origin, seem to uable reality check of an inferred demographic model show a different pattern, an excess of common variants is its implied pairwise nucleotide diversity value, ␪. Al- (Hey 1997; Clark et al. 1998; Reich et al. 2001, 2002). though our data-fitting analysis of the relative spectrum Simulation results have suggested that a bottleneck- does not provide absolute estimates for ␪, these values shaped history of effective population size consisting of can be obtained on the basis of the best-fitting models ␮ a phase of collapse followed by a recent phase of size by fixing the ancestral size N3 and mutation rate . recovery can reconcile this seeming contradiction be- For each of the three populations, we use a common tween observations from different mutation systems ancestral effective size of 10,000 and common mutation (Fay and Wu 1999; Hey and Harris 1999). These stud- rate of 2 ϫ 10Ϫ8 [a value that lies between recent, promi- ies characterize bottleneck-shaped histories by a size nent estimates for average per-nucleotide, per-genera-

expansion ratio (in our notation N1/N2) and a bottle- tion human mutation rate (Nachman and Crowell 2000; 362 G. T. Marth et al.

Kondrashov 2003 )]. This leads to an estimate of ␪ϭ pean and Asian SNPs have originated Ͻ10,000 genera- 7.88 ϫ 10Ϫ4 for the European model, in good agreement tions ago and have drifted to high population frequency. with previously reported values for other genome-wide Finally, the third column of Figure 4 shows the average data sets (Sachidanandam et al. 2001; Venter et al. 2001; age of SNPs at given frequencies, confirming that SNPs Marth et al. 2003). The prediction from the Asian data at a higher frequency are expected to be older than is slightly higher, 8.24 ϫ 10Ϫ4. The pairwise ␪ predicted SNPs at lower frequencies. Also, in each frequency class, by the best-fitting model for the African-American data the expected age of African SNPs is substantially higher is 10.29 ϫ 10Ϫ4, significantly higher than that observed than that of European or Asian SNPs, corroborating within the European and Asian samples, and in agree- earlier observations noting the more ancient origins of ment with the general consensus that nucleotide diver- African SNPs. sity is higher in sub-Saharan samples than in non-African The differential demographic histories of the three data (Relethford and Jorde 1999; Przeworski et al. populations examined also have important conse- 2000; Jorde et al. 2001; Tishkoff and Williams 2002). quences for the extent of allelic association in the hu- All three estimates are well within realistic values, lend- man genome, when the different populations are con- ing further credence to the validity of our model param- sidered. To illustrate this point, we have carried out eters. coalescent simulations, taking into account the individ- A bottleneck-shaped history was also our best-fitting ual best-fitting histories, and tabulated the average ex- three-epoch model structure for MD distributions ob- tent of linkage disequilibrium (LD) between markers served in overlap fragments of public genome clone separated by different values of recombination fraction data (Marth et al. 2003). However, the parameter esti- (for a fixed value of per-nucleotide, per generation re- mates are significantly different between these two stud- combination rate, the recombination fraction translates ies. Our estimates from MD data indicated a less severe into physical distance), as shown in Figure 6. Similar bottleneck of nearly identical duration and a shorter demographic histories distilled from the Asian and Eu- phase of recovery of more modest size as compared to ropean samples result in similar values of LD at a given the AFS in the European samples. Multiple factors may marker distance. LD is predicted to decay more rapidly contribute to these differences. First, the DNA samples (roughly twice as fast) for the best-fitting demographic for the two studies came from different donors. Second, history for the African-American samples, in agreement some fraction of the large-insert clones sequenced for with previous reports (Reich et al. 2001). Differences the construction of the public genome reference se- in the extent of allelic association within the genome are quence originate from libraries that are not of European expected to have profound consequences for medical origin [although there appears to be an overrepresenta- association studies. tion of European sequences (Weber et al. 2002), pre- Caveats and open problems: Clearly, our multi-epoch, sumably due to the origin of a single bacterial artificial stepwise models of demographic history represent sim- chromosome library with the largest contribution]. If plified versions of the “true” demographic past. Never- indeed an appreciable fraction of the data represents theless, our three-epoch models go beyond the majority sub-Saharan DNA, the resultant MD in these mixed data of previous studies that explore even simpler models of could indicate a less severe bottleneck than would have past population dynamics such as expansion vs. collapse been evident in a distribution containing only European or are restricted to the rejection of stationary effective data. size on the basis of summary statistics. Consideration of To understand the consequences of the differential the third-order dynamics in this study allowed us to histories that best describe the three population-specific reveal a phase of bottleneck in the history characterizing data sets, we have partitioned the corresponding fre- the European and the Asian samples, permitting recon- quency spectra according to the age of the mutations ciliation of the signals of recent population growth ap- (methods) that gave rise to the polymorphisms (Figure parent in mitochondrial and microsatellite data with 4, second column). According to these tabulations, 35.9% realistic, observed values of nucleotide diversity. of the European polymorphisms originated in Ͻ10,000 Although the signal of differential history is undeni- generations, as did a similar fraction, 34.9%, in the Asian able in the data, the effect is confounded by the fact model. In contrast, only 29.6% of the African mutation that the discovery and genotyping data sets were not are younger than 10,000 generations. This indicates that drawn from a single population. SNP discovery was per- the bottleneck events that explain the European and formed in shotgun sequences from ethnically diverse Asian data have eliminated a large fraction of the poly- libraries (with ethnic association of individual reads un- morphisms that predated these events, and a larger frac- known) aligned to the public genome reference se- tion of current polymorphisms are of a more recent quence (Sachidanandam et al. 2001), presumably rep- origin as compared to the African data. This effect is resenting a mixture of ethnicities, with a bias toward most visible at the common end of the spectrum: only clones from European donors (Weber et al. 2002). Poly- a negligible fraction of the common African SNPs are morphic sites generated by this effort were then selected young, but an appreciable fraction of common Euro- for genotyping in ethnically well-defined samples. It has Demographic Inference From SNP Data 363

Figure 6.—The average ex- tent of linkage disequilibrium, as predicted by the best-fitting, three-epoch demographic models for the three popula- tion samples. Values of r 2 and the corresponding values of re- combination fraction are shown for each of the three populations. On the right- hand side, we have indicated the equivalent physical dis- tances assuming a genome av- erage per-nucleotide, per-gen- eration recombination rate, r ϭ 10Ϫ8 (methods).

been previously noted that collections of samples from netic hitchhiking can mimic the effects of population multiple ethnicities contain a surplus of rare SNPs when expansion in that it gives rise to an excess of low-fre- measured in the same mixed collection (Ptak and quency alleles (Kaplan et al. 1989; Braverman et al. Przeworski 2002). However, it is unclear what the allele 1995). Recent efforts have been aimed at detecting loci frequency of the same SNPs is when measured sepa- that exhibit signatures of positive selection (Cargill et rately, within subpopulations. If the ethnicity of the al. 1999; Sunyaev et al. 2000; Akey et al. 2002; Payseur discovery and the genotyping samples were known, one et al. 2002). However, the exact proportion of could estimate the effect of the ascertainment bias with that have been targets of strong positive selection within models of population subdivision using coalescent simu- our evolutionary past is unclear (Bamshad and Wood- lation (Pluzhnikov et al. 2002). The effect of ascertain- ing 2003). It is also unclear, in general, how far the ment bias between ethnically mismatched or undefined effects of hitchhiking extend beyond the locus under samples is the subject of future investigation. selection (Wiehe 1998). Given that only a few percent Additionally, internal population substructure can of the human genome represents coding DNA, and also distort the frequency spectrum (Przeworski 2002; that not all genes are expected to be targets of positive Ptak and Przeworski 2002). Unfortunately, the little selection, we speculate that the distortion due to selec- amount of information that was available concerning tive forces on the AFS in our data set of Ͼ20,000 ran- sample origin did not permit incorporation of this effect domly selected genomic loci is small when compared into our models in a meaningful fashion. Specifically, to the global effects of drift modulated by long-term we did not take into account in our models the effects demography. of recent admixture in the African-American samples. Conclusion: The allele frequency spectrum is an ex- Although the AFS in these samples are best modeled cellent data source for modeling demographic history by population growth, it carries a slight but noticeable because of its independence of the effects of recombina- dip at medium minor allele frequencies, a feature pres- tion and local, or sequence composition-specific varia- ent in a more pronounced form in both the European tions of mutation rates and because the experimental (Figure 4a) and the Asian (Figure 4b) spectra. This determination of the allele frequency spectrum requires potentially signifies the contribution of European ances- measurement of allelic states only at single-nucleotide tral lineages on the background of African lineages positions, instead of sequencing of long stretches of (Rybicki et al. 2002) in the AFS signal. contiguous DNA. The emergence of population-specific We must also acknowledge that the current shape of genotype sets on the genome scale provides sufficient human variation structure is the result of a combination data for the direct comparison of model-predicted and of neutral and nonneutral (selective) forces. The cur- observed spectra with great resolution. This permits us rent state of the art in recognizing the effects of selection to improve on previous conclusions drawn on the in variation data has been reviewed recently (Bamshad strength of summary statistics, on the basis of data from and Wooding 2003). Positive selection resulting in ge- a handful of loci. Recent advances in allele frequency 364 G. T. Marth et al. modeling should provide us with exciting, new tools et al., 1997 Archaic African and Asian lineages in the genetic ancestry of modern humans. Am. J. Hum. Genet. 60: 772–789. to explore our demographic past and explain human Harpending, H., and A. Rogers, 2000 Genetic perspectives on hu- haplotype structure. Accurate reconstruction of the his- man origins and differentiation. Annu. Rev. Genomics Hum. tory of world populations should also help us to detect Genet. 1: 361–385. Hey, J., 1997 Mitochondrial and nuclear genes present conflicting and interpret differences that must be taken into ac- portraits of human origins. Mol. Biol. Evol. 14: 166–172. count during the development of general resources for Hey, J., and E. Harris, 1999 Population bottlenecks and patterns medical use such as the recently initiated human Haplo- of human polymorphism. Mol. Biol. Evol. 16: 1423–1426. Hudson, R. R., 1991 genealogies and the coalescent process, type Map Project (Cardon and Abecasis 2003; Clark pp. 1–44 in Oxford Surveys in Evolutionary Biology, edited by D. 2003; Wall and Pritchard 2003). Futuyama and J. Antonovics. Oxford University Press, Lon- don/New York/Oxford. The authors are indebted to Andrew Clark for useful comments Ingman, M., H. Kaessmann, S. Paabo and U. Gyllensten, 2000 on the manuscript. We also thank Ravi Sachidanandam for kindly Mitochondrial genome variation and the origin of modern hu- providing earlier versions of the allele frequency data set analyzed in mans. Nature 408: 708–713. this study. Jorde, L. B., W. S. Watkins and M. J. Bamshad, 2001 Population genomics: a bridge from evolutionary history to genetic medicine. Hum. Mol. Genet. 10: 2199–2207. Kaplan, N. L., R. R. Hudson and C. H. Langley, 1989 The “hitch- LITERATURE CITED hiking effect” revisited. Genetics 123: 887–899. Kimmel, M., R. Chakraborty, J. P. King, M. Bamshad, W. S. Watkins Akey, J. M., G. Zhang, K. Zhang, L. Jin and M. D. Shriver, 2002 et al., 1998 Signatures of population expansion in microsatellite Interrogating a high-density SNP map for signatures of natural repeat data. Genetics 148: 1921–1930. selection. Genome Res. 12: 1805–1814. Kondrashov, A. S., 2003 Direct estimates of human per nucleotide Altshuler, D., V. J. Pollara, C. R. Cowles, W. J. Van Etten, J. mutation rates at 20 loci causing Mendelian diseases. Hum. Mutat. Baldwin et al., 2000 An SNP map of the human genome gener- 21: 12–27. ated by reduced representation shotgun sequencing. Nature 407: Kruglyak, L., 1999 Prospects for whole-genome linkage disequilib- 513–516. rium mapping of common disease genes. Nat. Genet. 22: 139–144. Bamshad, M., and S. P. Wooding, 2003 Signatures of natural selec- Kuhner, M. K., J. Yamato and J. Felsenstein, 1995 Estimating tion in the human genome. Nat. Rev. Genet. 4: 99–111. effective population size and mutation rate from sequence data Braverman, J. M., R. R. Hudson, N. L. Kaplan, C. H. Langley and using Metropolis-Hastings sampling. Genetics 140: 1421–1430. W. Stephan, 1995 The hitchhiking effect on the site frequency Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody et spectrum of DNA polymorphisms. Genetics 140: 783–796. al., 2001 Initial sequencing and analysis of the human genome. Cardon, L. R., and G. R. Abecasis, 2003 Using haplotype blocks Nature 409: 860–921. to map human complex trait loci. Trends Genet. 19: 135–140. Li, W. H., 1977 Distribution of nucleotide differences between two Cargill, M., D. Altshuler, J. Ireland, P. Sklar, K. Ardlie et al., randomly chosen cistrons in a finite population. Genetics 85: 1999 Characterization of single-nucleotide polymorphisms in 331–337. coding regions of human genes. Nat. Genet. 22: 231–238. Marth, G., G. Schuler, R. Yeh, R. Davenport, R. Agarwala et al., Clark, A. G., 2003 Finding genes underlying risk of complex disease 2003 Sequence variations in the public human genome data by linkage disequilibrium mapping. Curr. Opin. Genet. Dev. 13: reflect a bottlenecked population history. Proc. Natl. Acad. Sci. 296–302. USA 100: 376–381. Clark, A. G., K. M. Weiss, D. A. Nickerson, S. L. Taylor, A. Mullikin, J. C., S. E. Hunt, C. G. Cole, B. J. Mortimore, C. M. Buchanan et al., 1998 Haplotype structure and population ge- Rice et al., 2000 An SNP map of human chromosome 22. Nature netic inferences from nucleotide-sequence variation in human 407: 516–520. lipoprotein lipase. Am. J. Hum. Genet. 63: 595–612. Nachman, M. W., and S. L. Crowell, 2000 Estimate of the mutation Crow, J. F., and M. Kimura, 1970 An Introduction to Population Genetic rate per nucleotide in humans. Genetics 156: 297–304. Theory. Harper & Row, New York. Ott, J., 1991 Analysis of Human Genetic Linkage. Johns Hopkins Uni- Di Rienzo, A., and A. C. Wilson, 1991 Branching pattern in the versity Press, Baltimore. evolutionary tree for human mitochondrial DNA. Proc. Natl. Payseur, B. A., A. D. Cutter and M. W. Nachman, 2002 Searching Acad. Sci. USA 88: 1597–1601. for evidence of positive selection in the human genome using Di Rienzo, A., P. Donnelly, C. Toomajian, B. Sisk, A. Hill et al., patterns of microsatellite variability. Mol. Biol. Evol. 19: 1143– 1998 Heterogeneity of microsatellite mutations within and be- 1153. tween loci, and implications for human demographic histories. Pluzhnikov, A., A. Di Rienzo and R. R. Hudson, 2002 Inferences Genetics 148: 1269–1284. about human demography based on multilocus analyses of non- Ewens, W. J., 1972 The sampling theory of selectively neutral alleles. coding sequences. Genetics 161: 1209–1218. Theor. Popul. Biol. 3: 87–112. Przeworski, M., 2002 The signature of positive selection at ran- Fay, J. C., and C.-IWu, 1999 A human population bottleneck can domly chosen loci. Genetics 160: 1179–1189. account for the discordance between patterns of mitochondrial Przeworski, M., R. R. Hudson and A. Di Rienzo, 2000 Adjusting versus nuclear DNA variation. Mol. Biol. Evol. 16: 1003–1005. the focus on human variation. Trends Genet. 16: 296–302. Fu, Y. X., 1995 Statistical properties of segregating sites. Theor. Ptak, S. E., and M. Przeworski, 2002 Evidence for population Popul. Biol. 48: 172–197. growth in humans is confounded by fine-scale population struc- Fu, Y. X., and W. H. Li, 1993 Statistical tests of neutrality of muta- ture. Trends Genet. 18: 559–563. tions. Genetics 133: 693–709. Reich, D. E., and D. B. Goldstein, 1998 Genetic evidence for a Gabriel, S. B., S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy et al., Paleolithic human population expansion in Africa. Proc. Natl. 2002 The structure of haplotype blocks in the human genome. Acad. Sci. USA 95: 8119–8123. Science 296: 2225–2229. Reich, D. E., M. Cargill, S. Bolk, J. Ireland, P. C. Sabeti et al., Gonser, R., P. Donnelly, G. Nicholson and A. Di Rienzo, 2000 2001 Linkage disequilibrium in the human genome. Nature Microsatellite mutations and inferences about human demogra- 411: 199–204. phy. Genetics 154: 1793–1807. Reich, D. E., S. F. Schaffner, M. J. Daly, G. McVean, J. C. Mullikin Griffiths, R. C., and S. Tavare, 1994a Simulating probability distri- et al., 2002 Human genome sequence variation and the influ- butions in the coalescent. Theor. Popul. Biol. 46: 131–159. ence of gene history, mutation and recombination. Nat. Genet. Griffiths, R. C., and S. Tavare, 1994b Sampling theory for neutral 32: 135–142. alleles in a varying environment. Philos. Trans. R. Soc. Lond. B Relethford, J. H., and L. B. Jorde, 1999 Genetic evidence for Biol. Sci. 344: 403–410. larger African population size during recent human . Harding, R. M., S. M. Fullerton, R. C. Griffiths, J. Bond, M. J. Cox Am. J. Phys. Anthropol. 108: 251–260. Demographic Inference From SNP Data 365

Rogers, A. R., 2001 Order emerging from chaos in human evolu- Venter, J. C., M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural et tionary genetics. Proc. Natl. Acad. Sci. USA 98: 779–780. al., 2001 The sequence of the human genome. Science 291: Rogers, A. R., and H. Harpending, 1992 Population growth makes 1304–1351. waves in the distribution of pairwise genetic differences. Mol. Wall, J. D., and J. K. Pritchard, 2003 Haplotype blocks and linkage Biol. Evol. 9: 552–569. disequilibrium in the human genome. Nat. Rev. Genet. 4: 587– Rybicki, B. A., S. K. Iyengar, T. Harris, R. Liptak, R. C. Elston 597. et al., 2002 The distribution of long range admixture linkage Wall, J. D., and M. Przeworski, 2000 When did the human popula- disequilibrium in an African-American population. Hum. Hered. tion size start increasing? Genetics 155: 1865–1874. 53: 187–196. Weber, J. L., D. David, J. Heil, Y. Fan, C. Zhao et al., 2002 Human Sachidanandam, R., D. Weissman, S. C. Schmidt, J. M. Kakol, L. D. diallelic insertion/deletion polymorphisms. Am. J. Hum. Genet. Stein et al., 2001 A map of human genome sequence variation 71: 854–862. containing 1.42 million single nucleotide polymorphisms. Nature Wiehe, T., 1998 The effect of selective sweeps on the variance of 409: 928–933. the allele distribution of a linked multiallele locus: hitchhiking Sherry, S. T., A. R. Rogers, H. Harpending, H. Soodyall, T. Jen- kins et al., 1994 Mismatch distributions of mtDNA reveal recent of microsatellites. Theor. Popul. Biol. 53: 272–283. human population expansions. Hum. Biol. 66: 761–775. Wooding, S., and A. Rogers, 2002 The matrix coalescent and an Sherry, S. T., H. C. Harpending, M. A. Batzer and M. Stoneking, application to human single-nucleotide polymorphisms. Genetics 1997 Alu evolution in human populations: using the coalescent 161: 1641–1650. to estimate effective population size. Genetics 147: 1977–1982. Yu, N., Z. Zhao, Y. X. Fu, N. Sambuughin, M. Ramsay et al., 2001 Sunyaev, S. R., W. C. Lathe III, V. E. Ramensky and P. Bork, 2000 Global patterns of human DNA sequence variation in a 10-kb SNP frequencies in human genes an excess of rare alleles and region on chromosome 1. Mol. Biol. Evol. 18: 214–222. differing modes of selection. Trends Genet. 16: 335–337. Zhao, Z., L. Jin, Y. X. Fu, M. Ramsay, T. Jenkins et al., 2000 World- Tajima, F., 1989 Statistical method for testing the neutral mutation wide DNA sequence variation in a 10-kilobase noncoding region hypothesis by DNA polymorphism. Genetics 123: 585–595. on human chromosome 22. Proc. Natl. Acad. Sci. USA 97: 11354– Tavare, S., D. J. Balding, R. C. Griffiths and P. Donnelly, 1997 11358. Inferring coalescence times from DNA sequence data. Genetics Zhivotovsky, L. A., L. Bennett, A. M. Bowcock and M. W. Feldman, 145: 505–518. 2000 Human population expansion and microsatellite varia- Tishkoff, S. A., and S. M. Williams, 2002 Genetic analysis of African tion. Mol. Biol. Evol. 17: 757–767. populations: human evolution and complex disease. Nat. Rev. Genet. 3: 611–621. Communicating editor: L. Excoffier

APPENDIX: THE EXPECTED NUMBER OF SEGREGATING SITES IN A SAMPLE DRAWN FROM A POPULATION CHARACTERIZED BY A PIECEWISE CONSTANT, MULTI-EPOCH HISTORY OF EFFECTIVE SIZE Model: We consider a population of a given organism evolving under the Wright-Fisher model and under selective neutrality. Let us select a specific site in the genome of the organism. Furthermore, let us randomly draw n DNA samples from this population. Without regard to recombination, the samples possess a unique tree-shaped genealogy at the selected site (the site genealogy). Such a genealogy can be described within the framework of the coalescent: starting with n samples in the present and, through a series of coalescent events (pairs of samples finding their common ancestors), this number reduces to 1, the most recent common ancestor (MRCA), or the root of the genealogy at that site (site root). At a given time, the process is said to be in state j, if at that time the current number of samples is j. This process is Markovian, in that the length of time until the next coalescent event depends only on the current state and is independent of the previous states. Due to molecular mutation processes, the nucleotide observed at the site under consideration might be different in different individuals. Let us assume that, at any given site, only two possible nucleotides are observed (diallelic variations). Accordingly, an individual carries either the allele that was present in the site root (also known as the ancestral allele) or a mutant or derived allele. Let us further assume that the mutant allele is the result of a single mutation event (infinite-sites assumption) within an ancestral sample of the site genealogy. Under this assumption, the number of samples that carry the derived allele is identical to the number of descendants of that ancestor within the site genealogy. Conversely, the derived allele is found in exactly i samples if and only if the ancestor in which the mutation occurred gave rise to i descendants. Under the further assumption of a constant-rate mutation process (Hudson 1991), the likelihood that a given mutation is of size i is related to the number of ancestral nodes with i descendants within the site genealogy and to the “life span” of these ancestors. As Fu shows in a seminal work (Fu 1995), this likelihood can be expressed with the length of time the site genealogy spends in state k, i.e., while the number of ancestor samples within the genealogy is exactly k. Under the further assumption of constant effective population size N, Fu then derives an explicit formula for the expected length of time in state k, leading to a simple result for the expected number of mutations of a given size within n samples (Fu 1995). Our final goal is to extend this result from constant to merely piecewise constant population size. To this end, we use a standard continuous approximation according to which the probability density function of the length of time t spent in state k within the genealogy is exponential under a constant population size, and for a diploid population,

t k Ϫ͑͑k͒ ͒ ( ) 2 /2N 2 e . 2N 366 G. T. Marth et al.

Using this approximation, we derive the expectation for the length of time spent in state k, under piecewise constant population history of an arbitrary number of epochs. Under the assumption of a constant-rate mutation process, ⌿ this allows us to compute the expectation for the number of mutations of size i, denoted by i, observed at a single site, at sites having identical site genealogies (DNA without recombination), or at a collection of sites with completely independent site genealogies. Because the distributions are identical for every site, the result is also valid for a collection of sites. Conventions and useful identities: We use the convention that the value of an empty product is 1 and the value of an empty sum is 0. The probability density function of a random variable X is denoted by f X and its cumulative density function by F X. The variable X conditioned on the event Y is denoted by X|Y. Next, we briefly state three lemmas to aid further derivations. In the following we assume that the ai are different.

Lemma 1. For every value of x, for each 1 Յ l Յ n, n Ϫ ͚ ͟ am x ϭ Ϫ 1. (A1) iϭl m:m϶i am a i lՅmՅn

Proof. Let   n a Ϫ x f(x) :ϭϪ1 ϩ ͚  ͟ i  ; i:i϶j; Ϫ ϭ  a i a j j l lՅiՅn we need to show that f(x) ϵ 0. For r: l Յ r Յ n we have that Ϫ ϭϪ ϩ ͟ ai ar ϭ f(ar) 1 Ϫ 0. i:i϶r; a i ar lՅiՅn Since f(x) is of degree at most n Ϫ l and it has at least n Ϫ l ϩ 1 different zeros, necessarily f(x) ϵ 0. Q.E.D.

Lemma 2. For k, i:1Յ k Ͻ i Յ n we have     i a a a  ͚  i  ͟ l   ͟ m  ϭ Ϫ Ϫ 0. (A2) jϭk aj l:kϽlՅj al ak m:jՅm Ͻi am ai

Proof.     i a a a  ␤ :ϭ ͚  i  ͟ l   ͟ m  k,i Ϫ Ϫ . jϭk aj l:kϽlՅj al ak m:jՅm Ͻi am ai ␤ ϭ Ͼ ϩ k,kϩ1 0, and for i k 1       i a a a  a ␤ ϭ ͚  i  ͟ l   ͟ m  ϭ  ͟ l ␣ k,i Ϫ Ϫ Ϫ k,i , jϭk aj l:kϽlՅj al ak m:jՅm Ͻi am ai l:kϽl Յi al ak where

Ϫ     i 1 a a (a Ϫ a ) a Ϫ a a  ␣ ϭ ϩ ͚  i j i k  ͟ m k  ͟ m  k,i 1 · Ϫ · Ϫ jϭk aj (aj ai) ai m:jϽmϽi am  m:jϽm Ͻi am ai

Ϫ   i 1 a Ϫ a a Ϫ a  ϭ ϩ ͚  i k  ͟ m k 1 Ϫ Ϫ jϭk a j ai m:jϽm Ͻi am ai

iϪ2     a Ϫ Ϫ a a Ϫ a a Ϫ a ϭ i 1 k ϩ ͚ Ϫ ϩ j k  ͟ m k Ϫ 1 Ϫ Ϫ aiϪ1 ai jϭk  aj ai m:jϽm Ͻi am ai

 Ϫ   Ϫ    i 2  a Ϫ a  i 1 a Ϫ a a Ϫ a  ϭϪ͚  ͟ m k ϩ  ͚  j k  ͟ m k Ϫ Ϫ Ϫ jϭk m:jϽmϽi am ai jϭkϩ1 aj ai m:jϽmϽi am ai Demographic Inference From SNP Data 367

 Ϫ   Ϫ  i 2  a Ϫ a  i 1  a Ϫ a  ϭϪ͚  ͟ m k ϩ  ͚  ͟ m k ϭ Ϫ Ϫ 0. Q.E.D. jϭk m:jϽmϽi am ai jϭkϩ1 m:jՅmϽi am ai

Lemma 3. For s Ͻ k Ͻ i Յ n:

i      ai ͟ al ͟ am  ϭ ͚  ϶   ϶  0. (A3) ϭ l:l k; Ϫ m:m i; Ϫ j k aj sϩ1ՅlՅj al ak jՅmՅn am ai

Proof. From Lemma 2,       a a i a a a 0 ϭ  ͟ q   ͟ r  ͚ i ͟ l ͟ m  ϩ Յ Ͻ Ϫ Ͻ Յ Ϫ Ͻ Յ Ϫ Յ Ͻ Ϫ q:s 1 q k aq ak r:i r n ar ai jϭk aj l:k l j al ak m:j m i am ai

i      ϭ ai ͟ al ͟ am  ͚  ϶   ϶  . Q.E.D. ϭ l:l k; Ϫ m:m i; Ϫ j k aj sϩ1ՅlՅj al ak jՅmՅn am ai

Lemma 4.

n    n    1 ϭ 1  ͟ ai  Ϫ 1  ͟ ai  ͚ ϶  ͚  ϶ . ϭ l:i j; Ϫ ϭ ϩ i:i j; Ϫ as j s aj sՅiՅn ai aj j s 1 aj  sϩ1ՅiՅn ai aj

Proof. Using Lemma 1,

n     n   Ϫ     1 ϭ 1  ͟ ai  ϭ 1  ͟ ai  ϩ 1  Ϫ as aj  ͟ ai  ͚ ϶ ͚  1 ϶  ϭ i:i j; Ϫ i:sϩ1ՅiՅn Ϫ ϭ ϩ i :i j; Ϫ as as j s sՅiՅn ai aj as  ai aj j s 1 aj  as   sՅiՅn ai aj 

n    n    ϭ 1  ͟ ai  Ϫ 1  ͟ ai  ͚ ϶  ͚  ϶ . Q.E.D. ϭ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j s aj sՅiՅn ai aj j s 1 aj  sϩ1ՅiՅn ai aj

Constant effective population size: First, we consider a demographic history characterized by a single, constant ϭ ͑j ͒ (1) ϭ population size N1. We introduce the notations aj 2 and a j a j/2N1. The length of time spent in state j (after Ϫ which the number of samples reduces from j to j 1) is denoted by Tj, jϪ1. The random variables Tj, jϪ1 Ϫ (1) ϶ ϭ (1) aj t and Ti,iϪ1 are independent for i j. The density function of Tj,jϪ1 is fTj, jϪ1(t) a j e , according to our model assumptions. The length of time from the present, when the number of samples is n, to the instant when the number {1} {1} ϭ ͚n of samples reduces to s, is denoted by Tn,s. Clearly Tn,s jϭsϩ1 Tj,jϪ1. The probability that, at time t, the genealogy {1} Յ Ͻ {1} {1} ϭ {1} ϩ Յ Ͻ is in state s is P(Tn,s t Tn,sϪ1). Since Tn,l Tn,lϩ1 Tlϩ1,l , for l :1 l n we can use the following convolution: {1} ϭ ͐t {1} Ϫ fTn,l(t) 0 fTn,lϩ1(t x)fTlϩ1,l(x)dx . Using these notations, the following are true:

Theorem 1. For s:1Յ s Ͻ n:

n    (1) (1) Ϫa t ͟ ai {1} ϭ ͚  j   fTn,s(t) a j e ϶ , (A4) ϭ ϩ i:i j; Ϫ j s 1  sϩ1ՅiՅn ai aj 

n    Ϫa (1)t ͟ ai {1} ϭ Ϫ ͚  j   FTn,s(t) 1 e ϶ , (A5) ϭ ϩ i:i j; Ϫ j s 1  sϩ1ՅiՅn ai aj 

n    n    {1} 1 ͟ ai 1 ͟ ai ͑Tn,s͒ ϭ ͚    ϭ ͚    E (1) ϶ 2N1 ϶ . (A6) ϭ ϩ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j s 1 a j sϩ1ՅiՅn ai aj  j s 1a j sϩ1ՅiՅn ai aj  368 G. T. Marth et al.

For s:2Յ s Ͻ n:

n    {1} (1) f (t) {1} {1} a j Ϫa t ͟ ai Tn,sϪ1 ͑Tn,s͒ Յ Ͻ Tn,sϪ1͒ ϭ ͚ j   ϭ P t e ϶ (1) , (A7) ϭ i:i j; Ϫ j s as sՅiՅn ai aj  a s

1 ͑Ts,sϪ1͒ ϭ E (1) . (A8) a s For i:1Յ i Ͻ n: ␮ ⌿ ϭ 4N1 E( i) . (A9) i

Proof. First we show Equations A4 and A5 by downward induction on s. These equations are clearly valid for s ϭ n Ϫ 1. Assume they are valid for s : s Ͼ k. Then t {1} ϭ {1} Ϫ {1} f Tn,k (t) Ύ f Tn,kϩ1 (t x)f Tkϩ1,k(x)dx 0

n  t (1)Ϫ (1)  Ϫ (1) ai (a a ϩ )x ϭ  (1) (1) aj t ͟ j k 1  ͚ a kϩ1aj e ϶ Ύ e dx ϭ ϩ i:i j; Ϫ j k 2  kϩ2ՅiՅn (ai aj) 0 

n  (1)Ϫ (1)  Ϫ (1) ai (a a ϩ )t ϭ  (1) aj t ͟ Ϫ j k 1  ͚ aj e ϶ ΄1 e ΅ ϭ ϩ i:i j; Ϫ j k 2  kϩ1ՅiՅn (ai aj) 

 n   Ϫ (1) n    Ϫa (1) ai a ϩ t ai ϭ  (1) j t ͟  Ϫ k 1 (1) ͟  ͚ aj e ϶ e ͚ aj ϶ . ϭ ϩ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j k 2  kϩ1ՅiՅn ai aj  j k 2  kϩ1ՅiՅn ai aj 

For Equation A4 we need to show that

n     Ϫ  ͟ ai  ϭ  ͟ ai  ͚ aj ϶ akϩ1 . ϭ ϩ i:i j; Ϫ kϩ2ՅiՅn Ϫ ϩ j k 2 kϩ1ՅiՅn ai aj  ai ak 1

This is equivalent to

n Ϫ ΂͚jϭkϩ2 aj͟i:i϶j; ai/(ai aj)΃ n   kϩ1ՅiՅn ͟ αι Ϫ ακϩ1 ϭϪ ϭ ͚  , 1 Ϫ ι:ι϶ϕ; ΂ α Ϫ α ΃ akϩ1͟kϩ2ՅiՅn ai/(ai akϩ1) jϭkϩ2 κϩ2ՅιՅν ι ϕ 

which follows from Lemma 1. Using Lemma 1 with l ϭ s ϩ 1 and x ϭ 0, we get

t n   t  (1) {1} ͟ ai (1) Ϫa x {1} ϭ T Յ ϭ {1} ϭ ͚   j  F Tn,s (t) P( n,s t) Ύ f Tn,s (x)dx ϶ Ύ aj e dx ϭ ϩ i:i j; Ϫ 0 j s 1 sϩ1ՅiՅn ai aj 0 

n    ai Ϫ (1) ϭ  ͟  Ϫ aj t  ͚ ϶ ΂1 e ΃ ϭ ϩ i:i j; Ϫ j s 1 sϩ1ՅiՅn ai aj  

 n   n  ͟ ai Ϫ (1) ͟ ai ϭ   Ϫ  aj t  ͚ ϶ ͚ e ϶ ϭ ϩ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j s 1sϩ1ՅiՅn ai aj  j s 1 sϩ1ՅiՅn ai aj 

n Ϫa (1) ai ϭ Ϫ j t ͟ 1 ͚ e ϶ . ϭ ϩ i:i j; Ϫ j s 1 sϩ1ՅiՅn (ai aj) Demographic Inference From SNP Data 369

{1} {1} ͑T Ͼ ͒ ϭ Ϫ {1} ͑T Յ This completes the proof of Equations A4 and A5. For (A7), note that P n ,s t 1 F Tn,s(t) and P n,s Ͻ {1} ͒ ϭ ͑ {1} Ͼ ͒ Ϫ ͑ {1} Ͼ ͒ t Tn,sϪ1 P Tn,sϪ1 t P Tn,s t . Then

n    n    {1} {1} Ϫ (1) ai Ϫ (1) ai ͑ Յ Ͻ Ϫ ͒ ϭ aj t  ͟  Ϫ aj t  ͟  P Tn,s t Tn,s 1 ͚  e ϶  ͚  e ϶  ϭ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j s  sՅiՅn ai aj  j s 1  sϩ1ՅiՅn ai aj 

 n  Ϫ   Ϫa (1)  ai as aj Ϫa (1) ͟ ai ϭ e s t  ͟  ϩ ͚  ΂1 Ϫ ΃e j t   i:i϶j;  ϩ Յ Յ Ϫ ϭ ϩ Ϫ s 1 i n ai aj  j s 1  as sՅiՅn ai aj 

Ϫ (1) Ϫ (1) as t   n  aj t   ϭ as e  ͟ ai  ϩ aj e  ͟ ai  ϶ ͚  ϶  i:i s; Ϫ Ϫ ϭ ϩ i:i j; Ϫ as sՅiՅn (ai as 1) j s 1  as sՅiՅn ai aj 

n    {1} a Ϫ a fT Ϫ (t) ϭ ͚  j ajt  ͟ i  ϭ n,s 1 e ϶ (1) . ϭ i:i j; Ϫ j s  as sՅiՅn (ai aj) a s

{1} Ն For (A6), since Tn,s 0,

∞ ∞ n    {1} {1} Ϫ (1) a ͑ ͒ ϭ Ն ͒ ϭ aj x  ͟ i  E Tn,s Ύ P(Tn,s x dx Ύ ͚  e ϶ dx ϭ ϩ i:i j; Ϫ 0 0 j s 1  sϩ1ՅiՅn ai aj 

n    ∞  n   1 a Ϫ (1) 1 a ϭ ͚   ͟ i  (1) aj x  ϭ ͚  ͟ i  (1) ϶ Ύ aj e dx (1) ϶ . ϭ ϩ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j s 1  aj sϩ1ՅiՅn ai aj  0  j s 1  aj sϩ1ՅiՅn ai aj 

Equation A8 can be easily obtained from fs,sϪ1(t). Finally, Equation A9 follows from Equation A8, by the argument presented by Fu (1995) to derive Equation 22. Q.E.D.

Piecewise constant effective population size: Consider a demographic history of M distinct epochs indexed by 1, 2,

...,M, where the ancestral epoch is numbered M. For epoch i, the constant effective population size is Ni, and ϭ ∞ (i) ϭ ͑k͒ ␶ ϭ ͚i the duration of this epoch is Ti; in particular, TM . We define a k 2 /2Ni. We introduce i jϭ1Tj , the time ␶ ϭ ␶ ϭ ∞ from the present back until the end of the ith epoch (so 0 0 and M ). At a given time t, the index of the ϭ ␶ Ն ␶ ϭ ␶ Ͻ Յ␶ current epoch is denoted by m(t), in formula m(t) min {k : k t}. In particular, m( i) i, and m(t)Ϫ1 t m(t). We also introduce a “normalized” time t*:

m(t)Ϫ1 t Ϫ␶ Ϫ T t * ϭ m(t) 1 ϩ ͚ i . 2Nm(t) iϭ1 2Ni

The proof is based on induction on the number of epochs. To facilitate this, we consider two kinds of partial models with smaller numbers of epochs, as follows:

{i } 1. The first model has a single epoch, with effective population size Ni. The random variable T n, j denotes the time from the present (state n) to the beginning of state j, under the parameters of the first model. 2. The second model is a truncated version of the original M-epoch model: it consists of i epochs, with parameters ϭ ∞ that are identical to the parameters of the first i epochs of the original model, except Ti ; i.e., the ith of the [i ] original model becomes the ancestral epoch of the truncated model. The random variable T n, j denotes the time from the present (state n) to reach state j, under the parameters of the second model.

Note that the two types of models coincide when i ϭ 1. The following are true:

Theorem 2. For s :1Յ s Ͻ n:

[M] ϭ [m(t)] [M] ϭ [m(t)] f T n,s(t) f T n,s (t) and F Tn,s(t) F Tn,s (t), (A10)

n    1 Ϫa t * ͟ ai [M] ϭ ͚  j   f Tn,s(t) a j e ϶ , (A11) ϭ ϩ i:i j; Ϫ 2Nm(t) j s 1  sϩ1ՅiՅn ai aj  370 G. T. Marth et al.

n    Ϫa t * ͟ ai [M] ϭ Ϫ ͚  j   F Tn,s(t) 1 e ϶ , (A12) ϭ ϩ i:i j; Ϫ j s 1  sϩ1ՅiՅn ai aj 

n   MϪ1 n        m (t)  1 ͟ ai Ϫ͚ a j Tl ͟ ai 1 1 ͑T [M] ͒ ϭ ͚    ϩ ͚ ͚  lϭ1    Ϫ  E Tn,s (1) ϶ e ϶ (mϩ1) (m) ϭ ϩ i:i j; Ϫ ϭ ϭ ϩ i:i j; Ϫ j s 1 a j sϩ1ՅiՅn ai aj  m 1 j s 1 sϩ1ՅiՅn ai aj  aj aj 

n    ϭ 1  ͟ ai  2N1 ͚  ϶  ϭ ϩ i:i j; Ϫ j s 1  aj sϩ1ՅiՅn ai aj 

MϪ1  n    Ϫa ␶* 1 a ϩ  Ϫ j m  ͟ i   ͚ 2(Nmϩ1 Nm) ͚ e ϶  . (A13) ϭ ϭ ϩ i:i j; Ϫ m 1  j s 1  aj sϩ1ՅiՅn ai aj 

For s :2Յ s Ͻ n:

[M] [m(t)] [M] [M] fTn,sϪ1(t) fTn,sϪ1(t) ͑Tn,s Յ Ͻ Tn,sϪ1͒ ϭ ϭ P t (m(t)) (m(t)) , (A14) a s a s

MϪ1 n     Ϫ m (l) 1 ͚ ϭ a j Tl ͟ ai 1 1 ͑Ts,sϪ1͒ ϭ ϩ ͚ ͚  l 1   Ϫ  E (1) e ϶ (mϩ1) (m) ϭ ϭ i:i j; Ϫ a s m 1 j s  sՅiՅn ai aj  as as 

 MϪ1  n   2 Ϫa ␶* a ϭ ϩ  Ϫ  j m ͟ i  N1 ͚ (Nmϩ1 Nm) ͚ e ϶ . (A15) ϭ ϭ i:i j; Ϫ as  m 1  j s  sՅiՅn ai aj 

For i :1Յ i Ͻ n:   Ϫ M 1 Ϫ n  Ϫ n  (j(jϪ1)␶* ) 2 Ϫ  ⌿ ϭ ␮N1 ϩ Nmϩ1 Nm n k m / ͟ l(l 1)  E( i) 4 ͚  ͚΂ ΃͚ e  . (A16)  n Ϫ 1 Ϫ l:l϶j; Ϫ Ϫ Ϫ  i mϭ1 ͑ ͒ kϭ2  i 1 jϭk  Յ Յ l(l 1) j(j 1)   i i k l n 

Proof: (A12) and (A14) are consequences of (A11):

∞ n  Ϫ (M) Ϫ␶ Ϫ (l)   ∞  aj ( MϪ1) Ϫ͚M 1a j Tl Ϫ (M) lϭ1 ͟ ai (M) a t [M] ϭ Ϫ [M] ϭ Ϫ ͚    j  F Tn,s(t) 1 Ύ f Tn,s (t)dt 1 e ϶ Ύ aj e dt ϭ ϩ i:i j; Ϫ t j s 1  sϩ1ՅiՅn ai aj t 

n  Ϫ (M) Ϫ␶ MϪ1 (l)   ϭ Ϫ aj (t MϪ1) Ϫ ͚ a j Tl  ͟ ai  1 ͚ e lϭ1 ϶ . ϭ ϩ i:i j; Ϫ j s 1  sϩ1ՅiՅn ai aj 

 n     Ϫ (M) Ϫ␶ MϪ1 (l)  [M] [M] aj (t MϪ1) Ϫ͚ a j Tl lϭ1 ͟ ai ͑T Յ Ͻ T Ϫ ͒ ϭ [M] Ϫ [M] ϭ ͚    P n,s t n,s 1 FTn,s (t) FTn,sϪ1(t) e ϶ ϭ i:i j; Ϫ j s  sՅiՅn ai aj      n  (M) Ϫ  Ϫa (tϪ␶ Ϫ ) Ϫ M 1 a(l)T j M 1 ͚ ϭ j l ͟ ai Ϫ  l 1    ͚ e ϶  ϭ ϩ i:i j; Ϫ  j s 1  sϩ1ՅiՅn ai aj    (M) Ϫ Ϫa (tϪ␶ Ϫ ) Ϫ M 1 a(l)T s M 1 ͚ ϭ s l ͟ ai ϭ e l 1   i:i϶j; Ϫ sՅiՅn ai aj    n  (M) Ϫ  Ϫ Ϫa (tϪ␶ Ϫ ) Ϫ M 1 a(l)T as aj j M 1 ͚ ϭ j l ͟ ai ϩ Ϫ l 1   ͚ ΂1 ΃e ϶  ϭ ϩ i:i j; Ϫ j s 1  as sՅiՅn ai aj  Demographic Inference From SNP Data 371

  n  (M) Ϫ  [M] [m(t )] Ϫa (tϪ␶ Ϫ ) Ϫ M 1 a(l)T fT Ϫ (T) fT Ϫ (T) aj j M 1 ͚ ϭ j l ͟ ai n,s 1 n,s 1 ϭ ͚  l 1   ϭ ϭ e ϶ (M) (m(t )) . ϭ i:i j; Ϫ j s as sՅiՅn ai aj  a s a s

We prove (A10) and (A11) by induction on the number of epochs M. The statements are true for M ϭ 1by Theorem 1. For M Ͼ 1 assume that the statements are true if the number of epochs is less than M. Clearly,

 n  ͕ [M] ϭ ͖ ϭ ͕ [M] Յ␶ [M] ϭ ͖ ʜ ʜ ͕ [M] Յ␶ Ͻ [M] [M] ϭ ͖ Tn,j t Tn,j MϪ1 and Tn,j t  Tn,i MϪ1 Tn,iϪ1 and Tn,j t . iϭjϩ1  The right side is a union of disjoint events; therefore (using density functions of conditioned variables) we have

[M] [M] ϭ T Յ␶ Ϫ [M]| [M]Յ␶ f Tn,j (t) P( n,j M 1) f Tn,j Tn,j MϪ1(t) n [M] [M] ϩ ͚ T Յ␶ Ϫ Ͻ T Ϫ [M]| [M]Յ␶ Ͻ [M] P( n,i M 1 n,i 1 )f Tn,j Tn,i MϪ1 Tn,iϪ1(t). iϭjϩ1 Clearly

[M] f [MϪ1] (t)/P(T Յ␶ ), t Յ␶ Ϫ  Tn,j n,j MϪ1 M 1 f [M]| [M]Յ␶ (t) ϭ Tn,j Tn,j MϪ1 Ͼ␶ 0, t MϪ1 and for each i Ͼ j Յ␶ 0, t MϪ1 [M] [M] [M] ϭ f T | T Յ␶ Ϫ ϽT ϩ (t)  . n,j n,i M 1 n,i 1 {M} Ϫ␶ Ͼ␶ f i,j (t MϪ1), t MϪ1

Յ␶ [M] ϭ [MϪ1] [M] ϭ [MϪ1] Therefore for t MϪ1 we have f T (t) f T (t) and F T (t) F T (t), so using the induction hypothesis, for Յ␶ n,j n,j n,j n,j t MϪ1, Equations A10, and consequently A11, hold. In particular,

[MϪ1] ␶ n    f ( MϪ1) Ϫ MϪ1 a(l) T [M] [M] Tn,sϪ1 aj ͚ ϭ j l ͟ ai ͑Tn,s Յ␶ Ϫ Ͻ Tn,sϪ1͒ ϭ ϭ ͚  l 1   P M 1 (MϪ1) e ϶ . ϭ i:i j; Ϫ a s j s as sϩ1 ՅiՅn ai aj  Ͼ␶ ϭ If t MϪ1, i.e., m(t) M, then (A10) and (A11) follow from Lemma 3:

n  n   s    Ϫ (l) Ϫ (M) Ϫ␶ am Ϫ͚M 1 a m T ͟ ap (M) aj (t MϪ1) ͟ aq [M] ϭ ͚  ͚  lϭ1 l  ͚    fTn,i (t) e ϶ aj e ϶ ϭ ϩ ϭ p:p m; Ϫ ϭ ϩ q:q j; Ϫ s i 1 m s as sՅpՅn ap am j i 1  iϩ1ՅqՅs aq aj

n  n  m      Ϫa(M)(tϪ␶ ) (l) (M) j MϪ1 Ϫ͚MϪ1 a Tl am ͟ aq ͟ ap ϭ  lϭ1 m      ͚ aj e ͚ e ͚ ϶  ϶  ϭ ϩ ϭ ϭ q:q j; Ϫ p:p m; Ϫ j i 1  m j  s jas  iϩ1ՅqՅs aq aj   sՅpՅn ap am

n    Ϫ (M) Ϫ␶ (l) (M) aj (t MϪ1) Ϫ͚MϪ1 a Tl ͟ aq ϭ  lϭ1 j   ͚ aj e ϶ . ϭ ϩ q:q j; Ϫ j i 1   iϩ1ՅqՅn aq aj

We get Equation A13 in a way similar to the proof of Equation A8:

∞ ∞ n    Ϫ (m(t)) Ϫ␶ m(t)Ϫ1 (l) [M] aj ͑t m(t)Ϫ1͒Ϫ͚ ϭ a Tl ͟ ai ͑T ͒ ϭ ͑ Ϫ [M] ͒ ϭ ͚  l 1 j   E n,s Ύ 1 F Tn,s(t) dt Ύ e ϶ dt ϭ ϩ i:i j; Ϫ 0 0 j s 1  sϩ1ՅiՅn ai aj 

M T n    n    m Ϫ (m) mϪ1a(l)T Ϫ (l) aj tϪ͚ ϭ j l ͟ ai 1 Ϫ͚M 1a j Tl ͟ ai ϭ ͚ Ύ ͚ e l 1  dt ϭ ͚  e lϭ1   i:i϶j; (M) i:i϶j; ϭ ϭ ϩ Ϫ ϭ ϩ Ϫ m 1 0 j s 1  sϩ1ՅiՅn ai aj  j s 1 aj sϩ1ՅiՅn ai aj 

MϪ1 n    Ϫ (m) Ϫ (l) 1 aj Tm Ϫ͚m 1a j Tl ͟ ai ϩ ͚ ͚  ͑ Ϫ ͒ lϭ1   (m) 1 e e ϶ ϭ ϭ ϩ i:i j; Ϫ m 1 j s 1 aj sϩ1ՅiՅn ai aj 372 G. T. Marth et al.

M n    MϪ1 n    Ϫ (l) (l) 1 Ϫ͚m 1a j Tl ͟ ai 1 Ϫ͚m a j Tl ͟ ai ϭ ͚ ͚  e lϭ1   Ϫ ͚ ͚  e lϭ1   (m) i:i϶j; (m) i:i϶j; ϭ ϭ ϩ Ϫ ϭ ϭ ϩ Ϫ m 1j s 1 aj sϩ1ՅiՅn ai aj  m 1 j s 1aj sϩ1ՅiՅn ai aj 

n    MϪ1 n      (l) 1 ͟ ai Ϫ͚m a j Tl ͟ ai 1 1 ϭ ͚    ϩ ͚ ͚ e lϭ1    Ϫ  . (1) i:i϶j; i:i϶j; (mϩ1) (m) ϭ ϩ Ϫ ϭ ϭ ϩ Ϫ j s 1 aj sϩ1ՅiՅn ai aj  m 1 j s 1 sϩ1ՅiՅn ai aj  aj aj 

Using Lemma 4,

MϪ1 n      [M] [M] 1 m a(t)T ai 1 1 ͑ ͒ ϭ ͑ ͒ Ϫ ͑ ͒ ϭ ϩ Ϫ͚ ϭ j l ͟ Ϫ E Ts,sϪ1 E Tn,sϪ1 E Tn,s ͚ ͚e l 1     (1) i:i϶j; Ϫ (mϩ1) (m) as mϭ1 jϭs  sՅiՅn ai aj  aj aj 

MϪ1 n      (l) Ϫ͚m a j Tl ͟ ai 1 1 Ϫ ͚ ͚  lϭ1    Ϫ  e ϶ (mϩ1) (m) ϭ ϭ ϩ i:i j; Ϫ m 1j s 1  sϩ1ՅiՅn ai aj   ai aj 

MϪ1   n    1 m a(l)T ai 1 1 ϭ ϩ Ϫ͚ ϭ s l ͟ Ϫ ͚  e l 1    ϩ  (1) ϭ ϩ Ϫ (m 1) (m) as mϭ1  i s 1 ai as  aj as    MϪ1 n        (l) Ϫ  Ϫ͚m a j Tl ͟ ai 1 1 as aj ϩ ͚ ͚ e lϭ1    Ϫ  1 Ϫ   i:i϶j; Ϫ (mϩ1) (m)  ϭ ϭ ϩ  ai aj  aj aj   as  m 1 j s 1  sՅiՅn    MϪ1 n      (l) 1 Ϫ͚m a j Tl ͟ ai 1 1  ϭ ϩ ͚ ͚ e lϭ1    Ϫ  . (1)  i:i϶j; Ϫ (mϩ1) (m)  as ϭ jϭs  ai aj  aj aj  m 1  sՅiՅn  This gives Equation A15. Finally, using manipulations identical to those used by Fu (1995) we derive Equation A16:  n Ϫ k  ΂ ΃    n  MϪ1 n    E(⌿ ) i Ϫ 1   Ϫ ␶ a  i ϭ  ϩ Ϫ a j *m ͟ s  ͚ N1 ͚ (Nmϩ1 Nm) ͚ e  ␮ Ϫ   s:s϶j; Ϫ  4 kϭ2  n 1 mϭ1 jϭk  kՅsՅn as aj   i΂ ΃   i    Ϫ Ϫ n n  n Ϫ  M 1 Nmϩ1 Nm Ϫ    N1 n k  n k Ϫ ␶* ͟ as  ϭ ͚ ΂ ΃ ϩ ͚  ͚ ΂ ΃͚e aj m   Ϫ Ϫ n Ϫ 1 ϭ  ϭ s:s϶j; Ϫ  n 1  ϭ i 1  ϭ  k 2 i Ϫ 1 j k Յ Յ as aj   i΂ ΃ k 2 m 1 i΂ ΃  k s n  i  i

 Ϫ   MϪ1 Ϫ Ϫ 1 n Ϫ n   N1 Nmϩ1 Nm n 1  n k Ϫ ␶* ͟ as  ϭ ϩ ͚ ΂ ΃ ͚ ΂ ΃ ͚ e aj m  ,   s:s϶j; Ϫ  i ϭ i i ϭ i Ϫ 1 ϭ  Յ Յ as aj  m 1  k 2  j k k s n  ␶ ϭ ͚m where *m lϭ1(Tl/2Nl). This completes the proof. Q.E.D.