<<

INVESTIGATION

Estimating Variable Effective Population Sizes from Multiple Genomes: A Sequentially Markov Conditional Sampling Distribution Approach

Sara Sheehan,*,1 Kelley Harris,†,1 and Yun S. Song*,‡,2 *Computer Science Division, †Department of Mathematics, and ‡Department of Statistics, University of California, Berkeley, California 94720

ABSTRACT Throughout history, the population size of modern humans has varied considerably due to changes in environment, culture, and technology. More accurate estimates of population size changes, and when they occurred, should provide a clearer picture of human colonization history and help remove confounding effects from natural selection inference. Demography influences the pattern of genetic variation in a population, and thus genomic data of multiple individuals sampled from one or more present-day populations contain valuable information about the past demographic history. Recently, Li and Durbin developed a -based hidden Markov model, called the pairwise sequentially Markovian coalescent (PSMC), for a pair of chromosomes (or one diploid individual) to estimate past population sizes. This is an efficient, useful approach, but its accuracy in the very recent past is hampered by the fact that, because of the small sample size, only few coalescence events occur in that period. Multiple genomes from the same population contain more information about the recent past, but are also more computationally challenging to study jointly in a coalescent framework. Here, we present a new coalescent-based method that can efficiently infer population size changes from multiple genomes, providing access to a new store of information about the recent past. Our work generalizes the recently developed sequentially Markov conditional sampling distribution framework, which provides an accurate approximation of the probability of observing a newly sampled haplotype given a set of previously sampled haplotypes. Simulation results demonstrate that we can accurately reconstruct the true population histories, with a significant improvement over the PSMC in the recent past. We apply our method, called diCal, to the genomes of multiple human individuals of European and African ancestry to obtain a detailed population size change history during recent times.

ITH the rise of new sequencing technologies, it has Much can be learned about ancient population history Wbecome easier to obtain genetic data from multiple from present-day DNA data, since the genome of each in- individuals at many loci. Such data have been providing dividual is an imperfect mosaic of the genomes of its ancestors. a new wealth of information from which to estimate popu- Accurately inferring the past demographic changes of humans lation genetic parameters such as mutation rates, recombi- has several important applications, including properly account- nation rates, effective population sizes, divergence times, ing for population structure in association studies and reducing and migration rates. More data should enable more accurate confounding effects in inferences about natural selection. It may parameter estimation, but it is both theoretically and com- also help to resolve archaeological and historical questions. putationally challenging to model the of many Humans are not the only organism for which demography individuals. raises important questions. For example, the demography of Drosophila has very interesting dynamics, as investigated by Copyright © 2013 by the Genetics Society of America several recent studies (Haddrill et al. 2005; Thornton and doi: 10.1534/genetics.112.149096 Andolfatto 2006; Wang and Hey 2010). Manuscript received December 26, 2012; accepted for publication April 7, 2013 Supporting information is available online at http://www.genetics.org/lookup/suppl/ In humans, ancient effective population size estimates doi:10.1534/genetics.112.149096/-/DC1. vary widely, as do the estimates of demographic events 1These authors contributed equally to this work. 2Corresponding author: Department of EECS, University of California, 683 Soda Hall, such as the out-of-Africa migration. Gronau et al. (2011) No. 1776, Berkeley, CA 94720-1776. E-mail: [email protected] used a coalescent-based approach with six diploid genomes

Genetics, Vol. 194, 647–662 July 2013 647 each from a different population and estimated that Eura- The major obstacle to generalizing the PSMC to multiple sians and Africans diverged 38–64 thousand years ago sequences is the explosion in the state with the number (KYA) and that the effective population size of humans in of sequences; the number of distinct coalescent tree topolo- the ancient past was 9000. Gravel et al. (2011) used low- gies grows superexponentially with the number of leaves, and coverage whole-genome data and high-coverage exome data we furthermore need to consider edge-weighted trees (i.e., to fit a one-bottleneck model followed by exponential growth include time information). In a related line of research, inter- in European and Asian populations. They estimated that the esting progress has been made (Hobolth et al. 2007; Dutheil timing of the out-of-Africa migration was 51 KYA and that et al. 2009; Mailund et al. 2011) in performing “ancestral pop- the effective population size in the ancient past was 7300, ulation genomic” inference under a coalescent HMM, but its which then increased to 14,500 at 150 KYA. applicability is limited to only a modest number of sequences, In their analysis, Gronau et al. considered 37,574 loci again due to the explosion in the state space. each of length 1 kb and, for computational tractability, as- In this article, we describe an alternative method that is sumed that the loci are all independent and that there is no efficient in the number of sequences, while retaining the key recombination within each locus. The method employed by generality of the PSMC in incorporating an arbitrary piece- Gravel et al. (2011) is based on fitting allele frequency spectra, wise constant population size history. More precisely, the assuming that all sites are independent. Incidentally, Myers computational complexity of our method depends quadratically et al. (2008) investigated the limits of inferring population on the number of sequences, and the computation involved can size changes from the allele frequency spectrum alone and be easily parallelized. As more sequences are considered, we showed that two distinct population size histories may yield expect to see a larger number of coalescence events during the exactly the same expected allele frequency spectra. It remains recent past and should be able to estimate recent population an open question whether taking linkage information into sizes at a higher resolution. With only two sequences, the account may remedy the problem of nonidentifiability. distribution of coalescence events is shifted toward the ancient The distribution of lengths of shared identity-by-descent past,relativetothedistributionofthetimeanewlineagejoins (IBD) tracts between pairs of unrelated individuals is in- a coalescent tree for multiple sequences. Thus, even if all formative of recent demographic history. Recently, Palamara sequences are considered pairwise, the resolution in the recent et al. (2012) utilized the empirical distribution of IBD sharing past may not be as clear as that achieved by jointly modeling in pairs of 500 Ashkenazi Jewish individuals to infer two multiple sequences. rapid population expansions separated by a severe founder The input to our method, which is also based on an HMM, event over the past 200 generations. This approach requires is a collection of haplotype sequences. At present, our method first inferring IBD tracts from data, but the accuracy of exist- assumes that mutation and recombination rates are given, ing IBD detection methods has not been fully characterized and it employs the expectation-maximization (EM) algorithm when the population under consideration has undergone to infer a piecewise constant history of population sizes, with a complex demographic history. an arbitrary number of change points. The pairwise sequentially Markovian coalescent (PSMC), Our work generalizes the recently developed sequentially recently developed by Li and Durbin (2011) to estimate an Markov conditional sampling distribution (SMCSD) frame- arbitrary piecewise constant population size history, does work (Paul et al. 2011) to incorporate variable population take linkage information into account and efficiently models size. This approach provides an accurate approximation of recombination between sites, using the sequentially Markov the probability of observing a newly sampled haplotype coalescent (McVean and Cardin 2005; Marjoram and Wall given a set of previously sampled haplotypes, and it allows 2006) for a pair of sequences. The PSMC is based on a hid- one to approximate the joint probability of an arbitrary num- den Markov model (HMM) in which the hidden state at ber of haplotypes. Through a simulation study, we demon- a given position corresponds to the coalescence time of the strate that we can accurately reconstruct the true population two lineages at that position and the observed state corre- histories, with a significant improvement over the PSMC sponds to the observed genotype (homozygous/heterozygous) in the recent past. Moreover, we apply our method to at the position. As one moves along the sequence, the co- the genomes of multiple human individuals of European alescence time may change as a result of recombination, and and African ancestry to obtain a detailed population size the spatial distribution of homozygous and heterozygous change history during recent times. Our software, called sites is informative of the distribution of coalescence times, demographic inference using composite approximate likeli- which depends on the past population sizes. While this ele- hood (diCal), is publicly available at https://sourceforge. gant approach produces reasonably accurate population size net/projects/dical. estimates overall, its accuracy in the very recent past is ham- pered by the fact that, because of the small sample size, few Notation and a Review of the SMCSD Framework coalescence events occur in that period. As a consequence, the information in the pattern of genetic variation for a pair Our work stems from the SMCSD framework (Paul et al. of sequences is insufficient to resolve very recent demo- 2011), which describes the conditional genealogical process graphic history. of a newly sampled haplotype given a set of previously

648 S. Sheehan, K. Harris, and Y. S. Song sampled haplotypes. In what follows, we briefly review the key concepts underlying the SMCSD model. We consider haplotypes each of length L from the same genomic region. Suppose we have already observed n hap- lotypes, On ={h1, ..., hn} sampled at random from a well-mixed population; note that some of the observed hap- lotypes may be identical. In this article, we use the terms “sites” and “loci” interchangeably. Recombination may occur between any pair of consecutive loci, and we denote the set of potential recombination breakpoints by B ={(1,2),..., (L 2 1, L)}. Given a haplotype h,wedenotebyh[ℓ]theallele at locus ℓ and by h[ℓ: ℓ9](forℓ # ℓ9) the subsequence (h[ℓ], ..., h[ℓ9]). As described in Paul and Song (2010), given the geneal- A O ogy On for the already observed sample n, it is possible to sample a conditional genealogy C for the additional haplo- type according to the following description: An ancestral lineage in C undergoes mutation at locus ℓ at rate uℓ/2 according to the stochastic mutation transition matrix P(ℓ). Further, as in the ordinary coalescent with recombination, an ancestral lineage in C undergoes recombination at break- point b 2 B at rate rb/2, giving rise to two lineages. Each pair of lineages within C coalesces with rate 1, and lineages in C A get absorbed into the known genealogy On at rate 1 for each pair of lineages. See Figure 1A for an illustration. Unfortunately, we do not generally have access to the A true genealogy On, and marginalizing over all possibilities is a challenging problem. However, Paul and Song (2010) showed that the diffusion-generator approximation de- scribed in De Iorio and Griffiths (2004a,b; Griffiths et al. A Figure 1 Illustration of a conditional genealogy C for a three-locus model. 2008) implies the following approximation to On , which fi The three loci of a haplotype are each represented by a solid circle, with the simpli es the problem considerably. color indicating the allelic type at that locus. Mutation events, along with Approximation 1 (the trunk genealogy) the locus and resulting haplotype, are indicated by small arrows. Recombi- nation events, and the resulting haplotype, are indicated by branching Approximate AO by the so-called trunk genealogy A* in events. Absorption events are indicated by dotted horizontal lines. (A) The n On true genealogy AO for the already observed sample On. (B) Approximation which lineages do not mutate, recombine, or coalesce with n by the trunk genealogy A* . Lineages in the trunk do not mutate, recom- “ ” On one another, but instead form a nonrandom trunk extend- bine, or coalesce. (C) Marginal conditional genealogy for each locus. ing infinitely into the past, as illustrated in Figure 1B. Al- though A* is not a proper genealogy, it is still possible to tion, recombination, and coalescence are each a factor of 2 On sample a well-defined conditional genealogy C for the addi- larger than those mentioned earlier. Intuitively, this rate ad- tional haplotype given A* in much the same way as de- justment accounts for using the (inexact) trunk genealogy On scribed above, except that rates need to be modified. A* , which remains static. Note that the adjustment follows On Specifically, lineages within C evolve backward in time sub- as a mathematical consequence of the diffusion-generator ject to the following events: approximation (De Iorio and Griffiths 2004a,b; Griffiths ℓ et al. 2008), and it is supported by the fact that the CSD Mutation: Each lineage undergoes mutation at locus with p^ (ℓ) PS has been shown to be exact for a one-locus model with rate uℓ according to P . parent-independent mutation (Paul and Song 2010). Recombination: Each lineage undergoes recombination at It can be deduced from the diffusion-generator approx- breakpoint b 2 B with rate rb. imation that p^ PSðajOnÞ, the conditional probability of sam- Coalescence: Each pair of lineages coalesces with rate 2. a * pling an additional haplotype given a set of previously Absorption: Each lineage is absorbed into a lineage of AO n sampled haplotypes O , satisfies a recursion. Unfortunately, with rate 1. n this recursion is computationally intractable to solve for even The genealogical process described above completely char- modest-sized data sets. To address this issue, Paul et al. acterizes a conditional sampling distribution (CSD), which (2011) proposed further approximations, described below, Paul and Song (2010) denoted by p^ PS. Observe that the rate to obtain a CSD that admits efficient implementation, while of absorption is the same as before, but the rates for muta- retaining the accuracy of p^ PS.

Estimating Variable Effective Population Sizes 649 Approximation 2 (sequentially Markov CSD) A given conditional genealogy C contains a marginal condi- tional genealogy (MCG) for each locus, where each MCG comprises a series of mutation events and the eventual absorption into a lineage of the trunk A* . See Figure 1C On for an illustration. The key insight (Wiuf and Hein 1999) is that we can generate the conditional genealogy as a se- quence of MCGs across the sequence, rather than backward in time. Although the sequential process is actually not Markov, it is well approximated (McVean and Cardin 2005; Marjoram and Wall 2006; Paul et al. 2011) by a Markov process, using a two-locus transition density. Ap- plying this approximation to p^ PS yields the sequentially Markov CSD, p^ SMC. Conditional on the MCG Cℓ21 at locus ℓ 2 1, the MCG Cℓ at locus ℓ can be sampled by first placing recombination events onto Cℓ21 according to a Poisson process with rate r(ℓ21,ℓ).If no recombination occurs, Cℓ is identical to Cℓ21. If recombi- nation does occur, Cℓ is identical to Cℓ21 up to the time Tr of the most recent recombination event. At this point, the lin- eage at locus ℓ, independently of the lineage at locus ℓ 2 1, proceeds backward in time until being absorbed into a line- Figure 2 Illustration of the sequentially Markov approximation in which age of the trunk. This transition mechanism for the Markov the absorption time Tℓ at locus ℓ is sampled conditionally on the absorp- T t process is illustrated in Figure 2. McVean and Cardin (2005) tion time ℓ21 = ℓ21 at the previous locus. In the marginal conditional C ℓ 2 use this approximation as well, while the transition process genealogy ℓ21 for locus 1, recombination breakpoints are realized as a Poisson process with rate r ℓ2 ℓ . If no recombination occurs, Cℓ is in Marjoram and Wall (2006) does allow the lineage to co- ( 1, ) identical to Cℓ21. If recombination does occur, as in the example here, alesce back into itself. Cℓ is identical to Cℓ21 up to the time Tr of the most recent recombination Given Cℓ, mutations are superimposed onto it according event. At this point, the lineage at locus ℓ, independently of the lineage at ℓ 2 to a Poisson process with rate uℓ. The MCG is absorbed into locus 1, proceeds backward in time until being absorbed into a lineage ℓ T T T T a trunk lineage corresponding to some haplotype h,which of the trunk. The absorption time at locus is ℓ = r + a, where a is the remaining absorption time after the recombination event. specifies an “ancestral” allele h[ℓ]. This allele is then prop- agated to the present according to the superimposed where t 2 [0, N), with t = 0 corresponding to the present mutations and the transition matrix P(ℓ), thereby generat- time, N is some reference effective population size, and ing an allele at locus ℓ of the additional haplotype a.We ref N(t) is the effective population size at time t in the past. The refer to the associated distribution of alleles as the emis- population-scaled recombination and mutation rates are de- sion distribution. fined with respect to N . Specifically, for b =(ℓ 2 1, ℓ), we The generative process described above for the SMCSD ref define r =4N r , where r denotes the recombination rate p^ can be formulated as an HMM, in which the hidden b ref b b SMC per generation per individual between loci ℓ 2 1 and ℓ, and state at locus ℓ corresponds to the MCG Cℓ, excluding muta- uℓ =4N mℓ, where mℓ denotes the mutation rate per gener- tion events: We denote the hidden state at locus ℓ in the ref ation per individual at locus ℓ. HMM by Sℓ =(Tℓ, Hℓ), where Tℓ 2 [0, N) is the absorption time and Hℓ 2On is the absorption haplotype. The emission Initial density at locus ℓ corresponds to the allele a[ℓ]. See Paul et al. Inthecaseofaconstantpopulation size, the absorption time Tℓ (2011) for explicit expressions for the initial, transition, for locus ℓ follows an exponential distribution, but with a vari- and emission densities in the case of a constant population able population size the absorption time is described by a non- size. homogeneous Markov chain. See Griffiths and Tavaré (1994) for a more thorough discussion of the coalescent with variable Incorporating Variable Population Size population size. As in the constant population size case, how- Here, we extend the SMCSD framework described in the ever, the prior distribution of absorption haplotype Hℓ is still O previous section to incorporate variable population size. A uniform over the observed haplotypes n in the trunk geneal- history of relative effective population size is described by og y. I n su mmar y, th e marg inal density of the hidden state sℓ = the function (t, h)isgivenby Z ð Þ n t 1 lð Þ¼N t ; zðlÞð ; Þ¼ h 2 t ; t (1) t h lð Þ exp n lðtÞ d (2) Nref t 0

650 S. Sheehan, K. Harris, and Y. S. Song Z where n denotes the number of haplotypes in O that are ðlÞ ðlÞ h n p^ ðajO Þ ¼ f ða½1 : L; s Þds : (6) identical to haplotype h. SMC n SMC L L

Transition density As with the constant population size HMM, a backward ðlÞ algorithm can also be devised to compute p^ ðajO Þ, al- To obtain the transition density, we need to take into SMC n though we do not present it here. account recombination, which causes changes in the hidden state of our HMM. If no recombination occurs between loci ℓ 2 1 and ℓ (prior to Tℓ2 ), then sℓ = sℓ2 . If a recombination 1 1 Discretizing the State Space event occurs between loci ℓ 2 1 and ℓ, the absorption time for locus ℓ will be Tℓ = Tr + Ta, where Tr is the time of To efficiently evaluate the recursion (5) and the marginal- recombination (which must be less than Tℓ21 and Tℓ) and ization (6), we discretize the time component of the state

Ta is the remaining additional time to absorption, as illus- space. We partition time (in units of 2Nref generations) into trated in Figure 2. To compute the transition density, we d intervals, demarcated by need to convolve the hidden variables Tr and Ta. Letting b =(ℓ 21, ℓ) for ease of notation, the transition density from t0 ¼ 0 , t1 , ⋯ , td ¼ N; sℓ21 =(t, h)tosℓ =(t9, h9) is given by and assume that l(t)defined in (1) has a constant value li ðlÞ ... f ðsℓjsℓ21Þ in each interval Di := [ti21, ti), for i =1, , d, " # R ð ; 9Þ ðlÞ 2r min t t 2r z ðt9; h9Þ (3) Xd ¼ bt d þ r btr R ; e sℓ21;sℓ be N dtr lð Þ¼ 1ð # , Þl ; 0 zðlÞðtÞ t t ti21 t ti i (7) t d r i¼1 l ðlÞ where z( )(t9, h9)isdeR fined in (2) and z ðtÞ :¼ 1 P ðlÞ N ðlÞ where ( ) is the indicator function. Using this piecewise 2O z ðt; hÞ. Note that z ðtÞdt ¼ 1. h n 0 constant l(t), we can write the HMM probabilities in a more workable form, as detailed below. Emission probability

The probability of emitting allele a at locus ℓ (i.e., a[ℓ]=a) Initial probability given hidden state sℓ =(t, h)is For t 2 Di, (7) implies that the initial density (2) can be N h i X m written as ðlÞ 2uℓt 1 m ðℓÞ j ðajsℓÞ¼e ðuℓtÞ P : (4) m! h½ℓ;a m¼0 Yi21 ðlÞ n 2 ð 2 Þ=l 2 ð 2 Þ=l z ð ; Þ¼ h n t ti21 i n tj tj21 j : t h l e e (8) This is the same emission probability as in Paul et al. (2011), i j¼1 but when we discretize the state space in the following sec- tion we have to take into account the effects of variable To obtain the initial probability in the time-discretized population size. model, we integrate over the time interval Di to obtain Z ðlÞ ^ ðlÞ nh ðiÞ Sequentially Markov conditional sampling probability z ðDi; hÞ¼ z ðt; hÞdt ¼ w ; (9) Di n Using the initial, transition, and emission densities described above, we can write down an integral recursion for the where ðlÞ forward probability f ða½1 : ℓ; sℓÞ of observing the first SMC h i Yi21 ℓ a ... a ℓ ℓ ð Þ 2 ð 2 Þ=l 2 ð 2 Þ=l alleles [1], , [ ] and the state at locus being sℓ. w i ¼ 1 2 e n ti ti21 i e n tm tm21 m ; # ℓ # For 2 L, m¼1 ðlÞ f ða½1 : ℓ; sℓÞ which corresponds to the probability that a lineage in the SMC R ¼ jðlÞða½ℓj Þ fðlÞð j Þ ðlÞ ða½ : ℓ 2 ; Þ ; conditional genealogy gets absorbed into the trunk geneal- sℓ sℓ sℓ21 fSMC 1 1 sℓ21 dsℓ21 ogy within the interval D . (5) i with base case Transition probability ðlÞ ðlÞ ðlÞ ða½ ; Þ¼j ða½ j Þz ð Þ: For the transition density from state sℓ2 =(t, h) to state sℓ = fSMC 1 s1 1 s1 s1 1 (t9, h9), we let i denote the time interval index such that t 2

Finally, the conditional probability of sampling an additional Di =[ti21, ti) and let j denote the index such that t9 2 Dj = haplotype a having previously observed On ={h1, ..., hn}is [tj21, tj). After some simplification, the transition density (3) given by becomes

Estimating Variable Effective Population Sizes 651 ðlÞ f ðsℓjsℓ21Þ across the d time intervals. Alternatively, users have the "# 2 option of specifying their own discretization time points to n Yj 1 2r t h 2nðt92tj21Þ=lj 2nðt 2t 2 Þ=l ¼ e b d ; þ e e m m 1 m Rði; t; j; t9Þ; achieve a desired resolution. sℓ21 sℓ l j m¼1 As noted by Li and Durbin (2011), allowing separate (10) population size parameters during time intervals that con- tain too few expected coalescence events can lead to inac- where R(i, t; j, t9)isdefined in the Appendix. curate estimates. Following their lead, we mitigate this To compute the transition probability in the time-discretized problem by constraining a few consecutive time intervals to model, we use Bayes’ rule and integrate the transition density have the same population size. function to obtain ^ðlÞ f Dj; h9jDi; h Z Z Modifying the Trunk Genealogy ¼ 1 fðlÞð 9; 9j ; ÞzðlÞð ; Þ 9 ðlÞ t h t h t h dt dt (11) The trunk genealogy approximation in Paul and Song (2010) ^z ð ; Þ D D Di h j i was derived by making an approximation in the diffusion ðiÞ ði;jÞ nh9 process dual to the coalescent for a constant population size. ¼: y di;jdh;h9 þ z ; n It yields an accurate approximate CSD in the case of a popu- ðlÞ lation at equilibrium, and for parent-independent mutation where ^z ðD ; hÞ is defined in (9), and explicit formulas for i models, the CSD becomes exact in the limit as the recombi- y(i) and z(i,j) are provided in the Appendix. The first term nation rate approaches N. However, in the variable popula- arises from the case of no recombination, while the second tion size setting, we must modify the trunk genealogy term accounts for the case when recombination does occur. approximation for the following reason: In the formulation Note that y(i) and z(i,j) depend only on the time interval and described earlier, the trunk absorbs a lineage in the condi- not on the absorbing haplotype. tional genealogy C at the rate ndt/l(t) at time t. Our HMM uses this inverse dependence and the inferred distribution of Emission probability absorption times to estimate the population size scaling Although thus far the emission density has not been affected function l(t). In reality, at time t the number of ancestral by the population size being variable, discretizing time intro- lineages is n(t) # n and a lineage in C gets absorbed at rate duces dependence on the function l(t). Let a denote the emit- n(t)dt/l(t). Hence, assuming that the trunk genealogy con- ted allele of the newly sampled haplotype a at locus ℓ.Using tains n lineages in every time interval causes absorption Bayes’ rule again and then integrating over the absorption time events to occur too quickly, leaving the ancient population interval gives sizes overestimated. We later provide empirical results that support this intuition (see Figure 8). ^ðlÞ j ðajDi; hÞ To remedy the problem described above, in our work we Z use the expected number of lineages in the trunk to modify ¼ 1 jðlÞð j ; ÞzðlÞð ; Þ ðlÞ a t h t h dt the rate of absorption, while still forbidding mutation, re- ^z ð ; Þ D (12) Di h i combination, and coalescence in the trunk genealogy. Let PN A (t) denote the number of lineages at time t ancestral to ¼ ðiÞ ð Þ PðℓÞ m ; n v m h½ℓ;a m¼0 asampleofsizen at time 0. Under the coalescent, the prob- ability distribution of An(t)isknowninclosedform(Tavaré where a formula for v(i)(m) is provided in the Appendix. 1984), but using it directly to compute the expected number of lineages leads to numerically unstable results, due to alter- Discretizing time and grouping parameters nating signs. However, one can obtain the following expres- sion for the expectation (Tavaré 1984, equation 5.11), which To discover periods of population expansion or contraction is numerically stable: with the SMCSD, it is necessary to specify a time discretiza- tion that has high resolution during such time periods. This nðtÞ :¼ E ½AnðtÞ is challenging in cases where we have little a priori knowl- Z Xn i t 1 nðn 2 1Þ⋯ðn 2 i þ 1Þ edge of the demographic history. Ideally the (unknown) co- ¼ exp 2 dt lðtÞ nðn þ 1Þ⋯ðn þ i 2 1Þ alescence events would be distributed uniformly across the i¼1 2 0 time intervals of our discretization; if very few coalescence · ð2i 2 1Þ: events occur in an interval, the corresponding population (13) size will often be overestimated, leading to runaway behav- ior. In our implementation, we employ a heuristic method, For simplicity, we assume that throughout time interval detailed in the Appendix, for choosing the discretization time Di =[ti21, ti), there are nðti21Þ lineages, creating what points t1, ..., td21 based on the spacing of SNPs in the data, we call a “wedding-cake genealogy,” as illustrated in with the aim for coalescence events to be distributed evenly Figure 3.

652 S. Sheehan, K. Harris, and Y. S. Song Figure 3 Illustration of the wedding-cake genealogy approximation, in A* which the varying thickness of a lineage in On schematically represents the amount of contribution to the absorption rate. As shown, the wedding- cake genealogy never actually loses any of the n lineages, and absorption into any of the n lineages is allowed at all times; we are modifying the absorption rate only as a function of time.

To modify the HMM formulas, we simply replace each n in (9), (11), and (12) with the appropriate nðÞ from (13), (i) except in the ratio nh/n multiplying w in (9) and the ratio (i,j) nh9=n multiplying z in (11) (these ratios are kept intact to preserve the relative contributions of different haplotypes). Note that the trunk genealogy never actually loses any of the n lineages, and absorption into any of the n lineages is allowed at all times; we are modifying the absorption rate only as a function of time. In the case of two sequences (one trunk lineage and one additionally sampled lineage), nðtÞ¼1 for all t, so the wedding-cake approximation does not change the model. Making the number of lineages Figure 4 Population size histories considered in our simulation study, more accurate by using this approximation improves our with time t = 0 corresponding to the present. (A) History S1 containing ability to estimate absorption times and therefore popu- a bottleneck. (B) History S2 containing a bottleneck followed by a rapid expansion. lation sizes.

Population Size Inference with Expectation Maximization tion assumes that the population-scaled recombination rates rb and mutation rates uℓ, as well as the mutation transition To utilize all our data in an exchangeable way, we use matrices P(ℓ), are given and fixed. For computational sim- “ ” (ℓ) a leave-one-out approach where we leave each haplotype plicity we currently assume that uℓ and P are the same for out in turn and perform the SMCSD computation. More pre- each site ℓ and rb is the same for each pair of consecutive cisely, we define the leave-one-out composite likelihood sites. The time discretization is fixed throughout the EM (LCL) as algorithm. The output of the algorithm is an estimated pop- l Yn ulation size scaling factor i for each interval Di =[ti21, ti). ðl; ; ... ; Þ¼ p^ðlÞ ð j ; ... ; ; ; ... ; Þ: LLCL h1 hn SMC hi h1 hi21 hiþ1 hn To convert these scaling factors into diploid effective popu- i¼1 lation sizes, one would need to multiply by Nref. Similarly, (14) the discretization times can be converted to years by multi-

plying them by 2Nrefg, where g is an average number of Because we compute the conditional sampling probability years per generation. through dynamic programming and the probability depends The standard Baum–Welch algorithm gives an EM pro- on the effective population sizes in complex ways, we cannot cedure for learning the parameters of an HMM in which the find the maximum-likelihood estimates analytically. Although transition probabilities and emission probabilities are trea- direct optimization could be used, it is computationally ted as unknown independent parameters. However, our expensive. Thus we employ an EM algorithm to estimate the HMM is more constrained than a general one, with (dn)2 + piecewise constant function l(t). Our current implementa- d|S|2 (where S is the alphabet of alleles) unknown

Estimating Variable Effective Population Sizes 653 Figure 6 Results of PSMC and diCal on data sets simulated under history Figure 5 Results of PSMC and diCal on data sets simulated under history S2 with sample size n = 10 and four alleles (A, C, G, and T). The PSMC S1 with sample size n = 10 and four alleles (A, C, G, and T). PSMC shows runaway behavior during the recent past, overestimating the most significantly overestimates the most recent population size, whereas we recent time by over three orders of magnitude on average. (A) Results for obtain good estimates up until the very ancient past. (A) Results for 10 10 different data sets. (B) Average over the 10 data sets. different data sets. (B) Average over the 10 data sets.

ward probabilities of our HMM. The exact details of making ^ðlÞ ^ðlÞ probabilities f ðDj; h9jDi; hÞ and j ða½ℓjDi; hÞ that are this step computationally efficient are provided in the Appen- l ... l functions of the d parameters 1, , d. During the E-step, dix. After the E-step, we use the Nelder–Mead optimization we compute the matrix [Aij] of the expected number of Di to routine (Nelder and Mead 1965) to update the parameters in Dj transitions. We also compute the matrix [Ei(b)] of the the M-step. Because of local maxima in the likelihood surface, expected number of times allele b is emitted from time in- we run this optimization routine several times (10) with terval i. Then, during the M-step we maximize the likelihood different starting conditions and then retain the estimates function with the largest likelihood. In the analysis discussed in this article, we ran the EM procedure for 20 iterations to obtain lðkþ1Þ; ... ; lðkþ1Þ 1 d convergence. As pointed out by Li and Durbin (2011), run- h i ðkÞ h i ðkÞ Q ðlðkÞÞ A Q ðlðkÞÞ E ðbÞ ning the EM procedure for many iterations often leads to ¼ f^ ij ^j ð j Þ i ; argmax Dj Di b Di overfitting. lðkÞ i;j i;b (15) ^ðlÞ ^ðlÞ Results where f ðDjjDiÞ and j ðbjDiÞ refer to the transition and emission probabilities where we have marginalized over the We compared the performance of our method, diCal, with absorption haplotype. that of PSMC (Li and Durbin 2011) on both simulated and

We initialize the algorithm with li = 1 for all i =1,..., d. real data. We compared diCal, using an n-haplotype leave- To compute [Aij] and [Ei(b)], we use the forward and back- one-out scheme (Equation 14), with PSMC, using the same

654 S. Sheehan, K. Harris, and Y. S. Song Table 1 Goodness-of-fit for PSMC and diCal, averaged over 10 simulated data sets, each with a sample of n = 10 haplotypes Simulated history PSMC error diCal error

S1 0.40328 0.10283 S2 0.71498 0.29992 The underlying population size histories are shown in Figure 4. The error metric used is a normalized integral of the absolute difference between the true history and the inferred history over time. These results demonstrate that diCal is substantially more accurate than the PSMC method. n haplotypes paired up sequentially (i.e., haplotype 1 paired with haplotype 2, haplotype 3 with haplotype 4, etc.). Unless stated otherwise, we used 16 discretization intervals and inferred seven free population size parameters in both PSMC and diCal. In the notation introduced by Li and Durbin (2011), the pattern we used is “3+2+2+2+ 2+2+3,” which means that the first parameter spans the Figure 7 The effect of considering more haplotypes in diCal, using the SMCSD-based leave-one-out likelihood approach. Data were simulated first three discretization intervals, the second parameter under population size history S1 with two alleles. In this study, we spans the next two intervals, and so on. We found that grouped adjacent parameters to fit roughly with population size change grouping a few consecutive intervals to share the same pa- points for illustration purposes. Shown is the increase in the accuracy of rameter significantly improved the accuracy of estimates. our method with an increasing sample size n. The recent sizes are the For example, due to an insufficient number of coalescence most dramatically affected, while intermediate sizes remain accurate even with few haplotypes. events, the first and last intervals are particularly susceptible to runaway behavior if they are assigned their own free parameters, but grouping them with their neighboring inter- over PSMC, we used an error metric described in Li and vals prevented such pathological behavior. See Supporting Durbin (2011), which is a normalized integral of the abso- Information, File S1 for further details of running PSMC and lute difference between the true ms history and the inferred our method. history over time. The results, summarized in Table 1, show that our method had a substantially lower overall error than The accuracy of population size inference PSMC. on simulated data For inference using diCal, we examined the impact of We used ms (Hudson 2002) to simulate full ancestral re- considering more haplotypes on the accuracy of population combination graphs (ARGs) under two different population size estimation. In this study, we focused on history S1 and histories and then superimposed a quadra-allelic, finite-sites grouped adjacent parameters to fit roughly with population mutation process on the ARGs to generate sequence data size change points for illustration purposes. Figure 7 shows over {A, C, G, T}. As illustrated in Figure 4, both histories qualitatively that increasing the sample size n makes our es- contained bottlenecks in the moderately recent past. History timate of the recent population size more accurate. Interme- S2 in Figure 4B in addition contained a recent rapid popu- diate sizes changed little with increasing n, and ancient sizes lation expansion relative to the ancient population size. For were somewhat variable depending on the distribution of each history, we simulated 10 independent ARGs for L =106 coalescence events. Note that for n = 2, our method is very sites and 10 haplotypes, with the population-scaled recom- similar to PSMC; we compute the transition probabilities bination rate set to 0.01 per site in ms. To add mutations, we slightly differently, but the wedding-cake approximation does set the population-scaled mutation rate to 0.014 per site and not change the model in this case. We used the same error used the quadra-allelic mutation matrix described in File S1. metric mentioned above to quantify the advantage of increas- As shown in Figures 5 and 6, our method performed ing the sample size. As shown in Table 2, the overall error much better in the recent past than did PSMC. PSMC often decreased as the sample size increased, with improvement had the type of runaway behavior shown in Figure 6, where tapering to 8–10 haplotypes for this particular history. it overestimated the most recent population size by over Impact of the wedding-cake genealogy approximation three orders of magnitude. We note that our method began to lose accuracy for more ancient times, most likely because We examined the advantage of using the wedding-cake ancient absorption events in a 1-Mb region are few and genealogy approximation in the SMCSD computation, com- sparsely distributed in time in the leave-one-out SMCSD pared to assuming an unmodified trunk genealogy. Figure 8 computation. Both methods tend to smooth out sudden illustrates that the unmodified trunk genealogy leads to changes in population size, which is why the inferred recov- overestimation of population sizes in the distant past, as ery time from a bottleneck is more recent than it should be. discussed in Modifying the Trunk Genealogy. The wedding- To quantify the improvement in accuracy of our method cake genealogy approximation, which adjusts the absorption

Estimating Variable Effective Population Sizes 655 Table 2 Goodness-of-fit for diCal on simulated bottlenecked history S1 for different sample sizes Sample size n diCal error

2 0.2914 4 0.1901 6 0.1446 8 0.0802 10 0.0899 We used the same error metric as in Table 1. As the sample size n increases, the error decreases, with global improvement tapering at 8–10 haplotypes. rate by accounting for the expected number of ancestral lineages of the already observed sample, leads to a signifi- cant improvement in the accuracy of population size infer- ence in the ancient past. Accuracy of estimated coalescence times Figure 8 A comparison of the SMCSD-based leave-one-out likelihood To assess the accuracy of estimated coalescence times, we approach in diCal, using the wedding-cake genealogy (blue line), with produced the posterior decoding and the posterior mean of that using the unmodified trunk genealogy (green line). The results the times that a left-out haplotype got absorbed into shown are for n = 10 haplotypes simulated under history S1 with two a wedding-cake genealogy. The data were simulated under alleles. Without the wedding-cake genealogy, absorption of the left-out lineage into the trunk occurs too quickly, and the lack of absorption the full coalescent with recombination, using ms assuming events in the midpast to the ancient past leads to substantial overestima- a constant population size. The true coalescence time at tion of the population sizes. Recent population sizes remain unaffected each site was taken as the time the left-out lineage joined since during these times the absorption rates in the wedding-cake gene- the rest of the coalescent tree at that site. As shown in Figure alogy and in the trunk genealogy are roughly the same. In this example, fi 9, we found that our estimated absorption times closely we grouped adjacent parameters to t roughly with population size change points for illustration purposes. tracked the true coalescence times. Results on real data rather unrealistic results; see Figure 10A, for which we We applied our method to data from 10 of the 179 human truncated the y-axis at 40,000 for ease of comparison with genomes that were sequenced at low coverage and phased Figure 10B. The data set may be too small for PSMC to as part of the 1000 Genomes pilot project. Five of the work accurately. We note that PSMC was able to produce individuals were Yorubans from Ibadan, Nigeria (YRI) and more reasonable results on simulated data sets, probably five were Utah residents of central European descent (CEU) becausetheyweregeneratedwithmuchhighermutation (1000 Genomes Project Consortium 2010). To minimize and recombination rates, thus representing a larger geno- potential confounding effects from natural selection, we mic region for humans. chose a 3-Mb region on chromosome 1 with no genes and As shown in Figure 10B, our method inferred that CEU then used the middle 2 Mb for analysis. We used the human and YRI had very similar histories in the distant past up until reference (version 36) to create a full multiple-sequence 117 KYA; discrepancies up to this point are most likely due alignment of 10 haplotypes (five diploid individuals) for to few observed ancient coalescence events with the leave- each of the CEU and YRI populations. Although we filtered one-out approach. We inferred that the European popula- out unphased individuals and sites, the final sequences are tion underwent a severe (out-of-Africa) bottleneck starting based on low-coverage short read data, so phasing and 117 KYA, with the effective population size dropping by imputation errors could affect the accuracy of our coalescence a factor of 12 from 28,000 to 2250. Furthermore, at time inference. We assumed a per-generation mutation rate the level of resolution provided by our time discretization, of m = 1.25 · 1028 per site, which is consistent with recent our results suggest that the European population has recov- studies of de novo mutation in human trios (Awadalla et al. ered from the bottleneck to an average effective size of 2010; Roach et al. 2010; Kong et al. 2012), and a mutation 12,500 for the past 16,000 years. transition matrix estimated from the human and the chimp In contrast to previous findings (e.g., Li and Durbin reference genomes (shown in File S1). For simplicity, we 2011), our method did not infer a significant drop in the assumed that the per-generation recombination rate r be- YRI population size during the early out-of-Africa bottleneck tween consecutive bases is constant and equal to m. The gen- phase in Europeans. Instead, the African effective popula- eration time was assumed to be 25 years. For a reference tion size seems to have decreased more gradually over time population size, we used Nref = 10,000. (possibly due to changes in structure) to an average effective The results of PSMC and our method are shown in Fig- size of 10,000 for the past 16,000 years. We note that our ure 10. PSMC displayed runaway behavior and produced results for real data are fairly robust to the choice of

656 S. Sheehan, K. Harris, and Y. S. Song Figure 9 Estimated absorption times in diCal using the leave-one-out SMCSD method vs. the true coalescence times for a 100-kb region. The data were simulated using ms for n = 6 haplotypes, assuming a constant population size. The true coalescence time at each site, obtained from ms, was taken as the time the ancestral lineage of a left-out haplotype joined the rest of the coalescent tree at that site. Shown is the true coalescence time for the nth haplotype and our corresponding inferred absorption times, obtained from the posterior decoding and the posterior mean. Our estimates generally track the true coalescence times closely. discretization, given that a sufficient number of coalescence we have inferred a severe (out-of-Africa) bottleneck in Euro- events occur within each set of grouped intervals. peans that began 117 KYA, with a drop in the effective population size by a factor of 12. In contrast, we have ob- Run time served a much more mild population size decrease in the The run time of our method is O(Ld(d + n)n), where n is the African population. We remark that our estimate of the tim- number of sequences, L is the number of bases in each se- ing of the bottleneck may not be very accurate, since we quence, and d is the number of time discretization intervals; used only 16 discretization intervals and seven free popula- the run time for each CSD computation is O(Ld(d + n)), and tion size parameters. Furthermore, all of our inferred times each sequence is left out in turn (although this step is paral- and population sizes would be smaller by a factor of 2 if we lelizable). The run time for PSMC is O(Ld2P), where P is the had used m = 2.5 · 1028. See Scally and Durbin (2012) for number of pairs of sequences analyzed. In practice, PSMC a more thorough discussion of how new mutation rate esti- can run much faster when consecutive sites are grouped into mates are changing the way we view ancient population bins of size 100; a bin is considered heterozygous if it con- history. An earlier initial human dispersal out of Africa tains at least one SNP and homozygous otherwise. Creating would fit with archaeological evidence of human artifacts a reasonable binning scheme for multiple sequences is less dated at 74 KYA in India and 64 KYA in China (Scally and clear. We are currently exploring this avenue, which could Durbin 2012). significantly improve our runtime and potentially enable During the recent past, our results demonstrate that the whole-genome analysis. effective population size of Europeans has grown in the past 16,000 years, slightly surpassing the effective population size of Africans, which does not show a growth at this Discussion and Future Work resolution. Recent studies (Gutenkunst et al. 2009; Gravel In this article, we have generalized the recently developed et al. 2011) suggest that the European population size re- sequentially Markov conditional sampling distribution cently grew much faster than the African population size, framework (Paul et al. 2011) to accommodate a variable although the sample size we considered is not large enough population size. One important new idea central to the suc- to confirm this. cess and accuracy of our method is the wedding-cake gene- The main strength of our method is in the recent past. alogy approximation, which modifies the rate of absorption Large-scale sequencing studies (Coventry et al. 2010; Keinan into the trunk by accounting for the varying number of lin- and Clark 2012; Nelson et al. 2012) of a subset of genes eages over time. Under simulated data, we have shown that suggest that humans underwent recent explosive population our method produces substantially more accurate estimates growth. Our method should be well equipped to infer such of the recent effective population size than does PSMC (Li recent demographic histories, but we would need to con- and Durbin 2011). sider a very large sample to accurately infer the rate of Applying our method to a 2-Mb intergenic region of expansion and the time of onset. Because of issues of com- chromosome 1 from five Europeans and five Africans, putational speed and memory footprint, our current imple- sequenced as part of the 1000 Genomes Project, and using mentation of the SMCSD computation can handle up to 20 a per-generation mutation rate of m = 1.25 · 1028 per site, haplotypes and a few megabases, but we are in the process

Estimating Variable Effective Population Sizes 657 mutations may produce vastly different likelihoods. Li and Stephens suggested averaging the PAC likelihood over sev- eral random permutations to alleviate this problem and this strategy seems to work reasonably well in practice. In our work, we have avoided the problem by adopting the leave- one-out approach, which yields accurate estimates of popu- lation sizes for the recent past, but not as good results for the ancient past. Employing the PAC approach may produce accurate estimates for all times, but a challenge that needs to be addressed in the SMCSD framework is that the wed- ding-cake genealogy, which is based on the prior expectation of the number of lineages, may not be accurate when there are few lineages, since coalescence times are more variable when they involve fewer lineages. We are working on im- proving the accuracy of the SMCSD computation by using the posterior absorption time distributions in a recursive fash- ion so that locus-specific absorption rates tailored to data can be used. This approach, together with the PAC model, should yield more accurate estimates of population sizes. One factor that we have not investigated is the impact of variable recombination (and/or mutation) rates, although it is conceptually straightforward for our method to accom- modate them. We have chosen not to incorporate recombi- nation rate variation into our present implementation as it would make the method even more computationally expen- sive, since the transition probabilities would then be potentially different at each site. In addition, most fine-scale recombina- tion maps (Crawford et al. 2004; McVean et al. 2004; Fearn- head and Smith 2005; Chan et al. 2012) are inferred under the assumption of a constant population size, which is exactly the assumption we are not making.WealsonotethatLiandDur- Figure 10 Variable effective population size inferred from real human data for European (CEU) and African (YRI) populations. For each popula- bin (2011) found that recombination hotspots did not affect tion, we analyzed a 2-Mb region on chromosome 1 from five diploid their results significantly and that the important parameter is individuals (10 haplotypes), assuming a per-generation mutation rate of the average recombination rate. 2 m = 1.25 · 10 8 per site. (A) The results of PSMC, which had some A good choice of time discretization is critical to the runaway behavior and unrealistic results. The data set is probably too performance of both diCal and PSMC. It is better to subdivide small for PSMC to work accurately. (B) The results of diCal. We inferred fi that the European population size underwent a severe bottleneck 117 time more nely during periods with small population size KYA and recovered in the past 16,000 years to an effective size of than during periods with large population size when few 12,500. In contrast, our results suggest that the YRI population size coalescences occur. However, since the demography is what did not experience such a significant drop during the early out-of-Africa we are trying to infer, selecting an initial discretization is very bottleneck phase in Europeans. difficult. Creating adaptive discretization schemes for co- alescent HMMs is an important area of future research. of exploring ways to increase the scalability. One way in We have shown that posterior decodings of diCal enable which we should be able to reduce our run time is by in- accurate inference of coalescence times. Using this informa- corporating recently developed algorithms for blockwise tion, it should be possible to develop an efficient method of HMM computation (Paul and Song 2012), which have been sampling marginal coalescent trees from the posterior distri- shown to result in a speedup of several orders of magnitude bution. We expect such local tree inference to have interesting for large data sets. applications, including genome-wide association studies and All the results in this article make use of a leave-one-out tests of selective neutrality. approach (Equation 14) instead of the well-used product of The SMCSD framework has been recently extended approximate conditionals (PAC) method proposed by Li and (Steinrücken et al. 2013) to incorporate structured pop- Stephens (2003). Briefly, the PAC approach utilizes the ap- ulations with migration. We are currently working on proximate likelihood p^ðhsð1ÞÞp^ðhsð2Þjhsð1ÞÞ⋯p^ðhsðnÞjhsð1Þ; ...; combining this extension with the work presented here hsðn21ÞÞ, where p^ is an approximate conditional sampling to implement an integrated inference tool (to be incorpo- distribution and s is some permutation of {1, ..., n}. A rated into diCal) for general demographic models. Such well-known drawback of this approach is that different per- a method could provide a detailed picture of the demographic

658 S. Sheehan, K. Harris, and Y. S. Song history that created the diversity we see today in humans and multiple populations from multidimensional SNP frequency other species. data. PLoS Genet. 5(10): e1000695. Haddrill, P. R., K. R. Thornton, B. Charlesworth, and P. Andolfatto, 2005 Multi-locus patterns of nucleotide variability and the de- Acknowledgments mographic and selection history of Drosophila melanogaster pop- ulations. Genome Res. 15: 790–799. We thank the members of our group for helpful discussions, Hobolth, A., O. F. Christensen, T. Mailund, and M. H. Schierup, in particular Anand Bhaskar, Jack Kamm, Joshua Paul, and 2007 Genomic relationships and speciation times of human, Matthias Steinrücken. This research is supported in part by chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genet. 3(2): e7. National Science Foundation Graduate Research Fellowships Hudson, R. R., 2002 Generating samples under a Wright-Fisher (to K.H. and S.S.), by a University of California (Berkeley) neutral model. Bioinformatics 18: 337–338. Regent’s fellowship (to K.H.), and by National Intitutes of Keinan, A., and A. G. Clark, 2012 Recent explosive human pop- Health grant R01-GM094402 and a Packard Fellowship for ulation growth has resulted in an excess of rare genetic variants. – Science and Engineering (to Y.S.S.). Science 336(6082): 740 743. Kong, A., M. L. Frigge, G. Masson, S. Besenbacher, P. Sulem et al., 2012 Rate of de novo mutations and the importance of father’s age to disease risk. Nature 488(7412): 471–475. Literature Cited Li, H., and R. Durbin, 2011 Inference of human population history from individual whole-genome sequences. Nature 10: 1–5. 1000 Genomes Project Consortium, 2010 A map of human ge- Li, N., and M. Stephens, 2003 Modelling linkage disequilibrium, nome variation from population-scale sequencing. Nature 467: and identifying recombination hotspots using SNP data. Genet- 1061–1073. ics 165: 2213–2233. Awadalla, P., J. Gauthier, R. Myers, F. Casals, F. Hamdan et al., Mailund, T., J. Y. Dutheil, A. Hobolth, G. Lunter, and M. H. 2010 Direct measure of the de novo mutation rate in au- Schierup, 2011 Estimating divergence time and ancestral ef- tismandschizophreniacohorts.Am.J.Hum.Genet.87: fective population size of Bornean and Sumatran orangutan sub- 316–324. species using a coalescent hidden Markov model. PLoS Genet. 7: Chan, A. H., P. A. Jenkins, and Y. S. Song, 2012 Genome-wide e1001319. fine-scale recombination rate variation in Drosophila mela- Marjoram, P., and J. D. Wall, 2006 Fast “coalescent” simulation. nogaster. PLoS Genet. 8(12): e1003090. BMC Genet. 7: 16. Coventry, A., L. M. Bull-Otterson, X. Liu, A. G. Clark, T. J. Maxwell McVean, G. A., and N. J. Cardin, 2005 Approximating the coales- et al., 2010 Deep resequencing reveals excess rare recent var- cent with recombination. Philos. Trans. R. Soc. Lond. B Biol. Sci. iants consistent with explosive population growth. Nat. Com- 360: 1387–1393. mun. 1: 131. McVean, G. A. T., S. R. Myers, S. Hunt, P. Deloukas, D. R. Bentley Crawford, D. C., T. Bhangale, N. Li, G. Hellenthal, M. J. Rieder et al., 2004 The fine-scale structure of recombination rate var- – et al., 2004 Evidence for substantial fine-scale variation in re- iation in the human genome. Science 304: 581 584. combination rates across the human genome. Nat. Genet. 36: Myers, S., C. Fefferman, and N. Patterson, 2008 Can one learn – 700–706. history from the allelic spectrum? Theor. Popul. Biol. 73: 342 De Iorio, M., and R. C. Griffiths, 2004a Importance sampling on 348. coalescent histories. I. Adv. Appl. Probab. 36(2): 417–433. Nelder, J. A., and R. Mead, 1965 A simplex method for – De Iorio, M., and R. C. Griffiths, 2004b Importance sampling on function minimization. Comput. J. 7(4): 308 313. coalescent histories. II: Subdivided population models. Adv. Nelson,M.R.,D.Wegmann,M.G.Ehm,D.Kessner,P.S.Jeanet al., – 2012 An abundance of rare functional variants in 202 drug target Appl. Probab. 36(2): 434 454. – Dutheil, J. Y., G. Ganapathy, A. Hobolth, T. Mailund, M. K. Uoyenoyama genes sequenced in 14,002 people. Science 337(6090): 100 104. Palamara, P. F., T. Lencz, A. Darvasi, and I. Pe’er, 2012 Length et al., 2009 Ancestral population genomics: the coalescent hidden distributions of identity by descent reveal fine-scale demo- Markov model approach. Genetics 183: 259–274. graphic history. Am. J. Hum. Genet. 91(5): 809–822. Fearnhead, P., and N. G. C. Smith, 2005 A novel method with Paul, J. S., and Y. S. Song, 2010 A principled approach to deriving improved power to detect recombination hotspots from poly- approximate conditional sampling distributions in population morphism data reveals multiple hotspots in human genes. Am. – – genetics models with recombination. Genetics 186: 321 338. J. Hum. Genet. 77: 781 794. Paul, J. S., and Y. S. Song, 2012 Blockwise HMM computation for Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth large-scale population genomic inference. Bioinformatics 28: et al., 2011 Demographic history and rare allele sharing 2008–2015. among human populations. Proc. Natl. Acad. Sci. USA 108: Paul, J. S., M. Steinrücken, and Y. S. Song, 2011 An accurate – 11983 11986. sequentially Markov conditional sampling distribution for the fi Grif ths, R., and S. Tavaré, 1994 Sampling theory for neutral coalescent with recombination. Genetics 187: 1115–1128. alleles in a varying environment. Philos. Trans. R. Soc. Lond. Roach, J. C., G. Glusman, A. F. A. Smit, C. D. Huff, R. Hubley et al., B Biol. Sci. 344: 403–410. 2010 Analysis of genetic inheritance in a family quartet by Griffiths, R. C., P. A. Jenkins, and Y. S. Song, 2008 Importance whole-genome sequencing. Science 328: 636–639. sampling and the two-locus model with subdivided population Scally, A., and R. Durbin, 2012 Revising the human mutation rate: structure. Adv. Appl. Probab. 40(2): 473–500. implications for understanding human evolution. Nat. Rev. Gronau,I.,M.J.Hubisz,B.Gulko,C.G.Danko,andA.Siepel, Genet. 10: 745–753. 2011 Bayesian inference of ancient human demography Steinrücken,M.,J.S.Paul,andY.S.Song,2013 AsequentiallyMarkov from individual genome sequences. Nat. Genet. 43: 1031– conditional sampling distribution for structured populations with mi- 1034. gration and recombination. Theor. Popul. Biol. (in press). Gutenkunst, R. N., R. D. Hernandez, S. H. Williamson, and C. D. Tajima, F., 1983 Evolutionary relationship of DNA sequences in Bustamante, 2009 Inferring the joint demographic history of finite populations. Genetics 105: 437–460.

Estimating Variable Effective Population Sizes 659 Tavaré, S., 1984 Lines of descent and genealogical processes, and Wang, Y., and J. Hey, 2010 Estimating divergence parameters their application in population genetics models. Theor. Popul. with small samples from a large number of loci. Genetics 184: Biol. 26: 119–164. 363–379. Thornton, K., and P. Andolfatto, 2006 Approximate Bayesian in- Wiuf, C., and J. Hein, 1999 Recombination as a point process ference reveals evidence for a recent, severe bottleneck in along sequences. Theor. Popul. Biol. 55: 248–259. a Netherlands population of Drosophila melanogaster. Genetics 172: 1607–1619. Communicating editor: J. Wakeley

Appendix HMM Formulas The expression R(i, t; j, t9) in (10) is defined as 8 P > ðiÞð Þþ i21 ðkÞ ; , ; > R t k¼0 R if i j < P ; ; ; 9 ¼ ðjÞð 9Þþ j21 ðkÞ ; . ; R i t j t > R t k¼0 R if i j > P : ðiÞð ∧ 9Þþ i21 ðkÞ ; ¼ ; R t t k¼0 R if i j where ∧ denotes the minimum operator and, for u 2 [tk21, tk),

r l kY2 1 ðkÞ b k 2r uþnðu2t 2 Þ=l 2r t 2 nðt 2t 2 Þ=l R ðuÞ :¼ e b k 1 k 2 e b k 1 e m m 1 m ; n 2 r l b k m¼1

r l kY2 1 ðkÞ b k 2r t þnðt 2t 2 Þ=l 2r t 2 nðt 2t 2 Þ=l R :¼ e b k k k 1 k 2 e b k 1 e m m 1 m : n 2 r l b k m¼1

After the state space has been discretized, we compute the transition probabilities using y(i) (the probability that no re- combination occurs) and z(i,j) (the probability that recombination does occur), Z ti ð Þ 1 ðlÞ 2r t y i ¼ z ðt; hÞe b dt ^ðlÞ z ðDi; hÞ ti21 Yi21 2nðt 2t 2 Þ=l 2r t 2 2r t 2nðt 2t 2 Þ=l ¼ 1 n e k k 1 k e b i 1 2 e b i i i 1 i ðiÞ þ r l w n b i k¼1 and

Z Z Z ∧ ðlÞ tj ti tℓ21 tℓ z ð ; Þ ð ; Þ n 2r t tℓ hℓ ðlÞ z i j ¼ r e b r R z ðtℓ2 ; hℓ2 Þdt dtℓ2 dtℓ ðiÞ b N zðlÞðtÞ t 1 1 r 1 w nhℓ2 tj21 ti21 0 d 1 tr i∧Pj21 :¼ Zði;jÞ þ wðjÞ RðkÞ; k¼1

(i,j) where Z corresponds to the case when the recombination event occurs during the time interval Di∧j (i.e., the latest it (k) (k) (i,j) could) and R corresponds to a recombination event in the time interval Dk. R is defined as before, and Z is

Z Z Z ∧ ðlÞ tj ti tℓ21 tℓ z ð ; Þ ði;jÞ n 2r t tℓ hℓ ðlÞ Z ¼ r e b r R z ðtℓ2 ; hℓ2 Þdt dtℓ2 dtℓ: ðiÞ b N zðlÞðtÞ t 1 1 r 1 w nhℓ2 tj21 ti21 tði∧jÞ21 d 1 tr

To evaluate Z(i,j), we must separate the computation into the cases i , j, i . j, and i = j,

660 S. Sheehan, K. Harris, and Y. S. Song 8 ðjÞ > w ðiÞ > f ; if i , j > wðiÞ > > ð Þ > f j ; if i . j < ði;jÞ ¼ r l r l Z 1 b i 2r t 2 2nðt 2t 2 Þ=l 2r t 2 b i 2r t 2 22nðt 2t 2 Þ=l > e b i 1 2 2e i i 1 i b i 1 2 e b i 1 i i 1 i > wðiÞ n þ r l n 2 l r > b i i > > 2 Yi21 > 2n 2r t 2nðt 2t 2 Þ=l Þ 2nðt 2t 2 Þ=l :> þ e b i i i 1 i e m m 1 m ; if i ¼ j; ðn 2 l rÞðn þ l rÞ i i m¼1 where we define

ð Þ 2r lir 2 ð 2 Þ=l 2r n 2r i :¼ bti21 þ b n ti ti21 i bti21 2 bti : f e 2 l r e 2 l r e n i b n i b

To compute the emission probabilities we define v(i)(k) below,

k Yi21 Xk h i ð Þ nðuℓÞ =l 2 ð 2 Þ=l 2ðjþ1Þ k! 2 k2j 2 k2j v i ðkÞ :¼ enti21 i e n tj tj21 j c e citi21 t 2 e citi t ; l ðiÞ ! i ðk 2 jÞ! i21 i iw k j¼1 j¼0 where n ci :¼ uℓ þ : li

Computation of the Expected Transition Counts During the E-Step

Naively, if we compute the expected number of transitions from state sℓ21 =(Di, hℓ21) to state sℓ =(Dj, hℓ) and then marginalize over the haplotypes, we obtain an O(n2) algorithm. To improve the run time, we can decompose the probability that a transition is used between locus ℓ 2 1 and ℓ into a part that depends on the absorption haplotype and a part that depends on the absorption time interval, and thus we can reduce the run time to O(n). First we compute the posterior (ℓ) probability A (sℓ21, sℓ) that a particular transition is used between locus ℓ 2 1 and ℓ, in terms of the discretized forward and backward probabilities Fℓ() and Bℓ(). Let the newly sampled haplotype have allele a at locus ℓ,soa[ℓ]=a. Then

ðlÞ ðlÞ ðℓÞð ; Þ¼ 1 ð Þf^ ð j Þ^j ð j Þ ð Þ: A sℓ21 sℓ p^ðaÞ Fℓ21 sℓ21 sℓ sℓ21 a sℓ Bℓ sℓ

Now we marginalize over the haplotypes, plugging in the transition density formula

P P X X ðlÞ ðlÞ ðℓÞð ; Þ¼ 1 ð Þf^ ð j Þ^j ð j Þ ð Þ A sℓ21 sℓ p^ðaÞ Fℓ21 sℓ21 sℓ sℓ21 a sℓ Bℓ sℓ hℓ2 hℓ 1 hℓ21 hℓ X X ðlÞ ðℓÞ 1 ^ ðiÞ ði;jÞnhℓ A D ; D ¼ Fℓ2 ðsℓ2 Þj ðajsℓÞBℓðsℓÞ y d ; þ z i j p^ðaÞ 1 1 sℓ21 sℓ n hℓ2 hℓ " 1 ! X ðlÞ ¼ 1 d ðiÞ ð ; Þ^j ð j ; Þ ð ; Þ p^ðaÞ i;jy Fℓ21 Di h a Di h Bℓ Di h h ! !# P P ðlÞ ð ; Þ nhℓ ^ þ z i j Fℓ2 ðsℓ2 Þ j ða jsℓÞBℓðsℓÞ ; 1 1 n hℓ21 hℓ which is linear in n since we are onlyP ever summing over one haplotype. To get the expected transition counts, we then sum ¼ L ðℓÞð ; Þ over all the breakpoints, so Aij ℓ¼2 A Di Dj .

Estimating Variable Effective Population Sizes 661 Discretizing Time With an ideal time discretization, coalescence events would be uniformly distributed across intervals, but inferring the distribution of coalescence times is equivalent to the problem of population size estimation. Our heuristic discretization procedure seeks to avoid poor discretization by using the observed spacing of SNPs in the data. Let T be the empirical distribution of absorption times for all the contiguous segments inferred by a posterior decoding of our data set. Then, for a discretization with d intervals, our goal is to compute t1, ..., td21 such that we see the same number (i.e.,|T |/d)of absorption times in each interval. We first tackle the problem of breaking up our data into segments with the same pairwise coalescence time and then compute the expectation of this time. The locations of ancestral recombination breakpoints divide up a sequence pair into segments that each have a single coalescence time, but we do not know these breakpoints. However, it will often be the case that all the base pairs between two adjacent SNPs will coalesce at the same time or be split between just two different times on either side of a recombination breakpoint. Moreover, in many cases, the positional distribution of SNPs and that of recombination breakpoints will be correlated—in particular, both SNPs and recombination breakpoints will be spaced farthest apart in regions of recent coalescence time. With this rationale, we take the observed distances between SNPs as a proxy for the length distribution of nonrecombining segments. To be more specific, let L be the list of all lengths between adjacent SNPs for all pairs of haplotypes, and let the d empirical quantiles of L be bounded by L1, ..., Ld21. Now we need the expectated coalescence time of an l-base segment with no mutation or recombination. Conditional on m mutation events and r recombination events, the coalescence time for two lineages under a constant population size is distributed as G(1 + m + r,1+lu + lr) (see Tajima 1983 for a derivation with mutation only), so the expected coalescence time for m = r =0is

1 : 1 þ lðu þ rÞ

In our implementation, we drop the 1 in the denominator since this represents our prior under constant population sizes of two lineages coalescing at rate 1. We want to minimize the use of our prior, so we put more weight on the term related to the empirical length distribution. Putting this all together, we plug the quantiles of L into this formula to obtain ti: 1 ti ¼ : Ld2iðr þ uÞ

If an approximate time range of interest is known (for example, in humans we might be interested in the last 1 million years), then the user can specify an end-time Tmax. Then all times are scaled by Tmax/td21.

662 S. Sheehan, K. Harris, and Y. S. Song GENETICS

Supporting Information http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.112.149096/-/DC1

Estimating Variable Effective Population Sizes from Multiple Genomes: A Sequentially Markov Conditional Sampling Distribution Approach

Sara Sheehan, Kelley Harris, and Yun S. Song

Copyright © 2013 by the Genetics Society of America DOI: 10.1534/genetics.112.149096 2 SI S. Sheehan, K. Harris, Y. S. Song

File S1 Supporting Information

1 Simulation details

The following ms commands were used to simulate data under three population size change histories:

S1: ms 10 1 -T -r 10000 1000000 -eN 0.05 0.1 -eN 0.2 0.5 -eN 0.5 1.25 S2: ms 10 1 -T -r 10000 1000000 -eN 0 10 -eN 0.05 0.1 -eN 0.2 0.5 -eN 0.5 1.25

Note that ms times are in units of 4N0 generations, so we multiplied the raw times above by 2 to compare to PSMC and our method diCal. Mutation rates were not specified above, since the only ms output used was tree at each base (-T flag). Mutations were then added to the trees using a finite sites model, the mutation matrix in Table 1, and a mutation rate θ = 0.01 × 1.443. The factor of 1.443 accounts for the fact that this mutation matrix allows mutations that do not actually change the base (i.e., an A → A transition); see Chan et al. (2012) for further explanation. This mutation matrix was also used for the real data analysis. The following style of command was used to run PSMC. We used 20 iterations as described in the PSMC paper (Li and Durbin, 2011), and the same pattern of parameters we used for diCal: psmc -p 3+2+2+2+2+2+3 -t 7 -N 20 -r 1 -o output.psmc input.psmcfa

To run our method on simulated data, the following style of command was used: java -Xmx25G -d64 diCal_EM -i input.fasta -p params.txt -n 9 -t 5 -a "3 2 2 2 2 2 3"

The parameter file includes the number of loci in each sequence, the number of alleles (4 in our case), an estimate of the mutation rate, mutation matrix, and recombination rate, and the discretization. The -n flag specifies the number of haplotypes to use in the trunk, so there are n + 1 total. The -t flag specifies the number of threads to use; memory requirements scale linearly with this parameter. If -t 1 was specified in the case, then -Xmx5G could be used for the memory requirement. The -a flag specifies the pattern of parameters, in an analogous fashion to PSMC. To run our method on real data, the following style of command was used: java -Xmx20G -d64 diCal_EM -i input.fasta -p params.txt -n 9 -t 2 -r 1.25 -a "4 2 2 2 2 2 2"

Table 1: Mutation matrix for realistic human data. The rows represent the original base, and the columns represent the mutated base.

base ACGT A 0.503 0.082 0.315 0.100 C 0.186 0.002 0.158 0.655 G 0.654 0.158 0 0.189 T 0.097 0.303 0.085 0.515 S. Sheehan, K. Harris, Y. S. Song 3 SI

The -r flag specifies the Tmax (analogous to the -t flag for PSMC), since for humans we know the approximate date range of interest. For the real data we used a longer sequence, so the memory requirements scale accordingly (linearly).

2 Comparison of diCal to PSMC

Although diCal and PSMC are both implementations of the sequentially Markov coalescent in a discrete- time framework, they have significant differences that must be considered when comparing results from the two programs. One difference is that PSMC scales all population sizes with respect to an inferred parameter θpsmc = 4Npsmcµ. In contrast, diCal scales population sizes with respect to a fixed input θsmcsd = 4Nsmcsdµ. Neither θ is right or wrong, they are just scaled with respect to a different N0. If we arbitrarily set Nsmcsd = 1, then Npsmc = θpsmc/θsmcsd

Thus when analyzing the results, we multiplied the PSMC sizes and times by Npsmc. We also multiplied the ms times by 2, since they are in units of 4N0 generations. To compare the performance of the two programs fairly, we gave both PSMC and diCal the same amount of data. Specifically, we compared the performance of diCal with a n-sequence leave-one-out scheme to the performance of PSMC with the same n sequences, but paired up sequentially (i.e. sequence 1 with 2, sequence 3 with 4, etc).

References

Chan, A. H., Jenkins, P. A., and Song, Y. S. 2012. Genome-wide fine-scale recombination rate variation in Drosophila melanogaster. PLoS Genet., 8,(12) e1003090.

Li, H. and Durbin, R. 2011. Inference of human population history from individual whole-genome sequences. Nature, 10, 1–5.