<<

bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Linkage disequilibrium between rare mutations

Benjamin H. Good1 1Department of Applied Physics, Stanford University, Stanford, CA 94305

The statistical associations between mutations, collectively known as linkage disequilib- rium (LD), encode important information about the evolutionary forces acting within a population. Yet in contrast to single-site analogues like the site frequency spectrum, our theoretical understanding of linkage disequilibrium remains limited. In particular, little is currently known about how mutations with different ages and fitness costs contribute to expected patterns of LD, even in simple settings where recombination and are the major evolutionary forces. Here, we introduce a forward-time framework for predicting linkage disequilibrium between pairs of neutral and deleterious mutations as a function of their present-day frequencies. We show that the dynamics of linkage disequilibrium become much simpler in the limit that mutations are rare, where they admit a simple heuristic picture based on the trajectories of the underlying lineages. We use this approach to derive analytical expressions for a family of frequency-weighted LD statistics as a function of the recombination rate, the frequency scale, and the additive and epistatic fitness costs of the mutations. We find that the frequency scale can have a dramatic impact on the shapes of the resulting LD curves, reflecting the broad range of time scales over which these correlations arise. We also show that the differences be- tween neutral and deleterious LD are not purely driven by differences in their mutation frequencies, and can instead display qualitative features that are reminiscent of . We conclude by discussing the implications of these results for recent LD measurements in bacteria. This forward-time approach may provide a useful framework for predicting linkage disequilibrium across a range of evolutionary scenarios.

The statistical associations between mutations, collec- tion that contain both mutations, and fA and fB are tively known as linkage disequilibrium (LD), play a cen- the marginal frequencies of the two mutations (Hill and tral role in modern evolutionary genetics (Slatkin, 2008). Robertson, 1968). The r2 statistic and related measures Correlations between mutations enable genome-wide as- like D0 (Lewontin, 1964) quantify how the joint distri- sociation studies and related methods for mapping ge- bution of the two mutations in the population differs netic basis of diseases and other complex traits (Visscher from a null model in which the are independently et al., 2017). The co-occurrence of mutations in specific distributed across individuals. A celebrated theoreti- DNA molecules is also important for evolutionary dy- cal result by Ohta and Kimura (1971) shows that, for namics, since these realized combinations provide the raw a neutrally evolving population of constant size N, the material on which and other - frequency-weighted expectation of r2 is given by ary forces can act. As a result, contemporary patterns of 5 + NR linkage disequilibrium encode crucial information about 2 E[r · W (fA, fB)] = 2 , (2) the historical processes of recombination (McVean et al., 11 + 13NR + 2(NR) 2004; Rosen et al., 2015), natural selection (Garud et al., where R is the recombination rate between the two sites 2015; Sabeti et al., 2002), and demography (Harris and and W (fA, fB) ∝ fA(1 − fA)fB(1 − fB) is a weight- Nielsen, 2013; Li and Durbin, 2011; Ragsdale and Gravel, ing function. The expression in Eq. (2) approaches an 2019) that operate within a population. Yet while exten- O(1) constant in the limit of low recombination rates sive theory has been developed for predicting marginal (NR  1) and decays as ∼1/NR when NR  1. Similar distributions of mutations (Coop and Ralph, 2012; Cvi- behavior also occurs for the unweighted expectation E[r2] jovi´cet al., 2018; Kamm et al., 2017; Kimura, 1964; Ne- (McVean, 2002), which shares the same ∼1/NR scaling her and Hallatschek, 2013; Polanski and Kimmel, 2003; when NR → ∞ (Song and Song, 2007). As a result, the Sawyer and Hartl, 1992), higher order correlations like shape of this LD curve is frequently used to estimate rates linkage disequilibrium remain poorly understood in com- of recombination, e.g. by examining how genome-wide parison. averages of linkage disequilibrium decay as a function of Many previous studies of linkage disequilibrium have the coordinate distance between sites (Ansari and Dide- focused on pairwise correlations between mutations at lot, 2014; Chakravarti et al., 1984; Garud et al., 2019; different sites along a genome. These correlations are Lynch et al., 2014; Rosen et al., 2015). often summarized by the correlation coefficient, Yet while these classical results have been enor- (f − f f )2 mously influential for building intuition about linkage r2 = AB A B , (1) disequilibrium, they suffer from several limitations that f (1 − f )f (1 − f ) A A B B have become increasingly important in modern genomic where fAB is the fraction of individuals in the popula- datasets. Chief among these is the absence of natural se- bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 2 lection. While there has been some progress in predicting ics (Hedrick, 1987; Kang and Rosenberg, 2019; Lewon- patterns of linkage disequilibrium under particular selec- tin, 1988; VanLiere and Rosenberg, 2008). However, few tion scenarios [e.g., hitchhiking near a recent selective methods currently exist for predicting the quantitative sweep (Kim and Nielsen, 2004; McVean, 2007; Pfaffelhu- values that are expected to emerge under a given evo- ber et al., 2008; Pokalyuk, 2012; Stephan et al., 2006)] lutionary scenario. This limited frequency resolution is we currently lack analogous theoretical predictions for particularly problematic in the presence of natural selec- the empirically relevant case where a subset of the ob- tion, which is known to strongly influence the distribu- served mutations are deleterious. This is a crucial lim- tion of mutation frequencies. This makes it difficult to in- itation, since numerous studies have documented differ- terpret the varying LD patterns that have been observed ences in the genome-wide patterns of LD between syn- across different classes of selected sites in a variety of onymous and nonsynonymous mutations (Arnold et al., natural populations. Do the differences between synony- 2020; Garcia and Lohmueller, 2020; Rosen et al., 2018; mous and nonsynonymous LD arise purely due to differ- Sohail et al., 2017) or for genic vs intergenic regions of the ences in their mutation frequencies? Or are there residual genome (Eberle et al., 2006), where purifying selection is signatures of selection that remain even after controlling thought to play an important role. Several recent studies for marginal mutation frequencies? What conclusions have begun to explore these effects in computer simula- can we draw about the underlying selection and recombi- tions (Arnold et al., 2020; Garcia and Lohmueller, 2020; nation processes when differences are observed for some Ragsdale and Gravel, 2019). Yet without a corresponding frequency ranges but not others? The goal of this work analytical theory, it can be difficult to understand how is to develop the theoretical tools necessary to address these patterns depend on the underlying parameters of these questions. the model, or to determine when more exotic forces like In the following sections, we present a new analytical positive selection, epistasis, or ecological structure are framework for predicting linkage disequilibrium between necessary to fully explain the observed data. pairs of neutral and deleterious mutations as a function A second and related limitation arises from the averag- their present-day frequencies. We do so by generaliz- ing scheme in Eq. (2), which weights each pair of muta- ing the traditional weighted average in Eq. (2), defining tions by their joint heterozygosity, W (fA, fB) ∝ fA(1 − a family of weights W (fA, fB|f0) that preferentially ex- fA)fB(1 − fB). This weighting tends to favor mutations clude mutations with frequencies &f0. We show that the with intermediate frequencies (e.g., 10% ≤ f ≤ 90%); dynamics of linkage disequilibrium become much simpler these are older genetic variants that have been segregat- in the limit that mutations are rare (f0  1), and can be ing for times comparable to the most recent common an- analyzed within a forward-time framework using branch- cestor of the population (McVean, 2002; Rogers, 2014). ing process approximations that have been useful in many Yet in practice, even a single population will typically other areas of . We use this approach harbor mutations that are distributed across an enor- to derive analytical expressions for statistics like r2 as a mous range of frequency scales, reflecting the broad range function of the recombination rate, the frequency scale of timescales over which these mutations occurred (Cvi- f0, and the additive and epistatic fitness costs of the two jovi´cet al., 2018). This broad range of frequencies is in- mutations. We find that the frequency scale f0 can have creasingly accessible in modern genomic datasets, where a dramatic impact the shape of the weighted LD curve, sample sizes can range from several hundred to several reflecting the varying timescales over which these muta- hundred thousand individuals (Allix-B´eguecet al., 2018; tions coexist within the population. We show how this Karczewski et al., 2020; Pasolli et al., 2019; Petit III scaling behavior can serve as a probe for estimating re- and Read, 2018; Shu and McCauley, 2017). Understand- combination rates and distributions of fitness effects in ing how LD varies across these different frequency scales large cohorts, and we discuss the implications of these could therefore provide new information about the evolu- results for recent LD measurements in bacteria. This tionary processes that operate on different ancestral time forward-time approach may provide a useful framework scales, similar to existing approaches based on the single- for predicting linkage disequilibrium across a range of site frequency spectrum (Lawrie and Petrov, 2014; Rags- other evolutionary scenarios. dale et al., 2018). Such an approach could be particularly useful for probing aspects of the recombination process, which are difficult to observe from single-site statistics MODEL alone. Yet at present, little is known about how different fre- We investigate the dynamics of linkage disequilibrium be- quency and time scales contribute to the expected pat- tween mutations in a population of constant size N. We terns of linkage disequilibrium within a population. Pre- consider a pair of genomic loci that acquire mutations at vious theoretical work has explored how mutation fre- rate µ per individual per generation, and we focus on the quencies constrain the possible values of statistics like infinite sites limit where Nµ  1. The mutant and wild- r2, independent of the underlying evolutionary dynam- type alleles at each locus are denoted by A/a and B/b, bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 3 respectively. We assume that mutations at these loci re- correlation structure is chosen to ensure that the haplo- duce the fitness of wildtype individuals by an amount sA type frequencies remain normalized at all times (Good and sB, respectively, while the fitness of the double mu- and Desai, 2013). We note that this Langevin formu- tant is reduced by sAB = sA + sB + . The parameter lation is formally equivalent to the traditional diffusion  allows us to account for epistatic interactions between limit of population genetics, which is more commonly the two mutations, while the additive limit is recovered expressed in terms of the Fokker-Planck equation for the when  = 0. Since we envision eventual applications to probability density, p(fAB, faB, faB, fab) (Ewens, 2004). bacteria, we initially focus on haploid genomes where In this case, the Langevin notation will turn out to be we can neglect the further complications of dominance slightly more convenient for our present study. and ploidy. However, our main results will also apply Following previous work, our analysis will focus on to diploid organisms when mutations are semi-dominant measures of linkage disequilibrium that are derived from (h = 1/2). different moments D, fA, and fB. For example, the We assume that the two loci undergo recombination weighted average in Eq. (2) is equivalent to the ratio of at rate R per individual per generation, producing dou- expectations, ble mutant from single mutants, and vice 2 versa. These recombination events could be implemented 2 E[D ] through a variety of mechanisms, including crossover re- E[r · W (fA, fB)] = , (4) E[fA(1 − fA)fB(1 − fB)] combination, gene conversion, and homologous recombi- nation of horizontally transferred DNA. In the context of 2 which is traditionally denoted by the symbol σd (Ohta our two-locus model, the differences between these mech- and Kimura, 1971). A convenient property of this ratio anisms can be entirely absorbed in the definition of R, of expectations is that it is independent of the underly- and will primarily influence how R scales as a function of ing mutation rate µ in the limit that Nµ  1, thereby the coordinate distance ` between the two loci. In sim- eliminating the dependence on one of the parameters. In ple cases, this dependence can be captured by the linear 2 other words, σd primarily measures properties of the seg- relationship, R(`) = r`, where r is a measure of the re- regating mutations, rather than the target size for those combination rate per base pair. We will revisit this point mutations to occur. Here, we generalize this definition of in more detail when we discuss applications to genomic 2 σd to consider a family of weighted averages of the form data.

These assumptions yield a standard two-locus model 2 −fA/f0−fB /f0 2 E[D · e ] for the population frequencies of the four possible hap- σd(f0) ≡ , [f (1 − f )f (1 − f ) · e−fA/f0−fB /f0 ] lotypes, f , f , f , and f , as well as the cor- E A A B B ab Ab aB AB (5) responding mutation frequencies fA ≡ fAb + fAB and f ≡ f + f . In the diffusion limit, this model can B aB AB which is equivalent to choosing a weighting function be expressed as a coupled system of nonlinear stochastic differential equations, −fA/f0 W (fA, fB|f0) ∝ fA(1 − fA)e ∂f ξ (t) (6) ab ab −fB /f0 = −[0 + S(t)]fab − RD(t)+ √ × fB(1 − fB)e ∂t N (3a)

+ µ(fAb + faB) − 2µ fab 2 in the weighted expectation E[r · W (fA, fB|f0)]. The ∂fab ξAb(t) additional exponential factors in this definition act like = −[sA + S(t)]fAb + RD(t)+ √ ∂t N (3b) a smeared out version of a step function, preferentially excluding contributions from mutations with frequencies + µ(fab + fAB) − 2µ fAb much larger than f0. By scanning over different val- ∂faB ξaB(t) = −[sB + S(t)]faB + RD(t)+ √ ues of f0, this statistic allows us to quantify how link- ∂t N (3c) age disequilibrium varies over different frequency scales. + µ(fab + fAB) − 2µ faB This weighting scheme is reminiscent of the frequency

∂fab ξAB(t) thresholds that have previously been employed in empir- = −[sAB + S(t)]fAB + RD(t)+ √ ical and computational studies of LD. From a qualitative ∂t N (3d) perspective, the precise shape of the weighting function + µ(f + f ) − 2µ f Ab aB AB will turn out to be relatively unimportant — we argue where S(t) ≡ sAfAb+sBfaB +sABfAB is the mean fitness below that any sufficiently sharp transition will produce 2 reduction within the population, qualitatively similar behavior for σd(f0). However, the exponential function will turn out to have particularly D(t) ≡ fABfab − fAbfaB = fAB − fAfB (3e) convenient properties in the context of our analytical cal- is the standard coefficient of linkage disequilibrium, and culations below. 2 {ξi(t)} are a collection of Brownian noise terms whose In addition to σd(f0), it will also be useful to consider bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 4 more general weighted moments of D defined by 2007; Fisher, 2007; Good, 2016; Weissman et al., 2010, 2009). The goal of the following sections is to flesh out fA+fB k − f k E[D · e 0 ] this basic intuition and show how it can be used to ob- σd (f0) ≡ f +f . (7) − A B tain predictions for linkage disequilibrium statistics like f0 E[fA(1 − fA)fB(1 − fB) · e ] k σd (f0). We will begin by presenting a heuristic analysis We will be particularly interested in the k = 1 moment, of the problem, which will allow us to identify the key which captures degree to which mutations are in coupling timescales and dynamical processes involved, and will be linkage (AB/ab haplotypes) vs repulsion linkage (Ab/aB useful for building intuition for the formal analysis that haplotypes), as well as the k = 4 moment, which can be follows. We will conclude by discussing potential appli- 2 2 combined with σd to quantify fluctuations in D . cations of these results in the context of genomic data. Despite the simplicity of the two locus model in Eq. (3), the resulting patterns of linkage disequilibrium have been difficult to characterize theoretically. As in many popula- HEURISTIC ANALYSIS tion genetic problems, the major hurdles arise from non- linearities in the S, D, and {ξi} terms in the stochastic We begin by presenting a heuristic analysis of the two differential equations, which typically require numerical locus model, which focuses on the underlying lineage dy- approaches to make further progress. However, dramatic namics that contribute to linkage disequilibrium statis- simplifications can arise if we restrict our attention to fre- tics like σd(f0). Roughly speaking, this heuristic analysis quency scales f01. In this case, the exponential weights will be accurate to leading order in the logarithms of var- in Eq. (7) will be vanishingly small except in cases where ious quantities (e.g. mutation frequencies, recombination fAB, fAb, faB . f0  1, such that the vast majority rates), while providing a more mechanistic picture of the of the population is comprised of wildtype individuals. underlying lineage dynamics that are involved. Readers Applying this approximation to the two-locus model in who prefer a more formal treatment first may skip to Eq. (3), we obtain a nearly linear system of stochastic the Formal Analysis section that follows. Our heuris- differential equations for the three mutant haplotypes, tic approach is similar to the one developed by Weissman r et al. (2010) to study fitness valley crossing in sexual pop- ∂fAb fAb = −sAfAb + µ + RD(t)+ ηAb(t) (8a) ulations. A key difference in this work is that we are now ∂t N more interested in understanding the steady-state fre- r ∂faB faB quency distributions of new mutations, rather than the = −s f + µ + RD(t)+ η (t) (8b) ∂t B aB N aB transient process of valley crossing. r ∂f f AB = −s f − RD(t)+ AB η (t) ∂t AB AB N AB (8c) Single-locus dynamics + µ(fAb + faB) where {ηi(t)} are now independent Brownian noise terms Before considering the full two-locus problem, it will be 0 0 with hηi(t)ηj(t )i = δijδ(t − t ) (Gardiner, 1985). Note useful to briefly review the evolutionary dynamics that that there is some subtlety involved in this approxi- arise at a single genetic locus. We will focus on the in- mation, since our weighting scheme only places limits finite sites limit (Nµ  1), where the population is al- on the frequencies at the time of observa- most always fixed for the wildtype , and new muta- tion. Equation (8) requires the stronger assumption that tions are introduced into the population at a total rate fAB, fAb, faB  1 for all previous times as well. This ∼Nµ  1 per generation. The vast majority of these lin- distinction will become important in our analysis below. eages will drift to extinction without ever leaving more When the approximations in Eq. (8) hold, the only re- than a few descendants in the population. However, a maining nonlinearity enters through the fAbfaB term in lucky minority will survive for sufficiently long times that D(t). This term is crucial for allowing double mutants they can grow to reach observable frequencies within the to be produced from single mutants via recombination. population. The dynamics of this process will strongly However, in many cases of interest, the contribution from depend on the fitness cost s of the mutation. We consider this term will turn out to be small, either because R it- both neutral (s = 0) and deleterious (s > 0) mutations self is sufficiently small, or because D(t) is small (e.g. for below. sufficiently large R). This suggests that in many cases, we may be able to treat the nonlinearity in D(t) as a Neutral mutations. When s = 0, the frequency of the perturbative correction to the otherwise linear dynam- mutant lineage is only influenced by genetic drift. These ics in Eq. (8). Extensive theory has been developed for dynamics take on a particularly simple form when the analyzing stochastic differential equations of this form, mutation is sufficiently rare [f(t)  1]. With probabil- ranging from powerful heuristic approaches to exact an- ity ∼1/t, the mutant lineage will survive for at least ∼t alytical results (Cvijovi´cet al., 2018; Desai and Fisher, generations and will reach a characteristic size f(t)∼t/N bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 5 Mutation frequency,

present Time, day SFS,

FIG. 1 Schematic illustration of mutation trajectories that contribute to the site frequency spectrum. Left: Mutations arise at different times and drift to their present-day frequencies (shaded region). Dark and light blue lines show examples of “upward triangle” and “downward triangle” trajectories, respectively. In both cases, deleterious mutations are prevented from growing much larger than the drift barrier, fsel∼1/Ns (dashed line). Right: The total probability of observing a mutation at frequency f (the site frequency spectrum, SFS) is a sum over all of the possible trajectories that could contribute to this present day frequency. When the SFS is dominated by upward triangle trajectories, the effects of negative selection resemble a present-day frequency threshold f0 (dashed line).

(Fisher, 2007). A more detailed calculation shows that near f0, most of the weight is concentrated at extremely the frequency of the surviving lineage is exponentially low frequencies – so much so that the distribution is not distributed around this characteristic size, so that even normalizable. However, these extremely rare muta- tions will only have a negligible influence on moments of N p(f|t, f > 0) df ∼ e−Nf/t df . (9) the form t Z f0 Nµ Nµf p hf p|f i ∼ f p · df ∼ 0 , (12) Most of this distribution is concentrated within an order 0 f p of magnitude of the mean (∼ t/N), consistent with our 0 notion of a “typical” value. However, there is also a small which are dominated by mutations with frequencies of probability (∼Nf/t) for the mutation to be observed at order ∼f0. Similarly, the probability of observing a mu- a much lower frequency f  t/N. This will become tation with a frequency of order ∼f0 has a well-defined important in our discussion below. value, Using these temporal dynamics as a building block, we Pr[f ∼ f0] ∼ Nµ , (13) can calculate the total probability of observing a muta- tion with present-day frequency f by summing over all which is insensitive to the divergence of the site frequency the possible times that the mutation could have origi- spectrum at low frequencies. nated in the past (Fig. 1). This allows us to recover the familiar neutral site frequency spectrum, Historical trajectories of mutations. Our link- age disequilibrium analysis will also require information Z ∞ 1 N Nµ p(f)df ∼ NµdT · · e−Nf/T df ∼ df . (10) about the historical trajectories of the mutations that are 0 T T f observed at a given value of f in the present. The sum in Eq. (10) suggests that the ages of these mutations will We can impose a maximum frequency threshold by in- follow an inverse exponential distribution, serting a step function, θ(f − f0), so that p(T |f)dT ∼ T −2e−Nf/T dT , (14) Nµ · θ(f − f) p(f|f )df ∼ 0 df . (11) 0 f which has a peak at T ∼Nf and a broad power law tail for T Nf. It will therefore be useful to consider the In contrast to the transient distribution in Eq. (9) – trajectories of mutations with ages T ∼Nf and T Nf which was concentrated around a typical frequency – this separately. equilibrium site frequency spectrum is evenly distributed Mutations with ages of order T ∼Nf will tend to across all orders of magnitude between f0 and 0. Since have “upward triangle” trajectories similar to the uncon- there are many more orders of magnitude near 0 than strained dynamics in Eq. (9) (Fig. 1). We can see this by bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 6 considering the frequency of the mutation at some inter- f(t)≈T/2N when t≈T/2, similar to an ordinary neutral mediate timepoint t < T , and treating the final frequency trajectory that drifted to size ∼T/2N before going ex- f(T ) as a sum of the Nf(t) independent lineages that are tinct. When T Nf, these historical frequencies can be currently present at time t. Each of these lineages has much larger than any present day frequency threshold f0, a probability ∼1/(T − t) of surviving until time T and and can eventually reach a point where the rare mutation growing to size ∼(T − t)/N. When t ∼ O(T ), many assumption starts to break down [f(t)∼1]. Fortunately, small lineages will survive to contribute to the present these ancient mutations have a negligible influence on the day frequency, and f(t) will remain close to f(T )≈f by site frequency spectrum in Eq. (11), which is dominated the central limit theorem. On the other hand, when tT , by contributions from the upward triangle trajectories the frequency of a single surviving lineage (∼T/N) will where T ∼Nf. already be as large as the final mutant frequency (∼f), However, the downward triangle trajectories can play so the most likely way of reaching this final frequency a more important role for other quantities, including the is for all but one of the intermediate lineages to go ex- average age of a mutation as a function of its present day tinct. The total probability of this coarse-grained trajec- frequency. Although the median age in Eq. (14) occurs tory can therefore be approximated as for T ∼Nf, the broad ∼1/T 2 tail causes the average age to diverge logarithmically, where it will be dominated 1 N p(f(t)|f, T ) df(t) ∼ Nµ × × e−Nf(t)/t df(t) by increasingly older mutations with T Nf. This pro- t t (15) cess cannot continue indefinitely, however, since muta- Nf(t) × e−Nf(t)/T , tions that drift to frequencies f(t)∼ 1 will eventually be T more likely to fix than to drift back down to f(T )∼f which reduces to by the time of observation. A useful approximation will therefore be to truncate the integral in Eq. (10) at a p(f(t)|f, T ) df(t) ∝ f(t)e−Nf(t)/t df(t) (16) maximum age Tmax∼N, corresponding to maximum his- torical frequency fmax∼1. The average age then reduces in the limit that tT . This distribution has a typ- to a finite value, ical frequency of order f(t)∼t/N, similar to Eq. (9), but with a significantly smaller probability of drifting hT |fi ∼ Nf log (1/f) , (19) to anomalously low frequencies at the intermediate time- which matches the well-known result from Kimura and point (∼N 2f 2/t2). Ohta (1973) in the limit that f → 1. This same cutoff In contrast to these familiar “upward triangle” tra- approximation will play a crucial role in analyzing the jectories, older mutations (T  Nf) will tend to have dynamics of linkage disequilibrium below. “downward triangle” shapes similar to the one depicted in Fig. 1. We can quantify these trajectories using a Deleterious mutations. Similar considerations apply similar lineage-based picture that we used for the recent for deleterious mutations (s > 0), except that natural mutations above. In this case, the typical size of a sin- selection will prevent them from growing much larger gle surviving lineage [∼(T − t)/N] will already be much than a critical frequency, f ∼1/Ns, above which natu- larger than the final frequency f, so the most likely tran- sel ral selection starts to dominate over genetic drift (Fisher, sition to f(T )∼f will again require all but one of the 2007; Good, 2016). Conversely, genetic drift will con- intermediate lineages to go extinct by time T . The ma- tinue to dominate over natural selection for frequencies jor difference is that this sole surviving lineage must now f(t)  f , and the dynamics will resemble those derived also drift to an anomalously low frequency at the time sel for neutral mutations above. This suggests that we can of observation [f(T )∼f  (T − t)/N], so that the total approximate the deleterious site frequency spectrum sim- probability of this transition can be approximated by ply by adding an additional step function to the neutral

Nf(t) − Nf(t) f result, Pr[0 < f(T ) < f|f(t)] ∼ e (T −t) · . (17) T − t T − t Nµθ(f − f)θ 1 − f p(f|f )df ∼ 0 Ns df , (20) The conditional distribution of f(t) is therefore given by 0 f

p[f(t)|f, T ] df(t) ∝ f(t)e−NT f(t)/t(T −t) df(t) , (18) which enforces the condition that mutations will rarely be found above the“drift barrier,” fsel∼1/Ns. which has a typical frequency f(t)∼t(T − t)/NT . The net effect of this new threshold will depend on the In contrast to the upward triangle trajectories in compound parameter Nsf0, which captures the relative Eq. (16), the typical frequency in Eq. (18) is now a strength of selection on timescales of order ∼Nf0. When nonmonotonic function of the intermediate timepoint t, Nsf0  1, deleterious mutations are always sampled thereby motivating our naming scheme. For a given age in their effectively neutral phase, and the site frequency T , this historical frequency reaches its maximum value spectrum reduces to the neutral version in Eq. (11). On bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 7 the other hand, when Nsf0  1, the frequency spectrum at the two loci evolve in a completely asexual manner. maintains a similar shape, but with an effective frequency Linkage disequilibrium is defined when both sites harbor threshold now set by the drift barrier fsel∼1/Ns  f0. segregating mutations at the same time (fA, fB > 0). In This implies that averages like hf pi will be dominated the weak mutation limit (Nµ  1), there are only two by frequencies of order ∼1/Ns, rather than the nominal different ways that this can occur: frequency threshold f0. To streamline notation, it will be useful to summarize Separate mutations. In the simplest scenario, a mu- this behavior by defining an effective fitness cost, tation will first occur at one of the two sites, and then a

( 1 second mutation will arise in a different wildtype back- if Nsf0  1 ∗ Nf0 s (f0) ∼ , (21) ground while the first mutation is still segregating in the s if Nsf0  1 population (Fig. 2A). We refer to this scenario as the sep- such that Eq. (20) can be written in the compact form arate mutations case, since it involves only single-mutant haplotypes (fA = fAb, fB = faB), while the double mu- Nµ · θ 1 − f p(f|s∗) ∼ Ns∗ df , (22) tant is absent (fAB = 0). At low mutation frequencies f (fA, fB  1), these single-mutant haplotypes will be ap- or alternatively, proximately independent of each other, and can therefore ∗ be predicted from the single-locus dynamics described Pr[f ∼ 1/Ns ] ∼ Nµ , (23) above. Equation (23) then shows that with probability where the dependence on the underlying fitness cost s is Pr[separate]∼(Nµ)2 , (25a) entirely encapsulated in the definition of s∗. This nota- tion emphasizes that the present day frequencies of neu- the four haplotype frequencies will reach typical sizes of tral and deleterious mutations will be essentially identical order to each other, given an appropriate choice of s∗.   1 ! f f 1 ∗ However, this correspondence between neutral and ab aB NsB ∼ 1 , (25b) deleterious mutations starts to break down when we con- fAb fAB ∗ 0 separate NsA sider the historical trajectories of these mutations back- ∗ ∗ ward in time. The key difference is that natural selection where sA and sB are defined as in Eq. (21) above. This will prevent deleterious mutations from growing much coarse-grained sampling distribution allows us to quickly estimate the contribution to various moments, e.g. larger than fsel∼1/Ns at any point during their life- time, and not only at the point of observation (Fig. 1). k 2 k (−1) (Nµ) This distinction has a negligible impact on the upward h(fAB − fAfB) iseparate ∼ . (26) (Ns∗ )k (Ns∗ )k triangle trajectories that dominate p(f|f0), since their A B maximum sizes are by definition bounded by the present Nested mutations. In the alternative scenario, a muta- day frequency f. However, the historical action of nat- tion at the second locus can be produced by the original ural selection has a much stronger impact on the down- mutant lineage rather than the wildtype (Fig. 2B,C). We ward triangle trajectories in Fig. 1, since it limits their refer to this scenario as the nested mutations case, since peak frequencies to a maximum size of order f ∼1/Ns. max it produces double mutant haplotypes (f = f + f , This frequency threshold implies a corresponding bound A Ab AB f = f ) without any single mutants at the second on the maximum age of a deleterious mutation of order B AB genetic locus (f = 0). At low mutation frequencies T ∼1/s, which alters the scaling of quantities like the aB max (f , f  1), we expect that these nested mutations will average age of a rare mutation, A B occur far less frequently than separate mutations, since  Nf log (1/f) if f  1  1 the mutant lineage will produce mutations at a much  Ns hT |fi ∼ Nf log (1/Nsf) if f  1  1, (24) lower rate ∼NµfAb  Nµ. To understand these con- Ns tributions, it is necessary to integrate over the historical 1/s if f ∼ 1  1. Ns frequencies of the Ab haplotype at the various times T These differences will play an important role when we that the second mutation could have occurred. We can consider the dynamics of linkage disequilibrium below. write this in the general form, ZZ 0 0 0 p(fAb, fAB) dfAb dfAB ∼ p(f )df · Nµf dT Two-locus dynamics Ab Ab Ab  1  ×p(f 0 → f |T )df · p → f |T df We are now in a position to analyze the joint behavior of Ab Ab Ab N AB AB a pair of genetic loci, which will allow us to develop a sim- (27) ilar heuristic picture for the dynamics of linkage disequi- 0 librium. To do so, it will be useful to first consider the dy- where p(fAb) denotes the equilibrium frequency distri- 0 namics in the absence of recombination, when mutations bution at the first site, and p(fi → fi|T ) denotes the bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 8

a) b)

A A AB A

Individuals B B

c) d)

A A AB AB Individuals B B

Time Time

FIG. 2 Schematic of different lineage dynamics that contribute to linkage disequilibrium. (a) Separate mutations: A and B mutations arise on independent wildtype backgrounds and are both still segregating at the time of observation (blue region). (b) Recent nested mutations: a double mutant (AB) is produced by a single-mutant background (A) in the recent past, and both haplotypes are still segregating at the time of observation. (c) Older nested mutations: a double mutant is produced by a larger single-mutant lineage in the distant past, but drifts back down to lower frequencies by the time of observation. (d) Recombination produces double mutant lineages from single mutant lineages, and vice versa.

0 probability density of transitioning from fi to fi in time would be unlikely to drift back down to this threshold by 0 T . Note that since fAb represents the historical frequency the time of observation. This suggests that recent nested of the Ab haplotype, it is not directly constrained by the mutations will occur with a total probability present-day frequency threshold, fAb, fAB . f0, but is ZZ 0 instead constrained indirectly through the dynamics in hrecenti Nµ dfAb 0 1 Pr ∼ T ∼ 1 · NµfAbdT · 0 nested s∗ 0 the p(f → fAB|T ) term. This distinction will become AB f T Ab 0 1 Ab fAb. Ns∗ important below. A To gain a more intuitive understanding of Eq. (27), it (Nµ)2 ∼ , (28a) will be useful to further distinguish between the contribu- Ns∗ ∗ A tions of relatively recent nested mutations (T ∼ 1/sAB) ∗ and those that arose in the distant past (T  1/sAB), and that the typical haplotype frequencies will be of order where s∗ is the double mutant analogue of s∗ and s∗ AB A B     in Eq. (25b). For simplicity, we will restrict our attention fab faB 1 0 ∗ ∗ ∗ ∼ 1 1 , (28b) to regimes where s s , s , such that the single mu- fAb fAB recent ∗ ∗ AB & A B NsA NsAB ∗ nested tant timescales (∼1/si ) are at least as long as the double mutant timescale. This regime includes the traditional An analogous distribution exists for recent nested mu- additive case ( = 0), as well as cases with strong syner- tations that arise on an aB background. As expected, gistic ( > 0) and moderately antagonistic ( < 0) epis- the total probability of these events is much smaller tasis, and will therefore capture much of the interesting than the separate mutations case. However, the smaller behavior. probability is balanced by the larger typical values of D = fAB − fAfB that occur in this case, such that the Recent nested mutations. When the age of the double total contribution to the corresponding moments, ∗ mutant is of order T ∼1/sAB, the characteristic dynam- ∗ ∗ 2 ics will be dominated by “upward triangle” trajectories, k (NsA + NsB)(Nµ) h(fAB − fAfB) irecent ∼ , (29) in which the double mutants reach their maximum fre- nested (Ns∗ · Ns∗ )(Ns∗ )k ∗ A B AB quency ∼1/NsAB near the point of observation (Fig. 2B). ∗ Since T .1/sA, the historical frequency of the Ab haplo- is often of equal or larger magnitude than the separate 0 ∗ type cannot be much larger than fAb∼1/NsA, since it mutations case in Eq. (26) above. bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 9

Older nested mutations. In contrast, the character- log10f0 ∗ 3 2 1 istic dynamics of older mutations (T  1/sAB) will be dominated by “downward triangle” trajectories that pre- viously reached much higher frequencies in the distant past (Fig. 2C). We can analyze these trajectories us- 3 ing the same techniques that we used to study the av- erage ages of neutral and deleterious mutations above. 2 1 To streamline our notation, it will be useful to introduce d a timescale for the maximum possible age of a double 1 mutant,

( 0 AB 1/sAB if NsAB  1 Tmax ∼ (30) N if NsAB  1 1 which corresponds to a maximum historical size of or- AB der fmax∼Tmax/N. When NsABf0  1, this maximum 100 AB ∗ age is actually located in the recent past (Tmax∼1/sAB), which implies that there is a negligible chance of produc- 0 ing older nested mutations. In contrast, when NsABf0  f / 2 ∗ d 10 1 1, there will be a large range of timecales (1/sAB . T . AB Tmax) where older nested mutations can arise. These older double mutants will have a much smaller sA = sB = s/2, = 0 chance of surviving until the present while also maintain- 2 10 sA = sB = s, = s ing a present-day frequency less than ∼1/Ns∗ : AB sA = sB = 0, = + s 1 Pr [0 < f 1/Ns∗ |T ] ∼ . (31) 10 2 10 1 100 101 102 Ab . AB T 2s∗ AB 2Nsf0 Similarly, if the historical frequency of the single mu- 0 tant was much larger than fAb∼T/N, it would be un- likely to drift back down below its present-day threshold FIG. 3 Frequency-resolved LD between deleterious (f ∼1/Ns∗ ) by the time of observation, while historical mutations as a function of the scaled fitness cost of Ab A the double mutant. Top: The signed LD moment, σ1(f ), frequencies smaller than f 0 ∼T/N have an O(1) chance d 0 Ab in Eq. (7) is depicted for pairs of nonrecombining loci with of drifting to extinction. Combining these two observa- additive ( = 0), synergistic ( > 0) and antagonistic ( < 0) tions, we conclude that older nested mutations will occur epistasis, which were chosen to have the same total cost for with a total probability the double mutant (sAB = s). Symbols denote the results of forward-time simulations across a range of parameters with ZZ 0 h older i Nµ dfAb 0 1 s > 10/N (Appendix A), and each symbol is colored by the Pr ∼ f 0 T · NµfAbdT · nested Ab. N 0 ∗ 2 corresponding value of f0. The solid lines shows the theoret- AB fAb sABT T .Tmax 1 ical prediction from Eq. (C8). Bottom: An analogous figure T & s∗ 2 AB for the squared LD moment, σd(f0), where solid lines show 2 the theoretical predictions from Eq. (C10). The “data col- (Nµ) AB ∗  ∼ log 1 + T s , (32a) lapse” in both panels indicates that frequency-weighted LD is Ns∗ max AB AB primarily determined by the compound parameters Nsf0 and Nf0. Weak scaled fitness costs (Nsf0 . 1) lead to an excess and that the typical haplotype frequencies will be of order 1 of coupling linkage (σd > 0), which qualitatively resembles     the effects of antagonistic epistasis ( < 0). fab faB 1 0 ∼ 1 . (32b) fAb fAB older 0 Ns∗ nested AB Note that the total probability of these events is still Estimating linkage disequilibrium. We now have all much smaller than the probability of arising on sepa- the ingredients necessary to calculate frequency-resolved k rate genetic backgrounds. However, it will be much LD statistics like σd (f0). For the term in the denomina- larger than the contribution from recent nested muta- tor, we note that the magnitude of fA(1 − fA)fB(1 − fB) tions whenever NsABf0  1. is roughly the same, regardless of whether the mutations occur on nested or separate backgrounds. However, since the separate backgrounds case is more likely by a factor bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 10

∗ of ∼Nsi  1, it provides the dominant contribution to the subsequent haplotype dynamics after these mutations the average in the denominator, so that arise. For example, in the nested mutations case, recombi- 1 1 2 nation will start to break up the AB haplotype at a per hfA(1 − fA)fB(1 − fB)i ∼ ∗ · ∗ · (Nµ) . (33) NsA NsB capita rate R, creating recombinant Ab and aB offspring (Fig. 2D). From the perspective of the f lineage, this In contrast, the numerator of σ2(f ) will usually be AB d 0 loss of individuals through recombination will resemble dominated by the contributions from nested mutations, an effective fitness cost, which can be absorbed in an since the double mutant frequency enters as a lower effective selection coefficient for the double mutant, power in the linkage disequilibrium coefficient D = fAB − fAfB. The precise form of this contribution will depend sAB,r ≡ sA + sB +  + R. (36) on the typical ages of the nested mutations, as described When R  s , this loss of individuals to recombina- by Eqs. (25) and (28) above. By combining these results AB tion will have a negligible impact on fAB, but it will be- with the denominator term in Eq. (33), we find that come the primary limiting factor on the lineage size when

  1  R  sAB. For sufficiently low rates of recombination f0 log if NsAB  1,  f0 (evaluated self-consistently below), the recombinant off- 2   1  σ (f0) ∼ f0 log if NsABf0  1, (34) spring of AB lineages are unlikely to grow to high enough d NsAB f0  1−  + 1 frequencies where they could influence the linkage dise-  sAB NsAB f0  if NsABf0  1, NsAB quilibrium coefficient D = fAB − fAfB. This suggests that we can approximate the future dynamics of the fAB which strongly depends on the magntiude of the scaled lineage using the results for the asexual case above, but selection strength NsABf0. In the absence of epista- with Eq. (36) replacing sAB. sis ( = 0), we see that linkage disequilibrium between Similarly, in the separate mutations scenario, recombi- strongly deleterious mutations (sA ≈ sB ≈ s  1/N) nation events between Ab and aB haplotypes will create will generally be lower than among neutral mutations recombinant AB haplotypes at a total rate NRfAbfaB with comparable present-day frequencies (f0,eff ∼1/Ns) per generation. In this case, the loss of individuals due (Fig. 3). Moreover, the directionality of this difference is through recombination is significantly smaller than for qualitatively similar to the the effects of synergistic epis- the AB lineages above, since the per capita rates of tasis ( > 0). Our lineage-based picture shows that these recombination for the single-mutant lineages are sup- differences primarily reflect the contributions of older pressed by additional factors of faB or fAb. However, nested mutations (Fig. 2C), which are sensitive to the these rare recombination events can still have a large ef- effects of natural selection long before the present day. fect on linkage disequilibrium if they happen to seed a An analogous argument can be used to calculate the lucky AB lineage that drifts to observable frequencies. first moment σ1(f ). We find that d 0 We can calculate the total probability of these events

  1  using a generalization of the approach that we used for log if NsAB  1  f0 nested mutations above. 1  1  σd(f0) ∼ log if NsABf0  1 (35) In this recombinant scenario, the dominant contri-  NsAB f0   butions will come from relatively recent recombination − s if NsABf0  1 ∗ AB events (T ∼1/sAB,r), which reach their maximum typ- ∗ which closely parallels the three regimes that emerge ical size (fAB∼1/sAB,r) near the time of observation in Eq. (34). Once again, we observe a dramatic dif- (Fig. 2D). As above, the historical frequencies of the ference between neutral and deleterious mutations that Ab and aB haplotypes cannot be much larger than 0 ∗ 0 ∗ cannot be captured by an effective frequency threshold fAb∼1/NsA and faB∼1/NsB at this timepoint, other- wise they would be unlikely to drift back down to these f0,eff ∼1/Ns. In this case, we see that the logarithmic behavior in the neutral limit is associated with an ex- thresholds by the time of observation. This suggests that recombinant double mutant will occur with a total prob- cess of coupling linkage (fAB > fAfB), which qual- itatively resembles the effects of antagonistic epistasis ability ZZZ  ( < 0). These results emphasize the importance of older h i Nµ dfAb Nµ dfaB Pr recomb. ∼ 1 · T ∼ s∗ nested mutations in shaping contemporary patterns of double AB fAb faB 1 fAb. Ns∗ LD among tightly linked loci. A 1 faB . Ns∗ B Incorporating recombination. We can now ask how 1  ×NRf f dT · small amounts of recombination start to alter the basic Ab aB T picture described above. Recombination cannot change 2 the rates at which separate or nested mutations are ini- NR(Nµ) ∼ ∗ ∗ (37a) tially produced, but it can have a dramatic impact on NsA · NsB bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 11 and that the haplotype frequencies will be of order log10f0 1 !   1 ∗ 3 2 1 fab faB NsA ∼ 1 1 . (37b) 3 fAb fAB recomb. Ns∗ Ns∗ double B AB,r Similar to the nested mutations case above, the to- 2 tal probability of the recombinant scenario is much smaller than the probability of separate mutations. How- ∗ 1 ever, as long as fAB∼1/NsAB,r is much larger than d 1 ∗ ∗ fAbfaB∼1/(NsA · NsB) — which will be true for all but the highest recombination rates — the smaller probablity of this scenario will be counterbalanced by the signifi- 0 cantly larger values of D = fAB −fAfB that it produces. By combining these results with our previous formulae for nested and separate mutations above, we can obtain 1 an analogous set of predictions for frequency-resolved LD statistics, 100   1  f0 log if NsAB,r  1  f0 2   1  0

σ (f0) ∼ f0 log if NsAB,rf0  1 (38a) f d NsAB,rf0 / 2 d 1  1−  + 1 10  sAB,r NsAB,rf0  if NsAB,rf0  1 NsAB,r and 10 2   1  log f if NsAB,r  1  0 1  1  σd(f0) ∼ log if NsAB,r  1 (38b) 10 2 10 1 100 101 102  NsAB,rf0   2NRf − if NsAB,r  1 0 sAB,r which are valid in the presence of recombination. For the special case neutral mutations (sA = sB =  = FIG. 4 Frequency-resolved LD between neutral muta- 0), these results take on a particularly simple form: tions as a function of the scaled recombination rate. An analogous version of Fig. 3, showing the first (top) and   1  second (bottom) LD moments in Eq. (7) for pairs of neutral f0 log if NR  1,  NRf0 mutations with a range of recombination rates, R > 2/N. 2  1  σd(f0) ∼ f0 log if NRf0  1, (39a) As above, symbols denote the results of forward time sim-  NRf0 ulations, and solid lines denote the theoretical predictions  1 Nr if NRf0  1, from Eqs. D23 (top) and D24 (bottom). Dashed lines show the classical predictions for the f0 → ∞ limit (Ohta and and Kimura, 1971). The “data collapse” in both panels indi- cates that frequency-weighted LD is primarily determined by   1  log f if NR  1, compound parameter NRf0. Low scaled recombination rates  0 1 1  1  (NRf0 . 1) lead to an excess of coupling linkage (σd > 0), σd(f0) ∼ log if NRf0  1, (39b)  NRf0 which qualitatively resembles the effects of antagonistic epis-  tasis ( < 0). 0 if NRf0  1, which constitute frequency-resolved analogues of the LD decay curves that are used to estimate recombination case above. These quasi-asexual dynamics are accompa- 1 rates in genomic data (Fig. 4). In this case, we see that nied by high levels of coupling linkage (σd&1), reflect- the behavior of the LD curves is strongly dependent on ing the important contributions of older nested muta- the compound parameter NRf0. When NRf0  1, we tions. Interestingly, our results show that the transi- 2 recover the well-known σd∼1/NR scaling of Eq. (2), in tion this mutation-dominated regime can occur even for which neither coupling or repulsion linkage is favored nominally high rates of recombination (NR  1), pro- 1 (σd ≈ 0) (Ohta and Kimura, 1971; Song and Song, vided that the frequency scale f0 is chosen to be suffi- 2 2007). However, when NRf0  1, σd no longer satu- ciently small (NRf0  1). This highlights the utility of rates at a constant value, as in Eq. (2), but instead tran- frequency-resolved LD statistics for probing the underly- sitions a new logarithmic dependence similar to asexual ing timescales of recombination process — a topic that bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 12 we will explore in more detail below. which is equivalent to the condition that the A and B mu- tations are in linkage equilibrium (fAB = fAfB). In this 2 Transition to Quasi-Linkage Equilibrium (QLE). way, we see that the Nrf0  1 regime can be identified Our previous results assumed that double mutants pro- with the traditional Quasi-Linkage Equilibrium (QLE) vide the dominant contribution to D = fAB −fAfB when regime of multilocus population genetics, in which the they reach their maximum typical frequencies. This will haplotype frequencies remain close to the typical val- be a good approximation in the recombinant scenario ues expected under linkage equilibrium. Genetic drift provided that will still drive fluctuations around the average value in 1 1 1 Eq. (43), with a magnitude that is inversely proportional  · , (40) to the typical number of recombinant lineages that con- Ns∗ Ns∗ Ns∗ AB,r A B tribute to the total frequency: which reduces to the simpler condition  2 δfAB 1 2 NRf  1 , (41) ∼ 2 . (44) 0 fAB NRf0 for a pair of neutral mutations. At low frequencies Interestingly, this behavior leads to identical predictions (f 1), this condition will generally be satisfied even 1 2 0 for the two lowest order LD statistics, σd(f0) and σd(f0), for large recombination rates (NRf  1), which are −1 −2 0 that we obtained in the f0  NR  f0 limit above. located deep into the recombination-dominated regions However, as we will demonstrate below, the differences of the LD curves in Eq. (39). This suggests that these between these two regimes will become apparent when previous expressions will be valid across a broad param- considering higher moments (or other properties of the eter range, which is sufficient to capture the transition joint haplotype distribution), due to the dramatic differ- from mutation-dominated to recombination-dominated ences in the typical fluctuations of fAB. In this way, we behavior (NRf0∼1). Nevertheless, for sufficiently dis- see that low frequency mutations give rise to an entirely tant loci, recombination rates may eventually grow to a −1 −2 new regime of behavior (f0  NR  f0 ), in which 2 point where NRf0 ∼1, where our existing analysis starts previous notions of quasi-asexuality or quasi-linkage equi- to break down. librium do not apply. We will refer to this regime as the Fortunately, we can obtain a relatively complete pic- clonal recombinant regime, which reflects the fact that ture of this transition by taking advantage of the large double mutants are primarily caused by rare recombinant 2 gap between NRf0 and NRf0 that emerges when f01, lineages that drift to observable frequencies one-by-one. and focusing our attention on cases where NRf0  1. We will explore the consequences of these dynamics in In this limit, the typical lifetimes of recombinant dou- more detail below. ble mutants (T ∼1/R) are much shorter than the life- times of the single mutant lineages that produce them (T ∼Nf0). This suggests that fAb and fAB will be ef- FORMAL ANALYSIS fectively “frozen” throughout the lifetime of an individ- ual recombinant lineage, and that the production rate We now turn to a formal derivation of the heuristic re- of these lineages will resemble a mutation process with sults presented above. To implement our conditioning 2 2 scheme for the maximum frequencies of the two muta- an overall rate ∼NRf0 . When NRf0  1, recombinant fAB lineages will be produced only rarely, and can occa- tions (fA, fB . f0), we will focus on the joint generating sionally fluctuate to frequencies of order ∼1/NR before function of the unconditioned process, they go extinct. This is sufficient to recover the familiar   − xfAb(t) − yfaB (t) − zfAB (t) 2 H(x, y, z, t)= e f0 f0 f0 , (45) σd∼1/NR scaling, 2 1 2 2 Nrf0 · Nr 1 such that the weighted moments follow from the identity σd(f0) ∼ 2 ∼ , (42) f Nr p p 0 D fA+fB E (−1) ∂ H i j k − f p fAbfaBfABe 0 = f0 x=1 , (46) that we observed in Eq. (39). ∂xi∂yj∂zk y=1 2 z=2 On the other hand, when NRf0  1, many recombi- t=∞ nant lineages will be produced every generation, and the with p = i + j + k. We can also define an “effective con- total double mutant frequency fAB will reach sizes much ditional distribution” using a similar weighting scheme, larger than ∼1/NR. In this case, the double mutant D − xfAb+yfaB +zfAB − fA+fB E frequency will grow to the point where the production e f0 · e f0 2 rate of new recombinants (∼NRf0 ) is exactly balanced H(x, y, z|f0) ≡ f +f D − A B E by their loss due to further recombination with the wild- e f0 type population (∼NRf ). This occurs when AB H(x + 1,y + 1,z + 2,t) 2 = lim (47) fAB ∼ f0 , (43) t→∞ H(1, 1, 2,t) bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 13 such that weighted moments can also be obtained di- where we have defined rectly from the derivatives of H(x, y, z|f ). Thus, for 0  x(1 − e−γAτ ) this special choice of weighting function, the conditional H (x, τ) ≡ log 1 + , A γ moments are easily calculated from solutions of the un- A (51b)  −γB τ  conditioned generating function, H(x, y, z, t). By differ- x(1 − e ) HB(y, τ) ≡ log 1 + . entiating Eq. (45) with respect to time and applying the γB stochastic dynamics in Eq. (8), we find that the generat- ing function H(x, y, z, t) satisfies the partial differential The conditional distribution in Eq. (47) then follows as equation,  x  H(x, y, z|f0) ≈ 1 − θ log 1 + ∂H ∂H ∂H γA + 1 = − γ x + x2 − γ y + y2 (52) ∂τ A ∂x B ∂y  y  − θ log 1 + + O(θ2) . 2  2  ∂H ∂ H γB + 1 − γABz + z − ρ(x + y) − ρf0z (48) ∂z ∂x∂y This distribution has a well-defined value even when ∂H ∂H  γi = 0, which shows how the frequency weighting in + θf0z + − θ(x + y)H ∂x ∂y Eq. (47) can eliminate the well-known divergence of the neutral branching process when τ → ∞. We also see that with initial condition H(x, y, z, 0) = 1, where we have the resulting distributions are equivalent to the uncondi- defined the scaled variables tioned frequency spectrum of a deleterious mutation with ∗ θ = 2Nµ, τ = t/2Nf0 , ρ = 2NRf0 an effective selection coefficient γ = γ + 1, which has a

γA = 2NsAf0 , γB = 2NsBf0 , γ = 2Nf0 (49) well-known form,

γAB = γA + γB + γ + ρ 2Nµ · e−2Nsf−f/f0 p(f|f0)df ≈ df . (53) Here we have used the conventional notation θ, ρ, and f γi to define scaled rates of mutation, recombination, and This constitutes a quantitative version of the heuristic re- selection, respectively. Note that in the latter two cases, sult in Eq. (20), and confirms our previous intuition that we have defined these scaled variables to include an addi- the deleterious site frequency spectrum can be mimicked tional factor of f0, in order to match the key control pa- by neutral mutations with an appropriate choice of f0. rameters that we obtained in our heuristic analysis above. By definition, these first order solutions do not pro- Motivated by these results, we will also restrict our atten- vide any information about linkage disequilibrium be- tion to scenarios where s, r  1/N, but where the scaled tween the two loci, which only starts to enter at order parameters ρ and γ can be either large or small compared θ2. In Appendix B, we show that these next order con- to one. This will ensure that the maximum historical tributions are given by frequencies remain sufficiently small that the branching process approximation remains valid, while still captur- θ2 H ≈ 1 − θH − θH + (H + H )2 ing the full range of the qualitative behavior identified A B 2 A B Z τ above. 2 0 0 0 (54a) − θ f0 dτ ψ(x, y, z, τ ) [Φx(τ, τ ) 0 0 0 0 +Φy(τ, τ ) + ρΦx(τ, τ )Φy(τ, τ )] , Perturbation expansion for small f0 where the functions Φ (τ, τ 0) and Φ (τ, τ 0) are defined The partial differential equation in Eq. (48) is difficult to x y by solve in the general case. To make progress, we will focus 0 0 on a perturbation expansion in the limit that θ and f −γA(τ−τ ) −γAτ 0 0 [1 − e ][γA + x(1 − e )] are both small compared to one, using the series ansatz Φx(τ, τ ) ≡ −γ τ , γA [γA + x(1 − e A )] 0 0 ∞ −γB (τ−τ ) −γB τ X i j 0 [1 − e ][γB + y(1 − e )] H(x, y, z, τ) = 1 + θ f0 Hi,j(x, y, z, τ) , (50) Φy(τ, τ ) ≡ −γ τ , γA [γB + y(1 − e B )] i,j=0 (54b) with H (x, y, z, 0) = 0. We solve for the first few terms i,j 0 in this expansion in Appendix B. The first order contri- and ψ(x, y, z, τ ) is a solution to the characteristic curve, butions are simply a product of the corresponding single- 0 ∂ψ ργ xe−γAτ locus distributions, 2 A 0 = −γABψ − ψ + −γ τ 0 ∂τ γA + x (1 − e A ) 0 (54c) H(x, y, z, τ) ≈ 1 − θHA(x, τ) −γB τ (51a) ργBye 2 + −γ τ 0 , − θHB(y, τ) + O(θ ) γB + y (1 − e B ) bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 14 with the initial condition ψ(x, y, z, 0) = z. Using this Substituting this expression into Eqs. (56c) and (57), we 2 formal solution, the weighted moments then follow as find that Σd(γA, γB, γ) can be expressed as a definite integral, f +f 2 2 − A B θ f0 3 hf f e f0 i ≈ + O(f ) , (55) A B 0 ∞ −γ τ 0 −γ τ 0 (γA + 1)(γB + 1) Z e AB (1 − e AB ) Σ2 = d γ 0 3 0 AB −γAB τ  and 4 2 + 1 − e f +f Z ∞ " 0 D k − A B E 2 k+1 0 0 0 γ + 1 − e−γAτ f e f0 ≈ θ f dτ ψ (τ ) [Φ (τ ) A AB 0 k A × (γB + 1) · (59) 0 γA 0 0 0 +ΦB(τ ) + ρΦA(τ )ΦB(τ )] , 0 # −γB τ (56a) γB + 1 − e 0 +(γA + 1) · dτ , γB where we have defined the functions 0 −γAτ which is straightforward to evaluate numerically. Analo- 0 0 γA + 1 − e ΦA(τ ) ≡ Φx(τ, τ )| x=1 = , gous integral expressions can be derived for the other mo- τ=∞ γA(γA + 1) k 0 (56b) ments, Σd(γA, γB, γ), as well as for the full conditional −γB τ 0 0 γB + 1 − e distribution, H(x, y, z|f0) (see Appendix C). Asymptotic ΦB(τ ) ≡ Φy(τ, τ )| y=1 = , τ=∞ γB(γB + 1) solutions of these integrals for small and large γAB show and that they have same limiting behavior that we identified in our heuristic analysis above, while the numerical solu- ∂kψ(x, y, z, τ 0) ψ (τ 0) ≡ (−1)k+1 . (56c) tions accurately capture the quantitative behavior across k k x=1 ∂z y=1 the full range of intermediate parameter values (Fig. 3). z=2 k Substituting these moments into the definition of σd (f0), Neutral loci. Another important limit occurs for neu- we find that the leading order solution collapses onto a tral mutations (γA = γB = γ = 0), where the manifold k lower dimensional manifold, in Eq. (57) reduces to a single parameter curve, Σd(ρ). k This is already a useful prediction, since it implies that σd k changes in R can be mimicked by changes in f0, and vice k−1 ≈ Σd(γA, γB, γ, ρ) , (57a) f0 versa. However, the characteristic curve in Eq. (54c) is k now more difficult to solve than in the asexual case, due where Σd(·) is a dimensionless function that depends only 0 on the scaled parameters γ and ρ: to the presence of the time-dependent terms, ρx/(1+xτ ) i and ρy/(1 + yτ 0). Physically, these terms represent the Z ∞ k 0 0 additional Ab and aB lineages that are created when the Σd ≈ −δk,1 + (1 + γA)(1 + γB) ψk(τ ) [ΦA(τ ) 0 AB haplotype recombines with the wildtype background. 0 0 0 0 +ΦB(τ ) + ρΦA(τ )ΦB(τ )] dτ . Fortunately, an exact solution can still be obtained in (57b) this case using special functions, which is derived in Ap- pendix D. After substituting this solution into Eq. (56), This solution is independent of the mutation rate, as ex- 2 we find that the scaling function Σd(ρ) can again be ex- pected, and depends on the frequency scale f0 only im- pressed as a definite integral, plicitly through the definitions of Σd, γi, and ρ. This is Z ∞ e−ζ (ρ + ζ)(2 + ρ + ζ) already an important constraint, as it implies that sce- Σ2(ρ)= narios with different underlying values of f , s , and r, d D(ζ)2 0 i 0 (60) but similar values of the scaled parameters γi and ρ, must  (ρ + ζ)(2 + ρ + ζ) k × (2 + ρ) + dζ , necessarily have similar values of Σd. Closed form ex- D(ζ) k pressions for Σd(·) are more difficult to obtain in the general case, due to the difficulty in solving the differ- where we have defined the function ential equation for ψ(x, y, z, τ 0) for arbitrary parameter D(ζ) ≡ (ρ + ζ)(2 + ρ + ζ) (1 + ρ + ζ)ρe−ζ combinations. However, further analytical progress can ρ (61) still be made by examining the behavior of this equation −1 + ρe (E1(ρ) − E1(ρ + ζ))] , in certain limits. and E1(ζ) is the standard exponential integral function, Non-recombining loci. The simplest behavior occurs Z ∞ e−u in the absence of recombination (ρ = 0), when the char- E1(ζ) ≡ du . (62) ζ u acteristic curve has an exact solution, 1 0 Analogous integral expressions for the moment Σ (ρ) and −γAB τ d 0 γABze 4 Σd(ρ) are presented in Appendix D. Once again, asymp- ψ(z, τ ) = −γ τ 0 . (58) γAB + z(1 − e AB ) totic evalution of these integrals recovers the same ∼1/ρ bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 15

2.0 Strong selection and/or recombination. The Neutral, f0 = final limit we will consider is one in which at least 1.5 2Ns = 25, f0 = one of scaled selection coefficients (γi) or the scaled Neutral, f0 = 1/2Ns recombination rate (ρ) is large compared to one. Phys- 1.0 ically, this approximation means that genetic drift 1

d is weak compared to the forces of selection and/or 0.5 recombination. In this limit, it is possible to solve for 0.0 the characteristic curve in Eq. (54c) using a separation of timescales approach, treating the ψ2 term as a 0.5 perturbative correction. We outline this perturbation expansion in Appendix E. We find that the first two 1.0 moments are given by

0 10 2 + γA + γB + ρ F1 ≈ 0 ,F2 ≈ 2 (63) (γA + γB + γ + ρ) which matches the asymptotic behavior in the non- 10 1 recombining (ρ = 0) and neutral (γi = 0) limits above. 2 d By comparing this result with the neutral version in Eq. (60), we observe a quantitative confirmation of our 10 2 heuristic prediction that LD is lower among deleterious mutations than among neutral mutations with identical present day frequencies (f0,eff = 1/2Ns). The most pro- nounced difference occurs for the first moment σ1, where 10 3 d 10 1 100 101 102 103 frequency-matched neutral mutations display an excess 1 2NR of coupling linkage (σd > 0) compared to the deleterious 1 case (σd ≈ 0), which can be observed for recombination rates as large as ρ ≈ 50 (Fig. 5). FIG. 5 LD contains residual signatures of purifying selection after controlling for mutation frequencies. The top and bottom panels compare the first (top) and sec- Transition to Quasi-Linkage Equilibrium (QLE) ond (bottom) LD moments for a pair of neutral (black) and strongly deleterious mutations (red) across a range of recom- The perturbation expansion in Eq. (54) is valid to lowest bination rates. Symbols denote the results of forward-time order in f0  1, which means that it cannot capture the simulations, and the solid lines denote the theoretical pre- transition to the quasi-linkage equilibrium (QLE) regime dictions from Eq. (2) (black) and Eq. (63) (red). The grey that occurs when ρf0∼1. Nevertheless, we can obtain an symbols show frequency-weighted neutral mutations with the analogous set of predictions for this regime by return- same present-day frequency spectrum as the deleterious mu- tations. For small recombination rates, the neutral control ing to the underlying stochastic differential equations 1 in Eq. (8), and focusing on cases where ρ  1, γ and group displays an excess of coupling linkage (σd > 0) driven i by ancient nested mutations, which are suppressed in the dele- f0  1, but where ρf0 is not necesarily small compared terious case. to one. In this limit, Eq. (8) can be solved using a sepa- ration of timecales approach, in which fAB evolves on a fast timescale, and ∼ log(1/ρ) scaling observed in our earlier heuristic r analysis for large and small ρ, respectively, while the nu- ∂f f AB = Rf f − Rf + AB η(t) , (64) merical solutions accurately capture the quantitative be- ∂t Ab aB AB N havior across the full range of ρ (Fig. 4). This full solution is often quite useful in practice, since the convergence to while fAb and faB are effectively fixed. These single- the asymptotic limits can be rather slow. In particu- mutant frequencies then evolve on longer timescales ac- lar, we see that the corrections to the small ρ limit scale cording to the single-locus dynamics in Eq. (52). In this as ∼1/ log(1/ρ), which implies that extremely small val- approximation, the fast dynamics in Eq. (64) approach ues of ρ (as low as ∼10−5) are required to achieve good an instantaneous equilibrium, numerical accuracy. This leaves a broad intermediate 2NRfAbfaB −1 −2NRfAB −5 p(fAB|fAb, faB) ∝ fAB e , (65) regime (10 . ρ . 10) in which Eq. (60) is critical for enabling quantitative comparisons with data. on a timecale of order ∼1/R, which is much shorter than the fluctuation timescales of fAb and faB in the limit that ρ  1. This allows us to easily calculate various LD bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 16

100 100 = 0.3 ( 1/f0) )

0 = 2.6 ( < 1/f0) f 1 10 = 14.6 ( > 1/f0) B f

, 1

A 10 f | 2 B d 2

10 A f / ( 4 d p

, y

3 t i

10 l 2 f0 = 0.001 i 10 b

f0 = 0.01 a b

f = 0.03 o 10 4 0 r P f0 = 0.1 10 3 10 2 10 1 100 101 102 103 0.00 0.05 0.10 0.15 0.20

2NRf0 Double mutant frequency, fAB

FIG. 6 Higher-order fluctuations reveal the transition to quasi-linkage equilibrium. Left panel: an analogous 4 version of the neutral collapse plot in Fig. 4 for the higher-order LD moment σd(f0). Symbols denote the results of forward time simulations for a range of recombination rates, which are colored by the corresponding value of f0. The solid black line shows the prediction from the perturbation expansion in Eqs. (D24) and (D25), and the dashed lines indicate the position, NRf0 ∼ 1/f0, where the perturbation expansion is predicted to break down. The solid colored lines show the asymptotically matched predictions from Eq. (F6), which capture the transition to the quasi-linkage equilibrium regime. Right panel: the conditional distribution of the double mutant frequency for fixed values of the marginal mutation frequencies, fA ≈ fB ≈ f0. Colored lines show forward-time simulations for pairs of neutral mutations, in which the marginal frequencies of both mutations were observed in the range 0.13 ≤ fA, fB ≤ 0.17; the double mutant frequency was further downsampled to n = 200 individuals 2 to enhance visualization. The dashed lines indicate the approximate positions of linkage equilibrium (fAB ≈ f0 , left) and perfect linkage (fAB ≈ f0; right). Conditional distributions are shown for three different recombination rates, whose characteristic shapes illustrate the transition between the mutation-dominated (NRf0  1; orange), clonal recombinant (1  NRf0  1/f0; green) and quasi-linkage equilibrium (NRf0  1/f0; blue) regimes.

stastistics from the conditional moments of Eq. (65), by fA ≈ fB ≈ f0 (Fig. 6, right). In the QLE regime, averaging over the single-locus distributions of fAb and Eq. (65) shows that the conditional distribution devel- k faB in Eq. (53). For the lowest few moments of σd (f0), ops a peak around the linkage equilibrium value fAB ≈ 2 we find that fAfB ≈ f0 , while the clonal recombinant regime has 1 a much broader distribution with a mode at fAB = 0 σ1(f ) = 0 , σ2(f ) = , and an exponential cutoff at f ∼1/NR. These char- d 0 d 0 2NR AB (66) acteristic shapes are qualitatively distinct from the con- 6 + 6NRf 2 σ4(f ) = 0 . ditional distributions that are observed in the mutation- d 0 (2NR)3 dominated regime (NRf0  1), which have a biomodal This confirms our heuristic result that the first two mo- shape with one peak at fAB = 0 and a smaller peak at k fAB ≈ fA ≈ fB ≈ f0. This suggests that the shape ments of σd are identical in the clonal recombinant and quasi-linkage equilibrium regimes, while the higher mo- of the conditional distribution p(fAB|fA, fB ≈ f0) might ments start to diverge due to the differences in the sta- provide a particularly robust test for distinguishing be- tween different rates of recombination. We will return tistical fluctuations of fAB. These differences can be ob- served by examining the ratio to this topic in the Discussion when we discuss potential applications of our results to genomic data from bacteria. 4 σd(f0) η(f0) = 2 , (67) 3σd(f0) Estimating frequency-resolved LD in finite samples which shifts from a rapid ∼1/(NR)2 decay in the clonal 2 recombinant regime to a shallower ∼f0 /NR decay in the So far, our formal analysis has focused on predicting en- QLE regime, while saturating to a constant value when semble averages of various LD statistics at a single pair NRf0  1 (Fig. 6, left). of genetic loci. To connect these results with empirical The differences between these regimes are even eas- data, we will often want to estimate these ensemble av- ier to observe by examining the full conditional distribu- erages in a slightly different way, by summing over many tion of the double mutant frequency, p(fAB|fA, fB≈f0), functionally similar pairs of genetic loci observed in a fi- at a fixed value of the marginal mutation frequencies, nite sample of n genomes. In these cases, we will not be bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 17 able to observe the haplotype frequencies that enter into for arbitrary integers i,j, and k, and arbitrary real num- k σd (f0) directly, but must instead infer them from the dis- bers x, y, and z. Thus, for the special choice crete counts, nAB, nAB, naB, and nab that are observed  1   2  in our finite sample. We will assume that these haplo- x∗ ≡ y∗ ≡ log 1 − , z∗ ≡ log 1 − (71) type counts are randomly sampled from the underlying nf0 nf0 population, so that they are multinomially distributed the conditional expectation reduces to around the current haplotype frequencies: ∗ ∗ ∗  −x (nAb−i)−y (naB −j)−z (nAB −k)  nAb!naB!nAB!e nAb naB nAB nab f~ n! · fAb · faB · fAB · fab i+j+k Pr[~n|f~] = . (68) n (nAb−i−1)!(naB−j−1)!(nAB−k−1)! nAb! · naB! · nAB! · nab! i j k −fA/f0−fB /f0 = fAbfaBfABe In sufficiently large samples (n → ∞), the haplotype (72) counts will remain close to their expected values ni ≈ This motivates us to define the function, k nfi, and σd (f0) can be well-approximated by setting   nAb−i fi = ni/n in Eq. (7). However, for sufficiently low fre- 1 nAb! 1− nf quencies (nf ∼10), the additional uncertainty in f will M (~n)=  0 i i i,j,k  i cause the naive estimator to be biased away from the n (nAb−i)! k true value of σ (f0). Our heuristic results show that (73) d naB −i nAB −k   1   2  these low frequencies will generically dominate LD esti- naB! 1− nAB! 1− nf0 nf0  mates — even in large samples — for sufficiently large × j · k n (naB−j)! n (nAB−k)!  values of NR or Nsi, or for sufficiently low choices of f0. k Unbiased estimators of σd (f0) are therefore essential for extrapolating across the full range of frequency scales. whose total expectation — which now averages over sam- In this section, we will develop one particular class of pling noise in addition to the underlying evolutionary estimators by generalizing an approach that we and oth- stochasticity — satisfies the identity ers have previously used to estimate the unweighted ver- D E 2 hM (~n)i = f i f j f k e−fA/f0−fB /f0 . (74) sion of σd in Eq. (4) (Garud et al., 2019; Ragsdale and i,j,k Ab aB AB Gravel, 2020). To extend this result to the frequency- resolved case, we will first take advantage of the fact Thus, we see that for this special choice of exponen- that the multinomial distribution in Eq. (68) reduces to tial weighting function, there is a simple relationship be- a product of independent Poisson distributions, tween the ensemble averages of haplotype frequencies and genome-wide averages over haplotype counts. Using this nAb naB formula, it is a straightforward (though tedious) task to (nfAb) −nf (nfaB) −nf Pr[~n|f~] ≈ e Ab · e aB k n ! n ! derive a corresponding set of estimators for σd (f0), by ex- Ab aB (69) nAB panding the fA and fB terms in Eq. (7) and iteratively (nfAB) × e−nfAB , applying Eq. (74). We list the relevant expressions for nAB! k the first few moments of σd (f0) in Appendix G. in the limit that mutations are rare (fA, fB  1). This joint distribution admits a general moment formula, Applications to synonymous and nonsynonymous LD

 −x(nAb−i)−y(naB −j)−z(nAB −k)  nAb!naB!nAB!e ~ LD curves are frequently calculated for pairs of synony- i+j+k f = n (nAb−i)!(naB−j)!(nAB−k)! mous or nonsynonymous mutations separated by similar −x −y −z coordinate distances ` (or ideally, by similar map lengths i j k −nfaB (1−e )−nfaB (1−e )−nfAB (1−e ) fAbfaBfABe R). In these cases, the empirical estimators in the previ- (70) ous section converge to a weighted average,

RRR k  1   1   σ (sA, sB, , R, f0) ρi(sA)ρi(sB)ρi (|sA, sB) dsA dsB d k d 1+2NsAf0 1+2NsB f0 σd,i(R, f0) ≈ , (75) RRR  1   1  ρi(sA)ρi(sB) dsA dsB 1+2NsAf0 1+2NsB f0 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 18 where ρi(s) denotes the distribution of fitness costs Here, we have introduced a forward-time framework for of synonymous (i = S) or nonsynonymous (i = N) predicting linkage disequilibrium between pairs of neu-  mutations, and ρi (|sA, sB) denotes the corresponding tral or deleterious mutations as a function of their present distribution of epistatic interactions. In this way, any day frequency scale f0. This additional functional depen- k k differences between the observed σd,N and σd,S curves dence proved to be more than a statistical curiosity, and can provide additional information about the differences instead enabled new insights into the dynamics of linkage in their underlying fitness costs. disequilibrium that have been difficult to obtain from ex- To understand the consequences of the average in isting theoretical approaches (McVean, 2002; Ohta and Eq. (75), recall that our earlier analytical expressions Kimura, 1971; Song and Song, 2007) showed that deleterious fitness costs generally lead to By restricting our attention to rare mutations (f0  k lower values of σd (sA, sB, , R, f0), where the magnitude 1), we were able to obtain a simple heuristic picture of this effect depends on the relative values of R and of linkage disequilibrium that emphasizes the underly- f0. The additional factors that appear in the average ing lineage dynamics of the two mutations (Fig. 2). We in Eq. (75) will further downweight the contributions of saw that the frequency scale f0 can dramatically influ- mutations with costs larger than ∼1/Nf0. This shows ence these dynamics, in a way that primarily depends that strongly deleterious mutations (s  1/Nf0) will on frequency-rescaled quantities like NRf0 and Nsf0. have a negligible impact in the numerator of Eq. (75), Our lineage-based picture highlighted the crucial impor- as long as there is an appreciable fraction of mutations tance of ancient nested mutations (Fig. 2C), which are with smaller fitness costs. However, for the same rea- substantially older than typical segregating variants, but sons, these strongly deleterious mutations will also have which provide an increasingly large contribution to LD a negligible impact on the denominator of Eq. (75), which among tightly linked loci (NRf0 . 1) with neutral or implies that they will have a negligible overall contribu- weakly deleterious fitness costs (Nsf0 . 1). In these k tion to the site-averaged LD statistics σd,N (R, f0) and cases, we saw that ancient nested mutations will create k σd,S(R, f0). an excess of coupling linkage (fAB > fAfB) that qualita- At the same time, we have seen that very weakly tively resembles the effects of antagonistic epistasis. This k excess coupling linkage has previously been observed in deleterious mutations (sAB  1/Nf0) produce σd val- ues that are nearly indistinguishable from neutral muta- computer simulations and in genomic data from diverse tions, differing only by a slowly varying ∼ log(1/NsABf0) organisms (Garcia and Lohmueller, 2020; Sandler et al., factor. These mutations contribute to the averages in 2020; Sohail et al., 2017), where it has fueled an ongo- k k ing debate about the inference of epistasis from patterns σd,N (R, f0) and σd,S(R, f0), but they cannot contribute to differences between the two quantities. Thus, in of nonsynonymous and synonymous LD in a variety of the absence of epistasis, we expect that the differences species. Our analytical calculations suggest a potential between synonymous and nonsynonymous LD will be mechanism for this counterintuitive behavior, and they driven by a narrow range of mutations with fitness costs demonstrate that this effect will generically arise even in O(1/Nf0), and will mainly be visible when NRf0  1. the absence of admixture or other complex demographic On one hand, this sensitivity suggests that it might scenarios. be possible to infer detailed information about ρN (s) by Our results also allow us to answer a question we posed k k comparing σN (f0) and σS(f0) values across a range of at the beginning of this work: do differences between syn- frequency scales. On the other hand, our analytical ex- onymous and nonsynonymous LD curves primarily arise pressions show that these marginal fitness costs will only from differences in their underlying mutation frequen- k lead to O(1) differences in σN (f0), which will sensitively cies? Our analytical results demonstrate that this is not depend on the precise value of the integral in Eq. (75). the case: the key difference is that strongly deleterious We leave a more detailed exploration of this dependence mutations (Nsf0  1) can no longer sustain the ancient for future work. We also note that this limited resolu- nested mutations in Fig. 2C, leading to lower levels of LD tion is no longer an issue in the presence of epistasis: compared to neutral mutations with similar present-day strong synergistic epistasis between weakly selected mu- frequencies (f0∼1/Ns; Fig. 5). This shows that ordinary k tations can produce large changes in σN (f0) if they are negative selection can generate differences between syn- sufficiently common. onymous and nonsynonymous LD that qualitatively re- semble the effects of negative epistasis. This could be an important potential confounder for efforts to infer nega- DISCUSSION tive epistasis by comparing levels of nonsynonymous and synonymous LD (Sandler et al., 2020; Sohail et al., 2017). Contemporary patterns of linkage disequilibrium contain However, we also saw that these strongly deleterious mu- important information about the evolutionary forces at tations can have a negligible impact on certain genome- 2 work within a population, which shape genetic variation wide averages like σd, due to their lower marginal fre- over a vast range of different length and time scales. quencies. Thus, the quantitative magnitude of this effect bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 19

a b

c d

FIG. 7 Frequency-resolved LD in the commensal human gut bacterium Eubacterium rectale. Single nucleotide variant (SNVs) were obtained for a sample of n = 109 unrelated strains reconstructed from different human hosts (Garud et al., 2 2019) (Appendix H). (a) Frequency-weighted LD (σd(f0)) as a function of coordinate distance (`) between 4-fold degenerate synonymous SNVs in core genes. Solid lines were obtained by applying the unbiased estimator in Appendix G to all pairs of SNVs within 0.2 log units of `, while the points depict genome-wide averages calculated from randomly sampled pairs of SNVs 2 from widely separated genes. The two estimates are connected by a dashed line for visualization. (b) Analogous σd(f0) curves as a function of the frequency scale f0. (c) The single site frequency spectrum, estimated from the fraction of SNV pairs in which the first mutation is observed with a given minor allele count, nA. (d) The conditional distribution of the double mutant frequency for fixed values of the marginal mutation frequencies, fA≈fB ≈f0. Colored lines show the observed distributions for pairs of SNVs with marginal mutation frequencies in the range 0.13 ≤ fA, fB ≤ 0.17; dashed lines indicate approximate 2 positions of linkage equilibrium (fAB ≈ f0 , left) and perfect linkage (fAB ≈ f0; right). The shapes of the three distributions are qualitatively similar to the the mutation-dominated (NRf0  1; orange), clonal recombinant (1  NRf0  1/f0; green) and quasi-linkage equilibrium (NRf0  1/f0; blue) regimes predicted in Fig. 6. can strongly depend on the underlying distribution of in population genetics to infer evolutionary parameters fitness effects as well as the averaging scheme employed. from the shape of the single-site frequency spectrum Moreover, while additive fitness costs can lead to lower (Lawrie and Petrov, 2014; Ragsdale et al., 2018). Our relative values of LD, we also saw that they cannot pro- results provide a framework for extending this approach 1 duce negative values of σd(f0) on their own. This sug- to multi-site statistics like linkage disequilibrium, poten- gests that negative genome-wide values of signed LD may tially creating new opportunities to probe the underly- constitute a more robust indicator of negative epistasis ing recombination process across a wide range of genomic than relative reductions in LD over synonymous sites. length and time scales. More generally, our results provide a framework for We illustrate this approach in Fig. 7. We used our 2 leveraging the increasingly large sample sizes of modern empirical estimators for σd(f0) to calculate frequency- genomic datasets to quantify the scaling behavior of link- resolved LD curves for 109 worldwide strains of the com- age disequilibrium across a range of underlying frequency mensal human gut bacterium Eubacterium rectale (Ap- scales. These scaling behaviors have a long history of pendix H). In a previous study, my colleagues and I used application in other areas of statistical physics (Meshu- this dataset to infer the presence of widespread homolo- lam et al., 2019; Stanley, 1999), and are commonly used gous recombination in the global population of E. rectale, bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 20

2 by examining how the unweighted version of σd decays sights into the dynamics of the underlying recombination as a function of the coordinate distance ` (Garud et al., process. 2019). Our frequency-resolved estimators now provide Of course, these empirical comparisons should be 2 an analogous manifold of LD curves, σd(`, f0), which al- treated with a degree of caution, since our theoretical low us to examine the dynamics of LD across nearly two analysis focused on an extremely simple null model that decades of frequency space (Fig. 7A,B). At a qualitative lacks many of the complexities associated with real mi- level, these empirical curves are similar to their theoreti- crobial populations. Our results suggest that it would cal counterparts above (Figs 4 and 5), with smaller muta- be interesting to extend these approaches to account for tion frequencies and/or longer coordinate distances lead- other factors that might be relevant at short time scales, ing to lower values of LD. However, the quantitative de- including time-varying population sizes, linked selection, pendence on ` and f0 indicates dramatic departures from and certain forms of spatial structure. We believe that the simplest neutral null models analyzed in this work. our lineage-based framework will provide a useful starting In particular, the E. rectale data suggest that larger co- point for predicting the dynamics of linkage disequilib- ordinate distances are more sensitive to reductions in f0 rium across these diverse evolutionary scenarios, which (Fig. 7A,B), while our theoretical models predict the op- would allow us to better exploit the unique features of posite trend (Fig. 5). Moreover, this unusual frequency modern genomic datasets. dependence is observed even at the largest coordinate 6 2 distances (`∼10 bp), where the overall reduction in σd might normally suggest convergence to recombination- DATA AVAILABILITY dominated behavior (NR  1). This example shows how analytical predictions of frequency-resolved LD statistics All source code for forward-time simulations, data anal- can help identify surprising features of the data that war- ysis, and figure generation are available at Github rant future study, yet would be difficult to identify from (https://github.com/bgoodlab/rare_ld). Polymor- intuition alone. phism data from E. rectale were obtained from our pre- In this case, the divergence between theory and data vious study (Garud et al., 2019), and can be accessed might have been anticipated, given that the marginal mu- using the accessions cited therein. Postprocessed SNV tation frequencies in E. rectale already deviate from the data used to create Fig. 7 are available in Supplemen- ∼1/f dependence predicted under the simplest neutral tary Data. null model (Fig. 7B). It is possible that the modest en- richment of rare mutations observed in the data could bias the relevant averages below the nominal value of ACKNOWLEDGMENTS f0, leading to somewhat stronger realized levels of link- age than would be expected under the simplest versions I thank Nandita Garud and Daniel Fisher for useful dis- of our model. We therefore attempted to quantify the cussions. This work was supported in part by a Terman same LD patterns in a different way, by examining the Fellowship from Stanford University and the Miller In- conditional distribution of the double mutant frequency stitute for Basic Research in Science at the University of for a specific value of the marginal mutation frequencies, California, Berkeley. fA ≈ fB ≈ f0, in order to reduce potential uncertain- ties introduced by the average in σ2(f ) (Fig. 7D). When d 0 REFERENCES f0≈10%, we see that the data display a transition be- tween the three characteristic regimes identified in Fig. 6, Allix-Beguec,´ C., Arandjelovic, I., Bi, L., Clifton, with quasi-linkage equilibrium emerging at the largest D., Crook, D., Fowler, P., Gibertoni Cruz, A., Gol- coordinate distances (`∼106) and mutation-dominated ubchik, Hoosdally, S., Hunt, M., Iqbal, Z., Lip- behavior at shorter genetic distances (100 < ` < 300). worth, S., Peto, T., Thwaites, G., Walker, A., On intermediate length scales (`∼1000), the LD distri- Walker, T., Wilson, D., Wyllie, D., and Yang, Y. bution transitions to an exponential shape expected in 2018. Prediction of susceptibility to first-line tuberculosis drugs by dna sequencing. New England Journal of Medicine the clonal recombinant regime (f −1  NR(`)  f −2), 0 0 379:1403–1415. which provides a direct empirical demonstration of this Ansari, M. A. and Didelot, X. 2014. Inference of the qualitatively new behavior. The boundaries between properties of the recombination process from whole bacte- these regimes provide an independent set of bounds on rial genomes. Genetics 196:253–265. the corresponding recombination rates, Arnold, B., Sohail, M., Wadsworth, C., Corander, J., Hanage, W. P., Sunyaev, S., and Grad, Y. H. 2020. 3 6 10 . NR(`≈10 ) . 100 . NR(`≈10 ) (76) Fine-scale haplotype structure reveals strong signatures of positive selection in a recombining bacterial pathogen. which no longer require extrapolation over multiple dis- Molecular Biology and Evolution 37:417–428. tance scales. This example shows how the sampling dis- Chakravarti, A., Buetow, K. H., Antonarakis, S., tributions of different LD statistics can provide new in- Waber, P., Boehm, C., and Kazazian, H. 1984. Nonuni- bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 21

form recombination within the human beta-globin gene Journal of Applied Probability 1:177–232. cluster. American journal of human genetics 36:1239. Kimura, M. and Ohta, T. 1973. The age of a neutral mu- Coop, G. and Ralph, P. 2012. Patterns of neutral diversity tant persisting in a finite population. Genetics 75:199–212. under general models of selective sweeps. Genetics 192:205– Lawrie, D. S. and Petrov, D. A. 2014. Comparative pop- 224. ulation genomics: power and principles for the inference of Cvijovic,´ I., Good, B. H., and Desai, M. M. 2018. The functionality. Trends in Genetics 30:133 – 139. effect of strong purifying selection on genetic diversity. Ge- Lewontin, R. 1964. The interaction of selection and linkage. netics 209:1235–1278. i. general considerations; heterotic models. Genetics 49:49. Desai, M. M. and Fisher, D. S. 2007. Beneficial muta- Lewontin, R. 1988. On measures of gametic disequilibrium. tion selection balance and the effect of on Genetics 120:849–852. positive selection. Genetics 176:1759–1798. Li, H. and Durbin, R. 2011. Inference of human population Eberle, M. A., Rieder, M. J., Kruglyak, L., and Nick- history from individual whole-genome sequences. Nature erson, D. A. 2006. Allele frequency matching between 475:493–496. snps reveals an excess of linkage disequilibrium in genic Lynch, M., Xu, S., Maruki, T., Jiang, X., Pfaffelhu- regions of the human genome. PLoS genetics 2. ber, P., and Haubold, B. 2014. Genome-wide linkage- Ewens, W. J. 2004. Mathematical Population Genetics. disequilibrium profiles from single individuals. Genetics Springer-Verlag, New York, second edition. 198:269–281. Fisher, D. S. 2007. Evolutionary dynamics, pp. 395–446. McVean, G. 2007. The structure of linkage disequilibrium In M. M. Jean-Philippe Bouchaud and J. Dalibard (eds.), around a . Genetics 175:1395–1406. Complex Systems, volume 85 of Les Houches. Elsevier. McVean, G. A. 2002. A genealogical interpretation of link- Garcia, J. A. and Lohmueller, K. E. 2020. Negative age disequilibrium. Genetics 162:987–991. linkage disequilibrium between amino acid changing vari- McVean, G. A., Myers, S. R., Hunt, S., Deloukas, P., ants reveals interference among deleterious mutations in Bentley, D. R., and Donnelly, P. 2004. The fine- the human genome. bioRxiv . scale structure of recombination rate variation in the hu- Gardiner, C. 1985. Handbook of Stochastic Methods. man genome. Science 304:581–584. Springer, New York. Meshulam, L., Gauthier, J. L., Brody, C. D., Tank, Garud, N. R., Good, B. H., Hallatschek, O., and Pol- D. W., and Bialek, W. 2019. Coarse graining, fixed lard, K. S. 2019. Evolutionary dynamics of bacteria in points, and scaling in a large population of neurons. Phys. the gut microbiome within and across hosts. PLoS biology Rev. Lett. 123:178103. 17:e3000102. Neher, R. A. and Hallatschek, O. 2013. Genealogies in Garud, N. R., Messer, P. W., Buzbas, E. O., and rapidly adapting populations. Proc Nat Acad Sci 110:437– Petrov, D. A. 2015. Recent selective sweeps in north 442. american drosophila melanogaster show signatures of soft Ohta, T. and Kimura, M. 1971. Linkage disequilibrium sweeps. PLoS genetics 11. between two segregating nucleotide sites under the steady Good, B. H. 2016. in Rapidly Evolving flux of mutations in a finite population. Genetics 68:571. Populations. PhD thesis, Harvard University, Cambridge Pasolli, E., Asnicar, F., Manara, S., Zolfo, M., MA. Karcher, N., Armanini, F., Beghini, F., Manghi, Good, B. H. and Desai, M. M. 2013. Fluctuations in fitness P., Tett, A., Ghensi, P., et al. 2019. Extensive distributions and the effects of weak linked selection on unexplored human microbiome diversity revealed by over sequence evolution. Theor Pop Biol 85:86–102. 150,000 genomes from metagenomes spanning age, geogra- Harris, K. and Nielsen, R. 2013. Inferring demographic phy, and lifestyle. Cell 176:649–662. history from a spectrum of shared haplotype lengths. PLoS Petit III, R. A. and Read, T. D. 2018. Staphylococcus Genet 9:e1003521. aureus viewed from the perspective of 40,000+ genomes. Hedrick, P. W. 1987. Gametic disequilibrium measures: PeerJ 6:e5261. proceed with caution. Genetics 117:331–341. Pfaffelhuber, P., Lehnert, A., and Stephan, W. 2008. Hill, W. and Robertson, A. 1968. Linkage disequilib- Linkage disequilibrium under genetic hitchhiking in finite rium in finite populations. Theoretical and applied genetics populations. Genetics 179:527–537. 38:226–231. Pokalyuk, C. 2012. The effect of recurrent mutation on the Kamm, J. A., Terhorst, J., and Song, Y. S. 2017. Ef- linkage disequilibrium under a selective sweep. Journal of ficient computation of the joint sample frequency spectra mathematical biology 64:291–317. for multiple populations. Journal of Computational and Polanski, A. and Kimmel, M. 2003. New explicit expres- Graphical Statistics 26:182–194. sions for relative frequencies of single-nucleotide polymor- Kang, J. T. and Rosenberg, N. A. 2019. Mathemati- phisms with application to statistical inference on popula- cal properties of linkage disequilibrium statistics defined tion growth. Genetics 165:427–436. by normalization of the coefficient d= pab–papb. Human Ragsdale, A. P. and Gravel, S. 2019. Models of ar- Heredity 84:127–143. chaic admixture and recent history from two-locus statis- Karczewski, K. J., Francioli, L. C., Tiao, G., Cum- tics. PLoS genetics 15:e1008204. mings, B. B., Alfoldi,¨ J., Wang, Q., Collins, R. L., Ragsdale, A. P. and Gravel, S. 2020. Unbiased estimation Laricchia, K. M., Ganna, A., Birnbaum, D. P., et al. of linkage disequilibrium from unphased data. Molecular 2020. The mutational constraint spectrum quantified from Biology and Evolution 37:923–932. variation in 141,456 humans. Nature 581:434–443. Ragsdale, A. P., Moreau, C., and Gravel, S. 2018. Ge- Kim, Y. and Nielsen, R. 2004. Linkage disequilibrium as a nomic inference using diffusion models and the allele fre- signature of selective sweeps. Genetics 167:1513–1524. quency spectrum. Current Opinion in Genetics & Devel- Kimura, M. 1964. Diffusion models in population genetics. opment 53:140 – 147. Genetics of Human Origins. bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 22

Rogers, A. R. 2014. How population growth affects linkage disequilibrium. Genetics 197:1329–1341. Rosen, M. J., Davison, M., Bhaya, D., and Fisher, D. S. 2015. Fine-scale diversity and extensive recombination in a quasisexual bacterial population occupying a broad niche. Science 348:1019–1023. Rosen, M. J., Davison, M., Fisher, D. S., and Bhaya, D. 2018. Probing the ecological and evolutionary history of a thermophilic cyanobacterial population via statistical properties of its microdiversity. PloS one 13. Sabeti, P. C., Reich, D. E., Higgins, J. M., Levine, H. Z., Richter, D. J., Schaffner, S. F., Gabriel, S. B., Platko, J. V., Patterson, N. J., McDonald, G. J., et al. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419:832–837. Sandler, G., Wright, S. I., and Agrawal, A. F. 2020. Using patterns of signed linkage disequilibria to test for epistasis in flies and plants. bioRxiv . Sawyer, S. A. and Hartl, D. L. 1992. Population genetics of polymorphism and divergence. Genetics 132:1161–1176. Shu, Y. and McCauley, J. 2017. Gisaid: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill 22:30494. Slatkin, M. 2008. Linkage disequilibrium – understanding the evolutionary past and mapping the medical future. Na- ture Reviews Genetics 9:477–485. Sohail, M., Vakhrusheva, O. A., Sul, J. H., Pulit, S. L., Francioli, L. C., van den Berg, L. H., Veldink, J. H., de Bakker, P. I., Bazykin, G. A., Kondrashov, A. S., et al. 2017. Negative selection in humans and fruit flies involves synergistic epistasis. Science 356:539–542. Song, Y. S. and Song, J. S. 2007. Analytic computation of the expectation of the linkage disequilibrium coefficient r2. Theoretical population biology 71:49–60. Stanley, H. E. 1999. Scaling, universality, and renormal- ization: Three pillars of modern critical phenomena. Rev. Mod. Phys. 71:S358–S366. Stephan, W., Song, Y. S., and Langley, C. H. 2006. The hitchhiking effect on linkage disequilibrium between linked neutral loci. Genetics 172:2647–2663. VanLiere, J. M. and Rosenberg, N. A. 2008. Mathemat- ical properties of the r2 measure of linkage disequilibrium. Theoretical population biology 74:130–137. Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., and Yang, J. 2017. 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics 101:5–22. Weissman, D. B., Feldman, M. W., and Fisher, D. S. 2010. The rate of fitness-valley crossing in sexual popula- tions. Genetics 186:1389–1410. Weissman, D. W., Desai, M. M., Fisher, D. S., and Feldman, M. W. 2009. The rate at which asexual popu- lations cross fitness valleys. Theoretical Population Biology 75:286–300. bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 23

Appendix A: Forward-time simulations

We used forward-time simulations to compare our analytical predictions to the full two-locus model in Eq. (3). We used a discrete generation, Wright-Fisher sampling scheme, in which the number of individuals with each haplotype at generation t + 1 is drawn from a Poisson distribution with mean equal to the expected number of individuals predicted by Eq. (3), given on haplotype frequencies at time t. To enhance computational efficiency for calculating LD statistics, we only simulated timepoints in which both mutations were segregating in the population at the same time. We implemented this scheme by first drawing an initial (single-mutant) haplotype from the single-locus site frequency spectrum (Sawyer and Hartl, 1992),

e2Ns(1−f) − 1 p(f|s) ∝ , (A1) x(1 − x) and then we introduced a second mutation in a single individual from either the mutant or wildtype background with probability f or 1 − f, respectively. The resulting population was then evolved until one of the mutations went extinct, and the process was restarted with a new pair of mutations. The frequencies of the four haplotypes were recorded every ∆t generations, and were used to generate the figures in the main text. The simulations in this work were performed with a population size of N = 105 and a sampling interval of ∆t = 100.

Appendix B: Perturbative solution of the generating function for small f0

To obtain the perturbative solution of Eq. (48) listed in Eq. (56), it will be helpful to define a zeroth order differential operator, ∂ ∂ ∂ ∂ L = + γ x + x2 + γ y + y2 + γ z + z2 − ρ(x + y) (B1) ∂τ A ∂x B ∂y AB ∂z so that Eq. (48) can be written as

∂H ∂H  ∂2H LH = −θ(x + y)H + θf z + − ρf z (B2) 0 ∂x ∂y 0 ∂x∂y with all of the θ and f0 dependence is confined to the right-hand side. Substituting the series expansion in Eq. (50) into this equation and grouping like terms, we find that the zeroth order solution is H ≈ 1, as expected. The first order correction, which enters at O(θ), satisfies the equation

LH1,0 = −(x + y) , (B3) which can be solved using the method of characteristics. To see this, we will define a function

Φ1,0(τR) = H1,0(x(τR), y(τR), τ − τR) , (B4) where the characteristic curves x(τR) and y(τR) are given by

−γAτR 2 xe ∂τR x = γAx − x , x(0) = x → x(τR) = x ,, (B5) 1 + (1 − e−γAτR ) γA −γB τR 2 ye ∂τR y = γBy − y , y(0) = y → y(τR) = y . (B6) 1 + (1 − e−γB τR ) γB (B7)

Then Φ1,0(τR) satisfies a related differential equation,

∂τR Φ1,0 = x(τR) + y(τR) , Φ1,0(0) = H1,0(x, y, τ) , Φ1,0(τ) = 0 , (B8) whose solution is given by

Z τR 0 0 0 Φ1(τR) − Φ1(0) = dτR [x(τR) + y(τR)] (B9) 0 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 24

The definition of Φ1,0(τR) in Eq. (B4) then implies that

Z τ  x   y  −γAτ  −γB τ  H1,0(x, y, τ) = − dτR[x(τR) + y(τR)] = − log 1 + 1 − e + − log 1 + 1 − e (B10) 0 γA γB | {z } | {z } HA(x,τ) HB (y,τ) which yields Eq. (51) in the main text. This same approach can be extended to higher orders of the series expansion Eq. (50). At the next order (θ2), we have

LH2,0 = −(x + y)H1,0 , (B11) which can again be solved by the method of characteristics. We will define an analogous function,

Φ2,0(τR) = H2,0(x(τR), y(τR), τ − τR) , (B12) where the characteristic curves x(τR) and y(τR) are the same as above. Then Φ2,0(τR) satisfies

∂τR Φ2,0 = (x(τR) + y(τR))Φ1,0(τR) , Φ2(0) = H2,0(x, y, τ) , Φ2(τ) = 0 , (B13) and hence Z τ 1 2 H2,0(x, y, τ) = − dτR Φ1,0(τR)∂τR Φ1,0(τR) = [HA(x, τ) + HB(y, τ)] . (B14) 0 2

2 We note that this O(θ ) term is independent of z, which mean that it cannot contribute to averages involving fAB. 2 The lowest order term in f0 is O(θ f0), since the O(θf0) term vanishes. At this order, we have ∂H ∂H ∂H  ∂H ∂H ∂H ∂H  LH = z 1,0 + 1,0 − ρ 2,0 = A + B − ρ A B (B15) 2,1 ∂x ∂y ∂x∂y ∂x ∂y ∂x ∂y where HA and HB are defined as above. The solution to this equation proceeds in a similar fashion. Let

Φ2,1(τR) = H2,1(x(τR), y(τR), z(τR), τf − τR) (B16) where the characteristic curve z is defined by

∂z(τR) 2 = γABz(τR) − z(τR) + ρ(x(τR) + y(τR)) , z(0) = z (B17) ∂τR

Then Φ2,1(τR) satisfies

∂τR Φ2 = −z(τR) [Φx(τR) + Φy(τR) − ρΦx(τR)Φy(τR)] , Φ2,1(0) = H2,1(x, y, z, τ) , Φ2,1(τ) = 0 (B18) where we have defined a pair of functions,

−γAτ +γAτR −γAτR ∂HA (1 − e f )[γA + x (1 − e )] Φx(τR) ≡ = − (B19) ∂x γ [γ + x (1 − e−γAτf )] x(τR),τf −τR A A

−γB τ +γB τR −γB τR ∂HB (1 − e f )[γB + y (1 − e )] Φy(τR) ≡ = − (B20) ∂y γ [γ + y (1 − e−γB τf )] y(τR),τf −τR B B

The solution for H2,1 then follows as Z τ H2,1(x, y, z, τ) = dτR z(τR) [Φx(τR) + Φy(τR) − ρΦx(τR)Φy(τR)] (B21) 0 Thus, if we can find a solution for the characteristic curve in Eq. (B17), then the leading order contribution to the generating function can be obtained by a direct integration, yielding Eq. (54) in the main text. To minimize confusion, we use the notation ψ(x, y, z, τR) in place of z(τR) throughout the rest of this work, in order to emphasize the implicit dependence on the initial conditions x(0) = x, y(0) = y, and z(0) = z. bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 25

Appendix C: Solution for nonrecombining loci

In the case of non-recombining loci (ρ = 0), the characteristic curve in Eq. (54c) reduces to a simple logistic equation, whose solution is given by

0 −γAB τ 0 ze ψ(z, τ ) = z 0 . (C1) 1 + (1 − e−γAB τ ) γAB

This solution will also be valid for finite recombination rates, provided that ρ  γA, γB, γ. Substituting this solution into Eq. (56), we find that the equilibrium generating function can be expressed as a definite integral,  x   y  θ2  x  θ2  y  H(x, y, z) ≈ 1 − θ log 1 + − θ log 1 + + log2 1 + + log2 1 + γA γB 2 γA 2 γB 0     Z ∞ −γAB τ 2 x y 2 ze + θ log 1 + log 1 + − θ f0 z −γ τ 0 (C2) γA γB 1 + (1 − e AB ) 0 γAB " 0 0 # γ + x(1 − e−γAτ ) γ + y(1 − e−γB τ ) × A + B dτ 0 . γA(γA + x) γB(γB + y)

In this case, the integral can be evaluated using special functions,

0 " 0 0 # Z ∞ −γAB τ −γAτ −γB τ ze γA + x(1 − e ) γB + y(1 − e ) 0 z −γ τ 0 + dτ 1 + (1 − e AB ) γA(γA + x) γB(γB + y) 0 γAB     (C3) γA z γB z     xz 2F1 1, 1, 2 + , − yz 2F1 1, 1, 2 + , − 1 1 z γAB γAB γAB γAB = + log 1 + − − . γA γB γAB γA(x + γA)(γAB + γA) γB(y + γB)(γAB + γB) where 2F1(a, b, c, u) is the hypergeometric function. This integral simplifies even further in the limit that γA = γB = 0, where we find that

0 " 0 0 # Z ∞ −γAB τ −γAτ −γB τ ze γA + x(1 − e ) γB + y(1 − e ) 0 0 + dτ z −γAB τ 0 1 + (1 − e ) γA(γA + x) γB(γB + y) γAB (C4) 2  z   1 1   z  ≈ Li2 + + log 1 + . γ γ + z x y γ where Lin(u) is the polylogarithm function. For γ  1, the resulting distribution of fAB is approximately uniformly distributed up to a cutoff around ∼f0, with fAB ≈ faB ≈ 0. This is much broader than the standard single-locus prediction, and reflects the dominant contribution of ancient nested mutations that originated long before the time of sampling. k We can also use this same solution to evaluate various moments Σd(γA, γB, γ) directly. To do so, it will be helpful to make use of the following identities:

0 ∂ψ(z, τ 0) e−γAB τ = 2 (C5) ∂z h z 0 i 1 + (1 − e−γAB τ ) γAB and k−1 −γ τ 0  −γ τ 0  ∂kψ(z, τ 0) (−1)k+1k! e AB 1 − e AB k = k−1 k+1 (C6) ∂z γ h z 0 i AB 1 + (1 − e−γAB τ ) γAB for k ≥ 2, such that

0  −γAB τ e if k = 1  h 0 i2  1+ 2 1−e−γAB τ  γAB ( ) 0 k−1 ψ (τ ) = −γ τ0  −γ τ0  (C7) k e AB 1−e AB k!  k−1 k+1 if k ≥ 2  h 2 0 i  γAB 1+ 1−e−γAB τ γAB ( ) bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 26

Substituting these expressions into Eq. (57), we obtain

0 " 0 0 # Z ∞ −γAB τ −γAτ −γB τ 1 e (1 + γB)(γA + 1 − e ) (1 + γA)(γB + 1 − e ) 0 Σd = −1 + 2 + dτ (C8) h 2 0 i γA γB 0 1 + (1 − e−γAB τ ) γAB   γAB − γA − γB 1 + γB γA 2 = + 2F1 1, 1, 2 + , − 2 + γAB (2 + γAB)(γAB + γA) γAB γAB   (C9) 1 + γA γB 2 + 2F1 1, 1, 2 + , − (2 + γAB)(γAB + γB) γAB γAB and

0  0  −γAB τ −γAB τ " 0 0 # Z ∞ e 1 − e −γAτ −γB τ 2 2 (1 + γB)(γA + 1 − e ) (1 + γA)(γB + 1 − e ) 0 Σd = 3 + dτ (C10) γAB h 2 0 i γA γB 0 1 + (1 − e−γAB τ ) γAB  2 2  (1 + γA)(1 + γB) γA + 3γAB + γAB + γABγA γB + 3γAB + γAB + γABγB = 3 + (2 + γAB) (γA + γAB)(1 + γA) (γB + γAB)(1 + γB) γA −2 γB −2 (C11) (1 + γB)(4 + γA + γAB) 2F1(1, 1, 3 + , ) (1 + γA)(4 + γB + γAB) 2F1(1, 1, 3 + , ) γAB γAB γAB γAB + 3 + 3 (2 + γAB) (γA + 2γAB) (2 + γAB) (γB + 2γAB)

Appendix D: Solution for neutral loci

In the limit where γA, γB, γ are all small compared to both 1 and ρ, we can obtain exact solutions for Fk(·) using special functions. Recall that since each of these scaled variables contains a factor of f0, this neutral regime can apply even when the nominal fitness costs are much larger than 1/N. When these conditions hold, the differential equation for the characteristic curve reduces to ∂ψ(x, y, z, τ 0) ρx ρy = −ρψ − ψ2 + + , ψ(x, y, z, 0) = z . (D1) ∂τ 0 1 + xτ 0 1 + yτ 0 This equation is difficult to solve due to the presence of the time-dependent terms on the right-hand side, which vary k over two different timescales ∼1/x and ∼1/y. Note, however, that moments like Σd(ρ) only depend on the value of k this function in the special case that x = y = 1. Thus, for the purposes of computing Σd(ρ), it will be sufficient to focus on the simpler equation

∂ψ(z, τ 0) 2ρ = −ρψ − ψ2 + , ψ(z, 0) = z , (D2) ∂τ 0 1 + τ 0 which can be non-dimensionalized using the transformation u = (τ + 1)ρ and Ψ = ψ/ρ, yielding

∂Ψ 2 z = −Ψ − Ψ2 + , Ψ(ρ)= (D3) ∂u u ρ The general solution to this equation is of the form

Ei(−u) + (1 + u)−1e−u + c(z) Ψ(u) = (D4) u(2+u) 1 −u u(2+u) 2(1+u) Ei(−u) + 2 e + 2(1+u) c(z) or switching back to ζ = u − ρ = ρτ,

2(1 + ρ + ζ)[A(ζ) + c(z)] + 2ρe−ζ Ψ(ζ; z) = (D5) (ρ + ζ)(2 + ρ + ζ)[A(ζ) + c(z)] + (1 + ρ + ζ)ρe−ζ where we have defined

A(ζ) ≡ ρeρ [Ei(−ρ − ζ) − Ei[−ρ)] (D6) bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 27 and where the constant c(z) is chosen to satisfy the initial condition Ψ = z/ρ when ζ = 0: z 2(1 + ρ)c(z) + 2ρ = (D7) ρ ρ(2 + ρ)c(z) + (1 + ρ)ρ For our purposes, it will be useful to solve for z = 2 + . Solving for c(), we obtain

1+ρ  1 + 2  c() = − 2+ρ  (D8) 1 + 2  which has derivatives ∂kc() 2 + ρk−1 1 1 = (−1)k−1k! → (−1)k−1k! (2 + ρ)k−1 2−k (D9) ∂k 2 2 2+ρ k+1 1 + 2  It is then straightforward to compute derivatives of Ψ(ζ; ) with respect to . For the first derivative, we find,

∂Ψ(ζ; ) 2(1 + ρ + ζ) ∂c = ∂ ∂ (ρ + ζ)(2 + ρ + ζ)[A(ζ) + c(z)] + (1 + ρ + ζ)ρe−ζ (D10) 2(1 + ρ + ζ)[A(ζ) + c()] + 2ρe−ζ  (ρ + ζ)(2 + ρ + ζ) ∂c − ∂ [(ρ + ζ)(2 + ρ + ζ)[A(ζ) + c(z)] + (1 + ρ + ζ)ρe−ζ ]2 2ρe−ζ ∂c = (D11) [(ρ + ζ)(2 + ρ + ζ)[A(ζ) + c(z)] + (1 + ρ + ζ)ρe−ζ ]2 ∂ Higher derivatives are facilitated by writing this as ∂Ψ(ζ; ) N(ζ) ∂c = 2 (D12) ∂ [D1(ζ)c() + D2(ζ)] ∂ where we have defined

N(ζ) = 2ρe−ζ (D13)

D1(ζ) = (ρ + ζ)(2 + ρ + ζ) (D14) −ρ −ζ D2(ζ) = (ρ + ζ)(2 + ρ + ζ)ρe [Ei(−ρ − ζ) − Ei(−ρ)] + (1 + ρ + ζ)ρe (D15) The second, third, and fourth derivatives are therefore given by

2 " 2  2# ∂ Ψ N ∂ c 2D1 ∂c 2 = 2 2 − (D16) ∂ (D1c + D2) ∂ (D1c + D2) ∂

3 " 3  2    2  3# ∂ Ψ N ∂ c 6D1 ∂ c ∂c 6D1 ∂c 3 = 2 3 − 2 + 2 (D17) ∂ (D1c + D2) ∂ (D1c + D2) ∂ ∂ (D1c + D2) ∂

4 " 4 "  3     2 2# ∂ Ψ N ∂ c D1 ∂ c ∂c ∂ c 4 = 2 4 − 8 3 + 6 2 (D18) ∂ (D1c + D2) ∂ (D1c + D2) ∂ ∂ ∂

2  2   2 3  4# 36D1 ∂ c ∂c 24D1 ∂c + 2 2 − 3 (D19) (D1c + D2) ∂ ∂ (D1c + D2) ∂

Evaluating at  = 0, we have

∂Ψ N = 2 (D20) ∂ =0 2(D1c(0) + D2) 2   ∂ Ψ −N D1 2 = 2 (2 + ρ)+ (D21) ∂ =0 2(D1c(0) + D2) D1c0 + D2 4  2 2 3  ∂ Ψ −3N 3 3D1(2 + ρ) 3D1(2 + ρ) D1 4 = 2 (2 + ρ) + + 2 + 3 (D22) ∂ =0 2(D1c(0) + D2) (D1c0 + D2) (D1c0 + D2) (D1c0 + D2) bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 28

This yields Z ∞ e−ζ (ρ + ζ)(2 + ρ + ζ) 1 (D23) Σd(ρ)= −1 + 2 dζ 0 D(ζ)

Z ∞ e−ζ (ρ + ζ)(2 + ρ + ζ)  (ρ + ζ)(2 + ρ + ζ) 2 (D24) Σd(ρ)= 2 (2 + ρ) + dζ 0 D(ζ) D(ζ)

Z ∞ e−ζ (ρ + ζ)(2 + ρ + ζ)  3(ρ + ζ)(2 + ρ + ζ)(2 + ρ)2 Σ4(ρ)= (2 + ρ)3 + d D(ζ)2 D(ζ) 0 (D25) 3[(ρ + ζ)(2 + ρ + ζ)]2(2 + ρ) 3[(ρ + ζ)(2 + ρ + ζ)]3  + + dζ D(ζ)2 D(ζ)3 where we have defined D(ζ) = (ρ + ζ)(2 + ρ + ζ)[−1 + ρeρ[Ei(−ρ − ζ) − Ei(−ρ)] + (1 + ρ + ζ)ρe−ζ (D26)

Appendix E: Solution for strong selection or recombination

We finally consider the regime where γAB  1, which can occur either when any of γA, γB, γ or ρ are large compared to one. In this limit, we can obtain solutions via perturbation theory, treating the ψ2 term as a small correction. We first rescale time u = τγAB, so that " # ∂ψ 1 ρ e−γAu/γAB e−γB u/γAB = −ψ − ψ2 + + (E1) 1 −γ u/γ  1 −γ u/γ  ∂u γAB γAB 1 + 1 − e A AB 1 + 1 − e B AB γA γB

We can put this in a more compact form by defining  = 1/γAB, α = ρ/γAB, and f(u) = x(u) + y(u), so that ∂ψ = −ψ − ψ2 + αf(u) (E2) ∂u with initial condition ψ(0) = z. We can solve this equation using a perturbation expansion in , defining

∞ X k ψ(u) =  ψk(u) (E3) k=0 ∞ X k f(u) =  fk(u) (E4) k=0 with ψk(0) = zδk,0. At zeroth order, we have ∂ψ 0 = −ψ + αf (u) (E5) ∂u 0 0 and hence Z u −u −u u0 0 0 ψ0(u; z) = ze + αe e f0(u ) du (E6) 0 For our purposes, it will suffice to continue this formal solution through second order in . At first order, we have ∂ψ 1 = −ψ + αf (u) − ψ (u; z)2 (E7) ∂u 1 1 0 and hence Z u Z u −u u0 0 0 −u u0 0 2 0 ψ1(u; z) = αe e f1(u ) du − e e ψ0(u ; z) du (E8) 0 0 Z u −u u0 0 0 2 −u −u = αe e f1(u ) du − z e (1 − e ) + ... (E9) 0 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 29

At second order, we have

∂ψ 2 = −ψ + αf (u) − 2ψ (u; z)ψ (u, ; z) (E10) ∂u 2 2 0 1 and hence Z u Z u −u u0 0 0 −u u0 0 0 ψ2(u; z) = αe e f2(u ) du − e e 2ψ0(u ; z)ψ1(u; z) du (E11) 0 0 2 = z2e−u 1 − e−u + ... (E12)

Finally, at third order, we have

∂ψ 3 = −ψ + αf (u) − ψ (u; z)2 − 2ψ (u; z)ψ (u; z) (E13) ∂u 3 3 1 0 2 and hence Z u Z u −u u0 0 0 −u u0 0 2 0 0 0 ψ3(u; z) = αe e f3(u ) du − e e [ψ1(u ; z) + 2ψ0(u ; z)ψ2(u ; z) du (E14) 0 0 = −z4e−u(1 − e−u)3 + ... (E15)

We can use these formal solutions to compute derivatives with respect to z:

∂ψ ∂ψ0 −u = + O() ≈ e (E16) ∂z z=2 ∂z z=2 2 2 Z u ∂ ψ ∂ ψ1 2 −u u0 h −2u0 i −u −u 2 =  2 + O( ) ≈ −e e 2e = −2e (1 − e ) (E17) ∂z z=2 ∂z z=2 0 4 4 ∂ ψ 3 ∂ ψ3 4 3 −u −u 3 4 =  4 + O( ) ≈= −24 e (1 − e ) (E18) ∂z z=2 ∂z z=2

Note that all three expressions are independent of f(u) at lowest order. We can then use these expressions to calculate LD statistics:

1 Z ∞  γ + 1 − e−γAu/γAB Σ1(γ , γ , γ , ρ) = −1 + du e−u (γ + 1) · A d A B AB γ B γ AB 0 A (E19) −γB u/γAB −γAu/γAB −γB u/γAB  γB + 1 − e γA + 1 − e γB + 1 − e +(γA + 1) · + ρ · · γB γA γB and

1 Z ∞  γ + 1 − e−γAu/γAB Σ2(γ , γ , γ , ρ) = du 2e−u(1 − e−u) (γ + 1) · A d A B AB γ2 B γ AB 0 A (E20) −γB u/γAB −γAu/γAB −γB u/γAB  γB + 1 − e γA + 1 − e γB + 1 − e +(γA + 1) · + ρ · · γB γA γB and

(γ + 1)(γ + 1) Z ∞  γ + 1 − e−γAu/γAB Σ4(γ , γ , γ , ρ) = A B du 4e−u(1 − e−u)3 (γ + 1) · A d A B AB γ4 B γ AB 0 A (E21) −γB u/γAB −γAu/γAB −γB u/γAB  γB + 1 − e γA + 1 − e γB + 1 − e +(γA + 1) · + ρ · · γB γA γB

In the limit that γA, γB  1, this reduce to Eq. (63) in the main text. bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 30

Appendix F: Transition to the Quasi-Linkage Equilibrium (QLE) regime

Using the quasi-stationary distribution in Eq. (65) in the main text, it is easy to show that the first several conditional averages are given by

hD|fAb, faBi = hfAB|fAb, faBi − fAfB ≈ 0 (F1)

2Nrf f hD2|f , f i = h(f − f f )2|f , f i ≈ Ab aB (F2) Ab aB AB A B Ab aB (2Nr)2 and 4 4 3 2 2 3 4 hD |fAb, faBi ≈ hfAB|fAb, faBi − 4hfAB|fAfBifAfB + 6hfAB|fAb, faBi(fAfB) − 4hfABi(fAfB) + (fAfB) (2Nrf f + 3)(2Nrf f + 2)(2Nrf f + 1)f f = A B A B A B A B (F3) (2Nr)3 (F4) in the limit that ρ  1 and fA, fB  1. Averaging over the slowly evolving fAb and faB frequencies then yields the moments in Eq. (66) in the main text. The ratio between the fourth and second moments then follows as 4 2 σd(f0) 1 + Nrf0 η(f0) = 2 = 2 (F5) 3σd(f0) 2(Nr)

To obtain a formula for η that works throughout the full range of Nr and f0 values, we asymptotically match this expression with the corresponding formula from Eq. (57), which yields

 2  1  4   1  1 + Nrf0 − 2 f0Σ (2Nrf0) − 2 2Nrf0 d 2Nrf0 η(Nr, f0) = 2 e + 2 1 − e (F6) 2(Nr) 3Σd(2Nrf0) 2 4 where Σd(ρ) and Σd(ρ) are defined by Eqs. (D24) and (D25) above.

Appendix G: Estimating frequency-resolved LD in finite samples

k As described in the main text, we can obtain unbiased finite-sample estimators for σd (f0) by expanding the D and fA(1 − fA)fB(1 − fB) terms in Eq. (7) and applying the moment formula in Eq. (74). To ensure a smooth mapping to the f0 → ∞ limit, it is useful to define a modified M(~n) function,

 nAb−i naB −i nAB −k   1   1   2  nAb! 1− naB! 1− nAB! 1− nf0 nf0 nf0 n !  AB  Mi,j,k,l(~n)= i · j · k · l (G1)  n (nAb−i1)! n (naB−j)! n (nAB−k)! n (nab−l)! which satisfies a related moment formula D E i j k l −fA/f0−fB /f0 hMi,j,k,l(~n)i ≈ fAbfaBfABfabe . (G2) in the limits that f0  1 or f0  1. Carrying out this procedure for the first few moments, we obtain

D − fA+fB E De f0 = hM0,0,1,1(~n) − M1,1,0,0(~n)i (G3a)

f +f D 2 − A B E D e f0 = hM0,0,2,2(~n) − 2M1,1,1,1(~n)+ M2,2,0,0(~n)i (G3b) and D − fA+fB E fA(1 − fA)fB(1 − fB)e f0 = hM2,2,0,0(~n) + M1,2,0,1(~n)+ M2,1,1,0(~n)+ M1,1,1,1(~n) (G3c) +M2,1,0,1(~n)+ M1,1,0,2(~n)+ M2,0,1,1(~n) + M1,0,1,2(~n)+ M1,2,1,0(~n)+ M0,2,1,1(~n)

+M1,1,2,0(~n)+ M0,1,2,1(~n)+ M1,1,1,1(~n)+ M0,1,1,2(~n)+ M1,0,1,2(~n)+ M0,0,2,2(~n)i 1 2 which can be combined to obtain count-based formulae for σd(f0) and σd(f0). One can then apply these estimators to genomic data by replacing the ensemble averages with appropriate sums over many functionally similar pairs of sites. bioRxiv preprint doi: https://doi.org/10.1101/2020.12.10.420042; this version posted December 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 31

Appendix H: Applications to polymorphism data from E. rectale

The linkage disequilibrium estimates in Fig. 7 were obtained from a sample of n = 109 E. rectale genomes that we analyzed in a previous study (Garud et al., 2019). In that work, we used a referenced-based approach to identify single nucleotide variants (SNVs) in the intra-host populations of several common species of gut bacteria from a panel of ∼1000 sequenced fecal samples. We also identified samples in which the haplotype of the dominant strain of a given could be resolved with high confidence. This analysis yielded 159 “quasi-phaseable” E. rectale genomes, which we used to identify between-host SNVs in the global E. rectale population. After controlling for population structure, we identified a subset of n = 109 samples that were inferred to descend from the largest clade. These 109 genomes were used as the basis for the analysis in Fig. 7. We used the same collection of SNVs identified in Garud et al. (2019), and we focused on the subset SNVs located at fourfold degenerate sites in core genes. We recorded the two-site haplotypes and coordinate distances between all pairs of SNVs located within 4 consecutive genes of each other on the E. rectale reference genome, defining A and B to be the minority alleles at each site. As a control, we also recorded analogous two-site haplotypes for pairs of SNVs that were constructed from a large number of randomly selected genes. The full collection of two-site haplotypes is provided in Supplementary Data. These data were used as inputs for each of the calculations described in Fig. 7.