Correlated Mutations and Homologous Recombination Within Bacterial Populations

Genetics: Early Online, published on December 22, 2016 as 10.1534/genetics.116.189621 GENETICS | INVESTIGATION

Correlated mutations and homologous recombination within bacterial populations

Mingzhi Lin∗ and Edo Kussell∗, †, 1 ∗Department of Biology and Center for Genomics and Systems Biology, New York University, †Department of Physics, New York University

ABSTRACT Inferring the rate of homologous recombination within a bacterial population remains a key challenge in quantifying the basic parameters of bacterial evolution. Due to the high sequence similarity within a clonal population, and unique aspects of bacterial DNA transfer processes, detecting recombination events based on phylogenetic reconstruction is often difficult, and estimating recombination rates using coalescent model-based methods is computationally expensive and often infeasible for large sequencing data sets. Here, we present an efficient solution by introducing a set of mutational correlation functions computed using pairwise sequence comparison, which characterize various facets of bacterial recombination. We provide analytical expressions for these functions, which precisely recapitulate simulation results of neutral and adapting populations under different coalescent models. We use these to fit correlation functions measured at synonymous substitutions using whole-genome data on Escherichia coli and Streptococcus pneumoniae populations. We calculate and correct for the effect of sample selection bias, i.e. the uneven sampling of individuals from natural microbial populations that exists in most datasets. Our method is fast and efficient, and does not employ phylogenetic inference or other computationally intensive numerics. By simply fitting analytical forms to measurements from sequence data, we show that recombination rates can be inferred and the relative ages of different samples can be estimated. Our approach, which is based on population genetic modeling, is broadly applicable to a wide variety of data and its computational efficiency makes it particularly attractive for use in the analysis of large sequencing datasets.

KEYWORDS bacteria; homologous recombination; population diversity; sample selection bias; sample ages; adapting populations; Bolthausen- Sznitman coalescent

acteria can receive DNA fragments from their environment and Gogarten 2011; Williams et al. 2012). Such transfer events B by different mechanisms and integrate them into their are particularly difﬁcult to detect, since they leave no obvious genome in a set of processes collectively known as horizon- marks and are indistinguishable based on nucleotide composi- tal gene transfer (HGT) (Thomas and Nielsen 2005). While the tion, yet they are likely to represent the majority of HGT events importance of HGT in bacterial evolution is increasingly appre- in bacteria (Fraser et al. 2007). ciated (Koonin and Wolf 2008; Shapiro et al. 2012; Oren et al. Three major mechanisms of HGT – transformation, conjuga- 2014; Rosen et al. 2015; Ravenhall et al. 2015), quantifying its tion, and transduction – mediate the passage of external DNA impact across bacterial genomes and in diverse environmental into bacterial cells. Once inside the cell, DNA fragments can samples remains a key challenge (Maynard Smith 1991; Soucy be integrated into the genome either by homologous or non- et al. 2015). In particular, a prevalent form of HGT involves homologous recombination mechanisms (Thomas and Nielsen homologous recombination of fragments that bear a high de- 2005). In homologous recombination, the fragment usually re- gree of sequence similarity to the recipient genome (Andam combines into a homologous genomic locus replacing the exist- Copyright © 2016 by the Genetics Society of America ing DNA at the recipient locus. DNA transfer by homologous doi: 10.1534/genetics.XXX.XXXXXX recombination is similar to gene conversion in eukaryotes – it th Manuscript compiled: Friday 9 December, 2016% is unidirectional, and changes occur only in the recipient locus. 1Corresponding author: Center for Genomics and Systems Biology, Department of Biology, New York University, 12 Waverly Place, New York, New York, 10003. email: In non-homologous recombination (also known as illegitimate [email protected] recombination), a comparatively rare event, the fragment is in-

Genetics, Vol. XXX, XXXX–XXXX December 2016 1 Copyright 2016.2017. serted directly into the genome without replacing DNA. Homol- using large datasets of whole-genome bacterial sequences. ogous recombination breaks genetic linkage within a bacterial population, thereby alleviating clonal interference as well as Materials and Methods Muller’s ratchet, and is therefore expected to increase the rate at which bacteria evolutionarily adapt to their environment. The- Neutral population models oretical analyses have shown that a principal determinant of We consider a generalized Moran model that evolves a popu- the speed of evolutionary adaptation is the recombination rate lation of N individuals in continuous time with overlapping (Cohen et al. 2005; Neher et al. 2010; Neher and Hallatschek 2013; generations (Moran 1958), and allows individuals to have multi- Weissman and Barton 2012; Weissman and Hallatschek 2014). ple offspring at each reproduction event. We choose an arbitrary Measuring these rates and other population genetic parameters unit of time to measure all rates in the model. Reproduction is therefore crucial for describing the evolutionary dynamics of events occur with rate G at exponentially distributed times. At different bacterial species. each event, exactly one individual reproduces, yielding U − 1 To infer the parameters and rates of the HGT process, mathe- new individuals that replace U − 1 randomly chosen individu- matical models based on coalescent theory (Kingman 1982; Hud- als out of the N − 1 existing individuals in the population not son 1983) have been used to study the genealogy of a sample including the parent. The value U is a random variable that can of sequences in the presence of gene conversion or homologous take values from 2, ... , N, with probability distribution P(U). recombination (Wiuf and Hein 2000; Wiuf 2000; McVean et al. To obtain the classic Moran model, one chooses P(2) = 1, and 2002; Didelot and Falush 2007; Touchon et al. 2009). Using a coa- P(U > 2) = 0, and it is well-known that in this case the result- lescent model with gene-conversion, computationally-intensive ing coalescent genealogies are given by KC statistics (Kingman ( ) = N 1 methods have been developed that allow estimation of recombi- 1982). Alternatively, one can choose P U N−1 U(U−1) , which nation rates on a whole-genome scale (Didelot et al. 2010; Ansari we will call the Schweinsberg model, in which case the coalescent and Didelot 2014). Other methods detect recombination events genealogies are given by BSC statistics (Schweinsberg 2012). based on regional differences of SNP density (Croucher et al. 2015) or inferred phylogeny (Marttinen et al. 2012), and are often Mutations, recombination, and fragment size distribution used to establish a clonal phylogeny for a sample based on the Mutations and DNA transfer occur with uniform rates through- vertically-inherited regions. Patterns of SNP density combined out the population and across genomes. We model circular with Markov Chain Monte Carlo simulations have also been genomes of length L with an ‘alphabet’ of size a letters, where used to infer genome-wide recombination parameters (Dixit the usual choice is a = 4, corresponding to the four bases of et al. 2015). One caveat is that phylogenetic methods are par- DNA. Each site on a genome mutates with rate µ (per genera- ticularly sensitive to errors in the reference phylogeny. When tion) to any of the a − 1 remaining letters with equal probability. recombination occurs as frequently as mutations accumulate, DNA transfer events, in which an external piece of DNA is im- which appears to be the case in many species (Vos and Dide- ported and homologously recombined into the genome, occur at lot 2009), it is not clear whether a reference phylogeny can be a rate R (per generation) in each genome. We define γ ≡ R/L accurately constructed based on sequence data alone. as the transfer rate per site, where L is the length of the genome. In natural populations, the genealogical trees reconstructed Each time a transfer occurs into a given genome (the ‘recipient’), from the sampled sequences can exhibit substantial departure one of the N individuals is chosen randomly to be the ‘donor’, from neutrality. Kingman’s coalescent (KC) allows only pair- and a randomly chosen fragment of size f is copied from the wise mergers – at most two ancestral lines can merge at each donor replacing the existing piece at the identical location in the coalescence event – and the branches of the coalescent tree are recipient. distributed evenly (Kingman 1982). In spatially expanding or The fragment size f is a random variable determined by panmictic adapting populations, multiple mergers of ancestral a probability distribution function, p( f ). We define the rate lines occuring at a single step are not rare in the genealogi- at which a site is affected by recombination, r ≡ γ f¯, where cal trees, and (Brunet et al. 2007) showed that a special case of f¯ is the average fragment size. We call the parameter r, the the multiple-merger coalescent, the Bolthausen-Sznitman coa- recombination coverage rate, since it measures the fraction of the lescent (BSC) (Bolthausen and Sznitman 1998), describes their genome that is ‘covered’ by horizontally-transferred pieces of genealogies. More recently, (Desai et al. 2013) and (Neher and DNA newly acquired in each generation. For two sites separated Hallatschek 2013) showed that in rapidly adapting populations, by distance l along the genome, the coverage rate r at each site exponential amplification of fit lineages leads to multiple merg- can be partitioned into two contributions: the rate at which only ers, and the resulting genealogies are well-described by the BSC. the single site is covered by a transfer, r1(l) (one-site transfer), Here, we present an analytically tractable framework for in- and the rate at which both sites are covered, r2(l) (two-site ferring homologous recombination from sequence data that is transfer), given by applicable for any exchangeable coalescent model, which sub- l L sumes KC and BSC models. We introduce new population- r1(l) = γ ∑ f p( f ) + γl ∑ p( f ) (1) genetic quantities based on correlations of substitutions within a f =1 f =l population, and use these to accurately infer recombination pa- L rameters and provide consistency tests of the model. We study r2(l) = γ ∑ ( f − l)p( f ) , (2) the impact of selection on these quantities and demonstrate a f =l smooth transition from KC to BSC predictions with the increase of selection strength. Importantly, our method does not rely on where r1(l) + r2(l) = r. When l is smaller than the minimal size the construction of phylogenetic trees, and because it is analyt- of transferred fragments, these functions depend only on f¯, with ¯ ically tractable it yields results as fast as sequence data can be r1(l) = γ l and r2(l) = γ( f − l). All simulations shown here ( ) = read into memory. We demonstrate the power of this method used a constant fragment size f0, setting p f δf f0 , where δij is

2 Edo Kussell et al. the Kronecker delta function; all analytical results were derived : a sequence pair : a site in terms of r1(l) and r2(l), hence they are generally applicable for any fragment size distribution. AACTCTTAAG 0 1 0 1 0 0 0 1 0 0

Genome sequences and population diversity ATCCCTTCAG 0 0 1 1 0 1 0 1 0 0 Each genome sequence ~g is specified as gi, for i = 0, 1, ... L − 1, AATCCCTCAG 0 1 0 1 0 0 1 0 0 0 over an alphabet of size a, with gi ∈ {1, 2, ... , a}. For a pair of genomes ~g and ~g0, we define the substitution sequence ATCCCTGAAG 0 1 1 0 0 1 0 0 0 0 0 ... S (~g,~g ) ≡ 1 − δ 0 . For notational convenience, we number 0 0 0 0 0 0 1 1 0 0 i gi gi the possible pairs of sequences k = 1 ... N(N − 1)/2, and write 0 1 1 0 0 1 1 1 0 0 k ... the substitution sequence of the k-th pair as Si , which we call the substitution variable and takes values {0, 1}. The substitution Pairwise sequence comparison AACTCTTAAG variable can be thought of as a set of observations indexed by the 0101000100 variables i and k, and we can average over the indices in various ATCCCTTCAG ways. For example, we define the average sequence distance Genomic sequences Substitution variables k between the k-th pair of genomes as d(k) ≡ hSi i, where we let h·i denote the average over sequence positions i. The population average pairwise distance, which we will also call the population diversity, is given by Figure 1 Illustration of population genetic correlation func- d ≡ d(k) = hSki , (3) tions. On the left, a population of N genomic sequences i each of length L is shown. A pair of sequences, ~g and ~g0, is where the bar denotes the average over all genome pairs k across compared at each site i, yielding the substitution sequence k the population. The variance of pairwise distances is written as Si , where k indexes the pair of genomes among all possible N(N − 1)/2 pairs. A pair of positions along the sequence sep- 2 2 k 2 k arated by a distance l is shown. The population diversity d, σ ≡ Var(d(k)) = hSi i − hSi i . (4) variance of pairwise distances, σ2, and correlation functions k Equivalently, and often more conveniently, we consider S = Si cM(l) mutational correlation, cS(l) structure correlation, and cR(l) to be a random variable that depends on i and k. Since we will rate correlation, are shown to involve taking averages or covari- be computing averages and correlations of S over the indices, we ances in different directions along the substitution sequences. take i and k to be random variables with a uniform distribution over their respective values. In this notation, we have d(k) = E(S|k), where S|k is the conditional random variable, and E(·) affected by recombination since divergence. We also define denotes expectation. The population diversity is given by d = ρ1(l) ≡ Nr1(l) and ρ2(l) ≡ Nr2(l) as the one-site and two-site E(E(S|k)) = E(S). The variance of pairwise distances is written recombinational coverage, where ρ1 + ρ2 = ρ. as σ2 = Var(E(S|k)). It is important to emphasize that S as defined here depends Correlation functions explicitly on the collection of genomes {~g} that make up the pop- We consider a pair of sites separated by distance l along the k k ulation at any given time. Throughout the paper, any quantity genome, denoted i and i + l. We define X ≡ Si and Y ≡ Si+l to that is defined using S (e.g. d, σ2, and the correlation func- be the substitution variables at any two such sites, and treat X tions below) is therefore a random variable whose distribution and Y as random variables, as above. This allows us to define is determined by all possible realizations of the stochastic popu- three distinct correlation functions, each of which involves all lation dynamics process. In simulations, we average over many pairs of sites separated by a distance of l along the genome (see runs to calculate the expectations of these quantities, while our Figure 1). The mutation correlation function, analytical results are constructed explicitly to compute their expectations over the stochastic dynamics. k k k k cM(l) ≡ hSi Si+li − hSi ihSi+li = E(Cov(X|k, Y|k)) , (5) Population genetic parameters assesses within each pair of genomes the correlation of muta- We define several additional quantities, which will be useful tions across all pairs of sites i and i + l; the structure correlation when analyzing recombination and diversity at the population function, level. We consider a pair of individuals that coalesced at time t D k k k k E ago. Their mutational divergence, 2tµ, corresponds to the fraction cS(l) ≡ Si Si+l − Si · Si+l = E(Cov(X|i, Y|i)) , (6) of the genome that accumulated mutations on the two lineages since coalescence. In the neutral models that we consider here, measures how strongly pairs of sites are correlated across the the mean coalescence time t¯ is N/2 generations (Appendix B), population; and the rate correlation function, and the mean mutational divergence is given by the well-known D k k E D kED k E population genetic parameter quantity θ ≡ Nµ. Analogously, cR(l) ≡ Si · Si+l − Si Si+l = Cov(E(X|i),E(Y|i)) , (7) we define the pairwise recombinational divergence, 2tγ, which is the average number of recombination events per site that quantifies the correlation of evolutionary rates at pairs of sites occurred on two lineages. The mean recombinational divergence is along the genome sequence. Note that due to the circular then given by the quantity φ ≡ Nγ. Since each event covers on genomes, the averaging along the genome sequence goes once average f¯ basepairs, we define the mean recombination coverage around the genome, for i = 0 ... L − 1, using the convention that as ρ ≡ φ f¯ = Nr, which measures the fraction of the genome i + l is taken modulo L.

Correlated mutations and horizontal transfer 3 Symbol Description Relations N Population size L Genome length µ Mutation rate, per site γ Recombination rate, per site a Number of alleles at each locus a = 4 for sequence models; a˜ ≡ a/(a − 1) f , f¯ Fragment size, mean fragment size p( f ) Probability distribution of fragment sizes l Physical distance between two loci r Recombination coverage rate r = γ f¯ r1(l) One-site coverage rate r1(l) ≈ γl r2(l) Two-site coverage rate r2(l) ≈ γ( f¯ − l) t, t¯ Coalescence time, mean coalescence time t¯ = N/2

λb,k Coalescence rate for subset of k out of b individuals λb,k = λb+1,k + λb+1,k+1 (see AppendixB) θ Mean mutational divergence θ = 2t¯µ = Nµ φ Mean recombinational divergence φ = 2t¯γ = Nγ ρ Mean recombination coverage ρ = φ f¯ = Nr

ρ1(l) One-site coverage ρ1(l) = Nr1(l)

ρ2(l) Two-site coverage ρ2(l) = Nr2(l) k k Si Substitution variable Si ∈ {0, 1}, with 0 indicating identity and 1 indicating a difference between the k-th pair of sequences at locus i k d Population diversity; average pairwise distance d = hSi i 2 2 2 k 2 k σ Variance of pairwise distances σ = hSi i − hSi i k k k k cM(l) Mutational correlation function cM(l) ≡ hSi Si+l i − hSi ihSi+li D k k k k E cS(l) Structure correlation function cS(l) ≡ Si Si+l − Si · Si+l D k k E D kED k E cR(l) Rate correlation function cR(l) ≡ Si · Si+l − Si Si+l −1 w Probability that an external one-site transfer affects t w = 2λ3,2(λ3,3 + 3λ3,2) (see Fig.7 & AppendixE)

k Table 1 Table of symbols and key relations. All times and rates are measured in units of generation time. The notation hSi i indicates k averaging over all loci i, and Si indicates averaging over all sequence pairs k.

4 Edo Kussell et al. Adapting population model When recombination coverage rates are low, the population ex- To simulate an adapting population (Fig. 3), we keep track of hibits clusters of similar sequences (Higgs and Derrida 1992), 2 2 individuals’ fitness and allow reproduction events to occur with resulting in high values of σ . Increasing r decreases σ due to rates proportional to individuals’ fitness, choosing the individ- transfers between clusters which reduce the differences between ual that is replaced at each event randomly and uniformly across clusters, and create an increasingly isotropic distribution of se- the population. To model fitness effects, we consider the genome quences. While d exhibits only moderate decrease as a function 2 as a series of L “codons”, each of which contains one selective of r, σ decreases much more rapidly with r. For fixed θ the site and one neutral site (mimicking aspects of real codons, in variance exhibits strong dependence on N that collapses onto which the first two basepairs are under stronger selective pres- a single curve as a function of the recombination coverage ρ sures than the third one). Mutations at neutral sites occur with (Figure 2A, inset). These results are consistent with homologous rate µ per site, as before. Mutations at selective sites occur at recombination acting as an efficient cohesive force within bac- et al. et al. rate µs per site, and change the genome’s fitness by ±s, where terial populations (Hanage 2006; Fraser 2007, 2009), the parameter s is a constant during the simulation, and the since a small increase of r is able to destroy clusters while only randomly chosen sign of s determines whether it is a beneficial marginally reducing population diversity. or deleterious mutation. Recombination events are modeled as above. Genomic correlations and the mutational covariance identity Inferring transfer rates from population measurements requires more informative quantities than d and σ2. The standard pop- Results ulation genetic approach is to explore the pattern of linkage We initially study populations that evolve neutrally in the pres- disequilibrium (LD) across genomic regions and relate it to re- ence of recombination, both by simulations and theoretical anal- combination rates (McVean et al. 2002; McVean et al. 2004; Ansari ysis, and later determine how on-going adaptation would in- and Didelot 2014). However, when more than two alleles can fluence the results. We then show that the theory can be used segregate at a locus (e.g. in large, diverse populations), minor- to explain mutational correlations in whole genome data, and ity and majority alleles cannot be unambiguously defined and can be applied in a straightforward manner to infer recombina- LD must be generalized to account for multiple alleles, which tion rates and other key population genetic parameters. Table 1 becomes mathematically cumbersome. Moreover, variations summarizes the key parameters and variables that are used in substitution rates across the genome can confound LD mea- throughout the text. surements, and a framework that accounts for rate variations is particularly useful in bacteria. Effect of homologous recombination on genetic diversity To address these issues, we introduce three correlation functions that capture different effects of homologous recombination We simulated neutral dynamics of a population of sequences on the structure of genetic diversity in the population (see Fig- that mutate and recombine DNA fragments, and measured their ure 1 and Materials and Methods). To avoid arbitrarily assigning average pairwise distance – or population diversity, d – at different a reference sequence, or the related issues of majority and minor- recombination rates. We used two different neutral models that ity alleles mentioned above, we consider all possible pairwise yield either Kingman or Boltzhausen-Sznitman coalescent trees comparisons of sequences, which we denote by the substitution for a population of size N, using the mutation rate µ (per site per k variable Si , where k indexes the pair of sequences, and i is the generation), recombination rate γ (per site per generation), and genomic position; if the two sequences are identical at position transferred fragment size f0. We define the recombination coverage i, Sk = 0, otherwise Sk = 1. We consider any two loci that ≡ i i rate, r γ f0, which measures the fraction of the genome that are l basepairs apart in a randomly chosen sequence pair, and is ‘covered’ by transferred pieces at each generation. We also denote the substitution variable at these loci by X = Sk and define the mutational divergence, θ ≡ Nµ, which under neutral i Y = Sk . We measure the correlation between X and Y across evolution corresponds to the fraction of the genome that accu- i+l loci within each sequence pair, and then average over all pairs mulated mutations in a pair of sequences since divergence, and to obtain the mutational correlation function, c (l). We measure the recombination coverage, ρ ≡ Nr, the fraction of the genome M the correlation between X and Y across the population, and covered by recombination since divergence (see Methods). then average over loci to obtain the structure correlation function, To study how DNA transfer impacts diversity for different c (l). Lastly, we compute the average substitution rates at the population sizes, we performed comparisons at fixed θ by choos- S two loci across the population, and measure the correlation of ing µ = θ/N. Figure 2A shows a linear decrease in diversity substitution rates along the genome using the rate correlation with increasing recombination, as well as the expected collapse function, c (l). Figure 1 indicates pictorially how these functions of the values of d obtained across different population sizes at R are obtained by correlating or averaging substitution variables constant mutational divergence θ. The decrease in diversity oc- conditionally along the sequence or across the population. The curs because each recombination event replaces a piece of one tabular structure suggests that a relation should exist among genome with a homologous piece of another. DNA transfer these functions. Indeed, it is easy to check by substitution that within a bacterial population can therefore only decrease the number of alleles segregating at any single site on DNA. How- 2 cM(l) + σ = cR(l) + cS(l) (8) ever, as seen here and shown analytically below, the slope of the decrease in d is very shallow, hence population diversity is a relation that we call the mutational covariance identity. In Ap- insensitive to large fold changes in r. pendix A, we show how this identity follows from the law of to- While a greater recombination rate has only a weak effect tal covariance applied to the variables X and Y. Additionally, the 2 on diversity, it has a stronger effect on the distribution of pair- non-negativity of σ yields the inequality, cM(l) ≤ cS(l) + cR(l). wise distances within the population. To see this, we plotted the The three correlation functions are shown for different recom- variance of pairwise distances σ2 as a function of r (Figure 2A). bination rates γ in Figure 2B,C. An important aspect of these

Correlated mutations and horizontal transfer 5 Figure 2 Simulation results for population diversity and mutational correlations for Kingman (top row) and Bolthausen-Sznitman −4 (bottom row) coalescents. Simulations used parameters N = 1000, L = 1000, f0 = 50, and γ = 10 , except where indicated, and all had identical mutational divergence Nµ = 0.1. (A) The population diversity (d) and variance (σ2) in a population are shown as functions of the recombination coverage rate r for three different population sizes: 102, N 103, and 104. (B,C) Correlation − − functions are shown for different recombination rates (γ): 0, 4 10 4, and 10 3. In (C) inset, structure correlation is shown for different population sizes with shapes corresponding to# (A). Full and approximate analytical solutions are shown in solid and dashed lines, respectively. (D) Population variance and correlation functions at adjacent sites (l = 1) are shown as functions of γ. measurements is that since we simulate neutrally evolving se- stant in the absence of recombination, where its value is deter- quences, the correlations exist only at population level, i.e. in mined by drift-mutation balance, i.e. reproduction events build the substitution variables which involve pairwise comparisons. structure correlation while mutations destroy it (Figure 2C). Re- Individual sequences in all cases are purely random, uncorre- combination causes cS to decay with l at a rate determined by lated sequences of four letters. Since the process of reproduction γ. To see why increasing recombination rate reduces the mag- builds correlations between pairs of sequences, while mutation nitude of structure correlation, we consider two sites separated breaks them, parameters related to reproduction, mutation, and by distance l. The recombination coverage rate r at a single site recombination can all influence the shape of correlation func- can be partitioned into the rate r1(l) = γl at which transferred tions. fragments overlap only the one site but not the other, and the In the absence of recombination, the mutational correlation rate r2(l) = γ( f0 − l) at which fragments span both sites, where cM is identically zero, as expected since mutations occur ran- r1(l) + r2(l) = r (see Materials and Methods). For a given pair domly throughout each sequence. DNA transfer causes pairs of sequences, the rate of one-site transfer is 4r1 since four sites of sequences to become identical over random blocks of size f0, are involved, while two-site transfers occur between the pair of thereby inducing local correlations which decay as a function of sequences with rate (2/N)r2. These two-site transfers will anni- l at a rate determined by γ; that is, the higher the recombination hilate differences between the pair of individuals at both sites rate, the faster the decay of cM(l). However, the dependence of and build correlation, while any one-site transfer will break asso- the magnitude of cM on γ is non-monotonic: correlation is low ciations and reduce correlation. Since N 1, cS is determined for both low and high values of γ, with a maximum at interme- mainly by one-site transfers which destroy correlation, thus it diate values (see also Figure 2D), which we discuss below. The decays monotonically with both l (Figure 2C) and γ (Figure 2D). qualitative behavior of the rate correlation cR is similar to that of We plotted in Figure 2D the dependence of all three correla- cM, both as a function of l (Figure 2B, inset) and γ (Figure 2D), tion functions as a function of the recombination rate γ using but with smaller overall magnitude. By definition, cR is con- the value l = 1, i.e. for proximate sites along the genome. The structed to detect correlation of substitution rates, and indeed, results indicate that the pair cM and cR behave qualitatively sim- 2 despite the fact that mutation rates in simulations are constant ilarly, as do the pair cS and σ . These pair relations could be across all loci, DNA transfer reduces diversity over blocks of size anticipated from the definitions in Eqs. (5), (6), and (7), since f0, hence it introduces a correlation length scale for substitution both cM and cR are calculated by taking covariance across the rates. genome sequence and expectation over the population, whereas 2 In contrast, the structure correlation cS is positive and con- both cS and σ involve the (co)variance over the population and

6 Edo Kussell et al. the expectation across the genome. In each case, however, co- where we define the one-site and two-site recombination cov- variance and expectation are taken in different orders, leading erage, respectively ρ1 ≡ Nr1 and ρ2 ≡ Nr2, with ρ1 + ρ2 = ρ, to four distinct quantities. Taking expectation before covariance and where w = 2/3 (KC) or w = 1/2 (BSC). This expression reduces the overall magnitude of the correlation since averaging confirms the basic intuition that two-site transfers build muta- 2 destroys transient correlations, hence cR ≤ cM and σ ≤ cS. tional correlation while one-site transfers destroy correlation. 0 Indeed, the difference cM ≡ cM − cR can be interpreted as Since ρ1, ρ2, ρ ∼ γ, it also predicts that cM initially increases measuring the portion of mutational correlations that result with increasing γ, goes through a maximum, and eventually from equilibrium fluctuations within the population. By the decays to zero. Taking l = 1 above, we compute the value 0 2 mutational covariance identity, cM = cS − σ , and knowing of γ at which cM is maximized (see Figure 2B) and obtain 2 ∗ −1 p that cS and σ are monotonically decreasing and have different γ = [µa˜ + (2wN) ]/ f , which has two regimes: if θ 1 0 then γ∗ ∼ 1/N, while if θ 1 then γ∗ ∼ µ. Calculation of the half-maximal positions explains the pronounced peak of cM as a function of γ. Lastly, since σ2 is the variance in pairwise rate correlation yields distances, it is determined by the recombination coverage rate cR(l) ≈ q cM(l) , (11) r = γ f0, while cS as discussed above is determined by r1(l) = lγ. For this reason, the correlations measured by cS(l = 1) persist where q < 1, which verifies that the overall shape of rate correla- for higher values of γ by a factor of f0 than the clusters measured 2 tion is identical to mutational correlation, while its magnitude is by σ . smaller (see Appendix E). We note that all of the dependencies that were measured for The structure correlation takes the form correlation functions, genetic diversity d and variance σ2 shown in Figure 2 were qualitatively very similar in both KC and BSC d2 cS(l) = (12) models. This opens the question of how these functional forms 1 + 2θa˜ + 2wρ1(l) depend quantitatively on the coalescent process and model parameters, which is the subject of the following section. indicating that one-site transfer reduces correlation, and that two-site transfer plays a negligible role. The principal difference Dependence of correlation functions and genetic diversity on between cS and cM is their sensitivity to one- or two-site trans- the rate of recombination fers: cM is strongly determined by ρ2, while cS is insensitive to it. Lastly, the variance σ2 can be found by the large l limit of c . Mathematical analysis of the models presented above was car- S In this limit, two loci are sufficiently far that a two-site transfer ried out in the context of a general, exchangeable coalescent cannot occur, hence ρ = 0 and ρ = ρ. Substituting in (12), we model defined by a set of coalescence rates λ , with which 2 1 b,k find any subset of k out of b ancestral lines coalesce into a single d2 common ancestor (see Appendix B). Exact solution for the corre- σ2 = (13) 1 + 2θa˜ + 2wρ lation functions could be obtained numerically as the solution of a system of linear equations with coefficients that are linear which confirms our prediction that the values of γ at which σ2 or combinations of N, µ, γ, r1(l), r2(l), and λb,k (see Appendix D). cS are half-maximal (Figure 2B) differ by a factor of f0. Moreover, The exact solution is shown in solid lines in all panels of Fig- we recapitulate the collapse of curves shown in Figure 2A (inset) ure 2, indicating excellent agreement between simulations and as a function of ρ. Importantly, these results indicate that while calculations. the effect of recombination on d is negligible compared to muta- An analytically tractable form, however, could only be ob- tion (since r θ, Eq. 9), its impact on the fluctuations σ2 and tained by approximating the full model dynamics. To this end, correlations is substantial and comparable to that of mutation we introduced a mean-field approximation for the one-site trans- (since ρ & θ, Eq. 13). fer process, which affects the allelic state of exactly one of the We generalized the model above to account for recombina- substitution variables X and Y of two sites. Instead of account- tion barriers, which can limit the efficiency of transfer depending ing for one-site transfers explicitly as we do in the full model, we on the divergence of donor and recipient sequences (Fraser et al. approximate their effect as a mutational event that changes the 2007). Assuming that transfer rates decay exponentially as a substitution variable to its expected value d (see Appendix E). function of sequence divergence, we simulated and analytically This approximation yields tractable expressions given below, calculated the impact on transfer rates within the population, which are shown as dashed lines in Figure 2. Generally, the and derived the relevant correction to the mean-field approxima- agreement between approximate and full solutions is excellent tion (see Appendix H). This analysis demonstrates how recom- for the Kingman model, while deviations are more pronounced bination barriers quantitatively affect the magnitude of correla- under the BSC model though mainly for mutational and rate tions in DNA sequences, and is useful in refining intuition about correlation functions. the basic structure of the theory. When applying our method to The mean-field expression for average pairwise distance is bacterial genomic datasets, however, our inference procedure described below accounts for recombination barriers in its basic θ d ≈ , (9) assumptions, and does not require further corrections. 1 + r + θa˜ Mutational correlations in adapting populations where a˜ ≡ a/(a − 1), indicating that recombination has a very weak effect on the overall diversity (Figure 2A) since r, like While our analysis can be applied for any exchangeable coales- µ, is orders of magnitude smaller than 1. Taking r 1, the cent, genealogies at neutral sites are often linked to non-neutral mutational correlation is sites that can undergo selection, which may violate the condition of exchangeability. In particular, in adapting populations in 2 2wd ρ2(l) which fitness changes result from specific mutations that mod- cM(l) = . (10) (1 + 2wρ1(l) + 2θa˜)(1 + 2wρ + 2θa˜) ulate individuals’ reproductive rates, exchangeability does not

Correlated mutations and horizontal transfer 7 A BC

D E

2 Figure 3 Mutational correlation (cM) and population variance (σ ) in adapting populations. Simulation results are shown in circles, with error bars indicating standard error of the mean. Full analytical solutions either based on KC or BSC models are shown in dashed and solid lines, respectively. Panel (A) shows cM(l) for different values of the selection strength, s. Panels (B-E) show the 2 3 3 −4 −4 −6 dependence of cM(l = 1) and σ on s and γ. The simulations used parameters N = 10 , L = 10 , µ = 10 , γ = 10 , µs = 10 , −2 s = 10 , and f0 = 50, except where indicated.

hold in general. To assess the magnitude of this effect on our pre- the population is subject to strong selection, and cM is correctly dictions, we simulated an explicit sequence model with fitness predicted using the BSC model. As recombination rates increase, effects in which both neutral and non-neutral sites exist within recombination decouples the selective loci, breaking linkage each genome (see Materials and Methods). Correlation functions between selective sites and their neutral neighbors, and thus were computed using the neutral sites, and in Figure 3 we show reduces the effects of linked-selection on the statistics at neutral the effects of selection on cM(l). In the presence of DNA transfer, sites. Deviations from BSC statistics are more readily detected 2 cM exhibits the characteristic decay that we observed for neutral using σ (Figure 3D and 3E), where for either increasing selection models, across all values of the selection strength s (Figure 3A). strength or decreasing recombination rate, σ2 is noticeablely Selection, however, diminishes the magnitude of correlations, lower than the analytical predictions. leading to a more shallow decay. We compared the simulation results of cM(l) for adapting Parameter inference with sample selection bias populations with our predictions based on the exact solution for We have presented calculations of the correlation functions for either KC or BSC models. Since the analytical results require randomly sampled individuals from a single large population. knowledge of λ2,2, which is model-dependent, we infer λ2,2 In principle, their functional forms could now be fit to mea- from the measured population diversity, d, according to Eq. (26). sured correlations using bacterial sequencing data, provided We find that cM(l) of adapting populations lies between the two that the sampled sequences constitute random individuals from curves defined by the KC and BSC solutions (Figure 3A). As the entire relevant population. In reality, there are strong biases the selection strength increases, the adapting population transi- associated with most sampling procedures. First, samples can tions smoothly from the KC to the BSC curve (Figure 3B). This often consist of very closely-related strains, particularly when is consistent with previous results that showed how BSC statis- considering pathogen samples from outbreaks, or strain samples tics are obtained in the strong selection limit (Neher et al. 2013). from specific geographic locations which represent only a frac- These limiting coalescence statistics are model independent (e.g. tion of the total diversity within a species (Maynard Smith 1991). Kingman’s coalescent describes both Moran and Wright-Fisher Second, sequencing of cultured samples involves selection bias models), hence the KC and BSC statistics are expected to pro- for clones that are able to grow on specific media, which further vide the lower and upper bounds on correlation functions in reduces the sampled diversity. Third, different strains are often many different models. We verified this to be the case using a identified and grouped based on specific markers (e.g. resis- substantially different adaptation model, in which fitness was tance) or phenotypes (e.g. growth on selective media), which controlled by a single locus (data not shown). may or may not reflect their actual phylogenetic relationships, The effect of changing the recombination rate γ on cM in especially in the context of extensive recombination. Since all of adapting populations is shown in Figure 3C, which shows a non- these effects bias the sample composition in largely unknown monotonic behavior similar to that seen in neutral populations ways, and thus most samples may not accurately reflect the ge- (Figure 2D). We also observed a smooth transition from BSC to netic composition of the bulk population, accurate measurement the KC statistics with increasing γ. For low recombination rates, of population genetic parameters must account for inherent

8 Edo Kussell et al. AB

Relative probability

Figure 4 Simulation results on inferring bulk population parameters from biased samples of closely-related sequences. Simulations −4 −4 −3 used parameters N = 1000, L = 10000, f0 = 500, µ = 10 and γ ranging from 10 to 10 , and a total of 400 populations were simulated for a total time much longer than the coalescent time. For each population, we constructed a biased sample by ˜(2) selecting the cluster of five sequences having the lowest average pairwise coalescent time. A. Measurement of the sample’s Ps,2 (l) ˜(2) (2) −4 (circles) and the bulk population’s P2 (l) ≡ P2 (l)/d (triangles) as a function of distance l, with γ = 10 . The blue curve is the analytical form given in Eq. 15, the red line is d(1 − l/ f¯), and the horizontal dashed line indicates the bulk population diversity, d. B. Inferred values of bulk population parameters θ and φ, and fragment size f¯, are shown relative to their true values (open circles). The diversity of the biased sample, ds, is shown relative to the population diversity, d (filled circles). For each biased sample of ˜(2) ¯ five sequences, Ps,2 (l) was calculated and fitting to Eq. 16 was used to infer θ, φ, and f . Fitting was performed over the range 1 ≤ l ≤ 50 using nonlinear least squares (R Core Team 2016). sampling biases. Here, we present an approach to accurately separated by distance l both have a substitution in a randomly estimate bulk population parameters from biased samples con- chosen pair of individuals, and the above expression is the ana- sisting of closely related sequences. lytical result for the bulk population. When the same function is We have shown that recombination reduces only very slightly computed within a sample of closely-related sequences, denoted the diversity of the bulk population (Fig. 2A and Eq. 9). How- (2) by Ps (l), we obtain ever, in a sample of closely-related sequences (where relatedness ,2 is measure by the coalescent time of the strictly vertical tree (2) ¯ (2) of cell divisions) recombination increases the diversity of the Ps,2 (l) ≈ (ds/d) 1 − l/ f P2 (l) , (16) sample by importing DNA fragments from more distant sequences. We denote the average diversity of the imported frag- which is accurate for l f¯. Remarkably, we see that the quantity ments by d, which we take to be larger than the diversity within ˜(2) (2) Ps ≡ Ps /ds depends only on the bulk population parameters the vertically-inherited portion of the sampled sequences. To ,2 ,2 θ and φ, and on the mean fragment size f¯, and does not involve calculate the average diversity d within the sample, including s the sample-specific quantities ρ and θ . Fitting this functional both vertically-inherited and horizontally-acquired sequences, s s form therefore allows us to infer the bulk population parameters we consider a pair of sampled sequences that coalesce at time t despite the biases inherent in natural population sampling. ago. Vertically-inherited portions have a mutational divergence As shown from simulation results (Fig. 4A), the above expres- θ ≡ 2µt, while recombined portions have an expected diver- s sion exhibits a crossover between two regimes. For l f¯, a gence of θ and corresponding diversity d. The recombination substitution at both sites is most likely the result of an external coverage for the given pair is ρs ≡ 2rt, and as shown in Ap- ˜(2) ≈ ˜(2) ≡ (2) pendix F, for closely-related sequences (i.e. where θs, ρs 1) two-site transfer, hence we have Ps,2 P2 P2 /d, which that recombine fragments from a much larger population (i.e. exhibits a hyperbolic decay (blue curve). As l increases, the sites where ρ 1) the diversity of the sample is given by are increasingly likely to have experienced a one-site transfer, ( ) which results in the linear decrease of P˜ 2 (l) with a slope of d ≈ ρ d . (14) s,2 s s 1/ f¯ (red line). We note that the expression computed for the biased sample (16) involves the fragment size f¯, while the same The correlation functions cM, cS, and cR in the mean-field ap- ( ) expression for the bulk population (15) does not. Intuitively, in proximation can be expressed in terms of the function P 2 (l), 2 a sample of closely related sequences recombination coverage which is analytically derived in Appendix E, and takes the form is small (ρs 1 per site) because the coalescence time is short, ( ) 1 hence the boundaries of externally transferred fragments are P 2 (l) = d2 + 1 . (15) 2 1 + 2θa˜ + 2wφl relatively clear and the mean fragment size can be deduced. In the bulk population, coalescence times are long (∼ N/2) and where φ ≡ Nγ is the bulk population’s recombinational divergence, recombination coverage is large (ρ 1), which results in over- i.e. the average number of recombination events per site since lapping fragments reducing the ability to infer fragment sizes. coalesence. This function expresses the probability that two sites Sampling bias is therefore not exclusively a nuisance, since it

Correlated mutations and horizontal transfer 9 presents the opportunity to infer the mean size of transfered mately 7%, while sequences within the sample differ by ≈ 0.2%. −5 fragments. Yet the sample’s mutational divergence, θs = 1.2 × 10 , indi- To test how well one can infer bulk population parameters cates that only about 0.001% of the genome has mutated since based on strongly biased samples, for each simulated popula- the coalescence, such that the vast majority of the sample’s diver- tion we sampled a small number of closely related sequences. sity has been acquired from external transfer events. Similarly, As shown in Fig. 4B, in these biased samples the diversity ds using the sample of all S. pneumoniae sequences, the bulk pop- (filled circles) is a small fraction of the bulk population’s diver- ulation exhibits 10% divergence, while sequences within the sity d = 0.088. Depending on the recombination rate, these sample differ by 0.8% and the sample’s mutational divergence, samples represent from about 1% to 10% of the total diversity, θs = 0.01%, indicates again that diversity has been acquired ˜(2) largely from external sequences. The high population diversities i.e. ds ≈ 0.001 − 0.01. We calculated Ps,2 (l) for the samples, and by fitting Eq. (16) to the data we inferred the bulk popu- also imply that the recombination barrier within species might lation parameters θ and φ, and f¯, which are shown in Fig. 4B. not be particularly strong, since recombination evidently can These results indicate that over a realistic range of sample diver- still occur between sequences with divergence as large as ten sity and population transfer rates one can accurately infer bulk percent. population parameters using strongly biased samples. When analyzing separate clades within each species, we infer bulk population divergence levels θ that are relatively similar, Using correlated mutations to infer recombination rates and with θ = 0.07 − 0.09 in E.coli and θ = 0.07 − 0.15 in S. pneumo- global diversity of bacterial populations niae, despite large variations in sample diversity among clades. We used whole genome sequences from a recently emerged This result is consistent with the separate isolates having ac- multidrug-resistant Escherichia coli clone, sequence type 131 cess to a single shared gene pool via recombination. Likewise, ¯ (ST131) containing 185 isolates (Price et al. 2013; Petty et al. the average sizes of recombination fragments f are consistent 2014; Ben Zakour et al. 2016), and 1,216 Streptococcus pneumo- across clades in both species, ranging from 0.5 to 2 kb, which niae isolates in a longitudinal pneumococcal carriage study is comparable to the length of a typical gene. One exception is (Chewapreecha et al. 2014), two species that are well-known clade BC4-6B of S. pneumoniae, which has an inferred size of 23.5 for horizontal transfer and recombination (Lorenz and Wacker- kb. It should be noted that our fitting process did not assume nagel 1994). For each dataset, and several subtypes or clades, any particular distribution for sizes of recombination fragments, we calculated sample diversity and correlation functions, and in- provided that l is much smaller than transferred fragment sizes, ( ) a condition that is satisfied here since fitting was performed for ferred the bulk population parameters (Table 2) by fitting P˜ 2 (l) s,2 l < 150 bp. (see AppendixG for details of sequence analysis). As shown in Figure 5, all correlation functions of synonymous substitutions Notably, the bulk population’s recombination coverage, ρ, exhibits a decay pattern in both species, indicating the presence ranges from 73 to 240 for clades of E. coli and from 48 to 2200 for of recombination. The close agreement between measurements clades of S. pneumoniae, indicating that in each species two ran- (circles) and predictions of the structure and rate correlation domly chosen sequences would have recombined so extensively functions (solid green and blue lines) indicates that our model since divergence that transfers would cover each recombining captures the essential features of the underlying transfer process region many times over. As mentioned above, the value of ρ across different clades of both species. inferred for each sample is equal to the ratio of the number of Parameter fitting yielded values of mean fragment size f¯, polymorphisms due to recombination versus the number of de mutational divergence θ, and recombinational divergence φ novo mutations. In both species ρ 1, indicating that the diver- (Table 2). For each sample, we directly measure the sample sity introduced by recombination dominates sample diversity, which is consistent with previous results (Price et al. 2013; Petty diversity ds, and using the fitted value of θ we calculate the global diversity d ≈ θ(1 + θa˜)−1 (Eq. 9) and the sample’s re- et al. 2014; Chewapreecha et al. 2014). Yet, since ρ varies substantially from clade to clade, it suggests that real variability exists combination coverage ρs ≈ ds/d (Eq. 14). From the inferred between subpopulations as far as the portions of the shared gene parameters we calculate ρ = φ f¯, and use the identity ρsθ = ρθs pool that each is able to access. Indeed, the inferred recombina- to obtain the value of θs. The sample’s mutational divergence θs provides a measure of the age of the sample since coalescence, tional divergence φ shows substantial variation among clades t¯, and can be converted to generations if the mutation rate µ is in both species, with φ = 0.048 − 0.347 for clades of E. coli and φ = 0.02 − 0.84 for clades of S. pneumoniae. Since mutational di- known, i.e. t¯ = θs/(2µ). We also calculate the ratio φ/θ = γ/µ which gives the relative rate of recombination to mutation in the vergence is relatively consistent across clades, while γ/µ varies population. Another useful quantity is the ratio of the number of widely from 0.7 to 3.87 in E. coli and from 0.65 to 6.55 in S. pneu- single nucleotide polymorphisms (SNPs) that are brought into moniae, it appears the variation in φ is not related to differences a sample by recombination to the number of SNPs that are due in population size, and likely arises from differences in access to the shared gene pool. to de novo mutations. This ratio is given by ρsθ/θs = ρ, which is the bulk population’s recombinational coverage, and thus it Lastly, we observe large variation across clades of the sample does not depend on the sample itself. mutational divergence θs, which provides a measure of each Table2 lists the inferred values of the model parameters. As sample’s age since coalescence. Within S. pneumoniae, θs varies shown above, our inference method is applicable for closely- from 6.0 × 10−7 to 4.6 × 10−5, i.e. nearly two orders of magni- −6 −5 related samples of sequences, in which ρs, θs 1 and ρ 1; tude, while in E. coli θs varies from 1.3 × 10 to 1.4 × 10 , or and indeed these conditions hold for the inferred values, indi- one order of magnitude. The entire collection of S. pneomoniae cating that the samples consist of closely-related sequences that sequences is in fact substantially older than any given clade, −4 exchange DNA fragments with a much more diverse population. with θs = 1.1 × 10 ; while in E. coli the sample as a whole −5 In E. coli, when considering all sequences as a single sample, we has θs = 1.2 × 10 which is very similar to the age of its old- infer the bulk population’s mutational divergence as approxi- est clades. These ages can be converted to generations using

10 Edo Kussell et al. 5 er g.W oeta ic aoaoymaueet of measurements laboratory since that note We ago. years 150 epeetdamttoa orlto-ae ehdlg for methodology correlation-based mutational a presented We esoe a rvd cuaeetmtsuigwhole-genome using estimates accurate provide can showed we approximately diverged strains sampled the which that using day, estimate per we generation one assume conservatively we Table uniyn eobnto ae nbceilppltos which populations, bacterial in rates recombination quantifying of value nbe cuaeadrpdfitn oifrDArecombination functions DNA infer correlation to the fitting rapid of and calculation accurate enables analytical Our data. Discussion field. the in problem open an divergence remains mutational generations mapping to con- – strains, regions between genomic exist and could ditions, rates mutation in variation and for bacteria, in assessed be easily cannot times generation example, For species. each in rates mutation the of values known statistics. of coalescent values general fitted more The exhibit prediction. may the clades improve certain to used and inferred be can which model, coalescent of for circles as shown are substitutions 5 Figure g fteetr apei on ob 500gnrtos While generations. 55,000 be to found is sample entire the of age nieyby entirely re n lelnscrepn orslso fitting of results to correspond lines blue and green µ µ q x = a o eettemtto ae nntrlevrnet – environments natural in rates mutation the reflect not may = = .2frK and KC for 0.22 2.2 M h oi ooe ie orsodt h rdcin ftetrecreainfntosbsdo h t sn h S model BSC the using fit, the on based functions correlation three the of predictions the to correspond lines colored solid The 2. , × R q hl-eoesqec nlsso natural of analysis sequence Whole-genome , 10 seAppendix (see S P ahdbakln orsod otebs tof fit best the to corresponds line black Dashed . ˜ − s ( S. pneumoniae E. coli ,2 10 2 ) ( in l ) .coli E. hc a t hl h rdcin of predictions the while fit, was which q = (Lee .3frBCcaecns niaigta ouainte tutrsotnflo Co S ttsis but statistics, BSC or KC follow often structures tree population that indicating coalescents, BSC for 0.33 .W oeta h xeln tof fit excellent the that note We E). tal. et n sn hsvle the value, this using and 2012) P ˜ s ( ,2 2 ) ( l ) bak swl as well as (black) q niaigta eitosfo h rdcin r u anyt h choice the to mainly due are predictions the from deviations that indicating , .coli E. .coli E. c ˜ R ( / / and c P ˜ l ˜ M ) s ( c ˜ ,2 2 and M ( ) l .pneumoniae S. ( ) ( l l ) sntsrrsn ic hscreainfnto sdetermined is function correlation this since surprising not is ) c sn h omgvni Eq. in given form the using ˜ ok In work. (Eq. aaees swl sasl-ossec hc ntemodel with the functions correlation on predicted check the comparing self-consistency by a provided as well as parameters, ifrn eso tan,i sntsrrsn htteeeit a exists there that surprising not is it used strains, studies previous of Since sets species. different same the of even clades substantially between vary can values these analysis, clade-by-clade (Dixit by determined (Didelot in (Touchonestimated in inferred 2.5 the than events mutational (Fig. parameters genetic depend population function the correlations on the how theory and showed simulations We by components. meaningful into partitioning correlations population, observed a within related are correlations of types function correlation Fig. (see measurements (red), S ( l ecmae u nerdprmtr otoeo previous of those to parameters inferred our compared We ) rsn nidpnetts fteter.Tedashed The theory. the of test independent an present ,w rvdda nutv ecito fhwvarious how of description intuitive an provided we 8), c ˜ R ( l efudta h ai frcmiainlto recombinational of ratio the that found we coli, E. orltdmttosadhrzna rnfr11 transfer horizontal and mutations Correlated ) slts esrdcreain fsynonymous of correlations Measured isolates. bu)and (blue) γ / c µ tal. et M tal. et c ˜ = S ( l ( ) .B eopsn h mutational the decomposing By 5). l . nerdb u oesi smaller is models our by inferred 2.1 ) ,admc agrta h 0.31 the than larger much and 2012), .Hwvr sw oe nour in noted we as However, 2015). noasmo eea itntterms distinct several of sum a into aaee ausaegvnin given are values Parameter 16. gen,where (green), q ag rm0.22 from range tal. et ,lre hnte1.0 the than larger 2009), 2). c ˜ x ( l ) ≡ − c .8 where 0.48, x ( l ) / d s for Best Fit Calculated

Species Clade Strains ds θ φ f¯ γ/µ ρ θs ρs

All 185 0.0018 0.071 0.147 990 2.1 150 1.2 × 10−5 2.8 × 10−2 Clade A 15 0.0005 0.069 0.048 1500 0.7 73 7.2 × 10−6 8.4 × 10−3 −5 −2 E. coli Clade B 43 0.0014 0.072 0.148 710 2.1 100 1.4 × 10 2.2 × 10 Clade C (C1 + C2) 120 0.0003 0.090 0.346 700 3.9 240 1.3 × 10−6 3.9 × 10−3 All 1216 0.0080 0.101 0.138 540 1.4 75 1.1 × 10−4 8.9 × 10−2 BC1-19F 365 0.0011 0.140 0.314 720 2.2 230 5.0 × 10−6 9.6 × 10−3 BC2-23F 213 0.0015 0.129 0.843 710 6.6 600 2.6 × 10−6 1.4 × 10−2 BC3-NT (all) 202 0.0043 0.131 0.284 1800 2.2 510 8.6 × 10−6 3.9 × 10−2 serotype 14 74 0.0009 0.069 0.020 6100 0.3 120 7.5 × 10−6 1.4 × 10−2 NT isolates 128 0.0050 0.146 0.366 2000 2.5 730 6.8 × 10−6 4.0 × 10−2 −7 −2 S. pneumoniae BC4-6B 126 0.0013 0.103 0.095 23000 0.9 2200 6.0 × 10 1.5 × 10 BC5-23A/F 106 0.0068 0.096 0.252 590 2.6 150 4.6 × 10−5 8.0 × 10−2 BC6-15B/C 102 0.0017 0.084 0.161 490 1.9 79 2.1 × 10−5 2.2 × 10−2 BC7-14 102 0.0012 0.067 0.043 1100 0.7 48 2.6 × 10−5 2.0 × 10−2

Table 2 Best-fit parameters for natural bacterial populations. The values of the three fitted parameters, θ, φ, and f¯ are given for ˜(2) ¯ each dataset, and correspond to the best fitting function for Ps,2 shown in Fig. 5 (dashed black line). The mean fragment size f is reported in basepairs. range of estimates, and what is remarkable, given the different sulated (γ/µ = 0.3) while the NT isolates are non-encapsulated methodologies and datasets, is that the various methods are all (γ/µ = 2.5). In this regard, our analysis attributes a much bigger within an order of magnitude of each other. Our approach has effect to encapsulation, which we infer reduces recombination the distinct advantage that it is extremely rapid and computa- rates by more than eightfold, whereas the previous analysis tionally efficient, and can therefore handle very large datasets inferred less than a twofold effect. with ease. Since our method is formulated using a population The distinct correlation signature of recombination in bacte- genetics model, it provides meaningful connections between a rial genomes enables us to infer several additional quantities large body of theoretical work and measurable quantities such that are of broad interest for studies of microbial population as mutational correlation functions. Importantly, we identified structure and phylogeny. First, using closely-related genome sample selection bias, which is prevalent in most datasets, as a samples, we are able to infer the bulk population’s diversity potential source of errors in parameter estimation, and by an- d and divergence θ. Remarkably, our inferred divergence for alyzing the correlation functions for samples of closely related the bulk population of E. coli (θ = 0.07) closely matches the sequences, we showed how to account for its effects. species’ diversity measured across a collection of global samples (θ = 0.075; (Dixit et al. 2015)), which provides additional con- In our analysis of S. pneumoniae, using the genomic sequences firmation of the methodology. Second, our approach provides from (Chewapreecha et al. 2014), we found the ratio γ/µ ranges an estimate of the age of each sample without comparison to from 0.3 to 6.6 across clades, while measured across the set of an outgroup. Existing methods attempt to partition SNPs into all clades γ/µ = 1.4. These values are larger than the previous internal to a sample (de novo mutations) versus external SNPs estimates, which range from 0.1 to 0.35, which were performed (arising from recombination), and thus encounter difficulties by determining recombination events by contiguous clusters of with older samples or overlapping recombination fragments. In SNPs (Chewapreecha et al. 2014). As noted in previous studies contrast, our method compares the decay of mutational corre- (Croucher et al. 2011; Chewapreecha et al. 2014; Croucher et al. lations with the standing diversity of the sample, and uses the 2015), calling recombination events on a locus-by-locus basis is predicted quantitative relations to infer the various rates. With- confounded by overlapping recombination events, which often out partitioning SNPs, and thus avoiding the known technical cannot be detected, and is strongly affected by the age of the difficulties, we are able to infer the rate of de novo mutations sample. For this reason, such methods are typically conservative, in a sample, i.e. its mutational divergence θ , and therefore its and their estimates are in effect lower bounds. Despite the order s relative age. of magnitude difference in our parameter estimates, our method recapitulates the basic biological finding of the previous work, Within the populations assayed here, we find a wide range which showed that encapsulated pneumococci have consistently of sample ages, e.g. the oldest S. pneumoniae clade is about lower rates than non-encapsulated strains. This result is seen in eighty times the age of the youngest clade. We do not find the BC3-NT clade (Table 2), where serotype 14 isolates are encap- any meaningful correlation between a sample’s age and the

12 Edo Kussell et al. recombination parameters (γ/µ, ρ) or diversity (θ, d) for the a tiny fraction of SNPs that capture the vertical signal, only now bulk population with which it exchanges DNA, indicating that in a much smaller portion of the genome. Our results therefore our method correctly models the overall scaling of divergence indicate that phylogenetic inference in bacteria may in most with time and appropriately removes such effects. Thus, we be- cases be an extremely challenging problem, and our methodol- lieve our method provides a reliable tool for comparing sample ogy provides the basic measures that quantify the (f)utility of ages, although such comparisons inherently assume a relatively tree building in any given case. constant molecular clock, i.e. a value of µ that does not vary In conclusion, our new approach for inferring within- substantially between samples of the same species. At the same population recombination rates, based on correlation functions, time, we expect that comparisons between species may be far provides a framework for measurements in a wide range of se- less meaningful, since neutral evolutionary rates can differ sub- quencing datasets. Further generalization of our method could stantially due to basic molecular differences and environmental incorporate variability in recombination rates across genomes, as factors that influence mutation rates. well as modeling explicitly population subdivision and commu- Finally, given the prevalence of recombination in bacteria, our nity structures. We expect that this framework and its general- analysis can be used to quantify the utility (or futility) of infer- izations will find fruitful application in inferring recombination ring a phylogenetic tree for a given sample of DNA sequences. rates within species, as well as providing a useful starting point Construction of a tree based on sequence similarity requires for analyses of much more complex sets of data such as metage- some portion of the sequence to contain vertically-inherited nomic samples. (i.e. clonal) information. Each recombination event partially degrades the vertical signal in the data, and the recombination Acknowledgements coverage ρs determines the proportion of the total sequence that We wish to thank Sergei Maslov, Erik van Nimwegen, Jane Carl- has been degraded. Once ρs reaches a value of 1, we expect no ton, Alexander Grosberg, Charles Peskin, Matthew Rockman, remaining vertical signal in the data, hence such value indicates Wei-Hsiang Lin, and Long Qian for valuable discussions, as that phylogenetic reconstruction of the sample would be futile. well as three anonymous referees for their comments on the The values of ρs inferred from the samples we analyzed are all manuscript. This work was funded by a Human Frontier Sci- relatively low, ranging from about 0.1 − 1% coverage (Table 2), ence Program Young Investigators’ grant. indicating that there exists a vertical signal in the data. However, reliable tree construction additionally requires a sufficient num- Literature Cited ber of de novo mutations that track the vertical signal, quantified Andam, C. P. and J. P. Gogarten, 2011 Biased gene transfer in by the value of θs times the genome size L. For certain samples, the number of SNPs may be too low to construct a meaning- microbial evolution. Nature Reviews Microbiology 9: 543–555. ful tree. Additionally, even when a sufficient number of SNPs Ansari, M. A. and X. Didelot, 2014 Inference of the properties is expected to be present, one must reliably identify the SNPs of the recombination process from whole bacterial genomes. that correspond to the clonal signal, which typically constitute a Genetics 196: 253–265. Ben Zakour, N. L., A. S. Alsheikh-Hussain, M. M. Ashcroft, N. T. small fraction of the sample diversity ds. Khanh Nhu, L. W. Roberts, M. Stanton-Cook, M. A. Schembri, For example, for the largest clade of the E. coli sequences and S. A. Beatson, 2016 Sequential acquisition of virulence ρs = 0.004 (clade C, Table 2), indicating that for each pair of and fluoroquinolone resistance has shaped the evolution of sequences more than 99% of the genome has been vertically Escherichia coli ST131. MBio 7: e00347–16. inherited, which seems like good news for tree building. How- Bolthausen, E. and A. Sznitman, 1998 On Ruelle’s probability −6 ever, θs = 1.3 × 10 for this sample, meaning that between any cascades and an abstract cavity method. Communications in pair of sequences fewer than 10 single nucleotide differences Mathematical Physics 197: 247–276. correspond to the vertical signal. These few crucial differences Brunet, E., B. Derrida, A. Mueller, and S. Munier, 2007 Effect must be recognized from among >1,000 differences that exist of selection on ancestry: an exactly soluble case and its phe- between any pair of individuals in the sample, the majority of nomenological generalization. Physical Review E 76: 041104. which are due to recombination events. Finding these few infor- Brunet, E., B. Derrida, and D. Simon, 2008 Universal tree struc- mative needles in the haystack of SNPs is the major challenge tures in directed polymers and models of evolving popula- for sequence-based phylogeny, and while the problem may be tions. Phys Rev E 78: 061102. quite difficult, at least in such cases one is relatively certain that Chewapreecha, C., S. R. Harris, N. J. Croucher, C. Turner, P. Mart- a vertical signal exists in the data. tinen, L. Cheng, A. Pessia, D. M. Aanensen, A. E. Mather, A. J. Much more severe, and possibly insurmountable, problems Page, S. J. Salter, D. Harris, F. Nosten, D. Goldblatt, J. Corander, exist in establishing reliably the phylogeny of sequences from J. Parkhill, P. Turner, and S. D. Bentley, 2014 Dense genomic the bulk population of a bacterial species, where the relevant sampling identifies highways of pneumococcal recombination. recombinational coverage is measured by ρ. Across the popula- Nat Genet 46: 305–9. tions that we analyzed ρ = 48 − 2200, indicating that for any pair Cohen, E., D. A. Kessler, and H. Levine, 2005 Recombination dra- of sequences sampled from the bulk population, their genomes matically speeds up evolution of finite populations. Physical have been covered many times over by recombinational transfers Review Letters 94: 098102. since coalescence. We expect no vertical signal would remain in Croucher, N. J., S. R. Harris, C. Fraser, M. A. Quail, J. Burton, the data. Since the value of ρ represents a genome-wide average, M. van der Linden, L. McGee, A. von Gottberg, J. H. Song, it is of course possible that some genomic regions could exhibit K. S. Ko, B. Pichon, S. Baker, C. M. Parry, L. M. Lambertsen, extremely low recombination rates, and thus have effectively D. Shahinas, D. R. Pillai, T. J. Mitchell, G. Dougan, A. Tomasz, much lower values of ρ, which may enable phylogenetic infer- K. P. Klugman, J. Parkhill, W. P. Hanage, and S. D. Bentley, ence in the bulk population, assuming ρ 1 these regions. In 2011 Rapid pneumococcal evolution in response to clinical the best case scenario, one would again be faced with identifying interventions. Science 331: 430–434.

Correlated mutations and horizontal transfer 13 Croucher, N. J., A. J. Page, T. R. Connor, A. J. Delaney, J. A. Keane, McVean, G., P. Awadalla, and P. Fearnhead, 2002 A coalescent- S. D. Bentley, J. Parkhill, and S. R. Harris, 2015 Rapid phyloge- based method for detecting and estimating recombination netic analysis of large samples of recombinant bacterial whole from gene sequences. Genetics 160: 1231–1241. genome sequences using Gubbins. Nucleic Acids Res 43: e15. McVean, G. A., S. R. Myers, S. Hunt, P. Deloukas, D. R. Bentley, Desai, M. M., A. M. Walczak, and D. S. Fisher, 2013 Genetic and P. Donnelly, 2004 The ﬁne-scale structure of recombination diversity and the structure of genealogies in rapidly adapting rate variation in the human genome. Science 304: 581–584. populations. Genetics 193: 565–585. Moran, P. A. P., 1958 Random processes in genetics. Mathemat- Didelot, X. and D. Falush, 2007 Inference of bacterial microevolu- ical Proceedings of the Cambridge Philosophical Society 54: tion using multilocus sequence data. Genetics 175: 1251–1266. 60–71. Didelot, X., D. Lawson, A. Darling, and D. Falush, 2010 Infer- Neher, R. A. and O. Hallatschek, 2013 Genealogies of rapidly ence of homologous recombination in bacteria using whole- adapting populations. Proceedings of the National Academy genome sequences. Genetics 186: 1435–1449. of Sciences of the United States of America 110: 437–442. Didelot, X., G. Méric, D. Falush, and A. E. Darling, 2012 Impact Neher, R. A., T. A. Kessinger, and B. I. Shraiman, 2013 Coa- of homologous and non-homologous recombination in the lescence and genetic diversity in sexual populations under genomic evolution of Escherichia coli. BMC genomics 13: 256. selection. Proceedings of the National Academy of Sciences of Dixit, P. D., T. Y. Pang, F. W. Studier, and S. Maslov, 2015 Recom- the United States of America 110: 15836–15841. binant transfer in the basic genome of Escherichia coli. Proceed- Neher, R. A., B. I. Shraiman, and D. S. Fisher, 2010 Rate of ings of the National Academy of Sciences 112: 9070–9075. adaptation in large sexual populations. Genetics 184: 467–481. Feingold, D. G. and R. S. Varga, 1962 Block diagonally domi- Oren, Y., M. B. Smith, N. I. Johns, M. K. Zeevi, D. Biran, E. Z. nant matrices and generalizations of the Gerschgorin circle Ron, J. Corander, H. H. Wang, E. J. Alm, and T. Pupko, 2014 theorem. Paciﬁc J. Math 12: 1241–1250. Transfer of noncoding DNA drives regulatory rewiring in Fraser, C., E. J. Alm, M. F. Polz, B. G. Spratt, and W. P. Hanage, bacteria. Proceedings of the National Academy of Sciences 2009 The Bacterial Species Challenge: Making Sense of Genetic 111: 16112–16117. and Ecological Diversity. Science 323: 741–746. Petty, N. K., N. L. Ben Zakour, M. Stanton-Cook, E. Skippington, Fraser, C., W. P. Hanage, and B. G. Spratt, 2007 Recombination M. Totsika, B. M. Forde, M.-D. Phan, D. Gomes Moriel, K. M. and the nature of bacterial speciation. Science 315: 476–480. Peters, M. Davies, B. A. Rogers, G. Dougan, J. Rodriguez- Garrison, E. and G. Marth, 2012 Haplotype-based vari- Baño, A. Pascual, J. D. D. Pitout, M. Upton, D. L. Paterson, ant detection from short-read sequencing. arXiv preprint T. R. Walsh, M. A. Schembri, and S. A. Beatson, 2014 Global arXiv:1207.3907 . dissemination of a multidrug resistant Escherichia coli clone. Hanage, W. P., B. G. Spratt, K. M. E. Turner, and C. Fraser, 2006 Proc Natl Acad Sci U S A 111: 5694–9. Modelling bacterial speciation 361: 2039–2044. Pitman, J., 1999 Coalescents with multiple collisions. The Annals Higgs, P. G. and B. Derrida, 1992 Genetic distance and species of Probability 27: 1870–1902. formation in evolving populations. Journal of Molecular Evo- Price, L. B., J. R. Johnson, M. Aziz, C. Clabots, B. Johnston, lution 35: 454–465. V. Tchesnokova, L. Nordstrom, M. Billig, S. Chattopadhyay, Hudson, R. R., 1983 Properties of a neutral allele model with M. Stegger, P. S. Andersen, T. Pearson, K. Riddell, P. Rogers, intragenic recombination. Theoretical population biology 23: D. Scholes, B. Kahl, P. Keim, and E. V. Sokurenko, 2013 The 183–201. epidemic of extended-spectrum-β-lactamase-producing Es- Kingman, J., 1982 On the genealogy of large populations. Journal cherichia coli ST131 is driven by a single highly pathogenic of Applied Probability 19: 27. subclone, H30-Rx. MBio 4: e00377–13. Koonin, E. V. and Y. I. Wolf, 2008 Genomics of bacteria and R Core Team, 2016 R: A Language and Environment for Statistical archaea: the emerging dynamic view of the prokaryotic world. Computing. R Foundation for Statistical Computing, Vienna, Nucleic acids research 36: 6688–6719. Austria. Kuzminov, A., 1999 Recombinational repair of DNA damage in Ran, R.-S. and T.-Z. Huang, 2006 The inverses of block tridi- Escherichia coli and bacteriophage λ. Microbiology and molec- agonal matrices. Applied mathematics and computation 179: ular biology reviews 63: 751–813. 243–247. Lee, H., E. Popodi, H. Tang, and P. L. Foster, 2012 Rate and molec- Ravenhall, M., N. Škunca, F. Lassalle, and C. Dessimoz, 2015 In- ular spectrum of spontaneous mutations in the bacterium ferring horizontal gene transfer. PLoS Computational Biology Escherichia coli as determined by whole-genome sequencing. 11. Proc Natl Acad Sci USA 109: E2774–83. Rosen, M. J., M. Davison, D. Bhaya, and D. S. Fisher, 2015 Fine- Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, scale diversity and extensive recombination in a quasisexual G. Marth, G. Abecasis, R. Durbin, and 1000 Genome bacterial population occupying a broad niche. Science 348: Project Data Processing Subgroup, 2009 The sequence align- 1019–1023. ment/map format and SAMtools. Bioinformatics 25: 2078–9. Sagitov, S., 1999 The general coalescent with asynchronous merg- Lorenz, M. G. and W. Wackernagel, 1994 Bacterial gene transfer ers of ancestral lines. Journal of Applied Probability 36: 1116– by natural genetic transformation in the environment. Micro- 1125. biological reviews 58: 563–602. Schweinsberg, J., 2000 Coalescents with simultaneous multiple Marttinen, P., W. P. Hanage, N. J. Croucher, T. R. Connor, S. R. collisions. Electronic Journal of Probability 5: 1–50. Harris, S. D. Bentley, and J. Corander, 2012 Detection of recom- Schweinsberg, J., 2012 Dynamics of the evolving Bolthausen- bination events in bacterial genomes from large population Sznitman coalecent. Electronic Journal of Probability 17. samples. Nucleic Acids Research 40: e6. Shapiro, B. J., J. Friedman, O. X. Cordero, S. P. Preheim, S. C. Maynard Smith, J., 1991 The population genetics of bacteria. Timberlake, G. Szabó, M. F. Polz, and E. J. Alm, 2012 Popula- Proceedings: Biological Sciences 245: 37–41. tion genomics of early events in the ecological differentiation

14 Edo Kussell et al. of bacteria. Science 336: 48–51. variables E(X|k) and E(Y|k) turn out to be identical, which is Soucy, S. M., J. Huang, and J. P. Gogarten, 2015 Horizontal gene due to the circular genome. The same result also holds to excel- transfer: building the web of life. Nature Reviews Genetics 16: lent approximation for a very large genome, with corrections 472–482. of order 1/L. Therefore we obtain the mutational covariance Thomas, C. M. and K. M. Nielsen, 2005 Mechanisms of, and inequality barriers to, horizontal gene transfer between bacteria. Nature cS + cR ≥ cM . (21) Reviews Microbiology 3: 711–721. Touchon, M., C. Hoede, O. Tenaillon, V. Barbe, S. Baeriswyl, B. Coalescence Rates P. Bidet, E. Bingen, S. Bonacorsi, C. Bouchier, O. Bouvet, The general exchangeable coalescent model is deﬁned by a set of et al., 2009 Organised genome dynamics in the Escherichia coli coalescence rates λb,k with which any subset of k out of b ancestral species results in highly diverse adaptive paths. PLoS Genet lines coalesce into a single common ancestor. These rates satisfy 5: e1000344. a self-consistency relation λb,k = λb+1,k + λb+1,k+1 (see (Pitman Vos, M. and X. Didelot, 2009 A comparison of homologous re- 1999; Sagitov 1999; Schweinsberg 2000; Brunet et al. 2008)). For combination rates in bacteria and archaea. The ISME journal the Moran model, we have for all b ≥ 2, λ = G/(N), since 3: 199–208. b,2 2 (N) is the total number of pairs that could coalesce; and since Weissman, D. and N. H. Barton, 2012 Limits to the rate of 2 only pairwise mergers are possible, λ = 0 for all k > 2 (Moran adaptive substitutions in sexual populations. PLoS Genet 8: b,k 1958). For the Schweinsberg model, we obtain e1002740. Weissman, D. and O. Hallatschek, 2014 The rate of adaptation in N U N large sexual populations with linear chromosomes. Genetics λ2,2 = G ∑ p(U) · / = G/(N − 1) , (22) 196: 1167–1183. U=2 2 2 Williams, D., J. P. Gogarten, and R. T. Papke, 2012 Quantifying and since its genealogy follows the BSC, we have homologous replacement of loci between haloarchaeal species. Genome biology and evolution 4: 1223–1244. (k − 2)!(b − k)! λ /λ = . (23) Wiuf, C., 2000 A coalescence approach to gene conversion. Theo- b,k 2,2 (b − 1)! retical population biology 57: 357–367. Wiuf, C. and J. Hein, 2000 The coalescent with gene conversion. For convenience, we take G = N − 1 for the Moran model and Genetics 155: 451–462. G = 2(N − 1)/N for the Schweinsberg model, so that both models have identical λ2,2 = 2/N. Setting G arbitrarily only changes the unit of all times and does not change any results Appendix below. Explicit coalescence computations are not made for the adapting population model; instead, simulation results in the A. Covariance decomposition and the law of total covariance limits of weak and strong selection are compared with theoretical The law of total covariance states that predictions of the KC and BSC models. This requires matching Cov(X, Y) = E(Cov(X|Z, Y|Z)) + Cov(E(X|Z),E(Y|Z)) (17) the overall coalescence rates of simulations with theory, which we accomplish by measuring λ2,2 directly in simulations. As in Materials and Methods, we deﬁne random variables for k k C. Calculation of population diversity two sites separated by distance l as X = Si and Y = Si+l, where i is a site and k is a sequence pair. We note that E(X|k) = E(Y|k), We consider substitutions at a single site on a randomly chosen since the expectation is taken over the same set of sites in both pair of genomic sequences ~g and ~g0 as shown in Figure 6A, cases. The expression σ2 = Var(E(S|k)) can therefore be written and analyze the dynamics of ~P(1) ≡ (p, 1 − p)T, where p is the equivalently as σ2 = Cov(E(X|k),E(Y|k)). Together with Eq. (5) probability that these two sequences are identical at that site. and the law of total covariance (setting Z = k), we have We trace their history back one instant ∆t, and compute the rates and outcomes of different events that affect the sequences, 2 cM + σ = Cov(X, Y) . (18) including coalescence due to reproduction, DNA transfer, and mutation. Similarly, using Eqs. (6) and (7) (setting Z = i), we obtain First, a coalescence event in which one of ~g or ~g0 reproduces cS + cR = Cov(X, Y) . (19) and replaces the other yields identity at the site, indicated by a star in Figure 6A corresponding to the state ~P(0) ≡ (1, 0)T, c (l) ≡ (X Y) The total covariance T Cov , is the covariance of the and occurs with rate λ2,2. The change in the state vector is then substitution variables at any pair of sites separated by distance ∆~P = ~P(0) − ~P(1). Second, an internal DNA transfer, which occurs l (i.e. over all sites and all sequence pairs). It can be computed between the pair of sequences in question (Figure 6A), yields the either by conditioning on the sequence pair or by conditioning same result as coalescence and thus the same ∆~P. Indeed, such on the site; either way the result is of course the same, which an internal DNA transfer is analogous to a reproduction event yields the covariance decomposition in the sense that a piece of DNA “reproduces” from the donor 2 sequence and replaces its counterpart in the recipient sequence. cM + σ = cS + cR . (20) The rate of such events is 2r/N. An external DNA transfer, on the While such a decomposition would be possible for any two- other hand, in which the piece of DNA is copied from a third 00 0 dimensional arrangement of random variables (e.g. magnetic sequence ~g 6= ~g or ~g , does not change the probability that the spins on a lattice), one interesting feature in our population ge- two sequences are identical at the site, and thus does not change netics application is that a variance appears on the left-hand the state vector, i.e. ∆~P = 0. Third, we assume mutations occur side rather than a covariance. This happens because the random independently at each site and uniformly along the sequence. A

Correlated mutations and horizontal transfer 15 ~ 2 mutation causes a change ∆P that can be represented as a linear Based on Eq. (4)-(7), the calculations of cM, cS, cR and σ are operator M∆t acting on ~P(1). The overall rate for mutations for reduced to the analysis of ~P(i): two sites is 2µ, and the linear operator is given by (2) (2)   cM(l) = P2 (l) − P2 (L) ; (34) −1 −1 (a − 1) (2) (4) M = 2µ   . (24) cS(l) = P (l) − P (l) ; (35) −1 2 2 1 −(a − 1) (4) (4) cR(l) = P2 (l) − P2 (L) ; (36) Summing the possible changes ∆P due to each event multiplied 2 = (2)( ) − (4)( ) = ( ) by their rates, we find σ P2 L P2 L cS L . (37) ~ (i) d~P(1) To analyze the dynamics of P , we randomly choose i se- = M~P(1) + (λ + 2r/N)(~P(0) − ~P(1)) . (25) dt 2,2 quences, trace their history back one instant ∆t, and compute the rates and outcomes of different events that could affect the Setting the derivative to zero, we obtain the steady-state solution sequences. These events include coalescence by reproduction, for d ≡ 1 − p: DNA transfers, and mutations. Mutations have the same effect 2µ θ on ~P(i), for i = 2, 3, 4, as follows. Since two pairs of sites are in- d = = , (26) λ2,2 + 2r/N + 2µa˜ 1 + r + θa˜ volved, the total rate of mutations is 4µ, and the linear operator of mutations acting on ~P(i) can be written as follows: where a˜ ≡ a/(a − 1). For the case r = 0, this yields the well- known expression for heterozygosity in the Moran model. M = 4µMµ , (38) where D. Calculation of correlation functions   a˜ We consider a pair of sites, a and b, separated by distance l, in −1 2a 0 0   a randomly chosen pair of genomic sequences, ~g and ~g . The M ≡  a˜ a˜  µ  1 − a  . (39) sequences can differ at 0, 1, or 2 sites, and we write the probabil-  2  ~ T 0 1 − a˜ ity distribution of these three possible states as P = (P0, P1, P2) , 2 a where the components are non-negative numbers summing to 1. ~ (2) More generally, a pair of sites can be configured across 2, 3, or D.1. Dynamics equation for P . Possible events that could 4 sequences, as shown in Figure 6B. Due to the effect of genetic affect ~P(2)(t) beside mutations are shown in Figure 6C. Co- linkage, the distribution of states ~P depends on the configura- alescence events occur at rate λ2,2, causing identity at both ( ) tion. We therefore denote by ~P(i) the distribution of states for sites, and the state vector becomes ~P 0 ≡ (1, 0, 0)T, hence i = 2, 3, or 4 randomly chosen sequences. ∆~P = ~P(0) − ~P(2). For DNA transfer events, as shown in Fig- 2 The correlation functions cM, cS, and cR, as well as σ are ure 6C, we need to consider (i) one-site internal transfers, (ii) determined by the values of four different terms (see Eqs. 4–7): one-site external transfers, (iii) two-site internal transfers, and (iv) two-site external transfer. (i) A one-site internal transfer, h k k i = (2)( ) Si Si+l P2 l ; (27) which occurs at rate 4r1/N, results in a coalescence at one site N−2 N−2 while leaving the other unchanged. The state vector becomes D E 1 (2) 2( ) (3) ( ) (4) Sk · Sk = P (l) + 1 P (l) + 2 P (l) ; ~ (1) ≡ ( − )T i i+l n 2 n 2 n 2 P 1 d, d, 0 , where d is the probability that two se- 2 2 2 ~ ~ (1) ~ (2) (28) quences differ at the unchanged site, hence ∆P = P − P . (ii) An external one-site transfer creates a configuration that in- 1 L hSkihSk i = hSkSk i ; (29) volves three different sequences, and the state vector becomes i i+l L ∑ i i+l ~ (3) ~ ~ (3) ~ (2) l=1 P , thus ∆P = P − P . This occurs with rate 4r1(N − 2)/N, since each site can receive DNA from any of the N − 2 sequences D ED E 1 L D E Sk Sk = Sk · Sk , (30) not in the given pair. (iii) A two-site internal transfer causes i i+l L ∑ i i+l l=1 identity at both sites, i.e. ∆~P = ~P(0) − ~P(2). Lastly, (iv) the two- N site external transfer does not change the state vector, ∆~P = 0, where n2 = ( 2 ). For large N we obtain since sequence labels are exchangeable. Multiplying the possible D E k k (4) changes by their rates yields Si · Si+l ≈ P2 (l) . (31) (2) Given that the maximal size of transferred segments, f , is d~P 2r 4r (N − 2) max = M~P(2) + λ + 2 ~P(0) + 1 ~P(3) k k dt 2,2 N N far smaller than the genomic length ( fmax L), hSi ihSi+li is ( ) 2 ( ) ≥ 4r1 ~ (1) 4r1(N − 1) + 2r2 ~ (2) largely determined by the flat tail of P2 l for l fmax, which + P − λ2,2 + P . (40) (2) N N we denote by P2 (L): D.2. Dynamics equation for ~P(3) fmax . Possible events that could 1 ( ) L − f ( ) ( ) hSkihSk i = P 2 (l) + max P 2 (L) ≈ P 2 (L) . (32) affect ~P(3) are illustrated in Figure 6D. Coalescence can affect i i+l L ∑ 2 L 2 2 l=0 the configuration in three different ways: (i) all three sequences This corresponds to loci that are sufficiently far such that they could coalesce simultaneously, yielding identity at both sites, ~ = ~ (0) − ~ (3) ~ are not affected by two-site transfers, which means r2 = 0 and hence ∆P P P ; (ii) the two-site sequence (g) and one of ~0 ~00 therefore r1 = r. Similar approximation yields the two single-site sequences (g or g ) could coalesce, causing identity at one site and leaving the other site unchanged; the D ED E ( ) k k 4 ~ (1) ~ ~ (1) ~ (3) Si Si+l ≈ P2 (L) . (33) state vector becomes P , and ∆P = P − P ; or (iii) the two

16 Edo Kussell et al. A B Coalescence

Internal transfers

coalesced states

C Coalescence

Two-site internal transfers

One-site internal transfers Coalescence (i)

D One-site external transfers (ii)

(iii)

E Coalescence Internal transfers (i) (i)

or (ii) (ii)

(iii) C External transfers

(iv)

F Internal transfers (i) or

(ii)

rates

Figure 6 Illustration of possible transitions between configurations for one or two pairs of sites. Each horizontal line represents a single sequence, and small vertical lines represent sites. Panel (A) shows the possible transitions and events for one pair of sites. When the pair become identical due to an indicated transition, we denote the site by a star. Panel (B) shows different configurations of two pairs of sites and the coalescent states in one or both sites, and their possible transitions and events are shown in panels (C), (D) and (E). Transitions between configurations are summarized in panel (F), where solid arrows represent transitions due to reproduction, internal one-site transfer, or two-site transfer, and dashed arrows correspond to an external one-site transfer into a sequence with two sites.

Correlated mutations and horizontal transfer 17 single-site sequences (~g0 and ~g00) could coalesce onto an ancestral homogeneous coupled linear differential equations: sequence that carries both sites, hence the state vector becomes ~ (2) ~ ~ (2) ~ (3)     P and ∆P = P − P . The rates for these events are λ3,3, ~P(2) ~P(2) 2λ and λ , respectively. d     3,2 3,2 ~ (3) ~ (3) ~ (0) ~ (0) ~ (1) ~ (1) P  = (S ⊗ I + I ⊗ M) P  + B ⊗ P + B ⊗ P Since an internal transfer involves choosing two sequences dt     ~ (4) ~ (4) (one as the donor and the other as the recipient), there exists P P two different types of transfers (Figure 6D): ones between the (43) ( ) ( ) two-site sequence (~g) and one of the two single-site sequences in which I is a 3 × 3 identity matrix, ~B 0 and ~B 1 describe tran- (~g0 or ~g00), and the others between the two single-site sequences sitions from ~P(2), ~P(3), and ~P(4) to ~P(0) and ~P(1), respectively, 0 00 ~ (0) T ~ (1) T ~ (0) (~g and ~g ). The results of these transfers are exactly the same with P = (1, 0, 0) , P = ((1 − d), d, 0) , B = (λ2,2 + as those of coalescing two sequences discussed above (ii and T ~ (1) 2r2/N, λ3,3, λ4,4) , and B = (4r1/N, 2λ3,2 + 4r/N, 4λ4,3 + iii, see above), such that they change the state vector to ~P(1) T 2λ4,2 + 4r/N) ; and S is the operator for reproduction and trans- ( ) and ~P 2 , and occur with rates 4r/N and 2r/N, respectively. An fer, which has the following form external one-site transfer from an external sequence to one of the   2r2+4r1(N−1) 4r1(N−2) two single-site sequences only changes the sequence labels, but − − λ2,2 0  N N   2r 6r2 N−3  ~ = S =  N + λ3,2 −( N + 2r1) − (3λ3,2 + λ3,3) 2r1 N . (44) not the configuration, thus ∆P 0. On the other hand, when   8r + − 12r − ( + + ) an external one-site transfer occurs into the two-site sequence, 0 N 4λ4,2 N 6λ4,2 4λ4,3 λ4,4 the configuration becomes that of ~P(4), hence ∆~P = ~P(4) − ~P(3), The tridiagonal form of S leads to a block tridiagonal matrix of with rate 2r1(N − 1)/N. Multiplying rates by changes, we find the Kronecker product, S ⊗ I + I ⊗ M, which can be written in a general form   d~P(3) 2r = M~P(3) + + λ (2~P(1) + ~P(2) − 3~P(3)) A1 B1 0 dt N 3,2   =   A D1 A2 B2  . (45) 2r (N − 3)   + λ (~P(0) − ~P(3)) + 1 (~P(4) − ~P(3)) . (41) 3,3 N 0 D2 A3 In the block matrix A, the off-diagonal blocks are scalar matrices D.3. Dynamics equation for ~P(4). Possible events that could and the diagonal blocks are sums of scalar matrices and the mu- affect ~P(4) are illustrated in Figure 6E. Since four sequences tation operator, M, which is diagonalizable. Importantly, A is a are involved, each of which carries only one site, four differ- block diagonally dominant matrix in the norm induced by the ent types of coalescent events are possible: (i) simultaneous l1-norm, which guarantees that A is nonsingular (Feingold and coalescence of all four sequences, which yield identity at both Varga 1962). The inverse of A can be computed efficiently using an algorithm given by (Ran and Huang 2006), which yields the sites and ∆~P = ~P(0) − ~P(4), (ii) simultaneous coalescence of steady-state solution of Eq. (43). Using the steady-state solution three sequences, producing identity at one of the two sites, for ~P(i) in Eqs. 27–33 and Eqs. 4–7 yields the exact solution for hence ∆~P = ~P(1) − ~P(4), (iii) coalescence of two sequences correlation functions and variance. The resulting solutions pre- with sites at the same locus, yielding identity at one of the two cisely predict the simulation results for both the Moran model sites, hence ∆~P = ~P(1) − ~P(4), and (iv) coalescence of two se- and the Schweinsberg model as shown in Figure 2. Since their ex- quences with two different sites, leading to an ancestral sequence act expression is unwieldy, we seek an approximation that yields ~ (3) that carries both sites; the state vector then becomes P , and a simpler form for further analysis in the following section. ~ ~ (3) ~ (4) ∆P = P − P . The rates of these events are λ4,4, 4λ4,3, 2λ4,2, and 4λ , respectively. 4,2 E. Mean-field approximation and solutions For an internal DNA transfer, the two chosen sequences may We note that the one-site external transfers make the transitions carry their single site either at the same locus or at different between ~P(i) a cyclic graph (Figure6F, dotted arrows) and thus loci, and therefore there are two types of internal DNA transfers complicate the solutions of the linear equations (Eqs. 40-42). (Figure 6E). The results of these two types of transfer on the state One possible approximation is thus to remove the transitions vector are exactly the same as for coalescence of two sequences ~ (2) ~ (3) ~ (3) ~ (4) discussed in types (iii) and (iv) above. The rates are 4r/N and P → P and P → P , and to account for them implicitly as mutations. As shown in Figure 7, an external transfer involves 8r/N, respectively. Since in the configuration of ~P(4), each of the three sequences with four possible genealogical structures. After four sequences contains only one site and sequence labels are the external one-site transfers, the genealogical relationships of exchangeable, external transfers do not change the state vector, the two sequences in question will change (Figure 7), but the ∆~P = 0. Multiplying these rates by changes, we find coalescent time will only change in two of the four genealogical structures, which happens with probability w = 2λ3,2/(λ3,3 + 3λ ). If the coalescent time does not change (as in cases i ~ (4) 3,2 dP ~ (4) ~ (0) ~ (4) ~ (1) ~ (4) = MP + λ4,4(P − P ) + 4λ4,3(P − P ) and ii, Figure 7), exchangeability implies that there will be no dt change in the probability distribution for the pair of sequences ~ (1) ~ (3) ~ (4) + (4r/N + 2λ4,2) (P + 2P − 3P ) . (42) in question. If the coalescent time changes (as in cases iii and iv, Figure 7), the probability distribution changes, and our mean- field approximation is to assume that the two sites will then D.4. Exact solution for steady-state values of ~P(i). Combin- differ with probability d, i.e. the average pairwise distance in ing the above, we express the equations as a system of non- the population. Accordingly, the operator for these transfers,

18 Edo Kussell et al. (i) (ii) (iii) (iv)

Time

XDY XDY XDY DXY

Figure 7 Illustration of external one-site transfers and their impacts on the possible genealogical trees. In each tree, X and Y are the pair of sequences under consideration, and D is the donor sequence of the external transfer shown as an arrow from D to X. A red circle represents the MRCA of X and Y before the transfer, and a red star denotes the MRCA after the transfer.

M , can be written as ~ (2) ~ (4) r For cS, using the solutions for P2 and P and Eq. (35), we   obtain 1−d d2(1 − q) −d 2 0   cS = . (55)  1  1 + 2wNr1(l) + 2Nµa˜ Mr =  d − 1 − d . (46)  2  d Lastly, we compute the variance 0 2 d − 1 2 σ ≡ cS(L) . (56) We can combine these external one-site transfers and point mutations to form an effective mutation operator, which E.2. Solution for the Schweinsberg model. We find the solu- ~ (2) (i) is 4µMµ + 4wr1(1 − 2/N)Mr for P or 4µMµ + 2wr1(1 − ~ tions for P2 in the mean-field approximation for the Schweins- (3) 3/N)Mr for ~P . The explicit external one-site transfers ap- berg model as follows: pearing in Eq. 43 are removed and replaced by the appropriate mutation operator, yielding the simplified solutions given below. (2) 1 Q2 ≈ ; (57) 1 + 2wNr1 + 2Nµa˜ (i) E.1. Solution for the Moran model. We find the solutions of ~P , (3) 1 (2) 2 Q ≈ Q ; (58) which we will use later for the solutions of correlation functions, 2 2 2 in the mean-field approximation for the Moran model as follows: ( ) 2 ( ) Q 4 ≈ Q 3 , (59) 2 3 2 (2) P − d2 1 + r (2) ≡ 2 = 2 for r 1. Q2 2 ; (47) d 1 + r + r1 + 2w(N − 2)r1 + 2Nµa˜ Given the solutions above, we find: (3) 2 (3) P − d 1 + r (2) 2 ≡ 2 = · 2wd Nr2(l) Q2 2 Q2 ; ( ) = d 3(1 + r) + w(N − 3)r1 + 2Nµa˜ cM l ; (60) (1 + 2wNr1(l) + 2Nµa˜)(1 + 2wNr + 2Nµa˜) (48) cR(l) = q · cM(l) ; (61) (4) 2 ( ) P − d 2(1 + r) ( ) 2 Q 4 ≡ 2 = · Q 3 . (49) d (1 − q) 2 d2 3(1 + r) + Nµa˜ 2 cS(l) = , (62) 1 + Nr1(l) + 2Nµa˜ Given that r 1 and N 1, we can simplify the solutions where q = 1/3. further: E.3. Mean-field approximation and coalescent theory. The (2) 1 (i) Q2 ≈ ; (50) mean-field results for ~P can equivalently be obtained us- 1 + 2wNr1 + 2Nµa˜ ing coalescent theory, which we illustrate here for the case of (3) 1 (2) Q ≈ · Q ; (51) ~P(2). Given two pairs of sites on a pair of sequences, a coa- 2 3 + wNr + 2Nµa˜ 2 1 lescent event could involve both of the two pairs with rate (4) 2 (3) λ1 ≡ λ2,2 + 2r2(l)/N, where the last term is the rate of coa- Q2 ≈ · Q2 . (52) 3 + Nµa˜ lescence due to two-site transfers between the two sequences, ~ (0) (i) yielding the state P ; or it could involve only one of the two Using the mean-field solution above for ~P and Eq. (34), we 2 pairs with rate λ2 ≡ 4r1(l)/N, which is the rate of coalescence obtain the mean-field solution for cM: due to internal one-site transfers between the two sequences, yielding the state ~P(1). The coalescent time t thus follows an 2wd2 Nr (l) c (l) = 2 . (53) exponential distribution with rate λ ≡ λ + λ . Given the co- M (1 + 2wNr (l) + 2Nµa˜)(1 + 2wNr + 2Nµa˜) 1 2 1 alescence time, one can compute ~P(2) by propagating forward Similarly, using the mean-field solution for ~P(4) and Eq. (36), in time the process of mutations and external one-site transfers for a time t starting from the two ancestral states, ~P(0) and ~P(1). we find the solution for cR: We note that external two-site transfers that occur during this cR(l) = q · cM(l) (54) time impact both sites and are thus equivalent to exchanging one of the sequences for a different individual in the popula- where q = 1 · 2 . tion. Since we study an exchangeable coalescent, these events 3+wNr1(l)+2Nµa˜ 3+Nµa˜

Correlated mutations and horizontal transfer 19 do not change the distribution of coalescent times, and there- (2) and substituting the expression for P2 yields fore have no impact on the calculation. We define a combined mutational operator that includes both mutations and external (2) 2 1 Ps,2 (l) ≈ 2r2(l)td + 1 . (67) one-site transfers, M ≡ 4µMµ + 4w(1 − 2/N)r1(l)Mr, and use 1 + 2wNr1(l) + 2Nµa˜ it to propagate forward in time, while taking the expectation over the coalescent time: We define ˜(2) (2) Ps,2 (l) ≡ Ps,2 (l)/ds (68) Z ∞ ~ (2) −tλ tM ~ (0) ~ (1) P (l) = e e (λ1P + λ2P )dt (63) and given that d ≈ ρ d = 2γ f¯ td, we obtain 0 s s = (λ − M)−1 (λ ~P(0) + λ ~P(1)) ( ) 1 2 ˜(2) dr2 l 1 Ps,2 (l) = + 1 . (69) γ f¯ 1 + 2wNr1(l) + 2Nµa˜ One can check that this is the same equation as the steady-state equation for ~P(2) in the mean-field approximation (see Eq. 40). If the distance between sites l is smaller than typical transferred ¯ Computing the matrix inverse and applying to ~P(0) = (1, 0, 0)T fragment sizes, we have r1(l) ≈ γl and r2(l) ≈ γ( f − l), using and ~P(1) = ((1 − d), d, 0)T therefore yields the same expression which we find (2) for P given in Eq. (47). (2) l 1 2 P˜ (l) = d 1 − + 1 . (70) s,2 f¯ 1 + 2wNγl + 2Nµa˜ F. Analytical forms for parameter inference in biased samples We consider a pair of individuals with coalescence time t N. G. Application to bacterial sequence analysis We assume that t is sufficiently short such that each pair of sites We obtained a curated list of 185 Escherichia coli sequence type in the genome separated by a distance l L has been affected by 131 (ST131) isolates from two recent studies (Price et al. 2013; at most one mutational or recombination event. For portions of Petty et al. 2014; Ben Zakour et al. 2016), and a total list of 1,216 the genome that were not affected by recombination, the average Streptococcus pneumoniae isolates, which consisted of all isolates per site divergence of the pair of sequences is given by θs ≡ 2µt. from the seven largest clusters in a longitudinal pneumococcal The probability that a single site is affected by recombination is carriage study (Chewapreecha et al. 2014). ρs ≡ 2γ f¯ t. Since the donor is randomly chosen from the entire For each species, we applied a reference-based approach to population of size N, we can neglect transfers between the given generate whole-genome alignments from Illumina read data. pair of individuals, which have probability 1/N. After a transfer, We used E. coli strain EC958 (EMBL accession code HG941718) the expected divergence in the recombined region is the bulk and S. pneumoniae strain ATCC700669 (EMBL accession code population diversity d. Accounting for the rates and effects of FM211187) as the reference genomes, and mapped read pairs mutation and recombination events, we calculate the expected against them using SMALT version 0.7.6 (https://sourceforge.net/ diversity of the pair of sequences as projects/smalt/) with default settings. The resulting alignment was then processed with Samtools (Li et al. 2009) and FreeBayes ds = ρsd + (1 − ρs)θs. (64) (Garrison and Marth 2012) to generate a consensus sequence. To call bases at each position, we required at least two reads Recalling that ρ = γ f¯ N and θ = µN, where we consider N to spanning in each direction (i.e. for a total minimum read depth represent the size of the bulk population with which sequences of 4) and at least a 75% consensus on the major allele. The base can recombine (e.g. that of an entire species), we can assume quality at the site had to be at least 50, and the mapping quality that ρ 1, since typically we have γN ∼ µN ∼ 0.01 − 0.1, had to be at least 30, on the Phred scale. When a called base was a while f¯ is usually on the order of a kilobase or longer. These single nucleotide polymorphism (SNP), in addition, we required assumptions are later tested for self-consistency once fitting has it to be supported by both FreeBayes and Samtools with quality been performed. Thus, we have ρsd ≈ ρs θ = ρ 1, so that scores at least 30. All other sites that failed to pass these criteria, θs θs (1 − ρs)θs < θs ρsd, which allows us to approximate ds by as well as insertions and deletions, were masked as gaps. Finally, only considering the contribution of recombination: for each clade in a species, the resulting consensus sequences were combined to generate a whole-genome alignment. ds ≈ ρsd . (65) To avoid ambiguity in assigning distances along the genome due to the potential for genome rearrangement, the resulting (2) whole-genome alignment was split into multiple gene align- We now consider the calculation of P2 (l), i.e. the probability that substitutions have occurred at both sites, where we ments, and each gene alignment was further split into multiple ( ) alignment blocks with a fixed length of 300 bp. In each block distinguish the population’s P 2 , which corresponds to the ex- 2 we removed sequences in which gaps constituted more than pectation for two sequences chosen at random from the entire (2) 2% of the total length. After filtering out such sequences, we population of size N; and the sample’s P2 , which we denote additionally excluded alignment blocks consisting of fewer than (2) five sequences. Using the filtered alignment blocks, we com- as Ps,2 and is calculated for the pairs of individuals within the given sample. Since t is sufficiently short, the probability that pared DNA sequences pairwise. For each pair of sequences k, k two or more events occur at that the two sites in question is we obtained a substitution profile Si at the third base of each negligible. Thus, the only event that can introduce substitutions codon. To reduce the impact of selection, we masked positions at both sites over the timescale t is a two-site transfer from the containing non-synonymous substitutions, and did not use them bulk population into the sample, which occurs with probability in the analysis, but preserved their genome coordinate so that 2tr2(l). We obtain physical distances remained unchanged. Using the substitution profiles, we calculated the sample di- (2) (2) ˜(2) Ps,2 (l) ≈ 2 t r2(l)P2 (l) . (66) versity ds and the correlation functions Ps,2 (l), cM(l), cS(l) and

20 Edo Kussell et al. ˜(2) cR(l). To infer population parameters, we fit Ps,2 (l) using the first 50 data points (codons) in a the same manner as we tested using simulation results (Fig. 4 and Appendix F). To predict c˜M(l), c˜S(l), c˜R(l), we used the relations in Eqs. (34)-(36) with ˜(2) ˜(4) ˜(2) the fitted form of Ps,2 (l), and used Ps,2 (l) = qPs,2 (l), where q was determined by the coalescent structure of the sample (e.g. q = 1/3 for BSC statistics). When fitting q (dashed blue and green lines in Fig. 5), we performed linear regression using the ˜(2) ˜(2) relation c˜S(l) = (1 − q)Ps,2 (l), with the fitted form Ps,2 (l) and the measured values of c˜S(l). We note that, in principle, one could infer the full distribution of transferred fragment sizes by using the structure correlation function. One can invert Eq. (12) to obtain r1(l) in terms of 2 2 cS(l), and by differentiating obtain p( f ) ∼ (∂ /∂ f )(1/cS( f )). In practice, however, accurately measuring the curvature of the the correlation decay would require a very large amount of data, which we found to be well beyond the sizes of available datasets (data not shown). Nevertheless, as we discussed above, the Figure 8 Recombination barrier and the rate of successful mean fragment size can be efficiently inferred using currently transfers. The plot shows the population rate of successful available data. transfers, Nγ˜, as a function of b, which controls transfer efficiency such that a higher b corresponds to a larger recombi- H. Recombination barriers and population structure nation barrier. The theoretical prediction, Nγ˜ = φ/(1 + θb), In the above analysis and simulations, the transfer process was is shown in solid lines. Simulation results are shown for mea- unconstrained, such that each genome can recombine with any sured rates (open circles) and inferred rates based on fitting other sequence within the population with equal rate. In real- mutational correlations (solid triangles). The solid circles are ity, however, barriers exist that prevent free genetic exchange the inferred rates corrected by the scale factor, 1 + θb/3. In- among individuals within a bacterial population. For example, set shows the data collapsing onto the theoretical prediction the mismatch-repair system inhibits interspecies recombination, when plotted as a function of θb. Simulations used parameters −4 and reduces recombination frequencies among distantly related N = 1000, L = 1000, f0 = 50, and γ = 10 , with various individuals (Fraser et al. 2007). Moreover, the classical path- values of b and µ as indicated. Each data point corresponds way for homologous recombination involves RecA-mediated to an average over 10000 simulations, which were run over a homology recognition between donor and recipient sequences total time much longer than the coalescent time. Fitting muta- (Kuzminov 1999). In several bacterial species, laboratory studies tional correlations was carried out using the mean-field result have found a log-linear relationship between sequence diver- for cM(l) in Eq. 10. gence and the frequency of recombination (Fraser et al. 2007).

H.1. Recombination barriers in simulation To assess the mag- recombination barriers in different datasets. nitude of recombination barriers on our estimation of transfer H.2. Rate of successful transfers in the population Here we rates, we modified our simulations as follows. After randomly calculate the average rate of successful transfers, γ˜, given the choosing a pair k of genomes for recombination, we calculate rate of attempted transfers, γ. For any sequence X, transfers their pairwise distance, d(k), and we allow recombination to oc- occur from donor sequences D whose coalescence time with X cur with probability exp[−b d(k)], where b is the strength of the is a random variable t, which is exponentially distributed with recombination barrier that reduces transfer efficiency. During rate λ . The average divergence between sequence X and D is the simulations, we record the total number of successful transfer 2,2 then given by 2µt, and the probability that transfer is successful events across the population. This process limits recombination is therefore exp(−2µtb). Integrating over t yields the mean rate among distantly related individuals, and thereby reduces the of successful transfers, overall rate of successful transfers within the population (Fig. 8). We compared the overall rates of successful transfers, γ˜, mea- Z ∞ γ˜ = γ λ2,2exp(−λ2,2t)exp(−2bµt)dt (71) sured directly from simulations with our estimations based on 0 the mean-field approximation. Due to the recombination barrier = γλ2,2/(λ2,2 + 2bµ) and its effect on the correlation functions, inference of transfer = γ/k , rates based on fitting of measurements from the bulk population b will underestimate the rates of successful transfers by a factor where kb ≡ (λ2,2 + 2bµ)/λ2,2 = 1 + θb measures the effective of 1 + θb/3, an increasing function of the product of popula- strength of the recombination barrier. tion diversity and the recombination barrier (Fig. 8; and see derivation below). Thus, a significant deviation of our estimate H.3. The impact of transfer efficiency on the mean-field ap- requires both high population diversity and a large recombina- proximation. As detailed in Appendix E, we obtained the mean- tion barrier, and one can correct the estimates if the value of b is field approximation by studying the distribution of possible known for a given species. However, our inference procedure phylogenetical trees involved in an external one-site transfer for fitting genomic samples (Appendix F) implicitly accounts (Figure 7). A recombination barrier changes this distribution for transfer efficiency in the way it models the sample diversity – transfers are more likely to occur among closely related se- ds, which avoids having to explicitly measure and correct for quences rather than among distant ones. It thus overweights

Correlated mutations and horizontal transfer 21 the first two types of trees shown in Fig. 7 and underweights the last two trees, which changes the value of w and leads to an underestimate of the rate of successful transfers. To correct this bias, we study the effect of transfer efficiency on the last two trees in Fig. 7. We consider a pair of sequences, X and Y, and an external one-site transfer from a donor sequence D to X. When we trace the three lineages backward in time, there exist two sequential mergers among them, and we let t1 and t2 denote the coalescent times of the first and second mergers, respectively. We compute the probability of observing tree (iii) with specified values of t1 and t2. The first merger occurs at t1 between branches D-Y, while all other possible mergers (D-X, X-Y, or D-X-Y) do not occur; the associated probability is thus −(3λ +λ )t λ3,2e 3,2 3,3 1 . After the first merger, there remains time t2 − t1 until the second merger, during which the sample contains −λ (t −t ) two lineages, hence the associated probability is λ2,2e 2,2 2 1 . Multiplying the two probabilities, and using the identity λ2,2 = λ3,3 + λ3,2, the probability of observing the given tree is

−2λ3,2t1−λ2,2t2 ptree(t1, t2) = λ2,2λ3,2e (72) Since the two trees (iii) and (iv) have the same topology, their probability is the same. In both cases, the divergence between D and X, which is given by 2µt2, determines the transfer efﬁ- ciency. We can therefore calculate the value of w, which is the probability of observing one-site transfers as trees (iii) and (iv), by integrating over the coalescence times:

Z ∞ Z t2 −2bµt2 w = 2 ptree(t1, t2)e dt1dt2 (73) 0 0 2λ λ = 2,2 3,2 (74) (λ2,2 + 2bµ)(λ2,2 + 2λ3,2 + 2bµ) With no recombination barrier (i.e. b = 0), we obtain the original probability w = w0 ≡ 2λ3,2/(λ2,2 + 2λ3,2). In the mean- field expressions for the correlation functions (Appendix E), the population recombination rate Nγ occurs only in the combination of parameters wNγ. Using the original mean-field solution to fit the population transfer rate (i.e. using w = w0) would thus yield a value wNγ/w0, which can be written as Nγ˜wkb/w0 indicating that we would underestimate the rate of successful transfers, Nγ˜, by a factor w0/(wkb). In the Moran model, by noting λ2,2 = λ3,2 = 2/N, we find the factor to be 1 + θb/3, while in the Schweinsberg model λ2,2 = 2λ3,2 = 2/N, and the factor is 1 + θb/2.

22 Edo Kussell et al.