INVESTIGATION
Purifying Selection, Drift, and Reversible Mutation with Arbitrarily High Mutation Rates
Brian Charlesworth*,1 and Kavita Jain† *Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3FL, United Kingdom, and †Theoretical Sciences Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore 560064, India
ABSTRACT Some species exhibit very high levels of DNA sequence variability; there is also evidence for the existence of heritable epigenetic variants that experience state changes at a much higher rate than sequence variants. In both cases, the resulting high diversity levels within a population (hyperdiversity) mean that standard population genetics methods are not trustworthy. We analyze a population genetics model that incorporates purifying selection, reversible mutations, and genetic drift, assuming a stationary population size. We derive analytical results for both population parameters and sample statistics and discuss their implications for studies of natural genetic and epigenetic variation. In particular, we find that (1) many more intermediate-frequency variants are expected than under standard models, even with moderately strong purifying selection, and (2) rates of evolution under purifying selection may be close to, or even exceed, neutral rates. These findings are related to empirical studies of sequence and epigenetic variation.
HE infinite sites model, originally proposed by Fisher There has recently been some discussion of how to go T(1922, 1930) and developed in detail by Kimura (1971), beyond the infinite sites assumption of a low scaled mutation has been the workhorse of molecular population genetics for rate, which breaks down for species with very large effective four decades. Its core assumption is that any nucleotide site population sizes, including some species of virus and bacteria, segregates for at most two variants and that the mutation and even eukaryotes such as the sea squirt and outbreeding rate scaled by effective population size (Ne) is so low that nematode worms, resulting in “hyperdiversity” of DNA se- new mutations arise only at sites that are fixed within the quence variability within a population (Cutter et al. 2013). population (see also Charlesworth and Charlesworth 2010, It is important to note, however, that this problem can arise p. 207). This assumption facilitates calculations of the the- even when the scaled mutation rate is relatively low, since oretical values of some key observable quantities, such as then the proportion of neutral nucleotide sites that are cur- the expected level of pairwise nucleotide site diversity or the rently segregating in a population (which depends on the expected number of segregating sites in a sample (Kimura scaled mutation rate) can be substantial when the population 1971; Watterson 1975; Ewens 2004). In the framework of size is sufficiently large. For example, with a neutral mutation coalescent theory, this implies a linear relation between the rate of u per site in a population of N breeding adults, the ex- genealogical distance between two sequences and the neu- pected fraction of sites that are segregating in a randomly tral sequence divergence between them, greatly simplifying mating population is fs = u [ln(2N) + 0.6775], where u = methods of inference and statistical testing (Hudson 1990; 4Neu (Ewens 2004, p. 298). Thus, with u = 0.01, a reasonable Wakeley 2008). value for many species (Leffler et al. 2012), we have fs =0.15 even when N has the implausibly low value of 1 million. This
Copyright © 2014 by the Genetics Society of America implies that 15% of new mutations are expected to arise at doi: 10.1534/genetics.114.167973 sites that are already segregating, suggesting a significant de- Manuscript received July 3, 2014; accepted for publication September 11, 2014; parture from the assumptions of the infinite sites model. (An published Early Online September 16, 2014. Supporting information is available online at http://www.genetics.org/lookup/suppl/ alternative way of looking at this is to determine the expected doi:10.1534/genetics.114.167973/-/DC1. number of new mutations that occur at a site while a preex- 1Corresponding author: Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Ashworth Laboratories, King’s Bldgs., W. Mains Rd., isting mutation is segregating, which is of a similar magnitude Edinburgh EH9 3FL, United Kingdom. E-mail: [email protected] to fs—see Appendix,EquationA1.)
Genetics, Vol. 198, 1587–1602 December 2014 1587 In addition, it has been known for nearly 20 years that Analysis of the Model of Purifying Selection, Drift, sufficiently high scaled mutation rates at some or all sites in a and Mutation sequence can lead to substantial departures from the infinite Basic assumptions sites expectations for statistics such as Tajima’s D,whichare commonly used to detect deviations from neutral equilibrium We assume a randomly mating, diploid, discrete generation caused by population size changes or selection (Bertorelle and population with N breeding adults, and effective population fi Slatkin 1995; Aris-Brosou and Excof er 1996; Tajima 1996; size Ne. Over a long sequence of m nucleotide sites, each site Yang 1996; Mizawa and Tajima 1997). This is because the has two alternative types, A1 and A2, with mutation rates u occurrence of mutations at sites that are already segregating and v from A2 to A1 and vice versa. (With diploidy, this means increases the pairwise diversity among sequences, but does that only three genotypes are present at a site: A1A1,A1A2 not increase the number of segregating sites (Bertorelle and and A2A2.) A1 and A2 might correspond to AT versus GC base Slatkin 1995). The analysis of data on DNA sequence variation pairs, unpreferred versus preferred synonymous codons, or in hyperdiverse species thus requires methods that deal with selectively favored versus disfavored nonsynonymous var- this problem, and a number of population genetics models iants. If epigenetic variation is being considered, then A1 that contribute to this have already been developed (Desai and A2 could be regarded as the methylated or unmethylated and Plotkin 2008; Jenkins and Song 2011; Cutter et al. states of a nucleotide site or a differentially methylated region 2012; Jenkins et al. 2014; Sargsyan 2014). (or vice-versa). This approach, while undoubtedly oversimpli- Finally, analyses of the inheritance of epigenetic markers, fied, avoids the problem of modeling mutation among all four such as methylated cytosines, have suggested that these can basepairs, which is difficult to deal with except by making the sometimes be transmitted across several sexual generations, unrealistic assumption of equal mutation rates in all direc- but with rates of origination or reversion that are several tions (Ewens 2004). orders of magnitude higher than the mutation rates of DNA If selection is acting, we assume semidominance, with A2 sequences (Johannes et al. 2009; Becker et al. 2011; Schmitz having a selective advantage s over A1 when homozygous et al. 2011; Lauria et al. 2014). In view of the current in- and with the fitness of A1A2 being exactly intermediate, al- terest in the possible functional and evolutionary signifi- though our general conclusions are probably not strongly de- cance of epigenetic variation (Richards 2006; Schmitz and pendent on this assumption. There is complete independence Ecker 2012; Grossniklaus et al. 2013; Klironomos et al. 2013), among sites (i.e., recombination is sufficiently frequent that it seems important to develop models that can shed light on linkage disequilibrium is negligible), and all evolutionary forces their population genetics, to understand the evolutionary are weak, so that the standard results of diffusion approxima- forces acting on them. tions can be employed. The purpose of this article is to develop a relatively sim- When the population is at statistical equilibrium, the ple analytical framework for examining the consequences of probability distribution of variant frequencies over sites remains high scaled mutation rates, in the framework of the classical stationary and the mean numbers of sites in each possible state random mating, finite population size model with forward are constant over time, despite continual changes at individual and backward mutations in the presence of selection and sites (Charlesworth and Charlesworth 2010, pp. 270–272). At genetic drift (Wright 1931, 1937). The approach is similar in any given time, some sites are fixed for the A1 type, some are spirit to the biallelic model used by Desai and Plotkin fixed for A2, and others segregate for both. Let the equilibrium (2008), but with a focus on sample statistics that summarize proportion of sites that are fixed for A1 and A2 be f1f and f2f, properties of the site frequency spectrum, as well as on the respectively. The proportion of sites that are segregating is expected rate of substitutions along a lineage. As has been fs =1– f1f – f2f. found in previous coalescent-based treatments with neutral- Results for some important population parameters ity (Bertorelle and Slatkin 1995; Aris-Brosou and Excoffier 1996; Cutter et al. 2012), the results derived below show These assumptions allow the use of Wright’s stationary dis- that very large departures from the infinite sites model occur tribution formula (Wright 1931, 1937) to describe the prob- when the scaled mutation rate is sufficiently high, even ability density of the frequency q of A2 at a site, when fairly strong purifying selection is acting, resulting in 2 2 features of the data such as a large excess of intermediate- fðqÞ¼C expðgqÞpa 1qb 1; (1) frequency variants. In addition, the signal of purifying selec- tion on substitutions along a lineage can be obscured or where p =1– q, a =4Neu, b =4Nev, g =2Nes, and the even converted into a signal of positive selection. The find- constant C is such that the integral of f(q)betweenq = 0 and ings have significant implications for the interpretation of q = 1 is equal to 1. It is convenient to write u in terms of the the results of studies of both epigenetic variability and mutational bias parameter, k; i.e., u = kv,sothata = kb. DNA sequence variability in species with large effective pop- Properties of the distribution ulation sizes. Readers who are interested primarily in the main biological conclusions may wish to skip over the details An explicit expression for C can be obtained by noting that of the derivations. the integral of the other terms on the right-hand side with
1588 B. Charlesworth and K. Jain respect to q is equal to the product of G(a) G(b)/ G(a + b) all sites are fixed and calculating the rate of flux between sites and the confluent hypergeometric function 1F1(a, b, z) fixed for A1 and A2; q is then taken to be the frequency of sites (Abramowitz and Stegun 1965, p. 503), with parameters that are fixed for A2,with1– q representing the frequency of a = b, b = a + b, z = g, where sites fixed for A2 (Bulmer 1991). This raises the question of how good an approximation we obtain by neglecting the term XN i fi z ðaÞi of order b,whenthein nite sites assumption is violated, so 1 F1ða; b; zÞ¼ (2) i! ðbÞ that a significant fraction of sites are in fact segregating for A1 i¼0 i and A2. ; ::: 2 First, we note that it is immediately obvious from and ðxÞ0 ¼ 1 ðxÞi ¼ xð1 þ xÞð2 þ xÞ ði 1 þ xÞ for i $ 1 (Pochhammer’s symbol). Equation 1 and Equations 5a and 5b that q with g =0is This can be seen by expanding the exponential term in equal to 1/(1 + k), so that Equation 6 for this case is exact, Equation 1 in powers of gq and integrating over the range as has long been known (Wright 1931). We can also obtain 0 to 1 (Kimura et al. 1963). a first-order approximation to Equations A3a and A3b when Integrating Equation 1, we have g 6¼ 0 by expanding in powers of b, which will be accurate when b is sufficiently small. Neglecting second-order and Gða þ bÞ 1 higher terms in b, as is also done in Equations 8a–8c, we C ¼ : (3) GðaÞGðbÞ 1 F1ðb; a þ b; gÞ obtain
1 2 bkg expð2 gÞ Furthermore, the jth moment of q around zero, obtained q ; (7a) ½1 þ k expð2 gÞ from the integral of qjf(q) between 0 and 1, is given by where F ðb þ j; a þ b þ j; gÞðbÞ 1 1 j: P P MjðqÞ¼ (4) N i N i 1 F1ðb; a þ b; gÞ ða þ bÞ g þ kexpð2 gÞ g i! ai 1 þ g i! ai 1 2 aiÞ j g ¼ i¼1 þ i¼2 þ ½1 þ kexpð2 gÞ In particular, the mean frequency of A2 is (7b) ; ; ... 1 1 F1ðb þ 1 a þ b þ 1 gÞ and ai is the harmonic series 1 + 1/2 + 1/3 + +1/ q ¼ (5a) – ð1 þ kÞ 1 F1ðb; a þ b; gÞ (i 1), with i $ 2. As shown after Equation A3b of the Appendix, the leading and the mean frequency of A1 is term in Equation 6 should provide a good approximation ; ; when bk #0.1. k 1 F1ðb a þ b þ 1 gÞ: p ¼ ; ; (5b) ð1 þ kÞ 1 F1ðb a þ b gÞ Approximations for large g: For examining what happens when g becomes very large, it is useful to note that the Taylor’s series expansion of Equation 5b for small b yields the expression Approximations for small b: Approximations to these ( " expressions for the case when b ,, 1andk is of order 1 are XN k expð2 gÞ gi derived in the Appendix. Equations A3a and A3b imply that p 1 þ b 1 þ k expð2 gÞ i! i¼1 #) 1 XN q ¼ þ OðbÞ: (6) k expð2 gÞ giþ1lnðiÞ ½1 þ k expð2 gÞ þ : (8a) 1 þ k expð2 gÞ ði þ 1Þ! i¼1
The left-hand side of Equation 6 is equivalent to the For large g, this gives fraction of sites that carry A2 in a random sequence sampled from the population; if A and A correspond to unpreferred k expð2 gÞ b expðgÞ 1 2 p 1 þ ; (8b) and preferred codons, respectively, this measures the fre- 1 þ k expð2 gÞ g quency of preferred codons, Fop (McVean and Charlesworth
1999). With epigenetic variation, if A1 and A2 correspond to i.e., methylated and unmethylated states, respectively, q mea- bk 2 sures the fraction of unmethylated sites or regions in a ran- p 1 þ Oðg 1Þ : (8c) g dom genome. The leading term on the right-hand side of Equation 6 is identical to the Li–Bulmer equation commonly used in anal- The first term on the right-hand site of Equation 8c is yses of selection on codon usage (Li 1987; Bulmer 1991). equivalent to the asymptotic expression for p with large g This result is, however, often derived by assuming that nearly given by Kimura et al. (1963). This implies that, for sufficiently
Population Genetics of Hyperdiversity 1589 large g compared with b, the mean frequency of the disfa- 2bk ½1 2 expð2 gÞ p : (11) vored variant is equal to its equilibrium frequency under g ½1 þ k expð2 gÞ mutation–selection balance with s .. u in an infinite pop- ulation, where p =2u/s =2vk/s (Haldane 1927), as As expected, this is identical to equation 15 of McVean and expected intuitively. Numerical studies show that Equation Charlesworth (1999) for the infinite sites model at statistical 8b performs well for g . 1ifb ,, 1, when it can give a good equilibrium, where new mutations arise only at sites that are approximation when neither the leading term in Equation 6 fixed either for A1 or for A2. When g .. b, this term con- nor the Kimura et al. (1963) large g approximation is accu- verges on the deterministic value under mutation–selection rate (results not shown). Equation 8b implies that the leading balance, 2bk/g, which corresponds to the diversity expected term in Equation 6 is accurate when g ,, –ln(b). at deterministic mutation–selection balance with p =2u/s (see above). Frequencies of fixed sites: Second, the approximate fre- In the case of neutrality, Equation 10b reduces to the quencies of sites that are fixed for A1 and A2, f1f and f2f, can following expression (Charlesworth and Charlesworth 2010, be found from the integrals of f(q) between 0 and 1/(2N) p. 237): and 1– 1/(2N) and 1, respectively (Ewens 2004, p. 178). For large N, such that gN–1 ,, 1, when q is close to zero we 2bk p ¼ : (12) have f(q)=C[qb–1 + O(gN–1)+O(bN–1)], and so ð1 þ kÞ½bð1 þ kÞþ1 Z 1=ð2NÞ f1f fðqÞdq As expected intuitively, the neutral diversity is always less 0 than for the infinite sites model with a given value of b and 21 2b 21 21 ¼ C½b ð2NÞ þ OðgN ÞþOðbN Þ (9a) k, where Equation 11 with g = 0 gives p =2bk/(1 + k), R because some new mutations arise at sites that are already 1 polymorphic; p approaches 2k/(1 + k)2 for large b, which is f2f 121=ð2NÞ fðqÞdq 2 2 fi ¼ C expðgÞ½ðbkÞ 1ð2NÞ bk þ OðgN21ÞþOðbN21Þ ; the value for an in nite population at equilibrium under reversible mutation between A and A . (9b) 1 2 Rate of substitution along a lineage where the terms in O(gN–1) and O(bN–1) can be neglected when N is sufficiently large (cf. Kimura 1981). Approxima- Analytic results: The rate of substitution of new mutations tions for these expressions for small b can readily be ob- along a lineage can be modeled as follows. Conditioning on tained (see Appendix). a frequency q of the A2 variant at a site in a given genera- tion, there is an expected number of 2Nvkq mutations per Nucleotide site diversity: Third, the expected pairwise site from A2 to A1 and 2Nvp from A1 to A2. The correspond- fi nucleotide site diversity, p, can be obtained from the expec- ing probability that A1 becomes xed, conditional on p,is – – tation of 2pq, given by 2E{q – q2}. From Equation 4, we have Q1(p) = [exp(gp) 1]/[exp(g) 1] (Kimura 1962). Condi- tioning on this fixation event, the probability that it is a new ; ; fi 2 bðb þ 1Þ 1 F1ðb þ 2 a þ b þ 2 gÞ A1 mutation that has been xed is 1/(2Np). The expected E q ¼ : fi ða þ bÞða þ b þ 1Þ 1 F1ðb; a þ b; gÞ number of new A1 mutations that become xed is thus equal –1 (10a) to vkp qQ1(p). Similarly, the conditional probability that A2 eventually becomes fixed is Q2(q)=[1– exp(–gq)]/[1 – exp 2 The expectation of E{q – q } is given by subtracting Equa- (–g)]; the net expected number of new A2 mutations that –1 tion 10a from Equation 5b. Using Equation 2 and simplify- become fixed is vpq Q2(q). (At first sight, it would seem ing, we obtain that this procedure cannot be applied to mutations arising