<<

Genet. Res., Camb. (2002), 79, pp. 129–139. With 6 figures. # 2002 Cambridge University Press 129 DOI: 10.1017\S0016672301005493 Printed in the United Kingdom Extending the coalescent to multilocus systems: the case of balancing selection

NICK H. BARTON  ARCADIO NAVARRO* Institute of Cell, Animal and Population Biology, UniŠersity of Edinburgh, UK (ReceiŠed 18 May 2001 and in reŠised form 31 August 2001)

Summary Natural populations are structured spatially into local populations and genetically into diverse ‘genetic backgrounds’ defined by different combinations of selected . If selection maintains genetic backgrounds at constant frequency then neutral diversity is enhanced. By contrast, if background frequencies fluctuate then diversity is reduced. Provided that the population size of each background is large enough, these effects can be described by the structured coalescent process. Almost all the extant results based on the coalescent deal with a single selected locus. Yet we know that very large numbers of are under selection and that any substantial effects are likely to be due to the cumulative effects of many loci. Here, we set up a general framework for the extension of the coalescent to multilocus scenarios and we use it to study the simplest model, where strong balancing selection acting on a set of n loci maintains 2n backgrounds at constant frequencies and at linkage equilibrium. Analytical results show that the expected linked neutral diversity increases exponentially with the number of selected loci and can become extremely large. However, simulation results reveal that the structured coalescent approach breaks down when the number of backgrounds approaches the population size, because of stochastic fluctuations in background frequencies. A new method is needed to extend the structured coalescent to cases with large numbers of backgrounds.

different locations or associated with different alleles 1. Introduction (cf. Hudson, 1990; Hey, 1991; Nordborg, 1997). A The increasing wealth of information on DNA powerful heuristic argument used in those studies is sequence that has accumulated during that the effect of selection on linked neutral variability the past two decades has stimulated a variety of is analogous to the effect of spatial subdivision. When studies that describe the patterns of neutral DNA a population is subdivided into demes of constant diversity that are to be expected under different size, its effective population size is increased and evolutionary forces, and that use those patterns to neutral variability can be larger than in a panmictic detect . Broadly, the effects of population (Nagylaki, 1982). By contrast, if the sizes selection on linked neutral variability depend on of demes fluctuates, for example by an - parameters such as the population size (N), the recolonization process, population size decreases and neutral mutation rate (µ), the recombination rate neutral variability is lost (Whitlock & Barton, 1997). between the selected and the neutral loci (r) and, of If one thinks in terms of the coalescent times, stable course, the strength of natural selection (s). The subdivision increases them because it takes time to general case is difficult, but a considerable amount of move from one location to another in order to progress has been made for strong selection (Ns ( 1) coalesce. Fluctuations make times shorter because using the structured coalescent, which follows back in lineages only tend to trace back to the limited number time the genealogies of sets of genes that can be in of successful demes. Natural populations are also structured into diverse * Corresponding author. Institute of Cell, Animal and Population backgrounds (i.e. haplotypes) defined by different Biology, University of Edinburgh, West Mains Road, Edinburgh combinations of selected alleles. Just as with spatial EH9 3JT, UK. Tel: j44 (0) 131 650 5513. Fax: j44 (0) 131 650 6564. e-mail: arcadi!holyrood.ed.ac.uk subdivision, genetic structure influences neutral di-

Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.14, on 30 Sep 2021 at 16:04:21, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0016672301005493 N. H. Barton and A. NaŠarro 130

versity. Diversity at linked neutral loci is reduced if where diallelic polymorphisms are maintained at background frequencies fluctuate more than expected equilibrium by strong balancing selection. We choose by drift either because selection eliminates deleteri- this model because of its simplicity. Frequencies can ous variants (Charlesworth et al., 1993; Hudson considered to be stable and no mutation in the & Kaplan, 1994; Charlesworth, 1994; Hudson & selected loci needs to be taken into account. We study Kaplan, 1995) or because it forces the fixation of variability by focusing on a pairwise measure: the advantageous alleles (Kaplan et al., 1989; Stephan et average probability of identity in state between two al., 1992; Aquadro & Begun, 1993; Aquadro et al., randomly chosen alleles. By doing so, we are studying 1994; Barton, 1998). By contrast, if balancing selection the distribution of pairwise coalescent times, of which maintains genetic backgrounds at constant frequencies identities are the Laplace transform (Hudson, 1990). then neutral diversity is enhanced (Strobeck, 1983; Afterwards, we check the validity of the analytical Hudson & Kaplan, 1988; Kaplan et al., 1988; Hey, approach by simulations. We aim to answer the 1991; Nordborg, 1997). question of how much neutral DNA sequence vari- The analogy with spatial subdivision, however, has ability can be inflated by balancing selection at many limitations. Genetic structure is more complex, and linked loci. harder to study, than its spatial analogue. Genomes are formed by very large numbers of genes (e.g. " 3i % 10 in humans) upon which selection can act, and 2. Results substantial effects of neutral variability are only expected through the accumulated effects of many (i) Framework loci. Thus, the relevant backgrounds will be defined In a random-mating diploid population of fixed size by combinations of alleles from several (perhaps N, the probability that two genes are identical by many) different loci, possibly spanning large regions descent (F) from the previous generation is 1\(2N). of the genome. This implies that one needs to consider The identity by descent of two alleles in the next genealogies that can be transferred between back- generation is: grounds by complex recombination events, rather E G than by simple migration. Owing to these difficulties, 1 1 1kF Fhl j 1k F l Fj .(1) almost all the theoretical results obtained up to now 2N F 2N H 2N deal with a single selected locus. Some multilocus results have been produced for purifying selection, This well-known result can be easily extended to the where the exact multilocus coalescent can be ap- multilocus case following the approach of Maruyama proximated by a simpler version, allowing the con- (1972; generalized by Nagylaki, 1982) for a spatially sideration of classes of backgrounds instead of the structured population and adapting it to genetical backgrounds themselves. Genetic backgrounds are structure. Suppose that genetic variation at many loci classified according to the number of deleterious defines a set of genetic backgrounds, labelled i. The mutations that they harbour. Backgrounds within a elements of the backward matrix, Γij, give the chance class are considered equivalent, so one only needs to that a that is presently in background i was follow lineages across classes and can ignore the exact in background j in the previous generation. The background they are associated with (Charlesworth et probability of i.b.d. in a given generation between a al., 1993; Hudson & Kaplan, 1994; Charlesworth, gene in background i and a gene in background j is 1994; Hudson & Kaplan, 1995). Other forms of given by Fij

multilocus selection, such as balancing selection, do E E G G ! δkl(1kFkk) not yield so easily to simplification and multilocus F ij l  Γik Γjl Fklj , (2) k l 2N results are only starting to emerge. For example, the F F k H H patterns of neutral sequence variation generated by where δkl l 1, δkl l 0ifk  l and Nk is the population balancing selection acting upon two diallelic loci that size of background k. The recursions for the identity interact epistatically has been recently investigated by in allelic state, f, are: Kelly and Wade (2000). Yet, the way in which E E G G multilocus balancing selection may be affecting linked ! # δkl(1kfkk) f ij l z  Γik Γjl fklj , (3) neutral variability in a genome-wide scenario is still k l 2Nk poorly understood. F F H H Here, we start by setting up a general analytical where z l 1kµ (µ is the mutation rate) and f can be method that extends the structured coalescent to any regarded as the generating function or the Laplace situation where selection is strong enough (i.e. Ns is Transform for the distribution of coalescence times large enough) for background frequencies to be (Hudson, 1990). constant. Then we apply the method to a model in These recursions can be simplified by changing the which a neutral locus is linked to a set of selected loci co-ordinates so as to diagonalize Γ. Let the eigenvalues

Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.14, on 30 Sep 2021 at 16:04:21, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0016672301005493 The multilocus coalescent and balancing selection 131

of Γ be λα ; the matrix of left eigenvectors is ναi, and be in background j in the next generation. It is the matrix of right eigenvectors is ηiα. The leading obtained as the product of two matrices; the first eigenvalue is λ" l 1, with ηi" independent of i, and accounting for the change in background frequency corresponds to complete mixing. ν"i gives the ultimate under selection or drift; the second, for recombination contribution that a gene currently in background i will in a pool with frequencies ph and qh.

make to future generations. E G Following Nagylaki (1982), we can define (1krqh)ph rqhph p p M l fij l ηiα ηjβ fαβ and fαβ l ναj νβk fjk, (4) rq p q (1 rp ) α β j k h h h k h

F q q H where E G 1krqh rqh l . (8) ναiηiβ l δαβ and ηiα ναj l δij. (5) rp 1 rp i α F h k h H Thus, in the new coordinate system, Equation (3): The elements in the background matrix Γkl give the chance that a gene that is presently in background k E G ! # ν kν k(1kfkk) was in background l in the previous generation. It can f l z λ λ f j α β . (6) αβ α β αβ k be obtained by applying Bayes’ rule and normalizing. F 2Nk H In the single locus case, it is Hence, at equilibrium, assuming that the background E G frequencies remain constant over time: 1krqh rqh Γ l . (9) # F rph 1krph H z λα λβ ναk νβk(1kfkk) fαβ l  # . (7) k 2Nk(1kz λα λβ ) Note that the forward and the backward matrices are identical because selection or drift do not This gives a simple formula for the full set of identities, move alleles between backgrounds (i.e. they are not fαβ, in terms of the diagonal elements, fkk. transitions). The eigenvectors and eigenvalues have very simple (ii) Simple examples: one and two selected loci forms Let us start with the simplest example: a single λ l (1, 1kr)(10) E G selected locus. Suppose that two genetic backgrounds qh ph are defined by the segregation of two alleles at a single ν l (11) 1 k1 locus (P, Q), at frequencies p and q at a given F H E G generation. Selection (or drift) then alters the fre- 1 ph η l .(12) quencies to ph, qh at the next generation. A linked F 1 kqh H neutral locus is associated with one or other of these backgrounds; the recombination rate is r. Consider now the case of two selected loci, each The forward transition matrix is M; element Mij segregating for two alleles P and Q with frequencies pi gives the chance that a gene now in background i will and qi. In this case, there are four possible back- grounds: P"P#, P"Q#, Q"P# and Q"Q#. We define the four background frequencies as b", b#, b$ and b% in one ! ! ! ! generation and b", b#, b$ and b% in the next. Assume a linear map nk1k2 (where n stands for ‘neutral’) as depicted in Fig. 1 and no interference. r1 r2 In this case, the backward transition matrix is (we Fig. 1. The neutral locus (gray) lies at the left of the set of selected loci (black). drop all the prime symbols)

Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.14, on 30 Sep 2021 at 16:04:21, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0016672301005493 N. H. Barton and A. NaŠarro 132

E b"(b"jr`"(b$jb%r`#)jb#(r"r#jr`"jr`#)) b#(b$r#r`"jb"(r#r`"jr"r`#)) b#b$r#kb"(k1jb%r#) b#b$r#jb"(1kb%r#) b"(b%r#r`"jb#(r#r`"jr"r`#)) b#(b#jr`"(b%jb$r`#)jb"(r"r#jr`"r`#)) b"b%r#jb#(1kb$r#) b"b%r#kb#(k1jb$r#) Γ l … b"r"(b$jb%r#) b#b$r"r`# b"b%r#jb$(1kb#r#) b"b%r#jb$(1kb#r#) b"b%r"r`# b#r"(b%jb$r#)

F b#b$r#jb%(1kb"r#) b#b$r#jb%(1kb"r#) (13) G b$r"(b"jb#r#) b"b%r"r`# b#b$r#jb"(1kb%r#) b#b$r#jb"(1kb%r#) b#b$r"r`# b%r"(b#jb"r#) b"b%r#jb#(1kb$r#) b"b%r#jb#(1kb$r#) , b$(b$jr`"(b"jb#r`#)jb%(r"r#jr`"r`#)) b%(b"r#r`"jb$(r#r`"jr"r`#)) b"b%r#kb$(k1jb#r#) b"b%r#jb$(1kb#r#) b$(b#r#r`"jb%(r#r`"jr"r`#)) b%(b%jr`"(b#jb"r`#)jb$(r"r#jr`"r`#))

b#b$r#jb%(1kb"r#) b#b$r#kb%(k1jb"r#) H where r" and r# are according to Fig. 1 and where r-i l 1kri. The eigenvectors are products of contributions Although the approach we use here is entirely from each locus. The first left eigenvector (first row of general and can be applied to diverse multilocus ν) corresponds to the steady state, where each scenarios, calculations become easier when it is background contributes in proportion to its fre- assumed that there is no linkage disequilibrium quency; in the long term, there is the same expected between selected loci, so the frequency of each marker frequency in each background (first column of background is just the product of allelic frequencies. η). The second eigenvalue corresponds to linkage In this case, the matrix can be greatly simplified. disequilibrium of the neutral marker with just the first locus, which breaks down at rate (1kr"). The third

E q"r`"(p#jq#r`#)jp"(p#jq#(r"r#jr`"r`#)) q#(q"r#r`"jp"(r#r`"jr"r`#)) p#(q"r#r`"jp"(r#r`"jr"r`#)) q"r`"(q#jp#r`#)jp"(q#jp#(r"r#jr`"r`#)) Γ l … p"r"(p#jq#r#) p"q#r"r`#

F p"p#r"r`# p"r"(q#jp#r#) (14) G q"r"(p#jq#r#) q"q#r"r`# p#q"r"r`# q"r"(q#jp#r#) . p#(q"jp"r`")jq#(p"r`"r`#jq"(r"r#jr`"r`#)) q#(p"r#r`"jq"(r#r`"jr"r`#))

p#(p"r#r`"jq"(r#r`"jr"r`#)) p"r`"(q#jp#r`#)jq"(q#jp#(r"r#jr`"r`#)) H Then, its eigenvectors and eigenvalues are eigenvalue corresponds to linkage disequilibrium with λ (1, r` , r` r` r r , r` r` )(15) l " " #j " # " # just the second locus, which breaks down at a rate E q"q# q"p# p"q# p"p# G equal to the chance that there will be no recombination kq# kp# q# p# of the neutral marker with the second locus, i.e. ν l (16) (1 r")(1 r#) r"r#. The fourth eigenvalue expresses kq" q" kp" p" k k j three-way disequilibrium. A different transition matrix 1 k1 k11 F H is obtained when the neutral marker lies between the E G 1 kp" kp# p"p# two selected loci (map 1-n-2). Its eigenvectors, however, are identical to the ones in (16) and (17); 1 kp" q# kp"q# η l ,(17) only the eigenvalues differ: 1 q" kp# kp#q" λ l (1, r`", r`#, r`"r`#). (18) F 1 q" q# q"q# H Assuming constant background frequencies, one where r" and r# are according to Fig. 1 and where can easily plug eigenvalues and eigenvectors into (7) r-i l 1kr-i. to obtain the equilibrium identities.

Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.14, on 30 Sep 2021 at 16:04:21, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0016672301005493 The multilocus coalescent and balancing selection 133

Because the second and third terms sum to zero, and (iii) Many loci the first and last terms sum to 1, we have δU,V,as The previous examples extend to many loci in a required. straightforward way by means of the multilocus machinery developed by Barton and Turelli (1991) and the principal components approach of Bennet (v) The effect of recombination of the identities (1954; see Dawson, 2000 for an insightful update and Now, consider the effect of recombination. Let the extension). The eigenvectors of the transition matrix chance that the marker remains associated with the set are products across loci, which correspond to the S be rS. Thus r! l 1 (where ! stands for the empty set), steady decay of various higher-order linkage disequi- and for a single selected locus i and recombination libria. They have a simple form because of the key rate c, roiq l 1kc, (we use c rather than r to avoid assumption that the backgrounds are in linkage confusion). Assuming a linear map and n equally equilibrium. Therefore, only the disequilibria of the spaced loci, with no interference, the chance of k n−k marker with each set of selected loci, which decay generating k junctions is c (1kc) . This leads to an geometrically (Bennet, 1954; Barton & Turelli, 1991; expression for rS. The transition matrix which gives Dawson, 2000), need to be tracked. If the backgrounds the numbers that move from background X to Y is were in linkage disequilibrium then the eigenvectors would be extremely complicated functions of recom- Γ(X, Y) l rSδS(X, Y)gL S(Y). (21) S B bination rates. Multiplying by νU(X), νV(Y) shows that these are indeed eigenvectors of Γ, and that the corresponding eigenvalues are just rV. (iv) Defining eigenŠalues We first define the multilocus eigenvectors and find (vi) The leading term in 1\N the eigenvalues in terms of the recombination rates. " This requires no assumptions about the map. How- We can find the leading term to O(N) by setting fkk to ever, with equally spaced loci on a linear map and no zero in (7). This leads to interference, the eigenvalues simplify. We then find " # E G the identities to leading order in , and give a * z λUλVδU,V 1 N f U,V l # # jO # , (22) recursion for the higher-order terms. 2NpUqU(1kz rU) F N H It is convenient to describe the backgrounds by a where f * is a linear transformation of the identity vector X, with values X 0or1 for alleles P or Q. U,V i l f(X, Y) between genes chosen from backgrounds X, Y. The eigenvectors correspond to linkage disequilibria It corresponds to the covariance between random between sets of loci U (set (2,4,9), for example, fluctuations in linkage disequilibria, just as f for a includes loci 2, 4 and 9). We write the eigenvectors as single locus corresponds to the variance in allelic functions of X, ν (X), η (X). Let S 2X 1, so that U U i l ik frequencies. (22) shows that fluctuations in disequi- S 1. Let g (1 X ) S p (so that g q or p ); i lp i l k i j i i i l i i libria involving different sets of loci are indepen- g- X S p (so that g- p or q ). Let δ (X, Y) 1 if i l ik i i i l i i i l dent (δ ), and their variance depends on the X Y , 0 otherwise. We will use the convention that U,V i l i probability of remaining associated with that set (r ). S  S , etc. The complete set of loci is L. Thus, U U l i ? U i Notice that the average identity between randomly the frequency of the gamete containing the set of loci # # chosen genes is f l z \(2N(1kz )), which is just the Uisg . !,! U same as for an unstructured population. This seems to The eigenvectors are contradict the decrease in identity which one expects for markers linked to a balanced polymorphism. νU l SUgLBU ηU l SUg` U.(19) However, it can be shown that this decrease is of order # Each of the terms in (19) is a function of the genotype 1\N and so does not appear in (22). X. To show that these eigenvectors are orthogonal, one only has to sum their product over X (vii) The full solution: transforming the identity within backgrounds νUηV l SUSVgLBUg` V l δU,V. (20) X X To get the full solution (valid for smaller N), one Because each of the elements in (5) is a product across needs to solve a matrix equation for the vector f(X, X ), loci, this sum can be simplified by separating it into which is the diagonal element of the full matrix of terms corresponding to the four kinds of loci: those in pairwise identities. It is easier to solve by transforming

U and V, in U but not V, in V but not U, and in to fU l XνU(X)f(X, X). Note that this is not the neither. These contribute g- i; Si; Si gi g- i; gi, respectively. same as fU,U, given by (22).

Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.14, on 30 Sep 2021 at 16:04:21, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0016672301005493 N. H. Barton and A. NaŠarro 134

Table 1. The eight classes of terms of the matrices H and H* (25). Each matrix is the product of some of these terms according to the presence (1) or absence (0) of loci in the sets W, U, V or Z, U, V

* HW,U,V HZ,U,V

Presence\Absence Term Presence\Absence Term WUV ZUV 000 1 00 0 1 001 0001 0 0 1 00 01 00 0 11 pq 0 11 1\(pq) 1 00 0 1 00 0 1 0 11 10 11 110 1110 1 111 (qkp) 11 1 (qkp)\(pq)

Applying this transformation to (7), and after some distance spanned by U (if the marker is embedded tedious algebra, we have within it) or the distance from the marker to the # * furthest locus if the marker is outside U. Suppose 1 z rUrVHW,U,V fU,V l  # (δW, kfW) (23) there are n+ loci to the right and n− to the left; the 2N W (1kz r r ) ! U V marker is αc from the right-hand locus and (1kα)c and from the left-hand locus. n+−" j fW l  HW,U,V fU,V, (24) 1 2 U,V Λ , l #j  # # ! ! 1kz j=! 1kz (1k(αjj)c) where n −" − 2j * ηZνUνV j  # # HW,U,V l νWηUηV, HZ,U,V l  (25) j=! 1kz (1k(1kαjj)c) X X gL n++n−−" j−" 2 (n+jn−kj) Note that (1kfX,X) transforms to (δZ,!kfZ). The j  # # (28) j=" 1 z (1 jc) matrices H, H* can be found by separating the sums k k over X into contributions from the eight classes of loci where j is the number of junctions. We have used that are defined by belonging or not to the three sets the fact that there are 2j sets which have all the loci to U, V and W. The matrix is the product of contributions the left of the neutral locus, and the leftmost locus * j−" (Table 1). The driving term involves ZδZ,!HZ,U,V l (jjα)c away; and there are 2 (n+jn− kj) sets * H !,U,Vl δU,V\(pUqU). The remaining sum involves which span a distance jc and which include the neutral HW,U,U l [(qkp)\pq]W for all U that contain W, and locus. zero otherwise. The driving term (which is just the transform of (22)) is therefore (ix) The aŠerage identity E G E G * 1 qkp f W l ΛW,W, (26) Because the actual background to which a neutral 2N pq F H F H W allele is linked is likely to be unknown, we focus on the where average identity between gametes chosen at random from the whole population, f , . From (23), this is 1 ! ! ΛW,W l  # # . (27) # V (1kz rWV) z (1kf! ) f , l # . (29) ! ! 2N(1 z ) As we can see (23), the full solution is rather complex, k

and obtaining numerical values for more than, say, The coefficient f! is the average identity between five loci becomes excruciatingly slow. However, several randomly chosen genes within the same background. simplifications can be made. It increases with tighter linkage.

(viii) Simplifying Λ for particular models of (x) An explicit solution for equal allele frequencies recombination −" Suppose that we have slow change (µ, r, N ' 1). Let The ΛU,V terms simplify for particular models of rU l 1kρU, z l 1kµ. In this case, (23) becomes Λ recombination. Consider !,!. For a linear map with * at most one crossover per generation (i.e. total HZ,U,V fU,V l  (δZ,!kfZ). (30) interference), we have rU l 1kkc, where k is the Z (4Nµj2NρUj2NρV)

Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.14, on 30 Sep 2021 at 16:04:21, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0016672301005493 The multilocus coalescent and balancing selection 135

Assuming a linear map, the recombination rates ρU (a) are just the total recombination rates between the 1 0·9 N = 10000 limits of the sets (including the marker). Overall, we 0·8 N = 1000 have a set of linear equations in the fZ 0·7 N = 100 0·6 * f HW,U,VHZ,U,V 0·5 fW l   (δZ,!kfZ). (31) U,V Z (4Nµj2NρUj2NρV) 0·4 0·3 In particular, the average identity between randomly 0·2 chosen gametes is 0·1 0 0510 15 20 (1kf!) f , l , (32) ! ! 4Nµ (b) 1 where 0·9 N = 10000 0·8 N = 1000 * 0·7 N = 100 H ,U,VHZ,U,V f ! (δ f ). (33) 0·6 ! l   Z,!k Z f U,V Z (4Nµj2NρUj2NρUρV) 0·5 0·4 The array H!,U,V only has contributions pU qU from 0·3 U l V, and from Z 7 U: 0·2 0·1 1 E G 0 f! l  1k  (qkp)Z fZ . (34) 0510 15 20 U (4Nµj4NρU) Z 7 U F H Number of selected loci Further simplification is achieved by assuming that Fig. 2. Theoretically expected identities between two some form of symmetrical balancing selection main- randomly chosen gametes for three different population tains equal and even frequencies at all the selected loci sizes. All the loci (both selected and neutral) are evenly " (pi l qi l #). Then, f can be easily expressed as spaced and the neutral locus lies at an extreme of the ! map (Fig. 1). Selection maintains even frequencies at all −% −& −$ 1 the selected loci. µ l 1n25i10 .(a) r l 10 .(b) r l 10 . f l 1k . (35) ! 1 1j U (4Nµj4NρU) loci, the number of possible backgrounds is 1024, and a population of, say, size 100 is clearly unable to Recall that U stands for all the possible sets of loci sustain such a level of subdivision. Although alleles and ρU is the total recombination between the limits of are maintained, some of the backgrounds they define each set. These results could be extended to arbitrary must be absent, rendering the population far less linear maps, because all that matters is the chance of structured than our analytical results assume. The a recombination event that dissociates the marker number of backgrounds grows exponentially with the ' from some set of loci. number of loci. For example, 20 loci produce " 10 ( Some values predicted using (35) and (32) are possible backgrounds, and 24 loci " 1.7i10 , so any plotted in Fig. 2. The predicted identity decreases (i.e. population is bound to lose selected polymorphism if diversity increases) steadily with the number of the number of loci is large. In the following, we use backgrounds. In fact, as the number of backgrounds simulations to check that the structured coalescent becomes large, the identity becomes absurdly small. breaks down for quite small numbers of loci even Examining (35), one can see an obvious reason for the when N is large. increase in variability predicted by the structured coalescent: the second term in the denominator is a n (xi) Simulations sum across all possible sets of loci that involves " 2 summands and, therefore, can become an enormous The coalescent approach developed above relies on quantity. In fact, with a large number of loci, the two related assumptions: that every background is number of summands may become so big that abundant; and that its frequency is both known and mutation rates can become almost irrelevant. (32) and constant. These assumptions will, in principle, be (35) predict that, as the number of backgrounds valid for strong and constant selection (i.e. Ns large

increases, f! 4 1 and f!,! 4 0 independently of mutation and s constant) but, when many selected loci are rates, because the backgrounds will eventually diverge, considered, the equations predict an unrealistically even for very low mutation. But, of course, as the large hitchhiking effect (Fig. 2), with identity dropping number of backgrounds approaches population size almost to zero. To find out why and under which (N), it is impossible to maintain all of them in the parameter values the theory breaks down, we simu- population. For example, if we consider 10 selected lated a multilocus system. Details of the simulation

Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.14, on 30 Sep 2021 at 16:04:21, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0016672301005493 N. H. Barton and A. NaŠarro 136

1 (a) 0·18 0·8

0·6 w 0·12 Simulated 0·4 Neutral f Predicted 0·2

0·06 0 0·2 0·4 0·6 0·8 1 h (heterozygosity) Fig. 3. Some instances of the fitness function. To make 1 0 values comparable, we divide by jα. Continuous lines 123456789 α l 1; dashed lines α l 10. Straight lines k l 1 (additive fitness); concave lines k l 10 (positive ); convex lines k l 0n1 (negative epistasis). (b) 0·8 system can be found in a companion paper (Navarro & Barton, 2002). Its basics are as follows: the program 0·6 performs forward simulations of a life cycle with selection and recombination for a population of size Simulated N. As in the previous section, we consider a number of f 0·4 Predicted selected loci, each segregating for two alleles, and a Neutral neutral locus lying alongside them. Selected loci are assumed to be spaced at equal intervals along the 0·2 genetic map, and the recombination between any two adjacent loci is r. The mutation rate at the neutral 0 locus is µ. All the selected loci have heterozygote 123456789 advantage and selection is symmetrical at each locus. Simulation values throughout this article were ob- (c) tained by running the population to drift-selection 1 equilibrium and calculating identities between ran- 0·9 domly chosen gametes ( f ) in the last generation. Every 0·8 f value plotted in the graphs is the average of 10 runs. 0·7 The coalescent approach presented is entirely 0·6 general and does not assume any specific multilocus f 0·5 fitness regime. Making some assumptions, we have 0·4 used it to study a polymorphic equilibrium where 0·3 n Neutral selection tries to maintain all the possible 2 back- 0·2 Predicted grounds at constant frequencies and without linkage 0·1 Simulated disequilibrium among selected loci. Such an equi- 0 librium can be reproduced under the n-locus sym- 123456789 metric viability model with negative epistasis (Karlin Number of selected loci & Avni, 1981; Christiansen, 1987, 1988, 1990; Barton Fig. 4. Identities (pSE) at a neutral locus with an & Shpak, 2000). This is the model we used in our increasing number of selected loci. All the loci (both selected and neutral) are evenly spaced. The neutral locus % simulations. For simplicity, we assumed that all loci lies at an extreme of the map (Fig. 1). µ l 1n25i10− , & % $ contribute equally to fitness and that selection acts r l 10− , α l 1 and k l 0n1.(a) N l 10 .(b) N l 10 . # only on the proportion of heterozygosity of an (c) N l 10 . individual. The fitness of an individual was given by k w (α, h, k) l 1jαh , where h is the proportion of with no linkage disequilibrium (Christiansen, 1987, heterozygous loci in a given individual (0 % h % 1) 1988). If k " 1, there is positive epistasis, so selection and α is the strength of selection. Epistasis enters the favours increased variance in heterozygosity. In this function by means of k, a parameter that allows for case, the equilibrium population has maximum linkage different selective regimes (Fig. 3). If k ! 1, there is disequilibrium and, when linkage is complete, it is negative epistasis, so selection favours decreased formed by two complementary backgrounds. If k l 1, variance in heterozygosity and, at equilibrium, all fitness is additive and selection only affects mean possible backgrounds are present in the population heterozygosity (i.e. individual loci), ignoring the

Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.14, on 30 Sep 2021 at 16:04:21, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0016672301005493 The multilocus coalescent and balancing selection 137

0·7 Kaplan et al., 1988; Hudson 1990; Nordborg 1997). It has been shown that the neutral variants linked to a 0·6 Neutral given selected allele (background) constitute a different 0·5 Predicted Sim. k = 0·1 ‘subpopulation’ and that differentiation between 0·4 Sim. k = 0·3 subpopulations increases the predicted sequence vari- Sim. k = 0·7 1 f Sim. k = 0·9 ability (Hudson & Kaplan, 988). Recently, Kelly and 0·3 Wade (2000) extended this single-locus theory to a 0·2 two-locus scenario and studied neutral sequence variation linked to a pair of balanced polymorphisms 0·1 held in strong linkage disequilibrium by epistatic 0 selection. Their main result is that neutral variability 12345 6789 can be increased across the intervening region for a Number of selected loci longer distance than expected with a single selected Fig. 5. Identities at a neutral locus with an increasing locus. As we detail elsewhere (Navarro & Barton, number of selected loci and different degrees of negative 2002), this two-locus selective scenario is a special case epistasis. Loci distribution and parameters as in Fig. 4 $ of the fitness model studied here (specifically, it is with N l 10 . equivalent to symmetric overdominance with positive epistasis, i.e., k " 1 in our fitness function). variance in heterozygosity and thus ignoring linkage Here, we have developed a general method to apply disequilibrium. Detailed results on neutral variability coalescent theory to multilocus scenarios. We have under these different fitness schemes can be found in used the method to extend the existing theory to Navarro & Barton (2002). Here, we only concern consider an arbitrary number of loci under balancing ourselves with the case of negative epistasis because it selection. The extension seems to work fairly well for allows straightforward comparisons with the analy- small number of loci but, when the number of loci tical coalescent approach. becomes too big, an absurdly large effect is predicted Fig. 4 shows predicted and simulated changes in the (Fig. 2). To perform a preliminary exploration of the value of identity as the number of selected loci causes of this behavior, we have simulated the effects increases, for different population sizes. For multi- on neutral variability of multilocus balancing selection locus systems with strong negative epistasis and few with negative epistasis. We find that such selection has selected loci, the analytical predictions hold quite well the potential to enhance neutral variability far beyond when compared with the simulations. With few loci, the predictions for a single selected locus, because it selection is successfully forcing all the possible allows many backgrounds to be maintained in the backgrounds to be present at even and constant population. There is, however, a limit to that increase. frequencies. The population is as subdivided as the Simulations show that variability stops increasing and theory assumes and neutral variability increases seems to stabilize about a certain value, independent proportionally to the level of subdivision. As expected, of the addition of more selected loci. An obvious when the number of loci in the system becomes too explanation is that, even with extremely strong large, the analytical and simulation results start to selection, the population size imposes a logical limit to diverge. Of course, the larger the population size, the the number of backgrounds that can coexist. However, more selected loci are needed for theoretically pre- this can hardly explain the results in Figs 4 and 5. A dicted identities to diverge from simulations. Fig. 5 close examination of these figures shows that simu- shows another factor affecting the divergence of lations diverge from the coalescent predictions even simulations form analytical results: the strength of when the number of backgrounds is relatively small epistasis in the system. The closer the fitness scheme is compared with population size. (e.g. in Fig. 4a, % to additivity, the more important is the divergence. As divergence starts with 64 backgrounds for N l 10 ). we discuss below, the cause of this divergence is that, Stochastic fluctuations in background frequencies are when the number of loci is large and\or epistasis is crucial for this divergence. zero or positive, the population is not as subdivided as The increase of variability produced by multilocus the theory assumes because fewer genetic backgrounds balancing selection is opposed by drift. With stable are maintained. background frequencies, drift acts independently within each subpopulation of size N\2n. This effect is taken into account by our analytical model. When the 3. Discussion number of backgrounds is large enough, another There is a substantial body of theory on the effects source of drift becomes important. The number of that balancing selection acting on a single selected backgrounds grows exponentially with the number of diallelic locus is expected to have on linked neutral loci and, although selection may still be very strong on variability (Strobeck, 1983; Hudson & Kaplan, 1988; each individual locus and allele frequencies are close

Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.14, on 30 Sep 2021 at 16:04:21, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0016672301005493 N. H. Barton and A. NaŠarro 138

long as their frequencies are kept constant. If this is 0·14 not the case, some way to account for perturbations 0·12 in background frequencies must be introduced. In 0·1 principle, one could consider genealogies conditioning on the random series of background frequencies and 0·08 Freq. then average over these sequences. This has been done 0·06 to study the earlier phases of selective sweeps, where 0·04 fluctuations in the frequency of the favoured allele are 0·02 important (Barton, 1998). Applying the same strategy to multilocus balancing selection is difficult for several 250 500 750 1000 1250 1500 1750 2000 reasons. First, the strength of perturbations does Number of generations depend on the form of selection in the system, which in our case includes epistasis and is therefore far more Fig. 6. Stochastic fluctuations of the frequencies of the 16 genetic backgrounds defined by four diallelic loci in a complex than simple positive selection. Second, a $ population of size N l 10 . Each line stands for the proper analysis must take simultaneously into account frequency of a background. Backgrounds had even fluctuations in all the backgrounds, which means frequencies at the first generation. Selected loci are evenly −& taking into account the whole population all time, spaced r l 10 , α l 1, k l 0n1. rather than only the favoured allele for just a few generations. Finally, in the case of selective sweeps, to deterministic expectations, selection quickly be- one can make the simplifying assumption that the comes very weak on each individual background. In favoured allele is destined for fixation, whereas, in the such circumstances, drift is not only caused by random case of multilocus balancing selection, one must sampling within each background but is also asso- account for the loss and recovery of backgrounds by ciated with stochastic fluctuations in background drift and recombination. The case for the application frequencies (Fig. 6). These fluctuations are not of our extension of the structured coalescent to the considered in the analytical model, in which fre- study of multilocus balancing selection is made worse quencies were assumed to be constant. To put things in by removing some of the simplifying assumptions that another way: the fluctuations violate the assumption underlie our model. First, our analytical model of no linkage disequilibrium between selected loci assumes that the selection–drift equilibrium has been because they generate random associations among reached, but Kelly and Wade (2000) show that the them. These random associations render the popu- patterns of variability expected during the approach lation less structured than assumed by the coalescent to equilibrium in a two-locus model are quite different approach, so that the increase in neutral variability is than the ones expected at equilibrium. In a multilocus smaller than predicted. Moreover, if fluctuations are scenario, equilibrium takes much longer to achieve, so very strong, some backgrounds will be eliminated by the relevance of equilibrium variability predictions is drift and the neutral variability associated with them doubtful. Second, although the analytical approach will be swept away. Backgrounds can eventually be developed here is completely general, our application recovered by recombination (Fig. 6), whereas the of it to balancing selection focused on symmetric neutral variability lost with them can only be replaced overdominance acting on several equally spaced, by mutation. diallelic loci that contributed equally to fitness. The The exact multilocus coalescent will be a useful tool stochastic fluctuations in background frequencies that to study balancing selection systems provided that the affect the validity of the coalescent will be even more number of relevant backgrounds is not very large and difficult to account for in more realistic systems in their frequencies are constant. It will be useful to which fitness is not symmetric; selection acts with study relatively simple multilocus systems in species different strength in different loci and the interactions with small effective population size, such as humans, between loci are not governed by exactly the same but more complex systems could be studies by this kind of epistasis. approach in species with larger population size, such An alternative is not to use the exact structured as Drosophila. Of course, some knowledge is needed coalescent, but an approximation. This has been about which loci are the targets of selection in order to successfully done in the case of purifying selection. In know how many backgrounds can potentially be that case, the coalescent is structured into different present in the population. One of the advantages of classes of backgrounds with different numbers of the coalescent approach is that the exact kind of deleterious mutations rather than into different indi- selection acting upon the system is not important. It vidual backgrounds. This approach is based on two does not matter if genetic backgrounds are maintained approximations: first, that all the backgrounds har- by, for example, some form of overdominance (as bouring a given number of mutations are selectively assumed here) or by frequency-dependent selection, as equivalent; and, second, that the fitness of a gamete

Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.14, on 30 Sep 2021 at 16:04:21, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0016672301005493 The multilocus coalescent and balancing selection 139

depends only on the mutations it carries, so that one Christiansen, F. B. (1987). The deviation from linkage does not need to consider zygotes (Charlesworth et equilibrium with multiple loci varying in a stepping-stone al., 1993; Hudson & Kaplan, 1994; Charlesworth, . Journal of Genetics 66, 45–67. Christiansen, F. B. (1988). Epistasis in the multiple locus 1994; Hudson & Kaplan, 1995). Unfortunately, symmetric viability model. Journal of Mathematical neither of these approximations is valid under bal- Biology 26, 595–618. ancing selection. There is no similar way to classify Christiansen, F. B. (1990). Simplified models for viability different backgrounds into equivalent classes because selection at multiple loci. Theoretical Population Biology heterozygosity and linkage disequilibrium are crucial. 37, 39–54. Dawson, K. J. (2000). The decay of linkage disequilibrium The fitness of a given gamete depends on the zygotes under random union of gametes: how to calculate it will form and, therefore, on the other gametes in the Bennet’s Principal Components. Theoretical Population population. Biology 58, 1–20. It seems clear that, in the case of balancing selection, Hey,J.(1991). A multi-dimensional coalescent process the study of multilocus genealogies must take into applied to multiallelic selection models and migration models. Theoretical Population Biology 39, 30–48. account stochastic fluctuations on each of the possible Hudson, R. (1990). Gene genealogies and the coalescent genetic backgrounds. This can be done by our forward process. Oxford SurŠeys in EŠolutionary Biology 7, 1–44. simulations, which we have used extensively to study Hudson, R. B. & Kaplan, N. L. (1988). The coalescent the effects of balancing selection on several measures process in models with selection and recombination. of neutral variability (Navarro & Barton, 2002). Genetics 120,831–840. Hudson, R. R. & Kaplan, N. L. (1994). Gene trees with We thank B. Charlesworth, D. Charlesworth and F. background selection. In AlternatiŠes to the Neutral Depaulis for valuable discussion and criticism. We are also Model. (ed. G. B. Golding). Chapman Hall: Pages? grateful to an anonymous reviewer, who pointed out an Hudson, R. R. & Kaplan, N. L. (1995). The coalescent imprecision in the original manuscript. This work was process with background selection. Philosophical Trans- supported by the BBSRC. actions of the Royal Society Series B 349, 19–23. Kaplan, N. L., Darden, T. & Hudson, R. R. (1988). The References coalescent process in models with selection. Genetics 120, 819–829. Aquadro, C. F. & Begun, D. J. (1993). EŠidence for and Kaplan, N. L., Hudson, R. R. & Langley, C. H. (1989). The implications of genetic hitch-hiking in the Drosophila hitch-hiking effect revisited. Genetics 123, 887–899. genome.InMechanisms of molecular eŠolution. (ed. N. Karlin, S. & Avni, H. (1981). Analysis of central equilibria Takahata & A. G. Clark), pp. 159–178. Sunderland, in multilocus systems. Theoretical Population Biology 20, MA: Sinauer Press. 241–280. Aquadro, C. F. et al.(1994). Selection, recombination and Kelly, J. K. & Wade, M. J. (2000). near DNA polymorphism in Drosophila.InNon-neutral eŠo- a two-locus balanced polymorphism. Journal of Theoreti- lution (ed. B. Golding), pp. 46–56. New York: Chapman cal Biology 204, 83–101. & Hall. Maruyama, T. (1972). Rate of decrease of genetic variability Barton, N. H. (1998). The effect of hitch-hiking on neutral in a two-dimensional continuous population of finite size. genealogies. Genetical Research 72, 123–133. Genetics 70, 639–651. Barton, N. H. & Shpak, M. (2000). The stability of Nagylaki, T. (1982). Geographical invariance in population symmetrical solutions to polygenic models. Theoretical genetics. Journal of Theoretical Biology 99, 159–172. Population Biology 57, 249–263. Navarro, A. & Barton, N. H. (2002). The effects of Barton, N. H. & Turelli, M. (1991). Natural and sexual multilocus balancing selection on neutral variability. selection on many loci. Genetics 127, 229–255. Genetics, (in press). Barton, N. H. & Wilson, I. (1995). Genealogies and Nordborg, M. (1997). Structured coalescent processes on geography. Philosophical Transactions of the Royal Society different timescales. Genetics 146, 1501–1514. 349, 49–59. Stephan, W, Wiehe, T. H. & Lenz, M. (1992). The effect of Bennett, J. H. (1954). On the theory of random mating. strongly selected substitutions on neutral polymorphism: Annals of Eugenics 18,311–317. analytical results based on diffusion theory. Theoretical Charlesworth, B. (1994). The effect of background selection Population Biology 41, 237–254. against deleterious mutations on weakly selected, linked Strobeck, C. (1983). Expected linkage disequilibrium for a variants. Genetical Research 63,213–228. neutral locus linked to a chromosome rearrangement. Charlesworth, B. et al.(1993). The effect of deleterious Genetics 103, 545–555. mutations on neutral molecular variation. Genetics 134, Whitlock, M. C. & Barton, N. H. (1997). The effective size 1289–1303. of a subdivided population. Genetics 146, 427–441.

Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.14, on 30 Sep 2021 at 16:04:21, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0016672301005493