<<

Genetics of Polymorphism and Divergence Stanley A. Sawyer∗, † and Daniel L. Hartl†

∗Department of Mathematics, Washington University, St. Louis, Missouri 63130, †Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110 November 3, 1992

ABSTRACT Frequencies of mutant sites are modeled as a Poisson random field in two species that share a sufficiently recent common ancestor. The selective effect of the new can be favorable, neutral, or detrimental. The model is applied to the sample configurations of nucleotides in the alcohol dehydrogenase (Adh) in Drosophila simulans and D. yakuba (McDonald and Kreitman 1991, Nature 351: 652–654). Assuming a synonymous rate of 1.5 × 10−8 per site per year and 10 generations per year, we obtain 6 estimates for the effective population size (Ne =6.5 × 10 ), the species divergence time −6 (tdiv =3.74 Myr), and an average selection coefficient (σ =1.53 × 10 per generation for advantageous or mildly detrimental replacements), although it is conceivable that only two of the amino acid replacements were selected and the rest neutral. The analysis, which includes a sampling theory for the independent infinite sites model with selection, also suggests the estimate that the number of amino acids in the enzyme that are susceptible to favorable mutation is in the range 2–23 out of 257 total possible codon positions at any one time. The approach provides a theoretical basis for the use of a 2 × 2 contingency table to compare fixed differences and polymorphic sites with silent sites and amino acid replacements.

t has been more than 25 years since Lewontin have been more at odds, and a great controversy en- I and Hubby (1966) first demonstrated high levels sued. To a large extent the issue has been clouded of molecular polymorphism in Drosophila pseudoob- by inadequate data (Lewontin 1974, 1991). Ob- scura. This finding had two strong immediate ef- servations of natural are snapshots of fects on evolutionary genetics: it stimulated molec- particular places and times, and the resulting infer- ular studies of many other organisms, and it led to ences about the long-term fate of molecular poly- a vigorous theoretical debate about the significance morphisms can be challenged by neutralists and se- of the observed polymorphisms (Lewontin 1991). lectionists alike. By the same token, laboratory ex- The experimental studies soon came to a consen- periments capable of detecting selection coefficients sus in demonstrating widespread molecular poly- as small as are likely to be important in nature morphism in numerous species of plants, animals, are currently impractical (Hartl and Dykhuizen and microorganisms. The theoretical debate was 1981, Hartl 1989), although some large effects have not so quickly resolved. One viewpoint (Kimura been documented (see Powers et al. 1991 for a re- 1968, 1983) held that most observed molecular vari- view). ation within and among species is essentially selec- In the 1980s, the increasing use of DNA se- tively neutral, with at most negligible effects on sur- quencing in evolutionary genetics gave some hope vival and reproduction. Opposed was the classical that the impasse could be overcome. Direct ex- Darwinian view that molecular polymorphism is the amination of , rather than the electrophoretic raw material from which fashions mobility of gene products, yields vast amounts of evolutionary progress, and that the newly observed information consisting of hundreds or thousands of molecular variation was unlikely to be any different nucleotides. The data are also of a different qual- Lewontin ( 1974). The two viewpoints could not ity, since the DNA sequences are unambiguous and 2 S.A.SawyerandD.L.Hartl

contain both synonymous nucleotide differences and ment fixed differences. When 30 aligned DNA se- differences that change amino acids. To the ex- quences from the alcohol dehydrogenase (Adh) tent that the synonymous differences are subjected of three species of Drosophila were compared (Mc- to weaker selective effects than amino acid differ- Donald and Kreitman 1991a), there were too few ences, comparisons between the two types of poly- polymorphic replacement sites (P =0.007, two- morphisms can serve as a basis of inference. Syn- sided Fisher exact test). McDonald and Kreit- onymous polymorphisms are more common than man (1991a) argue that the most likely reason for amino acid polymorphisms (Kreitman 1983), and the discrepancy is that some of the amino acid dif- also appear to be more weakly affected by selection ferences were fixed as a result of positive selection (Sawyer, Dykhuizen,andHartl 1987). acting on replacement . The possibility that the fixed differences could have resulted from With data from only one species, the level of a combination of slightly deleterious alleles (Ohta synonymous and replacement polymorphism must 1973), coupled with a dramatically changing popu- be substantial in order for statistical analysis to lation size, was also considered by McDonald and have enough power to detect selection (Sawyer, Kreitman (1991a) but considered implausible be- Dykhuizen,andHartl 1987; Hartl and Sawyer cause this would seem to require extraordinarily fine 1991). Most eukaryotic genes are not sufficiently tuning among a large number of independent pa- polymorphic to allow this approach. An alternative rameters. approach, pioneered by Hudson, Kreitman,and Aguade´ (1987), is based on comparing polymor- Although the McDonald-Kreitman test has con- phisms within species with fixed differences between siderable intuitive appeal, little quantitative theory species. This approach has been applied to the exists for the comparison of intraspecific polymor- Drosophila fourth (Berry, Ajioka, phism with interspecific divergence in the presence and Kreitman 1991) as well as to the tip of the of selection. In this paper we present such a the- X chromosome (Begun and Aquadro 1991), both ory. Among other things, it addresses the question of which are regions of reduced recombination. The of whether the imbalance in the Adh contingency level of polymorphism in these regions is also re- table could have resulted from the random fixation duced, and the analysis suggests strongly that the of mildly deleterious alleles over an extremely long reduction is the result of genetic hitchhiking associ- time in a population of constant size, rather than ated with periodic selective fixations. fixations of advantageous alleles in a shorter period of time. The theory also provides an estimate of the Comparison of molecular variation within and average amount of selection required to produce the between species is also the crux of a statistical test discrepancy observed, as well as an estimate of the proposed by McDonald and Kreitman (1991a). rate at which favorable mutations occur (or, equiv- The test is for homogeneity of entries in a 2 × 2 alently, an estimate of the average number of amino contingency table based on aligned DNA sequences. acids in the protein that are susceptible to a fa- The rows in the contingency table are the num- vorable mutation at any one time). Several objec- bers of replacement or synonymous nucleotide dif- tions to the details of the implementation of the ferences, and the columns are either the numbers McDonald-Kreitman test have been raised (Graur of fixed differences between species or else of poly- and Li 1991, Whittam and Nei 1991), and these morphic sites within species. Here polymorphic sites are also addressed briefly. are defined as sites that are polymorphic within one or more of the species, and fixed differences are de- fined as sites that are monomorphic (fixed) within Rationale and Results of the Analysis each species but differ between species. The term silent refers to nucleotide differences in codons that The first step in our method is to analyze the sample do not alter the amino acid, and replacement refers configurations of the nucleotides occurring at syn- to nucleotide differences within codons that do alter onymous sites in the aligned DNA sequences un- the amino acid. The McDonald-Kreitman test com- der the assumption that the synonymous variation pares the number of silent and replacement poly- is selectively neutral. This information is used to morphic sites with the number of silent and replace- estimate the mutation rate at silent sites and the Polymorphism and Divergence 3 divergence time between pairs of species. The diver- tween sites. There is considerable linkage dise- gence time is critical because, if the divergence time quilibrium in Adh around the Fast versus Slow between species is sufficiently long, then conceiv- electrophoretic polymorphism in D. melanogaster ably all of the fixed amino acid differences between (but not in D. simulans and D. yakuba). Possi- species could be due to the fixation of mildly delete- ble balancing or clinal selection on this polymor- rious alleles, and the significance of the McDonald- phism may not only affect nucleotide configurations Kreitman contingency table might be an artifact of in D. melanogaster, but also may not be appropri- saturation at silent sites. Using the estimated values ate for the model of genic selection that we apply of the silent mutation rate and the divergence time, below. Among the three species for which McDon- the numbers of synonymous polymorphic sites and ald and Kreitman (1991a) have Adh sequences, fixed differences predicted from the neutral configu- we are most confident in applying the analysis to ration theory are compared with the observed num- the D. simulans versus D. yakuba comparison. The bers. These estimates fit the observed Adh data resulting analysis of the joint nucleotide configura- very closely for all three pairwise species compar- tions at silent sites for D. simulans and D. yakuba isons, which suggests that the configuration distri- leads to the following estimates for the scaled silent butions at synonymous sites are roughly consistent mutation rate µs (summed over synonymous sites) with an equilibrium neutral model. and the species divergence time tdiv: The second step is to develop equations for µ =2.05 and t =5.8(1) the expected number of polymorphic sites and fixed s div differences between a pair of species in terms of both scaled in terms of the haploid effective popu- the magnitude and direction of selection, the mu- lation size N .Thatis,µ = u × N ,whereu tation rate to new alleles having a given (constant) e s s e s is the nucleotide mutation rate per generation selective effect, and the divergence time. From summed over all synonymous sites within amino these equations we estimate the amount of selection acid monomorphic codon positions in Adh other needed to explain the observed deficiency or excess than those coding for leucine and arginine (i.e., at in the number of replacement polymorphisms. We “regular” silent sites; see below). Analysis of the also estimate the rate of new mutations resulting in numbers of replacement polymorphisms and fixed amino acid replacements (or, equivalently, the num- differences then leads to the following minimal esti- ber of amino acid sites in the protein product at mates for the effective aggregate replacement muta- which favorable or mildly deleterious substitutions tion rate µ and average selection coefficient γ: are possible at any one time). While the config- r uration analysis takes into account the possibility µ =0.013µ and γ =9.95 (2) of multiple mutations at the same site, the second r s step assumes that this does not occur; i.e., that the again scaled in terms of the haploid effective popula- genetic locus involved has not been saturated by tion size. The quantity µ is the aggregate base mu- mutations since the divergence of the two species. r tation rate causing advantageous or mildly deleteri- This assumption was checked in two different ways. ous amino acid changes. The estimate of µ in (2) First, the expected number of silent polymorphisms r includes a correction factor to account for silent was calculated by both methods and found to agree polymorphisms that are destined to be fixed but within 12%. Second, the expected number of syn- are not yet fixed. The quantity γ is the estimated onymous sites with two or more neutral fixations average selection coefficient among advantageous or since the species diverged was estimated as less than mildly deleterious amino-acid changing mutations, two in all cases. There are no silent site poly- where “average” here means that γ is the selection morphisms with three or more nucleotides in the coefficient required to produce the same numbers of data considered, and only one silent site (between replacement polymorphisms and fixed differences if D. melanogaster and yakuba) is polymorphic in both these replacement mutations all had the same se- species. lective effect. The quantity γ is scaled in terms of Most of our analysis depends on the assump- the effective population size; i.e., γ = σNe,whereσ tion of linkage equilibrium or independence be- is the same selection coefficient per generation. In 4 S.A.SawyerandD.L.Hartl

this case, the estimates of γ and σ are the minimum Incidentally, the estimate of 1.5 × 10−9 muta- amount of selection required. Since there are no tions per silent site per generation, derived from replacement polymorphisms between D. simulans Rowan and Hunt (1991) and the assumption of and D. yakuba in the McDonald-Kreitman data, any 10 generations per year, compares well with the rule larger value than γ =9.95 in (2), along with a cor- of thumb (Drake 1991) that, in metazoans, the respondingly smaller value of µr, would explain the overall mutation rate is roughly one mutation per data just as well. genome per generation. For the Drosophila genome of 165 million base pairs, this implies 6.1×10−9 mu- The estimated aggregate mutation rate at silent tations per nucleotide pair per generation. These sites of µ =2.05 in (1) is based on 212 regu- s estimates are quite close, particularly since there lar amino acid monomorphic codon positions (see might be a slightly smaller substitution rate at silent below). This estimate therefore corresponds to sites within coding regions than for arbitrary nu- 2.05/212 = 0.0097 silent nucleotide changes per syn- cleotides. onymous site per Ne generations in the Adh se- quence. Assuming a silent nucleotide substitution rate of 0.015 per synonymous site per million years The analysis of the Adh data can also be in- (Myr)attheAdh locus, estimated from data on terpreted in another way. If the rate of replace- Hawaiian Drosophila (Rowan and Hunt 1991), Ne ment mutations is uniform across the coding region generations is 0.645 Myr. Therefore, tdiv =5.8 of Adh, then the overall replacement mutation rate in (1) implies a divergence time of 3.74 Myr be- of µr =0.013µs implies that an average of only tween D. simulans and D. yakuba. This value is in about 5.7 codons out of the 257 codon positions in the middle of a range 1.6–6.1 Myr implied by single- the molecule are susceptible to a favorable amino copy nuclear DNA hybridization data (Caccone, acid replacement at any one time, with all other Amato,andPowell 1988), where an estimated di- replacements at that time being strongly deleteri- vergence time between D. melanogaster and D. sim- ous. The estimate of 5.7 amino acid positions is ulans of 0.8–3 Myr (Lemeunier et al. 1986) is used based on equal mutation rates of each nucleotide. as the standard of comparison. As a consistency If the mutation rates vary according to nucleotide, check, the same analysis was carried out with the the estimated number of amino acids susceptible to Adh data from D. simulans and D. melanogaster. favorable replacement at any one time is between 2 This analysis yielded µs =2.07 for 213 monomor- and 23 (see discussion below). phic codon positions and a value of tdiv =1.24, from which the estimated divergence time is 0.80 Myr. The remainder of this paper focuses on the This estimate is at the low end of the range sug- technical details pertaining to the estimates and gested by Lemeunier et al. (1986). conclusions summarized above. The main themes Assuming 10 generations per year for D. sim- are, first, the analysis of joint nucleotide configura- ulans and D. yakuba, and a value of 0.645 Myr for tions at silent sites from a pair of species in order Ne generations, the estimated haploid effective pop- to estimate the mean mutation rate at silent sites 6 ulation size of either species is Ne =6.5 × 10 (and and the species divergence time; and, second, the hence 3.25 × 106 for the diploid population size). analysis of the expected probability density of poly- This estimate is in excellent agreement with the morphic site frequencies and of fixed differences at value of 2×106 suggested for D. simulans by Berry, the population level in order to estimate the amount Ajioka,andKreitman (1991). The value γ = of selection required if all favorable and weakly dele- 9.95 in (2) implies that the average selection coeffi- terious replacement mutations have the same selec- cient for advantageous or mildly deleterious amino tive effect. The population estimates are discussed −6 acid replacements in Adh is σ = γ/Ne =1.53×10 in the next section, with most of the detail deferred per generation. That is, only a very small average to later in the paper. We then discuss the numer- selection coefficient is required to account for the ical estimation of γ and µr, carry out the analysis observed lack of replacement polymorphisms (or, of joint nucleotide configurations, and finally give equivalently, the excess of fixed replacements) in the some supplementary comments on certain criticisms comparison of D. simulans with D. yakuba. of the McDonald-Kreitman test. Polymorphism and Divergence 5

Mutational Flux and Fixed Differ- Later in the paper we derive a diffusion approxima- {XN } ences tion for the expected number of processes i,k at equilibrium whose frequencies are in the range Suppose that new mutations arise with probabil- (p, p+dp), and we also calculate the equilibrium rate v > ity N 0 per generation in a population of haploid at which mutant alleles become fixed. size N. Let XN be the frequency of the descendants i,k Now, assume that the processes {XN } corre- ith i,k of the new mutant allele in the population, where spond to two-allele haploid Wright-Fisher models k , , ,... =0 1 2 is the number of generations that have (without mutation) in which organisms carrying a elapsed since the original mutation occurred. We new mutant allele have fitness w =1+σ rel- assume that the processes {XN : i =0, 1,...} are N N i,k ative to those carrying the non-mutant allele. If non-interacting, as would be the case if the muta- NσN → γ as N →∞, then the conditions for a tions occurred at distinct sites that remain in link- diffusion approximation hold with age equilibrium, or if the mutations were sufficiently well spaced in time. The mutations could be se- 1 b(x)= 2 x(1 − x)andc(x)=γx(1 − x)(5) lectively advantageous, disadvantageous, or neutral, but in any event we assume that the site frequency (Ewens 1979). The scale and speed measure (4) {XN } processes i,k are stochastically identical Markov are then chains. (That is, they have the same transition ma- 1 − e−2γx 2e2γx dx trices.) Note that XN =1/N for each i, since each s x dm x i,0 ( )= γ and ( )=x − x (6) new process begins with a single mutation. The 2 (1 ) states 0 and 1 are absorbing states that represent, 0 where s(x) is normalized so that s (0) = 1. respectively, the loss of the new allele and its fixa- T {t X tion. In general, the hitting times a =min : t = a} for a diffusion process X and the scale func- T N {k XN a} t Define i,a =min : i,k = as the num- tion s(x) in (4) are related by the identity ber of generations until the ith process attains the a T N ∞ s x − s frequency for the first time, and set i,a = if ( ) (0) P (T

so that µ is the limiting mutation rate per genera- This expression differs from the corresponding ex- tion at which new mutant alleles arise. For arbitrary pected value in the infinite alleles model (Ewens {XN } s0 , processes i,k satisfying (7a) and (0) = 1, we 1972 1979). However, the Poisson random field prove later in this paper that the limiting density model of the preceding section is not the same as of polymorphic mutant alleles with population fre- the infinite alleles model. The alleles in (12) do not quencies in the range (x, x + dx)is compete with one another, and there are no con- {XN } s(1) − s(x) straints on the sum of the frequencies i,k .The µ dm(x)(9) {XN } s(1) − s(0) processes i,k are also unaffected by subsequent mutations. This model also differs from the infinite s x dm x for ( )and ( ) in (4). This means that the sites model of Watterson (1975), since the pro- expected number of mutant alleles with population cesses {XN } are assumed to be independent (i.e., p

Table 1. Population fixational flux and polymorphic densities. Equilibrium Limiting density of freqs. flux of fixations of mutant nucleotides

dx µ µ Neutral s 2 s x Favored (γ>0) 2γ 1 − e−2γ(1−x) dx µ 2µ Unfavored r 1 − e−2γ r 1 − e−2γ x(1 − x) (γ<0)

of fixed differences, which equals numbers of fixed differences between the two species  at the present time are s(1) − s(x) dm(x) 1 − e−2γ(1−x) dx = 2γ 2t t 2γ(1 − x) x 2µ t and 2µ t (13) div div s div r div 1 − e−2γ In the next section we will use this expression as a for silent and replacement sites respectively. Note basis for estimating the average value of γ for re- that the second expression in (13) is much more sen- placement differences between species. sitive to γ if γ<0thanifγ>0. For example, the −8 factor multiplying 2µrtdiv is 4.1 × 10 if γ = −10, γ Sampling Formulas and Parameter but only 20 if = 10. Thus, if the two expres- sions in (13) are of comparable size, then γ cannot Estimates be strongly negative. Note that µr ≤ 10µs even if only 10% of the sites in a coding region are silent Suppose that two species diverged tdivNe genera- tions ago, and that both have the same haploid effec- and all amino acid replacements are advantageous tive population size N . Assume that the mutation or mildly deleterious. If the two expressions in (13) e µ /µ ≤ γ ≥− . γ rate for silent sites in the coding region of a particu- are equal and r s 10, then 1 81. If is lar gene is µ per gene per generation, and that the large and positive and the two expressions are equal, s µ /µ ≈ / γ mutation rate for nonlethal replacement mutations then r s 1 (2 ). is µr per gene per generation. Assume further that Similarly, the expected mutant frequency den- (i) all new replacement mutations bestow equal fit- sities at silent and replacement polymorphic sites ness w =1+γ/Ne, (ii) each new mutation since the are divergence of the species occurred at a different site dx (in particular, the gene has not been saturated with dν x µ s( )=2s x and mutations), and (iii) different sites remain in link- −2 (1− ) (14) age equilibrium. Under these assumptions, the fix- 1 − e γ x dx dν (x)=2µ ational flux and the expected frequency densities of r r 1 − e−2γ x(1 − x) mutant nucleotides at silent and replacement sites in a single random-mating population are those given respectively in either population. in Table 1. Now, suppose that we have aligned DNA se- Assume further that the number of polymor- quences from m from the first species phic sites at the present time that are destined to be- and n chromosomes from the second species. The come fixed, and the number of site polymorphisms next step is to convert the population level esti- surviving from the time of , can be ne- mates (13–14) to sample estimates. Suppose that glected in comparison with the number of fixed dif- a mutant nucleotide has population frequency x at ferences between the species and the number of sites a site. Then a random sample of m chromosomes that are presently polymorphic. Then the expected from that population will be monomorphic for the 8 S.A.SawyerandD.L.Hartl

m mutant nucleotide with probability qm(x)=x , However, the estimate (17) is about 45% too large and will be polymorphic at that site with probability for D. yakuba versus D. simulans and D. yakuba m m pm(x)=1−x −(1−x) . Thus the expected num- versus D. melanogaster, and it is more than 5 times ber of silent polymorphic sites in a random sample too large for D. melanogaster versus D. simulans of size m is (Table 9). The most likely reason for these dis- Z Z 1 1 − xm − − x m crepancies is that the silent polymorphic sites in p x dν x µ 1 (1 ) dx m( ) s( )=2s the present populations that are destined to become 0 0 x fixed are counted in 2µstdiv in (17) even though −1 mX 1 they are not yet fixed. The discrepancy is not large =2µ (15) s k except for the comparison D. melanogaster versus k=1 D. simulans, and it only affects the estimate of µr dν x for s( ) in (14). In fact, assuming linkage equi- in (1–2). A correction for this overestimation is in- librium between sites, the number of silent poly- cluded in the calculation of µr in (2). morphic sites is a Poisson random variable whose By a similar argument, the number of replace- mean (and hence also variance) is given by (15) (see ment sites that are monomorphic for a mutant nu- below). The expression (15) is the same as Wat- terson cleotide is Poisson with mean ’s (1975) formula for the expected number Z of polymorphic sites in a sample of size m from the γ 1 µ 2 t xm dν x r −2γ div + r( ) infinite sites model. However, the variance of the 1 − e 0 number of silent polymorphic sites in the infinite sites model is larger than the variance in the Pois- for dνr(x) in (14). The number of replacement fixed son model (15). differences between the two samples is then Poisson The number of silent polymorphic sites in both with mean samples together is then γ   µ 2 t G m G n 2 r −2γ div + ( )+ ( ) (18) 2µs L(m)+L(n) where 1 − e mX−1 1 (16) where L(m)= ≈ log m k Z 1 k=1 1 − e−2γ(1−x) G(m)= xm−1 dx The estimate (16) for the number of silent poly- 0 2γ(1 − x) morphic sites agrees with the observed numbers to within 7% for each of the three pairwise species com- Note G(m) ≤ 1/m for γ>0. By the same reason- parisons using the Adh data of McDonald and ing, the expected number of polymorphic replace- Kreitman (1991a), assuming the values of µs es- ment sites in a sample is timated from the joint configurations in the next  µ H m H n section. 2 r ( )+ ( ) (19) Similarly, a silent site will be monomorphic in a where sample of size m if it is fixed in the population, but may also be monomorphic in a sample by chance H(m)= Z if it is polymorphic in the population. Under the 1 1 − xm − (1 − x)m 1 − e−2γ(1−x) assumptions of (13), the number of silent sites in a dx x(1 − x) 1 − e−2γ sample of size m that are monomorphic for a mutant 0 nucleotide is Poisson with mean H m Z   The expression ( ) varies by only a factor of two 1 dx for γ>0, since the inequality A ≤ (1 − e−γA)/(1 − µ t xm µ µ t 2 s div + 2 s = s div + e−γ ≤ 0 the two samples is   P −1 1 1 for L(m)= m 1/k. However, H(m) will be 2µ t + + (17) k=1 s div m n smaller if γ is large and negative. Polymorphism and Divergence 9

Table 2. Sampling formulasa Fixed differences Polymorphic sites    µ t 1 1 µ L m L n Neutral 2 s div + m + n 2 s ( )+ ( )

2γ  2γ  Selected 2µ t + G(m)+G(n) 2µ F (m)+F (n) r 1 − e−2γ div r 1 − e−2γ

mX−1 L m 1 where ( )= k and k=1 Z 1 1 − xm − (1 − x)m 1 − e−2γx F (m)= dx, 0 1 − x 2γx Z 1 1 − e−2γx G(m)= (1 − x)m−1 dx 0 2γx

a Expected numbers for samples of m genes from one species and n genes from a second species.

The ratio of the number of replacement poly- for the ratio of the numbers of the observed replace- morphic sites to the number of replacement fixed ment polymorphic sites and replacement fixed dif- differences is given by the ratio of (19) to (18), which ferences. (Note that µr cancels in this particular is ratio. If γ>0 is sufficiently large, the approxi- F (m)+F (n) (20) mation (21) can be used instead.) If the number of tdiv + G(m)+G(n) replacement polymorphic sites is zero (as it is for the where D. simulans versus D. yakuba comparison), then we / 1 − e−2γ use 1 2 as a conservative value to replace the zero. F (m)= H(m) For example, if there are no replacement polymor- 2γ Z phic sites and 6 fixed replacement differences (as is 1 1 − xm − (1 − x)m 1 − e−2γx = dx thecaseforD. simulans versus D. yakuba), we set − x γx 1 0 1 2 the expression in (20) equal to 2 /6=1/12 and solve after a change of variables in x. The four basic sam- for γ. Using the estimate tdiv =5.8 derived in the pling formulas are summarized in Table 2. next section, the solution is γ =9.95. (The approxi- F m ∼ L m /γ G m → mation (21) gives γ =11.0 in this case.) Since there Note that ( ) P ( ) and ( ) 0as m−1 are no replacement polymorphic sites for D. simu- γ →∞for L(m)= 1/k in (16). Thus the k=1 lans versus D. yakuba, this value is a lower bound expression in (20) is asymptotic to for the average value of γ. Any larger value would L(m)+L(n) also be consistent with the data. as γ → +∞ (21) tdivγ Estimating parameters. We estimate γ by set- ting the expression (20) F (m)+F (n) (22) tdiv + G(m)+G(n) The replacement mutation rate µr can be esti- Numb. observed repl. polymorphisms mated by setting the ratio of the expected numbers = Numb. observed repl. fixed differences of replacement and silent fixed differences in (18) 10 S. A. Sawyer and D. L. Hartl

and (17) rate where Ne is the effective haploid population size, then Wright’s (1949) formula states that the µr 2γ tdiv + G(m)+G(n) (23) equilibrium population frequencies p1,...,p of the µ − e−2γ t /m /n K s(1 ) div +1 +1 K types are random with probability density Numb. observed repl. fixed differences = Γ(α) Numb. observed silent fixed differences pα1−1pα2−1 ...pαK −1 1 2 K (24) Γ(α1)Γ(α2) ...Γ(αK ) for the ratio of the numbers of the observed re- placement and silent fixed differences. Setting the N α ν N u α Pfor large e. In (24), j =2j =2 e j, = theoretical ratio (23) equal to the observed ratio K =1 αi,andΓ(α) is the gamma function. It follows 6/17 for D. simulans and D. yakuba yields µ = i r from (24) that E(p )=α /α for all i.Ifα1 = α2 = . . µ i i 0 037 = 0 018 s. (Recall that only regular silent ··· = α ,thenα = Kα and E(p )=1/K for µ K i i sites are counted in the estimation of s; see the next 1 ≤ i ≤ K. section.) However, as noted earlier, both 2µ t s div However, nucleotide frequencies at third-position and (17) overestimate the observed number of silent sites in fourfold degenerate codons tend to be far fixed differences, as well as the estimated number of from uniform (see Table 3). silent fixed differences using the probability distri- bution of joint configurations (see the next section). In both cases, the estimate (17) is too large by a Table 3. Nucleotide frequencies at 4-fold a factor of about 1.4 (Table 9). Since replacement degenerate sites in the Adh region polymorphisms with favorable mutant nucleotides Species Seqs Sites T C A G are likely to become fixed faster than silent poly- D. simulans 6 109 0.18 0.62 0.06 0.14 morphisms, we apply a correction of 1.4 to µ but s D. melanogaster 12 108 0.17 0.60 0.07 0.16 not to µ in (23). This correction leads to the esti- r D. yakuba 12 110 0.14 0.61 0.07 0.18 mate µr =0.0265 = 0.013µs for D. simulans versus a D. yakuba. The interpretation of µr in terms of an Data from McDonald and Kreitman (1991a) average of 5.7 amino acids susceptible to favorable replacement is discussed at the end of the next sec- tion. Assuming that the sites are independent, de- Under our assumptions, the numbers of silent partures from an even distribution are highly sig- −11 and replacement fixed differences and polymorphic nificant in all cases in Table 3 (P<10 ,3 sites are independent Poisson random variables (see d.f.), although the observed nucleotide frequencies below). Maximum likelihood estimators for µr do not differ significantly among the three species and γ (along with associated confidence intervals) (P =0.10, 6 d.f.). The mutation model (24) with α can be found by setting the expressions (18) and variable i provides a significantly better fit to the (19) (respectively) equal to the observed numbers of distribution of nucleotides at silent sites in the Adh α α P< −15 replacement fixed differences and replacement poly- data than does (24) with i = 1 ( 10 , 3 d.f., morphic sites. The maximum likelihood estimator in all three species, assuming independent sites). for γ is the same as the estimator (22) for γ.How- Although a mutation model with rates a function ever, we prefer to estimate µr in terms of µs and the of the final nucleotide, rather than the original, is more numerous silent site data rather than use the somewhat artificial, it is reasonable in the analysis maximum likelihood estimator for µr in this case. of these data since only the existing nucleotide se- quences are known. Furthermore, since the initial and final nucleotides may very well be correlated Divergence Times and Mutation Rates (e.g., because of a transition or transversion bias), α We treat a K-fold degenerate silent site as a neu- the model with variable i is probably reasonably tral Wright-Fisher model with K types, no selection, robust. In any event, it provides a significantly bet- ter fit to the Adh data. and mutation rate uij per generation from i to type j.Ifuij = uj depends only on j, and if Assuming Wright’s model (24) at K-fold de- νij = Neuij = νj = Neuj is the scaled mutation generate silent sites, the equilibrium distribution Polymorphism and Divergence 11

Table 4. Sample configurations. Configuration Probability from (25)

6! α1(α1 +1)α2α3(α3 +1)α4 TTCAAG 2! 1! 2! 1! α(α +1)(α +2)(α +3)(α +4)(α +5)

6! α3(α3 +1)(α3 +2)α4(α4 +1)(α4 +2) AAAGGG 0! 0! 3! 3! α(α +1)(α +2)(α +3)(α +4)(α +5)

6! α2(α2 +1)...(α2 +4)α3 CCCCCA 0! 5! 1! 0! α(α +1)(α +2)(α +3)(α +4)(α +5)

TTTTTC 12! α1(α1 +1)...(α1 +4)α2(α2 +1)α3(α3 +1)(α3 +2)α4(α4 +1) CAAAGG 5! 2! 3! 2! α(α +1)(α +2)...(α + 11)

TCCCCC 12! α1α2(α2 +1)...(α2 + 10) CCCCCC 1! 11! 0! 0! α(α +1)(α +2)...(α + 11)

of population nucleotide frequencies is a K-type (or third) codon position can change the degeneracy Dirichlet distribution (24) with parameters αi = in the third (or first) position. Codons for serine fall 2Neui,whereNe is the haploid effective population into two nonoverlapping classes, one 4-fold degener- size. If a sample of size m is randomly chosen from ate and one 2-fold degenerate; we treat these two this model, the probability that the sample will con- classes as separate amino acids. We define a regular tain mi nucleotides of type i for 1 ≤ i ≤ K is then silent site as either any 2-fold or 4-fold degenerate Q third-position site in an amino acid monomorphic m K α(mi) ! i=1 i codon position that does not code for leucine or argi- ,α α1 ··· α (m) = + + K m1! m2! ...mK! α nine, or else as a third-position site in an amino acid (25) monomorphic isoleucine codon position that does ( ) where α m = α(α +1)...(α + m − 1). Examples not contain an ATA codon (Sawyer, Dykhuizen, of (25) for some particular configurations are given and Hartl 1987). Regular isoleucine silent sites in Table 4. are considered 2-fold degenerate. The analysis of There are two types of twofold degenerate sites. nucleotide variation at silent sites is restricted to At sites of the first type, the nucleotide can be ei- regular silent sites. ther of the two pyrimidines T or C, for which (25) The scaled mutation rate to nucleotides of holds with K = 2. At sites of the second type, type j is νj = αj/2. Since E(pi)=αi/α by (24), the nucleotide can be either of the two purines A the overall mean mutation rate at 4-fold degenerate or G, and (25) holds with K =2andα1,α2 re- silent sites is placed by α3,α4. The only 3-fold degenerate amino acid, isoleucine, has third-position synonymous nu- X4 cleotides T, C, and A. Since the codon ATA is al- µ4 = αi(α − αi)/2α, α = α1 + ···+ α4 (26) most entirely absent in the Adh sequences, we treat i=1 3-fold degenerate sites as 2-fold degenerate in the analysis of silent site distributions, and ignore silent with the rates codon positions containing an ATA. Synonymous µTC = α1α2/(α1 + α2)and sites within leucine and arginine codons are of vari- (27) able degeneracy, since silent mutations in the first µAG = α3α4/(α3 + α4) 12 S. A. Sawyer and D. L. Hartl

at 2-fold degenerate sites. The aggregate silent mu- if b ≥ 1and tation rate per gene at regular 2-fold and 4-fold de- generate sites is then X∞ k α − − k−1α(k−1) (2 + 1)( 1) −λkt µ N µ N µ N µ d0(t)=1− e s = 2,T C TC + 2,AG AG + 4 4 (28) k! N N k=1 where 2,T C and 2,AG are the numbers of regular (30b) 2-fold degenerate silent sites of each type (TC or where α(m) = α(α+1)...(α + m − 1) as before and AG), N4 is the number of regular 4-fold degenerate − 2 3 λ = k(k + α − 1)/2. Since d (t)=O(e b t/ )for silent sites, and 3-fold degenerate sites are counted k b large b, the series (29) converges rapidly in b unless t in N2 unless the codon position contains an ATA. ,T C is small. Table 5 gives the within-species maximum like- Equation (29) has a simple intuitive explana- lihood estimates for α1,...,α4 based on the con- Tavare´ N →∞ figuration probabilities (25) at regular silent sites tion ( 1984). In the limit as and Nuj → αj /2, all individuals in an equilib- with all four αi’s varied independently in the maxi- mization. The last two columns in Table 5 give the rium Wright-Fisher population are the descendants of b<∞ “founding” ancestors that lived t × N resulting estimates for µ4 in (26) and µ in (27–28). e s d t Inferred 95% confidence intervals are generally on generations earlier. The expression b( ) in (30ab) b t the order of plus or minus half the size of the esti- is the probability distribution of for in the dif- mated MLE’s. Table 6 gives one-dimensional max- fusion time scale, under the assumption that mu- Tavare´ imum likelihood estimates for the single-α model tations break the line of descent ( 1984). That is, db(t) is the probability that all individu- αi ≡ α1. While the values for µ4 and µs in Table 6 are essentially the same as in the four-α model, the als in the present population either have descended b model with variable α provide a significantly better without intervening mutation from one of found- i t fit to the data in all three species. ing individuals who existed diffusion time units ago, or else are descended from a mutant ancestor An estimate of divergence time. Atime- that arose within the last t time units. If the nu- dependent version of Wright’s formula (24) was first cleotide frequencies were p1,...,pK in the ancestral Griffiths derived by (1979). Assume that two population, the probability that bi of the b founding K equilibrium Wright-Fisher -type populations di- ancestral individualsQ were of type i for 1 ≤ i ≤ K verged t × N generations ago, where N is the b /b ...b pbi div e e is ( ! 1! K !) i . The unmutated descendants haploid effective population size. The ancestral of these b individuals will have the same states at population and both offshoot populations are as- the present time. Thus the nucleotide frequencies sumed to have the same haploid effective popula- q1,...,qK at the present time have a Dirichlet dis- tion size and the same mutation structure uij ≡ uj. tribution (as in Wright’s formula (24)) conditioned p ,...,p Given present nucleotide frequencies 1 K in on a sample of size b having bi individuals of type i one of the populations, then the present frequencies for 1 ≤ i ≤ K. However, a K-type Dirichlet distri- q ,...,q 1 K in the other population are random with bution with parameters αi, conditioned that a sam- probability density ple of size b had bi individuals of type i, is Dirichlet ∞ K b α X X b! Y with parameters i + i. Finally, by time reversibil- P t, p, q d t pbi × ( )= b(2 ) i ity, the joint distribution of two present day popula- P b1 ...b =0 ! K! =1 b bi=b i tions, connected through an ancestral population t time units ago, is the same as the joint distribution Γ(b + α) qb1+α1−1 ...qbK +αK −1 of one population and an ancestral or descendant b α ... b α 1 K Γ( 1 + 1) Γ( K + K ) population 2t time units apart. This completes the (29) proof of Griffith’s formula (29). where αi =2Neui and α = α1+···+αK (Griffiths Suppose that a sample of size m is taken from 1979). The coefficients db(2t) in (29) satisfy ∞ one species, and of size n from another, closely re- X k α − − k−b α b (k−1) (2 + 1)( 1) ( + ) −λkt db(t)= e lated, species. It follows from (29) that the joint k! m k=b probability that the first sample has i bases of (30a) type i (1 ≤ i ≤ K), and that the second sample Polymorphism and Divergence 13

Table 5. MLE’s from Wright’s Formula (25)a

Species α1 (T) α2 (C) α3 (A) α4 (G) µ4 µs D. simulans 0.0101 0.0326 0.0021 0.0096 0.0156 2.33 D. melanogaster 0.0082 0.0259 0.0020 0.0083 0.0130 1.92 D. yakuba 0.0063 0.0281 0.0025 0.0100 0.0135 1.93

a Estimated from regular silent sites (see text)

Table 6. MLE’s assuming αi ≡ α1

Species Normal-theory 95% CI for α1 µ4 µs D. simulans (0.0104  0.0065) = (0.0039, 0.0169) 0.0155 2.26 D. melanogaster (0.0087  0.0052) = (0.0035, 0.0139) 0.0130 1.87 D. yakuba (0.0085  0.0051) = (0.0041, 0.0136) 0.0127 1.86

Table 7. Drosophila pairwise comparisons

Model with αi ≡ α1

Species µs tdiv 95% CI for tdiv tgene µs tdiv tgene simulans vs. yakuba 2.05 5.81 (3.28, 8.33) 6.72 1.98 5.55 6.46 melanogaster vs. yakuba 1.85 7.08 (4.06, 10.11) 8.18 1.80 6.54 7.85 melanogaster vs. simulans 2.07 1.24 (0.48, 1.99) 2.26 2.01 1.23 2.26

has nj bases of type j (1 ≤ j ≤ K), is efficients multiplying db(2t) in (31) for K =4and large b. ∞ X X b! Estimates of tdiv using this method and the Cmn db(2t) × P b1 ...b McDonald Kreitman =0 ! K! data of and (1991a) are b bi=b (31) given in Table 7. Note that our method gives a Q ( ) K α bi α b (mi) α b (ni) i=1 i ( i + i) ( i + i) direct estimate of species divergence times, rather α(b) (α + b)(m) (α + b)(n) than of gene divergence times (i.e., the time since the common ancestor of a set of genes, which may where Cmn = m!/(m1! ...mK !) × n!/(n1! ...nK !) be considerably older than the time since the species and α(m) = α(α +1)...(α + m − 1) as before. diverged). In practice, estimates of the average pair- t Given an aligned sample of DNA sequences wise gene divergence times gene (Table 7) were com- from two species, we estimate the divergence time puted as a starting point for the maximization of the t t between the two species as follows. First, all reg- likelihood for div. These estimates of gene tended t t ular silent sites (as defined earlier) within the two to be larger than div, particularly when div was species are pooled to find maximum likelihood esti- small. Table 7 also contains 95% confidence inter- t mates of α1,...,α4 using Wright’s formula (25) for vals for div based on the one-dimensional likelihood t the configuration probabilities. Using the estimated curvature at div. However, these confidence inter- values for α , a maximum likelihood estimate of t vals may overstate the accuracy of the estimates i div α is then found using the likelihoods (31) at regular since the i were held constant. The last three t silent sites in the two species. Since (31) depends columns in Table 7 give estimates for div based on (mi) the single-α model α ≡ α1. These were quite sim- on t only through db(2t), arrays (αi + bi) etc. i can be computed in advance once the α are known, ilar to the estimates in the preceding four columns i α independently of t. The most time-consuming part based on the four- model. of the maximization is the computation of the co- Watterson (1985, see also Padmadisastra 14 S. A. Sawyer and D. L. Hartl

1988) developed a maximum likelihood method for fewer polymorphisms than predicted by neutrality, estimating the divergence time between two popu- while a contracting population or balancing selec- lations based on the infinite alleles model at many tion would be likely to have more polymorphisms. unlinked loci. If mutation is sufficiently rare so that The second set of theoretical estimates are the recurrent mutation at polymorphic sites can be ne- Poisson-random-field-based sampling estimates de- glected, then this method should give estimates sim- rived earlier. The number of silent fixed differences ilar to ours, but it does not allow for the possibility is estimated as follows. By time reversibility, data of nucleotide-dependent mutation rates. Watter- from two species at the present time can be viewed son (1985) also discusses five other methods of esti- as two snapshots of a single species separated in mating species divergence times and compares them time by 2t time units. Under selective neutrality, by computer simulation. div the rate µs of regular silent mutations per individ- We also calculated bootstrap bias-corrected es- ual is the same as the long-term rate of fixations timates and bias-corrected 95% confidence intervals of regular silent differences at the population level, Efron t α ( 1982) for div in the four- model (Ta- so that the expected number of silent mutations in ble 8). These were based on 1000 nonparametric the population that will eventually become fixed dif- α bootstrap simulations for each species pair with i ferences is 2µstdiv. However, as we start with one fixed. The bias-corrected estimates are quite simi- contemporary species and evolve through the an- lar to the MLE’s in Table 7, but the bootstrap con- cestral species to the other contemporary species, fidence intervals, particularly the lower limits, are 2µstdiv overestimates the number of silent fixed dif- shifted upwards. ferences by the number of silent mutations that are destined to become fixed in this process but which Table 8. Bootstrap bias-corrected estimates have not yet had time to fix. There is no corre- sponding underestimate in the number of polymor- Species tdiv 95% CI phic sites, since an initial polymorphic site in the simulans vs. yakuba 5.89 (3.50, 8.90) first contemporary species cannot become a fixed melanogaster vs. yakuba 7.20 (4.56, 10.52) difference as we proceed backwards in time through melanogaster vs. simulans 1.26 (0.81, 1.92) the ancestral species and then forwards to the sec- ond species. (The site remains polymorphic in the present.) Finally, the estimate 2µ t is augmented Comparisons with data. We now compare the s div as in (17) by an estimate of the number of sites that observed numbers of silent polymorphic sites and are fixed in the sample but polymorphic in the pop- fixed differences with two sets of theoretical esti- ulation. mates of these numbers (Table 9). The first set of theoretical estimates is as follows. Since whether or The estimate (16) for the number of silent poly- not a site is polymorphic or a fixed difference de- morphic sites fits the observed numbers quite well. pends on its joint configuration in the two samples, However, the predicted numbers of fixed differences the probability that a silent site is polymorphic or 2µstdiv and (17) overstate, by factors of between a fixed difference can be computed from the joint 1.4 and 6.1, both the observed number of fixed dif- configuration probabilities (31) once αi and tdiv ferences and the number of fixed differences pre- are known. Estimates of the expected numbers of dicted from the sample configurations. The up- silent polymorphic sites and fixed differences based ward bias of this estimate probably reflects the fact on the joint configuration probabilities (31), maxi- that not all of the new mutations destined to be- mum likelihood estimates of αi and tdiv (Table 7), come fixed have as yet become fixed. Based on and the numbers N2,T C, N2,AG,andN4 of regular this argument, we propose that the total number silent sites of various degeneracies, are given in Ta- of silent fixed differences in terms of µs is approx- ble 9. These estimates fit the observed data very imately the theoretical expression (17) divided by closely, which suggests that the neutral joint con- 1.4 for D. simulans vs. D. yakuba, and the ratio figuration model (31) fits the data at synonymous of the number of replacement fixed differences to sites. Note that a consistently expanding popula- silent fixed differences can be estimated by 1.4 times tion or deleterious selection would tend to produce the ratio of (18) to (17). Setting the latter expres- Polymorphism and Divergence 15

Table 9. Silent polymorphic sites and fixed differences Obs. silenta Est. config.b Est. popn.c Species µs tdiv fixed poly fixed poly fixed poly sim. vs. yak. 2.05 5.81 17 21 16.0 19.9 24.8 21.8 mel. vs. yak. 1.85 7.08 18 21 17.5 20.5 26.9 22.4 mel. vs. sim. 2.07 1.24 1 21 2.0 19.6 6.1 21.9

a Regular silent sites only b From the joint configuration probabilities (31) c From (17) and (16)

sion equal to the observed ratio of replacement and be absorbing states for the Markov chains. Given silent fixed differences yields the corrected estimate the initial states xN = j/N → x, the processes µ . . µ {XN XN } r =00265 = 0 013 s of (2). k = i,k are assumed to satisfy the conditions  Selective constraints in Adh . The NE XN − XN XN x c x lim k+1 k k = N = ( ) µ N→∞ low value of the replacement mutation rate r,rela-  2 tive to the silent mutation rate µs, may reflect mu- NE XN − XN XN x b x lim ( k+1 k ) k = N = ( ) N→∞ tations occurring in only a small number of codons  at which a favorable amino acid change is possible, NE |XN − XN |2+δ XN x lim k+1 k k = N =0 with all other changes being strongly detrimental. N→∞ (32) The actual mutation rate at this small number of uniformly for 0 ≤ x ≤ 1forsomeδ>0, where susceptible sites may be the same as at silent sites. b(x)andc(x) are continuous functions on [0, 1]. In the single-α model α ≡ α1 for pooled data from i Let dm(x)ands(x) be the speed measure and scale D. simulans and yakuba, the mutation rate from function of the diffusion operator one nucleotide to another is ν = α1/2=0.00465, which suggests that the average number of codons 1 d2 d susceptible to a favorable amino acid change at any Lx = b(x) + c(x) 2 dx2 dx one time is n = µr/ν =0.0265/0.00465 = 5.7out of the 257 total codon positions available. In the (see equation (4)), and assume that the endpoints more general mutation model, the mutation rate de- 0, 1 are accessible boundaries for Lx (or for the pends on the nucleotide produced by the mutation, limiting diffusion process Xt). All of these condi- and the estimates ni = µr/(αi/2) = 2µr/αi range tions hold for the two-allele selection-and-drift hap- from an average of 1.8 susceptible codons (for muta- loid Wright-Fisher models discussed earlier (Ewens tions to C) to 23.1 susceptible codons (for mutations 1979). to A). v {XN } The probability N that a new process i,k begins in any time step is assumed to satisfy Mutational Flux and Fixed Differ- µ v ∼  as N →∞ ences: Derivations N NP T N 0(0 N starts at i,0 =1 with probability N 0in At equilibrium for fixed , the expected num- , {XN } x

Y XN ` T N is given by for = ` ,where = a+ is the first time at which the frequency XN ≥ a. For any fixed y<1, X∞ X∞ ` B x v P XN x v P XN x N N N N ( )= N ( k,k = )= N ( k = ) lim P (T1 0. In particular, if

Theorem 1. Suppose under the above assump- N N P (T 0.Then s(1/N ) − s(0) 1 s0(0) = ∼ s(1) − s(0) N s(1) − s(0) NX−1 f j/N B j/N 0 lim ( ) N ( ) then (33) implies that vN → µ/s (0). N→∞ j=1 Similarly, for a>0 in Theorem 1, before any ∞ X of the processes {XN } get to x ≥ a, they must first v QN f /N i,k = lim N k (1 ) N→∞ cross a. Thus the sum in (35) equals k=0 Z 1 s(1) − s(x) X∞  µ f x dm x N N N = ( ) ( ) (35) vN P (T +

(Ewens 1979) since f(x) ≡ 0forx0 iting distribution of the frequencies i,k form a and δ>0 such that Poisson random field by classical arguments (Kar- lin and McGregor 1966, Sawyer 1976, Karlin max P (T0 ∧ T1 ≥ t | X0 = x)=1− δ<1 Taylor 0≤ ≤1 and 1981). In particular, the limiting dis- x {XN } tribution of the numbers of frequencies i,k in where X ∧ Y =min{X, Y }. However, by (39) with any given set, as well as the number of processes f(x) ≡ 1, that have been fixed at 1 by any given time, are Poisson random variables that are independent for N N N lim P (T0 ∧ T1 ≥ k | X0 = xN ) nonoverlapping sets. N→∞ m = P (T0 ∧ T1 ≥ t | X0 = x) Given a sample of size , the population fre- {XN } quencies i,k of mutant nucleotides at those sites uniformly for k/N → t>0andxN → x.Thus that are polymorphic in the sample form a “ran- there exists constants C>0andδ>0 such that domly censored” version of the original Poisson ran- N N N dom field, where a process is “censored” if its site is sup max P (T0 ∧ T1 ≥ CN | X0 = j/N) N 0≤j≤N (40) monomorphic in the sample. If the censoring mecha- ≤ 1 − δ/2 < 1 nism is independent for different sites (which follows in this case from linkage equilibrium), then the cen- The Markov property implies that if CN in (40) is sored random field is also a Poisson random field. In replaced by mCN, then the right-hand side of (40) particular, given a sample of size m, the numbers of can be replaced by (1 − δ/2)m. Thus there exists silent and replacement fixed and polymorphic sites some ν>0 such that if |f(x)|≤M for 0 ≤ x ≤ 1, all have Poisson distributions. Similar arguments show that, if one has two |QN f(x)|≤MP(T N ∧ T N ≥ k | XN = x) k 0 1 0 or more random censoring mechanisms that are in- ≤ MΩe−νk/N dependent for different sites but mutually exclusive (i.e., a site cannot survive censoring by more than for some constant Ω. This provides the necessary one mechanism), then the respective censored ran- uniformity in (38). dom fields are independent Poisson random fields. As one example, given a sample of size m,thenum- We now compute the limiting flux into the ab- ber of monomorphic sites in the sample and the sorbing state 1. In any one time step, the proba- number of polymorphic sites in the sample are in- bility that a new process XN is begun with XN = i,k i,0 dependent Poisson random variables. In this exam- 1/N and is then eventually is absorbed at 1 is ple, a site is censored by the first mechanism if it is µ N N 1 polymorphic in the sample and censored by the sec- v P (T1

display a “(32)” polymorphism (i.e., with three se- acceptable approximation has to be determined by quences having one nucleotide and two sequences appropriate statistical tests on a case by case basis. having a second nucleotide) and the number of sites that display a “(41)” polymorphism (i.e., with four Other approaches to the analysis of DNA se- sequences having one nucleotide and one sequence quence variation within and between species were with a second nucleotide) are independent Poisson suggested as alternatives to the 2×2 contingency ta- random variables. Here the censoring mechanisms ble test by Graur and Li (1991) and Whittam and are to reject a site if it does not have a (32) (respec- Nei (1991). These tests appear to be less powerful tively (41)) nucleotide polymorphism in the sam- statistically than the 2 × 2 contingency table test, ple, where the mutant nucleotide could be either and may be subject to objections about their under- nucleotide. lying genetic assumptions (McDonald and Kreit- man 1991b). Furthermore, the test statistics in the proposed alternatives are assumed to have a sam- Discussion and Additional Comments pling distribution that is normal, when in fact the sampling distributions are unknown in most cases. The approach to polymorphism and divergence pre- As one example, Graur and Li (1991) compare the sented in this paper provides a framework for the observed number (k = 2) of replacement polymor- quantitative analysis and interpretation of DNA se- phisms within all three species in McDonald and quence variation within and between species. It Kreitman’s (1991a) data with the value of a test also provides a theoretical basis for the use of a statistic K = K1 + K2 + K3 (our notation), where recently proposed 2 × 2 contingency table test for each Ki is the number of segregating sites in an in- DNA sequences that compares polymorphisms and finite sites model (Watterson 1975). Separate pa- fixed differences with silent sites and amino acid re- rameter estimates are used within each species, and placements (McDonald and Kreitman 1991a), as the random variables Ki are independent. Wat- well as providing estimates for the relevant param- terson (1975) gives the mean and variance of K in × i eters. Hypothesis tests for 2 2 contingency table terms of its parameters. Watterson also gives a mo- data are well understood and quite robust, and this ment generating function for Ki that can be used to approach may be the most powerful and generally infer its exact distribution (as a sum of independent applicable test yet devised for detecting whether or geometrically distributed random variables with dif- not selection has been active in the recent evolution- ferent parameters), from which the exact one-sided ary history of particular proteins. P-value P (K ≤ 2) can be calculated. For one set One the other hand, the sample nucleotide con- of parameter values used by Graur and Li (1991), figurations also contain a tremendous amount of K =10.324.04 (mean and standard deviation), so information that can be interpreted as in the ap- that a normal approximation for K leads to a one- proach presented here. In particular, the joint sam- sided P-value P =0.020 for the observed K =2, ple configurations at silent sites can be used to esti- while the exact P (K ≤ 2) = 0.00981 from the the- mate both the synonymous mutation rate µs and oretical distribution. In this example the difference the species divergence time tdiv. Both parame- in the P-values is only twofold, but it is in the direc- ter estimates are scaled in terms of the effective tion of making the Graur and Li statistic less pow- population size Ne, and if independent informa- erful, and the discrepancy might be larger in other tion is available (as it is in this case from Hawaiian cases. Similar problems arise for a test proposed Drosophila), then all three parameters can be esti- by Whittam and Nei (1991), which is based on a mated. Given tdiv,the2× 2 contingency table pro- ratio of test statistics each of whose sampling dis- vides estimates of µr/µs (the ratio of aggregate mu- tributions is known only approximately. For ratio tations rates for synonymous and replacement sites) statistics, there is no guarantee that approximate and for γ (the average selection coefficient among P-values will even be conservative. In general, one favored and mildly deleterious replacement muta- should be careful about using a normal approxima- tions). Our analysis is based on the assumption that tion for P-values unless one is sure that the sampling the nucleotide sites are independent (i.e., in linkage distribution of the test statistic is at least approxi- equilibrium). Whether this assumption holds to an mately normally distributed. Polymorphism and Divergence 19

We would like to thank Richard Hudson, Tom Nagy- Hartl, D. L.,andDykhuizen, D.E., 1981 Po- laki, ZeAyala,´ and two referees for many helpful comments. tential for selection among nearly neutral al- This work was supported by National Science Foundation lozymes of 6-phosphogluconate dehydrogenase Grant DMS-9108262 and National Institutes of Health re- search grant GM44889 (SAS), and by National Institutes of in Escherichia coli. Proc. Nat. Acad. Sci. USA Health research grants GM30201, GM33741, and GM40322 78: 6344–6348. (DLH). Hartl, D. L., and S. A. Sawyer, 1991 Inference of selection and recombination from nucleotide Literature Cited sequence data. J. Evol. Biol. 4: 519–532.

Hilliker, A. J., S. H. Clark, and A. Chovnick, Begun, D. J. and C. F. Aquadro, 1991 Molec- 1991 The effect of DNA sequence polymor- ular of the distal portion of phisms on intragenic recombination in the rosy the X chromosome in Drosophila: evidence for locus of Drosophila melanogaster. Genetics 129: genetic hitchhiking of the yellow-achaete region. 779–781. Genetics 129: 1147–1158. Hudson, R. R., M. Kreitman, and M. Aguade´, Berry, A. J., J. W. Ajioka, M. Kreitman and , 1987 A test of neutral based 1991 Lack of polymorphism on the Drosophila on nucleotide data. Genetics 116: 153–159. fourth chromosome resulting from selection. Ge- netics 129: 1111–1117. Karlin, S., and J. McGregor, 1966 The num- ber of mutant forms maintained in a population. Caccone, A., G. D. Amato, and J. R. Pow- Proc. Fifth Berk. Symp. of Math. Stat. Prob. 4: ell, 1988 Rates and patterns of scnDNA 403–414. and mtDNA divergence within the Drosophila melanogaster subgroup. Genetics 118: 671–683. Karlin, S., and H. M. Taylor, 1981 A Second Course in Stochastic Processes. Academic Press, Drake, J. W., 1991 Spontaneous mutation. Annu. New York. Rev. Genet. 25: 125–146. Kimura, M., 1968 Evolutionary rate at the molec- Efron, B., 1987 Better bootstrap confidence in- ular level. Nature 217: 624–626. tervals. J. Amer. Statist. Assoc. 82: 171–200. Kimura, M., 1983 The Neutral Theory of Molecu- Ethier,S.N.and T. G. Kurtz, 1986 Markov lar Evolution. Cambridge University Press, Cam- Processes. Wiley and Sons, New York. bridge, England. Kreitman, M. (1983) Nucleotide polymorphism at Ewens, W. J. , 1972 The sampling theory of se- the alcohol dehydrogenase locus of Drosophila lectively neutral alleles. Theor. Popul. Biol. 3: melanogaster. Nature 304: 412–417. 87–112. Lemeunier, F., J. R. David, L. Tsacas, and M. Ewens, W. J., 1979 Mathematical Population Ge- Ashburner, 1986 The melanogaster species netics. Springer-Verlag, New York. group, pp. 147–256 in The Genetics and of Drosophila, edited by M. Ashburner and H. Graur, D. W.-H. Li ,and , 1991 Scientific corre- L. Carson. Academic Press, New York. spondence. Nature 354: 114–115. Lewontin, R. C., 1974 The Genetic Basis of Evo- Griffiths, R. C., 1979 A transition density ex- lutionary Change. Columbia University Press, pansion for a multi-allele diffusion model. Adv. New York. Appl. Probab. 11: 310–325. Lewontin, R. C., 1991 Electrophoresis in the de- Hartl, D. L., 1989 Evolving theories of enzyme velopment of evolutionary genetics: Milestone or evolution. Genetics 122: 1–6. millstone? Genetics 128: 657–662. 20 S. A. Sawyer and D. L. Hartl

Lewontin, R. C., and J. L. Hubby, 1966 A Watterson, G. A., 1975 On the number of seg- molecular approach to the study of genic het- regating sites in genetic models without recom- erozygosity in natural populations. II. Amount bination. Theor. Popul. Biol. 7: 256–276. of variation and degree of heterozygosity in nat- ural populations of Drosophila pseudoobscura. Watterson, G. A., 1985 Estimating species di- Genetics 54: 595–609. vergence times using multi-locus data, pp163–183 in Population Genetics and Molecular Evolution, McDonald, J. H. and M. Kreitman, 1991a edited by T. Ohta and K. Aoki, Springer-Verlag, Adaptive protein evolution at the Adh locus in Berlin. Drosophila. Nature 351: 652–654. Whittam,T.S.and M. Nei, 1991 Scientific cor- McDonald, J. H. M. Kreitman and , 1991b Sci- respondence. Nature 354: 115–116. entific correspondence. Nature 354: p116. Wright, S., 1938 The distribution of gene fre- Moran, P. A. P., 1959 The survival of a mutant quencies under irreversible mutation. Proc. Nat. gene under selection. II. Jour. Australian Math. Acad. Sci. USA 24: 253–259. Soc. I 1: 485–491. Wright, S. Ohta, T., 1973 Slightly deleterious mutant sub- , 1949 Adaption and selection, stitutions in evolution. Nature 246: 96–98. pp365–389 in Genetics, , and Evo- lution, edited by G. Jepson, G. Simpson,and Padmadisastra, S., 1988 Estimating divergence E. Mayr. Princeton Univ. Press, Princeton, times. Theor. Popul. Biol. 34: 297–319. N.J. Powers, D. A., T. Lauerman, D. Crawford, and L. DiMichele, 1991 Genetic mechanisms for adapting to a changing environment. Annu. Rev. Genet. 25: 629–659. Rowan, R. G., and J. A. Hunt, 1991 Rates of DNA change and phylogeny from the DNA se- quences of the alcohol dehydrogenase gene for five closely related species of Hawaiian Drosophila. Mol. Biol. Evol. 8: 49–70. Sawyer, S. A., 1976 Branching diffusion processes in population genetics. Adv. Appl. Probab. 8: 659–689. Sawyer, S. A., 1989 Statistical tests for detecting gene conversion. Mol. Biol. Evol. 6: 526–538. Sawyer, S. A., D. E. Dykhuizen, and D. L. Hartl, 1987 Confidence interval for the num- ber of selectively neutral amino acid polymor- phisms. Proc. Nat. Acad. Sci. USA 84: 6225–6228. Smith, J. M., C. G. Dowson, and B. G. Spratt, 1991 Localized sex in bacteria. Nature 349: 29–31. Tavare,´ S., 1984 Line-of-descent and genealog- ical processes, and their applications in popu- lation genetics models. Theor. Popul. Biol. 26: 119–164.