Copyright 0 1992 by the Genetics Society of America

Population Geneticsof Polymorphism and Divergence

Stanley A. Sawyer*'+and Daniel L. Hartlt

*Department of Mathematics, Washington University, St.Louis, Missouri 63130 and TDepartmentof Genetics, Washington University School of Medicine, St. Louis, Missouri 631 10 Manuscript received March 7, 1992 Accepted for publication August12, 1992

ABSTRACT Frequencies of mutant sites are modeled as a Poisson random field in two species that share a sufficiently recent common ancestor.The selective effectof the new can be favorable,neutral, or detrimental. The model is applied to the sample configurations of nucleotides in the alcohol dehydrogenase (Adh) in Drosophila simulans and Drosophila yakuba. Assuming a synonymous rate of 1.5 X lo-' per site per year and 10 generations per year,we obtain estimates for the effective size (N, = 6.5 X lo')), the species divergence time (tdiv= 3.74 million years), and an average (u = 1.53 X 1O"j per generation for advantageousor mildly detrimental replacements), althoughit is conceivable that only twoof the amino acid replacements were selected and the rest neutral.The analysis, which includesa sampling theory forthe independent infinitesites model with selection, also suggests the estimate that the number of amino acids in the enzyme that are susceptible to favorable mutation is in the range 2-23 at any one time. The approach provides a theoretical basis for the use of a 2 X 2 contingency tableto compare fixed differences and polymorphic sites with silent sites and amino acid replacements.

T has been more than 25 years since LEWONTIN currently impractical (HARTLand DYKHUIZEN198 1 ; I and HUBBY(1 966) first demonstrated high levels HARTL 1989), althoughsome large effects have been of molecular polymorphism in Drosophilapseudoob- documented [see POWERSet al. (1 99 1) for a review]. scura. This finding had two strong immediate effects In the 1980s, theincreasing use of DNA sequencing on evolutionary genetics: it stimulated molecular stud- in evolutionary genetics gave some hopethat the ies of many other organisms, and it led to a vigorous impasse could be overcome.Direct examination of theoretical debate about the significance of the ob- , ratherthan the electrophoretic mobility of served polymorphisms (LEWONTIN199 1). The exper- geneproducts, yieldsvast amounts of information imental studies soon came to a consensus in demon- consisting of hundreds or thousands of nucleotides. strating widespreadmolecular polymorphism in The data are also of a different quality,since the DNA numerous species of plants,animals, and micro- sequences are unambiguous and contain both synon- organisms. The theoretical debate was not so quickly ymous nucleotidedifferences and differences that resolved. One viewpoint (KIMURA1968, 1983) held change amino acids. To the extent that synonymous that most observedmolecular variation within and differences are subjected to weaker selective effects among species is essentially selectively neutral, with at than aminoacid differences, comparisonsbetween the most negligible effects on survival and reproduction. two types of polymorphisms can serve as a basis of Opposed was the classical Darwinian view that molec- inference.Synonymous polymorphisms are more ular polymorphism is the raw materialfrom which common than amino acid polymorphisms (KREITMAN fashions evolutionary progress, and 1983), andalso appear to be moreweakly affected by that the newly observed molecular variation was un- selection (SAWYER, DYKHUIZENand HARTL 1987). likely to be any different (LEWONTIN1974). The two With data from only one species, the level of syn- viewpoints could not have been more at odds, and a onymous andreplacement polymorphism mustbe great controversy ensued. To a large extent theissue substantial in orderfor statistical analysis to have has been clouded by inadequatedata (LEWONTIN enough power to detect selection (SAWYER, DYKHU- 1974, 1991). Observations of natural are IZEN and HARTL 1987; HARTL and SAWYER1991). snapshots of particular places and times, and the re- Most eukaryotic genes are not sufficiently polymor- sulting inferences about the long-term fate of molec- phic to allow this approach. An alternative approach, ular polymorphisms can be challenged by neutralists pioneered by HUDSON, KREITMAN andAGUAD~ and selectionists alike. By the same token, laboratory (1 987),is based on comparing polymorphisms within experiments capable of detectingselection coefficients species with fixed differences between species. This as small as are likely to be important in nature are approach has been applied to the Drosophila fourth

Genetics 132: 1161-1 176 (December, 1992) 1162 S. A. Sawyer and D. L. Hart1 (BERRY,AJIOKA, and KREITMAN199 1) provides an estimate of the average amount of selec- as well as to the tipof the X chromosome (BEGUNand tion required to produce thediscrepancy observed, as AQUADRO 199l), both of whichare regions of reduced well as anestimate of therate at which favorable recombination. The level of polymorphism in these occur (or, equivalently, an estimate of the regions is also reduced,and the analysis suggests average number of amino acids in the protein that are strongly that the reduction is the result of a genetic susceptible to a favorable mutation at any one time). hitchhiking associated with periodic selective fixa- Several objections to thedetails of the implementation tions. of the McDonald-Kreitman test have been raised Comparison of molecular variation within and be- (GRAUR andLI 1991; WHITTAMand NEI 199 l), and tween species is also thecrux of a statistical test these are also addressed briefly. proposed by MCDONALDand KREITMAN(1 99la). The test is for homogeneity of entries in a 2 x 2 contin- RATIONALEAND RESULTS OF THE ANALYSIS gency table based on aligned DNA sequences. The rows in the contingency table arethe numbers of The first step in our method is to analyze the sample replacement or synonymous nucleotidedifferences, configurations of the nucleotides occurring at synon- and the columns areeither the numbers of fixed ymous sites in the aligned DNA sequences under the differences between species or else of polymorphic assumption thatthe synonymous variation is selec- sites within species. Here polymorphic sites are defined tively neutral. This informationis used to estimate the as sites that are polymorphic within one or more of mutation rate at silent sites and the divergence time the species, and Fxed dijferences are defined as sites between pairs of species. The divergence time is crit- that are monomorphic (fixed) within each species but ical because, if the divergence time between species is sufficiently long,then conceivably allof the fixed differ between species. The term silent refers to nu- amino acid differences between species could be due cleotide differences in codons that do not alter the to the fixation of mildly deleterious alleles, and the amino acid, and replacement refers to nucleotide dif- significance of the McDonald-Kreitman contingency ferences within codons that do alter the amino acid. table might be an artifact of saturation at silent sites. The McDonald-Kreitman test compares the number Using the estimated values of the silent mutation rate of silent and replacement polymorphic sites with the and the divergence time, the numbers of synonymous number of silent and replacement fixed differences. polymorphic sites and fixed differencespredicted When 30 aligned DNA sequences from the alcohol from the neutral configuration theory are compared dehydrogenase (Adh) of three species of Dro- with the observed numbers. These estimates fit the sophila were compared (MCDONALDand KREITMAN observed Adh data very closely for all three pairwise 1991a), there were too few polymorphic replacement species comparisons, which suggests that the configu- sites (P = 0.007, two-sided Fisher exact test). MC- ration distributions at synonymous sites are roughly DONALD andKREITMAN (1991a) argues that themost consistent with an equilibrium neutral model. likely reason for the discrepancy is that some of the The second step is to develop equationsfor the amino acid differences were fixed as a result of posi- expected number of polymorphic sites and fixed dif- tive selection acting on replacement mutations. The ferences between a pair of speciesin terms of the possibility that the fixed differences could have re- magnitude and direction of selection, the mutation sulted from a combination of slightly deleterious al- rate to new alleles having a given (constant) selective leles (OHTA 1973),coupled with a dramatically chang- effect, and thedivergence time. From these equations ing population size, was also considered byMC- we estimate the amountof selection needed toexplain DONALDand KREITMAN (199la) but considered the observed deficiency or excess in the number of implausible because this would seem to require ex- replacement polymorphisms. Wealso estimate the traordinarily fine tuning among a large number of rate of new mutations resultingin amino acid replace- independent parameters. ments (or, equivalently, the number of amino acid Although the McDonald-Kreitman test has consid- sites in the proteinproduct at which favorable or erable intuitive appeal, little quantitative theory exists mildly deleterious substitutionsare possible at any one for thecomparison of intraspecific polymorphism with time). While the configuration analysis takes into ac- interspecific divergence in the presence of selection. count thepossibility of multiple mutations at thesame In this paper we present such a theory. Among other site, the second step assumes that this does not occur; things, it addresses the question of whether the im- i.e., that the genetic locus involved has not been satu- balance in the Adh contingency table could have re- rated by mutations since the divergence of the two sulted from the random fixation of mildly deleterious species. This assumption was checked in two different alleles over an extremely long timein a population of ways. First, the expected number of silent polymor- constant size, rather than fixations of advantageous phisms was calculated by both methods and found to alleles in a shorter period of time. The theory also agree within 12%. Second, the expected number of Polymorphism and Divergence 1163 synonymous sites with two or more neutral fixations amount of selection required. Since thereare no since the species diverged was estimated as less than replacement polymorphisms between D. simulans and two in all cases.There are nosilent site polymorphisms D. yakuba in the McDonald-Kreitman data, any larger with three or morenucleotides in the data considered, value than y = 9.95 in (2), along with a correspond- and only one silent site (between Drosophila melano- ingly smaller value of pr, would explain the data just gaster and Drosophila yakuba) is polymorphic in both as well. species. The estimated aggregate mutation rateat silent sites Most of our analysis depends on the assumption of of ps = 2.05 in (1) is based on 212 regular aminoacid linkage equilibrium or independence between sites. monomorphic codon positions (see below). This esti- There is considerable linkage disequilibrium in Adh matetherefore corresponds to 2.05/212 = 0.0097 aroundthe Fast vs. Slow electrophoretic polymor- silent nucleotide changes per synonomous site per I?, phism in D. melanogaster (but not in Drosophila simu- generations in the Adh sequence. Assuming a silent lans and D.yakuba). Possible balancing or clinal selec- nucleotide substitution rate of 0.0 15 persynonomous tion on this polymorphism may not only affect nucleo- site per million years (Myr) at theAdh locus, estimated tide configurations in D. melanogaster, but also may from data onHawaiian Drosophila (ROWAN and HUNT not be appropriate for the model of genic selection 1991), N, generations is 0.645 Myr. Therefore, tdiv = that we apply below. Among the three species for 5.8 in (1) implies adivergence time of 3.74 Myr which MCDONALDand KREITMAN(1991a) have Adh between D. simulans and D. yakuba. This value is in sequences, we are most confident in applying the the middle of a range 1.6-6.1 Myr implied by single- analysis to the D. simulans vs. D. yakuba comparison. copy nuclear DNA hybridization data (CACCONE, The resulting analysis of the joint nucleotide config- AMATOand POWELL1988), where an estimated diver- urations at silent sites for D. simulans and D. yakuba gence time between D. melanogaster and D. simulans leads to the following estimates for the scaled silent of 0.8-3 Myr (LEMEUNIERet al. 1986) is used as the mutation rate p, (summed over synonomous sites) and standard of comparison. As a consistency check, the the species divergence time t& same analysis was carried out with the Adh data from simulans and D. melanogaster. This analysis yielded p, = 2.05and tdiv = 5.8 (1) D. p, = 2.07 for 2 13 monomorphic codon positions and both scaled in terms of the haploid effective popula- a value of tdjy = 1.24,from which theestimated tion size Ne. That is, ps = u, X Ne, where us is the divergence time is 0.80 Myr. This estimate is at the nucleotide mutationrate per generation summedover low end of the range suggested by LEMEUNIERet al. all synonomous sites within amino acid monomorphic (1986). codon positions in Adh other than those coding for Assuming 10 generations per year for D. simulans leucine and arginine (i.e., at “regular” silent sites; see and D.yakuba, and a value of 0.645 Myr for Ne below). Analysis of the numbers of replacement poly- generations, the estimated haploid effective popula- morphisms and fixed differencesthen leads to the tion size of either species is I?, = 6.5 X 1O6 (and hence following minimal estimates for the effective aggre- 3.25 X lo6 forthe diploid population size). This gate replacement mutation rate p, and average selec- estimate is in excellent agreement with the value of 2 tion coefficient y : X lo6 suggested for D.simulans by BERRY,AJIOKA and KREITMAN(1991). The value y = 9.95 in (2) = 0.01 3ps and = 9.95 (2) pr y implies that the average selection coefficient for ad- again scaled in terms of the haploid effective popula- vantageous or mildly deleterious amino acid replace- tion size. The quantity pr is the aggregate base muta- ments in Adh is u = r/Ne= 1.53 X 1O”j per generation. tion rate causing advantageous or mildly deleterious That is, only a very small average selection coefficient amino acid changes. The estimate of p, in (2) includes is required to account for theobserved lack of replace- acorrection factor to accountfor silent polymor- ment polymorphisms (or, equivalently, the excess of phisms that are destined to be fixed but are not yet fixed replacements) in the comparison of D. simulans fixed. The quantity y is the estimated average selec- with D.yakuba. tion coefficient amongthese amino acid changing Incidentally, the estimate of 1.5 X 1O-’ mutations ~nutations,where “average” here means that y is the per silent site per generation, derived from ROWAN selection coefficient requiredto produce the same and HUNT (1 991) and theassumption of 10 genera- numbers of replacement polymorphisms and fixed tions per year, compares well with the rule of thumb differences if all replacement mutations had the Same (DRAKE 1991) that,in metazoans, the overall mutation selective effect. The quantity is scaled in termsof rate is roughly one mutation per genome per gener- the effective population size; ie., y = ON,, where g is ation. For the Drosophila genome of 165 million base the same selection coefficient per generation. In this pairs, this implies 6.1 X 1O-’ mutations per nucleotide case,the estimates of y and u arethe minimum pair per generation. These estimates are quite close, 1164 S. A. SawyerL. and D. Hart1 particularly since there might be a slightly smaller ing states that represent, respectively, the loss of the substitution rate at silent sites within coding regions new and its fixation. than for arbitrary nucleotides. Define Tila = min(K : X$ = a) as the number of The analysis ofthe Adh data can also be interpreted generations until the ithprocess attains the frequency in another way. If the rate of replacement mutations a for the first time, and set TTa = 00 if the allele is is uniform across the coding region of Adh, then the fixed or lost before this occurs. For the sake of brevity, overall replacementmutation rate of p, = 0.013~~ let X: = X% and TC = Tila refer to a typical process implies that an average of only about 5.7 codons in X$. Then P(TY < T;) is the probability that a new the molecule are susceptible to a favorable aminoacid mutant allele is fixed in the population before it is replacement at any one time, with all other replace- lost. ments at that time being strongly deleterious. The We apply a diffusion approximation for the discrete estimate of 5.7 amino acid positions is based on equal processes {X&1. The diffusion process is denoted {Xl1, mutationrates of each nucleotide. If themutation where time is scaled in units of N generations (i.e., t rates vary according to nucleotide, the estimated num- = k/N for large N), and the infinitesimal generator of ber of amino acids susceptible to favorable replace- (X,) is assumed to be of the ment at any one time is between 2 and 23 (see discus- 1 d' d sion below). L, = - b(x) - + c(x) - The remainder of this paper focuses on the techni- 2 dx' dx cal details pertaining to the estimates and conclusions where b(x) and C(X) are continuous functions on [0, summarized above. The main themes are, first, the 11. The operator L, in (3)can be written in the form analysis of joint nucleotide configurations at silent sites from a pair of species in order to estimate the L,="-- dd mean mutation rateat silent sites andthe species dm(x) ds(x) divergence time; and, second, the analysis of the ex- pected probability density of polymorphic site fre- by introducing an integrating factor. The functions quencies and of fixed differences at the population dm(x) and s(x) are called the speed measure and the levelin order to estimate the amount of selection scalefunction of L,, respectively (EWENS 1979). Later required if all favorable and weakly deleterious re- in the paper we derive a diffusion approximation for placement mutations have the same selective effect. the expected numberof processes (XfiR1at equilibrium The population estimates are discussed in the next whose allele frequencies are in the range (p,p + dp), section, with most of the detail deferred to later in and we also calculate the equilibrium rate at which the paper. We then discuss the numerical estimation mutant alleles become fixed. of y and p,, carry out the analysis of joint nucleotide Now, assume that the processes (Xrh)correspond to configurations, and finally give some supplementary two-allele haploid Wright-Fisher models (without mu- commentson certain criticisms of the McDonald- tation) in which organisms carryinga new mutant Kreitman test. allele have WN = 1 + UN relative to those carrying the non-mutantallele. If NUN + y as N -+ CQ, then the conditions for diffusion a approximation hold MUTATIONALFLUX AND FIXED DIFFERENCES with

Suppose that new mutations arise with probability b(x) = bx(1 - x) and C(X) = yx( 1 - x) (5) vN> 0 per generation in a population of haploid size (EWENS1979). The scale and speed measure (4) are N. Let X?' be the frequency of the descendants of the then ith new mutant allele in the population, where k = 0, 1, 2, . . . is thenumber of generationsthat have elapsed since the original mutation occurred. We as- sumethat the processes (X6 : i = 0, 1, . . .\ are noninteracting, as would be the case if the mutations where s(x) is normalized so that s'(0) = 1. occurred at distinct sites that remain in linkage equi- In general, the hittingtimes T, = min(t : X, = a) for librium, or if the mutations were sufficiently well a diffusion process X, and the scale function s(x) in (4) spaced in time. The mutations could be selectively are related by the identity advantageous, disadvantageous, or neutral, but in any event we assume that the site frequency processes {X%1 are stochastically identical Markov chains. (That is, they have the same transition matrices.) Note that Xro = 1/N for each i, since each new process begins If {x&) arethe two-allele haploid Wright-Fisher with a single mutation. The states 0 and 1 are absorb- models of (5-6), then as N -+ 03 and DivergencePolymorphism and 1165

TABLE 1 Population fixational flux and polymorphic densities

S(1IN) - 40) 27 Equilibrium fluxLimiting density of frequen- - - Population of fixationscies of mutantnucleotides s(1) - s(0) N 1 - e-’” Neutral dx (MORAN1959; recall that X; = l/N). Note that (7) ; does not follow from standard diffusion approxima- tion theory, which implies only that the difference between the two probabilities in (7)converges to zero. In this case, this is trivial because the two terms in (7) converge to zero individually. Whether (7) holds for favorable and y < 0 that they are unfavorable. In the general diploid or dioecious two-allele selection-and- neutral case (y = 0), the limiting fixational flux and drift models, even those with the same diffusion ap- polymorphic frequency density in (1 1) are proximation (5), is apparently still an open problem dx (MORAN1959; T. NAGYLAKI,personal communica- p and 2p - (12) tion). X Next, assume respectively. The expressions in (1 1) and (12) are displayed in Table 1. Finally, from (12), the expected VN-~ as N+m (8) number of neutral alleles with freqeuencies p in the so that EL is the limiting mutation rate per generation range pl < p < p-4 is given by at which new mutant alleles arise. For arbitrary proc- esses {X?’) satisfying (7)and s’(0) = 1, we prove later 2P log( PZlPl). in this paper that the limiting density of polymorphic This expression differs from the corresponding ex- mutant alleles with populationfrequencies in the pected value in the infinitealleles model (EWENS 1972, range (x, x + dx) is 1979). However, the Poisson random field model of the preceding section is not the same as the infinite alleles model. The alleles in (12) do not compete with one another, and there areno constraints on the sum for s(x) and dm(x) in (4). This means that the expected of the frequencies {x$!.The processes { Xfik) are also number of mutant alleles with population frequency unaffected by subsequent mutations. This model also p in the range 0 < PI < p < pz < 1 is the integral of differs from the infinite sites model of WATTERSON (9) over the interval (PI, p2). (More precisely, the (1975), since the processes {X?,) are assumed to be limiting distribution of the population frequencies of independent (ie., in linkage equilibrium). A model of surviving non-fixed mutantalleles is a Poisson random unlinked or independent sites might be preferable on field with (9) as its mean density.) We also show that generalgrounds tothe tightly linked sites of the the limiting flux of the processes {X61 into the state infinite sites model, not only because of chromosomal I (ie., into fixation) is given by recombination, but also because of the likely frequent occurrence of short-segment gene conversion events in both prokaryotes and eukaryotes (SAWYER 1989; SMITH,DOWSON and SPRATT199 1; HILLIKER, CLARK in the time scale of the diffusion. That is, (1 0) is the and CHOVNICK199 1). limiting number of mutant alleles per N generations For more general processes {xrk1, where the con- that become fixed. Notethat VN and p in (8)are nection condition (7) may not hold, the relations (9) expressed as rates per generation, whereas (10) is the and (1 0)are still validprovided thatvN in (8) is divided rate per N generations. The difference in time scale by the ratio of the two probabilities in (7). SEWALL reflects the fact that most of the new mutant allele WRIGHT (1938) deriveda distribution similar to (9)as processes become lost in the first few generations. anapproximation to a quasi-stable distributionfor Frequencies of Wright-Fisher alleles: For the two- one allele under selection and irreversible mutation. allele Wright-Fisher model (5-6) with y # 0, the WRIGHT’S(1 938) problem involved a transient distri- equilibrium flux of fixations (10)and the limiting bution for a single allele, and does not carry over to density (9) of non-fixed mutant allele frequencies take the equilibrium distribution (3) for a random field of the respective forms alleles. Between-species comparisons:In order to analyze differences between species, we assume that a single populationhad become separatedinto two disjoint where y > 0 indicates thatthe mutant alleles are and reproductively isolated subpopulations or species 1166 Sawyer S. A. Hart1and D. L. at a time tdjv in the past, measured in terms of the are of comparable size, then y cannot be strongly diffusion time scale (ie., the split occurred tdjv X N negative. Note that P, S 1 0pSeven if only 10% of the generations before the present). Both subpopulations sites in a coding region are silent and all amino acid are assumed to have haploid effective size N. If mu- replacements are advantageous or mildly deleterious. tations that occur after the population split can be If the two expressions in (13) are equal and pT/ps< distinguished, and if the numberof fixations that have 10, then 3 - 1.8 1. If is large and positive and the occurred in the two species can be approximated by two expressions are equal, then p,/pS = 1/(2y). 2t,li, times the equilibrium flux of fixations, then the Similarly, the expected mutant frequency densities number of mutations that correspond to fixed differ- at silent and replacement polymorphic sites are ences between the two species is 2tdiv times the first dx expressions in (1 1) and (1 2). An important quantity is dus(x) = 2ps - and the ratio of the expectednumber of polymorphic X alleles in thefrequency range (x, x + dx) tothe expected number of fixed differences, which equals

(s( 1) - s(x))dm(x)- (1 - e-2y(1--X))dx - respectively, in either population. 2 tdiv tdiv2y(l - x)x * Now, suppose that we have aligned DNA sequences from m chromosomesfrom the first species and n In the next section we will use this expression as a from the second species. The next step basis for estimating the average value of y for replace- is to convert the populationlevel estimates (1 3-1 4) to ment differences between species. sample estimates. Suppose that a mutant nucleotide has population frequency x at a site. Then a random SAMPLING FORMULAS AND PARAMETER ESTIMATES sample of m chromosomes from that population will be monomorphic for the mutantnucleotide with prob- Suppose that two species divergedgenerations ability qm(x)= xm,and will be polymorphic at that site ago, and that both have the same haploid effective with probability pm(x)= 1 - xm - (1 - x)m.Thus the population size Ne. Assume that the mutation rate for expected number of silent polymorphic sites in a ran- silent sites in the coding region of a particular gene is dom sample of size m is p, per gene per generation,and that the mutation rate for nonlethal replacement mutations is p, per gene per generation. Assume further that (i) allnew re-

m-l ~ placement mutations bestow equal fitness w = 1 + y/ 1 = 2Ps c Ne,(ii) each new mutation since the divergence of the k= 1 species occurred at a different site (in particular, the for dv,(x)in (14). In fact, assuming linkage equilibrium gene has not been saturated with mutations), and (iii) between sites, the number of silent polymorphic sites different sites remain in linkage equilibrium. Under is a Poisson random variable whose mean (and hence these assumptions, the fixational flux andthe ex- also variance) is given by (1 5)(see below). The expres- pected frequency densities of mutant nucleotides at sion (1 5) is the same as WATTERSON'S(1 975) formula silent and replacement sites in a single random-mating for the number of polymorphic sites in a sample of population are those given in Table 1. size m from the infinite sites model. However,the Assume further that the number of polymorphic variance in the infinite sites model is larger than the sites at the present time that are destined to become variance of the number of silent polymorphic sites in fixed, and the number of site polymorphisms surviv- our case. ing from the time of , can be neglected in The number of silent polymorphic sitesin both comparison with the number of fixed differences be- samples together is then tween the species and the number of sites that are presently polymorphic. Then the expected numbers 2Ps(L(m) + Un)) (16) of fixed differences between the two species at the where present time are m-1 , 2-Y 2pstdlv and 2Prtdiv 1 - (13) The estimate (16) for the number of silent polymor- for silent and replacement sites, respectively. Note phic sites agrees with the observed numbers to within that the second expression in (13) is much more 7% of each of the three pairwise species comparisons sensitive to y if y < 0 than if y > 0. For example, the using the Adh data of MCDONALDand KREITMAN fktor multiplying 2p,tdivis 4.1 X lo-' if y = -10, but (1 99 la), assuming the values of P, estimated from the only 20 if y = 10. Thus,if the two expressions in (1 3) joint configurations in the next section. Polymorphism and Divergence and Polymorphism 1167

Similarly, a silent site will bemonomorphic in a L(m) S H(m) < 2L(m), y > 0 sample of size m if it is fixed in the population, but may also be monomorphic in a sample by change if it for L(m) = Ilk. However, H(m)will be smaller is polymorphic in the population. Under the assump- if y is large and negative. tions of (1 3), the number of silent sites in a sample of The ratio of the number of replacement polymor- size m that are monomorphic for a mutant nucleotide phic sites to the number of replacement fixed differ- is Poisson with mean ences is given by the ratio of (19) to (1 8), which is

and the number of silent fixed differences between where the two samples is

However, the estimate (17) is about 45% toolarge for D. yakuba vs. D. simulans and D.yakuba vs. D. mela- nogaster, and it is more than 5 times too large for D. after a change of variables in x. The four basic sam- melanogaster vs. D.simulans (see Table 9). The most pling formulas are summarized in Table 2. likely reason for these discrepancies is that the silent Note that F(m) - L(m)/yand G(m)-+ 0 as y -+ 03 polymorphic sites in the present populations that are for L(m) = 2rL-l l/k in (16). Thus the expression in destined to become fixed are counted in 2Pstdiv in (1 7) (20) is asymptotic to even though they are not yet fixed. The discrepancy is not large except forthe comparison D.melanogaster vs. D.simulans, and it only affects the estimate of pr in (1-2). A correction for this overestimation is in- cluded in the calculation of pr in (2). Estimating parameters: We estimate y by setting the By a similar argument, the numberof replacement expression (20) sites that are monomorphic for a mutant nucleotide F(m) + F(n) is Poisson with mean tdiv + G(m) + G(n) 27 Pr 1- e-2r tdiv + X"'dVr(X) - Numb. observed repl. polymorphisms s' Numb.observed repl. fixed differences (22) for dv,(x) in (14). The number of replacement fixed differences between the two samples is then Poisson for the ratioof the numbers of the observed replace- with mean ment polymorphic sites and replacement fixed differ- ences. (Note that p, cancels in this particular ratio. If 9 CII &I y > 0 is sufficiently large, the approximation (21) can 2Pr 1 - e-2r (tdiv + C(m) + G(n)) (18) be used instead.) If the number of replacement poly- where morphic sites is zero (as it is for the D.simulans vs. D. yakuba comparison), then we use ?has a conservative 1 1 - ,-2r(l-x) G(m) = xm-l dx. value to replace the zero. For example, if there are 1 2y(l - x) no replacement polymorphicsites and 6 fixed replace- Note G(m)d l/m for > 0. By the same reasoning, ment differences (as is the case for D. simulans vs. D. the expected number of polymorphicreplacement yakuba), we set the expression in (20) equal to %/6 = sites in a sample is '112 and solve for y. Using the estimate tdiv = 5.8 derived in the next section, the solution is y = 9.95. 2Pr(H(m) H(n)) (19) + (The approximation (2 1) gives y = 1 1.O in this case.) where Since there are noreplacement polymorphic sites for D.simulans vs. D.yakuba, this value is a lower bound 1 - - (1 - x)" 1 - e-2Y(l--x) dx. for the average value of y. Any larger value would 1 - e-2Y also be consistent with the data. The expression H(m) varies by only a factor of two The replacement mutation rate pr can be estimated for y > 0, since the inequality A c (1 - e"yA)/(I - e-7 by setting the ratio of the expected numbers of re- d 1 for 0 < A < 1 implies placement and silent fixed differences in (1 8) and (17) 1168 S. A. Sawyer and D. L. Hart1 TABLE 2 Sampling formulas

Population Fixed Population differences Polymorphic sites Neutral

Selected

Expected numbers for samples of m genes from one species and n genes from a second species.

TABLE 3 Nucleotide frequencies at 4-fold degenerate sites in theAdh region - Numb. observed repl. fixed differences - Species Sequences Sites T C A G Numb. observed silent fixed differences (23) D. simulans 6 109 0.18 0.62 0.06 0.14 for the ratio of the numbers of the observed replace- D. melanoguster 12 108 0.17 0.60 0.07 0.16 ment and silent fixed differences. Setting the theoret- D.yakuba 12 110 0.14 0.61 0.07 0.18 ~~ ical ratio (23) equal to the observed ratio 6/17 for D. Data from MCDONALDand KREITMAN(1 99 la). simulans and D.yakuba yields p, = 0.037 = 0.018ps. (Recall that only regular silent sites are counted in the estimate p, in terms of ps and the more numerous estimation of ps; see the next section.) However, as silent site data rather than use the maximum likeli- noted earlier, both 2potdiv and (1 7) overestimate the hood estimator for pr in this case. observed number of silent fixed differences, as well as the estimated number of silent fixed differences DIVERGENCE TIMESAND MUTATION RATES using the probability distribution of joint configura- tions (see the next section). In both cases, the estimate We treat a K-fold degenerate silent site as a neutral (17) is too large by a factor of about 1.4 (see Table Wright-Fisher model with K types, no selection, and 9). Since replacement polymorphisms with favorable mutation rate uq per generation from i to type j. mutant nucleotides are likely to become fixed faster If ug = uj depends only on j, and if ug = N,u,~= v, = than silent polymorphisms, we apply a correction of Neuj is the scaled mutation rate where Ne is the effec- 1.4 to ps but not to pr in (23). This correction leads to tive haploid population size, then WRIGHT'S(1949) the estimate pr = 0.0265 = 0.0 13p, for D.simulans vs. formula states that the equilibriumpopulation fre- D. yakuba. The interpretation of pr in terms of an quencies p,, . . . , p, of the K types are random with average of 5.7 amino acids susceptible to favorable probability density replacement is discussed at the endof the nextsection. Under our assumptions, the numbers of silent and replacement fixed differences and polymorphic sites are independent Poisson random variables (see be- for large Ne. In (24), aj = 2vj = 2Neuj, a = c:l a;, low). Maximum likelihood estimatorsfor pr and y and r(a)is the gamma function. It follows from (24) (along with associated confidence intervals) canbe that ,?(pi) = ai/a for all i. If a1 = ap = . . . = aK,then found by setting theexpressions (1 8)and (1 9) (respec- a = Kai and E(p;)= 1/K for 1 S i S K. tively) equal to the observed numbers of replacement However, nucleotide frequencies at third-position fixed differences and replacement polymorphic sites. sites in fourfolddegenerate codons tend to be far The maximum likelihood estimator for y is the same from uniform (see Table 3). Assuming that the sites as the estimator (22) for y. However, we prefer to are independent, departures from an even distribu- Polymorphism and Divergence and Polymorphism 1169 tion are highly significant in all cases in Table 3 (P < and HARTL1987). Regular isoleucine silent sites are lo-", 3 d.f.), although the observed nucleotide fre- considered 2-fold degenerate. The analysis of nucleo- quencies do not differ significantly among the three tide variation at silent sites is restricted to regular species (P = 0.10, 6 d.f,). The mutation model (24) silent sites. with variable ai provides a significantly better fit to The scaled mutation rate to nucleotides of type j is the distribution of nucleotides at silent sites in the Adh vj = aj/2. Since E(p,)= a,/a by (24), theoverall mean data thandoes (24) with ai= a1 (P < 3d.f., in mutation rate at 4-fold degenerate silent sites is all three species, assuming independent sites). Al- 4 though a mutationmodel with rates that area function p4 = (Y,((Y- (Y,)/~(Y,(Y = (YI + . . . + (~q (26) of the final nucleotide, rather than the original, is ,=1 somewhat artificial, it is reasonable in the analysis of with the rates these datasince only the existing nucleotide sequences are known. Furthermore, since the initial and final PTC = a~az/(a~+ az) and nucleotides may very well be correlated (e.g.,because PAC = asa4/(a3 a4) (27) of a transition or transversion bias), the model with + variable a,is probably reasonably robust. Inany event, at 2-fold degenerate sites. The aggregate silent mu- it provides a significantly better fit to the Adh data. tation rate per individual at regular 2-fold and 4-fold Assuming Wright's model (24) at K-fold degenerate degenerate sites is then silent sites, the equilibrium distribution of population nucleotide frequencies is a K-type Dirichlet distribu- P, = NZ,TCPTC+ NZ,AGPAC+ N4~4 (28) tion (24) with parameters ai = 2N,u,, where Ne is the where N2,Tcand N~,AGare the numbers of regular 2- haploid effective population size. If a sample of size m fold degenerate silent sites of each type (TC or AG), is randomly chosen from this model, the probability N4 is the number of regular 4-fold degenerate silent that the sample will contain m, nucleotides of type i sites, and 3-fold degenerate sites are counted in NZ,TC for 1 K is then < i < unless the codon position contains an ATA. Table 5 gives the within-species maximum likeli- m! nf=,af"J , = + . . . + aK (25) hood estimates for al, . . . , a4 based on the configu- ml!mp! . . dm) . mK! ration probabilities (25) at regular silent sites with all four ai's varied independently in the maximization. where a(")= a(a + 1) . . . (a + m - 1). Examples of (25) for some particular configurations are given in The last two columns in Table 5 give the resulting Table 4. estimates for p4 in (26) and ps in (27-28). Inferred There aretwo types of twofold degenerate sites. At 95% confidence intervals are generally on the order sites of the first type, the nucleotide can be either of of plus or minus half the size of the estimated MLEs. the two pyrimidines Tor C, for which (25) holds with Table 6 gives one-dimensional maximum likelihood K = 2. At sites of the second type, the nucleotide can estimates forthe single-a model as a1. While the be either of the two purines A or G, and (25) holds values for p4 and ps in Table 6 are essentially the same with K = 2 and aI,a2 replaced by as, a4.The only 3- as in thefour-a model, the model with variable ai folddegenerate amino acid, isoleucine, has third- provide asignificantly better fit to the datain all three position synonomous nucleotides T, C, and A. Since species. the codon ATA is almost entirely absent in the Adh An estimateof divergence time:A time-dependent sequences, we treat 3-fold degenerate sites as 2-fold version of Wright's formula (24)was derived by GRIF- degenerate in the analysis of silent site distributions, FITHS (1979). Assume that two equilibrium Wright- and ignore silent codon positions containing an ATA. Fisher K-type populations diverged tdiv X Ne genera- Synonymous sites within leucine and arginine codons tions ago, where Ne is the haploid effective population are of variable degeneracy, since silent mutations in size. The ancestral population and both offshoot pop- the first (orthird) codon position can changethe ulations are assumed to have the same haploid effec- degeneracy in the third (or first) position. Codons for tive population size and the same mutation structure serine fall into two nonoverlapping classes, one 4-fold uq = u,. Given present nucleotide frequencies pl, . . . , degenerate and one2-fold degenerate; we treat these PK in one of the populations, then the present fre- two classes as separate amino acids. We define a reg- quencies 41, . . . , qK in the other population are ran- ular silent site as either any 2-fold or 4-fold degenerate dom with probability density third-position site in anamino acid monomorphic codon position thatdoes not code for leucine or arginine, or else as a third-position site in an amino acid monomorphicisoleucine codon position that does not containan ATA codon(SAWYER, DYKHUIZEN, 1170 S. A. Sawyer and D. L. Hart1

TABLE 4 Sample configurations

Configuratiorl Probability from Equation 25 TTCAAG 6! a,(a1 + I)ana3(aa + I)a, 2!1!2!1! .(a + l)(a + 2)(a + 3)(a + 4)(a + 5)

6! a:i(aa + I)(.> + 2)a4(a4+ ])(a4+ 2) 0!0!3!3! a(a + 1)(a+ 2)(a + 3)(a + 4)(a + 5)

6! aq. . .(a2 + 4)a3 0!5!1!0! a(a + 1)(a + 2)(a + 3)(a + 4)(a + 5)

12! 01). . .(ail + 4)a2(012 + l)as(aa + 1)(aa + 2)a4(a4 + 1) 5!2!3!2! .(a + I)(a + 2). . .(a + 11)

TCC<:CC 12! a,a2(a:, + 1). . .(a2+ 10) cccccc 1!11!0!0! a(a + I)(a + 2). . .(a + 11)

TABLE 5 1984). That is, &(t) is the probability that all individ- MLEs from Wright's formula Equation 25 uals in the present population either have descended without intervening mutation from oneof 6 founding individuals who existed t diffusion time units ago, or I). simulans 0,0101 0.00326 0.0021 0.00960.0156 2.33 else are descended from a mutant ancestor thatarose I). melanogaster 0.0082 0.002590.0020 0.0083 0.0130 1.92 within the last t time units. If the nucleotide frequen- I). yakuba 0.0063 0.00281 0.0025 0.0100 0.0135 1.93 cies were pl, . . . , pK in the ancestral population, the V;~lursare estintated from regular silent sites (see text). probability that bi of the b founding ancestral individ- uals were of type for 1 < i S K is (6!/61! . . bK!)lI p:. TABLE 6 i . The unmutated descendants of these b individuals will MLEs assuming ai = a, have the same states at the present time. Thus the nucleotide frequencies 41, . . . , qK at the present time SpeciesNormaltheory 95% CI for al 84 P, have a Dirichlet distribution (as in Wright's formula D. simulans (0.0104 & 0.0065) = (0.0039, 0.0169) 0.0155 2.26 (24)) conditionedon a sample of size b having bi D. melano- (0.0087 & 0.0052) = (0.0035, 0.0139) 0.0130 1.87 individuals of type i for 1 S i S K. However, a K-type gaster Dirichlet distribution with parameters a,, conditioned D.yakuba (0.0085 ? 0.0051) = (0.0041, 0.0136) 0.0127 1.86 that a sample of size 6 had b; individuals of type i, is where a, = 2Neu, and a = al + . . . + aK (GRIFFITHS Dirichlet with parameters 6, + a;. Finally, by time 1979). The coefficients db(2t) in (29) satisfy reversibility, the joint distribution of two present day populations, connected through an ancestral popula- (2R + a - l)(-l)*+(a! + b)(k-l) db(t) = - e -*A' (3Oa) tion t time units ago, is the same as the joint distribu- k=b k! tion of one population and anancestral or descendant if 6 2 1 and population 2t time units apart.This completes the proof of Griffith's formula (29). (2k + a - l)(-l)k-'a(k-') do(t) = 1 - e-',' (30b) Suppose that a sample of size m is taken from one h= I R! species, and of size n from another, closely related, where dm) = a(a + 1) . . . (a + m - 1) as before and species. It follows from (29) that the joint probability ~k = k(~+ a - 1)/2. Since &(t) = forlarge that the first sample has mi bases of type i(1 S i S K), h, the series (29) converges rapidly in b unless t is and that the second sample has n, bases of typej(1 S small. j G K), is

Equation 29 has a simple intuitive explanation (TA- m br VAR~1984). In the limit as N + W and Nu, 4 a,/2, c,, d42t) Ax a11 individuals in an equilibrium Wright-Fisher popu- b=O b,!. . . bK! lation are the descendants of 6 < 03 "founding" ances- tors that lived t x Ne generations earlier. The expres- sion &(t) in (30a,b) is the probability distribution of 6 for t in the diffusion time scale, under the assumption where Cmn= m!/(ml!. . . mK!) X n!/(nl!. . . nK!) and dm) that mutations break the line of descent (TAVAR~= a(a + 1) . . . (a + m - 1) as before. Polymorphism and Divergence 1171

TABLE 7 Drosophila pairwise comparisons

Model with a,= at

Species Ps td," 95% CI for td,% t,,., PA tdSY t,,,W D. simulans vs. D.yakuba 5.81 2.05 (3.28, 8.33)1.98 6.72 5.55 6.46 D. melanogaster vs. D.yakuba 1.85 7.08 (4.06, 10.1 1) 8.18 1.so 6.54 7.85 D. melanogaster vs. D.simulans 2.07 1.24 (0.48, 1.99) 2.26 2.012.26 1.23

Given an aligned sample of DNA sequences from TABLE 8 two species, we estimate the divergence time between Bootstrap bias-corrected estimates the two species as follows. First, all regular silent sites (as defined earlier) within the two species are pooled Species td,, 95% CI to find maximum likelihood estimates of a1, . . . , a4 D. simulans vs. D.yakuba 5.89(3.50, 8.90) using Wright'sformula (25) for the configuration D.melanogaster vs. D.yakuba 7.20 (4.56, 10.52) probabilities. Using the estimates values for a,,a max- D. melanogaster vs. D. simulans 1.26 (0.81, 1.92) imum likelihood estimate of tdiv is then found using the likelihoods (31) at regular silent sites in the two divergence times and comparesthem by computer species. Since (31) depends on t only through db(2t), simulation. arrays (a, + bi)@*) etc. can be computed in advance Wealso calculated bootstrap bias-corrected esti- once the ai are known, independently oft. Themost mates and bias-corrected 95% confidence intervals time-consuming part of the maximization is the com- (EFRON 1982) for tdiv in the four-a model (Table 8). putation of the coefficients multiplying db(2t) in (31) These were based on 1000 nonparametric bootstrap for K = 4 and large b. simulations for each species pair with ai fixed. The bias-corrected estimates are quitesimilar to the MLEs Estimates of tdiv using this method and the data of MCDONALDand KREITMAN(1 99 la) aregiven in Table in Table 7, butthe bootstrap confidence intervals, 7. Note that our method gives a direct estimate of particularly the lower limits, are shifted upwards. species divergence times, rather than of gene diver- Comparisons with data: We now compare the ob- gence times (ie., the time since the common ancestor served numbers of silent polymorphic sites and fixed of a set of genes, which may be considerably older differences with two sets of theoretical estimates of these numbers (Table 9). The first set of theoretical than the times since the species diverged). In practice, estimates is as follows. Since whether or not a site is estimates of the average pairwise genedivergence polymorphic or a fixed difference depends onits joint times tgene(Table 7) were computed as a starting point configuration in the two samples, the probability that for the maximization of the likelihood for tdiv. These a silent site is polymorphic or a fixed difference can estimates of tgene tended to be larger than tdiv, partic- be computed from the joint configuration probabili- ularly when tdiv was small. Table 7 also contains 95% ties (31) once a, and tdiv are known. Estimates of the confidence intervals for tdiv based on the one-dimen- expectednumbers of silent polymorphic sites and sional likelihood curvatureat tdiv. However,these fixed differences based on thejoint configuration confidence intervals may overstate the accuracy of the probabilities (3 l), maximum likelihood estimates of ai estimates since the ai were held constant. The last and tdiv (Table 7), and the numbers N2,rC,N2,~~, and three columns in Table 7 give estimates for tdiv based N4 of regular silent sites of various degeneracies, are on the single-a model ai = a2. These were quite similar given in Table 9. These estimates fit the observed to the estimates in the preceding four columns based data very closely, whichsuggests that the neutral joint on the four-cu model. configuration model (31) fits the data at synonymous WATTERSON(1985; see also PADMADISASTRA1988) sites. Note that a consistently expanding population developed a maximum likelihood method for estimat- or deleterious selection would tend to produce fewer ingthe divergence time between two populations polymorphisms than predicted by neutrality, while a based on the infinite alleles model at many unlinked contracting population or would loci. If mutation is sufficiently rare so that recurrent be likely to have more polymorphisms. mutation at polymorphic sites can be neglected, then The second set of theoretical estimates arethe this method should give estimates similar to ours, but Poisson-random-field-based sampling estimates de- it does not allow forthe possibilityof nucleotide- rived earlier. The number of silent fixed differences dependent mutationrates. WATTERSON(1985) also is estimated as follows. By time reversibility, data from discussesfive other methods of estimating species two species at the present time can be viewed as two 1172 S. A. Sawyer and D. L. Hard snapshots of a single species separated in time by 2tdiv by the mutation, and the estimates ni = p,/(a,/2) = time units. Under selective neutrality, the rate p, of 2p,./a, range from an average of 1.8 susceptible co- regular silent mutations per individual is the same as dons (for mutations to C) to 23.1 susceptible codons the long-term rate of fixations of regular silent differ- (for mutations to A). ences at the population level, so that the expected number of silent mutations in the population that will eventually become fixed differences is2p,tdiv. How- MUTATIONALFLUX AND FIXED DIFFERENCES: ever, as we start with one contemporary species and DERIVATIONS evolve through the ancestralspecies to the othercon- The purpose of this section is to derive the limiting temporary species, 2p,tdiv overestimates the number density of the processes (x$]in the frequencyinterval of silent fixed differences by the number of silent [O, 13, and the limiting rate at which these processes mutations that are destined to become fixed in this are fixed at the state 1. For each N 3 1, let (Xyk](i= process but which have not yet had time to fix. There 1, 2, . . . ) be a sequence of Markov chains on (0, 1/N, is no corresponding underestimate in the number of 2/N, . . . , 1) with the same transition matrix. Assume polymorphic sites, since an initial polymorphic site in that one of these processes starts at X,16 = 1/N with the first contemporary species cannot become a fixed probability vN > 0 in each time step. The endpoints difference as we proceed backwards in time through 0,l are assumed to beabsorbing states for theMarkov the ancestral species and then forwards to the second chains. Given the initial states xN = j/N + x, the species. (The site remains polymorphic in the present). processes (X” = X$) are assumed to satisfy the con- Finally, the estimate 2p,tdivis augmented as in (17) by ditions an estimate of the number of sites that are fixed in the sample but polymorphic in the population. The estimate (16) for the number of silent poly- morphic sites fits the observed numbers quite well. lim NE((Xf++,- X”)’ I X?’ = xN) = b(x) (32) However, the predicted numbers of fixed differences N-m 2p,tdiv and (17) overstate, by factors of between 1.4 and 6.1, both the observed number of fixed differ- ences and the number of fixed differences predicted uniformly for O s x c 1 for some 6 > 0, where b(x) from the sample configurations. The upward bias of and C(X) are continuous functions on [0, 11. Let drn(x) this estimate probably reflects the fact that not all of and s(x) be the speed measure and scale function of the new mutations destined to become fixed have as the diffusion operator yet become fixed. Based on this argument, we propose 1 d2 d that the total number of silent fixed differences in L, = - b(x) 7+ c(x) - dx dx terms of p, is approximately the theoretical expression 2 (1 7) divided by 1.4 for D. simulans vs. D. yakuba, and (see Equation 4), and assume that the endpoints 0, 1 the ratio of the number of replacement fixed differ- are accessible boundaries for L, (or for the limiting ences to silent fixed differences can be estimated by diffusion process Xt). All of these conditions hold for 1.4 times the ratio of (18) to (17). Setting the latter the two-allele selection-and-drift haploid Wright- expression equal to the observed ratioof replacement Fisher models discussed earlier (EWENS1979). and silent fixed differences yields the corrected esti- The probability vN that a new process (XS)begins mate p, = 0.0265 = 0.013pSof (2). in any time step is assumed to satisfy Selective constraints in Adh : The low value of the replacement mutation rate pr,relative to P VN - as N (33) the silent mutation rate ps, may reflect mutations NP(T$+ T$)(s(a)- ~(0)) occurring in only a small number of codons at which a favorable amino acid change is possible, withall where TZ+ = min(k : X: 3 a)and a > O(0 < a s 1) is other changes being strongly detrimental. The actual arbitrary. In fact, condition (33) is independent of a mutation rate at this small number of susceptible sites for 0 < a 1 (see below). If the condition P(T?+ < may be the same as at silent sites. In the single a model Tg) - P(T, < To I X. = 1/N) holds as N 3 UJ for Tt = cyi = a, for pooled data fromD. simulans and D.yakuba, minit : X, = a] and s’(0) = 1, then (33) is equivalent the mutation rate from one nucleotide to another is v to VN -+ p. = a1/2 = 0.00465, which suggests that the average At equilibrium for fixed N, the expected number number of codons susceptible to a favorable amino of processes (X%] at the state x for 0 < x < N is given acid change at any one time is n = p,/v = 0.0265/ by 0.00465 = 5.7. In the more general mutation model, m m the mutation rate depends onthe nucleotide produced Polymorphism and Divergence 1173

TABLE 9

Silent polymorphic sites and fixed differences

Estimated Estimated Observed silenta configurationb population‘

Species Ps td,“ Fixed Poly Fixed Poly Fixed Poly D.simulans vs. D.yakuba 2.05 5.81 17 21 16.0 19.9 24.8 21.8 D.melanogaster vs. D. yakuba 1.85 7.08 18 21 17.5 20.5 26.9 22.4 D. melanogaster vs. D. simulans 2.07 1.24 1 21 2.0 19.6 6.1 21.9

a Regular silent sites only. Fro111 the joint configuration probabilities (Equation3 1). FIxm Equations 16 and 17.

where k in (34) represents a time k generations in the (EWENS 1979; ETHIERand KURTZ 1986).It follows past. Iff(.) is a function on [o, 11 withf(0) =f(l) = from (37) and (32) that the “overshoot”in (36) can be 0, then the expected number of processes (Xc)in the neglected as N + 00; i.e., it is sufficient to take Y = a open interval (0, l), weighted byf(x) where x is the in (36). Hence by (36) and (37) present frequency, is P(T~+CT~)(s(a)-s(O))-P(T~ 0. In particular, if where Qff(x)= E(j(Xf) I X: = x). Our first result is the limit theorem Theorem 1: Suppose under the above assumptions that 1 S’(0b f(x) iscontinuous on [0, 13 andthat f(x) = 0 in the .- interval [0, a)for some a > 0. Then N S( 1) - s(0) N- 1 m then (33) implies that ZIN + p/s’(O). Similarly, for a > 0 in Theorem 1, before any of the processes (X%] get to x 3 a, they must first cross a. Thus the sum in (35) equals

\ where s(x) is the scale and dm(x)the speed measure of the limiting d@usion. It follows from (35) that thelimiting expected num- ber of processes (Xg]with frequencies in the range pl C p S p2 is the integral of the integrand on the by (33), where Y = X? for 1 = Tf+ 3 a as before, since right-hand side of (35) over p2), provided that (PI, PI, we can neglect the overshoot in (38) for the same p2 are points of continuityfor dm(x). In fact, the reasons as in (36). Now by (32) and Trotter’s theorem limiting distribution of (X61is a Poisson random field (ETHIERand KURTZ 1986) with this integrand as its mean density (see the next section). Proof of Theorem 1: We first show that condition (33) is independent of a. If 0 < 1/N < a < 1, then uniformly for k/N * t and XN * x. Thus, if we can X: must first cross a before getting to 1. This leads to show that the sum in (38) converges uniformly in N the identity for k/N 2 C, then as N - 00 P(T:

for Y = X?, where 1 = Tf+ is the first time at which = (s(a)- s(0)) ’(l) - S(x)f(x)dm(x) s’a the frequency X7 2 a. For any fixed y < 1, S( 1) - ~(0) Iim P(T;Y

Since the endpoints 0, 1 are accessible exit bound- sored” version of the original Poisson random field, aries for L, or (Xt], we can choose t > 0 and 6 > 0 where aprocess is “censored”if its site is monomorphic such that in the sample. If the censoring mechanism is inde- maxP(ToAT1>tIXo=x)=1-6<1 pendent for different sites (which follows in this case osx-2 1 from linkage equilibrium), then the censored random field is also a Poisson random field. Inparticular, where X A Y = min(X, Y].However, by (39) withf(x) given a sample of size m, the numbers of silent and = 1, replacement fixed and polymorphic sites all have Pois- limP(T~ATY3k)X~=xN)=P(ToAT13tIXo=x) son distributions. N-m Similar arguments show that, if one has twoor more uniformly for k/N + t > 0 and xN + x. Thus there random censoring mechanisms that are independent exists constants C > 0 and 6 > 0 such that for different sites but mutually exclusive (i.e., a site cannot survive censoring by more than one mecha- su max P( TtA TY 3 CN I X: =j/N) nism), then the respective censored random fields are OSjSN 2 (40) independent Poisson random fields. As one example, s 1 -6/2< 1. given a sample of sizem, the numberof monomorphic sites in the sample and the number of polymorphic The Markov property implies that CN in (40) is if sites in the sample are independent Poisson random replaced by mCN, then the right-hand side of (40) can variables. In this example, a site is censored by the be replaced by (1 - 13/2)~. Thus thereexists some u > first mechanism if it is polymorphic in the sample and 0 such that if If(x) I C M for 0 S x S 1, censored by the second mechanism if it is mono- IQ~~(~)I~MP(T:AT~~~IX~=X)~MQ~-”~’~morphic in the sample. (The number of sites that are fixed in the population form a separate independent for some constant Q. This provides the necessary Poissonclass.) It follows from this that, if one has uniformity in (38). samples of sizes m and n from two populations, then We now compute the limiting flux into the absorb- thenumber of fixed differences between the two ing state 1. In any one time step, the probability that samples and the numberof sites that are polymorphic a new process Xrh is begun with X?, = 1/N and is then in either sample are realizations of independent Pois- eventually is absorbed at 1 is son random variables. As a second example, given a sample of size 5 from a single population, the number of sites that display a “(32)”polymorphism (i.e.,with three sequences having by (33) with a = 1. Since one time unit in the diffusion one nucleotide and two sequences having a second time scale equals N discrete timesteps, we have shown nucleotide) andthe number of sites that display a “(41)”polymorphism (i.e., with four sequences having Theorem 2: Under the above conditions, the limiting one nucleotide and one sequence with a second nu- expected number of processes {xfk1 that are absorbed at 1 cleotide) are independent Poisson random variables. in one time unit in the dflusion time scale is Here the censoring mechanisms are to reject a site if P it does not have a (32) [respectively (41)] nucleotide polymorphism in the sample, where the mutant nu- s(1) - s(0) * cleotide could be either nucleotide. POISSONRANDOM FIELDS ANDA SAMPLING THEORY FOR INDEPENDENTSITES DISCUSSION ANDADDITIONAL COMMENTS Since the (X%) are independent Markov processes The approachto polymorphism and divergence that arrive in a limiting Poisson stream, the limiting presented in this paper provides a framework for the distribution of the frequencies (XC] form a Poisson quantitative analysis and interpretation of DNA se- random field by classical arguments (KARLIN and quence variation within and between species. It also MCGRECOR1966; SAWYER 1976;KARLIN and TAY- provides a theoretical basis for the use of a recently LOR 1981). In particular, the limiting distribution of proposed 2 x 2 contingency table test for DNA se- the numbers of frequencies (X$] in any given set, as quences that compares polymorphisms and fixed dif- well as the number of processes that have been fixed ferences with silent sites and amino acid replacements at 1 by any given time, are Poisson random variables (MCDONALDand KREITMAN199 1 a), as well as provid- that are independentfor nonoverlapping sets. ing estimates for the relevant parameters. Hypothesis Given a sample of sizem, the population frequencies tests for 2 X 2 contingency table data are well under- {X$) of mutant nucleotides at those sites that are stood and quite robust, andthis approach may be the polymorphic in the sample forma “randomly cen- most powerful and generally applicable test yet de- Polymorphism and Divergence and Polymorphism 1175 vised for detecting whether or not selection has been but it is in the direction of making the Graur and Li active in the recent evolutionary history of particular statistic less powerful, and the discrepancy might be proteins. larger in other cases. Similar problems arise for a test On the other hand, the sample nucleotide config- proposed by WHITTAMand NEI (199 I), which is based urations also contain a tremendous amount of infor- on a ratio of test statistics each of whose sampling mation that can be interpreted as in theapproach distributions is known only approximately. For ratio presented here. In particular, the jointsample config- statistics, there is no guarantee that approximate P- urations at silent sites can be used to estimate both values will even be conservative. In general,one the synonymous mutation rate p, and the species di- should be careful aboutusing a normal approximation vergence time tdiv. Both parameter estimates are scaled for P-values unless one is sure that the sampling dis- in terms of the effective population size Ne, and if tribution of the test statistic is at least approximately independent information is available (as it isin this normally distributed. case from Hawaiian Drosophila), then all three param- We would like to thank RICHARDHUDSON, TOM NAGYLAKI, 2% eters can be estimated. Given tdiv, the 2 X 2 contin- AYALA andtwo referees for many helpful comments. This work gency table provides estimates of p,/ps (the ratio of was supported by NationalScience FoundationGrant DMS- aggregatemutations rates for synonymous and re- 9108262and National Institutes of Healthresearch grant placement sites) and for y (the average selection coef- GM44889 (S.A.S.),and by National Institutes of Health research ficient among favored and mildly deleterious replace- grants GM30201, GM33741, and GM40322 (D.L.H.). ment mutations). Our analysis is based on the assump- tion that the nucleotide sites are independent (ie., in LITERATURE CITED linkage equilibrium). Whether this assumption holds to an acceptable approximation has to be determined BEGUN,D. J., and C. F. AQUADRO,1991 Molecularpopulation genetics of the distal portion of the X chromosome in Drosoph- by appropriate statistical tests on a case by case basis. ila: evidence for genetic hitchhikingof theyellow-achaeteregion. Other approaches to the analysis of DNA sequence Genetics 129 1147-1 158. variation within and between species were suggested BERRY,A. J., J. W. AJIOKAand M. KREITMAN, 1991 Lack of as alternatives to the 2 X 2 contingency table test by polymorphism on the Drosophila fourth chromosome resulting from selection. Genetics 129 1 11 1-1 117. (1 GRAUR andLI (199 1) and WHITTAMand NEI 99 1). CACCONE,A., G. D. AMATOand J. R. POWELL,1988 Rttes and These tests appear tobe less powerful statistically than patterns of scnDN.4 and mtDNA divergence within the Dro- the 2 X 2 contingency table test, and may be subject sophila melanogaster subgroup. Genetics 118: 671-683. to objections about their underlying genetic assump- DRAKE,J. W., 1991 Spontmeousmutation. Annu. Rev. Genet. tions (MCDONALDand KREITMAN 1991b).Further- 25: 125- 146. EFRON,B., 1987Better bootstrap confidence intervals. J. Am. more, the test statistics in the proposed alternatives Statist. Assoc. 82: 17 1-200. are assumed to have a sampling distribution that is ETHIER,S. N., and T. G. KURTZ, 1988 Markov Processes. Wiley XC normal, when in fact the sampling distributions are Sons, New York. unknown in most cases. As one example, GRAUR and EWENS,W. J., 1972 The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3: 87-1 12. LI (1991) compare the observed number = 2) of (k EWENS,W. J., 1979 Mathematical . Springer- replacement polymorphisms withinall three species Verlag, New York. in MCDONALDand KREITMAN'S(1991a) data with the GRAUR,D., and W.-H. LI, 1991 Scientific correspondence. Nature value of a test statistic K = K1+ K:!+ K3(our notation), 354: 114-1 15. where each K, is the number of segregating sites in an GRIFFITHS, R.C., 1979 A transition densityexpansion fora rnulti- allele diffusion model. Adv. Appl. Probab. 11: 310-325. infinite sites model (WATTERSON1975). Separate pa- HARTL,D. L., 1989 Evolving theories of enzyme evolution. Ge- rameter estimates are used within each species, and netics 122: 1-6. the random variables Ki are independent. WATTERSON HARTL,D. L., and D. DYKHUIZEN,1981 Potential for selection (1975) gives the mean and variance of Ki in terms of among nearly neutral allozymes of 6-phosphogluconate dehy- its parameters. WATTERSONalso gives a moment gen- drogenase in Escherichhia coli. Proc. Natl. Acad. Sci. USA 78: 6344-6348. erating function for K, that can be used to infer its HARTL,D. L., and S. A. SAWYER, 1991 Inferenceof selection and exact distribution (as a sum of independent geomet- recombination from nucleotide sequence data.J. Evol. Biol. 4: rically distributedrandom variables with different 519-532. parameters), from which the exact one-sided P-value HILLIKER,A. J., S. H. CLARK andA. CHOVNICK, 1991The effect ofDNA sequence polymorphisms on intragenic recombination P(K 6 2) can be calculated. For one set of parameter in the rosy locus of Drosophila melanogaster. Genetics 129: 779- values used by GRAUR andLI (1991), K = 10.32 & 781. 4.04 (mean and standard deviation), so that a norma] HUDSON,R. R., M. KREITMAN and M. AGUAD~, 1987A test of approximation for K leads to a one-sided P-value, P = neutral based on nucleotide data.Genetics 0.020, for theobserved K = 2, while the exact P(K c 116: 153-159. KARLIN,S., andJ. MCGREGOR,1966 The number of mutantforrrls 2) = 0.0098 1 from the theoretical distribution. In this maintained in a population. Proc. Fifth Berk. Synlp. Math. Stat. example the differencein the P-values is only twofold, Prob. 4: 403-4 14. 1176 S.and A. Sawyer D. L. Hart1

KARLIN,S., and H. M. TAYLOR,1981 A Second Course in Stochastic 1991 Genetic mechanisms for adapting to a changing envi- Processes. Academic Press, New York. ronment. Annu. Rev. Genet. 25 629-659. KIMURA.M., 1968 Evolutionary rate at the molecular level. Na- ROWAN,R. G., and J. A. HUNT,1991 Rates of DNA change and ture 217: 624-626. phylogeny from the DNA sequences of the alcohol dehydro- KIMURA,M., 1983 TheNeutral Theory of MolecularEvolution. genase gene for five closely related species of Hawaiian Dro- sophila. Mol. Biol. Evol. 8: 49-70. C;umbridge University Press, Cambridge, England. SAWYER, A,, 1976 Branching diffusion processes in population KRLITMAN, M., 1983 Nucleotide polymorphism at the alcohol S. genetics. Adv. Appl. Prohab. 8: 659-689. dehydroganse locus of Drosophilamelanogaster. Nature 304: SAWYER,S. A,, 1989 Statistical tests for detecting gene conver- 4 12-41 7. sion. Mol. Biol. Evol. 6: 526-538. I.~:MEUNIER,F., J. R. DAVID,L. TSACASand M. ASHBURNER, SAWYER,S. A., E.D. DYKHUIZENand D. L. HARTL, 1986 The melanogaster species group, pp. 147-256 in The 1987 Confidence interval for the number of selectively neu- Genetics and Biology of Drosophila, edited by M. ASHBURNERand tral amino acid polymorphisms. Proc. Natl. Acad. Sci. USA 84: H. L. CARSON.Academic Press, New York. 6225-6228. I.KWONTIN, R. C., 1974 The GeneticBasis of Evolutionary Change. SMITH,J. M., C. G. DOWSONand B. G. SPRATT,1991 Localized Columbia University Press, New York. sex in bacteria. Nature 349: 29-3 1. I,EWONTIN,R. C., 1991 Electrophoresis in the development of TAVAR~,S., 1984 Line-of-descent and genealogical processes, and evolutionary genetics: milestone or millstone? Genetics 128: their applications in population genetics models. Theor. Popul. 657-662. Biol. 26: 1 19- 164. IXWONTIN, R. C., and J. L. HUBBY,1966 A molecular approach WATTERSON,G. A,, 1975 On the number of segregating sites in to the study of genic heterozygosity in natural populations. 11. genetic models without recombination. Theor. Popul. Biol. 7: Amount of variation and degree of heterozygosity in natural 256-276. Watterson, G.A,, 1985 Estimating species divergence times using populations of Drosophila pseudoobscura.Genetics 54: 595-609. multi-locus data, pp. 163-183 in Population Genetics and Molec- MCDONALI),J. H., AND M. KREITMAN,1991a Adaptive protein ular Enolution, edited by T. OHTAand K. AOKI.Springer- evolution at the Adh locus in Drosophila. Nature 351: 652- Verlag, Berlin. 6.54. WHITTAM,T. S., and M. NEI, 1991 Scientific correspondence. MCDONALII,J. H., and M. KREITMAN, 1991b Scientific corre- Nature 354: 115-1 16. spondence. Nature 354: 1 16. WRIGHT,S., 1938 The distribution of gene frequencies under MORAN,1’. A. P., 1959 The survival of amutant gene under irreversible mutation. Proc. Natl. Acad.Sci. USA 24: 253- selection. 1l.J. Aust. Math. Sac. 1: 485-491. 259. OHTA, T., 1973 Slightly deleterious mutant substitutions in evo- WRIGHI,S., 1949 Adaption and selection, pp. 365-389 in Ge- lution. Nature 246 96-98. netics,, and Evolution, edited by G. JEPSON,G. SIMPSONand E. MAYR.Princeton University Press, Princeton, I’AIIMADISASTRA, s., 1988 Estimating divergence times. Theor. I’opul. Bid. 34: 297-319. N.J. I’OWLRS, 1). A,, T. LAUERMAN, D. CRAWFORD andL. DIMICHELE, Communicating editor: R. R. HUDSON