Quantitative Biology QB DOI 10.1007/s40484-013-0003-5

REVIEW Mathematics, genetics and

Warren J. Ewens*

Department of Biology and Statistics, The University of Pennsylvania, Philadelphia, PA 19104, USA * Correspondence: [email protected]

Received October 8, 2012; Revised October 22, 2012; Accepted November 6, 2012

The importance of mathematics and statistics in genetics is well known. Perhaps less well known is the importance of these subjects in evolution. The main problem that Darwin saw in his theory of evolution by was solved by some simple mathematics. It is also not a coincidence that the re-writing of the Darwinian theory in Mendelian terms was carried largely by mathematical methods. In this article I discuss these historical matters and then consider more recent work showing how mathematical and statistical methods have been central to current genetical and evolutionary research.

INTRODUCTION to measure variation at the genetic level, and to assess the extent to which his theory continues to hold when A brief description of the Darwinian theory of evolution investigated at the level and in terms of the modern by natural selection is as follows. In his revolutionary molecular notion of a gene. book, generally called The Origin of Species, Darwin [1] The main problem with the theory, as it was presented claimed that biological evolution arises by natural by Darwin in 1859, was that at that time the hereditary selection, operating on the variation that exists between mechanism was unknown. However, the nature of this the individuals in any biological population. The argu- mechanism is crucial to a complete understanding of the ment has four main components. First, the more fit argument. Worse than this, insofar as any idea of a theory individuals in the population leave disproportionately of heredity was known in 1859, the most prevalent theory more offspring than the less fit individuals. Second, the was based on the idea that any characteristic of a child, for offspring in large measure inherit the fitness of their example his blood pressure, is a mixture or blending of parents. Third, and following from the first two points, the that characteristic in that child’s parents, plus or minus offspring generation is on average more fit than the some small deviation deriving from unknown random parental generation. Finally, as generation succeeds effects. It is easy to see that under this so-called generation, the population steadily becomes more and “blending” theory an effective uniformity of any more fit, and eventually the more fit types replace the less characteristic among the individuals in the population fit. The changes in the population are often taken as being will soon arise, so that after no more than about ten or quite gradual, and may well be thought of as taking place twenty generations there will be essentially no individual- over a time span of many thousands or even hundreds of to-individual variation in any characteristic available for thousands of years. This is a population level theory. natural selection to act on. This difficulty was raised soon There is no concept of the evolution of the individual. The after the Darwinian theory was put forward, and was individual himself does not evolve. It is the population recognized immediately as a major criticism of the theory, that evolves as generation succeeds generation. and (unfortunately) Darwin amended later editions of his Clearly the existence of variation in the population is book in the light of it. To his dying day Darwin did not crucial to the argument. Without this variation there are no know of the resolution of the “variation preserving” fitness differentials between individuals, and the selective difficulty. process cannot proceed. Darwin considered variation Clearly some modification of the argument is neces- from one person to another in physical, mental and other sary, since we do not observe, in present-day populations, characteristics, but when his argument is recast in the uniformity of characters that the blending theory genetical terms, as it will be below, it will be necessary predicts. However, any modification to the blending

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 1 QB Warren J. Ewens

theory would probably require the assumption that the what happens in succeeding generations under the characteristics of children do not closely resemble those assumptions of random mating, no , no selective of the parents, and would thus remove one of the main differences, and indeed no complicating elements of any underpinnings of the Darwinian theory. kind. It is easy to see that the genotype frequencies in The above discussion brings us to Mendel. Mendel’s generations 1, 2 and 3 are as given below: work [2] appeared in 1866, seven years after the appearance of The Origin of Species. It was in effect A1A1 A1A2 A2A2 unread, and its importance unappreciated, until it was frequency, generation 1 X11 2X12 X22 rediscovered in 1900. It led, however, to the solution of 2 – – 2 the maintenance of variation puzzle that Darwin could not frequency, generation 2 x 2xð1 xÞð1 xÞ solve. frequency, generation 3 x2 2xð1 – xÞð1 – xÞ2 Mendel made the first and basic steps in elucidating the (1) hereditary mechanism. As is well known, he considered seven characters in peas, each of which happened to have In this table x = X11 + X12, so that x is the frequency of the a simple genetic basis. For example, he found that seed A1 in generation 1. The entries in this table show color, either green or yellow, is determined by the genes at three important features. First, the genotype frequencies a single gene locus, at which arose the “green” and the attained in generation 2 are of a binomial form, with x, the “yellow” . We will call the green/yellow gene frequency of A1 in generation 1, being the parameter of concept the billiard ball paradigm — a gene is seen under this distribution. Second, elementary calculations show this paradigm as being either a green or a yellow billiard that the frequency of A1 in generation 2 is also x. Finally, ball, with no known internal structure. In the first half of the genotype frequencies achieved in generation 2 the 20th century the billiard ball paradigm of the gene was continue to hold in generation 3, and hence also hold in fi all that was available. The resolution of the variation all future generations. This nal observation shows that problem was made using this paradigm for the gene, and it there is no tendency for the variation in the population to fi still holds even today when the DNA structure of the gene be dissipated. These elementary calculations, rst made is known. To see how this was so we turn to some independently by Hardy [3] and Weinberg [4] in 1908, mathematical consequences of the billiard ball paradigm. just a few years after Mendelism was re-discovered, show that under a Mendelian hereditary system, Darwinism is MATHEMATICAL IMPLICATIONS OF MENDELISM: saved: the variation needed for the operation of the THE RANDOM-MATING CASE Darwinian theory is preserved. A little bit of mathematics has gone a long way! Introductory concepts The “Hardy-Weinberg”, or binomial, form of the genotype frequencies in the above table, from generation Our focus throughout is on diploid organisms, such as 2 onwards, arise (in a randomly mating population) even Man, in which any individual receives half his genetic when selective differences between genotypes arise, composition from his mother and half from his father. We provided that genotype frequencies are taken at the time consider only autosomal loci: the sex-linked case involves of conception of any generation. Thus Hardy-Weinberg minor differences from the analysis below. Consider some frequencies will be used in the analysis of the selective specific gene locus on some specific , which case below, it then being understood that all frequencies we call locus “A”. In general, genes of many different are taken at this stage of the life cycle. In Genetics courses types might occur at this locus. Suppose for the moment the emphasis is often placed mainly on the binomial form that only two types of genes can arise at this locus, which of the Hardy-Weinberg frequencies, which, conveniently, we call the A1 allele and the A2 allele. (The words “gene” depend on the single quantity x. The importance of the and “allele” often become confused in the literature. Here permanence of these frequencies (at least in cases where we adhere to the concept that a gene is a physical entity there is no selection) to the Darwinian paradigm is often while the word “allele” refers to a gene type. Despite this not even mentioned. Yet this permanence is the true we often use the expression “gene frequency” instead of implication of the Hardy-Weinberg scheme. A similar the more logical “allele frequency”, since the term “gene result hold for haploid populations: again, since the gene frequency” has become embedded in the literature.) There is the unit of transmission, genetic variation is preserved are three possible genotypes, A1A1, A1A2 and A2A2.As in these populations also. mentioned above, the Darwinian theory is a population- Of course genetic variation might eventually be lost level theory, and it is therefore necessary to consider the through the action of selection or random drift (discussed population frequencies of these genotypes, and how they below), but the time-scales for changes brought about by change with time. We start with arbitrary frequencies in these is much longer than that appropriate to the blending generation 1, as shown in the table below, and consider theory.

2 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 QB Mathematics, genetics and evolution

It is worthwhile asking why variation is preserved in Δ – – – – 2 w ¼ 2xð1 xÞfw11x þ w12ð1 2xÞ w22ð1 xÞg the Mendelian system. The basic reason is a “quantal” 1 1 one: a parent of genotype A1A2 does not pass on, say, two- w x2 þ w þ w þ w xð1 – xÞ thirds of an A gene and one-third of an A gene, but an 11 12 2 11 2 22 1 2 entire A1 gene or an entire A2 gene. It is also possible to “ ” 2 – 2: argue that the quantal Mendelian mechanism is the only þw22ð1 – xÞ w (7) hereditary system that preserves variation, and thus allows the evolution of superior forms by natural Clearly Δw is non-negative, so we may conclude that selection. Thus it can be argued that if intelligent, that is natural selection acts so as to increase, or at worst evolved, creatures exist elsewhere in the universe, they maintain, the mean fitness of the population. This also will have a Mendelian hereditary mechanism. provides a quantification in genetic terms of the Darwinian concept that an “improvement” in the popula- Selection and mutation tion has been brought about by the action of natural selection. So far there has been no mention of selection or of If the wij are all close to unity we may write, to a mutation. Selection occurs when the various genotypes sufficiently close approximation, have different fitnesses. These differences can arise for Δ – – – – 2: various reasons, in particular through different viabilities, w 2xð1 xÞfw11x þ w12ð1 2xÞ w22ð1 xÞg (8) different fertilities and different mating success probabil- We return to this approximation later. ities for the various genotypes. The theory for the latter Equation (6) gives useful information about the central two forms of selection is complex, and here we focus on microevolutionary process, namely the replacement of differential viability fitnesses, that is on different one allele in a population by another. For example, it is capacities of the various genotypes to survive from the easy to see that if w > w > w , the frequency of A time of conception to the age of reproduction. 11 12 22 1 steadily increases and asymptotically approaches the Suppose then that the fitness of any individual depends value 1. The time T x , x required to increase the on his genotype at a single locus A, and (as above) that ð 1 2Þ frequency of A from some low value x to some higher two alleles, A and A , and thus three genotypes, are 1 1 1 2 value x can be found by approximating (6) by the possible at that locus. The fitnesses of individuals of these 2 differential equation three genotypes are defined in the following table: dx= – – – – genotype A A A A A A xð1 xÞfw11x þ w12ð1 2xÞ w22ð1 xÞg, (9) 1 1 1 2 2 2 dt w11 w12 w22 (2) to obtain 2 – – 2 frequency x 2xð1 xÞð1 xÞ x =! 2 dx : Tðx1,x2Þ – – – – From this table we are able to calculate several quantities, x1 xð1 xÞfw11x þ w12ð1 2xÞ w22ð1 xÞg in particular (i) the mean fitness w in the population, (ii) (10) the σ2 in fitness, (iii) the frequency x0 of A in the 1 Note that the denominator w in (6) has been replaced by 1 offspring generation and from this the change Δx=x# – x in this calculation, in accordance with the view that, since in the frequency of A between parental and offspring 1 most population sizes are held at a constant value by generations, and finally (iv) the change Δw in the mean extrinsic forces such as food supply, the mean fitness of a fitness of the population between parental and offspring population can be taken as 1. generations. These values are found by straightforward Equation (10) can be solved for Tðx , x Þ, but is best algebra and are given below: 1 2 left in the form given, since this is quite informative. For = 2 – – 2 example, the denominator of the integrand on the right- w w11x þ 2w12xð1 xÞþw22ð1 xÞ , (3) hand side in (10) shows that when the frequency x of A1 is 2= 2 2 2 – 2 – 2 – 2 close to 0 and close to 1, only very slow changes in this w11x þ 2w12xð1 xÞþw22ð1 xÞ w , (4) frequency will arise through selection. A second important fitness configuration is that for w x þ w ð1 – xÞ x0=x 11 12 , (5) which w > > w12 w11, w12 w22 , (11) w x þ w ð1 – 2xÞ – w ð1 – xÞ fi Δx=xð1 – xÞ 11 12 22 , (6) that is if there is heterozygote superiority in tness. Here w there is a stable equilibrium at which the frequency of A1

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 3 QB Warren J. Ewens

is x*,defined by an individual and depends on that individual’s genotype, w – w but an individual passes on a gene, not his entire x*= 12 22 : (12) genotype, to an offspring. This means that for a complete 2w – w – w 12 11 22 evolutionary investigation of the process it is necessary to * fi When the frequency of A1 is x the mean fitness does not isolate that part of the tnesses of the various genotypes change from one generation to the next. These two that can be attributed to “genes within genotypes”. Doing examples show that the Mendelian system can explain this was the great achievement of Fisher [5], who showed both evolution (through changes in gene frequencies) or how to overcome, as far as is possible, the fault-line the existence of standing genetic variation, with unchan- referred to above. To isolate the “genes within genotypes” ging gene frequencies. component of fitness we approximate the respective Mutation is the spontaneous change of a gene from one fitnesses in (3) by “additive gene (or genetic) fitness” allele to another. Suppose that the mutation rate from A1 to values indicated as follows: A2 is u and that the mutation rate from A2 to A1 is v.Itis fi tness of A1A1 fitted by w þ 2α1, easy to see that the daughter generation frequency of A1 is fi fi α α tness of A1A2 tted by w þ 1 þ 2, given by fi fi α tness of A2A2 tted by w þ 2 2. daughter generation frequency=xð1 – uÞþvð1 – xÞ, Here α1 and α2 can be thought of as additive fitness (13) contributions of the alleles A1 and A2 respectively, relative fi so that the change in frequency from one generation to the to the mean tness w. Of course these approximate values fi next is do not represent the true tnesses of these genotypes, Δx=v – ðu þ vÞx: (14) since dominance effects (at the A locus) and epistatic effects (between the A locus and all other loci in the This implies that there is an equilibrium where the genome) have been ignored. Nevertheless, substantial frequency of A is given by 1 progress is possible using these approximations, as we v frequency of A = , (15) now see. 1 u þ v With these approximate fitted values as actual fitnesses, fi and it is easy to see that this equilibrium is stable. we would compute the population mean tness by When both mutation and selection exist, the change in x2ðÞþw þ 2α 2xð1 – xÞðÞw þ α þ α the frequency of A is, to a first approximation, the sum of 1 1 2 1 – 2 α the changes given in (6) and (14). Because mutation rates þð1 xÞ ðÞw þ 2 2 , are generally thought of as being smaller than selective which reduces to w þ 2½α x þ α ð1 – xÞ. This would be differences, the first of these changes is considered the 1 2 equal to w if more significant one, except in the neighborhood of an α α – = : equilibrium point. There is again an equilibrium point of 1x þ 2ð1 xÞ 0 (16) the frequency of A when both selection and mutation 1 α α exist, and the nature of this point depends on the relative The values of 1 and 2 are found by a weighted least squares procedure that is, by minimizing the weighted values of the fitnesses wij. When w11 > w12 > w22, for example, this equilibrium frequency is very close to sum of squares – = – 2 – – α 2 1 u ðw11 w12Þ. The inferior allele A2 is maintained in x ðÞw11 w 2 1 the population only because of the mutation from A to 1 – – – α – α 2 þ 2xð1 xÞðÞw12 w 1 2 A2, and this point is called one of selection-mutation – 2 – – α 2 balance. When the inequalities (11) hold and mutation þð1 xÞ ðÞw22 w 2 2 (17) exists there will also be a stable equilibrium frequency of α α α A . Since mutation rates are very low, this will not differ with respect to 1 and 2. The minimizing values of 1 1 α appreciably from the frequency given in (12). and 2 are found to be α – – 1 ¼ w11x þ w12ð1 xÞ w, GENES AND GENOTYPES: THE EVOLUTIONARY (18) α ¼ w x þ w ð1 – xÞ – w: “FAULT-LINE” 2 12 22 These values do satisfy (16), so that with the “fitnesses” of α Despite the fact that the Mendelian hereditary system the three genotypes given respectively by w þ 2 1, w þ α α α α α fi provides the perfect framework for Darwinian evolution, 1 þ 2 and w þ 2 2 and with 1 and 2 de ned by (18), there remains one serious problem in joining Darwinism the true mean fitness is recovered. In other words it was with Mendelism. This arises because there is a fault-line not necessary to impose the condition (16): it is in the Mendelian/Darwinian process. A fitness belongs to automatically satisfied by the values of α1 and α2 defined

4 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 QB Mathematics, genetics and evolution by (18). On the other hand equations typified by (16) theorem is interesting, since it shows that Darwinian ideas will be needed later when the whole genome is have to be updated because of these mathematical considered. calculations, which show that because of the Mendelian The residual sum of squares after these values of α1 and mechanism it is only the “additive genetic” part of the α2 are inserted in (17) is called the dominance variance, total variance in fitness that contributes to an increase in 2 fi denoted D, and is given by mean tness. Thus it is not accurate to say that evolution by natural selection will occur if there is positive variance 2 = 2 – 2 – 2: D x ð1 xÞ fw11 2w12 þ w22g (19) in fitness: a more accurate statement is that evolution will If the actual genotype fitnesses had been of the additive occur if and only if there is positive additive genetic – = – fi variance in fitness in the population. Mathematics has form in which w11 w12 w12 w22, a perfect t to the fitnesses will be obtained by using the three approximat- played a key role here in up-dating and indeed amending the Darwinian paradigm in the light of Mendelian ing fitnesses w þ 2α1 for A1A1, w þ α1 þ α2 for A1A2 and α 2 genetics. w þ 2 2 for A2A2. In this case the dominance variance D is zero. In all other cases the fit will not be perfect; that is, We give the correct version of the Fundamental one can usually explain some but usually not all of the Theorem of Natural Selection later. For now we note variation in the fitness values by fitting the α values. that it is quintessentially a diploid population result. At any given gene locus a parent passes on a single gene to Of far greater importance than 2 is that part of the D an offspring, not his entire genotype. Thus it is essential to variance in fitness that is explained by fitting the additive isolate that part of the total variance in fitness that is due to values α and α . This is Fisher’s[5]additive genetic 1 2 “genes within genotypes”. This is precisely what the variance and is denoted 2 . It is found as the difference A additive genetic variance does. Evolution by natural between the total variance (4) and the dominance selection will not occur if the additive part of the total variance. Straightforward algebra shows that variance in fitness is zero. This is emphasized in the 2 = – – – – 2: A 2xð1 xÞðw11x þ w12ð1 2xÞ w22ð1 xÞÞ (20) following section.

The additive genetic variance thus measures the amount Standing genetic variance of variation in fitness that can be explained by genes within genotypes. It might be expected from the foregoing The “natural selection” change in gene frequency given in that the additive genetic variance plays an important role (6) can be contrasted with the situation where the genetic in several evolutionary considerations. This is indeed the composition of a population does not change from one case, and we now consider this concept in more detail. generation to the next. The case where the fitness values are of the form (11) provides one example of this. When * THE ADDITIVE GENETIC VARIANCE the frequency of A1 is at the value x as given in (12) neither of the alleles A1 and A2 is more “fit” than the other. The importance of the additive genetic variance can 2 = At this equilibrium, A 0. Thus while there is a positive perhaps best be appreciated by considering some variance in fitness at this equilibrium point, none of this examples of its use. We do this in this section. It is variation is “additive”, or residing within genes. Thus important to remember that the additive genetic variance neither allele is superior to the other and no evolution is a diploid population concept: it has no equivalent for occurs, as is in any event clear since genotype frequencies haploid populations. do not change from one generation to the next when the * frequency of A1 is x . The Fundamental Theorem of Natural Selection The correlation between relatives The Fundamental Theorem of Natural Selection (Fisher [5]) is a very complicated theorem and its full and correct One of the key components of the Darwinian theory is the version will be given later. The (incorrect and over- similarity, or in statistical terms the correlation, between simplified) version of the theorem states that since in a parent and offspring with respect to some measurement, population of constant size we can take w to be 1, the and in particular to fitness. Consider some character change Δw in mean fitness between parental and daughter determined entirely by the genotype at the locus A, and generations (as given approximately in (8)) is approxi- for which all A1A1 individuals have measurement m11, all mately equal to the additive genetic variance (given in A1A2 individuals have measurement m12, and all A2A2 (20)). This is often stated as the Fundamental Theorem of individuals have measurement m22.Tofind the parent- Natural Selection. It is not, however, the correct statement offspring correlation it is necessary to consider the flow of of the theorem. Nevertheless, this misinterpretation of the genes between parent and offspring at the A locus.

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 5 QB Warren J. Ewens

Elementary Mendelian arguments show that the genetic grandparent). Defining combinations between parent and offspring that can arise, = with various probabilities, are as indicated in the table Pff probðxf yf Þ, = below: Pfm probðxf ymÞ, P =probðx y Þ, offspring mf m f = Pmm probðxm ymÞ, genotype A1A1 A1A2 A2A2 α=1 3 2 – ðPff þ Pfm þ Pmf þ PmmÞ, A1A1 x x ð1 xÞ 0 2 2 – – – 2 β= parent A1A2 x ð1 xÞ xð1 xÞ xð1 xÞ Pff Pmm þ Pfm þ Pmf , – 2 – 3 A2A2 0 xð1 xÞ ð1 xÞ the correlation between any pair of relatives is given by (21) 2 2 correlation=α A þ β D : (22) 2 2 (The values in this table assume that the father and the mother of any individual are unrelated, an assumption that This formula can be used to verify the correlations given we make for now.) If one considers all possible above. For example, for two sibs, Pff = Pmm = 1/2, Pfm = combinations of parental-offspring measurements and Pmf = 0, and insertion of these values into (22) yields the their probabilities, it is found that the correlation between sib-sib correlation given above. parent and offspring for this measurement is This correlation formula (22) is used frequently in the literature. However it applies only under very restrictive 1 2 correlationðparent-offspringÞ= A , circumstances, namely (i) that the character in question 2 2 depends only on the genes at one single gene locus, (ii) where 2 and 2 are as given in (4) and (20) respectively, that there is no contribution to the correlation deriving A from the environment, (iii) that the population mates at with wij replaced by mij. Similar calculations for other relative pairs show that, for example, random, (iv) that stochastic effects have been ignored, and (v) that members of any mating pair are unrelated. Under 1 2 1 2 non-random mating and in cases where the character in correlationðsib-sibÞ= A þ D , 2 2 4 2 question depends on the genes at many loci the formulae become much more complex, as is shown later. All the 1 2 same, the additive genetic variance plays a key roll in correlationðuncle-nephewÞ= A , these more complex formulas. 4 2 1 2 correlationðgrandparent-grandchildÞ= A : 4 2 The concept of the (narrow) heritability (h2) is central in plant and animal breeding. In the context described above For our present purposes the main interest in these it is defined as formulae is in the central role that the additive genetic 2 2 h2= A , variance A plays in them. In view of the comments made 2 earlier about the interpretation of the additive genetic 2 2 variance as describing that part of the variance deriving where A and relate to the additive and total variance in from genes within genotypes, and the transmission of the character in question. This can be regarded as a genes from parent to offspring, it is not surprising that measure of the extent to which the stock can be improved parent-offspring correlation is directly proportional to the by breeding. If for example h2 = 0 there is no additive additive genetic variance. genetic variance in the character and no improvement is A more elegant way of finding these formulae was possible. In other words the gene frequencies are already proposed by Malécot [6]. We consider two individuals X at their optimum values as indicated by (12). The larger and Y, and define xf (xm) as the gene that X received from the heritability the more the stock can be improved by his father (mother), with a similar definition for yf and ym. . We say for example that xf ym if the gene that X received from his father is “identical by descent” with the GENERALIZATION TO THE CASE OF MANY gene that Y received from his mother, where two genes are ALLELES identical by descent if they can be traced back to a common ancestor gene (for example in a parent or a The theory developed above can be generalized in many

6 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 QB Mathematics, genetics and evolution ways. In this section we consider one of these, namely the being generalization to the case of an arbitrary number of X 2 = α2: possible alleles at the gene locus A of interest. A 2 xi i (25) Suppose that the number m of alleles that can arise at i the locus A exceeds two. There is now a (symmetric) Provided that the wij values are close to each other, the matrix W of fitness values given by increase in mean fitness is found to be close to the additive 2 3 genetic variance, thus extending the incorrect and over- w11 w12 w13 w1m fi 6 7 simpli ed form of the Fundamental Theorem of Natural 6 w21 w22 w23 w2m 7 Selection as described above to the case of an arbitrary 6 w31 w32 w33 w3m 7, 4 5 number of alleles at the locusX under consideration. MMMMM fi = If we de ne wi by wi jwijxj the change in the wm1 wm2 wm3 wmm frequency of Ai between successive generations can be where wij is the fitness of the genotype AiAj. The condition conveniently expressed as that there be a stable equilibrium with all alleles present at Δ = – = : positive frequencies is that W have exactly one positive xi xiðÞwi w w (26) eigenvalue and at least one negative eigenvalue (Kingman This is consistent with the expression given in (6). In [7]). This is a far less transparent condition than that general, many results found for the two allele case arising in (11) when m = 2, and could not be found other continue to apply in the multi-allelic case. For example, than by mathematical methods. the correlation between relatives formula (22) continues The simplest example of this result arises in the case to hold, with 2 being defined as in (25) and the total – ≠ A where wii =1 s, wij =1,(i j), where s is a (small) genetic variance being defined appropriately. For this positive constant. In this case the eigenvalues of W are reason, much of the theory focuses on simpler the two- m – s, – s, :::, – s, the above condition is satisfied, and the –1 allele case, with the assumption that results found from equilibrium (where each xi takes the value m ) is stable. the analysis will apply exactly, or at least approximately, ::: If the frequencies of the various alleles are x1, x2, , xm, to the multi-allele case. fi theX meanX tness w of the population is given by = w i jwijxixj. It can be shown (Kingman [7]) that THE STOCHASTIC THEORY this increases (or at worst remains unchanged) from one generation to another, in agreement with the correspond- All of the above theory is deterministic — there is no ing conclusion when m =2. component of randomness involved. But random phe- The additive genetic variance is found by minimizing nomena arise at several levels in evolutionary genetics, the expression from the small-scale intrinsic effect of the random X X transmission of one of two genes from a parent to an – – α – α 2 wij w i j offspring, to large-scale extrinsic random ecological and i j other events which affect the genetic make-up of an entire α α ::: α population. It is therefore necessary to consider the with respect to parameters 1, 2, , m. It is found that α α ::: α stochastic theory which takes these random events into the minimizing values of 1, 2, , m are given by X account, since among other things it is important to assess α = – = ::: : i wijxj w, ði 1, 2, , mÞ (23) the effects of the stochastic behavior on the Darwinian j theory. Before this theory is described, it has to be emphasized These are the direct generalizations of the values given in that the models used are at best rough approximate (18). They also satisfy the generalization of (16), namely descriptions of what happens in reality. The complexity of that the real world makes it impossible to formulate models X that have the predictive value of many models in physics. x α =0: (24) i i For example, we assume in this section that the i population of interest mates at random. We later consider Thus this equation does not have to be imposed more realistic models that take the complexities of nature extrinsically, as is often done. As with (16) it will have somewhat more into account. a far greater importance later, when we consider the whole A key component of the stochastic theory is the size of genome. the population under consideration. The classic stochastic 2 fi The additive genetic variance A,de ned as the sum of model describing the evolution of the population, the so- fi α α ::: α “ ” squares removed by tting the values of 1, 2, , m, can called Wright-Fisher model (named after its two be expressed in various ways, one of the most useful independent originators Fisher [5] and Wright [8]) is a

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 7 QB Warren J. Ewens

Markov chain model and indeed is one of the first approximation is that if the initial frequencies of the two examples of these models that was examined in depth. alleles are p and 1 – p, the mean time until one or other This model is one of simple binomial sampling. allele is lost from the population is, to a very close Assuming that there is no selection, mutation, geographi- approximation, cal dispersal, no complications due to there being two – – – genders, and so on, the model assumes that the genetic 4Nðplogp þð1 pÞlogð1 pÞÞ composition of an offspring generation is found by generations. This implies that although in this model random sampling with replacement from the genes of the genetic variation will eventually be lost, in a population of parental generation until the full offspring generation of any reasonable size the mean time for such loss is genes is obtained. generally very large. In practice this time might be so long We adopt the standard notation N for the number of that other considerations, such as the creation of new diploid individuals, so that the population total number of variation by mutation, have to be taken into account. In genes at the locus A of interest is 2N. Then if there are iA1 any event the main conclusion, that the “variation genes in the parental generation, the probability pij that preserving” property of the Mendelian system continues there are j such genes in the offspring generation is given to hold, is essentially maintained. by Sometimes interest focuses on conditional mean times. j – 2N – j Suppose for example that the condition is made that A1 is = 2N i 2N i : pij (27) eventually lost from the population. The conditional mean j 2N 2N time for this to occur is When selection and mutation occur a more general plog p – 4N expression, namely 1 – p 2N j 2N – j generations. A parallel calculation applies when the ðx0Þ ð1 – x0Þ (28) j condition is made that it is A2 that is eventually lost. It is also of interest to find the probability that a 0 arises, where x depends on the selective and mutation selectively favored allele becomes fixed. This probability parameters. In the case where mutation is absent and depends on the population size, the fitness values of the fi parental generation tnesses and genotype frequencies are various genotypes and the initial frequency of the favored 0 given in (2) the value of x is as given in (5), where on the allele. Again a diffusion approximation (now to the model right-hand side of (5) x is replaced by i/(2N). When there (28)) is needed. If for example w11 =1+ s, w12 =1+ s/2 is mutation but no selection, x0 is given by (13). A more and w22 = 1, the probability that A1 becomes fixed complicated model allows both selection and mutation. depends on s and on the number of genes 2N in the Many other models, for example birth and death population at the locus of interest. The table below gives models, have been investigated in detail in the literature, some typical calculations, assuming that the initial but here we consider only models of the binomial type frequency of A1 is 0.001. (27) and (28). The qualitative behavior of the model (27) is that loss N 104 105 106 of one or other allele will eventually occur, so that genetic : : : : variation will no longer exist at the locus in question. It is 0 00001 0 001 0 002 0 020 thus of relevance to the Darwinian theory to assess how s 0:0001 0:002 0:020 0:181 (29) long genetic variation may be expected to persist at this : : : : locus, assuming that the (very simple) model (27) holds. 0 001 0 020 0 181 0 865 Unfortunately, even for the simple model (27), which 0:01 0:181 0:865 1:000 does not allow for selection, it seems to be impossible to find a simple closed expression for the mean fixation time, Clearly the larger the population size, the more likely it is that is for the mean time until one or other allele is lost that the favored allele becomes fixed. This occurs from the population. This problem leads to the approx- essentially because in a large population random factors imation of this model, which is in discrete time and are of less importance than in a small population. Also, of discrete space, by a continuous-time continuous-space course, the probability of fixation of the favored allele diffusion process. Approximations of this type have been increases with its selective advantage. common in the literature since the Particular interest attaches to the case where there is ’ 1920 s, and they sometimes have curious mathematical one single initial A1 gene, corresponding to a situation properties not shared by diffusion processes arising in where a single initial A1 mutant arises in an otherwise physics. The result obtained by the diffusion process purely A2A2 population. Branching process theory was

8 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 QB Mathematics, genetics and evolution employed in the 1920’s to calculate the survival formulae by one or other of the definitions of the effective probability of this new mutant, and it is found that population size. The formula for Ne depends on the fi = fi when the tnesses are of the form w11 1 þ s, complicating factor involved and the de nition of Ne = = = w12 1 þ s 2, w22 1, this probability is about s. This used. In the case of two sexes, for example, most is clearly independent of the population size. definitions lead to an effective population size of about When from A1 to A2 (at rate u) and from A2 to 4NfNm/(Nf + Nm), where Nf and Nm are, respectively, the A1 (at rate v) both occur, there exists a stationary number of females and the number of males in the distribution of the frequency of A1. When there is no population. This formula implies that, when Nm = 1, the selection, this distribution depends on u, v and the number effective population size is very close to 4, no matter how 2N of genes in the population at the A locus. The diffusion many females there are in the population. This happens theory approximation to this distribution is because half the genes transmitted in any generation derive from a single male, leading to rapid changes in Γð4Nu þ 4NvÞ – – f ðxÞ= x4Nv 1ð1 – xÞ4Nu 1, (30) gene frequencies. This implies among other things that Γð4NuÞΓð4NvÞ there is a rapid loss of genetic variation over succeeding where x is the frequency of A . The mean of this generations, a matter of serious concern to animal 1 breeders. distribution is v=ðu þ vÞ, as might be expected from (15), and the variance is NON-RANDOM-MATING POPULATIONS uv : 2 (31) ðu þ vÞ ð4Nu þ 4Nv þ 1Þ In this section we generalize some of the results described This variance has no analogue in the deterministic above to the case of non-random-mating populations. analysis, and thus the stochastic analysis provides new Such a generalization is necessary, given the interest in information about the likely values of the frequencies of the human population deriving from data from the Human the two alleles under two-way mutation. Genome Project (see http://www.ornl.gov/sci/techre- The form of the distribution is U-shaped for small sources/Human-Genome/project/about.shtml for more population sizes, indicating that for such populations the on this) and the fact that the human population does not most likely situation is that one or other allele is quite rare mate at random. Because of the complications involved, (or even temporarily missing from the population). For we only consider the deterministic theory. large population sizes the distribution is unimodal, and concentrates closely around the mean. Genotype frequency changes When selection exists the form of the stationary distribution is more complicated, being of the form Suppose that at the time of conception of the parental generation Γð4Nu þ 4NvÞ – – f ðxÞ= x4Nv 1ð1 – xÞ4Nu 1gðxÞ, (32) = = ::: Γð4NuÞΓð4NvÞ frequency of AiAi xii, ði 1, 2, , mÞ, frequency of A A =2X , ði≠jÞ: where g(x) depends on the nature of the selective values of i j ij = ::: the various genotypes. From this, theX allelic frequencies xi,(i 1, 2, , m) are = m Generalizations of many of the above results are known given by xi j=1Xij. (All summations in this analysis when an arbitrary number of alleles m can arise at the are over the range (1, 2, :::, m), so this range is not locus in question, in particular of the stationary distribu- explicitly mentioned below.) fi tion (32). Thus the above results serve only as an The mean population tness at theX X time of conception introduction to the entire theory that is now known. All of of the parental generation is w ¼ i jXijwij, where as fi fi the above ignores many complicating factors that exist in above theP Ptness of AiAj is wij, while the variance in tness 2= 2 – 2 real populations, for example the presence of two sexes, is i jXijwij w , The within-generation change in the likely geographical dispersion of the population, the frequency of the allele Ai is then changes (either cyclical or monotonic) in the population X size, and so on. These complications are often dealt with X w – w =w: (33) usefully through the concept of the “effective population ij ij j size Ne”. This is a quite complicated concept and there are several definitions of an effective population size (see for There exist few if any population genetic models in which example Ewens [9]). For our purposes here it is sufficient the frequency of an allele at the time of reproduction in a to say that significant properties of the behavior of a parental generation differs from the frequency of that population can be found by replacing N in the above allele in the daughter generation at conception. Thus

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 9 QB Warren J. Ewens

assuming that this equality holds, the between generation The Fundamental Theorem of Natural Selection change Δxi in allelic frequencies is also given by (33), that is, It was remarked above that the commonly-stated version X of the Fundamental Theorem of Natural Selection is not Δ = – = : xi Xij wij w w (34) the true version. The incorrect version of the theorem j states that for a random-mating population in which fitness depends on the genes at a single locus only, Δ 2 = The additive genetic variance w A w. Thus the incorrect version assumes a random-mating population and provides an approximate statement only. Also, we shall see later that this version of fi To nd the additive generic variance in the non-random- the theorem cannot be true when fitness depends on the fi fi mating case, it is rst necessary to nd the average effects genes at two or more loci since in this case mean fitness α α ::: α 1, 2, , m of the various alleles at the locus of interest. can decrease, even under random mating. Thus this These are found by the natural generalization of the change in mean fitness then cannot be approximately procedure in the random-mating case that is by minimiz- equal to any form of variance. Thus commonly-stated (but ing the expression incorrect) version of the theorem has a limited usefulness, X X 2 since in practice fitness depends on the genes at many X w – w – α – α ij ij i j loci. Further, many populations of interest, especially the i j human population, do not mate at random. We now give α α ::: α with respect to the parameters 1, 2, , m, subject to a the correct version of the theorem for the case when fi constraintX generalizing (16), namely (in the present tness depends on the genes at a single locus only, and α = notation) xi i 0. (Once again, this constraint is not later generalize the theorem to the case where fitness necessary. However we nevertheless make it because in depends on any collection of genes in the entire genome. the multi-locus case considered below a constraint Random mating is not assumed. X X α α ::: α fi generalizing (16) is needed.) The values of 1, 2, , m We rewrite the mean tness as w ¼ i jXij α α fi Δ so found are given implicitly as the solutions of the w þ i þ j , and then de ne the partial change Pw equations in the mean fitness by X X X x α þ X α =wΔx , ði=1, 2, :::, mÞ: (35) i i ij j i ΔPw= ΔXij w þ αi þ αj : (38) j i j If we define matrices D and X by Then D ::: X X ¼ diagfx1, x2, , xmg, ¼fXijg, Δ = α Δ =2 = : Pw 2 i xi A w (39) and vectors α and Δ by i This is the correct version of the Fundamental Theorem of α= α , α , :::, α 0, ð 1 2 mÞ Natural Selection when fitness depends on the genes at a Δ= Δ Δ ::: Δ 0 ð x1, x2, , xmÞ , single locus only. The statement of the theorem is an exact one, not an approximation, and random mating is not then the various equations in (35) can be written in matrix assumed. form as ðD þ XÞα=wΔ: (36) The correlation between relatives

The additive genetic variance can be written in terms of Quite apart from its importance in determining the the average effects as evolution of a population, non-random mating has an X important effect on the correlation between relatives. 2 =2w α Δx : (37) A i i There is a vast literature on this subject, and here only i a few comparatively straightforward results will be This differs in general from the “random mating” given. expression given in (25). However it reduces to (25) The fundamental new parameter of interest is ρ, the when random mating isP the case. ThisP occurs because correlation between mating individuals in the measure- α = α = under random mating, jXij j xi j jxj 0, so that ment of interest. For a random-mating population, ρ =0. Δ = α ≠ from (35), w xi xi i. UsingP this equality in (37), the When 0 the correlations given in (22) need to be 2 α2 fi fi expression for A becomes 2 ixi i , as in (25). modi ed. De ning

10 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 QB Mathematics, genetics and evolution

2 and A2) possible at the A locus and two alleles (B1 and B2) h2= A , 2 possible at the B locus, there are four possible gametes, A1B1, A1B2, A2B1 and A2B2 defined by these alleles, and 2 we write the frequencies of these in a parental generation d2= D , as c , c , c and c respectively. 2 1 2 3 4 Because of the phenomenon of crossing over, gametes equation (22) implies that in the very simple case are not necessarily passed on faithfully from a parent to an considered previously, the correlation between relatives offspring. An individual with one chromosome of the 2 2 in a random-mating population is of the form a1h þ a2d gametic form A1B1 and the other chromosome of gametic for some constants a1 and a2. However, under non- form A2B2 will pass on a chromosome of gametic form random mating, we find, for example, that A1B2 or of gametic form A2B1 if a crossing over (more exactly, an odd number of crossings over) occurs between 1 correlationðparent-offspringÞ= ð1 þ Þh2, the A and the B loci. We denote the probability of such a 2 crossing over by R; in practice it is always the case that 0£R£1=2. Assuming no selection at either locus, the 1 1 fi correlationðsib-sibÞ = ð1 þ h2Þh2 þ d2, recurrence relations de ning offspring generation gametic 2 4 frequencies in terms of parental generation frequencies are correlationðuncle-nephewÞ í= – c1 c1 þ Rðc2c3 c1c4Þ, 1 2 1 = ð1 þ Þh2 h2 þ h2d2: í= – – 2 8 c2 c2 Rðc2c3 c1c4Þ, The second and third of these correlations are not of the í 2 2 c =c – Rðc c – c c Þ, random-mating form a1h + a2d . Further, a dominance 3 3 2 3 1 4 (that is d2) term now enters uncle-nephew correlations í= – (and many other correlations) whereas under random c4 c4 þ Rðc2c3 c1c4Þ, mating it does not. This implies that conclusions drawn where the dash notation refers to the offspring generation from simple correlations such as those following from frequencies. (22) can be quite erroneous. Despite this, and maybe One of the important features of a two-locus analysis is through ignorance, correlations as given by (22) and its that one can assess whether some conclusion found from a various generalizations often appear in the literature. one-locus analysis gives results that are consistent with those found for the two-locus analysis. For example, one TWO LOCI: THE RANDOM-MATING CASE of the first one-locus analysis results given above is that if there is no selection, gene frequencies remain unchanged We now turn to the case where the fitness and indeed any from one generation to the next. In the two-locus analysis other characteristic of any individual is assumed to í í the offspring generation frequency of A is c þ c , and depend on the genes that the individual has two loci, not 1 1 2 the above equations show that this is equal to c þ c , the just one. Although this generalization is only a small step 1 2 parental generation frequency of A . Thus the one-locus towards considering the case where the individual’s 1 result is confirmed in this more general analysis. fitness depends on his entire genome, it does allow some However, as shown below, not all one-locus conclusions significant advances to be made beyond the one-locus are confirmed by a two-locus analysis. analysis. Clearly gametic frequencies, unlike gene frequencies, change from one generation to another. To analyze the Evolutionary calculations evolutionary properties of the system governed by the gametic frequency recurrence relations it is useful to In this section we assume random mating of the introduce the “coefficient of ” D, population in question, and restrict attention to the defined by D=c c – c c . Then defining η by deterministic case. To analyze the joint evolution at two 1 4 2 3 i fi loci, A and B, it is necessary and suf cient (in the case of η = þ 1, ði=2, 3Þ, random mating) to use the frequencies of the various i η =– = possible gametes formed by the alleles at these two loci. i 1, ði 1, 4Þ, fi For our purposes it is suf cient to assume that the two loci the gametic recurrence relations can be written are on the same chromosome, and then to think of a í gamete as this chromosome. If there are two alleles (A1 ci =ci þ ηiRD, ði=1, 2, 3, 4Þ: (40)

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 11 QB Warren J. Ewens

The recurrence relation for D itself is easily shown to be the simple version of the Fundamental Theorem of D0=ð1 – RÞD, so that DðtÞ=ð1 – RÞtD. Clearly the value Natural Selection given above, that mean fitness increases of D decreases geometrically fast as time goes on. When from one generation to the next, cannot be true. D = 0, the frequency of any gamete is the product of the When this observation was first made it caused frequencies of its constituent alleles, and the two loci consternation among the population genetics community, evolve in effect independently. A state of “linkage since it seemed to imply that the (textbook version of the) equilibrium” has been reached. Fundamental Theorem of Natural Selection is not true for A different picture emerges when selective differences a two-locus system (and, a fortiori, for the entire genome). exist between the nine genotypes possible at the two loci. This quickly led to many further developments, for Suppose that the fitnesses are as given in (41), and that at example an investigation of cases of fitness arrays for the time of conception of the parental generation the which mean fitness did necessarily increase from one frequencies of the various genotypes are as given in (42). generation to the next. It was shown that one such array is the “additive over loci” fitness case, where the fitnesses B1B1 B1B2 B2B2 are of the additive form shown in (43). A1A1 w11 w12 w22 B B B B B B (41) 1 1 1 2 2 2 = A1A2 w13 w14 w23 w24 A1A1 1 þ l1 1 þ l2 1 þ l3 (43) A2A2 w33 w34 w44 A1A2 2 þ l1 2 þ l2 2 þ l3 A2A2 3 þ l1 3 þ l2 3 þ l3 B1B1 B1B2 B2B2 2 2 A further line of enquiry concerns the additive genetic A1A1 c1 2c1c2 c2 fi (42) variance in the case where tness depends on the genes at A1A2 c1c3 2ðc1c4 þ c2c3Þ 2c2c4 two loci. This is found by a least-squares procedure, using the fitnesses in (41) and the frequencies in (42). This is not A A c2 2c c c2 2 2 3 3 4 4 described in detail here since a much more complete and The mean fitness w of the population is then given by general discussion is given later. X X One can also calculate an additive genetic variance = estimate found from the marginal fitnesses at the various w cicjwij, i j loci. Thus the marginal fitness of A1A1 is and it is straightforward to show that the gametic w c2 þ 2w c c þ w c2 11 1 12 1 2 22 2 , frequency recurrence relations are 2 ðc1 þ c2Þ í= – 1 η ci w ðciwi þ iRw14DÞ, with a similar definition for other A-locus genotypes. The P where w = c w and η is as defined above. These marginal A-locus additive genetic variance is then found i j j ij i using the procedure that led to (19). A similar calculation recurrence relations have some quite interesting proper- í can be done for the B locus, and from this one can ties. For example, at equilibrium (c =c ), i i calculate the sum of the single locus marginal values. It is = – 1η = found that the condition that the true additive genetic w wi þ ci iRw14D, ði 1, 2, 3, 4Þ, variance be equal to the sum of the single locus marginal whereas mean fitness is maximized when values is that D =0. = = : w wi, ði 1, 2, 3, 4Þ The correlation between relatives Thus mean fitness is not necessarily maximized at an equilibrium point, and indeed is maximized at such a The correlation structure given in (22) and its various point only if, at this point, the coefficient of linkage generalizations for the case when the character of interest disequilibrium D is zero. This condition rarely holds in is determined by the genes at one locus, and random practice when selection exists, and an immediate mating of the population is assumed, requires consider- implication of this is that the course of natural selection able amendment when the character of interest is is such that mean fitness does not necessarily increase determined by the genes at two loci. We now discuss from one generation to the next. The reason for this is that some of the complications involved. a parent can pass on a gamete to an offspring that he does Suppose that the measurement in question depends on not possess himself, and to this extent the genetic make- the genes that an individual has at two loci, A and B. up of the child does not resemble that of the parent. Thus There are now nine genotypes and thus nine potentially

12 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 QB Mathematics, genetics and evolution different measurements, as shown in the table below. It is clear that even in the comparatively simple case of the sib-sib correlation, linkage causes complications to the B1B1 B1B2 B2B2 correlation formulae. This becomes increasingly the case for more distant relationships. Thus complications to A1A1 m11 m12 m22 (44) correlation formulae are evident even under the simplify- = A1A2 m13 m14 m23 m24 ing assumptions of random mating and linkage equili- brium. In the more general case with linkage A2A2 m33 m34 m44 disequilibrium (D≠0), linked loci and with non-random A collection of nine measurements implies a total of eight mating, it is almost impossible to find expressions for the degrees of freedom between measurements. Thus the total various correlations, and theoretical work on this problem variance σ2 in the measurement can be split up into eight has essentially come to a halt. It is even more difficult to different meaningful components. Assuming for the use correlations to estimate parameters such as herit- moment that mating is random and linkage equilibrium ability, and one is forced to rely on various approximation (D = 0) between loci occurs, σ2 can be split up into eight formulae. Theory, it appears, is of limited help so far as components in the form correlation questions are concerned. 2=2 2 2 2 2 Að1Þþ Að2Þþ AA þ Dð1Þþ Dð2Þ MANY LOCI: RANDOM-MATING AND 2 2 2 : þ DD þ ðADÞ þ ðDAÞ NON-RANDOM-MATING POPULATIONS 2 In this decomposition A (1) is the additive genetic Evolutionary considerations variance at the A locus as calculated from marginal 2 measurement values, A (2) is the corresponding quantity In this section we consider what can be said about the for the B locus, dominance are defined completely general case where the fitness of any similarly, while all other variances relate to various individual depends in an arbitrary way on all the genes interactions. It is convenient to write in the entire genome, random mating is not assumed, and 2 =2 2 the recombination structure in the entire genome is A Að1Þþ Að2Þ, completely arbitrary. Unfortunately it is not possible to derive the intra-generational changes in whole genome 2 =2 2 D Dð1Þþ Dð2Þ, genotypes in the absence of knowledge of the mating scheme and the recombination structure between loci. On 2 =2 2 the other hand it is possible to find the intragenerational AD ðADÞ þ ðDAÞ, changes in gene frequencies, and from this to the inter- so that generational changes in gene frequencies, and then to derive the correct whole-genome Fundamental Theorem 2=2 2 2 2 2 : A þ D þ AA þ DD þ AD of Natural Selection. That is our aim in this section. When A and B loci are unlinked (R = 1/2), it is found Assume that the various (gigantically large number of) that the parent-offspring correlation in the measurement is possible whole-genome genotypes are listed in some agreed order as genotypes 1, 2, :::, s, :::, S. The “time of conception” frequency of the typical genotype s in the 1 2 12 2 A þ AA = , fi 2 4 parental generation is denoted by gs and the tness of this genotype by ws. Thus at this timeX the parental generation and that the sib-sib correlation is fi = population mean tness is w sgsws. From this it is possible to find the intragenerational changes in gene 1 2 1 2 1 2 1 2 1 2 2: A þ D þ AA þ AD þ DD = frequencies, and from this to derive the whole-genome 2 4 4 8 16 Fundamental Theorem of Natural Selection. The intra- When A and B loci are linked ðR≠1=2Þ the parent- generational change in the frequency of the genotype s, offspring remains as above, but the sib-sib correlation that is the change from the time of conception to the time Δ = = – becomes of reproduction, is gs gsws w gs. This implies that the corresponding intra-generational change Δpai in the 1 2 1 2 1 2 2 frequency p of the allele A at the typical gene locus A is þ þ ð3 – 4R þ 4R Þ ai i 2 A 4 D 8 AA given by X 1 1 Δ = Δ 2 2 2 2 2 2: 2 pai cai s gs, (45) – – þ ð1 2R þ 2R Þ AD þ ð1 2R þ 2R Þ DD = ð Þ 4 4 s

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 13 QB Warren J. Ewens

Δ fi fi where the sum is taken over all whole-genome genotypes change Pw in mean tness is de ned as the change in the and cai(s) = 1, 2 or 0 depending on whether Aai arises once, expression (23) derived solely from the changes Δgs in the twice or not at all within the genotype gs. This is also the various whole-genome genotype frequencies and ignor- inter-generational change in the frequency of this allele, ing the changes in the α values, namely where frequencies are taken at the time of conception of X nX o Δ α : the two generations. This is all that can be said about gs cai ai (48) inter-generation frequency changes. It is, however, s enough for our purposes. The resulting expression can be shown to be equal to 2 =w, where 2 now denotes the whole-genome additive Average effects A A genetic variance, defined in a manner extending that for To make further progress it is necessary to define whole- the one-locus case. This simple and exact result is the genome average effects. As in the one locus case, the modern interpretation of the whole-genome Fundamental average effects of the various alleles at the various loci in Theorem of Natural Selection. It is true whatever the the genome are defined by minimizing the sum of squares mating scheme and the recombination structure might be. It is inconceivable that this result could have been X noX 2 obtained by anything other than a mathematical treatment. g w – w – c α : (46) s s aiðsÞ ai Two comments are in order. First, fitness is considered s here (and throughout this article) as a parameter, that is, as α In the expression (46), ai is the average effect of Ai, the an intrinsic property of a genotype. It is in practice outer sum is taken over all whole-genome genotypes and unknown, as is any parameter in statistical theory. the inner sum is taken, for each whole-genome genotype, Second, it is not clear that the FTNS makes a satisfactory over all alleles at all loci in the genome contained within statement — the concept of a partial change in mean that genotype, with cai defined above. The values of the fitness appears to some to be arbitrary. αai defined in this way are not unique, but become unique The investigation of multi-locus properties in non- if a condition of the form (24) is imposed for the average randomly-mating populations outlined above has hardly effects of the alleles at each locus in the genome. begun, and yet it will be essential to an investigation of It is not necessary to give explicit formulae for the the evolution of non-randomly-mating populations, in various α values defined by this least-squares procedure: particular the human population, using whole-genome indeed they can only be expressed implicitly as the data. (unique) solution of a gigantic set of simultaneous equations. In parallel with the one-locus case analysis, MOLECULAR POPULATION GENETICS the mean fitness of the population is now thought of as being X noX Introduction α gs w þ cai ai , (47) g Multi-locus evolutionary theory arises by going outwards from the single gene locus to the whole genome. In α α where now ai denotes the minimizing values. This considering molecular genetics we go in the other expression (as with the expression in the corresponding direction, and consider the internal nature of the gene. one-locus case) is numerically identicalP to that given by The investigation of evolutionary population genetics at fi fi the standard de nition of mean tness sgsws as given the molecular level was initiated by Kimura [10], and it is above. without doubt the most important development of 2 The whole-genome additive genetic variance A is the population genetics theory in the last fifty years. We sum of squares removed by fitting the α values. Standard therefore devote substantial attention to it. least-squares theory shows that Classical population genetics considered processes X X going forward in time, motivated by the need of rewriting 2 = α Δ A 2w aið paiÞ, andvalidatingtheDarwiniantheoryintermsof a i Mendelian genetics. In this task it was completely the sum being over all loci (a) in the genome and all successful. Current research, by contrast, is largely alleles (i) at each locus. focused on using current genetic data and assessing the evolutionary processes that led to them. The ensuing The Fundamental Theorem of Natural Selection: the theory is thus retrospective rather than prospective, and it correct whole-genome statement involves many statistical inference procedures. In carry- ing these procedures out it is essential to use the actual Again in parallel with the one-locus case, the partial genetic material and not abstract quantities like “the allele

14 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 QB Mathematics, genetics and evolution

A1” so prevalent in the classical theory. Thus the entire ary order: the number of such configurations is p(2N), the retrospective theory and all the inferential procedures number of partitions of 2N into positive integers. The must be carried out using the theory of molecular quantity p(2N) has been extensively studied in the population genetics. mathematical literature. In molecular population genetics we view the gene for It is clear that (49) implies certain transition probabil- what it is, namely a sequence of the four A, G, ities from one configuration to another. Although these C and T. For a gene with 1000 nucleotides there are 41000 probabilities are extremely complex and the Markov different possible sequences. This number is so chain of configurations has an extremely large number of large that little accuracy is lost in taking it to be infinite. states, nevertheless standard theory shows that there Each one of these sequences corresponds (in our previous exists a stationary distribution of configurations. We first terminology) to some allele, so in using the word “allele” consider one simple property of this stationary distribu- in the following section we mean some specific DNA tion, namely the probability that two genes drawn at sequence. We therefore assume an effective infinity of random are of the same allelic type. For this to occur different possible alleles. This viewpoint motivates the neither gene can be a mutant and, further, both must be theory in the following sections. descended from the same parent gene (probability – ð2NÞ 1) or different parent genes which were of the The infinitely many alleles model ðtÞ same allelic type. Writing F2 for the desired probability in generation t, we get The theory of molecular genetics is a stochastic one, and – – the most frequently used analysis assumes a diploid Fðtþ1Þ=ð1 – uÞ2ðð2NÞ 1 þf1 – ð2NÞ 1gFðtÞÞ: (50) population of size N and a Wright-Fisher model of 2 2 ðtþ1Þ= ðtÞ= evolution. (Several other models have, of course, been At equilibrium, F2 F2 F2 and thus considered in the literature.) In this model one assumes, fi = – – – 2 – 1 since there is an effective in nity of possible alleles as just F2 f1 2N þ 2Nð1 uÞ g , (51) described, that when a gene mutates, for example by a where = 4Nu. This is the simplest property of the change A ! T at some place in its nucleotide sequence, it fi changes to a gene of an entirely new allele, not so far seen stationary distribution of the con guration process. ðtþ1Þ in the population. In the simplified version of the theory Consider next the probability F3 that three genes considered here, any gene of any allelic type is assumed to drawn at random in generation t + 1 are of the same allele. mutate at the same rate, denoted u. Further, all allelic These three genes will all be descendants of the same gene – types are assumed to be selectively equivalent. (A more in generation t, (probability ð2NÞ 2), of two genes – advanced theory of course drops these assumptions.) (probability 3ð2N – 1Þðð2NÞ 2Þ) or of three different – These various assumptions imply that, if in generation t genes (probability ð2N – 1Þð2N – 2Þðð2NÞ 2Þ). Further, = ::: there are Xi genes of allelic type Ai (i 1, 2, 3, ), then the none of the genes can be a mutant, and it follows that probability that in generation t + 1 there will be Yi genes of allelic type A , together with Y new mutant genes, all ðtþ1Þ= – 3 – 2 – ðtÞ i 0 F3 ð1 uÞ ð2NÞ ð1 þ 3ð2N 1ÞF2 of different novel allelic types, is þð2N – 1Þð2N – 2ÞFðtÞÞ: (52) ð2NÞ! 3 ::: ::: = ∏πY i probfY0, Y1, Y2, jX1, X2, g ∏ i , (49) ðtÞ ðtþ1Þ Yi! By equating F3 and F3 and using the value calculated above for F we can find the stationary probability F that where π =u and π =X 1 – u = 2N , i=1, 2, 3, :::. 2 3 0 i ið Þ ð Þ three genes drawn at random are of the same allelic type. This model differs fundamentally from previous Clearly a continuation of this process is cumbersome and mutation models in that since each allele will sooner or does not lead to any revealing results. We consider some later be lost from the population, there can exist no approximate results below. nontrivial stationary distribution for the frequency of any The above arguments do, however, lead to simple allele. Nevertheless we are interested in stationary closed form for the partition process Markov chain behavior, and it is thus important to consider what eigenvalues. Equation (50) can be written in the form concepts of stationarity exist for this model. To do this we fi ::: t 1 – t consider delabeled con gurations of the form fa, b, c, g, Fð þ Þ – Fð1Þ=ð1 – uÞ2f1 – ð2NÞ 1gfFð Þ – Fð1Þg, (53) where such a configuration implies that there exist a genes 2 2 2 2 – of one allele, b genes of another allele, and so on. The and this implies that ð1 – uÞ2f1 – ð2NÞ 1g is an eigenva- specific alleles involved are not of interest. The possible lue of the Markov chain configuration process discussed configurations can be written down as f2Ng, f2N – 1, 1g, above. A similar argument using (52) shows that a second – – f2N – 2, 2g, f2N – 2, 1, 1g, …, f1, 1, 1, :::,1g in diction- eigenvalue is ð1 – uÞ3f1 – ð2NÞ 1gf1 – 2ð2NÞ 1g. Simi-

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 15 QB Warren J. Ewens

lar arguments can be used for all of the eigenvalues, and it Xn = : is found that EðKÞ – (60) j=1 þ j 1 – – l ¼ð1 – uÞif1 – ð2NÞ 1gf1 – 2ð2NÞ 1g i We return to these calculations below, when considering – f1 – ði – 1Þð2NÞ 1g (54) inference procedures in population genetics and also when considering the coalescent process. is an eigenvalue of the configuration process matrix and that its multiplicity is pðiÞ – pði – 1Þ, where p(i) is the The infinitely many sites model partition number given above. This provides a complete listing of all the eigenvalues. For details see Ewens and The infinitely many alleles model refers to a situation Kirby [11]. (obtaining in the 1970’s) where it was assumed that it is An important aspect of the theory concerning the possible to assess whether two different genes were of the infinitely many alleles model is that it is used for same or of different alleles. However that was all that was inferential procedures. This implies that it is necessary assumed, since the actual nucleotide sequence of any gene to derive properties of a sample of genes taken from the was not, at that time, known. A more refined model is the population. The elements of this sampling theory are now “infinitely many sites” model, (Kimura [13]), applicable outlined. We call the limiting process N ↕ ↓1, u ↕ ↓0 in cases where the nucleotide sequence is known. The with =4Nu held constant, the “asymptotic limit”. Then various assumptions made above for the infinitely many (51) shows that in this limit alleles model are retained in the infinitely many sites model, in particular that the gene mutation rate is u.Itis = 1 : F2 (55) assumed that there is no recombination within any gene, 1 þ so that recombination does not create new alleles. The Similarly, (52) shows that in this limit DNA sequence of the gene is assumed to be sufficiently long that any mutation occurs at a site at which no 2 = : mutation has previously occurred. (As a less extreme and F3 (56) ð1 þ Þð2 þ Þ equivalent assumption so far as the mathematical theory is It is possible to continue in this way to find the limiting concerned, we assume any mutation occurs at a site at (N ↕ ↓1) probability distribution of any allelic partition which only one nucleotide currently exists in the fi of a sample of n genes. This partition is best described by population.) Thus as with the in nitely many alleles ::: model, all mutations are to new alleles, not so far seen in the vector ( a1, a2, , an), where aj denotes the number of the population. Of course this assumption might not be a alleles in the sampleP that are represented by exactly j genes (so that ja =n). The probability distribution of reasonable one in view of the properties of protein j “fi ” this vector is given by folding, and this has led to a nitely many sites theory not discussed here. n aj Most of the relevant theory was developed by ::: = n! ∏ Pða1, a2, , an; Þ a , (57) Watterson [14]. Because exact expressions are very S ðÞ = a !j j n j 1 j cumbersome, all of the results given here are approximate ↕ ↓ (Ewens [12]). Here Snð Þ¼ ð þ 1Þð þ 2Þð þ formulas deriving from the asymptotic limit N 1.We – 1 22 n n j define a “segregating site” in a sample (population) as a n 1Þ¼ Sn þ Sn þþjjSn , so that Sn is the absolute value of a Stirling number of the first kind. site where two different nucleotides exist in the sample (population). Watterson found that the mean of the A particular case of this formula arises when an = 1, that is there is only one allele represented in the sample. The number S of segregating sites in a sample of n genes is probability of this is approximately = ðn – 1Þ! EðSÞ g1 , (61) : (58) X 1 2 n – 1 = n – 1 – 1 ð þ Þð þ Þð þ Þ where g1 = j . We return to this equation later X j 1 when considering inferential procedures. For n > 10 the The total number of alleles observed in the sample is aj. Denoting this sum by K, it is found from (57) that the variance of S is very close to probability distribution of K is = 2 VðSÞ g1 þ g2 , (62) kk X Sn = n – 1 – 2 PðK=k; Þ= , ðk=1, 2, :::, nÞ: (59) where g2 = j . Watterson also found that the S ðÞ j 1 n probability that there are no segregating sites in this From this it follows that the mean of K is sample is

16 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 QB Mathematics, genetics and evolution

ðn – 1Þ! θ is embodied in the observed value k of K. Thus any PðS=0Þ= : (63) θ ð1 þ Þð2 þ Þðn – 1 þ Þ statistical inference concerning using the sample information must be carried out entirely by using the This is identical to the probability that K = 1 in the observed value k of K. This implies that θ should be infinitely many alleles model as given in (58) above. This estimated by using only the observed value k. In doing provides a confirming link between the infinitely many this we regard Eq. (59) as providing the likelihood of θ, alleles model and the infinitely many sites model, since if given the observed value k of K. It is found that the there are no segregating sites there is only one DNA maximum likelihood estimate ^ is the implicit solution of sequence, or allele, in the sample of n sequences. the equation of Unfortunately it is extremely difficult to find other links between the two models, and it is also very difficult to find Xn ^ = : the joint probability distribution of K and S. k (66) = ^ þ j – 1 An important case arises when n = 2. Watterson found j 1 that the probability that there s segregating sites in a Given the observed value of k, this equation has to be sample of this size is ^ solved numerically for . 1 s We now turn to tests of selective neutrality. These are : “ ” þ 1 þ 1 based on Eq. (65), which is a neutral theory distribution and which does not contain any unknown parameters. It is This distribution has mean θ and variance þ 2. More therefore possible (at least in principle — in practice the generally it was shown by Tavaré [15] that in a sample of computations involved are quite complicated) to consider n sequences the probability distribution of S is any reasonable test statistic, evaluate its probability distribution from (65), and thus assess whether the Xn – 1 sþ1 n – 1 – n – 2 observed value of this test statistic is a reasonable one, probðS=sÞ= ð – 1Þj 1 : – given this distribution. The most popular form of this test j=1 j 1 j þ was devised by Watterson [16]: we do not provide any (64) details here.

θ Inferential procedures: estimating and testing for The infinitely many sites model: inference methods selective neutrality Suppose that we have data consisting of n DNA In this section we consider two inferential operations: sequences and that in these data there are s segregating θ estimating the parameter referred to above, and testing sites. Equation (61) leads to two possible estimates of θ. the hypothesis that the frequencies observed in our data The first (Ewens [17], Watterson [14]) follows directly are the result of purely random and are not, from (61): given the observed value s of S, the estimate of for example, the result of natural selection. We consider ^ of θ is these procedures for the two models outlined above, the S fi fi in nitely many alleles model and the in nitely many sites ^ =s=g : (67) model. In both models we use the simple approximate S 1 results such as (55), (57) deriving from the asymptotic This is clearly an unbiased estimate of θ. Assuming no limit theory, and Eq. (64) (given below) for the infinitely recombination between the sites in the gene of interest, many sites model. the variance of this estimate is 2= 2: g1 þ g2 g1 (68) The infinitely many alleles model: inference methods A second estimate follows from the fact that in the casen We start with Eqs. (57) and (59), both of which assume n = 2 the mean of S is θ. This implies that if all selective neutrality among the alleles observed. The 2 ::: conditional distribution of (a1, a2, , an) given that K = k combinations of pairs of sequences in the data are taken, is found from (57) and (59) to be and the average T of the number of sites at which two sequences differ is taken, then the mean of T is given ::: = n! : Pða1, a2, , an kÞ (65) by k∏n aj j Sn j=1aj!j Mean of T=: (69) The form of this expression shows that K is a sufficient statistic for θ: all the information in the sample relevant to Thus T could be used as an estimator of θ, and because of

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 17 QB Warren J. Ewens

^ 1 ab a this we write it as T , and we re-write (69) as β=ð þ Þ : – ^ =: b a Mean of T (70) This approximate null hypothesis distribution is then used θ Although this is an unbiased estimator of , it suffers to assess whether any observed value of D is significant. because of the fact that its variance, which is There are various approximations involved in the above 2 procedure. Various authors have assessed the importance n þ 1 2ðn þ n þ 3Þ2 – þ – , (71) of these and have attempted to devise improved tests. We 3ðn 1Þ 9nðn 1Þ do not provide the details here. does not approach 0 as the sample size n increases. However our interest here in this estimator is that it forms THE COALESCENT part of the neutrality hypothesis testing procedure and not as a possible estimator of θ. Coalescent theory The fact that there are two unbiased estimators of θ under the hypothesis of selective neutrality forms the The concept of the coalescent is now discussed at length basis of a test first put forward by Tajima [18]. The Tajima in many textbooks, and entire books (for example Hein, procedure is carried out in terms of the statistic D,defined Schierup and Wiuf [19] and Wakeley [20]) and book by chapters (for example Marjoram and Joyce [21] and Nordborg [22]) have been written about it. ^ – ^ D= TpffiffiffiffiS , (72) The aim of the coalescent is to describe the common V^ ancestry at various times in the past of the sample of n genes at some given gene locus through the concept of an ^ ^ – ^ where V is an unbiased estimate of the variance of T S equivalence class. (Essentially the same arguments apply and which we do not give here. The logic in choosing the to describe the common ancestry of all the 2N genes at a test statistic is that if selective neutrality is the case (which given locus in the population (of size N) at various times ^ ^ τ in statistical terms is the null hypothesis) T and S both in the past.) To do this we introduce the notation , have the same mean (θ), so that D is similar to a indicating a time τ in the past (so that if τ1 > τ2, time τ1 is standardized normal statistic. further in the past than time τ2). The sample of n genes is The next problem is to find the null hypothesis assumed taken at time τ =0. distribution of D. Under the null hypothesis of neutrality, Two genes in the sample of n are in the same D does not have a mean of zero, nor does it have a equivalence class at time τ if they have a common variance of 1, since the denominator of D involves a ancestor at this time. Equivalence classes are denoted by variance estimate rather than a known variance. Further, parentheses: thus if n = 8 and at time τ genes 1 and 2 have the distribution of D depends on the value of θ, which is in one common ancestor, genes 4 and 5 a second, and genes practice unknown. Thus there is no null hypothesis 6 and 7 a third, and none of the three common ancestors distribution of D invariant over all θ values. are identical and none is identical to the ancestor of gene 3 The Tajima procedure approximates the null hypothesis or of gene 8 at time τ, the equivalence classes at time τ are distribution of D in the following way. First, it is possible fð1, 2Þ, ð3Þ, ð4, 5Þ, ð6, 7Þ, ð8Þg: (74) to find both the smallest value that D can take and also the largest value that D can take. These are functions of n and We call any such set of equivalence classes an we denote them by a and b respectively. Second, it is equivalence relation, and denote any such equivalence assumed, as an approximation, that the mean of D is 0 and relation by a Greek letter. As two particular cases, the variance of D is 1. Finally, it is also assumed that the at time τ = 0 the equivalence relation is = density function of D is the generalized beta distribution f1 fð1Þ, ð2Þ, ð3Þ, ð4Þ, ð5Þ, ð6Þ, ð7Þ, ð8Þg, and at the time over the range (a, b), defined by of the most recent common ancestor of all eight genes, the equivalence relation is f = 1, 2, 3, 4, 5, 6, 7, 8 . Γ α β – α – 1 – β – 1 n fð Þg = ð þ Þðb DÞ ðD aÞ The Kingman coalescent process (Kingman [23,24]) is a f ðDÞ α β – , (73) ΓðαÞΓðβÞðb – aÞ þ 1 description of the details of the ancestry of the n genes moving from f to f . For example, given the with the parameters α and β chosen so that the mean of D 1 n equivalence relation in (74), one possibility for the is indeed 0 and the variance of D is indeed 1. This leads to equivalence relation following a coalescence is the choice fð1, 2Þ, ð3Þ, ð4, 5Þ, ð6, 7, 8Þg, so that equivalence classes ð1 þ abÞb α=– , ð6, 7Þ and ð8Þ have just amalgamated. Such an amalga- b – a mation is called a coalescence, and the process of successive amalgamations is called the coalescence

18 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 QB Mathematics, genetics and evolution process. infinitely many alleles assumption). If at time τ in the Coalescences are assumed to take place according to a coalescent there are j equivalence classes, the probability Poisson process, but with a rate depending on the number that either a mutation or a coalescent event had occurred of equivalence classes present. Suppose that there are j in ðτ þ δτ, τÞ is equivalence classes at time τ. The probability that the jðj – 1Þ 1 process moves from one nominated equivalence class (at j δτ þ δτ= jðj þ – 1Þδτ: (76) time τ) to some nominated equivalence class which can be 2 2 2 derived from it is δτ. (Here and throughout we ignore We call such an occurrence a defining event, and given terms of order ðδτÞ2.) Since there are jðj – 1Þ=2 possible that a defining event did occur, the probability that it was a choices for pairs of equivalence classes, a coalescence mutation is =ðj þ – 1Þ and that it was a coalescence is takes place in this time interval with probability ðj – 1Þ=ðj þ – 1Þ. 1 The probability that k different alleles are seen in the jðj – 1Þδτ, since all of the jðj – 1Þ=2 amalgamations 2 sample is then the probability that k of these defining possible at time τ are equally likely to occur. This implies events were mutations. The above reasoning shows that that no coalescence takes places between time τ and time k= this probability must be proportional to Snð Þ, where 1 fi τ þ δτ with probability 1 – jðj – 1Þδτ. Snð Þ is de ned below Eq. (57), the constant of 2 proportionality being independent of θ. This argument “ In order for this process to describe the random leads directly to the expression (59) for the probability ” sampling evolutionary model described above, it is distribution of the number of alleles in the sample. necessary to scale time so that unit time corresponds to 2N Using these results and combinatorial arguments generations. With this scaling, the time Tj between the counting all possible coalescent paths from a partition formation of an equivalence relation with j equivalence ða , a , :::, a Þ back to the original common ancestor, – 1 2 n classes to one with j 1 equivalence classes has an Kingman (Kingman [23,24]) was able to derive the more = – with mean 2 ðjðj 1ÞÞ. detailed sample partition probability distribution (57), and = The (random) time TMRCAS Tn þ Tn – 1 þ Tn – 2 þ deriving this distribution from coalescent arguments is fi þ T2 until all genes in the sample rst had just one perhaps the most pleasing way of arriving at it. common ancestor has mean Xn “Age” and “time” results = 1 = – 1 : EðTMRCASÞ 2 – 21 (75) j=2 jðj 1Þ n Many further results are available from coalescent theory (or can be derived by special arguments). The ones listed fi “ ” “ (The suf x MRCAS stands for most recent common in this section are “age” results and “time” results (such as ancestor of the sample.) This is, of course close to 2 (75)). They all relate to the Wright-Fisher infinitely many coalescent time units, or 4N generations, when n is large. alleles model introduced above. Some are sample results Kingman (Kingman [23,24]) showed that for large and some are population results. In view of the nature of populations, many population models (including the the coalescent in which we look backward into time, some Wright-Fisher model considered in this paper) are well of these results concern properties of the oldest allele in a approximated in their sampling attributes by the coales- sample (or in the population). cent process. The larger the population the more accurate Kelly [25] showed that the probability that the oldest is this approximation. Calculations such as this are allele in the sample is represented j times in the sample is “ ” “ relevant to the time back to Eve and to the Out of Africa” migration. – 1 n n þ – 1 = ::: : These calculations provide a population equivalent , ðj 1, 2, , nÞ (77) n j j of the sample expression (61). The mean time for which these are j – 1 equivalence classes is 2=ðjðj – 1ÞÞ, The case j = n is of particular interest. If an allele is as shown above. The mean number of mutations represented n times in a sample, it must be the oldest allele – in the sample. Thus the expression (77) for the case j = n Xalong the j 1 arms of the coalescentX tree is then 2N – 1 = – – = 2N – 1= should reduce to the expression (58). It is easy to see that j=1 2 ðjðj 1ÞÞ 2Nuðj 1Þ j=1 j. This is the population form of the sample expression (61). this happens, so that (77) may be regarded as a We have just introduced mutation into the coalescent, generalization of (58). These calculations also show that and we now consider the effects of mutation in more the probability that a gene seen j times in the sample is of detail. Suppose that the probability that any particular the oldest allele in the sample is j/n. ancestral gene mutates in the time interval ðτ þ δτ, τÞ is Perhaps the most important sample distribution concerns the frequencies of the alleles in the sample when δτ. All mutants are assumed to be of new alleles (the ordered by age. This distribution was found by Donnelly 2 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 19 QB Warren J. Ewens

and Tavaré [26], who showed that the probability that the the entire probability distribution of its age, is indepen- number of alleles in the sample takes the value k, and that dent of its current frequency and indeed of the frequency the age-ordered numbers of these alleles in the sample are, of all alleles in the population. ::: in age order, nð1Þ, nð2Þ, , nðkÞ,is If an allele is observed in the population with frequency p, its mean age is kðn – 1Þ! , X2N S n n n – n n – n 4N j nð Þ ðkÞð ðkÞ þ ðk 1ÞÞð ðkÞ þ ðk 1Þ þþ ð2ÞÞ ð1 – ð1 – pÞ Þgenerations: (82) j j – 1 (78) j=1 ð þ Þ fi where Sjð Þ is de ned below (57). This formula can be This is a generalization of the expression in (81), since if p found in several ways, one being as the size-biased = 1 only one allele exists in the population, and it must version of (57). then be the oldest allele. These are many other interesting results concerning the A further calculation concerns the mean age of the oldest allele in a sample, and further results connecting oldest allele in a sample of n genes. This mean age is the oldest allele in the sample to the oldest allele in the Xn 1 population, but the above examples give the general 4N generations: (83) fl – avor of these. j=1 jðj þ 1Þ We now turn to population results and their relation with sample data. Kelly [25] has shown that the Except for small values of n, this is close to the mean age probability that the oldest allele in the population is of the oldest allele in the population, given in (81). In represented by j genes in the population is other words, unless n is small, it is likely that the oldest allele in the population is represented in the sample. In – – 1 fact we have seen above that the probability that the oldest 2N 2N þ 1 : (79) 2N j j allele in the population is represented in the sample is n=ðn þ Þ. This is the population analogue of (77). There is also a We conclude by discussing two very important sample result connected with this, namely that in the “population” probability distributions. For both distribu- limiting case N ↕ ↓1, the probability that the oldest allele tions we consider the limiting N ↕ ↓1 case and thus in the population is observed in a sample of size n is consider frequencies of than numbers of genes of some n=ðn þ Þ. allele, and density functions rather than discrete prob- Donnelly [27] showed, more generally, that the ability distributions. All results considered relate to the probability that the oldest allele in the population is infinitely many alleles model. observed j times in the sample is The first of these distributions is Kingman’s [29] Poisson-Dirichlet distribution. This is the joint density – 1 n n – 1 þ , ðj=0, 1, 2, :::, nÞ: (80) function of the allele frequencies in a population when n þ j j ordered by size (that is, the order statistics of the allele frequencies). Unfortunately this distribution is not “user- For the case j = 0 the probability (80) is = n , ð þ Þ friendly” and few explicit results are known. Perhaps the confirming the complementary probability n= n ð þ Þ most important is that of Watterson [30], which gives the found above. Conditional on the event that the oldest fi joint density function of the rst r order statistics xð1Þ, allele in the population does appear in the sample, the ::: probability that it arises j times in the sample is given by xð2Þ, , xðrÞ in the Poisson-Dirichlet distribution. This the expression (77). joint density function is “ ” We now consider age questions. It is found that the f ðx , x , :::, x Þ mean time, into the past, that the oldest allele in the ð1Þ ð2Þ ðrÞ rΓ g – 1 – 1 population entered the population (by a mutation event) ¼ ð Þe gðyÞfxð1Þxð2Þ xðrÞg xðrÞ , (84) is = – – – – = ’ where y ð1 xð1Þ xð2Þ xðrÞÞ xðrÞ, g is Euler s X2N 4N constant 0.57721…, and g(y) is best defined through the mean age of oldest allele= generations: Laplace transform equation (Watterson and Guess [28]) jðj þ – 1Þ j=1 1 1 – – – (81) ! e tygðyÞdy=exp ! u 1ðe tu – 1Þdu : (85) 0 0 It was shown by Watterson and Guess [28] and Kelly [25] that not only the mean age of the oldest allele, but indeed The expression (84) simplifies to

20 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 QB Mathematics, genetics and evolution

::: f ðxð1Þ, , xðrÞÞ numerical values, were given in Watterson and Guess [28]. This leads to the numerical values in the first row of r – 1 – – – – 1 ¼ fxð1Þ xðrÞg ð1 xð1Þ xðrÞÞ , (86) Table 1 below. The fact that the Poisson-Dirichlet distribution is not when x þ x þþx – þ 2x ³1, and in parti- ð1Þ ð2Þ ðr 1Þ ðrÞ user-friendly makes it all the more interesting that a size- cular, biased distribution closely related to it, namely the GEM = – 1 – – 1 distribution, named for Griffiths [32], Engen [33] and f ðxð1ÞÞ ðxð1ÞÞ ð1 xð1ÞÞ , (87) McCloskey [34], who established its salient properties, is 1 when £x £1. both simple and elegant. This distribution gives the joint 2 ð1Þ density function of the frequencies of the alleles in the Equation (87) provides two interesting results. First, population when ordered by age (and not, as with the population geneticists are interested in the probability of Poisson-Dirichlet distribution, when ordered by frequen- “population monomorphism”,defined in practice as the cies). probability that the most frequent allele arises in the Suppose that a gene is taken at random from the population with frequency in excess of 0.99. Equation population. The probability that this gene will be of an (87) implies that this probability is close to 1 – ð0:01Þ . allele whose frequency in the population is x is just x. This Second, if there is an allele with frequency between 1/2 allele was thus sampled by this choice in a size-biased and 1 it must be the most frequent, and thus the mean way. It can be shown from properties of the Poisson- frequency of the most frequent allele must exceed Dirichlet distribution that the probability density of the frequency of the allele determined by this randomly 1 – 1 chosen gene is ! x x 1 – x 1= : = ð1Þ ð1Þð ð1ÞÞ 1 2 2 – f ðxÞ=ð1 – xÞ 1,0

Table 1. Mean frequency of (a) the most frequent allele, (b) the oldest allele, in a population as a function of θ. The probability that the most frequent allele is the oldest allele is also its mean frequency. θ 0.1 0.2 0.5 1.0 2.0 5.0 10.0 20.0 Most frequent 0.936 0.882 0.758 0.624 0.476 0.297 0.195 0.122 Oldest 0.909 0.833 0.667 0.500 0.333 0.167 0.091 0.048

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 21 QB Warren J. Ewens

the table are close, and this corresponds to the fact that for 5. Fisher, R. A. (1930) The Genetical Theory of Natural Selection. small θ the most frequent allele in the population is very Oxford: Clarendon Press. likely to be the oldest allele in the population. 6. Malécot, G. (1948) Les Mathématiques de l’Hérédité. Paris: Masson. Given the focus on retrospective questions arising in 7. Kingman, J. F. C. (1961) A mathematical problem in population molecular population genetics, it is natural to ask further genetics. Proc. Camb. Philol. Soc., 57, 574–582. questions about the oldest allele in the population. The 8. Wright, S. (1931) Evolution in Mendelian populations. Genetics, 16, – mean frequency ð1=ð1 þ ÞÞ of the oldest allele in the 97 159. population given above implies that when θ is very small, 9. Ewens, W. J. (2004) Mathematical Population Genetics. New York: this mean frequency is approximately 1 – .Itis Springer. 10. Kimura, M. (1971) Theoretical foundation of population genetics at the interesting to compare this with the mean frequency of molecular level. Theor. Popul. Biol., 2, 174–208. the most frequent allele when θ is small, found in effect 11. Ewens, W. J. and Kirby, K. (1975) The eigenvalues of the neutral alleles from the Poisson-Dirichlet distribution to be approxi- process. Theor. Popul. Biol., 7, 212–220. – mately 1 log2. More generally, the mean population 12. Ewens, W. J. (1972) The sampling theory of selectively neutral alleles. frequency of the jth oldest allele in the population is Theor. Popul. Biol., 3, 87–112. – 13. Kimura, M. (1969) The number of heterozygous nucleotide sites 1 j 1 : maintained in a finite population due to steady flux of mutations. 1 þ 1 þ Genetics, 61, 893–903. 14. Watterson, G. A. (1975) On the number of segregating sites in genetical The probability that a gene drawn at random from the models without recombination. Theor. Popul. Biol., 7, 256–276. population is of the type of the oldest allele is the mean = 15. Tavaré, S. (1984) Lines of descent and genealogical processes, and their frequency of the oldest allele, namely 1 ð1 þ Þ, as just applications in population genetic models. Theoret. Pop. Biol., 26, 119– shown, since “size-biased” sampling is equivalent to 164. “sampling by ages”. More generally the probability that n 16. Watterson, G. A. (1977) Heterosis or neutrality? Genetics, 85, 789–814. genes drawn at random from the population are all of the 17. Ewens, W. J. (1974) A note on the sampling theory for infinite alleles type of the oldest allele in the population is and infinite sites models. Theor. Popul. Biol., 6, 143–148. 18. Tajima, F. (1989) Statistical method for testing the neutral mutation 1 – n! ! xnð1 – xÞ 1dx= : (91) hypothesis by DNA polymorphism. Genetics, 123, 585–595. 0 ð1 þ Þð2 þ Þðn þ Þ 19. Hein, J., Schierup, M. H. and Wiuf, C. (2005) Gene Genealogies, Variation and Evolution. Oxford: Oxford University Press. From this result and the expression (58) we confirm that if 20. Wakeley, J. (2009) Coalescent Theory. Greenwood Village, Colorado: all genes in a sample of n are of the same allele, the Roberts and Company. probability that this is the oldest allele in the population is = 21. Marjoram, P. and Joyce, P. (2009) Practical implications of coalescent n ðn þ Þ. theory. In Lenwood, L. S. and Ramakrishnan, N. (eds.), Problem The elegance of many of these formulas make Solving Handbook in and Bioinformatics. New molecular population genetics an attractive field for York: Springer. research. However, this elegance arises because of the 22. Nordborg, M. (2001) Coalescent theory. In Balding, D. J., Bishop, M. J. simplifying assumptions that have been made, in and Cannings, C. (eds.), Handbook of Statistical Genetics. Chichester, particular that the gene locus of interest can be considered UK: Wiley. in isolation from linked gene loci, and that there is no 23. Kingman, J. F. C. (1982) The coalescent. Stoch. Proc. Appl., 13, 235– selection. Modern theory considers realistic problems 248. taking these complications into account, and (unfortu- 24. Kingman, J. F. C. (1982) On the genealogy of large populations. J. nately) much of the elegance of the above theory is then Appl. Probab., 19, 27–43. lost. Tavaré [35], Durrett [36] and Etheridge [37] give 25. Kelly, F. P. (1977) Exact results for the Moran neutral allele model. J. – accounts of this more recent theory. Appl. Probab., 14, 197 201. 26. Donnelly, P. J. and Tavaré, S. (1986) The ages of alleles and a REFERENCES coalescent. Adv. Appl. Probab., 18, 1–19. 27. Donnelly, P. J. (1986) Partition structures, Polya urns, the Ewens 1. Darwin, C. (1859) On the Origin of Species by Means of Natural sampling formula, and the ages of alleles. Theor. Popul. Biol., 30, 271– Selection or the Preservation of Favoured Races in the Struggle for Life. 288. London: John Murray. 28. Watterson, G. A. and Guess, H. A. (1977) Is the most frequent allele the 2. Mendel, G. (1866) Versuche über pflanzenhybriden (Experiments oldest? Theor. Popul. Biol., 11, 141–160. relating to plant hybridization). Verh. Naturforsch. Ver. Brunn, 4, 3–17. 29. Kingman, J. F. C. (1975) Random discrete distributions. J. R. Stat. Soc. 3. Hardy, G. H. (1908) Mendelian proportions in a mixed population. [Ser. A], 37, 1–22. Science, 28, 49–50. 30. Watterson, G. A. (1976) The stationary distribution of the infinitely- 4. Weinberg, W. (1908) Über den Nachweis der Vererbung beim many neutral alleles model. J. Appl. Probab., 13, 639–651. Menschen. (On the detection of heredity in man). Jahreshelfts. Ver. 31. Crow, J. F. (1972) The dilemma of nearly neutral mutations: how Vaterl. Naturf. Württemb., 64, 368–382. important are they for evolution and human welfare? J. Hered., 63, 306–

22 © Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 QB Mathematics, genetics and evolution

316. 35. Tavaré, S. (2004) Ancestral inference in population genetics. In Picard 32. Griffiths, R. C. (1980) Unpublished notes. J. (ed.), Êcole d’Êté de Probabilités de Saint-Fleur XXX1-2001, 1-188, 33. Engen, S. (1975) A note on the geometric series as a species frequency Berlin: Springer-Verlag. model. Biometrika, 62, 697–699. 36. Durrett, R. (2008) Probability Models for DNA Sequence Evolution. 34. McCloskey, J. W. (1965) A model for the distribution of individuals by Berlin: Springer-Verlag. species in an environment. Unpublished PhD. thesis. Michigan State 37. Etheridge, A. (2011) Some Mathematical Models from Population University. Genetics. Berlin: Springer-Verlag.

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2013 23