Infinite Allele Model with Varying Mutation Rate (Protein Polymorphism/Heterozygosity/Genetic Distance) MASATOSHI Nei, RANAJIT CHAKRABORTY, and PAUL A

Proc. Natl. Acad. Sci. USA Vol. 73, No. 11, pp.4164-4168, November 1976 Genetics Infinite allele model with varying mutation rate (protein polymorphism/heterozygosity/genetic distance) MASATOSHI NEi, RANAJIT CHAKRABORTY, AND PAUL A. FUERST Center for Demographic and Population Genetics, University of Texas at Houston, Houston, Tex. 77030 Communicated by Motoo Kimura, September 7, 1976 ABSTRACT Available data suggest that the variation in tone IV which has an unusually low rate of amino acid substi- mutation rate among protein loci follows the gamma distribution. Thus, taking. into account this variation, formulae are de- tution. The mean and standard deviations of this distribution veloped for the distribution of allele frequencies, mean and are 2.47 X 10-7 and 2.51 X 10- per year, respectively. We variance of heter'ozygosity, expected number of alleles, pro- fitted the following gamma distribution to these data: portion of polymorphic loci, and genetic distance. These formulae should be more appropriate for the analysis of gene frequency data for protein loci than equivalent formulae with f(z) r(a) e Zi, [1] constant mutation rate. where a = i2/V5 and , = z/V,, in which z and V. are the In the last 10 years statistical methods based on the so-called mean and variance of the variate in question (z), respectively. infinite allele model (1, 2) have been used extensively to study The maximum likelihood estimates of a and ,3 obtained are 0.95 the mechanism of maintenance of protein polymorphism. and 3.9 X 106, respectively. It is clear that the gamma distri- Because each allele in a population behaves in a random fashion bution fits the data reasonably well (Fig. 1), though the number comparable to a molecule in a mass of gases, these methods are of polypeptides used is very small. applied to a collection of alleles from a large number of loci. Another way to get a rough idea about the distribution of They are, however,' based on two unrealistic assumptions when mutation rate is to examine the distribution of -molecular applied to the current gene frequency data, most of which have weights of protein subunits, by assuming that the mutation rate been obtained by electrophoresis. First, in this model all new at a locus is proportional to the molecular weight of the poly- mutations are assumed to be novel, whereas at the level of peptide produced. This assumption seems to be only roughly electrophoresis some backward mutations may occur. Recently, correct. Data on amino acid substitutions in polypeptides in Fig. Ohta and Kimura (3) introduced a new mutation model 1 indicate that there is a significant correlation between the (stepwise mutation model), which is presumably appropriate substitution rate per polypeptide and molecular weight but the to electrophoretic data. Second, in the application of the classical correlation coefficient is only 0.53. At any rate, we have ex- infinite allele model it is assumed that the mutation rate is the amined the distribution of molecular weights of 119 protein same for all loci. This assumption is certainly incorrect, and an subunits in mammalian species, by using the data compiled by enormous variation of the rate of amino acid substitution among Darnall and Klotz (7). The distribution obtained is given in Fig. different proteins suggests that the rate of mutations that can 2. The mean and standard deviation of this distribution are be incorporated into the population varies considerably with 45,102 and 24,531, respectively. The mean molecular weight gene loci. The purpose of this paper is to develop a theory which is much higher than that for the polypeptides in Fig. 1 (ca is applicable to a collection of alleles from different loci. 15,000) but close to that of proteins which are often used in electrophoresis (8). The shape of the distribution of molecular Distribution of mutation rate weights is somewhat different from that of Fig. 1, but the To incorporate the variation of mutation rate, we need some gamma distribution again fits the data surprisingly well. In this idea about the distribution of mutation rate among loci. In case the maximum likelihood estimates of a and 13 are 3.7 and practice, virtually nothing is known about this distribution. A 8.14 X 10-5, respectively. Clearly, the coefficient of variation priori, one might assume that it is normally distributed. In the for Fig. 2 is much smaller than that for Fig. 1. This is probably case of mutation rate, however, the normal distribution is not due to the fact that mutation rate is not strictly proportional to very suitable, because mutation rate never becomes smaller molecular weight and is affected by a number of other factors. than 0. With this restriction, the alternative candidate is the Therefore, it is likely that the a value for actual mutation rate gamma distribution. In fact, as will be discussed below, there is closer to the value for the rate of amino acid substitution is some evidence for this. rather than that for molecular weights. Although the mutation rates for most protein loci are not In practice, what we really need is not the distribution of known at present, they can be'estimated under certain as- absolute mutation rate but that of M = 4Nv, where N is the sumptions. 'First, if we assume that a majority of gene substi- effective size of a population and v is the mutation rate per tutions in evolution are due to random fixation of selectively generation. We note then that v can be obtained by multiplying neutral genes, the mutation rate to such alleles may be estimated the mutation rate per year (vp) by generation time (g), if the from the rate of amino acid substitution in proteins (4). This mutation rate per year rather than per generation is constant, assumption may be incorrect, but is sufficient for testing the as seems to be the case with protein loci (4). Clearly, the coef- neutral mutation hypothesis. Dayhoff (5) has given the rates ficient of variation of M is identical to that'of the mutation rate of amino acid substitution per residue per year for 20 different per year, and if vy follows the gamma distribution, M also fol- kinds of polypeptides. The mutation rate per polypeptide lows the gamma. In this case, the parameter a is the same for (locus) can therefore be estimated by multiplying this rate by both M and vy,, since a is the reciprocal of the squared coeffi- the number of amino acids in each polypeptide (6). Fig. 1 gives cient of variation (z2/Vj). On the other hand, ,3 is given by the distribution of mutation rate thus obtained, excluding his- i/Vz, so that this value for the distribution of M is 4Ng times 4164 Downloaded by guest on September 30, 2021 Genetics: Nei et al. Proc. Natl. Acad. Sci. USA 73 (1976) 4165 10- 15 = 5. 30 LAt 0 5- 0 5x 10-7 1-6 RATE OF AMINO ACID SUBSTITUTION FIG. 1. Distribution of the rate of amino acid substitutions per polypeptide per year. The total number of polypeptides used is 19. The gamma distribution fits the data very well (x2(1) = 0.23; P > 0.60). 01i 0 50 100 150 smaller than that for vy. Namely, unlike a, ,B depends on pop- MOLECULAR WEIGHT IN THOUSANDS ulation size and generation time. FIG. 2. Distribution of molecular weights of protein subunits in Let M and VM be the mean and variance of M among loci mammalian species. The total number of proteins used is 119. The for a particular population, which is at steady state. We assume gamma distribution fits the data very well (x2(9) = 6.43; P > 0.65). that M follows a gama distribution. Then, M may be estimated from the average heterozygosity for randomly chosen loci, as tion of constancy of mutation rate when rate will be seen later. Furthermore, if we know the value of a, the this actually varies. variance of M is given by VM = Al2/a, and ,3 = a/M. Our Thus, we first computed the average heterozygosity (H) with given values of and VM, using formula study on the distributions of the rate of amino acid substitution [4] below, and then estimated M = - and molecular weight suggests that a is about 1 to 2. the value by Mc H/(1 H). With this Mc value, we computed the distribution [2] and compared it with Distribution of allele frequencies distribution [3], in which the M value was used. In this com- putation we considered two values MA, i.e., = 0.1 Consider a randomly mating population of effective size N, and of M and M = 1.0. In both cases a 1 was average assume that mutation and random genetic drift are balanced. Am2/VM - assumed. The M = = are Let 4tM (X) be the distribution of allele for a locus heterozygosities for 0.1 and M 1.0 0.084 and 0.404, frequencies respectively (Table 1). with a particular value of M = 4Nv, such that 4M(x)dx repre- sents the expected number of alleles whose frequency is in the The results obtained are given in Fig. 3. The distributioJ3] is given by solid lines, [2] = range from x to x + dx. Wright (1) and Kimura and Crow (2) whereas by broken lines. When M have shown that 0.1, the difference between [3] and [2] is so small, that the two distributions are practically indistinguishable. When M is 1.0, however, there is considerable difference between them. In this ' -M(X)= M(1 -X)M-' X-.

Infinite Allele Model with Varying Mutation Rate (Protein Polymorphism/Heterozygosity/Genetic Distance) MASATOSHI Nei, RANAJIT CHAKRABORTY, and PAUL A

What Is a Recessive Allele?

Basic Genetic Terms for Teachers

Basic Genetic Concepts & Terms

Glossary/Index

Evolution at Multiple Loci

BIOL 116 General Biology II Common Course Outline

Evolutionary Forces: Generation X Simulation to Launch the Genx

A Glossary of Terms for Restoration Genetics

Module 2: Genetics

Glossary in Evolutionary Biology Compiled by Prof

"TOP/BOT" Strand and "A/B" Allele

Genetics, DNA, and Heredity