Codon Substitution in Evolution and the “Saturation” of Synonymous Changes

CODON SUBSTITUTION IN EVOLUTION AND THE “SATURATION” OF SYNONYMOUS CHANGES

TAKASHI GOJOBORI’ Center for Demographic and Population Genetics, University of Texas at Houston, Houston, Texas 77025 Manuscript received May 26, 1983 Revised copy accepted August 2, 1983

ABSTRACT A mathematical model for codon substitution is presented, taking into ac- count unequal mutation rates among different nucleotides and purifying selection. This model is constructed by using a 61 X 61 transition probability matrix for the 61 nonterminating codons. Under this model, a computer simulation is conducted to study the numbers of silent (synonymous) and amino acid- altering (nonsynonymous) nucleotide substitutions when the underlying mutation rates among the four kinds of nucleotides are not equal. It is assumed that the substitution rates are constant over evolutionary time, the codon frequencies being in equilibrium, and, thus, the numbers of synonymous and nonsynonymous substitutions both increase linearly with evolutionary time. It is shown that, when the mutation rates are not equal, the estimate of synonymous substitutions obtained by F. PERLER,A. EFSTRATIADIS,P. LOMEDICO,W. GILBERT,R. KOLODNERAND J. DODGSON’S“Percent Corrected Divergence” method increases nonlinearly, although the true number of synonymous substitutions increases linearly. It is, therefore, possible that the “saturation” of synonymous substitutions observed by PERLERet al. is due to the inefficiency of their method to detect all synonymous substitutions.

T is well known that the rate of amino acid substitution in proteins is I approximately constant per site per year in various organisms as long as the function and tertiary structure of the molecule remain unaltered (ZUCK- ERKANDL and PAULINC 1965; KIMURAand OHTA 1974; FITCH 1976). Recently, studying the relationship between the numbers of synonymous and nonsynonymous (amino acid altering) nucleotide substitutions and the time since divergence between organisms, PERLERet al. (1980) concluded that, for globin and preproinsulin genes, the accumulation of synonymous substitutions is not linear with time, although the accumulation of nonsynonymous substitutions is linear. They stated that the rate of synonymous substitution was initially seven times higher than that of nonsynonymous substitution, but it was later (85-100 million years) reduced to the same level as that for nonsynonymous substitutions. They explained this change in rate of synonymous substitutions by a “saturation” of substitutions. For estimating synonymous and nonsynonymous substitutions PERLERet al.

’ Present address: National Institute of Genetics, Mishima, Shizuoka-ken, 41 1, Japan.

Genetics 105: 1011-1027 December, 1983. 1012 T. GOJOBORI (1980) used the so-called “Percent Corrected Divergence” (PCD) method, in which the substitution rates among the four different kinds of nucleotides are assumed to be equal, as in the case of the JUKES-CANTORmethod (JUKES and CANTOR1969; KIMURAand OHTA1972). Several studies, however, have shown that the actual substitution rates for the four different kinds of nucleotides are not equal (e.g., GOJOBORI,LI and GRAUR1982), and the assumption of equal rates gives a serious underestimate of the number of nucleotide substitutions, particularly when divergence time is long (KIMURA1980, 198 1; TAKAHATA and KIMURA1981; GOJOBORI,ISHII and NEI 1982). I, therefore, examined the accuracy of the PCD method by using a transition probability matrix of order 61 X 61 for the 61 nonterminating codons and incorporating the effects of mutation and purifying selection against nonsynonymous changes. The results obtained show that, when the model of unequal substitution rates is used (KIMURA198 1; TAKAHATAand KIMURA 198 1 ; GOJOBORI,ISHII and NEI 1982), the estimated number of nucleotide substitutions at the three positions of a codon increases linearly with evolutionary time. The results obtained suggest that the nonlinear increase of synonymous substitutions observed by PERLERet al. might be due to the inefficiency of their method to detect synonymous substitutions when the mutation rates among nucleotides are not equal.

THEORY Codon substitution matrix: Let us consider the evolution of a nucleotide sequence consisting of L codons. The 61 x 61 transition probability matrix, P, for the 61 nonterminating codons is used to describe the codon substitution process. This codon substitution matrix is similar to the 20 X 20 amino acid mutation probability matrix of DAYHOFF(DAYHOFF 1972; NEI and TATENO 1978), but I consider all nonterminating codons separately rather than 20 amino acids. The element, py, of this matrix (P)represents the probability of substitution from the ith codon to thejth codon during a unit evolutionary time. The frequency of thejth codon at time t (tth unit evolutionary time) is then given by 61 qJ@) = c q@)P,(t) (j= 1, 2, . . * 9 GI), (1) 1=1 where p,,(t) is the i, jth element of the matrix P‘ and q,(0) is the initial frequency of the ith codon. In many cases we can assume that the codon frequencies in the common ancestral sequence at time 0 are in equilibrium, i.e., qj(0) = q,(co) = qJ for all j. Let (vn; r, s = A, U, G and C) be the nucleotide mutation matrix, the r, sth element of which represents the probability that the rth nucleotide at a site changes by mutation to the sth nucleotide during a unit evolutionary time. The diagonal element, v,,, represents the probability that the rth nucleotide remains unchanged. The same matrix is used for the three positions of a codon. Under the assumption that mutations occur independently at the three positions, the element (p,) of the codon substitution matrix is given by a CODON SUBSTITUTION 1013 product of three elements of the nucleotide mutation matrix when there is no purifying selection. Let us now incorporate purifying selection into the codon substitution matrix. For simplicity, I assume that the relative rates of synonymous and nonsynonymous substitutions are 1 andf, respectively, f being less than 1 because of purifying selection. Thus, for example, the element PAGU,AGG of the codon substitution matrix is given by

PAGU,AGG= VAA~GGVUC$ (2) because the codon change from AGU to AGG causes an amino acid change from serine to arginine. In the present model, a diagonal element, p,,, of the matrix is given by 1 - E,+, p, for a given i. Expected rates of substitutions at the three positions of a codon: Let Df’ be the number (0 or 1) of nucleotide differences at the kth nucleotide position (k = 1, 2, 3) between a particular pair of codons i and j (i, j = 1, 2, . . ., 61). For example, D(A&c,~~~= 1, DACG,CAG (3) = 0 and so on. If it is assumed that nucleotide substitution occurs once, at most, for a given nucleotide position during a unit evolutionary time, then, the expected substitution rate, Ak, at the kth position of a codon is given by

61 61 Ak = 1DI;) Pyqr (k = 1, 2, 3)- (3) ,=Iz ]=I Thus, the expected substitution rate at each of the three codon positions can easily be computed by (3). Expected rates of synonymous and nonsynonymous substitutions: Define a “type I site” as a nucleotide site of a codon at which some, but not necessarily all, of the potential substitutions are synonymous, and define a “type I1 site” as a nucleotide site at which no synonymous substitution can occur. [The type I and type I1 sites are, respectively, the same as the synonymous and replacement sites defined by KAFATOS et al. (1977).] Every nucleotide site of all of the 61 nonterminating codons can, therefore, be classified into one of the two afore- mentioned types. Among the 183 nucleotide sites for the 61 nonterminating codons, 67 nucleotide sites are of type I and 116 are of type 11. If the mutation rate is the same for all nucleotide pairs, the expected numbers of type I and type I1 sites per codon are 1.10 (=67/61) and 1.90 (=3-1.10), respectively. In general, the expected number, S, of type I sites per codon in a sequence whose codon frequencies are in equilibrium can be computed by

where Si represents the number of type I site in the ith codon (see Table 1). Since only a certain proportion of the potential substitutions at type I sites lead to synonymous changes, the number of synonymous sites where every possible substitution is supposed to be synonymous can be obtained by multiplying S by a correction factor, c. This c may be estimated by the proportion of possible nucleotide changes that lead to synonymous substitutions among all 1014 T. GOJOBORI

TABLE 1

Thp number of tyfit I site (S,), the total iiuinbei of all allowable izucltotade changes at tjpe I sites (Q,)and the total nuinber of possible sjnon~mouschanges at type I sttes (Q;), zn the ith codon (i = 1,2, . . . , 61)

s* 41 4; (1)TTT’ 1 3 1 (5)TCT 1 3 3 (9)TAT 1 1 1 (11) TGT- 121 (2) TT_C 1 3 1 (6) TC_C 1 3 3 (IO) TAG 1 1 1 (12) TGC 121 (3)TTA 2 6 2 (7)TCA 1 3 3 TAA TGA (4)TTG 2 6 2 (8)TCG 1 3 3 TAG (13) TGG 000

(14)CTT 1 3 3 (18)CCT 1 3 3 (22)CAT 1 3 1 (26) CGT 133 (15)CTG 1 3 3 (19)CCC 1 3 3 (23)CAC 1 3 1 (27) CGC 133 (16)CTA 2 6 4 (20)CCA 1 3 3 (24)CAA 1 3 1 (28) CGA 254 (17)CTG 2 6 4 (21)CCG 1 3 3 (25)CAG 1 3 1 (29) CGG 264

(30) ATT 1 3 2 (34) ACT 1 3 3 (38)AAT 1 3 1 (42) AGT 131 (31) ATC 1 3 2 (35) AC_C 1 3 3 (39)AAC 1 3 1 (43) AGC 131 (32) ATA 1 3 2 (36) ACA 1 3 3 (40)AAA 1 3 I (44) AGA 252 (33) ATG 0 0 0 (37) ACG 1 3 3 (41)AAG 1 3 1 (45) AGG 262

(46) GTT 1 3 3 (50) GCT 1 3 3 (54) GAT 1 3 1 (58) GGT 133 (47) GlC 1 3 3 (51) GC_C 1 3 3 (55)GA_C 1 3 1 (59) GGC 133 (48) GTA 1 3 3 (52) GCA 1 3 3 (56) GAA 1 3 1 (60) GGA 133 (49) GTG 1 3 3 (53) GCG 1 3 3 (57) GAG 1 3 1 (61) GGG 133 ‘’ The underlines in codons show the locations of the type I sites. allowable nucleotide changes at type I sites. Now let N,, be the expected number of ull allowable nucleotide changes at type I sites per codon. N, is then given by Z, Qq,, where Q2 represents the total number of all allowable nucleotide changes at type I sites in the zth codon (see Table 1). Moreover, the expected number, Ai,,, of possible s~’rioizyn‘mouschanges at type I sites per codon is given Z, a’s,, where Q: represents the total number of possible synonymous changes at type I sites in the zth codon (see Table 1). Then c is given by N,,,/N,,. Thus, the expected number of synonjmous szks per codon is given by

iV, = SNps/Np. (5) The expected number, A’,?, of nonsyzonymous sites per codon is given by 3 - N,, since 1V, -k N,, = 3. To compute the expected number of synonymous substitutions, define Dg) as the number of different nucleotides when the zth and jth codons (2 # 1) represent the same amino acid. And define that 0;;’ = 0 when z and J are codons for different amino acids. All values of DC) can be computed from the genetic code. For example, D!%u(L~~),LLA(L~~)= 1 and DuuA(L~~),cLG(L~~) (S) = 2. The expected number, Ms, of synonymous substitutions per codon during a unit evolutionary time is then given by replacing D(;) in (3) by Dk). Thus, the rate, Ax, of synonymous substitutions per synonymous szte is given by MJN,. Similarly, define Dt)as the number of different nucleotides when the zth andjth codons (I # J) represent the different amino acids, and define D$) = 0 when z and] are codons for the same amino acid. (Note that ZtlD$) = D$) + Or;).) The rate, A,,, of nonsynonymous substitutions is then given by M,,/N,,, where M,, is CODON SUBSTITUTION 1015 the expected number of nonsynonymous substitutions per codon. Note that A, and A,, should not be computed by Ms/3 and M,,/3, respectively, because neither the number of synonymous sites nor of nonsynonymous sites per codon is 3. Of course, the expected rate of all substitutions per nucleotide site is (Ms+ Mn)/ 3, which can be computed either by the weighted average of A, and A,,, (AsNs/ 3 + A,,N,,/3) or by (A, + A:! + As)/3. PCD method: This method was developed by PERLERet al. (1980) for estimating the numbers of synonymous and nonsynonymous substitutions in the protein-coding regions between two nucleotide sequences compared. In this method the numbers of potential synonymous (silent in their paper) and nonsynonymous (replacement) sites are counted for all nucleotide sites. The observed numbers of synonymous and nonsynonymous nucleotide dqerences are also computed. The numbers of synonymous and nonsynonymous nucleotide substitutions per site are estimated from the ratio of the observed number of nucleotide differences to the number of potential sites. This method depends on the assumption that the rate of nucleotide substitution is the same for all nucleotides. In this respect it is very similar to the JUKES-CANTORmethod. [See the paper by PERLERet al. (1980) for details of the PCD method.] This method has been used by a number of molecular geneticists (e.g., EFSTRATIADISet al. 1980; CLEARYSCHON and LINCREL198 1; RONINSONand INGRAM 1982).

COMPUTER SIMULATION To check the accuracy of the PCD method, a computer simulation was conducted by using the codon substitution matrix. Since there are 61 different codons, compared with only four different nucleotides, the effect of stochastic errors could be much greater in codon substitution than in nucleotide substitution. To reduce this effect, I first considered a DNA sequence (732 codons) longer than that of the hemoglobin gene (140- 150 codons). The length of the DNA sequence used is close to the maximum length permitted by our computer capacity. To see the effect of stochastic errors, a nucleotide sequence of 144 codons was also studied (see DISCUSSION). Four different schemes of mutation represented in matrix form were considered. In the first scheme, equal mutation rates among the four nucleotides were assumed, whereas in the other three schemes unequal mutation rates were considered. Since the mutation scheme with equal rates involves only one parameter, it will be called one-parameter (1-p) mutation scheme (see Table 2). This is identical with the assumption of the JUKES-CANTORmethod [the one-parameter (1-p) method] and the PCD method. The second and the third mutation schemes considered were the six-parameter (6-p) scheme of KIMURA (1981) and GOJOBORI, ISHII and NEI (1982) and the four-parameter (4-p) scheme of TAKAHATAand KIMURA(1981) [see Table 2 and GOJOBORI, ISHII and NEI (1982)l. In the absence of purifying selection, there are methods for estimating the number of nucleotide substitutions at the three positions of a codon. They are the 6-p estimation method (GOJOBORI, ISHII and NEI 1982) for the 6-p mutation scheme and the 4-p estimation method (TAKAHATAand KIMURA 1981) for the 4-p mutation scheme. The last mutation scheme used is that for 1016 T. COJOBORI

TABLE 2 Mutcitzoii schemes with rquol or unrqunl rates among nucleotides

O/M" A U C G

(a) One-parameter (1-1)scheme A 1 - 3a a a a U a 1 - 3a a 01 C a a 1 - 3a a G a a a 1 - 3a (b) Four-parameter (4-p) scheme A 1 - (y + @a + a) Y Ba a U Y 1 - (y + Ba + a) a @a C @P P 1 - (Y -t @P + P) Y G P SP Y 1 - (7 + + P) (c) Six-parameter (6-p) scheme

A 1 - (2a + 011) a1 ff a U Pl 1 - (2a + PI) a a C P P 1 - (20 + a2) a2 G P P P2 1 - (2P + P2) (d) Pseudogene scheme A 0.98935 0.00235 0.0026 0.0057 U 0.00225 0.99235 0.0031 0.0023 C 0.00415 0.01 1 0.9825 0.00235 G 0.008 0.0035 0.00275 0.98575 '' 0, original nucleotide; M, mutated nucleotide. the rates of mutations among the four nucleotides in pseudogenes observed by GOJOBORI,LI and GRAUR(1982). Since all mutations in pseudogenes would be selectively neutral, this mutation scheme is considered to represent a possible pattern of actual spontaneous mutations. In all of the mutation matrices used, the values of U, were chosen to make the total mutation rate per nucleotide to be 0.01, and the time unit in which this amount of mutation occurred on the average was assumed to be a unit evolutionary time. This was achieved by using a = 0.0033333 for the 1-p scheme, a = 0.00375, p = 0.015, y = 0.001 and 0 = 0.5 for the 4-p scheme, and a = 0.00125, a1 = 0.008, a2 = 0.118, ,t? = 0.005, = 0.004 and p2 = 0.0059 for the 6-p scheme (see Table 2 for the definitions of parameters). In the mutation scheme estimated from pseudogenes, all off-diagonal elements of the original mutation matrix by GOJOBORI,LI and GRAUR(1 982) were adjusted so as to make the total mutation rate per nucleotide approximately equal to 0.01. Note that the total mutation rate per nucleotide is given by 2, ur Zsf, U,, where U, is the equilibrium frequency of the rth nucleotide which is obtained from the mutation matrix, {u,.J. In all cases, it was assumed that f is equal to 0.2, which seems to be a realistic value for hemoglobin genes (MIYATA, YASUNACAand NISHIDA1980). The codon substitution matrices were then constructed using (2). CODON SUBSTITUTION 1017 The equilibrium codon frequencies were obtained by squaring the codon substitution matrix repeatedly. Squaring was continued until the ratio of the standard deviation of the elements in each column of the matrix to the mean became less than 0.01. It is noted that all of the elements in each column should have the same values at equilibrium. Following the equilibrium codon frequencies, an ancestral sequence was produced by generating uniform qua- sirandom variables. Two descendant sequences were independently generated from the ancestral sequence after 10 units of evolutionary time. From each of the two sequences, a new descendant sequence was generated after another 10 units of evolutionary time. This process was repeated for every 10 units of evolutionary time until the total number of evolutionary time units became 100. The transition probabilities of codons after 10 units of evolutionary time were calculated by iteration of the substitution matrix ten times (= P'O). The codon at each of the '732 sites of a descendant DNA sequence was then determined by choosing random numbers to follow the probability distribution of different codons obtained by PI0. The two sequences generated for each evolutionary time were then compared, and the number of nucleotide substitutions was estimated by the PCD method for synonymous and nonsynonymous changes and also by the 1-p, 4-p and 6-p estimation methods for nucleotide changes at the three positions of a codon. Three replications were made under each of the four different schemes of mutation. In practice, however, the three replications gave essentially the same results, so that I shall present only one of the three replications in the following.

RESULTS Synonymous and nonsynonymous changes under equal mutation rates: To examine the accuracy of the PCD method under the mutation scheme with equal mutation rates (the 1-p scheme), I studied the relationship among evolutionary time, the expected number of nucleotide substitutions and the PCD values for synonymous and nonsynonymous changes. The expected number of nucleotide substitutions should increase linearly with evolutionary time, because the substitution rate is constant over time. The expected substitution rates were obtained from the computation through (4) and (5). These were 0.00942 and 0.00193 per nucleotide site per unit evolutionary time for synonymous and nonsynonymous changes, respectively. Figure la shows the estimated and expected numbers of synonymous and nonsynonymous substitutions when the mutation rate is the same for all nucleotide pairs. It is clear that the linearity of the PCD values with evolutionary time holds satisfactorily for both synonymous and nonsynonymous changes for this mutation scheme. The average substitution rates, which were calculated by the regression coefficient of the PCD values on evolutionary time, were 0.00840 and 0.00203 in the synonymous and nonsynonymous changes, respectively. Compared with the expected values, the relative errors of the PCD values were only 10.9%for synonymous changes and only 5.6% for nonsynonymous changes. Thus, the relative errors of the PCD values to the expected values were quite small for both synonymous and nonsynonymous changes. For 1018 T. COJOBORI

(a) 1-P scheme (b) 6-Pscheme d d 1.8 c

1.6 9

1.4”

0 20 40 60 80 m t 0 20 40 60 80 100 t

0 20 40 60 80 100 t 020406080100 t CODON SUBSTITUTION 1019 the synonymous changes, however, a slight underestimation of the number of substitutions is observed for the PCD values greater than 1.4 (or 90 units of evolutionary time). This can be explained by the following fact. When the evolutionary time is long, multiple synonymous substitutions followed by a single nonsynonymous substitution tend to be regarded as a single nonsynonymous substitution in the PCD method because of a high rate of synonymous substitutions. Nevertheless, the extent of underestimation is still small compared with the cases of unequal mutation rates, which will be mentioned below. Synonymous and nonsynonymous changes under unequal mutation rates: The PCD values of synonymous and nonsynonymous changes for the 6-p, 4-p and pseudogene mutation schemes are shown in Figure 1, b, c and d, respectively. For all three schemes, the PCD values of nonsynonymous changes are linearly related with time, and most of them are on the expected line. Indeed, their average rates of nonsynonymous substitutions, which were computed from the regression coefficients of the PCD values on evolutionary time, are nearly the same as the expected rates for all three schemes (see Table 3). By contrast, the PCD values of synonymous changes do not increase linearly with evolutionary time. The relationship between PCD and evolutionary time is nearly the same for all three mutation schemes. As observed by PERLERet al. (1980), the PCD method shows an effect of “saturation.” It is possible to fit two lines of different rates of substitutions to the data for synonymous substitutions, as PERLERet al. (1980) did. These two lines have different slopes, and the slope changes around the substitution numbers of 0.6 and 0.8 for the 6-p mutation scheme and for the 4-p and pseudogene mutation schemes, respectively. If we had not known the actual process of nucleotide substitution, we would have been tempted to explain this simulation result by a change in the substitution rate. Let us consider the case of the 6-p mutation scheme as an example. In Figure lb, the two broken lines represent the regression lines of the PCD values from 0 to 40 and from 40 to 90 units of evolutionary time. The PCD value at 100 units of time was excluded because the PCD value at this time no longer increased with time, so that the value appeared to have already reached the saturation level. The rate of synonymous substitution estimated from the slope of the steeper line was 0.00761 which was very close to the expected rate (0.00758). Thus, the PCD method could give the rate of synonymous substitution correctly up to PCD = 0.6. However, the other broken line for the synonymous changes has a slope that is similar to that of the nonsynonymous changes. This pattern is essentially the same as the observation by PERLERet al. (1980) for actual nucleotide sequences. As shown in Figure

FIGURE1 .-Relationships between evolutionary time (measured in evolutionary time units, t) and the number (d) of nucleotide substitutions for synonymous (Syn) and nonsynonymous (Non) changes estimated by the PCD method for (a) the 1-p mutation scheme; (b) the 6-p mutation scheme; (c) the 4-p mutation scheme; and (d) the pseudogene mutation scheme. The solid lines represent the expected numbers of substitutions that were computed by equations (4) and (5). The solid dots represent the PCD values that were obtained from the comparison of nucleotide sequences. The broken lines in (b), (c) and (d) show the regression lines of the PCD values for synonymous changes for the two different ranges of evolutionary time. 1020 T. GOJOBORI

TABLE 3 Expected rates and estimated rates" of nucleotide substitutionsfor annous mutation schemes

Equal mu- tdtion rate Unequal mutation rate Expected or Pseudogene Site estimated 1-p scheme 6-p scheme 4-1, scheme scheme Nonsynonymous site Expected 0.00193 0.00197 0.00180 0.00205 PCD method 0.00203 0.00193 0.001 73 0.00205

Synonymous site Expected 0.00942 0.00758 0.01 196 0.01178 PCD method 0.00840 0.00761 0.01171 0.01153 (t = 10-40)b (t = 10-30) (t = 10-30) 0.00335 0.00354 0.006 15 (t = 40-90) (t = 30-80) (t = 30-90)

First position Expected 0.00225 0.00219 0.00252 0.00270 1-p method 0.00192 0.00 198 0.00183 0.00219 6-p method 0.00193 0.00223 0.00207 0.00231 4-p method 0.00191 0.00247 0.00207 0.00230

Second position Expected 0.00192 0.001 82 0.00192 0.0021 1 1-p method 0.002 16 0.00174 0.00 194 0.0 0 2 2 5 6-p method 0.00216 0.00188 0.00224 0.00233 4-p method 0.00217 0.00226 0.00226 0.00233

Third position Expected 0.00733 0.00538 0.00719 0.00794 1-p method 0.00686 0.0043 1 0.00397 0.00538 6-p method 0.00702 0.00517 0.00636 0.00710 (t = 10-90) (t = 10-70)

4-p method 0.00705 0.00506 0.00790 0.00671 (t = 10-80) (t = 10-80) " The estimated rates of substitution were obtained from regression coefficients. ' Evolutionary time period for which a regression line was fitted (see text for details).

1, c and d, the 4-p and pseudogene mutation schemes showed essentially the same trend as the 6-p mutation scheme (also see Table 3). These results, therefore, indicate the possibility that the saturation of synonymous changes observed by PERLERet al. is due to the inefficiency of their method to detect nucleotide substitutions. Nucleotide changes at the three nucleotide positions of codons under equal mutation rates: Using the same nucleotide sequences as those used for the study of the PCD method, I examined the substitution patterns at the three nucleotide positions of codons under the 1-p mutation scheme. The substitution numbers at the three positions of codons were estimated by three methods, i.e., the 1- e, 4-p and 6-p estimation methods mentioned earlier. The results obtained are shown in Figure 2a. The estimates of substitution numbers by all three methods were nearly the same at each of the three positions of codons. As shown in Table 2, the substitution rates estimated by those methods were in ranges CODON SUBSTITUTION 1021 of 0.00 19 1-0.00 193, 0.002 16-0.002 17 and 0.00686-0.00705 at the first, second and third positions, respectively. The expected rates of nucleotide substitutions obtained by the method of (3) for the first, second and third positions were 0.00225, 0.00192 and 0.00733, respectively. Thus, the estimated rates by all three methods are very close to their expected values. When the evolutionary time was long, the substitution numbers at the first and third positions were slightly underestimated, and those at the second position were slightly overestimated. This can be explained by the effect of stochastic errors or the fact that the substitution scheme at each of the three positions may be slightly different from the 1-p mutation scheme because of purifying selection involved. This difference may cause accumulation of slight deviations of estimates from the expected values when the evolutionary time is long. At any rate, their relative errors to the expected values are very small, Moreover, the numbers of nucleotide substitutions estimated by the three methods seem to have a reasonably good linear relationship with time. Nucleotide changes at the three nucleotide positions of codons under unequal mutation rates: In the study of the nucleotide substitutions at the three positions of codons for the 6-p, 4-p and pseudogene mutation schemes, I again used the same sequence data as those used previously. The number of substitutions was estimated again by the 1-p, 4-p and 6-p estimation methods, and the results obtained are shown in Figure 2, b, c and d. To avoid complexity in these figures, the numbers of substitutions estimated by the 1-p method only are presented for the first two positions of codons, although for the third position the estimates obtained by all three methods are presented. In Table 3 the expected substitution rates computed by (3) and the estimated substitution rates by all three methods are presented. At the third nucleotide position, the numbers of substitutions estimated by the 1-p method show a saturation pattern for all three mutation schemes. At the first and second nucleotide positions, however, the numbers of substitutions estimated by the 1-p method showed a good linear relationship with time for all three mutation schemes, although the substitution rates at the first position were slightly underestimated. This observation for the 1-p estimation method is quite similar to that in the PCD method. A saturation pattern at the third nucleotide position and a slight underestimation at the first nucleotide position may be explained by the previous observation that multiple synonymous substitutions followed by a single nonsynonymous substitution are often mistaken as a single nonsynonymous substitution. For this reason, the number of substitutions at the third position, where synonymous substitutions occur predom- inantly, would be underestimated considerably by the 1-p method, whereas those at the first position, where synonymous substitutions occur very rarely, would be underestimated only slightly. Under the 6-p mutation scheme, the number of substitutions at all of the three nucleotide positions were estimated correctly by the 6-p and 4-p methods, although the 4-p method gave an underestimate at the third position when evolutionary time was long (see Figure 2b). Under the 4-p mutation scheme, the numbers of substitutions at the third positions were also estimated correctly by the 4-p and 6-p methods until 70 units of evolutionary time. When evolu- 1022 T. COJOBORI

d d

’s8 r 1.6

1.4 1.4

1.2 12 3 .O 1.0 - 0.8

0.6 1st 0.4 2nd

0.2

020406080100 t

(d) *e scheme

d d

1-8 r :: 1.6 1.o /’” t 1st 1st 2nd 2nd

02040608oK)o t 02040608oK)o t CODON SUBSTITUTION 1023 tionary time was longer than 70 units, the 4-p and 6-p methods did not give correct estimates at the third position. Under the pseudogene mutation scheme, the 4-p and 6-p methods gave underestimates of the numbers of nucleotide substitutions at the third position. However, the degree of underestimation for the 4-p and 6-p methods was much smaller than that for the 1-p method (also see Table 3).

DISCUSSION In the present study, the equilibrium codon frequencies were used as the initial codon frequencies, for the initial codon frequencies are unknown. In reality, however, it is possible that the codon frequencies have not yet reached the equilibrium state. Therefore, I examined the effect of nonequilibrium codon frequencies on the number of nucleotide substitutions under the 6-p mutation scheme. In this examination, the initial codon frequencies were assumed to be equal, i.e., 1/61. Note that these are not the equilibrium frequencies for the 6-p mutation scheme. When the other parameters were kept the same as those for the previous simulation, the substitution numbers estimated by the PCD method and the three other methods were examined for synonymous and nonsynonymous substitutions and for each of the three nucleotide positions of codons. As shown in Figure 3, a and b, the result obtained was quite similar to the previous case in which the initial codon frequencies are at equilibrium. In fact, the PCD values again showed the saturation pattern for synonymous substitutions but a good linear relationship with time for nonsynonymous substitutions. Similarly, the number of substitutions estimated by the 1-p method for the third position showed a nonlinear relationship with time, whereas those at the first two positions showed a linearity. In this case, both 6-p and 4-p methods also gave serious underestimates of substitution numbers when evolutionary time was long. In Figure 3, a and b, the expected rates of substitutions during the period between 0 and 100 units of evolutionary time were assumed to be constant over time and equal to the initial expected rates of substitutions, which were computed by (3), (4) and (5) under the assumption of ql = ql (0). Although the real expected rates should change with time in the nonequilibrium case, the rate of substitutions for the early period should be determined mainly by the initial expected rate of substitutions. As shown in Figure 3, a and b, this is true particularly for substitutions at nonsynonymous sites and the first two nucleotide positions of a codon,

FIGURE2.-Relationships between evolutionary time (t) and the numbers (d) of nucleotide substitutions at the first (lst), second (2nd) and third (3rd) nucleotide positions of codons estimated by the 1-p estimation method (e),the 4-p estimation method (A) and the 6-p estimation method (W) under (a) the 1-1 mutation scheme; (b) the 6-p mutation scheme; (c) the 4-p mutation scheme; and (d) the pseudogene mutation scheme. At the first and second nucleotide positions, the numbers of substitution estimated by the 1-p method only are presented. The open circles represent the number of substitutions estimated by the 1-p method for the second position. The solid lines represent the expected substitution numbers computed by eq. (3). In the 4-p and 6-p estimation methods, only applicable cases are presented. See GOJOBORI,ISHII and NEI (1982) for inapplicable cases. 1024 T. COJOBORI

(a) 6- P scheme ;non-equilibrium (b) 6-P scheme; non-equilibrium d d

020406080100 t 020406080Kx) t

d d 1.8 - 1.6 - 1.4 -

0 20 40 60 80 100 t 020406080100 t CODON SUBSTITUTION 1025 because the estimated numbers of these substitutions fit well the expected numbers computed by the initial expected rates of substitutions. At any rate, it is clear that, unless the initial frequencies of codons are very different from the equilibrium frequencies or equal frequencies, they do not affect the results very much. Although the evolutionary change of nucleotide sequences with 732 codons was studied here, the hemoglobin sequences examined by PERLERet al. con- sisted of only 140-150 codons. Therefore, it is possible that a stochastic error due to a finite nucleotide length is the main cause of the saturation of synonymous substitutions rather than the PCD method itself. To check this point, the nucleotide sequence with 144 codons was generated for each of the 1-p and 6-p mutation schemes, keeping all other parameters unchanged. The results obtained are shown in Figure 3, c and d. It is seen that the PCD values show an approximate linear relationship with time for both synonymous and nonsynonymous changes when the 1-p mutation scheme is used, although for synonymous changes the numbers of substitution for the sequence of 144 codons were underestimated more severely than those for the sequence of 732 codons. Under the 6-p mutation scheme, however, the PCD value for synonymous substitutions again showed a pattern of saturation, whereas a good linear relationship with time was observed for nonsynonymous substitutions. These results indicate that the stochastic error due to a finite nucleotide length is unlikely to be the main cause of the saturation phenomenon. Earlier studies (KIMURA198 1 ; GOJOBORI,NEI and ISHII 198 1 ; GOJOBORI, ISHII and NEI 1982) have shown that in the presence of unequal rates of nucleotide substitution the estimate of substitution numbers obtained by the method of JUKESand CANTORdoes not increase linearly with evolutionary time. Although they did not divide substitutions into synonymous and nonsynonymous, it is probable that the nonlinear increase of the PCD values for synonymous substitutions in the study of PERLERet al. is caused by the assumption of equal substitution rates, for the substitution rate for synonymous sites is similar to that for the third nucleotide positions of codons. This view is supported by the present study in which synonymous and nonsynonymous substitutions are considered separately. BROWNet al. (1982) modified the PCD method when they applied the method to mitochondrial DNA sequences, since for these sequences transition substitution seemed to occur much more frequently than transversion substitution. Unfortunately, the method of BROWNet al. appears incorrect as pointed out by HOLMQUIST(1983). BROWN et al. (1982)partitioned synonymous substitutions into transition and transversion types. This partition is not justified since at some sites synonymous substitutions can be caused by both transition

FIGURE3.-The effects of nonequilibrium codon frequencies [(a) and (b)] and the stochastic errors [(c) and (d)] on the nucleotide substitutions. The PCD values and the substitution numbers estimated by the 1-p, 4-p and 6-p estimation methods under the 6-p mutation scheme are presented in (a) and (b), respectively. The symbols in (b) are the same as those in Figure 2. In (c) and (d), the PCD values in the nucleotide sequence of 144 codons are presented under the 1-P and 6-p mutation schemes, respectively. 1026 T. COJOBORI and transversion. HOLMQUISTcriticized the PCD method of PERLERet al. for the same reason. However, PERLERet al. partitioned synonymous sites into three types of sites, i.e., those for the total number of possible nucleotide substitutions equal to one, two and three. This partitioning does not seem to cause any problem, since nucleotide substitutions at these sites can occur independently. Thus HOLMQUIST'Scriticism against the PCD method is not valid. Rather, the problem exists in that PERLERet al. used a Poisson correction of multiple substitutions for each category under the assumption of equal substitution rates. As shown in this study, the PCD method gives good estimates of substitution numbers when substitution rate is the same for all nucleotides, but a saturation pattern appears when substitution scheme deviates from equality. In conclusion, the nonlinear increase of the PCD values for synonymous substitutions is apparently caused by the inefficiency of the PCD method for detecting synonymous substitutions when there are unequal mutation rates among the four different kinds of nucleotides and purifying selection operates. This conclusion is supported by our earlier study (GOJOBORI,NEI and ISHII 1981; GOJOBORI,ISHII and NEI 1982) and by the study of BROWNet al. (1982) on mitochondrial DNA sequences. This study, therefore, suggests that the PCD method can give a serious underestimate of the number of synonymous substitutions when the PCD value is more than 0.6.

I am indebted to M. NEI who suggested the subject of this paper. I am also indebted to K. ISHII who helped to make the model at the early stage of this study. I would like to thank M. NEI and K. ISHII for their help and valuable suggestions. Thanks are also due to P. MAJUMDERfor helpful comments to improve the presentation. This study was supported by M. NEI'S grants NIH- EM20293 and NSF-DEB8110461.

LITERATURE CITED BROWN,W. M., E. M. PRAGER,A. WANGand A. C. WILSON,1982 Mitochondrial DNA sequences of primates: tempo and mode of evolution. J. Mol. Evol. IS: 225-239. CLEARY,M. L., E. A. SCHONand J. B. LINGREL,1981 Two related pseudogenes are the result of a gene duplication in the goat B-globin locus. Cell 26: 181-190. DAYHOFF,M. 0. (Editor), 1972 Atlas of PrutPzn Sequence and Structure, Vol. 5. National Biomedical Research Foundation, Silver Spring, Maryland. EFSTRATIADIS,A., J. W. POSAKONY,T. MANIATIS,R. M. LAWN,C. O'CONNELL,R. A. SPRITZ,J. K. DERIEL,B. G. FORGET,S. M. WEISSMAN,J. L. SLIGHTOM,E. A. BLECHL,0. SMITHIES,F. E. BARALLE,C. C. SHOULDERSand N. J. PROUDFOOT,1980 The structure and evolution of the human @-globingene family. Cell 21: 653-668. FITCH, W. M., 1976 The molecular evolution of cytochrome c in eukaryotes. J. Mol. Evol. 8: 13-40. GOJOBORI,T., K. ISHII and M. NEI, 1982 Estimation of average number of nucleotide substitutions when the rate of substitution varies with nucleotide. J. Mol. Evol. 18: 414-423. GOJOBORI,T., W.-H. LI and D. GRAUR,1982 Pattern of nucleotide substitution in pseudogenes and functional genes. J. Mol. Evol. 18: 360-369. GOJORORI,T., M. NEI and K. ISHII, 1981 Mathematical model of nucleotide substitutions with unequal substitution rates. Genetics 97(Suppl.): s43. HOLYQUIST, R., 1983 Transitions and transversions in evolutionary descent: an approach to understanding. J. Mol. Evol. 19: 134-144. CODON SUBSTITUTION 1027

JUKFS, T. H. and C. R. CANTOR,1969 Evolution of protein molecules. pp. 21-123. In: Mam- malian Protein Metabolism, Edited by H. N. MUNRO. Academic Press, New York. KAFATOS,F. C., A. EFSTRATIADIS,B. G. FORGETand S. M. WEISSMAN,1977 Molecular evolution of human and rabbit @-globinmRNAs. Proc. Natl. Acad. Sci. USA 74: 5618-5622. KIMURA,M., 1980 A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16: 111-120. KIMURA,M., 1981 Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78: 454-458. KIMURA,M. and T. OHTA, 1972 On the stochastic model for estimating of mutational distance between homologous proteins. J. Mol. Evol. 2: 87-90. KIMURA,M. and T. OHTA, 1974 On some principles governing molecular evolution. Proc. Natl. Acad. Sci. USA 71: 2848-2852. MIYATA,T., T. YASUNAGAand T. NISHIDA,1980 Nucleotide divergence and functional con- straint in mRNA evolution. Proc. Natl. Acad. Sci. USA 77: 7328-7332. NEI, M. and Y. TATENO,1978 Nonrandom amino acid substitution and estimation of the number of nucleotide substitutions in evolution. J. Mol. Evol. 11: 333-347. PERLER,F., A. EFSTRATIADIS,P. LOMEDICO,W. GILBERT,R. KOLODNERand J. DODGSON, 1980 The evolution of genes: the chicken preproinsulin gene. Cell 20 555-566. RONINSON,I. B. and V. M. INGRAM,1982 Gene evolution in the chicken @-globincluster. Cell 28: 5 15-52 1. TAKAHATA,N. and M. KIMURA,1981 A model of evolutionary base substitutions and its appli- cation with special reference to rapid change of pseudogenes. Genetics 98: 641-657. ZUCKERKANDL,E. and L. PAULING,1965 Evolutionary divergence and convergence in proteins. pp. 97-166. In: Evolving Genes and Proteins, Edited by V. BRYSONand H. J. VOGEL.Academic Press, New York. Corresponding editor: B. S. WEIR