<<

J. Genet., Vol. 75, Number 1, April 1996, pp. 91-115. Indian Academy of Sciences

Pattern of synonymous and nonsynonymous substitutions: an indicator of mechanisms of molecular

YASUO INA Department of , National Institute of Genetics, Mishima 411, Japan

Abstract. Comparison of numbers of synonymous and nonsynonymous substitutions is useful for understanding mechanisms of . In this paper, I examine the statistical properties of six methods of estimating numbers of synonymous and nonsynony- mous substitutions. The six methods are Miyata and Yasunaga's (MY) method; Nei and Gojobori's (NG) method; Li, Wu and Luo's (LWL) method; Pamilo, Bianchi and Li's (PBL) method; and Ina's (Ina) two methods. When the / bias at the level is strong, the numbers of synonymous and nonsynonymous substitutions are estimated more accurately by the PBL and Ina methods than by the NG, MY and LWL methods. When the nucleotide-frequency bias is strong and distantly related sequences are compared, all the six methods give underestimates of the number of synonymous substitutions. The concept of synonymous and nonsynonymous categories is also useful for analysis ofDNA polymorphism data.

Keywords. Synonymous substitutions; nonsynonymous substitutions; neutral theory of molecular evolution; functional constraints; ; gone conversion.

1. Introduction

Nucleotide substitutions reflect the history of DNA. Thus, to understand mechanisms of molecular evolution, it is of great importance to know the number and pattern of nucleotide substitutions between homologous sequences. Nucleotide substitutions in coding regions are classified into synonymous and nonsynonymous substitutions; synonymous substitutions do not lead to substitutions, whereas nonsynonymous substitutions result in amino acid substitu- tions. The numbers of synonymous (ds) and nonsynonymous (d•) substitutions per site have been used to clarify mechanisms of molecular evolution. In particular, estimates of d s and d N are used for a statistical test of the neutral theory of molecular evolution (Kimura 1968a, 1983). Various methods of estimating ds and d Nhave been developed. Among them, Miyata and Yasunaga's (1980) method (MY method); Li, Wu and Luo's (1985a) method (LWL method); Nei and Gojobori's (1986) method (NG method); Pamilo and Bianchi's (1993), and Li's (1993) method (PBL method); and Ina's (1995) methods (Ina methods) are often used. In this paper I call these methods 'currently used methods'. In the currently used methods the principle of maximum parsimony is used for estimation of the numbers of synonymous and nonsynonymous differences between codons com- pared, i.e. only the shortest pathways between the codons are considered. Recently, Goldman and Yang (1994), Muse and Gaut (1994), and Muse (1996a) defined statistical models of synonymous and nonsynonymous substitutions in a mathematically rigor- ous way. On the basis of these models, Goldman and Yang (1994) and Muse (1996a) developed maximum likelihood methods for estimating d s and d N. In the maximum 91 92 Yasuo Ina likelihood methods all possible pathways between codons compared are taken into account. Since maximum likelihood estimation of d s and d N is well reviewed by Yang (1995) and Muse (1996b), I shall not describe these methods in detail. In section 2 I explain the algorithms of the currently used methods. The statistical properties of these methods are not well understood. So, using computer simulation, I examined them with special reference to a neutrality test and the relation between the nucleotide- frequency bias (unequal frequencies of the four nucleotides) and estimates of d s (section 3). I did not examine Perler etal.'s (1980) method because estimates of d s obtained by their method are inaccurate (Gojobori 1983). On the basis of the results obtained by computer simulation, I discuss studies related to estimation of ds and d N (sections 4-7). Furthermore, I very briefly review Kimura's contribution to these studies.

2. Methods for estimating the numbers of synonymous and nonsynonymous substitutions

Let 2~ be the rate of mutation from nucleotide i to nucleotide j. The number of synonymous sites at a site is defined as the proportion of synonymous to total mutations at the site (Ina 1995). (This is not the only definition of the number of synonymous sites. For another definition, see Muse and Gaut (1994) and Muse (1996a).) Thus the number (s~) of synonymous sites for codon i is the sum of the number of synonymous sites at each position of the codon. For example, the number of synonymous sites for codon TTT, which encodes , is given by

"~TC = 2rc (1) STTT = 0 + 0 + }~TC+ "~TA + )~TG }~TC-~- 2TA -'~ ~TG'

because only a mutation from T to C at the third position is synonymous. The number (n~) ofnonsynonymous sites for codon i is obtained by n i = 3 - si. In reality it is difficult to estimate 2~j accurately. So, to estimate s~ and n~, we use simplified models such as Jukes and Cantor's (1969) model and Kimura's (1980) two-parameter model. The currently used methods for estimating ds and d N are classified into two categories in terms of a way to apply a multiple-hit correction formula. Methods in the first category (the MY, NG and Ina methods) assume that sites are either entirely synony- mous (no substitutions cause amino acid changes) or entirely nonsynonymous (all substitutions cause amino acid changes). Methods in the second category(the LWL and PBL methods) treat nondegenerate, two-fold-degenerate and four-fold-degenerate sites separately. When two or three nucleotide differences are observed between a pair of codons, there are two or more pathways between the codons. The NG and Ina methods give equal weights to different pathways between a pair of codons, whereas the MY, LWL and PBL methods give different weights to different pathways between the c0don pair. Thus, in Nei's (1987, p. 73) terminology, the NG and Ina methods are characterized as unweighted pathway methods, while the MY, LWL and PBL methods are character- ized as weighted pathway methods. However, the weighting method is different between the MY method and the LWL and PBL methods. In the MY method the weights are determined from Miyata etal.'s (1979) index, which represents a physicochemical difference between a pair of amino acids, whereas in the LWL and Synonymous and nonsynonymous substitutions 93

PBL methods the weights for different pathways are determined from empirical frequencies of codon substitution. In this paper estimates are signified by a hat (^), e.g. cls means an estimate of d s.

2.1 Miyata and Yasunaga's (MY) method

In the MY method the rate of acceptance (c~j) of mutations between codons i and j is given by 1 for c~ = 0 (i.e. synonymous change) cS~j= 1-&/3-5 for 0<~<3.465 (2) (0.01I for 6 ~> 3"465, where 6 is the degree of polarity and volume differences of the amino acids encoded by codons i andj (Miyata et al. 1979). The probability (Pk,~.i) that pathway k between codons i and j appears in the evolutionary process is given by the product of the 8~j value for adjacent codons involved in the pathway. For example, if intermediate codons i' and i" are involved in pathway k between codons i and j, the probability is given by Pk,ij = (~ii'(~i'i"~)i"j~ Thus the weight (%,,i j) for pathway k between codons i andj is given by Pk,ij cok,~j - -- (3) 2 Pk,ij k In the MY method the c%iJ value is used when the numbers of synonymous and nonsynonymous sites and of synonymous and nonsynonymous differences are es- timated. Nucleotide mutations are assumed to follow Jukes and Cantor's model. In this model the proportions of synonymous mutations are 0, 1/3, 2/3 and 1 for nondegener- ate, two-fold-degenerate, three-fold-degenerate and four-fold-degenerate sites respect- ively. Thus the numbers of synonymous sites are 0, 1/3, 2/3 and 1 for nondegenerate, two-fold-degenerate, three-fold-degenerate and four-fold-degenerate sites respectively. From these values, the number (sz) of synonymous sites for codon i is computed. For example, the number of synonymous sites for codon TTT is given by

STT T -----0 -~ 0 + 1/3 = 1/3, (4) because the first and second positions are nondegenerate sites and the third position is a two-fold-degenerate site. Since the number (nz) of nonsynonymous sites for codon i is given by n~ = 3 - s i, we obtain nTTT ---- 8/3. Intermediate codons between codons compared are considered when the numbers of synonymous and nonsynonymous sites are estimated. This estimation method is unique to the MY method. The number (sij) of synonymous sites for a pair of codons i and j is estimated as the weighted average of the numbers of synonymous sites for codons i and j (s~ + s j) and for intermediate codons between codons i and j: ij: Y, co ,ijk,,j, (5) k where gk,~j is the simple average of the numbers of synonymous sites for codons i and j and intermediate codons involved in pathway k between codons i andj. For example, 94 Yasuo Ina if intermediate codons i' and i" are involved in pathway k between codons i and j, sk,ij = (s~ + s e + s e, + @/4. The number (n~j) of nonsynonymous sites for a pair of codons i and j is estimated by r/~j = 3 - 2~;. The numbers of synonymous (S) and nonsynonymous (N) sites for a pair of nucleotide sequences of L codons are estimated by S = E 2~j and/V = E l~j = 3L - S respectively, where E stands for the summation of g~ or r/ri over all codon pairs in the nucleotide sequences. Let sa,j and na~j be the numbers of synonymous and nonsynonymous differences between codons i and j. When only one nucleotide difference is observed between codons i and j, the nucleotide difference is assigned as one synonymous (gd,o = 1,r/a~ = 0) or one nonsynonymous (2a~ = 0, ~ia~;= 1) difference. When two or three nucleotide differences are observed between codons i and j, the numbers of synonymous and nonsynonymous differences between the codons are estimated as the weighted averages of the ~a~a and r~eo. values for adjacent codons involved in pathways between codons i and j: = Y', oo ,,;sek,,s (6) k and = Z (7) k where ~ak,q and flak,o' are the numbers of synonymous and nonsynonymous differences observed in pathway/c between codons i andj. For example, if intermediate codons i' and i" are involved in pathway k between codons i and j, ~ak,o' = ~dW + gd~'~" + ~a~"~and riak.o. = 1id~, + r~arr, + l~0i,,J. The numbers of synonymous (So) and nonsynonymous (N0) differences between a pair of nucleotide sequences are estimated by S0 = I2 ~0~; and Nd. = .t; r/o~j, respectively, where Y, stands for the summation ofgd~j or r/o~j over all codon pmrs m the nucleotide sequences. To correct for multiple substitutions the MY method uses Jukes and Cantor's formula. Thus estimates of d s and d N are given by

d s=- - **Ps) (8) and G= - (9) respectively, where/~s = Sd/~ and/3 N = 2~a/.N. In equations (8) and (9) sites are treated as if all substitutions are either synonymous or nonsynonymous.

2.2 Nei and Gojobori's (NG) method

Nei and Gojobori (1986) developed two methods for estimating d s and d N. In this paper, however, I shall not describe their method II because (i) their method I is used much more often than their method II, and (ii) their method II cannot estimate d s and d N accurately unless the divergence of nucleotide sequences compared is small and the number of codons common to the sequences is large. Even when the nucleotide sequences compared are long and not greatly divergent, the performance of method II is not superior to that of method I. In this paper I refer to Nei and Gojobori's method I as the NG method. The NG method is a simplified version of the MY method. Intermediate codons involved in pathways between a pair of codons are not considered when the numbers of Synonymous and nonsynonymous substitutions 95 synonymous (S) and nonsynonymous (N) sites are estimated. Estimates of S and N for a nucleotide sequence of L codons are given by S= 2s~ and .~ = Z n~ = 3L- respectively, where Z stands for the summation of s~ or n~ over all codons in the nucleotide sequence. In practice the estimates of S or N for two nucleotide sequences compared are not always the same. In such a case the average of the estimates for the two nucleotide sequences is used as S or 2Q. Using an equal weight (c%.u) for all pathways between codons i and j, the numbers of synonymous and nonsynonymous differences between the codons are estimated by equations (6) and (7) respectively. The numbers of synonymous (Sd) and nonsynony- mous (Na) differences between a pair of nucleotide sequences are estimated by Sd = E ~du and -gd = 22 ~d~j respectively, where 22 stands for the summation of g~j or rid~J over all codon pairs in the nucleotide sequences. As in the MY method, nucleotide substitutions are assumed to follow Jukes and Cantor's model. Thus d s and d N are estimated by equations (8) and (9) respectively.

2.3 Ina's methods

In Ina's methods, which are extensions of the NG method, mutations are assumed to follow Kimura's two-parameter model (2TC = ;[CT = ;tAG = 'IGA = ~ and #iTA= Z m = 2cA = 2cG = )'AT = AAC = )~GT = AGe = #)" Under this assumption, the number (si) of synonymous sites for codon i becomes much simpler than that in the general case (e.g. equation (1) for STa.T). For example, the number of synonymous sites for codon TTT is given by

)~Tc _ c~ ~/# (10) STTT = '~TC + "tT'A + 2TO ~ + 2/? = ~/fl + 2"

The value of s t for codon TTT depends on the c~//? ratio alone. Similarly it can be shown that the values ofs~ for the other codons are also dependent on the u./# ratio alone. Thus we can estimate the total numbers of synonymous and nonsynonymous sites for a given nucleotide sequence if we obtain an estimate of the 0~//? ratio. Ina proposed two approaches to estimate the ~/# ratio: from the ratio of transitional substitution rate (%) to transversional substitution rate (,8 3) at the third position of codons (Inal method), or from the ratio of transitional substitution rate (ms) to transversional substitution rate (#s) at synonymous sites (Ina2 method). I shall first explain the Inal method. Using Kimura's two-parameter method, we estimate the transitional and transver- sional substitution rates at the third position of codons as

g3t = - ln(1 - 2t33 - (~3) + -~ln(1 - 2(23) (11) and = 2 3) (12) respectively, where t is the divergence time of the two nucleotide sequences compared, and/33 and (~3 are the estimates of the proportions of transitional and transversional differences respectively at the third position of codons. Although t is unknown in general, we do not have to know t in the present method for estimating d s and d N. This is because fs cancel each other out in C~3t/~3t. From the resulting value..._of&'3tl#3t~we estimate the number of synonymous sites for codon i, assuming that o~/# .~ ~t/~3t. 96 Ya~uo Ina

The total numbers of synonymous (S) and nonsynonymous (N) sites for a given nucleotide sequence of L codons are estimated by S = 2; dg and N = 3L - S respectively, where summation is made over all codons in the nucleotide sequence. In practice the estimates orS or N for two nucleotide sequences compared are not always the same, In such a case the average of the estimates for the two nucleotide sequences is used as S or -N. The numbers of synonymous and nonsynonymous differences are estflr~ated in a manner similar to the NG method. However, Ina's methods consider transitional and transver- sional changes separately, whereas the NG method does not distinguish these changes. Let ST~,Ij and STv,ij be the numbers of synonymous transitional and transversional differences respectively between codons i andj. Furthermore, let nTs,~j and nTv,rj be the numbers of nonsynonymous transitional and transversional differences respectively between codons i andj. When only one nucleotide difference is observed between a pair of codons, we can immediately assign it as one synonymous or nonsynonymous difference. The difference can be further identified as a transitional or transversional difference. For example, let us consider codons TTT and TTC, both of which encode phenylalanine. The nucleotide difference at the third position is counted as one synonymous transitional difference. Thus we have gTs:rTT,TTC= 1, gTv,TTT,TTC= 0, /~Ts,TTT,TTC = 0 and/~Tv,TTT,TTC = 0. When two or three nucleotide differences are observed between a pair ofcodons, two or more pathways between the codons are possible. As an example, consider codons ATT and ACG, between which there are two possible pathways. In pathway 1 (ATT.--~ACT~-~ACG), one synonymous transversional difference and one non- synonymous transitional difference are involved. In pathway 2 (ATT+--~ ATG+--~ ACG), one nonsynonymous transitional difference and one nonsynonymous transversional difference are involved. We assume, as in the NG method, that these pathways can occur with equal probability (1/2). Thus we have STs,ATT,ACG~(0~-0)/2 ~0, STv,ATT,AC~ = (1 + 0)/2 = 1/2, nT~,ATT,ACO= (1 + 1)/2 = 1, and t~Tv,ATT,ACO= (0 + 1)/2 = 1/2. Three nucleotide differences observed between a pair of codons are assigned in a similar way, although it is much more complicated. If a is involved in any pathway, that pathway is eliminated, reflecting the assumption that expression of truncated is irreversible in the evolutionary process. The total numbers of synonymous transitional (STy) and transversional (STy) dif- ferences between two nucleotide sequences are estimated by gT, =I;ST~,Zj and gT~ = y~ gT,,~j respectively, where summation of ST~,~j and s)~,zj is made over all codon pairs between the two nucleotide sequences. Similarly the total numbers ofnonsynony- mous transitional (Nr~) and transversional (NT~) differences between the two nucleotide sequences are estimated by NT~= ~-,l~Ts,ij and NTv = ~'~/~Tv,~j" Note that ST~ +/~Ts = LI,T~ + L2,T~ + L3,T~ and STy -1- ]QTv= ~-"I,TvJr_ L2,Tv .J_ L3.Tv, where Lk,T~ and LI;,Tv are the estimates of the total numbers of transitional and transversional differences at position k of codons between the two nucleotide sequences. The proportions of synonymous transi- tional (Ps) and transversional (Qs) differences are estimated by

= ~.~, (13) S and STy (14)

respectively. Synonymous and nonsynonymous substitutions 97

To correct for multiple substitutions, we use Kimura's two-parameter formula. An estimate of d s is obtained by

ds = - {(1 -- 2/3s - (~s) x/~- - 2(~s}. (15)

We can obtain an estimate of d N replacing/3 s and (~s with/~N and (~N respectively in formula (15), where fin = NT.,/~ and (~N = NT~/ N. This correction implies that the sequences comprise entirely synonymous or entirely nonsynonymous sites. In the Ina2 method we use an iterative procedure to estimate the e/fi ratio. The other steps (i.e. estimation of the numbers of synonymous and nonsynonymous sites and differences) are the same as in the Inal method. Let 02s,. and fis,, be the estimates of the synonymous transitional and transversional substitution rates, respectively, at the rth iteration cycle. Let S~ be the estimate of the number of synonymous sites at the rth iteration cycle. Furthermore, let i6s,~ and Qs,, be the estimates of the proportions of synonymous transitional and transversional differences respectively at the rth iteration cycle. We use the number (L) of codons compared as So, so/~s,1 = Sl-~/go and (~s.~ = ST~/S0" Note that S'r.~ and ST~ are estimated in the same way as in the Inal method. At the first iteration cycle 02s.~ and fis,~ are obtained by

~-~,at = - ln(1 - 2/3s,1- (~s,1) + ~ ln(1 - 2(~s,1 ) (16) and P .I t= ln(1 - 2Os,1) (17) respectively, where t is the divergence time of two nucleotide sequences compared. Unlike in the Inal method, we cannot use the ~s,~t/[ds, lt value as an estimate of the c~//3 ratio to estimate the values of s i. This is because %//~s > c~//~. Note that although the nucleotide changes at the first position between codons CGA and AGA and between codons CGG and AGG (two-fold-degenerate sites) are synonymous and transversional, synonymous changes at rnost two-fold-degenerate sites are transitional alone. Note also that although synonymous transversional changes occur at the third position of codons ATT, ATC and ATA (three-fold- degenerate sites), the frequencies of these codons are much less than those of two-fold- degenerate codons. We introduce a weighting factor (W) so that 0~//~~ Wc~s/[3s can hold. Since most synonymous changes occur at the third position of codons, here we consider only the third position ofcodons. We treat three-fold-degenerate codons as two-fold-degenerate codons. At the third position of two-fold-degenerate codons, only transitional substitu- tions are synonymous. At the third position of four-fold-degenerate codons, both transitional and transversional substitutions are synonymous. Thus ~s/l~s.,~(q2c~ + q3~ + q~tc~)/(q,~), where qz, q3 and q4 are the frequencies of two-fold-degenerate, three-fold-degenerate and four-fold-degenerate codons respectively. We have, there- fore, c~//~ ~ [qJ(q2 + q3 + q4)] ~s//~s - The weighting factor, W, is given by

W= q4 (18) q2 + q3 -t- q4" Replacing q2, q3 and q~ with their estimates in two nucleotide sequences compared, we estimate W. 98 Yasuo Ina

Let ar/-"g~be the estimate of the c~lfl ratio at the rth iteration cycle. From ~-~.lt, 3-~,1t and W, al/]71 is computed by

As in the Ina 1 method, t in equations (16) and (17) cancels out. Thus, from the resulting value of a~//3~, the number of synonymous sites (s~) for codon i can be estimated. The total number of synonymous sites for a given nucleotide sequence is estimated at the first iteration cycle by S 1 = Z ~i, i, where ~, 1 is the estimate ofs~ at the first iteration cycle and 2 stands for the summation ofgi, 1 over all codons in the nucleotide sequence. When Sl's are not the same for two nucleotide sequences compared, the average of the estimates for the two nucleotide sequences is used as St. Similarly, at the rth iteration cycle, S, is computed from /3s,~ = ST~/;~_ 1, Qs,~ = gT~/S~- 1, and ITv'. This iterative procedure is continued until S~ converges to S~ (S~ = S~ +~ ..... ~). The number (N) of nonsynonymous sites for a given nucleotide sequence of L codons is given by .~ = 3L- S~. To correct for multiple substitutions we use Kimura's two-parameter formula. An estimate of d s is obtained by

ds = - 1in {(1 - 2/3s, o~ - (~s,~) x/;- 2(~s,~ }. (20) We can obtain an estimate of d N by replacing /~..s,~ and (~s,~ with /~N,~ and (~N,oo respectively in formula (20), where PN,~ = IQT,/No~ and ~)N,~o =/Vrv/N~ The difference between the Inal and Ina2 methods lies only in estimation of the a//~ ratio. For a given value of the c~/fl ratio, these methods give the same estimates ofd s, d N, S and N. In the special case ofo~/fl = 1, the Ina methods give the same estimates of S and N as the NG method. Furthermore, in this case, if we use Jukes and Cantor's formula instead of Kimura's two-parameter formula tO correct for multiple substitutions, the Ina methods reduce to the NG method. Note ttmt, if we do not distinguish transitional and transversional differences, the estimates of the numbers of synonymous and nonsynonymous differences are always the same for the NG method and the Ina methods (ge = STs "~- STy and -ga = -NT~ + -~'rv)-

2.4 Li, Wu and Luo's (LWL) method

Unlike the MY and NG methods, the LWL method assumes that nucleotide substitu- tions follow Kimura's two-parameter model. Nucleotide mutations, however, are assumed to follow Jukes and Cantor's model as in the MY and NG methods. Nucleotide sites are classified into nondegenerate, two-fold-degenerate and four-fold- degenerate sites, and three-fold-degenerate sites are treated as two-fold-degenerate sites. Nucleotide differences at the third position between codons ATT and ATA and between codons ATC and ATA are regarded as transitional differences. Nucleotide differences at the first position between codon CGA or CGG and codon AGA or AGG are regarded as transitional differences. In reality the above differences are transversional differences. Thus these definitions are a theoretical problem in the LWL method. The nilmbers of nondegenerate (L0), two-fold-degenerate (LJ and four-fold-degen- erate (L4) sites are estimated as the averages of the corresponding numbers for Synonymous and n onsynonymous substitutions 99 nucleotide sequences compared. Since all nucleotide mutations at four-fold-degenerate sites are synonymous, these sites are synonymous sites. At two-fold-degenerate sites onty transitional mutations are synonymous, and the probability that such mutations occur is 1/3 in Jukes and Cantor's model. Thus the number of synonymous sites (S) for a nucleotide sequence is estimated by + L,. (21) Since the total number of nucleotide sites is L o + L 2 + L 4, the number of nonsynony- mous sites (N) is estimated by lV, = L o + L 2 + L 4- S= L o + 2 Lz" (22) The first term of the rightmost side in equation (22) represents the number of nonsynonymous sites at nondegenerate sites because any mutations at nondegenerate sites are nonsynonymous. The second term of the rightmost side in equation (22) represents the number ofnonsynonymous sites at two-fold-degenerate sites, because at two-fold-degenerate sites only transversional mutations are nonsynonymous and the probability that such mutations occur is 2/3 under Jukes and Cantor's model. In practice the estimates of S and N for two nucleotide sequences compared are not always the same. In such a case the average of the estimates for the two nucleotide sequences is used as S or .N. In the LWL method the number of nucleotide substitutions per site are estimated at nondegenerate, two-fold-degenerate and four-fold-degenerate sites separately. In Kimura's two-parameter method the numbers of transitional (Ai) and transversional (Bi) substitutions at i-fold-degenerate sites are estimated by fi~i = - - 2/31- Q.i) + ln(1 - 2(~i) (23) and /3,= -- -- 2(~,) (24) respectively, where/5 and (~z are the estimates of the proportions of transitional and transversional differences at i-fold-degenerate sites respectively. The number (K i) of nucleotide substitutions at i-fold-degenerate sites is estimated by/<~ = A~ +/3~. Only transitional substitutions are synonymous at two-fold-degenerate sites and all nucleot- ide substitutions are synonymous at four-fold-degenerate sites. Only transversional substitutions are nonsynonymous at two-fold-degenerate sites and all nucleotide substitutions are nonsynonymous at nondegenerate sites. Thus we can estimate d s and d N by

A A ~s = LzAa + L4K4 (25) Z3 L2 + L4 and A A A dN --'--L2 B2 + L0 K0 (26) + Lo

2.5 Pamilo and Bianchi's and Li's (PBL) method

The PBL method is a modified version of the LWL method. Here also it is assumed that nucleotide substitutions follow Kimura's two-parameter model. The method for 100 Yasuo Ina estimating P~, Qi, Ai and B~ is the same for the PBL and LWL methods. However, the weighting methods for estimating ds and d Nare different between the two methods. This is because, unlike the LWL method, the PBL method does not assume Jukes and Cantor's model as a mutation matrix. In the PBL method estimates of ds and d N are given by

L2A2 + L~A 4 4 = L2 + + & (271 and Lo Bo + L2 B2 g. = Ao + (281 Lo + respectively. Note that transitional substitutions are synonymous at both two-fold- degenerate and four-fold-degenerate sites, whereas transversional substitutions are synonymous only at four-fold-degenerate sites. Note also that transitional substitu- tions are nonsynonymous only at nondegenerate sites, whereas transversional substi- tutions are nonsynonymous at both nondegenerate and two-fold-degenerate sites. As in the LWL method, three-fold-degenerate sites are regarded as two-fold-degenerate sites. Nucleotide changes at the third position between codons ATT and ATA and between codons ATC and ATA are regarded as transitional changes. Nucleotide changes at the first position between codon CGA or CGG and codon AGA or AGG are regarded as transitional changes. As in the case of the LWL method, these definitions are a theoretical problem with the PBL method, which was partly resolved by Comeron (1995). Since the PBL method does not assume a particular mutation matrix, the method cannot estimate S and N. However, it is necessary to estimate S and N when we estimate ds or dN for two or more , or take the average of~7s or c~N weighted by S or .g over genes (e.g. Ohta 1993). For such purposes Ina (1995) presented formulae [-equations (27) and (28) in Ina (1995)] for estimating S and N based on the assumption that nucleotide mutations follow Kimura's two-parameter model.

2.6 Variation in substitution rates amon 9 nonsynonymous sites

In an analysis of amino acid sequences ofcytochrome c, Uzzell and Corbin (1971) found that the rate of amino acid substitution varies among sites and that the rate approxi- mately follows a gamma distribution. It is also known (Dayhoffet aI. 1978; Miyata et al. 1979) that the rate of amino acid substitution depends on the difference in physicochemicalproperties of the amino acids interchanged. Taking into account these observations, Dayhoff et al., Kimura (1983, p. 75), Ota and Nei (1994a), and Grishin (1995) proposed various methods for estimating the number of amino acid substitu- tions (see also Ina 1996a). Since the rate of amino acid substitution varies among sites or amino acids interchanged, or both, it is unlikely that the rate of nonsynonymous substitution is uniform among sites or nonsynonymous codons interchanged or both. In all the above methods for estimating dN, however, the rate of nonsynonymous substitution is assumed to be uniform among sites. When the rate of nonsynonymous substitution depends on Miyata et aI.'s (1979) index of amino acid interchangeability, these methods tend to underestimate dN, while the gamma distance methods give better estimates of d N (Nei and Gojobori 1986; Ina 1995). Synonymous and nonsynonymous substitutions 101

At present formulae that incorporate a gamma distribution for substitution rates among sites are available only for the NG and Ina methods. However, equation (5) of Nei and Gojobofi rnay be applicable to the MY method as well. Ganmm distance formulae for the LWL and PBL methods can also be obtained by replacing equations (23) and (24) with

Ai= (1 - 2/3,-- {0~)-,,-i - 1(1 - 20~) -z, _ (29) and ai ^ . 1 /~, = --~ {(1 - 2Q,) ,,, - 1} (30) respectively, where a i is the gamma parameter at i-fold-degenerate sites. Equations (29) and (30) are obtained from unnumbered equations of Jin and Nei (1990, p. 100). There are the following problems with the gamma distance methods. (i) Several methods for estimating the gamma parameter are available Ee.g. the method of moments (Johnson and Kotz 1973, p. 131), the maximma likelihood method (Yang 1993), the improved parsimony-based method (Yang and Kumar 1996)7. However, none of them was developed for estimation of the gamma parameter for nonsynony- mous substitution rates. (ii) In formulae for the variance of gamma distances (e.g. Jin and Nei 1990), the gamma parameter is assumed to be estimated without any errors. (iii) It seems unlikely that the rate ofnonsynonymous substitution always follows a gamma distribution. So caution is needed when we use the gamma distance methods.

2.7 Variancesof 4 and dN

Since the process of nucleotide substitution is stochastic, it is of great importance to consider the variances of ~7s and ~,~. Following Kimura and Ohta (1972) and Kimura (1980), formulae for the variances of cls and clN were proposed (Li et al. 1985a; Nei 1987, p. 76; Li 1993; Ina 1995). These formulae are approximate because the estimated numbers of synonymous and nonsynonymous sites are regarded as constants. Recently, taking into account estimation of the numbers of synonymous and nonsynonymous sites, Ota and Nei (1994b) obtained more rigorous formulae for the variances of els and d Nfor the NG method. Their formulae may also be applicable to the MY method. Ota and Nei's method gives the variances of the means olds and dN for the NG and MY methods. For the LWL, PBL and Ina methods, however, such formulae are not available. As suggested by Ina (1993, p. 78), the bootstrap method (Efron 1979) may be used for the LWL, PBL and Ina methods. Actually, Ina and Gojobori (1994) used the bootstrap method to examine the statistical significance of the difference between the mean of ~s and that of dN"

2.8 Merits and demerits of different methods for estimating ds and d N

When nucleotide mutations follow Jukes and Cantor's model, the MY, NG and LWL methods may give good estimates of d s and d N. Actually, Ota and Nei's (1994b) computer simulation showed that the NG method gives good estimates of d s and d N under Jukes and Cantor's model of nucleotide mutations. In this case simple methods such as the MY and NG methods may be preferable to complicated methods 102 Yasuo Ina such as the PBL and Ina methods because the variances of SCs and d Nare expected to be smaller for the former than for the latter. In actual nucleotide sequences, however, it is observed (Gojobori et aI. 1982a; Li et aI. 1984; for review see Nei i987, pp. 27-29, and Wakeley 1996) that transitional mutations occur more frequently than transversional ones. In this case the MY, NG and LWL methods tend to give biased estimates of d s and d N (Kondo et al. 1993; Pamilo and Bianchi 1993; Li 1993; Ina 1995). For example, applying the LWL and PBL methods to 14 genes of mouse and rat, Li (1993) found that estimates of d s were about 30% larger by the LWL method than by the PBL method. When the transi- tion/transversion bias is strong, the PBL and Ina methods are recommended because these methods do not assume Jukes and Cantor's model for estimation of the numbers of synonymous and nonsynonymous sites. The Ina methods are sometimes inapplicable to short and closely related nucleotide sequences, because the numbers of synonymous and nonsynonymous sites cannot be estimated. This is a problem with the Ina methods. However, if an estimate of the cr ratio is available from other data (e.g. other parts of sequences, other pairs of sequences, intron sequences), the Ina methods may be applicable to short and closely related nucleotide sequences (for details see Ina 1995, pp. 217-220). The Inal method tends to give underestimates of d s and overestimates of d Nwhen the transition/transversion bias is weak and negative selection against amino acid changes is strong. Using the estimated numbers of transitional and transversional substitutions at synonymous sites, we may be able to correct this kind of bias (for details see Ina 1995, pp. 222-223).

3. Statistical properties

Using computer simulation, Ina (1995) examined the statistical properties of ds and dN obtained by the MY, LWL, NG, PBL and Ina methods. Here I describe only a part of the results of the simulation: this is the part related to the d s vs d Nneutrality test (for the d s vs d N neutrality test, see section 4). In the influenza virus mutation scheme, the transition/transversion bias is strong but the nucleotide-frequencybias is weak. On the other hand, in the pseudogene mutation scheme, the transition/transversion bias is weak but the nucleotide-frequencybias is strong. The fraction of neutral mutations was assumed to be 1 for both synonymous and nonsynonymous sites (no selection scheme).Thus the expected values of ds and a N should be the same under this selection scheme. Nucleotide sequences of 1000 codons were used in the simulation. The expected vaIues of d s and d N were E(ds) = E(dN) = 0.1, 0.2, 0-3,..., 1.0. The number of replications was 100 for each set of parameters. Figure 1 shows the relation between the means of ds and clN obtained by the MY, LWL, NG, PBL and Ina methods. The MY, LWL and NG methods tend to overestimate ds and to underestimate d N (figure 1, a and b) even when E(ds) and E(dN) are small. These biases are stronger when the transition/transversion bias at the mutation level is stronger. This is because Jukes and Cantor's model is assumed for estimation of the numbers of synonymous and nonsynonymous sites in the MY, LWL and NG methods. It is possible that if we use these methods the ds vs dN neutrality test may not be able to detect positive selection. In other words the ds vs d N neutrality test may be too conservative if we use the MY, LWL and NG methods. Since Jukes and Cantor's model is not assumed in the PBL and Ina methods, these methods give less Synonymous and nonsynonymous substitutions 103

(5) Influenza virus gene mutation scheme (b) Pseudogene mutation scheme 1.6 1.6'...'...,...,,..,...,...,,.,,... --ds ~ // " NG | / i ~ MY I / dN 1 dN 0.8-" ~ o o [][]

0.a: o /o~n'o o o o 0,4. on ao o i::i' '~ ~176 0.2" 0.2 0.4 0.6 0.8 1.2 1.4 " b:2" b',4'" "0[6'' '0:8''" t'' '1:2' ' '114' ' d~ ds (d) \C) Influenza virus gene mutation scheme Pseudogene mutation scheme 1.6 .... ,...,...,...,..., ...... 1,6, -- dS = --- dS = dN1 /// 1'4i -Inal dNI / / 1,4 o Inaz I / 1.2 1"2!1- I ca PBL I / d~ 0.8. ~/oo u dN 0.8 o D r ca ca 0.6. 0.6 /

0.4- 0.42 o D ~ ca 0.2. 0.2!

0 ./.. 6!2 '0:4 "0:~ 0:8 ~ .... i:2' 1:i "1 d$ ds

Figure 1. Relation between the means of ds and (1N obtained by the NG, MY and LWL methods (a, b) and by the PBL and Ins methods 0, d). The influenza virus gene mutation scheme (a,c) and the pseudegene mutation scheme (b,d) were used. The diagonal line represents ds = dN. biased estimates of d s and d N (figure 1, c and d), in particular when the transition/trans- version bias at the mutation level is strong. Following the simulation method of Ins (1995), I examined the relation between a s and the nucleotide-frequency bias at the third position of codons. For simplicity Tamura's (1992) model was used as mutation matrices. In this model the transi- tion/transversion and nucleotide-frequency biases are considered. The ratio of relative transitional rate (c~) to relative transversional rate (fi) was assumed to be 6. The 0 values in the model were assumed to be 0-5, 0-6, 0.7, 0.8 and 0.9. (By reversing 0 and 1 - 0 in Tamura's model, we can predict the results for 0 -- 0.1, 0.2, 0.3 and 0.4 from the results for 0.9, 0"8, 0.7 and 0.6 respectively.) The fraction of neutral mutations at synonymous sites was assumed to be 1. The fraction of neutral mutations at nonsynonymous sites was assumed to depend on Miyata et al.'s (1979) index, which is strongly correlated with the rate of amino acid substitution. This selection scheme is the moderate selection scheme in Ins (1995). Nucleotide sequences of 1000 codons were used in the simulation. The expected values of d s were E(ds)= 0.1, 0.3, 0.5, 0.7 and 1.0. The number of replications was 100 for each set of parameters. 104 Yasuo Ina

(b) MY (a) NG I . , . , I , , , , I i , , , I . , . , I , . , 1.4 nnlal.,m.lmam.|mammll.mm 1.4

1.2 1.2 .F 1 1

4- X + + 0.8 x[ 0.8 • _..j...... ds X ds + X 0.6 + 0.6 [] X t" ...... ~...... /~...... '~ ...... 0.4 0 0.4 0 0 . .0 ...... O" [] 0,2 0 0.2 7.!. Q ...... lira...... @ ...... @ ...... 0...... 0 0 f:: :. ".:.: : :: :.: L: :.,:.. : '''l' '' ' n ''' ' i . . i I . ' 50 60 70 80 g0 1 O0 50 60 70 80 90 100 G+C% G+C%

(c) LWL (d} PBL 1.4 nnkn|lnl'|lUUllllllllm|l 1.4 Imgi|mmim|iiim|mammim|||

1.2 1.2

1 1 ...... -r- ......

+ X 0.8 + 0,8 ds ds )4 ...... • ...... + q: ...... 3 [] X + X 0.6 0.6 x n ...... n ...... r~...... x- ...... D [] 014 13 0.4 [] 0 ...... 0 ...... ~)" ...... o ...... o ...... o ...... 6 ...... ~ ...... 0.2, 0.2 @ ...... ~ ...... s...... o- ......

0 0 i'' '' I"''' I''' ' I '''' I ' ' " 50 60 70 B0 90 100 50 60 70 80 90 100 G+C% G+C%

(e) Inal (f) Ina2 1.4 I,,,,I.,,,I,..=1= =mill,,, 14 , , , , I , , , , l , , , , I , . , m I , , , ,

12 12

1 F + F ...... ~ ...... + 08 + 08 d8 K ...... X ...... ~- ...... ds ...... X ...... 4-...... 06 x + 06 X + • X 3 ...... rn ...... rh ...... [3 ...... X" ...... 0.4 - n [3 04 [] : ...... o ...... o ...... o ...... F ...... 3 ...... 0 ...... 0 ...... 0 ...... ~3" ...... 0,2. O.2 -"

p ...... @ ...... @ ...... ~ ...... D ...... O ...... @, ...... f ......

0 ' ' ' ' I ' ' ' ' I ' ' ' " I " " " " I ' " ' '

50 60 70 80 90 O0 5O 60 70 80 90 100 G+CO/o G+C%

Figure 2. Relation between G + C content (%) and the mean of ds obtained by the NG method (a), the MY method (b), the LWL method (c), the PBL method (d), the Inal method (e) and the Ina2 method (f). The dotted lines represent E(ds) = 0.1, 0"3, 0'5, 0'7 and 1.0.

Figure 2 shows the relation between the G + C content at the third position of codons and the mean of ds obtained by the MY, LWL, NG, PBL and Ina methods. As G + C content at the third position deviates from 50%, all the methods examined tend to underestimate d s. The underestimation of d s becomes clearer as E(ds) becomes larger. The underestimation ofd s was also observed under different simulation schemes Synonymous and nonsynonymous substitutions 105

(e.g. no selection scheme or strong selection scheme, nucleotide sequences of 290 codons, ~/p = 2 or 10) when E(ds) was large and G + C content at the third position deviated from 50%. This underestimation reflects use of Jukes and Cantor's formula (the MY and NG methods) or Kimura's two-parameter formula (the LWL, PBL and Ina methods) for multiple-hit correction. Since all the methods examined above do not take into account the nucleotide- frequency bias, these methods tend to underestimate d s and dN when distantly related sequences with strong nucleotide-frequency bias are analysed. Ina's (1995) computer simulation showed the possibility that the extent of underestimation of d s might be examined by using the estimated numbers (d3) of nucleotide substitutions at the third position of codons: d s is possibly underestimated if 33 obtained by Jukes and Cantor's method and Kimura's two-parameter method differs substantially from s obtained by methods that take into account the nucleotide-frequency bias, e.g. Takahata and Kimura's (1981) method, Gojobori et al.'s (1982b) method, Tajirna and Nei's (1984) method, and Tamura and Nei's (1993) method. [Development of these methods was stimulated by Kimura's (198 la) two-frequency-class model. For the statistical proper- ties of these methods, see Zharkikh (1994).~ Similarly, it might be possible to examine the extent of underestimation of d N using the estimated numbers of nucleotide substitutions at the first and second positions of codons.

4. Neutrality test

Kimura (1977) formulated the relation between the substitution rate (k) and the total (Wr) as

k-=vJo (O•fo•l), (31) where fo is the fraction of neutral mutations. Since advantageous mutations are assumed to be rare enough to be neglected as a first approximation in the neutral theory (Kimura 1968a, 1983), 1 -fo is the fraction of deleterious mutations. When all mutations are neutral (i.e..f0 = 1), the substitution rate k is equal to the total mutation rate vT, which is the maximurn substitution rate in the framework of the hypothesis. Kimura (1968b) discussed the possibility of synonymous changes being neutral and suggested that synonymous changes are subject to natural selection very much less than nonsynonymous changes are. Thus it is expected that ds 1> d Nunder the neutral mutation hypothesis. Note that the divergence time and the total mutation rate are the same for both synonymous and nonsynonymous sites. Many workers have tested the neutral mutation hypothesis by comparing cls with ar~. Well-known studies based on the ds vs d N neutralitytest include ones on pseudogenes ('dead genes') as a paradigm of neutral evolution and ones on the major histocompati- bility complex (MHC) genes as exceptional cases for the neutral theory. Miyata and Yasunaga (1981) found that a pseudogene of the ~-globin gene is evolving rapidly even at nonsynonymous sites, whereas for its functional counterparts synonymous substitutions predominate over nonsynonymous substitutions, i.e. ds > aN. This,observation can be explained weU by the neutral theory: most nonsynony- mous mutations are deleterious (i.e. fo < 1) for the functional genes but are neutral (i.e. fo ~ 1) for the pseudogene. On the other hand, it is difficult to explain by positive selection why the pseudogene is evolving rapidly. Li er aI. (1981) also obtained 106 Yasuo Ina essentially the same result, although they did not estimate ds and d N separately but compared the estimated numbers of nucleotide substitutions at the first, second and third positions of codons. Kimura himself also examined the substitution rates for pseudogenes (Kimura 1980; Takahata and Kimura 1981). Hughes and Nei (1988) found d s < s for the antigen recognition region of the MHC class I genes. This is the first case where the d s vs d N neutrality test rejected the neutral mutation hypothesis. The MHC class II genes also showed d s < dN for the antigen recognition region (Hughes and Nei 1989). Polymorphisms at the MHC loci predate the divergence of humans and chimpanzees (Lawlor et al. 1988) and that of mice and rats (Figueroa et al. 1988). These polymorphisms are too old to be compatible with the neutral mutation hypothesis. The fixation time for a neutral mutation (excluding the cases of eventual loss)is expected to be 4N~ generations (Kimura and Ohta 1969), where N~ is the effective population size (for effective population size, see Crow and Kimura 1970, pp. 345-365). Overdominant selection can maintain polymorphisms for a much longer period (e.g. Takahata 1990; Takahata and Nei 1990). Taking into account other abservations as well, Hughes and Nei suggested that overdominant selection is operating on the antigen recognition region of the MHC class I and class II genes. Using the ds vs d Nneutrality test, we can examine whether or not nucleotide sequence data are compatible with the neutral mutation hypothesis. However, from results of this test alone we cannot know what kind of positive selection is responsible for the evolution of the nucleotide sequences examined when the neutral mutation hypothesis is rejected. Population-genetics studies are helpful for understanding the selection operating on the nucleotide sequences. In particular, genealogical studies are powerful for analysis of nucleotide sequence data. For the theory of genealogy, see Hudson (1990), Takahata (1991) and Tajima (1993).

5. test

When gene conversion or recombination occurs in a sample of sequences, the diver- gence time is not the same across the whole region of the sequences. Thus gene conversion affects the number of substitutions. To detect gene conversion in coding regions, it is useful to compare d s from region to region. Since synonymous changes are thought to be less affected by selection than nonsynonymous changes are, a non- random distribution of synonymous differences may be attributable to gene conversion. Actually, this idea has often been used (e.g. Hayashida and Miyata 1983, 1985; Hayashida et al. 1992). Hughes (1991) discussed criteria for detecting gene conversion and summarized them as three criteria. In criterion 3 he suggested that 3 N also be examined. This is because similarity at synonymous sites may be a consequence of the nucleotide-frequency bias (e.g. high G + C content), not of gene conversion. Synonymous and nonsynonymous sites are tightly linked to each other and are interspersed in nucleotide sequences. Thus gene conversion affects both synonymous and nonsynonymous substitutions in ex- changed regions. Hughes's three criteria are not the only method for detecting gene conversion. Several other methods (e.g. Stephens 1985; Sawyer 1989; Takahata 1994) are available. At present it is not clear which method is the best (for comparison of the latter three methods, see Takahata 1994). Synonymous and nons ynon ymous substitutions 107

Statistical methods may not be able to detect gene conversion if exchanged regions are very short. For example, in the case of the MHC genes, very short regions (mostly less than 100 base pairs) are thought to be exchanged by gene conversion (for review, Parham and Ohta 1996). It may not be easy to detect such gene conversion events by statistical tests. So, without rigorous tests, Ohta (1995a) examined ds and ttN for different regions of the MHC genes to study the role of gene conversion in generating genetic variability. There is an important point about the effects of gene conversion on synonymous and nonsynonymous substitutions. Gene conversion affects both synonymous and nonsynony- mous substitutions in the same way. Gene conversion itself does not affect the dN/d s ratio if an exchanged segment is similar to the original one in function. Thus, when we observe s < dN for a certain region, we have to consider factors other than gene conversion (e.g. positive selection and reduction of synonymous substitutions) as the major cause.

6. Window analysis

For a good understanding of the nature of molecular evolution, it is of great importance to examine whether positive selection operates on genes or parts of genes. Ina (1993, p. 153) suggested that for such a purpose window analysis (e.g. Clark and Kao 1991; Ina el: al. 1994) may have potential power to detect parts of genes that show d s < d N. However, this analysis may not detect positive selection if selected sites are scattered in nucleotide sequences, or if the window size is not appropriate. Thus a result from this analysis showing 3 s >~ c7N for genes or parts of genes does not preclude the possibility that positive selection operates on the genes or parts of the genes. Rigorous methods for determining optimal window size and statistical significance level have never been developed for the d s vs d N neutrality test (for tests of a random distribution of different sites along sequences, see Clark and Kao 1991 and Tajima 1991). This is also a problem with window analysis. Nevertheless, window analysis is helpful for visu@zing the change in d s and clN from region to region to find unusual patterns, e.g. d s < d N or a reduction in synonymous or nonsynonymous substitutions or both.

7. Discussion

It is assumed in the ds vs d N neutrality test that all mutations at synonymous sites are neutral, i.e. fo = 1 for synonymous rnutations. However, this assumption may not always be valid. Therefore results of the d s vs d N neutrality test must be interpreted with caution. For example, Krushkal and Li (1995) found ds < dN for the delta antigen gene of hepatitis D (delta) virus. Since the estimated number of nucleotide substitutions in the noncoding region was larger than ds and dN and there was a strong preference of G and Cat the third position of codons in the delta antigen gene, Krushkal and Li suggested reduction of synonymous substitutions. When secondary structure in mRNA or RNA genomes is functionally important, some portion of n-rotations even at synonymous sites may be deleterious. For overlap- ping genes, which are often found in phage and virus genomes, synonymous changes in a gene may lead to nonsynonymous changes in its counterpart gene, so not all mutations at synonymous sites may be neutral. Miyata and Yasunaga (1978, 1980) studied the effects of these two kinds of constraints on synonymous and nonsynony- mous substitutions. They found reduction of synonymous substitutions for pairing 108 Yasuo Ina regions of the RNA genomes of phages MS2 and R17 and for overlapping genes of phages q~X174 and G4. Recently, using maximum likelihood methods, Muse (1995) and Rzhetsky (1995) studied the effects of secondary structure of RNA (autocorrelation among sites) on nucleotide substitutions (see also SchSniger and yon Haeseler 1994; Tillier 1994; Tillier and Collins 1995). However, at present no methods are available for estimation of ds and d N for autocorrelated sequences. To analyse overlapping genes, Hein (1995) developed a maximum likelihood method for estimating d s and dN- His method is a modified version of the LWL method. It is also possible that nonrandom usage of synonymous codons or the (for review, Ikemura 1985) retards the rate of synonymous substitution. Using the diffusion approximation method (e.g. Kimura 1964), Kimura (1981b) studied quanti- tatively the relation between codon usage bias and rate of synonymous substitution. Since Kimura's study, many others (e.g. Bulmer 1987; Li 1987; Berg and Martelius 1995) have studied this problem theoretically using different models and methods. For some organisms such as bacteria (Sharp and Li 1987) and Drosophila (Shields et at. 1988), it is known that codon usage bias and number of synonymous substitutions are negatively correlated. Wolfe et al. (1989) found considerable variation in d s among genes of mouse and rat and suggested that mutation rate varies among the genes. Wolfe and Sharp (1993) analysed many more data sets from mouse and rat and found that ds ranged about 20-fold. Moreover, they reported a significantly positive correlation between a s and tiN" They suggested that there is variation in mutation rates and that doublet mutation (mutations occurring simultaneously at two adjacent sites) is responsible for the correlation. Mouchiroud et al. (1995) also obtained similar results from different data sets for mammalian genes. They, however, suggested that functional constraints on synonymous changes are gene-specific, just as in the case ofnonsynonymous changes, and that these two types of constraints are positively correlated. Ohta and Ina (1995) and Ina (1996b) pointed out that variation in mutation rates among genes leads to variation in synonymous substitution rates among them and a positive correlation between synonymous and nonsynonymous substitution rates. It is still unclear which explanation is correct. However, it is true that estimates of the rate of synonymous substitution are more variable than expected by chance (Wolfe and Sharp 1993; Ohta and lna 1995; Ina 1996b). Thus we have to carefully interpret and use the number of synonymous substitutions, although synonymous substitutions have been used as a gene-independent molecular clock at least among closely related since Miyata et aI.'s (1980) study (see also Li et al. 1985b). Ira species with a short generation time has a large population size and a species with a long generation time has a small population size, in the nearly neutral theory (Ohta 1973, 1992; see also Gillespie 1995) the generation-time effect is expected to be larger for synonymous substitutions than for nonsynonymous substitutions.(for quantitative argument, see Ohta 1977 and Kimura 1979). Thus the dN/d s ratio is expected to be larger for a species with a small population size and a long generation time than for a species with a large population size and a short generation time. Ohta (1993, 1995b) examined the ratio of the mean oft7N to that ofds among genes of primates, artiodactyls and rodents. The ratio was highest for primates and lowest for rodents. She suggested that this result is consistent with the nearly neutral theory. Easteal and Collet (1994) examined the same problem for different data sets of primate, artiodactyl and rodent genes using marsupials as an outgroup species. On the Synonymous and nonsynonymous substitutions 109 basis of the result obtained, they aIso supported the nearly neutral theory. However, they suggested that the substitution rate at four-fold-degenerate (synonymous) sites is uniform among the lineages but the substitution rate at nondegenerate (nonsynony- mous) sites varies among the lineages. It is not clear whether Easteal and Collefs suggestion is correct. However, it is possible that rnarsupials might be so distantly related as an outgroup species that ds could not be estimated accurately. As E(ds) becomes larger, the difference in the mean of 3s among lineages, if any, becomes srnaller because of underestimation of ds (see figure 2) and the variance of cls becomes larger (the statistical power of the relative-rate test decreases). Moreover, it seems that Easteal and Collet's suggestion is incompatible with studies on male-driven molecular evolution (Miyata etal. 1987; Shimmin etal. 1993). The studies show that the rate of nucleotide substitution at synonymous sites and in noncoding regions is proportional to the number of cell divisions per unit time, suggesting that the rate of synonymous substitution is higher for rodents than for primates. However, Easteal etal. (1995) suggested that the substitution rate difference between X-linked genes and Y-linked genes might be caused by the difference in methylation between these two kinds of genes. For the generation-time effect, see also Wu and Li (1985), Easteal (1990) and Li eral. (1996). Recently Comeron (1995) developed a modified version of the PBL method. Nuc- leotide differences at the first position of arginine codons, CGA, CGG, AGA and AGG, are treated more rigorously in Comeron's method than in the PBL method. Thus the nurnbers of transitional and transversional substitutions at two-fold-degenerate sites are estimated more accurately in Comeron's method than in the LWL and PBL methods. The weighting method for averaging the numbers of transitional and transversional substitu- tions was also modified. These modifications improved the accuracy of estimation ofds and d N. Using computer simulation Comeron showed that his method was better than the Ina2 method, in particular when the transition/transversion bias was weak. However, there seems to be no need to use complicated methods such as the PBL, Ina and Comeron's methods when the transition/transversion bias is weak. In such a case the NG and MY methods may be preferable to Comeron's method because the variances of cls and dN are expected to be smaller for simple methods than for complicated methods. As in the LWL and PBL methods, the estimated numbers of transitional and transversional substitutions are available for nondegenerate, two-fold-degenerate and four-fold-degenerate sites separately in Comeron's method. Equations (29) and (30) may be applicable to Comeron's method. More recently Muse (1996a) has developed a maximum likelihood method for estimating d s and dN. In this method Muse and Gaut's (1994) model is used. The model is similar to the proportional model (Felsenstein 1981) or the equal-input model (Tajima and Nei 1982) in that the substitution rate is proportional to the frequency of a nucleotide replaced. In the model the nucleotide-frequency bias is considered but the transition/transversion bias is not considered. The equal-input model for multiple-hit correction is known to be robust under various conditions (Tajima and Nei 1984; Zharkikh 1994). However, the performance of Muse's method is not well understood when the transition/transversion bias is strong. As we have seen earlier, the transition/ transversion bias is an important factor for estimation of the nurnbers of synonymous and nonsynonymous sites. Further studies are needed to examine the effects of the transition/transversion bias on ds and a~N obtained by Muse's method. Goldman and Yang (1994) proposed a more complicated maximum likelihood method for estimating d s and d N. In their method the transition/transversion bias and 110 gasuo Ina the nucleotide-frequency bias are incorporated. The rates of substitution between nonsynonymous codons are assumed to depend on Grantham's (1974) physicochemi- ca1 distances between amino acids interchanged. The performance of Goldman and Yang's method is not clear because no one has ever examined the statistical properties of their method using computer simulation. Obviously the methods for estimating ds and dr4 are still in the stage of development and are not well established. Goldman and Yang's method and Muse's method use the maximum likelihood method, which is known to be an efficient statistical method. However, this does not necessarily mean that their methods are better than the currently used ones. Actually, Muse's (1996a) computer simulation under his model showed that his method gives essentially the same estimates of ds and dN as those obtained by the NG method unless the divergence of nucleotide sequences compared is very large. Moreover, the assumptions made in these methods are unlikely to hold in general. At the present time all methods depend on some simplifying assumptions, and it is unclear which method is most realistic. The actual process of synonymous and nonsynonymous substitutions is very complicated, and thus the properties and ap- plicabilities of these methods should be clarified under various conditions. Further studies, both theoretical and empirical, are needed. In this paper I have considered mainly molecular evolution. However, as pointed out by Kimura and Ohta (1971), molecular evolution and polymorphism are not two separate phenomena. Thus, to gain a deeper understanding of mechanisms of evolution at the molecular level, we have to consider polymorphism as well. The concept of synonymous and nonsynonymous categories is also useful for analysis of DNA polymorphism data. For example, McDonald and Kreitman (1991) used this concept and developed a neutrality test. Since synonymous and nonsynony- mous sites are tightly linked to each other and are interspersed in nucleotide sequences, the McDonald-Kreitman test is insensitive to change in population size, population subdivision, selection at linked loci, and recombination. In this point the McDonald- Kreitman test is superior to other neutrality tests such as the HKA (Hudson- Kreitman-Aguad~) test (Hudson et al. 1987), Tajima's (1989) test, Fu and Li's (1993) tests, and Fu's (1996) tests. All of these neutrality tests originated from Kimura's (1969) pioneering work on the infinite-site model.

Acknowledgements

I thank Drs T. Ohta and W. B. Provine for providing me the opportunity to write this paper. I thank Drs T. Ohta, N. Takahata, S. V. Muse, M. K. Uyenoyama, A. Rzhetsky and M. Nei for their comments, which improved the manuscript. This study was supported by a Grant-in-Md from the Ministry of Education, Science, Sports, and Culture of Japan. This is contribution no. 2055 from the National Institute of Genetics, Mishima 411, Japan.

References

Berg O. G. and Martelius M. 1995 Synonymous substitution-rate constraints in Escherichia coli and SalmoneUa typhimurium and their relationship to and selection pressare. J. Mol. EvoL 41:449-456 Bulmer M. 1987 Coevolution of codon usage and transfer RNA abundance. Nature 325:728-730 Clark G. A. and Kao T.-H. 1991 Excess nonsynonyrnous substitution of shared polymorphic sites among self-incompatibility alleles of Solanaceae. Proc. Natl. Acad. Sci. USA 88:9823-9827 Synonymous and nonsynonymous substitutions 111

Comeron J. M. 1995 A method for estimating the numbers of synonymous and nonsynonymous substitutions per site. J. MoL Evol. 41:1152--1159 Crow J. F. and Kimura M. 1970 An introduction to population genetics theory (New York: Harper and Row) DayhoffM. O., Schwartz R. M. and Orcutt B, C. 1978 A model of evolutionary change in proteins. In Atlas of sequence and structure (ed.) M. O. Dayhoff (Washington, DC: National Biomedical Research Foundation) vol. 5, suppl. 3, pp. 345-352 Easteal S. 1990 The pattern of mammalian evolution and the relative rate of molecular evolution. Genetics 124:165-173 Easteal S. and Collet C. 1994 Consistent variation in amino-acid substitution rate, despite uniformity of mutation rate: Protein evolution in mammals is not neutral. Mol. Biol. Evol. 11:643-647 Easteal S., Collet C. and Betty D. 1995 The mammalian molecular clock (New York: Springer) pp. 135-145 Effort B. 1979 Bootstrap methods: Another look at the jackknilb. Ann. Statist. 7:1-26 Felsenstein J. 1981 Evolutionary trees from DNA sequences: A maximum likelihood method approach. ,1. Mol. Evol. 17:368-376 Figueroa F., Gunther E. and Klein J. 1988 MHC polymorphism pre-dating speciation. Natw'e 335: 265-267 Fu Y.-X. 1996 New statistical tests of neutrality for DNA samples from a population. Genetics 143: 557-570 Fu Y.-X. and Li W.-H. 1993 Statistical tests of neutrality of mutations. Genetics 133:693-709 Gillespie J. H. 1995 On Ohta's hypothesis: Most amino acid substitutions are deleterious. J. Mol. Evol. 40: 64--69 Gojobori T. 1983 Codon substitution in evolution and the "saturation" of synonymous changes. Genetics 105:1011-1027 Gojobori T., Li W.-H. and Oraur D. 1982a Patterns ofnucleotide substitution in pseudogenes and functional genes, d. Mol. EvoI. 18:360-369 Gojobori T., Ishii K. and Nei M. 1982b Estimation of average number of nucleotide substitutions when the rate of substitution varies with nucleotide. J. Mol. Evol. 18:414--423 Goldman N. and Yang Z, 1994 A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736 Grantham R. i974 Amino acid difference formula to help explain protein evolution. Science 185:862-864 Grishin N. V. 1995 Estimation of the number of amino acid substitutions per site when the substitution rate varies among sites, d. Mol. Evol. 41:675-679 Hayashida H. and Miyata T. 1983 Unusual evolutionary conservation and frequent DNA segment exchange in class I genes of the major histocompatibility complex. Proe. Natl. Aead. Sd. USA 80:2671-2675 Hayashida H. and Miyata T. 1985 On the direction of gene conversion. Proe. dpn. Aead. B61:204-207 Hayashida H., Kuma K. and Miyata T. 1992 Interehromosomal gene conversion as a possible mechanism for explaining divergence patterns of ZFY-related genes. J. MoI. Evol. 35:181-183 Hein J. 1995 A maximumqikeliho od approach to analyzing nonoverlapping and overlapping reading frames. d. Mol. Evol. 40:181-189 Hudson R. R. 1990 Gene genealogy and the coalescent process. Oxford Surv. Evol. Biol. 7:1-44 Hudson R. R., Kreitman M. and Aguad~ M. 1987 A test of neutral molecular evolution based on nueleotide data. Genetics 116:153-159 Hughes A. L. 1991 Testing for interlocus genetic exchange in the MHC: A reply to Andersson and co-workers. Immunogenetics 33:243-246 Hughes A. L. and Nei M. 1988 Pattern of nucleotide substitution at major histocompatibility complex loci reveals overdominant selection. Nature 335:167-170 Hughes A. L. and Nei M. 1989 Nudeotide substitution at major histoeompatibility complex class II loci: Evidence for overdominant selection. Proc. Natl. Acad. Sci. USA 86:958-962 Ikemura T. 1985 Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2:13-34 Ina Y. 1993 Estimation of the numbers of synonymous and nonsynonymous substitutions with special reference to ~ral evolution. Ph.D. thesis, Department of Genetics, School of Life Science, The Graduate University for Advanced Studies, Hayama, Japan Ina Y. 1995 New methods for estimating the numbers of synonymous and nonsynonymous substitutions. J. Mol. Evol. 40:190-226 Ina Y. 1996a Variance and covariance of the number of amino acid substitutions estimated by Kimura's method. Genes Genet. Syst. 71:43-46 112 Yasuo Ina

Ina Y. 1996b Correlation between synonymous and nonsynonymous substitutions and variation in synonymous substitution numbers. In Current topics on molecular evolution (eds) M. Nei and T. Takahata (Pennsylvania: Institute of Molecular Evolutionary Genetics, The Pennsylvania State University, and Hayama: The Graduate University far Advanced Studies) pp. 105-113 Ina Y. and Gojobori T. 1994 Statistical analysis of nucleotide sequences of the hemagglutinin gene of human influenza A viruses. Prec. Natl. Aead. Sci. USA 91:8388-8392 Ina Y., Mizokami M., Ohba K. and Gojobori T. 1994 Reduction of synonymous substitutions in the core protein gene of hepatitis C virus. J. Mol. Ethel. 38:50-56 Jin L. and Nei M. 1990 Limitations of the evolutionary parsimony method of phylogenetic analysis. MoL Biol. Evol. 7:82-102 Johnson N. L. and Kotz S. 1973 Distribution in statistics: Discrete distributions (Boston: Houghton-Mifflin) Jukes T. H. and Cantor C. R 1969 Evolution of protein molecules. In Mammalian protein metabolism (ed.) H. N. Munro (New York: Academic Press) pp. 21-132 Kimura M. 1964 Diffusion models in population genetics, J. Appl. Prob. 1:177-232 Kimura M. 1968a Evolutionary rate at the molecular tevel. Nature 217:624-626 Kimura M. 1968b Genetic variability maintained in a finite population due to mutational production of neutral and nearly neutral isoalleles. Genet. Res. 11:247-269 Kimura M. 1969 The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61:893-903 Kimura M. 1977 Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275-276 Kimura M. 1979 Model of effectively neutral mutations in which selective constraint is incorporated. Prec. Natl. Acad. Sci. USA 76:3440-3444 Kimura M. 1980 A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. 3. Mol. Evol. 16:111-120 Kimura M. 1981a Estimation of evolutionary distances between homologous nucleotide sequences. Prec. Natl. Acad. Sci. USA 78:454-458 Kimura M. 1981b Possibility of extensive neutral evolution under with special reference to nonrandom usage of synonymous codons. Prec. Natl. Acad. Sci. USA 78:5773-5777 Kimura M. 1983 The neutral theory of molecular evolution (Cambridge: Cambridge University Press) Kimura M. and Ohta T. 1969 The average number of generations until fixation of a mutant gene in a finite population. Genetics 61:763-771 Kimura M. and Ohta T. 1971 Protein polymorphism as a phase of molecular evolution. Nature229:467-469 Kimura M. and Ohta T. 1972 On the stochastic model for estimation of mutational distance between homologous proteins. J. Me1. Eve1. 2:87-90 Kondo R., Horai S., Satta Y. and Takahata N. 1993 Evolution &hominoid mitochondrial DNA with special reference to the silent substitution rate over the genome. 3. Mol. Evol. 36:517-531 Krushkal J. and Li W.-H. 1995 Substitution rates in hepatitis delta virus. J. Mol. Evol. 41:721-726 Kumar S., Tamura K. and Nei M. 1993 MEGA: Molecular evolutionary cjenetics analysis (version 1.0). The Pennsylvania State University, University Park, USA Lawlor D. A., Ward F. E., Ennis P. D., Jackson A. P. and Parkam P. 1988 HLA-A and B polymorphisms predate the divergence of humans and chimpanzees. Nature 335:268-271 Li W.-H. 1987 Models of nearly neutral mutations with particular implications for nonrandom usage of synonymous codons. J. MoI. Euol. 24:337-345 Li W.-H. 1993 Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J. Mol. E~oI. 36:96-99 Li W.-H., Gojobori T. and Nei M. 1981 Pseudogenes as a paradigm of neutral evolution. Nature 292:237-239 Li W,-H., Wu C,-1. and Luo C.-C. 1984 Nonrandomness of as reflected in ntmleotide substitutions in pseudogenes and its evoIutionary implications, d. MoI. EvoI. 21:58-71 Li W.-H., Wu C.-I. and Luo C.-C. 1985a A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood ofnueleotide and codon changes. Mol. B~ol. Evol. 2:150-174 Li W.-H., Luo C.-C. and Wu C.-L 1985b Evolution of DNA sequences. In Molecular evolutionary genetics (ed.) R. J. Mac[ntyre (New York: Plenum Press) pp. 1-94 Li W.-H., Ellsworth D, L., Krushkal J., Chang B. H.-J. and Hewett-Emmett D. 1996 Rates of nucleotide substitution in primates and rodents and the generation-time effect hypothesis. Mol. Phylogenet. EvoI. 5: 182-187 Synonymous and nonsynon ymous substitutions 113

McDonald J. H, and Kreitman M. 1991 Adaptive protein evolution at the Adh locus in Drosophila. Nature 351:652-654 Miyata T. and Yasunaga T. 1978 Evolution of overlapping genes. Nature 272:532-535 Miyata T. and Yasunaga T. 1980 Molecular evolution of mRNA: A method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its application. d. Mol. EvoI. 16:23-36 Miyata T. and Yasunaga T. 1981 Rapidly evolving mouse c~-globin-related pseudo gone and its evolutionary history. Proc. Natl. Acad. Sei. USA 78:450-453 Miyata T., Miyazawa S. and Yasunaga T. 1979 Two types of amino acid substitutions in protein evolution. J. Mol. EvoI. 12:219-236 Miyata T., Yasunaga T. and Nishida T. 1980 Nucleotide sequence divergence and functional constraint in mRNA evolution. Proc. Natl. Acad. Sci. USA 77:7328 -7332 Miyata T., Hayashida H., Kuma K., Mitsuyasu K. and Yasunaga T. 1987 Male-driven molecular evolution: A model and nucleotide sequence analysis. Cold Sprin~d Harbor Syrup. Quant. Biol. 52:863-867 Mouchiroud D., Gautier C. and Bernardi G. 1995 Frequencies of synonymous substitutions in mammals are gone-specific and correlated with frequencies of nonsynonymous substitutions. Y. Mol. Evol. 40:107-113 Muse S. V. 1995 Evolutionary analyses of DNA sequence subject to constraints on secondary structure. Genetics 139: 1429- 1439 Muse S. V. 1996a Estimating synonymous and nonsynonymous substitution rates. Mol. Biol. Evol. 13: t05-114 Muse S. V. 1996b Evolutionary analysis when nucleotictes do not evolve independently. In Cuwent topics o~ molecular evolution (eds) M. Nei and N. Takahata (Pennsylvania: Institute of Molecular Evolutionary Genetics, The Pennsylvania State University, and Hayama: The Graduate University for Advanced Studies) pp. 115-124 Muse S. V. and Gaut B. S. 1994 A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715-724 Nei M, 1987 Molec~dar evolutio1~ary :]enetics (New York: Columbia University Press) Nei M. and Gojobori T. 1986 Simple methods tbr estimating the numbers of synonymous and nonsynony- mous substitutions. MoI. Biol. Evol. 3:418-426 Ohta T. 1973 Slightly deleterious mutant substitutions in evolution. Nature 246:96-98 Ohta T. 1977 Extension to the neutral mutation random drift hypothesis. In Molecular evohaion and polymorphism (ed.) M. Kimura (Mishima: National Institute of Genetics) pp. 148-167 Ohta T. 1992 The nearly neutral theory of molecular evolution. Annu, Roy. Ecol. Syst. 23:263-286 Ohta T. 1993 An examination of generation-time effect on molecular evolution. Proc. Natl. Acad. Sci. USA 90:10676-10680 Ohta T, 1995a Gone conversion vs point mutation in generating variability at the antigen recognition site of major histocompatibility complex loci..l. Mol. Evol. 41:115-119 Ohta T. 1995b Synonymous and nonsynonymous substitutions in mammalian genes and the nearly neutral theory. Y. Mol. Evol. 40:56-63 Ohta T. and Ina Y. 1995 Variation in synonymous rates among mammalian genes and the correlation between synonymous and nonsynonymous divergences. Y. Mol. Evol. 41:717-720 OtaT. and Nei M. 1994a Estimation of the number of amino acid substitutions per site when the substitution rate varies among sites, d. Mol, Evol. 38:642-643 Ota T. and Nei M. 1994b Variance and covariances of the numbers of synonymous and nonsynonymous substitutions per site. Mol. Biol. Evol. 11:613-619 Pamilo P. and Bianehi O. Ni 1993 Evolution of the Zfx and Zfy genes: Rates and interdependence between the genes, Mol. Biol. Evol. 10:271-281 Parham P. and Ohta T, 1996 Population biology of antigen presentation by MHC class I molecules. Science 272:67-74 Perler F., Efstratiadis A., Lomedico P., Gilbert W., Kolodner R. and Dodgeson J. 1980 The evolution of genes: The chicken preproiusulin gene. Cell 20:555-566 Rzhetsky A. 1995 Estimating substitution rates in ribosomal RNA genes, Genetics 141:771-783 Sawyer S. 1989 Statistical tests for detecting gene conversion. Mol, Biol. Evol. 6:526--538 Sch[Sniger M. and yon Haeseler A. 1994 A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phylogenet. Evol. 3:240-247 Sharp P. M. and Li W.-H. 1987 The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias. Mol. Biol. Evol. 4:222-230 114 Yasuo lna

Shields D. C., Sharp P. M., Higgins D. G. and Wright F. 1988 "Silent" sites in Drosophila genes are not neutral: Evidence of selection among synonymous codons. MoL Biol. Evol. 5:704-716 Shimmin L. C., Chang B. H.-J. and Li W.-H. 1993 Male-driven evolution of DNA sequences. Nature 362:745-747 Stepbens J. C. 1985 Statistical methods of DNA sequence analysis: Detection ofintragenic recombination or gene conversion. Mol. Biol. Evol. 2:539-556 Tajima F. 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595 Tajima F. 1991 Determination of window size for analyzing DNA sequences. J. Mol. Evol. 33:470-473 Tajima F. 1993 Statistical analysis of DNA polymorphism. Jpn. J. Genes. 68:567-595 Tajima F. and Nei M. 1982 Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. J. MoI. EvoI. 18:115-120 Tajima F. and Nei M. 1984 Estimation of evolutionary distance between nucleotide sequences. Mot. Biol. EvoI. 1:269-285 Takahata N. 1990 A simple genealogical structure of strongly balanced allelic lines and trans-species evolution of polymorphism. Proc. Natl. Acad. Sci. USA 87:2419-2423 Takahata N. 1991 A trend in population genetics theory. In New aspects of the #eneties of molecular evolution (eds) M. Kimura and N. Takahata (Tokyo: Japan Scientific Societies Press) pp. 27-47 Takahata N. 1994 Comments on the detection of reciprocal recombination or gene conversion, hn- munogeneties 39:146-149 Takahata N. and Kimm'a M. 198l A model of evolutionary base substitutions and its application with special reference to rapid change of pseudogenes. Genetics 98:641-657 Takahata N. and Nei M. 1990 AllelLcgenealogy under overdominant and frequency-dependent selection and polymorphism of major histoeompatibility complex loci. Genetics 124:967-978 Tamura K. 1992 Estimation of the number of nucteotide substitutions when there are strong transition- transversion and G + C-content biases. Mol. Biol. EvoL 9:678-687 Tamura K. and Nei M. 1993 Estimation of the number ofnucleotide substitutions in the control region of mitoehondrial DNA in humans and chimpanzees. Mol. Biol. Evot. 10:512-526 Tlllier E. R, M. 1994 Maximum likelihood with multiparameter models of substitution. J. MoI. EvoL 39: 409-417 TiIlier E. R. M. and Collins R. A. 1995 Neighbor joining and maximum likelihood with RNA sequences: Addressing the interdependence of sites. Mol. Biol. Evol. 12:7-15 Uzzell T. and CorNn K. W. 1971 Fitting discrete probability distributions to evolutionary events. Science 172:1089-1096 Wakeley J. 1996 The excess of transition among nucleotide substitutions: New methods of estimating transition bias underscore its significance. Trends Ecol. Evol. 11:158-163 Wolfe K. H. and Sharp P. M. 1993 Mammalian gone evolution: Nucleotide sequence divergence between mouse and rat. Y. Mol. Evot. 37:441-456 Wolfe K. H., Sharp P. M. and Li W.-H. 1989 Mutation rates differ among regions of the mammalian genome. Nature 337:283-285 Wu C.-I. and Li W.-H. 1985 Evidence for higher rates of nucleotide substitution in rodents than in man. Proc. NatL Aead. Sei. USA 82:1741-1745 Yang Z. 1993 Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mot. Biol. Evot. 10:1396-1401 Yang Z. 1995 Phytogenetic analysis by maximum likelihood (PAML) version 1.1. Institute of Molecular Evolutionary Genetics, The Pennsylvania State University, University Park, USA Yang Z. and Kumar S. 1996 Approximate methods for estimating the pattern of nueleotide substitutions and the variation of substitution rates among sites. Mol. Biol. Evol. 13:650-659 Zharkikh A. 1994 Estimation of evolutionary distances between nueleotide sequences. J. Mol. EvoI. 39: 315-329

Computer programs

NG method The NG method is implemented in MEGA (Kumar etal. 1993), which is a program package for molecular evolutionary genetics analysis. For details, contact the following E-mail address: [email protected]. Synonymous and nonsynonymous substitutions 115

LWL and PBL methods A computer program for the LWL and PBL methods was developed by Li. For details, contact the following E-mail address: liimhgc.sph.uth.tmc,cdu. Comemn's method A computer program developed by Comeron (1995) is available on request (E-mail address: comero n(iv port hos.bio.ub.es). Muse's method A computer program developed by Muse(1996a)is available by anonymous ftp at kurtz.bio.psu.edu (t28.118.180,t4l) in the directory/pub/distances. Goldman and Yang's method Goldman and Yang's method is implemented in PAML (Yang 1995), which is a program package for phylogenetie analysis by maximum likelihood, PAML is available by anonymous ftp at l't p.bio.indiana.edu (129.79.225.25)in the directory/molbio/evolve. hm's methods Computer programs developed by Ina (1995) are available by anonymous ffp at ftp.nig.ac.jp (133.39.3.6) in the directory /pub/unix/syn/newl (Inal)and /pub/unix/syn/new2 (Ina2), and ftp..affrc.go.jp (150,26.230.101~ in the directory/pub/unix/syn/newl (Inal) and/pub/unix/syn/new2 (Ina2). Window analysis A computer program developed by Ina is available by anonymous ffp at ffp.nig.ac.jp (133.39.3.6) in the directory/pnb/unix/windows and ffp.dna.affrc.go.jp (150.26.230.101) in the directory ~pub~unix~windows.