Proc. Nati. Acad. Sci. USA Vol. 84, pp. 166-169, January 1987 The and content of genomic DNA and bacterial evolution (biased pressure/codon usage/neutral theory) AKIRA MUTO AND SYOZO OSAWA Laboratory of Molecular , Department of , Faculty of Science, University, Furo-cho, Chikusa-ku, Nagoya 464, Japan Communicated by Motoo Kimura, September 15, 1986

ABSTRACT The genomic guanine and cytosine (G+C) various parts of the in the same direction, but to content of eubacteria is related to their phylogeny. The G+C different extents depending on their functional importance. content of various parts of the genome ( , stable A large amount of data on the DNA sequences of different RNA genes, and spacers) reveals a positive linear correlation genes and spacers from various are now available. with the G+C content of their genomic DNA. However, the Using these data, we present here lines of evidence that plotted correlation slopes differ among various parts of the support the above view. The study further suggests that genome or among the first, second, and third positions of the A-T/G-C pressure has played a major role in diversification codons depending on their functional importance. Facts sug- of genomic DNA sequences and codon usage in bacterial gest that biased mutation pressure, called A'T/G-C pressure, evolution. has affected whole DNA during evolution so as to determine the genomic G+C content in a given bacterium. The role of Correlation of G+C Content Between Entire Genome and A-T/G-C pressure in diversification of bacterial DNA se- the Specific Parts of Genome quences and codon usage patterns is discussed in the perspec- tive of the neutral theory of . The bacterial genome is roughly composed of protein genes (70-80%), spacers including various signals (20-30%), and Among eubacteria, the mean guanine and cytosine (G+C) stable RNA genes (<1%). Fig. 1 shows a plot of the G+C content of genomic DNA varies from approximately 25% to content of different parts of the genome vs. the mean G+C 75%. That the bacterial genomic G+C content is somehow content of various bacterial and indicates that all related to phylogeny has been suggested (1, 2). The phylo- components positively but differentially correlate to the genetic tree of eubacterial SS rRNA clearly indicates this genomic G+C content. In 1962 Miura (6) demonstrated that relationship (3). According to this tree Gram-negative and among bacteria the G+C content of rRNA (16S + 23S + 5S Gram-positive bacteria separated earliest. Among Gram- rRNAs) is about the same, whereas that of tRNA weakly positive bacteria, those with low genomic G+C content such correlates to genomic G+C content. There exists a weak as (genomic G+C, 42%), Lactobacillus positive correlation of the G+C content of both rRNA and viridescens (40%), Staphylococcus aureus (33%), Clostridi- tRNA to genomic G+C content. However, the G+C content um perfringens (38%), and Mycoplasma capricolum (25%) of spacers and protein genes reveals a strong linear correla- are phylogenetically close, whereas those with high genomic the G+C, such as Micrococcus luteus (75%), Streptomyces tion with genomic G+C content: among various bacteria griseus (73%), and Mycobacterium tuberculosis (67%), com- G+C content of spacers ranges from about 20% to 80% and prise one phylogenetic group. These two groups separated that of protein genes ranges from about 30% to 75%, as long ago. Gram-negative bacteria with intermediate G+C genomic DNA G+C content varies from 25% (M. capri- content, such as Escherichia coli (50%), Serratia marcescens colum) to 75% (M. luteus). Thus, for a given bacterium, the (58%), Salmonella typhimurium (51%), and Pseudomonas G+C content of spacers, protein genes, and stable RNA fluorescens (60%) belong to the common Gram-negative genes is all biased in the same direction as the G+C content branch. of the total genome. These facts suggest that the differences in genomic G+C This bias is stronger for spacers, less so for protein genes, content may have been caused by mutation pressure-the and least for stable RNA genes, although their contribution direction and magnitude of this pressure varying among the to the average G+C content is in the order of protein genes, phylogenetic lines. Such mutation pressure, which we call spacers, and stable RNA genes. These positive colinealities biased A-T/G-C pressure due to biased mutation rates among shown in Fig. 1 strongly support the idea that the A-T/G-C the four bases, seems to have been exerted on the entire pressure strongly influenced the G+C content of all DNA genome during evolution. caused by A-T/G-C during evolution. The differential levels of G+C content in pressure are subject to selective constraints that usually the different components of the genome in a given operate in the form of negative selection to eliminate func- can be understood as the consequence of selective con- tionally deleterious changes. A certain fraction of non- straints that have been exerted on the genome to eliminate deleterious (i.e., neutral) mutations are then fixed in the functionally deleterious mutants. Since most parts of spacers population due to random (4). Thus, functionally are functionally the least important in the genome, the larger less important parts in the genome evolve faster than more part of mutations in these regions is selectively neutral, and important ones in accordance with the neutral theory of therefore the evolutionary rate is higher than for other parts. molecular evolution (see ref. 5). In other words, for a given The rRNA and tRNA genes are less variable, because their species, the A-T/G-C pressure changes the G+C content of transcripts are nontranslatable, and most, if not all, of their sequences are important for biological functions. Protein The publication costs of this article were defrayed in part by page charge genes are more variable than the stable RNA genes, because payment. This article must therefore be hereby marked "advertisement" larger parts of synonymous codon changes and conservative in accordance with 18 U.S.C. §1734 solely to indicate this fact. amino acid changes occur without deleterious effects.

Downloaded by guest on September 26, 2021 166 Evolution: Muto and Osawa Proc. Natl. Acad. Sci. USA 84 (1987) 167

._ E F, C U) E COeD

o en a - c t Cl) ._ D fL4- ._a0 -3CD.-.UC/a a *-a f) *_O4- *-O a =~ C a; _ - C/C L- {L 4- r Q a) a o cau) O>=3 a a )- Ca a) 4-=. oC/CL- a /C C - U) e- o C a)UQ) E E > aC 4-v E in n >- u0 en cn Un cn en Um -~ E UA0 U C/)- aa n 0 E u - 4- * E Cn/ O 100r CL Q 1 Ca CL - )-- .9 L- en D 0o ' o a _- J_ a d) >I E ° u u u ouu a u N L- * (D- > > c L- 0 C 4 _ U a - a) U ) U o COL CD 0 Ien C/ C: C Cn I 4- -- 100 I a: CL Vn : - 1 1 I I I

80 _

80 _

0 0- 60 tRNA 0, _ __ ---0 60 F First (i 5) 0 a) 0.c] 40 _ Protein (5(D

Spacer 0 0 40 F Second

20 0

20 F

I I I 20 40 60 80 Genome G+C, % -j FIG. 1. Correlation of G+C content between total genomic DNA 20 40 60 80 and designated parts of the genome. Values for rRNA and tRNA are Genome G+C, % taken from Miura (6), Midgley (7), and Maniloff and Morowitz (8). References to protein genes are as follows: ribosomal protein genes FIG. 2. Correlation of the G+C content between total genomic of M. capricolum (9); penicillinase of B. cereus (10); 21 genes DNA and the first, second, and third codon positions. Values are of B. subtilis collected by Ogasawara (11); penicillinase gene of B. taken from the references in the legend for Fig. 1. licheniformis (12); 27 genes of E. coli collected by Koningsberg and Godson (13); ribulose-1,5-biphosphate carboxylase/oxygenase gene ofA. nidulans (14); nitrogenase gene ofR. parasponia (15); mercuric parisons may be those for ribosomal . Functional reductase gene of P. aeruginosa (16); isopropylmalate dehydrogen- structure of ribosomes is well conserved throughout the ase gene of T. thermophilus (17); viomycin phosphotransferase gene eubacterial group, and hence most, if not all, of the codon- of S. vinaceus (18); ribosomal protein-encoding genes of M. luteus and amino acid-substitutions would be predicted to be neutral (T. Ohama, F. Yamao, A. M., and S.O., unpublished data). Values or nearly neutral. In other words, most of the observable of spacers are calculated from the 5'- and 3'-flanking sequences ofthe substitutions would have no selective advantages for the above genes. ribosomal functions. Thus, we compared substitutions at 1188 homologous codon sites in the genes for nine ribosomal Codon Usage proteins in spc between M. capricolum (genome G+C, 25%; ref. 21 and unpublished work) and E. coli (50%; A nonrandom feature of codon usage has been widely ref. 22). Table 1 summarizes the codon substitution patterns. recognized (19, 20). In Fig. 2 are plotted the G+C contents Between the two species, 199 codons are identical. There of the first, second, and third codon positions, respectively, exist 360 synonymous (silent) codon substitutions, 169 con- from various bacterial species vs. G+C content of the servative amino acid substitutions (see legend for Table 1), corresponding genome. The G+C contents of all three and 459 other substitutions between them. About 81% of the positions reveal a linear positive relationship with genome synonymous codon substitutions occurs at the third position. G+C content, although slopes differ-the steepness rank In most cases, codons ending in guanine or cytosine in E. coli order being third, first, and second codon positions. Most are replaced by a synonymous codon ending in or striking is that the G+C content of the third positions varies () in M. capricolum (see also ref. 21). About linearly from about 10% in M. capricolum to >90% in M. 17% of the synonymous changes occurs at the first plus third luteus. The strong correlation is thus due to A.T/G'C sub- positions, and all of them result in an adenine- or stitutions at the codon third position. uracil(thymine)-richness in M. capricolum as compared with In the present study, comparisons are made between E. coli. Among 169 conservative amino acid substitutions, different protein genes. It is preferable to use homologous which occur primarily by changing the codon first and/or genes for this purpose, but this kind of available data is third positions, 114 codon changes occur in the direction to limited at present. The most valuable genes for such com- higher A+U(T)-content in M. capricolum. For example, Downloaded by guest on September 26, 2021 168 Evolution: Muto and Osawa Proc. Natl. Acad. Sci. USA 84 (1987) Table 1. A+T* substitutions at 1188 homologous codon sites of ribosomal protein-encoding genest between M. capricolum and E. coli Substituted codons, no. Synonymoust Conservative§ Others No Sub- No Sub- No Sub- Codon position Gain Loss gain/loss total Gain Loss gain/loss total Gain Loss gain/loss total Total 1 1 0 0 1 16 1 6 23 22 9 12 43 67 2 0 0 0 0 0 3 9 11 7 6 0 13 24 3 201 14 77 292 4 0 6 10 9 1 7 17 319 1 + 2 0 0 5 5 1 0 0 1 29 11 20 60 66 1 + 3 61 0 0 61 53 1 28 82 45 11 17 73 216 2 + 3 0 0 0 0 13 0 1 14 43 10 26 79 93 1 + 2 + 3 1 0 0 1 27 2 0 29 110 35 29 174 204 Total 264 14 82 360 114 6 50 170 265 83 111 459 989 (1188)$ No. of changed positions 327 14 87 428 235 11 79 325 602 185 232 1019 1772 Net A + T gain1l +303 -14 0 +289 +177 -6 0 +165 +401 -91 0 +310 +764 Of the 1188 codon sites, 199 are identical. *Number of codons that gain or lose A+T in M. capricolum as compared with E. coli. tNine ribosomal protein-encoding genes (S5, S8, S14, L5, L6, L14, L15, L18, and L24) in spc operon of E. coli (22) and M. capricolum (ref. 21; unpublished data). tSynonymous (silent) codon substitution. §Conservative amino acid substitution includes Lys/Arg, Leu/Ile, Leu/Val, Ile/Val, Ser/Thr, Ala/Gly, Gln/Asn, Glu/Asp and Phe/Tyr. $Inclusive of identical codons. IlNet A+T gained in M. capricolum as compared with E. coli.

many CGC or CGU codons for arginine, or CUG for leucine ofequilibrium (p) is v/(v + u). When u/v is 3.0, 1.0, and 0.33, in E. coli are replaced by AAA for lysine or AUU for the G+C content of the bacterium will be 25% (e.g., M. isoleucine in M. capricolum. Among 459 other codon sub- capricolum), 50% (E. coli), and 75% (M. luteus), respective- stitutions, which are also probably neutral for the reason ly. In this theory only the mutation pressure in two direc- previously discussed, a gain of A+U(T) is observed in 256 tions-G-C to A-T and A-T to G-C-is assumed. Thus, our codons of M. capricolum, whereas a loss of A+U(T) is A.T/G.C pressure is equivalent to Sueoka's u/v. Remember, observed in 83 codons. Among 989 substituted codons out of however, that at each site, one of the four bases the total 1188 codons, 643 codons (65%) gain A+U(T), 103 is fixed most of the time and that such a concept of (10%) lose A+U(T), and 243 (25%) show neither gain nor loss equilibrium can be applied only when a very large number of of A+U(T). The total of A+U(T) gain and loss in all sites are considered. Superimposed on mutation pressure, nucleotide sites (2967) in the substituted codons is 875 and mutational changes are subject to selective constraints, and 111, respectively. a certain fraction of nondeleterious mutations can become All these facts indicate that a strong biased mutation fixed in a given population (4, 5). pressure has caused M. capricolum to discriminate against The nature of biased A-T/G-C pressure is unknown. One guanine and cytosine and to use adenine and uracil wherever possibility is that mutational modification of some compo- possible. In contrast to the A+T-biased codon usage in M. nents in DNA synthesis biases the mutation rate towards capricolum, a strongly G+C-biased codon usage has been either AT pairs (25) or G-C pairs (26). Another possibility is shown in G+C-rich bacteria, such as Streptomyces species that and of DNA bases lead to the (18, 23) and M. luteus (unpublished data). Thus, we conclude accumulation of A-T pairs in DNA (27, 28); the abundance of that this biased mutation pressure is the major dictator for methylation/deamination and/or the activity of the codon usage in these bacteria. Ikemura (19, 20) demonstrated deamination-repair system could thus be candidates for a positive correlation between codon usage and tRNA biasing mutation pressure. abundance in E. coli and in yeast, claiming that tRNA Biased mutation pressure acts on the entire genome. This population is the major factor providing selective constraints is best deduced from the genome structures of the extremely on codon use. Kagawa et al. (17) stressed that the high G+C G+C-poor bacterium M. capricolum, in which all parts ofthe content in the codon third positions in an extremely genome (protein genes, stable RNA genes, and spacers) thermophylic bacterium, Thermus thermophilus, is the con- contribute systematically but to a different extent to the low sequence of high temperature selection, because G-C pairs G+C content ofthe genome (21, 29-31). This is also the case are more thermostable than A-T pairs in the double-stranded for G+C-rich bacteria such as M. luteus (unpublished data) structure of DNA. This explanation cannot be applied to and Streptomyces species (18, 23), in which all parts of the other G+C-rich bacteria, such as Micrococcus, Streptomy- ces, etc., which are not thermophylic. Thus, the results genomes contribute to high genomic G+C content. These shown in Fig. 2 and Fig. 1 strongly suggest that biased facts, together with Figs. 1 and 2, strongly suggest that biased A-T/G.C pressure predominates over other selective forces mutation pressure has been exerted uniformly on all of the in determining nonrandom codon usage. DNA. Different evolution rates at different parts of a given genome can best be explained by the negative selection Discussion principle of neutral theory (4, 5, 32, 33)-i.e., functionally less important parts ofthe genome have evolved more rapidly More than twenty years ago, Sueoka (24) presented a theory than more important ones. The most striking example is the to account for diversification of the G+C content of the fast evolutionary rate of spacer regions (Fig. 1) and the bacterial genome; the G+C content of DNA of a given synonymous codon substitution that occurred predominantly bacterium will be determined by the effective base conver- at codon third positions (Fig. 2). The correlation of mean sion rate u (G-C to A-T) and v (A-T to G-C). The G+C content G+C content of genomic DNA to bacterial phylogeny also Downloaded by guest on September 26, 2021 Evolution: Muto and Osawa Proc. Natl. Acad. Sci. USA 84 (1987) 169

favors the view that A-T/G-C pressure has been exerted on (1983) Proc. Natl. Acad. Sci. USA 80, 4050-4054. the genome in bacterial evolution. 15. Scott, K. F., Rolfe, B. G. & Shine, J. (1983) DNA 2, 141-148. We have found that one of the chain termination codons, 16. Brown, N. L., Ford, S. J., Primore, R. D. & Fritzinger, D. C. (1983) 22, 4089-4095. UGA (opal nonsense codon), in the universal code is read as 17. Kagawa, Y., Nojima, H., Nukiwa, N., Ishizuka, M., tryptophan in M. capricolum (9, 31) and suggested that this Nakajima, T., Yasuhara, T., Tanaka, T. & Oshima, T. (1984) change in the dictionary had occurred through J. Biol. Chem. 259, 2956-2960. biased mutation pressure. Jukes (34) has proposed a plausible 18. Bibb, M. J., Bibb, B. J., Ward, J. M. & Cohen, S. N. (1985) pathway whereby UGA tryptophan codon evolved in M. Mol. Gen. Genet. 199, 26-36. capricolum without deleterious or lethal effects under a 19. Ikemura, T. (1981) J. Mol. Biol. 151, 389-409. strong mutation pressure replacing G&C pairs by AT pairs. 20. Ikemura, T. (1982) J. Mol. Biol. 158, 573-597. He has further suggested that such biased mutation pressure 21. Muto, A., Kawauchi, Y., Yamao, F. & Osawa, S. (1984) could cause the genetic code to change in other Nucleic Acids Res. 12, 8209-8217. 22. Cerretti, D. P., Dean, D., Davis, G. R., Bedwell, D. M. & having biased DNA base composition. In fact, certain cili- Nomura, M. (1983) Nucleic Acids Res. 11, 2599-2616. ates, in which a deviation from the universal genetic code has 23. Gray, G. S. & Thompson, C. J. (1983) Proc. Natl. Acad. Sci. been discovered (35-39), have genomes with extremely low USA 80, 5190-5194. (=30%) G+C content. The evolution of the genetic code in 24. Sueoka, N. (1962) Proc. Natl. Acad. Sci. USA 49, 582-592. reference to A-T/GC pressure requires further discussion 25. Speyer, J. F. (1965) Biochem. Biophys. Res. Commun. 21, (unpublished work). 6-10. 26. Cox, J. F. & Yanofsky, C. (1967) Proc. Natl. Acad. Sci. USA We thank Dr. Motoo Kimura and our colleagues for valuable 58, 1895-1902. discussions. This work was supported by Grant-in-Aid from the 27. Bird, A. P. (1980) Nucleic Acids Res. 8, 1499-1504. Ministry of Education, Science and Culture of Japan. 28. Dujon, B. (1981) in The of the Yeast Saccharomyces, eds. Strathern, J. N., Jones, E. W. & 1. Lee, K. Y., Wahl, R. & Barbu, E. (1956) Ann. Inst. Pasteur Broach, J. R. (Cold Spring Harbor Laboratory, Cold Spring (Paris) 91, 212-214. Harbor, NY), pp. 505-635. 2. Sueoka, N. (1961) J. Mol. Biol. 3, 31-40. 29. Iwami, M., Muto, A., Yamao, F. & Osawa, S. (1984) Mol. 3. Hori, H. & Osawa, S. (1986) Biosystems 19, 162-172. Gen. Genet. 196, 317-322. 4. Kimura, M. (1968) Nature (London) 217, 624-626. 30. Sawada, M., Muto, A., Iwami, M., Yamao, F. & Osawa, S. 5. Kimura, M. (1983) The Neutral Theory ofMolecular Evolution (1984) Mol. Gen. Genet. 196, 311-316. (Cambridge Univ. Press, Cambridge, UK). 31. Yamao, F., Muto, A., Kawauchi, Y., Iwami, M., Iwagami, S., 6. Miura, K. (1962) Biochim. Biophys. Acta 55, 62-70. Azumi, Y. & Osawa, S. (1985) Proc. Natl. Acad. Sci. USA 82, 7. Midgley, J. E. M. (1962) Biochim. Biophys. Acta 61, 513-525. 2306-2309. 8. Maniloff, J. & Morowitz, H. J. (1972) Bacteriol. Rev. 36, 32. King, J. L. & Jukes, T. H. (1969) Science 164, 788-798. 263-290. 33. Kimura, M. & Ohta, T. (1974) Proc. Natl. Acad. Sci. USA 71, 9. Muto, A., Yamao, F., Kawauchi, Y. & Osawa, S. (1985) Proc. 2848-2852. Jpn. Acad. B 61, 12-15. 34. Jukes, T. H. (1985) J. Mol. Evol. 22, 361-362. 10. Sloma, A. & Gross, M. (1983) Nucleic Acids Res. 11, 35. Caron, F. & Meyer, E. (1985) Nature (London) 314, 185-188. 4997-5004. 36. Preer, J. R., Jr., Preer, L. B., Rudman, B. M. & Barnett, A. J. 11. Ogasawara, N. (1985) Gene 40, 145-150. (1985) Nature (London) 134, 188-190. 12. Neugebauer, K., Sprengel, R. & Schaller, H. (1981) Nucleic 37. Helftenbein, E. (1985) Nucleic Acids Res. 13, 415-433. Acids Res. 9, 2577-2588. 38. Horowitz, S. & Gorovsky, M. A. (1985) Proc. Natl. Acad. Sci. 13. Koningsberg, W. & Godson, G. N. (1983) Proc. Natl. Acad. USA 82, 2452-2455. Sci. USA 80, 687-691. 39. Kuchino, Y., Hanyu, N., Tashiro, F. & Nishimura, S. (1985) 14. Shinozaki, K., Yamada, C., Takahata, N. & Sugiura, M. Proc. Natl. Acad. Sci. USA 82, 4758-4762. Downloaded by guest on September 26, 2021