Letter to the Editor

Accounting for Background Composition When Measuring Codon Usage Bias John A. Novembre Department of Integrative Biology, University of CaliforniaÐBerkeley The phenomenon of codon usage bias has been im- Of the summary statistics that do not require portant in the study of evolution because it provides knowledge of preferred codons, the effective number of à examples of weak selection working at the molecular codons (ENC, or Nc as proposed by Wright 1990) has level. During the last two decades, evidence has accu- been found to be the most statistically well behaved, in mulated that some examples of codon usage bias are that it is the least affected by short gene lengths (Com- à driven by selection, particularly for species of fungi eron and Aguade 1998). Nc is inversely proportional to (e.g., Bennetzen and Hall 1982; Ikemura 1985), the extent of nonuniform codon usage. It takes the value (e.g., Ikemura 1981; Sharp and Li 1987), and insects of 61 when all codons are being used with equal fre- (e.g., Akashi 1997; Moriyama and Powell 1997). This quency, and its value decreases as codon usage becomes connection between codon usage bias and selection has less uniform. It is intuitively accessible because its value been important in stimulating the development of alter- is intended to correspond to the number of codons being natives to strictly neutral theories of molecular evolution used in a sequence. (Ohta and Gillespie 1996; Kreitman and Antezana à One limitation of Nc is that measuring the depar- 2000). Uncertainty persists, however, regarding the phy- tures from uniform codon usage is not always desirable. logenetic distribution of codon usage bias (such as Knowledge of background nucleotide composition pat- whether selection-based codon usage bias is present in terns may suggest a null distribution of codon usage that mammals; e.g., Karlin and Mrazek 1996; Iida and Aka- is nonuniform. For instance, in Drosophila, where the shi 2000; Sueoka and Kawanishi 2000; Smith and Eyre- background nucleotide composition is 60% AT, one Walker 2001a; Urrutia and Hurst 2001). Questions also might like to measure how far codon usage departs from remain as to what models of selection underlie codon 60% usage of AT-ending codons. Indeed, accounting for preferences (Kreitman and Antezana 2000), and specif- ically whether the presence of suboptimal codons is the background nucleotide composition when studying co- result of and drift, variation in selection pres- don usage has been recognized as being important (Aka- sure across sites, or antagonistic selection pressures shi, Kliman, and Eyre-Walker 1998; Marais, Mouchi- (Smith and Eyre-Walker 2001b). roud, and Duret 2001; Urrutia and Hurst 2001). Taking For investigating certain questions regarding the composition into account is particularly important for evolution of codon usage bias, it is useful to have a phylogenetic studies of codon usage bias where com- summary statistic describing the pattern of codon usage parisons are made among species with differing back- across all amino acids. Many summary statistics have ground nucleotide compositions or for studies of codon already been developed to describe the patterns of codon usage bias in genomic regions that may differ in back- usage. They can be divided roughly into two classes ground nucleotide composition. If not taken into consid- (Comeron and Aguade 1998). One class summarizes the eration, differences in background nucleotide composi- usage of certain preferred codons, and the other com- tion might lead one to conclude that differences in co- pares every codon's usage to a null distribution (typi- don preference exist when they do not. à cally uniform usage of synonymous codons). The former The departure of Nc from 61 because of variation class of methods has the disadvantage that it requires a in background nucleotide composition was recognized prior knowledge of the preferred codons. With summary by Wright and described as the following relationship à statistics one can explore general patterns, such as the between the third position GC content (fGC) and Nc: relationship of codon usage bias to recombination rate, gene length, or synonymous substitution rate. Observing à 29 broad patterns such as these has already provided insight NcGCϭ 2 ϩ f ϩ 22. into the evolutionary dynamics of codon usage bias ΂΃f GCϩ (1 Ϫ f GC) (Kliman and Hey 1993; Akashi and Eyre-Walker 1998; Using this relationship one can plot the expected value Kreitman and Comeron 1999). à of Nc versus the GC content. Such plots are known as à Nc-plots (e.g., see ®g. 2 in Wright 1990). Points that fall Abbreviations: ENC, effective number of codons. below the bell-shaped line suggest deviations from a Key words: codon usage bias, background nucleotide composi- null model of no-codon preferences. Unfortunately, tion, mutation bias. studying such patterns does not have a clear statistical Address for correspondence and reprints: John A. Novembre, De- à basis and detracts from the value of Nc as a quantitative partment of Integrative Biology, University of CaliforniaÐBerkeley, summary statistic. VLSB 3060, Berkeley, California 94720. E-mail: novembre@ socrates.berkeley.edu. Other codon usage bias summary statistics such as 2 Mol. Biol. Evol. 19(8):1390±1394. 2002 Akashi's scaled ␹ (Akashi 1995), the maximum-likeli- ᭧ 2002 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 hood codon bias statistic (MCB; Urrutia and Hurst

1390 Letter to the Editor 1391

2001), and B*(a) (Karlin and Mrazek 1996) account for calculated in a number of ways, including using mono-, background nucleotide composition. However, as pre- di-, or trinucleotide frequencies. The ␹2 statistic for each 2 sented below, these statistics are affected strongly by the amino acid,Xa , is calculated as sequence lengths being studied. Clearly, it would be use- k n (p Ϫ e )2 ful to have a statistic similar to Nà that has low se- 2 ai i c X a ϭ ͸ . quence-length effects but that relaxes the assumption of iϭ1 ei equal usage of all synonymous codons. Using theXF2 values,à is de®ned as follows: à aaЈ Here, a statistic (NcЈ ) is presented that accounts for 2 background nucleotide composition and is minimally af- à X aaϩ n Ϫ k FЈϭa . (5) fected by sequence length. The statistic is based on Pear- k(na Ϫ 1) 2 son's X statistics and describes the departure of the ob- à served codon usage from some expected distribution. The calculations from here on mirror those of Nc. The à ¯ FarЈ values are averaged for each redundancy class (Fˆ Ј ) The expected distribution can be derived from knowl- à edge of the background nucleotide composition. When and used to calculate NcЈ: à à uniform usage of codons is expected,NcЈ reduces to Nc. à ¯ NЈϭcrrn /Fˆ Ј. (6) à à ∈͸ Like Nc, NcЈ is relatively insensitive to sequence length. r ARC The properties ofNà are tested on simulated sequences, cЈ 2 2 Note that becauseXa is ␹ distributed, the expected val- and compared with related summary statistics. The re- à sults suggest thatNà Ј is useful for studying codon usage ue of FЈ will be 1/k when the codon usage matches the c expected usage pattern. Using the ␦-method, it can also among sequences that vary in background nucleotide à composition. be shown that the expected value ofNcЈ will be equal to à 61 for the standard genetic code. To understand the construction ofNcЈ, it is ®rst nec- à à à à Notably,NcЈ simpli®es to Nc when synonymous co- essary to provide background on Nc.Nc, or the ENC (Wright 1990), is derived by making an analogy with dons are expected to be in equal frequency. In this case 2 the ei are all 1/k, and theXa statistic for each amino acid the effective number of alleles ne at a locus (Kimura and à à reduces to Crow 1964). An estimate of ne is nÃe ϭ 1/F, where F is an estimate of the expected heterozygosity at the locus. k 22 For quantifying codon usage bias at an amino acid a, X aaϭ nk͸ p iϪ n a. (7) Wright de®ned an estimate of the homozygosity of co- iϭ1 don usage as follows: 2 à Substitution of thisXFaa into the equation forЈ (eq. 5) k produces Fà ϭ np2 Ϫ 1(n Ϫ 1) (1) aai͸ a k ΂΃iϭ1 ΋ à 2 FЈϭaainp͸ Ϫ 1(n aϪ 1). (8) ΂΃iϭ1 ΋ Here, pi is the frequency of the ith codon, k is the num- ber of synonymous codons for the amino acid of inter- à à This value ofFaЈ is equal to the Fa used by Wright in est, and na is the observed number of codons for the the calculation of Nà (eq. 1). Because the calculation of à c amino acid. The average of the Fa for each r-fold re- ÃÃÃà NcacЈ matches that of Nc onceFЈ is obtained,NЈ is equiv- dundancy class (e.g., onefold, twofold, threefold, four- à alent to Nc when uniform usage is expected. fold, sixfold) is then computed: à This derivation shows thatNcЈ is a generalization of Nà to the case in which expected codon usage is not ¯ 1 à c Fˆ raϭ F (2) uniform. Whereas Nà is implicitly based on an expected ͸∈ c nRC a RC à distribution of uniform usage,NcЈ can have the expected where nRC is the number of amino acids in the RC re- proportions of synonymous codon usage given by the à à dundancy class. Finally, Nc is computed as follows: background nucleotide composition.NcЈ will decrease à ¯¯¯¯ from 61 when the codon usage does not match the dis- Nc ϭ 2 ϩ (9/Fˆ 2356) ϩ (1/Fˆ ) ϩ (5/Fˆ ) ϩ (3/Fˆ ). (3) tribution predicted by nucleotide composition. In con- à à The calculation of Nc can also be expressed more gen- trast, Nc will decrease from 61 whenever the codon us- erally as age departs from uniform usage of each synonymous codon. The result is that the value ofNà Ј should be less à ¯ c Ncrrϭ n /Fˆ (4) dependent on nucleotide composition and more depen- ∈͸ r ARC dent on the codon preferences that go beyond those where ARC is the set of all redundancy classes, and nr caused by unequal nucleotide composition. à à is the number of amino acids in the redundancy class. In the practical implementation of both Nc andNcЈ , à For any particular amino acid we can also compute an care must be taken to exclude Fa values that are unde- à à à Nc value that is simply 1/F, where F is the homozygosity ®ned or equal to zero, which occurs when amino acids of codon usage for that particular amino acid. are rare or missing (see Wright 1990). The calculations à à To deriveNcЈ and show its relationship to Nc, I use presented subsequently follow Wright's suggestion that if Pearson's X2 statistics to quantify the departure of each no threefold redundant codons are observed, one should ¯¯ ¯ codon's usage (pi) from some expected usage (ei) for averageFˆ 24 andFˆ to obtainFˆ 3 . If other k-fold redundancy ¯ each amino acid. The expected usage of codons can be classes are unobserved,Fˆ k is assumed to equal 1/k. Such 1392 Novembre

Table 1 Nucleotide Compositions Used for the Simulated Sequences None Low-1 Low-2 Med-1 Med-2 High-1 High-2 fA ... 0.25 0.20 0.20 0.125 0.125 0.05 0.05 fG... 0.25 0.30 0.20 0.375 0.125 0.45 0.05 fC ... 0.25 0.30 0.40 0.375 0.625 0.45 0.85 fT ... 0.25 0.20 0.20 0.125 0.125 0.05 0.05 an assumption is conservative with regard to measuring strong codon usage bias. Finally, in the calculation of ¯ Fˆ k we exclude amino acids that are observed fewer than ®ve times. Here, these methods of dealing with missing à data are applied equally to the calculation of both Nc and à NcЈ. FIG. 1.ÐSummary statistic values relative to their values when à nucleotide compositions are equal. To explore the properties of the statistics Nc and à NcЈ, pseudorandomly generated coding sequences were generated for various lengths and nucleotide composi- cleotide composition. As expected when the nucleotide tions. Amino acids were chosen uniformly, and codon à à frequencies are uniform, Nc andNcЈ are equal, and both usage was assumed to be multinomial with proportions share a value of 61. However, as the unevenness in the determined by the nucleotide composition. The back- background nucleotide composition increases, the mean ground nucleotide composition estimates are given as f , à à A value of Nc decreases, whereas that ofNcЈ remains rela- f ,f, and f for , , , and thy- à C G T tively constant. The value ofNcЈ is nearly 61 for all nu- mine, respectively. For example, for a codon i that has à cleotide composition conditions, whereas that of Nc the sequence CAT, the expected frequency was calcu- ranges from 61 with equal nucleotide compositions to lated as ei ϭ fCfAfT/K, where K is a renormalization con- nearly 26 when the compositions are highly nonuniform. stant for ensuring the e for an amino acid sum to one. à i It is also worth noting that Nc is particularly sensitive A range of nucleotide compositions were used to à to GC skew. The value of Nc is on average nine units generate the sequences (Table 1). The nucleotide com- smaller for the composition sets with high GC skew than positions are modeled after those used in a study by à 2 for those with none. LikeNcЈ, scaled ␹ and MCB remain Comeron and Aguade (1998) and contain conditions constant over various nucleotide composition sets. The corresponding to different levels of background GC-AT B*(a) statistic remains nearly constant for modest levels content. The compositions also vary in their degree of of unevenness in nucleotide composition but increases GC skew (Lobry 1996; Shioiri and Takahata 2001). The for the medium and high levels. Low-2, Med-2, and High-2 have higher GC skew rela- The statistics were also studied over a range of tive to Low-1, Med-1, and High-1, respectively. For gene lengths to observe the effects of gene length. Fig- each composition and gene length, 10,000 sequences ure 2 represents the values of each statistic relative to à were generated. The mean and variance of the Nc and their values at 7,500 bp. At short sequence lengths (less à NcЈ statistics were calculated for each set of sequences. à à To compare the performance of Nc andNcЈ with other summary statistics that incorporate the effects of background nucleotide composition, Akashi's scaled ␹2 statistic (Akashi 1995), the MCB statistic of Urrutia and Hurst (2001), as well as the B*(a) measure proposed by Karlin and Mrazek (1996) were also computed for each sequence. Each of these measures attempts to account for background nucleotide composition. In the imple- mentation of these calculations the same expected val- à ues that are used for calculatingNcЈ are used for the scaled ␹2 and B*(a) measures. For calculating MCB the expected values were generated according to the method given by Urrutia and Hurst (2001). For the sequence lengths of 150 and 210 bp, there was insuf®cient data for the Urrutia and Hurst expectations to be computed, and so MCB was not calculated for these sequence lengths. Figure 1 shows the mean values of the various bias statistics for sequences of 7,500 bp for the various back- FIG. 2.ÐSummary statistic values relative to their values at 7,500 ground nucleotide compositions. The mean values are bp. Sequences were generated with the Medium-2 composition set. shown relative to their mean values with uniform nu- Results for other composition sets are qualitatively similar. Letter to the Editor 1393

à the value of Nc, to change it even more from that of à NcЈ. à If differences of less than 10 in the value of Nc are à à of interest, then usingNcЈ rather than Nc is important for even lower levels of unevenness in nucleotide compo- sition. For instance, a nucleotide composition of ap- à proximately 35% GC or AT can cause the value of Nc to change by ®ve units relative to equal usage. In ad- dition, modest levels of GC skew, such as in composi- à tion set Low-2, can cause the value of Nc to change by ®ve units relative to no skew. à UsingNcЈ with nonuniform expectations will be par- ticularly advantageous when comparing codon usage bias among sequences from genomic regions that vary signi®cantly in background nucleotide composition. Such variation is especially notable in mammals and birds, where variation in nucleotide composition is structured into isochores (Bernardi et al. 1985). In ad- FIG. 3.ÐCVs for the various summary statistics applied to se- dition,Nà Ј may also be useful in comparative studies of quences generated with the Medium-2 composition set. Results for c other composition sets are qualitatively similar. codon usage bias in which the species being studied have substantially different nucleotide compositions. In summary, this paper has introduced a summary à à à than 600 bp), the values of Nc andNcЈ are higher than statistic of codon usage bias (NcЈ ) that incorporates var- their asymptotic values. In general, however, the asymp- iation in background nucleotide composition among se- totic value is well approximated when the gene length quences. Simulations show that relative to the available 2 à reaches 600 bp. In contrast, the values of scaled ␹ , statistics,NcЈ effectively adjusts for background nucleo- MCB, and B*(a) do not closely match their asymptotic tide composition, is the least affected by gene length, values even at long sequence lengths (3,000 bp). Figure and has a low CV. With its use the analysis of codon 3 shows the coef®cients of variation (CVs) for the var- usage bias can be accomplished without the confound- ious summary statistics. For sequence lengths of above ing effects of variation in nucleotide composition. 600 bp the order of the statistics in terms of increasing A freely distributable program that calculates both à à 2 à à CV is NcЈ, Nc,B*(a), scaled ␹ , and MCB. Below 600 Nc andNcЈ is available on the Web page: http://ib. bp the behavior of each method's CV becomes compli- berkeley.edu/labs/slatkin/software.html. cated by the ways in which each measure handles the missing values. à Acknowledgments These results show that the statisticNcЈ performs well in terms of accounting for nucleotide composition, The author thanks Josh Herbeck and Montgomery being the least affected by gene length and having a low Slatkin for helpful discussions, Araxi Urrutia for dis- CV. cussion and for providing the script to calculate MCB, à à AlthoughNcЈ differs from Nc, they are similar and two reviewers for their comments. The research was à à enough in some cases so that usingNcЈ rather than Nc supported by a Howard Hughes Medical Institute Pre- may not always produce different results. For example, doctoral Fellowship and the National Institutes of Health à a reanalysis withNcЈ of data from Dunn, Bielawski, and (NIH GM-40282 to M. Slatkin). Yang (2001) on the relationship between Drosophila substitution rates and codon usage bias produced results that are qualitatively similar to those from the original LITERATURE CITED à analysis with Nc (data not shown). AKASHI, H. 1995. Inferring weak selection from patterns of à However, in other cases the difference between Nc polymorphism and divergence at ``silent'' sites in Drosoph- à andNcЈ will be important. For example, a difference in ila DNA. Genetics 139:1067±1076. à Nc of 10 may be seen as biologically signi®cant (e.g., ÐÐÐ. 1997. Distinguishing the effects of mutational biases Moriyama and Powell 1997, 1998). At that resolution and natural selection on DNA sequence variation. Genetics à the results presented here show that usingNcЈ with non- 147:1989±1991. uniform expectations will result in a considerable dif- AKASHI, H., and A. EYRE-WALKER. 1998. Translational selec- ference if the GC or AT content is on the order of 85%. tion and molecular evolution. Curr. Opin. Genet. Dev. 8: Such nucleotide composition is extreme in nature and is 688±693. found -wide in protozoan species, such as Plas- AKASHI, H., R. KLIMAN, and A. EYRE-WALKER. 1998. Muta- tion pressure, natural selection, and the evolution of base modium falciparum (Gardner et al. 1998), or in mito- composition in Drosophila. Genetica 102±103:49±60. chondrial of arthropods, such as Drosophila, BENNETZEN, J., and B. HALL. 1982. Codon selection in yeast. Anopheles, Bombyx, and Apis (Shioiri and Takahata J. Biol. Chem. 257:3026±3031. 2001). In addition, the results presented here show that BERNARDI, G., B. OLOFSSON,J.FILIPSKI,M.ZERIAL,J.SALI- the presence of nucleotide composition skew will affect NAS,G.CUNY,M.MEUNIER-ROTIVAL, and F. RODIER. 1985. 1394 Novembre

The mosaic genome of warm-blooded vertebrates. Science LOBRY, J. 1996. Asymmetric substitution patterns in the two 228:953±958. DNA strands of bacteria. Mol. Biol. Evol. 13:660±665. COMERON, J., and M. AGUADE. 1998. An evaluation of mea- MARAIS, G., D. MOUCHIROUD, and L. DURET. 2001. Does re- sures of synonymous codon usage bias. J. Mol. Evol. 47: combination improve selection on codon usage? Lessons 268±274. from nematode and ¯y complete genomes. Proc. Natl. DUNN, K., J. BIELAWSKI, and Z. YANG. 2001. Substitution rates Acad. Sci. USA 98:5688±5692. in Drosophila nuclear genes: implications for translational MORIYAMA, E., and J. POWELL. 1997. Codon usage bias and selection. Genetics 157:295±305. tRNA abundance in Drosophila. J. Mol. Evol. 45:514±523. ÐÐÐ. 1998. Gene length and codon usage bias in Drosophila GARDNER, M., H. TETTELIN,D.CARUCCI et al. (24 co-authors). 1998. Chromosome 2 sequence of the human malaria par- melanogaster, Saccharomyces cerevisiae and . Nucleic Acids Res. 26:3188±3193. asite Plasmodium falciparum. Science 282:1126±1132. OHTA, T., and J. GILLESPIE. 1996. Development of neutral and IIDA, K., and H. AKASHI. 2000. A test of translational selection nearly neutral theories. Theor. Popul. Biol. 49:128±142. at `silent' sites in the human genome: base composition SHARP, P., and W. LI. 1987. The rate of synonymous substi- comparisons in alternatively spliced genes. Gene 261:93± tution in enterobacterial genes is inversely related to codon 105. usage bias. Mol. Biol. Evol. 4:222±230. IKEMURA, T. 1981. Correlation between the abundance of Esch- SHIOIRI, C., and N. TAKAHATA. 2001. Skew of mononucleotide erichia coli transfer and the occurrence of the re- frequencies, relative abundance of dinucleotides, and DNA spective codons in its protein genes. J. Mol. Biol. 146:1± strand asymmetry. J. Mol. Evol. 53:364±376. 21. SMITH, N., and A. EYRE-WALKER. 2001a. Synonymous codon ÐÐÐ. 1985. Codon usage and tRNA content in unicellular bias is not caused by mutation bias in GϩC-rich genes in and multicellular organisms. Mol. Biol. Evol. 2:13±34. humans. Mol. Biol. Evol. 18:982±986. KARLIN, S., and J. MRAZEK. 1996. What drives codon choices ÐÐÐ. 2001b. Why are translationally sub-optimal synony- in human genes? J. Mol. Biol. 262:459±472. mous codons used in Escherichia coli? J. Mol. Evol. 53: KIMURA, M., and J. CROW. 1964. The number of alleles that 225±236. can be maintained in a ®nite population. Genetics. 49:725± SUEOKA, N., and Y. KAWANISHI. 2000. DNA GϩC content of 738. the third codon position and codon usage biases of human genes. Gene 261:53±62. KLIMAN, R., and J. HEY. 1993. Reduced natural selection as- URRUTIA, A., and L. HURST. 2001. Codon usage bias covaries sociated with low recombination in Drosophila melanogas- with expression breadth and the rate of synonymous evo- ter. Mol. Biol. Evol. 10:1239±1258. lution in humans, but this is not evidence for selection. Ge- KREITMAN, M., and M. ANTEZANA. 2000. The population and netics 159:1191±1199. evolutionary genetics of codon bias. Pp. 82±101 in R. WRIGHT, F. 1990. The `effective number of codons' used in a SINGH and C. KRIMBAS, eds. Evolutionary genetics: from gene. Gene 87:23±29. molecules to morphology. Cambridge University Press, Cambridge, U.K. ADAM EYRE-WALKER, reviewing editor KREITMAN, M., and J. COMERON. 1999. Coding sequence evo- lution. Curr. Opin. Genet. Dev. 9:637±641. Accepted April 5, 2002