Letter to the Editor Accounting for Background Nucleotide
Total Page:16
File Type:pdf, Size:1020Kb
Letter to the Editor Accounting for Background Nucleotide Composition When Measuring Codon Usage Bias John A. Novembre Department of Integrative Biology, University of CaliforniaÐBerkeley The phenomenon of codon usage bias has been im- Of the summary statistics that do not require portant in the study of evolution because it provides knowledge of preferred codons, the effective number of à examples of weak selection working at the molecular codons (ENC, or Nc as proposed by Wright 1990) has level. During the last two decades, evidence has accu- been found to be the most statistically well behaved, in mulated that some examples of codon usage bias are that it is the least affected by short gene lengths (Com- à driven by selection, particularly for species of fungi eron and Aguade 1998). Nc is inversely proportional to (e.g., Bennetzen and Hall 1982; Ikemura 1985), bacteria the extent of nonuniform codon usage. It takes the value (e.g., Ikemura 1981; Sharp and Li 1987), and insects of 61 when all codons are being used with equal fre- (e.g., Akashi 1997; Moriyama and Powell 1997). This quency, and its value decreases as codon usage becomes connection between codon usage bias and selection has less uniform. It is intuitively accessible because its value been important in stimulating the development of alter- is intended to correspond to the number of codons being natives to strictly neutral theories of molecular evolution used in a sequence. (Ohta and Gillespie 1996; Kreitman and Antezana à One limitation of Nc is that measuring the depar- 2000). Uncertainty persists, however, regarding the phy- tures from uniform codon usage is not always desirable. logenetic distribution of codon usage bias (such as Knowledge of background nucleotide composition pat- whether selection-based codon usage bias is present in terns may suggest a null distribution of codon usage that mammals; e.g., Karlin and Mrazek 1996; Iida and Aka- is nonuniform. For instance, in Drosophila, where the shi 2000; Sueoka and Kawanishi 2000; Smith and Eyre- background nucleotide composition is 60% AT, one Walker 2001a; Urrutia and Hurst 2001). Questions also might like to measure how far codon usage departs from remain as to what models of selection underlie codon 60% usage of AT-ending codons. Indeed, accounting for preferences (Kreitman and Antezana 2000), and specif- ically whether the presence of suboptimal codons is the background nucleotide composition when studying co- result of mutation and drift, variation in selection pres- don usage has been recognized as being important (Aka- sure across sites, or antagonistic selection pressures shi, Kliman, and Eyre-Walker 1998; Marais, Mouchi- (Smith and Eyre-Walker 2001b). roud, and Duret 2001; Urrutia and Hurst 2001). Taking For investigating certain questions regarding the composition into account is particularly important for evolution of codon usage bias, it is useful to have a phylogenetic studies of codon usage bias where com- summary statistic describing the pattern of codon usage parisons are made among species with differing back- across all amino acids. Many summary statistics have ground nucleotide compositions or for studies of codon already been developed to describe the patterns of codon usage bias in genomic regions that may differ in back- usage. They can be divided roughly into two classes ground nucleotide composition. If not taken into consid- (Comeron and Aguade 1998). One class summarizes the eration, differences in background nucleotide composi- usage of certain preferred codons, and the other com- tion might lead one to conclude that differences in co- pares every codon's usage to a null distribution (typi- don preference exist when they do not. à cally uniform usage of synonymous codons). The former The departure of Nc from 61 because of variation class of methods has the disadvantage that it requires a in background nucleotide composition was recognized prior knowledge of the preferred codons. With summary by Wright and described as the following relationship à statistics one can explore general patterns, such as the between the third position GC content (fGC) and Nc: relationship of codon usage bias to recombination rate, gene length, or synonymous substitution rate. Observing à 29 broad patterns such as these has already provided insight NcGC5 2 1 f 1 22. into the evolutionary dynamics of codon usage bias 12f GC1 (1 2 f GC) (Kliman and Hey 1993; Akashi and Eyre-Walker 1998; Using this relationship one can plot the expected value Kreitman and Comeron 1999). à of Nc versus the GC content. Such plots are known as à Nc-plots (e.g., see ®g. 2 in Wright 1990). Points that fall Abbreviations: ENC, effective number of codons. below the bell-shaped line suggest deviations from a Key words: codon usage bias, background nucleotide composi- null model of no-codon preferences. Unfortunately, tion, mutation bias. studying such patterns does not have a clear statistical Address for correspondence and reprints: John A. Novembre, De- à basis and detracts from the value of Nc as a quantitative partment of Integrative Biology, University of CaliforniaÐBerkeley, summary statistic. VLSB 3060, Berkeley, California 94720. E-mail: novembre@ socrates.berkeley.edu. Other codon usage bias summary statistics such as 2 Mol. Biol. Evol. 19(8):1390±1394. 2002 Akashi's scaled x (Akashi 1995), the maximum-likeli- q 2002 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 hood codon bias statistic (MCB; Urrutia and Hurst 1390 Letter to the Editor 1391 2001), and B*(a) (Karlin and Mrazek 1996) account for calculated in a number of ways, including using mono-, background nucleotide composition. However, as pre- di-, or trinucleotide frequencies. The x2 statistic for each 2 sented below, these statistics are affected strongly by the amino acid,Xa , is calculated as sequence lengths being studied. Clearly, it would be use- k n (p 2 e )2 ful to have a statistic similar to Nà that has low se- 2 ai i c X a 5 O . quence-length effects but that relaxes the assumption of i51 ei equal usage of all synonymous codons. Using theXF2 values,à is de®ned as follows: à aa9 Here, a statistic (Nc9 ) is presented that accounts for 2 background nucleotide composition and is minimally af- à X aa1 n 2 k F95a . (5) fected by sequence length. The statistic is based on Pear- k(na 2 1) 2 son's X statistics and describes the departure of the ob- à served codon usage from some expected distribution. The calculations from here on mirror those of Nc. The à ¯ Far9 values are averaged for each redundancy class (Fˆ 9 ) The expected distribution can be derived from knowl- à edge of the background nucleotide composition. When and used to calculate Nc9: à à uniform usage of codons is expected,Nc9 reduces to Nc. à ¯ N95crrn /Fˆ 9. (6) à à ∈O Like Nc, Nc9 is relatively insensitive to sequence length. r ARC The properties ofNà are tested on simulated sequences, c9 2 2 Note that becauseXa is x distributed, the expected val- and compared with related summary statistics. The re- à sults suggest thatNà 9 is useful for studying codon usage ue of F9 will be 1/k when the codon usage matches the c expected usage pattern. Using the d-method, it can also among sequences that vary in background nucleotide à composition. be shown that the expected value ofNc9 will be equal to à 61 for the standard genetic code. To understand the construction ofNc9, it is ®rst nec- à à à à Notably,Nc9 simpli®es to Nc when synonymous co- essary to provide background on Nc.Nc, or the ENC (Wright 1990), is derived by making an analogy with dons are expected to be in equal frequency. In this case 2 the ei are all 1/k, and theXa statistic for each amino acid the effective number of alleles ne at a locus (Kimura and à à reduces to Crow 1964). An estimate of ne is nÃe 5 1/F, where F is an estimate of the expected heterozygosity at the locus. k 22 For quantifying codon usage bias at an amino acid a, X aa5 nkO p i2 n a. (7) Wright de®ned an estimate of the homozygosity of co- i51 don usage as follows: 2 à Substitution of thisXFaa into the equation for9 (eq. 5) k produces Fà 5 np2 2 1(n 2 1) (1) aaiO a k 12i51 @ à 2 F95aainpO 2 1(n a2 1). (8) 12i51 @ Here, pi is the frequency of the ith codon, k is the num- ber of synonymous codons for the amino acid of inter- à à This value ofFa9 is equal to the Fa used by Wright in est, and na is the observed number of codons for the the calculation of Nà (eq. 1). Because the calculation of à c amino acid. The average of the Fa for each r-fold re- ÃÃÃà Ncac9 matches that of Nc onceF9 is obtained,N9 is equiv- dundancy class (e.g., onefold, twofold, threefold, four- à alent to Nc when uniform usage is expected. fold, sixfold) is then computed: à This derivation shows thatNc9 is a generalization of Nà to the case in which expected codon usage is not ¯ 1 à c Fˆ ra5 F (2) uniform. Whereas Nà is implicitly based on an expected O∈ c nRC a RC à distribution of uniform usage,Nc9 can have the expected where nRC is the number of amino acids in the RC re- proportions of synonymous codon usage given by the à à dundancy class. Finally, Nc is computed as follows: background nucleotide composition.Nc9 will decrease à ¯¯¯¯ from 61 when the codon usage does not match the dis- Nc 5 2 1 (9/Fˆ 2356) 1 (1/Fˆ ) 1 (5/Fˆ ) 1 (3/Fˆ ).