Proc. Natl. Acad. Sci. USA Vol. 96, pp. 5123–5128, April 1999 Genetics

Strand asymmetry and in the chloroplast of Euglena gracilis (codon bias͞nucleotide composition͞selection)

BRIAN R. MORTON*

Department of Biological Sciences, Barnard College, Columbia University, 3009 Broadway, New York, NY 10027

Communicated by M. T. Clegg, University of California, Riverside, CA, February 22, 1999 (received for review November 15, 1998)

ABSTRACT It is shown that the two strands of the major codons of the plastid are the C-terminating codons (2). chloroplast genome from Euglena gracilis are asymmetric with As a result, there is a bias toward C at twofold-degenerate sites regards to composition. This asymmetry switches in highly expressed genes, which is noticeable given the strong at both the origin of replication and a location that is halfway AϩT bias in the genome overall (2). The second set is around the circular genome from the origin. In both halves of composed of the fourfold-degenerate codon groups, and for the genome the leading strand is G؉T-rich, having a bias each of these codon groups, the codon withaTatthethird toward G over C and T over A, and the lagging strand is position is the major codon, with the exception of valine, for .(A؉C-rich. This asymmetry is probably the result of a differ- which the major codon is GTA (3 ence in mutation dynamics between the leading and lagging The single known exception to the general pattern discussed strands. In addition to composition asymmetry, the two above is the green alga Euglena gracilis. In terms of anticodon strands differ with regards to coding content. In both halves sequence, the tRNA genes coded by Euglena are essentially of the genome the vast majority of genes are coded by the identical to all other fully sequenced plastid (4). leading strand. These two aspects of strand asymmetry are Despite this, an increased of C at twofold- then applied to a statistical test for selection on codon usage. degenerate sites is not observed in any chloroplast gene of The results indicate that selection on codon usage is limited Euglena, including psbA (1, 4) which codes for the most ,to genes on the leading strand; no gene on the A؉C-rich prominent product of the chloroplast (5). Instead lagging strand shows evidence for selection, suggesting that all of the genes coded by the Euglena chloroplast genome have highly expressed genes are coded predominantly on the strand a high AϩT content at the third codon position of every codon of DNA that is the leading strand during replication. On the group (1). This apparent lack of increase in major codon usage basis of these observations it is proposed that the coding in highly expressed genes seems to suggest that selection is not strand bias is generated by selection to code highly expressed adapting codon usage to the tRNA population in Euglena. genes on the leading strand to coordinate the direction of However, a test comparing the codon usage of each gene to an replication and , thereby increasing the potential expected distribution generated from noncoding base frequen- rate of both reactions. cies indicated that several highly expressed genes from Euglena have a codon usage bias that is significantly different from There is strong evidence that the codon usage of certain plastid expectation (1). In addition, codon bias levels in Euglena genes genes is influenced by selection (1). The frequency of specific are strongly correlated with levels from homologous genes of codons, referred to as major codons, is significantly increased other algae (6). Both of these observations suggest that, in highly expressed genes such as some of the genes that code although there is no obvious increase in the use of major for products involved in the process of photosynthesis. In codons for twofold-degenerate groups, selection does act on contrast, low-expression genes, such as those coding subunits the codon usage of at least some Euglena chloroplast genes. of the plastid RNA polymerase, have a codon usage bias that These apparently conflicting lines of evidence make Euglena is dominated by the genome composition bias, which is AϩT- an interesting case study for codon usage bias. rich. Since the major codons match the plastid tRNA popu- In the current work, the codon usage of Euglena chloroplast lation, it has been suggested that this pattern of codon usage genes is studied in more detail with respect to a previously variation among genes is a result of selection for increased unknown asymmetry in the composition of the genome. A translation efficiency in highly expressed genes (1, 2). Evidence number of genomes have been found to be asymmetrical, for selection has been observed in all plants and algae studied particularly in compositional properties (7, 8). For example, to date, but there are differences in selective intensity between the bacterium genitalium (9) and the metazoan mitochondrial genome (10) have one GϩT-enriched strand lineages as indicated by variation among lineages in both the ϩ number of genes with an increased frequency of major codons and one A C-enriched strand. In some bacterial species, the and the degree of bias toward major codons in highly expressed strand-specific bias switches at both the origin and terminus of genes. In general, codon bias is strong in the green algae, of replication in such a way that the leading strand has the same intermediate strength in other algae, and extremely weak in the bias in all regions of the genome (7, 8, 9, 11). It is likely that flowering plants (1). this is the result of observed differences in the mutational Work on codon usage of plastid genes has focused on the dynamics of the polymerase subunits responsible for replicat- major codons in two sets of codon groups. The first set is ing the different strands of DNA (7). Euglena composed of the twofold-degenerate groups with a pyrimidine In the case of the chloroplast genome, one asym- metrical feature has been noted previously: the circular chro- at the third position (particularly GAY, CAY, TAY, AAY, and mosome is strongly biased in terms of which strand is the TTY). For these codon groups it has been shown that the coding strand (4). The entire genome consists of 143,172 bp,

The publication costs of this article were defrayed in part by page charge Abbreviations: R, purine; Y, pyrimidine; CAI, codon adaptation payment. This article must therefore be hereby marked ‘‘advertisement’’ in index. accordance with 18 U.S.C. §1734 solely to indicate this fact. *To whom reprint requests should be addressed. e-mail: bmorton@ PNAS is available online at www.pnas.org. barnard.columbia.edu.

5123 Downloaded by guest on September 30, 2021 5124 Genetics: Morton Proc. Natl. Acad. Sci. USA 96 (1999)

with a total of 62 -coding genes, including ORFs, and MATERIALS AND METHODS 41 RNA-coding genes. Moving outwards from the origin of All sequences were extracted directly from the complete replication (site 1) in one direction, almost all genes from site genome sequence of Euglena gracilis (ref. 4, GenBank acces- 1 to about 70000 are coded on the strand designated as strand sion no. X70810). Genes greater than 350 in A. In the other direction from the origin of replication, almost length, to avoid sampling bias in codon usage calculations, and all genes from site 143172 to 70000 are coded on strand B (ref. noncoding regions greater than 30 bases in length were used 4, also see Fig. 1). for analyses. Coding sequence locations were determined from In this study it is shown that the two strands of the Euglena information in the GenBank file, and the definition of strands chloroplast genome also differ in compositional properties. A and B are taken from ref. 4. The codon adaptation index Strand A has a bias toward G over C and T over A from (CAI; ref. 12) of each gene was calculated as described nucleotide 1 to approximately 70000 but a bias toward A and previously (1). C from about nucleotide 70000 to the last base of the circular To test CAI relative to the composition of noncoding genome. Therefore, the strand that codes the majority of genes regions, the circular genome was divided into two sections in each half of the genome also has a bias toward G and T. which were separated by sites 0 and 69075 of the genome (the Since the statistical test of codon usage applied previously was numbering is that used in the GenBank file). This division is based on an expectation generated from noncoding base based on the analysis presented below. For both halves of the frequencies (1), the strand asymmetry demonstrated in this genome, cumulative dinucleotide frequencies from all non- work requires that the test utilize the base frequencies for a coding regions were calculated. Dinucleotide frequencies for specific strand and genome location. When these factors are the opposite strand were determined from the complement so controlled for it is found that many more genes than previously that four tables of dinucleotide frequencies were obtained; one thought have a codon bias that is significantly greater than for each strand in each half of the genome. These tables were expected. In addition, selection appears to be limited to genes used to perform a test on each individual gene as described on the GϩT-rich strand, which is the leading strand in both previously (1). For 500 replicates, a random codon table was directions from the origin of replication. Therefore, coding generated with the same usage as the actual gene. strand bias is correlated with level. It is Codon frequencies were assigned at random for each codon proposed that this bias is due to selection to coordinate group on the basis of the composition of the second codon transcription and DNA replication. position of that codon group as well as the dinucleotide

FIG. 1. Plot of (T Ϫ A)͞(T ϩ A) (solid line) and (C Ϫ G)͞(C ϩ G) (dashed line) for strand A of the chloroplast genome of Euglena gracilis. The plot is performed over a sliding window of length 5,000 and a shift of 500, starting at the origin of replication. Above the plot is a representation of the coding strand, either strand A or B, for known genes and ORFs (4). Each box represents the number of genes indicated but does not imply anything about operon structure or spacing of the genes in that region. The shaded box represents the region where the three repeated rDNA operons are coded. The approximate location of the atpA gene is also indicated. Downloaded by guest on September 30, 2021 Genetics: Morton Proc. Natl. Acad. Sci. USA 96 (1999) 5125

frequency table for that region and strand in which the gene It can be seen in Fig. 1 that strand A has a bias toward T over is coded. The CAI value was then calculated for the random A and G over C over most of the first half of the genome, codon table. From the 500 replicates for the gene a distribution whereas in the second half of the genome, strand A has a was calculated and compared with the observed value. Any complementary bias toward A and C. Any strand with a bias gene with a CAI value greater than 2 standard deviations toward T over A and G over C will be referred to here as above the mean has a codon usage bias that is significantly GϩT-rich, although there is actually a higher representation of ϩ different than expected on the basis of the composition of A than G because of the strong A T bias of the genome. ϩ noncoding regions, which is regarded as evidence for selection Therefore, strand A is G T-rich over the half of the genome on codon use, assuming that noncoding dinucleotide frequen- between sites 1 and roughly 70000, whereas strand B is ϩ cies are an accurate reflection of mutation bias (1). G T-rich over the other half. The main exception to this Strand asymmetry was calculated for the genome overall and general trend is the small bias in strand A toward T around for noncoding regions by using the method described by Lobry position 125 kb from the origin. This is the location of the three (9, 11). For noncoding regions, the value (T Ϫ A)͞(T ϩ A), repeats of the rDNA operon on strand B, indicated by the where A and T represent the number of occurrences of the two shaded box above the plot. This deviation may simply reflect nucleotides, respectively, was calculated separately for each selection on the structural properties of the rRNA products. There are also two short sections that are an exception to the region. For the entire genome, the values (T Ϫ A)͞(T ϩ A) and strand bias. One is where strand A has a slight bias toward C (C Ϫ G)͞(C ϩ G) were calculated for windows of length 5000, over G roughly 30 kb from the origin, and the other where it starting with nucleotides 1 to 5000 and continuing by shifting shows a slight bias toward T over A about 90 kb from the origin. 500 nucleotides downstream along strand A for each new Because the first correlates with a region where 7 genes are window. coded by strand B and the second with a region where strand B codes 12 genes, these small exceptions may reflect selection RESULTS AND DISCUSSION for amino acid composition of these genes. The strand asymmetry in base composition and the differ- Strand Asymmetry in Euglena. The nucleotide composition ence between the two halves of the genome is most apparent of strand A from the Euglena chloroplast genome is shown in in noncoding DNA (Fig. 2). The position of the atpA gene, Fig. 1, starting and ending at the origin of replication. The which lies at the approximate point in Fig. 1 where the strand Ϫ ͞ ϩ Ϫ ͞ ϩ values (T A) (T A) and (C G) (C G) are plotted over shift occurs, is indicated in the figure. Almost all sequences to a sliding window, so that the plot represents regional biases the left (upstream) of atpA have a bias of T over A, whereas toward pyrimidines. Above the plot of compositional bias is a most sequences downstream have a bias toward A over T. The representation of the coding strand for all genes. As noted bias toward C over G was not measured for individual non- previously, there is a bias in the coding strand of the Euglena coding regions because many are under 100 nucleotides in chloroplast genome. Over the first half of the genome, most length and strongly biased toward A and T. However, the bias genes (38 of 51) are coded by strand A, whereas over the is observed; the noncoding regions up to nucleotide 69075 have second half most genes (45 of 52, including the 3 rDNA a cumulative composition of 2,694 A vs. 3,771 T and 804 G vs. operons) are coded by strand B (4). 712 C, whereas noncoding regions downstream of nucleotide

FIG. 2. Plot of (T Ϫ A)͞(T ϩ A) of strand A for all noncoding regions greater than 30 nucleotides in length. The noncoding regions are numbered from 1 to 75, starting at the origin of replication and proceeding downstream. These numbers represent the order of the regions but not genome location. The relative location and nucleotides of the atpA gene, which is located between noncoding regions 40 and 41, are indicated to show the genome location at which the symmetry changes in Fig. 1. Downloaded by guest on September 30, 2021 5126 Genetics: Morton Proc. Natl. Acad. Sci. USA 96 (1999)

69075 have a composition of 4,041 A vs. 3,344 T and 871 G vs. Table 1. Compositional properties of genes in different genomic 1,180 C. The few regions that are exceptions to the rule are regions and on different strands quite short, and the two regions downstream of the atpA gene Mean CAI Mean % T* with a strong T bias are identical 86-bp spacer sequences within Position Strand Genes (Range) (Range) the rDNA repeat, that is, within the region around 125 kb in Fig. 1 that is noted above. Overall, noncoding regions are Entire Both 34 0.355 0.484 asymmetrical, and the two halves of the genome display genome A 20 0.358 0.518 different strand-specific biases. Although this pattern of asym- B 14 0.352 0.435 metry is very similar to what has been observed in some other 0–70 kb Both 23 0.351 0.475 organisms (8, 9, 10), none of the other fully sequenced plastid A(GϩT-rich) 17 0.363 0.555 genomes has a composition asymmetry that is similar to what (0.242–0.436) (0.383–0.696) is observed in Euglena (data not shown). B(AϩC-rich) 6 0.316 0.248 Figs. 1 and 2 demonstrate that there are two asymmetrical (0.246–0.380) (0.147–0.312) features of the genome, coding strand bias and compositional 70–143 kb Both 11 0.365 0.503 bias. Interestingly, both of these biases switch strands at ϩ essentially the same genomic location, roughly 69 kb from the A(A C-rich) 2 0.300 0.322 (0.295–0.305) (0.281–0.361) origin, just downstream from the atpA gene (Fig. 1). As a result ϩ of this correlated change in biases the GϩT-rich strand codes B(G T-rich) 9 0.379 0.543 for the majority of genes over both halves of the genome. (0.310–0.509) (0.458–0.667) Strand A is defined as the strand that runs 5Ј to 3Ј as base *T composition at fourfold-degenerate sites. numbering increases (4). Since the switch in asymmetry occurs at the origin and again at the point roughly halfway around the from the composition properties of that strand. Therefore, to genome from the origin where replication terminates (4, 13), make inferences about selection we have to determine whether it is the leading strand of replication that is GϩT-rich in both the differences in codon usage between genes are due simply halves of the genome. Because the composition asymmetry is to the asymmetrical composition bias in the genome. most apparent in noncoding sequences, it is probably the result To control for variation in the composition properties of of an asymmetrical mutation process. A difference between different genome locations, we applied a previously developed the leading and lagging strand polymerases (or subunits), test for selection on codon usage. This test utilizes noncoding either in misincorporation properties or proofreading func- base compositions to generate an expected distribution of CAI tions, could account for the asymmetry. This possibility is for each gene. The distribution is then used to determine supported by the observation that the two replication strands whether the actual gene has a codon usage (CAI) that is in have significantly different mutational significantly different from expectation (1). A significant dynamics because of differences in the polymerase subunits difference means that composition properties of the genome, responsible for replicating the different strands (7). or region, cannot account for codon usage of that gene. This Strand Asymmetry and Codon Bias. In the current study we deviation is then taken as evidence that selection acts on that are interested in the codon usage of Euglena chloroplast genes: particular gene (1), based on an assumption that noncoding specifically, whether there is any evidence for selective con- DNA is an accurate reflection of composition bias. The straints and on what genes those constraints act. In addressing asymmetry of the Euglena genome means that the test for a this issue I studied each gene separately, comparing the level specific gene must utilize the base composition of the non- of codon bias of that gene to an expected value. Because coding regions from the same strand and genome location that composition bias contributes to codon bias (14, 15), the codes the gene. Therefore, four sets of noncoding dinucleotide compositional asymmetry demonstrated in Fig. 2 requires that frequencies were used, one for each of strands A and B over both the strand and the genomic location of the gene be taken the first half of the genome, and one for each of strands A and into consideration. To accomplish this, genes were divided into B over the second half of the genome. For each gene the two groups, those coded by a GϩT-rich strand (strand A from appropriate frequency table was applied to generate the 0 to 70 kb and strand B from 70 kb to 143 kb) and those coded distribution. This will test whether the codon usage of a specific by an AϩC-rich strand. The level of codon bias was measured gene is within the expected range, given the genomic location by using CAI. CAI measures, in essence, the relative frequency of that gene, allowing us to assess selection with regards to the of major codons in different genes and is positively correlated differences between the sets of genes in Table 1. with expression level in plastid genomes (1), probably because The results of this test are presented in Table 2. Two main of selection acting on highly expressed genes to match codon conclusions can be reached. First, many more genes give a usage to the tRNA population (1, 16). significant result than when the test was applied previously. As expected, the two sets of genes have different average The initial test, which did not account for strand asymmetry, compositional properties. Genes coded by the GϩT-rich gave evidence for selection on only 6 of 36 chloroplast genes strand have a much higher T content at fourfold-degenerate (including two ORFs) from Euglena (1). The test accounting sites and higher CAI values than genes coded on the AϩC-rich for asymmetry gives a significant result for 23 of 34 genes. strand, and there is almost no overlap between the ranges of Second, there is a clear asymmetry in the distribution of CAI values for GϩT-rich strand genes and AϩC-rich strand significant results. All of the genes that have a significant CAI genes (Table 1). The two exceptions are in region 1; the ccsA value are on the GϩT-rich strand; none of the 8 protein-coding gene, coded on strand B from nucleotides 2171–3549, which genes greater than 350 nucleotides in length that are coded by has a CAI of 0.380, and rps3, coded on strand A from an AϩC-rich strand have a significant value. Therefore, the nucleotides 52501–53668, which has a CAI of 0.242. Without higher CAI values of genes on the GϩT-rich strand (Table 1) these two genes the CAI ranges are 0.246–0.338 for strand B are not simply due to compositional properties; selection and 0.346–0.436 for strand A. However, variation in CAI will appears to contribute to codon bias to at least some degree. It necessarily follow from variation in composition properties. As should be noted, however, that the intensity of selection is not discussed above, selection in the plastid favors T at the third as strong in Euglena as in other green algae. Although the position of most fourfold-degenerate codon groups. Because results presented here show that CAI values of a number of CAI measures the frequency of major codons, genes on the genes are significantly greater than expectation, they are still GϩT-rich strand could have higher CAI values because of the low relative to homologous genes from other algae (1). Re- higher frequency of T at fourfold-degenerate sites resulting gardless, the results from Table 2 strongly suggest that selec- Downloaded by guest on September 30, 2021 Genetics: Morton Proc. Natl. Acad. Sci. USA 96 (1999) 5127

Table 2. Results from the codon bias test almost exclusively coded on the leading strand (4). It should be Genome location* Gene CAI† noted that this type of correlation is not limited to Euglena;a similar correlation between CAI and coding strand exists in E. Region 1 psbD P Ͻ 0.001 coli, with high CAI genes being coded at a significantly higher strand a (GϩT-rich) ycf13 P Ͻ 0.001 frequency on the leading strand (7). In addition, a recent psaA P Ͻ 0.0001 analysis of bacterial genomes showed that in every case the psaB P Ͻ 0.001 majority of genes are coded by the leading strand of replication Ͻ rpl2 P 0.01 (8). Ͻ rpl22 P 0.001 On the basis of results in Table 2, then, it is proposed that rps3 NS the asymmetrical organization of the Euglena chloroplast Ͻ rpl16 P 0.05 genome is caused by a selective pressure to coordinate repli- rpl14 NS cation and transcription similar to what has been suggested Ͻ rpl5 P 0.001 previously for B. burgdorferi (17). This coordination would be Ͻ rps8 P 0.01 independent of the compositional asymmetry, which is prob- Ͻ rps2 P 0.001 ably generated by different misincorporation biases of the Ͻ atpI P 0.001 enzymes that replicate the leading and lagging strands. The atpF P Ͻ 0.001 Ͻ fact that the two features switch strands at roughly the same psbC P 0.0001 genomic location (Fig. 1) is probably due to the shared psbA P Ͻ 0.0001 Ͻ dependence on replication, not to any real influence of one on atpA P 0.001 the other. Region 1 ycf4 NS Asymmetry and Amino Acid Composition. The strand asym- strand B (AϩC-rich) ccsA NS metry demonstrated in Figs. 1 and 2 has been shown above to tufA NS be an important factor in the analysis of codon bias. In rps7 NS addition, because genomic base composition is correlated with rps12 NS amino acid composition in a number of organisms (19–22), it rpl20 NS is possible that the asymmetry in Euglena gives rise to variation Region 2 rps4 NS in amino acid composition among genes in different genomic strand A (AϩC-rich) rps11 NS sections. The amino acids that would be expected to be influenced are those coded by codons that are either GϩT-rich Region 2 rpl12 P Ͻ 0.001 or AϩC-rich. There are 8 that fall into this category: Phe strand B (GϩT-rich) rps9 NS (TTY), Cys (TGY), Trp (TGG), Gly (GGN), Val (GTN), Lys atpB P Ͻ 0.001 (AAR), Asn (AAY), and Pro (CCN). The first five are petB P Ͻ 0.001 expected to be present in an increased frequency on the psbB P Ͻ 0.0001 leading strand in both halves of the genome and the last three rpoC2 P Ͻ 0.001 to be present in a decreased frequency. Ͻ rpoC1 P 0.001 The number of occurrences of for each of these 8 amino Ͻ rpoB P 0.001 acids is given in Table 3 for both the GϩT-rich and AϩC-rich Ͻ rbcL P 0.0001 genome regions. As expected, the five amino acids coded by *Regions are 1–70000 (1) and 70001-143000 (2). GϩT-rich codons occur at a significantly higher frequency on †P values are given for deviation of the observed CAI from the mean the leading strand (␹2 ϭ 7.05, P Ͻ 0.01). The only exception of the generated distribution. NS indicates a result that is not is glycine (GGN), which is present in higher frequency on the significant at the 5% level. AϩC-rich strand. When this amino acid is excluded the difference between the two strands is strongly significant (␹2 tion does influence codon usage of a significant number of ϭ 33.9, P Ͻ 0.001). In addition, the three amino acids coded genes in Euglena. by AϩC-rich codons are present at a significantly lower The fact that significant results were obtained only for genes ␹2 ϭ Ͻ ϩ frequency on the leading strand ( 13.3, P 0.01), primarily on the G T-rich strand indicates that there is a correlation because of a difference in the representation of Lys (AAR). between codon usage and genome organization in Euglena.On These results are consistent with strand bias having an influ- the basis of the model developed for Borrelia burgdorferi (17) ence on amino acid composition of the coded by each it is proposed that this correlation is a secondary effect of two strand, although differences in coding properties will have to different correlations: first, the established correlation be- tween expression level and CAI (1), and second, a correlation Table 3. Differences in amino acid composition of genes coded on between gene expression and genome organization. The pro- different strands posal is that the second correlation exists and that it is due to GϩT-rich strand AϩC-rich strand selective pressure to code most genes, and in particular highly expressed genes, on the leading strand. This organization Codons Number Frequency Number Frequency would increase the likelihood that the RNA and DNA poly- All 9,778 — 1,854 — merases are moving in the same direction during replication, reducing the number of ‘‘head-on collisions’’ between the two TTY 635 0.065 55 0.030 enzymes. Such collisions have been shown to significantly TGY 110 0.011 13 0.007 decrease the rate of replication (17, 18), so the basis of the TGG 198 0.020 8 0.004 selective pressure would be to increase the potential rate of GGN 729 0.075 185 0.100 DNA replication and, possibly, the rate of transcription during GTN 632 0.065 116 0.063 the replication cycle (17). Under such a model, the selective Total 2,304 0.236 377 0.203 pressure to code any particular gene on the leading strand would increase with the expression level of that gene, leading AAY 505 0.052 101 0.054 indirectly to the observed correlation between CAI and strand. AAR 697 0.071 196 0.106 The secondary nature of this correlation is supported by the CCN 379 0.039 73 0.039 observation that the nontranslated rRNA and tRNA genes are Total 1,581 0.162 370 0.200 Downloaded by guest on September 30, 2021 5128 Genetics: Morton Proc. Natl. Acad. Sci. USA 96 (1999)

be tested in future studies to evaluate the significance of the yses, and an accurate assessment of its relation to other algae differences observed between the strands. will require a serious consideration of the unusual composition Concluding Remarks. The data presented here demonstrate properties of the Euglena chloroplast genome. that there is a strong asymmetry in the chloroplast genome of Euglena gracilis. Two features are found to be asymmetrical: I thank Sean Graham for reading and commenting on this manu- nucleotide composition bias and which strand acts as the script and two anonymous reviewers for helpful comments. This work coding strand for genes. The strand-specific biases of both was supported in part by National Science Foundation Grant MCB- features are correlated in that they change at both the origin 9727906. and the putative terminus of DNA replication. It is suggested 1. Morton, B. R. (1998) J. Mol. Evol. 46, 449–459. that these are two independent features that are correlated 2. Morton, B. R. (1993) J. Mol. Evol. 37, 273–280. because of a common reliance on replication. Composition 3. Morton, B. R. (1996) J. Mol. Evol. 43, 28–31. asymmetry is proposed to arise from different mutational 4. Hallick, R. B., Hong, L., Drager, R. G., Favreau, M. R., Montfort, dynamics of the leading and lagging strand. When these biases A., Orsat, B., Spielman, A. & Stutz, E. (1993) Nucleic Acids Res. are accounted for there is statistical evidence that selection 21, 3537–3544. influences the codon usage of a large number of genes, but 5. Mullet, J. E. & Klein, R. R. (1987) EMBO J. 6, 1571–1579. apparently only genes coded on the leading strand of replica- 6. Morton, B. R. (1999) in Evolutionary Biology, eds. Hecht, M., tion. It is proposed that this arises from a correlation between MacIntyre, R. & Clegg, M. (Plenum, New York), in press. 13, expression level and selection for codon usage coupled with 7. Francino, M. P. & Ochman, H. (1997) Trends Genet. 240–245. 8. McLean, M. J., Wolfe, K. H. & Devine, K. M. (1998) J. Mol. Evol. selection to code genes, and in particular highly expressed 47, 691–696. genes, on the leading strand. This organization would limit 9. Lobry, J. R. (1996) Science 272, 745–746. ‘‘collisions’’ between RNA and DNA polymerase and, there- 10. Jermiin, L. S., Graur, D. & Crozier, R. H. (1995) Mol. Biol. Evol. fore, increase the speed of replication. The result is the coding 12, 558–563. strand asymmetry that is observed. Finally, it is also shown that 11. Lobry, J. R. (1996) Mol. Biol. Evol. 13, 660–665. amino acid frequencies vary between genes coded on each 12. Sharp, P. M. & Li, W.-H. (1987) Nucleic Acids Res. 15, 1281–1295. strand in a manner consistent with the mutation bias. 13. Koller, B. & Delius, H. (1982) EMBO J. 1, 995–998. If this model concerning selection and coding strand asym- 14. Sharp, P. M. (1991) J. Mol. Evol. 33, 23–33. 15. Sharp, P. M., Stenico, M. J., Peden, F. & Lloyd, A. T. (1993) metry for Euglena is correct, then it raises questions about the Biochem. Soc. Trans. 21, 835–841. plastid genome of other algae as well as plants. No other plastid 16. Ikemura, T. (1985) Mol. Biol. Evol. 2, 13–35. genome for which we have a complete sequence available— 17. McInerney, J. O. (1998) Proc. Natl. Acad. Sci. USA 95, 10698– which includes a diatom, a red alga, the cyanelle of Cyanophora 10703. paradoxa, a green alga, and several land plants—has such an 18. French, S. (1992) Science 258, 1362–1365. obvious organization in terms of coding strand bias. If the 19. D’Onofrio, G., Mouchiroud, D., Aissani, B., Gautier, C. & organization in Euglena does in fact reflect selection to Bernardi, G. (1991) J. Mol. Evol. 32, 504–510. coordinate replication and transcription, we need to determine 20. Berkhout, B. & van Hemert, F. J. (1994) Nucleic Acids Res. 22, 1705–1711. whether there is any evidence that other plastid genomes have 21. Porter, T. D. (1995) Biochim. Biophys. Acta 1261, 394–400. a similar organization. Finally, the evolutionary relationship of 22. Foster, P. G., Jermiin, L. S. & Hickey, D. A. (1997) J. Mol. Evol. the Euglena chloroplast to other plastids has been problematic 44, 282–288. for some time (23). Both the asymmetry described here and the 23. Martin, W., Somerville, C. C. & Loiseaux-de Goer, S. (1992) J. unusual codon usage bias (6) could influence sequence anal- Mol. Evol. 35, 385–404. Downloaded by guest on September 30, 2021