<<

Proc. Nati. Acad. Sci. USA Vol. 91, pp. 6795-6801, July 1994 Colloquium Paper

This paper was presented at a colloquium entitled "Tempo and Mode in " organized by Walter M. Fitch and Francisco J. Ayala, held January 27-29, 1994, by the National Academy of Sciences, in Irvine, CA. Rates and patterns of DNA evolution (chloroplast DNA/ribulose-l,5-bisphosphate carboxylase/ evolution/evolutionary rates) MICHAEL T. CLEGG*, BRANDON S. GAUTt, GERALD H. LEARN, JR.*, AND BRIAN R. MORTON* *Department of and Plant Science, University of California, Riverside, CA 92501; and tDepartment of Statistics, North Carolina State University, Raleigh, NC 27895

ABSTRACT The chloroplast (cpDNA) of Because the photosynthetic machinery requires many has been a focus of research In plant molecular evolution and more functions than are specified on the cpDNA of systematics. Several features of this genome have facilitated plants and , it is assumed that many gene functions were molecular evolutionary analyses. First, the genome is smail and transferred to the eukaryotic nuclear genome. The strong constitutes an abundant component of cellular DNA. Second, conservation of cpDNA gene content cited above indicates the chloroplast genome has been extensively characterized at that most transfers of gene function from the chloroplast to the molecular level providing the basic information to support the nuclear genome occurred early following the primordial comparative evolutionary research. And third, rates of nucde- endosymbiotic event. Among land plants, gene content is otide substitution are relatively slow and therefore provide the almost perfectly conserved, although a few recent transfers appropriate window of resolution to study plant phylogeny at of function from the chloroplast to the nuclear genome have deep levels ofevolution. Despite a conservative rate ofevolution been demonstrated (10, 11). and a relatively stable gene content, comparative molecular Conservation of gene content and a relatively slow rate of analyses reveal complex patterns of mutational change. Non- substitution in -coding has made the coding regions of cpDNA diverge through insertion/deletion chloroplast genome an ideal focus for studies of plant evo- changes that are sometimes site dependent. Coding genes lutionary history (reviewed in ref. 12). These efforts have codon bias that to violate culminated in the past year with the publication of a volume exhibit different patterns of appear of papers that presents a detailed molecular phylogeny for the equilibrium assumptions of some evolutionary models. plants (13). This effort has involved many laboratories Rates ofmolecular change often vary among plant famile and that have together determined the DNA sequence for more orders in a manner that violates the assumption of a simple than 750 copies (by latest count) of the cpDNA gene rbcL molecular clock. Finally, protein-coding genes exhibit patterns encoding the large subunit of ribulose-1,5-bisphosphate car- of amino acd change that appear to depend on protein boxylase (RuBisCo). The sequence data span the full range of structure, and these patterns may reveal subtle aspects of plant diversity from algae to flowering plants. This very large structure/function relationships. Only comparative studies of collection ofgene sequence data has allowed the reconstruc- molecular sequences have the resolution to reveal this under- tion of plant evolutionary history at a level of detail that is lying complexity. A complete description of the complexity of unprecedented in molecular systematics. molecular change is essential to a full understanding of the Accurate phylogenies are of more than academic interest. mechanisms of evolutionary change and in the formulation of They provide an organizing framework for addressing a host realistic models of mutational processes. of other important questions about biological change. For example, questions about the origins of major morphological The chloroplast genome was almost certainly contributed to adaptions must be placed in a phylogenetic context to recon- the eukaryotic through an endosymbiotic association with struct the precise sequence of genetic (and molecular) a -like . Moreover, present evidence changes that give rise to novel structures. The mutational suggests that the association that led to land plants occurred processes that subsume biological change can be revealed in only once in evolution (1). The derivative genome great detail through a phylogenetic analysis. And, questions (cpDNA) is a relictual molecule ofabout 150 kbp that encodes about the frequency and mode of horizontal transfer of roughly 100 genetic functions (reviewed in refs. 2 and 3). This can only be addressed in a phylo- genome is the most widely studied plant genome with regard genetic context. One can make a very long list of biological to both molecular organization and evolution. Complete se- problems that are illuminated by phylogenetic analyses. quences of six chloroplast are now available, and Despite their obvious utility, many important questions these represent virtually the full range of plant diversity [an remain about the accuracy of molecular phylogenies. All alga, (4); a , Marchantia (5); a conifer, methods of phylogenetic reconstruction assume a fairly high black pinet; a dicot, tobacco (6); a monocot, rice (7); and a degree of statistical regularity in the underlying mutational parasitic dicot, Epifagus (8)]. With the possible exception of dynamics. Our goal in this article is to analyze the tempo and Euglena and Epifagus, the picture that has emerged is one of mode of cpDNA evolution in greater depth. We will show a relatively stable genome with marked conservation of gene that below the surface impression of conservation there are content and a substantial conservation of structural organiza- a variety of mutational processes that often violate assump- tion. Mapping studies that span land plants and algae confirm the impression of strong conservation in gene content (9). Abbreviations: cpDNA, chloroplast DNA; RuBisCo, ribulose-1,5- bisphosphate carboxylase; indel, insertion/deletion; RFLP, restric- tion fragment length polymorphism. The publication costs of this article were defrayed in part by page charge *Wakasugi, T., Tsudzuki, J., Itoh, S., Nakashima, K., Tsudzuki, T., payment. This article must therefore be hereby marked "advertisement" Shibata, M. & Sugiura, M., 15th International Botanical Congress, in accordance with 18 U.S.C. §1734 solely to indicate this fact. Aug. 28-Sept. 3, 1993, Tokyo, abstr. 7.2.1-6. 6795 Downloaded by guest on September 30, 2021 67% Colloquium Paper: Clegg et al. Proc. Nad. Acad Sci. USA 91 (1994) tions of rate constancy and site independence of mutational 17). Based on flanking sequence data, it has been suggested change. To facilitate an analysis of patterns of cpDNA that recombination between short direct repeats was respon- mutational change, we divide the genome into functional sible for these deletions (17). The four large deletions ob- categories and we then study the pattern of mutational served in the grasses all extend from within, or very close to, change within each functional category. The functional cat- Orpl23 to an area about 400 bases upstream of psaL. The egories are (i) DNA regions that do not code for tRNA, region spanned by the deletions has a low A+T content ribosomal RNA (rRNA), or protein (referred to as "noncod- relative to other chloroplast noncoding sequences (including ing DNA"); (ii) protein-coding genes; and (iii) chloroplast the surrounding noncoding sequences) and may have been a . The methodological approach is that of comparative coding sequence inserted by recombination, as was Orpl23, sequence analysis, where complete DNA sequences from prior to the divergence of the grass family (16). The rate of phylogenetically structured samples are analyzed to reveal nucleotide substitution in this noncoding region is roughly the pattern of mutational change in evolution. equal to the rate of synonymous substitution (with the exception of OrpI23) for the neighboring rbcL gene (16), but Evolution of Noncoding DNA Regions a large number of short insertion/deletions (indels) have occurred. In addition to both the high rate of indel The chloroplast genome is highly condensed compared with and the large deletions, an inverted repeat in the noncoding eukaryotic genomes; for example, only 32% of the rice region about 300 bases upstream of the psal gene appears to genome is noncoding. Most of this noncoding DNA is found be labile to complex rearrangement events in the grass family in very short segments separatingfunctional genes. A number (16). of recent studies have revealed complex patterns of muta- A study of the noncoding region upstream of rbcL in the tional change in noncoding regions. The most widely studied grass family revealed similarpatterns ofcomplex change (18). noncoding region of the chloroplast genome is the region Indels were found to occur at a greater frequency than downstream of the rbcL gene in the grass family (Poaceae). nucleotide substitutions, and more complex changes, includ- This noncoding sequence is flanked by the genes rbcL and ing multiple tandem duplication events, were found to be psaf (the gene encoding I polypeptide I) and is confined to highly labile sites, resulting in similar although 1694 bases long in the rice genome, making it one of the independent indels at identical sites. Taken together, both longest noncoding regions of the genome and the longest noncoding regions reveal a marked incidence of parallel when introns are excluded (Fig. 1). A for the at labile sites that may be positively misleading for chloroplast gene rp123 is also located within this noncoding phylogenetic studies based on restriction fragment length segment (7). Hiratsuka et al. (7) argued that a large inversion, polymorphism (RFLP) data. unique to the grass family, arose through a recombinational Such complex patterns of mutation are not confined to interaction between nonhomologous tRNA genes, and the noncoding sequences of the chloroplast. The chloroplast same process of recombination between short repeats has gene for the RNA polymerase subunit B", rpoC2, has an been invoked as the mechanism responsible for the origin of insertion within the coding sequence in rice relative to the rp123 pseudogene qirpl23 (14). tobacco. This insertion has been shown to be confined to the The functional rpl23 gene of the rice genome is located in grass family. A comparison of the insertion among members the inverted repeat about 27 kb away. Bowman et al. (15) of the grass family revealed widespread diversity in terms of suggested that the pseudogene was being converted by the indel mutations as well as an increased rate of nucleotide functional rpl23 gene because the genetic divergence among substitution (19). was lower than the divergence observed for the The low noncoding content of the cpDNA generally and surrounding noncoding regions. Subsequent work, based on the high rate of large deletion events observed in the largest a phylogenetic analysis (16), has provided additional support contiguous noncoding segment in the rice genome suggest for a model ofgene conversion between the rpl23 pseudogene that nonessential sequences are rapidly removed from the and its functional counterpart in at least two lineages of the chloroplast genome as a result ofboth intra- and intermolec- grass family. ular recombination. This is supported by the rapid loss of all Four independent deletion events of at least 850 bases in photosynthetic genes from the chloroplast genome of the length have been observed spanning almost identical nonphotosynthetic plant Epifagus (8). Given the low simi- stretches ofthe noncoding region between rbcL and psal (16, larity of the noncoding sequence between the chloroplast Noncoding Region

j3-F3Pl23 Lpsa

Deleted region

Y;) Inverted repeat

FIG. 1. Diagram ofthe noncoding region flanked by rbcL andpsaIgenes. The rpl23 pseudogene is indicated as are four independent deletions. The hypervariable inverted repeat region referred to in the text is marked by a hairpin structure. Downloaded by guest on September 30, 2021 Colloquium Paper: Clegg et al. Proc. Nati. Acad. Sci. USA 91 (1994) 6797 genomes of rice and tobacco and the observed degree of simple counting method (29). The power of these methods to variation in noncoding sequences within the grasses, the detect deviations from a molecular clock depends on a conserved chloroplast genes appear to exist in a very fluid number of factors (e.g., the number of substitution events in medium of surrounding noncoding sequence. the lineages and the transition/transversion bias), but in many situations the three methods have comparable power Evolution of Protein Coding Genes (28, 29). Relative rate tests cannot detect changes in evolu- tionary rates if rates change proportionally among lineages Codon Bias of Chloroplast Genes. The codon utilization (30), although this precise condition seems unlikely to occur pattern of chloroplast genes of the algae Chlamydom- often. Relative rate tests may also fail to detect stochastic onas reinhardtii and moewusii appears to changes in rate within a lineage because the test compares be closely adapted to the chloroplast tRNA population. In average substitution rates. More importantly, relative rates contrast, land plant chloroplast genes are dominated by a contain less information on variation than do absolute rates, genomic bias towards a high A+T content, which is reflected but absolute rates are difficult to estimate because of imper- in a high third-codon position A+T content. An interesting fect knowledge of the time dimension. exception is the gene psbA coding for the central protein of The large rbcL data base has facilitated the characteriza- the photosystem II (PSII) reaction center, the D1 reaction tion of substitution rate variation in a number of evolutionary center polypeptide (20). The psbA gene product turns over at lineages. A few studies have applied relative rate tests to rbcL a very high rate and, consequently, is the most abundant sequences from intrafamilial taxa (31-33); on the whole, these product of the plant chloroplast (21). The psbA studies have uncovered limited rate variation. Two studies gene of flowering plants has a codon use very similar to have characterized rbcL rate variation over a wider sampling Chlamydomonas chloroplast genes (22). Given the tRNAs of plant taxa (34, 35). Gaut et al. (35) examined rbcL encoded by the sequenced chloroplast genomes, the pattern sequences from 35 monocotyledonous taxa and found sub- of codon bias of the Chlamydomonas genes appears to be the stantial synonymous rate variation among evolutionary lin- result of selection for an intermediate codon-anticodon in- eages, but little missense rate variation. The analyses re- teraction strength (B.R.M., unpublished data). Further, vealed rate homogeneity for rbcL sequences within well- based on the fact that psbA is the sole defined families, but substantial rate heterogeneity in chloroplast gene following this codon utilization pattern, interfamilial contrasts. The pattern of interfamilial rate vari- selection most likely acted on psbA codon use to increase ation revealed the most rapid nucleotide substitution rate in translation efficiency (22). the grass family, followed by families in the orders Or- Despite the difference in codon use by psbA relative to chidales, Liliales, Bromeliales, and Arecales (represented by other plant chloroplast genes, this gene has a much lower bias the palms). Rates of synonymous nucleotide substitution in in codon use than do Chlamydomonas chloroplast genes. In grass rbcL sequences exceed rates in other monocot se- fact, there is good evidence that selection no longer acts on quences by as little as 130%6 and by as much as 800%. psbA codon use in flowering plant lineages; it is merely a Bousquet et al. (34) reported extensive rate variation remnant of the ancestral codon use. Further, the highly among 15 rbcL sequences representing monocot, dicot, and expressed gene rbcL displays a codon use more like psbA taxa. They concluded that missense rate vari- than does any other flowering plant chloroplast gene (23). ation is more extensive than synonymous rate variation. This These two factors, the apparently recent loss of selection on result differs with that reported by Gaut et al. (35). The flowering plant psbA codon use and the similarity of rbcL to differences between these studies can probably be ascribed to psbA, indicate that loss of selection for codon adaptation on the wide phylogenetic range of taxa surveyed. The relatively plant chloroplast genes may have occurred at different times narrow monocot comparisons included few missense substi- for each plant chloroplast gene, most likely as genome tutions; the paucity of nonsynonymous substitutions may number increased over time (B.R.M., unpublished data). have made detection of missense rate heterogeneity difficult. Such a scenario may have important implications for studies On the other hand, the study of Bousquet et al. (34) may have of chloroplast origins because standard phylogenetic estima- underestimated synonymous rate variation because some of tion methods are likely to be biased and, therefore, unreliable their sequence comparisons had probably been saturated for at very deep evolutionary levels. The codon bias results also synonymous substitutions. suggest that nucleotide composition cannot be assumed to be Clearly a simple stochastically constant molecular clock at equilibrium, contrary to the assumptions incorporated in does not hold for the rbcL locus; however, there may be a most distance estimators. generation-time effect (36). While it is difficult to estimate Relative Rates of Nucleotide Substitution. Early work on generation times in most plant taxa, the monocot sequences chloroplast sequences suggested that substitution rates do show a clear negative correlation between the minimum not follow a constant molecular clock. Rodermel and Bogo- generation time and substitution rates (35). Given that rate rad (24) first claimed that substitution rates in chloroplast loci variation in monocot sequences is largely synonymous (and can vary among evolutionary lineages. They found that the therefore presumably close to neutral) and that rate variation atpE gene had slower missense rates, and the atpH gene had is correlated with minimum generation time, rate variation at slower synonymous rates, in grass lineages relative to dicot the rbcL locus of monocot taxa is consistent with neutral lineages. Similar comparisons of substitution rates in the predictions. The conclusions of Bousquet et al. (34) are not rbcL locus indicated that both missense and synonymous as straightforward, but their results do indicate clear differ- rates were faster in grass species relative to palm species (25). ences in substitution rates among annual and perennial taxa, These studies relied on fossil-based divergence times for a result not inconsistent with a generation-time effect. Bous- substitution rate estimates. Fossil-based divergence times quet et al. (34) also speculate that speciation rates influence can have large errors that introduce large (and unmeasurable) substitution rates. Other factors hypothesized to contribute uncertainty into rate estimates. Relative rate tests allow to rate variation among evolutionary lineages include poly- comparison of substitution rates between evolutionary lin- merase fidelity (27), selection (37), and G+C content (38). eages without dependence on knowledge of the time dimen- What is the pattern of rate variation at other chloroplast sion. First utilized by Sarich and Wilson (26), Wu and Li (27) loci? If rate variation is predominantly a function of an later extended relative rate tests to nucleotide sequences evolutionary factor that affects the whole genome (like, using a distance measure formulation. Subsequent refine- presumably, the generation-time effect), then one would ments include a maximum likelihood relative test (28) and a expect to find similar patterns of rate variation in most Downloaded by guest on September 30, 2021 6798 Colloquium Paper: Clegg et al. Proc. Natl. Acad Sci. USA 91 (1994) chloroplast loci. Conversely, widely divergent patterns of phytes and , eight , and 85 angiosperms rate variation among chloroplast loci would argue for locus- (for details, contact the authors). In general, conventional specific factors (like selection) as the motive force behind phylogenetic relationships that were supported by the anal- substitution rate variation. Ideally, one should sample chlo- ysis of the rbcL nucleotide sequence data (13, 43) were used roplast loci from the taxa used in the rbcL studies to directly to construct this . The amino sequences for the 105 compare patterns of rate variation among loci. Unfortu- taxa were translated from the nucleotide sequences and nately, data for these kinds of studies are not yet available. aligned (along with the >700 sequences in the As a first step in the analysis of rate variation among large data set) by using the following method. A preliminary chloroplast loci, Gaut et al. (39) examined rate heterogeneity alignment obtained using CLUSTAL v (46) was refined by among a number of chloroplast loci from three taxa (, aligning similar, presumably homologous features of the rice, and tobacco, using Marchantia as an outgroup). Com- solved three-dimensional crystal structures for the RuBisCo parison of sequence data from the maize and rice chloroplast large subunit. Nine gaps were required to align the se- genomes revealed little rate heterogeneity (using tobacco as quences. There were 494 sites in the aligned data set. The an outgroup). However, comparisons of sequence data from computer program MACCLADE version 3.04 was then used to rice and tobacco (using Marchantia as the outgroup) revealed locate amino acid replacement on branches of the tree and to much heterogeneity: significant deviation from a molecular count the total number of amino acid changes required clock was detected at 14 of 40 loci. All 14 loci have acceler- through the phylogeny. ated substitution rates in rice relative to tobacco, suggesting Of the 1350 amino acid replacements, 762 could be unam- concerted rate increases in the rice lineage. In addition, 17 biguously inferred; 568 and 182 were in a-helices and loci had nonsignificantly accelerated rates in rice, while the (3-strands, respectively. Only 488 (36%) of the replacements remaining 9 loci had nonsignificantly slower rates in rice, were in the a/(-barrel structure, which constitutes 46% of relative to tobacco. the sites. For the complete sequence, the most common Interestingly, many of the 14 loci that exhibit significant unambiguous replacements were Glu -* Asp, and Ala -- Ser rate heterogeneity between rice and tobacco lineages encode (40 and 35 changes, respectively). When the number and protein products of related function. For example, three of types of replacements were examined for the various struc- the four loci that encode RNA polymerase subunits demon- tures, interesting patterns emerged. Fig. 2 shows the frac- strate rate heterogeneity between rice and tobacco lineages. tions of sites with replacements and the numbers of changes Further analysis of RNA polymerase genes suggests rate for the complete sequences, a-helices, (-strands, and other acceleration with subsequent rate deceleration (B.S.G., un- structures. The distributions are highly skewed, indicating published data), perhaps indicating the episodic rate pattern that some sites may accept as many as 26 replacement events thought to result from selective pressures (37). while other sites do not accept amino acid replacements. [Character state distributions (amino at a given site) for Patterns of Amino Acid Replacement in the the 105 taxon dendrogram does not allow unequivocal re- RuBisCo Protein construction of the particular residues at all nodes in the dendrogram. For these residues the number of unambiguous The question of site-dependent probabilities of amino acid replacements underestimates the total number of replace- replacement can be addressed in considerable detail by using ments.] the very large rbcL data base. This large data base may be For the complete sequence, 33% of all sites are unvaried, used to ask whether models of nucleotide substitution pro- showing no changes. For a-helices, (-strands, and non-a/(3 vide an acceptable fit to the data, and it is even more structures, the percentages of unvaried sites are 25%, 35%, important to ask whether the pattern of accepted amino acid and 37%, respectively. The length of the tails of the distri- change provides useful information on protein adaptation. butions also differ among structural classes; there are as The three-dimensional structure of the RuBisCo protein has many as 26 unambiguous changes for the a and non-a/,8 been determined for a wide phylogenetic range of species structural classes, while the most changes allowed for sites in (40-42). The pattern of amino acid replacement can be (3-strands is 19. The unambiguous amino acid replacements mapped onto the physical structure to identify major con- that were most common among all sites in the a-helical straints in molecular change. Such an analysis, when placed regions were Ala-- Ser (31 changes), Glum- Asp (15), and Ser in a phylogenetic context, may also help to identify amino -- Thr (15). For the (3 regions among all sites, the most acid replacement of functional importance. common replacements were Hle -- Met (9) and Ile -- Val (8). The large subunit of RuBisCo contains two domains: (i) a The most common replacements for the non-a/13 regions carboxyl-terminal C domain that includes (a) an a/(-barrel were Glu -- Asp (21) and Tyr -* Phe (14). These are all fairly consisting of alternating a-helices and (3-strands, with the conservative replacements. More noteworthy is the fact that parallel (3-strands forming an interior barrel; (b) an interior the predominant changes for the (3-strands involve replace- extension with an a-helix followed by a pair of antiparallel ments among nonpolar residues. (-strands; and (c) a terminal extension of two to four a-hel- Table 1 shows the sites that are most variable for a and ices; and (ii) an amino-terminal N domain consisting of a non-a/( (>19 changes) and ( (>8 changes) structures. Many five-stranded antiparallel (3-sheet and four or five a-helices of the highly variable sites for all of the structural classes that form a lid-like structure covering the barrel of an involve replacements among nonpolar residues. Also inter- adjacent large subunit. Furthermore, the active site is well esting is the replacement of Ala-103 by Cys. This change known. It consists of charged and polar residues at the apparently occurs in six different lineages. In none of the carboxyl-terminal ends of the (strands in the a/(3-barrel of lineages is the change reversed. the C domain ofone subunit and asparagine and What conclusions may we draw from these analyses ofthe residues from the lid of the adjacent N domain. patterns ofamino acid variation? The analyses suggest that in One approach to examining patterns ofamino acid replace- angiosperms, amino acid replacements among hydrophobic ments in a structural context is to map amino acid replace- (nonpolar) residues occurmore frequently at some sites in the ments on a fully resolved, unambiguous tree. Since the large subunit of RuBisCo. While the frequency of the highly evolutionary relationships among all ofthe >750 taxa are not variable sites and the degree of variability differ for (-strand known to any degree of certainty, a subset of the taxa were sites as compared to other structural regions, these variable chosen for this analysis. We used 105 taxa, including three sites may be found throughout the large subunit sequence. , four algae (including Charophytes), five bryo- This sort of variability may have an impact on the use of Downloaded by guest on September 30, 2021 Colloquium Paper: Clegg et al. Proc. Natl. Acad. Sci. USA 91 (1994) 6799

Not Aipha/Beta 35 r EL M Alpha 30 U) Beta CI 25 UAll 0 coC 20

0) 1 5 0 1 10

0 .M WM.Un.nm~m L.J M . n -N M - a.J1E.-..c .J0 A S 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 lb6 - 8 2ya, .3.2 2 4 E 2 E Number of Replacements FIG. 2. The percentage ofsites with replacements versus the numbers ofchanges at a given site for the complete sequences, a-helix, (-strand, and other structures. The replacements were inferred for a dendrogram of 105 taxa (see text) by using the computer program MACCLADE version 3.04. Of the 494 aligned positions, 169 lie in a-helices, and 81 lie in (-strands. nucleotide sequence data for phylogeny reconstruction be- site-dependent probabilities that may also change between cause methods of phylogenetic inference assume the proba- major evolutionary lineages. bility of replacement is independent of site. The fact that Some of the more variable sites show high levels of third-nucleotide positions in a codon differ in substitution replacement for amino acids that are coded for by codons rate compared to first and second position sites is widely with third-site differences [Asp (GAY) < Glu (GAR), Ile appreciated, but this is a simple and easily corrected source (ATH) +- Met (ATG)], where Y = T or C, R = G or A, and of variation. It is much more difficult to correct for complex H = A, C, or T. The fact that these replacements are fairly frequent and readily tolerated is concordant with the higher Table 1. Highly variable sites in RuBisCo large subunits in the rate of substitution for third-position sites. Other third- phylogenetic tree for 105 taxa for a and non-a/,B sites (-20 position substitutions [e.g., Gin (CAR) *. His (CAY), Lys replacements per site) and for P-strand sites (.9 replacements (AAR) + Asn (AAY), Arg (AGR) +* Ser (AGY)] are not per site) among the replacements seen for highly variable sites; fur- Replace- Consistency Predominant thermore, replacements such as Ser (TCN) -+ Ala (GCN), Ile Site* ments Statest index replacementt (ATH) -+ Val (GTN), and Phe (TTY) -* Tyr (TAY) are similarly favorable even though they require first- or second- a and non-a/,B position substitutions. The rate heterogeneities that follow 32 22 2 0.05 Asp Glu (6) from such patterns make methods of phylogenetic inference 95 26 5 0.15 using a simple weighting scheme for different codon positions 99 25 7 0.24 problematical. 146 a 23 6 0.22 149 a 20 7 0.30 Patterns of Evolution 229 a 21 4 0.14 Ile Leu (13) 255 a 26 5 0.15 Ile Met (9) Data from the plastid genomes that have been completely 259 a 22 7 0.27 sequenced reveal that introns are a general feature of these 332 24 5 0.17 genomes. Each of the three classes of introns (groups I, II, 453 26 6 0.19 and III) have a characteristic secondary structure that is (-strand related to the mechanisms by which the intervening sequence 90 9 7 0.67 is excised from RNA precursors. The secondary structure 103 10 3 0.20 Ala Cys (6) requirements, along with other constraints, appear to limit 313 19 3 0.11 Ile- Met (8) the acceptance of mutational changes in intron sequences 330 10 4 0.30 lle Val (3) through evolutionary time. Group I introns, which were the 357 10 4 0.30 Phe- Tyr (6) first self-splicing introns identified (45), are found in a single 358 11 5 0.36 Hle Val (3) tRNA gene, trnL(UAA), from a number of plastid genomes. *Site is the position number in the 494 aligned positions. An a is Group III introns have thus far only been identified in the appended to the number of the residue for those in a-helices to euglenophytes and Astasia longa (46, 47) distinguish them from non-a/,8 sites. and appear to be truncated (or streamlined) group II introns. tStates is the number of different kinds of amino acids found at the It is noteworthy that group I introns are absent from the indicated site. ofEuglena (4) and *Predominant replacement indicates the most frequent unambigu- newly sequenced plastid genomes graciis ously inferred amino acid replacement in the character state recon- beechdrops, Epifagus virginiana (8). Both genomes are atyp- struction for the dendrogram at the indicated site; the number of ical, however: the Euglena genome is markedly different in unambiguously inferred independent occurrences is in parentheses. structural organization and may have had a separate origin The number of unambiguous replacements is an underestimate of from the plastid genomes of land plants; beechdrops is a the true number of replacements for some sites (see text). nonphotosynthetic parasitic plant, and its genome is highly Downloaded by guest on September 30, 2021 6800 Colloquium Paper: Clegg et al. Proc. Nad. Acad Sci. USA 91 (1994) reduced, lacking trnL(UAA) among other genes. Although otide frequencies are likely to be violated. Patterns of amino comparative sequence analyses of both group I and group III acid replacement in the rbcL gene also reveal substantial introns could provide information about rates and mecha- variation in site-dependent probabilities of substitution. nisms of chloroplast DNA evolution, there are few compar- Taken in toto, these complexities in mutational change ative data, and consequently our discussion will be limited to should motivate the development of more realistic algorithms group II introns. for phylogenetic inference when based on molecular data. In Group II introns form the most numerous and best char- the interim, one must view molecular phylogenetic recon- acterized class of plastid intervening sequences. They are structions as approximate, especially at very deep levels of found in both protein-coding genes and tRNA genes. The evolution. secondary structure ofgroup II introns is characterized by six Comparative sequence analyses have utility beyond the domains (I-VI) (reviewed in ref. 48). Domain I has a com- study of phylogeny. The pattern of amino acid replacements plicated structure and contains sites that probably form base observed over evolutionary time reflects the pattern of func- pairs with the 5' exon and are important for intron processing tional constraints that are imposed by natural selection on the (49). Domains II-VI are typically simple stem-loop struc- protein molecule. Natural selection is a sensitive filter be- tures. Domains V and VI have also been shown to be required cause it is capable ofdetecting subtle changes associated with for proper processing of the transcript (50, 51). Learn et al. small fitness effects. Regions that do not accept change are (52) examined the evolutionary constraints on the various clearly strongly constrained by functional requirements, but domains of group II introns in a comparative study of the those regions that do accept change may be simply uncon- intron found in a tRNA gene, trnV(UAC), from seven land strained, or in a few cases they may reflect responses to plants. They found that domain II evolves most rapidly, It is to comparable to the synonymous substitution rate of protein- adaptive change. difficult distinguish these latter two coding genes, consistent with the finding that domain II may possibilities, but unusual patterns, such as the repeated be dispensable in self-splicing introns (53). Portions of do- replacement of Ala-103 by Cys, may be suggestive of adap- main I (that are important in binding to the 5' exon) and tive change. domain V evolve at the slowest rates, comparable to mis- Comparative molecular sequence analyses also reveal as- sense rates for a number of protein-coding genes. These pects of the evolutionary process that would otherwise be results illustrate that evolutionary rates within group II opaque. For example, the recombinational processes that introns vary and appear to be related to the functional acted to convert 4irpl23 to the functional gene are only importance of intron structural features. resolvable at the sequence level. Similarly, the identification of labile sites for indel mutation, and more importantly, the Conclusions identification of local sequence features that may promote indel mutation, depend on a substantial comparative se- The uniformitarian assumption plays a fundamental role in quence base. When viewed broadly, comparative sequence the science of molecular evolution, just as it did in the analyses reveal the rich variety of mutational mechanisms evolutionary paleontology of G. G. Simpson. We must as- that subsume the processes of molecular evolution. sume that the kinds of mutational changes that we can in Support from National Institutes of Health Grants GM 45144 (to demonstrate "microscopic detail" from comparative se- M.T.C.), GM 15528 (to B.S.G.) and GM 45344 (to North Carolina quence analyses are the substance of molecular evolutionary State University) is gratefully acknowledged. The of author- change. When viewed in detail, the patterns of mutational ship is listed alphabetically. All authors made an equal contribution change in the chloroplast genome are complex, and they belie to this article. the notion that the cpDNA is a staid and conservative molecule. 1. Gray, M. W. (1993) Curr. Opin. Genet. Dev. 3, 884-890. The use ofcpDNA RFLPs and cpDNA gene sequence data 2. Palmer, J. D. (1991) in and Somatic Cell Genetics for phylogenetic inference has had an enormous impact on ofPlants, eds. Bogorad, L. & Vasil, K. (Academic, New York), studies ofplant phylogenetics and systematics. This has been Vol. 7A, pp. 5-53. facilitated by the beliefthat cpDNA-based mutational change 3. Clegg, M. T., Learn, G. H. & Golenberg, E. M. (1991) in is and the regu- Evolution at the Molecular Level, eds. Selander, R. K., Clark, regular. By large, assumption of statistical A. G. & Whittam, T. S. (Sinauer, Sunderland, MA), pp. 135- larity is adequate, and cpDNA data are an important addition 149. to the previous evidence available to students of plant 4. Hallick, R. B., Hong, L., Drager, R. G., Favreau, M. R., evolution. Nevertheless, it is important to investigate the Monfort, A., Orsat, B., Spielmann, A. & Stutz, E. (1993) ways that mutational change in this genome departs from the Nucleic Acids Res. 21, 3537-3544. kinds ofregularity assumed by most methods ofphylogenetic 5. Ohyama, K., Fukuzawa, H., Kohchi, T., Shirai, H., Sano, T., inference. When this question is asked, we find that noncod- Sano, S., Umesono, K., Shiki, Y., Takeuchi, M., Chang, Z., ing regions exhibit a number of mutational mechanisms. Aota, S.-I., Inokuchi, H. & Ozeki, H. (1986) Nature (London) Some sites are clearly labile to small indel mutations, and in 322, 572-574. at least one be site 6. Shinozaki, K., Ohme, M., Tanaka, M., Wakasugi, T., Hayash- case, large deletions also appear to ida, N., Matsubayashi, T., Zaita, N., Chunwongse, J., dependent. Complex recombinational processes are also Obokata, J., Yamaguchi-Shinozaki, K., Ohto, C., Torazawa, found to influence the evolution of at least some noncoding K., Meng, B. Y., Sugita, M., Deno, H., Kamogashira, T., regions of cpDNA. Group II introns show a strong relation- Yamada, K., Kusuda, J., Takaiwa, F., Kato, A., Tohdoh, N., ship between structure and probability of mutational change. Shimada, H. & Sugiura, M. (1986) EMBO J. 5, 2043-2049. Analyses of protein-coding genes also reveal complex pat- 7. Hiratsuka, J., Shamida, H., Whittier, R., Ishibashi, T., Saka- terns of mutational change. Finally, rates of nucleotide moto, M., Mori, M., Knodo, C., Honji, Y., Sun, C.-R., Meng, substitution are quite variable at the level oforder and above. B.-Y., Li, Y.-Q., Kano, A., Nishizawa, Y., Hirai, A., Shi- Current data suggest that much of the variation in evolution- nozaki, K. & Sugiura, M. (1989) Mol. Gen. Genet. 217, 185- rate can be accounted for by the generation-time effect 194. ary 8. Wolfe, K. H., Morden, C. W. & Palmer, J. D. (1992) Proc. hypothesis, although this requires further investigation. Natd. Acad. Sci. USA 89, 10648-10652. There are also interesting patterns of codon bias. When land 9. Palmer, J. D. (1985) Annu. Rev. Genet. 19, 325-354. plant and algal genes are compared, it is evident that patterns 10. Downie, S. R. & Palmer, J. D. (1991) in Molecular Systematics of selection for codon utilization have changed over evolu- of Plants, eds. Soltis, P. S., Soltis, D. E. & Doyle, J. J. tionary time and that models that assume equilibrium nucle- (Chapman and Hall, New York), pp. 14-35. Downloaded by guest on September 30, 2021 Colloquium Paper: Clegg et aL Proc. Natl. Acad. Sci. USA 91 (1994) 6801 11. Gantt, J. S., Baldauf, S. L., Calie, P. J., Weeden, N. F. & 31. Soltis, D. E., Soltis, P. S., Clegg, M. T. & Durbin, M. (1990) Palmer, J. D. (1991) EMBO J. 10, 3073-3078. Proc. Natl. Acad. Sci. USA 87, 4640-4644. 12. Clegg, M. T. (1993) Proc. Natl. Acad. Sci. USA 90, 363-367. 32. Doebley, J., Durbin, M., Golenberg, E. M., Clegg, M. T. & 13. Chase, M. W., Soltis, D. E., Olmstead, R. G., Morgan, D., Ma, D. P. (1990) Evolution 44, 1097-1108. Les, D. H., Duvall, M. R., Price, R. A., Hills, H. G., Qiu, 33. Bousquet, J., Strauss, S. H. & Li, P. (1992) Mol. Biol. Evol. 9, Y.-L., Kron, K. A., Rettig, J. H., Conti, E., Palmer, J. D., 1076-1088. Manhart, J. R., Sytsma, K. J., Michaels, H. J., Kress, W. J., 34. Bousquet, J., Strauss, S. H., Doerksen, A. H. & Price, R. A. Karol, K. G., Clark, W. D., Hedren, M., Gaut, B. S., Jansen, (1992) Proc. Natl. Acad. Sci. USA 89, 7844-7848. R. K., Kim, K.-J., Wimpee, C. F., Smith, J. F., Furnier, 35. Gaut, B. S., Muse, S. V., Clark, W. D. & Clegg, M. T. (1992) G. R., Straus, S. H., Xiang, Q.-Y., Plunkett, G. M., Soltis, J. Mol. Evol. 35, 292-303. P. S., Swensen, S. M., Williams, S. E., Gadek, P. A., Quinn, 36. Li, W.-H. (1993) Curr. Opin. Genet. Dev. 3, 896-901. C. J., Equiarte, L. E., Golenberg, E., Learn, G. H., Jr., Gra- 37. Gillespie, J. H. (1986) Mol. Biol. Evol. 3, 138-155. ham, S. W., Barrett, S. C. H., Dayanandan, S. & Albert, V. A. 38. Bulmer, R., Wolfe, K. H. & Sharp, P. M. (1991) Proc. Natl. (1993) Ann. Mo. Bot. Gard. 80, 528-580. Acad. Sci. USA 88, 5974-5978. 14. Ogihara, Y., Terachi, T. & Sasakuma, T. (1992) Curr. Genet. 39. Gaut, B. S., Muse, S. V. & Clegg, M. T. (1993) Mol. Phylo- 22, 251-258. genet. Evol. 2, 89-%. 15. Bowman, C. M., Barker, R. F. & Dyer, T. (1988) Curr. Genet. 40. Chapman, M. S., Suh, S. W., Curmi, P. M. G., Cascio, D., 14, 127-136. Smith, W. W. & Eisenberg, D. S. (1988) Science 241, 71-74. 41. Knight, S., Andersson, I. & Branden, C.-I. (1989) Science 244, 16. Morton, B. R. & Clegg, M. T. (1993) Curr. Genet. 24, 357-365. 702-705. 17. Ogihara, Y., Terachi, T. & Sasakuma, T. (1988) Proc. Natl. 42. Schneider, G., Lindqvist, Y. & Lundqvist, T. (1990) J. Mol. Acad. Sci. USA 85, 8573-8577. Biol. 211, 989-1008. 18. Golenberg, E. M., Clegg, M. T., Durbin, M. L., Doebley, J. & 43. Duvall, M. R., Clegg, M. T., Chase, M. W., Clark, W. D., Ma, D. P. (1993) Mol. Phylogenet. Evol. 2, 52-64. Kress, W. J., Hills, H. G., Eguiarte, L. E., Smith, J. F., Gaut, 19. Cummings, M. P., Mertens King, L. & Kellogg, E. A. (1994) B. S., Zimmer, E. A. & Learn, G. H., Jr. (1993) Ann. Mo. Bot. Mol. Biol. Evol. 11, 1-8. Gard. 80, 607-619. 20. Umesono, K., Inokuchi, H., Shiki, Y., Takeuchi, M., Chang, 44. Higgins, D. G., Bleasby, A. J. & Fuchs, R. (1992) Comp. Appl. Z., Fukuzawa, H., Kohchi, T., Shirai, H., Ohyama, K. & Biosci. 8, 189-191. Ozeki, H. (1988) J. Mol. Biol. 203, 299-331. 45. Zaug, A. J. & Cech, T. R. (1980) Cell 19, 331-338. 21. Mullet, J. E. & Klein, R. R. (1987) EMBO J. 6, 1571-1579. 46. Christopher, D. A. & Hallick, R. B. (1989) Nucleic Acids Res. 22. Morton, B. R. (1993) J. Mol. Evol. 37, 273-280. 17, 7591-7608. 23. Morton, B. R. (1994) Mol. Biol. Evol. 11, 231-238. 47. Seimeister, G., Buchholz, C. & Hachtel, W. (1990) Mol. Gen. 24. Rodermel, S. R. & Bogorad, L. (1987) Genetics 116, 127-139. Genet. 220, 425-432. 25. Wilson, M., Gaut, B. & Clegg, M. T. (1990) Mol. Biol. Evol. 7, 48. Michel, F., Umesono, K. & Ozeki, H. (1989) Gene 82, 5-30. 303-314. 49. Jacquier, A. & Michel, F. (1987) Cell 50, 17-29. 26. Sarich, V. M. & Wilson, A. C. (1973) Science 179, 1144-1147. 50. Schmelzer, C. & Mueller, M. W. (1987) Cell 51, 753-767. 27. Wu, C.-I. & Li, W.-H. (1985) Proc. Natl. Acad. Sci. USA 82, 51. Jarrell, K. A., Dietrich, R. C. & Perlman, P. (1988) Mol. Cell. 1741-1745. Biol. 8, 2361-2366. 28. Muse, S. V. & Weir, B. S. (1992) Genetics 132, 269-276. 52. Learn, G. H., Jr., Shore, J. S., Furnier, G. R., Zurawski, G. & 29. Tajima, F. (1993) Genetics 135, 599-607. Clegg, M. T. (1992) Mol. Biol. Evol. 9, 856-871. 30. Fitch, W. M. (1976) in Molecular Evolution, ed., Ayala, F. J. 53. Kwakman, J. H., Konings, D., Pel, H. J. & Grivell, L. A. (Sinauer, Sunderland, MA), pp. 160-178. (1989) Nucleic Acids Res. 17, 4205-4216. Downloaded by guest on September 30, 2021