Copyright 0 1993 by the Society of America

Rates and Patterns of Base Change in the Small Subunit Ribosomal RNA Gene

Lisa Vawter' and Wesley M. Brown Molecular Systematics Laboratory and Insect Division, Museum of Zoology, and Department of Biology, University of Michigan, Ann Arbor, Michigan 48109-1079 Manuscript received October28, 1992 Accepted for publication February 4, 1993

ABSTRACT The small subunitribosomal RNA gene (srDNA) hasbeen used extensively for phylogenetic analyses. One common assumption in theseanalyses is that substitution rates are biasedtoward transitions. We have developeda simple method for estimating relative ratesof base change that does not assume rate constancy and takes into account base composition biasesin different structures and taxa. We have applied this method to srDNA sequences from taxa with a noncontroversial phylogeny to measure relativerates of evolution in various structural regions of srRNA andrelative rates of the different transitions and .We find that: (1) the long single-stranded regionsof the RNA molecule evolve slowest, (2) biases in base composition associated with structure and phylogenetic position exist, and (3) the srDNAs studied lack a consistent transition/ bias. We have made suggestions based on these findingsfor refinement of phylogenetic analyses using srDNA data.

HE accumulation of DNA sequence data from SON and BROWN 1986), a transition/transversionbias T disparatetaxa makes study of thenature of has yet to be demonstrated in structural RNA genes. DNAevolution possible. Because certain classes of Despite the lack of evidence bearing upon individ- DNA characters tend to change in a related fashion, ual transition/transversionrates in structuralRNA the nature of change within these classes can be ex- genes, phylogenetic analyses that require specific as- plored statistically, so that we might learn about the sumptions aboutthem havebeen undertaken. evolution of molecules, as well as refine the assump- Though MINDELL and HONEYCUTT (1990)did tally tions for the use of molecular data in phylogenetic transitions and transversions for srDNAs, they didnot analysis. calculate rates of change, and used taxa in their study Atransition-transversion rate bias has beenob- that were too closely related to allow calculation of served in primate mitochondrial DNA (BROWNet al. ratesfrom their tallies because of sample sizes. A 1982). These authors noted that transitions (C c, T transition rate bias and equal transversion rates for and A c, G changes) were more common than ex- different have been used as assumptions pected, given random change, and proposed that the for phylogenetic analysis of rDNAs using the method mitochondrial DNA bias toward transitions is due to of invariants (LAKE 1988,1989). The transition-trans- bias. LI, WU and LUO(1 985) found a tran- version bias assumption has also been suggested for sition bias in nuclearprotein-coding genes and cladistic analyses of rDNAs (MISHLER et al. 1988; pseudo-genes, though the bias is not as pronounced PATTERSON1989; MICKEVICHand WELLER1990). It here as in mitochondrialgenes. This bias is often is unclear how applicable a the transition-transversion rate bias assumption is for phylogenetic analysis using discussed theoretically in terms of the silent sites of non-protein-coding genes, however, including srRNA protein-coding genes (e.g.,JUKES and BHUSHAN1986; (OLSENand WOESE1989; WOESE1989). JUKES 1987). Because both BROWNet al. and LI et al. We demonstrate a simple method to estimate rela- grouped all transitions and all transversions in their tive rates of change among nucleotides for different studies, they did not discuss specifically whether each structural classes within srDNA. This methodhas the transition, taken individually, was more common than advantage that it circumvents a constant molecular each individual transversion. Though this does hold clock assumption. We then use these relative rates to true of the BROWNet al. data set, whether it holds for evaluate the assumptions of a transition-transversion the LI et al. data set is less clear (L. VAWTER, unpub- bias and equal transversion probabilities for srDNA. lished observations). However, except for those struc- tural RNA genes in the mitochondrial genome (HIX- METHODS AND RESULTS

I Current address: Museum of Comparative Zoology, Harvard University, The data set: Here, we present the set of srDNA Cambridge Massachusetts 02138. sequence characters we used (nucleic acid positions),

Genetics 134: 597-608 (June, 1993) 598 L. Vawter and W. M. Brown

Caenorhabditis

Rattus Mus Homo I LLLLLLsssswsssssq) HTUlO’/ Xenopus’ HTU9 ws FIGURE2.”Illustration of terminology for categorization of bases into bulges (B), loops (L),stems (5’) and “other” (0).The HTUS/” “other” category comprises long single-stranded regions that are Artemia thought to interact with the ribosomal proteins (WOESE et al. 1983).

sequence and secondary structure. These alignments were performed as detailed in SOGINand ELWOOD (1 986), except thatwe discovered regions of sequence similarity by the method of LAWRENCE and GOLDMAN (1988), as implemented in EuGene3.2 ( Information Resource, 1989). We confirmed HTU7 the secondarystructures from the literature using FIGURE1.-Unrooted phylogenetic network of all OTUs and energetic (ZUKER 1989; JAEGER, TURNERand ZUKER HTUs in thedata set. The relationships among these taxa are 1990) and phylogenetic considerations (WOESEet al. noncontroversial (KEMP 1988; MILNER, 1988; NOVACEK,WYSS and 1983). Where the structure was ambiguous, we dis- MCKENNA 1988; WILLMER 1990). We used a parsimony method cardedthe structural information. The aligned se- (HENNIG1966), as implemented in the computer program PAUP, quences will be provided electronically by the author to assign base changes to branches. Branch lengths, excluding the dashed portion of the branch between Caenorhabditis and Artemia, (L.V.) upon request and aregiven in VAWTER(1991). are proportional to the number of changes they represent and are The method of structural classification we used (stem, based on all characters, including cladistically noninformative char- loop, bulge or“other”) is shown in Figure 2. We acters. The consistency index (KLUGEand FARRIS1969), based only emphasize here that we are including G-U pairs as on informative characters, is 0.94 and is discussed later in the text. stem structure, where they are not terminal to the as well as the methods we used for inferringhomology stem. Unpaired regions within stems are classified as (in the evolutionary sense) among the characters. We bulges, so that no unpaired positions are included as used a parsimony method to infer hypothetical ances- stem bases. The “other”class is not an arbitrary des- tral sequences for the evolutionary ancestors (HTUs) ignation for positions with unknown structure; rather, of the taxa that bear these srDNA characters. Forthis it comprises long single-stranded regionsthat interact with ribosomal proteins (WOESEet al. 1983). All data, study, we used published srDNA sequences from a including those for which structure was ambiguous, mouse, Mus musculus (RAYNAL,MICHOT and BACH- were included in calculation of overall base composi- ELLERIE 1984), a rat, Rattusnoruegicus (CHANet al. tions. Those positions where the alignment was un- 1984),a human, Homo sapiens (TORCZYNSKI,FUKE ambiguous but the structure unknown (e.g., positions and BOLLON1985), a frog, Xenopus laevis (SALIMand 1226-1 321 of the human sequence) were excluded MADEN1981), a brine shrimp,Artemia salina (NELLES from structural analysis. Sequence data that were not et al. 1984) and a nematode, Caenorhabditiselegans included in the remaining analyses because of ambi- (ELLIS,SULSTON and COULSON1986). We chose these guity of alignment or structure are detailed in VAW- taxa because their evolutionary relationships are non- TER (1991) and are available from the author (L.V.) controversial (Figure 1) (KEMP 1988; MILNER 1988; upon request. NOVACEK, WYSS and MCKENNA 1988; WILLMER For the phylogeny in Figure 1, we calculated hy- 1990)and are derivedfrom data sets otherthan pothetical ancestral sequences (HTUs) and predicted srDNA. By doing this, we avoid the logical circularity base changes forall varying positions using parsimony that would result if we were to use srDNA sequence (PAUP; SWOFFORD1989). Aswas pointed out by to derive an evolutionary network, and then derive FITCHand MARKOWITZ(1 970),it is necessary to su- conclusions about changes of srDNA sequence from perimpose sequence data on a phylogeny, rather than the network derived from those same changes. merely to tally differences between the taxa when We inferred homology among the srDNA nucleic taken pairwise, in order toutilize all changes required acid characters through alignment of both primary by the phylogeny. This approach to identifying base and PatternsRates and of srDNA Change 599

FIGURE3.-Matrix of genetic dis- tances between the taxa in the analy- sis. We calculated these as strict per- cent difference, with no corrections for multiple hits. As suggested by SOGINand ELWOOD(1986), we omit- ted unique insertions from the cal- culations.

changes has the advantage over utilizing correction and structural data from Caenorhabditis and the hy- formulae to estimate numbers ofchanges between pothetical ancestral taxa included,and oncewith them pairs of taxa, as it allows not only estimation of the eliminated. Because of ambiguities in structure or number of changes that took place, but estimation of alignment for the Caenorhabditis srDNA sequence, the specific changs that took place. Given a phyloge- positions had to be excluded from it that were in- netic network and character states for the endpoints cluded in the other taxa. This alternateinclusion and of thatnetwork, the algorithm in PAUP allocates eliminationof Caenorhabditis from the analysis al- character changes along that network so as to mini- lowed comparison of results of the analyses performed mize its length. The prime assumption ofthis method on the same positions in all taxa with the results of is minimum evolution,which is reasonable in this case, the analysis performed with one of the datasets (Cae- because the small pairwise geneticdistances (calcu- norhabditis) incomplete.Analyses were also done with lated as strict percent difference, not corrected for the hypotheticalancestral taxa eliminated, because multiple hits) between these taxa (Figure 3) suggest they are artificial constructs. Removing neither Cae- that multiple changes at the same site are unlikely. norhabditis nor the hypothetical ancestral taxa from The consistency index (KLUGE and FARRIS 1969) of the analysis changed the results qualitatively, except this network (Figure 1, C.I. = 0.94, with cladistically for lowering the sample sizes. Therefore, the results noninformative sites excluded) suggests that multiple of theseadditional manipulations will notbe pre- hits should not be a problem in this analysis or the sented. The phylogeny shown in Figure 1was imposed analyses that follow. The consistency index is a meas- on the sequence data andis the foundation forall the ure of homoplasy that can vary between zero and one, analyses. with one indicating no homoplasy in thedata set. Structural composition: Analysis of structural com- Because they cannot be aligned, unique stretches of position allows comparison of relative ratesof change sequence [e.g., positions 246-268 (human numbering in different structural classes. These relative rates can system) in the mammals when compared to therest of then be used to infer the importanceof base sequence the data set] were excluded from the pairwise genetic to thefunctions of the various structuralclasses. Struc- distance calculations, as suggested by SOGINand EL- tural composition ofthe dataset is shown in APPENDIX WOOD (1 986). A. Almost 43% of the total 11,340 positions We assigned base changes at nucleotide positions studied are involved in stempairing. The“other” inferred to be homologous to appropriate branches category (defined earlier)comprises about 20%of the of the phylogeneticnetwork, andthen calculated positions. The bulge, loop and “unknown”categories branch lengths as sums of changes assigned by the were the smallest structuralcategories, comprising parsimony algorithm. In cases where the branch on about 15%, 10% and 12% of the positions, respec- which a change was inferred to have occurred was tively. The unknowncategory (positions for which ambiguous, we divided the substitutional increment structure could not be unambiguously assigned) com- equally amongthe branches to which it couldbe prised 9.8% and 13.7%of the data set in Artemia and assigned. We included all character changes in this Caenorhabditis, respectively. This reflects the finding analysis, including cladistically noninformative that the Escherichia coli model of srRNA secondary changes, because our aim was to study tendencies in structure (WOESEet al. 1983) may not fit arthropods character change statistically, rather than to derive a and Caenorhabditis as well as it does other taxa (L. phylogeny. VAWTERand W. M. BROWN,unpublished results). All analyses were done twice, once with sequence The unknown category is largest in HTU 7 (almost 600 andL. Vawter W. M. Brown shown for each taxon separately, and for different structural regions of srRNA. As expected, stems are 40 more G/C rich than are any of the other structural categories. Loop, bulge and “other”regions are much more A-rich than are stem regions. GUTELLet al. 30 (1985) also noted that single stranded regions were A-rich and suggested that this might be because ade- 20 nine is the least polar of the bases and thus might facilitate hydrophobic interactions with proteins. Re- 10 gions of unknown structuredo not differ inbase composition appreciably from all structural classes 0 combined (Figure 5). This is consistent with a lack of bias as to which structural components comprise the FIGURE4.-The average percentof different structures, exclud- unknown class. Thus, it is unlikely that the inability ing unknown regions, for the extant taxa listed in APPENDIX A. The ranges of the structural compositions are indicated by the horizontal to categorize unambiguously members of the un- bars. known class biased the results of this analysis of base composition. 40% of the positions), because the genetic distance Phylogenetic biases inbase composition are notable. between Caenorhabditis and Artemiagenerates a Vertebrate srDNAs are more G/C rich overall [(G + large number of ambiguous character states(bases) in C)/(A + T) = 1.241 than are those of non-vertebrate their hypothetical ancestor. These ambiguous char- multicellular animals in this data set [(G + C)/(A + T) acter statesare expected in any structural analysis that = 0.961. In addition to the data presented in Figure involves hypothetical ancestral taxa, because it cannot 5, partialsrRNA and srDNA sequences forother always be predicted which of two or more bases was multicellular animals listed in GenBank were analyzed more likely to be present at a position. These ambig- for base composition. In thedata set presented in uous character states frequently produce ambiguities Table 1, it is notable that a cephalochordate, Bran- in structure. For example, thoughit might be possible chiostoma calijbniense [(G + C)/(A + T) = 1.151, and to predict that an ambiguous C or U would pair with an echinoderm,Anisodoris nobilis [(G + C)/(A + T) = G in a stem structure, it would be impossible to make 1.121, which are more closely related to vertebrates a structural prediction for that same ambiguous C or thanthe other non-vertebrates listed, also have a U if it were opposite an A. higher (G + C)/(A + T) ratio than the other non- Figure 4summarizes the means and ranges of struc- vertebrate animals. Thus, using the other non-verte- tural compositions for the extanttaxa studied,exclud- brate taxa as outgroups, it might be inferred thatthis ing positions in the unknown category. Excluding the G/C-richness evolved prior to the evolution of verte- unknown category gives a more accurate representa- brates from the ancestral state of (G + C)/(A + T) tion of the actual distribution of the structural classes ratio of approximately 1. Base composition in each of of srRNA. The stem and loop classes are the most thestructural classesis generally reflective of this constant classes, in terms of proportion, with a range phylogenetic base composition difference, with the of less than 2%.The proportions allottedto the bulge only exceptional occurrence being the extraordinary and “other”categories vary a bit more, having ranges T-richness and C-dearth for loop regions in Caenor- of 4.1% and 4.4%. respectively. habditis srDNA. Base composition:Phylogenetic biases in base com- Relative ratesof change among structural classes: position of genomic DNA (e.g., BERNARDIand BER- To compare relative rates of change among srDNA NARDI 1985; WADA, SUYAMA and HANAI 1991) and structural classes, we tabulated frequencies of change mitochondrial DNA (CLARYand WOLSTENHOLME among the structuralclasses and then corrected those 1985; CROZIERand CROZIER1993) have been noted. for the relative frequencies of the structural classes. Biases in base composition among srDNA structural Figure 6a shows a histogram of frequencies of change classes might be predicted onthe basis of energy in the structural classes that comprise srDNA. The considerations. For example,because the G-C pair has most frequent is clearly change in stem regions. How- a lower free energy value than do A-U or G-U pairs ever, when frequency of change is corrected for rela- (e.g., FREIERet al. 1986;TURNER, SUCIMOTOand tive sizes of the structural classes (APPENDIX A), one FREIER1988), structural regions requiring base pair- can see that stem regions change no faster than other ing might be expectedto have a G/Cbase composition regions of the srDNA molecule on a per nucleotide bias.Base compositions for the srDNAs studied, as basis (Figure 6b). Stem, loopand bulge regions change well as of various structural components taken indi- at about thesame rate, with the “other”category (long vidually, are shown in Figure5. Compositions are single-stranded regions which presumably interact pri- Rates andPatterns of srDNA Change 60 1

b

a rJ

IC.ekgans 0 A.dina Ix.lae*is ES H. sapiens 0 R. mllus 0 M. mvrrulus ! T A C G A C G base base

IC.ekgans IA.dina IX.lae”U H. sapiens 0 R.mrIvr 0 ,W. musculus i A C G T A C G T base I base

A C G T A C G T base base FIGURE5.-(a-f) Histograms of base compositions of various structures of different taxa. marily with the ribosomal proteins) evolving slowest. the data set, respectively, it is possible that GUTELLet GUTELLet al. (1985) partitioned universally conserved al.’s (1985) classification of srRNA positions as single- nucleotides in srDNAs into those that occur in single- or double-stranded regions is an oversimplification. stranded (loop, bulgeor other)as opposed to double- The “other” class might have numerically dominated stranded (stem) regions. They noted that while only the loop and bulge classes, incorrectly implying that 39% of all nucleotides occur in single-stranded re- single-stranded regions have a higher proportion of gions, 59% of universally conserved nucleotides occur universally conserved nucleotides than double- in them. The finding of this study is consistent with stranded regions. That “other” regions evolve more the results of GUTELLet al. (1985) and suggests an slowly than the rest of the structural classes suggests interesting possibility. Because GUTELLet al. (1985) that nucleotide sequence is critical in the function of categorized nucleotides as single or double-stranded these regions, perhaps in their interactions with ribo- and didnot consider the loop,bulge and“other” somal proteins or with the 5s or 28s rRNAs, in the categories as distinct entities, the present study sug- tertiary structure of the 18s rRNA, or even in trans- gests that the distribution of universally conserved lation. A reexamination of the distribution of univer- positions shouldbe reexamined. Because stem and sally conserved positions might reinforce the idea of “other” categories comprise about 49% and 24% of the importance of base sequence in “other” regions, 602 L. Vawter and W. M. Brown

TABLE 1 0.6- (a) uncomec: (G+C)/(A+T) ratios for additional srDNA sequences of 05- multicellular animals

0.4 I Taxa %A Taxa %C %G %T (G+C)/(A+T) 0.3- Tenebrio molitor insect 24.623.3 27.4 24.7 1.03 (XO7801) 02 - Anisodoris nobilis echi- 25.1 24.928.0 22.0 1.12 noderm (M20097, 0.1 - M20098, M20099) Bombyx mori insect 25.2 23.1 28.223.5 1.05 0.0- (X0 1339) Branchwstoma calqor- 24.924.5 29.1 21.6 1.15 nunse cephalochor- date (M20044, M20045, M20046) Chaetoperus sp. poly- 26.821.2 27.2 24.9 0.94 chaete (M20103, M20104, M20105) Spisula solidissima bi- 25.423.7 28.3 22.6 1.08 valve(M20122, M20127, M20113) Limulus polyphemus 27.223.2 27.3 22.3 1.02 chelicerate (M20083, M20084, M20085) FIGURE 6.-(a) Histogram of frequencies of change amongstruc- Dugesiu hgnna planar- 30.7 18.5 23.627.2 0.73 tural classes in srDNA. (b) Histogram of frequencies of change ian (M20068, among structural classes in srDNA, corrected for relative frequen- M20069. M20070) cies of structural classes. Hydra sp. cnidarian 28.418.7 25.8 27.1 0.80 (M20077, M20078. from the network, except in the case of changes on M20079 terminal branches. Lingula remi brachic- 26.320.7 27.7 25.3 0.94 For each of the pairs of taxa, these counts of base pod (M20086, NXWY, = M20087, M20088) change: where (N number of changes be- tween bases X and Y; X = A, C, G, T; Y = A, C, G, The numbers in parentheses followingthe taxon names are X Genebank accession numbers for the sequences analyzed. Of these T) were organized as 4 4 matrices. The raw base sequences, only the T. molitor sequence is a complete sequence. change counts for each of the pairs of taxa and for each structural class are presented in VAWTER(1991) and it might suggest that certain subsets of the loop and can beobtained from the author (L.V.).Base and bulge regions are important in these interactions changecounts in thestructural categories are not as well. always whole numbers. This is because a base change Base change matrices: To examine whether the may cause the structural categories to be differentin finding of a transition-transversion biasin protein- each of the taxa. For example, a G + C change may coding genes(BROWN et al. 1982;JUKES and BHUSHAN cause a U-G stem pair to become a U Cbulge. In such 1986;JUKFS 1987) can be extendedto structural RNA cases, a half-count was added to each of the appropri- genes, we calculated relative rates of base change for atestructural categoriesfor that particular base the srDNA data set overall, aswell as for the different change. structural classes. We chose to calculate relative, Step 2: Various structural classes and taxa had dif- rather than absolute, rates of change because a cal- ferent base composition biases.Because stems, for culation of absolute rates requires estimates of time example, are G/C rich, G and C are inherently more since divergence of taxa,a quantity that is rarely likely to be involved in stem base changes. Likewise, known accurately. Here, we presenta method for because Caenorhabditis and Artemia are more A/T estimating relative ratesof base change that does not rich than are the vertebrates in this data set (Figure require constant rate assumption. We calculated mat- 5), more changes between A and T are expected in rices of base change in the following manner. Caenorhabditis and Artemia than in the vertebrates. Step 1: Letting X and Y represent any two of the We corrected counts of character state changes for four nucleotides (A, C, G and U), we combined X + the probability of finding each base in a particular Y and Y + X state changes into a single category of X structuralcategory and pair of taxa, so that base c, Y state changes. This was necessary because changes compositions of the different taxa did not affect esti- are inferred from an unrooted network. Because of mates of relative rates of base change (see Figure 7 this, the polarity of base change cannot be inferred for example). The correction factors were scaled to Rates and Patternsof srDNA Change 603

Artemia x Caenorhabditis (stems) X->Y and Y->Xchanges combined uncorrected for base compositionscorrected for base compositions ACGT A CGT A -- 21 31.5 27.5 A -- 26.2 39.6 38.0 -- -- C -- 33.5 46 C -- -- 25.1 46.2 -- -- G -- -- 34.5 G ------29.4

FIGURE 7.-Examples of matrices showing tallies of raw base changes and base change tallies corrected for base composition as 0 I described in the text. 0 10 10 50 nnmba ~C

a b M I Bulges

0 0 M 40 60 80 100 I20 number of C<->T changes number of C<+T changes

d

os40 1.INMl 0.431

number o? C<->T changes number of C<+T changes

Slope R P-value N 11.614 0.96 180 0.758 0.93 <0.01 167 11.685 0.99 T changes

FIGURE9.-(a-e) In these graphs, the points on a line represent the numbers of corrected base changes for each of the pairs of extant taxa for one of the X c, Y base change pairs (where X c, Y = A t-) C, A c, G,A c, T, C c-) G,or G c, T), graphed against corrected numbers of C c., T changes, for those pairs of taxa. In these graphs, when several points lay directly on top of one another, each was moved very slightly such that all might be visible. Slopes,R values, and statistical significancesof the lines, using the Mantel test (MANTEL 1967; SMOUSE, LONG andSOKAL 1986) are given. common evolutionary history with each other than generally not known accurately. Moreimportantly, either doeswith Xenopus (Figure 1). Thus it is helpful graphingchange against time since divergence as- to be ableto test the statistical significance of relation- sumes a constant rate and, as is clear from Figure 1, ships between variables in a comparative study, be- this assumption is inappropriate in the case of these cause of the non-independence of points caused by srDNAs. some organisms in the study being more closely re- Relativerates of basechange: To examinethe lated to each other than to others. validity of a transition-transversion bias assumption In the manner presented above, one can estimate for srDNA,we estimated relative rates of base change relative probabilities of change between all pairs of in the different structural categories and overall. We taxa for a particular class of characters withoutknow- estimated relative rates as described above, in Step 3. ing times since divergence and without making a The rate of C t, T changes was arbitrarily designated constant rate assumption. The use of a time variable as 1.000. These relative rates are shown in Table 2 is often impractical because time since divergence is and aresummarized in Figure 10. Across allstructural Rates and Patterns of srDNA Change 605

TABLE 2 Relative rates of change by structural category

Structure A-CA-G A-T C-G CcrTGcrT All 0.565 0.672 0.557 0.630 1.000 0.587 Bulges 0.635 0.538 0.289 0.604 1.000 0.444 Loops 0.672 0.206 0.329 0.501 1.000 0.496 Other 0.427 0.512 0.525 0.540 1.000 0.431 FIGURE1 1 .-Illustration of the single base changes which allow Stems 0.614 0.980 0.758 0.685 1.000 0.675 maintenance of base pairing in structural RNAs, though not DNAs. In structural RNAs, stems can be preserved with certain single base changes, because of G-U pairing. . <' . A varies widely, from 20.8% of all changes in stems n to only 6.4% in loops. The C c, T transition occurs at thehighest rate of any change in allof the structural classes, as well as inall structural classes taken to-

gether. Though theA *-) G transition occursat almost the same rate as the C c.* T transition in stems, it is

II actually the least common change in loops, and there are transversions that occur at a higher rate than the 0.8 1 II A c* G transition in all of the rest of the structural classes. Evidence also exists fora C *-) T biasin amniotes (MARSHALL 1992), but no formal analysis was undertaken. The increased rate of A c, G transi- tions in stems is consistent with selection for mainte- nance of base pairing in stem structures, as illustrated in Figure 1 1. InDNA, where the complementary base pairs are C-G and A-T, it is not possible for base pairing to bemaintained with only a single base OTHER change. For example, a C-G pair requires two base changes to become a G-C pair, an A-T pair, or a T-A pair. However, in RNA, whereuracil (U) is substituted for (T) and the G-U pair is stable, certain single changes in srRNA stems allow stem structure to be maintained (Figure 11).The effects of selection on base change in srRNA stem structure have been STEM explored in detail (L. VAWTERand W. M. BROWN, unpublished data). It is clear that a transition-trans- version bias does not hold for srDNA. Thus, a tran- sition bias assumption, per se, should not be made for phylogenetic analysis of srDNAdata. However, though it would require much structural analysis and 0.8 t I1 a large investmentof computer time, therelative rates of change in the various structural classes could be employed in phylogenetic analysis using eitherthe method of invariants (LAKE 1988) or a maximum likelihood analysis. Also, this relative rate information could be used as ranking criteria for statistical com- parisons of alternate tree topologies using the method AC AG AT CG GT CT of TEMPLETON(1 983). FIGURE10.-This histogram illustrates relative rates of base To examine the validity of assumption of equal change in the different structural categories as well as overall. transversion probabilities used for phylogenetic analy- categories, C c, T changes, which are transitions, sis of rDNAs with the method of invariants (LAKE occur at the greatest rate. The othertransition, G c, 1988, 1989), we compared rates of transversions in A, occurs at about thesame rate as the remaining base thedifferent structural regions as well as overall. changes, when all structural categories are considered Across structuralcategories, therate of C t) G together. However, amountof change between G and changes is most constant, ranging across a factor of 606 L. Vawter and W. M. Brown only 1.36. The rate of A c* T changes is least constant, First, stem, loop and bulge regions evolve at about varying by a factor of 2.62-fold. Though this may be the same rate. The long, single-stranded regions that interesting, it does not, however, address the validity presumably interact with proteins evolve slowest. of a constant transversion rate assumption. Within a Second, there arestructure-associated biases in base structural category, the rateof transversions was least composition. Stems are more G/C rich than are any variable in the stem and “other” categories, with the of the other structuralcategories, presumably because highest rate of change being approximately 1.2 times the G-C pairingconfers maximum thermodynamic more common than the lowest rate of change. Bulges stability. Loop and “other” regions, which are sus- and loops had, respectively, a highest rate of change pected of interacting with proteins, are much more 2.2 and 2.6 times that of their lowest rates of change. A-rich than are stem and bulge regions. This isin However, only a small portion of the bases in srDNA accord with the suggestion of GUTELLet al. (1985) take part in loop and bulge structures (APPENDIX A, that A-richness of thesestructures might facilitate Figure 4). Overall, with the bases classed as structur- interactions with proteins. ally unknown included, transversions vary only 1.1- Third, there is also phylogenetic bias in base com- fold from the highest to the lowest rate. Thus the position. Vertebrate srDNAs were much more G/C assumption of equal rates of transversions made in rich than were those of invertebrates (G + C/A + T applying the method of invariants to srDNA is most of 1.24 for vertebrates, 0.96 for invertebrates), with likely valid. base compositions of the various structural categories Except for stem regions, we can find no particularly being generally reflective of this overall bias. convincing rationalization forthe patterns of base Fourth, a consistent transition-transversion bias change we observed. We note that the most common does not exist in these srDNAs. Indeed, the relative base changes in stem regions, C t, T and G t, A, are rates of the various transitions and transversions vary those base changes which are expected to be favored more than fourfold among all structural categories. under a regimeof selection for maintenanceof srRNA Though the C t, T transition occurs at a higher rate base pairing structure (L. VAWTERand W. BROWN, than do the other base changes, the other transition, unpublished results). It is those specific changes which A t, G, varies from being quite a bit more common allow G-U pairs to be formed and base-paired struc- than transversions in stems to being the least common tures to be maintained when particular members of change in loops. The relative rate scheme suggested C-G and A-U pairs change. in this manuscript is a more appropriate set ofas- Different rates of base change are also apparent sumptionsthan the transition-transversion bias for amongthe different taxa. The result of assigning phylogenetic analysis using the method of invariants, nucleotide changes to branches of the phylogenetic maximum likelihood, or the method of statistical in- network is shown in Figure 1. Because the allocation ference suggested by TEMPLETON(1983). The results of changes was done using cladistic methodology and presented here call into question the validity of con- including all changes, as opposed to only cladistically clusions of phylogenetic analyses using the assumption informative changes, branch lengths for sister taxa in that srDNA shows a transition-transversion bias. the rooted part of the network would be approxi- Last, transversions vary only 1.1-fold from the high- mately the same if a constant rate of substitution held est to the lowest rate. Thus the assumption of equal for this data set. Instead, it is apparent that a constant rates of transversions made in applying the method of rate assumption is inappropriatefor this set of sr- invariants to srDNA is probably a valid assumption. DNAs. High rate variation is expected in rDNAs when gene family size varies among taxa (OHTA1983), as it The authors are grateful to C. W. BIRKY, B. CRESPI,D. FISHER, W.-H. LI, B. O’CONNOR,M. SLATKIN, P.SMOUSE and two anony- does in this group of taxa (LEWIN 1987).This illustra- mous reviewers for helpful suggestions. This research was sup- tion of rate variation emphasizes the importance of ported by grants from the University of Michigan, the National verifying an assumption of rate constancy if it is to be Science Foundation and the U.S. Public Health Service, and was used in phylogenetic analysis. done was in partial fulfillment of the requirements of the Ph.D. at the University of Michigan by L.V.

CONCLUSIONS LITERATURECITED We have developed and applied a method of analy- sis for relative rates that is simple, standardizes for BERNARDI,G., and G. BERNARDI,1985 Codon usage and genome taxonomic and structural biases in base composition, composition. J. Mol. Evol. 22 363-365. and that avoids a constant rate assumption. The ex- BROWN,W. M., E. M. PRAGER,A. WANG and A. C. WILSON, amination of relative rates of change of different 1982 Mitochondrial DNA sequences in primates: tempo and categories of nucleotide in the srRNA gene permits mode of evolution. J. Mol. Evol. 18: 225-239. the following conclusions. CHAN,Y.-L., R. GUTELL,H. F. NOLLERand I. G.WOOL, 1984 The and PatternsRates and of srDNA Change 607

nucleotide sequence of a rat 18s ribosomal ribonucleic acid generalized regression approach. Cancer Res. 27: 209-220. gene and a proposal for the secondary structure of 18s ribo- MARSHALL,C.R., 1992 Substitution bias, weighted parsimony, somal ribonucleic acid. J. Biol. Chem. 259 224-230. and amniote phylogeny as inferred from 18s rRNA sequences. CLARY,D. O., and D. R. WOLSTENHOLME,1985 The mitochon- Mol. Biol. Evol. 9 370-373. drial DNA molecule of Drosophila yakuba:nucleotide sequence, MICKEVICH,M. F., and S.J. WELLER,1990 Evolutionary character gene organization, and genetic code. J. Mol. Evol. 22: 252- analysis: tracing character change on a cladogram. Cladistics 6 271. 137-1 76. CROZIER,R. H., and Y.4. CROZIER,1993 The mitochondrial MILNER,A. R., 1988 The relationships and origin of living am- genome of the honeybee Apis mellqera:complete sequence and phibians, pp. 59-1 02 in The Phylogeny and Class@ation of the genome organization. Genetics 133: 97-1 17. Tetrapods, Vol. I, edited byM. J. BENTON.Clarendon Press, ELLIS,R. E., J. E. SULSTONand A. R. COULSON,1986 The rDNA Oxford. of C. elegans: sequence and structure. Nucleic Acids Res. 14: MINDELL,D. P., and R. L. HONEYCUTT,1990 Ribosomal RNA in 2345-2364. vertebrates: evolution and phylogenetic implications. Annu. FELSENSTEIN,J., 1985 Phylogenies and the comparative method. Rev. Ecol. Syst. 21: 541-566. Am. Nat. 125 1-15. MISHLERB. D., K. BREMER,C. J. HUMPHRJESand S. P. CHURCHILL, FITCH,W. M., and E. MARKOWITZ,1970 An improved method 1988 The use of nucleic acid sequence data in phylogenetic for determining codon variability in a gene and its application reconstruction. Taxon 37: 391-395. to the rate of fixation of in evolution. Biochem. Molecular Biology Information Resource, 1989 Eugene User’s Genet. 4: 579-593. Manual, Release 3.2. Department of Cell Biology, Baylor Med- FREIER,S. M., R.KIERZEK, J. A. JAEGER,N. SUGIMOTO,M. H. ical College, Houston. CARUTHERS,T. NEILSONand D. H. TURNER, 1986Improved NELLES,L., B. L. FANG,G. VOLCKAERT,A. VAN DEN BERG HE^^^ R. free-energy parameters for predictions of RNA duplex stabil- DEWACHTER,1984 Nucleotide sequence of a crustacean 18s ity. Proc. Natl. Acad. Sci. USA 83: 9373-9377. ribosomal RNA gene and secondary structure of eukaryotic GUTELL,R. R.,B. WEISER,C. R. WOESE and H. F. NOLLER, small subunit ribosomal RNAs. Nucleic Acids Res. 12: 8749- 1985 Comparative anatomy of 16s-like ribosomal RNA. 8768. Prog. Nucleic Acid Res. Mol. Biol. 32: 155-2 16. NOVACEK,M. J., A. R. WYSSand M.C. MCKENNA, 1988 The HENNIG,W., 1966 Phylogenetic Systematics. University of Illinois major groups of eutherian mammals, pp. 31-72 in The Phylo- Press, Urbana. geny and Classzjication of the Tetrapods, Vol. 11, edited by M. J. HIXSON,J. E., and W. M. BROWN,1986 A comparison ofsmall BENTON.Clarendon Press, Oxford. ribosomal RNA genes from the mitochondrial DNA of great OHTA,T., 1983 On the evolution of multigene families. Theor. apes and humans: sequence, structure, evolution and phyloge- Popul. Biol. 23: 216-240. netic implications. Mol. Biol. Evol. 3: 1-18. OLSENG. J., and C. R.WOESE, 1989A brief note concerning JAEGER, J. A,, D. H. TURNERand M. ZUKER, 1990 Predicting archaebacterial phylogeny. Can. J. Microbiol. 35 119-123. optimal and suboptimal secondary structure forRNA. Methods PATTERSON,C., 1989 Phylogenetic relations of major groups: in Enzymol. 183: 281-306. conclusions and prospects, pp. 471-488 in TheHierarchy of JUKES,T. H., 1987 Transitions, transversions, and the molecular Lije, edited byB. FERNHOLM,K. BREMERand H. JORNVALL. evolutionary clock. J. Mol. Evol. 2687-98. Elsevier, Amsterdam. JUKES, T. H., and V. BHUSHAN,1986 Silent nucleotide substitu- RAYNAL,F., B. MICHOT and J.-P. BACHELLERIE,1984 Complete tions and G + C content of some mitochondrial and bacterial genes. J. Mol. Evol. 24 39-44. nucleotide sequence of mouse 18s rRNA gene: comparison KEMP,T. S., 1988 Interrelationships of the Synapsida, pp. 1-22 with other available homologs. FEBS Lett. 167: 263-268. SALIM,M., and B.E. H. MADEN,1981 Nucleotide sequence of in The Phylogeny andClassajkation of the Tetrapods, Vol. 11, edited by M. J. BENTON.Clarendon Press, Oxford. Xenopus laevis 18s ribosomal RNA inferred from gene se- KLUGE,A. G., and J. S. FARRIS,1969 Quantitative phyletics and quence. Nature 256: 205-208. the evolution of anurans. Syst. Zool. 18: 1-32. SMOUSE,P. E., J. C. LONGand R. R. SOKAL,1986 Multiple LAKE,J. A., 1988 Origin of the eucaryotic nucleus determined by regression and correlation extensions of the Mantel test of rate-invariant analysis of rRNA sequences. Nature 331: 184- matrix correspondence. Syst. Zool. 35: 627-632. 186. SOGIN,M. L., and H. J. ELWOOD,1986 Primary structure of the LAKE,J. A., 1989 Origin of the eukaryotic nucleus determined Paramecium tetraurelia small-subunit rRNA : phy- by rate-invariant analyses of ribosomal RNA genes, pp. 87- logenetic relationships within the Ciliophora. J. Mol. Evol. 23: IO 1 in The Hierarchy ofLije,edited by B. FERNHOLM,K. BREMER 53-60. and H. JORNVALL. Elsevier, Amsterdam. SWOFFORD,D. L., 1989 PAUP: Phylogenetic AnalysisUsing Parsi- LAWRENCE,C.B., and D. A. GOLDMAN,1988 Definition and mony. Illinois Natural History Survey, Champaign, Ill. identification of homology domains. Comp. Appl. Biosci. 4 TEMPLETON,A. R., 1983 Convergent evolution and nonparamet- 25-33. ric inferences from restriction data and DNA sequences, pp. LEWIN,B., 1987 Genes III. Wiley & Sons, New York. 151-1 79 in Statistical Analysis of DNA Sequence Data, edited by LI, W.-H., C.-I WU and C.-C. LUO, 1985 A new method for B. S. WEIR.Marcel Dekker, New York. estimating synonymous and nonsynonymous rates ofnucleotide TORCZYNSKI,R.M., M. FUKEand A.P. BOLLON,1985 Cloning substitution considering the relative likelihood of nucleotide and sequencing of a human 18s ribosomal RNA gene. DNA and codon changes. Mol. Biol. Evol. 2: 150-174. 4 283-291. MANTEL,N. A., 1967 The detection of disease clustering and a TURNER,D. H., N. SUCIMOTOand S. M. FREIER,1988 RNA 608 L. Vawter and W. M. Brown

structure prediction. Annu. Rev. Biophys. Biophys. Chem. 17: TABLE 3 167-192. VAWTER,LISA, 1991 Evolution of blattoid insects and of the small Structural composition of the srDNA data set subunit ribosomal RNA gene. Ph.D. Thesis, University of Michigan, Ann Arbor. WADA,A., A. SUYAMAand R. HANAI,1991 Phenomenological Composition (W) theory of GC/AT pressure on DNA base composition. J. Mol. Stem Loop Bulge Other Unknown EvoI. 32: 374-378. WILLMER,P., 1990 InvertebrateRelationships: Patterns in Animal HTU 7 27.5 6.1 9.4 17.1 39.9 Evolution. Cambridge University Press. Cambridge. HTU 8 44.0 9.7 14.6 18.3 13.4 WOESE,C. R., 1989 Archaebacteria and the nature of their evo- HTU 9 44.9 10.7 17.0 19.7 7.7 lution, pp. 119-130 in TheHierarchy of Life, edited by B. HTU 10 45.4 10.7 16.9 19.7 7.2 FERNHOLM,K. BREMICRand H. JORNVALL.Elsevier, Amster- C. elegans 42.2 8.5 13.6 22.1 13.7 dam. A. salina 43.1 10.5 14.4 22.2 9.8 WOESE,C. R., R. GUTELL, R. GUFTA and H. F. NOLLER, X. taeuis 44.9 10.3 15.3 20.1 9.5 1983 Detailed analysis of the high order structureof 16s-like H. sapiens 44.6 10.7 18.0 19.6 7.1 ribosomal ribonucleic acids. Microbiol. Rev. 47: 621-1369. R. rattus 45.3 10.7 14.5 22.6 7.0 ZUKER,M., 1989 Computer prediction of RNA structure. Meth- M. musculus 45.0 10.7 14.2 23.0 7.2 ods Enzymol. 180: 262-288. OTUs 44.2 10.2 15.0 21.6 9.0 Communicating editor: M. SLATKIN HTUsand OTUs 42.7 9.9 14.8 20.4 12.3 Structural composition of the entire dataset, including all HTUs APPENDIXA and unknown regions. As explained in the text, the large genetic distance between Caenorhabditis and Artemia accounts for the The structural composition of the srDNA data set large percentage of positions belonging to the unknown category is given in Table 3. in HTU 7.