Quick viewing(Text Mode)

Pseudogene Evolution and Natural Selection for a Compact Genome

Pseudogene Evolution and Natural Selection for a Compact Genome

and for a Compact Genome

D. A. Petrov and D. L. Hartl

Pseudogenes are nonfunctional copies of protein-coding that are presumed to evolve without selective constraints on their coding function. They are of con- siderable utility in evolutionary genetics because, in the absence of selection, dif- ferent types of in should have equal probabilities of fixa- tion. This theoretical inference justifies the estimation of patterns of spontaneous from the analysis of patterns of substitutions in pseudogenes. Although it is possible to test whether pseudogene sequences evolve without constraints for their protein-coding function, it is much more difficult to ascertain whether pseu- dogenes may affect fitness in ways unrelated to their sequence. Con- sider the possibility that a pseudogene affects fitness merely by increasing genome size. If a larger genome is deleterious—for example, because of increased energetic costs associated with genome replication and maintenance—then deletions, which decrease the length of a pseudogene, should be selectively advantageous relative to insertions or nucleotide substitutions. In this article we examine the implications of selection for genome size relative to small (1–400 bp) deletions, in light of em- pirical evidence pertaining to the size distribution of deletions observed in Dro- sophila and mammalian pseudogenes. There is a large difference in the spectra between these organisms. We argue that this difference cannot easily be attributed to selection for overall genome size, since the magnitude of selection is unlikely to be strong enough to significantly affect the probability of fixation of small deletions in Drosophila.

The Darwinian theory of evolution treats rectly in laboratory studies, and because heritable variation as random and undi- laboratory studies are almost inevitably rected. Mutation merely supplies the raw subject to experimental bias according to material for natural selection, which is the methods by which mutations are de- seen as the directional force that drives tected, attention has been devoted pri- evolution toward greater adaptation. But marily to making inferences about pat- ‘‘random’’ is a tricky word. Although mu- terns of spontaneous mutation from tational variation is not generally biased analyses of the observed patterns of nu- toward any particular adaptation, this cleotide substitution in functional genes. does not imply that mutational variation Although this approach is very powerful, is random in other respects. Some vari- its reliability remains questionable. The ants arise by mutation more frequently main problem is that patterns of substi- than others, independently of natural se- tution in functional genes are affected not lection, and these also affect the ultimate only by the relative rates at which differ- course of evolution since the probability ent mutations occur spontaneously, but of fixation under mutation–selection–drift also—and often decisively—by natural se- is a function of the mutation rate. lection. Some types of mutation may be, The possible existence of biases in on average, more deleterious than others, From the Harvard University Society of Fellows (Pe- trov) and the Department of Organismic and Evolution- spontaneous mutation, and the impor- and as a result, such mutations will be un- ary Biology, Harvard University, 16 Divinity Ave., Cam- tance of assessing these biases quantita- derrepresented in any observed sample. bridge, MA 02138. Address correspondence to Daniel L. Hartl at the address above or e-mail: dhartl@oeb. tively, have been recognized for a long The problem of bias is most acute for harvard.edu. This paper was delivered at a symposium time (Beale and Lehmann 1965; Li et al. substitutions in coding sequences. In the entitled ‘‘Genetic Diversity and Evolution’’ sponsored 1984, 1985; Zuckerkandl et al. 1971). How- exons of protein-coding genes, for exam- by the American Genetic Association at the Pennsyl- vania State University, University Park, PA, USA, June ever, empirical studies have proven diffi- ple, insertions and deletions (indels) are 12–13, 1999. cult. Because spontaneous mutation is observed much less frequently than point 2000 The American Genetic Association 91:221–227 generally too rare to be investigated di- substitutions (simple nucleotide substitu-

221 tions). This observation tells us essential- expression of nearby genes, more drastic based on an analysis of pseudogenes ly nothing about the relative rates of spon- mutations in the pseudogene may be se- (Graur et al. 1989) and in Drosophila based taneous indel formation versus point lectively advantageous, as they are more on DOA copies of non-LTR retrotranspos- substitution, since indels in exons are al- likely to disrupt the deleterious activity of able elements (Petrov et al. 1996, Petrov most certain to be severely deleterious. In the pseudogene. The reverse will be true and Hartl 1998). In these studies genome this example, the possible effects of mu- if the activity of a pseudogene is advan- size varies from 1.6 ϫ 108 bp in Drosophila tational bias on the fixation of mutations tageous. The problem is that generally one (a relatively compact genome) to 3.0 ϫ 109 are hopelessly confounded with the ef- does not know whether a pseudogene has bp in humans (a relatively large genome). fects of selection. any noncoding phenotypic effect and The deletion/ sizes ranged from 1 An alternative to dealing with the biases whether the effect is deleterious or advan- bp to approximately 500 bp and were lim- inherent in studying functional genes is to tageous. ited in part by the length of the pseudo- study their nonfunctional counterparts— Besides having position-specific or se- sequences studied (ϳ0.4–2 kb). Al- pseudogenes—instead. Because pseudo- quence-specific effects on the expression though the estimated values are specific genes are presumed to evolve without of genes, pseudogenes may exert more to the organisms examined, the implica- functional constraints, it can be argued global effects through their aggregate ef- tions from them are more general in that that all types of mutation should have an fects on genome size. If a compact genome the sizes of the indels examined are much equal chance of fixation (or persistence in is favored by selection, then it follows that smaller than the total genome size. populations), and thus the pattern of sub- deletions occurring in pseudogenes stitutions in pseudogenes should be con- should be slightly beneficial and inser- The Shape of the Selection Curve gruent with the pattern of mutations tions slightly deleterious, even though for Small Deletions and Insertions (Graur et al. 1989; Li et al. 1981, 1984). point substitutions may be selectively This approach has been widely used to es- neutral. On the other hand, if a large ge- We consider a model of global selection timate patterns of mutation in mammals nome is favored by selection, then dele- for genome size in which noncoding DNA (Graur et al. 1989; Li et al. 1981, 1984). tions will be slightly deleterious and inser- sequences are subjected to selection only More recently we invoked essentially the tions slightly beneficial. We emphasize the insofar as they affect genome size. In this same logic to estimate patterns of muta- word ‘‘slightly.’’ There’s the rub, for the ef- model we ignore any selective effects due tion in Drosophila using immobile, non- fectiveness of selection on any mutant al- to local position effects on gene expres- functional, ‘‘dead-on-arrival’’ (DOA) cop- lele is determined by the magnitude of Nes, sion, and any other sequence-specific or ies of non-LTR (long terminal repeat) where Ne is the effective population num- position-dependent mechanisms of selec- retrotransposable elements in lieu of con- ber and s is the selection coefficient. If Nes tion. ventional pseudogenes (Petrov et al. 1996; is large enough, then the probabilities of Consider a newly formed indel (deletion Petrov and Hartl 1998). fixation will be skewed from the sponta- or insertion) of a length ⌬G bp. The pres- How confident can one be that a se- neous mutation frequencies, and so infer- ence of such an indel changes the genome quence designated as a pseudogene does, ences about spontaneous mutation from size by the value of its length, namely, ⌬G in fact, have no coding function? The cur- the analysis of pseudogene substitutions bp. On average, the proportionate change rent standard for classifying a sequence as become problematic. in the genome size due to such an indel ⌬ a pseudogene is to show either that it is In this article we consider the model of will be G/G0, where G0 bp is the average ⌬ not transcribed and translated, or that it global selection for genome size and con- genome size in the population. If G/G0 is lacks a complete open reading frame sider its implications for the probability of small, then the selection coefficient, s(⌬G/

(ORF). Lack of coding function can also be fixation of small deletions and insertions G0), associated with the indel in question, argued on molecular evolutionary grounds in individual copies of pseudogenes scat- owing only to its effect on genome size, ⌬ if the pattern of nucleotide substitution in tered throughout the genome. The central can be found by expanding s( G/G0) ⌬ ϭ the sequence shows no evidence of func- issue is the likely magnitude of Nes for around G/G0 0 in a Taylor series, tional constraints on the coding capacity. small indels, which we discuss in light of s(⌬G/G )ϭs(0)+sЈ(0)[⌬G/G ] In practice, this usually means that the the observed size distributions of indels. 0 0 ϭsЈ(0)[⌬G/G ] (1) rate of synonymous and nonsynonymous Based on a number of considerations we 0 nucleotide substitutions are identical (Ks/ conclude that natural selection for total where s(0) ϭ 0 because the fitnesses are Ka ഠ 1), and that the ORF in the sequence genome size must have a negligible effect measured relative to the current average is disrupted by stop codons, deletions, on small (Ͻ400 bp) deletions and inser- genome size in the population, and sЈ(0) is and/or insertions. tions. Hence we affirm an earlier sugges- the slope of the selection coefficient func- Current criteria for pseudogenes ad- tion that the insertion/deletion spectra tion evaluated at the current average ge- dress only the protein-coding function of found in pseudogenes or in DOA copies of nome size. If sЈ(0) Ͼ 0 then a genome larg- functional genes. But some pseudogene non-LTR elements may generally afford er than the current average is locally sequences may have position effects on nearly unbiased estimates of the inser- favored, if sЈ(0) Ͻ 0 then a genome more the expression of nearby genes, or they tion/deletion spectra among spontaneous compact than the current average is lo- may be mutagenic through homologous mutations (Petrov et al. 1996). cally favored, and if sЈ(0) ϭ 0 then the cur- DNA interactions with their functional rent average is a local optimum of genome counterparts (Wu and Morris 1999). When size (the alternative possibility that G de- Background and Theoretical 0 position effects occur, most of them would fines a local fitness minimum seems too Considerations be expected to be deleterious, but a few remote to consider). may be beneficial. In cases where pseu- The data we examine are estimates of the It is unlikely that the current average ge- dogenes exert detrimental effects on the insertion/deletion spectra in mammals nome size represents the optimal value,

222 The Journal of Heredity 2000:91(3) (say, 10 bp), then there would be a notice- able effect of selection for indels larger than 100 bp, whereas if the critical value were large (for example, 100 kb), then only very large indels on the order of 1 Mb would have a greatly increased chance of Figure 1. The ‘‘critical length’’ of an indel is here de- ഠ fixation due to positive selection. fined as the length for which |4Nes| 1. Indels with lengths smaller than the critical length are nearly neu- tral. When there is genome-wide selection for a com- pact genome size, indels larger or smaller are subject- Empirical Evidence ed to a selection coefficient proportional to their size, positive if the indel is a deletion and negative if it is an In the rest of this article we examine em- k insertion. For |4Nes| 1, the ultimate fate of an indel pirical distributions of sizes of deletions is determined almost exclusively by selection. Figure 2. Probability of ultimate fixation of a newly arising indel with a value of 4Nes given along the ab- and insertions in DNA sequences in Dro- scissa, relative to a newly arising with sophila and mammals that are uncon- 4N s ϭ 0. because there is undoubtedly some mu- e strained with respect to their protein-cod- tational load, due to multiple sources, of ing function. We concentrate primarily on indels being formed continuously. This by identifying a range of indel sizes over the possibility that the difference in the means that sЈ(0)  0, and equation (1) which the extremes give evidence of being observed distributions, which is skewed therefore implies that, for indels that are subjected to large selective forces of op- toward more deletions and longer dele- small relative to the current average ge- posite sign, for then the critical length tions in Drosophila, is due to stronger nat- nome size, the selection coefficient should must be within this range. For example, as- ural selection for a compact genome in be linear in the magnitude of the change suming a uniform distribution of indel siz- Drosophila as compared to that in mam- in genome size. For the sizes of indels ob- es, if one could show that deletions of size mals. We conclude that, in Drosophila, the ϭ served in DOA retrotransposons in Dro- 10 bp were present at much lower fre- critical indel length at which |4Nes| 1is sophila and pseudogene mammals (1–1000 quencies in a population sample than larger (and possibly much larger) than the bp), relative to the total genome size (108– those of 1000 bp, and that deletions of size of indels we observe. Consequently 9 ⌬ 10 bp), the values of G/G0 are in the 1000 bp were present at much higher fre- the ultimate fate of individual indels in range 10Ϫ5–10Ϫ9, so the assumption in quencies than insertions of 1000 bp, then this size range should be determined ⌬ equation (1) that G/G0 is small is certain- one could infer that the critical length largely, if not exclusively, by random ge- ly justified. must be somewhere in the range 10–1000 netic drift—even if Drosophila is subjected bp and, moreover, that deletions are selec- to a genome-wide selective force favoring tively favored. The same inference would a compact genome. Consequences of Linear Fitness follow from the finding of drastically dif- Response to Small Changes in ferent fixation probabilities, or persistence Genome Size Defunct Transposable Elements as times, for deletions and insertions span- Pseudogene Surrogates The ultimate fate of an indel of a particular ning the 10–1000 bp interval. The key ⌬ K size G G0 is determined by the prod- point is that, in view of the linearity of the One of the most enigmatic differences in ⌬ uct of the selection coefficient s, s( G/G0), selection coefficient as a function of indel genome organization of Drosophila and and the reciprocal of four times the effec- size, so long as the indels are small, then mammals is the drastically different fre- tive population number, 1/4Ne. In particu- the interval over which selective differ- quencies of pseudogenes. Mammalian lar, the magnitude of |4Nes| determines ence should be detectable at the extremes genes often have tens or even hundreds of whether the selection-drift process lead- includes the critical length. pseudogene counterparts (Weiner et al. ing ultimately to fixation or loss is essen- Figure 2 shows the probability of fixa- 1986), whereas pseudogenes in Drosophila K tially stochastic (|4Nes| 1), essentially tion of a newly arising indel with some val- are exceedingly rare (Jeffs and Ashburner k deterministic (|4Nes| 1), or subject to ue of 4Ns, relative to the probability of fix- 1991). The paucity of pseudogenes in Dro- nonnegligible effects from both selection ation of a newly arising neutral allele (4Ns sophila has hampered studies of DNA se- ഠ ϭ ϭ and random genetic drift (|4Nes| 1). In 0), assuming Ne N. This ratio of prob- quences that are unconstrained with re- Figure 1 we designate the critical length, abilities is given explicitly by 4Ns/[1 Ϫ spect to their protein-coding function in m, as the size of an indel, ⌬G, for which exp(Ϫ4Ns)], and the range of values of 4Ns this organism and has precluded compar- ϭ |4Nes| 1. Because of the linearity of the has been chosen so that the ratio of fixa- ative studies of pseudogene evolution out- selection coefficient [equation (1)], it fol- tion probabilities ranges from approxi- side of mammals. lows that the ultimate fate of any indel mately 1/10 (actually 0.075) for 4Ns ϭϪ4 We have proposed that the molecular that is smaller than m by a factor of 10 or to 1/10 for 4Ns ϭϩ10. This curve implies evolutionary information that can be more will be essentially determined by that an indel favored by selection must be gleaned from bona fide pseudogenes (non- Յ random genetic drift (|4Nes| 0.1), 10 times longer than the critical value in functional copies of single-copy functional whereas the ultimate fate of any indel that order to have about a sixfold greater prob- genes) can be obtained in Drosophila and is larger than m by a factor of 10 or more ability of fixation than an indel matching other organisms from the study of de- will be essentially determined by selection the critical value. An increase of sixfold in funct, DOA copies of non-LTR retrotran- Ն (|4Nes| 10) (Figure 1). the probability of fixation is roughly what sposable elements (Petrov et al. 1996; Pe- Unfortunately the critical indel length m one might expect to be detectable in da- trov and Hartl 1997, 1998). Here we use is unknown for any organism. We can nev- tasets of the size of those presently avail- defunct in the proper dictionary sense of ertheless make some important inferences able. Hence if the critical value were small ‘‘no longer living, dead,’’ because non-LTR

Petrov and Hartl • Pseudogene Evolution 223 elements commonly create truncated, im- differences in the deletion spectra result mobile, and nonfunctional copies that are solely from genome-wide selection for a ‘‘virtual’’ pseudogenes in the sense that compact genome in Drosophila (Charles- they are unconstrained in their protein- worth 1996). If this were true, then it is coding function. We have shown how phy- unclear how selection for a compact ge- logenetic analyses of multiple, indepen- nome could create such a sharp difference dently transposed copies of a non-LTR in probability of fixation over the small element can be used to separate the con- range of deletion sizes shown in Figure 3. strained evolution of lineages of transpos- There is a very sharp discontinuity at 5 itionally active elements from the uncon- bp. Deletions of 3–5 bp are found at equal strained, pseudogene-like evolution of frequency in Drosophila and mammals (G lineages of defunct elements (Petrov et al. test, P ϭ .75), whereas deletions of 6–8 bp 1996; Petrov and Hartl 1997, 1998). Be- are 25-fold more frequent in Drosophila (G cause non-LTR elements are virtually ubiq- test, P ϭ 7 ϫ 10Ϫ5). uitous in eukaryotes, and can easily be To clarify the implications of Figure 3 cloned or amplified with universal prim- Figure 3. Size distributions observed in pseudogenes relative to the null hypothesis, assume in mammals and in defunct retroelements in Drosoph- ers, the approach using defunct non-LTR ila. The numbers in the size range 1–5 bp are not sig- that all of the indels observed in mam- elements should provide a general means nificantly different. malian pseudogenes and in defunct retroe- to study unconstrained DNA in practically lements in Drosophila are fixed (not poly- any eukaryote. morphic) in the population. This is a rates of deletions of 1–5 bp are indistin- conservative assumption with regard to guishable in Drosophila and mammals (G the null hypothesis, because to assume Pseudogene Evolution in test, P ϭ .96), whereas deletions of 6–15 that a deletion has become fixed in the Drosophila and Mammals bp are sixfold more common in Drosophila population is more favorable for positive Ϫ Application of this approach allowed the (G test, P ϭ 2 ϫ 10 5), and deletions of 16– selection than to assume that it is a seg- first estimate of the pattern of spontane- 30 bp are fourfold more frequent in Dro- regating polymorphism. Now consider the ous substitutions in Drosophila pseudo- sophila (G test, P ϭ .01). effect of positive selection for a compact genes, revealing features of evolutionary The difference in the deletion spectra genome, given that, over the range of indel stability as well as features of extreme var- between Drosophila and mammals may be sizes in Figure 3, the selection coefficient iation. The pattern of spontaneous, simple due to differences in spontaneous deletion must be linear in indel size. Because there nucleotide substitutions (point substitu- formation, on the one hand, or to differ- is no difference in the fixation of 3–5 bp ences in the strength and direction of nat- tions) in Drosophila pseudogenes proved deletions, we have to assume that 4Nes is to be surprisingly similar to that in mam- ural selection, on the other hand. If it is between 0 and 1 for deletions of 3–5 bp. mals, despite the very long evolutionary due to differences in spontaneous deletion Therefore, because of linearity, the value formation, then the finding implies only distance separating these animals. In fact, of 4Nes for deletions of 6–8 bp should be excluding the much higher rate of G::C to that deletions larger than 5 bp are gener- no more than 3 (and smaller on the aver- A::T in mammals (probably at- ated at a much greater frequency in Dro- age). The increase in the fixation proba- sophila than in mammals. If one rejects tributable to methylation of cytosines in bility of mutations with 4Nes of 3, com- mammals but not in Drosophila), the pat- this hypothesis, and instead invokes ge- pared with neutral mutations, is 3.16 terns are statistically indistinguishable nome-wide selection for a compact ge- (Figure 2). But this is very much smaller (Petrov and Hartl 1999). The same is true nome to explain the difference, then the than the 25-fold elevation actually ob- for insertions: in both organisms pseudo- implication is that deletions larger than 5 served (likelihood ratio test, P ϭ .015). genes accumulate on average 1–1.5 small bp have a greater probability of fixation An alternative scenario is to suppose insertions (2–3 bp in length) per 100 point and a longer persistence in Drosophila that longer deletions are deleterious in substitutions. On the other hand, the pat- than in mammals. To explain why selec- mammals owing to genome-wide selection tern of deletions is very different in Dro- tion affects deletions larger than 5 bp in for a larger genome, whereas deletions in sophila than in mammals. In Drosophila, Drosophila more than in mammals, one Drosophila are not subjected to selection. would have to argue that the selection for ϭϪ deletions in defunct retroelements are 2.5 If we suppose that 4Nes 1 for mam- times more frequent, and on average 7 a compact genome is either stronger in malian deletions of 3–5 bp, then we infer Drosophila, for whatever reasons, or that ϭϪ times longer, than in mammals. These dif- 4Nes 3 for deletions of 6–8 bp. Based ferences alone imply almost a 20-fold high- it is more effective in Drosophila because on these values we would predict about a er rate of loss of unconstrained DNA in of a larger effective population number. sixfold deficit in the observed frequency Drosophila (Petrov et al. 1996; Petrov and of deletions of 6–8 bp in mammals, which Hartl 1998). Can Selection Alone Explain the is smaller than the deficit that we ob- Difference in Deletion Spectra served but not significantly so (likelihood ratio test, P ϭ .13). However, for deletions Differences in deletion spectra Between Mammals and in the size range 11–15 bp, we would ex- Drosophila? Յ Ϫ Surprisingly, the difference in the size of pect 4Nes 5, which predicts at least a deletions between Drosophila and mam- The null hypothesis under consideration 30-fold deficit in this range of sizes, where- mals is due exclusively to a much higher is that the underlying distribution of spon- as the observed reduction is only seven- incidence of deletions exceeding 5 bp in taneous deletions is the same in Drosoph- fold. For deletions in the size range of 16– Յ Ϫ Drosophila (Figure 3). In particular, the ila as in mammals, and that the observed 30 bp, we would expect 4Nes 9, which

224 The Journal of Heredity 2000:91(3) predicts at least a 900-fold deficit in this conditions might be found in centromeric the hypothesis of genome-wide selection range of sizes, while the observed reduc- heterochromatin. However, if defunct ele- for a compact genome, owing to size tion is only fourfold. An omnibus likeli- ments are fixed because they happen to alone, we expect |4Nes| for each initially hood ratio test demonstrates that such a reside in regions protected from selection polymorphic insertion to be much greater ϭ ϫ scenario is extremely unlikely (P 1 for genome size, then how could selection than the |4Nes| for a 25 bp deletion. Such 10Ϫ11). for genome size operate with respect to deleterious alleles are not expected to No matter what scenario of genome- indel mutations that occur within these achieve high frequencies in the population wide selection might be invoked, unless very elements? And if they reside in re- or, equivalently, to persist in the popula- the underlying size distributions of spon- gions in which selection is ineffective be- tion for long periods of time. The persis- taneous deletions are different in mam- cause of high background selection, then tence times of defunct elements can be es- mals and Drosophila, the finding of equal again how could the selection be more ef- timated from the number of apomorphic frequencies of deletions of 3–5 bp implies fective with respect to indel mutations (terminal-branch) nucleotide substitutions ഠ that |4Nes| 1 for this range of sizes, and that occur within the selfsame elements? per site within these elements (Petrov et then the implications of equation (1) for Although we have made no attempt to al. 1996; Petrov and Hartl 1997, 1998). If we linearity of the selection coefficients is in- estimate what fraction of the defunct re- assume a constant rate of neutral point escapable. It follows that at least part of troelements in the Drosophila dataset are mutations of 15 ϫ 10Ϫ3 per million years the difference in the deletion spectra must fixed in any particular species, we have (Sharp and Li 1989), the average age of de- be attributed to real differences in the size noted a 1300 bp element that was fixed in funct elements in the D. melanogaster da- distribution of spontaneous deletions. the common ancestor of the D. simulans, taset is 1.33 Ϯ 0.292 million years (range D. mauritiana, and D. sechellia clade (Pe- 0.1–4.8 million years), which corresponds trov and Hartl 1998). Consider this 1300- to approximately 26 million generations Can the Critical Deletion Size in bp element in relation to the average size assuming about 20 generations per year. Drosophila be as Small as Those of the deletions observed in all defunct el- This is a very long persistence time for Observed? ements, which is 25 bp. If we assume that large insertions, which under the null hy- ϭ So far we conclude that genome-wide se- Nes 1 for a deletion of 25 bp (assuming pothesis should be very deleterious rela- lection alone cannot explain the discrep- any smaller value makes the following ar- tive to the average size of indels. By way ancy in the deletion spectra between Dro- guments even stronger), then Nes for a of comparison, for new neutral alleles in a Ϫ ϭ ϭ 6 sophila and mammals. But genome-wide 1300 bp deletion would be 52. The prob- population of size Ne N 10 , the av- selection could still account for part of the ability of fixation of the initial 1300-bp erage time to loss of alleles destined to be ϭ 6 ϭ difference, and this is the possibility that polymorphism, given Ne 10 (Akashi lost is about 2 ln(2N) 29 generations. We ϭ ϫ Ϫ28 we shall now consider. 1997) and N Ne, is about 7 10 .To might argue that the persistent elements Each defunct retroelement begins its get a sense of how small this probability are not destined to be lost, but rather are evolutionary lineage as a unique transpo- is, imagine that we were in a position to fixed or destined to be fixed—but this sition event—a single insertion some- observe every single fixation of a 1300 bp merely throws us back on the other horn where in the genome—that occurs in a sin- element over 109 generations (ϳ10 million of the dilemma of specifying a mechanism gle individual and therefore has an initial years for Drosophila) in a population of 107 by which large numbers of deleterious in- frequency in the entire population of 1/2N. individuals. In order to observe, on aver- sertions can become fixed. In other words, each defunct retroelement age, one such fixation, the rate of trans- Without invoking an ad hoc assumption begins as an insertion/deletion polymor- position would have to be on the order of that each defunct element in our study phism in the population, in which the in- 1012 per genome per individual. This is was positively selected for some unknown sertion has a frequency of 1/2N and the clearly an absurdly high transposition favorable local effect, we must conclude ‘‘deletion’’ (absence of the defunct ele- rate, even under the most congenial as- that the critical indel length in Drosophila ment) has a frequency of 1 Ϫ 1/(2N). If sumptions. is larger than the size of most of the indels ϭ there were genome-wide selection for a Even if the critical value at which Nes observed. Furthermore, the long persis- compact genome, then this selection must 1 were a deletion size of 50, then Nes for a tence times and high probabilities of fixa- also operate on the insertion/deletion 1300 bp polymorphism would be Ϫ26 and tion of defunct elements suggest that the polymorphism associated with the crea- the probability of fixation would be 7 ϫ critical value may be much larger. tion of each defunct element. This line of 10Ϫ17. Granted that this is just one exam- reasoning makes it difficult to fathom why ple: perhaps this particular defunct ele- Population Persistence and the defunct elements should persist in the ment does reside in a region of relaxed se- Length of Deletions population long enough to accumulate lection, or is even beneficial owing to a point substitutions and indels within local position effect. On the other hand, Deletions in the D. melanogaster dataset themselves, let alone to become fixed. One this particular element is typical: the av- range from 1 to 432 bp. Under the null hy- could argue that certain regions of the ge- erage rate of deletions and the average pothesis of genome-wide selection for a nome may be more permissive to the fix- size of deletions are indistinguishable compact genome, if the smallest selection ഠ ation of defunct elements, either because from those of other defunct elements (G coefficient yields 4Nes 1, then assuming they are somehow protected from ge- test for rate, P ϭ 0.42; Wilcoxon test for linearity [equation (1)] the largest should ϭ ഠ nome-wide selection for a compact ge- size, P 0.87). yield 4Nes 100. Since the persistence nome, or because they are in a region of Even if special circumstances account times of mutations that differ by such a high background selection (Charlesworth for the fixed element of 1300 bp, these cir- large range of selective values are also et al. 1995), and so can become fixed by cumstances are not expected to apply very different, we should see some indi- chance, in spite of being deleterious. Such broadly to other defunct elements. Under cation that defunct elements with large de-

Petrov and Hartl • Pseudogene Evolution 225 dataset (Petrov et al. 1996; Petrov and of deletions observed in Drosophila. Our Hartl 1997). analysis is based on theoretical consider- We conclude that genome-wide selec- ations showing that the intensity of ge- tion for a compact genome, if it occurs, is nome-wide selection for a more compact not strong enough to differentially affect genome acting on any indel must be line- the persistence times of Drosophila dele- arly proportional to its indel length, as tions ranging in length from 1 to 400 bp, long as the indel is a very small fraction suggesting that our sample of deletions in of the total genome size. defunct retroelements affords a virtually Among pseudogenes in mammals and unbiased estimate of the rate and size dis- defunct retrotransposons in Drosophila, tribution of spontaneous deletion forma- the observed number of deletions, relative tion. to the observed number of nucleotide sub- stitutions, is approximately equal for the size range of 1–5 bp, but much larger in Figure 4. Deletion size (ordinate) as a function of age Summary Drosophila for size classes 6–10 bp, 11–15 of the defunct element in which the deletion is found (abscissa). There is no significant correlation. Studies of pseudogene evolution are im- bp, and 16–30 bp. Surprisingly the magni- portant in that they may yield estimates tude of the excess in each of the catego- of the rates and patterns of spontaneous ries of larger deletions does not increase letions persist longer than defunct ele- mutation. The operational definition of a according to the size of the deletion, as ments with small deletions. pseudogene is a gene duplication that has would be expected if genome-wide selec- We can use the number of accumulated no protein-coding function, and that tion for a compact genome were acting on point substitutions to estimate the age of shows no evidence of selective con- deletions in proportion to their size. Fur- a particular defunct element assuming a straints on its nucleotide sequence. Pseu- thermore, any assumption of weak selec- Ϫ tion favoring deletions as small as those nucleotide substitution rate of 15 ϫ 10 3 dogenes are presumed to accumulate var- actually observed necessarily implies a per million years (Sharp and Li 1989). The ious types of substitutions by random much larger intensity of selection acting ages of defunct elements in the D. melan- genetic drift, unbiased by natural selec- against the persistence or fixation of new- ogaster dataset range from 0.1 to 4.8 mil- tion, and thus are thought to reflect the ly arisen defunct retroelements in the first lion years, and there is a very good cor- underlying patterns of spontaneous mu- place, but strong deleterious effects are in- relation between the number of deletions tation. consistent with our observations of very and the age of individual elements (Petrov In this article we discuss a possibility long persistence times and even fixation of and Hartl 1998). The simplest interpreta- that insertions and deletions (indels) in such elements in Drosophila species. Fur- tion of these data is that deletions have pseudogenes may be subject to selection thermore, the persistence times of dele- had more time to accumulate in older el- pressure resulting from their incremental tions ranging from 1 to 400 bp are not sig- ements and thus, on average, deletions effects on total genome size (Charles- nificantly different, which is unexpected in found in older elements should be older worth 1996; Petrov et al. 1996; Petrov and a model of genome-wide selection for a than deletions found in younger elements. Hartl 1997). For example, if there is a ge- compact genome. If there is genome-wide selection for a nome-wide selection pressure toward a We therefore conclude that the frequen- compact genome, then longer deletions, more compact genome, then one might ex- cy and size distribution of small indels being more strongly favored, should per- pect that deletions, especially longer ones, found in mammalian pseudogenes and de- sist longer than small deletions, and the would be overrepresented among ob- funct retroelements in Drosophila do not effect should be detectable, since it in- served mutations, relative to nucleotide appear to be biased to any significant ex- volves values of 4Nes ranging over a factor substitutions and even more so relative to tent by genome-wide directional selection of 100. By ‘‘persistence’’ we do not neces- insertions. acting on total genome size. This does not sarily mean persistence as polymor- A genome-wide selection hypothesis mean to imply in any way that natural se- phisms, merely long-term persistence in was prompted by our recent findings that lection is not acting on genome size in the genome, even if fixed. This possibility Drosophila pseudogene-like sequences Drosophila, but rather that such selection, would yield a positive correlation between lose DNA through frequent deletions if present, is not strong enough to notice- the length of a deletion and the age of the much faster than mammalian pseudoge- ably bias patterns of small deletions/inser- defunct element in which it is found. On nes, which we interpreted as reflecting in- tions (1–400 bp). This justifies the use of ഠ the other hand, if |4Nes| 1 or less across trinsic differences in the rate and size dis- these sequences to estimate the underly- the whole observed range of deletion siz- tribution of spontaneous deletions (Petrov ing rates and patterns of spontaneous nu- es, then no such correlation is expected. et al. 1996; Petrov and Hartl 1998). An al- cleotide substitution and indel formation. Figure 4 shows that there is actually no ternative interpretation based on genome- The analysis of pseudogenes and defunct correlation between the length of dele- wide selection for a more compact ge- retroelements promises to yield rich new tions and the age of the elements (Spear- nome in Drosophila was also suggested information about the basic properties ϭ ϭ man’s rs 0.06, P .62). We have also (Charlesworth 1996). and patterns of spontaneous mutation, tested whether the smallest deletions (1– In this article we argue that selection for and about how these patterns may change 10 bp) are different in average age com- smaller genome size is very unlikely to be in organisms over evolutionary time. pared with the largest deletions (77–432 the sole agent responsible for the differ- bp), and here again there is no detectable ence in deletion spectra between Drosoph- References ϭ Akashi H, 1997. Codon bias evolution in Drosophila: effect (Wilcoxon test, P .63). The same ila and mammals, and also is very unlikely of mutation-selection drift. Gene finding holds for deletions in the D. virilis to be intense enough to bias the frequency 205:269–278.

226 The Journal of Heredity 2000:91(3) Beale D and Lehmann H, 1965. Abnormal hemoglobin sequences. In: Molecular evolutionary genetics (Mac- intrinsic rate of DNA loss in Drosophila. Nature 384: and the . Nature 207:259–261. Intyre RJ, ed). New York: Plenum; 1–94. 346–349. Charlesworth B, 1996. The changing sizes of genes. Na- Li W-H, Wu C-I, and Luo C-C, 1984. Nonrandomness of Sharp PM and Li W-H, 1989. On the rate of DNA se- ture 384:315–316. as reflected in nucleotide substitutions quence evolution in Drosophila. J Mol Evol 28:398–402. Charlesworth D, Charlesworth B, and Morgan MT, 1995. and its evolutionary implications. J Mol Evol 21:58–71. Weiner AM, Deininger PL, and Efstratiadis F, 1986. Non- The pattern of neutral molecular variation under the Petrov DA and Hartl DL, 1997. Trash DNA is what gets viral retroposons: genes, pseudogenes, and transpos- background selection model. Genetics 141:1619–1632. thrown away: high rate of DNA loss in Drosophila. Gene able elements generated by the reverse flow of genetic 205:279–289. Graur D, Shuali Y, and Li W-H, 1989. Deletions in pro- information. Annu Rev Biochem 55:631–661. cessed pseudogenes accumulate faster in rodents than Petrov DA and Hartl DL, 1998. High rate of DNA loss in in humans. J Mol Evol 28:279–285. the D. melanogaster and D. virilis species groups. Mol Wu CT and Morris JR, 1999. Transvection and other Jeffs P and Ashburner M, 1991. Processed pseudogenes Biol Evol 15:293–302. homology effects. Curr Opin Genet Dev 9:237–246. in Drosophila. Proc R Soc Lond B 244:151–159. Petrov DA and Hartl DL, 1999. Patterns of nucleotide Zuckerkandl E, Derancourt J, and Vogel H, 1971. Muta- Li W-H, Gojobori T, and Nei M, 1981. Pseudogenes as a substitution in Drosophila and mammalian genomes. tional trends and random process in the evolution of paradigm of neutral evolution. Nature 292:237–239. Proc Natl Acad Sci USA 96:1475–1479. informational macromolecules. J Mol Biol 59:473–490. Li W-H, Luo C-C, and Wu C-I, 1985. Evolution of DNA Petrov DA, Lozovskaya ER, and Hartl DL, 1996. High Corresponding Editor: Masatoshi Nei

Petrov and Hartl • Pseudogene Evolution 227