<<

Pack-Mutator–like transposable elements (Pack- MULEs) induce directional modification of through biased insertion and DNA acquisition

Ning Jianga,1, Ann A. Fergusona, R. Keith Slotkinb, and Damon Lischc

aDepartment of Horticulture, Michigan State University, East Lansing, MI 48824; bDepartment of Cellular and Molecular Biology, Ohio State University, Columbus, OH 43210; and cDepartment of Plant and Microbial Biology, University of California, Berkeley, CA 94720

Edited by Hugo K. Dooner, Rutgers, The State University of New Jersey, New Brunswick, Piscataway, NJ, and approved December 17, 2010 (received for review July 23, 2010)

In monocots, many genes demonstrate a significant negative GC is defined by higher GC content at the 5′ end of genes than that at gradient, meaning that the GC content declines along the orienta- the 3′ end, and genes with a negative gradient are more abundant tion of . Such a gradient is not observed in the genes of in rice than in Arabidopsis (12). Several hypotheses, including the dicot plant Arabidopsis. In addition, a lack of is often transcription-coupled DNA repair (12), GC-biased conver- observed when comparing the 5′ end of the coding region of orthol- sion (13, 14), and translational advantage (15), have been pro- ogous genes in rice and Arabidopsis. The reasons for these differ- posed to explain the emergence of GC richness and the formation ences have been enigmatic. The presence of GC-rich sequences at of a negative GC gradient among grass genes. However, none of the 5′ end of genes may influence the conformation of chromatin, these hypotheses fully explains the dramatic difference in GC gradient among individual genes (also see Results). In this study, the expression level of genes, as well as the recombination rate. fi Mutator– we present evidence that Pack-MULEs in grasses speci cally Here we show that Pack- like transposable elements (Pack- amplify GC-rich gene fragments and preferentially insert into the MULEs) that carry gene fragments specifically acquire GC-rich frag- ′ ′ 5 end of genes. We suggest that their acquisition bias and in- ments and preferentially insert into the 5 end of genes. The result- sertion preference, combined with their capability to initialize ing Pack-MULEs form independent, GC-rich transcripts with transcription, allow Pack-MULEs to modify the 5′ end of their a negative GC gradient. Alternatively, the Pack-MULEs evolve into target-site genes and raise the local GC content. As such, Pack- additional exons at the 5′ end of existing genes, thus altering the GC MULEs have contributed to the emergence of GC-rich genes and content in those regions. We demonstrate that Pack-MULEs modify the formation of negative gradients. Our analysis provides insights the 5′ end of genes and are at least partially responsible for the into how TEs may shape their host . negative GC gradient of genes in grasses. Such a unique and global impact on gene composition and gene structure has not been ob- Results served for any other transposable elements. GC-Rich Internal Regions of Pack-MULEs from Maize and Rice. To further evaluate the impact of Pack-MULEs, we identified Pack- gene duplication | gene modification MULEs from maize and Arabidopsis, two (in addition to rice) with high-quality genomic sequences and gene annotation. A Arabidopsis ransposable elements (TEs) are DNA sequences that are ca- complete list of all maize, rice, and Pack-MULEs is pable of moving in the and that in the process increase provided in Table S1. Among the three , Pack-MULEs T fi are most abundant in rice (2,853), followed by maize (276), and their copy numbers. In plants, it is well known that the ampli - fi Arabidopisis cation of transposable elements is largely responsible for variation nally, , which contains only 46 Pack-MULEs. The average sequence identity of terminal inverted repeats (TIRs) of in genome size (1). For example, TEs account for only 14% of the Ara- genome of the model dicot plant Arabidopsis thaliana, which Pack-MULEs is 84% for rice, 87% for maize, and 72% for bidopsis. If the identity of the TIRs reflects the age of the ele- possesses one of the smallest genome sizes among plants (2). In Arabidopsis contrast, 85% of the maize genome, which is about 20 times the ments, then the Pack-MULEs are more ancient than size of the Arabidopsis genome, is composed of TEs (3). Although those in rice and maize. TEs amplify and contribute to genome size variation, the activity The internal region of Pack-MULEs in rice and maize are of TEs influences other genome components as well. One of these particularly GC rich. Individual Pack-MULE elements were di- activities is the duplication of genes and gene fragments by TEs vided into 10 equal-sized bins with bin 1 and bin 10 largely rep- such as Pack-Mutator–like transposable elements (Pack-MULEs). resenting TIRs (Fig. S1), which are sequences required for Pack-MULEs are nonautonomous DNA transposable elements transposition. Other bins represent the internal sequences Mutator flanked by the TIR. As shown in Fig. S1, the GC content of the that belong to the superfamily (4). First isolated in maize fi (5, 6), Pack-MULEs can amplify genes or gene fragments on a internal sequences of Pack-MULEs in maize and rice is signi - – cantly higher than that of the TIRs, or of the genome as a whole. massive scale (4, 5, 7 9). In rice, there are nearly 3,000 Pack- Arabidopsis MULEs, of which 22% are transcribed and at least 1% are In contrast, the internal sequences of Pack-MULEs translated (10). Pack-MULEs are frequently associated with small are only slightly more GC rich than the TIRs, and comparable to RNAs, which may suppress the expression of themselves and the the genomic average. parental genes from which the gene fragments are derived (10). Because a large part of the internal regions of Pack-MULEs are This suggests that Pack-MULEs have a great potential in duplicated fragments of the respective parental genes, these data regulating and providing unique resources for coding sequences. In addition to the variation in genome size and TE content, Author contributions: N.J. and D.L. designed research; N.J. and A.A.F. performed research; plants differ in the composition of their genes. In the genomes of N.J., A.A.F., R.K.S., and D.L. analyzed data; and N.J., A.A.F., R.K.S., and D.L. wrote the flowering plants, the genes can be categorized as GC rich and GC paper. poor (11). In general, the Gramineae (grass) genomes contain The authors declare no conflict of interest. many more GC-rich genes than the genomes of dicot plants such This article is a PNAS Direct Submission. as Arabidopsis (11). Most of the variation in GC content occurring 1To whom correspondence should be addressed. E-mail: [email protected]. among genes in grasses is present in the form of a negative gra- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. dient in the direction of transcription (12). This negative gradient 1073/pnas.1010814108/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1010814108 PNAS Early Edition | 1of6 Downloaded by guest on October 1, 2021 80 All non-TE genes Parental genes Acquired sequences

70 Arabidopsis

60

50

Maize Rice 40

30 Percent of genes or fragments of genes or fragments Percent

20

10

0 20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90 Fig. 1. The GC content of all non-TE genes, Pack-MULE pa- GC content (%) rental genes, and fragments acquired by Pack-MULEs.

− imply that the acquired region is GC rich as well. To test this idea, for maize, P < 10 6 for rice; Fig. 3). For negative genes, there is we compared the GC content of all non-TE genes, full-length a significant bias toward the 5′ end, whereas such a bias is not parental genes of Pack-MULEs, and Pack-MULE–acquired obvious for moderate genes (Fig. 3). For positive genes, the bias is regions within parental genes. For both maize and rice, the major shifted to the 3′ end (bins 7, 8, and 9) in both maize and rice. In maximum of non-TE genes falls into the bins of sequences that are Arabidopsis, the acquisition bias for the 5′ end is observed for both 40–45% in GC content, with a minor maximum (or “shoulder”)at negative genes and moderate genes. Unfortunately, only two pa- 60%, suggesting the presence of a subgroup of GC-rich genes rental genes are associated with a positive GC gradient in Arabi- (Fig. 1). In contrast, the distribution of GC content for parental dopsis. One acquisition event is at the 5′ end and the other is at the genes is bimodal due to an enrichment of GC-rich genes relative to 3′ end, which is insufficient for drawing conclusions. Thus, it all non-TE genes. The GC content of the acquired fragments in appears that, at least for maize and rice, the acquisition position is Pack-MULEs is even higher, with the major maximum around influenced by the GC gradient of genes, and acquisition events 70–75% (Fig. 1). In rice, the GC content of acquired sequences is occur more frequently in the GC-rich regions than in other regions. − about 16% higher [median value: 71% vs. 55%; P < 2.2 × 10 16, Due to the fact that there are many more negative genes than Wilcoxon rank sum (WRS) test] than the nonacquired regions of positive genes in maize and rice (Fig. 2), the consequence of such the parental genes. The difference is about 9% (62% vs. 53%; P = acquisition bias is an excess of 5′ end sequences among the acquired − 1.7 × 10 8, WRS test) in maize. In contrast, the GC content of fragments. The over-representation of 5′ end sequences is also true Arabidopsis genes demonstrates a largely normal distribution. for Arabidopsis because most of its genes are moderate genes from Compared with the parental genes and other genes, the GC which the 5′ ends are more frequently acquired (Fig. 3). The bias content of acquired regions is slightly higher, but the difference is for the 5′ end sequence was also reported in a previous study where not as dramatic as that in maize and rice (43% vs. 40%; P = 0.013, a few Pack-MULEs from Arabidopsis were examined (16). WRS test; Fig. 1). If Pack-MULEs in rice and maize preferentially acquire GC- rich sequences, why is the GC content of acquired regions in Acquisition Position in Parental Genes Is Influenced by GC Gradient of Arabidopsis not dramatically different from other genic sequences Genes. As mentioned above, many grass genes are associated with (Fig. 1)? To answer this question, we calculated the GC content of a negative gradient in GC content in the direction of transcrip- all genes in a 325-bp sliding window, which is the average size of tion, and therefore the 5′ ends of genes are more GC rich than acquired fragments in Pack-MULEs (4). In Arabidopsis, less than their 3′ ends (12). Thus, we hypothesize that the high GC content 4% of the genic regions (sliding windows) have a GC content that of Pack-MULEs in maize and rice is due to the presence of large is 10% higher than the average value of genes, and few of them numbers of genes with negative GC gradients, coupled with an (0.05%) have a GC content that is 20% higher than the average acquisition preference for the 5′ regions. To test whether this is value. In contrast, nearly 10% of the fragments in maize and rice the case, we examined the acquisition pattern from genes with have a GC content that is 20% higher than the average value of different GC profiles. Genes were categorized as “negative genes. This is about 200 times more abundant than that in Ara- genes” (the GC content of the 5′ half of the gene is at least 5% bidopsis, even when accounting for the difference in gene numbers higher than the 3′ half—negative gradient), “positive genes” (the between the organisms. Accordingly, one explanation for the lack GC content of the 3′ half is at least 5% higher than the 5′ half— of GC-rich Pack-MULEs in Arabidopsis is the relatively uniform positive gradient), and “moderate genes” (all genes in which the GC content and the absence of GC-rich islands within Arabidopsis GC gradient is not significant). The GC profile of a typical gene genes. Alternatively, the Pack-MULEs in Arabidopsis represent from each category is shown in Fig. S2. more ancient insertion events, and various evolutionary forces Consistent with the previous report (12), most Arabidopsis (e.g., deletion processes) may have homogenized GC content genes are moderate genes, yet nearly half of the maize and rice biases that may once have been more prevalent. genes exhibit a negative GC gradient (Fig. 2). For all three organisms, the relative fraction of negative genes is slightly higher Insertion Preference of Pack-MULEs. In addition to their acquisition among parental genes of Pack-MULEs than among all genes, bias, Pack-MULEs also demonstrate an insertion preference for albeit the bias is only significant for rice (χ2 test: P < 10−6). To regions flanking the 5′ termini of genes. Many Pack-MULEs are compare the acquisition frequency in different regions of genes, located within 1 kb of a non-TE gene, and the fraction of these gene sequences are divided into 10 equal-sized bins from tran- Pack-MULEs is negatively correlated with the genome size (25% scription start site (TSS) to transcription termination site (TTS). for maize, 33% for rice, and 44% for Arabidopsis). Moreover, the Interestingly, the acquisition patterns are significantly different majority of those Pack-MULEs are within 500 bp from the 5′ among different types of genes in maize and rice (χ2 test: P = 0.050 termini of adjacent genes (Fig. 4). This is consistent with previous

2of6 | www.pnas.org/cgi/doi/10.1073/pnas.1010814108 Jiang et al. Downloaded by guest on October 1, 2021 80 Arabidopsis All genes Parental genes

70 Maize Rice 60

50

40

30

20

10

Percent of all genes or parental genes of all genes or parental Percent 0

Positive Positive Positive NegativeModerate NegativeModerate NegativeModerate Fig. 2. The relative abundance of genes and parental genes Gene types with different GC gradients from maize, rice, and Arabidopsis.

reports describing the de novo insertion site preference of Muta- Transcripts Derived from Pack-MULEs Are Associated with Negative tor-like elements (17–20), particularly the recent finding that the GC Gradients. On the basis of current gene annotation, the majority highest insertion density of active Mutator elements in maize is of Pack-MULEs in rice are annotated as independent genes found in the regions around the TSS (21). Our data suggest that (39%) or as part of genes (29%). To minimize the artifacts of gene the insertion preference for sequences flanking the 5′ termini of annotation, we focused only on the loci with corresponding full- GENETICS genes is conserved for Mutator-like elements in all three plants, length cDNA (FL-cDNA) sequences. The relevant cDNAs were and a significant negative GC gradient of genes is not required for classified into two categories: (i) “Pack-MULE transcripts,” in such targeting specificity. which the entire ORF is located inside a Pack-MULE; and (ii) “chimeric transcripts,” which are fusions of Pack-MULEs and Lack of Linkage Between Pack-MULEs and Parental Genes. If Pack- flanking sequences (or flanking genes), in which the Pack-MULEs MULEs preferentially insert into the 5′ end of genes and the 5′ contribute to only part of the ORF and/or UTRs. end sequences are over-represented in the acquired regions, In rice, there are 22 chimeric transcripts matching the above a parsimonious explanation for this phenomenon is that Pack- criteria (see Table S3 for detailed information). For most of MULEs acquire sequences from adjacent genic regions. If such an them (18 of 22), the GC content of the sequences contributed by acquisition event is not always associated with transposition, one the Pack-MULE is higher than the non-Pack-MULE–derived would expect a linkage between the Pack-MULEs and the pa- sequence of the transcripts (Table S3). In addition, the Pack- rental genes. This would predict that the frequency of a Pack- MULE–derived parts in these chimeric transcripts are more fre- MULE and its parental gene being on the same is quently located in the 5′ as opposed to the 3′ regions of genes (Fig. higher than what is expected if genes on different S3A). On the other hand, Pack-MULE transcripts (84 loci with have equal chances of being acquired. However, this is not the FL-cDNA support) often initialize in the GC-rich internal regions case. For all three organisms, the number of cases where the Pack- and terminate within the other TIR or immediate flanking MULE and its parental gene are on the same chromosome or sequences, which have relatively low average GC content (e.g., within a certain distance is not significantly different from the Fig. S3 C and D). For this reason, the Pack-MULE transcripts are expected value (Table S2). As a result, there is no linkage between in general more GC rich and have a more dramatic negative GC recognizable Pack-MULEs and their parental genes. gradient than is observed in the non-TE genes (Fig. S3B). Most of

25 Negative genes Moderate genes Positive genes

20 Maize Rice Arabidopsis

15

10

Percent of acquisition events Percent 5

Fig. 3. Acquisition position varies among genes with distinct GC gradients. Sequences of genes with different GC profiles 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 (negative, positive, and moderate) were divided into 10 equal- sized bins, and the fraction of acquisition events in each bin is Bin position in parental genes demonstrated.

Jiang et al. PNAS Early Edition | 3of6 Downloaded by guest on October 1, 2021 30 5' end within 0.5 kb 5' end 0.5-1 kb 3' end within 0.5 kb 3' end 0.5-1 kb

25

20

15

10 Percent of Pack-MULEs Percent

5

0 Fig. 4. Preferential insertion of Pack-MULEs in the regions Maize Rice Arabidopsis adjacent to the 5′ termini of genes.

the Pack-MULE transcripts (66 of 84, or 79%) would be classified 00120) is 54 bp upstream of the corresponding insertion site of as negative genes and would belong to the GC-rich “shoulder” in OsChr11-00120. ABFH11 harbors a premature stop codon (Fig. Fig. 1. In maize, there are five Pack-MULE loci that match FL- 5B), suggesting that ABFH12 is more likely to have inherited the cDNAs [three Pack-MULE transcripts and two chimeric tran- full function of the ancestral ABFH gene, a conclusion that is scripts (Table S4)], all of which represent negative gradients. In supported by a Ka/Ks analysis (SI Text). Arabidopsis, there is only one Pack-MULE transcript with a Unlike ABFH11, ABFH12 is associated with different splicing moderate GC gradient. forms and distinct TSSs (SI Text). One TSS [ABFH12-Original, supported by EST (Fig. 5B)] is located downstream of the Pack- Pack-MULE–Derived Sequences as Unique 5′ Ends of Genes. To fur- MULE and corresponds to the TSS of ABFH11. The other TSS ther elaborate the role of rice Pack-MULEs in generating chi- [ABFH12-PM, supported by fl-cDNA (Fig. 5B)] is located inside meric transcripts in rice, the sequence of relevant transcripts was the Pack-MULE, and the internal region of the Pack-MULE used to search against gene sequences in rice for paralogs, or in serves as the 5′ UTR and the 5′ end of the coding region (42 amino maize for putative orthologs, of the genes that Pack-MULEs fused acids). The original exon 1 and part of exon 2 were spliced out and with or inserted into. This allowed us to deduce the gene structure became novel intron sequences in this particular transcript. Thus, it before the insertion of Pack-MULEs. This search led to the appears that the Pack-MULE on chromosome 12 inserted in the identification of five gene pairs [the Pack-MULE–modified gene upstream region of the gene and consequently modified the and the copy without modification by the Pack-MULE (SI Mate- splicing pattern of the gene and that this modification led to a new rials and Methods)]. Two of the examples are shown in Fig. 5 and transcript in which the original exon 1 was replaced. Both ABFH11 the other examples are shown in Fig. S4. To distinguish them from and ABFH12-Original are associated with a negative GC gradient. the parental genes, the genes adjacent to Pack-MULEs (or However, the contribution of the Pack-MULE to ABFH12-PM led inserted by Pack-MULEs) will be called “target-site genes.” For to a more GC-rich 5′ termini and a more dramatic negative GC two of the gene pairs found, Pack-MULEs did not change the gradient (Fig. 5D). This shows that Pack-MULEs are indeed ca- structure of the target-site genes except that the Pack-MULE pable of modifying the 5′ end of genes and raising the local GC sequence represents an addition at the 5′ end of the gene, resulting content. RT-PCR experiments were conducted to confirm that the in an elongated 5′ UTR (Fig. S4 A and B). For the remaining three expression and fusion of Pack-MULEs with relevant genes in Fig. 5 genes, Pack-MULEs seem to have replaced the 5′ UTR (Fig. 5A) are reproducible (Fig. S5). or the 5′ UTR plus the 5′ end of the coding regions (Fig. 5B; Fig. S4D) of the target-site genes. Due to the GC richness of Pack- Discussion MULE sequences, most of those transcripts modified by Pack- Gene duplication and divergence is one of the most important MULEs have a GC-rich 5′ end. means for the generation of new genes that are associated with In rice, the first 3 Mb of chromosome 11 and chromosome 12 novel functions (24–26). With the advancement of DNA se- were derived from a duplication that occurred 7.7 million y ago quencing technology, it is now obvious that TEs play important (22). This duplication enabled us to dissect an example at a rela- roles in gene duplication. Examples of gene duplication by TEs tively high resolution. Among the duplicated genes in this region, have been reported for all major families of TEs (reviewed in there is a pair of α/β fold family hydrolase genes [LOC_Os11- refs. 27–29). For example, the LINE-1 elements in humans have g02660 on chromosome 11, referred to as ABFH11, and LOC_Os- transduced at least 1% of the (30, 31). In plants, 12g02589 on chromosome 12, referred to as ABFH12 (Fig. 5B)]. Pack-MULEs and Helitrons are found to be most frequently The majority of the coding and noncoding sequence of the gene carrying genes and gene fragments (4, 9, 32–34). Despite the pair is highly similar, with an average identity of 90% at the nu- large amount of gene duplication events made by TEs, this is the cleotide level. Each gene harbors a Pack-MULE at the 5′ end. The first report to demonstrate that TEs select their acquisition tar- two Pack-MULEs were derived from independent insertions gets on the basis of the GC content of the sequence and use the based on distinct insertion sites and internal regions. Again, this acquired sequence to modify cellular genes. illustrates the observed insertion preference of Pack-MULEs in Our results demonstrate that there are at least three different the 5′ region of genes. It also suggests that some genes are “hot outcomes when a Pack-MULE inserts at the 5′ end of genes: (i) spots” for the insertion of Pack-MULEs, a phenomenon that has the Pack-MULE is not directly involved in the transcription of the also been observed for de novo insertions of Mu elements in maize downstream gene (e.g., ABFH11, Fig. 5B); (ii) the Pack-MULE– (17, 20, 23). The Pack-MULE on chromosome 11 (OsChr11- associated transcript represents (or becomes) the sole transcript 00120) is located 92 bp upstream of the TSS of ABFH11, of the gene (e.g., LOC_05g34920, Fig. 5A); and (iii) the Pack- according to gene annotation and the FL-cDNA sequence (Fig. MULE–associated transcript coexists with the original transcript 5B). On the basis of the comparison of the genomic sequence, the (e.g., ABFH12-PM and ABFH12-Original). One possibility is that insertion site of the Pack-MULE on chromosome 12 (OsChr12- ABFH12 (Fig. 5B) represents an intermediate status between

4of6 | www.pnas.org/cgi/doi/10.1073/pnas.1010814108 Jiang et al. Downloaded by guest on October 1, 2021 A OsChr05-01710 LOC_Os05g34920.1 Transcription Start Site Rice-Chr05 Transcription Termination Site Start Codon Stop Codon Maize-Chr09 Ancestral Stop Codon Alternative Start Codon GRMZM2G003389_T04 500 bp Fig. 5. Pack-MULEs modify transcripts of existing genes. In A and B, Pack-MULE TIRs are B shown as black triangles, and black horizontal OsChr12-00120 ABFH12-PM arrows indicate target-site duplications. Exons are depicted as colored or white boxes and introns as lines connecting exons. Homologous regions between paralogs or orthologs are Rice - Chr12 shown in the same color. Pack-MULE ID and OsChr12-00120 ABFH12-Original gene ID are also shown. Red arrows indicate the position of primers used for RT-PCR (Fig. S5). The drawing is in scale except for the primers that are enlarged for visibility. The OsChr11-00120 ABFH11 GenBank accession numbers of the FL-cDNA Rice - Chr11 sequence for each rice gene are AK110840 (LOC_Os05g34920.1), AK111804 (LOC_Os12g- 02589.1, ABFH12-PM), and AK111574 (LOC_Os- Maize-Chr06 GRMZM2G118597_T02 11g02660.1, ABFH11). In C and D, green arrows delimit regions contributed by Pack-MULEs. (A) A rice gene containing a novel 5′ UTR C D region derived from a Pack-MULE. (B)Di- 90 85 ABFH12-PM ABFH12-Original ABFH11 vergence of a pair of recently duplicated 80 LOC_Os05g34920.1 75 rice genes on chromosomes 11 and 12. The GENETICS colored narrow boxes in ABFH12-PM and 70 65 ABFH12-Original indicate that the relevant 55 60 sequences represent exons in other tran- 45 scripts but serve as introns or untranscribed 50 GC content (%) GC content 35 sequences in these particular transcripts. The

40 25 empty triangles indicate an insertion of an- other TE in the TIR of OsChr12-00120. (C and 30 15 D) The GC content as the function of the 1 301 601 901 1201 1501 1801 2101 1 201 401 601 801 1001 1201 1401 1601 position from the TSS in the Pack-MULE– Distance from TSS (bp) related transcripts in A and B, respectively.

a “normal” gene and a Pack-MULE–modified gene. Due to the The novel features of Pack-MULEs revealed in this study fact that we focused only on intact Pack-MULEs in this study, it is prompts us to revisit the prevalence of genes with negative GC conceivable that many more genes might have been modified by gradient in grasses. In rice, there is a codon usage bias and amino Pack-MULEs that are degenerated and no longer recognizable. If acid usage bias toward GC-rich codons, and it has been suggested the 5′ ends of normal genes represent ancient Pack-MULEs, the that the formation of the negative GC gradient is due to the combination of the acquisition bias with the insertion preference transcription-coupled DNA repair (TCR) process (12). However, of Pack-MULEs would be expected to form a self-perpetuating the GC gradient seems to vary dramatically among individual that would maintain and enhance the negative GC gradient genes (Fig. 2 and Fig. S2). Because TCR is a fundamental cellular fi of genes. process, it is dif cult to understand why it would confer a negative Because little is known about how Pack-MULEs acquire their GC gradient on some genes and a positive GC gradient on others. internal sequences, it is not clear how the bias toward GC-rich Here we propose an alternative model where multiple factors are sequences is achieved. According to one model, novel sequences responsible for the emergence of negative GC gradients in grasses. can be introduced into the element through the nicks formed First, codon usage bias allowed the appearance of GC-rich islands when the Pack-MULE sequences are present as stem loops (35). and GC-rich genes in grasses (38), especially for the genes involved Alternatively, genomic sequences can be acquired when a tem- in certain processes such as stress responses (14). Pack-MULEs have been accelerating this process by preferential acquisition and plate is switched during gap-repair processes (36, 37). Both amplification of GC-rich gene fragments. Due to their insertion models predict that transposition of the element to another locus preference around the 5′ termini of genes, we suggest that Pack- is not required for acquisition. If this is the case and Pack-MULEs MULEs have been introducing additional GC-rich sequences at acquire sequences from adjacent genes, one might expect a link- the 5′ region of genes or replacing existing 5′ termini with more age between Pack-MULEs and parental genes. However, such GC-rich sequences. Because such action is on individual genes, this a linkage was not detected in this study. However, because only would explain the high degree of variation among different genes. fi a small subset of TEs are to be xed in the genome, we cannot rule In addition to the modification of existing genes, Pack-MULEs out that the failure to detect linkage is an artifact of the fact that also form de novo transcripts with a negative GC gradient, adding most of the recognizable Pack-MULEs are not the initial element to the overall observed bias. Other factors, such as codon usage involved in the acquisition. An alternative possibility is that the bias, GC-biased , and translational advantage (13– acquisition occurred during the transposition process, when the 15, 39), may also contribute to the formation of the GC gradient, transposition complex is moving from the donor site to the in- but these are not the focus of this study. sertion site. In this case, GC-rich regions, which are frequently The presence of GC-rich Pack-MULEs at the 5′ end of genes— associated with open chromatin, may be more accessible for ac- either through fusion to their target-site gene or as independent quisition (also see below). elements with close proximity to the 5′ end of genes—may have

Jiang et al. PNAS Early Edition | 5of6 Downloaded by guest on October 1, 2021 further evolutionary impact beyond the alteration of gene se- through the alteration of DNA methylation and chromatin struc- quences. For example, it is known that in many organisms the re- ture (20). From this point of view, some of the Pack-MULEs could combination rate is correlated with local GC content (40, 41). be favored by selection simply due to their GC content. Taken to- Thus, the insertion of Pack-MULEs may alter the local recom- gether, our analysis indicates that the activity of Pack-MULEs may bination rate. In mammalian cells, GC-rich genes are associated have had a profound influence on gene structure and expression. with elevated expression levels compared with their GC-poor counterparts, even if they encode the same and are driven Materials and Methods by the same promoters (42). GC-rich sequences also provide more The procedure for the annotation of Pack-MULEs in the three genomes was targets for DNA methylation, which can contribute to the regula- similar to that described previously (10). The sequences for rice pseudomo- tion of gene expression (14, 43). GC-rich regions are associated lecules and gene annotation information were downloaded from the rice with more CG and CHG sites (where H = A, T, or C), and annotation group at Michigan State University (http://rice.plantbiology.msu. therefore they have the potential to maintain epigenetic patterns of edu/, release 6.0). Maize chromosome sequences and gene annotation in- DNA methylation more efficiently than GC-poor regions (44). In formation (4a.53) were downloaded from the maize sequencing project plants, only cytosines in CG or CHG contexts can propagate the (http://www.maizesequence.org/, B73 RefGen_v1) (3). The Arabidopsis DNA methylation state from one to the next or from the parent chromosome sequence and gene annotation information were from TAIR9 to the daughter generation (reviewed in ref. 45). GC-sparse regions (http://www.arabidopsis.org/). To identify the origin of the sequences cap- have larger amounts of CHH, which are incapable of maintaining tured by Pack-MULEs, the internal regions of Pack-MULEs were masked us- the methylation pattern upon S-phase DNA replication. There- ing non-MULE TEs and then used to query the genome for similar sequence fore, GC-rich sequences retain heritable DNA methylation levels (BLASTN e-value < 1.0E-10). For an individual Pack-MULE, the sequence with more efficiently, and the presence of Pack-MULEs may stabilize the highest score but not associated with a MULE TIR was considered as the epigenetic control of the gene expression of the neighboring genes. parental copy of the Pack-MULE. GC content of individual sequence was Finally, GC-rich sequences tend to augment bendability and the calculated using a custom script. The insertion sites of Pack-MULEs were ability to undergo a B-Z transition of DNA helical structure, which determined according to the position of Pack-MULE and the relevant genes. is often associated with open chromatin and active transcription Full methods and associated references are available in SI Materials (46). This might explain why the fraction of expressed Pack-MULEs and Methods. is much higher than other transposable elements, including the autonomous elements of MULEs (10). It is also interesting to note ACKNOWLEDGMENTS. We thank Ms. Dongmei Yin, Dr. Veronica Vallejo, and Mu1 ′ Ms. Channa Lindsay for assistance with RT-PCR analysis. We also thank Drs. Jeff that a de novo insertion of (a GC-rich Pack-MULE) in the 5 Bennetzen and Thomas Peterson for valuable discussion about the acquisition UTR of one of the tandemly duplicated p1 genes in maize induced mechanism of Pack-MULEs. This study was supported by Grant DBI-0607123 the expression of other p1 genes in the same cluster, possibly from the National Science Foundation to N.J. and Grant DBI-0820828 to D.L.

1. Bennetzen JL, Kellogg EA (1997) Do plants have a one-way ticket to genomic obesity? 24. Ohno S (1970) by Gene Duplication (Springer-Verlag, New York). Plant Cell 9:1509–1514. 25. Lynch M, Conery JS (2000) The evolutionary fate and consequences of duplicate 2. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flow- genes. Science 290:1151–1155. ering plant Arabidopsis thaliana. Nature 408:796–815. 26. Zhang J (2003) Evolution by gene duplication: An update. Trends Ecol Evol 18:292–298. 3. Schnable PS, et al. (2009) The B73 maize genome: Complexity, diversity, and dynamics. 27. Kazazian HH, Jr. (2004) Mobile elements: Drivers of . Science 303: Science 326:1112–1115. 1626–1632. 4. Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR (2004) Pack-MULE transposable elements 28. Bennetzen JL (2005) Transposable elements, gene creation and genome – mediate gene evolution in plants. Nature 431:569 573. rearrangement in flowering plants. Curr Opin Genet Dev 15:621–627. 5. Talbert LE, Chandler VL (1988) Characterization of a highly conserved sequence 29. Feschotte C, Pritham EJ (2007) DNA transposons and the evolution of eukaryotic – related to Mutator transposable elements in maize. Mol Biol Evol 5:519 529. genomes. Annu Rev Genet 41:331–368. 6. Walbot V, Rudenko GN (2002) MuDR/Mu Transposons of Maize Mobile DNA II, eds 30. Moran JV, DeBerardinis RJ, Kazazian HH, Jr. (1999) Exon shuffling by L1 retrotrans- Craig N, Craigie R, Gellert M, Lambowitz A (America Society of Microbiology Press, position. Science 283:1530–1534. Washington, DC), pp 533–564. 31. Pickeral OK, Makałowski W, Boguski MS, Boeke JD (2000) Frequent human genomic 7. Lisch D (2005) Pack-MULEs: Theft on a massive scale. Bioessays 27:353–355. DNA transduction driven by LINE-1 retrotransposition. Genome Res 10:411–415. 8. Juretic N, Hoen DR, Huynh ML, Harrison PM, Bureau TE (2005) The evolutionary fate of 32. Morgante M, et al. (2005) Gene duplication and exon shuffling by -like MULE-mediated duplications of host gene fragments in rice. Genome Res 15:1292–1297. – 9. Holligan D, Zhang X, Jiang N, Pritham EJ, Wessler SR (2006) The transposons generate intraspecies diversity in maize. Nat Genet 37:997 1002. landscape of the model legume Lotus japonicus. Genetics 174:2215–2228. 33. Yang L, Bennetzen JL (2009) Distribution, diversity, evolution, and survival of – 10. Hanada K, et al. (2009) The functional role of pack-MULEs in rice inferred from Helitrons in the maize genome. Proc Natl Acad Sci USA 106:19922 19927. purifying selection and expression profile. Plant Cell 21:25–38. 34. Du C, Fefelova N, Caronna J, He L, Dooner HK (2009) The polychromatic Helitron 11. Carels N, Bernardi G (2000) Two classes of genes in plants. Genetics 154:1819–1825. landscape of the maize genome. Proc Natl Acad Sci USA 106:19916–19921. 12. Wong GK, et al. (2002) Compositional gradients in Gramineae genes. Genome Res 12: 35. Bennetzen JL, Springer PS (1994) The generation of Mutator transposable element 851–856. subfamilies in maize. Theor Appl Genet 87:657–667. 13. Glémin S, Bazin E, Charlesworth D (2006) Impact of mating systems on patterns of 36. Engels WR, Johnson-Schlitz DM, Eggleston WB, Sved J (1990) High-frequency P sequence polymorphism in flowering plants. Proc Biol Sci 273:3011–3019. element loss in Drosophila is homolog dependent. Cell 62:515–525. 14. Tatarinova TV, Alexandrov NN, Bouck JB, Feldmann KA (2010) GC3 biology in corn, 37. Yamashita S, Takano-Shimizu T, Kitamura K, Mikami T, Kishima Y (1999) Resistance to rice, sorghum and other grasses. BMC 11:308. gap repair of the transposon Tam3 in Antirrhinum majus: A role of the end regions. 15. Gouy M, Gautier C (1982) Codon usage in : Correlation with gene expressivity. Genetics 153:1899–1908. Nucleic Acids Res 10:7055–7074. 38. Wang HC, Hickey DA (2007) Rapid divergence of codon usage patterns within the rice 16. Yu Z, Wright SI, Bureau TE (2000) Mutator-like elements in Arabidopsis thaliana: genome. BMC Evol Biol 7 (Suppl 1):S6. – Structure, diversity and evolution. Genetics 156:2019 2031. 39. Duret L, Galtier N (2009) Biased gene conversion and the evolution of mammalian ′ 17. Dietrich CR, et al. (2002) Maize Mu transposons are targeted to the 5 untranslated genomic landscapes. Annu Rev Genomics Hum Genet 10:285–311. fl region of the gl8 gene and sequences anking Mu target-site duplications exhibit 40. Fullerton SM, Bernardo Carvalho A, Clark AG (2001) Local rates of recombination are nonrandom nucleotide composition throughout the genome. Genetics 160:697–716. positively correlated with GC content in the human genome. Mol Biol Evol 18:1139–1142. 18. Lisch D (2002) Mutator transposons. Trends Plant Sci 7:498–504. 41. Birdsell JA (2002) Integrating genomics, , and classical genetics to study 19. Slotkin RK, et al. (2009) Epigenetic reprogramming and small RNA silencing of the effects of recombination on genome evolution. Mol Biol Evol 19:1181–1197. transposable elements in pollen. Cell 136:461–472. 42. Kudla G, Lipinski L, Caffin F, Helwak A, Zylicz M (2006) High guanine and cytosine 20. Robbins ML, Sekhon RS, Meeley R, Chopra SA (2008) A Mutator transposon insertion is associated with ectopic expression of a tandemly repeated multicopy Myb gene content increases mRNA levels in mammalian cells. PLoS Biol 4:e180. pericarp color1 of maize. Genetics 178:1859–1874. 43. Kalisz S, Purugganan MD (2004) Epialleles via DNA methylation: Consequences for – 21. Liu S, et al. (2009) Mu transposon insertion sites and meiotic recombination events co- . Trends Ecol Evol 19:309 314. localize with epigenetic marks for open chromatin across the maize genome. PLoS 44. Chan SW, et al. (2006) RNAi, DRD1, and histone methylation actively target Genet 5:e1000733. developmentally important non-CG DNA methylation in Arabidopsis. PLoS Genet 2:e83. 22. Rice Chromosomes 11 and 12 Sequencing Consortia (2005) The sequence of rice chromo- 45. Law JA, Jacobsen SE (2010) Establishing, maintaining and modifying DNA some 11 and 12, rich in disease resistance genes and recent gene duplication. BMC Biol 3:20. methylation patterns in plants and . Nat Rev Genet 11:204–220. 23. Bennetzen JL, Springer PS, Cresse AD, Hendrickx M (1993) Specificity and regulation 46. Vinogradov AE (2003) DNA helix: The importance of being GC-rich. Nucleic Acids Res of the Mutator transposable element system in maize. Crit Rev Plant Sci 12:57–95. 31:1838–1844.

6of6 | www.pnas.org/cgi/doi/10.1073/pnas.1010814108 Jiang et al. Downloaded by guest on October 1, 2021