Research LTR-mediated retroposition as a mechanism of RNA-based duplication in metazoans

Shengjun Tan,1 Margarida Cardoso-Moreira,2,6 Wenwen Shi,1 Dan Zhang,1,3 Jiawei Huang,1,3 Yanan Mao,1 Hangxing Jia,1,3 Yaqiong Zhang,1 Chunyan Chen,1,3 Yi Shao,1,3 Liang Leng,1 Zhonghua Liu,4 Xun Huang,4 Manyuan Long,5 and Yong E. Zhang1,3 1Key Laboratory of Zoological Systematics and Evolution and State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; 2Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland; 3University of Chinese Academy of Sciences, Beijing 100049, China; 4State Key Laboratory of Molecular Developmental Biology, Institute of and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, China; 5Department of Ecology and Evolution, The University of Chicago, Chicago, Illinois 60637, USA

In a broad range of taxa, genes can duplicate through an RNA intermediate in a process mediated by (retroposition). In mammals, L1 retrotransposons drive retroposition, but the elements responsible for retroposition in oth- er animals have yet to be identified. Here, we examined young retrocopies from various animals that still retain the sequence features indicative of the underlying retroposition mechanism. In Drosophila melanogaster, we identified and de novo assem- bled 15 polymorphic retrocopies and found that all retroposed loci are chimeras of internal retrocopies flanked by discon- tinuous LTR retrotransposons. At the fusion points between the mRNAs and the LTR retrotransposons, we identified shared short similar sequences that suggest the involvement of microsimilarity-dependent template switches. By expanding our approach to mosquito, zebrafish, chicken, and mammals, we identified in all these species recently originated retrocop- ies with a similar chimeric structure and shared microsimilarities at the fusion points. We also identified several retrocopies that combine the sequences of two or more parental genes, demonstrating LTR-retroposition as a novel mechanism of exon shuffling. Finally, we found that LTR-mediated retrocopies are immediately cotranscribed with their flanking LTR retro- transposons. Transcriptional profiling coupled with sequence analyses revealed that the sense-strand transcription of the retrocopies often lead to the origination of in-frame proteins relative to the parental genes. Overall, our data show that LTR-mediated retroposition is highly conserved across a wide range of animal taxa; combined with previous work from plants and yeast, it represents an ancient and ongoing mechanism continuously shaping gene content evolution in eukaryotes.

[Supplemental material is available for this article.]

Retrotransposons are pervasive in eukaryotic genomes and consist (Kaessmann et al. 2009). Unlike DNA-based duplicates, functional of two major classes: long terminal repeats (LTRs) and non-LTR ret- retrocopies or retrogenes inherently lack the ancestral regulatory rotransposons (mainly long interspersed nuclear elements or sequences, having to recruit new ones from the vicinity and/or LINEs). In each class, they can be further divided into autonomous by evolving them de novo (Kaessmann et al. 2009; Carelli et al. and nonautonomous retrotransposons. Autonomous retrotrans- 2016). The contribution of retrogenes to phenotypic evolution is posons encode the proteins required for their own mobilization, well recognized ranging from the avoidance of male–male court- whereas nonautonomous retrotransposons are trans-mobilized ship in fruit fly to the short-legged phenotype in domestic dogs by the autonomous elements. For example, in mouse nonautono- (Parker et al. 2009; Chen et al. 2013). mous MaLR, elements can be copied by the autonomous MERVL Currently, our understanding of the mutational mechanisms elements since they share similar LTRs (McCarthy and underlying retroposition is largely limited to mammalian systems McDonald 2004). A key contribution of retrotransposons to evolu- in which it has been conclusively shown that retrocopies can be tion is their capacity to mediate the formation of new genes in a generated by the (RT) of LINE1 or L1 retro- process termed retroposition (Soares et al. 1985; Brosius 1991; transposons (Eickbush 1999; Moran et al. 1999; Esnault et al. Eickbush 1999). Herein, the machinery reverse 2000; Wei et al. 2001). L1-mediated retrocopies harbor several hall- transcribes messenger RNAs (mRNAs) into complementary DNAs mark sequences in their flanking regions: a TTTT/AA endonuclease (cDNAs) and inserts them back into the genome as retrocopies cleavage site, a poly(A) tail priming the reverse transcription, target site duplications (TSDs) generated when fixing the insertion nick overhang, and a bias favoring truncation of the 5′ rather than the 3′ end (Kaessmann et al. 2009; Richardson et al. 2014). 6Present address: Zentrum für Molekulare Biologie der Universität Heidelberg (ZMBH), 69120 Heidelberg, Germany Corresponding author: [email protected] Article published online before print. Article, supplemental material, and publi- © 2016 Tan et al. This article, published in Genome Research, is available under cation date are at http://www.genome.org/cgi/doi/10.1101/gr.204925.116. a Creative Commons License (Attribution-NonCommercial 4.0 International), Freely available online through the Genome Research Open Access option. as described at http://creativecommons.org/licenses/by-nc/4.0/.

26:1663–1675 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/16; www.genome.org Genome Research 1663 www.genome.org Tan et al.

In principle, LTR retrotransposons could also mediate the for- Table 1. The 15 D. melanogaster polymorphic retrocopies mation of retrocopies. Specifically, LTR retrotransposons have a Parent/ close evolutionary relationship with (Wicker et al. retrocopy LTR 2007), which replicate by undergoing two rounds of template Parental transcript retrotransposon Microsimilarity switches during their viral DNA synthesis and use a low affinity gene length (bp) annotation (bp) RT enzyme (Delviks-Frankenberry et al. 2011). Retroviruses can capture host mRNAs (Hajjar and Linial 1993), and their RT can CG17604 2651/975 BLASTOPIA, Gypsy 16, 2 CG10212 3687/1323 BLASTOPIA, Gypsy 6, 0 use the mRNA as a template through switching induced by micro- CG32082 5246/704 BLASTOPIA, Gypsy 13, 1 similarity (between short similar sequences shared by the mRNA CG4799, 2571/243, BATUMI, Pao 6, 9, 12 and the , also called microhomology) (Goodrich and CG11924 3530/917 a − Duesberg 1990; Luo and Taylor 1990). Sometimes sequence simi- CR42443, 1023/441, STALKER2, Gypsy 2, 16 ′ CG5119 3127/1294 larity–independent mechanisms (e.g., 3 transduction) (Swain CG33957 3879/745 BLASTOPIA, Gypsy 9, 1 and Coffin 1992) may also work. Thus, it is conceivable that CG14796 5388/1073 MAX, Pao 10, 10 mRNAs could be captured in the virus-like particles (VLPs) where CG31605 2792/1564a TABOR, Gypsy 9 LTR retrotransposons replicate (Yoshioka et al. 1990; Havecker CG8567 2425/1683 412, Gypsy −2, 3 − et al. 2004), and the latter could mediate the retroposition of these CG2662 1647/377 412, Gypsy 26, 15 CG5020 5625/1469 BLASTOPIA, Gypsy 7, 2 mRNAs in a manner analogous to that of retroviruses. The evi- CG3631 2757/1695 Gypsy5, Gypsy 4, 7 dence for this process is, however, very limited. In yeast, reporter CG1242 2713/915 MAX, Pao 16, 2 assays demonstrated that the LTR retrotransposon Ty can capture CG10652 513/205 412, Gypsy 16, 1 a partial mRNA sequences and form a chimera consisting of internal CG8392 846/394 412, Gypsy 12 retrocopies flanked by the retrotransposon, and this is often ac- Transcript length is the length of the parental gene’s mRNA. When a pa- companied by microsimilarity at the fusion points (Derr et al. rental gene encodes multiple alternative isoforms compatible with the 1991; Schacherer et al. 2004). In , a chimera consisting of retroposed sequences, one isoform is randomly chosen (see partial sequences derived from three unlinked cellular genes Supplemental Data Set 1). The LTR retrotransposon is described as the flanked by the LTR retrotransposon Bs1 has been described element name followed by its family name, e.g., BLASTOPIA, Gypsy. Microsimilarity refers to the stretch of similar sequence present at the (Elrouby and Bureau 2001, 2010); in rice, 27 retrocopies were switch points shared by the retrocopies and the flanking LTR retrotrans- found to be located within LTR retrotransposons (Wang et al. posons. The numbers in this column follow the order of retroposition, ′ ′ 2006). Finally, a single case has been described in human i.e., from retrotransposons’ 3 terminal to their 5 terminal. The three in which the LTR retrotransposon HERV-K captured the gene negative values refer to the length of de novo sequences inserted at the switch point. For the retrocopy derived from CG10212, there is no FAM8A1 ′ (with microsimilarity present at the fusion points), result- microsimilarity or de novo sequence at its 5 end, therefore, it is referred ing in a currently untranscribed processed (Jamain to as “0.” et al. 2001). aRetrocopies for which we failed to assemble one flanking region. The Here, we performed a genome-wide study across a broad range length of the current contig is shown. of taxa (fruit fly, mosquito, zebrafish, chicken, mouse, and hu- man) to determine whether LTR retrotransposons can generally drive retroposition in animals and to identify the features associat- nism. For 10 of the 15 retrocopies for which we could design ed with LTR-mediated retrocopies. probes, we confirmed their existence by PCR experiments (Methods; Supplemental Fig. 2; Supplemental Table 4). We further Results randomly chose three retrocopies for Sanger sequencing and confirmed the assembled sequences (Supplemental Data Set 2). De novo assembly reveals a conservative but highly reliable data Based on the presence/absence of exon–exon junction set of 15 polymorphic retrocopies in the Drosophila melanogaster reads across the 40 lines, we estimate the lower-bound population Genetic Reference Panel (DGRP) frequencies of the 15 retrocopies to range from 1 to 31 (Supplemental Table 3). Notably, for both CG4799_CG11924_r We started our analyses by focusing on polymorphic retrocopies, and CG5119_CR42443_r, the sequences of two parental genes which are encoded by at least one inbred line but are absent were retroposed together generating new chimeric gene structures, from the reference genome. Specifically, we analyzed the 40 core each containing the partial sequences of two unlinked genes. In to- DGRP lines whose average sequencing depth reaches 52× tal, 17 parental genes were involved in the formation of 15 poly- (Supplemental Table 1). Based on previous work (Schrider et al. morphic retrocopies (Table 1). 2011, 2013; Ewing et al. 2013), we developed a refined pipeline to identify polymorphic retrocopies that combines both exon– LTR retrotransposons mediate retroposition in Drosophila exon and exon–intron junction read mapping and that includes populations targeted local de novo assembly (Methods; Supplemental Fig. 1). We identified a total of 31 candidate retrocopies supported The 15 retrocopies tend to be partial with a median length of by more than one read. Of these, we assembled 15 retrocopies 26% of their parental transcripts’ size, a consequence of both 5′ and their flanking regions (Table 1; Supplemental Table 2; and 3′ truncations with the latter being more extensive (median Supplemental Data Set 1). The remaining 16 candidate retrocopies truncation 419 bp vs 1387 bp, Wilcoxon rank test P = 0.04) could not be assembled because of the low sequencing depth of the (Supplemental Table 2). In addition to the lack of the 5′ truncation retroposed loci (for 12 candidate retrocopies) (Supplemental Table bias, all other hallmarks suggestive of L1-mediated retroposition 3) or because of mapping problems (for four retrocopies). We chose (i.e., presence of a poly[A] tail, TSDs, and a recognition site) are ab- to focus on the set of 15 retrocopies because knowing their flank- sent from the 15 retrocopies. When we aligned the retrocopies’ ing regions is critical to infer the underlying mutational mecha- flanking sequences against the reference genome, we found that

1664 Genome Research www.genome.org LTR retrotransposon mediated retroposition all flanking sequences consisted of LTR retrotransposons (Table 1). ious families of LTR retrotransposons are associated with the retro- Most notably, this association was not created by the insertion of copies (Table 1), for the five retrocopies associated with the Gypsy the retrocopies into preexisting LTR retrotransposons but instead BLASTOPIA, we found a unique pattern: The RT always switches by recombination events mediated by microsimilarity between back from the mRNA to the retrotransposon at one particular the retrocopies and the LTR retrotransposons. Specifically, no ret- site, TAAAAAACAAGC-GGTTG, which is always near an exon– rotransposons flanking the retrocopies showed the signature of exon junction (Fig. 1A; Supplemental Data Set 1). Our data also a previously continuous element that was interrupted by the show that more than two switches can occur. For example, a seg- insertion of a foreign sequence. Instead, in all cases, a portion of ment from the 5′ untranslated region (UTR) of CG4799 and a seg- the internal retrotransposon open reading frames (ORFs) was re- ment from the coding sequence (CDS) of CG11924 were co- placed by the retrocopies (Supplemental Data Set 1). Moreover, retroposed by the Pao BATUMI element to form a new chimeric ret- for 25 of the 29 (86%) junctions identified, we found stretches of rocopy, in which all three switch points are associated with microsimilarity shared between the LTR retrotransposons and stretches of microsimilarity (Fig. 1B). Thus, two or more genes the mRNAs (median length of 6 bp) (Table 1; Supplemental Data can be involved in this process to form a chimeric gene, which Set 1). For three junctions, de novo sequences (2–26 bp) were could be viewed as a new mechanism of exon shuffling (Gilbert found inserted between the two elements. Simulations show 1978; Long et al. 1995; Li 1997). that the extent of microsimilarity cannot be explained by the In Drosophila, 44% of recently inserted LTR retrotransposons random insertion of retrocopies within LTR retrotransposons are located within or nearby genes, which is an overrepresentation (Methods; Supplemental Fig. 3). compared to a random insertion model (Ganko et al. 2006). One Thus, retroposition in Drosophila seems to occur via a would expect that retrocopies created by LTR retrotransposons microsimilarity-dependent template switch mechanism of LTR should show a similar pattern. We could not test this hypothesis retrotransposons, which is analogous to the one reported for retro- with the assembled retrocopies, because they do not include the viruses and yeast (Goodrich and Duesberg 1990; Luo and Taylor sequences outside the flanking LTR retrotransposons. So, we gen- 1990; Derr et al. 1991; Hajjar and Linial 1993; Schacherer et al. erated whole-genome sequencing data with longer read sizes and 2004). One implication of this mechanism is that both 5′ and 3′ mobile element targeted sequencing data via the ME-Scan meth- LTRs should coexist in the retroposed product because the inte- odology (Methods; Supplemental Table 5; Witherspoon et al. grase has to recognize 15- to 20-bp sequences located at each 2010). With these additional data, we identified 18 insertion terminal (Hindmarsh and Leis 1999; Engelman et al. 2009). sites for nine retrocopies (Table 2; Supplemental Table 6). Twelve Consistently, we found in the assembled sequences of nine insertion sites were further validated by PCR followed by Sanger (60%) retrocopies, the presence of the 5′ LTR (the data are unclear sequencing. In agreement with the biased insertion pattern of for the other retrocopies because the assembled sequences are too LTR retrotransposons, most (8/12) retrocopies were inserted short). In addition, for five retrocopies, for which we obtained se- within genes. quences at the two terminals, we confirmed the presence of both 5′ Intriguingly, two retrocopies (CG17604_r and CG10212_r) and 3′ LTRs (Methods; Supplemental Data Sets 3–5). Although var- are associated with more than one insertion site (Table 2),

A CDS of CG17604 GCACAAAAGGAAAAGTTTGAGGCAAACATTTTACAATTAAGCA...... GAGCATTCCTACTAAGGTGC AACCGGAAAATACAGACGAATTTTCAAGCACTGCCAG GAAGCAGATAAAAAACAAGCGG TTGAGTATTCCCAAATGTTT ATGAAGTTGTGGAAGTACATAGCCG AAAATGCAGAC CTATTGTCATCCGGGTCGGA 5’ BLASTOPIA 3’ BLASTOPIA

B 3’ BATUMI TCTCCGTGTTGATGCCTGACAGTTCTAAGGCAGTCAAGCAACCTTG GCTTTTCAAATCAAACGGTA AATTCTGGCGGCGCCAAGCAGGCATT...... 5’ UTR of CG4799 ...... AAAGTTTCACACATTTCATCGCAGCAGCAAACATCTAAACTTTTAAAGA CAACAACAACAGCAGCAGCAGCAGCAGCAGGAACTGCTGGAGCAGCAGC...... CDS of CG11924 ...... ACGGGATGAGCAAACCAATAAATGAAAGACCAAACATCTAAACTTTTAAAGA GGAAAATTCCTAGTTTGCTGAAAAAAAGACCACTCGAAAAAACACTCGAAGA 5’ BATUMI C CDS of AGAP000264 AACGCTGCTCGAACGCAACAAAGACCTGGAGGCGACG...... AGCTCAAGAGCACGCAAAAG TGTCTCCTGTCCGAG TCCTGTCTTTAATCCCAACAAACCAGGAAAGGTGCGA TCGTACAGTCGGTAAAAAAG GTGTTAAGAGAGATA 5’ BEL18-I_AG 3’ BEL18-I_AG

D CDS of ENSMUSG00000021752 TCGCGCTCTCGGGCCGCCGG CGTCTCCTGGCCAGGTCG...... GTCCACTCCTTACTCGAAGGCATCTCAAACTATTTCACCAAGTGGAAT CCCACTTTATAGTGACTGCCAT TTATGGCCGGCCTTGA AAGACCTCTTTTACTGGCATCTCAAGGTTACTGGAACTGCTCCAA 5’ IAPEy 3’ IAPEy

Figure 1. Schematic representations of a subset of the retrocopies identified. (A) CG17604_r in D. melanogaster.(B) CG4799_CG11924_r in D. mela- nogaster.(C) ENSANGG00000011308 in mosquito. (D) ENSMUSG00000083549 in mouse. LTR retrotransposons and retrocopies are marked in gray and blue, respectively. Microsimilarity stretches are marked in yellow, and de novo sequence insertions in green. The red line in A marks the “TAAAAAACAAGC-GGTTG” motif. (CDS) coding sequences; (UTR) untranslated regions.

Genome Research 1665 www.genome.org Tan et al.

Table 2. Insertion sites of nine retrocopies although they appear to be derived from a single template switch event because they share the same gene structure. For CG17604_r, Retrocopy Position Method the evidence for a single template switch event is even stronger, CG17604_r SR:TATAA 2×250 bp because the different retrocopies share eight nucleotide differences Satellitea 2×250 bp that distinguish them from the parental gene (Supplemental Data 2L:11489564a Intergenic 2×250 bp Set 6). Considering that >80% of DNA-based segmental duplica- and ME- tion (SD) occurs in tandem in D. melanogaster (Zhou et al. 2008), Scan 2R:21009193a Intronic of ME-Scan the presence of multiple identical retrocopies located dispersedly CG16778 in the genome is unlikely to be a consequence of independent 3R: 3618946a 3′ UTR of ME-Scan SDs. One possible scenario is that the whole locus acts as a nonau- CG10086 tonomous LTR retrotransposon that can be further retroduplicated 3R:9536417a Intronic of ME-Scan by the enzymes of autonomous LTR retrotransposons. This sce- CG33748 CG10212_r LINE:DOC_DM 2×250 bp nario is not unlikely considering that nonautonomous LTR retro- 2L:13226230a Intergenic ME-Scan transposons are much more likely to proliferate than their a 2R:3607631 Intronic of 2×250 bp corresponding autonomous elements, presumably because the CG18812 former are less likely to be recognized as retroviruses and thus re- 3R:21723045a Intergenic 2×250 bp CG32082_r 3L:230088887a Intronic of ME-Scan pressed by the host machinery (Maksakova et al. 2006). CG8779 In summary, we have described three lines of evidence that CG4799_CG11924_r 3L:21967546a Intronic of 2×250 bp identify LTR retrotransposons as capable of mediating retroposi- CG7470 and ME- tion in Drosophila: (1) All polymorphic retrocopies are flanked by Scan CG33957_r 2L:17134768a Intronic of ME-Scan LTR retrotransposons, and the switch points between the LTR ret- CG34106 rotransposons and the retrocopies share stretches of microsimilar- CG14796_r X:8740722- Intronic of ME-Scan ity, suggesting the occurrence of microsimilarity-mediated b 8741237 CG12075 recombination; (2) the insertion sites are mostly genic, the prefer- CG31605_r 3L:17668235b Intronic of ME-Scan CG34251 ential insertion site of LTR retrotransposons; and (3) these new loci CG5020_r 3R:13789450a Intronic of mate pair show signs of acting as nonautonomous LTR retrotransposons, be- CG42457 and ME- ing subjected to further retroposition cycles. Scan CG1242_r 3L:2624322- Intronic of ME-Scan 2624753b CG12002 3R:15303121- Exonic/ ME-Scan LTR retrotransposons mediate retroposition across animals b 15303991 intronic of Since Drosophila polymorphic retrocopies show the signatures of CG34157 LTR-mediated retroposition, we also expect to find these features All insertion sites except CG17604_r in a simple repeat (SR) and in recently originated retrocopies encoded by the D. melanogaster CG10212_r in a LINE were tested by PCR and Sanger sequencing. reference genome. By modifying the strategy in Bai et al. (2007) aValidated sites. (Methods), we identified five retrocopies present in the reference b False positives. genome but absent from the closely related D. simulans and D. sechellia (Table 3; Supplemental Data Set 7). The youngest,

Table 3. The five newly originated Drosophila retrocopies

Parent/retrocopy Flanking transposon, Retrocopy transcript Diverg- Microsi- transposon family, Parent/Retrocopya location length (bp) enceb milarity (bp) divergencec Poly(A) TSDs

CG33932/CG33932_r Uextra:5521091– 1244/812 0.10% −11d MAX, Pao, 3.70% —— 5521902e CG16728/CG16728_r X:17689090– 2489/512 0.90% 7, 2 BLASTOPIA, Gypsy, —— 17689601 0.20% CG2033/CG12324 2R:6709256– 678/634 2.30% — No associated repeat AAAA — 6709889 elements CG1742/CR12628 2L:22226151– 667/623 3.50% −14, 9 G3_DM, Jockey, LINE, AAAAAAA AATCA/AATCA 22226773 10%f CG3265/CG40354g 3LHet:687408– 1873/1412 4.00% 2h TV1, Gypsy, 27.50% — ATTCTTCTTG/ 688819 ATCTTCTTG aBecause the function of these genes (except CG12324 or Rps15ab) is not characterized, we use the term retrocopy rather than retrogene. bDivergence reflects the nucleotide differences between parental genes and retrocopies. cThe average divergence relative to the consensus sequence generated by RepeatMasker. dA sequencing gap flanks one end of the retrocopy preventing the identification of microsimilarity stretches. eThe sequence of this retrocopy is absent from the new D. melanogaster Release 6 (Dm6). However, we could find exon–exon spanning reads in more than one DGRP line, suggesting that it is genuine. fThe presence of a continuous Jockey element in the closely related species, D. simulans, suggests that the retrocopy was inserted into this preexisting element (Supplemental Fig. 11). gCG40354 is annotated as a pseudogene in Dm6. hTV1 is immediately adjacent to this retrocopy at the 3′ flanking region, whereas the QUASIMODO Gypsy element is located in the 5′ flanking region 111 bp away (Supplemental Fig. 4B). So, only microsimilarity relative to TV1 is examined.

1666 Genome Research www.genome.org LTR retrotransposon mediated retroposition

CG33932_r and CG16728_r, are flanked by LTR retrotransposons, genes and one were co-retroposed into a single locus and CG16728_r additionally has microsimilarities at both (Supplemental Table 8; Supplemental Fig. 5). Overall, despite the junctions. Consistently, no hallmarks of L1-mediated retroposi- older age of the mouse retrocopies, we were still able to identify tion exist in the flanking regions of these two retrocopies stretches of microsimilarity for nine of the 22 switch points. (Supplemental Table 7). In contrast, the relatively older retrocop- Unlike in Drosophila, where most retrocopies are located within ies, CG12324 and CR12628, are not flanked by LTR retrotranspo- genes (Table 2), in mouse, seven of the eight retrocopies are inter- sons and are instead associated with candidate poly(A) tails and/ genic (Fisher’s exact test P = 0.03) (Supplemental Table 8), which is or TSDs (Toba and Aigaki 2000) suggesting a non-LTR-mediated consistent with the well-known depletion of LTR retrotransposons retroposition. The final retrocopy (CG40354) is ambiguous within or in the vicinity of genes in mammals (Jern and Coffin because it harbors a 2-bp microsimilarity and a 3′ truncation 2008). (121 bp) but also candidate TSDs (Supplemental Fig. 4A; Supplemental Data Set 7). Since this locus situates in a repeat- LTR retrotransposons drive the initial transcription of retrocopies rich region that does not have a long synteny with closely related species (Supplemental Fig. 4B), we could not infer if it was formed in both fly and mouse through an LTR-mediated or non-LTR-mediated retroposition fol- The prevalence of polymorphic retrocopies with more than one in- lowed by secondary mutations. sertion site in Drosophila suggests that retrocopies mediated by LTR However, no LTR retrotransposon remnants were found retrotransposons can be retroposed again as nonautonomous ret- flanking 96 old retrogenes that predate the species split of D. mel- rotransposons (Table 2). This also implies that the internal retro- anogaster and D. simulans (Bai et al. 2007). Similarly, only three ret- copies and flanking retrotransposons are cotranscribed as rogenes (CG1924, CG13732, and CG5650) are associated with chimeric transcripts by the regulatory sequences of the LTR retro- either candidate poly(A) tails or TSDs (Supplemental Table 7). transposons. Hence, we tested whether the retroposed loci are The absence of the sequence features diagnostic of the type of ret- transcribed as a single unit and whether the retrocopies share roposition event is expected for old retrogenes. It is a consequence the transcriptional profile of the corresponding LTR retrotran- of the rapid removal of flanking hallmark sequences by the strong sposons. First, we performed RT-PCR on the 10 retrocopies for deletion bias of D. melanogaster (Petrov and Hartl 1998) and by the which we could design primers spanning the switch points accumulation of substitutions after the retroposition event. (Supplemental Table 9). We detected the expression of all 10 retro- Because LTR retrotransposons are widely shared across species copies plus the flanking LTRs in both male and female adult bodies (Havecker et al. 2004; Eickbush and Jamburuthugoda 2008), we (Fig. 2A). In addition, both long-range RT-PCR for retrocopy searched additional animal genomes for signs of LTR-mediated CG5020_r and rapid amplification of cDNA ends (RACE) for retroposition. Specifically, we applied the pipeline used to identify CG17604_r, confirmed the transcription of the whole chimeric recently originated retrocopies in Drosophila to five other species unit from the 5′ to the 3′ LTRs (Methods; Supplemental Data (mosquito, zebrafish, chicken, mouse, and human) representing Sets 3, 4). Second, we profiled the expression of the 10 retrocopies a broad taxonomic breadth (Methods). We found retrocopies and corresponding retrotransposons in testis, ovary, and head potentially generated by LTR retrotransposons in all species. In (Methods). We targeted these three organs because in Drosophila, mosquito, zebrafish, and chicken we identified 3, 7, and 10 a large fraction of retrogenes are testis-specific or testis-dominant retrocopies, respectively, still retaining sequence features indica- (Supplemental Fig. 6; Betrán et al. 2002; Vibranovski et al. 2009), tive of the underlying retroposition mechanism (Supplemental but LTR retrotransposons tend instead to be abundantly tran- Table 8). We found that LTR retrotransposons are responsible scribed in the head (Perrat et al. 2013). The ovary was included for ∼50% of retrocopies (1/3 in mosquito, 5/7 in zebrafish, and for comparison. We found that nine of the 10 retrocopies show a 4/10 in chicken). One example is the mosquito retrocopy strong positive correlation between the expression of the retro- ENSANGG00000011308, which is flanked by the LTR retrotranspo- copy and the corresponding LTR retrotransposon, with eight retro- son BEL18_AG. It is associated with 5- and 7-bp microsimilarity copies showing their highest expression in the head, more than stretches at the switch points (Fig. 1C), an extensive 3′ truncation expected by chance (binomial test P = 0.02) (Fig. 2B). RT-PCR fur- (∼1 kb), and no signs of a poly(A) tail or TSDs (Supplemental ther confirmed the expression in the head of four retrocopies en- Table 8). coded by line 208 (Fig. 2C). Interestingly, the expression In mammals, L1 retrotransposons are known to be a major intensity of LTR retrotransposons is generally higher than that of driver of retroposition (Esnault et al. 2000; Kaessmann et al. the corresponding retrocopies except for STALKER2 2009; Abyzov et al. 2013) and have generated thousands of retro- (CG5119_CR42443_r) and Gypsy5 (CG3631_r) (Fig. 2B), which copies (Carelli et al. 2016). We implemented multiple filters could be explained by the fact that these two retrotransposons (Methods) to identify the possibly small number of retrocopies have a lower copy number in the genome relative to the others created by LTR retrotransposons. We identified three retrocopies (1–3 versus 4–27, Wilcoxon rank test one-tailed P = 0.03) in human (including the previously described FAM8A1_r) (Supplemental Table 10). This also suggests that retrocopies as (Jamain et al. 2001) and eight retrocopies in mouse that are nonautonomous retrotransposons could be less repressed than au- flanked by LTR retrotransposon remnants and are not associated tonomous elements (Maksakova et al. 2006). with any L1 hallmark sequences (Supplemental Table 8). The transcriptional characteristics of the mouse LTR-mediat- The sequence structures of these retrocopies match those ed retrocopies are similar to those of Drosophila with respect observed in Drosophila, mosquito, zebrafish, and chicken. For to both the transcription structure and expression profile. example, in mouse, a partial transcript (549/1646 bp) of Specifically, we found Mouse ENCODE Consortium RNA-sequenc- ENSMUSG00000021752 was retroposed by the LTR retrotranspo- ing (RNA-seq) reads (Yue et al. 2014) spanning the switch points son IAPEy, mediated by 10 bp of microsimilarity and 2 bp of de between the LTR retrotransposons and the retrocopies for four of novo sequence (Fig. 1D). In an exceptional example, we identified the eight LTR-mediated retrocopies, suggesting that they are tran- a “Super” retrocopy in which mRNAs derived from four distinct scribed as a single transcript (Methods; Supplemental Table 11;

Genome Research 1667 www.genome.org Tan et al.

Figure 2. The expression of retrocopies in fruit fly and mouse. (A) Whole-body RT-PCR results for 10 retrocopies validated in Supplemental Figure 2, which are expressed in both male (M) and female (F). The primer sequences and expected product sizes are listed in Supplemental Table 9. For the chimeric genes, CG5119_CR42443_r and CG4799_CG11924_r, we also designed primers flanking the fusion point between the two parental genes, the product of which is marked with “C.” (B) Tissue profiling via qRT-PCR. CG2662_r has no detectable expression in ovary. The Pearson correlation (r) between the retro- copies and corresponding retrotransposons across the three tissues is displayed above.(C) Tissue level RT-PCR results for four of 10 retrocopies encoded by line 208 across Testis (T), Ovary (O), and Head (H). (D) Expression profile of retrocopies encoded by the mouse genome. Tissues are hierarchically clustered on the basis of expression similarity across genes. Retrocopies are not clustered but instead are sorted by age as approximated by nucleotide divergence. To reveal relative moderate expression, the color-code scheme is truncated at the FPKM cutoff of 1 (i.e., all expression higher than 1 is shown in dark blue).

Supplemental Fig. 7). In the case of the Super retrocopy, long- (Marinov et al. 2014). Overall, these lines of evidence in mouse range RT-PCR experiments showed that the fragments derived again support the hypothesis that the expression of LTR-mediated from the four different parental genes are transcribed together, retrocopies is driven by the LTR retrotransposons’ regulatory also suggesting the whole locus represents a single unit (Supple- sequences. mental Fig. 8). Furthermore, unlike in the fly where LTR retro- transposons tend to be broadly transcribed (Supplemental Fig. LTR-mediated retrocopies are more often transcribed from the 9), LTR retrotransposons in mouse are generally tightly controlled same strand as the parental genes and capable of generating in- at the transcriptional level and tend to be tissue-specific with a strong bias toward early embryogenesis (Maksakova et al. 2006; frame proteins Faulkner et al. 2009; Macfarlan et al. 2012). Consistently, four Since LTRs (at the two ends) are known to have the potential to of the eight retrocopies show the highest expression at the two- act as bidirectional promoters (Feuchter and Mager 1990; Dunn cell or eight-cell stages across 17 tissues/developmental stages, et al. 2006; Faulkner et al. 2009), it is unclear whether the tran- which is not expected by chance (binomial test P = 0.009) scription of the retrocopies occurs from the sense strand relative (Fig. 2D; Supplemental Table 11). Actually, four could be an un- to the parental genes. In Drosophila, a RACE experiment demon- derestimation given that genes without high transcriptional lev- strated that CG17604_r is transcribed from the sense strand els (Fragments Per Kilobase of exon per Million fragments (Supplemental Data Set 4). In order to identify the direction of mapped, FPKM > 100) like LTR-mediated retrocopies (Supplemen- transcription for more retrocopies, we generated strand-specific tal Table 11) tend to be neglected in single-cell RNA-seq data RNA-seq data for testis, ovary, and head for line 208 (Methods;

1668 Genome Research www.genome.org LTR retrotransposon mediated retroposition

Supplemental Table 5; Supplemental Fig. 10). Given the high se- The transcription of the retrocopies from the same strand as quence identity between polymorphic retrocopies and their paren- the parental genes suggests the possibility that the retrocopies tal genes, we evaluated the strandedness using only the reads could lead to in-frame translation. In support of this idea, for 14 spanning the switch points. We did not find any spanning reads (82%) of the 17 Drosophila polymorphic or recently evolved retro- for CG17604_r and CG33957_r, but we detected reads supporting copies mediated by LTR-retrotransposons, the predicted longest the transcription of CG14796_r and CG4799_CG11924_r across ORF is in-frame relative to the parental proteins (Supplemental all three tissues (Fig. 3A). Specifically, we found that CG14796_r Table 12) possibly because they are too young (nucleotide diver- is transcribed from the antisense strand in head, but from the sense gence <1%) (Table 3) to have accumulated loss-of-function strand in ovary. In agreement with the relatively low expression of mutations. For relatively older retrocopies in other animals (nucle- CG14796_r in testis (Fig. 2B), we only found weak expression in otide divergence ≥1%) (Supplemental Table 8), 14 of 21 have accu- this tissue. CG4799_CG11924_r is bidirectionally transcribed in mulated at least one loss-of-function mutation (Supplemental ovary and head, and additionally shows very high levels of sense Data Set 8). We examined directly the level of constraint at the transcription in testis (Fragment count Per Million fragments protein-level by calculating KA/KS values (the ratio between the mapped, FPM = 14,111), a result also consistent with the qRT- nonsynonymous and the synonymous substitution rates) for PCR result (Fig. 2B). Thus, the RACE and RNA-seq experiments the longest ORF predicted relative to the parental protein. We showed that three of four retrocopies are capable of transcription found that ENSDARG00000076317_r from zebrafish and from the sense strand relative to the parental genes. ENSGALG00000011896_r from chicken have a KA/KS ratio of Analogously, by taking advantage of public strand-specific 0.145 and 0.239, respectively, which is significantly lower than RNA-seq data from mosquito, chicken, zebrafish, and mouse the conservative cutoff of 0.5 and suggestive of constraint (P = (Methods), we found that recently evolved retrocopies in these 0.041 and 0.004) (Supplemental Table 13; Betrán et al. 2002). species are also transcribed more often or more strongly from the ENSANGG00000011308 from mosquito also reached a marginal sense strand (relative to their parental genes). Overall, 12 of 18 ret- significance (KA/KS = 0.261, P = 0.075). Therefore, these retrocop- rocopies are moderately transcribed (FPKM > 0.1) from the sense ies, together with genes like Bs1 from maize (Elrouby and Bureau strand in at least one tissue, whereas only six reach the same inten- 2001, 2010), suggest that a fraction of LTR-mediated retrocopies sity from the antisense strand (Fig. 3). can encode functional proteins.

Figure 3. Strand-specific RNA-seq-based quantification. On the basis of newly generated and public strand-specific RNA-seq data, we quantified the expression of LTR-mediated retrocopies in fruit fly and mosquito (A), zebrafish (B), chicken (C), and mouse (D). For fruit fly, given the high similarity between polymorphic retrocopies and parental genes, only switch points spanning reads were counted, and FPM was calculated. For all other cases, FPKM was cal- culated. For mosquito, embryo, the third instar larvae, the fourth instar larvae, and pupae were profiled. For zebrafish, 4-, 6- and 8-h post fertilization (pf) stages were profiled.

Genome Research 1669 www.genome.org Tan et al.

Discussion This mechanism is compatible with the results from the limited studies on LTR-mediated retroposition in plants and LTR-mediated retroposition is deeply conserved across yeast. In plants, although the retrocopies described are generally eukaryotes old (Elrouby and Bureau 2001, 2010; Wang et al. 2006), they By looking at young retrocopies in a broad range of taxa, we show the typical chimeric structure with internal retrocopies showed that retrogenes can be created in both invertebrates and flanked by retrotransposons. Specifically, in maize, the gene vertebrates through an LTR-mediated retroposition mechanism. Bs1 combines the partial exonic sequences of three parental Our data suggest a model in which the RT starts synthesizing genes and is flanked by an LTR retrotransposon (Johns et al. cDNA using the template from the RNA of the LTR retrotranspo- 1985; Elrouby and Bureau 2001, 2010). The copy number of son, but induced by a stretch of microsimilarity between the LTR Bs1 ranges from one to five across various species of the genus retrotransposon and a cellular mRNA captured within the VLP, Zea, and all individual copies share the same switch points the enzyme moves to the latter and takes the cellular mRNA as (Elrouby and Bureau 2010). Thus, Bs1 likely has undergone sec- the template (Fig. 4). A second stretch of microsimilarity between ondary retropositions as seen for CG17604_r or CG10212_r the cellular mRNA and the LTR retrotransposon leads to another (Table 2). In contrast to these efforts in plants and our work switch, this time back to the LTR retrotransposon to end the re- on animals, the work in yeast studied retroposition by inducing verse transcription process. or screening artificial retrocopies via cellular assays (Derr et al. This mechanism creates retrocopies with four unique fea- 1991; Schacherer et al. 2004; Maxwell and Curcio 2007). Still, tures. First, because the stretches of microsimilarity can occur any- these experiments recapitulated the unique features associated where, mRNAs are often only partially duplicated (Table 1). with LTR-mediated retroposition such as the presence of micro- Second, because only small stretches of microsimilarity are re- similarity and the transcriptional capability (Schacherer et al. quired, LTR-mediated retroposition can give rise to chimeras in- 2004; Maxwell and Curcio 2007). volving multiple parental genes. It can therefore be considered as Thus, if we combine our current work in animals with previ- a novel mechanism of gene recombination, including domain- ous work on individual genes in plants and artificially induced sys- or exon-shuffling to capture the essence whereby the sequences tems in yeast, we see a striking consistence of the features that of previously existing genes are combined as novel gene structures characterize LTR-mediate retroposition: chimeric structures be- (Gilbert 1978; Moran et al. 1999; Graur and Li 2000). Third, a key tween partial duplicates and LTR retrotransposons, microsimilarity feature of LTR-mediated retrocopies is their immediate transcrip- at the junction points, the capacity for transcription, and the exis- tion by the LTR’s promoters. Fourth, LTR-mediated retrocopies tence of secondary retroposition events (Fig. 4). Template-switch can act as nonautonomous LTR retrotransposons and be subject dependent LTR-mediated retroposition is therefore highly con- to further duplication (Fig. 4). served across multiple eukaryotic lineages.

Figure 4. A schematic representation of the template switch model for LTR-mediated retroposition. (VLP) virus-like particle; (RT) reverse transcriptase; (PR) protease; (IN) . The black boxes in LTR retrotransposons correspond to the LTRs and the light green box to the ORF. The blue boxes corre- spond to exons encoded by the host genome and the white boxes to introns. The double template switch occurs in the VLP, generating a chimeric cDNA (multiple switches can occur) (Derr et al. 1991; Schacherer et al. 2004; Maxwell and Curcio 2007). After integration, the chimeric sequences can be tran- scribed as pseudo-LTR retrotransposons, which could be subject to further cycles of retroposition (Elrouby and Bureau 2010).

1670 Genome Research www.genome.org LTR retrotransposon mediated retroposition

Although LTR-mediated retroposition is active across differ- identified two or three retrocopies showing evidence for constraint ent lineages, its contribution to the formation of retrogenes varies at the protein level (Supplemental Table 13). widely across taxa. Our previous work in plants has found that We should note that translation does not absolutely require LTR- and non-LTR-mediated retroposition appear to occur with that the retrocopies are in the same frame as the parental genes. roughly equal rates in dicots ( and Manihot In the case of Bs1 in maize, the region derived from one parental esculenta) (Zhu et al. 2016). In this work, we confirm that L1-medi- gene is out-of-frame, whereas the regions derived from the other ated retroposition is the dominant mechanism of retroposition two parental genes are in-frame (Elrouby and Bureau 2001, in human and mouse (Kaessmann et al. 2009). In mosquito, 2010). Moreover, loss-of-function mutations can be rescued by zebrafish, and chicken the contribution is more balanced with the evolution of new introns within the retrocopies (Tan et al. each contributing roughly equally to the formation of retrocopies. 2014). In addition, evidence has accumulated suggesting the Although in Drosophila, we found similar numbers of recently orig- general significance of long noncoding RNAs (lncRNA) inated retrocopies created by both mechanisms, all polymorphic (Ponting et al. 2009; Guttman and Rinn 2012; Rinn and Chang retrocopies were associated with LTR-mediated events. Through 2012). The Drosophila retrocopy, CG4799_CG11924_r, is more simulations (Methods; Supplemental Table 14) we show that our highly expressed than 99.5% of the genes transcribed in testis pipeline works equally well detecting LTR-mediated and non- (Supplemental Fig. 12) and has therefore the potential to act as a LTR-mediated retroposition. Therefore, the enrichment of LTR- functional lncRNA if it could not encode a protein. Consistently, mediated retroposition among polymorphic retrocopies when we detected reduced levels of nucleotide diversity and increased compared to those recently originated (Fisher’s exact test P = levels of linkage disequilibrium (LD) around this locus, suggesting 0.04) likely reflects the recent increase in activity of LTR retrotrans- that positive selection could be driving its spread (Supplemental posons in flies (Bergman and Bensasson 2007; Kofler et al. 2015). Fig. 13). If some LTR-mediated recopies are indeed functional, their The different evolutionary trajectories of L1-mediated and chimeric structure implies that they should have a radically differ- LTR-mediated retrocopies ent functional role compared to their parental genes. The most ex- treme examples are those of retrocopies involving several parental L1-mediated retrocopies are long known to be mostly dead-on-ar- genes. Bs1 in maize shows that these complex chimeric genes can rival because of the absence of the parental genes’ regulatory se- indeed sometimes become functional hybrid proteins (Jin and quences (Mighell et al. 2000; Kaessmann et al. 2009). A recent Bennetzen 1994; Elrouby and Bureau 2001, 2010). Therefore, it work identified approximately 5000 retrocopies across various pla- is reasonable to conclude that LTR retrotransposons are capable cental mammals and found that only 4% were robustly expressed of creating retrocopies across a wide range of eukaryotes, which (FPKM ≥ 1) (Carelli et al. 2016). The expressed retrocopies were could subsequently evolve to be neofunctionalized retrogenes giv- found to have evolved their promoters de novo or to have recruited en their novel gene structure and expression profile. proto-promoters from nearby regions (Carelli et al. 2016). In our study, the two recently evolved non-LTR-mediated retrocopies in Drosophila also demonstrate that the fate of these retrocopies de- Methods pends on whether a promoter is fortuitously gained. CG12324 is broadly transcribed, likely because it hitchhiked regulatory se- Public Drosophila population resequencing data quences already present at the insertion site, as suggested by the We focused on the subset of 40 inbred lines from the DGRP project correlation between its expression and that of its neighboring (NCBI SRA database: SRP000694 and SRP000224), which were also gene, CG12323 (Pearson r = 0.76, P = 0.005) (Supplemental Table sequenced by the Drosophila Population Genomics Project (DPGP) 15). The other one, CR12628, is not transcribed, presumably project (Langley et al. 2012). because of the absence of regulatory sequences at the insertion site (Table 3; Supplemental Fig. 11). In contrast, we found that Terminology LTR-mediated retrocopies are transcribed immediately after the “ ” retroposition (Fig. 2). This phenomenon may be restricted to the Traditionally, retrocopy is used to refer to an intronless copy de- rived from a host mRNA. This terminology was developed for L1- fixation stage and to a short period after fixation. Consistently, mediated retrocopies, which generally have no additional flanking we only detect reads spanning switch points for four retrocopies sequences. However, for LTR-mediated retroposition, the retro- encoded by the mouse reference genome, suggesting that their posed locus consists not only of the sequence derived from the coexpression may eventually get decoupled. More importantly, host mRNA but also of the flanking LTR retrotransposon sequenc- four retrocopies (Supplemental Table 8) encoded by zebrafish es. To be consistent with previous studies, we use the term retro- and chicken already lost the LTR retrotransposon remnants at copy to refer only to the host mRNA-derived sequence. We refer one side, suggesting that LTR-mediated retrocopies eventually to retrocopies using the ID of the parental gene followed by “_r” diverge from nonautonomous LTR retrotransposons. except when the gene is already annotated.

A significant proportion of LTR-mediated retrocopies may play Detection of Drosophila polymorphic retrocopies novel functions To create the search library, we extracted exon–exon junction se- Our work shows that LTR retrotransposons can create new tran- quences by pooling 2×100 bp from each pair of consecutive exons scribed retrocopies. Therefore, from the viewpoint of transcrip- (Supplemental Fig. 1A). All reads were mapped to this library with tion, the potential of LTR-mediated retroposition to create Novoalign (http://www.novocraft.com), and only uniquely map- functional retrocopies is higher than that of L1-mediated retropo- ping reads were retained (Supplemental Methods). We called can- sition. But are any of these retrocopies actually functional (i.e., didate retroposition or intron loss events if there were at least two bona fide retrogenes)? Although the young age of the retrocopies reads spanning one exon–exon junction. Since the introns of pa- makes it difficult to test the levels of protein constraint, we have rental genes are not deleted by retroposition, we retained 31

Genome Research 1671 www.genome.org Tan et al. candidates with reads mapping to both exon–intron and intron– extent of microsimilarity observed between the retrocopies and exon junctions (Supplemental Fig. 1B). the LTR retrotransposons cannot be explained by chance.

Local assembly and validation of the polymorphic retrocopies Inference of the retrocopies’ insertion sites For each candidate retrocopy, we extracted the reads that mapped In order to identify the insertion sites, we generated whole- uniquely to the parental genes, 500-bp flanking regions, and to genome sequencing data using longer reads (2×250 bp) for six exon–exon junctions and performed de novo local assembly lines (e.g., 208) and mate-pair sequencing data for line 437 with MIRA (Supplemental Fig. 1C; Chevreux et al. 1999). MIRA (Supplemental Table 5). We detected heterozygosity in these se- will export at least two different contigs corresponding to the pa- quencing data that suggested contamination (admixture) between rental and retrocopy locus (Supplemental Fig. 1D). If the contig en- the lines sequenced. Thus in all analyses, these data were treated as coded by the retrocopy was too short to identify its exact switch pooled data. To confirm that the contamination occurred exclu- points, it was further assembled with PRICE (Ruby et al. 2013). sively during the production of the novel sequencing data, we We assembled 15 retrocopies derived from 17 parental genes as selected three loci with high SNP density and performed PCR ex- well as their flanking regions (10–200 bp) (Supplemental Data periments that confirmed that our fly stocks were not contaminat- Set 1). We validated this data set by PCR (Schrider et al. 2011, ed, showing the expected levels of homozygosity (Supplemental 2013) in which we used Primer3 (Untergasser et al. 2012) to Fig. 14). With these whole-genome sequencing data, we were design primers targeting the exons shared by parental genes and able to extend the flanking regions of the assembled retrocopies retrocopies that flank at least one intron in the parental genes from a median of 163 bp to 210 bp and to locate four retrocopies. (Supplemental Table 4). If retrocopies are present, we should am- To identify the genomic insertion sites of more retrocopies, we fur- plify two PCR products of different sizes, the larger corresponding ther performed targeted high-throughput sequencing following to the parental gene (introns present) and the smaller to the retro- the mobile element scanning (ME-Scan) protocol (Witherspoon copy (introns absent). et al. 2010) and located nine retrocopies (Table 2). Then, by PCR and Sanger sequencing, we validated 12 insertion sites for six ret- Evaluation of potential bias in the detection of LTR-mediated and rocopies (Table 2; Supplemental Table 6). non-LTR-mediated retrocopies We cloned and sequenced the two terminals of five retro- posed loci and found that they include LTRs that are required for We ran simulations approximating the process of retroposition in their integration into the genome (Supplemental Data Sets 3–6; Drosophila. Because there are more full-length LTR retrotranspo- Supplemental Methods). sons in this species (Supplemental Table 16), we randomly selected 105 genes out of the top 600 most highly transcribed genes in testis Identification of retrocopies in reference genomes and ovary as the parental genes to simulate 90 retrocopies mediat- ed by LTR retrotransposons and 15 retrocopies mediated by non- In order to detect recently originated retrocopies in the reference LTR (L1) retrotransposons (Supplemental Table 14). We deter- genome of D. melanogaster, we refined a previous strategy (Bai mined the size of the retroposed sequences, of the replaced regions et al. 2007) and searched the exon–exon junction library against of the LTR retrotransposons, and of the microsimilarities following the reference genome with BLASTN (Altschul et al. 1990). We the distributions learned from the empirical set of 15 retrocopies. retained only uniquely mapped regions and identified five D. mel- We simulated non-LTR-mediated retrocopies with 5′ truncation anogaster–specific retrocopies. Similarly, we scanned five other spe- (≤500 bp), poly(A) tails (≤100 bp), and TSD (≤10 bp) based on pre- cies: mosquito (UCSC assembly anoGam1), zebrafish (danRer7), vious studies (Linheiro and Bergman 2012; Subtelny et al. 2014). chicken (galGal4), mouse (mm9), and human (hg19). In mouse, We randomly inserted the simulated retrocopies into intergenic we found 208 retrocopies flanked by LTR retrotransposons on or intronic regions. Finally, we simulated sequencing reads with both sides. We excluded retrocopies inserted into preexisting a coverage of 50× and with variable read length as in the DGRP LTR retrotransposons by implementing four filters: (1) The flank- data (Supplemental Table 14). We were able to recover 85/90 sim- ing LTR retrotransposons have to belong to the same LTR retro- ulated LTR-mediated retrocopies and 15/15 simulated L1-mediated transposon type or the nonautonomous retrotransposon mated retrocopies (Supplemental Table 14; Supplemental Methods), with autonomous retrotransposon (ORR1/MaLR with MERVL, demonstrating that our pipeline is not biased in the detection of ETn with MusD) (McCarthy and McDonald 2004); (2) both the in- LTR-mediated and non-LTR-mediated retrocopies. ternal mRNAs and the flanking LTR retrotransposons have to be absent from the rat genome (rn4); (3) the LTR retrotransposons Estimation of the nonrandomness of the microsimilarity length have to be discontinuous with a gap >20 bp; and (4) there are no L1 hallmarks. Analogously, we filtered 518 human retrocopies To test whether the extent of microsimilarity could have occurred flanked by LTR retrotransposons with the whole retroposed locus purely by chance, we performed simulations for all 11 Drosophila being required to be absent in the nonprimate placental mammals. retrocopies consisting of only two rounds of template switches. In Supplemental Table 8, only the retrocopies that passed these fil- For each retrocopy, we randomly selected switch points. We select- ters are shown for mouse and human. ed the size of the retroposed sequences and of the replaced regions To approximate the evolutionary age, we calculated the over- of the LTR retrotransposons following the normal distribution all nucleotide divergence between parental gene and retrocopy via learned from our empirical set of retrocopies. For each retrocopy, BLAT (Kent 2002) for young retrocopies. For old retrogenes report- we generated 100 random samples and recorded the size of the ed in Bai et al. (2007), we directly downloaded the synonymous microsimilarity observed at both switch points. Supplemental substitution rate (KS) values. Figure 3 shows the distribution of the microsimilarity length at both ends across 100 simulated events. The observed length (the red dot) is an outlier (P < 0.01) for 10 retrocopies. The exception Analysis of hallmark sequences of L1-mediated retroposition is CG8567_r, but even for this retrocopy, the total length of micro- We called L1-mediated retrocopies by the presence of one of the similarity was only smaller or equal to the value observed in eight following three features: (1) a poly(A) tail at the 3′ end; (2) TSDs of the 100 simulations (P = 0.08). Overall, these data show that the at both ends; and (3) bias favoring truncation of 5′ rather than 3′

1672 Genome Research www.genome.org LTR retrotransposon mediated retroposition ends. The presence of a TTTT/AA endonuclease cleavage site has modest expression. For read counts, a cutoff of 0.1 (FPKM) corre- been shown to not be reliable (Richardson et al. 2014). We in- sponds to at least nine reads in Figure 3. ferred the boundary between retrocopies and flanking regions by BLAT. In order to generate a conservative data set of Evolutionary analyses LTR-mediated retrocopies, we specified relatively loose cutoffs for L1 hallmark sequences: a minimum of three continuous As To examine protein-level constraints, we performed KA/KS tests in a 5-bp-long sequence at the 3′ end, and a minimal TSD not between the parental genes and corresponding retrocopies. By fol- <5 bp with an identity ≥80%. When retrocopies flanked by lowing Betrán et al. (2002), we tested whether KA/KS is significantly LTR retrotransposons were also associated with candidate hall- smaller than 0.5, which suggests functionality for both genes. mark sequences for L1-mediated retroposition, we determined First, we translated retrocopies by aligning parental proteins whether the poly(A) tail and TSDs were part of the LTR retro- against retrocopies with exonerate (Slater and Birney 2005). transposons and whether the LTR retrotransposons were discon- Then, we aligned the proteins (for retrocopies with candidate tinuous given the canonical annotations in Repbase (Kaminker loss-of-function mutations, the longest continuous peptide was et al. 2002). used) with MAFFT (Katoh and Toh 2008) and then converted pro- tein-level alignments into codon-level alignments via PAL2NAL (Suyama et al. 2006). Finally, we performed likelihood ratio tests Transcriptional profiling of Drosophila polymorphic retrocopies via the codeml program in the PAML package (Yang 2007). We investigated the expression profile of polymorphic retrocopies We tested whether CG4799_CG11924_r is under positive se- using RT-PCR, qRT-PCR, RACE, and RNA-seq (Supplemental lection following Cardoso-Moreira et al. (2016). All SNPs located Methods). For LTR retrotransposons, we designed primers on the within 50 kb upstream of and downstream from 3L: 21,967,546 basis of the canonical sequence (Supplemental Table 9). To deter- (the insertion site) (Table 2) were extracted from the DGRP variant annotation file (VCF) using VCFtools (Danecek et al. 2011). We mine the internal controls in qRT-PCR, we followed the protocol 2 in Gan et al. (2013) and chose the three most transcriptionally then calculate levels of LD (as measured by r ) and estimated diver- constant genes (eIF-1A, Rap2l, and SdhA) out of 26 in Ling and sity levels using VCFtools (Danecek et al. 2011). Salvaterra (2011) and Ponton et al. (2011). Strand-specific RNA- seq data for line 208 were generated by following the dUTP proto- Data access col (Levin et al. 2010). Reads were mapped following a Novoalign- based RNA-seq pipeline, with the 15 retrocopies added to the refer- The sequence data generated for this study have been submitted ence sequence set. We only kept the reads spanning the switch to the NCBI BioProject database (http://www.ncbi.nlm.nih. points. gov/bioproject/) under accession numbers PRJNA282433, PRJNA285001, PRJNA325420, and PRJNA325579, and Genome Sequence Archive (GSA; http://gsa.big.ac.cn/) under accession Transcriptional profiling of recently originated retrocopies across numbers PRJCA000123, PRJCA000124, PRJCA000279, and various animals PRJCA000280 (Supplemental Table 5). Sanger trace files have The FPKM data of the 96 Drosophila retrogenes were downloaded been submitted to the NCBI Trace Archives (http://www.ncbi. from FlyBase (Marygold et al. 2013). Only the genes CG5150 and nlm.nih.gov/Traces/home/index.cgi) with TI numbers in 2344 CG9518 were absent from this data set. To investigate the expres- 039996–2344040077. sion profiles of the three older Drosophila retrocopies and of eight mouse retrocopies, we downloaded the FASTQ-format unstranded Acknowledgments modENCODE RNA-seq data (Celniker et al. 2009) and stranded ENCODE mouse transcriptome data (http://hgdownload.cse.ucsc We would like to thank three reviewers for their comprehensive .edu/goldenPath/mm9/encodeDCC/wgEncodeCshlLongRnaSeq/). and insightful comments. We thank the DGRP and DPGP groups Since LTR retrotransposons are highly active in early mouse for sharing the genomic data. We thank Bin He, Yang Shen, and embryogenesis (Macfarlan et al. 2012), we also downloaded Zhang laboratory members for helpful discussions. We also thank the single-cell RNA-seq data for two-cell and eight-cell stages Wanzhu Jin for providing mouse samples. This research was sup- from the NCBI GEO database (GSE44183). To better represent ported by grants from the Strategic Priority Research Program of the transcriptome diversity, 12 representative fly tissues and 15 the Chinese Academy of Sciences (XDB13010400), the National mouse tissues were chosen according to a tissue-level clustering Key Basic Research Program of China (Ministry of Science and analysis (Supplemental Fig. 15). Because retrocopies are highly Technology) (2015CB943001, 2013CB531202), the National similar (≥90% identity) to parental genes, we used only uniquely Natural Science Foundation of China (91331114, 31322050), mapping reads and calculated the effective length of the retro- and the 1000 Young Talents Program to Y.E.Z. M.C.M. was sup- copy defined in Zhang et al. (2011) when calculating the FPKM ported by a Marie Curie incoming postdoctoral fellowship. values. Computing was jointly supported by the HPC Platform in BIG, Furthermore, we analyzed public strand-specific data sets CAS, and by HPC Platform, Scientific Information Center, from mosquito (ArrayExpress ERP006838), zebrafish (Lee et al. Institute of Zoology, CAS, China. 2013), and chicken (Necsulea et al. 2014). We estimated the qual- Author contributions: Y.E.Z. conceived and designed the pro- ity of strandedness based on exon–exon junction spanning reads. jects. S.T., C.C., Y.S., and L.L. performed computational analyses. Overall, all data sets are reasonably accurate, with >89% S.T., W.S., D.Z., J.H., Y.M., H.J., Y.Z., Z.L., and X.H. performed reads (Supplemental Fig. 10) supporting the standard splicing the experiments. S.T., Y.E.Z., M.C.M., and M.L. analyzed the site “GT-AG.” data. Y.E.Z., S.T., M.C.M., and M.L. wrote the paper. In order to determine whether a gene is called as present, we specified two FPKM cutoffs. Because it has been shown that FPKM values from 0.1 to 0.4 could reflect active transcription (Hart et al. References 2013), we took the high cutoff of 1 to define robust expression as Abyzov A, Iskow R, Gokcumen O, Radke DW, Balasubramanian S, Pei B, done in Carelli et al. (2016) and the low cutoff of 0.1 to define Habegger L, The 1000 Genomes Project Consortium, Lee C, Gerstein

Genome Research 1673 www.genome.org Tan et al.

M. 2013. Analysis of variable retroduplications in human populations Havecker ER, Gao X, Voytas DF. 2004. The diversity of LTR retrotranspo- suggests coupling of retrotransposition to cell division. Genome Res sons. Genome Biol 5: 225. 23: 2042–2052. Hindmarsh P, Leis J. 1999. Retroviral DNA integration. Microbiol Mol Biol Rev Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local 63: 836–843. alignment search tool. J Mol Biol 215: 403–410. Jamain S, Girondot M, Leroy P, Clergue M, Quach H, Fellous M, Bourgeron Bai Y, Casola C, Feschotte C, Betrán E. 2007. Comparative genomics reveals T. 2001. Transduction of the human gene FAM8A1 by endogenous ret- a constant rate of origination and convergent acquisition of functional rovirus during primate evolution. Genomics 78: 38–45. retrogenes in Drosophila. Genome Biol 8: R11. Jern P, Coffin JM. 2008. Effects of retroviruses on host genome function. Bergman CM, Bensasson D. 2007. Recent LTR retrotransposon insertion Annu Rev Genet 42: 709–732. contrasts with waves of non-LTR insertion since speciation in Jin YK, Bennetzen JL. 1994. Integration and nonrandom mutation of a plas- Drosophila melanogaster. Proc Natl Acad Sci 104: 11340–11345. ma membrane proton ATPase gene fragment within the Bs1 retroele- Betrán E, Thornton K, Long M. 2002. Retroposed new genes out of the X in ment of maize. Plant Cell 6: 1177–1186. Drosophila. Genome Res 12: 1854–1859. Johns MA, Mottinger J, Freeling M. 1985. A low copy number, copia-like Brosius J. 1991. —seeds of evolution. Science 251: 753. transposon in maize. EMBO J 4: 1093–1101. Cardoso-Moreira M, Arguello JR, Gottipati S, Harshman LG, Grenier JK, Kaessmann H, Vinckenbosch N, Long M. 2009. RNA-based gene duplica- Clark AG. 2016. Evidence for the fixation of gene duplications by posi- tion: mechanistic and evolutionary insights. Nat Rev Genet 10: 19–31. tive selection in Drosophila. Genome Res 26: 787–798. Kaminker JS, Bergman CM, Kronmiller B, Carlson J, Svirskas R, Patel S, Frise Carelli FN, Hayakawa T, Go Y, Imai H, Warnefors M, Kaessmann H. 2016. E, Wheeler DA, Lewis SE, Rubin GM, et al. 2002. The transposable ele- The life history of retrocopies illuminates the evolution of new mamma- ments of the Drosophila melanogaster euchromatin: a genomics perspec- 3: lian genes. Genome Res 26: 301–314. tive. Genome Biol RESEARCH0084. Celniker SE, Dillon LA, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Katoh K, Toh H. 2008. Recent developments in the MAFFT multiple se- 9: – Kellis M, Lai EC, Lieb JD, MacAlpine DM, et al. 2009. Unlocking the se- quence alignment program. Brief Bioinformatics 286 298. — 12: crets of the genome. Nature 459: 927–930. Kent WJ. 2002. BLAT the BLAST-like alignment tool. Genome Res – Chen S, Krinsky BH, Long M. 2013. New genes as drivers of phenotypic evo- 656 664. lution. Nat Rev Genet 14: 744–744. Kofler R, Nolte V, Schlötterer C. 2015. Tempo and mode of transposable el- 11: Chevreux B, Wetter T, Suhai S. 1999. Genome Sequence Assembly Using ement activity in Drosophila. PLoS Genet e1005406. Trace Signals and Additional Sequence Information. In German Langley CH, Stevens K, Cardeno C, Lee YC, Schrider DR, Pool JE, Langley SA, Conference on Bioinformatics, pp. 45–56. Hannover, Germany. Suarez C, Corbett-Detig RB, Kolaczkowski B, et al. 2012. Genomic vari- ation in natural populations of Drosophila melanogaster. Genetics 192: Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, – Handsaker RE, Lunter G, Marth GT, Sherry ST, et al. 2011. The variant 533 598. call format and VCFtools. Bioinformatics 27: 2156–2158. Lee MT, Bonneau AR, Takacs CM, Bazzini AA, DiVito KR, Fleming ES, Delviks-Frankenberry K, Galli A, Nikolaitchik O, Mens H, Pathak VK, Hu Giraldez AJ. 2013. Nanog, Pou5f1 and SoxB1 activate zygotic gene ex- pression during the maternal-to-zygotic transition. Nature 503: WS. 2011. Mechanisms and factors that influence high frequency retro- – viral recombination. Viruses 3: 1650–1680. 360 364. Derr LK, Strathern JN, Garfinkel DJ. 1991. RNA-mediated recombination in Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, Gnirke A, Regev A. 2010. Comprehensive comparative analysis of S. cerevisiae. Cell 67: 355–364. strand-specific RNA sequencing methods. Nat Methods 7: 709–715. Dunn CA, Romanish MT, Gutierrez LE, van de Lagemaat LN, Mager DL. Li WH. 1997. Molecular evolution. Sinauer Associates, Sunderland, MA. 2006. Transcription of two human genes from a bidirectional endoge- Ling D, Salvaterra PM. 2011. Robust RT-qPCR data normalization: valida- nous retrovirus promoter. Gene 366: 335–342. tion and selection of internal reference genes during post-experimental Eickbush T. 1999. Exon shuffling in retrospect. Science 283: 1465–1467. data analysis. PLoS One 6: e17762. Eickbush TH, Jamburuthugoda VK. 2008. The diversity of retrotransposons Linheiro RS, Bergman CM. 2012. Whole genome resequencing reveals nat- and the properties of their reverse transcriptases. Virus Res 134: ural target site preferences of transposable elements in Drosophila mela- 221–234. nogaster. PLoS One 7: e30008. Elrouby N, Bureau TE. 2001. A novel hybrid open reading frame formed by Long M, Rosenberg C, Gilbert W. 1995. Intron phase correlations and the multiple cellular gene transductions by a plant ret- evolution of the intron/exon structure of genes. Proc Natl Acad Sci 92: J Biol Chem 276: – roelement. 41963 41968. 12495–12499. Elrouby N, Bureau TE. 2010. Bs1, a new chimeric gene formed by retrotrans- 153: – Luo GX, Taylor J. 1990. Template switching by reverse transcriptase during poson-mediated exon shuffling in maize. Plant Physiol 1413 1424. DNA synthesis. J Virol 64: 4321–4328. Engelman A, Oztop I, Vandegraaff N, Raghavendra NK. 2009. Quantitative 47: – Macfarlan TS, Gifford WD, Driscoll S, Lettieri K, Rowe HM, Bonanomi D, analysis of HIV-1 preintegration complexes. Methods 283 290. Firth A, Singer O, Trono D, Pfaff SL. 2012. Embryonic stem cell potency Esnault C, Maestre J, Heidmann T. 2000. Human LINE retrotransposons Nature 487: – 24: – fluctuates with activity. 57 63. generate processed . Nat Genet 363 367. Maksakova IA, Romanish MT, Gagnier L, Dunn CA, van de Lagemaat LN, Ewing AD, Ballinger TJ, Earl D, Broad Institute Genome Sequencing and Mager DL. 2006. Retroviral elements and their hosts: insertional muta- Analysis Program and Platform, Harris CC, Ding L, Wilson RK, genesis in the mouse germ line. PLoS Genet 2: e2. Haussler D. 2013. Retrotransposition of gene transcripts leads to struc- Marinov GK, Williams BA, McCue K, Schroth GP, Gertz J, Myers RM, Wold 14: tural variation in mammalian genomes. Genome Biol R22. BJ. 2014. From single-cell to cell-pool transcriptomes: stochasticity in Faulkner GJ, Kimura Y, Daub CO, Wani S, Plessy C, Irvine KM, Schroder K, gene expression and RNA splicing. Genome Res 24: 496–510. Cloonan N, Steptoe AL, Lassmann T, et al. 2009. The regulated retro- Marygold SJ, Leyland PC, Seal RL, Goodman JL, Thurmond J, Strelets VB, 41: – transposon transcriptome of mammalian cells. Nat Genet 563 571. Wilson RJ, FlyBase Consortium. 2013. FlyBase: improvements to the Feuchter A, Mager D. 1990. Functional heterogeneity of a large family of hu- bibliography. Nucleic Acids Res 41: D751–D757. 18: man LTR-like promoters and enhancers. Nucleic Acids Res Maxwell PH, Curcio MJ. 2007. Retrosequence formation restructures the – 1261 1270. yeast genome. Genes Dev 21: 3308–3318. Gan H, Wen L, Liao S, Lin X, Ma T, Liu J, Song CX, Wang M, He C, Han C, McCarthy EM, McDonald JF. 2004. Long terminal repeat retrotransposons et al. 2013. Dynamics of 5-hydroxymethylcytosine during mouse sper- of Mus musculus. Genome Biol 5: R14. matogenesis. Nat Commun 4: 1995. Mighell AJ, Smith NR, Robinson PA, Markham AF. 2000. Vertebrate pseudo- Ganko EW, Greene CS, Lewis JA, Bhattacharjee V, McDonald JF. 2006. LTR genes. FEBS Lett 468: 109–114. retrotransposon-gene associations in Drosophila melanogaster. J Mol Evol Moran JV, DeBerardinis RJ, Kazazian HH. 1999. Exon shuffling by L1 retro- 62: 111–120. transposition. Science 283: 1530–1534. Gilbert W. 1978. Why genes in pieces? Nature 271: 501. Necsulea A, Soumillon M, Warnefors M, Liechti A, Daish T, Zeller U, Baker Goodrich DW, Duesberg PH. 1990. Retroviral recombination during reverse JC, Grützner F, Kaessmann H. 2014. The evolution of lncRNA reper- transcription. Proc Natl Acad Sci 87: 2052–2056. toires and expression patterns in tetrapods. Nature 505: 635–640. Graur D, Li WH. 2000. Fundamentals of molecular evolution. Sinauer Parker HG, VonHoldt BM, Quignon P, Margulies EH, Shao S, Mosher DS, Associates, Sunderland, MA. Spady TC, Elkahloun A, Cargill M, Jones PG, et al. 2009. An expressed Guttman M, Rinn JL. 2012. Modular regulatory principles of large non-cod- fgf4 retrogene is associated with breed-defining chondrodysplasia in do- ing RNAs. Nature 482: 339–346. mestic dogs. Science 325: 995–998. Hajjar AM, Linial ML. 1993. A model system for nonhomologous recombi- Perrat PN, DasGupta S, Wang J, Theurkauf W, Weng Z, Rosbash M, Waddell nation between retroviral and cellular RNA. J Virol 67: 3845–3853. S. 2013. Transposition-driven genomic heterogeneity in the Drosophila Hart T, Komori H, LaMere S, Podshivalova K, Salomon D. 2013. Finding the brain. Science 340: 91–95. active genes in deep RNA-seq gene expression studies. BMC Genomics Petrov DA, Hartl DL. 1998. High rate of DNA loss in the Drosophila mela- 14: 778. nogaster and Drosophila virilis species groups. Mol Biol Evol 15: 293–302.

1674 Genome Research www.genome.org LTR retrotransposon mediated retroposition

Ponting CP, Oliver PL, Reik W. 2009. Evolution and functions of long non- Toba G, Aigaki T. 2000. Disruption of the Microsomal glutathione S-transfer- coding RNAs. Cell 136: 629–641. ase-like gene reduces life span of Drosophila melanogaster. Gene 253: Ponton F, Chapuis MP, Pernice M, Sword GA, Simpson SJ. 2011. Evaluation 179–187. of potential reference genes for reverse transcription-qPCR studies of Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M, Rozen physiological responses in Drosophila melanogaster. J Insect Physiol 57: SG. 2012. Primer3—new capabilities and interfaces. Nucleic Acids Res 40: 840–850. e115. Richardson SR, Morell S, Faulkner GJ. 2014. L1 retrotransposons and Vibranovski MD, Zhang Y, Long M. 2009. General gene movement off the X somatic mosaicism in the brain. Annu Rev Genet 48: 1–27. chromosome in the Drosophila genus. Genome Res 19: 897–903. Rinn JL, Chang HY. 2012. Genome regulation by long noncoding RNAs. Wang W, Zheng H, Fan C, Li J, Shi J, Cai Z, Zhang G, Liu D, Zhang J, Vang S, Annu Rev Biochem 81: 145–166. et al. 2006. High rate of chimeric gene origination by retroposition in Ruby JG, Bellare P, Derisi JL. 2013. PRICE: software for the targeted assembly plant genomes. Plant Cell 18: 1791–1802. of components of (meta) genomic sequence data. G3 (Bethesda) 3: Wei W, Gilbert N, Ooi SL, Lawler JF, Ostertag EM, Kazazian HH, Boeke JD, 865–880. Moran JV. 2001. Human L1 retrotransposition: cis preference versus 21: – Schacherer J, Tourrette Y, Souciet JL, Potier S, De Montigny J. 2004. Recovery trans complementation. Mol Cell Biol 1429 1439. of a function involving by retroposition in Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Saccharomyces cerevisiae. Genome Res 14: 1291–1297. Leroy P, Morgante M, Panaud O, et al. 2007. A unified classification sys- 8: – Schrider DR, Stevens K, Cardeno CM, Langley CH, Hahn MW. 2011. tem for eukaryotic transposable elements. Nat Rev Genet 973 982. Genome-wide analysis of retrogene polymorphisms in Drosophila mela- Witherspoon DJ, Xing JC, Zhang YH, Watkins WS, Batzer MA, Jorde LB. nogaster. Genome Res 21: 2087–2095. 2010. Mobile element scanning (ME-Scan) by targeted high-throughput 11: Schrider DR, Navarro FC, Galante PA, Parmigiani RB, Camargo AA, Hahn sequencing. BMC Genomics 410. Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol MW, de Souza SJ. 2013. Gene copy-number polymorphism caused by 24: – retrotransposition in humans. PLoS Genet 9: e1003242. Biol Evol 1586 1591. Slater GS, Birney E. 2005. Automated generation of heuristics for biological Yoshioka K, Honma H, Zushi M, Kondo S, Togashi S, Miyake T, Shiba T. 1990. Virus-like particle formation of Drosophila copia through autocat- sequence comparison. BMC Bioinformatics 6: 31. alytic processing. EMBO J 9: 535–541. Soares MB, Schon E, Henderson A, Karathanasis SK, Cate R, Zeitlin S, Yue F, Cheng Y, Breschi A, Vierstra J, Wu WS, Ryba T, Sandstrom R, Ma ZH, Chirgwin J, Efstratiadis A. 1985. RNA-mediated gene duplication: The Davis C, Pope BD, et al. 2014. A comparative encyclopedia of DNA ele- rat preproinsulin I gene is a functional . Mol Cell Biol 5: ments in the mouse genome. Nature 515: 355–364. 2090–2103. Zhang YE, Landback P, Vibranovski MD, Long M. 2011. Accelerated recruit- Subtelny AO, Eichhorn SW, Chen GR, Sive H, Bartel DP. 2014. Poly(A)-tail ment of new brain development genes into the human genome. PLoS profiling reveals an embryonic switch in translational control. Nature 9: 508: – Biol e1001179. 66 71. Zhou Q, Zhang G, Zhang Y, Xu S, Zhao R, Zhan Z, Li X, Ding Y, Yang S, Suyama M, Torrents D, Bork P. 2006. PAL2NAL: robust conversion of pro- Wang W. 2008. On the origin of new genes in Drosophila. Genome Res tein sequence alignments into the corresponding codon alignments. 18: – 34: – 1446 1455. Nucleic Acids Res W609 W612. Zhu Z, Tan S, Zhang Y, Zhang YE. 2016. LINE-1-like retrotransposons con- Swain A, Coffin JM. 1992. Mechanism of transduction by retroviruses. tribute to RNA-based gene duplication in dicots. Sci Rep 6: 24755. Science 255: 841–845. Tan S, Zhu Z, Zhu T, Te R, Zhang YE. 2014. Chance and necessity: emerging introns in intronless retrogenes. eLS doi: 10.1002/9780470015902. a0022886. Received January 29, 2016; accepted in revised form October 18, 2016.

Genome Research 1675 www.genome.org