Molecular paleontology of transposable elements in the Drosophila melanogaster genome

Vladimir V. Kapitonov* and Jerzy Jurka*

Genetic Information Research Institute, 2081 Landings Drive, Mountain View, CA 94043

Communicated by Margaret G. Kidwell, University of Arizona, Tucson, AZ, April 7, 2003 (received for review December 23, 2002) We report here a superfamily of ‘‘cut and paste’’ DNA transposons and found in the DMG. Surprisingly, the called Transib. These transposons populate the Drosophila mela- diversity of TEs fossilized in the DMG is higher than in mam- nogaster and Anopheles gambiae genomes, use a transposase that malian or vertebrate genomes studied so far. TEs can become is not similar to any known proteins, and are characterized by 5-bp stable structural components of the eukaryotic heterochromatin target site duplications. We found that the fly genome, which was (8–10). Perhaps the best example in this respect is AT, where TEs thought to be colonized by the <100 years ago, harbors are highly compartmentalized: Ϸ90% of genomic TEs are Ϸ5 million year (Myr)-old fossils of ProtoP, an ancient ancestor of localized in the paracentromeric heterochromatin, and Ϸ90% of the P element. We also show that Hoppel, a previously reported the paracentromeric heterochromatin is composed of TEs (6, (TE), is a nonautonomous derivate of ProtoP. 11). We show that similar compartmentalization characterizes We found that the ‘‘rolling-circle’’ transposons identified distributions of TEs in the DMG. previously in plants and worms populate also insect genomes. Our results indicate that Helitrons were horizontally transferred into Materials and Methods the fly or͞and mosquito genomes. We have also identified a most Computational Analysis. All TEs reported in this article were abundant TE in the fly genome, DNAREP1࿝DM, which is an Ϸ10- found by using various methods of computational analysis. We Myr-old footprint of a Penelope-like . We esti- began by compiling DNA sequences of known TEs and other mated that TEs are three times more abundant than reported repetitive elements deposited in the DM section of Repbase previously, making up Ϸ22% of the whole genome. The chromo- Update or ‘‘RU’’ (12). Using this database, we annotated TEs in somal and age distributions of TEs in D. melanogaster are very the DM DNA sequences deposited into GenBank and the similar to those in Arabidopsis thaliana. Both genomes contain only Berkeley Drosophila Genome Project public database (www. relatively young TEs (<20 Myr old), constituting a main component fruitfly.org͞sequence͞assembly.html). The annotation began of paracentromeric regions. with comparing DNA sequences of known TEs against the sequenced portion of DMG by using CENSOR (13). Putative new rosophila melanogaster (DM) is among the most important TEs were identified either as insertions into copies of known TEs Dspecies shaping our knowledge of mobile or transposable or as elements distantly similar to known TEs. These insertions elements (TEs) that populate eukaryotic and prokaryotic ge- were analyzed in detail for presence of hallmarks such as nomes and multiply by transferring their copies from one terminal inverted repeats (TIRs), target site duplications genomic place to another (1–4). The high variety of currently or (TSDs), long terminal repeats (LTRs), etc. For each identified ͞͞ recently active TEs and the role of DM as one of the basic genetic insertion, we applied BLASTN (14) or WU BLASTN (http: blast. models were among the major factors behind successful studies wustl.edu) to extract DM GenBank sequences whose fragments of TEs. All fly TEs reported in the literature or public databases were similar to the insertion sequence. Subsequently, precise before 1999 were found experimentally. However, recent explo- coordinates of these fragments were determined by running the sion of the available sequence data (5) has brought methods of insertion sequence against the extracted GenBank sequences by computational analysis to the forefront. Computer-assisted iden- using CENSOR. tification of DNA patterns similar to a query sequence is fast, DNA regions of a very low but significant partial identity sensitive, and cheap. Computational analysis also permits effi- (60–70%) to the known TE were also analyzed in the same way cient reconstruction of ancient TEs from their incomplete or as the insertions described above. As noted before (6), such mutated copies (we consider them as genomic DNA fossils). regions usually belong to highly diverged families of young TEs. Nevertheless, there is no guarantee that any TE reconstructed Using the majority rule applied to multiple aligned copies of from its footprints is an active element. Therefore, the final TEs, we built their consensus sequences. Copies of TEs created isolation of bona fide active transposable elements or verification by chromosomal duplications or redundant sequencing were of biological activity of reconstructed transposons must be done discarded on the basis of similarity between the corresponding in a test tube. flanking regions. Using approaches similar to those applied in identifying TEs Distantly related proteins were identified by using PSI-BLAST in the Arabidopsis thaliana (AT) and Caenorhabditis elegans (15). Multiple alignments of protein sequences were created by genomes (6, 7), we began a systematic computational analysis of CLUSTAL-W (16) and edited manually by using GENEDOC (17). TEs in the DM genome (DMG) in April 1999. At that time, Ϸ50 Multiple alignments of DNA sequences were obtained by using families of DM TEs were reported in the literature. On the basis the VMALN2 and PALN2 programs developed at the Genetic of computational analysis, we identified and characterized Ϸ80 Information Research Institute. We used FGENESH (ref. 18; EVOLUTION additional families of TEs harbored by the DMG. Despite stop www.softberry.com) for identifying genes encoded by TEs. Phy- codons, insertions, and deletions present in genomic copies of logenetic analysis was conducted by using MEGA (19). The ages ϭ ͞ different TEs, the consensus sequences reconstructed for most of TEs were calculated by using the formula t K v, where t is of the new families are free of these mutations and contain valuable information on the evolution and function of TEs. For Abbreviations: TE, transposable element; TIR, terminal ; TSD, target site example, analysis of conserved protein motifs in the recon- duplication; RU, Repbase Update; DM, Drosophila melanogaster; DMG, DM genome; AG, structed proteins may be the easiest way to identify catalytic Anopheles gambiae; AT, Arabidopsis thaliana; Myr, million years. centers and other motifs involved in transpositions. Here we *To whom correspondence may be addressed. E-mail: [email protected] or describe several novel TEs and an overview of DNA transposons [email protected].

www.pnas.org͞cgi͞doi͞10.1073͞pnas.0732024100 PNAS ͉ May 27, 2003 ͉ vol. 100 ͉ no. 11 ͉ 6569–6574 Downloaded by guest on September 29, 2021 alignment of Transib1 copies (Fig. 1), and it includes 43-bp TIRs. CENSOR-assisted screening of the DMG has shown that it harbors multiple repetitive elements, which are only Ϸ60% identical to the Transib1 consensus. On the basis of pairwise nucleotide identities, the identified repetitive elements related to Transib1 were split into three groups called Transib2–4. The Transib families are highly divergent from each other; their consensus sequences are only 60–62% identical to each other. At the same time, the identities between the correspond- ing consensus sequences and copies from the same family are over 90%. The Transib2-4 consensus sequences are 2,844, 2,883, and 2,656 bp long, including 42-, 45-, and 40-bp TIRs, respec- tively. Analogously to Transib1, Transib2-4 elements have gen- erated 5-bp TSDs upon their insertions into the genome. The identified TSDs and TIRs indicate that Transib1–4 elements are DNA transposons. Moreover, the Transib1–4 consensus se- quences encode hypothetical proteins, 461-aa Transib1p, 690-aa Transib2p, 685-aa Transib3p, and 533-aa Transib4p, respec- tively, which are 32–46% identical to each other (Table 4, which is published as supporting information on the PNAS web site). Such a high identity between the proteins, encoded by the Transib1–4 DNA sequences that are only Ϸ60% identical to each other, indicates that Transib1–4p proteins were necessary for the Transibs transposition. Because there are no other proteins encoded by these elements, we consider Transib1–4p as pre- dicted transposases. Based on a TBLASTN search, there are about 100 elements in the DMG that code for proteins similar to Transib1p-4p. Autonomous DNA transposons, i.e., elements Fig. 1. Reconstruction of the Transib1 (A), Transib2 (B), Transib3 (C), and that encode transposases, usually constitute a minor fraction of Transib4 (D) consensus sequences. The consensus sequences are schematically DNA transposons fixed in eukaryotic genomes. The major depicted as rectangles capped by black triangles indicating TIRs. Transposase- fraction is composed of nonautonomous elements that do not coding regions are shaded in gray. Their coordinates in the consensus se- encode the transposase but share common termini with auton- quences are shown above the rectangles. Transib copies used for reconstruc- omous elements from which they have descended. The same tion of the consensus sequences are shown as thick lines beneath the pattern is seen for Transibs (Fig. 1). rectangles. Gaps in the lines mark deletions. GenBank accession numbers and We noted a distant Ϸ60% identity between Transib1 and the corresponding sequence coordinates of TEs are indicated. Hopper, a 1,435-bp nonautonomous DNA transposon identified as a de novo insertion associated with the SxlM4 allele (21). the average time elapsed since the TE copies were transposed in Hopper has 33-bp TIRs similar to these in Transib, and its copies the genome, K is the average divergence of the TE copies from are also flanked by 5-bp TSDs (21) (Fig. 6). Most likely, Hopper their consensus sequence, and v is the neutral substitution rate, is the youngest member of the Transib clade. Hopper elements 0.016 substitution per site per million years (Myr) (20). Ages of identified in the sequenced genome do not encode proteins. Presumably, the autonomous Hopper encoding a Transib-like retroviruses were estimated on the basis of pairwise divergences transposase is not fixed in the DMG. of their LTRs by using the formula t ϭ K͞2v, where K is the Currently, Anopheles gambiae (AG) is the only other species divergence between LTRs that flanked the same provirus. whose sequenced portion of the genome (22) encodes proteins similar to the Transib transposases. Using WU BLAST,weex- Databases. The sequences of TEs reported in this manuscript are tracted all DNA fragments encoding proteins similar to the deposited in the DM section of RU (ref. 12, www.girinst.org͞ ࿝ Transib transposases from the public set of assembled AG Repbase Update.html). TE families originally reported in RU scaffolds. These fragments were split into several homogenous were also included in the data set of TEs maintained by the clusters. On the basis of the multiple alignment of sequences Drosophila Genome Project Consortium (www.fruitfly.org). from the most abundant cluster, we reconstructed their 4,303-bp consensus sequence. This consensus sequence is a ‘‘recovered’’ Results and Discussion ࿝ ࿝ mosquito Transib-like DNA transposon, called Transib1 AG, Transib. One particular copy of the ProtoP B transposon (Gen- which encodes a 708-aa transposase (35% identity to Transib2p) Bank AE003048, positions 9814–12479) harbors a 1693-bp in- and has 9-bp TIRs similar to those in the fly Transib elements sertion, positions 9932–11624, which has characteristics of DNA (Fig. 7, which is published as supporting information on the transposons. The insertion is flanked by the 5-bp CAGCG TSD PNAS web site). Moreover, Transib1࿝AG copies also are flanked and has a 44-bp TIR. After a search by BLASTN and CENSOR,we by the 5-bp TSDs. identified an additional six copies of the insertion (Fig. 1) Given the multiple copies, TIRs, TSDs, and insertions of some Ϸ scattered throughout the DMG and 90% identical to each Transib copies into other TEs, there is no doubt that Transib-like other. Analysis of the corresponding flanking regions confirmed elements are molecular fossils of recently active DNA trans- the presence of 5-bp flanking direct repeats, which can be posons. As there is no significant similarity of the Transib considered as remnants of 5-bp TSDs induced by transpositions transposases to any known proteins (PSI-BLAST, BLASTP, E Ͻ 1), (Fig. 6, which is published as supporting information on the it is certain that Transib1p–Transib4p and Transib1࿝AGp belong PNAS web site, www.pnas.org). This family of TEs was named to a previously unrecognized class of DNA transposases, cata- Transib1 (after the Trans-Siberian Express running between lysts of Transib ‘‘cut and paste’’ transpositions. The alignment of Moscow and the Pacific Ocean). The 2167-bp Transib1 consen- Transib1–4p, Transib1࿝AGp, and Transib2࿝AGp revealed seven sus sequence was reconstructed on the basis of a multiple conserved motifs, presumably most important for the transpo-

6570 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0732024100 Kapitonov and Jurka Downloaded by guest on September 29, 2021 Fig. 2. Multiple alignment of the Transib transposases. Diamonds mark amino acid residues that belong to a putative noncanonical DDE catalytic site. Asterisks mark bp 10, 30, etc. Conserved motifs and residues are underlined and shaded, respectively.

sition (Fig. 2). A putative catalytic D(34)D(35)E signal found in Ϸ30-bp TIR, this element was classified as a putative DNA many cut-and-paste transposases is also present in the fly Transib transposon (23, 24). However, its relationship to known super- transposases (Fig. 2). However, this signal is not present in the families of DNA transposons was obscure. We found that an mosquito transposases. Ϸ2-kb internal portion of one unusually long Hoppel element Given the unique conserved transposase, the 5-bp TSD, the (GenBank gi 5656725, 634-4052) encoded a protein distantly specific TIRs, and their presence in the fly and mosquito related to the P transposase (BLASTX,25–29% similarities to Ϫ genomes, we conclude that Transib-like TEs constitute an eighth known P transposases, E Ͻ 10 20, two stop codons). Surprisingly, superfamily of cut-and-paste DNA transposons (Table 1). Hoppel was the only TE linked to the identified internal portion Interestingly, the Transib codon composition is atypical for the (Fig. 3). Moreover, the internal sequence does not have any DMG. As a result, commonly applied programs, including those hallmarks of a TE inserted accidentally into a Hoppel element. used for systematic annotation of the DMG (18), fail to predict Therefore, we concluded that the identified long Hoppel element Transib1p-4p unless one presumes that the fly genome codes for was a fossilized autonomous DNA transposon that gave birth to ͞ genes characterized by human-, plant-, or worm-specific fea- the nonautonomous Hoppel 1360 elements. Given its antiquity tures, including their codon composition. Presumably, the atyp- (described below) and the similarity to all known P transposases, we called this transposon ProtoP. The 1,105-bp Hoppel consensus ical codon composition is related to the horizontal transfer of Ϸ Transibs. sequence is 97% identical to Hoppel elements, indicating that this family is Ϸ2 Myr old. ProtoP࿝B is a second family of ProtoP. The Ϸ1100-bp Hoppel (23) or ‘‘element 1360’’ (24) is one nonautonomous ProtoP-like transposons described here. of the most abundant TEs in the DMG (25). On the basis of its

Table 1. Cut-and-paste DNA transposons Homo A. C. Superfamily sapiens thaliana elegans DM

mariner͞Tc1 ϩϩϩϩ hAT ϩϩϩϩ PiggyBac ϩϪϪϩ MuDr ?* ϩϩ† ?‡ En͞Spm ϪϩϪϪ Harbinger ϩ§ ϩϩϩ§ EVOLUTION P ϩ§ ϪϪϩ Transib ϪϪϪϩ Fig. 3. Reconstruction of the ProtoP consensus sequence. The consensus *MuDr-like transposase is not found the human genome. However, Ricksha sequence is schematically depicted as a rectangle capped by the black triangles (unpublished data; RU) has MuDr hallmarks: 9-bp TSD, 70-bp TIR, and 5Ј-GGG indicating TIRs. The region coding for the transposase is shaded in gray. Its and CCC-3Ј termini. coordinates in the consensus sequence are indicated above the rectangle. †CEMUDR1 (unpublished data; RU). Copies of ProtoP used for reconstructing the consensus sequences are shown ‡Putatively, FB belongs to the MuDr superfamily. as thick lines beneath the ProtoP, Hoppel, and ProtoP࿝B consensus sequences. §Harbinger- and P-like transposases are present as single-copy genes in the DM Gaps in the lines mark deletions. GenBank accession numbers and the corre- (unpublished data; RU) and human (27) genomes. sponding sequence coordinates of TEs are indicated.

Kapitonov and Jurka PNAS ͉ May 27, 2003 ͉ vol. 100 ͉ no. 11 ͉ 6571 Downloaded by guest on September 29, 2021 did not show any evidence supporting this hypothesis. Alterna- tively, gain or loss of introns is relatively common and may be independent of retrotransposition. For example, the mosquito Helitron transposases are intronless (described below), whereas similar transposases in plants and worms are encoded by mul- tiple exons (7). Similarly, accidental intron gains were reported for plant mariners (29).

Helitrons in Insects. A TBLASTN search for fly proteins similar to those encoded by the plant and worm Helitron rolling-circle Fig. 4. A phylogenetic tree for P-like transposases. The unrooted tree was DNA transposons (7) did not reveal any evident matches. constructed by using the neighbor-joining method implemented in MEGA (19). ࿝ However, using an approach similar to that described in ref. 7, Transposases from the next P-like transposons are shown: P DH, Drosophila we identified in AG the insect Helitron transposon, called helvetica, GenBank accession no. AAK08181; P࿝SP, Scaptomyza pallida, joined ࿝ ࿝ ࿝ Helitron1 AG. Its 8,200-bp consensus sequence encodes an in- AAA29959–61; P DM, DM, A24786; P DB, Drosophila bifasciata, AAB31526; ࿝ P࿝LC, Lucilia cuprina, A46361. P1࿝AG and P3࿝AG from A. gambiae and ProtoP tronless 1,700-aa Helitron1 AGp protein composed of the ca- from DM are transposases from P elements reported in this manuscript. HS is nonical Helitron-like Rep and SF1 helicase domains. Analo- a putative human gene derived from the P-like transposase (27). The scale of gously to their plant and worm relatives (7), insect Helitron the Poisson correction distances between the protein sequences is indicated. transposons are characterized by the 5Ј-TC and CTAG-3Ј ter- Bootstrap values are shown at the nodes. mini, the 3Ј-terminal Ϸ12-bp hairpin, and TSD-free insertions between the target A-3Ј and 5Ј-T nucleotides. The Helitron1࿝AG Ϸ ࿝ Ϸ ࿝ consensus sequence is only 5% divergent from several ProtoP B copies are 95% identical to the 1,153-bp ProtoP B Helitron1࿝AG copies, indicating that these elements were trans- Ϸ consensus sequence ( 3 Myr old). There is only a 5% divergence posed in AG during the last few million years. The AG genome ࿝ between the Hoppel and ProtoP B consensus sequences that harbors more than 100 copies of different Helitron transposons Ϸ contain 191-bp and 136-bp internal regions, respectively, 95% that belong to more than 10 different families (not shown), identical to different portions of ProtoP (Fig. 3). The 4,480-bp characterized by a high (Ͼ90%) intra- and low (Ͻ70%) inter- consensus sequence of ProtoP was reconstructed on the basis of family nucleotide identities. Additionally to the Helitron1࿝AG, Ϸ a multiple alignment of its 28 copies, which are 95% identical we also reconstructed the consensus sequence of a Helitron2࿝AG ࿝ to the consensus (Fig. 3). Hoppel and ProtoP B elements were family. Surprisingly, we found that a 649-bp portion of the excluded from the last set of sequences. The ProtoP consensus Helitron2࿝AG (positions 972-1620) was 75% identical to a ࿝ sequence is 97% and 96% identical to the Hoppel and ProtoP B Ϸ600-bp fly sequence (AE002840, 9827–9264) (Fig. 7). Because consensus sequences, respectively. Therefore, the peak of this Helitron-like element is flanked by other DM-specific TEs, ࿝ ProtoP, Hoppel, and ProtoP B transpositional activity occurred sequencing-related artifacts can be discarded. The identified 3–5 Myr ago. The last estimate is supported by studies of TEs element encodes a portion of the Rep͞Helicase protein, and its inserted into ProtoP. For example, a ProtoP element identified coding sequence is interrupted by a few stop codons. Therefore, in the AE002983 GenBank sequence (positions 8879–1919) this element is a remnant of a Helitron. Ancestral lineages of DM harbors a copy of the Gypsy6A (unpub- and AG are thought to have separated Ϸ250 Myr ago (30). The lished data; RU), including its flanking LTRs (positions 2325– neutral substitution rate in AG is even higher than in DM (31). 8774). The 14% divergence between these LTRs corresponds to Therefore, transposable elements that were present in a lineage Ϸ5 Myr elapsed since the insertion of the retrovirus into the ancestral to both AG and DM would not be recognizable ProtoP element. A ProtoP࿝B copy (AE002743, positions 1–7347) anymore. For example, exons from Ϸ6,000 orthologous genes harbors the Invader5 retrovirus (unpublished data; RU). The identified in the mosquito and fly genomes show only a Ϸ56% 15% divergence between its LTRs also indicates that ProtoP nucleotide identity (32). The 75% identity between the mosquito transposons colonized the DMG Ϸ5 Myr ago. Therefore, P-like and fly Helitrons indicates that these elements were transferred elements, which were thought to be recent invaders of the DMG horizontally rather than transmitted vertically. (26), colonized this genome several Myr ago. The ProtoP consensus sequence encodes an 864-aa ProtoP1 piggyBac. The 1,881-bp consensus sequence of TE called transposase (positions 1194–3788). Phylogenetic analysis (Fig. 4) Looper1࿝DM, reconstructed from five copies, encodes a 358-aa shows that the ProtoP transposase separates a cluster formed by protein, which is more similar to the piggyBac-like transposase insect P elements and a putative human gene derived from from the human Looper (7) than to the transposase encoded by a P-like transposase (27). Computer-assisted analysis of Ϸ30 the moth piggyBac (33, 34). DM was the third species where ProtoP, Hoppel, and ProtoP࿝B copies and their target sites shows piggyBac DNA transposons were identified. Recently, piggyBac that ProtoP-like elements are flanked by 7-bp TSDs (not shown). TEs were also identified in mosquito (22) and fishes (35, 36). However, all studied P elements, including the P1࿝AG, P2࿝AG, These piggyBacs are characterized by Ϸ15-bp TIRs, including and P3࿝AG families present in AG (ref. 22; unpublished data; 5Ј-CCC and GGG-3Ј, and the TTAA TSDs. Using PSI-BLAST,we RU) are characterized by 8-bp TSDs. Usually, the TSD size is found that two hypothetical genes annotated in the DMG (gi one of a few hallmarks of different superfamilies of eukaryotic 7303440, 21355395) are encoding proteins similar to the piggyBac Ϫ DNA transposons. For example, 8-bp TSD is a characteristic of transposase (PSI-BLAST, E Ͻ 10 3). However, because the bio- hAT-like transposons; 4-bp, piggyBac; 5-bp, Transib; 9-bp, MuDr; logical function of these genes is unknown, it is not clear whether 3-bp, En͞Spm and Harbinger. mariner͞Tc1 is the only superfam- they are ‘‘host genes’’ derived from the transposases or remnants ily whose members are characterized by either 2-bp or 3-bp of young piggyBacs. Analogously, several hypothetical DM genes TSDs. Presumably, the phylogenetic place of the ProtoP trans- are former Harbinger transposases (ref. 6; RU). posase as an ‘‘outlier’’ among other P transposases (Fig. 4) may be related to the TSD size. While this paper was in review, we DNAREP1࿝DM. On the basis of a multiple alignment of Ϸ100 became aware of a similar report (28) describing a 3,410-bp expanded copies of the 300-bp ARS320 repetitive element autonomous Hoppel transposon (nearly identical to ProtoP de- identified in fly (GenBank J01068), we reconstructed a 600-bp posited in RU in 1999). The authors suggested that the intronless consensus sequence. The ARS320 element matches an internal P-like transposase is a result of retrotransposition. However, they portion of a 600-bp repetitive element called DNAREP1࿝DM

6572 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0732024100 Kapitonov and Jurka Downloaded by guest on September 29, 2021 Table 2. Non-LTR retrotransposons in the DMG elements from as many different clades of non-LTR retrotrans- Clade Families identified in DM posons as the DMG does. Among the seven different clades, Jockey is the most diverse and abundant in the DMG.We Jockey Fw, Jockey, Helena, BS, G, Doc, Strider, X, identified 15 previously undescribed families, bringing the total Het-A, TART, RU: Jockey2, Doc2࿝DM–Doc5࿝DM, number to 25 known families (Table 2). Like other TEs, families G2࿝DM–G7࿝DM, Bs3࿝DM, Bs4࿝DM, Fw2࿝DM, Fw3࿝DM of non-LTR retrotransposons that belong to the same clade are CR1 RU: DMCR1A characterized by Ͻ75% inter- and Ͼ85% intrafamily DNA II, RU: IVK࿝DM,* DMRT1A,** DMRT1B,** DMRT1C identities. R1 R1, RU: R1–2࿝DM LOA Bilbo, RU: Baggins1 LTR Retrotransposons. We identified and characterized 28 families R2 R2 of LTR retrotransposons in DM (Table 3). All fly LTR retro- ࿝ Penelope RU: DNAREP1 DM transposons belong to Gypsy, Copia, or Bel superfamilies (39), of Families following ‘‘RU:’’ are reported in this article and RU. Subsequent which Gypsy is the most diverse and abundant. Surprisingly, the nomenclature changes are marked by asterisks, and the secondary names are DMG does not contain nonautonomous LTR retrotransposons, *, You (49), and **, Waldo (49). present in mammals (40) and plants (6). On the basis of sequence identities and structural commonalities, Gypsy can by divided further into four major groups (Table 3), called Gypsy, MDG1, ࿝ (positions 100–400). Some DNAREP1 DM copies are inserted MDG3, and Osvaldo. The best-studied Gypsy group is composed ࿝ into other TEs (not shown). Therefore, DNAREP1 DM is a of canonical Gypsy-like families, including Gypsy, 297, 17.6, and transposable element. It is the most abundant TE in the DMG. Zam. The Osvaldo group includes retrotransposons similar to ࿝ Several thousand copies of DNAREP1 DM are present in the Osvaldo and Ulysses, discovered in Drosophila buzzatii (41) and Ϸ sequenced genome. They are 85% identical to the consensus D. virilis (42). Gypsy12 is an Osvaldo family we found in DM. The ࿝ sequence. Therefore, DNAREP1 DM is an old family of TEs, Ϸ2,300-bp Gypsy12 LTR is the longest LTR identified in DM. Ϸ transposed mainly 10 Myr ago. Currently, there are no young Some members of the Gypsy and Osvaldo groups encode env, the ࿝ subfamilies of DNAREP1 DM, i.e., with copies at least 95% surface envelope proteins, characteristic for animal retroviruses. ࿝ identical to each other. It indicates that DNAREP1 DM elements Independently, the Gypsy superfamily was split into four groups Ͼ lost their mobility 3 Myr ago. Surprisingly, members of the (43), on the basis of the phylogenetic study of reverse transcrip- ࿝ DNAREP1 DM family do not reveal any hallmarks of DNA tase, RNase H, and integrase domains, and by us (unpublished transposons and retrotransposons (TIRs, TSDs, 3Ј-microsatel- work; RU 1999) on the basis of the structural commonalities lites, etc). In 1999, we characterized DNAREP1࿝DM as a putative (Table 3). In RU, the Accord, Tabor, and Invader groups were DNA transposon (unpublished work; RU) because of its con- introduced instead of Gypsy, MDG1, and MDG3, respectively served termini and multiple internal deletions. Based on the (43). Members of the Bel superfamily (44) encode one long most recent data, another mechanism of DNAREP1࿝DM trans- polyprotein composed of gag, protease, reverse transcriptase, positions can be suggested. We found that a Ϸ200-bp 3Ј un- RNase H, and endonuclease domains. This superfamily was also translated region (UTR) portion of Penelope, an active Drosoph- called BCDRP (after Bel–Catch–Diver–Roo–Pao) and included ila virilis retrotransposon (37), is Ϸ72% identical to the Diver in fly and Catch in fugu (unpublished work; RU 1999). DNAREP1࿝DM consensus sequence (Fig. 8, which is published as supporting information on the PNAS web site). Moreover, the 3Ј Abundance of TEs. TEs account for 6% and 60% of the sequenced UTR of Penelope is required for its successful transposition (38). DM euchromatin and heterochromatin regions, respectively. Therefore, we consider DNAREP1࿝DM as a remnant of Given that heterochromatin constitutes Ϸ30% of the DMG (5), Penelope-like retrotransposon that was highly active in the TEs make up Ϸ22% of the whole genome. Our estimates are ancestral lineage of DM Ϸ10 Myr ago. We consider Penelope as three times higher than those reported recently (25). The a non-LTR retrotransposon. Alternatively, Penelope is also difference can be attributed to the different sets of TEs used as considered as a separate class of retroelements (38). Given that query sequences. Whereas the previous report (25) included TEs the DNAREP1࿝DM consensus sequence cannot be expanded, it from 39 families, we used elements representing Ϸ150 families. is unlikely that it is a solo LTR. Also, CENSOR and WU BLAST, used in our studies, are more sensitive than BLAST used previously (25). As a result, we Non-LTR Retrotransposons. Members of seven different clades of identified numerous families of TEs that are only 60% identical non-LTR retrotransposons populate the fly genome (Table 2). to known elements and are represented by only a few truncated Remarkably, of all sequenced genomes, no genome contains copies. There is a strong increase in the TE density on the

Table 3. LTR retrotransposons in the DMG TSD, LTR Superfamily͞group Family bp PBS* termini env

Gypsy͞Gypsy Gypsy, 297, 17.6, Tirant, Zam, Idefix, Nomad, Ninja, Burdock, HMS Beagle, Transpac, 4 Ϫ1 AG. . .YT ϩ

Springer; RU: Gypsy2–Gypsy4, Gypsy5,I Gypsy6, Gypsy7, Gypsy9, Gypsy10, EVOLUTION Quasimodo,II Accord, Accord2, GtwinIII Gypsy͞MDG1 MDG1, 412, Blood, Stalker; RU: Tabor,IV Stalker, Stalker2, Gypsy11 4 ϩ1 TG. . .CA Ϫ Gypsy͞MDG3 MDG3, Micropia, Blastopia; RU: Invader1–Invader6 4 5–8 TG. . .CA Ϫ Gypsy͞Osvaldo Circe; RU: Gypsy8, Gypsy12 4 35 TG. . .CA Ϫ Bel Bel, Roo, Batumi, Max; RU: Diver,V Diver2, RooA 5 1–3 TG. . .CA Ϫ Copia Copia, 1731; RU: Copia2࿝DM 5 5–7 TG. . .CA Ϫ

Families following ‘‘RU:’’ are reported in this article and RU. Subsequent nomenclature changes are marked as superscripts I–V and the secondary names are I, nik; II, antonia; III, hamilton; IV, wolfman; and V, mazi (50). *Start position of primer binding site (PBS); an internal portion starts at 1.

Kapitonov and Jurka PNAS ͉ May 27, 2003 ͉ vol. 100 ͉ no. 11 ͉ 6573 Downloaded by guest on September 29, 2021 Fig. 5. Density of TEs across chromosome 2. It was calculated as a percentage of TE-derived sequences per 100 kb in nonoverlapping windows. The Ϸ8-megabase (Mb) centromeric region is shaded in gray.

acrocentric chromosomes 2 (Fig. 5) and 3 (not shown) in major factor, one has to see families with copies Ͻ80% identical Ϸ500-kb regions separating euchromatin and paracentromeric to their consensus sequence. Most likely, this phenomenon can heterochromatin, which is mostly not sequenced. This chromo- be explained by relatively infrequent Ͼ100-kb deletions in somal distribution of TEs is very similar to that observed in AT paracentromeric heterochromatin composed mainly of TEs, (6). Given the fact that Ϸ80% of the paracentromeric͞ which have been accumulated in heterochromatin as a result of centromeric heterochromatin in DMG has not yet been se- a strong selection against their presence in the gene-rich eu- quenced, the percentage of TEs may be even higher. chromatin. It is unlikely that paracentromeric heterochromatin attracts de novo insertions of TEs as was suggested (46). For ͞ Age of TEs. All TEs identified in DMG are younger than 20 Myr, example, studies of de novo insertions of hAT and En Spm DNA analogous to those in AT (6). Importantly, both genomes lack transposons in AT show (47, 48) that they are targeted prefer- families with copies Ͻ80% identical to the family consensus, and entially outside paracentromeric heterochromatin, despite the they also do not contain LTR retrotransposons whose flanking apparent long-time accumulation of the same class TEs in LTRs are Ͻ80% identical to each other. It is unlikely that high heterochromatin (6, 11). rates of point mutation and short deletion in the fly genome (20, We thank Margaret Kidwell and reviewers for helpful comments, 45) are factors that alone can explain the age distribution of TEs. Dominique Anxolabehere for sharing unpublished results, and Jolanta For example, despite the 2- to 3-fold difference in the mutation Walichiewicz and Michael Jurka for assistance in preparing the manu- rates, there is no noticeable difference between the TE age script. This work was supported by the Grant 2 P41 LM06252-04A1 from distributions in AT and DM. Also, if the mutation rate is the the National Institutes of Health.

1. Berg, D. E. & Howe, M. H., eds. (1987) Mobile DNA (Am. Soc. Microbiol., 26. Clark, J. B. & Kidwell, M. G. (1997) Proc. Natl. Acad. Sci. USA 94, 11428–11433. Washington, DC). 27. Hagemann, S. & Pinsker, W. (2001) Mol. Biol. Evol. 18, 1979–1982. 2. Kidwell, M. G. & Lisch, D. R. (2001) Evol. Int. J. Org. Evol. 55, 1–24. 28. Reiss, D. E., Quesneville, H., Noaud, D., Andrieu, O. & Anxolabehere, D. 3. Fedoroff, N. V. (1999) Ann. N.Y. Acad. Sci. 870, 251–264. (2003) Mol. Biol. Evol. 20, 869–879. 4. Craig, N. L. (1995) Science 270, 253–254. 29. Feschotte, C. & Wessler, S. R. (2002) Proc. Natl. Acad. Sci. USA 99, 280–285. 5. Adams, M. D., Celniker, S. E., Holt, R. A., Evans, C. A., Gocayne, J. D., 30. Gaunt, M. W. & Miles, M. A. (2002) Mol. Biol. Evol. 19, 748–761. Amanatides, P. G., Scherer, S. E., Li, P. W., Hoskins, R. A., Galle, R. F., et al. 31. Sharakhov, I. V., Serazin, A. C., Grushko, O. G., Dana, A., Lobo, N., (2000) Science 287, 2185–2195. Hillenmeyer, M. E., Westerman, R., Romero-Severson, J., Costantini, C., 107, 6. Kapitonov, V. V. & Jurka, J. (1999) Genetica 27–37. Sagnon, N., et al. (2002) Science 298, 182–185. 7. Kapitonov, V. V. & Jurka, J. (2001) Proc. Natl. Acad. Sci. USA 98, 8714–8719. 32. Zdobnov, E. M., von Mering, C., Letunic, I., Torrents, D., Suyama, M., Copley, 8. Pimpinelli, S., Berloco, M., Fanti, L., Dimitri, P., Bonaccorsi, S., Marchetti, E., R. R., Christophides, G. K., Thomasova, D., Holt, R. A., Subramanian, G. M., Caizzi, R., Caggese, C. & Gatti, M. (1995) Proc. Natl. Acad. Sci. USA 92, et al. (2002) Science 298, 149–159. 3804–3808. 5, 9. Kapitonov, V. V., Holmquist, G. P. & Jurka, J. (1998) Mol. Biol. Evol. 15, 33. Fraser, M. J., Ciszczon, T., Elick, T. & Bauser, C. (1996) Insect Mol. Biol. 611–612. 141–151. 10. Ananiev, E. V., Phillips, R. L. & Rines, H. W. (1998) Proc. Natl. Acad. Sci. USA 34. Cary, L. C., Goebel, M., Corsaro, B. G., Wang, H. G., Rosen, E. & Fraser, M. J. 95, 10785–10790. (1989) Virology 172, 156–169. 11. The Arabidopsis Genome Initiative (2000) Nature 408, 796–815. 35. Smit, A. F. A. (2002) Repbase Rep. 2 (1), 35. 12. Jurka, J. (2000) Trends Genet. 16, 418–420. 36. Kapitonov, V. V. & Jurka, J. (2002) Repbase Rep. 2 (6), 21. 13. Jurka, J., Klonowski, P., Dagman, V. & Pelton, P. (1996) Comput. Chem. 20, 37. Evgen’ev, M. B., Zelentsova, E. S., Shostak, N. G., Kozitsina, M., Barskyi, V., 119–121. Lankenau, D. H. & Corces, V. G. (1997) Proc. Natl. Acad. Sci. USA 94, 196–201. 14. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990) J. 38. Pyatkov, K. I., Shostak, N. G., Zelentsova, E. S., Lyozin, G. T., Melekhin, M. I., Mol. Biol. 215, 403–410. Finnegan, D. J., Kidwell, M. G. & Evgen’ev, M. B. (2002) Proc. Natl. Acad. Sci. 15. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. USA 99, 16150–16155. & Lipman, D. J. (1997) Nucleic Acids Res. 25, 3389–3402. 39. Eickbush, T. & Malik, H. (2002) in Mobile DNA II, eds. Craig, N. L., Craigie, 16. Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994) Nucleic Acids Res. 22, R., Gellert, M. & Lambowitz, A. M. (Am. Soc. Microbiol., Washington, DC), 4673–4680. pp. 1111–1144. 17. Nicholas, K. B., Nicholas, H. B. & Deerfield, D. W. (1997) EMBNET News 4, 1–4. 40. Smit, A. F. (1999) Curr. Opin. Genet. Dev. 9, 657–663. 18. Salamov, A. A. & Solovyev, V. V. (2000) Genome Res. 10, 516–522. 41. Pantazidis, A., Labrador, M. & Fontdevila, A. (1999) Mol. Biol. Evol. 16, 909–921. 19. Kumar, S., Tamura, K., Jakobsen, I. B. & Nei, M. (2001) Bioinformatics 17, 42. Scheinker, V. S., Lozovskaya, E. R., Bishop, J. G., Corces, V. G. & Evgen’ev, 1244–1245. M. B. (1990) Proc. Natl. Acad. Sci. USA 87, 9615–9619. 20. Li, W.-H. (1997) Molecular Evolution (Sinauer, Sunderland, MA). 43. Malik, H. S. & Eickbush, T. H. (1999) J. Virol. 73, 5186–5190. 21. Bernstein, M., Lersch, R. A., Subrahmanyan, L. & Cline, T. W. (1995) Genetics 44. Malik, H. S., Henikoff, S. & Eickbush, T. H. (2000) Genome Res. 10, 139, 631–648. 1307–1318. 22. Holt, R. A., Subramanian, G. M., Halpern, A., Sutton, G. G., Charlab, R., 17, Nusskern, D. R., Wincker, P., Clark, A. G., Ribeiro, J. M., Wides, R., et al. 45. Petrov, D. A. (2001) Trends Genet. 23–28. (2002) Science 298, 129–149. 46. Dimitri, P. & Junakovic, N. (1999) Trends Genet. 15, 123–124. 23. Kurenova, E. V., Leibovich, B. A., Bass, I. A., Bebikhov, D. V., Pavlova, M. N. 47. Tissier, A. F., Marillonnet, S., Klimyuk, V., Patel, K., Torres, M. A., Murphy, & Danilevskaia, O. N. (1990) Genetika 26, 1701–1712. G. & Jones, J. D. (1999) Plant Cell 11, 1841–1852. 24. Kholodilov, N. G., Bolshakov, V. N., Blinov, V. M., Solovyov, V. V. & 48. Parinov, S., Sevugan, M., De, Y., Yang, W. C., Kumaran, M. & Sundaresan, V. Zhimulev, I. F. (1988) Chromosoma 97, 247–253. (1999) Plant Cell 11, 2263–2270. 25. Bartolome, C., Maside, X. & Charlesworth, B. (2002) Mol. Biol. Evol. 19, 49. Berezikov, E., Bucheton, A. & Busseau, I. (2000) Genome Biol. 1, 1–15. 926–937. 50. Bowen, N. J. & McDonald, J. F. (2001) Genome Res. 11, 1527–1540.

6574 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0732024100 Kapitonov and Jurka Downloaded by guest on September 29, 2021