<<

Remarkable compartmentalization of transposable elements and pseudogenes in the heterochromatin of the Tetraodon nigroviridis

Corinne Dasilva*, Hajer Hadji*, Catherine Ozouf-Costaz†, Sophie Nicaud*, Olivier Jaillon*, Jean Weissenbach*, and Hugues Roest Crollius*‡

*Genoscope and Centre National de la Recherche Scientifique, Unite´Mixte de Recherche 8030, 2 Rue Gaston Cre´mieux, 91057 Evry Cedex, France; and †Muse´um National d’Histoire Naturelle, Centre National de la Recherche Scientifique, FR 1541, 43 Rue Cuvier, 75231 Paris Cedex 05, France

Edited by Margaret G. Kidwell, University of Arizona, Tucson, AZ, and approved August 9, 2002 (received for review May 13, 2002) Tetraodon nigroviridis is among the smallest known vertebrate ge- distribution of transposable elements and pseudogenes in the nomes and as such represents an interesting model for studying euchromatin contrasts severely with the situation in sequenced genome architecture and . Previous studies have shown that small . At least 45% of human euchromatin is contributed Tetraodon contains several types of tandem and dispersed repeats, by transposable elements, and the annotation of chromosomes 20, but that their overall contribution is >10% of the genome. Using 21, and 22 (10–12) shows that Ϸ20% of annotated features are genomic library hybridization, fluorescent in situ hybridization, and pseudogenes, which provides an estimate of 6,000 pseudogenes in whole genome shotgun and directed sequencing, we have investi- the entire genome. Repeats and pseudogenes are scattered in the gated the global and local organization of repeat sequences in euchromatin as well as in the heterochromatin, with variations in Tetraodon. We show that both tandem and dispersed repeat ele- local densities but with a relative overall uniformity. Is this situation ments are compartmentalized in specific regions that correspond to seen in human general for vertebrates or is it typical of a genome the short arms of small subtelocentric chromosomes. The concentra- that is much less efficient at the pruning of transposable elements tion of repeats in these heterochromatic regions is in sharp contrast and pseudogenes than a compact genome? to their paucity in euchromatin. In addition, we have identified a To investigate this issue, we have studied the distribution of number of pseudogenes that have arisen through either duplication transposable elements and pseudogenes in a compact vertebrate of or the retro- of mRNAs. These pseudogenes are genome. Among vertebrates, the pufferfish Tetraodon nigroviridis amplified to high numbers, some with more than 200 copies, and possesses the smallest known genome (13), and we have previously remain almost exclusively located in the same heterochromatic re- characterized the sequence and genomic localization of several gions as transposable elements. The sequencing of one such hetero- sequences (including the centromeric and subtelo- chromatic region reveals a complex pattern of duplications and centric satellite sequences) and established a preliminary catalog of inversions, reminiscent of active and frequent rearrangements that transposable elements (14). This initial analysis based on 50 Mb of can result in the truncation and hence inactivation of transposable random shotgun reads indicated that the genome contains very few elements. This tight compartmentalization of repeats and pseudo- repeats (Ϸ7%, of which 1% is contributed by transposable ele- genes is absent in large vertebrate genomes such as and is ments) compared with other known vertebrates such as mammals, reminiscent of genomes that remain compact during evolution such but that most families of mobile elements, including LTR retro- as and Arabidopsis. transposons, non-long interspersed elements (LINEs), and DNA transposons are represented. Tetraodon is 18–30 million years he reasons for which some genomes accommodate distant from Takifugu rubripes, a marine pufferfish for which a Tlarge quantities of apparently unnecessary DNA, whereas oth- whole genome shotgun (WGS) assembly is now available (15). ers seem to either engage in a more efficient ‘‘pruning’’ process or Here, we show that the distribution of transposable elements and are equipped with better protective mechanisms against parasitic pseudogenes in Tetraodon is highly compartmentalized between elements remains a mystery (1). It is, however, largely accepted that heterochromatin and euchromatin. The genomic landscape is rem- transposable elements and pseudogenes are two categories of such iniscent of that seen in compact genomes across other phyla and sequences that generate large amounts of ‘‘ballast’’ DNA and thus contrasts with the larger mammalian genomes. Based on these contribute to increasing . The genomes of Caenorhab- results we discuss the implications that compact eukaryotic ditis elegans, , and Arabidopsis thaliana genomes may deal with transposable elements using common comprise between 100 and 125 Mb of euchromatin, and this mechanisms. fraction has now been nearly fully sequenced (2–4). In the euchro- matin of these small genomes, only between 3% and 10.5% of Materials and Methods nucleotides are contributed by transposable elements (5). Only 10 Colony Filter Hybridizations. The construction of two Tetraodon pseudogenes have been documented in Flybase (6), 432 in Worm- bacterial artificial chromosome (BAC) libraries has been described base (7), and 750 in the A. thaliana genome (4). The heterochro- (14). An equal number of clones from library A (pBACe.3.6, matic fraction (centromeres, short chromosome arms or dispersed EcoRI) and library B (pBeloBAC11, HindIII) were spotted on heterochromatin) varies between approximately 10 and 60 Mb for nylon membranes (Appligene, Strasbourg, France) by using a A. thaliana and D. melanogaster, respectively, and concentrates a robotic arrayer (QBOT, Genetix, New Milton, U.K.). Membranes large proportion of the transposable elements of these genomes (␤ heterochromatin) while containing few active genes (4, 8, 9), which ␣ are often interspersed with alternating blocks of tandem repeats ( This paper was submitted directly (Track II) to the PNAS office. heterochromatin). The majority of this sequence has not yet been ͞ Abbreviations: WGS, whole genome shotgun; BAC, bacterial artificial chromosome; FISH, sequenced and or assembled, mostly because the repetitive nature fluorescent in situ hybridization; iSET, isolated SET; TRAP, translocom-associated ; of these regions is likely to make any shotgun sequence assembly E2H2, enhancer of Zeste homolog 2; LINE, long interspersed element. extremely difficult. Data deposition: The sequences reported in this paper have been deposited in the GenBank The only repeat-rich genome that has been extensively sequenced database (accession nos. AJ457054, AJ313481, AJ313482, and AJ313483). and analyzed is the (5). Here the density and ‡To whom correspondence should be addressed. E-mail: [email protected].

13636–13641 ͉ PNAS ͉ October 15, 2002 ͉ vol. 99 ͉ no. 21 www.pnas.org͞cgi͞doi͞10.1073͞pnas.202284199 Downloaded by guest on September 25, 2021 were placed on 2YT agar overnight at 37°C, and colonies were then lysed and DNA was fixed as described (16). Each membrane contains 27,648 clones in duplicate (55,296 colonies) or 10 equiv- alents of the Tetraodon genome. hybridiza- tion probes were PCR products amplified with incorporation of digoxigenin-labeled nucleotides (Roche Diagnostics) from the Te- traodon BAC clone C0AA029L14 (GenBank accession no. AJ457054; see below) and correspond to the following positions on this sequence: Dm-Line, 5Ј end of putative coding region between base pairs 62656 and 62902; TC1-like, nearly the entire coding sequence between base pairs 82039 and 83147; Dr-Line, 5Ј end of putative coding region between base pairs 108856 and 109120; and Copia-like, central region of the putative coding region between base pairs 66851 and 67038. The 10-bp satellite probe was prepared as described (14). After denaturation, all probes from transposable elements were hybridized without competition by using a noniso- topic protocol as described (17). BAC Sequencing and Assembly. The insert of BAC clone C0AA029L14 was sequenced by using standard protocols for shotgun sequencing of BAC inserts (GenBank accession no. AJ457054). Because of the particular complexity of this sequence, each of the 1,299 reads was manually edited and assembled by using the Staden package (18), taking into account the pairing and estimated distances between end-clones. The assembly was verified and confirmed by restriction digest and electrophoresis of the BAC DNA with six enzymes expected to cut frequently (AatII, BglII, BsrGI, EagI, SpeI, SphI), in comparison with the patterns expected from the assembled sequence (data not shown). In addition, the Fig. 1. One satellite and four transposable element probes were hybridized total size of the assembled sequence is in perfect agreement with to a 10-genome equivalent Tetraodon BAC library, and a large fraction of restriction digests obtained with three rare cutting enzymes (NotI, clones in common were identified. (A) The four transposable elements include PmeI, PacI; data not shown). Dr-Line, a LINE element similar to a Danio rerio (zebrafish) LINE; Dm-Line, which is a different LINE element homologous to Drosophila factor I; a Fluorescent in Situ Hybridization (FISH). Biotin- and digoxigenin- TC1-like, and a copia-like sequences. Together the five probes (10-mer satel- labeled Dm-Line and TC1-like probes were mixed in Quantum- lite, Dr-Line, Dm-Line, TC1-like, and Copia-like) identify 1,743 clones, of which Appligene high stringency Hybrisol VI at 10 ng͞␮l each. Probes 1,562 are identified with at least one transposable element and 76 clones were hybridized and detected on Tetraodon freshly thawed chro- hybridize with all four elements. The Dm-Line (B) and the TC1-like (C) probes mosome preparations without any pretreatment, according to the were cohybridized by FISH to the same Tetraodon metaphase chromosome Quantum-Appligene protocol for repetitive probes (19). Prepara- preparations and show distinct signals in the same pairs of small subtelocentric ϫ tions were mounted in 1.2 ng͞␮l4Ј,6-diamidino-2-phenylindole in chromosomes. (Magnification: 750.) Antifade (Vector Laboratories) and analyzed by using Genus FISH-imaging equipment and software for animal chromosomes from Applied Imaging (Santa Clara, CA). 10-genome equivalent Tetraodon BAC library. Each probe was specific for a different element, which belongs to the two main Sequence Comparisons. The sequence of BAC clone C0AA029L14 was annotated by comparisons to nucleotide and protein databases classes of transposable elements: class I with and using BLAST (20). WGS sequencing of the Tetraodon genome at without LTR (probe copia-like, Dm-Line, and Dr-Line), and DNA Genoscope and the Whitehead Institute for Genome Research, transposons (probe TC1-like). Each probe identifies between 1.6% (Cambridge, MA) will be described in full elsewhere. To estimate and 3.3% of the clones arrayed on the membranes (Fig. 1A), in the copy numbers of enhancer of Zeste homolog 2 (EZH2), agreement with previous results that showed that the Tetraodon translocom-associated protein ␣ (TRAP␣), Trapeze, and isolated genome contains few transposable elements compared with other SET (iSET), of Ϸ100 bp were selected and compared by using vertebrate genomes (14). Strikingly, the four probes hybridize to an EVOLUTION BLASTN to 2.6 and 2.9 genome equivalent of shotgun reads from unexpectedly high fraction of clones in common. For instance, of Genoscope and the Whitehead Institute, respectively. Alignments 922 clones that hybridize to the Dm-Line probe (a ), that were retained contained at least 20 consecutive identical bases 423 clones (46%) also hybridize to the TC1-like probe (a DNA ϭ (W 20) and covered the entire length of the exons, and consec- transposon). This finding shows that the four transposable elements ϭ utive mismatches were not allowed (X 5). The number of are clustered in close proximity to each other in regions that cover alignments was then adjusted to one genome equivalent for com- Ϸ ͞ 5% of the genome (1,562 clones of 27,648). parison purposes. Substitutions and insertion deletion analysis of The BAC library was then hybridized with a 10-bp tandem repeat Trapeze the sequence was performed by first identifying all shotgun that we previously localized specifically in short heterochromatic sequences that contained exons 5–8 and exons 9–12. Each set of arms of approximately 10 pairs of subtelocentric chromosomes (19). sequences was aligned by using PILEUP in GCG version 10–2 (21) and manually edited. Substitutions, deletions, and insertions were Of the 259 positive clones, 30% are shared with the set of clones then automatically identified by comparison to the original EZH2 identified by the Dr-Line element, instead of the 0.015% expected and TRAP␣ sequences with purpose-built scripts. if the two sequences were independently distributed in the genome. This finding clearly suggests that the satellite and the LINE element Results are located in the same regions of the genome. To confirm this Tetraodon Transposable Elements Are Highly Compartmentalized in result, we cohybridized the Dm-Line and the TC1-like probes to Heterochromatin. To investigate the distribution of transposable metaphase chromosomes by double-color FISH (Fig. 1 B and C). elements in the Tetraodon genome, we hybridized four probes to a The two probes specifically hybridize to the same short heterochro-

Dasilva et al. PNAS ͉ October 15, 2002 ͉ vol. 99 ͉ no. 21 ͉ 13637 Downloaded by guest on September 25, 2021 Fig. 2. Structure and annotation of BAC clone C0AA029L14 positive with all four transposable elements tested on the BAC library. Analysis of duplicated sequences larger than 1,000 bases with MIROPEATS (45) shows that Ͼ34% of the sequence is not unique. Arched lines join the beginning and end of duplicated regions that are represented as thick horizontal lines. The Trapeze pseudogene (filled box) and the six iSET pseudogenes (open boxes) are represented on either side of the scale bar (forward and reverse strand).

matic arms of six pairs of subtelocentric chromosomes, where the into three categories: (i) one highly conserved ORF; (ii) five 10-mer satellite resides. fragments with homologies to kinase genes; and (iii) one Trapeze and six iSET pseudogenes. The highly conserved ORF finds similar Frequent Rearrangements in Heterochromatin. We further investi- sequences in metazoans, plants, and prokaryotes, and belongs to gated the local organization of Tetraodon heterochromatin by the Cluster of Orthologous Group (COG) of COG1990 sequencing one of the 76 BAC clones that was positive with all four (22) whose members are of unknown function. The pseudogene probes tested on the library (BAC clone C0AA029L14, Fig. 1A). A that we have called Trapeze is made up of 13 exons based on comparison of the 115-kb insert sequence against itself shows that similarities to known mammalian genes. The first eight exons are 34% is duplicated at least once either in tandem or inverted similar to the human EZH2 gene, and the last four exons are similar orientation, in multiple fragments ranging from a few hundred base to the human TRAP␣ gene (Fig. 3), whereas 9 is chimeric, pairs to 15 kb (Fig. 2). This organization suggests that frequent containing sequences similar to both EZH2 (exon 9) and TRAP␣ rearrangements take place between such sequences in this region. (exon 5). This fusion event introduces a frame shift that comes in The insert contains 18 transposable elements (Table 1) of which addition to the absence of a start methionine and suggests that only two are complete. These are duplicate copies of a TC1-like Trapeze is a pseudogene. In addition, no expression product of a transposon with the same in-frame within the gene, potential Trapeze gene could be found by either RT-PCR or cDNA indicating that both are nonfunctional and that one originates from library screenings. Several shotgun plasmid clones from the WGS a local duplication. The abundance of transposable elements in this sequence project were identified as containing the native Tetraodon BAC clone is in sharp contrast with their average frequency in the EZH2 and TRAP␣ genes and fully sequenced, revealing the com- Tetraodon sequence database (0.9% of nucleotides) (14), thus plete structure of the two genes (Fig. 3). This result, in addition to ␣ supporting at the sequence level our observations based on hybrid- the identification of a TRAP cDNA clone in a Tetraodon cDNA ␣ ization results that indicate a high density of transposable elements library by hybridization, indicates that the native EZH2 and TRAP in heterochromatic regions compared with the rest of the genome. genes still exist in the Tetraodon genome. Trapeze is thus the result of a fusion event between duplicated copies of the Tetraodon EZH2 Trapeze, a Chimeric Pseudogene, Is Present in 50 Copies. The 13 genes and TRAP␣ genes. and gene fragments identified in the BAC insert can be grouped In the process of exploring the Tetraodon sequence database for sequences homologous to either EZH2 or TRAP␣,we identified an unexpectedly high number of matching copies. To Table 1. Transposable elements, genes, and pseudogenes confirm this finding, we selected one exon in each unique half of identified in BAC C0AA029L14 the EZH2 and TRAP␣ genes and two exons from Trapeze (exons %of 3 and 12, Fig. 3) and searched for matching sequences in two Category Name Copies Total BAC DNA independent Tetraodon WGS databases. Results show a striking difference between exons that are in Trapeze and exons that are Transposable Copia-Ten 3 uniquely found in the single-copy genes EZH2 and TRAP␣ elements Rex1-Ten 4 (Table 2). Exons 2 of TRAP␣ and exon 12 of EZH2 identify an TC1-Ten 2 18 16.6 average of one sequence as expected for single-copy exons (see Factorl-Ten 3 Materials and Methods), whereas in contrast exons 3 and 12 of Maui-Ten 1 Trapeze identify an average of 51 sequences. This finding sug- BUSTER2 like 5 EZH2 TRAP␣ Trapeze Genes and iSET domains 6 gests that the fusion of and to create was pseudogenes Protein kinases 5 followed by an amplification process that results today in the 13 8.2 Trapeze 1 presence of approximately 50 copies of this chimeric pseudo- Unknown ORF 1 gene. Indeed, because there is no evidence of more than one original complete copy of EZH2 and TRAP␣ the amplification

13638 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.202284199 Dasilva et al. Downloaded by guest on September 25, 2021 Fig. 3. Exon– structure of the Trapeze pseudogene (GenBank accession no. AJ313483; Middle) and the native Tetraodon EZH2 gene (GenBank accession no. AJ313481, Top) and TRAP␣ genes (GenBank accession no. AJ313482, Bottom). Filled exons indicate EZH2 exons both in the native gene and Trapeze, and open exons indicate TRAP␣ exons, both in Trapeze and the native gene. Exon 9 of Trapeze is the result of a fusion between part of exon 9 of EZH2 and part of exon 5ofTRAP␣. The Tetraodon EZH2 gene spans Ͻ4 kb, a 10-fold reduction in size compared with the human EZH2 gene (46). Underlined exons were used to evaluate the number of copies of each gene in the Tetraodon genome by sequence comparison. The region in Trapeze underlined with a thin dashed line was used as probe in FISH and BAC library hybridization, and the region underlined with a thick dashed line was used for deletion rate measurements. The dashed box spans the exons that code for the SET domain of EZH2.

of sequences derived from EZH2 and TRAP␣ must be subse- duplication event. When searched against the two available Tetra- quent to the fusion event. odon WGS databases, iSET identifies an average of 240 sequences per genome, and thus is Ϸ5 times more abundant than Trapeze iSET, a Retrotransposed Pseudogene, Is Highly Amplified in the (Table 2). Genome. The C-terminal region of human EZH2, mouse EZH1, and fly E(z) proteins contain a highly conserved SET domain (23). The Multiple Copies of Trapeze and iSET Are Tightly Linked to This domain spans the last five exons of the predicted Tetraodon Transposable Elements and Heterochromatin. We hybridized one EZH2 protein. Because Trapeze is similar only to the first eight probe from Trapeze (spanning exons 1–4) and one from iSET to the exons of EZH2, it does not contain the SET sequence. However, the Tetraodon BAC library. Trapeze identified 396 positive clones, of sequence of the BAC clone C0AA029L14 contains six dispersed which 368 (93%) are included in the 1,562 clones that we previously copies of an iSET domain. The six 125-aa sequences are nearly identified with the four transposable element probes (Fig. 4), identical to each other but only 57% similar and 39% identical to indicating that the 50 copies of Trapeze are not randomly distributed the Tetraodon EZH2 SET domain (see Fig. 5, which is published as in the genome but are located mainly in heterochromatin, in close supporting information on the PNAS web site, www.pnas.org). The association with transposable elements. We hybridized the same iSET sequence is intronless and contains a stop codon in position probe onto metaphase chromosomes, revealing dispersed and 77, implying that it is not part of a functional protein. A comparison abundant signals in short heterochromatic arms of Ϸ6–10 pairs of of iSET to TREMBL (24) shows that the best alignment is obtained subtelocentric chromosomes (data not shown), i.e., at the same with the SET domain of the Tetraodon EZH2 domain. Taken location where transposable elements and the 10-mer satellite together, these observations suggest that iSET is a pseudogene that repeat are preferentially located. No signals could be seen in the rest originates from the reverse transcription of a truncated 3Ј EZH2 of the genome. The 1,136 positive clones identified by the iSET mRNA. The sequences flanking the six copies of iSET in the insert probe in the 10 X BAC library is consistent with approximately 240 of the BAC clone are always identical up to at least a few hundred copies in the genome. Notably, 1,028 (66%) of the positive clones base pairs upstream and downstream, indicating that iSET does not are common with clones identified by transposable elements and multiply as an isolated sequence but rather as part of larger include the 368 clones previously identified by Trapeze (Fig. 4). segmental duplications. The minimal region of sequence identity These data indicate that iSET and Trapeze are invariably found between the six loci containing iSET is 2.3 kb. Interestingly the in the same BAC clones, and that these clones represent a distinct sequences flanking iSET sequences always include the same 3Ј subset of the entire library, which was also previously shown to segment of a LINE element 900 bp upstream, pointing toward a concentrate most transposable elements. This is evidence, in a possible mechanism for the appearance of this pseudogene (see vertebrate genome, of tight linkage between two different pseudo-

Discussion). No poly(A) tail in the immediate downstream region genes that originally seem to derive from the same gene, are EVOLUTION of iSET could be identified. The high level of divergence between amplified to approximately 50 and 240 dispersed copies, respec- iSET and the EZH2 SET domain indicates a very ancient retro- transcription event whereas the high similarity between six copies that are presumably under no selection indicate a very recent serial

Table 2. Copies of EZH2, TRAP␣, Trapeze, and iSET exons identified in Tetraodon shotgun data Number of copies in 1ϫ Number of copies in 1ϫ Sequence Genoscope shotgun WICGR shotgun

TRAP_exon2 0 1 EZH2_exon12 3 0 Fig. 4. Hybridization results from the iSET domain and the Trapeze pseu- Trapeze_exon3 100 31 dogene, in comparison with those of transposable elements. Of the 1,136 BAC Trapeze_exon12 75 28 clones identified by iSET, 1,028 were already identified by one of the four iSET 192 318 transposable elements previously hybridized to the library. Similarly, of the 396 clones positive with Trapeze, 368 are also positive with iSET, indicating WICGR, Whitehead Institute for Genome Research. that both sequences are always physically associated in the genome.

Dasilva et al. PNAS ͉ October 15, 2002 ͉ vol. 99 ͉ no. 21 ͉ 13639 Downloaded by guest on September 25, 2021 Table 3. Substitutions, insertions, and deletions in exons and of Trapeze pseudogenes Average insertion Average deletion Substitutions Insertions Deletions sizes, bp sizes, bp

Exons 803 44 82 1.3 8.3 Introns 481 21 9 1.8 2.7 Total 1,282 65 91 1.5 7.7

To enable comparisons between exons and introns, values for introns are adjusted to correct for the fact that the total size of introns is smaller than the total size of exons. Values for exons are absolute values. The value of DNA loss rate (D; see text) was, however, calculated across all exons and introns by using uncorrected values.

tively, and are confined to a very specific compartment of the contains several hundred copies of a few highly amplified pseudo- genome, i.e., the heterochromatin. genes, found in close association with transposable elements. A number of studies have already shown that transposable The Trapeze Pseudogene Is Submitted to a High Rate of DNA Loss. elements are preferentially clustered in or around centric and Pseudogenes are sequences that are generally thought to be un- telomeric heterochromatin in other sequenced genomes such as constrained by and thus diverge neutrally by the that of Drosophila and Arabidopsis (4,8,31–34). Although some accumulation of substitutions, insertions, and deletions. This fea- plant and insect genomes may reach enormous sizes in some ture makes it possible to use pseudogenes to study the fate of , that of Arabidopsis and Drosophila each represent a very nonfunctional DNA in a genome (25, 26). For instance, it has been compact version of the genome in their respective phylum, as does shown that pseudogenes in the small genomes of different species the genome of Caenorhabditis elegans among nematodes. It is of Drosophila are submitted to a higher rate of deletions compared striking that the genome of Tetraodon, the smallest known verte- with the larger genomes of mammals (27) and to the 11 times larger brate genome, shares a number of features with these compact genome of the Laupala cricket (28), thus providing evidence that genomes, such as a preferential clustering of transposable elements DNA loss may be a long-term determinant of genome size. in heterochromatin and the paucity of euchromatic pseudogenes. Here, we have examined the pattern of substitutions, insertions, This feature is not shared with other vertebrate genomes such as and deletions in multiple copies of two regions of the Trapeze mouse or human, where transposable elements and pseudogenes pseudogene: 19 copies of the sequence encompassing exons 5–8in are abundant and well dispersed in euchromatin (5). Thus, we can comparison to the homologous region of the EZH2 gene (874 bp), delineate across several phyla a common trend in four small and 47 copies of the sequence encompassing exon 9 and exon 12 in genomes: they are not immune to repeat invasions but the ampli- ␣ comparison to the homologous region in the TRAP gene (690 bp) fication and dispersion of parasitic sequences are somehow con- (Fig. 3 and Table 3). This set of 66 Trapeze sequences represents all tained and show similar distribution profiles. of the single reads that spanned the required region of the pseu- In Tetraodon, we show that some pseudogenes are highly ampli- dogene sampled from the entire Tetraodon WGS database. From fied but remain exclusively in heterochromatin. The sequencing of the results it is striking that exons and introns are submitted to very a BAC clone from such a region shows that the multiple copies different rates. Substitutions, insertions, and deletions are amplify by duplications of large segments as opposed to being 1.7, 2.1, and 9.9 times more frequent in exons, respectively, and the restricted to the pseudogene sequences only. In all copies found, the average deletion size in exons is more than three times larger than iSET pseudogene, which corresponds to the last four exons of the in introns. This limited dataset provides a glimpse at the rate of EZH2 gene, is systematically downstream of and on the same DNA loss in unconstrained DNA in the Tetraodon genome. The segment as the 3Ј end of a LINE element highly similar to the mutational rate of DNA loss through small deletions and insertions Drosophila factor I element (35) (Dm-Line in the hybridization ϭ can be expressed per nucleotide per nucleotide substitution as D experiments). It is thus possible that this LINE element was once [(rate of deletion per nucleotide substitution) ϫ (average deletion inserted in intron 14 (Fig. 3) of the EZH2 gene and generated the size)] Ϫ [(rate of insertion per nucleotide substitution) ϫ (average iSET pseudogene via a transduction or ‘‘read through’’ mechanism insertion size)] (29). We find that the rate of DNA loss for the where the LINE RNA polymerase skipped its own 3Ј end to instead Trapeze sequences examined here is D ϭ 0.5619, suggesting that use the poly(A) signal of the EZH2 gene (36–38). Because the Tetraodon nonfunctional DNA is submitted to a higher rate of current Tetraodon EZH2 gene does not show any trace of this LINE deletions than in human or mouse, but lower than in Drosophila element, this event would have been followed by the excision of the (29). Overall it is striking that this value is in excellent agreement LINE copy from the intron of the EZH2 gene. We believe that with the negative correlation between genome size and rate of the emergence of the first copy of the iSET pseudogene and the DNA loss observed across genomes from dipterians, mammals, and amplification of the different copies seen in the genome today are nematodes, i.e., that smaller genomes tend to be submitted to a two independent events separated by a long evolutionary distance, higher rate of DNA loss. judging by the high sequence similarity between the different copies (98% amino acid similarity), but the high divergence between these Discussion sequences and the original EZH2 exons (57% similarity). The distribution of transposable elements and pseudogenes in the Trapeze, the second highly prevalent pseudogene in T. nigroviridis compact pufferfish T. nigroviridis genome is in sharp contrast to the heterochromatin, is not retroprocessed but is a chimeric structure situation seen in the larger genome of other vertebrates such as between part of the EZH2 gene and part of the TRAP␣ gene. The human or mouse. In Tetraodon, transposable elements are tightly much higher sequence similarity between the homologous parts of clustered in a specific compartment where abundant satellite se- Trapeze and EZH2 compared with iSET and EZH2 suggests that the quences are found and they constitute a minimal fraction of the emergence of Trapeze occurred later than that of iSET. genome (Ͻ10%). These regions are the short heterochromatic arms The analysis of the spectrum of substitutions, insertions, and of approximately 10 pairs of small subtelocentric chromosomes. In deletions in Trapeze is in agreement with previous studies showing addition, although pseudogenes were previously thought to be an that the number of deletions is higher than insertions in neutrally extremely rare feature in pufferfishes with a single documented evolving DNA (27, 39). However, although most studies have case in Takifugu rubripes (30), we show that the Tetraodon genome focused on defunct transposable elements or retroprocessed pseu-

13640 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.202284199 Dasilva et al. Downloaded by guest on September 25, 2021 dogenes, we have studied a pseudogene that retains an intron͞exon heterochromatin and the known function of EZH2 in eukaryotic structure. Strikingly, across all types of (deletions, inser- cells. tions, and substitutions) the mutational rate is significantly higher In the human (5), Drosophila (44), and Takifugu (15) genome in exons than in introns. Assuming a model of uniform mutation assembly projects, heterochromatin has been left out from the frequencies in neutrally evolving DNA, these data suggest that initial assembly for technical reasons, although many clones and mutations affecting exons are more likely to be retained than those shotgun sequences are available. This compartment is, however, affecting introns, implying that natural selection may be acting to likely to be a dynamic and vital part of the nucleus, and essential for reduce the potential of a pseudogene to produce a functional understanding many aspects of and nuclear mRNA, let alone a functional protein. Overall, the deletion rate, in organization. We would like to draw attention to the vast amount terms of frequency and size of deletions, is higher than in mammals of shotgun sequences that remain to be studied in human, Dro- and supports the view that small, but constant, deletions over long sophila, Takifugu, and probably also mouse after the initial assem- evolutionary periods may be one of the forces that tend to decrease bly, as a way to access and study the heterochromatic compartment the size of a genome (1, 40). Because Trapeze and iSET sequences in different eukaryotic genomes as was done here for Tetraodon.In seem to reside almost exclusively in heterochromatin, one question particular, comparisons with Takifugu heterochromatic sequences is whether their absence from euchromatin is the result of (i) a more should reveal whether the same clustering of transposable elements intense deletion rate of parasitic sequences in this compartment of occurs and whether Trapeze and iSET pseudogenes also exist and the genome, (ii) a transfer of such sequences from euchromatin to are similarly amplified. heterochromatin, or (iii) an amplification that is itself confined to Our results for the Tetraodon genome contrast with well-known heterochromatin. Any of these hypotheses would require a mech- mammalian genomes and underline the fact that organisms from anism that is not yet known to occur in eukaryotic cells. different phyla (plants, invertebrates, and vertebrates) with small It is intriguing that the EZH2 gene is at the origin of both Trapeze genomes appear to share common features in terms of transposable and iSET pseudogenes, albeit via different routes. Strong experi- element and pseudogene content and distribution. This finding may mental evidence links EZH2 to a conserved mechanism of eukary- suggest common mechanisms in dealing with such sequences and otic gene silencing (41), and this protein has been shown to form a may guide further studies on the of pseudo- complex with the embryonic ectoderm development protein (42) genes and transposable elements for better understanding of how and histone deacetylases that mediate the repression of gene the smallest known vertebrate genome achieves this extraordinary efficiency in packaging genetic information. transcription (43). More generally, SET domain proteins are seen as multifunctional chromatin regulators with roles that include We thank Fiona Francis for critical reading of the manuscript, Frederic telomeric and centromeric gene silencing (23) and possibly the Brunet, Pierre Capy, and an anonymous reviewer for insightful com- determination of chromosome architecture. This observation raises ments on an earlier version of the text, and the Whitehead Institute the question of a potential link between the presence of high copy Center for Genome Research for early access to part of the Tetraodon number EZH2-derived pseudogenes in transcriptionally repressed shotgun sequences used in this study.

1. Hartl, D. L. (2000) Nat. Rev. Genet. 1, 145–149. 23. Jenuwein, T., Laible, G., Dorn, R. & Reuter, G. (1998) Cell. Mol. Life Sci. 54, 80–93. 2. The C. elegans Sequencing Consortium (1998) Science 282, 2012–2018. 24. Bairoch, A. & Apweiler, R. (2000) Nucleic Acids Res. 28, 45–48. 3. Adams, M. D., Celniker, S. E., Holt, R. A., Evans, C. A., Gocayne, J. D., 25. Li, W. H., Gojobori, T. & Nei, M. (1981) Nature 292, 237–239. Amanatides, P. G., Scherer, S. E., Li, P. W., Hoskins, R. A., Galle, R. F., et al. (2000) 26. Petrov, D. A. & Hartl, D. L. (2000) J. Hered. 91, 221–227. Science 287, 2185–2195. 27. Petrov, D. A., Lozovskaya, E. R. & Hartl, D. L. (1996) Nature 384, 346–349. 4. The Arabidopsis Genome Initiative (2000) Nature 408, 796–815. 28. Petrov, D. A., Sangster, T. A., Johnston, J. S., Hartl, D. L. & Shaw, K. L. (2000) 5. International Human Genome Sequencing Consortium (2001) Nature 409, 860– Science 287, 1060–1062. 921. 29. Petrov, D. (2002) Theor. Popul. Biol. 61, 531–543. 6. The Flybase Consortium (1999) Nucleic Acids Res. 27, 85–88. 30. Clark, M. S., Pontarotti, P., Gilles, A., Kelly, A. & Elgar, G. (2000) J. Immunol. 165, 7. Stein, L., Sternberg, P., Durbin, R., Thierry-Mieg, J. & Spieth, J. (2001) Nucleic 4446–4452. Acids Res. 29, 82–86. 31. Biessmann, H., Champion, L. E., O’Hair, M., Ikenaga, K., Kasravi, B. & Mason, 8. Pimpinelli, S., Berloco, M., Fanti, L., Dimitri, P., Bonaccorsi, S., Marchetti, E., J. M. (1992) EMBO J. 11, 4459–4469. Caizzi, R., Caggese, C. & Gatti, M. (1995) Proc. Natl. Acad. Sci. USA 92, 3804–3808. 32. Le, M.-H., Duricka, D. & Karpen, G. H. (1995) 141, 283–303. 9. Dimitri, P. & Junakovic, N. (1999) Trends Genet. 15, 123–124. 33. Junakovic, N., Terrinoni, A., Di Franco, C., Vieira, C. & Loevenbruck, C. (1998) 10. Deloukas, P., Matthews, L. H., Ashurst, J., Burton, J., Gilbert, J. G., Jones, M., J. Mol. Evol. 46, 661–668. Stavrides, G., Almeida, J. P., Babbage, A. K., Bagguley, C. L., et al. (2001) Nature 34. Kapitonov, V. V. & Jurka, J. (1999) Genetica 107, 27–37. 414, 865–871. 35. Bucheton, A., Paro, R., Sang, H. M., Pelisson, A. & Finnegan, D. J. (1984) Cell 38, 11. Hattori, M., Fujiyama, A., Taylor, T. D., Watanabe, H., Yada, T., Park, H. S., 153–163. Toyoda, A., Ishii, K., Totoki, Y., Choi, D. K., et al. (2000) Nature 405, 311–319. 36. Moran, J. V., DeBerardinis, R. J. & Kazazian, H. H., Jr. (1999) Science 283, 12. Dunham, I., Shimizu, N., Roe, B. A., Chissoe, S., Hunt, A. R., Collins, J. E., 1530–1534. Bruskiewich, R., Beare, D. M., Clamp, M., Smink, L. J., et al. (1999) Nature 402, EVOLUTION 37. Goodier, J. L., Ostertag, E. M. & Kazazian, H. H., Jr. (2000) Hum. Mol. Genet. 9, 489–495. 13. Hinegardner, R. (1968) Am. Naturalist 102, 517–523. 653–657. 14. Roest Crollius, H., Jaillon, O., Dasilva, C., Ozouf-Costaz, C., Fizames, C., Fischer, 38. Esnault, C., Maestre, J. & Heidmann, T. (2000) Nat. Genet. 24, 363–367. C., Bouneau, L., Billault, A., Quetier, F., Saurin, W., et al. (2000) Genome Res. 10, 39. Ophir, R. & Graur, D. (1997) Gene 205, 191–202. 939–949. 40. Petrov, D. A. (2001) Trends Genet. 17, 23–28. 15. Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J. M., Dehal, P., 41. Laible, G., Wolf, A., Dorn, R., Reuter, G., Nislow, C., Lebersorger, A., Popkin, D., Christoffels, A., Rash, S., Hoon, S., Smit, A. F., et al. (2002) Science 297, 1301–1310. Pillus, L. & Jenuwein, T. (1997) EMBO J. 16, 3219–3232. 16. Nizetic, D., Drmanac, R. & Lehrach, H. (1991) Nucleic Acids Res. 19, 182. 42. Sewalt, R. G., van der Vlag, J., Gunster, M. J., Hamer, K. M., den Blaauwen, J. L., 17. Maier, E., Roest Crollius, H. & Lehrach, H. (1994) Nucleic Acids Res. 22, 3423–3424. Satijn, D. P., Hendrix, T., van Driel, R. & Otte, A. P. (1998) Mol. Cell. Biol. 18, 18. Staden, R. (1996) Mol. Biotechnol. 5, 233–241. 3586–3595. 19. Fischer, C., Ozouf-Costaz, C., Roest Crollius, H., Dasilva, C., Jaillon, O., Bouneau, 43. van der Vlag, J. & Otte, A. P. (1999) Nat. Genet. 23, 474–478. L., Bonillo, C., Weissenbach, J. & Bernot, A. (2000) Cytogenet. Cell Genet. 88, 50–55. 44. Myers, E. W., Sutton, G. G., Delcher, A. L., Dew, I. M., Fasulo, D. P., Flanigan, 20. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990) J. Mol. M. J., Kravitz, S. A., Mobarry, C. M., Reinert, K. H., Remington, K. A., et al. (2000) Biol. 215, 403–410. Science 287, 2196–2204. 21. Devereux, J., Haeberli, P. & Smithies, O. (1984) Nucleic Acids Res. 12, 387–395. 45. Parsons, J. D. (1995) Comput. Appl. Biosci. 11, 615–619. 22. Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. (2000) Nucleic Acids 46. Cardoso, C., Mignon, C., Hetet, G., Grandchamps, B., Fontes, M. & Colleaux, L. Res. 28, 33–36. (2000) Eur. J. Hum. Genet. 8, 174–180.

Dasilva et al. PNAS ͉ October 15, 2002 ͉ vol. 99 ͉ no. 21 ͉ 13641 Downloaded by guest on September 25, 2021