View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by Elsevier - Publisher Connector

Genomics 88 (2006) 323–332 www.elsevier.com/locate/ygeno

Tandem repeats in the CpG islands of imprinted ⁎ Barbara Hutter a, Volkhard Helms a, Martina Paulsen b,

a Bioinformatik, FR 8.3 Biowissenschaften, Universität des Saarlandes, Postfach 151150, D-66041 Saarbrücken, Germany b Genetik/Epigenetik, FR 8.3 Biowissenschaften, Universität des Saarlandes, Postfach 151150, D-66041 Saarbrücken, Germany

Received 21 December 2005; accepted 30 March 2006 Available online 11 May 2006

Abstract

In contrast to most genes in mammalian genomes, imprinted genes are monoallelically expressed depending on the parental origin of the alleles. Imprinted expression is regulated by distinct DNA elements that exhibit allele-specific epigenetic modifications, such as DNA methylation. These so-called differentially methylated regions frequently overlap with CpG islands. Thus, CpG islands of imprinted genes may contain special DNA elements that distinguish them from CpG islands of biallelically expressed genes. Here, we present a detailed study of CpG islands of imprinted genes in mouse and in human. Our study shows that imprinted genes more frequently contain tandem repeat arrays in their CpG islands than randomly selected genes in both species. In addition, mouse imprinted genes more frequently possess intragenic CpG islands that may serve as promoters of allele-specific antisense transcripts. This feature is much less pronounced in human, indicating an interspecies variability in the evolution of imprinting control elements. © 2006 Elsevier Inc. All rights reserved.

Keywords: Imprinting; CpG islands; Tandem repeats; DNA methylation; Repetitive elements; Epigenetics; Regulatory elements

To date, approximately 40 imprinted genes have been likely that additional elements are needed. Potential candidate identified in the human and mouse genomes, respectively. elements for this purpose are so-called differentially methyl- Since their monoallelic expression depends on the parental ated regions (DMRs) that exhibit distinct DNA methylation origin, it is assumed that distinct DNA elements in imprinted patterns on their parental alleles. Many DMRs have been genes are differentially recognized and modified in the shown to acquire a specific methylation pattern in one germ parental germ lines and that the resulting epigenetic marks line but remain unmethylated in the other germ line [10]. are maintained after fertilization [1]. Therefore, it has been Such DMRs represent key regulatory elements that are questioned what sequence features distinguish imprinted from responsible for imprinted expression of neighboring genes, nonimprinted genes [2,3]. In several studies it was observed thereby contributing to the domain-like organization of that many imprinted genes possess an unusual density of imprinted genes. repetitive retroviral elements. Imprinted genes showed a Most DMRs overlap with CpG islands. These are depletion of short interspersed transposable elements (SINEs) characterized by a high density of CpG dinucleotides that [2] and an enrichment of long interspersed nuclear element 1 can be targeted by DNA methylation. However, CpG islands (LINE-1) repeats [4,5] compared to randomly selected genes. are regulatory elements that are also common to biallelically Although it was shown that LINE-1 and some SINEs are expressed genes [11,12]. Usually they are located in the indeed differentially marked in the female and male germ promoter region [13–15]. In contrast, DMRs of imprinted lines [6–9], there is so far no evidence that these methylation genes are not always related to transcriptional start sites of marks can be stably maintained after fertilization. Thus, it is -coding genes. Some DMRs reside in introns where they serve as promoters for antisense transcripts [16–19]. Other DMRs are located upstream of imprinted genes, at some distance from the promoter region [20–22]. These ⁎ Corresponding author. Fax: +49 0 681 302 2703. DMRs may act as insulator or silencer elements as was shown E-mail address: [email protected] (M. Paulsen). for the DMR 5′ of the imprinted H19 gene [23,24].

0888-7543/$ - see front matter © 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2006.03.019 324 B. Hutter et al. / Genomics 88 (2006) 323–332

Table 1 General sequence properties of human and mouse imprinted genes and control groups Group Total sequence Median gene Total G + C G + C content in Total CpG CpG content in length (bp) length (bp) content (%) repetitive elements (%) content (%) repetitive elements (%) Mouse imprinted 2,452,775 20,697 45.43 ± 4.65 43.90 ± 3.18 1.28 ± 0.46 0.87 ± 0.42 Mouse ctrl1 5,864,555 21,603 46.37 ± 4.22 45.46 ± 2.43 1.35 ± 0.47 0.93 ± 0.29 Mouse ctrl2 5,615,657 15,103 45.87 ± 4.11 45.59 ± 2.30 1.35 ± 0.52 0.93 ± 0.26 Mouse all 13,932,987 20,912 45.98 ± 4.26 45.21 ± 2.62 1.34 ± 0.49 0.95 ± 0.35 Human imprinted 3,259,027 26,583 46.93 ± 7.68 44.96 ± 5.65 1.82 ± 0.97 1.54 ± 0.79 Human ctrl1 6,181,623 27,992 46.27 ± 6.37 45.82 ± 4.18 1.66 ± 0.83 1.68 ± 0.88 Human ctrl2 6,783,964 24,270 46.02 ± 6.14 46.29 ± 4.37 1.70 ± 0.83 1.72 ± 0.87 Human all 16,224,614 27,247 46.30 ± 6.53 45.84 ± 4.57 1.71 ± 0.86 1.67 ± 0.86 Sequence lengths were accumulated per group. For mouse, undetermined nucleotides were subtracted. Columns 4 to 7 list means and standard deviations. The control groups (79 genes each) have approximately twice the number of sequences as the imprinted groups (39 murine and 38 human genes). Median gene lengths and G+C content are similar in imprinted and control groups.

Frequent features of DMRs are arrays of tandem repeat selected from the National Center for Biotechnology Infor- motifs, and it has been suggested that such repeats are mation (NCBI) (http://www.ncbi.nlm.nih.gov) and Ensembl involved in the regulation of imprinting [25]. In addition to (http://www.ensembl.org) databases. The genomic DNA a potential role in imprinting, tandem repeat arrangements segment used for each gene comprised the entire intragenic of DNA sequences are likely to be involved in various region as annotated in the database plus 10 kb upstream of other epigenetic silencing and heterochromatin formation the transcriptional start site and 10 kb downstream of the last processes [25–27]. Many multicopy transgene arrays in exon. Likewise, two control groups, each including 79 mammalian genomes show increased methylation levels and randomly selected genes, were collected for mouse and for reduced gene expression compared to corresponding single- human (see Material and methods). In most cases, genes used copy transgenes [28]. Similarly, tandem repeat arrangements in the human control groups were also included into the can attract DNA methylation in meiotic processes in murine groups. Thus, similar to the selected imprinted genes, filamentous fungi [29]. Tandem repeats are found within the murine and human control groups contained a large or adjacent to both paternally and maternally methylated number of homologous genes (Supplementary Material Table DMRs. The so far described tandem repeat motifs vary S1). Since it cannot be excluded that the random selection of from 5 to 400 bp in length (Supplementary Material Table genes is biased, for example toward strongly expressed genes S8, [5]). Although repeat arrays have been identified in or toward sequences with a high G + C content, we collected mouse as well as in human, only a few orthologous DMRs the two control groups using different approaches. Actual possess highly conserved tandem repeat motifs. Likewise, differences between imprinted and nonimprinted genes should the number of repeated motifs and their arrangement within differ significantly for both control groups. As the assembly/ the DMR are highly variable in different species [19,21, annotation of the mouse genome is still not finished, some 30,31]. Even though the frequent appearance of tandem mouse sequences contain stretches of undefined nucleotides repeat arrays associated with imprinted genes has been (termed N), with an average N content of 0.71% (standard noted in many publications (see references in [5], deviation 1.22%). The transcribed regions of most genes are Supplementary Material Table S8), it has so far not been less than 25 kb long (Table 1). In general, mouse genes are systematically investigated whether such sequence elements shorter than human ones [32]. The gene length distribution is are indeed enriched in imprinted genes compared to similar in imprinted and randomly selected genes (Wilcoxon– biallelically expressed genes. Mann–Whitney test, p > 0.4). Therefore, we compared in this work the CpG islands of Comparing the G + C content did not reveal any notable imprinted genes against CpG islands of randomly selected differences, neither between human and mouse nor between genes in mouse and human and investigated the positions of imprinted and control sequences. The mean is 46.30% for CpG islands within the genes and their overlaps with human and 45.99% for mouse. The standard deviation of the repetitive elements. Searching specifically for tandem repeat G + C content is larger in human (6.53%) than in mouse arrays, we observed a significant enrichment of these motifs (4.26%), indicating that the G + C content of some human in CpG islands of imprinted genes. sequences is more extreme, as was reported before [32]. Human sequences have a mean CpG content of 1.71% (std Results dev 0.86%) and are significantly enriched in CpG dinucleo- tides compared to mouse with 1.34% (std dev 0.49%) (t test, General sequence properties p < 0.001). This difference is predominantly caused by a higher CpG content of interspersed repeat elements in human For the analyses of CpG islands, genomic sequences of 39 sequences. SINE repeats are significantly reduced in murine and 38 human genes that showed pronounced imprinted genes ([2], Supplementary Material Table S2). In imprinting effects in at least one of the two species were all analyzed groups the CpG and G + C contents are linearly B. Hutter et al. / Genomics 88 (2006) 323–332 325

Table 2 elements (from 1.5 to 1.6 CpG islands per gene). Comparing Mean number of CpG islands per gene the lengths (Table 3) and frequencies (Table 2) of CpG Group unm200 msk200 unm500 msk500 islands and the sequence proportions covered by CpG Mouse imprinted 6.5 5.9 1.5 1.6 islands (Supplementary Material Table S3) in imprinted and Mouse ctrl1 8.3 6.6 1.2 1.0 randomly selected genes, no consistent differences were Mouse ctrl2 8.2 6.3 1.5 1.5 found in both species. This result is in agreement with Mouse all 7.9 6.3 1.4 1.3 previous findings [4,37,38]. Human imprinted 14.9 9.9 2.5 2.4 Human ctrl1 14.0 5.3 1.8 1.7 Human ctrl2 18.8 8.2 2.3 2.0 Properties of intergenic and intragenic CpG islands Human all 16.1 7.4 2.1 1.9 The CpG island numbers were summed up per sequence and group and then CpG islands at the transcriptional start sites of genes have divided by the number of sequences in the respective group. unm200, Gardiner- been shown to contain promoter elements. In contrast, the ≥ ≥ Garden and Frommer [11] criteria (G + C content 50%, CpGobs/CpGexp 0.60, DMRs in imprinted regions are frequently located in intronic or ≥ length 200 bp); msk200, masking of repetitive elements, applying Gardiner- intergenic CpG islands. To investigate possible differences in Garden and Frommer criteria; unm500, Takai and Jones [33] criteria (G + C these different types of CpG islands we classified CpG islands ≥55%, CpGobs/CpGexp ≥0.65, length ≥500 bp); msk500, masking of repetitive elements, applying Takai and Jones criteria. by their locations into five groups: (1) Promoter CpG islands were defined as overlapping with the transcriptional start site as correlated (Pearson correlation coefficient between 0.71 and annotated in the database. (2) Since transcriptional start sites are 0.94, p < 0.0001). frequently not well defined, we collected a group of prepromoter CpG islands that overlap with a 1-kb window CpG islands upstream of the transcriptional start site. CpG islands within this group might therefore include functional promoter CpG islands. To obtain a comprehensive overview of CpG island (3) Upstream CpG islands reside in the remaining 9 kb of frequency and quality, we used the CpG Island Searcher from upstream sequence and (4) intragenic CpG islands are located Takai and Jones [33] and searched for CpG islands using two between the transcriptional start site and the 3′end of the gene. different parameter sets: the Gardiner-Garden and Frommer [11] (5) Downstream CpG islands were found in the 10 kb window parameters that detect CpG islands of at least 200 bp length, downstream of the 3′ end of the respective gene. G+C≥50%, and CpGobs/CpGexp ≥0.6 and the more stringent More human than mouse sequences possess short CpG Takai and Jones [33] parameters that identify only strong CpG islands that fulfill the Gardiner-Garden and Frommer criteria in 2 islands of at least 500 bp length, G + C ≥55%, and CpGobs/ their upstream and downstream regions (χ test, p < 0.05). In CpGexp ≥0.65. We determined the CpG island distribution these sequence segments the CpG island distributions of before and after removal, i.e., masking, of all repetitive elements imprinted genes and randomly selected genes did not differ (including microsatellites). We compared the absolute numbers significantly. Promoter CpG islands are only marginally (Table 2) and the lengths (Table 3) of all CpG islands found in affected by masking and parameter choice and appear with masked and unmasked sequences using both criteria and the similar frequencies in the groups of imprinted genes and portions of the whole sequences covered by CpG islands randomly selected genes in human and in mouse. In general, (Supplementary Material Table S3). promoter CpG islands are longer in human than in mouse Using the Gardiner-Garden and Frommer criteria, we [14,39]. The CpG island length distributions of imprinted genes identified significantly more CpG islands in human than in and control genes are similar. Including the small group of mouse, both in the imprinted and, even more pronounced, in the prepromoter CpG islands that may contain functional promoter control groups (see Table 2)(t test, p < 0.05 for imprinted group; CpG islands yielded similar results (data not shown). p < 0.001 for control groups). This is due mainly to the frequent occurrence of CpG islands in human Alu repetitive elements – Table 3 [33 36]. Comparing the frequencies of short CpG islands Average length of the CpG islands in human and mouse imprinted and control (using Gardiner-Garden and Frommer criteria) before and after sequences removal of repetitive elements it becomes evident that in Group unm200 (bp) msk200 (bp) unm500 (bp) msk500 (bp) contrast to the human, only few repetitive elements in mouse Mouse imprinted 502 ± 545 509 ± 505 1334 ± 741 1135 ± 573 overlap with CpG islands. As a consequence of SINE depletion Mouse ctrl1 369 ± 323 390 ± 349 1057 ± 479 995 ± 364 [2], fewer CpG islands in the human imprinted group are indeed Mouse ctrl2 405 ± 390 434 ± 404 1138 ± 558 1084 ± 544 associated with repetitive sequences compared to the control Human imprinted 442 ± 553 484 ± 554 1269 ± 883 1189 ± 721 groups of randomly selected genes. Human ctrl1 408 ± 431 513 ± 519 1298 ± 741 1231 ± 636 In both species, the stringent Takai and Jones parameters Human ctrl2 378 ± 390 456 ± 475 1175 ± 734 1143 ± 676 exclude not only most CpG-rich sections located in repetitive The standard deviation is very large due to highly skewed length distributions. ≥ elements, but also small CpG islands in unique sequences. Since unm200, Gardiner-Garden and Frommer [11] criteria (G + C content 50%, CpGobs/CpGexp ≥0.60, length ≥200 bp); msk200, masking of repetitive application of the two different parameter sets sometimes results elements, applying Gardiner-Garden and Frommer criteria; unm500, Takai and in splitting or fusion of CpG islands, their number for murine Jones [33] criteria (G + C ≥55%, CpGobs/CpGexp ≥0.65, length ≥500 bp); imprinted genes increased slightly after masking of repetitive msk500, masking of repetitive elements, applying Takai and Jones criteria. 326 B. Hutter et al. / Genomics 88 (2006) 323–332

tentatively longer intragenic CpG islands leads to the difference observed in Fig. 1.

Tandem repeats in CpG islands

CpG-rich tandem repeats in DMRs of imprinted genes have been assumed to function in imprinted gene regulation [25]. For this reason we examined whether tandem repeats are signifi- cantly enriched in CpG islands of imprinted genes in comparison to randomly chosen genes (Supplementary Material Table S4). Direct tandem repeats were identified using the Tandem Repeats Finder ([40], http://tandem.bu.edu/trf/trf.html). Most repeats in both species, in imprinted as well as in control sequences, were found in intragenic CpG islands. Except for short CpG islands in human sequences that still contain repetitive elements (Table 4, unm200), tandem repeats are indeed enriched in imprinted sequences (χ2 test, at least one tandem repeat array per sequence, p < 0.01). CpG islands in unmasked sequences include a high number of microsatellites as well as tandem repeats of retroviral origin. When microsatellites and (retro)transposable elements were discarded, the enrichment of tandem repeats in imprinted genes became even more pronounced (χ2 test, p < 0.005). Thereafter we excluded CpG islands in repetitive elements and analyzed only tandem repeats in CpG islands of unique sequences. Tandem repeats were identified for the imprinted genes Commd1, Grb10, Gtl2, H19, Igf2r, Impact, Magel2, Peg10, Peg3, Sgce, Slc22a18, Slc22a3, and Slc38a4 and the Nesp–Gnas locus in mouse and for ATP10A, CD81, GRB10, H19, KCNQ1, Fig. 1. Average total length of CpG islands per gene. CpG islands in repeat- MAGEL2, MEG3, MEST, PEG3, PHLDA2, SLC22A18, masked sequences at five different locations (see text) were identified in SNRPN, and ZNF264 and the NESP–GNAS locus in human imprinted sequences and two groups of randomly selected control genes (ctrl1, (Table 5, Supplementary Material Table S6). From the total of ≥ ctrl2) according to the Takai and Jones [33] parameters (G + C content 55%, 40 repeats, 13 in human and 8 in mouse have not been described CpGobs/CpGexp ≥0.65, length ≥500 bp). (A) Mouse, (B) human. Precisely, the CpG island lengths were summed up per sequence, group, and location; the sum before. In human and mouse, respectively 6 and 4 newly was divided by the number of sequences in the respective group. identified repeats are associated with genes for which no tandem repeats have been reported yet. The human repeat arrays are located in intragenic CpG islands of ATP10A, ZNF264, and The imprinted genes analyzed here contain a higher MEST (two arrays). One of the two tandem repeats found proportion of intragenic CpG islands satisfying the stringent between the ends of PHLDA2 and SLC22A18 is very well Takai and Jones criteria (Fig. 1). In mouse, more imprinted conserved, with a consensus similar to the reverse complement genes than control genes have at least one strong intragenic CpG island (χ2 test, p < 0.05). This is the case for both unmasked and masked sequences. Applying the Gardiner- Table 4 Percentage of sequences that possess at least one tandem repeat array in one of Garden and Frommer criteria, we found that murine intragenic their CpG islands CpG islands are also longer in imprinted genes than in randomly Group unm200 (%) msk200 (%) unm500 (%) msk500 (%) selected genes (Wilcoxon–Mann–Whitney test, p < 0.01). Compared to imprinted genes, randomly selected human Mouse imprinted 51.28 35.90 25.64 17.95 Mouse ctrl1 21.52 11.39 6.33 1.27 genes are enriched in short intragenic CpG islands that overlap Mouse ctrl2 22.78 8.86 7.59 2.53 with Alu elements. On the other hand, human imprinted genes Human imprinted 44.74 36.84 42.11 23.68 possess more strong intragenic CpG islands that fulfill the Human ctrl1 30.38 12.66 15.19 5.06 stringent Takai and Jones criteria than the control sequences. Human ctrl2 29.11 11.39 18.99 5.06 However, the number of genes possessing at least one strong More imprinted than control sequences contain at least one tandem repeat in a intragenic CpG island was only tentatively different in both CpG island, with the exception of human unm200 (χ2 test, p < 0.01). unm200, ≥ groups (χ2 test, p < 0.1). We also did not find consistent Gardiner-Garden and Frommer [11] criteria (G + C content 50%, CpGobs/ CpGexp ≥0.60, length ≥200 bp); msk200, masking of repetitive elements, differences between imprinted genes and the control groups applying Gardiner-Garden and Frommer criteria; unm500, Takai and Jones [33] when investigating the length of human intragenic CpG islands. criteria (G+C ≥55%, CpGobs/CpGexp ≥0.65, length ≥500 bp); msk500, For human, the subtle effect of slightly increased numbers of masking of repetitive elements, applying Takai and Jones criteria. B. Hutter et al. / Genomics 88 (2006) 323–332 327 of the MD motif in the IC2 imprinting center in Kcnq1 [19,41]. Antisense transcripts have been proposed to serve as The new tandem repeats in mouse are located in the mediators of allele-specific silencing [1]. The existence of bidirectional promoter CpG island of Peg10 and Sgce and in antisense transcripts has been extensively studied in imprinted intragenic CpG islands of Slc22a18 and Slc22a3. genes but not to a comparable degree in biallelically expressed To identify homologous repeats in human or mouse, we used genes. Data are currently accumulating [47]; therefore it is the consensus sequences of all repeats for local BLASTN hoped an analysis of whether they are significantly enriched in (http://www.ncbi.nlm.nih.gov/BLAST) searches in the groups imprinted genes will soon be possible. In human, intragenic of imprinted and control genes. Highly significant matches with CpG islands are not significantly overrepresented in imprinted an expectation value <0.001 (Supplementary Material Table S7) genes compared to control genes. This could be related to were identified for the murine Peg3 repeat motif, which was species-specific evolution of imprinting elements so that found in the human PEG3 gene with two repetitions. The intragenic CpG islands might be more important for regulation corresponding human location has no CpG island. In the human of imprinting in mouse than in human. Alternatively, this TRRAP gene, which belongs to the randomly selected genes, we species-specific difference in CpG island density might be identified one truncated version of the repeat motif in the mouse attributed to the finding that randomly selected genes in human Trrap gene (LOC433959, similar to transformation/transcrip- contain more intragenic CpG islands than in mouse. A possible tion domain-associated protein). The human motif is located in enrichment of intragenic CpG islands due to imprinting may a CpG island in which no tandem repeat array was found. The thus be less visible here. general structures of repeat arrays, i.e., motif length (10– 140 bp), number of repetitions (1.9–50.5), and array length The CpG islands of imprinted genes contain more tandem (50–1618 bp), were not found to be significantly different, repeats in mouse and human neither between imprinted and control sequences nor between mouse and human (Supplementary Material Table S5). Also for In this study we identified 28 human and mouse imprinted the CpG content of the repeat arrays we did not find any genes that possess CpG islands containing tandem repeat differences for the gene groups analyzed. arrays. In summary, we identified 8 imprinted genes for which repeat arrays were not previously described in the literature Discussion (Table 5). A number of known repeats (Supplementary Material Table S8) were not identified for several reasons: This study presents a comprehensive analysis of the Some known repeats, such as the array in the DMR upstream nucleotide sequence properties of CpG islands of 38 human of Gtl2 [21], are located outside our range of 10 kb upstream and 39 murine imprinted genes and of two control groups per and downstream of the transcribed sequence. Others, for species consisting of 79 randomly chosen genes each. Even example the repeat arrays at the Magel2 promoter [48], were though imprinted gene expression is associated with allele- not identified since this region escaped identification as a CpG specific DNA methylation at regulatory elements, we observed island because of a too low CpGobs/CpGexp ratio. Many surprisingly few differences in the CpG island properties of known repeat arrays in mouse like those in IC2 [19], the YY1 imprinted and randomly selected genes. The most prominent transcription factor binding sites in Peg3 [30], and the CTCF feature was the enrichment of tandem repeats in CpG islands of binding sites in H19 [31] are not arranged in the contiguous imprinted genes. head-to-tail fashion required by Tandem Repeats Finder [40] and were therefore missed. Some of the repeat arrays at the Intragenic CpG islands are overrepresented in imprinted genes Nesp–Gnas locus in mouse and human [18] did not achieve a in mouse score fulfilling the selected threshold. Except for the repeated motifs in the murine and human Peg3 As differentially methylated CpG islands are a major feature genes, we did not identify any new repeats that were present in of imprinted genes, we expected the imprinted group to have similar positions in both species. The absence of common motifs more or longer CpG islands. However, this was not the case in the identified repeats and the poor conservation of tandem regarding total frequencies, lengths, or proportions. This result repeats at homologous positions in mouse and human [19] corroborates previous findings [4,37,38]. Classifying the CpG indicate that specific motifs are not of functional importance. islands by their location within the investigated sequences, we Instead, we suppose that the possible function of these elements found that murine imprinted genes possess more often intragenic might be related to a typical DNA secondary structure that may CpG islands than randomly selected genes. Such intragenic CpG attract DNA methylation and/or histone modifications. islands may represent promoters of alternative sense or antisense Although our studies showed that tandem repeat arrays are transcripts. Indeed, the strong intragenic CpG islands (according significantly enriched in CpG islands of imprinted genes, such to Takai and Jones parameters) of Commd1 [42], Igf2 [43,44], tandem repeats were not identified in all of the analyzed Igf2r [45],andKcnq1 [19,41] and in the Nesp–Gnas locus [18] imprinted genes but only in a rather small portion of 37 and 36% have previously been identified as promoters of antisense in human and mouse, respectively. This does not necessarily transcripts. Furthermore, Igf2 [43,44], the Nesp–Gnas locus exclude a function of tandem repeats in imprinting, since for [18], and Grb10 [22,46] possess alternative start sites of sense major imprinted regions it has been shown that many imprinted transcripts that overlap with intragenic CpG islands. genes do not possess individual regulatory elements. Instead, 328 .Hte ta./Gnmc 8(06 323 (2006) 88 Genomics / al. et Hutter B.

Table 5 Localization of tandem repeat arrays in imprinted sequences of mouse and human Gene Repeat Location of Motif No. of Array Consensus sequence Previously CpG island length (bp) repetitions length (bp) identified by –

(A) Mouse 332 Commd1 1 Intra 38 5.0 194 CCTGCGCAGTTACCCGGTTATCCGCAGTACGTAGCCAG [16] Commd1 2 Intra 45 2.3 106 CTGCGCAGTTACCCGATTATCCAGTTATCCGCAGTACAGGCCTGC [16] Grb10 Intra 10 32.1 321 GCGTGTCGGC [22,46] Gtl2 (Meg3) Intra 23 3.3 79 GAGGACCCCAGGAAGCCCAGCGC H19 Upstream 11 9.5 109 GGGGGTATAGT [31] Igf2r 1 Intra 31 2.8 88 TCTCCTGCAACGTGGCACTTTTGAGCTCACC [17] Igf2r 2 Intra 24 11.1 265 CACACACCCACGGCATGGCGGTCT Impact Intra 140 2.5 362 GCTTTGCTGCATTGTCACATGAGCAGGCCCGGCCCACTCGGCTCG- [52] GCTCGGCACAGCTCGGCTGTTGCGTCACTGGCGCCTGCTCGGCTGC- GTTGTCACATGTTAGCAAGGCCGACTAGGCTGCTGCGTCACACGA- GCAG Magel2 Upstream 33 3.4 113 GCTGAGAGCTGCGGTGCCAGCCAGGCAGCGCTC Nesp–Gnas Intra 36 3.3 120 GCCGAGCCTGCCTCCGAGGCAGTCCCTGCCACCCAG [18] Peg10, Sgce 1 Upstream/promoter 17 3.8 65 CTCCCACCTCCCATCAT N Peg10, Sgce 2 Promoter 29 10.8 309 ACTAATGGGCGCTTCATGCGCTACAAAAT N Peg3 Intra 21 9.2 196 ATGGAGAGGCTGAAGAGCCAG Slc22a18 Intra 31 2.0 62 AATACACCCACTCTCTCCCGGAGAAAGCAGG N Slc22a3 Intra 44 2.0 88 AGACACACGGGGACATATATGACAGACGGAAGGAAGCTAGCGAC N Slc38a4 Intra 37 9.9 367 GGGATCGGGCTGGGGTTCCCGTGGAGGGACCCTCGCG P (B) Human ATP10A Intra 53 1.9 103 ATGGCTGAAAACATGGGTGGGGCCCCTCCCCACCTCGCCCGGGTC N GTGTTGTG CD81 1 Intra 17 15.6 269 TCTTGTGGGGTGGGGCG [53] CD81 2 Intra 31 7.0 221 GCACCCGTGCTGTGGCGTGCCCGTCGTCTGT [53] CD81 3 Downstream 30 2.3 69 CTCGGCCTCACCCAGGTGCTCCCGCTTGTG GRB10 Downstream 41 12.2 498 TACTCACACGTGGGACACAGATCCACTCTGCCGTAGCTGTA H19 1 Upstream 29 14.5 426 TGTCCCACCCGGGTGACGTGCCGTACCCG [31] H19 2 Downstream 39 3.3 127 GGGTGTGCGGGCGATGGGGGAGATGGACAACAGGACCGA KCNQ1 1 Intra 44 3.6 160 TCCACATGCCCGTCTGCAGCTCGAGAATTAGACGTGCCCTGGGC KCNQ1 2 Intra 40 1.9 76 GGAATCCTGGGCTGGAACCGGAAACTTCCCCGAGTACATA KCNQ1 3 Intra 50 3.9 193 CCTGACTCAGAACCACAACGTGGATTCCCAACTCCGATCCCAATT- [19] CGGGC KCNQ1 4 Intra 26 5.8 152 GGGAGGGCCGCGCTGAGGAGCCCCCA [19] KCNQ1 5 Intra 26 4.0 102 AGAACCGCGCCGAAGAACCCCCGGGG [19] KCNQ1 6 Intra 27 9.0 235 CCGAGGAGAACCGCGCTGAGGGGCGC [19] MAGEL2 1 Upstream 30 6.8 204 TCCCCCTCCGGGGACACCGATGGCTCATCC MAGEL2 2 Prepro 21 12.2 257 CCCACCACCGATCCGACAGGC [48] MEG3 (GTL2) Upstream 13 8.2 106 CAGGCAGCGGTGG 323 (2006) 88 Genomics / al. et Hutter B. MEST 1 Intra 20 2.9 57 CCTGTGGGGTTTGTGGGCAG N MEST 2 Intra 37 2.3 85 TTAGGATTTTTAGACCCCGGCATCCCTCTGGTGCGAT N NESP–GNAS Intra 27 8.7 238 GCAGCCCCAGCCGATCCCGACTCCGGG [18] PEG3 Intra 84 1.9 162 CAAGCCCCACCCACCTGGGCGCCATCTTTAATGAAAGAGCTTGAG- [30] ATTTGCCGCGCAGGCGCTGCCCGAATTGGTTGGGCGAGA PHLDA2, SLC22A18 1 Downstream 43 7.0 303 CCGGGGATGGGCTCGGTGGGACAGGCTCGGCCGCAGGCTGCTC (N) PHLDA2, SLC22A18 2 Downstream 36 14.7 529 CTGCCAGCCACCCGAACCCCAGAACCGCACCAGACA (N) SNRPN Intra 16 6.9 113 GTGGGCATTGGCGGCG [54] ZNF264 Intra 40 2.3 90 GGCGGCGGCCCTGCGTCTGGAACGCCGTTGCCACCGAGGA N Location indicates the position of the CpG island with reference to the transcribed portion of the gene: Prepro, prepromoter; Intra, intragenic (see also text). The consensus sequence is given as reported by Tandem Repeats Finder [40]. Previous identifications are indicated by the original reference. N, no tandem repeats reported for this genes; P, Rachel Smith, Babraham Institute, Cambridge, UK, personal communication. – 332 329 330 B. Hutter et al. / Genomics 88 (2006) 323–332 allele-specific gene regulation depends on a few central The CpG island sequences were extracted using a custom Perl script, regulatory elements. Particularly such elements have often applying either of the criteria for both repeat-masked and unmasked sequences. We noticed an unexpected increase in CpG islands in some been reported to contain repeat arrays. Hence the identified sequences upon masking: A stretch of masked nucleotides in the middle of a repeats might indeed highlight the rather few important CpG island could split this island into two and even enlarge them if the imprinting control elements. adjacent regions still fulfilled the criteria. This effect probably depends on the window shifting, trimming, and rejoining steps of the algorithm. Material and methods Sometimes, a window of size 500 bp may meet the criteria, whereas shorter fragments of 200 bp length may not and vice versa. We therefore examined Gene selection the numbers, absolute lengths, and fractions of sequences covered by CpG islands. Thirty-nine murine and 38 human genes that showed imprinting effects in at least one species were selected for analysis from the Catalogue of Imprinted Repeat masking Genes (http://igc.otago.ac.nz/IGC), the Mammalian Genetics Unit at Harwell (http://www.mgu.har.mrc.ac.uk/research/imprinting), and the literature: for Repeat masking was performed using local installations of RepeatMasker human and mouse ASB4, ASCL2, ATP10A, CDKN1C, CD81, COPG2, DIO3, open version 3.08 (Smit and Green, unpublished data, http://www. DLK1, DLX5, GATM, the NESP–GNAS locus, GRB10, GTL2 (MEG3), H19, repeatmasker.org) using the Washington University BLASTP program HTR2A, IGF2, KCNQ1, MAGEL2, NAP1L5, NDN, NNAT, PEG3, PEG10, (http://blast.wustl.edu) as search engine and the Repbase RepeatMasker PHLDA2, PLAGL1, RASGRF1, SGCE, SLC22A2, SLC22A3, SLC22A18, Libraries July 2004 from the Genetic Information Research Institute ([49,50], SLC38A4, SNRPN, UBE3A, and WT1; for human also MEG8, MEST, USP29, http://www.girinst.org). We applied default settings. All repetitive elements and ZNF264; for mouse additionally Peg12, Commd1, Ins2, Igf2r, and Impact including low-complexity DNA regions and simple repeats were masked, (Supplementary Material Table S1). i.e., replaced by stretches of N. Only di- to pentameric and some hexameric Two groups of randomly selected control genes were collected for each tandem repeats are scanned for, simple repeats shorter than 20 bp are species. For the first group (ctrl1), we generated random integer numbers within ignored by RepeatMasker. a range of 1 to 300,000 that cover the entire NCBI nucleotide database (http:// www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=nucleotide). For the second group Tandem repeat identification (ctrl2), random numbers were restricted to a range from 1 to 16,000. The random integer numbers were converted into reference cDNA accession numbers As a tandem repeat, also called (tandem) repeat array, we defined a starting with “NM_”. The second group may contain a bias because genes with sequence consisting of at least two approximate repetitions of a nucleotide low accession numbers were identified earlier and might therefore be more motif. Direct tandem repeat identification was performed by Tandem strongly expressed and also might have a higher G + C content than average Repeats Finder version 3.21 ([40], http://tandem.bu.edu/trf/trf.html) on all genes. However, we found no significant differences in G + C content (see determined CpG island sequences. The program detects only tandem repeats Results). On the other hand, the transcription start and end points are expected to that are arranged in a contiguous head-to-tail fashion. It was carried out on be better characterized for this group of genes that have been studied for a longer the CpG island sequence samples with the following parameters, using time. default values adapted from the Tandem Repeats Finder description page Genes were chosen from the NCBI nucleotide database if the corresponding (http://tandem.bu.edu/trf/trf.html/trfdesc.html): match score 2, mismatch gene had an autosomal location and was available for human as well as for score 5, indel score 7, match probability 80, indel probability 10, minscore mouse, with the exception of two genes in control group 1. This procedure to report 100, maxperiod 2000. resulted in 79 control genes per control group and species. The exact sequences As Tandem Repeats Finder reports redundant and overlapping repeat arrays, are available from the authors upon request. the output had to be edited manually: Redundant repeat arrays, starting and ending at the same position with different single-copy lengths, were counted as Sequence retrieval one. Two repeat arrays were regarded as a single one if the smaller one was overlapped by the larger one by more than 50%. The length was counted from For sequence retrieval, the NCBI Map Viewer (http://www.ncbi.nlm.nih. the start of the first to the end of the second array. As a consensus sequence, the gov/mapview) for human (build 35.1) and mouse (build 33.1) was used. We smaller motif was chosen. retrieved the genomic sequence of each gene according to the reference The genomic location of the tandem repeat arrays was determined using Genes_seq map annotation extended by 10 kb upstream of the BLASTN ([51], http://www.ncbi.nlm.nih.gov/BLAST). We used a local transcriptional start site and 10 kb downstream of the transcriptional installation of BLASTN version 2.2.11 to identify occurrences of repeat motifs termination site. The murine sequences of the imprinted genes Dlx5, Peg10, in homologous sequences of human and mouse. To obtain a query sequence that Sgce, and Slc22a2 and seven control genes containing over 5% of could match a similar tandem repeat array, the consensus sequence of each motif undefined nucleotides were replaced by improved assemblies of build 34.1. was concatenated to a dimer. The Copg2 and MEG8 sequences were taken from the Ensembl Genome Browser (http://www.ensembl.org) version 30 and the Grb10 sequence from Statistical analysis version 32. To calculate feature frequencies, custom Perl scripts were used. Stretches CpG island identification of undefined nucleotides in murine sequences were omitted for the calculations. For the identification of CpG islands, the CpG Island Searcher command As most of the data are not Gaussian distributed, special care had to be line version 1.3 from Takai and Jones ([33], http://cpgislands.usc.edu) was taken when comparing the different groups. The χ2 test is a robust, obtained and applied to the retrieved sequences. Currently, there exist two nonparametric method to assess whether a certain feature is overrepresented popular definitions of CpG islands: According to the original criteria by in one group. The t test is robust to differing variance and to data with Gardiner-Garden and Frommer [11], they must have at least 50% G + C content, non-Gaussian distribution as long as the groups are of similar size. We a CpGobs/CpGexp ratio of at least 0.6, and a minimum length of 200 bp. Takai also performed Wilcoxon–Mann–Whitney tests, especially for comparisons and Jones [33] introduced more stringent lower limits used as a default in their between groups of different size. We denote error probabilities of program: minimum 55% G + C content, CpGobs/CpGexp ≥0.65, and length 0.05 < p < 0.1 to indicate a tendency, a significant difference is fulfilled ≥500 bp. The CpG Island Searcher requires the number of CpG dinucleotides to for 0.01 < p < 0.05, and a property is found highly significant for be at least 7 to exclude artifact CpG islands. p < 0.01. B. Hutter et al. / Genomics 88 (2006) 323–332 331

Acknowledgments [18] C. Coombes, et al., Epigenetic properties and identification of an imprint mark in the Nesp–Gnasxl domain of the mouse Gnas imprinted locus, Mol. Cell. Biol. 23 (2003) 5475–5488. We thank Jörn Walter (Saarland University) for stimulating [19] M. Paulsen, T. Khare, C. Burgard, S. Tierling, J. Walter, Evolution of the discussions, Christoph Bock (Max Planck Institute for Com- Beckwith–Wiedemann syndrome region in vertebrates, Genome Res. 15 puter Science) for critical comments on the manuscript, and (2005) 146–153. Tihamér Geyer (Saarland University) for helpful advice on [20] R.S. Pearsall, et al., A direct repeat sequence at the Rasgrf1 locus and – programming. We highly appreciate the work of numerous imprinted expression, Genomics 55 (1999) 194 201. [21] M. Paulsen, et al., Comparative sequence analysis of the imprinted Dlk1– sequencing centers that made the mouse and human genomic Gtl2 locus in three mammalian species reveals highly conserved genomic DNA sequences for the analyzed region publicly available. This elements and refines comparison with the Igf2–H19 region, Genome Res. work was supported by the Forschungsausschuss of Saarland 11 (2001) 2085–2094. University and by the Deutsche Forschungsgemeinschaft [22] P. Arnaud, et al., Conserved methylation imprints in the human and mouse (WA1029/3-2). GRB10 genes with divergent allelic expression suggests differential reading of the same mark, Hum. Mol. Genet. 12 (2003) 1005–1019. [23] A.C. Bell, G. Felsenfeld, Methylation of a CTCF-dependent boundary Appendix A. Supplementary data controls imprinted expression of the Igf2 gene, Nature 405 (2000) 482–485. Supplementary data associated with this article can be found, [24] A.T. Hark, et al., CTCF mediates methylation-sensitive enhancer-blocking – in the online version, at doi:10.1016/j.ygeno.2006.03.019. activity at the H19/Igf2 locus, Nature 405 (2000) 486 489. [25] B. Neumann, P. Kubicka, D.P. Barlow, Characteristics of imprinted genes, Nat. Genet. 9 (1995) 12–13. References [26] T.A. Volpe, et al., Regulation of heterochromatic silencing and histone H3 lysine-9 methylation by RNAi, Science 297 (2002) 1833–1837. [1] W. Reik, J. Walter, Genomic imprinting: parental influence on the genome, [27] A. Saveliev, C. Everett, T. Sharpe, Z. Webster, R. Festenstein, DNA triplet Nat. Rev., Genet. 2 (2001) 21–32. repeats mediate heterochromatin-protein-1-sensitive variegated gene [2] J.M. Greally, Short interspersed transposable elements (SINEs) are silencing, Nature 422 (2003) 909–913. excluded from imprinted regions in the , Proc. Natl. [28] D. Garrick, S. Fiering, D.I.K. Martin, E. Whitelaw, Repeat-induced gene Acad. Sci. USA 99 (2002) 327–332. silencing in mammals, Nat. Genet. 18 (1998) 56–59. [3] P.P. Luedi, A.J. Hartemink, R.L. Jirtle, Genome-wide prediction of [29] F. Malagnac, et al., A gene essential for de novo methylation and imprinted murine genes, Genome Res. 15 (2005) 875–884. development in Ascobolus reveals a novel type of eukaryotic DNA [4] E. Allen, et al., High concentrations of long interspersed nuclear element methyltransferase structure, Cell 91 (1997) 281–290. sequence distinguish monoallelically expressed genes, Proc. Natl. Acad. [30] J. Kim, A. Kollhoff, A. Bergmann, L. Stubbs, Methylation-sensitive Sci. USA 100 (2003) 9940–9945. binding of transcription factor YY1 to an insulator sequence within the [5] J. Walter, B. Hutter, T. Khare, M. Paulsen, Repetitive elements in imprinted paternally expressed imprinted gene, Peg3, Hum. Mol. Genet. 12 (2003) genes, Cytogenet. Genome Res. 113 (2006) 109–115. 233–245. [6] C.M. Rubin, C.A. VandeVoort, R.L. Teplitz, C.W. Schmid, Alu repeated [31] A. Lewis, M. Kohzoh, M. Constância, W. Reik, Tandem repeat hypothesis DNAs are differentially methylated in primate germ cells, Nucleic Acids in imprinting: deletion of a conserved direct repeat element upstream of Res. 22 (1994) 5121–5127. H19 has no effect on imprinting in the Igf2–H19 region, Mol. Cell. Biol. [7] U. Hellmann-Blumberg, M.F.M. Hintz, J.M. Gatewood, C.W. Schmid, 24 (2004) 5650–5656. Developmental differences in methylation of human Alu repeats, Mol. [32] R.H. Waterston, et al., Initial sequencing and comparative analysis of the Cell. Biol. 13 (1993) 4523–4530. mouse genome, Nature 420 (2002) 520–562. [8] S.K. Howlett, W. Reik, Methylation levels of maternal and paternal [33] D. Takai, P.A. Jones, Comprehensive analysis of CpG islands in human genomes during preimplantation development, Development 113 (1991) 21 and 22, Proc. Natl. Acad. Sci. USA 99 (2002) 119–127. 3740–3745. [9] N. Lane, et al., Resistance of IAPs to methylation reprogramming may [34] K. Jabbari, G. Bernardi, CpG doublets, CpG islands and Alu repeats in provide a mechanism for epigenetic inheritance in the mouse, Genesis 35 long human DNA sequences from different isochore families, Gene 224 (2003) 88–93. (1998) 123–128. [10] P. Hajkova, et al., Epigenetic reprogramming in mouse primordial germ [35] E.S. Lander, et al., Initial sequencing and analysis of the human genome, cells, Mech. Dev. 117 (2002) 15–23. Nature 409 (2001) 860–921. [11] M. Gardiner-Garden, M. Frommer, CpG islands in vertebrate genomes, [36] S.-L. Oei, et al., Clusters of regulatory signals for RNA polymerase II J. Mol. Biol. 196 (1987) 261–282. transcription associated with Alu family repeats and CpG islands in human [12] L. Ponger, L. Duret, D. Mouchiroud, Determinants of CpG islands: promoters, Genomics 83 (2004) 873–882. expression in early embryo and isochore structure, Genome Res. 11 (2001) [37] X. Ke, S.N. Thomas, D.O. Robinson, A. Collins, A novel approach for 1854–1860. identifying candidate imprinted genes through sequence analysis of [13] F. Larsen, G. Gundersen, R. Lopez, H. Prydz, CpG islands as gene markers imprinted and control genes, Hum. Genet. 111 (2002) 511–520. in the human genome, Genomics 13 (1992) 1095–1107. [38] X. Ke, N.S. Thomas, D.O. Robinson, A. Collins, The distinguishing [14] F. Antequera, Structure, function and evolution of CpG island promoters, sequence characteristics of mouse imprinted genes, Mamm. Genome 13 Cell. Mol. Life Sci. 60 (2003) 1647–1658. (2002) 639–645. [15] R. Yamashita, Y. Suzuki, S. Sugano, K. Nakai, Genome-wide analysis [39] M. Cuadrado, M. Sacristán, F. Antequera, Species-specific organization of reveals strong correlation between CpG islands with nearby transcription CpG island promoters at mammalian homologous genes, EMBO Rep. 2 start sites of genes and their tissue specificity, Gene 350 (2005) 129–136. (2001) 586–592. [16] R.S. Pearsall, et al., Absence of imprinting in U2AFBPL, a human [40] G. Benson, Tandem repeats finder: a program to analyze DNA sequences, homologue of the imprinted mouse gene U2afbp-rs, Biochem. Biophys. Nucleic Acids Res. 27 (1999) 573–580. Res. Commun. 222 (1996) 171–177. [41] D. Mancini-DiNardo, S.J.S. Steele, R.S. Ingram, S.M. Tilghman, A [17] B. Reinhart, M. Eljanne, J.R. Chaillet, Shared role for differentially differentially methylated region within the gene Kcnq1 functions as an methylated domains of imprinted genes, Mol. Cell. Biol. 22 (2002) imprinted promoter and silencer, Hum. Mol. Genet. 12 (2003) 2089–2098. 283–294. 332 B. Hutter et al. / Genomics 88 (2006) 323–332

[42] Y. Wang, et al., The mouse Murr1 gene is imprinted in the adult brain, structural basis for species-specific imprinting, Genome Res. 10 (2000) presumably due to transcriptional interference by the antisense-oriented 1878–1889. U2af1-rs1 gene, Mol. Cell. Biol. 24 (2004) 270–279. [53] M. Paulsen, et al., Sequence conservation and variability of imprinting in [43] T. Moore, et al., Multiple imprinted sense and antisense transcripts, the Beckwith–Wiedemann syndrome gene cluster in human and mouse, differential methylation and tandem repeats in a putative imprinting control Hum. Mol. Genet. 9 (2000) 1829–1841. region upstream of mouse Igf2, Proc. Natl. Acad. Sci. USA 94 (1997) [54] A.H.M.M. Huq, et al., Sequencing and functional analysis of the SNPRN 12509–12514. promoter: in vitro methylation abolishes promoter activity, Genome Res. 7 [44] H. Sasaki, et al., Nucleotide sequence of a 28-kb mouse genomic (1997) 642–648. region comprising the imprinted Igf2 gene, DNA Res. 3 (1996) 331–335. [45] A. Wutz, D.P. Barlow, Imprinting of the mouse Igf2r gene depends on an intronic CpG island, Mol. Cell. Endocrinol. 140 (1998) 9–14. Further reading [46] T. Hikichi, T. Kohda, T. Kaneko-Ishino, F. Ishino, Imprinting regulation of the murine Meg1/Grb10 and human GRB10 genes: roles of brain-specific Catalogue of Imprinted Genes: http://igc.otago.ac.nz/IGC. promoters and mouse-specific CTCF-binding sites, Nucleic Acids Res. 31 Mammalian Genetics Unit at Harwell: http://www.mgu.har.mrc.ac.uk/research/ (2003) 1398–1406. imprinting. [47] G. Lavorgna, et al., In search of antisense, Trends Biochem. Sci. 29 (2004) National Center for Biotechnology Information: http://www.ncbi.nlm.nih.gov. 88–94. NC B I Nu c le o t id e D a t a ba s e : h tt p : // ww w. n c b i. n lm . n ih . g ov / e n tr e z / q ue r y. f c g i ? [48] I. Boccaccio, et al., The human MAGEL2 gene and its mouse homologue db=Nucleotide. are paternally expressed and mapped to the Prader–Willi region, Hum. NCBI Map Viewer: http://www.ncbi.nlm.nih.gov/mapview. Mol. Genet. 8 (1999) 2497–2505. BLASTN: http://www.ncbi.nlm.nih.gov/BLAST. [49] J. Jurka, Repeats in genomic DNA: mining and meaning, Curr. Opin. Sanger Institute Ensembl Genome Browser (Ensembl): http://www.ensembl. Struct. Biol. 8 (1998) 333–337. org. [50] J. Jurka, Repbase update: a database and an electronic journal of repetitive RepeatMasker: http://www.repeatmasker.org. elements, Trends Genet. 16 (2000) 418–420. Washington University BLAST: http://blast.wustl.edu. [51] S.F. Altschul, et al., Gapped BLAST and PSI-BLAST: a new generation of Genetic Information Research Institute: http://www.girinst.org. protein database search programs, Nucleic Acids Res. 25 (1997) CpG Island Searcher: http://cpgislands.usc.edu. 3389–3402. Tandem Repeats Finder: http://tandem.bu.edu/trf/trf.html. [52] K. Okamura, et al., Comparative genome analysis of the mouse imprinted Tandem Repeats Finder description page: http://tandem.bu.edu/trf/trf.html/ gene Impact and its nonimprinted human homolog IMPACT: toward the trfdesc.html.