High Concentrations of Long Interspersed Nuclear Element Sequence Distinguish Monoallelically Expressed Genes
Total Page:16
File Type:pdf, Size:1020Kb
High concentrations of long interspersed nuclear element sequence distinguish monoallelically expressed genes Elena Allen*, Steve Horvath*†, Frances Tong*, Peter Kraft†, Elizabeth Spiteri‡, Arthur D. Riggs‡, and York Marahrens*§ Departments of *Human Genetics and †Biostatistics, University of California, Los Angeles, CA 90095; and ‡Division of Biology, Beckman Research Institute of the City of Hope, 1450 East Duarte Road, Duarte, CA 91010 Edited by Stanley M. Gartler, University of Washington, Seattle, WA, and approved June 26, 2003 (received for review December 5, 2002) Genes subject to monoallelic expression are expressed from only cation of intact elements by the formation of heterochromatin one of the two alleles either selected at random (random mono- over their sequence (12–15). allelic genes) or in a parent-of-origin specific manner (imprinted The high density of LINE-1 elements around X-inactivated genes). Because high densities of long interspersed nuclear ele- genes raises the possibility that monoallelically expressed auto- ment (LINE)-1 transposon sequence have been implicated in X- somal genes are also flanked by high densities of repetitive inactivation, we asked whether monoallelically expressed autoso- sequence. Here we show that monoallelic autosomal genes mal genes are also flanked by high densities of LINE-1 sequence. A display significantly higher densities of LINE-1 sequence, less statistical analysis of repeat content in the regions surrounding truncated LINE-1 elements, and fewer CpG islands and SINE monoallelically and biallelically expressed genes revealed that elements in their flanking regions. We find evidence that random monoallelic genes were flanked by significantly higher imprinted and random monoallelic genes each separate into two densities of LINE-1 sequence, evolutionarily more recent and less statistically significant distinct groups. We identify Ͼ1,000 ad- truncated LINE-1 elements, fewer CpG islands, and fewer base- ditional genes that share the characteristics of one of these pairs of short interspersed nuclear elements (SINEs) sequence than groups but are not currently known to exhibit allelic differences biallelically expressed genes. Random monoallelic and imprinted in gene expression or chromatin structure. genes were pooled and subjected to a clustering analysis algo- rithm, which found two clusters on the basis of aforementioned Methods sequence characteristics. Interestingly, these clusters did not fol- Data. We identified 33 random monoallelically expressed (20 low the random monoallelic vs. imprinted classifications. We infer mouse, 13 human) genes; 39 imprinted (15 mouse, 24 human) that chromosomal sequence context plays a role in monoallelic genes; and 28 (15 mouse, 13 human) genes that we designated as gene expression and may involve the recognition of long repeats being biallelically expressed (Table 1). Genes were assigned to or other features. The sequence characteristics that distinguished the biallelically expressed category on the basis of gene expres- the high-LINE-1 category were used to identify more than 1,000 sion data and͞or evidence of synchronous replication timing. All additional genes from the human and mouse genomes as candi- monoallelic genes that have been assayed for replication timing date genes for monoallelic expression. are asynchronously replicating (Table 1 and ref. 16). We con- sequently assumed that synchronously replicated genes are bi- enes expressed from only one allele (monoallelic genes) are allelically expressed. However, we consider it plausible that Geither selected at random (random monoallelic genes) or in biallelic genes that are replicated asynchronously exist. The a parent-of-origin specific manner (imprinted genes). Both types presence of biallelically expressed genes within the aforemen- of monoallelic genes reside in chromosomal regions defined by tioned monoallelic gene set would bias the results of the current allelic differences in chromatin properties, including replication study toward the null hypothesis of no difference between the timing (Table 1). These regions vary in scale from single genes monoallelic and biallelic groups. (e.g., IL-2͞Il-2), to small gene clusters [e.g., PWS͞AS gene The majority of known autosomal monoallelic genes belong to region (1)], to the Ϸ3,000 genes that are silenced during X- gene families. A single representative from each of these gene inactivation (2). families was selected (Table 1). In addition, 150 genes (75 mouse, On the basis of the inability of X-inactivation to spread into 75 human) were randomly selected from the National Center for ͞ many autosomal regions in X autosome translocations, it was Biotechnology Information Reference Sequence (RefSeq) proposed that way stations located throughout the X chromo- genes. The location of each gene and the sequence characteris- some are required for gene silencing to spread (3, 4). The tics surrounding it were determined by using the University of discovery that inactivation correlates with high densities of long California, Santa Cruz (UCSC) Genome Browser (http:͞͞ interspersed nuclear elements (LINEs) in mice led to the genome.cse.ucsc.edu), February 2002 and June 2002 builds for hypothesis that LINE elements are way stations (5). An analysis mouse and human, respectively. Repeat information was deter- of human genome sequence supported this hypothesis (6). mined by using the REPEATMASKER (A. F. A. Smit and P. Green, LINEs are mostly truncated descendants of 6- to 7-kb (7) http:͞͞ftp.genome.washington.edu͞RM͞RepeatMasker.html) transposons enriched in chromosomal G bands (8) and in L1 and track of the UCSC Genome Browser. See Supporting Methods, L2 isochores (genomic DNA fractions in Cs2SO4 density gradi- which is published as supporting information on the PNAS web ents) (9). LINE elements make up 14.5% and 17.6% of mouse site, www.pnas.org, for a more detailed description of the and human autosomes and 28.9% and 31.0% of mouse and procedures used to obtain data. human X chromosomal sequence (E.A. and Y.M., unpublished data). Other abundant repeats throughout the genomes are the Ͻ350-bp short interspersed nuclear elements (SINEs), enriched This paper was submitted directly (Track II) to the PNAS office. in chromosomal R bands (8) and H3 and H4 isochores (9) and Abbreviations: LINEs, long interspersed nuclear elements; SINEs, short interspersed nuclear the retrovirus-derived long terminal repeats (LTRs). A host- elements; RefSeq, Reference Sequence. defense mechanism (10, 11) is thought to suppress the multipli- §To whom correspondence should be addressed. E-mail: [email protected]. 9940–9945 ͉ PNAS ͉ August 19, 2003 ͉ vol. 100 ͉ no. 17 www.pnas.org͞cgi͞doi͞10.1073͞pnas.1737401100 Downloaded by guest on September 25, 2021 Table 1. Mouse and human genes that are known to exhibit Table 1. (continued) allelic differences or allelic equivalence Allelic Rep. Allelic Rep. Species͞gene Chromosome expression timing Reference Species͞gene Chromosome expression timing Reference Random monoallelic m Interleukins p57(KIP2) 7 M A 57, 58 m Zfp127 7M– 59 Il-2 3MA39 Igf2-R 17 M – 40, 41 Il-1a*2M– 42 Ndn 7M– 62 Mash2 7M– 63 Il-4 11 M – 44 h h H19 11 M – 64 IL-2 4M– 46 ARHI 1M– 66 X chromosome MAGEL2 15 M – 68 m PEG10 7M– 69 Xist† XM– 49 PLAGL1 6M– 72 h ZIM2 19 M – 73 XIST X M A 51, 52 ZNF264 19 M – 74 Immunoglobin loci TP73 1M– 75 m CDKN1C 11 M – 76 Ig 6MA55 HYMAI 6M– 78 IgL‡ 16 M A 55 NDN 15 M – 79 NNAT 20 M – 81 IgH‡ 12 M A 55 GRB10 7M– 82 TCRB 6MA55 ZIM3 19 M – 74 h GNAS1 20 M – 84 Ig 2M– 61 DLK1 14 M – 85 ‡ IgL 22 M – 61 MEST 7M– 86 GENETICS IgH‡ 14 M – 61 ZNF215*11 M– 88 TCRA*7M– 61 UBE3A 15 M – 89 TCRB‡ 14 M – 61 ATP10C 15 M – 90 TCRD‡ 14 M – 61 IGF2 11 M – 41, 91 TCRG‡ 7M– 71 p57(KIP2) 11 M – 76 Olfactory receptors WT1*11M– 93 m SNRPN 15 M – 94 Biallelic Or10 14 M A 60 m Olfr41‡ 7MA60 Cebpb 2B– 38 ‡ Olfr72 9MA60 Edn3 2B– 38 ‡ Ors25 7MA60 Chrna4 2B– 38 Ora16‡ 2MA60 E2f1 2B– 38 OrM71‡ 9MA60 Ntsr 2B– 38 h Kcnb1 2B– 38 OR1E1 17 M – 61 Mc3r 2B– 38 OR12D2‡ 2M– 61 C-mpl 4 – S39 OR2S2‡ 9M– 61 Tmp 6 – S55 OR51B2‡ 11 M – 61 Cd2 3 – S55 NK receptors Rras 7 – S41 Alb 5 – S41 m Myc 15 – S60 Nkg2d*6M– 87 Pfkl 10 – S41 ‡ Ly49A* 6M– 87 Trp53 11 – S41 ‡ Ly49G2* Un M – 87 h Genes near t complex ACTB 7B– 65 m TSGA14 7B– 67 Nubp2 17 M – 92 NAP1L4 11 B – 64 Igfals‡ 17 M – 92 IGFBP1 7B– 70 Jsap‡ 17 M – 92 ACHE 7 – S41 Imprinted APOB 2 – S41 m CD3D 11 – S41 Tssc3 7M– 37 MYC 8 – S41 TP53 17 – S77 U2af1-rs1 11 M – 40, 41 PYGM 11 – S41 Peg3 7M– 43 ERBB2 17 – S80 Mas1 17 M – 45 RB1 13 – S80 Gnas 2M– 47 RUNX 21 – S83 Rasgrf1 9M– 48 Ins1 19 M – 50 m, mouse; h, human; Rep., replication. H19 7 M A 41, 53 *Monoallelically expressed in a subset of cells. † Snrpn 7 M A 41, 54 Paternally imprinted in some tissues. ‡Excluded from the analyses that involve individual representatives from each Igf2 7MA56 gene family (see supporting information). Allen et al. PNAS ͉ August 19, 2003 ͉ vol. 100 ͉ no. 17 ͉ 9941 Downloaded by guest on September 25, 2021 Sequence Analysis. A set of 68 covariates that described each gene and the sequence in the region surrounding it, defined as 100 kb to the 3Ј and 5Ј sides of the gene, was written. Definitions of covariates are in Supporting Methods. Statistical Methods. Univariate differences in covariates were tested across categorical groupings by using the Kruskal–Wallis test (17). Distributions of covariates across categorical groupings were visualized with boxplots. To facilitate unsupervised learning, an intrinsic dissimilarity measure was constructed with a random-forest analysis of the covariates describing the 200-kb regions flanking the genes (18–20).