Specific subfamilies of transposable elements contribute to different domains of T lymphocyte enhancers
Mengliang Yea, Christel Goudota, Thomas Hoylera, Benjamin Lemoineb, Sebastian Amigorenaa,1, and Elina Zuevaa,1
aInstitute Curie, Paris Sciences-et-Lettres Research University, Institut National de la Santé et de la Recherche Médicale, U932, 75005 Paris; and bGenomics Core, Baylor Scott & White Research Institute, Dallas, TX 75204
Edited by Akiko Iwasaki, Yale University, New Haven, CT, and approved February 21, 2020 (received for review July 18, 2019) Transposable elements (TEs) compose nearly half of mammalian elements (TEs) (10), which are abundant in mammalian ge- genomes and provide building blocks for cis-regulatory elements. nomes (up to 50% of total DNA). TEs are selfish genetic ele- Using high-throughput sequencing, we show that 84 TE subfam- ments that disseminate throughout genomes via autonomous ilies are overrepresented, and distributed in a lineage-specific fash- self-replication and reinsertion. Based on distinct mobilization + ion in core and boundary domains of CD8 T cell enhancers. machinery, TEs broadly divide into retrotransposons and DNA Endogenous retroviruses are most significantly enriched in core transposons (11). Retrotransposons are by far more abundant domains with accessible chromatin, and bear recognition motifs and split into long-terminal repeat (LTR) and non-LTR groups. for immune-related transcription factors. In contrast, short inter- LTRs include endogenous retroviruses (ERVs), while non-LTR spersed elements (SINEs) are preferentially overrepresented in TEs subdivide into long-interspersed (LINEs) and short-interspersed nucleosome-containing boundaries. A substantial proportion of elements (SINEs), nonautonomous transposons mobilized by the these SINEs harbor a high density of the enhancer-specific histone LINE integration machinery. These lineages are composed of mark H3K4me1 and carry sequences that match enhancer bound- phylogenetically related families, further branching out into ary nucleotide composition. Motifs with regulatory features are multiple subfamilies, each originating from one precursor copy. better preserved within enhancer-enriched TE copies compared With time, the accumulation of mutations introduced divergence
to their subfamily equivalents located in gene deserts. TE-rich GENETICS and TE-poor enhancers associate with both shared and unique in the consensus sequence within members of each subfamily. gene groups and are enriched in overlapping functions related A growing body of evidence suggests that specific TE sub- families have been coopted for transcriptional regulation in a to lymphocyte and leukocyte biology. The majority of T cell en- – hancers are shared with other immune lineages and are accessible number of biological contexts (12 22). In particular, TE- in common hematopoietic progenitors. A higher proportion of im- associated enhancers contributed to the evolution of early de- mune tissue-specific enhancers are TE-rich compared to enhancers velopment (17), placentation (18), mammalian pregnancy (14), specific to other tissues, correlating with higher TE occurrence in immune gene-associated genomic regions. Our results sug- Significance gest that during evolution, TEs abundant in these regions and carrying motifs potentially beneficial for enhancer architecture We performed a systematic analysis of transposable elements (TEs) + and immune functions were particularly frequently incorporated in enhancers associated with immune functions. In CD8 T cells, by evolving enhancers. Their putative selection and regulatory specific TE subfamilies are enriched in enhancers and provide tran- cooption may have accelerated the evolution of immune regulatory scription factor recognition motifs and architectural sequences to networks. central and peripheral domains, respectively. Enhancers unique to immune cell types are more prone to putative regulatory TE transposable element | enhancer | T lymphocyte | immune tissue cooption as compared to enhancers unique to other tissues. Ge- nomic neighborhoods of immune-related genes harbor more TEs + pon encountering cognate antigens, resting naïve CD8 compared to other gene-bearing regions, likely reflecting reduced UT cells undergo activation and functional differentiation local purifying selection. We hypothesize that TE abundance in into cytotoxic effectors, which fight infections. Mechanistically, immune-associated genomic regions facilitated active selection and adaptive immune responses rely on a profound reorganization of functionalization of TE motifs beneficial for immune functions. This enhancer networks that orchestrate transcriptional outputs (1, may have favored accelerated evolution of enhancers in immune 2). Mutations in enhancers have been associated with autoim- cells adapting to changing infectious challenges. mune and inflammatory disorders, such as asthma and rheu- matoid arthritis (3, 4). Epigenetic events accompanying enhancer Author contributions: M.Y., S.A., and E.Z. designed research; M.Y. and E.Z. performed + research; T.H. and B.L. contributed new reagents/analytic tools; M.Y., C.G., and E.Z. ana- remodeling during CD8 T cell differentiation have been com- lyzed data; and S.A. and E.Z. wrote the paper. prehensively documented (1, 2). Genomic features constituting + The authors declare no competing interest. CD8 T cell enhancers and their role in regulation, however, This article is a PNAS Direct Submission. have not yet been explored. This open access article is distributed under Creative Commons Attribution-NonCommercial- Enhancers are distantly operating modular elements that NoDerivatives License 4.0 (CC BY-NC-ND). stimulate gene transcription by looping with target promoters Data deposition: The data reported in this paper have been deposited in the Gene Ex- (5), acting individually or jointly on one or multiple genes. The pression Omnibus (GEO) database, https://www.ncbi.nlm.nih.gov/geo (accession no. structural framework of an active enhancer includes an accessi- GSE142151). ble DNA core, surrounded by more condensed chromatin har- See online for related content such as Commentaries. boring short arrays of nucleosomes (6, 7). These flanking 1To whom correspondence may be addressed. Email: [email protected] or nucleosomes are decorated with H3K4me1, which broadly marks [email protected]. both active and poised regulatory regions (8), and eventually This article contains supporting information online at https://www.pnas.org/lookup/suppl/ with H3K27ac, which correlates with enhancer activity (9). En- doi:10.1073/pnas.1912008117/-/DCSupplemental. hancers evolving from background DNA include transposable First published March 19, 2020.
www.pnas.org/cgi/doi/10.1073/pnas.1912008117 PNAS | April 7, 2020 | vol. 117 | no. 14 | 7905–7916 Downloaded by guest on September 24, 2021 and innate immune responses (21). TEs with regulatory potential and differentiated cells from the above-mentioned GSE95237 + are thought to accelerate evolution. Identical regulatory copies series. As these H3K27ac distal ATAC peaks may contain a spread across the genome can rapidly rewire novel transcrip- small portion of not yet unannotated promoters, we intersected tional patterns, resulting in functional novelty or an increase in them with the H3K4me3 peaks (in-house ChIP-seq), typically regulatory complexity (23, 24). This mechanism could be par- higher in promoters than in enhancers (27) (cut-off is shown in ticularly relevant for the evolution of the immune system, which SI Appendix, Fig. S2B). After filtering out unannotated pro- needs to adapt rapidly to effectively respond to a variety of moters, we observe an expected proportion of distal ATAC mutable stimuli. peaks with features of active enhancers (9), 67% in D0 and 69% Here, we use a combination of genome-wide approaches to in D3 (Fig. 1A). Finally, we obtained a total of 16,807 putative characterize enhancers enriched in TEs in differentiating mouse active enhancers and 12,526 of active promoters (all stages + CD8 T lymphocytes. We map the topology of TE-lineages pooled). across enhancer central and boundary domains and show that To identify regions with significant alterations in accessibility core-enriched TEs carry ancestral transcription factor (TF) rec- during D0–D3 transition, we performed differential quantifica- ognition motifs, better preserved compared to nonenhancer TE tion of ATAC signals in enhancers found in different conditions. equivalents. Flanking TEs harbor high densities of H3K4me1 Enhancers were assigned to a differentiation stage only if they histone mark and provide dA:dT-rich sequences that match en- were either more accessible or uniquely found in this stage hancer boundary nucleotide composition. TE-rich enhancers are compared to another. D1 mostly served to follow the kinetics of linked to important T cell functions and are part of a regulatory enhancers’ chromatin accessibility between D0 and D3. As network with a potentially high degree of functional redundancy. shown in Fig. 1 B, Upper, the amplitude (fold-change) of en- + Enhancers selectively active in CD8 T cells and other immune hancer chromatin accessibility during differentiation in most tissues are richer with TEs, as compared to enhancers active in cases exceeds twofold. Promoters are modulated to a somewhat other tissues. Similarly, TEs are more abundant in genomic lesser extent (Fig. 1 B, Lower). The vast majority of enhancers neighborhoods of immune-related genes, as compared to regions (nearly 80%) undergo statistically significant modulation of bearing genes silent in immune tissues. These results reveal the chromatin accessibility during differentiation, in contrast to 34% differential contribution of distinct TE lineages to enhancer of modulated promoters (see numbers in Fig. 1B and propor- domains and highlight the putative role of TEs in the evolution tions in Fig. 1C). Together, these data highlight the para- + of immune functions. mount importance of enhancers in the process of CD8 T cell differentiation. Results Based on differential quantification, we generated a super- Enhancer Network Reorganizes Early on during CD8+ TCell vised heatmap of ATAC signals clustered by their dynamic Differentiation. To study the role of TEs in cis-regulation in T profile during differentiation (Fig. 1D). We observed a broader lymphocytes, we isolated naïve (D0) C57BL/6J TCR transgenic D0-specific enhancer cluster (50% of the total enhancers), as + (OT-I) mouse CD8 T cells (expressing transgenic T cell re- compared to D3. The majority of D0-specific enhancers shrink ceptors specific for ovalbumin). OT-I cells were activated in vitro early on, at D1 (Fig. 1D). Less than 20% either gain accessibility for 24 h (D1, activated) or 3 d (D3, cycling cells) using antibodies by D3 or remain stable across differentiation (Fig. 1D). A small directed against CD3 and CD28. To identify active enhancers, proportion of enhancers is temporarily more accessible at D1. we first performed genome-wide mapping of open chromatin Higher numbers of the D0-specific enhancers cannot be (accessible to nuclease digestion), using an assay for transposase- explained by higher numbers of genes overexpressed at D0, as accessible chromatin coupled with high-throughput sequenc- compared to D3. The differential analysis of the stage-matched ing (ATAC-seq) (25). After peak calling, we identify a total RNA sequencing (RNA-seq; in-house) reveals comparable of 42,019 high-fidelity ATAC sites (see mapping and re- numbers of genes selectively up-regulated at D0 or D3 (SI Ap- producibility details in SI Appendix, Fig. S1 A and B). pendix, Fig. S3). This result suggests a higher enhancer/gene ratio To ascertain possible bias induced by in vitro differentiation, in naïve, as compared to activated T cells. + we compared our dataset to the published ATAC-seq carried out Loci of the key determinants of CD8 T cell differentiation on OT-I cells, naïve and in vivo-differentiated in response to (Il7ra coding for a major marker of naïve cells, and Tbx21 coding bacterial infection (GSE95237 series) (1). Of ATAC peaks from for T-bet, the master regulator of effectors), show the expected our dataset, 97% in naïve and 87% in D3 cells are also present in modulation of chromatin accessibility, including the experimen- naïve and differentiated cells from the GSE95237 series, re- tally validated enhancer that regulates Il7ra gene expression in spectively, yielding highly similar profiles in the genome browser naïve T cells (28) and a putative enhancer in the proximity of (SI Appendix, Fig. S1C). This not only demonstrates the reli- Tbx1 gene (Fig. 1E). + ability of our ATAC regions but also reveals that the majority of Altogether, these results show that during CD8 T cell dif- peaks found in D3 are independent of the type of stimulus. ferentiation chromatin remodeling is initiated early on (before Importantly, we further used the in-house dataset because the the first cell division) and that naïve cells embody broader en- sequencing conditions (longer paired-end reads of 100 bp) hancer accessibility and potentially higher enhancer redundancy, allowed less ambiguous mapping of ATAC signals on genomic as compared to activated cells. repeats and transposons, as compared to single-end sequencing of shorter reads (26). Higher resolution of this strategy allows Selected TE Species Are Differentially Enriched in Central and mapping TEs to ATAC peak summits, as well as slopes to better Peripheral Enhancer Domains. To colocalize TEs with active en- define boundaries of the accessible chromatin. hancers, we first defined enhancer boundaries using hallmark To distinguish putative enhancers, we split ATAC peaks into histone modifications: H3K27ac (GSE95237 series) and two groups: Promoters and the distal peaks (see the full pipeline H3K4me1 (GSE95237 series), as well as H3K4me3 (in-house), in SI Appendix, Fig. S2A). Regions situated within ±2 kb from present at a low level in active enhancers. We observed a annotated transcription start sites (TSS) were considered as comodulation of H3K4me1 and ATAC signals in D0- and D3- holding a promoter activity. More distal peaks were considered specific developmental enhancers (SI Appendix, Fig. S4). Ag- as cis-regulatory regions. To select distal ATAC peaks with the gregation of histone marks around ATAC peak centers reveals a signature of active enhancers, we performed an overlap with the classic peak–valley–peak pattern (29), with the ATAC peak hallmark histone modification H3K27ac using chromatin nearly perfectly coinciding with the valley (Fig. 2A). The highest immunoprecipitation-sequencing (ChIP-seq) datasets for naïve densities of H3K27ac and H3K4me3 occur within ATAC-adjacent
7906 | www.pnas.org/cgi/doi/10.1073/pnas.1912008117 Ye et al. Downloaded by guest on September 24, 2021 A Overlap between distal ATAC peaks B Chromatin accessibility amplitude in C Proportion of enhancers and and the enhancer and promoter enhancers and promoters during promoters modulated during discriminatory histone marks differentiation; D3 vs. D0 differentiation; D1 and D3 vs.D0
D0 D3 Enhancers 60 #9000 #2244 50 30% 29.5% 100 40 80 Enhancers Promoters 67% 69% 60
-log10 Padj 30 40 3% 1.5% 20 20 0 10 Total=21971 Total=10694 -20 -40 -6-5-4 -3-2-1 0 1 2 3 4 5 6 -60 pos low Enhancers: H3K27ac /H3K4me3 Log2 Fold Change -80 pos high -100
Unannotated promoters: H3K27ac /H3K4me3 % of the total regulatory regions D1 D3 D1 D3 Other regulatory regions: H3K27acneg Promoters 60 #1802 #2535 Lost accessibility 50 Gained accessibility D Clustering of invariant and 40 dynamic enhancers
-log10 Padj 30 20 D0 D1 D3 10
Invariant -6-5-4-3-2-1 0 1 2 3 4 5 6 Log2 Fold Change (19%) GENETICS TPM log2 8 6 E Dynamic remodeling of chromatin in the loci of 4 key factors of CD8+ T cell differentiation 2 0 Il7r Tbx21 3.5kb 12kb D0 (53%) ATAC signal ATAC D0 D1
RNA-Seq D3 D0 D1 (13%) D1
ATAC-Seq D3 naive (GSE95237) D3 effectors (15%)
H3K27ac (GSE95237)
+ Fig. 1. Remodeling of enhancers during CD8 T cell differentiation. (A) Pie charts depicting overlap between distal ATAC-peaks in D0 and D3 and the enhancer- and promoter-distinguishing histone marks (H3K27ac, GSE95237 series, and H3K4me3, in-house). (B) Volcano plots showing differential accessibility + of ATAC regions during differentiation (log2 fold-change of ATAC signal modulation) in promoters and active enhancers during CD8 T cell differentiation (shown is D3 vs. D0). (C) Bar plot showing proportion of promoters and enhancers significantly modulated during D0–D1 and D0–D3 transition (false-discovery rate 0.05, fold-change ≥1.5). Shown are proportions of the total promoters or enhancers. (D) Supervised heatmap of the transcripts per million (TPM)- normalized ATAC signal in enhancers, clustered by the significant or not modulation of accessibility during the D0–D3 transition. Each dynamic cluster is further subclustered into early (modulated at D1) and late responders. (E) Parallels between normalized RNA-seq, ATAC-seq, and CHIP-seq tracks on the IL7ra gene and its previously validated enhancer, and the Tbx21 gene and a putative enhancer in the vicinity (shadowed in gray).
regions (roughly equal to the size of one nucleosome in deep se- sequence environment and DNA shape (30–33), as well as host quencing; 170 bp), whereas H3K4me1 spans 500-bp boundaries of sites of attachment to nuclear scaffold/matrix (S/MARs) carrying the size-averaged ATAC-peak (Fig. 2A). a variety of regulatory codes (34, 35). These observations suggest that enhancers are composed of To investigate the contribution of TEs to different domains, the three distinct domains: An accessible core (ATAC peak, we intersected TE annotations from the RepeatMasker with orange label in Fig. 2A), proximal flanks (170 bp, green labels in enhancer coordinates. To cut through pervasive association and Fig. 2A), and distal flanks (330 bp, yellow labels in Fig. 2A). to reveal potentially functional TEs, we compared proportions of Although this separation is somewhat arbitrary, it fits enhancer the major TE lineages in enhancer domains with their genomic biology, with most accessible chromatin—including proximal abundance. In parallel, we analyzed local genomic neighbor- nucleosomes—serving as the epicenter for TF binding (25). hood, represented by 300-bp and 170-bp regions 2 kb away from Least-accessible flanks harbor enhancer signature histone marks enhancers that roughly match by size the ATAC peak (300 bp on and may contribute to TF binding by providing favorable average) and distal flanks (330 bp), as well as proximal flanks
Ye et al. PNAS | April 7, 2020 | vol. 117 | no. 14 | 7907 Downloaded by guest on September 24, 2021 A Drawing enhancer boundaries; B Enrichment or depletion of TE aggregation of histone mark signals lineage in enhancer domains around ATAC peak midpoints y 0.5 H3K4me1 H3K27ac 0.4 Lineage ProximalDistal flanks flanks H3K4me3 ATAC 170bp - 300bp2kb away - 2kb awa 0.3 LTR/ERVL−MaLR 2.2 1.5 1.2 ns ns LTR/ERV1 1.4 1.1 ns ns 0.9 0.2 LTR/ERVL 1.4 1.2 ns ns ns 0.1 LTR/ERVK 0.9 0.7 0.7 0.9 ns
Histone mark signal LINE/L2 3.0 2.5 2.1 ns 1.8 0.0 LINE/CR1 2.8 ns ns ns ns -1kb ATAC 1kb LINE/RTE−BovB 2.8 ns ns 0.8 0.9 peak 0.5kb 0.5kb LINE/L1 0.5 0.4 0.4 0.5 0.5 SINE/MIR 3.0 3.0 2.6 1.8 1.8 C Top TE subfamilies enriched in SINE/ID 2.5 ns ns ns 2.0 enhancer domains SINE/5S−Deu−L2 9.4 ns ns ns ns SINE/B4 1.5 2.2 2.5 2.0 ns SINE/B1 1.3 ns 2.7 ns 2.2 SINE/B2 1.0 1.7 2.1 2.1 ns
Lineage Subfamily ATAC ProximalDistal flanks flanks170bp - 300bp2kb away - 2kb away DNA/hAT−Charlie 1.9 2.0 2.1 ns ns ERVL−MaLR ORR1E 7.6 2.7 ns ns ns ERVL−MaLR ORR1D2 6.4 1.9 ns ns ns Enriched Depleted ERVL−MaLR ORR1B2 4.2 2.3 ns ns ns 80 0 240 ERVL−MaLR ORR1F 4.2 2.8 ns ns ns ERVL−MaLR ORR1G 3.2 ns ns ns ns −Log10(adj.pvalue) ERVL−MaLR MLT1N2 15 7.5 ns ns ns ERVL−MaLR MLT1B 2.5 ns ns ns ns Spatial distribution of enriched TEs ERVL−MaLR MTD 2.5 1.9 ns ns ns D ERVL−MaLR MTEa 2.7 ns ns ns ns with respect to ATAC peak midpoint ERV1 MTEb 3.9 ns ns ns ns ERV1 RLTR14 8.7 2.5 ns ns ns ERV1 RLTR14_RN 5.9 2.4 ns ns ns 0.008 ORR1 L1 L1M4b 3.6 ns ns ns ns 0.006 RLTR L1 L1MCa 3.3 2.4 1.9 1.2 1.2 MTD L1 L1MB8 1.9 2.2 2.1 1.6 1.6 0.004 MLT L2 L2b 4 3.7 3.1 2.7 2.7 RMER L2 L2 3.4 2.5 ns ns ns 0.002 L2 L2a 3.5 3.3 2.7 ns 2.0 0.000 MIR MIR1_Amn 3.5 5.1 ns ns ns 0.008 MIR MIR3 4.7 4.7 3.8 ns ns L1 MIR MIRb 3.6 3.5 3 ns ns 0.006 L2 MIR MIRc 4.0 3.6 3.3 ns ns B1 PB1D10 ns ns 3.7 ns ns 0.004 B1 PB1D7 3.3 ns 4.1 ns ns 0.002 B1 B1_Mur4 ns ns 3.5 ns ns B1 B1_Mur1 ns ns 3.4 ns ns 0.000 B1 B1_Mur2 ns ns 3.2 ns ns 0.008 B1 B1_Mur3 ns ns 3.1 ns ns MIR B1 B1_Mus1 ns ns 3.0 ns ns TE density B4 0.006 B1 B4 RSINE1 ns ns 3.2 ns ns B2 DNA/hat Charlie URR1A ns 3.2 3.1 ns ns 0.004 DNA/hat Charlie URR1B ns ns 3.0 ns ns 0.002 0.000 45 3 0.008 -Log10(adj.pvalue) 0.006 DNA E Proportional localization of 0.004 enriched TEs in enhancer domains 0.002 0.000 100 −1kb −0.5kb 0 0.5kb 1kb 8% 20% Distance of TE from the ATAC peak midpoint 27% 14% 32% 75 F 11% 57% Proportion of enhancers with 90% Distal flanks TE-rich domains 24% Proximal flanks 100 50 32% Core 80 78% 69% 60 28% 25 49% 40 36% 20 15%8% % of enhancers with TE-rich domains with 0 % of the total enhancer-linked copies 0 2% L2 ERV L1M MIR DNA SINE ATAC Distal Proximal flanks flanks
Fig. 2. Specific lineages of transposable elements are differentially enriched and distributed within CD8+ T cell enhancers. (A) Signal aggregation for histone marks H3K27ac/H3K4me1 (GSE95237 series), and H3K4me3 (in-house) around ATAC peak midpoints in naïve (D0) cells. Enhancer domains are represented by colored boxes; size-averaged ATAC peak (300 bp), proximal flanks (170 bp), and more distal flanks (330 bp). (B) Heatmap showing significance of enrichment (red) or depletion (green) vs. genome of TE family according to the binomial two-tailed test inferred on TE counts in enhancer domains and in control size- matched regions located 2 kb away from the enhancer. Labels represent fold-change ratio in-set vs. in-genome (by length). ns = not significant. (C) Heatmap showing significance of enrichment vs. genome of TE subfamilies in enhancer domains inferred on the number of counts using standard hypergeometric test. Labels represent fold-change ratio in-set vs. in-genome (by length). ns = not significant. (D) Distribution of distance between enhancer-enriched TEs and the ATAC peak midpoint (related subfamilies are combined into groups). (E) Proportional localization of enhancer-enriched TEs within distinct enhancer domains (related subfamilies are combined into groups). (F) Proportion of enhancers enriched for TE subfamilies in ATAC-peaks, proximal, and distal flanks. Shown are proportions from the total number of identified enhancers.
(170 bp). This analysis showed that TEs are overrepresented in apparent LTR retrotransposons (MaLRs), LINEs L2, and an- enhancer domains in a lineage-specific way. Accessible cores are cient SINEs belonging to the mammalian interspersed repeats significantly enriched in ERVs, in particular in the mammalian (MIR) (Fig. 2B). Both their significance and fold-enrichment
7908 | www.pnas.org/cgi/doi/10.1073/pnas.1912008117 Ye et al. Downloaded by guest on September 24, 2021 over background decrease while moving away from the enhancer copies directly overlapping ATAC peaks. Since related TE spe- core. In contrast, other SINE families, in particular the most cies show a similar distribution patterns, we combined them into numerous rodent-specific B1s, are depleted from cores but higher hierarchical groups. As shown in Fig. 3A, only a small enriched in distal flanks. LINEs L1s, although being most proportion of TEs can fully cover the ATAC peak. ERVs and abundant in the mouse genome, are depleted from enhancers LINEs on average cover up to 50% of the ATAC peak, while altogether. Certain lineages, such as MalRs, L2s, MIRs, and other species, either because they are smaller or further away, hat-Charlies, are recurrently linked to human and mouse en- contribute less. Thus, most accessible chromatin domains are hancers, likely reflecting their regulatory propensity in a variety composed of chimeric TE/non-TE DNA. As TEs can span both of biological contexts (13–15, 36). open and nucleosome-occupied regions, we quantified their as- For a better resolution within large families, we performed sim- sociation with ATAC sequencing fragments derived from shorter ilar tests for subfamilies, applying statistics to their enrichment in nucleosome-free regions (harboring the highest signal), and enhancer domains vs. genome by both length and copy number. To longer fragments containing mononucleosomes (less accessible ensure that enriched subfamilies are associated with enhancers chromatin) and multinucleosomes (condensed chromatin). specifically, we sampled size-matching regions (170 bp and 300 bp) Typically, mononucleosomes are closed to the open chromatin, 10,000 times genome-wide. No significant association of the while nucleosomes from longer fragments are further away (25). enhancer-enriched subfamilies with random regions was observed, This analysis shows that ERVs, in particular the most numerous except for a few L1M species. After removing them, we obtained 84 ORR1 species, avoid nucleosomes and mostly overlap with subfamilies (among 1,200 in total) enriched in enhancer domains nucleosome-free reads (Fig. 3B), except for MTD and RMER both by length and by copy number (see enriched subfamilies in species that may span nucleosomes. LINEs overlap with all types Dataset S1 and random sampling results in Dataset S2). The most of fragments by virtue of their length. MIRs are associated enriched are subfamilies belonging to ORR1, MTE, and MTD with both mononucleosomes and multinucleosome fragments, species of ERVs, L2s, and MIRs, as well as B1, B2, and B4 SINEs again contrasting other SINE species, mostly associated with (see Fig. 2C for the most numerically abundant subfamilies). Most multinucleosome reads (Fig. 3B). L1s are depleted from enhancers, but 11 (of 300) L1M subfam- Association of core TEs with nucleosome-free sequencing ilies are significantly enriched. Recently, the link between immunity reads suggests that they may provide origins of chromatin ac- and ORR1Es, MTEa and -b, and MIR3s has been highlighted by cessibility. For each enriched TE lineage, we quantified the their enrichment in enhancers of dendritic cells (37). overlap with the ATAC peak summit. Over half of ERV copies Subfamilies demonstrate lineage-specific enrichment prefer- underlie summits, while this proportion decreases for other TE ences, with ERVs enriched in cores and B1, B2, and B4 SINEs, species (Fig. 3C). As a result, ERVs are present in summits of GENETICS in distal flanks (Fig. 2C). Because MIR species enrich all en- the higher proportion of enhancers, as compared to other TE hancer domains, in contrast to other SINEs, mostly found in species (Fig. 3D). Overall, 10% of enhancer chromatin accessi- distal flanks, we separated MIRs from other SINEs in the sub- bility seems to be TE-originated (Fig. 3D). sequent analyses. The majority of overrepresented subfamilies We next studied the contribution of flanking TEs to boundary are not enriched in the local genomic environment (Fig. 2C), epigenetics. H3K4me1 signal is stronger at multinucleosome except for longer LINEs, which nevertheless lose enrichment fragments, as compared to single nucleosomes (Fig. 3E). As while moving away from the core. In concordance with the en- SINEs also colocalize with these fragments, we quantified their richment, TEs are distributed within enhancers in a lineage- association with the H3K4me1. We reasoned that a positive ratio dependent way. Distinct groups of ERVs, LINEs, and MIRs between the H3K4me1 signal on SINE elements and adjacent are positioned toward the most accessible chromatin, while other nucleosome-sized flanks would indicate a direct deposition of SINE species favor flanks (Fig. 2D). More precisely, nearly 80% this histone mark on SINE-bound nucleosomes (higher of ERVs are located in most accessible regions, while up to 90% H3K4me1 signal in SINEs compared to surrounding regions is of SINEs are in distal flanks (Fig. 2E). illustrated in Fig. 3 F, Left and Upper Right). As shown in Fig. 3 F, For further analyses, we defined enhancers as “TE-rich” only Lower Right, for nearly half of the copies of the most abundant if they contain at least one copy of a TE belonging to a subfamily SINE species, this ratio is positive. In total, 25% of enhancers enriched in one or more enhancer domains. All other enhancers are enriched for these SINEs. This suggests that flanking SINEs were qualified as “TE-poor,” even if they overlapped with TEs may contribute to nucleation and spreading of the H3K4me1, (from nonenriched subfamilies). Finally, 60% of enhancers were whichhasbeenshowntobindimportantchromatinmodi- defined as TE-rich, including 15% TE-rich in cores, and 56% in fiers (38). H3K4me1 signal at these SINEs decreases during flanks (Fig. 2F). While SINE-rich enhancers numerically activation (Fig. 3 F, Left), suggesting their putative developmental dominate, 10% include TEs from different lineages (SI Ap- role. Altogether, these results highlight substantial contribution of pendix,Fig.S5A). enhancer-enriched TEs to building open chromatin and boundary Enrichment tests applied to the developmental clusters reveal regions. a stronger association of TEs with the D0-specific enhancers, as compared to the D3-specific and invariant enhancers (SI Ap- TEs Enriched in Enhancers Carry Ancestral Regulatory Motifs. As the pendix, Fig. S5B). TE-rich enhancers in naïve cells seem to be majority of genomic transposons are truncated and diverged more responsive to stimulation, as higher proportions of ERV- from consensus sequence during evolution, we identified con- and LINE-containing enhancers lose accessibility during dif- stituent parts of the enhancer-enriched TE subfamilies that ferentiation, as compared to TE-poor enhancers (SI Appendix, overlap with enhancer domains. To compare them with inac- Fig. S5C). cessible TEs, we analyzed in parallel the equivalent subfamilies Altogether, these results demonstrate an enrichment of spe- located in gene-poor regions (over 200 kb from the annotated + cific TE subfamilies in CD8 T cell enhancers and their partic- TSS) with lower enhancer content and less accessible chromatin. ular topology. They also suggest that some species might have Similar to genomic ERVs, enhancer-linked counterparts are been coopted to specifically contribute to the maintenance of reduced to solo LTRs (viral promoters) (SI Appendix, Fig. S6A). naïve T cell homeostasis. They, however, systematically show higher preservation of the consensus sequence (SI Appendix, Fig. S6A) and longer length TE Species Enriched in Enhancers Differentially Associate with (by 70 bp on average) (SI Appendix, Fig. S6B). As predicted, Domain-Specific Chromatin Features. To investigate the possible LINEs are mostly truncated to 3′UTRs and strongly diverged contribution of TEs to enhancer core domains, we next analyzed from the consensus. Only a small proportion of younger L1Ms
Ye et al. PNAS | April 7, 2020 | vol. 117 | no. 14 | 7909 Downloaded by guest on September 24, 2021 A Coverage of ATAC peak B Association of TEs with nucleosome-free by TE sequence and nucleosome-bound chromatin
100 Nucleosome-free Mononucleosome Multinucleosome -0.5 0.5kb -0.5 0.5kb -0.5 0.5kb 1.0 8.0 ERV 1.5 50 6.0 0.5 1.0 4.0
% of ATAC % of ATAC 0.0 2.0 0.5 0 −0.5
peak coverage by TE TE TE TE ERV LINE DNA MIR SINE ORR1 RLTR RMER MLT MTD LINE 0.9 0.5 Number of TEs in enhancers 2.0 C 0.8 0.4 and in ATAC peak summits 1.5 0.7 0.3 0.6 0.2 10000 1.0 0.1 Total TE TE TE 8000 Summit-linked L1 L2 2000 3.0 0.80 0.6 DNA 1500 0.75 0.5 2.0 0.4 1000 0.70 0.65 0.3 Number of copies 500 1.0 TE TE TE 0 2.0 0.9 1.9 MIR 0.50 LINE DNA MIR SINE 0.85 ERV 1.8 0.45 D Proportion of enhancers 1.7 0.80 0.40 1.6 0.75 0.35 with TE in ATAC peak summit TE TE TE 6 2.5 SINE 0.7 0.5 2.0 0.4 4 1.5 0.6 0.3 1.0 0.5 0.2 0.5 0.1 TE TE TE 2 B1 B2 B4 F SINEs harboring high H3K4me1 signal
% of the total enhancers 0 D0 D3 Genome browser example MIR 0.35 ERV LINE DNA SINE 0.30 H3K4me1 signal at E 0.25 ATAC-seq fragments derived from mono- 0.20 H3K4me1 and multi-nucleosomes kb -0.5 SINE 0.5-0.5 SINE 0.5 B4
0.27 H3K4me1-harboring mono SINEs multi 60 0.26 40 0.25 20 0.24 H3K4me1 signal
% of the of total % 0
-0.5 Nucleosome0.5kb enhancer copies signal B1 B2 B4
+ Fig. 3. TEs enriched in CD8 T cell enhancers differentially contribute to chromatin states. Related TE lineages are hierarchically grouped to produce. (A) Violin plot showing distribution of ATAC-peak coverage by TE sequence (percent of ATAC peak length composed of TE). Median values are indicated. (B)TE- centered coverage plots showing association of enhancer-enriched TEs with ATAC-seq fragments derived from nucleosome-free, mono-, or multinucleosomes (di- and trinucleosomal fragments are combined). (C) Bar plot showing total number of enriched TEs in enhancers (colored bars) and TEs overlapping with ATAC peak summits (black bars). (D) Bar plot showing proportion of enhancers with TE in ATAC-peak summit. (E) Coverage plot showing intensity of H3K4me1 signal (GSE95237 series) in ATAC sequencing fragments derived from mono- or multinucleosomes. (F, Left) Heatmaps representing bins per million mapped reads-normalized H3K4me1 signal plotted over enhancer-linked SINEs harboring higher reads, as compared to adjacent regions. (Upper Right)A genome browser example of the H3K4me1-loaded B4 SINE. Upper bar plot shows proportion of the most numerically abundant enhancer-enriched SINE species targeted by H3K4me1.
carry promoter parts (5′UTRs). Enhancer-linked copies, however, recognized by ETS and RUNX domain-binding proteins are the are strikingly longer (by hundreds of base pairs) than their coun- most abundant. ETS1 and GAPBA (ETS domain) and Runx1 terparts from gene deserts. Relatively young B1, B2s, and B4s (RUNX domain) are linked to the key T cell functions (41–44). show high degrees of conservation and almost no difference from Neither enhancer flanks nor the size-matched control regions their genomic equivalents (SI Appendix,Fig.S6A and B). from the local genomic neighborhood are enriched for these Since transposons are proficient in binding multiple TFs (16– motifs (Fig. 4). The sequence predicted to bind the pleiotropic 20, 39) and some of them have preserved original promoters that transcriptional activators Sp1 and Sp1-like Klf (Kruppel-like may contain ancestral regulatory elements, we performed a Zinc finger proteins) is overrepresented in proximal flanks, to- search for TF-recognition motifs. We first analyzed domains of gether with AT-rich motifs (in both flanking domains), with + all CD8 T cell enhancers (pooled from distinct stages of dif- putative affinity to Forkhead family proteins that play various ferentiation) and then enhancer-enriched TEs. The Regulatory context-dependent functions in immune cells (45), as well as A- Sequence Analysis Tool (RSAT) (40) helped to reveal that core favoring Zinc finger proteins. The main overall flanking feature domains specifically harbor the majority of recognition motifs for is the abundance in A- and T-rich repeats. Some of these re- various TFs important for T cell biology (Fig. 4). Sequences peats (ATTTA or AACAAA) are enriched in S/MARs (46),
7910 | www.pnas.org/cgi/doi/10.1073/pnas.1912008117 Ye et al. Downloaded by guest on September 24, 2021 Predicted TF Logo ATAC Proximal Distal 170bp 300bp TE counterparts from gene deserts, suggesting a putative selec- or TF family peak flanks flanks 2kb away 2kb away tive advantage for T cell cis-regulation. Ets/Runx composite 1e-300 - - - - Since the A-rich Forkhead-like motifs are inherent to many Ets/Runx reverse - - - - 1e-300 TEs, we examined their capacity to bind Forkhead protein Ets_Gabpa 2.1e-136 - - - - Runx 4.7e-126 - - - - Foxo1, a major regulator of T cell homeostasis (45). To this end, Fos_JunB - - - - we exploited the public dataset of ChIP-seq for Foxo1 performed 1.3e-47 + Creb1 3e-96 - - - - on naïve CD4 T cells (52), whose transcriptome is fundamen- + Klf_Sp1 2.9e-57 - - - - tally similar to the one of naïve CD8 T cells (53). We find a Sry/Zfp422_384/ - - - - 1.1e-56 remarkable overlap of Foxo1 peaks with the promoter- and ac- Forkhead - - - - Egr2_4 1.3e-55 tive enhancer-linked ATAC peaks in naïve (D0) cells (SI Ap- - - - - Egr1_4 1.8e-52 pendix, Fig. S7A), confirming the biological relevance of the Forkhead - 2.6e-34 6.9e-68 2.2e-16 - dataset. Our quantification shows that various TE species Sp1, Klf - 1.4e-23 - - - colocalize with Foxo1 peak summits in enhancers (SI Appendix, - - 2.2e-20 - - - Fig. S7B), providing altogether 20% of binding sites. This sug- Sry/Zfp422_384/ gests substantial participation of TEs in the maintenance of naïve - - 9.1e-132 - 2e-73 Forkhead - - 2e-111 - - cell homeostasis. - - 3e-08 1.3e-24 2e-08 2e-04 In sum, we show that enhancer core-linked TEs carry a rich - - - 6e-12 - - array of ancestral motifs recognized by TFs relevant to T cell - - - 0.001 - - biology. These motifs are better preserved in enhancer-core TEs compared to nonaccessible TE equivalents. Flanking TEs carry Fig. 4. Distinct sequence composition in enhancer cores and flanks. Se- sequences matching the inherent enhancer flanking lexicon, quence enrichment analysis using RSAT for enhancer domains and size- and particularly featured in naïve cells, and may putatively contribute number-matching control regions 2 kb away from enhancer. Shown are lo- gos of the most enriched sequences, predicted TF or TF family, and e-values to sequence-based boundary regulation. of enrichment. Recognition motifs are embedded within enriched sequences in direct or reverse orientation. Dash indicates that the motif is not found. The Link Between TEs and Immune Functions. To identify tran- scriptional programs putatively rewired by TEs, we predicted enhancers’ gene targets via the Genomic Regions Enrichment of + suggesting that boundaries of CD8 T cell enhancers may foster Annotations Tool (GREAT) (54). Using matching RNA-seq, we
sites of the attachment to nuclear matrix. kept only developmentally comodulated enhancer/gene pairs GENETICS Overall, CpG dinucleotide frequency in flanks is lower than in (modulation of enhancer accessibility [by ATAC-seq] and change cores (0.08 in core vs. 0.05 in distal flanks), consistent with a in gene transcription [by RNA-seq] are in concordance). This recent report showing that the low-flanking CG content is a strategy allowed coupling of nearly 1,000 D0-specific genes to hallmark of human enhancers associated with immune functions developmentally cobehaving enhancers (Dataset S3). Over 40% (47). dA:dT-rich environments carry regulatory properties and of the gene targets are shared between TE-rich and TE-poor yield specific DNA shapes sensed by certain types of TFs (48), as enhancers, suggesting a high degree of putative enhancer re- well as dictate nucleosome positioning (49). This particular dundancy (Fig. 6 A, Upper). A substantial proportion is uniquely flanking composition seems to be enhancer-specific since the assigned to TE-rich enhancers. Despite SINEs outnumbering majority of sequences are not enriched in enhancer vicinity. The other TE families, the majority of genes are shared between cross-comparison of similar numbers of the D0- and D3-specific enhancers rich in SINEs and rich in other TE types (Fig. 6 A, enhancers (using counterpart as a background) reveals that the Lower). Nearly one-third of the predicted enhancer/gene pairs in the D0-specific cluster has been previously identified by Hi-C (2) D0-specific cluster is enriched in binding motif for Tcf7, a critical + TF for naïve and memory T cell-related functions (50) and AT- performed on naïve CD8 T cells, likely reflecting a direct rich flanking sequences of various lengths and An:Tn periodicity, physical contact (Dataset S4). SI Appendix, Fig. S8A illustrates which are positionally biased toward flanks (SI Appendix, Fig. the loci of Irf1 and FbxI5 genes and corresponding TE-rich en- S6C). In contrast, the D3-specific cluster outnumbers by hancers confirmed by Hi-C. Gene ontology (GO) enrichment analysis performed on activation-linked Fos-Jun motifs and various CG-rich sequences genes linked to TE-rich and TE-poor enhancers (split by TE that include Sp1/Klf motifs, all positioned toward enhancer cores lineage) showed that overrepresented functional terms mostly (SI Appendix, Fig. S6C). divide into two functional classes: Immune and metabolic (Fig. To study the putative contribution of TEs to TF-binding, we 6B). All of the enhancer categories are enriched for the “reg- performed sequence analysis of the enhancer-enriched TEs and ulation of response to stimuli” biological process. The ma- their subfamily equivalents from gene deserts, pooling them into jority of them associate with various semiredundant processes phylogenetically related groups. This reveals that a high pro- linked to leukocyte and lymphocyte biology, such as migra- portion of core TEs, in particular ERVs, carry recognition motifs tion, adhesion, and differentiation. LINE-linked enhancers for various TFs, including those for ETS and Runx domain themselves are restricted to a few processes, perhaps due to proteins (Fig. 5). Both L1s and L2s bear Sp1-Klf and other Zinc their lower number. ERV-, MIR-, and SINE-rich enhancers finger protein motifs. MIRs also carry several TF recognition functionally overlap with each other and with TE-poor en- motifs, in contrast to flanking SINEs that are almost uniquely hancers. Despite a high proportion of genes uniquely linked to enriched for dA:dT-rich sequences (including CA repeats), TE-rich enhancers, functional patterns highly overlap be- which are predicted to be favored by A-prone proteins. Similar tween the distinct enhancer categories. This suggests that TE- dA:dT-rich sequences are overrepresented in LINEs that span rich enhancers may have complemented regulatory branches flanks. These sequences almost perfectly match enhancer dA:dT to each process instead of rewiring innovative functions. flanking lexicon. Nearly all motifs enriched in enhancer-linked Similarly, D3-specific TE-rich and TE-poor enhancers show TEs are present in the corresponding consensus, including A- high functional overlap in leukocyte and immune effector rich tails of SINEs (51). The Fos-Jun binding motif found in functions (T cell activation, cytokine production) (SI Appen- B2s has been presumably evolved together with enhancers be- dix,Fig.S8B). In the invariant cluster, ERV- and LINE-rich cause it is not present in B2 consensus sequences. The majority enhancers alone are not associated with any particular term, of enriched motifs are absent or proportionally decreased from while MIR and other SINE-type–rich enhancers are linked to
Ye et al. PNAS | April 7, 2020 | vol. 117 | no. 14 | 7911 Downloaded by guest on September 24, 2021 Enhancer Gene desert Predicted TF similar sequence or TF family logo e-value % e-value % in the consensus ORR1 ETS 2.6e-244 50% 5.1e-30 22% CAGGAAGT(T/G) (Etv-Ets-Gabpa) 2.5e-92 49% 5.1e-77 27% CAGGAAGT(T/G) 4.5e-28 46% - - TTCCTCT RUNX (1,2,3) 3.9e-64 16% - - TGTGGTTT (AAACCACA) Lin54 4.8e-31 27% 4.7e-22 21% TTTGAATG (CATTCAAA) Max_Myc 6.3e-18 30% 5.5e-11 14% ACACTTGGT MTD 2.8e-56 60% 5.1e-30 27% TTCCTGC (GCAGGAA) E2f/Erf_Fli1 1.5e-46 53% - - TTCCTGC (GCAGGAA) 5.6e-41 60% 2.1e-30 57% TTCCTGC Runx1 2e-61 43% 1.5e-15 29% CTGTGGG (CCCACAG) RMER 1.3e-13 44% - - TCCCTTCCCC Sp1, Klf, E2f2 4.6e-06 44% - - CCCCTCCCC Rel_Rela_Bcl6 9.8e-22 55% - - TCCCTTCCCC Tcf7_Lef1 5.8e-07 20% - - AGACCAAC, TTTGGTCT Tead3 6.8e-09 35% - - ACCATACC MTE ETS 3.7e-44 41% 2e-17 36% AGGAGAAA, AGGAGACA (Etv, Gabpa, Elk ) Sp1, Klf 2.5e-18 30% 3.3e-13 26% ACCCACCC Zbtb26_Smad4 4.7e-07 15% --ATCTAGAAT RLTR Klf1, RUNX 1.3e-10 53% - - TGTGGTT Prdm1_RelA 2.8e-06 32% - - GAAAGTC Zfp523_Zfp143 9.5e-07 34% - - ACTAAAACA MLT Rbpj 0.015 15% - - TCCCCCCA Sp1/2_, Klf 0.025 12% - - CCCTCCC Hic1 0.054 11% - - GCCACC Forkhead, Znf384 0.009 22% - - AAATAAAT MIR Zfp787 9.5e-27 24% 5.8e-13 13% GGGCCTCAGTTTC (GGAAACTGAG) Zfp768 4.5e-20 18% - - Tbp 1.7e-15 16% - - GTAAAATG (CATTTTAC) Nfat 4.5e-20 16% 6.3e-14 23% GTAAAATGG Nr4f2_Essra 4.70e-16 19% - - GTGACCT (AGGTCAC) Gata 2.6e-11 11% - - AGATGA L2 Sry/Zfp422_ 3e-19 20% 6.3e-21 13% AATAAA 384/Forkhead 4.7e-48 18% 1.9e-24 10% AAAAAACAAAAA Sp1, Klf_Znf263 6.3e-51 15% - - CCCCTCCCC Fli1 3.1e-11 21% - - AGGAG 3.5e-46 12% - - CACACA L1 Sry/Zfp422_ 1e-14 42% - - TATTTTA (ATAAAAT) 384/Forkhead 1.1e-34 37% - - AAAAACAAA Setbp1_Ahctf1 1.5e-81 42% 3e-22 12% TATTTTAA Sp1/Klf, Znf148 3.4e-64 40% ns CCCCCCT(CT)CCCC 7.4e-95 26% 9.8e-10 16% CACACCCA DNAhat GGGTCACCACAA Srebf1 7.40e-38 54% - - (TTGTGGTGACCC) 5.50e-48 51% 3.5e-31 40% TTAAAGGGTC Rxra_Rxrb_Zfp652 1.3e-142 46% --GACCCCT B2 Zfp384/Forkhead 1e-79 30% 1e-56 23% ATAAAAATAAA Fos_Jun 7.1e-205 47% --GATGGCTCA B1 1e-300 10% 1e-300 2% AAAAAACAAAA 3.5e-138 18% 2.3e-63 9% AAAAAACAAAA B4 1e-300 13% 9.8e-300 11% CACACACACACA
Fig. 5. Enriched TEs carry regulatory motifs. Sequence analysis (RSAT) performed on enhancer-enriched TE subfamilies and their equivalents from gene-poor regions (>200 kb away from the TSS). Related subfamilies combined into bigger groups. Shown are logos of the sequences most enriched vs. background, predicted TF, e- values of enrichment, proportion of TEs carrying at least one sequence, and the same or highly similar motifs in the TE consensus from the Dfam database.Incaseof opposite orientation, reverse-complement of the sequence is shown in brackets. Dash indicates that the motif is not found (in any orientation).
metabolic functions related to protein, DNA, and RNA assigned to SINE-bearing enhancers, suggesting their possi- turnover (SI Appendix,Fig.S8C). Several processes—for ex- ble contribution to the evolution of specific housekeeping ample, catabolic and “chromosome localization”—are uniquely processes.
7912 | www.pnas.org/cgi/doi/10.1073/pnas.1912008117 Ye et al. Downloaded by guest on September 24, 2021 AB Enhancer/gene association GO enrichment in the D0-specific cluster in the D0-specific cluster
TE-poorSINE-richMIR-richERV-rich L2-richL1M-rich regulation of response to stimulus TE-poor TE-rich regulation of signal transduction regulation of cell communication regulation of cell adhesion regulation of cell migration 166 417 400 regulation of leukocyte migration 17% 42% 41% immune system process leukocyte activation Immune lymphocyte activation regulation of mononuclear cell proliferation regulation of leukocyte proliferation regulation of lymphocyte proliferation lymphocyte differentiation Non-SINE SINE-rich leukocyte differentiation TE-rich T cell differentiation regulation of cellular metabolic process regulation of nitrogen compound metabollic process regulation of macromolecule metabolic process
15% 45% 40% Metabolic protein modification process protein phosphorylation regulation of programmed cell death homeostatic process regulation of tumor necrosis factor production
1015 5 -log10 FDR C Overlap between enhancers in D Overlap of active enhancers immune cells and ATAC peaks and genes between T cells and in common progenitor other immune cells 100 80 100 No overlap 60 Overlap 40 50 60% 74% 20 GENETICS % of overlap 0 0 % of the total active genes and enhancers B cell + T cell Spleen Thymus Genes CD8 Myeloid cell Bone marrowDendriticMacrophage cell Enhancers E Proportion of TE-rich enhancers F TE occurences in immune and in immune and non-immune groups non-immune genomic regions **** **** immune TE-rich vs. non-immune TE-rich; 800 **** **** p-value < 2.2e-16 100 600 TE-poor TE-rich 400