bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Comprehensive analysis of epigenetic signatures of human transcrip- tion control†

Guillaume Devailly∗a and Anagha Joshib

Advances in sequencing technologies have enabled exploration of epigenetic and transcription profiles at a genome-wide level. The epigenetic and transcriptional landscape is now available in hundreds of mammalian cell and tissue contexts. Many studies have performed multi-omics analyses using these datasets to enhance our understanding of relationships between epigenetic modifications and transcription regulation. Nevertheless, most studies so far have focused on the promoters/enhancers and transcription start sites, and other features of transcription control including exons, introns and transcription termination remain under explored. We investigated interplay between epigenetic mod- ifications and diverse transcription features using the data generated by the Roadmap Epigenomics project. A comprehensive analysis of modifications, DNA , and RNA-seq data of about thirty human cell lines and tissue types, allowed us to confirm the generality of previously described relations, as well as to generate new hypotheses about the interplay between epigenetic modifications and transcript features. Importantly, our analysis included previously under-explored features of transcription control namely, transcription termination sites, exon-intron boundaries, mid- dle exons and exon inclusion ratio. We have made the analyses freely available to the scientific com- munity at joshiapps.cbu.uib.no/perepigenomics_app/ for easy exploration, validation and hypotheses generation. Background available epigenomic data 16–20 to gain insights into mammalian Epigenetic modifications of the DNA sequence and DNA- epigenetic control. These tools used diverse computational frame- associated proteins along with transcriptional machinery is works ranging from data integration and visualisation (e.g. ChIP- thought to be the main driver shaping mammalian genome dur- Atlas allows visualisation multiple histone modifications and tran- 1 ing development and disease . Epigenetic modifications include scription factor binding sites at given genomic locus by using DNA methylation, histone variants and histone post-translational public ChIP-seq and DNase-seq data 21) to semi-automated an- modifications (such as and ), and fa- notation of genome (e.g. Segway performed genomic segmen- 2 cilitate tissue specific expression . The advent and maturation tation of human by integrating histone modifications, of sequencing technologies have facilitated large scale genera- transcription-factor binding and open chromatin 22). Though tion of epigenomic data across diverse organisms in multiple cell identification of functional elements from epigenetic data 2,22 has and tissue types. Accordingly, consortia were established to gen- been highly effective to annotate the enhancer and promoter re- 3 erate large epigenomic datasets, including ENCODE , Roadmap gions of a genome, they failed to capture other transcription reg- 2 4 5 Epigenomics , Blueprint epigenome for human, modENCODE ulation features such as exon-intron boundaries and transcrip- 6,7 for model organisms, and FAANG for farm animal species. tion termination features. For example, Roadmap Epigenomics The International Human Epigenome Consortium (IHEC) was project mapped about 30 epigenetic modifications across human 8 set up to gather reference maps of human epigenomes . The cell lines and tissues to gather a representative set of "complete" data from these efforts have generated new findings through epigenomes 2. Using this data, Kundaje et al. built a hidden integrated analyses. Such analyses are facilitated by consortia Markov model based classifier to define 15 distinct chromatin 8,9 data portals as well as portals gathering data from multiple states, including active or inactive promoters, active or inactive 10–14 sources , which allow easy browsing as well as download- enhancers, condense and quiescent states. Notably, this unsu- ing both sequences and processed data. In addition, many data pervised approach did not lead to the definition of "exon" states, portals include (or link to) genome browsers to allow online so- and even less to "exon-included" and "exon-excluded" states. This 15 lutions for the data exploration . might be because epigenetic modifications enriched at enhancers Several online tools have been developed to explore publicly and promoters have a strong signal (or peaks), while the ones abundant at gene bodies (DNA methylation, ) are wide and diffuse. The promoter and enhancers features therefore a GenPhySE, Université de Toulouse, INRAE, ENVT, 31326, Castanet Tolosan, France. dominate in epigenetic data analyses, hindering recovery of as- E-mail: [email protected] sociations between epigenetic modifications and other transcrip- b Computational Biology Unit, Department of Clinical Science, University of Bergen, tion control events such as splicing (constitutive or alternative). 5021, Bergen, Norway. E-mail:[email protected] † Electronic Supplementary Information (ESI) available at the end of this document. Moreover, some transcription features might not have a strong See DOI: 00.0000/00000000.

Journal Name, [year], [vol.],1–19 | 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. correlation with any chromatin modification studied. For exam- malised expression (Transcript per millions, TPM) were calcu- ple, Curado et al. 23 have estimated that only 4% of differential in- lated by pseudo-mapping of the reads to the human transcrip- cluded exons were associated with changes in H3K9ac, H3K27ac, tome using Salmon 34 in each sample. Moreover, for every middle and/or H3K4me3 across 5 different cell lines. A gap therefore exon in each cell type, exon inclusion ratio (or ψ) was calculated remains in genome-wide computational analyses towards getting (see methods), ranging from 0 to 1 (0 for exons not included, and a comprehensive overview of the associations between epigenetic 1 for exons included in all the transcripts). modifications and transcription control features. The genome-wide histone modification and DNAseI profiles On the other hand, individual targeted studies have provided for the cell types corresponding to the transcriptome data were evidence for interplay between and other transcrip- obtained from the Roadmap Epigenomics consortium. For the tional features. DNA methylation at gene bodies has been pos- Whole Genome Bisulfite Sequencing (WGBS) data, we computed itively correlated with gene expression level 24,25. Maunakea et three new tracks (figure S1†): CpG nucleotide density (consistent al. 26 observed that DNA methylation was positively correlated across cell and tissue types), CpG methylation ratio (average ra- with splicing at alternatively spliced exons, and proposed a mech- tio of methylation at CpG sites in the window), CpG methylation anism involving DNA methylation reader MECP2. Lev Maoret density (number of methylated CpG sites in each window), for al. 27 observed that DNA methylation at exons can be either posi- each sample, using the WGBS CpG coverage track as a control. tively or negatively correlated with splicing depending on the ex- For each pair of transcription feature and epigenetic modifica- ons, through a mechanism involving CTCF and MECP2. A causal tion, the associations were explored at two levels. (1) Cell or tis- role of DNA methylation in alternative splicing was established sue level: for all genes (or exons) in each cell or tissue type and by drug-induced de-menthylation 28, as well as by targeted DNA (2) Gene level: for each gene (or exon) across all cell or tissue methylations and de-methylations by Shayevitch et al. 29. Xu et types. Cell or tissue level analysis allows within-assay comparison al. 30,31 identified H3K36me3 epigenetic modification associated of highly and lowly expressed genes (or exons), but is sequence with alternative splicing. There is some evidence for epigenetic and genomic context (unique for each gene or exon) dependent. control at transcription termination as well: the loss of gene Whereas Gene level analysis fixes the sequence and genomic con- body DNA methylation was found to favour the usage of a proxi- text, but is more sensitive to technical variability across experi- mal alternative poly-adenylation site by unmasking CTCF binding ments. 32 sites . The main observations from these analyses are summarised In summary, many large data integration approaches only al- in Table 1 and elaborated in sections below. We have de- low extraction of epigenetic signatures for dominant features veloped a web application to allow exploration of analyses at of transcription control (e.g. enhancer, promoter, TSS), miss- https://joshiapps.cbu.uib.no/perepigenomics_app/. ing many other transcription features (e.g. exon, intron, TSS). We therefore performed a systematic analysis of associations Transcription activity and epigenetic modifications near Tran- between epigenetic modifications and diverse transcription fea- scription Start Sites tures. Using the Roadmap Epigenomics project data for 30 epigenetic modifications in about 30 cell and tissue contexts, Many epigenetic modifications are enriched around the TSS of ex- we explored links between epigenetic modifications and tran- pressed genes, a region containing gene promoters. To investigate scription control. We confirmed previously known associa- the link between transcription level and epigenetic modifications tions as well as generated novel observations. We have pro- at the TSS, we generated stack profiles of each epigenetic modi- vided our analyses freely through a web application available at fication around TSS. When epigenetic modifications were sorted https://joshiapps.cbu.uib.no/perepigenomics_app/, allowing re- according to gene expression (figure 1A), most histone modifica- searchers to browse the results and generate working hypotheses. tions studied were more abundant at highly expressed genes than at lowly expressed ones. Specifically, only one histone Results (H2BK20ac), three histone methylation (H3K9me3, H3K27me3, H3K36me3), and one histone variant H2A.Z did not show a posi- Exploration of epigenetic signatures of transcription control tive correlation with gene expression level. We further noted that using the Roadmap Epigenomics project data only H3K27me3 was more abundant at the TSSs of lowly or not To explore the epigenetic signatures at transcription control sites, expressed protein coding genes than at the TSSs of highly ex- we first extracted three gene features: Transcription start site pressed protein coding genes. H3K27me3 was not present at the (TSS), transcription termination site (TTS) and middle exons TSSs of not expressed, non-protein coding genes in any of the cell from GENCODE annotation version 29 33 (Table 1). We classi- or tissue type, highlighting the fact that the associations between fied genes in two different ways. Firstly, we partitioned all genes an epigenetic modification and transcriptional level may be gene- based on the gene length "long" (>3kb), "short" (≤1kb), and "in- type specific. termediate" length genes. We also classified genes based on a We explored the trends in peak shapes and noted that a ’double simplified GENCODE gene types, namely: protein coding, RNA hill’ (or ’M’ shape) with a gap at the exact location of the TSS was genes, pseudogenes and other genes (see method section). the most common shape. The ’gap’ of ChIP-seq signal between We obtained RNA sequencing data for about 30 cell or tis- the ’hills’ was located exactly at a sharp peak of DNAseI signal sue types from the Roadmap data portal 2. Gene and exon nor- around TSS (figure 1A), denoting a very high DNA accessibility

2 |Journal Name, [year], [vol.], 1–19 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Table 1 Summary of the associations between epigenetic modifications and transcription features in the Roadmap epigenomics project data. Tick (or cross): whether (or not) an epigenetic assay shows an increase or decrease in signal at the feature. Pluses: positive associations. Minuses: negative associations. Zeroes: no correlation. NA: not available (data not available in enough cell types).

Journal Name, [year], [vol.],1–19 | 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Fig. 1 Relationships between epigenetic modifications near the Transcription Start Sites (TSS) and gene expression level. A. Association between epigenetic modifications near the TSS and gene expression level in the small intestine. Upper part, from left to right: Gene expression level in all 47,812 autosomal genes annotated by Gencode. First side bar indicates the gene type (green: protein coding genes, blue: pseudogenes, purple: RNA genes, red: other types of genes), the second side bar indicates the genes sorted according to expression level (figure 1A bottom, 5 bins in total, purple: highly expressed genes, green: lowly expressed genes). Stacked profiles of (i) DNAse-seq and respective control, of (ii) H3K4me3 ChIP-seq and respective input control, of (iii) CpG density, mCpG ratio (mCpG/CpG), mCpG density, and the WGBS coverage near TSS, sorted according to the corresponding gene expression level. Bottom part, from left to right: Boxplot of gene expression levels in each of the 5 expression bins defined in the upper part. Average profiles of DNAse-seq and respective control, of H3K4me3 ChIP-seq and respective input control, CpG density, mCpG ratio, mCpG density and WGBS coverage, ± SEM (Standard Error of the Mean) for each bin of promoters. B-E. Association between epigenetic marks near the TSS and gene expression level across cell types. B. Regression of expression level (in log10(TPM+1)) of the MKRN3 gene and the mean DNA methylation ratio at CpG sites 500bp around the TSS of the MKRN3 gene. Each dot corresponds to a cell type. The slope is negative and the correlation coefficient (R2) is greater than 0.75 for MKRN3. Similar regressions were generated for each gene and epigenetic modification pair. C-E. Distribution of the slopes from the gene regressions (as in B.) according to the r-squared correlation coefficients for DNAse-seq signal (C.), H3K4me3 ChIP-seq signal (D.) and mCpG ratio (E.) near the TSS of the corresponding genes. Numbers of genes and percentage of genes in each category are displayed below each box.

4 |Journal Name, [year], [vol.], 1–19 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. at the TSS. This suggests presence of a -free region For example, the linear regression between the MKRN3 gene ex- at promoter terminating at the TSS, with a first nucleosome po- pression level and the average CpG methylation ratio near the sitioned near the +1 of transcription. This double hill pattern TSS of the MKRN3 gene resulted in a slope of -1.41, an R2 of 0.82, was either symmetric or asymmetric depending on the mark and and a p-value of 3.4 · 10−13. For each epigenetic modification, the cell or tissue type (e.g. Stronger peak at the downstream hill distribution of slopes across all genes was plotted against their R2 than the upstream hill in H3K4me3 in the small intestine, fig- (figure 1C, 1D and 1E). The epigenetic modifications with posi- ure 1A). H3K79me2 was particularly strongly asymmetric (figure tive slopes across cell or tissue types were: H3K4me3 (figure 1D), S2†), with a strong peak at around ± 500bp downstream of the H3K36me3, H2BK5ac, H3K4ac, H3K9ac, H3K18ac, H3K27ac and TSS across 2 of the 3 cell lines for which the mark was studied in chromatin accessibility measured by DNAse1 digestion assay (fig- Roadmap Epigenomics. ure 1C). DNA methylation showed a negative slope for the R2 CpG density around the TSS was positively correlated with greater (figure 1E). gene expression in all cell and tissues in the dataset. CpG methy- lation ratio and CpG methylation density near gene TSS were neg- atively correlated with gene expression levels. While CpG methy- Transcription activity and epigenetic modifications near Tran- lation ratio showed a flat profile near the TSS of non-expressed scription Termination Sites genes, CpG methylation density transitioned from a gap at highly expressed genes to a peak at non expressed genes at the TSS (fig- We repeated the analyses described above at transcription termi- ure 1A). nation sites (TTS). We noted that highly expressed genes tend to We further explored these trends for individual gene types. As be longer than not expressed genes. In short genes, is is difficult protein-coding genes formed the majority of all genes, the above to distinguish the effect of epigenetic modifications at TSS from observations for all gene types were preserved when the analyses that at the TTS. We therefore defined three classes of genes: short were restricted to protein coding genes only. Separating genes ac- (≤ 1kb), long (> 3kb), and intermediate length genes, to mitigate cording to gene type highlighted differences in epigenetic profiles this gene-length effect. at lincRNAs and unprocessed pseudogenes compared to protein coding genes. Processed pseudogenes often had a different rela- While many epigenetic modifications showed a peak (or a gap) tionship between their expression level and the epigenetic status centred at the gene TSS, only two modifications were enriched at of their promoter. For example, expressed processed pseudogenes the gene TTS. First, H3K36me3 displayed a broad hill-shape pro- showed neither DNAse1 accessibility peak at the TSS, nor an en- file, with a peak at TTS (figure 2A). Levels of H3K36me3 at TTS richment of active epigenetic modifications at the TSS. While CpG were positively correlated with gene expression level for different density near the TSS was correlated with processed pseudogene genes within a cell type (figure 2A), and also for gene expression expression level, their promoters did not show any decrease in level of the same gene across cell type (figure 2C). In some sam- DNA methylation ratio, resulting in a positive relation between ples, the H3K36me3 profile was slightly asymmetric at TTS, with DNA methylation density and processed pseudogene expression. more signal in the gene body than after the TTS. Expressed genes of the ’antisense’ gene type often mirrored the epigenetic signature of the ’protein coding’ gene type. For exam- DNA methylation density increased at the TTS (figure 2A). As ple, H3K79me2 peak was pronounced 500 bp before the TSS of DNA methylation ratio was nearly constant, the increase of DNA ’antisense’ genes, whereas it peaked 500bp after the TSS of pro- methylation density was mostly due to an increase of CpG density tein coding genes. However, it is likely that the observed epige- at the TTS. DNA methylation density at the TTS was positively netic signal at antisense genes might be due to the corresponding correlated with gene expression level when comparing different ’sense’ gene. genes within a cell type. A weak negative correlation between Altogether, gene expression level was positively or negatively DNA methylation and gene expression level at TTS was observed correlated with many epigenetic modifications at gene promot- for a subset of genes when comparing the same gene across cell ers when comparing expressed and not expressed genes within a and tissue types (figure 2D). This is in agreement with recent find- cell type. The associations between epigenetic modifications and ings 32. The negative correlation between DNA methylation and gene expression in a cell or tissue type are promoter sequence gene expression was evident only in ’long’ genes, and in protein and gene context dependent. For example, the CpG density at the coding genes. TSS is a good predictor of both the gene expression level and the CpG methylation ratio and density (figure 1A). The investigation Though most epigenetic modifications did not show enrich- of associations across cell types is therefore important. To study ment at TTS, four epigenetic modifications (DNAseI accessibil- association between epigenetic modifications and transcription ity, H3K4me1, H3K79me1, H3K27ac), were positively correlated across different cell and tissue types, we calculated linear regres- with expression level of a gene across cell and tissue types. These sion between each epigenetic modification and gene expression epigenetic modifications showed no enrichment at TTS, yet the level across cell or tissue types. Specifically, the epigenetic sig- change in expression level was associated with the change in epi- nals in ±500bp window around the TSS and the gene expression genetic modification strength. These modifications could be re- level (log10(TPM +1)) for each gene (figure 1B) were linearly re- flecting broader chromatin organisational features such as topo- gressed to obtain a slope and a linear correlation coefficient (R2). logically associating domain and A/B chromatin domains 35.

Journal Name, [year], [vol.],1–19 | 5 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Fig. 2 Relationships between epigenentic modifications near the Transcription Termination Sites (TTS) and expression level. A. Association between epigenetic marks near the TTS and gene expression level in the spleen. Upper part, from left to right: Gene expression level in all 47,812 genes annotated by Gencode. First side bar indicates the gene type (green: protein coding genes, blue: pseudogenes, purple: RNA genes, red: other types of genes), the second side bar indicates the 5 bins used in the figure 2A bottom panels (purple: highly expressed genes, green: lowly expressed genes). Stacked profiles of (i) H3K36me3 ChIP-seq and respective input control, of (ii) CpG density, mCpG ratio (mCpG/CpG), mCpG density, and the WGBS coverage near TTS, sorted according to the corresponding gene expression level. Bottom part, from left to right: Boxplot of gene expression level in each of the 5 bins defined in the upper part. Then, average profiles of DNAse-seq and respective control, of H3K4me3 ChIP-seq and respective input control, CpG density, mCpG ratio, mCpG density and WGBS coverage ± SEM (Standard Error of the Mean) for each bin of promoters. B-D. Association between epigenetic marks near the TTS and gene expression level across cell types. B. Regression of expression level (in log10(TPM+1)) of the GATA3 gene with the mean H3L36me3 ChIP-seq signal 500bp around the TTS of the GATA3 gene, where each dot corresponds to a cell or tissue type. The slope was positive and the correlation coefficient (R2) was greater than 0.75 in this case. Similar regressions were conducted for each gene and epigenetic modification pair. C. and D. Distribution of the slopes from the gene regressions (as in B.) according to the r-squared correlation coefficients for H3K36me3 modification (C.), and mCpG ratio (D.) near the TTS of the corresponding genes. Number of genes and percentage of genes in each category are displayed below each box.

6 |Journal Name, [year], [vol.], 1–19 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Fig. 3 Relationships between epigenentic modifications near middle exons starts and exon expression level or exon inclusion ratio. A. Association between epigenetic marks near middle exons start sites and exon expression levels in the HUES64 cell line. Upper part, from left to right: middle exon expression levels in 16,811 middle exons annotated by Gencode. The side bar indicates 5 bins used in the figure 3A bottom panels (purple: highly expressed exons, green: lowly expressed exons). Then: Stacked profiles of H3K4me3 ChIP-seq, H3K36me3, and input control, sorted according to the exon expression levels. Bottom part, from left to right: Boxplot of exon expression levels in each of the 5 bins defined in the upper part. Then, average profiles of H3K4me3 ChIP-seq, H3K36me3, and respective input control ± SEM (Standard Error of the Mean) for each bin of exons. B. Association between epigenetic marks near middle exons start sites and exon inclusion ratios in the HUES64 cell line. Upper part, from left to right: middle exon inclusion ratios in 17,517 middle exons annotated by Gencode. The side bar indicates 5 bins used in the figure 3B bottom panels (purple: included exons, green: excluded expressed exons). Then: Stacked profiles of H3K4me3 ChIP-seq, H3K36me3, and respective input control, sorted according to the corresponding exon inclusion ratio. Bottom part, from left to right: Boxplot of exon inclusion ratios in each of the 5 bins defined in the upper part. Then, average profiles of H3K4me3 ChIP-seq, H3K36me3, and respective input control ± SEM (Standard Error of the Mean) for each bin of exons. C. and D. Distribution of the slopes from exon expression level regressions and epigenetic mark presence at exon start ( ± 100 bp) according to the R2 correlation coefficients for H3K4me3 signal (C.), and H3K36me3 (D.) near the start of the corresponding middle exons. E. Distribution of the slopes from exon inclusion ratio regressions and epigenetic modification at exon start ( ± 100 bp) grouped by the R2 correlation coefficients for H3K36me3 modification. Number of exons and percentage of exons in each category are displayed below each box.

Journal Name, [year], [vol.],1–19 | 7 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Exon transcription and epigenetic modifications at middle ex- less very weak associations for mCpG ratio, DNAse1, H3K4me1, ons H3K27me3, H3k9ac and H4K8ac. Several studies have found a correlation between DNA methyla- tion 26,27,36 or histone modifications 23,30,31 and exon and splic- A linear model for gene expression ing events, in either single or a few cell and tissue contexts. We So far, we analysed the associations between epigenetic modifi- explored whether these observations hold true in the Roadmap cations and transcription control in a pair-wise manner. In order Epigenomics dataset. We focused on middle exons and excluded to better model the combinatorial effect of modifications, we se- first and last exons of protein coding genes. A total of 16,811 lected 6 epigenetic modifications (DNA methylation, H3K4me1, middle exons were expressed in at least one cell or tissue type H3K4me3, H3K9me3, H3K27me3 and H3K36me3) for which epi- in the Roadmap dataset. The expression level of an exon was genetic and transcriptome data was available for 27 cell types in defined as the sum of the TPM of the transcripts including that the Roadmap dataset, and regressed a linear model at four tran- exon. Similar to TSS and TTS analysis, we performed epigenetic scriptional features: (i) epigenetic modifications around TSS and modifications enrichment analysis at middle exons. H3K36me3 gene expression level (figure 4A), (ii) epigenetic modifications showed enrichment at middle exons, correlated with the exon ex- around TTS and gene expression level (figure 4B), (iii) epigenetic pression level within a cell or tissue type (figure 3A). Changes modifications around the start of middle exons and exon expres- in H3K36me3 at exons was also strongly associated with exons sion level (figure 4C), and (iv) epigenetic modifications around expression level across cell and tissue types (figure 3D). Though the start of middle exons and exon inclusion ratio (figure 4D). some other epigenetic modifications showed a weak enrichment At TSS, amongst the 6 marks studied, DNA methylation was at exons, similar to input samples used as negative controls, associated with gene repression, while all studied histone methy- therefore likely reflecting a technical artefact. Epigenetic mod- lations were associated with activation of expression level (figure ifications including H3K4me1, H3K4me3 (figure 3C), H3K27ac, 4A). At TTS, we noted similar pattern as TSS, with a difference H3K79me1, H3K79me2, H3K9ac, and H3K8ac, were correlated that H3K36me3 at TTS was the modification most strongly associ- with exon expression level across cell and tissue types, but did ated with an increase in expression level, followed by H3K4me1 not show any enrichment at the middle exons. The H3K4me3 sig- as the second most positively associated (figure 4B). We noted nal, peaked at the gene TSS, and terminated around the start of that restricting the analysis to long genes only, TTS associations the first internal exon (figure 3A), resulting in a transition from with epigenetic modifications were much weaker (figure S4†), H3K4me3 marked chromatin to H3K36me3 marked chromatin highlighting that a important portion of the TTS associations were near the beginning of the second exon of genes. due to short genes for which the promoter marks might confound There was no difference between DNA methylation ratio at ex- the TTS signal. H3K36me3 at middle exons was also strongly ons and introns, but exons had overall more CpG sites than in- associated with exon expression levels (figure 4C). Finally, no sig- trons, resulting in an higher DNA methylation density at middle nificant associations were detected for epigenetic modifications exons than at surrounding introns. DNAse1 accessibility further (of 6 selected ones) at exon inclusion ratio (figure 4D). decreased at exon start (Figure S3†), from the already low sur- rounding accessibility, suggesting that the splicing acceptor site PEREpigenomics, a web resource to explore associations be- has even lower accessibility than surrounding regions. tween epigenetic modifications and transcription features Using the Roadmap dataset, we unravelled a range of associa- Exon inclusion ratio and epigenetic modifications at middle tions between epigenetic modifications and transcription control. exons We have included a total of 9,024 stacked profiles of epigenetic Exon expression level in a cell type consists of both constitutive modifications near TSS, TTS and middle exons, sorted according and alternative splicing events. To study association between to the gene expression level, exon expression level or exon inclu- epigenetic modifications and alternate splicing events, we calcu- sion ratio, in about 30 different cell and tissue types. We have lated exon inclusion ratio for each exon. The exon inclusion ra- also provided stacked profiles for TSS and TTS for each gene type tion was calculated for all transcripts of a given gene (see meth- (protein-coding, RNA, pseudogene, other), as well as for short ods), in each cell or tissue type and sorted using the inclusion (≤ 1kb), long (> 3kb) and intermediate length genes. Users fur- ratio (also known as Psi). Accordingly, we obtained Psi values thermore can generate regressions between an epigenetic modi- for 17,517 exons, including 706 exons that were never included fication at TSS or TTS and gene expression level across cell and in the Roadmap datasets, but were part of genes that were ex- tissue types for a gene of interest, and the same feature is avail- pressed in this dataset. We checked whether epigenetic mod- able for middle exons and exon expression level or exon inclusion ifications were correlated with exon inclusion ratio and noted ratio as well. that only a few modifications; namely H3K27ac, H3K4me3, and We have made the analyses available to users through a web- H3K36me3 were associated with exon inclusion ratio within a application at joshiapps.cbu.uib.no/perepigenomics_app/. cell type (figure 3B). DNA methylation was also associated with exon inclusion ratio. No epigenetic modification showed a signifi- Discussion cant association with the changes in inclusion ratio at the alterna- In summary, this multi-faceted analysis of the Roadmap Epige- tively included exons (figure 3E, Table 1). There were neverthe- nomics data, freely available through a web application, has en-

8 | Journal Name, [year], [vol.],1–19 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Fig. 4 Linear regression models including six epigenetic modifications characterised in 27 cell types. Each bar represents the number of genes or exons with a statistically significant slope (p <= 0.01), either positive (golden) or negative (deep blue). A. Linear regression model of gene expression level and the levels of 6 epigenetic modifications near their respective TSS (±500 bp). B. Linear regression model of gene expression level and the levels of 6 epigenetic modifications near their respective TTS (±500 bp). C. Linear regression model of middle exon expression level and the levels of 6 epigenetic modifications near their respective start (±100 bp). D. Linear regression model of middle exon inclusion ratio and the levels of 6 epigenetic modifications near their respective start (±100 bp).

Journal Name, [year], [vol.],1–19 | 9 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. abled us confirm known (or previously observed in only one of DNA methylation few cell types) associations between epigenetic modifications and CpG density is highly variable in the human genome, with CpG transcription on a large data set and further formulate new hy- depleted or CpG poor regions spanning most of the genome. We potheses. We specifically discuss epigenetic associations of pre- noted that CpG density at TSS was strongly associated with gene viously under-explored transcription features below grouped ac- expression across genes within a cell type, where most expressed cording to the epigenetic modification. genes had a CGI centred at their TSS. Accordingly, at TSS, the mCpG ratio, or the average methylation at CpG sites, was nega- tively associated with gene expression across genes within a cell Histone modifications type. This trend overlaps with the CpG density where most of CpG deserts are heavily methylated, and most CGI are unmethy- For 28 studied histone modifications near the TSS, 25 were pos- lated 40 (figure S1†). Increase in mCpG ratio at TSSs was also itively correlated with gene expression across genes within a cell associated with the down-regulation of gene expression in the type, and a subset of them (16) were also correlated with gene gene level analysis. Non-promoter regions were methylated with expression when comparing the same gene across cell and tis- mCpG ratios around 85%, this ratio dropped to around 30% at the sue types (table 1). H3K27me3 was the only mark that was TSSs of the expressed protein-coding genes. We noted that this more abundant at TSS of lowly or not expressed genes than at not the case for expressed pseudogenes, whose TSSs had higher highly expressed protein coding genes. It is important to note CpG density than surrounding regions, but remained methylated. that H3K27me3 modification was not present at non-protein cod- mCpG density is the number of methylated sites at a given ge- ing genes, highlighting that the negative correlation between nomic region, calculated as a product of the CpG density and the H3K27me3 and gene expression is a gene type specific associa- average mCpG ratio at a give genomic region. It has been shown 37,38 tion. H3K9me3 has been associated with closed chromatin , that mCpG density, but not mCpG ratio, was the main driver of the and was not enriched at the TSS of non expressed genes in the binding of DNA methylation readers of the MBD family 41. While Roadmap Epigenomics data set. While many active histone marks many publications focus solely on the mCpG ratio as a metric to showed a TSS asymmetry, with higher signal downstream of the evaluate DNA methylation, we argue that multiple metrics pro- TSS than at the upstream of the TSS. H3K79me1 and H3K79me2 vide complementary information. For example, while mCpG ratio were enriched downstream of the TSS of expressed genes (figure stays constant across the promoters of repressed genes, mCpG † S2 ), suggesting an association with transcription direction. In- density peaks near the TSS of repressed genes, suggesting that deed, H3K79 methylations are catalysed by the DOT1L these regions might preferentially recruit repressive DNA binding 39 during transcription elongation . It should be noted that their proteins (e.g. MBP). DNA methylation density also peaks at the profile was only available in 5 and 3 cell lines respectively, all TSSs of expressed processed pseudogenes, as their CGIs remain derived from the embryonic stem cell line H1. H3K79me1 and methylated. At exons, mCpG ratio is as high as at introns, but H3K79me2 asymmetries were less marked in the H1-derived tro- the CpG density is higher at exons than at introns, resulting in an phoblast, which could reflect either a relevant biological differ- higher mCpG density. ence or be caused by experimental issues. It has been observed that GC rich region might be more difficult to sequence using some Illumina sequencing protocols, resulting Among histone modifications, H3K36me3 displayed a unique in lower coverage at CGI 42. We noted this bias in about half of the profile across all studied transcription features. The H3K36me3 WGBS samples, where WGBS coverage at TSS was anti-correlated modification is positively correlated with gene expression at the with gene expression across genes within a cell type. Some WGBS TTS across genes within a cell type, and also when comparing samples were less affected by this bias, while a few showed even the same gene across cell and tissue types. While largely absent an inverse trend, with higher coverage at GC rich regions. Luckily from the TSS region, it is enriched at all middle exons and on these biases leave mCpG ratio and mCpG density profiles largely the last exon. We noted that gene body H3K36me3 begins at unaffected, therefore preserving the validity of the analysis. the start of the second exon, where H3K4me3 peak diminishes. We further explored a potential link between exonic H3K36me3 and alternative splicing. Our approach using exon inclusion ratio DNA accessibility and nucleosome positioning revealed that any such association was either weak or restricted to DNAse1 assay showed a narrow (<100 bp) peak before and at only a few exons. Similar observations have been made by others: the TSS of expressed genes. This narrow peak of DNA accessibil- Xu et al 31 noted that from about 3000 alternative splicing events, ity matched the location of a dip in the bi-modal signal present 800 were positively correlated with changes in H3K36me3 and in many histone modifications positively correlated with expres- 700 where negatively correlated with changes in H3K36me3. It sion level (e.g. H3K4me3 and many histone acetylations). These should be noted that H3K36me3 modified genomic regions (wide observations suggest that there is a short nucleosome-free region peaks) tend to be an order of magnitude larger than an average before the TSS of expressed genes, with a nucleosome positioned exon size (1000 pb vs 100 pb). Altogether, though 10 histone just after the +1 of transcription. Such nucleosome positioning modifications showed some enrichment at exons, the associations effect is well described in yeast 43 and in mammals 44. Intrigu- between epigenetic modifications and changes in exon inclusion ingly, middle exon starts and TTS positions appear to be depleted ratio were weak and/or limited to a subset of genes. of DNAse1 signal, even more so than the surrounding regions

10 |Journal Name, [year], [vol.], 1–19 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

(figure S3†). This suggests that middle and last exon starts are Methods particularly inaccessible regions, maybe due to nucleosome posi- tioning 45 or the presence of the splicing machinery at acceptor Data retrieving sites. Gencode human annotation version 29 (main annotation file), were downloaded from theGencode as gff3 files. Reads from RNA-seq data were retrieved from the European Nucleotide Analysis across cell and tissue types Archive using the the Roadmap sample table as a reference. Roadmap Epigenomics whole genome bisulphite sequencing For each gene and middle exon, we correlated the level of epi- (WGBS) data sets were downloaded bigwig files of fractional genetic modification with expression level or exon inclusion ra- methylation and read coverage from the Roadmap Epigenomics tio across the different cell and tissue types in the Roadmap the Roadmap data portal. dataset. When regressing promoter DNA methylation with gene expression levels, 2,506 genes had a linear regression coefficient Histone modifications and DNAse1 data were downloaded as R2 ≥ 0.75 (5.2% of all genes), including 1,173 protein coding consolidated, not subsampled, tagAlign files from the the genes (6.3% of protein coding genes). The average slope was neg- Roadmap data portal. ative, indicating that for an individual gene, higher level of pro- Altogether, we retrieved 27 RNA-seq and corresponding WGBS moter DNA methylation is associated with lower gene expression datasets, 13 DNAse1 profiles (with matching controls), and 242 level. Conversely, 2,485 genes had a linear regression coefficient histone ChIP-seq datasets in 27 human cell lines or tissues (with 2 R ≥ 0.75 (5.2% of all genes) when regressing H3K4me3 levels at 27 matching controls). the promoter and gene expression level, including 1,367 protein coding genes (7.3% of protein coding genes). The average slope in this case was positive, i.e. higher level of promoter H3K4me3 RNA sequencing analysis was associated with higher gene expression level. Nonetheless, RNA-sequencing reads were quantified using Salmon 34 v12.0 by most genes or exons did not have a regression coefficient ≥ 0.25. pseudoalignement to the human reference genome hg38 and an- This might be because many genes and exons might not have notations v29 provided by Gencode 33. We selected the param- large enough epigenetic or expression variability in this dataset. eters validateMappings, seqBias, gcBias where on, with Furthermore, ChIP-seq and DNAse-seq peak height can also be bi- biasSpeedSamp equal to 5, and libType equals to A. For sam- ased by the epigenetic modifications (or accessibility) in the dom- ples with biological replicates, the median expression value in inant fraction of alleles and cells, as well as biased by changes in TPM samples was used for genes and transcripts. For each exon, ChIP efficiency due to hard-to-control experimental variations. exon expression level was calculated as the sum of the TPM of the transcripts including this exon. Exon inclusion ratios as com- Conclusions puted as the sum of TPM of transcripts including the exon divided by the TPM value of the gene. In summary, we performed a comprehensive analysis to study We considered each gene uniquely, by selecting a representa- links between epigenetic modifications and transcription control tive TSS (ory TTS) per gene, from all annotated TSSs (or TTSs). using the Roadmap Epigenomic data. The Roadmap Epigenomic We sorted all transcripts of a gene according to their TSS (or TTS) histone modifications, whole genome bisulfite sequencing, and genomic coordinates and selecting the TSS (or TTS) of the tran- RNA-seq data, across diverse human cell and tissue types, al- script at the middle of the list, i.e. the median TSS (or TTS). The lowed us to confirm the generality of previously described re- list of middle exons was obtained by taking the shortest transcript lations between epigenetic modifications and transcription con- of each protein coding gene, then selecting genes with 3 or more trol in one or few cell types, as well as to generate new hy- exons, and excluding the first and last exons of each transcript. potheses about the interplay between epigenetic modifications The shortest known transcript isoform was used to ensure that and transcript diversity. Importantly, our analysis focused on pre- the list of middle exons could not contain first or last exons of viously under-explored features of transcription control includ- other isoforms. Most annotated non-protein coding genes were ing, transcription termination sites, exon-intron boundaries, mid- monoexonic, and were excluded from the exonic analyses. dle exons and exon inclusion ratio. We have produced thou- sands of stack profile plots of epigenetic marks around gene fea- Epigenetic modifications data processing tures sorted according to gene expression level, exon expression level or exon inclusion ratio, and filterable by gene type. These For WGBS, three different tracks were generated from the plots are made freely available through a web application, joshi- FractionalMethylation.bigwig files using bedtools 46 and apps.cbu.uib.no/perepigenomics_app/. We hope this web appli- rtracklayer 47: number of CpG sites per window, mean DNA cation will serve the community as a resource i) to validate known methylation ratio per window, density of mCpG sites per win- or previously described epigenetic modifications associated with dow, using windows of 250 base pair width, sliding by 100 base transcription features ii) as well as an interactive tool to allow pairs. The WGBS coverage file was processed similarly to produce exploration of the data as a novel hypotheses generator of epige- a fourth track serving as a control. No post-processing was done netic and transcriptional control. for DNAse1 and Histone tagAlign files.

Journal Name, [year], [vol.],1–19 | 11 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Heatmap generation Author’s contributions Gene types were derived from Gencode 33, where the pseudogene GD and AJ designed the analyses and the web applications, and category contained all genes with a word "pseudogene", and the wrote this manuscript. GD performed the analysis and developed "other" type of gene was defined as neither protein coding, nor the web application. pseudogene, nor RNA gene. Genes were binned in 5 groups of equal sizes according to their expression values, and exons in 5 Acknowledgements groups according to their expression values or exon inclusion ra- GD was funded for this work by the People Program (Marie tio. Curie Actions FP7/2007-2013) under REA grant agreement No For genes with multiple Gencode annotations for TSSs (or PCOFUND-GA-2012-600181. AJ is supported by the Bergen Re- TTSs), we sorted all TSSs (or TTSs) according to their genomic search Foundation Grant no. BFS2017TMT01. The authors would coordinate (5’ to 3’, taking into account their orientation) and like to thanks Olaf Sarnow, Kjell Petersen and Stanislav Oltu for took the TSS (or TTS) of the transcript in the middle of the list. their help with the web server configuration, and Anna Mantsoki The list of middle exons was obtained using protein coding genes. and Barry Horne for her help at the beginning of the project. For genes with several annotated transcripts, the transcript with Abreviations the smallest length was selected, first and last exons were filtered, and the remaining exons were kept. Stacked profiles of tracks • TSS: Transcription Start Sites centred at TSS or TTS were generated using a region of ± 2.5kb • TTS: Transcription Termination Sites around the TSS or TTS, using windows of 100 bases every 100 bases. Stacked profiles of tracks centred at middle exons were • TPM: Transcript Per Million of reads drawn using a region of ± 1kb around middle exon starts, with windows of 50 bases every 50 bases. For histone modifications • CpG: Cytosine - Guanine dinucleotide and DNAse1 data, and corresponding input controls, coverage was expressed as FPKM values. For CpG density and mCpG den- • CGI: CpG island sity was defined as the number of (methylated) sites per window • Psi: Exon inclusion ratio (250 bp for TSS or TTS, 100 bp for exons). The mCpG ratio rnged from 0 (all the CpG sites in the window fully unmethylated) to 1 • WGBS: Whole Genome Bisulfite Sequencing (all the CpG sites in window fully methylated), and the coverage was expressed as number of reads covering a region. For each • H3K9me3: tri-methylation of 9 of histone 3 of the five gene (or exon) bins, we displayed the value distribu- tion as boxplots, and the average profiles ± standard error of the • H4K5ac: acetylation of lysine 5 of histone 4 mean (SEM) for each bin. Heatmaps were drawn by a custom script using the following packages: seqplots 48, Repitools 49, Ge- Notes and references 50 47 51 nomicRanges , rtracklayer , and plotrix . 1 J. Romanowska and A. Joshi, Genes, 2019, 10, 76. 2 A. Kundaje, , W. Meuleman, J. Ernst, M. Bilenky, A. Yen, Regression analysis A. Heravi-Moussavi, P. Kheradpour, Z. Zhang, J. Wang, M. J. Ziller, V. Amin, J. W. Whitaker, M. D. Schultz, L. D. Ward, At each gene TSS (or TTS), the epigenetic modification level in a A. Sarkar, G. Quon, R. S. Sandstrom, M. L. Eaton, Y.-C. Wu, sample was averaged in the ± 500 bp region around the TSS (or A. R. Pfenning, X. Wang, M. Claussnitzer, Y. Liu, C. Coarfa, TTS) and a linear regression was calculated for each gene using R. A. Harris, N. Shoresh, C. B. Epstein, E. Gjoneska, D. Le- expression values in log10(TPM + 1). Epigenetic marks at the ± ung, W. Xie, R. D. Hawkins, R. Lister, C. Hong, P. Gascard, 100 bp region around middle exon starts were regressed with A. J. Mungall, R. Moore, E. Chuah, A. Tam, T. K. Canfield, either log10(TPM + 1) of the exon or the exon inclusion ratio. R. S. Hansen, R. Kaul, P. J. Sabo, M. S. Bansal, A. Carles, For each regression the slope and regression coefficient (R2) were J. R. Dixon, K.-H. Farh, S. Feizi, R. Karlic, A.-R. Kim, A. Kulka- obtained, using dplyr 52, purrr 53, and broom 54. rni, D. Li, R. Lowdon, G. Elliott, T. R. Mercer, S. J. Neph, V. Onuchic, P. Polak, N. Rajagopal, P. Ray, R. C. Sallari, K. T. Web application and script availability Siebenthall, N. A. Sinnott-Armstrong, M. Stevens, R. E. Thur- man, J. Wu, B. Zhang, X. Zhou, A. E. Beaudet, L. A. Boyer, PEREpigenomics is developed in Shiny 55. Source P. L. D. Jager, P. J. Farnham, S. J. Fisher, D. Haussler, S. J. M. code and data of the application are available at Jones, W. Li, M. A. Marra, M. T. McManus, S. Sunyaev, J. A. forgemia.inra.fr/guillaume.devailly/perepigenomics_app. Thomson, T. D. Tlsty, L.-H. Tsai, W. Wang, R. A. Waterland, Scripts used to process the data and generate the plots can be M. Q. Zhang, L. H. Chadwick, B. E. Bernstein, J. F. Costello, found at: github.com/gdevailly/perepigenomicsAnalysis. J. R. Ecker, M. Hirst, A. Meissner, A. Milosavljevic, B. Ren, J. A. Conflicts of interest Stamatoyannopoulos, T. Wang and M. Kellis, Nature, 2015, 518, 317–330. The authors declare that they have no competing interests. 3 Nature, 2012, 489, 57–74.

12 | Journal Name, [year], [vol.],1–19 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

4 D. Adams, L. Altucci, S. E. Antonarakis, J. Ballesteros, S. Beck, wick, K. M. Chan, W. Chen, T. H. Cheung, L. Chiapperino, A. Bird, C. Bock, B. Boehm, E. Campo, A. Caricasole, F. Dahl, N. H. Choi, H.-R. Chung, L. Clarke, J. M. Connors, P. Cronet, E. T. Dermitzakis, T. Enver, M. Esteller, X. Estivill, A. Ferguson- J. Danesh, M. Dermitzakis, G. Drewes, P. Durek, S. Dyke, Smith, J. Fitzgibbon, P. Flicek, C. Giehl, T. Graf, F. Grosveld, T. Dylag, C. J. Eaves, P. Ebert, R. Eils, J. Eils, C. A. Ennis, T. En- R. Guigo, I. Gut, K. Helin, J. Jarvius, R. Küppers, H. Lehrach, ver, E. A. Feingold, B. Felder, A. Ferguson-Smith, J. Fitzgib- T. Lengauer, Å. Lernmark, D. Leslie, M. Loeffler, E. Macin- bon, P. Flicek, R. S.-Y. Foo, P. Fraser, M. Frontini, E. Furlong, tyre, A. Mai, J. H. Martens, S. Minucci, W. H. Ouwehand, S. Gakkhar, N. Gasparoni, G. Gasparoni, D. H. Geschwind, P. G. Pelicci, H. Pendeville, B. Porse, V. Rakyan, W. Reik, P. Glažar, T. Graf, F. Grosveld, X.-Y. Guan, R. Guigo, I. G. M. Schrappe, D. Schübeler, M. Seifert, R. Siebert, D. Sim- Gut, A. Hamann, B.-G. Han, R. A. Harris, S. Heath, K. Helin, mons, N. Soranzo, S. Spicuglia, M. Stratton, H. G. Stunnen- J. G. Hengstler, A. Heravi-Moussavi, K. Herrup, S. Hill, J. A. berg, A. Tanay, D. Torrents, A. Valencia, E. Vellenga, M. Vin- Hilton, B. C. Hitz, B. Horsthemke, M. Hu, J.-Y. Hwang, N. Y. gron, J. Walter and S. Willcocks, Nature Biotechnology, 2012, Ip, T. Ito, B.-M. Javierre, S. Jenko, T. Jenuwein, Y. Joly, S. J. 30, 224–226. Jones, Y. Kanai, H. G. Kang, A. Karsan, A. K. Kiemer, S. C. Kim, 5 S. E. Celniker, , L. A. L. Dillon, M. B. Gerstein, K. C. Gunsalus, B.-J. Kim, H.-H. Kim, H. Kimura, S. Kinkley, F. Klironomos, S. Henikoff, G. H. Karpen, M. Kellis, E. C. Lai, J. D. Lieb, D. M. I.-U. Koh, M. Kostadima, C. Kressler, R. Kreuzhuber, A. Kun- MacAlpine, G. Micklem, F. Piano, M. Snyder, L. Stein, K. P. daje, R. Küppers, C. Larabell, P. Lasko, M. Lathrop, D. H. White and R. H. Waterston, Nature, 2009, 459, 927–930. Lee, S. Lee, H. Lehrach, E. Leitão, T. Lengauer, Å. Lern- 6 L. Andersson, , A. L. Archibald, C. D. Bottema, R. Brauning, mark, R. D. Leslie, G. K. Leung, D. Leung, M. Loeffler, Y. Ma, S. C. Burgess, D. W. Burt, E. Casas, H. H. Cheng, L. Clarke, A. Mai, T. Manke, E. R. Marcotte, M. A. Marra, J. H. Martens, C. Couldrey, B. P. Dalrymple, C. G. Elsik, S. Foissac, E. Giuf- J. I. Martin-Subero, K. Maschke, C. Merten, A. Milosavlje- fra, M. A. Groenen, B. J. Hayes, L. S. Huang, H. Khatib, J. W. vic, S. Minucci, T. Mitsuyama, R. A. Moore, F. Müller, A. J. Kijas, H. Kim, J. K. Lunney, F. M. McCarthy, J. C. McEwan, Mungall, M. G. Netea, K. Nordström, I. Norstedt, H. Okae, S. Moore, B. Nanduri, C. Notredame, Y. Palti, G. S. Plastow, V. Onuchic, F. Ouellette, W. Ouwehand, M. Pagani, V. Pan- J. M. Reecy, G. A. Rohrer, E. Sarropoulou, C. J. Schmidt, J. Sil- caldi, T. Pap, T. Pastinen, R. Patel, D. S. Paul, M. J. Pazin, verstein, R. L. Tellam, M. Tixier-Boichard, G. Tosser-Klopp, P. G. Pelicci, A. G. Phillips, J. Polansky, B. Porse, J. A. Pospisi- C. K. Tuggle, J. Vilkki, S. N. White, S. Zhao and H. Zhou, lik, S. Prabhakar, D. C. Procaccini, A. Radbruch, N. Rajew- Genome Biology, 2015, 16, year. sky, V. Rakyan, W. Reik, B. Ren, D. Richardson, A. Richter, 7 S. Foissac, S. Djebali, K. Munyard, N. Vialaneix, A. Rau, D. Rico, D. J. Roberts, P. Rosenstiel, M. Rothstein, A. Salhab, K. Muret, D. Esquerré, M. Zytnicki, T. Derrien, P. Bardou, H. Sasaki, J. S. Satterlee, S. Sauer, C. Schacht, F. Schmidt, F. Blanc, C. Cabau, E. Crisci, S. Dhorne-Pollet, F. Drouet, T. Fa- G. Schmitz, S. Schreiber, C. Schröder, D. Schübeler, J. L. raut, I. Gonzalez, A. Goubil, S. Lacroix-Lamandé, F. Laurent, Schultze, R. P. Schulyer, M. Schulz, M. Seifert, K. Shirahige, S. Marthey, M. Marti-Marimon, R. Momal-Leisenring, F. Mom- R. Siebert, T. Sierocinski, L. Siminoff, A. Sinha, N. Soranzo, part, P. Quéré, D. Robelin, M. S. Cristobal, G. Tosser-Klopp, S. Spicuglia, M. Spivakov, C. Steidl, J. S. Strattan, M. Stratton, S. Vincent-Naulleau, S. Fabre, M.-H. P.-V. der Laan, C. Klopp, P. Südbeck, H. Sun, N. Suzuki, Y. Suzuki, A. Tanay, D. Tor- M. Tixier-Boichard, H. Acloque, S. Lagarrigue and E. Giuffra, rents, F. L. Tyson, T. Ulas, S. Ullrich, T. Ushijima, A. Valen- BMC Biology, 2019, 17, year. cia, E. Vellenga, M. Vingron, C. Wallace, S. Wallner, J. Walter, 8 D. Bujold, D. A. de Lima Morais, C. Gauthier, C. Côté, H. Wang, S. Weber, N. Weiler, A. Weller, A. Weng, S. Wilder, M. Caron, T. Kwan, K. C. Chen, J. Laperle, A. N. Markovits, S. M. Wiseman, A. R. Wu, Z. Wu, J. Xiong, Y. Yamashita, T. Pastinen, B. Caron, A. Veilleux, P.-É. Jacques and X. Yang, D. Y. Yap, K. Y. Yip, S. Yip, J.-I. Yoo, D. Zerbino and G. Bourque, Cell Systems, 2016, 3, 496–499.e2. G. Zipprich, Cell, 2016, 167, 1145–1149. 9 C. A. Davis, B. C. Hitz, C. A. Sloan, E. T. Chan, J. M. David- 12 S. J. Marygold, , M. A. Crosby and J. L. Goodman, Methods in son, I. Gabdank, J. A. Hilton, K. Jain, U. K. Baymuradov, A. K. Molecular Biology, Springer New York, 2016, pp. 1–31. Narayanan, K. C. Onate, K. Graham, S. R. Miyasato, T. R. 13 J. Chèneby, M. Gheorghe, M. Artufel, A. Mathelier and Dreszer, J. S. Strattan, O. Jolanki, F. Y. Tanaka and J. M. B. Ballester, Nucleic Acids Research, 2017, 46, D267–D275. Cherry, Nucleic Acids Research, 2017, 46, D794–D801. 14 J. Chèneby, Z. Ménétrier, M. Mestdagh, T. Rosnet, A. Douida, 10 M. Sánchez-Castillo, D. Ruau, A. C. Wilkinson, F. S. Ng, W. Rhalloussi, A. Bergon, F. Lopez and B. Ballester, Nucleic R. Hannah, E. Diamanti, P. Lombard, N. K. Wilson and Acids Research, 2019. B. Gottgens, Nucleic Acids Research, 2014, 43, D1117–D1123. 15 D. Li, S. Hsu, D. Purushotham, R. L. Sears and T. Wang, Nu- 11 H. G. Stunnenberg, M. Hirst, S. Abrignani, D. Adams, cleic Acids Research, 2019, 47, W158–W165. M. de Almeida, L. Altucci, V. Amin, I. Amit, S. E. Antonarakis, 16 C. Coarfa, C. S. Pichot, A. Jackson, A. Tandon, V. Amin, S. Aparicio, T. Arima, L. Arrigoni, R. Arts, V. Asnafi, M. Es- S. Raghuraman, S. Paithankar, A. V. Lee, S. E. McGuire and teller, J.-B. Bae, K. Bassler, S. Beck, B. Berkman, B. E. Bern- A. Milosavljevic, BMC Bioinformatics, 2014, 15, year. stein, M. Bilenky, A. Bird, C. Bock, B. Boehm, G. Bourque, 17 F. Albrecht, M. List, C. Bock and T. Lengauer, Nucleic Acids C. E. Breeze, B. Brors, D. Bujold, O. Burren, M. J. Bussemak- Research, 2016, 44, W581–W586. ers, A. Butterworth, E. Campo, E. C. de Santa-Pau, L. Chad- 18 G. Devailly, A. Mantsoki and A. Joshi, Bioinformatics, 2016,

Journal Name, [year], [vol.],1–19 | 13 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

32, 3354–3356. M. Imakaev, T. Ragoczy, A. Telling, I. Amit, B. R. Lajoie, P. J. 19 Y. He and T. Wang, Bioinformatics, 2017, 33, 3268–3275. Sabo, M. O. Dorschner, R. Sandstrom, B. Bernstein, M. A. Ben- 20 M. G. Dozmorov, Bioinformatics, 2017, 33, 3323–3330. der, M. Groudine, A. Gnirke, J. Stamatoyannopoulos, L. A. 21 S. Oki, T. Ohta, G. Shioi, H. Hatanaka, O. Ogasawara, Mirny, E. S. Lander and J. Dekker, Science, 2009, 326, 289– Y. Okuda, H. Kawaji, R. Nakaki, J. Sese and C. Meno, EMBO 293. reports, 2018, 19, year. 36 S. Shukla, E. Kavak, M. Gregory, M. Imashimizu, B. Shuti- 22 M. M. Hoffman, O. J. Buske, J. Wang, Z. Weng, J. A. Bilmes noski, M. Kashlev, P. Oberdoerffer, R. Sandberg and S. Ober- and W. S. Noble, Nature Methods, 2012, 9, 473–476. doerffer, Nature, 2011, 479, 74–79. 23 J. Curado, C. Iannone, H. Tilgner, J. Valcárcel and R. Guigó, 37 T. Kouzarides, Cell, 2007, 128, 693–705. Genome Biology, 2015, 16, year. 38 I. A. Tchasovnikarova, R. T. Timms, N. J. Matheson, K. Wals, 24 Y. Li, J. Zhu, G. Tian, N. Li, Q. Li, M. Ye, H. Zheng, J. Yu, R. Antrobus, B. Gottgens, G. Dougan, M. A. Dawson and P. J. H. Wu, J. Sun, H. Zhang, Q. Chen, R. Luo, M. Chen, Y. He, Lehner, Science, 2015, 348, 1481–1485. X. Jin, Q. Zhang, C. Yu, G. Zhou, J. Sun, Y. Huang, H. Zheng, 39 K. Wood, M. Tellier and S. Murphy, Biomolecules, 2018, 8, 11. H. Cao, X. Zhou, S. Guo, X. Hu, X. Li, K. Kristiansen, L. Bol- 40 A. M. Deaton and A. Bird, Genes & Development, 2011, 25, und, J. Xu, W. Wang, H. Yang, J. Wang, R. Li, S. Beck, J. Wang 1010–1022. and X. Zhang, PLoS Biology, 2010, 8, e1000533. 41 T. Baubec, R. Ivánek, F. Lienert and D. Schübeler, Cell, 2013, 25 M. B. Stadler, R. Murr, L. Burger, R. Ivanek, F. Lienert, 153, 480–492. A. Schöler, C. Wirbelauer, E. J. Oakeley, D. Gaidatzis, V. K. 42 J. C. Dohm, C. Lottaz, T. Borodina and H. Himmelbauer, Nu- Tiwari and D. Schübeler, Nature, 2011. cleic Acids Research, 2008, 36, e105–e105. 26 A. K. Maunakea, I. Chepelev, K. Cui and K. Zhao, Cell Research, 43 K. Struhl and E. Segal, Nature Structural & Molecular Biology, 2013, 23, 1256–1269. 2013, 20, 267–273. 27 G. L. Maor, A. Yearim and G. Ast, Trends in Genetics, 2015, 31, 44 Y. Li, C. Li, S. Li, Q. Peng, N. A. An, A. He and C.-Y. Li, Pro- 274–280. ceedings of the National Academy of Sciences, 2018, 115, 8817– 28 X.-L. Ding, X. Yang, G. Liang and K. Wang, Scientific Reports, 8822. 2016, 6, year. 45 H. Guo, B. Hu, L. Yan, J. Yong, Y. Wu, Y. Gao, F. Guo, Y. Hou, 29 R. Shayevitch, D. Askayo, I. Keydar and G. Ast, RNA, 2018, X. Fan, J. Dong, X. Wang, X. Zhu, J. Yan, Y. Wei, H. Jin, 24, 1351–1362. W. Zhang, L. Wen, F. Tang and J. Qiao, Cell Research, 2016, 30 Y. Xu, Y. Wang, J. Luo, W. Zhao and X. Zhou, Nucleic Acids 27, 165–183. Research, 2017, 45, 12100–12112. 46 A. R. Quinlan and I. M. Hall, Bioinformatics, 2010, 26, 841– 31 Y. Xu, W. Zhao, S. D. Olson, K. S. Prabhakara and X. Zhou, 842. Genome Biology, 2018, 19, year. 47 M. Lawrence, R. Gentleman and V. Carey, Bioinformatics, 32 V. Nanavaty, E. W. Abrash, C. Hong, S. Park, E. E. Fink, Z. Li, 2009, 25, 1841–1842. T. J. Sweet, J. M. Bhasin, S. Singuri, B. H. Lee, T. H. Hwang 48 P. Stempor and J. Ahringer, Wellcome Open Research, 2016, 1, and A. H. Ting, Molecular Cell, 2020, 78, 752–764.e6. 14. 33 A. Frankish, M. Diekhans, A.-M. Ferreira, R. Johnson, I. Jun- 49 A. L. Statham, D. Strbenac, M. W. Coolen, C. Stirzaker, S. J. greis, J. Loveland, J. M. Mudge, C. Sisu, J. Wright, J. Arm- Clark and M. D. Robinson, Bioinformatics, 2010, 26, 1662– strong, I. Barnes, A. Berry, A. Bignell, S. C. Sala, J. Chrast, 1663. F. Cunningham, T. D. Domenico, S. Donaldson, I. T. Fiddes, 50 M. Lawrence, W. Huber, H. Pagès, P. Aboyoun, M. Carlson, C. G. Girón, J. M. Gonzalez, T. Grego, M. Hardy, T. Hourlier, R. Gentleman, M. T. Morgan and V. J. Carey, PLoS Computa- T. Hunt, O. G. Izuogu, J. Lagarde, F. J. Martin, L. Martínez, tional Biology, 2013, 9, e1003118. S. Mohanan, P. Muir, F. C. P. Navarro, A. Parker, B. Pei, F. Pozo, 51 L. J, R-News, 2006, 6, 8–12. M. Ruffier, B. M. Schmitt, E. Stapleton, M.-M. Suner, I. Sy- 52 H. Wickham, R. François, L. Henry and K. Müller, dplyr: A cheva, B. Uszczynska-Ratajczak, J. Xu, A. Yates, D. Zerbino, Grammar of Data Manipulation, 2020. Y. Zhang, B. Aken, J. S. Choudhary, M. Gerstein, R. Guigó, 53 L. Henry and H. Wickham, purrr: Functional Programming T. J. P. Hubbard, M. Kellis, B. Paten, A. Reymond, M. L. Tress Tools, 2019. and P. Flicek, Nucleic Acids Research, 2018, 47, D766–D773. 54 D. Robinson and A. Hayes, broom: Convert Statistical Analysis 34 R. Patro, G. Duggal, M. I. Love, R. A. Irizarry and C. Kingsford, Objects into Tidy Tibbles, 2019. Nature Methods, 2017, 14, 417–419. 55 W. Chang, J. Cheng, J. Allaire, Y. Xie and J. McPherson, shiny: 35 E. Lieberman-Aiden, N. L. van Berkum, L. Williams, Web Application Framework for R, 2018.

14 | Journal Name, [year], [vol.],1–19 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Supplementary figures

Journal Name, [year], [vol.],1–19 | 15 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Supplementary Figure S1: WGBS is analysed through CpG density, mCpG ratio and mCpG density metrics. The mCpG ratio an mCpG density metrics can behave differently if comparing regions with different CpG density. A. On these hypothetical 1 kb genomic windows, the CpG rich region (left) has an lower mCpG ratio than the CpG poor region (right), but an higher mCpG density. B. Four tracks have been obtained from WGBS data (here from the H1-hESC cell line). The 36kb region including the gene KLHL36 is shown. From top to bottom: The CpG density tracks, higher at the three CpG islands in the region. The mCpG ratio, showing high values (> 80%) everywhere but at the first CpG island. The mCpG density, showing low values everywhere but at the second and thris CpG island. Finaly, the WGBS coverage is used as a control track. Here it shows a lower coverage at the three CpG islands.

16 |Journal Name, [year], [vol.], 1–19 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Supplementary Figure S2: H3K79me2 near the transcription start sites in H1-BMP4 derived mesoderm (A) or H1-BMP4 derived throphoblast (B). Upper parts, from left to right: Gene expression level in all 47,812 autosomal genes annotated by Gencode. First side bar indicates the gene type (green: protein coding genes, blue: pseudogenes, purple: RNA genes, red: other types of genes), the second side bar indicates the genes sorted according to expression level (5 bins in total, purple: highly expressed genes, green: lowly expressed genes). Stacked profiles of H3K79me2 ChIP-seq and respective input control, sorted according to the corresponding gene expression level. Bottom parts, from left to right: Boxplot of gene expression levels in each of the 5 expression bins defined in the upper part. Average profiles of H3K79me2 ChIP-seq and respective input control, ± SEM (Standard Error of the Mean) for each bin of promoters.

Journal Name, [year], [vol.],1–19 | 17 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Supplementary Figure S3: DNAse1 profile around middle exon start sites in pancreas show a narrow decrease in DNAse accessibility near splicing acceptor sites. Upper part, from left to right: middle exon expression levels in 16,811 middle exons. The side bar indicates 5 bins used in the figure S3 bottom panels (purple: highly expressed exons, green: lowly expressed exons). Then: Stacked profiles of DNAse1 profile and control, sorted according to the exon expression levels. Bottom part, from left to right: Boxplot of exon expression levels in each of the 5 bins defined in the upper part. Then, average DNAse profiles and control ± SEM (Standard Error of the Mean) for each bin of exons.

18 |Journal Name, [year], [vol.], 1–19 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.23.309625; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Supplementary Figure S4: Linear regression models including six epigenetic modifications characterised in 27 cell types, focusing on the 25,068 long genes (> 3kb, amongst 47,812 analysed genes). Each bar represents the number of genes with a statistically significant slope (p <= 0.01), either positive (golden) or negative (deep blue). A. Linear regression model of long gene expression level and the levels of 6 epigenetic modifications near their respective TSS (±500 bp). B. Linear regression model of long gene expression level and the levels of 6 epigenetic modifications near their respective TTS (±500 bp).

Journal Name, [year], [vol.],1–19 | 19