Integrative Prediction of Gene Expression with Chromatin Accessibility and Conformation Data Florian Schmidt1,2,3,4†, Fabian Kern1,3,5† and Marcel H

Schmidt et al. Epigenetics & Chromatin (2020) 13:4 https://doi.org/10.1186/s13072-020-0327-0 Epigenetics & Chromatin RESEARCH Open Access Integrative prediction of gene expression with chromatin accessibility and conformation data Florian Schmidt1,2,3,4†, Fabian Kern1,3,5† and Marcel H. Schulz1,2,3,6,7* Abstract Background: Enhancers play a fundamental role in orchestrating cell state and development. Although several methods have been developed to identify enhancers, linking them to their target genes is still an open problem. Several theories have been proposed on the functional mechanisms of enhancers, which triggered the development of various methods to infer promoter–enhancer interactions (PEIs). The advancement of high-throughput techniques describing the three-dimensional organization of the chromatin, paved the way to pinpoint long-range PEIs. Here we investigated whether including PEIs in computational models for the prediction of gene expression improves perfor- mance and interpretability. Results: We have extended our TEPIC framework to include DNA contacts deduced from chromatin conformation capture experiments and compared various methods to determine PEIs using predictive modelling of gene expression from chromatin accessibility data and predicted transcription factor (TF) motif data. We designed a novel machine learning approach that allows the prioritization of TFs binding to distal loop and promoter regions with respect to their importance for gene expression regulation. Our analysis revealed a set of core TFs that are part of enhancer–promoter loops involving YY1 in diferent cell lines. Conclusion: We present a novel approach that can be used to prioritize TFs involved in distal and promoter-proximal regulatory events by integrating chromatin accessibility, conformation, and gene expression data. We show that the integration of chromatin conformation data can improve gene expression prediction and aids model interpretability. Keywords: Machine learning, Chromatin accessibility, DNase1-seq, Chromatin conformation, Gene regulation, HiC, HiChIP, Gene expression prediction Introduction establishing and maintaining cellular identity and their Understanding the processes involved in gene regula- dysfunction is related to several diseases [1]. tion is an important endeavour in computational biology. TFs bind to promoters of genes, which are in close Key players in gene regulation are transcription factors proximity to their transcription start site (TSS) and to (TFs), DNA binding proteins that are essential in regu- enhancers, regulatory regions that can be several thou- lating transcriptional processes. Tey are important in sand base pairs away from the regulated gene [2]. Since enhancers have been described for the frst time in 1981 by Banerji et al. [3], numerous studies shed light on their *Correspondence: [email protected] functional role. †Florian Schmidt and Fabian Kern contributed equally to this work 6 Institute of Cardiovascular Regeneration, Goethe-University, For example, enhancers were shown to be essential in Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany cell diferentiation [4]. Also, it has been reported that Full list of author information is available at the end of the article mutations occurring in enhancer regions, can not only © The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creat iveco mmons .org/publi cdoma in/ zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Schmidt et al. Epigenetics & Chromatin (2020) 13:4 Page 2 of 17 lead to changes in gene expression [5, 6], but can also on a gene-specifc level, these methods require the avail- increase the probability to contract certain diseases, ability of large data sets for the considered species and for instance Hirschsprung’s disease [7]. Tese efects tissues, which is generally not the case. In practice, the are likely to be caused by an altered binding of TFs due established window and nearest gene-based linkage para- to SNPs occurring in enhancer sequences [2, 8, 9]. To digms are still being used [25]. However, the drawback understand the function of enhancers, a crucial step after of those approaches is that they do not include long- identifcation of putative enhancer regions is to link them range enhancer–gene interactions, as proposed by the to their target genes. looping model. Tese have been experimentally deter- Recently, considerable progress has been made in iden- mined using, for example fuorescence in situ hybridi- tifying putative enhancer regions: In the past decade, zation (FISH), via the identifcation of enhancer RNAs many epigenetic data sets have been generated in con- (eRNAs) and their correlation to target genes, or via sortia like ENCODE [10], Blueprint [11] and Roadmap 3C-based high-throughput methods, for instance HiC, [12]. Histone Modifcations, especially H3K27ac and Capture-HiC, and HiChIP [30]. Especially the develop- H3K4me1, have been used in unsupervised computa- ment of such high-throughput methods to analyse the tional approaches, such as CHROMHMM [13], EPICSEG 3D organization of the genome enables us to determine [14], or REPTILE [15] to highlight putative enhancer genome-wide DNA contacts [31]. Detailed analyses of regions genome-wide. individual genes, e.g. the β-globin gene showed that mul- Also (semi-)supervised methods, e.g. ENHANCER tiple contacts occur simultaneously at one genomic loci [16], ENHANCERDBN [17], or DECRES [18], relying on and also overlap with DHSs [32]. It was shown that loops experimentally validated enhancer regions used as train- are established by Cohesin, Mediator complexes and ing data have been proposed. Furthermore, it was shown CTCF, which is known to act as an insulator protein. By that DNase-hypersensitive sites (DHSs) are good candi- performing genome-wide chromatin conformation cap- date sites for TF-binding [19, 20] and that DNase1-seq ture experiments, it is possible to segment the genome in signal is also predictive for gene expression [20, 21]. Tus multiple topological associating domains (TADs). Also, DHS sites, which are not located nearby promoters can it is known that there is more intra-TAD interaction be considered as candidate enhancer regions. However, among genes and enhancers than between TADs [33]. To it is still a fundamental biological question how enhanc- mine the variety of data types that inform on chromatin ers interact with their potentially distantly located target structure and to establish promoter–enhancer interac- genes. Te most prevalent hypothesis is that enhancers tions (PEIs) following the looping model, a wealth of tools are brought to close proximity to their target genes by have been published. Te JEME method by Cao et al. for chromosomal re-organization and DNA-looping. Tis instance is a two-level learning algorithm utilizing linear hypothesis is known as the looping model. It is oppos- regression and random forest (RF) models that not only ing the so-called scanning model, which states that an links candidate enhancers, derived from HMs, eRNAs enhancer is usually regulating only its nearest active pro- and/or sites of ac JEMEcessible chromatin, to their tar- moter [22]. Experimental evidence could be found for get genes, it also predicts sample-specifc activity of the both models [2], hence it is likely that both mechanisms enhancers. In this process, is able to consider data on are occurring in nature. long-range interactions, e.g. ChIA-PET data [27]. A dif- Inspired by these models, several experimental and ferent approach has been taken by He et al. [34], in which computational methods have been proposed to link a gold standard set of PEIs derived from ChIA-PET data enhancers to their target genes. Following the scan- are used to train classifers to predict enhancer–promoter ning model, two approaches are common in the feld: pairs. Te classifers consider various HMs, TF motif (1) window-based linkage and (2) nearest gene linkage. scores as well as sequence conservation. Te approach In the window-based approach, a gene is associated taken by TARGETFINDER , suggested by Whalen et al., with regulatory regions that are located within a defned is yet another way to tackle the PEI inference problem genomic region around this gene [23, 24]. Alternatively, [35]. Tey construct a gold standard set of active PEIs by in the nearest gene approach, an enhancer is only associ- combining chromatin state segmentations for ENCODE ated with its nearest gene [25]. To reduce false-positive and Roadmap data sets with several HiC data sets. Using assignments, the nearest gene linkage is also often cou- several genomic features such as DNA methylation, pled to a correlation test between epigenetic signals in HMs, CAGE data, and binding information of proteins the enhancer and the expression of the candidate gene (e.g. Cohesin), they learn ensemble models that predict [26]. PEI interactions across cell-types. A general overview on While approaches like JEME [27], FOCS [28] or available methods is provided in Yao et al. [2]. STITCHIT [29] ofer the linkage of regulatory elements Schmidt et al. Epigenetics & Chromatin (2020) 13:4 Page 3 of 17 Despite the availability of PEI data derived from chro- passing the automated fltering of JAMM are considered.

Integrative Prediction of Gene Expression with Chromatin Accessibility and Conformation Data Florian Schmidt1,2,3,4†, Fabian Kern1,3,5† and Marcel H

Mediator Head Subcomplex Med11/22 Contains a Common Helix Bundle

Transcriptomic Analysis of Pluripotent Stem Cells: Insights Into Health and Disease Jia-Chi Yeo1,2 and Huck-Hui Ng1,2,3,4,5,*

The Title of the Dissertation

Regulation of Adult Neurogenesis in Mammalian Brain

Variation in Protein Coding Genes Identifies Information

Effects of Downregulating GLIS1 Transcript On

The Suppressive Effects of 1,25-Dihydroxyvitamin D3 and Vitamin D Receptor on Brown Adipocyte Differentiation and Mitochondrial Respiration

Table S1. Complete Gene Expression Data from Human Diabetes RT² Profiler™ PCR Array Receptors, Transporters & Channels* A

Nrf1 and Nrf2 Positively and C-Fos and Fra1 Negatively Regulate the Human Antioxidant Response Element-Mediated Expression of NAD(P)H:Quinone Oxidoreductase1 Gene

Integrative Prediction of Gene Expression with Chromatin Accessibility and Conformation Data

Transcription Factors Recognize DNA Shape Without Nucleotide Recognition

Genetic Variation As a Tool for Identifying Novel Transducers of Itch