<<

bioRxiv preprint doi: https://doi.org/10.1101/311167; this version posted May 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

i i “main” — 2018/5/2 — 9:50 — page 1 — #1 i i

Bioinformatics doi.10.1093/bioinformatics/xxxxxx Advance Access Publication Date: Day Month Year Manuscript Category

Research article Causal regulatory network inference using activity as a causal anchor Deepti Vipin 1, Lingfei Wang 2, Guillaume Devailly 1, Tom Michoel 2,∗ and Anagha Joshi 1,∗

1Division of Developmental Biology and 2Division of Genetics and , The Roslin Institute, The University of Edinburgh, Easter Bush, Midlothian, Eh25 9RG, UK.

∗To whom correspondence should be addressed. Associate Editor: XXXXXXX

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract Motivation: control plays a crucial role in establishing a unique signature for each of the hundreds of mammalian types. Though gene expression data has been widely used to infer the cellular regulatory networks, the methods mainly infer correlations rather than causality. We propose that a causal inference framework successfully used for eQTL data can be extended to infer causal regulatory networks using enhancers as causal anchors and enhancer RNA expression as a readout of enhancer activity. Results: We developed statistical models and likelihood-ratio tests to infer causal gene regulatory networks using enhancer RNA (eRNA) expression information as a causal anchor and applied the framework to eRNA and transcript expression data from the FANTOM consortium. Predicted causal targets of transcription factors (TFs) in mouse embryonic stem cells, macrophages and erythroblastic leukemia overlapped significantly with experimentally validated targets from ChIP-seq and perturbation data. We further improved the model by taking into account that some TFs might act in a quantitative, dosage-dependent manner, whereas others might act predominantly in a binary on/off fashion. We predicted TF targets from concerted variation of eRNA and TF and target expression levels within a single cell type as well as across multiple cell types. Importantly, TFs with high-confidence predictions were largely different between these two analyses, demonstrating that variability within a cell type is highly relevant for target prediction of cell type specific factors. Finally, we generated a compendium of high-confidence TF targets across diverse cell and tissue types. Availability: Methods have been implemented in the Findr software, available at https://github.com/lingfeiwang/findr Contact: [email protected], [email protected]

1 Introduction hundreds of human cell types, with an estimated 1.4% of the human associated to putative promoters and about 13% to putative Despite having the same DNA, gene expression is unique to each cell type enhancers. Enhancers physically interact with promoters to activate gene in the human body. Cell type specific gene expression is controlled by expression. Although the general rules governing these interactions (if any) short DNA sequences called enhancers, located distal to the transcription remain poorly understood, experimental techniques such as start site of a gene. Collaborative efforts such as the FANTOM (Andersson conformation capture (3C, 4C) combined with next generation sequencing et al., 2014) and Roadmap Epigenomics (Kundaje et al., 2015) projects (Hi-C) (Mifsud et al., 2015) as well as computational methods based on have now successfully built enhancer and promoter repertoires across correlations between histone modifications or DNase I hypersensitivity at enhancers with the expression of nearby promoters (Thurman et al.,

© The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] 1

i i

i i bioRxiv preprint doi: https://doi.org/10.1101/311167; this version posted May 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

i i “main” — 2018/5/2 — 9:50 — page 2 — #2 i i

2 Vipin et al.

2012; He et al., 2014) are continually improving at predicting enhancer- an alternative (Halt) hypothesis, to support or reject the causal model promoter interactions. In contrast, understanding how the activation of one E → A → B, where E is an eQTL/enhancer in the regulatory region gene leads to the activation or repression of other , i.e. uncovering of gene A, and B is a putative target gene: primary linkage (E → A), the structure of cell-type specific transcriptional regulatory networks, secondary linkage (E → B), conditional independence (E → B only remains a major challenge. It is known that promoter expression levels of through A), B’s relevance (E → B or correlation between A and transcription factors (TFs) co-express and cluster together with promoters B), and excluding pleiotropy (partial correlation between A and B after of functionally related genes (Forrest et al., 2014), but without any conditioning on E). The log-likelihood ratios (LLRs) are computed for additional information such associations are merely correlative and do all possible targets of each gene, and then converted into p-values and not indicate a causal regulation by the TF. posterior probabilities (Storey and Tibshirani, 2003) of the alternative Statistical causal inference aims to predict causal models where the hypothesis being true; see Wang and Michoel (2017) for details. manipulation of one variable (e.g. expression of gene A) alters the We applied three treatments to enhancer expression data. First, we distribution of the other (e.g. expression of gene B), but not necessarily regarded enhancers as binary (on/off) variables, and after binarizing vice versa (Pearl, 2009). A key role in causal inference is played by causal the data (see Methods), used the existing Findr to predict TF targets anchors, variables that are known a priori to be causally upstream of others, directly. This approach will be referred to as Findr-B (“binary”). Second, and can be used to orient the direction of causality between other, relevant we adapted all five tests in Findr to use continuous instead of discrete variables. A major application of this principle has been found in genetical causal anchor data, and used this method on (untransformed) eRNA data. genomics or systems genetics: genetic variations between individuals This approach will be referred to as Findr-C (“continuous”). Third, to alter molecular and organismal phenotypes, but not vice versa, so these accommodate the co-existence of binary and continuous enhancers for quantitative trait loci (QTL) can be used as causal anchors to determine different TFs within the same dataset, we developed an automatic adaptive the direction of causality between correlated traits from population-based method to independently treat each enhancer as binary or continuous, data (Schadt et al., 2005; Chen et al., 2007; Rockman, 2008; Li et al., depending on the relative strength of the primary enhancer-TF linkage 2010). Such pairwise causal associations can then be assembled into causal with either method. We call this approach Findr-A (“adaptive”). gene networks to model how genetic variation at multiple loci collectively affect the status of molecular networks of genes, , metabolites and downstream phenotypes (Schadt, 2009). Interestingly, several experiments have recently shown that enhancer 3 Methods regions can be transcribed to form short, often bi-directional transcripts, 3.1 Datasets called enhancers or eRNAs (Natoli and Andrau, 2012). eRNA expression is correlated with and crucially, precedes the expression of • We used Cap Analysis of Gene Expression (CAGE) data (TPM target genes (Arner et al., 2015). Though the functional role of eRNAs expression values) from the FANTOM5 Consortium for enhancer and remains to be understood, the presence of eRNA from a regulatory transcription start sites (TSS) in mouse embryonic stem (ES) cells region is an indicator of enhancer activity (Danko et al., 2015), and (36 experiments), macrophages (224 experiments) and erythroblastic eRNA expression has been successfully used to predict leukemia (52 experiments) (Forrest et al., 2014; Arner et al., 2015). activity (Azofeifa et al., 2018). We therefore hypothesized that eRNA We also selected 1036 samples from all cell types and tissues in mouse. expression as a readout of enhancer activity could act as a causal anchor, • We use Chip-seq data from the Codex consortium (Sánchez-Castillo opening new avenues to reconstruct causal gene regulatory networks. et al., 2015). Data available for 78 TFs in mouse ES cells, 12 TFs in To test this hypothesis, we developed novel statistical models and macrophages and 17 TFs in erythroleukemia cells were used. likelihood-ratio tests for using (continuous) eRNA expression data in • For validation using Knock-out data, we have collected differentially causal inference, based on existing methods for discrete eQTL data, and expressed gene lists after perturbation of factors from published studies implemented these in the Findr software (Wang and Michoel, 2017). We in mouse and gene lists after over-expression of factors in mouse ES then applied this new method to CAGE data generated by the FANTOM5 cells from Xu et al. (2013). project, a unique resource of enhancer and promoter expression across • We obtained CAGE data (TPM expression values) from the hundreds of human and mouse cell types (Forrest et al., 2014; Arner FANTOM5 Consortium for enhancer and TSSs for human cell types et al., 2015), and validated predictions using ChIP-seq and perturbation and tissues (Forrest et al., 2014; Arner et al., 2015). We selected 360 data. We noted that continuous eRNA expression values increased target samples from all cell types and tissues (1826 in total), by removing prediction performance for some factors, while for other factors, a technical and biological replicates. binarized presence or absence of the enhancer signal performed better. Leveraging this observation, we found that a data-driven approach to classify enhancer expression as either binary or continuous was sufficient to 3.2 Data processing automatically select the best target prediction method, allowing parameter- • CAGE data was processed to clear for unannotated and non-expressed free application of the method to organisms and cell types where validation genes, and expression levels were log-transformed. Only enhancers data is not currently available. expressed in more than one third of experiments were retained. • For each TF, we selected the promoter with the highest median expression level as the promoter for that TF. • For each TF, all enhancers within 50 kb of the TF promoter region 2 Approach were detected using the GenomicRanges package in Bioconductor We used enhancer expression as a causal anchor to infer causal gene Lawrence et al. (2013), and considered as candidate causal anchor interactions, within the Findr framework (Wang and Michoel, 2017). Findr enhancers for that TF. provides accurate and efficient inference of gene regulations using eQTLs • Enhancer data was binarized by setting all experiments with zero read as causal anchors by accounting for hidden confounding factors and weak count to 0 and all others to 1. regulations. This is achieved by performing and combining five likelihood • For the ChIP-seq data, genes with a TF binding site within 1kb of their ratio tests (Figure 1B), each of which consists of a null (Hnull) and TSS were defined as targets for that TF.

i i

i i bioRxiv preprint doi: https://doi.org/10.1101/311167; this version posted May 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

i i “main” — 2018/5/2 — 9:50 — page 3 — #3 i i

Causal gene network 3

Fig. 1. Overview of Findr framework. A. The schematic representation of causal gene regulatory network inference using enhancer activity as a causal anchor B. Five statistical tests used by Findr for causal inference C. work flow of the Findr-A framework. i i

i i bioRxiv preprint doi: https://doi.org/10.1101/311167; this version posted May 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

i i “main” — 2018/5/2 — 9:50 — page 4 — #4 i i

4 Vipin et al.

• For the knockout and over-expression data, genes with differential 4. Relevance test: The relevance test verifies that B is regulated by expression q-value< 0.05 were defined as targets for the TF. either E or A. Its hypotheses are

(4) 3.3 Likelihood ratio tests with continuous causal anchor Hnull ≡ E → AB, data (4) Halt ≡ E → A ∧ E → B ← A. Given a causal relation E → A → B to test, where E is a (continuous) enhancer for TF A and B is a putative target gene, with their expression Similarly, data samples 1,...,n annotated in subscripts, we first convert each

continuous variable into a standard normal distribution by rank. Each (4) n 2 2 2 LLR = − ln (1 − ρˆ )(1 − ρˆ ) − (ˆρAB − ρˆEAρˆEB ) variable is modeled as a normal distribution with the mean linearly and 2 EA EB additively dependent on its regulators in the five tests below (illustrated in n + ln(1 − ρˆ2 ). Figure 1B). 2 EA

Primary linkage test 1. : The primary linkage test verifies that the (4) enhancer E regulates the regulator gene A. Its null and alternative LLRnull/n ∼ D(2, n − 3). hypotheses are 5. Controlled test: The controlled test verifies that E regulates B

(1) (1) through A, partially or fully, with the hypotheses as Hnull ≡ E A, Halt ≡ E → A. H(5) ≡ B ← E → A, The log likelihood ratio (LLR) and its null distribution is identical null with the correlation test in Wang and Michoel (2017). Therefore the (5) Halt ≡ B ← E → A ∧ A → B. LLR is n LLR(1) = − ln(1 − ρˆ2 ), 2 EA Its LLR is where (5) n 2 2 2 n LLR = − ln (1 − ρˆ )(1 − ρˆ ) − (ˆρAB − ρˆEAρˆEB ) 1 X 2 EA EB ρˆXY ≡ XiYi . n n n i=1 + ln(1 − ρˆ2 ) + ln(1 − ρˆ2 ), 2 EA 2 EB And its null distribution is with the null distribution (1) LLRnull/n ∼ D(1, n − 2). (5) LLRnull/n ∼ D(1, n − 3). The probability density function (PDF) for z ∼ D(k1, k2) is defined as: for z > 0, The LLR and its null distribution then allow to compute the p-values and the posterior probabilities of the null and alternative hypotheses 2 −2z(k1/2−1) −k z p(z | k1, k2) = 1 − e e 2 , separately for each subtest, as detailed in Wang and Michoel (2017). B(k1/2,k2/2)

and for z ≤ 0, p(z | k1, k2) = 0, where B(a, b) is the Beta function. 3.4 Findr-B and Findr-C 2. Secondary linkage test: The secondary linkage test verifies that In Wang and Michoel (2017), it was shown that a combined causal the enhancer E regulates the target gene B. The LLR and its null inference test performs best in terms of sensitivity and specificity for distribution are identical with those of the primary linkage test, except recovering true regulatory interactions, using both real and simulated test by replacing A with B. data. The combined test score is: 3. Conditional independence test: The conditional independence test verifies that E and B become independent after conditioning on A, 1 P = (P2P5 + P4) with its null and alternative hypotheses as: 2

(3) where Pi is the posterior probability for subtest i. Hnull ≡ E → A → B, The Findr-B method returns this combined P -value using the original (3) Halt ≡ B ← E → A ∧ (A correlates with B). Findr on binarized enhancer data. Findr-C does the same using the new tests on continuous enhancer data. Correlated genes are modelled as having a multi-variant normal distribution whose mean linearly depends on their regulator gene. 3.5 Adaptive method Findr-A Therefore, Given a set of TFs and for every TF, a set of candidate causal anchor (3) n 2 2 2 enhancers, the adaptive Findr-A method performs the following, for each LLR = − ln (1 − ρˆ )(1 − ρˆ ) − (ˆρAB − ρˆEAρˆEB ) 2 EA EB TF A (Figure 1C): n n + ln(1 − ρˆ2 ) + ln(1 − ρˆ2 ). 2 EA 2 AB 1. Compute the primary linkage test p-value for all candidate enhancers of A, both continuous and binarized. Following the same definition of the null data, its null distribution is 2. Find the enhancer E with the lowest p-value overall. 3. If the lowest p-value occurred for a binarized enhancer, use Findr-B (3) LLRnull/n ∼ D(1, n − 3). for TF A with E as its causal anchor, else use Findr-C.

i i

i i bioRxiv preprint doi: https://doi.org/10.1101/311167; this version posted May 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

i i “main” — 2018/5/2 — 9:50 — page 5 — #5 i i

Causal gene network 5

Fig. 2. Recall-precision curves for target predictions by Findr-B and Findr-C using ChIP- seq and perturbation data

3.6 Validation methods For the purpose of evaluation, we calculated the Findr-B and Findr- C scores for all TF-gene combinations. Genes with scores exceeding a certain threshold were considered as predicted targets for each TF. Precision-Recall curves, calculated using the the “PRROC” package, and Fig. 3. Robustness of Findr performance demonstrated by using different score thresholds hypergeometric overlap p-values with respect to known targets from ChIP- seq and knockout data in the same cell type were used to compare the performance of the two methods. the suitability of Findr-B or Findr-C for causal inference was dependent 4 Results on the factor, i.e. using continuous enhancer data performed better for some factors (Figure 2: Gata1, Fli1), while on/off data performed better 4.1 Causal inference from enhancer and transcript/gene for others (Figure 2: ,Klf2). Because the number of putative ChIP- expression CAGE data seq targets for each factor varied widely across factors and cell types, To test our hypothesis that causal inference using enhancer expression as a from only 420 gene targets for JunD in Erythroblasts to over 12,000 gene causal anchor predicts true TF targets, we used CAGE data generated by the targets for ESRRB in ES cells, the background precision levels differed FANTOM consortium across hundreds of cell types and tissues in human highly between factors (Figure 2). and mouse (Forrest et al., 2014). We used bi-directional expression in non- Transcription factor binding inferred using ChIP-seq data is thought promoter regions as an indicator of likely enhancer activity in each CAGE to be mostly opportunistic and therefore might not to provide direct clues sample (Andersson et al., 2014). The presence of enhancer expression about the functional targets of the factor (Cusanovich et al., 2014). We (as a proxy for enhancer activity) in each sample is crucial for the ability therefore collected available perturbation data, specifically expression data to apply causal inference techniques. We first selected three mouse cell after knock-out or knock-down (KO) of a factor, for 85 factors in ES types (embryonic stem cells, macrophages and erythroblastic leukemia) cells and 11 factors in macrophages (see Methods). We also collected for systematic characterization, as these had more than 20 samples per gene expression data after over-expression for 55 factors in mouse ES cell type, with diverse treatments or time series. Furthermore, ChIP-seq cells (Xu et al., 2013). Using differentially expressed gene lists as known and TF knockout validation datasets were available for each of these cell targets, we evaluated the predictions of both methods. This confirmed the types. We selected predicted enhancers (Andersson et al., 2014) within factor-specific suitability of either the Findr-B or Findr-C method (Figure 50kb of each transcription factor in a cell type, resulting in 109 enhancers 4). for 48 transcription factors in ES cells, 55 enhancers for 8 transcription We further tested whether these results were sensitive to the target factors in macrophages and 5 enhancers for 4 transcription factors in probability prediction threshold. The enrichments of true positives were erythroleukemia, with an average of 3.8 enhancers per transcription factor stable over a wide range around this threshold for both methods (Figure 3). across cell types. We inferred causal transcription factor-target interactions for each 4.2 Development and validation of an adaptive transcription factor using the enhancer element most strongly linked to model-selection approach for causal inference using each TF in each cell type. We predicted targets for 48 transcription factors with two methods, one using continuous enhancer data (“Findr-C”), and discretized or continuous data one using discretized, binary (on/off) enhancer data (“Findr-B”) (see As the optimal prediction performance depended on a factor-specific Methods). The Findr software outputs a score representing the putative choice between discretizing enhancer expression data or not, we probability of a causal interaction for each transcription factor-target pair investigated if this decision could be made in a data-driven, adaptive (see Methods). For both methods, the targets with predicted probability approach (henceforth called “Findr-adaptive” or “Findr-A”,see Figure 1C), of a causal interaction greater than 0.8 (see Methods) were validated in the absence of validation data. In short, Findr-A selects for each TF using a compendium of ChIP-seq data (Sánchez-Castillo et al., 2015), among all its candidate enhancers, both continuous and binary, the one containing 78 factors in ES cells, 12 factors in macrophages and 17 factors with the strongest primary linkage to the TF’s expression, and then uses in Erythroblasts. Of these factors, 18 in ES cells, 7 in macrophages and that enhancer and its corresponding method (Findr-B or Findr-C) to predict 4 in Erythroblasts had enhancer expression in CAGE data. We noted that downstream targets for that TF (see Methods for details). This adapative

i i

i i bioRxiv preprint doi: https://doi.org/10.1101/311167; this version posted May 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

i i “main” — 2018/5/2 — 9:50 — page 6 — #6 i i

6 Vipin et al.

Fig. 4. Comparison of Findr-A predictions using mouse ES cells and All cell types samples. A. bar plots representing enrichments for Findr-B, -C and -A predictions using ChIP-seq data as a validation dataset for ES cells (left) and all cell types (right). B. bar plots representing enrichments for Findr-B, -C and -A predictions using Knock-out data as a validation dataset for ES cells (left) and All cell types (right). C. overlap of factors between ES and All cell types using ChIP-seq (left) and knock-out (right) as validation datasets.

i i

i i bioRxiv preprint doi: https://doi.org/10.1101/311167; this version posted May 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

i i “main” — 2018/5/2 — 9:50 — page 7 — #7 i i

Causal gene network 7

approach was indeed able to select the best performing method for most information about the activity of the associated transcription factor of the factors (Figure 4A,B, Findr-A selection marked by a green star). (Figure 5). Reassuringly, factors known to form regulatory complexes, We further performed functional enrichment analysis of gene target including Max and Mxi1 or Runx1 and Smad1, also clustered together, sets predicted by Findr-A. 1119 targets of JunB in macrophages were i.e. shared predicted causal targets (Figure 5). enriched for ‘LPS signalling pathway’ (p < 10−8) and ‘RNA binding’ To investigate whether combining multiple enhancers was more (p < 10−8). The macrophage CAGE samples indeed measured the informative for determining causal targets, we compared two integrative response to LPS signalling. JunB is known to be a delayed response methods against the predictions of taking individual enhancers. Firstly, gene, attenuating transcriptional activity of immediate early genes and we used the median expression level of all the putative enhancers for RNA binding proteins, specifically terminating of mRNAs each transcription factor as a ‘meta-enhancer’ in Findr-A. Secondly, we induced by immediate early genes (Healy et al., 2013). 131 targets for calculated the first principal component of the binary target prediction Cbx7 in ES cells (Figure 3) were enriched for ‘regulation of transcription matrix for all enhancer of a TF in order to ‘average’ predictions. However, from RNA polymerase II promoter’ (p < 10−10), and included several we did not observe any significant overall improvement in performance developmental genes such as Hox family proteins. This enrichment was using either method. much stronger than for the ChIP bound targets of Cbx7 (p < 10−5). Cbx7 is a part of the PRC complex, which binds predominantly at bivalent 4.5 Causal inference using CAGE expression data across chromatin at promoters of transcription regulators in ES cells (Mantsoki human cell types et al., 2015). Finally, we inferred causal interactions between transcription regulators Traditional causal inference using causal anchors relies on a and targets using CAGE enhancer and TSS expression data in . conditional independence test (Figure 1B, test 3), but the presence of Specifically, we inferred causal interactions for 20 transcription factors common upstream regulatory factors results in low sensitivity for this test (with eRNA expression) using Findr-A (see Methods). We firstly validated in applications to gene network inference (Wang and Michoel, 2017). To the predicted interactions using a database of experimentally validated address this, both Findr-B and Findr-C (and hence also Findr-A) employ regulatory interactions in human (Han et al., 2018) and noted a statistically a combination of alternative tests (Figure 1B, test 2, 4 and 5, see Methods significant overlap between the predicted and experimentally validated for details) that resulted in significantly improved prediction of TF targets gene sets (p < 10−5). Figure 6 represents the network of top 200 from eQTL data (Wang and Michoel, 2017). This remained true for the interactions for each factor. The factors involved in biologically related CAGE data in all three mouse cell types (data not shown), and hence the processes shared predicted causal targets. For example, two members of combined test remains the recommended default. the SMAD family, SMAD3 and SMAD6, as well as BCOR and SIN3A involved in histone deacytelase activity showed a high overlap of predicted 4.3 Perturbations within and across cell types provide targets. causal targets for a distinct set of transcription factors We noted that most factors performed better using binary enhancer expression values and wondered if this might be due to the limited 5 Discussion & Conclusion expression data available for each cell type. The Fantom5 CAGE data We explored the utility of eRNA expression as a causal anchor to predict contains over 1000 samples across many more mouse cell types and tissues, transcription regulatory networks, by leveraging the observation that henceforth called “all-data”. Findr-C indeed performed marginally better eRNAs mark the activity of regulatory regions. Previous studies support on all-data rather than cell type specific data. Importantly, all-data and this notion, as eRNA expression has been shown to temporally precede the cell type (ES) specific data resulted in causal targets for a distinct set of expression of its effector gene (Arner et al., 2015), and to correlate strongly transcription factors. In particular, variation within a cell type was more with active regulatory regions across cell types (Danko et al., 2015). We informative for causal target predictions of cell type specific factors. For therefore developed a novel statistical framework to infer causal gene example, the targets of key pluripotency factors and Esrrb (Dunn networks (Findr-A), by extending the Findr software for causal inference et al., 2014) were enriched in ES-data but not all-data (Figure 4A,B). using eQTL data (Wang and Michoel, 2017). There were only five common factors with causal targets predicted We demonstrated the applicability of Findr-A by predicting causal using both all-data and ES-specific data that were validated by ChIP-seq interactions from CAGE data generated through the FANTOM consortium, data, and only six common factors validated by KO data (Figure 4C). and validating them with ChIP-seq and perturbation data for three mouse Interestingly, the target genes predicted from ES-data and all-data for the cell types as well as on the entire FANTOM5 data. Notably, different same factor overlapped significantly. For example, 72% of Eed predicted factors were enriched for within cell type analysis as compared to across targets using ES-data overlapped with Eed predicted targets using all-data. cell types. The causal regulatory network of cell type specific factors (e.g. Sox2, Esrrb in ES cells) could be inferred only using expression variation within a cell type and not across cell types. Due to the limited availability 4.4 Multiple enhancers of the same factor have a highly of validation data, a more comprehensive assessment was not possible. correlated expression The current approach can be extended in several aspects in the future. Mammalian genes are controlled by multiple enhancers. We investigated Firstly, Findr assumes equal (or no) relations between all sample pairs the stability of inference outcome under different choices of enhancers (Wang and Michoel, 2017), which hold for the majority of eQTL datasets. as causal anchors. For all factors for which Findr-A predicted targets By accounting for heterogeneous sample relationships, such as biological in ES cells that overlapped significantly with perturbation data (Figure and/or technical replicates, time series, or population structure, we may 4), we predicted additional target sets using other available enhancers, be able to reconstruct more accurate networks. Secondly, the assumption resulting in target sets for 76 enhancer-transcription factor pairs for 46 that eRNAs act as causal anchors is only approximately true, because unique transcription factors. their activity ultimately is regulated by other regulatory factors, i.e. the The hierarchical clustering of transcription factor-enhancer pairs based assumption that they are a priori causally upstream of correlated TF–gene on these target sets clustered mostly by transcription factors, indicating pairs will not hold for all genes. Because eRNAs are temporally expressed that the expression of multiple enhancers contains highly redundant before their direct target genes, we hypothesize that explicit modelling of

i i

i i bioRxiv preprint doi: https://doi.org/10.1101/311167; this version posted May 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

i i “main” — 2018/5/2 — 9:50 — page 8 — #8 i i

8 Vipin et al.

Fig. 5. Circular plot of hierarchical clustering of overlap of targets predicted using multiple enhancers for the same factor in mouse ES cells

gene expression dynamics in the Findr framework will allow to detect and Funding correct for such feedback loops. Thirdly, eRNAs are expressed at relatively This work was funded by grants from the Biotechnology and Biological low levels, and therefore susceptible to noise, and a reliable eRNA signal Sciences Research Council (BBSRC) [BB/P013732/1, BB/M020053/1]. was available for only a limited number of known transcription factors in mouse or human. Generating deep sequencing data for CAGE, utilizing GRO-seq or epigenetic or transcription factor ChIP-seq data to estimate References enhancer activity could be possible ways to get around this. In conclusion, we have demonstrated that enhancer activity can be used Andersson, R. et al. (2014). An atlas of active enhancers across human to infer causal gene regulatory networks. We foresee this approach to be cell types and tissues. Nature, 507(7493), 455–461. of high value in the context of human medicine, by combining genetic, Arner, E. et al. (2015). Transcribed enhancers lead waves of coordinated epigenetic and transcriptomic information across individuals to unravel transcription in transitioning mammalian cells. Science, 347(6225), causal disease networks. 1010–1014. Azofeifa, J. G. et al. (2018). Enhancer RNA profiling predicts transcription factor activity. Genome Res.

i i

i i bioRxiv preprint doi: https://doi.org/10.1101/311167; this version posted May 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

i i “main” — 2018/5/2 — 9:50 — page 9 — #9 i i

Causal gene network 9

ADCY1

PDE1B

NKIRAS1

CREBL2 FABP7

SRCIN1

KIF21A SPATA7 CAP2 GABARAPL1

FBXO36

BRSK1 INPP5F CDKL5 STXBP1

DNAJC28 DNAL1 IGSF21 LRRC49 SCAMP1 HCN1 Res., 46(D1), D380–D386.

PAWR

TRPC1

UBQLN1 KIAA1467

RNF219

OLFM1 PHC1

BSN

DBC1

PCDH10 ZNF629 PNMA2 MDH1 SERINC1 KBTBD6 CCDC130

MAP1B

PCDHB15 FBXW7

ANK2 THY1

DCLK2

MPP2 TRAM1L1

SNAPC3

C11orf87 ZNF415 PRRT2

SLC25A25 ZFP2

ADAM23 PPP2R2B CADM2

TMEM35

DNAJB4 ANAPC10

CPE

DENND5A MEGF8 KLHL11

TUB VSNL1 SLC30A10

ZNF365 KIAA0195 CADM4 TMEM59L

LGI1 BEND6 CTXN1 et al. PREPL CPT1C He, B. (2014). Global view of enhancer-promoter interactome in

SC5DL SLC4A8 PTPN4

PCDHAC2 CNNM1 CLUAP1 ZFP1 GPHN DCTN6

CHGB

SNAP91 MAPK8IP2 SEZ6L2

PCDP1 SLC1A2

ZC3H3 HECW1

ZBTB7A SAMD14 SRSF12 HSD11B1L SPPL3

SEC61A2

SCN1B EFR3B SACS C16orf45 GRIA3 PDE4A ASCC2 CAB39L PER3 KIAA0913

TRIP12

RAPGEF4

PITPNM1 ZNF487P

ATL1 ZNF483 SLX4 SLC25A4 GPR37L1

PIP4K2B

DTX3 CTNND2 NAP1L3 SPATA2 ZNF484

LRRC4B

NAPG

BRMS1

CTNNBIP1 NEUROD2 CDH10 SERP2

AP3M2 CBLN2

SLC45A1 ARPC1B PTPRN GNAO1 DPYSL2 FAM59B NDRG3

NEURL4

NOVA1 CRMP1 MIR4787

ARL6

ELAVL2

MAPT PTPRS ZNF532 AIMP1 DMWD POU3F2

GAP43 B3GNT1 CNIH2 PCLO DLG2 Proceedings of the National Academy of Sciences 111 LINGO1 KCNJ3 human cells. , (21), HAVCR2 RWDD2A CECR6

C1orf114 TMEM130 CELSR3 C6orf168 AKAP11 HABP4 PSD CHST10 MAP7D1

ATP6V1C1 KLHDC3

MAP6 UBQLN4

ARMCX5 CA11 FAM57B TMEM8B MAPK8IP1

BRSK2 C19orf47 STMN4 GDI1 BAI2 ST8SIA1 CACNB3

SYT1 EFNB3 EPS15L1 FAM164A KIF1A

NRSN2 PODXL2 APLP1 ZNF579 BCL9L SKP1 WBP2

NLGN4X

RAB6B AXIN2 FBXL16 MAP3K10

EEF1A2 FAM131B CDK20 CBLN4 IP6K1 DLEU7 L3MBTL1

TUBG2 MAP2

PCSK1N

CPLX1 BBS4 TMEM123 SEPT5 SLC25A22 TRIM41

STT3B SNAPC4

MLLT11 SEPT7 RTN1 ARNT2 CAMK2N2 ANGEL2 ABCA3

ACSL6 CRIPT

GBAP1

SPTBN1 ACTR3 SPTBN4

RTN3 FAM115A

ANKRD46

CCDC106

GPR137 PRRC2B BTBD7 ZNF335 IRF2BP1

SRRM4 TYK2 USP22 KIF3A CACNA1G

AMN1 PI4KA E2191–E2199. HTATSF1P2 TMEM132B

KATNB1

DCAF7 TTC9B SERPINI1

ATP5J

FKBP4 BBS7 NMI CARD16 DUS1L

TIPARP ProSAPiP1 SMPD4

PRKAR1B

RAB11FIP4 SLC7A14 AP3B2 DYRK1B

GBP2 IRGQ LGALS8 CBX2 ATP6V0A1 LDOC1L

TUBB4A CLINT1 SGTA CELF3 MAN2A1 DPYSL5 MRPS11

VRK2 SHANK3 GPI

C10orf35 ATXN7L3B TMEM145

LYST EXOC6B GPR162

PTDSS2

SCAF11 RCC2

UBL5

STK38 KIAA0368 INA

NCAN

FAM91A1 CA10

CLSTN3 FAM171A2

ISCA2 SHOC2 UBE2O

C16orf13 DGKZ NEAT1 CDC25A TXNL4A RANBP3

GPSM1

POLR2I MATR3 PIN1 ADAR SFXN5

SYCE1L SLC8A1 NUDC IPMK TRAPPC2L DIRC2 DEAF1

C14orf1 CTAGE5 FBRSL1 NDRG4 CKB DUSP26 KYNU et al. LPPR5 Healy, S. (2013). Immediate early response genes and cell CLCN4 ITCH PHF21A FAM96B

PAICSP4 ELK4 CUEDC2 PPP1R7 PPP1R26 GSK3A RHBDD2 ANKAR RTN2 G3BP2 UBE2I DHX35 TBCCD1 RNF11 GANC COPS7B PGD LRRC61

SPAG1 ITGB3BP EPS15 AKR7A2 ACBD5 DHX34 IFNGR1 TNFSF10 C1QL1 CASC4 SMARCA4 ACAP2 SNRPB

TNFAIP2 PHF8 RUNDC3A

PTOV1 RTN4R LYSMD3 MIR3128 ZNF213

MUS81 RUVBL2 PRMT10 RTKN TTC23L PDXP C9orf86

TPGS2 HMGB3

UROS DGCR5 GPX4 NDUFB2 MARCKSL1

TRIM27 DBNDD1 NT5M AB590070 TNPO2 NDUFS5 MCM7 NDUFA7 BX649020 PPIL2 ARHGDIG MOB1A NR4A1 PLEKHJ1 CRNDE KIF3C HCN2 KIAA1967 TEX264 STMN1 TMEM165 MAGT1 FLYWCH2

MAP2K4 XIAP NONO SPTBN2

FAM82A1 DNM1

MZT2B

NFYC C11orf84 NOSIP TRAK2 AB384105 SOS2 XRCC1 MGAT5B C19orf25 RBMS1 VASH2 RNASEH2B TP53BP2 SCP2 TNFRSF14 DUSP8 PPP2R4 LINC00340 FBXL18 CCT7 FAM104A YWHAG BABAM1 PDCD10 ERP44 HLA-F C7orf41 TCF7L2 ZNF749 FAM71E1

ZNF649 ANKRD54 C12orf52 CTNNB1 ZBTB1

SMARCB1 ZNF574 RPS6KL1 SNRNP25 SVOP PATZ1

AGPAT4 CCNF COG3 CREB5 C10orf32 transformation. Pharmacol. Ther., 137(1), 64–77. CDKN2AIPNL CRY1 FADS1 EPG5 BAG6

NCF2 TM9SF3 LSM4 HRASLS5 FLAD1 MRP63 CCDC82 ZNF193 KIAA1045 UBXN4 AB462940 TSC22D2 SLC25A32 IL15 PTGES2

TBCD PAFAH1B3 SPHK2 ZNF414 ARHGAP31 RAC3

DDX3X SGMS1 COQ9 EIF2AK2 ALYREF HLA-A ALPK1 UBE2D3 TMSB15B CDK5R1 LRRC24 CXCL16 C3orf75 DIAPH1 SLC43A3 C2orf77

KCNJ10

COBRA1 EF212284 CU680404 ERI3 TUBG1

EED HIAT1

RNF2 DNAJC3-AS1 CDK14 CTBP1 ATXN7 CHAF1B PIH1D1

PIGA PNMAL1 RNF40 ATP7A DCK GPCPD1 MDK ARRDC3 LRR1 ZNF562 EAF1 ZBTB9 HDAC2 PLDN AZI1 TYRO3 CLASP2 AB384754 KCNK2 BMP5 MRPL43 LRRC41 SNRPA CHRNA5 DDX5 SYN1 LSMD1 CCT6B CD46 TNFAIP3 TBK1 TOM1 GRPR FDX1L FIBP SLC35A5 PIDD

CHMP2B UBL7 PGAM1 WAPAL RNMT SNRPD2P1 MARCH8 MAP3K2 SH3GL1 B4GALNT4

C6orf115 PURB SHARPIN SLC4A11 FAM129A MTA2 SYAP1 LUC7L2 VPS33B AHCTF1 ACTB

CNOT7 CMPK1 C16orf42 RSRC1 SMAD1 HIVEP1 WHSC2 MEX3A ADAM17 ATF3 ERN1 SF3B4 NR4A2 POLA2 C10orf54 C9orf78 DNTTIP2 TARBP2 RLIM C14orf142

RBM4 FAM102B HNRNPL RPL39 MYL6B GORAB DDX60 CNST BST1

EP300

SAE1 NBEAL1 TH1L Kundaje, A. et al. (2015). Integrative analysis of 111 reference human UBAP2L

NHP2 PRDX2 GPR113 TRAF2 SV2C RPUSD1 MICB NR4A3 CSNK1G3 NCKAP1L LATS1 C10orf46

MED17 ST7L USPL1 EIF2AK3 PRKACA RCOR2 OSBPL8

TSTD2 FTH1 MAP3K1 NUDT22 TRIM38 ELF4 TMF1 PELI1

PPP2R1A TIMM17B RABEPK

MTF1 AZIN1 ZNF394 RAP1A GGNBP2 FOS ADD3

CU688193 B3GNTL1 CHN1 OCIAD1 GOLGB1 PPP1CB

LRRFIP1 GCC2 DVL2

CTNND1 SP140L ATRIP CBX8 PUS10 NFE2L2 CD58 ALDH18A1 MIER1 GPC2 REST GALC

BTN3A1 ACSL1 LSM2 ACTR1A SENP5 LITAF GTF3C5 SLC25A36 WDR54 ATXN10 RPL10 TRAF3IP3 TMEM194B

PSEN1 AKR1A1 TOMM7 GBA PRMT1

MLST8 NEDD8 NUFIP2 RABGAP1L RANBP2 BAZ2A MORF4L1 JMJD1C

MDC1 MAP3K8 TOE1 RPUSD3 PIK3CA MIR3917 FAM105B PHF11 ITGB8 ISG20 CD2BP2 NFKB1 ISCU NEK7 TERC RBM15B PARP12 STRADA

DLEU2 PTPRE C7orf10 C7orf30 PFAS SEMA4C TNFSF4 ERAL1 NLRC5

CEBPZ SOX4 TERT

ATP5I MON1A

ZHX2 CLIP1

DNAJB12 POP7 MMGT1 IRF1 ESCO1 BAG5 MAGED2 MAPKAPK2 ZC3H4 USP12 METAP2 MRPL19 TRAF1

MKRN1 NAT14 Nature 518 HLA-C epigenomes. , (7539), 317–330. SON LCE2A FBXO9 ARRDC2 ARHGEF1

SAT1 TAB3 CNIH4 ETFB CCDC51 MBNL3 GCA CLEC4E TANK RLTPR KLHL25 SPAG9

ZCCHC6 UBE2R2

ENO2 RAC2 JAGN1 TAOK3 ANKRD31

STAM2

NGRN CORO7 PML SAMD1

B9D1 AB464084 RB1 PHPT1 PATL1

SP100 TMEM9B CNOT8 VCAN

GINS1

PEMT PLCG2 TAP2 GEN1 ZDHHC4

FOSB SMAP2 USP11 TTC32

LCP1 PRCC

BACH1 UNC13D GRIK2 SP110 USP8

ELF1

TMEM223 ZNF331 SOCS4 TEAD2 GHDC USO1 KHSRP BOLA3 SKIL TBL3 TRAF3 NDUFC1 RILPL2 ELMO1

CWC25 NINJ2

MPV17L2 PARP14

GSK3B CTNNA1 ZMYM6 PSMB10 CDC73 ATXN3 PFN2 C3orf19 SAC3D1 HIBADH

NOD2 MRPS24 BIN3 PTTG2 SYF2

NDUFB10

PFN1 SNX20 BTG2 NFKBIB

SH3BP2 DTX3L NFKBIA METTL21D

CYTIP EXOC7 NFKBIE MOB3A MXD1 ARHGDIB CCDC93 PKN1

NUDT3 C4orf52 QSER1 RBM23 et al. MAZ Lawrence, M. (2013). Software for computing and annotating PSMB8

DDHD1 COX7A2L STX11 IFIH1 NKRF CBS PPP3CC PSMB9 DMXL2 HDDC2 MKLN1 FAM113B ADRBK1 PARK7

ZNF74

NIPSNAP1 CORO1A BAZ1A MRPL20

FNDC3A NGFRAP1

NIPBL KDM5A CHN2 TCEAL4 MLKL TNFAIP8L2 CISD2 PHF20 RASGRP3 EEF1B2 STK17B FKBP3 NCOA7 MX2 ZNF295 ALOX5AP NUP62 SLC9A8 WHAMM NFKBIZ MED13 FAM49B IL2RG

NCOA3

CDC42SE1 C15orf39 BOD1 HCG25 TSPAN6 ITSN2 RARA FUNDC2 GGA3 DZIP1 POLR2F B3GNT5 DENND4A RALY LAT2 FLJ31945 MIR4687 CD55 MCTP1

WDR7 RGS14 FSTL3 MTX2 FMO5 SP3 SMEK2 TET2 ACSL5 KLF13

USP15 ZFC3H1 ANKRD34A AB007865

IRAK1BP1

STAG1 UBR4 PSMB6 PRKCD ZDHHC14

PGRMC1 ATP2C1 PLA2G12A FTSJD2 AADAT DKFZP779L1853 PPP1CA IRF9 C22orf34 FAM200B

STAG2 CHST11 SWI5 METRN ANKMY2 GATAD1

GNA13

CCDC114 IREB2 HCG27

RPS6KA3 SERPINB1 ANP32A PTPN6

BROX PLCG1 CAST STK10 IL12RB1 DOCK7 B4GALT2 TBX21

TMEM69 ASH1L RABL5 genomic ranges. PLoS Comput. Biol., 9(8), e1003118.

UVRAG CASP8 PTPN1 FAM110A CCBL1 OGT MEN1 ACN9 IRF4 GRB2

ZBTB8A ZCCHC8 LIMK1 SP1 AMPD3 WWP2 DNAJC27 PPAP2A

GPR132

REL

BIRC3 TMEFF1 CREB3L2 USP16 ARHGAP15 DCP1A FMNL1 EGFEM1P XRCC3 CHCHD1

RNF44 CST7

DAZAP2 IRF2BP2 SSU72 CRTC2 ATF2 GZF1 UNC45A ZMYM2 PLEK2 CTC1 DNAJC13 PICALM GATAD2A MEI1 PURG CHD8 PIP4K2A XRN1 MCM9 VPS11 SLC25A1 CEP63 TTC14 NSD1 FAM49A TMEM154 STK4 DEF6 ZNF354C

RBFOX2 MIR4727 EVI2B UIMC1 ANKRD49 VRK1

MTR UPP1 WBP5 GMIP OVCA2

SPATS2L CCDC111 FOXM1 UQCRQ WRB

GPSM3 RAB43 RIT1 PAM16 CCDC136 XPO4 NBEAL2 C14orf118

C5orf39 HNRNPF MRPL51 SLC39A14 MORN4 PARD6A RCN1 CHRNB2 CBLB MYADM CCDC150 MED23 CASD1

UBTD1 SPN ZBTB39 CD83 PCBP1-AS1 CTSS PARP1 SLC16A5 RLF MRE11A IQGAP1 LRMP

PERP ERBB2IP CYB5R4 SNHG10 ARAP2 NFIL3 UBXN7

CCDC102B AGPAT5 RSBN1L C14orf37 NCKAP1 RABGGTB GPR65 XPO1 SKAP1 ARHGAP4 TRIM37 SOX6 C9orf72 CD38 HIST1H1E

LRRCC1 SMARCD2 JMJD6 ZNF519 DCBLD1 PIK3CG IFT172

BBX C1orf63 et al. HPRT1 YTHDC1 CDC20P1 Li, Y. (2010). Critical reasoning on causal inference in genome-wide ITGB7 NRXN1 THOC2 MACF1

DHX38 CYTH1 KIAA1456 APP

CRIP2 BMI1 ABI2 GIMAP5

SETX SEMA6D C1orf112 ZNF791 CCNT1

AQR PRR3 SEPW1 TOX SREK1 PIK3IP1 TAX1BP1 KIAA0240 RNF168 PBX4 C5orf54 ACLY FAM135A ACAT2 DSTN CLDN12 CHD7 FSIP1 CDKN1B PPWD1

FAM3C TYW1B ODZ3

GKAP1 PTPRK PNISR SF3B1 FAM122C

FLJ22184 HLA-L WDR17 NPTN SUV420H2

KRT18 PLAGL2 EIF2B5 TAF2 GPBP1L1 SERINC2 CDC42SE2 CDC14A

FAM76B CHID1 PALLD SLC30A9 FANCL TMEM9 PLXNC1 RFX7 ARHGEF2 ANTXR1 JARID2 SETDB2 PRDM1 PRIM1 RICTOR DTYMK CHD2 GNPNAT1 SPECC1 LAMA4 MYL6 FOSL2 SNX33 MSL2 AFTPH DDX39A

NFKBID ARID3B ING3 GNS COL5A2 ARHGAP9 SSH2 FCGRT CU688443 GPATCH1 NUP205 CDCP1 RASSF5 SLC35A2 FOXN2 C19orf57 CCNL1 PRKACB RSRC2

SUGP2 CEP192 MIR17HG AHNAK2 SARM1 TM9SF1 TAGAP KCTD6 BCLAF1 SCLT1 METTL21A CCDC88A SKA2 SLC44A5 HERC4 ZNF654 LMNB2

NCKIPSD LAMB2 PRPF4B PRKCB LSM5 VAV1 TMEM151B ICAM2 EFEMP1 STAT5A ZNF131 PDZD11 RAI14 FAHD1 ATP9A AUTS2 LBR ARID3A PSMB5 KATNAL2 DLGAP4 GPM6B TMEM170B PTK2B

ZNF267 CEP85 COPS4 PIK3CD C4orf29 MRPL33 CSNK1D POSTN PDE8A CBLL1 PYGL POU2F1 WIPF1 HSPB1 POC5 MAPK6 POMGNT1 TCTN1 PPHLN1 NUDT1 BDP1 ZNFX1 PEAK1 CNOT10

MYLIP ZCCHC2 DCP2 BIRC6 AKT1S1 C20orf111 NDST1 ZNF382 linkage and association studies. DYNLL1 CTTN ATAD5 TAPT1 FIGNL1 CD53 TPD52L1 NRGN MIR4638 C16orf54 SMG6 ZNF326 KANSL2 C1S BTN2A1 CDK5RAP1 ALDH7A1 TMC8 CCBL2 FBXL5 PRR14 GBP4 PDE7A NQO1 CDKAL1 AURKAIP1

DZIP3 LUC7L PRRC2C SMARCA2 IGFLR1 DPY19L2P2 NUP214 C10orf28 FAM53B PTPRC

SYT11 TPRKB SH2D4A CHPF IFNAR2 SMARCA5 WDR1 SELPLG MIB2 TAPBP EML2 DUSP4 SUMF1 ALDOA TNRC6B BCL10 ZNF430 CCDC88B LMTK2 ZCCHC7 GPR63

POM121 MFNG

TCF4 PNP

ST7 PLOD2 PCYOX1 ALDH1L2 BLM GEMIN2 VPRBP DMTF1 C10orf118 ANKRD44 BRPF1 YEATS4 BTD LCP2 PAM SLC25A19 RIN3 SAMHD1 CXCR4

ZNF644 SFSWAP

RRAGA CTDSPL2 SYNPO PRSS23 NCF4 CTSW CENPA ZBTB11 BAG3 PRDX5 SCAMP3 ARID4B SASH3 ECHS1 CELF2 GSN RELT CDK12 MTBP CLIP4 SNX12 EPS8 DENND4B KCNN3 CDR2L RBM39 CENPM RFXAP MAN1B1 IKZF5 ZRANB3 FAM127A PDLIM2 MRPL24

TNKS1BP1

SLC7A5P1 ATP8A1 GIMAP7 AKAP1

TMED8 MTMR14 GLT8D1 ARF4 NUTF2 ITGB2 NASP RHEBL1 PWP2 DNAJC7 PPDPF NPAT MPHOSPH9 PCM1 HACE1 SNRPA1 C1orf85 PLEK ZZEF1

FCHO1 BCL3 TDP1 C9orf93 SUV39H1 NXF1 BRAF MDM1 LCORL ERCC1 TXNL1 SPI1 DCTN2 F2RL1 MCFD2 FAT1 MYB SRXN1 RASGRP4 SYNGAP1 TTC8 MYSM1 DENND3

PPFIBP1 NLGN1 SEPT6 LARP6 CCDC142 SRSF11 GMFG

DDHD2 PSMD8 SNRK RASAL3 MICA ARHGAP30 SRSF5 C17orf79 RHOD URB2 RNH1 TUBGCP3 LPPR2 RBM25 IFT122 LAMB1 CU691854 HJURP

MIS18BP1 KDM6A FGD3

ARID5B AP2S1 NEDD9

DEPDC5 KIF21B TGOLN2 PON2 IKZF1 INPP5D GVINP1 ZBTB24 et al. EPC1 Mantsoki, A. (2015). CpG island erosion, polycomb occupancy MDN1 ARID5A CCDC134 SPC24 ABCA7 P4HA2 FAM98B GALNT10 WIPI1 OPA1 ARHGAP33 RBM38 UCP2 ZNF546 HECA UBE2S CD63 MTF2 RBL2 CEP97 SDC1 RPS9 PHLPP2 LIAS PNN SYK NDUFS8 CSTB CPSF6

FKBP9 PRPF38A TACC3 MRPS34 HERPUD2 TRIM24 AMICA1 HCST APBB1IP AX721088 C12orf48 PAN3 COL12A1 MYO1F ARHGEF40 MLL5 EVL RAD54B SRSF10 SSRP1 RBBP6 KATNAL1 KLF3 LAIR1

TM9SF2 PDHX DCN TTC28-AS1 RNASE6 PHF12 INO80D BCL2 GTF2A2 CDC42EP1 MS4A6A ARRB2 GTF2H5 PARVA FAF1 SLC39A7 ADAM22 CREBZF SAP30 VMP1 DDX26B EAF2 KIF1C COPG SDF4 ARID4A C5orf56 C15orf34 CDK2AP1 COPZ2 CENPC1 ARAP1 CREB1 NKG7 LMNB1 CALU HNRNPA1 TUBGCP4 NUP35 SH3D19 PDLIM4 RNF125 FAM161A NET1 ACVR1 SFI1 RPS15 TRNP1 PIK3R5 NPAS2 PAN2 XPO7 PIAS2 IFT43 SF3B14 PHF6 ZNF529 KDM4C PLD3 RPS6KA1 ZDHHC5 SLC3A2 PHF3 SLC10A3 BRD2 GIMAP1 EXOSC2 RAD51C CEP350 PPP1R18 TMCO6 C1R SIN3A PRPF19 PLBD2 TAGLN CCDC88C CTSZ CTU1 MPEG1 TAF4B CCND3 CSF2RB ZNF266 FAM126B FCHSD2 HN1 PVR ARHGAP26 C10orf128 BRI3 CCP110 RBM15 AOAH FGF11 SHISA4 ATP6V0E1 UCHL5 FEN1 KNTC1 NCOA2 C4orf27 LTBP1 FBLIM1 CNTRL RPL6 SENP7

MYO1B OXNAD1 B3GNT9 SPARC REXO2 ARHGEF9 DYRK1A LENG8 DENND1C

EPM2AIP1 ARL2BP EID1 ACO1 GLIPR1 STAT5B MAML1 CD93 KIAA1191 DOCK11 GRN TXNDC17 SETD1A NFATC3 SMAD6 TOR1AIP1 PTPRCAP FAM98A PEA15SETDB1 MDFI

CDC42BPB RTN4 CLMP CALD1 AKAP8 BIRC5 HMGB2

FNDC3B BTAF1 SIGIRR ARHGAP25 FPGS FDFT1 C4orf46 CNOT1 NAP1L4 SEMA3C SEMA4D HELB CARD8 NBL1 PSPC1 CREB3 TAF5 DALRD3 PPIC KIAA1586 CEP152 MKL1 MANBAL CEP57L1 PLK1 CYR61 FAM168A WSB2 EPB41

TPM2 PATL2 CPEB3 NUPL2 KDELR1

ANKRD37 SELM and sequence motif enrichment at bivalent promoters in mammalian PGCP DKK3 EFCAB7 PPP1R3C NDFIP2 MLL3 PTPN14 RC3H1 PHIP ZNF92 CD52 ENAH WAS PCF11 HMHA1 DOT1LC14orf43 SLC35B2 THAP4 ITGAX LAPTM5 CECR1 PRPF38B CTSB USP48 CLK1 C6orf132 LNPEP RTTN

TCTE3 CETN2 RAB2A KIF22 PXDC1 GUSBP11 ZNF292 HERC1 ARL1 USP33 PCMTD1 NCAPD3 CD276 AGTPBP1 FN1 HIRIP3 MDM4 POLR2H GNG2 ANTXR2 ATG16L2 SNX32 KIF15 DRG1 CYFIP1 USP37 ZNF431 MRC2 SMEK1 BCAR1 PARVG SNX7 FAM114A1 BAX HAUS5 NKTR PSMD14 LAMC1 EIF6 C5orf41 GTDC1 TPP2 CLK4 SMCHD1 YIF1A CAPN1 CAV2 ATP9B IL10RA PNPLA2 USP3 LRP5L EFEMP2 POGZ NUP160 LRRC37A3 ATL3 DDIT3 ZNF136 FADS3 CEP95 C1orf162

ELK3 RNPC3 HNRNPH1 ARHGAP1 MTHFR NUP210 VPS13B LINC00461 YIPF5 COL5A1 FSTL1 ARSJ FAM13B SLC25A40 PROCR SERGEF PPP1R16B C4orf21 ZNF273 TNFAIP1 SEC61B CD37 HNRPDL ZNF107 PPP1R10 HELZ ITPKB RBM22 KIAA1522 PTRF GIGYF2 GAB3

TACC1 CD151 ITGAM

DOCK2

RPS21 PTTG1IP PAQR8 SNAI2 SLC39A13 GPC1 VKORC1 PPIB UTRN RNGTT EEF1E1 RNF138 SEC11A GNG12

INPP5B C19orf38 RAB13 GIPC1 NTN4 SEC61G FAM129B JOSD2 SYNE1 NUPR1 EID2B TBXAS1 UBXN2A NUP153 POLR2L PLOD1 TGFBR2 MAU2 LRP10 MGST3 MGAT4B RFWD3 POLR3K ARID1A LRRC8B PRPF3 CD69 FYTTD1 RNF167 KIF2A ITGA3 AY726569 HNRPLL AHNAK TPM4 S100A13

DDOST CYTH4 HMBOX1 SEPT7P2 GPATCH8 FTH1P16 FTH1P3 DFNA5 SYNC SERPINE2 HSP90B1 MON2 TIMP2 SART3 CD84 IFITM3

FRYL SEC31A EHD2 AFF1 MVP PMS2P3 ITGB5 RPS6KA5 NRF1

TPT1 OSMR LOXL2 TRIP6 CAV1 GCFC1 CNOT2 CLASRP POLG2 Sci Rep 5 NIF3L1 CYBB embryonic stem cells. , , 16791. EIF4G1 KAT6A CYB5R3 TXNDC16 SCARNA2 LAMP1 POLR3E ZNF407 TNFRSF12A COL6A3 SELL TGFB1 FERMT3 SP4 S100A16 TUBGCP6 MYCBP2 COPS7A F13A1 NR3C1 MAP3K4 RPS23 PROCA1 ANKRD12 THBS2 SMTN TERF1 RUFY2 TMCO3 SCARB2 YKT6

CRIM1 RNF19A GAS6 CCDC80 FCGR2A PDCD1LG2 ITGB1 XBP1 SGMS2 AXL CU677000 CTGF VASN NRP1 PLAT

FNBP4 KBTBD8 N4BP2L1 HDLBP LOX OSTC SERPINH1 CDC42EP3 RHOC UTP15 MYO19 EIF2C3 FTX RCSD1 ACTN1 HOMER3 RFX3 ARID2 FIP1L1 SNX6 ASAP1 SRPX2 MLL COL1A1 RRAS AKIP1 CAGE1 IL16 ELL2P1 TIMP1 MMP2 PLCB2 UBE2Z SYDE1

RIN2 COPA GPR176 SRRM1 AHSA2 TMED3 ARHGAP19 RBM26

SCYL2 C10orf137 SAFB2 UNG GSS GLT8D2 MYO1G CCNT2 KCNA3 MIA3 CD81 UGDH AKNA SERPINE1 FOXN3 LRRC59 MRPL17 ARHGEF6 MBNL1 SLAIN1 ZNF789 MYO1C CD59 CCDC66 THUMPD2 KLHL6 COPB2 NR2F2 CFL2 TRAF3IP2-AS1 POLE ORMDL2 RCHY1 ECM2 COL6A2 C12orf24 ATXN2L GNG5 KPNA4 MT1P2 MALAT1 ATAD2B ZFR AHI1

THBS1 RNF181 EZH2 C1orf132 KIAA1324 PLEKHG2 MAP4 CSF3R

PRDX4 SLA RECQL PLAUR TRIM33 ARGLU1 COL8A1 VGLL3 INHBA C12orf35

YY1 CCDC84 CAPN2 TM2D2 SHC1 VAT1 NR2C2 ZCCHC11 GOLT1B TGFBI PDLIM7 LEMD3 CCND1 RAPGEF6 APOPT1 SAFB ERVK13-1 RSPRY1 ELOVL1 UBR7 VEGFCSRPX ZNF254 L32785 TRIM22 DDAH1 UBR2

ANKMY1 GNL1 ECM1 ZNF195 ARF1

ADM AKAP13 JAKMIP2 SHPRH PDGFC TPBG RBM33 CHD6 MT1E SCAPER MGST1 BX649164 POLA1

SHMT2 FZD6 FLNA SDC4 CDH13 SNRNP40 S100A11 et al. BCOR ASNA1 Mifsud, B. (2015). Mapping long-range promoter contacts in human PLEKHN1 TNS3 LEPREL4 MMP14 TGIF1 MYOF PRKCDBP RND3 SFPQ NT5E MFGE8 GPX8 PIAS1 ZC3H12A DDX55 ZNF493

UBN2 TRAPPC1 TAX1BP3 PRELID1

PSIP1 KAT7 CWF19L2 ATN1 INTS6 CDK17 MBD2 SLFN12 ZYX SFXN3 LIF LTBR ANXA2 DPH3 LGALS1

CAPZA1 RBM6 QSOX1 NFATC4 ADAM9

PLSCR3 CLTA MIR222 SKIV2L2 COPZ1 COL4A1 AB463129 MIR221 FRMD6 HIF1A RAB34 STT3A

CARD6 CEP170 KIAA0528 AL358752 PLEC ATP8B1 WWTR1 BMP1 ITGA5 SLTM EU016553 BCAS2 VAMP3 CKAP4 COPE FAM117B KIRREL PHLDA3 SNRNP200 MKL2 ANKRD28 TIGD1 TAF10 AKAP9 PCNT P4HB DAD1 MKI67IP TAF9B CAP1 FHL2 NNMT NFKB2 GOLGA8B MAK16 YIPF3 IGFBP6 ACTN4 MLF2 KRIT1 PTGR1 ERGIC3 ARFGAP3 DCBLD2 KIAA0232 WBP4 KIAA0430 RIPK2 RNF8 CFL1 PTBP3 PRPF39 SRPK1 BIRC2 OIP5-AS1 KDELR3 CERCAM C11orf30 RCOR3 MYO15B PLIN3 RCAN1 PXDN DUSP28 RBM5

UNKL PABPC1 HTRA2 CBX7 PAPOLG LGALS3 ANXA5 APBB3 BPTF

SS18L1 FLJ10038 MLLT6 S100A10 RCL1

USP1 PTBP2 GADD45A SMAD3 ZNF767 LZTS2 ZNF100 SPICE1 RPL27 CDC42EP5 SHISA5 IQCB1 TRAM2 RGL1 UAP1 ADAMTS6 C12orf51 MAPK1IP1L CU689793 MYEF2 SCAF4 ARHGAP27 CXCL1 RPL35A FOSL1 SLC19A1 C8orf38 DOPEY1 EHD1 ALCAM FAM176B

GNA15 RHBDF1 GRASP PILRB C19orf10 TRIM5 TWIST2 RPS27L CHERP WNK1 AZI2 ZBTB44 ZBTB20

EXT1 AIM1

ABL2 ACIN1 ZNF638 TMTC3 BUD13

ENO1P1 TCF12 SLC4A1AP IGFBP4 TMEM237 SECISBP2 MBTD1 CTF1 STARD9 IGFBP3 Nature Genetics 47 TBC1D2 SNX13 AB384358 CLIC1 TUBA1C cells with high-resolution capture Hi-C. , (6), 598– HQ013238 NR2C1 HIST2H2BE RBM7 GRAMD1A LTBP2 ZNF212 NSL1 SIRT1 C8orf58 YPEL1 TAF1 LIMA1 RALGAPA1 LAMP2 RPS27A TGFB1I1 SSR3 SNHG1

MFSD10 KIAA1958 FAP SDF2 HDAC9 TSPAN4 KCTD10 SLC38A2 C1orf122 EBNA1BP2 B4GALT1 C11orf71 SURF4 ANXA2P2 SCARNA13 RCN3 CORO1C TIA1 CCT6P1 AMBRA1 GLRX3 COL4A2 COL6A1 DIXDC1 PSME2 KLF6 NAA16 MED15 FERMT1 HSPA5 HEG1 ANKZF1 MGEA5 RNF138P1 FAM179B CD44 MMP19 REEP5

ZNF783 FAM176A HSPA13 RAB27B TCEB2 GLT25D1 IER3

GBE1 ARHGEF17 MET MIR21 EMD

DDX47 SRCAP BOD1L MAN2A2 EVC LETM2 NCOR2 RPL32P3 N4BP2L2 DRAP1 VPS13A RBMS2P1 RRBP1 RAB32

MLL4 POLR2J4

GNG11 BZW1 GPRC5A RIC8A GNA11 RPL19 TSPAN9 ILF3 CDK6 PPP1R12B PHLDA2 TM4SF1 ESYT2 LIG1

FAF2 VEGFA DUSP14

AMOTL2 AHR CU691178 NFE2L1 NOTCH2 CD109 NUP133 PRKCH CAPNS1 GJA1 PPP1R14B PLAU BCAR3 SH3PXD2B UNK TWSG1

KPNA5 WDR70 ADRM1 FAM83G SETD5 ANKRD13D KCTD11 RBL1 DDX17 RPL7 AHCYL2 RBM10

FGFR1OP SEPT9 CHTOP EMP1

RRAS2 RERE

NBN AF130094 CFLAR UBXN2B CLIC4 AFG3L1P MARCH6 SAR1A

U00943 OLFML3 CDC42EP2 EEF1A1P6 TUBB6 GBP1 ANXA1 AB384478 ARF6 RGS20 SMAGP VPS13D 606. EFNB1

HYAL2 UACA DNAJA4 BCL2L11 FADD ZBTB38 CSNK1A1

LHFPL2

TNFRSF1A CENPP HSPA14 PDIK1L FUBP1 CANX TNFRSF10B RGS3 KIAA1328 NAPB EIF1 SHB MIR4690 MYL9 SORBS3 TRIM52 EGFR GARNL3 EXT2 C14orf159 PSMA3 N4BP2 FKBP1A MXI1 TMEM43 COMMD7 C16orf57 AFAP1 NFYA DNASE1L1 ST6GALNAC3 TCEA1 PPP1R12A ZNF668 CSGALNACT1 DAB2 MECP2

PHLPP1

ENG SNX5 MYH9

HSPG2 MAML2 AGAP11 CRTAP EIF3D LAPTM4A LMNA ZNF33A SRSF3 FTSJ1 TMEM194A SRSF7 CEP72 THOC1

CRYZ FKBP10 KRT8 PTX3 PRDM2 PXN EIF5A CDK7 TFG MTUS1 EEF1A1 GNL3L KIF5A UGCG AB384019 AF086249 ANXA2P3 RIF1 UBC

MSH3 ABCA6 ECE1 PCDHGA11 PLXNB2 MTHFD2 ALMS1 HTRA1 SYNCRIP NOSTRIN TXN ZNF48 YPEL2

SEPN1 ZFAND3 BRP44L

LAMB3 TFE3 TMEM63B SLC39A1 NCOA5

UBE2E3 DRAM1 SMPD1 GLIS2 FASTKD1 WWOX PNPLA7 RHOA ATRX SLFN5

KIAA0101 LEPREL2

SCAMP5 PCMTD2 SLC25A39 RRAGD COMT CCNA2 DDR2 S100A6 MND1 CPA4 C14orf149 ACAD8 OXSR1

NUP37 RAB11FIP5 EIF3E EPHA2

ZNF678 SNX24 PHB2

TPPP3 NUMA1

PLK2 IQSEC2 ZNF397

SERTAD1 TGM2 SMC3 Natoli, G. and Andrau, J. C. (2012). Noncoding transcription at enhancers: GAPDH RIN1 TLE2 ZNF646 SH3YL1 QTRTD1 SLC39A10

CDH11

TPRA1 RPL17

IL20RB CDK1 G6PD GART

FAM65B

PLEKHG4 TAGLN2 AJUBA PTPN23

TRAPPC10 BFSP1 ZBTB4

ACVRL1 C15orf58

COQ4 RALGDS TSHZ1 RPS8

LASP1

DPM1 C11orf48 PAICS

ASPH

BICC1 GM2A IER5L ZNF138

GRID1 LMBRD1 TGFBR3 PARP16 CDKN1A

CTR9 SERPINB6 CTIF

PPP1R15A SERTAD3

C7orf46 RPS16 SH3GL3

RFTN2 CLPP RTF1 C9orf21

SEMA4A

MRPL40

PLP2 EIF2S3 CELF5 SNRPF SQRDL

PMP22

ADAMTS10 H2AFJ PTGES CSDA THRA FXYD5

ERGIC1 QKI FBXO31 ZHX3 ZEB2

LDHA EPS8L2 DDIT4 RPRD1A CKMT2

BEX5 EIF3H SECTM1 NTRK2 TRIP10 MAPRE2 MARVELD1

GSTM5

RCOR1 RPL29 CSAD DNMT3A

CLK3 RUVBL1

DDX42 OLA1 MSLN HDHD2 ATF6 MRPL13 C1orf95 C20orf177

MGAT1 SPOCD1 PBXIP1 CPSF3

SATB1 ZNF25 THNSL2 SNX4 LRTOMT Annu. Rev. Genet. 46 AIFM2 NR2F6 ATF4 general principles and functional models. , , 1–19. LDB2 TRIM9 RAET1G

CU676956 TNS1 C1QBP MAP3K7 MIAT KPNA2

ATIC NACA2 RPS6KB1 COA5 SLC9A9

PHLDB2 NCAPG PDK4 INPP5A

KLF9 ATF7IP CENPT RSL1D1 STARD3NL EXOC6 GNB2L1

ITPRIPL2 ZNF565 C19orf48 SLC39A9

RCBTB1 APOBEC3C ACTA2 CRY2 NRBP2 RPS15A SEMA5A ZNF835 PID1 RPS4X PHTF2 NOP16 RPL18 SLC2A4RG

SUN2 ARHGEF10L SRPK2 TNIP1 MYLK SNRPN KBTBD11 NF2

RUNX1 CRYAB CEP55 DCAF16 ILF2 G0S2 RHOB RNASE1 TMOD3 PRELP EFHA1 TCF25 BCL6 CEP128 RAI2 HERC3 GPR56 JUN EEF1D

PDLIM1 RCAN2 CLDN5

EIF3J RPLP0 PPIA TAGLN2P1 ZBTB16 ALG3

ZNF346 LMOD1 TOM1L2

FIG4 S100A4 TP53INP2 FAM125B NR3C2

ZNF708 PGRMC2 KLF10 NPM3 FAM54A

TMEM19 SORBS2

ZNF362 EIF2D OSR2 SDC3

RPL13 CKS2 TP53I3

CASKIN2 WDR75 CTDSP2

C1QTNF6 CEBPB AURKA C11orf96 ZNF510 LTBP3 HNRNPAB

ST6GALNAC6 AVPI1 TRAPPC11 NDRG2 TNPO1 STAB1

LY6K TCEAL5 LBX2 CTPS

C1orf21 SOS1 CRNKL1 NFIB SNX30 TEF Causality TSC22D1 Pearl, J. (2009). . Cambridge university press. MELK TMEM173 ACOT11

ABHD10 CYBASC3 LMBR1 LMO7 MTMR10

ZNF75A EEF2 CD34

SEC24C TRIB3 RNF103 PLEKHM3

TLK1 YPEL3 ADAMTSL4 PSMD1 LY6E SYNRG ZNF280C B3GAT1 GPATCH2 RUNX1T1 VPS45 CRTC3 VLDLR RPS7 FXYD1

KIAA1147 PLCL1

MED1 CAPRIN1

RHOT1 MIR3187 LDLR CEP290

CSRP2BP SNRPD1

ARHGEF7 AURKB S100B RGN PEX1 ZNF643 RBM3 EXOC5 ADD2 VPS4A POLR3F JAZF1 GMFB

SLIT3 CAD

IFRD2 ERO1L

SNORD1C ATP2B1 KLHL7 DUSP5 VPS41 TUBA4A

DHX36 BNIP3 GLA ZFYVE21 FAM107A TPR A2M PIK3R1 C21orf63 MDM2

TP53BP1

MRPS12

HIF3A

FAM151B MYCT1 IGFBP5 CSDAP1

RC3H2 GPRASP1 C17orf85

SRA1 AASDHPPT APCDD1 GALNTL2 TPX2

PCDHB4 CXXC1 VAMP2

FHL5 NR1D1

NOTCH3 DDX49

LAMA2 ZKSCAN1

ZMIZ1 PLXDC1 CHEK2 Rockman, M. V. (2008). Reverse engineering the genotype–phenotype SNHG8 KIFC3 S1PR1 EIF2A BCAS3

RAB23

PVRL4 SPHK1 CRLF1 MAPK4

RPL36 PHF20L1 TICAM1

ZNF358

ADAMTS9-AS2

SIRT2

DDR1 RPL34 LGI4 METTL7A

RBCK1 RAPGEFL1

ADCY6

CCNB1

MED18

MOCS1 PDE5A KLHDC1 SSPN

ZNF134 AES PPAP2B PBX1

MAPK9 MAF ADCY4 UCK2

TBC1D16 EFS

TRAF3IP2

IL15RA FZD4

RPL8

SLC50A1

VSTM4

API5 STK24

OTUD3 map with natural genetic variation. Nature, 456(7223), 738–744.

MKNK2

ORC1

LYNX1

ACOX2

CTSF

TMC4

PEX11A

MGST2 DSG3

TMPRSS13

SDCBP2

BARX2 DEFB1 GALNT3

KLK7

MUC15 TACC2

GJB3 VAMP8

CYSLTR1

C2orf55 Sánchez-Castillo, M. et al. (2015). CODEX: A next-generation OVOL1 ANO1 PRSS8

CXCL13 DLG3

PLAC2 SYT8

VTCN1 LGALSL KLK5

MAPK13

LGALS2

KRT6A

CD74

CD3G KLF5 TMCO4 FXYD2

TTC22

AIM1L

FAM160A1

MUC20

IL20RA

PRKD2 DMAP1 LAD1 KRT13

RASAL2 KIAA1217

RGL2 ANK3

SH3BP1

ALS2CL

EFCAB4A PP14571ZNF750

PRSS22 CYB561 CCDC12 STAP2

SPINT1 CYP4B1 sequencing experiment database for the haematopoietic and embryonic EPS8L1

C20orf151 CMIP SCEL

TMPRSS4

CPNE5

MUC4 ATG9B

PSD4 MIR650

CEACAM1 SPINT2 EXPH5 C10orf95

LIFR

IGLV3-19

MYO3B

CLEC10A LYZ

PI4KB KRT5

MUC1 GPR110 TJP2 LY6D GRHL3

ILDR1

RREB1

ST6GALNAC2

ERP27

DQ126686 GRHL1

CDC42BPG

SLC2A9 FUT3 MPZL2 CD164L2 DSC2 TRIM29

PHLDB3

ZNF812 SCNN1B FAM83A KLK6

OSBPL5

LYPD3 SYTL1 OVOL2 INTS12 MPZL3

TMEM30B S100A2 RBM47 stem cell communities. Nucleic Acids Research, 43(D1), D1117–D1123. S100A9 FGFBP1 TMTC2

KRT15

PROM2

GGTA1P AP1G2 LLGL2 LRG1 EPN3

CBFA2T3

SBNO2 BNIPL

RXRA SLC22A18

MMRN2

PLEKHH3 LAG3

AQP3 JUP

PTK6

PLXNB1 TNFSF13

NOTCH1 CLDN4 ST14 EHF

TJP3

KRT14 HCAR2 SLC34A2

FRMD4B

KLF8

LCN2 CTLA4 GGT6 CCR2

SOWAHB

C17orf109

SLPI

F11R

ABLIM1

STX19 TACSTD2

CD40LG Schadt, E. E. (2009). Molecular networks as sensors and drivers of DUOXA1

IGHV6-1 XDH ACPP

SVIL NAALADL2

DNASE1L3 WFDC2 SULT2B1

ARHGAP32

SCNN1A

SLC28A3

GPR172B SH3D21 KLK10

KRT23

LAMA3

FUT2

GRHL2

LYPD2

SLC44A4

KHNYN common human diseases. Nature, 461, 218. Schadt, E. E. et al. (2005). An integrative genomics approach to infer

Fig. 6. Network representation of transcription factor-targets predicted using Findr-A causal associations between gene expression and disease. Nature causal inference on FANTOM5 human dataset Genetics, 37(7), 710–717. Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, Chen, L. S. et al. (2007). Harnessing naturally randomized transcription to 100(16), 9440–9445. infer regulatory relationships among genes. Genome Biology, 8, R219. Thurman, R. E. et al. (2012). The accessible chromatin landscape of the Cusanovich, D. A. et al. (2014). The functional consequences of variation . Nature, 489(7414), 75–82. in transcription factor binding. PLoS Genet., 10(3), e1004226. Wang, L. and Michoel, T. (2017). Efficient and accurate causal inference Danko, C. G. et al. (2015). Identification of active transcriptional with hidden confounders from genome-transcriptome variation data. regulatory elements from GRO-seq data. Nat. Methods, 12(5), 433–438. PLOS Computational Biology, 13(8), e1005703. Dunn, S. J. et al. (2014). Defining an essential transcription factor program Xu, H. et al. (2013). ESCAPE: database for integrating high-content for naïve pluripotency. Science, 344(6188), 1156–1160. published data collected from human and mouse embryonic stem cells. Forrest, A. et al. (2014). A promoter-level mammalian expression atlas. Database (Oxford), 2013, bat045. Nature, 507(7493), 462–+. Han, H. et al. (2018). TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids

i i

i i