<<

bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Virtual ChIP-seq: predicting binding

2 by learning from the transcriptome

1,2,3 1,2,3,4,5 3 Mehran Karimzadeh and Michael M. Hoffman

1 4 Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada 2 5 Princess Margaret Centre, Toronto, ON, Canada 3 6 Vector Institute, Toronto, ON, Canada 4 7 Department of Computer Science, University of Toronto, Toronto, ON, Canada 5 8 Lead contact: michael.hoff[email protected]

9 March 8, 2019

10 Abstract

11 Motivation:

12 Identifying transcription factor binding sites is the first step in pinpointing non-coding mutations 13 that disrupt the regulatory function of transcription factors and promote disease. ChIP-seq is 14 the most common method for identifying binding sites, but performing it on patient samples is 15 hampered by the amount of available biological material and the cost of the experiment. Existing 16 methods for computational prediction of regulatory elements primarily predict binding in genomic 17 regions with sequence similarity to known transcription factor sequence preferences. This has limited 18 efficacy since most binding sites do not resemble known transcription factor sequence motifs, and 19 many transcription factors are not even sequence-specific.

20 Results:

21 We developed Virtual ChIP-seq, which predicts binding of individual transcription factors in new 22 cell types using an artificial neural network that integrates ChIP-seq results from other cell types 23 and accessibility data in the new cell type. Virtual ChIP-seq also uses learned associ- 24 ations between expression and transcription factor binding at specific genomic regions. This 25 approach outperforms methods that predict TF binding solely based on sequence preference, pre- 26 dicting binding for 36 transcription factors (Matthews correlation coefficient > 0.3).

27 Availability:

28 The datasets we used for training and validation are available at https://virchip.hoffmanlab. 29 org. We have deposited in Zenodo the current version of our software (http://doi.org/10. 30 5281/zenodo.1066928), datasets (http://doi.org/10.5281/zenodo.823297), predictions for 36 31 transcription factors on Roadmap Epigenomics cell types (http://doi.org/10.5281/zenodo. 32 1455759), and predictions in Cistrome as well as ENCODE-DREAM in vivo TF Binding Site 33 Prediction Challenge (http://doi.org/10.5281/zenodo.1209308).

1 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

34 1 Introduction

35 Transcription factor (TF) binding regulates . Each TF can harmonize expression of 36 many by binding to genomic regions that regulate transcription. Cellular machinery utilizes 37 these master regulators to regulate key cellular processes and adapt to environmental stimuli. 38 Alteration in sequence or quantity of a given TF can impact expression of many genes. In fact, 39 these alterations can be the primary cause of hereditary disorders, complex disease, autoimmune 1 40 defects, and cancer . 41 TFs bind to accessible chromatin based on weak non-covalent interactions between amino acid 2 3 42 residues and nucleic acids. DNA’s primary structure (sequence) , secondary structure (shape) , and 4 43 tertiary structure (conformation) all play roles in TF binding. Many TFs form a complex with 44 others as well as chromatin-binding and therefore bind to DNA indirectly. Some TFs also 45 have different isoforms and undergo various post-translational modifications. In vitro assays, such 5 46 as high throughput systematic evolution of ligands by exponential enrichment (HT-SELEX) and 6 47 binding microarrays , have provided a compelling understanding of context-independent 7 48 TF sequence and shape preference . Yet, for the aforementioned reasons, performance of models 8,9 49 trained on these in vitro data are poor when applied on in vivo experiments . To address this 50 challenge, we must explore how to better model DNA shape, TF-TF interactions, and context- 51 dependent TF binding. 10 52 Chromatin immunoprecipitation and sequencing (ChIP-seq) and similar methods, such as 11 12 53 ChIP-exo and ChIP-nexus , can map the presence of a given TF in the genome of a biological 54 sample. To map TFs, these assays require a minimum of 1,000,000 to 100,000,000 cells, depending 55 on properties of the TF itself and available antibodies. Such large numbers of cells are not often 56 available from clinical samples. Therefore, it is impossible to systematically assess TF binding in 57 most disease systems. Assessing chromatin accessibility through transposase-accessible chromatin 13 58 using sequencing (ATAC-seq) , however, requires only hundreds or thousands of cells. One can 59 obtain this many cells from many more clinical samples. While chromatin accessibility does not de- 60 termine TF binding, several methods use this information together with knowledge of TF sequence 14,15,16 61 preference, genomic conservation, and other genomic features to predict TF binding . 62 Predicting TF binding with motif discovery tools within chromatin accessible regions has helped 17 63 us understand the role of several TFs in various disease. For example, He et al. used motif discov- 64 ery tools to identify the role of OCT1 and NKX3-1 after prolonged androgen stimulation in prostate 18 65 cancer. Similarly, Bailey et al. discovered that a known risk single nucleotide poly- 66 morphism (SNP) upstream of ESR1 disrupts GATA3 binding and enhances expression of ESR1. 67 We propose that using more accurate tools to predict TF binding will allow understanding the role 68 of TF binding in more contexts. 69 Previous studies have used various approaches to predict TF binding. Several methods use unsu- 14 15 70 pervised approaches such as hierarchical mixture models or hidden Markov models to identify 71 transcription factor footprint using chromatin accessibility data. These approaches use sequence 72 motif scores to attribute footprints to different transcription factors. Convolutional neural network 73 models can boost precision by learning sequence preferences from in vivo, rather than in vitro 20,21 74 data . Variation in sequence specificity and cooperative binding of some transcription factors 75 prevents these methods from accurately predicting binding of all transcription factors. A more re- 76 cent approach uses matrix completion to impute TF binding using a 3-mode tensor representing 22 77 genomic positions, cell types, and TF binding . This method doesn’t rely on sequence specificity, 78 but can only predict TF binding in well-studied cell types with many ChIP-seq datasets. This 79 means one cannot use it to predict binding in a cell type where ChIP-seq is not possible, such as 80 limited clinical samples. 81 Identifying the best approach for predicting TF binding remains a challenge, because most

2 bioRxiv preprint b a c acigsqec oi ftesm F uha T3 eta niheto h oi salso is enrichment. motif central show the and of motif enrichment sequence central its ATF3, contain as both peaks such MAFK TF, most same contrast, the In more. low. of or 50% motif blue). of occupancy, sequence occupancy motif occupancy, matching motif (high motif with motif (low sequence TFs motif the to have compared sequence peaks the of have more peaks or (c) 50% of where 50% TFs than and less green), where whisker. TFs quartile a (red), within me- JASPAR value beyond boxplot: extreme outliers of most Whisker: points: line the (IQR). Horizontal Individual show range replicates. Boxplots interquartile and family. range: TF’s types Box the cell dian. from different motif from sequence datasets JASPAR any among with distribution TF a for peaks ChIP-seq 1: Figure

0%

No sequencemotif CUX1 CHD1 SRF CHD7

POU2F2 EP300 doi: RELA HDAC2

otCI-e ek akteT’ eunemtf (a) motif. sequence TF’s the lack peaks ChIP-seq Most STAT5A HDAC6 MNT KDM4A certified bypeerreview)istheauthor/funder.Allrightsreserved.Noreuseallowedwithoutpermission. BCLAF1 KDM5A https://doi.org/10.1101/168419 IRF2 KDM5B STAT3 PHF8 ZBTB33 RBBP5 SMARCB1 SAP30 SOX6 SIRT6 20 0 100 0 RFX1 SUZ12 ATF3 CBX2 Motif occupancy NR2C2 CBX3 ESRRA CBX8 SREBF1 CHD4 ATF2 HDAC1 seq data in ENCODE −seq datain withChIP SMAD1 KAT2B 19 Number of chromatin factors ofchromatin Number

Central enrichment E−valueCentral enrichment SMARCA4 KDM1A GFI1B NCOR1

Chromatin factors with ChIP-seq data in ENCODE dataset ENCODE datain factorswithChIP-seq Chromatin RNF2

faT’ oi slwrfrTswt oi cuac fls hn50% than less of occupancy motif with TFs for lower is motif TF’s a of PBX2 MEF2A SETDB1 GATA3 WHSC1 TCF12 H3F3A BCL11A KAT2A SMARCC1 SUPT20H ZZZ3

060 40 SP4 THAP1 BDP1 YY1 BRF1 BRF2 No sequence motif 0% < Motif occupancy < 50% Motif occupancy ≥50% Motifoccupancy Motifoccupancy<50% 0%< No sequencemotif KLF5 ; CCNT2

ZBTB7A this versionpostedMarch12,2019. 0 300 200

● GTF2B ● RUNX1 ● MAX GTF3C2 ● ● ● MEIS2 HMGN3

● NELFE ● REST t-test p=0.02 ● ● ETS1 POLR3A

● BRCA1 ● POU5F1 CHD2 ●

● TEAD4 ● ● STAT1 MAZ IRF3 RAD21 ● ● ZNF274 RCOR1 NFE2 SIN3A

(b) ZC3H11A 80 PAX5 ● ● MEF2C ZFP36 ZEB1 POLR2A 3 GTF2F1 HCFC1

d ETV6

ubro atr ihu eunemtfin motif sequence a without factors of Number MXI1 MYB RB1 CentriMo motif occurence site-probability ZNF263 SMC3

0.00 0.01 0.02 0.03 0.04 NFIC ELF1 TBL1XR1 BHLHE40 TRIM22 RUNX3 UBTF

(d) WRNIP1 P P PAX8 ZHX1 e e SP1

a SIN3B a E2F6 k

k ZKSCAN1

s TCF7 s BMI1

w o F ihasalnme fpeaks of number small a with TFs For FOSL1

w MTA3

i TCF7L2

t NONO i h t JUND The copyrightholderforthispreprint(whichwasnot h

POLR3G

m GABPA ZMIZ1 m

o ELK1 MTA1 00 −20 t o IRF4 i IKZF1 f t

FOXA1 i

= CTBP2 f Relative genomicposition

FOXA2

= TRIM28 5 TCF3

1 HA− 8

5 USF1 PML 8

6 BATF

0 TAF1 9 Bits Bits TARDBP 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 GATA2 NANOG HNF4A TAF7 SP2 MBD4

6 5 4 3 2 1 FOS ZHX2 3 2 1 EBF1 CBX5 NFYB rcino ENCODE of Fraction TFDP1 TAL1 SIX5 ZNF217 MYBL2 FOXM1 ZBED1

JASPAR MA0496.1MAFK NRF1 NR3C1 JASPAR MA0605.1Atf3 MAFF TBP CEBPB CREBBP

01 21 14 13 12 11 10 9 8 7 HNF4G 6 5 4 SMAD2 BACH1 ATF7 RXRA RFX5 JUN CEBPZ 20 FOSL2 CREB3L1 ZNF384 CREM EGR1 IQR. ±1.5 ATF1 NFYA NR1H2 ELK4 HSF1 8 7 USF2 NEUROD1 ZNF207 ZNF143 NR2F2 ARID3A

15 SPI1 SREBF2 MAFK PBX3 GATA1 CREB1 NFE2L2 FOXP2 CTCFL CEBPD CTCF NFATC1 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

14 82 studies use different benchmarking approaches. For example, one earlier study only assesses pre- 83 diction on genomic regions that match the TF’s sequence motif. By excluding ChIP-seq peaks not 84 matching the TF’s sequence motif from benchmarking, it underestimates false negative peaks and 85 overestimates prediction accuracy. Most previous studies benchmark their predictions using the 22,23,24 86 area under receiver operating characteristic curve (auROC) statistic . When test data is im- 87 balanced, meaning it has very different numbers of positive and negative examples, using auROC 25,26 88 misleads evaluators . Unfortunately, the TF binding status of genomic regions is highly imbal- 89 anced, making auROC alone a poor metric for evaluating TF binding prediction. Evaluation is 90 further complicated by wildly varying prediction performance across different TFs. Recently, the 91 ENCODE-DREAM in vivo TF Binding Site Prediction Challenge (DREAM Challenge) introduced 27 92 guidelines for assessing TF binding prediction . They recommend reporting both auROC, which 93 assesses false negative predictions and the area under precision-recall curve (auPR), which also 94 assesses false positives. 95 RNA-seq allows us to obtain transcriptome data from samples with small cell counts, including 96 patient samples. We hypothesized that we could leverage the transcriptome to better predict TF 97 binding. Previous methods have predicted gene expression using information on active regulatory 28,29,30 31 98 elements . Others have predicted chromatin accessibility using gene expression data , or 99 used differences in gene expression to identify statistically significant sequence motifs in specific 32 100 conditions , but they haven’t predicted in vivo TF binding using transcriptome data, as we do 101 below. 102 Here, we introduce Virtual ChIP-seq, a novel method for more accurate prediction of TF binding. 103 Virtual ChIP-seq predicts TF binding by learning from publicly available ChIP-seq experiments. 23 104 Unlike Qin and Feng , it can do this in new cell types with no existing ChIP-seq data. Virtual 105 ChIP-seq also learns from other data such as genomic conservation, and the association of gene 106 expression with TF binding. 107 Virtual ChIP-seq also accurately predicts the locations of DNA-binding proteins without known 108 sequence preference. This would be impossible for most existing methods, which rely on sequence 33 109 preference. Strictly speaking, only some of these proteins are TFs. As Lundberg et al. , we use 110 the term chromatin factors in this paper to refer to factors subject to ChIP. 111 Virtual ChIP-seq predicted binding of 36 chromatin factors in new cell types with a minimum 112 Matthews correlation coefficient (MCC) of 0.3. Eight of these chromatin factors (GTF2F1, HCFC1, 113 HDAC2, NRF1, RAD21, SIN3A, SMC3, and TAF1) do not have DNA-binding domains and there- 34 114 fore are not TFs according to Lambert et al. . These chromatin factors had minimum accuracy 115 (fraction of all predictions that were correct) of 0.99 and minimum specificity (fraction of negative 116 predictions that were correct) of 0.99. Precision (fraction of positive predictions that were correct) 117 ranged between 0.16 and 0.78 (Table 1). We predicted binding of these 36 chromatin factors on 33 35 118 Roadmap Epigenomics cell types and provide these predictions as a track hub for community use 119 (https://virchip.hoffmanlab.org).

120 2 Results

121 2.1 Sequence motifs are absent in most TF binding sites

122 2.1.1 Most ChIP-seq peaks lack the TF’s relevant sequence motif

14,15 123 Many computational tools predict TF binding using sequence preference data . Most tools 124 represent TF sequence preference in position weight matrix (PWM) format. PWMs the 125 likelihood for presence of each nucleotide at different positions of a sequence motif. With tools such 36 126 as FIMO , we can efficiently search and rank genomic regions that match TF sequence motifs.

4 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

127 One cannot determine a TF’s binding sites based solely on its sequence preference. We can iden- 128 tify some additional properties, such as co-binding partners, from high-throughput experiments. For 129 other properties, such as post-translational modifications to the TF, we lack corresponding large- 130 scale data. Many post-translational modifications affect cellular localization, binding partners, and 37 131 DNA-recognition of chromatin factors . Therefore, we expect existing computational prediction 132 methods to be more accurate for chromatin factors where post-translational modifications and co- 133 binding partners contribute less to TF binding. For chromatin factors with more complex biology, 134 however, we expect computational prediction methods to fail. 135 Using ChIP-seq data on 201 chromatin factors in 54 different cell types, we investigated whether 136 the majority of binding sites matched the sequence motif of the same TF. Among these 201 proteins, 137 76 lacked a sequence motif in JASPAR (Figure 1a, Supplementary Table 1). Some of these motif- 138 free proteins, such as EZH2 and HDAC, are chromatin-binding proteins rather than true TFs. For 139 simplicity in describing the prediction task, we refer to them as chromatin factors. Others are TFs 140 without known sequence preference. For sequence-specific TFs, the fraction of peaks that match a 141 sequence motif ranges from 4.55% (for SIX5) to 94.2% (for CTCF) with a mean of 49.4% (Figure 1b).

142 2.1.2 Many sequence motifs are not centrally enriched

143 Central enrichment measures how close a sequence motif occurs to a set of ChIP-seq peak summits. 19 144 According to Bailey and Machanick , high central enrichment indicates direct TF binding. We 19 145 used CentriMo to measure central enrichment. We compared central enrichment between TFs 146 with low motif occupancy (< 50% of ChIP-seq peaks contain the motif) and high motif occupancy 147 (≥ 50% of peaks contain the motif; Figure 1c). TFs with low motif occupancy had weaker central 148 enrichment (t-test; p = 0.02). For example, 30.87% of ATF3 peaks overlapped with the MA0605.1 149 JASPAR motif. ATF3 peaks also had lower central enrichment than MAFK peaks, which had 150 74.29% overlap with the MA0496.1 JASPAR motif (Figure 1d).

151 2.2 Model, performance, and benchmarking

152 2.2.1 Datasets

153 Virtual ChIP-seq learns from the association of gene expression and chromatin factor binding 154 in publicly available datasets. Our method requires ChIP-seq data of each chromatin factor in 155 as many cell types as possible, with matched RNA-seq data from the same cell types. We used 38 39 40 156 ChIP-seq data (from Cistrome DB and ENCODE ) and RNA-seq data (from CCLE and 41 157 ENCODE ) to assess Virtual ChIP-seq’s binding predictions for 63 DNA-binding proteins in new 158 cell types. We considered a 200 bp genomic bin as bound if it overlapped a peak summit with −4 159 FDR < 10 (Methods). 160 In addition to benchmarking on our own held-out test cell types, we wanted to compare against 27 161 the DREAM Challenge . To do this, we also used their datasets, which include ChIP-seq data for 162 31 chromatin factors. For most of these chromatin factors, the DREAM Challenge held out test 163 instead of test cell types. The DREAM Challenge included ChIP-seq data for only 164 12 chromatin factors in completely held-out cell types. Completely holding out cell types better 165 fits the real-world scenarios that require binding site prediction. Using the datasets we generated, 166 we had matched data in enough cell types to train and validate models for 9 of these 12 chromatin 167 factors (CTCF, E2F1, EGR1, FOXA1, GABPA, JUND, MAX, REST, and TAF1).

5 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Gene 1 Gene 5000 Gene expression a

Raji row min 0 row MDA−MB−453 MCF−7 LoVo keratinocyte HepG2 HeLa−S3

Cell types H1−hESC CA46 column max BL−41 BJ Quantile-normalized A549 ChIP-seq signal 0 0.5 1 b c Genomic bins (100 bp) BJ FGG NMU KRT8 Raji SBP1 NPPA CBX8 NEU4 TUBB IGLJ2 PGK1 MT1G MFNG GATA1 SYT14 IGHG1 NAA11 RPS18 RPL7A LoVo MFAP4 ANXA1 DPEP1 TONSL HCG16 EIF1AY A549 HOXC9 CEBPD KDM5D CA46 CCL4L2 RPL36A BTBD16 FAM72B GYG2P1 ST8SIA4 BL − 41 SLC27A3 TBC1D25 HepG2 KIAA0040 RBBP8NL MCF − 7 SLC25A27 HORMAD1 LINC00162 CACNA2D3 IGKV2D − 29 HIST2H2BA AF131215.4 AP001619.3 AC024937.6 HeLa − S3 5p12 H1 − hESC RP11 − 384K6.4 RP11 − 25H12.1 keratinocyte RP11 − 512H23.2 Pearson correlation CTD − 3099C6.11 5q11.2 − MB 453 MDA Cell types −0.5 0.0 0.5 Genes p-value > 0.1

d GM12878 K562 e GM12878 K562 Expression score = 0.488 Expression score = −0.512 Expression score = −0.602 Expression score = 0.407 5q14.3 5q15 150 150 400 400 5 chr5:132294001 chr5:111758001

100 100

200 200 5q32 50 50 5q34

0 0 0 0 TF binding at genomic bin Rank of correlation of gene expression and Rank of correlation gene expression TF binding at genomic bin Rank of correlation of gene expression and Rank of correlation gene expression 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Rank of expression of top 5000 genes with highest variance in expression Rank of expression of top 5000 genes with highest variance in expression Scale 200 bases hg38 Scale 200 bases hg38 chr5: 111,757,800 111,757,900 111,758,000 111,758,100 111,758,200 chr5: 132,294,000 132,294,100 132,294,200 132,294,300 132,294,400 ChIP-seq of MYC in GM12878 ChIP-seq of MYC in GM12878 GM12878 MYC GM12878 MYC ChIP-seq of MYC in K562 ChIP-seq of MYC in K562 K562 MYC K562 MYC Figure 2: Virtual ChIP-seq learns from association of gene expression and chromatin factor binding at each genomic bin. This example shows Virtual ChIP-seq analysis for the MYC TF. (a) Gene expression levels for 5000 genes in 12 cell types. For simplicity of visualization, we showed only 100 of these genes in the matrix and labeled only every other gene. We ranked RNA-seq RPKM expression values within each cell type. This matrix shows a subset of 5000 high- variance genes, sorted by variance of each gene’s expression between cell types. Blue: row minimum; white: median expression; red: row maximum. (b) ChIP-seq signal for 100 bp bins in 12 cell types, taken from four larger regions (25 bins each) on chromosome 5. We quantile-normalized ChIP signal from MACS software among cell types. This matrix shows a subset of the 54,037 bins on chromosome 5 which have TF binding in at least one training cell type. White: column mini- mum (0.0); black: column maximum (1.0). Cyan: a region in the NREP with MYC binding in GM12878; magenta: a region upstream of SLC22A4 with MYC binding in K562. (c) Association matrix: gene expression–ChIP signal correlation between 100 genomic bins and 5000 high-variance genes. This is a subset of the larger 54,037 × 5,000 association matrix for chromosome 5. Each cell shows the Pearson correlation for 12 cell types between expression for a particular gene and ChIP signal at a particular genomic bin. Orange: negative correlation; white: p-value of Pearson correlation greater than 0.1 (NA); Purple: positive correlation. (d) (Top) Expression score plots for a 100 bp bin in the NREP locus. Each plot has one point for each of 184 genes with non-NA corre- lation values at that bin in the association matrix. Each point displays (Continued on next page.)

6 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

168 2.2.2 Learning from the transcriptome

42 169 Different cell types have distinct transcriptomic and epigenomic states . Changing gene expression 170 levels can affect patterns of chromatin factor binding and chromatin structure. We hypothesized that 171 some gene expression changes would lead to consistent and observable changes in chromatin factor 172 binding. As an extreme example, eliminating expression of a chromatin factor would eventually 173 eliminate binding of that chromatin factor genome-wide. Other changes in gene expression could 174 lead to competitive, cooperative, allosteric, and other indirect effects that would affect chromatin 175 factor binding. To account for both direct and indirect effects of the expression of regulatory genes, 31 176 one must model the dependency of each chromatin factor binding site on expression of all genes . 177 To exploit this model, we identified genes with significant positive or negative correlation with 178 chromatin factor binding at any given genomic bin. We did this for genes all over the genome, 179 irrespective of distance from the binding site. 180 For each chromatin factor, we created an association matrix measuring correlation between 181 gene expression and binding of that chromatin factor in previously collected datasets (Figure 2a– 182 c). In this matrix, each value corresponds to the Pearson correlation between ChIP-seq binding of 183 that chromatin factor at one genomic bin and the expression level of one gene. We used missing 184 values when there was no significant association between gene expression and chromatin factor 185 binding (p > 0.1). 186 Power analysis (Methods) identified which correlations the p > 0.1 cutoff would exclude de- 187 pending on the number of available cell types with matched ChIP-seq and RNA-seq data. For 188 CTCF, which had the largest number of cell types available—21 cell types with matched ChIP-seq 189 and RNA-seq—this cutoff provided 80% power to detect an absolute value of Pearson correlation 190 |r| ≥ 0.52. Many chromatin factors had only 5 cell types with matched data and the cutoff provided 191 80% power to detect only larger correlations, |r| ≥ 0.92. 192 We calculated an expression score for a chromatin factor in a new cell type using the association 193 matrix and RNA-seq data for the new cell type, but no ChIP-seq data. The expression score 194 is the Spearman correlation between the non-NA values for that genomic bin in the association 195 matrix and the expression levels of those genes in the new cell type (Figure 2d, Figure 3a). We 196 used the rank-based Spearman correlation to make the score robust against slight differences in 197 analytical methodology used to estimate gene expression. We used the expression score as one 198 of the variables in a multi-layer perceptron (Methods). Compared to using the high-dimensional 199 and sparse association matrix as an input to the multi-layer perceptron, this approach has several 200 advantages. Most importantly, the expression score is dependent on the transcriptome of the new 201 cell type. Other advantages include a much lower number of parameters to optimize, and avoiding 202 the problem of NA values in the association matrix.

Figure 2: (Continued from previous page.) the rank of correlation value for that gene among one row of the association matrix against the rank of expression for that gene among 5000 high-variance genes in (left) GM12878 and (right) K562 cell types. The expression score at a bin for a cell type is Spearman’s rank correlation coefficient ρ between those two ranks. Blue line: best linear fit to data; grey region: 95% confidence interval of the fit. (Bottom) UCSC Genome Browser display of 550 bp around that region. Blue rectangle: MYC ChIP-seq peak in GM12878 or K562. Here, MYC binds only in GM12878. (e) Expression score plot and Genome Browser display for a 100 bp bin upstream of SLC22A4. Here, MYC binds only in K562.

7 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

203 2.2.3 Learning from other predictive features

204 We included a number of other predictive features beyond expression score. Virtual ChIP-seq in- 205 cludes as input for each genomic bin the frequency of the chromatin factor’s presence in existing 206 ChIP-seq data (Figure 3b). Since most chromatin factor binding occurs within accessible chro- 43 207 matin , we also used evidence of chromatin accessibility from DNase-seq or ATAC-seq (Figure 3c). 44 208 While many intra-species genomic differences lie in the non-coding genome , we expect some 209 regulatory elements to be conserved among closely related species. Previous studies highlight the 45 210 association of genomic conservation and chromatin factor binding in organisms as simple as yeast 46 47,48 211 or as complex as human . To learn from patterns of genomic conservation, we used PhastCons 212 scores from a 7-way primate and placental mammal comparison (http://hgdownload.cse.ucsc. 213 edu/goldenPath/hg38/phastCons7way) in our model (Figure 3d). 214 We used sequence motif score where available (Figure 3e; see Methods). Relying only on TF 9 215 sequence preference, however, would prevent accurate prediction of most true TF binding sites 216 (Figure 1). For each TF, we represented sequence preference using the FIMO score of JASPAR 217 sequence motifs of that TF or a similar TF. JASPAR has no motif for some chromatin factors, such 218 as EP300. Where JASPAR has more than one motif for a TF, additional motifs often represent 219 different versions of the motif such as SREBF2 (MA0596.1) and SREBF2-var2 (MA0828.1). In 220 some cases, the additional motif represents a preference of a cooperative TF heterodimer, such as 221 MAX-MYC (MA0059.1). Regardless of reason, we included all of each TF’s motifs as features in 222 its model (Supplementary Table 2). 223 We also investigated potential improvements by adding a couple of additional integrative fea- 224 tures available for a limited number of chromatin factors and cell types (Supplementary Table 2). 15 225 First, we used the output of Hidden Markov model-based Identification of TF footprints (HINT) 226 which identifies TF footprints within accessible chromatin. Second, we used a boolean feature in- 227 dicating overlap of each genomic bin with clusters of chromatin accessibility peaks identified by 49 228 CREAM .

229 2.2.4 Selecting hyperparameters and training

230 We created an input matrix with rows corresponding to 200 bp genomic windows and columns rep- 231 resenting the features described above. Specifically, these features included expression score (Fig- 232 ure 3a), previous evidence of binding of chromatin factor of interest in publicly available ChIP-seq 233 data (Figure 3b), chromatin accessibility (Figure 3c), genomic conservation (Figure 3d), sequence 234 motif scores (Figure 3e), HINT footprints, and CREAM peaks. We used sliding genomic bins with 235 50 bp shifts, where most 200 bp bins overlap six other bins. This provided a maximum resolution of 236 50 bp in binding prediction. This resulted in a sparse matrix with 60,620,768 rows representing each 50 237 bin in the GRCh38 genome assembly . The sparse matrix used in the main model had between 238 4 and 11 columns, depending on the number of available sequence motifs. When we added HINT 239 footprints and CREAM peaks, the matrix had between 6 and 13 columns instead. We trained on an 240 imbalanced subset of genomic regions which had chromatin factor binding or chromatin accessibil- −4 241 ity (FDR < 10 ) in any of the training cell types. To speed the process of training and evaluation, 242 we further limited training input data to four chromosomes (chr5, chr10, chr15, and chr20). For 243 validation, however, we used data from these same four chromosomes in completely different cell 244 types held out from training. We evaluated the performance on all of the 9,635,407 bins in these 245 four chromosomes (Figure 3f), not just those with prior evidence of chromatin factor binding or 246 chromatin accessibility.

8 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Cross validation training Cross validation evaluation a GM12878 K562 Expression score = 0.488 Expression score = −0.512 f chr5 12 q11.2 q14.3q15 23.1 32 5q34

150 150 Expression chr10

100 100 score chr15

50 50 chr20 11.111.1 Accessible chromatin or previous TF binding 0 0 Number 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 4-fold cross validation scheme expression and TF binding b Rank of correlaiton of gene Rank of genes by expression of cell types GM12878 with TF binding g Activation function ∊ {sigmoid, tanh, rectifier}

K562 Chromatin h Number of hidden units ∊ {2×(pP+1), 50,100}

ChIP − seq signal accessibility Genomic position Bound in 2 cell types Bound in 1 cell type c PhastCons genomic conservation ...... Bound? DNase-seq signal Genomic position Score of d Human A A T G A C C A T G G A motif 1 Chimp A A T G C C C A T G A A Rhesus A T T G G C C A T T A A ... Dog A T T A T C C A T T A A e 2 2

1 1 Bits Score of i Number of hidden layers ∊ {2, 5, 10, 50}

A G G C C A A C G C T G motif 7 C T G CG A CG G T TAAGATGT TCAA CAA T 0 C G A 0 GT G C T LL2penalty ∊ {0.0001, 0.001, 0.01} T A C CA G j 2 k Specific to genomic region ● ● ● ● ● ● ● Specific to cell type ● ● ● ●

PhastCons Motif Number of cell Expression Chromatin Genomic position types with TF HINT CREAM conservation Score score accessibility binding chr1:1-200 0.01 0.1 0 0.0 0.1 0.0 − chr1:51-250 0.10 0.8 2 0.4 0.3 0.2 + chr1:101-300 0.20 0.1 2 0.4 0.1 0.0 − Training cell type 1 chr1:151-350 0.30 0.0 2 0.4 0.0 0.0 − chr1:1-200 0.01 0.1 0 0.0 0.5 0.5 + chr1:51-250 0.10 0.8 2 0.4 0.7 0.2 + chr1:101-300 0.20 0.1 2 0.4 0.9 0.0 + Training

cell type 2 chr1:151-350 0.30 0.0 2 0.4 0.7 0.0 + chr1:1-200 0.01 0.1 0 0.0 0.5 0.2 + chr1:51-250 0.10 0.8 2 0.4 0.2 0.0 + chr1:101-300 0.20 0.1 2 0.4 0.1 0.0 − Training

cell type 3 chr1:151-350 0.30 0.0 2 0.4 0.0 0.0 − ...... Figure 3: Optimizing and training a multi-layer perceptron. We used a number of fea- tures to predict chromatin factor binding in each bin. These include (a) expression score (Fig- ure 2d–e), (b) the number of training cell types with binding of that chromatin factor, (c) chro- matin accessibility, (d) PhastCons genomic conservation in placental mammals, and (e) any sequence motif corresponding to that TF in the JASPAR database. In JASPAR, some chro- matin factors have no sequence motifs, while others have up to seven different sequence mo- tifs. This led to a number of features p ∈ [4, 11], excluding features from HINT footprints or CREAM peaks not used in the main model. (f) For each chromatin factor, we trained a multi-layer perceptron using these features for selected bins in four chromosomes (5, 10, 15, and 20). Specifically, we selected bins with accessible chromatin or ChIP-seq signal in at least one training cell type (selected regions with vertical blue bars are for illustration purpose). To optimize hyperparameters, we repeated the training process with different hyperparameters us- ing four-fold cross validation, excluding one chromosome at a time. (Continued on next page.)

9 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

247 To build a generalizable classifier that performs well on new cell types with only transcrip- 248 tome and chromatin accessibility data, we concatenated input matrices from 12 training cell types: 249 A549, GM12878, HepG2, HeLa-S3, HCT-116, BJ, Jurkat, NHEK, Raji, Ishikawa, LNCaP, and 250 T47D (Supplementary Table 3).

251 2.2.5 The multi-layer perceptron

51 252 The multi-layer perceptron is a fully connected feed-forward artificial neural network . Our multi- 253 layer perceptron assumes binding at each genomic window is independent of upstream and down- 254 stream windows (Figure 3). For each chromatin factor, we trained the multi-layer perceptron with 52 255 adaptive momentum stochastic gradient descent and a minibatch size of 200 samples. We used 256 4-fold cross validation to optimize hyperparameters including activation function (Figure 3g), num- 257 ber of hidden units per layer (Figure 3h), number of hidden layers (Figure 3i), and L2 regularization 258 penalty (Figure 3j). For training, we only used genomic bins which overlapped chromatin acces- 259 sibility peaks or previous evidence of chromatin factor binding in any of the training cell types. 260 For assessing performance, however, we used all genomic bins of the chromosome (Methods). In 261 each cross validation fold, we iteratively trained on 3 of the 4 chromosomes (5, 10, 15, and 20) at 262 a time, and assessed performance in the remaining chromosome. We selected the model with the 53 263 highest average Matthews correlation coefficient (MCC) after 4-fold cross validation. MCC incor- 264 porates all four categories of a confusion matrix and assesses performance well even on imbalanced 54 265 datasets . For 23 chromatin factors, the optimal model had 10 hidden layers. For another set of 266 23 chromatin factors, the optimal model had 5 hidden layers. For the final 17 chromatin factors, 267 the optimal model had only 2 hidden layers. 268 For 57 out of the 63 chromatin factors examined, the best-performing model had 100 hidden 269 units in each layer—the maximum number of hidden units per layer examined in the grid search. 270 For the remaining 6 chromatin factors, the optimal model had 10–24 hidden units in each layer. 271 Different activation functions—hyperbolic tangent (tanh) or rectifier—proved optimal for different 272 chromatin factors (Supplementary Table 4). 273 We investigated if chromatin factors with the same DNA binding domain (as reported in Lam- 34 274 bert et al. ) also have similar optimized hyperparameters. All C2H2 zinc finger TFs (EGR1, CTCF, 275 MAZ, REST, YY1, ZBTB33, ZNF143, and ZNF274) had a rectifier activation function, 100 hidden −4 276 units, and L2 regularization penalty of 10 . The number of hidden layers ranged from 2 to 10. 277 The other DNA binding domains which had more than 4 TFs in our datasets, bHLH and bZIP, did 278 not have the same hyperparameter among their TFs (Supplementary Table 4). There was also no 279 significant correlation between number of hidden layers, hidden units, or activation function with 280 performance of the model in validation cell types.

Figure 3: (Continued from previous page.) For each chromatin factor, we performed a grid search over (g) activation function (sigmoid, tanh, and rectifier), (h) number of hidden units per layer (2(p+1), 50, or 100), (i) number of hidden layers (2, 5, 10, or 50), and (j) L2 regularization penalty (0.0001, 0.001, or 0.01). We chose the quadruple of hyperparameters which resulted in the highest mean Matthews correlation coefficient (MCC) over all four chromosomes. (k) Schematic of the matrix of input features for training the multi-layer perceptron. We used the input features from all the available training cell types to train the multi-layer perceptron.

10 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

281 2.2.6 Virtual ChIP-seq predicts chromatin factor binding with high accuracy

282 We evaluated the performance of Virtual ChIP-seq in validation cell types (K562, PANC-1, MCF- 283 7, IMR-90, H1-hESC, and primary liver cells) which we did not use in calculating the expression 284 score, training the multi-layer perceptron, or optimizing hyperparameters. Before predicting in new 285 cell types, we chose a posterior probability cutoff for use in point metrics such as accuracy and F1 286 score. When a chromatin factor had ChIP-seq data in more than one of the validation cell types, 287 we chose the cutoff that maximizes MCC of that chromatin factor in H1-hESC cells. Then, we 288 excluded H1-hESC when reporting threshold-requiring metrics. For these chromatin factors, we 289 pre-set a posterior probability cutoff of 0.4, the mode of the cutoffs for other chromatin factors 290 (Supplementary Table 5). 291 Virtual ChIP-seq predicts binding of 36 chromatin factors in validation cell types with MCC > 292 0.3, auROC > 0.9, and 0.3 < auPR < 0.8 (Figure 4a, Table 1, Supplementary Table 6).

293 2.2.7 Virtual ChIP-seq correctly predicts binding sites in genomic locations not found 294 in training data

295 We evaluated the performance of Virtual ChIP-seq for 63 chromatin factors with binding in valida- 296 tion cell types. For 59 of these chromatin factors, Virtual ChIP-seq predicted true chromatin factor 297 binding in regions without conservation among placental mammals. For 44 out of 63 chromatin fac- 298 tors, Virtual ChIP-seq predicted true chromatin factor binding in regions without chromatin factor 299 binding in any of the training ChIP-seq data. From these 63 chromatin factors, 43 are sequence- 300 specific, and for all of these chromatin factors, Virtual ChIP-seq predicted true binding for regions 301 that did not match the TF’s sequence motif. For 47 chromatin factors, Virtual ChIP-seq even 302 correctly predicted chromatin factor binding in regions that didn’t overlap chromatin accessibility 303 peaks (Supplementary Table 7). Most of these regions were frequently bound to the chromatin fac- 304 tor in publicly available ChIP-seq data. These predictions showed that the multi-layer perceptron 305 learned to leverage multiple kinds of information and predict chromatin factor binding accurately, 306 even in the absence of features required by previous generations of binding site classifiers.

307 2.2.8 Comparison with DREAM Challenge

308 DREAM Challenge rules forbid using genomic conservation or ChIP-seq data as training features. 309 This also excludes the expression score, as creating its association matrix relies on ChIP-seq data. 310 The challenge also required training and validation on its own provided datasets. These datasets 311 have ChIP-seq data in only a few cell types. This restricts Virtual ChIP-seq’s approach which lever- 312 ages all publicly available datasets. The DREAM Challenge ChIP-seq datasets use only two repli- 55 313 cates for each experiment and requires that peaks have a irreproducibility discovery rate (IDR) 314 of less than 5%. −4 315 In these cases, we included peaks that pass a false discovery rate (FDR) threshold of 10 in 316 at least two replicates (Supplementary Table 8). 317 The DREAM Challenge assessed participant entries by measuring performance on three valida- 318 tion chromosomes (chr1, chr8, and chr21), combined. To assess performance of Virtual ChIP-seq on 319 DREAM Challenge data, we did the same. To assess performance on Cistrome data, however, we 320 measured performance on each chromosome independently. This allowed us to examine the variance 321 in performance among these chromosomes. 322 Although Virtual ChIP-seq used features not allowed in the DREAM Challenge, comparing with 323 DREAM Challenge participants is the only sound way to show how any method including these 324 features compares to the state of the art. Before the DREAM Challenge, TF binding prediction 325 methods mostly reported performance measurements only in those parts of a chromosome where

11 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

a 1.00 auPR > 0.5 0.5 > auPR > 0.25 auPR < 0.25 0.75 ● ●

● 0.50 ● ● ● ● ● auPR ● ● ● ● ● ● ● 0.25 ● ● ● ● ● ● ● 0.00 ● YY1 JUN TBP SRF FOS MAZ MAX MYC TAF1 ATF3 ATF2 MXI1 TP53 E2F1 ELF1 ELK1 H2AZ EZH2 HSF1 USF2 USF1 NRF1 RELA ESR1 CTCF JUND CUX1 BRD4 REST CHD1 CHD2 EGR1 RXRA SMC3 MAFK SIN3A EP300 GATA3 KAT2B TCF12 FOSL1 FOXA1 RAD21 TEAD4 BACH1 HDAC2 GABPA HCFC1 CREB1 BRCA1 RUNX1 FOXM1 CEBPB KDM1A RCOR1 ZNF143 ZNF274 TCF7L2 ZBTB33 GTF2F1 SREBF1 BHLHE40

1 2 3 4 5 6 7 8 9 b 1. CTCF 2. FOXA1 3. TAF1 4. MAX 5. GABPA 6. REST 7. JUND 8. EGR1 9. E2F1 1.00

0.75

0.50 auPR 0.25

0.00 adbc adbc HINT J − Team J − Team J − Team J − Team J − Team J − Team J − Team J − Team J − Team dxquang dxquang dxquang dxquang dxquang dxquang dxquang dxquang BlueWhale NittanyLions NittanyLions autosome.ru autosome.ru autosome.ru autosome.ru davidaknowles davidaknowles yuanfang.guan yuanfang.guan yuanfang.guan yuanfang.guan yuanfang.guan yuanfang.guan yuanfang.guan yuanfang.guan yuanfang.guan yuanfang.guan Ramil Nurtdinov Virtual ChIP − seq DREAM Virtual ChIP − seq DREAM Virtual ChIP − seq DREAM Virtual ChIP − seq DREAM Virtual ChIP − seq DREAM Virtual ChIP − seq DREAM Virtual ChIP − seq DREAM Virtual ChIP − seq DREAM Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome Virtual ChIP − seq Cistrome

H1−hESC K562 MCF−7 PC−3 DREAM validation chromosomes Cell type Chromosome iPSC liver PANC−1 T47D chr5 chr10 chr15 chr20 GRCh37 IMR-90 auPR baseline GRCh38 Figure 4: Virtual ChIP-seq predicts chromatin factor binding with high accuracy. Using ChIP-seq and RNA-seq data, we learned from the association of gene expression and chromatin factor binding for 63 chromatin factors. (a) Box plots show distribution of auPR among 4 chro- mosomes (5, 10, 15, and 20) for 63 chromatin factors assessed in four cell types (blue: H1-hESC; orange: IMR-90; green: K562; brown: MCF-7). Dashed line: medians. Grey shapes: prevalence of bound bins in the chromosome, the auPR baseline. Axis label colors categorize median auPR (pur- ple: greater than 0.5, red: between 0.25 and 0.5, black: below 0.25). Sequence logos indicate one of a TF’s JASPAR motifs, when available. When multiple motifs existed, we displayed the shortest motif here. Numbers 1–9: The nine chromatin factors that the DREAM Challenge also evaluated in its final round. (b) We compared Virtual ChIP-seq’s performance to that of the top 4 performing methods in the DREAM Challenge across-cell type final round. For CTCF, MAX, GABPA, REST, and JUND, we had enough cell types to train and validate the performance of Virtual ChIP-seq on DREAM data. For these chromatin factors we trained on chromosomes 5, 10, 15, and 20 in training cell types and validated performance on merged data of chromosomes 1, 8, and 21 in validation cell types. For other chromatin factors, we trained the model and validated our performance using publicly available Cistrome and ENCODE data. auPR values are only directly comparable for the same cell type and test set. The black vertical line in each panel separates test sets based on genome assembly and source. Axis label color: reference genome assembly (black: GRCh37, grey: GRCh38).

12 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

chromatin F1 Accuracy MCC auROC auPR N factor ATF2 0.270±0.002 0.990±0.001 0.314±0.008 0.917±0.026 0.443±0.022 1 BHLHE40 0.334±0.021 0.997±0.000 0.356±0.010 0.974±0.002 0.382±0.01 1 CEBPB 0.510±0.091 0.992±0.002 0.515±0.072 0.964±0.017 0.534±0.073 3 CHD2 0.399±0.038 0.998±0.000 0.406±0.034 0.950±0.012 0.386±0.046 1 CREB1 0.362±0.131 0.997±0.002 0.371±0.121 0.868±0.135 0.335±0.174 2 CTCF 0.667±0.126 0.995±0.004 0.675±0.092 0.985±0.050 0.841±0.108 6 E2F1 0.256±0.097 0.998±0.002 0.314±0.078 0.978±0.019 0.291±0.105 2 ELF1 0.431±0.047 0.997±0.001 0.456±0.038 0.949±0.042 0.493±0.066 2 ELK1 0.430±0.069 1.000±0.000 0.465±0.054 0.991±0.009 0.420±0.054 2 ESR1 0.372±0.103 0.993±0.006 0.430±0.049 0.883±0.033 0.461±0.019 2 FOS 0.333±0.027 0.997±0.001 0.393±0.020 0.861±0.004 0.394±0.008 1 FOSL1 0.319±0.006 0.994±0.001 0.316±0.006 0.929±0.006 0.272±0.012 1 FOXA1 0.433±0.082 0.997±0.004 0.492±0.072 0.981±0.022 0.568±0.117 3 GABPA 0.298±0.049 0.994±0.002 0.393±0.036 0.986±0.012 0.496±0.036 3 GTF2F1 0.235±0.120 0.996±0.001 0.312±0.070 0.985±0.015 0.191±0.081 2 HCFC1 0.459±0.021 0.999±0.000 0.487±0.024 0.990±0.005 0.515±0.044 2 HDAC2 0.303±0.033 0.986±0.005 0.370±0.018 0.948±0.051 0.281±0.040 2 HSF1 0.350±0.149 1.000±0.000 0.378±0.145 0.999±0.012 0.309±0.240 1 JUN 0.218±0.127 0.998±0.001 0.311±0.153 0.983±0.009 0.456±0.257 2 JUND 0.341±0.163 0.993±0.002 0.386±0.135 0.979±0.019 0.326±0.161 4 MAFK 0.354±0.041 0.997±0.001 0.423±0.028 0.989±0.005 0.513±0.103 3 MAX 0.400±0.045 0.996±0.002 0.444±0.059 0.961±0.012 0.491±0.111 3 MAZ 0.370±0.025 0.997±0.001 0.422±0.019 0.987±0.005 0.493±0.070 2 MXI1 0.394±0.018 0.999±0.000 0.402±0.017 0.993±0.004 0.381±0.025 1 NRF1 0.658±0.042 1.000±0.000 0.664±0.038 0.994±0.014 0.720±0.051 3 RAD21 0.593±0.062 0.996±0.002 0.626±0.056 0.983±0.033 0.740±0.095 3 REST 0.482±0.120 0.999±0.001 0.493±0.091 0.985±0.008 0.567±0.095 3 SIN3A 0.389±0.048 0.998±0.002 0.394±0.029 0.966±0.004 0.411±0.037 3 SMC3 0.733±0.016 0.999±0.000 0.734±0.016 0.998±0.001 0.792±0.018 1 SRF 0.353±0.060 0.998±0.001 0.364±0.070 0.982±0.008 0.365±0.115 2 TAF1 0.378±0.073 0.999±0.001 0.437±0.097 0.987±0.009 0.490±0.168 3 TEAD4 0.344±0.061 0.990±0.002 0.385±0.020 0.967±0.023 0.343±0.019 2 TP53 0.275±0.103 1.000±0.000 0.382±0.086 1.000±0.008 0.660±0.222 1 USF1 0.353±0.047 0.993±0.001 0.382±0.040 0.891±0.012 0.372±0.046 1 USF2 0.410±0.040 0.999±0.000 0.427±0.028 0.982±0.007 0.437±0.032 1 YY1 0.397±0.049 0.996±0.001 0.408±0.058 0.945±0.043 0.417±0.104 2

Table 1: Performance of Virtual ChIP-seq for 36 chromatin factors on validation cell types. Each row displays median values ± standard deviation of several performance metrics for prediction of a chromatin factor across 4 chromosomes for each available validation cell type. MCC: Matthews correlation coefficient, auROC: area under receiver-operating characteristic curve, auPR: area under precision-recall, N: number of validation cell types for 36 chromatin factors with MCC > 0.3. We reported auROC and auPR across all the validation cell types across all posterior probability cutoffs. Black chromatin factors: we found the posterior probability cutoff which max- imized MCC in H1-hESC, and then reported F1, accuracy, and MCC of the other validation cell types. 13 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

326 a method had more likelihood of success. The DREAM Challenge, like Virtual ChIP-seq, instead 327 reports performance on the intended deployment domain of such methods: whole chromosomes. 328 Leading DREAM Challenge methods potentially could improve their performance by including the 329 features used by Virtual ChIP-seq. We compared Virtual ChIP-seq with DREAM Challenge results 330 when we trained and validated on either Cistrome DB data or DREAM Challenge data.

331 2.2.9 Prediction accuracy varies by transcription factor

332 The DREAM Challenge evaluates predictions on binding of 31 chromatin factors. The final sub- 333 mission round evaluates predictions for 12 chromatin factors in held-out cell types. The datasets 334 we used, however, allow us to predict binding of 63 chromatin factors in new cell types. Of these 335 chromatin factors, 41 are unique to our dataset and do not overlap any of the DREAM Challenge 336 chromatin factors (Supplementary Table 9). The DREAM Challenge has data on the other 22 337 chromatin factors, but the challenge evaluated only 9 of these chromatin factors in its final round. 338 For CTCF, FOXA1, TAF1, and REST, Virtual ChIP-seq had a higher auPR in at least one 56,57 339 validation cell type than any DREAM Challenge participant . For EGR1 and E2F1, Virtual 340 ChIP-seq performed better than at least one of the four top-performing methods of the challenge in 341 one of the validation cell types (Figure 4b). DREAM Challenge and Cistrome ChIP-seq peak calls 342 had different class imbalances, making auPR statistics not directly comparable (Supplementary 343 Table 10). These imbalances were not always in the same direction. In FOXA1 peak calls in liver, 344 for example, Cistrome called 0.12% of genomic bins bound to a chromatin factor, half the fraction 345 of the DREAM Challenge (0.25%). Our predictions for FOXA1 binding in T47D and MCF-7 using 346 Cistrome had a higher auPR than participants of DREAM Challenge for liver. The FOXA1 peak 347 calls for these cell types also had a higher fraction of chromatin factor-bound genomic bins: 1.36% 348 for MCF-7, and 0.39% for T47D. This opposed the smaller fraction of bins bound in Cistrome 349 data in CTCF (in PANC-1, liver, and T47D), TAF1 (in liver, H1-hESC, K562, and T47D), and 350 REST (in H1-hESC, K562, and PANC-1). The differences in class prevalence are both minor and in 351 diverging directions. Because of this, they do not bias the baseline auPR of evaluation on Cistrome 352 datasets in a particular direction when compared to evaluation on DREAM Challenge datasets. 353 The power of Virtual ChIP-seq to learn from the transcriptome data diminishes when fewer cell 354 types are available, as in the DREAM Challenge data. Nonetheless, when trained on DREAM Chal- 355 lenge data, Virtual ChIP-seq outperformed 13/14 DREAM Challenge participants when predicting 356 CTCF binding in PC-3 cells. When predicting CTCF binding in iPSC cells, Virtual ChIP-seq had 357 a higher auPR than 8/14 Challenge participants. The Virtual ChIP-seq auPR for binding of REST 358 in liver was also higher than that of 9/14 DREAM Challenge participants (Supplementary Table 359 11). 360 Virtual ChIP-seq predicted binding of 36 chromatin factors with a median MCC > 0.3. These 361 36 chromatin factors had a auPR between 0.27 and 0.84 (Table 1). Some of these chromatin factors 362 show high levels of consistent binding among different cell types, which makes predictions easier. 363 The fraction of bins bound to a chromatin factor in at least half of training cell types, however, 364 varies between 0 to 15.75% across all chromatin factors. Even for chromatin factors with a median 365 auPR > 0.5 (purple in Figure 4a) the fraction of bins bound in half of training cell types varied 366 from 0.5% in FOXA1 to 10.5% in NRF1. For some DNA-binding proteins, Virtual ChIP-seq fails 367 to predict binding accurately (auPR < 0.3). DNA-binding proteins with low auPR and low MCC 368 include chromatin modifiers such as KAT2B, KDM1A, EZH2 and chromatin binding proteins such 369 as CHD1 and BRD4. Chromatin factors with low prediction accuracy include ATF2, CUX1, E2F1, 370 EP300, FOSL1, FOXM1, JUN, RCOR1, RELA, RXRA, SREBF1, TCF12, TCF7L2, and ZBTB33. 371 For some proteins, such as ATF2, EP300, EZH2, FOXM1, KAT2B, KDM1A, TCF12, and TCF7L2, 372 in at least one validation cell type, most ChIP-seq peaks didn’t overlap with chromatin accessible

14 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

373 regions.

374 2.2.10 Features of true and false predictions

375 To better understand why the model sometimes predicted incorrectly, we examined predictions of 376 52 chromatin factors in validation chromosomes (chr5, chr10, chr15, and chr20) in K562. We inves- 377 tigated true positive (TP), false positive (FP), and false negative (FN) predictions. We excluded 378 true negative (TN) predictions because their high numbers mainly reflect imbalanced class preva- 379 lence and potential ascertainment bias in the ground truth. Among the three labels, TP genomic 380 bins varied from 0.19% for RELA to 58% for CTCF (Figure 5a). For 24 of these 52 chromatin 381 factors, most incorrect predictions were FN (Figure 5a, left). For the other 28 chromatin factors, 382 most incorrect predictions were FP (Figure 5a, right). 383 We investigated presence and absence of predictive features among genomic bins labeled TP, 384 FP, and FN. We defined presence of a feature as a positive value, and absence as a non-positive 385 value. Expression score has values in [−1, 1] when a region had chromatin factor binding in any of 386 the training cell types that have matched RNA-seq data. For expression score, non-positive values 387 include both 0 and negative values. All other input features only have values in [0, 1]. For most 388 chromatin factors, the model performed better when all features were present. This means higher 389 TP, lower FN, and lower FP (Figure 5b). 390 For CTCF, incorrect predictions represented less than 5% of TPs when all predictive features 391 were present, when only sequence motif was absent, or only the expression score was absent (Fig- 392 ure 5b). Without presence of chromatin accessibility, the model made a higher number of false 393 predictions, but still made some correct predictions. 394 The model only predicted novel binding sites not present in training cell types when the site 395 matched the TF’s sequence motif (Figure 5b). For NRF1, MAFK, and ZNF274, the model made 396 frequent FN predictions when expression score and sequence motif match were absent. REST, 397 JUND, YY1, and E2F1 have more FP than FN. For these TFs, FP predictions were frequent when 398 expression score and sequence motif match were absent. For ZBTB33, both FP and FN predictions 399 were high when expression score and sequence motif match were both absent. 400 ZNF274 had only 117 correctly predicted binding sites and RELA had only 5 correctly pre- 401 dicted binding sites in the four validation chromosomes. In both of these cases, the model had low 402 specificity and sensitivity, predicting a much higher number of FNs and FPs than TPs.

403 2.2.11 The expression score leverages similarity with training cell types to improve 404 predictions

405 The expression score for a genomic bin is the Spearman correlation between expression of specific 406 genes in a new cell type and a measure of how chromatin factor binding in that genomic bin 407 correlates with expression of those genes among training cell types. For each genomic region, the 408 expression score uses the expression values of a different set of genes to provide a low or high 409 probability for chromatin factor binding in the new cell type. 410 We investigated whether the expression score serves as a way of encoding the ChIP-seq data 411 of the training cell type with the most similar transcriptome to the new cell type. To do this, 412 we randomly permuted expression scores across the genome. We identified bins that have TP 413 predictions with the original expression score but switch to FN with the permuted score. The 414 correct predictions that require the original expression score usually had ChIP-seq peaks in one 415 or more training cell type. In rare cases, these apparently expression-requiring predictions did not 416 have corresponding binding in any of the training cell types. In these cases, the expression score 417 may have contributed little to original prediction, but a permuted expression score penalized the

15 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

a FN > FP FP > FN 10,000

100 TPs

Number of 1 100 75 50 25 0 (TP + FP + FN) YY1 Genomic bins × 100 JUN TBP SRF FOS MAZ MAX MYC ATF3 ATF2 TAF1 MXI1 ELF1 E2F1 ELK1 EZH2 USF2 USF1 H2AZ NRF1 RELA CTCF JUND REST BRD4 CUX1 CHD2 CHD1 EGR1 MAFK SMC3 SIN3A EP300 KAT2B TCF12 FOSL1 RAD21 TEAD4 BACH1 GABPA HCFC1 CREB1 RUNX1 FOXM1 CEBPB RCOR1 TCF7L2 ZNF274 ZNF143 ZBTB33 GTF2F1 SREBF1

False negative False positive True positive

b CTCF NRF1 REST JUND YY1 80 80 80 80 80 TP = 53,367 TP = 1,468 TP = 2,931 TP = 25,380 TP = 11,090

60 60 60 60 60

40 40 40 40 40

20 20 20 20 20 (TP + FP + FN) Genomic bins × 100 0 0 0 0 0 Sequence motif ● ● ● ● ● ● ● ● ● ● Previous TF binding ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Expression score ● ● ● ● ● ● ● ● ● ● ● Chromatin accessibility ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PhastCons ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● MAFK ZBTB33 E2F1 ZNF274 RELA 80 80 80 80 80 TP = 7,649 TP = 913 TP = 1,237 TP = 117 TP = 5

60 60 60 60 60

40 40 40 40 40

20 20 20 20 20 (TP + FP + FN) Genomic bins × 100 0 0 0 0 0 Sequence motif ● ● ● ● ● ● ● ● ● Previous TF binding ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Expression score ● ● ● ● ● ● ● ● ● ● Chromatin accessibility ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PhastCons ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Figure 5: True and false predictions and associated features. Fraction of potential chromatin factor binding sites in K562 categorized as false negative (FN: orange), false positive (FP: green), and true positive (TP: blue). This excludes any sites deemed true negative (TN). (a) Stacked bar plot of prediction categorization for the 52 chromatin factors with K562 ChIP-seq data, sorted by the fraction of TP genomic bins. Grey bars show number of TP predictions. (Left) 24 chromatin factors where FN fraction exceeded FP fraction. We selected 4 factors to examine in more detail below (orange names). (Right) 28 chromatin factors where FP fraction exceeded FN fraction. We selected 6 factors to examine in more detail below (green names). (b) UpSet58 plot of prediction categorization in 10 factors given the 4 most common combinations of positive values for input features, and all other combinations (asterisks). Black dots indicate the features with positive values in each combination. Number of TPs indicated is in validation chromosomes (chr5, chr10, chr15, and chr20). We took the 10 factors from a wide range of those with best performance (top left) to worst performance, as sorted by ratio of TP to FP+FN.

418 bin below the prediction threshold. 419 We investigated the TF JUND in more detail. In JUND, 126 out of 1,155 expression-requiring

16 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

a b 1,155 expression-score–specific CTCF RELA REST 0.4 SREBF1 Other TP predictions for JUND binding in K562 PANC-1

117 expression-score–specific 160 0.3 TP predictions for JUND present IMR-90 in only one training cell type 0.2 MCF-7

120 0.1 40

0.0 30

H1-hESC 20 −0.1

80 Expression score permutation importance

0.0 0.2 0.4 0.6 10 MCC between validation cell type and training cell types

ELK1 c CHD2 NRF1 0 HCFC1 100 JUN Smooth muscle (no RNA) ● 40 HCT−116 (r = −0.13) ● MAZ USF2 SK−N−SH (r = 0.4) ● 50 SMC3 FOSL1 HepG2 (r = 0.39) ● ● HeLa−S3 (r = 0.44) 0 GM12878 (r = 0.43) ● MXI1 FOS

RAD21 ATF2 0 ● ● ● ● Smooth muscle (no RNA) TAF1 CTCF ● ● ● HCT−116 (r = −0.13) SIN3A ZBTB33 SK−N−SH (r = 0.4) ● CREB1 REST HepG2 (r = 0.39) ● TEAD4 MAFK HeLa−S3 (r = 0.44) ● ● ● ● SRF USF1

Percent of correctly predicted binding sites Percent MAX JUND YY1 CEBPB Lack of ELF1GABPA Genomic conservation Chromatin accessibility Previous TF binding Sequence motif Figure 6: Expression score leverages similarity with training cell types. (a) UpSet plots of TP predictions of JUND binding in K562 which did not pass the posterior probability threshold when we permuted the expression score. Each bar represents a combination of training cell types with the binding site (black dots below plot). r: genome-wide correlation of rank of expression of the top 5,000 genes with highest variation among a training cell type with rank of expression of the same genes in K562. Smooth muscle lacked matched RNA-seq data. Blue plot: the top 5 combinations with the highest number of TP genomic bins. Orange plot: the TP predictions which were bound to chromatin factor in only one training cell type. (b) Scatter plot of expression score permutation importance for 160 pairs of 63 chromatin factors and 6 validation cell types against ChIP-seq peak similarity between that cell type and 1–10 training cell types. Permutation importance is the difference in auPR when permuting expression score. Similarity is measured by MCC of validation cell type ChIP-seq peaks, treating each training cell type in turn as ground truth. Each point indicates median quantities, and error lines indicate median absolute deviation. (c) Bar plot of the fraction of binding sites for 29 chromatin factors correctly predicted on K562 validation chromosomes (chr5, 10, 15, and 20) which lacked particular predictive features. These features include genomic conservation (red), chromatin accessibility (green), sequence motif (turquoise), and evidence of chromatin factor binding in another cell type (purple). For chromatin factors with no sequence motif, we deemed every binding site to lack a sequence motif.

420 TP predictions did not exist in any of the training cell types (Figure 6a, blue). Some of these

17 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

421 true predictions (117/1,155) existed in only one of the training cell types (Figure 6a, orange). We 422 investigated correlation of the rank of expression of the top 5,000 genes with the highest variance 423 among training cell types and the validation cell type K562 (Figure 6a). The training cell type with 424 the highest correlation was not necessarily the cell type with the highest number of expression- 425 requiring predictions. For example, although the correlation among expression of all of the 5000 426 genes is highest between HeLa-S3 and K562 (r = 0.44), HCT-116 (r = −0.13) is the source of the 427 highest number of correct expression score specific predictions. This is unsurprising since, for each 428 region’s expression score, we used only a subset of the 5,000 genes in the global calculation here. 429 The other 912 predictions existed in 2 or more training cell types. This implies that, at least for 430 JUND, the expression score did not simply encode ChIP-seq data of a single training cell type with 431 the most similar global transcriptome to the new cell type. 432 We also examined whether the expression score’s effectiveness depends on the similarity of 433 chromatin factor binding among training and validation cell types. Under this hypothesis, we would 434 expect high correlation between the expression score’s contribution to model performance and the 435 similarity of ChIP-seq data between the validation cell type and the training cell types. To examine 436 this hypothesis, we calculated pairwise similarity in ChIP-seq data between the validation cell 437 type and each training cell type. Due to the highly imbalanced class prevalence of ChIP-seq data, 59 438 we used pairwise MCC as the similarity measure. We also calculated permutation importance , 439 the difference in auPR when permuting the expression score (auPR − auPRpermuted expression score). 440 Permutation importance indicates a feature’s contribution to a predictive model. 441 For each validation cell type, we calculated the median MCC of its ChIP-seq data with that 442 of training cell types and median expression score permutation importance among the 4 validation 443 chromosomes (Figure 6a). These two variables correlate in general (Spearman’s ρ = 0.41; p = −8 444 3 × 10 ). CTCF binding in PANC-1 similarity with training cell types ranges from MCC = 0.38 445 to MCC = 0.76 (Figure 6b). Only CTCF binding in IMR-90 has a higher similarity to training 446 cell types (MCC ∈ [0.35, 0.79]). The permutation importance of CTCF predictions in PANC-1 is 447 0.27, while the permutation importance of CTCF predictions in IMR-90 is 0.15. The variation in 448 correlation of similarity to training cell types and permutation importance of the expression score 449 is more evident for REST (Figure 6b). While the median similarity of REST binding with training 450 cell types is 0.32 for MCF-7 and 0.38 for H1-hESC, the permutation importance for REST binding 451 is 0.07 for MCF-7 but −0.02 for H1-hESC. 452 Using the expression score generally improved performance when validation cell types had sim- 453 ilar TF location patterns to training cell types. For example, some validation cell types similar to 454 the training cell types often had high expression score permutation importance (≥ 0.1) for CTCF 455 (IMR-90, liver, MCF-7, PANC-1) and REST (K562, PANC-1). For RELA and SREBF1, however, 456 all validation cell types had low expression score permutation importance (< 0.1), and low similarity 457 of ChIP-seq data to training cell types (Figure 6b).

458 2.2.12 Some correct predictions lack known predictive features

459 Many correctly predicted binding sites in K562 lack important predictive features of chromatin 460 factor binding (Figure 6c). Among 29 chromatin factors with MCC > 0.3 in K562, almost all 47,48 461 correct predictions are in genomic bins conserved among placental mammals . The exceptions 462 include 3.72% of predictions for ZBTB33, 2.11% of predictions for REST, 2.07% of predictions 463 for USF2, 1.49% of predictions for NRF1, 1.47% of predictions for CHD2 and 0.18%–0.89% for 464 other chromatin factors. Many correctly predicted binding sites for ATF2, MAFK, REST, CEBPB, 465 USF1, FOSL1, and CTCF don’t overlap chromatin accessibility peaks. We correctly predicted many 466 binding sites for TEAD4, GABPA, JUND, CREB1, USF1, CHD2, and FOSL1 in regions which 467 had no binding in training cell types. For all these factors except JUND, the nearest upstream or

18 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

468 downstream genomic bin of these novel predictions in K562 bound the chromatin factor as well. 469 The nearest training cell type binding site to these correct novel predictions were 50 bp–3.6 Mbp 470 away. The nearest peak in training cell types for these novel predictions was not significantly 471 closer compared to other K562 ChIP-seq peaks (Wilcoxon rank sum test; p = 1). In these cases, the 472 multi-layer perceptron learned from other available predictive features. For example, in TEAD4, all 473 novel correctly predicted binding sites in validation chromosomes overlapped chromatin accessibility 474 peaks. These correct predictions also had a mean PhastCons conservation of 0.182, significantly −16 475 higher than the mean of 0.150 in other genomic bins (Welch t-test; p < 2 × 10 ).

476 2.3 The choice of input features determines prediction performance

477 2.3.1 The most important features

478 To evaluate the importance of each feature in our predictive model, we performed an ablation study 479 on training data. First, we systematically removed features. Second, we fitted the model without 480 these features on some of the training cell types (HeLa-S3, GM12878, HCT-116, LNCaP). Third, 481 we evaluated performance on one held-out training cell type (HepG2; Supplementary Table 12). 482 This ablation study did not use any of the validation cell types which we used for final evaluation 483 of the model. 484 We called the effect of excluding an input feature substantive only when the average increase 485 or decrease in auPR was at least 0.05. Excluding genomic conservation, sequence motif, HINT, or 486 CREAM did not substantively change performance of the model for most chromatin factors (Fig- 487 ure 7). Excluding chromatin accessibility, publicly available ChIP-seq data, and the expression 488 score decreased performance in most chromatin factors. Excluding expression score substantively 489 decreased median auPR in 13/21 chromatin factors, while excluding publicly available ChIP-seq 490 data substantively decreased auPR in 18/21 chromatin factors.

491 2.3.2 Inclusion of some features have opposite effects on prediction of different chro- 492 matin factors

493 Beyond the most important features—chromatin accessibility, ChIP-seq, and expression score— 494 excluding other features rarely substantively decreased prediction performance (Figure 7b–c). When 495 we excluded sequence motifs, auPR decreased substantively for ZBTB33, JUN, JUND, FOXA1, 496 and ELF1. Excluding HINT footprints decreased auPR substantively only for CEBPB, JUN, and 497 JUND. Excluding CREAM clusters of chromatin accessibility peaks decreased auPR substantively 498 only for ZBTB33, ELF1, and FOXA1. 499 Removing certain input features actually improved prediction for some chromatin factors (Fig- 500 ure 7b–c). Associations that differed between training cell types and validation cell types sug- 501 gested that these input features generalize poorly. For example, CREAM clusters’ overlap with 502 NRF1 ChIP-seq peaks was not consistent among GM12878 (7.52%), HeLa-S3 (31.8%), and HepG2 −4 503 (25.78%). This represented a significant variation among these cell types (ANOVA; p = 1.9×10 ). 504 While most TF footprints (95.96%) overlapped NRF1 peaks, TF footprints constituted only a 505 small fraction of NRF1 peaks (0.73%). NRF1 peaks overlapped a smalls proportion of TF footprints 506 in training cell types GM12878 (1.14%) and HeLa-S3 (0.59%), but significantly greater than the 507 0.45% overlap in HepG2 (Welch t-test; p = 0.007). In HepG2, 7.28% of YY1 peaks overlap TF 508 footprints while in the training cell type GM12878, the overlap is only 1.22% (Welch t-test; p = −5 509 5 × 10 ) and in the other training cell type HCT-116 the overlap is much higher (17.92%; Welch −6 510 t-test; p = 5 × 10 ). Overlap of ZBTB33 peaks with TF footprints is much smaller in HepG2 511 (0.49%) compared to training cell types GM12878 (2.32%) and HCT-116 (5.27%; Welch t-test; −4 512 p = 6 × 10 ). Features with varying and cell-specific association with chromatin factor binding

19 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

a CEBPB CTCF ELF1 EP300 FOXA1 GABPA JUN 1.00

0.75

0.50

0.25

0.00 JUND MAX MYC NRF1 RAD21 REST RFX5 1.00

0.75

0.50

0.25

0.00 SMC3 TAF1 TCF12 TEAD4 USF1 YY1 ZBTB33 1.00

0.75

area under precision − recall curve (auPR) 0.50

0.25

0.00 ChIP−seq ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Expression score ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Accessibility ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Sequence motif ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

CREAM ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

HINT ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

PhastCons ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Chromosome chr5 chr10 chr15 chr20 b Excluding PhastCons c x = No CREAM x = No HINT x = No sequence motif ● Excluding HINT ● ● ● ● ● ● ● ● 0.1 ● ● ● ● ● ● ● ● ● ● ● Excluding CREAM ● ● ●

All ● ●

● ● ● 0.0 ● ● Excluding sequence motifs ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Excluding chromatin − auPR ● ● ● ● ● ● accessibility x −0.1 ● ● ● ● ● ●

Excluding expression score ● auPR ● −0.2 ● ● Excluding ChIP-seq

● −0.3 ● ● Excluding ChIP-seq & expression score YY1 JUN JUN ELF1 ELF1 NRF1 NRF1 NRF1 JUND −15−10−5 0 JUND TCF12 FOXA1 FOXA1 TEAD4 CEBPB ZBTB33 Effect of excluding Number of TFs ZBTB33 ZBTB33 feature on performance Decrease Increase Transcription factors

Figure 7: Virtual ChIP-seq’s most important features consist in ChIP-seq data and expression score. (a) Area under precision-recall curve (auPR) for predicting a chromatin factor’s binding sites after training on only a subset of input features. We trained on five cell types (HeLa- S3, GM12878, HCT-116, and LNCaP) and predicted on either HepG2. Ablating a feature caused either substantive decrease (orange), substantive increase (turquoise), or no substantive change in auPR. An UpSet58-like matrix shows the subset of features used for each column. Dark grey strip above facet: when ablating HINT, CREAM, or sequence motifs substantively changed auPR. (b) Double-ended bar plot of the number of chromatin factors with average auPR increase or decrease of at least 0.05 when ablating each feature. Bars show the number of chromatin factors where ablation caused the average auPR to decrease (orange, left) or increase (turquoise, right). (c) Change in auPR for those chromatin factors with an average auPR increase or decrease of at least 0.05 when we excluded clusters of regulatory elements (CREAM), footprints (HINT), or sequence motifs. Backgrounds indicate auPR decrease (orange) or increase (turquoise).

20 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

513 complicate convergence of the multi-layer perceptron and may result in overfitting. As a result, the 514 multi-layer perceptron achieved a higher performance on some chromatin factors when we ablated 515 those features. 516 Association of clusters of regulatory elements and chromatin factor footprints with chromatin 517 factor binding varies among cell types. Using a CREAM feature substantively improved perfor- 518 mance in 3/21 chromatin factors and using a HINT feature substantively improved performance 519 in 3/21 chromatin factors (Figure 7b–c). In contrast, including CREAM substantively decreased 520 performance for 1 case and including HINT for 4 cases. When we repeat this experiment by using 521 different training and validation cell types, clusters of regulatory elements and TF footprints result 522 in increase or decrease in performance of different chromatin factors, while they barely result in an 523 increase in auPR above 0.05. Because of the limited upside and apparent downside, we didn’t use 524 these two cell-type–specific features for our final model.

525 2.4 Transcription factors and their targets regulate similar biological pathways

526 2.4.1 Gene set enrichment analysis of chromatin factor targets

527 To calculate the expression score, we investigate correlation of chromatin factor binding at each 528 genomic bin with expression of 5,000 genes across the genome (Methods). This brings us to our 529 hypothesis that genes whose expression is perturbed with binding of a chromatin factor regulate 530 the same biological processes as the chromatin factor. To understand biological implications of 531 transcriptome perturbation in response to chromatin factor binding, we measured how frequently 532 each gene’s expression associated with binding of each chromatin factor. We hypothesized that if 533 expression of a gene consistently correlates with binding of a chromatin factor, it is a potential target 534 of that chromatin factor. Similarly, if the expression of a gene negatively correlates with binding 535 of a chromatin factor, cellular machinery upregulated by that chromatin factor might cause net 536 suppression of that gene’s expression. 537 To identify such genes, for each chromatin factor, we ranked genes by subtracting the number 538 of genomic bins they are positively correlated with from the number of genomic bins they are 539 negatively correlated. We call this difference the association delta. For each chromatin factor, we 540 identified the 5,000 genes with the highest variance in expression among cells with matched RNA- 541 seq and ChIP-seq data (Figure 2a). We measured correlation of expression of each of the 5,000 542 genes with chromatin factor binding at every 100 bp genomic window in 4 chromosomes (chr5, 543 chr10, chr15, and chr20). This approach identified genes that have consistent positive or negative 544 association with chromatin factor binding (Figure 8a). We considered these genes as potential 60 545 targets of each chromatin factor, and used the Gene Set Enrichment Analysis (GSEA) tool to 546 identify pathways with significant enrichment in either direction (Figure 8a.) Only the rank of 547 association delta affects these results, and we presumed that there would be little difference in 548 using all chromosomes instead of just 4. The 4-chromosome analysis for JUND had no significant 549 rank difference from an analysis of chromosome 10 alone (Wilcoxon rank sum test p = 0.3). We 550 only investigated (GO) terms annotated to a minimum of 10 and a maximum of 551 500 out of a total of 17,106 GO-annotated genes. 552 We identified 1,681 GO terms with significant enrichment (GSEA p < 0.001) among potential 553 targets of at least one of the 113 chromatin factors we investigated (Figure 8b). Only 63 of these 113 554 chromatin factors had matched ChIP-seq and RNA-seq in at least 5 of the training cell types and 555 one of the validation cell types we used for learning from the transcriptome. Each chromatin factor 556 had potential targets with significant enrichment in a mean of 92 terms (median 76; Figure 8c). 557 Each of the 1,681 terms had significant enrichment in potential targets of a mean of 6 chromatin 558 factors (median 2; Figure 8d). Furthermore, 300 of these GO terms had significant enrichment in

21 bioRxiv preprint eaiey(le orltdwt hoai atrbnigfrec cluster. terms. each GO for 1,681 binding in factor enrichment chromatin of with correlated correlation (blue) on negatively based not. clusters increased did 6 stability (g–l) matrix cluster defined We Gaussian matrix matrix. GSEA random where correlation the (red). clusters dimensions of same of that the number fac- of but chromatin smallest matrix of Gaussian the random matrix background: subsampled correlation a Grey subsampled and the (turquoise) both GSEA by in tors found clusters term. between GO index, each Jaccard in GSEA enrichment line: with Green targets factor. term. potential chromatin GO each each with of annotated targets genes Super- for Verti- delta factor. association. score. association chromatin (green) enrichment of each negative rank of targets bars: and black potential (red) cal in gene. enrichment positive significant each with with for pathways genes, delta identified 5,000 association curve: the bins blue of of imposed each number the for the subtracts plot takes and bar expression delta gene’s association associated. a delta The with negatively association binding. associated an that positively with by that genes associated bins 5,000 bins of ranked genomic number we many factor, how chromatin each summarizes For that bins. genomic some at ing clusters. 8: Figure Oxidoreductase acting on NADPH oxygenase onNADPH acting Oxidoreductase e Rank of 5,000 genes based on a Pathway enrichment Pathway 00 50 50 50000 25000 0 25000 50000 Anatomical structure andmorphogenesis Anatomical structure Jaccard index Sensory perception of chemical stimulus perceptionofchemical Sensory

0.00 0.25 0.50 0.75 1.00 association with JUND binding Upregulation Downregulation Negative association Negative Superoxide - NADPH oxidase activity oxidase -NADPH Superoxide Number ofgenomicbins CCR chemokine binding receptor CCR chemokine oposo SAsaitcfrtetp5ptwy nihdi ee oiiey(e)and (red) positively genes in enriched pathways 5 top the for statistic GSEA of Boxplots Circulatory systemdevelopmentCirculatory Macromolecular complex asse... 8 7 6 5 4 3 2 doi: Protein complex biogenesis Adaptive immune response Cellular defense response Olfactory receptoractivity Olfactory receptoractivity Vasculature development o ilgclptwy euae yptniltreso hoai factor chromatin of targets potential by regulated pathways biological Top ahgn a aebt oiieadngtv orlto ihcrmtnfco bind- factor chromatin with correlation negative and positive both have may gene Each Epithelium development certified bypeerreview)istheauthor/funder.Allrightsreserved.Noreuseallowedwithoutpermission. Embryo developmentEmbryo https://doi.org/10.1101/168419 Number ofclusters Blood microparticle Positive association Positive Protein localization Membrane region Association delta Association Odorant binding Enzyme binding Chromatin factorGSEA Chromatin (c) k Gaussian random Gaussian j → . . . 1.0 0.5 0.0 −0.5 −1.0 . . . 1.0 0.5 0.0 −0.5 −1.0 ● itga hwn o ayo ,8 Otrsaeerce npotential in enriched are terms GO 1,681 of many how showing Histogram GSEA statistic GSEA statistic Olfactory cluster Defense cluster Defense transcription on potential enrichment targets of Gene set analysis factors 10 9 ● ● ● ● ● ● ● ● ● (a) b ● → → ● ● ● ● ● h soito akn rcs o UDbnig Double-ended binding. JUND for process ranking association The ● ● ● ● ● ● l ● ● ● ● ● ● ● ●

. . . 1.0 0.5 0.0 −0.5 −1.0

(d) HDAC2 TCF7L2 Angiogenesis cluster Angiogenesis RAD21 KAT2B

MAFK MAFF KMT2B ; GABPA

ZKSCAN1 EZH2 TAF1 GSEA enrichment score

REST 0.00 0.25 0.50 0.75

SRF

GSEA statistic EGR1 0.0 0.2 0.4 0.6 RXRA this versionpostedMarch12,2019. RFX5

f

ARID3A MXI1 GATA3 NR3C1 RUNX1

CREB1USF2 AR

Rank of genes by association delta byassociation Rank ofgenes

Genes annotated withGO:0034101 Genes annotated TP53 withGO:0055090 Genes annotated

0020 0040 5000 4000 3000 2000 1000 0

itga hwn o ayo 3crmtnfcoshave factors chromatin 63 of many how showing Histogram NR2C2 ARNT 5000 4000 3000 2000 1000 0

ZNF143 CHD2 E2F1

● ● ●

TBP ZNF274

ATF2

YY1 NFIC BACH1

USF1

SIN3A SUZ12

RFX1 BRD4

NRF1 MYC

ATF3 RCOR1 SMC3 ZBTB33

Regulation ofcellactivation Ribonucleoprotein complex biogenesis RNA processing Ribosome biogenesis PolyA RNAbinding PolyA Blood vessel morphogenesis ncRNA processing ncRNA Skeletal systemdevelopment Skeletal Vasculature development Angiogenesis

MAX SOX2

22 CEBPB TCF12

SPI1 BRCA1

MNT KDM1A

NR1H2CEBPZ

(f) TFAP2C (e) MAZ HSF1

(b) ASCL1 FOXM1 GATA2

RELA EP300 PPARG

TEAD4 ESR1 edormo lsesietfidi the in identified clusters 6 of Dendrogram opo fcutrsaiiy smaue by measured as stability, cluster of Boxplot JUND STAT3

eeStErcmn nlss(GSEA) Analysis Enrichment Set Gene FOXA1 STAG1 ELF1 MED1

TAL1 ZZZ3 NFYB CTCF

TEAD1

ELK1 ZFX

SREBF1 FOXK2 E2F4 CREM UBTF

CTBP1 chromatin Clustering

BHLHE40 ESRRA pathways enriched CUX1 The copyrightholderforthispreprint(whichwasnot

FOS on factors based potential targets potential

HCFC1 NR2F1 ZBTB40

MLLT1 of correlation

RAD51

among their among SP2 KDM5B

g NFE2L2 POLR2A-pS5

H2AFZ

FOSL1 H2AZ

NR2F2 NFYA

CEBPG STAT1 JUN IRF3

SREBF2 GTF2F1

MEF2A CHD1 . . . 1.0 0.5 0.0 −0.5 −1.0 ● ● GSEA statistic ● ● ● h ● ● ● ● ● ● ● ● ● ● Neural cluster Neural i ● . . . 1.0 0.5 0.0 −0.5 −1.0 ● ● ● ● ● . . . 1.0 0.5 0.0 −0.5 −1.0 GSEA statistic Inflammation cluster Inflammation GSEA statistic Motility cluster Motility c Count of GO terms d Count of TFs 600 200 400 15 10 0 5 0 ● ● ● Number of chromatin factors whose targets had whosetargets factors ofchromatin Number ● ● ● ● ● ● ● ● ● ● potential targets of each chromatin factor chromatin targetsofeach potential ● 0 ● ● ● 2 significant enrichment ineachGOterm enrichment significant Number of GO terms enriched in enriched Number ofGOterms Ribonucleoprotein complex biogenesis Postsynapse Postsynaptic membrane Protein localization Establishment ofproteinlocalization Establishment incell oflocalization Synaptic membrane Synaptic signaling Central systemdevepment nervous Spliceosomal complex 76 0 0 300 200 100 04 60 40 20 Perinuclear regionofcytoplasm Protein activation cascade Blood microparticle -fibrinclot coagulation Blood Cytoskeleton organization signaling mediated GTPase Small Cytoskeletal proteinbinding Cholesterol efflux Anchoring junction Anchoring Acute phaseresponse Ribonucleoprotein complex Ribosome biogenesis Blood coagulation - fibrin clot Blood coagulation-fibrin ncRNA metabolic process metabolic ncRNA processing ncRNA rRNA metabolic process metabolic rRNA Superoxide - NADPH oxidase -NADPH Superoxide Acute inflammatory response Acute inflammatory Acute phaseresponse Acylglycerol homeostasis Median 60 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

chromatin Upregulated Downregulated chromatin factors in cluster factor cluster pathways pathways with relevant biology

ASCL161, CTCF62, ESR163, Neural activity Protein FOXA164, GATA265, HSF166, Neural and biosynthesis PPARγ 67, STAT368, TAL169, development TEAD170 Cytoskeletal CTBP171, KDM5B72, MEF2A73, Motility Inflammation organization STAT174 BHLHE4075, CEBPG76, CUX177, Inflammation Inflammation RNA biosynthesis ELK178, FOXM179, JUN80, JUND81, RELA82 Vasculature, Olfactory blood, and NFIC83, ATF284, ATF385, Olfactory perception structural SIN3A86, CEBPB87, RFX188 development Cell defense Protein ARID3A89, CREB190, EGR191, Defense and chemokine biogenesis and KAT2B92, KMT2B93, MAFF94, signaling localization RFX595, RXRA96, SRF97 AR98, ARNT99, BACH1100, Angiogenesis BRCA1101, BRD4102, E2F1103, Angiogenesis RNA biosynthesis and GATA3104, KDM1A105, MYC106, vasculature RUNX1107, TP53108

Table 2: Many chromatin factors within each biological function cluster are involved in the same pathways as their potential target genes. We summarized each cluster of chromatin factors according to top over-represented GO terms in the first 3 columns. Chromatin factors in the 4th column are involved in the same biological mechanism as the bold pathways mentioned in 2nd or 3rd column.

559 potential targets of at least 10 chromatin factors. 560 To identify chromatin factors involved in similar biological processes, we searched for enrichment 561 of any of the 1,681 GO terms in 113 chromatin factors. This analysis relied on the GSEA enrichment 562 score as a normalized test statistic. We examined the pairwise correlation between the vector 563 of enrichment scores for each pair of chromatin factors. These pairwise correlations constitute a 564 symmetric correlation matrix. We hypothesized that chromatin factors with high correlation are 565 involved in similar biological processes. 566 To identify groups of chromatin factors involved in similar biological processes, we performed 567 hierarchical clustering on the correlation matrix. We sought to identify clusters of chromatin factors, 568 and the best number of clusters between 2 and 10, inclusive. As a control, we generated a correlation 569 matrix of same dimensions from a matrix of random Gaussian values (Methods). For each matrix 570 we repeatedly generated random subsamples and clustered them. For each subsample, we found the 571 set of pairs of chromatin factors with the same cluster membership. For couples of these subsamples, 109 572 we identified the Jaccard index between these sets as a measure of cluster stability (Methods). 573 We then compared the increase or decrease in Jaccard indices from each number of clusters to the 574 number of clusters one larger.

23 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

575 The smallest number of clusters with an increase in Jaccard index only for the correlation ma- 576 trix was 6 (Figure 8e–f). We assigned names to these clusters based on their enriched biological 577 pathways. We then examined the chromatin factors included in those clusters. The Neural clus- 61 66 65 67 578 ter (Figure 8g) includes ASCL1 , HSF1 , GATA2 , and PPARγ . These chromatin factors play 579 important roles in the development of the nervous system and are implicated in neurological disor- 61,65,66,67 580 ders . The top 5 GO terms enriched in the potential targets of these chromatin factors are 581 all related to nervous system development and function (Figure 8g). The downregulated pathways of 582 the Motility cluster (Figure 8h) relate to cytoskeletal organization. The included chromatin factors, 71 72 73 74 583 CTBP1 , KDM5B , MEF2A , and STAT1 , all play a role in the epithelial-to-mesenchymal 584 transition, which involves re-organization of the cytoskeleton. Similarly, we found that for other 585 clusters, specific upregulated or downregulated pathways of cluster’s targets are also regulated by 586 many of the cluster’s chromatin factors (Figure 8i–l, Table 2).

587 2.5 A compendium of chromatin factor binding predictions for 33 tissues and 588 cell types

589 2.5.1 Predicting chromatin factor binding in Roadmap datasets

35 590 The Roadmap Epigenomics Project performed DNase-seq on 55 and RNA-seq on 39 human tissues 591 and cell types, but not ChIP-seq of any chromatin factor. For 33 of these tissues, they produced 592 matched DNase-seq and RNA-seq data. This makes the Roadmap data an ideal application for 593 Virtual ChIP-seq. 594 We generated an annotation similar to peak calls by converting the multi-layer perceptron’s 595 posterior probabilities to a presence or absence call. We made this call based on a different cutoff 596 for each chromatin factor. We defined this cutoff as the posterior probability which maximized 597 MCC in H1-hESC. For chromatin factors without ChIP-seq data in H1-hESC, we used the mode 598 of cutoffs from the other different chromatin factors (0.4). We excluded H1-hESC when reporting 599 all performance metrics that depend on this threshold. The number of binding sites we predicted 600 in other validation cell types and Roadmap data is similar to ChIP-seq peaks in other validation 601 cell types (Figure 9a). 602 Using the cutoff which maximized MCC in H1-hESC only slightly decreased performance mea- 603 surements from what one could achieve with the optimal cutoff for each cell type (Figure 9b). For 604 example, the MCC score showed a median decrease of 0.06 and F1 score showed a median decrease 605 of 0.1. 606 As a community resource, we created a public track hub (https://virchip.hoffmanlab.org) 607 with predictions for 33 Roadmap cell types (Figure 9c). This track hub contains predictions for 36 608 chromatin factors which had a median MCC > 0.3 in validation cell types (Table 1).

24 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

SRF Best cutoff for validation data Cutoff based on H1-hESC a 0.5 b 100,000 1.00

0.4

0.75 0.3 1,000

MCC 0.2 0.50

10 0.1 Performance 0.25

0.0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Number of bins predicted to bind TF Posterior probability cutoff Posterior probability cutoff 0.00 F1 Recall Precision MCC CTCF Roadmap H1-hESC Validation cell types c Scale 2 kb hg38 0.8 100,000 chr20: 951,000 952,000 953,000 954,000 ChIP-seq peaks for CTCF in Cistrome DB and ENCODE A549 Bone 0.6 H1-hESC HCT-116 HCT116 K562 1,000 HepG2 Mammary Gland 0.4

MCC ChIP-seq peaks for CTCF in fibroblast of lung - ENCFF510CUI CTCF in NHLF 0.2 CTCF in Virtual ChIP-seq predictions of CTCF in left lung 10 left lung CTCF in Virtual ChIP-seq predictions of CTCF in adrenal gland 0.0 adrenal gland 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Number of bins predicted to bind TF Virtual ChIP-seq predictions of CTCF in B-cell Posterior probability cutoff Posterior probability cutoff CTCF in B-cell Number of bins Posterior cutoff maximizing Virtual ChIP-seq predictions of CTCF in T-cell CTCF in T-cell bound to TF MCC in H1-hESC

Figure 9: Chromatin factor binding predictions in validation cell types and Roadmap datasets. (a) Number of genomic bins that chromatin factor is predicted to bind (left) and MCC (right) as a function of posterior probability cutoff for SRF (top) and CTCF (bottom). This - tionship is shown for H1-hESC (turquoise), 2 validation cell types for SRF (blue), and 6 validation cell types for CTCF (blue). Each curve represents predictions for one of the 4 chromosomes (chr5, chr10, chr15, and chr20). Left panels also show how many genomic bins are predicted to bind the chromatin factor in 18 Roadmap datasets (red). Vertical red dashed line: posterior probability cutoff which maximized MCC of the chromatin factor in H1-hESC. Horizontal red dashed lines: number of genomic bins with chromatin factor binding in validation cell types. (b) Boxplot of vari- ous performance measures when using the best cutoff for each dataset (red) and the optimal cutoff in H1-hESC (turquoise). (c) UCSC Genome Browser display of a 4000 bp region on using the Virtual ChIP-seq track hub (https://virchip.hoffmanlab.org). The track hub has a supertrack for each chromatin factor. Each supertrack contains 34 tracks: one track specifying genomic bins bound by that chromatin factor in Cistrome and ENCODE, and one track for each of the 33 Roadmap cell types with predictions for that chromatin factor. This example shows parts of the track hub related to CTCF, including a track with experimental results in Cistrome DB and ENCODE with 7 out of 144 cell types enabled, and Virtual ChIP-seq predictions in left lung, adrenal gland, B-cell, and T-cell. The height of predictions indicates the number of overlapping genomic bins predicted to bind the chromatin factor, ranging between 0–4. Between are MACS2 narrow peak calls for CTCF in normal human lung fibroblasts (NHLF) from ENCODE (ENCFF510CUI). Blue: peaks; orange: peak summits.

25 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

609 3 Methods

610 3.1 Data used for prediction

611 3.1.1 Overlapping genomic bins

612 To generate the input matrix for training and validation, we used 200 bp genomic bins with slid- 613 ing 50 bp windows. We excluded any genomic bin which overlaps with ENCODE blacklist regions 614 (https://www.encodeproject.org/files/ENCFF419RSJ/@@download/ENCFF419RSJ.bed.gz). Ex- 615 cept where otherwise specified, we used the Genome Reference Consortium GRCh38/hg38 assem- 50 616 bly .

617 3.1.2 Chromatin accessibility

618 We used Cistrome DB ATAC-seq and DNase-seq narrowPeak files for assessing chromatin acces- 619 sibility (Supplementary Table 8). We mapped the signal value of peak summits to all the bins 620 overlapping that summit. In rare cases where a genomic bin overlaps more than one summit, we 621 used the signal value of the summit closest to the p terminus of the chromosome When data were 622 available from multiple experiments, we averaged signal values. Because Cistrome DB does not 623 include raw data that one can use for DNase footprinting, we limited the analysis of HINT TF 624 footprinting and CREAM regulatory element clustering to ENCODE DNase-seq experiments on 625 GM12878, HCT-116, HeLa-S3, LNCaP, and HepG2.

626 3.1.3 Genomic conservation

47,48 627 We used GRCh38 primate and placental mammal 7-way PhastCons genomic conservation 110 628 scores from the UCSC Genome Browser (http://hgdownload.cse.ucsc.edu/goldenPath/hg38/ 629 phastCons7way). We assigned each bin the mean PhastCons score of the nucleotides within.

630 3.1.4 Sequence motif score

36 111 631 We used FIMO (version 4.11.2) to search for motifs from JASPAR 2016 to identify binding 632 sites of each TF that have the sequence motif of that TF. We used the curated, non-redundant 633 JASPAR database of vertebrate sequence motifs to avoid the unnecessary complexity of having 634 similar redundant motifs. To get a liberal set of motif matches, we used a liberal p-value threshold 635 of 0.001 and didn’t adjust for multiple testing. If the motif for the TF didn’t exist in JASPAR, we 636 used other motifs with same initial 3 letters and counted any TF binding site which had overlap 637 with any of those motifs (Supplementary Table 1). 638 We also used FIMO and JASPAR 2016 to identify the sequence specificity of chromatin ac- 639 cessible regions. For this analysis, we used a false discovery rate threshold of 0.01%. We used any 640 sequence motif matching the initial 3 letters of a TF as a predictive feature of binding for that 641 TF. For many chromatin factors, more than one motif matched this criteria, and we used all as 642 independent features in the model (Supplementary Table 2).

643 3.1.5 ChIP-seq data

644 We used Cistrome DB and ENCODE ChIP-seq narrowPeak files. We only used peaks with FDR < −4 645 10 . When multiple replicates of the same experiment existed, we only considered peaks that 646 passed the FDR threshold in at least two replicates. We considered bound only those genomic bins

26 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

647 overlapping peak summits. We calculated prevalence of bound bins in each chromosome as bound prevalence = bound + unbound

25 648 and used it as an auPR baseline .

649 3.1.6 RNA-seq data

650 We downloaded an ENCODE expression matrix (https://public-docs.crg.es/rguigo/encode/ 651 expressionMatrices/H.sapiens/hg19/2014_10/gencodev19_genes_with_RPKM_and_npIDR_oct2014. 41 652 txt.gz) with RNA-seq data for each gene, measured in reads per kilobase per million mapped 653 reads (RPKM). We retrieved similar Cancer Cell Line Encyclopedia (CCLE) RNA-seq data using 112 654 PharmacoGx . Since these data are processed differently, we limited our analysis to Ensembl 655 gene IDs shared between the two datasets, and ranked gene expression values by cell type. The 656 two datasets have 4 shared cell types: A549, HepG2, K562, and MCF-7. Within each of these cell 657 types, we examined the concordance of RNA-seq data between ENCODE and CCLE after possible 113 658 transformations. The concordance correlation coefficient of rank of RPKM (0.827) was higher 659 compared to untransformed RPKM (0.007) or quantile-normalized RPKM (0.006; Welch t-test −6 660 p = 10 ). The DREAM Challenge, however, had processed RNA-seq of all cell types uniformly, 661 allowing us to directly use transcripts per million reads (TPM) in analysis of DREAM Challenge 662 datasets.

663 3.1.7 Expression score

664 We created an expression matrix for each chromatin factor with matched ChIP-seq and RNA-seq 665 data in N ≥ 5 training cell types with the following procedure:

666 1. We divided the genome into M 100 bp non-overlapping genomic bins.

M×N 667 2. We created a non-negative ChIP-seq matrix C ∈ R≥0 (Figure 2a). We used signal mean 114 668 among replicate narrowPeak files generated by MACS2 for each of M bins and N cell types 669 and quantile-normalized this matrix.

0 670 3. We row-normalized C to C , scaling the values of each row between 0 and 1.

671 4. We identified the G = 5000 genes with the highest variance among the N cell types.

N×G 672 5. We created an expression matrix E ∈ R∈[0,1] containing the row-normalized rank of expression 673 each of the G = 5000 genes in N cell types (Figure 2b).

674 6. For each bin i ∈ [1, M] and each gene g ∈ [1, G], we calculated the Pearson correlation 0 675 coefficient Ai,g between the ChIP-seq data for that bin Ci,: and the expression ranks for that 676 gene E:,j over all cell types. If the Pearson correlation was not significant (p > 0.1), we set M×G 677 Ai,g to NA. These coefficients constitute an association matrix A ∈ (R∈[−1,1] ∪ {NA}) 678 (Figure 2c).

115 679 We performed power analysis of the Pearson correlation test using the R pwr package . 680 To predict ChIP-seq binding for a new cell type (Figure 2d), we calculated an expression score 681 for each genomic bin in that cell type. The expression score is Spearman’s ρ for expression of the 682 same G = 5000 genes in the new cell type with every row of the association matrix A. Each of 683 these rows represents a single genomic bin. An expression score close to 1 indicates that genes with

27 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

684 high expression have high values in the association matrix, and genes with low expression genes 685 have low values. An expression score close to −1 indicates that genes with high or low expression 686 have opposite values in the association matrix (Figure 2d).

687 3.2 Training, optimization, and benchmarking

688 3.2.1 Training and optimization

689 For the purpose of training and validating the model on Cistrome datasets, we only used chromo- 690 somes 5, 10, 15, and 20. These 4 chromosomes constitute 481.78 Mbp (15.6% of the genome). For 691 training only, we excluded any genomic region without chromatin accessibility signal and previous 692 evidence of chromatin factor binding. For validation and reporting performance, we included these 693 regions, using the totality of the 4 chromosomes. We concatenated data from training cell types 694 (A549, GM12878, HepG2, HeLa-S3, HCT-116, BJ, Jurkat, NHEK, Raji, Ishikawa, LNCaP, and 695 T47D; Supplementary Table 3) into the training matrix. 116 696 We used Python 2.7.13, Scikit-learn 0.18.1 , NumPy 1.11.0, and Pandas 0.19.2 for processing 51 697 data and training classifiers. We used the default Scikit-learn method to initialize the multi-layer 698 perceptron’s parameters β and coefficients β0. This uses random values from a uniform distribution. 699 The support of the uniform distribution used depends on properties of the current layer i and the 700 next layer i + 1. Specifically, the maximum value of the uniform distribution b is a function of the 701 number of the hidden units ui in the current layer, the number of hidden units ui+1 in the next 702 layer, and an activation factor l based on the activation function of the current layer. For sigmoid 703 activation, l = 2.0 and for other activation functions, l = 6.0. For each layer i, Scikit-learn sets s l b = . ui + ui+1

704 Scikit-learn samples each parameter βi and each coefficient β0,i from the uniform distribution U(−b, b). 51 705 We optimized hyperparameters of the multi-layer perceptron using grid search and 4-fold cross 706 validation. We used minibatch training with 200 genomic bins in each minibatch. We searched for 707 several options to optimize the activation function (Figure 3g), number of hidden units per hidden 708 layer (Figure 3h), number of hidden layers (Figure 3i), and L2 regularization penalty (Figure 3j). In 709 each round of 4-fold cross-validation, we trained on data of 3 chromosomes, and assessed best MCC 710 on the remaining chromosome. We selected the set of hyperparameters yielding highest average 711 MCC after 4-fold cross validation.

712 3.2.2 Benchmarking

117 713 We used the R precrec package to calculate auPR and auROC. Precision-recall curves better 25,54 714 assess a binary classifier’s performance on imbalanced test data than ROC .

715 3.2.3 DREAM Challenge comparison

716 For comparison to DREAM results, we also trained and validated the Virtual ChIP-seq model 717 on GRCh37 DREAM Challenge data. For training the model on DREAM Challenge datasets, we 718 used the data of chr5, chr10, chr15, and chr20 of training cell types. We evaluated performance 719 against the union of the DREAM validation chromosomes (chr1, chr8, and chr21) in validation cell 720 types. For CTCF, we trained on all cell types except MCF-7, PC-3, and iPSC which we used for 721 validation. For MAX, we used all cell types except liver and K562 for training. For GABPA, REST, 722 and JUND, we used all cell types except liver for training. We compared these metrics to those of 723 DREAM Challenge participants in the final round of cross–cell-type competition.

28 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

724 3.3 Clustering chromatin factors based on enrichment of their potential targets 725 in GO terms

726 To identify groups of chromatin factors involved in similar biological processes, we performed hier- 727 archical clustering on the correlation matrix. We sought to identify clusters of chromatin factors, 728 and the best number of clusters between 2 and 10, inclusive. For use in this process, we created a 729 Gaussian random matrix of 1,681 rows and 113 columns as a control, and calculated its correla- 730 tion matrix. Then, we compared cluster stability between the original correlation matrix and the 731 control for each potential number of clusters. To do this, we subsampled 75% of each correlation 732 matrix rows twice without replacement. Then, we clustered chromatin factors in each matrix into 733 the specified number clusters. For both of these clusterings, we constructed the set of every pair 734 of chromatin factors present in the same cluster. We then calculated the Jaccard index between 109 735 the first clustering’s constructed set and that of the second . We repeated this subsampling and 736 clustering process 50 times for each number of clusters. We picked the smallest number of clusters 737 which had an increase in Jaccard index compared to the number of clusters one smaller only in the 738 chromatin factor correlation matrix.

739 3.4 Chromatin factor prediction on Roadmap data

740 We downloaded Roadmap DNase-seq and RNA-seq data aligned to GRCh38 from the ENCODE 35 741 DCC . For each DNase-seq narrowPeak file with matched RNA-seq, we predicted binding of 36 742 chromatin factors with MCC > 0.3 in validation cell types (Table 1, Supplementary Table 6, 743 https://virchip.hoffmanlab.org).

744 3.5 Colors

745 For plots with three categories, we used a color palette optimized for viewers with deuteranopia 746 (http://mkweb.bcgsc.ca/colorblind) and chose colors also distinguishable by those with protanopia 747 and tritanopia. 118 748 For other plots, we either used the default ggplot2 color palette or manually-adjusted Col- 119 749 orBrewer palettes.

750 4 Discussion

751 Performing functional genomics assays to assess binding of all chromatin factors may never be pos- 752 sible in patient tissues. Nevertheless, computational prediction of chromatin factor binding based on 1 753 sequence specificity has identified the role of many chromatin factors in various diseases . Scanning 754 the genome for occurrences of each sequence motif, results in a range of 200–2000 predictions/Mbp. 755 In some cases, this is 1,000 times more frequent than experimental data from ChIP-seq peaks. 756 Similar observations led to a futility conjecture that almost all TF binding sites predicted in this 120 757 way will have no functional role . 758 Nevertheless, there is more to TF binding than sequence preference. Most chromatin factors 9 759 don’t have any sequence preference (Figure 1), and indirect TF binding through complexes of 760 chromatin-binding proteins complicates predictions based solely on sequence specificity. In addition 761 to the high number of false positive motif occurrences, many ChIP-seq peaks lack the TF’s sequence 762 motif. Therefore, relying on sequence specificity alone not only generates too many false positives, 763 but also many false negatives. We call this latter observation the dual futility conjecture, although 764 it differs in degree from the original. Adding additional data about cellular state allows us to move 765 beyond both conjectures.

29 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

766 We can assess chromatin factor binding through ChIP-seq or its more precise variations ChIP- 12 11 767 nexus or ChIP-exo . These experiments may still not properly reflect in vivo chromatin factor 121 768 binding due to technical difficulties such as non-specific or low affinity antibodies or false de- 122 769 tection of unrelated factors in hyper-ChIPable regions . Using publicly available ChIP-seq data 770 produced with different protocols and reagents, complicates prediction of chromatin factors more 56 771 sensitive to experimental conditions . Variations among training and validation cell types in our 772 datasets, overfitted the multi-layer perceptron to certain input features of some chromatin factors. 773 More robust approaches in assessment of chromatin factor binding—such as CRISPR epitope tag- 123 774 ging ChIP-seq (CETCh-seq) , which doesn’t rely on specific antibodies—may provide less noisy 775 reference data for learning and prediction of chromatin factor binding. 776 Virtual ChIP-seq predicted binding of 36 chromatin factors in new cell types, using from the new 777 cell types only chromatin accessibility and transcriptome data. By learning from direct evidence of 778 chromatin factor binding and the association of the transcriptome with chromatin factor binding 779 at each genomic region, most use of sequence motif scores becomes redundant. As more ChIP-seq 780 data in diverse cell types and tissues becomes available, our approach allows predicting binding of 781 more chromatin factors with high accuracy. This is true even in the case of factors that are not 782 sequence-specific. Although Virtual ChIP-seq uses direct evidence of chromatin factor binding at 783 each genomic region as one of the input features, it is able to correctly predict new peaks which 784 don’t exist in training cell types. For 39 of 41 sequence specific chromatin factors, Virtual ChIP-seq 785 correctly predicted chromatin factor binding in regions without any match to sequence motifs. 786 Virtual ChIP-seq’s performance varies over different chromatin factors, and for each chro- 787 matin factor it varies over different genomic regions. When all predictive features had positive 788 values, for example, model performance exceeded conditions where some features were absent. 789 Post-translational modifications to chromatin factors, which none of our input features assess, 790 might explain the varying performance of our model. For example, Virtual ChIP-seq predicts both 791 MYC (MCC = 0.03) and RUNX1 (MCC = 0.27) poorly, and post-translational modifications are 124,125 792 known to influence their activity . But post-translational modifications also influence the ac- 126,127 793 tivity of well-predicted chromatin factors CTCF (MCC = 0.68) and SMC3 (MCC = 0.73) . 794 Incorporating post-translational modification information remains a future challenge for building 795 more accurate models of chromatin factor binding. 796 The DREAM Challenge datasets provide data for training and validating machine learning mod- 797 els for predicting binding of 31 chromatin factors. Our datasets, using a combination of Cistrome 798 DB and ENCODE, allow training and validating models for predicting binding in a more extensive 799 63 chromatin factors. Our provided predictions of binding of 36 high-confidence chromatin fac- 800 tors in 33 different Roadmap tissue types will allow the research community to better investigate 801 epigenomics of disease affecting those tissues (https://virchip.hoffmanlab.org/). In addition to 802 providing our predictions as a resource for use by biologists, we also provide the processed datasets 803 we use as a resource for machine learning researchers. This should accelerate the development of 804 future methods by many groups.

805 Acknowledgments

806 We thank X. Shirley Liu (ORCID: 0000-0003-4736-7339) for providing the Cistrome DB narrow- 807 Peak files. We thank the Roadmap Epigenomics Mapping Consortium and the ENCODE Project 808 Consortium for generating the datasets which enabled this work. We thank Sage Bionetworks- 809 DREAM and the ENCODE-DREAM Challenge organizers for providing data and results before 810 publication. We thank Carl Virtanen and Zhibin Lu (University Health Network High Performance 811 Computing Centre and Bioinformatics Core) for technical assistance. We thank Anshul Kundaje

30 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

812 (ORCID: 0000-0003-3084-2287), Nicolae R. Zabet (ORCID: 0000-0001-9964-6271), Patrick Martin 813 (ORCID: 0000-0002-4093-8277) and those at Banff International Research Station Workshop on 814 “The Role of Genomics and Metagenomics in Human Health: Recent Developments in Statisti- 815 cal and Computational Methods” for comments on this manuscript. This work was supported by 816 the Canadian Cancer Society (703827 to M.M.H.), the Ontario Ministry of Training, Colleges and 817 Universities (Ontario Graduate Scholarship to M.K.), and the University of Toronto Faculty of 818 Medicine Frank Fletcher Memorial Fund (M.K.).

819 Competing interests

820 The authors declare that they have no competing interests.

821 References

822 [1] Tong Ihn Lee and Richard A. Young. Transcriptional regulation and its misregulation in disease. Cell, 152(6):1237–1251, 2013.

823 [2] Pamela J. Mitchell and Robert Tjian. Transcriptional regulation in mammalian cells. Science, 245:371–378, 1989.

824 [3] Remo Rohs, Sean M. West, Alona Sosinsky, Peng Liu, Richard S. Mann, and Barry Honig. The role of DNA shape in protein- 825 DNA recognition. Nature, 461(7268):1248, 2009.

826 [4] Susan Jones, Paul van Heyningen, Helen M. Berman, and Janet M. Thornton. Protein-DNA interactions: a structural analysis. 827 Journal of Molecular Biology, 287(5):877–896, 1999.

828 [5] Nobuo Ogawa and Mark D. Biggin. High-throughput SELEX determination of DNA sequences bound by transcription factors 829 in vitro. Gene Regulatory Networks: Methods and Protocols, pages 51–63, 2012.

830 [6] Martha L. Bulyk. Protein binding microarrays for the characterization of DNA–protein interactions. In Analytics of Protein– 831 DNA Interactions, pages 65–85. Springer, 2006.

832 [7] Sachi Inukai, Kian Hong Kock, and Martha L. Bulyk. Transcription factor–DNA binding: beyond binding site motifs. Current 833 Opinion in Genetics & Development, 43:110–119, 2017.

834 [8] Matthew T. Weirauch, Atina Cote, Raquel Norel, Matti Annala, et al. Evaluation of methods for modeling transcription factor 835 sequence specificity. Nature Biotechnology, 31(2):126–134, 2013.

836 [9] Md. Abul Hassan Samee, Benoit Bruneau, and Katherine Pollard. Transcription factors recognize DNA shape without nucleotide 837 recognition. bioRxiv, 2017. doi: 10.1101/143677.

838 [10] David S. Johnson, Ali Mortazavi, Richard M. Myers, and Barbara Wold. Genome-wide mapping of in vivo protein-DNA 839 interactions. Science, 316(5830):1497–1502, 2007.

840 [11] Ho Sung Rhee and B. Franklin Pugh. ChIP-exo method for identifying genomic location of DNA-binding proteins with near- 841 single-nucleotide accuracy. Current Protocols in Molecular Biology, pages 21–24, 2012.

842 [12] Qiye He, Jeff Johnston, and Julia Zeitlinger. ChIP-nexus enables improved detection of in vivo transcription factor binding 843 footprints. Nature Biotechnology, 33(4):395–401, 2015.

844 [13] Jason D. Buenrostro, Paul G. Giresi, Lisa C. Zaba, Howard Y. Chang, and William J. Greenleaf. Transposition of native 845 chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature 846 Methods, 10(12):1213–1218, 2013.

847 [14] Roger Pique-Regi, Jacob F. Degner, Athma A. Pai, Daniel J. Gaffney, et al. Accurate inference of transcription factor binding 848 from DNA sequence and chromatin accessibility data. Genome Research, 21(3):447–455, 2011.

849 [15] Eduardo G. Gusmao, Manuel Allhoff, Martin Zenke, and Ivan G. Costa. Analysis of computational footprinting methods for 850 DNase sequencing experiments. Nature Methods, 13(4):303–309, 2016.

851 [16] Xi Chen, Bowen Yu, Nicholas Carriero, Claudio Silva, and Richard Bonneau. Mocap: Large-scale inference of transcription 852 factor binding sites from chromatin accessibility. Nucleic Acids Research, 45(8):4315, 2017.

853 [17] Housheng Hansen He, Clifford A. Meyer, Hyunjin Shin, Shannon T. Bailey, et al. Nucleosome dynamics define transcriptional 854 enhancers. Nature Genetics, 42(4):343–347, 2010.

855 [18] Swneke D. Bailey, Kinjal Desai, Ken J. Kron, Parisa Mazrooei, et al. Noncoding somatic and inherited single-nucleotide variants 856 converge to promote ESR1 expression in breast cancer. Nature Genetics, 48(10):1260–1266, 2016.

31 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

857 [19] Timothy L. Bailey and Philip Machanick. Inferring direct DNA binding from ChIP-seq. Nucleic Acids Research, 40(17): 858 e128–e128, 2012.

859 [20] Jian Zhou and Olga G. Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature 860 Methods, 12(10):931, 2015.

861 [21] Daniel Quang and Xiaohui Xie. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function 862 of DNA sequences. Nucleic Acids Research, 44(11):e107–e107, 2016.

863 [22] Wei-Li Guo and De-Shuang Huang. An efficient method to transcription factor binding sites imputation via simultaneous 864 completion of multiple matrices with positional consistency. Molecular BioSystems, 13:1827–1837, 2017.

865 [23] Qian Qin and Jianxing Feng. Imputation for transcription factor binding predictions based on deep learning. PLOS Compu- 866 tational Biology, 13(2):e1005403, 2017.

867 [24] Richard I. Sherwood, Tatsunori Hashimoto, Charles W. O’Donnell, Sophia Lewis, et al. Discovery of directional and nondi- 868 rectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nature Biotechnology, 32(2):171–178, 869 2014.

870 [25] Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the ROC plot when evaluating binary 871 classifiers on imbalanced datasets. PLOS One, 10(3):e0118432, 2015.

872 [26] Travers Ching, Daniel S. Himmelstein, Brett K. Beaulieu-Jones, Alexandr A. Kalinin, et al. Opportunities and obstacles for 873 deep learning in biology and medicine. bioRxiv, 2017. doi: 10.1101/142760.

874 [27] ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge. https://synapse.org/encode, 2017. Ac- 875 cessed: 2018-01-31.

876 [28] Michael A. Beer and Saeed Tavazoie. Predicting gene expression from sequence. Cell, 117(2):185–198, 2004.

877 [29] Zhengqing Ouyang, Qing Zhou, and Wing Hung Wong. ChIP-Seq of transcription factors predicts absolute and differential gene 878 expression in embryonic stem cells. Proceedings of the National Academy of Sciences, 106(51):21521–21526, 2009.

879 [30] David R. Kelley and Yakir A. Reshef. Sequential regulatory activity prediction across chromosomes with convolutional neural 880 networks. bioRxiv, 2017. doi: 10.1101/161851.

881 [31] Weiqiang Zhou, Ben Sherwood, Zhicheng Ji, Yingchao Xue, et al. Genome-wide prediction of DNase I hypersensitivity using 882 gene expression. Nature Communications, 8(1):1038, 2017.

883 [32] Harmen J. Bussemaker, Hao Li, and Eric D. Siggia. Regulatory element detection using correlation with expression. Nature 884 Genetics, 27(2):167, 2001.

885 [33] Scott M. Lundberg, William B. Tu, Brian Raught, Linda Z. Penn, Michael M. Hoffman, and Su-In Lee. ChromNet: Learning 886 the human chromatin network from all ENCODE ChIP-seq data. Genome Biology, 17(1):82, 2016.

887 [34] Samuel A. Lambert, Arttu Jolma, Laura F. Campitelli, Pratyush K. Das, Yimeng Yin, Mihai Albu, et al. The human tran- 888 scription factors. Cell, 172(4):650–665, 2018.

889 [35] Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.

890 [36] Charles E. Grant et al. FIMO: scanning for occurrences of a given motif. Bioinformatics, 27(7):1017–1018, 2011.

891 [37] Theresa M. Filtz, Walter K. Vogel, and Mark Leid. Regulation of transcription factor activity by interconnected post- 892 translational modifications. Trends in Pharmacological Sciences, 35(2):76–85, 2014.

893 [38] Shenglin Mei, Qian Qin, Qiu Wu, Hanfei Sun, et al. Cistrome Data Browser: a data portal for ChIP-Seq and chromatin 894 accessibility data in human and mouse. Nucleic Acids Research, 45(D1):D658–D662, 2017.

895 [39] ENCODE Project Consortium. An integrated Encyclopedia of DNA elements in the . Nature, 489(7414):57–74, 896 2012.

897 [40] Jordi Barretina, Giordano Caponigro, Nicolas Stransky, Kavitha Venkatesan, et al. The Cancer Cell Line Encyclopedia enables 898 predictive modeling of anticancer drug sensitivity. Nature, 483(7391):603–607, 2012.

899 [41] Sarah Djebali, Carrie A. Davis, Angelika Merkel, Alex Dobin, Timo Lassmann, et al. Landscape of transcription in human 900 cells. Nature, 489(7414):101, 2012.

901 [42] Nathan C. Sheffield et al. Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription 902 factor binding, and long-range interactions. Genome Research, 23(5):777–788, 2013.

903 [43] Robert E. Thurman, Eric Rynes, Richard Humbert, Jeff Vierstra, et al. The accessible chromatin landscape of the human 904 genome. Nature, 489(7414):75–82, 2012.

32 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

905 [44] Jeffrey Rogers and Richard A. Gibbs. Comparative primate genomics: emerging patterns of genome content and dynamics. 906 Nature Reviews Genetics, 15(5):347–359, 2014.

907 [45] Moshe Pritsker, Yir-Chung Liu, Michael A. Beer, and Saeed Tavazoie. Whole-genome discovery of transcription factor binding 908 sites by network-level conservation. Genome Research, 14(1):99–108, 2004.

909 [46] Eugene Berezikov, Victor Guryev, and Edwin Cuppen. Exploring conservation of transcription factor binding sites with CON- 910 REAL. Methods in Molecular Biology, 395:437–448, 2007.

911 [47] Adam Siepel, Gill Bejerano, Jakob S. Pedersen, Angie S. Hinrichs, et al. Evolutionarily conserved elements in vertebrate, insect, 912 worm, and yeast genomes. Genome Research, 15(8):1034–1050, 2005.

913 [48] Katherine S. Pollard, Melissa J. Hubisz, Kate R. Rosenbloom, and Adam Siepel. Detection of nonneutral substitution rates on 914 mammalian phylogenies. Genome Research, 20(1):110–121, 2010.

915 [49] Seyed Ali Madani Tonekaboni, Parisa Mazrooei, Victor Kofia, Benjamin Haibe-Kains, and Mathieu Lupien. CREAM: Clustering 916 of genomic REgions Analysis Method. bioRxiv, 2017. doi: 10.1101/222562.

917 [50] Valerie A. Schneider, Tina Graves-Lindsay, Kerstin Howe, Nathan Bouk, et al. Evaluation of GRCh38 and de novo haploid 918 genome assemblies demonstrates the enduring quality of the reference assembly. Genome Research, 27(5):849–864, 2017.

919 [51] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings 920 of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.

921 [52] Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. arXiv, abs/1412.6980, 2014. 922 arxiv.org/abs/1412.6980.

923 [53] Brian W. Matthews. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et 924 Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975.

925 [54] Davide Chicco. Ten quick tips for machine learning in computational biology. BioData Mining, 10(1):35, 2017.

926 [55] Qunhua Li, James B. Brown, Haiyan Huang, Peter J. Bickel, et al. Measuring reproducibility of high-throughput experiments. 927 The Annals of Applied Statistics, 5(3):1752–1779, 2011.

928 [56] Daniel Quang and Xiaohui Xie. FactorNet: a deep learning framework for predicting cell type specific transcription factor 929 binding from nucleotide-resolution sequential data. bioRxiv, 2017. doi: 10.1101/151274.

930 [57] Jens Keilwagen, Stefan Posch, and Jan Grau. Accurate prediction of cell type-specific transcription factor binding. Genome 931 Biology, 20(1):9, 2019.

932 [58] Alexander Lex, Nils Gehlenborg, Hendrik Strobelt, Romain Vuillemot, and Hanspeter Pfister. UpSet: visualization of intersecting 933 sets. IEEE Transactions on Visualization and Computer Graphics, 20(12):1983–1992, 2014.

934 [59] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

935 [60] Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, et al. Gene set enrichment analysis: a knowledge- 936 based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43): 937 15545–15550, 2005.

938 [61] Alexandre A. Raposo, Francisca F. Vasconcelos, Daniela Drechsel, Corentine Marie, et al. Ascl1 coordinately regulates gene 939 expression and the chromatin landscape during neurogenesis. Cell Reports, 10(9):1544–1556, 2015.

940 [62] L. Ashley Watson, Xu Wang, Adrienne Elbert, Kristin D. Kernohan, et al. Dual effect of CTCF loss on neuroprogenitor 941 differentiation and survival. Journal of Neuroscience, 34(8):2860–2870, 2014.

942 [63] Elise Lamar and Chris Kintner. The Notch targets Esr1 and Esr10 are differentially regulated in Xenopus neural precursors. 943 Development, 132(16):3619–3630, 2005.

944 [64] Anna L. M. Ferri, Wei Lin, Yannis E. Mavromatakis, Julie C. Wang, et al. Foxa1 and Foxa2 regulate multiple phases of midbrain 945 dopaminergic neuron development in a dosage-dependent manner. Development, 134(15):2761–2769, 2007.

946 [65] Ryan T. Willett and Lloyd A. Greene. Gata2 is required for migration and differentiation of retinorecipient neurons in the 947 superior colliculus. Journal of Neuroscience, 31(12):4444–4455, 2011.

948 [66] Seiji Ishii and Kazue Hashimoto-Torii. HSF modulates neural development under normal and stress conditions. In Heat Shock 949 Factor, pages 115–129. Springer, 2016.

950 [67] Rodrigo A. Quintanilla, Elias Utreras, and Fabi´anA. Cabezas-Opazo. Role of PPARγ in the differentiation and function of 951 neurons. PPAR Research, 2014, 2014.

952 [68] Seunghee Lee, Rongkun Shen, Hyong-Ho Cho, Ryuk-Jun Kwon, et al. STAT3 promotes motor neuron differentiation by col- 953 laborating with motor neuron-specific LIM complex. Proceedings of the National Academy of Sciences, 110(28):11445–11450, 954 2013.

33 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

955 [69] Kaia Achim, Paula Peltopuro, Laura Lahti, Hui-Hsin Tsai, et al. The role of Tal2 and Tal1 in the differentiation of midbrain 956 GABAergic neuron precursors. Biology Open, 2(10):990–997, 2013.

957 [70] Xinwei Cao, Samuel L. Pfaff, and Fred H. Gage. YAP regulates neural progenitor cell number via the TEA domain transcription 958 factor. Genes & Development, 22(23):3320–3334, 2008.

959 [71] Xiao-Ling Zhang, Cheng-Xin Huang, Jie Zhang, Akira Inoue, et al. CtBP1 is involved in epithelial-mesenchymal transition and 960 is a potential therapeutic target for hepatocellular carcinoma. Oncology Reports, 30(2):809–814, 2013.

961 [72] Zanabazar Enkhbaatar, Minoru Terashima, Dulamsuren Oktyabri, Shoichiro Tange, et al. KDM5B histone demethylase controls 962 epithelial-mesenchymal transition of cancer cells by regulating the expression of the microRNA-200 family. Cell Cycle, 12(13): 963 2100–2112, 2013.

964 [73] Wei Yu, Changshan Huang, Qian Wang, Tao Huang, et al. transcription factors promotes EMT and invasiveness of 965 hepatocellular carcinoma through TGF-β1 autoregulation circuitry. Tumor Biology, 35(11):10943–10951, 2014.

966 [74] Puja Kachroo, Mi-Heon Lee, Ling Zhang, Felicita Baratelli, et al. IL-27 inhibits epithelial-mesenchymal transition and angiogenic 967 factor production in a STAT1-dominant pathway in human non-small cell lung cancer. Journal of Experimental & Clinical 968 Cancer Research, 32(1):97, 2013.

969 [75] Chih-Chung Lin, Tara R. Bradstreet, Elizabeth A. Schwarzkopf, Julia Sim, et al. Bhlhe40 controls cytokine production by T 970 cells and is essential for pathogenicity in autoimmune neuroinflammation. Nature Communications, 5:3551, 2014.

971 [76] Christopher J. Huggins, Radek Malik, Sook Lee, Jacqueline Salotti, et al. C/EBPγ suppresses senescence and inflammatory 972 gene expression by heterodimerizing with C/EBPβ. Molecular and Cellular Biology, 33(16):3242–3258, 2013.

973 [77] Mathieu Darsigny, St´ephanie St-Jean, and Fran¸coisBoudreau. Cux1 transcription factor is induced in inflammatory bowel 974 disease and protects against experimental colitis. Inflammatory Bowel Diseases, 16(10):1739–1750, 2010.

975 [78] Aneta Kasza, Paulina Wyrzykowska, Irena Horwacik, Piotr Tymoszuk, et al. Transcription factors Elk-1 and SRF are engaged 976 in IL1-dependent regulation of ZC3H12A expression. BMC Molecular Biology, 11(1):14, 2010.

977 [79] David Balli, Xiaomeng Ren, Fu-Sheng Chou, Emily Cross, et al. Foxm1 transcription factor is required for macrophage migration 978 during lung inflammation and tumor formation. Oncogene, 31(34):3875–3888, 2012.

979 [80] Bozena Kaminska. Molecular characterization of inflammation-induced JNK/c-Jun signaling pathway in connection with tu- 980 morigenesis. Methods in Moleular Biology, 512:249–264, 2009.

981 [81] H. Terence Cook, Ruth Tarzi, Zelpha D’Souza, Gaelle Laurent, et al. AP-1 transcription factor JunD confers protection from 982 accelerated nephrotoxic nephritis and control podocyte-specific Vegfa expression. The American Journal of Pathology, 179 983 (1):134–140, 2011.

984 [82] Samaneh Yazdani, Mohammad Hasan Karimfar, Abbas Ali Imani Fooladi, Leila Mirbagheri, et al. Nuclear factor κB1/RelA 985 mediates the inflammation and/or survival of human airway exposed to sulfur mustard. Journal of Receptors and Signal 986 Transduction, 31(5):367–373, 2011.

987 [83] Hany E. S. Marei and Abd-Elmaksoud Ahmed. Transcription factors expressed in embryonic and adult olfactory bulb neural 988 stem cells reveal distinct proliferation, differentiation and epigenetic control. Genomics, 101(1):12–19, 2013.

989 [84] Mercedes Lachn-Montes, Andrea Gonzlez-Morales, Maria Victoria Zelaya, Estela Prez-Valderrama, et al. Olfactory bulb neu- 990 roproteomics reveals a chronological perturbation of survival routes and a disruption of prohibitin complex during Alzheimer’s 991 disease progression. Scientific Reports, 7:9115, 2017.

992 [85] Shreelatha Bhat and Walton D. Jones. An accelerated miRNA-based screen implicates Atf-3 in Drosophila odorant receptor 993 expression. Scientific Reports, 6:20109, 2016.

994 [86] Josefine S. Witteveen, Marjolein H. Willemsen, Tha´ısC. D. Dombroski, Nick H. M. Van Bakel, et al. Haploinsufficiency of 995 MeCP2-interacting transcriptional co-repressor SIN3A causes mild intellectual disability by affecting the development of cortical 996 integrity. Nature Genetics, 48(8):877–887, 2016.

997 [87] Adele J. Vincent, Jennifer M. Taylor, Derek L. Choi-Lundberg, Adrian K. West, and Meng Inn Chuah. Genetic expression 998 profile of olfactory ensheathing cells is distinct from that of Schwann cells and astrocytes. Glia, 51(2):132–147, 2005.

999 [88] Chenzhuo Feng, Jiejie Li, and Zhiyi Zuo. Expression of the transcription factor regulatory factor X1 in the mouse brain. Folia 1000 Histochemica et Cytobiologica, 49(2):344–351, 2011.

1001 [89] Julie M. Ward, Kira Rose, Courtney Montgomery, Indra Adrianto, et al. Disease activity in systemic lupus erythematosus 1002 correlates with expression of the transcription factor AT-rich–interactive domain 3A. Arthritis & Rheumatology, 66(12):3404– 1003 3412, 2014.

1004 [90] Andy Y. Wen, Kathleen M. Sakamoto, and Lloyd S. Miller. The role of the transcription factor CREB in immune function. 1005 The Journal of Immunology, 185(11):6413–6419, 2010.

34 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1006 [91] Steven B. McMahon and John G. Monroe. The role of early growth response gene 1 (EGR-1) in regulation of the immune 1007 response. Journal of Leukocyte Biology, 60(2):159–166, 1996.

1008 [92] Atsuko Masumi, I-Ming Wang, Bruno Lefebvre, Xing-Jiao Yang, et al. The histone acetylase PCAF is a phorbol-ester-inducible 1009 of the IRF family that confers enhanced responsiveness. Molecular and Cellular Biology, 19(3):1810–1820, 1010 1999.

1011 [93] Chia-Hsin Su, I-Hsuan Lin, Tsai-Yu Tzeng, Wen-Ting Hsieh, and Ming-Ta Hsu. Regulation of IL-20 expression by estradiol 1012 through KMT2B-mediated epigenetic modification. PLOS One, 11(11):e0166090, 2016.

1013 [94] Wael Massrieh, Anna Derjuga, Florence Doualla-Bell, Chun-Ying Ku, et al. Regulation of the MAFF transcription factor by 1014 proinflammatory cytokines in myometrial cells. Biology of Reproduction, 74(4):699–705, 2006.

1015 [95] Jean Villard, Marie Peretti, Krzysztof Masternak, Emmanu`eleBarras, et al. A functionally essential domain of RFX5 mediates 1016 activation of major histocompatibility complex class II promoters by promoting cooperative binding between RFX and NF-Y. 1017 Molecular and Cellular Biology, 20(10):3364–3376, 2000.

1018 [96] Feng Ma, Su-Yang Liu, Bahram Razani, Neda Arora, et al. α attenuates host antiviral response by 1019 suppressing type I interferon. Nature Communications, 5:5494, 2014.

1020 [97] Lan Xie. MKL1/2 and ELK4 co-regulate distinct (SRF) transcription programs in macrophages. BMC 1021 Genomics, 15(1):301, 2014.

1022 [98] Sumiko Yoshida, Ken-ichi Aihara, Yasumasa Ikeda, Yuka Sumitomo-Ueda, et al. promotes gender- 1023 independent angiogenesis in response to ischemia and is required for activation of VEGF receptor signaling. Circulation, 1024 128(1):60–71, 2013.

1025 [99] Bryan L. Krock, Nicolas Skuli, and M. Celeste Simon. Hypoxia-induced angiogenesis: good and evil. Genes & Cancer, 2(12): 1026 1117–1133, 2011.

1027 [100] Li Jiang, Meng Yin, Xiangxiang Wei, Junxu Liu, et al. Bach1 represses Wnt/β-catenin signaling and angiogenesis. Circulation 1028 Research, 117(4):364–375, 2015.

1029 [101] Hideki Kawai, Huchun Li, Philip Chun, Shalom Avraham, and Hava Karsenty Avraham. Direct interaction between BRCA1 1030 and the regulates vascular endothelial growth factor (VEGF) transcription and secretion in breast cancer 1031 cells. Oncogene, 21(50):7730, 2002.

1032 [102] Mingcheng Huang, Qian Qiu, Youjun Xiao, Shan Zeng, Mingying Zhan, et al. BET bromodomain suppression inhibits VEGF- 1033 induced angiogenesis and vascular permeability by blocking VEGFR2-mediated activation of PAK1 and eNOS. Scientific 1034 Reports, 6:23770, 2016.

1035 [103] David Engelmann, Deborah Mayoli-N¨ussle,Christian Mayrhofer, Katharina F¨urst,et al. E2F1 promotes angiogenesis through 1036 the VEGF-C/VEGFR-3 axis in a feedback loop for cooperative induction of PDGF-B. Journal of Molecular Cell Biology, 5 1037 (6):391–403, 2013.

1038 [104] Haihua Song, Jun-ichi Suehiro, Yasuharu Kanki, Yoshiko Kawai, et al. Critical role for GATA3 in mediating Tie2 expression 1039 and function in large vessel endothelial cells. Journal of Biological Chemistry, 284(42):29109–29124, 2009.

1040 [105] Vasundhra Kashyap, Shafqat Ahmad, Emeli M. Nilsson, Leszek Helczynski, et al. The lysine specific demethylase-1 1041 (LSD1/KDM1A) regulates VEGF-A expression in prostate cancer. Molecular Oncology, 7(3):555–566, 2013.

1042 [106] Troy A. Baudino, Catriona McKay, Helene Pendeville-Samain, Jonas A. Nilsson, et al. c-Myc is essential for vasculogenesis and 1043 angiogenesis during development and tumor progression. Genes & Development, 16(19):2530–2543, 2002.

1044 [107] Ken Iwatsuki, Kiyoko Tanaka, Tsuyoshi Kaneko, Ritsuko Kazama, et al. Runx1 promotes angiogenesis by downregulation of 1045 insulin-like growth factor-binding protein-3. Oncogene, 24(7):1129–1137, 2005.

1046 [108] Farhang M. Ghahremani, Steven Goossens, David Nittner, Xavier Bisteau, et al. promotes VEGF expression and angio- 1047 genesis in the absence of an intact p21-Rb pathway. Cell Death & Differentiation, 20(7):888–897, 2013.

1048 [109] Gilbert Saporta and Genane Youness. Comparing two partitions: some proposals and experiments. In Compstat, pages 243–248. 1049 Springer, 2002.

1050 [110] James W. Kent, Charles W. Sugnet, Terrence S. Furey, Krishna M. Roskin, et al. The human genome browser at UCSC. 1051 Genome Research, 12(6):996–1006, 2002.

1052 [111] Anthony Mathelier, Oriol Fornes, David J. Arenillas, Chih-yu Chen, Grgoire Denay, et al. JASPAR 2016: a major expansion 1053 and update of the open-access database of transcription factor binding profiles. Nucleic Acids Research, 44(D1):D110–115, 1054 2016.

1055 [112] Petr Smirnov, Zhaleh Safikhani, Nehme El-Hachem, Dong Wang, et al. PharmacoGx: an R package for analysis of large 1056 pharmacogenomic datasets. Bioinformatics, 32(8):1244–1246, 2015.

1057 [113] Laurence I-Kuei Lin. A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45(1):255–268, 1989.

35 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1058 [114] Yong Zhang, Tao Liu, Clifford A. Meyer, Jrme Eeckhoute, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biology, 1059 9(9):R137, 2008.

1060 [115] Stephane Champely. pwr: basic functions for power analysis, 2017. URL https://CRAN.R-project.org/package=pwr. R package 1061 version 1.2-1.

1062 [116] Fabian Pedregosa, Ga¨elVaroquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, et al. Scikit-learn: 1063 machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

1064 [117] Takaya Saito and Marc Rehmsmeier. Precrec: fast and accurate precision–recall and ROC curve calculations in R. Bioinfor- 1065 matics, 33(1):145–147, 2017.

1066 [118] Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009. ISBN 978-0-387-98140-6.

1067 [119] Erich Neuwirth. RColorBrewer: ColorBrewer Palettes, 2014. R package version 1.1-2.

1068 [120] Wyeth W. Wasserman and Albin Sandelin. Applied bioinformatics for the identification of regulatory elements. Nature Reviews 1069 Genetics, 5(4):276, 2004.

1070 [121] Benjamin L. Kidder, Gangqing Hu, and Keji Zhao. ChIP-seq: technical considerations for obtaining high-quality data. Nature 1071 Immunology, 12(10):918–922, 2011.

1072 [122] Leonid Teytelman, Deborah M. Thurtle, Jasper Rine, and Alexander van Oudenaarden. Highly expressed loci are vulnerable 1073 to misleading ChIP localization of multiple unrelated proteins. Proceedings of the National Academy of Sciences, 110(46): 1074 18602–18607, 2013.

1075 [123] Daniel Savic, Christopher E. Partridge, Kimberly M. Newberry, Sophia B. Smith, et al. CETCh-seq: CRISPR epitope tagging 1076 ChIP-seq of DNA-binding proteins. Genome Research, 25(10):1581–1589, 2015.

1077 [124] Xiaoyan Wang, Melissa Cunningham, Xiaoli Zhang, Sara Tokarz, Bryan Laraway, et al. Phosphorylation regulates c-Myc’s 1078 oncogenic activity in the mammary gland. Cancer Research, 71(3):925–936, 2011.

1079 [125] Mayumi Yoshimi, Susumu Goyama, Masahito Kawazu, Masahiro Nakagawa, Motoshi Ichikawa, et al. Multiple phosphorylation 1080 sites are important for RUNX 1 activity in early hematopoiesis and T-cell differentiation. European Journal of Immunology, 1081 42(4):1044–1050, 2012.

1082 [126] Takeshi Sekiya, Kensaku Murano, Kohsuke Kato, Atsushi Kawaguchi, and Kyosuke Nagata. Mitotic phosphorylation of CCCTC- 1083 binding factor (CTCF) reduces its DNA binding activity. FEBS Open Bio, 7(3):397–404, 2017.

1084 [127] Tomoyuki Fukuda, Florencia Pratto, John C. Schimenti, James M. A. Turner, R. Daniel Camerini-Otero, and Christer H¨o¨og. 1085 Phosphorylation of chromosome core components may serve as axis marks for the status of chromosomal events during mam- 1086 malian meiosis. PLOS Genetics, 8(2):e1002485, 2012.

36