bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

In silico prediction of high-resolution Hi-C interaction matrices

Shilu Zhang1, Deborah Chasman1, Sara Knaack1, and Sushmita Roy1,2*

1Wisconsin Institute for Discovery, 330 N. Orchard Street, Madison, WI 53715, USA 2Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53715, USA *To whom correspondence should be addressed.

1 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Abstract

The three-dimensional organization of the genome plays an important role in regulation by en- abling distal sequence elements to control the expression level of hundreds of kilobases away. Hi-C is a powerful genome-wide technique to measure the contact count of pairs of genomic loci needed to study three-dimensional organization. Due to experimental costs high resolution Hi-C datasets are avail- able only for a handful of cell lines. Computational prediction of Hi-C contact counts can offer a scal- able and inexpensive approach to examine three-dimensional genome organization across many cellular contexts. Here we present HiC-Reg, a novel approach to predict contact counts from one-dimensional regulatory signals such as epigenetic marks and regulatory binding. HiC-Reg exploits the signal from the region spanning two interacting regions and from across multiple cell lines to generalize to new contexts. Using existing feature importance measures and a new matrix factorization based approach, we found CTCF and chromatin marks, especially repressive and elongation marks, as important for pre- dictive performance. Predicted counts from HiC-Reg identify topologically associated domains as well as significant interactions that are enriched for CTCF bi-directional motifs and agree well with interac- tions identified from complementary long-range interaction assays. Taken together, HiC-Reg provides a powerful framework to generate high-resolution profiles of contact counts that can be used to study individual locus level interactions as well as higher-order organizational units of the genome.

2 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Introduction

2 The three-dimensional organization of the genome has emerged as an important component of the gene

3 regulation machinery that enables distal regulatory elements, such as enhancers, to control the expres-

4 sion of genes hundreds of kilobases away. These long-range interactions can have major roles in tissue-

5 specific expression [1–3] and how regulatory sequence variants impact complex phenotypes [4, 5], in-

6 cluding diseases such as cancer, diabetes and obesity [6, 7]. Conformation Capture (3C)

7 technologies such as, 4C, 5C, ChIA-PET, Hi-C [8] and Capture-Hi-C [9], used to measure 3D proxim-

8 ity of genomic loci have rapidly matured over the past decade. However, there are several challenges

9 for identifying such interactions in diverse cell types and contexts. First, the majority of these exper-

10 imental technologies have been applied to well-studied cell lines. Second, very few Hi-C datasets are

11 available at resolutions high enough (e.g, 5kbp) to identify enhancer gene interactions due to the tradeoff

12 between measuring genome-wide chromosome organizations while achieving higher resolutions. Third,

13 long-range gene regulation involves a complex interplay of transcription factors, histone marks and ar-

14 chitectural [10–13], making it important to examine three-dimensional proximity in concert

15 with these components of the transcription machinery.

16 Recently, numerous computational approaches for predicting long-range interactions have been de-

17 veloped [14–18], which leverage the fact that regulatory regions that participate in long-range regula-

18 tory interactions have characteristic one-dimensional genomic signatures [14, 15, 19]. However, these

19 methods have all used a binary classification framework, which is not optimal because the interaction

20 prediction problem is a one-class problem with no negative examples of interactions. In addition, these

21 do not exploit the genome-wide nature of high-throughput chromosome capture datasets such as Hi-C as

22 they focus on significantly interacting pairs, which can be sensitive to the method used to call significant

23 interactions. Recent comparison of methods for calling significant interactions showed that there are

24 substantial differences in the interactions identified from different methods [20].

25 To overcome the limitations of a classification approach and to maximally exploit the information in

26 high-throughput assays such Hi-C, we developed a Random Forest regression-based approach, HiC-Reg.

27 HiC-Reg integrates published Hi-C datasets with one-dimensional regulatory genomic datasets such as

28 chromatin marks, architectural and transcription factor proteins, and chromatin accessibility, to predict

29 interaction counts between two genomic loci in a cell line-specific manner. We applied our approach to

30 high-resolution Hi-C data from five cell lines to predict interaction counts at 5kb resolution and systemat-

31 ically evaluated its ability to predict interactions within the same cell line (tested using cross-validation),

3 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

32 across different as well as different cell lines. Our work shows that a Random Forests-

33 based regression framework can predict genome-wide Hi-C interaction matrices within cell lines and

34 can generalize to new chromosomes and cell lines and modeling the signal between regions as well as

35 integrating data from multiple cell lines is beneficial for improved predictive power. Feature analysis of

36 the predicted interactions suggests that CTCF and chromatin signals are both important for high qual-

37 ity predictions. Computational validation of HiC-Reg shows that HiC-Reg predictions agree well with

38 ChIA-PET datasets, recapitulate well-known examples of long-range interactions, exhibit bidirectional

39 CTCF loops, and can recover topologically associated domains. Overall, HiC-Reg provides a compu-

40 tational approach for predicting the interaction counts discovered by Hi-C technology, which can be

41 used for examining long-range interactions for individual loci as well as for studying the organizational

42 properties of chromosome conformation.

4 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

43 Results

44 HiC-Reg: A Random Forests regression approach for predicting high-resolution

45 contact count of genomic regions

46 We developed a regression-based approach called HiC-Reg to predict cell line-specific contact counts be-

47 tween pairs of 5kb genomic regions using cell line specific one-dimensional regulatory signals, such as

48 histone marks and transcription factor binding profiles (Fig 1). Our regression framework treats contact

49 counts as outputs of a regression model from input one-dimensional regulatory signals associated with

50 a pair of genomic regions. As cell line specific training count data we used high resolution (5kb) Hi-C

51 datasets from five cell lines from Rao et al [21]. We used features from 14 cell line specific regulatory

52 genomic datasets and genomic distance to represent a pair of regions. These datasets include the architec-

53 tural protein CTCF, repressive marks (H3k27me3, H3k9me3), marks associated with active gene bodies

54 and elongation (H3k36me3, H4k20me1, H3k79me2), enhancer specific marks (H3k4me1, H3k27ac),

55 activating marks (H3k9ac, H3k4me2, H3k4me3), cohesin component (RAD21), a general transcription

56 factor (TBP) and DNase I (open chromatin) [14]. HiC-Reg uses a Random Forests regression model

57 as its core predictive model. The Random Forests regression model significantly outperforms a linear

58 regression model suggesting that there are non-linear dependencies that are important to be captured to

59 effectively solve the count prediction problem (Supplementary Fig 1).

60 A key consideration for chromosomal count prediction is how to represent the genomic region pairs

61 as examples for training a regression model. To represent a pair in HiC-Reg, we considered three main

62 feature encodings in addition to genomic distance between the two regions of interest: PAIR-CONCAT,

63 WINDOW and MULTI-CELL (Fig 1). PAIR-CONCAT simply concatenates the features for each region

64 into a single vector. The WINDOW feature encoding additionally uses the average signal between two

65 regions and was proposed in TargetFinder, a classification-based method [15]. MULTI-CELL incorpo-

66 rates signals from other cell lines. We assessed the performance of these methods using distance stratified

67 Pearson’s correlation computed on test set pairs in a five fold cross validation setting (Methods). The

68 distance stratified correlation measures the correlation between predicted and true counts for genomic

69 pairs at a particular distance threshold. Distance stratification is important due to the high dependence of

70 contact count on genomic distance [22]. As a baseline we compared the performance of these models to a

71 Random Forest regression model trained on distance alone. The performance of HiC-Reg is much better

72 than the performance of distance alone, demonstrating that addition of regulatory signals significantly

73 improves performance (Fig 2). Between PAIR-CONCAT and WINDOW, WINDOW features are signif-

5 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

74 icantly better and the performance of PAIR-CONCAT often decreases as function of distance (Fig 2A).

75 WINDOW and MULTI-CELL have similar performance and are both better than using PAIR-CONCAT.

76 To examine the generality of this behavior across all chromosomes, we applied HiC-Reg in all chro-

77 mosomes in a five fold cross validation setting. To summarize the performance captured in the distance

78 stratified correlation curve, we computed the area under the distance stratified correlation curve (AUC,

79 Methods). The AUC metric ranges from 0 to 1, with 1 representing the best performance and succinctly

80 describes the performance across all five cell lines and chromosomes. We find that across cell lines and

81 all chromosomes, both the WINDOW and the MULTI-CELL features are significantly better than simple

82 concatenation of the features of regions suggesting that incorporating the signal between two regions is

83 important for capturing these interaction counts at longer distances (Fig 2B). When examining the per-

84 formance across the five cell lines, the AUC for the Nhek cell line was lowest suggesting it is the hardest

85 to predict. Due to the superior performance of the WINDOW and MULTI-CELL features, we excluded

86 the PAIR-CONCAT feature from subsequent experiments.

87 To examine whether our approach can generalize across chromosomes, we trained HiC-Reg on one

88 chromosome and used it to predict counts in a different chromosome for the same cell line. We con-

89 sidered five chromosomes, chr9, chr11, chr14, chr17 and chr19, each as training and test chromosomes,

90 in all five cell lines. HiC-Reg predictions across chromosomes were slightly worse than when training

91 and testing on the same chromosome in all but the Gm12878 cell line, where the cross chromosome

92 performance was much worse. Interestingly, while both feature types, WINDOW and MULTI-CELL

93 were comparable when training and testing in the same chromosome setting (Fig 2), when comparing

94 models across chromosomes, we found that the MULTI-CELL feature was better than the WINDOW

95 feature (Fig 3, Supplementary Figs 2-6), especially for the Gm12878 cell line. Some chromosomes

96 produced worse models than other chromosomes. For example, chromosome 19 was a poor predictor of

97 other chromosomes in all cell lines, and chromosome 14 was a poor predictor of all other chromosomes

98 in K562. These performance differences are indicative of unique chromosomal features that are shared

99 or unique to specific cell lines (chr14 in K562). Taken together, our analysis showed that HiC-Reg is

100 able to successfully predict interaction counts in different cell lines using one-dimensional features and

101 the MULTI-CELL features in particular are important for a cross-chromosome setting.

102 Feature analysis identifies important determinants of chromosome contact counts

103 To gain insight into the relative importance of different one-dimensional regulatory signals such as chro-

104 matin marks and transcription factor binding signals for predicting contact counts, we conducted dif-

6 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

105 ferent types of feature analyses on our learned random forests regression models. We focused on the

106 MULTI-CELL features as they had the best performance. We first ranked features using a standard fea-

107 ture importance score, “Out Of Bag” (OOB) error (Methods, Fig 4A, Supplementary Fig 7). For a

108 given cell line, the features from the cell line ranked highly (e.g., in Gm12878 all top 10 features were

109 from the Gm12878 cell line, Fig 4A); however, there were some exceptions where a feature from another

110 cell line was also important (e.g., several Gm12878 features were important for the HUVEC and K562

111 cell lines, Fig 4A). The feature rankings across different chromosomes were similar (Spearman corre-

112 lation 0.72-0.87, Supplementary Fig 7D). Based on this ranking, the most important features included

113 Distance, elongation chromatin marks like H3K36me3 and H4K20me1, a repressive mark H3K9me3,

114 enhancer associated mark H3K4me1, and DNase I, typically on one or both end point regions (R1 and

115 R2) and rarely on the Window regions (W). We next used a complementary strategy of counting the num-

116 ber of times a feature was used for predicting the count of a test pair (Fig 4B, Supplementary Fig 8).

117 This analysis also found Distance to be important but also other factors like CTCF, TBP, DNase I, as

118 well as repressive marks (H3k9me3 and H3k27me3), an enhancer associated mark (H3K4me1), and an

119 elongation mark (H4K20me1). Here too we observed that the features measured in the training cell line

120 were typically most highly ranked with some variations (e.g., Gm12878 features were useful for predict-

121 ing counts in most cell lines). Comparing across chromosomes, the feature rankings were very similar

122 (Spearman correlation 0.85-0.94 Supplementary Fig 8D). The two feature analysis methods agreed on

123 the importance of elongation mark, H4K20me1, enhancer mark H3K4me1 and several repressive marks,

124 but there were some differences. The OOB method identified histone mark features in the region (R1

125 and R2) while the feature usage counting importance implicated CTCF and TBP Window signals to be

126 the most important.

127 While the above approach identified the top features for all pairs of regions, it does not tell us whether

128 there are different feature sets useful for different sets of pairs. In particular, there could be some inter-

129 acting pairs that are driven largely by chromatin marks, while another set of pairs driven by transcription

130 factors. Furthermore, it does not inform us about dependencies among features that might be important

131 for making these predictions. To address this, we developed a novel feature analysis method, based on

132 non-negative matrix factorization (NMF, Methods). Briefly, we obtained region pairs that were in the

133 lowest 5% of test error and counted the number of times a feature or a pair of features was used on

134 the tree path traversed for these region pairs. We selected only feature pairs that were in the 5% lowest

135 errors for computational efficiency and because the individual feature rankings did not change compared

136 to when considering all pairs (Supplementary Fig 9, 10). Furthermore, we focused on chromosome 17

7 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

137 because the feature importances were similar across chromosomes (Supplementary Fig 7, 8). We used

138 these counts to create a region-pair by feature-pair matrix, with each entry of the matrix denoting the

139 number of times the feature pair was used in the trees. We next applied NMF on this matrix to obtain

140 clusters of region pairs associated with clusters of feature pairs (Fig 4C). Such “bi-clusters” are indica-

141 tive of different classes of region pairs and the most important features associated with them (Fig 4C).

142 For example, for Gm12878, one cluster (Cluster 1) was associated with H4K20me1 and CTCF, while

143 another cluster (Cluster 2) was associated with repressive marks (H3K27me3) together with other his-

144 tone marks but not CTCF. A third cluster (Cluster 3) was associated with features from the K562 cell

145 line (CTCF and H4k20me1) in addition to those from Gm12878. Cluster 4 was associated with CTCF

146 in Gm12878 and Huvec and H3K4me1 in Gm12878 and finally Cluster 5 was largely associated with

147 TBP. We found similar behavior in other cell lines as well (Supplementary Fig 11-14). For example, in

148 K562, we found two types of clusters: one cluster (Cluster 2), was associated with different chromatin

149 marks, while the other clusters had CTCF, in combination with different chromatin marks. While CTCF

150 was important as a feature hub in all cell lines, some lines exhibited importance of other types of features

151 like H3k9me3 (Nhek), DNase I (Huvec and Hmec) and RAD21 (Nhek). We applied the same procedure

152 to the matrix of region-pairs by individual features (Supplementary Fig 15, 16), however, the groups

153 were essentially driven by the cell line in which the feature was measured. While this showed that NMF

154 captures the most important grouping structure (cell line), it was not unexpected and only served as a

155 sanity check.

156 Taken together, our feature analysis showed that there were largely CTCF-driven and chromatin

157 mark driven clusters of interactions. CTCF was a key feature for predicting interactions and was asso-

158 ciated with other chromatin marks. Furthermore, we found that elongation marks like H3K36me3 or

159 H4K20me1 and repressive marks like H3K27me3 were often important as feature hubs.

160 HiC-Reg predictions exhibit hallmarks of true looping interactions and identify

161 TADs

162 We next assessed the quality of our predictions based on several additional metrics that examined the oc-

163 currence of specific properties of true interactions. One such property is the occurrence of bi-directional

164 CTCF motifs [21]. Briefly, a pair can have one of four configurations of the CTCF motif, (++), (+-), (-+)

165 and (–), where “+” corresponds to the motif on the forward strand and “-” corresponds to the motif being

166 on the reverse strand. The pairs with the (+-) configuration are most likely the true loops. This prop-

167 erty of looping interactions was also used by Forcato et al [20] to compare different Hi-C peak calling

8 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

168 programs. To assess the occurrence of CTCF bidirectional motifs in HiC-Reg predictions, we applied

169 Fit-Hi-C [22], to identify significant interactions in both true and predicted counts (Methods). Follow-

170 ing [20], we only focus on pairs with at least one CTCF signed motif mapped to each pair of regions, but

171 discarding a pair if one of the regions has both orientations of the motif in the same region. We quan-

172 tified the tendency of CTCF bidirectional motifs to occur in the significant pairs versus all pairs using

173 fold enrichment, which compares the fraction of interactions predicted by HiC-Reg that has a particular

174 configuration compared to the proportion of interactions expected by random chance. Across all five

175 cell lines, significant pairs called on both the true and predicted counts are enriched for the bidirectional

176 motif (+-) configuration (Fig 5A). The level of enrichment is comparable for the interactions identified

177 from the true and predicted counts and is sometimes better in the predicted counts.

178 As a second evaluation metric, we compared the significant interactions identified by Fit-Hi-C on

179 HiC-Reg’s predictions and true counts with interactions identified using a complementary experiment,

180 ChIA-PET (Methods). We obtained 10 published datasets for different factors (RNA PolII, CTCF and

181 RAD21) and histone marks in multiple cell lines [23, 24]. We estimated fold enrichment of the ChIA-

182 PET interactions in the significant interactions compared to background. We found that interactions from

183 both true counts and HiC-Reg predictions were enriched for ChIA-PET interactions, and these enrich-

184 ments were often better for HiC-Reg predictions than those from the true counts (Fig 5B). The CTCF

185 directionality and ChIA-PET enrichment suggest that significant interactions from HiC-Reg predictions

186 exhibit hallmarks of true looping interactions.

187 As a third validation metric, we asked to what extent the HiC-Reg counts can be used to study

188 structural units of organization of the chromosome, such as topologically associated domains (TADs,

189 [25]). One of the advantages of a regression versus a classification framework is that the output counts

190 can be examined with downstream TAD finding algorithms [20,26]. We applied the Directionality Index

191 (DI) TAD finding method to HiC-Reg predicted counts and true counts and compared the similarity of

192 the identified TADs using a metric derived from the Jaccard coefficient (See Methods). The Jaccard

193 coefficient assesses the overlap between two sets of objects (e.g., regions in one TAD versus regions

194 in another TAD) and is a number between 0 and 1, with 0 representing no overlap and 1 representing

195 perfect overlap. We aggregated the Jaccard coefficient across all TADs identified on a chromosome into

196 a single average Jaccard coefficient (Methods). Across all cell lines and chromosomes, the average

197 Jaccard coefficient ranged between 0.79-0.83 indicating good agreement between TADs from true and

198 predicted counts (Fig 5C). This high agreement is visually shown for a selected region (2Mb block of

199 5kb regions on chr17: 32000000-34000000) where the identified TADs (cyan boxes) agree between the

9 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

200 true and predicted count matrices (Fig 5D).

201 Overall, these validation results show that HiC-Reg predictions can be used to study three-dimensional

202 genome organization at the level of individual loops as well as the level of higher-order structural units

203 such as topologically associated domains.

204 HiC-Reg can be used predict contact counts in new cell lines

205 Our analysis so far demonstrates the feasibility of using regression to predict chromosome contact count

206 in cell lines with available Hi-C data for at least some chromosomes. We next asked if we could apply this

207 approach to predict interactions in test cell lines different from the training cell line. This would enable

208 us to study the utility of HiC-Reg in cell lines where Hi-C data are not available. We applied HiC-Reg

209 trained on one cell line to predict counts for pairs from a different test cell line. We evaluated the quality

210 of the predictions in the test cell line using the distance-stratified Pearson’s correlation (CrossCell, Fig 6),

211 and additional validation metrics (CTCF directionality, ChIA-PET and TAD recovery, Fig 7). These

212 validation metrics were computed on the test cell line and compared against different models: (i) model

213 trained on distance alone, (ii) model trained with cross-validation (CV) on the test cell line (CV, Fig 6),

214 and (iii) a new baseline model which simply “transferred” the count from the training cell line to the test

215 cell line (TransferCount, Fig 6).

216 A model trained on a cell line different from the test cell line is significantly better than a model

217 trained on distance alone, but is often worse than a model trained on the same cell line (Fig 6). For

218 example, for chr17, the same cell line CV model has the best performance (blue line Fig 6A) compared

219 to all other versions of cross cell line predictions. Here too we observe that the MULTI-CELL features

220 (Fig 6A, red line) are better or at least as good as the WINDOW features (Fig 6A, cyan line). Compared

221 to transferring counts (TransferCount, green line Fig 6A), both MULTI-CELL and WINDOW have

222 significant benefits at long distance relationships (green line is usually below the red and cyan lines after

223 ∼ 250kb). The summarized AUC offer a concise summary of this behavior (Fig 6B), with models using

224 MULTI-CELL being at least as good as TransferCount models in the majority of training-test cell line

225 combinations. Overall, the Gm12878 cell line was the hardest to predict using a model from other cell

226 lines.

227 When comparing the predictions in other chromosomes (Supplementary Fig 17, 18), we observe a

228 similar behavior. The MULTI-CELL was at least as good or better than transferring counts or WINDOW

229 features and the Gm12878 cell line remained the hardest cell line to predict. Interestingly, the extent to

230 which a cell line can be predicted from a model in a different cell line depends greatly on the test cell

10 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

231 line. In particular, on Gm12878 which has the highest sequencing depth, none of the models were able

232 to come up to par with the model trained and tested on Gm12878 (Fig 6A, fourth row, Fig 6B, fourth

233 column). In contrast, for HUVEC, the K562 model is able to predict interactions nearly as well as the

234 CV model especially when using the MULTI-CELL features. Similarly, for HMEC, HUVEC-trained

235 model was able to recapitulate the performance of the HMEC CV-trained model to a great extent (Fig

236 6B). Previously, we have shown that an ensemble model of combining predictions from multiple models

237 provided a robust performance in a new cell line [14]. Therefore, we combined the predictions from

238 the models trained on each of the cell lines by taking the average of predictions (Fig 6, ENSEMBLE).

239 We find that the ensemble predictions are at least as good as the predictions from the individual models

240 (Fig 6A orange line) and the ensemble for the MULTI-CELL features is better than that for the WIN-

241 DOW features (Fig 6B). In K562, Huvec and Nhek cell lines, we found that ensemble approaches per-

242 formed comparably with the cross-validation performance and better than the best performing cross-cell

243 line prediction.

244 As additional validation of our predictions we tested the significant interactions from these predicted

245 counts for enrichment of ChIA-PET interactions, CTCF bidirectional loops and TAD recovery (Fig 7).

246 We again applied Fit-Hi-C to the true and predicted counts and found that the predicted interactions

247 in each of the chromosomes tested were significantly enriched for bi-directional motifs and ChIA-PET

248 interactions at levels comparable to that from true counts. Furthermore, there was good agreement with

249 the TADs identified in each of these chromosomes based on the Jaccard coefficient score.

250 Overall, these results suggest that HiC-Reg can be used to predict interactions in new cell lines and

251 it performs better than baseline approaches based on distance alone or simply transferring counts. There

252 is a dependence on the training cell line and an ensemble approach offers a robust way to combine

253 predictions from multiple training models.

254 HiC-Reg can recover interactions in manually curated high confidence interaction

255 pairs

256 So far our results show that HiC-Reg can predict genome-wide Hi-C counts and these counts can be

257 used for identifying significantly interacting pairs as well as higher-order organizational units. To gain

258 deeper insight into the features that drive interactions between two specific loci, we next focused on

259 well-characterized long-range interactions. In particular, we examined two loci, one associated with the

260 hemoglobin gene, HBA1 (Fig 8), and the other one associated with the pregnancy associated plasma A

261 (PAPPA) gene implicated in age-related breast cancer susceptibility.

11 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

262 Distal regulation of the HBA1 gene by a regulatory element 33-48kb away [27] has been experimen-

263 tally characterized using low throughput [27] and high throughput methods such as 5C [28]. We applied

264 Fit-Hi-C on our predicted counts to obtain significant interactions associated with the 5kb bins contain-

265 ing the HBA1 promoter. We found 14 significantly interacting pairs associated with the HBA1 gene, the

266 majority of which came from the K562 cell line (CV), which is consistent with this gene being specific to

267 erythroid cells [27]. Among the 14 significantly interacting pairs, 2 overlapped with 5C datasets (Fig 8B,

268 D, green lines) at 35kb and 70kb. Visualization of the regulatory signals spanning the HBA1 gene and its

269 interacting regions in the WashU genome browser [29], showed chromatin marks including H3K9me3,

270 H3K20me1, H3K36me3, H3K4me1, H3K4me2, H3K27ac as important features and CTCF and DNase I

271 as additional contributors to these interactions (Fig 8B). Examination of individual and pairwise features

272 based on their usage count in the significant interactions showed that the top features for making these

273 predictions came from the Window region of K562 or HMEC (Supplementary Fig 19A, B, Fig 8C),

274 and included features such as TBP, CTCF, DNase I as well as chromatin marks such as H3K4me1 and

275 H3K27ac.

276 We next investigated the PAPPA gene locus, which is implicated in the development of mammary

277 glands and is a gene of interest in breast cancer studies [30,31]. The rat ortholog of the PAPPA gene was

278 shown to be regulated by a 8.5kb genomic region called the temporal control element (TCE) in rat mam-

279 mary epithelial cells [31]. The TCE resides within the MCS5C genomic locus associated with breast

280 cancer susceptibility and is conserved in human and mouse [31]. We examined significant interactions

281 associated with the human PAPPA TSS and found the largest number of significant interactions in the

282 HMEC cell line or in cross cell line predictions using the HMEC model (Supplementary Fig 20D). The

283 HMEC cell line, is a primary mammary epithelial cell line, which indicates that these interactions are

284 relevant to the breast tissue. We focus on the significant interactions identified in HMEC when trained in

285 CV mode as these are likely the highest quality. We found a total of 7 significant interactions of which

286 2 overlap the TCE element. Visualization of the signals (Supplementary Fig 20B) and feature analy-

287 sis on the significant interactions indicated that CTCF, TBP and H3K4me1 (Supplementary Fig 20C)

288 measured in the HMEC cell line are important for this interaction. In summary, HiC-Reg predictions

289 provide computational support for the long-range regulation of the PAPPA gene in a relevant human cell

290 line, which was originally studied in the rat mammary cells.

291 Taken together, our fine-grained analysis of two loci known to be involved in long-range regulatory

292 interactions provided further support of our predictions, highlighted potentially important features that

293 facilitate these interactions and serve as case studies of how HiC-Reg could be used to characterize a

12 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

294 particular locus of interest. In both cases we found additional loci that are predicted to interact with

295 these genes, which can be followed with future experiments.

13 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

296 Discussion

297 The three-dimensional organization of the genome can affect the transcriptional status of a single gene

298 locus as well as larger chromosomal domains, which can both have significant downstream conse-

299 quences on complex phenotypes. Although high-throughput chromosome capture conformation assays

300 are rapidly evolving, measuring cell line-specific interactions on a genome-wide scale and at high reso-

301 lution is a significant challenge. In this work, we described a novel computational approach, HiC-Reg

302 that can predict the contact count of two genomic regions from their one-dimensional regulatory signals,

303 which are available for a large number of cell lines and experimentally more tractable to generate than

304 Hi-C datasets. Because HiC-Reg directly predicts counts, instead of classifying interactions from non-

305 interactions as has been commonly done [14, 15, 19], the output from HiC-Reg can be used to identify

306 significant interactions using peak-calling algorithms (e.g., Fit-Hi-C [22]) as well as examine more large

307 scale organizational properties using domain finding algorithms (e.g, Directionality Index method, [32]).

308 We studied several technical issues within the HiC-Reg regression framework relevant to model de-

309 sign, feature representation and important datasets for predicting counts. A key challenge we tried to

310 address using HiC-Reg was to generate high resolution interaction counts in a new chromosome or cell

311 line of interest. The former is relevant to predict interactions among regions that might have not have

312 been experimentally assayed. The latter is useful because some cell types and developmental stages may

313 not be amenable to large scale high-throughput 3C experiments and computational predictions could pri-

314 oritize specific regions for targeted experimental studies. Our cross-chromosome experiments showed

315 that the performance is decreased when training on one chromosome and testing on another. The fea-

316 tures identified as important across different chromosomes are very similar (Supplementary Fig 7, 8),

317 suggesting that the overall properties governing chromosomal contact are similar across chromosomes,

318 however, there may be fine-grained differences that are not being captured by the Random Forests re-

319 gression model. Incorporation of additional measurements from transcription factor binding could be

320 beneficial for capturing these differences. Our cross-cell line prediction shows that a regressor trained

321 on one cell line can predict interactions in other cell lines. However, the performance can vary from

322 one training cell line to another, making the choice of the training cell line non-trivial. Ensemble ap-

323 proaches that aggregate predictions from multiple predictive models are a natural way to address this

324 problem.While the simple ensemble predictor was better or comparable to the best performing cell line,

325 more systematic approaches to combine shared information across different cell lines such as considering

326 different weighting strategies will be an important direction of future work.

14 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

327 A second issue we considered was determining the most important genomic datasets for predicting

328 contact counts. Our predictive framework enabled us to study known players of long-range gene regula-

329 tion such as CTCF and cohesin [33], together with additional components of the transcription machinery

330 such as chromatin marks and general transcription factors, several of which have not been thoroughly

331 characterized in the context of long-range interactions. We examined the importance of these features

332 globally for all pairs, for sets of pairs, and for individual pairs. Our analysis showed that in addition to

333 CTCF, elongation and repressive marks can also be important for predicting counts. The importance of

334 elongation marks such as H3K36me3 in predicting contact count is consistent with existing work [9,34],

335 which showed that H3K36me3 and elongation-related signals are enriched in regions participating in

336 long-range interactions [9] and higher-order genome organization [34]. The identification of repressive

337 marks such as H3K27me3 could be because of their association in large-scale transcriptional units such

338 as compartments [11], TADs [12] and intra-TAD loops [35], and the specific pairs associated with these

339 repressive marks could be specifically transcriptionally silenced or be in a poised state [11].

340 We evaluated the predictions from HiC-Reg using different validation metrics, globally using com-

341 plementary assays, as well as, at specific loci that have been studied in the literature through high qual-

342 ity, albeit low throughput experiments. This was important because predictive performance based on

343 the ability to predict contact counts may not necessarily reveal biological insights. In particular, we

344 compared the statistically significant interactions identified from HiC-Reg with those measured exper-

345 imentally using ChIA-PET and whether they were enriched for CTCF bidirectional loops. Using both

346 metrics, we showed that HiC-Reg predictions can be used to identify significant interactions that have

347 as good experimental support as those from true counts. We also assessed the ability of HiC-Reg pre-

348 dictions to recover higher order units of organization such as TADs and found good agreement between

349 TADs identified using our predicted counts compared to true counts. We demonstrated the utility of

350 HiC-Reg in studying the long-range regulatory landscape of two loci including the well-studied HBA1

351 locus as well as a relatively less studied locus, PAPPA. Interestingly, our analysis of the PAPPA locus

352 provided support of long-range regulation of PAPPA, originally identified in rat, in a relevant human cell

353 line.

354 HiC-Reg exploits the widely available chromatin mark signals that are experimentally easier to mea-

355 sure compared to the Hi-C experiment. HiC-Reg however relies on the availability of these marks in

356 new contexts, which may not always be available. Several groups have started to explore imputation

357 strategies of chromatin marks [36,37]. Another important direction of future work would be to examine

358 how HiC-Reg performs with imputed marks as this would greatly increase the impact of a predictive

15 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

359 modeling framework such as HiC-Reg.

360 In summary, we have developed a regression-based framework to predict interactions between pairs

361 of regions across multiple cell lines by integrating published Hi-C datasets with one-dimensional regu-

362 latory genomic datasets. The predictions from HiC-Reg can be used to identify significant interactions

363 as well as examine higher-order organizational units. As additional chromatin mark signals and Hi-C

364 data become available, our method can take advantage of these datasets to learn better predictive models.

365 This can be helpful to systematically link genes to enhancers as well as to interpret regulatory variants

366 across diverse cell types and diseases.

16 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

367 Material and Methods

368 Learning a cell line-specific regression model for predicting Hi-C interactions ma-

369 trices

370 HiC-Reg is based on a regression model to predict contact counts measured in a Hi-C experiment using

371 features derived from various regulatory genomic data sets (e.g. ChIP-seq data sets for histone modifi-

372 cations, transcription factor occupancies, Fig 1). HiC-Reg uses Random Forests as its main predictive

373 algorithm. Random Forests are powerful tree ensemble learning approaches that have been shown to

374 have very good generalization performance [38], and have been applied to a variety of predictive prob-

375 lems in gene regulation [39–41]. We trained Random Forests on the above described different feature

376 encodings and Hi-C SQRTVC normalized contact counts downloaded from Rao et al paper [21]. We

377 experimented with different number of trees (Supplementary Fig 1D) and found that beyond 20 trees

378 there was no improvement in performance. Hence we performed all subsequent experiments with 20

379 trees. We compared the Random Forests regression model to a linear regression model using the WIN-

380 DOW feature encodings of a pair, which includes the signals associated with each region in the pair as

381 well as the average signal between the two regions, and the Distance between two regions. We found

382 that the non-linear regression model based on Random Forests performs significantly better than a lin-

383 ear regression approach (Supplementary Fig 1). We next describe the training and test generation and

384 different feature representations of a pair of regions.

385 Generation of training and test sets

386 To generate training and test datasets for our regression models, we first binned each chromosome into

387 5kb non-overlapping regions. We randomized the regions and split them into five sets. Each time, we

388 select one of the five sets as the test set of regions and the remaining four as training, repeating this for

389 all five sets of regions. Within each training or test set of regions we generate all pairs of interactions

390 that are within a 1MB radius. Because the Hi-C matrix is symmetric, we need to only predict the upper

391 triangle of the matrix and hence our pairs are not redundant. In each pair, the region with the smaller

392 coordinate is referred to as the “R1” region and the region with the larger coordinate as the “R2” region.

393 We conducted three different types experiments to evaluate the performance of HiC-Reg: (i) same cell

394 line, same chromosome cross-validation (CV), (ii) same cell line cross chromosome comparison, and,

395 (iii) different cell line same chromosome comparison.

17 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

396 For the same cell line same chromosome setting, HiC-Reg was trained and tested using five-fold

397 cross-validation using training and test pairs generated as described above. In each fold, we trained

398 random forests regression models on 4 folds, and predict contact counts for the left-out fold. We con-

399 catenated predictions from five folds and assessed performance using distance-stratified Pearson’s cor-

400 relation of true and predicted counts. Because each training/test example is a pair of regions, we need

401 to consider two types of examples: those that share a region with the training data (easy examples) and

402 those that do not share a region with the training data (hard examples). Our same cell line and chromo-

403 some cross-validation results are generated using hard pairs only. The cross-validation experiments were

404 done in all autosomal chromosomes.

405 For the same cell line cross chromosome setting, we used the five random forests regression models

406 trained on each fold from the training chromosome to predict contact counts for all pairs in a test chromo-

407 some. Each pair in the test chromosome had five predictions and we took the average of these predictions

408 as the final predicted count. We note that in this setting, all test pairs are hard. Cross-chromosome ex-

409 periments were done on five chromosomes, 9, 11, 14, 17 and 19.

410 For different cell line same chromosome setting, we again used the random forests regression models

411 trained on the training data from each fold in one cell line and generated predictions for all pairs in the

412 test cell line. Next, we took the average of these predictions as the final predicted count. When using

413 MULTI-CELL features, we excluded the features derived from the regulatory signals measured in the

414 test cell line. In this setting as well, all test pairs are hard pairs. Our ensemble approach took a further

415 average of the predictions from all four training cell lines as the predictions in a test cell line. Cross-cell

416 line experiments were done in three chromosomes: 14, 17 and 19.

417 Feature extraction and representation

418 To extract features for HiC-Reg’s regression framework, we used data sets from the ENCODE project

419 [42] for five cell lines: K562, Gm12878, Huvec, Nhek and Hmec, learning a separate model for each cell

420 line. We selected 14 data sets that were measured in all five cell lines. These 14 data sets included ChIP-

421 seq datasets for 10 histone marks and CTCF, DNase I-seq and DNase I-seq derived motifs of RAD21

422 and TBP, which we had previously found to be helpful for predicting enhancer-promoter interactions

423 in a classification setting [14]. A ChIP-seq signal is represented as the average read count aggregated

424 into a 5kb non-overlapping bin. We obtained the raw fastq files from the ENCODE consortium [42],

425 aligned reads to the human hg19 assembly using bowtie2 [43], retrieved reads aligned to a locus using

426 SAMtools [44] and applied BEDTools [45] to obtain a level read count. We next aggregated

18 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

427 the read counts of each base pair in a 5kb region. Next, we normalized aggregated signal by sequencing

428 depth and collapsed replicates by taking the median. As TBP and RAD21 ChIP-seq data are not available

429 in Huvec, Nhek and Hmec cell lines, we predicted the binding sites using PIQ [46] on the Dnase I data

430 and used the sum of purity scores for all motifs mapped to the same 5kb bin as the signal value.

431 We represented features of a region as a 14 dimensional feature vector, each dimension corresponding

432 to one of the 14 genome-wide data sets. To generate a feature vector for a pair of regions, we used

433 different strategies: PAIR-CONCAT, WINDOW and MULTI-CELL. In the PAIR-CONCAT case, we

434 concatenated the 14-dimensional feature vectors of the two regions to obtain a feature vector of size 28.

435 In the WINDOW case, we concatenated the 14-dimensional feature vectors of the two regions together

436 with the feature vectors of the intervening region between the two regions to obtain a feature vector of

437 size 42. We call this the WINDOW feature following Whalen et al [15]. The feature associated with

438 the intervening region is a mean signal value of the feature in the region. In the MULTI-CELL case,

439 we concatenated the 42-dimensional feature vectors of the two regions from all five cell lines to obtain

440 a merged feature vector of size 210. Finally, for all these feature representations, we included genomic

441 distance between the two regions of a pair as an additional feature.

442 Application of Fit-Hi-C to call significant interactions.

443 The output of HiC-Reg can be analyzed using a peak calling method such as Fit-Hi-C [22]. Fit-Hi-

444 C uses spline models to estimate expected contact probability at a given distance. The input to Fit-

445 Hi-C is a raw count interaction file and an optional bias file calculated by the ICE method [47]. Fit-

446 Hi-C estimates the statistical significance of interactions using a binomial distribution and corrects for

447 multiple testing using the Benjamini-Hochberg method and outputs the P-value and corrected Q-value

448 for each pair of interactions. We adopted the two-phase spline fitting procedure and used a FDR<0.05

449 to define significant pairs. As our predictions are based on normalized contact counts, we directly used

450 our predicted counts as input for Fit-Hi-C without bias file. For each cell line, we first concatenated

451 predictions across all chromosomes and conducted Fit-Hi-C analysis on these pairs. We applied the

452 same procedure on true counts to find significant pairs.

453 Evaluation metrics

454 We used different metrics for assessing the quality of our predictions for contact counts. The distance-

455 stratified Pearson’s correlation was used to directly measure the accuracy of the predicted counts. The

19 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

456 other metrics were useful to assess the quality of results after further downstream analysis of HiC-

457 Reg predictions and compare them to similar results obtained from actual measured data. In particular,

458 enrichment of CTCF bi-directional motifs and ChIA-PET datasets enabled us to study the quality of

459 significant interactions identified from HiC-Reg predictions, while the TAD similarity enabled us to

460 study the ability of HiC-Reg predictions to capture structural units of organization.

461 Distance stratified Pearson’s correlation. To assess the quality of count prediction of HiC-Reg,

462 we used Pearson’s correlation of predicted contact counts and true contact counts as a function of ge-

463 nomic distance. We grouped pairs of regions based on their genomic distance and calculated the Pear-

464 son’s correlation of predicted and true contact counts for pairs that fall into each distance bin. We

465 considered all pairs upto a distance of 1MB in distance bins of 5kb. To easily compare the performance

466 between different methods, chromosomes and cell lines, we summarized the distance-stratified Pearson’s

467 correlation curve into the Area Under the Curve (AUC). AUC was computed using the Trapezoidal rule

468 and divided by number of distance bins, so that it ranged from 0 to 1. The higher the AUC, the better the

469 performance.

470 Enrichment of bidirectional CTCF motifs on significant interactions. Cell line-specific DNase

471 I-seq fastq files were retrieved from ENCODE (http://hgdownload.soe.ucsc.edu/goldenPath/

472 hg19/encodeDCC/wgEncodeOpenChromDnase/) and further processed with PIQ [46] to obtain

473 the coordinates and orientation of CTCF motifs. PIQ gives a score from 0.5-1, which is proportional

474 to the true occurrence of the motif. We selected a threshold of 0.9 to identify high confidence CTCF

475 motifs. We use R1 to denote the region with the smaller starting coordinate, and R2 to denote the re-

476 gion with the larger starting coordinate. Following Rao et al. [21], an interaction is labeled as having a

477 convergent CTCF orientation if R1 contains CTCF motifs on the forward strand (“+” orientation) and

478 R2 contains CTCF motifs on the reverse strand (“-” orientation). We only focus on pairs with at least

479 one CTCF motif mapped to R1 and at least one CTCF motif mapped to R2. A pair can have one of the

480 four configurations: (i) “++” configuration where both R1 and R2 have the motifs in the “+” orientation,

481 (ii) “+-” configuration where R1 has CTCF motifs in“+” orientation and R2 has CTCF motifs in the “-”

482 strand, (iii) “-+” configuration, where R1 is in the “-” orientation and R2 is in the “+”, (iv) “–”, where

483 both R1 and R2 have motifs in the “-” orientation. We counted the total number of pairs with each of

484 these configurations and compared this with the number of significant pairs called by Fit-Hi-C using the

485 Hypergeometric test and fold enrichment. Briefly, assume we are testing the enrichment for the “+-”

20 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

486 configuration. Let the total number of possible pairs in the background be S. Let k be the total number

487 of significant interactions with any of the configurations of CTCF motifs, let m to be the total number of

488 interactions with the “+-” configuration of CTCF motifs and q be the number of significant interactions

489 with the “+-” configuration of CTCF motifs. Using Hypergeometric test, we test the probability of ob-

490 serving q or more interactions of k interactions to have the “+-” configuration, given that there are m out

q m 491 of S total interactions that have the “+-” configuration. Fold enrichment is computed as k / S and must

492 be greater than 1 to be considered as significant enrichment over background.

493 Validation of genome-wide predictions using ChIA-PET experiments. We downloaded 10

494 ChIA-PET data sets from Li et al. and Heidari et al.: PolII in HeLa and K562 and CTCF in the K562

495 cell line [23], and remaining seven ChIA-PET data sets from Heidari et al. [24], which included RNA

496 PolII, CTCF, RAD21 and multiple chromatin marks in K562 and Gm12878 cell lines. Our metric for

497 evaluating these genome-wide maps is fold enrichment, which assesses the fraction of significant inter-

498 actions identified from HiC-Reg that overlapped with experimentally detected measurements, compared

499 to the fraction of interactions expected by random chance. We mapped ChIA-PET interactions onto the

500 pairs of regions used in HiC-Reg by requiring one region of an interaction from the ChIA-PET dataset

501 to overlap with one region of a HiC-Reg pair (e.g., R1), and the other ChIA-PET region to map to the

n1/n2 502 second region (e.g, R2). Fold enrichment is defined as , where n1 is the number of significant m1/m2

503 interactions from HiC-Reg that overlap with an interaction in the ChIA-PET data set, n2 is the total

504 number of HiC-Reg significant interactions, m1 is the total number of interactions in the ChIA-PET

505 dataset that can be mapped to any of the HiC-Reg pairs, and m2 is the total number of possible pairs in

506 the universe. The observed overlap fraction of interactions is n1/n2 and the expected overlap fraction of

507 interactions is m1/m2. A fold enrichment greater than 1 is needed in order to be considered significant.

508 Identification of TADs and TADs similarity. To identify topologically associated domains (TADs),

509 we applied the Directionality Index (DI) method described in Dixon et al [32]. The method is based on

510 Hidden Markov Model (HMM) segmentation of the DI. The DI is a score for a genomic region to mea-

511 sure the bias in the directionality of interactions for that region as measured in a Hi-C dataset. It is

512 determined by the difference in the number of reads between the region and a genomic window (e.g.,

513 2MB) upstream of the region and the number of reads between the region and a window downstream

514 of the region. The window is user-defined. The DI score is segmented into three states of upstream,

515 downstream or no bias. A TAD is then defined by a contiguous stretch of downstream biased states.

21 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

516 We transformed our upper triangle prediction matrix with resolution of 5kb to a symmetric interac-

517 tion matrix and its genomic coordinates as the input. We used the default parameters of the package with

518 a window size of 2Mb for defining the Directionality Index. For comparison, we applied the same pro-

519 cedure on SQRTVC normalized matrices (i.e. our true counts matrices) to identify TADs. We compared

520 the similarity of TADs identified from the HiC-Reg predicted counts and true counts for each chromo-

521 some. Specifically, we matched a TAD found in the true count data to a TAD found in the predicted

522 counts based on the highest Jaccard coefficient. The Jaccard coefficient measures the overlap between |A∩B| 523 two sets, A and B and is defined as the ratio of the size of the intersection to the union, |A∪B| . The

524 Jaccard coefficient ranges from 0 to 1, with 1 denoting complete overlap. The Jaccard coefficients for

525 each match were averaged across all TADs from the true counts. We repeated the matching procedure

526 for each TAD from the predicted counts to a TAD in the true count and averaged the Jaccard coefficient

527 across all TADs from the predicted counts. The overall similarity between TADs was then the average

528 of these two averages.

529 Feature Analysis

530 Single feature importance. To assess the importance of individual features we used Out of Bag

531 Variable Importance measure [38], which computes importance of a feature based on the change in error

532 on out of bag example pairs when the feature values are permuted. We used MATLAB’s implementation

533 of this measure. In addition we counted the number of times a particular feature was used to predict the

534 count for an example pair when it is part of the test set. Briefly, for each test example, i and each tree, t

t 535 in the ensemble, let ni(f) denote the number of times feature f is used on the path from the root to the P P t 536 leaf in tree t for example i. The overall importance of feature f is t∈T i∈D ni(f), where T stands

537 for the ensemble of regression trees and D is the dataset of examples. We computed these counts on

538 all test example pairs as well as examples with the top 5% lowest errors. The feature importances were

539 very similar (Supplementary Fig 7, 8). A key difference between MATLAB’s OOB importance and our

540 feature usage count importance is that the OOB is computed on left out examples from the training set,

541 which includes the “easy” examples. The feature usage count is computed only on the hard pairs, which

542 do not share a region with the training pairs.

543 Pairwise feature importance. To assess the importance of pairs of features, we considered all

544 pairs of features that occur on the tree path from root to a leaf for a test example (similar to above). This

545 approach is similar to the Foresight method [48], which counts the number of times a pair of features

22 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

546 co-occur on the path from the root to the leaf. The main difference between our approach and that

547 of Foresight is that we estimate these counts on the test examples, while Foresight estimates these on

548 the examples in each leaf node in the tree identified during training. By using the counts on the test

549 examples, our approach is less prone to overfitting. Because the feature rankings of individual features

550 are similar when using all pairs and pairs with the top (smallest) 5% errors, we computed these counts

551 only the top 5% error pairs.

552 Non-negative matrix factorization for identifying sets of features associated with sets of

553 pairs. We developed a novel feature analysis method to identify feature sets associated with sets of

554 pairs based on non-negative matrix factorization (NMF). The input to this approach is a n × m matrix

555 X with n rows corresponding to test pairs and m columns corresponding to a pair of features and each

556 entry Xij corresponds to the number of times feature pair j is used to make a prediction for pair i. NMF

T 557 decomposes the matrix X into two lower rank non-negative matrices U and V , where X = UV , U

558 is n × k, V is k × m, and k is the rank. We used k = 5 for NMF. The U and V matrices are chosen

2 559 to minimize squared error ||X − UV ||2 of the lower dimensional reconstruction. We use MATLAB’s

560 non-negative matrix factorization function nnmf, with k = 5 factors to perform this factorization, which

561 uses an iterative algorithm to estimate U and V . The U and V matrices provide a low-dimensional repre-

562 sentation of the interaction pairs and feature pairs respectively. For ease of interpretation, we normalized

563 the U matrix to make each row sum to unity (denoted as U¯) and normalized V matrix each column

564 sum to unity (denoted as V¯ ). To identify sets of features associated with sets of examples, we used

565 the columns and rows of the U¯ and V¯ matrices. Specifically, we assigned a example i to cluster c if

0 0 566 i = arg maxc0 U¯(i, c ). Similarly a feature pair, j is assigned to cluster c if c = arg maxc0 V¯ (c , j). The

567 columns of U and rows of V have one-to-one correspondence, thus providing a natural bi-clustering out-

568 put. We were able to get striking factorization results where there were groups of pairs clearly associated

569 with groups of pairs. To further interpret these feature pairs, we visualized the pairwise interactions as

570 networks in Cytoscape [49], with node size proportional to the number of feature pairs they are asso-

571 ciated with and edge weights corresponding to the strength of the association. The NMF based feature

572 analysis enabled us to extract groups of examples associated with sets of feature pairs.

23 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

573 Implementation and availability

574 The HiC-Reg code and associated MATLAB and R scripts to compute various validation metrics are

575 available at https://github.com/Roy-lab/HiC-Reg/tree/master/Scripts.

576 Funding

577 This work is supported by the National Institutes of Health (NIH), BD2K grant U54 AI117924 and NIH

578 R01GM11733.

579 Acknowledgements

580 We thank the Center for High Throughput Computing at UW Madison for computational resources and

581 Alireza Fotuhi Siahpirani for assistance with the CTCF motif and TAD analysis.

24 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

14 datasets available in all 5 cell lines Cell line specific features H3k27ac

Histone Modifications H3k27me3 Histone marks H3k36me3 DNA H3k4me1 H3k4me2 Histones H3k4me3 TF binding Coding Tbp ChIP Peak H3k79me2 Ctcf H3k9ac Rad21 H4k20me1 DNase I Regulatory Regions H3k9me3

Pair features PAIR-CONCAT features Region-level features(5kb) { Region1 Region2 Distance

Histone marks TF binding DNase I 1

2 WINDOW features Region1 Window Region2 Distance ...

n

MULTI-CELL features Cell line 1 window ... Cell line m window Distance

Random Forest Regression HiC-Reg

Count: 4

Region1 Region2

25 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig 1. Overview of the HiC-Reg framework. HiC-Reg makes use of 14 datasets: 9 chromatin marks and 5 datasets related to Transcription Factor (TF) binding. A 5kb genomic region is represented by a vector of aggregated signals of either chromatin marks, accessibility, or TF occupancy. A pair of regions in HiC- Reg is represented using one of three types of features: PAIR-CONCAT, WINDOW, and MULTI-CELL. HiC-Reg uses Random Forests regression to predict the contact count between pairs of genomic regions. HiC-Reg takes as input a feature vector for a pair of regions and gives as output a predicted contact count for that pair (e.g., Count:4).

26 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A MULTI-CELL WINDOW PAIR-CONCAT DISTANCE

Gm12878 on chr17 K562 on chr17 Huvec on chr17 Hmec on chr17 Nhek on chr17

0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 Correlation Correlation Correlation Correlation Correlation 0 0 0 0 0 -0.2 -0.2 -0.2 -0.2 -0.2 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 Distance (Mbp) Distance (Mbp) Distance (Mbp) Distance (Mbp) Distance (Mbp)

B Gm12878 K562 Huvec Hmec Nhek

chr1

chr2

chr3

chr4

chr5

chr6

chr7

chr8

chr9

chr10

chr11

chr12

chr13

chr14

chr15

chr16

chr17

chr18

chr19

chr20

chr21

chr22

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 AUC

Feature MULTI−CELL WINDOW PAIR−CONCAT

27 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig 2. HiC-Reg cross-validation performance. A. The distance-stratified Pearson’s correlation plots of test data when training on the same cell line for chromosome 17 in five cell lines: Gm12878, K562, Huvec, Hmec, Nhek. The x-axis corresponds to a particular distance bin and the y-axis corresponds to the Pearson’s correlation of the counts predicted by HiC-Reg for pairs at a particular distance and the true counts. The correlation plots for three types of feature representations of pairs are shown: MULTI-CELL, WINDOW and PAIR-CONCAT. B. Area under the curve (AUC) for the distance-stratified Pearson’s correlation in all chromosomes for each of the five cell lines. The AUC is computed for test pairs using 5 fold cross validation.

28 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A CV:MULTI-CELL Cross-Chrom (MULTI-CELL) Cross-Chrom (WINDOW) Distance Test Chroms chr9 chr11 chr14 chr17 chr19 B 1 1 1 Gm12878 1 1 Gm12878: MULTI-CELL Gm12878: WINDOW Test Chroms Test Chroms chr9 chr11 chr14 chr17 chr19 chr9 chr11 chr14 chr17 chr19 0.5 0.5 0.5 0.5 0.5

chr9 chr9 0.60 0.42 0.46 0.46 0.42 0.59 0.30 0.36 0.36 0.32

Correlation 0 0 0 0 0 chr11 0.41 0.61 0.45 0.46 0.43 0.32 0.60 0.37 0.34 0.35 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 1 1 1 1 1 chr14 0.40 0.40 0.63 0.43 0.40 0.31 0.29 0.62 0.33 0.29 Training Chroms

0.5 0.5 0.5 0.5 0.5 Training Chroms chr17 0.39 0.41 0.43 0.65 0.42 0.30 0.29 0.34 0.64 0.31 chr11

Correlation 0 0 0 0 0 chr19 0.36 0.39 0.38 0.43 0.64 0.26 0.25 0.29 0.32 0.64

0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 K562 1 1 1 1 1 K562: MULTI-CELL K562: WINDOW Test Chroms Test Chroms

0.5 0.5 0.5 0.5 0.5 chr9 chr11 chr14 chr17 chr19 chr9 chr11 chr14 chr17 chr19 chr9 chr9 0.58 0.46 0.47 0.43 0.41 0.58 0.42 0.45 0.38 0.36 Correlation 0 0 0 0 0 chr11 0.50 0.52 0.47 0.43 0.42 0.45 0.52 0.43 0.38 0.38 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 1 1 1 1 1 chr14 0.43 0.43 0.52 0.36 0.36 0.34 0.35 0.51 0.23 0.21 Training Chroms 0.5 0.5 0.5 0.5 0.5 Training Chroms chr17 0.51 0.46 0.47 0.51 0.42 0.48 0.43 0.44 0.51 0.38 chr11

Correlation 0 0 0 0 0 chr19 0.48 0.44 0.45 0.40 0.49 0.45 0.41 0.43 0.36 0.49

0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 Huvec 1 1 1 1 1 Huvec: MULTI-CELL Huvec: WINDOW Test Chroms Test Chroms

0.5 0.5 0.5 0.5 0.5 chr9 chr11 chr14 chr17 chr19 chr9 chr11 chr14 chr17 chr19 chr9 chr9 0.38 0.37 0.34 0.35 0.37 0.38 0.34 0.30 0.31 0.32 Correlation 0 0 0 0 0 chr11 0.31 0.42 0.33 0.35 0.37 0.27 0.42 0.29 0.30 0.31 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 1 1 1 1 1 chr14 0.30 0.35 0.39 0.31 0.33 0.28 0.34 0.39 0.27 0.30 Training Chroms

0.5 0.5 0.5 0.5 0.5 Training Chroms chr17 0.31 0.37 0.33 0.43 0.37 0.28 0.34 0.30 0.42 0.34 chr11

Correlation 0 0 0 0 0 chr19 0.27 0.32 0.30 0.31 0.43 0.24 0.30 0.27 0.28 0.43

0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 Hmec 1 1 1 1 1 Hmec: MULTI-CELL Hmec: WINDOW Test Chroms Test Chroms chr9 chr11 chr14 chr17 chr19 chr9 chr11 chr14 chr17 chr19 0.5 0.5 0.5 0.5 0.5 chr9 chr9 0.45 0.48 0.52 0.45 0.53 0.43 0.46 0.50 0.41 0.49 Correlation 0 0 0 0 0 chr11 0.42 0.51 0.51 0.45 0.52 0.40 0.50 0.49 0.42 0.48 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 1 1 1 1 1 chr14 0.41 0.46 0.54 0.43 0.51 0.39 0.45 0.53 0.39 0.48 Training Chroms

0.5 0.5 0.5 0.5 0.5 Training Chroms chr17 0.42 0.47 0.51 0.50 0.54 0.41 0.46 0.50 0.49 0.51 chr11

Correlation 0 0 0 0 0 chr19 0.35 0.41 0.47 0.39 0.58 0.33 0.39 0.46 0.37 0.58

0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1

Nhek 1 1 1 1 1 Nhek: MULTI-CELL Nhek: WINDOW Test Chroms Test Chroms chr9 chr11 chr14 chr17 chr19 chr9 chr11 chr14 chr17 chr19 0.5 0.5 0.5 0.5 0.5

chr9 chr9 0.31 0.28 0.23 0.29 0.20 0.31 0.25 0.21 0.26 0.13

Correlation 0 0 0 0 0 chr11 0.24 0.35 0.19 0.28 0.19 0.21 0.35 0.18 0.26 0.18 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 1 1 1 1 1 chr14 0.24 0.26 0.29 0.26 0.18 0.22 0.25 0.30 0.23 0.14 Training Chroms

0.5 0.5 0.5 0.5 0.5 Training Chroms chr17 0.17 0.24 0.05 0.38 0.18 0.13 0.20 0.12 0.38 0.14 chr11 chr19 0.10 0.16 0.03 0.16 0.34 0.08 0.13 0.09 0.10 0.33 Correlation 0 0 0 0 0

0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1

Distance(Mb) 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65

29 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig 3. HiC-Reg cross-chromosome performance. A. The distance-stratified Pearson’s correlation plot when training on one chromosome and testing on a different chromosome from the same cell line. Shown are the distance-stratified correlations for models trained on chromosome 9 and 11 (row groups) and tested on chromosome 9, 11, 14, 17, 19 (columns) at 5kb in different cell lines. Each pair of rows corresponds to a particular cell line and results are shown for five cell lines: Gm12878, K562, Huvec, Hmec, Nhek. B. Heatmap of AUC for all pairs of tested cross-chromosome experiments. Each off-diagonal entry in the heatmap denotes the AUC when trained on the row chromosome and tested on the column chromosome. The more red an entry the better the performance in that chromosome pair combination. The diagonal entries are the AUC values when training and testing on the same chromosome in cross-validation mode.

30 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A Gm12878 chr17 K562 chr17 Huvec chr17 Hmec chr17 Nhek chr17 Distance Distance Distance Dnase_R2_Hmec Distance H3k9me3_W_Gm12878 H4k20me1_R2_K562 Dnase_R2_Gm12878 H3k4me1_R1_Hmec H3k9me3_R2_Nhek H3k4me1_W_Gm12878 H3k4me1_R1_K562 Dnase_R1_Gm12878 Dnase_R1_Hmec H3k9me3_R1_Nhek H3k27me3_R2_Gm12878 H3k27me3_R2_Gm12878 H4k20me1_R1_Huvec H4k20me1_R1_Hmec Dnase_R2_Nhek H4k20me1_R2_Gm12878 H3k27me3_R1_Gm12878 H4k20me1_R2_Huvec Distance Dnase_R1_Nhek H3k27me3_R1_Gm12878 H3k79me2_R1_K562 Dnase_R2_Huvec H3k4me1_R2_Hmec Ctcf_W_Gm12878 H3k9me3_R1_Gm12878 H3k27me3_R2_K562 H3k36me3_R2_Huvec H3k9me3_R1_Hmec Dnase_R1_Gm12878 H3k4me2_R1_Gm12878 H3k9ac_R1_K562 H4k20me1_R2_Gm12878 H3k9ac_R2_Hmec H3k36me3_R2_Nhek H3k4me3_R1_Gm12878 H4k20me1_R1_K562 H3k9ac_R2_Gm12878 H4k20me1_R1_Gm12878 H4k20me1_R2_Nhek H4k20me1_R1_Gm12878 H3k36me3_R1_Gm12878 H3k27me3_R2_Gm12878 H3k27me3_R2_Hmec H4k20me1_R2_Gm12878 H3k9ac_R2_Gm12878 H3k4me3_R2_K562 H3k4me2_R2_Huvec Dnase_R1_Gm12878 H4k20me1_W_K562 H3k9me3_R2_Gm12878 H3k79me2_R2_K562 H3k4me1_R2_Huvec H3k27me3_R2_Gm12878 H3k79me2_R2_Nhek H3k4me3_R2_Gm12878 H3k9me3_R1_K562 H3k27ac_R2_Huvec H3k27ac_R1_Hmec H3k9ac_R2_Gm12878 Dnase_R2_Gm12878 H3k9me3_R2_K562 RAD21_R2_Huvec H3k9ac_R1_Hmec H3k4me3_W_Nhek H3k4me2_R2_Gm12878 H3k27me3_R1_K562 H3k79me2_R1_Huvec H3k79me2_R2_Hmec H4k20me1_R1_Nhek H3k36me3_R1_Gm12878 Dnase_R2_Gm12878 H3k9ac_R1_Gm12878 H4k20me1_R2_Hmec H3k9me3_R2_K562 RAD21_R2_Gm12878 H3k9me3_R1_Gm12878 H3k9me3_R2_Huvec H3k36me3_R2_Hmec H3k4me1_R2_Gm12878 Dnase_W_Gm12878 H4k20me1_W_Gm12878 H3k4me2_W_K562 H3k36me3_R1_Hmec H3k27me3_R2_Nhek H3k36me3_R2_Gm12878 H3k4me3_R1_K562 H3k79me2_R2_Huvec H3k9me3_R2_Hmec H3k27ac_R1_Gm12878 H3k4me1_R1_Gm12878 H3k9ac_R2_Gm12878 H3k36me3_R1_Gm12878 H3k79me2_R1_Hmec H3k27ac_R1_Nhek 0 2 4 6 0 2 4 6 0 1 2 3 4 5 0 1 2 3 4 0 1 2 3 4 OOB Importance OOB Importance OOB Importance OOB Importance OOB Importance

B Gm12878 chr17 K562 chr17 Huvec chr17 Hmec chr17 Nhek chr17 Distance Distance Distance Distance Distance Ctcf_W_Gm12878 Ctcf_W_K562 Ctcf_W_Gm12878 Ctcf_W_Hmec H3k9me3_R2_Nhek H3k4me1_W_Gm12878 TBP_W_K562 H3k27me3_W_Huvec TBP_W_Hmec H3k9me3_R1_Nhek H3k9me3_W_Gm12878 Ctcf_W_Gm12878 TBP_W_Huvec H3k9ac_W_Hmec TBP_W_Nhek H4k20me1_W_Gm12878 H4k20me1_W_K562 Ctcf_W_Huvec Dnase_R2_Hmec Ctcf_W_Hmec H3k79me2_W_Gm12878 Dnase_W_K562 H3k9me3_W_Huvec H3k4me1_R1_Hmec Ctcf_W_Nhek TBP_W_Gm12878 H3k27me3_W_K562 Dnase_R2_Gm12878 Dnase_R1_Hmec H3k9me3_W_Nhek H3k27me3_W_Gm12878 H3k9me3_W_K562 H3k9me3_R1_Huvec H3k4me1_R2_Hmec Ctcf_W_Gm12878 H3k36me3_W_Gm12878 H3k36me3_W_K562 Dnase_R1_Gm12878 Ctcf_W_Gm12878 RAD21_W_Nhek H3k27ac_W_Gm12878 H3k9me3_R2_K562 TBP_W_Gm12878 TBP_W_Gm12878 H3k9me3_W_Gm12878 H3k4me3_W_Gm12878 H4k20me1_R2_K562 Dnase_W_Gm12878 H3k4me3_W_Hmec H3k79me2_W_Nhek RAD21_W_Gm12878 H3k9me3_R1_K562 H3k27me3_W_Gm12878 Dnase_W_Hmec TBP_W_Gm12878 H3k9me3_R2_Huvec H3k27ac_W_Hmec H3k27me3_W_Nhek H3k9ac_W_Gm12878 H4k20me1_R1_K562 H4k20me1_W_Gm12878 H3k27me3_W_Hmec H3k9me3_W_Huvec Dnase_W_Gm12878 H3k27ac_W_K562 H3k79me2_W_Huvec H3k4me1_W_Hmec H3k9ac_W_Nhek H3k4me2_W_Gm12878 TBP_W_Gm12878 H3k4me1_W_Huvec Dnase_R2_Gm12878 H3k27ac_W_Nhek Ctcf_W_K562 H4k20me1_W_Gm12878 Ctcf_W_Nhek Dnase_R1_Gm12878 Dnase_R2_Nhek H3k79me2_R1_Nhek H3k4me1_W_Gm12878 H4k20me1_W_Huvec H3k9ac_R2_Hmec RAD21_W_Gm12878 H4k20me1_W_K562 H3k79me2_W_Gm12878 Ctcf_W_Hmec H3k9ac_R1_Hmec H3k27me3_W_Gm12878 H3k9me3_R1_Nhek Ctcf_W_Hmec H3k4me3_W_Huvec H3k36me3_W_Hmec Dnase_R1_Gm12878 H3k36me3_W_K562 Dnase_W_Gm12878 0.0e+00 5.0e+06 1.0e+07 1.5e+07 0.0e+00 5.0e+06 1.0e+07 1.5e+07 0e+00 1e+07 2e+07 0.0e+00 4.0e+06 8.0e+06 1.2e+07 0e+00 1e+07 2e+07 3e+07 4e+07 Feature Usage Count Feature Usage Count Feature Usage Count Feature Usage Count Feature Usage Count C V factors Feature pairs 2 U factors 4 Gm12878 K562 Huvec Hmec Nhek Region Pairs

2 4 100 200 300 400 500 600 700 800 900 T U/V UV Feature 0 0.5 1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Ctcf_R1_Gm12878

Dnase_W_Gm12878 H3k4me2_W_Gm12878 H3k4me1_W_Gm12878 RAD21_W_Gm12878 Ctcf_W_Gm12878 H3k4me3_W_Gm12878 H3k27me3_W_K562 TBP_W_Gm12878 Ctcf_W_Huvec TBP_W_Gm12878 H4k20me1_W_Gm12878 H3k9me3_R2_Nhek Ctcf_W_Gm12878 H3k9me3_W_Huvec H3k79me2_R1_Gm12878 H3k36me3_W_Gm12878 H3k79me2_W_Gm12878 RAD21_W_Gm12878 H4k20me1_W_K562 H3k27me3_W_Nhek H3k36me3_W_K562 H4k20me1_W_Gm12878 H3k27me3_W_Gm12878 Ctcf_W_K562 H3k4me1_W_Gm12878 H3k4me1_W_Gm12878 H3k4me3_W_Gm12878 H3k79me2_W_Gm12878 RAD21_W_K562 Ctcf_W_Gm12878 H3k27ac_W_Gm12878 H3k27me3_W_Huvec H3k9me3_W_Gm12878 H3k9me3_W_Gm12878 H3k9me3_W_Gm12878 H3k9ac_W_Gm12878 H3k4me1_W_Gm12878

Cluster1 Cluster2 Cluster3 Cluster4 Cluster5

31 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig 4. Analysis of features important for predicting Hi-C contact counts. A. Shown are the top 20 MULTI- CELL features ranked based on Out Of Bag (OOB) feature importance on chromosome 17 in all five cell lines. Each horizontal bar corresponds to one feature. The feature name includes the name of the histone mark, DNase I or TF, whether it is on one of the interaction regions (R1, R2) or in the intervening window (W), and the specific cell line from which this feature is extracted. B. Shown are top 20 features ranked based on counting the number of times a feature is used for test set predictions. Feature rankings are for chromosome 17 for all five cell lines. C. Non-negative matrix (NMF) factorization of region-pair by feature- pair matrix for Gm12878 chromosome 17. The U and V factors are the NMF factors to provide membership of region pairs or feature pairs in a cluster (white lines demarcate the region pair and feature pair clusters). The factorized feature count matrix is shown below the V factors and to the right of the U factors. The heatmap on the right are the features associated with each of the pairs, with rows corresponding to a pair of regions and columns corresponding to the feature values grouped by the cell line from which they are obtained. Bottom are Cytoscape network representation of important pairs of features. The node size is proportional to the number of features the specific feature co-occurs on a path in the decision tree. The thickness of the line is proportional to the number of times the pair of features is used on the path from root to the leaf for a test example pair. Font size of the node label is proportional to its size.

32 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A CTCF enrichment B ChIA-PET enrichment Gm12878 Gm12878 2 150 100 1 50 0 0 K562

2 K562 40 20 1 0 0 30 Hu v

Hu v 20 ichment

2 ec ichment

ec 10 1 0 old En r F old En r 0 F 3 Hmec 2 2 Hmec 1 1 0 15

0 Nhek 10

2 Nhek 5

1 0 olII 0 P −− −+ +− ++ K562_ K562_CTCF Hela_RNAPII K562_RAD21 K562_RNAPII K562_H4K27ac Count True Predicted K562_H4K4Me1 K562_H4K4Me2 K562_H4K4Me3 GM12878_RAD21

Random Count True Predicted

Random C D TAD similarity HiC True counts Gm12878 chr17 Predicted counts Gm12878 chr17 32250000 32500000 32750000 33000000 33250000 33500000 33750000 34000000 32250000 32500000 32750000 33000000 33250000 33500000 33750000 34000000

0.90 ● 10 10

● 32250000 9 32250000 9 ● 8 8 32500000 32500000 7 7 32750000 6 32750000 6 0.85 33000000 5 33000000 5 33250000 4 33250000 4 3 3 33500000 33500000 2 2 0.80 33750000 1 33750000 1 34000000 0 34000000 0 d coefficient HiC True counts K562 chr17 Predicted counts K562 chr17

0.75

● Jacca r 2250000 2500000 2750000 3000000 3250000 3500000 3750000 4000000 2250000 2500000 2750000 3000000 3250000 3500000 3750000 4000000 7 7

● 2250000 6 2250000 6

● 2500000 2500000 0.70 5 5 2750000 2750000 4 4 3000000 3000000 3 3 3250000 3250000 ● 2 2 0.65 3500000 3500000 Gm12878 Hmec Huvec K562 Nhek 3750000 1 3750000 1 4000000 0 4000000 0

33 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig 5. Assessing significant interactions and TADs from HiC-Reg predictions. A. Fold enrichment of four configurations of CTCF motifs in significant interactions identified using true and predicted counts. A fold enrichment >1 (red horizontal line) is considered as enriched. Fold enrichment in all five cell lines is shown. B. Fold enrichment of interactions identified using ChIA-PET experiments in significant interactions from HiC-Reg’s predictions and true counts. A fold enrichment >1 (red horizontal line) is considered as enriched. C. Distribution of TAD similarity identified from true and predicted counts for all chromosomes. Each point in the box plot corresponds to the average Jaccard coefficient for a chromosome. D. TADs identified on true (left) and predicted (right) HiC count matrices for selected regions. Top: Gm12878 cell line, chr17:32- 34Mbp. Bottom: K562 cell line, chr17: 2-4Mbp)

34 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

CV:MULTI-CELL CrossCell:MULTI-CELL CrossCell:WINDOW Ensemble:MULTI-CELL Transfer Counts Distance A Training Cell Huvec Hmec Nhek Gm12878 K562

0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4

Huvec 0.2 0.2 0.2 0.2 0.2 Correlation 0 0 0 0 0 -0.2 -0.2 -0.2 -0.2 -0.2 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1

0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 Hmec

Correlation 0 0 0 0 0 -0.2 -0.2 -0.2 -0.2 -0.2 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1

0.6 0.6 0.6 0.6 0.6

Test Cell 0.4 0.4 0.4 0.4 0.4

Nhek 0.2 0.2 0.2 0.2 0.2

Correlation 0 0 0 0 0 -0.2 -0.2 -0.2 -0.2 -0.2 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1

0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 Gm12878 Correlation 0 0 0 0 0 -0.2 -0.2 -0.2 -0.2 -0.2 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1

0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4

K562 0.2 0.2 0.2 0.2 0.2

Correlation 0 0 0 0 0 -0.2 -0.2 -0.2 -0.2 -0.2 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 Distance(Mb) B Test Cell Huvec Hmec Nhek Gm12878 K562

Ensemble

K562

Gm12878

Nhek aining Cell r T Hmec

Huvec 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

Feature MULTI−CELL WINDOW TransferCounts AUC

35 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig 6. Assessing HiC-Reg predicted counts in new test cell lines. A. Distance stratified Pearson’s correlation plot using models trained on one cell line (columns) and tested on a different cell line (rows) for chromo- some 17 and all five cell lines: Huvec, Hmec, Nhek, Gm12878, K562. Each plot (except the ones on the diagonal) show the distance stratified correlation plots when using the same cell line (blue), different cell line using MULTI-CELL (red) and WINDOW (cyan) features, using the ensemble of all predictions from a different cell line (orange), when simply transferring counts (green). The plots on the diagonal show the CV performance. B. Area under the curve (AUC) for distance stratified correlation in chromosome 17. Shown are the AUCs for the different training cell line models for both feature types (red and green), ensembles from both feature types, as well as transfer count when predicting in a new cell line.

36 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A B

K562 K562 K562 K562 K562 + + + - - + - - CTCF H4K4Me1 H4K4Me2 H4K4Me3 H4K27ac

Test cell Test cell Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562 Nhek

Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562 Nhek Gm12878 Gm12878 Hmec K562 Huvec K562 Huvec cell Nhek

cell Hmec Nhek Training Ensemble

Training Training Ensemble K562 K562 K562 Hela GM12878 PolII RAD21 RNAPII RNAPII RAD21 fold enrichment

1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8 4.2 4.6 5.0 Test cell Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562 Nhek Gm12878 Hmec Huvec K562

C cell Nhek

Training Training Ensemble chr14 chr17 chr19 fold enrichment 1.0 1.9 2.8 3.7 4.6 5.5 6.4 7.3 8.2 9.1 10.0 Test cell Gm12878 K562 Huvec Hmec Nhek Gm12878 K562 Huvec Hmec Nhek Gm12878 K562 Huvec Hmec Nhek

Gm12878 K562 Huvec

cell Hmec Nhek Training Training Ensemble

Jaccard coefficient 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

37 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig 7. Ability of HiC-Reg to capture significant interactions and TADs in new cell lines. A. Enrichment of CTCF motifs in four configurations for cross-cell predictions. Shown are the fold enrichments when testing using a model trained on different cell line or an ensemble of models using MULTI-CELL features. The diagonal entries correspond to the fold enrichment obtained in the cross-validation setting. B. Fold enrichment of ChIA-PET interactions in predicted interactions when trained on a different cell line as well as when using the MULTI-CELL ensemble. C. Jaccard coefficient based similarity of TADs called on true and predicted interaction counts when training on a different cell line. Shown are Jaccard similarity values for three chromosomes, 14, 17 and 19.

38 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A Gm12878_Ctcf Gm12878_H3k27ac Gm12878_H3k27me3 Gm12878_H3k36me3 Gm12878_H3k4me1 Gm12878_H3k4me2 Gm12878_H3k4me3 Gm12878_H3k79me2 Gm12878_H3k9ac Gm12878_H3k9me3 Gm12878_H4k20me1 Gm12878_DNase1 Gm12878_RAD21 Gm12878_TBP K562_Ctcf K562_H3k27ac K562_H3k27me3 K562_H3k36me3 K562_H3k4me1 K562_H3k4me2 K562_H3k4me3 K562_H3k79me2 K562_H3k9ac K562_H3k9me3 K562_H4k20me1 K562_Dnase1 K562_RAD21 K562_TBP Huvec_Ctcf Huvec_H3k9me3 Huvec_H3k27ac Huvec_H3k27me3 Huvec_H3k36me3 Huvec_H3k4me1 Huvec_H3k4me2 Huvec_H3k4me3 Huvec_H3k79me2 Huvec_H3k9ac Huvec_H4k20me1 Huvec_DNase1 Huvec_RAD21 Huvec_TBP Hmec_Ctcf Hmec_H3k9me3 Hmec_H3k27ac Hmec_H3k27me3 Hmec_H3k36me3 Hmec_H3k4me1 Hmec_H3k4me2 Hmec_H3k4me3 Hmec_H3k79me2 Hmec_H3k9ac Hmec_H4k20me1 Hmec_DNase1 Hmec_RAD21 Hmec_TBP Nhek_Ctcf Nhek_H3k9me3 Nhek_H3k27ac Nhek_H3k27me3 Nhek_H3k36me3 Nhek_H3k4me1 Nhek_H3k4me2 Nhek_H3k4me3 Nhek_H3k79me2 Nhek_H3k9ac Nhek_H4k20me1 Nhek_DNase1 Nhek_RAD21 Nhek_TBP Distance K562toGm12878 HuvectoGm12878 NhektoGm12878 HmectoGm12878 Gm12878toK562 HuvectoK562 NhektoK562 HmectoK562 Gm12878toHuvec K562toHuvec NhektoHuvec HmectoHuvec Gm12878toNhek K562toNhek HuvectoNhek HmectoNhek Gm12878toHmec K562toHmec HuvectoHmec NhektoHmec Gm12878toGm12878 K562toK562 HuvectoHuvec NhektoNhek HmectoHmec K562toGm12878_qvalue HuvectoGm12878_qvalue NhektoGm12878_qvalue HmectoGm12878_qvalue Gm12878toK562_qvalue HuvectoK562_qvalue NhektoK562_qvalue HmectoK562_qvalue Gm12878toHuvec_qvalue K562toHuvec_qvalue NhektoHuvec_qvalue HmectoHuvec_qvalue Gm12878toNhek_qvalue K562toNhek_qvalue HuvectoNhek_qvalue HmectoNhek_qvalue Gm12878toHmec_qvalue K562toHmec_qvalue HuvectoHmec_qvalue NhektoHmec_qvalue Gm12878toGm12878_qvalue K562toK562_qvalue HuvectoHuvec_qvalue NhektoNhek_qvalue HmectoHmec_qvalue chr16_145000_150000

chr16_225000_230000

chr16_1225000_1230000 Signal Distance Count Significance 0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6 4.0 -1.0 9.1 19.229.339.449.559.669.779.889.9100 -1.0 1.1 3.2 5.3 7.4 9.5 11.613.715.817.920.0 -1.0 0.0 1.0 B 60000 80000 100000 120000 140000 160000 180000 200000 220000 240000 260000 280000 300000 320000 340000 360000 380000 400000 420000 440000 460000 480000 500000 20.99 RAD21 0.01 0.55 TBP 0.01 3834.5 H3k9me3 49.37 19430.9 H3k27me3 0 4220.6 H4k20me1 77.4 8050.25 H3k36me3 80.15 48177.5 H3k79me2119.33 38021 H3k4me3 180.68 31132.5 H3k4me2 47.14 H3k4me1 24825.2 238.85 41976.2 H3k27ac 58.61 45238.5 H3k9ac 137.77 30201.5 DNase1 49.38 45425 Ctcf 194.45 Interactions from 5C Significant interactions from CV predictions chr16 p13.3 DDX11L10RHBDF1 HBZ HBM HBQ1 ITFG3 AXIN1 NME4 RAB11FIP3 POLR3K MPG HBA2 LUC7L RGS11 MRPL28 NME4 SNRNP25 MPG HBA1 ITFG3 PDIA2 TMEM8A NME4 MPG LUC7L RGS11 LOC100134368 RefSeq genes NPRL3 ITFG3 AXIN1 NME4 NPRL3 RGS11 NME4 NPRL3 RGS11 NME4 NPRL3 ARHGDIG NME4 NPRL3 DECR2 K562 (HBA1) Hmec (HBA1) 140 ● ● C D ● Mapped to 5C (Signficant) 140 ● 120 ●Signficant

TBP_W_K562.Ctcf_W_Hmec 120 ●Non−Signficant 100 Dnase_R2_K562.Ctcf_W_Hmec 100 80

● 80 Counts Counts ● Ctcf_W_Hmec.TBP_W_Nhek 60 ● 60 ●● ● 40 Dnase_R1_K562.Ctcf_W_Hmec ●●●● ●●● 40 ● ●●● ● ●● H4k20me1_W_K562 20 ●● ●●

● 20 ●●● ●●●●● ●●●● ●● TBP_W_K562.TBP_W_Nhek ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ●●●●●●●● ● ●●●●● 0 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●●● ●●● ● 0 ● ● ● H3k9me3_W_Huvec Dnase_R2_K562.TBP_W_K562 −200000 0 200000 400000 600000 800000 1000000 −200000 0 200000 400000 600000 800000 1000000 H3k27ac_R1_K562 H3k4me3_W_K562 H3k4me1_R1_K562.Ctcf_W_Hmec Distance Distance H3k4me1_R1_K562 HuvectoNhek (HBA1) HmectoNhek (HBA1) Ctcf_R2_K562 H3k4me1_W_K562.Ctcf_W_Hmec Ctcf_W_Gm12878 ● ● H3k4me2_W_K562 150 Dnase_R1_K562 H3k27ac_R1_K562.Ctcf_W_Hmec 100 ● H3k4me1_W_K562

Ctcf_W_Hmec.TBP_W_Hmec 80 ● Dnase_R2_K562 100 H3k4me2_W_K562.Ctcf_W_Hmec 60 ● Counts Counts Dnase_R1_Nhek

40 ● TBP_W_K562 Dnase_R1_K562.TBP_W_K562 ●● 50 TBP_W_Huvec ● TBP_W_Nhek ●● ● ●● 20 ● TBP_W_Hmec TBP_W_Huvec.Ctcf_W_Hmec ●●●●●●●●● ●●● ● ● ●●●●●●●●●● ●●●● ●●●●●●●● ●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ● ● ● 0 ● ● ● ● Ctcf_W_Hmec H3k4me1_W_K562.TBP_W_K562 −200000 0 200000 400000 600000 800000 1000000 −200000 0 200000 400000 600000 800000 1000000 Ctcf_W_Nhek H3k9me3_W_Huvec.Ctcf_W_Hmec Distance Distance TBP_W_Gm12878 H3k27ac_R1_K562.TBP_W_K562 Dnase_R1_Hmec HuvectoHmec (HBA1) NhektoHmec (HBA1)

TBP_W_K562.TBP_W_Huvec 120 ● ●● 100

TBP_W_K562.TBP_W_Hmec ● 100 80 TBP_W_K562.H3k9me3_W_Huvec 80 60 ● 60 H3k4me1_R1_K562.TBP_W_K562 Counts Counts ● 40 ● 40 ● 0 600 ● ● ●● ●● 20 ● ● ● ●●● 20 ● Importance ●●●●●● ●●● ●●●● ●●● ●●●●●●●●●●● ●●●●●● ● ●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ● ● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ● ● −200000 0 200000 400000 600000 800000 1000000 −200000 0 200000 400000 600000 800000 1000000 Distance Distance

39 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig 8. Examining HiC-Reg predictions at the HBA1 locus. A. Shown are the features, predicted counts and P-value for the 1MB radius around the 5kb bin spanning the HBA1 gene TSS. The predicted counts (white read heat map) and p-value significance (white magenta heatmap) for the 5 CV models and the 20 cross-cell line models are shown. The columns correspond to the specific model. Gray corresponds to missing count data in the original Hi-C dataset. B. Visualization of feature signals using WashU Epigenome Browser for significant interactions obtained from a model trained in K562 and tested in K562. Shown are also overlapping interactions with 5C pairs. Red vertical lines demarcate the 5kb bin overlapping the HBA1 gene TSS and the distal region that interacts with it, as supported by 5C. The green arcs depict the interactions predicted by HiC-Reg and supported by 5C. C. Pairwise feature analysis based on pairs of features used for predicting significant interactions associated with HBA1 gene in the K562 cell line (Left: Network representation of important pair-wise features. Right: Ranking of top 20 important pair- wise features). D. Manhattan style plots of predicted counts around the 1MB radius of the HBA1 gene. Only predictions from models that had significant interactions are shown (6 out of 25). The left part of each Manhattan plot is shortened because this region is towards the beginning of the chromosome. Blue dots are significant interactions and red dots are significant interactions that overlap a 5C interaction.

40 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

582 References

583 [1] Glenn A. Maston, Sara K. Evans, and Michael R. Green. Transcriptional Regulatory Elements in

584 the . Annual Review of Genomics and Human Genetics, 7(1):29–59, 2006.

585 [2] Axel Visel, Edward M. Rubin, and Len A. Pennacchio. Genomic views of distant-acting enhancers.

586 Nature, 461(7261):199–205, September 2009.

587 [3] Wouter de Laat and Denis Duboule. Topology of mammalian developmental enhancers and their

588 regulatory landscapes. Nature, 502(7472):499–506, October 2013.

589 [4] Olivia Corradin, Alina Saiakhova, Batool Akhtar-Zaidi, Lois Myeroff, Joseph Willis, Richard

590 Cowper-Sal-lari, Mathieu Lupien, Sanford Markowitz, and Peter C. Scacheri. Combinatorial ef-

591 fects of multiple enhancer variants in linkage disequilibrium dictate levels of gene expression to

592 confer susceptibility to common traits. Genome Research, 24(1):gr.164079.113–13, November

593 2013.

594 [5] Matthew T. Maurano, Richard Humbert, Eric Rynes, Robert E. Thurman, Eric Haugen, Hao Wang,

595 Alex P. Reynolds, Richard Sandstrom, Hongzhu Qu, Jennifer Brody, Anthony Shafer, Fidencio

596 Neri, Kristen Lee, Tanya Kutyavin, Sandra Stehling-Sun, Audra K. Johnson, Theresa K. Can-

597 field, Erika Giste, Morgan Diegel, Daniel Bates, R. Scott Hansen, Shane Neph, Peter J. Sabo,

598 Shelly Heimfeld, Antony Raubitschek, Steven Ziegler, Chris Cotsapas, Nona Sotoodehnia, Ian

599 Glass, Shamil R. Sunyaev, Rajinder Kaul, and John A. Stamatoyannopoulos. Systematic local-

600 ization of common disease-associated variation in regulatory DNA. Science (New York, N.Y.),

601 337(6099):1190–1195, September 2012.

602 [6] William A. Flavahan, Yotam Drier, Brian B. Liau, Shawn M. Gillespie, Andrew S. Venteicher,

603 Anat O. Stemmer-Rachamimov, Mario L. Suva,` and Bradley E. Bernstein. Insulator dysfunction

604 and oncogene activation in IDH mutant gliomas. Nature, 529(7584):110–114, January 2016.

605 [7] Dirk A. Kleinjan and Laura A. Lettice. Long-range gene control and genetic disease. Advances in

606 genetics, 61:339–388, 2008.

607 [8] E de Wit and W de Laat. A decade of 3C technologies: insights into nuclear organization. Genes

608 & Development, 26(1):11–24, January 2012.

609 [9] Stefan Schoenfelder, Mayra Furlan-Magaril, Borbala Mifsud, Filipe Tavares-Cadete, Robert Sugar,

610 Biola-Maria Javierre, Takashi Nagano, Yulia Katsman, Moorthy Sakthidevi, Steven W Wingett,

611 Emilia Dimitrova, Andrew Dimond, Lucas B Edelman, Sarah Elderkin, Kristina Tabbada, Elodie

41 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

612 Darbo, Simon Andrews, Bram Herman, Andy Higgs, Emily LeProust, Cameron S Osborne, Jen-

613 nifer A Mitchell, Nicholas M Luscombe, and Peter Fraser. The pluripotent regulatory circuitry

614 connecting promoters to their long-range interacting elements. Genome Research, 25(4):582–597,

615 April 2015.

616 [10] Matthew Denholtz, Giancarlo Bonora, Constantinos Chronis, Erik Splinter, Wouter de Laat, Jason

617 Ernst, Matteo Pellegrini, and Kathrin Plath. Long-Range chromatin contacts in embryonic stem

618 cells reveal a role for pluripotency factors and polycomb proteins in genome organization. Cell

619 Stem Cell, 13(5):602–616, November 2013.

620 [11] James Fraser, Iain Williamson, Wendy A Bickmore, and Josee Dostie. An Overview of Genome

621 Organization and How We Got There: from FISH to Hi-C. Microbiology and Molecular Biology

622 Reviews, 79(3):347–372, July 2015.

623 [12] David U Gorkin, Danny Leung, and Bing Ren. The 3D Genome in Transcriptional Regulation and

624 Pluripotency. Stem Cell, 14(6):762–775, June 2014.

625 [13] Zong Wei, Fan Gao, Sewoon Kim, Hongzhen Yang, Jungmook Lyu, Woojin An, Kai Wang, and

626 Wange Lu. Klf4 organizes long-range chromosomal interactions with the oct4 locus in reprogram-

627 ming and pluripotency. Cell stem cell, 13(1):36–47, July 2013.

628 [14] Sushmita Roy, Alireza Fotuhi Siahpirani, Deborah Chasman, Sara Knaack, Ferhat Ay, Ron Stewart,

629 Michael Wilson, and Rupa Sridharan. A predictive modeling approach for cell line-specific long-

630 range regulatory interactions. Nucleic Acids Research, 43(18):8694–8712, October 2015.

631 [15] Sean Whalen, Rebecca M Truty, and Katherine S Pollard. Enhancer–promoter interactions are

632 encoded by complex genomic signatures on looping chromatin. Nature Genetics, 48(5):488–496,

633 April 2016.

634 [16] Jacob Schreiber, Maxwell Libbrecht, Jeffrey Bilmes, and William Noble. Nucleotide sequence and

635 DNaseI sensitivity are predictive of 3D chromatin architecture. pages 1–15, January 2017.

636 [17] O Corradin, A Saiakhova, B Akhtar-Zaidi, L Myeroff, J Willis, R Cowper-Sal middle dot lari,

637 M Lupien, S Markowitz, and P C Scacheri. Combinatorial effects of multiple enhancer variants in

638 linkage disequilibrium dictate levels of gene expression to confer susceptibility to common traits.

639 Genome Research, 24(1):1–13, January 2014.

640 [18] Yun Zhu, Zhao Chen, Kai Zhang, Mengchi Wang, David Medovoy, John W Whitaker, Bo Ding,

641 Nan Li, Lina Zheng, and Wei Wang. Constructing 3D interaction maps from 1D epigenomes.

642 Nature Communications, 7:1–11, 1.

42 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

643 [19] Bing He, Changya Chen, Li Teng, and Kai Tan. Global view of enhancer–promoter interactome in

644 human cells. Proceedings of the National Academy of Sciences, 111(21):201320308–E2199, May

645 2014.

646 [20] Mattia Forcato, Chiara Nicoletti, Koustav Pal, Carmen Maria Livi, Francesco Ferrari, and Silvio

647 Bicciato. Comparison of computational methods for Hi-C data analysis. Nature Publishing Group,

648 pages 1–11, June 2017.

649 [21] Suhas S P Rao, Miriam H Huntley, Neva C Durand, Elena K Stamenova, Ivan D Bochkov, James T

650 Robinson, Adrian L Sanborn, Ido Machol, Arina D Omer, Eric S Lander, and Erez Lieberman

651 Aiden. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin

652 Looping. Cell, 159(7):1665–1680, December 2014.

653 [22] Ferhat Ay, Timothy L Bailey, and William Stafford Noble. Statistical confidence estimation for

654 Hi-C data reveals regulatory chromatin contacts. Genome Research, 24(6):999–1011, June 2014.

655 [23] Guoliang Li, Xiaoan Ruan, Raymond K Auerbach, Kuljeet Singh Sandhu, Meizhen Zheng, Ping

656 Wang, Huay Mei Poh, Yufen Goh, Joanne Lim, Jingyao Zhang, Hui Shan Sim, Su Qin Peh, Fabi-

657 anus Hendriyan Mulawadi, Chin Thing Ong, Yuriy L Orlov, Shuzhen Hong, Zhizhuo Zhang, Steve

658 Landt, Debasish Raha, Ghia Euskirchen, Chia-Lin Wei, Weihong Ge, Huaien Wang, Carrie Davis,

659 Katherine I Fisher-Aylor, Ali Mortazavi, Mark Gerstein, Thomas Gingeras, Barbara Wold, Yi Sun,

660 Melissa J Fullwood, Edwin Cheung, Edison Liu, Wing-Kin Sung, Michael Snyder, and Yijun Ruan.

661 Extensive Promoter-Centered Chromatin Interactions Provide a Topological Basis for Transcription

662 Regulation. Cell, 148(1-2):84–98, January 2012.

663 [24] Nastaran Heidari, Douglas H Phanstiel, Chao He, Fabian Grubert, Fereshteh Jahanbani, Maya

664 Kasowski, Michael Q Zhang, and Michael P Snyder. Genome-wide map of regulatory interactions

665 in the human genome. Genome Research, 24(12):1905–1917, December 2014.

666 [25] Britta A. M. Bouwman and Wouter de Laat. Getting the genome in shape: the formation of loops,

667 domains and compartments. Genome Biology, 16(1):154–9, August 2015.

668 [26] Rola Dali and Mathieu Blanchette. A critical assessment of topologically associating domain pre-

669 diction tools. Nucleic acids research, 45(6):2994–3005, April 2017.

670 [27] Davide Bau,` Amartya Sanyal, Bryan R. Lajoie, Emidio Capriotti, Meg Byron, Jeanne B. Lawrence,

671 Job Dekker, and Marc A. Marti-Renom. The three-dimensional folding of the alpha-globin gene do-

672 main reveals formation of chromatin globules. Nature structural & molecular biology, 18(1):107–

673 114, January 2011.

43 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

674 [28] Amartya Sanyal, Bryan R. Lajoie, Gaurav Jain, and Job Dekker. The long-range interaction land-

675 scape of gene promoters. Nature, 489(7414):109–113, September 2012.

676 [29] Xin Zhou, Brett Maricque, Mingchao Xie, Daofeng Li, Vasavi Sundaram, Eric A Martin, Brian C

677 Koebbe, Cydney Nielsen, Martin Hirst, Peggy Farnham, Robert M Kuhn, Jingchun Zhu, Ivan

678 Smirnov, W James Kent, David Haussler, Pamela A F Madden, Joseph F Costello, and Ting Wang.

679 The Human Epigenome Browser at Washington University. Nature Publishing Group, 8(12):989–

680 990, December 2011.

681 [30] Yukie Takabatake, Claus Oxvig, Chandandeep Nagi, Kerin Adelson, Shabnam Jaffer, Hank

682 Schmidt, Patricia J. Keely, Kevin W. Eliceiri, John Mandeli, and Doris Germain. Lactation opposes

683 pappalysin-1-driven pregnancy-associated breast cancer. EMBO molecular medicine, 8(4):388–

684 406, April 2016.

685 [31] Amanda N. Henning, Jill D. Haag, Bart M. G. Smits, and Michael N. Gould. The non-coding

686 mammary carcinoma susceptibility locus, mcs5c, regulates pappa expression via Age-Specific

687 chromatin folding and Allele-Dependent DNA methylation. PLOS Genetics, 12(8):e1006261+,

688 August 2016.

689 [32] Jesse R Dixon, Siddarth Selvaraj, Feng Yue, Audrey Kim, Yan Li, Yin Shen, Ming Hu, Jun S Liu,

690 and Bing Ren. Topological domains in mammalian genomes identified by analysis of chromatin

691 interactions. Nature, 485(7398):376–380, May 2012.

692 [33] Elena Gomez-D´ ´ıaz and Victor G. Corces. Architectural proteins: regulators of 3D genome organi-

693 zation in cell fate. Trends in Cell Biology, 24(11):703–711, November 2014.

694 [34] Sven Heinz, Lorane Texari, Michael G B Hayes, Matthew Urbanowski, Max W Chang, Ninvita

695 Givarkes, Alexander Rialdi, Kris M White, Randy A Albrecht, Lars Pache, Ivan Marazzi, Adolfo

696 Garc´ıa-Sastre, Megan L Shaw, and Christopher Benner. Transcription Elongation Can Affect

697 Genome 3D Structure. Cell, pages 1–38, August 2018.

698 [35] Bryan J Matthews and David J Waxman. Computational prediction of CTCF/cohesin-based intra-

699 TAD loops that insulate chromatin contacts and gene expression in mouse liver. eLife, 7:1–40, May

700 2018.

701 [36] Timothy J. Durham, Maxwell W. Libbrecht, Jeffry J. Howbert, Jeff Bilmes, and William Stafford S.

702 Noble. PREDICTD PaRallel epigenomics data imputation with cloud-based tensor decomposition.

703 Nature communications, 9(1), April 2018.

44 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

704 [37] Jason Ernst and Manolis Kellis. Large-scale imputation of epigenomic datasets for systematic

705 annotation of diverse human tissues. Nature biotechnology, 33(4):364–376, April 2015.

706 [38] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, Oct 2001.

707 [39] Benjamin L Moore, Stuart Aitken, and Colin A Semple. Integrative modeling reveals the principles

708 of multi-scale chromatin boundary formation in human nuclear organization. Genome biology,

709 16(1):1270–14, May 2015.

710 [40] Nisha Rajagopal, Wei Xie, Yan Li, Uli Wagner, Wei Wang, John Stamatoyannopoulos, Jason Ernst,

711 Manolis Kellis, and Bing Ren. RFECS: A Random-Forest Based Algorithm for Enhancer Identifi-

712 cation from Chromatin State. PLoS computational biology, 9(3):e1002968–14, March 2013.

713 [41] Xianjun Dong, Melissa C Greven, Anshul Kundaje, Sarah Djebali, James B Brown, Chao Cheng,

714 Thomas R Gingeras, Mark Gerstein, Roderic Guigo,´ Ewan Birney, and Zhiping Weng. Modeling

715 gene expression using chromatin features in various cellular contexts. Genome biology, 13(9):1–17,

716 September 2012.

717 [42] ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human

718 genome. Nature, 489(7414):57–74, September 2012.

719 [43] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature Methods,

720 9(4):357–359, March 2012.

721 [44] H Li, B Handsaker, A Wysoker, T Fennell, J Ruan, N Homer, G Marth, G Abecasis, R Durbin,

722 and 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and

723 SAMtools. Bioinformatics, 25(16):2078–2079, August 2009.

724 [45] Aaron R Quinlan and Ira M Hall. BEDTools: a flexible suite of utilities for comparing genomic

725 features. Bioinformatics, 26(6):841–842, January 2010.

726 [46] Tatsunori Hashimoto, Charles W O’Donnell, Sophia Lewis, Amira A Barkal, John Peter van Hoff,

727 Vivek Karun, Tommi Jaakkola, Richard I Sherwood, and David K Gifford. Discovery of directional

728 and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape.

729 Nature Biotechnology, 32(2):171–178, January 2014.

730 [47] Maxim Imakaev, Geoffrey Fudenberg, Rachel Patton McCord, Natalia Naumova, Anton

731 Goloborodko, Bryan R Lajoie, Job Dekker, and Leonid A Mirny. Iterative correction of Hi-C

732 data reveals hallmarks of chromosome organization. Nature Methods, 9(10):999–1003, September

733 2012.

45 bioRxiv preprint doi: https://doi.org/10.1101/406322; this version posted September 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

734 [48] Julian de Ruiter, Theo Knijnenburg, and Jeroen de Ridder. Mining the forest: uncovering biological

735 mechanisms by interpreting random forests. bioRxiv, 2017.

736 [49] P Shannon. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interac-

737 tion Networks. Genome Research, 13(11):2498–2504, November 2003.

46