<<

Manuscript submitted to eLife

1 Generative Modeling of

2 Multi-mapping Reads with mHi-C

3 Advances Analysis of Hi-C Studies

1 2,3 1,4* 4 Ye Zheng , Ferhat Ay , Sündüz Kele

*For correspondence: [email protected] (SK) 1 2 5 Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA; La 3 6 Jolla Institute for Allergy and Immunology, La Jolla, CA 92037, USA; School of Medicine, 4 7 University of California San Diego, La Jolla, CA 92093, USA; Department of Biostatistics

8 and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53706, USA

9

10 Abstract Current Hi-C analysis approaches are unable to account for reads that align to multiple

11 locations, and hence underestimate biological signal from repetitive regions of genomes. We

12 developed and validated mHi-C,amulti-read mapping strategy to probabilistically allocate Hi-C

13 multi-reads. mHi-C exhibited superior performance over utilizing only uni-reads and heuristic

14 approaches aimed at rescuing multi-reads on benchmarks. Specically, mHi-C increased the

15 sequencing depth by an average of 20% resulting in higher reproducibility of contact matrices and

16 detected interactions across biological replicates. The impact of the multi-reads on the detection of

17 signicant interactions is inuenced marginally by the relative contribution of multi-reads to the

18 sequencing depth compared to uni-reads, cis-to-trans ratio of contacts, and the broad data quality

19 as reected by the proportion of mappable reads of datasets. Computational experiments

20 highlighted that in Hi-C studies with short read lengths, mHi-C rescued multi-reads can emulate the

21 eect of longer reads. mHi-C also revealed biologically supported bona de promoter-enhancer

22 interactions and topologically associating domains involving repetitive genomic regions, thereby

23 unlocking a previously masked portion of the genome for conformation capture studies.

24

25 Introduction

26 DNA is highly compressed in the nucleus and organized into a complex three-dimensional structure.

27 This compressed form brings distal functional elements into close spatial proximity of each other

28 (Dekker et al., 2002; de Laat and Duboule, 2013) and has far-reaching inuence on regulation.

29 Changes in DNA folding and chromatin structure remodeling may result in cell malfunction with

30 devastating consequences (Corradin et al., 2016; Won et al., 2016; Javierre et al., 2016; Rosa-

31 Garrido et al., 2017; Spielmann et al., 2018). Hi-C technique (Lieberman-Aiden et al., 2009; Rao

32 et al., 2014) emerged as a high throughput technology for interrogating the three-dimensional

33 conguration of the genome and identifying regions that are in close spatial proximity in a genome-

34 wide fashion. Thus, Hi-C data is powerful for discovering key information on the roles of the

35 chromatin structure in the mechanisms of gene regulation.

36 There are a growing number of published and well-documented Hi-C analysis tools and pipelines

37 (Heinz et al., 2010; Hwang et al., 2014; Ay et al., 2014a; Servant et al., 2015; Mifsud et al., 2015;

38 Lun and Smyth, 2015), and their operating characteristics were recently studied (Ay and Noble,

39 2015; Forcato et al., 2017; Yardımcı et al., 2017) in detail. However, a key and common step in

40 these approaches is the exclusive use of uniquely mapping reads. Limiting the usable reads to only

1 of 29 Manuscript submitted to eLife

41 uniquely mapping reads underestimates signal originating from repetitive regions of the genome

42 which are shown to be critical for tissue specicity (Xie et al., 2013). Such reads from repetitive

43 regions can be aligned to multiple positions (Figure 1A) and are referred to as multi-mapping reads

44 or multi-reads for short. The critical drawbacks of discarding multi-reads have been recognized in

45 other classes of genomic studies such as transcriptome sequencing (RNA-seq) (Li and Dewey, 2011),

46 chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) (Chung et al.,

47 2011; Zeng et al., 2015), as well as genome-wide mapping of -RNA binding sites (CLIP-seq or

48 RIP-seq) (Zhang and Xing, 2017). More recently, Sun et al. (2018) and Cournac et al. (2015) argued

49 for a fundamental role of repeat elements in the 3D folding of genomes, highlighting the role of

50 higher order chromatin architecture in repeat expansion disorders. However, the ambiguity of

51 multi-reads alignment renders it a challenge to investigate the repetitive elements co-localization

52 with the true 3D interaction architecture and signals. In this work, we developed mHi-C (Figure 1–

53 Figure supplements 1 and 2), a hierarchical model that probabilistically allocates Hi-C multi-reads to

54 their most likely genomic origins by utilizing specic characteristics of the paired-end reads of the

55 Hi-C assay. mHi-C is implemented as a full analysis pipeline (https://github.com/keleslab/mHiC) that

56 starts from unaligned read les and produces a set of statistically signicant interactions at a given

57 resolution. We evaluated mHi-C both by leveraging replicate structure of public of Hi-C datasets of

58 dierent species and cell lines across six dierent studies, and also with computational trimming

59 and data-driven simulation experiments.

60 Results

61 Multi-reads signicantly increase the sequencing depths of Hi-C data

62 For developing mHi-C and studying its operating characteristics, we utilized six published studies,

63 resulting in eight datasets with multiple replicates, as summarized in Table 1 with more details

64 in Figure 1-source data 1: Table 1. These datasets represent a variety of study designs from

65 dierent organisms, i.e., and mouse cell lines as examples of large genomes and three

66 dierent stages of Plasmodium falciparum red blood cell cycle as an example of a small and AT-rich

67 genome. Specically, they span a wide range of sequencing depths (Figure 1B), coverages and

68 cis-to-trans ratios (Figure 1–Figure supplement 3), and have dierent proportions of mappable

69 and valid reads (Figure 1–Figure supplement 4). Before applying mHi-C to these datasets and

70 investigating biological implications, we rst established the substantial contribution of multi-reads

71 to the sequencing depth across these datasets with diverse characteristics. At read-end level

72 (Supplementary le 1 for terminology), after the initial step of aligning to the reference genome

73 (Figure 1–Figure supplements 1 and 2), multi-reads constitute approximately 10% of all the mappable

74 read ends (Figure 1–Figure supplement 5A). Moreover, the contribution of multi-reads to the set

75 of aligned reads further increases by an additional 8% when chimeric reads (Supplementary le

76 1) are also taken into account (Figure 1–source data 1: Table 2). Most notably, Figure 1–Figure

77 supplement 6A demonstrates that, in datasets with shorter read lengths, multi-reads constitute a

78 larger percentage of usable reads compared to uniquely mapping chimeric reads that are routinely

79 rescued in Hi-C analysis pipelines (Servant et al., 2015; Lun and Smyth, 2015; Durand et al., 2016).

80 Moreover, multi-reads also make up a signicant proportion of the rescued chimeric reads (Figure 1–

81 Figure supplement 6B). At the read pair level, after joining of both ends, multi-reads increase the

82 sequencing depth by 18% to 23% for shorter read length datasets and 10% to 15% for longer read

83 lengths, thereby providing a substantial increase to the depth before the read pairs are further

84 processed into bin pairs (Figure. 1C; Figure 1–source data 1: Table 3).

85 Multi-reads can be rescued at multiple processing stages of mHi-C pipeline

86 As part of the post-alignment pre-processing steps, Hi-C reads go through a series of validation

87 checking to ensure that the reads that represent biologically meaningful fragments are retained

88 and used for downstream analysis (Figure 1–Figure supplements 1 and 2, Supplementary le 1).

2 of 29 Manuscript submitted to eLife

Table 1. Hi-C Data Summary

Cell line Replicate Read length (bp) Restriction HiC Protocol Source Resolution (kb) IMR90 rep1-6 36 HindIII dilution Jin et al. (2013) 40 GM12878 rep2-9 101 MboI in situ Rao et al. (2014) 5, 10*, 40* GM12878 rep32, rep33 101 DpnII in situ Rao et al. (2014)5 A549 rep1-4 151 MboI in situ Dixon et al. (2018) 10, 40 ESC(2012) rep1, rep2 36 HindIII dilution Dixon et al. (2012) 40 ESC(2017) rep1-4 50 DpnII in situ Bonev et al. (2017) 10, 40 Cortex rep1-4 50 DpnII in situ Bonev et al. (2017) 10, 40 P.falciparum 3 stages 40 MboI dilution Ay et al. (2014b) 10, 40 * Replicates 2, 3, 4, and 6 of the GM12878 cell line datasets were process at 10kb and 40kb resolutions.

89 mHi-C pipeline tracks multi-reads through these processing steps. Remarkably, in the application of

90 mHi-C to all six studies, a subset of the high-quality multi-reads are reduced to uni-reads either in

91 the validation step when only one candidate contact passes the validation screening, or because

92 all the alignments of a multi-read reside within the same bin (Figure 1C; Supplementary le 1 and

93 see Materials and Methods). Collectively, mHi-C can rescue as high as 6.7% more valid read pairs

94 (Figure 1D) that originate from multi-reads and are mapped unambiguously without carrying out any

95 multi-reads specic procedure for large genomes and 10.4% for P. falciparum. Such improvement

96 corresponds to millions of reads for deeper sequenced datasets (Figure 1–source data 1: Table 4).

97 For the remaining multi-reads (Figure 1D, colored in pink), which, on average, make up 18% of all

98 the valid reads (Figure 1–Figure supplement 5B), mHi-C implements a novel multi-mapping model

99 and probabilistically allocates them.

100 mHi-C generative model (Figure 1E and see Materials and Methods) is constructed at the bin-pair

101 level to accommodate the typical signal sparsity of genomic interactions. The bins are either xed-

102 size non-overlapping genome intervals or a xed number of restriction fragments derived from the

103 Hi-C protocol. The resolutions at which seven cell lines are processed are summarized in Table 1.

104 In the mHi-C generative model, we denote the observed alignment indicator vector for a given

105 paired-end read i by vector Yi and use unobserved hidden variable vector Zi to indicate its true 106 genomic origin. Contacts captured by Hi-C assay can arise as random contacts of nearby genomic

107 positions or true biological interactions. mHi-C generative model acknowledges this feature by

108 utilizing data-driven priors, ⇡(j,k) for bin pairs j and k, as a function of contact distance between the 109 two bins. mHi-C updates these prior probabilities for each candidate bin pair that a multi-read can 110 be allocated to by leveraging local contact counts. As a result, for each multi-read i, it estimates

111 posterior probabilities of genomic origin variable Zi. Specically, Pr(Zi,(j,k) =1 Yi, ⇡) denotes the 112 posterior probability, i.e., allocation probability, that the two read ends of multi-read i originate 113 from bin pairs j and k. These posterior probabilities, which can also be viewed as fractional contacts 114 of multi-read i, are then utilized to assign each multi-read to the most likely genomic origin. Our 115 results in this paper only utilized reads with allocation probability greater than 0.5. This ensured the

116 output of mHi-C to be compatible with the standard input of the downstream normalization and

117 statistical signicance estimation methods (Imakaev et al., 2012; Knight and Ruiz, 2013; Ay et al.,

118 2014a).

119 Probabilistic assignment of multi-reads results in more complete contact matrices

120 and signicantly improves reproducibility across replicates

121 Before quantifying mHi-C model performance, we provide a direct visual comparison of the contact

122 matrices between Uni-setting and Uni&Multi-setting using raw and normalized contact counts.

123 Figure 2A and Figure 2–Figure supplements 1-4 clearly illustrate how multi-mapping reads ll in the

3 of 29 Manuscript submitted to eLife

124 low mappable regions and lead to more complete matrices, corroborating that repetitive genomic

125 regions are under-represented without multi-reads. Quantitatively, for the combined replicates of

126 GM12878, 99.61% of the 5kb bins with interaction potential are covered by at least 100 raw contacts

127 under the Uni&Multi-setting, compared to 98.72% under Uni-setting, thereby allowing us to study

128 25.55Mb more of the genome. For normalized contact matrices, the coverage increases from

129 99.42% in Uni-setting to 99.97% in Uni&Multi-setting (Figure 2–Figure supplement 5). In addition to

130 increasing the sequencing depth in extremely low contact bins for both raw and normalized contact

131 counts, higher bin-level coverage after leveraging multi-mapping reads appears as a global pattern

132 across the genome for raw contact matrices (Figure 2–Figure supplements 6 and 7). Figure 2–

133 Figure supplement 8 provides the histogram of bin-level dierences of normalized contact counts

134 between the two settings and indicates a positive average dierence. While some bins appear

135 to have their contact counts decreased in the Uni&multi-setting compared to Uni-setting after

136 normalization (purple bar in Figure 2–Figure supplement 8A), comparison of the raw contact counts

137 in Figure 2–Figure supplement 8B shows that these bins do indeed have lower raw contact counts

138 in the Uni-setting compared to Uni&Multi-setting and that the reduction observed is an artifact of

139 normalization. This also highlights that multi-reads alleviate the ination of low raw contact count

140 regions due to normalization. These major improvements in coverage provide direct evidence that

141 mHi-C is rescuing multi-reads that originate from biologically valid fragments.

142 We assessed the impact of multi-reads rescued by mHi-C on the reproducibility from the point

143 of both raw contact counts and signicant interactions detected. We used the stratum-adjusted

144 correlation coecient proposed in HiCRep(Yang et al., 2017) for evaluating the reproducibility of

145 Hi-C contact matrices. Figure 2B and Figure 2–Figure supplement 9 and 10 illustrate that integrating

146 multi-reads leads to increased reproducibility and reduced variability of stratum-adjusted correla-

147 tion coecients among biological replicates across all the study datasets. Furthermore, we observe

148 that, for some , e.g., chr17 of IMR90 and chr16 of GM12878, the improvement in

149 reproducibility stands out, without a systematic behavior across datasets. A close examination of

150 improvement in reproducibility as a function of the ratio of rescued multi-reads to uni-reads across

151 chromosomes highlights the larger proportion of multi-reads rescued for these chromosomes

152 (Figure 2–Figure supplement 11). To further assess that the improvement in reproducibility did not

153 manifest due to an unaccounted systematic bias in the assignment of multi-reads, we evaluated

154 reproducibility similarly between replicates of GM12878 and replicates of IMR90. Figure 2–Figure

155 supplement 12 shows that Uni-setting and Uni&Multi-setting lead to similar levels of reproducibility

156 between replicates of these unrelated samples with all Wilcoxon rank-sum test p-values of the pair-

157 wise comparisons between Uni- and Uni&Multi-settings > 0.21; therefore, ruling out the possibility

158 of systematic bias as the source of improvement in reproducibility due to multi-reads.

159 In addition to the direct comparison of the raw contact matrices and their reproducibility,

160 we identied the set of signicant interactions by Fit-Hi-C (Ay et al., 2014a) and assessed the

161 reproducibility of the identied interactions. Figure 2C shows that mHi-C signicantly improves

162 reproducibility of detected interactions across all the pairwise comparisons of replicates within

163 each study dataset. Figure 2–Figure supplement 13A presents more details on the degree of

164 overlap among the signicant interactions identied at 5% and 10% false discovery rate (FDR)

165 across replicates for the IMR90 datasets. These comparisons highlight that signicant interactions

166 specic to Uni&Multi-setting have consistently higher reproducibility than those specic to Uni-

167 setting across all pairwise comparisons. Since random contacts tend to arise due to short genomic

168 distances between loci, we stratied the signicant interactions based on distance and reassessed

169 the reproducibility as a function of the genomic distance between the contacts (Figure 2–Figure

170 supplement 13B). Notably, signicant interactions identied only by the Uni-setting and those

171 common to both settings have a stronger gradual descending trend as a function of the genomic

172 distance, indicating decaying reproducibility for long-range interactions. In contrast, Uni&Multi-

173 setting maintains a relatively higher and stable reproducibility for longer genomic distances.

4 of 29 Manuscript submitted to eLife

174 Multi-reads detect novel signicant interactions

175 At 5% false discovery rate, mHi-C detects 20% to 50% more novel signicant interactions for relatively

176 highly sequenced study datasets (Figure 3A and Figure 3–Figure source data 1; Figure 3A–Figure

177 supplement 1 for other FDR thresholds and resolutions). The gains are markedly larger for datasets

178 with smaller sequencing depths (e.g., ESC-2012) or extremely high coverage (e.g., P. falciparum).

179 Overall gains in the number of novel contacts persist as the cuto for mHi-C posterior probabilities of

180 multi-read assignments varies (Figure 3–Figure supplement 2). At xed FDR, signicant interactions

181 identied by the Uni&Multi-setting also include the majority of signicant interactions inferred from

182 the Uni-setting, indicating that incorporating multi-reads is extending the signicant interaction list

183 (low level of purple lines in Figure 3–Figure supplement 2).

184 We leveraged the diverse characteristics of the study datasets and investigated the factors

185 that impacted the gain in the detected signicant interactions due to multi-reads. The top row of

186 Figure 3–Figure supplement 3 summarizes the marginal correlations of the percentage change in

187 the number of identied signicant interactions (at 40Kb resolution and FDR of 0.05) with the data

188 characteristics commonly used to indicate the quality of Hi-C datasets (excluding the high coverage

189 P. falciparum dataset). These marginal associations highlight the signicant impact of the relative

190 contribution of multi-reads to the sequencing depth compared to uni-reads and cis-to-trans ratio of

191 contacts (Figure 3–Figure supplement 4). Figure 3–Figure supplement 5 increase in the number of

192 novel signicant interactions for the GM12878 datasets in more detail across a set of FDR thresholds

193 and at dierent resolutions, and includes two types of restriction . Specically, Figure 3–

194 Figure supplement 5C illustrates a clear negative association between the sequencing depth and

195 the percent improvement in the number of identied signicant interactions at 5kb resolution due

196 to the larger impact of multi-reads on the smaller depth replicates. As an exception, we note that P.

197 falciparum datasets tend to exhibit signicantly higher gains in the number of identied contacts

198 especially under stringent FDR thresholds (Figure 3A), possibly due to the ultra-high coverage of

199 these datasets (Figure 3–Figure supplement 6). In addition to these marginal associations, Figure 3B

200 and Figure 3–Figure supplement 7 display the percentage increase in the number of identied

201 signicant interactions as a function of the percentage increase in the real depth due to multi-reads

202 and the cis-to-trans ratio across all the study datasets. A consistent pattern highlights that short

203 read datasets with large proportion of mHi-C rescued multi-reads compared to uni-reads enjoy a

204 larger increase in the number of identied signicant interactions regardless of the FDR threshold,

205 while for datasets with similar relative contribution of multi-reads, e.g., within lower depth IMR90,

206 cis-to-trans ratios positively correlate with the increase in the number of identied signicant

207 interactions.

208 We next asked whether novel signicant interactions due to rescued multi-reads could have

209 been identied under the Uni-setting by employing a more liberal FDR threshold. Leveraging multi-

210 reads with posterior probability larger than 0.5 and controlling the FDR at 1%, Fit-Hi-C identied

211 32.49% more signicant interactions compared to Uni-setting (comparing dark green to dark purple

212 bar in Figure 3C) and 36.43% of all signicant interactions are unique to Uni&Multi-setting (light

213 green bar over dark green bar in Figure 3C) collectively for all the four replicates of GM12878 at

214 40kb resolution. We observed that 34.89% of these novel interactions (yellow bar over the light

215 green bar in Figure 3C) at 1% FDR (i.e., 12.71% compared to the all the signicant interactions

216 under Uni&Multi-setting) cannot be recovered even by a more liberal signicant interaction list 217 under Uni-setting at 10% FDR. Conversely, Uni&Multi-setting is unable to recover only 4.60% of the 218 Uni-setting contacts once the FDR is controlled at 10% for the Uni&Multi-setting (light blue over 219 dark purple bar in Figure 3C), highlighting again that Uni&Multi-setting predominantly adds on

220 novel signicant interactions while retaining contacts that are identiable under the Uni-setting. A

221 similar analysis for individual replicates of IMR90 are provided in Figure 3–Figure supplement 8 as

222 well as those of GM12878 at the individual replicate level or collective analysis at 5kb, 10kb, and

223 40kb resolutions in Figure 3–Figure supplements 9-12. We further conrmed this consistent power

5 of 29 Manuscript submitted to eLife

224 gain by a Receiver Operating Characteristic (ROC) and a Precision-Recall (PR) analysis (Figure 3–

225 Figure supplement 13). The PR curve illustrates that at the same false discovery rate (1-precision),

226 mHi-C achieves consistently higher power (recall) than the Uni-setting in addition to better AUROC

227 performance.

228 Chromatin features of novel signicant interactions

229 To further establish the biological implications of mHi-C rescued multi-reads, we investigated

230 genomic features of novel contacts. Annotation of the signicant interactions with ChromHMM

231 segmentations from the Roadmap Epigenomics project (Roadmap Epigenomics Consortium, 2015)

232 highlights marked enrichment of signicant interactions in annotations involving repetitive DNA

233 (Figure 3D, Figure 3–Figure supplement 14A). Most notably, ZNF & repeats and Heterochro-

234 matin states exhibit the largest discrepancy of the average signicant interaction counts between

235 the Uni- and Uni&Multi-settings. To complement the evaluation with ChromHMM annotations,

236 we evaluated the Uni-setting and Uni&Multi-setting signicant interaction enrichment of genomic

237 regions harboring histone marks and other biochemical signals (The ENCODE Project Consortium,

238 2012; Jin et al., 2013) (See Materials and Methods) by comparing their average contact counts to

239 those without such signal (Figure 3E and Figure 3 – source data 2). Notably, while we observe that

240 multi-reads boost the average number of contacts with biochemically active regions of the genome,

241 they contribute more to regions that harbor H3K27me3 peaks (Figure 3E, Figure 3–Figure supple-

242 ment 14B). Such regions are associated with downregulation of nearby genes through forming

243 heterochromatin structure (Ferrari et al., 2014). Figure 3–Figure supplements 15-17 further provide

244 specic examples of how increased marginal contact counts due to multi-reads are supported by

245 signals of histone modications, CTCF binding sites, and . Many genes of biolog-

246 ical signicance reside in these regions. For example, NBPF1 (Figure 3–Figure supplement 15) is

247 implicated in many neurogenetic diseases and its family consists of dozens of recently duplicated

248 genes primarily located in segmental duplications (Safran et al., 2010). In addition, RABL2A within

249 the highlighted region of Figure 3–Figure supplement 16 is a member of RAS oncogene family.

250 Multi-reads discover novel promoter-enhancer interactions

251 We found that a signicant impact of multi-reads is on the detection of promoter-enhancer interac-

252 tions. Overall, mHi-C identies 14.89% more signicant promoter-enhancer interactions at 5% FDR

253 collectively for six replicates for IMR90 (Figure 4–source data 1: Table 1 and Figure 4 – source data 2).

254 Of these interactions, 13,313 are reproducible among all six replicates under Uni&Multi-setting

255 (Figure 4–source data 1: Table 2) and 62,971 are reproducible for at least two replicates (Figure 4–

256 source data 1: Table 3) leading to 15.84% more novel promoter-enhancer interactions specic

257 to Uni&Multi-setting. Figure 4A provides WashU epigenome browser (Zhou et al., 2011) display

258 of such novel reproducible promoter-enhancer interactions on 1. Figure 4–Figure

259 supplements 1-2 provides more such reproducible examples and Figure 4–Figure supplement 3

260 depicts the reproducibility of these interactions in more details across the 6 replicates.

261 We next validated the novel promoter-enhancer interactions by investigating the expression

262 levels of the genes contributing promoters to these interactions. Figure 4B supports that genes with

263 signicant interactions in their promoters generally exhibit higher expression levels (comparing bars

264 1-5 to bars 8-9 in Figure 4B). Furthermore, if these interactions involve an enhancer, the average gene

265 expression can be 38.17% higher than that of the overall promoters with signicant interactions

266 (comparing bars 10-11 to bars 1-2 in Figure 4B). Most remarkably, newly detected signicant

267 promoter-enhancer interactions (bar 13 in Figure 4B) exhibit a stably higher gene expression level,

268 highlighting that, without multi-reads, biologically supported promoter-enhancer interactions are

269 underestimated. In addition, an overall evaluation of signicant interactions (5% FDR) that considers 270 interactions from promoters with low expression (TPM f 1) as false positives illustrates that mHi-C 271 specic signicant promoter interactions have false positive rates comparable to or smaller than

272 those of signicant promoter interactions common to Uni- and Uni&Multi-settings (Figure 4–Figure

6 of 29 Manuscript submitted to eLife

273 supplement 4). In contrast, Uni-setting specic contacts have elevated false positive rates.

274 Multi-reads rene the boundaries of topologically associating domains

275 We next investigated the impact of mHi-C rescued multi-reads on the topologically associating

276 domains (TADs) (Pombo and Dillon, 2015), where we used a broad denition of TADs to include

277 contact and loop domains. We used the DomainCaller (Dixon et al., 2012, 2015) to infer TADs

278 of IMR90 datasets at 40kb resolution and Arrowhead (Rao et al., 2014) for GM12878 datasets at

279 5kb resolution under Uni&Multi-settings (Figure 5 – source data 1, 2). The detected TADs are

280 compared to those under the Uni-setting. While this comparison did not reveal stark dierences in

281 the numbers of TADs identied under the two settings (Figure 5–Figure supplement 1), we found

282 that Uni&Multi-setting identies 2.01% more reproducible TADs with 2.36% lower non-reproducible

283 TADs across replicates (Figure 5A). Several studies have revealed the role of CTCF in establishing

284 the boundaries of genome architecture (Ong and Corces, 2014; Tang et al., 2015; Hsu et al., 2017).

285 While this is an imperfect indicator of TAD boundaries, we observed that a slightly higher proportion

286 of the detected TADs have CTCF peaks with convergent CTCF motif pairs at the boundaries once

287 multi-reads are utilized (Figure 5–Figure supplements 2A-C). Figure 5B provides an explicit example

288 of how the gap in the contact matrix due to depletion of multi-reads biases the inferred TAD

289 structure. In addition to discovery of novel TADs (Figure 5–Figure supplement 3)bylling in the

290 gaps in the contact matrix and boosting the domain signals, mHi-C also renes TAD boundaries

291 (Figure 5–Figure supplements 4 and 5), and eliminates potential false positive TADs that are split by

292 the contact depleted gaps in Uni-setting (Figure 5–Figure supplements 6-8). The novel, adjusted,

293 and eliminated TADs are largely supported by CTCF signal identied using both uni- and multi-

294 reads ChIP-seq datasets (Zeng et al., 2015) as well as convergent CTCF motifs (Figure 5–Figure

295 supplement 2D), providing evidence for mHi-C driven modications to these TADs and revealing a

296 slightly lower false discovery rate for mHi-C compared to Uni-setting (Figure 5C, Figure 5–Figure

297 supplement 2E, and Figure 5–Figure supplement 9).

298 Next, we assessed the abundance of dierent classes of repetitive elements, from the Re-

299 peatMasker (Open, 2015) and UCSC genome browser (Casper et al., 2017) hg19 assembly, at the

300 reproducible TAD boundaries. Specically, we considered segmental duplications (DUP), short

301 interspersed nuclear elements (SINE), long interspersed nuclear elements (LINE), long terminal

302 repeat elements (LTR), DNA transposon (DNA) and satellites (SATE). We utilized ±1 bin on either

303 side of the edge coordinate of a given domain as its TAD boundary. At a lower resolution, i.e., 40kb

304 for IMR90, each boundary is 120kb region and the percentages of TAD boundaries with each type

305 of repetitive element illustrate negligible dierences between the Uni-setting and Uni&Multi-setting

306 (Figure 5–Figure supplement 10A). Similarly, due to the large sizes of the TAD boundaries, a majority

307 of TAD boundaries harbor SINE, LINE, LTR, and DNA transposon elements. However, higher res-

308 olution analysis of the GM12878 dataset at 5kb reveals SINE elements as the leading category of

309 elements that co-localizes with more than 99% of TAD boundaries followed by LINEs (Figure 5–Figure

310 supplement 10B). This is consistent with the fact that SINE and LINE elements are relatively short

311 and cover a larger portion of the compared to other subfamilies (15% for SINE and

312 21% for LINE) (Treangen and Salzberg, 2012). We further quantied the enrichment of repetitive

313 elements at TAD boundaries by comparing their average abundance with those within TADs and

314 the genomic intervals of the same size across the whole genome as the baseline. Figure 5D and

315 Figure 5–Figure supplement 11 show that SINE elements, satellites, and segmental duplications are

316 markedly enriched at the TAD boundaries compared to the whole genome and within TADs. More

317 interestingly, at higher resolution, i.e., 5kb for GM12878, the SINE category both have the highest

318 average enrichment and is enhanced by mHi-C (Figure 5D). In summary, under uni&Multi-setting,

319 the detected TAD boundaries tend to harbor more SINE elements supporting prior work that human

320 genome folding is markedly associated with the SINE family Cournac et al. (2015).

7 of 29 Manuscript submitted to eLife

321 Large-scale evaluation of mHi-C with computational trimming experiments and

322 simulations establishes its accuracy

323 Before further investigating the accuracy of mHi-C rescued multi-reads with computational ex-

324 periments, we heuristic strategies for rescuing multi-reads at dierent stages of the Hi-C analysis

325 pipeline as alternatives to mHi-C (Figure 6A; see Materials and Methods for the detailed description

326 of the model-free approaches and related analysis). Specically, AlignerSelect and DistanceSelect

327 rescue multi-reads by simply choosing one of the alignments of a multi-read pair by default aligner

328 strategy and based on distance, respectively. In addition to these, we designed a direct alternative

329 to mHi-C, named SimpleSelect, as a model-free approach that imposes genomic distance priors in

330 contrast to leveraging of the local interaction signals of the bins by mHi-C (e.g., local contact counts

331 due to other read pairs in candidate bin pairs).

332 To evaluate the accuracy of mHi-C in a setting with ground truth, we carried out trimming

333 experiments with the A549 151bp read length dataset and Hi-C data simulations where compared

334 mHi-C to both a random allocation strategy as the baseline and the additional heuristic approaches

335 we developed (Figure 6A). Specically, we trimmed the set of 151bp uni-reads from A549 into

336 read lengths of 36bp, 50bp, 75bp, 100bp, and 125bp. As a result, a subset of uni-reads at the full

337 read length of 151bp with known alignment positions were reduced into multi-reads, generating

338 gold-standard multi-read sets with known true origins. The resulting numbers of valid uni- and

339 multi-reads are summarized in Figure 6–Figure supplement 1A in comparison with the numbers of

340 valid uni-reads in the original A549 datasets. The corresponding multi-to-uni ratios of these settings

341 vary with the lengths of the trimmed reads and their range covers the typical multi-to-uni ratios

342 observed in the full read length datasets (Figure 6–Figure supplement 1B).

343 We rst investigated the multi-read allocation accuracy with respect to trimmed read length,

344 sequencing depth, and mappability at resolution 40kb. Figure 6B exhibits superior performance of

345 mHi-C over both the model-free methods and the random baseline in correctly allocating multi-

346 reads of dierent lengths to their true origins across intra- and inter-chromosomal contacts. As

347 expected and illustrated by Figure 6B, the accuracy of multi-read assignment has an increasing

348 trend with the read length. Specically, it ranges between 70% and 90% for mHi-C and 20% and 35%

349 for the random allocation strategy for the shortest and longest trimmed read lengths of 36bp and

350 125bp, respectively. When the allocated multi-reads are stratied as a function of the mappability,

351 reads with the lowest mappability (<0.1) have accuracy levels of less than 32% to 70% across the

352 trimmed read lengths (Figure 6C for 75bp, Figure 6–Figure supplement 2 for the other trimming

353 lengths). Notably, the accuracy quickly reaches 74% to 87% for reads with mappability of at least

354 0.5 (Figure 6C, Figure 6–Figure supplement 2).

355 Next, we assessed the allocation accuracy among dierent classes of repetitive elements (Fig-

356 ure 6D). Allocations involving segmental duplication regions exhibit a systematically lower per-

357 formance compared to other repeat classes and the average across the whole genome for all

358 ve trimming settings. Notably, even for these segmental duplication regions, the accuracy of

359 mHi-C is markedly higher than both the model-free approaches and the random selection baseline

360 displayed in Figure 6B. To nalize the accuracy investigation, we further varied the trimming setting

361 by mixing uni-reads and multi-reads from dierent replicates (see setting (ii) of trimming strategies

362 in Materials and Methods) and considering resolutions of 10kb and 40kb in addition to an empirical

363 Hi-C simulation. Figure 6–Figure supplements 3-6 provide accuracy results closely following the

364 results presented in this section from these additional settings and further validate signicantly

365 better performance of mHi-C compared to the random allocation and other heuristic approaches

366 across dierent trimmed read lengths.

367 After establishing accuracy, we evaluated the impact of mHi-C rescued multi-reads of the

368 trimmed datasets on the recovery of the (original) full read length contact matrices, topological

369 domain structures, and signicant interactions. To assess the recovery of the original contact matrix,

370 we compared both the trimmed Uni- and Uni&Multi-settings with the gold standard Uni-setting at

8 of 29 Manuscript submitted to eLife

371 the full read length utilizing HiCRep (Yang et al., 2017). Figure 7A and Figure 7–Figure supplement 1

372 illustrate that mHi-C achieves signicant improvement in the reproducibility across all chromosomes

373 under all trimming settings compared to the Uni-setting. While the pattern of reproducibility with

374 dierent read lengths in Figure 7A is consistent with the expectation that the longer trimmed

375 reads yield contact matrices that are more similar to the full read length one, the improvement in

376 reproducibility due to mHi-C is markedly larger compared to the gains from longer read sequences

377 making multi-read rescue essential. For example, the reproducibility for Uni&Multi-setting at 50bp is

378 8.84% to 27.33% higher than that of Uni-setting at 125bp. We further evaluated reproducibility under

379 each trimming setting across replicates and benchmarked the results against the reproducibility

380 of the Uni-setting with the original read length of 151bp. Figure 7–Figure supplements 2 and 3

381 conrm the gain in reproducibility across replicates due to Uni&Multi-setting for all the trimming

382 lengths. As expected, the reproducibility across replicates based on the uni-reads of the original read

383 length is higher than the levels achievable by the Uni&Multi-Setting at trimmed read lengths. This

384 comparison further supports that mHi-C assigns multi-reads in a biologically meaningful manner

385 as was evidenced earlier by the increased reproducibility among replicates of the same condition

386 (Figure 2B and Figure 2–Figure supplements 9 and 10) but not across replicates of the dierent

387 conditions (Figure 2–Figure supplement 12). It also rules out the possibility of ination of the

388 reproducibility metric by consistent but biologically irrelevant assignment of multi-reads to certain

389 loci. TAD identication with these trimmed sets highlights the sensitivity of TAD boundary detection

390 to the sequencing depth. Figure 7B and Figure 7–Figure supplements 4-7 display examples where

391 the trimmed Uni&Multi-setting achieved better recovery of TAD structure of the full-length dataset

392 compared to trimmed Uni-setting. Overall, this performance is attributable to the accuracy of mHi-C

393 assignments and the resulting increase in sequencing depth of the trimmed uni-read dataset. Finally,

394 we compared the signicant interactions detected by the trimmed Uni- and Uni&Multi-settings and

395 observed that mHi-C rescued multi-reads in trimmed datasets enable detection of a larger number

396 of interactions across a range of FDR thresholds (Figure 7–Figure supplement 8). Most notably,

397 an evaluation of detection power for the top 10K signicant interactions of the full-length dataset

398 demonstrates that, while the Uni-setting can only recover 50% of these at the trimmed read length

399 of 36bp, Uni&Multi-setting recovers 70% (Figure 7C and Figure 7–Figure supplement 9). We note

400 that these power values are slightly underestimated because the full-length uni-read dataset also

401 included chimeric reads that were rescued as uni-reads. In contrast, trimmed reads in the trimming

402 experiments were generated from uni-reads without rescuing chimeric reads (see Materials and

403 Methods; Figure 7–Figure supplement 10). Despite this, the 26.19% increase in sequencing depth

404 due to multi-reads at the trimmed read length of 36bp (Figure 6–Figure supplement 1B) translated

405 into a signicantly better recovery of the signicant interactions. Further assessment by ROC and PR

406 analysis (Figure 7D, E and Figure 7–Figure supplement 11) of the set of signicant contacts identied

407 by both settings illustrates that Uni&Multi-setting exhibits these advantages without inating the

408 false discoveries. As reads get longer towards the full length, the ROC and PR curves converge

409 under the two settings (Figure 7–Figure supplement 11).

410 Discussion

411 Hi-C data are powerful for identifying long-range interacting loci, chromatin loops, topologically

412 associating domains (TADs), and A/B compartments (Lieberman-Aiden et al., 2009; Yu and Ren,

413 2017). Multi-mapping reads, however, are absent from the typical Hi-C analysis pipelines, resulting

414 in under-representation and under-study of three-dimensional genome organization involving

415 repetitive regions. Consequently, downstream analysis of Hi-C data relies on the incomplete Hi-C

416 contact matrices which have frequent and, sometimes, severe interaction gaps spanning across

417 the whole matrix. While centromeric regions contribute to such gaps, our results indicate that lack

418 of multi-reads in the analysis is a signicant contributor. Our Hi-C multi-read mapping strategy,

419 mHi-C, probabilistically allocates high-quality multi-reads to their most likely positions (Figure 1E)

420 and successfully lls in the missing chunks of contact matrices (Figure 2A and Figure 2–Figure

9 of 29 Manuscript submitted to eLife

421 supplements 1-4). As a result, incorporating multi-reads yields remarkable increase in sequencing

422 depth which is translated into signicant and consistent gains in reproducibility of the raw contact

423 counts (Figure 2B) and detected interactions (Figure 2C). Analysis with mHi-C rescued reads identies

424 novel signicant interactions (Figure 3), promoter-enhancer interactions (Figure 4), and renes

425 domain structures (Figure 5). Our computational experiments with trimmed and simulated Hi-C

426 reads validate mHi-C and elucidate the signicant impact of multi-reads in all facets of the Hi-C data

427 analysis. We demonstrate that even for the shortest read length of 36bp, mHi-C accuracy exceeds

428 74% (85% for longer trimmed reads) for regions with underlying mappability of at least 0.5 (Figure 6C

429 and Figure 6–Figure supplement 2). mHiC signicantly outperforms a baseline random allocation

430 strategy as well as several other model-free and intuitive multi-read allocation strategies while

431 achieving its worst allocation accuracy of 63% for reads originating from segmental duplications

432 (Figure 6B and D). Trimming experiments further demonstrated the utility of multi-reads for contact

433 matrix, TAD, and signicant interaction recovery (Figure 7).

434 The default setting of mHi-C is intentionally conservative. In this default setting, mHi-C rescues

435 high-quality multi-reads that can be allocated to a candidate alignment position with a high proba-

436 bility of at least 0.5. mHi-C allows relaxation of this strict ltering where instead of keeping reads

437 with allocation probability greater than 0.5, these posterior allocation probabilities can be utilized as

438 fractional contacts. We chose not to pursue this approach in this work as the current downstream

439 analysis pipelines do not accommodate such fractional contacts. Currently, mHi-C model does not

440 take into account potential copy number variations and genome arrangements across the genome.

441 While mHi-C model can be extended to take into account estimated copy number and arrangement

442 maps of the underlying sample genomes as we have done for other multi-read problems (Zhang

443 and Kele, 2014), our computational experiments with cancerous human alveolar epithelial cells

444 A549 does not reveal any notable deterioration in mHi-C accuracy for these cells with copy number

445 alternations.

446 Materials and Methods

447 mHi-C workow

448 We developed a complete pipeline customized for incorporating high-quality multi-mapping reads

449 into the Hi-C data analysis workow. The overall pipeline, illustrated in Figure 1–Figure supple-

450 ments 1 and 2, incorporates the essential steps of the Hi-C analysis pipelines. In what follows,

451 we outline the major steps of the analysis to explicitly track multi-reads and describe how mHi-C

452 utilizes them.

453 Read end alignment: uni- and multi-reads and chimeric reads.

454 The rst step in the mHi-C pipeline is the alignment of each read end separately to the reference

455 genome. The default aligner in the mHi-C software is BWA (Li and Durbin, 2010); however mHi-C

456 can work with any aligner that outputs multi-reads. The default alignment parameters are (i) edit

457 distance maximum of 2 including mismatches and gap extension; and (ii) a maximum number

458 of gap open of 1. mHi-C sets the maximum number of alternative hits saved in the XA tag to be

459 99 to keep track of multi-reads. If the number of alternative alignments exceeds the threshold

460 of 99 in the default setting, these alignments are not recorded in XA tag. We regarded these

461 alignments as low-quality multi-mapping reads compared to those multi-mapping reads that have

462 a relatively smaller number of alternative alignments. In summary, low-quality multi-mapping

463 reads are discarded together with unmapped reads, only leaving uniquely mapping reads and

464 high-quality multi-mapping reads for downstream analysis. mHi-C pipeline further restricts the

465 maximum number of mismatches (maximum to be 2 compared to 3 in BWA default setting) to

466 ensure that the alignment quality of multi-reads is comparable to that of standard Hi-C pipeline.

467 Chimeric reads, that span ligation junction of the Hi-C fragments (Figure 1–Figure supplement 1)

468 are also a key component of Hi-C analysis pipelines. The ligation junction sequence can be derived

469 from the restriction enzyme recognition sites and used to rescue chimeric reads. mHi-C adapts

10 of 29 Manuscript submitted to eLife

470 the pre-splitting strategy of diHiC (Lun and Smyth, 2015), which is modied from the existing

471 Cutadapt (Martin, 2011) software. Specically, the read ends are trimmed to the center of the ® 472 junction sequence. If the trimmed left 5 ends are not too short, e.g., g 25 bps, these chimeric reads 473 are remapped to the reference genome. As the lengths of the chimeric reads become shorter, these

474 reads tend to become multi-reads.

475 Valid fragment ltering.

476 While each individual read end is aligned to reference genome separately, inferring interacting loci

477 relies on alignment information of paired-ends. Therefore, read ends are paired after unmapped

478 and singleton read pairs as well as low-quality multi-mapping ends (Figure 1–Figure supplement 1

479 and Supplementary le 1) are discarded. After pairing, read end alignments are further evaluated

480 for their representation of valid ligation fragments that originate from biologically meaningful long-

481 range interactions (Figure 1–Figure supplement 1). First, reads that do not originate from around

482 restriction enzyme digestion sites are eliminated since they primarily arise due to random breakage

483 by sonication (Belaghzal et al., 2017). This is typically achieved by ltering the reads based on the

484 total distance of two read end alignments to the restriction site. We required the total distance

485 to be within 50-800 bps for the mammalian datasets and 50-500 bps for P. falciparum. The lower

486 bound of 50 for this parameter is motivated by the chimeric reads with as short as 25 bps on both

487 ends. Second, a single Hi-C interaction ought to involve two restriction fragments. Therefore, read

488 ends falling within the same fragment, either due to dangling end or self-circle ligation, are ltered.

489 Third, because the nature of chromatin folding leads to the abundance of random short-range

490 interactions, interactions between two regions that are too close in the genomic distance are highly

491 likely to be random interaction without regulatory implications. As a result, reads with ends aligning

492 too close to each other are also ltered according to the twice the resolution rule. Notably, as

493 a result of this valid fragment ltering, some multi-mapping reads can be counted as uniquely

494 mapping reads (Supplementary le 1 - 2b). This is because, although a read pair has multiple

495 potential genomic origins dictated by its multiple alignments, only one of them ends up passing the

496 validation screening. Once the multi-mapping uncertainty is eliminated, such read pairs are passed

497 to the downstream analysis as uni-reads. We remark here that standard Hi-C analysis pipelines do

498 not rescue these multi-reads.

499 Duplicate removal.

500 To remove PCR duplicates, mHi-C considers the following two issues. First, due to allowing a

501 maximum number of 2 mismatches in alignment, some multi-reads may have the exact same

502 alignment position and strand direction with uni-reads. If such duplicates arise, uni-reads are

503 granted higher priority and the overlapping multi-reads together with all their potential alignment

504 positions are discarded completely. This ensures that the uni-reads that arise in standard Hi-C

505 analysis pipelines will not be discarded as PCR duplicates in the mHi-C pipeline. Second, if a multi-

506 mapping read alignment is duplicated with another multi-read, the read with smaller alphabetical

507 read query name will be preserved. More often than not, if multi-read A overlaps multi-read B at

508 a position, then it is highly likely that they will overlap at other positions as well. This convention

509 ensures that it is always the read pair A alignments that are being retained (Figure 1–Figure

510 supplement 2).

511 Genome binning.

512 Due to the typically limited sequencing depths of Hi-C experiments, the reference genome is divided

513 into small non-overlapping intervals, i.e., bins, to secure enough number of contact counts across

514 units. The unit can be x-sized genomic intervals or a xed number of consecutive restriction frag-

515 ments. mHi-C can switch between the two unit options with ease. After binning, the interaction unit

516 reduces from alignment position pairs to bin pairs. Remarkably, multi-mapping reads, ends of which

517 are located within the same bin pair, reduce to uni-reads as their potential multi-mapping alignment

11 of 29 Manuscript submitted to eLife

518 position pairs support the same bin pair contact. Therefore, there is no need to distinguish the

519 candidate alignments within the same bin (Figure 1–Figure supplement 2 and Supplementary le 1 -

520 3b).

521 mHi-C generative model and parameter estimation.

522 mHi-C infers genomic origins of multi-reads at the bin pair level (Supplementary le 1). We denoted

523 the whole alignment vector for a given paired-end read i by vector Yi. If the two read ends of read 524 i align to only bin j and bin k, respectively, we set the respective components of the alignment ® ® 525 ® ® vector as: Yi,(j,k) = 1 and Yi,(j ,k ) =0, ≈ j ë j, k ë k. Index of read, i, ranges from 1 to N, where N is 526 total number of valid Hi-C reads, including both uni-reads and multi-reads that pass the essential 527 processing in Figure 1–Figure supplements 1 and 2. Overall, the reference genome is divided into M 528 bins and j represents the bin index of the end, alignment position of which is upstream compared 529 to the other read end position indicated by k. Namely, j takes on a value from 1 to M *1and k runs 530 from j +1to the maximum value M. For uniquely mapping reads, only one alignment is observed, (M*1,M) (M*1,M) 531 i.e., . However, for multi-mapping reads, we have . (j,k) Yi,(j,k) =1 (j,k) Yi,(j,k) > 1 532 We next dened a hidden variable to denote the true genomic origin of read . If read ≥ Zi,(j,k) ≥ i i 533 originates from position bin pairs j and k, we have Zi,(j,k) =1. In addition, a read can only originate (M*1,M) 534 from one alignment position pair on the genome; thus, for both uni- and multi- (j,k) Zi,(j,k) =1 535 reads. We dene O = {(j, k): Z ( ) =1} to represent true genomic origin of read i and S as the set i i, j,k ≥ Oi 536 of location pairs that read pair can align to. Hence, = 1, if ( , ) À . Under the assumption i Yi,(j,k) j k SOi 537 that the true alignment can only originate from those observed alignable positions, Oi must be one 538 of the location pairs in . We further assume that the indicators of true origin for read , = SOi i Zi 539 (Zi,(1,2),Zi,(1,3), ..., Zi,(M*1,M)) are random draws from a Dirichlet - Multinomial distribution. Specically, 540 i.i.d. Zi Ì Multinomial(⇡(1,2),⇡(1,3), 5 ,⇡(j,k), 5 ,⇡(M*1,M)),i=1, 5 ,N, (1)

541 where ⇡(j,k) can be interpreted as contact probability between bin j and k (j

⇡ Ì Dirichlet((1,2),(1,3), 5 ,(j,k), 5 ,(M*1,M)), (2)

where ⇡ =(⇡(1,2),⇡(1,3), 5 ,⇡(M*1,M)) and (j,k) is a function of genomic distance and quanties random contact probability. Specically, we adapt the univariate spline tting approach from Fit-Hi-C (Ay et al., 2014a) for estimating random contact probabilities with respect to genomic distance and

set (j,k) = Spline(j,k)ùN +1. Here, N is the total number of valid reads as dened above and Spline(j,k) denotes the spline estimate of the random contact probability between bins j and k. Therefore, Spline(j,k)ùN is the average random contact counts (i.e., pseudo-counts) between bin j and k. As a result, the probability density function of ⇡ can be written as:

M*1 M M*1 M ( =1 = +1 (j,k)) ( *1) ( )= j k j (j,k) P ⇡ M*1 M ⇡(j,k) ≥ ≥ (( )) j=1 k=j+1 j=1 k=j+1 j,k « «  M*1 M M*1 M ( =1± = +1±(Spline(j,k)ùN + 1)) = j k j Spline(j,k)ùN M*1 M ⇡(j,k) . ≥ ≥ (Spline(j,k)ùN + 1) j=1 k=j+1 j=1 k=j+1 « «

543 We next derive the full± data± joint distribution function.

544 Lemma 1. Given the true genomic origin under the mHi-C setting, the set of location pairs that

545 a read pair can align to will have observed alignments with probability 1.

12 of 29 Manuscript submitted to eLife

Proof.

P (Yi Zi,(j,k) = 1) = P (Yi Oi) M*1 M

=  P (Yi,(j,k) Oi) j=1 k=j+1 « « M*1 M  = [1( =1 ( )À ) + 1( 1 ( )Ã )] Yi,(j,k) , j,k SOi Yi,(j,k) ë , j,k SOi j=1 k=j+1 « « =1. ∏

Based on Lemma 1, we can get the joint distribution P (Y, Z ⇡) as

N  P (Y , Z ⇡)= P⇡ (Yi, Zi) i « N M*1 M Zi,(j,k) = P⇡ (Yi,Zi,(j,k) = 1) i j=1 k=j+1 « « « N M*1 M Zi,(j,k) = [P⇡ (Yi Zi,(j,k) = 1)⇡(j,k)] i j=1 k=j+1 « « « N M*1 M  = Zi,(j,k) ⇡(j,k) . i j=1 k=j+1 « « «

Using the Dirichlet-Multinomial conjugacy, we derive the posterior distribution of ⇡ as

N

P (⇡ Z)◊P (⇡, Z)= P (Zi ⇡)P (⇡) i=1 « M*1 M N  ( Z ( )+( )*1)  ◊ ⇡ i=1 i, j,k j,k (j≥,k) j=1 k=j+1 « « M*1 M N ( Z ( )+Spline(j,k)ùN) = ⇡ i=1 i, j,k . (j≥,k) j=1 k=j+1 « « 546 We next derive an Expectation-Maximization algorithm for tting this model.

E-step.

(t) ⇡( ) (t) = ( )= j,k 1[( )À ] Zi,(j,k) E Zi,(j,k) Yi, ⇡ (t) j,k SOi . ( ® ® )À ⇡ ® ® j ,k SOi (j ,k )  M-step. ≥

N (t) =1 Z ( ) + Spline(j,k)ùN (t+1) = i i, j,k ⇡(j,k) M*1 M ® ® . + ® ® ( )ù N ≥ j =1 k =j+1 Spline j ,k N

547 ≥ ≥ Estimate of the contact probability ⇡(j,k) in the M-step can be viewed as an integration of local N (t) 548 interaction signal, encoded in , and random contact signal due to prior, i.e., Spline( )ù . i=1 Zi,(j,k) j,k N 549 The by-products of the EM algorithm are posterior probabilities, P( =1 , ), which are ≥ Zi,(j,k) Yi ⇡ 550 utilized for assigning each multi-read to the most likely genomic origin. To keep mHi-C output

551 compatible with the input required for the widely used signicant interactions detection methods, 552 we ltered multi-reads with maximum allocation posterior probability less than or equal to 0.5 and 553 assigned the remaining multi-reads to their most likely bin pairs. This ensured the use of at most

554 one bin pair for each multi-read pair. We repeated our computational evaluations by varying this

555 threshold on the posterior probabilities to ensure robustness of the overall conclusions to this

556 threshold.

13 of 29 Manuscript submitted to eLife

557 Assessing false positive rates for signicant interactions and TADs identication

558 under the Uni- and Uni&Multi-settings 559 To quantify false positive rates of the Uni- and Uni&Multi-settings, at the signicant interaction

560 level, we dened true positives and true negatives by leveraging deeply sequenced replicates of

561 the IMR90 dataset (replicates 1-4). Signicant interactions reproducibly identied across all four

562 replicates at 0.1% FDR by both the Uni- and Uni&Multi-settings were labeled as true positives (i.e.,

563 true interactions). True negatives were dened as all the contacts that were not deemed signicant

564 at 25% FDR in any of the four replicates. We then evaluated signicant interactions identied by

565 smaller depth replicates 5 & 6 with ROC and PR curves (Figure 2–Figure supplement 13) by using

566 these sets of true positives and negatives as the gold standard. To quantify false positive rates

567 at the topologically associating domains (TADs) level (Figure 5C, Figure 5–Figure supplement2E),

568 we utilized TADs that are reproducible in more than three replicates of the IMR90 dataset and/or

569 harbor CTCF peaks at the boundaries as true positives. The rest of the TADs are supported neither

570 by multiple replicates nor by CTCF, hence are regarded as false positives.

571 Evaluating reproducibility

572 Reproducibility in contact matrices was evaluated using HiCRep in the default settings. We further

573 assessed the reproducibility in terms of identied contacts by grouping the signicant interactions

574 into three categories: contacts only detected under Uni-setting, contacts unique to Uni&Multi-

575 setting, and contacts that are detected under both settings. The reproducibility is calculated by

576 overlapping signicant interactions between every two replicates and recording the percentage of

577 interactions that are also deemed signicant in another replicate (Figure 4–Figure supplement 13).

578 Chromatin states of novel signicant interactions

579 We annotated the novel signicant interactions with the 15 states ChromHMM segmentations for

580 IMR90 epigenome (ID E017) from the Roadmap Epigenomics project (Roadmap Epigenomics Con-

581 sortium, 2015). All six replicates of IMR90 are merged together in calculating the average enrichment

582 of signicant interactions among the 15 states (Figure 4B and Figure 4–Figure supplement 14A).

583 ChIP-seq analysis

584 ChIP-seq peak sets for IMR90 cells were obtained from ENCODE portal (https://www.encodeproject.

585 org/) and GEO (Barrett et al., 2012). Specically, we utilized H3K4me1 (ENCSR831JSP), H3K4me3

586 (ENCSR087PFU), H3K36me3 (ENCSR437ORF), H3K27ac (ENCSR002YRE), H3K27me3 (ENCSR431UUY)

587 and CTCF (ENCSR000EFI) from the ENCODE project and p65 (GSM1055810), p300 (GSM1055812)

588 and PolII (GSM1055822) from GEO (Barrett et al., 2012). In addition, raw data les in fastq format

589 were processed by Permseq (Zeng et al., 2015) utilizing DNase-seq of IMR90 (ENCODE accession

590 ENCSR477RTP) to incorporate multi-reads and, subsequently, peaks were identied using EN-

591 CODE uniform ChIP-seq data processing pipeline (https://www.encodeproject.org/pages/pipelines/

592 #DNA-binding). CTCF motif quantication for topologically associating domains was carried out

593 with FIMO (Grant et al., 2011) under the default settings using CTCF motif frequency matrix from

594 JASPAR (Khan et al., 2017).

595 Promoters with signicant interactions

596 Signicant interactions across six replicates of the IMR90 study were annotated with GENCODE V19

597 (Harrow et al., 2012) gene annotations and enhancer regions from ChromHMM. Gene expression

598 calculations utilized RNA-seq quantication results from the ENCODE project with accession number

599 ENCSR424FAZ.

600 Marginal Hi-C tracks in Figure 4–Figure supplements 15-17

601 Uni-setting and Uni&Multi-setting Hi-C tracks displayed on the UCSC genome browser gures

602 (Figure 4–Figure supplements 15-17) are obtained by aggregating contact counts of six replicates of

14 of 29 Manuscript submitted to eLife

603 IMR90 for each genomic coordinate along the genome.

604 Visualization of contact matrices and interactions

605 We utilized Juicebox (Durand et al., 2016), HiGlass (Kerpedjiev et al., 2017) and WashU epigenome

606 browser (Zhou et al., 2011) for depicting contact matrices and interactions, respectively, throughout

607 the paper. Normalization of the contact matrices for visualization was carried out by the Knight-Ruiz

608 Matrix Balancing Normalization (Knight and Ruiz, 2013) provided by Juicebox (Durand et al., 2016).

609 Model-free multi-reads allocation strategies

610 The simplied and intuitive strategies depicted in Figure 6A correspond to rescuing multi-reads at

611 dierent essential stages of the Hi-C analysis pipeline. AlignerSelect relies on the base aligner, e.g.,

612 BWA, to determine the primary alignment of each individual end of a multi-read pair. DistanceSelect

613 enables the distance prior to dominate. It selects the read end alignments closest in the genomic

614 distance as the origins of the multi-read pair and defaults to the primary alignment selected by

615 base aligner for inter-chromosomal multi-reads. Finally, SimpleSelect follows the overall mHi-C

616 pipeline closely by making use of the standard Hi-C validation checking and binning procedures.

617 For the reads that align to multiple bins, it selects the bin pair closest in the genomic distance as

618 the allocation of the multi-read pair. Bin-pair allocations for inter-chromosomal multi-reads are set

619 randomly in this strategy.

620 Comparison of mHi-C with model-free multi-reads allocation for their impact on

621 identifying dierential interactions

622 We evaluated the direct biological consequence of heavily biasing read assignment by genomic

623 distance, as employed by SimpleSelect, by comparing the signicant interactions among three life

624 stages of P. falciparum. We reasoned that the better multi-reads allocation strategy would reveal

625 a dierential analysis pattern more consistent with the Uni-setting, whereas a genomic distance

626 biased strategy - SimpleSelect - will underestimate dierences since multi-reads will be more likely

627 to be allocated to candidate bin pairs with shortest genomic distance regardless of other local

628 contact signals (Figure 6–Figure supplement 7A). Figure 6–Figure supplement 7B corroborates this

629 drawback of SimpleSelect and demonstrates that mHi-C dierential patterns agree better with that

630 of the Uni-setting. Moreover, Figure 6–Figure supplement 7B suggests that rings stage is more

631 similar to schizonts, an observation consistent with existing ndings on P. falciparum life stages (Ay

632 et al., 2014b; Bunnik et al., 2018).

633 Trimming procedures

634 We considered two approaches for generating evaluation datasets where we combined the trimmed

635 multi-reads from replicate 2, which has the median sequencing depth among all replicates of the

636 A549 study set, with (i) trimmed reads of replicate 2 that remain uniquely aligned to the reference

637 genome at the same trimmed read length (Figure 6 and 7), and (ii) uni-reads from other replicates,

638 i.e., replicates 1, 3, and 4 in the A549 dataset individually (Figure 6–Figure supplements 3, 5 and 6).

639 The rst setting enables a direct comparison of the set of uni- and multi-reads at trimmed read

640 length compared to uni-reads at the full read length to evaluate accuracy. The numbers of reads

641 are summarized in Figure 6–Figure supplement 1A along with multi-to-uni ratios in Figure 6–Figure

642 supplement 1B. In the second trimming setting (ii), the uni-read sets are of the original sequencing

643 depth and the added multi-reads constitute a smaller proportion compared to observed levels in

644 the data (Figure 6–Figure supplement 1C) due to the chimeric read rescue that was part of full-length

645 datasets (Figure 7–Figure supplement 10). Therefore, for this setting, we leverage the higher overall

646 depth of the datasets and evaluate the multi-read assignment accuracy at dierent resolutions, i.e.,

647 10kb and 40kb.

15 of 29 Manuscript submitted to eLife

648 Simulation procedures

649 We devised a simulation strategy that utilizes parameters learned from the Hi-C data and results in

650 data with a similar signal to noise characteristics as the actual data.

651 1. Construction of the interaction prior based on the uni-reads fragment interaction frequency list

652 of GM12878 dataset (replicate 6). The frequency list from the prior encompasses both the

653 genomic distance eect and local interaction signal strength and forms the basis for simulating

654 restriction fragment interactions.

655 2. Generating the restriction enzyme cutting sites for each simulated fragment pair. After sampling

656 interacting fragments using the frequency list from Step 1, a genomic coordinate within ±

657 500bp of the restriction enzyme cutting site and a strand direction are selected randomly.

658 Reads of dierent lengths (36bp, 50bp, 75bp, 100bp) are generated starting from these cutting

659 sites.

660 3. Mutating the resulting reads. and gap rates are empirically estimated based on the

661 aligned uni-reads of replicate 6. The reads from Step 2 are uniformly mutated with these rates

662 allowing up to 2 and 1 gap.

663 4. Simulate sequence quality scores of the reads. We utilize the empirical estimation of the distribu-

664 tion regarding the sequence base quality scores across individual locations of the read length

665 and simulate for each read its sequence quality scores at the nucleotide level.

666 5. Alignment to the reference genome. The simulated reads are aligned to the reference genome

667 and ltered for validation as we outline in the mHi-C pipeline, resulting in the set of multi-reads

668 that are utilized by mHi-C.

669 We generated numbers of multi-reads comparable to those of replicate 6 (Figure 7–Figure supple-

670 ment 4B) and in the nal step of the simulation studies, we merge the simulated set of multi-reads

671 with rep3&6 uni-reads and ran mHi-C step4 (binning) - step5 (prior already available) - step6 (assign

672 multi-reads posterior probability) independently at resolutions 10kb and 40kb.

673 Software availability

674 mHi-C pipeline is implemented in Python and accelerated by C. The source codes and instructions

675 for running mHi-C are publicly available at https://github.com/keleslab/mHiC. Each step is organized

676 into an independent script with exible user-dened parameters and implementation options.

677 Therefore, analysis can be carried out from any step of the work-ow and easily ts in high-

678 performance computing environments for parallel computations.

679 Acknowledgments

680 This work was supported by NIH HG009744 and NIH HG007019 (S.K.). F.A. is partially supported by

681 Institute Leadership Funds from La Jolla Institute for Allergy and Immunology. We thank Peigen

682 Zhou from the University of Wisconsin–Madison for the insightful discussions on accelerating the

683 pipeline. We also thank the peer reviewers and the Reviewing and Senior eLife Editors of this work

684 for their constructive comments.

685 Author Contributions

686 F.A. and S.K. conceived the project. Y.Z., F.A., and S.K. designed the research. Y.Z. and S.K. developed

687 the method. Y.Z. constructed the pipeline and performed experiments. All authors contributed to

688 the preparation of the manuscript.

689 Competing Interests

690 The authors declare no competing nancial interests.

16 of 29 Manuscript submitted to eLife

A B Hi-C Uni- and Multi-mapping Reads o Singleton

Ligated Uni-read Fragment Multi-read IMR90 GM12878 A549 ESC-2012 ESC-2107 Cortex (36bp) (101bp) (151bp) (36bp) (50bp) (50bp) (40bp)

j2 j3 j1 k1 k2 k3 3 22.66%

Observed contact: 22.31% 2.5 (j1, k1) Potential contacts for Multi-read: True origin: (j2, k2),(j2, k3),(j3, k2),(j3, k3) (j1, k1) True origin: unknown 2 22.33% C mHi-C Pre-processing Rescues Multi-reads 1.5 22.06% Validation Checking Genome Binning Sequencing Depth (*10^9)

d1 22.5% Invalid read 1 23.3% d2 pairs 23.02% 12.18% 12.24% 14.21% 21.52% 13.15% 13.29%

50 bps < d1 + d2 < 800 bps 13.1% 22.17% 18.3% 18.81%

0.5 23.83% 23.77% 11.02%

40Kb 40Kb 11.52% 11.97% 10.69% >25k bps 23.46% 13.35% 17.96% 13.28% 20.98% 12.46% 12.8% 17.11% 19.88%

Valid read pair 24.66% 0

rep1 rep2 rep3 rep4 rep5 rep6 rep2 rep3 rep4 rep5 rep6 rep7 rep8 rep9rep32rep33 rep1 rep2 rep3 rep4 rep1 rep2 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 E mHi-C Model D Uni-reads

IMR90 GM12878 A549 ESC-2012 ESC-2107 Cortex (36bp) (101bp) (151bp) (36bp) (50bp) (50bp) (40bp) j1 k1 1.25 3.7% 3.23%

Genomic Distance=|k1-j1| Contact Probability Probability Contact

Prior Information Prior Genomic Distance

1

Uni-reads & Multi-reads 2.5% j2 j3 k2 k3

0.75 2.32% Local Bin-pair Contact Counts 3.91% Local Contacts 0.5 0.68% Sequencing Depth (*10^9) 2.6% Multi-reads Posterior probabilities 2.86% 1.19% 1.07% that each potential 0.95% 1.34% contact is the true origin: 0.25 1.49%

3.98% 0.9% 0.79% 5.32% j2 j3 k2 k3 P[(j2, k2)]=0.65 0.83% 4.57% 5.02% 0.78% 1.45% 6.39% 4.48%

0.15 P[(j2, k3)]=0.07 0.71% 6.65% 0.85% 4.65% 1.22% 4.42%

0.07 P[(j3, k2)]=0.15 7.32% 7.27% 0 10.37% 0.13 P[(j3, k3)]=0.13 0.65

Posterior Probability rep1 rep2 rep3 rep4 rep5 rep6 rep2 rep3 rep4 rep5 rep6 rep7 rep8 rep9rep32rep33 rep1 rep2 rep3 rep4 rep1 rep2 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3

Figure 1. Overview of multi-reads and mHi-C pipeline. (A) Standard Hi-C pipelines utilize uni-reads while discarding multi-mapping reads which give rise to multiple potential contacts. (B) Total number of reads in dierent categories as a result of alignment to reference genome across the study datasets. Percentages of high-quality multi-reads compared to uni-reads are depicted on top of each bar. (C) Multi-mapping reads can be reduced to uni-reads within validation checking and genome binning pre-processing steps. (D) Aligned reads after validation checking and binning. Percentage improvements in sequencing depths due to multi-reads becoming uni-reads are depicted on top of each bar. (E) mHi-C modeling starts from the prior built by only uni-reads to quantify the relationship between random contact probabilities and the genomic distance between the contacts. This prior is updated by leveraging local bin pair contacts including both uni- and multi-reads and results in posterior probabilities that quantify the evidence for each potential contact to be the true genomic origin. Figure 1–source data 1. Detailed summary of study datasets. Figure 1–Figure supplement 1. mHi-C pipeline (Alignment - Read end pairing - Valid fragment ltering). Figure 1–Figure supplement 2. mHi-C pipeline (Duplicate removal - Genome binning - mHi-C). Figure 1–Figure supplement 3. Coverage and cis-to-trans ratios across individual replicates of the study datasets as indicators of data quality. Figure 1–Figure supplement 4. Percentages of mappable and valid reads across study datasets as an indicator of data quality. Figure 1–Figure supplement 5. Categorization of reads after alignment across study datasets. Figure 1–Figure supplement 6. Comparison of the prevalence of multi-reads and chimeric reads, both of which require additional processing.

17 of 29 Manuscript submitted to eLife

A Uni-setting (Raw Counts) Uni&Multi-setting (Raw Counts) B Uni-setting Uni&Multi-setting IMR90 0.9

0.8

0.7

0.6

0.5 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX chr6: 25.5-28.5 Mb chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22

0.9 GM12878

30 30 Reproducibility of Contact Matrices 0.8 Uni-setting (Normalized Counts) Uni&Multi-setting (Normalized Counts) 0.7 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX chr10 chr11 chr12 chr13 chr14 chr15 chr17 chr18 chr19 chr20 chr21 chr22 C chr16 1

actions 0.75 chr6: 25.5-28.5 Mb 0.5

0.25 Reproducibility of Significant Inte r

30 30 0 IMR90 GM12878 A549 Cortex

Figure 2. Global impact of multi-reads in Hi-C analysis. (A) Contact matrices of GM12878 with combined reads from replicates 2-9 are compared under Uni-setting and Uni&Multi-setting using raw and normalized contact counts for chr6:25.5Mb - 28.5Mb. White gaps of Uni-reads contact matrix, due to lack of reads from repetitive regions, are lled in by multi-reads, hence resulting in a more complete contact matrix. Such gaps remain in the Uni-setting even after normalization. Red squares at the left bottom of the matrices indicate the color scale. (B) Reproducibility of Hi-C contact matrices by HiCRep across all pairwise comparisons between replicates under the Uni- and Uni&Multi-settings (IMR90 and GM12878 are displayed). (C) Reproducibility of the signicant interactions across replicates of the study datasets. Reproducibility is assessed by overlapping interactions detected at FDR of 5% for pairs of replicates within each study dataset. Figure 2–Figure supplement 1. Raw and normalized contact matrices of GM12878 under Uni-setting and Uni&Multi-setting fon . Figure 2–Figure supplement 2. Raw and normalized contact matrices of GM12878 under Uni-setting and Uni&Multi-setting on . Figure 2–Figure supplement 3. Raw and normalized contact matrices of GM12878 under Uni-setting and Uni&Multi-setting on chromosome 3. Figure 2–Figure supplement 4. Raw and normalized contact matrices of GM12878 under Uni-setting and Uni&Multi-setting on chromosome 5. Figure 2–Figure supplement 5. Proportion of bins covered by at least 100 or 1000 raw or normalized contacts under Uni-setting and Uni&Multi- setting for GM12878 with combined reads from replicates 2-9 at 5kb resolution. Figure 2–Figure supplement 6. Bin coverage improvement of raw contact matrices under Uni&Multi-setting compared to Uni-setting for GM12878 with combined reads from replicates 2-9 at 5kb. Figure 2–Figure supplement 7. Bin coverage improvement of raw contact matrices under Uni&Multi-setting compared to Uni-setting for IMR90 at the individual replicate level for two dierent allocation probability thresholds. Figure 2–Figure supplement 8. Bin coverage comparison of normalized contact matrices under Uni&Multi- and Uni-settings for GM12878 with combined reads from replicates 2-9 at 5kb. Figure 2–Figure supplement 9. Reproducibility at the contact matrix level under the Uni- and Uni&Multi-settings across study datasets. Figure 2–Figure supplement 10. Reproducibility at the contact matrix level at resolutions 40kb (low) and 10kb (high) across study datasets. Figure 2–Figure supplement 11. Improvement in reproducibility versus multi-read contribution across chromosomes. Figure 2–Figure supplement 12. Reproducibility at the contact matrix level under the Uni- and Uni&Multi-settings between GM12878 and IMR90 at 40kb resolution. Figure 2–Figure supplement 13. A detailed analysis of reproducibility of signicant interactions for IMR90.

18 of 29 Manuscript submitted to eLife

A B C

IMR90(36bp) IMR90(36bp) 75 Cortex(50bp) 723,910 Cortex(50bp) Cortex(50bp) IMR90(36bp) Cortex(50bp) IMR90(36bp)

20

114,210 50 IMR90(36bp) 33,268 IMR90(36bp) 959,105 15 A549(151bp) % Improvement in the Number of Significant Interactions 25 A549(151bp) A549(151bp) A549(151bp) GM12878(101bp) GM12878(101bp) 349,405

GM12878(101bp) 121,916 GM12878(101bp) 1 2 3 Trans Ratio

x(50bp) t e

um(40bp) % Significant Contacts: A549(151bp) IMR90(36bp) Co r 20 40 60 80 alcipa r ESC-2012(36bp) ESC-2017(50bp) GM12878(101bp)

P. f human mouse 20 40 60 80 P.falciparum D E

20 20 15

10 10

5

0 0

Average Number of Contacts within olII p65 7_Enh 4_Tx 9_Het Average Number of Contacts within p300 P 1_TssA Others 6_EnhG 15_Quies H3K27ac H3K4me1 12_EnhBiv13_ReprPC 10_TssBiv H3K4me3 H3K36me3H3K27me3

19 of 29 Manuscript submitted to eLife

Figure 3. Gain in the numbers of novel signicant interactions by mHi-C and their characterization by chromatin marks. (A) Percentage increase in detected signicant interactions (FDR 5%) by comparing contacts identied in Uni&Multi-setting with those of Uni-setting across study datasets at 40kb resolution. (B) Percentage change in the numbers of signicant interactions (FDR 5%) as a function of the percentage of mHi-C rescued multi-reads in comparison to uni-read and cis-to-trans ratios of individual datasets at 40kb resolution. (C) Recovery of signicant interactions identied at 1% FDR by analysis at 10% FDR, aggregated over the replicates of GM12878 at 40kb resolution. Detailed descriptions of the groups are provided in Figure 3–Figure supplement 8. (D) Average number of contacts falling within the signicant interactions (5% FDR) that overlapped with each chromHMM annotation category across six replicates of IMR90 identied by Uni- and Uni&Multi-settings. (E) Average number of contacts (5% FDR) that overlapped with signicant interactions and dierent types of ChIP-seq peaks associated with dierent genomic functions (IMR90 six replicates). Red/Green labels denote smaller/larger dierences between the two settings compared to the dierences observed in the "Others" category that depict non-peak regions.

Figure 3–source data 1. Percentage of improvement in the number of signicant interactions across 6 studies at resolution 40kb. Figure 3–source data 2. ChIP-seq peaks detected following the standard ChIP-seq data processing pipeline of ENCODE (The ENCODE Project Consortium, 2012) using both uni-reads and multi-reads aligned by Permseq (Zeng et al., 2015) for CTCF, PolII, p65, p300, H3K27ac, H3K27me, H3K36me3, H3K4me1 and H3K4me3. Figure 3–Figure supplement 1. Percentage change in the numbers of signicant interactions under the Uni&Multi-setting compared to Uni-setting at dierent FDR thresholds and resolutions. Figure 3–Figure supplement 2. Comparison of signicant interactions as a function of posterior probabilities of multi-read assignment (IMR90). Figure 3–Figure supplement 3. Heatmap for marginal correlations of percentage increase in the number of identied signicant interactions (FDR 5%) with indicators of data quality. Figure 3–Figure supplement 4. Percentage change in the numbers of signicant interactions with respect to cis-to-trans ratio. Figure 3–Figure supplement 5. Percentage change in the numbers of signicant interactions of GM12878 datasets at dierent resolutions. Figure 3–Figure supplement 6. Percentage change in the numbers of signicant interactions with respect to coverage. Figure 3–Figure supplement 7. Percentage change in the numbers of signicant interactions as a function of the percentage of mHi-C rescued multi-reads in comparison to uni-reads and cis-to-trans ratio. Figure 3–Figure supplement 8. Recovery of signicant interactions identied at FDR 1% by analysis at FDR 10% for each of six replicates of IMR90. Figure 3–Figure supplement 9. Recovery of signicant interactions identied at FDR 1% by analysis at FDR 10% for each of four replicates of GM12878 at 40kb resolution. Figure 3–Figure supplement 10. Recovery of signicant interactions identied at FDR 1% by analysis at FDR 10% for each of four replicates of GM12878 at 10kb resolution. Figure 3–Figure supplement 11. Recovery of signicant interactions identied at FDR 1% by analysis at FDR 10% for each of ten replicates of GM12878 at 5kb resolution. Figure 3–Figure supplement 12. Recovery of signicant interactions identied at FDR 1% by analysis at FDR 10% for GM12878 aggregated across replicates at 5kb, 10kb, and 40kb resolutions. Figure 3–Figure supplement 13. ROC and PR curves for replicates 5 and 6 of IMR90. Figure 3–Figure supplement 14. Quantication of signicant interactions for chromHMM states and ChIP-seq peak regions (IMR90). Figure 3–Figure supplement 15. Marginalized Hi-C signal, ChIP-seq coverage and peaks, and gene expression for chr1 (IMR90). Figure 3–Figure supplement 16. Marginalized Hi-C signal, ChIP-seq coverage and peaks, and gene expression for chr2 (IMR90). Figure 3–Figure supplement 17. Marginalized Hi-C signal, ChIP-seq coverage and peaks, and gene expression for chr9 (IMR90).

20 of 29 Manuscript submitted to eLife

A

5000

4000

3000

2000

1000 Contact Count Contact 0

144500000 144530000 144560000 144590000 144620000 144650000 144680000 144710000 144740000 144770000 144800000 144830000 144860000 144890000 144920000 144950000 144980000 145010000 145040000

Mappability Hi-C Marginal 56969 60186 (Uni&Multi) 50404 Hi-C Marginal 17797 LOC100288142 (Uni) LOC728875 PFN1P2 LOC653513 PDE4DIP NBPF9 NBPF9 PDE4DIP NBPF8 PDE4DIP NBPF8 PDE4DIP NBPF8 PDE4DIP PDE4DIP RefSeq genes PDE4DIP Active TSS Quiescent/Low PDE4DIP ChromHMM PDE4DIP Strong ZNF genes & repeats Enhancers

Uni&Multi- setting

Uni-setting q21.1 chr1

B Promoter-Promoter Promoter-Other Promoters that have Promoters that have Promoter-Enhancer Significant Significant Significant Interactions no Significant Interactions Significant Interactions Interactions Interactions Promoters Promoters do overlapping not overlap with 40 with Enhancers Enhancers

20

0

Average Gene Expression (TPM) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1. Uni-setting 6. Uni-setting (75,100) 10. Uni-setting(12,691) 14. Uni-setting (12,558) 2. Uni&Multi-setting 7. Uni&Multi-setting (64,265) 11. Uni&Multi-setting(13,043) 15. Uni&Multi-setting (12,940) 3. Specific to Uni-setting 8. Uni-setting (38,642) 12. Specific to Uni-setting (10,149) 16. Uni-setting (4,429) 4. Specific to Uni&Multi-setting 9. Uni&Multi-setting (27,564) 13. Specific to Uni&Multi-setting 17. Uni&Multi-setting (4,689) 5. Common to both settings (12,564)

21 of 29 Manuscript submitted to eLife

Figure 4. Novel promoter-enhancer interactions are reproducible and associated with actively expressed genes. (A) mHi-C identies novel signicant promoter-enhancer interactions (green arcs) that are reproducible among at least two replicates in addition to those reproducible under the Uni-setting (purple arcs). Shaded and the boxed regions correspond to the anchor and target bins, respectively. The top track displays the contact counts associated with the anchor bin under Uni- and Uni&Multi-settings. Related chromHMM annotation color labels are added around the track. The complete color labels are consistent with ChromHMM 15-state model at http://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html. (B) Average gene expression with standard errors for ve dierent scenarios of interactions that group promoters into six dierent categories. In the rst panel, signicant interactions involving promoters are classied into ve settings and the average gene expressions across genes with the corresponding promoters are depicted. The second panel involves two alignment settings and genes without any promoter interactions at 5% FDR. This panel is further separated into two categories: promoters that overlap with enhancer annotated regions and those that do not. The latter one serves as the baseline for average expression. Genes contributing to the third and fourth panel have promoter-enhancer, promoter-promoter interactions at 5% FDR. The fth panel considers genes promoters of which have signicant interactions with non-enhancer and non-promoter regions. Numbers in the parenthesis correspond to the number of transcripts in each category.

Figure 4–source data 1. The number of signicant promoter-enhancer Hi-C interactions at FDR 5% under Uni-setting and Uni&Multi-setting, respectively, for six replicates of IMR90. Figure 4–source data 2. Signicant promoter-enhancer interactions at FDR 5% under Uni-setting and Uni&Multi-setting for six replicates of IMR90 with the number of contacts. Figure 4–Figure supplement 1. Examples of signicant promoter-enhancer interactions reproducible among 6 replicates under Uni- and Uni&Multi-settings (IMR90) on . Figure 4–Figure supplement 2. Examples of signicant promoter-enhancer interactions reproducible among 6 replicates under Uni- and Uni&Multi-settings (IMR90) on chromosome 17. Figure 4–Figure supplement 3. Signicant promoter-enhancer interactions under Uni- and Uni&Multi-settings across 6 IMR90 replicates (Chromo- some 17). Figure 4–Figure supplement 4. Expression distribution of genes promoters of which have signicant promoter interactions (IMR90).

22 of 29 Manuscript submitted to eLife

A B Uni-setting Uni&Multi-setting

72.55 MB 97.55 MB 72.55 MB 97.55 MB 60

40 72.55 MB AD

% of T 20

0 <=3 >=4 >=5 =6 Reproduced by n replicates Chromosome 10 Chromosome C 5.0 y Rate

e r 4.5 v

97.55 MB 30 30 AD Detection Chromosome 10 Chromosome 10 4.0 alse Disc o F of T D 11.18 9.99 8.88 9 8.59

7.68 Aver v 7.3 6.9 6.5 6.47 6 5.71 (within T 4.17 3.89 3.94 3.52 3.67 2.91 2.97 (within T 3 2.58 2.74 2.33 2.21 2.08 1.94 1.79

(at TAD boundar

e r 0.25 0.33 v 0.23 0.050.170.24

A 0 (at TAD boundar DUP SINE LINE LTR DNA SATE

23 of 29 Manuscript submitted to eLife

Figure 5. mHi-C rescued multi-reads rene detected topologically associating domains. (A) Percentage of topologically associating domains (TADs) that are reproducibly detected under Uni-setting and Uni&Multi-setting. TADs that are not detected in at least 4 of the 6 replicates are considered as non-reproducible. (B) Comparison of the contact matrices with superimposed TADs between Uni- and Uni&Multi-setting for chr10:72,550,000-97,550,000. Red squares at the left bottom of the matrices indicate the color scale. TADs aected by white gaps involving repetitive regions are highlighted in light green. Light green outlined areas correspond to new TAD boundaries. (C) False discovery rate of TADs detected under two settings. TADs that are not reproducible and lack CTCF peaks at the TAD boundaries are labeled as false positives. (D) Average number of repetitive elements at the boundaries of reproducible TADs compared to those within TADs and genomewide intervals of the same size for GM12878 at 5kb resolution.

Figure 5–source data 1. Topologically associating domains detected by DomainCaller (Dixon et al., 2012) under Uni&Multi-setting for six replicates of IMR90. Figure 5–source data 2. Topologically associating domains detected by Arrowhead (Rao et al., 2014) under Uni&Multi-setting for ten replicates of GM12878. Figure 5–Figure supplement 1. The number of topologically associating domains (TADs) detected in each chromosome under Uni-setting and Uni&Multi-setting (IMR90). Figure 5–Figure supplement 2. Comparison of CTCF peaks at the boundaries of topologically associating domains (TADs) under Uni-setting and Uni&Multi-setting across six replicates of IMR90. Figure 5–Figure supplement 3. Novel topologically associating domains (TADs) with CTCF peaks at TAD boundaries (IMR90). Figure 5–Figure supplement 4. Existing topologically associating domains (TADs) with adjusted boundaries supported by CTCF peaks at the new TAD boundaries on chr1 and chr5 (IMR90). Figure 5–Figure supplement 5. Existing topologically associating domains (TADs) with adjusted boundaries supported by CTCF peaks at the new TAD boundaries on chr12 and chr13 (IMR90). Figure 5–Figure supplement 6. False positive topologically associating domains (TADs) detected by the Uni-setting due to the missing reads in low mappability regions on chr2 and chr3 (IMR90). Figure 5–Figure supplement 7. False positive topologically associating domains (TADs) detected by the Uni-setting due to the missing reads in low mappability regions on chr4 and chr16 (IMR90). Figure 5–Figure supplement 8. False positive topologically associating domains (TADs) detected by the Uni-setting due to the missing reads in low mappability regions on chr21 and chrX (IMR90). Figure 5–Figure supplement 9. False discovery rate of TADs detected under two settings. TADs that are not reproducible are labeled as false positives without considering the CTCF peaks at the TAD boundaries. Figure 5–Figure supplement 10. Percentage of TAD boundaries harboring dierent types of repetitive elements under Uni-setting and Uni&Multi- setting for IMR90 at 40kb and GM12878 at 5kb. Figure 5–Figure supplement 11. Average number of repetitive elements at the reproducible TAD boundaries compared to averages within TADs and genomewide intervals of the same size for IMR90 at 40kb resolution.

24 of 29 Manuscript submitted to eLife

A

B D 80 Average mHiC LTR DNA 70

LINE SimpleSelect 75 SINE AlignerSelect SATE

acy (%) acy (%) 50 DUP 70 DistanceSelect Accu r

Accu r Random 30 65

36 50 75 100 125 36 50 75 100 125 Read Length Read Length C Correctly Assigned

3e+05 80 Falsely Assigned 83.68% Not Assigned 2e+05 70 acy (%) # of Reads

1e+05 84.39% 60 51.92% Accu r 78.17% 54.85% 81.42% 60.08% 65.43% 74.2% 70.47% 56.11% 0e+00 60.44% 50 [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) Average Mappability

Figure 6. Assessing accuracy of mHi-C allocation by trimming experiments with the A549 study set of 151bp reads. (A) Intuitive heuristic strategies (AlignerSelect, DistanceSelect, SimpleSelect) for model-free assignment of multi-reads at various stages the of Hi-C analysis pipeline. (B) Accuracy of mHi-C in allocating trimmed multi-reads with respect to trimmed read length, compared with model-free approaches as well as random selection as a baseline. (C) Allocation accuracy with respect to mappability for 75bp reads. Red solid line depicts the overall accuracy trend. "Not assigned" category refers to multi-reads with a maximum posterior probability of assignment f 0.5. (D) mHi-C accuracy among dierent repetitive element classes. Figure 6–Figure supplement 1. Summary of the sequencing depths of the full length and trimmed datasets of the A549 study. Figure 6–Figure supplement 2. Allocation accuracy at the 40kb resolution among dierent mappability regions for trimmed reads of varying lengths. Figure 6–Figure supplement 3. Intra-chromosomal and intra&inter-chromosomal allocation accuracy with respect to trimmed read length for trimming setting (ii). Figure 6–Figure supplement 4. Evaluating accuracy of mHi-C multi-read allocation with simulations. Figure 6–Figure supplement 5. Allocation accuracy across dierent mappability regions for trimmed reads of length 36bp, 50bp, 75bp, 100bp, and 125bp for trimming setting (ii). Figure 6–Figure supplement 6. Allocation accuracy across dierent classes of repetitive elements at 10kb and 40kb resolutions under trimming setting (ii). Figure 6–Figure supplement 7. Comparison of signicant interactions among the three life stages of P. falciparum.

25 of 29 Manuscript submitted to eLife

A B Original Uni-reads Trimmed Uni-reads 36 50 75 100 125 TrimUni TrimUni&Multi

90

80 70 Recovery Rate 60 300 30 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX chr6:24.2Mb-31.8Mb chr6:24.2Mb-31.8Mb chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 Trimmed Uni&Multi-reads C D E 1.00 75 0.6 70 0.75 0.4 65

e Rate er (%) 0.50 w

o ositi v P 60 Precision 0.2 ue P 0.25 r T 55 50 0.0 0.00 36 50 75 100 125 0.00 0.03 0.06 0.09 0.00 0.25 0.50 0.75 1.00 30 Read Length False Positive Rate Recall chr6:24.2Mb-31.8Mb

Figure 7. Trimmed uni- and multi-reads to recover the original contact matrix of the longer read dataset A549. (A) mHi-C rescued multi-reads of the trimmed dataset along with trimmed uni-reads lead to contact matrices that are signicantly more similar to original contact matrices compared to only using trimmed uni-reads. (B) TAD detection on chromosome 6 with the original longer uni-reads contact matrix with black TAD boundaries, trimmed uni-reads (36bp) contact matrix with green TAD boundaries, and trimmed uni- and multi-reads (36bp) contact matrix with blue TAD boundaries. (C) The power of recovering top 10,000 signicant interactions of full read length dataset using trimmed reads under FDR10%. (D, E) Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves for trimmed Uni- and Uni&Multi-setting. The ground truth for these curves is based on the signicant interactions identied by the full read length dataset at FDR of 10%. The dashed line is y = x. Figure 7–Figure supplement 1. Reproducibility of trimmed Uni-setting and trimmed Uni&Multi-setting across dierent read lengths at 40kb resolution. Figure 7–Figure supplement 2. Reproducibility comparison between original Uni-setting and trimmed Uni-setting and Uni&Multi-setting across replicates of A549 at 40kb resolution. Figure 7–Figure supplement 3. Reproducibility comparison between original Uni-setting and trimmed Uni-setting and Uni&Multi-setting across chromosomes at 40kb resolution. Figure 7–Figure supplement 4. TAD detection on chromosome 3: 124Mb-127.8Mb of original longer uni-reads contact matrix, trimmed uni-reads (36bp) contact matrix and trimmed uni- and multi-reads (36bp) contact matrix. Figure 7–Figure supplement 5. TAD detection on chromosome 7: 3Mb-10Mb of original longer uni-reads contact matrix with black TAD boundaries, trimmed uni-reads (36bp) contact matrix with green TAD bounaries and trimmed uni- and multi-reads (36bp) contact matrix with blue boundaries. Figure 7–Figure supplement 6. TAD detection on chromosome 7: 147.5Mb-155Mb of original longer uni-reads contact matrix with black TAD boundaries, trimmed uni-reads (36bp) contact matrix with green boundaries and trimmed uni- and multi-reads (36bp) contact matrix with blue boundaries. Figure 7–Figure supplement 7. TAD detection on chromosome 10: 77Mb-92Mb of original longer uni-reads contact matrix, trimmed uni-reads (36bp) contact matrix and trimmed uni- and multi-reads (36bp) contact matrix. Figure 7–Figure supplement 8. Numbers of signicant interactions identied with trimmed reads under Uni- and Uni&Multi-settings at FDR 0.1%, 1%, 5%, 10% . Figure 7–Figure supplement 9. Proportion of top 10,000 signicant interactions of full read length dataset recovered in trimmed reads sets under FDR 10%. Figure 7–Figure supplement 10. Impact of chimeric reads in the A549 datasets. Figure 7–Figure supplement 11. ROC and PR curves for detection of full read length dataset signicant interactions by the analysis of trimmed read datasets.

26 of 29 Manuscript submitted to eLife

691 References 692 Ay F, Bailey TL, Noble WS. Statistical condence estimation for Hi-C data reveals regulatory chromatin contacts. 693 Genome Research. 2014; 24(6):999–1011.

694 Ay F, Bunnik EM, Varoquaux N, Bol SM, Prudhomme J, Vert JP, Noble WS, Le Roch KG. Three-dimensional 695 modeling of the P. falciparum genome during the erythrocytic cycle reveals a strong connection between 696 genome architecture and gene expression. Genome Research. 2014; 24(6):974–988.

697 Ay F, Noble WS. Analysis methods for studying the 3D architecture of the genome. Genome Biology. 2015; 698 16(1):183.

699 Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, 700 Holko M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Research. 2012; 701 41(D1):D991–D995.

702 Belaghzal H, Dekker J, Gibcus JH. Hi-C 2.0: an optimized hi-C procedure for high-resolution genome-wide 703 mapping of chromosome conformation. Methods. 2017; 123:56–65.

704 Bonev B, Cohen NM, Szabo Q, Fritsch L, Papadopoulos GL, Lubling Y, Xu X, Lv X, Hugnot JP, Tanay A, et al. 705 Multiscale 3D genome rewiring during mouse neural development. Cell. 2017; 171(3):557–572.

706 Bunnik EM, Cook KB, Varoquaux N, Batugedara G, Prudhomme J, Shi L, Andolina C, Ross LS, Brady D, Fidock DA, 707 et al. Changes in genome organization of parasite-specic gene families during the Plasmodium transmission 708 stages. bioRxiv. 2018; p. 242123.

709 Casper J, Zweig AS, Villarreal C, Tyner C, Speir ML, Rosenbloom KR, Raney BJ, Lee CM, Lee BT, Karolchik D, et al. 710 The UCSC genome browser database: 2018 update. Nucleic acids research. 2017; 46(D1):D762–D769.

711 Chung D, Kuan PF, Li B, Sanalkumar R, Liang K, Bresnick EH, Dewey C, Kele S. Discovering transcription 712 factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP-Seq data. PLoS 713 Computational Biology. 2011; 7(7):e1002111.

714 Corradin O, Cohen AJ, Luppino JM, Bayles IM, Schumacher FR, Scacheri PC. Modeling disease risk through 715 analysis of physical interactions between genetic variants within chromatin regulatory circuitry. Nature 716 genetics. 2016; 48(11):1313.

717 Cournac A, Koszul R, Mozziconacci J. The 3D folding of metazoan genomes correlates with the association of 718 similar repetitive elements. Nucleic Acids Research. 2015; 44(1):245–255.

719 Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science. 2002; 295(5558):1306– 720 11.

721 Dixon JR, Jung I, Selvaraj S, Shen Y, Antosiewicz-Bourget JE, Lee AY, Ye Z, Kim A, Rajagopal N, Xie W, et al. 722 Chromatin architecture reorganization during stem cell dierentiation. Nature. 2015; 518(7539):331.

723 Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B. Topological domains in mammalian genomes 724 identied by analysis of chromatin interactions. Nature. 2012; 485(7398):376.

725 Dixon JR, Xu J, Dileep V, Zhan Y, Song F, Le VT, Galip G¨rkan Yardımcı AC, Bann DV, Wang Y, Clark R, Zhang 726 L, Yang H, Liu T, Iyyanki S, An L, Pool C, Sasaki T, Rivera-Mulia JC, Özadam H, Lajoie BR, et al. Integrative 727 detection and analysis of structural variation in genomes. Nature Genetics. 2018 September; https: 728 //www.nature.com/articles/s41588-018-0195-8.

729 Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, Lander ES, Aiden EL. Juicebox provides a visualization 730 system for Hi-C contact maps with unlimited zoom. Cell Systems. 2016; 3(1):99–101.

731 Ferrari KJ, Scelfo A, Jammula S, Cuomo A, Barozzi I, Stützer A, Fischle W, Bonaldi T, Pasini D. Polycomb- 732 dependent H3K27me1 and H3K27me2 regulate active transcription and enhancer delity. Molecular Cell. 733 2014; 53(1):49–62.

734 Forcato M, Nicoletti C, Pal K, Livi CM, Ferrari F, Bicciato S. Comparison of computational methods for Hi-C data 735 analysis. Nature Methods. 2017; 14(7):679.

736 Grant CE, Bailey TL, Nobel WS. FIMO: Scanning for occurrences of a given motif. Bioinformatics. 2011; 7:1017. 737 http://www.ncbi.nlm.nih.gov/pubmed/21330290, doi: 10.1093/bioinformatics/btr064.

27 of 29 Manuscript submitted to eLife

738 Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, 739 et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Research. 2012; 740 22(9):1760–1774.

741 Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. Simple combina- 742 tions of lineage-determining transcription factors prime cis-regulatory elements required for 743 and B cell identities. Molecular Cell. 2010; 38(4):576–589.

744 Hsu SC, Gilgenast TG, Bartman CR, Edwards CR, Stonestrom A J, Huang P, Emerson DJ, Evans P, Werner MT, Keller 745 CA, et al. The BET protein BRD2 cooperates with CTCF to enforce transcriptional and architectural boundaries. 746 Molecular Cell. 2017; 66(1):102–116.

747 Hwang YC, Lin CF, Valladares O, Malamon J, Kuksa PP, Zheng Q, Gregory BD, Wang LS. HIPPIE: a high-throughput 748 identication pipeline for promoter interacting enhancer elements. Bioinformatics. 2014; 31(8):1290–1292.

749 Imakaev M, Fudenberg G, McCord RP, Naumova N, Goloborodko A, Lajoie BR, Dekker J, Mirny LA. Iterative 750 correction of Hi-C data reveals hallmarks of chromosome organization. Nature Methods. 2012; 9(10):999– 751 1003.

752 Javierre BM, Burren OS, Wilder SP, Kreuzhuber R, Hill SM, Sewitz S, Cairns J, Wingett SW, Várnai C, Thiecke MJ, 753 et al. Lineage-specic genome architecture links enhancers and non-coding disease variants to target gene 754 promoters. Cell. 2016; 167(5):1369–1384.

755 Jin F, Li Y, Dixon JR, Selvaraj S, Ye Z, Lee AY, Yen CA, Schmitt AD, Espinoza C, Ren B. A high-resolution map of 756 three-dimensional chromatin interactome in human cells. Nature. 2013; 503(7475):290.

757 Kerpedjiev P, Abdennur N, Lekschas F, McCallum C, Dinkla K, Strobelt H, Luber JM, Ouellette SB, Ahzir A, Kumar 758 N, et al. HiGlass: Web-based visual comparison and exploration of genome interaction maps. bioRxiv. 2017; 759 p. 121889.

760 Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, Bessy A, Chèneby J, Kulkarni SR, 761 Tan G, et al. JASPAR 2018: update of the open-access database of transcription factor binding proles and its 762 web framework. Nucleic Acids Research. 2017; 46(D1):D260–D266.

763 Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA Journal of Numerical Analysis. 2013; 33(3):1029– 764 1047.

765 de Laat W, Duboule D. Topology of mammalian developmental enhancers and their regulatory landscapes. 766 Nature. 2013; 502(7472):499–506.

767 Li B, Dewey CN. RSEM: accurate transcript quantication from RNA-Seq data with or without a reference genome. 768 BMC Bioinformatics. 2011; 12(1):323.

769 Li H, Durbin R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics. 2010; 770 26(5):589–595.

771 Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, 772 Dorschner MO, et al. Comprehensive mapping of long-range interactions reveals folding principles of the 773 human genome. Science. 2009; 326(5950):289–293.

774 Lun AT, Smyth GK. diHic: a Bioconductor package to detect dierential genomic interactions in Hi-C data. BMC 775 Bioinformatics. 2015; 16(1):258.

776 Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet journal. 777 2011; 17(1):pp–10.

778 Mifsud B, Tavares-Cadete F, Young AN, Sugar R, Schoenfelder S, Ferreira L, Wingett SW, Andrews S, Grey W, 779 Ewels PA, et al. Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C. 780 Nature Genetics. 2015; 47(6):598–606.

781 Ong CT, Corces VG. CTCF: an architectural protein bridging genome topology and function. Nature Reviews 782 Genetics. 2014; 15(4):234.

783 Open R, 4.0 [http://www. repeatmasker. org]. Accessed; 2015.

784 Pombo A, Dillon N. Three-dimensional genome architecture: players and mechanisms. Nature Reviews 785 Molecular Cell Biology. 2015; 16(4):245–257.

28 of 29 Manuscript submitted to eLife

786 Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, 787 Lander ES, Aiden EL. A 3D map of the human genome at kilobase resolution reveals principles of chromatin 788 looping. Cell. 2014; 159(7):1665–1680.

789 Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature. 2015; 790 518(7539):317–330. http://view.ncbi.nlm.nih.gov/pubmed/25693563.

791 Rosa-Garrido M, Chapski DJ, Schmitt AD, Kimball TH, Karbassi E, Monte E, Balderas E, Pellegrini M, Shih TT, 792 Soehalim E, et al. High resolution mapping of chromatin conformation in cardiac myocytes reveals structural 793 remodeling of the epigenome in heart failure. Circulation. 2017; p. Circulation–117.

794 Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, Nativ N, Bahir I, Doniger T, Krug H, et al. 795 GeneCards Version 3: the human gene integrator. Database. 2010; 2010.

796 Servant N, Varoquaux N, Lajoie BR, Viara E, Chen CJ, Vert JP, Heard E, Dekker J, Barillot E. HiC-Pro: an optimized 797 and exible pipeline for Hi-C data processing. Genome Biology. 2015; 16(1):259.

798 Spielmann M, Lupiáñez DG, Mundlos S. Structural variation in the 3D genome. Nature Reviews Genetics. 2018; 799 p. 1.

800 Sun JH, Zhou L, Emerson DJ, Phyo SA, Titus KR, Gong W, Gilgenast TG, Beagan JA, Davidson BL, Tassone F, et al. 801 Disease-Associated Short Tandem Repeats Co-localize with Chromatin Domain Boundaries. Cell. 2018; .

802 Tang Z, Luo OJ, Li X, Zheng M, Zhu JJ, Szalaj P, Trzaskoma P, Magalska A, Wlodarczyk J, Ruszczycki B, Michalski P, 803 Piecuch E, Wang P, Wang D, Tian SZ, Penrad-Mobayed M, Sachs LM, Ruan X, Wei CL, Liu ET, et al. CTCF-Mediated 804 Human 3D Genome Architecture Reveals Chromatin Topology for Transcription. Cell. 2015; 163(7):1611–1627.

805 The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 806 2012 09; 489:57–74. https://www.encodeproject.org.

807 Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and 808 solutions. Nature Reviews Genetics. 2012; 13(1):36.

809 Won H, de La Torre-Ubieta L, Stein JL, Parikshak NN, Huang J, Opland CK, Gandal MJ, Sutton GJ, Hormozdiari 810 F, Lu D, et al. Chromosome conformation elucidates regulatory relationships in developing human brain. 811 Nature. 2016; 538(7626):523.

812 Xie M, Hong C, Zhang B, Lowdon RF, Xing X, Li D, Zhou X, Lee HJ, Maire CL, Ligon KL, Gascard P, Sigaroudinia 813 M, Tlsty TD, Kadlecek T, Weiss A, O’Geen H, Farnham PJ, Madden PAF, Mungall AJ, Tam A, et al. DNA 814 hypomethylation within specic transposable element families associates with tissue-specic enhancer 815 landscape. Nature Genetics. 2013; 45(7):836–41.

816 Yang T, Zhang F, Yardimci GG, Song F, Hardison RC, Noble WS, Yue F, Li Q. HiCRep: assessing the reproducibility 817 of Hi-C data using a stratum-adjusted correlation coecient. Genome Research. 2017; p. gr–220640.

818 Yardımcı G, Özadam H, Sauria ME, Ursu O, Yan KK, Yang T, Chakraborty A, Kaul A, Lajoie BR, Song F, et al. 819 Measuring the reproducibility and quality of Hi-C data. bioRxiv. 2017; p. 188755.

820 Yu M, Ren B. The Three-Dimensional Organization of Mammalian Genomes. Annual review of cell and 821 developmental biology. 2017; 33:265–289.

822 Zeng X, Li B, Welch R, Rojo C, Zheng Y, Dewey CN, Kele S. Perm-seq: mapping protein-DNA interactions in 823 segmental duplication and highly repetitive regions of genomes with prior-enhanced read mapping. PLoS 824 Computational Biology. 2015; 11(10):e1004491.

825 Zhang Q, Kele S. CNV-guided multi-read allocation for ChIP-seq. Bioinformatics. 2014; 30(20):2860–2867.

826 Zhang Z, Xing Y. CLIP-seq analysis of multi-mapped reads discovers novel functional RNA regulatory sites in the 827 human transcriptome. Nucleic Acids Research. 2017; 45(16):9260–9271.

828 Zhou X, Maricque B, Xie M, Li D, Sundaram V, Martin EA, Koebbe BC, Nielsen C, Hirst M, Farnham P, et al. The 829 human epigenome browser at Washington University. Nature Methods. 2011; 8(12):989–990.

29 of 29 Manuscript submitted to eLife

Hi-C reads Ligation site End 1 End 2

Alignment Alignment

>25bps Chimeric reads Multi-reads Multi-reads Uni-reads Uni-reads

Read ends pairing

Uniquely mapping read pairs Multi-mapping read pairs

>99

Unmapped reads Singleton reads Low quality Multi-reads

830 Valid fragment filtering d1 d1

d2 50 bps < d1 + d2 < 800 bps d1 + d2 >800 bps d2 <25k bps d1 + d2 < 50 bps >25k bps

Short-range contacts Valid read pairs Invalid alignments

End 1 End 2 End 1

End 2 End 1 End 2 Dangling end Self circle Religation

Figure 1–Figure supplement 1. mHi-C pipeline (Alignment - Read end pairing - Valid fragment ltering). 1. Read ends are aligned to reference genome separately allowing multi-reads and chimeric reads to be rescued. 2. Read ends are paired by their read query names. Multi-reads form more than one read pair with the same read query name. Read ends that fail to align form either unmapped reads or singleton reads and are discarded. Multi-reads with ends aligning to more than 99 positions are regarded as low-quality multi-reads and are excluded from the downstream analysis. 3. Validation checking to lter short-range contacts and alignments far away from restriction enzyme recognition sites. Contacts residing within the same restriction fragment, i.e., dangling end or self-circle, as well as adjacent fragments (religation) are discarded. The above three processing steps are applied to each read independently enabling parallel implementation. Manuscript submitted to eLife

Valid fragment filtering d1

d2 50 bps < d1 + d2 < 800 bps

>25k bps

Valid read pairs

Duplicate removal Uni-reads Multi-reads A

1 mismatch

2 mismatches

Multi-reads Multi-reads B

Genome binning

40Kb 40Kb 40Kb 40Kb 40Kb 40Kb 40Kb

Uni-bin pairs Multi-bin pairs Multi-reads reduced to Uni-bin pairs

831 mHi-C

Prob=0.9 0.1

40Kb 40Kb 40Kb 40Kb 40Kb 40Kb 40Kb

Uni-bin pairs Multi-reads reduced to Multi-bin pairs Uni-binpairs

Contact matrix Bin k Bin j Bin k Bin j 3 contact counts

Figure 1–Figure supplement 2. mHi-C pipeline (Duplicate removal - Genome binning - mHi-C). 4. PCR duplicates are removed to ensure that when a uni-read and a multi-read have the same alignment position and strand direction, the uni-read is kept. In the case of multi-reads that overlap with other multi-reads, the ones with alphabetically larger IDs are removed. 5. Genome is split into x-sized non-overlapping intervals, i.e., bins or a xed number of restriction fragments and, as a result, read alignment position pairs are reduced to bin pairs. Multi-reads, candidate alignment positions of which fall into the same bin, are reduced to uni-bin pairs. 6. mHi-C model estimates an allocation probability for each potential contact and enables ltering of contacts by thresholding this allocation probability. Manuscript submitted to eLife

A human mouse P.falciparum

IMR90(36bp) GM12878(101bp) A549(151bp) Cortex(50bp) P.falciparum(40bp) (36bp) 465.1

400

348.29

300

260.03 age (%)

e r 200 v C o

103.19 100 94.84 64.04 48.69 40.31 28.5928.55 20.0617.09 21.66 22.75 18.1420.0120.6419.32 17.08 19.68 12.8314.22 10.1512.57 10.6 10.94 11.08 7.56 6.74 5.76 3.61 7.76 7.39 0 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep2 rep3 rep4 rep5 rep6 rep7 rep8 rep9 rep1 rep2 rep3 rep4 rep1 rep2 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep5 rep6 rep32 rep33 B

832 IMR90(36bp) GM12878(101bp) A549(151bp) Cortex(50bp) P.falciparum(40bp) (36bp) 3.8 3.8 3.67 3.67 3.58 3.58 3.5 3.5 3.19 3.19 2.83 2.83 2.71 2.71 2.54 2.54 2.46 2.46 2.38 2.38 2.36 2.36 2.24 2.24 2.19 2.19 2.15 2.15 ans Ratio

r 2.15 2.07 2.07 1.91 1.91 1.75 1.75 Cis/ T 1.45 1.45 1 1 0.92 0.92 0.87 0.87 0.81 0.81 0.76 0.76 0.75 0.75 0.59 0.59 0.53 0.53 rep1 rep2 rep1 rep2 rep3 rep4 rep2 rep3 rep4 rep6 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep5 rep6

Figure 1–Figure supplement 3. Coverage and cis-to-trans ratios across individual replicates of the study datasets as indicators of data quality. (A) Coverage is approximated as the ratio of the sequencing depth to the genome size. (B) Cis-to-trans ratio is dened as the the number of valid intra-chromosomal contacts divided by the number of valid inter-chromosomal contacts. Manuscript submitted to eLife

A

IMR90(36bp) GM12878(101bp) A549(151bp) Cortex(50bp) P.falciparum(40bp) (36bp) 86.52 86.7587.1986.79 85.1885.61 86 86.13 82.87 81.8 80.77 80.22 79.39 80.53 79.35 78.5 79.04 78.8978.68 78.63 78.15 76.9 75.74 75 72.45 71.09 71.71 69.48 69.08 65.41 65.5

52.77 50 49.39

36.14

25

% of mappa b 0 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep1 rep2 rep3 rep4 rep5 rep6 rep2 rep3 rep4 rep5 rep6 rep7 rep8 rep9 rep1 rep2 rep3 rep4 rep1 rep2 rep32 rep33

833 B

IMR90(36bp) GM12878(101bp) A549(151bp) Cortex(50bp) P.falciparum(40bp) (36bp) 65.19 62.22 61.52 61.29 60 58.99 59.48 57.38 56.01 55.43 53.19 53.6 52.15 51.09 49.46 47.69 46.2 45.6 45.43 43.12 42.47 40.22 40 40 36.14 33.18 32.67 31.52

27.13 27.03 25.72 24.6325.02 23.12 22.86 20 % of v

0 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep2 rep3 rep4 rep5 rep6 rep7 rep8 rep9 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep5 rep6 rep1 rep2 rep32 rep33

Figure 1–Figure supplement 4. Percentages of (A) mappable and (B) valid reads over the set of all reads for individual replicates of the study datasets as an indicator of data quality. Both the aligned uni-reads and multi-reads are taken into consideration. Manuscript submitted to eLife

Pair Count

A o Singleton human mouse P.falciparum

70

60

50

40

30

Sequencing Depth (%) 20 10 0

IMR90(36bp) GM12878(101bp) A549(151bp) Cortex(50bp) P.falciparum(40bp)

B VPair Count 834 human mouse P.falciparum

80

60

40

Sequencing Depth (%)

20

0

IMR90(36bp) GM12878(101bp) A549(151bp) Cortex(50bp) P.falciparum(40bp)

Figure 1–Figure supplement 5. Categorization of reads after alignment across study datasets. (A) Percentages of mapped reads in dierent alignment categories. (B) Percentages of valid reads with both ends uniquely aligned to reference genome (Uni-reads), at least one end aligning to multiple positions and resulting in only one valid alignment after validation and binning (Multi-reads (Reduce to Uni-reads)), and reads with multiple potential valid alignment positions (Multi-reads (Modeling)). Manuscript submitted to eLife

A Compare Multi reads with Chimeric reads Multi end1 Multi end2 Chimeric end1 Chimeric end2 human mouse P.falciparum

2e+08 # of reads 1e+08

0e+00

IMR90(36bp) GM12878(101bp) A549(151bp) ESC 2012(36bp) ESC 2017(50bp) Cortex(50bp) P.falciparum(40bp)

B Compare Multi reads with Chimeric reads Full end1 Full end2 Chimeric end1 Chimeric end2 835 human mouse P.falciparum

12

10 reads ulti 8 % of m

6

4

IMR90(36bp) GM12878(101bp) A549(151bp) ESC 2012(36bp) ESC 2017(50bp) Cortex(50bp) P.falciparum(40bp)

Figure 1–Figure supplement 6. Comparison of the prevalence of multi-reads and chimeric reads, both of which require additional processing. (A) Numbers of multi-reads compared to the numbers of chimeric reads at each read end level. For Hi-C datasets with shorter read lengths, multi-reads constitute a larger percentage of the usable reads compared to chimeric reads. (B) Proportion of multi-reads among full-length alignable reads compared with that among chimeric reads. As expected, chimeric reads lead to larger percentages of multi-reads. Manuscript submitted to eLife

Uni-setting (Raw Counts) Uni&Multi-setting (Raw Counts) chr1:227.2Mb-228.8Mb

20 20 Uni-setting (Normalized Counts) Uni&Multi-setting (Normalized Counts) 836 chr1:227.2Mb-228.8Mb

20 20

Figure 2–Figure supplement 1. Raw and normalized contact matrices of GM12878 under Uni- setting and Uni&Multi-setting on chromosome 1. Manuscript submitted to eLife

Uni-setting (Raw Counts) Uni&Multi-setting (Raw Counts) chr2:99.1Mb-102.4Mb

10 10 Uni-setting (Normalized Counts) Uni&Multi-setting (Normalized Counts) 837 chr2:99.1Mb-102.4Mb

10 10

Figure 2–Figure supplement 2. Raw and normalized contact matrices of GM12878 under Uni- setting and Uni&Multi-setting on chromosome 2. Manuscript submitted to eLife

Uni-setting (Raw Counts) Uni&Multi-setting (Raw Counts) chr3:184.8Mb-185.6Mb

30 30 Uni-setting (Normalized Counts) Uni&Multi-setting (Normalized Counts) 838 chr3:184.8Mb-185.6Mb

30 30

Figure 2–Figure supplement 3. Raw and normalized contact matrices of GM12878 under Uni- setting and Uni&Multi-setting on chromosome 3. Manuscript submitted to eLife

Uni-setting (Raw Counts) Uni&Multi-setting (Raw Counts) chr5:174Mb-177Mb

30 30 Uni-setting (Normalized Counts) Uni&Multi-setting (Normalized Counts) 839 chr5:174Mb-177Mb

30 30

Figure 2–Figure supplement 4. Raw and normalized contact matrices of GM12878 under Uni- setting and Uni&Multi-setting on chromosome 5. Manuscript submitted to eLife

Raw Contact Matrices Normalized Contact Matrices

840 Proportion of Bins

Uni-setting Uni&Multi-setting Uni-setting Uni&Multi-setting

Figure 2–Figure supplement 5. Proportion of bins that are covered by at least 100 (row 1) or 1000 (row 2) contacts for raw contact matrices (column 1) and normlized contact matrices (column2) under Uni- and Uni&Multi-settings for GM12878 with combined reads from replicates 2-9 at 5kb resolution. Manuscript submitted to eLife

Chromosome 1 Chromosome 2 Chromosome 3

6 7.5 6

4 5.0 4

2 2.5 2

0.0 0 0 0 2 4 6 0 2 4 6 0 2 4 6

Chromosome 4 Chromosome 5 Chromosome 6 10.0

7.5 6 7.5

5.0 4 5.0

2 2.5 2.5

841 0 0.0 0.0 0 2 4 6 0 2 4 6 0 2 4 6

Chromosome 7 Chromosome 8 12

7.5 6 9

4 5.0 6

2 2.5 3

0 0.0 0 0 2 4 6 0 2 4 6 0 3 6 9

1e+001e+02 1e+04 1e+06

Figure 2–Figure supplement 6. Bin coverage improvement of raw contact matrices under Uni&Multi-setting compared to Uni-setting for GM12878 with combined reads from replicates 2-9 at 5kb. Only chromosome 1-9 are shown and the pattern for the rest chromosomes are very similar. The dashed line is y = x. Manuscript submitted to eLife

A 0.5 0.5 0.5 12.5 12 12

10.0 9 9

7.5 6 6 5.0 3 3 2.5

0.0 0 0 0.0 2.5 5.0 7.5 0 2 4 6 8 0.0 2.5 5.0 7.5 0.5 0.5 0.5 12

9 9 9

6 6 6

3 3 3

0 0 0 0.0 2.5 5.0 7.5 0 2 4 6 8 0 2 4 6 8 B 0.9 0.9 0.9 12 12 12

842 9 9 9

6 6 6 3 3 3 0 0 0 0.0 2.5 5.0 7.5 0 2 4 6 8 0.0 2.5 5.0 7.5 0.9 0.9 0.9 12

9 9 9

6 6 6

3 3 3 0 0 0 0.0 2.5 5.0 7.5 0 2 4 6 8 0 2 4 6 8 Figure 2–Figure supplement 7. Bin coverage improvement of raw contact matrices under Uni&Multi-setting compared to Uni-setting for IMR90 at the individual replicate level for two dier- ent allocation probability thresholds. Uni&Multi-setting for (A) includes multi-reads with posterior probability > 0.5, whereas (B) depicts more strict ltering with an allocation probability greater than 0.9. The dashed line is y = x. mHi-C rescues multi-reads from valid ligation fragments, resulting in signicant increase in contact counts. Manuscript submitted to eLife

A Normalized Number of Interactions B Top 0.01% Normalized Reads Larger in Uni-setting than Uni&Multi-setting Chromosome 1 Chromosome 1

3e+07 7.5

2e+07 5.0

1e+07 2.5

0.0 1 10 100 0e+00 0 10 0 2 4 6 Chromosome 2 4e+07 Chromosome 2

3e+07 6

2e+07 4

Frequency 1e+07 2

843 0e+00 0 1 10 100 0 2 4 6 0 10 Chromosome 3 log(Raw Number of Interaction under Uni&Multi-setting) 3e+07 Chromosome 3

6

2e+07

4

1e+07 2

0 0e+00 1 10 100 0 10 0 2 4 6 Diference of Normalized Interaction Number log(Raw Number of Interaction under Uni-setting) Figure 2–Figure supplement 8. Bin coverage comparison of normalized contact matrices under Uni&Multi- and Uni-settings for GM12878 with combined reads from replicates 2-9 at 5kb. (A) Histogram of the bin-level dierences between normalized contact counts of the Uni&Multi- and Uni-settings. Green bars represent increased counts under the Uni&Multi-setting and purple ones indicate no change or decrease. Bins in the purple bar group are low coverage bins under the Uni-setting and have inated normalized contact counts. Only chromosome 1-3 are shown and the pattern for the rest of the chromosomes are very similar. (B) Raw contact count comparison of the top 0.01% bins of Panel A with drastically higher normalized contact counts under the Uni-setting compared to the Uni&Multi-setting (purple bars). The contact counts of these bins get inated by normalization. The dashed lines are y = x. Manuscript submitted to eLife

A. A549 Uni-setting Uni&Multi-setting

0.975

0.950

0.925

0.900 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22

B. ESC-2017

0.9

0.8 844

0.7 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19

C. Cortex

0.95

0.90

0.85

0.80 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19

Figure 2–Figure supplement 9. Reproducibility at the contact matrix level under the Uni- and Uni&Multi-settings in A549, ESC-2017 and Cortex cell lines. Note that the box plots for the ESC-2012 cell line which only has two replicates are not displayed. Manuscript submitted to eLife

Reproducibility of Raw Contact Matrices (40kb) Uni-setting Uni&Multi-setting 1.0

0.9

0.8

0.7 Reproducibility (SCC score) 0.6

0.5

IMR90 GM12878 A549 ESC-2012 ESC-2017 Cortex (36bp) (101bp) (151bp) (36bp) (50bp) (50bp)

845 Reproducibility of Raw Contact Matrices (10kb)

0.8

0.6 Reproducibility (SCC score)

0.4

GM12878 A549 ESC-2017 Cortex (101bp) (151bp) (50bp) (50bp)

Figure 2–Figure supplement 10. Reproducibility at the contact matrix level at resolutions 40kb (low) and 10kb (high) across study datasets. Manuscript submitted to eLife

GM12878

10 chr9

chrX 7.5 chr17 chr2 chr8 chr1 chr9 chr16

5 chr7

846 chr2 chr11 chr10 chr17 chr6 chr8 chr3 chr6 chr1 chr19 2.5 chr5 chr3 chr19 chr7 chr20 chr11 chr12 chrX chr18 chr22 chr16 chr20 chr5 chr10 chr13 chr4 chr14 chr15 chr4 0 chr18 chr12 chr21 chr13 chr14 chr22 chr21 chr15 0.05 0.10 0.15 0.20 0.25

Figure 2–Figure supplement 11. Percent improvement in reproducibility due to the Uni&Multi- setting versus the proportion of number of valid multi-reads compared to the number of the uni-reads in the datasets.

Uni-setting Uni&Multi-setting

GM12878: rep2 GM12878: rep3

0.5

0.4

0.3

0.2

0.1

GM12878: rep4 GM12878: rep6

0.5 Reproducibility 847 0.4

0.3

0.2

0.1

rep1 rep2 rep3 rep4 rep5 rep6 rep1 rep2 rep3 rep4 rep5 rep6 IMR90

Figure 2–Figure supplement 12. Reproducibility at the contact matrix level under the Uni- and Uni&Multi-settings between GM12878 and IMR90 at 40kb resolution. Each box contains repro- ducibility measurements on 23 chromosomes between every two pairs of replicates from GM12878 and IMR90. Within each panel, one GM12878 replicate contact matrix is compared with six IMR90 replicates respectively for Uni- and Uni&Multi-settings. Manuscript submitted to eLife

A IMR90 Significant Contacts Reproducibility at FDR 0.05 IMR90 Significant Contacts Reproducibility at FDR 0.1 rep1 rep2 rep3 rep1 rep2 rep3

0.75 0.75

0.50 0.50

0.25 0.25

0 0 rep2 rep3 rep4 rep5 rep6 rep1 rep3 rep4 rep5 rep6 rep1 rep2 rep4 rep5 rep6 rep2 rep3 rep4 rep5 rep6 rep1 rep3 rep4 rep5 rep6 rep1 rep2 rep4 rep5 rep6 rep4 rep5 rep6 rep4 rep5 rep6

0.75 0.75 Reproducibility Reproducibility

0.50 0.50

0.25 0.25

0 0 rep1 rep2 rep3 rep5 rep6 rep1 rep2 rep3 rep4 rep6 rep1 rep2 rep3 rep4 rep5 rep1 rep2 rep3 rep5 rep6 rep1 rep2 rep3 rep4 rep6 rep1 rep2 rep3 rep4 rep5 B IMR90 Significant Contacts Reproducibility at FDR 0.05 w.r.t Contact Distance 1

0.75

0.50

Reproducibility 0.25

0 848

1M 2M 3M 5M 6M 7M 8M 9M 10M 200K 400K 600K 800K 1.2M 1.4M 1.6M 1.8M 2.2M 2.4M 2.6M 2.8M 3.2M 3.4M 3.6M 3.8M 4.0M 4.2M 4.4M 4.6M 4.8M 5.2M 5.4M 5.6M 5.8M 6.2M 6.4M 6.6M 6.8M 7.2M 7.4M 7.6M 7.8M 8.2M 8.4M 8.6M 8.8M 9.2M 9.4M 9.6M 9.8M Contact Distance ()

IMR90 Significant Contacts Reproducibility at FDR 0.1 w.r.t Contact Distance 1

0.75

0.50

Reproducibility 0.25

0

1M 2M 3M 5M 6M 7M 8M 9M 10M 200K 400K 600K 800K 1.2M 1.4M 1.6M 1.8M 2.2M 2.4M 2.6M 2.8M 3.2M 3.4M 3.6M 3.8M 4.0M 4.2M 4.4M 4.6M 4.8M 5.2M 5.4M 5.6M 5.8M 6.2M 6.4M 6.6M 6.8M 7.2M 7.4M 7.6M 7.8M 8.2M 8.4M 8.6M 8.8M 9.2M 9.4M 9.6M 9.8M Contact Distance (Base Pair) Common compared Common compared to to Uni-setting Uni&Multi-setting

Figure 2–Figure supplement 13. Reproducibility of signicant interactions for IMR90. (A) Sig- nicant interactions are classied into three categories: Uni-setting specic or Uni&Multi-setting specic or common to both. Reproducibility is evaluated by the percentage of signicant interac- tions reproduced in another replicate within the same category. (B) Reproducibility of signicant interactions stratied by genomic distance. Reproducibility is evaluated by the percentage of signi- cant interactions reproduced in another replicate within the same category and genomic distance range. Manuscript submitted to eLife

A human mouse P.falciparum

FDR 0.001 FDR 0.01 FDR 0.05 FDR 0.1

75

50

25 % Improvement in the Number of Significant Interactions (40kb) x(50bp) x(50bp) x(50bp) x(50bp) t e t e t e t e um(40bp) um(40bp) um(40bp) um(40bp) r r r r A549(151bp) A549(151bp) A549(151bp) A549(151bp) IMR90(36bp) IMR90(36bp) IMR90(36bp) IMR90(36bp) Co r Co r Co r Co r alcipa alcipa alcipa alcipa ESC-2012(36bp) ESC-2017(50bp) ESC-2012(36bp) ESC-2017(50bp) ESC-2012(36bp) ESC-2017(50bp) ESC-2012(36bp) ESC-2017(50bp) GM12878(101bp) GM12878(101bp) GM12878(101bp) GM12878(101bp) P. f P. f P. f B P. f 849 FDR 0.001 FDR 0.01 FDR 0.05 FDR 0.1

200

100

% Improvement in the Number of Significant Interactions (10kb)

0 x(50bp) x(50bp) x(50bp) x(50bp) t e t e t e t e um(40bp) um(40bp) um(40bp) um(40bp) r r r r A549(151bp) A549(151bp) A549(151bp) A549(151bp) Co r Co r Co r Co r alcipa alcipa alcipa alcipa ESC-2017(50bp) ESC-2017(50bp) ESC-2017(50bp) ESC-2017(50bp) GM12878(101bp) GM12878(101bp) GM12878(101bp) GM12878(101bp) P. f P. f P. f P. f

Figure 3–Figure supplement 1. Percentage change in the numbers of signicant interactions under the Uni&Multi-setting compared to Uni-setting at 0.1%, 1%, 5% and 10% FDR thresholds and resolutions (A) 40kb and (B) 10Kb. Manuscript submitted to eLife

Rep1 Rep2 Rep3 FDR(0.001) FDR(0.01) FDR(0.05) FDR(0.001) FDR(0.01) FDR(0.05) FDR(0.001) FDR(0.01) FDR(0.05)

30 Gain Loss 30 30

20

20 20

10 10 10

0.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.9 Rep4 Rep5 Rep6 FDR(0.001) FDR(0.01) FDR(0.05) FDR(0.001) FDR(0.01) FDR(0.05) FDR(0.001) FDR(0.01) FDR(0.05)

70 30 70

850 60 60

50 50 20 40 40

30 30

10 20 20

Percentage Change in the Number of Significant Contacts 10 10

0.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.9 Poster

Figure 3–Figure supplement 2. Comparison of signicant interactions as a function of posterior probabilities of multi-read assignment (IMR90 40kb). Percentage change in the numbers of sig- nicant interactions gained (Green) and lost (Purple) by the Uni&Multi-setting compared to the Uni-setting across individual IMR90 replicates for varying FDR and allocation probability thresholds. Manuscript submitted to eLife % Improvement of SignContact % Improvement Sequencing Depth % Multi − reads/Uni reads Reads % Mappable Reads % Valid Coverage Ratio Cis/Trans 1 % Improvement of SignContact ● ● ● ● 0.8

Sequencing Depth ● ● 0.6

0.4 % Multi−reads/Uni−reads ● 851 0.2

% Mappable Reads ● ● ● 0

−0.2 % Valid Reads ● ● ● ● −0.4

Coverage ● ● −0.6

−0.8 Cis/Trans Ratio ● ● −1

Figure 3–Figure supplement 3. Heatmap for marginal correlations of percentage increase in the number of identied signicant interactions (FDR 5%) with indicators of data quality across study datasets excluding P. falciparum at 40kb. The relative percentage of multi-reads added to uni-reads as well as cis-to-trans ratio is leading impact factors (with p-values f 0.005) followed by the percentage of mappable reads and valid reads. Manuscript submitted to eLife

●a human ●a mouse human mouse

0.001 0.01

80

● ● IMR90(36bp) ● ● IMR90(36bp) ● ●

60 IMR90(36bp) ● ● IMR90(36bp) ● ● ●● ● ● ● ● Cortex(50bp) Cortex(50bp) GM12878(101bp) ● Cortex(50bp) ● ● ● 40 Cortex(50bp) ● Cortex(50bp) Cortex(50bp) ● ● ● ● Cortex(50bp) GM12878(101bp) ● ● GM12878(101bp) Cortex(50bp) ● GM12878(101bp) IMR90(36bp) GM12878(101bp) A549(151bp) ● IMR90(36bp) ● GM12878(101bp) ● ● A549(151bp) GM12878(101bp) A549(151bp) ● ● ● IMR90(36bp) ● ● IMR90(36bp) A549(151bp) A549(151bp) ● A549(151bp) IMR90(36bp) ● A549(151bp) ● ● ● ● ● IMR90(36bp) A549(151bp) GM12878(101bp) ● IMR90(36bp) 20 ●IMR90(36bp)

0.05 0.1

852 80 ● ● % Novel Significant Contact % Novel

● ●

60

% Improvement in the Number of Significant Interactions ● ● ● ● ● ● ● GM12878(101bp) GM12878(101bp) ● GM12878(101bp) IMR90(36bp) ● ● ● GM12878(101bp) ● ● IMR90(36bp) Cortex(50bp) Cortex(50bp) IMR90(36bp) 40 ● ● ●● Cortex(50bp) ● Cortex(50bp) IMR90(36bp) ● ● GM12878(101bp) ● Cortex(50bp) Cortex(50bp) ● ●● Cortex(50bp) GM12878(101bp) GM12878(101bp) GM12878(101bp) Cortex(50bp) ● ● ● A549(151bp) ● ● ● ● ● A549(151bp)● A549(151bp) ● A549(151bp) ● A549(151bp) A549(151bp) A549(151bp) ● A549(151bp)

IMR90(36bp) ● ● IMR90(36bp) IMR90(36bp) ● ● 20 IMR90(36bp) ● IMR90(36bp) ●IMR90(36bp) ●IMR90(36bp) ●IMR90(36bp)

1 2 3 1 2 3 Cis/Trans Ratio

Figure 3–Figure supplement 4. Percentage change in the numbers of signicant interactions with respect to cis-to-trans ratio excluding P. falciparum at 40kb. Cis-to-trans ratio is dened as the the number of valid intra-chromosomal contact counts divided by the number of valid inter- chromosomal contact counts. Manuscript submitted to eLife

A GM12878 Gain of Significant Interactions Resolution 5k 10k 40k

FDR 0.001 FDR 0.01 FDR 0.05 FDR 0.1

80

60

40 5k 5k 5k 5k % Improvement in the Number of Significant Interactions 10k 40k 10k 40k 10k 40k 10k 40k

B C Restriction Enzyme MboI DpnII FDR 0.001 FDR 0.01 FDR 0.001 FDR 0.01 84.2 84.28 81.67 81.8 rep9 80.55 80.93 rep6 rep9 rep6 80 76.23 76.82 72.29 80 rep4 rep4 71.39 71.035 70.69 rep32 rep32 64.03 62.95 rep2 59.9 60.4 rep2 60 70 rep8 rep8 49.84 47.99 rep33 43.86 41.45 rep3 40 60 rep33 rep3

rep5 50 20 rep5 853 rep7 rep7 40 0

FDR 0.05 FDR 0.1 FDR 0.05 FDR 0.1 84.15 84.45 rep6 rep9 rep9 82.25 82.43 81.71 81.53 rep6 80 rep4 73.55 80 rep4 70.96 71.79 70.36 62.62 63.83 rep8 59.67 59.88 rep32 rep8 60 70 rep2 53.19 51.04 51.77 50.64 47.31 rep33 rep32 38.05 60 40 rep3 rep2 % Improvement in the Number of Significant Interactions (5kb) % Improvement in the Number of Significant Interactions (5kb) rep7 rep5 rep33 50 rep5 20 rep7

40 0 rep3 rep2 rep3 rep4 rep5 rep6 rep7 rep8 rep9 rep2 rep3 rep4 rep5 rep6 rep7 rep8 rep9 2e+08 4e+08 6e+08 2e+08 4e+08 6e+08 rep32 rep33 rep32 rep33 Sequencing Depth

Figure 3–Figure supplement 5. (A) Percentage change in the numbers of signicant interactions of GM12878 datasets at 5kb, 10kb, 40kb resolutions, respectively, across varying FDR thresholds. (B) Percentage change in the numbers of signicant interactions of 8 replicates that are based on MboI as the restriction enzyme and 2 replicates with DpnII as restriction enzyme at 5kb resolution across dierent FDR thresholds. (C) Percentage change in the numbers of signicant interactions with respect to the replicate sequencing depth of all 10 replicates of GM12878 at 5kb resolution across dierent FDR thresholds. Manuscript submitted to eLife

a human a mouse a P.falciparum

FDR 0.001 FDR 0.01

P.falciparum(40bp) P.falciparum(40bp)

P.falciparum(40bp) P.falciparum(40bp)

IMR90(36bp) ESC 2012(36bp) 75 P.falciparum(40bp) P.falciparum(40bp) ESC 2012(36bp) IMR90(36bp) ESC 2012(36bp) ESC 2012(36bp)

IMR90(36bp) IMR90(36bp) ESC 2017(50bp) ESC 2017(50bp) ESC 2017(50bp) ESC 2017(50bp) ESC 2017(50bp) 50 ESC 2017(50bp) Cortex(50bp) Cortex(50bp) ESC 2017(50bp) ESC 2017(50bp) GM12878(101bp) Cortex(50bp) Cortex(50bp) Cortex(50bp) Cortex(50bp) GM12878(101bp) Cortex(50bp) Cortex(50bp) GM12878(101bp) A549(151bp) GM12878(101bp) IMR90(36bp) GM12878(101bp) A549(151bp) IMR90(36bp) A549(151bp) GM12878(101bp) GM12878(101bp) A549(151bp) A549(151bp) A549(151bp) A549(151bp) 25 IMR90(36bp) IMR90(36bp) IMR90(36bp) IMR90(36bp) IMR90(36bp) GM12878(101bp) A549(151bp) IMR90(36bp)

FDR 0.05 FDR 0.1 854

P.falciparum(40bp) ESC 2012(36bp) P.falciparum(40bp) ESC 2012(36bp) 75 % Improvement in the Number of Significant Interactions

ESC 2012(36bp) ESC 2012(36bp)

P.falciparum(40bp) P.falciparum(40bp)

ESC 2017(50bp) ESC 2017(50bp) P.falciparum(40bp) P.falciparum(40bp) ESC 2017(50bp) ESC 2017(50bp) ESC 2017(50bp) 50 GM12878(101bp) GM12878(101bp) ESC 2017(50bp) ESC 2017(50bp) IMR90(36bp) IMR90(36bp) GM12878(101bp) IMR90(36bp) Cortex(50bp) IMR90(36bp) Cortex(50bp) GM12878(101bp) Cortex(50bp) ESC 2017(50bp) Cortex(50bp) Cortex(50bp) Cortex(50bp) GM12878(101bp) GM12878(101bp) GM12878(101bp) GM12878(101bp) A549(151bp) Cortex(50bp) A549(151bp) A549(151bp) A549(151bp) A549(151bp) A549(151bp) A549(151bp) A549(151bp) 25 IMR90(36bp) IMR90(36bp) Cortex(50bp) IMR90(36bp) IMR90(36bp) IMR90(36bp) IMR90(36bp) IMR90(36bp) IMR90(36bp) 0 100 200 300 400 0 100 200 300 400 Coverage (%)

Figure 3–Figure supplement 6. Percentage change in the numbers of signicant interactions with respect to coverage at 40kb. Coverage is approximated as the ratio of the sequencing depth to the genome size. Manuscript submitted to eLife

a human a mouse 20 40 60 80 20 40 60 80

0.001 0.01

IMR90(36bp) IMR90(36bp) IMR90(36bp) IMR90(36bp) Cortex(50bp) Cortex(50bp) ESC−2012(36bp) ESC−2012(36bp) Cortex(50bp) Cortex(50bp) ESC−2017(50bp) Cortex(50bp) Cortex(50bp) ESC−2017(50bp) ESC−2017(50bp)ESC−2017(50bp) ESC−2017(50bp)ESC−2017(50bp) Cortex(50bp) IMR90(36bp) Cortex(50bp) ESC−2017(50bp) IMR90(36bp) ESC−2017(50bp) IMR90(36bp) IMR90(36bp) 20

ESC−2012(36bp) ESC−2012(36bp) IMR90(36bp) IMR90(36bp)

IMR90(36bp) IMR90(36bp)

15 A549(151bp) A549(151bp)

A549(151bp) A549(151bp) A549(151bp) A549(151bp) A549(151bp) A549(151bp) GM12878(101bp) GM12878(101bp) GM12878(101bp) GM12878(101bp)

GM12878(101bp) GM12878(101bp)

GM12878(101bp) GM12878(101bp)

0.05 0.1 855 IMR90(36bp) IMR90(36bp) IMR90(36bp) IMR90(36bp) Cortex(50bp) Cortex(50bp) ESC−2012(36bp) ESC−2012(36bp)

Cortex(50bp) Cortex(50bp) ESC−2017(50bp) Cortex(50bp) Cortex(50bp) ESC−2017(50bp) ESC−2017(50bp)ESC−2017(50bp) ESC−2017(50bp)ESC−2017(50bp) IMR90(36bp) Cortex(50bp) IMR90(36bp) Cortex(50bp) ESC−2017(50bp) ESC−2017(50bp) IMR90(36bp) IMR90(36bp) # of Multi − reads/# Uni reads * 100% 20

ESC−2012(36bp) ESC−2012(36bp) IMR90(36bp) IMR90(36bp)

IMR90(36bp) IMR90(36bp)

15 A549(151bp) A549(151bp)

A549(151bp) A549(151bp) A549(151bp) A549(151bp) A549(151bp) A549(151bp) GM12878(101bp) GM12878(101bp) GM12878(101bp) GM12878(101bp)

GM12878(101bp) GM12878(101bp)

GM12878(101bp) GM12878(101bp) 1 2 3 1 2 3 Cis/Trans Ratio

Figure 3–Figure supplement 7. Percentage change in the numbers of signicant interactions as a function of the percentage of mHi-C rescued multi-reads in comparison to uni-reads and cis-to-trans ratios at 40kb. Manuscript submitted to eLife

IMR90 rep1 (Threshold 0.5) IMR90 rep2 (Threshold 0.5) Uni-setting 234,654 247,014 (FDR 1%) Uni-setting.Specific (Uni&Multi FDR 1%) 186,392 198,644 Uni-setting.Specific (Uni&Multi FDR 10%) Uni&Multi-setting (FDR 1%) Uni&Multi-setting.Specific (Uni FDR 1%) 63,963 62,867 Uni&Multi-setting.Specific (Uni FDR 10%) 32,389 31,310 15,701 14,497 1,569 1,805

IMR90 rep3 (Threshold 0.5) IMR90 rep4 (Threshold 0.5) 214,053 306,674

173,575 257,983

57,282 69,740 36,004 43,796 16,804 21,049 2,753 3,375

IMR90 rep5 (Threshold 0.5) IMR90 rep6 (Threshold 0.5) 29,961 45,505 856

19,268 29,648

11,695 17,493

11,243 4,559

1,002 60 1,636 242

Figure 3–Figure supplement 8. Recovery of signicant interactions identied at FDR 1% by analysis at FDR 10% for each of six replicates of IMR90 at 40kb. Uni&Multi-setting. Specic (Uni FDR 10%) is the set of signicant interactions identied at 1% FDR by the Uni&Multi-setting but are still unrecoverable by the Uni-setting even with a liberal FDR of 10%. The detailed descriptions of the groups are as follows: Uni-setting (FDR 1%): # of signicant interactions identied by the Uni-setting at 1% FDR. Uni&Multi-setting (FDR 1%): # of signicant interactions identied by the Uni&Multi- setting at 1% FDR. Uni-setting.Specic (Uni&Multi FDR 1%): # of signicant interactions identied by Uni-setting (FDR 1%) but not by Uni&Multi-setting at 1% FDR. Uni-setting.Specic (Uni&Multi FDR 10%): # of signicant interactions identied by Uni-setting (FDR 1%) but not by Uni&Multi-setting at 10% FDR. Uni&Multi-setting.Specic (Uni FDR 1%): # of signicant interactions identied by Uni&Multi-setting (FDR 1%) but not by Uni-setting at 1% FDR. Uni&Multi-setting.Specic (Uni FDR 10%): # of signicant interactions identied by Uni&Multi-setting (FDR 1%) but not by Uni-setting at 10% FDR. Manuscript submitted to eLife

GM12878 rep2 (Resolution:40000) GM12878 rep3 (Resolution:40000)

Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting 345557 345565

267959 263611

114696 122648

37098 38373 40694 40078 12879 14604

GM12878 rep4 (Resolution:40000) GM12878 rep6 (Resolution:40000)

857 Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting 182723 85260

132198 60142

37607 74454

15543 23929 27922 12489 3969 1816

Figure 3–Figure supplement 9. Recovery of signicant interactions identied at FDR 1% by analysis at FDR 10% for each of four replicates of GM12878 at 40kb resolution. Color labels are the same as Figure 8 Manuscript submitted to eLife

GM12878 rep2 (Resolution:10000) GM12878 rep3 (Resolution:10000)

Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting 84495 107536

71574 53878 51221 58082

29347 32454 20604 22120

3096 3868

GM12878 rep4 (Resolution:10000) GM12878 rep6 (Resolution:10000)

858 Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting 29674 17410

13386 21921 11380 17773 16982 10273

10020 6249

2017 1275

Figure 3–Figure supplement 10. Recovery of signicant interactions identied at FDR 1% by analysis at FDR 10% for each of four replicates of GM12878 at 10kb resolution. Color labels are the same as Figure 8 Manuscript submitted to eLife

GM12878 rep2 (Resolution:5000) GM12878 rep3 (Resolution:5000) GM12878 rep4 (Resolution:5000)

Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting 197964 61838 26507

23707 22325 47603

38552 115745 116597 36650 14591

11791 24317

58455

34378 10165

2043 3032

GM12878 rep5 (Resolution:5000) GM12878 rep6 (Resolution:5000) GM12878 rep7 (Resolution:5000)

Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting 101395 17112 42707

15260 14576

29686 67669 60274 26001 9412

7560

35833 14734 12980 26548

7265 9650 1260

GM12878 rep8 (Resolution:5000) GM12878 rep9 (Resolution:5000) GM12878 rep32 (Resolution:5000)

Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting 54574 28047 10015 859 25537 8986 24215 8683 43980

36894

31973 5664 15219

12709 4635 21379

1281 5343 1867

GM12878 rep33 (Resolution:5000)

Uni−setting Uni&Multi−setting 40244

30459

26504 24697

14912

5077

Figure 3–Figure supplement 11. Recovery of signicant interactions identied at FDR 1% by analysis at FDR 10% for each of ten replicates of GM12878 at 5kb resolution. Color label is the same as Figure 8 Manuscript submitted to eLife

GM12878 (Resolution:5000) − 10 replicates

Uni−setting Uni&Multi−setting 580403

398404

353208

278869

171209

46983

GM12878 (Resolution:10000) − 4 replicates

Uni−setting Uni&Multi−setting 239115

153498 144610

860 90163

58993

10256

GM12878 (Resolution:40000) − 4 replicates

Uni−setting Uni&Multi−setting 959105

723910

349405

114210 121916

33268

Figure 3–Figure supplement 12. Recovery of signicant interactions identied at FDR 1% by analysis at FDR 10% for GM12878 summed across replicates at 5kb, 10kb, and 40kb resolutions. Color label is the same as Figure 8 Manuscript submitted to eLife

A rep5 rep6

1.00

0.75 e Rate 0.50

ositi v ue P r

T 0.25

0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 False Positive Rate B 861 rep5 rep6

1.00

0.75

0.50 Precision 0.25

0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Recall

Figure 3–Figure supplement 13. ROC and PR curves for replicates 5 and 6 of IMR90. Sets of "True" interactions and "True" non-interactions are dened by reproducible signicant/insignicant interactions across replicate 1-4 of both Uni-setting and Uni&Multi-setting (See Materials and Methods). signicant interactions of replicates 5 and 6 are utilized to compare ROC (A) and PR (B) curves among the Uni- and Uni&Multi-settings. Manuscript submitted to eLife A 30

20

10

0

4_Tx Average Number of Contacts within Significant Interactions at FDR 5% 7_Enh 6_EnhG 1_TssA 5_TxWk 9_Het 3_TxFlnk12_EnhBiv2_TssAFlnk13_ReprPC 11_BivFlnk10_TssBiv 15_Quies8_ZNF/Rpts B 14_ReprPCWk 20

862 15

10

5

0

Average Number of Contacts within Significant Interactions at FDR 5% olII CTCF p300 P p65 Others H3K27ac H3K4me3 H3K4me1H3K36me3H3K27me3 Uni-setting.Specific Uni&Multi-setting.Specific Common

Figure 3–Figure supplement 14. Quantication of signicant interactions for chromHMM states and ChIP-seq peak regions (IMR90). (A) Grouping the signicant interactions into three groups (Uni- setting specic, Uni&Multi-setting specic, Common to both settings) reveals the largest enrichment dierences in chromHMM annotation categories related to repetitive regions, such as Zinc Finger Genes & Repeats as well as Heterochromatin. (B) Average number of signicant interactions across regions with a variety of ChIP-seq signals. Red/green labels denote smaller/larger dierences between Uni-setting.Specic and Uni&Multi-setting.specic compared to the dierences observed in the "Others" category that depicts non-peak regions. Manuscript submitted to eLife

chr1 (p36.21-p36.13) 1p31.1 1q12 32.1 1q41 q4344 Scale 1 Mb hg19 chr1: 16,500,000 17,000,000 17,500,000 18,000,000 40000 _ HiC_marginal_rep1-6_Uni-setting HiC_Uni 0 _ 40000 _ HiC_marginal_rep1-6_Uni&Multi-setting HiC_uniMulti 0 _ CTCF_uniMulti_peak CTCF_uniMulti_peak 40 _ CTCF_rep1_uniMulti CTCF_rep1_uniMulti 0 _ 40 _ CTCF_rep2_uniMulti CTCF_rep2_uniMulti 0 _ H3K27me3_uniMulti_peak H3K27me3_uniMulti_peak 40 _ H3K27me3_rep1_uniMulti H3K27me3_rep1_uniMulti 0 _ 40 _ H3K27me3_rep2_uniMulti H3K27me3_rep2_uniMulti 0 _ H3K27ac_uniMulti_peak H3K27ac_uniMulti_peak 40 _ H3K27ac_rep1_uniMulti H3K27ac_rep1_uniMulti 0 _ 40 _ H3K27ac_rep2_uniMulti H3K27ac_rep2_uniMulti 0 _ H3K36me3_uniMulti_peak H3K36me3_uniMulti_peak 40 _ H3K36me3_rep1_uniMulti H3K36me3_rep1_uniMulti 0 _ 40 _ H3K36me3_rep2_uniMulti H3K36me3_rep2_uniMulti 0 _ H3K4me1_uniMulti_peak H3K4me1_uniMulti_peak 40 _ H3K4me1_rep1_uniMulti H3K4me1_rep1_uniMulti 0 _ 40 _ H3K4me1_rep2_uniMulti H3K4me1_rep2_uniMulti 0 _ 863 H3K4me3_uniMulti_peak H3K4me3_uniMulti_peak 40 _ H3K4me3_rep1_uniMulti H3K4me3_rep1_uniMulti 0 _ 40 _ H3K4me3_rep2_uniMulti H3K4me3_rep2_uniMulti 0 _ Basic Gene Annotation Set from ENCODE/GENCODE Version 17 PLEKHM2 SPEN HSPB7 ANO7P1 SZRD1 U1 ESPNP U1 SDHB PADI1 RCC2 PLEKHM2 snoU13 EPHA2 SZRD1 NBPF1 MST1L CROCC Y_RNA PADI4 ARHGEF10L AL121992.1 SPEN EPHA2 SZRD1 NBPF1 RNU1-4 RNU1-2 SDHB PADI1 RCC2 ARHGEF10L RP11-288I21.1 ZBTB17 ARHGEF19 SZRD1 NBPF1 RP11-108M9.1 PADI2 PADI4 ARHGEF10L PLEKHM2 ZBTB17 ARHGEF19 SZRD1 NBPF1 RP11-108M9.2 PADI2 PADI1 RCC2 SLC25A34 ZBTB17 ARHGEF19 SZRD1 NBPF1 MIR3675 PADI2 AC004824.2 ARHGEF10L SLC25A34 ZBTB17 C1orf134 SPATA21 CROCCP2 RP11-108M9.3 PADI2 PADI4 ARHGEF10L RP11-169K16.4 ZBTB17 RSG1 NECAP2 RNU1-3 CROCC RP11-380J14.1 PADI6 ARHGEF10L TMEM82 C1orf64 FBXO42 RNU1-1 AL021920.2 MFAP2 PADI1 snoU13 TMEM82 RP11-5P18.5 FBXO42 AL355149.2 RP11-108M9.4 PADI1 ARHGEF10L FBLIM1 HSPB7 SPATA21 AL137798.1 MFAP2 PADI3 ARHGEF10L FBLIM1 HSPB7 SPATA21 AL021920.1 MFAP2 MIR3972 FBLIM1 HSPB7 SPATA21 MFAP2 RP1-20B21.4 FBLIM1 HSPB7 NECAP2 RP1-37C10.3 AC004824.1 FBLIM1 HSPB7 NECAP2 ATP13A2 FBLIM1 CLCNKA NECAP2 ATP13A2 RP11-169K16.9 CLCNKA NECAP2 ATP13A2 CLCNKA NECAP2 ATP13A2 CLCNKA RP4-798A10.2 CLCNKA CROCCP3 CLCNKB RP4-798A10.7 CLCNKB RP4-798A10.4 FAM131C RP5-875O13.1 FAM131C AL355149.1 RP11-276H7.2 RP11-276H7.3 ARHGEF19-AS1 10 _ RNA-seq_minusStrand RNA-seq_minus 0 _ 10 _ RNA-seq_plusStrand RNA-seq_plus 0 _

Figure 3–Figure supplement 15. Marginalized Hi-C signal (contact counts aggregated across the genomic coordinates for six replicates of IMR90), ChIP-seq coverage and peaks and gene expression for chr1:16,000,000-18,000,000. Highlighted in grey is a region with signicantly dierent marginal Hi-C signal between Uni-setting and Uni&Multi-setting. Manuscript submitted to eLife

chr2 (q13-q14.1) 21 14 p12 13 14.3 31.1 34 35 Scale 1 Mb hg19 chr2: 113,500,000 114,000,000 114,500,000 115,000,000 40000 _ HiC_marginal_rep1-6_Uni-setting HiC_Uni 0 _ 40000 _ HiC_marginal_rep1-6_Uni&Multi-setting HiC_uniMulti 0 _ CTCF_uniMulti_peak CTCF_uniMulti_peak 40 _ CTCF_rep1_uniMulti CTCF_rep1_uniMulti 0 _ 40 _ CTCF_rep2_uniMulti CTCF_rep2_uniMulti 0 _ H3K27me3_uniMulti_peak H3K27me3_uniMulti_peak 40 _ H3K27me3_rep1_uniMulti H3K27me3_rep1_uniMulti 0 _ 40 _ H3K27me3_rep2_uniMulti H3K27me3_rep2_uniMulti 0 _ H3K27ac_uniMulti_peak H3K27ac_uniMulti_peak 40 _ H3K27ac_rep1_uniMulti H3K27ac_rep1_uniMulti 0 _ 40 _ H3K27ac_rep2_uniMulti H3K27ac_rep2_uniMulti 0 _ H3K36me3_uniMulti_peak H3K36me3_uniMulti_peak 40 _ H3K36me3_rep1_uniMulti H3K36me3_rep1_uniMulti 0 _ 40 _ H3K36me3_rep2_uniMulti H3K36me3_rep2_uniMulti 0 _ H3K4me1_uniMulti_peak H3K4me1_uniMulti_peak 40 _ H3K4me1_rep1_uniMulti H3K4me1_rep1_uniMulti 0 _ 40 _ H3K4me1_rep2_uniMulti H3K4me1_rep2_uniMulti 0 _ H3K4me3_uniMulti_peak H3K4me3_uniMulti_peak 864 40 _ H3K4me3_rep1_uniMulti H3K4me3_rep1_uniMulti 0 _ 40 _ H3K4me3_rep2_uniMulti H3K4me3_rep2_uniMulti 0 _ Basic Gene Annotation Set from ENCODE/GENCODE Version 17 POLR1B CKAP2L IL37 IL36RN PAX8 CBWD2 U6 snoU13 U3 snoU13 U2 POLR1B CKAP2L IL37 IL36RN PAX8 CBWD2 SLC35F5 AC010982.1 DPP10 POLR1B CKAP2L IL37 IL1F10 PAX8 CBWD2 SLC35F5 ACTR3 DPP10 POLR1B IL1A IL37 IL1F10 PAX8 FOXD4L1 MIR4782 ACTR3 POLR1B IL1B IL36G U6 PAX8 AC016745.1 RABL2A RP11-141B14.1 AC010982.2 POLR1B NT5DC4 IL37 IL1F10 PAX8 RP11-395L14.3 SLC35F5 AC110769.3 Y_RNA IL1B IL36G IL1RN IGKV1OR2-108 WASH2P AC024704.2 CHCHD5 AC079753.1 IL36A PSD4 AC016745.3 RPL23AP7 AC104653.1 CHCHD5 IL36B PSD4 RP11-395L14.4 ACTR3 CHCHD5 IL36B PSD4 FAM138B AC110769.1 AC012442.5 AC016724.8 MIR1302-3 AC012442.6 IL1RN RABL2A AC079922.3 IL1RN RABL2A SLC20A1 IL1RN RABL2A SLC20A1 IL1RN RABL2A NT5DC4 IL1RN RABL2A AC016683.5 RABL2A AC016683.5 AC017074.1 AC016683.6 AC017074.2 AC016683.6 AC016683.6 AC016683.6 AC016683.6 AC016683.6 AC016683.6 AC016683.6 AC016683.6 AC016683.6 RP11-65I12.1 AC016683.6 10 _ RNA-seq_minusStrand RNA-seq_minus 0 _ 10 _ RNA-seq_plusStrand RNA-seq_plus 0 _

Figure 3–Figure supplement 16. Marginalized Hi-C signal (contact counts aggregated across the genomic coordinates for six replicates of IMR90), ChIP-seq coverage and peaks and gene expression for chr2:113460,000-116,000,000. Highlighted in grey is a region with signicantly dierent marginal Hi-C signal between Uni-setting and Uni&Multi-setting. Manuscript submitted to eLife

chr9 (q13) 24.19p23 p21.3 21.1 12 9q12 13 q31.1 3233.1 33.3 Scale 200 kb hg19 chr9: 66,300,000 66,400,000 66,500,000 66,600,000 66,700,000 66,800,000 66,900,000 40000 _ HiC_marginal_rep1-6_Uni-setting HiC_Uni 0 _ 40000 _ HiC_marginal_rep1-6_Uni&Multi-setting HiC_uniMulti 0 _ CTCF_uniMulti_peak CTCF_uniMulti_peak 40 _ CTCF_rep1_uniMulti CTCF_rep1_uniMulti 0 _ 40 _ CTCF_rep2_uniMulti CTCF_rep2_uniMulti 0 _ H3K27me3_uniMulti_peak H3K27me3_uniMulti_peak 40 _ H3K27me3_rep1_uniMulti H3K27me3_rep1_uniMulti 0 _ 40 _ H3K27me3_rep2_uniMulti H3K27me3_rep2_uniMulti 0 _ H3K27ac_uniMulti_peak H3K27ac_uniMulti_peak 40 _ H3K27ac_rep1_uniMulti H3K27ac_rep1_uniMulti 0 _ 40 _ H3K27ac_rep2_uniMulti H3K27ac_rep2_uniMulti 0 _ H3K36me3_uniMulti_peak H3K36me3_uniMulti_peak 40 _ H3K36me3_rep1_uniMulti H3K36me3_rep1_uniMulti

865 0 _ 40 _ H3K36me3_rep2_uniMulti H3K36me3_rep2_uniMulti 0 _ H3K4me1_uniMulti_peak H3K4me1_uniMulti_peak 40 _ H3K4me1_rep1_uniMulti H3K4me1_rep1_uniMulti 0 _ 40 _ H3K4me1_rep2_uniMulti H3K4me1_rep2_uniMulti 0 _ H3K4me3_uniMulti_peak H3K4me3_uniMulti_peak 40 _ H3K4me3_rep1_uniMulti H3K4me3_rep1_uniMulti 0 _ 40 _ H3K4me3_rep2_uniMulti H3K4me3_rep2_uniMulti 0 _ Basic Gene Annotation Set from ENCODE/GENCODE Version 17 RP11-262H14.1 RP11-262H14.4 U6 AL353626.1 RNA5SP283 RP11-318K12.2 AL353626.2 RP11-262H14.7 RP11-262H14.3 10 _ RNA-seq_minusStrand RNA-seq_minus 0 _ 10 _ RNA-seq_plusStrand RNA-seq_plus 0 _

Figure 3–Figure supplement 17. Marginalized Hi-C signal (contact counts aggregated across the genomic coordinates for six replicates of IMR90), ChIP-seq coverage and peaks and gene expression for chr9:66,250,000-66,950,000. Highlighted in grey is a region with signicantly dierent marginal Hi-C signal between Uni-setting and Uni&Multi-setting. Manuscript submitted to eLife

300

200

100

Contact Count Contact 0

75900000 75920000 75940000 75960000 75980000 76000000 76020000 76040000 76060000 76080000 76100000 76120000 76140000 76160000 Mappability Hi-C Marginal 8872 (Uni&Multi) 3864 Hi-C Marginal 7130 1163 (Uni) SRRM 3 HSPB1 YWHAG SRCRB4 D ZP3 DTX2 ZP3 DTX2 DTX2 FDPSP2 UPK3B DTX2 UPK3B RefSeq genes Enhancers Genic Enhancers UPK3B ChromHMM 866 Strong Transcription

Uni&Multi- setting

Uni-setting

q11.23 chr7 Figure 4–Figure supplement 1. Examples of signicant promoter-enhancer interactions repro- ducible among 6 replicates under Uni- and Uni&Multi-settings (IMR90) on chromosome 7. Manuscript submitted to eLife

400

300

200

100

Contact Count Contact 0

15170000 15190000 15210000 15230000 15250000 15270000 15290000 15310000 15330000 15350000 15370000 15390000 15410000 15430000 15450000 15470000 15490000

Mappability Hi-C Marginal 11558 8946 10816 7038 (Uni&Multi) Hi-C Marginal 10373 5937 3448 1716 (Uni) PMP22 TEKT3 TVP23C-CDRT4 CDRT1 PMP22 TVP23C-CDRT4 PMP22 CDRT4 TVP23C PMP22 TVP23C PMP22 CDRT1 PMP22 867 RefSeq genes PMP22 MIR4731 Enhancers Strong Transcription ChromHMM Genic Enhancers Active TSS

Uni&Multi- setting

Uni-setting

p12 chr17 Figure 4–Figure supplement 2. Examples of signicant promoter-enhancer interactions repro- ducible among 6 replicates under Uni- and Uni&Multi-settings (IMR90 ) on chromosome 17. Manuscript submitted to eLife

19008

24mer Mappability ChromHMM GENCODE V19 genes Uni-setting Rep1 Uni-setting Rep2

Uni-setting Rep3

Uni-setting Rep4

Uni-setting Rep5 868 Uni-setting Rep6

Uni&Multi-setting Specifc Rep1

Uni&Multi-setting Specifc Rep2

Uni&Multi-setting Specifc Rep3

Uni&Multi-setting Specifc Rep4

Uni&Multi-setting Specifc Rep5

Uni&Multi-setting Specifc Rep6

Figure 4–Figure supplement 3. Signicant promoter-emhancer interactions under Uni- and Uni$Multi-settings across 6 IMR90 replicates (Chromosome 17). This is the individual replicate level data for Figure 4–Figure supplement 4 in a large genome region. Manuscript submitted to eLife

60 Common

40

869 20 Percentage of Genes

0

0 (0, 1) [1, 5) [5, 10) [10, Inf) Gene Expression (TPM)

Figure 4–Figure supplement 4. Expression distribution of genes promoters of which have signi- cant promoter interactions (IMR90). Genes harboring signicant interactions (at 5% FDR) within their promoters are grouped into dierent gene expression categories. Manuscript submitted to eLife

A

200

150 # of TADs # of

100

50

chr1chr2chr3chr4chr5chr6chr7chr8chr9chr10chr11chr12chr13chr14chr15chr16chr17chr18chr19chr20chr21chr22chrX

B

200 870

150 # of TADs # of 100

50

Figure 5–Figure supplement 1. The number of topologically associating domains (TADs) detected in each chromosome under Uni-setting and Uni&Multi-setting (IMR90). (A) Total number of TADs identied across six replicates for each chromosome. (B) Total number of TADs identied across 23 chromosomes for each replicate. Manuscript submitted to eLife

A B C ies

r Convergent Tandem Right Tandem Left Divergent 76 90 80

75

60 89 74 ergent CTCF Motif v 40

73 88

AD with Co n 20 72 ADs with CTCF Peaks at Both Bounda % of CTCF Motif at TAD Boundaries % of T 87 0 % of T

D E

5

ies 79 4

78 3 871 ADs with

77 2 y Rate of TADs Detection e r v o 76 1 ergnet CTCF Motif at Bounda r v % of Adjusted T % of Adjusted Co n

alse Disc 0 75 F rep1 rep2 rep3 rep4 rep5 rep6

Figure 5–Figure supplement 2. Comparison of CTCF peaks at the boundaries of topologically associating domains (TADs) under Uni-setting and Uni&Multi-setting across six replicates of IMR90. (A) Percentages of TADs that have CTCF peaks at boundaries. (B) Percentages of TADs that have both CTCF peaks and convergent CTCF motifs at the boundaries. (C) Percentages of four types of CTCF motifs orientations at TAD boundaries. Convergent motif pairs are those with a forward strand motif upstream of TAD boundaries and a reverse strand motif downstream of TAD boundaries. Tandem Right, similarly, represents forward-forward CTCF motif pairs. Tandem Left refers to reverse-reverse motif pairs. Divergent is reverse-forward motif pairs (D) Some TAD boundaries are adjusted under Uni&Multi-setting. Box plots depict the percentage of adjusted TADs that have convergent CTCF motifs at the boundaries. (E). False discovery rate of TADs detected under the two settings. TADs that are not reproducible and lack CTCF convergent motifs at the TAD boundaries are considered as false positives. Manuscript submitted to eLife

A Uni-setting Uni&Multi-setting

5.35 MB 33.85 MB 5.35 MB 33.85 MB 5.30 MB Chromosome 6 Chromosome

40 40 33.45 MB Chromosome 6 Chromosome 6

B Uni-setting Uni&Multi-setting

872

15.15 MB 43.65 MB 15.15 MB 43.65 MB 15.10 MB Chromosome 9 Chromosome

40 40 43.25 MB Chromosome 9 Chromosome 9

Figure 5–Figure supplement 3. Novel topologically associating domains (TADs) with CTCF peaks at TAD boundaries (IMR90). Gene tracks, 24mer mappability tracks as well as CTCF peaks are displayed above the contact matrices. (A) Example at chr6:5,350,000-33,850,000. Even in the lack of obviously low mappable contact gaps, multi-reads can enhance the existing interaction signal and reveal detectable TAD structures supported by CTCF peaks. (B) Example on chr9:15,150,000-43,650,000. TAD structure, supported by CTCF peaks at the TAD boundaries, becomes detectable as multi-reads ll in the gap in the contact matrix. Red squares at the left bottom of the matrices indicate the color scale. Manuscript submitted to eLife

A Uni-setting Uni&Multi-setting

66.80 MB 95.30 MB 66.80 MB 95.30 MB 66.75 MB Chromosome 1 Chromosome

40 40 94.90 MB Chromosome 1 Chromosome 1 B 873 Uni-setting Uni&Multi-setting

0.00 MB 28.50 MB 0.00 MB 28.50 MB 0.00 MB Chromosome 5 Chromosome

40 40 28.15 MB Chromosome 5 Chromosome 5

Figure 5–Figure supplement 4. Existing topologically associating domains (TADs) with adjusted boundaries supported by CTCF peaks at the new TAD boundaries (IMR90). (A) An example from chr1:66,800,000-95,300,000. (B) An example from chr5:0-28,500,000. Red squares at the left bottom of the matrices indicate the color scale. Manuscript submitted to eLife

A Uni-setting Uni&Multi-setting

0.00 MB 25.00 MB 0.00 MB 25.00 MB 0.00 MB Chromosome 12 Chromosome

40 40 25.00 MB Chromosome 12 Chromosome 12

874 B Uni-setting Uni&Multi-setting

42.80 MB 71.30 MB 42.80 MB 71.30 MB 42.80 MB Chromosome 13 Chromosome

50 50 71.30 MB Chromosome 13 Chromosome 13

Figure 5–Figure supplement 5. Existing topologically associating domains (TADs) with adjusted boundaries supported by CTCF peaks at the new TAD boundaries (IMR90). (A) An example from chr12:0-25,000,000. (B) An example from chr13:42,800,000-71,300,000. Red squares at the left bottom of the matrices indicate the color scale. Manuscript submitted to eLife

A Uni-setting Uni&Multi-setting

105.60 MB 134.10 MB 105.60 MB 134.10 MB 105.50 MB Chromosome 2 Chromosome

50 50 133.65 MB Chromosome 2 Chromosome 2 B 875 Uni-setting Uni&Multi-setting

60.00 MB 88.50 MB 60.00 MB 88.50 MB 60.25 MB Chromosome 3 Chromosome

40 40 88.40 MB Chromosome 3 Chromosome 3

Figure 5–Figure supplement 6. False positive topologically associating domains (TADs) detected by the Uni-setting due to the missing reads in low mappability regions (IMR90). TADs that are split by white gaps are no longer detected once multi-reads are incorporated, indicating that they are highly likely false positives under the Uni-setting. (A) Example on chr2:105,600,000-134,100,000. (B) Example on chr3:60,000,000-88,500,000. Red squares at the left bottom of the matrices indicate the color scale. Manuscript submitted to eLife

A Uni-setting Uni&Multi-setting

0.00 MB 28.50 MB 0.00 MB 28.50 MB 0.00 MB Chromosome 4 Chromosome

50 50 28.15 MB Chromosome 4 Chromosome 4 B

876 Uni-setting Uni&Multi-setting

0.00 MB 28.50 MB 0.00 MB 28.50 MB 0.00 MB Chromosome 16 Chromosome

40 40 28.50 MB Chromosome 16 Chromosome 16

Figure 5–Figure supplement 7. False positive topologically associating domains (TADs) detected by the Uni-setting due to the missing reads in low mappability regions (IMR90). TADs that are split by white gaps are no longer detected once multi-reads are incorporated, indicating that they are highly likely false positives under the Uni-setting. (A) Example on chr4:0-28,500,000. (B) Example on chr16:0-28,500,000. Red squares at the left bottom of the matrices indicate the color scale. Manuscript submitted to eLife

A Uni-setting Uni&Multi-setting

14.35 MB 42.85 MB 14.35 MB 42.85 MB 14.35 MB Chromosome 21 Chromosome

40 40 42.85 MB Chromosome 21 Chromosome 21 B 877 Uni-setting Uni&Multi-setting

60.80 MB 117.80 MB 60.80 MB 117.80 MB 61.40 MB Chromosome X Chromosome

80 80 117.70 MB Chromosome X Chromosome X

Figure 5–Figure supplement 8. False positive topologically associating domains (TADs) detected by the Uni-setting due to the missing reads in low mappability regions (IMR90). TADs that are split by white gaps are no longer detected once multi-reads are incorporated, indicating that they are highly likely false positives under the Uni-setting. (A) Example on chr21:14,350,000-42,850,000. (B) Example on chrX:60,800,000-117,800,000. Red squares at the left bottom of the matrices indicate the color scale. Manuscript submitted to eLife

25

20 22.5

15 AD Detection (*100%) AD Detection (*100%) 20.0 878 10 y Rate of T y Rate of T e r e r v v o 5 o 17.5

alse Disc 0 alse Disc F F rep1 rep2 rep3 rep4 rep5 rep6

Figure 5–Figure supplement 9. False discovery rate of TADs detected under Uni-setting and Uni&Multi-setting (IMR90). TADs that are not reproducible are labeled as false positives without considering the CTCF peaks at the TAD boundaries. Manuscript submitted to eLife

A IMR90 (40kb) Uni setting Uni&Multi setting

99.9 99.92 99.98 99.98 98.94 99.1 99.6 99.62 y r 90

60 AD bounda

30 23.38 23.48 % of T 2.69 2.6 0 DUP SINE LINE LTR DNA SATE

GM12878 (5kb) 879 B Uni setting Uni&Multi setting

99.43 99.44 97.84 97.91 y r 90 83.72 83.72 77.11 76.89

60 AD bounda 30 % of T 6.71 7.61 0.8 0 0.68 DUP SINE LINE LTR DNA SATE

Figure 5–Figure supplement 10. Percentage of TAD boundaries co-localized with dierent types of repetitive elements under Uni-setting and Uni&Multi-setting for IMR90 at 40kb and GM12878 at 5kb.

IMR90 (40kb) Average over hg19 Genome

TAD

70.96 71.13 71.89 71.36 TAD 68.75 63.62 63.66 TADboundary 60 58.43 e Elements

v TADboundary

48.98 48.85

40

30.65 30.6 28.19 880 22.4 22.4 20.21 20.24 20 18.61 16.76 16.63

5.73 age Number of Repetiti 5.62 5.59 5.04 1.97 1.46 e r 1.41 0.36 0.96 0.9

v 0 A DUP SINE LINE LTR DNA SATE

Figure 5–Figure supplement 11. Average number of repetitive elements at the reproducible topologically associating domains detected under Uni-setting and Uni&Multi-setting for IMR90 at 40kb. Such enrichment is compared to those within TADs and genomewide intervals of the same size. Manuscript submitted to eLife

A B 26.19

6e+07 20 18.16

14.21 13.1 13.15 13.29 4e+07

10

7.58

2e+07 3.02 1.22 0

0e+00 rep1 rep2 rep3 rep4 Or 36bp 50bp 75bp 100bp 125bp rrr rr C 308.23M306.49M 3e+08

243.92M 218.09M

881 2e+08

1e+08

43.94M 47.81M 34.67M 38M 9.42M 7.09M 3.4M 1.63M 0.65M 0e+00 r r r r r T T T T T Figure 6–Figure supplement 1. Summary of the sequencing depths of the full length and trimmed datasets of A549. (A) Numbers of uni-reads and multi-reads across trimmed read lengths compared to the sequencing depth of uni-reads of full read length A549 dataset (replicate 2), not including uni-reads rescued from chimeric reads. (B) Percentage of multi-reads over uni-reads in the full- length A549 datasets (4 replicates, excluding uni-reads from chimeric reads) and the trimmed datasets. Multi-to-uni percentages of the read sets in the trimmed datasets cover the range of the percentages observed in the study datasets. (C) Numbers of multi-reads from dierent trimming lengths compared to sequencing depths of the full read length A549 datasets. For the trimming setting (ii) described in Materials and Methods, multi-reads (green bars) added to the uni-read datasets (blue bars) constitute a smaller percentage of the sequencing depth while enabling analysis at the higher resolution of 10kb. Manuscript submitted to eLife

Cis + Trans TrimN: 36 (mHiC > 0.5) Cis + Trans TrimN: 50 (mHiC > 0.5) CorrectlyAssigned FalselyAssigned NotAssigned CorrectlyAssigned FalselyAssigned NotAssigned 2000000

2000000 1500000 81.93% 80.26%

1500000 62.42%

1000000 58.43%

1000000 56.89% # of Reads # of Reads 54.61% 77.75% 79.71% 70.17%

56.95% 500000 71.72% 56.31% 59.38%

500000 73.91% 75.57% 59.92% 62.54% 66.05% 63.9% 67.55% 70.8% 69.49% 69.22% 0 0 68.67% [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) Average Mappability Average Mappability

Cis + Trans TrimN: 100 (mHiC > 0.5) Cis + Trans TrimN: 125 (mHiC > 0.5) CorrectlyAssigned FalselyAssigned NotAssigned CorrectlyAssigned FalselyAssigned NotAssigned 882 1000000

3e+05 750000 82.12%

2e+05

500000 82.53% # of Reads # of Reads

250000 1e+05 87.19% 86.58% 82.26% 84.83% 86.79% 84.51% 50.11% 57.44% 79.96% 71.44% 76.01% 65.71% 48.85% 83.27% 60.24% 79.46% 74.99% 69.49% 38.96% 57.14% 31.9% 0 0e+00 0% [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) Average Mappability Average Mappability

Figure 6–Figure supplement 2. Allocation accuracy at the 40kb resolution among dierent map- pability regions for trimmed reads of varying lengths. mHi-C > 0.5 refers to the fact that only the allocations with posterior probability of assignment greater than 0.5 are evaluated. Manuscript submitted to eLife

Intra&Inter-chromosome

Resolution: 10K Resolution: 40K 90

mHiC

70

SimpleSelect

AlignerSelect acy (%) 50

Accu r DistanceSelect

Random 30

36 50 75 100 125 36 50 75 100 125

Intra-chromosome 883 Resolution: 10K Resolution: 40K

90

70 acy (%)

50 Accu r

30

36 50 75 100 125 36 50 75 100 125

Figure 6–Figure supplement 3. Intra-chromosomal and intra&inter-chromosomal allocation accu- racy with respect to trimmed read length using uni-reads of replicate 1, 3, and 4 combined with multi-reads of replicate 2 (trimming setting (ii)). Manuscript submitted to eLife

A CisTrans 40kb (mHiC > 0.5) CisTrans 10kb (mHiC > 0.5) 100 mHiC 90 90 SimpleSelect 80 DistanceSelect 80

AlignerSelect 70 70 Random

60 acy (%) 60

50 Accu r 50

40 40

30 30

20 20 36 50 75 100 36 50 75 100 Read Length B Read Length:36bp Read Length: 50bp CorrectlyAssigned FalselyAssigned NotAssigned CorrectlyAssigned FalselyAssigned NotAssigned

1200000

750000 78.96% 900000 78.57% 14.3%

500000 884 600000 12.58% # of Reads 18.46% # of Reads 17.85% 75.03% 78.97% 250000 300000 60.95% 25.1% 65.89% 32.71% 68.71% 25.76% 73.33% 41.98% 34.69% 51.09% 45.82% 55.98% 13.82% 10.88% 10.12% 0 0 8.2%

[0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) Average Mappability Average Mappability Read Length: 75bp Read Length: 100bp CorrectlyAssigned FalselyAssigned NotAssigned CorrectlyAssigned FalselyAssigned NotAssigned

6e+05

3e+05 71.14% 76.8%

4e+05

2e+05 # of Reads # of Reads

2e+05

1e+05 83.97% 82.93% 11.52% 76.65% 19.89% 81.21% 77.81% 32.01% 45.68% 81.28% 11.39% 68.4% 58.88% 22.24% 71.88% 35.62% 50.76% 64.59% 9.49% 0% 7.34% 0e+00 3.74% 0e+00

[0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) Average Mappability Average Mappability

Figure 6–Figure supplement 4. Evaluating accuracy of mHi-C allocation with simulations. (A) Allocation accuracy of mHi-C at 40kb and 10kb resolutions. (B) Allocation accuracy for simulated reads of dierent lengths among regions of varying mappability. Manuscript submitted to eLife

Read Length 36bp Resolution: 10K Resolution: 40K 90

6e+06 87.39% 80 Correctly 81.16% 55.42%

80.93% Assigned 4e+06 Falsely acy (%)

62.5% 70

78.84% Assigned # of Reads

88.87% Not Accu r 83.35%

2e+06 84.06% 68.95% 79.05% 72.94%

73.84% Assigned 85.91% 60 72.87% 80.88% 77.89% 81.1% 74.38% 76.56% 48.17% 73.16% 55.92% 60.28% 0e+00 50 [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) Average Mappability

Read Length 50bp Read Length 75bp

10000 40000 10000 40000 6e+06 4e+06 86.01%

87.56% 3e+06 83.14% 4e+06 82.25%

56.57% 2e+06 79.57% # of Reads 64.41% # of Reads

2e+06 74.44% 89.46% 89.5% 83.46% 84.41% 85.55% 1e+06 61.17% 78.29% 71.37% 72.79% 87.21% 86.72% 70.28% 76.3% 70.37% 78.99% 80.42% 67.65% 71.29% 87.35%

885 80.33% 82.98% 81.59% 73.37% 76.76% 75.68% 80.78% 67.9% 85.3% 83.43% 70.15% 75.73% 73.17% 47.11% 73.34% 53.39% 45.18% 70.39% 58.74% 46.61% 0e+00 0e+00 53.98% [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) Average Mappability Average Mappability

Read Length 100bp Read Length 125bp

10000 40000 10000 40000 3e+06

9e+05 82.89% 83.28% 81.5% 2e+06 6e+05 81.84% # of Reads # of Reads 1e+06 3e+05 88.18% 88.12% 85.39% 85.2% 87.54% 87.05% 87.63% 87.63% 82.13% 69.32% 84.25% 80.37% 77% 82.82% 86.84% 85.4% 84.13% 81.84% 66.02% 65.37% 87.29% 75.93% 64.68% 85.22% 61.53% 78.13% 79.64% 83.67% 64.22% 71.22% 74.51% 82.41% 67.84% 75.73% 71.69% 68.3% 51.12% 61.15% 37.59% 49.7% 45.77% 48.51% 19.12% 0e+00 0e+00 30.43% [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) [0, 0.05) [0.05, 0.1) [0.1, 0.15) [0.15, 0.2) [0.2, 0.25) [0.25, 0.3) [0.3, 0.35) [0.35, 0.4) [0.4, 0.45) [0.5, 0.55) [0.55, 0.6) [0.7, 0.75) Average Mappability Average Mappability

Figure 6–Figure supplement 5. Allocation accuracy across dierent mappability regions for trimmed reads of 36bp, 50bp, 75bp, 100bp, and 125bp, using uni-reads of replicate 1, 3, and 4, respectively. Manuscript submitted to eLife

A Resolution: 10K Resolution: 40K

85 Average 80 LTR DNA acy (%) LINE 75

Accu r SINE SATE DUP

70 36 50 75 100 125 36 50 75 100 125 B Correct Incorrect 36 50 75 100 125 73.46% 72.48% 4e+06

76.31% Resolution: 10kb 75.48% 3e+06 886 73.85% 2e+06 69.1% 76.84% 71.91% 80.92%81.8% 1e+06 73.82% 77.43% 81.17%83.2%83.78% 76.71% 81.82% 84.33% 81.49% 83.52% 82.21% 85.15% 82.7% 83.67% 0e+00 73.97% 76.18% 82.62% 84.66% 82.56%85.13% # of Valid Reads 4e+06 78.99% 77.67% Resolution: 40kb

3e+06 77.17%78.13%

2e+06 79.47% 75.2% 78.51% 74.66% 77.24%77.92% 1e+06 74.25% 79.23% 78.41% 78.38% 75.48%77.91%78.61% 78.56% 79.08% 76.51% 78.73% 79.69% 78.2% 79.54% 0e+00 79.61% 79.57% 80.35% 78.1% 79.36%77.94% DUPSINELINE LTR DNASATE DUPSINELINE LTR DNASATE DUPSINELINE LTR DNASATE DUPSINELINE LTR DNASATE DUPSINELINE LTR DNASATE

Figure 6–Figure supplement 6. Allocation accuracy across dierent classes of repetitive elements at 10kb and 40kb resolutions using uni-reads of replicate 1, 3, and 4 combined with multi-reads of replicate 2 (trimming setting (ii)). (A) mHi-C accuracy among dierent types of repetitive element classes with respect to trimmed read length. (B) Allocation accuracy across dierent classes of repetitive elements at 10kb and 40kb resolutions using uni-reads of replicate 1, 3, and 4 combined with multi-reads of replicate 2 (trimming setting (ii)). Manuscript submitted to eLife

A Uni-setting

SimpleSelect mHi C

0.0 2.5 5.0 10 0.0 2.5 5.0 10 0.0e+00 0.0e+00 Genomic Distance (Mb) Genomic Distance (Mb) B Plasmodium Signifcant Contact Diference among Stages (FDR 0.001) PlasmodiumPlasmodium Signi Signficantfcant Contact Contact DifDiferenceerence among among Stages Stages (FDR (FDR 0.01) 0.01) Rings Trophozoites Schizonts Rings Trophozoites Schizonts 100 100

80 80

60 60

40 40 887

Trophozoites Schizonts Rings Schizonts Rings Trophozoites TrophozoitesSchizonts Rings Schizonts Rings Trophozoites Plasmodium Signifcant Contact Diference among Stages (FDR 0.05) Rings Trophozoites Schizonts 100 cant Contacts f cant of Signi Percentage 80

60

40

Trophozoites Schizonts Rings Schizonts Rings Trophozoites

Figure 6–Figure supplement 7. (A) Comparison of genomic distance distributions of signicant interactions between SimpleSelect and mHi-C for IMR90 rep5 (FDR < 0.001). (B) Comparison of signicant interactions among the three life stages of P. falciparum. Y-axis in each panel, namely rings, trophozoites, and schizonts, depicts the percentage of contacts that are signicant only in the panel condition compared to the other two conditions. Under varying FDR thresholds, mHi-C and Uni-setting tend to have similar percentages of dierential interactions among ring - trophozoites - schizonts plasmodium life stages. In contrast, SimpleSelect tends to underestimate dierential interactions due to over-emphasizing contact distance prior. Manuscript submitted to eLife

Trimmed Uni-setting Trimmed Uni&Multi-setting

90

80

888 70 Reproducibility

60

36 50 75 100 125

Figure 7–Figure supplement 1. Reproducibility of trimmed Uni-setting and trimmed Uni&Multi- setting across dierent read lengths at 40kb resolution. Manuscript submitted to eLife

Setting Uni−setting Uni&Multi−setting

rep1 vs rep2 rep2 vs rep3 rep2 vs rep4 1.0

0.8

889 Reproducibility

0.6

trim36 trim50 trim75 trim100 trim125 trim36 trim50 trim75 trim100 trim125 trim36 trim50 trim75 trim100 trim125 original151 original151 original151 TrimN

Figure 7–Figure supplement 2. Reproducibility comparison between original Uni-setting and trimmed Uni-setting and Uni&Multi-setting across replicates at 40kb resolution. Manuscript submitted to eLife

Setting ● Uni−setting Uni&Multi−setting ● original151 ● trim50 ● trim100 TrimN ● trim36 ● trim75 ● trim125

rep1 vs rep2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.9 ● ● ● ● ● ●

0.8 ● 0.7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

rep2 vs rep3 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.9 ● ● ●

0.8 ●

0.7 ● 890 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Reproducibility 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ●

rep2 vs rep4

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.9 ● ● ● ● ●

0.8 ● 0.7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 Chromosome

Figure 7–Figure supplement 3. Reproducibility comparison between original Uni-setting and trimmed Uni-setting and Uni&Multi-setting across chromosomes at 40kb resolution. Manuscript submitted to eLife

Original Uni-reads Trimmed Uni-reads

600 60 chr3: 124Mb - 127.8Mb chr3: 124Mb - 127.8Mb Trimmed Uni&Multi-reads

891

60 chr3: 124Mb - 127.8Mb

Figure 7–Figure supplement 4. TAD detection on chromosome 3 of original longer uni-reads contact matrix with black TAD boundaries, trimmed uni-reads (36bp) contact matrix with green TAD boundaries and trimmed uni- and multi-reads (36bp) contact matrix with blue TAD bounaries. Manuscript submitted to eLife

Original Uni-reads Trimmed Uni-reads

500 60 chr7: 3Mb - 10Mb chr7: 3Mb - 10Mb Trimmed Uni&Multi-reads 892

60 chr7: 3Mb - 10Mb Figure 7–Figure supplement 5. TAD detection on chromosome 7 of original longer uni-reads contact matrix, trimmed uni-reads (36bp) contact matrix and trimmed uni- and multi-reads (36bp) contact matrix. Manuscript submitted to eLife

Original Uni-reads Trimmed Uni-reads

500 60 chr7: 147.5Mb - 155Mb chr7: 147.5Mb - 155Mb

Trimmed Uni&Multi-reads 893

60 chr7: 147.5Mb - 155Mb

Figure 7–Figure supplement 6. TAD detection on chromosome 7 of original longer uni-reads contact matrix, trimmed uni-reads (36bp) contact matrix and trimmed uni- and multi-reads (36bp) contact matrix. Manuscript submitted to eLife

Original Uni-reads Trimmed Uni-reads

200 40 chr10:77Mb-92Mb chr10:77Mb-92Mb

Trimmed Uni&Multi-reads 894

40 chr10:77Mb-92Mb

Figure 7–Figure supplement 7. TAD detection on chromosome 10 of original longer uni-reads contact matrix with black TAD boundaries, trimmed uni-reads (36bp) contact matrix with green TAD boundaries and trimmed uni- and multi-reads (36bp) contact matrix with blue TAD bounaries Manuscript submitted to eLife

# of Significant Interactions Uni-setting Uni&Multi-setting

36 50

60000

40000

20000

75 100

60000

895 40000

20000

125 0.001 0.01 0.05 0.1

60000

40000

20000

0.001 0.01 0.05 0.1

Figure 7–Figure supplement 8. Numbers of signicant interactions identied with trimmed reads under Uni- and Uni&Multi-settings at FDR 0.1%, 1%, 5%, 10% . Manuscript submitted to eLife

70 er (%) w

o 896 P

60

36 50 75 100 125 Read Length

Figure 7–Figure supplement 9. Power is computed as the percentage of top 10,000 signicant interactions of the full read length dataset detected by the analysis of trimmed read datasets under FDR of 10%. Manuscript submitted to eLife

P

6e+08

4e+08 897 2e+08

0e+00

rep1 rep2 rep3 rep4

Figure 7–Figure supplement 10. Comparison of read decompositions of "rep2" with "rep2- NonChimericReads" sequencing depths indicates that in the full read length dataset, a large proportion of the uni-reads are due to rescued chimeric reads (i.e., these are Uni- or Multi-reads in rep2 and Singletons in rep2-NonChimericReads). Downstream analysis of trimming experiments beyond evaluating the accuracy of multi-read assignments utilized Uni-reads of rep 2 displayed here for deriving gold standard interaction sets. Manuscript submitted to eLife

Trimming setting: trimN = 50 Trimming setting: trimN = 50

1.00 0.6 0.75

0.4

e Rate 0.50 Precision ositi v 0.2 ue P 0.25 r T 0.0 0.00

0.00 0.03 0.06 0.09 0.00 0.25 0.50 0.75 1.00 False Positive Rate Recall Trimming setting: trimN = 75 Trimming setting: trimN = 75 1.00 0.6 0.75

e Rate 0.4 ositi v 0.50 Precision ue P r

T 0.2 0.25

0.0 0.00

0.00 0.03 0.06 0.09 0.00 0.25 0.50 0.75 1.00 False Positive Rate Recall Trimming setting: trimN = 100 Trimming setting: trimN = 100 1.00 0.6

898 0.75

e Rate

0.4

ositi v 0.50 Precision ue P r

T 0.2 0.25

0.0 0.00

0.00 0.03 0.06 0.09 0.00 0.25 0.50 0.75 1.00 False Positive Rate Recall Trimming setting: trimN = 125 Trimming setting: trimN = 125 1.00 0.6

0.75

e Rate

0.4

ositi v 0.50

Precision ue P

r T 0.2 0.25

0.00 0.0

0.00 0.03 0.06 0.09 0.00 0.25 0.50 0.75 1.00 False Positive Rate Recall

Figure 7–Figure supplement 11. ROC and PR curves for detection of full read length dataset signicant interactions by the analysis of trimmed read datasets under the trimmed Uni- and Uni&Multi-settings at read lengths of 50bp, 75bp, 100bp, and 125bp. In the ROC curves, the false positive rates are displayed up until 10% as the values of both the x- and y-axis rates shoot up to 1 after this value. Ground truth for these curves are based on the signicant interactions identied by the full read length dataset at FDR of 10%.