A Unified Framework for Inferring the Multi-Scale Organization Of

bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

A uniﬁed framework for inferring the multi-scale organization of chromatin domains from Hi-C

Ji Hyun Baka,b,∗, Min Hyeok Kima,∗, Lei Liua, and Changbong Hyeona,1 aKorea Institute for Advanced Study, Seoul 02455, Korea; bRedwood Center for Theoretical Neuroscience, University of California, Berkeley, CA 94720, USA

This manuscript was compiled on April 6, 2020

Abstract: Identifying chromatin domains (CDs) from high- smaller genomic scale, each TAD is further split into sub-TADs throughput chromosome conformation capture (Hi-C) data is cur- that display more localized contacts (19, 28–31) (Fig. 1c). rently a central problem in genome research. Here we present a uni- Currently there are many algorithms available to identify fied algorithm, Multi-CD, which infers CDs at various genomic scales CDs from Hi-C data, making important contributions to un- by leveraging the information from Hi-C. By integrating a model of the derstanding the intra-chromosome architecture. However, CDs chromosome from polymer physics, statistical physics-based clus- identified using different algorithms or parameters display sig- tering analysis, and Bayesian inference, Multi-CD identifies the CDs nificant variations, and there is no generally accepted definition that best represent the global pattern of correlation manifested in Hi- for the above-mentioned CD at each scale. For example, the C. The multi-scale intra-chromosomal structures compared across average size of a TAD varies from 100 kb to 2 Mb depending on different cell types allow us to glean the principles of chromatin or- the specific algorithm being used. Furthermore, still lacking is ganization: (i) Sub-TADs, TADs, and meta-TADs constitute a robust a unified algorithm to characterize CDs at multiple scales. In hierarchical structure. (ii) The assemblies of compartments and TAD- many of the existing algorithms specialized in finding CDs at based domains are governed by different organizational principles. particular genomic scales, Hi-C data should first be formatted (iii) Sub-TADs are the common building blocks of chromosome archi- at specific resolution (18, 19, 22). Although there are methods tecture. CDs obtained from Multi-CD applied to Hi-C data enable a (32, 33) developed for identifying hierarchical domain struc- quantitative and comparative analysis of chromosome organization tures of chromosomes, their algorithms rely on local pattern in different cell types, providing glimpses into structure-function re- recognition analyses (18, 19, 22, 34), not implementing the lationship in genome. physical viewpoint that chromosomes are a three dimensional object made of a long polymer (6, 14, 35–40). polymer network | chromosome | clustering analysis | tunable group Here, we interpret Hi-C data as pairwise contact probability model | Bayesian inference matrix resulting from polymer networks whose inter-monomer distances are harmonically restrained. The cross-correlation matrix derived from Hi-C is used as the sole input for Multi-CD, hromosome conformation capture (3C) and its derivatives, the algorithm that we have developed to identify CD at varying which are used to identify chromatin contacts through C genomic scales. Agreement of CD solutions from Multi-CD the proximity ligation techniques (1, 2), take center stage in with the previous knowledge on chromatin organization as well chromosome research (3, 4). Square-block and checkerboard as with information from bio-markers indicates the reliability patterns manifested in Hi-C data provide glimpses into the of Multi-CD. A single algorithm-based solutions of CD allow organization of chromatin chains inside cell nuclei. Despite cell- us to assess the multi-scale structure of chromatin from a more to-cell variations inherent to Hi-C data, which is in practice unified perspective. We assert that amid the rapidly expanding collected from a heterogeneous cell population, the cell-type volume of Hi-C data (10–12), Multi-CD holds good promise specificity of chromosome architecture gleaned from Hi-C is to quantitative and principled determination of chromatin still clear. Furthermore, the change of Hi-C pattern with the organization. transcription activity and the phase of cell cycle underscores the functional roles of chromosome structure in gene regulation (5–14). Given that pathological states of chromatin are also Theory manifested in Hi-C (15, 16), accurate characterization of chro- Transforming Hi-C into a cross-correlation matrix using a matin domains (CDs) from Hi-C data is of great importance polymer network model. A chromosome can be viewed as a in advancing our quantitative understanding of the genome polymer chain that is folded to a network structure charac- function. terized with multiple cross-links (35, 41, 42). Even in the Inside cell nuclei each chromosome made of ∼ O(102) Mb interphase that displays less amount of activity than mitotic DNA is segregated into its own territory (Fig. 1a) (17). At phase, continuous events of free energy consumption break the scale of & O(10) Mb, alternating blocks of active and the detailed balance condition, driving the chromosome out of inactive chromatin are phase-separated into two megabase equilibrium (43–47). However, chromosome dynamics in each sized aggregates, called A- and B-compartments (18–21)

(Fig. 1b). Topologically associated domains (TADs), detected Author contributions: J.H.B., M.H.K., L.L., and C.H. designed and performed research, J.H.B., at ∼ O(10−1) − O(1) Mb (22–25), are considered the basic M.H.K., and L.L. analyzed data; J.H.B., M.H.K., and C.H. wrote the paper. functional unit of chromatin organization and gene regula- The authors declare no competing interest. tion because of their well-conserved domain boundaries across ∗J.H.B. and M.H.K. contributed equally to this work. cell/tissue types (17, 20, 26, 27). It was suggested that the 1To whom correspondence should be addressed. E-mail: [email protected] proximal TADs in genomic neighborhood aggregate into a Data sharing: Code for the algorithm presented in this paper is available online (https://github.com/ higher-order structural domain termed “meta-TAD” (6). At multi-cd).

1 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a We propose a principled interpretation of Hi-C data as 10 6 a contact probability matrix, which can be derived from a oun t 10 3 c chromosome C

i mathematically tractable yet physically meaningful model territories H 10 0 of gaussian polymer network. As a pre-processing method,

chr.9 chr.10 chr.11 our approach can replace the common but arbitrary use of nonlinear (most often logarithmic) scaling of the Hi-C data. b A-compartment 10 4 Modeling the correlations with the group model. Clustering a B-compartment 2 10 correlation matrix into a ﬁnite number of correlated groups is 10 0 a general problem discussed in diverse disciplines. We adapted

20 22 24 26 28 30 a statistical mechanical formalism known as the “group model,” developed for identifying the correlated groups of companies c from empirical data of stock market price ﬂuctuations (56– meta-TAD 4 10 58). Without ambiguity, the formalism can be applied to the TAD 2 10 sub-TAD clustering of correlated genomic segments in a chromosome. 0 10 Let us assume that each genomic segment i ∈ {1, 2, ··· ,N}

21 22 23 24 belongs to a chromatin domain si. Then the vector s = position on chromosome (s1, s2, . . . , sN ) can be called the domain solution for the N Fig. 1. The hierarchical organization of interphase chromosome and Hi-C map. segments. For example, a state s = (1, 1, 1, 2, 2, 3) describes (a) Chromosome territories in the cell nucleus, which are manifested as the higher a structure where the 6 genomic segments are clustered into intra-chromosomal counts in the Hi-C map. (b) Alternating blocks of active and inactive 3 domains. Indexing of the domains is arbitrary. If there are chromatins, segregated into A- and B-compartments, give rise to the checkerboard K distinct domains in the solution, we can always index the pattern on Hi-C. (c) Sub-megabase to megabase sized chromatin folds into TADs. domains such that s ∈ {1, 2, ··· ,K}. Adjacent TADs are merged to meta-TAD (6), and individual TAD is further decomposed i into sub-TADs (19, 28–31). We also assume that the cross-correlation matrix C, captured by Hi-C, is essentially described by the correlation of a set of hidden variables {xi} where xi represents the “genomic phase is slow (39, 40, 48, 49), such that the system remains state” of the i-th chromatin segment. Without loss of general- in local mechanical equilibrium over an extended period of ity we can only consider the case where xi has zero mean and

time, as captured by the stable patterns in the Hi-C data (14). unit variance; or simply standardize as (xi − hxii)/σxi → xi. Although a great amount of cell-to-cell variation is expected Adapting the formalism in Refs. (56, 57), we assume that each for a population of cells (39, 49, 50), fluorescence measurement xi obeys the following stochastic equation indicates that the spatial distances between pairs of chromatin r segments can be well described by the gaussian distribution gsi 1 xi = ηsi + √ i [3] (21, 51–53) (see Fig. S1). This motivates us to model the chro- 1 + gsi 1 + gsi mosome structure using a gaussian polymer network whose where η and are two independent and identically dis- configuration fluctuates around its local mechanical equilib- si i tributed (i.i.d) random variables with η , ∼ N (0, 1), that rium state (54, 55). See SI Appendix for more justifications si i for the use of the Gaussian polymer network model. are linked to the domain (si) and the individual segment (i) respectively. The parameter g (≥ 0) is associated with For a polymer chain whose long-range pairwise interactions si each domain s , such that a larger g indicates a stronger are restrained via harmonic potentials with varying stiffness, i si contribution from the domain-dependent variable η . The the distance between each pair of segments i and j is written si cross-correlation between two segments i and j is written as in the following form g 1 4 hx x i = si δ + δ . [4] P (r ; γ ) = √ γ3/2r2 exp(−γ r2 ), [1] i j 1 + g sisj 1 + g ij ij ij π ij ij ij ij si si In light of Eq. 4, the first term of Eq. 3 on the right hand −1 with γij = 2(σii+σjj −2σij ), where σij is the positional covari- side contributes to intra-CD correlation of the si-th CD with ance determined by the topology of polymer network (14, 42). increasing gsi ; the second term of Eq. 3 corresponds to a noise The contact probability between the two segments, pij , is the that randomizes the intra-domain correlation. Regarding (C)ij probability that their pairwise distance rij is below a cutoff calculated from Hi-C as the correlation between the stochastic r . In other words, it is calculated using p = R rc dr P (r; γ ), c ij 0 ij variables xi and xj , each representing the genomic state of where P (r; γij ) is the distance distribution in Eq. 1. Impor- segment (Eq. 4), namely, tantly, this model establishes a one-to-one mapping from the contact probability pij to the parameter γij , which allows one (C)ij ⇔ hxixj i, [5] to determine the covariance matrix {σij } and consequently √ we use C as an input data for the clustering approach pre- the cross-correlation matrix, (C)ij = σij / σiiσjj (see Materi- als and Methods for mathematical details). As a result, the scribed by the group model. Our goal is to find the domain solution s = (s1, s2, ··· , sN ) following transformation from pij to (C)ij is conceived: that best represents the pattern in the correlation matrix

pij −→ γij −→ σij −→ (C)ij . [2] C, with an appropriate set of strength parameters g = (g1, g2, ··· , gK ), where K is the number of distinct domains in The resulting cross-correlation matrix C is used as the input the solution. Using the group model described above, we can data for our domain inference algorithm (see Fig. 2). calculate the likelihood p(C|s, g) that the observed correlation

2 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

matrix C was drawn from an underlying set of domains s a normalized contact correlation with strengths g (see SI Appendix for derivation). The log- Hi-C probability matrix likelihood is written as a sum over all domains in the solution s: raw Hi-C

K 1 g c X k k b Multi-CD log p(C|s, g) = − (1 + gk) nk − simulated annealing 2 1 + gknk [6] k=1 correlation matrix domain solution −nk log(1 + gk) + log(1 + gknk)] , inference

PN where nk = i=1 δsi,k is the size of domain k, and ck = PN i,j=1 Cij δsi,kδsj ,k is the sum of all intra-domain correlation elements. The log-likelihood in Eq. 6 is maximized at gˆk = Fig. 2. Schematic of the Multi-CD algorithm. (a) Pre-processing: Hi-C data provide 2 information of the correlation pattern. (b) Inference: for a given correlation matrix C, (ck − nk)/(nk − ck) for each k, allowing us to consider the the Multi-CD algorithm finds chromatin domain (CD) solutions s at multiple scales. reduced likelihood p(C|s) ≡ maxg p(C|s, g) = p(C|s, gˆ). The algorithm of Multi-CD is repeatedly applied to C to determine the best domain For convenience, we define an energy-like cost function solutions at different values of λ (outer red box). At each λ, the best domain solution is E(s|C) for a domain solution s given a correlation matrix found through a simulated annealing, in which the effective temperature T is gradually C, such that the problem of finding the maximum-likelihood decreased (inner blue box). We use a MCMC sampling method to approximate solution s is equivalent to finding the s that minimizes E(s|C). the posterior distribution p(s|C) ∝ exp(−H(s|C; λ)/T ) and to find the domain Specifically, it is useful to write the likelihood function as solution s at each fixed temperature T . The boxes in red and blue represent the two levels of iterations varying λ and T , respectively, in the algorithm. p(C|s) ∝ exp(−E(s|C)/T ) to resemble a Boltzmann distribution, with The group model becomes “tunable” at the level of inference K 2 1 X ck nk − ck through the parameter λ: as λ is increased, the resulting E(s|C) = log + (nk − 1) log 2 , [7] ∗ 2 nk n − nk estimate s tends to have a fewer number of domains. In k=1 k parallel to the statistical physics problem of a grand-canonical where the new parameter T > 0 corresponds to the thermody- ensemble, T is the effective temperature of the system, and λ namic temperature. The best domain solution can be found amounts to the negative chemical potential. by using the simulated annealing method where T is slowly decreased. Results Inferring the domain solutions at multiple scales: a tunable Discovery of chromatin domains at multiple scales. Given a group model. Besides evaluating how well a domain solution raw Hi-C matrix, one can use our Multi-CD algorithm for s explains the correlation pattern in the data, which is cap- transforming it to a correlation matrix C (Fig. 2a), as well as tured by the likelihood, we also want to impose an additional for identifying a set of CDs for each fixed λ (Fig.2b). preference to more parsimonious solutions. This is done by Here we applied Multi-CD on a sample subset of Hi-C data introducing a prior distribution, p(s). Here we use the form from a commonly used human lymphocyte cell line, GM12878, p(s) = exp(−λK(s)/T ), where K(s) is a function that gives a at 50-kb resolution. After the transformation of the raw Hi-C larger value for a more fragmented solution s (larger number (Fig. 3a) into a correlation matrix (Fig. 3b), Multi-CD was of domains), and λ (≥ 0) is a parameter that scales the overall employed to infer a family of CD solutions that vary with λ strength of the prior. Specifically, K(s) is defined such that (Fig. 3c). We also applied Multi-CD to four other cell lines, log K(s) is the entropy of s: HUVEC, NHEK, K562, and KBM7 (Fig. 3d), and analyzed the CD solutions for all five cell lines. All results shown in PK K(s) = exp − k=1 pk log pk , pk = nk/N. [8] the main text consider chromosome 10; see Fig. S3 for similar results from three other chromosomes (chr4, 11, 19). This quantity measures the effective number of domains; for We observed several general features from the families of example, K(s) = K when the domain sizes are uniform. CD solutions in these cell lines: Taking a Bayesian approach, we combine the likelihood and (i) The average domain size hni always increased monoton- the prior, to construct the posterior distribution according to ically with λ (Fig. 3e), as expected from our construction of the Bayes rule: p(s|C) ∝ p(C|s) p(s). The posterior is written the prior. in terms of an effective Hamiltonian H: (ii) The domain sizes were relatively homogeneous in the p(s|C) ∝ e−H(s|C)/T , H(s|C) = E(s|C) + λK(s). [9] small-domain regime (small λ), but became heterogeneous after a cross-over point (Fig. 3f). To quantify this, we defined Finding the maximum a posteriori (MAP) estimate, s∗ = the index of dispersion for the domain sizes, which is simply 2 argmaxs p(s|C), is equivalent to solving a global minimization the variance-to-mean ratio D = σn/hni. If the domains were problem for H(s|C). As the s-space is expected to be high- generated by randomly selecting the boundaries along the dimensional and is likely characterized with multiple local genome, the dispersion would be D = 1; a smaller D < minima, we use simulated annealing (59) to find the energy- 1 indicates that domain sizes are more homogeneous than minimizing s∗ (Fig. S2) (see Materials and Methods for the random. A larger D > 1 means that the domain sizes are details). heterogeneous. The crossover points arise at λcr ≈ 30 − 40 Our algorithm, Multi-CD, is a principled method for infer- (hnicr ≈ 1.6 Mb) for GM12878, HUVEC, and NHEK; and ring the chromatin domain (CD) solutions at multiple scales. λcr ≈ 60 − 70 (hnicr ≈ 2.2 Mb) for K562 and KBM7. We

3 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a GM12878 d HUVEC NHEK K562 KBM7 4 0 20 10 20 20 20 20 3 10 40 2 Mb Mb 10 HiC count 80 1 iC (log scale) 10 H 30 30 30 30 30 20 25 30 120 chr10 Hi-C data from di erent cell types Mb

b e 3 h 20 1 0.8 2

0 , Mb 0.6 Mb 1 similarity cell-to-cell 0 0.4 30 -1 f 0 10 20 30 40 50 60 70 20 25 30 6 c 4 i 2 GM12878 0 HUVEC g 0.2 NHEK

K562

nMI KBM7 20 Mb 22 24 30 0.1 0 10 20 30 40 50 60 70 k

GM12878 0.4 0.9 1.2 1.5 1.7 2.0 2.1 2.3 HUVEC , Mb NHEK 1 j 0 K562 10 0.8 KBM7 20 30 Gene APBB1IP 0.6 40

similarity Regulatory 50 0.4 scale-to-scale elements 60 70 26.65 Mb 27.15 Mb 20 30 0 10 20 30 40 50 60 70 Mb

Fig. 3. Multi-scale chromatin domain solutions for various cell types. (a) A subset of 50-kb resolution Hi-C data, covering a 10-Mb genomic region of chr10 in GM12878. (b) The cross-correlation matrix Cij for the corresponding subset. See Fig. S4 for a full-chromosome view. (c) Multi-CD applied to the correlation matrix in b. Domain solutions determined at 4 different values of λ = 0, 10, 30, 50. (d) Hi-C data from the same chromosome (chr10) in four other cell lines: HUVEC, NHEK, K562, and KBM7. Same subset as in a. (e-g) Characteristics of the domain solutions determined for all five cell lines in a and d: (e) the average domain size, hni; (f) the index of dispersion in the 2 domain size, D(= σn/hni); (g) the normalized mutual information, nMI. (h-i) Comparison of domain solutions across cell types. (h) Average cell-to-cell similarity of domain solutions, in terms of Pearson correlations, at varying λ. (i) Domain solutions obtained at λ = 10 for 5 different cell types. See Fig. S5 for solutions at λ = 0 and λ = 40. (j) Similarity between domain solutions at different λ’s, shown for GM12878. See Fig. S6 for corresponding results for the other four cell lines. (k) RNA-seq signals from the five cell lines (colored hairy lines), on top of the TAD solutions (filled boxes), in a genomic interval that contains the regulatory elements associated with a gene APBB1IP. APBB1IP is transcriptionally active only in two cell lines, GM12878 and KBM7, where the regulatory elements are fully enclosed in the same TAD. See Fig. S7 for a larger figure.

observed that the onset of heterogeneity was related to the different cell types (Fig. 3h-i). We quantified the extent of appearance of non-local domains (Fig. 3c). domain conservation, in terms of the Pearson correlation (see (iii) We quantified the goodness of each CD solution by Materials and Methods), averaged over all pairs of different cell comparing its corresponding binary matrix against the Hi-C types. Domain conservation was strong for smaller domains data in terms of the normalized mutual information (nMI; see at λ ≤ 30 (hni . 1.5 Mb), with the strongest conservation at Materials and Methods for details). There is a scale, λ∗, at λ = 10 (Fig. 3h). The CDs at λ = 10 are shown in Fig. 3i for which the diagonal block pattern manifested in Hi-C data is five different cell types. most accurately captured. In Fig. 3g, the best solution was (v) Finally, we quantified the similarity between pairs of CD found at λ∗ ≈ 30 for GM12878, HUVEC, and NHEK; the λ∗’s solutions obtained at different scales, again using the similarity were identified at larger values for K562 and KBM7. As an measure based on Pearson correlation. In the case of GM12878, interesting side note, K562 and KBM7 belong to immortalized the family of CD solutions is divided into two regimes; the leukemia cell lines, whereas the other three cell types are smaller-scale CD solutions from a range of 10 ≤ λ ≤ 40 are normal cells; the different statistical property of Hi-C patterns correlated among themselves, and the larger-scale CD solutions manifested in λ∗ may hint at a link between the pathological from λ > 40 as well. CD solutions below and above λ ≈ 40 state and a coarser organization of the chromosome. are not correlated with each other (Fig. 3j; also see Fig. S6 for (iv) The CD solutions inferred by Multi-CD, especially the other cell lines). the families of local CDs, appeared to be conserved across The division boundary in (v) is found at a λ value in the sim-

4 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

∗ ilar range with the best-clustering scale λ , and the crossover a 1 b c 0.7 λcr from local/homogeneous to non-local/heterogeneous CDs 10 10 (compare Fig. S6 to Fig. 3f-g). Hereafter we will refer to the k=1 two regimes as the family of local CDs with homogeneous size 20 0 20 distribution (λ λ∗), and the family of non-local CDs with 30 30 . k=2 heterogenous size distribution (λ λ∗). -1 -0.7 & 10 20 30 10 20 30 re-ordered Mb Mb TAD-like organizations in the family of local CDs. We identi- by domain d 0.2 fied at least three important scales in the family of local CDs. First of all, there was a scale at which domain conservation was 0.15 Multi-CD maximized across different cells (λ = 10). This observation is GaussianHMM consistent with the widely accepted notion that TADs are the 0.1 (compartment) GaussianHMM most well-conserved, common organizational and functional nMI with 0.05 (sub-comp.) unit of chromosomes, across different cell types (27, 60). Thus, 60 70 80 90 100 110 for the example from human chromosome 10, we identify the CDs found at this scale λ = 10 as the TADs. The average Fig. 4. Domain solutions for compartments. (a) Input correlation data for com- domain size at λ = 10 was hni ≈ 0.9 Mb, which agrees with the partment identification. The 2-Mb diagonal band was removed. (b) Lower triangle: typical size of TADs as suggested by previous studies (22, 23). CDs obtained at λ = 90, based on the diagonal-band-removed data, which we iden- The goodness of clustering, on the other hand, was maxi- tify as the compartments. Upper triangle: the CO/E matrix shown for comparison. (c) Same pair of data, after re-ordering to collect the two largest CDs in our solution mized at a larger scale, λ ≈ 30 (hni ≈ 1.5 Mb) for GM12878. (lower triangle), with k = 1, 2 as the B- and A-compartments respectively. The The CDs at this scale turned out to be aggregates of multiple CO/E is simultaneously reordered to show a clear separation of correlation patterns TADs in the genomic neighborhood, from visual inspection (upper triangle). (d) nMI between CDs at varying λ and CO/E, showing a plateau in (see Fig. 3c), or as quantified in terms of our nestedness score the range 70 ≤ λ ≤ 100. CDs inferred by Multi-CD show consistently higher nMI, (see Materials and Methods). We therefore identify these CDs compared to sub-compartments (dashed line) and compartments (dotted line) from a previous method (19). as the “meta-TADs”, a higher-order structure of TADs, adopt- ing the term of Ref. (6). In contrast to a previous analysis that extended the range of meta-TADs to the entire chromosome database (63), we identified the regulatory elements for this (6), we use the term meta-TAD exclusively for the larger-scale gene (enhancers and promoters) in the interval between 26.65 local CDs, distinguishing them from the non-local structures and 27.15 Mb. Notably, our Multi-CD solutions show that (i.e., compartments, discussed below). We note, however, that the interval associated with the regulatory elements is fully the terminologies of TADs and the meta-TADs are still not enclosed in the same TAD in GM12878 and KBM7, whereas it definitive – a recently proposed algorithm based on structural is split into different TADs in the other three cell lines (Fig. 3k). entropy minimization (61) found that the “best” solutions The observation suggests that, for a gene to be expressed, it is were found at ∼ 2 Mb domains, which is consistent with our critical that all regulatory elements are within the same TAD; findings, although these domains were called the TADs in this is consistent with the understanding that TADs define the Ref. (61). functional boundaries for genetic interactions (5, 6, 17, 20). Finally, a trivial but special scale is λ = 0, where no additional preference for coarser CDs is imposed. The CDs at Compartments as the best domain solution that coexists with this scale are supposed to best explain the local correlation TAD-like domains. The super-Mb sized domains are generally pattern that is reflected in the strong Hi-C signals near the defined as the compartments in the chromosome organization diagonal. These smaller CDs are almost completely nested (27). Compartments are characterized by the checkerboard in the TADs and the meta-TADs; we can therefore call them pattern in off-diagonal part of correlation matrix, being highly the sub-TADs. We also confirmed that the sub-TAD solutions non-local CDs. Our formulation of the group model has the were not limited by the resolution of the Hi-C data; sub-TADs flexibility for dealing with the non-local CDs naturally. How- were robustly reproduced from a finer, 5-kb Hi-C (Fig. S8). ever, a naïve application of Multi-CD by increasing λ did not The first three panels in Fig. 3c shows three representative identify the compartments; some non-local CDs were found TAD-like CD solutions at λ ≤ λ∗: sub-TADs (λ = 0; smallest (Fig. 3c), but they do not correspond to alternating patterns CDs), TADs (λ = 10, strongest domain conservation), and characteristic to compartments. meta-TADs (λ = λ∗ = 30, largest nMI). The nested struc- We hypothesize that compartments correspond to a sec- ture is reminiscent of the hierarchically crumpled structure of ondary CD solution, that coexists with a best solution that chromatin chains (37, 62). was already identified. Assuming a statistical independence between the two solutions and the additivity of cross-correlations Chromatin organization and its link to gene expression. The (extension of Eq. 3), the inference of the secondary solution is CD solutions from Multi-CD can shed important insights into reduced to a standard Multi-CD applied to a modified input the link between chromatin organization and gene expression. data, which is essentially the result of taking out the best CD To demonstrate this, we overlaid the RNA-seq profiles on solution from the original correlation matrix C (see SI Ap- the TAD solutions, identified for the corresponding subset pendix). Here we consider a simplified version of this problem, of the chr10 of five cell lines (GM12878, HUVEC, NHEK, and remove from C a diagonal band of width 2 Mb, similar K562, KBM7) (Fig. 3k; also see Fig. S7). At around 26.8 Mb to the size of meta-TADs (Fig. 4a). Applied to the modified position of this chromosome, we found a gene APBB1IP, which Hi-C, Multi-CD successfully captures the non-local correla- is transcriptionally active in GM12878 and KBM7 but not tions, and identifies two large compartments with alternating in HUVEC, NHEK and K562. Consulting the GeneHancer patterns (Fig. 4b). The correspondence is clearer when the

5 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

sub-TAD 0.67 0.97 0.95 can be considered a common building block of the chromatin a 20 0.67 architecture (see Fig. 5b). sub-TAD 0.97 comp.

Mb 0.95 0.55 Validation of domain solutions from Multi-CD. The CD solu- 0.43 tions from Multi-CD are in good agreement with the results 30 0.55 0.98 20 TAD TAD 0.98 of several existing methods. Speciﬁcally, our CDs correspond meta-TAD to the previously proposed sub-TADs (19) at λ = 0, to the

Mb TADs (22) at λ ≈ 10, and to the compartments (19) at λ ≈ 90 b sub-TAD smaller (see Fig. S10). When assessed in terms of the nMI, Multi-CD 30 0.43 domains outperforms the corresponding alternatives (ArrowHead (19), 20 meta-TAD TAD DomainCaller (22), GaussianHMM (19) for sub-TADs, TADs,

Mb meta-TADs) at the respective scales (Fig. 6a). meta-TAD In order to further validate the biological relevance of the 30 larger CD solutions from Multi-CD, we compared them with several compartment compartment domains biomarkers that are known to be correlated with the spatial organization of the genome (66). All results shown here are for chr10 of GM12878. Fig. 5. Hierarchical organization of CD families (a) Hierarchical structure of CDs are highlighted with the domain solutions for sub-TADs (red), TADs (green), meta- First, we calculated how much the boundaries of our sub- TADs (blue) and compartments (black). Shown for chr10 of GM12878. Each square TAD and TAD solutions are correlated with the CTCF signals, panel overlays a pair of CD solutions; number above the panel reports the nestedness which are known to be linked to TAD boundaries (22, 23) score. Inset: a reprint of the nestedness scores in a tetrahedral visualization with the (Fig. 6b). We quantified this in terms of a correlation function, four representative CD solutions. (b) A schematic diagram of inferred hierarchical χ(d), where d is the genomic distance between a domain relations between sub-TADs, TADs, meta-TADs and compartments, based on our calculation of nestedness scores. boundary and each CTCF signal (see Materials and Methods). The correlation function shows a strong enrichment of CTCF signals at domain boundaries (high peak of correlation at indices of segments are re-ordered (Fig. 4c). Because the larger d ≈ 0), as well as precision (fast decay of correlation as d CD (k = 1) shows a greater number of contacts (Fig. S9), it increases). Multi-CD performs similarly to ArrowHead and can be associated with the B-compartment, which is usually DomainCaller in terms of the enrichment at the boundary, and more compact; k = 2 is associated with the A-compartment. does better in terms of the precision (Fig. 6b). Specifically, Further validation of the two compartments will be presented when fitted to exponential decays, the correlation lengths are below, through comparisons with epigenetic markers. 34 kb (λ = 0) and 143 kb (λ = 10) for Multi-CD, compared To compare the goodness of our compartments with exist- to & 900 kb for the two previous methods (Fig.6b). ing methods, we calculated the nMI against the CO/E matrix, Next, we compared our compartment solutions (CDs at λ = the conventional form for compartment identification (see 90, shown in Fig. 4b) with the replication timing profiles (Repli- Materials and Methods)(18). We find that Multi-CD outper- Seq), which are known to correlate differently with the A- and forms GaussianHMM (19), a widely accepted benchmark, in B- compartments (7, 67). Our inferred compartments exhibit capturing the large-scale structures in Hi-C (Fig. 4d). the anticipated patterns of replication timing (Fig. 6c); the A- compartment shows an activation of replication signals in the Multi-scale, hierarchical organization of chromatin domains. early-phases (G1, S1, S2) and a repression in the later phases Now that we identified four classes of CD solutions, namely sub- (S3, S4, G2), whereas the B-compartment shows an opposite TADs, TADs, meta-TADs and compartments, we examined trend. There is a clear anti-correlation between the replication their hierarchical relationships. Note that these CDs were patterns in the two compartments along the replication cycle obtained independently at the respective λ values, not through (Fig. 6c), as quantified in terms of the Pearson correlation a hierarchical merging. Sub-TADs or TADs are almost always (Fig. 6d). Comparison to other epigenetic markers, such as nested inside a meta-TAD, and TADs inside a meta-TAD, the pattern of histone modifications, further confirms the whereas there are mismatches between the TAD-like domains association of our CD solutions with the A/B-compartments and the compartments. We quantified this relationship in (Fig. S11). terms of a nestedness score h, such that h = 0 indicates the chance level and h = 1 a perfect nestedness (see Materials Discussion and Methods), along with a visual comparison of each pair of CD solutions (Fig. 5a). This analysis confirms that the Multi-CD has many essential advantages that will make it hierarchy between any pair of TAD-like domains (sub-TADs, a useful tool for the study of chromatin organization. As a TADs, and meta-TADs) is appreciably strong. On the other computational algorithm, Multi-CD includes two core steps: hand, the hierarchical links between the TAD-like domains the pre-processing of raw Hi-C data into a correlation matrix, and compartments are much weaker, which is again consistent and the inference of chromatin domain (CD) solutions from with the recent reports that TADs and compartments are the correlation matrix. The pre-processing is based on a model organized by different mechanisms (64, 65). of gaussian polymer network, allowing a physically justifiable Although the nestedness score between sub-TADs and com- interpretation of the Hi-C data. partments (nestedness score h = 0.67) is not so large as those The domain identification problem is formulated by com- among the pairs of TAD-based domains, it is still greater than bining the group model with a Bayesian inference for the CD those between TADs and compartments (h = 0.55) or between solution. The formulation of Multi-CD that optimizes the meta-TADs and compartments (h = 0.43). Thus, sub-TAD recognition of global pattern appeared in Hi-C naturally deals

6 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a GaussianHMM b 15 Multi-CD 15 Previous methods Multi-CD sub-TAD sub-TAD (Arrowhead) comp. 10 TAD 10 TAD (DomainCaller)

TAD 5 5

ArrowHead 0 0 Multi-CD sub-TAD -2 -1 0 1 2 -2 -1 0 1 2 0 0.05 0.1 0.15 nMI distance (Mb) distance (Mb)

c A-compartment B-compartment d with A Multi-CD with B solution

G1 G1 0.6 -0.58 S1 S1 0.6 -0.55 S2 S2 -0.21 0.2 S3 S3 -0.4 0.39 S4 S4 -0.6 0.54 Repli-Seq signal G2 G2 -0.4 0.32

0510 15 20 25 30 35 40 45 -0.5 0 0.5 Position on chr10 (Mb) Pearson correlation

Fig. 6. Validation of CD solutions from Multi-CD. (a) In terms of the normalized mutual information between the CD solutions and the input data, Multi-CD outperforms ArrowHead, DomainCaller and GaussianHMM at the corresponding scales (sub-TAD, TAD and compartment). (b) The correlation function χ(d) between CTCF signals and the domain boundaries. Shown for sub-TADs and TADs, obtained from Multi-CD (left); from ArrowHead and DomainCaller (right). (c) Genome-wide, locus-dependent replication signal. Top panel shows the A- (blue) and B- (red) compartments inferred by Multi-CD. Bottom panels show the replication signals in six different phases in the cell cycle, shaded in matching colors for the two compartments. (d) Pearson correlation between the replication signals and the two compartments A (ﬁlled blue) and B (open red).

with non-local CDs, which differentiates Multi-CD from previ- active loop extrusion and microphase separation, and that they ous methods that focus on local features in Hi-C, such as CD do not necessarily have a hierarchical relationship (65, 68–70). boundaries or loops enriched with higher contact frequencies. Meanwhile, the sub-TADs are nested in each of the other three Moreover, Multi-CD can find CDs across a wide range of scales solutions, including the compartments (Fig. 5), indicating that without having to adjust or down-sample the Hi-C data to sub-TADs are the fundamental building blocks of the higher- match the scale of CDs to be identified, which is an important order CD organization. In fact, the existence of sub-TADs is improvement over many existing methods. robust when a finer-resolution Hi-C is considered. Applying An important feature of Multi-CD, as emphasized in the Multi-CD on Hi-C data at 5-kb resolution, we clearly recover name, is that it provides a unified framework to identify CDs the sub-TADs that are consistent with the sub-TADs obtained at multiple scales, where the scales of the CDs are tuned by from the 50-kb Hi-C (see Fig. S8). a single parameter λ. The resulting family of CD solutions While there are methods that report hierarchical CDs (32, allow quantitative comparisons between CD solutions at dif- 33), Multi-CD makes significant advances both algorithmically ferent scales. The analysis revealed special scales at which the and conceptually. Multi-CD can detect non-local domains with CD solutions are particularly interesting: sub-TADs (λ = 0), better flexibility instead of finding a set of intervals. Multi- TADs (λ = 10, where domain conservation was strongest), CD also avoids the high false-negative rate that is typical and meta-TADs (λ = 30, where the correlation pattern was of the previous method (e.g., TADtree (32)) that focuses on best captured). At larger scales, we found that compartments the nested domain structure (Fig. S12). Further, employing (λ = 90) emerge as a secondary solution that can be inferred an appropriate prior to explore the solution space effectively, after removing the local signals in the correlation matrix that Multi-CD can avoid the problem encountered in Armatus (33) correspond to the TAD-like solutions. We confirmed that which skips detection of domains in some part of Hi-C data Multi-CD successfully reproduces, or even outperforms, the while its single scale parameter is varied (Fig. S12). existing methods to identify CDs at the specific scales. Impor- Multi-CD is a method of great flexibility that can be readily tantly, Multi-CD achieves this performance through a single applied to analyze any dataset that exhibits pairwise correla- unified algorithm, which not only identifies the specific CD tion patterns. However, two cautionary remarks are in place solutions accurately, but also allows a comparative analysis of for more careful interpretation of the results. (i) In general, the multi-scale family of solutions. the relevant values of λ depend on the resolution of the input In particular, we characterized the hierarchical organiza- Hi-C dataset, as well as on the cell type. While λ is a useful tion of the chromatin by quantifying the similarity and the parameter that allows comparative analysis, its specific value nestedness between CD solutions at two different scales. We does not carry any biological significance. Although we re- showed that the characteristics of CD solutions shared by the ferred to a specific CD solution by the corresponding value of local, TAD-like domains do not precisely hold together in the λ in the current analysis (Fig. 3), the lesson should not be that non-local, compartment-like domains. This finding is consis- TADs, for instance, always correspond to the particular value tent with the recent studies which report that compartments of λ; instead, TADs should be identified as the most conserved and TADs are formed by different mechanisms of motor-driven CD solutions across cell types after scanning a range of λ’s.

7 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

−1 (ii) Multi-CD is agnostic about whether the collected data is where the variance (2γij ) is associated with the covariance matrix −1 homogeneous or heterogeneous. Application of Multi-CD to elements as γij = 2(σii +σjj −2σij ). The contact probabilities can single-cell Hi-C data, and the subsequent interpretation of the be calculated from the distribution of pairwise distances (Eq. 1), by result, would be straightforward; however, if the input Hi-C saying that two segments i and j are in contact when their distance data were an outcome of a mixture of heterogeneous subpopu- rij is below a cutoff, rc. In other words, we write lations, the solution from Multi-CD would correspond to their Z rc 1/2 q γij −γ r2 pij = P (rij ; γij )drij = erf(γ rc) − 2rc e ij c , [10] superposition. This is a fundamental issue inherent to any ij π Hi-C data analysis method. Despite the presence of cell-to-cell 0 x 2 variations, the population-averaged pattern manifest in Hi-C x √2 R dte−t γ where erf( ) = π 0 . The value of ij is uniquely deter- carries a rich set of information that is specific to the cell type. mined for each pij ; once we have the γij ’s, we can reconstruct the The need for interpretable inference methods that can extract covariance matrix {σij }. Because the diagonals are underdeter- valuable insights into the spatial organization of the genome, mined in Hi-C (self-contacts are not reported), we assume a uniform including ours, is still high. variance σii = σjj = σc along the diagonal. Note that although the value of γij depends on the choice of rc, its effect is only to scale To recapitulate, in order to glean genome function from 2 the γij ’s as γij → rc γij , and consequently the σij ’s. Hi-C data that varies with the genomic state (10–13), a compu- Finally, we normalize the covariance matrix to build the correla- tationally accurate method to identify CD structures is of vital tion matrix C: importance. Multi-CD is a physically principled method that σij σij 1 identifies multi-scale structures of chromatin domains by solv- (C)ij = √ = = 1 − . [11] σiiσjj σc 4σcγij ing the global optimization problem. We find the chromatin domains identified from Multi-CD in excellent match with It appears that σc sets the overall intensity of C. Here, we chose biological data such as CTCF binding sites and replication the value of σc as the median of 1/4γij , i.e., σc = median(1/4γij ). This cancels out the scaling effect of rc in σ , so that the choice timing signal. Quantitative analyses of CD structures identi- ij of rc does not affect the ultimate construction of the correlation fied across multiple genomic scales and various cell types offer matrix C. general physical insight into chromatin organization inside cell nuclei. The observed/expected (O/E) matrix and its Pearson correlation matrix. The O/E matrix was used to account for the genomic distance- dependent contact number due to random polymer interactions in Materials and Methods chromosome (18). Each pair (i, j) in O/E matrix is calculated by taking the count number Mij (observed number) and dividing it by average contacts within the same genomic distance d = |i − j| (expected number). Since the expected number could be noisy, Interpretation of Hi-C data. one smooths it out by increasing the window size (see refs (18, 19) Normalization and contact probability. Here we describe how the Hi- for further details). In this study, we used the expected number C data can be interpreted as a set of contact probabilities for pairs obtained from (19). The Pearson correlation matrix of the O/E C of genomic segments, pij . Typically, a Hi-C matrix have widely ( O/E) represents the overall contact pattern through pairwise varying row-sums; for example, the net count of the i-th segment in correlation coefficients between segments. the experiment is much larger than the net count of the j-th segment. To marginalize out this site-wise variation and only focus on the Determination of optimal CD solutions. differential strengths of pairwise interactions, the raw Hi-C matrix Mraw is normalized to have uniform row and column sums. This is Metropolis-Hastings sampling. Markov chain Monte Carlo (MCMC) achieved using the Knight-Ruiz (KR) algorithm (71), which finds sampling was employed to find the minimum value of the total cost a vector v = (v1, ··· , vN ) for calculating (M)ij = vivj (Mraw)ij , function H. At each trial move from the current state s to the next such that each row (column) in M sums to 1. state s0, the move is accepted with a probability min(1, α), where We assume that the normalized Hi-C signal is proportional α(s, s0) = exp [−(H(s0|C) − H(s|C))/T ]. to the contact probability: (M)ij ∝ pij . Note that pij is the In sampling the space of CD solutions, a move from a state probability that the two segments i and j are within a contact s to another state s0 is defined such that the two CD solutions distance, and the rows of the contact probability matrix (P)ij = pij (s, s0) differ only by one genomic segment. More precisely, because is not required to sum to 1. Because the proportionality constant is a CD solution is invariant upon permutations of the domain indices, unknown a priori, however, we have a free parameter to choose. We the distance between s and s0 is uniquely defined as the minimal do this by fixing the average nearest-neighbor contact probability, number of mismatches over all possible domain index permutations. p¯1 = hpi,i+1i. We expect the p¯1 to be relatively close to 1, assuming To ensure that the sampling is properly conducted, we continue ∗ that nearest-neighbor contact is likely; but not exactly 1, because the sampling until each chain collects ttot ≥ 5τ samples in the there are variations among the nearest-neighbor Hi-C signal. In CD solution space. Here τ ∗ is the “relaxation time” defined as the this work we chose p¯1 = 0.9. The resulting contact probability number of steps it takes until the autocorrelation function R(τ), matrix P is given as (P)ij = min(1, pîj ), with pîj = (p¯1/µ)(M)ij , ∗ drops significantly: τ = argminτ |R(τ) − 1/e|. The autocorrelation where µ = hMi,i+1i is the Hi-C signals averaged over the nearest- function is calculated from the value of the total cost function H, as neighbors. At p¯1 = 0.9, in our case, the fraction of over-saturated elements (pîj > 1) was sufficiently small. 1 R(τ) = h(H(st|C) − µ)(H(st+τ |C) − µ)it, [12] σ2 Building correlation matrix from Hi-C. A chromosome can be re- garded as a polymer chain containing N monomers, each of which where st is the t-th sample in the chain, and µ and σ are the mean corresponds to the i-th genomic segment and its spatial position and standard deviation of {H(st|C)}, respectively. In Eq.12, the is written as ri. Adapting the random loop model (RLM) (42), running time average is taken over all the pairs of samples with the we interpret chromosome conformation as described by an ideal time gap of τ. polymer network, with cross-links between spatially close segments. We note that sampling is the computational bottleneck for our In RLM, we describe the spatial positions of the polymer segments method; our stop condition (at 5τ ∗) was chosen conservatively for using a gaussian distribution with zero mean and and a covariance accurate solutions. In practice, the sampling time can be reduced matrix Σ, with elements (Σ)ij = σij = hδri · δrj i. It follows that at the cost of an increased batch size (number of different initial the distance rij = |ri − rj | between two monomers i and j can configurations, as described below in simulated annealing), which is be described in the form of a weighted gaussian function (Eq. 1) significantly cheaper if parallel computing is used.

8 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Simulated annealing. The simulated annealing process is described where vk,k0 is the number of overlapping sites between two domains 0 0 below. k ∈ s and k ∈ s , and nk is the size of domain k. The highest score (0) 0 Initialization. An initial configuration s is generated in two h1(k → s ) = 1 is obtained when domain k is fully included in one random steps. First, the total number of CDs, K, is drawn randomly of the domains in s0. The null hypothesis corresponds to where the 0 from the set of integers {1, ··· ,N}. Then, each genomic segment domains in s and s are completely uncorrelated, in which case h1 0 i ∈ {1, ··· ,N} is allocated randomly into one of the CDs, k ∈ only reflects the overlap “by chance”. The chance level h¯1(k → s ) 0 {1, 2, ··· ,K}. The initial temperature T0 is determined such that is calculated by making nk random draws from s ; we averaged over the acceptance probability for the “worst” move around s(0) is 0.5. 100 independent trials. We normalize the score as Iteration. At each step r, the temperature is fixed at Tr. We ¯ 0 h1 − h1 sample the target distribution pr(s|C) ∝ exp(−H(s|C)/Tr), using hˆ1(k → s ) = , [18] the Metropolis-Hastings sampler described above. For the next 1 − h¯1 step r + 1, the temperature is lowered by a constant cooling factor 0 such that hˆ1(k → s ) = 0 indicates the chance level, and hˆ1(k → c ∈ (0, 1), such that the next temperature is Tr+1 = c · Tr. cool cool s0) = 1 means a perfect nestedness. Finally, we define the nestedness We used c = 0.95 in this study. cool score h(s → s0) for the entire CD solution as a weighted average: Final solution. The annealing is repeated until the temperature T T . 0 P 0 reaches f . We used f = 0 03. Then we quench the system to h(s → s ) = hˆ1(k → s ) · nk/N. [19] the closest local minimum by performing gradient descent. Because k∈s there is still no guarantee that the global minimum is found, we Correlation between CTCF signal and domain boundaries. The valid- tried a batch of at least 10 different initial configurations and chose ity of domain boundaries, determined from various CD-identification ∗ ∗ the final state s that gives the minimal H(s |C). methods including Multi-CD, is assessed in terms of their correlation with the CTCF signal. Suppose that the CTCF signal at Analysis on subsets of Hi-C data. Our method allows the user to genomic segment i is given as φ (i). Then, we can consider an break down the Hi-C data into subsets, as long as the CDs are CTCF overlap function between φ (i) and a CD-boundary indicating localized within the subsets (Fig. S13). This saves the algorithm CTCF function ψ (i), where ψ (i) = 1 if the i-th segment is precisely from the large memory requirement of dealing with the entire intra- DB DB at the domain boundary; ψ (i) = 0, otherwise. We evaluated a chromosomal Hi-C (for example, Hi-C of chromosome 10 has 2711 DB distance-dependent, normalized overlap function χ(d), defined as bins in 50-kb resolution). For the analysis of the 50-kb resolution Hi-C data in this paper, we used subsets of the data that correspond hδφCTCF(i + d)ψDB(i)ii to 40-Mb ranges along the genome, or 800 bins. χ(d) = , [20] hψDBi

Analysis and evaluation of domain solutions. where δφCTCF = φCTCF − hφCTCFi. If the domain boundaries determined from Multi-CD is well correlated with TAD-capturing Similarity of two distinct CD solutions using Pearson correlation. To 0 CTCF signal, a sharply peaked and large amplitude overlap function measure the extent of similarity between two CD solutions s and s , (χ(d)) is expected at d = 0. we evaluate the Pearson correlation. The binary matrices B and B0 that represent the two CD solutions, are defined such that the Correlation between epigenetic marks and compartments. We calcu- matrix element are all 1’s within the same CD and 0 otherwise. i.e., late the correlation of our compartment solutions with the epigenetic 0 (B)ij = Bij = δsisj . The similarity between B and B is quantified marks. Given a compartment solution s with two large domains using the Pearson correlation A and B, we consider two binary vectors q(A) and q(B), where (A) p q = +1 if the i-th segment belongs to compartment A, and ρ = hδBδB0i h(δB)2i h(δB0)2i, [13] i q(A) = −1 otherwise. For a set of epigenetic marks measured across 0 ¯ 0 ¯0 2 h h where hδBδB i = h(Bij − B)(Bij − B )ii6=j , and h(δB) i = h(Bij − the genome is represented with , where its component i denotes 2 the value at the i-th genomic segment, the correlation between the B¯) ii6=j . The average h·ii6=j runs over all distinct pairs. solutions of compartment A and B and the epigenetic marks can Normalized mutual information. We use the mutual information to be evaluated using the Pearson correlations as: evaluate how well a CD solution s captures the visible patterns in the c q(A) · h /|q(A)||h|, c q(B) · h /|q(B)||h|. pairwise correlation data. We consider the binary grouping matrix A = ( ) B = ( ) [21] (B)ij = Bij = δsi,sj for the CD solution of interest, and compare it Data availability. All data used in the paper were obtained to the input data matrix (A)ij = Aij . In this study, either log10 M or CO/E was used for A. Treating the matrix elements a ∈ A and from publicly available repositories. See SI Appendix. b ∈ B as two random variables, we construct the joint distribution Code availability. The Matlab software package and associated p(A, B) = hδAij ,aδBij ,bii6=j [14] documentation are available online (https://github.com/multi-cd). where h·ii6=j is an average over all distinct pairs. The Kronecker delta for the continuous variable a is defined in a discretized fashion: that is, δA ,a = 1 if Aij ∈ [a, a + ∆a) and 0 otherwise, where ACKNOWLEDGMENTS. We thank Roger Oria Fernandez for ij critical comment on our script deposited in GitHub. This work was ∆a(= [max {Aij } − min {Aij }] /100) is used for discretization into supported in part by a KIAS Individual Grant at Korea Institute for 100 bins. Then we can calculate the mutual information, Advanced Study (No. CG035003 to C.H.). We thank the Center for I A B P P p a, b p a, b /p a p b , Advanced Computation in KIAS for providing computing resources. ( ; ) = a∈A b∈B ( ) log [ ( ) ( ) ( )] [15] 1. J. O. Davies, A. M. Oudelaar, D. R. Higgs & J. R. Hughes, How best to identify chromosomal and the normalized mutual information (nMI), interactions: a comparison of approaches. Nat. Methods, 14, 125, (2017). p 2. J. Dekker, K. Rippe, M. Dekker & N. Kleckner, Capturing chromosome conformation. Science, nMI(A; B) = I(A; B) H(A) · H(B), [16] 295, 1306–1311, (2002). 3. T. Misteli, Beyond the sequence: cellular organization of genome function. Cell, 128, 787– H X − P p x p x 800, (2007). where ( ) = x∈X ( ) log ( ) is the marginal entropy. 4. W. A. Bickmore & B. van Steensel, Genome architecture: domain organization of interphase Nestedness of CD solutions. Here we define a measure to quantify chromosomes. Cell, 152, 1270–1284, (2013). the nestedness between two CD solutions, s (assumed to have 5. C. Lanctôt, T. Cheutin, M. Cremer, G. Cavalli & T. Cremer, Dynamic genome architecture in smaller domains on average) and s0 (larger domains). The idea the nuclear space: regulation of gene expression in three dimensions. Nat. Rev. Genetics, 8, s s0 104, (2007). is the following: is perfectly nested in if, whenever two sites 6. J. Fraser, C. Ferrai, A. M. Chiariello, M. Schueler, T. Rito, G. Laudanno, M. Barbieri, B. L. belong to a same domain in s, they also belong to a same domain Moore, D. C. Kraemer, S. Aitken et al., Hierarchical folding and reorganization of chromosomes 0 in s . For each domain k ∈ s, we consider the best overlap of this are linked to transcriptional changes in cellular differentiation. Mol. Sys. Biol., 11, 852, (2015). 0 domain k on the other solution s : 7. T. Ryba, I. Hiratani, J. Lu, M. Itoh, M. Kulik, J. Zhang, T. C. Schulz, A. J. Robins, S. Dalton & 0 D. M. Gilbert, Evolutionarily conserved replication timing profiles predict long-range chromatin h1(k → s ) = max vk,k0 /nk [17] k0∈s0 interactions and distinguish closely related cell types. Genome Res., 20, 761–770, (2010).

9 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

8. I. Hiratani, T. Ryba, M. Itoh, T. Yokochi, M. Schwaiger, C.-W. Chang, Y. Lyou, T. M. Townes, Acad. Sci. U. S. A., 109, 16173–16178, (2012). D. Schübeler & D. M. Gilbert, Global reorganization of replication domains during embryonic 37. J. D. Halverson, J. Smrek, K. Kremer & A. Y. Grosberg, From a melt of rings to chromosome stem cell differentiation. PLoS Biol., 6, e245, (2008). territories: the role of topological constraints in genome folding. Rep. Prog. Phys., 77, 022601, 9. G. Cavalli & T. Misteli, Functional implications of genome topology. Nat. Struct. Mol. Biol., 20, (2014). 290–299, (2013). 38. C. A. Brackley, J. M. Brown, D. Waithe, C. Babbs, J. Davies, J. R. Hughes, V. J. Buckle & 10. J. R. Dixon, I. Jung, S. Selvaraj, Y. Shen, J. E. Antosiewicz-Bourget, A. Y. Lee, Z. Ye, A. Kim, D. Marenduzzo, Predicting the three-dimensional folding of cis-regulatory regions in mammalian N. Rajagopal, W. Xie, Y. Diao, J. Liang, H. Zhao, V. V. Lobanenkov, J. R. Ecker, J. A. Thomson genomes using bioinformatic data and polymer models. Genome Biol., 17, 59, (2016). & B. Ren, Chromatin architecture reorganization during stem cell differentiation. Nature, 518, 39. G. Shi, L. Liu, C. Hyeon & D. Thirumalai, Interphase Human Chromosome Exhibits Out of 331, (2015). Equilibrium Glassy Dynamics. Nat. Commun., 9, 3161, (2018). 11. P. H. L. Krijger & W. De Laat, Regulation of disease-associated gene expression in the 3D 40. L. Liu, G. Shi, D. Thirumalai & C. Hyeon, Chain organization of human interphase chro- genome. Nat. Rev. Mol. Cell Biol., 17, 771, (2016). mosome determines the spatiotemporal dynamics of chromatin loci. PLoS Comp. Biol., 14, 12. J. R. Dixon, J. Xu, V. Dileep, Y. Zhan, F. Song, V. T. Le, G. G. Yardımcı, A. Chakraborty, e1006617, (2018). D. V. Bann, Y. Wang, R. Clark, L. Zhang, H. Yang, T. Liu, S. Iyyanki, L. An, C. Pool, T. Sasaki, 41. J. D. Bryngelson & D. Thirumalai, Internal constraints induce localization in an isolated poly- J. C. Rivera-Mulia, H. Ozadam, B. R. Lajoie, R. Kaul, M. Buckley, K. Lee, M. Diegel, D. Pezic, mer molecule. Phys. Rev. Lett., 76(3), 542–545, (1996). C. Ernst, S. Hadjur, D. T. Odom, J. A. Stamatoyannopoulos, J. R. Broach, R. C. Hardison, F. Ay, 42. M. Bohn, D. W. Heermann & R. van Driel, Random loop model for long polymers. Phys. Rev. W. S. Noble, J. Dekker, D. M. Gilbert & F. Yue, Integrative detection and analysis of structural E., 76, 051805, (2007). variation in cancer genomes. Nat. Genetics, 50, 1388, (2018). 43. R. Bruinsma, A. Y. Grosberg, Y. Rabin & A. Zidovska, Chromatin hydrodynamics. Biophys. J., 13. Y. Shao, N. Lu, Z. Wu, C. Cai, S. Wang, L.-L. Zhang, F. Zhou, S. Xiao, L. Liu, X. Zeng, 106, 1871–1881, (2014). H. Zheng, C. Yang, Z. Zhao, G. Zhao, J.-Q. Zhou, X. Xue & Z. Qin, Creating a functional single- 44. C. Battle, C. P.Broedersz, N. Fakhri, V. F. Geyer, J. Howard, C. F. Schmidt & F. C. MacKintosh, chromosome yeast. Nature, 560, 331, (2018). Broken detailed balance at mesoscopic scales in active biological systems. Science, 352, 604– 14. L. Liu, M. H. Kim & C. Hyeon, Heterogeneous Loop Model to Infer 3D Chromosome Struc- 607, (2016). tures from Hi-C. Biophys. J., 117, 613–625, (2019). 45. W. Hwang & C. Hyeon, Quantifying the heat dissipation from molecular motor’s transport 15. M. Franke, D. M. Ibrahim, G. Andrey, W. Schwarzer, V. Heinrich, R. Schöpflin, K. Kraft, properties in nonequilibrium steady states. J. Phys. Chem. Lett., 8, 250–256, (2017). R. Kempfer, I. Jerkovic,´ W. L. Chan, M. Spielmann, B. Timmermann, L. Wittler, I. Kurth, P.Cambi- 46. W. Hwang & C. Hyeon, Energetic costs, precision, and transport efficiency of molecular mo- aso, O. Zuffardi, G. Houge, L. Lambie, F. Brancati, A. Pombo, M. Vingron, F. Spitz & S. Mundlos, tors. J. Phys. Chem. Lett., 9, 513–520, (2018). Formation of new chromatin domains determines pathogenicity of genomic duplications. Nature, 47. X. Fang, K. Kruse, T. Lu & J. Wang, Nonequilibrium physics in biology. Rev. Mod. Phys., 91, 538, 265–269, (2016). 045004, (2019). 16. S. C. Baca, D. Prandi, M. S. Lawrence, J. M. Mosquera, A. Romanel, Y. Drier, K. Park, 48. A. Rosa & R. Everaers, Structure and dynamics of interphase chromosomes. PLoS Comput. N. Kitabayashi, T. Y. MacDonald, M. Ghandi et al., Punctuated evolution of prostate cancer Biol., 4, e1000153, (2008). genomes. Cell, 153, 666–677, (2013). 49. H. Kang, Y.-G. Yoon, D. Thirumalai & C. Hyeon, Confinement-induced glassy dynamics in a 17. T. Cremer & M. Cremer, Chromosome territories. Cold Spring Harbor Perspectives in Biology, model for chromosome organization. Phys. Rev. Lett., 115, 198102, (2015). 2, (2010). 50. G. Shi & D. Thirumalai, Conformational heterogeneity in human interphase chromosome or- 18. E. Lieberman-Aiden, N. L. Van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, ganization reconciles the FISH and Hi-C paradox. Nature Commun., 10, 1–10, (2019). I. Amit, B. R. Lajoie, P. J. Sabo, M. O. Dorschner et al., Comprehensive mapping of long-range 51. Q. Szabo, D. Jost, J.-M. Chang, D. I. Cattoni, G. L. Papadopoulos, B. Bonev, T. Sexton, interactions reveals folding principles of the human genome. Science, 326, 289–293, (2009). J. Gurgo, C. Jacquier, M. Nollmann, F. Bantignies & G. Cavalli, TADs are 3D structural units of 19. S. S. Rao, M. H. Huntley, N. C. Durand, E. K. Stamenova, I. D. Bochkov, J. T. Robinson, A. L. higher-order chromosome organization in Drosophila. Science Advances, 4, (2018). Sanborn, I. Machol, A. D. Omer, E. S. Lander & E. L. Aiden, A 3D Map of the Human Genome at 52. E. H. Finn, G. Pegoraro, S. Shachar & T. Misteli, Comparative analysis of 2d and 3d distance Kilobase Resolution Reveals Principles of Chromatin Looping. Cell, 159, 1665 – 1680, (2014). measurements to study spatial genome organization. Methods, 123, 47–55, (2017). 20. T. Sexton & G. Cavalli, The role of chromosome domains in shaping the functional genome. 53. L. Giorgetti, R. Galupa, E. P.Nora, T. Piolot, F. Lam, J. Dekker, G. Tiana & E. Heard, Predictive Cell, 160, 1049 – 1059, (2015). polymer modeling reveals coupled fluctuations in chromosome conformation and transcription. 21. S. Wang, J.-H. Su, B. J. Beliveau, B. Bintu, J. R. Moffitt, C.-t. Wu & X. Zhuang, Spatial Cell, 157, 950–963, (2014). organization of chromatin domains and compartments in single chromosomes. p., aaf8084, 54. N. Sauerwald, S. Zhang, C. Kingsford & I. Bahar, Chromosomal dynamics predicted by an (2016). elastic network model explains genome-wide accessibility and long-range couplings. Nucleic 22. J. R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S. Liu & B. Ren, Topological acids research, 45, 3663–3673, (2017). domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 485, 55. S. Zhang, F. Chen & I. Bahar, Differences in the intrinsic spatial dynamics of the chromatin 376–380, (2012). contribute to cell differentiation. Nucleic Acids Research, x, gkz1102, (2019). 23. E. P. Nora, B. R. Lajoie, E. G. Schulz, L. Giorgetti, I. Okamoto, N. Servant, T. Piolot, N. L. 56. L. Laloux, P. Cizeau, J.-P. Bouchaud & M. Potters, Noise dressing of financial correlation van Berkum, J. Meisig & J. Sedat, Spatial partitioning of the regulatory landscape of the X- matrices. Phys. Rev. Lett., 83, 1467, (1999). inactivation center. Nature, 485, 381, (2012). 57. J. D. Noh, Model for correlations in stock markets. Phys. Rev. E., 61, 5981, (2000). 24. T. Sexton, E. Yaffe, E. Kenigsberg, F. Bantignies, B. Leblanc, M. Hoichman, H. Parrinello, 58. L. Giada & M. Marsili, Data clustering and noise undressing of correlation matrices. Phys. A. Tanay & G. Cavalli, Three-Dimensional Folding and Functional Organization Principles of the Rev. E., 63, 061101, (2001). Drosophila Genome. Cell, 148, 458 – 472, (2012). 59. S. Kirkpatrick, D. Gelatt & M. P. Vecchi, Optimization by simulated annealing. Science, 220, 25. B. D. Pope, T. Ryba, V. Dileep, F. Yue, W. Wu, O. Denas, D. L. Vera, Y. Wang, R. S. Hansen, 671–680, (1983). T. K. Canfield et al., Topologically associating domains are stable units of replication-timing 60. M. Vietri Rudan, C. Barrington, S. Henderson, C. Ernst, D. Odom, A. Tanay & S. Hadjur, regulation. Nature, 515, 402, (2014). Comparative Hi-C reveals that CTCF underlies evolution of chromosomal domain architecture. 26. J. Dekker & E. Heard, Structural and functional diversity of topologically associating domains. Cell Reports, 10, 1297 – 1309, (2015). FEBS Letters, 589, 2877–2884, (2015). 61. A. Li, X. Yin, B. Xu, D. Wang, J. Han, Y. Wei, Y. Deng, Y. Xiong & Z. Zhang, Decoding 27. J. R. Dixon, D. U. Gorkin & B. Ren, Chromatin domains: the unit of chromosome organization. topologically associating domains with ultra-low resolution Hi-C data by graph structural entropy. Mol. Cell, 62, 668–680, (2016). Nature Commun, 9, 3265, (2018). 28. J. E. Phillips-Cremins, M. E. Sauria, A. Sanyal, T. I. Gerasimova, B. R. Lajoie, J. S. Bell, 62. A. Grosberg, S. Nechaev & E. Shakhnovich, The role of topological constraints in the kinetics C.-T. Ong, T. A. Hookway, C. Guo, Y. Sun, M. J. Bland, W. Wagstaff, S. Dalton, T. C. McDevitt, of collapse of macromolecules. J. Phys., 49, 2095–2100, (1988). R. Sen, J. Dekker, J. Taylor & V. Corces, Architectural protein subclasses shape 3D organization 63. S. Fishilevich, R. Nudel, N. Rappaport, R. Hadar, I. Plaschkes, T. Iny Stein, N. Rosen, of genomes during lineage commitment. Cell, 153, 1281–1295, (2013). A. Kohn, M. Twik, M. Safran, D. Lancet & D. Cohen, GeneHancer: genome-wide integration 29. F. Jin, Y. Li, J. R. Dixon, S. Selvaraj, Z. Ye, A. Y. Lee, C.-A. Yen, A. D. Schmitt, C. Espinoza of enhancers and target genes in GeneCards. Database, 2017, 1–17, (2017). & B. Ren, A high-resolution map of three-dimensional chromatin interactome in human cells. 64. E. P. Nora, A. Goloborodko, A.-L. Valton, J. H. Gibcus, A. Uebersohn, N. Abdennur, J. Dekker, Nature, 503, 290, (2013). L. A. Mirny & B. G. Bruneau, Targeted degradation of CTCF decouples local insulation of chro- 30. P. P. Rocha, R. Raviram, R. Bonneau & J. A. Skok, Breaking TADs: insights into hierarchical mosome domains from genomic compartmentalization. Cell, 169, 930–944, (2017). genome organization. Epigenomics, 7, 523–526, (2015). 65. W. Schwarzer, N. Abdennur, A. Goloborodko, A. Pekowska, G. Fudenberg, Y. Loe-Mie, N. A. 31. Q. Wang, Q. Sun, D. M. Czajkowsky & Z. Shao, Sub-kb Hi-C in D. melanogaster reveals con- Fonseca, W. Huber, C. H. Haering, L. Mirny et al., Two independent modes of chromatin organi- served characteristics of TADs between insect and mammalian cells. Nature communications, zation revealed by cohesin removal. Nature, 551, 51, (2017). 9, 188, (2018). 66. T. Misteli, Beyond the sequence: cellular organization of genome function. Cell, 128, 787– 32. C. Weinreb & B. J. Raphael, Identification of hierarchical chromatin domains. Bioinformatics, 800, (2007). 32, 1601–1609, (2016). 67. R. S. Hansen, S. Thomas, R. Sandstrom, T. K. Canfield, R. E. Thurman, M. Weaver, M. O. 33. D. Filippova, R. Patro, G. Duggal & C. Kingsford, Identification of alternative topological do- Dorschner, S. M. Gartler & J. A. Stamatoyannopoulos, Sequencing newly replicated DNA remains in chromatin. Algorithms for Molecular Biology, 9, 14, (2014). veals widespread plasticity in human replication timing. Proc. Natl. Acad. Sci. U. S. A., 107, 34. H. Shin, Y. Shi, C. Dai, H. Tjong, K. Gong, F. Alber & X. J. Zhou, Topdom: an efficient and 139–144, (2010). deterministic method for identifying topological domains in genomes. Nucleic Acids Res., 44, 68. G. Fudenberg, M. Imakaev, C. Lu, A. Goloborodko, N. Abdennur & L. A. Mirny, Formation of e70–e70, (2015). chromosomal domains by loop extrusion. Cell Reports, 15, 2038–2049, (2016). 35. J. Mateos-Langerak, M. Bohn, W. de Leeuw, O. Giromus, E. M. Manders, P. J. Verschure, 69. J. Gassler, H. B. Brandão, M. Imakaev, I. M. Flyamer, S. Ladstätter, W. A. Bickmore, J.-M. Pe- M. H. Indemans, H. J. Gierman, D. W. Heermann, R. Van Driel & S. Goetze, Spatially confined ters, L. A. Mirny & K. Tachibana, A mechanism of cohesin-dependent loop extrusion organizes folding of chromatin in the interphase nucleus. Proc. Natl. Acad. Sci. U. S. A., 106, 3812–3817, zygotic genome architecture. EMBO J., 36, 3600–3618, (2017). (2009). 70. J. Nuebler, G. Fudenberg, M. Imakaev, N. Abdennur & L. A. Mirny, Chromatin organization by 36. M. Barbieri, M. Chotalia, J. Fraser, L.-M. Lavitas, J. Dostie, A. Pombo & M. Nicodemi, Com- an interplay of loop extrusion and compartmental segregation. Proc. Natl. Acad. Sci. U. S. A., plexity of chromatin folding is captured by the strings and binders switch model. Proc. Natl. 115, E6697–E6706, (2018).

10 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

71. P. A. Knight & D. Ruiz, A fast algorithm for matrix balancing. IMA J. Numer. Anal., 33, 1029– or HLM, (14)), suggest that a harmonic spring network provides a 1047, (2013). reasonable approximation of the energy landscape for the mixture 72. E. P. Consortium et al., An integrated encyclopedia of DNA elements in the human genome. of those subpopulations. Nature, 489, 57, (2012). As a side note, it is worth highlighting the versatility of the Gaus- 73. P. M. Goldbart & A. Zippelius, Amorphous solid state of vulcanized macromolecules: A varia- tional approach. Phys. Rev. Lett., 71(14), 2256–2259, (1993). sian polymer network model in representing the complex topology of 74. J. D. Bryngelson & D. Thirumalai, Internal constraints induce localization in an isolated poly- chromosome conformation. For the conventional Rouse chain whose mer molecule. Phys. Rev. Lett., 76(3), 542–545, (1996). monomers along the backbone are constrained by an energy hamilto- N−1 75. T. Haliloglu, I. Bahar & B. Erman, Gaussian dynamics of folded proteins. Phys. Rev. Lett., H k/ P r − r 2 nian = ( 2) i ( i+1 i) with a uniform spring constant 79(16), 3090–3093, (1997). 2 76. C. Hyeon, G. Morrison & D. Thirumalai, Force-dependent hopping rates of RNA hairpins k, it is straightforward to show that hrij i ∼ |i − j|. Furthermore, if can be estimated from accurate measurement of the folding landscapes. Proceedings of the two monomers are in close proximity to form a contact (rij < rc), National Academy of Sciences, 105, 9604–9609, (2008). then one can obtain the contact probability between monomers i r j p R c dr P r ∼ |i − j|−3/2 and in the chain backbone as ij = 0 ij ( ij ) . Supporting Information Appendix However, adding just a few non-nearest-neighbor interactions to the Rouse model makes the results highly nontrivial. To illustrate this, A. Data acquisition. we explicitly compared the contact probability map of a linear Gaus- sian chain (Rouse chain), and those of Gaussian polymer network Hi-C data. Hi-C data were obtained through GEO data repository models with varying numbers of non-nearest-neighbor interactions, (GSE63525-celltype-primary) (19), where celltype is replaced by one which were calculated from the HLM-generated structural ensemble of the five different cell types that were considered in our analysis (14) (see Fig. S14). The statistical behavior of Gaussian polymer (GM12878, HUVEC, NHEK, K562, and KBM7). network model differs from that of the linear “Gaussian” chain. 2 The mean square distance hrij i no longer scales linearly with the Biological markers. The domain solutions from Multi-CD were separation s ≡ |i − j| (see Fig. S14e), and the contact probability compared with known biological markers. We obtained these pij (or p(s)) is no longer described with a simple scaling relation data mostly from the ENCODE project (72). Specifically, (Fig. S14f). The simple modification to the Rouse model, resulting we used the enrichment data of the transcriptional repressor in the Gaussian polymer network model, allows one to explore many CTCF measured in a Chip-Seq assay from http://hgdownload. different issues of chromosomes. In fact, our recent work based on cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwTfbs/ HLM (14) demonstrated several case studies, substantiating the wgEncodeUwTfbsGm12878CtcfStdPkRep1.narrowPeak.gz. We various experimental measurements on chromosome conformation binned the CTCF assay at 50-kb resolution (to match the Hi-C by solving the inverse-problem of inferring chromosome structures format). If there are multiple signal enrichments in a single bin, from Hi-C data. we took the average value. Because each CTCF signal has a finite width, there are occasional cases where a signal ranges across two C. Derivation of the likelihood function. Here we show how to derive bins; in those cases we evenly divided the signal strength into the the likelihood function, Eq. 6. two bins. The Repli-seq signals in the six phases G1, S1, S2, S3, S4, http://hgdownload.cse.ucsc.edu/goldenPath/ and G2 were obtained from Problem. We want to compute hg19/encodeDCC/wgEncodeUwRepliSeq/, and were averaged over 50-kb windows along the genome to construct the replication p(x|s, g) = δN (x − f(η,)) [S1] timing profiles. The 11 histone mark signals were obtained from η, http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_ with the following assumptions: jan2011/byDataType/signal/jan2011/bigwig/ , and undergone the same N preprocess as Repli-seq. The RNA-seq data for the four cell • x ∈ R is a sequence of normalized and uncorrelated ob- lines GM12878, HUVEC, NHEK and K562 were also obtained servations, with zero mean hxi = 0N and unit covariance from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/ Cov(x) = IN . wgEncodeCaltechRnaSeq/ . RNA-seq for the cell line KBM7 were • s = (s1, ··· , sN ) is a clustering map that assigns each site https://opendata.cemm.at/barlowlab/2015_ separately obtained from i ∈ {1, ··· ,N} to a cluster index si ∈ {1, ··· ,K}. Without Kornienko_et_al/hg19/AK_KBM7_2_WT_SN.F.bw . Information about loss of generality, we can assume that si ≤ sj whenever i < j the APBB1IP gene was obtained from the GeneCards database (ordered indexing). https://www.genecards.org/cgi-bin/carddisp.pl?gene=APBB1IP. The information about its regulatory elements was obtained specifically • η ∼ N (0N , Λ) and ∼ N (0N , Σ) are i.i.d. gaussian random from the GeneHancer (63) database http://hgdownload.cse.ucsc.edu/ variables, where Λ and Σ are N × N covariance matrices. gbdb/hg19/geneHancer/geneHancerInteractionsAll.hg19.bb, accessible The cluster-dependent covariance is a block diagonal matrix > through the UCSC Genome Browser Module. Λ = [Λs] = [1ns 1ns ], defined element-wise as (Λ)ij = δsi,sj . The site-wise variation is assumed to be uncorrelated, with a B. Justifications for the Gaussian polymer network for modeling unit covariance matrix Σ = IN , or (Σ)ij = δij . chromosomes. Here we provide additional justifications for the use • The clustering strength g = (g1, ··· , gK ) parameterizes the of harmonic potentials in the effective Hamiltonian, and conse- target function f, defined element-wise as quently, a gaussian distribution for pairwise distances (Eq. 1). √ The concept of an effective Hamiltonian consisting of harmonic gsi ηi + i f η, , i , ··· ,N. potential terms is not new; it has been widely employed to study a i( ) = p = 1 [S2] 1 + gs variety of systems, including the phase transition of vulcanized i macromolecules with increasing numbers of crosslinks (73, 74), Lemma: gaussian integral. and the fluctuation dynamics of native proteins (gaussian network model, (75)). Furthermore, a slightly modified, but essentially Z ia>z 1 > N identical, form of Hamiltonian was used to study the dynamics dz N (z|µ, M) e = exp − a Ma , a ∈ R . [S3] N 2 of folding/unfolding transitions of a single RNA molecule under R external force (generalized Rouse model, (76)). Whereas the success of the gaussian polymer network model Lemma: Sherman-Morrison formula. does not necessarily guarantee its extension to the modeling of −1 > −1 > −1 −1 A uv A chromosomes, our use of a gaussian distribution for the pairwise (A + uv ) = A − . [S4] v>A−1u distance between two segments in the polymer is empirically jus- 1 + tified. The Gaussian-like pairwise distance distributions reported p Solution. Let us abbreviate the coefficients as αs ≡ gs/(1 + gs) by fluorescence measurements of the chromosome (Fig. S1), and the √ agreement of the 3D structural properties inferred by a modeling and βs ≡ 1/ 1 + gs, such that fi = αsi ηi + βsi i. Further define approach that shares the same philosophy (heterogeneous loop model A ≡ diag(αsi ) and B ≡ diag(βsi ), to write f = Aη + B. Taking

11 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

the inverse Fourier transform of the Dirac delta function, we can D. Modification of the correlation matrix for the secondary domain write solution. Here we show how the correlation matrix can be modified to Z Z solve for a secondary domain solution (for example compartments), dk > dk > δN (x − f) = ei(x−f) k = ei(x−Aη−B) k, by taking out the contribution from the primary domain solution (for (2π)N (2π)N example the meta-TADs), which is supposed to be known already. R R We extend the group model to consider a bivariate grouping, where = N unless otherwise specified. Now we can rewrite i 7→ (si, ui), where each genomic locus i simultaneously belongs R Eq. S1, and evaluate the gaussian integrals using the lemma (Eq. S3): to a primary group si and a secondary group ui, presumably at different scales. Generalizing Eq. 3, we assume a linear model p(x|s, g) 1 √ p Z Z Z xi = i + gsi ηsi + hui ξui , dk ix>k −iAη>k −iB>k p = e dη N (η) e d N () e 1 + gsi + hui (2π)N where ξui and hui are respectively the random variable and the Z dk 1 1 = exp ix>k − (Ak)>Λ(Ak) − (Bk)>Σ(Bk) grouping strength parameter that correspond to the secondary group (2π)N 2 2 ui. If we further assume that si and ui are statistically independent, the pairwise correlation between two loci i and j can be written as Z dk 1 ix>k − k>Qk , = N exp [S5] 1 (2π) 2 hxixj i = δij + gsi δsi,sj + hui δui,uj ; 1 + gsi + hui Q ≡ A A B B where ( Λ + Σ ). Recognizing that this is another (un- the contributions from different groups are additive. Q−1 normalized) gaussian integral with covariance matrix , use the We can do a straightforward algebra to rearrange the above lemma (Eq. S3) once again: model as Z p 1 > 1 + gs + hu gsi δsi,sj δij + hui δui,uj p x|s, g π N Q−1 · dk N k|0, Q−1 eix k i i ( ) = (2 ) det N ( ) hxixj i − = , (2π) 1 + hui 1 + gsi + hui 1 + hui 1 = exp − x>Q−1x − log det Q . [S6] such that the right-hand side becomes the single-group model. The 2 left-hand side of this expression is a normalized residual of the correlation, which we will call Cres. If we have already inferred the With uncorrelated , both Q and Q−1 are block diagonal matrices, ij s g the exponent is completely separable by clusters: primary group i and the corresponding strength ˜ without consid- ering the secondary group, the correspondence to this two-group K model (due to normalization) is given as g = (1 + h)g˜. Substituting 1 X > −1 this, and replacing the model correlation hxixj i with the data Cij , log p(x|s, g) = xs Qs xs + log det Qs , [S7] 2 the left-hand side of the previous expression is simplified to s=1 res Cij = (1 +g ˜si ) Cij − g˜si δsi,sj . [S14] where xs is the corresponding ns-dimensional subset of x, and 2 2 res Qs = AsΛsAs +BsΣsBs = αsΛs +βs Σs, is the ns ×ns block matrix C is written in terms of the original data C and the primary 2 2 corresponding to cluster index s; element-wise, (Qs)ij = αs + βs δij . solution s (and g˜) only; it is independent of the unknown secondary solution u (and h). Now u is the solution of a modified single-group We now simplify the two terms in the summand of Eq. S7, and res show that the resulting expression corresponds to Eq. 6. First, the problem, using this residual correlation C as the input data. quadratic term can be expanded by using the Sherman-Morrison lemma Eq. S4:

−1 2 > −1 Qs = (βs Ins + (αs1ns )(αs1ns ) ) [S8] 1 (α2/β2)11> I − s s . = 2 2 2 > [S9] βs 1 + (αs/βs )1 1

The quadratic form is

> −1 gscs xs Qs xs = (1 + gs) ns − , [S10] 1 + gsns

x>x PN x 2δ ≈ hx2i PN δ n where s s = i=1( i) si,s i i=1 si,s = s, and x> 11> x PN x x δ δ ≡ c s ( ) s = i,j=1 i j si,s sj ,s s. Second, the log-determinant term can be calculated by consider- ing the eigenvalues of the matrix Qs. Solving for Qsz = λsz for an arbitrary ns-dimensional vector z,

s > 2 λsz = αs(1 z)1 + βs z; [S11] there are two types of solutions. The ﬁrst possibility is to have the 2 2 eigenvector z ∝ 1, in which case λs,1 = αsns +βs = (1+gsns)/(1+ 2 gs). The other possibility is to have (λs − βs )z vanish, where 2 λs,2 = ··· = λs,ns = βs = 1/(1 + gs); the degenerate eigenvectors span the remaining (ns − 1)-dimensional subspace. Therefore

1 + gsns Q α2n β2 · β2 ns−1 , det( s) = ( s s + s ) ( s ) = n [S12] (1 + gs) s and log det(Qs) = log(1 + gsns) − ns log(1 + gs). [S13] Substituting Eq. S10 and Eq. S13 into Eq. S7, we have derived Eq. 6 as shown the main text.

12 epo he uniisoe varying over quantities three plot We aeaaye eetdfor repeated analyses Same hamiltonian (a) annealing. in simulated data through Hi-C solution the domain from obtained best solutions, the Finding S2. Fig. Eq. using fits red. The in (52). shown in are Fig.4B fits from corresponding digitized their were and (histograms) (53), data in experimental Fig.2F The from cells. digitized fibroblast ( in in chr1 Fig.S3 on from probes lines. FISH adapted was of figure pairs five This between FISH. with (Eq. measured male cells probabilities, of IMR90 contact chromosome human of in values Chr21 corresponding on (a) the TADs represents Gaussian. other by colors described different are in pairs area segment shaded of distributions Distance S1. Fig. certified bypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmadeavailableunder bioRxiv preprint (e) d c e a b itnedsrbtosbtensvnpiso IHpoe nteTi/itrgo nteXcrmsm fmueEC h xeietldt baklns were lines) (black data experimental The ESC. mouse of chromosome X the on region Tsix/Xist the in probes FISH of pairs seven between distributions Distance

H P(r) P(r )

0 1 2 ij 0.3 0.6 0.9 1.2 0 ( 0 s|C P(r) 0 0.5 0.2 0.4 0.6 0.8 ) r [µm] 1 0 Drosophila . FISH 0 doi: Eq.1 (d) 1 2 1 h etcapacity heat The r [µm] r https://doi.org/10.1101/530519 ij 2 FISH λ Eq.1 1.5 3 [a] 3 10. = γ mro.Teeprmna aawr iiie rmFg3 n( in Fig.3B from digitized were data experimental The embryos. ij =2.0 4 4 0.3 0.7 5 T 5 hr h iuae neln rmhg olow to high from annealing simulated the where , (a) P(r) P(r) 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 6 ttrevle of values three at , 0 0 C 0 0 v 1 1 = r [µm] r [µm] 2 2 hδ 3 3 H 4 4 2 i/T 5 5 T 2 P(r) P(r) . 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 for ; (e) a 0 0 this versionpostedApril6,2020. CC-BY 4.0Internationallicense λ 0 0 h omlzdmta nomto nI ewe h oanslto n iCmti ( matrix Hi-C and solution domain the between (nMI) information mutual normalized The h Dslto teach at solution CD The 0. = 1 1 r [µm] r [µm] 2 2 3 3 4 4 5 5 usto iCdt,cvrn 0M eoi eino h1 fGM12878. of chr10 on region genomic 10-Mb covering data, Hi-C of subset A asinpoaiiydsrbto plotting distribution probability Gaussian

T P(r) P(r) rgtt eti gr)wsue sasmln protocol. sampling a as used was ﬁgure) in left to (right 0.002 0.004 0.002 0.004 51 0 0 .Terbs t oEq. to ﬁts best Their ). 0 0 10 T at r [nm] r [nm] a osrce by constructed was 500 500 r 21 . c The copyrightholderforthispreprint(whichwasnot 1 1 ). 1 = 1000 1000 (c) ) itnedsrbtosbtentreFS rbso h X the on probes FISH three between distributions Distance (b) P(r) P(r) 0.002 0.004 0.002 0.004 itnedsrbtosbtenoeTD(A1)and (TAD17) TAD one between distributions Distance 1 r lte ihsldlines. solid with plotted are 0 0 2 0 0 , 000 P r [nm] r [nm] 500 500 ( r apetaetre en equilibrated. being trajectories sample ij 1 1 ) 1000 1000 ihdfeetvle of values different with

P(r) P(r) P(r) 0.002 0.004 0.002 0.004 0.002 0.004 0 0 0 0 0 0 (d) (c) 1 itnedistributions Distance r [nm] r [nm] r [nm] r lte ihsolid with plotted are h fetv energy effective The 500 500 500 FISH Eq.1 1 1 γ log 1000 1000 1000 ij (Eq. 10 M .The 1). (b) ). (c-e) (f-i) CD 13 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

GM12878 HUVEC NHEK K562 KBM7 b 20 20 20 20 20

a Mb chr4 iC (log scale) H 30 30 30 30 30 0 0 0 c 20 20 20 20 20 0 40 40 40 Mb chr10

40 iC (log scale) 80 80 H 80 30 30 30 30 30 10 10 10 10 10

120 120 chr11 chr19 d

120 chr10 Mb chr11

160 iC (log scale) H

Mb chr4 20 20 20 20 20 10 Mb e 10 10 10 10 10 Mb chr19 iC (log scale) 20 20 20 20 20 H chr4 chr10 chr11 chr19 f 4 3 3 3 2 2 2

, Mb 2 1 1 1

0 0 0 0 g 6 10 8 4 6 4 5 4 2 2 2 0 0 0 0 0.2 0.2 h 0.2 0.2 nMI 0.1

0.1 0.1 0.1

i 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 similarity cell-to-cell 0.4 0.4 0.4 0.4 010203040506070 010203040506070 010203040506070 010203040506070

0.5 1.1 1.4 1.8 2.1 2.5 2.8 3.0 0.4 0.8 1.2 1.5 1.7 1.9 2.1 2.3 0.4 0.8 1.2 1.5 1.7 2.0 2.3 2.5 0.4 0.8 1.1 1.5 1.7 2.0 2.2 2.5 , Mb , Mb , Mb , Mb

Fig. S3. Chromatin domain solutions for chromosomes 4, 10, 11 and 19. (a) Relative sizes of chromosomes considered, aligned at the centromeres. The gray shade in each chromosome indicates the 10-Mb interval for which we show the Hi-C data in the next panels. (b-e) Hi-C data for the corresponding 10-Mb genomic intervals of (b) chr4, (c) chr10, (d) chr11, and (e) chr19, for the five different cell lines respectively. All the panels for chr10 are reprints of Fig. 3 in the main text. (f-i) Statistics of the domain solutions for chr4, chr10, chr11, and chr19. The five cell lines are color coded as indicated at the top of (b). (f) Mean domain size hni as a function of λ. (g) The index of dispersion 2 D(= σn/hni) of domain sizes. (h) The goodness of domain solutions, measured in terms of the normalized mutual information with respect to Hi-C data (log10 M). (i) The similarity of domain solutions across the five different cell types, measured by the Pearson correlation between binarized contact matrices. For each chromosome, arrows indicate the likely TAD scale (highest cell-to-cell similarity) and the likely meta-TAD scale (where the nMI is high and the index of dispersion D starts to diverge).

14 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a normalized Hi-C matrix (GM12878, chr10) b correlation matrix (GM12878, chr10) 0 0 0 1

104 20 20 0.5 40 40 3 40 10

60 60

Mb Mb 0 102 80 80 80

-0.5 100 101 100

120 120 120 chr10 Mb chr10 -1 050 100 050 100 Mb Mb

Fig. S4. A full-chromosome view of the input data, for the chromosome 10 of GM12878. (a) the normalized Hi-C matrix M, and (b) the correlation matrix C.

Fig. S5. Examples of Multi-CD domain solutions at different scales. Shown are the domain solutions obtained from Multi-CD for the ﬁve different cell lines (GM12878, HUVEC, NHEK, K562, KBM7), at (a) λ = 0 and (b) λ = 40.

GM12878GM12878GM12878GM12878GM12878 GM12878GM12878GM12878GM12878GM12878 HUVECHUVECHUVECHUVECHUVEC NHEKNHEKNHEKNHEKNHEK K562K562K562K562K562 KBM7KBM7KBM7KBM7KBM7 HUVEC HUVECNHEKHUVECNHEKNHEK a f HUVEC NHEKHUVECK562 K562NHEKK562KBM7K562KBM7K562KBM7KBM7KBM7 a20a a20a20 201 20 20 20 20120 20 20 201 f f f f 0.4 200 20 20 0 20 20 20 20 200 20 20 200 20 20 0 20 20 2020 20 0.9 10 10 10 10 10 0.8 0.8 0.8 1.2 20 20 0.8 20 20 20 0.8 Mb 30 30 30 30 30 , 1.5 0.6 i 1.7 40 40 0.6 40 0.6 40 0.6 40 0.6 n

25 25 25 25 25 Similarity

h 25 25 25 25 25 50 25 Mb 2550 25 50 25 50 25 50 Mb

,Mb 25 25 25 25 25

2.0 Mb Mb 25 25 25 25 25

Mb 0.4 60 60 60 0.4 60 0.4 60 2.1 0.4 0.4 2.3 70 70 70 70 70 22 22 HiC (log scale) 22 22 HiC (log scale) HiC (log scale)

HiC (log scale) 22 0 0 0 0 0 HiC (log scale) 10 20 30 40 50 60 70 10 20 30 40 50 60 70 10 20 30 40 50 60 70 10 20 30 40 50 60 70 10 20 30 40 50 60 70 3030 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 20 23050 3025 20 30 23050 2305 20 30 23005 2305 20 30 223005 2305 20 30 223005 2305 30 20 20 205 25 2305 3020 2300 2205 25 2305 3020 2300 2205 25 2305 3020 2300 2205 25 2305 3020 2300 2205 25 2305 30 30 Fig. S6. Scale-to-scale similarity of domain solutions for different cell lines. We calculate the similarity between domain solutions at different λ in terms of Pearson correlation. The calculation was performed for chromosome 10 from ﬁve different cell lines. 3 0.2 24 24 b 3 3 b3 d 0.20.2d0.2 24 24 b b b 3 d d d 0.2 24 2 2 2 2 2 , Mb , Mb , Mb , Mb

1, Mb 1 15 26 1 1 mutual info. 26 mutual info. 26 mutual info. 26

1 mutual info. Position on chr10, Mb

Position on chr10, Mb 26 mutual info. Position on chr10, Mb 0 0.1 Position on chr10, Mb 0 0 0 0.10.1 0.1 Position on chr10, Mb c 6 0 e 0.1 c 6 6 6 e 0.8 c c e0.8e c 6 0.8 0.8e 4 0.8 28 4 0.7 28 4 4 0.7 28 28 0.7 0.7 28 4 0.60.7 2 0.6 2 0.6 0.6 2 2 conservation 0.5 conservation 0.6 conservation 2 0.5conservation 0 0.5 0.5 0 0 10 20 30 40 50 60 70 conservation 0.50 10 20 30 40 50 60 70 30 00 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 30 0 0 100 10 20 20 30 30 40 40 50 50 60 60 70 70 0 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 30 30 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 30 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Gene APBB1IP Regulatory elements for APBB1IP RNA-seq

TADs GM12878 HUVEC NHEK K562 KBM7

25 25.5 26 26.5 27 27.5 28 28.5 29 Position on chr10, Mb

Fig. S7. Cell-line dependent TAD organization and its link to gene expression. The RNA-seq signals from ﬁve different cell lines (colored hairy lines) are shown on top of the TAD solutions obtained by Multi-CD (triangles with matching colors). At the top shown are the the position of a speciﬁc gene APBB1IP (top row), and the regulatory elements associated with this gene (second row), including the enhancers and the promoter (the position of promotor is marked with a magenta line). APBB1IP is transcriptionally active only in two cell lines, GM12878 and KBM7. In the two cell lines, the regulatory elements are fully enclosed in the same TAD.

Fig. S8. Identiﬁcation of sub-TAD boundaries at 5-kb resolution. (a) The optimum cluster size, best describing 5-kb resolution Hi-C map in terms of nMI, is determined at hni = 0.35 Mb, which is consistent with the sub-TAD size determined from 50-kb resolution Hi-C at λ = 0. (b-c) Comparison between Multi-CD solutions at different resolutions of the input Hi-C data, that point to the robustness of sub-TAD boundaries regardless of Hi-C resolution. (b) The best CD solution (corresponding to λ = λ∗ in panel (a)) for the 5-kb resolution Hi-C data in the 120-124 Mb region of the genome. (c) Solution for the same genomic interval from 50-kb Hi-C, determined at λ = 0. The two CD solutions are effectively identical, which supports our interpretation of sub-TAD as the unit of hierarchical chromosome organization.

16 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

200 k=1 k=2

150 P(d)

100

50 2.5 3 3.5 4 4.5 distance d, Mb

Fig. S9. Intra-domain contact proﬁles for the two compartment solutions. We plot the genomic distance-dependent contact number for the two large domains k = 1 and k = 2, from the solution at λ = 90 for chromosome 10 in GM12878. At short genomic distance, the domain solution of k = 1 is characterized with a greater number of contacts than k = 2, which suggests that k = 1 domain is locally more compact. We therefore associate the ﬁrst domain k = 1 to the B-compartment, and the second domain k = 2 to the A-compartment.

sub-TAD TAD Compartment

GM12878: chr10 133.8 Mb a a sub- T 26 2626 26 23 10 10 27 2727 27 Multi-CD Arrowhead 20 Mb 25 AD Mb 28 Mb 2828 Mb 28 30 10 28 29 2929 29 40 30 3030 30 30 10 26 27 28 29 30 226 27 28 29 30 23 25 28 3026 27 2180 29203030 40 sub-TAD TAD Compartment = 0 = 10 = 90 GM12878: chr10b 133.8 Mb b a 26 23 10 22 26 22 22 4 23 10 t 27 10 T 20 AD 27 ou n 24 Multi-CD 24Mb DomainCaller25 24 20

Mb 2528 iC c Mb Mb Mb

28 30 2 H 26 26 28 26 10 29 30 29 28 28 28 28 40 30 4030 30 30 3030 30 100 10 20 30 40 22 24 26 28 3026 27 28 29 30 223 242526 2288 30 10 20 30 4022 24 26 28 30 sub-TAD TAD 10 Compartment 10 c GM12878: chr10 = 0 = 10 133.8 Mb = 90 b Compartment a 20 20 Mb 26 23 10 26 30 30 4 10 23 1010 10 t 10 27

27 Multi-CD GaussianHMM20 ou n

Mb 25 20 202040 40 20 Mb 25

28 iC c Mb Mb Mb

28 10 20 30 40 30 2 1H 0 20 30 40 28 10 30 29 3030 30 29 28 ArrowHead 40 DomainCaller GaussianHMM 40 30 c 403040 40 30 30 100 10 20 30 40 26 27 2180 2920 3030 40 23 25 28 30 110 20 30 40 10 20 30 40 26 22 10 Mb 10 10 Mb Mb = 0 = 10 27 = 90 24 b 20 Mb Fig. S10. Comparison of domain solutions from Multi-CD and other methods at speciﬁc scales. Comparison between domain solutions obtained by three popular algorithms (ArrowHead, DomainCaller, GaussianHMM) (right column) and those by Multi-CD (left column), applied to 50-kb resolution Hi-C data. Three subsets from the same

Hi-C data (log10 M), with different magniﬁcation (5, 10, and 40 Mb from top to bottom), are given in the middle column. ArrowHead algorithm (19) was used for identifying the domain structures of sub-TADs, DomainCaller (22) for TADs, and Gaussian Hidden Markov Model (GaussianHMM) (19) for compartments. Multi-CD use λ = 0, 10, 90, as the parameter values for identifying sub-TADs, TADs, and compartments, respectively.

17 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

A-compartment B-compartment Multi-CD solution 0.6 G1 -0.58 0.6 S1 -0.55 0.2 S2 -0.21 -0.4 S3 0.39 -0.6 S4 0.54 Repli-Seq signal -0.4 G2 0.32 H3K36me3 -0.38 0.4 -0.06 H3K27me3 0.073 -0.04 H3K9me3 -0.074 0.3 H4K20me1 -0.33 0.5 H2AZ -0.49 0.5 H3K27ac -0.46 0.5 H3K4me1 -0.47

Histone marks 0.6 H3K4me2 -0.53 0.4 H3K4me3 -0.42 0.4 H3K79me2 -0.36 0.5 H3K9ac -0.47 0510 15 20 25 30 35 40 45 -0.5 0 0.5 Position on chr10 (Mb) Pearson correlation

Fig. S11. Comparison of histone marks and compartments. Extension of Fig. 6c-d which make comparison between the CD solutions for A/B-compartments by Multi-CD and epigenetic marks. The upper part with Repli-Seq signals is a reprint from the main text ﬁgure. The lower part shows histone marks on the corresponding genomic range. Majority of the histone marks are correlated with the A-compartment. The values of Pearson correlation between Repli-Seq signal or histone marks and A/B-compartment are given on the right.

a c e position on chr10 (Mb) position on chr10 (Mb)

b d f position on chr10 (Mb) position on chr10 (Mb)

Fig. S12. Comparison to existing algorithms for identifying domains at multiple scales. (a,b) Normalized mutual information between domain solutions at multiple scales, from Multi-CD, Armatus (33) and TADtree (32) respectively, and the log10 of KR-normalized Hi-C matrix for chr10 of the cell line GM12878. The scale of a domain PK PN solution s is measured in two ways, in terms of (a) the effective number of clusters, K(s) = exp − (nk/N) log(nk/N) , where nk = δs ,k is the k=1 i=1 i PN domain size; and (b) the total area of 1’s in the corresponding binary contact matrix, (area) = Bij where Bij = δs ,s . All domain solutions from TADtree and i,j=1 i j Armatus were obtained using the respective default parameter settings. (c-f) Visual comparison of domains found by (c,d) TADtree and Multi-CD, and (e,f) Armatus and Multi-CD, at matching scales in terms of the average domain size. Domain solutions are shown in the upper triangle, colored by red (intra-domain) and white (extra-domain) for effective visualization. The lower triangle plots the corresponding subset of the Hi-C data (KR-normalized and in log10). Refer to the original papers (32, 33) for the deﬁnitions of the respective control parameters α (TADtree) and γ (Armatus).

18 bioRxiv preprint doi: https://doi.org/10.1101/530519; this version posted April 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a 0 0 0 b 0 0 0

40 40 40 40 40 40

80 80 80 80 80 80

120 120 120 120 120 120 0 40 80 120 0 40 80 120 0 40 80 120 0 40 80 120 0 40 80 120 0 40 80 120

Multi-CD Multi-CD Multi-CD Multi-CD

20 200 200 0 20 200 200 Multi-CD 25 1025 1025 Multi-CD10 25 1025 2510 30 2030 2030 20 30 2030 3020 35 3035 3035 30 35 3035 3530 40 40 40 40 40 40 40 20 25 30 35 40 200 2150 3200 3350 4040 200 2150 3200 3350 4040 0 10 202030254030 35 40 200 2150 3200 3350 4040 200 2150 3200 33504040

20 20 20 20 20 20 CC = 0.95 CC = 0.84 CC = 0.98 CC = 0.90

25 25 25 25 25 25

30 30 30 30 30 30 20 25 30 20 25 30 20 25 30 20 25 30 20 25 30 20 25 30 Mb Mb Mb Mb Mb Mb

Fig. S13. Robustness of clustering solutions over different subsets of Hi-C data. Here we compare domain solutions from Hi-C inputs of different size. Multi-CD is conﬁrmed to be locality-preserving. That is, the sets of domain solutions determined from Hi-C inputs with different sizes remain almost identical to each other. The Hi-C data demarcated by the purple squares on the top panels are the input data used for Multi-CD analysis. The three panels from left to right on the bottom are the domain solutions from 10-Mb, 20-Mb, and 40-Mb Hi-C inputs. (a) For λ = 0, the correlation coefﬁcients of 20-Mb Hi-C and 40-Mb Hi-C generated domain solutions with respect to the 10-Mb Hi-C generated one is 0.95 and 0.84, respectively. (b) Same calculations were carried out for λ=10.

a b c d

3 3 3 1 15 15 15 15 2 2 2 10 10 10 10 5 1 5 1 5 1 5 0 0 0 0 0 0 0 0 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 1 1 1 1 15 15 15 15 10 10 10 10 5 5 5 5 0 0 0 0 0 0 0 0 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15

e f 1 a

] 10 b 2 c d 0.1 a p(s) (s) [a

2 b r 1 c d 0.01 1 10 20 1 10 20 s s

Fig. S14. Heterogenous loop model (14) to compare the contact probabilities of Gaussian polymer networks. (a-d) Four examples of polymer models composed of 20 monomers with different interaction strength matrix [kij ] (top row), and the corresponding contact probability matrices [pij ] (second row) calculated with rc = 1. (e) the mean square distance and (f) the contact probability p(s) are calculated as a function of the genomic distance, s, for the four different models (a-d). Scaling results in (e) and (f) show that even the Gaussian polymer network model can produce rich multi-scale structure with domains.