bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
Genome-wide prediction of topoisomerase IIβ binding by architectural factors and chromatin accessibility
Pedro Manuel Mart´ınez-Garc´ıa1*, Miguel Garc´ıa-Torres2, Federico Divina2, Jos´e Terr´on-Bautista1, Irene Delgado-Sainz1, Francisco G´omez-Vela2, Felipe Cort´es-Ledesma1,3*,
1 Centro Andaluz de Biolog´ıaMolecular y Medicina Regenerativa (CABIMER), CSIC-Universidad de Sevilla-Universidad Pablo de Olavide, Seville 41092, Spain 2 Division of Computer Science, Universidad Pablo de Olavide, ES-41013, Seville, Spain 3 Topology and DNA breaks Group, Spanish National Cancer Centre (CNIO), 28029, Madrid, Spain
* [email protected], [email protected]
Abstract
DNA topoisomerase II-β (TOP2B) is fundamental to remove topological problems linked to DNA metabolism and 3D chromatin architecture, but its cut-and-reseal catalytic mechanism can accidentally cause DNA double-strand breaks (DSBs) that can seriously compromise genome integrity. Understanding the factors that determine the genome-wide distribution of TOP2B is therefore not only essential for a complete knowledge of genome dynamics and organization, but also for the implications of TOP2-induced DSBs in the origin of oncogenic translocations and other types of chromosomal rearrangements. Here, we conduct a machine-learning approach for the prediction of TOP2B binding sites using publicly available sequencing data. We achieve highly accurate predictions, with accessible chromatin and architectural factors being the most informative features. Strikingly, TOP2B is sufficiently explained by only three features: DNase I hypersensitivity, CTCF and cohesin binding, for which genome-wide data are widely available. Based on this, we develop a predictive model for TOP2B genome-wide binding that can be used across cell lines and species, and generate virtual probability tracks that accurately mirror experimental ChIP-seq data. Our results deepen our knowledge on how the accessibility and 3D organization of chromatin determine TOP2B function, and constitute a proof of principle regarding the in silico prediction of sequence-independent chromatin-binding factors.
Author summary
Type II DNA topoisomerases (TOP2) are a double-edged sword. They solve topological problems in the form of supercoiling, knots and tangles that inevitably accompany genome metabolism, but they do so at the cost of transiently cleaving DNA, with the risk that this entails for genome integrity, and the serious consequences for human health, such as neurodegeneration, developmental disorders or predisposition to cancer. A comprehensive analysis of TOP2 distribution throughout the genome is therefore essential for a deep understanding of its function and regulation, and how this can affect genome dynamics and stability. Here, we use machine learning to thoroughly explore genome-wide binding of TOP2B, a vertebrate TOP2 paralog that has been linked to
March 18, 2020 1/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
genome organization and cancer-associated translocations. Our analysis shows that TOP2B-DNA binding can be accurately predicted exclusively using information on DNA accessibility and binding of genome-architecture factors. We show that such information is enough to generate virtual maps of TOP2B binding along the genome, which we validate with de novo experimental data. Our results highlight the importance of TOP2B for accessibility and 3D organization of chromatin, and show that computationally predicted TOP2 maps can be accurately obtained using minimal publicly available datasets, opening the door for their use in different organisms, cell types and conditions with experimental and/or clinical relevance.
Introduction 1
Type II DNA topoisomerases are unique in their ability to catalyze duplex DNA 2
passage, and therefore the only enzymes capable of dealing with superhelical DNA 3
structures, as well as unknotting and decatenating DNA molecules [1]. This places them 4
in a privileged position to integrate and coordinate many aspects of genome 5
organization and metabolism, and have indeed been related to virtually all aspects of 6
chromatin dynamics [2, 3]. While the genomes of invertebrates and lower eukaryotic 7
systems code for only one type II topoisomerase (TOP2), vertebrates encode two close 8
paralogs (TOP2A and TOP2B) with similar catalytic and structural properties but 9
with different roles [2, 4]. TOP2A is essential for cellular viability [5] and is mainly 10
expressed in dividing cells [6], where it is required for chromosome segregation [2]. 11
TOP2B, in contrast, is not essential at a cellular level [7], but is ubiquitously expressed 12
and required for organismal viability due to roles in the transcriptional regulation of 13
genes required for proper development of the nervous system [2, 6,8]. Consistent with 14
regulatory functions in transcription, TOP2B significantly associates with promoters 15
and DNase I hypersensitivity sites, as well actively regulated epigenetic modifications 16
and enhancers [9–13]. Furthermore, recent studies have linked TOP2B with 3D genome 17
architecture due to a strong colocalization with architectural factors such as CTCF and 18
the cohesin complex protein RAD21 at the boundaries of topologically associated 19
domains (TADs), which has been interpreted as TOP2B functions in resolving 20
topological problems derived from the active organization of the genome in the context 21
of the loop extrusion model [11, 12]. 22
Despite these fundamental roles in the organization and dynamics of the genome, 23
TOP2 function can also pose a threat to its integrity [4]. Aberrant activity of TOP2, 24
which can occur spontaneously or as a consequence of cancer chemotherapy, results in 25
the formation of DNA double-strand breaks (DSBs). These are highly cytotoxic lesions 26
that, if inefficiently or aberrantly repaired, can seriously compromise cell survival and 27
genome stability. Indeed, unrepaired or misrepaired DSBs can lead to genome 28
rearrangements associated with neurodegeneration, sterility, developmental disorders 29
and predisposition to cancer [14–16]. In particular, transcription-associated TOP2B 30
function at TAD boundaries has been recently linked to oncogenic chromosomal 31
translocations [17, 18]. The specific functions of TOP2B in these processes are, however, 32
still largely unexplored and far from being fully understood. A deep understanding of 33
TOP2B function and regulation in a genome-wide context, and how it can result in 34
chromosome fragility and alterations, will require thorough and unbiased studies, 35
integrating different systems, cellular types and conditions. 36
In this work, we take advantage of the wealth of high-throughput sequencing data to 37
comprehensively study the predictive power of a wide set of chromatin features to 38
identify TOP2B binding sites. We show that TOP2B localization can be accurately 39
predicted using publicly available sequencing data, and identify architectural and open 40
chromatin factors as the most informative features. Moreover, feature selection analysis 41
March 18, 2020 2/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
indicates that TOP2B can be faithfully predicted using only three sequencing datasets. 42
Indeed, models trained with DNase-seq and ChIP-seq of CTCF and RAD21 allow for 43
accurate prediction across different mouse cell types, supporting a generalizing 44
association of TOP2B with these features. Finally, we perform ChIP-seq of TOP2B in 45
mouse thymocytes and human MCF7 and generate predictions on such cell types in 46
order to validate our model. Genome wide predictive tracks accurately mirror 47
experimental TOP2B tracks in both organisms, showing that our model can be used to 48
generate virtual TOP2B signals in potentially any mammalian cell type and condition 49
for which sequencing data of CTCF, RAD21 and DNase I hypersensitivity are available. 50
Materials and methods 51
Processing publicly available data 52
The experimental data used in this study is summarized in S1 Table. When available, 53
we batch-downloaded mm9 and hg19 BAM files from ENCODE. Raw sequencing data 54
was processed by first quality filtering and merging the reads of biological replicates and 55
corresponding input samples (in the case of ChIP-seq experiments). Then we mapped 56
them to the mouse or human genome (mm9 or hg19) using Bowtie 1.2 [19] with option 57
“-m 1” so that reads that mapped only once to the genome were retained. Genome track 58
plots of Fig 1 and S1 Fig were generated using Bioconductor packages Gviz [20] and 59
trackViewer [21]. 60
Identification of TOP2B Binding Sites 61
We used HOMER [22] for the computation of TOP2B peaks and retained those with a 62
fold change over the input sample greater than 50. Then, peaks overlapping > 20% of 63
their width with non-mappable regions or regions of the ENCODE blacklist [23] were 64
discarded. When replicates were available, peaks were called for individual replicates 65
and overlapping peaks were kept. For each training cell type, we generated the same 66
number of random regions in the genome than TOP2B binding sites were identified. 67
Such regions had 300 bp length and were selected so that they had the same distribution 68
across chromosomes as TOP2B peaks. Again, non-mappable regions and regions of the 69
ENCODE blacklist were discarded. To make sure they did not overlap potential TOP2B 70
peaks that were not detected by the peak calling algorithms, regions with high levels of 71
TOP2B/INPUT ChIP-seq signal were also discarded. In order to account for sequence 72
composition biases, an additional set of random regions was generated so that their 73
distribution of sequence G+C content matched that of TOP2B peaks. 74
Quantification of High-Throughput Sequencing Data 75
To score sequencing data around TOP2B binding sites, we based on the scoring 76
approach of [24]. For ChIP-seq experiments, reads aligned to a 300 bp window centered 77
on TOP2B binding center were first considered, normalized to the size of the 78
corresponding sequencing library and scaled to the size of the smaller library (among 79
test and input). Then, they were added 1 as a pseudocount (to avoid undefined values 80
in posterior logarithmic transformations) and the ratio of test versus input signal was 81
calculated. Finally, this value was log transformed. RNA-seq, DNase-seq and 82
MNase-seq were scored as the number of reads aligned within the same 300 bp window 83
divided by the library size. CpG methylation was measured as the total percentage of 84
methylated CG for each CpG dinucleotide as described in [25]. 85
March 18, 2020 3/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
DNA shape and sequence features 86
Three-dimensional DNA shape features at base pair or base pair step resolution were 87
derived from the high-throughput approach implemented in the Bioconductor library 88
DNAshapeR [26]. We used thirteen DNA shape features: helix twist, propeller twist, 89
minor groove width, roll, shift, slide, rise, tilt, shear, stretch, stagger, buckle and 90
opening [27]. For a given DNA sequence of length L, the k-mer feature at each 91 k nucleotide position was encoded as a binary vector of length 4 , as described in [28]. A 92
value of 1 represents the occurrence of a particular k-mer from starting at that position. 93
The k-mer features for a whole DNA sequence were then encoded by the concatenation 94
of the k-mer feature vectors at each position of the sequence. As a result, we obtained a 95 k binary vector of length 4 (L-k+1). 96
Machine learning 97
Training data matrix 98
After scoring the sequencing experiments as well as DNA shape and sequence 99
parameters, we built up a final data matrix with rows representing TOP2B/random 100
sites and columns corresponding to scored features. Since we identified a total of 13, 128 101
and 8, 413 TOP2B peaks in mouse liver and MEFs respectively, we ended up with model 102
matrices of 32, 766 columns and either 26, 256 (mouse liver) or 16, 826 (MEFs) rows. 103
Classifiers 104
The Naive Bayes [29] (NB) classifier applies the Bayes rule to compute the probability 105
of a class label given an instance. For such purpose, the classification model learns, 106
from training data, the conditional probability of each feature given the class label. NB 107
assumes that all features are conditionally independent given the values of the class. 108
Support Vector Machine [30] (SVM) is a classifier based on the idea of finding a 109
hyperplane that optimally separates the data into two categories by solving an 110
optimization problem. Such hyperplane, is the one that maximizes the margin between 111
classes. In case of non-separable data, it optimizes a weighted combination of the 112
misclassification rate and the distance of the decision boundary to any sample vector. 113
After several tests with different kernels (linear, polynomial, radial basis and sigmoid), 114
we use a linear kernel due to its good performance and simplicity. 115
Random Forests [31] is a widely applied tree-based supervised machine-learning 116
algorithm designed as a combination of bagging and a random selection of features that 117
has proven to be an effective tool for the statistical analysis of high-dimensional 118
genomic data [32]. 119
Feature selection algorithms 120
Fast Correlation Based Filter [33] (FCBF) is an efficient heuristic that, based on 121
information theory measures, performs a relevance and redundancy analysis for selecting 122
a good subset of features. Basically, FCBF is a backward search strategy that uses 123
Symmetrical Uncertainty (SU) as goodness function to measure non-linear dependencies 124
between features. In order to identify relevant features, it requires to set a threshold 125
parameter δ so that only values above δ are considered relevant.In this work we set 126
δ = 0 since there is no rule about this parameter tuning and in the datasets under study 127
only a small subset of features have a SU value different to 0. 128
Scatter Search [34,35] (SS) is a population-based metaheuristic that uses 129
intensification and diversification mechanisms to generate new solutions. The strategy 130
starts by generating an initial population of diverse solutions from the solution space. 131
March 18, 2020 4/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
Then, it selects a subset of features, called Reference Set (RefSet), of high quality and 132
dispersed solutions from the initial population. In the next step the method selects all 133
subsets consisting of two solutions and combine them in order to generate new solutions 134
that are improved by applying a local search which will yield to local optima. Finally, a 135
static update of the reference set is carried out selecting solutions according to quality 136
and diversity from the union of the original RefSet and the new local optima found. 137
In this work we use the Correlation Feature Selection [36] (CFS) measure as 138
goodness function. CFS evaluates subsets of features by considering the non-linear 139
correlation Symmetrical Uncertainty function. 140
Genome wide predictions 141
Once we identified the most predictive features, a generalized TOP2B binding model 142
was built by training on DNase-seq, RAD21 and CTCF data from mouse liver, MEFs 143
and activated B cells using Random Forests [31]. We first used 5-fold cross-validation to 144
confirm that the model was able to accurately predict TOP2B in the training cell types 145
(as shown in Results and Discussion; Table 4) and then applied the model to the mouse 146
thymus and the human MCF7 mappable genomes. First, we splitted the genomes into 147
bins of 300 bp with sliding windows of 50 bp. Then we scanned each bin with our model 148
and obtained a TOP2B binding probability vector that was used to build up a bedgraph 149
file. Probability track plots of Fig 6 and S6-7 Figures were generated using UCSC 150
genome browser [37]. 151
ChIP-seq 152
Cells were fixed by adding formaldehyde to a final concentration of 1% and incubated at 153
37ºC for 10 min. Fixation was quenched by adding glycine to a final concentration of 154
125 mM at cell culture plates. After two washes with cold PBS, in the presence of 155
complete protease inhibitor cocktail (Roche) and PMSF, cell pellet was lysed in two 156
steps using 0.5% NP-40 buffer for nucleus isolation and SDS 1% lysis buffer for nuclear 157
lysis. Sonication was performed using Bioruptor (Diagenode, UCD-200) at high 158
intensity and three cycles of 10 min (30” sonication, 30” pause) and chromatin was 159
clarified by centrifugation (17000xg, 10 min, 4 ºC). For IP, 30 µg chromatin and 4 µg 160
antibody (anti-TOP2B, SIGMA) were incubated o/n in IP buffer (0.1% SDS, 1% 161
TX-100, 2 mM EDTA, 20 mM TrisHCl pH8, 150 mM NaCl) at 4ºC, and then with 25 162
µL of pre-blocked (1 mg/ml BSA) Dynabeads protein A and Dynabeads protein G 163
(ThermoFisher) for 4 hours. Beads were then sequentially washed with IP buffer, IP 164
buffer containing 500 mM NaCl and LiCl buffer (0.25 M LiCL, 1% NP40, 1% NaDoc, 20 165
mM TrisHCl pH8 and 1 mM EDTA). ChIPmentation was carried out as previously 166
described (Schmidl 2015) using Tagment DNA Enzyme provided by the Proteomic 167
Service of CABD (Centro Andaluz de Biolog´ıadel Desarrollo). DNA was eluted by 168
incubation at 50ºC in 100 µL elution buffer (1% SDS, 100 mM NaHCO3) for 30 min 169
followed by an incubation with 200 mM NaCl and 10 µg of Proteinase K 170
(ThermoFisher) to revert cross-linking. DNA was then purified using Qiagen PCR 171
Purification columns. Libraries were amplified for N-1 cycles (being N the optimum Cq 172
determined by qPCR reaction) using NEBNext High-Fidelity Polymerase (M0541, New 173
England Biolabs), purified and size-selected using Sera-Mag Select (GE Healthcare) and 174
sequenced using Illumina NextSeq 500 in a single-end configuration. ChIP-seq datasets 175
are available under GEO accession number GSE141528. 176
March 18, 2020 5/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
Results and Discussion 177
Chromatin landscape of TOP2B binding sites 178
We started by comprehensively assessing the chromatin landscape of TOP2B binding 179
sites in two mouse cell types: embryonic fibroblasts (MEFs) and liver. We collected 180
high-throughput sequencing data from ENCODE [38] and several independent studies 181
(S1 Table) and observed the expected colocalization with open chromatin sites, RNA 182
polymerase II (Pol2), architectural components and transcription associated histone 183
modifications; genome tracks in both cell types are provided as an example (Fig 1A; S1 184
Fig). Then, we studied the position specific distribution of each chromatin feature 185
around the TOP2B binding center (Fig 1B). Several marks were found to be highly 186
enriched in both cell types, such as DNase-seq, CTCF and RAD21. They also showed a 187
high colocalization frequency with TOP2B (Fig 1C). For instance, 89% of TOP2B sites 188
in mouse liver colocalized with DNase-seq peaks, while for randomly selected regions 189
this percentage decreased to 11%. Similarly, 87% of TOP2B peaks overlapped with 190
RAD21 in MEFs as compared to 4% of random regions. These observations are in line 191
with previous findings reporting a preferential association of TOP2B with accessible 192
chromatin, actively transcribed promoters and regulators of genome architecture [10,11], 193
and suggest that chromatin features measured by next generation sequencing provide 194
valuable information for the prediction of TOP2B binding. 195
Fig 1. Chromatin landscape of TOP2B. A. Genome browser view for TOP2B ChIP-seq signal and other relevant chromatin features in mouse liver (http://genome.ucsc.edu). B. Average reads enrichment of selected chromatin features within ±4 kb of TOP2B binding center (solid lines) and random regions (dashed lines) in mouse liver (blue) and MEFs (red). Signals were smoothed using nucleR [39]. Left and right sided ordinate axes correspond to mouse liver and MEFs, respectively. C. Overlap frequencies of chromatin features at TOP2B binding sites as compared to random regions in mouse liver and MEFs. D. Top panel shows relative frequency of several polynucleotides and dinucleotides within ±500 bp of TOP2B binding center and random regions. Bottom panel shows average of 3D DNA shape features within ±1 kb of TOP2B binding center and random regions. Profiles corresponding to MEFs and liver are colored as in C.
Current experimental data suggest that TOP2B does not bind to a specific DNA 196
motif [11], which agrees with topoisomerases specifically acting where DNA topological 197
problems arise. However, TOP2B associates with DNA regions that are frequently 198
bound by sequence-specific transcription factors [9–11], so the fact that particular 199
topological structures can be favored by specific properties of DNA sequence is still a 200
possibility. Indeed, a recent study has provided evidence that DNA sequence governs 201
the location of supercoiled DNA [40]. To explore whether sequence composition of 202
TOP2B binding sites can be used in our predictive approach, we analyzed the spatial 203
distribution of DNA sequence dinucleotides within 1kb windows centered on TOP2B 204
peaks. For both murine cell types, TOP2B sites showed an increase of GC and GG 205
dinucleotides while AT and AA were depleted (Fig 1D, top). This result agrees with 206
previously reported GC-rich motifs, such as CTCF and ESR1, at TOP2B occupied 207
regions [10, 11]. 208
Another feature that could potentially help in the prediction of TOP2B binding is 209
DNA shape, which has been shown to affect the binding preferences of a number of 210
DNA associated proteins [28, 41, 42]. In this sense, models trained with information on 211
DNA shape have demonstrated significant improvements in transcription factor binding 212
predictions [28]. To further inspect whether the observed sequence preferences of 213
March 18, 2020 6/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
TOP2B sites are accompanied by specific 3D conformational parameters, we derived 214
high-throughput predictions of DNA shape features from TOP2B peaks. Position 215
specific profiles of such features revealed a specific pattern around the TOP2B binding 216
center, characterized by decreased helix twist, increased minor groove width and 217
propeller twist, and a modest enrichment of roll (Fig 1D, bottom). Such conformation 218
suggests helical unwinding as consequence of a decrease in helix twist, which together 219
with the widening of the minor groove could be providing an energetically favorable 220
scenario for TOP2B binding. 221
Finally, we also interrogated whether CpG methylation is informative for TOP2B 222
binding prediction. CpG methylation is a well-known epigenetic mark that has proven 223
predictive ability for the localization of DNA binding proteins [43–45]. We profiled 224
whole genome bisulfite sequencing (WGBS) data around TOP2B binding sites and 225
found decreased CpG methylation as compared to random regions (S2 Fig), showing 226
that this feature is likely to be informative for TOP2B prediction. 227
A predictive model for TOP2B binding 228
In order to assess whether the above chromatin features could discriminate TOP2B 229
binding sites from the rest of the genome, we applied the computational approach 230
outlined in Fig 2. For both MEFs and liver, we first identified TOP2B peaks, resized 231
them to 300 bp and generated the same number of random genomic regions (see 232
Materials and methods). Then we scored 15 high-throughput sequencing experiments 233
(S1 Table) together with DNA sequence and shape features within such regions. This 234
resulted in a data matrix with rows representing TOP2B/random sites and columns 235
corresponding to scored features. Finally, binary classifiers were trained and tested 236
using 5-fold cross-validation. 237
Fig 2. Machine learning schema for the prediction of TOP2B binding. TOP2B binding sites and random regions were first identified. Then, 15 high-throughput sequencing experiments together with DNA sequence and shape features were scored around such regions, which resulted a data matrix with rows representing TOP2B/random sites and columns representing the scored features. Finally, binary classifiers were trained and tested using 5 fold cross-validation and feature selection was applied to identify the most informative features.
In order to evaluate if DNA sequence is informative for the prediction of TOP2B 238
binding, we represented DNA 1-mers, 2-mers and 3-mers for each nucleotide position in 239
the TOP2B binding sites as described in [28] (see Materials and methods). This 240
produced 1, 200 parameters representing 1-mers, 3, 584 parameters representing 2-mers 241
and 13, 088 parameters representing 3-mers. We also included information on 13 DNA 242
shape features using DNAshape method [46], which added other 7, 695 parameters to 243
the model. This parametrization allowed us to measure the predictive ability of DNA 244
sequence and shape to explain TOP2B binding with an unprecedented resolution. As a 245
result, we ended up with model matrices of 32, 766 columns and either 13, 128 (liver) or 246
8, 413 (MEFs) rows. 247
Chromatin features predict TOP2B 248
Since TOP2B binding regions display increased G+C content as compared to the 249
average of the mouse genome, we trained one more model in which random regions were 250
selected so that their distribution of sequence G+C content matches that of TOP2B 251
peaks. This allowed us to account for potential biases of the predictions due to 252
differences in G+C content. We employed Support Vector Machine (SVM) [30] and 253
March 18, 2020 7/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
Naive Bayes (NB) [29] to build up the predictive models and used 5-fold 254
cross-validation to estimate the prediction accuracy. We started by training on the 255
whole set of features. Accurate predictions were obtained for the two murine cell types 256
using both the GC-corrected and the regular model (Tables 1 and 2; Fig 3; S3 Fig), 257
indicating that chromatin features can faithfully explain TOP2B binding. As a matter 258
of fact, SVM trained on all features performed considerably better than NB. For 259
instance, while regular models trained with NB achieved accuracies of 71% and 77.4% 260
for MEF and liver, respectively, these values increased to 98% and 97.4% when using 261
SVM. Similarly, accuracies obtained training the GC-corrected models with NB were 262
74.1% and 76.2% for MEF and liver, respectively, whereas SVM produced accuracies of 263
96.7% and 93%. Such performance differences are likely to be explained by many 264
features showing dependencies to each other or being poorly informative. This would 265
impair NB performance, which assumes all features are independent. Indeed, removal of 266
not informative features leads to a significant increase of accuracies achieved by the NB 267
algorithm (see next sections). 268
Fig 3. Chromatin features predict TOP2B binding. ROC curves and AUC values for Support Vector Machine models trained on the indicated sets of features (for Naive Bayes models, see S3 Fig)
Table 1. Performance of SVM and NB classifiers in MEF. Cell type Category GC corrected Random NB SVM NB SVM All 74.18 ± 1.08 96.72 ± 0.18 70.94 ± 0.75 98.06 ± 0.28 Sequence 59.15 ± 0.61 55.44 ± 1.10 66.87 ± 0.88 64.41 ± 0.48 3D shape 55.08 ± 0.92 54.32 ± 0.87 66.23 ± 0.54 61.80 ± 0.97 Histone marks 84.82 ± 0.73 83.90 ± 0.66 88.61 ± 0.70 87.57 ± 0.74 MEF Architecture 90.35 ± 0.61 94.98 ± 0.32 93.15 ± 0.58 96.52 ± 0.39 Open chromatin 81.92 ± 0.58 83.96 ± 1.44 90.48 ± 0.61 92.35 ± 0.63 Expression 50.36 ± 0.25 50.48 ± 0.54 50.77 ± 0.22 52.52 ± 1.57 Pol2 60.60 ± 0.62 61.48 ± 1.06 57.71 ± 0.46 56.09 ± 0.92 CpG methylation 69.45 ± 0.88 73.82 ± 0.88 63.98 ± 0.79 68.90 ± 1.09
Table 2. Performance of SVM and NB classifiers in mouse liver. Cell type Category GC corrected Random NB SVM NB SVM All 76.20 ± 0.97 93.00 ± 0.56 77.43 ± 0.20 97.47 ± 0.13 Sequence 62.95 ± 0.70 56.64 ± 1.10 74.98 ± 0.31 72.78 ± 0.94 3D shape 57.97 ± 0.66 54.00 ± 1.42 73.19 ± 0.38 69.71 ± 0.46 Histone marks 78.03 ± 0.32 81.17 ± 0.47 78.97 ± 0.44 85.34 ± 0.41 Liver Architecture 87.45 ± 0.53 91.74 ± 0.37 89.50 ± 0.54 94.26 ± 0.56 Open chromatin 71.68 ± 0.46 85.32 ± 4.22 83.35 ± 0.92 92.01 ± 1.57 Expression 50.00 ± 0.01 50.00 ± 0.01 50.04 ± 0.28 50.02 ± 0.01 Pol2 54.92 ± 0.24 53.97 ± 0.42 52.95 ± 0.73 52.63 ± 0.90 CpG methylation 71.58 ± 0.34 75.31 ± 0.34 66.56 ± 0.36 69.58 ± 0.44
Chromatin architecture and open chromatin are the most 269
informative features 270
To explore whether specific molecular events provide different prediction abilities for 271
TOP2B binding, chromatin features were grouped into 8 categories (Fig 2): open 272
March 18, 2020 8/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
chromatin (DNAse-seq, MNase-seq), histone marks (H3K4me1, H3K4me3, H3K9ac, 273
H3K27ac, H3K27me3, H3K36me3), Pol2 binding (POLR2A ChIP-seq), architectural 274
components (ChIP-seq of CTCF and the cohesin complex components RAD21, STAG1 275
and STAG2), gene expression (RNA-seq), CpG methylation (WGBS), DNA sequence 276
and DNA shape (measured using DNAshapeR [26]). Then, SVM and NB models were 277
constructed for each of such categories. 278
Different predictive performances were observed for the grouped molecular events, 279
with similar results across cell types (Tables 1 and 2; Fig 3; S3 Fig). Models trained 280
with architectural components achieved the highest prediction accuracies, ranging from 281
87.4% to 96.5%, with area under ROC curves (AUC) of 0.96 - 0.99 (Fig 3; S3 Fig). 282
This is in agreement with previous findings of TOP2B associating with the boundaries 283
of TADs [11,12]. In the same line, models built using open chromatin factors reflect 284
TOP2B preference to bind accessible DNA, showing accuracies of 71.6% - 92.35% and 285
AUC of 0.88 - 0.99 (Tables 1 and 2; Fig 3; S3 Fig). As expected, histone-mark models 286
also obtained reasonably high prediction performances, with accuracies ranging from 287
78% to 88.6% and AUC of 0.84 - 0.95 (Tables 1 and 2; Fig 3; S3 Fig). The set of 288
histone modifications used to feed these models include a combination of promoter, 289
enhancer and gene body associated marks, which represents common genomic regions 290
where TOP2B tends to localize. 291
In contrast to the above results, gene expression (measured by RNA-seq) appeared 292
to be the less informative feature for TOP2B binding, with accuracies of around 50% 293
and AUC of 0.5 - 0.54. Interestingly, and despite the observed co-occurrence of TOP2B 294
and Pol2 at gene promoters (Fig effig1; S1 Fig), the model trained with Pol2 binding 295
information also exhibited poor prediction accuracies, ranging from 52.6% to 61.4% and 296
AUC of 0.53 - 0.67. These observations contrast with previous findings of TOP2B being 297
involved in transcriptional regulation [47], and may indicate that other chromatin 298
features, such as histone marks and DNase-seq, are capturing such association. 299
The impact of correcting for GC-bias 300
According to the performances achieved, the effect of training models using 301
GC-corrected background regions is negligible for the categories analyzed. As expected, 302
this is not the case for DNA sequence and DNA shape models, which obtained 303
decreased prediction accuracies when trained with GC-corrected sites as compared to 304
those achieved using pure random regions. While sequence content exhibited a 305
prediction accuracy of 55.4% and 62.9% and AUC of 0.58 - 0.66 for GC-corrected 306
models, these values increased to 64.1% - 74.9% and AUC of 0.71 - 0.84 for random 307
models (Tables 1 and 2; Fig 3; S3 Fig). Similarly, sequence-derived 3D shape models 308
built with GC-corrected regions achieved accuracies from 54% to 57.9% and AUC of 309
0.56 - 0.6, whereas pure random models obtained accuracies of 61.8% - 73.1% and AUC 310
of 0.67 - 0.78. Given the remarkable resolution of our model in measuring DNA 311
sequence and sequence-derived 3D DNA shape, these results suggest that sequence 312
information alone is only poorly informative of TOP2B binding, and strengthen the idea 313
of topoisomerases acting where DNA topological problems arise. 314
The predictive performances of models trained with CpG methylation data were also 315
influenced (although only moderately) by the use of GC-corrected sites as background 316
regions. While pure random models achieved modest accuracies, ranging from 64% to 317
69.5% and AUC of 0.73 - 0.76, GC-corrected models obtained moderately higher 318
accuracies of 69.4% - 75.3% and increased AUC of 0.78 - 0.81 (Tables 1 and 2; Fig 3; 319
S3 Fig). When forcing the G+C content of random regions to match that of TOP2B 320
binding sites, they tend to be located at GC rich sites not occupied by TOP2B. Such 321
regions happen to be hypermethylated as compared with TOP2B sites (S2 Fig), which 322
leads to an increased predictive power of CpG methylation. Given the modest 323
March 18, 2020 9/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
differences found in using random or GC-corrected random regions in model 324
performances, from this point on only GC-corrected random regions were considered. 325
Three chromatin features explain TOP2B binding genome wide 326
With the aim of identifying the most relevant features to explain TOP2B binding, we 327
applied feature selection, which aims at reducing the number of attributes in the data 328
matrix while keeping (or improving) the predictive power of the classifier. We used two 329
feature selection algorithms using 5-fold cross-validation: Fast Correlation Based Filter 330
(FCBF) [33] and Scatter Search (SS) [34, 35] (see Materials and methods). In addition 331
to the accuracy and the AUC values, here we also considered the average number of 332
features selected by each strategy and the stability of each algorithm. The stability (or 333
robustness) evaluates the sensitivity to variations of a feature selection algorithm in the 334
dataset. This issue is especially important in high dimensional domains where strategies 335
may yield to different solutions with similar performance and, therefore, this measure 336
enhances the confidence in the analysis of the results. 337
Both feature selection algorithms succeeded in finding a subset of features that 338
faithfully explained TOP2B binding in both cell types, with accuracies ranging from 339
92.1% to 96.11% (Table 3). As a matter of fact, although in mouse liver the two 340
classification approaches (SVM and NB) achieved the same performance for both 341
feature selection algorithms, this was not the case in MEFs, in which the SVM slightly 342
outperformed NB predictions. Regarding the number of selected features, while FCBF 343
needed averages of 46 and 22 features to optimize the prediction of TOP2B in MEF and 344
liver, respectively, SS only needed 3 features in both cell types (S2 Table). It is worth 345
stressing that SS always finds, in every iteration, the same 3 features, achieving a 346
stability score of 1 (Table 3). On the other hand, features selected by FCBF vary from 347
different iterations, which confers a poor robustness score to this strategy. 348
Table 3. Performance of SVM and NB classifiers using the features selected by FCBF and SS strategies. Cell type Algorithm NB SVM #features MEF FCBF 94.60 ± 0.35 96.11 ± 0.28 46.80 ± 3.90 SS 93.88 ± 0.31 95.51 ± 0.33 3.00 ± 0.00 Liver FCBF 92.62 ± 0.51 92.84 ± 1.91 22.20 ± 4.38 SS 92.90 ± 0.44 92.10 ± 1.97 3.00 ± 0.00
The most selected features for each strategy are shown in Fig 4A. For each 349
histogram and feature, the white bar height indicates the frequency of selection and the 350
black bar height is the Symmetrical Uncertainty (SU) value with respect to the class 351
(TOP2B). SU [48] is a widely used technique that allows to measure the relevance 352
between two variables and can be interpreted as a correlation measure to identify 353
non-linear dependencies between features. Due to the mentioned low stability score 354
obtained with the FCBF approach, only 5 out of the average 46 and 3 out of the 355
average 22 features were selected in every iteration for MEFs and liver, respectively (S2 356
Table). In MEFs, such 5 features were RAD21, DNase-seq, WGBS, MNase and the 357
DNA shape associated translational parameter Shift. In liver, the three selected features 358
corresponded to DNase-seq, STAG2 and WGBS. It is worth mentioning that the only 359
features selected in 4 of the 5 iterations were the DNA shape associated parameters 360
Slide and Shear in MEFs and Helix Twist in liver. The recurrent selection of such DNA 361
shape features in both cell types may indicate a modest contribution of these 362
parameters to TOP2B binding. In any case, most of the predictive power was sustained 363
by open chromatin features (DNAse-seq, MNase-seq), cohesin complex associated 364
factors (RAD21, STAG2) and to a lesser extent CpG methylation (WGBS). 365
March 18, 2020 10/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
Fig 4. Three features accurately predict TOP2B. A. Top features selected by Fast Correlation Based Filter and Scatter Search algorithms. For each histogram and feature, the white bar height indicates the frequency of selection and the black bar height is the Symmetrical Uncertainty (SU) value with respect to the class (TOP2B). Indexes of DNA shape parameters indicate position within the corresponding parameter vector associated to the 300 bp width of modeled TOP2B binding sites (see Materials and methods). B. Summary of the most selected features by both algorithms. Top and bottom aligned dots indicate selection of a given feature by the corresponding selection algorithm in liver and MEFs, respectively. In the middle, the SU of each feature is displayed. Only the top fifteen features according to their SU are shown, which happen to match in both cell types. C. ROC curves and AUC values for Naive Bayes models trained on either MEF, liver or activated B cells and applied to the three cell types (for Support Vector Machine and Random Forests models, see S4 Fig). Only DNase-seq, RAD21 and CTCF binding data were used for training.
A significantly more robust and consistent selection was achieved when using SS (Fig 366
4A), where only three features were selected in every iteration for both cell types. 367
Furthermore, DNase-seq and RAD21 were both selected for MEFs and liver, which is in 368
line with above results and highlights the influence of accessible chromatin and cohesin 369
factors for TOP2B binding across cell types. As a matter of fact, while in MEFs SS 370
selected CTCF at every iteration, this was not the case in liver, where the cohesin 371
complex factor STAG2 was always selected (Fig 4A). 372
A summary of the features selected by FCBS and SS in both cell types is shown in 373
Fig 4B. Top and bottom aligned dots indicate selection of a given feature by the 374
corresponding selection algorithm in liver and MEFs, respectively. In the middle, the 375
relevance (SU) of each feature with respect to TOP2B is displayed. Only the top fifteen 376
features according to their SU are shown, which happen to match in both cell types. 377
With the exception of STAG2, the trends of liver and MEFs are very similar, indicating 378
that the specific contributions of genomic features for TOP2B prediction are conserved 379
across cell types. Furthermore, the results obtained by the SS algorithm strengthen the 380
importance of accessible chromatin and architectural factors for TOP2B binding, and 381
shows that a small number of these features are enough to achieve accurate genome 382
wide predictions. 383
Validation in different mouse cell types 384
We next asked to what extend the three selected features can be used to predict TOP2B 385
binding in cell types that were not used for the training. It is noteworthy to mention 386
that, since data on STAG2 binding is rather scarce, we focused on DNase-seq, RAD21 387
and CTCF. Although the latter was only selected in MEFs, it was among the most 388
informative features in mouse liver (Fig 4B; S2 Table), and there is a wealth of 389
ChIP-seq data available for this protein. Besides the two cell types considered so far, we 390
included TOP2B binding data from mouse activated B cells (Gene Expression Omnibus 391
GSM2635606). To the best of our knowledge, this dataset is the only publicly available 392
mouse dataset on TOP2B binding for which DNase-seq and ChIP-seq of both CTCF and 393
RAD21 are also accessible (together with the already used from mouse liver and MEFs). 394
We assessed model predictions by training on one cell type and testing prediction 395
accuracy on the rest. Besides SVM and NB, this time we also applied Random Forests 396
(RF) [31], which has been widely applied to genomic data with excellent results [32]. 397
The achieved predictions showed that the cross-cell type applications of the classifiers 398
did not decrease their performances, which acquired accurate predictions independently 399
of the training cell type (Fig 4C; S4 Fig; Table 4). For example, when we applied NB 400
March 18, 2020 11/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
models trained on MEF and mouse liver to activated B cells, we obtained accuracies of 401
98.3% and 97.9%, respectively, compared to an average of 97.6% using the model 402
trained on activated B cells itself. These results support the idea that TOP2B 403
association with DNase-seq, CTCF and RAD21 can be generalized across cell types. 404
Table 4. Performance of cross cell type predictions using models trained with DNase-seq, RAD21 and CTCF. Errors are only shown for models trained and tested with the same cell type due to test data partition. Cell type (train/test) RF NB SVM MEF 95.03 ± 0.44 93.62 ± 3.76 95.22 ± 0.28 MEF/Liver 91.97 89.43 92.36 MEF/aB cells 98.31 98.26 98.38 Liver 94.90 ± 0.16 91.60 ± 0.51 76.15 ± 9.76 Liver/MEF 89.33 89.04 93.11 Liver/aB cells 96.38 97.87 98.21 aB cells 98.70 ± 0.22 97.63 ± 0.42 98.28 ± 0.31 aB cells/MEF 93.16 85.00 92.95 aB cells/Liver 89.08 85.49 81.19
To further test the above observations, we used genome wide ChIP-seq TOP2B 405
binding in mouse thymocytes obtained in our laboratory [49] and assessed model 406
prediction accuracies on them. This time we generated models using a combination of 407
data from mouse liver, MEFs and activated B cells, which allowed us to better capture 408
the cell type specificities of TOP2B binding in a single classifier. In order to evaluate 409
the specificity of our model, we performed ChIP-seq using two different antibodies 410
(Novus and Santa Cruz). We called peaks using HOMER [22] and identified 49, 974 and 411
31, 653 TOP2B binding sites for Novus and Santa Cruz, respectively. Then, we applied 412
the model and computed AUC values (Fig 5A). Although modest predictive power was 413
achieved for individual antibodies (AUC of 0.62 and 0.74 for Santa Cruz and Novus, 414
respectively), when applying the model to common peaks of both experiments we 415
obtained quite accurate predictions, with AUC of 0.89. 416
Fig 5. Validation of TOP2B model in mouse thymocytes and human MCF7. A. ROC curves and AUC values for the prediction of TOP2B peaks detected using Novus and Santa Cruz antibodies. B. Venn diagram showing the overlaps between predicted TOP2B peaks and the two sets of experimental peaks. C. Top panel shows average TOP2B ChIP-seq reads enrichment within ± 3 kb of predicted peaks confirmed by HOMER (predicted + ChIP-seq), predicted peaks not detected by HOMER (only predicted) and experimental peaks not predicted by the classifier (Only ChIP-seq). Signals were smoothed using nucleR [39]. Bottom panel shows heatmap representations of TOP2B ChIP-seq reads enrichment within ± 8 kb of predicted peaks not detected by HOMER (only predicted). D. ROC curves and AUC values for the prediction of TOP2B peaks in two replicates performed in MCF7. E. Venn diagram showing the overlaps between predicted TOP2B peaks and the two sets of experimental peaks. F. Top panel shows average TOP2B ChIP-seq reads enrichment within ± 3 kb of predicted peaks confirmed by HOMER (predicted + ChIP-seq), predicted peaks not detected by HOMER (only predicted) and experimental peaks not predicted by the classifier (only ChIP-seq). Bottom panel shows heatmap representations of TOP2B ChIP-seq reads enrichment within ± 8 kb of predicted peaks not detected by HOMER (only predicted).
Next, we generated genome wide predictions by splitting the genome in 300 bp 417
regions and applying the TOP2B classifier (S5 Fig). We identified 46, 836 regions in the 418
thymus genome with a probability of TOP2B binding above 0.95. Such regions included 419
March 18, 2020 12/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
46.6% of experimental ChIP-seq peaks detected using Novus antibody, 26.7% of those 420
detected using Santa Cruz and 76.2% of common peaks (Fig 5B). In addition, our 421
model detected a set of 24, 566 TOP2B peaks that were not identified by HOMER in 422
any of the two ChIP-seq experiments. In order to assess whether we were potentially 423
detecting false positives, we examined ChIP-seq intensity centered at TOP2B peaks 424
(Fig 5C, top). We displayed TOP2B at three types of peaks: predicted peaks confirmed 425
by HOMER (predicted+ChIP-seq), predicted peaks not detected by HOMER (only 426
predicted) and HOMER peaks not predicted by the classifier (only ChIP-seq). 427
Interestingly, the highest enrichment of TOP2B signal was observed for both types of 428
predicted peaks, independently of them being identified by HOMER or not, indicating 429
that the model faithfully identifies TOP2B binding sites supported by a large amount of 430
ChIP-seq reads, including peaks not called by HOMER. To check if the signal 431
enrichment observed at the latter is an effect of averaging reads around predicted peaks, 432
we represented read enrichment around every region using heatmaps (Fig 5C, bottom). 433
An increased ChIP-seq signal was observed centered at the whole set of predicted peaks 434
not detected by HOMER, suggesting that the model is indeed highly sensitive to 435
TOP2B binding. 436
Validation in human cells 437
In order to test whether our TOP2B binding model can be generalized to other 438
organisms, we performed ChIP-seq of TOP2B in human MCF7 cell line. This time we 439
performed two replicates using the same antibody (Sigma), which allowed us to evaluate 440
to what extend our predictions resemble an experimental replicate. We identified 82, 775 441
and 69, 818 TOP2B peaks for each replicate, with 37, 774 common peaks. Our model 442
obtained reasonably accurate predictions on both set of peaks, with AUC of 0.78 (Fig 443
5D). When predicting on the set of common peaks, the model performed considerably 444
better, with AUC of 0.87. 445
Again, we generated genome wide predictions by tiling the MCF7 genome into 300 446
bp regions and applying our model to each of them (S5 Fig). We identified 79, 596 447
regions with a probability of TOP2B binding above 0.95, including 50.7% of 448
experimental peaks from replicate 1, 48.3% of experimental peaks from replicate 2 and 449
66.7% of common peaks, showing that our predictions behave just as another 450
experimental replicate (Fig 5E). Since our predictions revealed an additional set of 451
28, 757 peaks that were not detected by HOMER, we assessed whether these 452
corresponded to potential false positives, as described above. We displayed TOP2B 453
ChIP-seq signal at the same three types of peaks: only predicted, only ChIP-seq and 454
both. As expected, predicted peaks confirmed by HOMER showed a high enrichment of 455
TOP2B (Fig 5F, top), indicating that the model is able to identify well-defined binding 456
sites supported by experimental evidence. On the other hand, ChIP-seq peaks not 457
detected by the classifier showed an only modest accumulation of TOP2B. Interestingly, 458
predicted peaks not detected by HOMER showed the highest enrichment of TOP2B, 459
indicating that our model outperforms HOMER in detecting true TOP2B-binding 460
regions. Again, we represented TOP2B ChIP-seq reads using heatmaps in order to test 461
whether the above enrichment is an effect of averaging reads around predicted peaks 462
(Fig 5F, bottom). We observed ChIP-seq signal centered at the whole set of predicted 463
peaks not detected by HOMER, confirming that the model is also sensitive to predict 464
TOP2B binding in human cells. 465
March 18, 2020 13/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
TOP2B probability mirrors experimental binding intensity 466
genome wide 467
Given the remarkable predictive power of our model, we generated virtual tracks 468
genome wide using TOP2B binding probabilities and represented them in a genome 469
browser [37] together with experimental TOP2B ChIP-seq signals. Predictive tracks 470
faithfully simulate ChIP-seq data at the gene level and accurately capture cell type 471
binding differences both in the training and the test sets (Fig 6). It is worth to note 472
that, even though we modeled TOP2B binding using binary classifiers, the similarities 473
between the experimental signals and our predicted probabilities are quite significant. 474
Furthermore, signal comparison with the three chromatin features used as predictors 475
shows that the model is also able to apply learned associations of DNAse-seq, RAD21 476
and CTCF with TOP2B to precisely generate bona fide predictions (S6 and S7 Fig). 477
These results highlight the power of our model and demonstrates that the described 478
approach can be used as a framework for the prediction of genome wide TOP2B signal 479
potentially in any mammalian cell type for which sequencing data on DNAse I 480
hypersensitivity, RAD21 and CTCF are available. Furthermore, this demonstrates that, 481
besides global TOP2B binding, our model can be used to produce virtual probability 482
tracks with which TOP2B accumulation can be analyzed at specific genomic loci. 483
Fig 6. TOP2B predictive and experimental tracks in all cell types used in this study. A. Selected region of the mouse genome. Liver, MEFs, activated B cells and thymocytes signals are shown in blue, green, red and purple, respectively. B. Selected region of the human genome. MCF7 signals are shown in orange.
Conclusion 484
TOP2B is a crucial enzyme for DNA metabolism, whose activity is required for a wide 485
range of cellular processes including replication, transcription and genome organization. 486
The few genome-wide maps of TOP2B that are currently available show high 487
colocalization with several chromatin features, but no work has been devoted to 488
comprehensively study predictive features that define TOP2B binding. By applying 489
feature selection techniques, we show that TOP2B binding can be accurately predicted 490
using only three high-throughput sequencing datasets, DNase-seq and ChIP-seq of 491
RAD21 and CTCF, highlighting the importance of TOP2B in the regulation of genome 492
architecture. Since the above datasets are largely available in ENCODE [38] and 493
Roadmap [50], our framework allows the scientific community to predict TOP2B 494
binding in a large set of organisms, cell types, tissues and biological conditions. Finally, 495
we believe that this approach and the generation of virtual probability tracks could be 496
applied to other chromatin binding factors with a similar sequence-independent binding 497
behavior to TOP2B. 498
Supporting information 499
S1 Fig. Genome browser view for TOP2B ChIP-seq signal and other 500
relevant chromatin features in MEFs. 501
S2 Fig. Enrichment of CpG methylation around TOP2B binding sites in 502
mouse liver and MEFs. Average of whole genome bisulfite sequencing reads within 503
±1 kb of TOP2B binding center, random regions and GC corrected random regions are 504
displayed. 505
March 18, 2020 14/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
S3 Fig. Chromatin features predict TOP2B binding. ROC curves and AUC 506
values for Naive Bayes models trained on the indicated sets of features. 507
S4 Fig. ROC curves and AUC values for Support Vector Machine and 508
Random Forests models. Models were trained on either MEF, liver or activated B 509
cells and applied to the three cell types. Only DNase-seq and ChIP-seq of RAD21 and 510
CTCF were used for training. 511
S5 Fig. Framework for the validation of TOP2B model in mouse and 512
human. First, the genomes of mouse and human were tiled into bins of 300 bp with 513
sliding windows of 50 bp. Then, DNase-seq and ChIP-seq reads of CTCF and RAD21 514
were scored on those bins (see Materials and methods) and the TOP2B binding model 515
trained on mouse liver, MEFs and activated B cells was applied. Bins having a TOP2B 516
probability higher than 0.95 were classified as TOP2B binding regions. Validation was 517
performed on thymocytes and MCF7 cells for mouse and human, respectively. 518
S6 Fig. TOP2B predictive track in mouse liver, activated B cells and 519
MEFs (training cell types). From top to bottom, genome browser view for TOP2B 520
ChIP-seq, TOP2B predicted track, CTCF ChIP-seq, RAD21 ChIP-seq and DNase-seq. 521
S7 Fig. TOP2B predictive track in human MCF7 and mouse thymocytes 522
(validation cell types). From top to bottom, genome browser view for TOP2B 523
ChIP-seq, TOP2B predicted track, CTCF ChIP-seq, RAD21 ChIP-seq and DNase-seq. 524
S1 Table. Public and custom sequencing data used in this study. 525
S2 Table. Ranking of features selected by scatter search and fast 526
correlation base filter algorithms. 527
Acknowledgments 528
We thank A. Herrero-Ruiz and C. G´omez-Mar´ınfor valuable comments. All the 529
analyses were performed using custom scripts that were run on the High Perfomance 530
Computing cluster provided by the Centro Inform´aticoCient´ıficode Andaluc´ıa(CICA). 531
References
1. Vos SM, Tretter EM, Schmidt BH, Berger JM. All tangled up: How cells direct, manage and exploit topoisomerase function; 2011. 2. Nitiss JL. DNA topoisomerase II and its growing repertoire of biological functions. Nature Reviews Cancer. 2009;9(5):327–337. doi:10.1038/nrc2608. 3. Pommier Y, Sun Y, Huang SyN, Nitiss JL. Roles of eukaryotic topoisomerases in transcription, replication and genomic stability. Nature Reviews Molecular Cell Biology. 2016;17(11):703–721. doi:10.1038/nrm.2016.111. 4. Deweese JE, Osheroff N. The DNA cleavage reaction of topoisomerase II: Wolf in sheep’s clothing. Nucleic Acids Research. 2009;37(3):738–748. doi:10.1093/nar/gkn937.
March 18, 2020 15/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
5. Akimitsu N, Kamura K, Ton´eS, Sakaguchi A, Kikuchi A, Hamamoto H, et al. Induction of apoptosis by depletion of DNA topoisomerase IIα in mammalian cells. Biochemical and Biophysical Research Communications. 2003;307(2):301–307. doi:10.1016/S0006-291X(03)01169-0. 6. Thakurela S, Garding A, Jung J, Schubeler D, Burger L, Tiwari VK. Gene regulation and priming by topoisomerase IIalpha in embryonic stem cells. Nat Commun. 2013;4:2478. doi:10.1038/ncomms3478. 7. Dereuddre S, Delaporte C, Jacquemin-Sablon A. Role of topoisomerase IIβ in the resistance of 9-OH-ellipticine- resistant Chinese hamster fibroblasts to topoisomerase II inhibitors. Cancer Research. 1997;57(19):4301–4308. 8. Capranico G, Tinelli S, Austin CA, Fisher ML, Zunino F. Different patterns of gene expression of topoisomerase II isoforms in differentiated tissues during murine development. BBA - Gene Structure and Expression. 1992;1132(1):43–48. doi:10.1016/0167-4781(92)90050-A. 9. Madabhushi R, Gao F, Pfenning AR, Pan L, Yamakawa S, Seo J, et al. Activity-Induced DNA Breaks Govern the Expression of Neuronal Early-Response Genes. Cell. 2015;161(7):1592–1605. doi:10.1016/j.cell.2015.05.032. 10. Manville CM, Smith K, Sondka Z, Rance H, Cockell S, Cowell IG, et al. Genome-wide ChIP-seq analysis of human TOP2B occupancy in MCF7 breast cancer epithelial cells. Biology Open. 2015;4(11):1436–1447. doi:10.1242/bio.014308. 11. Uusk¨ula-ReimandL, Hou H, Samavarchi-Tehrani P, Rudan MV, Liang M, Medina-Rivera A, et al. Topoisomerase II beta interacts with cohesin and CTCF at topological domain borders. Genome Biol. 2016;17(1):1–22. doi:10.1186/s13059-016-1043-8. 12. Canela A, Maman Y, Jung S, Wong N, Callen E, Day A, et al. Genome Organization Drives Chromosome Fragility. Cell. 2017;170(3):507–521.e18. doi:10.1016/j.cell.2017.06.034. 13. Dellino GI, Palluzzi F, Chiariello AM, Piccioni R, Bianco S, Furia L, et al. Release of paused RNA polymerase II at specific loci favors DNA double-strand-break formation and promotes cancer translocations. Nature Genetics. 2019;doi:10.1038/s41588-019-0421-z. 14. Ciccia A, Elledge SJ. The DNA Damage Response: Making It Safe to Play with Knives; 2010. 15. Polo SE, Jackson SP. Dynamics of DNA damage response proteins at DNA breaks: A focus on protein modifications; 2011. 16. Chapman JR, Taylor MRG, Boulton SJ. Playing the End Game: DNA Double-Strand Break Repair Pathway Choice; 2012. 17. Canela A, Maman Y, Huang SyN, Wutz G, Tang W, Zagnoli-Vieira G, et al. Topoisomerase II-Induced Chromosome Breakage and Translocation Is Determined by Chromosome Architecture and Transcriptional Activity. Molecular Cell. 2019;doi:10.1016/j.molcel.2019.04.030.
March 18, 2020 16/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
18. Gothe HJ, Bouwman BAM, Gusmao EG, Piccinno R, Petrosino G, Sayols S, et al. Spatial Chromosome Folding and Active Transcription Drive DNA Fragility and Formation of Oncogenic MLL Translocations. Molecular Cell. 2019;doi:10.1016/j.molcel.2019.05.015. 19. Langmead B, Trapnell C, Pop M, Salzberg S. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology. 2009;10(3):R25. doi:10.1186/gb-2009-10-3-r25. 20. Hahne F, Ivanek R. Visualizing genomic data using Gviz and bioconductor. In: Methods in Molecular Biology; 2016. 21. Ou J, Zhu LJ. trackViewer: a Bioconductor package for interactive and integrative visualization of multi-omics data; 2019. 22. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Molecular Cell. 2010;doi:10.1016/j.molcel.2010.05.004. 23. Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Scientific Reports. 2019;doi:10.1038/s41598-019-45839-z. 24. Comoglio F, Paro R. Combinatorial Modeling of Chromatin Features Quantitatively Predicts DNA Replication Timing in Drosophila. PLoS Computational Biology. 2014;10(1). doi:10.1371/journal.pcbi.1003419. 25. Lister R, Mukamel EA, Nery JR, Urich M, Puddifoot CA, Johnson ND, et al. Global epigenomic reconfiguration during mammalian brain development. Science. 2013;doi:10.1126/science.1237905. 26. Chiu TP, Comoglio F, Zhou T, Yang L, Paro R, Rohs R. DNAshapeR: An R/Bioconductor package for DNA shape prediction and feature encoding. Bioinformatics. 2016;doi:10.1093/bioinformatics/btv735. 27. Li J, Sagendorf JM, Chiu TP, Pasi M, Perez A, Rohs R. Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding. Nucleic Acids Research. 2017;doi:10.1093/nar/gkx1145. 28. Zhou T, Shen N, Yang L, Abe N, Horton J, Mann RS, et al. Quantitative modeling of transcription factor binding specificities using DNA shape. Proceedings of the National Academy of Sciences. 2015;112(15):4654–4659. doi:10.1073/pnas.1422023112. 29. Duda RO, Hart PE. Pattern classification and scene analysis. New York, USA: John Wiley & Sons; 1973. 30. Vapnik VN. Statistical Learning Theory. Wiley-Interscience; 1998. 31. Breiman L. Random Forests. Machine Learning. 2001;45:5–32. 32. Chen X, Ishwaran H. Random forests for genomic data analysis; 2012. 33. Yu L, Liu H. Efficient Feature Selection via Analysis of Relevance and Redundancy. Journal of Machine Learning Research. 2004;5:1205–1224.
March 18, 2020 17/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
34. Garc´ıaL´opez FC, Garc´ıaTorres M, Moreno P´erezJA, Moreno Vega JM. Scatter Search for the Feature Selection Problem. In: Conejo R, Urretavizcaya M, P´erez-de-laCruz JL, editors. Current Topics in Artificial Intelligence. Berlin, Heidelberg: Springer Berlin Heidelberg; 2004. p. 517–525. 35. Garc´ıa-L´opez FC, Garc´ıa-Torres M, Meli´an-BatistaB, P´erezJAM, Moreno-Vega JM. Solving the Feature Selection Problem by a Parallel Scatter Search. European Journal of Operations Research. 2006;169(2):477–489. 36. Hall MA. Correlation-based Feature Subset Selection for Machine Learning. University of Waikato; 1998. 37. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The Human Genome Browser at UCSC. Genome Research. 2002;doi:10.1101/gr.229102. 38. Consortium EP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. doi:10.1038/nature11247. 39. Flores O, Orozco M. nucleR: A package for non-parametric nucleosome positioning. Bioinformatics. 2011;doi:10.1093/bioinformatics/btr345. 40. Kim SH, Ganji M, Kim E, van der Torre J, Abbondanzieri E, Dekker C. DNA sequence encodes the position of DNA supercoils. eLife. 2018;doi:10.7554/eLife.36557. 41. Mathelier A, Xin B, Chiu TP, Yang L, Rohs R, Wasserman WW. DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo. Cell Systems. 2016;3(3):278–286.e4. doi:10.1016/j.cels.2016.07.001. 42. Rao S, Chiu TP, Kribelbauer JF, Mann RS, Bussemaker HJ, Rohs R. Systematic prediction of DNA shape changes due to CpG methylation explains epigenetic effects on protein-DNA binding. Epigenetics and Chromatin. 2018;doi:10.1186/s13072-018-0174-4. 43. Jones PA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nature Reviews Genetics. 2012;13(7):484–492. doi:10.1038/nrg3230. 44. Vinson C, Chatterjee R. CG methylation. Epigenomics. 2012;4(6):655–663. doi:10.2217/epi.12.55. 45. Rube HT, Lee W, Hejna M, Chen H, Yasui DH, Hess JF, et al. Sequence features accurately predict genome-wide MeCP2 binding in vivo. Nature Communications. 2016;doi:10.1038/ncomms11025. 46. Zhou T, Yang L, Lu Y, Dror I, Dantas Machado AC, Ghane T, et al. DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale. Nucleic acids research. 2013;41(Web Server issue). doi:10.1093/nar/gkt437. 47. Tiwari VK, Burger L, Nikoletopoulou V, Deogracias R, Thakurela S, Wirbelauer C, et al. Target genes of Topoisomerase II regulate neuronal survival and are defined by their chromatin state. Proceedings of the National Academy of Sciences. 2012;109(16):E934–E943. doi:10.1073/pnas.1119798109. 48. Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: Practical Machine Learning Tools and Techniques; 2016.
March 18, 2020 18/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
49. Alvarez-Quil´onA,´ Terr´on-BautistaJ, Delgado-Sainz I, Serrano-Ben´ıtezA, Romero-Granados R, Mart´ınez-Garc´ıaPM, et al. Endogenous topoisomerase II-mediated DNA breaks drive thymic cancer predisposition linked to ATM deficiency. Nature Communications. 2020;doi:10.1038/s41467-020-14638-w. 50. Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature. 2015;doi:10.1038/nature14248.
March 18, 2020 19/19 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. bioRxiv preprint doi: https://doi.org/10.1101/2020.03.23.003277; this version posted March 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.