bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
1 Recurrent Integration of Human Papillomavirus Genomes at Transcriptional
2 Regulatory Hubs
3 Alix Warburton1, Tovah E. Markowitz2-3, Joshua P. Katz4, James M. Pipas4 and Alison
4 A. McBride1 *
5
6 1Laboratory of Viral Diseases, National Institute of Allergy and Infectious Diseases, 33 North Drive,
7 MSC3209, National Institutes of Health, Bethesda, Maryland 20892, USA
8 2NIAID Collaborative Bioinformatics Resource (NCBR), National Institute of Allergy and Infectious
9 Diseases, National Institutes of Health, Bethesda, MD, USA
10 3Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research,
11 Frederick, MD, USA
12 4Department of Biological Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
13
14 *Corresponding author
15 E-mail: [email protected]
16 Short Title: Integration of HPV genomes in human cancers
17
18
19
20
21
22
1
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
23 ABSTRACT
24 Oncogenic human papillomavirus (HPV) genomes are often integrated into host chromosomes in HPV-
25 associated cancers. HPV genomes are integrated either as a single copy, or as tandem repeats of viral
26 DNA interspersed with, or without, host DNA. Integration occurs frequently in common fragile sites
27 susceptible to tandem repeat formation, and the flanking or interspersed host DNA often contains
28 transcriptional enhancer elements. When co-amplified with the viral genome, these enhancers can form
29 super-enhancer-like elements that drive high viral oncogene expression. Here, we compiled highly
30 curated datasets of HPV integration sites in cervical (CESC) and head and neck squamous cell carcinoma
31 (HNSCC) cancers and assessed the number of breakpoints, viral transcriptional activity, and host genome
32 copy number at each insertion site. Tumors frequently contained multiple distinct HPV integration sites,
33 but often only one “driver” site that expressed viral RNA. Since common fragile sites and active
34 enhancer elements are cell-type specific, we mapped these regions in cervical cell lines using FANCD2
35 and Brd4/H3K27ac ChIP-seq, respectively. Large enhancer clusters, or super-enhancers, were also
36 defined using the Brd4/H3K27ac ChIP-seq dataset. HPV integration breakpoints were enriched at both
37 FANCD2-associated fragile sites, and enhancer-rich regions, and frequently showed adjacent focal DNA
38 amplification in CESC samples. We identified recurrent integration “hotspots” that were enriched for
39 super-enhancers, some of which function as regulatory hubs for cell-identity genes. We propose that
40 during persistent infection, extrachromosomal HPV minichromosomes associate with these
41 transcriptional epicenters, and accidental integration could promote viral oncogene expression and
42 carcinogenesis.
2
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
43 INTRODUCTION
44 Persistent infection with high-oncogenic risk HPV types is responsible for almost all cervical and
45 ~70% oropharyngeal carcinomas 1. One factor that can contribute to oncogenic progression of HPV-
46 positive lesions is integration of the viral genome into host chromatin. Integration is associated with
47 increased genetic instability in high-grade cervical intraepithelial neoplasia (CIN) and cervical and
48 oropharyngeal carcinomas 2-9 due to dysregulated expression of the viral oncoproteins, E6 and E7. Many
49 studies have compared the human genomic regions associated with HPV integration sites to elucidate
50 the mechanisms that might promote integration and carcinogenesis. Here, we have curated datasets
51 from The Cancer Genome Atlas and other published sources to define a rigorous database of integration
52 breakpoints and correlated these with “in-house” datasets of common fragile sites and enhancer
53 elements defined in cervical carcinoma cells.
54 Integration of HPV DNA occurs in all human chromosomes; however, integration sites are often
55 found within or in close proximity to common fragile sites 10-13. Common fragile sites are regions of the
56 genome that have difficulty completing replication and, as such, are susceptible to chromosome
57 breakage in mitosis. They are prone to replication stress that can be due to a shortage of replication
58 origins or clashes between replication and transcriptional processes. Therefore, they vary in fragility
59 depending on cell type and disease state 14. Most previous studies that documented an association of
60 HPV integration sites with common fragile sites used the classical FRA-regions that were defined
61 cytogenetically in lymphocytes 9-13,15. As such, these fragile sites are not cell-type specific and are often
62 large, poorly defined regions that cover a large proportion of the human genome. FANCD2 is required
63 for resolution of these genetically unstable sites and, as such, is a marker of common fragile sites 16-18.
64 HPV integration sites occur frequently in amplified regions of the host genome, and focal
65 amplification of cellular flanking sequences at sites of viral integration are frequently observed in HPV-
66 positive tumors 6,7,9,15. Co-amplification of the viral genome and flanking cellular sequences can result
3
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
67 from unlicensed initiation of replication at the viral origin resulting in endoreduplication 19-21.
68 Subsequent recombination can result in amplified tandem repeats. Genome amplification can also occur
69 at common fragile sites by breakage-fusion-bridge cycles 22.
70 HPV integration is also enriched at transcriptionally active regions of the host genome 15,23. We
71 previously identified an HPV16 integration site in the W12 20861 cervical cell line that was adjacent to a
72 cell-type specific enhancer. Co-amplification of this regulatory element and the viral genome to
73 approximately 25 copies resulted in the formation of a super-enhancer-like element to drive high viral
74 oncogene expression 24,25. This ‘enhancer-hijacking’ is a novel mechanism by which HPV integration can
75 promote oncogenesis; however, it is unclear how common this mechanism is in HPV-associated cancers.
76 The aim of this study was to examine the association among HPV integration loci, common
77 fragile sites, and genome amplification to determine whether insertion of HPV genomes adjacent to
78 active cellular enhancers often resulted in viral ‘enhancer-hijacking’, and whether genetic instability
79 could result in co-amplification of viral-cellular regulatory repeats to drive oncogenic progression of
80 HPV-associated cancers. We have extended our previously published work 26 to generate a common
81 fragile site dataset in cervical carcinoma cell lines using higher resolution mapping of FANCD2 binding
82 and have mapped cellular enhancers and super-enhancers in an HPV16-positive cell line derived from a
83 cervical lesion27 using H3K27ac and Brd4 ChIP-seq (Figure 1). Here we compare HPV integration sites
84 with these cervical cell-type specific enhancer and fragile site datasets.
4
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
85 RESULTS
86 CESC and HNSCC tumors frequently contain multiple, clustered HPV integration breakpoints
87 A dataset of HPV integration breakpoints was assembled from various sources 5,6,8,9,28-35, as
88 outlined in Figure 1 and Supplementary Table S1. Integration breakpoints were defined as the junctions
89 between the viral and host chimeric reads within the human reference genome. A total of 1,299
90 integration breakpoints from 333 cervical carcinomas (CESC) and 119 integration breakpoints from 41
91 head and neck squamous cell carcinomas (HNSCC) were included in this study (Supplementary Figure S1
92 and Supplementary Table S2-S3). We found that many tumor samples contained multiple integration
93 breakpoints that could have resulted from either independent integration events at different
94 chromosomal loci or from amplification of a single integration site resulting in a cluster of multiple,
95 closely spaced breakpoints. To classify this, we defined an integration locus as either a single HPV
96 insertion breakpoint, or as multiple, closely spaced breakpoints (a cluster), Figure 2a. Clustered
97 breakpoints within the same chromosome that had a maximum distance of 3 Mb between the most 5’
98 and 3’ breakpoints were classified as a single integration locus. Based on this classification, the total
99 number of integration loci analyzed in our study was 584 for CESC samples and 58 for HNSCC samples.
100 Tumors with multiple integration loci were observed in 28.8% of CESC and 22.0% of HNSCC tumors.
101 Sites of recurrent HPV DNA integration in different tumor samples are termed integration
102 hotspots. We defined integration hotspots (five or more sites located <5Mb apart) in our CESC dataset
103 and compared them to previously defined hotspots from the literature 10,15,28,36-40. We identified a total
104 of 37 hotspots in CESC tumors (Supplementary Table S4), which represented 313/584 (53.6%)
105 integration loci from our CESC dataset (Figure 2b-c). Twenty-three hotspots overlapped previously
106 defined sites of recurrent integration and 14 were novel hotspots, Supplementary Figure S2 and
107 Supplementary Table S4-S5. We were unable to define sites of recurrent integration in the HNSCC
108 dataset because of the low number of integration loci in these tumors.
5
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
109 The distribution of clustered breakpoints at each integration locus and across each chromosome
110 is shown in Figure 2b for CESC samples and Supplementary Figure S3 for HNSCC samples. Most
111 integration sites had 1-2 breakpoints in both the CESC (81.1%) and HNSCC (75.9%) datasets. Sites of
112 recurrent integration are indicated by orange boxes for the CESC samples. The integration loci at these
113 hotspots were more likely to contain clustered breakpoints compared to integration sites elsewhere in
114 the genome for CESC tumors (Figure 2c; p=0.004). Higher numbers of clustered breakpoints at sites of
115 recurrent integration suggests that these regions are susceptible to genomic instability.
116 Most tumors with integrated HPV DNA have a single driver integration
117 Constitutive expression of the viral oncogenes from the integration locus is required for clonal
118 selection and oncogenic progression. Transcriptionally silent HPV integration loci can be considered to
119 be passenger sites 35. To identify driver versus passenger integrations, the transcriptional activity of each
120 integration locus was determined for the subset of samples that had matched RNA sequencing data
121 (CESC, n=144; HNSCC, n=35). The transcription status of each integration locus is indicated in
122 Supplementary Table S2-S3. Integration loci in which no HPV mRNA was detected by RNA-seq analysis
123 were classified as inactive or passenger loci. In CESC, all samples with a single integration locus (n=86)
124 were transcriptionally active (Figure 3a). For samples with multiple integration loci (CESC, n=58; HNSCC,
125 n=13), more than one transcriptionally active integration locus was observed in 35 (60.3%) CESC and
126 four (30.8%) HNSCC tumors. Three HNSCC samples had no driver integrations; one sample (4.5%) had a
127 single integration locus and two samples (15.4%) had multiple integration loci, Figure 3b. Overall, the
128 majority of CESC and HNSCC tumors with integrated HPV genomes had only a single transcriptionally
129 active integration locus. This implies that most tumors with integrated HPV DNA have a single driver
130 integration. Viral oncogene transcription was analyzed at integration hotspots in CESC, which showed
131 that sites of recurrent integration can have both driver and passenger integrations but are more likely to
132 be transcriptionally active (Figure 3c).
6
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
133 Clustered integration breakpoints are associated with amplified regions of the host genome in CESC
134 and HNSCC
135 HPV integration loci often have amplification and/or rearrangements of the flanking cellular
136 sequences at the insertion sites 6,7,9,15. We determined the frequency with which integration loci in our
137 datasets were associated with somatic copy number alterations for the subset of samples that had
138 matched host genome copy number data (235 CESC and 22 HNSCC tumors; Supplementary Table S6-
139 S7). In this subset, most integration breakpoints occurred within amplified regions of cellular DNA in
140 both CESC and HNSCC relative to regions with a normal genomic copy number or adjacent to deletions
141 (Figure 4a-b). In addition, the number of breakpoints per integration locus was significantly different at
142 amplified regions of the host genome relative to loci with a normal genomic profile for both CESC and
143 HNSCC samples; higher numbers of breakpoints per cluster occurred at amplified regions (Figure 4a-b).
144 No significant difference was found in the number of clustered breakpoints per integration loci in
145 regions with a normal genomic copy number relative to those with genomic copy number losses in CESC
146 samples. We conclude that integration loci associated with flanking host DNA amplification were more
147 likely to contain clustered breakpoints and to have higher numbers of breakpoints per integration locus.
148 Integration loci with associated focal amplifications were found on all chromosomes in CESC and
149 on 15 chromosomes in HNSCC. The distribution of clustered integration breakpoints relative to host
150 genome amplification and sites of recurrent integration in CESC is shown, Figure 4c-d. There was an
151 enrichment of integration loci with associated host DNA amplifications at integration hotspots in CESC
152 (Figure 4e) and higher numbers of clustered breakpoints were common at these regions (Figure 4c).
153 Genomic instability in these regions most likely results in co-amplification of both viral and host DNA.
154 Identification of common fragile sites in cervical cancer cell lines
155 Genomic instability occurs frequently at common fragile sites. We previously used FANCD2 ChIP-
156 chip to map fragile sites in an aphidicolin-treated cervical carcinoma cell line (C33-A) 26. Here, we have
7
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
157 extended the FANCD2 dataset using ChIP-seq analysis of both C33-A and HeLa cervical carcinoma cells
158 (Supplementary Table S8-S9) and combined these results with those previously mapped in C33-A cells
159 by ChIP-chip 26. There was good overlap between the FANCD2 peaks called across the three datasets
160 (p<0.0001). In total we defined 513 FANCD2-enriched regions between the two cervical carcinoma cell
161 lines, and they are listed in Supplementary Table S10.
162 We compared our cervical carcinoma cell line derived FANCD2 mapped fragile sites (genomic
163 coverage 7.9%) to the 77 aphidicolin-induced common fragile sites (FRA regions) defined cytogenetically
164 in lymphocytes and reported in the HGNC (HUGO Gene Nomenclature Committee) database (genomic
165 coverage 48.3%), Figure 5a. A total of 115 (22.4%) FANCD2-enriched regions derived from C33-A and
166 HeLa cells overlapped with 55.8% (n=43/77) FRA-regions, Figure 5b. FRA-regions that overlapped with
167 FANCD2-associated fragile sites are listed in Supplementary Table S11. Permutation testing was used to
168 determine the significance in overlap between our FANCD2-dataset and traditional FRA regions. The
169 association between these genomic features did not reach significance (p=0.0634), which reflects
170 differences in replication stress at these regions in different cell types.
171 Recent studies used high-resolution MiDA-seq (next-generation sequencing of EdU
172 incorporation at sites of mitotic DNA synthesis) to map fragile sites in HeLa cells 41,42 and show that they
173 colocalize with FRA-regions and FANCD2 foci in cells treated with aphidicolin. We compared the overlap
174 between our dataset of FANCD2-enriched regions and the mitotic DNA synthesis regions profiled in HeLa
175 cells (total genomic coverage of 4.4%, Figure 5a). A total of 120 (23.4%) FANCD2-enriched regions
176 overlapped with mitotic DNA synthesis regions, which represented 48.3% (n=112/232) of mitotic DNA
177 synthesis regions, Figure 5b. Permutation testing was used to determine the significance in overlap
178 between FANCD2-enriched regions and mitotic DNA synthesis regions (p<0.0001). Mitotic DNA synthesis
179 regions that overlapped with FANCD2-associated fragile sites are indicated in Supplementary Table S12.
8
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
180 Collectively, these data show good correlation between our FANCD2-associated fragile sites and regions
181 of the genome susceptible to genetic instability.
182 The instability of fragile sites is often due to transcription-replication conflicts that frequently
183 occur at long genes 43. We and others have previously shown that FANCD2-enriched regions overlap
184 with transcriptionally active long genes in C33-A 26 and U2OS cells 44. Here we extended that association
185 to include genes that are >0.3 Mb in length 45. A total of 184/513 (35.9%) FANCD2-enriched regions
186 overlapped with protein-coding genes that were ≥0.3 Mb, which corresponded to 185/782 (23.7%) long
187 genes. A Fisher’s exact test was used to determine the significance in overlap between FANCD2-
188 enriched regions and long genes (two-tailed, p=4.22E-13). Of the long genes that overlapped with
189 FANCD2-enriched regions, 121/185 (65.4%) were expressed in C33-A and/or HeLa cells, Figure 5c,
190 further validating our FANCD2 peaks as sites of genetic instability in cervical carcinoma cells.
191 Supplementary Table S13 lists the long genes used in this analysis and their association with common
192 fragile sites and expression in C33-A and HeLa cells from RNA-seq (26 and Expression Atlas,
193 https://www.ebi.ac.uk/gxa/home). Example alignments of our FANCD2-associated fragile sites relative
194 to common FRA-regions, mitotic DNA synthesis regions and long genes are shown in Figure 5d.
195 Integration breakpoints are enriched at FANCD2-associated fragile sites
196 To assess the association of our cervical carcinoma-specific fragile site dataset with HPV
197 integration sites, we calculated the frequency with which an integration breakpoint occurred within 50
198 Kb 9 of the C33-A and HeLa FANCD2-enriched regions (Supplementary Table S10). Approximately 18%
199 integration breakpoints were associated with FANCD2-enriched regions in both the CESC and HNSCC
200 datasets. The data was permutated 10,000 times to create an expected distribution of the overlap
201 between integration breakpoints and FANCD2-enriched regions. This showed that the FANCD2-
202 associated fragile sites were significantly enriched for both CESC and HNSCC HPV integration sites when
203 each breakpoint was analyzed independently from each other, Table 1.
9
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
204 To remove bias resulting from overrepresentation of integration loci containing clusters of
205 breakpoints, the integration loci were simplified into defined subsets for significance testing. Integration
206 loci contain either a single breakpoint or a cluster of breakpoints (Figure 2a). In the latter group, the
207 clustered breakpoints were condensed into a single site represented by the most 5’ and 3’ breakpoints.
208 The final category combined the single and condensed categories, in which each integration locus was
209 represented just once. The integration loci in each category were tested independently for their
210 association with FANCD2-enriched regions. For the CESC dataset, sites containing single or clustered
211 breakpoints were significantly associated with FANCD2-enriched regions, Table 1. In contrast, only sites
212 with clustered breakpoints reached significance for the HNSCC dataset (Table 1).
213 Generation of a cervical keratinocyte enhancer dataset using Brd4 and H3K27ac ChIP-seq
214 It has been noted previously that HPV integration occurs frequently at transcriptionally active
215 regions 15,23, and we have demonstrated that HPV integration can capture and amplify cellular enhancers
216 to drive viral oncogene expression 25. Brd4 is a marker of cell lineage-specific enhancers 46-48 and HPV E2
217 tethering sites 26,49. Moreover, we, and others, have shown previously that Brd4 and the HPV E2
218 replication protein bind to transcriptionally active chromatin within the host genome 49,50 that overlap
219 many FANCD2-associated fragile sites 26. Viral replication factories form adjacent to these sites and we
220 have proposed that tethering of the viral genome to these unstable sites would increase the chances of
221 integration at these regions.
222 To further examine the association of cellular enhancers with HPV integration in CESC and
223 HNSCC, we generated an “in-house” enhancer dataset using Brd4 and H3K27ac ChIP-seq in four
224 different subclones of W12 cervical keratinocytes. We defined 6,935 enhancer consensus peaks in the
225 four W12 subclones (Supplementary Table S14). The resulting H3K27ac and Brd4 ChIP-seq signals were
226 compared to enhancers in the NHEK (Normal Human Epidermal Keratinocytes) ENCODE dataset 51,
227 which showed that 83.5% (p<0.0001) W12 Brd4/H3K27ac enriched peaks overlapped ENCODE defined
10
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
228 enhancers (Supplementary Figure S4). Approximately 40% integration breakpoints and loci from the
229 CESC and HNSC datasets were significantly associated with our cervical keratinocyte enhancer dataset,
230 Table 2.
231 Integration hotspots are associated with gene loci related to cell development and identity
232 Enhancers that were associated with integration loci were analyzed using the Genomic Regions
233 Enrichment of Annotations Tool (GREAT) 52 to identify common functional significance based on
234 proximity testing. This identified 52 putative enhancer target genes for the CESC dataset, representing
235 28 gene loci of which 60.7% overlapped integration hotspots (Figure 6a). Twelve of the gene loci are
236 previously defined sites of recurrent integration and include the KLF5, KLF12, MYC, TP63, RAD51B and
237 HMGA2 genes, and five of the gene loci are novel integration hotspots and include the CAMK1G, FOXQ1,
238 EXOC2, GRHL2, ID1, COX4I2, HM13 and NFIA genes (Figure 6a). For HNSCC, 19 target genes were
239 identified through GREAT gene ontology analysis, representing eight gene loci of which 50% overlapped
240 integration hotspots profiled in CESC, Supplementary Figure S4. Six target genes (KLF5, KLF12, FAM84B,
241 POU5F1B, TUBD1 and VMP1) were common across CESC and HNSCC. Gene ontology analysis of our
242 enhancer regions identified epithelium development, epithelial cell differentiation, and negative
243 regulation of keratinocyte differentiation to be significantly enriched biological processes associated
244 with CESC integration loci (Supplementary Table S15). Ectoderm development and differentiation,
245 epidermal and epithelial differentiation, tongue morphogenesis and negative regulation of spreading
246 epidermal cells during wound healing were biological processes significantly associated with enhancer
247 enrichment at HNSCC integration loci, Supplementary Table S15. This indicates that sites of recurrent
248 HPV integration are often associated with cellular pathways relevant to host cell development and
249 differentiation.
250 Keratinocyte-specific super-enhancers are enriched at integration loci
11
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
251 Large enhancer clusters were characteristic of the integration targets identified through GREAT
252 analysis. We therefore defined super-enhancers in our Brd4-defined W12 enhancer dataset based on
253 relative peak height of the H3K27ac and Brd4 ChIP-seq datasets using the Rank Ordering of Super-
254 Enhancers (ROSE) tool 53,54. We defined 338 super-enhancers in W12 cervical keratinocytes
255 (Supplementary Table S16). Intersect analysis showed that 89/584 (15.2%) CESC integration loci
256 overlapped with super-enhancers profiled in W12 cells, and of these loci 72 (80.9%) were classified as
257 sites of recurrent integration (Figure 6b). A total of 25/37 (67.6%) integration hotspots contained super-
258 enhancers and permutation testing showed that both CESC integration loci and sites of recurrent
259 integration were significantly associated with these regulatory domains (p<0.0001). For HNSCC, 8/57
260 (14%) integration loci were associated with W12 super-enhancers, and 62.5% of these loci overlapped
261 integration hotspots profiled in CESC, including the KLF5/KLF12, MYC, ERBB2 and VMP1 gene loci.
262 Permutation testing showed that HNSCC integration loci were significantly associated with super-
263 enhancers profiled in W12 cells (p<0.0001). Thus, keratinocyte-specific super-enhancers are enriched at
264 integration loci in HPV-associated tumors and are frequently found at integration hotspots.
265 The association of CESC integration loci with super-enhancers and FANCD2-enriched fragile sites
266 at integration hotspots was also addressed, Figure 6c. This showed that the frequency of FANCD2
267 enrichment at integration loci was comparable for sites of recurrent (50/313 loci; 16.0%) and non-
268 recurrent integration (48/271 loci; 17.7%) in CESC, whereas the association of super-enhancers was
269 augmented at integration loci that occurred within hotspots (72/313 loci; 23.0%) relative to non-
270 recurrent sites of integration (17/271 loci; 6.3%). Most integration loci that overlapped super-enhancers
271 were active for viral oncogene expression (driver integrations) and were more frequently observed at
272 integration hotspots, Figure 6d. Furthermore, several cancer driver genes, including ASXL1, CACNA1A,
273 IRF6, KANSL1, KLF5, KRT222, MYC, PPM1D, PTCH1 and PTPDC1 55 were located within 1 Mb of super-
274 enhancers that overlapped with integration hotspots, Supplementary Figure S5 and Table S17.
12
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
275 Alignment of FANCD2-associated fragile sites, super-enhancers and associated target genes at
276 integration hotspots are shown in Figure 6e and Supplementary Figure S4-S5. Collectively, these data
277 show that transcriptionally active chromatin and/or regions of genetic instability are common features
278 of HPV integration sites. Moreover, integration hotspots are commonly associated with super-
279 enhancers, several of which regulate cancer driver and/or cell-identity genes.
280 Super-enhancers are frequently amplified at integration hotspots in CESC
281 The association of super-enhancers at integration hotspots was compared with the host somatic
282 copy number alteration in CESC samples. Super-enhancers were more frequently observed at those
283 CESC integration loci with associated host DNA amplifications (53/240; 22.1%) relative to those that had
284 either a normal genomic profile (16/140; 11.4%) or deletions within the host DNA flanking sequences
285 (1/17; 5.9%), Figure 6f. Of the amplified CESC integrations that had associated super-enhancers, 43
286 (81.1%) were sites of recurrent integration and represented 43/143 (30.1%) hotspot and 10/97 (10.3%)
287 non-hotspot loci with associated host genome amplifications, Figure 6f. Super-enhancer overlap was
288 also more frequently observed at integration hotspots (11/62; 17.7%) than non-hotspots (5/78; 6.4%)
289 for loci with a normal genomic profile, representing 68.8% of loci that overlapped super-enhancers for
290 this subgroup. However, for integration loci that had associated host deletions, no super-enhancers
291 were observed at sites of recurrent integration (Figure 6f). This data shows that amplification of super-
292 enhancers is frequently observed at integration loci in CESC, particularly at sites of recurrent integration.
293 DISCUSSION
294 Many studies have documented the “landscape” of HPV integration sites with respect to
295 traditional common fragile sites, host genome amplification, and transcription and related regulatory
296 elements 9-13,15,23. Here, we combined and curated DNA and RNA sequencing datasets from HPV-positive
297 CESC and HNSCC tumors and compared them with novel “in-house” datasets of common fragile sites
298 defined by FANCD2 ChIP-seq, and enhancers and super-enhancers defined by Brd4 and H3K27ac ChIP-
13
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
299 seq in cervical carcinoma derived cells. We show that viral integration sites in CESC are enriched at
300 FANCD2-associated fragile sites in cervical cells. We also show that cervical cell enhancers are over-
301 represented at HPV integration sites and that HPV integration is often associated with super-enhancers,
302 particularly at integration hotspots enriched for cell-identity genes. Furthermore, we show that the
303 flanking host DNA that is enriched for enhancers and super-enhancers is frequently amplified in CESC
304 tumors.
305 HPV genomes replicate as extrachromosomal nuclear minichromosomes at every stage of the
306 infectious cycle. The virus relies on the host replication and transcriptional machinery and it is thought
307 that the HPV genome localizes to regions of the nucleus that facilitate these processes 56. At different
308 stages of infection, the viral DNA associates with nuclear ND10 bodies, and interphase and mitotic host
309 chromatin, and highjacks the DNA damage repair processes to amplify viral DNA 57,58. Concomitantly, the
310 viral E6 and E7 proteins induce cell proliferation and replication stress, abrogate cell cycle checkpoints,
311 and inhibit the innate immune response 59. E6 and E7 proteins also modify the epigenetic landscape of
312 the host genome by changing the levels of different histone-modifying enzymes 60. Together, these
313 activities could promote the accidental integration of viral DNA that is closely associated with host
314 chromatin.
315 HPV genomes replicate using two different modes: in maintenance replication the genomes
316 replicate bidirectionally at low copy number, but this switches to a unidirectional recombination-
317 directed mechanism in the amplification stage 61,62. The formation of tandem repeats at integration sites
318 could be related to these processes and over-replication of viral and host sequences could result from
319 repeated initiation of replication at the viral replication origin, especially if the HPV E1 and E2 proteins
320 are expressed 21,63. In fact, unscheduled firing of replication origins and increased replication fork stalling
321 has been shown to occur in both viral and host sequences at HPV integration sites in the MYC locus 20,
322 which is frequently amplified in HPV-associated cancers. Tandem repeating units of co-amplified viral
14
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
323 and cellular DNA could result from this endoreduplication, replication fork arrest, and homologous
324 recombination. Highly rearranged integrations are also consistent with the breakage-fusion-bridge–type
325 model of genome amplification 19. At fragile sites, perturbed replication dynamics could also generate
326 focal amplifications and/or rearrangements of viral-host sequences.
327 In this study, we generated a dataset of aphidicolin-induced common fragile sites in two cervical
328 carcinoma cell lines, C33-A and HeLa, and found a significant association between these sites and
329 integration breakpoints in CESC, particularly at those loci with clustered breakpoints. Common fragile
330 sites are susceptible to somatic copy number alterations 64,65 likely due to replication stress that arises
331 from perturbed replication dynamics in conflict with transcription of long genes 66. Accordingly, our C33-
332 A and HeLa common fragile sites were overrepresented at long genes expressed in these cells. We did
333 not observe an enrichment of HPV integration sites in HNSCC samples with our FANCD2-associated
334 common fragile sites, although an association was previously noted between traditional FRA-regions and
335 integration sites in oropharyngeal squamous cell carcinomas 10. This difference could reflect the larger
336 genomic coverage of FRA-regions used in the previous study, or the limited number of samples in our
337 HNSCC dataset. Moreover, common fragile sites are likely distinct in cervical and oropharyngeal derived
338 keratinocytes, or alternatively these findings could reflect differences in the biology of HPV infection and
339 mechanisms of oncogenic progression in the different tissue types.
340 HPV integration often occurs in transcriptionally active chromatin within the host genome 15,23,67
341 and we previously described an example of enhancer-hijacking and co-amplification of cellular and viral
342 regulatory sequences at an HPV integration site in cervical lesion derived cells 25. The association of viral
343 integration breakpoints with putative enhancer regions in HPV-associated cancers has been reported 15,
344 but the enhancer regions used were based on ENCODE histone modifications and therefore did not
345 reflect the specific enhancer profiles of HPV-positive cervical cells. Here, we defined keratinocyte
346 enhancers in HPV16-positive W12 cervical keratinocytes by H3K27ac and Brd4 enrichment and show
15
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
347 that these specific enhancers are significantly overrepresented at HPV integration loci. In some cases,
348 these loci were associated with focal amplification of host DNA, providing evidence for potential
349 enhancer-capture. A recent study showed enrichment of active histone marks at HPV integration loci in
350 cervical tumors, which correlated with upregulation of local gene expression, and increased gene
351 expression levels at loci with increased breakpoints 68. Kamal et al. also found increased local host gene
352 expression at loci with multiple junction copies (analogous to clustered breakpoints) 36. Furthermore,
353 integration loci with associated somatic copy number alterations have also been shown to have
354 increased gene expression 29. These observations could represent enhancer-hijacking. Integration
355 hotspots often contain genes that drive cancer 55 and accidental integration at these regions could
356 results in clonal expansion and selection due to perturbation of these oncogenic pathways. Thus,
357 enhancer-hijacking can drive expression of both viral and cellular genes.
358 Super-enhancers are large clusters of enhancers, rich in Brd4 binding and H3K27ac modification,
359 that often control cell identity genes and are coopted in tumorigenesis 53,54. We defined super-
360 enhancers in our W12 cervical cell line datasets and showed that they were strongly associated with
361 integration hotspots, including the MYC, KLF5/KLF12 and ERBB2 gene loci, which are important
362 regulators of cell-cycle, proliferation and apoptosis 69-71; TP63, which is a master regulator of epidermal
363 keratinocyte proliferation and differentiation 72; and RAD51B, which is a key regulator of homologous
364 recombination repair 73. We propose that, during persistent infection, extrachromosomal HPV genomes
365 specifically localize at key transcriptional regulatory hubs within the host genome, several of which are
366 important for keratinocyte biology.
367 The cellular Brd4 protein is involved in many of the cellular and viral processes described in this
368 study. Brd4 is a chromatin scaffold protein that modulates transcriptional initiation and elongation and
369 is a major component of super-enhancers 74. Brd4 is also important at multiple stages of the HPV
370 infectious cycle 75, binds to common fragile sites in C33-A cells 26, and is important for tethering HPV
16
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
371 genomes to mitotic chromatin 76,77. Brd4 is enriched at the HPV16 integration site/super-enhancer in
372 W12 cells and inhibition of Brd4 binding reduces E6/E7 transcription and cell growth 24. Therefore, Brd4
373 is an example of a factor that is crucial for key cell and viral chromatin-related processes, and the
374 juxtaposition of these processes could promote integration of viral DNA and oncogenic progression.
375 In conclusion, many factors contribute to the integration of a viral genome that eventually
376 drives oncogenesis. Cancer genomes often contain multiple HPV integration sites but usually only one is
377 transcriptionally active (this study, 29,35). In CESC-derived cells, expression of the viral E6 and E7
378 oncoproteins is necessary for cell proliferation, survival, and maintenance of the tumor phenotype 78.
379 Therefore, integration is common, but requires the right genomic location for constitutive viral
380 oncogene expression. For example, the viral oncogenes are usually expressed from a viral-host fusion
381 transcript that requires a splice acceptor and polyadenylation signal in the flanking host DNA 79,80.
382 Therefore, HPV integrants require a combination of events and processes that are dependent on the
383 genetic and/or epigenetic landscape of the flanking host chromatin to drive oncogenesis.
384 METHODS
385 HPV integration datasets
386 A systematic literature review identified genomic datasets from HPV-positive CESC and HNSCC that
387 contained information on HPV type and integration breakpoints within the host and viral genomes,
388 which were identified by sequencing (Supplementary Table S1) 5,6,8,9,28-35. The integration breakpoints
389 used in this study were originally identified from both RNA and DNA sequencing methods, including
390 APOT (amplification of papillomavirus oncogene transcripts), DIPS (detection of integrated
391 papillomavirus sequences) and next-generation sequencing technologies. The use of hybrid-capture
392 technologies for detection of viral integration sites has been reported to give high rates of false positives
393 33,34, and so insertion breakpoints identified by this method were only included if they were validated by
394 other means, such as Sanger sequencing. The methodology used to identify each integration breakpoint
17
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
395 is referenced in Supplementary Table S2 and S3. RNA-based sequencing methods give an approximation
396 of the insertion site based on the closest splice acceptor site within the host genome; therefore,
397 integration breakpoints identified by RNA-seq and/or APOT were only used to determine the viral
398 transcription status of an integration site for samples with matched DNA sequencing data. For TCGA
399 CESC and HNSCC samples, unmapped reads were extracted from RNA-seq, whole genome sequencing
400 (WGS) and whole exome sequencing (WXS) BAM files (https://portal.gdc.cancer.gov/legacy-archive;
401 accessed 01/01/2013) and pre-processed with prinseq-lite.pl version 0.20.2 to remove low quality reads
402 81. Pre-processed reads were mapped with Bowtie 2 using the very-sensitive preset option against the
403 Viral Refseq database 82,83. All unmapped reads were subjected to BLASTN with default parameters
404 against the Viral Refseq database 84. All aligned reads were then subjected to BLASTN against the human
405 hg19 reference assembly. Bowtie 2 and BLASTN reports were passed into SummonChimera using a
406 1,000 bp deletion size for integration detection 85. The SummonChimera reports were manually parsed
407 to remove chimeric junctions with lower than 20 read coverage, chimeric junctions with no cross-
408 analysis verification, and ambiguously reported integration predictions. Finally, a unique ID was
409 provided to all uniquely detected chimeric junctions. For analysis of the association of integration
410 breakpoints with different genomic features of interest, only integration breakpoints identified by DNA
411 sequencing methods were used. The characteristics of samples included in this study, categorized by
412 histology type, HPV type, tumor location and sequencing methods are summarized in Supplementary
413 Figure S1. CESC and HNSCC integration breakpoints included in this study are listed in Supplementary
414 Table S2-S3.
415 Integration hotspot dataset
416 Integration loci from CESC tumors that were within 5 Mb of each other were collapsed into a single
417 genomic interval to define integration hotspot boundaries. Exceptions to this size cut-off for collapsing
418 adjacent integration loci were permitted to reflect previously defined hotspots from the literature. Five
18
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
419 or more integration loci per hotspot (or three or more integration loci for sites that overlapped
420 previously defined hotspots from the literature) were used to define sites of recurrent integration.
421 Integration hotspots defined from our CESC dataset and previously in the literature are listed in
422 Supplementary Tables S4 and S5, respectively.
423 Somatic copy number alteration datasets
424 The amplification status of the cellular sequences flanking integration breakpoints had been assessed in
425 a subset of samples by comparative genomic hybridization (CGH) or SNP array datasets. For CGH array
426 data, we defined two-fold or more focal amplification or deletion of the host genome at an integration
427 locus as having an associated somatic copy number alterations 8,28. Matched SNP6 copy number
428 segment data for TCGA CESC and HNSCC tumor samples were downloaded from the Broad Institute on
429 04/07/2020, http://firebrowse.org/ (genome_wide_snp_6-segmented_scna_minus_germline_cnv_
430 hg19) 86,87. Positive and negative mean segment values above 0.3 and below -0.3 represented copy
431 number gains and losses, respectively, and mean segment values between 0.3 and -0.3 were considered
432 noise 86,87. Deletions that occurred within 50 Kb of an integration breakpoint were included in this
433 analysis. Deletions that directly overlapped an integration breakpoint were excluded; these likely
434 reflected the alternative chromosome as sequencing of the chimeric viral-host junction was available for
435 the associated HPV insertion sites. Integration loci defined as having associated somatic copy number
436 alterations in CESC and HNSCC are detailed in Supplementary Table S6 and S7, respectively.
437 Cell culture
438 Subclones (20831, 20861, 20862, 20863) derived from HPV16-positive W12 cervical keratinocytes 79,88
439 were maintained in F-medium (3:1 [vol/vol] F-12–Dulbecco’s modified Eagle’s medium, 5% fetal bovine
440 serum, 0.4 μg/ml hydrocortisone, 5 μg/ml insulin, 8.4 ng/ml cholera toxin, 10 ng/ml epidermal growth
441 factor, 24 μg/ml adenine, 100 U/ml penicillin and 100 μg/ml streptomycin). All cells were grown in the
442 presence of irradiated 3T3-J2 feeder cells. C33-A and HeLa cervical carcinoma derived cell lines were
19
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
443 maintained in Dulbecco’s modified Eagle’s medium, supplemented with 10% fetal bovine serum, 100
444 U/ml penicillin and 100 μg/ml streptomycin. To induce replication stress, C33-A and HeLa cells were
445 treated for 24 hours with 0.2 µM aphidicolin (Sigma A0781) prior to harvesting for FANCD2 ChIP-seq
446 experiments, described below.
447 ChIP-seq: FANCD2
448 Aphidicolin treated C33-A and HeLa cells were processed for chromatin immunoprecipitation (ChIP) as
449 previously described 24. Briefly, cells were crossed-linked with 1% formaldehyde and chromatin was
450 isolated and sheared to 100–500 bp DNA fragments using a Bioruptor sonicator (Diagonode) on high
451 power settings. Chromatin samples (25 μg per ChIP) were incubated overnight at 4°C with an antibody
452 against FANCD2 (Bethyl, A302-174A, 2.5 μg). Rabbit IgG (Jackson ImmunoRes, 011-000-003) was used to
453 determine non-specific binding to control regions (though not sequenced). Chromatin
454 immunocomplexes were precipitated for 1 hour at 4°C with blocked Dynabeads Protein G (Invitrogen),
455 subjected to multiple wash steps and the chromatin eluted in elution buffer (50 mM Tris-HCl [pH 8.0], 10
456 mM EDTA [pH 8.0], 1% SDS). Chromatin was reverse cross-linked overnight at 65°C in 0.2 M NaCl,
457 followed by RNase A and proteinase K treatment, and the DNA purified using the ChIP DNA Clean &
458 Concentrator kit (Zymo Research). ChIP DNA from two biological replicates were pooled and subjected
459 to 2 x 150 bp paired-end read sequencing on the Illumina HiSeq-4000 platform (Genomics Resource
460 Center, Institute for Genome Sciences, University of Maryland) to a sequencing depth of >14 million
461 reads per sample. FANCD2 ChIP-seq datasets are accessible through GEO Series accession number
462 GSE183048.
463 ChIP-seq: Brd4 and H3K27ac
464 W12 chromatin samples were isolated as described above, and have been described previously 25.
465 However, only the sequence data flanking the HPV16 integration sites was previously analyzed and
466 published. Here, we analyze the same dataset but for the entire human genome. Antibodies used were
20
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
467 Brd4 (Bethyl Laboratories A301-985A, 3 μg) or H3K27ac (Millipore 07–360, 3 μl). No antibody controls
468 were included to monitor non-specific binding. Brd4 and H3K27ac ChIP-seq datasets are accessible
469 through GEO Series accession number GSE183048.
470 ChIP-seq processing and peak calling
471 Reads were trimmed with Cutadapt version 1.18 89. All reads aligning to the ENCODE hg19 v1 blacklist
472 regions 90 were identified by alignment with BWA version 0.7.17 91 and removed with Picard
473 SamToFastq, https://broadinstitute.github.io/picard/. Remaining reads were aligned to an hg19
474 reference genome using BWA. Reads with a mapQ score less than 6 were removed with SAMtools
475 version 1.6 92 and PCR duplicates were removed with Picard MarkDuplicates. Peaks were called by
476 comparing each ChIP sample to its matching input sample. For FANCD2, the mean fragment size was
477 estimated by Phantompeakqualtools version 2.0 93. Peaks were called using SICER version 1.1 94 with the
478 following parameters: redundancy threshold of 100, effective genome fraction of 0.75, window size of
479 25,000 bp, and gap size of 50,000 bp. H3K27ac and Brd4 peaks were called using macsBroad (macs
480 version 2.1.1 from 2016/03/09) 95 with the following parameters: --broad-cutoff 0.01 -f "BAMPE". Data
481 was converted into bigwigs for viewing and normalized by reads per genomic content (RPGC) using
482 deepTools version 3.0.196 using the following parameters: --binSize 25 --smoothLength 75 --
483 effectiveGenomeSize 2700000000 --centerReads --normalizeUsing RPGC. RPGC-normalized input values
484 were subtracted from RPGC-normalized ChIP values of matching cell type genome-wide using Deeptools
485 with --binSize 25.
486 FANCD2-associated fragile site dataset
487 FANCD2 ChIP-seq peaks were filtered by a -log10 q-value of ten or above to remove low-confidence
488 calls. Filtered C33-A ChIP-seq peaks were combined with previously mapped aphidicolin-induced
489 FANCD2 peaks identified by ChIP-chip in these cells 26. C33-A (Supplementary Table S8) and HeLa
490 FANCD2-enriched regions (Supplementary Table S9) were combined and overlapping peaks merged
21
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
491 using bedtools MergeBED 97. Association of ChIP peaks between the three FANCD2 datasets were
492 determined by permutation testing using regioneR 98. Combined FANCD2 peaks from C33-A and HeLa
493 are listed in Supplementary Table S10. FANCD2-enriched regions were compared to aphidicolin-induced
494 common fragile sites characterized in lymphoblast cells (FRA regions) and mitotic DNA synthesis regions
495 characterized in HeLa cells 41,42 that are listed in Supplementary Table S11 and S12, respectively, using
496 regioneR 98. FRA-regions were downloaded from the HGNC (HUGO Gene Nomenclature Committee)
497 database (https://www.genenames.org/download/custom/) on 08/27/2020 using advanced filtering:
498 gd_locus_type = 'fragile site’.
499 Overlap analysis of FANCD2 enriched regions with long genes
500 The Gencode Release 19 human reference genome (GRCh37) was filtered for protein-coding genes
501 greater than or equal to 0.3 Mb in length, including UTRs (Supplementary Table S13), and used to
502 determine the overlap with FANCD2-enriched regions.
503 Enhancer dataset
504 Consensus peak sets for H3K27ac were defined as overlapping regions found in at least four out of eight
505 W12 samples using DiffBind 99,100. Enhancers were defined as genomic intervals that overlapped
506 between H3K27ac peaks and Brd4 peaks and are listed in Supplementary Table S14. Proximal and distal
507 enhancer-like cis-Regulatory Elements by ENCODE for NHEK and HeLa cells were downloaded from
508 https://screen.encodeproject.org/# (accessed September 2020) 51. ENCODE GRCh38 enhancer files were
509 converted to hg19 using the UCSC Genome Browser LiftOver tool, http://genome.ucsc.edu/cgi-
510 bin/hgLiftOver. W12 enhancers were compared to ENCODE NHEK enhancers using regioneR 98.
511 Overlap analysis of integration breakpoints with fragile sites and enhancers
512 The intersect between the genomic coordinates of HPV integration breakpoints (-/+ 50 Kb flank regions,
513 7,9) with enhancers (Supplementary Table S14) or FANCD2-enriched regions (Supplementary Table S10)
514 was analyzed using the Overlapping Pieces of Intervals function in the Galaxy genomics platform
22
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
515 (https://usegalaxy.org/). Samples with multiple reported breakpoints within the same chromosome
516 were classified as a single integration locus if the 5’ and 3’ most breakpoints were within 3 Mb of each
517 other. This 3 Mb cut-off was based on manual analysis of the distance between clustered breakpoints
518 identified by WGS and/or hybrid-capture technologies for the CESC and HNSCC datasets. Each
519 integration locus was assigned a unique integration ID so that the number of breakpoints per integration
520 could be categorized as a cluster. Each integration breakpoint was analyzed independently for their
521 association with the genomic feature of interest, as well as with adjacent HPV breakpoints, which were
522 classified as belonging to the same integration locus/cluster. For significance testing, the data was
523 permutated 10,000 times to create an expected distribution of the overlap between integration
524 breakpoints and loci with FANCD2-enriched regions or enhancers using regioneR 98.
525 Super-enhancer dataset
526 Super-enhancers were defined in the Brd4 and H3K27ac W12 ChIP-seq datasets using the Rank Ordering
527 of Super-Enhancers (ROSE) tool, using default parameters 53,54. Enhancers defined in W12 cells
528 (Supplementary Table S14) were used as the input list of enhancers. Super-enhancers were defined by
529 Brd4 and/or H3K27ac consensus peaks that mapped in at least six out of twelve W12 samples for the
530 20831, 20862 and 20863 subclones and are listed in Supplementary Table S16. Super-enhancers
531 mapped in the 20861 subclone were excluded as they were masked by the amplified viral-host derived
532 super-enhancer-like element at the HPV integration site in these cells 25.
533 Gene ontology analysis of W12 enhancers
534 W12 enhancers that overlapped with CESC integration breakpoints (-/+ 50 Kb flanks) were analyzed
535 using the Genomic Regions Enrichment of Annotations Tool (GREAT) using default parameters, accessed
536 July 2021, http://great.stanford.edu/public/html/. All enhancers profiled in W12 cells (Supplementary
537 Table S14) were used as the input list of background regions.
23
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
538 DATA AVAILABILITY
539 The data discussed in this publication have been deposited in NCBI's Gene Expression Omnibus 101 and
540 are accessible through GEO Series accession number GSE183048
541 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE183048).
542 ACKNOWLEDGEMENTS
543 We thank Justin Lack (NIH/NCBR/NCI/NIAID/FNL) for advice on statistical analyses and Susan Huse
544 (NIH/NCBR/NCI/NIAID/FNL) for preliminary analysis of enhancer-capture at integration sites. The results
545 published here are in part based upon data generated by the TCGA Research Network,
546 https://www.cancer.gov/tcga. This work was funded by the Intramural Research Program of NIAID, NIH
547 grant number ZIA AI001223 LVD, and NIH grant number AI153156 (JMP). The funders had no role in
548 study design, data collection and analysis, decision to publish, or preparation of the manuscript.
549 AUTHOR CONTRIBUTIONS
550 AAM supervised the project. AAM and AW conceived and designed the study. AW compiled datasets
551 and performed ChIP-seq experiments. TEM processed ChIP-seq data. TEM designed and advised on
552 statistical analyses. AW and TEM performed statistical analysis. AAM, AW and TEM analyzed the data. JP
553 and JK identified integration breakpoints from CESC and HNSCC TCGA datasets. AAM and AW wrote the
554 manuscript. All authors discussed, critically revised, and approved the final version of the article for
555 publication.
556 COMPETING INTERESTS
557 The authors declare no conflict of interest.
558
24
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
559 REFERENCES
560 1 Viens, L. J. et al. Human Papillomavirus-Associated Cancers - United States, 2008-2012. MMWR 561 Morb Mortal Wkly Rep 65, 661-666 (2016). 562 2 Alazawi, W. et al. Genomic imbalances in 70 snap-frozen cervical squamous intraepithelial 563 lesions: associations with lesion grade, state of the HPV16 E2 gene and clinical outcome. Br J 564 Cancer 91, 2063-2070, doi:10.1038/sj.bjc.6602237 (2004). 565 3 Peter, M. et al. Frequent genomic structural alterations at HPV insertion sites in cervical 566 carcinoma. J Pathol 221, 320-330, doi:10.1002/path.2713 (2010). 567 4 Thomas, L. K. et al. Chromosomal gains and losses in human papillomavirus-associated neoplasia 568 of the lower genital tract - a systematic review and meta-analysis. Eur J Cancer 50, 85-98, 569 doi:10.1016/j.ejca.2013.08.022 (2014). 570 5 Parfenov, M. et al. Characterization of HPV and host genome interactions in primary head and 571 neck cancers. Proc Natl Acad Sci U S A 111, 15544-15549 (2014). 572 6 Ojesina, A. I. et al. Landscape of genomic alterations in cervical carcinomas. Nature 506, 371- 573 375, doi:10.1038/nature12881 (2014). 574 7 Akagi, K. et al. Genome-wide analysis of HPV integration in human cancers reveals recurrent, 575 focal genomic instability. Genome Res 24, 185-199, doi:10.1101/gr.164806.113 (2014). 576 8 Holmes, A. et al. Mechanistic signatures of HPV insertions in cervical carcinomas. NPJ Genom 577 Med 1, 16004, doi:10.1038/npjgenmed.2016.4 (2016). 578 9 Bodelon, C. et al. Chromosomal copy number alterations and HPV integration in cervical 579 precancer and invasive cancer. Carcinogenesis 37, 188-196, doi:10.1093/carcin/bgv171 (2016). 580 10 Gao, G. et al. Common fragile sites (CFS) and extremely large CFS genes are targets for human 581 papillomavirus integrations and chromosome rearrangements in oropharyngeal squamous cell 582 carcinoma. Genes Chromosomes Cancer 56, 59-74, doi:10.1002/gcc.22415 (2017). 583 11 Thorland, E. C., Myers, S. L., Gostout, B. S. & Smith, D. I. Common fragile sites are preferential 584 targets for HPV16 integrations in cervical tumors. Oncogene 22, 1225-1237, 585 doi:10.1038/sj.onc.1206170 (2003). 586 12 Thorland, E. C. et al. Human papillomavirus type 16 integrations in cervical tumors frequently 587 occur in common fragile sites. Cancer Res 60, 5916-5921 (2000). 588 13 Smith, P. P., Friedman, C. L., Bryant, E. M. & McDougall, J. K. Viral integration and fragile sites in 589 human papillomavirus-immortalized human keratinocyte cell lines. Genes Chromosomes Cancer 590 5, 150-157, doi:10.1002/gcc.2870050209 (1992). 591 14 Le Tallec, B. et al. Updating the mechanisms of common fragile site instability: how to reconcile 592 the different views? Cellular and Molecular Life Sciences 71, 4489-4494 (2014). 593 15 Bodelon, C., Untereiner, M. E., Machiela, M. J., Vinokurova, S. & Wentzensen, N. Genomic 594 characterization of viral integration sites in HPV-related cancers. Int J Cancer 139, 2001-2011, 595 doi:10.1002/ijc.30243 (2016). 596 16 Madireddy, A. et al. FANCD2 Facilitates Replication through Common Fragile Sites. Mol Cell 64, 597 388-404, doi:10.1016/j.molcel.2016.09.017 (2016). 598 17 Pentzold, C. et al. FANCD2 binding identifies conserved fragile sites at large transcribed genes in 599 avian cells. Nucleic Acids Res 46, 1280-1294, doi:10.1093/nar/gkx1260 (2018). 600 18 Chan, K. L., Palmai-Pallag, T., Ying, S. & Hickson, I. D. Replication stress induces sister-chromatid 601 bridging at fragile site loci in mitosis. Nat Cell Biol 11, 753-760, doi:10.1038/ncb1882 (2009). 602 19 Herrick, J. et al. Genomic organization of amplified MYC genes suggests distinct mechanisms of 603 amplification in tumorigenesis. Cancer research 65, 1174-1179 (2005).
25
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
604 20 Conti, C., Herrick, J. & Bensimon, A. Unscheduled DNA replication origin activation at inserted 605 HPV 18 sequences in a HPV‐18/MYC amplicon. Genes, Chromosomes and Cancer 46, 724-734 606 (2007). 607 21 Kadaja, M., Isok-Paas, H., Laos, T., Ustav, E. & Ustav, M. Mechanism of genomic instability in 608 cells infected with the high-risk human papillomaviruses. PLoS Pathog 5, e1000397 (2009). 609 22 Coquelle, A., Pipiras, E., Toledo, F., Buttin, G. & Debatisse, M. Expression of Fragile Sites Triggers 610 Intrachromosomal Mammalian Gene Amplification and Sets Boundaries to Early Amplicons. Cell 611 89, 215-225, doi:10.1016/s0092-8674(00)80201-9 (1997). 612 23 Christiansen, I. K., Sandve, G. K., Schmitz, M., Durst, M. & Hovig, E. Transcriptionally active 613 regions are the preferred targets for chromosomal HPV integration in cervical carcinogenesis. 614 PLoS One 10, e0119566, doi:10.1371/journal.pone.0119566 (2015). 615 24 Dooley, K. E., Warburton, A. & McBride, A. A. Tandemly Integrated HPV16 Can Form a Brd4- 616 Dependent Super-Enhancer-Like Element That Drives Transcription of Viral Oncogenes. MBio 7, 617 doi:10.1128/mBio.01446-16 (2016). 618 25 Warburton, A. et al. HPV integration hijacks and multimerizes a cellular enhancer to generate a 619 viral-cellular super-enhancer that drives high viral oncogene expression. PLoS Genet 14, 620 e1007179, doi:10.1371/journal.pgen.1007179 (2018). 621 26 Jang, M. K., Shen, K. & McBride, A. A. Papillomavirus genomes associate with BRD4 to replicate 622 at fragile sites in the host genome. PLoS Pathog 10, e1004117, 623 doi:10.1371/journal.ppat.1004117 (2014). 624 27 Stanley, M. A., Browne, H. M., Appleby, M. & Minson, A. C. Properties of a non-tumorigenic 625 human cervical keratinocyte cell line. Int.J.Cancer 43, 672-676 (1989). 626 28 Hu, Z. et al. Genome-wide profiling of HPV integration in cervical cancer identifies clustered 627 genomic hot spots and a potential microhomology-mediated integration mechanism. Nat Genet 628 47, 158-163, doi:10.1038/ng.3178 (2015). 629 29 Cancer Genome Atlas Research, N. et al. Integrated genomic and molecular characterization of 630 cervical cancer. Nature 543, 378-384, doi:10.1038/nature21386 (2017). 631 30 Koneva, L. A. et al. HPV Integration in HNSCC Correlates with Survival Outcomes, Immune 632 Response Signatures, and Candidate Drivers. Mol Cancer Res 16, 90-102, doi:10.1158/1541- 633 7786.MCR-17-0153 (2018). 634 31 Olthof, N. C. et al. Comprehensive analysis of HPV16 integration in OSCC reveals no significant 635 impact of physical status on viral oncogene and virally disrupted human gene expression. PLoS 636 One 9, e88718, doi:10.1371/journal.pone.0088718 (2014). 637 32 Cancer Genome Atlas, N. Comprehensive genomic characterization of head and neck squamous 638 cell carcinomas. Nature 517, 576-582, doi:10.1038/nature14129 (2015). 639 33 Liu, Y., Lu, Z., Xu, R. & Ke, Y. Comprehensive mapping of the human papillomavirus (HPV) DNA 640 integration sites in cervical carcinomas by HPV capture technology. Oncotarget 7, 5852-5864, 641 doi:10.18632/oncotarget.6809 (2016). 642 34 Liu, Y. et al. Genome-wide profiling of the human papillomavirus DNA integration in cervical 643 intraepithelial neoplasia and normal cervical epithelium by HPV capture technology. Sci Rep 6, 644 35427, doi:10.1038/srep35427 (2016). 645 35 Xu, B. et al. Multiplex Identification of Human Papillomavirus 16 DNA Integration Sites in 646 Cervical Carcinomas. PLOS ONE 8, e66693, doi:10.1371/journal.pone.0066693 (2013). 647 36 Kamal, M. et al. Human papilloma virus (HPV) integration signature in Cervical Cancer: 648 identification of MACROD2 gene as HPV hot spot integration site. Br J Cancer, 649 doi:10.1038/s41416-020-01153-4 (2020). 650 37 Li, W. et al. Characteristic of HPV Integration in the Genome and Transcriptome of Cervical 651 Cancer Tissues. Biomed Res Int 2018, 6242173-6242173, doi:10.1155/2018/6242173 (2018).
26
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
652 38 Kraus, I. et al. The majority of viral-cellular fusion transcripts in cervical carcinomas cotranscribe 653 cellular sequences of known or predicted genes. Cancer Res 68, 2514-2522, doi:10.1158/0008- 654 5472.CAN-07-2776 (2008). 655 39 Schmitz, M., Driesch, C., Jansen, L., Runnebaum, I. B. & Dürst, M. Non-Random Integration of the 656 HPV Genome in Cervical Cancer. PLOS ONE 7, e39632, doi:10.1371/journal.pone.0039632 657 (2012). 658 40 Zhang, R. et al. Dysregulation of host cellular genes targeted by human papillomavirus (HPV) 659 integration contributes to HPV-related cervical carcinogenesis. Int J Cancer 138, 1163-1174, 660 doi:10.1002/ijc.29872 (2016). 661 41 Ji, F. et al. Genome-wide high-resolution mapping of mitotic DNA synthesis sites and common 662 fragile sites by direct sequencing. Cell Res, doi:10.1038/s41422-020-0357-y (2020). 663 42 Macheret, M. et al. High-resolution mapping of mitotic DNA synthesis regions and common 664 fragile sites in the human genome through direct sequencing. Cell Res, doi:10.1038/s41422-020- 665 0358-x (2020). 666 43 Helmrich, A., Ballarino, M. & Tora, L. Collisions between replication and transcription complexes 667 cause common fragile site instability at the longest human genes. Mol Cell 44, 966-977, 668 doi:10.1016/j.molcel.2011.10.013 (2011). 669 44 Okamoto, Y. et al. Replication stress induces accumulation of FANCD2 at central region of large 670 fragile genes. Nucleic Acids Res 46, 2932-2944, doi:10.1093/nar/gky058 (2018). 671 45 Brison, O. et al. Transcription-mediated organization of the replication initiation program across 672 large genes sets common fragile sites genome-wide. Nat Commun 10, 5693, 673 doi:10.1038/s41467-019-13674-5 (2019). 674 46 Lee, J.-E. et al. Brd4 binds to active enhancers to control cell identity gene induction in 675 adipogenesis and myogenesis. Nature Communications 8, doi:10.1038/s41467-017-02403-5 676 (2017). 677 47 Brown, J. D. et al. BET bromodomain proteins regulate enhancer function during adipogenesis. 678 Proc Natl Acad Sci U S A 115, 2144-2149, doi:10.1073/pnas.1711155115 (2018). 679 48 Najafova, Z. et al. BRD4 localization to lineage-specific enhancers is associated with a distinct 680 transcription factor repertoire. Nucleic Acids Res 45, 127-141, doi:10.1093/nar/gkw826 (2017). 681 49 Jang, M. K., Kwon, D. & McBride, A. A. Papillomavirus E2 proteins and the host BRD4 protein 682 associate with transcriptionally active cellular chromatin. J Virol 83, 2592-2600, 683 doi:10.1128/JVI.02275-08 (2009). 684 50 Helfer, C. M., Yan, J. & You, J. The cellular bromodomain protein Brd4 has multiple functions in 685 E2-mediated papillomavirus transcription activation. Viruses 6, 3228-3249, 686 doi:10.3390/v6083228 (2014). 687 51 Consortium, E. P. et al. Expanded encyclopaedias of DNA elements in the human and mouse 688 genomes. Nature 583, 699-710, doi:10.1038/s41586-020-2493-4 (2020). 689 52 McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nat 690 Biotechnol 28, 495-501, doi:10.1038/nbt.1630 (2010). 691 53 Whyte, Warren A. et al. Master Transcription Factors and Mediator Establish Super-Enhancers at 692 Key Cell Identity Genes. Cell 153, 307-319, doi:10.1016/j.cell.2013.03.035 (2013). 693 54 Lovén, J. et al. Selective inhibition of tumor oncogenes by disruption of super-enhancers. Cell 694 153, 320-334 (2013). 695 55 Bailey, M. H. et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 696 173, 371-385.e318, doi:https://doi.org/10.1016/j.cell.2018.02.060 (2018). 697 56 Warburton, A., Della Fera, A. N. & McBride, A. A. Dangerous liaisons: long-term replication with 698 an extrachromosomal HPV genome. Viruses submitted (2021).
27
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
699 57 Moody, C. A. Impact of Replication Stress in Human Papillomavirus Pathogenesis. Journal of 700 virology 93, doi:10.1128/JVI.01012-17 (2019). 701 58 Anacker, D. C. & Moody, C. A. Modulation of the DNA damage response during the life cycle of 702 human papillomaviruses. Virus research 231, 41-49, doi:10.1016/j.virusres.2016.11.006 (2017). 703 59 Schiffman, M. et al. Carcinogenic human papillomavirus infection. Nat Rev Dis Primers 2, 16086, 704 doi:10.1038/nrdp.2016.86 (2016). 705 60 Mac, M. & Moody, C. A. Epigenetic Regulation of the Human Papillomavirus Life Cycle. 706 Pathogens (Basel, Switzerland) 9, doi:10.3390/pathogens9060483 (2020). 707 61 Flores, E. R. & Lambert, P. F. Evidence for a switch in the mode of human papillomavirus type 16 708 DNA replication during the viral life cycle. J.Virol. 71, 7167-7179 (1997). 709 62 Sakakibara, N., Chen, D. & McBride, A. A. Papillomaviruses use recombination-dependent 710 replication to vegetatively amplify their genomes in differentiated cells. PLoS Pathog 9, 711 e1003321, doi:10.1371/journal.ppat.1003321 (2013). 712 63 Kadaja, M. et al. Genomic instability of the host cell induced by the human papillomavirus 713 replication machinery. EMBO J 26, 2180-2191, doi:10.1038/sj.emboj.7601665 (2007). 714 64 Myllykangas, S. et al. DNA copy number amplification profiling of human neoplasms. Oncogene 715 25, 7324-7332, doi:10.1038/sj.onc.1209717 (2006). 716 65 Hellman, A. et al. A role for common fragile site induction in amplification of human oncogenes. 717 Cancer Cell 1, 89-97, doi:10.1016/s1535-6108(02)00017-x (2002). 718 66 Wilson, T. E. et al. Large transcription units unify copy number variants and common fragile sites 719 arising under replication stress. Genome Res 25, 189-200, doi:10.1101/gr.177121.114 (2015). 720 67 Kelley, D. Z. et al. Integrated Analysis of Whole-Genome ChIP-Seq and RNA-Seq Data of Primary 721 Head and Neck Tumor Samples Associates HPV Integration Sites with Open Chromatin Marks. 722 Cancer Res 77, 6538-6550, doi:10.1158/0008-5472.CAN-17-0833 (2017). 723 68 Gagliardi, A. et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus 724 clade-specific epigenome and transcriptome landscapes. Nat Genet 52, 800-810, 725 doi:10.1038/s41588-020-0673-7 (2020). 726 69 Schmidt, E. V. The role of c-myc in cellular growth control. Oncogene 18, 2988-2996, 727 doi:10.1038/sj.onc.1202751 (1999). 728 70 Holbro, T. The ErbB receptors and their role in cancer progression. Experimental Cell Research 729 284, 99-110, doi:10.1016/s0014-4827(02)00099-x (2003). 730 71 Dong, J. T. & Chen, C. Essential role of KLF5 transcription factor in cell proliferation and 731 differentiation and its implications for human diseases. Cell Mol Life Sci 66, 2691-2706, 732 doi:10.1007/s00018-009-0045-z (2009). 733 72 Soares, E. & Zhou, H. Master regulatory role of p63 in epidermal development and disease. Cell 734 Mol Life Sci 75, 1179-1190, doi:10.1007/s00018-017-2701-z (2018). 735 73 Takata, M. et al. The Rad51 paralog Rad51B promotes homologous recombinational repair. Mol 736 Cell Biol 20, 6476-6482, doi:10.1128/mcb.20.17.6476-6482.2000 (2000). 737 74 Hnisz, D. et al. Super-enhancers in the control of cell identity and disease. Cell 155, 934-947 738 (2013). 739 75 McBride, A. A., Warburton, A. & Khurana, S. Multiple Roles of Brd4 in the Infectious Cycle of 740 Human Papillomaviruses. Frontiers in Molecular Biosciences 8, doi:10.3389/fmolb.2021.725794 741 (2021). 742 76 You, J., Croyle, J. L., Nishimura, A., Ozato, K. & Howley, P. M. Interaction of the bovine 743 papillomavirus E2 protein with Brd4 tethers the viral DNA to host mitotic chromosomes. Cell 744 117, 349-360 (2004). 745 77 Baxter, M. K., McPhillips, M. G., Ozato, K. & McBride, A. A. The mitotic chromosome binding 746 activity of the papillomavirus E2 protein correlates with interaction with the cellular
28
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
747 chromosomal protein, Brd4. Journal of virology 79, 4806-4818, doi:10.1128/JVI.79.8.4806- 748 4818.2005 (2005). 749 78 Goodwin, E. C. & DiMaio, D. Repression of human papillomavirus oncogenes in HeLa cervical 750 carcinoma cells causes the orderly reactivation of dormant tumor suppressor pathways. 751 Proc.Natl.Acad.Sci.U.S.A 97, 12513-12518 (2000). 752 79 Jeon, S. & Lambert, P. F. Integration of human papillomavirus type 16 DNA into the human 753 genome leads to increased stability of E6 and E7 mRNAs: implications for cervical 754 carcinogenesis. Proc Natl Acad Sci U S A 92, 1654-1658, doi:10.1073/pnas.92.5.1654 (1995). 755 80 Wentzensen, N. et al. Characterization of viral-cellular fusion transcripts in a large series of 756 HPV16 and 18 positive anogenital lesions. Oncogene 21, 419-426, doi:10.1038/sj.onc.1205104 757 (2002). 758 81 Schmieder, R. E., R. Quality control and preprocessing of metagenomic datasets. Bioinformatics 759 27, 863-864, doi:10.1093/bioinformatics/btr026 (2011). 760 82 Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357- 761 359, doi:10.1038/nmeth.1923 (2012). 762 83 Langmead, B., Wilks, C., Antonescu, V. & Charles, R. Scaling read aligners to hundreds of threads 763 on general-purpose processors. Bioinformatics 35, 421-432, doi:10.1093/bioinformatics/bty648 764 (2019). 765 84 McGinnis, S. & Madden, T. L. BLAST: at the core of a powerful and diverse set of sequence 766 analysis tools. Nucleic acids research 32, W20-W25, doi:10.1093/nar/gkh435 (2004). 767 85 Katz, J. P. & Pipas, J. M. SummonChimera infers integrated viral genomes with nucleotide 768 precision from NGS data. BMC Bioinformatics 15, 348, doi:10.1186/s12859-014-0348-4 (2014). 769 86 Harvard, B. I. o. M. a. Broad Institute TCGA Genome Data Analysis Center (2016): SNP6 Copy 770 number analysis (GISTIC2); Cervical Squamous Cell Carcinoma and Endocervical 771 Adenocarcinoma. doi:doi:10.7908/C16D5SCD (2016). 772 87 Harvard, B. I. o. M. a. Broad Institute TCGA Genome Data Analysis Center (2016): SNP6 Copy 773 number analysis (GISTIC2); Head and Neck Squamous Cell Carcinoma. . 774 doi:doi:10.7908/C1V987FP (2016). 775 88 Jeon, S., Allen-Hoffmann, B. L. & Lambert, P. F. Integration of human papillomavirus type 16 into 776 the human genome correlates with a selective growth advantage of cells. J Virol 69, 2989-2997, 777 doi:10.1128/JVI.69.5.2989-2997.1995 (1995). 778 89 Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. 779 EMBnet.journal 17, doi:10.14806/ej.17.1.200 (2011). 780 90 Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 781 489, 57-74, doi:10.1038/nature11247 (2012). 782 91 Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. 783 Bioinformatics 25, 1754-1760, doi:10.1093/bioinformatics/btp324 (2009). 784 92 Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079, 785 doi:10.1093/bioinformatics/btp352 (2009). 786 93 Kharchenko, P. V., Tolstorukov, M. Y. & Park, P. J. Design and analysis of ChIP-seq experiments 787 for DNA-binding proteins. Nat Biotechnol 26, 1351-1359, doi:10.1038/nbt.1508 (2008). 788 94 Xu, S., Grullon, S., Ge, K. & Peng, W. Spatial clustering for identification of ChIP-enriched regions 789 (SICER) to map regions of histone methylation patterns in embryonic stem cells. Methods Mol 790 Biol 1150, 97-111, doi:10.1007/978-1-4939-0512-6_5 (2014). 791 95 Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 9, R137, doi:10.1186/gb- 792 2008-9-9-r137 (2008). 793 96 Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. 794 Nucleic Acids Res 44, W160-165, doi:10.1093/nar/gkw257 (2016).
29
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
795 97 Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. 796 Bioinformatics 26, 841-842, doi:10.1093/bioinformatics/btq033 (2010). 797 98 Gel, B. et al. regioneR: an R/Bioconductor package for the association analysis of genomic 798 regions based on permutation tests. Bioinformatics 32, 289-291, 799 doi:10.1093/bioinformatics/btv562 (2016). 800 99 Stark, R. & Brown, G. D. DiffBind: differential binding analysis of ChIP-Seq peak data. (2011). 801 100 Ross-Innes, C. S. et al. Differential oestrogen receptor binding is associated with clinical outcome 802 in breast cancer. Nature 481, 389-393, doi:10.1038/nature10730 (2012). 803 101 Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and 804 hybridization array data repository. Nucleic Acids Res 30, 207-210, doi:10.1093/nar/30.1.207 805 (2002). 806
807 FIGURE LEGENDS
808 Figure 1. Overview of datasets
809 Schematic representation of datasets used for overlap analysis of CESC and HNSCC integration sites with
810 enhancers mapped in W12 cervical keratinocytes and FANCD2-associated common fragile sites mapped
811 in C33-A and HeLa cervical carcinoma cell lines. APOT, Amplification of papillomavirus oncogene
812 transcripts; WGS, Whole genome sequencing; WXS, Whole exome sequencing.
813 Figure 2. HPV integration loci frequently contain clustered insertional breakpoints
814 (a), Schematic representation of HPV integration breakpoints and loci. Green lines represent integration
815 breakpoints. Integration loci are defined as either a single breakpoint, or multiple, closely spaced
816 breakpoints (a cluster). Samples with clustered breakpoints within the same chromosome are classified
817 as a single integration locus if the 5’ and 3’ most breakpoints are within 3 Mb of each other. (b),
818 Schematic representation of clustered breakpoints at CESC integration loci across the human genome.
819 Lines connecting to each chromosome represent different integration loci. Blue circles represent the
820 indicated number of breakpoints per integration locus; orange boxed regions represent integration
821 hotspots. See Supplementary Figure S3 for the distribution of clustered breakpoints at integration loci in
822 HNSCC tumors. (c), Scatter plot showing the frequency of single and clustered breakpoints per
30
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
823 integration locus for CESC tumors grouped according to whether they overlap integration hotspots. The
824 p-value is based on a non-parametric, unpaired t-test (two-tailed; **P<0.01).
825 Figure 3. Transcription status of integrated viral genomes 826
827 (a-b), Bubble graph showing the percentage of transcriptionally active integration loci per tumor in
828 CESC (a) and HNSCC (b) samples relative to the number of integration loci per tumor; 100% indicates
829 that all integration loci are active in that tumor. Orange, blue and grey circles represent tumors with a
830 single, multiple or no transcriptionally active loci, respectively. Circle size indicates the number of
831 samples per grouping (for CESC, largest, n=86 and smallest, n=1; for HNSCC, largest, n=21 and smallest,
832 n=1). Three HNSCC samples (TCGA-CR-6482, TCGA-CN-5374 and TCGA-CR-7404) were reported as
833 integration negative from RNA-seq 30 but had a single or multiple integration loci detected through WGS
834 5 and were therefore classified as transcriptionally inactive. (c), Bar chart showing the number of CESC
835 integration loci that are transcriptionally active or inactive for viral oncogene expression at integration
836 hotspots. Association between viral oncogene transcription and integration hotspots was based on a
837 Fisher’s exact test (two-tailed; **P<0.01).
838 Figure 4. Clustered integration breakpoints are associated with amplified regions of the host genome
839 in CESC and HNSCC
840 For the subset of CESC and HNSCC samples that had matched somatic copy number alteration data, HPV
841 integration breakpoints were grouped according to the associated host DNA copy number status.
842 Normal, AMP (amplification) and DEL (deletion) refers to the genomic profile of the host DNA at the
843 integration locus. (a-b), Scatter plots showing the number of breakpoints per locus grouped according to
844 the somatic copy number alteration status of the integration locus for CESC (a) and HNSCC (b) tumors.
845 For CESC, the number of integration loci per grouping was Normal, n=140; AMP, n=240 and DEL, n=17.
846 For HNSCC, the number of integration loci per grouping was Normal, n=7; AMP, n=28 and DEL, n=1. P-
31
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
847 values are based on non-parametric, unpaired t-tests (two-tailed; *p<0.05, ****p<0.0001, NS=Non-
848 significant). All statistical tests were performed relative to integration loci with a normal genomic
849 profile. DNA amplification associated with integration loci ranged from 693 bp to 54.2 Mb (average, 1.6
850 Mb; median 47.4 Kb) in CESC and 6.5 Kb to 102.7 Mb (average, 3.3 Mb; median 43.1 Kb) in HNSCC. (c-d),
851 Schematic representation of clustered breakpoints at integration loci that have associated host somatic
852 copy number alterations. Lines connecting to each chromosome represent different integration loci for
853 the CESC (c) and HNSCC (d) datasets. The number of circles represents the number of breakpoints per
854 integration locus. Blue, green and red colored circles respectively represent integration sites that have a
855 normal genomic profile or associated amplifications or deletions. Orange boxed regions represent
856 integration hotspots. (e) Stacked bar chart showing the number of CESC integration loci that overlap
857 integration hotspots grouped according to whether they have associated somatic copy number
858 alterations. Association between somatic copy number alterations and integration hotspots was based
859 on a chi-square test (*p<0.05).
860 Figure 5. Cervical keratinocyte-specific fragile sites mapped by FANCD2 ChIP-seq
861 HPV-negative (C33-A) and HPV18-positive (HeLa) cervical carcinoma cells were treated for 24 hours with
862 0.2 µM aphidicolin and FANCD2-enriched regions were identified by ChIP-seq. All analyses were
863 performed using the combined C33-A and HeLa FANCD2 dataset. (a), Bar graph showing the genomic
864 coverage of FANCD2-enriched regions relative to common fragile sites (FRA regions) and mitotic DNA
865 synthesis (MDS) regions reported in the literature 41,42. (b), Venn diagram showing the regions of overlap
866 between fragile sites identified in the FANCD2, FRA and MDS datasets. (c), Venn diagram showing the
867 overlap of FANCD2-associated fragile sites with protein-coding genes longer or equal to 0.3 Mb. Red
868 circle represents all long genes; blue circle represents long genes that are expressed in C33-A and/or
869 HeLa cells. (d), Alignment of FANCD2-enriched regions in C33-A (red) and HeLa (yellow) cells with
870 associated genes (blue bars represent genes >0.3 Mb; genes expressed in C33-A and/or HeLa cells are
32
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
871 indicated by gene name) and FRA and MDS regions (black bars). Red and yellow bars below the FANCD2
872 ChIP-seq signal tracks represent peaks mapped by SICER analysis in the corresponding cell lines.
873 Figure 6. Integration hotspots are associated with cellular super-enhancers
874 H3K27ac and Brd4 enriched regions were profiled in HPV16-positive cervical derived W12 keratinocyte
875 subclones by ChIP-seq. Enhancer regions were defined as peaks that overlapped in both H3K27ac and
876 Brd4 datasets, and that were identified across the four W12 subclones. (a), GREAT (Genomic Regions
877 Enrichment of Annotations Tool) gene ontology analysis was performed using W12 enhancers that
878 overlapped CESC integration breakpoints (-/+ 50 Kb flanks) as input, and compared against all W12
879 enhancers, to identify putative target genes associated with these cis-regulatory regions based on
880 enhancer frequency. Bars represent putative target genes plotted against their FDR (false discovery rate)
881 adjusted p-values (q-value). Blue and grey bars represent genes that overlap integration hotspots and
882 sites of non-recurrent integration, respectively. Enriched target genes within the same genomic locus
883 were grouped (e.g. KLF5 and KLF12) and plotted using the most significant q-value. (b), Venn diagram
884 showing the regions of overlap between integration loci, integration hotspots and super-enhancers
885 mapped in W12 subclones. (c), Bar chart showing the number of CESC integration loci that were
886 grouped according to whether or not they are integration hotspots and plotted based on their overlap
887 with super-enhancers, FANCD2-associated fragile sites or both genomic features. Numbers above the
888 graph indicate the total number of integration loci within each grouping. (d), Bar chart showing the
889 number of CESC integration loci that are associated with super-enhancers (SE) that were grouped
890 according to whether they are integration hotspots and plotted based on their viral transcription status.
891 Numbers above the graph indicate the total number of integration loci that have associated viral
892 transcription data within each grouping. (e), Alignment of Brd4 (blue) and H3K27ac (red) ChIP-seq
893 signals mapped in W12 cervical keratinocytes at integration hotspots (top black bars; size indicated in
894 Mb) in cervical carcinomas. Grey bars represent amplified (AMP) host DNA in different CESC tumors
33
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
895 from The Cancer Genome Atlas. Green, yellow, and black bars below the ChIP-seq signal tracks
896 represent super-enhancers (SE) mapped in W12 subclones, FANCD2-associated fragile sites mapped in
897 C33-A and HeLa cells and CESC integration loci, respectively. Genes identified from GREAT gene ontology
898 analysis 52 and cancer driver genes 55 are indicated by blue bars. Each integration hotspot is
899 characterized in Supplementary Figure S5 and Table S4. (f), Bar chart showing the number of CESC
900 integration loci that are associated with super-enhancers (SE) that were grouped according to whether
901 they are integration hotspots and plotted based on their host somatic copy number alternation status.
902 The number of integration loci per grouping was normal, n=140; amplification (AMP), n=240 and
903 deletion (DEL), n=17.
904 Figure S1. Characteristics of CESC and HNSCC HPV integration datasets
905 (a), The distribution of cervical carcinomas (CESC, n=333) based on tissue histology. Squamous cell
906 carcinomas, adenocarcinomas and adenosquamous carcinomas accounted for 77.2%, 8.1% and 1.8% of
907 cervical samples, respectively. CESC that were not specified by histology type accounted for 12.9% of the
908 samples. (b), The frequency of HPV types across the different histological subtypes. One CESC sample
909 (TCGA-EA-A43B) had two integration sites of different viral types and was therefore excluded from these
910 counts. HPV16 (67.5%) and HPV18 (17.2%) were the most frequent HPV types detected in CESC. HPV16
911 was the predominant viral type in squamous cell carcinomas (HPV16, 68.0%; HPV18, 13.3%; other,
912 18.8%) and adenocarcinomas (HPV16, 59.3%; HPV18, 37.0%; other, 3.7%), whereas HPV18 was the
913 predominant viral type in adenosquamous carcinomas (HPV18, 83.3%; HPV16, 16.7%; other, none).
914 Head and neck squamous cell carcinomas (n=41) included in this study were predominantly male
915 subjects (males, 82.9%; females, 17.1%), Supplementary Table S2. (c), Tumors of the oropharynx,
916 including the base of tongue and tonsils, accounted for 70.7% of samples. The remainder of tumors
917 were from the oral cavity (24.4%) and larynx (4.9%). (d), The percentage distribution of HPV type by
918 HNSCC tumor location. HPV16 was the most predominant viral type in HNSCC (90.2%) and accounted for
34
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
919 96.6%, 70.0% and 100% tumors of the oropharynx, oral cavity and larynx, respectively. The remainder of
920 HNSCC samples were positive for HPV33 (7.3%) and HPV35 (2.4%) that were isolated from the oral cavity
921 and oropharynx, respectively. (e-h), The distribution of samples and integration breakpoints included in
922 this study that were grouped by sequencing method. (e-f), For the CESC dataset, 50.8% samples had
923 associated transcription-based data (e) and all integration breakpoints were detected through next-
924 generation sequencing technologies (hybrid-capture/WGS/WXS with probes added to capture the
925 integrated HPV DNA sequence) (f). (g), The HNSCC dataset was limited in size as >50% samples were
926 identified from RNA sequencing alone and did not have matched DNA sequences. (h), Approximately
927 54% HNSCC samples were processed from WGS and 46% from DIPS, and all had matched RNA data
928 (RNA-seq and APOT, respectively). *Two samples were also validated by RNA-seq analysis in addition to
929 APOT. APOT, Amplification of papillomavirus oncogene transcripts; DIPS, Detection of integrated
930 papillomavirus sequences; WGS, Whole genome sequencing; WXS, Whole exome sequencing.
931 Figure S2. Integration hotspots defined in cervical tumors
932 Schematic representation of integration hotspots across the human genome defined previously in the
933 literature (red bars) and in our cervical carcinoma, CESC, dataset (blue bars).
934 Figure S3. Distribution of clustered breakpoints at HNSCC integration loci
935 Schematic representation of clustered breakpoints at HNSCC integration loci across the human genome.
936 Lines connecting to each chromosome represent different integration loci. Blue circles represent the
937 indicated number of breakpoints per integration locus.
938 Figure S4. Keratinocyte-specific enhancers mapped by Brd4 and H3K27ac ChIP-seq
939 (a), Venn diagram showing the regions of overlap between Brd4 and H3K27ac ChIP-seq signals profiled
940 in W12 cervical keratinocytes with enhancers defined in NHEK (Normal Human Epidermal Keratinocytes)
941 cells from ENCODE 51. Permutation testing was used to determine the significance in overlap between
942 W12 Brd4/H3K27ac consensus peaks and ENCODE NHEK enhancers (p<0.0001). (b), Alignment of Brd4
35
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
943 (blue) and H3K27ac (red) ChIP-seq signals mapped in W12 cervical keratinocytes with enhancers defined
944 in the HeLa and NHEK ENCODE datasets. Black bars below the ChIP-seq signal tracks represent W12
945 enhancers that were defined by consensus peaks for Brd4 and H3K27ac enrichment. (c), GREAT
946 (Genomic Regions Enrichment of Annotations Tool) gene ontology analysis was performed using W12
947 enhancers that overlapped HNSCC integration breakpoints (-/+ 50 Kb flanks) as input, and compared
948 against all W12 enhancers, to identify putative target genes associated with these cis-regulatory regions.
949 Bars represent putative target genes plotted against their FDR (false discovery rate) adjusted p-values
950 (q-value). Blue and grey bars represent genes that overlap integration hotspots and sites of non-
951 recurrent integration, respectively. Enriched target genes within the same genomic locus were grouped
952 (e.g. KLF5 and KLF12) and plotted using the most significant q-value. (d), Alignment of Brd4 (blue) and
953 H3K27ac (red) ChIP-seq signals mapped in W12 cervical keratinocytes at integration hotspots (black
954 bars; size indicated in Mb) in cervical carcinomas. Grey bars represent amplified (AMP) host DNA in
955 different HNSCC tumors from The Cancer Genome Atlas. Green, yellow, and black bars below the ChIP-
956 seq signal tracks represent super-enhancers (SE) mapped in W12 subclones, FANCD2-associated fragile
957 sites mapped in C33-A and HeLa cells and CESC/HNSCC integration loci, respectively. Genes identified
958 from GREAT gene ontology analysis 52 and cancer driver genes 55 are indicated by blue bars.
959 Figure S5. Genomic landscape of integration hotspots
960 Alignment of Brd4 (blue) and H3K27ac (red) ChIP-seq signals mapped in W12 cervical keratinocytes at
961 integration hotspots (top black bars; size indicated in Mb) in cervical carcinomas. Numbers above each
962 integration hotspot indicates the hotspot ID referenced in Supplementary Table S4. Grey bars represent
963 amplified (AMP) host DNA in different CESC tumors from The Cancer Genome Atlas. Green, yellow, and
964 black bars below the ChIP-seq signal tracks represent super-enhancers (SE) mapped in W12 subclones,
965 FANCD2-associated fragile sites mapped in C33-A and HeLa cells and CESC/HNSCC integration loci,
966 respectively. Long protein-coding genes (>0.3 Mb in length), genes identified from GREAT gene ontology
36
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
967 analysis 52 and cancer driver genes 55 are indicated by blue bars. Each integration hotspot is further
968 characterized in Supplementary Table S4.
969 SUPPLEMENTARY TABLES
970 Table S1. CESC and HNSCC datasets
971 Cases represent the number (n) of patient samples with integrated HPV genomes detected by DNA
972 and/or RNA sequencing methods. For hybrid-capture method, only integration breakpoints that were
973 validated by Sanger sequencing were included in this study. Somatic copy number alteration (SCNA)
974 datasets for CESC and HNSCC TCGA samples were downloaded from the Broad Institute 86,87.
975 Table S2. CESC HPV integration breakpoints (n=1,299)
976 CESC HPV integration sites (Patient ID, chromosome, integration breakpoint, HPV type, HPV breakpoints,
977 tumour type, detection method used for calling integration breakpoints, target gene, (viral oncogene)
978 transcription status, SCNA status, author of study, Pubmed ID). Only integration breakpoints detected
979 through DNA sequencing were used for overlap analyses.
980 Table S3. HNSCC HPV integration breakpoints (n=119) included in this study
981 HNSCC HPV integration sites (Patient ID, chromosome, integration breakpoint, HPV type, HPV
982 breakpoints, tumour location, detection method used for calling integration breakpoints, target gene,
983 (viral oncogene) transcription status, SCNA status, gender, smoking status, author of study, Pubmed ID).
984 Only integration breakpoints detected through DNA sequencing were used for overlap analyses.
985 Table S4. Integration hotspots defined in CESC
986 Table S5. Integration hotspots defined in the literature
987 Overlapping intervals for hotspots identified from the literature were merged into a single interval using
988 the MergeBED tool (Merged hotspot position) for comparison to hotspots defined in our CESC dataset
989 (Supplementary Figure S2).
37
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
990 Table S6. CESC integrations with associated SCNA included in this study
991 Samples that had associated host genome amplification data were classified based on the somatic copy
992 number alteration (SCNA) status at the site of integration. Normal, normal genomic profile; AMP,
993 amplification; DEL, deletion; DEL (excluded; overlapping deletion), deletions that overlap integration
994 breakpoints were excluded as these likely reflect the alternative chromosome as sequencing of the
995 chimeric viral-host junction was available for the associated HPV insertion sites
996 Table S7. HNSCC integrations with associated SCNA included in this study
997 Samples that had associated host genome amplification data were classified based on the somatic copy
998 number alteration (SCNA) status at the site of integration. Normal, normal genomic profile; AMP,
999 amplification; DEL, deletion; DEL (excluded; overlapping deletion), these likely reflect the alternative
1000 chromosome as sequencing of the chimeric viral-host junction was available for the associated HPV
1001 insertion sites
1002 Table S8. FANCD2-associated fragile sites mapped in C33-A cervical carcinoma cells
1003 C33-A FANCD2 peaks identified by ChIP-seq, filtered by -log10 q-value >10, combined with FANCD2
1004 ChIP-chip peaks previously reported in C33-A cells 26.
1005 Table S9. FANCD2-associated fragile sites mapped in HeLa cervical carcinoma cells
1006 HeLa FANCD2 peaks identified by ChIP-seq, filtered by -log10 q-value >10.
1007 Table S10. FANCD2-associated fragile sites mapped in C33-A and HeLa cervical carcinoma cells
1008 C33-A FANCD2 peaks identified by ChIP-chip and ChIP-seq (Supplementary Table S6) were combined
1009 with HeLa ChIP-seq peaks (Supplementary Table S7). Overlapping FANCD2 peaks from the C33-A and
1010 HeLa datasets were merged (merging based on nearby peaks was set to 0 bp).
1011 Table S11. Aphidicolin-induced common fragile sites from the HGNC Database
38
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
1012 https://www.genenames.org/download/custom/. Bold font indicates FRA-regions that overlap with
1013 FANCD2-enriched regions listed in Supplementary Table S8. A total of 43/77 (55.8%) FRA-regions
1014 overlap with FANCD2-enriched regions profiled in C33-A and HeLa cells.
1015 Table S12. Mitotic DNA synthesis (MDS) regions that overlap with FANCD2 peaks
1016 MDS regions profiled in HeLa cells 41,42 were compiled and overlapping regions merged (merging based
1017 on nearby peaks was set to 0 bp). A total of 112/232 (48.3%) MDS regions overlapped with FANCD2-
1018 enriched regions profiled in C33-A and HeLa cells.
1019 Table S13. FANCD2-enriched regions that overlap long genes
1020 The Gencode Release 19 human reference genome (GRCh37) was filtered to identify protein-coding
1021 genes longer than or equal to 0.3 Mb, including UTRs. A total of 185/782 (23.7%) long genes overlapped
1022 with FANCD2-enriched regions profiled in C33-A and HeLa cells and are indicated in bold font. Genes
1023 that are transcriptionally active in C33-A and/or HeLa cells were determined from RNA-seq (26 and
1024 Expression Atlas, accessed September 2020, https://www.ebi.ac.uk/gxa/home) and are indicated in blue
1025 font. 121/185 (65.4%) long genes that overlapped with FANCD2-enriched regions are expressed in C33-A
1026 and/or HeLa cells.
1027 Table S14. Brd4/H3K27ac consensus enhancer peaks profiled in W12 subclones
1028 Brd4 and H3K27ac peaks were identified through ChIP-seq in HPV16-positive W12 cervical keratinocyte
1029 subclones (20831, 20861, 20862 and 20863). Consensus peaks for H3K27ac were identified across the
1030 four W12 subclones. Enhancers were defined as the overlapping genomic intervals between H3K27ac
1031 consensus peaks and Brd4 peaks.
1032 Table S15. Top 20 biological processes associated with enhancers that overlap integration breakpoints
1033 in CESC and HNSCC
1034 Tables S16. Super-enhancers defined in W12 cervical keratinocytes
1035 Table S17. Cancer driver genes within 1 Mb of super-enhancers that overlap integration loci
39
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
1036 TABLES
1037 Table 1. Overlap of integration breakpoints with FANCD2-assocated fragile sites
Dataset Integration subgroups Breakpoints or loci Overlap with % p-value per subgroup (n) FANCD2 sites (n) CESC All breakpoints 1,299 229 17.6 0.0001 Single breakpoints 258 39 15.1 0.0044 Clustered breakpoints 1,041 190 18.3 0.0001 Condensed clustered breakpoints 326 59 18.1 0.0001 Combined single and condensed 584 98 16.8 0.0001 breakpoints HNSCC* All breakpoints 118 21 17.8 0.0039 Single breakpoints 27 2 7.4 0.5011 Clustered breakpoints 91 19 20.9 0.0009 Condensed clustered breakpoints 30 5 16.7 0.2293 Combined single and condensed 57 7 12.3 0.3700 breakpoints 1038
1039 CESC and HNSCC integration breakpoints (-/+ 50 Kb flank regions) were grouped by the number of
1040 breakpoints per integration locus; single indicates one breakpoint and clustered indicates two or more
1041 breakpoints. Integration breakpoints from each subgroup were intersected with FANCD2-enriched
1042 regions and the frequency of overlap calculated. For the All breakpoints, Single breakpoints (not clustered)
1043 and Breakpoints located within a cluster subgroups, each breakpoint was tested independently for its
1044 overlap with FANCD2-enriched regions, regardless of whether it was part of a cluster or not. For the A
1045 single cluster of breakpoints (using 5'-3' endpoints) subgroup, the region spanning the most 5' and 3'
1046 breakpoints of an integration locus was used to test for the overlap with FANCD2-enriched regions. For
1047 the Single and clustered 5’-3’ breakpoints subgroup, the Single breakpoints (not clustered) and A single
1048 cluster of breakpoints (using 5'-3' endpoints) subgroups were combined for overlap analysis. The data was
1049 permutated 10,000 times to create an expected distribution of overlap. *A single integration breakpoint
1050 on chromosome Y was excluded from this analysis. Bold font indicates significant p-values.
1051
40
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
1052 Table 2. Overlap of integration breakpoints with keratinocyte-specific enhancers
Dataset Integration subgroups Breakpoints or loci Overlap with % p-value per subgroup (n) Enhancers (n) CESC All breakpoints 1,299 495 38.1 0.0001 Single breakpoints 258 67 26.0 0.0001 Clustered breakpoints 1,041 428 41.1 0.0001 Condensed clustered breakpoints 326 162 49.7 0.0001 Combined single and condensed 584 229 39.2 0.0001 breakpoints HNSCC* All breakpoints 118 43 36.4 0.0001 Single breakpoints 27 9 33.3 0.0052 Clustered breakpoints 91 34 37.4 0.0001 Condensed clustered breakpoints 30 13 43.3 0.0011 Combined single and condensed 57 22 38.6 0.0003 breakpoints 1053
1054 CESC and HNSCC integration breakpoints (-/+ 50 Kb flank regions) were grouped by the number of
1055 breakpoints per integration locus; single indicates one breakpoint and clustered indicates two or more
1056 breakpoints. Integration breakpoints from each subgroup were intersected with cellular enhancers and
1057 the frequency of overlap calculated. For the All breakpoints, Single breakpoints (not clustered) and
1058 Breakpoints located within a cluster subgroups, each breakpoint was tested independently for its overlap
1059 with enhancer regions, regardless of whether it was part of a cluster or not. For the A single cluster of
1060 breakpoints (using 5'-3' endpoints) subgroup, the region spanning the most 5' and 3' breakpoints of an
1061 integration locus was used to test for the overlap with enhancer regions. For the Single and clustered 5’-
1062 3’ breakpoints subgroup, the Single breakpoints (not clustered) and A single cluster of breakpoints (using
1063 5'-3' endpoints) subgroups were combined for overlap analysis. The data was permutated 10,000 times to
1064 create an expected distribution of overlap. *A single integration breakpoint on chromosome Y was
1065 excluded from this analysis. Bold font indicates significant p-values.
1066
1067
41
Figure 1
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC HPV+ cervical 105 and is also made available for use under a CC0 license. carcinomas (CESC)* WGS WXS HPV integra�on sites Host DNA copy number HPV+ head and RNA-seq Viral transcrip�on neck squamous APOT cell carcinomas (HNSCC)*
Brd4 HPV+ cervical H3K27ac Brd4 Brd4 carcinoma cells H3K27ac lines ChIP-seq H3K27ac C33-A Enhancers C33-A HeLa Aphidicolin FANCD2 treated cervical Integra�on breakpoints carcinoma cells HeLa ChIP-seq PIBF1 KLF5 lines FRA7J KLF12 BORA Common Fragile Sites Overlap Analysis *Data Sources: Bodelon, Holmes, Hu, Koneva, Lui, Ojesina, Olthof, Parfenov, The Cancer Genome Atlas, Xu Figure 2
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (whicha was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC Integra�on Integra�on105 locus: and is also made available for use under a CC0 license. locus: single clustered breakpoint breakpoints
<3 Mb: same integra�on locus >3 Mb: separate integra�on locus
b CESC Dataset Breakpoints (n): 1 2 3-9 >9
11 12 7 8 9 10
3 4 5 6
1 2
13 14 15 16 17 18 19 20 21 22 X
c n=271 n=313 (n) 20 ** 18 l ocus
n 16 �o 14 12 10 per i ntegra
s s 8
oi nt 6 4
breakp 2
0
CESC No Yes Hotspot Figure 3
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC a Numer of transcrip�onally 105 and is also made available for useb under a CC0 license. ac�ve loci per tumor: Single Mul�ple None 120 120 100
100 c � ve loci a ve c � ve loci 80 a 80 60
60 � onally � onally 40
40 tumor ( %) per per tumor ( %) per 20
20 CC transcrip 0 HNS CESC CESC transcrip 0 -20 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 -1 0 1 2 3 4 5 6 7 Integra�on loci per tumor (n) Integra�on loci per tumor (n)
c **P=0.004 120 Ac�ve 100 Inac�ve 80 � on loci (n) 60 40 20 CESC CESC integra 0 No Yes Integra�on hotspot Figure 4 a 20 **** b 10 * e bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright*P=0.014 holder for this preprint (which18 was not certified by peer review) is the author/funder.9 This article is a US Government work. It is not subject to copyright Hotspot:under 17 USC 105 and is also made available for use under a CC0 license. 250 16 8 No 14 7 200 Yes 12 6 10 5 150 8 4 � on loci (n) 6 3 100 CESC CESC breakpoints (n)
HNSCC breakpoints (n) 2 4 NS 50 2 1 CESC CESC integra 0 0 0 Normal AMP DEL Normal AMP DEL Normal AMP DEL DNA copy number DNA copy number DNA copy number c CESC Dataset Breakpoints (n): Normal AMP DEL
7 8 9 10 11 12
3 4 5 6
1 2
13 14 15 16 17 18 19 20 21 22 X d
HNSSC Dataset Breakpoints (n): Normal AMP DEL
3 4 5 7 9 11 13 14 15 16 1817 19 20 21 X
1 2 Figure 5 a bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540b ; this version posted Augustc 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject toLong copyright genes under 17 USC 50 105 and is also made available for use under a CC0 license. FANCD2 peaks 279 40 312 64 30
20 Expressed in 86 FANCD2 peaks 121 C33-A/HeLa 10 81 34 328 318 Genomic coverage Genomic (%) 97 0 FANCD2 FRA MDS 26 15 MDS regions Fragile site dataset FRA regions (CFS) d
chr1: 20 Mb 40 Mb 60 Mb chr3: 5 Mb 10 Mb 15 Mb C33-A C33-A
ChIP-seq HeLa ChIP-seq reads: reads: HeLa
MDS: MDS:
FRA: FRA: FRA1B FRA1L FRA3B Long NFIA Long ERC2 genes: FAF1 DAB1 PATJ ROR1 NEGR1 genes: CACNA2D3 FHIT PTPRG MAGI1
chr7: 5 Mb 10 Mb chr16: 1 Mb 5 Mb C33-A C33-A
ChIP-seq ChIP-seq reads: HeLa reads: HeLa
MDS: MDS:
FRA: FRA: FRA7J FRA16D Long Long genes: AUTS2 GALNT17 genes: WWOX CDH13 Figure 6 a bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted Augustb 31, 2021. The copyright holder for this preprint (which30 was not certifiedHotspot by peer review)Non-hotspot is the author/funder.* Novel hotspot This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license. 25 og10) l 20 Super-enhancers (- 17 296 ue l 15 a v
q- 10 * R 72 D * F 5 * * * Integra�on loci Hotspots 0 254 241 NFIA FLRT3 GRHL2 PIK3R1 MYO1B CLSTN1 MRPL23 CAMK1G IL8/RASSF6 KLF5/KLF12 GLOD4/NXN VMP1/TUBD1 KLF14/MKLN1 RXRA/COL5A1 C9orf3/FANCC FOXQ1/EXOC2 AHDC1/WASF2 YWHAQ/TAF1B ADRB2/SH3TC2 IGF1R/PGPEP1L ITPR1/BHLHE40 HMGA2/MSRB3 RAD51B/ZFP36L1 MYO9A/GRAMD2 ID1/COX4I2/HM13 TP63/LEPREL1/CLDN1 RNF145/EBF1/UBLCP1
MYC/POU5F1B/FAM84B GREAT gene ontology hits c d f (n) n=271 n=313 n=123 n=145 (n) E 40 E S S 60
120 Overlapping genomic Viral transcrip�on Host DNA copy
(n) 35 100 feature: status: 50 number: ci ci l ap er 30 l ap er v Super-enhancer Ac�ve v Normal l o o o 80 FANCD2 peak 25 Inac�ve 40 AMP � on
60 Both that 20 30 DEL ci ci 15
40 l o 20
i ntegra 10 � on SC 20 10 E 5 C 0 0 0 Integra�on loci that No Yes Integra No Yes No Yes Integra�on hotspot Integra�on hotspot Integra�on hotspot e
chr3: 19.1 Mb chr8: 1 Mb
Host Host AMP: AMP:
Brd4 Brd4 ChIP-seq ChIP-seq reads: H3K27ac reads: H3K27ac W12 SE: W12 SE: FANCD2 peaks: FANCD2 peaks: Integrants: Integrants: GREAT genes: TP63 CLDN1 GREAT genes: MYC LEPREL1 FAM84B POU5F1B Cancer driver TBL1XR1 Cancer driver genes: MECOM PIK3CA genes: RAD21 MYC
chr13: 14.6 Mb chr14: 0.7 Mb
Host Host AMP: AMP:
Brd4 Brd4
ChIP-seq H3K27ac ChIP-seq H3K27ac reads: reads: W12 SE: W12 SE: FANCD2 peaks: FANCD2 peaks: Integrants: Integrants: GREAT genes: KLF5 GREAT genes: ZFP36L1 Cancer driver KLF12 Cancer driver RAD51B genes: RB1 CYSLTR2 DACH1 KLF5 genes: MAX ZFP36L1 Figure S1
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certifiedCESC by peersamples review) is the author/funder. This article is a US Governmentb HPV16 work. It is notHPV18 subject to copyrightOther under 17 USC a 105 and is also made available for use under a CC0 license. (n=333) 100
80
Histology: 60 Adenocarcinoma Adenosquamous 40 Squamous cell 20 Carcinoma (Not specified) 0 CESC samples by HPV type (%) qu amous cell Notfi ed speci S Adenosquamous Adenocarcinoma Histology
c HNSCC samples d HPV16 HPV33 HPV35 (n=41) 100 Tumor loca�on: 80 Larynx Oral cavity 60 Oropharynx
40
20 HNSCC samples by HPV type (%) 0 Larynx Oral cavity Oropharynx Tumor loca�on
CESC breakpoints e CESC samples f (n=1,299) (n=333) 800
600 Sequencing method: Hybrid-capture WGS 400
WXS � on breakpoints (n) Hybrid-capture/APOT* WGS/RNA-seq 200 WXS/RNA-seq
WGS/WXS/RNA-seq CESC integra 0 Hybrid- WGS WXS capture Sequencing method
g HNSCC samples h HNSCC breakpoints (n=41) (n=119) 100
80 ak points (n) 60 Sequencing method: � on bre DIPS/APOT 40 WGS/RNA-seq 20
HNSCC integra 0 DIPS WGS Sequencing method bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
Figure S2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Literature hotspots CESC defined hotspots Figure S3
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also madeHNSCC available Dataset for use under a CC0 license. Breakpoints (n): 1 2 3-9
11 12 13 7 8 9
3 4 5 6
1 2
14 15 16 17 18 19 20 21 22 X Y bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
Figure S4
a b c 15 Hotspot ENCODE (NHEK) Non-hotspot 13,122 CESC GREAT target genes 10 * chr1: 5 Mb 10 Mb Brd4 * 5 * 5 ChIP-seq H3K27ac * 28,134 W12 Brd4 peaks reads: H3K27ac FDR q-value (-log10) 10,847 89 0 W12: 2,142 HeLa: NHEK: ACTL7B KLF5/KLF12 RefSeq ERG/PSMG1 TUBD1/VMP1
W12 H3K27ac peaks RHOD/KDM2A
10,432 genes: PLCXD2/PHLDB2 KRT13/KRT17/EIF1 FAM84B/POU5F1B TMPRSS2/RIPK4/ETS2/ GREAT gene ontology hits d chr8: 1 Mb chr13: 14.6 Mb Host AMP: Host AMP:
Brd4 Brd4 ChIP-seq ChIP-seq H3K27ac reads: H3K27ac reads: W12 SE: W12 SE: FANCD2 peaks: FANCD2 peaks: CESC integrants: CESC integrants: HNSCC integrants: HNSCC integrants: GREAT genes: MYC GREAT genes: KLF12 FAM84B POU5F1B KLF5 Cancer driver Cancer driver MYC DACH1 KLF5 genes: genes:
chr17: 7.5 Mb chr17: 5.1 Mb Host AMP: Host AMP: Brd4 Brd4 ChIP-seq ChIP-seq H3K27ac reads: H3K27ac reads: W12 SE: W12 SE: FANCD2 peaks: FANCD2 peaks: CESC integrants: CESC integrants: HNSCC integrants: HNSCC integrants: GREAT genes: GREAT genes: TUBD1 Cancer driver Cancer driver VMP1 genes: NF1 CDK12 KRT222 ERBB2 BRCA1 KANSL1 SPOP genes: RNF43 PPM1D CD79B AXIN2 Figure S5
1 2 bioRxiv preprintHotspot doi: (chr1): https://doi.org/10.1101/2021.08.30.45754014.6 Mb ; this version posted August 31, 2021.16.9 The Mb copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC Host AMP: 105 and is also made available for use under a CC0 license. Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants:
GREAT genes: CLSTN1 WASF2 NFIA Cancer driver MACF1 genes: RPL22 MTOR EPHA2 ARID1A THRAP3 CDKN2C JAK1 FUBP1 RPL5 Long genes:
3 4 Hotspot (chr1): 21.1 Mb 6.1 Mb
Host AMP:
Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants:
GREAT genes: CAMK1G Cancer driver ZBTB7B SPTA1 BTG2 genes: TXNIP RIT1 DHX9 PTPRC ELF3 IRF6 H3F3A Long genes:
5 6 Hotspot (chr2): 7.6 Mb 25.7 Mb
Host AMP: Brd4
ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants:
GREAT genes: MYO1B Cancer driver CASP8 genes: ACVR2A ACVR1 NFE2L2 PMS1 SF3B1 IDH1 ERBB4 RQCD1 CUL3 Long genes:
7 8 Hotspot (chr3): 3.1 Mb 19.1 Mb
Host AMP:
Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: CLDN1 TP63 Cancer driver PIK3CA genes: MECOM TBL1XR1 Long genes: 9 10 11 bioRxiv preprintHotspot doi: (chr4): https://doi.org/10.1101/2021.08.30.45754011.6 Mb ; this18 Mbversion posted August 31, 2021. The copyright0.5 holder Mb for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license. Host AMP: Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: IL8 GREAT genes: RASSF6 Cancer driver KIT genes: FGFR3 PDGFRA ALB Long genes:
12 Hotspot (chr5): 5.6 Mb
Host AMP:
Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: Cancer driver NIPBL genes: IL7R Long genes:
13 14 15 Hotspot (chr6): 11.6 Mb 8.3 Mb 6.6 Mb
Host AMP:
Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: FOXQ1 EXOC2 Cancer driver HIST1H1C/E PIM1 genes: FOXQ1 HLA-A/B LEMD2 Long genes:
16 17 Hotspot (chr7): 1.2 Mb 10.8 Mb Host AMP:
Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: Cancer driver genes: PIK3CG MET Long genes: 18 19 bioRxiv preprintHotspot doi: (chr8): https://doi.org/10.1101/2021.08.30.457540; this version11 Mbposted August 31, 2021.1 Mb The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license. Host AMP: Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: FAM84B MYC GREAT genes: GRHL2 POU5F1B Cancer driver genes: CNBD1 RAD21 MYC Long genes:
20 Hotspot (chr9): 4.4 Mb
Host AMP: Brd4
ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: FANCC GREAT genes: C9orf3 Cancer driver genes: GNAQ PTPDC1 PTCH1 Long genes:
21 Hotspot (chr10): 5.6 Mb
Host AMP:
Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: Cancer driver genes: PTEN SMC3 TCF7L2 Long genes:
22 23 Hotspot (chr11): 4.3 Mb 6.4 Mb Host AMP:
Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: Cancer driver INPPL1 genes: CTNND1 SF1 CCD1 PGR ATM KMT2A Long genes: 24 bioRxiv preprintHotspot doi: (chr12): https://doi.org/10.1101/2021.08.30.457540; this version0.2 postedMb August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license. Host AMP: Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: MSRB3 HMGA2 Cancer driver genes: Long genes:
25 Hotspot (chr13): 14.6 Mb
Host AMP:
Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: KLF12 GREAT genes: KLF5 Cancer driver PDS5B CYSLTR2 KLF5 genes: BRCA2 RB1 DACH1 Long genes:
26 27 Hotspot (chr14): 0.7 Mb 2.9 Mb
Host AMP: Brd4
ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: ZFP36L1 GREAT genes: RAD51B Cancer driver genes: MAX ZFP36L1 ATXN3 DICER1 TRAF3 AKT1 Long genes:
28 Hotspot (chr15): 13.6 Mb Host AMP: Brd4
ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: GRAMD2 MYO9A Cancer driver genes: TCF12 RNF111 MAP2K1 SIN3A IDH2 Long genes: 29 30 31 bioRxiv preprintHotspot doi: (chr17): https://doi.org/10.1101/2021.08.30.457540 7.5 Mb ; this version5.1 Mb posted August 31, 2021.6.2 Mb The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license. Host AMP: Brd4
ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: TUBD1 GREAT genes: VMP1 Cancer driver CDK12 SPOP SOX9 SRSF2 genes: NF1 ERBB2 BRCA1 KANSL1 RNF43 PPM1DCD79B PRKAR1A ZNF750 Long genes:
32 33 Hotspot (chr19): 13 Mb 20.8 Mb Host AMP:
Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: ARHGAP35 Cancer driver CACNA1A PIK3R2 KMT2B genes: KEAP1 SMARCA4 JAK3 CEBPA CIC ERCC2 GRIN2D PPP2R1A Long genes:
34 35 Hotspot (chr20): 4.3 Mb 2.2 Mb
Host AMP: Brd4
ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: ID1 FLRT3 HM13 Cancer driver genes: PLCB4 ZNF133 ASXL1 Long genes:
36 37 Hotspot (chrX): 11.2 Mb 5.7 Mb Host AMP:
Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: USP9X RBM10 STAG2 Cancer driver ARAF PHF6 genes: TMSB4X EIF1AX DMD BCOR KDM6A KDM5C ZCCHC12 SMARCA1 FLNA Long genes: