bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

1 Recurrent Integration of Human Papillomavirus Genomes at Transcriptional

2 Regulatory Hubs

3 Alix Warburton1, Tovah E. Markowitz2-3, Joshua P. Katz4, James M. Pipas4 and Alison

4 A. McBride1 *

5

6 1Laboratory of Viral Diseases, National Institute of Allergy and Infectious Diseases, 33 North Drive,

7 MSC3209, National Institutes of Health, Bethesda, Maryland 20892, USA

8 2NIAID Collaborative Bioinformatics Resource (NCBR), National Institute of Allergy and Infectious

9 Diseases, National Institutes of Health, Bethesda, MD, USA

10 3Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research,

11 Frederick, MD, USA

12 4Department of Biological Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania, USA

13

14 *Corresponding author

15 E-mail: [email protected]

16 Short Title: Integration of HPV genomes in human cancers

17

18

19

20

21

22

1

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

23 ABSTRACT

24 Oncogenic human papillomavirus (HPV) genomes are often integrated into host in HPV-

25 associated cancers. HPV genomes are integrated either as a single copy, or as tandem repeats of viral

26 DNA interspersed with, or without, host DNA. Integration occurs frequently in common fragile sites

27 susceptible to tandem repeat formation, and the flanking or interspersed host DNA often contains

28 transcriptional enhancer elements. When co-amplified with the viral genome, these enhancers can form

29 super-enhancer-like elements that drive high viral oncogene expression. Here, we compiled highly

30 curated datasets of HPV integration sites in cervical (CESC) and head and neck squamous cell carcinoma

31 (HNSCC) cancers and assessed the number of breakpoints, viral transcriptional activity, and host genome

32 copy number at each insertion site. Tumors frequently contained multiple distinct HPV integration sites,

33 but often only one “driver” site that expressed viral RNA. Since common fragile sites and active

34 enhancer elements are cell-type specific, we mapped these regions in cervical cell lines using FANCD2

35 and Brd4/H3K27ac ChIP-seq, respectively. Large enhancer clusters, or super-enhancers, were also

36 defined using the Brd4/H3K27ac ChIP-seq dataset. HPV integration breakpoints were enriched at both

37 FANCD2-associated fragile sites, and enhancer-rich regions, and frequently showed adjacent focal DNA

38 amplification in CESC samples. We identified recurrent integration “hotspots” that were enriched for

39 super-enhancers, some of which function as regulatory hubs for cell-identity . We propose that

40 during persistent infection, extrachromosomal HPV minichromosomes associate with these

41 transcriptional epicenters, and accidental integration could promote viral oncogene expression and

42 carcinogenesis.

2

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

43 INTRODUCTION

44 Persistent infection with high-oncogenic risk HPV types is responsible for almost all cervical and

45 ~70% oropharyngeal carcinomas 1. One factor that can contribute to oncogenic progression of HPV-

46 positive lesions is integration of the viral genome into host chromatin. Integration is associated with

47 increased genetic instability in high-grade cervical intraepithelial neoplasia (CIN) and cervical and

48 oropharyngeal carcinomas 2-9 due to dysregulated expression of the viral oncoproteins, E6 and E7. Many

49 studies have compared the human genomic regions associated with HPV integration sites to elucidate

50 the mechanisms that might promote integration and carcinogenesis. Here, we have curated datasets

51 from The Cancer Genome Atlas and other published sources to define a rigorous database of integration

52 breakpoints and correlated these with “in-house” datasets of common fragile sites and enhancer

53 elements defined in cervical carcinoma cells.

54 Integration of HPV DNA occurs in all human chromosomes; however, integration sites are often

55 found within or in close proximity to common fragile sites 10-13. Common fragile sites are regions of the

56 genome that have difficulty completing replication and, as such, are susceptible to

57 breakage in mitosis. They are prone to replication stress that can be due to a shortage of replication

58 origins or clashes between replication and transcriptional processes. Therefore, they vary in fragility

59 depending on cell type and disease state 14. Most previous studies that documented an association of

60 HPV integration sites with common fragile sites used the classical FRA-regions that were defined

61 cytogenetically in lymphocytes 9-13,15. As such, these fragile sites are not cell-type specific and are often

62 large, poorly defined regions that cover a large proportion of the . FANCD2 is required

63 for resolution of these genetically unstable sites and, as such, is a marker of common fragile sites 16-18.

64 HPV integration sites occur frequently in amplified regions of the host genome, and focal

65 amplification of cellular flanking sequences at sites of viral integration are frequently observed in HPV-

66 positive tumors 6,7,9,15. Co-amplification of the viral genome and flanking cellular sequences can result

3

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

67 from unlicensed initiation of replication at the viral origin resulting in endoreduplication 19-21.

68 Subsequent recombination can result in amplified tandem repeats. Genome amplification can also occur

69 at common fragile sites by breakage-fusion-bridge cycles 22.

70 HPV integration is also enriched at transcriptionally active regions of the host genome 15,23. We

71 previously identified an HPV16 integration site in the W12 20861 cervical cell line that was adjacent to a

72 cell-type specific enhancer. Co-amplification of this regulatory element and the viral genome to

73 approximately 25 copies resulted in the formation of a super-enhancer-like element to drive high viral

74 oncogene expression 24,25. This ‘enhancer-hijacking’ is a novel mechanism by which HPV integration can

75 promote oncogenesis; however, it is unclear how common this mechanism is in HPV-associated cancers.

76 The aim of this study was to examine the association among HPV integration loci, common

77 fragile sites, and genome amplification to determine whether insertion of HPV genomes adjacent to

78 active cellular enhancers often resulted in viral ‘enhancer-hijacking’, and whether genetic instability

79 could result in co-amplification of viral-cellular regulatory repeats to drive oncogenic progression of

80 HPV-associated cancers. We have extended our previously published work 26 to generate a common

81 fragile site dataset in cervical carcinoma cell lines using higher resolution mapping of FANCD2 binding

82 and have mapped cellular enhancers and super-enhancers in an HPV16-positive cell line derived from a

83 cervical lesion27 using H3K27ac and Brd4 ChIP-seq (Figure 1). Here we compare HPV integration sites

84 with these cervical cell-type specific enhancer and fragile site datasets.

4

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

85 RESULTS

86 CESC and HNSCC tumors frequently contain multiple, clustered HPV integration breakpoints

87 A dataset of HPV integration breakpoints was assembled from various sources 5,6,8,9,28-35, as

88 outlined in Figure 1 and Supplementary Table S1. Integration breakpoints were defined as the junctions

89 between the viral and host chimeric reads within the human reference genome. A total of 1,299

90 integration breakpoints from 333 cervical carcinomas (CESC) and 119 integration breakpoints from 41

91 head and neck squamous cell carcinomas (HNSCC) were included in this study (Supplementary Figure S1

92 and Supplementary Table S2-S3). We found that many tumor samples contained multiple integration

93 breakpoints that could have resulted from either independent integration events at different

94 chromosomal loci or from amplification of a single integration site resulting in a cluster of multiple,

95 closely spaced breakpoints. To classify this, we defined an integration as either a single HPV

96 insertion breakpoint, or as multiple, closely spaced breakpoints (a cluster), Figure 2a. Clustered

97 breakpoints within the same chromosome that had a maximum distance of 3 Mb between the most 5’

98 and 3’ breakpoints were classified as a single integration locus. Based on this classification, the total

99 number of integration loci analyzed in our study was 584 for CESC samples and 58 for HNSCC samples.

100 Tumors with multiple integration loci were observed in 28.8% of CESC and 22.0% of HNSCC tumors.

101 Sites of recurrent HPV DNA integration in different tumor samples are termed integration

102 hotspots. We defined integration hotspots (five or more sites located <5Mb apart) in our CESC dataset

103 and compared them to previously defined hotspots from the literature 10,15,28,36-40. We identified a total

104 of 37 hotspots in CESC tumors (Supplementary Table S4), which represented 313/584 (53.6%)

105 integration loci from our CESC dataset (Figure 2b-c). Twenty-three hotspots overlapped previously

106 defined sites of recurrent integration and 14 were novel hotspots, Supplementary Figure S2 and

107 Supplementary Table S4-S5. We were unable to define sites of recurrent integration in the HNSCC

108 dataset because of the low number of integration loci in these tumors.

5

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

109 The distribution of clustered breakpoints at each integration locus and across each chromosome

110 is shown in Figure 2b for CESC samples and Supplementary Figure S3 for HNSCC samples. Most

111 integration sites had 1-2 breakpoints in both the CESC (81.1%) and HNSCC (75.9%) datasets. Sites of

112 recurrent integration are indicated by orange boxes for the CESC samples. The integration loci at these

113 hotspots were more likely to contain clustered breakpoints compared to integration sites elsewhere in

114 the genome for CESC tumors (Figure 2c; p=0.004). Higher numbers of clustered breakpoints at sites of

115 recurrent integration suggests that these regions are susceptible to genomic instability.

116 Most tumors with integrated HPV DNA have a single driver integration

117 Constitutive expression of the viral oncogenes from the integration locus is required for clonal

118 selection and oncogenic progression. Transcriptionally silent HPV integration loci can be considered to

119 be passenger sites 35. To identify driver versus passenger integrations, the transcriptional activity of each

120 integration locus was determined for the subset of samples that had matched RNA sequencing data

121 (CESC, n=144; HNSCC, n=35). The transcription status of each integration locus is indicated in

122 Supplementary Table S2-S3. Integration loci in which no HPV mRNA was detected by RNA-seq analysis

123 were classified as inactive or passenger loci. In CESC, all samples with a single integration locus (n=86)

124 were transcriptionally active (Figure 3a). For samples with multiple integration loci (CESC, n=58; HNSCC,

125 n=13), more than one transcriptionally active integration locus was observed in 35 (60.3%) CESC and

126 four (30.8%) HNSCC tumors. Three HNSCC samples had no driver integrations; one sample (4.5%) had a

127 single integration locus and two samples (15.4%) had multiple integration loci, Figure 3b. Overall, the

128 majority of CESC and HNSCC tumors with integrated HPV genomes had only a single transcriptionally

129 active integration locus. This implies that most tumors with integrated HPV DNA have a single driver

130 integration. Viral oncogene transcription was analyzed at integration hotspots in CESC, which showed

131 that sites of recurrent integration can have both driver and passenger integrations but are more likely to

132 be transcriptionally active (Figure 3c).

6

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

133 Clustered integration breakpoints are associated with amplified regions of the host genome in CESC

134 and HNSCC

135 HPV integration loci often have amplification and/or rearrangements of the flanking cellular

136 sequences at the insertion sites 6,7,9,15. We determined the frequency with which integration loci in our

137 datasets were associated with somatic copy number alterations for the subset of samples that had

138 matched host genome copy number data (235 CESC and 22 HNSCC tumors; Supplementary Table S6-

139 S7). In this subset, most integration breakpoints occurred within amplified regions of cellular DNA in

140 both CESC and HNSCC relative to regions with a normal genomic copy number or adjacent to deletions

141 (Figure 4a-b). In addition, the number of breakpoints per integration locus was significantly different at

142 amplified regions of the host genome relative to loci with a normal genomic profile for both CESC and

143 HNSCC samples; higher numbers of breakpoints per cluster occurred at amplified regions (Figure 4a-b).

144 No significant difference was found in the number of clustered breakpoints per integration loci in

145 regions with a normal genomic copy number relative to those with genomic copy number losses in CESC

146 samples. We conclude that integration loci associated with flanking host DNA amplification were more

147 likely to contain clustered breakpoints and to have higher numbers of breakpoints per integration locus.

148 Integration loci with associated focal amplifications were found on all chromosomes in CESC and

149 on 15 chromosomes in HNSCC. The distribution of clustered integration breakpoints relative to host

150 genome amplification and sites of recurrent integration in CESC is shown, Figure 4c-d. There was an

151 enrichment of integration loci with associated host DNA amplifications at integration hotspots in CESC

152 (Figure 4e) and higher numbers of clustered breakpoints were common at these regions (Figure 4c).

153 Genomic instability in these regions most likely results in co-amplification of both viral and host DNA.

154 Identification of common fragile sites in cervical cancer cell lines

155 Genomic instability occurs frequently at common fragile sites. We previously used FANCD2 ChIP-

156 chip to map fragile sites in an aphidicolin-treated cervical carcinoma cell line (C33-A) 26. Here, we have

7

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

157 extended the FANCD2 dataset using ChIP-seq analysis of both C33-A and HeLa cervical carcinoma cells

158 (Supplementary Table S8-S9) and combined these results with those previously mapped in C33-A cells

159 by ChIP-chip 26. There was good overlap between the FANCD2 peaks called across the three datasets

160 (p<0.0001). In total we defined 513 FANCD2-enriched regions between the two cervical carcinoma cell

161 lines, and they are listed in Supplementary Table S10.

162 We compared our cervical carcinoma cell line derived FANCD2 mapped fragile sites (genomic

163 coverage 7.9%) to the 77 aphidicolin-induced common fragile sites (FRA regions) defined cytogenetically

164 in lymphocytes and reported in the HGNC (HUGO Nomenclature Committee) database (genomic

165 coverage 48.3%), Figure 5a. A total of 115 (22.4%) FANCD2-enriched regions derived from C33-A and

166 HeLa cells overlapped with 55.8% (n=43/77) FRA-regions, Figure 5b. FRA-regions that overlapped with

167 FANCD2-associated fragile sites are listed in Supplementary Table S11. Permutation testing was used to

168 determine the significance in overlap between our FANCD2-dataset and traditional FRA regions. The

169 association between these genomic features did not reach significance (p=0.0634), which reflects

170 differences in replication stress at these regions in different cell types.

171 Recent studies used high-resolution MiDA-seq (next-generation sequencing of EdU

172 incorporation at sites of mitotic DNA synthesis) to map fragile sites in HeLa cells 41,42 and show that they

173 colocalize with FRA-regions and FANCD2 foci in cells treated with aphidicolin. We compared the overlap

174 between our dataset of FANCD2-enriched regions and the mitotic DNA synthesis regions profiled in HeLa

175 cells (total genomic coverage of 4.4%, Figure 5a). A total of 120 (23.4%) FANCD2-enriched regions

176 overlapped with mitotic DNA synthesis regions, which represented 48.3% (n=112/232) of mitotic DNA

177 synthesis regions, Figure 5b. Permutation testing was used to determine the significance in overlap

178 between FANCD2-enriched regions and mitotic DNA synthesis regions (p<0.0001). Mitotic DNA synthesis

179 regions that overlapped with FANCD2-associated fragile sites are indicated in Supplementary Table S12.

8

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

180 Collectively, these data show good correlation between our FANCD2-associated fragile sites and regions

181 of the genome susceptible to genetic instability.

182 The instability of fragile sites is often due to transcription-replication conflicts that frequently

183 occur at long genes 43. We and others have previously shown that FANCD2-enriched regions overlap

184 with transcriptionally active long genes in C33-A 26 and U2OS cells 44. Here we extended that association

185 to include genes that are >0.3 Mb in length 45. A total of 184/513 (35.9%) FANCD2-enriched regions

186 overlapped with -coding genes that were ≥0.3 Mb, which corresponded to 185/782 (23.7%) long

187 genes. A Fisher’s exact test was used to determine the significance in overlap between FANCD2-

188 enriched regions and long genes (two-tailed, p=4.22E-13). Of the long genes that overlapped with

189 FANCD2-enriched regions, 121/185 (65.4%) were expressed in C33-A and/or HeLa cells, Figure 5c,

190 further validating our FANCD2 peaks as sites of genetic instability in cervical carcinoma cells.

191 Supplementary Table S13 lists the long genes used in this analysis and their association with common

192 fragile sites and expression in C33-A and HeLa cells from RNA-seq (26 and Expression Atlas,

193 https://www.ebi.ac.uk/gxa/home). Example alignments of our FANCD2-associated fragile sites relative

194 to common FRA-regions, mitotic DNA synthesis regions and long genes are shown in Figure 5d.

195 Integration breakpoints are enriched at FANCD2-associated fragile sites

196 To assess the association of our cervical carcinoma-specific fragile site dataset with HPV

197 integration sites, we calculated the frequency with which an integration breakpoint occurred within 50

198 Kb 9 of the C33-A and HeLa FANCD2-enriched regions (Supplementary Table S10). Approximately 18%

199 integration breakpoints were associated with FANCD2-enriched regions in both the CESC and HNSCC

200 datasets. The data was permutated 10,000 times to create an expected distribution of the overlap

201 between integration breakpoints and FANCD2-enriched regions. This showed that the FANCD2-

202 associated fragile sites were significantly enriched for both CESC and HNSCC HPV integration sites when

203 each breakpoint was analyzed independently from each other, Table 1.

9

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

204 To remove bias resulting from overrepresentation of integration loci containing clusters of

205 breakpoints, the integration loci were simplified into defined subsets for significance testing. Integration

206 loci contain either a single breakpoint or a cluster of breakpoints (Figure 2a). In the latter group, the

207 clustered breakpoints were condensed into a single site represented by the most 5’ and 3’ breakpoints.

208 The final category combined the single and condensed categories, in which each integration locus was

209 represented just once. The integration loci in each category were tested independently for their

210 association with FANCD2-enriched regions. For the CESC dataset, sites containing single or clustered

211 breakpoints were significantly associated with FANCD2-enriched regions, Table 1. In contrast, only sites

212 with clustered breakpoints reached significance for the HNSCC dataset (Table 1).

213 Generation of a cervical keratinocyte enhancer dataset using Brd4 and H3K27ac ChIP-seq

214 It has been noted previously that HPV integration occurs frequently at transcriptionally active

215 regions 15,23, and we have demonstrated that HPV integration can capture and amplify cellular enhancers

216 to drive viral oncogene expression 25. Brd4 is a marker of cell lineage-specific enhancers 46-48 and HPV E2

217 tethering sites 26,49. Moreover, we, and others, have shown previously that Brd4 and the HPV E2

218 replication protein bind to transcriptionally active chromatin within the host genome 49,50 that overlap

219 many FANCD2-associated fragile sites 26. Viral replication factories form adjacent to these sites and we

220 have proposed that tethering of the viral genome to these unstable sites would increase the chances of

221 integration at these regions.

222 To further examine the association of cellular enhancers with HPV integration in CESC and

223 HNSCC, we generated an “in-house” enhancer dataset using Brd4 and H3K27ac ChIP-seq in four

224 different subclones of W12 cervical keratinocytes. We defined 6,935 enhancer consensus peaks in the

225 four W12 subclones (Supplementary Table S14). The resulting H3K27ac and Brd4 ChIP-seq signals were

226 compared to enhancers in the NHEK (Normal Human Epidermal Keratinocytes) ENCODE dataset 51,

227 which showed that 83.5% (p<0.0001) W12 Brd4/H3K27ac enriched peaks overlapped ENCODE defined

10

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

228 enhancers (Supplementary Figure S4). Approximately 40% integration breakpoints and loci from the

229 CESC and HNSC datasets were significantly associated with our cervical keratinocyte enhancer dataset,

230 Table 2.

231 Integration hotspots are associated with gene loci related to cell development and identity

232 Enhancers that were associated with integration loci were analyzed using the Genomic Regions

233 Enrichment of Annotations Tool (GREAT) 52 to identify common functional significance based on

234 proximity testing. This identified 52 putative enhancer target genes for the CESC dataset, representing

235 28 gene loci of which 60.7% overlapped integration hotspots (Figure 6a). Twelve of the gene loci are

236 previously defined sites of recurrent integration and include the KLF5, KLF12, MYC, TP63, RAD51B and

237 HMGA2 genes, and five of the gene loci are novel integration hotspots and include the CAMK1G, FOXQ1,

238 EXOC2, GRHL2, ID1, COX4I2, HM13 and NFIA genes (Figure 6a). For HNSCC, 19 target genes were

239 identified through GREAT analysis, representing eight gene loci of which 50% overlapped

240 integration hotspots profiled in CESC, Supplementary Figure S4. Six target genes (KLF5, KLF12, FAM84B,

241 POU5F1B, TUBD1 and VMP1) were common across CESC and HNSCC. Gene ontology analysis of our

242 enhancer regions identified epithelium development, epithelial cell differentiation, and negative

243 regulation of keratinocyte differentiation to be significantly enriched biological processes associated

244 with CESC integration loci (Supplementary Table S15). Ectoderm development and differentiation,

245 epidermal and epithelial differentiation, tongue morphogenesis and negative regulation of spreading

246 epidermal cells during wound healing were biological processes significantly associated with enhancer

247 enrichment at HNSCC integration loci, Supplementary Table S15. This indicates that sites of recurrent

248 HPV integration are often associated with cellular pathways relevant to host cell development and

249 differentiation.

250 Keratinocyte-specific super-enhancers are enriched at integration loci

11

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

251 Large enhancer clusters were characteristic of the integration targets identified through GREAT

252 analysis. We therefore defined super-enhancers in our Brd4-defined W12 enhancer dataset based on

253 relative peak height of the H3K27ac and Brd4 ChIP-seq datasets using the Rank Ordering of Super-

254 Enhancers (ROSE) tool 53,54. We defined 338 super-enhancers in W12 cervical keratinocytes

255 (Supplementary Table S16). Intersect analysis showed that 89/584 (15.2%) CESC integration loci

256 overlapped with super-enhancers profiled in W12 cells, and of these loci 72 (80.9%) were classified as

257 sites of recurrent integration (Figure 6b). A total of 25/37 (67.6%) integration hotspots contained super-

258 enhancers and permutation testing showed that both CESC integration loci and sites of recurrent

259 integration were significantly associated with these regulatory domains (p<0.0001). For HNSCC, 8/57

260 (14%) integration loci were associated with W12 super-enhancers, and 62.5% of these loci overlapped

261 integration hotspots profiled in CESC, including the KLF5/KLF12, MYC, ERBB2 and VMP1 gene loci.

262 Permutation testing showed that HNSCC integration loci were significantly associated with super-

263 enhancers profiled in W12 cells (p<0.0001). Thus, keratinocyte-specific super-enhancers are enriched at

264 integration loci in HPV-associated tumors and are frequently found at integration hotspots.

265 The association of CESC integration loci with super-enhancers and FANCD2-enriched fragile sites

266 at integration hotspots was also addressed, Figure 6c. This showed that the frequency of FANCD2

267 enrichment at integration loci was comparable for sites of recurrent (50/313 loci; 16.0%) and non-

268 recurrent integration (48/271 loci; 17.7%) in CESC, whereas the association of super-enhancers was

269 augmented at integration loci that occurred within hotspots (72/313 loci; 23.0%) relative to non-

270 recurrent sites of integration (17/271 loci; 6.3%). Most integration loci that overlapped super-enhancers

271 were active for viral oncogene expression (driver integrations) and were more frequently observed at

272 integration hotspots, Figure 6d. Furthermore, several cancer driver genes, including ASXL1, CACNA1A,

273 IRF6, KANSL1, KLF5, KRT222, MYC, PPM1D, PTCH1 and PTPDC1 55 were located within 1 Mb of super-

274 enhancers that overlapped with integration hotspots, Supplementary Figure S5 and Table S17.

12

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

275 Alignment of FANCD2-associated fragile sites, super-enhancers and associated target genes at

276 integration hotspots are shown in Figure 6e and Supplementary Figure S4-S5. Collectively, these data

277 show that transcriptionally active chromatin and/or regions of genetic instability are common features

278 of HPV integration sites. Moreover, integration hotspots are commonly associated with super-

279 enhancers, several of which regulate cancer driver and/or cell-identity genes.

280 Super-enhancers are frequently amplified at integration hotspots in CESC

281 The association of super-enhancers at integration hotspots was compared with the host somatic

282 copy number alteration in CESC samples. Super-enhancers were more frequently observed at those

283 CESC integration loci with associated host DNA amplifications (53/240; 22.1%) relative to those that had

284 either a normal genomic profile (16/140; 11.4%) or deletions within the host DNA flanking sequences

285 (1/17; 5.9%), Figure 6f. Of the amplified CESC integrations that had associated super-enhancers, 43

286 (81.1%) were sites of recurrent integration and represented 43/143 (30.1%) hotspot and 10/97 (10.3%)

287 non-hotspot loci with associated host genome amplifications, Figure 6f. Super-enhancer overlap was

288 also more frequently observed at integration hotspots (11/62; 17.7%) than non-hotspots (5/78; 6.4%)

289 for loci with a normal genomic profile, representing 68.8% of loci that overlapped super-enhancers for

290 this subgroup. However, for integration loci that had associated host deletions, no super-enhancers

291 were observed at sites of recurrent integration (Figure 6f). This data shows that amplification of super-

292 enhancers is frequently observed at integration loci in CESC, particularly at sites of recurrent integration.

293 DISCUSSION

294 Many studies have documented the “landscape” of HPV integration sites with respect to

295 traditional common fragile sites, host genome amplification, and transcription and related regulatory

296 elements 9-13,15,23. Here, we combined and curated DNA and RNA sequencing datasets from HPV-positive

297 CESC and HNSCC tumors and compared them with novel “in-house” datasets of common fragile sites

298 defined by FANCD2 ChIP-seq, and enhancers and super-enhancers defined by Brd4 and H3K27ac ChIP-

13

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

299 seq in cervical carcinoma derived cells. We show that viral integration sites in CESC are enriched at

300 FANCD2-associated fragile sites in cervical cells. We also show that cervical cell enhancers are over-

301 represented at HPV integration sites and that HPV integration is often associated with super-enhancers,

302 particularly at integration hotspots enriched for cell-identity genes. Furthermore, we show that the

303 flanking host DNA that is enriched for enhancers and super-enhancers is frequently amplified in CESC

304 tumors.

305 HPV genomes replicate as extrachromosomal nuclear minichromosomes at every stage of the

306 infectious cycle. The virus relies on the host replication and transcriptional machinery and it is thought

307 that the HPV genome localizes to regions of the nucleus that facilitate these processes 56. At different

308 stages of infection, the viral DNA associates with nuclear ND10 bodies, and interphase and mitotic host

309 chromatin, and highjacks the DNA damage repair processes to amplify viral DNA 57,58. Concomitantly, the

310 viral E6 and E7 induce cell proliferation and replication stress, abrogate cell cycle checkpoints,

311 and inhibit the innate immune response 59. E6 and E7 proteins also modify the epigenetic landscape of

312 the host genome by changing the levels of different histone-modifying enzymes 60. Together, these

313 activities could promote the accidental integration of viral DNA that is closely associated with host

314 chromatin.

315 HPV genomes replicate using two different modes: in maintenance replication the genomes

316 replicate bidirectionally at low copy number, but this switches to a unidirectional recombination-

317 directed mechanism in the amplification stage 61,62. The formation of tandem repeats at integration sites

318 could be related to these processes and over-replication of viral and host sequences could result from

319 repeated initiation of replication at the viral replication origin, especially if the HPV E1 and E2 proteins

320 are expressed 21,63. In fact, unscheduled firing of replication origins and increased replication fork stalling

321 has been shown to occur in both viral and host sequences at HPV integration sites in the MYC locus 20,

322 which is frequently amplified in HPV-associated cancers. Tandem repeating units of co-amplified viral

14

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

323 and cellular DNA could result from this endoreduplication, replication fork arrest, and homologous

324 recombination. Highly rearranged integrations are also consistent with the breakage-fusion-bridge–type

325 model of genome amplification 19. At fragile sites, perturbed replication dynamics could also generate

326 focal amplifications and/or rearrangements of viral-host sequences.

327 In this study, we generated a dataset of aphidicolin-induced common fragile sites in two cervical

328 carcinoma cell lines, C33-A and HeLa, and found a significant association between these sites and

329 integration breakpoints in CESC, particularly at those loci with clustered breakpoints. Common fragile

330 sites are susceptible to somatic copy number alterations 64,65 likely due to replication stress that arises

331 from perturbed replication dynamics in conflict with transcription of long genes 66. Accordingly, our C33-

332 A and HeLa common fragile sites were overrepresented at long genes expressed in these cells. We did

333 not observe an enrichment of HPV integration sites in HNSCC samples with our FANCD2-associated

334 common fragile sites, although an association was previously noted between traditional FRA-regions and

335 integration sites in oropharyngeal squamous cell carcinomas 10. This difference could reflect the larger

336 genomic coverage of FRA-regions used in the previous study, or the limited number of samples in our

337 HNSCC dataset. Moreover, common fragile sites are likely distinct in cervical and oropharyngeal derived

338 keratinocytes, or alternatively these findings could reflect differences in the biology of HPV infection and

339 mechanisms of oncogenic progression in the different tissue types.

340 HPV integration often occurs in transcriptionally active chromatin within the host genome 15,23,67

341 and we previously described an example of enhancer-hijacking and co-amplification of cellular and viral

342 regulatory sequences at an HPV integration site in cervical lesion derived cells 25. The association of viral

343 integration breakpoints with putative enhancer regions in HPV-associated cancers has been reported 15,

344 but the enhancer regions used were based on ENCODE histone modifications and therefore did not

345 reflect the specific enhancer profiles of HPV-positive cervical cells. Here, we defined keratinocyte

346 enhancers in HPV16-positive W12 cervical keratinocytes by H3K27ac and Brd4 enrichment and show

15

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

347 that these specific enhancers are significantly overrepresented at HPV integration loci. In some cases,

348 these loci were associated with focal amplification of host DNA, providing evidence for potential

349 enhancer-capture. A recent study showed enrichment of active histone marks at HPV integration loci in

350 cervical tumors, which correlated with upregulation of local gene expression, and increased gene

351 expression levels at loci with increased breakpoints 68. Kamal et al. also found increased local host gene

352 expression at loci with multiple junction copies (analogous to clustered breakpoints) 36. Furthermore,

353 integration loci with associated somatic copy number alterations have also been shown to have

354 increased gene expression 29. These observations could represent enhancer-hijacking. Integration

355 hotspots often contain genes that drive cancer 55 and accidental integration at these regions could

356 results in clonal expansion and selection due to perturbation of these oncogenic pathways. Thus,

357 enhancer-hijacking can drive expression of both viral and cellular genes.

358 Super-enhancers are large clusters of enhancers, rich in Brd4 binding and H3K27ac modification,

359 that often control cell identity genes and are coopted in tumorigenesis 53,54. We defined super-

360 enhancers in our W12 cervical cell line datasets and showed that they were strongly associated with

361 integration hotspots, including the MYC, KLF5/KLF12 and ERBB2 gene loci, which are important

362 regulators of cell-cycle, proliferation and apoptosis 69-71; TP63, which is a master regulator of epidermal

363 keratinocyte proliferation and differentiation 72; and RAD51B, which is a key regulator of homologous

364 recombination repair 73. We propose that, during persistent infection, extrachromosomal HPV genomes

365 specifically localize at key transcriptional regulatory hubs within the host genome, several of which are

366 important for keratinocyte biology.

367 The cellular Brd4 protein is involved in many of the cellular and viral processes described in this

368 study. Brd4 is a chromatin scaffold protein that modulates transcriptional initiation and elongation and

369 is a major component of super-enhancers 74. Brd4 is also important at multiple stages of the HPV

370 infectious cycle 75, binds to common fragile sites in C33-A cells 26, and is important for tethering HPV

16

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

371 genomes to mitotic chromatin 76,77. Brd4 is enriched at the HPV16 integration site/super-enhancer in

372 W12 cells and inhibition of Brd4 binding reduces E6/E7 transcription and cell growth 24. Therefore, Brd4

373 is an example of a factor that is crucial for key cell and viral chromatin-related processes, and the

374 juxtaposition of these processes could promote integration of viral DNA and oncogenic progression.

375 In conclusion, many factors contribute to the integration of a viral genome that eventually

376 drives oncogenesis. Cancer genomes often contain multiple HPV integration sites but usually only one is

377 transcriptionally active (this study, 29,35). In CESC-derived cells, expression of the viral E6 and E7

378 oncoproteins is necessary for cell proliferation, survival, and maintenance of the tumor phenotype 78.

379 Therefore, integration is common, but requires the right genomic location for constitutive viral

380 oncogene expression. For example, the viral oncogenes are usually expressed from a viral-host fusion

381 transcript that requires a splice acceptor and polyadenylation signal in the flanking host DNA 79,80.

382 Therefore, HPV integrants require a combination of events and processes that are dependent on the

383 genetic and/or epigenetic landscape of the flanking host chromatin to drive oncogenesis.

384 METHODS

385 HPV integration datasets

386 A systematic literature review identified genomic datasets from HPV-positive CESC and HNSCC that

387 contained information on HPV type and integration breakpoints within the host and viral genomes,

388 which were identified by sequencing (Supplementary Table S1) 5,6,8,9,28-35. The integration breakpoints

389 used in this study were originally identified from both RNA and DNA sequencing methods, including

390 APOT (amplification of papillomavirus oncogene transcripts), DIPS (detection of integrated

391 papillomavirus sequences) and next-generation sequencing technologies. The use of hybrid-capture

392 technologies for detection of viral integration sites has been reported to give high rates of false positives

393 33,34, and so insertion breakpoints identified by this method were only included if they were validated by

394 other means, such as Sanger sequencing. The methodology used to identify each integration breakpoint

17

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

395 is referenced in Supplementary Table S2 and S3. RNA-based sequencing methods give an approximation

396 of the insertion site based on the closest splice acceptor site within the host genome; therefore,

397 integration breakpoints identified by RNA-seq and/or APOT were only used to determine the viral

398 transcription status of an integration site for samples with matched DNA sequencing data. For TCGA

399 CESC and HNSCC samples, unmapped reads were extracted from RNA-seq, whole genome sequencing

400 (WGS) and whole exome sequencing (WXS) BAM files (https://portal.gdc.cancer.gov/legacy-archive;

401 accessed 01/01/2013) and pre-processed with prinseq-lite.pl version 0.20.2 to remove low quality reads

402 81. Pre-processed reads were mapped with Bowtie 2 using the very-sensitive preset option against the

403 Viral Refseq database 82,83. All unmapped reads were subjected to BLASTN with default parameters

404 against the Viral Refseq database 84. All aligned reads were then subjected to BLASTN against the human

405 hg19 reference assembly. Bowtie 2 and BLASTN reports were passed into SummonChimera using a

406 1,000 bp deletion size for integration detection 85. The SummonChimera reports were manually parsed

407 to remove chimeric junctions with lower than 20 read coverage, chimeric junctions with no cross-

408 analysis verification, and ambiguously reported integration predictions. Finally, a unique ID was

409 provided to all uniquely detected chimeric junctions. For analysis of the association of integration

410 breakpoints with different genomic features of interest, only integration breakpoints identified by DNA

411 sequencing methods were used. The characteristics of samples included in this study, categorized by

412 histology type, HPV type, tumor location and sequencing methods are summarized in Supplementary

413 Figure S1. CESC and HNSCC integration breakpoints included in this study are listed in Supplementary

414 Table S2-S3.

415 Integration hotspot dataset

416 Integration loci from CESC tumors that were within 5 Mb of each other were collapsed into a single

417 genomic interval to define integration hotspot boundaries. Exceptions to this size cut-off for collapsing

418 adjacent integration loci were permitted to reflect previously defined hotspots from the literature. Five

18

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

419 or more integration loci per hotspot (or three or more integration loci for sites that overlapped

420 previously defined hotspots from the literature) were used to define sites of recurrent integration.

421 Integration hotspots defined from our CESC dataset and previously in the literature are listed in

422 Supplementary Tables S4 and S5, respectively.

423 Somatic copy number alteration datasets

424 The amplification status of the cellular sequences flanking integration breakpoints had been assessed in

425 a subset of samples by comparative genomic hybridization (CGH) or SNP array datasets. For CGH array

426 data, we defined two-fold or more focal amplification or deletion of the host genome at an integration

427 locus as having an associated somatic copy number alterations 8,28. Matched SNP6 copy number

428 segment data for TCGA CESC and HNSCC tumor samples were downloaded from the Broad Institute on

429 04/07/2020, http://firebrowse.org/ (genome_wide_snp_6-segmented_scna_minus_germline_cnv_

430 hg19) 86,87. Positive and negative mean segment values above 0.3 and below -0.3 represented copy

431 number gains and losses, respectively, and mean segment values between 0.3 and -0.3 were considered

432 noise 86,87. Deletions that occurred within 50 Kb of an integration breakpoint were included in this

433 analysis. Deletions that directly overlapped an integration breakpoint were excluded; these likely

434 reflected the alternative chromosome as sequencing of the chimeric viral-host junction was available for

435 the associated HPV insertion sites. Integration loci defined as having associated somatic copy number

436 alterations in CESC and HNSCC are detailed in Supplementary Table S6 and S7, respectively.

437 Cell culture

438 Subclones (20831, 20861, 20862, 20863) derived from HPV16-positive W12 cervical keratinocytes 79,88

439 were maintained in F-medium (3:1 [vol/vol] F-12–Dulbecco’s modified Eagle’s medium, 5% fetal bovine

440 serum, 0.4 μg/ml hydrocortisone, 5 μg/ml insulin, 8.4 ng/ml cholera toxin, 10 ng/ml epidermal growth

441 factor, 24 μg/ml adenine, 100 U/ml penicillin and 100 μg/ml streptomycin). All cells were grown in the

442 presence of irradiated 3T3-J2 feeder cells. C33-A and HeLa cervical carcinoma derived cell lines were

19

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

443 maintained in Dulbecco’s modified Eagle’s medium, supplemented with 10% fetal bovine serum, 100

444 U/ml penicillin and 100 μg/ml streptomycin. To induce replication stress, C33-A and HeLa cells were

445 treated for 24 hours with 0.2 µM aphidicolin (Sigma A0781) prior to harvesting for FANCD2 ChIP-seq

446 experiments, described below.

447 ChIP-seq: FANCD2

448 Aphidicolin treated C33-A and HeLa cells were processed for chromatin immunoprecipitation (ChIP) as

449 previously described 24. Briefly, cells were crossed-linked with 1% formaldehyde and chromatin was

450 isolated and sheared to 100–500 bp DNA fragments using a Bioruptor sonicator (Diagonode) on high

451 power settings. Chromatin samples (25 μg per ChIP) were incubated overnight at 4°C with an antibody

452 against FANCD2 (Bethyl, A302-174A, 2.5 μg). Rabbit IgG (Jackson ImmunoRes, 011-000-003) was used to

453 determine non-specific binding to control regions (though not sequenced). Chromatin

454 immunocomplexes were precipitated for 1 hour at 4°C with blocked Dynabeads Protein G (Invitrogen),

455 subjected to multiple wash steps and the chromatin eluted in elution buffer (50 mM Tris-HCl [pH 8.0], 10

456 mM EDTA [pH 8.0], 1% SDS). Chromatin was reverse cross-linked overnight at 65°C in 0.2 M NaCl,

457 followed by RNase A and proteinase K treatment, and the DNA purified using the ChIP DNA Clean &

458 Concentrator kit (Zymo Research). ChIP DNA from two biological replicates were pooled and subjected

459 to 2 x 150 bp paired-end read sequencing on the Illumina HiSeq-4000 platform (Genomics Resource

460 Center, Institute for Genome Sciences, University of Maryland) to a sequencing depth of >14 million

461 reads per sample. FANCD2 ChIP-seq datasets are accessible through GEO Series accession number

462 GSE183048.

463 ChIP-seq: Brd4 and H3K27ac

464 W12 chromatin samples were isolated as described above, and have been described previously 25.

465 However, only the sequence data flanking the HPV16 integration sites was previously analyzed and

466 published. Here, we analyze the same dataset but for the entire human genome. Antibodies used were

20

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

467 Brd4 (Bethyl Laboratories A301-985A, 3 μg) or H3K27ac (Millipore 07–360, 3 μl). No antibody controls

468 were included to monitor non-specific binding. Brd4 and H3K27ac ChIP-seq datasets are accessible

469 through GEO Series accession number GSE183048.

470 ChIP-seq processing and peak calling

471 Reads were trimmed with Cutadapt version 1.18 89. All reads aligning to the ENCODE hg19 v1 blacklist

472 regions 90 were identified by alignment with BWA version 0.7.17 91 and removed with Picard

473 SamToFastq, https://broadinstitute.github.io/picard/. Remaining reads were aligned to an hg19

474 reference genome using BWA. Reads with a mapQ score less than 6 were removed with SAMtools

475 version 1.6 92 and PCR duplicates were removed with Picard MarkDuplicates. Peaks were called by

476 comparing each ChIP sample to its matching input sample. For FANCD2, the mean fragment size was

477 estimated by Phantompeakqualtools version 2.0 93. Peaks were called using SICER version 1.1 94 with the

478 following parameters: redundancy threshold of 100, effective genome fraction of 0.75, window size of

479 25,000 bp, and gap size of 50,000 bp. H3K27ac and Brd4 peaks were called using macsBroad (macs

480 version 2.1.1 from 2016/03/09) 95 with the following parameters: --broad-cutoff 0.01 -f "BAMPE". Data

481 was converted into bigwigs for viewing and normalized by reads per genomic content (RPGC) using

482 deepTools version 3.0.196 using the following parameters: --binSize 25 --smoothLength 75 --

483 effectiveGenomeSize 2700000000 --centerReads --normalizeUsing RPGC. RPGC-normalized input values

484 were subtracted from RPGC-normalized ChIP values of matching cell type genome-wide using Deeptools

485 with --binSize 25.

486 FANCD2-associated fragile site dataset

487 FANCD2 ChIP-seq peaks were filtered by a -log10 q-value of ten or above to remove low-confidence

488 calls. Filtered C33-A ChIP-seq peaks were combined with previously mapped aphidicolin-induced

489 FANCD2 peaks identified by ChIP-chip in these cells 26. C33-A (Supplementary Table S8) and HeLa

490 FANCD2-enriched regions (Supplementary Table S9) were combined and overlapping peaks merged

21

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

491 using bedtools MergeBED 97. Association of ChIP peaks between the three FANCD2 datasets were

492 determined by permutation testing using regioneR 98. Combined FANCD2 peaks from C33-A and HeLa

493 are listed in Supplementary Table S10. FANCD2-enriched regions were compared to aphidicolin-induced

494 common fragile sites characterized in lymphoblast cells (FRA regions) and mitotic DNA synthesis regions

495 characterized in HeLa cells 41,42 that are listed in Supplementary Table S11 and S12, respectively, using

496 regioneR 98. FRA-regions were downloaded from the HGNC (HUGO Committee)

497 database (https://www.genenames.org/download/custom/) on 08/27/2020 using advanced filtering:

498 gd_locus_type = 'fragile site’.

499 Overlap analysis of FANCD2 enriched regions with long genes

500 The Gencode Release 19 human reference genome (GRCh37) was filtered for protein-coding genes

501 greater than or equal to 0.3 Mb in length, including UTRs (Supplementary Table S13), and used to

502 determine the overlap with FANCD2-enriched regions.

503 Enhancer dataset

504 Consensus peak sets for H3K27ac were defined as overlapping regions found in at least four out of eight

505 W12 samples using DiffBind 99,100. Enhancers were defined as genomic intervals that overlapped

506 between H3K27ac peaks and Brd4 peaks and are listed in Supplementary Table S14. Proximal and distal

507 enhancer-like cis-Regulatory Elements by ENCODE for NHEK and HeLa cells were downloaded from

508 https://screen.encodeproject.org/# (accessed September 2020) 51. ENCODE GRCh38 enhancer files were

509 converted to hg19 using the UCSC Genome Browser LiftOver tool, http://genome.ucsc.edu/cgi-

510 bin/hgLiftOver. W12 enhancers were compared to ENCODE NHEK enhancers using regioneR 98.

511 Overlap analysis of integration breakpoints with fragile sites and enhancers

512 The intersect between the genomic coordinates of HPV integration breakpoints (-/+ 50 Kb flank regions,

513 7,9) with enhancers (Supplementary Table S14) or FANCD2-enriched regions (Supplementary Table S10)

514 was analyzed using the Overlapping Pieces of Intervals function in the Galaxy genomics platform

22

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

515 (https://usegalaxy.org/). Samples with multiple reported breakpoints within the same chromosome

516 were classified as a single integration locus if the 5’ and 3’ most breakpoints were within 3 Mb of each

517 other. This 3 Mb cut-off was based on manual analysis of the distance between clustered breakpoints

518 identified by WGS and/or hybrid-capture technologies for the CESC and HNSCC datasets. Each

519 integration locus was assigned a unique integration ID so that the number of breakpoints per integration

520 could be categorized as a cluster. Each integration breakpoint was analyzed independently for their

521 association with the genomic feature of interest, as well as with adjacent HPV breakpoints, which were

522 classified as belonging to the same integration locus/cluster. For significance testing, the data was

523 permutated 10,000 times to create an expected distribution of the overlap between integration

524 breakpoints and loci with FANCD2-enriched regions or enhancers using regioneR 98.

525 Super-enhancer dataset

526 Super-enhancers were defined in the Brd4 and H3K27ac W12 ChIP-seq datasets using the Rank Ordering

527 of Super-Enhancers (ROSE) tool, using default parameters 53,54. Enhancers defined in W12 cells

528 (Supplementary Table S14) were used as the input list of enhancers. Super-enhancers were defined by

529 Brd4 and/or H3K27ac consensus peaks that mapped in at least six out of twelve W12 samples for the

530 20831, 20862 and 20863 subclones and are listed in Supplementary Table S16. Super-enhancers

531 mapped in the 20861 subclone were excluded as they were masked by the amplified viral-host derived

532 super-enhancer-like element at the HPV integration site in these cells 25.

533 Gene ontology analysis of W12 enhancers

534 W12 enhancers that overlapped with CESC integration breakpoints (-/+ 50 Kb flanks) were analyzed

535 using the Genomic Regions Enrichment of Annotations Tool (GREAT) using default parameters, accessed

536 July 2021, http://great.stanford.edu/public/html/. All enhancers profiled in W12 cells (Supplementary

537 Table S14) were used as the input list of background regions.

23

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

538 DATA AVAILABILITY

539 The data discussed in this publication have been deposited in NCBI's Gene Expression Omnibus 101 and

540 are accessible through GEO Series accession number GSE183048

541 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE183048).

542 ACKNOWLEDGEMENTS

543 We thank Justin Lack (NIH/NCBR/NCI/NIAID/FNL) for advice on statistical analyses and Susan Huse

544 (NIH/NCBR/NCI/NIAID/FNL) for preliminary analysis of enhancer-capture at integration sites. The results

545 published here are in part based upon data generated by the TCGA Research Network,

546 https://www.cancer.gov/tcga. This work was funded by the Intramural Research Program of NIAID, NIH

547 grant number ZIA AI001223 LVD, and NIH grant number AI153156 (JMP). The funders had no role in

548 study design, data collection and analysis, decision to publish, or preparation of the manuscript.

549 AUTHOR CONTRIBUTIONS

550 AAM supervised the project. AAM and AW conceived and designed the study. AW compiled datasets

551 and performed ChIP-seq experiments. TEM processed ChIP-seq data. TEM designed and advised on

552 statistical analyses. AW and TEM performed statistical analysis. AAM, AW and TEM analyzed the data. JP

553 and JK identified integration breakpoints from CESC and HNSCC TCGA datasets. AAM and AW wrote the

554 manuscript. All authors discussed, critically revised, and approved the final version of the article for

555 publication.

556 COMPETING INTERESTS

557 The authors declare no conflict of interest.

558

24

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

559 REFERENCES

560 1 Viens, L. J. et al. Human Papillomavirus-Associated Cancers - United States, 2008-2012. MMWR 561 Morb Mortal Wkly Rep 65, 661-666 (2016). 562 2 Alazawi, W. et al. Genomic imbalances in 70 snap-frozen cervical squamous intraepithelial 563 lesions: associations with lesion grade, state of the HPV16 E2 gene and clinical outcome. Br J 564 Cancer 91, 2063-2070, doi:10.1038/sj.bjc.6602237 (2004). 565 3 Peter, M. et al. Frequent genomic structural alterations at HPV insertion sites in cervical 566 carcinoma. J Pathol 221, 320-330, doi:10.1002/path.2713 (2010). 567 4 Thomas, L. K. et al. Chromosomal gains and losses in human papillomavirus-associated neoplasia 568 of the lower genital tract - a systematic review and meta-analysis. Eur J Cancer 50, 85-98, 569 doi:10.1016/j.ejca.2013.08.022 (2014). 570 5 Parfenov, M. et al. Characterization of HPV and host genome interactions in primary head and 571 neck cancers. Proc Natl Acad Sci U S A 111, 15544-15549 (2014). 572 6 Ojesina, A. I. et al. Landscape of genomic alterations in cervical carcinomas. Nature 506, 371- 573 375, doi:10.1038/nature12881 (2014). 574 7 Akagi, K. et al. Genome-wide analysis of HPV integration in human cancers reveals recurrent, 575 focal genomic instability. Genome Res 24, 185-199, doi:10.1101/gr.164806.113 (2014). 576 8 Holmes, A. et al. Mechanistic signatures of HPV insertions in cervical carcinomas. NPJ Genom 577 Med 1, 16004, doi:10.1038/npjgenmed.2016.4 (2016). 578 9 Bodelon, C. et al. Chromosomal copy number alterations and HPV integration in cervical 579 precancer and invasive cancer. Carcinogenesis 37, 188-196, doi:10.1093/carcin/bgv171 (2016). 580 10 Gao, G. et al. Common fragile sites (CFS) and extremely large CFS genes are targets for human 581 papillomavirus integrations and chromosome rearrangements in oropharyngeal squamous cell 582 carcinoma. Genes Chromosomes Cancer 56, 59-74, doi:10.1002/gcc.22415 (2017). 583 11 Thorland, E. C., Myers, S. L., Gostout, B. S. & Smith, D. I. Common fragile sites are preferential 584 targets for HPV16 integrations in cervical tumors. Oncogene 22, 1225-1237, 585 doi:10.1038/sj.onc.1206170 (2003). 586 12 Thorland, E. C. et al. Human papillomavirus type 16 integrations in cervical tumors frequently 587 occur in common fragile sites. Cancer Res 60, 5916-5921 (2000). 588 13 Smith, P. P., Friedman, C. L., Bryant, E. M. & McDougall, J. K. Viral integration and fragile sites in 589 human papillomavirus-immortalized human keratinocyte cell lines. Genes Chromosomes Cancer 590 5, 150-157, doi:10.1002/gcc.2870050209 (1992). 591 14 Le Tallec, B. et al. Updating the mechanisms of common fragile site instability: how to reconcile 592 the different views? Cellular and Molecular Life Sciences 71, 4489-4494 (2014). 593 15 Bodelon, C., Untereiner, M. E., Machiela, M. J., Vinokurova, S. & Wentzensen, N. Genomic 594 characterization of viral integration sites in HPV-related cancers. Int J Cancer 139, 2001-2011, 595 doi:10.1002/ijc.30243 (2016). 596 16 Madireddy, A. et al. FANCD2 Facilitates Replication through Common Fragile Sites. Mol Cell 64, 597 388-404, doi:10.1016/j.molcel.2016.09.017 (2016). 598 17 Pentzold, C. et al. FANCD2 binding identifies conserved fragile sites at large transcribed genes in 599 avian cells. Nucleic Acids Res 46, 1280-1294, doi:10.1093/nar/gkx1260 (2018). 600 18 Chan, K. L., Palmai-Pallag, T., Ying, S. & Hickson, I. D. Replication stress induces sister-chromatid 601 bridging at fragile site loci in mitosis. Nat Cell Biol 11, 753-760, doi:10.1038/ncb1882 (2009). 602 19 Herrick, J. et al. Genomic organization of amplified MYC genes suggests distinct mechanisms of 603 amplification in tumorigenesis. Cancer research 65, 1174-1179 (2005).

25

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

604 20 Conti, C., Herrick, J. & Bensimon, A. Unscheduled DNA replication origin activation at inserted 605 HPV 18 sequences in a HPV‐18/MYC amplicon. Genes, Chromosomes and Cancer 46, 724-734 606 (2007). 607 21 Kadaja, M., Isok-Paas, H., Laos, T., Ustav, E. & Ustav, M. Mechanism of genomic instability in 608 cells infected with the high-risk human papillomaviruses. PLoS Pathog 5, e1000397 (2009). 609 22 Coquelle, A., Pipiras, E., Toledo, F., Buttin, G. & Debatisse, M. Expression of Fragile Sites Triggers 610 Intrachromosomal Mammalian Gene Amplification and Sets Boundaries to Early Amplicons. Cell 611 89, 215-225, doi:10.1016/s0092-8674(00)80201-9 (1997). 612 23 Christiansen, I. K., Sandve, G. K., Schmitz, M., Durst, M. & Hovig, E. Transcriptionally active 613 regions are the preferred targets for chromosomal HPV integration in cervical carcinogenesis. 614 PLoS One 10, e0119566, doi:10.1371/journal.pone.0119566 (2015). 615 24 Dooley, K. E., Warburton, A. & McBride, A. A. Tandemly Integrated HPV16 Can Form a Brd4- 616 Dependent Super-Enhancer-Like Element That Drives Transcription of Viral Oncogenes. MBio 7, 617 doi:10.1128/mBio.01446-16 (2016). 618 25 Warburton, A. et al. HPV integration hijacks and multimerizes a cellular enhancer to generate a 619 viral-cellular super-enhancer that drives high viral oncogene expression. PLoS Genet 14, 620 e1007179, doi:10.1371/journal.pgen.1007179 (2018). 621 26 Jang, M. K., Shen, K. & McBride, A. A. Papillomavirus genomes associate with BRD4 to replicate 622 at fragile sites in the host genome. PLoS Pathog 10, e1004117, 623 doi:10.1371/journal.ppat.1004117 (2014). 624 27 Stanley, M. A., Browne, H. M., Appleby, M. & Minson, A. C. Properties of a non-tumorigenic 625 human cervical keratinocyte cell line. Int.J.Cancer 43, 672-676 (1989). 626 28 Hu, Z. et al. Genome-wide profiling of HPV integration in cervical cancer identifies clustered 627 genomic hot spots and a potential microhomology-mediated integration mechanism. Nat Genet 628 47, 158-163, doi:10.1038/ng.3178 (2015). 629 29 Cancer Genome Atlas Research, N. et al. Integrated genomic and molecular characterization of 630 cervical cancer. Nature 543, 378-384, doi:10.1038/nature21386 (2017). 631 30 Koneva, L. A. et al. HPV Integration in HNSCC Correlates with Survival Outcomes, Immune 632 Response Signatures, and Candidate Drivers. Mol Cancer Res 16, 90-102, doi:10.1158/1541- 633 7786.MCR-17-0153 (2018). 634 31 Olthof, N. C. et al. Comprehensive analysis of HPV16 integration in OSCC reveals no significant 635 impact of physical status on viral oncogene and virally disrupted human gene expression. PLoS 636 One 9, e88718, doi:10.1371/journal.pone.0088718 (2014). 637 32 Cancer Genome Atlas, N. Comprehensive genomic characterization of head and neck squamous 638 cell carcinomas. Nature 517, 576-582, doi:10.1038/nature14129 (2015). 639 33 Liu, Y., Lu, Z., Xu, R. & Ke, Y. Comprehensive mapping of the human papillomavirus (HPV) DNA 640 integration sites in cervical carcinomas by HPV capture technology. Oncotarget 7, 5852-5864, 641 doi:10.18632/oncotarget.6809 (2016). 642 34 Liu, Y. et al. Genome-wide profiling of the human papillomavirus DNA integration in cervical 643 intraepithelial neoplasia and normal cervical epithelium by HPV capture technology. Sci Rep 6, 644 35427, doi:10.1038/srep35427 (2016). 645 35 Xu, B. et al. Multiplex Identification of Human Papillomavirus 16 DNA Integration Sites in 646 Cervical Carcinomas. PLOS ONE 8, e66693, doi:10.1371/journal.pone.0066693 (2013). 647 36 Kamal, M. et al. Human papilloma virus (HPV) integration signature in Cervical Cancer: 648 identification of MACROD2 gene as HPV hot spot integration site. Br J Cancer, 649 doi:10.1038/s41416-020-01153-4 (2020). 650 37 Li, W. et al. Characteristic of HPV Integration in the Genome and Transcriptome of Cervical 651 Cancer Tissues. Biomed Res Int 2018, 6242173-6242173, doi:10.1155/2018/6242173 (2018).

26

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

652 38 Kraus, I. et al. The majority of viral-cellular fusion transcripts in cervical carcinomas cotranscribe 653 cellular sequences of known or predicted genes. Cancer Res 68, 2514-2522, doi:10.1158/0008- 654 5472.CAN-07-2776 (2008). 655 39 Schmitz, M., Driesch, C., Jansen, L., Runnebaum, I. B. & Dürst, M. Non-Random Integration of the 656 HPV Genome in Cervical Cancer. PLOS ONE 7, e39632, doi:10.1371/journal.pone.0039632 657 (2012). 658 40 Zhang, R. et al. Dysregulation of host cellular genes targeted by human papillomavirus (HPV) 659 integration contributes to HPV-related cervical carcinogenesis. Int J Cancer 138, 1163-1174, 660 doi:10.1002/ijc.29872 (2016). 661 41 Ji, F. et al. Genome-wide high-resolution mapping of mitotic DNA synthesis sites and common 662 fragile sites by direct sequencing. Cell Res, doi:10.1038/s41422-020-0357-y (2020). 663 42 Macheret, M. et al. High-resolution mapping of mitotic DNA synthesis regions and common 664 fragile sites in the human genome through direct sequencing. Cell Res, doi:10.1038/s41422-020- 665 0358-x (2020). 666 43 Helmrich, A., Ballarino, M. & Tora, L. Collisions between replication and transcription complexes 667 cause common fragile site instability at the longest human genes. Mol Cell 44, 966-977, 668 doi:10.1016/j.molcel.2011.10.013 (2011). 669 44 Okamoto, Y. et al. Replication stress induces accumulation of FANCD2 at central region of large 670 fragile genes. Nucleic Acids Res 46, 2932-2944, doi:10.1093/nar/gky058 (2018). 671 45 Brison, O. et al. Transcription-mediated organization of the replication initiation program across 672 large genes sets common fragile sites genome-wide. Nat Commun 10, 5693, 673 doi:10.1038/s41467-019-13674-5 (2019). 674 46 Lee, J.-E. et al. Brd4 binds to active enhancers to control cell identity gene induction in 675 adipogenesis and myogenesis. Nature Communications 8, doi:10.1038/s41467-017-02403-5 676 (2017). 677 47 Brown, J. D. et al. BET bromodomain proteins regulate enhancer function during adipogenesis. 678 Proc Natl Acad Sci U S A 115, 2144-2149, doi:10.1073/pnas.1711155115 (2018). 679 48 Najafova, Z. et al. BRD4 localization to lineage-specific enhancers is associated with a distinct 680 transcription factor repertoire. Nucleic Acids Res 45, 127-141, doi:10.1093/nar/gkw826 (2017). 681 49 Jang, M. K., Kwon, D. & McBride, A. A. Papillomavirus E2 proteins and the host BRD4 protein 682 associate with transcriptionally active cellular chromatin. J Virol 83, 2592-2600, 683 doi:10.1128/JVI.02275-08 (2009). 684 50 Helfer, C. M., Yan, J. & You, J. The cellular bromodomain protein Brd4 has multiple functions in 685 E2-mediated papillomavirus transcription activation. Viruses 6, 3228-3249, 686 doi:10.3390/v6083228 (2014). 687 51 Consortium, E. P. et al. Expanded encyclopaedias of DNA elements in the human and mouse 688 genomes. Nature 583, 699-710, doi:10.1038/s41586-020-2493-4 (2020). 689 52 McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nat 690 Biotechnol 28, 495-501, doi:10.1038/nbt.1630 (2010). 691 53 Whyte, Warren A. et al. Master Transcription Factors and Mediator Establish Super-Enhancers at 692 Key Cell Identity Genes. Cell 153, 307-319, doi:10.1016/j.cell.2013.03.035 (2013). 693 54 Lovén, J. et al. Selective inhibition of tumor oncogenes by disruption of super-enhancers. Cell 694 153, 320-334 (2013). 695 55 Bailey, M. H. et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 696 173, 371-385.e318, doi:https://doi.org/10.1016/j.cell.2018.02.060 (2018). 697 56 Warburton, A., Della Fera, A. N. & McBride, A. A. Dangerous liaisons: long-term replication with 698 an extrachromosomal HPV genome. Viruses submitted (2021).

27

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

699 57 Moody, C. A. Impact of Replication Stress in Human Papillomavirus Pathogenesis. Journal of 700 virology 93, doi:10.1128/JVI.01012-17 (2019). 701 58 Anacker, D. C. & Moody, C. A. Modulation of the DNA damage response during the life cycle of 702 human papillomaviruses. Virus research 231, 41-49, doi:10.1016/j.virusres.2016.11.006 (2017). 703 59 Schiffman, M. et al. Carcinogenic human papillomavirus infection. Nat Rev Dis Primers 2, 16086, 704 doi:10.1038/nrdp.2016.86 (2016). 705 60 Mac, M. & Moody, C. A. Epigenetic Regulation of the Human Papillomavirus Life Cycle. 706 Pathogens (Basel, Switzerland) 9, doi:10.3390/pathogens9060483 (2020). 707 61 Flores, E. R. & Lambert, P. F. Evidence for a switch in the mode of human papillomavirus type 16 708 DNA replication during the viral life cycle. J.Virol. 71, 7167-7179 (1997). 709 62 Sakakibara, N., Chen, D. & McBride, A. A. Papillomaviruses use recombination-dependent 710 replication to vegetatively amplify their genomes in differentiated cells. PLoS Pathog 9, 711 e1003321, doi:10.1371/journal.ppat.1003321 (2013). 712 63 Kadaja, M. et al. Genomic instability of the host cell induced by the human papillomavirus 713 replication machinery. EMBO J 26, 2180-2191, doi:10.1038/sj.emboj.7601665 (2007). 714 64 Myllykangas, S. et al. DNA copy number amplification profiling of human neoplasms. Oncogene 715 25, 7324-7332, doi:10.1038/sj.onc.1209717 (2006). 716 65 Hellman, A. et al. A role for common fragile site induction in amplification of human oncogenes. 717 Cancer Cell 1, 89-97, doi:10.1016/s1535-6108(02)00017-x (2002). 718 66 Wilson, T. E. et al. Large transcription units unify copy number variants and common fragile sites 719 arising under replication stress. Genome Res 25, 189-200, doi:10.1101/gr.177121.114 (2015). 720 67 Kelley, D. Z. et al. Integrated Analysis of Whole-Genome ChIP-Seq and RNA-Seq Data of Primary 721 Head and Neck Tumor Samples Associates HPV Integration Sites with Open Chromatin Marks. 722 Cancer Res 77, 6538-6550, doi:10.1158/0008-5472.CAN-17-0833 (2017). 723 68 Gagliardi, A. et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus 724 clade-specific epigenome and transcriptome landscapes. Nat Genet 52, 800-810, 725 doi:10.1038/s41588-020-0673-7 (2020). 726 69 Schmidt, E. V. The role of c-myc in cellular growth control. Oncogene 18, 2988-2996, 727 doi:10.1038/sj.onc.1202751 (1999). 728 70 Holbro, T. The ErbB receptors and their role in cancer progression. Experimental Cell Research 729 284, 99-110, doi:10.1016/s0014-4827(02)00099-x (2003). 730 71 Dong, J. T. & Chen, C. Essential role of KLF5 transcription factor in cell proliferation and 731 differentiation and its implications for human diseases. Cell Mol Life Sci 66, 2691-2706, 732 doi:10.1007/s00018-009-0045-z (2009). 733 72 Soares, E. & Zhou, H. Master regulatory role of p63 in epidermal development and disease. Cell 734 Mol Life Sci 75, 1179-1190, doi:10.1007/s00018-017-2701-z (2018). 735 73 Takata, M. et al. The Rad51 paralog Rad51B promotes homologous recombinational repair. Mol 736 Cell Biol 20, 6476-6482, doi:10.1128/mcb.20.17.6476-6482.2000 (2000). 737 74 Hnisz, D. et al. Super-enhancers in the control of cell identity and disease. Cell 155, 934-947 738 (2013). 739 75 McBride, A. A., Warburton, A. & Khurana, S. Multiple Roles of Brd4 in the Infectious Cycle of 740 Human Papillomaviruses. Frontiers in Molecular Biosciences 8, doi:10.3389/fmolb.2021.725794 741 (2021). 742 76 You, J., Croyle, J. L., Nishimura, A., Ozato, K. & Howley, P. M. Interaction of the bovine 743 papillomavirus E2 protein with Brd4 tethers the viral DNA to host mitotic chromosomes. Cell 744 117, 349-360 (2004). 745 77 Baxter, M. K., McPhillips, M. G., Ozato, K. & McBride, A. A. The mitotic chromosome binding 746 activity of the papillomavirus E2 protein correlates with interaction with the cellular

28

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

747 chromosomal protein, Brd4. Journal of virology 79, 4806-4818, doi:10.1128/JVI.79.8.4806- 748 4818.2005 (2005). 749 78 Goodwin, E. C. & DiMaio, D. Repression of human papillomavirus oncogenes in HeLa cervical 750 carcinoma cells causes the orderly reactivation of dormant tumor suppressor pathways. 751 Proc.Natl.Acad.Sci.U.S.A 97, 12513-12518 (2000). 752 79 Jeon, S. & Lambert, P. F. Integration of human papillomavirus type 16 DNA into the human 753 genome leads to increased stability of E6 and E7 mRNAs: implications for cervical 754 carcinogenesis. Proc Natl Acad Sci U S A 92, 1654-1658, doi:10.1073/pnas.92.5.1654 (1995). 755 80 Wentzensen, N. et al. Characterization of viral-cellular fusion transcripts in a large series of 756 HPV16 and 18 positive anogenital lesions. Oncogene 21, 419-426, doi:10.1038/sj.onc.1205104 757 (2002). 758 81 Schmieder, R. E., R. Quality control and preprocessing of metagenomic datasets. Bioinformatics 759 27, 863-864, doi:10.1093/bioinformatics/btr026 (2011). 760 82 Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357- 761 359, doi:10.1038/nmeth.1923 (2012). 762 83 Langmead, B., Wilks, C., Antonescu, V. & Charles, R. Scaling read aligners to hundreds of threads 763 on general-purpose processors. Bioinformatics 35, 421-432, doi:10.1093/bioinformatics/bty648 764 (2019). 765 84 McGinnis, S. & Madden, T. L. BLAST: at the core of a powerful and diverse set of sequence 766 analysis tools. Nucleic acids research 32, W20-W25, doi:10.1093/nar/gkh435 (2004). 767 85 Katz, J. P. & Pipas, J. M. SummonChimera infers integrated viral genomes with nucleotide 768 precision from NGS data. BMC Bioinformatics 15, 348, doi:10.1186/s12859-014-0348-4 (2014). 769 86 Harvard, B. I. o. M. a. Broad Institute TCGA Genome Data Analysis Center (2016): SNP6 Copy 770 number analysis (GISTIC2); Cervical Squamous Cell Carcinoma and Endocervical 771 Adenocarcinoma. doi:doi:10.7908/C16D5SCD (2016). 772 87 Harvard, B. I. o. M. a. Broad Institute TCGA Genome Data Analysis Center (2016): SNP6 Copy 773 number analysis (GISTIC2); Head and Neck Squamous Cell Carcinoma. . 774 doi:doi:10.7908/C1V987FP (2016). 775 88 Jeon, S., Allen-Hoffmann, B. L. & Lambert, P. F. Integration of human papillomavirus type 16 into 776 the human genome correlates with a selective growth advantage of cells. J Virol 69, 2989-2997, 777 doi:10.1128/JVI.69.5.2989-2997.1995 (1995). 778 89 Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. 779 EMBnet.journal 17, doi:10.14806/ej.17.1.200 (2011). 780 90 Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 781 489, 57-74, doi:10.1038/nature11247 (2012). 782 91 Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. 783 Bioinformatics 25, 1754-1760, doi:10.1093/bioinformatics/btp324 (2009). 784 92 Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079, 785 doi:10.1093/bioinformatics/btp352 (2009). 786 93 Kharchenko, P. V., Tolstorukov, M. Y. & Park, P. J. Design and analysis of ChIP-seq experiments 787 for DNA-binding proteins. Nat Biotechnol 26, 1351-1359, doi:10.1038/nbt.1508 (2008). 788 94 Xu, S., Grullon, S., Ge, K. & Peng, W. Spatial clustering for identification of ChIP-enriched regions 789 (SICER) to map regions of histone methylation patterns in embryonic stem cells. Methods Mol 790 Biol 1150, 97-111, doi:10.1007/978-1-4939-0512-6_5 (2014). 791 95 Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 9, R137, doi:10.1186/gb- 792 2008-9-9-r137 (2008). 793 96 Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. 794 Nucleic Acids Res 44, W160-165, doi:10.1093/nar/gkw257 (2016).

29

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

795 97 Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. 796 Bioinformatics 26, 841-842, doi:10.1093/bioinformatics/btq033 (2010). 797 98 Gel, B. et al. regioneR: an R/Bioconductor package for the association analysis of genomic 798 regions based on permutation tests. Bioinformatics 32, 289-291, 799 doi:10.1093/bioinformatics/btv562 (2016). 800 99 Stark, R. & Brown, G. D. DiffBind: differential binding analysis of ChIP-Seq peak data. (2011). 801 100 Ross-Innes, C. S. et al. Differential oestrogen receptor binding is associated with clinical outcome 802 in breast cancer. Nature 481, 389-393, doi:10.1038/nature10730 (2012). 803 101 Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and 804 hybridization array data repository. Nucleic Acids Res 30, 207-210, doi:10.1093/nar/30.1.207 805 (2002). 806

807 FIGURE LEGENDS

808 Figure 1. Overview of datasets

809 Schematic representation of datasets used for overlap analysis of CESC and HNSCC integration sites with

810 enhancers mapped in W12 cervical keratinocytes and FANCD2-associated common fragile sites mapped

811 in C33-A and HeLa cervical carcinoma cell lines. APOT, Amplification of papillomavirus oncogene

812 transcripts; WGS, Whole genome sequencing; WXS, Whole exome sequencing.

813 Figure 2. HPV integration loci frequently contain clustered insertional breakpoints

814 (a), Schematic representation of HPV integration breakpoints and loci. Green lines represent integration

815 breakpoints. Integration loci are defined as either a single breakpoint, or multiple, closely spaced

816 breakpoints (a cluster). Samples with clustered breakpoints within the same chromosome are classified

817 as a single integration locus if the 5’ and 3’ most breakpoints are within 3 Mb of each other. (b),

818 Schematic representation of clustered breakpoints at CESC integration loci across the human genome.

819 Lines connecting to each chromosome represent different integration loci. Blue circles represent the

820 indicated number of breakpoints per integration locus; orange boxed regions represent integration

821 hotspots. See Supplementary Figure S3 for the distribution of clustered breakpoints at integration loci in

822 HNSCC tumors. (c), Scatter plot showing the frequency of single and clustered breakpoints per

30

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

823 integration locus for CESC tumors grouped according to whether they overlap integration hotspots. The

824 p-value is based on a non-parametric, unpaired t-test (two-tailed; **P<0.01).

825 Figure 3. Transcription status of integrated viral genomes 826

827 (a-b), Bubble graph showing the percentage of transcriptionally active integration loci per tumor in

828 CESC (a) and HNSCC (b) samples relative to the number of integration loci per tumor; 100% indicates

829 that all integration loci are active in that tumor. Orange, blue and grey circles represent tumors with a

830 single, multiple or no transcriptionally active loci, respectively. Circle size indicates the number of

831 samples per grouping (for CESC, largest, n=86 and smallest, n=1; for HNSCC, largest, n=21 and smallest,

832 n=1). Three HNSCC samples (TCGA-CR-6482, TCGA-CN-5374 and TCGA-CR-7404) were reported as

833 integration negative from RNA-seq 30 but had a single or multiple integration loci detected through WGS

834 5 and were therefore classified as transcriptionally inactive. (c), Bar chart showing the number of CESC

835 integration loci that are transcriptionally active or inactive for viral oncogene expression at integration

836 hotspots. Association between viral oncogene transcription and integration hotspots was based on a

837 Fisher’s exact test (two-tailed; **P<0.01).

838 Figure 4. Clustered integration breakpoints are associated with amplified regions of the host genome

839 in CESC and HNSCC

840 For the subset of CESC and HNSCC samples that had matched somatic copy number alteration data, HPV

841 integration breakpoints were grouped according to the associated host DNA copy number status.

842 Normal, AMP (amplification) and DEL (deletion) refers to the genomic profile of the host DNA at the

843 integration locus. (a-b), Scatter plots showing the number of breakpoints per locus grouped according to

844 the somatic copy number alteration status of the integration locus for CESC (a) and HNSCC (b) tumors.

845 For CESC, the number of integration loci per grouping was Normal, n=140; AMP, n=240 and DEL, n=17.

846 For HNSCC, the number of integration loci per grouping was Normal, n=7; AMP, n=28 and DEL, n=1. P-

31

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

847 values are based on non-parametric, unpaired t-tests (two-tailed; *p<0.05, ****p<0.0001, NS=Non-

848 significant). All statistical tests were performed relative to integration loci with a normal genomic

849 profile. DNA amplification associated with integration loci ranged from 693 bp to 54.2 Mb (average, 1.6

850 Mb; median 47.4 Kb) in CESC and 6.5 Kb to 102.7 Mb (average, 3.3 Mb; median 43.1 Kb) in HNSCC. (c-d),

851 Schematic representation of clustered breakpoints at integration loci that have associated host somatic

852 copy number alterations. Lines connecting to each chromosome represent different integration loci for

853 the CESC (c) and HNSCC (d) datasets. The number of circles represents the number of breakpoints per

854 integration locus. Blue, green and red colored circles respectively represent integration sites that have a

855 normal genomic profile or associated amplifications or deletions. Orange boxed regions represent

856 integration hotspots. (e) Stacked bar chart showing the number of CESC integration loci that overlap

857 integration hotspots grouped according to whether they have associated somatic copy number

858 alterations. Association between somatic copy number alterations and integration hotspots was based

859 on a chi-square test (*p<0.05).

860 Figure 5. Cervical keratinocyte-specific fragile sites mapped by FANCD2 ChIP-seq

861 HPV-negative (C33-A) and HPV18-positive (HeLa) cervical carcinoma cells were treated for 24 hours with

862 0.2 µM aphidicolin and FANCD2-enriched regions were identified by ChIP-seq. All analyses were

863 performed using the combined C33-A and HeLa FANCD2 dataset. (a), Bar graph showing the genomic

864 coverage of FANCD2-enriched regions relative to common fragile sites (FRA regions) and mitotic DNA

865 synthesis (MDS) regions reported in the literature 41,42. (b), Venn diagram showing the regions of overlap

866 between fragile sites identified in the FANCD2, FRA and MDS datasets. (c), Venn diagram showing the

867 overlap of FANCD2-associated fragile sites with protein-coding genes longer or equal to 0.3 Mb. Red

868 circle represents all long genes; blue circle represents long genes that are expressed in C33-A and/or

869 HeLa cells. (d), Alignment of FANCD2-enriched regions in C33-A (red) and HeLa (yellow) cells with

870 associated genes (blue bars represent genes >0.3 Mb; genes expressed in C33-A and/or HeLa cells are

32

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

871 indicated by gene name) and FRA and MDS regions (black bars). Red and yellow bars below the FANCD2

872 ChIP-seq signal tracks represent peaks mapped by SICER analysis in the corresponding cell lines.

873 Figure 6. Integration hotspots are associated with cellular super-enhancers

874 H3K27ac and Brd4 enriched regions were profiled in HPV16-positive cervical derived W12 keratinocyte

875 subclones by ChIP-seq. Enhancer regions were defined as peaks that overlapped in both H3K27ac and

876 Brd4 datasets, and that were identified across the four W12 subclones. (a), GREAT (Genomic Regions

877 Enrichment of Annotations Tool) gene ontology analysis was performed using W12 enhancers that

878 overlapped CESC integration breakpoints (-/+ 50 Kb flanks) as input, and compared against all W12

879 enhancers, to identify putative target genes associated with these cis-regulatory regions based on

880 enhancer frequency. Bars represent putative target genes plotted against their FDR (false discovery rate)

881 adjusted p-values (q-value). Blue and grey bars represent genes that overlap integration hotspots and

882 sites of non-recurrent integration, respectively. Enriched target genes within the same genomic locus

883 were grouped (e.g. KLF5 and KLF12) and plotted using the most significant q-value. (b), Venn diagram

884 showing the regions of overlap between integration loci, integration hotspots and super-enhancers

885 mapped in W12 subclones. (c), Bar chart showing the number of CESC integration loci that were

886 grouped according to whether or not they are integration hotspots and plotted based on their overlap

887 with super-enhancers, FANCD2-associated fragile sites or both genomic features. Numbers above the

888 graph indicate the total number of integration loci within each grouping. (d), Bar chart showing the

889 number of CESC integration loci that are associated with super-enhancers (SE) that were grouped

890 according to whether they are integration hotspots and plotted based on their viral transcription status.

891 Numbers above the graph indicate the total number of integration loci that have associated viral

892 transcription data within each grouping. (e), Alignment of Brd4 (blue) and H3K27ac (red) ChIP-seq

893 signals mapped in W12 cervical keratinocytes at integration hotspots (top black bars; size indicated in

894 Mb) in cervical carcinomas. Grey bars represent amplified (AMP) host DNA in different CESC tumors

33

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

895 from The Cancer Genome Atlas. Green, yellow, and black bars below the ChIP-seq signal tracks

896 represent super-enhancers (SE) mapped in W12 subclones, FANCD2-associated fragile sites mapped in

897 C33-A and HeLa cells and CESC integration loci, respectively. Genes identified from GREAT gene ontology

898 analysis 52 and cancer driver genes 55 are indicated by blue bars. Each integration hotspot is

899 characterized in Supplementary Figure S5 and Table S4. (f), Bar chart showing the number of CESC

900 integration loci that are associated with super-enhancers (SE) that were grouped according to whether

901 they are integration hotspots and plotted based on their host somatic copy number alternation status.

902 The number of integration loci per grouping was normal, n=140; amplification (AMP), n=240 and

903 deletion (DEL), n=17.

904 Figure S1. Characteristics of CESC and HNSCC HPV integration datasets

905 (a), The distribution of cervical carcinomas (CESC, n=333) based on tissue histology. Squamous cell

906 carcinomas, adenocarcinomas and adenosquamous carcinomas accounted for 77.2%, 8.1% and 1.8% of

907 cervical samples, respectively. CESC that were not specified by histology type accounted for 12.9% of the

908 samples. (b), The frequency of HPV types across the different histological subtypes. One CESC sample

909 (TCGA-EA-A43B) had two integration sites of different viral types and was therefore excluded from these

910 counts. HPV16 (67.5%) and HPV18 (17.2%) were the most frequent HPV types detected in CESC. HPV16

911 was the predominant viral type in squamous cell carcinomas (HPV16, 68.0%; HPV18, 13.3%; other,

912 18.8%) and adenocarcinomas (HPV16, 59.3%; HPV18, 37.0%; other, 3.7%), whereas HPV18 was the

913 predominant viral type in adenosquamous carcinomas (HPV18, 83.3%; HPV16, 16.7%; other, none).

914 Head and neck squamous cell carcinomas (n=41) included in this study were predominantly male

915 subjects (males, 82.9%; females, 17.1%), Supplementary Table S2. (c), Tumors of the oropharynx,

916 including the base of tongue and tonsils, accounted for 70.7% of samples. The remainder of tumors

917 were from the oral cavity (24.4%) and larynx (4.9%). (d), The percentage distribution of HPV type by

918 HNSCC tumor location. HPV16 was the most predominant viral type in HNSCC (90.2%) and accounted for

34

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

919 96.6%, 70.0% and 100% tumors of the oropharynx, oral cavity and larynx, respectively. The remainder of

920 HNSCC samples were positive for HPV33 (7.3%) and HPV35 (2.4%) that were isolated from the oral cavity

921 and oropharynx, respectively. (e-h), The distribution of samples and integration breakpoints included in

922 this study that were grouped by sequencing method. (e-f), For the CESC dataset, 50.8% samples had

923 associated transcription-based data (e) and all integration breakpoints were detected through next-

924 generation sequencing technologies (hybrid-capture/WGS/WXS with probes added to capture the

925 integrated HPV DNA sequence) (f). (g), The HNSCC dataset was limited in size as >50% samples were

926 identified from RNA sequencing alone and did not have matched DNA sequences. (h), Approximately

927 54% HNSCC samples were processed from WGS and 46% from DIPS, and all had matched RNA data

928 (RNA-seq and APOT, respectively). *Two samples were also validated by RNA-seq analysis in addition to

929 APOT. APOT, Amplification of papillomavirus oncogene transcripts; DIPS, Detection of integrated

930 papillomavirus sequences; WGS, Whole genome sequencing; WXS, Whole exome sequencing.

931 Figure S2. Integration hotspots defined in cervical tumors

932 Schematic representation of integration hotspots across the human genome defined previously in the

933 literature (red bars) and in our cervical carcinoma, CESC, dataset (blue bars).

934 Figure S3. Distribution of clustered breakpoints at HNSCC integration loci

935 Schematic representation of clustered breakpoints at HNSCC integration loci across the human genome.

936 Lines connecting to each chromosome represent different integration loci. Blue circles represent the

937 indicated number of breakpoints per integration locus.

938 Figure S4. Keratinocyte-specific enhancers mapped by Brd4 and H3K27ac ChIP-seq

939 (a), Venn diagram showing the regions of overlap between Brd4 and H3K27ac ChIP-seq signals profiled

940 in W12 cervical keratinocytes with enhancers defined in NHEK (Normal Human Epidermal Keratinocytes)

941 cells from ENCODE 51. Permutation testing was used to determine the significance in overlap between

942 W12 Brd4/H3K27ac consensus peaks and ENCODE NHEK enhancers (p<0.0001). (b), Alignment of Brd4

35

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

943 (blue) and H3K27ac (red) ChIP-seq signals mapped in W12 cervical keratinocytes with enhancers defined

944 in the HeLa and NHEK ENCODE datasets. Black bars below the ChIP-seq signal tracks represent W12

945 enhancers that were defined by consensus peaks for Brd4 and H3K27ac enrichment. (c), GREAT

946 (Genomic Regions Enrichment of Annotations Tool) gene ontology analysis was performed using W12

947 enhancers that overlapped HNSCC integration breakpoints (-/+ 50 Kb flanks) as input, and compared

948 against all W12 enhancers, to identify putative target genes associated with these cis-regulatory regions.

949 Bars represent putative target genes plotted against their FDR (false discovery rate) adjusted p-values

950 (q-value). Blue and grey bars represent genes that overlap integration hotspots and sites of non-

951 recurrent integration, respectively. Enriched target genes within the same genomic locus were grouped

952 (e.g. KLF5 and KLF12) and plotted using the most significant q-value. (d), Alignment of Brd4 (blue) and

953 H3K27ac (red) ChIP-seq signals mapped in W12 cervical keratinocytes at integration hotspots (black

954 bars; size indicated in Mb) in cervical carcinomas. Grey bars represent amplified (AMP) host DNA in

955 different HNSCC tumors from The Cancer Genome Atlas. Green, yellow, and black bars below the ChIP-

956 seq signal tracks represent super-enhancers (SE) mapped in W12 subclones, FANCD2-associated fragile

957 sites mapped in C33-A and HeLa cells and CESC/HNSCC integration loci, respectively. Genes identified

958 from GREAT gene ontology analysis 52 and cancer driver genes 55 are indicated by blue bars.

959 Figure S5. Genomic landscape of integration hotspots

960 Alignment of Brd4 (blue) and H3K27ac (red) ChIP-seq signals mapped in W12 cervical keratinocytes at

961 integration hotspots (top black bars; size indicated in Mb) in cervical carcinomas. Numbers above each

962 integration hotspot indicates the hotspot ID referenced in Supplementary Table S4. Grey bars represent

963 amplified (AMP) host DNA in different CESC tumors from The Cancer Genome Atlas. Green, yellow, and

964 black bars below the ChIP-seq signal tracks represent super-enhancers (SE) mapped in W12 subclones,

965 FANCD2-associated fragile sites mapped in C33-A and HeLa cells and CESC/HNSCC integration loci,

966 respectively. Long protein-coding genes (>0.3 Mb in length), genes identified from GREAT gene ontology

36

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

967 analysis 52 and cancer driver genes 55 are indicated by blue bars. Each integration hotspot is further

968 characterized in Supplementary Table S4.

969 SUPPLEMENTARY TABLES

970 Table S1. CESC and HNSCC datasets

971 Cases represent the number (n) of patient samples with integrated HPV genomes detected by DNA

972 and/or RNA sequencing methods. For hybrid-capture method, only integration breakpoints that were

973 validated by Sanger sequencing were included in this study. Somatic copy number alteration (SCNA)

974 datasets for CESC and HNSCC TCGA samples were downloaded from the Broad Institute 86,87.

975 Table S2. CESC HPV integration breakpoints (n=1,299)

976 CESC HPV integration sites (Patient ID, chromosome, integration breakpoint, HPV type, HPV breakpoints,

977 tumour type, detection method used for calling integration breakpoints, target gene, (viral oncogene)

978 transcription status, SCNA status, author of study, Pubmed ID). Only integration breakpoints detected

979 through DNA sequencing were used for overlap analyses.

980 Table S3. HNSCC HPV integration breakpoints (n=119) included in this study

981 HNSCC HPV integration sites (Patient ID, chromosome, integration breakpoint, HPV type, HPV

982 breakpoints, tumour location, detection method used for calling integration breakpoints, target gene,

983 (viral oncogene) transcription status, SCNA status, gender, smoking status, author of study, Pubmed ID).

984 Only integration breakpoints detected through DNA sequencing were used for overlap analyses.

985 Table S4. Integration hotspots defined in CESC

986 Table S5. Integration hotspots defined in the literature

987 Overlapping intervals for hotspots identified from the literature were merged into a single interval using

988 the MergeBED tool (Merged hotspot position) for comparison to hotspots defined in our CESC dataset

989 (Supplementary Figure S2).

37

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

990 Table S6. CESC integrations with associated SCNA included in this study

991 Samples that had associated host genome amplification data were classified based on the somatic copy

992 number alteration (SCNA) status at the site of integration. Normal, normal genomic profile; AMP,

993 amplification; DEL, deletion; DEL (excluded; overlapping deletion), deletions that overlap integration

994 breakpoints were excluded as these likely reflect the alternative chromosome as sequencing of the

995 chimeric viral-host junction was available for the associated HPV insertion sites

996 Table S7. HNSCC integrations with associated SCNA included in this study

997 Samples that had associated host genome amplification data were classified based on the somatic copy

998 number alteration (SCNA) status at the site of integration. Normal, normal genomic profile; AMP,

999 amplification; DEL, deletion; DEL (excluded; overlapping deletion), these likely reflect the alternative

1000 chromosome as sequencing of the chimeric viral-host junction was available for the associated HPV

1001 insertion sites

1002 Table S8. FANCD2-associated fragile sites mapped in C33-A cervical carcinoma cells

1003 C33-A FANCD2 peaks identified by ChIP-seq, filtered by -log10 q-value >10, combined with FANCD2

1004 ChIP-chip peaks previously reported in C33-A cells 26.

1005 Table S9. FANCD2-associated fragile sites mapped in HeLa cervical carcinoma cells

1006 HeLa FANCD2 peaks identified by ChIP-seq, filtered by -log10 q-value >10.

1007 Table S10. FANCD2-associated fragile sites mapped in C33-A and HeLa cervical carcinoma cells

1008 C33-A FANCD2 peaks identified by ChIP-chip and ChIP-seq (Supplementary Table S6) were combined

1009 with HeLa ChIP-seq peaks (Supplementary Table S7). Overlapping FANCD2 peaks from the C33-A and

1010 HeLa datasets were merged (merging based on nearby peaks was set to 0 bp).

1011 Table S11. Aphidicolin-induced common fragile sites from the HGNC Database

38

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

1012 https://www.genenames.org/download/custom/. Bold font indicates FRA-regions that overlap with

1013 FANCD2-enriched regions listed in Supplementary Table S8. A total of 43/77 (55.8%) FRA-regions

1014 overlap with FANCD2-enriched regions profiled in C33-A and HeLa cells.

1015 Table S12. Mitotic DNA synthesis (MDS) regions that overlap with FANCD2 peaks

1016 MDS regions profiled in HeLa cells 41,42 were compiled and overlapping regions merged (merging based

1017 on nearby peaks was set to 0 bp). A total of 112/232 (48.3%) MDS regions overlapped with FANCD2-

1018 enriched regions profiled in C33-A and HeLa cells.

1019 Table S13. FANCD2-enriched regions that overlap long genes

1020 The Gencode Release 19 human reference genome (GRCh37) was filtered to identify protein-coding

1021 genes longer than or equal to 0.3 Mb, including UTRs. A total of 185/782 (23.7%) long genes overlapped

1022 with FANCD2-enriched regions profiled in C33-A and HeLa cells and are indicated in bold font. Genes

1023 that are transcriptionally active in C33-A and/or HeLa cells were determined from RNA-seq (26 and

1024 Expression Atlas, accessed September 2020, https://www.ebi.ac.uk/gxa/home) and are indicated in blue

1025 font. 121/185 (65.4%) long genes that overlapped with FANCD2-enriched regions are expressed in C33-A

1026 and/or HeLa cells.

1027 Table S14. Brd4/H3K27ac consensus enhancer peaks profiled in W12 subclones

1028 Brd4 and H3K27ac peaks were identified through ChIP-seq in HPV16-positive W12 cervical keratinocyte

1029 subclones (20831, 20861, 20862 and 20863). Consensus peaks for H3K27ac were identified across the

1030 four W12 subclones. Enhancers were defined as the overlapping genomic intervals between H3K27ac

1031 consensus peaks and Brd4 peaks.

1032 Table S15. Top 20 biological processes associated with enhancers that overlap integration breakpoints

1033 in CESC and HNSCC

1034 Tables S16. Super-enhancers defined in W12 cervical keratinocytes

1035 Table S17. Cancer driver genes within 1 Mb of super-enhancers that overlap integration loci

39

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

1036 TABLES

1037 Table 1. Overlap of integration breakpoints with FANCD2-assocated fragile sites

Dataset Integration subgroups Breakpoints or loci Overlap with % p-value per subgroup (n) FANCD2 sites (n) CESC All breakpoints 1,299 229 17.6 0.0001 Single breakpoints 258 39 15.1 0.0044 Clustered breakpoints 1,041 190 18.3 0.0001 Condensed clustered breakpoints 326 59 18.1 0.0001 Combined single and condensed 584 98 16.8 0.0001 breakpoints HNSCC* All breakpoints 118 21 17.8 0.0039 Single breakpoints 27 2 7.4 0.5011 Clustered breakpoints 91 19 20.9 0.0009 Condensed clustered breakpoints 30 5 16.7 0.2293 Combined single and condensed 57 7 12.3 0.3700 breakpoints 1038

1039 CESC and HNSCC integration breakpoints (-/+ 50 Kb flank regions) were grouped by the number of

1040 breakpoints per integration locus; single indicates one breakpoint and clustered indicates two or more

1041 breakpoints. Integration breakpoints from each subgroup were intersected with FANCD2-enriched

1042 regions and the frequency of overlap calculated. For the All breakpoints, Single breakpoints (not clustered)

1043 and Breakpoints located within a cluster subgroups, each breakpoint was tested independently for its

1044 overlap with FANCD2-enriched regions, regardless of whether it was part of a cluster or not. For the A

1045 single cluster of breakpoints (using 5'-3' endpoints) subgroup, the region spanning the most 5' and 3'

1046 breakpoints of an integration locus was used to test for the overlap with FANCD2-enriched regions. For

1047 the Single and clustered 5’-3’ breakpoints subgroup, the Single breakpoints (not clustered) and A single

1048 cluster of breakpoints (using 5'-3' endpoints) subgroups were combined for overlap analysis. The data was

1049 permutated 10,000 times to create an expected distribution of overlap. *A single integration breakpoint

1050 on chromosome Y was excluded from this analysis. Bold font indicates significant p-values.

1051

40

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

1052 Table 2. Overlap of integration breakpoints with keratinocyte-specific enhancers

Dataset Integration subgroups Breakpoints or loci Overlap with % p-value per subgroup (n) Enhancers (n) CESC All breakpoints 1,299 495 38.1 0.0001 Single breakpoints 258 67 26.0 0.0001 Clustered breakpoints 1,041 428 41.1 0.0001 Condensed clustered breakpoints 326 162 49.7 0.0001 Combined single and condensed 584 229 39.2 0.0001 breakpoints HNSCC* All breakpoints 118 43 36.4 0.0001 Single breakpoints 27 9 33.3 0.0052 Clustered breakpoints 91 34 37.4 0.0001 Condensed clustered breakpoints 30 13 43.3 0.0011 Combined single and condensed 57 22 38.6 0.0003 breakpoints 1053

1054 CESC and HNSCC integration breakpoints (-/+ 50 Kb flank regions) were grouped by the number of

1055 breakpoints per integration locus; single indicates one breakpoint and clustered indicates two or more

1056 breakpoints. Integration breakpoints from each subgroup were intersected with cellular enhancers and

1057 the frequency of overlap calculated. For the All breakpoints, Single breakpoints (not clustered) and

1058 Breakpoints located within a cluster subgroups, each breakpoint was tested independently for its overlap

1059 with enhancer regions, regardless of whether it was part of a cluster or not. For the A single cluster of

1060 breakpoints (using 5'-3' endpoints) subgroup, the region spanning the most 5' and 3' breakpoints of an

1061 integration locus was used to test for the overlap with enhancer regions. For the Single and clustered 5’-

1062 3’ breakpoints subgroup, the Single breakpoints (not clustered) and A single cluster of breakpoints (using

1063 5'-3' endpoints) subgroups were combined for overlap analysis. The data was permutated 10,000 times to

1064 create an expected distribution of overlap. *A single integration breakpoint on chromosome Y was

1065 excluded from this analysis. Bold font indicates significant p-values.

1066

1067

41

Figure 1

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC HPV+ cervical 105 and is also made available for use under a CC0 license. carcinomas (CESC)* WGS WXS HPV integra�on sites Host DNA copy number HPV+ head and RNA-seq Viral transcrip�on neck squamous APOT cell carcinomas (HNSCC)*

Brd4 HPV+ cervical H3K27ac Brd4 Brd4 carcinoma cells H3K27ac lines ChIP-seq H3K27ac C33-A Enhancers C33-A HeLa Aphidicolin FANCD2 treated cervical Integra�on breakpoints carcinoma cells HeLa ChIP-seq PIBF1 KLF5 lines FRA7J KLF12 BORA Common Fragile Sites Overlap Analysis *Data Sources: Bodelon, Holmes, Hu, Koneva, Lui, Ojesina, Olthof, Parfenov, The Cancer Genome Atlas, Xu Figure 2

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (whicha was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC Integra�on Integra�on105 locus: and is also made available for use under a CC0 license. locus: single clustered breakpoint breakpoints

<3 Mb: same integra�on locus >3 Mb: separate integra�on locus

b CESC Dataset Breakpoints (n): 1 2 3-9 >9

11 12 7 8 9 10

3 4 5 6

1 2

13 14 15 16 17 18 19 20 21 22 X

c n=271 n=313 (n) 20 ** 18 l ocus

n 16 �o 14 12 10 per i ntegra

s s 8

oi nt 6 4

breakp 2

0

CESC No Yes Hotspot Figure 3

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC a Numer of transcrip�onally 105 and is also made available for useb under a CC0 license. ac�ve loci per tumor: Single Mul�ple None 120 120 100

100 c � ve loci a ve c � ve loci 80 a 80 60

60 � onally � onally 40

40 tumor ( %) per per tumor ( %) per 20

20 CC transcrip 0 HNS CESC CESC transcrip 0 -20 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 -1 0 1 2 3 4 5 6 7 Integra�on loci per tumor (n) Integra�on loci per tumor (n)

c **P=0.004 120 Ac�ve 100 Inac�ve 80 � on loci (n) 60 40 20 CESC CESC integra 0 No Yes Integra�on hotspot Figure 4 a 20 **** b 10 * e bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright*P=0.014 holder for this preprint (which18 was not certified by peer review) is the author/funder.9 This article is a US Government work. It is not subject to copyright Hotspot:under 17 USC 105 and is also made available for use under a CC0 license. 250 16 8 No 14 7 200 Yes 12 6 10 5 150 8 4 � on loci (n) 6 3 100 CESC CESC breakpoints (n)

HNSCC breakpoints (n) 2 4 NS 50 2 1 CESC CESC integra 0 0 0 Normal AMP DEL Normal AMP DEL Normal AMP DEL DNA copy number DNA copy number DNA copy number c CESC Dataset Breakpoints (n): Normal AMP DEL

7 8 9 10 11 12

3 4 5 6

1 2

13 14 15 16 17 18 19 20 21 22 X d

HNSSC Dataset Breakpoints (n): Normal AMP DEL

3 4 5 7 9 11 13 14 15 16 1817 19 20 21 X

1 2 Figure 5 a bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540b ; this version posted Augustc 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject toLong copyright genes under 17 USC 50 105 and is also made available for use under a CC0 license. FANCD2 peaks 279 40 312 64 30

20 Expressed in 86 FANCD2 peaks 121 C33-A/HeLa 10 81 34 328 318 Genomic coverage Genomic (%) 97 0 FANCD2 FRA MDS 26 15 MDS regions Fragile site dataset FRA regions (CFS) d

chr1: 20 Mb 40 Mb 60 Mb chr3: 5 Mb 10 Mb 15 Mb C33-A C33-A

ChIP-seq HeLa ChIP-seq reads: reads: HeLa

MDS: MDS:

FRA: FRA: FRA1B FRA1L FRA3B Long NFIA Long ERC2 genes: FAF1 DAB1 PATJ ROR1 NEGR1 genes: CACNA2D3 FHIT PTPRG MAGI1

chr7: 5 Mb 10 Mb chr16: 1 Mb 5 Mb C33-A C33-A

ChIP-seq ChIP-seq reads: HeLa reads: HeLa

MDS: MDS:

FRA: FRA: FRA7J FRA16D Long Long genes: AUTS2 GALNT17 genes: WWOX CDH13 Figure 6 a bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted Augustb 31, 2021. The copyright holder for this preprint (which30 was not certifiedHotspot by peer review)Non-hotspot is the author/funder.* Novel hotspot This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license. 25 og10) l 20 Super-enhancers (- 17 296 ue l 15 a v

q- 10 * R 72 D * F 5 * * * Integra�on loci Hotspots 0 254 241 NFIA FLRT3 GRHL2 PIK3R1 MYO1B CLSTN1 MRPL23 CAMK1G IL8/RASSF6 KLF5/KLF12 GLOD4/NXN VMP1/TUBD1 KLF14/MKLN1 RXRA/COL5A1 C9orf3/FANCC FOXQ1/EXOC2 AHDC1/WASF2 YWHAQ/TAF1B ADRB2/SH3TC2 IGF1R/PGPEP1L ITPR1/BHLHE40 HMGA2/MSRB3 RAD51B/ZFP36L1 MYO9A/GRAMD2 ID1/COX4I2/HM13 TP63/LEPREL1/CLDN1 RNF145/EBF1/UBLCP1

MYC/POU5F1B/FAM84B GREAT gene ontology hits c d f (n) n=271 n=313 n=123 n=145 (n) E 40 E S S 60

120 Overlapping genomic Viral transcrip�on Host DNA copy

(n) 35 100 feature: status: 50 number: ci ci l ap er 30 l ap er v Super-enhancer Ac�ve v Normal l o o o 80 FANCD2 peak 25 Inac�ve 40 AMP � on

60 Both that 20 30 DEL ci ci 15

40 l o 20

i ntegra 10 � on SC 20 10 E 5 C 0 0 0 Integra�on loci that No Yes Integra No Yes No Yes Integra�on hotspot Integra�on hotspot Integra�on hotspot e

chr3: 19.1 Mb chr8: 1 Mb

Host Host AMP: AMP:

Brd4 Brd4 ChIP-seq ChIP-seq reads: H3K27ac reads: H3K27ac W12 SE: W12 SE: FANCD2 peaks: FANCD2 peaks: Integrants: Integrants: GREAT genes: TP63 CLDN1 GREAT genes: MYC LEPREL1 FAM84B POU5F1B Cancer driver TBL1XR1 Cancer driver genes: MECOM PIK3CA genes: RAD21 MYC

chr13: 14.6 Mb chr14: 0.7 Mb

Host Host AMP: AMP:

Brd4 Brd4

ChIP-seq H3K27ac ChIP-seq H3K27ac reads: reads: W12 SE: W12 SE: FANCD2 peaks: FANCD2 peaks: Integrants: Integrants: GREAT genes: KLF5 GREAT genes: ZFP36L1 Cancer driver KLF12 Cancer driver RAD51B genes: RB1 CYSLTR2 DACH1 KLF5 genes: MAX ZFP36L1 Figure S1

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certifiedCESC by peersamples review) is the author/funder. This article is a US Governmentb HPV16 work. It is notHPV18 subject to copyrightOther under 17 USC a 105 and is also made available for use under a CC0 license. (n=333) 100

80

Histology: 60 Adenocarcinoma Adenosquamous 40 Squamous cell 20 Carcinoma (Not specified) 0 CESC samples by HPV type (%) qu amous cell Notfi ed speci S Adenosquamous Adenocarcinoma Histology

c HNSCC samples d HPV16 HPV33 HPV35 (n=41) 100 Tumor loca�on: 80 Larynx Oral cavity 60 Oropharynx

40

20 HNSCC samples by HPV type (%) 0 Larynx Oral cavity Oropharynx Tumor loca�on

CESC breakpoints e CESC samples f (n=1,299) (n=333) 800

600 Sequencing method: Hybrid-capture WGS 400

WXS � on breakpoints (n) Hybrid-capture/APOT* WGS/RNA-seq 200 WXS/RNA-seq

WGS/WXS/RNA-seq CESC integra 0 Hybrid- WGS WXS capture Sequencing method

g HNSCC samples h HNSCC breakpoints (n=41) (n=119) 100

80 ak points (n) 60 Sequencing method: � on bre DIPS/APOT 40 WGS/RNA-seq 20

HNSCC integra 0 DIPS WGS Sequencing method bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

Figure S2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

Literature hotspots CESC defined hotspots Figure S3

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also madeHNSCC available Dataset for use under a CC0 license. Breakpoints (n): 1 2 3-9

11 12 13 7 8 9

3 4 5 6

1 2

14 15 16 17 18 19 20 21 22 X Y bioRxiv preprint doi: https://doi.org/10.1101/2021.08.30.457540; this version posted August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.

Figure S4

a b c 15 Hotspot ENCODE (NHEK) Non-hotspot 13,122 CESC GREAT target genes 10 * chr1: 5 Mb 10 Mb Brd4 * 5 * 5 ChIP-seq H3K27ac * 28,134 W12 Brd4 peaks reads: H3K27ac FDR q-value (-log10) 10,847 89 0 W12: 2,142 HeLa: NHEK: ACTL7B KLF5/KLF12 RefSeq ERG/PSMG1 TUBD1/VMP1

W12 H3K27ac peaks RHOD/KDM2A

10,432 genes: PLCXD2/PHLDB2 KRT13/KRT17/EIF1 FAM84B/POU5F1B TMPRSS2/RIPK4/ETS2/ GREAT gene ontology hits d chr8: 1 Mb chr13: 14.6 Mb Host AMP: Host AMP:

Brd4 Brd4 ChIP-seq ChIP-seq H3K27ac reads: H3K27ac reads: W12 SE: W12 SE: FANCD2 peaks: FANCD2 peaks: CESC integrants: CESC integrants: HNSCC integrants: HNSCC integrants: GREAT genes: MYC GREAT genes: KLF12 FAM84B POU5F1B KLF5 Cancer driver Cancer driver MYC DACH1 KLF5 genes: genes:

chr17: 7.5 Mb chr17: 5.1 Mb Host AMP: Host AMP: Brd4 Brd4 ChIP-seq ChIP-seq H3K27ac reads: H3K27ac reads: W12 SE: W12 SE: FANCD2 peaks: FANCD2 peaks: CESC integrants: CESC integrants: HNSCC integrants: HNSCC integrants: GREAT genes: GREAT genes: TUBD1 Cancer driver Cancer driver VMP1 genes: NF1 CDK12 KRT222 ERBB2 BRCA1 KANSL1 SPOP genes: RNF43 PPM1D CD79B AXIN2 Figure S5

1 2 bioRxiv preprintHotspot doi: (chr1): https://doi.org/10.1101/2021.08.30.45754014.6 Mb ; this version posted August 31, 2021.16.9 The Mb copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC Host AMP: 105 and is also made available for use under a CC0 license. Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants:

GREAT genes: CLSTN1 WASF2 NFIA Cancer driver MACF1 genes: RPL22 MTOR EPHA2 ARID1A THRAP3 CDKN2C JAK1 FUBP1 RPL5 Long genes:

3 4 Hotspot (chr1): 21.1 Mb 6.1 Mb

Host AMP:

Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants:

GREAT genes: CAMK1G Cancer driver ZBTB7B SPTA1 BTG2 genes: TXNIP RIT1 DHX9 PTPRC ELF3 IRF6 H3F3A Long genes:

5 6 Hotspot (chr2): 7.6 Mb 25.7 Mb

Host AMP: Brd4

ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants:

GREAT genes: MYO1B Cancer driver CASP8 genes: ACVR2A ACVR1 NFE2L2 PMS1 SF3B1 IDH1 ERBB4 RQCD1 CUL3 Long genes:

7 8 Hotspot (chr3): 3.1 Mb 19.1 Mb

Host AMP:

Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: CLDN1 TP63 Cancer driver PIK3CA genes: MECOM TBL1XR1 Long genes: 9 10 11 bioRxiv preprintHotspot doi: (chr4): https://doi.org/10.1101/2021.08.30.45754011.6 Mb ; this18 Mbversion posted August 31, 2021. The copyright0.5 holder Mb for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license. Host AMP: Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: IL8 GREAT genes: RASSF6 Cancer driver KIT genes: FGFR3 PDGFRA ALB Long genes:

12 Hotspot (chr5): 5.6 Mb

Host AMP:

Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: Cancer driver NIPBL genes: IL7R Long genes:

13 14 15 Hotspot (chr6): 11.6 Mb 8.3 Mb 6.6 Mb

Host AMP:

Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: FOXQ1 EXOC2 Cancer driver HIST1H1C/E PIM1 genes: FOXQ1 HLA-A/B LEMD2 Long genes:

16 17 Hotspot (chr7): 1.2 Mb 10.8 Mb Host AMP:

Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: Cancer driver genes: PIK3CG MET Long genes: 18 19 bioRxiv preprintHotspot doi: (chr8): https://doi.org/10.1101/2021.08.30.457540; this version11 Mbposted August 31, 2021.1 Mb The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license. Host AMP: Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: FAM84B MYC GREAT genes: GRHL2 POU5F1B Cancer driver genes: CNBD1 RAD21 MYC Long genes:

20 Hotspot (chr9): 4.4 Mb

Host AMP: Brd4

ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: FANCC GREAT genes: C9orf3 Cancer driver genes: GNAQ PTPDC1 PTCH1 Long genes:

21 Hotspot (chr10): 5.6 Mb

Host AMP:

Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: Cancer driver genes: PTEN SMC3 TCF7L2 Long genes:

22 23 Hotspot (chr11): 4.3 Mb 6.4 Mb Host AMP:

Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: Cancer driver INPPL1 genes: CTNND1 SF1 CCD1 PGR ATM KMT2A Long genes: 24 bioRxiv preprintHotspot doi: (chr12): https://doi.org/10.1101/2021.08.30.457540; this version0.2 postedMb August 31, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license. Host AMP: Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: MSRB3 HMGA2 Cancer driver genes: Long genes:

25 Hotspot (chr13): 14.6 Mb

Host AMP:

Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: KLF12 GREAT genes: KLF5 Cancer driver PDS5B CYSLTR2 KLF5 genes: BRCA2 RB1 DACH1 Long genes:

26 27 Hotspot (chr14): 0.7 Mb 2.9 Mb

Host AMP: Brd4

ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: ZFP36L1 GREAT genes: RAD51B Cancer driver genes: MAX ZFP36L1 ATXN3 DICER1 TRAF3 AKT1 Long genes:

28 Hotspot (chr15): 13.6 Mb Host AMP: Brd4

ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: GRAMD2 MYO9A Cancer driver genes: TCF12 RNF111 MAP2K1 SIN3A IDH2 Long genes: 29 30 31 bioRxiv preprintHotspot doi: (chr17): https://doi.org/10.1101/2021.08.30.457540 7.5 Mb ; this version5.1 Mb posted August 31, 2021.6.2 Mb The copyright holder for this preprint (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license. Host AMP: Brd4

ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: TUBD1 GREAT genes: VMP1 Cancer driver CDK12 SPOP SOX9 SRSF2 genes: NF1 ERBB2 BRCA1 KANSL1 RNF43 PPM1DCD79B PRKAR1A ZNF750 Long genes:

32 33 Hotspot (chr19): 13 Mb 20.8 Mb Host AMP:

Brd4 ChIP-seq reads: H3K27ac W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: ARHGAP35 Cancer driver CACNA1A PIK3R2 KMT2B genes: KEAP1 SMARCA4 JAK3 CEBPA CIC ERCC2 GRIN2D PPP2R1A Long genes:

34 35 Hotspot (chr20): 4.3 Mb 2.2 Mb

Host AMP: Brd4

ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: ID1 FLRT3 HM13 Cancer driver genes: PLCB4 ZNF133 ASXL1 Long genes:

36 37 Hotspot (chrX): 11.2 Mb 5.7 Mb Host AMP:

Brd4 ChIP-seq H3K27ac reads: W12 SE: FANCD2 peaks: CESC integrants: HNSCC integrants: GREAT genes: USP9X RBM10 STAG2 Cancer driver ARAF PHF6 genes: TMSB4X EIF1AX DMD BCOR KDM6A KDM5C ZCCHC12 SMARCA1 FLNA Long genes: