bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra- targets germline alleles in an infant leukemia

1 Recapitulation of human germline coding variation in an ultra-mutated infant leukemia

2 Alexander M Gout1, Rishi S Kotecha2,3,4, Parwinder Kaur5,6, Ana Abad1, Bree Foley7, Kim W 3 Carter8, Catherine H Cole3, Charles Bond9, Ursula R Kees2, Jason Waithman7,*, Mark N 4 Cruickshank1*

5 1Cancer Genomics and Epigenetics Laboratory, Telethon Kids Cancer Centre, Telethon Kids 6 Institute, University of Western Australia, Perth, Australia. 7 2Leukaemia and Cancer Research Laboratory, Telethon Kids Cancer Centre, Telethon Kids 8 Institute, Perth, Australia. 9 3Department of Haematology and Oncology, Princess Margaret Hospital for Children, Perth, 10 Australia. 11 4School of Medicine, University of Western Australia, Perth, Australia. 12 5Personalised Medicine Centre for Children, Telethon Kids Institute, Australia. 13 6Centre for Plant Genetics & Breeding, UWA School of Agriculture & Environment. 14 7Cancer Immunology Unit, Telethon Kids Institute, Perth, Australia. 15 8McCusker Charitable Foundation Bioinformatics Centre, Telethon Kids Institute, Perth, 16 Australia. 17 9School of Molecular Sciences, The University of Western Australia, Perth, Australia. 18 19 * These authors contributed equally to this work. Correspondence and requests for 20 materials should be addressed to J.W. ([email protected]) or to M.N.C. 21 ([email protected]). 22 23 Keywords: infant acute lymphoblastic leukemia, hypermutation, mismatch repair defect, 24 MSH2 25 Running title: Ultra-mutation targets germline alleles in an infant leukemia 26 Word Count: (4608) 27 Total number of Figures/Tables: 5 figures and 2 tables 28 Financial support: Children’s Leukaemia and Cancer Research Foundation (Inc.) Fellowship 29 (MNC); NHMRC Career Development Fellowship 1066869 (JW); Raine Medical Research 30 Foundation Clinician Research Fellowship 2016-2018 (RSK); Children’s Leukaemia and 31 Cancer Research Foundation (Inc.) project grants (URK, MNC, RSK); Ian Potter Foundation 32 (AG); Healy Foundation (AG). Telethon Kids Institute Blue Sky Research Grant (MNC, 33 URK, RSK, JW). Telethon Kids Institute Research Focus Area Precision Medicine Working 34 Group Funding (JW, MNC). 35 Conflicts of interest: The authors declare no conflicts of interest.

1 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

36 Abstract: Mixed lineage leukemia/Histone-lysine N-methyltransferase 2A

37 rearrangements occur in 80% of infant acute lymphoblastic leukemia, but the role of

38 cooperating events is unknown. Infant leukemias typically carry few somatic lesions, however

39 we identified a case with over 100 somatic point per megabase. The patient

40 presented at 82 days of age, among the earliest onset of cancer hypermutation recorded. The

41 transcriptional profile showed global similarities to canonical cases. Coding lesions were

42 predominantly clonal and almost entirely targeting alleles reported in human genetic variation

43 databases with a notable exception in the mismatch repair gene, MSH2. The patient’s

44 diagnostic leukemia transcriptome was depleted of rare and low-frequency germline alleles

45 due to loss-of-heterozygosity, while somatic point mutations targeted low-frequency and

46 common human alleles in proportions that offset this discrepancy. Somatic signatures of ultra-

47 mutations were highly correlated with germline single polymorphic sites indicating

48 a common role for 5-methylcytosine deamination, DNA mismatch repair and DNA adducts.

49 These data suggest similar molecular processes shaping population-scale

50 variation also underlie evolution of an infant ultra-mutated leukemia case.

2 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

51 Introduction:

52 Histone-lysine N-methyltransferase 2A (KMT2A) (also known as Mixed lineage

53 leukemia) gene rearrangements (KMT2A-R) at 11q23 occur in infant, pediatric, adult and

54 therapy-induced acute leukemias and are associated with a poor prognosis. KMT2A-R occurs

55 in around 80% of infant acute lymphoblastic leukemia (ALL) and 35-50% of infant acute

56 myeloid leukemia (AML) involving more than 60 distinct partner 1. KMT2A-R infant

57 ALL (iALL) genomes have an exceedingly low somatic mutation burden, with recurring

58 mutations in RAS-PI3K complex genes. Compared to wild-type KMT2A, the fusion

59 oncoprotein has altered histone-methyltransferase activity, causing epigenetic and

60 transcriptional deregulation. While KMT2A-oncoproteins are strong drivers of

61 leukemogenesis, several lines of evidence suggest that KMT2A-R alone is insufficient for

62 tumor initiation. For example, there is significant variability in latency of disease onset in

63 experimental models of KMT2A-fusions 2, 3 and in patients with KMT2A-R detected in

64 neonatal blood spots 4. Furthermore, KMT2A-R acute leukemias 5 in children are enriched for

65 mutations affecting epigenetic regulatory genes and on average harbor more somatic lesions

66 than infants. These observations have prompted speculation that iALL is a distinct

67 developmental and genetic entity and pre-leukemic transformation occurs in utero, targeting a

68 cell of origin with a genetic/epigenetic permissive predisposition 6.

69 As a step towards defining cooperating events we performed next-generation

70 sequencing on a KMT2A-R iALL patient cohort and integrated the data with a larger set of

71 acute leukemias. We investigated the combined germline and somatic mutation profiles,

72 identifying a highly mutated specimen from a patient diagnosed at 82 days of age. Here we

73 report sequence analysis of this unique genetic entity in relation to its canonical KMT2A-R

74 iALL counterparts, revealing common mechanisms driving leukemia ultra-mutation and

75 population-scale human genome variation.

76

3 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

77 Methods:

78 Patient specimens, external sequencing data and controls

79 The study cohort consisted of ten infants diagnosed at Princess Margaret Hospital for

80 Children (Perth), of which we have previously reported RNA-seq data for five patients 7.

81 Additional RNA-seq data was downloaded for 39 KMT2A-R iALL cases at diagnosis, six

82 cases at relapse, six cases lacking the KMT2A-R (KMT2A-germline) and for pediatric

83 KMT2A-R patients at diagnosis, comprising 10 AML cases, seven ALL cases and one

84 undifferentiated case as reported previously by Andersson et al. 5 (accession

85 EGAS00001000246). RNA-seq data comprising 28 KMT2A-R and 52 KMT2A-germline adult

86 diagnosistic AML cases downloaded from the Short Read Archive (BioProject Umbrella

87 Accession PRJNA2787668; Leucegene: AML sequencing). Normal human bone marrow

88 transcriptomes were downloaded from the Short Read Archive (BioProject PRJNA284950)

89 comprising 20 pooled sorted human hematopoietic progenitor populations 9. The full list of

90 samples analysed in this study including those downloaded from external sources is

91 summarised in Supplementary Table S1.

92 Sample preparation and next-generation sequencing

93 RNA was isolated with TRIzol™ Reagent (Invitrogen Life Science Technologies,

94 Carlsbad, CA, USA) and purified with RNeasy Purification Kits (Qiagen, Hilden, Germany).

95 Genomic DNA was isolated with QIAamp DNA Mini Kit (Qiagen). Nucleic acid integrity

96 and purity was assessed using the Agilent 2100 Bioanalyzer (Agilent Technologies, Santa

97 Clara, CA). RNA-seq was performed at the Australian Genome Research Facility, Melbourne

98 with 1 μg RNA as previously reported 7. Whole exome sequencing (WES) was performed by

99 the Beijing Genome Institute using SureSelect XT exon enrichment (Agilent Technologies)

100 with 200 ng of remission DNA (v4 low input kit) or 1 μg of diagnostic DNA (v5). Diagnostic

101 RNA and remission genomic DNA sequencing was performed on a HiSeq 2000 (Illumina,

102 Inc., San Diego, CA, USA) instrument with 100bp paired end reads; diagnostic genomic

4 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

103 DNA sequencing was performed on a HiSeq 2500 with 150bp paired end reads. Sequencing

104 libraries were constructed using Illumina TruSeq kits (Illumina, Inc.).

105 Bioinformatics

106 For gene-expression analysis, RNA-seq reads were mapped to hg19 with HISAT and

107 gene-counts generated with HTseq 10. Data was normalized with RUV-seq 11 implementing

108 factor analysis of negative control genes identified by fitting a linear model with cell type

109 (hematopoietic sub-set or leukemia sub-type) as a covariate. voom! 12 was used to adjust for

110 library size and generate counts per million (cpm) values. KMT2A-fusion split-reads were

111 identified using RNA-seq data with FusionFinder 13 using default settings.

112 Single nucleotide variants (SNVs) were identified from RNA-seq data using the

113 intersection of two methodologies: (1) the “Genome Analysis Toolkit (GATK) Best Practices

114 workflow for SNP and indel calling on RNAseq data” and (2) SNPiR 14. WES data was

115 processed based on the GATK Best Practices workflow. Variants were characterized using

116 custom scripts that extract the coverage of sequenced bases from VCF and bam files,

117 determining the overlap among samples, and calculating variant allele fractions (VAFs).

118 High confidence somatic mutations were identified using VarScan2 (v2.3.9) 15 setting

119 cut-offs of P-value<0.005; coverage > 50X; VAF > 0.1 or VAF > 0.3 to enrich for clonal

120 mutations. Variants were annotated with ANNOVAR 16 and with metadata from additional

121 sources including seven functional/conservation algorithms (PolyPhen2 17, Sift 18,

122 MutationTaster 19, likelihood ratio test 20, GERP 21, PhyloP 22, and CADD 23) and databases of

123 human genetic variation (ExAC, 1000 genomes, HapMap, dbSNP, COSMIC and ClinVar).

124 Predicted functional alleles were assigned for variants called as damaged/deleterious by at

125 least 4/7 algorithms.

126 Somatic mutational spectra were investigated using the R package, DeconstructSigs

127 (v1.8.0) 24 calculating the fraction of somatic substitutions within the possible 96-trinucleotide

128 contexts relative to the exome background. Copy number analysis was performed on WES

5 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

129 data using VarScan2 to calculate log2 depth ratio and GC-content and Sequenza (v2.1.2) 25 to

130 identify chromosomal segments with copy loss. enrichment was performed

131 using GREAT (v3.0.0) 26 bioinformatics tools. Mutations were inspected by mapping the

132 genomic position to crystal structures in the DataBank27 and analyzed with Pymol

133 (v1.8.6.2).

134 Data and code availability

135 Scripts used to execute variant detection software and analysis are available at

136 https://github.com/cruicks/ultramutated-case under an MIT license and are being developed

137 on Github (https://github.com/cruicks/vcfbamCompare). The sequencing data has been

138 deposited in the Short Read Archive with study ID PRJNA296746.

139

140 Results:

141 Characteristics of KMT2A-R iALL cohort and RNA-seq variant detection pipeline

142 We reported previously KMT2A-R fusion transcripts for 5/10 cases 7, and of the

143 remaining five KMT2A-R iALL we detected KMT2A-MLLT3 (P401, P438), KMT2A-AFF1

144 (P848), KMT2A-MLLT1 (P706) and KMT2A-EPS15 (P809). KMT2A-fusion transcripts were

145 therefore detected in all samples and reciprocal fusion sequences detected in 5/10 samples

146 (Table 1). To determine the validity of our RNA-seq based pipeline for detecting rare and

147 common alleles, we examined the concordance of single nucleotide variants (SNVs) called

148 from RNA-seq and WES data. Across seven samples with matched WES and RNA-seq data,

149 we identified a total of 54,792 point mutations from RNA-seq reads of which 166 were absent

150 from matched WES (coverage>20 reads), representing a discordance rate of 0.3% (Figure 1a;

151 Supplementary Table S2). In one sample, P706 there were two events called with RNA-seq

152 data that were not identified in matched WES but with evidence extracted using the Integrated

153 Genome Browser at previously reported somatic mutation sites in FLT3 (encoding FLT3

6 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

154 p.V491L 28; 200/1,719 alternate RNA-seq reads; 15/214 alternate WES reads) and KRAS

155 (encoding KRAS p.A146T 29; 24/159 alternate RNA-seq reads; 6/51 alternate WES reads).

156 Furthermore, analysis of 117 discordant events with coverage > 50 reads, revealed 76

157 recurrences in two or more samples (64%) at 26 unique sites, however none of these sites

158 have been validated as somatic mutations in cancer to date in the Catalog of Somatic

159 Mutations in Cancer (COSMIC)30 or The Pediatric Cancer Genome Project (PCGP)31

160 databases. These results indicate a high rate of reproducibility of the RNA-seq variant

161 detection pipeline and suggest that RNA-DNA differences mostly comprise artifacts and/or

162 novel RNA-editing events 32 and a minority of putative somatic alterations exclusively

163 detected by RNA-seq.

164 Identification of a highly mutated infant leukemia

165 Expressed somatic mutations were defined as variants detected in both diagnostic

166 leukemia RNA-seq and WES, but absent in remission WES at sites with >20 reads (Figure

167 1b-d). Most patients (5/6) showed the expected “silent genomic landscape” (Supplementary

168 Table S3) except for patient P337 (Figure 1b-d; Supplementary Table S4). Three patients

169 expressed RAS-family hot-spot somatic mutations (P848 and P706: KRAS p.A146T and

170 P438: NRAS.p.G12S). We detected no expressed somatic missense mutations in patients

171 P399 and P401. The pair of monozygotic twins (P809 and P810) lacked private expressed

172 mutations but we detected two shared expressed missense mutations clustered in the PER3

173 gene (encoding PER3.p.M1006R and PER3.p.K1007E). These results are consistent with

174 recurring RAS-pathway mutations and low mutation rate reported previously 5, 33. However,

175 in patient P337 RNA-seq reads we detected 1,420 expressed somatic mutations

176 (Supplementary Table S4). The overlap of both rare and common alleles across the matched

177 datasets verifies the common origin of RNA-seq and diagnostic/remission WES data ruling

178 out sample contamination or mislabeling (Figure 1a-c). We previously reported a complex

179 KMT2A translocation involving 2q37, 19p13.3 and 11q2334 in the highly mutated leukemia

180 sample and established two cell lines from the diagnostic sample 7.

7 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

181 Characterisation and clonal analysis of somatic lesions

182 Somatic lesions were further evaluated in six cases by comparing diagnostic and remission

183 WES. Two different variant allele fraction (VAF)-cut-offs, either VAF>0.1 or VAF>0.3, were

184 chosen to enrich for sub-clonal and clonal events respectively. There were on average ~82.2

185 (range 69-95) sub-clonal and ~3.4 clonal (range 3-4) somatic point mutations in canonical

186 KMT2A-R iALL exomes, confirming previous studies of a low somatic mutation rate in the

187 dominant clone (Table 1). In contrast, there were 198 sub-clonal and 5,054 clonal somatic

188 point mutations in the highly-mutated sample, equating to a rate of 139 mut/Mb, classifying

189 this sample as ultra-mutated. There were also ~40-fold increased number of indels in the

190 dominant clone of the ultra-mutated sample (n=260) compared to typical cases (average =

191 6.6; range: 3-9) including 13 predicted to cause a frame-shift protein-coding truncation (Table

192 1; Supplementary Table S5). There were 2,420 exonic substitutions encoding 1,109 missense

193 alleles and 12 stop-gain alleles (Supplementary Table S5). We also noted that the ultra-

194 mutated case carried an excess of loss-of-heterozygosity (LOH) positions, with 6,555 high

195 confidence sites; vastly more than typical cases which had a mean of 57.2 LOH positions

196 (range 28-100) (Figure 2a-b; Table 1).

197 WES was performed on two independent cell lines derived from the ultra-mutated diagnostic

198 specimen, PER-784A and PER-826A, to characterize clonal somatic events. (Figure 2c-d).

199 Most somatic and LOH-events detected in the primary ultra-mutated sample were also called

200 in both, PER-784A and PER-826A cell lines using P337 remission WES for all comparisons

201 (Figure 2d). These data suggest that the somatic lesions are largely clonal in agreement with

202 the VAF-distributions (Figure 2b, d). In total, 4,754/5,054 (94%) somatic events

203 (Supplementary Table S6) and 6,150/6,575 (93%) LOH-events (Supplementary Table S7)

204 were called in the derived cell lines. Moreover, there were relatively few discordant sites

205 when directly comparing cell lines with the primary diagnostic sample (PER-784A: 34

206 discordant sites; PER-826A: 44 discordant sites; Supplementary Table S6)). We further

207 investigated evidence of clonal somatic lesions in cell lines using an alternative approach, by

8 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

208 directly extracting quality filtered WES reads mapping to coding somatic point mutations

209 (n=2,411; Q>50 in diagnosis WES; Figure 2e) and LOH-sites (n=3,117; Q>50 in remission

210 WES; Figure 2f; Supplementary Table S8). All coding clonal somatic mutations were

211 supported by reads in both cell lines of which 121 (5%) were homozygous (VAF>0.95 in cell

212 line and VAF>0.9 in primary sample). Most LOH-sites were evidenced by reference alleles

213 (VAF<10%; 2,179/3,117) in cell lines, with ~30% of LOH-sites evidenced by alternate alleles

214 (VAF> 90%; 926/3,117) and only 12 sites (0.3%) with both reference and alternate alleles

215 (VAF range 10-90%).

216 We performed copy number analysis of the ultra-mutated sample to map LOH-sites to

217 chromosomal segments with evidence of deletions. Most genomic regions were called as

218 diploid with limited chromosomal copy loss detected (Supplementary Figure S1a-b). There

219 was a total of 382/6,555 (5.8%) LOH-sites mapping to homozygous deletions, suggesting

220 only a minority of the identified LOH-sites likely arose through segment

221 deletion. Infant KMT2A-R ALL typically carry a small number of copy-number alterations

222 which may coincide with LOH-sites. We determined the distance of LOH-sites, somatic

223 mutations and germline variants (called by VarScan2) to the nearest homozygous SNV in

224 each leukemia diagnostic sample finding that in canonical cases LOH-sites tend to be located

225 closer to homozygous sites than either germline or somatic variants (Supplementary Figure

226 S1c; left panel) and show a unique bimodal distribution (Supplementary Figure S1c; right

227 panel). In contrast the average distance of LOH-sites to homozygous SNVs in the ultra-

228 mutated sample was similar to that of germline variants and somatic mutations with each

229 displaying a similar distribution (Supplementary Figure S1c). Thus, most LOH-sites in the

230 ultra-mutated specimen do not show evidence of localising within deleted or copy-neutral

231 loss-of-heterozygosity chromosomal segments.

232 In summary, these data demonstrate that somatic lesions in the primary sample were

233 predominantly clonal and LOH-sites appear to comprise a large proportion of somatic

234 substitutions rather than chromosomal alterations. We also note relatively few coding

9 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

235 sequence changes in derived cell lines compared to the dominant diagnostic leukemic clone,

236 both which are capable of long-term propagation. These results could reflect loss of the

237 hypermutator phenotype during in vitro culture or in diagnostic leukemia cells.

238

239 Signatures of somatic mutations and LOH-sites correlate with profiles of germline alleles

240 We next investigated the sequence features of somatic and LOH substitutions in the

241 ultra-mutated sample (Figure 3). We hypothesised that LOH-sites and somatic mutations may

242 show common feature biases depending on the context of alteration. Somatic events were

243 evaluated separately that induce a heterozygous mutation (n=4,773) or homozygous mutation

244 (n=281), and LOH causing reference→alternative changes (n=2,057) were evaluated

245 separately to alternative→reference changes (n=4,453). Since we noted an overlap in the

246 somatic mutations of the ultra-mutated sample with variants in the ExAC database (Figure 1a-

247 c), we first examined the distribution of these sites with respect to the derived allele frequency

248 in ExAC. We observe that each of these categories of sequence alterations in the ultra-

249 mutated sample were distributed across the range of population allele-frequencies (Figure 3a-

250 b). While the distributions of LOH and somatic sites were similar, somatic heterozygous

251 changes were relatively more frequent at higher frequency alleles as were LOH

252 reference→alternate alleles.

253 Somatic signatures of ultra-mutations were investigated in comparison to

254 heterozygous and homozygous single nucleotide polymorphisms (SNPs) in patient remission

255 samples. Heterozygous positions were defined by VAF<0.9 yielding 51,347 heterozygous and

256 31,237 homozygous SNPs in canonical remission samples and 10,681 heterozygous and 6

257 617 homozygous SNPs in the ultra-mutated remission sample. Analysis of WES remission

258 variants from canonical patients and the ultra-mutated sample revealed similar ExAC allele

259 frequency distributions with distinctions between heterozygous and homozygous alleles

260 (Figure 3c-d). The overall enrichment of each category of somatic substitution and LOH

10 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

261 across the 96 possible trinucleotide contexts resembled patient remission SNPs with elevated

262 proportions of C>T transitions at NpCpG motifs (Figure 3e).

263 Pre-defined signatures were used to infer mutational profiles revealing Signature 1A

264 (associated with spontaneous deamination of 5-methylcytosine), Signature 12 (associated

265 with unknown process with strand-bias for T→C substitutions) and Signature 20 (associated

266 with defective DNA mismatch repair, MMR) as enriched in each of the mutation/variant sets,

267 however with distinctive proportions (Figure 3f). Heterozygous somatic mutations and LOH

268 alternative→reference sites were additionally enriched for Signature 1B (variant of Signature

269 1A) and showed highly correlated profiles (Pearson’s correlation 0.919). LOH

270 reference→alternative sites were additionally enriched for Signature 5 (associated with

271 unknown process with strand-bias for T→C substitutions) and were highly correlated with

272 homozygous somatic mutation profiles (0.912) and to profiles of patient heterozygous

273 positions (P337: 0.935; canonical: 0.926). Homozygous somatic mutations also displayed

274 similarities to heterozygous patient SNPs (P337: 0.981; canonical: 0.985). Overall the somatic

275 mutational profiles indicate highly related patterns of both somatic ultra-mutations and LOH-

276 sites. Notably LOH alternative→reference sites showed unique and overlapping features to

277 somatic heterozygous mutations. We also observe underlying similarities in the profiles of

278 somatic substitutions and germline alleles.

279

280 Ultra-mutation distribution recapitulates common and rare germline variation

281 We noted a similar spectrum of rare and common alleles in the ultra-mutated

282 leukemia transcriptome compared to typical KMT2A-R iALL cases; but there were

283 comparatively fewer rare/low-frequency alleles shared with the remission sample (Figure 1a-

284 c; ExAC<0.5%). We further explored this trend to disentangle the contribution of somatic

285 acquired alleles and LOH-sites. Analysis of the remission and diagnostic samples from this

286 patient in isolation, revealed a similar proportion of rare/low-frequency coding alleles

11 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

287 detected by WES and both were proportionally within the same range as canonical

288 counterparts (Figure 4a). However, when directly comparing the remission and diagnostic

289 WES, we noted a modest depletion of germline rare/low-frequency coding SNV alleles

290 defined by VarScan2 (Figure 4a; 2.9% rare alleles in P337 compared to 3.5±0.2% in

291 canonical cases). This apparent depletion of germline alleles in the ultra-mutated leukemic

292 sample was coincident with an increased proportion of somatic acquired rare/low-frequency

293 coding SNV alleles compared to LOH-sites (Figure 4a). Therefore, when considering the total

294 pool of SNVs in the leukemic coding genome, the somatic gains and losses of rare/low-

295 frequency coding SNV alleles were in proportions that resulted in a similar distribution as

296 canonical cases (Figure 4a). We extended this observation by investigating the rare and

297 common allele abundance of expressed germline-encoded variants and somatic mutations in

298 the ultra-mutated leukemic sample compared with a larger cohort of KMT2A-R iALL

299 transcriptomes (Figure 4b; n=32 St. Jude patients and n=8 Perth patients). Most typical

300 KMT2A-R iALL diagnostic specimens expressed similar numbers of rare and low-frequency

301 expressed SNVs, except for 4/32 cases sequenced by St Jude investigators, which had an

302 elevated number of expressed rare and low-frequency ExAC alleles (Figure 4b). Excluding

303 these samples, typical KMT2A-R iALL patients expressed on average 42.3 ±10.4 (standard

304 deviation) variants that were absent from ExAC and 212 ± 32.5 variants with an allele

305 frequency of <0.5% in ExAC. The total number of expressed variants in the ultra-mutated

306 sample was slightly lower than typical cases (38 absent from ExAC and 174 AF <0.5% in

307 ExAC), and were approximately comprised of half germline (17/38 and 88/174 respectively)

308 and half somatic acquired alleles. Overall, the ultra-mutated transcriptome had the lowest

309 prevalence of germline-encoded rare and low-frequency alleles among the larger sample size

310 (Table 2). Furthermore, the somatic mutation abundance within each allele-frequency bin was

311 proportional to their abundance in prototypical silent cases. Consequently, the low numbers of

312 rare germline alleles in the ultra-mutated transcriptome were off-set by somatic mutations

313 such that the total numbers were within the same range as typical cases (Figure 4b-c). We

314 conclude that despite carrying thousands of somatic mutations, the coding-genome of ultra-

12 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

315 mutated patient P337 shows similarities to canonical KMT2A-R iALL cases, vis-à-vis rare

316 allele prevalence and representation of common human genetic variation.

317 RAS-pathway and DNA-replication and repair gene mutations

318 Gene ontology analysis of the expressed non-synonymous somatic mutations (n=535)

319 revealed over-representation of RAS-pathway genes (7/23 genes; 7-fold enriched; P=0.049;

320 RALGDS, RAC1, CHUK, PIK3R1, NFKB1, PLD1 and PIK3CA) and “MLL signature 1 genes”

321 (33/369 genes; 2-fold enriched; P=0.046). Of the 535 non-synonymous expressed somatic

322 mutations, 45 were predicted deleterious/damaging, including in the MMR gene MSH2

323 (p.E489K). The MSH2 somatic mutation localizes to a hot-spot, within the MMR ATPase

324 domain in a region at the periphery of the DNA binding site, approximately 20 Å from the

325 DNA molecule in the complex structure of full-length MSH2, a fragment of MSH6, ADP and

326 DNA containing a G:T mispair 35. The mutation encodes a substitution of a negatively

327 charged glutamate with a positively charged lysine (Figure 4d). There was a non-synonymous

328 somatic mutation in another MMR gene, MUTYH; however, this site is a common human

329 allele with a derived AF of 0.297 and was not called as damaged/deleterious by our prediction

330 pipeline. There were no somatic mutations, LOH-sites nor rare/deleterious coding germline

331 alleles detected in POLE or POLD1 which are frequently co-mutated with MSH2 in ultra-

332 mutated tumours 36, 37 and in sporadic early-onset colorectal cancer 38. Among the 45

333 expressed and predicted damaging somatic mutations, there were two rare substitutions in

334 DNA polymerase encoding genes: POLG2 (p.D308V; ExAC AF 0.0000165), which encodes

335 a mitochondrial DNA polymerase and POLK (p.R298H; ExAC AF 0.0003), which encodes

336 an error prone replicative polymerase that catalyzes translesion DNA synthesis. The POLK

337 mutation maps to a conserved impB/mucB/samB family domain (Pfam: PF00817) involved in

338 UV-protection. The encoded substitution of arginine to histidine is located approximately 20

339 Å from the DNA molecule in the structure of a 526 amino acid N-terminal Pol K fragment

340 complexed with DNA containing a bulky deoxyguanosine adduct induced by the carcinogen

341 benzo[a]pyrene (Figure 4e) 39. Both wild-type and mutant residues are positively charged,

13 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

342 however arginine is strongly basic (pI 11) while histidine is weakly basic (pI 7) and titratable

343 at physiological pH 40. We additionally observed an expressed and predicted functional

344 germline rare allele in ATR, a master regulator of the DNA damage response (p.L2076V;

345 ExAC AF 0.0002).

346 Global transcriptional similarities of the ultra-mutated and canonical infant leukemias

347 We next examined the transcriptional profile of the highly mutated KMT2A-R iALL

348 specimen in comparison to a collection of patient leukemia samples and purified blood

349 progenitor populations (Figure 5). As expected cell type was the major source of variation

350 distinguishing normal hematopoietic progenitor, lymphoblastic leukemias and myeloid

351 leukemias. We also observe clustering according to KMT2A translocation status and partner

352 genes as reported previously 41. Replicate samples of the highly mutated KMT2A-R iALL

353 specimen cluster with canonical counterparts demonstrating global transcriptional similarities

354 (Figure 5).

14 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

355 Discussion:

356 We report sequence analysis of an ultra-mutated acute lymphoblastic leukemia

357 diagnosed in an infant at 82 days of age which, to our knowledge, represents the earliest onset

358 of a hypermutant tumor recorded. The mutation rate observed was at the upper limit of cases

359 reported previously in childhood cancers 36, 37. By integrating analysis of somatic mutations

360 and germline alleles with a larger cohort of KMT2A-R iALL patients we found that the high

361 somatic mutation rate does not generate large numbers of rare alleles compared to typical

362 cases. Rather, we observe that the ultra-mutated KMT2A-R leukemic coding genome closely

363 resembles the rare allele landscape of canonical cases. This apparent paradox is explained by

364 the net effect of somatic gains and losses, both of which were targeted to rare and common

365 alleles in proportions that off-set the depleted pool of germline alleles observed in the

366 leukemic coding sequences. Our data indicate both the rare allele burden and transcriptional

367 profile of the ultra-mutated leukemia were comparable to its canonical counterparts.

368 Therefore, despite its radically different genetic route to leukemogenesis, the molecular

369 profile of the ultra-mutated specimen seems to have converged towards that of canonical

370 cases.

371 Most ultra-mutated childhood cancers reported previously have a combined MMR

372 and proof reading polymerase deficiency 36, 37 - we detect evidence for the former (MSH2

373 point mutation), but not the latter. However, we identify somatic mutations in two genes

374 encoding DNA polymerases (POLK and POLG2) and an additional MMR gene (MUTYH).

375 Notably, the POLK mutation is a rare allele, predicted functional and encodes pol κ, a

376 replicative error-prone translesion DNA polymerase that plays a role in bypassing unrepaired

377 bases that otherwise block DNA replication 42, 43. For example, pol κ preferentially

378 incorporates dAMP at positions complementary to 7,8-dihydro-8-oxo-2′-deoxyguanosine

379 adducts 44. As a corollary to these observations, POLK mutations have been found in prostate

380 cancer with elevated C→T transition 45. Further studies are required to determine if any of

15 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

381 these mutations cooperate to specify the unique mutational pattern observed in the ultra-

382 mutated KMT2A-R iALL genome.

383 The largest fraction of ultra-mutations was attributed to Signature 1, associated with

384 5-methylcytosine deamination inducing C→T transitions, which accumulates in normal

385 somatic cells with age over the human lifespan 46. This signature is found ubiquitously across

386 tumour sub-types 47, and prominently in hypermutated early onset AML patients with

387 germline MBD4 mutations 48. LOH-sites (reference→alternative) were enriched for Signature

388 5 which is also correlated with human aging. We observe enrichment of Signature 20 in

389 LOH-sites and somatic alterations, consistent with an association with a MMR-defect 47. This

390 signature has also been reported as enriched in tumours with germline-MMR and secondary

391 mutation in the POLD1 proof-reading gene 47. We also find enrichment of Signature 12 which

392 arises from a poorly characterized underlying molecular process; however, the dominant base

393 transitions, A→G and T→C have been observed in association with DNA damage (e.g. by

394 adducts) in normal and cancer tissues 49, 50. These observations suggest common mutagenic

395 and repair deficits driving age-related mutation in humans were operative in the cell of origin

396 during pre-natal evolution of the ultra-mutated infant leukemia genome. Collectively, these

397 results implicate the MSH2 point mutation as the primary MMR-hit, followed by an

398 additional event effecting a proof reading polymerase such as POLK, however, our data imply

399 that POLE and POLD1 mutations are not cooperative in this process.

400 The preponderance of ultra-mutated sites to overlap germline alleles in proportion to

401 their population allele frequencies demonstrates that hypermutation can recapitulate aspects

402 of population scale human genetic variation in a cancer genome. Aproximately half of the

403 somatic substitutions appear to target heterozygous SNPs resulting in loss-of-heterozygosity.

404 We cannot rule out a role for focal chromosomal alterations undetectable by our copy-number

405 and zygosity analyses. However collectively our data suggest that a major fraction of the

406 LOH-sites arose by point mutagenesis of heterozygous positions. They more frequently target

407 the alternate allele inducing reversion to the reference base, with around 30% of events

16 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

408 resulting in a homozygous non-reference allele in the diagnostic sample. Most ultra-mutations

409 are therefore likely passengers coinciding with common human alleles, consistent with

410 findings that hypermutated tumours tend to have relatively few drivers proportional to their

411 mutation burden 51; however, as proposed previously by Valentine and colleagues52, a

412 minority of rare alleles may contribute to infant leukemogenesis. While our results do not

413 resolve the temporal order of the key oncogenic events - namely MSH2 point mutation,

414 genomic hypermutation and KMT2A translocation – due to the clonal nature of these events

415 they appear to have occurred in rapid succession. This could be explained by a sentinel DNA

416 repair defect inducing hypermutation and genome instability followed by positive selection of

417 cooperating lesions conferring a growth advantage resulting in clonal expansion and fixation

418 of passengers 53. An alternative model is also possible in which the MSH2 MMR-defect

419 impaired double-strand-break repair 54, 55 promoting KMT2A-rearrangement, but with neutral

420 selection of somatic point mutations. Such a scenario appears inconsistent with the observed

421 predominance of clonal mutations, but could be reconciled if the hypermutator phenotype was

422 lost rapidly after the KMT2A-rearrangement and prior to clonal expansion. Since pervasive

423 positive selection has been observed in cancer and non-cancerous tissues 51, 56, we favour the

424 former model as being more plausible. However, since we noted loss of hypermutator

425 phenotype in patient-derived cell lines, the later model is not ruled out by our data.

426 Collectively our results illuminate a link between mutagenesis during human aging, germline

427 variation and somatic ultra-rmutation in cancer. Further studies are required to determine if

428 combinations of somatic mutations and/or germline alleles promote KMT2A-R iALL

429 leukemogenesis/hypermutation and if somatic mutated sites significantly overlap human

430 germline alleles in other malignancies.

17 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

431 Financial support: Children’s Leukaemia and Cancer Research Foundation (Inc.) Fellowship

432 (MNC); NHMRC Career Development Fellowship 1066869 (JW); Raine Medical Research

433 Foundation Clinician Research Fellowship 2016-2018 (RSK); Children’s Leukaemia and

434 Cancer Research Foundation (Inc.) project grants (URK, MNC, RSK); Ian Potter Foundation

435 (AG); Healy Foundation (AG). Telethon Kids Institute Blue Sky Research Grant (MNC,

436 URK, RSK, JW). Telethon Kids Institute Research Focus Area Precision Medicine Working

437 Group Funding (JW, MNC).

438 Acknowledgments: We thank Daniel MacArthur and Fengmei Zhao for assisting with exome

439 sequence analysis; and researchers at The Pediatric Cancer Genome Project at St. Jude

440 Children's Research Hospital for provision of data. We thank all the patients, families and

441 hospital staff for providing samples.

442 Authorship contributions: AG wrote computer code, performed bioinformatics, conceived

443 analysis strategy, interpreted and analyzed data and edited the manuscript. AA and BF

444 performed experiments. KWC, PK and CB performed additional data analysis and

445 interpretation. RSK provided patient samples and clinical data, assisted with data

446 interpretation and edited the manuscript. CHC, URK and JW acquired funding support,

447 assisted with data interpretation and edited the manuscript. MNC performed bioinformatics,

448 formulated research, conceived analysis strategy, acquired funding support, interpreted data,

449 designed experiments and drafted the manuscript.

450 Disclosure of Conflicts-of-Interest: The authors declare no conflicts of interest.

18 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

451 References:

452 1. Meyer C, et al. The MLL recombinome of acute leukemias in 2017. Leukemia, 453 (2017).

454 455 2. Bergerson RJ, et al. An insertional mutagenesis screen identifies genes that 456 cooperate with Mll-AF9 in a murine leukemogenesis model. Blood 119, 4512-4523 457 (2012).

458 459 3. Chen W, O'Sullivan MG, Hudson W, Kersey J. Modeling human infant MLL leukemia 460 in mice: leukemia from fetal liver differs from that originating in postnatal marrow. 461 Blood 117, 3474-3475 (2011).

462 463 4. Gale KB, et al. Backtracking leukemia to birth: identification of clonotypic gene 464 fusion sequences in neonatal blood spots. Proc Natl Acad Sci U S A 94, 13950-13954 465 (1997).

466 467 5. Andersson AK, et al. The landscape of somatic mutations in infant MLL-rearranged 468 acute lymphoblastic leukemias. Nat Genet 47, 330-337 (2015).

469 470 6. Sanjuan-Pla A, et al. Revisiting the biology of infant t(4;11)/MLL-AF4+ B-cell acute 471 lymphoblastic leukemia. Blood 126, 2676-2685 (2015).

472 473 7. Cruickshank MN, et al. Systematic chemical and molecular profiling of MLL- 474 rearranged infant acute lymphoblastic leukemia reveals efficacy of romidepsin. 475 Leukemia, (2016).

476 477 8. Lavallee VP, et al. The transcriptomic landscape and directed chemical interrogation 478 of MLL-rearranged acute myeloid leukemias. Nat Genet 47, 1030-1037 (2015).

479 480 9. Casero D, et al. Long non-coding RNA profiling of human lymphoid progenitor cells 481 reveals transcriptional divergence of B cell and T cell lineages. Nat Immunol 16, 482 1282-1291 (2015).

483 484 10. Anders L, et al. Genome-wide localization of small molecules. Nat Biotechnol 32, 92- 485 96 (2014).

486 487 11. Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor 488 analysis of control genes or samples. Nat Biotechnol 32, 896-902 (2014).

489 490 12. Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model 491 analysis tools for RNA-seq read counts. Genome Biol 15, R29 (2014).

19 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

492 493 13. Francis RW, Thompson-Wicking K, Carter KW, Anderson D, Kees UR, Beesley AH. 494 FusionFinder: a software tool to identify expressed gene fusion candidates from 495 RNA-Seq data. PLoS One 7, e39987 (2012).

496 497 14. Piskol R, Ramaswami G, Li JB. Reliable identification of genomic variants from RNA- 498 seq data. Am J Hum Genet 93, 641-651 (2013).

499 500 15. Koboldt DC, et al. VarScan 2: somatic mutation and copy number alteration 501 discovery in cancer by exome sequencing. Genome Res 22, 568-576 (2012).

502 503 16. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants 504 from high-throughput sequencing data. Nucleic Acids Res 38, e164 (2010).

505 506 17. Adzhubei I, Jordan DM, Sunyaev SR. Predicting functional effect of human missense 507 mutations using PolyPhen-2. Curr Protoc Hum Genet Chapter 7, Unit7 20 (2013).

508 509 18. Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous 510 variants on protein function using the SIFT algorithm. Nat Protoc 4, 1073-1081 511 (2009).

512 513 19. Schwarz JM, Rodelsperger C, Schuelke M, Seelow D. MutationTaster evaluates 514 disease-causing potential of sequence alterations. Nat Methods 7, 575-576 (2010).

515 516 20. Chun S, Fay JC. Identification of deleterious mutations within three human genomes. 517 Genome Res 19, 1553-1561 (2009).

518 519 21. Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A. Distribution and 520 intensity of constraint in mammalian genomic sequence. Genome Res 15, 901-913 521 (2005).

522 523 22. Cooper GM, et al. Single-nucleotide evolutionary constraint scores highlight disease- 524 causing mutations. Nat Methods 7, 250-251 (2010).

525 526 23. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. A general 527 framework for estimating the relative pathogenicity of human genetic variants. Nat 528 Genet 46, 310-315 (2014).

529 530 24. Rosenthal R, McGranahan N, Herrero J, Taylor BS, Swanton C. DeconstructSigs: 531 delineating mutational processes in single tumors distinguishes DNA repair 532 deficiencies and patterns of carcinoma evolution. Genome Biol 17, 31 (2016).

533

20 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

534 25. Favero F, et al. Sequenza: allele-specific copy number and mutation profiles from 535 tumor sequencing data. Ann Oncol 26, 64-70 (2015).

536 537 26. McLean CY, et al. GREAT improves functional interpretation of cis-regulatory 538 regions. Nat Biotechnol 28, 495-501 (2010).

539 540 27. Berman HM, et al. The . Nucleic Acids Res 28, 235-242 (2000).

541 542 28. He J, et al. Integrated genomic DNA/RNA profiling of hematologic malignancies in 543 the clinical setting. Blood 127, 3004-3014 (2016).

544 545 29. Chang MT, et al. Identifying recurrent mutations in cancer reveals widespread 546 lineage diversity and mutational specificity. Nat Biotechnol 34, 155-163 (2016).

547 548 30. Forbes SA, et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids 549 Res 45, D777-D783 (2017).

550 551 31. Downing JR, et al. The Pediatric Cancer Genome Project. Nat Genet 44, 619-622 552 (2012).

553 554 32. Pickrell JK, Gilad Y, Pritchard JK. Comment on "Widespread RNA and DNA sequence 555 differences in the human transcriptome". Science 335, 1302; author reply 1302 556 (2012).

557 558 33. Chang VY, Basso G, Sakamoto KM, Nelson SF. Identification of somatic and germline 559 mutations using whole exome sequencing of congenital acute lymphoblastic 560 leukemia. BMC Cancer 13, 55 (2013).

561 562 34. Henderson MJ, et al. A xenograft model of infant leukaemia reveals a complex MLL 563 translocation. Br J Haematol 140, 716-719 (2008).

564 565 35. Warren JJ, Pohlhaus TJ, Changela A, Iyer RR, Modrich PL, Beese LS. Structure of the 566 human MutSalpha DNA lesion recognition complex. Mol Cell 26, 579-592 (2007).

567 568 36. Campbell BB, et al. Comprehensive Analysis of Hypermutation in Human Cancer. Cell 569 171, 1042-1056 e1010 (2017).

570 571 37. Shlien A, et al. Combined hereditary and somatic mutations of replication error 572 repair genes result in rapid onset of ultra-hypermutated cancers. Nat Genet 47, 257- 573 262 (2015).

574

21 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

575 38. Palles C, et al. Germline mutations affecting the proofreading domains of POLE and 576 POLD1 predispose to colorectal adenomas and carcinomas. Nat Genet 45, 136-144 577 (2013).

578 579 39. Jha V, Bian C, Xing G, Ling H. Structure and mechanism of error-free replication past 580 the major benzo[a]pyrene adduct by human DNA polymerase kappa. Nucleic Acids 581 Res 44, 4957-4967 (2016).

582 583 40. Szpiech ZA, et al. Prominent features of the amino acid mutation landscape in 584 cancer. PLoS One 12, e0183273 (2017).

585 586 41. Stam RW, et al. profiling-based dissection of MLL translocated and 587 MLL germline acute lymphoblastic leukemia in infants. Blood 115, 2835-2844 (2010).

588 589 42. Kanemaru Y, et al. DNA polymerase kappa protects human cells against MMC- 590 induced genotoxicity through error-free translesion DNA synthesis. Genes Environ 591 39, 6 (2017).

592 593 43. Maddukuri L, Ketkar A, Eddy S, Zafar MK, Eoff RL. The Werner syndrome protein 594 limits the error-prone 8-oxo-dG lesion bypass activity of human DNA polymerase 595 kappa. Nucleic Acids Res 42, 12027-12040 (2014).

596 597 44. Irimia A, Eoff RL, Guengerich FP, Egli M. Structural and functional elucidation of the 598 mechanism promoting error-prone synthesis by human DNA polymerase kappa 599 opposite the 7,8-dihydro-8-oxo-2'-deoxyguanosine adduct. J Biol Chem 284, 22467- 600 22480 (2009).

601 602 45. Yadav S, Mukhopadhyay S, Anbalagan M, Makridakis N. Somatic Mutations in 603 Catalytic Core of POLK Reported in Prostate Cancer Alter Translesion DNA Synthesis. 604 Hum Mutat 36, 873-880 (2015).

605 606 46. Alexandrov LB, et al. Clock-like mutational processes in human somatic cells. Nat 607 Genet 47, 1402-1407 (2015).

608 609 47. Alexandrov LB, et al. Signatures of mutational processes in human cancer. Nature 610 500, 415-421 (2013).

611 612 48. Mathijs A Sanders EC, Christoffer Flensburg, Annelieke Zeilemaker, Sarah E Miller, 613 Adil al Hinai, Ashish Bajel, Bram Luiken, Melissa Rijken, Tamara Mclennan, Remco M 614 Hoogenboezem, François G Kavelaars, Marnie E Blewitt, Eric M Bindels, Warren S 615 Alexander, Bob Löwenberg, Andrew W Roberts, Peter J M Valk, Ian Majewski. 616 Germline loss of MBD4 predisposes to leukaemia due to a mutagenic cascade driven 617 by 5mC. bioRxiv, (2018).

22 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

618 619 49. Chen L, Liu P, Evans TC, Jr., Ettwiller LM. DNA damage is a pervasive cause of 620 sequencing errors, directly confounding variant identification. Science 355, 752-756 621 (2017).

622 623 50. Lin J, Shi T. Error-prone DNA polymerase and oxidative stress increase the incidences 624 of A to G mutations in tumors. Oncotarget 8, 45154-45163 (2017).

625 626 51. Martincorena I, et al. Universal Patterns of Selection in Cancer and Somatic Tissues. 627 Cell 171, 1029-1041 e1021 (2017).

628 629 52. Valentine MC, et al. Excess congenital non-synonymous variation in leukemia- 630 associated genes in MLL- infant leukemia: a Children's Oncology Group report. 631 Leukemia, (2013).

632 633 53. Dagogo-Jack I, Shaw AT. Tumour heterogeneity and resistance to cancer therapies. 634 Nat Rev Clin Oncol, (2017).

635 636 54. van Oers JM, et al. The MutSbeta complex is a modulator of p53-driven 637 tumorigenesis through its functions in both DNA double-strand break repair and 638 mismatch repair. Oncogene 33, 3939-3946 (2014).

639 640 55. Burdova K, Mihaljevic B, Sturzenegger A, Chappidi N, Janscak P. The Mismatch- 641 Binding Factor MutSbeta Can Mediate ATR Activation in Response to DNA Double- 642 Strand Breaks. Mol Cell 59, 603-614 (2015).

643 644 56. Martincorena I, et al. Tumor evolution. High burden and pervasive positive selection 645 of somatic mutations in normal human skin. Science 348, 880-886 (2015).

646

647

23 bioRxiv preprint certified bypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmadeavailableunder Gout et al Ultra-mutation targets germline alleles in an infant leukemia doi: 648 Table 1: Characterisation of somatic lesions in KMT2A-R patient samples. https://doi.org/10.1101/248690

Reciproc Point al Diagnostic Mutations Point Indels Point mutations or Patient KMT2A- leukemia Constitutional 0.1 > VAF1 Mutations VAF > LOH2- hotspots with RNA- ID KMT2A-fusion fusion DNA remission DNA < 0.3 VAF > 0.3 0.3 sites SEQ reads Not > 1,000 expressed P337 KMT2A-MLLT1 detected Exome Exome 198 5,054 260 6,555 somatic mutations Not No non-silent somatic ; a this versionpostedFebruary7,2018.

P401 KMT2A- MLLT 3 detected Exome Exome 78 3 9 45 mutations CC-BY-NC 4.0Internationallicense AFF1- No non-silent somatic P399 KMT2A-AFF1 KMT2A Exome Exome 95 3 7 67 mutations Not P438 KMT2A- MLLT3 detected Exome Exome 74 4 9 100 NRAS.p.G12S EPS15- PER3.p.M1006R, P810 KMT2A-EPS15 KMT2A Exome Exome 95 4 5 46 PER3.p.K1007E EPS15- PER3 M1006R, P809 KMT2A-EPS15 KMT2A Exome Exome 69 3 3 28 PER3.p.K1007E AFF1- P848 KMT2A-AFF1 KMT2A no Exome NA NA NA NA KRAS.p.A146T . AFF1- The copyrightholderforthispreprint(whichwasnot P287 KMT2A-AFF1 KMT2A no no NA NA NA NA No COSMIC3 hotspots Not P706 KMT2A- MLLT1 detected no Exome NA NA NA NA KRAS.p.A146T Not P272 KMT2A-AFF1 detected no no NA NA NA NA No COSMIC3 hotspots 649 1VAF: Variant allele fraction; 2LOH: Loss-of-heterozygosity; 3COSMIC: Catalogue of Somatic Mutations in Cancer

24 bioRxiv preprint certified bypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmadeavailableunder Gout et al Ultra-mutation targets germline alleles in an infant leukemia doi: 650 Table 2: Mean and standard deviation of expressed rare (Not in ExAC or <0.5% allele frequency), low-frequency (0.5-1% allele frequency) and common 651 (>1% allele frequency) alleles in 39 KMT2A-R patient samples compared to ultra-mutated (P337) sample and its constituent expressed germline alleles and https://doi.org/10.1101/248690 652 somatic mutations.

Not in ExAC <0.5% AF1 in ExAC 0.5-1% AF in ExAC >1% AF in ExAC KMT2A-R iALL 42.3±10.4 212.1±32.6 76.8±14.7 7722.3±266.0 P337 expressed 38 174 89 7590 P337 Germline 17 88 47 6235 P337 Somatic 21 86 42 1355

1 2 ; a 653 AF: allele frequency; ExAC: Exome aggregation consortium this versionpostedFebruary7,2018. CC-BY-NC 4.0Internationallicense . The copyrightholderforthispreprint(whichwasnot

25 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

654 Figure Legends:

655 Figure 1: Characterization of germline and somatic point mutations identifies an ultra-

656 mutated infant leukemia case. (a) Overlap of variants identified from diagnostic leukemia

657 specimen by RNA-seq and whole exome sequencing (WES). (b) Overlap of variants called in

658 diagnostic RNA-seq and remission bone marrow WES. (c) Overlap of variants called in

659 diagnostic and remission WES. Barplots colored showing overlap binned by derived allele

660 frequency as follows: present with AF<0.5% in ExAC (green), present with AF 0.5-1% in

661 ExAC (yellow), present with AF>1% in ExAC (purple), absent from the ExAC (orange), non-

662 overlapping variants are colored to show sites with coverage <20 reads (grey) or coverage

663 >20 reads (black) in the comparator. (d) Scatterplots of diagnostic leukemia (x-axis) versus

664 remission (y-axis) variant allele fraction (VAF) from WES for coding variants detected in

665 diagnostic leukemia RNA-seq.

666 Figure 2: Clonal analysis of somatic ultra-mutations. Histograms showing the VAF

667 density distribution of somatic mutations (left, x-axis: dark shaded bars) or LOH-sites (right;

668 x-axis: light shaded bars) in diagnostic leukemia WES from canonical cases (a) and the ultra-

669 mutated case (b). (c) Illustration of cell lines derived from ultra-mutated specimen and

670 histograms displaying the VAF density distribution of somatic mutations (left, x-axis: dark

671 shaded bars) or LOH-sites (right; x-axis: light shaded bars) in cell lines (PER-784A: red;

672 PER-826A: green). (d) Venn diagram displaying the overlap in somatic mutations (left) and

673 LOH-sites (right) identified by VarScan2 comparing the primary diagnostic specimen and

674 derived cell lines (PER-784A: red; PER-826A: green) to the primary remission sample. (e)

675 Somatic mutation and (f) LOH coding sites showing total read coverage (left y-axis) and VAF

676 (right y-axis) from primary specimens (left panels) and cell lines (right panels) ordered by

677 decreasing VAF in the diagnostic sample (x-axis).

678

26 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

679 Figure 3: Germline-like somatic signatures of ultra-mutations. (a-d) Histograms showing

680 the number (a-b) or frequency (c-d) of events overlapping the spectrum of ExAC derived

681 allele frequency bins with 5% increments. (a) Somatic mutations called by VarScan2

682 grouping homozygous and heterozygous events separately. (b) Loss-of-heterozygosity sites

683 called by VarScan2 grouping reference→alternative and alterantive→reference sites

684 separately. (c-d) All quality filtered coding variants from each of the canonical patients or

685 P337 ultra-mutated specimen called by GATK grouped by single nucleotide polymorphisms

686 (SNPs) that were homozygous and heterozygous events from (c) remission sample or (d)

687 diagnostic sample. (e) Normalized fraction of variants within unique trinucleotide context

688 (ordered alphabetically on the x-axis) for VarScan2 calls (left panel; as shown for panels a-b),

689 or remission variants (right panel; as shown for panel c). (f) Somatic signatures identified by

690 deconstructSigs for categories of mutations/SNPs (displayed in panel e) showing their relative

691 proportions (bar plot) and pair-wise Pearson’s correlation coefficients (heatmap), ordered by

692 unsupervised hierarchical clustering.

693

694 Figure 4: Ultra-mutations target rare and common human alleles and are associated

695 with somatic point mutations encoding a novel MSH2 allele and a rare POLK allele.

696 (a) Proportion of coding single nucleotide variants in diagnostic (D) and remission (R) whole

697 exome sequencing (WES) data binned by derived allele frequency as follows: present with

698 AF<0.5% in ExAC (green), present with AF 0.5-1% in ExAC (yellow), present with AF>1%

699 in ExAC (purple), absent from the ExAC (orange). Grey shading denotes data from the ultra-

700 mutated specimen. The distribution of variants called as germline, somatic or LOH by

701 VarScan2 are displayed for comparison. (b) Number of variants (y-axis) called from RNA-seq

702 reads from St Jude and Perth cohorts (open circles) are plotted with the total variants detected

703 in the ultra-mutated sample (shaded circles), germline-encoded variants (shaded triangle

704 pointing up) and somatic point mutations (shaded triangle pointing down) binned according to

27 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

705 the allele count in ExAC database (red: not in ExAC; green < 0.5%; yellow: 0.5-1% and blue

706 > 0.5%). (c) Bar plot displaying fraction of variants within rare, low-frequency and common

707 allele frequency bins (as described for panel a) ordered by increasing proportion of common

708 alleles (i.e. derived AF>1%) and excluding outliers (SJINF018, SJINF021, SJINF010 and

709 SJINF008). Proportion of RNA-seq germline and somatic variants identified by overlap with

710 remission/diagnostic WES data are displayed for comparison. (d) Crystal structure of the

711 MutS lesion recognition complex (PDB CODE 2O8D35) MSH2:MSH6 mismatch repair

712 (MSH2: green, MSH6: blue) bound to ADP and G:T mismatch DNA (orange)

713 showing the location of the amino acid change (NM_000251:exon9:c.G1465A:p.E489K). (e)

714 Crystal structure of Pol k (green and blue representing different molecules) in complex with a

715 bulky DNA adduct (orange) (PDB CODE 4U6P39) showing locations of the amino acid

716 change (NM_016218:exon7:c.G893A:p.R298H).

717 Figure 5: Global transcriptional concordance of ultra-mutated infant leukemia. Multi-

718 dimensional scaling of RNA-seq count data from acute leukemias and normal bone marrow

719 progenitor populations displaying translocation partner gene by color and blood/leukemia

720 type by symbol shape according to the legend. Labels show replicate measurements from the

721 ultra-mutated diagnostic specimen, P337.

28 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Gout et al Ultra-mutation targets germline alleles in an infant leukemia

722 Supplementary Fig. S1. Copy-number alterations and loss-of-heterozygosity in ultra-

723 mutated KMT2A-R iALL diagnostic sample. (a) Genomic copy number ratios of the ultra-

724 mutated sample comparing diagnostic and remission WES data. (b) Proportion of reads

725 mapping to different copy-number bins (up to 4). (c) Box and whisker plots (median with

726 upper and lower quartiles) displaying the nearest distance of a variant/mutation to a

727 homozygous variant in the diagnostic WES data for canonical samples (left) compared to the

728 ultra-mutated sample (right) for VarScan2 germline, somatic and LOH SNV calls.

729 Supplementary Table S1: Summary of samples

730 Supplementary Table S2: Read evidence of RNA-seq variants that were not detected in

731 matched WES data.

732 Supplementary Table S3: Comparison of RNA-seq reads from diagnosis sample with WES

733 reads.

734 Supplementary Table S4: Comparison of RNA-seq reads from diagnosis sample with WES

735 reads from diagnosis, remission and derived cell lines.

736 Supplementary Table S5: Annotation of somatic mutations and indels from ultra-mutated

737 patient P337.

738 Supplementary Table S6: Summary of LOH-sites called in primary P337 and derived cell

739 lines.

740 Supplementary Table S7: Summary of somatic SNVs called in primary P337 and derived

741 cell lines.

29 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

FIGURE 1 a b c P810 P810 P810 P809 P809 P809 P438 P438 P438 P399 P399 P399 P401

Diagnosis P401 P401 P337 RNA vs exome

P337 RNA Diagnosis P337 Diagnosis exome Diagnosis vs remission exome vs remission exome 0 0 20 40 0 20 40 50 100 150 200 250 400 600 800 100 200 300 6000 7000 8000 100 200 300 6000 7000 8000 1000 16000180002000022000 Number of variants Number of variants Number of variants Not in ExAC <0.5% AF in ExAC 0.5-1% AF in ExAC >1% AF in ExAC Discordant >20 reads Discordant <20 reads d “Silent genomic landscape” Ultra-mutated P399 P401 P438 P809 P810 P337 1.0 1.0 1.0 1.0 1.0 1.0

0.5 0.5 0.5 0.5 0.5 0.5

Remission exome VAF Remission 0 0 0 0 0 0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 Diagnosis exome VAF b e Frequency a FIGURE 2 400 1 000 1 000 Read coverage Frequency 10 15 800 600 400 200 200 400 600 800 0 0 5 P337 diagnosisVAF . . 1.0 0.5 0.1 P399 diagnosisVAF 0

. . 1.0 0.5 0.1 P337 coding somatic variants somatic coding P337

500 bioRxiv preprint certified bypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmadeavailableunder Remission Diagnosis 1000 4 000 150 300 0 0 . 1.0 0.5 0

1500 1.0 0.5 0

2000 doi: Reads Reads VAF VAF https://doi.org/10.1101/248690

10 1.0 0.5 0.0 0.5 1.0 c

P337 diagnosis 0 4 8

n fractio allele ariant V P401 diagnosisVAF . . 1.0 0.5 0.1

1 000 Read coverage 1 000 800 600 400 200 200 400 600 800

0 Cell line read evidence read line Cell 100 150 50 0 500 PER-826A PER-784A . 1.0 0.5 0 PER-784A PER-826A 1000 300

1500 0 ; PER-784A VAF . . 1.0 0.5 0.1 a this versionpostedFebruary7,2018. CC-BY-NC 4.0Internationallicense 10 15 0 5

2000 P438 diagnosisVAF . . 1.0 0.5 0.1 Reads Reads VAF VAF

1.0 0.5 0.0 0.5 1.0 4 000

n fractio allele ariant V 0 f . 1.0 0.5 0 200 400 0 1 000 Read coverage 1 000 800 600 400 200 200 400 600 800 . 1.0 0.5 0

0 P337 coding LOH sites LOH coding P337

500 400

1000 0 . . 1.0 0.5 0.1 PER-826A VAF Remission Diagnosis 0 4 8 P809 diagnosisVAF

1500 1.0 0.5 0.1 . The copyrightholderforthispreprint(whichwasnot 2000 4 000 2500 0 Reads Reads 100 150 VAF VAF . 1.0 0.5 0 3000 50 0

1.0 0.5 0.0 0.5 1.0 . 1.0 0.5 0

n fractio allele ariant V

1 000 Read coverage 1 000 800 600 400 200 200 400 600 800 PER-784A 0 d P337 diagnosis

940 99 Cell line read evidence read line Cell 0 4 8 7 305 177 . . 1.0 0.5 0.1

500 P810 diagnosisVAF Somatic 4 272 300

1000 PER-826A VarScan2overlap PER-784A PER-826A 127 1500 100 2000 PER-784A 0 115 P337 diagnosis . 1.0 0.5 0 2500 405 260 5 485 Reads Reads LOH 405 45 VAF VAF

3000 PER-826A

129 1.0 0.5 0.0 0.5 1.0

n fractio allele ariant V bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

FIGURE 3 a P337 somatic homozygous b P337 LOH Ref→Alt 10 000 10 000 P337 somatic heterozygous P337 LOH Alt→Ref 1 000 1 000

100 100

10 10

1 1 Number of events Number of events 0.1 0.1

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 0.00 0.05 0.10 0.15 0.20 0.250.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.750.80 0.85 0.90 0.95 1.00 ExAC derived allele frequency ExAC derived allele frequency c d Remission coding Diagnosis coding 1 P337 homozygous Canonical homozygous 1 P337 homozygous Canonical homozygous P337 heterozygous Canonical heterozygous P337 heterozygous Canonical heterozygous 0.1 0.1

0.01 0.01

0.001 0.001

0.0001 0.0001 Relative frequency (fractions ) Relative frequency (fractions ) 0.00 0.050.10 0.15 0.200.25 0.30 0.35 0.400.45 0.50 0.55 0.600.65 0.70 0.750.80 0.85 0.90 0.951.00 0.00 0.050.10 0.15 0.200.25 0.30 0.35 0.400.45 0.50 0.55 0.600.65 0.70 0.750.80 0.85 0.90 0.951.00 ExAC derived allele frequency ExAC derived allele frequency e VarScan2 P337 somatic lesions Remission coding variants P337 somatic heterozygous P337 heterozygous 0.08 N = 4 773 0.08 N = 10 681

0.04 0.04 Fraction Fraction 0 0 P337 somatic homozygous 0.08 P337 homozygous N = 281 N = 6 617 0.04 0.04 Fraction Fraction 0 0 P337 LOH Alt→Ref 0.08 Canonical heterozygous N = 4 453 N = 51 347 0.04 0.04 Fraction Fraction 0 0 0.08 P337 LOH Ref→Alt 0.08 Canonical homozygous N = 2 057 N = 31 237 0.04 0.04 Fraction Fraction 0 0 T[T>A]T T[T>A]T T[T>C]T T[T>C]T T[C>T]T T[C>T]T T[T>A]A T[T>A]A A[T>A]T A[T>A]T T[T>G]T T[T>G]T T[T>C]A T[T>C]A A[T>C]T A[T>C]T C[T>A]T C[T>A]T T[T>A]C T[T>A]C T[C>T]A T[C>T]A A[T>A]A A[T>A]A A[C>T]T A[C>T]T T[C>A]T T[C>A]T T[T>C]C T[T>C]C T[T>G]A T[T>G]A A[T>G]T A[T>G]T C[T>C]T C[T>C]T T[T>A]G A[T>C]A T[T>A]G A[T>C]A G[T>A]T G[T>A]T T[C>T]C T[C>T]C A[T>A]C A[T>A]C C[T>A]A C[T>A]A C[C>T]T C[C>T]T A[C>T]A A[C>T]A T[C>A]A T[C>A]A A[C>A]T A[C>A]T C[T>G]T C[T>G]T T[T>G]C T[T>G]C G[T>C]T T[T>C]G G[T>C]T T[T>C]G A[T>G]A A[T>G]A C[T>C]A C[T>C]A A[T>C]C A[T>C]C T[C>T]G T[C>T]G C[T>A]C C[T>A]C G[T>A]A G[T>A]A G[C>T]T G[C>T]T A[T>A]G A[T>A]G C[C>T]A C[C>T]A T[C>G]T A[C>T]C T[C>G]T A[C>T]C T[C>A]C T[C>A]C C[C>A]T C[C>A]T A[C>A]A A[C>A]A T[T>G]G T[T>G]G G[T>G]T G[T>G]T A[T>G]C A[T>G]C C[T>G]A C[T>G]A C[T>C]C C[T>C]C G[T>C]A G[T>C]A A[T>C]G A[T>C]G C[T>A]G C[T>A]G G[T>A]C G[T>A]C C[C>T]C C[C>T]C G[C>T]A G[C>T]A A[C>T]G A[C>T]G T[C>G]A T[C>G]A T[C>A]G T[C>A]G A[C>G]T A[C>G]T G[C>A]T G[C>A]T A[C>A]C A[C>A]C C[C>A]A C[C>A]A C[T>G]C C[T>G]C G[T>G]A G[T>G]A G[T>C]C G[T>C]C A[T>G]G A[T>G]G C[T>C]G C[T>C]G G[T>A]G G[T>A]G G[C>T]C G[C>T]C C[C>T]G C[C>T]G T[C>G]C T[C>G]C C[C>G]T C[C>G]T A[C>G]A A[C>G]A G[C>A]A G[C>A]A C[C>A]C C[C>A]C A[C>A]G A[C>A]G C[T>G]G C[T>G]G G[T>G]C G[T>G]C G[T>C]G G[T>C]G G[C>T]G G[C>T]G T[C>G]G T[C>G]G G[C>G]T G[C>G]T A[C>G]C A[C>G]C C[C>G]A C[C>G]A G[C>A]C G[C>A]C C[C>A]G C[C>A]G G[T>G]G G[T>G]G C[C>G]C C[C>G]C G[C>G]A G[C>G]A A[C>G]G A[C>G]G G[C>A]G G[C>A]G C[C>G]G C[C>G]G G[C>G]C G[C>G]C G[C>G]G G[C>G]G C>A C>G C>T T>A T>C T>G C>A C>G C>T T>A T>C T>G f Proportion of variants (%) Pearson’s correlation coefficient 0 50 100 0.2 0.4 0.6 0.8 1.0 Unknown P337 LOH Alt→Ref Signature U1 P337 heterozygous somatic Signature R3 P337 homozygous SNP Signature 5 Canonical homozygous SNP Signature 20 P337 LOH Ref→Alt Signature 12 P337 homozygous somatic Signature 1B Canonical heterozygous SNP Signature 1A P337 heterozygous SNP a FIGURE 4 Proportion of variants (%) 100 10 95 0 5 d 0.5-1% Not inExAC

MSH2 p.E489K P399-R WES codingvariants bioRxiv preprint AF inExAC P399-D certified bypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmadeavailableunder P401-R P401-D P438-R P438-D MSH2 p.E489K P809-R P809-D

>1% <0.5% P810-R

P810-D doi:

AF inExAC P337-R 22 AF inExAC

P337-D https://doi.org/10.1101/248690 VarScan2 Å

23 Gerrmline LOH Å Somatic 20 Å b 10 000

Number of variant1 000 s 100 10

St Jude Not inExAC Perth ; a P337 this versionpostedFebruary7,2018. Germline CC-BY-NC 4.0Internationallicense b Somatic RNA-seq codingvariants

St Jude <0.5% AF Perth P337 Germline Somatic

St Jude 0.5-1% AF Perth P337

e Germline Somatic POLK p.R298H .

St Jude The copyrightholderforthispreprint(whichwasnot Perth >1% AF P337 Germline Somatic c Proportion of variants (%) 100 95 10 0 5

SJINF017 SJINF059 SJINF016 20 POLK p.R298H SJINF004

30 SJINF013 Å

POLK p.R298H SJINF014 Å SJINF011

30 SJINF001 SJINF015 Å P287 P810 SJINF020 RNA-seq codingvariants P809 P438 SJINF063 SJINF012 SJINF006 SJINF019 SJINF062 SJINF009 SJINF053 P337 SJINF002 SJINF055 P706 P401 P399 SJINF064 SJINF003 SJINF005 SJINF022 P848 Germline Somatic bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

FIGURE 5 Blood progenitor ● MLL/KMT2A-Germline ● ● ● ● ● MLL/KMT2A-MLLT1 ● ● ● ● ● ●● ● P337_rep1 MLL/KMT2A-AFF1 ● ● ● ● ● ● ●● ● ● ●

1 2 MLL/KMT2A-MLLT3 1 2 ● P337_rep2 ● ● ● ● ●● ● ● ● MLL/KMT2A-MLLT4 ● ● ● ● ● MLL/KMT2A-MLLT10 ● ● ● 0 0 ● ● MLL/KMT2A-EPS15 MLL/KMT2A-USP2 MLL/KMT2A-CASC5 ● MLL/KMT2A-MLLT6 MLL/KMT2A-ELL ● Dimension 2 MLL/KMT2A-ENL

Leading logFC dim 2 MLL/KMT2A-SEPT9 ● MLL/KMT2A-ENAH LMPP MLL/KMT2A-GAS7 BCP Diagnostic infant ALL HSC Relapse infant ALL LCP −4 −3 −2 −1

−4 −3 −2 −1 Childhood ALL Thy Childhood AML Leucegene AML −4 −2 0 2 Dimension 1 Leading logFC dim 1 bioRxiv preprint doi: https://doi.org/10.1101/248690; this version posted February 7, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

SUPPLEMENTARY FIGURE S1 a 2.5 7 Copy number 2.0 6 5 1.5 4 1.0 3 2 Depth ratio 0.5 1 0 0.0 Chromosome: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X b c 80 Canonical 7 P337 0.5 6 0.4 Canonical LOH 60 5 0.3 4

40 3 Density 0.2 2 0.1 Percentage (%) Percentage 20 1

to homozygous SNP) 0.0

log10 (shortest distance 0 0 2 4 6 0 Germline Somatic LOH Germline Somatic LOH 0 1 2 3 4 log10 (shortest distance to homozygous SNP) Copy number N = 19 245 Bandwidth = 0.1431