Supplementary Information Supplement to: Large-Scale Identification of Clonal Hematopoiesis and Mutations Recurrent in Blood Cancers

Julie E. Feusier, Sasi Arunachalam, Tsewang Tashi, Monika J. Baker, Chad VanSant-Webb, Amber Ferdig, Bryan E. Welm, Christopher Ours, Juan L. Rodriguez-Flores, Lynn B. Jorde, Josef T. Prchal, Clinton C. Mason

(JEF and SA contributed equally.)

Department of Pediatrics, Division of Pediatric Hematology and Oncology, University of Utah, Salt Lake City, Utah (JEF, SA, MJB, CVS-W, AF, CO, CCM); Department of Human Genetics, University of Utah, Salt Lake City, Utah (JEF, LBJ); Huntsman Cancer Institute, University of Utah, Salt Lake City, Utah (SA, TT, BEW, JTP); Department of Oncological Sciences, University of Utah, Salt Lake City, Utah (SA); Division of Hematology and Hematologic Malignancies, University of Utah, Salt Lake City, Utah (TT, JTP); VA Medical Center, Salt Lake City, Utah (TT, JTP); Department of Surgery, University of Utah, Salt Lake City, Utah (BEW); Department of Genetic Medicine, Weill Cornell Medical College, New York, New York (JLR-F)

Table of Contents Supp. Methods ………………………………….………………...……….…….… 2 Supp. Figure 1. ………………………………….………………...……….…….… 9 Supp. Figure 2. ………………………………….…………………….………..… 10 Supp. Figure 3. ………………………………….…………………….………..… 11 Supp. Figure 4. ………………………………….…………………….………..… 12 Supp. Figure 5. ………………………………….…………………….………..… 14

1 Supplementary Methods. A. Hotspot mutation analysis:

Criteria for inclusion of somatic mutation studies and samples

Of the identified somatic mutation studies, only those meeting sample identifiability criteria were included in an effort to include only mutations from diagnostic samples and to greatly reduce the likelihood of an individual’s mutations being included twice in our extraction process. Studies were required to have assessed somatic mutations with NGS methods in eight or more diagnostic samples of one of seven hematologic malignancies (ALL, AML, CML, CMML, JMML, MDS, and MPN). Some studies assessed relapse or secondary leukemia samples in addition to the initial diagnostic sample. We removed such studies from our analysis unless all mutations from each relapse/secondary leukemia sample could be confidently removed. Many studies followed an approach of assessing a discovery cohort followed by a validation cohort. As some of these studies included a few individuals in both cohorts, we uniformly included the mutations reported for only one of these cohorts to avoid potential artifactual recurrence. This was most frequently the validation cohort which often had an increased sample size; however, if one cohort was assessed for only a minimal target region, the other cohort having a larger genomic region of assessment was used instead. We were also careful to assess whether studies documented the inclusion of patients in previously reported studies. When such occurred, only mutations identified in the more recent study were included. A process was developed to automize as much as possible the mining, extraction, conversion, and evaluation of reported mutations in each paper. Each of the studies was assessed for genetic information that could accurately identify the genetic position of the alteration and effect on its encoded . Formats of reported mutations were standardized, with priority for inclusion given to reported genomic positions with reference and alternate alleles provided. In general, studies reporting only amino acid substitutions without a transcript identifier or genomic position were excluded as such mutations may have occurred in one of multiple locations.

Criteria for inclusion of somatic mutations

For studies meeting the above criteria, additional inclusion criteria were utilized that assessed the utility and reliability of listed mutations. The extracted mutations from each study were evaluated for consistency of provided reference alleles and amino acids. This was necessary to ensure that the correct genetic reference system and build of the reported mutations was identified prior to our conversion and harmonization program. We assessed the reference alleles of all mutations (SNVs and indels grouped separately) reported within a given study and compared these with the possible combinations of reference build, 0-bp or 1-bp coordinate system, and cDNA or gDNA at those loci. Nearly always the correct scenario was completely obvious as near 100% correlation was achieved, with other combinations being far from this. When such near perfect correlation was not achieved for either the set of SNVs, indels, or both, these entire sets were excluded.

A few studies published all observed variants including those having an extremely low number of supporting reads or allele frequencies – and without providing an explanation to delineate mutations. For such studies we included an additional variant allele fraction (VAF) requirement (VAF in tumor ≥30% if only tumor was assessed, VAF in tumor ≥25% and VAF in germline <5% if both tumor and germline assessed) to increase the likelihood of including bona fide diagnostic mutations over noise and artifacts. When only amino acid substitutions were published with a transcript, a back-transformation was utilized to identify the coordinates and mutation that would produce such a substitution, and that information was then forward translated to obtain the amino acid substitution for the single transcript chosen for each (described below). The LiftOver utility of the UCSC Genome Browser1 was used to convert genomic coordinates of mutations for studies not providing mutations in hg19 coordinates. Any mutations at unresolvable loci were excluded.

Following the initial extraction of diagnostic mutations from the somatic landscape papers and determination of 48 studies with useable data, these studies were reviewed a second time to ensure that extracted mutations met all usability criteria. Twelve studies were arbitrated to have mutations added or removed, resulting in a net decrease of 298 mutations following this additional evaluation. This entire process resulted in 48 studies meeting inclusion criteria which together reported 58,177 mutations that we then assessed for effect on protein and recurrence.

2 Single transcript determination

Initial assessment of the extraction process identified the use of different transcripts for many across different studies to be a common problem and potential pitfall to an analysis relying only on the amino acid alteration reported in a given study. As a result, mutations were included when a genetic position or transcript was provided in addition to the amino acid alteration or this information could be unambiguously resolved. To standardize reporting and recurrence determination, the most common RefSeq2 transcript for genes utilized across nine initial studies was determined (when ties occurred the sample size of the studies was used in the final selection). For genes not reported mutated in these nine somatic mutation landscape studies, the largest RefSeq transcript was utilized as the referent transcript. These single transcripts for each gene were utilized as the sole determinant for the effect of mutations on amino acids by ANNOVAR3 throughout all of the analyses reported in this manuscript with the exception of a few genes in the general CHIP analysis only, in which we utilized the transcript listed in previous CHIP assessments to more closely match the use of similar methods for comparison.

Hotspot determination

We sought to identify amino acid and splice-site loci that were reported mutated across the 48 somatic landscape studies more frequently than would be expected by chance, and to designate such loci as mutation hotspots. Our collective cohort consisted of patients assessed for a variety of targeted regions and representing multiple blood cancers, and which for the most part did not have silent or non-exonic mutations reported. As a result, we sought to determine minimalist criteria that would reflect significant deviation from models of the overall expected recurrence frequency of reported protein-altering and splice-site mutations under plausible assumptions that included a greater somatic burden than was reported (increasing the requirement for significant recurrence). Under the null hypothesis of equal contribution for all mutation loci (i.e. no hotspots, or equivalently, no observed recurrence beyond expectation from binomial distributions), we tested various model scenarios for deviance from expected recurrence. We simulated the combined patients of these studies, varying the total theoretical protein-altering or splice-site mutations in each patient from 30 to 10 (lower values yielding very infrequent recurrence) and across 10 to 20 million amino acid or splice-site loci. Each separate simulation provided estimates for expected recurrence distributions, reflecting the expectation from their corresponding binomial distribution models. For the plausible scenario of 15 amino acid altering or splice-site mutations per patient occurring across 15 million possible amino acid or splice-site loci, expectation for the point probability of twice reported recurrence across the 7,444 patients was 412.1 loci (in moderate comparison to our observed 533 loci reported mutated twice only), whereas for thrice reported recurrence this expectation fell to just 1.04 (in stark contrast to our observation of 156 loci reported exactly three times). For higher levels of recurrence within this scenario, the cumulative expectation for recurrence between 4 and 7,444 inclusive would be <3 total loci – again, in stark contrast to our observation of 278 amino acid loci reported between 4 and 512 times. Comparatively, the expectation for thrice recurrence rose to only 2.45 loci when the average number of mutations per patient was raised from 15 to 20. Other modeling scenarios similarly found the observed number of loci having three times recurrence of protein-altering/splice-site mutations to be a clear change- point from expectation. Based on these multiple assessments, three recurrent protein-altering mutations at the same amino acid locus or splice-site was deemed to be a prodigious threshold for identifying hotspot mutations in these data.

Comparison with recurrence in COSMIC

To identify the frequency that our identified hotspots for CHIP might be similarly identified by using the valuable and frequently utilized Catalogue of Somatic Mutations in Cancer (COSMIC, cancer.sanger.ac.uk)4 database as a source of blood cancer mutation recurrence, we downloaded the “Cosmic Mutation Data” (CosmicMutantExport.tsv.gz) and “Non coding variants” (CosmicNCV.tsv.gz) files from the current COSMIC version (version 92, August 2020) for filtering and assessment. We filtered mutations for being of the 7 hematologic cancer types of our focus, being designated as a primary sample, and for the main transcript for each gene (mutations in these datasets are repeated for various transcripts). We further resolved some minor inconsistent formats and removed synonymous mutations. We then tallied recurrence for all other mutations at the amino acid or

3 splice site location being present in either of the two COSMIC files. We were pleased to see very good concordance between our recurrence tallies for all mutations of high frequency. Of the 176 hotspots we identified with 6 times recurrence or more in our 48 studies, 34 (19.3%) were not present in 3 or more primary-labeled samples of the 7 blood cancers in COSMIC. Of the total 434 loci we report in 3 or more persons, 222 (51.2%) were not present in 3 or more primary-labeled samples of the 7 blood cancers in the filtered COSMIC set. Delving into the discordant mutations, we found some to be listed in COSMIC with even extremely high recurrence, yet with none or only one or two being listed in primary samples in the hematologic cancers of our focus. We observed 59.8% of all mutations listed in COSMIC for the 7 blood cancers to be labeled as coming from primary samples. We also tallied the number of samples listed across all mutations (including silent) from a search on the GRCh37 web version of COSMIC (version 92) for each gene and reference amino acid of our 434 hotspots, in order to identify the total number of reported mutations across all cancers and tumor types within COSMIC at these hotspots. WHSC1 was the only gene to be listed under a different gene alias (NSD2).

Determination of loci able to be mapped with high confidence for clonal assessment

As detecting clonal somatic mutations often requires the identification of variants present at low frequencies, distinguishing such low frequency mutations from artefactual mapping errors is crucial (e.g. mismapped germline variants or base differences in highly similar regions of the genome may be mistaken for clonal mutations). As the WES utilized in most studies to date has typically been with 75bp to 125bp read lengths, loci which are highly similar for such short-range stretches with other regions in the genome have a greater potential for such artifacts. While true somatic mutations and even CHIP can occur at such loci, for most NGS data to date, clonal mutations at these genetic locations are less reliable. As a result, we sought to identify such loci and remove any from being used in our determination of CHIP.

To do this, we performed BLAT5 calls for every possible 75 sequence and 100bp sequence overlapping with the mutation loci (hence 175 unique blats were performed for each locus). Each blat was then assessed for whether there was at least one other region in the genome having either 95% or more identicity (being identical) or being only one or two bases off from identicity. Any such occurrence resulted in our marking the mutation locus as being less confident for use in CHIP determination. In addition, the UCSC genome browser1-provided sequence of the hg19 genome was assessed for lower case bases, – which indicate their assessment of a base being in a repeat region; such loci were also designated as being less confident for use in determining CHIP.

Some genes having high numbers of reported mutations also had many regions that were difficult to map, e.g. PTPN11; some hotspots occurring at locations highly similar with as many as 11 other regions of length 75 bases across the genome. Hence for clonal mutation assessment, we chose to exclude such hotspots to reduce the potential number of false positives that could be called.

Each amino acid hotspot locus was initially assessed for the presence of germline SNPs present at a frequency >0.2% in the 1000 Genomes Phase 3 data, and subsequently for variants observed in either the gnomAD6 exome or genome ver. 2.1.1 at >0.1% frequency. Three loci met this flag (occurring with 1KG population alternate allele frequencies of 0.22%, 1.04%, and 0.24%) and were decided to be classified as less confident loci for determining CHIP (labeled as a SNP locus). We also evaluated each of the recurrent mutation loci for the occurrence of homopolymer repeats (six or more identical bases) adjacent with any base of the amino acid locus – and similarly classified these hotspots as loci having less than high confidence for accurate CHIP detection.

B. CHIP analysis:

Non-cancer cohorts for evaluation of CHIP

Three large cohorts with publicly available paired-end whole exome sequencing data were identified for evaluation of clonal mutations: Simons Simplex Collection (SSC), Qatari Genome (QTRG), and the 1000 Genomes Project,

4 Phase 3 (1KG). FASTQ data for all samples of these cohorts were downloaded from the European Nucleotide Archive7 (ENA). SSC cohort: Data for persons in the SSC cohort were available in ENA as PRJNA167318 and included a total of 863 samples, which included 59 samples with single-end data. To maintain uniformity of analysis with the vast majority of samples across the three cohorts, all non-paired-end sequenced samples were excluded, resulting in 804 persons available for analyses. Persons were identified as children with autism, unaffected siblings, or parents. 1KG cohort: We utilized the 1KG samples deposited by the Wellcome Trust Sanger Institute in ENA as PRJNA262923. Samples were paired-end for all except 1 sample (SRR1598151 in the LWK population) having only single-end data, and to maintain uniformity with CHIP calling in the rest of the cohorts this sample was not utilized. QTRG cohort: Samples from the Qatari Genome cohort were accessed in ENA under PRJNA290484 and included a mixture of persons designated as having type 2 diabetes and controls. A total of 1,231 persons having paired-end sequencing data were evaluated for clonal mutations in this QTRG cohort.

Previous cohort assessments

To our knowledge, the 1KG cohort and Qatari cohort data which are publicly available have not been previously assessed for CHIP. The SSC cohort has also not been analyzed for CHIP but has been previously analyzed for de novo mutations8,9 and postzygotic mutations having non-small fractions9. While the thresholds utilized in the initial de novo mutation assessments were set to identify variants present in most cells, the latter postzygotic analysis utilized an initial threshold requiring seven or more variant reads, which could have identified some larger CHIP mutations but would miss those having lower frequencies. Their additional assessments of mutations with three or more variant reads were performed only on their final list of genes having recurrent nonsynonymous postzygotic mutations. We assessed the clonal mutations we identified in the children of the SSC cohort with those in this latter assessment. None of our identified CHIP mutations at hotspots had been reported in their analysis; however, for our general analysis of CHIP not exclusive to hotspots, Lim et al.9 did report two of the variants we identified and listed for general CHIP in our Supplementary Table S8, – for SRR1301659, DNMT3A p.V665L (having variant reads present on both strands) and for SRR1301605, PPM1D p.W427X (not having variant reads present on both strands).

Data processing

We aligned each sample’s paired-end FASTQ data files to the human genome reference hg19 with bwa-mem (version 0.7.12). Subsequent cleaning, realignment, and recalibration of bam files were performed with the software picard-tools10 (version 1.138), and GATK11 (version 3.4-46), as recommended by GATK and with java version 1.8.0_51. Mpileup files for each sample were created by samtools12 (version 1.2) with the removal of reads having a mapping quality score

Recovery and quality control analysis for CHIP at hotspots

Potential CHIP variants at hotspots having high potential for artifactual calls in a single blood sample assessment were flagged for exclusion, mainly due to high sequence similarity with other regions in the genome. We assessed each variant and identified only 1 variant for recovery at: SRSF2 p.P95 (1 sample). All of the other flagged variants had their exclusion criteria maintained: NF1 p.T676 (13 samples), CALR p.K385 (33 samples), ASXL1 p.G643 (6 samples), ARIH1 p.G79 (81 samples), PTPN11 p.A72 (10 samples), PTPN11 p.E76 (6 samples), PTPN11 p.G503 (3 samples), PTPN11 p.Q510 (1 sample), RAD21 p.A20 (1 sample), KRAS p.A59 (4 samples), KMT2C p.C391 (2005

5 samples), and ZRSR2 p.S439 (43 samples). Among identified hotspot variants not flagged in the initial QC stage, a variant at TRIM24 p.R910 was also removed as a potential artifact due to possible strand bias observed across all 3 samples called with this variant, otherwise meeting CHIP requirements, and in all but one of 14 additional samples with this variant meeting requirements for relaxed CHIP. CHIP variants were reviewed for VAFs indicative of a potential germline SNP or de novo mutation. However, similar to other WES studies for CHIP, we did not impose an upper-bound VAF filter as such events would be difficult to distinguish from widespread clonal expansion with only a single DNA source. We note that our prevalence of CHIP calls having high VAF was quite similar to that in past WES studies of CHIP.

Sequencing depth analysis

The sequencing depth across the 364 hotspots was determined for each sample using the samtools12 depth command with a mapping quality threshold of mq10 and base quality threshold of q30. A representative base from each of the loci was collated across samples and the mean and median of each locus for each sample and for each sample across the 364 samples were computed. The mean depths were significantly different across the three non-cancer cohorts for these 364 loci, thus indicating the potential for variation of CHIP detection rates across cohorts. As specific ages were known only for the QTRG cohort, age differences also possibly existed across cohorts. Since CHIP prevalence has been reported to be significantly associated with age, this lack of information precluded inference otherwise possible through direct adjustment for depth with CHIP prevalence.

Childhood sample verification

Each of the three children with identified CHIP at hotspots occurred in children that were sequenced as part of a quartet in the well-designed SSC study8 which sequenced children and both parents. By assessing SNPs in each member of a trio, it was possible to verify the child-parent relationship of samples, providing the ability to verify whether clonal mutations in reported children truly did occur in children and were not part of a mix-up with an adult sample or data. As previous to the current manuscript reported CHIP mutations in healthy persons have been exclusive to adults, this verification is essential. We assessed 4,948 a priori SNPs having a minor allele population frequency between 0.4 and 0.6 in the 1000 Genomes population for having been assessed with at least 15 reads of base quality ≥30 and mapping quality ≥10 and with clear zygosity of SNPs by their VAF satisfying one of three scenarios: VAF<0.1, 0.40.9. Concordance of parental homozygosity and childhood expected genotype occurred at rates of 99.8%, 99.3%, and 98.6% across 1,084, 938, and 728 respectively qualifying SNPs for the three potential children samples assessed – thereby allowing the conclusion that sample mix-ups with parental samples did not occur and that the identified CHIP mutations did occur in children.

Relaxed CHIP analysis

Low-level variant reads identified at a locus with next generation sequencing can be viewed as a mixture of two distributions – artefactual/incorrect reads and true variant reads. Artefactual reads reflect to some degree the error threshold of base quality reads permitted (e.g. q30 indicates a misread to occur 1 time out of 1,000 reads, or 0.1%, with the total number of artefactual variant reads observed at a locus following a binomial distribution of this error rate). True variant reads will on average be detected at the rate of their clonal presence (e.g. observed in 1% of sequencing reads when occurring in 2% of blood cells for heterozygous mutations), the total number of variant reads following a binomial distribution of this prevalence. High depth sequencing can allow for easier distinguishing of these two distributions by requiring a higher variant read threshold. For lower depth sequencing, an identical threshold will result in only the largest clones being detected. While the use of three variant reads has a greatly increased rate of true clonal detection, for lower depth data, the rate of true clonal detection with two variant reads is still considerably increased. Thus, for the lower sequenced Qatari cohort, we also considered this situation (though we did not include such in our final CHIP designations). We observed that the Qatari cohort had only 5.3 times as many potential CHIP mutations having exactly two variant reads compared to the number of CHIP mutations in that

6 cohort with three or more variant reads. That ratio in the Qatari cohort was significantly less than in the much more deeply sequenced Simons Simplex and 1KG cohorts, having ratios of 16.7 and 13.0, respectively (p<0.05 for both comparisons, Fisher’s Exact test).

Determination of CHIP not limited to hematologic hotspot loci

To identify CHIP not restricted to the loci of our hematologic hotspot analysis, we utilized the list of mutations specified in Jaiswal et al.14. We further followed their exclusion criteria, specifically: we required at least 3 variant reads for SNPs and 6 variant reads for indels, we removed variants located anywhere within 6bp homopolymer reads unless the variant was observed with >8% variant allele frequency, we removed indels having a length of 3 or more bases, we removed variants in the first or last 10% of the open reading frame unless the variant occurred at one of our determined hematologic hotspots, we excluded variants for TET2, CBL, and CBLB which had a cumulative binomial probability >0.001 for following a heterozygous mutation, we excluded variants failing our mappability criteria (that no possible 75bp or 100bp overlapping sequence with the variant mapped elsewhere in the genome with >95% identicity or differing by more than 2 bases), we excluded loci identified by UCSC to be difficult to map as indicated by a lower case base, and we excluded variants observed in either the gnomAD6 exome or genome version 2.1.1 at >0.1% frequency.

These criteria resulted in a total of 2,097 variants across the 4,538 persons of our combined cohorts. Of note, due to the >100-fold number of potential bases which could result in unrestricted CHIP (>160,000 bases for unrestricted CHIP compared to 1,092 bases for CHIP at hotspots), some of these criteria are more stringent than those required for our identification of CHIP at hotspots in order to reduce the number of false positives inherent with greatly increased multiple testing. We further assessed these variants and identified 8 subjects each belonging to one of only 3 of the SSC quartets who exhibited a very high number (ranging from 9 to 26) of single base pair duplications which met the criteria for CHIP at non-hotspots. No other persons from any cohort exhibited more than two insertions at such loci. Investigating potential artifactual causes for these variants, we found these same 8 samples (3 children and 5 parents) to have had the highest mean error as reported in the supplement of the original study by Sanders et al.8, with error rates 2 to 3-fold higher than the next highest sample. This provided an independent reason for their exclusion from all CHIP tallies and calculations. We also removed variants present in >0.1% of the cohorts (5 or more persons). We assessed strand bias across recurrent variants not occurring near the beginning of an exon and removed variants showing a consistent bias across multiple samples. These additional exclusions resulted in a total of 1,377 mutations. Finally, our list of final confident variants was filtered by the minimum variant frequency present in Jaiswal et al.14, 4%, which resulted in our final list of 189 CHIP variants. Approximately 70% of these CHIP mutations exhibited variants on both strands. While the remainder reflected the nearly identical rate of similarly biased reference reads after adjusting for total reads, we provide data regarding potential strand bias of these general CHIP mutations for use as an additional indicator of confidence in Supplementary Table 8.

We assessed the 755 mutations that were reported at one of the 364 hotspots in the 48 somatic landscape papers with the allowed mutations for CHIP within Jaiswal et al.14. We converted mutations that were listed with different transcripts to confidently identify those mutations that were uniquely present in our list. We used the ProteinPaint15 utility of the St. Jude Cloud PeCan suite (https://pecan.stjude.cloud/proteinpaint) to identify the domain of each of these mutations. We included only a single domain when multiple entries were listed for a mutation locus.

1. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The human genome browser at UCSC. Genome Res 2002; 12: 996-1006.

2. The NCBI handbook [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2002 Oct. Chapter 18, The Reference Sequence (RefSeq) Project. Available from http://www.ncbi.nlm.nih.gov/books/NBK21091

3. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010; 38: e164.

7 4. Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, et al. COSMIC: the Catalogue of Somatic Mutations In Cancer. Nucleic Acids Res 2019; 47: D941-D947.

5. Kent WJ. BLAT – the BLAST-like alignment tool. Genome Res 2002; 12: 656-664.

6. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 2020; 581: 434-443.

7. European Nucleotide Archive. https://www.ebi.ac.uk/ena

8. Sanders SJ, Murtha MT, Gupta AR, Murdoch JD, Raubeson MJ, Willsey AJ, et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 2012; 485: 237-241.

9. Lim ET, Uddin M, De Rubeis S, Chan Y, Kamumbu AS, Zhang X, et al. Rates, distribution and implications of postzygotic mosaic mutations in autism spectrum disorder. Nat Neurosci 2017; 20: 1217-1224.

10. Picard. http://broadinstitute.github.io/picard.

11. McKenna A, Hannah M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010; 20: 1297-1303.

12. Li H, Handsaker B. Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25: 2078-2079.

13. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012; 22: 568-576.

14. Jaiswal S, Natarjan P, Silver AJ, Gibson CJ, Bick AG, Shvartz E, et al. Clonal hematopoiesis and risk of atherosclerotic cardiovascular disease. N Engl J Med 2017; 377: 111-121.

15. Zhou X, Edmonson MN, Wilkinson MR, Patel A, Wu G, Liu Y, et al. Exploring genomic alteration in pediatric cancer using ProteinPaint. Nat Gen 2016; 48: 4-6.

8 A 500 400 300 300

250

200

150 at 364 mutation hotspots

Average sequencing depth 100

50

0 ) ) ) 4) 9) 1) 3) 5) 4) 2) 91) 06) 96) 61 86) 93) 99) 03) 05) 99) 03 07) 02) 99) 98) 85) 96) 04) 07) 08) 413) 3 ======9 11 104) = 10 = 12 = (N=25(N N N N (N (N (N=9 N (N= (N=8 (N=6 (N= (N= (N B B( X( U( (N=1 N R (N=1 (N= (N=1 (N= V( K (N=1 (N= (N=1 en W(N= B S(N=1LM (N=9SN B H D S(N=1U SL XL EL JL U en ts (N= S -FI PT W P P UR T RI (N=1 l ldr -AC -BE -CD -CE -C -E IB -IT J -KH -M -M - - S -TSI Y rentsl (Ndr G -G - - G G - - a i G G G G G G K G G G G G G G G G h K K K K G-CHG-CHK K 1 K K K K K K K K G-P G K K p c 1 KG-A1 1 1 K K 1 1 1 KG-GI 1 K 1 KG-L1 1 1 1 K K 1 1 1 1 1 KG-GW1 1 1 1 1 1 SC 1 S SC atari aduatari chi S Q Q Cohorts B 100

90

80

70 100x coverage  60

50

40

30

20 Percent of 364 hotspots with 10

0 ) ) ) 4) 9) 1) 3) 5) 4) 2) 91) 06) 96) 61 86) 93) 99) 03) 05) 99) 03 07) 02) 99) 98) 85) 96) 04) 07) 08) 413) 3 ======9 11 104) = 10 = 12 = (N=25(N N N N (N (N (N=9 N (N= (N=8 (N=6 (N= (N= (N B B( X( U( (N=1 N R (N=1 (N= (N=1 (N= V( K (N=1 (N= (N=1 en W(N= B S(N=1LM (N=9SN B H D S(N=1U SL XL EL JL U en ts (N= S E -FI PT W P P UR T RI (N=1 l ldr -AC -BE -CD -CE -C - IB -IT J -KH -M -M - - S -TSI Y rentsl (Ndr G -G - - G G - - a i G G G G G G K G G G G G G G G G h K K K K G-CHG-CHK K 1 K K K K K K K K G-P G K K p c 1 KG-A1 1 1 K K 1 1 1 KG-GI 1 K 1 KG-L1 1 1 1 K K 1 1 1 1 1 KG-GW1 1 1 1 1 1 SC 1 S SC atari aduatari chi S Q Q Cohorts Supplementary Figure 1. Sequencing depths in the non-cancer cohorts. A) Average sequencing depth at the 364 mutation hotspots by subcohort (blue horizontal bars indicate mean depth for each subcohort). B) Percent of 364 hotspots sequenced at a depth 100x (maroon horizontal bars indicate mean percentage of loci sequenced at9 a depth of 100x or more for each subcohort). 500

400

300

200

100

30 28 26 24 22 20 18 16 14 12

Times mutation reported at hotspot 10 8 6 4 2 0

R A S 6 1 L1 BL D2 H2 T3 H1 K2 X1 D2 B1 A2 T2 53 X W7 KIT AS 3 C X FL ID R T C WT BCO CN MT3 EZ GNA JA N PHF UN R TE TP AS C N FB KDM6A R SE SF SRSF2 D SMA Genes

Supplementary Figure 2. Recurrent mutations reported at hotspots in genes with identified CHIP mutations. For the 23 genes (horizontal axis) observed to have CHIP in the persons of the non-cancer cohorts, shown is the frequency of reported diagnoses of leukemia or myeloid neoplasm at hotspots from the 48 somatic landscape studies (vertical axis). Color and size indicate the frequency of CHIP mutation detection in the non-cancer cohorts (red indicates CHIP mutation observed, black indicates not observed, brown indicates difficult to map hotspot locus, size of circles indicates frequency CHIP mutation observed - range 1 to 4).

10 14 hotspot with CHIP reported 3 to 5 times 12 hotspot with CHIP reported 10 6 to 517 times 8 6 Density 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 Variant Allele Fraction

Supplementary Figure 3. Distribution of CHIP variant allele fractions by hotspot recurrence. For the 80 confident CHIP mutations at hotspots, distribution of their variant allele fractions (VAF) is shown, grouped by the number of times the hotspot was recurrently reported mutated in the hotspot mutation analysis. Median VAF for 3 to 5 times recurrent: 0.053, median VAF for 6 to 517 times recurrent: 0.033.

11 A) SRR1301885, DNMT3A p.R882S

B) SRR1301448, RUNX1 p.G170 splice site (NM_001754:exon6:c.509-1G>T)

12 C) SRR1301732, RUNX1 p.S322X

D) SRR1301448, TET2 p.1863S

Supplementary Figure 4. Integrated Genomics Viewer (IGV) plots for the three children with CHIP at hotspot mutations. Shown are the IGV plots of reads with the identified variant or reference allele. Variants were observed on both strands (red and blue colors) and did not occur at the end of reads. A) CHIP at hotspot for SRR1301885, DNMT3A p.R882S B) CHIP at hotspot for SRR1301448 RUNX1 p.G170 splice site C) CHIP at hotspot for SRR1301732 RUNX1 p.S322X D) Secondary CHIP not at hotspot for SRR1301448, TET2 p.1863S.

13 12,894 10000 5000

600 533 500

400

300 Number of Loci 200 156

100 62 40 25 27 16 26 11 10 7 5 6 7 2 4 5 4 6 3 6 3 11 1 0

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 1-100 2 01-2001-3001-4001-5001-60 1 2 3 4 5 Recurrence of Reported Mutations at Loci

Supplementary Figure 5. Frequency of amino acid or splice site altering mutation loci having mutations reported at various recurrence levels for diagnoses of hematologic malignancies across the 48 studies. Vertical axis indicates the number of amino acid or splice site loci that were reported mutated at the recurrence level shown on the horizontal axis, allowing for assessment of deviation from a binomial process.

14