Supplementary Information Supplement To: Large-Scale Identification of Clonal Hematopoiesis and Mutations Recurrent in Blood Cancers
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary Information Supplement to: Large-Scale Identification of Clonal Hematopoiesis and Mutations Recurrent in Blood Cancers Julie E. Feusier, Sasi Arunachalam, Tsewang Tashi, Monika J. Baker, Chad VanSant-Webb, Amber Ferdig, Bryan E. Welm, Christopher Ours, Juan L. Rodriguez-Flores, Lynn B. Jorde, Josef T. Prchal, Clinton C. Mason (JEF and SA contributed equally.) Department of Pediatrics, Division of Pediatric Hematology and Oncology, University of Utah, Salt Lake City, Utah (JEF, SA, MJB, CVS-W, AF, CO, CCM); Department of Human Genetics, University of Utah, Salt Lake City, Utah (JEF, LBJ); Huntsman Cancer Institute, University of Utah, Salt Lake City, Utah (SA, TT, BEW, JTP); Department of Oncological Sciences, University of Utah, Salt Lake City, Utah (SA); Division of Hematology and Hematologic Malignancies, University of Utah, Salt Lake City, Utah (TT, JTP); VA Medical Center, Salt Lake City, Utah (TT, JTP); Department of Surgery, University of Utah, Salt Lake City, Utah (BEW); Department of Genetic Medicine, Weill Cornell Medical College, New York, New York (JLR-F) Table of Contents Supp. Methods ………………………………….………………...……….…….… 2 Supp. Figure 1. ………………………………….………………...……….…….… 9 Supp. Figure 2. ………………………………….…………………….………..… 10 Supp. Figure 3. ………………………………….…………………….………..… 11 Supp. Figure 4. ………………………………….…………………….………..… 12 Supp. Figure 5. ………………………………….…………………….………..… 14 1 Supplementary Methods. A. Hotspot mutation analysis: Criteria for inclusion of somatic mutation studies and samples Of the identified somatic mutation studies, only those meeting sample identifiability criteria were included in an effort to include only mutations from diagnostic samples and to greatly reduce the likelihood of an individual’s mutations being included twice in our extraction process. Studies were required to have assessed somatic mutations with NGS methods in eight or more diagnostic samples of one of seven hematologic malignancies (ALL, AML, CML, CMML, JMML, MDS, and MPN). Some studies assessed relapse or secondary leukemia samples in addition to the initial diagnostic sample. We removed such studies from our analysis unless all mutations from each relapse/secondary leukemia sample could be confidently removed. Many studies followed an approach of assessing a discovery cohort followed by a validation cohort. As some of these studies included a few individuals in both cohorts, we uniformly included the mutations reported for only one of these cohorts to avoid potential artifactual recurrence. This was most frequently the validation cohort which often had an increased sample size; however, if one cohort was assessed for only a minimal target region, the other cohort having a larger genomic region of assessment was used instead. We were also careful to assess whether studies documented the inclusion of patients in previously reported studies. When such occurred, only mutations identified in the more recent study were included. A process was developed to automize as much as possible the mining, extraction, conversion, and evaluation of reported mutations in each paper. Each of the studies was assessed for genetic information that could accurately identify the genetic position of the alteration and effect on its encoded protein. Formats of reported mutations were standardized, with priority for inclusion given to reported genomic positions with reference and alternate alleles provided. In general, studies reporting only amino acid substitutions without a transcript identifier or genomic position were excluded as such mutations may have occurred in one of multiple locations. Criteria for inclusion of somatic mutations For studies meeting the above criteria, additional inclusion criteria were utilized that assessed the utility and reliability of listed mutations. The extracted mutations from each study were evaluated for consistency of provided reference alleles and amino acids. This was necessary to ensure that the correct genetic reference system and build of the reported mutations was identified prior to our conversion and harmonization program. We assessed the reference alleles of all mutations (SNVs and indels grouped separately) reported within a given study and compared these with the possible combinations of human genome reference build, 0-bp or 1-bp coordinate system, and cDNA or gDNA at those loci. Nearly always the correct scenario was completely obvious as near 100% correlation was achieved, with other combinations being far from this. When such near perfect correlation was not achieved for either the set of SNVs, indels, or both, these entire sets were excluded. A few studies published all observed variants including those having an extremely low number of supporting reads or allele frequencies – and without providing an explanation to delineate mutations. For such studies we included an additional variant allele fraction (VAF) requirement (VAF in tumor ≥30% if only tumor was assessed, VAF in tumor ≥25% and VAF in germline <5% if both tumor and germline assessed) to increase the likelihood of including bona fide diagnostic mutations over noise and artifacts. When only amino acid substitutions were published with a transcript, a back-transformation was utilized to identify the coordinates and mutation that would produce such a substitution, and that information was then forward translated to obtain the amino acid substitution for the single transcript chosen for each gene (described below). The LiftOver utility of the UCSC Genome Browser1 was used to convert genomic coordinates of mutations for studies not providing mutations in hg19 coordinates. Any mutations at unresolvable loci were excluded. Following the initial extraction of diagnostic mutations from the somatic landscape papers and determination of 48 studies with useable data, these studies were reviewed a second time to ensure that extracted mutations met all usability criteria. Twelve studies were arbitrated to have mutations added or removed, resulting in a net decrease of 298 mutations following this additional evaluation. This entire process resulted in 48 studies meeting inclusion criteria which together reported 58,177 mutations that we then assessed for effect on protein and recurrence. 2 Single transcript determination Initial assessment of the extraction process identified the use of different transcripts for many genes across different studies to be a common problem and potential pitfall to an analysis relying only on the amino acid alteration reported in a given study. As a result, mutations were included when a genetic position or transcript was provided in addition to the amino acid alteration or this information could be unambiguously resolved. To standardize reporting and recurrence determination, the most common RefSeq2 transcript for genes utilized across nine initial studies was determined (when ties occurred the sample size of the studies was used in the final selection). For genes not reported mutated in these nine somatic mutation landscape studies, the largest RefSeq transcript was utilized as the referent transcript. These single transcripts for each gene were utilized as the sole determinant for the effect of mutations on amino acids by ANNOVAR3 throughout all of the analyses reported in this manuscript with the exception of a few genes in the general CHIP analysis only, in which we utilized the transcript listed in previous CHIP assessments to more closely match the use of similar methods for comparison. Hotspot determination We sought to identify amino acid and splice-site loci that were reported mutated across the 48 somatic landscape studies more frequently than would be expected by chance, and to designate such loci as mutation hotspots. Our collective cohort consisted of patients assessed for a variety of targeted regions and representing multiple blood cancers, and which for the most part did not have silent or non-exonic mutations reported. As a result, we sought to determine minimalist criteria that would reflect significant deviation from models of the overall expected recurrence frequency of reported protein-altering and splice-site mutations under plausible assumptions that included a greater somatic burden than was reported (increasing the requirement for significant recurrence). Under the null hypothesis of equal contribution for all mutation loci (i.e. no hotspots, or equivalently, no observed recurrence beyond expectation from binomial distributions), we tested various model scenarios for deviance from expected recurrence. We simulated the combined patients of these studies, varying the total theoretical protein-altering or splice-site mutations in each patient from 30 to 10 (lower values yielding very infrequent recurrence) and across 10 to 20 million amino acid or splice-site loci. Each separate simulation provided estimates for expected recurrence distributions, reflecting the expectation from their corresponding binomial distribution models. For the plausible scenario of 15 amino acid altering or splice-site mutations per patient occurring across 15 million possible amino acid or splice-site loci, expectation for the point probability of twice reported recurrence across the 7,444 patients was 412.1 loci (in moderate comparison to our observed 533 loci reported mutated twice only), whereas for thrice reported recurrence this expectation fell to just 1.04 (in stark contrast to our observation of 156 loci reported exactly three times). For higher levels of recurrence within this scenario, the cumulative expectation