Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

1 Genetic Determinants of Somatic Selection of Mutational

2 Processes in 3,566 Human Cancers

3 Jintao Guo1,2,3*, Ying Zhou1,2,3*, Chaoqun Xu1,2,3, Qinwei Chen1,2,3, Zsófia Sztupinszki4, Judit 4 Börcsök4, Canqiang Xu5, Feng Ye6,7,8, Weiwei Tang6,7,8, Jiapeng Kang6,7,8, Lu Yang6,7,8, Jiaxin 5 Zhong1,2,3, Taoling Zhong1,2,3, Tianhui Hu1, Rongshan Yu5, Zoltan Szallasi4,9, Xianming Deng10, 6 Qiyuan Li1,2,3‡

7 1. National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, 8 361102, China.

9 2. Department of hematology, School of Medicine, Xiamen University, Xiamen, 361102, China.

10 3. Department of Pediatrics, The First Affiliated Hospital of Xiamen University, Xiamen, 361102, 11 China.

12 4. Danish Cancer Society Research Center, Copenhagen, Denmark.

13 5. XMU-Aginome Joint Lab, School of Informatics, Xiamen University, Xiamen, 361005 China.

14 6. Department of Medical Oncology, The First Affiliated Hospital of Xiamen University, Xiamen, 15 China.

16 7. Department of Medical Oncology, The First Affiliated Hospital of Xiamen University, 17 Teaching Hospital of Fujian Medical University, Xiamen, Fujian, China.

18 8. Xiamen Key Laboratory of Antitumor Drug Transformation Research, The First Affiliated 19 Hospital of Xiamen University, Xiamen, China

20 9. Computational Health Informatics Program, Boston Children's Hospital, Boston, 21 Massachusetts.

22 10. State Key Laboratory of Cellular Stress Biology, School of Life Science, Xiamen University, 23 Xiamen, 361102, China.

24 *These authors contributed equally to this work.

25 Running title: Driver and variants of mutational processes in cancers 26 Keywords: mutational processes; somatic selection; driver genes; cancer therapy; tumor immune 27 microenvironment

28 Additional information 29 This work was supported by the Fundamental Research Funds for the Chinese Central 30 Universities [20720190101 to QL]; the National Natural Science Foundation of China [81802823 31 to YZh]; the Natural Science Foundation of Fujian Province of China [2018J01054 to YZh]; 1

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

32 National Natural Science Foundation of China (Grant No. 81702414 to WT); Natural Science 33 Foundation of Fujian Province of China (Grant No. 2020J05306 to WT) and Xiamen Science and 34 Technology Planning Project (Grant No. 3502Z20194003 to WT).

35 ‡Corresponding Author: Qiyuan Li, School of Medicine, Xiamen University, 4221-120 36 South Xiang'an Road, Xiang'an District, Xiamen, Fujian Province, China. E-mail: 37 [email protected]. Phone: +86 592 2185175 38 The authors declare no potential conflicts of interest. 39 The total number of words of the manuscript, including Introduction, Materials and Methods, 40 Results, Discussion, Acknowledgements and Figure Legends: 7359; the number of figures: 6; the 41 number of supplementary figures: 8 and supplementary tables: 4.

2

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

42 Abstract 43 The somatic landscape of the cancer genome results from different mutational processes 44 represented by distinct “mutational signatures”. Although several mutagenic mechanisms are 45 known to cause specific mutational signatures in cell lines, the variation of somatic mutational 46 activities in patients, which is mostly attributed to somatic selection, is still poorly explained. Here 47 we introduce a quantitative trait, mutational propensity (MP), and describe an integrated method 48 to infer genetic determinants of variations in the mutational processes in 3,566 cancers with 49 specific underlying mechanisms. As a result, we report 2,314 candidate determinants with both 50 significant germline and somatic effects on somatic selection of mutational processes, of which 51 485 act via cancer expression and 1,427 act through the tumor immune microenvironment. 52 These data demonstrate that the genetic determinants of MPs provide complementary information 53 to known cancer driver genes, clonal evolution, and clinical biomarkers.

54 Statement of Significance

55 The genetic determinants of the somatic mutational processes in cancer elucidate the biology 56 underlying somatic selection and evolution of cancers and demonstrate complementary predictive 57 power across cancer types.

58 Introduction 59 Cancer acquire malignant phenotypes through various somatic in the genome 60 which result in functional gains or losses contributing to the tumor fitness(1). Somatic driver 61 mutations are critical for cancer initiation and evolution and cause the genetic heterogeneity which 62 determine the clonal architecture in cancers(2). The mutations occur through different mutational 63 processes which are evidenced in distinct mutational signatures represented by the frequencies of 64 mutations within corresponding nucleotide contexts(3). The mutational signatures surrogate for 65 the mutational processes during tumorigenesis which drive the clonal evolution(4); and in turn, 66 impact the complex clinical phenotypes of the disease(5). 67 So far, there are 67 signatures of single base substitution (SBS) identified, of which 49 were 68 considered likely to be of biological origin(6). The mutagenic mechanisms of some mutational 69 signatures have been elucidated in cell lines or mouse models(7,8). External mutagens, such as 70 tobacco smoking and UV, results in specific mutational signatures; then alterations of certain 71 biological functions, such as deficiencies in double-strand breaks (DSB) repair mechanism, 72 APOBEC enzymatic activities give rise to specific mutational signatures(3,6). Finally, the random 73 mutations can change the fitness of the tumor cells hence subject to extrinsic selective pressures 74 such as immune responses, chemotherapy regimen and targeted therapies(9–11). For instance, 75 APOBEC mutational signature is a predictive marker for immunotherapy response in non-small 76 cell cancer.

3

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

77 The activities of the mutational signature vary substantially among individual cancers, 78 suggesting complex biological mechanisms underlying somatic selection of mutational 79 processes(3,12). The mutagenic processes only partially account for the activities of the 80 mutational signatures in cancer patients. For example, APOBEC3B activity contributes to SBS2 81 and SBS13, however APOBEC3B expression only explains 20% to 30% of the total variance of 82 the APOBEC mutational signature (Spearman’s r = 0.3), whereas the causes of the rest of the 83 variation are still unknown(13). Likewise, SBS4 is caused by exposure to tobacco smoking (9). 84 However, no more than 20% of the SBS4 activities is explained by the tobacco exposure level in 85 lung squamous cell carcinomas and lung adenocarcinomas (14). It appears that, although the 86 causal mutagenetic factors of the mutational processes are known, the majority of the inter-tumor 87 variations in the activities of the mutational processes in real patients are still largely unexplained. 88 However, there still lacks reliable methodology for identification of the genetic determinants 89 of somatic selection of mutational processes(15). The major challenges are, first, the measure of 90 the activities of the mutational signatures which are regularly confounded by contamination, 91 sequencing errors and mapping biases among the cancer samples(16). Moreover, the distribution 92 of the signature activities in cancers are usually heavily skewed, to which most of the linear 93 models do not fit and fail to reveal the statistical associations. Therefore, in this study, we 94 introduced the quantitative trait of mutational propensity (MP), which is tethered to the relative 95 activity between a given signature and a reference signature. The MP is quasi-normal distributed 96 and more explainable by linear models. Then, we described a regression model integrating 97 evidences from both germline and somatic levels to identify the genetic determinants of somatic 98 selection of mutational processes. In addition, we inferred the causal biological mechanisms for 99 the candidate determinants of the mutational processes via cancer and tumor 100 immune microenvironment (TIME). 101 Previous studies identify cancer driver genes (or driver mutations) as of which the mutational 102 statuses are significantly associated with specific mutational processes(17,18). However, the 103 genetic determinants which impact the mutational processes are rarely reported until recently(19). 104 We hypothesized that the intro- and inter-tumor variations of the activities of the mutational 105 signatures are largely resulted from the selection processes during somatic evolution, which is 106 caused by the heterogeneity in the genetic background, the somatic landscape, and the 107 microenvironment. Recent studies reported several cancer genes with functional and clinical 108 significance based on the effects on the mutational processes(20,21). 109 Our analyses provide a systematic view of the genetic determination of somatic selection of 110 the major mutational processes in 3,566 cancers from TCGA; and suggest highly relevant 111 biological processes with clinical and therapeutic implications. In addition, the study described an 112 alternative approach to identify cancer genes based on the process of clonal evolution, which 113 further broaden our understanding of the formation of intratumor heterogeneity and inform the 114 future precision medicine for cancer.

4

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

115 Materials and Methods

116 Data Collection

117 We analyzed multi-omics data on 13 cancer types including breast Cancer (BRCA, N = 690), 118 colorectal cancer (COAD, N = 344), esophageal adenocarcinomas (EAC, N = 82), squamous-cell 119 carcinomas (ESCC, N = 31), glioblastoma multiforme (GBM, N = 326), Kidney Renal Clear Cell 120 Carcinoma (KIRC, N = 299), liver hepatocellular carcinoma (LIHC, N = 121), lung 121 adenocarcinoma (LUAD, N = 427), ovarian cancer (OV, N = 345), prostate adenocarcinoma 122 (PRAD, N = 231), stomach adenocarcinoma (STAD, N = 213), thyroid carcinoma (THCA, N = 82) 123 and uterine corpus endometrial carcinoma (UCEC, N = 375; Table S1). 124 The genotype data, germline and somatic variants and mRNA expressions were downloaded 125 from Genomic Data Commons Data Portal. The gene-based somatic copy-number alterations 126 (SCNAs), DNA methylation were downloaded from UCSC Xena TCGA hub 127 (https://tcga.xenahubs.net). The gene fusions were download from Tumor Fusion Gene Data 128 Portal (https://tumorfusions.org). The immune characteristics were downloaded from the 129 published study(22,23), including intratumor heterogeneity (ITH), immune cell fractions, subtypes, 130 clone number (supplementary methods). 131 To avoid biases from the populational background, it is a general practice to control for the 132 ethnicity of the population. To inferred the ancestry, we performed principal component analysis 133 using five populations of the 1000 Genomes Project Phase 3 (1KGP) as references. For a certain 134 population, the genotypes were pre-phased using SHAPEIT2 and imputed using IMPUTE2 with 135 the 1KGP reference panel for further analyses (supplementary methods) (24,25). 136 We calculated the gene-wise genetic burden for each class of germline variants (missense, 137 truncated and structural variant), and encoded the gene-wise statuses of somatic non-synonymous 138 mutations (somatic nsy-mutations), SCNAs, methylation aberrations in the promoter regions 139 (TSS-methylation) and fusions for integrated regression analyses (supplementary methods). 140 The matched multi-level data of 1,631 cancer cell lines were downloaded from the Cancer 141 Cell Line Encyclopedia project (CCLE; https://portals.broadinstitute.org/ccle), including the 142 somatic mutations, SCNAs, DNA methylations, mRNA expressions. The profiling protocols were 143 consistent with the preference rules of TCGA (supplementary methods). The 563 cell lines’ 144 CERES scores were download from The Cancer Dependency Map Project at Broad Institute 145 (DepMap, v19q2; https://depmap.org). CERES score is a computational method to estimate 146 gene-dependency levels from CRISPR-Cas9 essentiality screens while accounting for the copy 147 number-specific effect (26). The drug sensitivity data (IC50) of 305 drugs in 988 cell lines were

148 downloaded from Genomics of Drug Sensitivity in Cancer V8 (GDSC;

149 https://www.cancerrxgene.org).

5

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

150 Deriving the Consensus Mutational Signatures from Multiple Cancer

151 Types

152 The distribution of mutational signatures (MSs) activities is heavily skewed, thus, prevents 153 the development of an optimized model for the discovering of driver genes for these mutagenic 154 processes. To overcome the statistical obstacle, we retrieved the somatic mutational signatures 155 using “pmsignature” based on 5-nucleotide context(27). This algorithm introduced a “background 156 signature” which is designed to capture biases in intrinsic genome sequence composition and is 157 calculated from the composition of consecutive nucleotides of the sequence(27). 158 All the mutational signatures were compared the results to the SBS mutational signatures in the 159 COSMIC (https://cancer.sanger.ac.uk/cosmic) using cosine similarity (CS) measure. In order to

160 account for the confounding effects in the future analysis, we set a threshold of 1×10-3 for the

161 present call of a given signature k. Then we used the background signature as the reference and 162 generated a new statistical term, called, mutational propensity (MP) which is the relative activity 163 of mutational signature as the ratio between the activities of a given signature k and the reference 164 (eq.1).

푀푆푘푖 푀푃푘푖 = ln ( ) #(1.) 푀푆7푖

푡ℎ 푡ℎ 165 Here 푀푆푘푖 is the activity of the 푘 signature of 푖 individual; 푀푆7푖 is the reference 푡ℎ 166 signature (MS7); 푀푃푘푖 is the 푘 mutational propensity. 167 We used Bioconductor package “deconstructSigs” to determine the activities of the 168 conserved mutational signatures in cancer cell lines(28), and the MPs were computed in the same 169 way aforementioned. We compared the mean of MPs between the TCGA and CCLE cohorts 170 based on cosine similarity.

171 Integrated Regression Analyses Suggest Driver Genes of Mutational

172 Processes

173 To estimate the effects of each gene on the mutational processes, we combined seven classes 174 of gene variations: germline variants (missense, truncated and structural variant), somatic 175 nsy-mutations, SCNAs, TSS-methylation and gene fusions. We excluded highly polymorphic 176 genes from the analysis, namely, the human leukocyte antigen genes and olfactory receptor genes. 177 Then, for pan-Cancer and cancer-specific analyses, we excluded the samples with low present call 178 (N < 30) and genes with very low mutation/variation rate (N < 5). We used the MPs to evaluate 179 the effects of the mutational status of each gene using a linear model for both pan-cancer (eq.2) 180 and cancer-specific analysis (eq.3). The regression coefficients of 훽 represent the effect sizes of 181 the variations of genes.

6

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

1 퐺푚푖푠 푖푗 퐺푡푟푢푖푗 퐺푠푡푟푖푗 ′ 푆푛푣 푀푃푘푖 = 휷풊풋 푖푗 + 휀푖푗푘#(2.) 푆푐푛푎푖푗 푀푒푡ℎ 푖푗 퐹푢푠푖푗 ( 퐶푖 ) 1 퐺푚푖푠 푖푗 퐺푡푟푢푛푖푗 퐺푠푡푟 ′ 푖푗 푀푃푘푖 = 휷 + 휀푖푗푘#(3.) 풊풋 푆푛푣푖푗 푆푐푛푎푖푗

푀푒푡ℎ푖푗 ( 퐹푢푠푖푗 )

2 푡ℎ 182 Here 휀푖푗푘~푁(0, 휎 ) is a Gaussian error term; 퐺푚푖푠푖푗 refers to the 푗 gene's missense 푡ℎ 183 genetic burden of 푖 individual; and 퐺푡푟푢푛푖푗, 퐺푠푡푟푖푗, 푆푛푣푖푗, 푆푐푛푎푖푗, 푀푒푡ℎ푖푗 and 퐹푢푠푖푗 are the 184 truncation genetic burden, structural genetic burden, somatic nsy- status, SCNA status, 푡ℎ 185 TSS-methylation levels and fusion status, respectively; 퐶푖 is the cancer type. 푀푃푘푖 is the 푘 186 mutational propensity. We define a driver gene of which the germline variants and somatic 187 variants are both significantly associated with the MPs.

188 Instrumental Variable Regression Analysis

189 We then aimed to find the genes of which the genetic features impact the mutational 190 processes through their expression levels by performing instrumental variable analysis using the 191 Julia (v1.1.1) package of “FixedEffectModels”. Briefly, the dependent variable is the MPs, the 192 independent variable is the expression levels of genes that significantly associated with MPs, the 193 genetic instruments are the genetic features in eq.2 and eq.3 which are previously found 194 significant (eq. 4).

푀푃푘푖 = 훽0 + 훽1 × 푚푅푁퐴푖푗 | 퐺퐹푖푗푘 + 퐶푖 + 휀푖푗푘#(4.) 2 푡ℎ 195 Here, 휀푖푗푘~푁(0, 휎 ) is a Gaussian error term; 푀푃푘푖 is the 푘 mutational propensity of 푡ℎ 푡ℎ 196 the 푖 individual; 푚푅푁퐴푖푗 is the 푗 gene expression; 퐺퐹푖푗푘 are the genetic features; 퐶푖 is 197 the cancer type. 198 Models with instrument variables were estimated using Two-Stage least squares (2SLS; Julia 199 package, FixedEffectModels). To determinate the significance of the independent variables, we 200 calculated FDR based on the P-values of the regression coefficients using Benjamini-Hochberg 201 procedure. To determinate the significance of the instrumental variables, we used the weak 202 instruments test P-values (29) based on the Kleibergen-Paap rank Wald F-statistic and estimated

203 FDRweak. Finally, we chose genes of which the expression levels and genetic instruments were

204 both significant to call E-genes (FDR < 0.1 and FDRweak < 0.1). 7

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

205 Similarly, we performed the IV analysis to find the genes of which the genetic statuses 206 impact the mutational processes through interacting with the TIME. The dependent variable is the 207 MPs, the independent variable is the immune cell fraction(22), the genetic instruments are the 208 genetic features. Finally, we selected a set of genes of which genetic features associated with

209 immune cell fractions, in turn, impacted somatic mutational processes (FDR < 0.1 and FDRweak < 210 0.1), which were called I-genes.

211 Identify the Drug-related Candidate Genes in Cancer Cell Lines

212 To evaluate the effects of the candidate genes on the drug sensitivity/resistance to therapy, 213 we used a linear model to calculated the association between the mRNA expression levels of 214 E-genes and I-genes and the half maximal inhibitory concentration (IC50) of drugs (eq. 5):

퐼퐶50푖푘 = 훽0 + 훽1 × 푚푅푁퐴푖푗 + 퐶푖 + 휀푖푗푘#(5.)

2 푡ℎ 푡ℎ 215 Here, 휀푖푗푘~푁(0, 휎 ) is a Gaussian error term; 퐼퐶50푖푘 is the IC50 of the 푘 drug of 푖 푡ℎ 216 cell line; 푚푅푁퐴푖푗 is the mRNA expression of the 푗 gene; 퐶푖 is the cancer type.

217 Identify the Genes Associated with Anti-PD-1 Therapy Response

218 To identify the genes associated with anti-PD-1 therapy response, we performed a one-sided 219 Student’s t-test to capture genes that were differentially expressed between the non-response 220 groups (progressive/stable disease, PD/SD) and response groups (partial/complete response, 221 PR/CR) in melanoma (N = 55) (30) and metastatic gastric cancer (N = 45) (31). A p-value < 0.05 222 was considered significant.

223 Identify Common Non-Coding Variants Influencing the Mutational

224 Processes

225 In order to examine the association between non-coding germline variants and the mutational 226 processes in cancers, we took the MP as a quantitative trait; and performed a whole genome 227 Quantitative Trait Loci (QTL) mapping based on linear model in both pan-cancer level (eq.6) as 228 well as cancer-specific level (eq.7), which resulted in a set of mpQTLs significantly associated 229 with the MPs (FDR < 0.1).

푀푃푘푖 = 훽0 + 훽1 × 𝑔푖푗 + 퐶푖 + 휀푖푗푘#(6.)

푀푃푘푖 = 훽0 + 훽1 × 𝑔푖푗 + 휀푖푗푘#(7.)

2 푡ℎ 230 Here 휀푖푗푘~푁(0, 휎 ) is a Gaussian error term; 푀푃푘푖 is the 푘 mutational propensity of 푡ℎ 푡ℎ 231 the 푖 individual; 𝑔푖푗 is the 푗 SNP's genotype; 퐶푖 is the cancer type.

8

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

232 Then, we used the significant variants as genetic instruments to examine the association 233 between the nearby genes of variants (< 1 Mb) and the corresponding MPs in each cancer type 234 using Julia (v1.1.1) package FixedEffectModels (eq.8):

푀푃푘푖 = 훽0 + 훽1 × 푚푅푁퐴푖푗푙 | 𝑔푖푗푘 + 퐶푖 + 휀푖푗푘푙#(8.)

2 푡ℎ 235 Here, 휀푖푗푘푙~푁(0, 휎 ) is a Gaussian error term; 푀푃푘푖 is the 푘 mutational propensity of 푡ℎ 푡ℎ 푡ℎ 236 the 푖 individual; 푚푅푁퐴푖푗푙 is the 푙 gene expression nearby the 푗 SNP (< 1 Mb); 𝑔푖푗푘 is 푡ℎ 푡ℎ 237 the genotype of the 푗 SNP which is significantly associated with the 푘 MP; 퐶푖 is the cancer 238 type.

239 Annotation of Biological Processes Determine the Mutational

240 Processes

241 The above E-genes and I-genes were then subjected to gene set enrichment analysis (GSEA) 242 using gene sets including Reactome (MSigDB v6.1) and Hallmark (MSigDB v6.1), KEGG 243 (MSigDB v6.1).

244 Gene-sets Enrichment Test

245 We performed Fisher's exact test to evaluated the enrichment of E-genes and I-genes 246 enriched for known cancer driver gene sets using R packages GeneOverlap.

247 Clinical Analysis

248 To assess the effects of E-genes and I-genes on the treatment outcome, we collected a cohort 249 of 60 LUAD patients from The First Affiliated Hospital of Xiamen University (FHXU). Informed 250 written consent was obtained from each subject or each subject's guardian. The usage of patient 251 data is approved by the ethics committee/institutional review board of FHXU. We used a logistic 252 regression to evaluate the interaction between somatic mutational burden of E-genes or I-genes 253 and the targeted therapies on the binary clinical outcomes (PR and PD/SD), adjusting for clinical 254 covariates including stage and smoking. A p-value < 0.05 was considered significant. 255 The Kaplan-Meier method was utilized to estimate overall survival, and difference between 256 groups were assessed using the log-rank test. A p-value < 0.05 was considered significant.

257 Data Availability Statement

258 The Code Ocean capsule containing the necessary data and codes to replicate the results of 259 this study can be found at https://codeocean.com/capsule/6181333/tree/v1 (DOI: 260 10.24433/CO.2000361.v1).

9

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

261 Results

262 Mutational Propensity in Thirteen Cancer Types 263 We obtained a dataset of 13 cancer types from TCGA with matched germline variants, 264 somatic mutations, SCNAs, DNA methylation and expression of mRNA(32). After filtering for 265 populational background and removal of unmatched individuals, we identified a population of 266 3,566 Utah residents of northern and western European ancestry (CEU; Fig. 1, Fig. S1A-B and 267 Table S1) for the following analysis. 268 In order to derive a quantitative trait for inter-tumor variations in the mutational processes, 269 we first retrieved seven highly conserved somatic mutational signatures (MSs) from the 270 mutational profiles based on a 5-nucleotide context (Fig. 2A and Fig. S2A-C). These signatures 271 are highly comparable to the known SBS signatures in the COSMIC according to the cosine 272 similarity (CS; Fig. S2D)(6). For example, MS1 is highly similar to SBS1 (the deamination of 273 5-methylcytosine, characterized by C>T at NpCpG trinucleotide, CS = 0.97). MS2 associates with 274 SBS2 (CS = 0.79) which is attributed to the activity of the AID/APOBEC family of cytidine 275 deaminases. MS3 correlates with SBS6 (CS = 0.75) which is associated with defective DNA 276 mismatch repair (dMMR). MS4 correlates to SBS4 (CS = 0.87) caused by tobacco smoking. MS5 277 and MS6 correlates with SBS10a (CS = 0.96) and SBS10b (CS = 0.91), respectively, both of 278 which are caused by polymerase epsilon exonuclease (POLE) domain mutations(3,6). Finally, the 279 MS7 universally correlates with the multiple mutational signatures in the COSMIC databases (Fig. 280 S2D). Among all MS’s, MS7 is abundant in all nucleotide contexts, representing the most 281 recurrent and conserved mutations in the genome, and universally present in all cancer types, 282 hence a suitable reference mutational signature in the following analysis. 283 To control for the latent confounders of between-sample variation and improve the skewed 284 distribution of MSs, we introduced mutational propensity (MP) as a quantitative trait for somatic 285 selection pressure over the MSs. MP is defined as the natural log-ratio of the activities between a 286 signature of interest and the reference signature (eq.1). Hence a positive MP indicates that the 287 corresponding mutational process is positively selected in the given cancer. After removal of the 288 extreme values, the MPs in the cancer population follow approximately normal distribution (Fig. 289 2B-C). 290 The MPs retain the important biological properties of the original MSs (Fig. 2D). Of note, in 291 LUAD, COAD and UECE, tumor with high tumor mutation burden (TMB) inclined to specific 292 MPs, suggesting the corresponding mutational processes are dominant in high TMB tumors. 293 Among the 13 cancer types, the effect sizes (Spearman’s rho) of MPs on TMB are strongly 294 correlated with those of the original MSs (R = 0.822, P < 2.20×10-16; Fig. 2E). We calculated the 295 Shannon’ index based on the activities of MS as a measure of diversity of the mutational 296 processes. Our data suggested that tumors with more diverse mutational processes showed 297 significantly increased number of clones (P=7.00×10-4, Fig. S3A-C). And the effect sizes 298 (Spearman’s rho) of MPs on ITH are strongly correlated with those of MSs (R = 0.795, P < 299 2.20×10-16; Fig. 2F). We also noticed that the MPs are significantly associated with tumor 300 immune subtypes (Fig. S4). 10

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

301 To further validate the MPs in cell lines, we retrieved the six MPs in 544 cancer cell lines of 302 CCLE representing 13 cancer types. The MPs are highly consistent between cancer tissues and the 303 cell lines of the same cancer type with an overall correlation of 0.957 (P = 1.00×10-6; Fig. 2G). 304 Within each cancer type, the cosine similarity between TCGA sample and cell lines ranges from 305 0.938 (ESCC) to 0.996 (EAC). The results suggested that the MPs are highly conserved in both 306 cancer tissues and cell lines hence robust surrogate for the mutational processes in cancer(33). 307 In addition, the MPs are significantly associated with IC50 of 116 drugs (FDR < 0.1). For 308 example, MP3 is associated with the IC50 of BRAF inhibitor (dabrafenib, FDR = 0.0781 and 309 PLX-4720, FDR = 0.0240); MP4 is associated with the IC50 of HSP90 inhibitors (CCT-018159; 310 FDR = 0.0568; Fig. S5), which is consistent to the previous reports that treatment influence the 311 mutability of cancer (10,34).

312 The Genetic Determinants of Somatic Selection of Mutational Propensities in

313 Cancers 314 Next, we sought to identify genes which influence somatic selection of MPs. We defined a 315 regression model integrating the effects of the germline statuses, the somatic statuses, and the 316 epigenetic statuses (TSS methylation levels). To determine the germline statuses of the genes, we 317 classified the 524,912 rare variants into three subtypes: 9,584 structural variants (> 50bp), 467,709 318 missense variants and 47,619 truncated variants. Then we computed the burden of each class of 319 variations separately for 17,810 genes. For the somatic mutational statuses, we obtained a total of 320 820,907 single nucleotide variants (SNVs), 163,544 deletions and 54,084 insertions from TCGA 321 for 3,566 cancers. We, then, retrieved 19,105 genes’ somatic mutational statuses based on 797,814 322 non-synonymous variants. The samples with high TMB (> 20) occur mostly in UCEC (17.6%), 323 COAD (17.4%) and STAD (16.0%), which is consistent to prior studies (35). 324 We evaluated 17,027 genes in 3,566 cancers and found 87.5% (N=14,903) genes of which at

325 least one type of genetic status were significantly associated with the MPs (FDR<0.1), suggesting

326 a wide influence of the mutational processes. Among the genes associated with the MPs, 14,373 327 (84.4%) acts at the pan-cancer level; and 13 (0.0763%, ESCC) to 9,596 (56.4%, UCEC) are 328 cancer-specific. As for the effect sizes, the somatic nsy-mutations were significant in 12,836 genes 329 and accounted for the most of variations of MPs, especially for MP5 (1.77% to 54.1%), MP6 330 (0.0864% to 27.0%) and MP3 (0.798% to 29.9%; Fig. 3A); followed by the SCNAs which were 331 significant in 11,721 genes and accounted for 0.764% to 48.8% of the variance of MP1 and MP3, 332 respectively. Of note, there are a small subset of genes of which SCNA statuses accounted more 333 than 40% of the variance of MP1 (adjusted R2 > 0.4), including PIK3CA (FDR = 6.68×10-4, adjust 334 R2= 0.477) and MAP3K1 (FDR = 1.25×10-4, adjust R2 = 0.478) (Fig. 3A). In addition, the 335 TSS-methylation of 8,199 genes accounted for over 10% of the variance of MP2 and 20% of the 336 variance of MP6. Genes in this category include CDH1 (MP2, FDR = 2.75×10-5, adjust R2 = 0.162) 337 and ERCC3 (MP6, FDR = 0.0318, adjust R2 = 0.227) (Fig. 3A). The effect sizes of the germline

11

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

338 variants are much smaller than those of the somatic mutations (Fig. 3A), with the number of 339 significant genes decreases in the order of germline missense mutation (N= 3,304), truncated 340 mutation (N = 737) and structural mutation (N = 107). Gene fusions were significant in only 48 341 genes. Nevertheless, the fusion statuses of 12 genes showed very strong effects on MP1 (R2 > 342 0.4). 343 To further validate the methodology, we chose a set of cancer genes with known germline 344 pathogenic effects for an internal validation, including MBD4, BRCA1, BRCA2 and several 345 dMMR genes (MLH1, MSH2, MSH6 and PMS2). As a result, we found that MP1 (the deamination 346 of 5-methylcytosine) is influenced by the germline truncation of MBD4 (P = 9.17×10-3, adjusted 347 R2 = 0.467) and BRCA2 (P = 3.79×10-5, adjusted R2 = 0.468) at pan-cancer levels, which is 348 consistent to the previous report (19). MP3 (dMMR-related) is influenced by germline missense 349 variants of PMS2 in LUAD (P = 0.0431, adjusted R2 = 0.0122) and MSH6 in BRCA (P = 3.93 350 ×10-4, adjusted R2 = 0.0232). And MP2 (APOBEC-related) is influenced by the missense variant 351 of APOBEC3H (P = 3.61 ×10-3, adjusted R2 = 0.156; Fig. 3A). 352 Our data indicated that many genes are associated with the MPs but such associations do not 353 necessarily imply any causal effects. To ascertain the causal effect of a given gene, we mandated 354 the significance of both germline effect and non-germline effect on the same MP (Methods). Thus, 355 we identified 2,314 candidate driver genes of MPs in cancer based on FDR of 0.1 (Fig. 3B). 356 Among the candidate genes, 1,268 (54.8%) genes influence the MPs at pan-cancer level, 935 357 (40.4%) in specific cancer types, and 111 (4.80%) are significant at both pan-cancer and 358 cancer-specific level, such as BRCA1/2, FAT3 and SETDB1 (Fig. 3C). The majority of the 359 cancer-specific candidate genes of MP are identified in UCEC (38.4%, N = 888), COAD (5.49%, 360 N = 127) and BRCA (1.17%, N = 27). On the other hand, 87.9% of the 2,314 candidate driver 361 genes are associated with unique MPs (Fisher's exact test P < 0.01) suggesting the biological 362 mechanisms underlying somatic selection of mutational processes are highly specific; 10.46% (N 363 = 242) are associated with two and only 1.64% (N=38) are associated with more than two MPs 364 (Fig. 3D). MP2 (N = 661), MP5 (N = 491), MP6 (N = 452), MP1 (N=362) and MP3 (N=352; 365 Table S2A) are the mutational processes with the most genetic determinants (90.3% total), 366 whereas MP4, which is caused by tobacco smoking has the less (N = 310). 367 We also validated the candidate genes in the population of African Ancestry in Southwest US 368 (ASW, N = 564) and Han Chinese in Beijing, China (CHB, N = 363). Among the 2,314 candidate 369 genes, the germline genetic burdens of 9 genes in ASW and 4 in CHB are significantly associated 370 with the same MPs as in CEU; and the somatic statuses of 223 genes in ASW and 18 in CHB are 371 significant (Table S2A). 372 In order to infer the biological processes mediating the effects of the candidate determinant 373 genes and the selection of mutational processes, we performed an instrumental variable (IV) 374 regression for the 2,314 candidate driver genes based on two possible mechanisms of somatic 375 selection: first, the genetic and/or epigenetic statuses of a candidate gene directly alter its 376 expression level in cancers; and influence the fitness of the cell. As a result, we identified 485

12

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

377 unique “expression-associated determinant genes” (E-genes; FDR < 0.1 and FDRweak < 0.1; Fig. 378 3B; Table S2B). In another scenario, the genetic and/or epigenetic statuses influence the selection 379 of mutational process through interacting with the TIME. Thus, we identified another set of 1,427

380 “Immune-interactive-genes” (I-genes; FDR < 0.1 and FDRweak < 0.1; Fig. 3B, Table S2C). 381 We further analyzed candidate genes in detail and found that the 485 E-genes and 1,427 382 I-genes are both significantly enriched for known cancer driver genes, such as the COSMIC 383 cancer gene consensus (CGCs; 2.39-fold, P = 1.30×10-5; 1.74-fold, P = 2.80×10-5), and three other 384 cancer driver gene sets (P < 0.05, OR = 1.78-4.05; Fig. 4A)(15,36,37), suggesting the genetic 385 determinants of somatic selection of mutational processes we found are highly consistent to the 386 known cancer driver genes.

387 Cancer Gene Expression Intermediates the Genetic Determinants of Mutational Propensity 388 Of the 485 “E-genes”, 210 influence the MPs at pan-cancer level, which mainly affect the 389 mutational processes of MP1 (N=73, OR = 15.5, P = 3.41×10-43) and MP3 (N = 23, OR = 5.91, P 390 = 7.55×10-10) (Fig. 4B). There are also 283 “E-genes” act in specific cancer types, most of which 391 are identified in UCEC (N=244), COAD (N=42) and BRCA (N=10, Table S2B). 392 The 485 E-genes are significantly enriched in pathways of cancer cell growth and 393 proliferation pathways, such as metabolism of RNA (FDR = 2.26×10-6), RNA polymerase III 394 transcription (FDR = 3.72×10-3), cell cycle (FDR = 3.86×10-3) and metabolism of lipids (FDR = 395 3.21×10-5; Fig. 4C). Many E-genes are well-known cancer driver genes, such as BRCA1 (MP2, 396 FDR = 7.20×10-3), PIK3R1 (MP6, FDR = 1.89×10-5), while others are newly reported, such as 397 PPIP5K2 (MP1, FDR = 0.00250) and SNX24 (MP6, FDR = 2.51×10-6).

398 Tumor Immune Microenvironment Intermediates the Genetic Determinants of Mutational

399 Propensity 400 Of the 1,427 “I-genes”, 797 significantly influence MPs at pan-cancer level, which are 401 significantly enriched for determinants of the mutational processes corresponds to MP2 (N =360, 402 OR=4.09, P=1.70×10-48), MP3 (N =58, OR = 8.42, P = 4.45×10-10), MP5 (N = 180, OR = 3.98, P 403 = 3.89×10-28) and MP6 (N = 138, OR = 3.82, P = 4.20×10-22) (Fig. 4B). There are another 680 404 “I-genes” acting in specific cancer types, most of which are identified in in UCEC (N=599), 405 COAD (N=54) and BRCA (N=23, Table S2C). 406 The pathways significantly enriched in the I-genes are consistent to the functions in cancer 407 immunity. The I-genes are enriched in cytokine signaling in immune system pathways (FDR = 408 3.16×10-7), innate immune system pathway (FDR = 1.29×10-5, Fig. 4C), hence the effects on the 409 activities of tumor microenvironment. Then the I-genes are also enriched in cancer invasion and 410 metastasis pathways, such as epithelial mesenchymal transition(EMT; FDR = 2.59 ×10-5) and 411 adhesion (FDR = 1.66 ×10-22), suggesting that intercellular communications are involved in 412 somatic selection and influence the clonal evolution(38). Finally, certain metabolism pathways, 413 such as the lipid metabolism (FDR=5.41×10-3) are also enriched in the I-genes.

13

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

414 Of note, 350 out of the 485 E-genes are also I-genes, which suggest the genetic determinants 415 can influence the mutational processes in both ways (Table S2D). Many of these 350 genes are 416 known cancer-genes, such as MSH2, MSH6, BRCA1, ALK, ABL2, MAPK6 and NF1. 417 In addition, the impacts of the somatic statuses on the MPs are related to the activities of 418 specific immune cells. For example, the TSS-methylations are strongly associated with MP2 419 through neutrophil activity (25.2%, N = 360, Fig. 4D). And the somatic nsy-mutations influence 420 MP5 through the activities of T follicular helper cells (19.1%, N = 273) and M1 macrophages 421 (9.95%, N = 142). These findings suggest specific immune responses underlie somatic selection of 422 mutational processes which are activated in response to different types of somatic alterations in 423 I-genes.

424 The Genetic Determinants of Somatic Mutational Propensity Impact

425 Carcinogenesis and Cancer Therapy 426 As the genetic determinants of MPs influence many cancer-related biological processes, we 427 asked how these genes impact carcinogenesis and the consequent biological-clinical 428 characteristics of cancer. 429 Recent advances in gene-editing techniques enable highly specific evaluation of the 430 genetic-dependency of cancer proliferation. We first evaluated the enrichment of genes annotated 431 for cancer-dependency by CRISPR-Cas9 screening in 233 cell lines (26). We categorized 10,343 432 genes of cancer-dependency according to the median CERES score across different cell lines. 433 Then we compared the fold enrichment of benchmark gene sets(15,36,37), such as COSMIC 434 CGCs and susceptibility cancer driver gene sets with those of our E-genes and I-genes. As a result, 435 both E-genes and I-genes are significantly enriched in the oncogenes of which CERES score are 436 less than zero (Fig. 5A). For example, E-genes are significantly enriched in the gene-set (CERES 437 = -1.6 ~ -1.2; P = 0.024, fold enrichment = 9.18) and I-genes are significantly enriched in the 438 gene-set (CERES = -2.2 ~ -2; P = 0.0137, fold enrichment = 17.7). Of note, the tendency of 439 enrichment of the E-genes and I-genes remains stable after removal of known cancer genes. 440 We compared the CERES scores of E-genes and I-genes. Consistent to the corresponding 441 biological basis, the median CERES score of the E-genes is significantly lower than those of the 442 I-genes (P = 0.0075; Fig. S6) suggesting that I-genes overall show weaker cancer dependency 443 than E-genes in vitro. 444 We assessed the therapeutic implications of the E-genes in 300 cancer cell lines treated with 445 320 drugs with specific targets. Among the six benchmark gene sets (15,36,37), E-genes showed 446 the highest fraction of genes which are significantly predictive of the IC50 in all cell lines (70.5%, 447 N = 148, FDR < 0.1); whereas the fraction of predictive I-genes is the lowest (63.9%, Fig. 5B). 448 Each of the 320 cancer drugs targets a specific signaling pathway in cancer. Our results suggest 449 that E-genes extensively influence the efficacies of targeted therapies in cell lines, especially on 450 hormone-related pathway and cytoskeleton pathway (MP2) and chromatin histone methylation 451 pathway (MP3 and MP6; Fig. 5C). Moreover, in a cohort of 60 LUAD which received

14

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

452 chemotherapy and targeted therapies (Table S3), the somatic mutation burdens of 17 E-genes 453 showed significant interactive effect on the clinical benefit from EGFR targeted therapies (PR 454 versus PD/SD; P = 0.0361, OR = 2.26); whereas the mutation burdens of I-genes showed no such 455 effect. 456 We then compared the I-genes with genes associated with responses to immune check point 457 inhibitors (ICI, anti-PD-1) in melanoma (N = 55) (30) and metastatic gastric cancer (N = 45) (31). 458 As a result, I-genes showed the strongest overlapping with genes predictive of anti-PD-1 therapy 459 response in both cohorts (9.91%, N = 79 and 4.52%, N = 36), compared to the other five 460 benchmark gene sets (Fig. 5D-E). Moreover, the 79 I-genes overlapped with the anti-PD-1 461 therapy response genes are enriched in pathways such as fatty acid metabolism (FDR = 0.0113) 462 and Adipogenesis (FDR = 0.0113; Fig. S7). As for the long-term outcomes, the somatic 463 nsy-mutation statuses of 6 I-genes, such as CREBBP (MP3; HR = 0.415, P = 5.66×10-4) and 464 PARP1(MP6; HR = 0.169, P = 0.0123) were significantly associated with better overall survival 465 in 1,661 patients received ICI treatment. (Fig. S8)(39). 466 In summary, our results suggest that E-genes and I-genes influence specific mutational 467 processes, respectively. E-genes represent the proliferation capacity hence the impacts on the 468 age-related mutational process, such as deamination of methylated cytosines, which occur 469 throughout the patient’s life time (9). The I-genes which influence somatic selection of mutational 470 processes via TIME, are more effective in the mutational processes associated with cancer 471 immunity, such as AID/APOBEC process (40).

472 Quantitative Trait Loci of the Mutational Propensities 473 Finally, we assessed the effects of 6,103,818 non-coding common germline variants on six 474 MPs. We found 127 unique non-coding germline variants which mapped to two MPs (mpQTLs; P 475 < 5×10-8, Fig. 6A; Table S4A). Of note, the effect sizes of the non-coding common germline 476 variations on the mutational processes are comparable to those of the coding variants. We found 477 that the MP2 (APOBEC-related) is the most influenced by mpQTLs (20q11.21, 15q21.2 and 478 16q12.2; 77.2%, N=98); followed by MP6 (mut-POLE-related; 10q24.1, 9.45%, N=12) and MP1 479 (NpCpG-related; 5q12.3, 5.51%, N=7; Fig. 6A). Especially, we found that 6 SNPs in 22q13.1 are 480 significantly associated with MP2 (APOBEC), such as rs112045173 (P < 8.21×10-6) and 481 rs17824310 (P < 8.49×10-6), which are consistent to the previous study(19). Among the germline 482 variants associated with the MPs, 99 loci (78.0%) act at the pan-cancer level; while the others are 483 reported mainly in three cancer types, BRCA (N = 14, 0.110%), THCA (N = 7, 5.51%) and KIRC 484 (N = 4, 3.15%; Table S4A). 485 We noticed that all of the unique 127 mpQTLs are known eQTL loci in cancer (41), 486 suggesting the impacts on mutational process are related to cancer gene expression. To reveal the 487 underlying mechanisms through which the mpQTLs exert their functions, we performed IV 488 regression based on gene expression in cis of the mpQTLs (Methods). The results suggested that 489 13 mpQTLs significantly impact the MP2 (APOBEC-related) via acting on mRNA transcript 490 levels in cis (Table S4B). For example, rs6060924 acts through mitochondrial cytochrome c

15

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

491 oxidase subunit 4 isotype 2 (COX4I2) to impact MP2 (Fig. 6B-C); another SNP in linkage 492 disequilibrium is reported as an eQTL of COX4I2 in a variety of cancer (41). COX4I2 is a 493 component of Warburg effect and related to tumor progression and metastasis(42). The other 494 mpQTL (rs35413356) acts through Aldehyde Dehydrogenase 5 Family Member A1 (ALDH5A1) 495 which also impacts MP2 (Fig. 6D-E).

496 Discussion 497 In our study, we used the MPs as a measure of the propensity of mutational processes in 498 cancer. Compared with the raw activities of MSs, the MPs show better normality, robustness to 499 sample purity and heterogeneity and stronger clinical relevance. MPs work as a robust linkage to 500 somatic selection of mutational processes which facilitate the discovery of relevant cancer genes. 501 Future study with larger sample size can improve the statistical power to identify more 502 cancer-evolution related genes. 503 Although the germline variants that directly cause MSs are mainly reported for homologous 504 recombination (HR) (43) and dMMR (44) related genes. Recent studies demonstrated that 505 germline variants other than HR and dMMR also influence the MSs (45). The germline 506 determinants of MPs include both causal mutagenic processes and intrinsic or extrinsic biological 507 processes contributing to somatic selection and clonality of cancer. In the current study, we 508 hypothesized that many more germline variants can influence the “fitness” of cancer. Therefore, 509 the candidate genes in our study included both genes from the causal mutagenic processes and 510 intrinsic or extrinsic biological processes contributing to somatic selection and clonality of cancer. 511 Most of the studies trying to identify “cancer drivers” are based on the significantly high 512 recurrence of somatic mutations in the cancer populations. Such methods have yielded many 513 meaningful driver genes and mutations which are successfully used in cancer diagnosis, prediction, 514 and treatment. On the other hand, the clinical benefits of treatments still vary substantially among 515 patients, which is largely attributed to the clonal evolution of cancer. The current study is designed 516 to identify genes and variants which potentially influence the processes of somatic selection and 517 thereby suggest new candidate for cancer driver genes. 518 ITH is extensively studied for its association to clonal evolution and somatic selection, and a 519 major cause of resistance to treatments(46). However, most of ITH studies are still based on 520 limited sample size whereas the clinical phenotypic heterogeneity of cancer is observed and 521 measured in large population of cancer. Alternatively, our approach is based on the inter-tumor 522 variations of the evolution processes in a reasonably large population, which offers the capability 523 to assess the effects of genetic heterogeneity, especially rare mutational events, on the somatic 524 evolution, which cannot be addressed in small sample. 525 Here, we report two gene sets (E-genes and I-genes) which exhibit additional predictive 526 power in the treatment responses and innate immune status in TIME. Our findings confirm the 527 importance of somatic evolution in the development of cancer and provide an alternative way to

16

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

528 identify genes which further explain the heterogenous clinical phenotypes beyond the known 529 driver genes. 530 Our data suggest that I-genes are enriched in lipid metabolism. For example, HSPH1 (heat 531 shock family H member 1) can induces macrophage differentiation(47). HSPH1 also 532 interacts with BCL6 which specifies and promotes T follicular helper cell program (48). Therefore, 533 we deduce that some of the I-genes modulate T follicular helper cell and M1 macrophage 534 activities via lipid metabolism and thereby impact MP5. Consistently, recent studies show that 535 tumors with elevated lipid metabolism have increased antigen presentation and are associated with 536 response to anti-PD-1 or TIL-based immunotherapy (49). 537 Some of the I-genes can also be functionally pleiotropic in cancer. For example, EEF2 538 (eukaryotic elongation factor 2, CERES score = -2.03), elicits both humoral and cellular immune 539 responses; it also promotes tumor cell proliferation, angiogenesis, metastasis, and invasion (50). 540 TYRO3 is involved in both cell proliferation and survival pathways and immune response 541 regulation (51). CXCR4 promotes cancer cell proliferation (52) and also play a role in the 542 recruitment of immunosuppressive cells such as regulatory T cells (Treg), M2, and N2 neutrophils 543 to limits the effectiveness of immune responses (53,54). Altogether, our results showed that the 544 I-genes and E-genes have multiple roles in both cancer cell survival and immune response. 545 Nevertheless, the current study is limited in certain aspects. Although we consider the effects 546 of both germline and somatic statuses, the variants can cause either gain or loss of the function, 547 which lead to opposite effects on the signature activity; and the current analysis cannot control for 548 the consistency of the effects as the functional consequence of the variants are unknown. Unlike 549 the germline variants, there were no significant difference in selection pressure among these 550 somatic variants (missense, nonsense and splice site) (55). Therefore, we combined all somatic 551 mutations into a binary status in order to take the advantage sample size. Eventually, our 552 validation based on separated mutational statuses indicates that the determinant genes of MPs are 553 highly conserved (Table S2A). 554 Theoretically, the method described is applicable to all MSs. But for many rarer mutational 555 processes which occur only in certain cancer subtypes, analyzing such MSs together with common 556 ones will cause imbalance in the data and introduce unnecessary biases. 557 In addition, there are other important biological mechanisms which influence the somatic 558 evolution, such as age and individual mutation order (56,57). However, the current analysis based 559 on TCGA samples cannot address the effects of such factors due to the lack of information. Other 560 factors, such as irradiation (58) or chemotherapy(10) also influence somatic selection of 561 mutational processes. However, as TCGA sample receive different treatments, the current study 562 cannot directly assess the effects caused by a specific therapy. 563 Our findings can be further validated in different levels. For example, the correlation between 564 the deterministic genes and the MPs can be validated by CRISPR-Cas9 mediated knockout or 565 knock-in in cell lines or mouse models. Moreover, the MP can be verified from the panel 566 sequencing, such as MSK-IMPACT(59) and Praxis Extended RAS Pane l (60), obtained from

17

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

567 clinical specimens and thereby to further validate the correlation with imaging, pathological 568 diagnoses, clinical and treatment outcomes in a broader population level. 569 In summary, we provide a systematic view of the landscape of genetic determination of 570 somatic selection of mutational processes in cancers. Our findings can inform the identification of 571 cancer genes with highly potential clinical and therapeutic values.

572 Acknowledgement 573 We thank Dr. Jiahuai Han, Dr. Bing Xu, Dr. Dan Du and Dr. Tianhai Ji for critical reading of 574 and feedback to our work. The results published here are in whole or part based upon data 575 generated by the TCGA Research Network: https://www.cancer.gov/tcga.

576 Reference 577 1. Riaz N, Havel JJ, Makarov V, Desrichard A, Urba WJ, Sims JS, et al. Tumor and 578 Microenvironment Evolution during Immunotherapy with Nivolumab. Cell. 579 2017;171:934-949.e16. 580 2. Nam AS, Chaligne R, Landau DA. Integrating genetic and non-genetic determinants of 581 cancer evolution by single-cell multi-omics. Nature Reviews Genetics. Nature Publishing 582 Group; 2020;1–16. 583 3. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV, et al. 584 Signatures of mutational processes in human cancer. Nature. 2013;500:415–21. 585 4. McGranahan N, Favero F, de Bruin EC, Birkbak NJ, Szallasi Z, Swanton C. Clonal status of 586 actionable driver events and the timing of mutational processes in cancer evolution. Science 587 Translational Medicine. 2015;7:283ra54-283ra54. 588 5. PCAWG Evolution & Heterogeneity Working Group, PCAWG Consortium, Gerstung M, 589 Jolly C, Leshchiner I, Dentro SC, et al. The evolutionary history of 2,658 cancers. Nature. 590 2020;578:122–8. 591 6. PCAWG Mutational Signatures Working Group, PCAWG Consortium, Alexandrov LB, 592 Kim J, Haradhvala NJ, Huang MN, et al. The repertoire of mutational signatures in human 593 cancer. Nature. 2020;578:94–101. 594 7. Huang MN, Yu W, Teoh WW, Ardin M, Jusakul A, Ng AWT, et al. Genome-scale 595 mutational signatures of aflatoxin in cells, mice, and human tumors. Genome Res. 596 2017;27:1475–86. 597 8. Zou X, Owusu M, Harris R, Jackson SP, Loizou JI, Nik-Zainal S. Validating the concept of 598 mutational signatures with isogenic cell models. Nature Communications. 2018;9. 599 9. Helleday T, Eshtad S, Nik-Zainal S. Mechanisms underlying mutational signatures in 600 human cancers. Nature Reviews Genetics. 2014;15:585–98. 601 10. Russo M, Crisafulli G, Sogari A, Reilly NM, Arena S, Lamba S, et al. Adaptive mutability 602 of colorectal cancers in response to targeted therapies. Science. 2019;366:1473–80. 603 11. Wang S, Jia M, He Z, Liu X-S. APOBEC3B and APOBEC mutational signature as potential 604 predictive markers for immunotherapy response in non-small cell lung cancer. Oncogene. 605 2018;1. 18

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

606 12. Turajlic S, Sottoriva A, Graham T, Swanton C. Resolving genetic heterogeneity in cancer. 607 Nature Reviews Genetics. 20:404–16. 608 13. Roberts SA, Lawrence MS, Klimczak LJ, Grimm SA, Fargo D, Stojanov P, et al. An 609 APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers. Nature 610 Genetics. Nature Publishing Group; 2013;45:970–6. 611 14. Alexandrov LB, Ju YS, Haase K, Van Loo P, Martincorena I, Nik-Zainal S, et al. 612 Mutational signatures associated with tobacco smoking in human cancer. Science. 613 2016;354:618–22. 614 15. Dietlein F, Weghorn D, Taylor-Weiner A, Richters A, Reardon B, Liu D, et al. 615 Identification of cancer driver genes based on nucleotide context. Nat Genet. 2020;52:208– 616 18. 617 16. Baez-Ortega A, Gori K. Computational approaches for discovery of mutational signatures in 618 cancer. Briefings in Bioinformatics. 2017;20:77–88. 619 17. Letouzé E, Shinde J, Renault V, Couchy G, Blanc J-F, Tubacher E, et al. Mutational 620 signatures reveal the dynamic interplay of risk factors and cellular processes during liver 621 tumorigenesis. Nature Communications. 2017;8:1315. 622 18. Temko D, Tomlinson IPM, Severini S, Schuster-Böckler B, Graham TA. The effects of 623 mutational processes and selection on driver mutations across cancer types. Nature 624 Communications. Nature Publishing Group; 2018;9:1857. 625 19. The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer 626 analysis of whole genomes. Nature. 2020;578:82–93. 627 20. Middlebrooks CD, Banday AR, Matsuda K, Udquim K-I, Onabajo OO, Paquin A, et al. 628 Association of germline variants in the APOBEC3 region with cancer risk and enrichment 629 with APOBEC-signature mutations in tumors. Nature Genetics. 2016;48:1330–8. 630 21. Kim J, Mouw KW, Polak P, Braunstein LZ, Kamburov A, Tiao G, et al. Somatic ERCC2 631 mutations are associated with a distinct genomic signature in urothelial tumors. Nature 632 Genetics. 2016;48:600–6. 633 22. Thorsson V, Gibbs DL, Brown SD, Wolf D, Bortone DS, Yang T-HO, et al. The Immune 634 Landscape of Cancer. Immunity. 2018;48:812-830.e14. 635 23. Raynaud F, Mina M, Tavernari D, Ciriello G. Pan-cancer inference of intra-tumor 636 heterogeneity reveals associations with different forms of genomic instability. Kool M, 637 editor. PLoS Genet. 2018;14:e1007669. 638 24. Delaneau O, Zagury J-F, Marchini J. Improved whole- phasing for disease and 639 population genetic studies. Nature Methods. 2013;10:5–6. 640 25. Howie BN, Donnelly P, Marchini J. A Flexible and Accurate Genotype Imputation Method 641 for the Next Generation of Genome-Wide Association Studies. Schork NJ, editor. PLoS 642 Genet. 2009;5:e1000529. 643 26. Meyers RM, Bryan JG, McFarland JM, Weir BA, Sizemore AE, Xu H, et al. Computational 644 correction of copy number effect improves specificity of CRISPR–Cas9 essentiality screens 645 in cancer cells. Nature Genetics. Nature Publishing Group; 2017;49:1779–84. 646 27. Shiraishi Y, Tremmel G, Miyano S, Stephens M. A Simple Model-Based Approach to 647 Inferring and Visualizing Cancer Mutation Signatures. Marchini J, editor. PLOS Genetics. 648 2015;11:1005657. 19

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

649 28. Rosenthal R, McGranahan N, Herrero J, Taylor BS, Swanton C. deconstructSigs: 650 delineating mutational processes in single tumors distinguishes DNA repair deficiencies and 651 patterns of carcinoma evolution. Genome Biology. 2016;17. 652 29. Pflueger CE, Wang S. A Robust Test for Weak Instruments in Stata. The Stata Journal. 653 2015;15:216–25. 654 30. Cui C, Xu C, Yang W, Chi Z, Sheng X, Si L, et al. Ratio of the interferon-γ signature to the 655 immunosuppression signature predicts anti-PD-1 therapy response in melanoma. npj Genom 656 Med. 2021;6:7. 657 31. Kim ST, Cristescu R, Bass AJ, Kim K-M, Odegaard JI, Kim K, et al. Comprehensive 658 molecular characterization of clinical responses to PD-1 inhibition in metastatic gastric 659 cancer. Nat Med. 2018;24:1449–58. 660 32. Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, et al. The 661 Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics. 2013;45:1113–20. 662 33. Ghandi M, Huang FW, Jané-Valbuena J, Kryukov GV, Lo CC, McDonald ER, et al. 663 Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature. 664 2019;569:503–8. 665 34. Liu K, Guo J, Liu K, Fan P, Zeng Y, Xu C, et al. Integrative analysis reveals distinct 666 subtypes with therapeutic implications in KRAS-mutant lung adenocarcinoma. 667 EBioMedicine. 2018;36:196–208. 668 35. Chalmers ZR, Connelly CF, Fabrizio D, Gay L, Ali SM, Ennis R, et al. Analysis of 100,000 669 human cancer genomes reveals the landscape of tumor mutational burden. Genome 670 Medicine. 2017;9:34. 671 36. Martínez-Jiménez F, Muiños F, Sentís I, Deu-Pons J, Reyes-Salazar I, Arnedo-Pac C, et al. 672 A compendium of mutational cancer driver genes. Nat Rev Cancer. 2020;20:555–72. 673 37. Huang K, Mashl RJ, Wu Y, Ritter DI, Wang J, Oh C, et al. Pathogenic Germline Variants in 674 10,389 Adult Cancers. Cell. 2018;173:355-370.e14. 675 38. Walens A, Lin J, Damrauer JS, McKinney B, Lupo R, Newcomb R, et al. Adaptation and 676 selection shape clonal evolution of tumors during residual disease and recurrence. Nat 677 Commun. 2020;11:5017. 678 39. Hsiehchen D, Hsieh A, Samstein RM, Lu T, Beg MS, Gerber DE, et al. DNA Repair Gene 679 Mutations as Predictors of Immune Checkpoint Inhibitor Response beyond Tumor Mutation 680 Burden. Cell Reports Medicine. 2020;1:100034. 681 40. Driscoll CB. APOBEC3B-mediated corruption of the tumor cell immunopeptidome induces 682 heteroclitic neoepitopes for cancer immunotherapy. Nature communications. 2020;11:790. 683 41. Zheng Z, Huang D, Wang J, Zhao K, Zhou Y, Guo Z, et al. QTLbase: an integrative 684 resource for quantitative trait loci across multiple human molecular phenotypes. Nucleic 685 Acids Research. 2020;48:D983–91. 686 42. Mazzio EA, Boukli N, Rivera N, Soliman KFA. Pericellular pH homeostasis is a primary 687 function of the Warburg effect: inversion of metabolic systems to control lactate steady state 688 in tumor cells. Cancer Sci. 2012;103:422–32. 689 43. Polak P, Kim J, Braunstein LZ, Karlic R, Haradhavala NJ, Tiao G, et al. A mutational 690 signature reveals alterations underlying deficient homologous recombination repair in breast 691 cancer. Nature Genetics. 2017;49:1476–86. 20

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

692 44. Grolleman JE, Díaz-Gay M, Franch-Expósito S, Castellví-Bel S, de Voer RM. Somatic 693 mutational signatures in polyposis and colorectal cancer. Molecular Aspects of Medicine. 694 2019;69:62–72. 695 45. Ramroop JR, Gerber MM, Toland AE. Germline Variants Impact Somatic Events during 696 Tumorigenesis. Trends in Genetics. 2019;35:515–26. 697 46. Ramón y Cajal S, Sesé M, Capdevila C, Aasen T, De Mattos-Arruda L, Diaz-Cano SJ, et al. 698 Clinical implications of intratumor heterogeneity: challenges and opportunities. J Mol Med 699 (Berl). 2020;98:161–77. 700 47. Berthenet K, Boudesco C, Collura A, Svrcek M, Richaud S, Hammann A, et al. 701 Extracellular HSP110 skews macrophage polarization in colorectal cancer. 702 Oncoimmunology. 2016;5:e1170264. 703 48. Liu X, Yan X, Zhong B, Nurieva RI, Wang A, Wang X, et al. Bcl6 expression specifies the 704 T follicular helper cell program in vivo. J Exp Med. 2012;209:1841–52. 705 49. Lim AR, Rathmell WK, Rathmell JC. The tumor microenvironment as a metabolic barrier to 706 effector T cells and immunotherapy. Kroemer G, Settleman J, editors. eLife. eLife Sciences 707 Publications, Ltd; 2020;9:e55185. 708 50. Oji Y, Tatsumi N, Fukuda M, Nakatsuka S-I, Aoyagi S, Hirata E, et al. The translation 709 elongation factor eEF2 is a novel tumor-associated antigen overexpressed in various types 710 of cancers. International Journal of Oncology. 2014;44:1461–9. 711 51. Smart S, Vasileiadi E, Wang X, DeRyckere D, Graham D. The Emerging Role of TYRO3 712 as a Therapeutic Target in Cancer. Cancers. 2018;10:474. 713 52. Bianchi ME, Mezzapelle R. The Chemokine Receptor CXCR4 in Cell Proliferation and 714 Tissue Regeneration. Front Immunol. 2020;11:2109. 715 53. Li Z, Wang Y, Shen Y, Qian C, Oupicky D, Sun M. Targeting pulmonary tumor 716 microenvironment with CXCR4-inhibiting nanocomplex to enhance anti–PD-L1 717 immunotherapy. Sci Adv. 2020;6:eaaz9240. 718 54. Scala S. Molecular Pathways: Targeting the CXCR4–CXCL12 Axis—Untapped Potential in 719 the Tumor Microenvironment. Clin Cancer Res. 2015;21:4278–85. 720 55. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, et al. Patterns of 721 somatic mutation in human cancer genomes. Nature. 2007;446:153–8. 722 56. Rozhok AI, DeGregori J. Toward an evolutionary model of cancer: Considering the 723 mechanisms that govern the fate of somatic mutations. PNAS. National Academy of 724 Sciences; 2015;112:8914–21. 725 57. Kent DG, Green AR. Order Matters: The Order of Somatic Mutations Influences Cancer 726 Evolution. Cold Spring Harb Perspect Med. Cold Spring Harbor Laboratory Press; 727 2017;7:a027060. 728 58. Marusyk A, Casás-Selves M, Henry CJ, Zaberezhnyy V, Klawitter J, Christians U, et al. 729 Irradiation Alters Selection for Oncogenic Mutations in Hematopoietic Progenitors. Cancer 730 Res. American Association for Cancer Research; 2009;69:7262–9. 731 59. Cheng DT, Mitchell TN, Zehir A, Shah RH, Benayed R, Syed A, et al. Memorial Sloan 732 Kettering-Integrated Mutation Profiling of Actionable Cancer Targets (MSK-IMPACT). 733 The Journal of Molecular Diagnostics. 2015;17:251–64.

21

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

734 60. Udar N, Lofton-Day C, Dong J, Vavrek D, Jung AS, Meier K, et al. Clinical validation of 735 the next-generation sequencing-based Extended RAS Panel assay using metastatic 736 colorectal cancer patient samples from the phase 3 PRIME study. J Cancer Res Clin Oncol. 737 2018;144:2001–10.

738

22

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

739 Figure legends 740 Fig. 1 The schematic view of this study. In this study, we introduced the quantitative trait of 741 mutational propensity (MP), which is tethered to the relative activity between a given signature 742 and a reference signature. Then, we described a regression model integrating evidences from both 743 germline and somatic levels to identify the genetic determinants of somatic selection of mutational 744 processes, and also performed mpQTL analysis for MP-associated loci. In addition, we inferred 745 the causal biological mechanisms for the candidate determinants of the mutational processes via 746 cancer gene expression and tumor immune microenvironment (TIME). 747 Fig. 2 The mutational propensities in thirteen cancer types. A) The conserved MSs based 748 on 5-nucleotide context are correlated with COMISC single base substitution (SBS) signatures. B) 749 The quantile-quantile plot shows the abnormal distribution of the MSs at pan-cancer level. C) The 750 quantile-quantile plot shows the quasi-normal distribution of the MPs at pan-cancer level. D) The 751 MPs of each individual in different TCGA cancer types. Each bar represents an individual. The 752 cancer types are ranked from left to right in the order of increased median TMB. In each cancer 753 type, the individuals are ranked from left to the right in the order of decreased activities of the 754 MS7. Different colors represent different MPs. E) The MPs and the MSs activities show 755 consistent effects (spearman correlation) on TMB. The effect sizes are calculated as Spearman’s 756 rho between MP or MS and TMB. F) The MPs and the MSs activities show consistent effects 757 (Spearman correlation) on ITH. The effect sizes are calculated as Spearman’s rho between MP or 758 MS and TMB. G) The MPs are highly consistent between TCGA cancer tissues and CCLE cancer 759 cell lines representing the same cancer types. 760 Fig. 3. Genes’ genetic statuses influence the mutational propensities. A) The fraction of 761 variance (adjusted R2) of 6 MPs explained by different statuses of genes (FDR < 0.1), including 762 three types of germline variations (missense, truncated, structural), four types of somatic 763 alterations (somatic nsy-mutation, SCNA, fusion) and epigenetic stats (TSS-methylation). The 764 benchmark genes with different colors are the known cancer driver genes of which germline or 765 somatic mutations are associated with the certain MSs. B) The Upset plot shows the overlaps 766 among the gene sets, each of which is significantly associated with the MPs by a distinct genetic 767 status. The black bar shows the total 2,314 unique genes which are considered as candidate genes 768 of which germline genetic burden and somatic or epigenetic status are simultaneously 769 significantly associated with the same MP (FDR < 0.1). C) The number of significant genes of 770 which genetic statuses impact the MPs at pan-cancer level or cancer-specific level. D) The number 771 of significant genes of which genetic statuses associate with one or multiple MPs.

772 Fig. 4 The E-genes and I-genes influence the mutational propensities. A) The Upset plot 773 shows the overlap between the E-genes, I-genes, COSMIC CGCs, and another three published 774 benchmark cancer gene sets. B) The E-genes and I-genes are enriched for determinants of 775 different MPs at the pan-cancer level based on the background of 2,314 candidate genes. The dot 776 color represents the -log10(FDR). The dot size represents the odds ratio of enrichment based on

23

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

777 the background of 2,314 candidate genes. C) The pathways enriched in E-genes and I-genes. The 778 dot color represents the -log10(FDR). The dot size represents the odds ratio of enrichment. D) The 779 I-genes impact on the MPs through different immune cell activities in TIME.

780 Fig. 5 The carcinogenesis and cancer therapy of E-genes and I-genes. A) The genetic 781 determinants of the MPs enrich for genes to which cancer cells manifest strong 782 genetic-dependency. B) Comparison fractions of genes significantly associated with drug IC50 in 783 cancer cell lines among different gene sets. C) The E-genes of different MPs are predictive of 784 drug IC50 targeting at different pathways. The dot size represents the fraction of drugs in each 785 category which are predicted. Different colors represent the average absolute effect sizes. D) The 786 fractions of genes significantly associated with anti-PD-1 therapy response in melanoma among 787 different gene sets. E) The fractions of genes significantly associated with anti-PD-1 therapy 788 response in metastatic gastric cancer among different gene sets.

789 Fig. 6 The quantitative loci of the mutational propensities. A) The genomic distribution of 790 the SNP loci. Each box represents a chromosome and each dot a SNP . The different colors 791 represent the different MPs significantly associated. The colored dots represent the mpQTLs 792 reached genome-wide significance (p < 5.0×10-8) and the rest of SNPs are labeled in gray. B-C) 793 The associations among the rs6060924 genotypes with the MP2 and the COX4I2 expression. D-E) 794 The associations among the rs35413356 genotypes with the MP2 and the ALDH5A1 expression. 795 The P values are based on student t-test: *P < 0.05, **P < 0.01, ***P < 1.0×10-3, **** P < 796 1.0×10-4 and ***** P < 1.0×10-5.

24

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Fig. 1 Gene level Mutational Germline signature(MS) Common SNPs Data Germline Somatic Epigenetic Genotypes PreAnalysis Missense Nsy-mutations TSS Mutational Truncation SCNAs methylation propensity Genotype levels Structural Fusions MPk=ln(MSk/MS7) imputation

Association 1) Integrated regression: 2) mpQTL analysis: analysis MP~Germline+Somatic+Epigenetic MP~Genotype

2,314 candidate genes 127 mpQTLs

Causality Hypothesis 1 Hypothesis 2 Hypothesis inference by 485 E-genes 1,427 I-genes 13 mpQTLs instrumental variable regression Gene Fraction of cis gene MPs MPs MPs expression immune cells expression Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited. Fig. 2

A B C C>A C>G C>T T>A T>C T>G 5.0 0.02 0.01 MS1 1.0 MPi=ln(MSi/MS7) 0.00 MP1 0.02 2.5 0.01 0 MS2 MP2 0.5 0.005 MS3 MP3 0.0 0.000 0.010 0.005 MS4 MP4 -2.5 0 0.0 MS activities MP activities Activities 0.05 MS5 MP5 0.00 -5.0 0.10 0.05 MS6 MP6 0 -0.5 0.002 -7.5 0.001 MS7 0 -4 -2 0 2 4 -4 0-2 2 4 Five nucleotide Sequence Motifs Theoretical Quantiles Theoretical Quantiles D THCA PRAD BRCA KIRC ESCC GBM OV UCEC EAC LIHC STAD COAD LUAD TMB 10 TMB levels ultralow low middle high

0

-10

-20 5.0

2.5

0.0 Mutational propensity activities -2.5

-5.0

E F G 0.75 0.4 rho = 0.822 rho = 0.795 r = 0.955 0 0.50 p < 1×10-16 0.2 p < 1×10-16 p = 1×10-6 Cancer types -1 THCA ESCC STAD 0.25 0.0 PRAD OV COAD -2 0.00 BRCA UCEC LUAD -0.2 KIRC EAC -0.25 -3

MP effect sizes of ITH sizes MP effect GBM LIHC

MP effect sizes of TMB sizes MP effect -0.4 -0.25 0.00 0.25 0.50 -0.3 -0.2 -0.1 0.0 0.1 0.2 -4 -3 -2 0-1 MSDownloaded effect fromsizes cancerres.aacrjournals.org of TMB on September 23,MS 2021. effect © 2021 sizes American of Association ITH for Cancer Mean MP activities in TCGA Mean MP activities in CCLE Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Fig. 3

BRCA2(panCancer) A 1000 BRCA2(BRCA) 500 MP1

2000 APOBEC1/2/3B/3H(panCancer) 1000 Germline missenses 500 APOBEC3H(LUAD) MP2 Germline truncation

MSH6(BRCA) MSH6(panCancer) Germline structural PMS2(UCEC) MSH2(COAD) Somatic nsy-mutation MSH6(COAD) MLH1(panCancer) SCNAs MP3 500 MSH2/6(UCEC) Gene count TSS methylation Fusion 500 MP4

POLE3(panCancer) POLE2(panCancer) POLE(panCancer) POLE(UCEC) 500 POLE(COAD) MP5

POLD1(panCancer) 1000 POLE(COAD) POLE(UCEC) 500 POLE3(panCancer) MP6

0.0 0.1 0.2 0.3 0.4 0.5 adjusted R2 5000 4687 B C D 3000 2630 2000 1000 1500 2000 1000 500

Gene count 500 1226 0 0 1000 828 783 panCancer Cancer 1 2 3 4 Intersection gene count 728 469 278 specific MP count 362 191 179 111 142 55 82 66 0 Germline missenses 3304 Germline truncation 737 Germline structural 107 12836 Somatic nsy-mutation SCNAs 11721 Fusion 48 TSS methylation 9525 0 5000 10000 Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Gene count Fig. 4Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited. 996 A E−genesI−genes MP1 Ratio of 692 B -log10(P) enrichment 600 MP2 10.0 MP3 >7.5 7.5 MP4 5 310 300 2.5 5.0 223 MP5 176 177 121 MP6 2.5 82 108 Intersection gene count 40 26 23 35 0 Gene set size Candidate genes E-genes 485 2314 I-genes 1427 CGCs 723 Martínez-Jiménez F et al.2020 568 Dietlein F et al.2020 460 Huang K et al.2018 152 -log10(FDR) C >7 E−genes 6 5 I−genes 4 3

Disease Ratio of Cell Cycle enrichment 0.5 snRNP Assembly Hallmark epithelial MetabolismMetabolism of RNA of lipids 0.4

Innate Immune System Go biological adhesion Diseases of metabolism Phospholipid metabolismmesenchymal transition 0.3 Signaling by Rho GTPases 0.2 Signaling by Nuclear Receptors Extracellular matrix organization RNA Polymerase III Transcription RNA Polymerase III TranscriptionRNA Polymerase II Transcription InitiationMolybdenum From Type cofactor 3 Promoter biosynthesis 0.1 ost−translational protein modification D P Cytokine Signaling in Immune system

T follicular MP5 helper cells (316) Somatic (722) Nsy-mutation MP4 (733) (171) M1 macrophages (415) MP3 (263) CD8 T cells SCNA (301) (413) Resting mast cells (212) MP1 (187) Memory B cells(132)

Others(15) Others MP6 (235) (260) TSS methylation Neutrophils MP2 (453)Downloaded from cancerres.aacrjournals.org on September(442) 23, 2021. © 2021 American Association for Cancer (371) Research. Fig. 5 A * E-genes Martínez-Jiménez F et al.2020 7.5 I-genes Dietlein F et al.2020 CGCs Huang K et al.2018

ichment 5.0 r * * old en F 2.5 * * * * * * * * * *

0.0 Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited. -2.2~-2 -2~-1.8 -1.8~-1.6 -1.6~-1.4 -1.4~-1.2 -1.2~-1 -1~-0.8 -0.8~-0.6 -0.6~-0.4 -0.4~-0.2 -0.2~0 0~0.2 0.2~0.4 0.4~0.6 Median CERES scores B D 10.0% E Genes associated Genes associated 4% Genes associated with 70.0% with drug IC50 in with anti-PD-1 therapy anti-PD-1 therapy response all cell lines of CCLE response in melanoma in metastatic gastric cancer 9.8% ( ( Cui C et al.2021) 3% Kim ST et al.2018) 67.5% 9.6% 2% action of genes action of genes action of genes Fr 65.0% 9.4% Fr Fr 1%

9.2% 63.0% 0 E−genesMartínez-Jiménez (pan-Cancer)Huang CGCsK et alF .2018etDietlein al.2020I−genes F et al.2020 (pan-Cancer) I−genesDietlein (pan-Cancer)E−genes F et alCGCs.2020 (pan-Cancer)Martínez-JiménezHuang K et F al .2018 I−genesDietlein (pan-Cancer)Martínez-Jiménez F et alCGCs E−genes F etHuang al (pan-Cancer).2020 K et al

.2020 .2018 et al.2020

C detai MP1 Fraction of drugs associated MP2 with E-genes' expression coss in targeted pathways MP3

a a

s

ene MP4 0.2 0.4 0.6 0.8

g− MP5 Average absolute effect size of E MP6 E-genes on drugs of Hormone-reChromatin lated(2)EGFR histone signaling(5)JNK methylation(3) andApoptosis p38 signaling(7)Chromatin regulation(12)WNT histone signaling(7)Other(44) acetylation(15)p53 pathwMetabolism(9)a ERK MAPKsignaling(15)GenomeCytoskeleton(8) integr Cell cycle(17)Kinases(42)Mitosis(15)PI3K/MTORRTK signaling(28)signaling(36)Protein stabilityChromatin DNAand other(5) degradation(8)replication(10) targeted pathways Drug targeted 0.2000.2250.2500.275 pathways y(4) (drug number) ity(8)

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Fig. 6 1 A 22 9 B C 21 4 ** 20 8 ** 10 ***** 2 *** *** 19 7

18 6 0 -log10(pvalue) MP2 17 5 5 3

16 -4 P < 5×10-8 expression COX4I2 MP1 MP4 A/A G/GA/G 0 A/A G/GA/G

15

MP2 MP5 4 rs6060924 rs6060924

14 MP3 MP6 D E 4 P < 1×10-5 * * ** 12 ** 13

5 0 MP2

12 9

6 11 -4 6 -/- G/G-/G ALDH5A1 expression -/- G/G-/G 10 7 rs35413356 rs35413356 Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer 9 Research.8 Author Manuscript Published OnlineFirst on July 2, 2021; DOI: 10.1158/0008-5472.CAN-21-0086 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Genetic Determinants of Somatic Selection of Mutational Processes in 3,566 Human Cancers

Jintao Guo, Ying Zhou, Chaoqun Xu, et al.

Cancer Res Published OnlineFirst July 2, 2021.

Updated version Access the most recent version of this article at: doi:10.1158/0008-5472.CAN-21-0086

Supplementary Access the most recent supplemental material at: Material http://cancerres.aacrjournals.org/content/suppl/2021/07/03/0008-5472.CAN-21-0086.DC1

Author Author manuscripts have been peer reviewed and accepted for publication but have not yet Manuscript been edited.

E-mail alerts Sign up to receive free email-alerts related to this article or journal.

Reprints and To order reprints of this article or to subscribe to the journal, contact the AACR Publications Subscriptions Department at [email protected].

Permissions To request permission to re-use all or part of this article, use this link http://cancerres.aacrjournals.org/content/early/2021/07/02/0008-5472.CAN-21-0086. Click on "Request Permissions" which will take you to the Copyright Clearance Center's (CCC) Rightslink site.

Downloaded from cancerres.aacrjournals.org on September 23, 2021. © 2021 American Association for Cancer Research.