<<

bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Mutation Vulnerability Characterizes Human Cancer

Yong Fuga Li 1,∗, Fuxiao Xin2

1. Department of Bioinformatics, Illumina Inc., San Diego, USA;

2. Intuit Inc., San Diego, CA;

* Correspondence: [email protected].

Abstract

Recent studies by Tomasetti et al. revealed that the risk disparity among differ- ent types of cancer is mainly determined by inherent patterns in DNA replication errors rather than environmental factors. In this study we reveal that inherent patterns of DNA mutations plays a similar role in cancer at the molecular level. Cancer results from stochastic DNA mutations, yet non-random patterns of can- cer mutations emerge when we look across hundreds of cancer genomes. Over 500 cancer genes have been identified to date as the hot spot genes of cancer mutations. It is generally believed that these are mutated more frequently because they reside in functionally important pathways and are hence selected during the somatic evolution process of tumor progression. This theory however does not explain why many genes in the same pathways of cancer genes are not mutated in cancer. In this study, we challenge this view by showing that the inherent patterns of spontaneous mutations of human genes not only distinguish cancer causing genes and non-cancer genes but also shapes the mutation profile of cancer genes at the sub-gene level.

Introduction

Cancer is a group of complex diseases marked by abnormal proliferation of cells as well as somatic mutations of genes. A person has a 40.37% lifetime risk of diagnosis and a 20.84% lifetime risk of dying from cancer (cancer.gov, Septem- ber 24, 2014). Despite extensive studies, cancer remains a major challenge in

1 bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

2

medicine. The molecular mechanisms of cancer initialization and progression are not yet fully understood, and for most cancer types, no efficient therapy is available [1]. Cancer causing genes, or cancer genes for short, refers to the class of genes that exhibit causal mutations, either somatic or germline, in cancers [2]. Contin- ued efforts have been devoted to the discovery and study of these cancer genes, with around 15-30 novel ones discovered each year (see supplementary material and Fig. 1). Around 550 cancer genes have been identified as of 2016 [2]. These cancer genes, comprising 2.6% of human proteome, span a very wide range of cel- lular functions and molecular pathways [3], including cell proliferation, growth suppression, DNA repair, apoptosis, cellular senescence, angiogenesis, metasta- sis, immune response, and energy metabolism. Cancer genes are recognized as the key for understanding the biology of cancer, while cancer gene mutations in individual tumor hold the key for cancer precision medicine. Despite our successes in discovering cancer genes, the fundamental distinc- tions between cancer genes and non-cancer genes remain undefined. Studies of the cancer genes repeatedly identify a handful of key biological pathways [3], yet many genes in the same pathways lacks recurrent mutations despite being func- tionally related to cancer [1, 3, 4]. Relatedly, many cancer genes mutate only in specific type of cancers, or shows drastically different mutation frequency in different types of cancers. These hint that protein function is not the only de- terminant of cancer genes. Recent studies by Tomasetti et al. revealed that the prevalence difference among different types of cancer is mainly determined by inherent patterns in DNA replication errors rather than environmental factors [5, 6]. In this study we reveal that inherent patterns of DNA mutations plays a similar role in cancer at the molecular level on cancer genes. Cancer has been recognized as a process of cellular evolution inside human body [7, 8]. In light of the molecular evolution theory, we propose gene’s muta- tion vulnerability as an independent factor, which together with gene’s specific functions, distinguish cancer genes from other genes. Specifically, different genes are of different spontaneous nucleotide mutation rates, different vulnerability to amino acid changes upon nucleotide mutations, and different vulnerability to functional changes upon amino acid mutations. The vulnerability of a cancer related gene to function impairment sets the stage for somatic selection pressure to play its role. Based on this theory, we hypothesize that cancer genes are more vulnera- ble to mutations compared to other genes. To test the hypothesis, we quantify bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

3

the mutation vulnerability (MVI) of a gene based on the coding DNA sequence (CDS), the DNA mutation rate and spectrum, and the genetic table. We dis- covered that gene MVI differs significantly between cancer genes and non-cancer genes, and also among different types of cancer genes. Further, we show that MVI of a cancer gene is predictive of its overall mutation frequency in can- cers. Finally within individual cancer genes, we show that codon level mutation vulnerability is predictive of the observed codon mutation frequency.

Materials and Methods

Mutation vulnerability index

We use a basic probabilistic model to describe the mutation of genes during

somatic evolution. Let CDS = N1N2...Nj...NL be a coding sequence of length 0 L. We model the probability of nucleotide Nj mutating to Nj after one round of 0 DNA replication as P (Nj |CDS) = θ 0 , where θ 0 Nj−1(Nj →Nj )Nj+1 Nj−1(Nj →Nj )Nj+1 is the neighbor-dependent single nucleotide mutation rate [9, 10]. Notice that, P 0 θ 0 = 1 for any given Nj. In this study, we Nj ∈N ={A,C,G,T } Nj−1(Nj →Nj )Nj+1 only model single nucleotide mutations and ignore in-dels and more complex mutation types. Further we assume the nucleotides mutate independently condi- 0 0 0 0 0 QL 0 tioned on the current CDS, i.e. P (CDS = N1N2...Nj ...NL|CDS) = j=1 P (Nj |CDS). As a result, the probability of two point mutations happening on a single codon or within a single gene is negligible.

We then define the mutation vulnerability index mvix for a codon x of tri-

nucleotide NjNj+1Nj+2 as the expected number of nonsynonymous substitution on this codon after one round of DNA replication:

X 0 0 mvix = I(x ; x) · P (x |CDS) (1) x0∈N 3

where I(x0; x) is the identity function that takes 1 if and only if two codons x 0 0 0 0 and x are nonsynonymous, while P (x |S) is calculated as P (Nj |S)·P (Nj+1|S)· 0 0 P (Nj+2|S). Notice that I(x ; x) measures the functional consequence of mu- tations, while P (x0|S) measures the mutational biases. In this study, we use a simple identity function I(x0; x) that treats all nonsynonymous mutations as equally non-neutral regardless of the amino acid types and locations in the pro- tein, and similarly treats all synonymous mutations as neutral. MVI for a gene is further defined as the expected number of nonsynonymous substitutions bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

4

on the CDS after one round of DNA replication, and it is calculated as the summation of codon mvi:

X MVI = mvix. (2) x∈ codons in S

Predictive models and codon importance

Three types of machine learning methods, including logistic regression, random forest, and neural network, are evaluated with 10 fold cross-validation to deter- mine their performances in predicting cancer genes. Random forest and neural network are evaluated here in order to detect the potential contributions of codon interactions to cancer gene status. Specifically, the data were randomly divided into 10 folds, and 10 separate models were each trained with 90% of the data and tested on the remaining 10%. For training of random forest models, the same number of non-cancer genes were randomly sampled to match the size of the cancer genes in the training set, and 5000 trees were grown. For each machine learning method, five set of models were built using different feature variables: Len, only the CDS length is used in building the models; mvi, only the precomputed average codon mvi for each gene is used; Len + mvi, both CDS length and average codon mvi are used; codon, the 64 codon frequencies for a gene are used; Len + codon, both CDS length and the 64 codon frequencies (normalized by length) are used. Two measures were used to quantify the importance of each type of codon (feature) in predicting cancer genes: 1) the coefficients of the codon frequency variables in the generalized linear model with logit link function; 2) the mean decrease of classification accuracy associated with each codon in the random forest model.

Mutation rate and spectrum

Due to the chemical properties of DNA as well as cell’s DNA repair machinery, nucleotide’s mutation rate and spectrum vary depending on the identity of the nucleotide, the neighboring nucleotides, as well as the local genomic context [10–12]. For cancer genes, both germline and somatic mutations play impor- tant roles. The somatic mutation spectrum can be different from the germline mutations depending the carcinogen that drives the mutations [13, 14], while the overall mutation rate in detectable cancer tissues are generally higher than bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

5

non-cancer cells [15]. However, the somatic mutation rate and spectrum vary depending on the carcinogen, cancer type, and cancer stage. It is hard to obtain a universal somatic mutation spectrum to model all cancer genes together. Here we use the germline mutation rate and spectrum to model all cancer gene mu- tations. We believe it is a good model for the germline cancer gene mutations, as well as the somatic cancer gene mutations at the dormant or initiation stage of cancer. The germline mutation rate has been estimated based on mutations of pseudogenes to be around 2.5 × 10−8 [16]. The neighbor-dependent mutation rates are then obtained by adjusting the neighbor-dependent mutation spec- trum [10] with this mutation rate. The full mutation rate table is available in Supplementary table 1.

Gene Sequence and Mutation Data

The CDS sequences and cancer gene mutations are obtained from the Catalogue Of Somatic Mutations In Cancer (COSMIC: http://cancer.sanger.ac.uk/cosmic)[2]. For each gene, only one representative CDS, generally the longest CDS for the gene, is selected in COSMIC to avoid redundancy. This also helps to avoid over- fitting during machine learning. Such strategy is not without limitations. For example, some cancer genes have multiple splicing isoforms of distinct functions in cancer, such as the p16INK4a and p14ARF proteins for CDKN2A gene [17]. Using all 28412 CDS sequences in COSMIC or restricting to gene (19123 in total) gives nearly identical results. Across this study, only mutations fall in the coding sequences are considered.

Results

Human cancer genes show higher mutation vulnerability

There are two prerequisites for a gene to be cancer genes. First, the gene is mutated in cancer cells; Second, specific mutations of the gene renders the host cells with phenotypic changes that allow cancer initiation and progression. To study cancer genes with an emphasis on the mutational aspects in addition to gene functions, we define the mutation vulnerability index (MVI) of a gene as the expected number of nonsynonymous mutations on the gene upon one cycle of DNA replication. We calculated the MVIs of all protein coding genes based on the coding sequences (CDS) and the neighbor-dependent nucleotide mutation rates estimated on human pseudogenes [10]. We then systematically studied the bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

6

association between gene’s mutation vulnerability and its cancer gene status. We observed a global increase in mutation vulnerability of known cancer genes compare to other genes (Fig. 3A). The average gene level MVI of cancer genes is 48.8% higher compare to that of non-cancer genes (p-value 6.9 × 10−32, based on two sample T-test on the log transformed MVIs). We dissected the cancer genes based on whether somatic mutation or germline mutations are observed for them (Fig. 3B-D). Cancer genes with only somatic or only germline mutations do not significantly differ in their MVIs, although they have 45% higher MVI than non-cancer genes. On the other hand, cancer genes with both somatic and germline mutations have higher MVIs compare to those with only somatic mutations (p-value 0.03) or only germline mutations (p-value 0.16), or neither (p-value 1.0 × 10−8). We further dissected the cancer genes based the molecular genetic of the mutations. Cancer genes with only dominant or only recessive mutations have 34% or 92% higher MVIs compared to non-cancer genes (p- values 4.8 × 10−21 and 1.3 × 10−13 respectively), while recessive cancer genes have 58% higher MVIs compared to dominant cancer genes (p-value 1.9×10−5). Cancer genes with both dominant and recessive mutations have 206% higher MVI compared to the non-cancer genes (p-value 0.0004).

Both protein length and codon mutation vulnerability are predictive of cancer gene

Coding sequence (CDS) length is a major contributor to mutation vulnerability.

Let mvi0 and mvi1 be the lower and upper bound of the mvi for a single codon.

The MVI of a gene is then bounded by mvi0 · L and mvi1 · L, where L is the length of the protein. The neighbor-dependent nucleotide mutation rate is in the range of 1.2 × 10−9 to 6.1 × 10−8 (Supp. Table 1), and based on this mutation rate, the codon mvi is estimated to be in the range of 2.4 × 10−8 and 1.4 × 10−7. For human genes, the length of coding sequences explains 99.81% of the variances of gene MVI, while the average codon mvi only explains 0.29%. We therefore decompose MVI into gene length and average codon mvi to study their respective contributions to the MVI -cancer gene stustus association. Gene CDS length alone is associated with cancer genes. Cancer genes are 644 amino acids on average, 46% longer than non-cancer gene which are 441 amino acids on average (p-value 4.4 × 10−30, Wilcoxon rank sum test). Is this cancer gene-CDS length association biological? Here is a counter argument. COSMIC generally selects the longest RefSeq cDNA sequence as the representative se- bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

7

quence (personal communications), as a result, genes with more known isoforms will tend to have slightly longer CDS due to the selection bias. This could lead to artificial associations between CDS length and cancer genes, because cancer genes are generally more extensively studied experimentally compared to non- cancer genes, and may as a result have more isoforms identified, and hence longer longest CDSs. To clarify this, we systematically analyzed gene isoforms based on human RefSeq sequences. Indeed, a cancer gene has 7.01 isoforms on aver- age, compared to 5.03 for non-cancer genes (p-value 4.1 × 10−21, Wilcoxon rank sum test). To eliminate the biases associated with isoform counts, we examined the median rather than length of each gene’s isoforms, and found that cancer genes are still 45% longer than non-cancer genes (591AA versus 407AA, p-value 2.2 × 10−28). Even the shortest isoforms of cancer genes are 35% longer than the shortest isoform of non-cancer genes (454 AA versus 337 AA, p-value 1.0 × 10−18), despite that we expect the lengths of the shortest isoforms to be negatively biased by intense research activity on cancer genes. Statistical eval- uation further confirms the conditional dependence of cancer gene status with CDS length after controlling for the number of isoforms (Jonckheere-Terpstra Test on ordinal variables, z = 6.21, p-value = 5.2 × 10−10). These observa- tions hence provide strong evidence that the CDS length-cancer gene status association is biologically real. Although CDS length is the main contributor of gene’s mutation vulnerabil- ity, gene’s average codon mvi, calculated as MV I/L, remain significantly asso- ciated with cancer genes (p-value 0.0005, Wilcoxon rank sum test). The associ- ation remain significant after controlling for CDS length (Jonckheere-Terpstra Test, z = 4.20, p-value = 2.7 × 10−5). With these, we suggest that both CDS length and codon usage are biologically associated with cancer genes, with CDS length being the main contributor to the association.

The predictive power of codon usage on cancer genes

The mutation vulnerability index calculated based on Equation 2 has two limi- tations. First, it relies on the germline nucleotide mutation spectrum, which was derived based on pseudogenes [10]. Second, it utilizes a naive model for amino acid mutation effect such that all nonsynonymous mutations lead to protein function impairment. One way to overcome these limitations is to use machine learning and directly model the relationship between the codon usage and cancer gene status. We refer to the learned probability from such data-driven model bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

8

as the learned MVI as compared to the precomputed MVI from Equation2. Using the normalized codon frequencies together with protein length as pre- dictor variables (Len + codon) in logistic regression, we achieved an AUC (area under the Receiver Operator Characteristic curve) of 0.731 (95% confidence in- terval 0.710-0.751, estimated from 30 times 10-fold cross validations). If only the 64 normalized codon frequencies (model codon) are used, the AUC is 0.727 (95% confidence interval 0.707-0.747, Fig. 4 and Supp. Fig. S2). For com- parison, precomputed MVI, protein length (Len) and average codon mvi alone each achieved AUC 0.646, 0.641 and 0.543 respectively, while AUC 0.661 is achieved when protein length and average codon mvi are combined through lo- gistic regression (Len + mvi). The superior predictive power of learned MVI from models Len + codon and codon compared to the precomputed MVI sup- port the limitations of the germline mutation spectrum and naive mutation effect model. Similar predictive performances are observed when random forest or artificial neural network are used instead of the logistic regression (data not shown), suggesting that nonlinear combinations of the codon frequencies do not contribute to cancer gene status. Notice that the precomputed MVI considers the impact of two neighboring nucleotide of a codon on the mutation rate of the 1st and 3rd nucleotides in a codon, while the machine learning models do not captures such information.

Learned codon importance recapitulates precomputed codon mutation vulnerability

Based on the machine learning models, we analyzed the importance of each of the 64 codons in predicting cancer genes. Interestingly, we observed that codons of higher importance in predicting cancer genes also have higher precomputed codon mvi, regardless of the machine learning models (generalized linear model or random forest). In another word, the relative importance of the codons recapitulates codon’s mvi. If mutation vulnerabilities of codons do not impact genes’ cancer gene status, we do not expect to observe this relationship. For generalized linear model, we measure the importance of each codon by its coefficient in the model fitted either 1) with all codons together or 2) with each codon separately. For the former, we observed a positive Pearson corre- lation coefficient 0.26 (p-value 0.021) for 61 non-stop codons, and correlation coefficient 0.12 (p-value 0.18) for all 64 codons (Notice that the stop codons are not very informative since each gene has only 1 stop codon.) For the latter, we bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

9

observed a positive correlation 0.55 (p-value 2.0×10−6) for the non-stop codons and correlation coefficient 0.42 (p-value 0.00032) for all 64 codons (Fig. 4C). Similar results are obtained for random forest with mean decrease of classifica- tion accuracy as codon importance measure. The 5 most importance codons are TCG, CGA, CGT, CCG, and ACG. Notice they all contain CpG di-nucleotide, which are known to be associated with high mutation rates due to methylation. Among these five codons, four have significantly higher codon usage in cancer genes compared to non-cancer genes (Fig. 4D), with 21%, 13%, 15%, and 22% higher frequencies in cancer genes for codons TCG, CGA, CGT, CCG respectively. It is worth noting that across human CDSs, there is a global negative associ- ation between the codon’s mutation vulnerability and the codon usage (Pearson correlation R = -0.37, p-value = 0.0026, or R = -0.43, p-value = 0.00048 after removing the stop codons, Supp. Fig. 1). This is consistent with the muta- tional bias theory of codon usage biases, which states that the codons that are easily mutated at the nucleotide level to other codons will end up having low frequency.

The variation of MVI among cancer genes is associated with their mutation frequency

While we observed that gene MVI differs between cancer genes and non-cancer genes (Fig. 3-4), we also notice large variation of MVI among cancer genes themselves (Fig. 3). Are the MVI differences between two cancer genes of any biological significance? We believe MVI together with gene’s function determine the probability that a gene is observed mutated in cancer. There are two predictions from this: first, MVI will impact a gene’s cancer gene status; second, MVI will impact the mutation frequency of cancer genes. We have evaluated and confirmed the first prediction. To validate the second prediction, we estimate each cancer gene’s mutation frequency in cancer samples based on the mutation data in COS- MIC. Notice that not all samples in COSMIC were subjected to full genome sequencing, hence the estimation can be conservative and inaccurate. Despite this, we observed significant association between cancer genes’ MVIs and their mutation frequencies (Fig. 5). There are significant positive associations be- tween precomputed MVIs and mutation frequencies for tumor suppressor genes (Spearman rank correlations 0.687, p-value 1.4×10−19) as well as for oncogenes bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

10

(Spearman rank correlations 0.580 , p-values 8.7 × 10−37). Similarly, machine learned MVI is also positively correlated with mutation frequencies for tumor suppressor genes (Spearman rank correlations 0.444, p-value 1.1 × 10−7) as well as oncogenes (Spearman rank correlations 0.105 , p-values 0.04). We emphasize that the learned MVI is trained for differentiating cancer gene against non- cancer gene without any knowledge of cancer gene’s mutation frequency, yet it is significantly associated with the mutation frequency of cancer gene in tumor samples. These further validate the association between mutation vulnerability and cancer genes.

Intra-gene codon mutation frequency and codon mutation vulnerability are associated

Cancer is a process of somatic evolution. If somatic selection pressure is uni- formly positive for all amino acid mutations in cancer proteins, then the sponta- neous mutation rate of an amino acid decides its mutation frequency in cancer, and the spontaneous mutation rate of a cancer protein decides the cancer gene’s mutation frequency in cancer. We have studied the gene level mutation fre- quency in the previous section, and here we study the codon level mutation rate. For each of the 541 cancer genes with mutation data, we determine if the observed mutation frequency of the codons (along the CDS) are correlated with precomputed codon mutation vulnerability (mvi). Positive linear correlations are observed for 89% (482) of the genes, among which 71% (342) are signifi- cant at p-value cutoff 0.01, and 52% (253) remain significant after Benjamini- Hochberg correction. By contrast, among the 59 genes with negative correlations between mvi and observed codon mutation frequency, none are significant at p-value cutoff 0.01 (Fig. 6A). The codon mvi-codon mutation frequency corre- lation is on average stronger for tumor suppressor genes compared to oncogenes (p-value 1.0×10−8, Mann-Whitney test on the ranks of p-values for codon muta- tion frequency-mvi correlation). 97% of the tumor suppressor genes has positive codon mvi-mutation frequency correlation, 81% among which are significant at p-value level 0.01, while only 87% of oncogenes has positive mvi-mutation fre- quency correlation, 67% among which being significant. We visualized RB1 and PHF6 as examples to further understand the codon mvi-codon mutation fre- quency relationship. By plotting the predicted codon mvi and observed codon mutation frequency spectra as mirror images (Fig. 6B), we find that mutation bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

11

vulnerability explains a significant portion of the mutation hot spots. Majority of mutation hot spots have high mutation vulnerability, although many codons with high mvi are not highly mutated, suggesting that high mutation vulner- ability is a precondition for high mutation frequency. These corroborate with previous findings and support the important roles of mutation vulnerability on cancer gene mutation.

Discussion

Cancer is a complex genetic [18, 19] and aging disease [20]. Recently big data approach to cancer genomics study is driving an increasingly complete under- standing of the genes and pathways underlying cancer development [21–24], while at the same time revealing the stochastic nature of cancer due to somatic evolution [7, 8]. Over 500 cancer genes have been identified with recurrent (driver) mutations in cancer samples, yet no cancer gene is found to mutate in all cancers, and no two cancers share the same set of cancer gene mutations. The specific functions, e.g. biological processes, pathways and protein inter- actions, of genes are believed to be the reason why some genes are cancer genes while others are not [25]. In this study, we analyze the mutational aspect of can- cer genes. We found that biases in the DNA mutation process, captured by each gene as its mutation vulnerability (MVI), significantly impact the likelihood for a gene to be observed as cancer gene. Compared to non-cancer genes, cancer genes have on average 48.8% higher mutation vulnerability (Fig. 3), with con- tributions from both protein length and codon usage. Importantly, among the cancer genes, the ones with higher MVI also have higher mutation frequency in cancer samples (Fig. 5). This suggests that MVI influences a gene’s chance of being a cancer gene in a quantitative manner. Recent studies [5, 6] revealed that the cancer risk disparity among different tissue types is mainly determined by inherent properties of DNA replication errors related to cell types rather than environmental factors. In this study we reveal that inherent patterns of DNA mutations may play a similar role at the molecular level. At the codon level, we observe that majority of highly mutated codons in cancer genes have high mvi (codon mutation vulnerabilities), while many codons with high predicted mvi are not highly mutated (Fig. 6). High mutation vulnerability and functional relevance are likely two independent prerequisites for a gene to be cancer gene and for a codon of a cancer gene to mutated in cancer. The mutation vulnerability of genes and codons set the stage for positive bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

12

selection to play its role in cancer formation [26]. A gene with cancer relevant function but low vulnerability may not show statistically significant mutation frequency to be identifiable as a cancer gene. We suggest that cancer gene should not be viewed as a binary concept but rather a continuum. With more cancer genomes sequences, we will have more power and detect more “weak” cancer genes. A machine learning approach for predicting cancer gene will help us to un- derstand the key features differentiating cancer gene from other genes, and it may also guide the discovery of new cancer genes and improve the interpretation of cancer genomes in differentiating driver mutations from non-driver mutations. Existing effort for cancer gene prediction has been focused on utilizing the func- tional attributes of genes, e.g. molecular functions and signaling pathways, locations and connectivity in the protein-protein interaction network [25, 27]. Some recent studies also suggested the value of [25, 26, 28]. We reveal here that cancer genes can be predicted at a decent performance by precomputed or learned MVIs, both of which are functions of gene’s codon frequencies (Fig. 4). Further improvement of cancer gene prediction will be possible if the protein functions and expression levels are combined with the mutation vulnerabilities of genes. We note that the mutation vulnerability index is inversely related to concept of mutational robustness in evolutionary biology [29, 30], which describes the property that an organism’s fitness remains unchanged upon mutations. The main feature of MVI is that it captures the spontaneous mutation biases of nucleotides as well as the consequences of the nucleotide mutations. MVI, how- ever, does not accurately model if an amino acid change is functionally neutral to the cell or organism. There are several limitations in our approach for computing gene’s mutation vulnerability. First, we only model single nucleotide mutations. In-dels and rearrangements, which are common for oncogenes, are not considered. Explicit modeling of these mutation types could give us a better mutation vulnerability index for genes. We also only consider the impact of two immediate neighbors on the mutation rate of a nucleotide. Including more neighboring nucleotides as well as the local genomics environment around a gene, e.g. chromatin structure and CG%, could improve the accuracy of MVI. Finally, for the precomputed MVI (Equation 2) non-sense and missense mutations are treated equally, and all missense mutations are viewed the same in terms of their functional impacts. Ongoing research shows that we can better predict the functional consequences bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

13

of amino acid changes by incorporating the amino acid conservation patterns and neighboring amino acid sequences [31–33]. Using these advanced models instead of the naive mutation effect model could lead to a better MVI, improved cancer gene predictions, and increased understanding of cancer genes.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors agreed to the publication of the manuscript.

Availability of data and material

Relevant data are provided as supplementary material.

Competing interests

None

Funding

None

Authors’ contributions

YFL participated in the study design, analysis, and wrote the manuscript. FX carried out analysis and revised the manuscript.

Acknowledgements

The authors would like to thank Tara Brock and Janelle Marie Estes for discus- sions. bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

14

References

[1] Vogelstein, B & Kinzler, K. W. (2004) Cancer genes and the pathways they control. Nat Med 10, 789–799.

[2] Futreal, P, Coin, L, Marshall, M, Down, T, Hubbard, T, Wooster, R, Rah- man, N, & Stratton, M. (2004) A census of human cancer genes. Nature reviews cancer 4, 177–183.

[3] Hanahan, D & Weinberg, R. A. (2011) Hallmarks of cancer: the next generation. Cell 144, 646–674.

[4] Hanahan, D & Weinberg, R. A. (2000) The hallmarks of cancer. Cell 100, 57–70.

[5] Tomasetti, C, Li, L, & Vogelstein, B. (2017) Stem cell divisions, somatic mutations, cancer etiology, and cancer prevention. Science (New York, N.Y.) 355, 1330–1334.

[6] Tomasetti, C & Vogelstein, B. (2015) Cancer etiology. variation in cancer risk among tissues can be explained by the number of stem cell divisions. Science (New York, N.Y.) 347, 78–81.

[7] Nowell, P. C. (1976) The clonal evolution of tumor cell populations. Science 194, 23–28.

[8] Greaves, M & Maley, C. C. (2012) Clonal evolution in cancer. Nature 481, 306–313.

[9] Rubin, A. F & Green, P. (2009) Mutation patterns in cancer genomes. Proc Natl Acad Sci U S A 106, 21766–21770.

[10] Hess, S. T, Blake, J. D, & Blake, R. D. (1994) Wide variations in neighbor- dependent substitution rates. J Mol Biol 236, 1022–1033.

[11] Vitkup, D, Sander, C, & Church, G. M. (2003) The amino-acid mutational spectrum of human genetic disease. Genome Biol 4, R72.

[12] Zhu, Y. O, Siegal, M. L, Hall, D. W, & Petrov, D. A. (2014) Precise estimates of mutation rate and spectrum in yeast. Proc Natl Acad Sci U S A 111, E2310–E2318. bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

15

[13] Pleasance, E. D, Cheetham, R. K, Stephens, P. J, McBride, D. J, Humphray, S. J, Greenman, C. D, Varela, I, Lin, M.-L, OrdÃşÃśez, G. R, Bignell, G. R, Ye, K, Alipaz, J, Bauer, M. J, Beare, D, Butler, A, Carter, R. J, Chen, L, Cox, A. J, Edkins, S, Kokko-Gonzales, P. I, Gormley, N. A, Grocock, R. J, Haudenschild, C. D, Hims, M. M, James, T, Jia, M, Kings- bury, Z, Leroy, C, Marshall, J, Menzies, A, Mudie, L. J, Ning, Z, Royce, T, Schulz-Trieglaff, O. B, Spiridou, A, Stebbings, L. A, Szajkowski, L, Teague, J, Williamson, D, Chin, L, Ross, M. T, Campbell, P. J, Bentley, D. R, Futreal, P. A, & Stratton, M. R. (2010) A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191–196.

[14] Ivanov, D, Hamby, S. E, Stenson, P. D, Phillips, A. D, Kehrer-Sawatzki, H, Cooper, D. N, & Chuzhanova, N. (2011) Comparative analysis of germline and somatic microlesion mutational spectra in 17 human tumor suppressor genes. Hum Mutat 32, 620–632.

[15] Araten, D. J, Golde, D. W, Zhang, R. H, Thaler, H. T, Gargiulo, L, Notaro, R, & Luzzatto, L. (2005) A quantitative measurement of the human somatic mutation rate. Cancer Res 65, 8111–8117.

[16] Graur, D & Li, W. (2000) Fundamentals of molecular evolution. (Sinauer Associates Sunderland, Mass).

[17] Stone, S, Jiang, P, Dayananth, P, Tavtigian, S. V, Katcher, H, Parry, D, Peters, G, & Kamb, A. (1995) Complex structure and regulation of the p16 (mts1) locus. Cancer Res 55, 2988–2994.

[18] Balmain, A, Gray, J, & Ponder, B. (2003) The genetics and genomics of cancer. Nat Genet 33 Suppl, 238–244.

[19] Varghese, J. S & Easton, D. F. (2010) Genome-wide association studies in common cancers–what have we learnt? Curr Opin Genet Dev 20, 201–209.

[20] DePinho, R. A. (2000) The age of cancer. Nature 408, 248–254.

[21] Porta-Pardo, E & Godzik, A. (2016) Mutation drivers of immunological responses to cancer. Cancer Immunol Res.

[22] Vogelstein, B, Papadopoulos, N, Velculescu, V. E, Zhou, S, Diaz, Jr, L. A, & Kinzler, K. W. (2013) Cancer genome landscapes. Science 339, 1546– 1558. bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

16

[23] Vogelstein, B & Kinzler, K. W. (2015) The path to cancer –three strikes and you’re out. N Engl J Med 373, 1895–1898.

[24] Stratton, M. R, Campbell, P. J, & Futreal, P. A. (2009) The cancer genome. Nature 458, 719–724.

[25] Aragues, R, Sander, C, & Oliva, B. (2008) Predicting cancer involvement of genes from heterogeneous data. BMC Bioinformatics 9, 172.

[26] Ostrow, S. L, Barshir, R, DeGregori, J, Yeger-Lotem, E, & Hershberg, R. (2014) Cancer evolution is associated with pervasive positive selection on globally expressed genes. PLoS Genet 10, e1004239.

[27] Li, L, Zhang, K, Lee, J, Cordes, S, Davis, D. P, & Tang, Z. (2009) Discov- ering cancer genes by integrating network and functional properties. BMC Med Genomics 2, 61.

[28] Li, Y. F, Xin, F, & Altman, R. B. (2016) Separating the causes and consequences in disease transcriptome. Pac Symp Biocomput 21, 381–392.

[29] van Nimwegen, E, Crutchfield, J. P, & Huynen, M. (1999) Neutral evolution of mutational robustness. Proc Natl Acad Sci U S A 96, 9716–9720.

[30] Draghi, J. A, Parsons, T. L, Wagner, G. P, & Plotkin, J. B. (2010) Muta- tional robustness can facilitate adaptation. Nature 463, 353–355.

[31] Ng, P. C & Henikoff, S. (2003) Sift: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31, 3812–3814.

[32] Ng, P. C & Henikoff, S. (2006) Predicting the effects of amino acid substi- tutions on protein function. Annu Rev Genomics Hum Genet 7, 61–80.

[33] Li, B, Krishnan, V. G, Mort, M. E, Xin, F, Kamati, K. K, Cooper, D. N, Mooney, S. D, & Radivojac, P. (2009) Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics 25, 2744–2750. bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

17

Tables

N/A

Figure Legends

Figure 1:

Trend of cancer gene discovery, estimated according to the number of PubMed articles reporting novel tumor suppressor genes or oncogenes.

Figure 2:

The mutation vulnerability index calculations for a fake gene of three codons. For each codon, there are total nine possible single nucleotide mutations with different probabilities per one round of DNA replication. Di-nucleotide and tri-nucleotide mutations are of low probability and ignored. The nine single nucleotide mutations for the 2nd codon are shown in the figure. Among the nine mutants, five are non-synonymous. Codon mvi values are calculated based on the total probability of these non-synonymous mutants, and gene MVI is the sum all codon mvi values of the gene. Red shades highlight the amino acid and nucleotide mutations.

Figure 3:

Gene’s mutation vulnerability is associated with its cancer gene status and type. A. Distributions of the gene level MVI of cancer genes (CG) versus non-cancer genes (non-CG). Log base 10. B. Bar plots comparing the MVI of different types of cancer genes against non-cancer genes. On the top: dominant, cancer genes with dominant mutations, mostly oncogenes; recessive, cancer genes with recessive mutations, mostly tumor suppressor genes; both, cancer genes with both dominant and recessive mutations. On the bottom: somatic, genes with only somatic mutations; germline, cancer genes with only germline mutations; both: cancer genes with both somatic and germline mutations. Log base 10. C. The increase in mean MVI of different types of cancer genes relative to that of non-cancer genes. D. Significance levels of pairwise differences of mean MVI among different types of cancer genes and non-cancer genes. Pairwise T-tests were used to compare Log transformed gene MVI. P-values are adjusted for bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

18

multi-testing using Benjamini–Hochberg’s method. *, p ≤ 0.05; **, p ≤ 0.01; *** p ≤ 0.001; ns: not significant.

Figure 4:

The power of gene’s CDS length (Len), average codon MVI (mvi, i.e. MV I/Len), and codon usage (codon) in differentiating cancer genes from non-cancer genes. A) Receive Operator Characteristic curve for representative 10-fold cross vali- dations for models Len, mvi, Len + mvi, codon, and Len + Codons. B) The performance comparison among the 5 models. The error bars are estimated based on 30 iterations of 10-fold cross validations. C) Importance of codons in predicting cancer genes is significantly associated with the mutation vulnerabil- ity of the codons. Only non-stop codons are shown. Pearson correlation 0.55, p-value 2.0 × 10−6. (D) The frequencies of top 5 most important codons in cancer genes versus non-cancer genes. * p-value < 0.01, ** p-value < 0.001.

Figure 5:

A cancer gene’s MVI is positively associated with its mutation frequency in cancer samples. Scatter plot of the mutation frequencies of cancer genes against precomputed MVI (A) or learned MVI (B). In (B), the learned MVI is trained using the codon usage as predictors and gene’s cancer gene status as target, hence the mutation frequency information is completely independent from the machine learning procedure for cancer gene prediction. Gene’s mutation fre- quencies in cancer samples are calculated based on COSMIC gene mutation database. Log base 10 is used.

Figure 6:

Associations between intra-gene codon mutation frequencies and codon’s muta- tion vulnerability. (A) Volcano plot for the codon mutation frequency - codon mvi correlation. X-axis is the Pearson correlation coefficient per cancer gene. Y- axis is the p-value associated with the correlation coefficient. Dashed horizontal line marks p-value cutoff 0.01. Inlets: the density plot for correlation coeffi- cients (top) and p-values (bottom) grouped by cancer gene type (purple, tumor suppressor genes; green, oncogenes). Arrows point to RB1 and PHF6 genes, for which detailed mutation spectrum and predicted mvi spectrum are shown as mirror images in (B) and (C). The mutation frequency spectra are flipped bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

19

vertically. Codon mvi is calculated by aggregating the neighbor dependent nu- cleotide mutation frequencies with the mutation consequence determined by the generic genetic table. Codon mutation frequency is taken as the raw counts of mutations occurring at a given codon. Log base 10 is used. bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

20

Figures 500 300 #cancer genes 100 0

1985 1995 2005 2015

year

Fig. 1: bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

21

Potenal neutral mutants Gene M R * P = 4.56×10-9 P = 2.10×10-9 CDS A T G C G A T G A M R * M R * A T G A G A T G A A T G C G T T G A

Potenal non-neutral mutants P = 4.09×10-9 P = 1.16×10-8 P = 5.05×10-8 P = 1.03×10-8 M R * M R * M * * M P * A T G C G C T G A A T G C G G T G A A T G T G A T G A A T G C C A T G A

P = 6.31×10-9 P = 7.36×10-9 Gene M R * M G * M L * CDS A T G C G A T G A A T G G G A T G A A T G C T A T G A Codon × -8 × -7 × -8 mvi 6.17 10 1.11 10 5.16 10 P = 3.63×10-8 M Q * Gene MVI: 2.24 × 10-7 A T G C A A T G A

Fig. 2: bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

22

(A)

1.0 CG Non−CG

0.5 Normalized gene counts

0.0 −6 −5 −4 −3 log(gene MVI) (B) (C) (D)

non−CG ***

*

ns somatic +45% ***

*

germline +45% ns both +95%

non−CG −5.5 −5.0 −4.5 −4.0 *** ***

*** ***

dominant +34%

*

recessive log(gene MVI) +92% ns both +206%

−5.5 −4.5

log(gene MVI)

Fig. 3: bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

23

(A) (B)

0.70 0.8

0.65 ● Len ● mvi 0.4 ● Len + mvi 0.60 ● Predictive Power Predictive True positive rate positive True codon ● Len + codon 0.55 0.0

mvi 0.0 0.2 0.4 0.6 0.8 1.0 Len mvi Len + codon Len + codon (C) False positive rate (D) 07 − CGT 1.00 **

1.4e CGC CGG cancer genes CGA 0.75 ** other genes ACG ns 08 GCG ** − CCG * TCG 0.50 ATA

8.0e ATG GTCGACTAT TGG TAC GTGGTTTGT CATAAC TTCACAAAG GTACACCAAAGT GAAGAGACCATTGCCATCACTGATGGCTGCCAGTTTGGGAATAGCGGT AAA GGACTTAGG frequency (%) CCAAGACCTGCTTCATCCCCCGCATTA 0.25 CTCTCT TTG 08 CTA precomputed codon mvi

− CTG

0.00 2.0e 0.0005 0.0015 ACG CCG CGA CGT TCG codon importance in codon predicting cancer genes

Fig. 4: bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

24

R= 0.3910.558 p= 6.66e4.62e−1612 R=R= 0.379 0.0982 p= p= 8.21e 0.0515−06 (A) Recessive genes: rho=rho= 0.687 0.58 ρ p_rho= = 0.687, p-value 1.4 x 10p_rho= 8.68e1.42e−−3719 -19 (B) Recessive genes: rho=rho= 0.444 0.105ρ p_rho= = 0.444, p-value 1.1 x 10 p_rho= 1.06e 0.0368−07 -7 Dominant genes: ρ = 0.558, p-value 8.7 x 10-37 Dominant genes: ρ = 0.105, p-value 0.04

KRAS JAK2 KRASJAK2 BRAF ● Dominant BRAF 1 TP53 ● Recessive 1 TP53 − − PIK3CA PIK3CA IDH1 IDH1 NRAS CTNNB1EGFR EGFRNRASCTNNB1 FGFR3 APC FGFR3APC

2 KIT 2 KIT MYD88PTEN PTENMYD88

− HRAS NOTCH1 − HRASNOTCH1 CDKN2AVHL DNMT3ASF3B1TET2MED12ARID1AATM KMT2CKMT2D CDKN2AVHLDNMT3AATM KMT2DTET2KMT2CMED12SF3B1ARID1A KCNJ5IDH2 ALKCREBBPNF1TRRAP KCNJ5IDH2ALKTRRAPNF1 CREBBP SMAD4PTPN11PDGFRAFLT3RETGRIN2AROS1 AKAP9 ROS1FLT3RETSMAD4PDGFRAGRIN2APTPN11AKAP9 FOXL2 CARD11ABL1ZNF521KDRASXL1SETBP1PTCH1SMARCA4ARID2DICER1PTPRBPDE4DIPNOTCH2EP300ATRXBRCA2RNF213 PTCH1RNF213PTPRBBRCA2KDRABL1DICER1PDE4DIPARID2ZNF521NOTCH2SETBP1CARD11SMARCA4ASXL1ATRXFOXL2EP300 MPLEZH2CDH11NTRK3RB1PTPRCERBB2PBRM1CAMTA1SETD2MYH11CACNA1DMYH9POLENSD1UBR5RANBP2 MPLCACNA1DNTRK3ERBB2CDH11RANBP2POLEMYH11RB1PTPRCNSD1EZH2CAMTA1SETD2PBRM1MYH9UBR5 SRSF2GNAQRUNX1T1RUNX1NFE2L2AKT1NF2TSHRFGFR2AXIN1MECOMCBLSTAG2JAK3PRDM16COL2A1ERBB3KDM6AAFF3COL1A1PTPRKFLT4BRCA1CICTSC2KAT6BNINARID1BNCOR1TPR JAK3TSHRAKT1GNAQKDM6AAXIN1TSC2FLT4SRSF2ERBB3RUNX1T1FGFR2PTPRKNF2MECOMBRCA1PRDM16NFE2L2STAG2CICNINCBLNCOR1KAT6BRUNX1AFF3TPRARID1BCOL2A1COL1A1 H3F3A STK11CHEK2WT1PPP2R1APIK3R1AMER1BCL11ABCL11BPDGFRBSRGAP3LIFRATP2B3MSH6NCOA2CDK12KDM5CMLLT4CUX1KDM5ANUP98MYO5ATRIP11NUP214 ATP2B3NUP214PPP2R1ACHEK2LIFRBCL11AH3F3AMSH6KDM5CMYO5ANUP98AMER1PDGFRBSRGAP3KDM5ASTK11PIK3R1TRIP11BCL11BMLLT4CDK12CUX1WT1NCOA2 RHOAU2AF1SMARCB1GNA11IL7RCASP8HNF1AEBF1MEN1NRG1ITKRNF43PRDM1STAT3FGFR1NTRK1NFATC2CBLBLRIG3ABL2TSC1JAK1DCTN1ARHGEF12FANCD2PER1WHSC1NCOA1PLCG1FANCACLIP1WRNCLTCL1MN1PCM1CASC5 IL7RFANCARHOAFANCD2NTRK1MEN1WHSC1RNF43ITKEBF1LRIG3CLTCL1GNA11STAT3PER1CBLBCASC5ARHGEF12JAK1CASP8HNF1ADCTN1FGFR1PLCG1ABL2CLIP1TSC1U2AF1SMARCB1NRG1WRNNCOA1NFATC2PCM1PRDM1MN1 3 BCL2MAP2K1MAP2K4SPOPFCRL4GATA2TCF7L2POT1ESR1SLC34A2MLH1TNFAIP3BCL6FGFR4EXT1CSF3RTBX3MAP3K13SMOAXIN2ERCC4THRAP3CYLDRANBP17DNM2SND1ATP1A1MAML2XPO1TRIM33HIP1PALB2ERCC5ERC1AFF4RECQL4STILTERTBLMBCR 3 SLC34A2BCL2TERTPOT1RECQL4CSF3RFCRL4EXT1FGFR4SMOXPO1CYLDMAP2K4ATP1A1PALB2RANBP17ERCC5ERCC4SND1MLH1MAP2K1HIP1AXIN2GATA2MAP3K13BLMTCF7L2THRAP3TNFAIP3SPOPTRIM33BCRSTILDNM2ESR1TBX3AFF4BCL6ERC1MAML2 GATA1P2RY8PAX5GATA3ETV6ACVR1ETV1CDC73MLLT3PAX3KLF4PRF1RAD21MSNFOXP1CBFA2T3RAF1SYKTRAF7ACSL6DAXXELNCNOT3CEP89BRD4FBXO11USP6STAT6ECT2LPMS2PPFIBP1IL6STMKL1PMS1MSH2TRIM24PML ELNP2RY8PRF1ECT2LGATA1ACSL6STAT6USP6ACVR1FBXO11IL6STTRAF7MLLT3CBFA2T3MSH2CDC73RAF1PMS2PAX5PMLMKL1DAXXCEP89TRIM24ETV6RAD21PMS1SYKCNOT3FOXP1MSNPPFIBP1PAX3KLF4GATA3ETV1BRD4 − RAC1 GAS7LCKGPC3PAX7TCF12ARHGAP26EXT2NBNCARSSTAT5BARNTRALGDSKIF5BBUB1B − RALGDSRAC1ARHGAP26CARSEXT2BUB1BLCKSTAT5BTCF12NBNGAS7GPC3ARNTPAX7KIF5B HIST1H3BCD79BPPP6CCCND1PHOX2BPHF6MYOD1CEBPATBL1XR1ETNK1PLAG1NONOSLC45A3MUTYHETV5FOXA1MYCNNT5C2BIRC3HSP90AB1LCP1EWSR1ELF4ZCCHC8CRTC1MYBHOOK3GMPSFOXO1ZNF278HSP90AA1TCF3ELLERCC3STRNBRD3PCSK7TOP1EPS15DDX10XPC SLC45A3TBL1XR1ETNK1HIST1H3BMUTYHGMPSCD79BPCSK7PHOX2BPHF6PPP6CBIRC3CCND1XPCZCCHC8ZNF278ELF4PLAG1NT5C2ERCC3TCF3MYCNSTRNHSP90AB1LCP1DDX10HOOK3ETV5EPS15ELLMYOD1MYBHSP90AA1BRD3NONOTOP1CEBPACRTC1EWSR1FOXO1FOXA1 RSPO2CCND2FAM131BSH3GL1FAM46CWIF1HLANKX2SEPT6LSM14ACREB3L2EIF3EMITFCREB3L1CHN1MDM4RAP1GDS1PAX8PPARGFOXO4CBLCZRSR2BMPR1A−WASALDH2AKT2NAB2WHSC1L1FANCGAPICALMFUSCRTC3MLLT1TFE3FNBP1SEPT9−DDX5FLCNSH2B3EZR1FOXO3SUZ12LMNANFKB2 FANCGHLARSPO2ALDH2WIF1CBLCNAB2−RAP1GDS1SH2B3AAKT2CHN1SEPT6WHSC1L1FLCNCREB3L2CCND2FAM46CPAX8PPARGPICALMEIF3EZRSR2MLLT1LMNABMPR1ATFE3FAM131BNFKB2SEPT9WASFOXO4FNBP1DDX5SUZ12MDM4CREB3L1SH3GL1EZRCRTC3LSM14AMITFFOXO3NKX2FUS−1 MAXRHOHPOU2AF1RSPO3HNRNPA2B1PDGFBRPL5RUNDC2APRKAR1AWWTR1EIF4A2NDRG1CCNE1TAL1NFIBCANT1CLP1MDM2PRCCTFEBRARAFANCESUFURPN1RELNCKIPSDACSL3TFRCSFPQ PDGFBCANT1RHOHRSPO3POU2AF1FANCENCKIPSDRPN1RUNDC2APRKAR1ATFRCCCNE1HNRNPA2B1SUFUTAL1EIF4A2NFIBACSL3CLP1MDM2RARARPL5NDRG1MAXRELTFEBPRCCWWTR1SFPQ CDKN1BCCNB1IP1TNFRSF14KLK2SOCS1CCND3KLF6PSIP1RAD51BHERPUD1FASHOXA11SOX2KIAA1598MAP2K2SMARCD1OMDSS18GOPCCALRCCDC6ZNF384MNX1MAFABI1PWWP2ABCL3 TNFRSF14SOCS1KLK2RAD51BCCND3FASKLF6CCNB1IP1OMDHERPUD1PSIP1SMARCD1ABI1CDKN1BPWWP2AZNF384KIAA1598MAP2K2GOPCCALRMAFBCL3HOXA11MNX1CCDC6 SOX2SS18 TCL1ACDKN2CDDIT3SRSF3BTG1SSX1LHFPBCL10CD79ABCL7AYWHAEMUC1NPM1LASP1HOXA9CDK4XPAKDSRHEY1CDK6CDX2DEKOLIG2FANCFMAFBJUNHOXA13DDB2DDX6NFKBIETRIM27 TCL1ALHFP CDKN2CDDIT3CD79AFANCFMUC1CDK6TRIM27BTG1KDSRBCL10DDB2SSX1NPM1DDX6NFKBIESRSF3DEKYWHAECDX2HEY1HOXA13CDK4MAFBHOXA9XPALASP1JUNOLIG2BCL7A H3F3BTNFRSF17IL2FHITNACAPDCD1LG2TPM3JAZF1SBDSCD274TFPTFEVPOU5F1ELK4 PDCD1LG2FHITTNFRSF17TFPTPOU5F1H3F3BCD274IL2JAZF1ELK4SBDSFEV NACATPM3 RPL22SDHAF2SDHDCD74SDC4VTI1ALMO1SDHBSETLYL1HOXD13NUTM2A SDHD SDHAF2LMO1RPL22SDC4NUTM2ALYL1SDHBCD74VTI1AHOXD13SET TCL6 CNBP NUTM2B TCL6 NUTM2B CNBP 4 COX6C CHIC2 4 CHIC2COX6C CHCHD7HMGA2TCEA1SDHCMYCL HMGA2CHCHD7SDHC MYCL TCEA1 log(mutation rate) log(mutation rate) log(mutation − RMI2 FUBP1 TET1 − RMI2 FUBP1TET1 IKBKB KTN1 CNTRL IKBKBCNTRLKTN1 MALAT1 HMGA1 ZNF331LPPRABEP1EML4 MALAT1 ZNF331HMGA1EML4 RABEP1LPP SSX2CREB1TMPRSS2IL21RATICMALT1NUTM1BCL9 IL21RTMPRSS2ATICMALT1SSX2NUTM1CREB1 BCL9 MSI2 ERGIKZF1ZBTB16MLLT10SPECC1 MSI2MLLT10ZBTB16SPECC1ERG IKZF1 MDS2 CBFBATF1 SS18L1TFGPBX1ETV4FIP1L1GOLGA5GPHNERCC2RBM15CLTC MDS2 CBFB ERCC2ATF1ETV4GPHNCLTCRBM15GOLGA5FIP1L1PBX1TFG SS18L1

5 FGFR1OPHLFHOXC13TLX1SEPT5TTLIRF4ASPSCR1FANCC CIITAKIAA1549 5 TTLFANCCASPSCR1CIITAKIAA1549FGFR1OPSEPT5IRF4HOXC13TLX1HLF − PAFAH1B2PRRX1TPM4HOXC11TLX3 AFF1 − PAFAH1B2TLX3PRRX1AFF1TPM4HOXC11

−5.5 −5.0 −4.5 −4.0 −4 −3 −2 −1

log(gene MVI) log(learned MVI)

Fig. 5: bioRxiv preprint doi: https://doi.org/10.1101/360065; this version posted July 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

25

R= 0.621 p= 0 n=394,147 TT−p=6.3erho=− 0.87709 biT− 2tailp_rho=−p=0.00096 0 (A) b=5.399 AUC=0.662 MWU−p=6.7e−09 (B) 0.125 −6.9 0.100 KMT2D 0.075 −7.1 0.050 −7.3

density 0.025

0.000 log(mvi) −7.5 n=394,14780 TT−p=6.8e0 25 −5006 75biT−2tail−p=0.46 b=0.04833 AUC=0.619−log(p−value) MWU−p=2.1e−05 0.0 4 3.2 3 6.5 2 NF1 rate TRRAP 60 mutation density 1 9.8 RB1 0 13.0 −0.2 0.0 0.2 0.4 correlation ATM 0 200 400 600 800 MYO5A KMT2C GRIN2A −6.9

40 ATRX CIC RB1 codon location in CDS − log(p value) RNF213 NCOR1PTPRBNIN CARD11 −7.1 PDE4DIPSMARCA4TET2 STAG2 KMT2AAKAP9CACNA1DMYH9 THRAP3 PHF6 UBR5APC −7.3 MLLT4KDRMECOMFBXW7_NM_018315_2 ARID2TSC2ATP2B3LIFR −7.4

RANBP2CREBBPMYH11KDM5APBRM1 log(mvi) SETD2DCTN1CDH11 NOTCH2ARHGEF12NFATC2MKL1AMER1RUNX1T1 NSD1NCOA2ZNF521 SMARCB1 −7.6 20 TPRNUP98JAK1RNF43CDC73 EP300PTPRKPTPRCFLT4AFF3MAP3K13BCL11ANTRK1 NONO CAMTA1PRDM16CUX1MAGI1ERCC5SLTMTOP1 PTEN 0.0 BRCA2ARID1APCM1KDM5CCDK12MSH6PDGFRBECT2LCNOT3TFRCANTXR1NT5C2TP53 CLTCL1SRGAP3MN1ERC1TNFAIP3SND1NTRK3SLC34A2NRG1WT1CLP1 ARID1BKAT6BCOL2A1WHSC1CLIP1PER1TSC1BCL11BDNM2PMLTCF12BCL6HNF1AFOXO4ITK SETBP1CFHPRDM1PCSK7CEP89TCF7L2SLC45A3MUTYHSEPT9CASP8 MUC1 FANCD2PTCH1WRNAFF4ARHGAP26HSP90AB1AXIN2CBFA2T3PIK3R1FOXP1EBF1PLAG1WASFUSETV6SH3GL1JAZF1 2.5 TRIP11COL1A1ZMYM2TERTTRIM24FGFR4EWSR1LCP1SMAD4FCRL4NF2NAB2LSM14AMAP2K4 ASXL1ERBB3TRIM33LRIG3CYLDERCC4PMS2DAXXACSL6ELNFLCNETV5AKT2SUFUMAP2K2CHN1NFIBWIF1 NCOA1FANCAKDM6ABLMFLIIHIP1CBLBSUZ12FNBP1TCF3POT1RAD21TRAF7RPN1FANCETFE3ALDH2PAX7ETV1TFEBGAS7WWTR1MITFPPP6CRSPO2 MAX BCRLONP1IL6STAXIN1CARSGMPSBRD4EXT2FANCGLMNARAF1MLLT1KIAA1598PRF1MDM2FAM46CGATA3PAX3NDRG1EIF4A2IDH1KLF6CD79APDGFBRHOH 5.0 NUP214ROS1RECQL4PPFIBP1ABL2FGFR1KIF5BZNF278ACSL3MLH1FOXO1TBX3CHEK2SH2B3CREB3L1GNAS_NM_016592_1RELELLZRSR2CCNE1FZR1SEPT6GATA1BCL10ATF1SET rate BRCA1ERBB2PLCG1DNMT3ANUTM2BPALB2RALGDSCSF3RWHSC1L1XPO1JAK3USP6CREB3L2CRTC3ERCC3PPP2R1APPARGELF4BIRC3PRKAR1AZNF384PRCCRUNX1LCKSYKRARAP2RY8CCNB1IP1FASRPL5TFPTBCL7ANACALHFPSRSF3 ALKNCKIPSDPMS1DDX10NFKB2MSH2PICALMFOXO3SFPQBRD3EXT1ESR1GPC3SMOTBL1XR1BMPR1AMSNRUNDC2AMAP2K1PAX8CDK4HOXA9RSPO3HIST1H4I_ENST00000354348CD274TPM3RPL15CD74RPL22FHIT COX6C RMI2 RBM15_ENST00000369784NOTCH1CASC5RANBP17RAP1GDS1TET1KTN1FGFR2ZCCHC8EPS15STILMAML2STAT6DDX5SMARCD1XPCSTRNPWWP2AACVR1MYBHOOK3EZRDDX6TSHRCBLHNRNPA2B1NBNCCDC6DDB2MEN1CRTC1ERGPOU2AF1ETV4IDH2GATA2GOPCCREB1PSIP1BCL3EIF3EABI1OLIG2PHOX2BMAFSDHBCDK6VTI1AHEY1LASP1KLK2HLFCDKN2ACDKN2CHMGA1DDIT3TCL1ACHIC2 PCBP1 BUB1BFBXO11EZH2MDM4CEBPACCND2PAX5CCND3YWHAETNFRSF17CRLF2TAL1PRRX1LMO1XPA mutation TCEA1MALAT1CDKN1BFGFR1OPSRSF2MYD88TMPRSS2CHCHD7GOLGA5LPPCTNNB1VHLHIST1H3BNOL3RBM15MALT1PAFAH1B2U2AF1KCNJ5ZBTB16PTPN11HOXC11SOCS1ZNF331KRASNUTM1FANCFNUTM2ASEPT5FOXL2BCL2EGFRRAC1SDC4CIITAIRF4GPHNTPM4BRAFTTLIL21RMSI2AKT1KLF4DEKCLTCCNTRLHSP90AA1KIAA1549NFKBIETNFRSF14FIP1L1SPECC1PIK3CAHERPUD1JAK2AFF1FLT3MLLT10FOXA1OMDRAD51BATP1A1PDGFRAHOXA13DICER1BCL9KITCD79BPBX1SBDSSF3B1CANT1ELK4POLERABEP1IKBKBH3F3BNKX2FUBP1SS18L1PDCD1LG2KDSRMLLT3RETIL7RSTAT5BMPLCDX2EML4TLX1NFE2L2ARNTSPOPSSX2STAT3ABL1FAM131BATICMYCNTRIM27LYL1HOXC13HOXD13ETNK1BTG1POU5F1CBLCIKZF1JUNSTK11MNX1SS18NPM1MAFBSDHCHRASSSX1FEVPGAM2SOX2−TLX3CNBPSDHAF21MRPL36CBFBSDHDTCL6IL2 7.5 0 HMGA2ASPSCR1H3F3AHOXA11MYOD1MDS2FGFR3GNA11MED12MYCLMTCP1NBFANCCCCND1HLAGNAQERCC2TFGCALRRHOANRAS−A PHF6 10.0 −0.2 0.0 0.2 0.4 0 100 200 300 Pearson correlation between codon MVI and codon mutation counts codon location in CDS

Fig. 6: