In silico learning of tumor evolution through mutational time series

Noam Auslandera, Yuri I. Wolfa, and Eugene V. Koonina,1

aNational Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894

Contributed by Eugene V. Koonin, March 15, 2019 (sent for review January 29, 2019; reviewed by Natalia L. Komarova and Andrei Zinovyev) arises through the accumulation of somatic over potentially revealing the complex dynamics of the tumor evolu- time. Understanding the sequence of occurrence during tion and enabling a variety of context-specific predictions. cancer progression can assist early and accurate diagnosis and To this end, we employed long short-term memory (LSTM) improve clinical decision-making. Here we employ long short-term networks (20), a type of recurrent neural network (RNN) (21) memory (LSTM) networks, a class of recurrent neural network, to capable of learning long-term dependencies in a time series se- learn the evolution of a tumor through an ordered sequence of quence. These networks have achieved major success in time mutations. We demonstrate the capacity of LSTMs to learn series prediction tasks and for learning evolution of recurrent complex dynamics of the mutational time series governing tumor systems (20, 22–24). We demonstrate here the utility of applying progression, allowing accurate prediction of the mutational LSTM to time-ordered mutational data in colon and lung ade- burden and the occurrence of mutations in the sequence. Using nocarcinomas. We define a pseudotemporal ranking, rep- the probabilities learned by the LSTM, we simulate mutational resenting each tumor sample as a binary text, which can be used data and show that the simulation results are statistically in- for training LSTM in a similar manner to that applied in natural distinguishable from the empirical data. We identify passenger language texts classification (25–27). Applied to a discrete ver- mutations that are significantly associated with established cancer sion of the mutational data, ordered into an approximate tem- drivers in the sequence and demonstrate that the carrying poral sequence, these networks can be used to predict the these mutations are substantially enriched in interactions with the mutational load from a limited number of mutations and for corresponding driver genes. Breaking the network into modules stepwise prediction of the occurrence of following mutations in GENETICS consisting of driver genes and their interactors, we show that the series. Using the sequence dynamics learned by the models, these interactions are associated with poor patient prognosis, thus we reconstruct sequences of mutations that are statistically in- likely conferring growth advantage for tumor progression. Thus, distinguishable from the original observations. We find that the application of LSTM provides for prediction of numerous addi- occurrence of distinct subsets of mutations could be predicted tional conditional drivers and reveals hitherto unknown aspects of from the timeline of the major cancer drivers. Investigation of cancer evolution. the driver genes contributing to the prediction of each of these mutations uncovers numerous driver–interactor gene pairs that cancer progression | driver mutations | passenger mutations | machine are highly and specifically enriched with different types of in- learning | neural networks dependently identified interactions. We further derive modules of driver genes and find that their predicted interactions are umorigenesis is a multistep process characterized by accu- associated with poor survival rate, suggesting that the driver and Tmulation of somatic mutations, which contribute to tumor associated passengers jointly promote tumor growth. growth, clinical progression, immune escape, and the develop- ment of drug resistance (1, 2). The somatic mutations found in a Significance tumor cell are accumulated over the lifetime of the cancer pa- tient, so that some mutations are acquired in early steps of tu- Cancer is caused by the effects of somatic mutations known as morigenesis and even in premalignant cells (3). Relatively small drivers. Although a number of major cancer drivers have been subsets of these mutations are established tumor drivers, whereas identified, it is suspected that many more comparatively rare the remainder are thought to be passengers that do not confer – and conditional drivers exist, and the interactions between growth advantage or may even negatively affect tumor fitness (4 different cancer-associated mutations that might be relevant 7). Although molecular and cell biology studies have revealed – for tumor progression are not well understood. We applied an many mechanistic details of tumorigenesis (4, 8 10), our un- advanced neural network approach to learn the sequence of derstanding of the tumor evolution dynamics remains limited, mutations and the mutational burden in colon and lung presumably due to the complexity of the process and the and to identify mutations that are associated with individual abundance of passenger events that could be randomly dis- drivers. A significant ordering of driver mutations is demon- tributed but might exert various effects on tumor fitness and strated, and numerous, previously undetected conditional properties (11, 12). drivers are identified. These findings broaden the existing un- In , tumor development has been explained derstanding of the mechanisms of tumor progression and have by a multistep model of that describes the pro- implications for therapeutic strategies. gression of a benign adenoma to a malignant carcinoma through a series of well-defined histological stages that are linked to a Author contributions: N.A., Y.I.W., and E.V.K. designed research; N.A. performed re- mutational time series, i.e., the temporal sequence of occurrence search; N.A., Y.I.W., and E.V.K. analyzed data; and N.A. and E.V.K. wrote the paper. of driver mutations (13–16). Similar stepwise models have been Reviewers: N.L.K., University of California, Irvine; and A.Z., Institut Curie. developed for other types of adenocarcinomas (17, 18), although The authors declare no conflict of interest. the temporal succession of molecular changes characterizing the This open access article is distributed under Creative Commons Attribution-NonCommercial- progression of these tumors has not been elucidated at the level NoDerivatives License 4.0 (CC BY-NC-ND). of confidence it has for colon cancer (19). Given that the somatic 1To whom correspondence should be addressed. Email: [email protected]. alterations in some tumors can be represented as a multistep This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. sequence of events, we conjectured that time series learning al- 1073/pnas.1901695116/-/DCSupplemental. gorithms could be applied to the sequence of somatic mutations, Published online April 23, 2019.

www.pnas.org/cgi/doi/10.1073/pnas.1901695116 PNAS | May 7, 2019 | vol. 116 | no. 19 | 9501–9510 Downloaded by guest on September 25, 2021 Results gression of several cancers, apparently, via interactions with the Throughout this work, we use a discretized version of the mu- immune system, and is emerging as an important target for tational data from two tumor types, namely, colon cancer, where cancer therapy (37). SYNE1, a cytoskeleton organizer, although the stepwise evolutionary model has been established (13–16), less thoroughly characterized, appears to contribute to DNA and , where such a model has been suggested but not damage response and, thus, to genome instability and tumori- widely accepted as it is in the case of colon cancer (17, 19). The genesis (38). Thus, the genes that contribute to mutation load snapshot mutational datasets are ordered into an approximate prediction in both types of cancer seem to reveal common bi- temporal sequence that is estimated via the training sets for each ological themes. Moreover, the scores assigned by the LSTM tumor type. For each classification task, we trained LSTM net- using only the 20 latest mutations (i.e., the 20 mutations with the works using The Cancer Genome Atlas (TCGA) (28) time- highest order scores) as a sequence are highly correlated with the ordered mutational data for these tumor types and tested the observed mutational load in all test sets (Fig. 1 B–D). Using the network performance using independent datasets (Table 1). time series from the earliest time point, however, results in al- most random prediction performance with similar number of Predicting Tumor Mutational Load from the Mutational Time Series. mutations (SI Appendix, Fig. S2), suggesting that the ultimate We aim to represent tumor mutations as a discrete timeline of mutational burden of a tumor depends primarily on mutations mutational events such that the appearance of a mutation in the that occur late in tumor evolution. Training linear classifiers to timeline would correspond to their estimated place in the tumor predict the mutational load from the same sequences of muta- evolution. We thus calculate a score for every mutation as the tions or randomly selected mutations (Materials and Methods) ratio of its occurrences in the presence and in the absence of resulted in significantly inferior performance compared with the other mutations (see Materials and Methods for details). This LSTM (Fig. 1 E–G), suggesting that the LSTM networks learn form of the score is motivated by the assumption that mutations complex dynamics within the mutational data that could not be that appear late in the tumor evolution are more likely to be captured by conventional classification approaches. fixed when other mutations are present, supported by a pre- The scores assigned to predict the mutational load are also viously established notion that cooccurring events more often associated with clinical phenotypes. For the colon cancer test take place at late stages in tumor evolution (29–32). Sorting the sets, the scores assigned by LSTM trained with the 20 latest mutations by the scores evaluated on our training datasets, we mutations in the sequence are significantly higher in samples of find that this estimated order of mutation appearance is in primary tumors assigned with poor grade vs. moderate grade, in agreement with the established succession of the key drivers in high vs. low microsatellite instability, and in high vs. low CpG colon adenocarcinoma (SI Appendix, Fig. S1A). In particular, the island methylation phenotypes (Fig. 1 H–K). In the lung cancer APC mutation is an early driver event, followed by CTNBB1, test set, higher scores are associated with lower progression-free KRAS, and SMAD4, whereas PIK3CA and TP53 mutations survival (PFS; Fig. 1L). occur later in tumor development (13, 33). Moreover, we find that the scores significantly correlate with the ratio between the Predicting Occurrence of Mutations in the Sequence and Simulated frequency of a mutation in colon adenomas and its frequency in Data Analysis. We then explored the possibility of predicting the colon carcinomas [from COSMIC database (34, 35); SI Appen- occurrence of mutations in the course of tumor evolution from dix, Fig. S1B], further supporting the relevance of this score for other mutations in the time series. To this end, the LSTM net- the inference of the order of mutations. works were trained to predict the occurrence of each mutation We then used the time-ordered mutational data to predict the Mt in the time series, using the time series of mutations from the overall mutational load in the respective tumors and evaluate the latest mutation up to the Mt+1 mutation (thus predicting earlier number of mutations required for an accurate prediction. To this mutations from later ones). The occurrence of most mutations in end, we trained LSTM networks aiming to predict high vs. low the sequence could be predicted with good accuracy, with the mutational load (Materials and Methods) from a time series of median AUC = 0.88, 0.73, and 0.69 for the first and second colon mutations. Given a discrete time series of mutation occurrences, test sets and for the lung test set, respectively (Fig. 2A). Muta- the LSTM network is trained using the series up to a time t and is tions for which occurrence could not be predicted (AUC < 0.5; applied to the left-out test to produce scores that reflect the 2 and 17% of the mutations in colon and lung cancers, re- probability of each sample in the test to have a high mutational spectively) were significantly less common than those that were load. Starting from the final time point (the last mutation in the readily predictable (rank-sum P value = 0.001 for colon and 2.4e- series, likely occurring late in the tumor evolution), we find that 106 for lung; Dataset S1), implying that the occurrence of these prediction of the mutational load saturates with high perfor- low-frequency mutations is not linked to tumor progression. mance (AUC > 0.95) with fewer than 100 mutations for both More unexpected than the poor prediction for low-frequency colon and lung adenocarcinomas (Fig. 1A). The sets of genes mutations, the prediction accuracy for mutations in known can- that contribute to the mutational load prediction significantly cer driver genes was significantly lower than that for mutations in overlap between colon and lung cancers (P value ≈ 0 for the last all other genes, in both tumor types (Fig. 2 B and C). This ob- 100 genes). Notably, this overlap includes genes that encode servation is maintained when including only genes that are fre- some of the longest human that perform various orga- quently mutated (SI Appendix, Fig. S3). These findings indicate nizing roles in either intracellular or intercellular interactions, that the driver mutations that determine the course of tumori- such as (TTN), mucin-16 (MUC16), and nesprin-1 genesis could not be readily predicted from the rest of the mu- (SYNE1). TTN is one of the most commonly mutated early tational landscape of a given tumor type. In contrast, many drivers in colon cancer (36). MUC16 is implicated in the pro- passenger mutations tend to be linked to specific drivers (39) and

Table 1. Datasets used for training and testing, for colon and lung adenocarcinomas Dataset Colon Lung

Training COAD (n = 253) LUAD (n = 506) Test Colorectal adenocarcinoma [DFCI (69), n = 612] Lung adenocarcinoma [Broad (71), n = 183] Colorectal adenocarcinoma [Genentech (70), n = 72]

9502 | www.pnas.org/cgi/doi/10.1073/pnas.1901695116 Auslander et al. Downloaded by guest on September 25, 2021 A 1

0.8

0.6

0.4 COAD test 1 - Genentech 2012, n = 72 AUC COAD test 2 - DFCI cell reports 2016, n = 619 0.2 LUAD test - BROAD Cell 2018, n = 183

0 0 10 20 30 40 50 60 70 80 90 100 Number of mutations for training

COAD test 1 COAD test 2 LUAD test B 10 C 10 D 8

7 9 8 6 8 5 6 7 4 6 4 3 Log Mutation count Log Mutation count 5 Log Mutation count 2 2 4 1 Rho = 0.89, P = 6.69e-27 Rho = 0.89, P = 0 Rho = 0.88, P = 0 3 0 0 0.5 0.52 0.54 0.56 0.58 0.6 0.5 0.52 0.54 0.56 0.58 0.6 0.4 0.5 0.6 0.7 0.8 Assigned score Assigned score Assigned score

EGCOAD test 1 F COAD test 2 LUAD test 1 1 1 Logistic 0.8 0.8 0.8

0.6 0.6 KNN 0.6 0.4 0.4 SVM

0.4 GENETICS Spearman Rho Spearman Rho 0.2 Spearman Rho 0.2

0 0.2 0 LSTM 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Number of mutations Number of mutations Number of mutations

COAD test 1 COAD test 2 COAD test 2 COAD test 2 LUAD test HLIJK 120 P=0.0001 P=1e-36 P=2e-21 n=91 P=0.007 0.6 P=1e-08 0.6 n=503 0.58 0.6 n=405 100 n=15 n=95 0.58 0.58 0.58 80 0.56 n=91 n=47 n=92 0.56 0.56 n=438 0.56 60 0.54 score score score score 0.54 0.54 0.54 40 PFS months n=57 0.52 0.52 0.52 0.52 20

0.5 0.5 0.5 0.5 0 MSS MSI Well Poor MSS MSI CIMP CIMP High Low Moderate Moderate low high score score

Fig. 1. Prediction of tumor mutational load from the mutational time series. (A) The test AUCs (y axis) obtained for training LSTMs on different lengths of mutation sequences (x axis) when starting from the latest ordered mutation, for colon and lung test sets. (B–D) Correlation between the score assigned by LSTMs with the 20 latest mutations in the time series (x axis), and the true mutational load (y axis, log transformed). (E–G) Spearman correlation between the scores assigned by different learning models and the observed mutational load (y axis) when using different number of mutations from these that are or- dered latest in the sequence (up to 50 mutations, x axis). The dashed lines show the results for classifiers trained on randomly selected mutations (rather than the ordered sequence of mutations that is shown by solid lines). (H–K) Scores assigned by LSTM using the last 20 mutations for colon cancer patients with different clinical characteristics. (L) PFS of lung cancer patients in the test set, of samples assigned with high vs. low (using the median) scores with the last 20 mutations. COAD, colon adenocarcinoma; LUAD, lung adenocarcinoma.

thus could be predicted with confidence. Notably, however, the clustering analysis did not separate the simulated data from subset of driver mutations for which good prediction accuracy the real data (SI Appendix, Fig. S4 A and B), and principal was achieved included DNA repair genes, such as MLH1, MSH2, component analysis (PCA) showed notable similarity between MSH6, PMS2, and MUTYH for colon cancer and DDX5 and the simulated datasets and the real ones (Fig. 2 D and E), which CHD1L for lung cancer, conceivably due to their effect on other is maintained when considering only frequently mutated genes mutations in the sequence (Dataset S1). (SI Appendix, Fig. S5). This similarity was observed also when Recently, it has been shown that LSTM RNNs can be used to using the tSNE dimensionality reduction method (40) (SI Ap- generate complex sequences by simply predicting one data point pendix, Fig. S4 C and D). Considering the patterns of occurrence at a time (24). We hence reasoned that similar technique could of the frequently mutated cancer drivers (those with the fre- be utilized to generate the mutational time sequence and thus quency of mutation in the top 10%) in the simulated data, we could be employed for mutational data reconstruction. We used identified clear similarities to the observed patterns in the orig- the LSTM scores to predict the occurrence of each mutation in inal data, such as the mutual exclusion of APC, KRAS, and the time series (from the latest to the earliest), to determine the TP53 in colon cancer and of KRAS and MGA in lung cancer occurrence of one mutation at a step in a simulated time series (Fig. 2 F and G and SI Appendix, Fig. S6). (see Materials and Methods for details). In this manner, we To evaluate the effect of the actual order of the mutations in reconstructed 100 mutational samples for colon cancer and the sequence function on the prediction of the preceding mu- 100 samples for lung cancer (Datasets S2–S3). The K-means tations, we repeated this analysis for the 300 last mutations, in

Auslander et al. PNAS | May 7, 2019 | vol. 116 | no. 19 | 9503 Downloaded by guest on September 25, 2021 A BC 1 1 1200 COAD test 1, n = 619 0.9 0.9 COAD test 2, n = 72 0.8 0.8 1000 LUAD test, n = 183 0.7 0.7 800 0.6 0.6 0.5 0.5 600 0.4 0.4 Mutations 400 0.3 Lung AUC 0.3 0.2 Mean colon AUC 0.2 200 0.1 0.1 0 0 P = 1e-05 0 P = 0.01 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Others Drivers OthersDrivers AUC

DEColon cancer Lung cancer training, n = 245 training, n = 506 20 test1, n = 72 20 test, n = 183 test2, n = 619 Simulated, n = 100 15 Simulated, n = 100 15 10 10 5 5 0 PC3 (0.91%) PC3 (0.97%) 0 -5 -10 -5 -15 -10 20 10 10 5 60 25 0 50 0 20 -10 40 -5 15 30 PC2 (1.4%) -20 20 PC2 (1.07%) -10 10 -30 10 -15 5 0 PC1 (9.9%) 0 PC1 (4.1%) FG-40 -10 -20 -5 Lung simulated data Colon simulated data Lung TCGA data Colon TCGA data FN1 FN1 NF1 NF1 RB1 APC WT1 ATM ATM CAD KDR DCC MGA MGA TRIO TP53 TP53 FAT1 BRAF PTEN BRAF ATRX SOX9 KRAS CHD9 KRAS CHD4 EGFR ASPM ITSN1 MMP2 AXIN2 BCOR STK11 BAZ2B SETD2 KEAP1 LPHN2 SVEP1 CDK12 AKAP9 LPHN2 NTRK2 MED12 RBM10 CNOT1 PTPRU BMPR2 SMAD4 AMER1 SMAD4 PIK3R1 PIK3CA PIK3CA TCF7L2 ARID1A ARID1A CEP290 MECOM MAP3K4 CTNNB1 ACVR2A SRGAP3 CREBBP CREBBP DNMT3A DNMT3A SMARCA4

Fig. 2. Prediction and generation of the sequence of mutations. (A) Histogram of mutations count (y axis) for each performance level (AUC; x axis) for mutation prediction in a sequence for colon cancer test sets (pink and purple bars for test sets 1 and 2, respectively) and for lung cancer test set (green bars). (B) Mean AUC of mutation prediction in the sequence for the two colon test sets: comparison of drivers with all other genes. (C) AUC of mutation prediction in the sequence for the lung test set: comparison of drivers with all other genes. (D and E) Scatter plots of PC1–PC3 obtained by PCA applied to the combined mutational data from all datasets used and the simulated samples, for colon and lung cancers, respectively. The percentage of variance explained by each PC is indicated in parentheses. (F and G) Presence–absence patterns for the high-frequency cancer drivers in the reconstructed mutational samples (red) and the TCGA mutational data (blue) for colon and lung cancers, respectively. The samples are ordered by the hierarchical clustering results, with the Euclidean distance metric and average linkage.

both colon and lung cancers, after randomly permuting the order of some mutations (currently classified as passenger) might also be mutations in the sequence up to each predicted mutation (10 ran- predicted from the ordered time sequence of mutations in cancer dom permutations for each prediction task). The comparison of the driver genes which play key roles in tumor development. Mutations results with those obtained with the original, ordered sequence of that might be predicted through their association with major cancer mutations shows a dramatic drop in the prediction accuracy (paired drivers are of obvious interest because they could potentially con- P < rank-sum value 0.06, 0.02, and 9e-6 for colon test set1, colon tribute to different aspects of tumorigenesis. To explore such po- test set 2, and lung test set, respectively; SI Appendix,Fig.S7). This tential associations, we selected well-characterized, major cancer is a strong indication that the actual order of mutations is important drivers in each tumor type studied (n = 42 for colon and n = 26 for for predicting the mutational sequence. lung; Materials and Methods) and utilized the discrete time series of Predicting Associations Between Major Cancer Drivers and Other their occurrences to predict the occurrence of other mutations. This Genes. Given that most mutations are predicted with high accu- analysis yielded 354 genes for colon cancer and 273 genes for lung racy from the time sequence, we investigated the possibility that cancer in which mutations could be predicted robustly with high

9504 | www.pnas.org/cgi/doi/10.1073/pnas.1901695116 Auslander et al. Downloaded by guest on September 25, 2021 AUC from multiple points in the time series of the major drivers cancer network than in the colon cancer network, possibly due to (Materials and Methods and Dataset S4). different TP53 mutants that are observed in these tumors (47) Given that these mutations are accurately predicted using a that have different interacting partners (48). Notably, in both the short mutational sequence that includes only the major drivers, colon and the lung networks, TP53, SMAD4, ATM, and we hypothesized that the respective genes interact with these NF1 belong to the same strongly connected network module, drivers. To further characterize the potential functional con- whereas major tumor-specific cancer drivers such as APC (co- nections between the major drivers and the identified associated lon) and EGFR (lung) are disconnected. genes, we first determined which driver contributed to the pre- Next, we systematically assessed whether the predicted inter- diction of each of the identified driver-associated mutations actors of the major cancer drivers are involved in the same or (Materials and Methods) to generate a list of driver–interactor similar processes with the corresponding drivers. Using GO pairs for colon and lung cancers (Fig. 3A and Dataset S5). A () enrichment (49, 50), we find that for both STRING interactions enrichment analysis (41, 42) (Materials and colon and lung cancers, all major drivers share a significant (with Methods) showed that for 68 and 39 predicted driver–interactor hypergeometric P value < 0.05) overlap of the sets of GO pro- pairs in colon and lung cancers, respectively, the interaction is cesses with their predicted interactors (Fig. 4 A and B). This level validated under the multiple criteria implemented in STRING of enrichment in shared processes is unlikely to be reached by (hypergeometric P value ≈ 0 and 5.2675e-04 for colon and lung, chance (permutation P < 0.001 for both colon and lung). respectively). When the major cancer drivers were analyzed in- For the 13 major drivers that are shared between colon and dividually, we found significant (P value < 0.05) STRING en- lung cancers, we identified 1,119 GO significantly enriched richment between about 22% of both colon and lung cancer processes (enrichment P value < 0.01; Dataset S6). For the major drivers and their predicted interactors, a result that is predicted interactors of these major drivers, 240 GO processes unlikely to be obtained by chance (Fig. 3A; permutation P are highly enriched in the case of colon cancer, and 229 pro- value < 0.001 for both colon and lung). cesses are highly enriched for lung cancer. Of these, 133 GO The node degrees of the major cancer drivers in the network processes are shared between the interactors from colon and of STRING-validated interactions vary substantially within the lung cancers (hypergeometric P value ≈ 0), and among them, 34 networks inferred for each tumor type and differ between the GO processes are shared with the set of processes enriched colon and lung networks for shared drivers, with the mean de- among drivers (hypergeometric P value = 0.02; Fig. 4C and

gree of 3.5 for colon and 3 for lung (Fig. 3 C and D). In both Dataset S6). Among the GO processes that are shared between GENETICS networks, SMAD4, a gene encoding a involved in the these major cancer drivers and their interactors are cell motility TGF-beta signaling pathway (43), is highly connected. Most of pathways, regulation of cell development, differentiation and the genes connected with SMAD4 in these networks are also growth factor stimulation, and mesenchyme development pro- involved in TGF-beta signaling, but the specific subsets of these cesses. It appears plausible that the predicted interactions of genes differ between the colon and lung networks. Several of diverse genes with the major drivers promote tumor growth and these interactions have been reported previously. In particular, aggressiveness through the modulation of those processes. MAP2K4 has been identified as a conditional tumor suppressor in lung adenocarcinomas but not in colon cancer (44), whereas Driver–Interactor Modules Are Associated with Patients’ Survival. We mutations in HIF1A are associated with poor prognosis in colon next investigated the bipartite network of major drivers and their cancer (45); furthermore, HIF1A protein physically interacts predicted interactors. In an attempt to infer the contributions with SMAD4 under hypoxic conditions in colon cancer cell lines of these interactions to tumor fitness through patients’ survival, (46). The degree of TP53 is considerably higher in the lung we first sought to identify modules of major drivers that share

10 STRING 82 9943 306 80 92 126 12811 726 426 100 103 97 301 7736 174 81 14157 100 330 230 388 12563 24 171 16 332 39 147 11830 113 24 120 621 123 979 A 5 PREDICTED 136 317 136 355 294 347 58 79 181 85 280 91 274 355 96 346 261 345 355 205 284 232 39 0

Colon INTERSECT 21153 1422225210 8 66 322 1 -Log(P) 2.7 0.31 0.153.57 1.72 2.56 3.11 4.15 2.12 0.46 0.95 2.6 2.48 4.73 1.89 2.9 4.1 1.59 0.49 0 NF1 FN1 APC WT1 ATM CAD DCC MGA TRIO TP53 PTEN BRAF ATRX SOX9 CHD4 KRAS CHD9 ITSN1 ASPM BCOR AXIN2 LPHN2 CDK12 AKAP9 MED12 PTPRU CNOT1 SMAD4 BMPR2 PIK3R1 AMER1 PIK3CA TCF7L2 ARID1A CEP290 MECOM MAP3K4 CTNNB1 ACVR2A SRGAP3 CREBBP DNMT3A B 10 STRING 32185 121 109 194 593 64625 319 27740 241 12386 216 19 27 28922 131 219 127 435 3908 996 5 PREDICTED 109 195 93 36 16 33 18 109 35 51 29 272 12722 146 97 15 182 0

Lung INTERSECT 3 1 22 6 221 1 121 15 2 1 -Log(P) 1.148 1.613 1.385 3.028 5.19 6.299 2.811 7.019 2.474 -0 2.852 1.652 0.148 0.907 0 1 1 TM AT1 NF1 FN1 RB1 A KDR MGA F TP53 BRAF KRAS EGFR MMP2 STK BAZ2B LPHN2 SETD2 SVEP1 KEAP1 NTRK2 RBM10 SMAD4 PIK3CA ARID1A CREBBP DNMT3A SMARCA4

C Colon cancer drivers D Lung cancer drivers HELQSELEPAX6MED23 Colon drivers interacting mutations POTEF MAX Lung drivers interacting mutations ZFPM1 SLC26A3 FEZF2 MAEA RASA2MMP17 BUB1 MAP2K4 NEK1 CNOT1 RPS6KA6 EHBP1 CREB3L1 CHD4 CREBBP GTF2A1 DAPK1 CNOT1 MAT2B GSK3B DPYSL2 ENG GFAP KEAP1 ATP1B4 CHD9 NF1 SMAD4 MGA GPR68 MAP3K4 APC CSNK1D FGF2 ACTL6B KDM6B TACR3 SMAD4 ARID1A HIF1A THOC5 BRAF NTRK2 MAP3K19 RHOQ AXIN2 ATRX EPN1 PLK2 TAF4 NUP153 KRAS WT1 RASA2 RB1CC1 NF1 TP53 NR3C1 CC2D2A EDNRA RBL1 PHLPP1 ATM TP53 DNMT3A TRIO MED12 LPHN2EGFR ZFYVE9 PAX2 CD4 EOMES SLC17A9 ATM PIK3CA GATA3 CCDC14 CEP290 TUBGCP5 EML4 CRIM1 SMARCA4 MMP2 NTF3 EPAS1 CEP135 RPGRIP1 PMS2 RASAL3 SLC3A2 PDE6C MED15 SUPT16H CEP78 PCM1 VASN PDE4B KIAA1009 UBA1 LTK RFC1 NGFFOXM1

Fig. 3. STRING-validated interactions of major drivers in colon and lung cancers. (A and B) Heatmaps showing, for each major colon and lung cancer driver, respectively, the number of STRING interactions within the mutational data (first row), the number of predicted interactions (second row), the number of interactions in the intersection (third row), and the log-transformed hypergeometric P value (fourth row). (C and D) The networks of STRING-validated in- teractions for colon and lung cancers, respectively.

Auslander et al. PNAS | May 7, 2019 | vol. 116 | no. 19 | 9505 Downloaded by guest on September 25, 2021 Associated interactors with significant GO enrichment Mean GO overlap A 1 0.8 0.6 0.4 Fraction 0.2 0 FN1 NF1 APC WT1 ATM CAD DCC TRIO TP53 MGA PTEN BRAF SOX9 ATRX KRAS CHD4 CHD9 ITSN1 ASPM AXIN2 BCOR LPHN2 CDK12 AKAP9 MED12 PIK3R1 PTPRU CNOT1 AMER1 BMPR2 SMAD4 TCF7L2 PIK3CA ARID1A CEP290 MECOM MAP3K4 CTNNB1 ACVR2A SRGAP3 CREBBP DNMT3A Associated interactors with significant GO enrichment Mean GO overlap B 1 0.8 0.6 0.4 Fraction 0.2 0 NF1 FN1 RB1 ATM KDR MGA FAT1 TP53 BRAF KRAS EGFR MMP2 STK11 BAZ2B KEAP1 LPHN2 SETD2 SVEP1 NTRK2 RBM10 PIK3CA SMAD4 ARID1A DNMT3A CREBBP SMARCA4 C Mutual drivers Colon interactors Lung interactors ANATOMICAL STRUCTURE FORMATION HEMATOPOIETIC CELL DIFFERENTIATION GROWTH FACTOR STIMULUS REGULATION CELL PART MORPHOGENESIS FOREBRAIN DEVELOPMENT CELL MOTILITY MITOTIC CELL CYCLE EMBRYONIC EYE MORPHOGENESIS CEREBRAL CORTEX DEVELOPMENT MULTICELLULAR DEVELOPMENT PALLIUM DEVELOPMENT LOCOMOTION NEUROGENESIS MOVEMENT OF COMPONENT METANEPHRIC MESENCHYME DEVELOPMENT KIDNEY MESENCHYME DEVELOPMENT CELLULAR COMPONENT MORPHOGENESIS ANATOMICAL STRUCTURE MORPHOGENESIS EYE DEVELOPMENT MESENCHYME DEVELOPMENT PATTERN SPECIFICATION PROCESS TELENCEPHALON DEVELOPMENT CAMERA_TYPE_EYE_MORPHOGENESIS REGULATION OF CELL DEVELOPMENT REGULATION CELL DIFFERENTIATION NUCLEAR SEGREGATION REGULATION OF CELL MORPHOGENESIS EYE MORPHOGENESIS DIFFERENTIATION REGULATION CHROMOSOME CENTROMERIC REGION INTRINSIC COMPONENT PLASMA MEMBRANE NUCLEAR MATRIX NUCLEAR PERIPHERY NUCLEIC ACID BINDING TF LIPI FN1 NF1 ATM NGF MGA TP53 TAL1 PLK2 FUT8 NTF3 BBS7 PKD2 BRAF DNA2 KIF27 UPRT KRAS GFAP CHRD OGDH PVRL1 LPAR6 USP33 CPSF2 FOLH1 TRPV2 FOXF2 CYLC2 LPHN2 GAP43 SPG20 ZFPM1 RUFY3 COX10 GATA6 ABCB9 DPEP3 GATA3 POTEF NPHP3 GCNT4 SMAD4 KDM6B WNT8A PIK3CA ZNF880 EOMES ZNF837 ZNF778 ARID1A ATP2B4 CEP135 ZNF33A PRR14L GTF2A1 TFAP2A PHLPP1 GPR126 RCBTB2 ANKRD1 CREBBP DNMT3A C14orf80 C12orf50 CAPRIN1 UGT2B15 FAM122A RPS6KA6 RPS6KA5 KIAA1009 RHOBTB1 NEUROD6

Fig. 4. GO enrichment of the predicted interactions of major cancer drivers. (A and B) The gray bars show the fraction of GO processes associated with each major cancer driver that are significantly shared with its predicted interactors, for colon and lung cancers, respectively. The dot plots show the percentage of overlap of GO processes between each major driver and its predicted interactors (the red bar shows the mean of this distribution). (C) Heatmaps for GO processes enriched with the shared colon and lung major drivers and their interactors (presented are the interactors that are most strongly associated with these GO processes; for full information, see Dataset S6).

interactors. To this end, we performed a heuristic search for shown to correlate with lymph node metastasis and late stage driver modules, aiming to cover the maximum number of drivers and unfavorable prognosis of colorectal cancer (51), and PBX3 in each tumor type. Specifically, we searched for the maximum mutations have been identified in colorectal tumor cells un- partition of each tumor graph into disjoint subgraphs, such that dergoing epithelial–mesenchymal transition and have been each subgraph is complete (more precisely, the relation between shown to be associated with poor prognosis (52). the modules of drivers and interactors is complete; see Materials For the lung adenocarcinoma mutational data, we detected and Methods for details). In colon cancer, we identified 3 mutu- two mutually exclusive modules of drivers (with mutually exclu- ally exclusive modules of drivers (with mutually exclusive pair- sive pairwise interactions) that together cover 10 of the 18 major wise interactions), which together cover 22 of the 23 major colon lung cancer drivers with predicted interactions, with 47 inter- cancer drivers with predicted interactions (all but the DCC actors of these drivers (Fig. 6 A and B). Similarly to the obser- gene), and 47 interactors (Fig. 5 A–C). For each module, we then vations on colon cancer, a high number of mutations in the identified the TCGA colon samples in which the given module is interactors of each driver module is associated with poor survival highly mutated (see Materials and Methods for details). We in samples where the corresponding drivers are mutated (Fig. 6 performed Kaplan–Meier survival analysis comparing the sur- C and E) and with lower PFS (Fig. 6 D and F). For four of the vival curves between samples with high vs. low number of mu- predicted interactors of module 1 (but none for module 2), we tations of the predicted interactors within each module, also find that some of the individual interacting mutations are conditional on the drivers in the respective module being highly significantly associated with lower survival rate when drivers mutated. Strikingly, we found that the high mutation rate of from this module are mutated (Fig. 6G). One of such interactors interactors of each of the three modules is associated with poor is EPN1, an Epsin family member shown to regulate tumor survival when the driver component of the module is mutated progression (53). (Fig. 5 D–F). For some of these shared interactors, we also find that individual mutations are significantly associated with lower Discussion survival rate in the context where their driver module is highly Most epithelial cancers are preceded by premalignant lesions, mutated (Fig. 5 G–I). Among these, SIX4 expression has been which frequently display mutationsincancerdrivergenes(54–57).

9506 | www.pnas.org/cgi/doi/10.1073/pnas.1901695116 Auslander et al. Downloaded by guest on September 25, 2021 COL13A1 ST8SIA4 PGAP1 A ENG B PBX3 CWC22 C OFCC1 ACTL6B TDRD7 SESTD1 FAM78B EML4 KIF3A SLC22A24 SMAD4 TRIO IER2 CYP11A1 MAP10 AXIN2 MECOM PDE6C WT1 COL6A5 KRAS LPHN2 EDNRA

IREB2 CEP290 APC VWCE DNMT3A CNOT1 ARID1A BCOR SIX4 HSPA4L FOLH1 EPSTI1 CDHR1 TRIM35 CREBBP NF1 CHD9 ATRX MAP3K4 N4BP1 NEMF ZFYVE9 MED12 MAP3K19 ATM CHD4 HELQ TP53 RAD51AP2 FOXC1 GALK2 MED23 HEATR5B TMPRSS12 TSC22D1 TRAPPC11 Module I drivers VASN CCDC110 Module II drivers Module III drivers EEA1 RPS6KA6 PAQR6 Module I interactors FCGRT Module II interactors Module III interactors

D1.2 E1.2 F1.2 Module I & high interactors mutation load Module II & high interactors mutation load Module III & high interactors mutation load Module I & low interactors mutation load Module II & low interactors mutation load Module III & low interactors mutation load 1 1 1 Censored Censored Censored Log-rank P = 3.5e-04 Log-rank P = 0.06 Log-rank P = 0.17 0.8 0.8 0.8

0.6 n = 17 0.6 0.6

0.4 0.4 0.4 n = 18 0.2 0.2 0.2 n = 29 Estimated survival functions Estimated survival functions Estimated survival functions n = 37 n = 35 n = 36 0 0 0 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 Days Days Days

Module II & Module II & G Module I & mutated interactor Module I & not mutated interactor Censored H mutated interactor not mutated interactor Censored I TRAPPC11 SIX4 NEMF VASN 1 1 1 1 1 MAP3K19 1 CYP11A1 P = 0.06 P = 0.05 P = 0.03 P = 0.03 P = 0.03 P = 0.04 ESF ESF ESF

0.5 0.5 0.5 ESF ESF 0.5 0.5 0.5 ESF

0 0 0 0 0 0 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 10002000 3000 0 1000 2000 3000 0 100020003000 Time Days Days Days Days Days 1 EEA1 1 EML4 1 TSC22D1 1 PBX3 1 MAP10 P = 0.05 P = 0.03 P = 0.01 Module III & P = 0.008 P = 0.06 mutated interactor ESF ESF ESF

0.5 0.5 0.5 0.5 ESF 0.5 ESF Module III & not mutated interactor

0 0 0 0 0 Censored GENETICS 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 Days Days Days Days Days

Fig. 5. Modules of drivers and interactors in colon cancer. (A–C) The complete networks of interactions between modules of major colon cancer drivers (modules I–III, respectively) and their predicted shared interactors. (D–F) Kaplan–Meier survival curves of TCGA colon cancer samples with high mutation rate of drivers modules I–III, respectively, with high vs. low number of mutation in the interactors of these modules (defined by the median). (G–I) Kaplan–Meier survival curves of TCGA colon cancer samples with vs. without mutations in individual interactors of these modules.

Nevertheless, early and late events are not universally characterized range, thus substantially increasing the size of the dataset re- in most epithelial tumors, with the exception of colorectal cancer, quired for training. In our analysis, discretizing the mutational where tumor progression has been thoroughly dissected in terms of data and maintaining discrete labels (i.e., both input size and stepwise accumulation of somatic mutations. This phenomenon labels are always of size 2) generates a simple enough problem starts from transformation of normal epithelium to an adenoma, that can be approached even with limited amounts of data (Table proceeding to in situ carcinoma and ultimately to invasive and 1). We also found that LSTMs perform much better than linear metastatic tumor, where specific mutations mark each step in this classifiers, such as support vector machine (SVM), for the pre- tumorigenic transformation (17, 19). It is hence reasonable to diction of the mutational load. This is likely to be the case be- surmise that time series prediction approaches could be valuable cause LSTMs learn nonlinear relationships between mutations in when applied to the sequence of these genetic events in tumors. the given sequence, rather than defining a linear separating hy- RNNs, and particularly gated RNN architectures such as perplane, with the underlining assumption that the mutational LSTMs, have recently shown promising results in learning long- load can be predicted using a linear function of mutations. term dependencies of sequences for multiple tasks of classifica- A potentially important contribution of this approach is the tion (58, 59) and for data labeling and synthesis (24, 60). Here we identification of interactors of themajorcancerdrivers.Herewe show that the mutational time series could be utilized via LSTM predict many such interactors and show that they are significantly networks to achieve good performance in otherwise difficult enriched with STRING interactions and show nonrandom overlap prediction tasks. Using the estimated order of mutations ap- of GO processes with the corresponding major drivers. Anecdotally, pearance in tumor evolution, we demonstrate that end point at least some of the better characterized interactors were found to conditions, such as the mutational burden and clinical pheno- be involved in the same pathway with the corresponding driver. In types, could be predicted from a limited number of mutations. contrast, we find that the predicted interactors of each drivers are The nonlinear relations learned by the networks, together with not located on similar chromosomal arms (SI Appendix,Fig.S8), the discrete representation of the data, enable performance that suggesting that these are mostly functional, rather than physical is significantly superior to the previous models built for this task interactions. Furthermore, and most strikingly, we found that mu- (61, 62). The model can learn intricate dynamics of the muta- tations in these predicted interactors, in the presence of the corre- tional sequence in tumor evolution that can subsequently be used sponding driver mutations, are associated with poor survival and, for the reconstruction of mutation sequences. When more data thus, are likely to confer growth advantage at the respective steps of become available and the relevant neural network models are tumor progression. We identified unique modules of major drivers further refined, similar approaches could be applied to re- and their interactors for both colon and lung cancer. The strong construct data on actual DNA sequences and could thus exten- correlation with patients’ survival suggests that these interactions are sively contribute to our understanding of tumor evolution. It is clinically relevant and, if further tested, could potentially be used for worth pointing out that applications of LSTMs generally take patients’ stratification and clinical decision-making. The identified advantage of much larger data for training because in these, the interactors can be regarded as secondary drivers whose oncogenic input alphabet as well as possible labels are from a much larger activity is conditional to and associated with the occurrence of

Auslander et al. PNAS | May 7, 2019 | vol. 116 | no. 19 | 9507 Downloaded by guest on September 25, 2021 DYRK4 RBM12B RPAP1 FAM153B MAB21L2 A GAP43 GPI NCSTN NCF2 RASA2 MID2 OLR1 KCND3 B ATM KIN TRIM10 GPR39 TACR3 DNMT3A SMAD4 OR2D3 C1orf177 CALCRL EPN1 LPHN2 NF1 FGD1 NSUN2

ZNF334 DAOA ARHGAP19 ADORA1 TAF4 STK11 BRAF SMARCA4 FGF14 MED15 TP53 L3MBTL4 NTF3 MMP2 CBFA2T3 NGF TAL1 ABHD2 SAMD4A CD4 SLC13A3 LSP1 Module I drivers Module II drivers LRPPRC CYP4F12 ATP2B4PRSS35 IL15MMP19 Module I interactors Module II interactors MAP3K3

CDE1 1 1 Module I & high interactors mutation load Module I & high interactors mutation load Module II & high interactors mutation load Module I & low interactors mutation load Module I & low interactors mutation load Module II & low interactors mutation load Censored Censored 0.8 0.8 0.8 Censored Training data Log-rank P = 0.08 Test data Log-rank P = 0.04 Training data Log-rank P = 0.005

0.6 n = 3 0.6 0.6

0.4 0.4 0.4

n = 40 0.2 0.2 0.2 Estimated PFS functions Estimated survival functions n = 43 n = 96 Estimated survival functions n = 76 n = 9 0 0 0 0 1000 2000 3000 4000 5000 6000 0102030405060 0 1000 2000 3000 4000 Days Months Days FG 1 1 SAMD4A 1 PRSS35 1 EPN1 Module II & high interactors mutation load P = 0.01 P = 2e-5 P = 0.003 Module II & low interactors mutation load 0.8 0.8 0.8 0.8 Censored 0.6 0.6 0.6 ESF ESF Test data Log-rank P = 0.07 ESF 0.4 0.4 0.4 0.2 0.2 0.2 0.6 0 0 0 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 Days Days Days 0.4 1 RPAP1 P = 0.02 0.8 0.2 n = 5 Estimated PFS functions n = 36 0.6

ESF 0.4 Module I & mutated interactors 0 0.2 Module I & not mutated interactors 0 102030405060 0 Censored Months 0 1000 2000 3000 4000 Days

Fig. 6. Modules of drivers and interactors in lung cancer. (A and B) The complete networks of interactions between modules of major lung cancer drivers (modules 1 and 2, respectively) and their predicted shared interactors. (C–F) Kaplan–Meier survival curves of TCGA colon cancer samples (overall survival; C and E) and the lung cancer test set samples (PFS; D and F) of samples with high mutation rate of driver modules I and II, respectively, with high vs. low number of mutation in the interactors of these modules (defined by the median). (G) Kaplan–Meier survival curves of TCGA colon cancer samples with high mutation rate of driver module I (defined by the median), with vs. without mutations of individual interactors of module I.

mutations in major drivers for the respective cancer types. Hence, data to advance our understanding of tumor initiation and pro- these conditional drivers could be readily predicted notwith- gression through the dissection of the sequence of evolutionary standing the low accuracy of prediction for most of the major events. drivers themselves. To summarize, in this work, we present the application of LSTM Materials and Methods for learning the stepwise sequence of mutations in tumors. This Mutation Data. Mutational data of colon and lung adenocarcinoma from approach is shown to efficiently tackle several tasks that are not TCGA were used for training throughout this work and were obtained from amenable to standard techniques, such as prediction of the occur- Xena browser (63). Datasets used for testing were obtained from cBioPortal rence of mutations and reconstruction of mutational data. Our (64, 65). The datasets are summarized in Table 1, spanning 1,626 samples findings reveal a unidirectional relation between driver and pas- overall, each derived from a distinct tumor. senger mutations: drivers determine the course of tumorigenesis, Driver Mutation Lists for Lung and Colon Cancers. The union of driver genes and their occurrence is difficult to predict from the rest of the obtained from (66) and (67) was used for lung and colon cancers (Dataset S7). mutational landscape, whereas passengers are often linked to spe- cific drivers and so can be predicted with confidence. Thus, drivers Preprocessing and Sorting Mutational Data. To generate the binary sequence are indeed in the driver’s seat and bring with them a host of as- of mutations, we first discretize the mutational data assigning 1 for each sociated passengers, some of which could be secondary, conditional nonsynonymous mutation. For each cancer type, we then consider all genes drivers. The present results support the notion that long-term de- that are mutated at least once in all datasets used (totals of 12,322 and pendencies between genes involved in tumorigenesis and cancer 12,327 mutated genes for colon and lung cancer, respectively). progression are widespread in tumor evolution and can be learned To sort the mutations of colon and lung adenocarcinoma by their esti- from the mutational sequence using LSTM networks and similar mated temporal order, we evaluate the following function for each TCGA approaches. We show that this notion holds for colon cancer, where training dataset: P the stepwise process of mutation acquisition is established, but also X s = 1& s = 1 = Pall samples s gi gj for lung cancer for which such a stepwise model has been suggested Order ScoreðgiÞ = = , [1] all samples ssgi 1& sgj 0 but remains controversial. Similar strategies could be readily all genes gj

employed for other tumor types and different types of biological where sgi = 1 if sample s has a mutation in gene gi. We calculate the order

9508 | www.pnas.org/cgi/doi/10.1073/pnas.1901695116 Auslander et al. Downloaded by guest on September 25, 2021 score for each considered gene using the TCGA datasets that are used for datasets. To assign the occurrence of every other mutation t to the reconstructed training and use that score to sort the test datasets. Order scores calculated samples, we train LSTMt to predict the occurrence of t from the sequence using the test datasets were found to significantly correlate with that derived starting from the mutation ordered last up to time point t + 1 and apply LSTMt from the training set for both colon and lung cancers (SI Appendix,Fig.S9). to the simulated sequence (synthesized up to time point t + 1),toobtaina vector of scores predicting the occurrence of mutation t in each reconstructed LSTM Machines. Each LSTM network unit defines a time t inasequence(the sample. We then use freqt, the frequency of the t ordered mutation in the subscript t denotes the mutation that is ordered t in the tumor evolution via genuine datasets, and assign 1 to the freqt mutations that were assigned with the order score defined above), and is composed of the following components: highest scores by LSTMt when applied to the reconstructed sequences. PCA was applied to the integration of all datasets (Fig. 2 D and E and SI = σ + + ft gðWf xt Uf ht−1 bf Þ, [2] Appendix, Fig. S5 A–D) or, when inferring the PCA coefficient, without the training sets (SI Appendix, Fig. S5 E and F). it = σgðWi xt + Uiht−1 + bi Þ, [3] Training LSTMs to Identify Mutations That Interact with the Major Cancer o = σ ðW x + U h − + b Þ, [4] t g o t o t 1 o Drivers and Assigning Major Drivers to Interacting Mutations. We select the driver genes in which mutations are observed frequently in our training sets c = f ⊙c − + i ⊙σ ðW x + U h − + b Þ, [5] t t t 1 t c f t f t 1 f (top 0.1 percentile). These genes are defined as major drivers and are used as an ordered sequence of mutations (n = 42 for colon and n = 26 for lung; ht = ft ⊙σcðct Þ, [6] Dataset S4). The sequence of occurrences of these major drivers is used to

where the initial values are c0 = 0 and h0 = 0. ⊙ denotes the Hadamard predict the occurrence of other mutations, excluding those with very low product. frequency (genes that are mutated in three samples or less) as the prediction xt are the input vectors to the LSTM unit (an ordered sequence of mu- of those could be obtained easily by chance. We hence trained 42 LSTMs for tations). ft , it, and ot are the activation vectors for the forget gate, input colon and 26 for lung (using the sequence of major drivers up to each time > gate and output gate, respectively. ht is the output vector of the LSTM point). The genes that could be predicted with AUC 0.85 for the test set unit, and ct is cell state vector. W and U are the weight matrices, and b re- repeatedly, from multiple locations in the sequence of drivers, are selected presents the bias matrices that are learned during training. σ represents as predicted interactors of the major drivers. the nonlinear functions, where σg is the gate activation, sigmoid function We then use the scores produced by the LSTMs where the driver- −x −1 σgðxÞ = ð1 + e Þ , and σc is the state activation, tanh (hyperbolic tangent) interacting gene is well predicted and correlate them with the occurrence = e2x − 1 of each of the major drivers. The major drivers whose occurrence is signifi- function,tanhðxÞ 2x + . e 1 cantly correlated (Spearman P value < 0.05) with the LSTM scores predicting

All LSTM networks used in this work are sequence-to-label LSTMs with five GENETICS – hidden layers and were trained using Adam optimizer (68), where the maximum a given driver-interacting gene are combined with it into driver interactor number of epochs for training was set to 100 for the mutational load prediction pairs. A detailed graphical schema describing the steps of this analysis can be and to 10 for all other prediction tasks. The minibatch size used for each training found in SI Appendix, Fig. S10 iteration was set to 27, with a standard gradient-clipping threshold set to 1. Enrichment with STRING Interactions and GO Analysis. To investigate whether Training LSTMs to Predict the Mutational Load. We train LSTMs for the se- the pairs of major drivers and their predicted interactors are enriched with quence up to each time point when starting once from the mutation ordered established interactions, we performed the following analyses: (i) STRING last, and second from the mutation ordered first in the sequence. The two- enrichment, in which hypergeometric enrichment analysis was performed for each major driver gene, to find if its LSTM-predicted interactors are categorical labels Ls ∈ f0,1g are defining low (0, median). For each time point t,anLSTMt is trained on the training set using the sequence up to t and is then applied to the test set to ment, where for each major driver, we calculated the percentage of its as- predict the mutational load using the time sequence of mutations up to t. sociated interactors that share a significant number of GO processes with it For each time point t, the resulting scores are used to evaluate the perfor- (hypergeometric P value < 0.05) and the mean percentage of overlapping mance via two measures: (i)theAUCt resulting from an ROC curve, pre- GO pathways with its interactors. We then calculate an empirical P value dicting the low vs. high categories in the test, and (ii) the Spearman rank from 1,000 repetitions, with drivers randomly assigned to the identified ρ interacting mutations (maintaining the same degree). correlation coefficient t between the LSTMt scores that are assigned to each test sample and the actual mutational load of the sample. Modules of Major Drivers and Interacting Genes. To cover as many drivers as Training KNN, SVM, and Logistic Regression Classifiers to Predict the possible, we created modules via a heuristic search using 10,000 repetitions of Mutational Load. KNN (with K = 5), SVM (using linear kernel), and logistic the following genetic algorithm. Starting with a randomly selected major regression classifiers were trained on the training sets using (i) sequences of driver, in each round, we randomly selected a major driver that had not yet the latest ordered mutations that were used as input to the LSTMs and (ii) been added to the module and added it to the module if its addition did not sequences of 50 randomly selected mutations. These were trained to predict decrease the number of mutual module-interactors by more than 20%. The low vs. high mutational load as described for the LSTMs and applied to the round ended when 100 random selections were not added to the current test sets, and the resulting classification scores were correlated with the true module or when the number of interactors of a module was less than 3. mutational load via Spearman rank correlation. Finally, we investigated the 10,000 modules and selected those that together cover maximal number of major drivers such that the modules were mutually Training LSTMs to Predict the Occurrence of Succeeding Mutations in the Time exclusive with respect to both drivers and interactors. Sequence. To predict the occurrence of a mutation ordered t in the sequence, we train LSTM for the sequence starting from the mutation ordered last up Survival Analyses. All Kaplan–Meier analyses are performed by comparing to time point t + 1 using the training sets and test their performance for the survival of patients with high scores (module mutational count larger predicting these occurrences in the test sets, where the two-categorial labels than median) to those with low scores, using a one-sided log-rank test. Ls ∈ f0,1g are defining the occurrence of the mutation. Code Availability. All code was implemented in MATLAB_R2018a using Deep Training LSTM Networks to Simulate Mutational Data. To synthesize muta- Learning Toolbox and is publicly available through GitHub: https://github. tional data, we use the full mutational data for each cancer type. We re- com/noamaus/LSTM-Mutational-series. construct 100 simulated mutational samples for each tumor type, one mutation at a step, from the last-ordered mutation to the first. ACKNOWLEDGMENTS. We thank E.V.K. group members for helpful discus- For the last-ordered mutation we randomly assign 1 to reconstructed sions. The authors’ research is supported by the Intramural Research Pro- samples corresponding to the frequency of the mutation in the genuine gram of the National Institutes of Health (National Library of Medicine).

1. Vogelstein B, Kinzler KW (1993) The multistep nature of cancer. Trends Genet 9:138–141. 3. Stratton MR, Campbell PJ, Futreal PA (2009) The cancer genome. Nature 458: 2. Farber E (1984) The multistep nature of cancer development. Cancer Res 44: 719–724. 4217–4223. 4. Vogelstein B, et al. (2013) Cancer genome landscapes. Science 339:1546–1558.

Auslander et al. PNAS | May 7, 2019 | vol. 116 | no. 19 | 9509 Downloaded by guest on September 25, 2021 5. Pleasance ED, et al. (2010) A comprehensive catalogue of somatic mutations from a 39. Bozic I, et al. (2010) Accumulation of driver and passenger mutations during tumor human cancer genome. Nature 463:191–196. progression. Proc Natl Acad Sci USA 107:18545–18550. 6. McFarland CD, Korolev KS, Kryukov GV, Sunyaev SR, Mirny LA (2013) Impact of del- 40. van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9: eterious passenger mutations on cancer progression. Proc Natl Acad Sci USA 110: 2579–2605. 2910–2915. 41. Szklarczyk D, et al. (2015) STRING v10: Protein-protein interaction networks, in- 7. Persi E, Wolf YI, Leiserson MDM, Koonin EV, Ruppin E (2018) Criticality in tumor tegrated over the tree of life. Nucleic Acids Res 43:D447–D452. evolution and clinical outcome. Proc Natl Acad Sci USA 115:E11101–E11110. 42. Szklarczyk D, et al. (2017) The STRING database in 2017: Quality-controlled protein- 8. Tanaka T (2009) Colorectal carcinogenesis: Review of human and experimental animal protein association networks, made broadly accessible. Nucleic Acids Res 45: studies. J Carcinog 8:5. D362–D368. 9. Loeb LA, Harris CC (2008) Advances in chemical carcinogenesis: A historical review and 43. Zawel L, et al. (1998) Human Smad3 and Smad4 are sequence-specific transcription – prospective. Cancer Res 68:6863 6872. activators. Mol Cell 1:611–617. 10. Knudson AG (2001) Two genetic hits (more or less) to cancer. Nat Rev Cancer 1: 44. Ahn Y-H, et al. (2011) Map2k4 functions as a tumor suppressor in lung adenocarci- – 157 162. noma and inhibits tumor cell invasion by decreasing peroxisome proliferator- 11. Teixeira MR, Heim S (2005) Multiple numerical chromosome aberrations in cancer: activated γ2 expression. Mol Cell Biol 31:4270–4285. What are their causes and what are their consequences? Semin Cancer Biol 15:3–12. 45. Baba Y, et al. (2010) HIF1A overexpression is associated with poor prognosis in a 12. Gillies RJ, Verduzco D, Gatenby RA (2012) Evolutionary dynamics of carcinogenesis cohort of 731 colorectal cancers. Am J Pathol 176:2292–2301. and why targeted therapy does not work. Nat Rev Cancer 12:487–493. 46. Papageorgis P, et al. (2011) Smad4 inactivation promotes malignancy and drug re- 13. Vogelstein B, et al. (1988) Genetic alterations during colorectal-tumor development. sistance of colon cancer. Cancer Res 71:998–1008. N Engl J Med 319:525–532. 47. Harris CC (1996) tumor suppressor gene: At the crossroads of molecular carci- 14. Armaghany T, Wilson JD, Chu Q, Mills G (2012) Genetic alterations in colorectal nogenesis, molecular epidemiology, and cancer risk assessment. Environ Health cancer. Gastrointest Cancer Res 5:19–27. Perspect 104:435–439. 15. Fearon ER (1992) Genetic alterations underlying colorectal tumorigenesis. Cancer Surv 48. Freed-Pastor WA, Prives C (2012) Mutant p53: One name, many proteins. Genes Dev 12:119–136. – 16. Lurje G, Zhang W, Lenz H-J (2007) Molecular prognostic markers in locally advanced 26:1268 1286. colon cancer. Clin Colorectal Cancer 6:683–690. 49. The Gene Ontology Consortium (2017) Expansion of the gene ontology knowledge- – 17. Noguchi M (2010) Stepwise progression of pulmonary adenocarcinoma–Clinical and base and resources. Nucleic Acids Res 45:D331 D338. molecular implications. Cancer Metastasis Rev 29:15–21. 50. Ashburner M, et al.; The Gene Ontology Consortium (2000) Gene ontology: Tool for – 18. Sanada Y, et al. (2006) Histopathologic evaluation of stepwise progression of pan- the unification of biology. Nat Genet 25:25 29. creatic carcinoma with immunohistochemical analysis of gastric epithelial transcrip- 51. Li G, Hu F, Luo X, Hu J, Feng Y (2017) SIX4 promotes metastasis via activation of the tion factor : Comparison of expression patterns between invasive components PI3K-AKT pathway in colorectal cancer. PeerJ 5:e3394. and cancerous or nonneoplastic intraductal components. Pancreas 32:164–170. 52. Lamprecht S, et al. (2018) PBX3 is part of an EMT regulatory network and indicates 19. Yatabe Y, Borczuk AC, Powell CA (2011) Do all lung adenocarcinomas follow a poor outcome in colorectal cancer. Clin Cancer Res 24:1974–1986. stepwise progression? Lung Cancer 74:7–11. 53. Tessneer KL, et al. (2013) Epsin family of endocytic adaptor proteins as oncogenic 20. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9: regulators of cancer progression. J Cancer Res Updates 2:144–150. 1735–1780. 54. Ryan BM, Faupel-Badger JM (2016) The hallmarks of premalignant conditions: A 21. Williams RJ, Zipser D (1989) A learning algorithm for continually running fully re- molecular basis for cancer prevention. Semin Oncol 43:22–35. current neural networks. Neural Comput 1:270–280. 55. Shin DM, et al. (1994) Activation of p53 in premalignant lesions 22. Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: Continual prediction during head and neck tumorigenesis. Cancer Res 54:321–326. with LSTM. Neural Comput 12:2451–2471. 56. Viola MV, Fromowitz F, Oravez S, Deb S, Schlom J (1985) ras p21 expression 23. Schmidhuber J, Wierstra D, Gomez F (2005) Evolino: Hybrid neuroevolution/optimal is increased in premalignant lesions and high grade bladder carcinoma. J Exp Med linear search for sequence learning. IJCAI International Joint Conference on Artificial 161:1213–1218. Intelligence, (IJCAI, Edinburgh), pp 853–858. 57. Enomoto T, et al. (1991) K-ras activation in premalignant and malignant epithelial 24. Graves A (2013) Generating sequences with recurrent neural networks. arXiv: lesions of the human uterus. Cancer Res 51:5308–5314. 13080850. 58. Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks 25. Sundermeyer M, Schlueter R, Ney H (2012) LSTM neural networks for language for sequence learning. arXiv:1506.00019. modeling. INTERSPEECH 2012. 59. Shi X, et al. (2015) Convolutional LSTM network: A machine learning approach for 26. Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text precipitation nowcasting. arXiv:1506.04214. ’ classification. AAAI 15 Proceedings of the Twenty-Ninth AAAI Conference on 60. Graves A, Fernandez S, Gomez F, Schmidhuber J (2006) Connectionist temporal clas- – Artificial Intelligence (AAAI Press, Menlo Park, CA), pp 2267 2273. sification: Labelling unsegmented sequence data with recurrent neural networks. 27. Sutskever I, Martens J, Hinton G (2011) Generating text with recurrent neural net- Proceedings of the 23rd International Conference on Machine Learning (ICML), (ACM, works. Proceedings of the 28th International Conference on Machine Learning New York), pp 369–376. ’ – (ICML 11), (Omnipress, Bellevue, WA), pp 1017 1024. 61. Lyu GY, Yeh YH, Yeh YC, Wang YC (2018) Mutation load estimation model as a 28. Weinstein JN, et al.; Cancer Genome Atlas Research Network (2013) The cancer ge- predictor of the response to cancer immunotherapy. NPJ Genomic Med 3:12. nome Atlas pan-cancer analysis project. Nat Genet 45:1113–1120. 62. Roszik J, et al. (2016) Novel algorithmic approach predicts tumor mutation load and 29. Attolini CS, et al. (2010) A mathematical framework to determine the temporal se- correlates with immunotherapy clinical outcomes using a defined gene mutation set. quence of somatic genetic events in cancer. Proc Natl Acad Sci USA 107:17604–17609. BMC Med 14:168. 30. Cheng YK, et al. (2012) A mathematical methodology for determining the temporal 63. Goldman M, et al. (2018) The UCSC Xena platform for cancer genomics data visuali- order of pathway alterations arising during gliomagenesis. PLOS Comput Biol 8: zation and interpretation. bioRxiv:10.1101/326470. e1002337. 64. Gao J, et al. (2013) Integrative analysis of complex cancer genomics and clinical 31. Desper R, et al. (2000) Distance-based reconstruction of tree models for oncogenesis. profiles using the cBioPortal. Sci Signal 6:pl1. J Comput Biol 7:789–803. 65. Cerami E, et al. (2012) The cBio cancer genomics portal: An open platform for ex- 32. Höglund M, et al. (2001) Multivariate analyses of genomic imbalances in solid tumors – reveal distinct and converging pathways of karyotypic evolution. Genes ploring multidimensional cancer genomics data. Cancer Discov 2:401 404. Cancer 31:156–171. 66. Bailey MH, et al.; MC3 Working Group; Cancer Genome Atlas Research Network 33. Lin SH, et al. (2018) The somatic mutation landscape of premalignant colorectal ad- (2018) Comprehensive characterization of cancer driver genes and mutations. Cell – enoma. Gut 67:1299–1305. 173:371 385.e18. 34. Bamford S, et al. (2004) The COSMIC (catalogue of somatic mutations in cancer) da- 67. Rubio-Perez C, et al. (2015) In silico prescription of anticancer drugs to cohorts of tabase and website. Br J Cancer 91:355–358. 28 tumor types reveals targeting opportunities. Cancer Cell 27:382–396. 35. Tate JG, et al. (2018) COSMIC: The catalogue of somatic mutations in cancer. Nucleic 68. Kingma DP, Ba JL (2015) Adam: A method for stochastic optimization. arXiv: Acids Res 47:D941–D947. 1412.6980. 36. Wolff RK, et al. (2018) Mutation analysis of adenomas and carcinomas of the colon: 69. Giannakis M, et al. (2016) Genomic correlates of immune-cell infiltrates in colorectal Early and late drivers. Genes Chromosomes Cancer 57:366–376. carcinoma. Cell Rep 15:857–865. 37. Aithal A, et al. (2018) MUC16 as a novel target for cancer therapy. Expert Opin Ther 70. Seshagiri S, et al. (2012) Recurrent R-spondin fusions in colon cancer. Nature 488: Targets 22:675–686. 660–664. 38. Sur I, Neumann S, Noegel AA (2014) Nesprin-1 role in DNA damage response. Nucleus 71. Imielinski M, et al. (2012) Mapping the hallmarks of lung adenocarcinoma with 5:173–191. massively parallel sequencing. Cell 150:1107–1120.

9510 | www.pnas.org/cgi/doi/10.1073/pnas.1901695116 Auslander et al. Downloaded by guest on September 25, 2021