bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther Rheinbay1*, Morten Muhlig Nielsen2*, Federico Abascal3*, Grace Tiao1#, Henrik Hornshøj2#, Julian M. Hess1#, Randi Istrup Pedersen2#, Lars Feuerbach4, Radhakrishnan Sabarinathan5,6, Tobias Madsen2, Jaegil Kim1, Loris Mularoni5,6, Shimin Shuai18, Andrés Lanzós8,9, Carl Herrmann10,11, Yosef E. Maruvka1,12, Ciyue Shen13,14, Samirkumar B. Amin15,16, Johanna Bertl2, Priyanka Dhingra17,18, Klev Diamanti19, Abel Gonzalez-Perez5,6, Qianyun Guo20, Nicholas J. Haradhvala1,12, Keren Isaev21,22, Malene Juul2, Jan Komorowski19,23, Sushant Kumar24, Donghoon Lee24, Lucas Lochovsky25, Eric Minwei Liu17,18, Oriol Pich5,6, David Tamborero5,6, Husen M. Umer19, Liis Uusküla-Reimand26,27, Claes Wadelius28, Lina Wadi21, Jing Zhang24, Keith A. Boroevich29, Joana Carlevaro-Fita8,9, Dimple Chakravarty30, Calvin W.Y. Chan10,33, Nuno A. Fonseca31, Mark P. Hamilton32, Chen Hong4,33, Andre Kahles34, Youngwook Kim35, Kjong-Van Lehmann34, Todd A. Johnson29, Abdullah Kahraman36, Keunchil Park35, Gordon Saksena1, Lina Sieverling4,33, Nicholas A. Sinnott-Armstrong37, Peter J. Campbell3,38, Asger Hobolth20, Manolis Kellis1,39, Michael S. Lawrence1,12, Ben Raphael40, Mark A. Rubin41,42,43, Chris Sander13,14, Lincoln Stein7,21,44, Josh Stuart45, Tatsuhiko Tsunoda46,47,48, David A. Wheeler49, Rory Johnson50, Jüri Reimand21,22, Mark B. Gerstein51,52,53, Ekta Khurana17,18,42,43, Núria López-Bigas5,6,54, Iñigo Martincorena3&, Jakob Skou Pedersen2,20&, Gad Getz1,12,55,56& on behalf of the PCAWG Drivers and Functional Interpretation Group and the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Network

* These authors contributed equally # These authors contributed equally & These authors contributed equally

Abstract Discovery of cancer drivers has traditionally focused on the identification of - coding . Here we present a comprehensive analysis of putative cancer driver mutations in both protein-coding and non-coding genomic regions across >2,500 whole cancer genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium. We developed a statistically rigorous strategy for combining significance levels from multiple driver discovery methods and demonstrate that the integrated results overcome limitations of individual methods. We combined this strategy with careful filtering and applied it to protein-coding genes, promoters, untranslated regions (UTRs), distal enhancers and non-coding RNAs. These analyses redefine the

1 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

landscape of non-coding driver mutations in cancer genomes, confirming a few previously reported elements and raising doubts about others, while identifying novel candidate elements across 27 cancer types. Novel recurrent events were found in the promoters or 5’UTRs of TP53, RFTN1, RNF34, and MTG2, in the 3’UTRs of NFKBIZ and TOB1, and in the non-coding RNA RMRP. We provide evidence that the previously reported non-coding RNAs NEAT1 and MALAT1 may be subject to a localized mutational process. Perhaps the most striking finding is the relative paucity of point mutations driving cancer in non-coding genes and regulatory elements. Though we have limited power to discover infrequent non-coding drivers in individual cohorts, combined analysis of promoters of known cancer genes show little excess of mutations beyond TERT.

Introduction Discovery of cancer drivers has traditionally focused on the identification of recurrently mutated protein-coding genes. Large-scale projects such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) have profiled the genomes of a number of different cancer types to date, leading to the discovery of many putative cancer genes. Whole-genome sequencing (WGS) has made it possible to systematically survey non-coding regions for potential driver events. Recent studies of single-nucleotide variants (SNVs) and insertions/deletions (indels) detected from WGS data from smaller cohorts of patients have revealed putative candidates for regulatory driver events1–8. However, to date, only a few events have been functionally validated as regulatory drivers, affecting expression of one or more target genes by changing their regulation, RNA stability or disruption of normal genome topology6,9–12. Non-coding RNAs play diverse regulatory roles and are enzymatically involved in key steps of transcription and protein synthesis. Though functional evidence for non-coding RNAs in cancer is accumulating13,14, only few examples have been shown to be the target of recurrent mutation15,16. In contrast to protein-coding regions, the lower coverage and complexity of DNA sequence in these non- coding regions pose additional challenges for high-quality mutation calling, mutation rate estimation and identification of driver events. Adequate statistical methods that address these issues are needed to identify non-coding drivers from these data.

The ICGC and TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) effort, which has collected and systematically analyzed >2,700 cancer genome sequences from >2,500 patients representing a variety of cancer types17, offers an unprecedented opportunity to perform a comprehensive analysis of putative coding and non-coding driver events. Here,

2 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

we describe results from multiple methods for detecting such drivers based on somatic SNVs and indels and a framework for integrating them. Using this approach, we identify significantly mutated genomic elements of various types in individual cancer types and across cancer types. We then use additional sources of information to remove from the initial list of significant elements those likely to result from mapping issues caused by repetitive regions or by inaccurate estimation of the local background mutation rate, and classify recurrently mutated elements as putative cancer-drivers or as likely false positives. Finally, we estimate our power to discover novel drivers in the PCAWG data set and evaluate the overall density of non-coding regulatory driver mutations around known cancer genes.

Overview of recurrent mutations across cancer types Many protein-coding driver mutations and the two well-studied TERT promoter mutations occur in frequently mutated single base genomic “hotspots”. In order to obtain an initial view of these in the PCAWG data set, we generated a genome-wide list of highly mutated hotspots by ranking all sites in the genome based on the number of cancers that harbor somatic mutations in them. Only 12 genomic sites were recurrently somatically mutated in more than 1% (26) of cancers and 106 in more than 0.5%, while the vast majority (93.19%) of somatic mutations were private to a single patient’s cancer (Methods; Extended Data Fig. 1). Interestingly, although protein-coding regions span only ~1% of the genome, 15 (30%) of the 50 most frequently mutated sites were known amino acid altering hotspots in known cancer genes (in KRAS, BRAF, PIK3CA, TP53, and IDH1) (Fig. 1a). Also among the top hits were the two TERT promoter hotspots. The top 50 list further contained a high proportion of mutations almost exclusively contributed by lymphoid malignancies (Lymph- BNHL and Lymph-CLL) (18 hotspots) and melanomas (6). The melanoma-specific hotspots were located in regions occupied by transcription factors (5/6 in promoters), a phenomenon that has recently been attributed to reduced nucleotide excision repair (NER) of UV-induced DNA damage17–19. Indeed, signature analysis of the melanoma hotspot mutations attributed them to UV radiation (Fig. 1b). The hotspots in lymphoid malignancies reside in the IGH locus on 14, a known phenomenon in B-cell-derived cancers where IGH has undergone somatic hypermutation by the activation-induced cytosine deaminase (AID), an observation confirmed by mutational signature analysis (Fig. 1b). Hotspots in an intron of GPR126 and in the promoter of PLEKHS13,20 were located inside palindromic DNA that may fold into hairpin structures and expose single-stranded DNA loops to APOBEC enzymes3; accordingly, these mutations had high APOBEC signature probabilities. The six remaining non-coding hotspots could be attributed to a combination of mutational processes and presumed technical artifacts (Supplementary Note). In contrast to these non-coding and cancer-specific sites, mutations in protein-coding hotspots were attributed to the aging

3 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

signatures, or to a mixture of signatures. Our observation that, with the exception of the TERT promoter hotspots, all of the top non-coding hotspots are generated by highly localized, cancer-specific mutational processes suggests that these may be passengers.

Next, we sought to systematically identify genomic elements that drive cancer by comprehensively analyzing SNVs and indels in transcribed and regulatory genomic elements, including protein-coding genes, promoters, 5’UTRs, 3’UTRs, splice sites, distal regulatory elements/enhancers, lncRNA genes, short ncRNAs, and miRNAs (in total ~4% of the genome) (Methods; Fig. 1c; Extended Data Fig. 2; Supplementary Table 1). Drivers can be common across many cancer types (such as TP53, PIK3CA, PTEN, and TERT promoter) or highly specific (e.g. CIC and FUBP1 in oligodendroglioma, BCR-ABL in chronic myelogenous leukemia, and FOXA1 in breast and prostate cancers). Analysis of large mixed tumor cohorts increases the power to discover common drivers but dilutes the signal from cancer-specific ones. In order to maximize our ability to detect common and cancer-specific drivers, we performed analyses of individual tumor types, tumors grouped by their tissue of origin or organ site (“meta-cohorts”), as well as a pan-cancer set (Fig. 1d; Methods). This aggregation of tumors by tissue or organ site further allowed us to take advantage of patients from cohorts too small to analyze individually (<20 patients). Overall, we analyzed 2,583 unique patient samples in 27 individual tumor types and 15 meta-cohorts.

Discovery of driver elements In order to identify bona fide drivers in each cohort, we collected and integrated results from multiple driver discovery methods. These methods identify statistically significant elements based on SNVs and indels and evidence from one or more of three criteria: (i) mutational burden, (ii) functional impact of mutation changes, and (iii) clustering of mutation sites in hotspots (Fig. 2a; Methods; Supplementary Table 2). Generally, mutational burden and clustering of events are independent of the type of genomic element that is tested for significance. The ability to use measures of functional impact, however, may vary greatly depending on the element type. In protein-coding genes, functional impact is assessed through the amino acid change introduced by a given mutation. For some non-coding elements, the impact of mutations on transcription factor binding sites, miRNA binding sites, or functional RNA structure may be predicted. However, in general, the exact consequences of somatic mutations are harder to assess in non-coding regions than in protein-coding regions, and may even vary substantially across cell types.

Integration of results from multiple driver discovery methods

4 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Most cancer genomic studies to date have employed a single method to identify putative cancer driver elements. This can introduce biases because the number and type of drivers detected by individual methods depends on their underlying assumptions and statistical model. To reach a comprehensive consensus list of candidate drivers, we collected the p- values calculated by up to 16 driver discovery methods for each genomic element (Supplementary Table 2, 3) and derived a general strategy to integrate them (Fig. 2a). Our strategy accounts for the correlation of p-values among driver discovery methods that use similar approaches (burden, clustering, functional impact) when applied to the same data, while accumulating the independent evidence provided by different methods (Fig. 2b). To integrate potentially correlated statistical results, we used Brown’s method for combining p- values21, an extension of the popular Fisher’s method22 (Fig. 2c). We first used simulated data sets that preserve local mutation rates and signatures but do not contain drivers, to assess the specificity of each method and the correlation structure among different methods when no genomic element is expected to have a significant excess or pattern of mutations (Methods; Extended Data Fig. 3). As anticipated, p-values were correlated among methods that share similar approaches. We then used the resulting correlation structure to integrate p-values from the observed (real) mutation data from different methods into a single p-value for each genomic element. Since simulated data sets are based on assumptions, we tested whether we can directly estimate the correlation structure from p-values calculated on the observed data. As this simpler approach yielded analogous results, it was used throughout this study (Methods, Extended Data Fig. 3). After additional filtering steps (see below), we conservatively controlled for multiple hypothesis testing using the Benjamini-Hochberg procedure (BH)23 across the 27 individual and 15 meta-cohorts, for each element type separately (Fig. 2e). Cohort-element combinations (hits) with q<0.1 (10% false discovery rate [FDR]) were considered significant. Overall, the union of all hits from the individual driver discovery methods contained 3,048 cohort-element pairs involving 1,694 unique elements, whereas the integrated results contained 1,406 significant hits with 635 unique elements (Supplementary Table 4, 5).

Flagging and removal of potential false positive hits Even after careful variant calling, false positive identification of driver loci can arise from residual inaccuracies in background models, residual sequencing and mapping artifacts or from local increases in the burden of mutations generated by as-yet unmodeled mutational processes. To minimize technical issues, we carefully reviewed and filtered all candidate driver elements based on low-confidence mappability regions and site-specific noise in a panel of normal samples from PCAWG (Fig. 2d; Fig. 3a,b; Methods). For example, the lncRNA RN7SK is a small nuclear RNA with many homologous regions in the genome,

5 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

making it prone to read mapping errors (Fig. 3a), while a fraction of LEPROTL1 mutations fell in positions flagged by the site-specific noise filter (Fig. 3b).

In addition, mutational processes not properly captured by current background models can cause local accumulation of mutations. As shown in the hotspot analysis above, events in the promoters of PIM1 and RPL13A are strongly associated with the AID and UV mutational processes, respectively, as previously noted18,19,24 (Fig. 3c). Moreover, DNA palindromes may form hairpin structures that expose loop bases to APOBEC enzymes, as has been proposed for PLEKHS1, GPR126, TBC1D12 and LEPROTL12,3 (Fig. 3d). Non-coding RNAs that form secondary structures often contain palindromic sequences, which could be preferential targets for APOBEC enzymes. However, only a small fraction of the mutations in structural RNAs appear to be APOBEC derived overall (4.8%) (Supplementary Note). In total, 569 out of 1,406 hits (element-cohort combinations with q<0.1) and 383 out of 635 unique elements that were initially significant, were filtered out, resulting in 837 hits (Fig. 3e; Supplementary Tables 4, 5). Finally, multiple hypothesis correction was repeated after assigning a non-significant p-value (P = 1) to the filtered out hits, resulting in further removal of 91 candidate element-cohort combinations.

Performance of the integration method To evaluate the sensitivity and validity of the integration approach, we compared the performance of individual methods and of the integration on protein-coding genes using, as a gold-standard, a list of 603 known recurrently-mutated cancer genes in the Cancer Census25 (CGC v80). These analyses revealed that, typically, the integration of different methods outperformed individual methods, both in terms of sensitivity and specificity, yielding lists of significant genes that were longer and more enriched in high-confidence cancer genes (Extended Data Fig. 4).

Discovery of recurrently mutated elements Our conservative integration and filtering strategy yielded 746 total hits, which include 646 hits in 157 protein-coding genes, 30 hits in seven protein-coding promoters, 26 hits in eight long non-coding RNAs, 18 hits in six non-coding RNA promoters, ten hits in six 3’UTRs, six hits in four 5’UTRs, six hits in one enhancer, three hits in one microRNA and one hit in a small RNA candidate (Fig 4a; Supplementary Table 5). The number of significant elements varied from just one in clear-cell renal cancer to 78 in the Carcinoma meta-cohort (Fig. 4b) and depended strongly on cohort size (r = 0.84; P = 1.9x10-8; also see power analysis below), reflecting that the landscape of driver candidates, particularly in small tumor cohorts, is still incomplete.

6 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Comparison of the p-values obtained from individual tumor types and meta-cohorts revealed that although most candidate drivers gained significance in larger meta-cohorts, the tumor- type specific genes DAXX (Panc-Endocrine), BRAF (Skin-Melanoma), NRAS (Skin- Melanoma), SPOP (Prost-AdenoCa) and hsa-mir-142 (Lymph-BNHL) scored higher in their respective tumor types (Extended Data Fig. 5). These results emphasize the need for careful tradeoff between tumor-type specificity and cohort size when searching for drivers.

Protein-coding mutations. The ability to discover known protein-coding cancer genes from WGS data provided the opportunity to evaluate our strategy and to put non-coding driver hits in context (Extended Data Fig. 6). Overall, candidate coding drivers were concordant with previous results: of the 157 genes significant in at least one patient cohort, 64% are listed in the CGC and 87% are in a more comprehensive list of cancer genes (https://doi.org/10.1101/190330). In contrast to prior studies based on large exome sequencing data sets, the moderate number of patients per cancer type in this data set provided sufficient power to detect only the genes with the strongest signal. Indeed, when we relaxed the significance threshold (q<0.25) to detect hits “near significance”, we found 93 additional hits in 62 unique genes. Half (31) of these genes were already discovered as significant (q<0.1) in at least one cohort in this study, and now became significant in additional cohorts. Of the 31 genes not previously identified by our analyses, 19% were in the CGC and 32% in the more comprehensive list of drivers. These results confirm that many cancer genes were just beyond our significance threshold and would likely have been discovered with larger cohorts (Supplementary Table 4).

Non-coding driver candidates In contrast to protein-coding elements, there was far less agreement among significance methods when analyzing non-coding elements, and most hits were strongly supported by only a few methods (Extended Data Fig. 7). The limited agreement between driver discovery methods in non-coding elements is likely due to a weaker signal of selection and different underlying assumptions that imperfectly model background mutation frequencies in non-coding regions. Thus, in order to nominate a significant element as a candidate driver, we carefully reviewed the supporting evidence from the genomic data and sought additional evidence from PCAWG (chromosomal breakpoints, copy-number, loss-of-heterozygosity and expression data), cancer gene databases and the literature (Supplementary Table 6).

Promoters and 5’UTRs. Due to the overlap between the downstream regions of promoters and 5’UTRs, we reviewed the 11 significant elements from our promoter and 5’UTR

7 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

analyses together, yielding several interesting candidates. Our integrated results confirmed that the TERT promoter is the most significantly mutated promoter in cancer, being significant in 8 individual tumor types and 11 meta-cohorts (Fig. 4c; Supplementary Table 5). As previously reported, we observed significantly higher TERT transcript expression levels in mutated compared to non-mutated cases (Extended Data Fig. 8).

Among the remaining candidates, we found recurrent mutations in the promoters of PAX5 and RFTN1 in lymphoma. PAX5 is a known off-target of AID hypermutation24 and, indeed, the percentage of its mutations attributable to AID activity (46%) was just below our filtering threshold. Consistently, mutations in its promoter were not associated with gene expression changes, suggesting that these mutations may be passengers (Extended Data Fig. 8). In contrast, mutations in the promoter of RFTN1 were weakly associated with an increase in RFTN1 expression levels (P = 0.03; mutant/wild type fold difference (FD) = 1.2; P = 0.02, after excluding 8/21 mutations attributed to the AID signature) (Fig. 5a). The protein encoded by RFTN1, Raftlin, associates with B-cell receptor complexes26 but its function in lymphoma is unclear.

Mutations called in the 5’UTR of RNF34 in Liver-HCC overlap an intronic DNAse I hypersensitive region in multiple cell types27, suggesting that events may affect expression of RNF34 or a neighboring gene through an intragenic enhancer. Indeed, expression of the histone demethylase KDM2B located downstream of RNF34 shows weak correlation with these mutations (P = 0.03; FD = 1.3; Fig 5b) with no effect on RNF34 mRNA (P = 0.41). Mutations in the 5’ region of MTG2 were concentrated in a hotspot and showed marginally significant decreased expression in the Pan-cancer (P = 0.036; FD = 0.8) and Carcinoma (P = 0.029; FD = 0.8) meta-cohorts (Fig. 5c; Extended Data Fig. 8). MTG2 encodes a little- studied GTPase that associates with the mitochondrial ribosome28. HES1 promoter mutations were significant in Carcinoma and Pan-cancer, but showed no association with gene expression (Extended Data Fig. 8). HES1 is a NOTCH signalling target29, and is focally amplified in gastric cancers (Extended Data Fig. 9).

PTDSS1, DTL (and INTS7, which shares a bidirectional promoter), IFI44L and POLR3E showed trends towards increased or decreased expression in mutated tumors, although the small number of samples prevented us from drawing definitive conclusions (Extended Data Fig. 8). None of these genes were present in significantly amplified or deleted focal peaks (Methods). Validation of these hits with additional data in further studies will be needed to evaluate whether they are genuine drivers.

8 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

As previously reported in a number of studies1,3,7, we also found recurrent mutations in the promoter of WDR74, which was significant in multiple cohorts. The mutations concentrate in a small region overlapping an evolutionarily conserved spliceosomal U2 snRNA and therefore potentially affect splice patterns. However, no significant associations with either WDR74 or transcriptome-wide splicing were seen (Supplementary Note; Methods). We did, however, find that the repetitive U2 sequence was frequently affected by recurrent artifacts in SNP databases, raising concerns about potential mapping errors for this repetitive RNA type (Extended Data Fig. 10).

Finally, restricted hypothesis testing of the promoters of CGC genes (n = 603) revealed significant recurrence of mutations in the promoter region of TP53 (11 patients in the pan- cancer cohort; q = 0.044, NBR method). Most of these mutations were substitutions and deletions affecting either the TSS or the donor splice site of the first non-coding exon of TP53. For the 6 samples with expression data, these mutations were associated with a dramatic reduction of expression (Fig. 5d). In 8 of the 11 mutant cases, the mutation occurred in combination with copy-neutral loss of heterozygosity. This is the first report of a relatively infrequent, but impactful, form of TP53 inactivation by non-coding mutations.

Enhancers. Two enhancer regions were found significant in our integration analysis; one at the IGH locus in Lymph-CLL and the other near TP53TG1 in multiple cohorts. However, many of the recurrent mutations in the enhancer overlapping the IGH locus are due to characteristic somatic hypermutation; 48% of mutations in this enhancer element matched the AID signature, which is just below our filtering threshold. Nine of 18 mutations in the enhancer near TP53TG1 were contributed by esophageal cancers. However, 89% of mutations in these esophageal tumors were attributed to a mutational signature common in this cancer type (mostly T>G substitutions in NTT context), raising the concern that these reflect a localized mutational process rather than a driver event30 (PCAWG7 publication). These mutations were concentrated around a conserved region (Fig. 5e) and overlapped sites bound by NFIC and ZBTB7 transcription factors in HepG2 cells27. TP53TG1 is a non- coding RNA suggested as a tumor suppressor involved in p53 response to DNA damage and has been reported to be epigenetically silenced in cancer31.

3’UTRs. Recurrent somatic events were identified in the 3’UTRs of six genes: FOXA1 in prostate cancer, ALB in liver cancer, SFTPB in lung adenocarcinoma, NFKBIZ and TMEM107 in lymphomas, and TOB1 in Carcinoma, most of which contained a large number of indels (Fig. 4c). High rates of indels in the 3’UTR of ALB in liver cancer and SFTPB in lung cancer have been recently reported8 but it is unclear whether they are cancer drivers or

9 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

the result of a poorly-understood mechanism of localized indel hypermutation. To determine whether indels affect function and accumulate specifically in the 3’UTRs, we compared indel rates in 3’UTRs with other regions around these genes. Consistent with local hypermutation, rather than selection, we observed similar indel rates in 3’UTR and 1 kb downstream of the polyadenylation site in ALB, SFTPB and FOXA1 (Fig. 6a). This, together with additional observations presented below, strongly suggest that indel recurrence at these loci is caused by a novel localized hypermutation process.

On the other hand, although ALB is subject to localized hypermutation, protein-coding SNVs in ALB are enriched in missense and truncating events relative to synonymous mutations in liver cancer (P = 1.5x10-7; Fig. 6b). Furthermore, the ALB locus is lost in a fraction of liver cases (Fig. 6b), with copy losses having a tendency to occur in samples without somatic mutations (Fisher’s exact test, P = 0.0099). These findings, suggest that loss of ALB may be a genuine driver event in liver cancer. Similarly, the well-known driver role of FOXA1 in prostate tumors is supported by the functional impact of coding mutations (P = 4.4x10-7) and focal amplifications, raising the possibility that the mutation enrichment observed in the 3'UTR might also be functional.

In contrast to the genes discussed above, the number of indels in the 3’UTRs of NFKBIZ and TOB1 was significantly higher than in other parts of these genes, suggesting that indels in the 3’UTRs could be under functional selection (Fig. 6a). TOB1 encodes an anti-proliferation regulator that associates with ERBB2, and also affects migration and invasion in gastric cancer32. As part of the CCR4-NOT complex, TOB1 regulates other mRNAs through binding to their 3’UTR and promoting deadenylation33. Tumors with 3’UTR mutations in TOB1 showed a trend towards decreased expression (P = 0.053; FD = 0.7). Mutations did not concentrate in known miRNA binding sites, possibly due to incomplete annotation (Fig. 6c). However, the extreme conservation of this region (5th most conserved among all tested 3’UTRs with an average PhyloP score of 5.6) indicates that these events likely disrupt functionally important sites (Fig. 6c). Interestingly, TOB1 and its neighboring gene WFIKKN2 are focally amplified in breast cancer and pan-cancer, suggesting a complex role in cancer (Extended Data Fig. 9). NFKBIZ is a transcription factor that is mutated in recurrent/relapsed DLBCL34 and amplified in primary lymphomas34,35. 3’UTR mutations accumulated in a hotspot proximal to the stop codon and upstream of conserved miRNA binding sites which might be affected by these mutations (Fig. 6d). Only two events in these regions are SNVs, suggesting that mutations in the NFKBIZ 3’UTR are not the consequence of AID off-target activity. Lymphomas with 3’UTR mutations showed a trend towards increased mRNA levels of NFKBIZ (P = 0.035; FD = 3.2; the trend remains after correction

10 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

for copy number, P = 0.03; Fig. 6d). Functional studies will be required to understand the exact function of 3’UTR mutations on transcript regulation of TOB1 and NFKBIZ.

Mutations in the significant 3’UTR of TMEM107 in lymphoma overlap the U8 small RNA SNORD118 (Extended Data Fig. 10). Similar to WDR74, overlap with a repetitive RNA element suggests that these mutations might be caused by mapping artifacts.

Non-coding RNAs. ncRNAs are often part of large gene families with multiple copies in the genome. Sequence similarity with other genomic regions may therefore complicate their analysis. Similarly to the U2 RNA upstream of WDR74, we observed high levels of germline polymorphisms in normal samples in some of the significant ncRNAs and their promoters (Extended Data Fig. 10). Though interesting candidates, caution should thus be exercised in interpreting these hits.

The non-coding RNA RMRP is significantly mutated in multiple cancer types, in both its gene body and promoter (Fig 4c; Fig 6e; Supplementary Table 5). RMRP is the RNA component of the endoribonuclease RNase MRP, an enzymatically active ribonucleoprotein36,37,38. Its catalytic function depends on the RNA secondary and tertiary structure and its interactions with proteins39. Germline mutations in RMRP cause cartilage- hair hypoplasia, and somatic promoter mutations have been reported to be functional9. In addition, the RMRP locus is focally amplified in several tumor types, including epithelial cancer (Extended Data Fig. 9c). SNVs in the gene body (7 in pan-cancer) are significantly biased towards secondary structure impact (P = 0.011, permutation test)40,41. Three of these are individually significant (each with P < 0.1, sample level permutation tests), with two affecting the same position and one located in a tertiary interaction site (Fig. 6e). Of the four gene-body indels, three are located in or near protein-binding sites (P = 0.08), including a deletion that is predicted to affect the secondary structure (Extended Data Fig. 9d). Given RMRP's role in replication of the mitochodrial genome36, we tested whether mutations in this locus were associated with altered mitochondrial genome copy number. Indeed, mutated samples showed a trend towards higher mitochondrial copy number (two-sided rank-sum test, P = 0.1). However, RMRP also appears to be a target of potential artifacts (Extended Data Fig. 10) and so its relevance to cancer requires further scrutiny.

The microRNA precursor (pre-miRNA) for miR-142 was found significant in Lymph-BNHL, Lymphatic system and Hematopoietic system (Fig. 4c; Supplementary Table 5). The locus is a known AID off-target in lymphoma5,42(Extended Data Fig. 10). However, five of seven mutations (71%) in the dominantly expressed mature miRNA mir-142-p3, where the largest

11 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

functional impact is expected, were not assigned to AID, raising the possibility that they may be under selection5.

Eleven additional ncRNA-related elements (RNU6-573P, RPPH1, RNU12, TRAM2-AS1, G025135, G029190,RP11-92C4.6 and promoters of RNU12, MIR663A, RP11-440L14.1, and LINC00963) passed our stringent post-filtering, but had limited supporting evidence after manual inspection. These are discussed in a Supplementary Note. We further checked whether any of our non-coding candidates were associated with known germline cancer risk variants, and found no significant associations, with the exception of TERT (Methods).

NEAT1 and MALAT1, two neighboring lncRNAs previously reported as recurrently mutated in liver43 and breast cancer3, were found significant in multiple cohorts (Fig. 4c), mutated in a large number of patients (369 for NEAT1 and 158 for MALAT1; Fig 7a). In addition, these genes are significantly associated with nearby chromosomal breakpoints (NEAT1: P = 5.3x10-9; MALAT1: P = 0.0012; Methods; Extended Data Fig. 11). The pattern of breakpoints, indels and SNVs scattered throughout their sequence would suggest a possible tumor suppressor role. To evaluate this hypothesis, we tested whether MALAT1 and NEAT1 mutations are associated to loss of heterozygosity (LOH). Neither gene showed biallelic loss, in contrast to many known canonical tumor suppressors (Fig. 7b; Extended Data Fig. 11). Mutations in MALAT1 and NEAT1 also do not exhibit higher cancer allele fractions compared to mutations in flanking regions, a feature of early driver mutations (Fig 7c; Extended Data Fig. 11). Furthermore, NEAT1 and MALAT1 mutations are not associated with altered expression levels, suggesting that they do not affect post-transcriptional stability nor show increased expression like many mutated oncogenes (Fig 7d; Extended Data Fig. 11).

The high enrichment of indels throughout the gene body of NEAT1 and MALAT1, which have very high expression levels across many cancer types, resembles the phenomenon described above for ALB and SFTPB. If these indels are generated by a specific mutational process, we might expect distinct features from indels found elsewhere in the genome. Indeed, we find that indels in NEAT1, MALAT1, ALB and SFTPB are strongly enriched in events longer than 1 bp and particularly in indels of length 2-5 bp, compared to the genomic background (least significant Fisher’s P = 6.8x10-5, for MALAT1; Fig. 7e). The association of these indels with highly expressed genes suggests a transcription-coupled mutagenic mechanism, possibly transcription-replication collisions44–46. A systematic search of genes with increased rates of 2-5 bp indels reveals that this yet-unknown mutational process affects other highly expressed genes (Extended Data Fig. 11), some of which were reported

12 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

in a recent study8, indicating that these indel sizes are a feature of the reported mutational process. Interestingly, SNVs also occur at higher frequencies in these genes, suggesting that they may be generated by the same mechanism, although less frequently (Fig. 7f).

Overall, the discovery of a localized mutational signature and the lack of association of the mutations in MALAT1 and NEAT1 with loss of heterozygosity or higher allele frequencies suggests that indels in these genes are most likely passenger events. The previously reported oncogenic phenotypes associated with both ncRNAs43,47,48 may thus be related to other types of alterations.

Localized lack of power to call mutations Certain genomic regions, especially those with high GC content, are subject to systematic low coverage in next-generation sequencing49. We used power analysis to quantify the number of possibly missed mutations, especially in non-coding regions. These calculations attempt to account for variations in local sequencing depth, purity and overall ploidy of individual tumor samples, background mutation rates across cohorts and elements, and the size of patient cohorts9,50–52. We found that mutation detection sensitivity was high across the element types, (median d.s. = 0.98) but slightly lower for the typically GC-rich promoters (median d.s. = 0.96) and 5’UTR elements (0.96; Fig. 8a; Supplementary Table 4,5). However, for some individual genomic regions the detection sensitivity is dramatically reduced and it is therefore important to evaluate it for elements of interest. For example, only 43% of tested cases have at least 50% average detection sensitivity across the third exon of TCF7L2 and 20% in the 5’UTR of AKT1 (Extended Data Fig. 12a). Positional detection sensitivities for the two canonical TERT promoter hotspot sites were highly variable among patients and cohorts, ranging from 4% of sufficiently powered (≥90%) patients in CNS- PiloAstro to 100% of patients in Thy-AdenoCa (Fig. 8b; Extended Data Fig. 12b). This means that we have nearly no information about possible TERT promoter mutations in CNS- PiloAstro and incomplete information in other cohorts. Using the detection sensitivity and observed mutation count at the TERT hotspot sites, we inferred the expected total number of

events in each tumor type (Fig 8c). This analysis revealed that ~216 (CI95 = [188,245]) TERT hotspot mutations were likely missed in this study due to lack of power. Likewise, about four additional mutations would be expected at the recurrent FOXA1 promoter hotspot shown to be mutated in hormone-receptor positive breast cancers (Extended Data Fig. 12)9.

13 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

These calculations suggest that a considerable number of potentially impactful somatic mutations were not detected due to systematic low sequencing coverage, and that specific positions can be unpowered in elements with overall sufficient coverage.

We further evaluated the discovery power for recurrent events identified in mutational burden tests in the tumor cohorts analyzed in this study9,50. As expected, discovery power was highest in cohorts with many patients and low background mutation densities allowing discovery of typical-sized driver elements mutated in fewer than 1% of patients in the Carcinoma, Adenocarcinoma and pan-cancer meta-cohorts with 90% power (Fig. 8d). In contrast, the small Bladder-TCC cohort (n=23) with relatively high background mutation density (~2.7 mutations/Mb) is only powered to discover drivers that occur in at least 25% of patients (Fig. 8d). Overall, power differences between element types for individual tumor cohorts were small, suggesting that lack of power alone cannot explain the observed paucity of regulatory driver elements.

Relative paucity of non-coding driver point mutations To further analyze the relative paucity of driver mutations in non-coding elements, we sought to estimate the overall number of driver mutations, in coding and non-coding regions of known cancer genes. Given the limited statistical power to detect individually significant elements, we combined the signal from multiple elements, as recently described53. The difference between the total number of mutations observed across a combined set of elements and the number of passenger mutations that is expected by chance, i.e. the excess of mutations, approximates the total number of driver mutations in the set of elements. To estimate the expected number of background mutations, we fit the NBR model to presumed passenger genes, controlling for sequence composition, element size and regional mutation densities (Methods; Supplementary Table 7; Supplementary Note)53. We focused on 142 known cancer genes that include the significantly mutated cancer genes in this study and frequently amplified or deleted cancer genes (Methods). We specifically excluded TERT from this analysis because of its high frequency of promoter driver mutations and the incomplete detection sensitivity reported above.

Overall, this approach predicted an excess of 3,133 driver mutations (CI95% [2,987-3,273]; 2,258 SNVs and 875 indels) in the protein-coding sequences of these genes across the pan- cancer meta-cohort (Fig. 8e). In contrast to coding regions, the observed number of mutations across the combined set of non-coding elements associated with these genes is

very close to the expected number of passenger mutations, , with an excess of 71 (CI95%:22- 133) mutations in promoters, 32 (0-79) in 5’UTRs, and 103 (25-184) in 3’UTRs. These

14 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

results indicate that coding genes contribute the vast majority (>90%) of driver point mutations in these 142 cancer genes. Importantly, these estimates are conservative, since the estimation of the background number of mutations from putative passenger genes may include yet undetected driver mutations. In addition, non-coding mutations in promoters of cancer genes were also not generally associated with LOH nor with altered expression, as one would expect if they were enriched with driver mutations (Supplementary Note). Altogether, our results suggest that mutations in protein-coding regions dominate the landscape of driver point mutations in known cancer genes.

Discussion If we are to fulfill the ambitions of precision medicine, we need a detailed understanding of the genetic changes that drive each person’s cancer, including those in non-coding regions (https://doi.org/10.1101/190330). Most cancer genomic studies have focused on protein- coding genes, leaving non-coding regulatory regions and non-coding RNA genes largely unexplored. The unprecedented availability of high-quality mutation calls from whole- genomes of >2,500 patients across 27 cancer types has enabled us to comprehensively search for functional elements with cancer-driver mutations across the genome. To obtain reliable results and avoid common pitfalls in detecting drivers54, we have benchmarked a large number of methods and developed a novel and rigorous statistical strategy for integrating their results.

Among the most interesting candidate non-coding driver elements identified in this analysis, we have uncovered promoter or 5’UTR mutations in TP53, RFTN1 and MTG2, a putative intragenic regulatory element in RNF34 that possibly regulates KDM2B; 3’UTR mutations in NFKBIZ and TOB1; and recurrent mutations in the non-coding RNA RMRP. We have also found evidence suggesting that a number of previously reported and frequently mutated non-coding elements may not be genuine cancer drivers. Particularly, the non-coding RNAs NEAT1 and MALAT1 are subject to a high density of passenger indels, seemingly due to a transcription-associated mutational process that targets some of the most highly expressed genes.

Our study has yielded an unexpectedly low number of non-coding driver candidates. The results from four analyses - genomic hotspot recurrence, driver element discovery, discovery power and mutational excess - suggest that the regulatory elements studied here contribute a much smaller number of recurrent cancer-driving mutations than protein-coding genes. This contrasts the distribution of germline polymorphisms associated with heritability of complex traits, which are most frequently located outside of protein-coding genes 55.

15 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Technical shortcomings, such as severe localized lack of sequence coverage in GC-rich promoters (as in the case of the TERT promoter), may lead to considerable underestimation of true drivers in certain regions. The yield of driver events will thus benefit from technical improvements, including less variable sequence coverage (e.g. PCR-free library preparation), longer sequencing reads and advanced methods for read mapping and variant calling that can overcome common artifacts. Moreover, studying larger tumor cohorts will provide increased power to discover infrequently mutated driver elements.

Our analyses have focused on currently-annotated non-coding regulatory elements and non- coding genes, comprising 4% of the genome, and exclusively on SNVs and indels. Although our genome-wide hotspot analysis did not detect novel highly recurrent non-coding mutation sites, it is possible that additional non-coding drivers reside outside of the regions tested here. Improvements in the functional annotation of non-coding elements, our understanding of their tissue-specificity, and the impact of mutations in them, will refine current driver discovery models and enable more comprehensive screens.

A challenge in the identification of non-coding drivers is distinguishing real novel driver events from yet-unidentified mutational mechanisms targeting certain genomic regions, such as the recently described hypermutation of transcription factor binding sites in melanoma56,57. Better understanding of mutational processes and their activity along the genome will be critical to improve the sensitivity and specificity of statistical methods for driver discovery.

One potential explanation for the relative paucity of non-coding drivers is the smaller functional territory size of many regulatory elements, and hence smaller chance of being mutated, compared to protein-coding genes (Extended Data Fig.1; Fig. 8d). The presence of TERT promoter mutations at just two sites suggests that non-coding driver mutations may be confined to specific positions, similar to the small number of impactful sites in the oncogenes BRAF, KRAS and IDH1.

SNVs and small indels may not easily alter the function of non-coding regulatory elements. Directly mutating protein-coding sequences or altering expression levels by copy number changes, epigenetic changes or repurposing of distal enhancers through genomic rearrangements58,59 may be more likely to provide large phenotypic effects. While protein- truncating or frameshift mutations can be created by a single nucleotide change, silencing regulatory regions of tumor suppressors may depend on large deletions to achieve a similar

16 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

effect. Although the very high frequency of TERT promoter mutations in cancer demonstrates that some promoter point mutations can be potent driver events, our analyses suggest that the number of regulatory sites where a point mutation can lead to major phenotypic effects may be smaller than anticipated across the genome. This highlights TERT as an unusual example, perhaps because even a modest increase in the expression level of TERT may be sufficient for circumventing normal telomere shortening in cancer cells.

Comprehensive and reliable discovery of non-coding driver mutations in cancer genomes will be an integral part of cancer research and precision medicine in the coming years. We anticipate that the approaches developed here will provide a solid foundation for the incipient era of driver discovery from ever-larger numbers of cancer whole-genome sequences.

Acknowledgements We thank Kirsten Kübler for assistance with meta cohort generation. We are grateful to Matthew Meyerson, Eric S. Lander and the PCAWG steering committee for helpful feedback on the manuscript.

Author contributions This work was carried out by the Discovery of Driver Elements Subgroup of the PCAWG Drivers and Functional Interpretation Group based on data from the ICGC/TCGA Pan- Cancer Analysis of Whole Genomes Network. Contributions spanned a range of categories, as defined in Supplementary Note. The Driver discovery category includes method development, driver significance evaluation and results integration. The Candidate vetting and filtering category covers removal of potential false positive hits. The Case-based analysis category includes driver potential and performed follow-up studies using additional evidence. The Power analysis and driver mutations at known cancer genes category includes power evaluations and estimation of the number of true drivers. The Leadership and organisational work category includes organization and workgroup leadership. E.R., M.M.N., F.A., G.T., H.H., J.M.H., R.I.P., I.M., J.S.P. and G.G. wrote the manuscript. All authors have had access to read and comment on the manuscript.

Author affiliations 1The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02124, USA 2Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark

17 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

3Wellcome Trust Sanger Institute, Hinxton CB10 1SA, Cambridgeshire, UK 4Division of Applied Bioinformatics, German Cancer Research Center, Im Neuenheimer Feld 280, Heidelberg 69120, Germany 5Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, 08028 Barcelona, Spain. 6Research Program on Biomedical Informatics, Universitat Pompeu Fabra, Barcelona, Spain 7Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada 8Centre for Genomic Regulation (CRG), Barcelona 08003, Spain 9Universitat Pompeu Fabra (UPF), Barcelona, Spain 10Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany 11Institute of Pharmacy and Molecular Biotechnology, and Bioquant Center, University of Heidelberg, 69120 Heidelberg, Germany 12Massachusetts General Hospital, Center for Cancer Research, Charlestown, Massachusetts 02129, USA 13Department of Cell Biology, Harvard Medical School, Boston, MA, USA 14cBio Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, 15Department of Genomic Medicine, the University of Texas MD Anderson Cancer Center, Houston, Texas, USA 16Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, Texas, USA 17Department of Physiology and Biophysics, Weill Cornell Medicine, 1300 York Avenue, New York, NY, 10065, USA 18Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York 10021 USA 19Computational Biology and Bioinformatics, Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden 20Bioinformatics Research Centre (BiRC), Aarhus University, Aarhus, Denmark. 21Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada 22Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada 23Institute of Computer Science, Polish Academy of Sciences, Warsaw, 01-248, Poland 24Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA 25Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA

18 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

26Genetics and Genome Biology Program, SickKids Research Institute, Toronto, ON, Canada 27Department of Gene Technology, Tallinn University of Technology, Estonia 28Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden 29Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan 30Department of Genitourinary Medical Oncology - Research, Division of Cancer Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA 31European Bioinformatics Institute. European Molecular Biology Laboratory. Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK 32Department of Molecular and Cellular Biology, Baylor College of Medicine, 1 Baylor Plaza Houston M822, Houston, Texas 77030, USA 33Faculty of Biosciences, Heidelberg University, Im Neuenheimer Feld 234, 69120 Heidelberg, Germany 34Division of Computational Biology, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA 35Samsung Medical Center, Sungkyunkwan University School of Medicine South Korea 36Institute of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, CH-8057 Zurich, Switzerland 37Department of Genetics, Stanford University School of Medicine, Stanford, California, USA 38Department of Haematology, University of Cambridge, Cambridge CB2 2XY, UK 39MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts, USA 40Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA 41Weill Cornell Medicine and New York Presbyterian, Department of Pathology and Laboratory Medicine, New York, NY, USA 42Weill Cornell Medicine and New York Presbyterian, Caryl and Israel Englander Institute for Precision Medicine, New York, NY, USA 43Weill Cornell Medicine and New York Presbyterian, Sandra and Edward Meyer Cancer Center, New York, NY, USA 44Ontario Institute for Cancer Research, Toronto, Ontario M5G 0A3, Canada 45Center for Biomolecular Science and Engineering, University of California at Santa Cruz, Santa Cruz, CA 95064, USA

19 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

46Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan 47Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan 48CREST, Japan Science and Technology Agency, Tokyo, Japan 49Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA 50Department of Medical Oncology, Inselspital, University Hospital and University of Bern, 3010 Bern, Switzerland 51Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA 52Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, USA 53Department of Computer Science, Yale University, New Haven, Connecticut, USA 54Catalan Institution for Research and Advanced Studies (ICREA), 08010 Barcelona, Spain 55Harvard Medical School, 250 Longwood Avenue, Boston, 02115, MA, USA MA, USA 56Massachusetts General Hospital, Department of Pathology, Boston, Massachusetts, USA

Figure legends Figure 1: Mutational hotspots and overview of functional elements and meta-cohorts. a, Characteristics of top 50 single-site hotspots in PCAWG SNV data. The stacked bar chart (left) shows the total number of patients mutated across the PCAWG cohorts colored by mutation type. Gene names are given when hotspots overlap functional elements (colour- coded). Known somatic driver sites are marked with a hashtag. Amino acid change is indicated for protein-coding genes. The genomic location (chr:position) is colored by overlap with center of palindromes (orange) or immunoglobulin loci (brown). The table (right) shows the distribution of hotspot SNVs in the PCAWG cohorts sorted by number of hotspots. Tumor-type specific hotspots are indicated with a box. This only includes cohorts with at least 20 patients, and at least 10 patients or 10% of patients with a SNV. The Lymph-BNHL and Lymph-CLL cohorts are shown together as Lymphoid malignancies. See Extended Data Fig. 1 for a complete set of individual and meta cohorts. b, Mutational signature attributions for mutations in each hotspot site. c, Schematic describing definition of functional element types from GENCODE and ENCODE annotation resources. Functional elements (black) are defined based on transcript annotations from various databases. Elements arising from multiple transcripts with the same gene ID are collapsed, as seen here for the protein-coding isoforms. Promoter elements are defined as 200 bases upstream and downstream of the transcription start sites of a gene's transcripts (marked with red). Splice

20 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

site elements extend 6 and 20 bases from 3' and 5' exonic ends into intronic regions respectively, as indicated. Regions overlapping protein-coding bases and protein-coding splice sites are subtracted from other regions, as indicated for the lncRNA promoter element in gray. d, Organization of meta-cohorts defined by tissue of origin and organ system (Methods). Pan-cancer contains all cancers excluding Skin-Melanoma and lymphoid malignancies.

Figure 2: Filtering and integration process for nomination of driver candidates. a, Overview of driver discovery method and their lines of evidence to evaluate candidate gene drivers. Methods employing each feature are marked with blue boxes next to the appropriate track. b, Spearman’s correlation of p-values across the different driver discovery algorithms based on simulated (null model) mutational data. Dendrogram illustrates relatedness of method p-values and algorithm approaches marked by colored boxes on dendrogram leaves. c, P-values are combined with Brown’s method based on the correlation structure calculated in (b). Individual method (left) and integrated (right) log-transformed p-values are shown in a heatmap (grey: missing data). d, Post-filtering used several criteria to identify likely suspicious candidates (shaded grey rectangle). e, Significant driver candidates were identified after controlling for multiple hypothesis testing based on an FDR q-value threshold of 0.1 (blue asterisk). Candidates with q-values below 0.25 (blue dash) were also considered of interest.

Figure 3: Technical and biological confounders used in candidate filtering. a, Read mappability based on alignability, uniqueness and blacklisted regions. Example shown for RN7SK, which was removed due to low alignability. b, Site-specific noise evaluated in normal samples from PCAWG. c, Increased local mutation density caused by AID and UV mutational processes in lymphoma and melanoma, respectively. d, Palindromic DNA can expose bases to APOBEC enzymes. f, Number of significant unique hits (left) and unique elements (right) removed by each filter.

Figure 4: Protein-coding and non-coding driver elements identified in PCAWG cohorts. a, Number of total hits (elements significant in a certain patient cohort; left) and number of unique significant elements (right). b, Number of significant elements by type and by cohort. c, Significant non-coding elements identified in this study. Element types are indicated with colors as in Fig. 1c.

Figure 5: Novel non-coding promoter and enhancer driver candidates. a, RFTN1 promoter locus (left) and gene expression for mutant (mut.) and wild-type (wt) (right) in

21 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Lymph-BNHL tumors. Expression values are colored by copy number status; AID mutations are highlighted in yellow. b, A mutational hotspot inside RNF34 overlapping a DNAse I site (left) correlates with gene expression changes of KDM2B in Liver-HCC (right). Coloring of mutations as in a. c, MTG2 promoter locus (left) and associated gene expression changes in Carcinoma tumors (right). Coloring of mutations as in a. d, An enhancer associated with TP53TG1 (Methods) (left) contains mutations mostly attributed to an esophageal cancer- associated signature.

Figure 6: Novel 3’UTR and non-coding RNA driver candidates. a, Quantification of indel rates for functional elements associated with significant 3’UTRs. b, Overview of ALB genomic locus and PCAWG variants illustrates local density of indels. Coloring of enlarged ALB gene elements corresponds to functional elements in barplots in a. Copy number changes from 314 Liver-HCC cases are compared to other samples from PCAWG. c, Genomic locus of TOB1 3’UTR. Note extreme conservation (PhyloP) for the 3’UTR. d, Genomic locus of NFKBIZ 3’UTR (left) and associated gene expression changes in Lymph- BNHL (right). e, Genomic locus of the RMRP transcript and promoter region (left), and its RNA secondary structure, tertiary structure interactions, protein and substrate interactions, and mutations with their predicted structural impact (right; Extended Data Figure 9d); lymphoma and melanoma are excluded.

Figure 7: Characterization of a mutational process in NEAT1 and MALAT1. a, Genomic locus overview showing the distribution of indels and SNVs in NEAT1 and MALAT1. b, Analysis of enrichment in LOH associated with mutation (“double-hit”) for canonical tumor suppressors (TSG), oncogenes (OG), and NEAT1/MALAT1. c, Comparison of cancer allelic fractions within those same genes and within flanking regions (including 2 kbp upstream, 2 kbp downstream and introns). d, Difference in expression between mutated and wild-type alleles of these genes. e, Percentages of different groups of indel sizes for all protein-coding and lncRNA genes, ALB, NEAT1, MALAT1 and the set of genes enriched in 2-5 bp indels. f, SNV and indel rates (total events/bp) in different functional regions of 18 protein-coding genes enriched in 2-5 bps indels (without ALB, which contributed 47% of indels). Red lines indicate background indel and SNV rates estimated from all protein-coding genes.

Figure 8: Missed mutations and paucity of non-coding drivers. a, Cumulative distribution of average detection sensitivity (d.s.) from 60 PCAWG pilot samples in different functional element types. b, Distribution of TERT promoter hotspot (chr5:1295228; hg19) detection sensitivity for each each patient, by cohort. Grey dots indicate values for individual patients inside estimated distribution (areas colored by cohort). Horizontal black bars mark

22 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

the medians. Numbers above distributions indicate the percentage of patients powered (d.s. ≥90%) in each cohort. See Extended Data Fig. 12b for the second TERT hotspot. c, Percentage of patients with observed (blue) and inferred missed (red) mutations at the chr5:1,295,228 and chr5:1,295,250 TERT promoter hotspot sites. Error bars indicate 95% confidence interval. Numbers above bars show the total inferred number of TERT promoter mutations for each site in this cohort. Red numbers indicate the absolute number of inferred missed mutations (due to lack of read coverage). d, Heatmap shows minimal frequency of a driver element in a cohort that is powered (≥90%) to be discovered. Cell numbers indicate the number of patients in a given cohort that would need to harbor a mutation in a given element. For example, the pan-cancer cohort is powered to discover a driver gene (CDS) present in <1% or 18 patients, while the Bladder-TCC cohort is only powered to discover drivers present in at least 27% or 6 patients. Power to discover driver elements is dependent on the background mutation frequency (shown above the heatmap) and element length (shown to the right). e, The ratio between observed numbers of mutations (SNVs and indels) in regulatory and coding regions of 142 protein-coding cancer genes. The absolute number

of driver mutations predicted, with CI95% in brackets, is shown above each bar.

Extended Data Fig. 1: Mutational hotspots in additional tumor types. a, Barplot of positions vs. patients. The stacked barcharts under the barplot show the proportion of protein-coding (dark grey) and non-coding (light grey) positions, respectively, in each of the bars in the barplot. b, Distribution of SNVs in top 50 single-site hotspots across all analyzed individual cohorts and meta-cohorts.

Extended Data Fig. 2: Genomic element statistics. a, Percentage of genomic coverage for each element type. b, Distribution of element lengths for each element type.

Extended Data Fig. 3: Integration details. a Quantile-quantile plots of p-values reported by various driver detection algorithms on the three simulated data sets (Broad, DKFZ, and Sanger; shown for coding regions in the meta-carcinoma cohort) showed no major enrichment of mutations above the background rate. Results generally followed the expected null (uniform) distribution, and the p-values reported on simulated data were subsequently used to assess the covariance of method results. b, Quantile-quantile plots of integrated p- values using the Brown and Fisher methods for combining p-values across the different driver detection algorithm results were generated for a few representative tumor cohorts (shown here for coding regions). Brown combined p-values (light blue) generally followed the null distribution as expected, while Fisher combined p-values were significantly inflated (dark blue), confirming that dependencies existed between the results reported by the various

23 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

driver detection algorithms. To simplify the integration procedure, we calculated covariances using p-values from the observed data instead of simulated data and found that the integrated results based on the observed covariances (first column of plots) were essentially the same as the results obtained using the simulated covariances (second, third, and fourth columns of plots). c, Triangular heatmaps showing the Spearman correlations of p-values among the various driver detection methods in observed versus simulated data (coding regions, meta-carcinoma cohort) are highly similar. Differences in the observed and simulated correlation values (shown in the far-right heatmaps) were minimal, and thus the final integration of p-values across methods was performed using covariances estimated on observed data. d, Integrated p-values based on observed and simulated covariance estimations (shown on the right, top heatmap, for coding regions in glioblastoma) did not differ noticeably. In cases where individual methods reported results that yielded substantially fewer hits than the median across all methods (bottom heatmap, methods in light grey with results in dashed box), removing the methods from the integration did not affect the number of significant genes identified (right column of results in bottom heatmap, shown for coding regions in lung adenocarcinoma).

Extended Data Fig. 4: Sensitivity of driver discovery methods. The performance of different driver discovery methods and of the integration method on protein-coding genes was evaluated using known cancer genes. A list of 603 Cancer Gene Census genes (CGC, v80) was used as a reference gold-standard set of known cancer genes. a-c, For each method, genes with q-value<0.10 were sorted according to their p-value and the fraction of CGC genes shown in the y-axis. The total number of significant genes identified by each method is shown in the x-axis. The integration approach tends to outperform most methods across cohorts, yielding longer lists of significant genes and more strongly enriched in known cancer genes. d, Heatmap depicting the number of known cancer genes identified by each method and by the integration approach in each cohort. The color of each cell reflects the relative sensitivity of each method in each cohort, measured as the number of cancer genes detected by a method (also shown as a number inside each cell) divided by the maximum number detected by any of the methods. Methods are sorted from top to bottom according to their mean relative sensitivity across datasets.

Extended Data Fig. 5: Sensitivity vs. specificity in individual cohorts vs. meta cohorts for candidate drivers. q-values for the most significant individual cohort (x-axis) vs. meta cohort (y-axis) are shown. Driver elements are colored by their element type.

24 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Extended Data Fig. 6: Protein-coding driver elements identified in PCAWG cohorts. Significant protein-coding elements identified in this study. Compare to figure 4.

Extended Data Fig. 7: Lack of concordance for non-coding results. Heatmaps depicting p-values for methods included in integration for Liver-HCC cohorts. a, Most methods agree on the top protein-coding hits. b, c: Agreement between methods on non-coding hits is far lower for Liver-HCC promoters (b) and 3’UTRs (c).

Extended Data Fig. 8: Mutation to expression correlation. Expression is compared between mutated and non-mutated samples. For each element, the z-score of the expression values for mutated and wild type in the significant cohort is plotted. For copy numbers, SCNA amplification indicates SCNA > 10, SCNA gain indicates SCNA ≥ 3, SCNA loss indicates SCNA ≤ 1 and no events indicates SCNA < 3 and SCNA >1. If a patient is mutated with multiple point mutation types, indels are indicated over SNVs.

Extended Data Fig. 9: Focal amplifications including TOB1, HES1 and RMRP in TCGA tumors. a, Copy number profiles of 55 of 441 stomach adenocarcinomas from TCGA show copy number gains around HES1. b, TOB1 and its gene neighbor WIFKKN2 are focally amplified in cancer. 172 of 10844 total samples from 33 cancer types are shown. c, RMRP focal amplifications in TCGA cancers (160 of 10844 total tumors shown). d, RMRP secondary structure (RF00030 in Rfam) labeled with P4 tertiary interactions as well as protein and substrate interactions (reported as ± 3bp windows). Mutations and their predicted structural impact are indicated. Lymphoma and melanoma mutations are excluded.

Extended Data Fig. 10: Density of potential germline polymorphisms and somatic mutations in several candidates. Genomic overviews showing mapping quality masks from the 1,000 Genomes Project (pilot/strict), and densities of germline SNP and somatic SNV calls on PCAWG normal and tumor samples, respectively. Excess of germline SNP calls can be indicative of artifacts, thus an increased number of somatic mutations in such a region raises concerns about true mutational recurrence. a, WDR74 promoter, U2 RNA element. b, TMEM107 3’UTR. c, RNU12 small RNA. d, RMRP promoter and ncRNA transcript. e, RPPH1 ncRNA. f, Genomic overview showing SNVs in mir-142. SNVs attributed to the AID mutational signature are marked in orange.

Extended Data Fig. 11: Supporting information for NEAT1 and MALAT1. a, Mutation density in the NEAT1 and MALAT1 genomic loci. Co-localized peaks of densities in the loci are seen for structural variants, SNVs and indels. b, Levels of expression of the genes

25 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

enriched in 2-5 bp indels in their respective tissues. Background gene expression levels are shown to the left for comparison. c, Heatmap showing the levels of expression across tissues for the genes enriched in 2-5 bp indels. d, The average cancer allelic fraction (CAF) is compared between each genomic element and the corresponding flanking regions (+/- 2Kb and introns; overlapping coding exons were excluded). The size of the points represent the number of mutated samples for each particular element. e, The relative rate of loss of heterozygosity (LOH) is compared between mutated and wild-type samples, coloured by element type and highlighting significant LOH enrichments with an outside black circle. f, Expression comparison between mutated and wild-type samples. For each element, the

median log2 fold expression difference of cohorts found significant in the driver discovery is plotted. If elements were significant in multiple cohorts, the one with the most significant expression difference is shown. Number of mutated samples refers to tumors for which RNA-seq data could be obtained. Element types are colored as indicated in the legend. Elements with a significant expression difference (q < 0.1 or q < 0.01) are indicated by circles. g, Number samples with structural variants (SV) near or in each element is plotted against the number of patients with a point mutation. Number of SVs is the sum of event counts in bins of 50 kb regions that overlap with a given element. Elements with a significantly higher than expected number of samples with SVs are indicated with black circles (q < 0.1).

Extended Data Fig. 12: Lack of detection power in specific elements. a, Cumulative Average d.s. for selected elements across 60 patients. Exon 3 of the colon cancer gene TCF7L2 is 90% powered in only 14/60 (23%) of patients. The TERT promoter and 5’UTR element of AKT1 show lack of detection power in the majority of patients. b, Distribution of TERT promoter hotspot (chr5:1295250; hg19) detection sensitivity for each each patient, by cohort. Grey dots indicate values for individual patients inside estimated distribution (areas colored by cohort). Horizontal black bars mark the medians. Numbers above distributions indicate the percentage of patients powered (d.s. ≥90%) in each cohort. c, Left: distribution (pink) of detection sensitivity across all breast cancer patients (Breast- AdenoCa, Breast-LobularCa, Breast-DCIS; grey dots) at the FOXA1 promoter hotspot site (chr14:38064406; hg19). The number above the distribution plot indicates the percentage of patients powered (d.s. ≥90%). Black horizontal bar indicates the median of the distribution. Right: percentage of patients with observed (blue) and inferred missed (red) mutations. Error bar indicates 95% confidence interval.

Supplementary Tables Supplementary Table 1: Genomic element type summary statistics

26 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Supplementary Table 2: Driver discovery method description Supplementary Table 3: Driver discovery methods included in integration Supplementary Table 4: Summary and annotation of protein-coding driver candidates Supplementary Table 5: Summary and annotation of non-coding driver candidates Supplementary Table 6: Summary of additional evidence for non-coding driver candidates Supplementary Table 7: List of cancer genes used in this study Supplementary Table 8: List of cases used in detection sensitivity analysis Supplementary Table 9: Impact of covariates on obs/exp ratio Supplementary Note

27 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

References

1. Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013). 2. Fredriksson, N. J., Ny, L., Nilsson, J. A. & Larsson, E. Systematic analysis of noncoding somatic mutations and gene expression alterations across 14 tumor types. Nat. Genet. 46, 1258–1263 (2014). 3. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole- genome sequences. Nature 534, 47–54 (2016). 4. Melton, C., Reuter, J. A., Spacek, D. V. & Snyder, M. Recurrent somatic mutations in regulatory regions of human cancer genomes. Nat. Genet. 47, 710–716 (2015). 5. Puente, X. S. et al. Non-coding recurrent mutations in chronic lymphocytic leukaemia. Nature 526, 519–524 (2015). 6. Northcott, P. A. et al. Enhancer hijacking activates GFI1 family oncogenes in medulloblastoma. Nature 511, 428–434 (2014). 7. Weinhold, N., Jacobsen, A., Schultz, N., Sander, C. & Lee, W. Genome-wide analysis of noncoding regulatory mutations in cancer. Nat. Genet. 46, 1160–1165 (2014). 8. Imielinski, M., Guo, G. & Meyerson, M. Insertions and Deletions Target Lineage- Defining Genes in Human Cancers. Cell 168, 460–472.e14 (2017). 9. Rheinbay, E. et al. Recurrent and functional regulatory mutations in breast cancer. Nature (2017). doi:10.1038/nature22992 10. Huang, F. W. et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 (2013). 11. Horn, S. et al. TERT promoter mutations in familial and sporadic melanoma. Science 339, 959–961 (2013). 12. Flavahan, W. A. et al. Insulator dysfunction and oncogene activation in IDH mutant gliomas. Nature 529, 110–114 (2016). 13. Huarte, M. The emerging role of lncRNAs in cancer. Nat. Med. 21, 1253–1261 (2015). 14. Bhan, A., Soleimani, M. & Mandal, S. S. Long Noncoding RNA and Cancer: A New Paradigm. Cancer Res. 77, 3965–3981 (2017). 15. Lanzós, A. et al. Discovery of Cancer Driver Long Noncoding RNAs across 1112 Tumour Genomes: New Candidates and Distinguishing Features. Sci. Rep. 7, 41544 (2017). 16. Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 2059–2074 (2013). 17. Campbell, P. J. et al. Pan-cancer analysis of whole genomes. (2017). doi:10.1101/162784 18. Perera, D. et al. Differential DNA repair underlies mutation hotspots at active promoters in cancer genomes. Nature 532, 259–263 (2016). 19. Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016). 20. Ellis, M. J. et al. Whole-genome analysis informs breast cancer response to aromatase inhibition. Nature 486, 353–360 (2012). 21. Brown, M. B. 400: A Method for Combining Non-Independent, One-Sided Tests of Significance. Biometrics 31, 987 (1975). 22. Fisher, R. A. Statistical Methods For Research Workers. (Genesis Publishing Pvt Ltd, 1925).

28 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

23. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995). 24. Pasqualucci, L. et al. Hypermutation of multiple proto-oncogenes in B-cell diffuse large- cell lymphomas. Nature 412, 341–346 (2001). 25. Forbes, S. A. et al. COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer. Nucleic Acids Res. 38, D652–7 (2010). 26. Saeki, K. The B cell-specific major raft protein, Raftlin, is necessary for the integrity of lipid raft and BCR signal transduction. EMBO J. 22, 3015–3026 (2003). 27. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the . Nature 489, 57–74 (2012). 28. Kotani, T., Akabane, S., Takeyasu, K., Ueda, T. & Takeuchi, N. Human G-, ObgH1 and Mtg1, associate with the large mitochondrial ribosome subunit and are involved in translation and assembly of respiratory complexes. Nucleic Acids Res. 41, 3713–3722 (2013). 29. Kopan, R. & Ilagan, M. X. G. The canonical Notch signaling pathway: unfolding the activation mechanism. Cell 137, 216–233 (2009). 30. Dulak, A. M. et al. Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity. Nat. Genet. 45, 478–486 (2013). 31. Diaz-Lagares, A. et al. Epigenetic inactivation of the p53-induced long noncoding RNA TP53 target 1 in human cancer. Proc. Natl. Acad. Sci. U. S. A. 113, E7535–E7544 (2016). 32. Li, B.-S. et al. MicroRNA-25 promotes gastric cancer migration, invasion and proliferation by directly targeting transducer of ERBB2, 1 and correlates with poor survival. Oncogene 34, 2556–2565 (2015). 33. Hosoda, N. et al. Anti-proliferative protein Tob negatively regulates CPEB3 target by recruiting Caf1 deadenylase. EMBO J. 30, 1311–1323 (2011). 34. Morin, R. D. et al. Genetic Landscapes of Relapsed and Refractory Diffuse Large B-Cell Lymphomas. Clin. Cancer Res. 22, 2290–2300 (2016). 35. Chapuy, B. et al. Targetable genetic features of primary testicular and primary central nervous system lymphomas. Blood 127, 869–881 (2016). 36. Shadel, G. S. & Clayton, D. A. Mitochondrial DNA maintenance in vertebrates. Annu. Rev. Biochem. 66, 409–435 (1997). 37. Schmitt, M. E. & Clayton, D. A. Nuclear RNase MRP is required for correct processing of pre-5.8S rRNA in Saccharomyces cerevisiae. Mol. Cell. Biol. 13, 7935–7941 (1993). 38. Gill, T., Cai, T., Aulds, J., Wierzbicki, S. & Schmitt, M. E. RNase MRP cleaves the CLB2 mRNA to promote cell cycle progression: novel method of mRNA degradation. Mol. Cell. Biol. 24, 945–953 (2004). 39. Esakova, O. & Krasilnikov, A. S. Of proteins and RNA: the RNase P/MRP family. RNA 16, 1725–1747 (2010). 40. Sabarinathan R, E. al. RNAsnp: efficient detection of local RNA secondary structure changes induced by SNPs. - PubMed - NCBI. Available at: https://www.ncbi.nlm.nih.gov/pubmed/23315997. (Accessed: 7th July 2017) 41. Mularoni, L., Sabarinathan, R., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations. Genome Biol. 17, 128 (2016).

29 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

42. Robbiani, D. F. et al. AID produces DNA double-strand breaks in non-Ig genes and mature B cell lymphomas with reciprocal chromosome translocations. Mol. Cell 36, 631– 641 (2009). 43. Fujimoto, A. et al. Whole-genome mutational landscape and characterization of noncoding and structural mutations in liver cancer. Nat. Genet. 48, 500–509 (2016). 44. Lippert, M. J. et al. Role for topoisomerase 1 in transcription-associated mutagenesis in yeast. Proc. Natl. Acad. Sci. U. S. A. 108, 698–703 (2011). 45. Sankar, T. S., Wastuwidyaningtyas, B. D., Dong, Y., Lewis, S. A. & Wang, J. D. The nature of mutations induced by replication–transcription collisions. Nature 535, 178–181 (2016). 46. Jinks-Robertson, S. & Bhagwat, A. S. Transcription-associated mutagenesis. Annu. Rev. Genet. 48, 341–359 (2014). 47. Ke, H. et al. NEAT1 is Required for Survival of Breast Cancer Cells Through FUS and miR-548. Gene Regul. Syst. Bio. 10, 11–17 (2016). 48. Han, Y., Liu, Y., Nie, L., Gui, Y. & Cai, Z. Inducing Cell Proliferation Inhibition, Apoptosis, and Motility Reduction by Silencing Long Noncoding Ribonucleic Acid Metastasis-associated Lung Adenocarcinoma Transcript 1 in Urothelial Carcinoma of the Bladder. Urology 81, 209.e1–209.e7 (2013). 49. Dabney, J. & Meyer, M. Length and GC-biases during sequencing library amplification: a comparison of various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques 52, 87–94 (2012). 50. Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014). 51. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013). 52. Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012). 53. Martincorena, I. et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell (2017). doi:10.1016/j.cell.2017.09.042 54. Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013). 55. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome- wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015). 56. Perera, D. et al. Differential DNA repair underlies mutation hotspots at active promoters in cancer genomes. Nature 532, 259–263 (2016). 57. Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016). 58. Flavahan, W. A. et al. Insulator dysfunction and oncogene activation in IDH mutant gliomas. Nature 529, 110–114 (2016). 59. Hnisz, D. et al. Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science 351, 1454–1458 (2016).

30 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Methods Table of contents: 1. Patient cohorts 2. Mutational hotspot analysis 3. Mutational signatures 4. Genomic element definition 5. Driver discovery methods 6. Simulated data sets 7. Statistical framework for the integration of results from multiple driver discovery methods 8. Post-filtering of candidates 9. Sensitivity and specificity analysis 10. Gene expression analyses 11. Normalization for copy number variation 12. Mutation to expression association 13. Copy number analyses 14. Power analysis 15. Associations between mutation and signatures of selection: loss of heterozygosity and cancer allelic fractions 16. Signals of selection in aggregates of non-coding regions of known cancer genes 17. Mutational process and indel enrichment 18. Mutation association with splicing 19. Structural variation analysis 20. RNA structural analysis 21. Cancer associated germline variant distance to non-coding driver candidates

31 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1. Patient cohorts Generation of high-quality tumor set (Loris Mularoni) We selected a total of 2583 samples to be included in the driver detection analyses. This list contains all the samples that were not flagged as problematic by the PCAWG-TECH group. A single aliquot was assigned to each sample; in cases where multiple aliquots were present, we selected a single aliquot based on the following criteria, in order of importance: - we prioritized primary tumors over metastatic or recurrent tumors - we selected aliquots with an OxoG score higher than 40 - we prioritized aliquots with the highest quality (as indicated by the Stars values) - we prioritized aliquots with RNA-seq data availability - we prioritized aliquots with the lowest contamination (as indicated by the ContEst values) - if a selection could not be made after applying the above filters we selected an aliquot randomly

Selection of tumor cohorts for analysis (Esther Rheinbay) Individual tumor type cohorts from the high-quality tumor set were selected for analysis if they met a minimum size. This size was determined based on the cumulative number of patients, such that no more than 2.5% of total patients were excluded. This led to a minimum cohort size criterion of 20 patients, and removed the Bone-Cart (9 donors), Bone-Epith (11), Bone-Osteoblast (5), Breast-DCIS (3), Breast-LobularCa (13), Cervix-AdenoCa (2), Cervix- SCC (18), CNS-Oligo (18), Lymph-NOS (2), Myeloid-AML (13) and Myeloid-MDS (2) individual cohorts. Samples from these cohorts were still included in meta-cohort analysis (see below).

Tumor meta-cohorts (Esther Rheinbay) Tumor meta-cohorts were assembled for identification of drivers and increase of discovery power across cell lineages and organ systems. The following meta cohorts were used in driver analyses: By cell type of origin: Epithelial: Carcinoma (comprised of tumor cohorts Bladder-TCC, Biliary-AdenoCa, Breast-AdenoCa, Breast-LobularCa, Cervix-AdenoCa, ColoRect-AdenoCa, Eso- AdenoCa, Kidney-ChRCC, Kidney-RCC, Liver-HCC, Lung-AdenoCa, Ovary- AdenoCa, Panc-AdenoCa, Panc-Endocrine, Prost-AdenoCa, Stomach-AdenoCa, Thy-AdenoCa, Uterus-AdenoCa,Head-SCC, Cervix-SCC, Lung-SCC), Adenocarcinoma (Biliary-AdenoCa, Breast-AdenoCa, Breast-LobularCa, Cervix- AdenoCa, ColoRect-AdenoCa, Eso-AdenoCa, Kidney-ChRCC, Kidney-RCC, Liver-

32 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

HCC, Lung-AdenoCa, Ovary-AdenoCa, Panc-AdenoCa, Prost-AdenoCa, Stomach- AdenoCa, Thy-AdenoCa, Uterus-AdenoCa), squamous epithelium (Head-SCC, Cervix-SCC, Lung-SCC) Mesenchymal cells/sarcoma (Bone-Cart, Bone-Epith, Bone-Leiomyo, Bone- Osteosarc) Glioma (CNS-PiloAstro, CNS-Oligo, CNS-GBM) Hematopoietic system (Lymph-BNHL, Lymph-CLL, Lymph-NOS, Myeloid-AML, Myeloid-MDS, Myeloid-MPN) By organ system: digestive tract (Liver-HCC, ColoRect-AdenoCa, Panc-AdenoCa, Eso-AdenoCa, Stomach-AdenoCa, Biliary-AdenoCa), kidney (Kidney-RCC, Kidney-ChRCC), lung (Lung-AdenoCa, Lung-SCC), lymphatic system (Lymph-BNHL, Lymph-CLL, Lymph-NOS), myeloid (Myeloid-AML, Myeloid-MDS, Myeloid-MPN), breast (Breast- AdenoCa, Breast-LobularCa), female_reproductive_system (Breast-AdenoCa, Breast-LobularCa, Cervix-AdenoCa, Cervix-SCC, Ovary-AdenoCa, Uterus- AdenoCa), central nervous system (CNS-PiloAstro, CNS-Oligo, CNS-Medullo, CNS-GBM) Pan-cancer: Two “Pan-cancer” cohorts were created: “Pancan-no-skin-melanoma” containing all tumor types with the exception of Skin-Melanoma to remove issues caused by very high mutation rate tumors; and “Pancan-no-skin-melanoma-lymph” with the additional removal of lymphoid tumors (Lymph-BNHL, Lymph-CLL, Lymph-NOS) that have local somatic hypermutation caused by AID.

33 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

2. Mutational hotspot analysis (Randi Istrup Pedersen) We selected the top 50 single position hotspots based on the number of patients with an SNV mutation. The individual positions marked as problematic by the site-specific noise filter (see below) analysis were excluded.

Each hotspot was defined by its genomic position and annotated by the number of patients with an SNV mutation in the given hotspot. We also annotated each hotspot with whether it falled in one of the genomic element types analyzed in the driver discovery. We further overlapped with loop-regions of palindromes, which are hypothesized to fold into DNA-level hairpins, and with location in immunoglobulin loci. When a hotspot overlapped a protein- coding gene we extracted the corresponding amino acids changes from Oncotator1 (http://portals.broadinstitute.org/oncotator/).

We identified known driver hotspots, by overlap with the somatic driver positions compiled in the Cancer Genome Interpreter repository (https://www.cancergenomeinterpreter.org/mutations), which among others include mutations from ClinVar, DoCM and the literature (ref: https://doi.org/10.1101/140475).

For each hotspot we calculated the proportion of mutations in the defined cohorts and meta- cohorts. Only cohorts with at least 20 patients, and at least 10 patients or 10% of patients with an SNV were included in Fig. 1a (for the distribution in all cohorts and meta-cohorts see Extended Data Fig. 1). Lymph-BNHL and Lymph-CLL were shown together as Lymphoid malignancies.

Based on mutational signature analysis of all the cancer samples, we extracted the posterior probability that each hotspot mutation from a given patient was generated by one of 37 identified mutational signatures. In lymphoid malignancies somatic hyper-mutations generated by AID come in clusters along the genome. Posterior probabilities for the ten signatures relevant for the lymph cohorts were therefore derived from models that consider the correlation of AID mutations along the genome. For each hotspot the collected posterior probabilities were averaged.

34 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

3. Mutational Signatures (Jaegil Kim) We performed a de novo global signature discovery to identify mutational signatures operating in PCAWG WGS cohort (N = 2,583). All single nucleotide variants (SNVs) were stratified into M = 4608 mutation channels according to the combination of six base substitutions at pyrimidine bases (C>A/G/T and T>A/C/G) in penta-nucleotide sequence contexts and a transcription strand direction, transcribed (-), non-transcribed (+), non-coding regions. All insertions and deletions were classified into additional 20 channels depending on the number of inserted or deleted bases and any indels beyond 9 bases were grouped into a single channel. The resulting mutation frequency matrix X (4,628 channels across 2,583 samples) were ingested to the Broad Institute’s signature analysis pipeline, SignatureAnalyzer, to determine the optimal number of signatures (K) and the signature profiles by de-convoluting X into a product of two non-negative matrices as X ~ WH, W (M × K) and H (K × N) being a signature-loading and an activity-loading matrix, respectively. The most distinguished feature of SignatureAnalyzer is that it suggests an optimal number of signatures best explaining X at the balance between the error measure and the model complexity exploiting Bayesian non-negative matrix algorithm (NMF)2,3,4. Our de-novo signature discovery identified 37 signatures including a split of 4 UV-related signatures, 3 APOBEC-related signatures, 5 POLE-related signatures, 3 MSI-related signatures, 2 COSMIC17 signatures, and other 21 singleton signatures (see PCAWG7 paper for details).

Although the matrix H contains the most representative activity across samples it also has spurious activity assignments intrinsic to the global signature discovery (i.e., applying SignatureAnalyzer on samples with different histological subtypes and mutational processes). To minimize this contamination and interference we re-assigned activity separately in each histological subtype using a more refined projection approach. We first identified a subset of ultra-mutant samples with a dominant activity of POLE (n = 8), MSI (n = 18), and alkylating signatures (n = 1) from the original H, which are usually very exclusive to samples with a specific DNA repair or replication defect, or a treatment. To prevent a false- positive assignment of these signatures in remaining samples we introduced a binary matrix Z (37 × N), where the row elements corresponding to POLE, MSI, and alkylating signatures, are set to zero except for samples with a dominant activity, while all other elements are set to one. More specifically, the projection was done by keeping the signature-loading matrix W' (M × 37) frozen (W' represents the normalized signature profiles of 37 global signatures), while allowing SignatureAnalyzer to determine a subset of signatures and infer the activity- loading matrix H' (37 × N') that best approximates the mutation frequency matrix X' (M × N'), where N' represents the number of samples in each histological subtype, as X' ~ W' (Z'¤H'), Here “¤” denotes element-wise multiplications.

35 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

4. Definition of genomic elements (Morten Muhlig Nielsen; Nicholas Sinnott-Armstrong) For coding elements and elements pertaining to protein coding genes (3’UTR, 5’UTR and protein coding promoters) regions were defined based on GENCODE annotations (v.19)5:

Coding elements (CDS): The set of coding bases collapsed across all coding transcripts with a given GENCODE gene ID.

Protein coding splice site elements (pcSS): Intronic regions extending 6 bases from donor splice sites and 20 bases from acceptor splice sites were collected for all coding transcripts. Bases were collapsed across all coding transcripts with a given GENCODE gene ID. The global set of CDS bases were subtracted.

5’UTR elements (5UTR): The set of 5’UTR bases collapsed across all coding transcripts with a given GENCODE gene ID. The global set of CDS and pcSS bases were subtracted.

3’UTR elements (3UTR): The set of 3’UTR bases collapsed across all coding transcripts with a given GENCODE gene ID. The global set of CDS, pcSS and 5UTR bases were subtracted.

Protein coding promoter elements (pcPROM): Regions extending 200 bases in both directions from all protein coding transcripts’ transcription start sites (5’ ends). Bases were collapsed across all coding transcripts with a given GENCODE gene ID. The global set of CDS and pcSS bases were subtracted.

lncRNA elements: lncRNA transcripts were defined based on annotations from GENCODE (v.19) and MiTranscriptome (v.2)6 not overlapping GENCODE. Transcripts were included if fulfilling criteria 1-5 and 6 or 7 below: 1) No sense overlap to protein coding gene regions 2) More than 5kb away from protein coding genes on sense strand. 3) Longer than 200 bases. 4) Not annotated as the following biotypes: immunoglobulin, T-cell receptor, Mt_rRNA, Mt_tRNA, miRNA, misc_RNA, rRNA, scRNA, snRNA, snoRNA, ribozyme, sRNA or scaRNA. 5) Not overlapping genomic regions aligning back to the human genome (self chained regions). 6) More than 20% of bases overlap conserved elements (except if annotated as pseudogene)

36 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

7) Expressed in more than 10% of PCAWG samples with RNAseq data.

Genes corresponding to the selected transcripts were supplemented with a set of known functional lncRNA genes from the literature in addition to GENCODE annotated non-coding snoRNA and miRNA host genes. The elements were made by collapsing bases across transcripts with given gene ID. The global set of CDS, pc_SS, 5UTR, 3UTR, pcPROM and lncRNA_SS bases were subtracted.

lncRNA splice site elements (lncRNASS): Intronic regions extending 6 bases from donor splice sites and 20 bases from acceptor splice sites were collected for all lncRNA transcripts. Bases were collapsed across all lncRNA transcripts with a given gene ID. The global set of CDS, pc_SS, 5UTR, 3UTR and pcPROM bases were subtracted.

lncRNA promoter elements: Regions extending 200 bases in both directions from all lncRNA transcripts’ transcription start sites (5’ends). Bases were collapsed across all lncRNA transcripts with a given gene ID. The global set of CDS, pc_SS, 5UTR, 3UTR, pcPROM and lncRNA_SS bases were subtracted.

Short RNA elements: Short RNA transcripts were defined based on annotations from databases Rfam (v.11)7, tRNAscanSE (v.2.0)8 and snoRNAdb (v.3)9 in addition to GENCODE transcripts with biotype annotations mt_rRNA, mt_tRNA, misc_RNA, rRNA and snoRNA. Bases were collapsed across all smallRNA transcripts with a given gene ID. The global set of CDS, pc_SS, 5UTR, 3UTR and pcPROM bases were subtracted.

microRNA elements: Mature miRNAs were defined based on mirBase (v.20)10 and a set of potential novel miRNAs11

Enhancers: Contiguous 15-state ChromHMM called enhancers correlated between H3K4me1 and RNA-seq across 57 human tissues were downloaded from Roadmap Epigenomics Consortium extended data12. Associated links, defined by co-occurring activity in a given cell type, were merged across cell types at FDR = 0.1. HoneyBadger213 p10 calls for all DNase I sites were filtered to peaks with signal strength 0.8 or greater and intersected with enhancer elements. The union of all DNase I peaks which overlapped with a given element, with all CDS regions filtered out, were used as the input to driver detection.

37 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

5. Candidate driver identification methods A summary of approaches used by each method is listed in Supplementary Table 2. ActiveDriverWGS (Juri Reimand) Driver analysis with ActiveDriverWGS (ref) was performed after discarding hypermutated samples (>90,000 mutations) from the PCAWG cancer cohort.To avoid leakage of signals from known cancer drivers, we removed missense mutations in analyses of non-coding regions. ActiveDriverWGS is a local mutation enrichment method for genome-wide discovery of cancer driver mutations with increased mutation burden of single nucleotide variants (SNVs) and indels. ActiveDriverWGS performs a model-based test whether a given genomic element is significantly more mutated than adjacent background genomic sequence (+/- 10kb and introns). Statistical significance of mutations is computed with a Poisson-linked generalised linear regression model. The null model treats all SNVs with trinucleotide context as cofactor, while indels are modelled with a separate cofactor for all nucleotides. Mutation counts per nucleotide are presented as the response variable The alternative model tests whether the element has different mutation burden than the background sequence. The null and alternative models are compared with chi-square tests and confidence intervals of expected mutations were derived from the null model using resampling. If the confidence intervals indicated significant excess of mutation in the background and depletion in the element of interest, we inverted corresponding small p- values (p=1-p if p<0.5). Elements with no mutations were automatically assigned p=1.

CompositeDriver0.2 (Eric Minwei Liu, Ekta Khurana) We have developed CompositeDriver (ref) – a computational method that combines signals of mutation recurrence and the functional impact score derived from FunSeq2 scheme14 to identify coding and non-coding elements under positive selection. CompositeDriver assigns a score to each region of interest (i.e., CDS, promoter, UTR, enhancer or ncRNA) through summation of positional mutation recurrence multiplied by the functional impact score for all mutations within the region. A null CompositeDriver score distribution is built to calculate the p-value for a region of interest. Mutations in the same element type but outside the region of interest are defined as background mutations. To build the null distribution, the same numbers of mutated positions are repeatedly drawn (default is 105 times) from background mutations with similar replication timing and similar mutation context15. By drawing random mutations from the same element type, CompositeDriver incorporates DNase I hypersensitive sites and histone modification marks as covariates into the null model16. Finally, the Benjamini-Hochberg method is used for multiple hypothesis correction17.

dNdScv (Inigo Martincorena)

38 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

dNdScv is a maximum-likelihood algorithm designed to test for positive or negative selection in cancer genomes or other sparse resequencing studies. dNdScv models somatic mutations in a given gene as a Poisson process, accounting for sequence composition and mutational signatures using 192 trinucleotide substitution rates. Mutation rates are also known to vary across genes, often co-varying with functional features of the human genome, such as replication time and chromatin state. This information is exploited by dNdScv to refine the estimates of the background mutation rate of each gene, using a negative binomial regression. This regression removes known sources of variation of the mutation rates and models the remaining unexplained variation of the mutation rate across genes as being Gamma distributed, which protects the method against overconfidence in the estimated background mutation rate for a gene. Overall, the local mutation rate for a gene is estimated accounting for mutational signatures in the samples analysed, the sequence composition of a gene in a trinucleotide context, 20 epigenomic covariates and the local number of synonymous mutations in the gene. Inferences on selection are carried out separately for missense substitutions, truncating substitutions (nonsense and essential splice site mutations) and indels, and then combined into a global P-value per gene. dNdScv has been described in much greater detail elsewhere18.

DriverPower (Shimin Shuai) DriverPower is a combined burden and functional impact test for coding and non-coding cancer driver elements. In the DriverPower framework, randomized non-coding genome elements are used as training set. In total 1373 reference features covering nucleotide compositions, conservation, replication timing, expression levels, epigenomic marks and compartments are collected for downstream modelling. For the modelling, a feature selection step by randomized Lasso is performed at first. Then the expected background mutation rate is estimated with selected highly important features by binomial generalized linear model. The predicted mutation rate is further calibrated with functional impact scores measured by CADD and Eigen scores. Finally, a p-value is generated for each test element by binomial test with the alternative hypothesis that the observed mutation rate is higher than the adjusted mutation rate.

ExInAtor (Andres Lanzos; Rory Johnson) ExInAtor was specifically created for prediction of cancer driver lncRNAs, but is agnostic to gene type and can also be used for protein-coding genes. The exons of each gene are identified and collapsed across transcript isoforms. For each gene, the trinucleotide content of the exonic region is calculated. The remaining intronic regions, along with 10 kb of sequence upstream and downstream, are defined as the background region. From this

39 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

background, a new background region is created by randomly sampling the maximum number of nucleotides, such that the trinucleotide content exactly matches that of the exonic region. Next, the number of mutations in the exonic and sampled background regions are compared by hypergeometric test. Genes with elevated exonic mutational density are considered candidate driver genes. ExInAtor was used with randomisation seed of 256. Otherwise ExInAtor was run exactly as described in Lanzós et al, 201719.

LARVA (Jing Zhang; Lucas Lochovsky) LARVA20, or Large-scale Analysis of Recurrent Variants in noncoding Annotations, is a computational method that detects significantly elevated somatic mutation burdens in genomic elements—both coding and non-coding—to identify putative cancer-driving elements. Given a cancer cohort variant call set, and a list of genomic elements, LARVA models the expected background somatic mutation rate by fitting a beta-binomial distribution to the elements' variant counts. This model properly accounts for the high mutation rate variability seen throughout the genome, which improves over some previous models' assumption of a constant mutation rate. LARVA's model also incorporates the influence of mutation rate covariates, such as DNA replication timing. LARVA's output lists each genomic element from the input, along with a p-value based on the deviation of the element's observed variant count from the expected variant count under LARVA's model.

MutSig (Julian Hess, Esther Rheinbay) The MutSig suite21 classifies whether genomics features, both coding and non-coding, are highly mutated relative to a predicted background mutation rate (BMR), which varies on a macroscopic-level across patients (patient-specific mutation rates can span orders of magnitude across pan-cancer cohorts) and genes (known covariates such as replication timing are strongly correlated with mutation rate) and on a microscopic level across sequence contexts (since mutational signatures are heterogeneous across a cohort and highly context-dependent). MutSig accounts for all three of these to compute the joint BMR distribution across genes/patients/contexts, and then convolves across the latter two dimensions to estimate the expected distribution of total background burden for a given gene across a whole cohort. Genes are then scored by how their total non-background burden exceeds this null distribution.

MutSig estimates a gene’s BMR by its synonymous mutation rate for coding genes, and by its mutation rate at nonconserved positions for non-coding genes. If the number of background mutations in a given gene is insufficient to provide a confident estimate of its

40 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

BMR, MutSig will incorporate the background counts from other genes with similar covariate profiles into its estimator.

MutSig (MutSig2CV) was originally designed for coding regions only21. Modifications to this version of the algorithm to run on non-coding regions are novel to PCAWG. Coding MutSig also incorporates per-gene functional impact and clustering tests, which were not run on non-coding regions.

NBR (Inigo Martincorena) NBR is a method that test for evidence of higher mutation density than expected by chance in a given region of the genome, while accounting for trinucleotide mutational signatures, sequence composition and the local density of mutations around each element. This method has been described in detail in a previous publication22, where it was used to identify candidate driver noncoding elements across 560 breast cancer whole-genomes.

Based on some of the features of dNdScv, NBR involves two main steps. First, all mutations across all elements tested are used to obtain maximum-likelihood estimates for the 192 rate

parameters (rj) describing each of the possible trinucleotide substitutions in a strand-specific

manner. rj = nj/Lj , where nj is the total number of mutations observed across samples of a

given trinucleotide class (j) and Lj is the number of available sites for each trinucleotide. These rates are used to estimate the total number of mutations across samples expected under neutrality in each element considering the mutational signatures active in the cohort

and the sequence of the elements (Eh = Sj rjLj,h). This estimate assumes no variation of the mutation rate across elements in the genome. Second, a negative binomial regression is used to refine this estimate of the background mutation rate of an element, using covariates

and Eh as an offset. In this study, the local density of somatic mutations (normalized by sequence composition) was used as a covariate, using a window around the element of a variable size across cohorts to ensure sufficient numbers of mutations in each window around each element and excluding coding sequences and previously identified candidate noncoding driver regions. Replication time and average gene expression level for 100 kb genomic bins were also used as covariates. The negative binomial regression models mutation counts as Poisson-distributed within an element with mutation rates varying across elements according to a Gamma distribution. As in dNdScv, this provides a refined estimate

of the background mutation rate for each element (Eh*) as well as a data-driven measure of uncertainty around this estimate (-q- the overdispersion parameter of the negative binomial regression). P-values for each element are calculated using a cumulative negative binomial

41 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

distribution with the mean (Eh*) and dispersion (q) parameters estimated by the negative binomial regression.

To protect against neutral indel hotspots or indel artefacts, unique indel sites rather than total indels per element were used. To protect against misannotation of a mutation clusters as sets of independent events, a maximum of two mutations per region and per sample were considered in the analysis.

ncdDetect (Malene Juul) ncdDetect23 is a driver detection method tailored for the non-coding part of the genome. It uses a burden based approach, in which the frequency of mutations is considered, to reveal signs of recurrent positive selection across cancer genomes. For each candidate region, the observed mutation frequency is compared to a sample- and position-specific background mutation rate. A scoring scheme is applied to further account for functional impact in the significance evaluation of a candidate cancer driver element. In the present application, the scoring scheme is defined as log-likelihoods, i.e. minus the natural logarithm of the sample- and position-specific probabilities of mutation.

The position- and sample-specific probabilities of mutation used by ncdDetect are obtained by a statistical null model, inferred from somatic mutation calls of a collection of cancer samples (https://doi.org/10.1101/122879). The model includes a set of genomic annotations, known to correlate with the mutation rate in cancer. These are replication timing, trinucleotides (the nucleotide under consideration and its left and right flanking bases), genomic segment (a variable segmenting the genome into regulatory element types) and a position-specific measure of the local mutation rate (a weighted average of the mutation rate, calculated across samples in a 40 kb window flanking each specific position plus/minus 10 kb).

ncDriver (Henrik Hornshøj) The ncDriver method (doi: https://doi.org/10.1101/182642) provides separate evaluations of the significance for two functional mutation properties, the level of conservation and the level of cancer type specificity. In the ncDriverConservation test, the conservation level of mutated positions were evaluated locally for being surprisingly high, given the distribution of conservation within the element. The p-value of the mean mutation phyloP conservation score for an element was obtained by Monte Carlo simulation of 100,000 mean phyloP scores based on the same number of mutations. Each mutated element was also evaluated globally by looking up the rank of the element mean phyloP conservation score among all

42 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

elements annotated as the same type. This provided p-values for both local and global mutation conservation level, which were combined into a single conservation p-value using Fisher’s method. In the ncDriverCancerType test, the distribution of observed mutation counts of an element across the cancer types were evaluated for being surprising compared to expected counts estimated from a background null model (as described for the ncdDetect method) that accounts for cancer type specific mutation signatures and other co-variates. A Goodness-of-fit test with Monte Carlo simulation was used to determine whether the distribution of observed mutation counts across cancer types within the element is surprising given the expected mutation counts based on cancer types, mutation contexts and element type. For indels, the expected mutation counts were estimated solely from the mutation rates calculated from the mutation context, cancer type and element type.

OncodriveFML (Loris Mularoni) OncodriveFML24 is a method designed to estimate the accumulated functional impact bias of tumor somatic mutations in genomic regions of interest, both coding and non-coding, based on a local simulation of the mutational process affecting it. The rationale behind OncodriveFML is that the observation of somatic mutations on a genomic element across tumors, whose average impact score is significantly greater than expected for said element constitutes a signal that these mutations have undergone positive selection during tumorigenesis. This, in turn is considered as a direct indication that this element drives tumorigenesis.

OncodriveFML first computes the average functional impact score of the observed mutations in the element of interest. The functional impact scores of mutations have been calculated using both CADD25 (coding and non-coding regions) and VEST326 (only coding regions). Then the method randomly samples the same number of observed mutations following the probability of mutation of different tri-nucleotides, computed from the mutations observed in each cohort. The randomization step is repeated many times (1,000,000 in these analyses) and each time an average functional impact score is calculated. Finally, OncodriveFML derives an empirical p-value for each element by comparing the average functional impact score observed in the element to its local expected average functional impact score resulting from the random sampling. The empirical p-values are then corrected for false discovery rate and genomic elements that after the correction are still significant are considered candidate drivers.

regDriver (Husen M. Umer)

43 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

regDriver assesses the significance of mutations affecting transcription factor motifs using tissue-specific functional annotations (https://doi.org/10.1002/humu.23014). For each tumor cohort, functional annotations from the cell lines most similar to the respective tumor type are gathered. A functionality score is computed for each mutation based on its overlapping functional annotations. regDriver, collects highly-scored mutations in each of the defined elements and assesses the elements’ significance by comparing its accumulative score to a background score distribution obtained from the simulated sets. Therefore, only candidate regulatory mutations are considered in evaluating mutation enrichment per element.

44 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

6. Simulated datasets

Broad simulations (Yosef Maruvka, Gad Getz) Due to their differing context characteristics, we simulated SNVs and indels with different approaches. For SNVs, we divided the genome into 50kb regions. For each region, we counted the number of mutations in it across all the PCAWG patients and divided it by the total number of mutations. Every mutation was randomly assigned into a new region based on the region's rate. The position inside the region was chosen to maintain the trinucleotide context of each mutation (the 5’ and 3’ nearest neighbors and the mutated position itself) and the alternate allele. In addition, for every base we counted how many times it was covered sufficiently, in 401 tumor and normal WGS pairs, in order to enable calling of a mutation27. The fraction of patients with enough coverage at a given site was used as the position’s probability for being mutated inside the new current region.

For indels, a new, randomized position was chosen in a region of 50,000 bases around the indel. The position of the new indel was chosen to match the indel 5’ and 3’ neighboring reference bases. For insertions, the inserted motif was the same as the original insertion, but for deletions only the length of the indels was kept but not the exact sequence.

DKFZ simulations (Carl Hermann, Calvin Chan) This simulation utilizes the SNV calls to perform a localised randomisation. The original SNV entries which do not map to chromosome 1-22, X, Y are first filtered and excluded from randomization. All SNVs located in the protein coding regions (CDS) corresponding to GENCODE19 definition are erased before performing randomisation. The trinucleotide centered at each SNV position is determined and an identical trinucleotide is randomly sampled within the 50kb window. In case of insertion, instead of the mutated trinucleotide, the neighboring nucleotide of the insertion site is scanned within the randomisation window. For deletion and multi-nucleotides variants, the altered sequence is scanned within the randomization window with a ranked probability assigned for each position. The randomised sample is then selected from the top 100 matched positions with scaled probability.

Sanger simulations (Inigo Martincorena) This simulation aimed to generate datasets of neutral somatic mutations that retain key sources of variation in mutation rates known to exist in cancer genomes, including mutational signatures, and variable mutation rates across the genome and among individuals and cancer types. To do so while minimizing the number of assumptions in the simulation, we used a simple local randomization approach. First, all coding mutations as

45 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

well as mutations in the TERT promoter, MALAT1 or NEAT1 were excluded. Second, each mutation in each patient was randomly moved to an identical trinucleotide within a 50 kb window, while retaining the patient ID. Third, mutations falling within 50 bp of their original position were filtered out. This simple randomization retains the variation of the mutation rate and mutational signatures across large regions of the genome, across individuals and across cancer types.

46 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

7. Statistical framework for the integration of results from multiple driver discovery methods (Grace Tiao, Gad Getz) The classical approach for combining p-values obtained from independent tests of a given null hypothesis was described by R. A. Fisher in 1948. He noted that for a set of k p-values, the sum X of the transformed p-values, where

k X = -2 Σ i=1 ln(pi)

and pi is the p-value for the ith test, follows a chi-square distribution with 2k degrees of freedom28. Thus, to obtain a single combined p-value for a set of independent tests, the new test statistic X is computed from the p-values obtained from the tests and scored against a chi-square distribution with 2k degrees of freedom. Fisher’s test is asymptotically optimal among all methods of combining independent tests29; however, in cases where tests exhibit dependence, the Fisher combined p-value is generally too small (anti-conservative). In this study, we combine p-values from several driver detection methods, many of which share similar approaches and whose results are therefore not independent. To address this issue, we used an extension of the Fisher method developed by Morten Brown for cases in which there is dependence among a set of tests29. Using the same test statistic, renamed Ψ to indicate the difference in the independence assumption, Brown observed that if Ψ were assumed to have a scaled chi-square distribution – i.e.,

2 ψ ~ c X 2f

then

f = E[ψ]2/var(ψ) and c = var(ψ)/ 2E[ψ]

Note that E[Ψ] = 2k irrespective of the independence requirement, and that

var(ψ) = 4k + 2 Σi

Thus when the pi are independent, var(ψ) = 4k, which gives f = k and c = 1, and the test statistic follows the chi-square distribution with 2k degrees of freedom described by Fisher. However, when the independence condition is relaxed, var(ψ) ≠ 4k, and the test statistic generally follows a different, scaled, chi-square distribution whose scaling parameter c and

degrees of freedom 2f are determined by the covariances of the pi’s. The covariances can

47 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

be computed via numerical integration over the joint distributions of all pi and pj pairs, but this requires knowledge of the joint distribution; and even in cases where the joint distribution is known, the integration may not be computationally feasible for large and complex datasets30. In this study, following the example of Poole et. al31, we computed the empirical

covariance of pi and pj, using the samples wi and wj , where wi is the set of all reported p- values for method i, and used the empirical covariance to approximate the Brown scaled chi- square distribution. The advantage to this approach is that the empirical covariance

estimation is non-parametric – it does not assume an underlying joint distribution of pi and pj – and is thus applicable to complex and interrelated biological datasets where data is noisy and not regularly Gaussian. Poole et. al showed that the empirical covariance estimation approach is accurate, robust, and efficient for such datasets.

Implementing and evaluating the integration method on simulated and observed data To evaluate the efficacy of the empirical Brown’s method of dependent p-value integration, we generated three sets of simulated mutation data (see above) and ran the driver detection algorithms on each of the simulated datasets. We checked that the p-value results from the various driver detection algorithms followed the expected null (uniform) distribution (Extended Data Fig. 3a). Then, for each simulated data set, we calculated the empirical covariance for each pair of driver algorithm results. We then used these covariance values over simulated datasets to compute the combined Brown p-values on observed data: for each gene in the observed PCAWG somatic mutation dataset, we computed the Brown test statistic from the set of p-values reported by the various driver detection algorithms. The Brown test statistic was then evaluated against the appropriate chi-square distribution, whose scale and degree parameters were approximated by the covariance values calculated on the simulated data (see above). We ran this procedure, as well as the Fisher method, for six representative tumor- type cohorts (Breast-AdenoCa, CNS-GBM, ColoRect-AdenoCa, Lung-AdenoCa, Uterus- AdenoCa, and meta-Carcinoma) and found that the Brown combined p-values generally followed the null distribution as expected (Extended Data Fig. 3b). The Fisher combined p- values were significantly inflated (Extended Data Fig. 3b), confirming that dependencies existed between the results reported by the various driver detection algorithms. We next explored whether it was possible to reduce the number of algorithm runs required to complete these calculations for all tumor-type cohorts by computing the covariance values on observed data instead of simulated data. In each of the six representative tumor-type cohorts, we calculated the empirical covariances on the observed

48 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

data only and then computed the integrated Brown p-values on the observed data using the observed covariances. Significant genes identified using only observed covariances remained mostly unchanged from the significant genes identified using the simulated covariances (Extended Data Fig. 3d), and examination of the differences in the covariance values between the simulated estimations and the observed estimations revealed only minor differences in values (Extended Data Fig. 3c). The significant drivers presented in this study were identified using this final approach – e.g., by computing integrated Brown p-values using estimations of covariance on observed data only. Integration of p-values from observed data was performed for 42 tumor-type cohorts and 13 target element types. Methods were selected for each given data set (see “Selecting methods to include in the integration of observed p-values”, below) and raw p-values smaller than 10-16 were trimmed to that value before proceeding with the integration. Methods with missing data for a given element (i.e., ones that failed to report a p-value for a given element) were excluded from the calculation for that element, and therefore in some cases the integrated Brown p-value was computed from p-values reported by only a subset of all the driver detection algorithms contributing results for that data set.

Selecting methods to include in the integration of observed p-values In some cases, individual driver detection algorithms reported p-values for a given data set that deviated strongly from the expected uniform null distribution. These were methods for which the quantile-quantile (QQ) plots demonstrated considerable inflation. We removed results that reported an unusual number of significant hits by calculating, for each set of results, the number of significant elements found by each individual method using the Benjamini-Hochberg FDR with q<0.1 as the significance threshold. Any single method that reported four times the median number of significant elements identified by individual methods was discarded from the integration. In a separate analysis, we found that removing methods that yielded fewer hits than the median (i.e., methods with deflated QQ-plots) did not affect the number of significant genes identified through the integration of the reported p- values (Extended Data Fig. 3d); hence we did not remove such methods.

8. Post-filtering of candidates (Esther Rheinbay, Morten Muhlig Nielsen, Lars Feuerbach, Henrik Tobias Madsen) Post-filtering of significant hits was performed to remove those with accumulation of mutations caused by sequencing problems or mutational processes. In particular, we applied the following: (i) at least three mutations are present in the element, (ii) mutations are present in at least three patients of the tested cohort, (iii) less than 50% of mutations are located in palindromic DNA sequence22, (iv) more than 50% of mutations are located in

49 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

mappable genomic regions (CRG alignability, DAC blacklisted regions and DUKE uniqueness32; (v), less than 50% of mutations occur near indels, (vi) a site-specific noise filter (see below); and (vii) manual review of sequence evidence for novel drivers. For lymphomas, which contain regions of somatic hypermutation caused by AID enzyme activity, we (viii) further required less than 50% of mutations contributed by this process; and for Skin-melanoma, we (ix) excluded mutations occurring in the extended (CTTCCG) context that contributes to promoter hotspot mutations in this tumor type33,34,35. Even after this motif- based filter, a large number of promoter and 5’UTR regions remained in the list of candidates. We, therefore, marked these as likely due to failed repair in TF occupied sites since it is unclear how many of them are true or false drivers. Additionally, elements that failed manual mutation review were filtered out at this stage.

Site-specific noise filter Genomic positions of mutations in each significant hit were analyzed in all normal control samples to assess the position-specific noise-level. Therefore, for each of the three non- reference nucleotides a and cohort c ϵ C the relative frequency of normal samples which had at least two reads supporting the alternative allele a were calculated as p(a,c) ϵ [0;100]. The

noise score was then computed as !∈! ���!"(�(�, �) + 1) . Mutations at positions with a score > 20 for at least one of the non-reference nucleotides were flagged. Elements for which the number of mutations at flagged positions exceeded 20% were removed.

DNA palindromes We define a palindrome as a sequence of DNA followed by its complementary reverse with a sequence of variable length in between (Fig. 3d). It is hypothesized that these palindromes can temporarily form DNA hairpins36. While in the hairpin state the loop region is single- stranded and open to attack by APOBEC enzymes. Based on observations in breast cancer whole genome sequences37, we decided to consider palindromes with a minimal repeat length of 6 bp and an intervening sequence (loop) length of 4-8 bp. We call these regions genome-wide using the algorithm described in Ye et al.38 however using our own implementation (https://github.com/TobiasMadsen/detectIR). In total, we find 7.3 M palindrome regions covering a total of 135.2 MB of which 33.6 MB are loop sequence.

Computing the false discovery rate We controlled the false discovery rate (FDR) within each of the sets of tested genomic elements by concatenating all integrated Brown p-values from across all tumor-type cohorts and applying the Benjamini-Hochberg procedure17 to the integrated Brown p-values. A q- value threshold of 0.1 was chosen to designate cohort-element combinations as significant

50 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

hits. In addition, we defined cohort-element combinations in the range 0.1≤ q<0.25 as “near significance”. We next applied several additional, mutation-based filtering criteria to each significant or near-significant candidate and assigned p-values of 1 to candidates that failed these filtering criteria. Final Benjamini-Hochberg FDR values were then re-calculated on the adjusted sets of integrated Brown p-values to arrive at a list of candidate driver cohort- element combinations.

51 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

9. Sensitivity and precision analysis of driver predictions (Iñigo Martincorena) All methods employed in this study were shown to have a low rate of false positives when run on a series of neutral simulated datasets without driver mutations. To evaluate the sensitivity and precision of different methods and particularly of our approach for p-value integration, we compared their relative performance in detecting known cancer genes when applied to protein-coding genes. As a reference gold-standard set of known cancer genes we used a list of 603 genes from the manually-curated Cancer Gene Census v80 database. Results are shown in Extended Data Fig. 4.

52 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

10. Gene expression analyses (Samir B. Amin, Morten M. Nielsen, Andre Kahles, Nuno Fonseca, Lehmann Kjong, members of the PCAWG Transcriptome Working Group and Jakob Skou Pedersen)

To extend the RNAseq-based expression profiling of GENCODE annotations provided by the PCAWG Transcriptome Working Group39, we profiled an extended set of gene annotations, including a comprehensive set of non-coding RNAs (described above and at Synapse:syn5325435).

The profiling used the docker-based workflow described in 39 for 1,180 RNA-seq donor libraries, matched to WGS data across 27 different cancer types39. In brief, raw sequence reads from donor libraries were uniformly evaluated for QC using FastQC tool, and subsequent alignment was performed on QC-passed libraries using two methods: STAR (v2.4.0i)40 and TopHat2 (v2.0.12)41. Resulting QC-passed bam files were independently used to quantify extended RNA-seq annotations at the gene-level counts using htseq-count method with following parameters: -m intersection-nonempty --stranded=no --idattr gene_id. This step resulted in two sets of gene-level counts files per donor library which were independently normalized using FPKM normalization and upper quartile normalization (FPKM-UQ). The final expression values were provided as a gene-centric table (rows as genes, columns as samples) with each value representing an average of the TopHat2 and STAR-based alignments FPKM values. Gene-centric tables based on both, GENCODE and extended RNA-seq annotations are available at Synapse data portal: syn2364711. Docker- based workflow for quantifying extended RNA-seq annotations is at https://github.com/dyndna/pcawg14_htseq

53 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

11. Normalization for copy number variation (Henrik Tobias Madsen, Morten Muhlig Nielsen, and Jakob Skou Pedersen) To account for the effects of somatic copy number alterations (SCNAs) on expression, we used two different approaches to create two additional versions of the expression profiles: First a conservative approach where we remove all samples not having the regular bi-allelic copy number for the gene in question. Second a less conservative approach where we build a regression model of expression data based oncopy number (CN) data, and then tested for an effect of somatic mutations on the residual (i.e. the expression that is not explained by copy number).

Generally the higher the copy number of a particular gene the higher expression. The relationship between copy number and gene expression is not strictly linear, as various feedback mechanisms in the cell try to compensate for the mostly deleterious effects of SCNAs. This is known as dosage compensation and has been studied extensively in the context of mammalian sex , but also in evolution of yeast and in diseases caused by aneuploidy42–45. We therefore fit a linear regression model between the logarithm of expression and the logarithm of CN. This effectively amounts to a power-regression model.

A number of factors makes it hard to learn the regression parameters for each gene and cancer type in isolation. (i) for some cancer types we have only a limited number of samples; (ii) for some genes there is not much variation in CN; and (iii) the variation in expression between samples is generally high. We overcome these problems by employing a mixed model strategy, that allows sharing of information between genes, effectively regularizing the parameter estimates for gene cancer-type combinations that carry little information on their own.

Let and denote the expression and SCNA measurement respectively for gene, , in cancer-type, , and sample respectively. We then define:

,

where and are fixed effects, whereas and are random effects, with

and . Finally the residual is . Using this model we infer a global SCNA to expression regression, , but allow some regularized gene-specific and gene/cancer type-specific variation: and .

54 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Thus we exploit the similarity across genes and similarity within genes across cancer types.

Since the variance increases with the absolute value of the explanatory variable associated with a random slope, this kind of mixed model display heteroskedasticity. Furthermore the model is not invariant under scaling of the explanatory variable, in this case SCNA. We centralise SCNA, such that normal diploid regions have the least variance.

55 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

12. Mutation to expression association (Morten Muhlig Nielsen, Henrik Tobias Madsen, and Jakob Skou Pedersen)

Mutation to expression association was calculated using non-parametric rank sum based statistics on z-score normalized expression values. This equalizes expression mean and variance for each cancer type and is a way to allow for comparison across cancer types. For comparisons of expression in mutated vs non-mutated (wild-type) cases within the same cancer type, the test reduces to the Wilcoxon rank sum test. For comparisons involving samples in multiple cancer types, such as meta-cohorts and pan-cancer cohorts, the statistic and associated p-value is still the non-parametric Wilcoxon rank sum test, and thus no assumptions regarding distribution of z-score expressions are made. Tied expression values were broken by adding a small random rank robust value. Association estimates were performed based on original expression values as well as the two copy-number normalized expression sets mentioned above. Fold difference values were calculated per mutation as

the log2 ratio of the expression of the mutated tumor to the median of all wild-type tumors of the same cancer type. Reported Fold Difference values for an element with multiple mutations represent the median fold difference of all mutations in that element.

56 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

13. Copy number analyses (Esther Rheinbay) We surveyed significant focal copy number alterations for candidate driver genes as orthogonal evidence for their “driverness”. Significant copy number alterations were obtained from the TCGA Copy Number Portal (http://portals.broadinstitute.org/tcga/home), analysis 2015-06-01-stddata-2015_04_02 regular peel-off, a database of recurrent copy number alterations calculated by the GISTIC2 algorithm46 across >10,000 samples and 33 tumor types from TCGA. GISTIC2 results were included for candidate drivers if a gene was significant (residual q<0.1) and was located within a peak with ≤ 10 genes. Visualization was performed with the Integrative Genomics Viewer (IGV)47.

57 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

14. Power Calculations (Esther Rheinbay) Calculation of mutation detection sensitivity. Average detection sensitivity, the power to detect a true somatic variant, was calculated using a binomial model across all exon-like regions for different genomic element types as previously published27,48. Detection sensitivity was based on sequencing coverage and estimated clonal variant allele fraction from 60 representative PCAWG alignments from the pilot cases (https://doi.org/10.1101/161562; Supplementary Table 7). Clonal allele fraction was estimated based on PCAWG consensus purity and ploidy estimates (doi: https://doi.org/10.1101/161562) as purity/(average ploidy)48. Element-wise averages were calculated as average across all exons for a given element.

Estimation of total number of promoter hotspot mutations. Detection sensitivity for all patients was calculated for the two most recurrent TERT promoter hotspot sites (chr5:1295228, chr5:1295250; hg19) using total read depth at these positions, sample purity and average ploidy. For each cohort, the number and percentage of powered (≥90%) patients was obtained. The number of total expected mutations was then inferred as number of observed (called) mutations divided by the fraction of patients powered. The number of “missed” mutations is the difference between the total expected and observed mutations. Percentages of these numbers were calculated relative to the size of individual patient cohorts. Confidence intervals (95%) on the total percentage of patients with a TERT hotspot mutation were calculated using the beta distribution. Poisson confidence intervals were calculated for the number of missed mutations in the PCAWG cohort. Note that the inference of TERT mutations assumes exactly one mutation per patient. Estimates for the FOXA1 promoter hotspot mutation (chr14:38064406; hg19) were conducted using the same procedure.

Calculation of the minimum powered mutation frequency in a population. Power to discover driver elements mutated at a certain frequency in the population were conducted as described before21,49, but solving for the lowest frequency for a driver element in the patient population that is powered (≥90%) for discovery. The calculation of this lowest frequency takes into account the average background mutation frequencies for each cohort/element combination, the median length and average detection sensitivity for each element type, patient cohort size, and a global desired false positive rate of 10%.

58 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

15. Associations between mutation and signatures of selection: loss of heterozygosity and cancer allelic fractions (Federico Abascal, Iñigo Martincorena) For protein-coding sequences mutation recurrence can be analysed in the context of the functional impact of mutations (e.g. missense, truncating) to better distinguish the signal of selection. In contrast, estimating the functional impact of mutations in non-coding elements of the genome is a difficult, yet unsolved problem. To overcome this limitation and be able to compare selection signatures for both coding and non-coding elements under a similar framework, we developed two measures of selection which are agnostic to the functional impact of mutations.

Association between mutation and loss of heterozygosity When a tumour carries a driver mutation in one allele of a given gene, it may be the case that a second hit on the other allele confers a growth advantage and is positively selected. When one of the events involves the loss of one of the alleles the process is referred to as loss of heterozygosity (LOH). This kind of biallelic losses are typical of, but not exclusive to, tumour suppressor genes (TSGs).

For each gene, we build a 2x2 contingency table indicating the number of cases in which the gene was mutated or not and the number of cases in which the gene was subject to LOH or not. We applied a Fisher’s exact test of proportions to identify which genes showed an excess of LOH associated to mutation. P-values were corrected with the FDR method to account for multiple hypotheses testing. This analysis was applied to each tumour type and cohort separately and proved very successful in identifying TSG as well as some oncogenes (OGs) as well.

Association between mutation and cancer allelic fractions Driver mutations that provide an advantage for tumour cells are expected to show higher allelic fractions based on different interacting processes, including: early selection; amplification of the locus carrying the driver mutation; loss of the non-mutated locus (LOH). Comparing cancer allelic fractions (CAF) can be informative to detect signatures of selection, both for TSGs and OGs.

CAFs are defined here as the proportion of reads coming from the tumour and carrying the mutation. To transform observed fractions (VAFs) into CAFs, tumour purity and local ploidy needs to be taken into account according to the following formula:

CAF = VAF * (Lp * Pt + 2 * (1 - Pt)) / (Lp * Pt)

59 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Where Lp corresponds to the local ploidy for the mutated locus, and Pt denotes the tumour purity. Ploidy and tumour purity predictions were obtained from Dentro et al (in preparation).

To determine whether CAFs for a given gene or element were higher than expected we compared them to the CAFs observed in flanking regions. To define flanking regions we took 2 kb at each side of the gene/element, excluding any eventually overlapping coding exon, and also included introns (if present). The two sets of CAFs associated to each gene/element, i.e. those CAFs lying within the gene and those flanking it, were compared with a t-test to detect significant deviations. P-values were corrected with the FDR method. This approach was able to identify most known TSGs and OGs.

60 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

16. Signals of selection in aggregates of non-coding regions of known cancer genes (Federico Abascal, Iñigo Martincorena) We conducted a series of analyses on regions combined across genes to determine whether the paucity of driver mutations found in non-coding regions was related to lack of statistical power in single-gene analyses. This analysis also aimed to estimate how many driver mutations present in cancer genes were missed in this study. For protein-coding sequences, the number of driver mutations was estimated using dN/dS ratios as described in (50). For non-coding, regulatory regions of protein-coding genes (Promoter and UTRs), we relied on a modified version of the NBR negative binomial regression model described above (Section 5) to quantify the overall excess of driver mutations. We applied a second approach to determine whether there was an enrichment of LOH associated to mutations in the different types of non-coding regions associated to protein-coding genes.

Observed vs. expected numbers of mutations based on the NBR mutation model NBR was used to estimate the background mutation rate expected across cancer genes, using a conservative list of 19,082 putative passenger genes as background. The resulting model is used to predict the numbers of expected SNVs and indels per element type per gene, and aggregate sums across genes. For this analysis we selected a reduced but diverse set of 142 cancer genes, encompassing 27 focally amplified and 22 deleted genes found by TCGA and present in the Cancer Gene Census51, and 112 genes with a significant signal of selection in this study (q<0.1) and present in the Cancer Gene Census. The list of 142 genes can be found in Supplementary. Table 7. To be as accurate as possible, we used a diverse set of covariates in the NBR model, including: local mutation rate (estimated on neutral regions +/- 100 kb around each gene), detection sensitivity (defined as the element-average proportion of callable samples per site according to MuTect), gene expression covariates (first 8 principal components of the matrix of average gene expression values in each tumour type, as well as two binary variables marking the 500 genes with highest expression values in any tumour and 1,229 genes with a maximum FPKM lower than 0.1 across tumour types), and averaged copy-number calls for each gene across all samples (see Supplementary Note for more details). For each element type, the sum of observed mutations across the 142 cancer genes were compared to the sum of the expected rates to estimate the excess of mutations in regulatory and coding regions of cancer genes. An excess of observed mutations provides an estimate of the number of driver events18. Confidence intervals were calculated using the equation for the ratio of two Poisson observations, which are the number of mutations in the list of known cancer genes and in the list of passenger genes. It is important to know that these confidence intervals do not capture uncertainty in the background model and should be interpreted with caution. For this reason,

61 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

we systematically evaluated the impact of a diverse array of covariates on our estimates (see Supplementary Note). We also note that this test can underestimate the number of non-coding drivers since some driver mutations can be present in the list of putative passenger genes, although this effect is expected to be quantitatively small if the density of driver mutations in regulatory regions of known cancer genes is higher than in those of putative passenger genes.

Mutation-LOH association for aggregates of genes For this analysis we combined data across known cancer genes, including 603 genes in the Cancer Gene Census v80 and 154 additional significantly mutated genes found by exome studies18,21. To estimate whether there was an excess of LOH associated to mutation in regulatory and coding regions of cancer genes, we calculated the fold change in LOH for the aggregate of cancer genes and normalized it dividing by the fold change observed in passenger genes. Confidence intervals were estimated using parametric bootstrapping (100,000 pseudoreplicates) for both cancer and passenger genes.

62 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

17. Mutational process and indel enrichment (Federico Abascal, Iñigo Martincorena) After noticing the skewed distribution of indel lengths in genes like ALB, SFTBP, NEAT1 and MALAT1, we carried a search for other genes showing the same pattern, which may be subject to the same mutational process. For every gene we record the proportion of indels of length 2-5 bp out of the total number of indels and compared this proportion with the background proportion using a binomial test. The background proportion was calculated using all protein-coding and lncRNAs genes. For every gene we also calculated the indel rate and compared it to the background indel rate using a binomial test. Both sets of p- values were independently corrected with the FDR method. The analysis was done for each tumour type separately. Genes with a q-value < 0.1 both for enrichment in 2-5 bp indels and for higher indel rates were further analyzed as candidates to be under the process of localized indel hypermutation described in this study. The levels of expression of these genes were analysed across all tumour types.

63 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

18. Mutations association to splicing (Andre Kahles) For the assessment of the relationship between mutations in the U2 locus upstream of WDR74 and local changes in , we analysed changes in the percent- spliced-in (PSI) value52 of alternative splicing events located in WDR74 relative to the given genotype. The analysis was based on four different alternative splicing event types extracted and quantified by the PCAWG Transcriptome Working Group (exon skip, intron retention, alternative 3’ splice site, alternative 5’ splice site)39. In total we analysed 116, 103, 27 and 98 events, respectively, for the above event types. We then filtered the events for a minimum of expression evidence (non-NaN PSI) in at least 10 samples, presence of the alternate allele together with expression evidence in at least 5 samples, and a minimum absolute distance of mean PSI values of alternate and reference group of 0.05. Based on the selected events, we retained 3, 3, 1 and 0 events for analysis. Using an ordinary least squares model with the genotypes as factors and the the PSI values as responses, we computed p-values and the variance explained by presence of mutations. Multiple testing correction followed the Bonferroni method.

To assess the relationship between mutations in the U2 locus upstream of WDR74 and global changes in splicing, we used two global splicing statistics. We measured the amount of splicing as the number of edges in the splicing graph of a gene in a given samples. All splicing graphs were taken from the analyses of the PCAWG Transcriptome Working Group39. For each sample, we computed the mean number of splicing edges over all genes, resulting in the extent of splicing per sample. We measured the extent of splicing as how far each event is outlying from the mean over all event in the same cancer type (encoded by project code x histotype). The splicing outlier values per sample and gene were taken from the PCAWG Transcriptome Working Group analyses39. The mean over all genes of a sample was taken as a statistic for the extent of splicing.

For both statistics, we again used an ordinary least squares model as described above to model the relationship between genotype and splicing. As splicing is highly tissue dependent, we included the project codes as additional factors, accounting for both tissue specificity as well as possible underlying batch effects. To extract the amount of additional variance explained through the presence of mutations, we computed one model on project codes and presence of mutations and one model on project codes alone. For the model, we only included samples of project codes, where we had at least one sample with a mutation present, resulting in a total of 618 samples over 14 project codes.

64 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

19. Structural variation analysis (Morten Muhlig Nielsen, Lars Feuerbach) Structural variant data was provided by the PCAWG Structural Variation Working Group. The data provide p-values for the observed breakpoint counts in 50kb bins along the genome. Candidate elements were overlapped with the bins, and fisher’s method was used to calculate a single p-value for each element. The set of element p-values were corrected with the FDR method.

65 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

20. RNA structural analysis (Radhakrishnan Sabarinathan, Ciyue Shen, Chris Sander, Jakob Skou Pedersen) In order to test if the observed mutations (SNVs) in the RMRP gene are biased towards high RNA secondary structure impact, we performed a permutation test by following the steps used in oncodriveFML24 together with the predicted structural impact scores from RNAsnp53. At first, the RNAsnp was run with the options -m 1 -w 300 and other default parameters to obtain the minimum correlation coefficient (r_min) score for each possible mutations in the RMRP gene. The r_min scores were then transformed, 1-((r_min+1)/2), to range between 0 and 1, where 1 indicates high structural impact score. Further, we followed the steps of oncodriveFML (see section ‘oncodriveFML’ for more details) with 1,000,000 randomizations and by using per sample mutational signatures (i.e., the probability of observing a mutation in a particular tri-nucleotide context in a given sample) to compute the p-value at the cohort and sample level.

Furthermore, the RNA secondary structure impact scores (r_min) of indels (insertions/deletion) were computed by using a modified version of RNAsnp (since the current version of RNAsnp is limited to substitutions only). Briefly, we first computed the probability matrices of wild-type and mutant sequences (by taking into account the insertion or deletion) and then adjusted the size of matrices to be equal (by introducing additional rows and columns with zeros in one of the matrices with respect to insertion or deletion). Further, by following the steps of RNAsnp, we computed the r_min score. The structure shown in Fig. 6d is based on the conserved secondary structure annotation obtained from Rfam (RF00030)54.

Tertiary structure contacts in RMRP were predicted using evolutionary couplings co-variation analysis (EC analysis 55) of the multiple sequence alignment of 933 eukaryotic RMRP sequences from Rfam (RF00030). The EC analysis (software available at https://github.com/debbiemarkslab/plmc) was run with the options -le 20.0, -lh 0.01, -t 0.2, -m 100 and the top 100 interactions were chosen as predicted contacts, either in secondary or tertiary structure, depending on local context. As no experimental 3D structure or cross- linking experiments of the mammalian RMRP are available, interaction sites were inferred by homology to the partially known yeast RMRP crystal structure. We (1) aligned the human RMRP sequence with the Saccharomyces cerevisiae RMRP sequence using the sequence family covariance model from Rfam and (2) mapped the locations of RNA-protein interactions within 4Å56 from the crystal structure and the experimentally determined RNA- protein crosslinking sites57, and RNA substrate crosslinking sites58 from the yeast sequence to the human RMRP sequence. For the crosslinking sites, a ±3 nucleotide window is

66 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

reported as the interaction site. In order to test if the locations of the observed indels are biased towards tertiary structure, protein- or substrate-interaction sites, 1,000,000 randomizations of five indels were performed assuming uniform distribution of indels across the RMRP gene body, and an empirical p-value was calculated (P = .08), showing a potential for functional impact.

Two different overlapping deletion calls in the RMRP gene body were observed in the same thyroid cancer patient. After manual inspection of the tumor and normal bam files, it was found that these calls were based on the same mutational event.

67 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

21. Cancer associated germline variant distance to non-coding driver candidates. (Morten Muhlig Nielsen) We used a set of genome-wide significant cancer associated germline SNPs (n=650) from the NHGRI-EBI GWAS catalog59 as collected by Sud et al60. We evaluated the genomic distance from candidate non-coding drivers to the closest germline variant. All distances were above 50 kb with the exception of the TERT promoter which was 1 kb away from a coding variant (rs2736098) in the TERT gene.

68 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

References

1. Ramos, A. H. et al. Oncotator: cancer variant annotation tool. Hum. Mutat. 36, E2423–9 (2015). 2. Kasar, S. et al. Whole-genome sequencing reveals activation-induced cytidine deaminase signatures during indolent chronic lymphocytic leukaemia evolution. Nat. Commun. 6, 8866 (2015). 3. Kim, J. et al. Somatic ERCC2 mutations are associated with a distinct genomic signature in urothelial tumors. Nat. Genet. 48, 600–606 (2016). 4. Polak, P. et al. A mutational signature reveals alterations underlying deficient homologous recombination repair in breast cancer. Nat. Genet. 49, 1476–1486 (2017). 5. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012). 6. Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015). 7. Burge, S. W. et al. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 41, D226– 32 (2013). 8. Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997). 9. Lestrade, L. & Weber, M. J. snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res. 34, D158–62 (2006). 10. Kozomara, A. & Griffiths-Jones, S. miRBase: annotating high confidence miRNAs using deep sequencing data. Nucleic Acids Res. 42, D68–D73 (2014). 11. Londin, E. et al. Analysis of 13 cell types reveals evidence for the expression of numerous novel primate- and tissue-specific microRNAs. Proc. Natl. Acad. Sci. U. S. A. 112, E1106–15 (2015). 12. Fingerman, I. M. et al. NCBI Epigenomics: a new public resource for exploring epigenomic data sets. Nucleic Acids Res. 39, D908–12 (2011). 13. Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015). 14. Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014). 15. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013). 16. Polak, P. et al. Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature 518, 360–364 (2015). 17. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995). 18. Martincorena, I. et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell (2017). doi:10.1016/j.cell.2017.09.042 19. Lanzos, A. et al. Discovery of Cancer Driver Long Noncoding RNAs across 1112 Tumour Genomes: New Candidates and Distinguishing Features. (2016). doi:10.1101/065805 20. Lochovsky, L., Zhang, J., Fu, Y., Khurana, E. & Gerstein, M. LARVA: an integrative framework for large-scale analysis of recurrent variants in noncoding annotations. Nucleic Acids Res. 43, 8123–8134 (2015). 21. Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014). 22. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole- genome sequences. Nature 534, 47–54 (2016). 23. Juul, M. et al. Non-coding cancer driver candidates identified with a sample- and position-specific model of the somatic mutation rate. Elife 6, (2017). 24. Mularoni, L., Sabarinathan, R., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. OncodriveFML: a general framework to identify coding and non-coding regions with

69 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

cancer driver mutations. Genome Biol. 17, 128 (2016). 25. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014). 26. Douville, C. et al. Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST-Indel). Hum. Mutat. 37, 28–35 (2016). 27. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013). 28. Anscombe, F. J. & Fisher, R. A. Statistical Methods for Research Workers. J. R. Stat. Soc. Ser. A 118, 486 (1955). 29. Brown, M. B. 400: A Method for Combining Non-Independent, One-Sided Tests of Significance. Biometrics 31, 987 (1975). 30. Poole, W., Gibbs, D. L., Shmulevich, I., Bernard, B. & Knijnenburg, T. A. Combining dependentP-values with an empirical adaptation of Brown’s method. Bioinformatics 32, i430–i436 (2016). 31. Poole, W., Gibbs, D. L., Shmulevich, I., Bernard, B. & Knijnenburg, T. Combining Dependent P-values with an Empirical Adaptation of Brown’s Method. (2015). doi:10.1101/029637 32. Derrien, T. et al. Fast computation and applications of genome mappability. PLoS One 7, e30377 (2012). 33. Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016). 34. Perera, D. et al. Differential DNA repair underlies mutation hotspots at active promoters in cancer genomes. Nature 532, 259–263 (2016). 35. Fredriksson, J., Eliot, K., Filges, S., Stahlberg, A. & Larsson, E. Recurrent promoter mutations in melanoma are defined by an extended context-specific mutational signature. (2016). doi:10.1101/069351 36. Pearson, C. E., Zorbas, H., Price, G. B. & Zannis-Hadjopoulos, M. Inverted repeats, stem-loops, and cruciforms: significance for initiation of DNA replication. J. Cell. Biochem. 63, 1–22 (1996). 37. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole- genome sequences. Nature 534, 47–54 (2016). 38. Ye, C., Ji, G., Li, L. & Liang, C. detectIR: a novel program for detecting perfect and imperfect inverted repeats using complex numbers and vector calculation. PLoS One 9, e113349 (2014). 39. Fonseca, N. A. et al. Pan-cancer study of heterogeneous RNA aberrations. bioRxiv 183889 (2017). doi:10.1101/183889 40. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). 41. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013). 42. Hose, J. et al. Dosage compensation can buffer copy-number variation in wild yeast. Elife 4, (2015). 43. Lockstone, H. E. et al. Gene expression profiling in the adult Down syndrome brain. Genomics 90, 647–660 (2007). 44. Birchler, J. A. Reflections on studies of gene expression in aneuploids. Biochem. J 426, 119–123 (2010). 45. Heard, E. & Disteche, C. M. Dosage compensation in mammals: fine-tuning the expression of the X chromosome. Genes Dev. 20, 1848–1867 (2006). 46. Mermel, C. H. et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 12, R41 (2011). 47. Thorvaldsdóttir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).

70 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

48. Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012). 49. Rheinbay, E. et al. Recurrent and functional regulatory mutations in breast cancer. Nature 547, 55–60 (2017). 50. Martincorena, I. et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell (2017). doi:10.1016/j.cell.2017.09.042 51. Zack, T. I. et al. Pan-cancer patterns of somatic copy number alteration. Nat. Genet. 45, 1134–1140 (2013). 52. Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015 (2010). 53. Sabarinathan, R. et al. RNAsnp: efficient detection of local RNA secondary structure changes induced by SNPs. Hum. Mutat. 34, 546–556 (2013). 54. Nawrocki, E. P. et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–7 (2015). 55. Weinreb, C. et al. 3D RNA and Functional Interactions from Evolutionary Couplings. Cell 165, 963–975 (2016). 56. Perederina, A., Esakova, O., Quan, C., Khanova, E. & Krasilnikov, A. S. Eukaryotic ribonucleases P/MRP: the crystal structure of the P3 domain. EMBO J. 29, 761–769 (2010). 57. Khanova, E., Esakova, O., Perederina, A., Berezin, I. & Krasilnikov, A. S. Structural organizations of yeast RNase P and RNase MRP holoenzymes as revealed by UV- crosslinking studies of RNA–protein interactions. RNA 18, 720–728 (2012). 58. Esakova, O., Perederina, A., Berezin, I. & Krasilnikov, A. S. Conserved regions of ribonucleoprotein ribonuclease MRP are involved in interactions with its substrate. Nucleic Acids Res. 41, 7084–7091 (2013). 59. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–6 (2014). 60. Sud, A., Kinnersley, B. & Houlston, R. S. Genome-wide association studies of cancer: current insights and future perspectives. Nat. Rev. Cancer 17, 692–704 (2017).

71 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Supplementary Note

1. Overview of non-coding hotspots in top 50 not categorized in main text 2. Discussion of additional significant non-coding elements 3. Evaluation of splicing association of U2 mutations upstream of WDR74 4. Enrichment of protein-coding drivers 5. Impact of covariates on the estimation of driver mutations in functional regions of cancer genes 6. Author contributions by category 7. References

1. Overview of non-coding hotspots in top 50 not categorized in main text Six of the 35 non-coding hotspots in top 50 are not assigned to the mutational processes described in the main text and highlighted in Fig. 1b. Below we describe these in detail.

Four of the six remaining non-coding hotspots were found on the X chromosome (X:116579329, X:7791111, X:83966025, and X:83967552). The mutations in these hotspots were mainly contributed by males (61-94% of mutations per hotspot), suggesting the possibility of uncaught noise. One of these hotspots is located in palindromic DNA (X:7791111). The DNA sequence around this position is composed of two mononucleotide repeats of pairing bases (ATTTTTAAAAAAAAAT), which fits our definition of palindromic structures (Methods). The sequence context around this hotspot do not match the sequence recognized by APOBEC enzymes1 and the hotspot had a low APOBEC signature probability (Fig. 1b). It is therefore unlikely to be caused by the same mutational processes as the other palindromic hotspots in the top 50 hotspots, but rather likely represents noise associated with homopolymer runs (PCAWG variants paper).

A hotspot (3:164903710) contained mutations in multiple cancer types with most mutations contributed by Liver-HCC (6/17). It is located about 1 kb downstream of SLITRK3, two base pairs from an annotated CTCF transcription factor binding site. CTCF sites and the base pairs immediately downstream are known to be highly mutated in several cancer types including Liver-HCC. This position is lowly conserved, suggesting that it would not have a functional impact2.

The last non-coding hotspot (1:103599442) had a high proportion of mutations attributed to the COSMIC5 signature. It overlaps a repetitive LINE region, suggesting potential mapping

72 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

issues. Moreover, the position is flanked by two other positions, two base pairs away in both directions, which also have mutations in the same samples and reads, however these positions were called problematic in the Panel-of-Normals (PoN) filter, further suggesting mapping issues.

2. Discussion of additional significant non-coding elements WDR74 promoter The WDR74 promoter has already been suggested as a potential driver in several studies3–6. However, we found that mutations are concentrated inside a U2 RNA where the density of putative polymorphisms is abnormally high. This outstanding level of diversity could in principle be due to higher mutability in the germline. However, given the repetitive nature of the U2 element (there are hundreds of copies in the genome) and the extreme levels of putative diversity, it is more likely that this region is a source of mapping artefacts. This observation raise concerns about the validity of WDR74 promoter and, hence, on its potential as a cancer driver (Extended Data Fig. 10).

TMEM107 3’UTR The case of the TMEM107 3’UTR is very similar to that of the WDR74 promoter. There is a small RNA (SNORD118) embedded in the 3’UTR, and is there where most mutations concentrate. In PCAWG normals data there is a remarkable excess of putative SNP diversity in this element. (Extended Data Fig. 10).

Additional non-coding RNAs hits Eleven additional ncRNAs or ncRNA promoter hits, not described in detail in the main text, passed the flagging of potential false positive hits (RNU6-573P, RPPH1, RNU12, TRAM2- AS1, G025135, G029190,RP11-92C4.6 and promoters of RNU12, MIR663A, RP11- 440L14.1, and LINC00963). The driver role of these were generally not supported by additional lines of evidence and lacked functional evidence or appeared to be affected by technical artefacts. They are described individually below.

RNU6-573P The small RNA RNU6-573P is detected in Endocrine pancreas (q = 7.3x10-4) with three mutations. However, it appears to be a nonfunctional pseudogene recently inserted in the human lineage. Not only the locus, but the wider region is subject to increased mutational burden, further supporting that mutational mechanisms or technical issues rather than selection underlies the mutational recurrence.

73 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

RNU12 RNU12 is a spliceosomal RNA, which was found significant in Lymph-BNHL (q = 5.0x10-2) and in the Hematopoietic system meta cohort (q = 7.0x10-2). The mutation rate is 1.6 times higher inside the gene in pan-cancer compared to the flanking regions. However, the mutation rate is 4 times higher in gnomAD, which renders it as a likely problematic region. Significance in the promoter of RNU12 was also identified in Lymphomas (q = 3.5x10-2) and the Hematopoietic system meta cohort (q = 1.8x10-6). But this region is overlapping the promoter of POLDIP3 (Polymerase delta-interacting protein 3), which makes interpretation difficult.

MIR663A The promoter of MIR663A was recurrently mutated in the carcinoma meta cohort (q = 7.2x10-4). It is the primary transcript of MiR-663, which has tumorigenic functions in gastric cancer and nasopharyngeal carcinoma7. The mutation rate is 1.2 times higher inside the element compared to the flanking regions, but the mutation rate is 2.4 times higher in gnomAD. The mutations were also not correlated with expression.

RPPH1 RPPH1 forms the RNA component of the RNase P ribonucleoprotein, which matures precursor-tRNAs by cleaving their 5’ end8. It is transcribed from a divergent promoter together with the protein-coding gene PARP2, however, no association with expression was observed for either gene. The region has a high level of germline polymorphisms in normal samples indicative of a problematic region to map.

TRAM2-AS1 The promoter element of the antisense non-coding RNA TRAM2-AS1 is recurrently mutated in the Female reproductive system meta cohort (q = 0.10). The promoter is shared with the divergently transcribed protein-coding gene TRAM2. Interestingly, the promoter has 8 mutations in other cohorts, which collectively associate with the expression of TRAM2 (P = 0.03; carcinoma) but not TRAM2-AS1 (P = 0.9). It is uncertain which gene is controlled by this significant promoter element.

G025135 G025135 is a ncRNA element from MiTranscriptome9 identified here as a significant driver candidate in Lymph-CLL (q = 0.01). There are patients with many mutations, indicating that an unknown mutation process is at play leading to false driver prediction.

74 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

G029190 The G029190 ncRNA is also from MiTranscriptome with significance in Kidney-RCC (q = 0.04) and in the Kidney meta cohort (q = 0.05). It is located downstream of RAB11FIP3 and upstream of CAPN15. A possible function of this gene has not been established.

LINC00963 LINC00963 was found significant in the Kidney meta cohort (q = 0.04). It has many alternative transcripts with different TSSs and thus a large promoter. LINC00963 is known to be involved in the prostate cancer transition from androgen-dependent to androgen- independent and metastasis via the EGFR signaling pathway10. The SNV mutation rate is generally high (0.05 mutations per position), although lower than the flanking regions (0.08).

RP11-92C4.6 RP11-92C4.6 is a predicted antisense ncRNA, which was identified as a significant driver candidate in the Breast meta cohort (q = 0.08). It is a short region with low conservation containing four SNVs. It is overlapping the COL15A1 protein-coding gene with the upstream promoter region. COL15A1 has previously been linked to ovarian cancer. It is unclear whether this ncRNA annotation is valid and whether these reflect true functional mutations.

RP11-440L14.1 The promoter of RP11-440L14.1 was identified as a candidate driver in the Carcinoma meta cohort with 14 mutations (q = 5.4x10-3). It has a hotspot position with four mutations overlapping two different deletions. The lncRNA is located between CPLX1 and PCGF3. The validity of the annotation and possible function for this lncRNA has not been established.

3. Evaluation of splicing association of U2 mutations upstream of WDR74 We hypothesized that the mutations observed in the evolutionarily conserved spliceosomal U2 RNA upstream of WDR74 may affect splicing. As prior sequencing evidence, summarised in the GENCODE version 19 gene annotations11, shows that the U2 is co- transcribed as an alternative 5’ exon in some cases, it might play a role in splicing and regulation of the WDR74 transcript. We therefore both evaluated (A) whether the mutations associate with alternative splicing of the WDR74 gene; and (B) whether they associate with transcriptome-wide changes in splicing.

A) We modelled the association of the the somatic mutations with the amount of alternative

75 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

splicing, as measured by percent-spliced-in values12, using ordinary least squares regression, which did not reveal any significant associations. For none of the seven tested alternative splicing events (three exon skips, three intron retentions, one alternative 3-prime splice site) the genotype was able to explain a substantial fraction of the observed variation. The largest R-squared value was 0.01 and no p-value reached nominal significance.

B) A similar result was obtained for the relationship between the presence of U2 mutations and global changes in alternative splicing. Modeling the amount of alternative splicing on the ICGC project codes alone, which reflect cancer types and contributing institutions, we reached an R-squared value of 0.612, reflecting the strong relationship between splicing and tissue identity. Including the presence of somatic mutations in the model did not change this value. We obtained an analogous result for the extent of alternative splicing, reaching an R- squared of 0.414 on the project codes alone and the same value when including the genotype into the model.

In conclusion, we found no evidence for an association between mutations in the U2 upstream of WDR74 and alternative splicing of WDR74 transcripts or global changes in splicing.

4. Enrichment of protein-coding drivers Collectively, mutations occurring in the promoter region of the 757 cancer genes did not have a significantly different association with expression than synonymous mutations (Supplementary fig. 1a). Similarly, promoter and UTR mutations in cancer genes are not significantly enriched in LOH with respect to mutations in putative passenger genes (Supplementary fig. 1b). This is consistent with the prediction above that only a very small fraction of the mutations observed in the promoters and UTRs of known cancer genes are genuine driver events.

76 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Supplementary figure 1: a, Expression associated with mutations in coding and promoter regions of cancer genes. Z-score expressions associated with non-sense mutations deviate significantly from silent mutations, likely through nonsense mediated decay, whereas expressions associated with promoter mutations do not differ from that of silent mutations. Only mutations in diploid positions were used. b, Excess of LOH associated to mutation in regulatory and coding regions of cancer genes. The y-axes shows the ratio of fold changes in cancer vs. passenger genes, with fold changes representing the excess or depletion of LOH associated with mutation.

5. Impact of covariates on the estimation of driver mutations in functional regions of cancer genes As described in Methods, we estimated the abundance of driver mutations in coding and regulatory regions (promoter and UTRs) of 142 known cancer genes using the NBR background model fitted on putative passenger genes.

Different covariates were included to improve the fit of the model. The local mutation rate, calculated on neutral regions within +/-100 kb around each element, was included to account for regional variation of mutation rates. Detection sensitivity (d.s.) was averaged for each element using the MuTect estimates of callable sites from each sample. For genes in chromosomes X and Y, we did not have MuTect estimates and d.s. was imputed using a linear regression model of d.s. as a function of GC content. Detection sensitivity accounts for elements where the mutation rate is lower than expected just because of poor sequencing

77 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

coverage. A third type of covariate was included in the model to account for associations between gene expression levels and mutation rates. Starting with a matrix of mean FPKM expression values for each gene and tumour type, we log-transformed and scaled the expression matrix using pseudocounts and applied Principal Component Analysis to reduce the dimensionality. We selected the first 8 components as covariates, which together explained 95.5% of the variance. In addition, we added two additional covariates to account for non-linearity between expression and mutation rate in the tails of the expression spectrum. To accomplish this, we created two binary variables, one marking the 500 genes with highest maximum expression values across tumour types, the other marking 1,229 genes whose expression did not exceed FPKM values of 0.1 in any tumour type. Finally, since tumours are rich in amplifications and deletions and these events may result in seemingly increased or decreased mutation rates, we included a copy-number covariate, calculated as the average copy number of each gene across all PCAWG samples.

We intentionally selected a small set of cancer genes to estimate the abundance of driver mutations because with larger numbers of genes the signal of selection becomes weaker while any systematic bias may become more prominent. We noticed these biases becoming increasingly apparent when analyzing larger sets of genes. In an attempt to capture a diverse but relatively small group of cancer genes, we included Cancer Gene Census genes recurrently mutated by coding mutations in PCAWG and genes frequently altered by copy number gains or losses, as described in Methods.

Supplementary Table 9 shows the impact of using different covariates on the 142 genes selected for this analysis. Reassuringly, this shows that the estimates are broadly consistent across models with different covariates, with variations typically within the confidence intervals of alternative models. This confirms that the overall conclusions are largely unaffected by the use of different models. Supplementary Figure 2 shows the results for the 142 genes and for a larger set of 603 genes from the Cancer Gene Census.

To evaluate the performance of the NBR model, we compared the number of driver substitutions predicted by NBR in the CDS regions of the 142 cancer genes to the number predicted by dN/dS (calculated by dNdScv). dN/dS offers an independent estimate of the number of driver substitutions in a group of genes using the local density of synonymous mutations to estimate the neutral expectation13, instead of predicting the background mutation rate by extrapolation from putative passenger genes using a regression model.

Reassuringly, in these 142 genes, NBR predicts 2,258 (CI95%: 2,127-2,382) driver

substitutions and dN/dS predicts 2,384 (CI95%: 2,043-2,759).

78 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Supplementary figure 2: Estimation of the excess substitutions (left) and indels (right) in regulatory and protein-coding regions of cancer genes and for the 142 (left) genes and a larger set of genes from the Cancer Gene Census (right). Ratios of observed vs. expected number of mutations (top), the percentage of mutations predicted to be drivers (middle), and the total number of predicted drivers in all cancers and in each patient (bottom) are shown.

6. Author contributions by category

Driver discovery A.L., C.H., C.W., D.A.W., E.K., E.M.L., E.R., G.G., G.T., H.M.U., I.M., J.K., J.R., J.S.P., K.A.B., K.D., K.I., L.M., L.U.-R., M.M.N., M.P.H., N.A.S., P.D., P.J.C., R.J., S.B.A., T.A.J., T.T. and , Y.F. contributed and curated genomic annotations. C.H., C.W.Y.C., I.M., S.B.A. and Y.E.M. contributed randomized mutational data sets for driver discovery. A.L., A.G.-P., A.H., D.L., D.T., E.K., E.M.L., E.R., H.H., H.M., I.M., I.R., J.B., J.C.-F., J.D., J.F., J.M.H.,

79 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

J.R., J.Z., K.C., K.D., K.I., L.L., L.M., L.S., L.U.-R., L.W., M.B.G., M.J., N.L.-B., O.P., P.D., Q.G., R.S., S.K. and S.S. contributed driver methods and results. E.R., G.G. and G.T. implemented results integration. A.K., C.v.M., C.V., G.T., H.H., I.M., J.R., L.F. and M.M.N. contributed driver results integration. C.H., C.W.Y.C., E.K., E.R., G.G., J.K., J.M.H., J.S.P., M.M.N. and R.I.P. contributed single site recurrence analysis.

Candidate vetting and filtering E.R., F.A., H.H., H.T.M., J.K., L.F. and M.M.N. contributed individual candidate filters. A.L., C.H., C.W., E.K., E.M.L., E.R., F.A., G.G., G.T., H.H., H.M., H.M.U., J.K., J.M.H., J.S.P., K.D., L.F., L.S., M.M.N., M.S.L., N.A.S. and R.J. performed candidate vetting.

Case-based analysis E.R., F.A., G.G., H.H., I.M., J.M.H., J.R., J.S.P., K.P., M.M.N. and M.P.H. contributed case- based analysis. A.G.-P., A.H., A.L., C.H., D.C., D.T., E.K., E.R., F.A., G.G., G.T., H.H., H.K., I.M., J.C.-F, J.R., J.S.P., K.I., K.P., L.M., L.S., L.U.-R., L.W., M.A.R., M.B.G., M.M.N., M.S.L., N.A.S., N.L.-B., O.P., R.I.P., R.S., S.K. and Y.K. contributed results interpretation. A.K., J.S.P., K.A.B., K.-V.L., M.M.N., N.A.F., S.B.A., T.A.J. and T.T. contributed expression profiling (extended GENCODE set). A.K., C.V., D.C., H.M., H.T.M., J.R., J.S.P., K.I., L.W., M.A.R., M.M.N., M.S.L. and S.B.A. contributed mutation-to-expression correlation analysis. A.K., C.V., D.C., J.R., L.W., M.A.R., N.A.S. and Z.Z. contributed network or pathway analysis. R.S, Ciyue S., Chris S., and J.S.P. contributed structural RNA analysis.

Power analysis and driver mutations at known cancer genes E.R. analysed SNV detection and driver discovery power. I.M. evaluated excess number of mutations at known drivers. F.A. and M.M.N. integrated additional evidence.

Leadership and organizational work E.R., G.G. and J.S.P. contributed working group leadership. A.G.-P., D.A.W., E.K., E.R., G.G., G.T., I.M., J.R., J.S.P., L.F., L.M., M.B.G., N.L.-B., O.P., P.J.C., R.S. and S.K. contributed organization.

80 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

7. References

1. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013). 2. Katainen, R. et al. CTCF/cohesin-binding sites are frequently mutated in cancer. Nat. Genet. 47, 818–821 (2015). 3. Weinhold, N., Jacobsen, A., Schultz, N., Sander, C. & Lee, W. Genome-wide analysis of noncoding regulatory mutations in cancer. Nat. Genet. 46, 1160–1165 (2014). 4. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole- genome sequences. Nature 534, 47–54 (2016). 5. Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013). 6. Fujimoto, A. et al. Whole-genome mutational landscape and characterization of noncoding and structural mutations in liver cancer. Nat. Genet. 48, 500–509 (2016). 7. Yi, C. et al. MiR-663, a microRNA targeting p21(WAF1/CIP1), promotes the proliferation and tumorigenesis of nasopharyngeal carcinoma. Oncogene 31, 4421– 4433 (2012). 8. Baer M, E. al. Structure and transcription of a human gene for H1 RNA, the RNA component of human RNase P. - PubMed - NCBI. Available at: https://www.ncbi.nlm.nih.gov/pubmed/2308839. (Accessed: 7th July 2017) 9. Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015). 10. Wang, L. et al. Linc00963: a novel, long non-coding RNA involved in the transition of prostate cancer from androgen-dependence to androgen-independence. Int. J. Oncol. 44, 2041–2049 (2014). 11. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012). 12. Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015 (2010). 13. Martincorena, I. et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell 171, 1029–1041.e21 (2017).

81 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Figure 1 a b er−HCC y−AdenoCa v anc−AdenoCa h ymphoid malignancies Protein change − 0% − − − − − 100% Skin−Melanoma Bladder−TCC Genomic position ColoRect−AdenoCa Li L P Gene Breast−AdenoCa Prost−AdenoCa Head−SCC Lung−AdenoCa CNS−GBM T − 200 (7.7 %) − 150 (5.8 %) − 100 (3.9 %) − 50 (1.9 %) − 0 (0.0 %) 166 KRAS # G12D/A/V 12:25398284 0 0 1 20 60 0 0 1 2 14 0 0 97 TERT # 5:1295228 1 0 15 2 0 4 48 0 6 3 23 52 56 KRAS # G12S/R/C 12:25398285 0 0 0 8 21 0 0 0 0 6 0 0 46 BRAF # V600E 7:140453136 1 0 29 2 0 0 0 0 0 0 21 0 41 PIK3CA # H1047L/R 3:178952085 0 13 0 2 1 1 0 1 2 0 0 0 39 TP53 # R273L/H 17:7577120 2 1 0 4 4 1 0 1 2 0 0 3 38 PIK3CA # E545Q/K 3:178936091 0 5 1 6 0 1 5 0 15 3 0 0 35 TERT # 5:1295250 0 0 22 0 0 0 14 0 4 0 0 16 33 14:106326944 17 0 0 0 0 0 0 0 0 0 0 0 28 TP53 # R175H 17:7578406 0 2 1 6 4 0 0 0 6 0 0 0 28 RPL13A 19:49990694 0 0 27 0 0 0 0 0 0 0 0 0 27 TP53 # R248P/Q 17:7577538 1 3 0 0 2 0 5 0 0 3 0 6 24 14:106326972 13 0 0 0 0 0 0 0 0 0 0 0 24 TP53 # R282W 17:7577094 0 2 0 0 5 1 0 1 2 0 0 0 22 14:106326713 12 0 0 0 0 0 0 0 0 0 0 0 22 TP53 # R213* 17:7578212 0 2 1 8 2 1 0 0 0 0 0 0 21 TP53 # Y220C 17:7578190 2 1 0 0 3 1 5 0 0 3 0 0 20 14:106328942 11 0 0 0 0 0 0 0 0 0 0 0 20 TP53 # R248G/W 17:7577539 0 1 0 2 2 1 5 2 4 0 0 0 20 IDH1 # R132H 2:209113112 1 0 0 0 0 0 0 1 0 0 0 8 19 GPR126 6:142706206 0 2 0 0 0 0 40 0 0 0 5 0 19 RPS20 8:56987141 0 0 18 0 0 0 0 0 0 0 0 0 19 X:116579329 0 0 0 0 0 6 0 0 0 0 0 0 18 14:106329192 10 0 0 0 0 0 0 0 0 0 0 0 18 X:7791111 1 1 0 2 2 0 5 0 0 0 0 0 17 14:106326887 9 0 0 0 0 0 0 0 0 0 0 0 17 14:106326619 9 0 0 0 0 0 0 0 0 0 0 0 17 14:106326618 9 0 0 0 0 0 0 0 0 0 0 0 17 14:106327115 9 0 0 0 0 0 0 0 0 0 0 0 17 14:106327559 9 0 0 0 0 0 0 0 0 0 0 0 17 TP53 # R273S/C 17:7577121 1 1 0 0 3 1 0 0 0 0 0 0 17 14:106240243 9 0 0 0 0 0 0 0 0 0 0 0 17 3:164903710 0 0 0 2 0 2 0 1 0 0 0 0 17 X:83966025 0 2 0 0 0 0 0 6 0 0 0 0 17 KDM5A 12:498776 0 0 16 0 0 0 0 0 0 0 0 0 16 14:106329196 9 0 0 0 0 0 0 0 0 0 0 0 16 14:106329852 8 0 1 0 0 0 0 0 0 0 0 0 16 14:106329550 9 0 0 0 0 0 0 0 0 0 0 0 16 PLEKHS1 10:115511590 0 2 0 0 0 0 22 0 0 9 5 0 16 PLEKHS1 10:115511593 0 3 0 0 1 0 27 0 0 0 5 0 16 19:2151793 0 0 15 2 0 0 0 0 0 0 0 0 16 NRAS # Q61K 1:115256530 0 0 11 2 0 1 0 0 0 0 3 0 16 RP5-1125A11.1 20:32580927 0 0 15 0 0 0 0 0 0 0 0 0 16 X:83967552 0 0 0 0 1 0 0 8 0 0 0 0 15 14:106326877 8 0 0 0 0 0 0 0 0 0 0 0 15 14:106329236 8 0 0 0 0 0 0 0 0 0 0 0 15 14:106329350 8 0 0 0 0 0 0 0 0 0 0 0 15 14:106327417 8 0 0 0 0 0 0 0 0 0 0 0 15 IQGAP1 15:90931383 0 0 15 0 0 0 0 0 0 0 0 0 15 1:103599442 0 0 0 0 0 1 0 6 0 0 0 0 Mutation types: Gene annotation: Positional annotation: Tumor type specific Mutational processes: C > A Protein-coding IGH locus (all or all but one mutation in tumor type) Age APOBEC T > A Promoter (pc) Center of palindrome AID COSMIC5 C > G Promoter (lncRNA) 0 10 20 30 40 50 60 UV Other Percentage of cohort mutated T > C Intron C > T # Known driver hotspot T > G c s

n Protein-coding (pc) gene with two isoforms lncRNA gene

Input Enhancer miRNA snoRNA annotatio

y Promoter (pc) r 5'UTR nts Protein-coding e Splice sites (pc) 3'UTR Promoter (lncRNA) r discove d elem e Splice sites (lncRNA) e lncRNA Small RNA

Deriv miRNA for driv Enhancer d

2500 Cell type of origin Organ system

2000

1500

1000

Number of patients 500

0 y e ract ract CNS Lungyeloid Glioma ve t ve t Breast Kidn ymphoid M an-cancer Sarcoma L P Carcinoma Squamous Hematopoietic Digesti Adenocarcinoma

emale reproducti F bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 2 a

ActiveDriverWGSCompositeDriverDriverPowerdNdScvExInAtorLARVAPlusMutSigNBR ncdDetectsuitencDriverOncodriveFMLregDriver Mutational burden

Clustering

Functional impact Distal TF binding site Promoter Individual driver discovery methods b

DriverPower MutSig dNdScv oncodriveFML_cadd

oncodriveFML_vest3 ncDriverConservation_snv ncDriverConservation_indel ncDriverCancerType_snv ncDriverCancerType_indel ncdDetect Correlation estimation

ExInAtor Burden NBR Clustering ActiveDriverWGS Functional impact

Correlation -1 0 1 c

TP53 EGFR PTEN IDH1 RB1 PIK3CA ATRX SLC30A3 EEF1A1 ERCC4 CompositeDriverncdDetectoncodriveFML_vest3ExInAtorActiveDriverWGSMutSigLARVADriverPoweroncodriveFML_caddNBRdNdScvncDriverConservation_indelncDriverConservation_snvIntegration Integration of methods

−log10 P 0 5 10 15 d

TP53 EGFR TP53 PTEN EGFR IDH1 PTEN RB1 IDH1 PIK3CA RB1 ATRX # mutations PIK3CA SLC30A3 # patients ATRX EEF1A1 Sequence properties ERCC4 ERCC4 Signatures Post-filtering of candidates e

Cohorts lncRNAs Promoters * TP53 Elements Coding genes * EGFR * PTEN * IDH1 − RB1 PIK3CA ATRX ERCC4 Significant False discovery rate correction * Lists of − Near-significant significant elements bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 3

a RNS7K Alignment artifact SNVs Indels (mappability) high Alignability low Region of low alignability

b Flagged mutation Pass mutation Tumor reads Alternative allele Technical

Patient 1 Site-specific noise Patient 2 ......

Patient n Normal reads from all patients

c AID somatic hypermutation UV + Lack of NER

Mutational process PIM1 RPL13A Mutational signature AID UV Other DNAse signal

d 100 ion s Other f mu tat Biological *

APOBEC C T G T % o % G Sequence 0 T T T A PLEKHS1 promoter GCTTTTTTGCAATTGAACAATTGCAAAATTGG T T T G C A A A A A A C G T T A A A palindrome * C * G T * G T * Bladder-TCC * * Breast-AdenoCa * Breast-LobularCa * * Cervix-SCC * * Lung-AdenoCa * Lung-SCC * Panc-AdenoCa * * Panc-Endocrine * Thy-AdenoCa * * * *APOBEC mutation * *

e Element−cohort hits Unique elements Mappability

Site−specific noise

Protein−coding Mutational process Promoter (protein−coding) 5'UTR 3'UTR Sequence palindrome Enhancer lncRNA lncRNA promoter Number of mutations miRNA and patients Small RNA 0 20 40 60 80 100 0 20 40 60 80 100 % removed by filter % removed by filter bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which Figurewas not4 certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. a Protein-coding Promoter (protein-coding) 5’UTR 3’UTR Enhancer lncRNA lncRNA promoter miRNA Small RNA 0 10 20 30 600 700 0 2 4 6 8 10 150 Number of element-cohort hits Number of unique elements b 80 50 Small RNA miRNA lncRNA promoter 40 lncRNA 60 Enhancer 3’UTR 30 5’UTR Promoter 40 Protein-coding 20

Number of hits 20 10

0 0 o y rine act act my eloid CNS Lung Breast My Glioma ve_tr Kidneve_tr er−HCC ancancer Sarcoma CarcinomaP Squamous y−AdenoCa CNS−GBM Head−SCC Liv Lung−SCCLymph−CLLeloid−MPNy−AdenoCa r Bladder−TCC Kidney−RCC us−AdenoCa Bone−Leio CNS−MedulloCNS−PiloAstroEso−AdenoCa Lymph−BNHLMy anc−AdenoCa Thy−AdenoCa Digesti Kidney−ChRCCLung−AdenoCa P anc−EndocProst−AdenoCaSkin−Melanoma Bilia Bone−OsteosarcBreast−AdenoCa Ovar P Uter Adenocarcinoma Lymphatic_system ColoRect−AdenoCa Stomach−AdenoCa Hematopoietic_system emale_reproducti F

c 350 2000

0 1 2 3 4 5 −log10 q-value # patients SNV Indel 0 0 TERT WDR74 HES1 RFTN1 POLR3E

Promoter PAX5 IFI44L (protein-coding) 0 20 60 100 140 0 20 60 100 140 MTG2 RNF34 PTDSS1 5’UTR DTL 0 5 10 15 0 5 10 15 20 NFKBIZ TOB1 ALB TMEM107 3’UTR SFTPB FOXA1 0 5 10 20 30 0 5 10 20 30

Enhancer TP53TG1 IGH locus 0 10 20 30 40 50 0 20 40 60 80 NEAT1 MALAT1 RPPH1 RMRP RNU12

lncRNA G029190 RP11−92C4.6 G025135 0 50 150 250 350 0 100 300 500 RMRP RNU12 TRAM2−AS1 RP11−440L14.1 MIR663A (lncRNA) Promoter LINC00963 0 20 40 60 80 0 20 40 60 80 miRNA hsa−mir−142 0 5 10 15 0 5 10 15 small RNA RNU6−573P y o act ine 0 1 2 3 4 0 1 2 3 4 r my CNS eloid Lung # patients with # mutations in all patients Breast Kidne e_t r Glioma

M y mutation in element er−HCC Sarcoma ancancer P Squamous Carcinoma Li v eloid−MPN Lung−SCC CNS−GBM Head−SCC y−AdenoCa y−AdenoCa Lymph−CLL r Kidney−RCC us−AdenoCa Bladder−TCC CNS−Medullo Lymph−BNHL M y anc−AdenoCa Thy−AdenoCa Eso−AdenoCa Digesti v Bone−Leio CNS−PiloAstro anc−Endoc Lung−AdenoCa P Kidney−ChRCC Prost−AdenoCa Skin−Melanoma P O var Bone−Osteosarc Adenocarcinoma Bilia Breast−AdenoCa Ute r Lymphatic_system Stomach−AdenoCa ColoRect−AdenoCa Hematopoietic_system Female_reproductibve_tract bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 5 Mutation type P = 0.03 4 SNV a Indel Promoter No event AID mutation RFTN1 2 Conservation (PhyloP) AID Non−AID H3K4me3 (HepG2) CNA

SNVs xpression z−score 0 ymph−BNHL Amplification L Indels Gain Loss Lymph-BNHL events RFTN1 e −2 No event mut. wt b P = 0.034 Intronic regulator 2 RNF34 KDM2B Conservation (PhyloP) er-HCC

xpression z-score 0

DNase Clusters Liv SNVs 2 Indels KDM2B e wtmut. Liver-HCC events RNF34 c P = 0.029 SS18L1 3’UTR Promoter 5 MTG2 Conservation (PhyloP) 2.5 H3K4me3 (HepG2) xpression z-score SNVs Carcinoma 0

Indels MTG2 e wtmut. d

100 P = 0.0002 TP53 Conservation (PhyloP) H3K4me3 (GM12878)

Indels level (FPKM)

SNVs Expression 0 TP53 5’UTR 1 2 3 4 5 6 1 Indels 2 5 SNVs 3 6 4 Lung-SCC

Kidney-ChRCCPanc-AdenoCa Ovary-AdenoCa Breast-AdenoCa Biliary-AdenoCa e

Enhancer regions TP53TG1

H3K4me3 (HepG2) H3K4me1 (HepG2)

Conservation (PhyloP)

H3K4me3 (HepG2) DNAse clusters TEAD4 NR2F2 CTCF IRF1 HNF4A SMC3 RAD21 Eso-AdenoCa events Transcription factor ChIP ZNF143 RCOR1 (ENCODE) EP300 Other digestive tract CEBPB ATF3 cancer events JUND SNVs

Indels Other COSMIC17 0 100 % Eso-AdenoCa mutations bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 6

ALB SFTPB FOXA1 NFKBIZ TOB1 a 1 kb downstream 41 3 9 0 0 3’UTR 12 7 10 8 9

Introns 425 4 4 31 3

Exons 36 0 9 2 2

0 30 0 15 0 8 0 4 0 10 Indel rate (mutations/Mb/patient)

b chr4: 74,250,000 74,300,000 74,350,000 ALB AFP AFM SNVs P = 1.72 x10-14 100

Indels 75

50

25 ALB Gain SNVs of samples Percentage Loss No event 0 Indels Liver-HCC Other (n = 314) (n = 2,200) c P = 0.053 4 Mutation type TOB1 3’UTR SNV Indel No event Conservation (PhyloP) 2 CNA miRNA sites (TargetScan)

Carcinoma Amplification miRNA sites (Ago-Clip) 0 Gain SNVs Loss Indels z-score expression TOB1 2 No event mut. wt

P = 0.035 d NFKBIZ 3’UTR 6

Conservation (PhyloP) 4 miRNA sites (TargetScan) SNVs 2

Indels z-score expression Lymph-BNHL Lymph-BNHL events 0

NFKBIZ wtmut. e

RMRP transcript RMRP promoter 240 C A C Panc-AdenoCA: C>U C A Lung-AdenoCA: C>A A C G 240 230 A

A C A C G U A C A A A C 150 G Panc-AdenoCA: C>U A C G G C 160 A U 230 Lung-AdenoCA: C>A G G C A C C C G C C G C G C C C G C G C Conservation C C A Panc-Endocrine: G>C G A G C U C Panc-Endocrine: G>C U C G G G U G U G G C A G C U G C C 200 U C G G C C U C 220 A U G 170 C U 250 G U A U A G C G 210 C C G U C A G G ins. G>AG U A G C A G U C (PhyloP) A C Liver-HCC A A G 140 U G G C C C G C U U C A U G G C C U A U U C C>U G G>U: Liver-HCC C G 220 A U A U Stomach- G C G 180 C 250 A>G: C C 20 AdenoCA 10 U A Panc-AdenoCA G U C C G 190 C G G A G C C G C C G G C U C A U G G C 260 G U C C C C U G A 130 C A G G U U A G C 120 A U A G SNVs C G U U G C G G U U G C U G U C C G A U 110 C A C 1 ins. G>ACGTGG: Liver-HCC 268 C G C C G A C C G 100 C U G U A U A G C U G G A C G 30 U G G G A G A A C>U G A>U: Liver-HCC U C A 70 G A C U A U U A G G C C G C C U C C del. G: A G C C A A G C C G C 90 Breast-AdenoCa C A U Stomach- G G G 80 A C C 20 U AdenoCA C 10 U A 60 G 40 U A U C G U C G C C U Indels A C C C G C U U 50 C del. UCCUC U U Thy-AdenoCA G G C 260

U G

G C

C G

G G U U G C U G U

1 ins. G>ACGTGG: Liver-HCC 268 P4 tertiary interaction sites Low High Mutation RNA-protein interaction sites Mutation with predicted RNA-substrate interaction sites structural impactP ( < 0.1) Predicted structural impact of mutations e b a Figure 7

LOH association fold difference (log2) Indels − SNVs NEAT1 2 0 2 4 6 bioRxiv preprint was notcertifiedbypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmade Indels SNVs

**

Percentage of indels 100 chr11 (23,509 genes) 50 TP53

Genome 0

**

n=1027341 VHL NEAT1 n.s. KRAS Oncogene Tumor suppressor doi: ALB n=485 BRAF s n.. .s.n .s.n .s.n https://doi.org/10.1101/237313 NEAT1 NEAT1 n=139

MALAT1 MALAT1

2-5 bp enriched n=31

(18 genes) c 0.0 Cancer allele fractions 1.0 n=563 available undera P3VHL TP53 Indel length in * ** * 1 bp 2− 6−9 bp 10+ bp out 5 bp

in ; this versionpostedDecember23,2017. out CC-BY-NC-ND 4.0Internationallicense f KRAS in out BRAF

1kb upstream in **** Mutation rate 10 0 2 4 6 8 out MALAT1

5’UTR MALAT1 NEAT1 Indels

in n.s. n.s. Exons out 1 kb downstreamIntrons MALAT1 3’UTR in out d The copyrightholderforthispreprint(which

1kb upstream . 10 12 0 2 4 6 8 Expression fold difference (log2) − − − − 4 3 2 1 0 1 2 3 5’UTR SNVs TP53

Exons ** ** ** **

VHL 1 kb downstreamIntrons 10 kb

KRAS 3’UTR

BRAF

NEAT1 ..n.s. n.s.

MALAT1 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Figure 8 a b % of patients in cohort powered (≥90%) 32 70 94 83 52 77 34 4 21 60 59 16 20 7 78 74 27 68 65 91 66 93 60 61 49 100 48 1.0 1.0 CDS y t

Promoters i v i

5'UTRs t i e s v 3'UTRs n

Enhancers e s ulati

lncRNAs n m o i t probability Cu c e t e at chr5:1,295,228 D 0.0 0.0 c a o a o o a a a L L a a e a a a a a C C C C C C N M r ll y L

0.5 1.0 n tr C C C C C H C C C C C C i a P m C C C C Average detection sensitivity B s r o o o o o C o o o o o o s m o N T S S du RC RC H - c M G A - n n n n n n n n n n n - - o - - n - - i e B r h o r o e e e e e e e e e e e d eo - a d e e S t ph l i il C d d M d d d d d d d d d ey - L s - e ve o N P ea i - ung l ph m A A A A A A A A A A A - dd - - S ------End - - O y L dn e L M C t t t ey H - i - a m y S y c - s y o h l ye L N n s c s r r c e y K s n c n u h N o i B e o a n r C dn n M L a i ung r T E k ea va C B e ili o R r L t Pa K P S m o Pa O B B l B U o t o S C

c chr5:1,295,228 chr5:1,295,250 Inferred mutations e 2 4 5 9 3 2 7 159 0 0 0 10 0 0 1 1 13 Missed 100 3 15 25 14 4 5 9 171 1 1 1 26 11 3 7 3 36 Total 3.0 3,133 [2,987-3,273] 80 TERT TERT 2.5 60

40 d ratio 2.0

% of patients e 20 with mutation Missed

promoter mutation 1.5 0 Called % patients with a a o a a L a a a C C M 71 [22-133] 32 [0-79] C C C C M ll 37 [0-88] L 103 [25-184] m C C C C C C C B m C C C B o o o o C o o T o S T S du G RC H - - G - n - n n n n n - - - - n r - e r r

d 1.0 a e e e e e d a S e l S e ph l d M d d d d ey e Point mutations - e N ve ea N ea i m A A A A A dd dd - S - - - - M C y L dn M H C t a H - i a y y - y l l L N c r r n K n h i B i B e a C observed/expect k ung T 0.5 k va ili R L S S o O B l o C d 0.0 5’UTRCoding 3’UTR3’UTR 10 Promoter ns/Mb 8 downstream o i

t 6 a

t 4

u 2 Median

M 0 element length CDS 18 17 16 14 10 9 8 8 8 8 7 8 8 8 7 8 6 9 8 12 8 8 6 5 4 9 6 8 7 11 5 8 7 5 5 6 5 7 6 5 4 6 Enhancer 12 11 11 10 8 8 6 6 6 7 5 6 7 6 6 6 5 7 6 9 6 7 5 5 4 7 5 6 5 8 4 6 6 4 5 5 4 6 5 5 4 6 lncRNA 16 16 15 14 9 9 7 7 7 8 6 7 7 7 7 7 5 8 7 11 7 8 5 5 4 8 5 7 6 11 5 7 6 5 5 6 4 6 6 5 4 6 3’UTR 16 16 15 14 9 9 7 7 8 8 7 7 8 7 7 7 5 8 7 11 7 8 6 5 4 8 5 8 7 11 5 7 6 5 5 6 4 6 6 5 4 6 5’UTR 12 12 11 10 8 8 6 7 7 7 6 7 7 7 6 6 5 7 6 9 7 7 6 5 4 7 5 7 6 8 4 7 6 4 5 5 4 6 5 5 4 5 Promoter 15 14 14 12 9 9 7 7 7 8 7 7 8 7 7 7 6 8 7 10 7 7 6 6 4 8 6 7 7 9 5 7 7 5 5 6 4 7 6 5 4 6 0 0 0 ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 60 120 90 89 84 81 68 56 52 48 47 44 43 41 39 38 37 34 34 23 23 97 95 797 382 314 287 235 232 208 199 197 195 186 146 143 141 121 110 107 105 ======2278 1856 1631 n n n n n n n = n n = n n n n = = n = n = = n = = n = = = = = n n = = = = n n ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( = = = n n n n n n n n n n n n n n n n n n

( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( n n n c o e a a a

a

d a

a

a a N i ( M ( ( r t t t yo s n

a a a a o a a L LL tr S C C C C C C C i c o c s a P m m m CC CC CC B l ll ey u s r ung a a o o o o o o C C C C C o H s a a m N e e m m o CC T S S RCC c - o L M ph t t G A n n - n n n o n o o o n - - ea o c m m tr tr o o - N - ye C i du h H o r RCC s s dn r r o e e e e e e n n n n li e i d eo m - - o m o _ _ n d B y e y e S e t ph il i a r C M B d d d d d e d e e e d y - a n n K a s G s - s l L i i l o N P S d d d M d ea ung ey l - ve ve - m A A A A A A A _ _ - c c i i dd - e ve - - End - - - - - O t t y L i a r r e ph C A A A A t c c H ey - - a S s y y i i h - - - S - o s c ye l L L dn a a n M c t t c r t t Squ e i m m c u h c y s - e N c N o e B s e s a n r a r dn C M n y o i a K n n i T ung g E du o i o C e i B C R o ili L n o t L r o K n k m ea va ph Pa o r a D p B B r l Pa P l e U o S p o O t m o e B d t e y S a r C A m L - _ m n e i l e k a H s - m 0 5 10 15 20 25 30 o e n F

- Minimum % powered (≥90%) n a c n Pa bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Extended Data Figure 1 10 a 10

9 10

8 10

7 10

6 10

5 10

4 10 Number of positions

3 10

100

10

RPL13A, TP53 RPL13A, IGH locus, TP53 locus, IGH

TP53

KRAS

KRAS

TERT

BRAF

TP53 PIK3CA

PIK3CA TERT

TP53 TP53 IGH locus IGH IGH locus IGH 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 160 165 100 % positions Fraction of 0 % Protein-coding Non-coding Number of patients b y−RCC y−ChRCC y−AdenoCa y−AdenoCa us−AdenoCa eloid eloid−MPN er−HCC y−AdenoCa anc−AdenoCa anc−Endoc rine ve system emale reproducti ve ymph−BNHL ymph−CLL Protein change − 200 (7.7 %) − 150 (5.8 %) − 100 (3.9 %) − 50 (1.9 %) − 0 (0.0 %) Gene CNS−Medullo Adenocarcinoma tract Digestive Breast Glioma Kidney CNS−GBM Kidne Hematopoietic system Lung Genomic position L L Breast−AdenoCa Skin−Melanoma ColoRect−AdenoCa Eso−AdenoCa P Liv Stomach−AdenoCa Bladder−TCC Prost−AdenoCa Uter Head−SCC Lung−SCC Ovar Bilia r Lung−AdenoCa P CNS−PiloAstro Kidne system Lymphatic Squamous CNS M y T h M y Carcinoma F 166 KRAS # G12D/A/V 12:25398284 0 0 0 1 20 3 60 0 0 0 1 5 2 3 1 6 14 0 0 0 0 0 0 0 0 9 0 10 0 20 2 3 0 8 0 0 0 0 97 TERT # 5:1295228 0 2 0 15 2 0 0 4 0 48 0 0 6 0 1 3 3 23 52 2 4 0 0 0 0 3 1 2 1 2 1 4 0 2 13 21 2 0 56 KRAS # G12S/R/C 12:25398285 0 0 0 0 8 3 21 0 2 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 4 0 4 0 7 0 0 0 3 0 0 0 0 46 BRAF # V600E 7:140453136 1 0 0 29 2 0 0 0 0 0 0 0 0 0 0 0 0 21 0 0 0 0 4 0 0 1 1 1 1 1 0 0 0 0 2 3 0 0 41 PIK3CA # H1047L/R 3:178952085 0 0 13 0 2 0 1 1 3 0 1 3 2 3 0 0 0 0 0 0 2 0 0 0 0 2 0 3 0 1 8 2 14 2 1 0 0 0 39 TP53 # R273L/H 17:7577120 3 0 1 0 4 6 4 1 2 0 1 7 2 0 9 6 0 0 3 0 0 0 0 0 0 2 2 2 2 3 4 1 1 0 1 1 0 0 38 PIK3CA # E545Q/K 3:178936091 0 0 5 1 6 5 0 1 2 5 0 5 15 7 0 0 3 0 0 1 0 0 0 0 0 2 0 2 0 2 4 11 5 5 0 0 1 0 35 TERT # 5:1295250 0 0 0 22 0 0 0 0 0 14 0 0 4 0 0 0 0 0 16 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 3 5 0 0 33 14:106326944 29 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 17 0 0 0 0 0 0 0 0 0 28 TP53 # R175H 17:7578406 0 0 2 1 6 5 4 0 3 0 0 3 6 3 2 3 0 0 0 0 0 0 0 0 0 2 0 2 0 3 2 4 2 2 0 0 0 0 28 RPL13A 19:49990694 0 0 0 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 TP53 # R248P/Q 17:7577538 1 0 3 0 0 5 2 0 3 5 0 5 0 5 3 0 3 0 6 0 0 2 0 0 0 2 1 2 1 2 3 2 2 4 1 2 0 0 24 14:106326972 19 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 0 13 0 0 0 0 0 0 0 0 0 24 TP53 # R282W 17:7577094 0 0 2 0 0 4 5 1 2 0 1 3 2 0 2 0 0 0 0 0 0 0 0 0 0 2 0 2 0 2 2 1 2 0 1 1 0 0 22 14:106326713 17 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 12 0 0 0 0 0 0 0 0 0 22 TP53 # R213* 17:7578212 0 0 2 1 8 4 2 1 3 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 3 1 1 2 2 1 1 0 0 21 TP53 # Y220C 17:7578190 0 3 1 0 0 2 3 1 0 5 0 5 0 3 4 0 3 0 0 0 0 0 0 0 0 2 1 2 2 2 2 1 1 3 0 0 0 0 20 14:106328942 17 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 11 0 0 0 0 0 0 0 0 0 20 TP53 # R248G/W 17:7577539 0 0 1 0 2 3 2 1 0 5 2 5 4 0 2 0 0 0 0 0 0 0 0 0 0 2 0 1 0 2 2 2 1 0 0 0 0 0 20 IDH1 # R132H 2:209113112 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 8 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 7 13 0 0 19 GPR126 6:142706206 0 0 2 0 0 4 0 0 0 40 0 0 0 3 0 0 0 5 0 0 0 0 0 0 0 2 0 1 0 1 2 1 2 2 0 0 0 0 19 RPS20 8:56987141 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 X:116579329 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 2 0 2 0 3 0 0 0 0 0 0 0 0 18 14:106329192 12 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 10 0 0 0 0 0 0 0 0 0 18 X:7791111 1 0 1 0 2 7 2 0 6 5 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 2 1 0 1 0 0 0 1 0 17 14:106326887 12 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 9 0 0 0 0 0 0 0 0 0 17 14:106326619 13 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 9 0 0 0 0 0 0 0 0 0 17 14:106326618 11 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 9 0 0 0 0 0 0 0 0 0 17 14:106327115 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 9 0 0 0 0 0 0 0 0 0 17 14:106327559 14 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 9 0 0 0 0 0 0 0 0 0 17 TP53 # R273S/C 17:7577121 1 0 1 0 0 3 3 1 2 0 0 5 0 0 0 3 0 0 0 0 0 0 0 0 0 1 1 1 1 2 1 0 1 0 1 1 0 0 17 14:106240243 15 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 9 0 0 0 0 0 0 0 0 0 17 3:164903710 0 0 0 0 2 5 0 2 8 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 3 0 0 0 0 0 0 0 0 17 X:83966025 0 0 2 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 2 1 2 0 0 0 0 0 17 KDM5A 12:498776 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 14:106329196 12 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 9 0 0 0 0 0 0 0 0 0 16 14:106329852 12 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 8 0 0 0 0 0 0 0 0 0 16 14:106329550 11 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 9 0 0 0 0 0 0 0 0 0 16 PLEKHS1 10:115511590 0 0 2 0 0 0 0 0 0 22 0 0 0 3 0 0 9 5 0 0 0 2 0 0 0 1 0 1 0 0 2 1 2 5 0 0 0 0 16 PLEKHS1 10:115511593 0 0 3 0 0 0 1 0 0 27 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 1 0 1 0 1 2 1 3 0 0 0 0 0 16 19:2151793 0 0 0 15 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 16 NRAS # Q61K 1:115256530 0 0 0 11 2 0 0 1 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 3 0 1 1 1 0 1 0 0 0 0 0 0 1 3 16 RP5-1125A11.1 20:32580927 0 0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 X:83967552 0 0 0 0 0 0 1 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 15 14:106326877 12 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 8 0 0 0 0 0 0 0 0 0 15 14:106329236 11 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 8 0 0 0 0 0 0 0 0 0 15 14:106329350 13 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 8 0 0 0 0 0 0 0 0 0 15 14:106327417 14 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 8 0 0 0 0 0 0 0 0 0 15 IQGAP1 15:90931383 0 0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 1:103599442 0 0 0 0 0 0 0 1 2 0 6 0 0 0 0 3 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 Mutation types: Gene annotation: Positional annotation: C > A T > A Protein-coding Promoter (lncRNA) Center of palindrome 0 5 10 15 20 25 30 35 40 45 50 55 60 65 C > G T > C Promoter (pc) Intron IGH locus Percentage of cohort mutated C > T T > G # Known driver hotspot bioRxiv preprint was notcertifiedbypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmade b a Extended DataFigure2

element length (bases)

10 10 10 10 10 genomic coverage (%) 0.8 0.0 0.4 1.2 doi: 1 1 2 3 4 5 https://doi.org/10.1101/237313 number Length distribution of tested genomic elements oftestedgenomic Length distribution

enhancers Coverage of tested genomic elements oftestedgenomic Coverage -

3'utr - protein-coding 5'utr -

genes available undera protein-coding

promoters - - ;

lncRNAs this versionpostedDecember23,2017. - CC-BY-NC-ND 4.0Internationallicense promoterslncRNA - 3735 0855 46102 58102 88191 96391 61803 miRNAs 488872 6448 - short RNAs - The copyrightholderforthispreprint(which . bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Extended Data Figure 3 a Broad simulation Sanger simulation DKFZ simulation

ncdDetect ncdDetect

4 ncdDetect oncodriveFML_vest3 oncodriveFML_vest3 oncodriveFML_vest3 ExInAtor ExInAtor ActiveDriverWGS ExInAtor ActiveDriverWGS DriverPower ActiveDriverWGS DriverPower oncodriveFML_cadd DriverPower dNdScv oncodriveFML_cadd 3 oncodriveFML_cadd dNdScv NBR MutSig NBR NBR ncDriverConservation_indel dNdScv MutSig ncDriverConservation_snv ncDriverConservation_indel ncDriverConservation_snv ncDriverCancerType_snv

alue (−log10) ncDriverConservation_snv

2 ncDriverCancerType_indel ncDriverCancerType_snv MutSig ed p− v Observ 1 0 1 2 3 4 5 0 0 1 2 3 4 5

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

Expected p−value (−log10) Expected p−value (−log10) Expected p−value (−log10)

Color Key b c Observed Broad simulation DKFZ simulation Sanger simulation Observed Broad simulation Difference alue (−log10) ed p− v ActiveDriverWGSDriverPower ActiveDriverWGSDriverPower ActiveDriverWGSDriverPowerExInAtor ExInAtor ExInAtor MutSig 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 MutSig MutSig NBR Observ NBR NBR 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 Colorectal Adenocarcinoma Colorectal dNdScv dNdScv - dNdScv =

ncdDetect ncdDetect ncdDetect alue (−log10)

ncDriverCancerType_snv ncDriverCancerType_snv

ed p− v ncDriverCancerType_snv ncDriverConservation_snv 0 5 10 15 20 25 0 5 10 15 20 0 5 10 15 20 25 0 5 10 15 20 25 ncDriverConservation_snv ncDriverConservation_snv oncodriveFML_cadd oncodriveFML_cadd oncodriveFML_cadd Observ

Uterus Adenocarcinoma Uterus 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 oncodriveFML_vest3 oncodriveFML_vest3 oncodriveFML_vest3

alue (−log10) Observed Sanger simulation Difference ed p− v 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Observ Lung Adenocarcinoma Lung 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

Expected p−value (−log10) Expected p−value (−log10) Expected p−value (−log10) Expected p−value (−log10) DriverPowerExInAtor DriverPowerExInAtor DriverPowerExInAtor ActiveDriverWGS MutSig MutSig ActiveDriverWGS MutSig NBR ActiveDriverWGS NBR NBR Fisher combined p-values Brown combined p-values dNdScv - dNdScv = dNdScv d

ncdDetect ncdDetect TP53 ncdDetect ************ * ncDriverCancerType_indel ncDriverCancerType_indel ncDriverCancerType_indel ************ * EGFR ncDriverCancerType_snv ncDriverCancerType_snv ncDriverCancerType_snv ********* * PTEN * **** * IDH1 ncDriverConservation_indel ncDriverConservation_indel ncDriverConservation_indel ncDriverConservation_snv ncDriverConservation_snv * − − RB1 oncodriveFML_cadd ncDriverConservation_snvoncodriveFML_cadd oncodriveFML_cadd PIK3CA oncodriveFML_vest3 oncodriveFML_vest3 oncodriveFML_vest3 ** ATRX SLC30A3 MYSM1 EEF1A1

CNS-GBM BAG1 ERCC4 Observed DKFZ simulation Difference

NBR MutSigLARVA ExInAtor dNdScv ncdDetect DriverPower compositeDriver ActiveDriverWGS oncodriveFML_vest3 oncodriveFML_cadd Integrated (observed) Integrated (simulations)

DriverPower DriverPower ActiveDriverWGSDriverPowerExInAtor ActiveDriverWGSExInAtor ActiveDriverWGSExInAtor MutSig MutSig MutSig NBR NBR NBR ********** TP53 * TP53 KRAS KRAS ********** * dNdScv dNdScv dNdScv ******* KEAP1 * KEAP1 = − − RBM10 RBM10 - *** * * − ** STK11 * STK11 * * CTNNB1 CTNNB1 * TBL1XR1 TBL1XR1 ncdDetect ncdDetect ncdDetect * DDRGK1 DDRGK1 EGFR EGFR OR4E2 OR4E2 ncDriverConservation_indel ncDriverConservation_indel − − ATP5B ATP5B ncDriverConservation_indel ncDriverConservation_snv ncDriverConservation_snv ncDriverConservation_snv Lung-AdenoCa * SPARC SPARC oncodriveFML_cadd oncodriveFML_cadd oncodriveFML_cadd − TTC24 TTC24 oncodriveFML_vest3 oncodriveFML_vest3 oncodriveFML_vest3

NBR MutSigLARVA ExInAtor dNdScv ncdDetect Integrated Integrated DriverPower compositeDriver ActiveDriverWGS oncodriveFML_vest3 oncodriveFML_cadd after method removal -1-0.5 0 0.5 1 ncDriverConservation_snv ncDriverConservation_indel Spearman Significant gene correlation * −log10 P − Near-significant gene 0 5 10 15 filtered method d a Extended DataFigure4 Individual tumortypes 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 1 2 2 3 1 5 4 4 Percentage of well-known cancer genes

Biliary−AdenoCa 100 20 40 60 80 4 0 0 1 3 1 2 2 2 3 6 3 5 Bladder−TCC 0 2 2 0 0 0 0 0 1 1 2 2 2 2 12 2 2 2 bioRxiv preprint 0 Bone−Leiomyo was notcertifiedbypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmade 4 1 4 1 1 1 1 Bone−Osteosarc 1 NBR (n=31/47) oncodriv Driv MutSig (n=42/90) Activ ExInAtor (n=23/38) oncodriv ncdDetect (n=9/11) 14 11 10 4 5 8 5 8 8 Breast−AdenoCa 9 erPo eDriv 2 3 3 2 2 3 3 4 3 4 4 4 9 6 eFML_cadd (n=35/54) eFML_v w CNS−GBM 20 er2 (n=32/48) er (n=36/62) 10 Genes r 3 3 0 6 2 7 5 9 6 CNS−Medullo doi: est3 (n=34/53) 5 4 4 1 0 0 0 3 4 4 3 4 2 CNS−PiloAstro https://doi.org/10.1101/237313 13 11 11 2 2 5 5 3 4 3 3 3 8 7 Car ColoRect−AdenoCa anked by p−v 06 80 60 40 8 5 5 5 5 5 5 9 0 6 8 4

Eso−AdenoCa cinoma 5 3 5 2 2 2 5 Head−SCC 5 1 0 1 0 0 0 1 10 6 0 1 1 1 Kidney−ChRCC Brown_observ ncDriv ncDriv ncDriv ncDriv dNdScv (n=43/79) available undera 5 3 2 3 3 4 2 5 6 5 6 alue Kidney−RCC erCancerType_snv (n=11/16) erCancerType_indel (n=7/9) erConserv erConserv 13 10 11 14 11 13 11 3 7 6 6 Liver−HCC 3 3 ed (n=47/74) 5 2 2 2 1 4 1 2 2 4 4 3 5 5 5 5 ation_indel (n=0/0) ation_snv (n=2/5) Lung−AdenoCa ; this versionpostedDecember23,2017. 7 1 3 2 4 Lung−SCC 5 CC-BY-NC-ND 4.0Internationallicense 33 000 0 10 26 38 28 83 28 33 23 38 26 6 5 Lymph−BNHL 7 6 2 1 1 4 1 1 6 0 6 7 Lymph−CLL 4 1 4 2 2 0 0 0 2 1 1 Myeloid−MPN 0 0 100 b 1 1 1 1 1 11 1 1 12 2 20 40 60 80

Ovary−AdenoCa 0 14 10 12 12 3 3 6 5 5 Panc−AdenoCa 9 0 5 2 2 2 4 3 3 3 2 5 5 5 4 4 3 Panc−Endocrine 5 5 oncodriv Driv LAR MutSig (n=7/9) Activ ExInAtor (n=6/10) oncodriv ncdDetect (n=4/5) compositeDriv 4 3 0 1 4 4 4 Prost−AdenoCa 5 erPo V 2 eDriv A (n=3/9) 7 1 4 6 5 4 6 Skin−Melanoma 6 eFML_cadd (n=3/4) eFML_v w er2 (n=6/8) er (n=6/8) 5 4 3 2 2 2 3 3 5 5 5 1 4 The copyrightholderforthispreprint(which Stomach−AdenoCa Genes r er (n=4/4) 10 8 6 4 . est3 (n=4/6) 2 6 0 0 0 4 1 4 1 2 1 1 Thy−AdenoCa 1 2 Breast−AdenoCa 11 01 58 16 50 9 5 2 4 5 6 8 Uterus−AdenoCa 46 6 anked by p−v Meta-cohorts 00 10 34 37 22 33 413 34 09 50 39 7 Adenocarcinoma 9 14 14 1 6 45 6 9 5 8

Breast Brown_observ ncDriv ncDriv NBR (n=8/11) dNdScv (n=9/12) 11 43 28 38 01 19 10 40 43 55 54 8 2 Carcinoma 10 9 alue erConserv erConserv 21 11 14 15 17 22 2 0 0 0 0 0 0 0 0 0 0 0 6 7 CNS 9 30 21 20 12 20 26 30 26 5 7 7 ed (n=8/11)

Digestive_tract ation_snv (n=0/0) ation_indel (n=0/0) 19 69 16 13 16 18 18 0 0 0 0 0 0 0 0 0 0 8 9 9 5 Female_reproductive_tract 12 339 13 11 41 11 335 13 5 3 3 7 40 8 Glioma 8 10 32 35 31 27 0 0 0 0 0 9 Hematopoietic_system 14 7 3 3 5 83 38 3 3 7 6 6 6 8 5 Kidney 9 6 100 9 0 9 6 7 4 6 8 c

Lung 20 40 60 80 0 37 29 533 3 35 03 30 633 3 26 42 03 30 Lymphatic_system 0 9 0 2 0 Myeloid 3 1 11 4 2 1 2 4 NBR (n=4/6) oncodriv Sarcoma Driv LAR MutSig (n=7/14) Activ ExInAtor (n=4/4) oncodriv ncdDetect (n=3/3) compositeDriv 12 11 0 50 6 8 6 5 858 58 8 7 7 erPo V eDriv

Squamous A (n=2/2) 20 05 79 58 80 53 42 16 954 49 75 eFML_cadd (n=5/8) eFML_v w 4 2 2 2 2 er2 (n=7/10) Pancancer er (n=9/15) Genes r 14 44 47 29 11 er (n=4/7) 3 5

Pancancer (no lymphoma) est3 (n=3/6) ColoRect−AdenoCa 20 52 53 47 810 18 252 52 76 Pancancer (no melanoma) 4 14 62 42 51 32 43 anked by p−v Pancancer (no melanoma/lymphoma) 2 MutSig dNdScv Brown_obse ncDriv ncDriv ncDriv ncDriv compositeD LAR oncod oncod ExInAtor ncdDetect Activ NBR Driv 10 erPo V eDriv Brown_observ ncDriv ncDriv dNdScv (n=6/9) A riv riv erConse erConse erCancerType_indel erCancerType_sn eFML_cadd eFML_v w alue erConserv erConserv er2 er riv rv ed (n=9/16) er ed rv rv ation_snv (n=0/0) ation_indel (n=0/0) 15 est3 ation_indel ation_sn v v bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Extended Data Figure 5

PTEN VHL PIK3CA CTNNB1 TP53 KRAS CDKN2A SMAD4

TERT PBRM1 APC ARID1A RMRP

q-value) NFE2L2 10 IDH1 CREBBP MEN1 − log B2M RB1 SETD2 NOTCH1 TNFRSF14 EGFR FBXW7 MAP2K4 MAP3K1 DDX3X CDH1 WDR74EZH2 CBFB AXIN1 NEAT1 BAP1 PIK3R1 PTCH1 CCND3 KMT2D GATA3 HIST1H1C ARID2 RNF43 ALB BRAF ACVR2A KDM6A MYD88 MALAT1 TMSB4X CASP8GNAI2 ARHGAP35 STAT6 SPOP CIC RPL22 CDKN1AKLHL6BTG1 TP53TG1 HIST1H2BCAKT1CARD11ZFP36L2FBXO11ATRX GNA13TGFBR2 FOXA1 NRAS RMRP ATM hsa−mir−142 TBL1XR1KEAP1FGFR1 H3F3B RHOARRAGC RPPH1 NF1 SOX9 PTPN11MAP2K7NFKBIE HLARBM10−B SMO HRASKMT2CSTAT3TMEM30AGRB2MEF2BCTDNEP1 CDKN1B RPS6KA3PPP2R1ASF3B1 MIR663AH3F3A SMARCA4PRKCDBRD7PRKAR1A HES1PAX5 AJUBA Most significant meta cohort ( MTG2SRSF7 MTORCTCFCDK12ACVR1BTET2 RP11EPHA2ELF3−FAT1440L14.1HIST1H1BKBTBD4 DAXX NOTCH2 NFKBIZALBEBF1 LDB1 HIST1H2AGFUBP1MCL1TOB1BAX RPL5RFTN1 NXF1 LINC00963SMARCB1NCOR1SFTPBU2AF1ITPKBBCORPHF6TSC1TBX3PAX5G029190G085970STK11DNMT3A CTCRP11TRAM2HIST1H4DPTDSS1−IDH2512J12.6−92C4.6−AS1KDM5CRNU12SETDB1

TMEM107HNF4ARNF34BRPF1FOXA1RIPK4POLR3ETBR1DTLSMAD2PA2G4CAMK1G025135ZNF292IRF4GNASPLK1EEF1A1IFI44LRELAFGFR3PCBP1RAC1LATS1APOB 4FCT 2L7FCT A1KRYD 1I1CNYD RFTN1

0 5 10IGH 15 locus 20 25

0 5 10 15 20 25

Most significant individual cohort (−log10 q-value) bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Extended Data Figure 6 300 2000

0 1 2 3 4 5 −log10 q-value # patients SNV Indel 0 0

TP53 PTEN PIK3CA ARID1A KRAS SMAD4 RB1 CDKN2A BRAF KMT2D NRAS NFE2L2 MAP2K4 KDM6A FBXW7 CTNNB1 B2M SETD2 RBM10 NF1 KMT2C BAP1 ATM ACVR2A ZFP36L2 SF3B1 PIK3R1 PBRM1 NOTCH1 MAP3K1 KEAP1 GATA3 DDX3X CDKN1A CDH1 CBFB APC VHL TGFBR2 SMARCA4 RPL22 RNF43 MAP2K7 FOXA1 CASP8 AXIN1 ARID2 ARHGAP35 ALB AKT1 ACVR1B TBL1XR1 SPOP SOX9 RPS6KA3 NFKBIE MYD88 IDH1 HRAS HLA−A EZH2 EGFR CTCF CREBBP BRD7 TNFRSF14 TMSB4X TMEM30A STK11 STAT6 STAT3 SRSF7 RRAGC RHOA RFTN1 PTCH1 PRKCD NXF1 MEN1 MEF2B KLHL6 HLA−B HIST1H2BC HIST1H1C HIST1H1B GRB2 GNAI2 GNA13 FGFR1 FBXO11 EPHA2 EBF1 CDKN1B CCND3 CARD11 BTG1 ATRX TET2 SMO SETDB1 RPL5 PTPN11 PRKAR1A PPP2R1A PAX5 MTOR MCL1 LDB1 KDM5C KBTBD4 ITPKB HIST1H2AG FUBP1 ELF3 DNMT3A DAXX CTDNEP1 CTC−512J12.6 CIC CDK12 AJUBA ZNF292 U2AF1 TSC1 TCF7L2 TCF4 TBX3 TBR1 SMARCB1 SMAD2 RIPK4 RELA RAC1 PLK1 PHF6 PCBP1 PA2G4 NOTCH2 NCOR1 LATS1 IRF4 IDH2 HNF4A HIST1H4D H3F3B H3F3A GNAS FGFR3 FAT1 EEF1A1 DYRK1A DYNC1I1 CAMK1 BRPF1 BCOR BAX APOB

m 0 200 400 600 800 0 200 400 600 800 CNS Lung Breast Kidney Glioma

y−RCC # patients with # mutations in all patients Myeloid er−HCC

Sarcoma mutation in element y−ChRCC Squamous Pancancer Carcinoma Li v eloid−MPN Lung−SCC CNS−GBM Head−SCC y−AdenoCa y−AdenoCa y−AdenoCa Lymph−CLL Kidne us−AdenoCa Bladder−TCC CNS−Medullo Lymph−BNHL M y T h Eso−AdenoCa Digestive_tract Bone−Leio CNS−PiloAstro Panc−Enocrine Lung−AdenoCa Panc−AdenoCa Kidne Prost−AdenoCa Skin−Melanoma O var Bone−Osteosarc Adenocarcinoma Biliar Breast−AdenoCa Uter ymphatic_system L Stomach−AdenoCa ColoRect−AdenoCa Hematopoietic_system Female_reproductive_tract bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Extended Data Figure 7 a b c ncdDetect ActiveDriverWGS MutSig LARVA DriverPower oncodriveFML_cadd ncDriverConservation_snv ncDriverConservation_indel Integration compositeDriver ncdDetect oncodriveFML_vest3 ExInAtor ActiveDriverWGS Integration compositeDriver ncdDetect ActiveDriverWGS MutSig LARVA DriverPower MutSig LARVA DriverPower oncodriveFML_cadd NBR dNdScv ncDriverConservation_snv ncDriverConservation_inde oncodriveFML_cadd NBR ncDriverConservation_snv ncDriverConservation_indel Integration *********** * CTNNB1 ****** * TERT ** * ALB *********** * TP53 ***** * WDR74 − SPHK2 * − ******** * AXIN1 *** * * IFI44L − − USP34 ********* * ARID2 * − − IER3 − − RFC4 ******** * ARID1A HES1 − DBP * *** − * * ALB TTR TMEM107 ** − ****** * NFE2L2 HIST1H4A − AFF4 − ****** * RPS6KA3 AC008914.1 − AHSA2 ****** * BAP1 HIST1H2BK HSP90AB1 − − RB1 ****** * KRT222 * FGA ****** − * * BRD7 PCF11 CALM2 * ****** * KEAP1 AL590714.1 − SPAG5 − − CDKN1A **** * TMEM183A ARPC2 − PTEN ***** * CASC5 − GLYR1 − − KRTAP5−11 * ** ** * XRCC2 − CTD−2021H9.3 − APOB * **** * GLE1 MEA1 RPL22 Liver−HCC

***** * Liver−HCC CD14 − GPRC5A − * − * * EEF1A1 ERP27 VAV3 * − − * * * SETDB1 − ZSCAN5B TRIML2 − − − * CAMK1 OGDH GLI2 ** ** * DYRK1A DTWD2 PRR19 * * ACVR2A PSMB7 − * RPL5 BPTF ANKRD39 − ** − * HNF4A NR4A2 KRTAP4−3 * − * − * C11orf57 HIST1H2BG ** − CRIP3 * ALB LEMD3

Liver−HCC * − G6PC * * − IDH1 Included methods: 10 Included methods: 8 * − HIST1H2BD * − TSC1 * − − VAV3 * CHPT1 * C13orf45 PRKACA * − VIM Protein-coding TSC2 0 5 10 15 * USP15 − TYW5 Promoter FDR < 0.1 CDKN2A * SDHB 0 5 10 15 - FDR < 0.25 − − SEC23A − TNRC6B 3’UTR No p-value calculated TRAF6 0 5 10 * − ALDOB UBBP4 −log10 p-value Included methods: 13 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which Extendedwas notData certified Figure by peer review) 8 is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

P=5.3e−04 P=7.3e−08 P=0.025

100 100 15

80 80 score score Mutation type Mutation type score Mutation type − − SNV SNV − SNV 10 no event no event no event 60 60 Copy number event Copy number event Copy number event AdenoCA

CNA gain − CNA gain CNA amplification no event no event CNA gain CNS_tumors 40 40 expression z Thy T T expression z 5 T expression z CNA loss Carcinoma_tumors R R R no event TE TE 20 TE 20

0 0 0 promoter mut. wt promoter mut. wt promoter mut. wt

8 10 4 P=0.27 P=0.45 P=0.42

8 3 6 Mutation type score score score indel Mutation type 6 Mutation type − − − 2 SNV SNV SNV 4 no event no event no event 4 1 Copy number event Copy number event Copy number event CNA gain CNA amplification CNA amplification 2 pancan

ymph_tumors CNA loss CNA gain CNA gain expression z expression z expression z

L 2 0 no event CNA loss CNA loss Carcinoma_tumors no event no event PAX5 HES1 0 HES1 0 −1

−2 −2 −2 promoter mut. wt promoter mut. wt promoter mut. wt

5 P=0.15 P=0.81 P=0.16 4 3 4 3 score score − 3 Mutation type Mutation type Mutation type

score 2 −

SNV − SNV SNV no event 2 no event no event 2 Copy number event Copy number event 1 Copy number event AdenoCA AdenoCA − CNA gain − CNA gain CNA gain 1 1 expression z CNA loss CNA loss CNA loss expression z expression z Lung no event Lung no event 0 no event 0 Digestive_tract_tumors DTL

0 INTS7 PTDSS1 −1 −1

−1 5'UTR mut. wt 5'UTR mut. wt 5'UTR mut. wt

P=0.25 P=0.53 5

4 4 score − score 3 Mutation type 3 Mutation type − SNV SNV no event no event

HCC 2 2 −

Copy number event AdenoCA Copy number event

CNA gain − CNA gain Liver CNA loss expression z 1 CNA loss expression z 1 no event Ovary no event 0 IFI44L 0 POLR3E −1

promoter mut. wt promoter mut. wt bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Extended Data Figure 9available under aCC-BY-NC-ND 4.0 International license.

192,400 kb 48,600 kb

TCGA-AC-A23H-01A-11D-A160-01TCGA-AO-A0JM-01A-21D-A059-01 TCGA-D8-A27N-01A-11D-A16C-01TCGA-A2-A3XZ-01A-42D-A238-01 TCGA-A2-A1FW-01A-11D-A13J-01TCGA-C8-A12M-01A-11D-A111-01 TCGA-A8-A076-01A-21D-A011-01TCGA-C8-A137-01A-11D-A111-01 TCGA-DD-A73D-01A-12D-A32F-01TCGA-GM-A2DA-01A-11D-A18N-01 a chr3 193,000 kb TCGA-C8-A12T-01A-11D-A111-01TCGA-L5-A8NG-01A-11D-A37B-01194,000 kb TCGA-EY-A1GJ-01A-12D-A141-01TCGA-21-A5DI-01A-31D-A26L-01 TCGA-BH-A0DG-01A-21D-A12N-01TCGA-AR-A24U-01A-11D-A166-01 TCGA-D8-A1XZ-01A-11D-A14J-01TCGA-EW-A6SC-01A-12D-A32H-01 TCGA-AN-A0FW-01A-11D-A036-01TCGA-A2-A25B-01A-11D-A166-01 TCGA-NA-A4R1-01A-11D-A28Q-01TCGA-ZF-A9R3-01A-11D-A38F-01 TCGA-E2-A1LB-01A-11D-A141-01TCGA-C8-A12U-01A-11D-A111-01 TCGA-DD-AACJ-01A-11D-A40Q-01TCGA-A8-A07I-01A-11D-A011-01 TCGA-29-2429-01A-01D-1043-01TCGA-ND-A4WA-01A-12D-A28Q-01 TCGA Samples MB21D2 HRASLS ATP13A4 OPA1 LOC647323 HES1 TCGA-D5-6537-01A-11D-1717-01TCGA-BR-4369-01A-01D-1155-01LINC00887 GP5 LINC00884 TMEM44 FAM43A TCGA-W3-AA21-06A-11D-A38F-01TCGA-BH-A0DD-01A-31D-A12N-01 TCGA-AO-A0JG-01A-31D-A087-01TCGA-AN-A0XW-01A-11D-A107-01 TCGA-VQ-A922-01A-11D-A40Z-01 TCGA-09-1665-01B-01D-0533-01TCGA-BH-A0BP-01A-11D-A111-01 TCGA-BH-A0HW-01A-11D-A036-01TCGA-IF-A3RQ-01A-11D-A227-01 TCGA-BR-7723-01A-11D-2052-01 TCGA-D1-A0ZP-01A-21D-A10L-01TCGA-C8-A1HI-01A-11D-A134-01 TCGA-56-A4BY-01A-11D-A24C-01TCGA-E9-A1RF-01A-11D-A160-01 TCGA-AO-A0J3-01A-11D-A036-01TCGA-AR-A0TY-01A-12D-A111-01 TCGA-BR-4357-01A-01D-1155-01 TCGA-EO-A2CH-01A-11D-A17U-01TCGA-BR-4357-01A-01D-1155-01 TCGA-DX-AB3A-01A-11D-A416-01TCGA-BH-A0HB-01A-11D-A059-01 TCGA-CD-8527-01A-11D-2338-01 TCGA-BL-A0C8-01B-04D-A273-01TCGA-A8-A08J-01A-11D-A011-01 TCGA-10-0931-01A-01D-0399-01TCGA-D3-A8GP-06A-11D-A371-01 TCGA-AP-A053-01A-21D-A00X-01TCGA-S3-AA14-01A-11D-A41E-01 TCGA-BR-8360-01A-11D-2338-01 TCGA-B9-A5W9-01A-11D-A28F-01TCGA-N7-A59B-01A-11D-A28Q-01 TCGA-NF-A4WX-01A-11D-A28Q-01TCGA-2Y-A9H0-01A-11D-A381-01 TCGA-VQ-A8PJ-01A-11D-A40Z-01 TCGA-04-1356-01A-01D-0452-01TCGA-A2-A0T3-01A-21D-A111-01 TCGA-BH-A18R-01A-11D-A12A-01TCGA-AQ-A0Y5-01A-11D-A14J-01 TCGA-EW-A1OZ-01A-11D-A141-01TCGA-A2-A0D1-01A-11D-A036-01 TCGA-VQ-A91Z-01A-11D-A40Z-01 TCGA-EW-A2FV-01A-11D-A17C-01TCGA-UD-AAC4-01A-11D-A39Q-01 TCGA-G3-A25S-01A-11D-A16U-01TCGA-A8-A09W-01A-11D-A011-01 TCGA-B7-5818-01A-11D-1599-01 TCGA-A7-A26H-01A-11D-A166-01TCGA-E9-A5UP-01A-11D-A28A-01 TCGA-50-5944-01A-11D-1752-01TCGA-EA-A5ZF-01A-11D-A28A-01 TCGA-VQ-A91Q-01A-12D-A40Z-01 TCGA-3C-AALK-01A-11D-A41E-01TCGA-E2-A14T-01A-11D-A111-01 TCGA-UT-A88E-01B-11D-A39Q-01TCGA-BH-A0EE-01A-11D-A036-01 TCGA-AG-3609-01A-02D-0824-01TCGA-12-0818-01A-01D-0384-01 TCGA-CD-A487-01A-21D-A24C-01 TCGA-AP-A5FX-01A-11D-A27O-01TCGA-D8-A27V-01A-12D-A17C-01 TCGA-B3-4104-01A-02D-1348-01TCGA-E9-A1NA-01A-11D-A141-01 TCGA-BR-4191-01A-02D-1130-01 TCGA-FJ-A3ZE-01A-11D-A23L-01TCGA-QQ-A5VC-01A-31D-A32H-01 TCGA-E9-A2JS-01A-11D-A17U-01TCGA-AN-A0AJ-01A-11D-A011-01 TCGA-A8-A07P-01A-11D-A011-01TCGA-D8-A140-01A-11D-A111-01 TCGA-D7-A6F0-01A-11D-A31K-01 TCGA-WB-A81J-01A-11D-A35H-01TCGA-H8-A6C1-01A-11D-A32M-01 TCGA-OR-A5LT-01A-11D-A29H-01TCGA-LN-A9FQ-01A-31D-A386-01 TCGA-IN-A6RN-01A-12D-A33S-01 TCGA-A8-A08F-01A-11D-A011-01TCGA-VQ-AA6D-01A-11D-A40Z-01 TCGA-DI-A2QY-01A-12D-A19X-01TCGA-13-1487-01A-01D-0472-01 TCGA-AN-A04A-01A-21D-A036-01TCGA-BR-8289-01A-11D-2338-01 TCGA-FP-8211-01A-11D-2338-01 TCGA-AA-3556-01A-01D-0819-01TCGA-52-7810-01A-11D-2121-01 TCGA-XF-AAMH-01A-11D-A42D-01TCGA-55-6712-01A-11D-1854-01 TCGA-EQ-5647-01A-01D-1599-01 TCGA-KV-A74V-01A-11D-A33P-01TCGA-95-7043-01A-11D-1943-01 TCGA-A7-A0CD-01A-11D-A011-01TCGA-D8-A27G-01A-11D-A16C-01 TCGA-IB-7891-01A-11D-2200-01TCGA-3A-A9IV-01A-11D-A40V-01 TCGA-CG-4437-01A-01D-1799-01 TCGA-AR-A0TW-01A-11D-A087-01TCGA-C8-A1HM-01A-12D-A134-01 TCGA-BQ-5891-01A-11D-1588-01TCGA-G3-AAUZ-01A-11D-A381-01 TCGA-RD-A7BS-01A-11D-A32M-01 TCGA-HS-A5N8-01A-11D-A26F-01TCGA-BH-A0HX-01A-21D-A059-01 TCGA-G7-6790-01A-11D-1959-01TCGA-55-1595-01A-01D-0944-01 TCGA-CD-5800-01A-11D-1599-01 TCGA-63-7023-01A-11D-1943-01TCGA-E9-A22A-01A-11D-A160-01 TCGA-GL-A59T-01A-21D-A28F-01TCGA-JU-AAVI-01A-11D-A402-01 TCGA-EW-A1OY-01A-11D-A141-01TCGA-BH-A0DK-01A-21D-A059-01 TCGA-CG-4472-01A-01D-1155-01 TCGA-29-1693-01A-01D-0559-01TCGA-G7-6789-01A-11D-1959-01 TCGA-2L-AAQM-01A-11D-A396-01TCGA-UY-A8OD-01A-11D-A363-01 TCGA-BR-8060-01A-11D-2338-01 TCGA-E9-A54Y-01A-11D-A25N-01TCGA-E2-A1IG-01A-11D-A141-01 TCGA-13-0921-01A-01D-0399-01TCGA-E2-A1B1-01A-21D-A12N-01 TCGA-BC-A10Y-01A-11D-A12Y-01TCGA-41-2573-01A-01D-0911-01 TCGA-CD-A48C-01A-11D-A24C-01 TCGA-58-A46N-01A-11D-A24C-01TCGA-EW-A6SA-01A-21D-A32H-01 TCGA-FR-A3YN-06A-11D-A238-01TCGA-BH-A0BC-01A-22D-A087-01 TCGA-B7-A5TN-01A-21D-A31K-01 TCGA-P3-A6T6-01A-11D-A34I-01TCGA-B6-A0IG-01A-11D-A036-01 TCGA-G7-6795-01A-11D-1959-01TCGA-NK-A7XE-01A-12D-A400-01 TCGA-DD-A3A7-01A-11D-A22E-01TCGA-G7-A8LD-01A-11D-A35Y-01 TCGA-B7-5816-01A-21D-1599-01 TCGA-10-0936-01A-01D-0399-01TCGA-AR-A2LL-01A-11D-A17U-01 TCGA-FB-A545-01A-11D-A26H-01TCGA-VR-AA4D-01A-11D-A37B-01 TCGA-FP-7735-01A-11D-2052-01 TCGA-AN-A04C-01A-21D-A036-01TCGA-AH-6897-01A-11D-1923-01 TCGA-61-1722-01A-01D-0577-01TCGA-N6-A4V9-01A-11D-A28Q-01 TCGA-GV-A3JZ-01A-11D-A219-01TCGA-85-8070-01A-11D-2243-01 TCGA-VQ-A925-01A-11D-A40Z-01 TCGA-B4-5835-01A-11D-1668-01TCGA-B9-4117-01A-02D-1348-01 TCGA-N5-A4RD-01A-11D-A28Q-01TCGA-EY-A2OQ-01A-11D-A19X-01 TCGA-CG-5725-01A-11D-1599-01 TCGA-BH-A0HY-01A-11D-A059-01TCGA-AA-A00L-01A-01D-A003-01 TCGA-BR-6801-01A-11D-1881-01 TCGA-BR-8484-01A-11D-2390-01 TCGA-BR-7958-01A-21D-2338-01 TCGA-IN-A7NT-01A-21D-A34T-01 TCGA-HU-A4H5-01A-21D-A253-01 TCGA-HU-A4G9-01A-11D-A24C-01 TCGA-VQ-A91S-01A-11D-A40Z-01 TCGA-VQ-A8P5-01A-11D-A396-01 TCGA-D7-6519-01A-11D-1799-01 TCGA-R5-A7ZE-01B-11D-A34T-01 TCGA-VQ-AA68-01A-11D-A40Z-01 TCGA-HU-A4H4-01A-21D-A253-01 TCGA-BR-4366-01A-01D-1155-01 TCGA-D7-A6EV-01A-11D-A31K-01 TCGA-HU-A4H3-01A-21D-A253-01 TCGA-BR-A4PE-01A-31D-A253-01 TCGA-3M-AB46-01A-11D-A40Z-01 TCGA-CG-4305-01A-01D-1155-01 TCGA-IN-A6RR-01A-12D-A32M-01

TCGA-VQ-A8E0-01A-11D-A40Z-01 p24.1 p22.3 p21.3 p21.1 p13.2 p11.2 q12 q13 q21.13 q21.32 q22.2 q22.33 q31.2 q32 q33.2 q34.11 q34.3 TCGA-CG-4449-01A-01D-1155-01 TCGA-VQ-A923-01A-11D-A40Z-01 TCGA-VQ-A8PP-01A-21D-A40Z-01 TCGA-D7-A6EZ-01A-11D-A31K-01 TCGA-HU-A4GF-01A-11D-A24C-01 TCGA-BR-4267-01A-01D-1130-01 TCGA-B7-A5TI-01A-11D-A31K-01 TCGA-IN-A6RL-01A-11D-A32M-01 TCGA-VQ-AA6A-01A-11D-A40Z-01

b chr3 48,700 kb 49,000 kb

TCGA Samples CACNA1G ABCC3 ANKRD40 LUC7L3 LINC00483 WFIKKN2 TOB1 SPAG9 TCGA-BH-A0AW-01A-11D-A059-01 TCGA-A8-A0A7-01A-11D-A011-01 TCGA-AR-A250-01A-31D-A166-01 TCGA-EW-A1OX-01A-11D-A141-01 TCGA-BH-A1F2-01A-31D-A13J-01 TCGA-D8-A1JA-01A-11D-A13J-01 TCGA-EW-A6SD-01A-12D-A33D-01 TCGA-3C-AALI-01A-11D-A41E-01 TCGA-C8-A275-01A-21D-A16C-01 TCGA-BH-A0DZ-01A-11D-A011-01 TCGA-C8-A1HO-01A-11D-A13J-01 TCGA-A2-A04X-01A-21D-A036-01 TCGA-E2-A109-01A-11D-A10L-01 TCGA-C8-A1HK-01A-21D-A13J-01 TCGA-AO-A0JM-01A-21D-A059-01 TCGA-AC-A23H-01A-11D-A160-01 TCGA-A2-A3XZ-01A-42D-A238-01 TCGA-D8-A27N-01A-11D-A16C-01 TCGA-C8-A12M-01A-11D-A111-01 TCGA-A2-A1FW-01A-11D-A13J-01 TCGA-C8-A137-01A-11D-A111-01 TCGA-A8-A076-01A-21D-A011-01 TCGA-GM-A2DA-01A-11D-A18N-01 TCGA-DD-A73D-01A-12D-A32F-01 TCGA-L5-A8NG-01A-11D-A37B-01 TCGA-C8-A12T-01A-11D-A111-01chr9 35,600 kb 35,700 kb c TCGA-21-A5DI-01A-31D-A26L-01 TCGA-EY-A1GJ-01A-12D-A141-01 TCGA-AR-A24U-01A-11D-A166-01 TCGA-BH-A0DG-01A-21D-A12N-01 TCGA-EW-A6SC-01A-12D-A32H-01 TCGA-D8-A1XZ-01A-11D-A14J-01 TCGA-A2-A25B-01A-11D-A166-01 TCGA-AN-A0FW-01A-11D-A036-01 TESK1 SIT1 RMRP CA9 TPM2 TLN1 CREB3 TCGA-ZF-A9R3-01A-11D-A38F-01 TCGA-NA-A4R1-01A-11D-A28Q-01 TCGA-C8-A12U-01A-11D-A111-01 TCGA-E2-A1LB-01A-11D-A141-01 TCGA-A8-A07I-01A-11D-A011-01 TCGA-DD-AACJ-01A-11D-A40Q-01 TCGA-ND-A4WA-01A-12D-A28Q-01 TCGA-29-2429-01A-01D-1043-01 TCGA-BR-4369-01A-01D-1155-01 TCGA-D5-6537-01A-11D-1717-01 TCGA-BH-A0DD-01A-31D-A12N-01 TCGA-W3-AA21-06A-11D-A38F-01 TCGA-AN-A0XW-01A-11D-A107-01 TCGA-AO-A0JG-01A-31D-A087-01 TCGA-BH-A0BP-01A-11D-A111-01 TCGA-09-1665-01B-01D-0533-01 TCGA-IF-A3RQ-01A-11D-A227-01 TCGA-BH-A0HW-01A-11D-A036-01 TCGA-C8-A1HI-01A-11D-A134-01 TCGA-D1-A0ZP-01A-21D-A10L-01 TCGA-E9-A1RF-01A-11D-A160-01 TCGA-56-A4BY-01A-11D-A24C-01 TCGA-AR-A0TY-01A-12D-A111-01 TCGA-AO-A0J3-01A-11D-A036-01 TCGA-BR-4357-01A-01D-1155-01 TCGA-EO-A2CH-01A-11D-A17U-01 TCGA-BH-A0HB-01A-11D-A059-01 TCGA-DX-AB3A-01A-11D-A416-01 TCGA-A8-A08J-01A-11D-A011-01 TCGA-BL-A0C8-01B-04D-A273-01 TCGA-D3-A8GP-06A-11D-A371-01 TCGA-10-0931-01A-01D-0399-01 TCGA-S3-AA14-01A-11D-A41E-01 TCGA-AP-A053-01A-21D-A00X-01 TCGA-N7-A59B-01A-11D-A28Q-01 TCGA-B9-A5W9-01A-11D-A28F-01 TCGA-2Y-A9H0-01A-11D-A381-01 TCGA-NF-A4WX-01A-11D-A28Q-01 TCGA-A2-A0T3-01A-21D-A111-01 TCGA-04-1356-01A-01D-0452-01 TCGA-AQ-A0Y5-01A-11D-A14J-01 TCGA-BH-A18R-01A-11D-A12A-01 TCGA-A2-A0D1-01A-11D-A036-01 TCGA-EW-A1OZ-01A-11D-A141-01 TCGA-UD-AAC4-01A-11D-A39Q-01 TCGA-EW-A2FV-01A-11D-A17C-01 TCGA-A8-A09W-01A-11D-A011-01 Samples TCGA 160 TCGA-G3-A25S-01A-11D-A16U-01 TCGA-E9-A5UP-01A-11D-A28A-01 TCGA-A7-A26H-01A-11D-A166-01 TCGA-EA-A5ZF-01A-11D-A28A-01 TCGA-50-5944-01A-11D-1752-01 TCGA-E2-A14T-01A-11D-A111-01 TCGA-3C-AALK-01A-11D-A41E-01 TCGA-BH-A0EE-01A-11D-A036-01 TCGA-UT-A88E-01B-11D-A39Q-01 TCGA-12-0818-01A-01D-0384-01 TCGA-AG-3609-01A-02D-0824-01 TCGA-D8-A27V-01A-12D-A17C-01 TCGA-AP-A5FX-01A-11D-A27O-01 TCGA-E9-A1NA-01A-11D-A141-01 TCGA-B3-4104-01A-02D-1348-01 TCGA-QQ-A5VC-01A-31D-A32H-01 TCGA-FJ-A3ZE-01A-11D-A23L-01 TCGA-AN-A0AJ-01A-11D-A011-01 TCGA-E9-A2JS-01A-11D-A17U-01 TCGA-D8-A140-01A-11D-A111-01 TCGA-A8-A07P-01A-11D-A011-01 TCGA-H8-A6C1-01A-11D-A32M-01 TCGA-WB-A81J-01A-11D-A35H-01 TCGA-LN-A9FQ-01A-31D-A386-01 TCGA-OR-A5LT-01A-11D-A29H-01 TCGA-VQ-AA6D-01A-11D-A40Z-01 TCGA-A8-A08F-01A-11D-A011-01 TCGA-13-1487-01A-01D-0472-01 TCGA-DI-A2QY-01A-12D-A19X-01 TCGA-BR-8289-01A-11D-2338-01 TCGA-AN-A04A-01A-21D-A036-01 TCGA-52-7810-01A-11D-2121-01 TCGA-AA-3556-01A-01D-0819-01 TCGA-55-6712-01A-11D-1854-01 TCGA-XF-AAMH-01A-11D-A42D-01 TCGA-95-7043-01A-11D-1943-01 TCGA-KV-A74V-01A-11D-A33P-01 TCGA-D8-A27G-01A-11D-A16C-01 TCGA-A7-A0CD-01A-11D-A011-01 TCGA-3A-A9IV-01A-11D-A40V-01 TCGA-IB-7891-01A-11D-2200-01 240 d TCGA-C8-A1HM-01A-12D-A134-01 TCGA-AR-A0TW-01A-11D-A087-01 TCGA-G3-AAUZ-01A-11D-A381-01 A C A C G U A TCGA-BQ-5891-01A-11D-1588-01 A A A Panc-AdenoCA: C>U C TCGA-BH-A0HX-01A-21D-A059-01150 C G A C TCGA-HS-A5N8-01A-11D-A26F-01 G 160 A U 230 TCGA-55-1595-01A-01D-0944-01 Lung-AdenoCA: C>A A G TCGA-G7-6790-01A-11D-1959-01 C G TCGA-E9-A22A-01A-11D-A160-01 C C G C TCGA-63-7023-01A-11D-1943-01 C C C G C G TCGA-JU-AAVI-01A-11D-A402-01 C C A Panc-Endocrine: G>C G TCGA-GL-A59T-01A-21D-A28F-01 C U C TCGA-BH-A0DK-01A-21D-A059-01 U TCGA-EW-A1OY-01A-11D-A141-01G G G U G TCGA-G7-6789-01A-11D-1959-01 G G C TCGA-29-1693-01A-01D-0559-01C A G U TCGA-UY-A8OD-01A-11D-A363-01 G C TCGA-2L-AAQM-01A-11D-A396-01U C 200 TCGA-E2-A1IG-01A-11D-A141-01G C C U C 220 A U C U 250 TCGA-E9-A54Y-01A-11D-A25N-01 G 170 C A TCGA-E2-A1B1-01A-21D-A12N-01A U C G 210 C C TCGA-13-0921-01A-01D-0399-01G U A TCGA-41-2573-01A-01D-0911-01 G G TCGA-BC-A10Y-01A-11D-A12Y-01 A ins. G>AG U A TCGA-EW-A6SA-01A-21D-A32H-01C G A G A TCGA-58-A46N-01A-11D-A24C-01 C Liver-HCC A U 140 TCGA-BH-A0BC-01A-22D-A087-01 G C C C G CTCGA-FR-A3YN-06A-11D-A238-01 U U TCGA-B6-A0IG-01A-11D-A036-01 G C A C TCGA-P3-A6T6-01A-11D-A34I-01 U G C U A U TCGA-NK-A7XE-01A-12D-A400-01U C C>U TCGA-G7-6795-01A-11D-1959-01 G G>U: Liver-HCC C G TCGA-G7-A8LD-01A-11D-A35Y-01 A U Stomach- G C TCGA-DD-A3A7-01A-11D-A22E-01 C TCGA-AR-A2LL-01A-11D-A17U-01 G 180 A>G: C C 20 AdenoCA TCGA-10-0936-01A-01D-0399-01 10 U A Panc-AdenoCATCGA-VR-AA4D-01A-11D-A37B-01 G U TCGA-FB-A545-01A-11D-A26H-01 C C G 190 TCGA-AH-6897-01A-11D-1923-01 G G A TCGA-AN-A04C-01A-21D-A036-01 G C C G TCGA-N6-A4V9-01A-11D-A28Q-01 C G G TCGA-61-1722-01A-01D-0577-01 C U C A TCGA-85-8070-01A-11D-2243-01 U G G C 260 U C TCGA-GV-A3JZ-01A-11D-A219-01 C TCGA-B9-4117-01A-02D-1348-01 C TCGA-B4-5835-01A-11D-1668-01 C U G TCGA-EY-A2OQ-01A-11D-A19X-01 130 C A G TCGA-N5-A4RD-01A-11D-A28Q-01 U TCGA-AA-A00L-01A-01D-A003-01 U A G C TCGA-BH-A0HY-01A-11D-A059-01 120 A U C G U G G G U U G C U G U C A 110 C A C 1 ins. G>ACGTGG: Liver-HCC 268 G C C C G 100 C U G U A G C G G A C G 30 U G G G A G A A G A>U: Liver-HCC U C A 70 G A C U A U U A G C C G C C U C C del. G: A G C C A A G C C 90 Breast-AdenoCa C A G G G 80 A C C U C 60 G 40 U A U C G U C G Low High C C U Mutation P4 tertiary interaction sites C A C C U U Mutation with predicted RNA-protein interaction sites 50 C del. UCCUC Predicted structural impact U U structural impact (P < 0.1) RNA-substrate interaction sites Thy-AdenoCA G of mutations bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Extended Data Figure 10

a Scale 5 kb hg19 chr11: 62,601,000 62,602,000 62,603,000 62,604,000 62,605,000 62,606,000 62,607,000 62,608,000 62,609,000 62,610,000 WDR74 4.88 - 100 Vert. Cons 0 - -4.5 _ 1000G Accs Pilot 1000G Accs Strict PCAWG germline

PCAWG somatic

RP11-727F15.9 WDR74 RNU2-2P WDR74 WDR74 WDR74 WDR74 RP11-727F15.9 WDR74

b Scale 2 kb hg19 chr17: 8,076,000 8,076,500 8,077,000 8,077,500 8,078,000 8,078,500 8,079,000 8,079,500 8,080,000 8,080,500 TMEM107 4.88 - 100 Vert. Cons 0 - -4.5 _ 1000G Accs Pilot 1000G Accs Strict PCAWG germline

PCAWG somatic TMEM107 SNORD118 TMEM107 TMEM107 TMEM107 TMEM107 TMEM107 RP11-599B13.7

Scale 500 bases hg19 c chr22: 43,011,000 43,011,500 43,012,000 RNU12 4.88 - 100 Vert. Cons 0 - -4.5 _ 1000G Accs Pilot 1000G Accs Strict PCAWG germline PCAWG somatic

POLDIP3 RNU12 POLDIP3 POLDIP3

Scale 1 kb hg19 d chr9: 35,656,500 35,657,000 35,657,500 35,658,000 35,658,500 35,659,000 35,659,500 35,660,000 RMRP 4.88 - 100 Vert. Cons 0 - -4.5 _ 1000G Accs Pilot 1000G Accs Strict PCAWG germline PCAWG somatic

RMRP CCDC107 CCDC107 CCDC107 CCDC107 CCDC107 CCDC107 ARHGEF39 ARHGEF39 ARHGEF39

e Scale 2 kb hg19 chr14: 20,809,500 20,810,000 20,810,500 20,811,000 20,811,500 20,812,000 20,812,500 20,813,000 20,813,500 RPPH1 4.88 - 100 Vert. Cons 0 - -4.5 _ 1000G Accs Pilot 1000G Accs Strict PCAWG germline PCAWG somatic RPPH1 PARP2 PARP2 PARP2 RP11-203M5.2 mir-142 f chr 7 100 bp primary miRNA pre-miRNA mature miRNA SNVs

primary miRNA pre-miRNA Mutational signature mature miRNA mir-142-p3 mir-142-p5 Non-AID SNVs AID 6500000 Extended Data d Expression fold difference (log2) Mean a f CAF gene/element -2 -1

0 1 2 3 Density Indels (n SNVs 0.4 0.5 0.6 0.7 0.8 0.9 Struc 0 t Number ofmutate

u ( n = r ● alvariants (n=

0 = 454)

4324) 6510000 BRAF TERT TE bioRxiv preprint ● KRAS was notcertifiedbypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmade R 0. NE ● MEN1 T PBRM1 ● ● 4 CDKN2A SMAD4 CDKN2A ● ● ● A ● ● 0 ● ● T1 ● 3 ● ● 79) ● ● MAL ● ● VHL ● RB1 ● ● ● ● 6520000 ● ● ● ● 0. MALA NE ● ● Mean ● A ● ● 5 ● T1 A 5 T1 ● 0 doi: F 0 T ● 1 MAL i VHL NEA d g 0. C https://doi.org/10.1101/237313 TP53 6530000 sampl A 6 ure 11 AF T1 APC T 1

flanking 0 0.7 ● ● ● ● ● ● ● ● ● e 100 q < 3'UTR Background miRNA Enhancers Small R lncRNA lncRNA Protein 5'UTR Protein s withRNAseqdata 6540000

0.1 KRAS 0. − − N p available undera coding coding prom 0 8 A romoter 65500000 0.9 300 o ; ter b this versionpostedDecember23,2017. − − TP53 A B L T B ackgro 2 1 G 0 1 2 3 4 5 ( (Liv

T CC-BY-NC-ND 4.0Internationallicense difference (log2) FG hy- e u LOH fold SFTP r-HCCn e A A d FG (Liver-HCden −4 −2 0 2 4 6 8 B B o ) (Lung-Ad(Liver-H C

TRAM2 A a ) HRAS LD PRK RP11 C TNFR 0 APOO ) RPL22 H3F3B A POLR3E HIS B RP11 − RNU6 FUB ● RNU12

− CC R1A LINC00963 MTG2 HES1 Enhance RNU12 AS1 e DTL

4 (Liv RNF34

RMRP noCa T S ST 100 150 Number of structural events 40L14.1 A B

● ) 1H2 F14 P DH g − ● ● RPPH1 K IF 1 (Liver-He − 50 RR 9 A ● IRF4 11

573P r- PLK1 ● I 2C4.6 DH1 A PTD ● ● ● 44L 4 ) CBFB

● ● H TG G A r ● 0 mir A ● ● :TP ( CC GC ● SF L CVR1B ● ● P P F S F ● ● ● B i -

AX5 C

A2G4 v BT BR2 142 5 O G029190 S1 ) T ● FG T D U2AF1 er-HC ALB C

NXF1 PS 3TG1 ( PB XA1 ● OB1 L NFK G C AXX F

F i

1 1 v B2M ) O R1 TSC1

● ● er-H XA1 MAP2 0 H (L B C C

RFTN1 P IZ ● YP i ) R v C

TMEM107 er-HC ● Number ofmutate KE RMRP 2C C K C (L CAS 4 MEN1 ●

CDH1 ) BA YP i

A 8 v

Enhance er-HCC P1 P R (Liver- C P 3

1 N 8 P A ) AX

5 N

● 11- EAT5 G025135 EAT EGFR 0

I ( RB1 N1 1 L

MIR663A H ) r i ● 151 1 v :IGH 1 er C

● (Liv SF3 ( C

RB1 E - NF1 CC B )

VHL HCC 1 so-

B e

CIC 4 r-H 1 D PTEN ARID2 . 10 C153 A

SMARC ) ( d C BRAF P L e C SP 2 i noCa FAT1 NL

WDR74 v )

0 er-HCC VHL (Liv Number of

A R PBRM1 I A TRX P ● N GNAS APC NF1 C ) 4 L PB ( (L e The copyrightholderforthispreprint(which IP P r N anc- -HCC 1 iv ) O F ARID1A CDKN2A er-HC

TCH1 (

( P . St a Ad )

A omach-Aden TM CREB enoCa 100

PT c- SMAD4 C

TERT A ) d MALAT1 E enoC BRAF B d N ) P SE 200 sampl T n a D2 oCa) F

A T1 mutate CDKN2A ) CTNNB1 c KMT2C APC

e Lymph.NOS

KRAS Head.SCC s Lymph.BNHL TE d Cervix.AdenoCA 150 Bladder.TCC samples MAL

30 Prost.AdenoCA ● R Ovary.AdenoCA ARID1A T ● Breast.LobularCa

0 Bone.Leiomyo

A Skin.Melanoma Cervix.SCC T1 CNS.Oligo

AP CNS.GBM

O Lymph.CLL NEAT1 PI B Uterus.AdenoCA KMT2D K ColoRect.AdenoCA

3CA Kidney.ChRCC Eso.AdenoCa Breast.AdenoCa 400 200 Lung.SCC Thy.AdenoCA Lung.AdenoCA

NE Kidney.RCC KRAS 400 Stomach.AdenoCA Panc.AdenoCA ● Biliary.AdenoCA A Liver.HCC Expression level 600 T1 AP ADH4 CYP2C8 RP11-1151B14.3 HPR FGA FGB ALB ADH1B CPS1 ALDOB CYP3 PNLIP CPB1 LIPF TG CCDC152 SPRN SF NEAT1 TP53 T O TP53 PB B A 800 5 1000 bioRxiv preprint doi: https://doi.org/10.1101/237313; this version posted December 23, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Extended Data Figure 12 a c Inferred mutations 4 Missed 31% 6 Total 100 AKT1 5’UTR 1.0 TCF7L2 (exon 3) 6

4

2 % patients % patie nts Detection sensitivity 0 Missed 0.0 0 Called 0.0 1.0 0.0 1.0 All breast cancers Average detection sensitivity FOXA1 promoter (chr14:38,064,406) b % of patients in cohort powered (≥90%) 35 91 94 85 48 79 35 4 23 61 61 19 22 8 81 77 30 69 61 91 66 94 61 64 51 100 48 1.0 at chr5:1,295,250 Detection sensitivity 0.0 tr o du ll o RC C RC C eo sarc ph -CLL il oAs ve r-HCC ea d-SCC ung -SCC ph -BNHL dd er-TCC End ocrine Li dn ey - L CNS-GBM H ye loid-MPN Lym Ki Bla CNS-Me dn ey -Ch M Lym ung -AdenoCa Thy-AdenoCa Eso-AdenoCa ea st-AdenoCa va ry-AdenoCa CNS-P Bone-Leiomyo ili ary-AdenoCa L Pa nc-AdenoCa Ki Prost-AdenoCa Skin-Melanoma Pa nc- O B Bone-Ost Br Uterus-AdenoCa Stomach-AdenoCa ColoRect-AdenoCa