© 2015 Nature America, Inc. All rights reserved. 2015; 2015; Received 17 June; accepted 20 November; published online 21 December M.P.S. ( should Correspondence be addressed to C.L.A. ( California, USA. California, USA. California, USA. 1 alterations noncoding characterizing or mutations coding of distribution the little attention positional analyzing on systematically mutated -coding frequently identifying by alterations somatic of landscape the of ing understand our expanded substantially have (ICGC) Consortium Cancer Genome Atlas (TCGA) and the International Cancer Genome the in exons (for example, in BRAF Val600; ref. vating mutations diverse mapping to single residues (for of example, hyperacti include elements drivers melanoma example, For functional size. and nature alter mutations driver cancer, In advantage the interactions. biophysical types, distinct signaling. domains demonstrate regions including alterations significantly clustering and sensitive An driver Cancer J Greenleaf William L Araya Carlos alterations molecular of functional a landscape rich highlights types cancer mutated regions across of significantly Identification Nature Ge Nature Department Department of Genetics, Stanford University School of Medicine, Stanford,

improved

varied inclusive doi:10.1038/ng.347

[email protected] as genes TERT

sequencing mutated protein

and to exemplified

transcription

Mutation

methods

mechanisms n of across

unknown simulations

etics

promoter mutated by

The

spatial 4 3 2 at functionally method of Department Department of Chemistry, Stanford University, Stanford, These These authors contributed equally to this work. Department of Applied Physics, Stanford University, Stanford,

the regions interfaces,

noncoding

in functional 1

1 a ), inactivating mutations ), along tumor-suppressor inactivating

,

up

4 ADVANCE ONLINE PUBLICATION ONLINE ADVANCE accumulation

in studies spectrum frequencies clustering , Can Cenik , Can

PTEN

by

regions distributions would

2 1 to 2

factor 1 of ). Cancer genomics projects such as The The as such projects genomics Cancer ). are 1

suggest a

, ~ tumor

3

) ) or W.J.G. ( agnostic linker oncogenic

1 1

often

have

) ) and regulatory mutations (for example, differentially drivers. 5% diversity 3–

be

(SMRs). of 5 binding

. However, these studies have focused of

annotation that

of types in

coding region primarily

with alterations

of

specific SMRs 1 of driver

[email protected] We ,

4 mutations

protein-altering

of misregulation

, , Jason A Reuter

functions

SMRs associated to sites [email protected]

employed of SMRs and

detect demonstrate

mutated

identification.

PIK3CA

identified

tumor and independent,

reveal noncoding in

underscores

molecular affect within untranslated variably

changes

).

types.

density-based

across

in recurrent

and

cancer

mutations. which

regulatory

that

elements, ), ),

the sized SMRs

1 tumor

in , Gert Kiss , Gert 6

both .

- -

by by BRAF in observed or recurrently mutated amino acids called mutation hotspots mutation called acids amino mutated recurrently body or the across rates mutation synonymous-to-synonymous alterations that complement gene- and pathway-centric analyses. pathway-centric and gene- complement that alterations somatic by targeted recurrently elements noncoding and coding of a spectrum rich SMRs identified Overall, regions. in mutations these ing insight into of somatic consequences the accumulating molecular thereby provid profiles, and signaling and transcriptional interfaces, molecular structures, protein elements, noncoding with associated were Moreover, SMRs elements. functional and genes new in as well (SMRs). as drivers regions cancer well-established mutated numerous in SMRs Weidentified significantly term we which mutation, by somatic altered recurrently regions genomic sized of tion variably identifica unbiased the gene- permitted approach This types. in 21 cancer and mutation recurrent of type– regions identify to mutation models mutation specific cancer-, using techniques clustering cancer. of drivers noncoding significant be may features regulatory specificity stage and tissue developmental substantial in have animals identified probabilities pathways and genes driver of level the at types cancer the among disregard heterogeneity or established disruption of targets relevant the presume either types mutation UTRs—or and promoters signals, splicing sites, ETS-binding as regions—such or predefined (ii) data sequencing whole-genome pan-cancer (i) either primarily examined have genomes cancer in variation regulatory noncoding to characterize Efforts may sequencing. be captured by whole-exome exons in even or to proximal located are exon- elements of regulatory embedded importance the ignoring and sequence protein influence exclusively mutations driver that assuming mutations, synonymous of clusters non or identified length of have fixed windows employed where Moreover, examined as been has interfaces. genes such within or clustering mutation elements, subunits protein coding affecting functional those of scale intermediate vast 2 Algorithms to identify cancer driver genes often examine non examine often genes driver cancer identify to Algorithms To address these diverse limitations, we employed density-based density-based employed we limitations, diverse these Toaddress genome the in elements regulatory of proportion significant A POLE , , Vijay S Pande ) 9 26– . Yet, these analyses ignore recurrent alterations in the the in alterations recurrent ignore analyses these Yet, . 3 , 2 4 8 . . Yet, systematic analyses of genomic regulatory activity , suggesting that mutations in cancer type–specific type–specific cancer in mutations that suggesting , 7 , IDH1 (ref. (ref. IDH1 , 5 , 24 2 , Michael P Snyder , Michael , 2 5 as well as in nucleotide-specific mutation mutation nucleotide-specific in as well as 13– 8 ) and DNA polymerase DNApolymerase and ) 1 8 . 15 , 1 20– 9 , suggesting that many many that suggesting , 1 2 3 & . These approaches approaches These . ysis i s ly a n a

10– 1 ε 2 (encoded (encoded , analyses analyses , 5 , as as ,  - - - ­

© 2015 Nature America, Inc. All rights reserved. Gene-specific mutation probability models accounted for sequence sequence for accounted models probability mutation Gene-specific Methods). (Online score density for as cluster final the each estimate ( correlated ( probability mutation of density for and each models cluster genome-wide using gene-specific region in each WeMethods). cancer (Online type mutation evaluated observing of as the Fisher’s combined ( between SMRs discovered by employing background models derived from whole-genome sequencing (WGS based) or whole-exome sequencing (WES based). frequency ( genes. noncoding and exons to coding refers label “Exon*” The thresholds. (5%) FDR score density maximum and median minimum, the indicate lines Dashed in insets. shown are (bottom) types region of SMR distribution the and (top) regions evaluated in scores of density distribution The gene). by associated labeled and by type (color-coded discovered SMRs the for scores density and frequency ( positions mutated ( densities mutation higher with if subclusters parameter distance The n 1 Figure noncoding features can be within embedded coding regions important functionally Webecause mutationssynonymous included proximal domains of the ( clusters variably sized 198,247 of somatic mutationstify exon- within annotation-independent density-based clustering technique mutations driver cancer of analysis the in considered previously not were thus and splicing their or sequences ( ( regions regulatory adjacent and transcripts sequences, protein-coding on impact their recording types, cancer 21 of tumors 4,735 from (SNVs) variants We examined ~3 million previously identified Multiscale RESULTS s i s ly a n a  1). equals mutations/samples where is, (that samples distinct from derive mutations where of regions number the ( g h b a n = 3,078,482 somatic SNV calls. ( calls. SNV somatic = 3,078,482 ) Categories with significant fold change in mutation type representation between SMR-associated and input mutations are denoted (* denoted are mutations input and SMR-associated between representation type in mutation change fold significant with ) Categories ) Distribution of the number of mutations per sample in SMRs (blue) and 58 recurrently altered noncoding regions noncoding altered 58 recurrently and (blue) in SMRs sample per of mutations number of the ) Distribution

log (count) = 2,431,360) of these somatic mutations do not alter protein-coding Mutation density scores within each cluster identified were derived (Mutation shuffling) 10 To discover both coding and noncoding cancer drivers, we applied an (observed) Domains 0 1 2 3 4 5 6 SNPs Mutation (6) Filtersignificantlymutatedregions(SMRs) (5) DetermineFDRs (4) Scoremutationregions(binomialtest) (3) Refinemutationregions(DBSCAN,binomialtest) (2) Discovermutationregions(DBSCAN) (1) Defineexon-proximaldomains regions

Intron

Identification of SMRs in 21 cancer types across a broad spectrum of functional elements. ( elements. of functional spectrum a broad across types in 21 cancer of SMRs Identification Intergenic (≥5 kb) ≥

2% per cancer type. Gray bars represent SMRs with FDR FDR with SMRs represent bars Gray type. cancer 2% per Nonsynonymous

Supplementary Fig. 3a Supplementary

Synonymous detection k Exo Noncoding exon P � max( P or more mutations for each mutation type within the the within type mutation each for mutations more or � ≥ mutation frequency FDR = = Upstream ( 5 kb) density frequency 2 mutations ≤ n c d 1 Downstream (≤5 kb) p P p c Mutation type / ≤ / –log

c Stop gain Bayesian d

p 5% and = s

Regions s , 10

) within the cluster of size of size cluster the ) within 3 UTR , 10

<0.05 ′ 10

Splice-site region 1 Fig. Supplementary

ε ≤ ( ≤ ,

P 5′ UTR

Exo is dynamically defined as the average distance of mutated positions ( positions of mutated distance average as the defined is dynamically

of � P � densit P

≤ Splice site acceptor global ≤ 500bp value of the individual binomial probabilities 500bp n Start gain 2 Supplementary Fig. 2 Fig. Supplementary SMRs y ≥ ) ) Splice site donor 2% Intragenic n

=191,669 Start lost

Exo Stop lost (Monte Carlo)

b Synonymous stop Simulated n ) Exons and exon-proximal domains ( domains exon-proximal and ) Exons regions 3 Nonsynonymous start Rare amino acid the more conservative ), selecting Unknown Fig. 1 5 c e P (

SMRs 5 < 0.05, binomial test) were found in a second-pass analysis with with analysis in a second-pass found were test) binomial < 0.05, –log (P )

c 10 density somatic single-nucleotide Fig. 1 Fig. 150 250 100 200 ). We note that 79.0% 79.0% that note We ). s 50 10 20 50 b (see the Online Methods for details on density scoring and FDR calculation). ( calculation). FDR and scoring on density details for Methods Online the (see 0 and Online Methods). 0 U2AF1 POL 0 AKT EGFR ), which were well well were which ), Median =17bp a FLT3 1 E ). DNMT3A 100 TBC1D12 IDH2 1 2 0 MYD88, SF3B1 CDKN2 APC FBXW7 200 IDH1 Size (bp) MIR142 MALAT1 TMEM208 Y HIST1H3I CEP192 RPS2 NDUFB9 PCDHB8 TP53 PPP2R1A 3 AE1D1 13– FBXO24 KIAA0907 0 PIK3CA to iden A 7 1 Mutation frequency(% 300 8 ≤ . 5% but mutation frequency <2%. ( <2%. frequency mutation 5% but 3 0 BCL NRAS CTNNB1 400 2 ± 9 - 2

1,000 bp) were scanned for clusters of somatic mutations (orange; DBSCAN). DBSCAN). (orange; mutations of somatic clusters for scanned were bp) 1,000

KRA PTEN ≥ (for example, TP53.1). Note that some SMRs ( SMRs some that Note TP53.1). example, (for ( characterization further spanned 735 genomic for regions, which were types assigned unique SMR codes cancer 20 in (SMRs; regions mutated 4 Fig. olds controlling the false-discovery rate (FDR) to known cancer genes but also nominate potentially new drivers. level analysis quintile with the top density scores were not found previously in a 3b Fig. gene- (CGC; Census Gene Cancer the by determined as enrichments (up to 120×) for somatic SNV–drivenhigh mutation cancer density, genes increasing( density scores correlated with stronger Methods). (Online sequencing whole-genome and data given probabilities mutation sequencing whole-genome cancer in probabilities mutation intronic gene-specific derive to framework Bayesian a applied we exons, on pressure selection to due estimates correlate to shown rate mutation somatic been with have which timing, replication and sion gene expres in (GC content) local as composition well as differences more than one cancer type. cancer one than more 500 S Exon 4 0

Regions (×1,000) We applied Monte Carlo simulations to density score select thresh Although many known cancer-related genes did not display signals of 10 ) ( f 3 0 4 6 2 8 n * base WGS , 2 = 872) Upstream Splic Exo and and 5 0 0 KRAS –log , while controlling for differences in sensitivity in whole-exome c d n Intron e ) * 31 10 5 0 10 (densityscore) Supplementary Table 1 Supplementary , 3 93.2% 5 2 Splic 15 or in the CGC. Thus, high density scores are enriched for . Moreover, ~10% of genes associated with SMRs in the the in Moreover,SMRs . with associated genes of ~10% Upstrea 5 Other Downstream 3 5 Other ′ 3 Downstream d UT 20 ′ ′ UT UT ′ e a UT p BRA ) in the domain size ( size ) domain in the R DVANCE ONLINE PUBLICATION ONLINE DVANCE 25 R R R 0 m ( F n 30 based d WES = 714) a ) The number of SMRs with FDR FDR with of SMRs ) number The ) Pan-cancer distribution of mutation types for types of mutation distribution ) Pan-cancer e Fig. Fig. ) SMR size distribution. ( distribution. ) size SMR h g d SMRs log2 (SMR/input mutations) SMRs 4 100 500 . To avoid skewed mutation probability probability mutation skewed To avoid . 100 200 300 400 500 600 700 1 20 –4 –2 –6 5 1 0 2 0 c ε ) that were altered in in altered were ) that defined as the average distance of distance average as the defined 8 7 6 5 4 3 2 1 *

2 Stop lost UCEC * 0 Stop gain MELA significantly 872 We ). identified (green). Horizontal lines indicate indicate lines Horizontal (green). d LUAD s Nonsynonymous start ). Clusters (green) are divided divided are (green) ). Clusters * LUSC Mutations/samples (ratio) 5′ UTR Nonsynonymous coding * COLR

* BRCA Splice site donor

* MUMY Splice site acceptor ESOP

c Start lost HNSC * ) Per-cancer mutation mutation ) Per-cancer Synonymous coding GLBM ≤ n

f Start gain DLBC 5% ( = 120) appeared in in appeared 120) = ) Concordance ) Concordance n

Nature Ge Nature Splice site region KIRC = 663(76.03%) CLLX ≤ n Synonymous stop * P i. 1 Fig. Supplementary Supplementary 5% and mutation mutation 5% and

= 5(8.62%) OVAR Supplementary 3′ UTR ≥ < 0.01). < 0.01). * BLCA Exon (noncoding) of 2% patients

* LAML Upstream

* PRAD Intron RHAB * n d

Downstream CARC = * ). SMRs SMRs ). n MEDU 1 n Intergenic = 158),

0 3 6 9 12 15

etics

) al. et (Weinhold Regions - -

© 2015 Nature America, Inc. All rights reserved. are are not by driven samples that contribute large numbers of mutations 5 ( texts con mutation by unaccounted are not driven Methods), and Online 1–2,041 bp), are robust to distinct mutation background models ( to attributable not artifacts. mapping is SMRs in density mutation observed the that (in SMR breakpoints published recently 6,179 of analysis an and genome, the of regions (100-bp) mappable within contained ( criteria alternate stringent these satisfied applied be could tests these where types cer can the in SMRs high-confidence of 95.0% Fully types. cancer tiple (FDR threshold a false-discovery passed ‘robust’ that 205 SMRs identified procedure This Methods). (Online non-functional are clusters these assump that tion the under clusters mutation intronic from scores density advantage in cancer, we and leveraged single-nucleotide trinucleotide selective no with mutations of clusters in result could that processes unaccounted for To control sets. these in genes cancer SNV–driven 10 × 2.6 (63.3×; high correspondingly samples (Online Methods mutator and from contribution and scores density their of basis the on 40. lysine at acetylation K40, acetyl Ser363; and Thr359 at phosphorylation S363, pT359, respectively. range, interquartile × the 1.5 within points n ( underlined. is sequence 5 the in SMR cancer bladder the * Firefly/ performed. were replicates three experiment, each For 2). and (1 preps DNA plasmid independent with cells (HEK) HEK293T and (A375) melanoma in performed ( highlighted. are SMRs the overlap that sites binding ( scales in SMRs melanoma at frequencies mutation somatic and occurrences, motif and sites binding (TF) factor transcription Factorbook 100-way), phastCons (Cons.; conservation vertebrate I signals, DNase and (ChIP-seq) sequencing and ( (†). roles developmental have or control cycle cell with associated be to or (*) cancer with associated be to known are factors transcription 23 the of Eighteen shown. are types cancer all across 2 Figure Nature Ge Nature per region ( b e a P = 14) bladder tumors. The central band, box boundaries and whiskers correspond to the median, the interquartile range and the highest and lowest lowest and highest the and range interquartile the median, the to correspond whiskers and boundaries box band, central The tumors. bladder = 14) ′ UTR and splice-site mutations ( mutations splice-site and UTR RHAB MEL DLBC –log (q value) < 0.1, two-sided two-sided < 0.1, SMRs display a wide range of sizes ( sets low-confidence and medium- high-, into SMRs Weclassified Signal (firefly/renilla) 10 0.8 0.4 1.2 10 25 15 20 0 0 5 A Supplementary Fig. 6 Fig. Supplementary ± #1 **

WT INSM1 A375 1,000, 1,000, −10 YAE1D1 Noncoding SMRs recurrently alter promoters and 5 and promoters alter recurrently SMRs Noncoding Nr2e3 Factors implicatedincancer:18/23(78%) Erg*

PTEN MT Myb FLI1* #2 EHF * *

) and low (5.0×; (5.0×; low and ) WT n

Fig. 1 Nr5a2 FEV* MT EBF1 NFATC2* etics HEK #1 WT Zfp423 ELK4* * Rfx1

± Arnt::Ahr SPIB ) within 50 bp of a resolved breakpoint, suggesting suggesting breakpoint, resolved a of bp 50 within ) MT ≤ 75 and and 75 ORA_1R ELK1* renilla h

#1 FLI1 5%) in these secondary tests or tests were found in mul secondary 5%) in these WT GABPA

A375 GABPA ). ). This is in contrast to recently proposed regions of

Supplementary Fig. 5 Fig. Supplementary t PRDM1*

test. Error bars, s.d. ( s.d. bars, Error test. FEV

MT KIAA0907 Erg ELF1* #2 ADVANCE ONLINE PUBLICATION ONLINE ADVANCE WT ELK4 Ets1* HINFP

luciferase signals are shown and were normalized by the mean signal with wild-type promoter for each experiment. ** experiment. each for promoter wild-type with signal mean the by normalized were and shown are signals luciferase MT EHF* HEK g HOXA5 ± #1 FOXO3 FOXF2* ) Relative protein and post-translational modification signals of wild-type ( wild-type of signals modification post-translational and protein ) Relative WT 7 bp). Mutation frequencies within each SMR (red) and at each position (purple bars) are shown. Motifs of ETS family family ETS of Motifs shown. are bars) (purple position each at and (red) SMR each within frequencies Mutation 7 bp). MT Sox3 STAT2, STAT1* 0.5 1.0 1.5 2.0 POU2F2 E2F6 FOXO3* 0

P STAT2/1 Stat4 FOXF2 E2F6† = 5.0 × 10 × 5.0 = ), and are enriched in protein-coding, protein-coding, in enriched are and ),

′ Stat4 Supplementary TableSupplementary 2 ** #1

UTR of of UTR Sox3*

P WT Ets1 A375 ELF1 FOXP2 = 2.5 × 10 × 2.5 = MT TERT PRDM1 STAT1* #2 WT ** SPIB POU2F2 NFATC2 P MT ELK1 STAT3* HEK < 0.01; 0.01; < #1 ** CREB1 Lhx3† ≤ ≥

Fig. 1 WT b 0 12

TBC1D12

) Cancer-specific motif enrichment analysis. ( analysis. enrichment motif ) Cancer-specific MT score f z −4 ) Gene structure, ENCODE CTCF and DNase I signals, and vertebrate conservation (phastCons 100-way) at 100-way) (phastCons conservation vertebrate and I signals, DNase and CTCF ENCODE structure, ) Gene Bits –0. –1. –1. –2. ). Over 87% of SMRs were were SMRs of 87% Over ). c ) enrichments for somatic somatic for enrichments ) e 0 5 0 5 0 MELA TF site + motif Cons. Signal −46 Chr. 1(+) ; ; median = 17 bp, range = ELF1, ELK1 –75 bp –1,000 bp Fig. 1 Fig. 9.3% (C>T100%) (MA0076.2) ELK4 moti 9 8 7 6 5 4 3 2 1 GABPA < ), medium (6.2×; (6.2×; medium ), are shown on multiple scales. The position of the start codon is highlighted in green, and the Kozak Kozak the and green, in highlighted is codon start the of position The scales. multiple on shown are ELK4 T G C T G G C C T T C T C C G A C A < ELF e 3 KIAA0907 ) Luciferase reporter signal from wild-type (WT) and mutant (MT) promoters in three experiments experiments three in promoters (MT) mutant and (WT) wild-type from signal reporter ) Luciferase g 3 f ). Notably, SMRs Notably, SMRs ). ELK1 yielded a single single a yielded ELK4 ELF1

SPI1 ). ). We observed NR2C2 Motif position ′ UTRs. ( UTRs. < 155,904,250 4 bp 4 bp 10.2%10.2 % 155,904,253 GABPA 0.9% (C>T100%) Fig. 1 a ) Transcription factors with enriched ( enriched with factors ) Transcription P DNase +1,000 bp = +75 bp 10 - - - - f

I 11 bp ing the breast cancer–associated antigen and putative transcription transcription putative and antigen cancer–associated breast the ing approaches. gene-level ments Note Supplementary ( data these within recurrence mutation dense of regions harbor not do genes cancer known most that ever, ered by analysis gene-level 26 known genes cancer in implicates 31 gene not × uncov associations type cancer SMRs of presence The alterations. protein on acting Fig. 7 ing SMRs are by mutations ( driven nonsynonymous PIM1 analysis gene-level a in 17 include genes undetected also CGC genes previously associated example, (for suppressors tumor and example, (for genes a onco of total canonical affecting test), 91 including known drivers, (Lawrence alterations cancer-associated somatic with genes known are in and enriched genes 8.35-fold 610 on impacts diverse have to predicted are SMRs SMRs genes. cancer-associated and elements functional to relevance their characterize to we sought and alterations, somatic by recurrent Thus, we a set of SMRs have diverse targeted sized variably identified ( samples by distinct sively alteration recurrent –0. –1. –1. –2. 0 5 0 5 0 d We discovered SMRs in multiple new cancer driver genes, includ genes, driver cancer new multiple in WeSMRs discovered KIAA0907 Chr. 7(+) –1,000 bp –75 bp 9 8 7 6 5 4 3 2 1 (MA0028.1) PAX5 ELF1 moti and the cancer-associated noncoding gene MA ), ), demonstrating that SMRs capture primarily positive selection

T T G T T G A G G A A G G C C A G G C enrich E2F1 Z 0.9% (G>A) c NRF , GABPA f d ELF1 1 ) Gene structure, ENCODE immunoprecipitation immunoprecipitation chromatin ENCODE structure, ) Gene ( EGR1 Motif position c

for

) and ) and 39,605,966 9.3% 5 bp 5 bp

known < SP1, SP2,SP4

39,605,970 < 2 YAE1D1 n 5 0 2.5% (G>A) 5.9% (G>A) BRAF , such as established oncogenes like like oncogenes established as such , where as few as five regions were driven exclu were driven as as few regions where five = 78) and mutant (TBC1D12.1 SMR altered; altered; SMR (TBC1D12.1 mutant and = 78) ), suggesting that SMR identification comple identification SMR that suggesting ), et et al. UAK3 EL ELF1 GABPA F e YAE1D1

+1,000 bp I 5

5 cancer P +75 bp 5 ( q , , ( or CGC, = 6.0 × 10 = 6.0 < 0.01) motifs in small SMRs ( SMRs small in motifs < 0.01) Supplementary TableSupplementary 3 10 d KRAS

bp ) promoter regions are shown on multiple multiple on shown are regions ) promoter

Bits MELA motif + site TF Cons. Signal

genes f −0.2 g , , 0.2 0.6 Chr. 10(+) –1,000 bp NRAS –75 bp P PTEN = 8.1 × 10 −45 Dnase G G T G G T A G A C C C C C A C C q Mutant CTCF (pT359, S363)

= 0.0005 Supplementary Fig. 8 Fig. Supplementary and , Wilcoxon rank-sum test). test). rank-sum , Wilcoxon 96,162,368 p90RS 8.1% (C>T , , I PIK3CA , , 15.2%

TP53 Wild type implicate K ) MALAT1 −49 3 bp 3 bp ysis i s ly a n a M and and −2 −1 < VGPEDADACSGRNPKLLPVPAPD , hypergeometric , hypergeometric 0 2 1 ). ). We note, how Supplementary Supplementary 96,162,370 < 10.1% (G>A) and and Mutant q APC (acetyl, K40)

= 4.3×10 new α < . . Most cod ≤ TBC1D12 +1,000 bp -tubulin CTNNB1 < BCL2 25 bp) bp) 25 P +75 bp < 0.05, < 0.05, ). SMR- ). Wild type –5

ones bp

and and and and

Cons. BLCA Signal

P RPPA signal (normalized) signal RPPA  ------)

© 2015 Nature America, Inc. All rights reserved. specific transcription factor motif enrichments within SMRs SMRs within ( enrichments motif rhabdosarcoma and melanoma lymphoma, cell B large diffuse factor from transcription specific ( and winged-helix repressor (3.2×; (7.4×; family oncogene ETS for sequences binding in Across all cancer types, small ( mutated in a pan-cancer analysis of whole-genome sequencing data significantly deemed regions with coincided SMRs promoter Three 5 chro open matin within lay SMRs 130 total, In data. sequencing exome in whole- variation noncoding pathological to discover potential the highlighting sequences, protein affect to predicted not are SMRs of (31.2%; proportion significant A SMRs section. next the in discussed as alterations, noncoding by driven (~81%) primarily are drivers cancer new tive puta these differences, methodological of basis the on expected As ( genes driver cancer new with associated in patients with melanoma harbored somatic protein-altering mutations and whole-exome ten data of from sets sequencing 17 whole-genome of six melanomas for cutaneous 17 data sequencing whole-genome in these validated in were Mutations SMRs SMRs. three of more or one within mutations factor BRAF.1 and BRAF.2 in are type shown per cancer mutations missense ( SMR interface in the TRIM33 ( B large lymphoma) cell and (WT) Lys111Glu ( protein mutant. as in shown the wild-type interface, ( (bottom). angles dihedral side-chain SMR in the and PIK3CA.3 in the PIK3CA.2 residues mutated display Insets surfaces. solvent-accessible (PDB), Bank Data (Protein ( in SMRs PIK3CA. represent bars Gray samples. cancer blue) domains of PIK3CA frequencies of alteration per-residue and comparison types cancer across SMRs per PFAM ( is (left). frequency shown per of per cancer,domain genes The domain number protein per residue. ( alterations. recurrent by targeted interfaces molecular and cancers among regions altered differentially identifies complexes and proteins onto SMRs of 3 Figure s i s ly a n a  a Fig. 2 Fig. i. 2 Fig. a Pep_M12B_propep ′

DNA_pol_B_exo1 ) Nonsynonymous mutation mutation ) Nonsynonymous UTR (6.0×; (6.0×; UTR ANKRD30A Pfam_B_2247 Pfam_B_1681 Pfam_B_1095 Pfam_B_2234 Pfam_B_3697 GF_recep_I Pkinase_Tyr Cadherin_ PI3K_p85B STAT_bind 1 Furin_like Cadherin 2

Genes implicate b Histon 8 Iso_dh a Serpin MAT PI3K OATP ANKRD30A DSP

ITAM and were enriched in promoter (4.9×; (4.9×; promoter in enriched were and Actin and and MH2 BT BH4 SE VHL and and Po P5 Ra Structural mapping mapping Structural G H C V T 9 u 0 2 e a 3 c s 7 lo Supplementary Table 7 Supplementary g

upeetr Tbe 6 Table Supplementary DLBC 10

q PRAD . Overall, of the 185 high-confidence SMRs, 16 were were 16 SMRs, high-confidence 185 the of Overall, . (mutationsper100samples) CLLX = 4.4 × 10 × 4.4 = 0

NEUB diverse CARC MEDU

3 KIRC

4 BRCA , in which ~21% of melanomas harbored harbored melanomas of ~21% which in ,

e LAML

) and SMRs (multiple myeloma) ( myeloma) ) and (multiple SMRs OVAR MUMY 2RD GLBM 3

noncoding

, ESOP 2

BLCA 0 −10 i

O HNSC ≥ . Within the entire gene body, 27 of 118 118 of body, 27 gene entire the Within . ). Structural alignments and molecular visualizations were prepared with PyMOL (Schrödinger). The relative proportions of proportions The relative (Schrödinger). PyMOL with prepared were visualizations and molecular alignments ). Structural ≤ 2.5 MELA , ,

25-bp) noncoding SMRs were enriched COLR ) features ( features ) 2IU

LUSC LUAD

q UCEC d P G e = 2.0 × 10 ) Mutations within the PIK3CA.2, PIK3CA.3 SMR PIK3CA.3 the PIK3CA.2, within ) Mutations < 2.2 × 10 × 2.2 < and b ).

Alteration frequency (%) Alteration frequency (%) regulatory 0.1 0.1 16.0 Residu 2.0 . e lo eetd cancer- detected also We ). (594–601 V600 0 5 5 3HI PIM1.1 ABD Supplementary Table 5 Table Supplementary Supplementary Table 4 Table Supplementary e PIK3CA.1 Z −4 ). Residues within endometrial cancer SMRs in PIK3CA (orange) and PIK3R1 (red) are rendered as are rendered (red) and PIK3R1 (orange) in SMRs PIK3CA cancer endometrial within ). Residues 20 ) q RBD =1.16×10 −16 ) ) transcription factors 4 0 Missense mutations f PIK3CA.2 q 20 40 60 0 = 4.0 × 10 × 4.0 = , proportions test) test) proportions ,

MELA C2 features 00 PIK3CA.3 COLR f q Helical –16 ), a DNA (green) interface SMR ( ), interface a DNA (green) 60 MUMY V60 = 2.6 × 10 × 2.6 = 8 0 c PIK3CA.4 0 ) Co-crystal structure of the PIK3CA (p110 of the PIK3CA structure ) Co-crystal

P GLBM BRAF.2 (466–469) BRAF.1 (594–601) =2.3×10 00 LUAD Kinase PIK3CA.5 −9 P 1,000 ) and and ) =7.2×10

f PIK3CA.6 . The PDB codes . for The PDB codes −6 2 –6 UCEC LUSC LUAD HNSC GLBM ESO COLR BRCA BLCA

0 ). ). ). ).

e UCEC - - BRCA ) .

– P i ) Molecular structures are shown of spatially clustered alterations (diffuse (diffuse alterations clustered of are shown spatially structures ) Molecular PIK3CA.2 andPIK3CA.3 ≤ c –10 5 of of target sites 5 in SMRs 32 melanocytes untransformed to comparison in levels protein YAE1D1 increased showed samples melanoma of cancer colorectal crypt-like lower in observed ( YAE1D1 mutant the with expression gene this reporter in changes at detectable no However, observed we . transcription enhance may alterations SMR promoter that cancers colorectal and lung in expression increased preferences. SNORA42 sequence harbor gene this in primary sequences intronic ETS on KIAA0907 effects (ENCODE)), Elements varying DNA of with (Encyclopedia sites binding within tor sequences recognition muta core SMRs, altered both tions In analysis alterations. noncoding pan-cancer in a specificity cancer in significance reach not and ( for melanomas data sequencing whole-genome ( 9.3%, respectively, of melanomas analyzed by whole-exome sequencing the Recurrent mutations were positioned near the start codon (Kozak (Kozak codon start the near positioned were mutations Recurrent 28 g Mutations (% s 0°s Fig. 2 Fig.

Fig. 2c Fig. ° 320

240 P ° s =0.0145 In addition to SMRs that influence promoter regions, we observed we promoter observed regions, to In SMRs that influence addition We 4-bp discovered and 5-bp SMRs within open chromatin sites of s TBC1D12 ° KIAA0907 200 0 ° n s

RUNX1.

s °

= 2 for for 2 = 0 RUNX1.

e 16

40

), RNA-level overexpression of of overexpression RNA-level ), ° s s

, °

1 ) 80 120 promoter mutations reduced reporter expression gene reporter reduced mutations promoter ° d s α 00 ). Somatic mutations in these SMRs were confirmed in in confirmed were SMRs these in mutations Somatic ). ), an H/ACA class small nucleolar RNA (snoRNA) with with (snoRNA) RNA nucleolar small class H/ACA an ), 3 2 -helix interfere with Arg79-binding contacts at the PIK3R1 contacts Arg79-binding with interfere -helix encodes a putative RNA-binding protein. However, However, protein. RNA-binding putative a encodes 9 1 e g . . Most strikingly, we discovered a 3-bp SMR in the 5 that was mutated in ~15% of bladder cancers ( cancers bladder of ~15% in mutated was that – ), reciprocal protein interface SMRs ( SMRs interface protein ), reciprocal ′ i and 3 and and are YAE1D1 a DVANCE ONLINE PUBLICATION ONLINE DVANCE 3CX 4 YAE1D1 6 ′ UTRs, including putative microRNA (miRNA) (miRNA) microRNA putative including UTRs, in endometrial (UCEC; orange) and breast (BRCA; (BRCA; and breast orange) (UCEC; in endometrial W h (subunit A) SMAD2.1 α , , of of SMAD α 1UW helix (yellow; top) and their corresponding corresponding and top) their (yellow; helix ; blue) and PIK3R1 (p85 ; and PIK3R1 blue) PIK3CA.4 promoters that were altered in 10.2% and n 2 KIAA0907 = 17 cases) 17 = b H ) Alteration frequency matrix of PIK3CA matrix frequency ) Alteration , , PIK3CA.5 1H9 SMAD4.1 PIK3R1.3 D PIK3R1.1 , , PIK3CA.6 1U7 promoter ( promoter YAE1D1 (subunit B) SMAD SNORA80E SMAD4.2 3 , 2 V 0 3 and d 2 8 . Yet, these regions did did regions these Yet, .

. Frame counts Binding mode 1 i R108 E1222 h 3 WT K111E

7 ) and a ) H3.1 and a histone has previously been been previously has α –10 PIK3CA-PIK3R1 bindin 3U5 and a small cohort cohort small a and n E8 ; gray) interaction ; interaction gray) WT Nature Ge Nature enthalpy (kcal/mol) = 1 for R79 Fig. 2 Fig. 1 n vivo in E1215 K111 –9 2 (also known as as known (also 35 N 29% 62% 0 , highlighting highlighting , TRIM33 , , respectively. , 62% 3 –8 6 HIST1H3I.1 e , suggesting suggesting , 71% E1222 ). Whereas Whereas ). KIAA0907 R108 –7 T fac ETS K111E E8 71% 38% K111 1 Fig. 2 Fig. R79 n E1215 –5 E

′ g etics UTR

2 10 20 30 40 69 13 20 27 mode Binding f ). ). 1 2 3 4 8 7 6 - -

© 2015 Nature America, Inc. All rights reserved. (ABD) and the linker region between the ABD and Ras-binding Ras-binding and ABD the Up of (RBD) PIK3CA. domain to endometrial corpus 14% of uterine between region linker the and (ABD) an affecting PIK3CA.3) and (PIK3CA.2 SMRs type–specific cancer observed we to the helical (PIK3CA.5) and kinase (PIK3CA.6) domains. In contrast, ( types cancers (p110 PIK3CA subunit 3-kinase subunit of the catalytic within the phosphoinositide 8 Table ( mutated types across cancer that were differentially ( lymphoma cell B large diffuse in SET and carcinoma cell clear kidney in VHL by exemplified as alteration, of in burdens tein domains can specificity show type cancer remarkable pro cancers, multiple in alteration somatic of burdens high showed subgenic and cancer type resolution. Although many protein domains with mutation frequencies of mitted a analysis differential systematic of per SMRs types across multiple cancer The regions. identification protein-coding within lay SMRs exome-derived most expected, As SMRs cancer. in alterations noncoding and regions noncoding functional in potentially pinpointing method our mutated SMR identification recurrently identifying for sequencing data whole-exome of usefulness the establish results These assays (RPPA) array protein phase ( proliferation cycle cell increased of ( phosphorylation (p90RSK) RPS6KA1 altered displayed SMR this in mutations with cancers breast 172 of three and adenomas lung of two 40 cancers, of bladder two 20 including types, cancer of seven Mutations in this SMR were validated in the whole-genome sequences region positions −1 and −3), suggesting a role in translational control. a FGFR2 FBXW7 FGFR2 EGFR NFE2L2 PIK3R1 HRAS PPP2R1A HIST1H2BK HIST1H3I RUNX1 PIK3CA SMAD4 DNMT3A SMAD4 TP53 HIST1H2BK SMAD2 SPOP VHL VHL Protein (i) Table 1 Nature Ge Nature SMRs, intron-separated these in alterations harbored carcinomas Whether theSMR-harboringprotein(i)correspondstoaknownornewcancerdrivergene. P = 4.3 × 10 Among genes ( genes Among

Fig. Fig. 3 permit

42 ). A striking example of this differential targeting occurred occurred targeting differential this of example striking A ). Recurrently Recurrently altered protein interfaces uncovered by , 4 3 n . We detected six SMRs in in SMRs six detected We . −5 b etics α ), with multiple cancer types displaying SMRs mapping ), displaying with types multiple cancer

, -helical region between the adaptor-binding domain domain adaptor-binding the between region -helical high-resolution α t test, Benjamini-Hochberg), as determined by reverse- ), a key oncoprotein implicated in a range of human a in range implicated oncoprotein a ), key FGF2 SKP1 FGF8 EGF KEAP1 PIK3CA RASA1 PPP2R5C HIST1H4 TRIM33 DNA PIK3R1 SMAD3 DNMT3L SMAD2 TP53BP1 DNA SMAD4 H2AFY TCEB2 TCEB1 Partner n P

= 94) with multiple SMRs, we detected 48 SMRs SMRs 48 detected we SMRs, multiple with 94) = (j) = 0.0005, 0.0005, = ADVANCE ONLINE PUBLICATION ONLINE ADVANCE

b 1EV 2OV 2FD 3NJ 2FL 3HI 1WQ 2NP 2CV 3U5 1H9 3HH 1U7 2QR 1U7 1KZ 2CV 1U7 3HQ 1LQ 3ZU PDB t test, Benjamini-Hochberg), a signal signal a Benjamini-Hochberg), test,

analysis Z U P 2 5 5 R B Y B N V P N D F V V H M 1 4 1 4 ( Chain 0 Fig. 2 Fig. , and altered altered and , (i) H B R B B R D D D B B B D P C A E A A C I PIK3CA

of

Fig. 3 Fig.

g coding Chain and Online Methods). Methods). Online and (j) M M H H G B D A C A A E A C C C A X F F J 3 across eight cancer cancer eight across a , 2 ). 0

. Bladder tumors tumors Bladder .

α

Supplementary Supplementary alterations -tubulin levels levels -tubulin Chr. 10:123,279,674–12,327,9677 Chr. 4:153,249,384–153,249,385 Chr. 10:123,279,674–123,279,677 Chr. 7:55,233,035–55,233,043 Chr. 2:178,098,799–178,098,815 Chr. 5:67,589,138–67,589,149 Chr. 11:534,283–534,291 Chr. 19:52,716,323–52,716,329 Chr. 6:27,114,446–27,114,519 Chr. 6:27,839,651–27,840,062 Chr. 21:36,231,782–36,231,792 Chr. 3:178,936,070–178,936,099 Chr. 18:48,604,665–48,604,797 Chr. 2:25,463,271–25,463,308 Chr. 18:48,604,665–48,604,797 Chr. 17:7,578,369–7,578,556 Chr. 6:27,114,446–27,114,519 Chr. 18:45,374,881–45,374,945 Chr. 17:47,696,421–47,696,467 Chr. 3:10,191,469–10,191,513 Chr. 3:10,191,469–10,191,513 S MRs b Multiple componentpartnerproteinsidentified. Region - -

with multiple SMRs displayed significant SMR three-dimensional three-dimensional SMR significant displayed SMRs multiple with ( adenomas P-loop alterations were more common in multiple myeloma and lung BRAF whereas cancers, colorectal and melanoma in frequent more mechanisms regions( whereSMRs alterations P-loop have BRAF been and shown to V600 function throughBRAF distinct between clustering and Online Methods). This approach also identified three-dimensional ( proto-oncoprotein kinase serine/threonine SMR-associated ( alterations missense of clustering three-dimensional with proteins 46 detected structural We genes. leveraged cancer known and SMR-associated we 428 from information structures, protein three-dimensional to PIK3CA. in alteration oncogenic of mechanism unrecognized ously Fig. 9 Supplementary ( PIK3CA interactions wild-type binding to in comparison in kcal/mol 1.8 of loss a in result may which Arg79, at patterns salt-bridge intermolecular alter can substitutions (p.Gly118Asp) (p.Lys111Glu) and that indicate PIK3CA.3 PIK3CA.2 binding PIK3CA-PIK3R1 of simulations dynamics molecular scale ( interface molecular a to changes suggesting test), directionally orientated to one side of the stood under poorly are region linker ABD-RBD the described, affecting mutations previously been have PIK3CA of domains (PIK3CA.6) the ABD (PIK3CA.1), C2 (PIK3CA.4), helical (PIK3CA.5) and kinase region. this to localized be part, in could, cancers breast and ferences sequences ( differences these ( cancers frequencies breast and alteration endometrial in PIK3CA.2 in differences ( test) significant proportions observed we example, For cancers. other in altered recurrently highly not were regions these although To systematically characterize the location of alterations with respect of mutations mapping recurrent effects the to oncogenic Although 45– 4 4 4 8 in total 3 . Interestingly, missense alterations within this . region this were within Interestingly, alterations missense , 2 P 0 Supplementary Table 9 Table Supplementary 4 < 0.01, proportions test). In total, seven of 16 proteins proteins 16 of seven total, In test). proportions 0.01, < . These findings indicate that previously described dif described previously that indicate findings These . 9 . . Moreover, we found that BRAF V600 were alterations PIK3CA distance (Å) P Average = 0.02, proportions test) in whole-genome whole-genome in test) proportions 0.02, = 11.685 10.288 13.680 11.480 11.883 10.112 11.878 13.253 ). ). Taken a suggest previ results together, these 9.352 8.763 6.157 6.713 5.302 7.313 8.957 9.028 9.730 9.231 7.962 9.867 7.259 mutation frequency between endometrial endometrial between mutation frequency

Distance 0.406 0.346 0.413 0.386 0.566 0.567 0.350 0.247 0.664 0.610 0.351 0.335 0.700 0.380 0.694 0.556 0.520 0.460 0.462 0.367 0.395 ratio Fig. 3 Fig. ), as exemplified by PIM1, an an PIM1, by exemplified as ), Fig. 3 Fig. α

helix ( b d ) and further validated validated further and ) 0.037 0.036 0.036 0.019 0.009 0.008 0.007 0.007 0.002 0.001 0.001 2.56 ×10 1.94 ×10 5.13 ×10 5.13 ×10 5.13 ×10 3.27 ×10 5.61 ×10 3.72 ×10 7.62 ×10 7.62 ×10 , Online Methods and and Methods Online , q value P ysis i s ly a n a = 0.0145, Rayleigh q 12 10 × 1.2 = Fig. 3 Fig. −6 −6 −7 −7 −7 −7 −8 −8 −10 −10 c ). Large- ). Fig. 3 Fig. Status Known Known Known Known Known Known Known Known New New Known Known Known Known Known Known New Known Known Known Known Fig. 3 Fig. −16 a f ), ),  e - - - ,

© 2015 Nature America, Inc. All rights reserved. (i) SMR alterations associate with distinct molecular signatures or sur whether ask to data RPPA clinical and (RNA-seq), sequencing RNA We leveraged signatures. molecular with association their by ations We sought to determine the impact potential functional of SMR alter Molecular epigenetic regulation and histone alterations in tumorigenesis altered between associations recent extend and underscore findings DNA-protein interface of histone H2B ( histone( H3.1 In addition, SMRs pinpoint ( recurrent interfaces alterationsmolecular at alterations at driver the interface between validated detecting in SMRs of robustness the highlight results these ( cancer endometrial in interface PIK3R1 described recently ( cancer colorectal in heterotrimer SMAD4 detected reciprocal SMRs at all interfaces electrostatic of the SMAD2- cleft of SPOP substrate-binding the as such associations, cancer established with molecular interfaces of protein-protein and DNA-protein interactions molecular interfaces ( alter likely that SMRs 17 identified DNAand or proteins interacting and residues SMR between distances intermolecular examined We mutation driver cancer of mechanism understudied yet nized recog a DNA-protein and protein-protein of interactions, interfaces alterations. pathogenic for coherence spatial frequent ( clustering differences, intragenic significant with markers the and markers 36 for signal in SMRs six in mutations by (PDB, ( ( (bottom). enzymes reductase aldo-keto of expression increased highly exhibit mutations NFE2L2.2 with Samples (middle). comparisons three for summarized is similarity of ratios odds of distribution The ratio. odds test exact Fisher’s by by sorted are genes expressed Differentially (top). shown is (HNSC) carcinoma neck and head and (BLCA) cancer pair. ( SMR each in mutations mutated the with samples for signals represent lines Red with associated SMRs 4 Figure s i s ly a n a  molecular similar with correlate SMR (ii) alterations outcomes, vival GO:001661 b a

0

We next sought to identify SMRs that might affect the molecular molecular the affect might that SMRs identify to sought Wenext SMRs (n = 30) log UCEC

BRCA 10 3WN

LAML (genes)

LUAD SMRs are associated with distinct molecular signatures. ( signatures. molecular distinct with associated are SMRs GLBM

7 HNSC

). A sector of recurrent alterations in KEAP1 (teal) did not pass our 2% frequency cutoff. ( cutoff. frequency 2% our pass not did (teal) KEAP1 in alterations recurrent of A sector ). 6

signatures LUSC upeetr Tbe 10 Table Supplementary 5

) in specific cancer types ( types cancer specific ) in BLCA 3.8 3 Fig. 3 Fig. and DNA-binding on interfaces RUNX1 ( d PPP2R1A.1 DNMT3A.1 CTNNB1.1

5 Frequency NFE2L2.2 NFE2L2.1 PIK3CA.6 PIK3CA.5 MIR142.1 RUNX1.1 RUNX1.2 10 12 4 HRAS.1 KRAS.3 CHD4.1 PTEN.2 2 4 8 0 6 AKT1.1 TP53.2 TP53.3 TP53.4 TP53.5 FLT3.1 i IDH1.1 IDH2.1 , , and SMRs PIK3CA-atreciprocal regulatory the ) and TRIM33, an E3 ubiquitinandan E3 at and ) ligase, TRIM33, the ≥ −3 Table RAB25 (RPPAsignal) 10 differentially expressed genes (FDR <5%). ( <5%). (FDR genes expressed differentially 10 P

=2.5×10 −2 highlight TP53.2 −1

PTEN.2 SNX19.1 PIK3CA

TP53.3 ( 1 n e 1 0 q Different cancer type

CTNNB1.1 = 2) and Online Methods). These included 15 –27 ) The overlap between differentially expressed genes associated with alteration of the NFE2L2.2 SMR in bladder bladder in SMR NFE2L2.2 the of alteration with associated genes expressed differentially between overlap ) The value (Kruskal-Wallis test) of differential expression between SMRs from from SMRs between expression differential of test) (Kruskal-Wallis value AKT1.1 DNMT3A.1 CHD4.1 Same cancer type +

IDH1.1 2 c

KRAS.3 , ,

impact HRAS.1 Frequency AKT1 10 20 FLT3.1 15 PPP2R1A.1 0 5 –10 RAB25

S TP53.4 Supplementary Fig. 11 P

upplementary Table 13 upplementary PIK3CA.6 = 0.02 ), which is consistent with with consistent is which ), −5

and and PIK3CA.5

RUNX1.1 (RNA-seq signal)

IDH2.1 SNX19.1 of 5 0 Fig. 3 Fig. ( Supplementary Fig. 10 Fig. Supplementary RUNX1.2 n

q MIR142.1 = 3) TP53

< 0.05). i. 3 Fig.

SMR TP53.5 NFE2L2.2 NFE2L2.1 + 10 ≥ 1.5 0 b

. Normalized RPPA-based expression data were obtained from the TCPA the from obtained were data expression RPPA-based . Normalized

10 (odds ratio) (odds ). Taken together, together, Taken ). –log

alterations h ), as have been been have as ), SNX19 e Gene SMRs Bins Fraction of altered

× 100 20 60 20 40

Fig. Fig. 3 genes (HNSC) 0 0.2 0.4 0.6 0.8 1.0 0 –6 SMR. ( SMR. ). ( ). 5 ). These 0 a 5 g 1.0 ) Matched RNA-seq data for nine cancers showed that mutations in 30 distinct distinct 30 in mutations that showed cancers nine for data RNA-seq ) Matched . Overlap, log g Fraction ofaltered 50– ). ). We 0.2 ) Structure of the NFE2L2.2 SMR (orange) in the KEAP1-binding domain domain KEAP1-binding the in (orange) SMR NFE2L2.2 the of ) Structure genes (BLCA) b log 5 d , NFE2L2.2 2 ). ). c 0.4 ) Similarity between differentially expressed gene sets associated with with associated sets gene expressed differentially between ) Similarity - - - 1.5 . 2 ) Normalized RPPA ( ) Normalized

(foldchange) 0.6 10 can elicit similar molecular profiles in distinct cancers. For instance, For instance, cancers. distinct in profiles molecular similar elicit can analysis indicated that mutations in the same SMR in different cancers such as a between relationship relationships, many of which were supported by RPPA measurements genes ( 17 from SMRs 23 for relationships, functional potential suggesting sites chromatin open NDUFA13 both In determined. be to remains expression RAB25 mechanism by which synonymous mutations in but the trafficking, in are intracellular and SNX19 implicated RAB25 ( in differences expression progression RNA with consistent are cancer increases breast and ovarian promotes that t ( RAB25 of levels expression protein the in increases SNX19 in SMR cancer bladder a in mutations point synonymous example, ( glioblastoma in alterations pathway (EGF) factor growth epidermal and EIF4 mTORas well as recurrent and cancer endometrial in alterations pathway GSK3 recurrent light high which signatures, molecular and mutations somatic recurrent between connections unappreciated previously identified analyses ( survival patient and pathways signaling RNA in expression, changes signatures. molecular different or similar with associate profiles in and cancers, distinct (iii) SMR alterations in the same gene BLCA:LUS test; test; Fig. 4 P

2.0 (odds ratio) = 0.02, Wilcoxon rank-sum test; test; rank-sum Wilcoxon 0.02, = We identified concordant changes in gene expression for SMR pairs, diverse with associated were SMRs in mutations that found We 0.8 AKR1C3, AKR1C AKR1C1, AKR1C2, HNSC:LUS C Fig. 4 Fig. a 2.5 1.0 Fig. Fig. 4 (encoding sorting nexin 19) were associated with significant significant with associated were 19) nexin sorting (encoding , , Online Methods and BLCA:HNS f C 0 0.5 1.0 1.5 2.0 2.5 3.0 , SMRs with clusters of synonymous mutation overlapped overlapped mutation of synonymous clusters with , SMRs ) Relative enrichment for oxidoreductase activity activity oxidoreductase for enrichment ) Relative +6

3.0 4

b

d

and and 10 (odds ratio) (odds ). ). These included multiple well-established mechanistic log

b C a Overlap, ) and RNA-seq ( RNA-seq ) and DVANCE ONLINE PUBLICATION ONLINE DVANCE Supplementary Table 12 Table Supplementary f g NFE2L2. KEAP1 2

8 UCEC , suggesting potential regulatory effects. regulatory potential suggesting , h 1 ) Patients with breast cancer were grouped grouped were cancer breast with ) Patients

TP53 UCEC NFE2L2. HNSC NFE2L2.2 Supplementary Supplementary Tables 11

BLCA 2 or or PIK3CA c

) signals for RAB25 are plotted. plotted. are RAB25 for ) signals LUSC P PIK3CA

value, and similarity is quantified quantified is similarity and value, 0 1 2 3 4 5

2 (enrichment) g lo Supplementary Table 15 Supplementary Caspase.7_cleavedD198 h and i. 4 Fig. MAPK_pT202_Y204 MEK1_pS217_S221 are plotted (red highlights highlights (red plotted are JNK_pT183_Y185 Tuberin_pT1462 X4E.BP1_pT70 PDK1_pS241 X14.3.3_zet AMPK_alpha Chk1_pS345 AKT1 4 Y Chk2_pT68 p27_pT198 ), a RAS family GTPase GTPase family RAS a ), AP_pS127 Caveolin. 1 Cyclin_E Cyclin_E Cyclin_B ER.alpha X4E.BP1 . The median RPPA median . The INPP4 c RBM1 GATA Notch FoxM1 ASN CDK c.Myc ). Intriguingly, both both Intriguingly, ). JNK2 Chk Bcl.2 SNX19 DJ.1

p53 Akt PR AR S6

B S a 1 1 2 1 2 1 1 5 3

. Furthermore, this . Furthermore,

Nature Ge Nature

TP53.4 TP53.5

P TP53.2

= 2.5 × 10 × 2.5 = correlate with – PIK3CA.5

14

57 SNX19 PIK3CA.6 , AKT1.1 5 )

5

8

A 6 PIK3C . These These . n RAB25 . . These

etics TP53 ). For ). and and

+2.1 –2.1 ≤ ≥ 0.5 1.5 −27

4

RPPA signal RPPA 10

value) q ( 1 –log - , ,

© 2015 Nature America, Inc. All rights reserved. focused on identifying predefined functional units with recurrent recurrent with alterations. This approach not only assumes accurate annotations but units functional predefined have identifying variation on focused disease-associated of studies exceptions, few With DISCUSSION mutations. driver of landscape the assessing in sizes sample cancer increasing of value the highlight findings These of structure the protein-altering mutations in breast of cancer remains ( unseen proportion large a >850, numbers sample with (Spearman’s bers of Gini were coefficients dispersion well correlated with sample num ( type cancer in each residue per substitutions of acid amino coefficient Gini the measuring by here analyzed somatic mutations the coding of distribution the in structure assess to metric native We alter an regions. sought cancer-associated identify to mutations in of the SMR structure distribution leverages somatic analysis driver The suppressors tumor gene and and oncogenes are in with pleiotropy consistent established of alterations with SMRs in the same associated signatures molecular ( SMRs MAPK and MEK1 phosphorylation among samples with altered and levels ASNS in differences Yet,SMR-specific levels. we observed within signatures. For example, in breast cancer, alterations in SMRs distinct molecular of regions to a different associated gene given respect with 4 Fig. in mutations with ing partner, recapitulated the expression changes in observed patients Hochberg; NAD with of donors group CH-OH on the acting for oxidoreductases enriched in mutations with patients in changes types cancer to altered androgen and metabolism have been implicated in multiple reductases aldo-keto in alterations with samples cancer endometrial among expression gene in increases highest the with genes four The cancers; neck and head and carcinoma, cell squamous lung endometrial, (bladder, types cancer distinct four in changes tomic NFE2L2 factor transcription oncogenic the in alterations SMR that found we cancer. per residues mutated most 100 the on exclusively computed were in types cancer between comparisons For blue. dark in shown is fit exponential of line The size. sample (bootstrapped) of a function as cancer breast in frequency mutation nonsynonymous of ( shown. are right) (top numbers sample tumor with coefficients these of correlation the and (bottom) ( proteins. ~19,000 across residue, per contained mutations nonsynonymous of fraction the as calculated were dispersion of coefficients Gini uncharacterized. largely remains mutations 5 Figure Nature Ge Nature elements of functional spectrum uncharacterized largely the ignores a ) Lorenz curves (top left), Gini coefficients coefficients Gini left), (top curves ) Lorenz The identified SMRs also permitted interrogation of mutations in in mutations of interrogation permitted also SMRs identified The

structure g and and TP53 q

(ref. (ref. Structure in the distribution of cancer cancer of distribution the in Structure < 0.01; 0.01; < + Fig. Fig. 4 n or NADP as acceptors (4.9–39.0×; (4.9–39.0×; or NADP as acceptors Supplementary Fig. 12 Fig. Supplementary 60– were associated with highly similar changes in protein protein in changes similar highly with associated were etics 5 a

9 6 of , the Gini coefficients coefficients Gini , the 2 ) were associated with large, concordant transcrip concordant large, with associated were ) f . Across all four cancer types, genes with expression expression with genes . types, Across four all cancer Fig. 4 Fig.

). ). Mutations in cancer ρ

63 NFE2L2 = 0.74). Subsampling demonstrated that, even even that, demonstrated Subsampling 0.74). = ADVANCE ONLINE PUBLICATION ONLINE ADVANCE , b 6 AKR1C1 4 ) Gini coefficients coefficients ) Gini . h ). These results establish differences in the the in differences establish results These ).

mutations SMRs ( SMRs – AKR1C4 KEAP1 ).

P

< 0.01, Benjamini-Hochberg; Benjamini-Hochberg; 0.01, <

remains

, encoding an , bind NFE2L2 encoding

NFE2L2 ( Fig. 4 Fig.

P

largely e

≤ SMRs were highly highly were SMRs NFE2L2.1 ), which contribute contribute which ), a 0.001, Benjamini- 0.001, Cumulative fraction

of mutations (%) Gini coefficient 100 0.1 0.2 0.3 0.4 0.5 40 20 60 80

unseen 1 0 were the the were Cumulative fraction 20

Fig. Fig. 5 BRCA Fig. 4 Fig. Fig. 5 Fig.

LAML of residues(%

COLR 40 TP53 LUAD 60 b

a OVAR e ). ). ). ). ). ). HNSC - - - -

UCEC 80 MUMY 100 tially novel roles in cancer development. cancer in roles novel tially poten with genes multiple as well as analyses, gene-level by missed as such genes, cancer-related known sized SMRs ( for strategies cancer based drivers by variably discovering identifying and pathway- gene-level existing and complements limitations these Our approach avoids variants. that may of be the targets pathological out out the genome. We from stemming for controlled effects mutational through rates mutation somatic in variation account into take that models type–specific Here cancer we have applied models. mutation SMRs. of value prognostic the of may assessment permit cohorts patient larger much in data clinical with conjunction in mutations SMR examine to efforts future and genes, and cancers distinct in targeting functional subgenic for support robust provide by exemplified as signatures, molecular distinct with associated be can gene a single Consistent with alterations pleiotropic to dependencies, SMRs within evidence biochemical recent with concordant are findings These by protein. way PIK3R1 with the of regulatory interactions weakened PIK3CA catalytic of activity signaling basal elevated in result SMRs both in mutations that indicate simulations biophysical Moreover, PIK3CA.2 and PIK3CA.3 SMRsthe in function mutations through that similar suggest mechanisms.helix PIK3CA the in alterations of ity uniform directional and proximity geometric close The functions. pleiotropic from may arise that a property cancers, across mutations somatic of distribution the in substructure demonstrate PIK3CA in within frequencies mutation SMR in types cancer among Differences tures. specific analysis of somatic mutations and associated molecular signa sequencing. whole-exome in variation noncoding discovering in cancer drivers for noncoding 5 in harbor alterations cancer bladder with patients of ~15% addition, In 15.2, = ratio (odds sites binding of sequences recognition core the overlap SMRs 5 and promoter of 39 17 Strikingly, regions. coding non lay within SMRs altered frequently of most Notably,the several tion misregula and the advantage approaches. of detection agnostic functionally oncogenic of mechanisms varied the both underscores diversity functional This among others. sites, binding factor cription exons and miRNAs, protein domains, 5 elements in the genome, including single amino acids, complete coding

GLBM ) MELA SMR detection would benefit from further improvements of somatic type– cancer subgenic, a provides SMRs of identification The functional of spectrum diverse a target SMRs Cancer-associated

CLLX � 1.6 = 0.74

DLBC log LUSC 10 2.0

BRAF ESOP BLCA (samples) KIRC PRAD 2.4 Supplementary Table 16 MEDU , , TBC1D12 2.8 EGFR NEUB TP53 CARC RHAB 0 0.2 0.4 0.6

and a mechanistically uncharacterized uncharacterized mechanistically a and SMRs in breast cancers. Together, these results results Together, these cancers. breast in SMRs

(top 100 residues) 100 (top . Together, these results extend the support support the extend results these Together, . coefficient Gini b Gini coefficient (mutated residues) 0.05 0.10 0.15 0.20 0.25 20 0 P , 23 = 1.5 × 10 × 1.5 = 0 , 6 5 ). SMR-associated genes include and establish the potential for potential the and establish PIM1 ′ UTRs, splice sites and splice trans UTRs, 200 Breast cancer(BRCA) and and −11 Samples 400 , Fisher’s exact test). test). Fisher’sexact , MIR142 in vivo in ysis i s ly a n a ′ UTR melanoma melanoma UTR ( n ) 600 ETS family family ETS , that were were that , α 800 ′ UTR UTR helix helix 4 8  ------­ .

© 2015 Nature America, Inc. All rights reserved. through through URLs. cancer and facilitate driving anddiagnostics therapeutics development. mechanisms molecular the of understanding further allow approaches high-throughput developed recently Applying insights. enhances mechanistic often information, phenotypic or structural biophysical, genomic, additional with combined when genome, the in recurrence cal. We demonstrate that identifying the spatial distribution of mutationbiochemical and cellular consequences of individual mutations is criti further the identification of new cancer driver genes sizes. sample increased and refined tissue of normal matched cohorts at may sequencing require scales context chromatin and expression type–specific cell for account that models Mutation rates. mutation somatic to contribute defects repair biases mutation trinucleotide and and control mutation processes that may result in mutation clustering mutation frequencies to limit effects from purifying selection on exons specificity strand for account mutation probabilities capture our tion, nucleotide-specific models expression gene and timing replication in differences s i s ly a n a  M ucsc.edu tcga-d manuscript. with from supervision V.S.P. C.L.A., C.C., J.A.R., G.K., M.P.S. and W.J.G. wrote the Markov state model–based decompositions and computed binding enthalpies luciferase assays. G.K. outcarried biophysical simulations, performed hidden RNA-seq, RPPA andoutcome survival analyses. J.A.R. anddesigned performed analyses. C.C. constructed Bayesian mutation probability models and performed and(coding), frequency structural whole-genome sequencing recurrence estimation (simulations),false-discovery as well as regulatory (noncoding), probability models and clustering,performed density-based scoring and empirical of SMRs. C.L.A. constructed uniform annotations and non-Bayesian mutation and C.L.A. methods. and C.C. developed methods for the detection and analysis C.L.A. and W.J.G. conceived of the project, and all authors experiments designed by the Rita Allen Foundation. Supercomputing Center under grant PSCA13072P. This work was supported through (P41GM103712-S1) Anton-1 resources provided by the Pittsburgh provided by the National Center for Multiscale Modeling of Biological Systems and OCI-0725070 ACI-1238993 and the state of Illinois. Further support was supported by the Blue Waters project via US National FoundationScience awards Analysis and atDesign Stanford University. Biophysical simulations were US NIH Simbios Program and(U54GM072970) the Center for Molecular 01. G.K. acknowledges support from the Lawrence Scholars Program, the Damon Runyon Cancer Research Foundation and US NIH award 1U01HG007919- Translational AwardScience grant J.A.R.UL1TR000093. was supported by the the Lucile Packard Foundation for Children’s Health and US NIH and Clinical and 1P50HG00773501. C.C. was supported by the HealthChild Research Institute, was supported by US National Institutes of Health (NIH) grants 3U54DK10255602 and D.E. Webster for reading critical and suggestions to the manuscript. C.L.A. regarding statistical analyses. We thank M.M. Winslow, D.M. Fowler, S. Fields data sets available to the community.scientific We thank H. Tang for discussions We thank the TCGA, ICGC and TCPA for making large-scale these cancer online version of the pape Note: Any Supplementary Information and Source Data files are available in the the in v available are references associated any and Methods AU A ersion of the pape the of ersion

ck et Although the sequencing of additional cancer genomes will likely likely will genomes cancer additional of sequencing the Although T H no h O ata.nci.nih.gov/tcg Data from Lawrence Lawrence from Data ods wl R R C / ht . ed ONT tp://www.tumorportal 71– gm 3 7 , 66 3 RIBU o iety nergt vrain ihn Ms may SMRs within variation interrogate directly to , ent 68 , r 6 . T 9 r s . and cell type–specific chromatin context chromatin and type–specific cell I ON a ; UCSC Cancer Browser, S et al. et 6 7 , leverage whole-genome sequencing sequencing whole-genome leverage , 5 .org were obtained from TumorPortal from obtained were 3 . However, tumor-specific DNA tumor-specific However, . / . TCGA Data Portal, Portal, Data TCGA . http://genome-cance 5 , characterizing the 4 , 6 6 . In addi In .

https://

online online 7 0

also also

3 r. - - , 11. 10. 9. 8. 7. 6. 5. 4. 3. 2. 1. reprints/ at online available is information permissions and Reprints version of t The authors declare competing interests:financial details are available in the 32. 31. 30. 29. 28. 27. 26. 25. 24. 23. 22. 21. 20. 19. 18. 17. 16. 15. 14. 13. 12. C O

MP Kane, D.P. & Shcherbakova, P.V. A common cancer-associated DNA polymerase DNA cancer-associated common P.V.A Shcherbakova, D.P. & Kane, D.W. Parsons, H. Davies, Ding, L., Wendl, M.C., McMichael, J.F. & Raphael, B.J. Expanding the computational M.S. Lawrence, M.S. Lawrence, Alexandrov,L.B. F.W.Huang, E. Hodis, uain ass n xetoal srn mttr hntp, niaig fidelity indicating phenotype, proofreading. of loss mutator from distinct defects strong exceptionally an causes mutation genomes. cancer mining for toolbox types. tumour genes. cancer-associated 500 Science (2012). multiforme. (2002). 949–954 Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveCLUST: exploiting OncodriveCLUST: N. Lopez-Bigas, & A. Gonzalez-Perez, D., Tamborero, N.D. Dees, atru, . Sily J, rwr D, tatn MR & opr CS A ess of census A C.S. Cooper, & M.R. Stratton, D., Brewer, J., Shipley, T., Santarius, P.A.Futreal, Martin, E., Kriegel, H.P., Jörg, S. & Xiaowei, X. A density-based algorithm for discovering P. Cingolani, human reference 111 of analysis Integrative Consortium. Epigenomics Roadmap A.B. Stergachis, C.L. Araya, M.D.M. Leiserson, Hofree, M., Shen, J.P., Carter, H., Gross, A. & Ideker, T. Network-based stratification mutations somatic Recurrent M. Snyder, D.V.& Spacek, J.A., Reuter, C., Melton, Synonymous B. Lehner, & T. Gabaldón, J., Valcárcel, B., Miñana, F., Supek, of analysis Systematic E. Larsson, & J.A. Nilsson, L., Ny, N.J., Fredriksson, Weinhold, N., Jacobsen, A., Schultz, N., Sander, C. & Lee, W. Genome-wide analysis the in elements DNA of encyclopedia integrated An Consortium. Project ENCODE proteins. RNA-binding human of Tuschl,census T.& Hafner,A M. Gerstberger,S., H.Y.Xiong, Wolfe, A.L. A.B. Stergachis, C. Cenik, microRNA Conserved B. Berger, & N. Perrimon, Y., Zhao, M., Schnall-Levin, regions protein identify to method novel a e-Driver: A. Godzik, & E. Porta-Pardo, (2010). 29 the positional clustering of somatic mutations to identify cancer genes. Res. Genome amplified and overexpressed human cancer genes. cancer human overexpressed and amplified (2004). clusterslargespatialindatabases noise.with strain nucleotide polymorphisms, SnpEff: SNPs in the genome of epigenomes. evolution. regulatory resolution. (2015). 106–114 complexes. protein and pathways across mutations somatic rare mutations. tumor of genomes. cancer human (2015). of regions regulatory in cancers. human in mutations driver (2014). as 1324–1335 act frequently mutations tumor 14 across alterations types. expression gene and mutations somatic noncoding cancer. in mutations (2014). regulatory noncoding of genome. human Genet. Rev. Nat. disease. of determinants genetic the cancer. in evolution. protein affects genes. mitochondrial and (2011). secretory e1001366 for export mRNA nuclear USA Sci. Acad. Natl. in targeting cancer. driving , 2238–2244 (2013). 2238–2244 , ET , 415–421 (2013). 415–421 , index.htm w I Nat. Genet. Nat. he pape

N 1118 339 G G FI et al. et t al. et Nature Nature et al. et al. et et al. et t al. et t al. et ; Science et al. et , 957–959 (2013). 957–959 , Drosophila et al. et

Nature t al. et iso-2 22 Nature N l A landscape of driver mutations in melanoma. in mutations driver of landscape A . r Bioinformatics t al. et a eoe nlss eel itrly ewe 5 between interplay reveals analysis Genome et al. et . RNA G-quadruplexes cause eIF4A-dependent oncogene translation et al. et A et al. et Regulatory analysis of the of analysis Regulatory

et al. et RNA splicing. The human splicing code reveals new insights into insights new reveals code splicing human The splicing. RNA , 1589–1598 (2012). 1589–1598 , Nature uain o the of Mutations

MuSiC: identifying mutational significance in cancer genomes. cancer in significance mutational identifying MuSiC: Highly recurrent Highly 15 t al. et DVANCE ONLINE PUBLICATION ONLINE DVANCE

A census of human cancer genes. cancer human of census A N

513 ; 512 porm o anttn ad rdcig h efcs f single of effects the predicting and annotating for program A 46 t al. et

iso-3 CIAL CIAL I , 829–845 (2014). 829–845 , 518 Nat. Methods Nat. 321 Nature

Discovery and saturation analysis of cancer genes across 21 across genes cancer of analysis saturation and Discovery

Mutational heterogeneity in cancer and the search for new for search the and cancer in heterogeneity Mutational Signatures of mutational processes in human cancer.human in processes mutational of Signatures n nertd eoi aayi o hmn glioblastoma human of analysis genomic integrated An Exonic transcription factor binding directs codon choice and choice codon directs binding factor transcription Exonic , 1258–1263 (2014). 1258–1263 , 505 107 , 65–70 (2014). 65–70 , , 400–405 (2014). 400–405 , s s ieped n oig ein a i 3 in as regions coding in widespread as is osrain of Conservation

489 , 317–330 (2015). 317–330 , , 1807–1812 (2008). 1807–1812 , Pan-cancer network analysis identifies combinations of combinations identifies analysis network Pan-cancer . Nature Science , 495–501 (2014). 495–501 , , 15751–15756 (2010). 15751–15756 , Fly (Austin) Fly NTE

515 , 57–74 (2012). 57–74 ,

30

, 365–370 (2014). 365–370 , R 499 , 3109–3114 (2014). 3109–3114 ,

E 342 TERT 10 S Nat. Rev. Genet. Rev. Nat. , 214–218 (2013). 214–218 , BRAF T Science , 1108–1115 (2013). 1108–1115 ,

, 1367–1372 (2013). 1367–1372 , S 6 promoter mutations in human melanoma. human in mutations promoter , 80–92 (2012). 80–92 , trans ee n ua cancer. human in gene Cancer Res. Cancer C. elegans C. KDD

atn cruty uig mammalian during circuitry -acting 347

96 , 1254806 (2015). 1254806 , a. Genet. Nat. a. e. Cancer Rev. Nat. Nat. Rev.Nat. Cancer , 226–231(1996).,

genome with spatiotemporal with genome a. Genet. Nat. 15

74

, 556–570 (2014). 556–570 , http://www. Drosophila melanogaster , 1895–1901 (2014). 1895–1901 , Nature Ge Nature Cell

′ T itos and UTR 46 LS Genet. PLoS a. Genet. Nat.

150 1160–1165 , 47 Bioinformatics ′ UTRs.

Nature nature.com/

4 710–716 , 10 , 251–263 , , 177–183 , Cell n 59–64 , etics online Nature

Proc. 156 417

47 7 ε , , , ,

© 2015 Nature America, Inc. All rights reserved. 53. 52. 51. 50. 49. 48. 47. 46. 45. 44. 43. 42. 41. 40. 39. 38. 37. 36. 35. 34. 33. Nature Ge Nature

abei C.E. Barbieri, F. Cheng, genes important functionally of discovery Interaction-based M. Singh, & D. Ghersi, network: interaction protein-protein cancer Human Kar,O. Gursoy,Keskin, G., & A. J.R. Haling, Oncogenic R.L. Williams, & O. Vadas, G.R., Masson, O., Perisic, J.E., Burke, P.Gkeka, C.-H. Huang, Miled, N. of characterization genomic Integrated Network. Research Atlas Genome Cancer Thorpe,L.M., Yuzugullu, Zhao,&H. J.J. PI3K cancer:in divergent roles isoforms,of Y. Samuels, common members: family J. RSK Li, p90 The O.E. Pardo, & M.J. Seckl, R., Lara, microRNA of prediction comprehensive MiRmap: E.M. Zdobnov, & C.E. Vejnar, M. Uhlén, E. Budinska, Y. Okugawa, Mei, Y.-P. D. Jäger, A. Malhotra, MED12 (2014). 2156–2169 interactome. cancer the in perturbations mutational somatic cancers. in perspective. structural a signaling. MAPK in BRAF for roleindependent of activation natural the in (2012). 15259–15264 events p110 dynamic 3-kinase enhance phosphoinositide and mimic mutations mutant. oncogenic H1047R and PI3K oncogenic of effects subunit. catalytic 3-kinase carcinoma. endometrial therapeutictargeting.activationand of modes cancers. 10 specificity. isoform and functions strength. repression target 347 cancer. colorectal in heterogeneity 2015-30935 cancer. colorectal in biomarker prognostic Oncogene (2001). 2055–2061 library. cancer breast a of screening serological by tissue breast Res. mechanisms. homology-independent by spawned rearrangements complex , 1046–1047 (2013). 1046–1047 , , 1260419 (2015). 1260419 ,

23 et al. et mutations in prostate cancer. prostate in mutations , 762–776 (2013). 762–776 , Science et al.

et al. n t al. et et al. et 31 TCPA: a resource for cancer functional proteomics data. proteomics functional cancer for resource a TCPA: t al. et Nucleic Acids Res. Acids Nucleic etics t al. et t al. et t al. et 9 , 2794–2804 (2012). 2794–2804 , t al. et et al. et t al. et Mechanism of two classes of cancer mutations in the phosphoinositide t al. et (15 October 2015). October (15 Small nucleolar RNA 42 acts as an oncogene in lung tumorigenesis. t al. et Investigating the structure and dynamics of the of dynamics and structure the Investigating Identification of a tissue-specific putative transcription factor in transcription putative atissue-specific of Identification rtois Tsu-ae mp f h hmn proteome. human the of map Tissue-based Proteomics.

304 tutr o te RFMK ope rvas kns activity kinase a reveals complex BRAF-MEK the of Structure ih rqec o mttos f the of mutations of frequency High The structure of a human p110 human a of structure The Breakpoint profiling of 64 cancer genomes reveals numerous reveals genomes 64 cancer of profiling Breakpoint

tdig uoieei truh ewr eouin and evolution network through tumorigenesis Studying Clinical significance of significance Clinical ee xrsin atrs nel nw ee o molecular of level new a unveil patterns expression Gene Exome sequencing identifies recurrent identifies sequencing Exome ADVANCE ONLINE PUBLICATION ONLINE ADVANCE , 554 (2004). 554 , Nature PLoS Comput. Biol. Comput. PLoS α Nucleic Acids Res. Acids Nucleic Science mutations.

42 497 α PLoS Comput. Biol. Comput. PLoS Cancer Res. Cancer ( , e18 (2014). e18 ,

J. Pathol. J. , 67–73 (2013). 67–73 , 317 PIK3CA Nat. Genet. Nat. Science , 239–242 (2007). 239–242 , ).

Gut 231 5

Nat. Rev.CancerNat. SNORA42 73 40 rc Nt. cd Si USA Sci. Acad. Natl. Proc. , e1000601 (2009). e1000601 , Cancer Cell Cancer

318

, 5301–5308 (2013). 5301–5308 , , 11673–11683 (2012). 11673–11683 , , 63–76 (2013). 63–76 , http://dx.d

44 α /p85 , 1744–1748 (2007). 1744–1748 ,

, 685–689 (2012). 685–689 , 10 s n noee n a and oncogene an as , e1003895 (2014). e1003895 , α PIK3CA complex elucidates the elucidates complex

26 oi.org/10.1136/gutjn

o. il Evol. Biol. Mol. , 402–413 (2014). 402–413 ,

SPOP 15 PIK3CA ee n human in gene Cancer Res. Cancer , 7–24 (2015).7–24 , , Nat. Methods Nat. FOXA1 wild-type Genome Science

and 109

61 31 l- , , ,

66. 65. 64. 63. 62. 61. 60. 59. 58. 57. 56. 55. 54. 73. 72. 71. 70. 69. 68. 67.

Supek, F. & Lehner, B. Differential DNA mismatch repair underlies mutation rate mutation underlies repair mismatch DNA Differential B. Lehner, F.& Supek, X.S. Puente, X. Wu, p110 in mutations domain kinase and domain Helical P.K. Vogt, & L. Zhao, and AKR1C1 T.M. Penning, & J. Šinkovec, R., Rupreht, T., Šmuc, T.L., Rižner, M. Stanbrough, Q. Ji, G.M. DeNicola, Zhang, J. K.W.Cheng, P.V.Hornbeck, cancer. to path variant a mutations: H3.3 Histone Knoepfler, P.S. & B.T.K.Yuen, N.I. Fleming, Guenther,U.-P. J.D. Buenrostro, C.L. Araya, Polak, P. S.A. Roberts, therapy. cancer and response damage DNA The A. Ashworth, & C.J. Lord, M.A.M. Reijns, variation across the human genome. human the across variation leukaemia. mutations. mechanisms. different USA Sci. Acad. by Natl. Proc. function of gain induce 3-kinase phosphatidylinositol Endocrinol. cancer. Cell. endometrial Mol. in ratios estrogen and progesterone determine may AKR1C3 cancer. prostate androgen-independent (2006). 2815–2825 in testosterone to signaling. (2004). progesterone on effect potential their tumorigenesis. and detoxification 3 Akt/GSK- of activation and transition epithelial-mesenchymal of induction through post-translational cancers. breast and determined experimentally of function mouse. and man in modifications and structure the Cell Cancer Res. Cancer protein. landscapes. evolutionary and biophysical Biotechnol. reveals array parallel massively USA function. protein of measurements large-scale from solely cancer. of cancers. human in widespread 481 genome. the of β /Snail signaling. /Snail , 287–294 (2012). 287–294 ,

109 t al. et Nature t al. et et al. , 16858–16863 (2012). 16858–16863 , et al. Nature

Nat. Commun. Nat.

et al. et Nature

32 24 73 et al. et eetv ls o ARC ad K12 n rat acr and cancer breast in AKR1C2 and AKR1C1 of loss Selective Cell-of-origin chromatin organization shapes the mutational landscape ciain f ies sgaln ptwy b oncogenic by pathways signalling diverse of Activation et al. et Overexpression of Rab25 contributes to metastasis of bladder cancer , 562–568 (2014). 562–568 , t al. et , 567–574 (2013). 567–574 ,

et al. et t al. et , 725–735 (2013). 725–735 , et al. et 502 Nature et al. et et al. et A fundamental protein property,protein fundamental A stability,thermodynamic revealed

t al. et t al. et The RAB25 small GTPase determines aggressiveness of ovarian of aggressiveness determines GTPase small RAB25 The

518 526

Carcinogenesis , 385–388 (2013). 385–388 , SMAD2 PhosphoSitePlus: a comprehensive resource for investigating for resource comprehensive a PhosphoSitePlus: Nat. Med. Nat. o-oig eurn mttos n hoi lymphocytic chronic in mutations recurrent Non-coding Hidden specificity in an apparently nonspecific RNA-binding nonspecific apparently an in specificity Hidden n PBC yiie emns mtgnss atr is pattern mutagenesis deaminase cytidine APOBEC An Lagging-strand replication shapes the mutational landscape mutational the shapes replication Lagging-strand Increased expression of genes converting adrenal androgens adrenal converting genes of expression Increased

, 360–364 (2015). 360–364 ,

248 , 519–524 (2015). 519–524 , 518 uniaie nlss f N-rti itrcin o a on interactions RNA-protein of analysis Quantitative noeeidcd r2 rncito pooe ROS promotes transcription Nrf2 Oncogene-induced

5 , 126–135 (2006). 126–135 ,

, 502–506 (2015). 502–506 , , 105 , 4961 (2014). 4961 , SMAD3

Nat. Genet. Nat. 10 , 2652–2657 (2008). 2652–2657 , Nature

, 1251–1256 (2004). 1251–1256 , Nucleic Acids Res. Acids Nucleic 34 and Nature , 2401–2408 (2013). 2401–2408 ,

475 SMAD4

45 521 , 106–109 (2011). 106–109 , , 970–976 (2013). 970–976 , , 81–84 (2015). 81–84 , mutations in colorectal cancer. colorectal in mutations acr Res. Cancer

40 , D261–D270 (2012). D261–D270 , ysis i s ly a n a rc Nt. cd Sci. Acad. Natl. Proc.

acr Res. Cancer 64 7610–7617 , PIK3CA Nature

α Nat. 66 of  ,

© 2015 Nature America, Inc. All rights reserved. for the analyzed gene. In this framework, the posterior distribution is also also is distribution posterior the framework, this In gene. analyzed the for this cancer-specific matched set to calculate the posterior mutation probability mutations in sequencing above. Wewhole-genome intronic observed all used ( genes of set the We size. sample with used inversely to scale distribution of prior the variance the enables parameterization This type. cancer each in samples sequencing with probability in parameterized the whole-exome sequencing data and was distribution prior parameters The type. mutation each for probability mutation the on distribution beta We prior a placed distribution. types. cancer a as mutation a binomial of observing analyzed the likelihood we modeled Specifically, the of each in gene each per for transversion probabilities and mutation transition posterior derive to framework Bayesian a Weemployed data. sequencing whole-exome cancer-specific with junction sure on sequencing exons, whole-genome we pan-cancer used probabilities. mutation ‘matched’ a to of derive set ‘exonic’the transition/transversion per probability mutation score distance normalized a with genes similar most the we selected gene, For each neighborhood. rank 500-gene the within pairs gene between distances feature absolute correlation-weighted, of sum Wegene. the for query each measured then was selected genes closest 500 the on of the basis of the sequentially gene feature and weights, the neighborhood We then converted gene features into their percentile ranks. Genes were sorted type. tumor each in probabilities mutation exonic observed the and features gene between as correlation the rank defined weights feature-specific derived compiled previously used We features). most (gene-level content GC genes and time of replication set expression, in the similar identified we type, tumor each in and gene each For information. this using gene each for models probability mutation our refine in of the major we genome, mutation probability covariates to somatic sought tumor- data. of sequencing cohort whole-exome specific the in sample, per cytosine) example, (for base specific a to mutated somatically were that reference gene each in exonic adenines) example, (for (100-bp), bases mappable of fraction these the indicate Specifically, data. probabilities sequencing whole-exome using assembly genome gene to derive ‘exonic’ mutation probabilities for each gene in the hg19 human each of regions exonic mappable, the within transversions frequency and transitions of the calculated we First, probabilities. mutation distinct multiple models. probability Mutation also types cancer 23 from calls was procedure This exons). annotate to annotated applied from kb (>5 calls intergenic (“Unknown”, genes any to unassigned calls ant “—” ( ( “?” to assignments gene removed and scales multiple at ments assign name gene standardized procedures These assignments. name gene standardize and downstream) kb 5 and upstream kb 5 (transcribed regions (coding regions 5 introns, transcribed exons, noncoding plus regions, protein-coding in impact mutation snpEff applied We URL. n downloaded indicated were types the cancer 21 from from calls variant somatic sequencing annotation. variant Uniform below). described (as packages source open using R in performed were analyses survival and RNA-seq RPPA, scripts. custom BioPython with performed were implemented with SciKit Learn NumPy and SciPy with performed performed were operations interval PANDAS with genomic and structure Python Data within ments. performed was computing Scientific ONLINE Nature Ge Nature another beta distribution. We then assigned the expected value of the posterior = 3,078,482 (96.6%) SNV calls from 4,735 tumors, recording (GRCh37.66) (GRCh37.66) recording tumors, 4,735 from calls SNV (96.6%) 3,078,482 = To avoid skewed mutation probabilities due to increased selection pres selection increased to due probabilities mutation skewed avoid To be to shown been have timing replication and levels expression Because n = 130,728) in the original file. In addition, this procedure reduced vari

MET α n = = etics 7 6 µ H and Pybedtools and ≤ × × 200) that were matched to the analyzed gene as described described as gene analyzed the to matched were that 200) ODS n ν = 11,461,951 whole-genome sequencing somatic SNV SNV somatic sequencing whole-genome 11,461,951 = and and 3 β , 2 = (1 − − (1 = 0 . 3,185,590 uniformly processed uniformly 3,185,590 4 7 For each tumor type and gene, we calculated calculated we gene, and type tumor each For expression and replication timing data and and data timing replication and expression 9 . Structural . and Structural sequence alignment analyses ′ 7 UTRs and 3 and UTRs 7 8 , respectively. Statistical computing was was computing Statistical respectively. , 0 µ 7 , PyMOL (Schrödinger) modules and and modules (Schrödinger) PyMOL , 8 ) × × ) , and machine learning methods were were methods learning machine and , ν , where where , n ′ UTRs) and gene-associated gene-associated and UTRs) = 1,239,475) to to 1,239,475) = 2 µ 9 to uniformly annotate annotate uniformly to is the per-base mutation mutation per-base the is ν is the number of exome ≤ 1. Lastly, we averaged averaged we Lastly, 1. 74 , 7 5 n R environ R and 3 5 , whole-exome whole-exome 2 0 data in con n n = 64) and and 64) = = 899,731 899,731 = ≤ 200 200 - - - - -

sented in in sented pre for were used the comparison probabilities mutation These sequencing–based) densities. (whole-exome ‘matched’ and mutation sequencing–based) intronic (whole-genome ‘Bayesian’ sequencing whole-genome observed example, adenines) within specific trinucleotide contexts (for example, example, (for contexts trinucleotide specific within adenines) example, (for bases reference exonic (100-bp), mappable of fraction the as calculated were probabilities mutation ‘trinucleotide’ Specifically, type. tumor each for models probability mutation ‘trinucleotide’ computed we types, cancer and ( correlated better were probabilities tion ( genes among related 2b Fig. mutation were probabilities well ( correlated Complementary such of probabilities mutation of background treatment importance and gene-specific cancer- the highlighting types, cancer individual within genes ( types cancer probabilities between strongly mutation varied (‘Bayesian’) sequencing–derived whole-genome well as as ‘global’) and (‘exonic’,‘matched’ of sequencing–derived distributions whole-exome observed The as type. tumor genes each in all probabilities across mutation ‘exonic’ in transversions and transitions of probability ‘exonic’rate. mutation ‘Bayesian’median mutation was probability to equal the mean cancer-specific rates such that the transition/transversion by the cancer-specific probabilities ( transition/transversion each for probability mutation the of estimate the as distribution probability exonic domains (above), evaluating density reachability within within reachability density evaluating (above), domains exonic ing of to clusters of with applications noise detect (DBSCAN) identification. cluster Mutation as selected s.d. 2 were exceeding those samples (outlier) mutator consistency, for threshold a As (log of mutations distribution on the detection outlier (MAD) deviation absolute median using were detected type tumor in each mutations Mutator sample identification. Project ENCODE by the defined In addition, we removed ‘blacklisted’ regions of the human genome previously which for the ‘expanded’ set amounted ranges (10 to 1,005,774,400,023 stop), (start, ranges genomic possible of number the computed we domains, For Browser). Genome UCSC set each of reads (ENCODE, 100-bp single-end ‘expanded’ domains in which over We below). (see the evaluated identified n to define regions bp merged and 0 bp 1,000 and definition. domain Mutation data. sequencing whole-exome tumor-specific of cohort the in cyto sample, per example, CAG>CCG) (for sine, base specific a to mutated somatically were that CAG) conservative conservative of the ‘Bayesian’ and ‘global’ density scores, max( Asscore density tion probabilities. ( the primary the previously described ‘exonic’, ‘matched’, ‘Bayesian’ and within the region. For‘global’ each region, we computed the above somatic density scores with muta ( number observed the sampling of probability binomial Fisher’scombined as determined was cluster each in Mutation cluster scoring. existed. kilobase) per samples tumor (mutated ( higher significantly with SNVs of subclusters where refined were clusters mutation Detected clusters. to and tolerates noise in numbers spatial density, or wherebysizes distal mutationscluster are not predefined assigned evaluating to or confined not is approaches DBSCAN window sliding to contrast In domain the of size base-pair the and pairs) with parameter reachability The type. cancer each in = 191,669 ‘expanded’= 191,669 (E) domains in mutation genomic were clusters which Lastly, to account for trinucleotide biases trinucleotide for account to Lastly, We computed a as tumor the average ‘global’type per probability mutation ε = ). The ‘Bayesian’ and ‘matched’ mutation probabilities were well cor well were probabilities mutation ‘matched’ and ‘Bayesian’ The ). d Figure 1 Figure p / d s where where f . d p and and Supplementary Fig. 2c Fig. Supplementary n The of significance the mutationobserved densities = mutation Finally,12). the posterior we calibrated d s refer to the number of mutated positions (base (base positions of mutated number to the refer We extended Ensembl (75) exonic regions by regions exonic (75) Ensembl We extended Samples harboring aberrantly high burdens of k 8 P We deployed density-based spatial cluster spatial density-based We deployed ) or more mutations for each mutation type mutation type or more) mutations for each 1 ≥ < 0.05, binomial test) mutation densities densities mutation test) binomial 0.05, < . 90% 90% of positions were fully mappable with Supplementary Fig. 2a Fig. Supplementary n = 279,979 ‘concise’= 279,979 and Supplementary Fig. 2d Fig. Supplementary d 3 , thresholded to 10 10 to thresholded , , 4 in diverse mutation processes processes mutation diverse in ), although ‘Bayesian’ muta ‘Bayesian’ although ), P n k density = 305,145 ‘concise’ and = (C) 305,145 ε -means spatial clustering, clustering, spatial -means was dynamically defined defined dynamically was ), we selected the most ), we selected doi: P 10.1038/ng.3471 Supplementary Supplementary ≥ Bayesian 2 2 SNVs within n ) and among among and ) ) per sample. sample. ) per ≤ ε n

ε base pairs pairs base = 175,228 ) with the the with )

≤ , , 500 bp. 500 P 12.0025 global 3 ≥ , 4 ). ). ). ). 2 ------.

© 2015 Nature America, Inc. All rights reserved. above) removal were labeled as mutator driven SMRs. Among SMRs robust robust SMRs Among SMRs. driven mutator as labeled were removal above) defined (as sample mutator following threshold frequency mutation 2% the as sets First, follows. SMRs and in fell low-confidence alterations which below classification. cluster Mutation intergenic. and downstream upstream, 3 (exon), gene noncoding syn stop, start, onymous synonymous coding, synonymous gain, start branch, splice-site region, splice-site stop, nonsynonymous start, nonsynonymous U12, branch splice acceptor, splice-site site coding, nonsynonymous gain, stop splice lost, stop lost, donor, start acid, site amino rare severity, of order in included, 3 5 upstream, splice, intron, gene), noncoding and region (coding exon included classes Region affected. region of class the and gene the on impacts this On of mutation types the recording SMR gene, to each a SMR. single we assigned basis, the the within mutations selected of we number largest genes, the by multiple affected to gene assignment resolve to insufficient was impact mutation Where mutation. of type severe most the by affected gene Lawrence by defined (as genes driver cancer known (overlap previously (i) to SMRs genes assigned multiple preferentially we annotations), with ping associated SMRs For “Uniform annotation”). (see variant regions gene-associated and transcribed coding, on impact annotation. Mutation cluster types. cancer distinct 20 in regions, genomic unique 735 from SMRs, 872 in resulted procedure This removed. were classes gene repetitive other and receptors olfactory dogenes, in ( scores density with Mutation cluster filtering. ( scores density primary the of as SMRs on were the of basis regions identified Fully 93.2% 714 these regions. score, density nate, conservative ( clusters FDR–thresholded 5% the in index) (Jaccard overlap 99.2% a confirmed and 90× to simulations FDR tumorthresholds per 5% are type provided in domain E type. tumor each in clusters E/C in increase 1.7× the by tainty of whole-exome sequencing coverage, we adjusted FDRs from C domains to E domains, FDRs where mutations control Tocannot be randomizedthreshold. owing FDR to the decreased cer tumor-specific final the as FDR of (average) value expectation as these regions would not represent false discoveries. For each tumor type, the false-discovery set if the clusters were associated with CGC genes ( the from scores density outlier with clusters We excluded respectively. tions, number of clusters from domain simulated and (randomized) muta observed FDR guarantees ( score density the computed we simulation, each For above. as repeated were procedures scoring and refinement detection, cluster probabilities per transitionmutation ‘global’ and transversionobserved the ( retain to identity base reference maintaining the observed mutations in each domain and tumor type were randomized while of positions Inthe simulation, mutationseach types. cancer across 30,784,820 omizing mutations within C domains in each tumor simulating type, a total of domains. bp–expanded 1,000 the within clusters in increase 1.7× a indicating respectively, domains, C the query within fell domains in E query identified of clusters mutation the in domains query “Mutation(described domain definition”). 117,148/198,718 (E) ‘expanded’ ‘concise’ and of (C) sets two in clusters mutation evaluate and thresholding. cluster Mutation ‘trinucleotide’the using region each mutation somatic probabilities. ( score density mutation trinucleotide a computed Finally, we doi: at significant scores density FDR-corrected with those removal, mutator to ′ UTR, downstream and other (intergenic). Mutation impacts (from snpEff) snpEff) (from Mutation impacts and (intergenic). other downstream UTR, ≥ We using an mutation reiterated and cluster FDR alter filtering estimation of number the expanded we cutoffs, FDR the of robustness the Toassess rand by performed simulations ten from calculated were FDRs Empirical 10.1038/ng.3471 2% of samples in each cancer type. Lastly, clusters associated with pseu with associated clusters Lastly, type. cancer each in samples of 2% ≤ 5%, whereby false and true discoveries are computed as the the computedas are discoveries true and false whereby 5%, P density As a final step in calling SMRs, we clustersselected ) at the 5% FDR threshold and that were mutated mutated were that and threshold FDR 5% the at ) Supplementary Fig. 4e Fig. Supplementary SMRs were annotated on SMRs were of annotated the basis mutation P density We applied the procedures above to detect detect to above procedures the Weapplied P SMRs were classified into high-, medium- medium- high-, into classified were SMRs alternate ≤ ). 5% simulation thresholds was defined defined was thresholds simulation 5% = max( Supplementary TableSupplementary 1 ′ n UTR, 5 UTR, = 12). In each iteration, mutation et al. et P matched – or the CGC) or (ii) the the (ii) or CGC) the or ′ g UTR, miRNA, intron, intron, miRNA, UTR, ). P density , , P global ) threshold that that threshold ) P trinucleotide ), resulting in ), resulting n ≤ = 522) . 5% in the the in 5% ′ UTR, UTR, 31 ) for ) , 3 2 ------,

ing annotations are provided in in provided are annotations ing and SMR correspond reads. coordinates 75- 50-, and with single-end 100-bp alignability and uniqueness 35-bp their to respect with SMRs annotated we Mutation cluster labeling. cluster Mutation ‘robust’. as classified FDR intron-based with in discovered multiple and cancer types non-mutator-driven SMRs compliant ( scores density on trinucleotide approach this applied we type, cancer each in frequencies mutation trinucleotide in biases for account ( probabilities (GC) single-nucleotide sequence and composition–controlled timing– replication expression-, our on approach this two First, metrics. to account for unaccounted mutation clustering, we applied We not on shown). to FDRs (data control approach this in melanoma applied clusters mutation intronic subsampling by determined as thresholds FDR of n distribution of mutation density scores ( types with sufficient intronic mutation clusters to permit KD estimates of their estimates that limit the FDR to Gaussian kernel density estimation (KDE) to derive with modeled was clusters mutation intronic from scores of density tribution and For treated intronic clusters as the each discoveries. type, cancer dis false mutations passenger of composed primarily are mutations intronic that tion of mutations with no advantageselective in cancer, we introduced the assump confidence. medium deemed were criteria low-confidence or high-confidence the low meet not as did that SMRs classified confidence. were SMRs Mutator-driven confidence. high as classified adjusted tion are available at available are tion visualiza overlap enrichment transcriptome and molecular for scripts alone herein arethe described analyses available from the authors by request. Stand- availability. Code and analysis site target miRNA the see (x) please assays, luciferase (xi) analysis, survival (ix) analysis, ment enrich functional (viii) RPPA analysis, (vii) analysis, RNA-seq (vi) binding, PIK3CA-PIK3R1 of dynamics molecular (v) angles, dihedral mutation (iv) clustering, spatial mutation (iii) mapping, structure protein (ii) enrichments, signatures. mutation APOBEC expected (uncorrected) Raw correction. Holmes-Bonferroni following signatures mutation APOBEC in ( signatures APOBEC with consistent are that of mutations fraction the SMR each within calculated we have pleteness, com For sequences. SMR of ~7.9% only up made frequencies trinucleotide unaccounted types, cancer Across SMRs. of 79% encompassed types cancer These mutations. SMR of numbers low with types cancer from noise prevent ( types cancer to restricted were analyses These signatures. mutation trinucleotide by not driven unaccounted types ( cancer diverse in contexts trinucleotide in change fold the and biases contexts in the SMR mutations as compared to the input mutations. single-nucleotide frequencies and (ii) fold change in frequencies of trinucleotide as the deviation in the observed trinucleotide mutation frequencies examine ontwo importantthe metrics: (i) basisunaccounted oftrinucleotide biases measured to analyses these extended Moreover, we signature. mutation this for control ( signatures are identifiable in the data, our SMRs are depleted for such signatures texts owing to APOBEC mutational processes con single-nucleotide other from mutation in frequencies significantly differ to shown previously have been (TCW) ofthese subset a as contexts, sequence analysis. Mutation trinucleotide APOBEC mutation signatures Supplementary Fig. 6a Supplementary Fig. 6b Fig. Supplementary

≥ To control that for mutation result could in processes clusters unaccounted For additional methods describing (i) transcription factor motif motif factor transcription (i) describing methods additional For trinucleotide unaccounted the between correlation low a observed We Supplementary Figure 6d Figure Supplementary 100 intronic mutation clusters was determined on of the 100 basis mutation intronic was clusters the stability determined P < 0.05 following Bonferroni correction ( correction Bonferroni < following 0.05 P values would indicate that 12% of SMRs have higher than than higher have SMRs of 12% that indicate would values The Python and R scripts to process the data and conduct conduct and data the process to scripts R and Python The http: ), suggesting that the background models conservatively ≤ //www.github.com/claraya/SMR ), further supporting the conclusion that SMRs are are SMRs that conclusion the supporting further ), 5% thresholds (both (both thresholds 5% SMRs with higher than expected prevalence for for prevalence expected than higher with SMRs 6 9 , only 4% of SMRs had higher than expected expected than higher had SMRs of 4% only , ≤ were labeled ( 5%. 5%. This approach is limited to the ten cancer Supplementary Table 2 Supplementary n We of the evaluated frequency trinucleotide = 6) that had had that 6) = Supplementary Note Supplementary Supplementary Supplementary Fig. 5 Supplementary Fig. 6c Fig. Supplementary Supplementary Fig. 6d 6 9 . . Although APOBEC mutation P ≥ density 250 SMR mutation sites to to sites mutation SMR 250 P- P density value and and and Nature Ge Nature . x P .

P ≤ . density trinucleotide P 5.2 × 10 trinucleotide ). ). A threshold of q -value (FDR) ). Second, to to Second, ). ). As shown shown As ). −17 ). Finally, n ). SMRs SMRs ). ) were were ) etics ) ) were ------

© 2015 Nature America, Inc. All rights reserved. 77. 76. 75. 74. Nature Ge Nature

Dale, R.K., Pedersen, B.S. & Quinlan, A.R. Pybedtools: a flexible Python library Python flexible a Pybedtools: A.R. Quinlan, & B.S. Pedersen, R.K., Dale, McKinney,W.in engineers. and scientists for Python M. Aivazis, & K.J. Millman, computing. scientific for Python T.E. Oliphant, (2011). for manipulating genomic datasets and annotations. 978-1-4583-4619-3. ISBN-13: (2010). 51–56 13 (2007). , 9–12 (2011). 9–12 , n etics Proc. 9th Python Sci. Conf. Sci. Python 9th Proc. (eds. van der Walt, S. & Millman, J.) Millman, & S. Walt, der van (eds. opt Si Eng. Sci. Comput. Bioinformatics Comput. Sci. Eng. Sci. Comput.

27 , 3423–3424

9 10–20 ,

81. 80. 79. 78.

Boyle, A.P.Boyle, P.J.A. Cock, F.Pedregosa, for structure a Array: NumPy The G. Varoquaux, & S.C. Colbert, S., Walt, der Van distant species. distant bioinformatics. and (2009). biology molecular 12 computation. numerical efficient , 2825–2830 (2011). 2825–2830 , et al. et t al. et et al. et Comparative analysis of regulatory information and circuits across circuits and information regulatory of analysis Comparative Nature Scikit-learn: machine learning in Python. in learning machine Scikit-learn: ipto: rey vial Pto tos o computational for tools Python available freely Biopython:

512 , 453–456 (2014). 453–456 , Comput. Sci. Eng. Sci. Comput. Bioinformatics

13 , 22–30 (2011). 22–30 , doi:

J. Mach. Learn. Res. Learn. Mach. J. 25 10.1038/ng.3471 1422–1423 ,