The spectra of somatic across many tumor types

Mike Lawrence Broad Institute of Harvard and MIT

1st Annual TCGA Scientific Symposium November 17, 2011 rates across cancer

Hematologic Carcinogens Childhood

?? ?? HPV & HPV mutation type C → T C → A C → G A → G A → T A → C

OV mutation type C → T C → A C → G A → G A → T A → C

mutation rate (per million sites) OV mutation type C → T C → A C → G A → G A → T A → C

mutation rate (per million sites) GBM mutation type C → T C → A C → G A → G A → T A → C

LUSC lung squamous mutation type C → T C → A C → G A → G A → T A → C

LUAD lung adeno mutation type C → T C → A C → G A → G A → T A → C

Melanoma mutation type C → T C → A C → G A → G A → T A → C

cervical mutation type C → T C → A C → G A → G A → T A → C

bladder total rate 100/Mb 10/Mb 1/Mb 0.1/Mb total rate

type of spectrum Head&Neck HPV GBM HPV Bladder viral?

Kidney

Esophageal Colorectal Gastric

Lung

Melanoma GBM

Kidney

H&N

Bladder

Lung Gastric Colorectal Esophageal Melanoma finding significantly mutated patients tally significance MutSig scoring algorithm genes

* patients tally significance MutSig scoring algorithm version 0

assume background mutation rate is:

genes · uniform across sequence contexts · uniform across patients · uniform across genes * patients tally significance MutSig scoring algorithm version 1

assume background mutation rate is:

genes · variable across sequence contexts · uniform across patients · uniform across genes *

C→T (UV-induced) A→T patients tally significance MutSig scoring algorithm version 1

assume background mutation rate is:

genes · variable across sequence contexts · uniform across patients · uniform across genes *

melanoma patients C→T (UV-induced) A→T

W

gene X gene Y gene Z patients tally significance MutSig scoring algorithm version 2

assume background mutation rate is:

genes · variable across sequence contexts · variable across patients · uniform across genes *

patient 1 patient 2 low mutation rate high mutation rate patients tally significance MutSig scoring algorithm version 2

assume background mutation rate is:

genes · variable across sequence contexts · variable across patients · uniform across genes *

patient 1 patient 2 low mutation rate high mutation rate

gene A gene B gene C gene D patients tally significance MutSig scoring algorithm version 2

assume background mutation rate is:

genes · variable across sequence contexts · variable across patients · uniform across genes *

Problem: mutation rate is heterogeneous across genes gene J gene K gene L gene M patients tally significance MutSig scoring algorithm version 2

assume background mutation rate is:

genes · variable across sequence contexts · variable across patients · uniform across genes *

Problem: mutation rate is heterogeneous across genes

average = 3 / Mb

uniform across genes patients tally significance MutSig scoring algorithm version 2

assume background mutation rate is:

genes · variable across sequence contexts · variable across patients · uniform across genes *

Problem: mutation rate is heterogeneous across genes

average = 3 / Mb

uniform across genes

q<0.01

→ hits patients tally significance MutSig scoring algorithm version 2

assume background mutation rate is:

genes · variable across sequence contexts · variable across patients · uniform across genes *

Problem: mutation rate is heterogeneous across genes

average = 3 / Mb average = 3 / Mb

uniform across genes

q<0.01

→ hits patients tally significance MutSig scoring algorithm version 2

assume background mutation rate is:

genes · variable across sequence contexts · variable across patients · uniform across genes *

Problem: mutation rate is heterogeneous across genes

average = 3 / Mb average = 3 / Mb

uniform across genes 25% genes: rate = 6 / Mb

q<0.01

→ hits patients tally significance MutSig scoring algorithm version 2

assume background mutation rate is:

genes · variable across sequence contexts · variable across patients · uniform across genes *

Problem: mutation rate is heterogeneous across genes

average = 3 / Mb average = 3 / Mb 75% genes: rate = 2 / Mb uniform across genes 25% genes: rate = 6 / Mb

q<0.01

→ hits patients tally significance MutSig scoring algorithm version 2

assume background mutation rate is:

genes · variable across sequence contexts · variable across patients · uniform across genes *

Problem: mutation rate is heterogeneous across genes

average = 3 / Mb average = 3 / Mb 75% genes: rate = 2 / Mb uniform across genes 25% genes: rate = 6 / Mb

q<0.01

→ hits → false positives * known lung cancer genes #1 * TP53 457 patients #2 KRAS 180 lung squamous cell carcinoma * 277 lung adenocarcinoma #7 OR4A15 average mutation rate = 10 / Mb #13 * KEAP1 #14 OR8H2 version 0 MutSig results #15 * STK11 (assuming uniform #17 OR2T4 background mutation rate #25 OR2T3 across genes) #31 OR2T6 #48 CSMD3 #49 OR5D16 all of these genes #55 RYR2 are extremely significant #100 CSMD1 (q<10-7) #139 * PIK3CA #158 RYR3 #159 MUC16 #161 OR2T33 total of #169 * NFE2L2 843 genes #172 OR10G8 #180 OR2L8 Bryan Hernandez significantly mutated Peter Hammerman (q<0.01) #198 MUC17 Marcin Imielinski #217 TTN Matthew Meyerson Lung cancer * known lung cancer genes #1 * TP53 457 patients #2 KRAS 180 lung squamous cell carcinoma * 277 lung adenocarcinoma #7 OR4A15 "fishy" genes average mutation rate = 10 / Mb #13 KEAP1 * #14 OR8H2 olfactory receptors version 0 MutSig results #15 * STK11 (146 with q<0.01) (assuming uniform #17 OR2T4 background mutation rate #25 OR2T3 "cub and sushi" across genes) #31 OR2T6 reported to be tumor suppressors but significantly mutated in #48 CSMD3 almost every tumor type #49 OR5D16 all of these genes (including TCGA ovarian) #55 RYR2 are extremely significant #100 CSMD1 ryanodine receptors (q<10-7) #139 * PIK3CA cardiac calcium channels #158 RYR3 #159 MUC16 mucins #161 OR2T33 gel-forming proteins total of #169 * NFE2L2 843 genes #172 OR10G8 significantly mutated #180 OR2L8 largest human #198 MUC17 100x bigger than p53 (q<0.01) 34,350 amino acids #217 TTN 100 Kb coding sequence patients tally significance MutSig scoring algorithm version 2

assume background mutation rate is:

genes · variable across sequence contexts · variable across patients · uniform across genes *

Problem: mutation rate is actually heterogeneous across genes

Challenge: predict gene-specific background mutation rates

We eventually want to learn the background mutation rate of every gene (and all possible mutations at all basepairs!)

As we sequence more and more samples, we get closer to this goal. Highly expressed genes have lower mutation rates

3

` 2

1 mutation rate rate (/Mb) mutation 0 Low →`` High expression

Chapman et al. Nature (2011) mutations / Mb from from TCGA lung cancer dataset noncodingrate mutation shown: genome the across more or -fold ten varies rate mutation background chr10

position (Mb) position

mutations / Mb from from TCGA lung cancer dataset noncodingrate mutation shown: genome the across more or -fold ten varies rate mutation background Early- chr10

replicating geneshavelowermutation rates

position (Mb) position correlated highly

Chen et al. (2010) shown: replicationmeasurements time from shown: Stamatoyannopoulos et al. (2009) Sunyaev (Harvard/BWH) Lab genome the across also varies greatly varies also Genome Research time replication Nat. Gen. 20:447 early late

replication Late replication explains most olfactory receptors

16 ORs All Genes late

Early Late replication mutations /Mb

early Olfactory Receptors chr1 (Mb) mutations / Mb mutations / Mb chr8 position (Mb) position extrapolate even later evenextrapolate replication times replication CSMD3

early early late late

replication replication initial model assumed a flat mutational landscape

mutation rate

gene A

gene B landscape is actually not flat

mutation gene A rate

gene B improve estimate by binning together similar genes...

mutation gene A rate

gene B ...or by local regression

mutation gene A rate

average outward until neighborhood becomes too different from starting point

gene B * known lung cancer genes Lung "fishy" genes #1 * TP53 cancer #2 * KRAS #7 OR4A15 #13 * KEAP1 #14 OR8H2 MutSig v0 #15 STK11 assuming uniform * bkgd mutation rate #17 OR2T4 across all genes #25 OR2T3 #31 OR2T6 #48 CSMD3 #49 OR5D16 q<10-7 #55 RYR2 #100 CSMD1 #139 * PIK3CA #158 RYR3 #159 MUC16 #161 OR2T33 843 genes #169 * NFE2L2 significantly mutated #172 OR10G8 (q<0.01) #180 OR2L8 #198 MUC17 #217 TTN * known lung cancer genes Lung "fishy" genes #1 * TP53 cancer #2 * KRAS improved MutSig #7 OR4A15 using gene-specific background mutation rates #13 * KEAP1 #14 OR8H2 STK11 #1 MutSig v0 #15 STK11 * * NFE2L2 #4 assuming uniform #17 OR2T4 * bkgd mutation rate TP53 #7 #25 OR2T3 * q<10-5 across all genes KRAS #8 #31 OR2T6 * KEAP1 #11 52 genes #48 CSMD3 * significantly mutated PIK3CA #12 #49 OR5D16 * (q<0.01) q<10-7 #55 RYR2 ...... #100 CSMD1 * OR8H2 #181 #139 PIK3CA OR5T2 #276 * q~0.2 #158 RYR3 OR10J3 #334 #159 MUC16 CSMD3 #388 #161 OR2T33 MUC17 #2614 843 genes #169 * NFE2L2 RYR2 #2898 significantly mutated #172 OR10G8 CSMD1 #4482 (q<0.01) #180 OR2L8 TTN #4825 q=1 #198 MUC17 MUC16 #5650 *most significant #217 TTN RYR3 #11496 Correcting for variation in mutation rate

Before After Lung Squamous 261 (50 OR) 18 (0 OR)

Lung Adeno 511 (93 OR) 33 (1 OR)

Melanoma 177 (7 OR) 61 (0 OR)

Prostate 3 (0 OR) 3 (0 OR)

DLBCL 32 (1 OR) 15 (0 OR)

Ultimate solution: Learn the background rate putting it all together patient 1 patient 2

µ · F µ w v o p,s,c = o · Fp · Fg Σ( p,k · k,c) = µp,s,c k

relative mutation rate contribution of patient p relative of factor k mutation rate in sum across to mutation gene g k "factors" category c (i.e. mutational patient p processes) category c mutation rate in relative weight of gene g overall mutation rate mutation rate factor k patient p across entire dataset of gene g in patient p category c significantly mutated genes across tumor types tumor types tumor

IDH1 KRAS PIK3CA TNFRSF14 EGFR PTEN NFE2L2 TP53 BRAF NRAS SF3B1 MYD88 ATM CREBBP Acknowledgements MutSig Team NHGRI NCI Petar Stojanov Broad Institute Sequencing Bryan Hernandez Program and Platform Marcin Imielinski Tim Fennell Peter Hammerman Sara Chauvin Gregory Kryukov Lauren Ambrogio Gaddy Getz Sheila Fisher Eran Hodis Matthew Meyerson Chip Stewart Stacey Gabriel Joshua Levin Xian Adiconis Analysis Team Levi Garraway Eric Lander Kristian Cibulskis Andreas Gnirke Andrey Sivachenko Lynda Chin Todd Golub David Jaffe Scott Carter (cont.) Toby Bloom Gordon Saksena Mike Berger Broad Institute of Harvard and MIT Chad Nusbaum Yotam Drier Mike Chapman Dana Farber Cancer Institute Broad Institute Biological Alex Ramos Jens Lohr Genetic Analysis Platform Aaron McKenna Derek Chiang Shamil Sunyaev Adam Bass Robb Onofrio Rui Jing Craig Mermel Paz Polak Austin Dulak Brendan Blumenstiel Lihua Zou Nicolas Stransky Leonid Mirny Project Management Huy Nguyen David DeLuca Roel Verhaak Rameen Beroukhim Carrie Sougnez Mellisa Parkin Elena Helman Barbara Weir Steve Schumacher Erica Shefler Wendy Winckler Jaegil Kim Shantanu Banerji Barbara Tabak Elizabeth Nickerson Cheng-Zhong Zhang Firehose Daniel Auclair Broad Institute Biological Sylvan Baca Doug Voet Cathy Wu Marisa Cortes Samples Platform Trevor Pugh Mike Noble Jennifer Brown Kristin Thompson Kristin Ardlie Nam Pho Pei Lin Lili Wang Andrew Cherniak Youzhong Wan Jim Robinson Dan DiCara Mark DePristo Alex Kostic Dan-Avi Landau Helga Thorvaldsdottir Lee Lichtenstein Eric Banks Peter Chen Aviv Regev Marc-Danie Nazaire Robert Zupko Kiran Garimella Nikhil Wagle Peter Carr Nathalie Pochet Jill Mesirov