The Spectra of Somatic Mutations Across Many Tumor Types
Total Page:16
File Type:pdf, Size:1020Kb
The spectra of somatic mutations across many tumor types Mike Lawrence Broad Institute of Harvard and MIT 1st Annual TCGA Scientific Symposium November 17, 2011 mutation rates across cancer Hematologic Carcinogens Childhood ?? ?? HPV & HPV mutation type C → T C → A C → G A → G A → T A → C OV mutation type C → T C → A C → G A → G A → T A → C mutation rate (per million sites) OV mutation type C → T C → A C → G A → G A → T A → C mutation rate (per million sites) GBM mutation type C → T C → A C → G A → G A → T A → C LUSC lung squamous mutation type C → T C → A C → G A → G A → T A → C LUAD lung adeno mutation type C → T C → A C → G A → G A → T A → C Melanoma mutation type C → T C → A C → G A → G A → T A → C cervical mutation type C → T C → A C → G A → G A → T A → C bladder total rate 100/Mb 10/Mb 1/Mb 0.1/Mb total rate type of spectrum Head&Neck HPV GBM HPV Bladder viral? Kidney Esophageal Colorectal Gastric Lung Melanoma GBM Kidney H&N Bladder Lung Gastric Colorectal Esophageal Melanoma finding significantly mutated genes patients tally significance MutSig scoring algorithm genes * patients tally significance MutSig scoring algorithm version 0 assume background mutation rate is: genes · uniform across sequence contexts · uniform across patients · uniform across genes * patients tally significance MutSig scoring algorithm version 1 assume background mutation rate is: genes · variable across sequence contexts · uniform across patients · uniform across genes * C→T (UV-induced) A→T patients tally significance MutSig scoring algorithm version 1 assume background mutation rate is: genes · variable across sequence contexts · uniform across patients · uniform across genes * melanoma patients C→T (UV-induced) A→T gene W gene X gene Y gene Z patients tally significance MutSig scoring algorithm version 2 assume background mutation rate is: genes · variable across sequence contexts · variable across patients · uniform across genes * patient 1 patient 2 low mutation rate high mutation rate patients tally significance MutSig scoring algorithm version 2 assume background mutation rate is: genes · variable across sequence contexts · variable across patients · uniform across genes * patient 1 patient 2 low mutation rate high mutation rate gene A gene B gene C gene D patients tally significance MutSig scoring algorithm version 2 assume background mutation rate is: genes · variable across sequence contexts · variable across patients · uniform across genes * Problem: mutation rate is heterogeneous across genes gene J gene K gene L gene M patients tally significance MutSig scoring algorithm version 2 assume background mutation rate is: genes · variable across sequence contexts · variable across patients · uniform across genes * Problem: mutation rate is heterogeneous across genes average = 3 / Mb uniform across genes patients tally significance MutSig scoring algorithm version 2 assume background mutation rate is: genes · variable across sequence contexts · variable across patients · uniform across genes * Problem: mutation rate is heterogeneous across genes average = 3 / Mb uniform across genes q<0.01 → hits patients tally significance MutSig scoring algorithm version 2 assume background mutation rate is: genes · variable across sequence contexts · variable across patients · uniform across genes * Problem: mutation rate is heterogeneous across genes average = 3 / Mb average = 3 / Mb uniform across genes q<0.01 → hits patients tally significance MutSig scoring algorithm version 2 assume background mutation rate is: genes · variable across sequence contexts · variable across patients · uniform across genes * Problem: mutation rate is heterogeneous across genes average = 3 / Mb average = 3 / Mb uniform across genes 25% genes: rate = 6 / Mb q<0.01 → hits patients tally significance MutSig scoring algorithm version 2 assume background mutation rate is: genes · variable across sequence contexts · variable across patients · uniform across genes * Problem: mutation rate is heterogeneous across genes average = 3 / Mb average = 3 / Mb 75% genes: rate = 2 / Mb uniform across genes 25% genes: rate = 6 / Mb q<0.01 → hits patients tally significance MutSig scoring algorithm version 2 assume background mutation rate is: genes · variable across sequence contexts · variable across patients · uniform across genes * Problem: mutation rate is heterogeneous across genes average = 3 / Mb average = 3 / Mb 75% genes: rate = 2 / Mb uniform across genes 25% genes: rate = 6 / Mb q<0.01 → hits → false positives Lung cancer * known lung cancer genes #1 * TP53 457 patients #2 KRAS 180 lung squamous cell carcinoma * 277 lung adenocarcinoma #7 OR4A15 average mutation rate = 10 / Mb #13 * KEAP1 #14 OR8H2 version 0 MutSig results #15 * STK11 (assuming uniform #17 OR2T4 background mutation rate #25 OR2T3 across genes) #31 OR2T6 #48 CSMD3 #49 OR5D16 all of these genes #55 RYR2 are extremely significant #100 CSMD1 (q<10-7) #139 * PIK3CA #158 RYR3 #159 MUC16 #161 OR2T33 total of #169 * NFE2L2 843 genes #172 OR10G8 #180 OR2L8 Bryan Hernandez significantly mutated Peter Hammerman (q<0.01) #198 MUC17 Marcin Imielinski #217 TTN Matthew Meyerson Lung cancer * known lung cancer genes #1 * TP53 457 patients #2 KRAS 180 lung squamous cell carcinoma * 277 lung adenocarcinoma #7 OR4A15 "fishy" genes average mutation rate = 10 / Mb #13 KEAP1 * #14 OR8H2 olfactory receptors version 0 MutSig results #15 * STK11 (146 with q<0.01) (assuming uniform #17 OR2T4 background mutation rate #25 OR2T3 "cub and sushi" proteins across genes) #31 OR2T6 reported to be tumor suppressors but significantly mutated in #48 CSMD3 almost every tumor type #49 OR5D16 all of these genes (including TCGA ovarian) #55 RYR2 are extremely significant #100 CSMD1 ryanodine receptors (q<10-7) #139 * PIK3CA cardiac calcium channels #158 RYR3 #159 MUC16 mucins #161 OR2T33 gel-forming proteins total of #169 * NFE2L2 843 genes #172 OR10G8 titin significantly mutated #180 OR2L8 largest human protein #198 MUC17 100x bigger than p53 (q<0.01) 34,350 amino acids #217 TTN 100 Kb coding sequence patients tally significance MutSig scoring algorithm version 2 assume background mutation rate is: genes · variable across sequence contexts · variable across patients · uniform across genes * Problem: mutation rate is actually heterogeneous across genes Challenge: predict gene-specific background mutation rates We eventually want to learn the background mutation rate of every gene (and all possible mutations at all basepairs!) As we sequence more and more samples, we get closer to this goal. Highly expressed genes have lower mutation rates 3 ` 2 1 mutation rate rate (/Mb) mutation 0 Low →`` High expression Chapman et al. Nature (2011) chr10 mutations /Mb position (Mb) background mutation rate varies ten-fold or more across the genome shown: noncoding mutation rate from TCGA lung cancer dataset Early-replicating genes have lower mutation rates chr10 late replication mutations /Mb early position (Mb) background mutation rate replication time varies ten-fold or more highly also varies greatly across the genome correlated across the genome shown: noncoding mutation rate Sunyaev Lab (Harvard/BWH) from TCGA lung cancer dataset Stamatoyannopoulos et al. (2009) Nat. Gen. shown: replication time measurements from Chen et al. (2010) Genome Research 20:447 Late replication explains most olfactory receptors 16 ORs All Genes late Early Late replication mutations /Mb early Olfactory Receptors chr1 (Mb) chr8 late replication mutations /Mb early extrapolate even later late replication times replication mutations /Mb CSMD3 early position (Mb) initial model assumed a flat mutational landscape mutation rate gene A gene B landscape is actually not flat mutation gene A rate gene B improve estimate by binning together similar genes... mutation gene A rate gene B ...or by local regression mutation gene A rate average outward until neighborhood becomes too different from starting point gene B * known lung cancer genes Lung "fishy" genes #1 * TP53 cancer #2 * KRAS #7 OR4A15 #13 * KEAP1 #14 OR8H2 MutSig v0 #15 STK11 assuming uniform * bkgd mutation rate #17 OR2T4 across all genes #25 OR2T3 #31 OR2T6 #48 CSMD3 #49 OR5D16 q<10-7 #55 RYR2 #100 CSMD1 #139 * PIK3CA #158 RYR3 #159 MUC16 #161 OR2T33 843 genes #169 * NFE2L2 significantly mutated #172 OR10G8 (q<0.01) #180 OR2L8 #198 MUC17 #217 TTN * known lung cancer genes Lung "fishy" genes #1 * TP53 cancer #2 * KRAS improved MutSig #7 OR4A15 using gene-specific background mutation rates #13 * KEAP1 #14 OR8H2 STK11 #1 MutSig v0 #15 STK11 * * NFE2L2 #4 assuming uniform #17 OR2T4 * bkgd mutation rate TP53 #7 #25 OR2T3 * q<10-5 across all genes KRAS #8 #31 OR2T6 * KEAP1 #11 52 genes #48 CSMD3 * significantly mutated PIK3CA #12 #49 OR5D16 * (q<0.01) q<10-7 #55 RYR2 . #100 CSMD1 * OR8H2 #181 #139 PIK3CA OR5T2 #276 * q~0.2 #158 RYR3 OR10J3 #334 #159 MUC16 CSMD3 #388 #161 OR2T33 MUC17 #2614 843 genes #169 * NFE2L2 RYR2 #2898 significantly mutated #172 OR10G8 CSMD1 #4482 (q<0.01) #180 OR2L8 TTN #4825 q=1 #198 MUC17 MUC16 #5650 *most significant #217 TTN RYR3 #11496 olfactory receptor Correcting for variation in mutation rate Before After Lung Squamous 261 (50 OR) 18 (0 OR) Lung Adeno 511 (93 OR) 33 (1 OR) Melanoma 177 (7 OR) 61 (0 OR) Prostate 3 (0 OR) 3 (0 OR) DLBCL 32 (1 OR) 15 (0 OR) Ultimate solution: Learn the background rate putting it all together patient 1 patient 2 µ · F µ w v o p,s,c = o · Fp · Fg Σ( p,k · k,c) = µp,s,c k relative mutation rate contribution of patient p relative of factor k mutation rate in sum across to mutation