CRAVAT and MuPIT Interac ve: Web Tools for Cancer Muta on Analysis
Rachel Karchin, Ph.D. Department of Biomedical Engineering Ins tute for Computa onal Medicine Johns Hopkins University
The Cancer Genome Atlas’ 2nd Annual Scien fic Symposium November 27-28, 2012 Need for computa onal tools to analyze large-scale cancer muta on data
2 Goal is to provide an end-to-end muta on analysis workflow
List of muta ons from tumor sequencing
Iden fy type of Iden fy known Map to transcripts change (missense, variants and nonsense, silent) muta ons
Analysis
Predict driver vs. Predict func onal Visualize Find significantly random impact of muta ons on mutated genes muta ons muta ons ter ary structure and pathways
3 The majority of soma c muta ons in tumor exomes are missense
Sjoblom et al 2006 Hunter et al 2005 Davies et al 2005 Stephens et al 2005 Sjoblom et al 2006 Missense
Nonsense
Silent
Splice
Other Colorectal Breast
Greenman et al 2007 Jones et al 2008 Parsons et al 2008 McLendon et al 2008 Ding et al 2008
Parsons et al 2011 Jones et al 2010 TCGA 2010 Stransky et al 2011 Li et al 2011
4 h p://www.cravat.us
Tools for evalua ng missense muta ons – CHASM: cancer driver analysis h p://mupit.icm.jhu.edu – VEST: Func onal effect analysis – Annota ons (1000g, ESP6500, COSMIC, GeneCards, PubMed)
Interac ve visualiza on on 3D protein structure – Automa c mapping onto available structures – Simple interac ve interface – UniProtKB feature table annota ons provided – Publica on quality figures 5 Outline
• Introduc on to CRAVAT • Introduc on to MuPIT • Future plans • Your input
6 CRAVAT Web Server A+)%$&0$5<%"/&()$ 0'&5$%<5&'$ h p://www.cravat.us ).B<.(*+(3$ ,-.(/01$%1#.$&0$ ,-.(/01$9(&:($ !"#$%&$%'"()*'+#%)$ *2"(3.$45+)).().6$ ;"'+"(%)$"(-$ (&().().6$)+7.(%8$ 5<%"/&()$
E("71)+)$
C+)<"7+D.$ ='.-+*%$0<(*/&("7$ ='.-+*%$-'+;.'$;)>$ ?+(-$)+3(+@*"(%71$ 5<%"/&()$&($ +5#"*%$&0$ '"(-&5$ 5<%"%.-$3.(.)$ %.'/"'1$)%'<*%<'.$ 5<%"/&()$ 5<%"/&()$ "(-$#"%2:"1)$
7 List of muta ons from tumor Input sequencing • Paste into text box or upload text file
• Genomic Coordinates – Unique id for variant – Chromosome – 0-based start posi on – 1-based end posi on – Strand on which bases are reported – Reference base – Alternate base RECOMMENDED – sample ID (op onal)
• Transcript Coordinates – Unique id for variant – Transcript – Amino Acid Subs tu on – sample ID (op onal)
8 Map to transcripts
• Transcripts are scored by coverage of coding bases and agreement between RefSeq and Ensembl transcript defini ons. • A greedy algorithm selects the “best transcript” for each mutated posi on
Scores
0.557 0.557 0.378 0.378
0.731 0.685 0.731 0.731 0.731 0.731 0.685 0.731 0.685 0.731 0.731 0.731
Douville et al. CRAVAT: Cancer-Related Analysis of VAriants Toolkit. 2012 (submi ed) 9 Iden fy known variants and muta ons
Occurrenc ESP6500 ESP6500 1000 es in Amino Amino allele allele HUGO Genomes COSMIC Occurrences in COSMIC by primary sites [exact Transcript acid acid dbSNP frequency frequency symbol allele [exact nucleo de change] posi on change (European (African frequency nucleo d American) American) e change] RALYL NM_173848.5 270 M270I 0 0 0 1 upper_aerodiges ve_tract(1) TRIM29 NM_012101.3 216 P216H 0 0 0 1 breast(1) ZNF317 NM_001190791.1 280 G280V 0 0 0 1 large_intes ne(1) CD248 NM_020404.2 115 T115N rs140616335 0 0 0.0007 0 SRPX2 NM_014467.2 393 R393Q 0 0 0.0007 0 KDR NM_002253.2 539 G539R rs55716939 0 0.0022 0.0005 0 LGI2 NM_018176.3 322 R322H rs149130054 0.0005 0.0001 0 0 LILRB1 NM_006669.3 189 P189L 0 0.0001 0 0
dbSNP 1K Genomes ESP6500 COSMIC
10 Predict driver vs. Analysis random muta ons Supervised machine learning!
Training Set
http://euvolution.com/futurist-transhuman-news-blog/
?
Carter et al.. Cancer-specific high-throughput annota on of soma c muta ons. Cancer Research. 2009 11 Where do the decision rules come from?
amino acid subs tu on proper es human polymorphism proper es
predicted local protein structure
UniprotKB features regional amino acid composi on http://euvolution.com/futurist-transhuman-news-blog/ vertebrate ortholog conserva on (genome alignments) superfamily homolog conserva on (deep protein alignments) amino acid compa bility with orthologs and homologs
86 features pre-computed exome-wide in SNVBox
Wong et al.. CHASM and SNVBox toolkit. Bioinforma cs. 2011 12 Where does the training set come from ?
Driver missense muta ons: • Gene harbors at least 5 muta ons • Oncogene: >15% of NS muta ons occur at the same posi on • Tumor suppressor: > 15% of NS muta ons are inac va ng Bozic et al PNAS 2010 Jones et al Science 2010 • All missense muta ons in these genes are included Vogelstein et al. submitted
Random passenger missense muta ons: • Generated in silico with a genera ve model • Designed to match tumor ssue di-nucleo de muta on spectrum • In CHASM feature space, they do not look like high allele frequency SNPs
13 Tissue-specific classifiers in CRAVAT
• Select from pull down list
• Pre-constructed classifiers for tumor types under analysis by TCGA and ICGC
• ‘Other’ offers a generic classifier if ssue under study is not yet supported
14 Predict func onal Analysis impact of muta ons • Variant Effect Scoring Tool (VEST) – Iden fy func onal muta ons – Same supervised learning algorithm and features as CHASM – Different training set • 47000 disease missense muta ons from HGMD1 2012v2 • 45000 neutral missense variants from the ESP65002 (AF>1%)
1 Stenson P, Mort M, Ball E, Howells K, Phillips A, Thomas N, Cooper D, et al.: The human gene muta on database: 2008 update. Genome Med 2009, 1:13. 2 Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Sea le, WA [h p://evs.gs.washington.edu/EVS/]. 15 Why VEST? !""#$%&'()#&*+'(%,$#
-*,)(%,'"#$%&'()## &*+'(%,$# ./012/#$%&'()# #&*+'(%,$#
16 !""#$%&'()#&*+'(%,$# !""#$%&'()#&*+'(%,$#
-*,)(%,'"#$%&'()## -*,)(%,'"#$%&'()## &*+'(%,$# &*+'(%,$# ./012/#$%&'()# ./012/#$%&'()# #&*+'(%,$# #&*+'(%,$#
Func onal Benign (Damaging) 17 Why VEST?
Func onal Benign (Damaging) Driver vs. Passenger and Damaging vs. Benign classifiers provide complementary benefits for muta on analysis.
18 Find significantly Analysis mutated genes and pathways
• Gene annota ons – PubMed – GeneCards • Gene scores – Combined P-values of VEST or CHASM scores (Stouffer’s)
19 Submit
20 Job completed: Email No fica on
21 CRAVAT Results
If MuPIT Interac ve input file is requested • File with muta ons forma ed for input to MuPIT Interac ve 22 muPIT Interac ve mupit.icm.jhu.edu
Input genomic coordinates of muta ons of interest
23 Table of protein structures onto which your muta ons can be mapped
A subset of muta ons did not map onto protein structures
Posi on-specific annota ons available for each structure can be used to select most interes ng structures. 24 Selec ng a structure brings you to a details page for interac ve viewing
Applet supports rota on and zoom
Your muta ons Annota ons from UniProt feature table
25 Easy to use interface for customized figures Firehose breast cancer muta ons
Five muta ons cluster together on a structure of RUNX1 (in complex with CBFbeta)
26 Zoom-in for view of the muta ons (green) and a func onally annotated region (red)
Displayed Controls 27 Func onally annotated region with text display
28 29 Future Func onality
• Integra on of CRAVAT and MuPIT into a single end-to-end service • GANT - a new sta s c to find significantly mutated genes and pathways • Exhaus ve mapping of genomic coordinates onto all protein coding transcripts defined at a locus
30 External integra ons
Coming in 2013
31 Thank you for listening!
Please visit Poster #75 to talk with us about CRAVAT and MuPIT Please visit Poster #80 to learn about the Intogen-SM analysis pipeline from Lopez-Bigas Lab (Spain) 32 Acknowledgments
NIH R21 CA135866 NIH R21 CA152432 NIH 3U24CA143858--2S1 NSF CAREER DBI 0845275 33