Cancer Focus on Computer Resources Research

CRAVAT 4: Cancer-Related Analysis of Variants Toolkit David L. Masica1,2, Christopher Douville1,2, Collin Tokheim1,2, Rohit Bhattacharya2,3, RyangGuk Kim4, Kyle Moad4, Michael C. Ryan4, and Rachel Karchin1,2,5

Abstract

Cancer sequencing studies are increasingly comprehensive and level interpretation, including joint prioritization of all nonsilent well powered, returning long lists of somatic mutations that can be mutation consequence types, and structural and mechanistic visu- difficult to sort and interpret. Diligent analysis and quality control alization. Results from CRAVAT submissions are explored in an can require multiple computational tools of distinct utility and interactive, user-friendly web environment with dynamic filtering producing disparate output, creating additional challenges for the and sorting designed to highlight the most informative mutations, investigator. The Cancer-Related Analysis of Variants Toolkit (CRA- even in the context of very large studies. CRAVAT can be run on a VAT) is an evolving suite of informatics tools for mutation inter- public web portal, in the cloud, or downloaded for local use, and is pretation that includes mutation mapping and quality control, easily integrated with other methods for cancer omics analysis. impact prediction and extensive annotation, - and mutation- Cancer Res; 77(21); e35–38. Ó2017 AACR.

Background quickly returning mutation interpretations in an interactive and user-friendly web environment for sorting, visualizing, and An investigator's work is far from over when results are returned inferring mechanism. CRAVAT (see Supplementary Video) is from the sequencing center. Depending on the service, genetic suitable for large studies (e.g., full-exome and large cohorts) mutation calls can require additional mapping to include all and small studies (e.g., gene panel or single patient), performs relevant RNA transcripts or correct sequences. Assignment all mapping and assigns sequence ontology, predicts mutation of mutation consequence type or sequence ontology (e.g., mis- impact using multiple bioinformatics classifiers normalized to sense, splice, or indel) might be incomplete. Once mapping is provide comparable output, allows for joint prioritization of all complete and sequence ontology assigned, the task of interpreting nonsilent mutation types, organizes annotation from many mutation impact remains. This interpretation can entail annota- sources on graphical displays of protein sequence and 3D tion from multiple sources, employing one or more bioinformat- structure, and facilitates dynamic filtering and sorting of results ics classifiers, testing for functional or pathway enrichment, and so that the most important mutations can be quickly retrieved referencing each mutation against databases of known cancer from large submissions. In addition to running on our web- drivers or common polymorphisms. Making full use of these server, CRAVAT is available as a Docker (https://www.docker. diverse resources can be a daunting task. The individual utilities com/) image to run in a self-contained environment locally or often assess a limited number of consequence types, require in the cloud and is easily integrated with other software different formatting of input data, and return disparate output packages. The following describes the details of the CRAVAT not amenable to direct comparison. Once the investigator has interactive results explorer; machine-readable text and Excel managed to assemble a suitable ad hoc pipeline to process and spreadsheets are also provided with each job submission. interpret each mutation, the task of effectively sorting and filtering fi large lists of mutations may be dif cult. CRAVAT 4.x The Cancer-Related Analysis of Variants Toolkit (CRAVAT; ref. 1) is designed to streamline the many steps outlined above, The CRAVAT webserver (cravat.us) prompts the user to submit sequencing data, select impact prediction methods, and provide an email address for communicating job status. Once sequencing 1Department of Biomedical Engineering, The Johns Hopkins University, Balti- data are submitted, CRAVAT performs quality control and maps more, Maryland. 2The Institute for Computational Medicine, The Johns Hopkins all mutations from genome to transcript, mapping each mutation 3 University, Baltimore, Maryland. Department of Computer Science, The Johns to all relevant transcripts, and then to protein sequence; addi- 4 Hopkins University, Baltimore, Maryland. In Silico Solutions, Falls Church, tionally, mutations are mapped to available protein structures Virginia. 5Department of Oncology, The Johns Hopkins University School of Medicine, Baltimore, Maryland. and homology models. Next, all mutation consequence types (sequence ontologies) are assigned, including missense, stop Note: Supplementary data for this article are available at Cancer Research gains and losses, in-frame and frame-shifting insertions and Online (http://cancerres.aacrjournals.org/). deletions (indel), splice, and silent mutations. Corresponding Author: Rachel Karchin, Johns Hopkins University, 316 Hacker- man Hall, 3400 North Charles St., Baltimore, MD 21218. Phone: 410-516-5578; Mutation impact prediction Fax: 410-516-5294; E-mail: [email protected] CRAVAT employs two approaches for predicting mutation doi: 10.1158/0008-5472.CAN-17-0338 impact, namely Cancer-Specific High-Throughput Annotation of Ó2017 American Association for Cancer Research. Somatic Mutations (CHASM; ref. 2) and Variant Effect Scoring

www.aacrjournals.org e35

Downloaded from cancerres.aacrjournals.org on September 30, 2021. © 2017 American Association for Cancer Research. Masica et al.

Tool (VEST; refs. 3, 4). CHASM uses a random Forest classifier that mutation-specific information accessible through the tooltip is trained from a positive class of cancer driver mutations from (mouse-over); the user can optionally choose to plot tissue- COSMIC (5) and a putative set of passenger mutations; thus, specific mutations from TCGA data. CHASM classifies mutations as cancer drivers or passengers (ran- The CRAVAT interactive Variant tab serves information that will dom Forests are ensembles of decision trees derived via machine be familiar to users after exploring the Gene tab, but here, the user learning). VEST uses a random Forest classifier that is trained from can drill down into the details even further (see for instance, the a positive class of disease-associated germline variants from the second spreadsheet in Fig. 1B). This tab includes P values and Gene Mutation Database (6) and a negative class of FDRs for each mutation in the context of all relevant transcripts. common variants from the exome sequencing map (7); thus, Frequencies from population databases (1000 Genomes, ref. 10; VEST classifies mutations as pathogenic or benign. CHASM can ESP6500, ref. 7; and ExAC, ref. 11) are graphically displayed and score missense mutations. VEST scores all nonsilent mutation can be further partitioned by ethnicity. Figure 1B shows an consequence types and facilitates joint prioritization (i.e., scores example lollipop diagram, where the submitted R230I missense are directly comparable among all mutations, regardless of substitution (top side of lollipop diagram) is compared with sequence ontology). Both CHASM and VEST provide composite TCGA kidney cancer data (bottom of lollipop diagram), revealing gene-level P values and false discovery rates (FDR), as well as the nearby D228N substitution in the ACVR1C gene. mutation-specific P values and FDRs. MuPIT interactive Results summary Figure 1C shows selected displays from Mutation Position Figure 1A shows selected displays from the CRAVAT Summary Imaging Toolbox (MuPIT) Interactive (12). MuPIT maps muta- tab. This summary includes aggregate results for all mutations tion positions to annotated, interactive 3D protein structures, and from a single submission, at the patient and cohort level. Distri- links directly from the gene and mutation entries in the above- bution of submitted mutations by gene function and sequence mentioned widgets and tables (see for instance, the red pill– ontology are displayed as pie charts (Fig. 1A); a pie chart showing shaped button in the bottom right-hand corner of Fig. 1B). MuPIT distribution by Cancer Genome Landscape definitions (8) is also achieves high coverage by including experimental structures from provided. -specific distribution of nonsilent, mis- the and homology models filtered using sense, and inactivating mutations is displayed on a circos plot standard measures for model quality. MuPIT was designed for (Fig. 1A). The results summary also shows the most frequently investigators without expertise in protein structure/function and mutated (corrected for gene length) as a bar chart, the top 10 can be used to develop mechanistic hypotheses regarding muta- most significant genes as determined by VEST or CHASM com- tions of interest. posite P values, and sample-level consequence-type distributions For each structure, the MuPIT Protein tab allows manipula- displayed as a series of stacked histograms (Fig. 1A). Gene set tion of graphical parameters, such as color, outline, opacity, enrichment of collections of genes with significant composite and style (cartoon, stick, space fill, etc.), as well as options for VEST or CHASM P values is embedded as biological networks displaying specific domains, chains, and ligands; additionally, from the Network Data Exchange (NDEx; ref. 9); here, the user can publication-ready images can be exported from this tab. The toggle through each network detected for their submission and Mutations tab allows the user to toggle through submitted view significant genes in the context of larger biological networks mutations alongside statistically significant 3D mutation clus- (Fig. 1A). ters from TCGA data; statistically significant mutation clusters are precomputed for each of 31 TCGA cancer subtypes using the Gene- and mutation-level analysis HotMAPS algorithm (13). In Fig. 1C, MuPIT is located at the Figure 1B shows selected displays from the CRAVAT Gene and Annotations tab, and user-submitted mutations (shown in Variant tabs. The Gene tab displays an interactive spreadsheet, green) are spatially proximal to known TGFBR1 binding and including every gene that harbored a mutation from the submis- active sites (cyan and blue), suggesting a potential mechanistic sion (truncated example shown at top of Fig. 1B). This table role for these mutations. includes gene-specific mutation frequency, the most severe con- sequence type among submitted mutations for each gene, pres- CRAVAT 5 ence of statistically significant mutation clustering in the related The next update of CRAVAT is scheduled for release in 2017 protein [inferred from The Cancer Genome Atlas (TCGA) data, see and will include annotations for noncoding mutations, detec- below], composite VEST and CHASM P values and FDRs, classi- tion of statistically significant sequence hotspots for mutations, fication as a tumor suppressor or oncogene if applicable, associ- updated VEST and CHASM classifiers, inclusion of top-per- ated drugs, COSMIC occurrences, disease status from ClinVar (a forming impact prediction algorithms from other groups, database of clinically annotated variants), and migration to the current reference (GRCh38), terms. The table can be sorted by any category and filtered by and integration with the Broad Integrated Genomics Viewer. CHASM and VEST significance (see filtering widget, Fig. 1B); any The corresponding MuPIT update will include pathogenic instance of this table can be exported in Excel format. Clicking any germline mutations, and rare and common germline mutations row (i.e., gene) in this table retrieves a drill-down table and a from healthy populations for optional mapping and display lollipop chart. The drill-down table displays information for each onto protein structure. mutation from the selected gene, including chromosome posi- tion, sequence ontology, protein change, mutation-specific VEST and CHASM P values, and the reference and alternative DNA base In Context with Other Omics Analysis (s). The lollipop diagram plots all submitted mutations for the CRAVAT is developed for easy integration with other software, selected gene onto the protein sequence, with domain labels and including programmatic interfaces for web services. This flexibility

e36 Cancer Res; 77(21) November 1, 2017 Cancer Research

Downloaded from cancerres.aacrjournals.org on September 30, 2021. © 2017 American Association for Cancer Research. Interpreting and Prioritizing Somatic Mutations in Cancer

A Acyl-CoA de.. Circos plot ATP binding ACTN 4 Gene function ATPase acti.. Nonsilent Electron car.. Missense ACTN 1 Inactivating Fatty-acyl-C.. ACTA 1 64 Flavin adeni.. Oxioreducta.. ACTN 2 24 Protein bind.. Protein hom.. Transporter a.. Sequence ontologies Mutation count by ontology Frameshift del Frameshift ins P Missense Pathway Genes 224 RAC1 signaling 0.0001 ABI1,ABI2,ACTR2,ACTR3 585 Stopgain

Mutation count FA kinase sign. Synonymous 0.0002 ACTA1,ACTN1,ACN2,ACTN4 PDGFRb sign. 0.0007 ABI1,ABL1,ACTN4,ACTR2, Splice site Sample B Filter Has a CHASM P value: VEST Number mutation CHASM cancer CHASM VEST 0.05 Most severe pathogenicity HUGO of in a driver composite cancer driver pathogenicity Driver CGC TARGET Occurrences ClinVar sequence composite P value VEST P value: symbol variants TCGA P value FDR FDR (non-silent) genes Role in Disease ontology (non-silent) COSMIC 0.05 Mutation (missense) (missense) Cluster Columns ABL1 10 Splice site 0 0.05 0.034 0.1 Oncogene Imatinib,.. 591 Gene Info ACADL 1 Missense 0.3098 0.55 0.034 0.1 77 HUGO symbol ACVR1C 8 Missense 0 0.05 0.0191 0.05 163

Number of variants ACTR2 11 Missense 0.0656 0.2 0.0356 0.1 56 Variant Impact ACSL4 1 Missense 0.176 0.45 0.0384 0.1 124 Mental r... VEST CHASM VEST Disease Association CHASM cancer pathogenicity Reference Alternate ID Chrom Position Sequence ontology Protein_change driver P value P value Driver genes base(s) base(s) (missense) (non-silent) CGC Role var989 chr2 158406760 Missense R230I 0.063 0.0191 C A

TARGET var994 chr2 158406837 Synonymous G204G A C Occurrences in COSMIC Ocurrences in COSMIC Variant info Population stats by primary site UID: var989 Pathogenicity Cancer driver PubMed articles Gene: ACVR1C 0.000 0.000 0.000 impact impact PubMed search term 1000 Genomes ESP 6500 ExAC ndex VEST CHASM ClinVar disease Chromosome: chr2 Ref AA: D ClinVar XRef Position: 158406760 KIRC Alternate AA: N CGC Inheritance Ref base(s): C AA position: 228 MuPIT CGC Tumor Types Somatic Alt base(s): A My mutations Ontology: Missense P = 0.019 P = 0.063 CGC Tumor Types Germline dbSNP: None Tissue: KIRC Structure Sample: s_10 Study # samples:7 TCGA Mutations

Gene: TGFBR1 C Crystal structure of the cytoplasmic domain of the type Protein Mutations Annotations TGFb receptor in complex with FKB12 (structure ide: 1b6c)

Figure 1. Selected widgets and displays from the CRAVAT Summary tab (A), the Gene and Variant tabs (B), and from MuPIT (C).

www.aacrjournals.org Cancer Res; 77(21) November 1, 2017 e37

Downloaded from cancerres.aacrjournals.org on September 30, 2021. © 2017 American Association for Cancer Research. Masica et al.

affords the combination of CRAVAT's mutation interpretation Acquisition of data (provided animals, acquired and managed patients, capabilities with software considering other omics datatypes (e.g., provided facilities, etc.): C. Tokheim copy number alterations, methylation, expression). For instance, Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): C. Tokheim, R. Bhattacharya, R. Kim, K. Moad, other methods currently integrating CRAVAT include Trinity, M.C. Ryan which assembles Illumina RNA-seq data (14); UCSC Xena, which Writing, review, and/or revision of the manuscript: D.L. Masica, C. Douville, can combine many omics and clinical datatypes (15); and the C. Tokheim, R. Karchin NDEx, an online commons for biological network data (9). Administrative, technical, or material support (i.e., reporting or organizing CRAVAT can also be used with Galaxy tools. data, constructing databases): C. Douville, K. Moad Study supervision: R. Karchin Disclosure of Potential Conflicts of Interest No potential conflicts of interest were disclosed. Grant Support This work was supported by NIH, NCI grant U24CA204817-01 to R. Karchin. Authors' Contributions Conception and design: C. Douville, M.C. Ryan, R. Karchin Received February 2, 2017; revised March 18, 2017; accepted June 14, 2017; Development of methodology: C. Douville, M.C. Ryan, R. Karchin published online November 1, 2017.

References 1. Douville C, Carter H, Kim R, Niknafs N, Diekhans M, Stenson PD, et al. 9. Pratt D, Chen J, Welker D, Rivas R, Pillich R, Rynkov V, et al. NDEx, the CRAVAT: cancer-related analysis of variants toolkit. Bioinformatics network data exchange. Syst 2015;1:302–5. 2013;29:647–8. 10. 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, 2. Carter H, Chen S, Isik L, Tyekucheva S, Velculescu VE, Kinzler KW, et al. DePristo MA, Durbin RM, et al. An integrated map of genetic variation from Cancer-specific high-throughput annotation of somatic mutations: 1,092 human genomes. Nature 2012;491:56–65. computational prediction of driver missense mutations. Cancer Res 11. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. 2009;69:6660–7. Analysis of protein-coding genetic variation in 60,706 . Nature 3. Carter H, Douville C, Stenson PD, Cooper DN, Karchin R. Identifying 2016;536:285–91. Mendelian disease genes with the variant effect scoring tool. BMC Geno- 12. Niknafs N, Kim D, Kim R, Diekhans M, Ryan M, Stenson PD, et al. MuPIT mics 2013;14:S3. interactive: webserver for mapping variant positions to annotated, inter- 4. Douville C, Masica DL, Stenson PD, Cooper DN, Gygax DM, Kim R, et al. active 3D structures. Hum Genet 2013;132:1235–43. Assessing the pathogenicity of insertion and deletion variants with the 13. Tokheim C, Bhattacharya R, Niknafs N, Gygax DM, Kim R, Ryan M, et al. variant effect scoring tool (VEST-Indel). Hum Mutat 2016;37:28–35. Exome-scale discovery of hotspot mutation regions in human cancer using 5. Forbes SA, Beare D, Gunasekaran P, Leung K, Bindal N, Boutselakis H, et al. 3D protein structure. Cancer Res 2016;76:3719–31. COSMIC: exploring the world's knowledge of somatic mutations in human 14. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, cancer. Nucleic Acids Res 2014;43:D805–11. et al. De novo transcript sequence reconstruction from RNA-seq using the 6. Stenson PD, Mort M, Ball EV, Howells K, Phillips AD, Thomas NS, et al. The Trinity platform for reference generation and analysis. Nat Protoc Human Gene Mutation Database: 2008 update. Genome Med 2009;1:13. 2013;8:1494–512. 7. Fu W, O'Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, et al. Analysis of 15. Goldman M, Craft B, Zhu J, Swatloski T, Cline M, Haussler D. Abstract 6,515 exomes reveals the recent origin of most human protein-coding 5270: The UCSC Xena system for integrating and visualizing functional variants. Nature 2013;493:216–20. genomics. In: Proceedings of the 107th Annual Meeting of the American 8. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. Association for Cancer Research; 2016 Apr 16–20; New Orleans, LA. Cancer genome landscapes. Science 2013;339:1546–58. Philadelphia, PA: AACR; 2016. Abstract nr 5270.

e38 Cancer Res; 77(21) November 1, 2017 Cancer Research

Downloaded from cancerres.aacrjournals.org on September 30, 2021. © 2017 American Association for Cancer Research. CRAVAT 4: Cancer-Related Analysis of Variants Toolkit

David L. Masica, Christopher Douville, Collin Tokheim, et al.

Cancer Res 2017;77:e35-e38.

Updated version Access the most recent version of this article at: http://cancerres.aacrjournals.org/content/77/21/e35

Cited articles This article cites 14 articles, 3 of which you can access for free at: http://cancerres.aacrjournals.org/content/77/21/e35.full#ref-list-1

Citing articles This article has been cited by 10 HighWire-hosted articles. Access the articles at: http://cancerres.aacrjournals.org/content/77/21/e35.full#related-urls

E-mail alerts Sign up to receive free email-alerts related to this article or journal.

Reprints and To order reprints of this article or to subscribe to the journal, contact the AACR Publications Department at Subscriptions [email protected].

Permissions To request permission to re-use all or part of this article, use this link http://cancerres.aacrjournals.org/content/77/21/e35. Click on "Request Permissions" which will take you to the Copyright Clearance Center's (CCC) Rightslink site.

Downloaded from cancerres.aacrjournals.org on September 30, 2021. © 2017 American Association for Cancer Research.