Download Functionalities Provided by Each of the PCAWG Data Resources
Total Page:16
File Type:pdf, Size:1020Kb
UC Santa Cruz UC Santa Cruz Previously Published Works Title A user guide for the online exploration and visualization of PCAWG data. Permalink https://escholarship.org/uc/item/1pb4z60d Journal Nature communications, 11(1) ISSN 2041-1723 Authors Goldman, Mary J Zhang, Junjun Fonseca, Nuno A et al. Publication Date 2020-07-07 DOI 10.1038/s41467-020-16785-6 Peer reviewed eScholarship.org Powered by the California Digital Library University of California ARTICLE https://doi.org/10.1038/s41467-020-16785-6 OPEN A user guide for the online exploration and visualization of PCAWG data ✉ Mary J. Goldman 1,15 , Junjun Zhang 2,15, Nuno A. Fonseca 3,15, Isidro Cortés-Ciriano 4,5,14,15, Qian Xiang 2, Brian Craft1, Elena Piñeiro-Yáñez 6, Brian D. O’Connor7, Wojciech Bazant3, Elisabet Barrera3, Alfonso Muñoz-Pomer3, Robert Petryszak3, Anja Füllgrabe 3, Fatima Al-Shahrour 6, Maria Keays3, David Haussler 1, John N. Weinstein 8, Wolfgang Huber 9, Alfonso Valencia10,11, Peter J. Park 4, Irene Papatheodorou 3, Jingchun Zhu1, Vincent Ferretti12 & Miguel Vazquez 9,13 1234567890():,; The Pan-Cancer Analysis of Whole Genomes (PCAWG) project generated a vast amount of whole-genome cancer sequencing resource data. Here, as part of the ICGC/TCGA Pan- Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole gen- ome sequencing data from 2658 cancers across 38 tumor types, we provide a user’s guide to the five publicly available online data exploration and visualization tools introduced in the PCAWG marker paper. These tools are ICGC Data Portal, UCSC Xena, Chromothripsis Explorer, Expression Atlas, and PCAWG-Scout. We detail use cases and analyses for each tool, show how they incorporate outside resources from the larger genomics ecosystem, and demonstrate how the tools can be used together to understand the biology of cancers more deeply. Together, the tools enable researchers to query the complex genomic PCAWG data dynamically and integrate external information, enabling and enhancing interpretation. 1 UC Santa Cruz Genomics Institute, Santa Cruz, CA 95064, USA. 2 Ontario Institute for Cancer Research, Toronto, ON M5G 0A3, Canada. 3 European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK. 4 Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. 5 Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK. 6 Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), 28029 Madrid, Spain. 7 Data Sciences Platform, Broad Institute, Cambridge, MA, USA. 8 Department of Bioinformatics and Computational Biology, UT MD Anderson Cancer Center, Houston, TX 77030, USA. 9 European Molecular Biology Laboratory, 69117 Heidelberg, Germany. 10 Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain. 11 ICREA, 08010 Barcelona, Spain. 12 CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada. 13 Norwegian University of Science and Technology, Trondheim, Norway. 14Present address: European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK. 15These authors contributed equally: Mary J. Goldman, Junjun ✉ Zhang, Nuno A. Fonseca, Isidro Cortés-Ciriano. email: [email protected] NATURE COMMUNICATIONS | (2020) 11:3400 | https://doi.org/10.1038/s41467-020-16785-6 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-16785-6 he Pan-Cancer Analysis of Whole Genomes (PCAWG) functional associations. UCSC Xena integrates diverse types of TConsortium aggregated whole-genome sequencing (WGS) genomic and phenotypic/clinical information at the sample level data from 2658 cancers across 38 tumor types generated by across the large number of samples, enabling rapid examination of the International Cancer Genome Consortium (ICGC) and The patterns within and across data types. The Chromothripsis Explorer Cancer Genome Atlas (TCGA) projects. These sequencing data visualizes genome-wide mutational patterns, with a focus on com- were re-analyzed with standardized, high-accuracy pipelines to plex genomic events, e.g., chromothripsis and kataegis. This is align to the human genome (reference build hs37d5) and identify achieved through interactive Circos plots for each tumor with dif- germline variants and somatically acquired mutations, as descri- ferent tracks that correspond to allele-specific copy number variants, bed in the PCAWG marker paper1. Here we provide a user guide somatic structural variations, simple somatic mutations, indels, and to five tools introduced in the PCAWG marker paper: The ICGC clinical information. The Expression Atlas focuses on RNA-seq data, Data Portal, UCSC Xena, Chromothripsis Explorer, Expression supporting queries in either a baseline context (e.g., finding genes Atlas, and PCAWG-Scout. Each of them was created or extended that are expressed in prostate adenocarcinoma samples) or in a to explore PCAWG data resources1. All of the tools aim to differential context (e.g., finding genes that are under- or over- streamline analysis and visualization by pre-loading the PCAWG expressed in prostate adenocarcinomas compared to adjacent nor- data so that users do not need to locate, curate, or manage the mal prostate samples). PCAWG-Scout allows users to run their own data and by making the tools accessible through a web interface. analyses on-demand, including prediction of cancer-driver genes, Each of these five tools also integrates other genomics datasets differential gene expression, recurrent structural variations, survival, and tools that provide context and insight for interpretation of pathway enrichment, mutations as visualized on a protein structure, patterns in the PCAWG data helping this resource fully realize its mutational signatures, and possible recommended therapies (based potential. Some of the datasets and tools integrated include the onthein-housePanDrugsresource; Supplementary Fig. 1). Each of UCSC Genome Browser2, Ensembl3, drug target compendia4, the fivetoolsoffersdifferentvisualizations and analyses of the COSMIC5, and even large and complementary sequencing efforts PCAWG data resource, each with its own strengths, and each such as GTEx6. Intuitive access to these additional tools and enabling different insights into the data. When employed together, datasets is provided either by showing their data side by side or by they provide the user with a deeper understanding of the cancer’s providing context-dependent URL links. biology (Fig. 1). More information about the tools can be found at The five resources in this paper each provide a different per- thePCAWGLandingPage(http://docs.icgc.org/pcawg). spective and focus to the PCAWG data (Table 1). The ICGC Data Portal serves as the main entry point for accessing all PCAWG data Results and also enables exploration of PCAWG consensus simple somatic ICGC Data Portal and a use case. As a main entry point, the mutations, including point mutations and small indels, each by their ICGC Data Portal (https://dcc.icgc.org, Zhang7) provides an frequencies, patterns of co-occurrence, mutual exclusivity, and intuitive graphical interface for browsing, searching, and Table 1 Search, visualization, analysis/integration, and download functionalities provided by each of the PCAWG data resources. Functionality ICGC Data Portal UCSC Xena Chromothripsis Expression Atlas PCAWG- Explorer Scout Search Search by demographic data, specimen phenotype, YY Y molecular subtype Search by genes and/or variants Y Y Y Y Search by genomic coordinates Y Y Visualize Visualize multiple types of data together Y Y Y Y Visualize coding variants Y Y Y Y Visualize non-coding variants Y Y Y Visualize structural variants Y Y Y Visualize mutational signatures and predicted Y drivers Visualize genome-wide profiles, including LOH, in Y Circos plots Visualize tissue expression on a human figure Y Visualize gene co-expression Y Y Visualize pathways, therapeutic associations Y Y Visualize summary of BAMs/VCFs Y Analysis Kaplan–Meier analysis with statistics Y Y Y Gene set/pathway enrichment analysis Y Y Y View non-identifiable analysis results of YYYY protected data Discover differentially or co-expressed genes, Y mutually exclusive genomic events Annotations from other resources Y Y Y Y Download Programmatic data download Y Y Y Y Download BAMs, VCFs, primary files Y Download secondary processed data Y Y Y Y 2 NATURE COMMUNICATIONS | (2020) 11:3400 | https://doi.org/10.1038/s41467-020-16785-6 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-16785-6 ARTICLE Online resources for PCAWG data visualization b UCSC Xena visual spreadsheet—integrate RNAseq and WGS data RNAseq RNAseq WGS WGS WGS Donor ERG expression ERG fusion ERG structural variant TMPRSS2 SV SLC45A3 SV Histology 5′ 3′ 5′ 10 kb 3′ 5′ 10 kb 3′ DO36283 5.17 1 DO36345 4.91 1 TMPRSS ERG ]21:39864277 DO36356 4.40 0 TMPRSS DO36273 4.40 1 SLC45A ERG ]21: DO36372 3.99 1 TMPRSS ERG ]21:39909709] DO36359 3.00 1 DO36265 2.91 1 TMPRSS ERG ]21:3986 DO36335 2.29 1 DO36285 1.57 0 Prost- a DO36254 1.16 0 AdenoCA DO36360 1.13 0 DO36262 0.808 0 DO36239 0.705 0 Search and DO46813 0.392 0 DO49815 0.164 0 download > 70,000 DO36248 0.0809 0 DO36253 –0.561 0 primary BAMs and DO36226 –0.998 0 VCFs c Chromothripsis Explorer—donor DO36283 X 1 21 22 21 20 2 19 ICGC data repository 18 3 17 16 4 15 ERG 14 TMPRSS2 5 13 6 12 11 7 10 8 9 d Expression Atlas—compare normal vs. tumor expression PCAWG AWGs primary results