An RNA-Seq Visualization Tool for Cancer Genomics Aaron R
Total Page:16
File Type:pdf, Size:1020Kb
Shifman et al. BMC Genomics (2016) 17:75 DOI 10.1186/s12864-016-2389-8 SOFTWARE Open Access Cascade: an RNA-seq visualization tool for cancer genomics Aaron R. Shifman, Radia M. Johnson and Brian T. Wilhelm* Abstract Background: Cancer genomics projects are producing ever-increasing amounts of rich and diverse data from patient samples. The ability to easily visualize this data in an integrated an intuitive way is currently limited by the current software available. As a result, users typically must use several different tools to view the different data types for their cohort, making it difficult to have a simple unified view of their data. Results: Here we present Cascade, a novel web based tool for the intuitive 3D visualization of RNA-seq data from cancer genomics experiments. The Cascade viewer allows multiple data types (e.g. mutation, gene expression, alternative splicing frequency) to be simultaneously displayed, allowing a simplified view of the data in a way that is tuneable based on user specified parameters. The main webpage of Cascade provides a primary view of user data which is overlaid onto known biological pathways that are either predefined or added by users. A space-saving menu for data selection and parameter adjustment allows users to access an underlying MySQL database and customize the features presented in the main view. Conclusions: There is currently a pressing need for new software tools to allow researchers to easily explore large cancer genomics datasets and generate hypotheses. Cascade represents a simple yet intuitive interface for data visualization that is both scalable and customizable. Keywords: RNA-seq, Visualization, Cancer genomics, Dimensionality reduction Background functionally equivalent impacts on conserved biological The growth of next-generation sequencing (NGS) has pathways. Software tools that allow researchers to revolutionized the study of cancer through the analysis explore their data in an unbiased fashion, to generate of data from whole-exome [1] or genome [2, 3] sequen- potential hypotheses and that are intuitive enough for cing, the profiling of epigenetic modifications [4] and non-bioinformatians (e.g. clinicians) to apply their spe- transcriptome sequencing (RNA-seq) [5, 6]. RNA-seq is cialized knowledge, are therefore critical. of particular interest, since it is possible to survey gene The difficulties associated with the visualization and expression, splicing and mutations in a single experi- exploration of multidimensional cancer genomics data ment. Despite these powerful techniques, a growing list have been previously discussed [9], along with efforts to of cancer genomics (CG) studies have demonstrated that try and address these issues. Ideally, the process of data defining the precise genetics of the disease remains chal- exploration would also involve representations that lenging [7, 8]. Since the molecular mechanisms that can utilize an intuitive type of “dimensional reduction” (e.g. act as “driver events” may differ between patients with with diverse data types being represented by shapes and the same type of cancer, it is critical for researchers to colours) in order to display a single coherent summary be able to easily integrate the results of RNA-seq ana- of the information. Moreover, such approaches should lysis, along with other NGS approaches, to identify also be capable of leveraging the vast amount of verified biological information stored online in resource data- * Correspondence: [email protected] bases such as the Molecular Signatures Database Laboratory for high throughput genomics, Institute for Research in (MSigDB) [10], Pathway Interaction Database (PID) [11], Immunology and Cancer, University of Montreal, Montreal, QC, Canada Reactome [12], and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases [13]. © 2016 Shifman et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Shifman et al. BMC Genomics (2016) 17:75 Page 2 of 11 Here we present Cascade, a new data visualization tool edges/lines (interactions) with concentric rings as an ar- to display and explore NGS datasets in a rapid and intui- bitrary guide for pathway depth (which can be toggled tive way by allowing multiple data attributes to be shown on/off). Gene names are shown above each node and simultaneously. Cascade allows the analysis of RNA-seq gene expression values are represented as green vertical data, or whole-exome or genome sequencing, to be eas- lines. The functionality of Three.JS is completely generic ily mapped onto known or user defined biological path- and therefore although the planar view is a default rep- ways. The program uses a variety of tunable parameters resentation (since many established pathways have a se- to highlight specific attributes of genes/features that are quential order of interactions), the Cascade source code of interest to the researcher. Cascade can also easily in- could be modified to also allow the representation of tegrate other data sources and, through user configur- typical “hair-ball” network diagrams. Cascade graphics able gene lists and networks, highlight relevant features are also entirely customizable by end-users and some of the data. By allowing researchers and clinicians with modifications (e.g. colour selection, font size) can be specialized information to browse and explore their data made directly through a simple file of parameters. in the context of existing biological knowledge, Cascade In order to optimize the use of screen space, a collaps- fills a gap which currently exists between tools used ini- ible side-menu (Fig. 2) to view the data selection options tially to process the data for highly specific tasks (read- is displayed on the left side of the page. The menu al- mapping, variant detection) and tools that allow reanaly- lows users to alter 4 aspects of the rendered view: A) the sis of published data. data types to include in the representation, B) the bio- logical pathway to overlay data onto, C) the source of Implementation the RNA-seq data, and D) the gene list selection to re- Cascade is a web-based user interface that allows re- strict frequency colouring for the data types selected. searchers to interactively explore their RNA-seq data The choice of data types to represent is dependent only while allowing a wide variety of data types to be dis- on what is available in the database or uploaded by the played. Cascade consists of the main web page, an user, allowing individuals to tailor the view to their underlying relational database (MySQL) containing all of needs. Although Cascade could easily be altered to dis- the information from RNA-Seq experiments (along with play any type of data (gene lengths, %GC, etc.), has been information defined in biological pathways, gene lists, designed for cancer genomics data. Therefore the default etc.) and a collection of PHP scripts allowing the user to data types available to be shown (if available) are copy submit requests to the database to be displayed in the number variations (CNVs), single nucleotide variants browser (Fig. 1). (SNVs; either novel or known damaging variants), gene The main Cascade webpage uses the scripts from the expression (either mean/median levels of gene transcrip- Java library Three.JS [14] to generate interactive 3D tion), alternate splicing, and insertion/deletions (indels). forms without a requirement for specific browser plu- For instance, activating the “alternative splicing” option gins. The main page is sub-divided into a plotting canvas will add coloured rings around nodes based on whether where the genes (nodes) and interactions (edges) are or not that gene has been flagged by the user as exhibit- rendered, a space-saving, collapsible, menu on the left- ing alternative splicing. User supplied values are required hand side of the application [15] and a fixed information for this which could represent a binary observation of al- ribbon at the top of the page (Fig. 2). The rendering can- ternative splicing or a skewed ratio of isoforms since in- vas depicts genes as spheres on a single plane linked by dividual isoforms cannot be represented. Likewise, the Fig. 1 Overview of Cascade organization. A cartoon view of Cascade is shown with the three principal components coloured. The relational database (blue) holds all user defined data (e.g. from RNA-seq experiments) along with pre-defined data for biological pathways, networks, disease associated genes, etc. A collection of Java/PHP scripts (pink) act as bridge between the database and the main webpage (grey) used for user interaction and data visualization Shifman et al. BMC Genomics (2016) 17:75 Page 3 of 11 Fig. 2 Screenshot of data rendering on main webpage. Cascade uses a space-saving menu on the left-hand side of the screen to store functions to: (a) select features of the RNA-seq data to be displayed, (b) select biological pathways to overlay data onto, (c) select datasets to use for visualization of features selected (using a) and (d) restrict the colouring thresholds for features based on custom or predefined disease gene lists. The “modify ranges” button (top centre) allows users to alter the cohort frequency thresholds (a) required for node colour changes (mutations)or ring appearance (splicing). Additional buttons (top) toggle display of guide rings, generate (.jpg) screen images or open tool documentation presence of known/predicted deleterious SNVs must be specific list of genes (independent of which pathway is assigned by the user when the data is loaded into the being examined).