A Cloud-Based Platform for Interactive Visualization and Integrative Analysis of Genomics and Clinical Data
Total Page:16
File Type:pdf, Size:1020Kb
GenePool: A Cloud-Based Platform for Interactive Visualization and Integrative Analysis of Genomics and Clinical Data Hua Fan-Minogue, MD, PhD1*, Marina Sirota, PhD1*, Sandeep Sanga, PhD2*, Dexter Hadley, MD, PhD1, Atul J. Butte, MD, PhD1, Tod Klingler, PhD2 1Division of System Medicine, Department of Pediatrics, Stanford School of Medicine, Stanford, CA; 2Station X Inc, San Francisco, CA Abstract Advances in genomic technology harnessed by large-scale efforts from interdisciplinary consortia have generated more comprehensive genomic data and provided unprecedented opportunities for understanding human health and disease. The Cancer Genome Atlas (TCGA) in particular contains comprehensive clinical, genomics and transcriptomics measurements across over 16,000 of patient samples and over 30 tumor types. However, the vast volume and complexity of these data has also brought challenges to biomedical research community for translating them into insights about diseases quickly and reliably. Here we present a cloud-based platform, GenePool, which allows rapid interrogation of multi-genome data and their associated metadata from public or private datasets in a secure and interactive way. We demonstrate the functionality of GenePool with two use cases highlighting interactive sample browsing and selection, comparative analysis between disease subtypes and integrated analysis of genomic, proteomic and phenotypic data. These examples also display the promise of GenePool to accelerate the identification of genetic contributions to human disease and the translation of these findings into clinically actionable results. Introduction Since the completion of the Human Genome Project, tremendous progress in the technology has enabled the generation of genomic data in a remarkably high throughput and cost-effective manner, which is strikingly illustrated by the next-generation sequencing technologies (1). The resulting unprecedented level of sequence data has facilitated the identification of genes associated with disease or drug response, some of which have led to improved therapies (2). Large interdisciplinary consortia have played a major role in harnessing these advances and building comprehensive catalogues of genomic data. For example, The Cancer Genome Atlas (TCGA) characterizes cancer transcriptomics and genomics data along with phenotypic data (3), the 1000 Genome Project focuses in depth on the human genetic variation (www.1000genomes.org), and The Genotype-Tissue Expression (GTEx) program aims to identify the correlations between genetic variation and gene expression in multiple tissues (commonfund.nih.gov/GTEx). Various types of data, including clinical, genetic variation, mRNA expression, methylation, copy number variation, and others have been generated and released as resources to the community. The rapid growth in generation of these sorts of large-scale genomics efforts brings new challenges to biomedical research. The raw and processed data often reside in the repository of each individual consortium, the access of which needs manual downloading through individual data portal. This also leaves users to concern about the compatible computing power and data storage capacities. Furthermore, analysis of these data requires specialized bioinformatics knowledge about the computing algorithms and statistical methods appropriate to genomic data, which is not readily available in most biomedical research labs (4). Therefore, data management, query and analysis become new challenges for efficiently translating the abundance of data into knowledge about disease. Some solutions are available that enable easy access to the data. For example, the data portal for the International Cancer Genome Consortium (ICGC) allows quickly browsing or query of genomics data from cancer projects, including TCGA (5). COSMIC (catalogue of somatic mutations in cancer) (6) and cBioPortal (7), provides easy access to curated comprehensive information on cancer genomics data. However, these data portals have no or limited capability for empowering real-time data analysis. The Broad Institute (www.broadinstitute.org) offers a suite of online analysis tools for large genome-related datasets. Although these tools can be used against the existing datasets of Broad Institute, application on alternative datasets requires local downloading of the tools of choice. Cloud computing technology has been increasingly used to meet the needs for easy access to analysis tools, as well as storing and retrieval of large datasets. One example is the CloudMap (8), however, it is only applied to a single type of data and integrative analysis is not available. Therefore, a user-friendly solution for accessing, processing, analyzing and integrating genomic, clinical as well as other types of data is currently lacking. Here we present a cloud-based platform, GenePool™ (https://stationx.mygenepool.com), which aims to provide a one-stop solution for genomic data analysis and integration. It offers secure storage, genomics workflows, interactive visualization and interpretation * These individuals contributed equally ¶ Corresponding author 1 of genomics data obtained from various genomic technologies, as well as associated metadata. It currently holds over 10 public datasets across several modalities, including TCGA and GTEx, and multiple biological annotation engines, such as Gene Ontology (GO) and Disease Ontology (DO). GenePool allows import of external genomic data for customized use. It also allows reporting and securely sharing of results with multiple options for annotation, sorting and data export. To demonstrate its function and usability, we applied GenePool to the TCGA datasets of the two subtypes of lung cancer, Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC). For each subtype, we first performed interactive browsing and query, then carried out analysis of differential expression for each subtype. We identified genes that were unique to each subtype as well as a common lung cancer signature. In the second use case, we dove into a deeper analysis of a specific subgroup of LUAD patients focusing on individuals with a history of smoking across multiple modalities including gene expression, somatic mutation and protein expression data. Methods System architecture GenePool runs on Amazon Web Services (http://aws.amazon.com/) and takes advantage of EC2 compute, S3 storage, and other AWS services to maintain a secure computing and storage environment (Figure 1a). Its backend code is primarily implemented in Java, which interacts with a statistical engine powered by R to run select computations. Its front-end primarily uses Javascript and HTML 5.0. It leverages a hybrid database layer to manage and store genomics and annotation data. This hybrid data layer consists of a PostgreSQL database and a flat-file, NoSQL-like component. The supported data types include expression data (RNAseq and miRNAseq) in BAM format or as tab-delimited files, and sequence variation (whole genome and exomes, copy number and DNA mehtylation) mutation calls in VCF format. Users are allowed to import genomics data and/or work with the reference library of GenePool. It has canned workflows, but also allows customized workflows developed with its APIs. The software solution has fully integrated multi-genome browser and provides an easy way to store and select genomics datasets according to sample-associated metadata for analysis, and an automated system for performing routine, well-characterized genomics workflows through an intuitive interface. Figure 1. Overview of GenePool platform. a. System architecture, b. functionality. Analysis tools GenePool has well-characterized workflows for various types of genomic data analysis, such as expression profiling and comparisons, sequence variation profiling and comparisons, time-to-endpoint analysis, paired and trio analysis, gene enrichment analysis. A variety of statistical methods are available as options for the user to choose from, such as t-test or likelihood ratio test for expression comparisons; chi-squared test for variation comparisons; fisher’s exact test for gene enrichment analysis; univariate cox proportional hazard ratio test for time-to-endpoint analysis; and q-values for estimating FDR. The results can be visualized in heat map, co-expression networks, kaplan-meier survival curves, scatter plots, histograms and bar charts (Figure 1b). Scalability The combination of Amazon Cloud Infrastructure and GenePool’s own unique platform architecture grants it elastic scalability, where the accessibility can be scaled up or down depending on demand. Utilizing load-balancing and server / application metrics, GenePool can sense periods of high demand and add additional compute as required to the middle application tier. This additional compute is leveraged to meet the demands of customer workloads particularly during high concurrency and during period where large analyses are being performed. This capability allows GenePool to meet the ever changing and unpredictable demands of our users’ workloads. This elastic scalability also benefits the data import process. As data import jobs can vary in size and computational requirements, each job is assessed for their computational needs and a unique server instantiated for the specific job. 2 This allows GenePool to vertically scale, applying large multi processor servers for large jobs so that data is imported quickly and efficiently. Depending on the type of data being imported, GenePool can