Downloadable from the Anduril Website [17]
Total Page:16
File Type:pdf, Size:1020Kb
Ovaska et al. Genome Medicine 2010, 2:65 http://genomemedicine.com/content/2/9/65 RESEARCH Open Access Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme Kristian Ovaska1, Marko Laakso1†, Saija Haapa-Paananen2†, Riku Louhimo1, Ping Chen1, Viljami Aittomäki1, Erkka Valo1, Javier Núñez-Fontarnau1, Ville Rantanen1, Sirkku Karinen1, Kari Nousiainen1, Anna-Maria Lahesmaa-Korpinen1, Minna Miettinen1, Lilli Saarinen1, Pekka Kohonen2, Jianmin Wu1, Jukka Westermarck3,4, Sampsa Hautaniemi1* Abstract Background: Coordinated efforts to collect large-scale data sets provide a basis for systems level understanding of complex diseases. In order to translate these fragmented and heterogeneous data sets into knowledge and medical benefits, advanced computational methods for data analysis, integration and visualization are needed. Methods: We introduce a novel data integration framework, Anduril, for translating fragmented large-scale data into testable predictions. The Anduril framework allows rapid integration of heterogeneous data with state-of-the- art computational methods and existing knowledge in bio-databases. Anduril automatically generates thorough summary reports and a website that shows the most relevant features of each gene at a glance, allows sorting of data based on different parameters, and provides direct links to more detailed data on genes, transcripts or genomic regions. Anduril is open-source; all methods and documentation are freely available. Results: We have integrated multidimensional molecular and clinical data from 338 subjects having glioblastoma multiforme, one of the deadliest and most poorly understood cancers, using Anduril. The central objective of our approach is to identify genetic loci and genes that have significant survival effect. Our results suggest several novel genetic alterations linked to glioblastoma multiforme progression and, more specifically, reveal Moesin as a novel glioblastoma multiforme-associated gene that has a strong survival effect and whose depletion in vitro significantly inhibited cell proliferation. All analysis results are available as a comprehensive website. Conclusions: Our results demonstrate that integrated analysis and visualization of multidimensional and heterogeneous data by Anduril enables drawing conclusions on functional consequences of large-scale molecular data. Many of the identified genetic loci and genes having significant survival effect have not been reported earlier in the context of glioblastoma multiforme. Thus, in addition to generally applicable novel methodology, our results provide several glioblastoma multiforme candidate genes for further studies. Anduril is available at http://csbi.ltdk.helsinki.fi/anduril/ The glioblastoma multiforme analysis results are available at http://csbi.ltdk.helsinki.fi/anduril/tcga-gbm/ * Correspondence: [email protected] † Contributed equally 1Computational Systems Biology Laboratory, Institute of Biomedicine and Genome-Scale Biology Research Program, University of Helsinki, Haartmaninkatu 8, Helsinki, FIN-00014, Finland Full list of author information is available at the end of the article © 2010 Ovaska et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ovaska et al. Genome Medicine 2010, 2:65 Page 2 of 12 http://genomemedicine.com/content/2/9/65 Background Glioblastoma multiforme data set Comprehensive characterization of complex diseases The glioblastoma data set was originally released in 2008 calls for coordinated efforts to collect and share gen- [1] and has been updated online since then. An updated ome-scale data from large patient cohorts. A prime revision was used in the present work: comparative geno- example of such a coordinated effort is The Cancer mic hybridization array (aCGH), single nucleotide poly- Genome Atlas (TCGA), which currently provides more morphism (SNP), exon, geneexpressionandmicroRNA than five billion data points on glioblastoma multiforme (miRNA) data were accessed May to August 2009, while (GBM) with the aim of improving diagnosis, treatment methylation and clinical data were accessed October to and prevention of GBM [1]. November 2009. The data set consists of 338 primary Translating genome-scale data into knowledge and glioblastoma patients with clinical annotations. Data further to effective diagnosis, treatment and prevention were analyzed from the following microarray platforms: strategies requires computational tools that are designed Affymetrix HU133A (269 GBM samples, 10 control sam- for large-scale data analysis as well as for the integration ples), Affymetrix Human Exon 1.0 (298 GBM samples, of multidimensional data with clinical parameters and 10 control samples), Agilent 244 k aCGH (238 GBM knowledge available in bio-databases. In addition, it is samples), Affymetrix SNP Array 6.0 (214 GBM blood evident that until data integration tools are developed to samples), Illumina GoldenGate methylation array (243 the level that experimental scientists can independently GBM samples) and Agilent miRNA array (251 GBM interpret the vast amounts of data generated by genome- samples, 10 control samples). Pre-normalized data (level scale technologies, most of the potential of the generated 2) were used for gene, exon and miRNA expression and data will be severely underexploited. In order to address methylation arrays. Raw data (level 1) were used for these challenges, we have developed a data analysis and aCGH and SNP platforms. Clinical annotations were integration framework, Anduril, which facilitates the used to compute the duration of patient survival in integration of various data formats, bio-databases and months from the initial diagnosis to death or to the last analysis techniques. Anduril manages and automates ana- follow-up. The publicly available results in the present lysis workflows from importing raw data to reporting and work do not reveal protected patient information. visualizing the results. In order to facilitate interpretation of the large-scale data analysis results, Anduril generates Gene expression analyses a website that shows the most relevant features of each The gene and exon expression platforms include ten con- gene at a glance, allows sorting of data based on different trol samples from brain tissue extracted from non-cancer parameters, and provides direct links to more detailed patients in addition to the glioblastoma samples. Tran- views of genes, transcripts, genomic regions, protein-pro- script level expressions are calculated from the exon level tein interactions and pathways. expression data by considering the problem of transform- We demonstrate the utility of the Anduril framework ing the exon-level data to transcripts as a least squares by analyzing heterogeneous and multidimensional data problem. For ith gene having m exons and n transcripts in from 338 GBM patients [1]. GBM is an aggressive brain Ensembl(v.58)wedefineavectorei of length m that cancer having a median survival of one year and is denotes the measured exon expressions, and an m times n remarkably resistant to all current anti-cancer therapeu- matrix Ai, where the values in each column denote if the tic regimens [2]. In order to understand the complex exon belongs to the transcript (1) or not (0). Transcript molecular mechanisms behind GBM, earlier efforts have expression values ti are solved from the equation Aiti = ei analyzed data from one or two platforms, such as muta- using the QR decomposition to ensure numerical stability. tions, copy number and gene expression profiles and The gene level expression values for the exon array plat- methylation patterns [3-7]. In contrast, we have analyzed form were computed by taking a median of the intensity all TCGA provided GBM data sets and collected the of all the exons linked with the gene in Ensembl. results into a comprehensive website that facilitates the Differential expression is determined by computing interpretation of the data and allows an advanced view fold changes and applying a t-test between glioblastoma of genes and genomic regions crucial to GBM progres- and control groups, followed by multiple hypotheses sion. Most importantly, Anduril can be applied to data correction [8]. Fold changes are computed by dividing from any accessible source. the mean of glioblastoma expression values by the mean of control expression values. Materials and methods Documentation for algorithms, their parameters and Transcriptome survival analysis usage in the analysis together with all results are avail- Differentially expressed splice variants were selected as able in Additional file 1. the basis of expression survival analysis. There were Ovaska et al. Genome Medicine 2010, 2:65 Page 3 of 12 http://genomemedicine.com/content/2/9/65 8,887 splice variants (out of a total 75,083) that were MiRBase::Targets version 4 was used to match the anno- differentially expressed having absolute fold change >2 tations used in constructing the Agilent human miRNA and a multiple hypothesis corrected P-value < 0.05. For array (G4470A). these splice variants we computed sample-specific fold changes by dividing the sample expression value by the DNA methylation arrays mean of control expression values. These fold changes Illumina DNA Methylation Cancer