Pattern Discovery and Cancer Gene Identification in Integrated Cancer

Pattern discovery and cancer gene identification in integrated cancer genomic data Qianxing Moa,b, Sijian Wangc, Venkatraman E. Seshana, Adam B. Olshend, Nikolaus Schultze, Chris Sandere, R. Scott Powersf, Marc Ladanyig, and Ronglai Shena,1 aDepartment of Epidemiology and Biostatistics, eComputational Biology Program, and gDepartment of Pathology and Human Oncology and Pathogenesis Program, Memorial Sloan–Kettering Cancer Center, New York, NY 10065; bDepartment of Medicine and Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, TX 77030; cDepartment of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53792; dDepartment of Epidemiology and Biostatistics, University of California, San Francisco, CA 94107; and fCancer Genome Center, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11797 Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved December 19, 2012 (received for review May 27, 2012) Large-scale integrated cancer genome characterization efforts in- integrates the information to extract biological principles from the cluding the cancer genome atlas and the cancer cell line encyclo- massive amount of data to provide useful insights for advancing pedia have created unprecedented opportunities to study cancer diagnostic, prognostic, and therapeutic strategies. biology in the context of knowing the entire catalog of genetic In a previous publication (8), we proposed an integrative alterations. A clinically important challenge is to discover cancer clustering framework called iCluster. The method was recently subtypes and their molecular drivers in a comprehensive genetic used in a landmark study to predict novel breast cancer subtypes context. Curtis et al. [Nature (2012) 486(7403):346–352] has re- with distinct clinical outcomes (9), and it was found that the joint cently shown that integrative clustering of copy number and gene clustering of copy number and gene expression profiles resolved expression in 2,000 breast tumors reveals novel subgroups beyond the considerable heterogeneity of the expression-only subgroups. the classic expression subtypes that show distinct clinical outcomes. Other approaches on data integration that have emerged in re- To extend the scope of integrative analysis for the inclusion of cent years include generalized data decomposition methods (10, somatic mutation data by massively parallel sequencing, we pro- 11) and nonparametric Bayesian models (12). However, two pose a framework for joint modeling of discrete and continuous major challenges have not yet been fully addressed. First, the variables that arise from integrated genomic, epigenomic, and existing methods are not designed to include both discrete (e.g., transcriptomic profiling. The core idea is motivated by the hypoth- somatic mutation) and continuous variables, thus limiting the esis that diverse molecular phenotypes can be predicted by a set of ability to harness the full potential of large-scale integrated ge- orthogonal latent variables that represent distinct molecular driv- nomic datasets. In fact, most of the previous methods have fo- ers, and thus can reveal tumor subgroups of biological and clinical cused on integrating only copy number and gene expression. importance. Using the cancer cell line encyclopedia dataset, we A second challenge that has not been fully addressed lies in demonstrate our method can accurately group cell lines by their systematically distinguishing cancer genes that are reliable and cell-of-origin for several cancer types, and precisely pinpoint their constant features of a subtype from those that are less reliable. known and potential cancer driver genes. Our integrative analysis To address these challenges, we present a significant enhance- also demonstrates the power for revealing subgroups that are not ment of the iCluster method, which we call iCluster+. The en- lineage-dependent, but consist of different cancer types driven by a hanced method can perform pattern discovery that integrates common genetic alteration. Application of the cancer genome atlas diverse data types: binary (somatic mutation), categorical (copy colorectal cancer data reveals distinct integrated tumor subtypes, number gain, normal, loss), and continuous (gene expression) suggesting different genetic pathways in colon cancer progression. values. In this paper, we demonstrate the power of this method for integrating the full spectrum of cancer genomic data using the multivariate generalized linear model | multidimensional data | CCLE and TCGA colorectal cancer datasets. A key aspect of the penalized regression method is to use generalized linear regression for the formulation of a joint model, with respect to a common set of latent variables major goal of many cancer genome projects is to characterize that we propose represents distinct driving factors (molecular eti- A key genetic alterations in cancer and discover therapeutic ology and genetic pathways). Geometrically, these latent variables targets through comprehensive genomic profiling of the cancer form a set of “principal” coordinates that span a lower dimensional genome. The Cancer Genome Atlas (TCGA) studies have un- integrated subspace, and collectively capture the major biological veiled the genetic landscape of several cancer types by whole-ge- variations observed across cancer genomes. As a result, the latent nome and whole-exome sequencing, DNA copy number profiling, variable approach enables rigorous analysis of the integrated ge- promoter methylation profiling, and mRNA expression profiling nomic data, as we show in this report can reveal common themes in a large number of tumors (1–5). Complementary to the tumor that sort the tumors into distinct subgroups of biological and project, the Cancer Cell Line Encyclopedia (CCLE) (6) and the clinical importance. To identify genomic features that contribute Sanger cell line project (7) has cataloged a compilation of ge- most to the biological variation and thus have direct relevance for netic and molecular data in almost 1,000 human cancer cell lines, characterizing the molecular subgroups, we apply a penalized coupled with pharmacological profiles for a large panel of anti- cancer drugs. These large-scale integrative genomic efforts have STATISTICS been geared toward comprehensively cataloging individual ge- Author contributions: Q.M., S.W., V.E.S., A.B.O., N.S., C.S., R.S.P., M.L., and R.S. designed nomic alterations, analogous to a reverse-engineering process research; Q.M., S.W., V.E.S., and R.S. performed research; Q.M., N.S., and R.S. analyzed where thousands of individual cancer genomes are taken apart to data; and Q.M., S.W., V.E.S., A.B.O., N.S., C.S., R.S.P., M.L., and R.S. wrote the paper. shed light on common biological principles. Unfortunately, cancer The authors declare no conflict of interest. genomes exhibit considerable heterogeneity with abnormalities This article is a PNAS Direct Submission. occurring in different genes among different individuals, posing Freely available online through the PNAS open access option. a great challenge to identify those genes with functional impor- 1To whom correspondence should be addressed. E-mail: [email protected]. tance and therapeutic implications. Thus, there is a corresponding This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. SYSTEMS BIOLOGY need for a forward-engineering process that synthesizes and 1073/pnas.1208949110/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1208949110 PNAS | March 12, 2013 | vol. 110 | no. 11 | 4245–4250 Downloaded by guest on September 24, 2021 À Á likelihood approach (13) with lasso penalty terms (14, 15) to in- P xijt = 1 zi À Á = α + β z ; duce sparsity. The lasso regression allows us to pinpoint the subset log jt jt i of genomic features that have significant weights on the latent 1 − P xijt = 1 zi variables, which leads to enhanced interpretability and a more stable estimation of the latent variables. where and Pðxijt = 1jziÞ is the probability of gene j mutated in patient i given the value of the latent factor zi; αjt is an intercept Results term; and βjt is a length-k row vector of coefficients that de- iCluster+ Framework. iCluster+ integrates a diverse range of data termine the weights genomic variable j contributes to the types (Fig. 1). First, we introduce some notations. Let xijt denote latent variables. the genomic variable associated with the jth (j ∈ f1; ⋯; ptg) ge- If xijt is a multicategory variable (e.g., copy number states: loss/ nomic feature in the ith (i ∈ f1; ⋯; ng) sample of the tth normal/gain), we consider the following multilogit regression: (t ∈ f1; ⋯; mg) data type. A genomic feature can be either À Á À Á α + β z a protein-coding gene or non–gene-centric elements of interest exp jct jct i P xijt = c zi = P À Á; c = 1; ⋯; C; (genomic region, CpG sites, microRNA, etc.), depending on the C α + β z ℓ = 1 exp jℓt jℓt i data type. Let zi = ðzi1; ⋯; zikÞ′ be a column vector consisting of k unobserved latent variables. where fPðxijt = 1jziÞ; ⋯; Pðxijt = CjziÞg denote the probability of the The core idea is the following. We use a set of latent variables to states of the categorical variable (e.g., copy number loss, normal, represent k distinct driving factorsP (molecular drivers), which z α β p = p gain) given the value of i; jct is the intercept term; jct is a length-k predict the values of the original t t genomic variables, and row vector of regression coefficients for category c;andC is the collectively

Pattern Discovery and Cancer Gene Identification in Integrated Cancer

Influencers on Thyroid Cancer Onset: Molecular Genetic Basis

Genome-Wide Analysis of Host-Chromosome Binding Sites For

List of Genes Associated with Sudden Cardiac Death (Scdgseta) Gene

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Supplemental Information Proximity Interactions Among Centrosome

4-6 Weeks Old Female C57BL/6 Mice Obtained from Jackson Labs Were Used for Cell Isolation

UC San Diego Electronic Theses and Dissertations

Genome-Wide Rnai Screening Identifies Human Proteins with A

Supplemental Information

Cytoplasmic E2f4 Forms Organizing Centres for Initiation of Centriole Ampliﬁcation During Multiciliogenesis

Molecular Mechanism of the MORC4 Atpase Activation

Supplementary Table S4. FGA Co-Expressed Gene List in LUAD