Integrated Network Analysis and Logistic Regression Modeling Identify Stage-Specific Genes in Oral Squamous Cell Carcinoma Vinay Randhawa1,2 and Vishal Acharya1,2*
Total Page:16
File Type:pdf, Size:1020Kb
Randhawa and Acharya BMC Medical Genomics (2015) 8:39 DOI 10.1186/s12920-015-0114-0 RESEARCH ARTICLE Open Access Integrated network analysis and logistic regression modeling identify stage-specific genes in Oral Squamous Cell Carcinoma Vinay Randhawa1,2 and Vishal Acharya1,2* Abstract Background: Oral squamous cell carcinoma (OSCC) is associated with substantial mortality and morbidity but, OSCC can be difficult to detect at its earliest stage due to its molecular complexity and clinical behavior. Therefore, identification of key gene signatures at an early stage will be highly helpful. Methods: The aim of this study was to identify key genes associated with progression of OSCC stages. Gene expression profiles were classified into cancer stage-related modules, i.e., groups of genes that are significantly related to a clinical stage. For prioritizing the candidate genes, analysis was further restricted to genes with high connectivity and a significant association with a stage. To assess predictive power of these genes, a classification model was also developed and tested by 5-fold cross validation and on an independent dataset. Results: The identified genes were enriched for significant processes and functional pathways, and various genes were found to be directly implicated in OSCC. Forward and stepwise, multivariate logistic regression analyses identified 13 key genes whose expression discriminated early- and late-stage OSCC with predictive accuracy (area under curve; AUC) of ~0.81 in a 5-fold cross-validation strategy. Conclusions: The proposed network-driven integrative analytical approach can identify multiple genes significantly related to an OSCC stage; the classification model that is developed with these genes may help to distinguish cancer stages. The proposed genes and model hold promise for monitoring of OSCC stage progression, and our findings may facilitate cancer detection at an earlier stage, resulting in improved treatment outcomes. Keywords: Coexpression network analysis, Gene module, Hub gene, Microarray, Oral squamous cell carcinoma, Logistic regression modeling Background early-stage OSCC [3]. Unfortunately, only 35 % of cases Oral cancer is a common cancer worldwide with squa- of oral cancer are detected at the earliest stage (without mous cell carcinoma being the most prevalent subtype, producing symptoms) [4]because of its molecular com- which accounts for 96 % of oral-cavity cancers [1]. There plexity and clinical behavior. Therefore, there is an ur- are ~260,000 new cases of oral squamous cell carcinoma gent need for identifying molecular predictors that may (OSCC) and 124,000 deaths worldwide annually [2]. enable cancer detection at early stages. Despite considerable advances in treatments, the overall Remarkable advances in technologies for assaying gene five-year survival rate of patients at the advanced stage is expression and the availability of high-throughput data only 30 %, but is greater than 90 % among patients with have opened up new avenues of cancer research that may allow researchers to generate hypotheses regarding improved disease classification. Global gene expression * Correspondence: [email protected] profiles in OSCC have been studied using traditional 1Functional Genomics and Complex Systems Laboratory, Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology, Council of approaches which have helped to identify some candidate Scientific and Industrial Research, Palampur, Himachal Pradesh, India 2Academy of Scientific and Innovative Research (AcSIR), New Delhi, India © 2015 Randhawa and Acharya. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Randhawa and Acharya BMC Medical Genomics (2015) 8:39 Page 2 of 20 gene biomarkers on the basis of comparison between can- to discriminate the two most common OSCC groups: cerous and non-cancerous cases [5–7]. In addition, several early stages (I, II) versus late stages (III, IV). For this pur- computational methods have been developed to identify pose, first, a set of highly correlating genes was obtained biomarkers for oral cancer prognosis and diagnosis from a gene coexpression network. For prioritizing the [8–10]. Comparative analysis of expression profiles be- candidate genes, the analysis were then restricted to genes tween early and late stages has uncovered genes with with high connectivity and a significant association with stage-dependent alterations in expression of various can- cancer stage. To advance the understanding of these cers [11, 12]; however, to our knowledge, not much efforts relations, using expression profiles of the putative gene has been devoted to analysis of OSCC stage progression candidates, we then develop a classification model to dis- in relation to aggressiveness of the cancer. criminate cancer stages; this model helped to classify Genes and proteins function cooperatively and thus OSCC efficiently. The methodology presented herein regulate common biological processes by co-regulating seems to be the first implementation of key hub genes (de- each other [13]; however, genes identified via the classical pending on topological and phenotypic importance) to approaches are usually not functionally related and there- identify an OSCC stage and may predict clinical aggres- fore may not reveal key biological processes. Because of siveness of this cancer. these limitations, the traditional approaches are not very useful for identification of specific genes that contribute to or are affected by complex diseases. Fortunately, rapid ad- Methods vances in network biology have effectively provided valu- To identify the key genes whose expression may discrim- able frameworks for analysis of multidimensional biological inate between early- and late-stage OSCC samples, we data and have important applications to clinical practice. adopted the following major steps: (1) merging of multiple Instead of analyzing tens of thousands of gene com- microarray datasets to identify differentially expressed parisons, the network-based analysis offers a meaning- genes (DEGs) in tumor samples compared to normal ful data reduction scheme that limits the analysis to (healthy) controls, (2) analysis of the gene coexpression only hundreds [14–16] or even tens [17–20] of rele- network to identify stage associated modules and their key vant genes. Altered gene coexpression networks have hub genes, and (3) development of a hub gene-based clas- been proven to be the major cause of dysregulated sifier model to distinguish OSCC stages. Schematic repre- expression during cancer progression [21, 22]. Dis- sentation of the overall strategy is shown in Fig. 1. eases can therefore be studied as networks by system- atically exploring topological associations between Acquisition and proecessing of gene expression data contributing genes. Gene coexpression networks have Gene expression profiles of OSCC were obtained from the been utilized even to identify key tumorigenic genes National Center for Biotechnology Information (NCBI) with the aim to find biomarkers or to gain insights Gene Expression Omnibus (GEO) database [29] via quer- into probable disease mechanisms. Nevertheless, most ies with the search terms “oral squamous cell carcinoma” of these studies remain limited to physically interact- and “head and neck squamous cell carcinoma” (August, ing genes and do not take into account their associa- 2014) to specifically retrieve most widely used Affymetrix tions with the disease phenotype. On the other hand, HGU-133a and HGU-133plus2 array datasets (Table 1). disruption in connections within disease modules give Studies that had well-defined phenotypic description of rise to particular disease phenotypes [23]. Thus, now cancer stage were preferred. Other criteria included: (i) it seems to be more important to consider the pheno- samples comprising only human tissue (not derived from typic association in order to characterize the mecha- cell lines) and without any history of specific treatment, nisms of disease progression [24]. (ii) studies comprising both case samples and healthy con- A complex alteration of global gene expression profiles trol samples (to identify disease-specific signals), and (iii) among genes is a determinant of progression of cancer studies that were conducted on similar platforms (we stages and grades [25–27].Because genes that are highly wanted to obtain a high proportion of overlapping genes). connected within a gene set are thought to drive other Individual datasets were imported into the R 3.0.2 statis- genes [28], it was hypothesized that examining the altered tical environment (www.r-project.org) by means of the gene expression profiles of key genes might help to discrim- GEOquery tool of the Bioconductor software package inate between early and late stage cancers. Although cancer (version 2.22.0) [30] and were processed using affy pack- stage is an effective prognostic factor, to the best of our age. At initial levels of quality control (QC), all sam- knowledge, systematic studies characterizing gene expres- ples were pre-processed together using the standard sion data in relation to OSCC stage progression