A 44-Gene Set Constructed for Predicting the Prognosis of Clear Cell Renal Cell Carcinoma
Total Page:16
File Type:pdf, Size:1020Kb
INTERNATIONAL JOURNAL OF MOLECULAR MEDICINE 42: 3105-3114, 2018 A 44-gene set constructed for predicting the prognosis of clear cell renal cell carcinoma YONGGANG WANG1, YAO WANG1 and FENG LIU2 Departments of 1Urology and 2Nephrology, China‑Japan Union Hospital of Jilin University, Changchun, Jilin 130033, P.R. China Received May 1, 2018; Accepted September 7, 2018 DOI: 10.3892/ijmm.2018.3899 Abstract. Clear cell renal cell carcinoma (ccRCC) is the most STC2, CXCL10, SPAG4, CD27, MMP9 and RUNX3 may be frequent type of renal cell carcinoma (RCC). The present utilized for predicting the prognosis of patients with ccRCC. study aimed to examine prognostic markers and construct a prognostic prediction system for ccRCC. The mRNA Introduction sequencing data of ccRCC was downloaded from The Cancer Genome Atlas (TCGA) database, and the GSE40435 dataset Renal cell carcinoma (RCC), which is derived from the lining was obtained from the Gene Expression Omnibus database. of the proximal convoluted tubule, represents 90‑95% of Using the Limma package, the differentially expressed cases of kidney cancer (1). The early symptoms of RCC are genes (DEGs) in the TCGA dataset and GSE40435 dataset usually undetected, which leads to advanced disease stages were obtained, respectively, and the overlapped DEGs were in patients newly diagnosed with RCC (2). With the progres- selected. Subsequently, Cox regression analysis was applied for sion of RCC, tumor cells may metastasize to other organs, screening prognosis-associated genes. Following visualization including the liver, lymph nodes, lungs, brain, adrenal glands of the co-expression network using Cytoscape software, the and bones (3). RCC has a relatively higher incidence in men network modules were examined using the GraphWeb tool. than women, particularly in those >65 years old (4). Clear cell Functional annotation for genes in the network was performed RCC (ccRCC), characterized by the clear cytoplasm in cells, is using the clusterProfiler package. Finally, a prognostic predic- the most frequent type of RCC (5). Therefore, examining the tion system was constructed through Bayes discriminant pathogenesis of ccRCC is necessary and of significance. analysis and confirmed with the GSE29609 validation dataset. There are several studies reporting the molecular mecha- The results revealed a total of 263 overlapped DEGs and nisms of ccRCC. For example, metastasis-associated lung 161 prognosis‑associated genes. Following construction of the adenocarcinoma transcript 1 (MALAT1) is upregulated in co‑expression network, 16 functional terms and three pathways patients with ccRCC, which may be utilized to predict overall were obtained for genes in the network. In addition, red, yellow survival (OS) and serve as a promising therapeutic target for (Involving chemokine ligand 10 (CXCL10), CD27 molecule the disease (6,7). Hypoxia‑inducible factor‑1α, which is involved (CD27) and runt-related transcription factor 3 (RUNX3)], in tumoral adaptation to hypoxic conditions, can be a critical green (Involving angiopoietin-like 4 (ANGPTL4), stannio- prognostic factor in metastatic ccRCC (8). The overexpres- calcin 2 (STC2), and sperm associated antigen 4 (SPAG4)], sion of cannabinoid receptor 2 is important in the cell cycle, and cyan modules were extracted from the co-expression cellular proliferation and migration of RCC cells, and predicts network. Additionally, the prognostic prediction system poor outcomes for patients with RCC (9). Previous studies have involving 44 signature genes, including ANGPTL4, STC2, demonstrated that X-linked inhibitor of apoptosis protein is a CXCL10, SPAG4, CD27, matrix metalloproteinase (MMP9) potential prognostic marker in RCC, and that its downregulation and RUNX3, was identified and confirmed. In conclusion, the may weaken immune resistance (10,11). Carbonic anhydrase 9 44-gene prognostic prediction system involving ANGPTL4, belongs to the carbonic anhydrase family, and its low expres- sion is correlated with the poor prognosis of patients with ccRCC (12,13). However, these studies are insufficient and the prognostic mechanisms of ccRCC remain to be fully elucidated. Correspondence to: Dr Feng Liu, Department of Urology, In addition, the majority of the aforementioned results obtained China‑Japan Union Hospital of Jilin University, 126 Xiantai Street, from these microarray data were not validated by other datasets. Erdao, Changchun, Jilin 130033, P.R. China Bioinformatics analysis has been increasingly applied for E-mail: [email protected] revealing the genetic changes in the high-throughput data of tumors (14,15). In the present study, comprehensive bioin- Key words: clear cell renal cell carcinoma, co-expression network, formatics analyses for gene expression data were performed module analysis, signature genes, prognosis to construct a prognostic prediction system for ccRCC with specific signature genes. Additionally, these predicted genes were validated by another dataset. The aim of the present study 3106 WANG et al: 44-GENE PROGNOSTIC GENE SET FOR CCRCC was to provide additional, and reliable, prognostic markers for genes, following which the correlation coefficients between patients with ccRCC, and shed light on the molecular mecha- the expression values of two genes were calculated using nisms of disease progression. the COR function (23) in R. P<0.05 and a correlation coef- ficient |r|≥0.6 were used for selecting gene co‑expression Materials and methods interactions. Subsequently, the co-expression network was visualized using Cytoscape software 3.4.0 (http://www.cyto- Data source and data preprocessing. From The Cancer scape.org/) (24). Genome Atlas (TCGA; https://cancergenome.nih.gov/) Using Fisher's exact test method in the R package database, the mRNA sequencing data of ccRCC, which was clusterProfiler (http://bioconductor.org/packages/release/ sequenced on the Illumina HiSeq 2000 RNA Sequencing bioc/html/clusterProfiler.html) (25), the significant Gene platform, were downloaded with relevant clinical informa- Ontology (GO) terms and Kyoto Encyclopedia of Genes and tion on December 18th, 2016. There were a total of 606 Genomes (KEGG) pathways were enriched for the gene nodes samples, and 605 of these had information on sample in the co-expression network. In addition, the transcription source (normal or tumor tissues), including 533 ccRCC and factors (TFs) significantly associated with the nodes in the 72 normal samples. The expression profile dataset GSE40435 co-expression network were searched using The Database for (GPL10558 platform) in the Gene Expression Omnibus Annotation, Visualization and Integrated Discovery (DAVID; (GEO; http://www.ncbi.nlm.nih.gov/geo/) database was https://david.ncifcrf.gov/) (26). Additionally, sub modules in also obtained, which included 101 ccRCC and 101 normal the co-expression network were examined using the GraphWeb samples. According to the annotation platform, probes tool (http://biit.cs.ut.ee/graphweb/) (27). in the raw data of GSE40435 were converted into gene symbols. If multiple probes matched one gene, the average Construction of the prognostic prediction system. The 525 expression value of the gene was obtained. Subsequently, ccRCC samples with survival information in the TCGA the expression data underwent logarithmic transformation dataset were considered as the training dataset for the using the R package Limma (http://www.bioconductor. prognostic prediction system. First, the above samples were org/packages/release/bioc/html/limma.html) (16), to reach classified into two groups based on their survival status: an approximately normal distribution from the skewed distri- Alive and deceased groups. Subsequently, combining bution. Thereafter, the median normalization method (17) survival status with survival times, the samples were was utilized to normalize the data. further divided into good prognosis (alive and survival time ≥15 months) and bad prognosis (succumbed to mortality Analysis of differentially expressed genes (DEGs). The TCGA and survival time <15 months) groups. As the median data was combined with the 19,004 protein coding gene OS time was ~15 months for all the ccRCC samples, the annotation information in the HUGO Gene Nomenclature cut-off value for grouping was set as 15 months. The prior Committee (HGNC) database (18), and mRNAs in the TCGA probability based on the Bayesian approach was then dataset were identified. The differential expression analysis determined. The nodes of the co-expression network were of genes or mRNAs was then performed between ccRCC and ranked (-logRank P-values from largest to smallest). Using normal samples in the TCGA dataset and GSE40435 dataset the discriminant Bayes function in the e1071 package separately, using the R package Limma (16). Subsequently, of R (https://cran.r-project.org/web/packages/e1071/index. the false discovery rate (FDR) values were calculated by html) (28). Bayes discriminant analysis was performed for the R package multtest (http://www.bioconductor.org/pack- the network nodes through adding genes one by one and ages/release/bioc/html/multtest.html) (19). An FDR <0.05 and removing the genes that influenced prediction accuracy. |log2fold change (FC)|>0.585 were the thresholds for selection The prognostic score was defined as the discriminant coef- of the DEGs. In addition, the overlapped DEGs in the TCGA ficient of each sample when the prediction accuracy was the and GSE40435 datasets were identified for the following highest, and the gene set was considered