LASSO‑Based Cox‑PH Model Identifies an 11‑Lncrna Signature for Prognosis Prediction in Gastric Cancer
Total Page:16
File Type:pdf, Size:1020Kb
MOLECULAR MEDICINE REPORTS 18: 5579-5593, 2018 LASSO‑based Cox‑PH model identifies an 11‑lncRNA signature for prognosis prediction in gastric cancer YONGHONG ZHANG1*, HUAMIN LI2*, WENYONG ZHANG1, YA CHE3, WEIBING BAI4 and GUANGLIN HUANG4 1Department of General Surgery, Shangluo Central Hospital, Shangluo, Shaanxi 726000; 2Department of Pathology, Weinan Central Hospital, Weinan, Shaanxi 714000; 3Department of Medical Oncology, Shangluo Central Hospital, Shangluo, Shaanxi 726000; 4Department of General Surgery, Yulin Xingyuan Hospital, Yulin, Shaanxi 719000, P.R. China Received December 23, 2017; Accepted September 13, 2018 DOI: 10.3892/mmr.2018.9567 Abstract. The present study aimed to identify a long cantly different overall survival and recurrence-free survival non-coding (lnc) RNAs-based signature for prognosis assess- times. The predictive capability of this signature was verified ment in gastric cancer (GC) patients. By integrating gene in an independent set. These signature lncRNAs were impli- expression data of GC and normal samples from the National cated in several biological processes and pathways associated Center for Biotechnology Information Gene Expression with the immune response, the inflammatory response and Omnibus, the EBI ArrayExpress and The Cancer Genome cell cycle control. The present study identified an 11‑lncRNA Atlas (TCGA) repositories, the common RNAs in Genomic signature that could predict the survival rate for GC. Spatial Event (GSE) 65801, GSE29998, E-MTAB-1338, and TCGA set were screened and used to construct a weighted Introduction correlation network analysis (WGCNA) network for mining GC-related modules. Consensus differentially expressed Gastric cancer (GC) is the fifth leading cause of malignancy RNAs (DERs) between GC and normal samples in the four worldwide, with a 5-year survival rate of <10% (1,2). In China, datasets were screened using the MetaDE method. From the it is the second most commonly diagnosed cancer in men and overlapped lncRNAs shared by preserved WGCNA modules the third most commonly diagnosed cancer in women (3). and the consensus DERs, an lncRNAs signature was obtained The poor prognosis is primarily attributable to patients being using L1-penalized (lasso) Cox-proportional hazard (PH) frequently identified at an advanced stage and therefore diffi- model. LncRNA-mRNA networks were constructed for these cult to cure (4). Early detection is key to improving survival signature lncRNAs, followed by functional annotation. A total rate of GC patients. Therefore, discovery of valuable molec- of 14,824 common mRNAs and 2,869 common lncRNAs were ular biomarkers is of significance for the facilitation of early identified in the 4 sets and 5 GC‑associated WGCNA modules diagnosis and effective prediction of prognosis and thereby were preserved across all sets. MetaDE method identified contributing to improved outcomes in GC patients. 1,121 consensus DERs. A total of 50 lncRNAs were shared Long noncoding RNAs (lncRNAs) are defined as a group by preserved WGCNA modules and the consensus DERs. of non-protein-coding transcripts of greater than 200 nucleo- Subsequently, an 11-lncRNA signature was identified by tides in length, which are characterized by tissue-specific LASSO-based Cox-PH model. The lncRNAs signature-based expression patterns (5,6). With the number of lncRNAs being risk score could divide patients into 2 risk groups with signifi- triple the number of protein-coding genes, lncRNAs are predicted to exhibit a more important role in basic, transla- tional and clinical oncology than protein-coding genes (7). Several lncRNAs have been demonstrated in GC, including H19 (8-10), HOTAIR (11,12) and ANRIL (13). However, the Correspondence to: Dr Weibing Bai or Dr Guanglin Huang, association of lncRNAs with GC prognosis has not been fully Department of General Surgery, Yulin Xingyuan Hospital, 33 Middle elucidated. Although a recent study by Miao et al (14) reported Section of West Renmin Road, Yuyang, Yulin, Shaanxi 719000, a 4-lncRNA signature of prognostic value for GC patients, the P. R. Ch i na signature is yielded by bioinformatics analysis of The Cancer E-mail: [email protected] Genome Atlas (TCGA) data only. A comprehensive analysis E-mail: [email protected] of gene expression data of GC patients from more databases is *Contributed equally required for acquiring a more convincing prognostic lncRNAs signature. Key words: network, mRNA, pathway, gene ontology, differentially In contrast with the study of Miao et al (14), the present expressed RNAs study performed an integrated analysis on GC gene expression data mined in the National Center for Biotechnology Information (NCBI), Gene Expression Omnibus (GEO), EBI 5580 ZHANG et al: A PROGNOSTIC lncRNA SIGNATURE FOR GASTRIC CANCER ArrayExpress and TCGA repositories. The present study was scale-free topology criterion. Following the removal of RNAs mainly focused on revealing the critical lncRNAs involved in with coefficients of variation <0.1, the weighted adjacency matrix GC pathogenesis and the roles of the critical lncRNAs in the was then developed. A dynamic tree cut algorithm was used to molecular mechanisms of GC. An 11-lncRNA signature was mine modules with a module size ≥30 and a minimum cut height identified for prognostic risk assessment of GC patients using of 0.95. In addition, preservation of modules in all 4 datasets was weighted correlation network analysis (WGCNA) network, examined using the module preservation function of the WGCNA the MetaDE method and a LASSO-based Cox-proportional package. In addition, functional annotation of the modules iden- hazard (PH) model. In addition, the prognostic significance of tified was investigated using the userListEnchment function of this signature was validated in an independent set. In order to WGCNA package. reveal the molecular mechanisms of these critical lncRNAs, the lncRNA-mRNA interaction network was constructed Identification of consensus differentially expressed RNAs. for functional and pathway enrichment analysis. The results Consensus differentially expressed RNAs (DERs) between GC revealed that these critical lncRNAs can regulate the associ- specimens and normal control specimens across the 4 datasets ated mRNAs to influence the immune response, inflammatory (GSE6580, GSE29998, E-MTAB-1338 and TCGA) were response and cell cycle in the pathogenesis of GC. identified with metaDE package (22,23) (https://cran.r‑project. org/web/packages/MetaDE/) in R language version 3.4.1. Materials and methods The cutoff was set at tau2=0, Qpval>0.05, P<0.05 and false discovery rate (FDR)<0.05. tau2 denotes the amount of hetero- Data resource and preprocessing. Gene expression profiles for geneity while Qpval denotes heterogeneity of a dataset. The GC were searched in publicly accessible GEO at the NCBI common lncRNAs shared by the list of consensus DERs and (http://www.ncbi.nlm.nih.gov/geo/) and EBI ArrayExpress the RNAs in the preserved WGNCA modules were selected (https://www.ebi.ac.uk/arrayexpress/). Inclusion criteria were: for further analysis. Human gene expression data; gastric cancer specimens and paired normal specimens; total count of specimens ≥50. Development of a prognostic risk scoring system for GC. Finally, Genomic Spatial Event (GSE) (15) 6580 and GSE29998 L1-penalized (lasso) characterized by simultaneous variable downloaded from NCBI GEO and E-MTAB-1338 from EBI selection and shrinkage is a useful method for determining ArrayExpress were selected in the present study (Table I). interpretable prediction rules in high-dimensional data (24). Raw data (TXT) in GSE6580, GSE29998 and E-MTAB- In order to determine an lncRNA signature for prognosis, 1338 were subject to log2 transformation by limma (version the penalized package (24) in R language (version 3.4.1) was 3.34.0) software (16) (https://bioconductor.org/ pack- applied to fit a lasso Cox‑PH (25) to the overlapped lncRNAs. ages/release/bioc/html/limma.html). Subsequently, the data Based on the optimal lambda value that was selected through were transformed from a skewed distribution to normal a 1,000 cross-validations, a panel of prognostic lncRNAs distribution, followed by median normalization. Based on was determined. An equation for calculating risk score was the platform annotation files (Table I), probe sets that were generated based on the expression levels of these prognostic assigned with a RefSeq transcript ID and/or Ensembl gene lncRNAs and their regression coefficients from the Cox‑PH ID were obtained, of which the probe sets labeled as ‘NR’ model as follows: (non-coding RNA in the Refseq database) were selected. In addition, platform sequencing data was aligned with human Risk score=βlncRNA1 x exprlncRNA1 + βlncRNA2 x exprln- genome (GRCh38) (17,18) using Clustal 2 (http://www. cRNA2 + · ···· + βlncRNAn x exprlncRNAn clustal.org/clustal2/) (19). The resulting lncRNAs and the above-mentioned lncRNAs annotated in Refseq database were Risk score was calculated and assigned to each patient in combined and used in further analysis. the training set (TCGA set, Table II). With the median risk The present study also acquired mRNA-seq data of 384 score as cutoff, all patients in the training set were split into GC samples and 26 normal controls from TCGA portal a high-risk group and a low-risk group. Overall survival (OS) (https://gdc-portal.nci.nih.gov/), which did not require time and recurrence-free survival (RFS) time of the two risk preprocessing. Common RNAs of the GSE6580, GSE29998, groups were analyzed and compared by Kaplan-Meier survival E-MTAB-1338 and TCGA sets were used for further analysis. analysis and the logrank test. The robustness of the risk scoring system was validated WGCNA network analysis. WGCNA (20) is a bioinformatics tool in an independent dataset (GSE62254) (26) downloaded from used to build a gene co-expression networks to mine network NCBI GEO (platform: GPL570, Affymetrix Human Genome modules closely associated with dieases. Based on the common U133 Plus 2.0 Array). GSE62254 included the gene expres- RNAs identified, WGCNA package (21) (version 1.61) in R 3.4.1 sion data of 300 GC tissue samples (Table II).