Three Survival-Related Genes of Esophageal Squamous Cell Carcinoma Identified by Weighted Gene Coexpression Network Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Hindawi Complexity Volume 2021, Article ID 9997783, 11 pages https://doi.org/10.1155/2021/9997783 Research Article Three Survival-Related Genes of Esophageal Squamous Cell Carcinoma Identified by Weighted Gene Coexpression Network Analysis Di Lu ,1 He Wang ,2 Xuanzhen Wu ,1 Jianxue Zhai ,1 Xiguang Liu ,1 Xiaoying Dong ,1 Siyang Feng ,1 Xiaoshun Shi ,1 Jianjun Jiang ,1 Zhizhi Wang ,1 Zhiming Chen ,1 Shuhua Zhao ,3 Jinhua Zhong ,3 Gang Xiong ,1 Hua Wu ,1 Haofei Wang ,1 and Kaican Cai 1 1Department of oracic Surgery, Nanfang Hospital, Southern Medical University, Guangzhou 510515, China 2Department of oracic Surgery, Peking University Shenzhen Hospital, Shenzhen 518000, China 3Department of Biological Information Research, HaploX Biotechnology, Shenzhen 518000, China Correspondence should be addressed to Kaican Cai; [email protected] Received 23 March 2021; Accepted 24 July 2021; Published 9 August 2021 Academic Editor: Chao Huang Copyright © 2021 Di Lu et al. *is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Background. *e aim of this study was to identify novel biomarkers associated with esophageal squamous cell carcinoma (ESCC) prognosis. Methods. 81 ESCC samples collected from *e Cancer Genome Atlas (TCGA) were used as the training set, and 179 ESCC samples collected from the Gene Expression Omnibus database (GEO) were used as the validation set. *e protein-coding genes of 25 samples from patients who completed the follow-up in TCGA were analyzed to construct a coexpression network by weighted gene coexpression network analysis (WGCNA). Gene ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analyses were performed for the selected genes. *e least absolute shrinkage and selection operator (LASSO) Cox regression model was constructed to analyze survival-related genes, and an optimal prognostic model was de- veloped as well as evaluated by Kaplan–Meier and ROC curves. Results. In this study, a module containing 43 protein-coding genes and strongly related to overall survival (OS) was identified through WGCNA. *ese genes were significantly enriched in retina homeostasis, antimicrobial humoral response, and epithelial cell differentiation. Besides, through the LASSO regression model, 3 genes (PDLIM2, DNASE1L3, and KRT81) significantly related to ESCC survival were screened and an optimal prognostic 3-gene risk prediction model was constructed. ESCC patients with low and high OS in both sets could be successfully discriminated by calculating a risk score with the linear combination of the expression level of each gene multiplied by the LASSO coefficient. Conclusions. Our study identified three novel biomarkers that have potential in the prognosis prediction of ESCC. 1. Introduction ESCC in recent years, the overall 5-year survival rate for ESCC patients is still less than 25%, and the prognosis is still poor, Esophageal cancer (EC) is a highly aggressive malignancy and with metastasis and recurrence frequently occurring [4]. In one of the leading causes of cancer-related death worldwide addition, due to late diagnosis and the lack of effective tar- [1]. *ere are two histologic subtypes of EC: esophageal geted therapy, most patients are diagnosed at an advanced squamous cell carcinoma (ESCC) and esophageal adeno- stage [5]. *erefore, it is urgent to explore new therapeutic carcinoma (EAC). Among them, ESCC is the main type of methods and new therapeutic targets. EC, accounting for about 90% [2]. Currently, the main Although many genes and mechanisms have been treatments for ESCC include chemotherapy, radiation, and shown to be closely related to the occurrence and devel- surgery [3]. Despite significant advances in the treatment of opment of ESCC, the overall genes and regulation of ESCC 2 Complexity are still unclear. In recent years, with the development of 2.2. WGCNA Network Construction and Module Detection. high-throughput technology, bioinformatics has been in- A weighted gene coexpression network was constructed creasingly applied to explore diseases, molecular mecha- using the WGCNA package in R. *e adjacency coefficient nisms, and find biomarkers for diseases [6]. *e Cancer (aij) was calculated by the absolute value of Pearson’s cor- Genome Atlas (TCGA) contains genomic and clinical in- relation coefficient of genes i and j to the power of β. β formation on many types of cancers, which is publicly aij � jcor(xi; xj)| , where xi is the series of expression values available [7]. For example, Shergalis et al. performed for gene i. *e lowest power β is chosen when the scale-free bioinformatics analysis using TCGA and found that 20 topology fit index curve flattens out upon reaching a high genes were overexpressed and associated with poor survival value. In addition to considering the connection between outcomes in GBM patients [8]. Weighted gene coex- two correlated genes, WGCNA also takes into account as- pression network analysis (WGCNA) is one of the im- sociated genes, and the topological overlaps (Tij) are cal- portant methods to understand gene function and gene culated from aij as follows, to compose a topological overlap association from whole genome expression. It can be used matrix (TOM), as a similarity evaluation reflecting relevancy to detect coexpression modules of highly related genes and and overlap between genes: modules of interest related to clinical features, providing 8> lij + aij good insights for predicting the function of coexpressed > ; i j; <> ≠ genes and discovering genes that play a key role in human min�ki; kj � + 1 − aij Tij �> diseases [9]. *e least absolute shrinkage and selection > :> operator (LASSO) is a penalty regression method that can 1; i � j; be used to analyze gene expression profiles. In addition, due (1) to the high dimension and collinearity of the Lasso Cox lij � X aiuauj; regression model, it can be combined with WGCNA for u≠i;j biomarker identification [10]. *erefore, a better under- standing and application of the above methods are more ki � X aiu; conducive to the identification of ESCC-related genes and u≠i to explore their potential clinical roles and molecular where u represents the common genes linking genes i and j mechanisms to understand the occurrence and develop- together and T takes account of the overlap between ment of ESCC. It has been applied to gene-network con- ij neighboring genes of genes i and j. TOM is subtracted structions and survival model identification for many kinds from one and converted into a topological overlap dis- of cancer like lung adenocarcinoma [11]. In this study, similarity matrix referred to as the corresponding dis- RNA-Seq data from TCGA and microarray data from the similarity of TOM (dissTOM). A hierarchical clustering Gene Expression Omnibus (GEO) database of ESCC were tree (dendrogram) of genes is then created based on downloaded. *en, we conducted a WGCNA-based anal- dissTOM. Finally, modules of highly correlated and ysis in order to identify key modules and hub genes as- coexpressed genes are created via a Dynamic Tree Cut sociated with ESCC pathogenesis. Additionally, the algorithm. potential functions of genes in this identified module were analyzed by gene ontology (GO) and pathway-enrichment analyses. More importantly, a 3-gene risk prediction model 2.3. Relating Modules to External Clinical Traits and Identi- was constructed using Cox and LASSO regression models, fying Hub Genes. Correlations between modules and clinical which could help us better predict ESCC prognosis. traits, including pathologic stage and survival time, were estimated by Spearman’s correlation tests. Significantly 2. Methods correlated modules were visualized using Cytoscape 3.7.1. Genes with multiple links were defined as hub genes. 2.1. Gene Expression Data and Clinical Data. Gene expres- sion data and clinical data for patients with ESCC were 2.4. Gene Ontology and Pathway-Enrichment Analysis. obtained from TCGA (https://cancergenome.nih.gov/) and *e potential biological functions and signaling pathways of the GEO data repository (http://www.ncbi.nlm.nih.gov/geo/) the genes in the selected modules were analyzed by enrich- on September 20, 2018, and included 173 samples and 358 ment GO and Kyoto Encyclopedia of Genes and Genomes samples, respectively. Gene expression levels from TCGA (KEGG) pathways in Metscape (http://metascape.org). were measured by RNA sequencing, denoted by fragments per kilobase of transcript per million mapped reads (FPKM) : FPKM � 109 × number of reads mapped to the gene/(number 2.5. LASSO Cox Regression Model Construction. LASSO Cox of reads mapped to all protein-coding genes × length of the regression models were constructed using the glmnet gene in base pairs). Gene expression profiles from GEO were package in R. By utilizing several hub genes from the selected assessed by the Agilent-038314 CBC Homo sapiens micro- modules, the function returned a series of values of λ and array V2.0 and were analyzed using the GeneSpring software models. *e coefficients of most original genes were pe- V11.5 (Agilent). Clinical information, including pathologic nalized to zero in line with increasing values of the tuning TNM stage and follow-up information, was also collected. parameter λ. λ was chosen when the partial likelihood de- *is research flow chart is shown in Figure 1. viance reached its lowest. A suitable model was then chosen Complexity 3 e cancer genome atlas Gene expression omnibus database (81 ESCC samples) (179 ESCC samples) Construct a coexpression network WGCNA analysis Selected modules Function enrichment analysis Identification and validation of hub genes LASSO Cox regression model Figure 1: *is flow chart of this study. by 10-fold cross-validation using the function cv.glmnet. 3. Results Using the function lambda.min, the remaining genes with nonzero LASSO coefficients were obtained. *e risk score 3.1. Data Preprocessing. Samples from TCGA were regarded for each ESCC patient was calculated by a linear combi- as a training set and used for network and LASSO Cox nation of the FPKM of each gene (Gk) multiplied by the regression model construction.