Identification of EMT-Related High-Risk Stage II Colorectal Cancer And
Total Page:16
File Type:pdf, Size:1020Kb
www.nature.com/bjc ARTICLE Molecular Diagnostics Identification of EMT-related high-risk stage II colorectal cancer and characterisation of metastasis-related genes Kai Wang1, Kai Song1, Zhigang Ma2, Yang Yao2, Chao Liu2, Jing Yang1, Huiting Xiao1, Jiashuai Zhang1, Yanqiao Zhang2 and Wenyuan Zhao1 BACKGROUND: Our laboratory previously reported an individual-level prognostic signature for patients with stage II colorectal cancer (CRC). However, this signature was not applicable for RNA-sequencing datasets. In this study, we constructed a robust epithelial-to-mesenchymal transition (EMT)- related gene pair prognostic signature. METHODS: Based on EMT-related genes, metastasis-associated gene pairs were identified between metastatic and non-metastatic samples. Then, we selected prognosis-associated gene pairs, which were significantly correlated with disease-free survival of stage II CRC using multivariate Cox regression model, as the EMT-related prognosis signature. RESULTS: An EMT-related signature composed of fifty-one gene pairs (51-GPS) for prediction-relapse risk of patients with stage II CRC was developed, whose prognostic efficiency was validated in independent datasets. Moreover, 51-GPS achieved better predictive performance than other reported signatures, including a commercial signature Oncotype Dx colon cancer and an immune-related gene pair signature. Besides, EMT-related functional gene sets achieved high enrichment scores in high-risk samples. Especially, loss-of-function antisense approach showed that DEGs between the predicted two clusters were metastasis- related. CONCLUSIONS: The EMT-related gene pair signature can identify the high relapse-risk patients with stage II CRC, which can facilitate individualised management of patients. British Journal of Cancer (2020) 123:410–417; https://doi.org/10.1038/s41416-020-0902-y BACKGROUND greatly affected by the sampling locations10,11 and RNA degrada- Colorectal cancer (CRC) ranks third in terms of incidence, but tion problem during sample preparation12 of tumour tissues. second in terms of cancer-related death worldwide.1 Treatment Although a quantitative signature Oncotype Dx colon cancer has decision and prognosis assessment mainly depend on the been used commercially, some patients are categorised as pathological stage of the tumour.2 However, about 20% patients “intermediated risk” cluster, which complicates clinical decision- of stage II CRC will relapse after curative surgery.3 Therefore, some making. To tackle the above-mentioned problems, qualitative other factors were proposed for therapy decisions. For example, methods, such as TSP13 and k-TSP,14 have been proposed, which stage II patients with high-risk factors, such as T4 stage and high are relatively robust to these factors. Using this method, several tumour grade, have a greater chance of relapse and should be qualitative signatures have been developed for prognosis/ treated with chemotherapy after surgery.4 But these clinicopatho- prediction of tumours. Especially for CRC patients, based on the logical risk factors do not adequately distinguish between patients within-sample relative expression orderings (REOs) of genes, our who have high or low risk of relapse, and lead to over- or under- laboratory previously reported an individual-level prognostic diagnosis.5 signature consisting of three gene pairs for predicting the post- Several studies have developed quantitative signatures based surgery relapse risk of stage II CRC.15 This signature was developed on gene expression for survival stratification with stage II CRC,6,7 by training on microarray expression data, and validated using which were developed from genome-wide, prognosis-related and independent microarray datasets. However, this signature was immune-related genes. Unfortunately, the clinical practice is not assessed in the RNA-seq platform. Wu et al.16 have also limited owing to issues such as overfitting on small discovery constructed REO-based individualised prognostic signatures. datasets and lack of sufficient validation. Besides, this type of Nevertheless, the signature was developed without considering prognostic signature calculated by the sum of the weighted the specificity of stage. expression values of the characteristic genes is difficult to be The epithelial-to-mesenchymal transition (EMT) is a centrally reproducible due to experimental batch effects and platform important mechanism for the metastasis of carcinomas,17 which differences.8,9 In addition, gene expression measurements are was typically characterised by loss of cell–cell adhesion and apical- 1Department of Systems Biology, College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150086, China and 2Department of Gastrointestinal Medical Oncology, Harbin Medical University Cancer Hospital, No. 150, Haping Road, Nangang District, Harbin 150001, China Correspondence: Yanqiao Zhang ([email protected]) or Wenyuan Zhao ([email protected]) These authors contributed equally: Kai Wang, Kai Song, Zhigang Ma Received: 17 March 2020 Revised: 25 April 2020 Accepted: 1 May 2020 Published online: 21 May 2020 © The Author(s) 2020 Published by Springer Nature on behalf of Cancer Research UK Identification of EMT-related high-risk stage II colorectal cancer and. K Wang et al. 411 based cell polarity, as well as the increased invasion of cells.18 the appropriate threshold of the signature. A sample was classified Furthermore, induction of EMT has been reported to lead to as a high-risk cluster if at least k gene pairs voted for high-risk, patients at an early-stage CRC more prone to metastasis.19,20 otherwise, low-risk cluster. The prognosis-associated gene pairs Herein, we aim to construct a gene pair signature to figure with the appropriate threshold were defined as the gene pair outpatients at risk of relapse using EMT-related genes in stage signature (GPS), which could be used directly in the validation II CRC. datasets. Survival analysis METHODS DFS was defined as the time from surgery to relapse or the final Data acquisition and pre-processing documented data (censored). Survival curves of DFS between Twelve CRC gene expression datasets were collected from the different clusters were estimated using the Kaplan–Meier (K–M) public database, including ten microarray datasets and a RNA-seq method, the differences between the survival curves were dataset from the Gene Expression Omnibus (GEO, https://www.ncbi. compared using the log-rank test30 and 95% confidence intervals nlm.nih.gov/geo/),21 and one RNA-seq dataset from The Cancer (CIs) were calculated using a univariate Cox regression model.29 Genome Atlas (TCGA, https://www.cancer.gov/about-nci/ The independent prognostic value of the signature was assessed organization/ccg/research/structural-genomics/tcga),22 as described by multivariate Cox regression model after adjustment for clinical briefly in Supplementary Table S1a. Specific clinicopathological factors. The predictive accuracy of the signature was assessed features are described in Supplementary Table S1b. For the datasets using the receiver-operating characteristic curve (ROC, “pROC” from GEO, we downloaded the raw data (.CEL files) and used the package, version 1.14.0). All statistical analyses were performed robust multi-array average (RMA) method23 for background using R software version 3.5.2 (https://www.r-project.org/). adjustment without quantile normalisation. Each probe ID was mapped to Entrez gene ID with the corresponding platform files. If Functional enrichment analysis a probe was mapped to multiple or zero genes, the data of this We performed gene set enrichment analysis using GSEA software probe were discarded. If multiple probes were mapped to the same (http://software.broadinstitute.org/gsea/index.jsp) with 1000 per- gene, the expression level of this gene was summarised as the mutations. For the RNA-seq data from TCGA, FPKM values were arithmetic mean of the values of multiple probes. For TCGA used to calculate the log2 fold change between high- and low-risk transcriptional data derived from Illumina high-throughput sequen- clusters. The hallmark gene sets were used for the target gene sets 1234567890();,: cing (HiSeq) platform, the raw count and fragments per kilobase of for GSEA. The gene sets satisfying p < 0.05 were considered transcript per million fragments mapped (FPKM) values were statistically significant. Besides, we assessed consensus molecular extracted. For mutation data derived from the Illumina Genome subtype (CMS) classification between high- and low-risk clusters Analyzer DNA Sequencing GAIIx platform, only the nonsynonymous using the “CMSclassifier” package, version 1.0.0. mutations remained. Data of copy number variations (CNVs) were processed with the GISTIC algorithm.24 Samples from GSE39582 Cell lines and transfection and TCGA were used as training cohorts due to their relatively high- HCT116 cells were purchased from the American Type Culture quality clinical records and long-term follow-up. Datasets with Collection (Manassas, VA, USA) and cultured at 37 °C in a humified sample size more than 50 were used as independent validation atmosphere of 95% O2 and 5% CO2. HCT116 cells were grown in cohorts. However, as for small-sample datasets with sample size less high-glucose DMEM medium (Thermo Fisher Scientific, Waltham, than 50, we combined them according to the individual platforms, MA, USA) with 10% foetal calf serum (Thermo Fisher Scientific, and defined Com_570 and Com_96, since different datasets could