Weighted Gene Expression Profiles Identify Diagnostic and Prognostic
Total Page:16
File Type:pdf, Size:1020Kb
Validation Study Journal of International Medical Research 48(3) 1–12 Weighted gene expression ! The Author(s) 2019 Article reuse guidelines: profiles identify diagnostic sagepub.com/journals-permissions DOI: 10.1177/0300060519893837 and prognostic genes for journals.sagepub.com/home/imr lung adenocarcinoma and squamous cell carcinoma Xing Wu1,*, Linlin Wang2,*, Fan Feng3 and Suyan Tian4 Abstract Objective: To construct a diagnostic signature to distinguish lung adenocarcinoma from lung squamous cell carcinoma and a prognostic signature to predict the risk of death for patients with nonsmall-cell lung cancer, with satisfactory predictive performances, good stabilities, small sizes and meaningful biological implications. Methods: Pathway-based feature selection methods utilize pathway information as a priori to provide insightful clues on potential biomarkers from the biological perspective, and such incor- poration may be realized by adding weights to test statistics or gene expression values. In this study, weighted gene expression profiles were generated using the GeneRank method and then the LASSO method was used to identify discriminative and prognostic genes. Results: The five-gene diagnostic signature including keratin 5 (KRT5), mucin 1 (MUC1), trigger- ing receptor expressed on myeloid cells 1 (TREM1), complement C3 (C3) and transmembrane serine protease 2 (TMPRSS2) achieved a predictive error of 12.8% and a Generalized Brier Score of 0.108, while the five-gene prognostic signature including alcohol dehydrogenase 1C (class I), gamma polypeptide (ADH1C), alpha-2-glycoprotein 1, zinc-binding (AZGP1), clusterin (CLU), cyclin dependent kinase 1 (CDK1) and paternally expressed 10 (PEG10) obtained a log-rank P-value of 0.03 and a C-index of 0.622 on the test set. 1Department of Teaching, The First Hospital of Jilin University, Changchun, Jilin Province, China 2Department of Ultrasound, China-Japan Union Hospital *These authors contributed equally to this work. of Jilin University, Changchun, Jilin Province, China Corresponding author: 3School of Mathematics, Jilin University, Changchun, Jilin Suyan Tian, Division of Clinical Research, The First Province, China Hospital of Jilin University, 71 Xinmin Street, Changchun, 4Division of Clinical Research, The First Hospital of Jilin Jilin Province 130021, China. University, Changchun, Jilin Province, China E-mail: [email protected] Creative Commons Non Commercial CC BY-NC: This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage). 2 Journal of International Medical Research Conclusions: Besides good predictive capacity, model parsimony and stability, the identified diagnostic and prognostic genes were highly relevant to lung cancer. A large-sized prospective study to explore the utilization of these genes in a clinical setting is warranted. Keywords Lung adenocarcinoma, squamous cell carcinoma, GeneRank, prognosis, classification, pathway information Date received: 8 August 2019; accepted: 19 November 2019 Introduction gene signature might provide valuable clues on biomarkers that distinguish the Nonsmall-cell lung cancer (NSCLC) is any patients with a specific disease from the type of epithelial lung cancer (LC) other healthy controls; or different diseases with than small-cell lung carcinoma (SCLC) and phenotypically similar medical condi- it accounts for approximately 85% of all LC tions.10 In contrast, a prognostic signature cases.1 Compared with SCLC, patients with may offer insights into the course of a dis- NSCLC are relatively less sensitive to chemo- 2–4 ease, the prediction of survival rates and the therapy. In contrast to NSCLC, SCLC 10,11 has a shorter doubling time, higher growth response to a specific treatment. As far as NSCLC is concerned, many diagnostic sig- fraction and earlier development of metasta- 12–15 16–20 ses.4,5 Furthermore, NSCLC can be classified natures and prognostic signatures into three major histological subtypes: lung have been identified. For example, a adenocarcinoma (AC), lung squamous cell recent study used 183 AC and 80 SCC carcinoma (SCC) and large cell lung cancer patients as a training set and obtained a (LCLC);1 accounting for approximately 4/5 42-gene signature to discriminate these 12 of all LC cases when combined together.6 two subtypes. Furthermore, a 72-gene Since the choice of chemotherapy and tar- prognostic signature was shown to predict geted therapies depends on histological sub- the risk of recurrence for early-stage 21 types, the discrimination and separation of NSCLC patients. NSCLC subtypes are of essential importance Identification of a gene signature is usu- in the clinical setting.7 Likewise, the success- ally accomplished with the aid of a feature ful prediction of which patients with NSCLC selection process. Feature selection has the have a high risk for recurrence and death is advantages of simplifying the final models, of primary importance with regard to the shortening the training time, alleviating the provision of more individualized and precise over-fitting problem and thus improving medical interventions. generalization and having a better biologi- A gene signature is a list of genes with a cal interpretation. Different from the con- unique pattern of gene expression that ventional feature selection methods that results from an altered biological process ignore biological pathway information or and/or a medical condition.8,9 According the underlying correlation or intrinsic to the type of outcomes, a gene signature grouping structure among genes, there can be classified into either a diagnostic sig- exist many feature selection methods that nature or a prognostic one. A diagnostic incorporate such information to guide Wu et al. 3 which genes should be selected. These meth- expression values of genes to obtain weight- ods branch out as a novel type of feature ed gene expression profiles and then the selection, dubbed as the pathway-based fea- LASSO method31 was used to select rele- ture/gene selection.22,23 vant genes, with the objective of identifying Previous studies have demonstrated a diagnostic gene signature for subtype that taking informative pathway knowledge segmentation and a prognostic gene signa- into account, the pathway-based feature ture to predict overall survival rate. The selection algorithms outperform the classic resulting weighted gene expression profiles methods.22,24–26 Consequently, such meth- (i.e. the GeneRanks) bypassed the step of ods tend to replace the classic methods as estimating pathway-based weights and the first choice of statistical methods in thus provided an alternative strategy of many real-world applications. So far, a weighting. majority of identified gene signatures for Notably, even though the GeneRank NSCLC were obtained by utilizing classic method was used in both this current feature selection methods (e.g. a Cox study and our previous study,32 the pur- model or a logistic model plus a LASSO poses of its implementation (to provide penalty) that consider no pathway informa- the rankings for genes versus to generate tion.27,28 To develop more pathway-based weighted expression profiles) and the feature selection methods and then use feature selection methods used (the Cox- them to construct better-performed gene filter method versus the LASSO method) signatures for NSCLC is highly desirable. in both studies differ dramatically. Lastly, The GeneRank method29 modified the objectives of these two studies were dif- Google’s PageRank algorithm30 to specifi- ferent. While our previous study focused on cally handle biological data. This method identifying subtype-specific prognostic calculates a rank (i.e. GeneRank) for each genes, the current study aimed at a diagnos- gene and balances between the mean expres- tic gene signature to separate AC from SCC sion value or the fold change among pheno- and a prognostic gene signature that works types and the connectivity level of a gene well for both subtypes. within a network. GeneRank prioritizes a gene that is highly connected to other genes within the network. Previously, the Materials and methods GeneRank method was utilized to rearrange Experimental data genes accordingly to their GeneRanks and restrict the search space (i.e. the number The data from three microarray experi- of genes under consideration) to those top- ments were used in this study. The accession ranked genes. Then the corresponding numbers of the three microarray experi- P-values of Cox-filter models were adjusted ments in the Gene Expression Omnibus according to a specific gene’s correlation (GEO; https://www.ncbi.nlm.nih.gov/geo/) magnitudes with other genes involved in repository are GSE30219, GSE37745 and this search space, in order to eliminate GSE50081. The RNA-Seq data of TCGA redundant genes as much as possible. LUAD (for the AC subtype) and LUSC The weighting methods are one major (for the SCC subtype) cohorts were used category of pathway-based feature selection as an independent set to validate the perfor- methods,13,22 but they have been underutil- mance of the resulting gene signatures. ized since weight estimation is always Since the three microarray datasets and subject to bias. In this current study, the the RNA-Seq