Non‑Small‑Cell Lung Cancer Pathological Subtype‑Related Gene
Total Page:16
File Type:pdf, Size:1020Kb
356 MOLECULAR AND CLINICAL ONCOLOGY 8: 356-361, 2018 Non‑small‑cell lung cancer pathological subtype‑related gene selection and bioinformatics analysis based on gene expression profiles JIANGPENG CHEN1, XIAOQI DONG2, XUN LEI1, YINYIN XIA1, QING ZENG1, PING QUE1, XIAOYAN WEN1, SHAN HU1 and BIN PENG1 1School of Public Health and Management, Chongqing Medical University, Chongqing 400016; 2Department of Respiratory Diseases, The First Affiliated Hospital of Zhejiang University, Hangzhou, Zhejiang 310003, P.R. China Received March 21, 2016; Accepted November 21, 2017 DOI: 10.3892/mco.2017.1516 Abstract. Lung cancer is one of the most common malignant Introduction diseases and a major threat to public health on a global scale. Non-small-cell lung cancer (NSCLC) has a higher degree of Lung cancer is one of the most common malignant diseases malignancy and a lower 5-year survival rate compared with that and a major threat to public health on a global scale. The main of small-cell lung cancer. NSCLC may be mainly divided into types of lung cancer are small-cell lung cancer (SCLC) and non- two pathological subtypes, adenocarcinoma and squamous cell small-cell lung cancer (NSCLC). NSCLC has a higher degree carcinoma. The aim of the present study was to identify disease of malignancy and a lower 5-year survival rate compared with genes based on the gene expression profile and the shortest path SCLC, and may be divided into two major histopathological analysis of weighted functional protein association networks subtypes, namely adenocarcinoma (ADC) and squamous cell with the existing protein-protein interaction data from the Search carcinoma (SCC). As NSCLC is a complex disease, the iden- Tool for the Retrieval of Interacting Genes. The gene expression tification of disease genes has been among the main goals of profile (GSE10245) was downloaded from the National Center biologists and clinicians. Although our understanding of lung for Biotechnology Information Gene Expression Omnibus data- cancer, particularly NSCLC, has improved significantly in base, including 40 lung adenocarcinoma and 18 lung squamous recent years, several questions require further elucidation. cell carcinoma tissues. A total of 8 disease genes were identi- With the completion of the Human Genome Project, scien- fied using Naïve Bayesian Classifier based on the Maximum tific research has entered the post-genome era. Furthermore, Relevance Minimum Redundancy feature selection method with the advances in biological research and the development following preprocessing. An additional 21 candidate genes of high throughput biotechnologies, the quantities of candidate were selected using the shortest path analysis with Dijkstra's genes and numerous genomic loci have been reported to play a algorithm. The AURKA and SLC7A2 genes were selected three vital role in the pathogenesis of NSCLC. However, the mecha- and two times in the shortest path analysis, respectively. All nism underlying this disease has not yet been fully elucidated. those genes participate in a number of important pathways, As is well-known, different analytical methods may result in such as oocyte meiosis, cell cycle and cancer pathways with different consequences due to the ̔small N large P̓ problem in Kyoto Encyclopedia of Genes and Genomes pathway enrich- microarray data, i.e., the gene expression data usually come with ment analysis. The present findings may provide novel insights only dozens of tissue samples, but with up to tens of thousands into the pathogenesis of NSCLC and enable the development of gene features. This extreme sparseness is considered to of novel therapeutic strategies. However, further investigation is significantly compromise the performance of a classifier (1). As required to confirm these findings. a result, the ability to extract a subset of informative genes while removing irrelevant or redundant genes is crucial for accurate classification (2). Hence, the methodology research is of great importance and the reuse of microarray data is feasible and extremely valuable. It is meaningful to identify disease-specific genes to help provide a novel therapeutic target for tumor treat- ment and elucidate the mechanism of the disease. Correspondence to: Dr Bin Peng, School of Public Health and Management, Chongqing Medical University, 1 Yixueyuan Road, The aim of the present study was to identify the NSCLC Chongqing 400016, P.R. China subtype-related genes based on the classification method E-mail: [email protected] and the shortest path analysis based on the protein-protein interaction (PPI) network, and to investigate their underlying Key words: non-small-cell lung cancer, microarray, bioinformatics functions by Gene Ontology (GO) and Kyoto Encyclopedia of analysis, feature selection, pathway Genes and Genomes (KEGG) pathway enrichment analysis with the Database for Annotation, Visualization and Integrated CHEN et al: NON-SMALL CELL LUNG CANCER RELATED GENE SELECTION AND BIOINFORMATICS ANALYSIS 357 Discovery (DAVID). This workflow may be of value for Let S denote the subset of features we are seeking. The identifying disease genes and may provide insight into the maximum relevance condition is pathogenesis of malignant tumors. To the best of our knowl- edge, this investigation is the first of its type in lung cancer (2) studies, and it may be useful for further lung cancer candidate gene verification, thus enabling a better understanding of the molecular mechanisms of NSCLC and the development more The minimum redundancy condition is effective diagnostic and intervention methods. (3) Materials and methods Data source. The gene expression dataset (GDS3627, acces- sion no. GSE10245) was downloaded from the National Center where S is used to represent the number of features in S, for Biotechnology Information Gene Expression Omnibus and c represents the targeted classes. (GEO) from a research conducted by Kuner et al (3) on lung The mutual information quotient criterion may be cancer. The data were processed with the Affymatrix GPL570 described as follows. platform and generated from 58 tissue samples (40 ADC and 18 SCC). All the tissue samples underwent careful analysis of maxФ (D,R), Ф = D/R (4) the histopathology, estimation of tumor and stromal content and subsequent macro-dissection of the tumor regions prior to This optimization may be computed efficiently by heuristic enrollment in the microarray experiment. No ethics committee algorithm approval is required to obtain these data, since they are publicly available. In addition, only expression data are presented, and (5) no personal patient information is revealed. Preprocessing of microarray data. Data preprocessing, including where xj ∈ XF -Sm-1, XF represents the original feature set. quantile normalization and log2 transformation was completed by the GEOquery package (4). In order to minimize false posi- Naïve Bayes Classifier.The Naïve Bayes Classifier is a simple tives, the GeneSelector package in R programme was used to probabilistic classifier based on applying Bayes' theorem with rank gene lists. Briefly, selection of differential expressed genes strong independence assumptions between features. For a may be highly affected by the statistical methods used, and the sample s with m gene expression levels {g1, g2,....., gm} for the overlap of genes selected using the same significance value may m features, the posterior probability that s belongs to class ck is be quite low. GeneSelector ranks genes on the basis of how well they perform in a selected number of statistical tests and mini- mize the number of false positives (5). Ranked lists based on (6) ordinary Student's t-test, significance analysis of microarrays, and fold-change were first generated for each of the criteria. The Markov chain model within the GeneSelector package was used where p(gt|ck) are conditional tables estimated from training to aggregate the ranked lists of all criteria into a list of genes that examples. are significant, independently of the method used. To reduce The Naïve Bayes Classifier was used after all the features noise and due to the fact that prediction methods such as Naïve ranked by the mRMR. A series of classifiers were then Bayes Classifier prefer categorical data, the gene expression data constructed when the features were added into the classifier were discretized using the respective µ (mean) and σ (standard one by one. For example, the first classifier was constructed by deviation): Any data larger than µ + σ/2 were transformed to the first feature, the second classifier by the first two features, state 1; any data between µ - σ/2 and µ + σ/2 were transformed and the rest in the same manner. to state 0; and any data smaller than µ ‑ σ/2 were transformed Using incremental feature selection (IFS), the number N to state -1. These three states correspond to the overexpression, may be determined. Its idea is to compare prediction accuracy baseline, and underexpression of genes, respectively. According defined in the following selection among different Ns, and to the previous studies, discretization of gene expression data select the one with the highest accuracy. generally leads to better prediction accuracy. The prediction accuracy was formulated by Maximum relevance minimum redundancy (mRMR) feature (7) selection method. The mRMR feature selection method was proposed by Ding et al (6). For discrete variables, the mutual information I of two variables x and y is defined based on where TP represents the number of true positives, TN