Gene Selection in Cancer Classification Using GPSO/SVM and GA/SVM Hybrid Algorithms Enrique Alba, José Garcia-Nieto, Laetitia Jourdan, El-Ghazali Talbi
Total Page:16
File Type:pdf, Size:1020Kb
Gene Selection in Cancer Classification using GPSO/SVM and GA/SVM Hybrid Algorithms Enrique Alba, José Garcia-Nieto, Laetitia Jourdan, El-Ghazali Talbi To cite this version: Enrique Alba, José Garcia-Nieto, Laetitia Jourdan, El-Ghazali Talbi. Gene Selection in Cancer Clas- sification using GPSO/SVM and GA/SVM Hybrid Algorithms. Congress on Evolutionary Computa- tion, Sep 2007, Singapor, Singapore. inria-00269967 HAL Id: inria-00269967 https://hal.inria.fr/inria-00269967 Submitted on 3 Apr 2008 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Gene Selection in Cancer Classification using PSO/SVM and GA/SVM Hybrid Algorithms Enrique Alba, JoseGarc´ ´ıa-Nieto, Laetitia Jourdan and El-Ghazali Talbi Abstract— In this work we compare the use of a Particle a filter method, is in definition, independent of the learning Swarm Optimization (PSO) and a Genetic Algorithm (GA) algorithm used after it. The second one, the wrapper method, (both augmented with Support Vector Machines SVM) for which carries out the feature subset selection and classifi- the classification of high dimensional Microarray Data. Both algorithms are used for finding small samples of informative cation in the same process, engages a learning algorithm genes amongst thousands of them. A SVM classifier with 10- to measure the classification accuracy. From a conceptual fold cross-validation is applied in order to validate and evaluate point of view, wrapper approaches are clearly advantageous, the provided solutions. A first contribution is to prove that since the features are selected by optimizing the discriminate PSO SV M is able to find interesting genes and to provide clas- power of the finally used induction algorithm. sification competitive performance. Specifically, a new version of PSO, called Geometric PSO, is empirically evaluated for Feature selection for gene expression analysis in cancer the first time in this work using a binary representation in prediction often uses wrapper classification methods [7] to Hamming space. In this sense, a comparison of this approach discriminate a type of tumor, to reduce the number of genes with a new GASV M and also with other existing methods of to investigate in case of a new patient, and also to assist literature is provided. A second important contribution consists in drug discovery and early diagnosis. Several classification in the actual discovery of new and challenging results on six public datasets identifying significant in the development of a algorithms could be used for wrapper methods, such as K- variety of cancers (leukemia, breast, colon, ovarian, prostate, Nearest Neighbor (K-NN) [8] or Support Vector Machines and lung). (SVM) [9]. By creating clusters a big reduction of the number of considered genes and an improvement of the classification I. INTRODUCTION accuracy can be finally achieved. Microarray technology (DNA microarray) [1] allows to The definition of the feature selection problem is this: simultaneously analyze thousands of genes and thus can give given a set of features F = {f1, ..., fi, ..., fn}, find a subset important insights about cell’s function, since changes in F ⊆ F that maximizes a scoring function Θ:Γ→ G such the physiology of an organism are generally associated with that changes in gene expression patterns. Several gene expres- F = argmaxG⊂Γ{Θ(G)}, (1) sion profiles obtained from tumors such as Leukemia [2], Colon [3] and Breast [4] have been studied and compared to where Γ is the space of all possible feature subsets of expression profiles of normal tissues. However, expression F and G a subset of Γ. The optimal feature selection data are highly redundant and noisy, and most genes are problem has been shown to be NP-hard [10]. Therefore, believed to be uninformative with respect to studied classes, only heuristics approaches are able to deal with large size as only a fraction of genes may present distinct profiles for problems. Recently, such advanced structured methods have different classes of samples. So, tools to deal with these been used to explore the huge space of feature subsets, like issues are critically important. These tools should learn to for example metaheuristics as Evolutionary Algorithms and, robustly identify a subset of informative genes embedded specifically, Genetic Algorithms (GAs) [11], [5], [12]. out of a large dataset which is contaminated with high- In this work, we are interested in gene selection and dimensional noise [5]. classification of DNA Microarray data in order to distin- In this context, feature selection is often considered as guish tumor samples from normal ones. For this purpose, a necessary preprocess step to analyze these data, as this we propose two hybrid models that use metaheuristics and method can reduce the dimensionality of the datasets and classification techniques. The first one consists of a Particle often conducts to better analyses [6]. Swarm Optimization (PSO) [13] combined with a SVM Two models of feature selection exist depending on approach. PSO is a population based metaheuristic inspired whether the selection is coupled with a learning scheme or by the social behavior of bird flocking or fish schooling. not. The first one, filter model, which carries out the feature Specifically, a recent version called Geometric PSO [14] subset selection and the classification in two separate phases, (explained in Section II) has been used in this work. The uses a measure that is simple and fast to compute. Hence, second model is based on the popular GA using a spe- E. Alba and J. Garc´ıa-Nieto are with the Department of Lengua- cialized Size-Oriented Common Feature Crossover Operator jes y Ciencias de la Computacion,´ Universidad de Malaga,´ (email: (SSOCF) [15], which keeps useful informative blocks and {eat,jnieto}@lcc.uma.es, url http://neo.lcc.uma.es) produces offsprings which have the same distribution than L. Jourdan and E-G. Talbi are with the LIFL/INRIA Futurs-Universite´ de Lille 1, Batˆ M3-Cite´ Scientifique, (email: {jourdan,talbi}@lifl.fr, url the parents. This model will be also combined with SVM in http://www2.lifl.fr/OPAC/) our approach. Both proposed approaches are experimentally assessed on B. The SVM Classifier six well-known cancer datasets (Leukemia, Colon, Breast, Support Vector Machines, a technique derived from statis- Ovarian [16], Prostate [17] and Lung [18]), discovering new tical learning theory, is used to classify points by assigning and challenging results and identifying specific genes that them to one of two disjoint half spaces [9]. So, SVM our work suggests as significants. Performances of proposed performs mainly a (binary) 2-class classification. For linearly PSO and GA algorithms solving the gene extraction problem separable data, SVM obtains the hyperplane which maxi- (using SVM) are compared in this paper. Specifically, we fo- mizes the margin (distance) between the training samples cused in the capacity of the PSOSV M combination in order and the class boundary. For non linearly separable cases, to provide considerable performance in this matter. In this samples are mapped to a high dimensional space where sense, comparisons with several state of art methods show such a separating hyperplane can be found. The assignment competitive results according to the conventional criteria. is carried out by means of a mechanism called the kernel The outline of this work as follows. We review the PSO function. and the SVM techniques in order to introduce our PSOSV M SVM is widely used in the domain of cancer studies, hybrid model in Section II. In Section III, the six microarray protein identification and specially in Microarray data [6], datasets used in this study are described. Experimental results [12], [19]. Unfortunately, in many bioinformatics problems are presented in Section IV, including biological descriptions the number of features is significantly larger than the number of several obtained genes. Finally, we summarize our work of samples. For this reason, tools for decreasing the number and present some conclusions and possible future work in of features in order to improve the classification or to help Section V. to identify interesting features (genes) in noisy environments are necessary. In addition, SVM can treat data with a large II. GENE SELECTION AND CLASSIFICATION BY number of genes, but it has been shown that its perfor- PSOSV M mance is increased by reducing the number of genes [20]. In this section, we describe the hybrid PSOSV M approach The hybrid PSO and hybrid GA approaches next proposed for gene selection and classification of Microarray data. The contribute notably in this sense. PSO algorithm is designed for obtaining gene subsets as solutions in order to reduce the high number of genes to C. The Hybrid PSOSV M Approach be later classified. The SVM classifier is used whenever the In order to offer a basic idea of the operation of our fitness evaluation of a tentative gene subset is required. PSOSV M approach, in Figure 1, we can observe a sim- ple scheme of how features are extracted from the initial A. Particle Swarm Optimization microarray dataset and how the resulted subset is evaluated. Particle Swarm Optimization was first proposed by In a first phase, the metaheuristic algorithm involved, PSO Kennedy and Eberhart in 1995 [13]. PSO is a population in this case, provides a binary encoded particle1 where each based evolutionary algorithm inspired in the social behavior bit2 represents a gene. If a bit is 1, it means that this gene of bird flocking or fish schooling.