Sparse Logistic Regression with a L1/2 Penalty for Gene Selection in Cancer

Liang et al. BMC Bioinformatics 2013, 14:198 http://www.biomedcentral.com/1471-2105/14/198 RESEARCH ARTICLE Open Access Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification Yong Liang1*, Cheng Liu1, Xin-Ze Luan1, Kwong-Sak Leung2, Tak-Ming Chan2, Zong-Ben Xu3 and Hai Zhang3 Abstract Background: Microarray technology is widely used in cancer diagnosis. Successfully identifying gene biomarkers will significantly help to classify different cancer types and improve the prediction accuracy. The regularization approach is one of the effective methods for gene selection in microarray data, which generally contain a large number of genes and have a small number of samples. In recent years, various approaches have been developed for gene selection of microarray data. Generally, they are divided into three categories: filter, wrapper and embedded methods. Regularization methods are an important embedded technique and perform both continuous shrinkage and automatic gene selection simultaneously. Recently, there is growing interest in applying the regularization techniques in gene selection. The popular regularization technique is Lasso (L1), and many L1 type regularization terms have been proposed in the recent years. Theoretically, the Lq type regularization with the lower value of q would lead to better solutions with more sparsity. Moreover, the L1/2 regularization can be taken as a representative of Lq (0 < q < 1) regularizations and has been demonstrated many attractive properties. Results: In this work, we investigate a sparse logistic regression with the L1/2 penalty for gene selection in cancer classification problems, and propose a coordinate descent algorithm with a new univariate half thresholding operator to solve the L1/2 penalized logistic regression. Experimental results on artificial and microarray data demonstrate the effectiveness of our proposed approach compared with other regularization methods. Especially, for 4 publicly available gene expression datasets, the L1/2 regularization method achieved its success using only about 2 to 14 predictors (genes), compared to about 6 to 38 genes for ordinary L1 and elastic net regularization approaches. Conclusions: From our evaluations, it is clear that the sparse logistic regression with the L1/2 penalty achieves higher classification accuracy than those of ordinary L1 and elastic net regularization approaches, while fewer but informative genes are selected. This is an important consideration for screening and diagnostic applications, where the goal is often to develop an accurate test using as few features as possible in order to control cost. Therefore, the sparse logistic regression with the L1/2 penalty is effective technique for gene selection in real classification problems. Keywords: Gene selection, Sparse logistic regression, Cancer classification Background and decrease classification accuracy. Moreover, from the With the development of DNA microarray technology, the machine learning perspective, too many genes may lead to biology researchers can analyze the expression levels of overfitting and can negatively influence the classification thousands of genes simultaneously. Many studies have performance. Due to the significance of these problems, ef- demonstrated that microarray data are useful for classifica- fective gene selection methods are desirable to help to clas- tion of many cancers. However, from the biological per- sify different cancer types and improve prediction accuracy. spective, only a small subset of genes is strongly indicative In recent years, various approaches have been developed of a targeted disease, and most genes are irrelevant to can- for gene selection of microarray data. Generally, they are di- cer classification. The irrelevant genes may introduce noise vided into three categories: filter, wrapper and embedded methods. Filter methods evaluate a gene based on discrim- * Correspondence: [email protected] 1Faculty of Information Technology & State Key Laboratory of Quality inative power without considering its correlations with other Research in Chinese Medicines, Macau University of Science and Technology, genes [1-4]. The drawback of filter methods is that it exam- Macau, China ines each gene independently, ignoring the possibility that Full list of author information is available at the end of the article © 2013 Liang et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Liang et al. BMC Bioinformatics 2013, 14:198 Page 2 of 12 http://www.biomedcentral.com/1471-2105/14/198 groups of genes may have a combined effect which is not ne- Methods cessarily reflected by the individual performance of genes in Sparse logistic regression with the L1/2 penalty the group. This is a common issue with statistical methods In this paper, we focus on a general binary classification such as T-test, which examine each gene in isolation. problem. Suppose we have n samples, D ={(X1, y1), (X2, th Wrapper methods utilize a particular learning method y2), …,(Xn, yn)}, where Xi =(xi1, xi2 , …, xip)isi input as feature evaluation measurement to select the gene pattern with dimensionality p and yi is a corresponding subsets in terms of the estimated classification errors variable that takes a value of 0 or 1; yi = 0 indicates the th th and build the final classifier. Wrapper approaches can i sample in Class 1 and yi = 1 indicates the i sample obtain a small subset of relevant genes and can signifi- is in Class 2. The vector Xi contains p features (for all th cantly improve classification accuracy [5,6]. For example, p genes) for the i sample and xij denotes the value of Guyon et al. [7] proposed a gene selection approach util- gene j for the ith sample. Define a classifier f(x)=ex /(1 + izing support vector machines (SVM) based on recursive ex) such that for any input x with class label y, f(x) pre- feature elimination. However, the wrapper methods dicts y correctly. The logistic regression is expressed as: ÀÁ greatly require extensive computational time. ÀÁ ′ ′ exp X iβ The third group of gene selection procedures is embed- PYðÞ¼ jX ¼ fXβ ¼ ÀÁ ð Þ i 1 i i þ ′ β 1 ded methods, which perform the variable selection as 1 exp X i part of the statistical learning procedure. They are much Where β =(β , β ,…, β ) are the coefficients to be esti- more efficient computationally than wrapper methods 0 1 p mated, note that β is the intercept. The log-likelihood is: with similar performance. Embedded methods have 0 drawn much attention recently in the literature. The em- n ÈÉÂÃÀÁ ÂÃÀÁ ðÞ¼βj − ′ β þ ðÞ− − ′ β bedded methods are less computationally expensive and l D ∑ yi log fXi 1 yi log 1 fXi i¼1 less prone to over fitting than the wrapper methods [8]. ð Þ Regularization methods are an important embedded tech- 2 β nique and perform both continuous shrinkage and auto- We can obtain by minimizing the log-likelihood (2). matic gene selection simultaneously. Recently, there is In high dimensional application with p >> n, directly growing interest in applying the regularization techniques in solving the logistic model (2) is ill-posed and may lead the logistic regression models. Logistic regression is a power- to overfitting. Therefore, the regularization approaches ful discriminative method and has a direct probabilistic inter- are applied to address the overfitting problem. When pretation which can obtain probabilities of classification adding a regularization term to (2), the sparse logistic re- apart from the class label information. In order to extract gression can be modelled as: 8 9 key features in classification problems, a series of regularized < p = logistic regression methods have been proposed. For ex- β ¼ arg min lðÞþβjD λ∑Pðβ Þ ð3Þ ample, Shevade and Keerthi [9] proposed the sparse logistic : j ; j¼1 regression based on the Lasso regularization [10] and Gauss- λ Seidel methods. Glmnet is the general approach for the L1 Where > 0 is a tuning parameter and P(B)isa type regularized (including Lasso and elastic net) linear regularization term. The popular regularization tech- model using a coordinate descent algorithm [11,12]. Similar nique is Lasso (L1) [10], which has the regularization β ∑ β to sparse logistic regression with the L1 regularization term P( )= | |. Many L1 type regularization terms method,GavinC.C.andNicolaL.C.[13]investigated have been proposed in the recent years, such as SCAD sparse logistic regression with Bayesian regularization. In- [15], elastic net [16], and MC+ [17]. spired by the aforementioned methods, we investigate the Theoretically, the Lq type regularization P(β)=∑ |β|q sparse logistic regression model with a L1/2 penalty, in par- with the lower value of q would lead to better solutions ticular for gene selection in cancer classification. The L1/2 with more sparsity. However when q is very close to penalty can be taken as a representative of Lq (0 < q <1) zero, difficulties with convergence arise. Therefore, Xu penalty and has demonstrated many attractive properties, et al. [14] further explored the properties of Lq (0 <q <1) such as unbiasedness, sparsity and oracle properties [14]. regularization and revealed the extreme importance and In this paper, we develop a coordinate descent algorithm special role of the L1/2 regularization. They proposed

Load more