Multi-Layer Perceptrons with Embedded Feature Selection with Application in Cancer Classification∗

Multi-layer Perceptrons with Embedded Feature Selection with Application in Cancer Classification∗ Liefeng Bo, Ling Wang and Licheng Jiao Institute of Intelligent Information Processing, Xidian University, Shannxi 710071, China Abstract — This paper proposed a novel neural network network feature selector (NNFS) for feature selection. NNFS model, named multi-layer perceptrons with embedded feature encourages the small weights to converge to zeros by adding selection (MLPs-EFS), where feature selection is incorporated a penalty term to the error function. In 2002, Hsu et al. [4] into the training procedure. Compared with the classical MLPs, proposed a wrapper method, named ANNIGMA-wrapper for MLPs-EFS add a preprocessing step where each feature of the fast feature selection. In 2004, Sindhwani et al. [5] presented samples is multiplied by the corresponding scaling factor. By applying a truncated Laplace prior to the scaling factors, a maximum output information (MOI-MLP) algorithm for feature selection is integrated as a part of MLPs-EFS. Moreover, feature selection. Due to the vast and extensive literatures in a variant of MLPs-EFS, named EFS+MLPs is also given, which feature selection for MLPs, we only mentioned a small perform feature selection more flexibly. Application in cancer fraction of them and the reader can refer to [6] for more classification validates the effectiveness of the proposed information. algorithms. In this paper, we present a neural networks model, named MLPs-EFS that incorporates feature selection into the Key words — Multi-layer perceptrons (MLPs), neural networks, training procedure. Compared with the classical MLPs, Laplace prior, feature selection, cancer classification. MLPs-EFS add a preprocessing step where each feature of the samples is multiplied by the corresponding scaling factor. I. Introduction In order to achieve feature selection, a truncated Laplace With the wide application of information technology, one prior is applied to the scaling factors. Some researchers [7] has a growing chance to confront high dimensional data sets have shown that the Laplace prior promotes sparsity and with hundreds or thousands of features. Typical examples leads to a natural feature selection. The sparsity-promoting include text processing of internet documents, gene nature of Laplace prior is discussed in the literature [8]. expression array analysis, and combinatorial chemistry. Moreover, we also derive a variant of MLPs-EFS, named These data sets often contain many irrelevant and redundant EFS+MLPs that can perform feature selection more flexibly. features, hence an effective feature selection scheme becomes Application in cancer classification validates the a key factor in determining the performance of a classifier. effectiveness of the proposed algorithms The purpose of feature selection is to find the smallest subset The rest of this paper is organized as follows. Section II of features that result in satisfactory generalization presents the MLPs-EFS and EFS+MLPs algorithms. Section performance. The potential benefits of feature selection are at III reports the experimental results of proposed algorithms least three-fold [1-2]: and compares them with the results obtained by several ① Improving the generalization performance of existing algorithms. Section IV discusses the contributions of learning algorithm by eliminating the irrelevant and this paper redundant features; ② Enhancing the comprehensibility of learning II. Multi-layer Perceptrons with Embedded algorithm by identifying the most relevant feature subset; Feature Selection ③ Reducing the cost of future data collection and Consider the three-layer perceptrons with embedded boosting the test speed of learning algorithm by only utilizing feature selection shown in Fig. 1. The error measure we a small fraction of all features. optimize is the sum of the squared difference between the In MLPs, the input of hidden unit is the linear combination desired output and the actual output, of all the components of samples. This mechanism is highly 1 c T effective for feature extraction, but not for feature selection E wv,,α =−yHvyHv()mmmmφφ () () − () , (1) ()∑( oo()) ( ()) since the weight of each component is often non-zero. lc m=1 Therefore, various new methods have been proposed to select where we have the following: a good feature subset for MLPs. In 1997, Setiono and Liu [3] z l is the number of training samples. developed a backward elimination method, named neural ∗ This research was supported by the National Natural Science Foundation of China under Grant No. 60372050. z c is the number of output units that is equal to the number of category. z y()m is the output vector of the m-th class using the ()m “one-vs-all” encoding scheme, which satisfies yi = 1 ()m if the i-th sample belongs to the m-th class, and yi = 0 otherwise. z v()m is the weight vector from hidden units to the m-th output unit. 2 Fig. 2. Prior over the scaling factor αi z Hij is the output of the i-th sample at the j-th hidden unit, If we do not restrict the weights from input layer to hidden T layer, it may become very large. To avoid this case, we adopt Hw=⊗+φ ()jijα2 x() b() . (2) ij h (( ) ( ) ) a Gaussian prior over those weights h 2 z ⊗ denotes the elementwise multiplication. ⎛⎞λ2 ()j P2 ()ww∝−exp⎜⎟, (4) (i) ∑ 2 z x is the i-th input vector. ⎝⎠nh× j=1 2 z α is the scaling factors of input vector. Note that 2 n 2 ()jj() 2 where h is the number of hidden units, w = wi α denotes αα⊗ . 2 ∑() i=1 z w()j is the weight vector from the input layer to denotes the 2-norm and λ2 controls the intensity of the prior. the j-th hidden unit. Considering the error measure along with the prior, we ()j z b is the bias term of the j-th hidden unit. obtain a penalized maximum likelihood estimate of the z φo and φh (i) are the activation function that often scaling factors and the weights by minimizing the following is the logistic function, hyperbolic tangent function, expression or linear function. fE(wv,,α) =+(wv ,,αα) log( PP12( )) + log( (w)) nhn2 . (5) 2 λλ122 ()j α1 =++Ewwv,,α α x1 ()∑∑∑ii() nnhiji===111× 2 a2 x2 Like the original MLPs, the optimization problem that a2 MLPs-EFS confront with is unconstrained. Hence, all the x 3 3 optimization algorithms applied to MLPs are also suitable for a2 MLPs-EFS. In order to speedup the training procedure, x n n instead of the standard backpropagation (BP) algorithm, a variant of the conjugate gradient, named scaled conjugate Fig. 1. Three-layer perceptrons with embedded feature gradient (SCG) [9], is used to find a local minimum of the selection objective function f (wv,,α) . SCG is fully automated Compared with the classical MLPs, MLPs-EFS add a including no user dependent parameters and avoiding a time preprocessing step where each feature of the samples is consuming line-search. Experimental study performed by 2 multiplied by the corresponding scaling factor αi . After Moller in 1993 showed that SCG yields a speedup of at least MLPs-EFS are trained, the importance of features can be an order of magnitude relative to BP. Demuth’s test report identified by the magnitude of the scaling factors. That is, the [10] also showed that SCG performs well over a wide variety irrelevant and the redundant features are associated with the of problems, particularly for networks with a large number of small scaling factors and the relevant features with the large weights. scaling factors. Since we hope that the irrelevant and the MLPs-EFS add the scaling factors; hence, the number of redundant features affect learning procedure as little as its free parameters is slightly larger than that of MLPs. possible, it is necessary to make the small scaling factors However the training time of MLPs-EFS is comparable with close to zeros. To achieve this goal, we adopt a truncated that of MLPs because the number of free parameters added Laplace prior as shown in Fig. 2. over the scaling factors by MLPs-EFS is far smaller than the total number of the free parameters of MLPs. For example, for a network with ten ⎧ ⎛⎞λ1 22 2 ⎪exp⎜⎟−≥αα if 0 hidden units and three output units, the number of free P α ∝ n 1 , (3) 1 ()⎨ ⎝⎠ parameters of MLPs-EFS is only 1.08 times that of MLPs if ⎪ 2 ⎩0if0α < the feature dimensionality of the samples is 10. Similar where n is the feature dimensionality of the samples, conclusion also holds true for the test time. n After MLPs-EFS are trained, there exist some scaling α22= α denotes the 1-norm and λ controls the factors that are very close to but not exactly zeros. To remove 1 ∑ i 1 i=1 the features with small scaling factors from network, we need intensity of the prior. a variant of MLPs-EFS, named EFS+MLPs that can be The points with y = 1 lie in the first or third quadrant and described as the following: those with y = −1 in the second or fourth quadrant. The 1. Train MLPs-EFS with the full features; optimal decision function of this problem is non-linear and 2. Rank the importance of features according to the the highest recognition rate of linear classifiers is only magnitude of the scaling factors; 66.67%. 3. Reserve the d most important features; XOR1: XOR1 is a variant of XOR problem (7) and 4. Train MLPs only with the reserved d features. derived by adding 18 redundant features which are copies of the first two features and 20 irrelevant features which are III. Comparisons with Existing Algorithms drawn from the uniform distribution between -1 and 1, In order to know how well MLPs-EFS and EFS+MLPs simultaneously.

Load more