Feature Selection Using Feature Dissimilarity Measure and Density-Based Clustering: Application to Biological Data

Feature selection using feature dissimilarity measure and density-based clustering: Application to biological data 1 2 3, DEBARKA SENGUPTA , INDRANIL AICH and SANGHAMITRA BANDYOPADHYAY * 1Genome Institute of Singapore, Singapore 138 672, Singapore 2HTL Co. India Pvt. Ltd., New Delhi 110 092, India 3Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700 108, India *Corresponding author (Email, [email protected]) Reduction of dimensionality has emerged as a routine process in modelling complex biological systems. A large number of feature selection techniques have been reported in the literature to improve model performance in terms of accuracy and speed. In the present article an unsupervised feature selection technique is proposed, using maximum information compression index as the dissimilarity measure and the well-known density-based cluster identification technique DBSCAN for identifying the largest natural group of dissimilar features. The algorithm is fast and less sensitive to the user-supplied parameters. Moreover, the method automatically determines the required number of features and identifies them. We used the proposed method for reducing dimensionality of a number of benchmark data sets of varying sizes. Its performance was also extensively compared with some other well-known feature selection methods. [Sengupta D, Aich I and Bandyopadhyay S 2015 Feature selection using feature dissimilarity measure and density-based clustering: Application to biological data. J. Biosci. 40 721–730] DOI 10.1007/s12038-015-9556-y 1. Introduction methods pick up relevant features by observing their intrinsic properties 21. These methods generally assign Bioinformaticians frequently face the challenge of reduc- some score to each of the features while evaluating them ing the number of attributes of high-dimensional biolog- in isolation. These methods scale up well due to their ical data for improving the models involved in sequence simplicity. The major disadvantage of such methods is analysis, microarray analysis, spectral analysis, literature that they ignore the relationship of features with the mining, etc. 21. Feature selection is useful for multiple existing classes. GINI index, F-score, Relief-F 13 and reasons. The main objectives of feature selection are as Markov Blanket filter 14 are some popular filter follows: (a) accelerating the model creation task; (b) methods. Unlike filter methods, wrapper methods learn avoiding model over-fitting or under-fitting; (c) identify- from the natural grouping of data. These methods pro- ing the salient features, which are decisive of the target duce a subset of features that can efficiently differentiate categories. Feature selection is widely used in classifica- between classes or clusters. Wrapper methods can be tion, clustering, regression, etc. A typical feature selec- supervised as well as unsupervised in nature. Supervised tion process consists of four basic steps for finding the wrapper approaches utilize class label information to optimal set of features: subset generation, subset evalu- evaluate feature sub sets. Such methods are often com- ation, stopping criterion and result validation 4. Feature putationally expensive as they tend to do a rigorous selection methods can be categorized as either filter or search in the respective feature space. Genetic wrapper 11. A third category called hybrid can be in- algorithm-based supervised feature selection approaches troduced to encompass the rest of the methods. Filter (Pal and Wang 1996; Tan et al. 2006; Mukhopadhyay Keywords. Clustering; dissimilarity; eigenvalue; feature selection http://www.ias.ac.in/jbiosci J. Biosci. 40(4), October 2015, 721–730, * Indian Academy of Sciences 721 Published online: 28 September 2015 722 D Sengupta, I Aich and S Bandyopadhyay et al. 2009) are popular as wrapper methods. Unsuper- feature selection technique itself. To address these issues vised wrapper methods are becoming increasingly popu- we propose a feature selection method that works by lar because of their lower time requirement compared to discovering natural groups of dissimilar features. We the supervised ones. Principal Component Analysis show that the larger eigenvalue of the covariance matrix (PCA) and MICI (maximum information compression derived from a pair of features is inversely proportional index) (Mitra et al. 2002) are popular among these. An to their dissimilarity, thereby making it an appropriate interesting hybrid filter-wrapper approach is introduced distance for obtaining group of dissimilar features. It is in Ruiz et al. (2006), crossing a univariately pre-ordered to be noted here that the present method is inspired by feature ranking with an incrementally augmenting wrap- the work in (Mitra et al. 2002), with notable differences. per method. Moreover a comparison with (Mitra et al. 2002) is also Biological data sets usually contain hundreds and provided in the Results section. thousands of features. For microarray data sets contain An extensive comparison of the proposed method many thousands genes (about 5,000–30,000). These data with several state-of-the-art techniques, viz. MICI (max- sets are frequently use molecular classification of various imum information compression index) (Mitra et al. life-threatening diseases like cancers (Golub et al. 1999). 2002), mRMR (Max-Dependency, Max-Relevance and Reducing the dimensionality of biological data sets is Min-Redundancy) (Peng et al. 2005), SFFS, SFBS essential to avoid model over-fitting. Filter methods are (Pudil et al. 1994), SBS, SFS, and Branch and Bound preferably use to reduce dimensionality of such data (Devijver and Kittler 1982), demonstrate its significance sets. Some frequently used filter methods are F-score, and effectiveness. For ease of reference the proposed χ2, Euclidean distance, i-test, Information Gain, etc. 21. feature selection technique is named as Feature Selection Filter methods work well for simple cases, where dis- using Information Compression Index, or FSICI. tinction between different classes are quite obvious and The rest of this paper is organized as follows: In apparent. However, in many complex cases these section 2 we explain how the larger eigenvalue corre- methods fail to give much insight into the molecular sponding to the covariance matrix derived from a pair of differentiation. PCA is a reasonably fast, unsupervised features can be used to find cluster of dissimilar fea- wrapper method, which is commonly used in such cases. tures. In this section we also describe various compo- However, PCA fails discriminate classes when classes nents as well as the computational complexity of the overlap along the direction of maximum variance of algorithm. In section 3 we compare the performance of the instances. In practice, feature extraction based on FSICI with multiple established feature selection tech- PCA often suffer from the problem of under-fitting. niques. In this section we also provide statistical evi- Unsupervised methods are indispensable when labeled dence for superior performance of FSICI. A brief instances are not available. For example, single cell analysis is also done to evaluate sensitivity of the ap- RNA-sequence data analysis, which reveals tissue het- proach to the required parameters. We conclude the erogeneity require single cells to be mapped to known paper in section 4. cell types (Jaitin et al. 2014). Only filter or unsupervised wrapper techniques can be used to reduce dimensionality in such studies. 2. Method A disadvantage of the MICI (maximum information compression index) based approach lies with the use of The proposed feature selection technique involves three k-NN-based clustering algorithm for finding clusters of steps: First, all dissimilarities between all possible pairs of features based on their similarity (Mitra et al. 2002). It features are measured using the principle of linear projec- is a common knowledge that any method based on k-NN tions (described later in this section). Natural groups of principle is somewhat sensitive to the choice of k. dissimilar features are then identified using DBSCAN, the Moreover, their approach of selecting a representative density based clustering method (Ester et al. 1996). Fi- features from each of the clusters of similar features nally, the cluster containing the maximum number of tends to discard important (dissimilar) features when features is selected. The reasons for using DBSCAN for the clusters are large in size. In contrast, if clusters of clustering are following: (1) DBSCAN is capable of dissimilar features are identified, loss of important fea- determining outlier; (2) it does not require the possible tures can be minimized by selecting the largest among number of clusters as a input; and (3) it can discover the clusters. Additionally, it is always preferred that the arbitrary shaped clusters. The steps involved in the pro- number of features is determined automatically by the posed feature selection method are illustrated below: J. Biosci. 40(4), October 2015 Feature selection in biological data 723 2.1 Steps of FSICI quantify information loss between a pair of features. In the following derivation we show how the larger eigenvalue can 2.1.1 Measuring feature dissimilarity: Although correlation be used as a measure of feature dissimilarity. coefficient is one of the obvious choices for tracking linear Let x and y be two random variables and the covariance dependency between two variables x and y, it has the following matrix of x

Load more