A Robust Feature Selection System with Colin's CCA Network

Neurocomputing 173 (2016) 855–863 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A robust feature selection system with Colin's CCA network Md Zakir Hossain a, Md. Monirul Kabir b, Md. Shahjahan a,n a Department of Electrical and Electronic Engineering, Khulna University of Engineering and Technology (KUET), Khulna 9203, Bangladesh b Department of Electrical and Electronic Engineering, Dhaka University of Engineering and Technology (DUET), Gazipur, Bangladesh article info abstract Article history: Biomedical data such as microarray data are typified by high dimensionality and small sample size. Received 23 August 2014 Feature selection (FS) is a predominant technique to find informative features as a means of biomarker Received in revised form from this huge amount of data. However, many early studies report acquisition noise that results 3 August 2015 unreliable supposition to select features. The methods which rely on computing mutual information Accepted 19 August 2015 between features will lead to poor generalization. To the best of our knowledge very few feature zhi yong Liu Available online 2 September 2015 selection methods are proposed to address these two problems together. This paper presents a novel FS method based on Network of Canonical Correlation Analysis, NCCA, which sounds robust to acquisition Keywords: noise and ignores mutual information computation. Two strong strategies that distinguish NCCA than Canonical Correlation Analysis (CCA) other methods are adopted in NCCA namely ‘training with noisy features’ and ‘maximum correlation and Neural network (NN) minimum redundancies’ to address the above two problems. As a result informative feature subset is Feature selection (FS) Embedded approach converged. NCCA has been applied to different types of biomedical dataset having very high dimension Pruning with different aspects such as microarray, gene expression and voice signal. In order to get reliable results, NCCA is evaluated with two classifiers – neural network (NN) and support vector machine (SVM). The result of NCCA is very robust in terms of elapsed time, accuracy and Mean Square Error (MSE) with respect to mutual information based methods such as an information gain (IG) method. The computa- tional complexity of NCCA has been shown to be less than IG theoretically and experimentally. It is observed that NCCA is about 2–19 times faster than IG depending on the size of dataset. NCCA is further compared with other standard methods in the literature and it is found to be better than other tech- niques. & 2015 Elsevier B.V. All rights reserved. 1. Introduction but they could select features not suitable for the learning models. In embedded methods, feature selection and learning model are Huge amounts of features and on the contrary a single feature optimized simultaneously. Hybrid methods take advantage of both do not represent a problem domain completely due to the pre- filter and wrapper methods in some particular hybridization fra- sence of redundant and insufficient information. In order to get a meworks. The focus of this paper is on selecting features using representative and informative features from a problem domain, two-step special embedded scheme. feature selection methods become an important step to build fast There are two major problems encountered in feature selection and robust learning models with better generalization ability. paradigm. One is from data side and another from methodology Recent approaches for feature selection include wrapper [1,2], side. Concerning data, acquisition noise is an unavoidable predominant phenomenon that automatically includes in a data col- filter [3], embedded [4], hybrid methods [5,6]. Wrapper methods lection scheme. Biological data often contain noise and artifacts involve the learning models and evaluate features directly since they are collected with a data acquisition system. Data arti- according to the learning outcome. They tend to obtain better facts typically affect more than 10% of the records in a biological learning outcome at the cost of computation effort. On the other dataset [7]. A large amount of sources of noise and artifacts are fi hand, lter methods measure the goodness of a feature based on observed such as annotation errors, attribute errors, sequence the intrinsic characteristic of the data and without any con- violation, contaminated sequence, nonspecific attributes, invalid sideration of the learning models. They are computationally fast, values etc. Although application of these data to a learning model results unreliable implication to select features, inclusion of noise n does not mean that they are unimportant. Corresponding author. Tel.: þ88 1912251993; fax: þ88 41 774403. E-mail addresses: [email protected], From methodology side, mutual information based methods [email protected] (Md. Shahjahan). such as Information gain (IG) is popularly used for selecting http://dx.doi.org/10.1016/j.neucom.2015.08.040 0925-2312/& 2015 Elsevier B.V. All rights reserved. 856 M.Z. Hossain et al. / Neurocomputing 173 (2016) 855–863 criteria of attribute from a decision tree [8]. It measures the less than the smallest dimensionality of the two variables. The amount of information in bits about the class prediction with the standard statistical CCA is an extension of multiple regressions, expected reduction in entropy [9], when there is available feature where the second set contains multiple response variables. The with their corresponding class distribution. The relevance of a variables in each set are linearly combined in a low dimensional feature can be determined statistically by the highest value of IG. space such that the linear combinations have a maximal correla- Nevertheless IG requires discretization for numeric data prior to FS tion. A major target of feature selection is to improve the gen- [10]. Moreover features are selected in a greedy manner [11] by eralization ability of a classifier. A classifier cannot adopt infor- measuring the mutual dependence of two variables. It can some- mation perfectly if an input pattern becomes vigorously long times over fit training data, resulting large trees. In addition, it which is generally observed in a biomedical dataset. As a result, ignores simultaneous optimization across all features. The meth- generalization performance of the classifier degrades significantly. ods based on mutual information such as in [12] ‘‘does not always In this case, reducing of a long pattern into smaller subsets can guarantee to decrease the misclassification probability’’ [13] which significantly improve the performance of the classifier. Especially may also lead to worse classification performance. This may be due the weakness of one subset can be reduced by the strength of to consideration of noise and artifacts jointly in an FS process. others [20]. The specialty of Colin's CCA network is that it can Recursive elimination of features (Relief-F) [14] was proposed to allow more than two subsets. identify important variants according to strong interactions of data The basic definition of selecting feature subset is “strongly rele- sets. However, it is sensitive to the presence of noise in attributes. vant and weakly relevant but non-redundant features form an opti- ” Among other approaches, χ2-statistics, t-statistics, minimum mal subset of features [21].InanNCCAsystem,thisstrategywas redundancy-maximum relevance (MRMR), Effective range based prevailed by taking strongly and weakly but relevant features by gene selection (ERGS), methods are discussed briefly. The features maximizing correlation among them. The nonredundancy of weakly are selected based on the sorted values of χ2-statistic for all fea- correlated set was evident since features with heavily correlated tures. Similar to IG, each numeric attribute is discretized before weights from weak set of NCCA were selected after a maximization computing chi-square statistic. The values of t-statistic are sorted process. Weakly correlated subset is considered with highly corre- in descending order to select the important feature. Ding and Peng lated subset on the basis of the following background. Firstly,noise fi [15] proposed a minimum redundancies maximum relevance and artifacts signi cantly degrade the correlation among the relevant features. Therefore a significant amount of weakly correlated features (MRMR) technique that selects features by minimizing redun- are separated and neglected in the previous approaches although dancy among them with maximal relevance. MRMR uses mutual they seemed to be important. Secondly, it is evident that there must information criterion [16,17] as a measure of relevance for discrete have nonlinear relationship among features—which is difficult to datasets whereas F-statistic between the genes and the class isolate using linear CCA network. The overall correlation also variable is considered as the score of maximum relevance for the degrades due to the presence of nonlinear correlation. The features continuous variables. This approach requires two search algo- corresponding to these nonlinear correlations may have significant rithms to meet redundancy and relevance conditions. Effective role to classification task. There is no doubt that a large amount of range gene selection (ERGS) proposes a principle that a feature them are ignored by many previous methods although they can play should be selected if decision boundaries among classes are very significant role to improve the

A Robust Feature Selection System with Colin's CCA Network

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support