Neurocomputing 173 (2016) 855–863

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

A robust feature selection system with Colin's CCA network

Md Zakir Hossain a, Md. Monirul Kabir b, Md. Shahjahan a,n a Department of Electrical and Electronic Engineering, Khulna University of Engineering and Technology (KUET), Khulna 9203, Bangladesh b Department of Electrical and Electronic Engineering, Dhaka University of Engineering and Technology (DUET), Gazipur, Bangladesh article info abstract

Article history: Biomedical data such as microarray data are typified by high dimensionality and small sample size. Received 23 August 2014 Feature selection (FS) is a predominant technique to find informative features as a means of biomarker Received in revised form from this huge amount of data. However, many early studies report acquisition noise that results 3 August 2015 unreliable supposition to select features. The methods which rely on computing Accepted 19 August 2015 between features will lead to poor generalization. To the best of our knowledge very few feature zhi yong Liu Available online 2 September 2015 selection methods are proposed to address these two problems together. This paper presents a novel FS method based on Network of Analysis, NCCA, which sounds robust to acquisition Keywords: noise and ignores mutual information computation. Two strong strategies that distinguish NCCA than Canonical Correlation Analysis (CCA) other methods are adopted in NCCA namely ‘training with noisy features’ and ‘maximum correlation and Neural network (NN) minimum redundancies’ to address the above two problems. As a result informative feature subset is Feature selection (FS) Embedded approach converged. NCCA has been applied to different types of biomedical dataset having very high dimension Pruning with different aspects such as microarray, gene expression and voice signal. In order to get reliable results, NCCA is evaluated with two classifiers – neural network (NN) and support vector machine (SVM). The result of NCCA is very robust in terms of elapsed time, accuracy and Mean Square Error (MSE) with respect to mutual information based methods such as an information gain (IG) method. The computa- tional complexity of NCCA has been shown to be less than IG theoretically and experimentally. It is observed that NCCA is about 2–19 times faster than IG depending on the size of dataset. NCCA is further compared with other standard methods in the literature and it is found to be better than other tech- niques. & 2015 Elsevier B.V. All rights reserved.

1. Introduction but they could select features not suitable for the learning models. In embedded methods, feature selection and learning model are Huge amounts of features and on the contrary a single feature optimized simultaneously. Hybrid methods take advantage of both do not represent a problem domain completely due to the pre- filter and wrapper methods in some particular hybridization fra- sence of redundant and insufficient information. In order to get a meworks. The focus of this paper is on selecting features using representative and informative features from a problem domain, two-step special embedded scheme. feature selection methods become an important step to build fast There are two major problems encountered in feature selection and robust learning models with better generalization ability. paradigm. One is from data side and another from methodology Recent approaches for feature selection include wrapper [1,2], side. Concerning data, acquisition noise is an unavoidable pre- dominant phenomenon that automatically includes in a data col- filter [3], embedded [4], hybrid methods [5,6]. Wrapper methods lection scheme. Biological data often contain noise and artifacts involve the learning models and evaluate features directly since they are collected with a data acquisition system. Data arti- according to the learning outcome. They tend to obtain better facts typically affect more than 10% of the records in a biological learning outcome at the cost of computation effort. On the other dataset [7]. A large amount of sources of noise and artifacts are fi hand, lter methods measure the goodness of a feature based on observed such as annotation errors, attribute errors, sequence the intrinsic characteristic of the data and without any con- violation, contaminated sequence, nonspecific attributes, invalid sideration of the learning models. They are computationally fast, values etc. Although application of these data to a learning model results unreliable implication to select features, inclusion of noise

n does not mean that they are unimportant. Corresponding author. Tel.: þ88 1912251993; fax: þ88 41 774403. E-mail addresses: [email protected], From methodology side, mutual information based methods [email protected] (Md. Shahjahan). such as Information gain (IG) is popularly used for selecting http://dx.doi.org/10.1016/j.neucom.2015.08.040 0925-2312/& 2015 Elsevier B.V. All rights reserved. 856 M.Z. Hossain et al. / Neurocomputing 173 (2016) 855–863 criteria of attribute from a [8]. It measures the less than the smallest dimensionality of the two variables. The amount of information in bits about the class prediction with the standard statistical CCA is an extension of multiple regressions, expected reduction in entropy [9], when there is available feature where the second set contains multiple response variables. The with their corresponding class distribution. The relevance of a variables in each set are linearly combined in a low dimensional feature can be determined statistically by the highest value of IG. space such that the linear combinations have a maximal correla- Nevertheless IG requires discretization for numeric data prior to FS tion. A major target of feature selection is to improve the gen- [10]. Moreover features are selected in a greedy manner [11] by eralization ability of a classifier. A classifier cannot adopt infor- measuring the mutual dependence of two variables. It can some- mation perfectly if an input pattern becomes vigorously long times over fit training data, resulting large trees. In addition, it which is generally observed in a biomedical dataset. As a result, ignores simultaneous optimization across all features. The meth- generalization performance of the classifier degrades significantly. ods based on mutual information such as in [12] ‘‘does not always In this case, reducing of a long pattern into smaller subsets can guarantee to decrease the misclassification probability’’ [13] which significantly improve the performance of the classifier. Especially may also lead to worse classification performance. This may be due the weakness of one subset can be reduced by the strength of to consideration of noise and artifacts jointly in an FS process. others [20]. The specialty of Colin's CCA network is that it can Recursive elimination of features (Relief-F) [14] was proposed to allow more than two subsets. identify important variants according to strong interactions of data The basic definition of selecting feature subset is “strongly rele- sets. However, it is sensitive to the presence of noise in attributes. vant and weakly relevant but non-redundant features form an opti- ” Among other approaches, χ2-, t-statistics, minimum mal subset of features [21].InanNCCAsystem,thisstrategywas redundancy-maximum relevance (MRMR), Effective range based prevailed by taking strongly and weakly but relevant features by gene selection (ERGS), methods are discussed briefly. The features maximizing correlation among them. The nonredundancy of weakly are selected based on the sorted values of χ2-statistic for all fea- correlated set was evident since features with heavily correlated tures. Similar to IG, each numeric attribute is discretized before weights from weak set of NCCA were selected after a maximization computing chi-square statistic. The values of t-statistic are sorted process. Weakly correlated subset is considered with highly corre- in descending order to select the important feature. Ding and Peng lated subset on the basis of the following background. Firstly,noise fi [15] proposed a minimum redundancies maximum relevance and artifacts signi cantly degrade the correlation among the relevant features. Therefore a significant amount of weakly correlated features (MRMR) technique that selects features by minimizing redun- are separated and neglected in the previous approaches although dancy among them with maximal relevance. MRMR uses mutual they seemed to be important. Secondly, it is evident that there must information criterion [16,17] as a measure of relevance for discrete have nonlinear relationship among features—which is difficult to datasets whereas F-statistic between the genes and the class isolate using linear CCA network. The overall correlation also variable is considered as the score of maximum relevance for the degrades due to the presence of nonlinear correlation. The features continuous variables. This approach requires two search algo- corresponding to these nonlinear correlations may have significant rithms to meet redundancy and relevance conditions. Effective role to classification task. There is no doubt that a large amount of range gene selection (ERGS) proposes a principle that a feature them are ignored by many previous methods although they can play should be selected if decision boundaries among classes are very significant role to improve the generalization ability of a classifier. wide. Although ERGS does not require any , it Considering the above facts, the proposed method attempts to con- completely depends on statistically defined effective range of sider these missing weakly correlated features together with highly decision boundaries. correlated features to make a complete feature subset. In this paper, a novel FS system is proposed based on Colin Fyfe's It is well established that there is no information where there is network of CCA [18], NCCA to address challenge of selecting reliable certainty [22]. In this sense, it is clear that information is lost when features using training with noisy features and maximum correla- highly and lightly correlated feature subsets are selected due to tion and minimum redundancies. The system is to maximize cor- having unity and zero correlation respectively. In order to get an relation among several data groups and separate highly and lightly informative and complete subset, features from both subsets should correlated groups in consecutive two-step training to choose a be selected – not only from one subset. Selecting from any one fi reliable nal subset. This approach protects noise using simulta- subset from them certainly indicates the presence of redundant neous training with noisy features while it avoids computation of features. Therefore, an informative and complete subset may consist mutual information. This approach is different from previous of highly as well as lightly correlated features. For this reason, CCA methods on the following aspects. (i) A number of feature groups network is again applied in the second stage to search most infor- are used instead of the traditional use of individual features. (ii) No mative features from highly and lightly correlated features subsets. additional search algorithms are required except CCA network. Following the training, 50% features from both subsets are pruned (iii) It is faster than other approaches due to the use of less number corresponding to smaller weights of CCA network. The remaining of instructions and there was no huge computational burden in the features from both subsets are the members of final subset. program. (iv) CCA network searches global correlation among fea- The process of selecting feature subset in this method is empirically ture groups instead of traditional computation of mutual correla- linked with the mutual information methods in some sense or the tion between two variables. (v) NCCA can cope with nonlinear other. It is evident that the cross correlations between features of correlation some extend due to the presence of NN training. highly correlated subset are approximately unity meaning that fea- tures are mutually dependent. On the other hand, the opposite is true for lightly correlated features meaning that the features of lightly 2. The proposed method correlated subset are mutually independent [23]. Therefore, mutual information in either subset is very small. In order to get features with 2.1. Background high information definitely it is essential to combine highly and lightly correlated subsets so that features in the final subset are maximally Canonical Correlation Analysis (CCA) finds two bases, one for correlated with minimum redundancies. For this reason, a further each variable and corresponding correlation that is a way of training with CCA network is attempted for finding maximally corre- measuring the linear relationship between two multidimensional lated features. In this way information content is increased since CCA variables [19]. The dimensionality of these new bases is equal to or network now maximizes correlation between two complementary M.Z. Hossain et al. / Neurocomputing 173 (2016) 855–863 857 subsets (nearly unity and zero correlations) and a further pruning is Start attempted to reduce redundancy some extend. In fact final subset is expected to have a correlation approximately around (0þ1)/2¼0.5 – an uncertain state meaning that it contains high information. Dataset

2.2. The NCCA system Multivariate data generation NCCA net This system mainly consists of two phases. In the first phase, entire feature set of a particular problem are divided into several Weight update groups of features to train the CCA network. Highly and lightly correlated feature subsets are then obtained by pruning a large No number of features from corresponding lightly and heavily Update weights of CCA network respectively. After a training session, the complete? magnitudes of weights of the CCA network indicate correlation levels as a whole. Since the information is lost due to pruning in Yes the first phase, highly and lightly correlated subsets are again undergone to CCA network training in the second phase. In this way Correlation informative features with maximum correlation and minimum redundancies are found. The joint use of NN training and statistics strengthen the FS process and exhibit robust results. Table 1 describes the meanings of parameters used in the NCCA system. For clear understanding, the entire NCCA system for FS is Delete Delete Respective Respective explained in Fig. 1 following a stepwise brief description. attribute attribute

Step 1. Consider entire dataset of size SF× , where S is the Is it Is it number of samples and F is the number of features. maximum minimum correlation? Yes Yes correlation? Step 2. Split entire feature set into M subsets after making it randomly distributed. For simplicity sake we consider M = 3.

Hence x1, x2 and x3 are three groups of entire features with No No dimensions of say Sk× , S × m and Sn× respectively are Lightly correlated Highly correlated generated from SF× , where Fkmn=+ +. The input to the features features CCA network is now ready for the first phase training as CCA network depicted from Fig. 1.

Step 3. Initialize Lagrange multipliers λ123,,λλand learning Weight update constants η0, and η. Generate random weight w1, w2 and w3according to the dimensions of x1, x2 and x3. No Step 4 Optimize correlations y1, y2 and y3 using following Update separate constrained objective functions as shown in Eq. (1), complete? where E ( ) denotes the expectation and taken over the joint distribution and i denotes the index of objective function Yes

1 2 DEyyi =( )+λi 1 − y Selected Features ii+1 2 ()i ()1 Step 5. Update weights and Lagrange multipliers according to Fig. 1. General overview of the NCCA system. following joint learning rules (Eqs. (3) and (4)). Eq. (2) is weighted sum over all features of a sample. Where j is the index for features. ywxi ==ii ∑ wxij ij These updated Eqs. (3) and (4) can be deduced from Eq. (1) and j ()2 can be found elsewhere [18]

∆=wxyyijηλ ij ()i+1 −i i ()3 Table 1 Clarification of used parameters. 2 ∆=ληi 1 −y Notations Descriptions Dimensions 0 ()i ()4

i Index of pattern j Index of feature k,andmn Dimensions of each partitions Step 6. Select two S × p matrices (xH and xL) from Sk× , S × m F =+kmn + Total number of features and Sn× matrices by discarding (+kmn +)− pfeatures S Total number of patterns M Number of feature groups according to lightly and highly correlated features, where Di ith optimization function pkmn≪( + + ). The highly and lightly correlated features yi ith output correlation vector k, m, n corresponding to trained heavily and lightly weights of the CCA xij ith pattern of jth feature Sk× , Sm× , S × n network are separated by discarding ((kmn + + ) − p) lightly wij ith weight of jth feature Sk× , Sm× , S × n and ((kmnp + + )− ) heavily weights respectively. Finally two λi Lagrange multipliers of ith objective M subsets xH and xL of the same size S × p left. function η Step 7. These two feature subsets xH and xL are again applied in η0 and Learning rate Constant the NCCA system with new initializing weight subsets wH and 858 M.Z. Hossain et al. / Neurocomputing 173 (2016) 855–863

Entire data S×(k+m+n) S×F . . . .

W 1 S×m W2 Subdivision S×n Wn S×k

………...... 1st phase

y1 y2 ……….. yi ‘S’ times weight update Maximization process Lightly correlated Highly correlated subset subset

S×p S×p

WH WL

nd y 2 phase yL H ‘S’ times weight update Maximization process

S×p

Selected features

Fig. 2. Structure of two phases NCCA system.

Table 2 Characteristics and partitions of datasets.

Datasets Features Classes Patterns Partition sets

Training Validation Testing

Colon cancer 2000 2 62 40 6 16 Lymphoma 4026 2 47 30 5 12 Leukemia 7129 2 72 47 7 18 Global Cancer 16,063 14 144 94 14 36 Voice Rehabilitation 309 2 126 82 13 31

wL. The training is attempted again with new weights for 3. Characteristics of biomedical datasets fi nding maximizing correlation vector yH and yL. This is the second phase training as observed from Fig. 2. We investigate five different types of datasets including colon Step 8. Expected final number of features are selected from cancer [24], lymphoma [25], leukemia [26], Voice Rehabilitation xH and xL following the training. Here xH/2 and xL/2 features [27] and Global Cancer [28] datasets. These datasets have been the are separated by discarding features corresponding to lowest subject of many studies in . The characteristics wL/2 and highest wH/2 weights respectively. and partitions of these datasets are listed in Table 2. These data are applied to NCCA. How the final feature subset are selected in a top-down manner is explained numerically in Section 4.1. The NCCA system is 3.1. Colon cancer applied for different dataset. Though it starts with different initial random weight for every run, it did not show any significance A tumor is arising from colonic mucosal epithelium due to the difference for weight maximization as well as subset selection. impact of various carcinogenic factors like environment, genetics, etc. Weights were normalized for finding representative result. The which has cancerous growths in the tissues of the colon. It often structure of CCA network can be realized from Fig. 2. occurs in rectum and the junction of rectum and sigmoid colon. The incidence rate of colon cancer is second only to gastric and esopha- geal cancer. The colon dataset contains 62 samples. The samples M.Z. Hossain et al. / Neurocomputing 173 (2016) 855–863 859 consisting of 40 tumor biopsies are from tumors and 22 normal 1 biopsies are from healthy parts of the colons of the same patients. There are a total of 2000 genes in the dataset. 0.8 0.6 3.2. Leukemia 0.4 It is a type of cancer that affects the blood and bone marrow, 0.2

the spongy center of bones where our blood cells are formed. They Average Correlation are malignant neoplasms of hematopoietic stem cells. The disease 0 develops when blood cells produced in the bone marrow grow out 10 20 30 50 100 of control. The leukemia consists of 25 and 47 samples of acute Number of features myeloblastic leukemia (AML) and acute lymphoblastic leukemia (ALL) respectively. The samples are taken from 63 bone marrow Fig. 3. Average correlations of selected features for lymphoma data. samples and 9 peripheral blood samples. The total number of samples for x of leukemia dataset, random weight for w also has samples to be tested is 72 and number of genes to be tested is 1 1 k¼ 7129, which are all acute leukemia patients. 2337 values. Initially Lagrange and learning parameters are set as λ123===0. 015,λλ 0. 020, 0. 025 and ηη0 ==0. 5, 0. 001.It fi 3.3. Lymphoma also did not show any signi cant variation on the results with the variation of learning constant and Lagrange multipliers. The same It is a malignancy of an immune system which can be devel- procedure is applicable for every datasets. oped in any part of body. It may appear as a solid tumor in the organs rich in lymphatic tissues and tends to encroach on lymph 4.2. Results of the NCCA system nodes, tonsil, spleen and bone marrow. For its insidious symptoms, it can be easily ignored. But if symptoms are noted in early stage 4.2.1. 1st Phase and taken treatments in time, the survival rate can highly be In this phase, the correlation is maximized among three sub- improved. Here, the lymphoma dataset contains 47 samples of divisions of data using Eqs. (2)–(4). Firstly, a feature with maximum 4026 genes. The samples are categorized into germinal center correlation is searched and it is deleted. Accordingly features with B-like lymphoma and activated B-like lymphoma. maximum correlation are deleted until the arrival of a expected number. In this way, we find our expected lightly correlated feature 3.4. LSVT Voice Rehabilitation subsets as 10, 20, 30, 50 and 100 features for every datasets. In a similar fashion, we find highly correlated feature subsets with same Voice rehabilitation dataset was taken from speech signals of size by discarding minimum correlated respective attributes. Parkinson's disease (PD) subjects. Lee Silverman Voice Treatment (LSVT) program was developed to allow PD subjects to investigate 4.2.2. 2nd Phase the potential of using sustained vowel /a/ phonations through a In this phase, the above two selected feature subsets are fed to rehabilitative treatment session. This dataset contained 126 sam- CCA network again by using Eqs. (2)–(4) for finding a feature subset ples having 309 features each. Each sample is to categorize into which carry both lightly and highly correlated features. These two two cohorts as acceptable and unacceptable. subsets are trained again with NCCA. This is because maximally correlated and minimally redundant features are then converged. 3.5. Global Cancer For a feature set of 10 highly correlated features as well as a feature set of 10 lightly correlated features, we search a final feature subset This dataset is collected from tumor gene expression signatures. with 10 features taking 50% of lightly and 50% of highly correlated Here, important features are sought from 144 samples of 16,063 feature subsets. In this way, other feature subsets are created. genes where samples were classified into 14 common tumor types. Fig. 3 shows average correlation for different number of selected features for lymphoma dataset. Though highly and lightly correlated features show almost opposite behavior, an informative and com- 4. Experimental simulation and discussion plete subset is created. For that reason, we apply again CCA network on these two datasets. It is interesting to observe that 30 and 50 4.1. Experimental setup selected features carry more information, because their average correlations are 0.6120 and 0.5249 respectively. On the other hand, The performance of proposed FS algorithm, NCCA, has been evaluated on different well-known biomedical datasets. Classification for 10, 20, 100 information content is smaller in the correlation sense accuracies of selected features are computed using NN and SVM as explained earlier. It is suggestive that optimal number of feature – classifiers. The performances are compared with various methods. subset may be around 30 50. The similar observations are found for fi MSE is also computed using NN classifier to evaluate the perfor- other problems. Informative features are de nitely selected as it is – mance of NCCA with IG. The programisperformedwithanIntel seen that correlation is not around two extremities unity and zero. () Core(TM) i5-2450 M CPU with 2.5 GHz, 4.00 GB of RAM, Oper- Therefore, the NCCA system is maximizing uncertainty (information). ating system 64-bit computer using MATLAB 2012. All features of a Thus a good subset is generated which preserves the maximum dataset is subdivided into possible three equal subdivisions including correlation and minimum redundancy of the data. classes of that dataset. As an example, there are F = 2000 features of It is expected that there will be some common features in the colon cancer dataset. We include classes of colon cancer data with its final feature subsets due to a correlation maximization process. For features. For that reason there have a total of 2002 features and these Global Cancer problem, attribute ID 7158 was commonly found in are subdivided as k=667, m=667 and n=668 features for x1, x2 and all subsets while ID 19 and ID 11414 were found in more than two x3 respectively. Consequently random weight is generated within a subsets. These features must have some biological background and certain small value with the same dimension of corresponding fea- they are useful biomarker for cancer diagnosis. Similar outcomes tures set. Just like as, when there are F=2377 features and S=72 were found for all other problems. 860 M.Z. Hossain et al. / Neurocomputing 173 (2016) 855–863

Table 3 Comparison with information gain in terms of classification accuracy (%).

NN Classifier SVM Classifier

Feature # 10 20 30 50 100 10 20 30 50 100

Datasets Method

Colon cancer NCCA 80.6 87.1 93.5 93.5 90.3 74.19 77.42 83.87 83.87 87.10 IG 88.7 88.7 90.3 91.9 88.7 80.65 80.65 80.65 83.87 83.87 Leukemia NCCA 100 100 100 100 100 100 100 100 100 98.86 IG 98.6 98.6 97.2 98.6 98.6 94.29 94.29 94.29 94.29 97.14 Lymphoma NCCA 100 100 100 100 100 100 100 100 100 97.83 IG 100 100 100 100 100 95.65 100 100 100 97.83 Global Cancer NCCA 91.5 95.7 95.7 97.9 97.9 88.9 93.8 94.4 94.4 94.4 IG 89.4 91.5 93.6 95.7 95.7 87.5 88.9 90.3 91.0 93.1 Voice Rehabilitation NCCA 100 100 100 100 100 100 100 100 100 100 IG 88.1 89.7 91.3 92.1 93.7 85.71 88.10 90.48 90.48 90.48

4.3. Evaluation of the NCCA system MRMR-FSQ (F-test Similarity Quotient), t-statistic and χ2-statistics for different subsets of selected 10, 20 and 100 features for colon cancer, We analyze the performance of selected features using NN leukemia and lymphoma datasets. NCCA is compared with these tool. The scaled conjugate gradient back-pro- methodsasshowninTable 5.Theclassification accuracy is measured pagation is utilized for training the network. In order to get the with a 5-fold cross validation technique. efficacy of the NCCA system, the classification accuracy is compared It is observed from Table 5 that CCA network achieves higher with the IG method. The comparative evaluation of these two accuracy than other feature selection approaches except 10 and 20 methods also analyzed on the basis of MSE during training and features of colon cancer data. This is because some information testing. MSE is the average squared difference between outputs and may be lost due to the selection of a smaller number of features targets. The features obtained with the NCCA system show better from relatively smaller dimensional microarray dataset. The performance as MSE is comparatively low. It is calculated with Eq. accuracy of the NCCA system is found to be 87.1%, where ERGS, (5),whereP, O, t and a indicate the number of patterns, the number Relief-F, MRMR-FDM, MRMR-FSQ, t-statistic and χ2statistics show of output, target output and actual output respectively 83.87%, 75.81%, 67.74%, 67.74%, 80.65% and 79.03% respectively for P O 100 features of colon cancer dataset as shown in Table 5. NCCA 2 MSE=−1 t o 2 ∑∑()ij,, ij exhibits better accuracy for other number of selected features. i==11j ()5

The validation samples are used for stopping the training 4.6. Feature subset validation using t-test automatically. It stops automatically when generalization stops improving, as indicated by an increase in the MSE of the validation The t-test assesses whether the means of two groups are sta- samples. The testing samples are independent from other samples, tistically different from each other. This analysis is appropriate to so it gives unique measures. In this experiment, we use almost 65% compare the means of two groups. Here, two independent sam- samples for training, 10% for validation and other for testing. For all ples such as selected features from IG and NCCA are utilized for t- the case 5 hidden nodes are utilized. test. These data are normally distributed. One-tailed T values and corresponding P values are calculated from two samples using 4.4. Comparison between the NCCA system and IG student t-test calculator for two independent means [31], according to Eq. (6), when one sample of 10 features set is selected The features obtained with the NCCA system are applied to NN from NCCA, then same sample of same features set is selected for fi and SVM classi ers to evaluate the accuracy. It is seen from Table 3 IG to calculate T value and this procedure is applicable for every that the NCCA system shows better accuracy and lower MSE than case. Here, M̅1 and M̅ 2 indicate mean value of a sample from NCCA IG. A slightly lower accuracy is found for subsets having 10 and 20 and IG based selected features and N1, N2, d1, d2 denote corre- features in case of colon cancer problem. This may be due to the sponding number of features and standard deviations respectively presence of insufficient information in relatively lower dimen- MM¯ − ¯ sional microarray dataset. However, NCCA achieves higher accu- t = 12 ⎛ 2 2 ⎞⎛ ⎞ racy than IG for all other number of features. Mean Square Error ()()NdNd1−+−112 11 ⎜⎟1 2 ⎜ + ⎟ fi ⎝ NN+−2 ⎠ N N (MSE) of NN classi er is computed for closer observation as shown 12⎝ 1 2⎠ ()6 in Table 4. It is seen that MSE is almost 1000 times less than that of IG. For this reason, we may conclude that NCCA shows better One tailed t-test with significant level of 0.05 is done in this accuracy as well as smaller MSE than IG. work and results are listed in Table 6.Confidence level was 90% for every test. It is seen that two groups are statistically different for 4.5. Comparison with other recent methods each features set. From T table it is found that T value is 1.734 for one tailed, 0.05 significant and 90% confidence level where degree The NCCA system is evaluated by measuring the classification of freedom is 18. The degree of freedom for 10 features set is 18. It is accuracy with a SVM classifier [29]. There are a vigorous number of FS also observed from Table 6 that computed T value of every dataset is approaches. It is difficult to compare with all of them and sometimes higher for 10 features set than T table value. It is observed for every impractical. Chandra and Gupta had illustrated the direct com- feature subset that computed T value is higher than t value found parisons in [30], among ERGS, Relief-F, and two schemes of MRMR from T table. So we may say that selected features from calculated algorithms namely, MRMR FDM (F-test Distance Multiplicative), IG and CCA network are statistically different. M.Z. Hossain et al. / Neurocomputing 173 (2016) 855–863 861

Table 4 Mean Square Error for all datasets with NN.

Feature # 10 20 30 50 100

Datasets Method

Colon cancer NCCA 2.15E1 2.38E2 1.64E2 9.44E2 1.27E2 IG 1.79E1 9.68E2 1.20E11.16E1 2.20E1 Leukemia NCCA 2.68E7 6.98E6 1.21E6 2.44E7 1.45E6 IG 5.49E2 5.29E2 8.08E2 5.06E2 5.41E2 Lymphoma NCCA 1.47E7 3.29E7 4.20E5 2.28E6 3.04E7 IG 3.53E3 1.06E33.16E3 2.45E3 2.42E4 Global Cancer NCCA 1.17E2 1.30E2 7.49E28.1E21.17E2 IG 2.12E1 1.29E1 1.27E11.19E1 1.33E1 Voice Rehabilitation NCCA 1.79E7 1.30E7 1.74E7 1.37E7 1.47E7 IG 1.31E11.17E1 9.39E2 1.03E1 1.83E1

Table 5 10000 Comparison with other recent methods in terms of classification accuracy (%) using SVM classifier. NCCA IG 1000 Feature # 10 20 100 100 Datasets Method

Colon cancer NCCA 74.19 77.42 87.10 10 ERGS 82.26 80.65 83.87

Relief-F 69.35 75.81 75.81 Time elasped (Sec) MRMR-FDM 66.13 70.97 67.74 1 MRMR-FSQ 62.90 70.97 67.74 t-statistic 79.03 77.42 80.65 2 79.03 79.03 79.03 0.1 χ -statistic VR CC Lym Leu GC Leukemia NCCA 100 100 98.86 ERGS 93.06 97.22 98.61 Fig. 4. Comparison between the NCCA system and IG in terms of execution time to Relief-F 81.94 90.28 93.06 select 20 features for all problems –‘VR, CC, Lym, Leu and GC indicate Voice MRMR-FDM 58.33 61.11 81.94 Rehabilitation, colon cancer, lymphoma, leukemia and Global Cancer respectively’. MRMR-FSQ 48.61 59.72 80.56 t-statistic 91.67 97.22 97.22 χ2-statistic 91.67 95.83 97.22 process becomes O (×Falog () a). Total complexity for selecting Lymphoma NCCA 100 100 97.83 features using IG becomes O (×)+SF OFa (×log ()) a. ERGS 92.71 93.75 95.83 (ii) NCCA: In this paper, CCA network is used into two phases Relief-F 91.67 89.58 92.71 MRMR-FDM 91.67 90.63 93.75 where it takes the advantages of different update rules for MRMR-FSQ 82.29 89.58 94.79 finding feature subsets. For that reason it takes lower com- t-statistic 96.88 95.83 95.83 putational time than different statistical methods. The CCA χ2-statistic 96.88 95.83 95.83 network is used 2 times for selecting feature subsets. Firstly, highly and lightly correlated subsets are created and secondly, final subset is generated from above two subsets. If the total number of features for a given dataset is Fmnk=++, Table 6 then in the first phase the cost of measuring correlation is t values for different selected features set. O ()+()+(mOnOk). In this case the number of features into Feature # 10 20 30 50 100 three subdivisions is m, n and k respectively. In addition there require O (())+(())+(()mmOnnOkklog log log ) for a tα 1.734 1.684 1.671 1.660 1.646 pruning process. In this case, m, n, and k are reduced by Colon cancer 2.091154 2.141034 2.122539 2.180469 2.391252 Leukemia 1.764697 1.914172 2.021953 5.007544 7.106670 one in each course of pruning. In this case Lymphoma 2.261929 3.923671 2.824479 3.539583 2.522234 O ()+()+()≫(mOnOkOmmlog ())+( OnnOkk log ())+( log ()). Global Cancer 2.336076 1.993103 2.409329 1.966882 1.994468 In the second phase correlation is maximized between lightly Voice Rehabilitation 5.154239 2.890115 3.954526 2.480576 2.964862 and highly correlated subsets. Here, the number of features of a given dataset is p ≪(mnk + + ) and computational cost is 2 ×()Op, also 2 ×(Oplog () p) for pruning. So total 5. Computational complexity computational complexity for NCCA is found as O (mOnOk )+ ( )+ ( ) +2 × OpOmOnOk ( )≈ ( )+ ( )+ ( ). (i) IG: If the number of samples is denoted as S,thenthecostof We observe that the computational complexity of NCCA is measuring IG is O (×SF), where the number of total features for a O ()+()+()

Global Cancer has 16,063 attributes whether leukemia, lymphoma, [11] B.A. Draper, K. Baek, M.S. Bartlett, J.R. Beveridge., Recognizing faces with PCA colon cancer and Voice Rehabilitation datasets have 7129, 4026, and ICA, Comput. Vision Image Underst. 91 (2003) 115–137. [12] B. Frénay, G. Doquire, M. Verleysen, Estimating mutual information for feature 2000 and 309 attributes respectively. In this regard, highest selection in the presence of label noise, Comput. Statistics Data Anal. 71 (2014) execution time is required to select features from Global Cancer 832–848. data and this is about 3480 s for IG while only about 183 s with [13] B. Frénay, G. Doquire, M. Verleysen, Theoretical and empirical study on the potential inadequacy of mutual information for feature selection in classifi- NCCA as shown Fig. 4. About 266 s is spent with IG and 16.50 s is cation, Neurocomputing 112 (2013) 64–78. spent with NCCA for leukemia data. On the other hand lymphoma, [14] B. Draper, C. Kaito, J. Bins, Iterative relief, In: Workshop on Learning in Com- colon cancer and Voice Rehabilitation datasets require about 48, puter Vision and Pattern Recognition, Madison, 2003. [15] C. Ding, H. Peng, Minimum redundancy feature selection from microarray 8 and 0.37 s with IG while NCCA needs 2.5, 1.25 and 0.20 s gene expression data, J. Bioinform. Comput. Biol. 3 (2005) 185–205. respectively. The selected feature subset has different number of [16] H. Peng, et al., Feature selection based on mutual information: criteria of attributes from 10 to 100. Though feature subset is selected from maxdependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27 (2005) 1226–1238. same number of original attributes, the elapsed time is almost [17] Y. Zhang, et al., Gene selection algorithm by combining reliefF and mRMR, same for a specific dataset. For this reason, we can say that NCCA is BMC Genom. 9 (Suppl 2) (2007) S27. computationally very much inexpensive than IG. [18] P.L. Lai, C. Fyfe., A neural implementation of canonical correlation analysis, Neural Netw. 12 (1999) 1391–1397. [19] Wolfgang Härdle, L.éopold Simar, Canonical Correlation Analysis, In: Applied Multivariate Statistical Analysis, 2007, 321–330. 6. Conclusion [20] Weifeng Liu, Dacheng Tao, Multiview hessian regularization for image anno- tation, IEEE Trans. Image Process. 22 (7) (2013) 2676–2687. [21] Mourad Elouni, Albert Y. Zomaya, Biological Knowledge Discovery Handbook, We propose a special embedded an FS technique on a framework 2013, 387. of CCA network training and optimization simultaneously. The highly [22] Mischa Schwartz, Information Transmission, Modulation and Noise, McGraw Hill, Inc., USA, 1980. and lightly correlated two-subsets are separated from several subsets [23] L. Wentian, Mutual information functions versus correlation functions, by correlation maximization and pruning subsequently. Final subset is J. Statistical Phys. 60 (5) (1990) 823–837. created using the same strategy from selected two-subsets. The final [24] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor features are applied to two standard NN and SVM classifier models to and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. reliable evaluate the performance of NCCA. It is observed that NCCA Sci. USA 96 (12) (1999) 6745–6750. fi achieves better accuracy than IG, MRMR, ERGS, Relief-F, t-statistic, [25] A. Alizadeh, et al., Distinct types of diffuse large B-cell lymphoma identi ed by gene expression profiling, Nature 403 (2000) 503–511. 2 χ -statistic methods. The feature subset selected by the NCCA system [26] T. Golub, et al., Molecular classification of cancer: class discovery and class is informative as explained in Section 4.2. NCCA is also very much prediction by gene expression, Science 286 (1999) 531–537. computationally inexpensive than IG as explained in Section 5.The [27] A. Tsanas, M.A. Little, C. Fox, L.O. Ramig, Objective automatic assessment of rehabilitative speech treatment in Parkinson disease, IEEE Trans. Neural Syst. proposed system is about 2–19 times faster than IG. In this case the Rehabil. Eng. 22 (2014) 181–190. NCCA system exhibits distinguishable and robust performance as [28] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang, M. Angelo, justified with different classifiers. A Cross-validation technique was C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander, T.R. Golub, Multiclass cancer diagnosis using tumor gene applied for unbiased classification results. Features obtained with expression signatures, Proc. Natl. Acad. Sci. USA 98 (26) (2001) 15149–15154. proposed system are statistically different from that of IG as validated [29] Steve Gunn, Support Vector Machines for Classification and Regression, ISIS with the t test results. Therefore NCCA is quite capable to select Technical Report ISIS-1-98, Image Speech & Intelligent Systems Research Group, University of Southampton (1998). economical yet informative features with respect to other state-of-art [30] B. Chandra, Manish Gupta, An efficient statistical feature selection approach for methods. It can be applied to any other problem domain. classification of gene expression data, J. Biomed. Inform. 44 (2011) 529–539. [31] Website: 〈http://www.socscistatistics.com/tests/studentttest/Default2.aspx〉.

Acknowledgments Md Zakir Hossain obtained B. E in 2011 and M. E in We thank anonymous reviewers for their constructive com- 2014 from the Department of Electrical and Electronic Engineering (EEE), Khulna University of Engineering and ments which helped much to improve the manuscript. Technology (KUET), Khulna, Bangladesh and recently he has started PhD at Australian National University, Aus- tralia from March, 2015. He is now on leave from KUET. He served the Department of EEE, University of Infor- References mation Technology & Sciences from January, 2012 and Department of EEE, KUET, from June, 2012 as a lecturer and he is now an Assistant Professor since July, 2014. His [1] C. Hsu, H. Huang, D. Schuschel, The ANNIGMA-wrapper approach to fast research interest includes information retrieval and feature selection for neural nets, IEEE Trans. Syst. Man Cybern. – Part B: biomedical signal processing. He has published a num- Cybern. 32 (2002) 207–212. ber of international conference and journal papers. [2] Md. Monirul Kabir, Md. Monirul Islam, Kazuyuki Murase, A new wrapper feature selection approach using neural network, Neurocomputing 73 (2010) 3273–3283. [3] J.-B. Yang, C.-J. Ong, An effective feature selection method via mutual infor- mation estimation, IEEE Trans. Syst. Man Cybern. Part B: Cybern. 42 (2012) 1550–1559. Md. Monirul Kabir received the B. E. degree in Electrical [4] I. Guyon, A. Elisseeff, An Introduction to variable and feature selection, J. Mach. and Electronic Engineering from Bangladesh Institute of Learn. Res. 3 (2003) 1157–1182. Technology (BIT), Khulna, now Khulna University of [5] J. Huang, Y. Cai, X. Xu, A hybrid for feature selection wrapper Engineering and Technology (KUET), Bangladesh in based on mutual information, Pattern Recognit. Lett. 28 (2007) 1825–1844. 1999. He received a master of engineering degree in the [6] Md. Monirul Kabir, Md Shahjahan, Kazuyuki Murase, A new hydrid ant colony department of Human and Artificial Intelligent Systems optimization algorithm for feature selection, Expert Syst. Appl. 39 (2012) from University of Fukui, Japan in 2008. He obtained a 3747–3763. doctor of engineering degree in the System Design [7] Judice Lie Yong Koh, Correlation-Based Methods for Biological Data Cleaning, Engineering from University of Fukui in March 2011. He Department of Computer Science, School of Computing, National University of was an assistant programmer from 2002 to 2005 at the Singapore, Republic of Singapore, 2003, Thesis for the Doctor of Philosophy. Dhaka University of Engineering and Technology [8] J. Quinlan, Induction of decision trees, Mach. Learn. 1 (1) (1986) 81–106. (DUET), Bangladesh. Recently he is an Associate pro- [9] T.M. Mitchell, Machine Learning, Mc Graw-Hill, USA, 1997. fessor of EEE department in DUET. His research interest [10] Z. Zhu, Y.S. Ong, M. Dash, Markov blanket-embedded genetic algorithm for includes , artificial neural networks, evolutionary approaches, swarm gene selection, Pattern Recognit. 49 (2007) 3236–3248. intelligence, and mobile ad hoc network. M.Z. Hossain et al. / Neurocomputing 173 (2016) 855–863 863

Md. Shahjahan is full Professor at the Department of Electrical and Electronic Engineering, Khulna Uni- versity of Engineering and Technology, Khulna, Ban- gladesh. He received B. E. from Bangladesh Institute of Technology (BIT) in January, 1996. He received M. E. in Information Science from University of Fukui, Japan, in 2003, D. E. at the Department of System Design Engi- neering from University of Fukui in 2006. He joined as a Lecturer at Department of Electrical and Electronic Engineering, KUET, in September, 1996, as assistant professor in 2006 and as a professor in June, 2012. He received the best student award from IEICE, Hokuriku part, in the year 2003, Japan. His research includes machine learning, data mining, biomedical signal processing. He is a member of Institute of Engineers Bangladesh (IEB), Bangladesh. He has published more than 80 international conference and reputed journal papers.