Feature Selection with Decision Tree Criterion

Feature Selection with Decision Tree Criterion Krzysztof Grabczewski ˛ and Norbert Jankowski Department of Computer Methods Nicolaus Copernicus University Toruń, Poland kgrabcze,[email protected] Abstract estimates of separability can be compared regardless the substantial difference in types. Classification techniques applicable to real life data are The split value is defined differently for continuous and more and more often complex hybrid systems comprising symbolic features. For continuous features it is a real num- feature selection. To augment their efficiency we propose ber and for symbolic ones it is a subset of the set of alterna- two feature selection algorithms, which take advantage of tive values of the feature. The left side (LS) and right side a decision tree criterion. Large computational experiments (RS) of a split value s of feature f for a given dataset D are have been done to test the possibilities of the two methods defined as: and to compare their results with other techniques of simi- {x ∈ D : f(x) <s} if f is continuous lar goal. LSs,f (D)= {x ∈ D : f(x) ∈ s} otherwise RSs,f (D)=D − LSs,f (D) 1. Introduction where f(x) is the f’s feature value for the data vector x. The definition of the separability of split value s is: Effective and versatile classification can not be achieved · | |·| − | by single classification algorithms. They must be hybrid SSV(s, f, D)=2 LSs,f (Dc) RSs,f (D Dc) complex models comprising a feature selection stage. Sim- c∈C ilarly to other data analysis tasks, also in feature selection it − min(|LSs,f (Dc)|, |RSs,f (Dc)|) is very beneficial to take advantage of different kinds of al- c∈C gorithms. where C is the set of classes and Dc is the set of data vec- One of the great features of decision tree algorithms is tors from D assigned to class c ∈ C. that they inherently estimate a suitability of features for sep- Originally the criterion was used to construct decision aration of objects representing different classes. This facil- trees with recursive splits. In such an approach, current split ity can be directly exploited for the purpose of feature se- is selected to correspond to the best split value among all lection. possible splits of all the features (to avoid combinatorial ex- After presenting the Separability of Split Value criterion, plosion in the case of discrete features with numerous sym- we derive two algorithms of ranking features. The methods bols the analysis is confined to a relatively small number are applied to a number of publicly available datasets and of subsets). The subsequent splits are performed as long as their results compared to those obtained with some most any of the tree nodes can be sensibly split. Then the trees are commonly used feature selection techniques. At the end we pruned to provide possibly highest generalization (an inner present a number of conclusions and future perspectives. cross validation is used to estimate the best pruning param- eters). 2. SSV based algorithms 2.1. SSV criterion for feature selection One of the most efficient heuristics used for decision tree construction is the Separability of Split Value (SSV) crite- The SSV criterion has been successfully used not only rion [3].Its basic advantage is that it can be applied to both for building classification trees. It proved efficient in data continuous and discrete features in such a manner that the type conversion (from continuous to discrete [2] and in the opposite direction [4]) and as the discretization part of fea- 3. F←the set of all the features of the input space. ture selection methods, which finally rank the features ac- 4. i ← 0 cording to such indices like Mutual Information [1]. F ∅ Decision tree algorithms are known for the ability of de- 5. While = do: tecting the features that are important for classification. Fea- (a) For each feature f ∈Fnot used by T define its ture selection is inherent there, so it is not so necessary at rank R(f) ← i. Remove these features from F. the data preparation phase. Inversely: the trees’ capabilities (b) Prune T by deleting all the final splits of nodes can be used for feature selection. N for which G(N) is minimal. Feature selection based on the SSV criterion can be de- (c) Prune T by deleting all the final splits of nodes signed in different ways. From the computational point of N for which G(N)=0. view, the most efficient one is to create feature ranking on ← the basis of the maximum SSV criterion values calculated (d) i i +1 for each of the features, for the whole training dataset. The 6. Return the list of features in decreasing order of R(f). cost is the same as when creating decision stubs (single- This implements a ,,full-featured” filter – the decision tree split decision trees). However, when the classification task building algorithm selects the splits locally, i.e. with respect is a multiclass one, separabilities of single splits can not to the splits selected in earlier stages, so that the features oc- reflect the actual virtue of the features. Sometimes two or curring in the tree, are complementary. It is important to see three splits may be necessary to prove that the feature can that despite this, the computational complexity of the algo- accurately separate different classes. Thus it is reasonable rithm is very attractive (contrary to wrapper methods of fea- to reward such features, but some penalty must be intro- ture selection). duced, when multiple splits are being analyzed. These ideas The ranking generated by the algorithm 2 can be used for brought the following algorithm: feature selection either by dropping all the features of rank Algorithm 1 (Separate feature analysis) 0 or by picking a given number of top-ranked features. In some cases the full classification trees use only a small Input: Training data (a sample X of input patterns part of the features. It does not allow to select any num- and their labels Y ). ber of features – the maximum is the number of features Output: Features in decreasing order of importance. used by the tree, because the algorithm gives no informa- 1. For each feature f tion about the ranking of the rest of the features. ← (a) T the SSV decision tree built for one- 3. Algorithms used for the comparison dimensional data (feature f only). (b) n ← the number of leaves in T . Numerous entropy based methods and many other fea- (c) For i = n − 1 downto 1 ture relevance indices are designed to deal with discrete SSV(T ) data. Their results strongly depend on the discretization i. si ← , where SSV(T ) is the sum of log(2+i) method applied before calculation of feature ranks [1]. This the SSV values for all the splits in T . makes reliable comparisons of ranking algorithms very dif- ii. Prune T by deleting a node N of minimum ficult, so here we compare the results obtained with meth- error reduction. ods, which do not suffer from such limitations. We have (d) Define the rank R(f) of feature f as the maxi- chosen two simple but very successful methods based on mum of the si values. the Pearson’s correlation coefficient and a Fisher–like cri- 2. Return the list of features in decreasing order of R(f). terion. Another two methods being the subject of the comparisons belong to the Relief family. Another way of using SSV for feature ranking is to cre- The Pearson’s correlation coefficient detects linear corre- ate a single decision tree and read feature importance from lation. Thus, used for feature selection it may provide poor it. The filter we have used here is the algorithm 2. results in the case of nonlinear dependencies.Nevertheless Algorithm 2 (Feature selection from single SSV tree) in many applications it seems optimal because of very good results and computational simplicity. Input: Training data X, Y . The Fisher-like criterion is the F-score defined as Output: Features in decreasing order of importance. m − m 0 1 , (1) 1. T ← the SSV decision tree built for X, Y . s0 + s1 2. For each non-leaf node N of T , G(N) ← the classifi- where mi is the mean value of the feature calculated for the cation error reduction of node N. elements of i-th class, and si is the corresponding standard deviation. Such criterion can be used only in the case of bi- All datasets were standardized before the tests, however nary classification. For multiclass data our estimate of fea- it must be pointed out, that none of the feature ranking ture relevance was the maximum of the “one class vs the methods tested here is sensitive to data scaling – the stan- rest” ranks. dardization affects only kNN and SVM classification meth- The two representatives of the Relief family of algo- ods. Moreover, Naive Bayes and SSV based selections di- rithms are the basic Relief method [6] and its extension by rectly deal with symbolic data, so in such cases, the stan- Kononenko known as ReliefF [7]. The basic algorithm ran- dardization does not concern them at all. domly selects m vectors and for each (vector V ) of them Drawing reliable conclusions requires appropriate test- finds the nearest hit H and miss M (neighbors belonging to ing strategy and appropriate analysis of the results. The ma- the same and different class respectively) – then the weight jor group of our tests were 10-fold cross-validation (CV) of each feature f is changed according to tests repeated 10 times to obtain good approximation of the average result.

Load more