Feature Selection for Regression Problems
Total Page:16
File Type:pdf, Size:1020Kb
Feature Selection for Regression Problems M. Karagiannopoulos, D. Anyfantis, S. B. Kotsiantis, P. E. Pintelas unreliable data, then knowledge discovery Abstract-- Feature subset selection is the during the training phase is more difficult. In process of identifying and removing from a real-world data, the representation of data training data set as much irrelevant and often uses too many features, but only a few redundant features as possible. This reduces the of them may be related to the target concept. dimensionality of the data and may enable There may be redundancy, where certain regression algorithms to operate faster and features are correlated so that is not necessary more effectively. In some cases, correlation coefficient can be improved; in others, the result to include all of them in modelling; and is a more compact, easily interpreted interdependence, where two or more features representation of the target concept. This paper between them convey important information compares five well-known wrapper feature that is obscure if any of them is included on selection methods. Experimental results are its own. reported using four well known representative regression algorithms. Generally, features are characterized [2] as: 1. Relevant: These are features which have Index terms: supervised machine learning, an influence on the output and their role feature selection, regression models can not be assumed by the rest 2. Irrelevant: Irrelevant features are defined as those features not having any influence I. INTRODUCTION on the output, and whose values are generated at random for each example. In this paper we consider the following 3. Redundant: A redundancy exists regression setting. Data is generated from an whenever a feature can take the role of unknown distribution P on some domain X another (perhaps the simplest way to and labeled according to an unknown function model redundancy). g. A learning algorithm receives a sample S = {(x1, g(x1 )), . , (xm, g(xm))} and attempts to Feature selection algorithms in general have return a function f close to g on the domain X. two components: a selection algorithm that Many regression problems involve an generates proposed subsets of features and investigation of relationships between attempts to find an optimal subset; and an attributes in heterogeneous databases, where evaluation algorithm that determines how different prediction models can be more ‗good‘ a proposed feature subset is, returning appropriate for different regions. some measure of goodness to the selection algorithm. However, without a suitable Many factors affect the success of machine stopping criterion the feature selection learning on a given task. The representation process may run exhaustively or forever and quality of the instance data is first and through the space of subsets. Stopping criteria foremost [13]. If there is much irrelevant and can be: (i) whether addition (or deletion) of redundant information present or noisy and any feature does not produce a better subset; and (ii) whether an optimal subset according Educational Software Development Laboratory, to some evaluation function is obtained. Department of Mathematics, University of Patras, Greece,{mariosk, dany, sotos, pintelas}@math.upatras.gr Ideally, feature selection methods search algorithms. The final section concludes this through the subsets of features, and try to find work. the best one among all the competing candidate subsets according to some evaluation function. However, this procedure II. WRAPPER FEATURE SELECTION METHODS is exhaustive as it tries to find only the best one. It may be too costly and practically prohibitive, even for a medium-sized feature Theoretically, having more features should set size. Other methods based on heuristic or result in more discriminating power. random search methods, attempt to reduce However, practical experience with machine computational complexity by compromising learning algorithms has shown that this is not performance. always the case, current machine learning toolkits are insufficiently equipped to deal In [10] different feature selection methods are with contemporary datasets and many grouped into two broad groups (i.e., filter and algorithms are susceptible to exhibit poor wrapper), based on their dependence on the complexity with respect to the number of inductive algorithm that will finally use the features. selected subset. Filter methods are independent of the inductive algorithm, There are several wrapper selection whereas wrapper methods use the inductive algorithms that try to evaluate the different algorithm as the evaluation function. Wrapper subsets of the features on the learning methods wrap the feature selection around the algorithm and keep the subsets that perform induction algorithm to be used, using cross- best. The simplest method is forward validation to predict the benefits of adding or selection (FS). It starts with the empty set and removing a feature from the feature subset greedily adds attributes one at a time. At each used step FS adds the attribute that, when added to the current set, yields the learned structure A strong argument for wrapper methods is that generalizes best. Once an attribute is that the estimated correlation coefficient of added FS cannot later remove it. Backward the learning algorithm is the best available stepwise selection (BS) starts with all heuristic for measuring the values of features. attributes in the attribute set and greedily Different learning algorithms may perform removes them one at a time, too. Like forward better with different feature sets, even if they selection, backward selection removes at each are using the same training set [4]. The step the attribute whose removal yields a set wrapper selection methods are able to of attributes that yields best generalization. improve performance of a given regression Also like FS, once BS removes an attribute, it model, but they are expensive in terms of the cannot later add it back to the set [17]. computational effort. The existing filter algorithms are computationally cheaper, but A problem with forward selection is that it they fail to identify and remove all redundant may fail to include attributes that are features. In addition, there is a danger that the interdependent, as it adds variables one at a features selected by a filter method can time. However, it may locate small effective decrease the correlation coefficient of a subsets quite rapidly, as the early evaluations, learning algorithm. involving relatively few variables, are fast. In contrast, in backward selection inter- The next section presents the most well dependencies are well handled, but early known wrapper selection methods. In the evaluations are relatively expensive [11]. In sections III-VI, we compare five well-known [5] the authors describe forward selection, wrapper feature selection methods. backward selection and some variations with Experimental results are reported using four classification algorithms and conclude that well known representative regression any wrapper selection method is better than interdependencies between bits in the string), no selection method. and thus should be well-suited for this problem domain. However, genetic Sequential forward floating selection (SFFS) algorithms typically require a large number of and sequential backward floating selection evaluations to reach a minimum. (SBFS) are characterized by the changing number of features included or eliminated at different stages of the procedure [15]. A similar way is followed by the Best First III. EXPERIMENTS WITH REPRESENTATIVES search. The Best First search starts with an LEARNING TECHNIQUES empty set of features and generates all possible single feature expansions [8]. The subset with the highest evaluation is chosen In this study, we used for our experiments the and is expanded in the same manner by forward selection (FS), the backward adding single features. If expanding a subset selection (BS), the Best First forward results in no improvement, the search drops selection (BFFS), the Best First backward back to the next best unexpanded subset and selection (BFFS) and the genetic search continues from there. Given enough time a selection (GS) with the combination of five Best First search will explore the entire search common machine learning techniques. For space, so it is common to limit the number of our experiments, we used a representative subsets expanded that result in no algorithm from each machine learning improvement. The best subset found is technique namely: Regression Trees [17], returned when the search terminates. The Best Regression Rules [17], Instance-Based First search can be combined with forward or Learning Algorithms [1] and Support Vector backward selection. Machines [16]. Another way is to start the search from a For the purpose of the present study, we used randomly selected subset (i.e. Random 12 well known dataset from the UCI Generation) and add or delete a feature at repository [3]. In Table 1, there is a brief random. A more informed random feature description of these data sets. selection is carried out with the use of genetic The most well known measure for the degree algorithms [18]. A solution is typically a fixed of fit for a regression model to a dataset is the length binary string representing a feature correlation coefficient. If the actual target subset—the value of each position in the values are a1, a2, …an and the predicted string represents the presence or absence of a target