Feature Selection

F Feature Selection Definition (or Synopsis) Feature selection, as a dimensionality reduction Suhang Wang1, Jiliang Tang2, and Huan Liu1 technique, aims to choose a small subset of the 1Arizona State University, Tempe, AZ, USA relevant features from the original ones by re- 2Michigan State University, East Lansing, MI, moving irrelevant, redundant, or noisy features. USA Feature selection usually leads to better learning performance, i.e., higher learning accuracy, Abstract lower computational cost, and better model interpretability. Data dimensionality is growing rapidly, which Generally speaking, irrelevant features poses challenges to the vast majority of existing are features that cannot help discriminate mining and learning algorithms, such as the curse samples from different classes(supervised) or of dimensionality, large storage requirement, and clusters(unsupervised). Removing irrelevant high computational cost. Feature selection has features will not affect learning performance. In been proven to be an effective and efficient way fact, the removal of irrelevant features may help to prepare high-dimensional data for data mining learn a better model, as irrelevant features may and machine learning. The recent emergence of confuse the learning system and cause memory novel techniques and new types of data and fea- and computation inefficiency. For example, in tures not only advances existing feature selection Fig. 1a, f1 is a relevant feature because f1 can research but also evolves feature selection con- discriminate class1 and class2. In Fig. 1b, f2 is a tinually, becoming applicable to a broader range redundant feature because f2 cannot distinguish of applications. In this entry, we aim to provide a points from class1 and class2. Removal of f2 basic introduction to feature selection including doesn’t affect the ability of f1 to distinguish basic concepts, classifications of existing sys- samples from class1 and class2. tems, recent development, and applications. A redundant feature is a feature that implies the copresence of another feature. Individually, each redundant feature is relevant, but removal of one of them will not affect the learning per- Synonyms formance. For example, in Fig. 1c, f1 and f6 are strongly correlated. f6 is a relevant feature Attribute selection; Feature subset selection; Fea- itself. However, when f1 is selected first, the ture weighting later appearance of f6 doesn’t provide additional information. Instead, it adds more memory and © Springer Science+Business Media New York 2016 C. Sammut, G.I. Webb (eds.), Encyclopedia of Machine Learning and Data Mining, DOI 10.1007/978-1-4899-7502-7 101-1 2 Feature Selection ab1 1 class1 class1 class2 class2 0.8 0.5 0.6 0 f2 0.4 −0.5 0.2 −1 0 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4 f1 f1 cd 5 5 4.5 4 3 4 2 f6 3.5 f4 1 3 0 class1 class1 2.5 −1 class2 class2 2 −2 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4 f1 f1 Feature Selection, Fig. 1 A toy example to illustrate feature. The presence of noisy features may degenerate the concept of irrelevant, redundant, and noisy features. the learning performance. f 6 is a redundant feature when f 1 is a relevant feature and can discriminate class1 f 1 is present. If f 1 is selected, removal of f 6 will not and class2. f 2 is an irrelevant feature. Removal of f 2 affect the learning performance. (a) Relevant feature. (b) will not affect the learning performance. f 4 is a noisy Irrelevant feature. (c) Redundant feature. (d) Noisy feature computational requirement to learn the classifi- Motivation and Background cation model. A noisy feature is a type of relevant feature. In many real-world applications, such as data However, due to the noise introduced during mining, machine learning, computer vision, the data collection process or because of the and bioinformatics, we need to deal with high- nature of this feature, a noisy feature may not dimensional data. In the past 30 years, the be so relevant to the learning or mining task. dimensionality of the data involved in these areas As shown in Fig. 1d, f4 is a noisy feature. It has increased explosively. The growth of the can discriminate a part of the points from the number of attributes in the UCI machine learning two classes and may confuse the learning model repository is shown in Fig. 2a. In addition, the for the overlapping points (Noisy features are number of samples also increases explosively. very subtle. One feature may be a noisy feature The growth of the number of samples in the UCI itself. However, in some cases, when two or machine learning repository is shown in Fig. 2b. more noisy features can complement each other The huge number of high-dimensional data has to distinguish samples from different classes, they presented serious challenges to existing learning may be selected together to benefit the learning methods. First, due to the large number of model.) features and relatively small number of training samples, a learning model tends to overfit, and their learning performance degenerates. Data Feature Selection 3 a 15 10 # Attributes (Log) 5 F 0 1985 1990 1995 2000 2005 2010 Year b 16 14 12 10 8 Sample Size (Log) 6 4 2 1985 1990 1995 2000 2005 2010 Year Feature Selection, Fig. 2 Growth of the number of features and the number of samples in the UCI ML repository. (a) UCI ML repository number of attribute growth. (b) UCI ML repository number of sample growth with high dimensionality not only degenerates and feature selection. Both feature extraction many algorithms’ performance due to the and feature selection are capable of improving curse of dimensionality and the existence of performance, lowering computational complex- irrelevant, redundant, and noisy dimensions, ity, building better generalization models, and it also significantly increases the time and decreasing required storage. Feature extraction memory requirement of the algorithms. Second, maps the original feature space to a new feature storing and processing such amounts of high- space with lower dimensionality by combining dimensional data become a challenge. the original feature space. Therefore, further Dimensionality reduction is one of the most analysis of new features is problematic since popular techniques to reduce dimensionality there is no physical meaning for the transformed and can be categorized into feature extraction features obtained from feature extraction. In 4 Feature Selection contrast, feature selection selects a subset of labels allows supervised feature selection algo- features from the original feature set. Therefore, rithms to effectively select discriminative features feature selection keeps the actual meaning of to distinguish samples from different classes. A each selected feature, which makes it superior in general framework of supervised feature selec- terms of feature readability and interpretability. tion is shown in Fig. 4a. Features are first gen- erated from training data. Instead of using all the data to train the supervised learning model, Structure of the Learning System supervised feature selection will first select a subset of features and then process the data with From the perspective of label availability, feature the selected features to the learning model. The selection methods can be broadly classified into feature selection phase will use the label infor- supervised, unsupervised, and semi-supervised mation and the characteristics of the data, such as methods. In terms of different selection strate- information gain or Gini index, to select relevant gies, feature selection can be categorized as filter, features. The final selected features, as well as wrapper, and embedded models. Figure 3 shows with the label information, are used to train a the classification of feature selection methods. classifier, which can be used for prediction. Supervised feature selection is usually used for Unsupervised feature selection is usually used classification tasks. The availability of the class for clustering tasks. A general framework of Feature Selection, Fig. 3 Feature selection categories Feature Selection, Fig. 4 General frameworks of supervised and unsupervised feature selection. (a) A general framework of supervised feature selection. (b)A general framework of unsupervised feature selection Feature Selection 5 unsupervised feature selection is described in the similarity matrix so that label information Fig. 4b, which is very similar to supervised fea- can provide discriminative information to select ture selection, except that there’s no label infor- relevant features, while unlabeled data provide mation involved in the feature selection phase and complementary information. the model learning phase. Without label infor- Filter Models For filter models, features are mation to define feature relevance, unsupervised selected based on the characteristics of the data feature selection relies on another alternative cri- without utilizing learning algorithms. This ap- terion during the feature selection phase. One proach is very efficient. However, it doesn’t con- commonly used criterion chooses features that sider the bias and heuristics of the learning al- can best preserve the manifold structure of the gorithms. Thus, it may miss features that are original data. Another frequently used method relevant for the target learning algorithm. A filter is to seek cluster indicators through clustering algorithm usually consists of two steps. In the F algorithms and then transform the unsupervised first step, features are ranked based on certain cri- feature selection into a supervised

Load more