F

Feature Selection Definition (or Synopsis) Feature selection, as a Suhang Wang1, Jiliang Tang2, and Huan Liu1 technique, aims to choose a small subset of the 1Arizona State University, Tempe, AZ, USA relevant features from the original ones by re- 2Michigan State University, East Lansing, MI, moving irrelevant, redundant, or noisy features. USA Feature selection usually leads to better learn- ing performance, i.e., higher learning accuracy, Abstract lower computational cost, and better model inter- pretability. Data dimensionality is growing rapidly, which Generally speaking, irrelevant features poses challenges to the vast majority of existing are features that cannot help discriminate mining and learning algorithms, such as the curse samples from different classes(supervised) or of dimensionality, large storage requirement, and clusters(unsupervised). Removing irrelevant high computational cost. Feature selection has features will not affect learning performance. In been proven to be an effective and efficient way fact, the removal of irrelevant features may help to prepare high-dimensional data for learn a better model, as irrelevant features may and . The recent emergence of confuse the learning system and cause memory novel techniques and new types of data and fea- and computation inefficiency. For example, in tures not only advances existing feature selection Fig. 1a, f1 is a relevant feature because f1 can research but also evolves feature selection con- discriminate class1 and class2. In Fig. 1b, f2 is a tinually, becoming applicable to a broader range redundant feature because f2 cannot distinguish of applications. In this entry, we aim to provide a points from class1 and class2. Removal of f2 basic introduction to feature selection including doesn’t affect the ability of f1 to distinguish basic concepts, classifications of existing sys- samples from class1 and class2. tems, recent development, and applications. A redundant feature is a feature that implies the copresence of another feature. Individually, each redundant feature is relevant, but removal of one of them will not affect the learning per- Synonyms formance. For example, in Fig. 1c, f1 and f6 are strongly correlated. f6 is a relevant feature Attribute selection; Feature subset selection; Fea- itself. However, when f1 is selected first, the ture weighting later appearance of f6 doesn’t provide additional information. Instead, it adds more memory and

© Springer Science+Business Media New York 2016 C. Sammut, G.I. Webb (eds.), Encyclopedia of Machine Learning and Data Mining, DOI 10.1007/978-1-4899-7502-7 101-1 2 Feature Selection

ab1 1 class1 class1 class2 class2 0.8 0.5

0.6

0 f2 0.4

−0.5 0.2

−1 0 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4

f1 f1 cd 5 5

4.5 4 3 4 2

f6 3.5 f4 1 3 0 class1 class1 2.5 −1 class2 class2 2 −2 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4 f1 f1

Feature Selection, Fig. 1 A toy example to illustrate feature. The presence of noisy features may degenerate the concept of irrelevant, redundant, and noisy features. the learning performance. f 6 is a redundant feature when f 1 is a relevant feature and can discriminate class1 f 1 is present. If f 1 is selected, removal of f 6 will not and class2. f 2 is an irrelevant feature. Removal of f 2 affect the learning performance. (a) Relevant feature. (b) will not affect the learning performance. f 4 is a noisy Irrelevant feature. (c) Redundant feature. (d) Noisy feature computational requirement to learn the classifi- Motivation and Background cation model. A noisy feature is a type of relevant feature. In many real-world applications, such as data However, due to the noise introduced during mining, machine learning, , the data collection process or because of the and bioinformatics, we need to deal with high- nature of this feature, a noisy feature may not dimensional data. In the past 30 years, the be so relevant to the learning or mining task. dimensionality of the data involved in these areas As shown in Fig. 1d, f4 is a noisy feature. It has increased explosively. The growth of the can discriminate a part of the points from the number of attributes in the UCI machine learning two classes and may confuse the learning model repository is shown in Fig. 2a. In addition, the for the overlapping points (Noisy features are number of samples also increases explosively. very subtle. One feature may be a noisy feature The growth of the number of samples in the UCI itself. However, in some cases, when two or machine learning repository is shown in Fig. 2b. more noisy features can complement each other The huge number of high-dimensional data has to distinguish samples from different classes, they presented serious challenges to existing learning may be selected together to benefit the learning methods. First, due to the large number of model.) features and relatively small number of training samples, a learning model tends to overfit, and their learning performance degenerates. Data Feature Selection 3

a 15

10

# Attributes (Log) 5 F

0 1985 1990 1995 2000 2005 2010 Year b 16

14

12

10

8 Sample Size (Log) 6

4

2 1985 1990 1995 2000 2005 2010 Year

Feature Selection, Fig. 2 Growth of the number of features and the number of samples in the UCI ML repository. (a) UCI ML repository number of attribute growth. (b) UCI ML repository number of sample growth with high dimensionality not only degenerates and feature selection. Both many algorithms’ performance due to the and feature selection are capable of improving and the existence of performance, lowering computational complex- irrelevant, redundant, and noisy dimensions, ity, building better generalization models, and it also significantly increases the time and decreasing required storage. Feature extraction memory requirement of the algorithms. Second, maps the original feature space to a new feature storing and processing such amounts of high- space with lower dimensionality by combining dimensional data become a challenge. the original feature space. Therefore, further Dimensionality reduction is one of the most analysis of new features is problematic since popular techniques to reduce dimensionality there is no physical meaning for the transformed and can be categorized into feature extraction features obtained from feature extraction. In 4 Feature Selection contrast, feature selection selects a subset of labels allows supervised feature selection algo- features from the original feature set. Therefore, rithms to effectively select discriminative features feature selection keeps the actual meaning of to distinguish samples from different classes. A each selected feature, which makes it superior in general framework of supervised feature selec- terms of feature readability and interpretability. tion is shown in Fig. 4a. Features are first gen- erated from training data. Instead of using all the data to train the model, Structure of the Learning System supervised feature selection will first select a subset of features and then process the data with From the perspective of label availability, feature the selected features to the learning model. The selection methods can be broadly classified into feature selection phase will use the label infor- supervised, unsupervised, and semi-supervised mation and the characteristics of the data, such as methods. In terms of different selection strate- information gain or Gini index, to select relevant gies, feature selection can be categorized as filter, features. The final selected features, as well as wrapper, and embedded models. Figure 3 shows with the label information, are used to train a the classification of feature selection methods. classifier, which can be used for prediction. Supervised feature selection is usually used for Unsupervised feature selection is usually used classification tasks. The availability of the class for clustering tasks. A general framework of

Feature Selection, Fig. 3 Feature selection categories

Feature Selection, Fig. 4 General frameworks of supervised and unsupervised feature selection. (a) A general framework of supervised feature selection. (b)A general framework of unsupervised feature selection Feature Selection 5 unsupervised feature selection is described in the similarity matrix so that label information Fig. 4b, which is very similar to supervised fea- can provide discriminative information to select ture selection, except that there’s no label infor- relevant features, while unlabeled data provide mation involved in the feature selection phase and complementary information. the model learning phase. Without label infor- Filter Models For filter models, features are mation to define feature relevance, unsupervised selected based on the characteristics of the data feature selection relies on another alternative cri- without utilizing learning algorithms. This ap- terion during the feature selection phase. One proach is very efficient. However, it doesn’t con- commonly used criterion chooses features that sider the bias and heuristics of the learning al- can best preserve the manifold structure of the gorithms. Thus, it may miss features that are original data. Another frequently used method relevant for the target learning algorithm. A filter is to seek cluster indicators through clustering algorithm usually consists of two steps. In the F algorithms and then transform the unsupervised first step, features are ranked based on certain cri- feature selection into a supervised framework. terion. In the second step, features with the high- There are two different ways to use this method. est rankings are chosen. A lot of ranking criteria, One way is to seek cluster indicators and simulta- which measures different characteristics of the neously perform the supervised feature selection features, are proposed: the ability to effectively within one unified framework. The other way is to separate samples from different classes by con- first seek cluster indicators, then to perform fea- sidering between class variance and within class ture selection to remove or select certain features, variance, the dependence between the feature and and finally to repeat these two steps iteratively the class label, the correlation between feature- until certain criterion is met. In addition, certain class and feature-feature, the ability to preserve supervised feature selection criterion can still be the manifold structure, the used with some modification. between the features, and so on. Semi-supervised feature selection is usually Wrapper Models The major disadvantage of used when a small portion of the data is labeled. the filter approach is that it totally ignores the When such data is given to perform feature effects of the selected feature subset on the per- selection, both supervised and unsupervised formance of the clustering or classification al- feature selection might not be the best choice. gorithm. The optimal feature subset should de- Supervised feature selection might not be able to pend on the specific biases and heuristics of the select relevant features because the labeled data learning algorithms. Based on this assumption, is insufficient to represent the distribution of the wrapper models use a specific learning algorithm features. Unsupervised feature selection will not to evaluate the quality of the selected features. use the label information, while label information Given a predefined learning algorithm, a general can give some discriminative information to framework of the wrapper model is shown in select relevant features. Semi-supervised feature Fig. 5. The feature search component will pro- selection, which takes advantage of both labeled duce a set of features based on certain search data and unlabeled data, is a better choice strategies. The feature evaluation component will to handle partially labeled data. The general then use the predefined learning algorithm to framework of semi-supervised feature selection evaluate the performance, which will be returned is the same as that of supervised feature to the feature search component for the next selection, except that data is partially labeled. iteration of feature subset selection. The feature Most of the existing semi-supervised feature set with the best performance will be chosen as selection algorithms rely on the construction of the final set. The search space for m features the similarity matrix and select features that is O.2m/. To avoid exhaustive search, a wide best fit the similarity matrix. Both the label range of search strategies can be used, including information and the similarity measure of the hill-climbing, best-first, branch-and-bound, and labeled and unlabeled data are used to construct genetic algorithms. 6 Feature Selection

Feature Selection, Fig. 5 A general framework of wrapper models

Feature Selection, Fig. 6 Classification of recent development of feature selection from feature perspective and data perspective

Embedded Models Filter models are compu- Recent Developments tationally efficient, but totally ignore the biases of the learning algorithm. Compared with filter The recent emergence of new machine learning models, wrapper models obtain better predictive algorithms, such as sparse learning, and new accuracy estimates, since they take into account types of data, such as social media data, has the biases of the learning algorithms. However, accelerated the evolution of feature selection. In wrapper models are very computationally expen- this section, we will discuss recent developments sive. Embedded models are a trade-off between of feature selection from both feature and data the two models by embedding the feature selec- perspectives. tion into the model construction. Thus, embedded From the feature perspective, features can models take advantage of both filter models and be categorized as static and streaming features, wrapper models: (1) they are far less computa- as shown in Fig. 6a. Static features can be further tionally intensive than wrapper methods, since categorized as flat features and structured fea- they don’t need to run the learning models many tures. The recent development of feature selection times to evaluate the features, and (2) they in- from the feature perspective mainly focuses on clude the interaction with the learning model. The streaming and structure features. biggest difference between wrapper models and Usually we assume that all features are known embedded models is that wrapper models first in advance. These features are designated as static train learning models using the candidate features features. In some scenarios, new features are and then perform feature selection by evaluating sequentially presented to the learning algorithm. features using the learning model, while embed- For example, Twitter produces more than 250 ded models select features during the process of millions tweets per day, and many new words model construction to perform feature selection (features) are generated, such as abbreviations. without further evaluation of the features. In these scenarios, the candidate features are Feature Selection 7 generated dynamically, and the size of features Applications is unknown. These features are usually named as streaming features, and feature selection for High-dimensional data is very ubiquitous in the streaming features is called streaming feature real world, which makes feature selection a very selection. For flat features, we assume that fea- popular and practical preprocessing technique tures are independent. However, in many real- for various real-world applications, such as text world applications, features may exhibit certain categorization, remote sensing, image retrieval, intrinsic structures, such as overlapping groups, microarray analysis, mass spectrum analysis, se- trees, and graph structures. For example, in speed quence analysis, and so on. and signal processing, different frequency bands Text Clustering Thetaskoftextclusteringisto can be represented by groups. Figure 6a shows group similar documents together. In text cluster- the classification of structured features. Incorpo- ing, a text or document is always represented as F rating knowledge about feature structures may a bag of words, which causes high-dimensional significantly improve the performance of learn- feature space and sparse representation. Obvi- ing models and help select important features. ously, a single document has a sparse vector over Feature selection algorithms for the structured the set of all terms. The performance of clustering features usually use the recently developed sparse algorithms degrades dramatically due to high learning techniques such as group and tree- dimensionality and data sparseness. Therefore, in guided lasso. practice, feature selection is a very important step From the data perspective, data can be cate- to reduce the feature space in text clustering. gorized as streaming data and static data as shown Genomic Microarray Data Microarray data is in Fig. 6b. Static data can be further categorized usually short and fat data Ð high dimensionality as independent identically distributed (i.i.d.) data with a small sample size, which poses a great and heterogeneous data. The recent development challenge for computational techniques. Their of feature selection from the data perspective is dimensionality can be up to tens of thousands mainly concentrated on streaming and heteroge- of genes, while their sample sizes can only be neous data. several hundreds. Furthermore, additional exper- Similar to streaming features, streaming data imental complications like noise and variability comes sequentially. Online streaming feature render the analysis of microarray data an exciting selection is proposed to deal with streaming domain. Because of these issues, various feature data. When new data instances come, an online selection algorithms are adopted to reduce the feature selection algorithm needs to determine dimensionality and remove noise in microarray (1) whether adding the newly generated features data analysis. from the coming data to the currently selected Hyperspectral Image Classification Hyper- features and (2) whether removing features spectral sensors record the reflectance from from the set of currently selected features the Earth’s surface over the full range of solar ID. Traditional data is usually assumed to wavelengths with high spectral resolution, which be i.i.d. data, such as text and gene data. results in high-dimensional data that contains However, heterogeneous data, such as linked rich information for a wide range of applications. data, apparently contradicts this assumption. However, this high-dimensional data contains For example, linked data is inherently not i.i.d., many irrelevant, noisy, and redundant features since instances are linked and correlated. New that are not important, useful, or desirable for types of data cultivate new types of feature specific tasks. Feature selection is a critical selection algorithms correspondingly, such as preprocessing step to reduce computational cost feature selection for linked data and multi-view for hyperspectral data classification by selecting and multisource feature selection. relevant features. 8 Feature Selection

Sequence Analysis In bioinformatics, sequence for the dataset is unknown. If the number of analysis is a very important process to under- selected features is too few, the performance will stand a sequence’s features, functions, structure, be degenerated, since some relevant features are or evolution. In addition to basic features that eliminated. If the number of selected features represent nucleotide or amino acids at each po- is too large, the performance may also not be sition in a sequence, many other features, such very good since some noisy, irrelevant, or redun- as k-mer patterns, can be derived. By varying the dant features are selected to confuse the learning pattern length k, the number of features grows model. In practice, we would grid search the exponentially. However, many of these features number of features in a range and pick the one are irrelevant or redundant; thus, feature selection that has relatively better performance on learning techniques are applied to select a relevant feature models, which is computationally expensive. In subset and essential for sequence analysis. particular, for supervised feature selection, cross validation can be used to search the number of features to select. How to automatically deter- Open Problems mine the best number of selected features remains an open problem. Scalability With the rapid growth of dataset For many unsupervised feature selection size, the scalability of current feature selection methods, in addition to choosing the optimal algorithms may be a big issue, especially for number of features, we also need to specify online classifiers. Large data cannot be loaded the number of clusters. Since there is no label to the memory with a single scan. However, full information and we have limited knowledge dimensionality data must be scanned for some about each domain, the actual number of clusters feature selection. Usually, they require a suffi- in the data is usually unknown and not well cient number of samples to obtain statistically defined. The number of clusters specified by significant result. It is very difficult to observe the users will result in selecting different feature feature relevance score without considering the subsets by the unsupervised feature selection density around each sample. Therefore, scalabil- algorithm. How to choose the number of clusters ity is a big issue. for unsupervised feature selection is an open Stability Feature selection algorithms are often problem. evaluated through classification accuracy or clus- tering accuracy. However, the stability of algo- rithms is also an important consideration when Cross-References developing feature selection methods. For exam- ple, when feature selection is applied on gene  Classification data, the domain experts would like to see the  Clustering same or at least similar sets of genes selected after  Dimensionality Reduction each time they obtain new samples with a small  Feature Extraction amount of perturbation. Otherwise, they will not trust the algorithm. However, well-known fea- ture selection methods, especially unsupervised Recommended Reading feature selection algorithms, can select features with low stability after perturbation is introduced Alelyani S, Tang J, Liu H (2013) Feature selection to the training data. Developing algorithms of for clustering: a review. In: Aggarwal CC (ed) Data feature selection with high accuracy and stability clustering: algorithms and applications, vol 29. CRC is still an open problem. Press, Hoboken Dy JG, Brodley CE (2004) Feature selection for Parameter Selection In feature selection, we . J Mach Learn Res 5:845Ð usually need to specify the number of features to 889 select. However, the optimal number of features Feature Selection 9

Guyon I, Elisseeff A (2003) An introduction to Saeys Y, Inza I, Larranaga˜ P (2007) A review of feature variable and feature selection. J Mach Learn Res 3: selection techniques in bioinformatics. Bioinfor- 1157Ð1182 matics 23(19):2507Ð2517 Jain A, Zongker D (1997) Feature selection: evalu- Tang J, Liu H (2012) Feature selection with linked ation, application, and small sample performance. data in social media. In: SDM, Anaheim. SIAM, IEEE Trans Pattern Anal Mach Intell 19(2):153Ð158 pp 118Ð128 Kohavi , John GH (1997) Wrappers for feature subset Tang J, Alelyani S, Liu H (2014) Feature selection for selection. Artif Intell 97(1):273Ð324 classification: a review. In: Aggarwal CC (ed) Data Koller D, Sahami M (1996) Toward optimal feature classification: algorithms and applications. Chap- selection. Technical report, Stanford InfoLab man & Hall/CRC, Boca Raton, p 37 Li, J, Cheng K, Wang S, Morstatter F, Trevino R P, Wu X, Yu K, Ding W, Wang H, Zhu X (2013) Online Tang J, Liu H (2016) Feature Selection: A Data feature selection with streaming features. IEEE Perspective. arXiv preprint 1601.07996 Trans Pattern Anal Mach Intell 35(5):1178Ð1192 Liu H, Motoda H (2007) Computational methods of Zhao ZA, Liu H (2011) Spectral feature selection feature selection. CRC Press, New York for data mining. Chapman & Hall/CRC, Boca F Liu H, Yu L (2005) Toward integrating feature se- Raton lection algorithms for classification and clustering. Zhao Z, Morstatter F, Sharma S, Alelyani S, Anand A, IEEE Trans Knowl Data Eng 17(4):491Ð502 Liu H (2010) Advancing feature selection research. Liu H, Motoda H, Setiono R, Zhao Z (2010) Feature ASU feature selection repository, 1Ð28 selection: an ever evolving frontier in data mining. In: FSDM, Hyderabad, pp 4Ð13