Gradient Boosted Feature Selection

Gradient Boosted Feature Selection Zhixiang (Eddie) Xu ∗ Gao Huang Kilian Q. Weinberger Washington University in St. Tsinghua University Washington University in St. Louis 30 Shuangqing Rd. Louis One Brookings Dr. Beijing, China One Brookings Dr. St. Louis, USA huang- St. Louis, USA [email protected] [email protected] [email protected] ∗ Alice X. Zheng GraphLab 936 N. 34th St. Ste 208 Seattle, USA [email protected] ABSTRACT 1. INTRODUCTION A feature selection algorithm should ideally satisfy four con- Feature selection (FS) [8] is an important problems in ma- ditions: reliably extract relevant features; be able to iden- chine learning. In many applications, e.g., bio-informatics [21] tify non-linear feature interactions; scale linearly with the or neuroscience [12], researchers hope to gain insight by ana- number of features and dimensions; allow the incorpora- lyzing how a classifier can predict a label and what features tion of known sparsity structure. In this work we propose a it uses. Moreover, effective feature selection leads to par- novel feature selection algorithm, Gradient Boosted Feature simonious classifiers that require less memory [25] and are Selection (GBFS), which satisfies all four of these require- faster to train and test [5]. It can also reduce feature extrac- ments. The algorithm is flexible, scalable, and surprisingly tion costs [29, 30] and lead to better generalization [9]. straight-forward to implement as it is based on a modifi- Linear feature selection algorithms such as LARS [7] are cation of Gradient Boosted Trees. We evaluate GBFS on highly effective at discovering linear dependencies between several real world data sets and show that it matches or out- features and labels. However, they fail when features in- performs other state of the art feature selection algorithms. teract in nonlinear ways. Nonlinear feature selection algo- Yet it scales to larger data set sizes and naturally allows for rithms, such as Random Forest [9] or recently introduced domain-specific side information. kernel methods [32, 23], can cope with nonlinear interactions. But their computational and memory complexity typ- Categories and Subject Descriptors ically grow super-linearly with the training set size. As data sets grow in size, this is increasingly problematic. Balancing H.3 [Information Storage and Retrieval]: Miscellaneous; the twin goals of scalability and nonlinear feature selection I.5.2 [Pattern Recognition]: Design Methodology|Fea- is still an open problem. ture evaluation and selection In this paper, we focus on the scenario where data sets contain a large number of samples. Specifically, we aim to General Terms perform efficient feature selection when the number of data points is much larger than the number of features (n d). Learning We start with the (NP-Hard) feature selection problem that also motivated LARS [7] and LASSO [26]. But instead of Keywords using a linear classifier and approximating the feature selec- Feature selection; Large-scale; Gradient boosting tion cost with an l1-norm, we follow [31] and use gradient boosted regression trees [7] for which greedy approximations exist [2]. The resulting algorithm is surprisingly simple yet very effective. We refer to it as Gradient Boosted Feature Selection (GBFS). Following the gradient boosting framework, trees ∗Work done while at Microsoft Research are built with the greedy CART algorithm [2]. Features are Permission to make digital or hard copies of all or part of this work for personal or selected sparsely following an important change in the im- classroom use is granted without fee provided that copies are not made or distributed purity function: splitting on new features is penalized by for profit or commercial advantage and that copies bear this notice and the full citation a cost λ > 0, whereas re-use of previously selected features on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or incurs no additional penalty. republish, to post on servers or to redistribute to lists, requires prior specific permission GBFS has several compelling properties. 1. As it learns and/or a fee. Request permissions from [email protected]. an ensemble of regression trees, it can naturally discover KDD’14, August 24–27, 2014, New York, NY, USA. nonlinear interactions between features. 2. In contrast to, Copyright is held by the owner/author(s). Publication rights licensed to ACM. e.g., FS with Random Forests, it unifies feature selection ACM 978-1-4503-2956-9/14/08 ...$15.00. and classification into a single optimization. 3. In contrast http://dx.doi.org/10.1145/2623330.2623635 . to existing nonlinear FS algorithms, its time and memory trices in capital bold (F) font. Specific entries in vectors or complexity scales as O(dn), where d denotes the number of matrices are scalars and follow the corresponding conven- features dimensionality and n the number of data points1, tion. d and is very fast in practice. 4. GBFS can naturally incorpo- The data set consists of input vectors fx1;:::; xng 2 R rate pre-specified feature cost structures or side-information, with corresponding labels fy1; : : : ; yng 2 Y drawn from an e.g., select bags of features or focus on regions of interest, unknown distribution. The labels can be binary, categor- similar to generalized lasso in linear FS [19]. ical (multi-class) or real-valued (regression). For the sake We evaluate this algorithm on several real-world data sets of clarity, we focus on binary classification Y 2 {−1; +1g, of varying difficulty and size, and we demonstrate that GBFS although the algorithm can be extended to multi-class and tends to match or outperform the accuracy and feature se- regression as well. lection trade-off of Random Forest Feature Selection, the current state-of-the-art in nonlinear feature selection. 3.1 Feature selection with the l1 norm We showcase the ability of GBFS to naturally incorporate Lasso [26] combines linear classification and l1 regulariza- side-information about inter-feature dependencies on a real tion world biological classification task [1]. Here, features are X min `(xi; yi; w) + λjwj1: (1) grouped into nine pre-specified bags with biological mean- w (x ;y ) ing. GBFS can easily adapt to this setting and select entire i i feature bags. The resulting classifier matches the best accu- In its original formulation, `(·) is defined to be the squared > 2 racy of competing methods (trained on many features) with loss, `(xi; yi; w) = (w xi − yi) . However, for the sake of only a single bag of features. feature selection, other loss functions are possible. In the binary classification setting, where yi 2 {−1; +1g, we use the > 2. RELATED WORK better suited log-loss, `(xi; yi; w) = log(1+exp(yiw xi)) [11]. One of the most widely used feature selection algorithms 3.2 The capped l1 norm is Lasso [26]. It minimizes the squared loss with l1 regu- l1 regularization serves two purposes: It regularizes the larization on the coefficient vector, which encourages sparse classifier against overfitting, and it induces sparsity for fea- solutions. Although scalable to very large data sets, Lasso ture selection. Unfortunately, these two effects of the l1- models only linear correlations between features and labels norm are inherently tied and there is no way to regulate the and cannot discover non-linear feature dependencies. impact of either one. [17] propose the Minimum Redundancy Maximum Rel- [33] introduce the capped l1 norm, defined by the element- evance (mRMR) algorithm, which selects a subset of the wise operation most responsive features that have high mutual information with labels. Their objective function also penalizes select- q(wi) = min(jwij; ): (2) ing redundant features. Though elegant, computing mu- Its advantage over the standard l1 norm is that once a fea- tual information when the number of instance is large is ture is extracted, its use is not penalized further | i.e., it intractable, and thus the algorithm does not scale. HSIC penalizes using many features does not reward small weights. Lasso [32], on the other hand, introduces non-linearity by This is a much better approximation of the l0 norm, which combining multiple kernel functions that each uses a single only penalizes feature use without interfering with the mag- feature. The resulting convex optimization problem aligns nitude of the weights. When is small enough, i.e., ≤ this kernel with a \perfect" label kernel. The algorithm re- mini jwij, we can compute the exact number of features ex- quires constructing kernel matrices for all features, thus its tracted with q(w)/. In other words, penalizing q(w) is a time and memory complexity scale quadratically with input close proxy for penalizing the number of extracted features. data set size. Moreover, both algorithms separate feature However, the capped l1 norm is not convex and therefore selection and classification, and require additional time and not easy to optimize. computation for training classifiers using the selected fea- The capped l1 norm can be combined with a regular l1 (or tures. l2) norm, where one can control the trade-off between feature Several other works avoid expensive kernel computation extraction and regularization by adjusting the corresponding while maintaining non-linearity. Grafting [18] combines l1 regularization parameters, µ, λ ≥ 0: and l0 regularization with a non-linear classifier based on X min `(xi; yi; w) + λjwj1 + µq(w): (3) a non-convex variant of the multi-layer perceptron. Fea- w ture Selection for Ranking using Boosted Trees [15] selects (xi;yi) the top features with the highest relative importance scores. Here q(w) denotes [q(w1); : : : ; q(wd)]. [27] and [9] use Random Forest. Finally, while not a feature selection method, [31] employ Gradient Boosted Trees 4.

Gradient Boosted Feature Selection

10 Oriented Principal Component Analysis for Feature Extraction

Arxiv:2006.04059V1 [Cs.LG] 7 Jun 2020

Feature Selection/Extraction

Feature Extraction (PCA & LDA)

Feature Extraction for Image Selection Using Machine Learning

Time Series Feature Extraction for Industrial Big Data (Iiot) Applications

Machine Learning Feature Extraction Based on Binary Pixel Quantiﬁcation Using Low-Resolution Images for Application of Unmanned Ground Vehicles in Apple Orchards

Towards Reproducible Meta-Feature Extraction

Xgboost: a Scalable Tree Boosting System

Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data Robert T

Feature Extraction Using Dimensionality Reduction Techniques: Capturing the Human Perspective

Unsupervised Feature Extraction for Reinforcement Learning