Learning Feature Engineering for Classification

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Learning Feature Engineering for Classification Fatemeh Nargesian1, Horst Samulowitz2, Udayan Khurana2 Elias B. Khalil3, Deepak Turaga2 1University of Toronto, 2IBM Research, 3Georgia Institute of Technology [email protected], samulowitz, ukhurana @us.ibm.com, { } [email protected], [email protected] Abstract search in feature space using heuristic feature quality mea- sures (such as information gain) and other surrogate mea- Feature engineering is the task of improving pre- sures of performance [Markovitch and Rosenstein, 2002; Fan dictive modelling performance on a dataset by et al., 2010]. Others perform greedy feature construction and transforming its feature space. Existing ap- selection based on model evaluation [Dor and Reich, 2012; proaches to automate this process rely on ei- Khurana et al., 2016]. Kanter et al. proposed the Data Sci- ther transformed feature space exploration through ence Machine (DSM) which considers feature engineering evaluation-guided search, or explicit expansion of problem as feature selection on the space of novel features. datasets with all transformed features followed by DSM relies on exhaustively enumerating all possible features feature selection. Such approaches incur high com- that can be constructed from a dataset, given sequences gen- putational costs in runtime and/or memory. We erated from a set of transformations, then performing feature present a novel technique, called Learning Feature selection on the augmented dataset [Kanter and Veeramacha- Engineering (LFE), for automating feature engi- neni, 2015]. Evaluation-based and exhaustive feature enu- neering in classification tasks. LFE is based on meration and selection approaches result in high time and learning the effectiveness of applying a transfor- memory cost and may lead to overfitting due to brute-force mation (e.g., arithmetic or aggregate operators) on generation of features. Moreover, although deep neural net- numerical features, from past feature engineering works (DNN) allow for useful meta-features to be learned au- experiences. Given a new dataset, LFE recom- tomatically [Bengio et al., 2013], the learned features are not mends a set of useful transformations to be applied always interpretable and DNNs are not effective learners in on features without relying on model evaluation or various application domains. explicit feature expansion and selection. Using a In this paper, we propose LFE (Learning Feature Engi- collection of datasets, we train a set of neural net- neering), a novel meta-learning approach to automatically works, which aim at predicting the transformation perform interpretable feature engineering for classification, that impacts classification performance positively. based on learning from past feature engineering experiences. Our empirical results show that LFE outperforms By generalizing the impact of different transformations on the other feature engineering approaches for an over- performance of a large number of datasets, LFE learns useful whelming majority (89%) of the datasets from var- patterns between features, transforms and target that improve ious sources while incurring a substantially lower learning accuracy. We show that generalizing such patterns computational cost. across thousands of features from hundreds of datasets can be used to successfully predict suitable transformations for 1 Introduction features in new datasets without actually applying the trans- Feature engineering is a central task in data preparation for formations, performing model building and validation tasks, machine learning. It is the practice of constructing suitable that are time consuming. LFE takes as input a dataset and features from given features that lead to improved predictive recommends a set of paradigms for constructing new useful performance. Feature engineering involves the application features. Each paradigm consists of a transformation and an of transformation functions such as arithmetic and aggregate ordered list of features on which the transformation is suit- operators on given features to generate new ones. Transfor- able. mations help scale a feature or convert a non-linear relation At the core of LFE, there is a set of Multi-Layer Percep- between a feature and a target class into a linear relation, tron (MLP) classifiers, each corresponding to a transforma- which is easier to learn. tion. Given a set of features and class labels, the classifier Feature engineering is usually conducted by a data scien- predicts whether the transformation can derive a more useful tist relying on her domain expertise and iterative trial and feature than the input features. LFE considers the notion of error and model evaluation. To perform automated fea- feature and class relevance in the context of a transformation ture engineering, some existing approaches adopt guided- as the measure of the usefulness of a pattern of feature value 2529 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) and class label distributions, and transformation. feature selection over the augmented dataset. Cognito [Khu- Different datasets contain different feature sizes and differ- rana et al., 2016] recommends a series of transformations ent value ranges. One key challenge in generalizing across based on a greedy, hierarchical heuristic search. Cognito and different datasets is to convert feature values and their class DSM focus on sequences of feature transformations, which labels to a fixed size feature vector representation that can is outside the scope of this paper. In contrast to these ap- be fed into LFE classifiers. To characterize datasets, hand- proaches, we do not require expensive classifier evaluations crafted meta-features, fixed-size stratified sampling, neural to measure the impact of a transformation. ExploreKit [Katz networks and hashing methods have been used for different et al., 2016] also generates all possible candidate features, tasks [Michie et al., 1994; Kalousis, 2002; Feurer et al., 2015; but uses a learned ranking function to sort them. ExploreKit Weinberger et al., 2009]. However, these representations requires generating all possible candidate features when en- do not directly capture the correlation between feature val- gineering features for a new (test) dataset and as such, results ues and class labels. To capture such correlations, LFE con- reported for ExploreKit are with a time limit of three days structs a stack of fixed-size representations of feature values per dataset. In contrast, LFE can generate effective features per target class. We use Quantile Data Sketch to represent within seconds, on average. feature values of each class. Quantile has been used as a Several machine learning methods perform feature extrac- fixed-size space representation and achieves reasonably ac- tion or learning indirectly. While they do not explicitly work curate approximation to the distribution function induced by on input features and transformations, they generate new fea- the data being sketched [Greenwald and Khanna, 2001]. tures as means to solving another problem [Storcheus et al., LFE presents a computationally efficient and effective al- 2015]. Methods of that type include dimensionality reduc- ternative to other automated feature engineering approaches tion, kernel methods and deep learning. Kernel algorithms by recommending suitable transformations for features in a such as SVM [Shawe-Taylor and Cristianini, 2004] can be dataset. To showcase the capabilities of LFE, we trained LFE seen as embedded methods, where the learning and the (im- on 85K features, extracted from 900 datasets, for 10 unary plicit) feature generation are performed jointly. This is in transformations and 122K feature pairs for 4 binary trans- contrast to our setting, where feature engineering is a pre- formations, for two models: Random Forest and Logistic processing step. Regression. The transformations are listed in Table 1. We Deep neural networks learn useful features automati- empirically compare LFE with a suite of feature engineering cally [Bengio et al., 2013] and have shown remarkable suc- approaches proposed in the literature or applied in practice cesses on video, image and speech data. However, in some (such as the Data Science Machine, evaluation-based, ran- domains feature engineering is still required. Moreover, fea- dom selection of transformations and always applying the tures derived by neural networks are often not interpretable most popular transformation in the training data) on a sub- which is an important factor in certain application domains set of 50 datasets from UCI repository [Lichman, 2013], such as healthcare [Che et al., 2015]. OpenML [Vanschoren et al., 2014] and other sources. Our experiments show that, of the datasets that demonstrated any 3 Automated Feature Engineering Problem improvement through feature engineering, LFE was the most = f ,...,f effective in 89% of the cases. As shown in Figure 2, simi- Consider a dataset, , with features, 1 n , and a target class, , a setD of transformations,F {= T ,...,T} , lar results were observed for the LFE trained with Logistic T { 1 m} Regression. Moreover, LFE runs in significantly lesser time and a classification task, . The feature engineering problem q L compared to the other approaches. This also enables interac- is to find best paradigms for constructing new features such tions with a practitioner since it

Load more