Journal of Machine Learning Research 7 (2006) 1861-1885 Submitted 8/05; Revised 6/06; Published 9/06
Streamwise Feature Selection
Jing Zhou [email protected] Electrical and Systems Engineering University of Pennsylvania Philadelphia, PA 19104, USA
Dean P. Foster [email protected] Robert A. Stine [email protected] Statistics Department University of Pennsylvania Philadelphia, PA 19104, USA
Lyle H. Ungar [email protected] Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA
Editor: Isabelle Guyon
Abstract In streamwise feature selection, new features are sequentially considered for addition to a predic- tive model. When the space of potential features is large, streamwise feature selection offers many advantages over traditional feature selection methods, which assume that all features are known in advance. Features can be generated dynamically, focusing the search for new features on promising subspaces, and overfitting can be controlled by dynamically adjusting the threshold for adding fea- tures to the model. In contrast to traditional forward feature selection algorithms such as stepwise regression in which at each step all possible features are evaluated and the best one is selected, streamwise feature selection only evaluates each feature once when it is generated. We describe information-investing and α-investing, two adaptive complexity penalty methods for streamwise feature selection which dynamically adjust the threshold on the error reduction required for adding a new feature. These two methods give false discovery rate style guarantees against overfitting. They differ from standard penalty methods such as AIC, BIC and RIC, which always drastically over- or under-fit in the limit of infinite numbers of non-predictive features. Empirical results show that streamwise regression is competitive with (on small data sets) and superior to (on large data sets) much more compute-intensive feature selection methods such as stepwise regression, and allows feature selection on problems with millions of potential features. Keywords: classification, stepwise regression, multiple regression, feature selection, false discov- ery rate
1. Introduction
In many predictive modeling tasks, one has a fixed set of observations from which a vast, or even infinite, set of potentially predictive features can be computed. Of these features, often only a small number are expected to be useful in a predictive model. Pairwise interactions and data transforma- tions of an original set of features are frequently important in obtaining superior statistical models,