Adaptive Random Subspace Learning (RSSL) Algorithm for Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Adaptive Random SubSpace Learning (RSSL) Algorithm for Prediction Mohamed Elshrif [email protected] Rochester Institute of Technology (RIT), 102 Lomb Memorial Dr., Rochester, NY 14623 USA Ernest Fokoué [email protected] Rochester Institute of Technology (RIT), 102 Lomb Memorial Dr., Rochester, NY 14623 USA Abstract tion f for predicting the response Y given the vector X of explanatory variables. In keeping with the standard in sta- We present a novel adaptive random subspace tistical learning theory, we will measure the predictive per- learning algorithm (RSSL) for prediction pur- formance of any given function f using the theoretical risk pose. This new framework is flexible where it functional given by can be adapted with any learning technique. In this paper, we tested the algorithm for regression Z R( f ) = E[`(Y; f (X))] = `(x;y)dP(x;y); (1) and classification problems. In addition, we pro- X ×Y vide a variety of weighting schemes to increase with the ideal scenario corresponding to the universally the robustness of the developed algorithm. These best function defined by different wighting flavors were evaluated on sim- ∗ ulated as well as on real-world data sets con- fb = arginffR( f )g = arginffE[`(Y; f (X))]g: (2) f f sidering the cases where the ratio between fea- tures (attributes) and instances (samples) is large For classification tasks, the most commonly used loss func- and vice versa. The framework of the new algo- tion is the zero-one loss `(Y; f (X)) = 1fY6= f (X)g, for which rithm consists of many stages: first, calculate the the theoretical universal best defined in (2) is the Bayes weights of all features on the data set using the classifier given by f ∗(x) = argmax fPr[Y = yjx]g. For re- y2 correlation coefficient and F-statistic statistical Y gression tasks, the squared loss `(Y; f (X)) = (Y − f (X))2 measurements. Second, randomly draw n sam- is by far the most commonly used, mainly because of the ples with replacement from the data set. Third, wide variety of statistical, mathematical and computational perform regular bootstrap sampling (bagging). benefits it offers. For regression under the squared loss, Fourth, draw without replacement the indices of the universal best defined in (2) is also known theoreti- the chosen variables. The decision was taken cally known to be the conditional expectation of Y given X, based on the heuristic subspacing scheme. Fifth, specifically given by f ∗(x) = [YjX = x]. Unfortunately, call base learners and build the model. Sixth, use E these theoretically expressions of the best estimators can- the model for prediction purpose on test set of not be realized in practice because the distribution function the data. The results show the advancement of P(x;y) of (X;Y) defined on X × Y is unknown. To cir- arXiv:1502.02599v1 [cs.LG] 9 Feb 2015 the adaptive RSSL algorithm in most of the cases cumvent this learning challenge, one has to do essentially compared with the synonym (conventional) ma- two foundational thing, namely: (a) choose a certain func- chine learning algorithms. tion class F (approximation) from which to search for the estimator fbof the true but unknown underlying f , (b) spec- 1. Introduction ify the empirical version of (1) based on the given sample D, an use that empirical risk as the practical objective func- > > tion. However, in this paper, we do not directly construct Given a dataset D = fzi = (xi ;yi) ; i = 1;··· ;ng, where > p xi = (xi1;··· ;xip) 2 X ⊂ IR and yi 2 Y are realizations our estimating classification functions from the empirical of two random variables X and Y respectively, we seek to risk. Instead, we build the estimators using other optimal- use the data D to build estimators fbof the underlying func- ity criteria, and then compare their predictive performances using the average test error AVTE(·), namely ( ) 1 R 1 m AVTE( f ) = `(y(r); f (x(r))) ; (3) b ∑ ∑ it br it R r=1 m t=1 Submission and Formatting Instructions for ICML 2015 where fbr(·) is the r-th realization of the estimator fb(·) built sion (MLR) model as our base learner in regression and using the training portion of the split of D into training set the generalized linear model (GLM) as a base learner in (r) (r) classification. Some applications by nature posses few in- and test set, and xi ;yi is the t-th observation from the t t stances (small n) with large number of features (p n) test set at the r-th random replication of the split of D. In o such as fMRI (Kuncheva et al., 2010) and DNA microar- this paper, we consider both multiclass classification tasks rays (Bertoni et al., 2005) data sets. It is hard for a tradi- with response space Y = f1;2;··· ;Gg and regression tasks tional (conventional) algorithm to build a regression model, with Y = IR, and we focus on learning machines from a or to classify the data set when it possesses a very small in- function class F whose members are ensemble learners in stances to features ratio. The prediction problem becomes the sense of Definition (1). In machine learning, in order even more difficult when this huge number of features cor- to improve the accuracy of a regression function, or a clas- related are highly correlated, or irrelevant for the task of sification function, scholars tend to combine multiple es- building such a model, as we will show later in this paper. timators because it has been proven both theoretically and Therefore, we harness the power of our proposed adaptive empirically (Tumer & Ghosh, 1995; Tumer & Oza, 1999) subspace learning technique to guide the choice/selection that an appropriate combination of good base learners leads of good candidate features from the data set, and there- to a reduction in prediction error. This technique is known fore select the best base learners, and ultimately the ensem- as ensemble learning (aggregation). In spite of the underly- ble yielding the lowest possible prediction error. In most ing algorithm used, the ensemble learning technique most typical random subspace learning algorithms, the features of the time (on average) outperforms the single learning are selected according to an equally likely scheme. The technique, especially for prediction purposes (van Wezel & question then arises as to whether one can devise a better Potharst, 2007). There are many approaches of perform- scheme to choose the candidate features for efficiently with ing ensemble learning. Among these, there are two popu- some predictive benefits. On the other hand, it is interest- lar ensemble learning techniques, bagging (Breiman, 1996) ing to assess the accuracy of our proposed algorithm under and boosting (Freund, 1995). Many variants of these two different levels of the correlation of the features. The an- techniques have been studied previously such as random swer to this question constitutes one of the central aspect forest (Breiman, 2001) and AdaBoost (Freund & Schapire, of our proposed method, in the sense we explore a variety 1997) and applied in a prediction problem. Our proposed of weighting scheme for choosing the features, most of them method belongs to the subclass of ensemble learning meth- (the schemes) based on statistical measures of relationship ods known as random subspace learning. between the response variable and each explanatory vari- able Definition 1 Given an ensemble H = fh1;h2;··· ;hLg of . As the computational section will reveal, the weight- ∗ ing schemes proposed here lead to a substantially improve- base learners hl : X −! Y , with relative weight al 2 IR+ (usually al 2 (0;1) for convex aggregation), the ensemble ment in predictive performance of our method over random representation f (L) of the underlying function f is given by forest on all but one data set, arguably due to the fact that the aggregation (weighted sum) our method because it leverages the accuracy of the learn- ing algorithm through selecting many good models (since L the weighting scheme allows good variables to be selected f (L)(·) = a h (·): (4) ∑ l l more often and therefore leads to near optimal base learn- l=1 ers). A question naturally arises as to how the ensemble H is chosen, and how the weights are determined. Boot- 2. Related Work strap Aggregating also known as bagging (Breiman, 1996), boosting (Freund & Schapire, 1996), random forests Traditionally, in a prediction problem, a single model is (Breiman, 2001), and bagging with subspaces (Panov & built based on the training set and the prediction is decided Dzeroski, 2007) are all predictive learning methods based based solely on this single fitted model. However, in bag- on the ensemble learning principle for which the ensem- ging, bootstrap samples are taken from the data set, then, for each instance, the model is fitted. Finally, the predic- ble is built from the provided data set D and the weights are typically taken to be equal. In this paper, we focus on tion is made based on the average of all bagged models. learning tasks involving high dimension low sample size Mathematically, the prediction accuracy for the constructed (HDLSS) data, and we further zero-in on those data sets model using bagging outperforms the traditional model and for which the number of explanatory variables p is sub- in the worst case it has the same performance. However, it stantially larger than the sample size n. As our main con- must be said that it depends on the stability of the modeling tribution in this paper, we introduce, develop and apply a procedure. It turns out that bagging reduces the variance new adaptation of the theme of random subspace learn- without affecting the bias, thereby leading to an overall re- ing (Ho, 1998) using the traditional multiple linear regres- duction in prediction error, and hence its great appeal.