Muddling Labels for Regularization, a Novel Approach to Generalization
Total Page:16
File Type:pdf, Size:1020Kb
Muddling Labels for Regularization, a novel approach to generalization Karim Lounici * 1 Katia Meziani * 2 Benjamin Riu * 1 2 3 Abstract constructed model (sparsity, low-rank, coefficient positive- ness,...). This usually involves introducing hyperparam- Generalization is a central problem in Machine eters which require calibration. The most common ap- Learning. Indeed most prediction methods re- proach is data-splitting. Available data is partitioned into a quire careful calibration of hyperparameters usu- training/validation-set. The validation-set is used to evalu- ally carried out on a hold-out validation dataset ate the generalization error of a model built using only the to achieve generalization. The main goal of this training-set. paper is to introduce a novel approach to achieve generalization without any data splitting, which Several hyperparameter tuning strategies were designed is based on a new risk measure which directly to perform hyperparameter calibration: Grid-search, Ran- quantifies a model’s tendency to overfit. To fully dom search [Bergstra and Bengio, 2012] or more advanced understand the intuition and advantages of this hyperparameter optimization techniques [Bergstra et al., new approach, we illustrate it in the simple linear 2011, Bengio, 2000, Schmidhuber, 1987]. For instance, regression model (Y = Xβ + ξ) where we de- BlackBox optimization [Brochu et al., 2010] is used when velop a new criterion. We highlight how this cri- the evaluation function is not available [Lacoste et al., terion is a good proxy for the true generalization 2014]. It includes in particular Bayesian hyperparametric risk. Next, we derive different procedures which optimization such as Thompson sampling [Mockus,ˇ 1975, tackle several structures simultaneously (correla- Snoek et al., 2012, Thompson, 1933]. These techniques tion, sparsity,...). Noticeably, these procedures either scale exponentially with the dimension of the hyper- concomitantly train the model and calibrate the parameter space, or requires a smooth convex optimization hyperparameters. In addition, these procedures space [Shahriari et al., 2015]. Highly non-convex optimiza- can be implemented via classical gradient de- tion problems on a high dimensionnal space can be tackled scent methods when the criterion is differentiable by Population based methods (Genetic Algorithms[Chen w.r.t. the hyperparameters. Our numerical ex- et al., 2018, Real et al., 2017, Olson et al., 2016], Parti- periments reveal that our procedures are compu- cle Swarm [Lorenzo et al., 2017, Lin et al., 2008]) but at tationally feasible and compare favorably to the a high computational cost. Another family of advanced popular approach (Ridge, LASSO and Elastic- methods, called gradient-based techniques, take advantage Net combined with grid-search cross-validation) of gradient optimization techniques [Domke, 2012] like our in term of generalization. They also outperform method. They fall into two categories, Gradient Iteration the baseline on two additional tasks: estimation and Gradient approximation. Gradient Iteration directly and support recovery of β. Moreover, our pro- computes the gradient w.r.t. hyperparameters on the train- cedures do not require any expertise for the cali- ing/evaluation graph. This means differentiating a poten- bration of the initial parameters which remain the tially lengthy optimization process which is known to be a arXiv:2102.08769v1 [stat.ML] 17 Feb 2021 same for all the datasets we experimented on. major bottleneck [Pedregosa, 2016]. Gradient approxima- tion is used to circumvent this difficulty, through implicit differentiation [Larsen et al., 1996, Bertrand et al., 2020]. Introduction However, all these advanced methods require data-splitting to evaluate the trained model on a hold-out validation-set, Generalization is a central problem in machine learn- unlike our approach. ing. Regularized or constrained Empirical Risk Minimiza- tion (ERM) is a popular approach to achieve generaliza- Another approach is based on unbiased estimation of the tion [Kukackaˇ et al., 2017]. Ridge [Hoerl and Kennard, generalization error of a model (SURE [Stein, 1981], 1970], LASSO [Tibshirani, 1996] and Elastic-net [Zou and AIC [Akaike, 1974], Cp-Mallows [Mallows, 2000]) on the Hastie, 2005] belong to this category. The regularization training-set. Meanwhile, other methods improve general- term or the constraint is added in order to achieve gen- ization during the training phase without using a hold-out eralization and to enforce some specific structures on the validation-set. For instance, Stochastic Gradient Descent Muddling Labels for Regularization, a novel approach to generalization > and the related batch learning techniques [Bottou, 1998] where X = (X1; ··· ; Xn) is the n × p design matrix and > achieve generalization by splitting the training data into a the n-dimensional vectors Y = (Yi; ··· ;Yn) and ξ = > large number of subsets and compute the Empirical Risk (ξ1; ··· ; ξn) are respectively the response and the noise (ER) on a different subset at each step of the gradient de- variables. Throughout this paper, the noise level σ > 0 1 Pn 2 1=2 scent. This strategy converges to a good estimation of the is unknown. Set jjvjjn = ( n i=1 vi ) for any v = > n generalization risk provided a large number of observations (v1; : : : ; vn) 2 R . is available. Bear in mind this method and the availability In practice, the correlation between X and Y is unknown of massive datasets played a crucial role in the success of i i and may actually be very weak. In this case, X provides Deep neural networks. Although batch size has a positive i very little information about Y and we expect from a good impact on generalization [He et al., 2019], it cannot maxi- i procedure to avoid building a spurious connection between mize generalization on its own. Xi and Yi. Therefore, by understanding generalization as Model aggregation is another popular approach to achieve “do not fit the data in non-informative cases”, we suggest generalization. It concerns for instance Random Forest creating an artificial dataset which preserves the marginal [Ho, 1995, Breiman, 2001], MARS [Friedman, 1991] and distributions while the link between Xi and Yi has been Boosting [Freund and Schapire, 1995]. This approach ag- completely removed. A simple way to do so is to construct gregates weak learners previously built using bootstrapped an artificial set De = (X; Ye ) = (X; π(Y)) by applying subsets of the training-set. The training time of these permutations π 2 Sn (the set of permutations of n points) models is considerably lengthened when a large number on the components of Y of the initial dataset D where for n > of weak learners is considered, which is a requirement for any y 2 R , we set π(y) = (yπ(1); : : : ; yπ(n)) . improved generalization. Recall XGBOOST [Chen and Guestrin, 2016] combines a version of batch learning and The rest of the paper is organized as follows. In Section model aggregation to train weak learners. 1 we introduce our novel criterion and highlight its gener- MARS, Random Forest, XGBOOST and Deep learning alization performance. In Section 2, this new approach is have obtained excellent results in Kaggle competitions and applied to several specific data structures in order to design other machine learning benchmarks [Fernandez-Delgado´ adapted procedures which are compatible with gradient- et al., 2014, Escalera and Herbrich, 2018]. However these based optimization methods. We also point out several ad- methods still require regularization and/or constraints in or- vantageous points about this new framework in an exten- der to generalize. This implies the introduction of numer- sive numerical study. Finally we discuss possible directions ous hyperparameters which require calibration on a hold- for future work in Section 3. out validation-set for instance via Grid-search. Tuning these hyperparameters requires expensive human expertise and/or computational resources. 1. Label muddling criterion We approach generalization from a different point of view. In model (1), we want to recover β∗ from D = (X; Y). The underlying intuition is the following. We no longer see Most often, the n observations are partitioned into two generalization as the ability of a model to perform well on parts of respective sizes ntrain and nval, which we de- unseen data, but rather as the ability to avoid finding pattern note the train-set (Xtrain; Ytrain) and the validation-set where none exist. Using this approach, we derive a novel (Xval; Yval) respectively. The train-set is used to build criterion and several procedures which do not require data a family of estimators fβ(θ; Xtrain; Ytrain)gθ which de- splitting to achieve generalization. pends on a hyperparameter θ. Next, in order to achieve generalization, we use the validation-set to calibrate θ. This paper is intended to be an introduction to this novel This is carried out by minimizing the following empirical approach. Therefore, for the sake of clarity, we consider criterion w.r.t. θ: here the linear regression setting but our approach can be extended to more general settings like deep learning1. Let us consider the linear regression model: kYval − Xval β(θ; Xtrain; Ytrain)knval : ∗ Y = Xβ + ξ; (1) 1In another project, we applied this approach to deep neural In our approach, we use the complete dataset D to build the networks on tabular data and achieved good generalization per- family of estimators and to calibrate the hyperparameter θ. formance. We obtained results which are equivalent or superior to Random Forest and XGBOOST. We also successfully extended ∗ t T our approach to classification on tabular data. This project is in Definition 1. Fix T 2 N . Let fπ gt=1 be T permutations the final writing phase and will be posted on Arxiv soon. in Sn. Let fβ(θ; ·; ·)gθ be a family of estimator. Muddling Labels for Regularization, a novel approach to generalization The MLR criterion2 is defined as: Synthetic data. For p = 80, we generate observations (X;Y ) 2 Rp × R, s:t: Y = X>β∗ + , with ∼ N (0; σ), MLRβ (θ) = kY − Xβ(θ; X; Y)kn σ = 10 or 50.