Muddling Labels for Regularization, a novel approach to generalization

Karim Lounici * 1 Katia Meziani * 2 Benjamin Riu * 1 2 3

Abstract constructed model (sparsity, low-rank, coefficient positive- ness,...). This usually involves introducing hyperparam- Generalization is a central problem in Machine eters which require calibration. The most common ap- Learning. Indeed most prediction methods re- proach is data-splitting. Available data is partitioned into a quire careful calibration of hyperparameters usu- training/validation-set. The validation-set is used to evalu- ally carried out on a hold-out validation dataset ate the generalization error of a model built using only the to achieve generalization. The main goal of this training-set. paper is to introduce a novel approach to achieve generalization without any data splitting, which Several hyperparameter tuning strategies were designed is based on a new risk measure which directly to perform hyperparameter calibration: Grid-search, Ran- quantifies a model’s tendency to overfit. To fully dom search [Bergstra and Bengio, 2012] or more advanced understand the intuition and advantages of this hyperparameter optimization techniques [Bergstra et al., new approach, we illustrate it in the simple linear 2011, Bengio, 2000, Schmidhuber, 1987]. For instance, regression model (Y = Xβ + ξ) where we de- BlackBox optimization [Brochu et al., 2010] is used when velop a new criterion. We highlight how this cri- the evaluation function is not available [Lacoste et al., terion is a good proxy for the true generalization 2014]. It includes in particular Bayesian hyperparametric risk. Next, we derive different procedures which optimization such as Thompson sampling [Mockus,ˇ 1975, tackle several structures simultaneously (correla- Snoek et al., 2012, Thompson, 1933]. These techniques tion, sparsity,...). Noticeably, these procedures either scale exponentially with the dimension of the hyper- concomitantly train the model and calibrate the parameter space, or requires a smooth convex optimization hyperparameters. In addition, these procedures space [Shahriari et al., 2015]. Highly non-convex optimiza- can be implemented via classical gradient de- tion problems on a high dimensionnal space can be tackled scent methods when the criterion is differentiable by Population based methods (Genetic Algorithms[Chen w.r.t. the hyperparameters. Our numerical ex- et al., 2018, Real et al., 2017, Olson et al., 2016], Parti- periments reveal that our procedures are compu- cle Swarm [Lorenzo et al., 2017, Lin et al., 2008]) but at tationally feasible and compare favorably to the a high computational cost. Another family of advanced popular approach (Ridge, LASSO and Elastic- methods, called gradient-based techniques, take advantage Net combined with grid-search cross-validation) of gradient optimization techniques [Domke, 2012] like our in term of generalization. They also outperform method. They fall into two categories, Gradient Iteration the baseline on two additional tasks: estimation and Gradient approximation. Gradient Iteration directly and support recovery of β. Moreover, our pro- computes the gradient w.r.t. hyperparameters on the train- cedures do not require any expertise for the cali- ing/evaluation graph. This means differentiating a poten- bration of the initial parameters which remain the tially lengthy optimization process which is known to be a arXiv:2102.08769v1 [stat.ML] 17 Feb 2021 same for all the datasets we experimented on. major bottleneck [Pedregosa, 2016]. Gradient approxima- tion is used to circumvent this difficulty, through implicit differentiation [Larsen et al., 1996, Bertrand et al., 2020]. Introduction However, all these advanced methods require data-splitting to evaluate the trained model on a hold-out validation-set, Generalization is a central problem in machine learn- unlike our approach. ing. Regularized or constrained Empirical Risk Minimiza- tion (ERM) is a popular approach to achieve generaliza- Another approach is based on unbiased estimation of the tion [Kukackaˇ et al., 2017]. Ridge [Hoerl and Kennard, generalization error of a model (SURE [Stein, 1981], 1970], LASSO [Tibshirani, 1996] and Elastic-net [Zou and AIC [Akaike, 1974], Cp-Mallows [Mallows, 2000]) on the Hastie, 2005] belong to this category. The regularization training-set. Meanwhile, other methods improve general- term or the constraint is added in order to achieve gen- ization during the training phase without using a hold-out eralization and to enforce some specific structures on the validation-set. For instance, Stochastic Gradient Descent Muddling Labels for Regularization, a novel approach to generalization

> and the related batch learning techniques [Bottou, 1998] where X = (X1, ··· , Xn) is the n × p design matrix and > achieve generalization by splitting the training data into a the n-dimensional vectors Y = (Yi, ··· ,Yn) and ξ = > large number of subsets and compute the Empirical Risk (ξ1, ··· , ξn) are respectively the response and the noise (ER) on a different subset at each step of the gradient de- variables. Throughout this paper, the noise level σ > 0 1 Pn 2 1/2 scent. This strategy converges to a good estimation of the is unknown. Set ||v||n = ( n i=1 vi ) for any v = > n generalization risk provided a large number of observations (v1, . . . , vn) ∈ R . is available. Bear in mind this method and the availability In practice, the correlation between X and Y is unknown of massive datasets played a crucial role in the success of i i and may actually be very weak. In this case, X provides Deep neural networks. Although batch size has a positive i very little information about Y and we expect from a good impact on generalization [He et al., 2019], it cannot maxi- i procedure to avoid building a spurious connection between mize generalization on its own. Xi and Yi. Therefore, by understanding generalization as Model aggregation is another popular approach to achieve “do not fit the data in non-informative cases”, we suggest generalization. It concerns for instance Random Forest creating an artificial dataset which preserves the marginal [Ho, 1995, Breiman, 2001], MARS [Friedman, 1991] and distributions while the link between Xi and Yi has been Boosting [Freund and Schapire, 1995]. This approach ag- completely removed. A simple way to do so is to construct gregates weak learners previously built using bootstrapped an artificial set De = (X, Ye ) = (X, π(Y)) by applying subsets of the training-set. The training time of these permutations π ∈ Sn (the set of permutations of n points) models is considerably lengthened when a large number on the components of Y of the initial dataset D where for n > of weak learners is considered, which is a requirement for any y ∈ R , we set π(y) = (yπ(1), . . . , yπ(n)) . improved generalization. Recall XGBOOST [Chen and Guestrin, 2016] combines a version of batch learning and The rest of the paper is organized as follows. In Section model aggregation to train weak learners. 1 we introduce our novel criterion and highlight its gener- MARS, Random Forest, XGBOOST and Deep learning alization performance. In Section 2, this new approach is have obtained excellent results in Kaggle competitions and applied to several specific data structures in order to design other benchmarks [Fernandez-Delgado´ adapted procedures which are compatible with gradient- et al., 2014, Escalera and Herbrich, 2018]. However these based optimization methods. We also point out several ad- methods still require regularization and/or constraints in or- vantageous points about this new framework in an exten- der to generalize. This implies the introduction of numer- sive numerical study. Finally we discuss possible directions ous hyperparameters which require calibration on a hold- for future work in Section 3. out validation-set for instance via Grid-search. Tuning these hyperparameters requires expensive human expertise and/or computational resources. 1. Label muddling criterion

We approach generalization from a different point of view. In model (1), we want to recover β∗ from D = (X, Y). The underlying intuition is the following. We no longer see Most often, the n observations are partitioned into two generalization as the ability of a model to perform well on parts of respective sizes ntrain and nval, which we de- unseen data, but rather as the ability to avoid finding pattern note the train-set (Xtrain, Ytrain) and the validation-set where none exist. Using this approach, we derive a novel (Xval, Yval) respectively. The train-set is used to build criterion and several procedures which do not require data a family of estimators {β(θ, Xtrain, Ytrain)}θ which de- splitting to achieve generalization. pends on a hyperparameter θ. Next, in order to achieve generalization, we use the validation-set to calibrate θ. This paper is intended to be an introduction to this novel This is carried out by minimizing the following empirical approach. Therefore, for the sake of clarity, we consider criterion w.r.t. θ: here the linear regression setting but our approach can be extended to more general settings like deep learning1. Let us consider the linear regression model: kYval − Xval β(θ, Xtrain, Ytrain)knval .

∗ Y = Xβ + ξ, (1) 1In another project, we applied this approach to deep neural In our approach, we use the complete dataset D to build the networks on tabular data and achieved good generalization per- family of estimators and to calibrate the hyperparameter θ. formance. We obtained results which are equivalent or superior to Random Forest and XGBOOST. We also successfully extended ∗ t T our approach to classification on tabular data. This project is in Definition 1. Fix T ∈ N . Let {π }t=1 be T permutations the final writing phase and will be posted on Arxiv soon. in Sn. Let {β(θ, ·, ·)}θ be a family of estimator. Muddling Labels for Regularization, a novel approach to generalization

The MLR criterion2 is defined as: Synthetic data. For p = 80, we generate observations (X,Y ) ∈ Rp × R, s.t. Y = X>β∗ + , with  ∼ N (0, σ), MLRβ (θ) = kY − Xβ(θ, X, Y)kn σ = 10 or 50. We consider three different scenarii. T 1 X t t Scenario A (correlated features) corresponds to the case − kπ (Y) − β(θ, , π (Y))k . (2) T X X n where the LASSO is prone to fail and Ridge should per- t=1 form better. Scenario B (sparse setting) corresponds to a case known as favorable to LASSO . Scenario C combines MLR The criterion performs a trade-off between two an- sparsity and correlated features. For each scenario we sam- tagonistic terms. The first term fits the data while the ple a train-dataset of size ntrain = 100 and a test-dataset second term prevents overfitting. Since the MLR crite- of size ntest = 1000. rion is evaluated directly on the whole sample without any For each scenario, we perform M = 100 repetitions of the hold-out validation-set, this approach is particularly use- data generation process to produce M pairs of train/test ful for small sample sizes where data-splitting approaches datasets. Details on the data generation process can be can produce strongly biased performance estimates [Va- found in the Appendix. balas et al., 2019, Varoquaux, 2018]. We highlight in the following numerical experiment the re- Performances evaluation. For each family F(θ) and markable generalization performance of the MLR criterion. each criterion C(θ), we construct on every train-dataset, the corresponding model β . Next, using the corre- NUMERICAL EXPERIMENTS. btrain sponding hold-out test-dataset Dtest = (Xtest, Ytest), we compute their R2-scores. We consider two families Impact of T . In Figure 1 we study the impact of the F(θ) = {β(θ, Xtrain, Ytrain)}θ number of permutations T on the generalization perfor- mance of the criterion measured via the R2-score in (3). train of estimators constructed on a -dataset and indexed The most striking finding is the sharp increase in the gen- θ by : Ridge and LASSO . We are interested in the prob- eralization performance from the first added permutation θ Θ lem of calibration of the hyperparameter on a grid . We in Scenarios A and C. Adding more permutations does C(θ) θ MLR compare two criteria to calibrate : the crite- not impact the generalization but actually improves the rion and cross-validation (implemented as Ridge, Lasso in running time and stability of the novel procedures which Scikit-learn [Pedregosa et al., 2011]). For each criterion we will introduce in the next section. In a pure sparsity C(θ), the final estimator βbtrain = β(θ,b Xtrain, Ytrain) is setting (Scenario B with LASSO), adding permutations s.t. marginally increases the generalization. θb = arg min C(θ). θ∈Θ Ridge : Scenario A LASSO : Scenario B LASSO : Scenario C

0.5 0.83 0.2 2 R 0.4 -score. For each family, the generalization perfor- 0.82 0.0 0.3 mance of each criterion is evaluated using the hold-out 0.81 0.2 0.2

test D = ( , Y ) Test set R2-Score 0.80 -dataset test Xtest test by computing the fol- 0.1 0.4 2 0.0 lowing R -scores: 0.79 0.6

0.1 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Number of Permutations Number of Permutations Number of Permutations kYtest − test βbtraink2 R2(β ) = 1 − X (≤ 1), (3) btrain 2 kYtest − Ytest1nk2 Figure 1. Variation of the R -score w.r.t. the number of permuta- tion. where Ytest is the empirical mean of Ytest. For the sake of simplicity, we set R2(θ) = R2(β ). The oracle we b btrain Comparison of generalization performance. For Sce- aim to match, is nario A with the Ridge family, we compute the dif- 2 C(θ) 2 ∗ θ∗ = arg min R2(θ). ference R (θb ) − R (θ ), for the two criteria C(θ) θ∈Θ (MLR and CV). For Scenarios B and C, we consider the LASSO family and compute the same difference. Boxplots Our first numerical experiments concern synthetic data3. in Figure 1 summarize our finding over 100 repetitions of the synthetic data. The empirical mean is depicted by a 2Muddling Labels for Regularization 3Our Python code is released as an open source green triangle on each boxplot. Moreover, to check for sta- 2 package for replication in the github repository tistically significant margin in R -scores between different ICML2021SupplementaryMaterial. procedures, we use the Mann-Whitney test (as detailed in Muddling Labels for Regularization, a novel approach to generalization

Ridge : Scenario A Lasso : Scenario B Lasso : Scenario C [Kruskal, 1957] and implemented in scipy [Virtanen et al., 1.0 1.0 1.0

2020]). The boxplots highlighted in yellow correspond to 0.8 0.8 0.8

) 0.6 0.6 0.6 the best procedures according to the Mann-Whitney (MW) (

:

R L 0.4 0.4 0.4 test. As we observed, the MLR criterion performs better M Generalization Generalization Generalization 0.2 MLR 0.2 MLR 0.2 MLR * * * Ridge Scenario A LASSO Scenario B LASSO Scenario C 0.0 MLR 0.0 MLR 0.0 MLR 10 1 10 4 10 3 10 2 10 1 100 10 4 10 3 10 2 10 1 100 10 4 10 3 10 2 10 1 100

1.0 1.0 1.0

10 1 10 1 0.8 0.8 0.8

10 2

) 0.6 0.6 0.6 (

2 : 10 V

C 0.4 0.4 0.4 10 2 R2-score difference with oracle Generalization Generalization Generalization 10 3 0.2 CV 0.2 CV 0.2 CV * * * MLR CV MLR CV MLR CV 0.0 CV 0.0 CV 0.0 CV

10 4 10 3 10 2 10 1 100 10 4 10 3 10 2 10 1 100 10 4 10 3 10 2 10 1 100 hyper-parameter values hyper-parameter values hyper-parameter values Figure 2. Generalization performance for MLRand CV for Ridge and LASSO Hyperparameter calibration Figure 3. Criterion landscape for grid-search calibration with MLR (Up) and CV (down). than CV for the calibration of the Ridge and LASSO hy- perparameters in Scenarios A and C in the presence of correlation in the design matrix. models differentiable w.r.t. θ, we can minimize (2) w.r.t. θ via standard gradient-based methods. This motivated the Plot of the generalization performance. In order to plot introduction of new procedures based on the MLR criterion. the different criteria together, we use the following rescal- Definition 2. Consider {β(θ)}θ a family of models differ- ing. Let F :Θ → , we set t T R entiable w.r.t. θ. Let {π }t=1 be T derangements in Sn. The β-MLR procedure is θ = arg min F (θ) and θ = arg max F (θ). θ∈Θ θ∈Θ βb = β(θb) with θb = arg min MLRβ(θ). (4) For any θ ∈ Θ, we define θ where MLR is defined in (2). F (θ) − F (θ) β ΛF (θ) = . F (θ) − F (θ) Moreover, using our approach, we can also enforce sev- eral additional structures simultaneously (sparsity, corre- Figure 3 contains the rescaled versions of the R2-score on lation, group sparsity, low-rank,...) by constructing appro- the test set and the MLR and CV criteria computed on the priate families of models. In this regard, let us consider train-dataset. The vertical lines correspond to the selected the 3 following procedures which do not require a hold-out values of the hyperparameter in the grid Θ by the MLR and validation-set. CV criteria as well as the optimal hyperparameter θ∗ for the R2-score on the test set. The MLR criterion is a good R-MLR procedure for correlated designs. The Ridge 2 R proxy for the generalization performance (R - score on the family of estimators {β (λ, X, Y)}λ>0 is defined as fol- test-dataset) on the whole grid Θ for the Ridge in Scenario lows: A. In Scenario C with the LASSO, the MLR criterion is a R > −1 > smooth function with steep variations in a neighborhood of β (λ, X, Y) = (X X + λIp) X Y, λ > 0. (5) its global optimum. This is an ideal configuration for the implementation of gradient descent schemes. Applying Definition 2 with the Ridge family, we ob- the MLR criterion performs better than CV on correlated tain the R-MLR procedure βbR = βR(λb) where λb = data for grid-search calibration of the LASSO and Ridge arg min MLRβR (λ). This new optimisation problem can be λ>0 hyperparameters. However CV works better in the pure solved by gradient descent, contrarily to the previous sec- sparsity scenario. This motivated the introduction of novel tion where we performed a grid-search calibration of λ. procedures based on the MLR principle which can handle the sparse setting better. S-MLR for sparse models. We design in Definition 3 below a novel differentiable family of models to enforce 2. Novel procedures sparsity in the trained model. Applying Definition 2 to this family, we can derive the S-MLR procedure: βS = βS (θ) In Section 1, we used (2) to perform grid-search calibration b b where θb = arg min MLRβS (θ). of the hyperparameter. However, if {β(θ)}θ is a family of θ Muddling Labels for Regularization, a novel approach to generalization

S Definition 3. Let {β (λ, κ, γ, , Y)} ∗ ∗ p Applying Definition 2 to this family, we can derive X (λ,κ,γ)∈R+×R+×R be a family closed-form estimators defined as follows: the A-MLR procedure: βbA = βA(θb) where θb = S R arg min MLRβA (θ). β (λ, κ, γ, X, Y) = S(κ, γ)β (λ, XS(κ, γ), Y), (6) θ where βR is defined in (5), the quasi-sparsifying function This procedure is designed to handle both correlation and ∗ p p×p sparsity. S : R+ × R →]0, 1[ is s.t.

Scenario A Scenario B Scenario C S(κ, γ) = diag (S1(κ, γ), ··· , Sp(κ, γ)) , AMLR : S( ) ]0, 1[ AMLR : S( ) ]0, 1[ AMLR : S( ) ]0, 1[ 80 SMLR : S( ) = 0 SMLR : S( ) = 0 25 SMLR : S( ) = 0 40 MLR : S( ) = 1 70 MLR : S( ) = 1 MLR : S( ) = 1 where for any j = 1, ··· , p, 20 60 30 50 ∗ p 15 Sj : × →]0, 1[ 40 R+ R 20  2 −2 −1 Density in % 30 10 −κ×(σ +10 )(γj −γ) γ 20 (κ, γ) 7→ S (κ, γ) = 1 + e , 10 j 5 10

0 0 0 1 Pp 2 Pp 2 10 3 10 2 10 1 100 10 3 10 2 10 1 100 10 3 10 2 10 1 100 with γ = p i=1 γi and σγ = i=1(γi − γ) .

The new family (6) enforces sparsity on the regression vec- Figure 4. Distribution of the value of S(µb) over 100 repetitions on tor but also directly onto the design matrix. Hence it can synthetic data. be seen as a combination of data-preprocessing (perform- ing feature selection) and model training (using the ridge Figure 4 shows βbA behaves almost as a selector which estimator). picks the most appropriate family of models between βR S Noticeably, the “quasi-sparsifying” trick transforms feature and β . Indeed, in all scenarios, S(µb) only takes val- selection (a discrete optimization problem) into a continu- ues close either to 0 or 1 in order to adapt to the struc- ous optimization problem which is solvable via classical ture of the model. Moreover in Scenario B (sparsity), βbA gradient-based methods. The function S produces diago- always selects the sparse model. Indeed we always have nal matrices with diagonal coefficients in ]0, 1[. Although S(µb) ≤ 0.002 over 100 repetitions. the sigmoid function Sj cannot take values 0 or 1, for very small or large values of γj, the value of the corresponding Algorithmic complexity. Using the MLR criterion, we diagonal coefficient of S(κ, γ) is extremely close to 0 or 1. develop fully automatic procedures to tune regularization In those cases, the resulting model is weakly sparse in our parameters while simultaneously training the model in a numerical experiments. Thresholding can then be used to single run of the gradient descent algorithm without a hold- perform feature selection. out validation set. The computational complexity of our methods is O(n(p + r)K) where n, p, r, K denote respec- A-MLR for correlated designs and sparsity. Aggrega- tively the number of observations, features, regularization tion is a statistical technique which combines several esti- parameters and iterations of the gradient descent algorithm. mators in order to attain higher generalization performance The computational complexity of our method grows only [Tsybakov, 2003]. We propose in Definition 4 a new ag- arithmetically w.r.t. the number of regularization parame- gregation procedure to combine the estimators (5) and (6). ters. This essentially consists in an interpolation between βR S and β models, where the coefficient of interpolation is NUMERICAL EXPERIMENTS. We performed nu- quantified via the introduction of a new regularization pa- merical experiments on the synthetic data from Section 1 rameter µ ∈ R. and also on real datasets described below. Definition 4. We consider the family of models A Real data. We test our methods on several commonly {β (λ, κ, γ, µ, X, Y)}(λ,κ,γ,µ)∈ ∗ × ∗ × p× R+ R+ R R used real datasets (UCI [Asuncion and Newman, 2007] and with Svmlib [Chang and Lin, 2011] repositories). See Appendix for more details. Each selected UCI dataset is splitted ran- βA(θ, , Y) = S(µ) × βR(λ, , Y) X X domly into a 80% train-dataset and a 20% test-dataset. S + (1 − S(µ)) × β (λ, κ, γ, X, Y), We repeat this operation M = 100 times to produce M (7) pairs of (train, test)-datasets. In order to test our procedures in the setting n ≤ p, we se- where βR(λ, Y) and βS (λ, κ, γ, , Y) are defined in X (5) lected, from Svmlib, the news20 dataset which contains a and respectively and S is the sigmoid4. (6) train and a test dataset. We fixed the number of features 4 For µ ∈ R, S(µ) takes values in (0, 1) and is actually ob- p and we sample six new 20news train-datasets of differ- served in practice to be close to 0 or 1. ent sizes n from the initial news20 train-dataset. For each Muddling Labels for Regularization, a novel approach to generalization size n of dataset, we perform M = 100 repetitions of the Consequently, our procedures run in reasonable time as il- sampling process to produce M train-datasets. We kept lustrated in Figures 8 and 9. the initial test-set for the evaluation of the generalization Scenario A Scenario B Scenario C 100 100 performances. 100

Number of iterations. We choose to solve (4) using

10 1 10 1 1 ADAM but other GD methods could be used. Figure 5 con- Time in sec. 10 tains the boxplots of the number of ADAM iterations for the MLR procedures on the synthetic and real datasets over MLR SMLR AMLR MLR SMLR AMLR MLR SMLR AMLR the M = 100 repetitions. Although MLRβS and MLRβA are highly non-convex, the number of iterations required for Figure 8. Synthetic data: running times in seconds. convergence is always about a few several dozen in our ex- periments. This was already observed in other non-convex UCI 20news settings [Kingma and Ba, 2014]. 100

101 Synthetic data UCI 20news 70 60 60 10 1 60 Time in sec. Time in sec. 50 50 50 40 40

40 30 30 10 2 Iterations Iterations Iterations

20 30 MLR SMLR AMLR MLR SMLR AMLR 20

10 20 10 0 Figure 9. UCI data (left) and 20news data (right): running times MLR SMLR AMLR MLR SMLR AMLR MLR SMLR AMLR . Figure 5. Synthetic, UCI and 20news data: Number of itera- tions Initial parameters. Strinkingly, the initial values of the parameters (see Table 1) used to implement our MLR proce- dures could remain the same for all the datasets we consid- Running time. Our procedures were coded in Pytorch to ered while still yielding consistently good prediction per- underline how they can be parallelized on a GPU. A com- formances. These initial values were calibrated only once parison of the running time with the benchmark procedures in the standard setting (n ≥ p) on the Boston dataset [Harri- is not pertinent as they are implemented on cpu by Scikit- son Jr and Rubinfeld, 1978, Belsley et al., 2005] which we learn. The main point of our experiments was rather to did not include in our benchmark when we evaluated the show how the MLR procedures can be successfully paral- performance of our procedures. We emphasize again we lelized. This opens promising prospects for the MLR ap- used these values without any modification on all the syn- proach in deep learning frameworks. thetic and real datasets. The synthetic and UCI datasets fall From a computational point of view, the matrix inversion in into the standard setting. Meanwhile, the 20news datasets (5) is not expensive in our setting as long as the covariance correspond to the high-dimensional setting (p  n). As matrix can fully fit on the GPU5. Figures 6 and 7 confirm such, it might be possible to improve the generalization the running time is linear in n, p for the MLR procedures. performance by using a different set of initial parameters This confirms the MLR procedures are scalable. better adapted to the high-dimensional setting. This will be investigated in future work.

UCI data 20news

0 However, in this paper, we did not intend to improve the 10 1 10 generalization performance by trying to tune the initial pa- 6 × 100

1 rameters for each specific dataset. This was not the point 10 0 Time in sec. 4 × 10 Time in sec.

3 × 100 of this project. We rather wanted to highlight our gradient- MLR MLR SMLR SMLR 0 AMLR AMLR 2 × 10 2 based methods compare favorably in terms of generaliza- 10 6 × 102 103 2 × 103 3 × 103 0 100 200 300 400 500 Number of observations n number of features p tion with benchmark procedures just by using the default initial values in Table 1. Figure 6. UCI data: running Figure 7. 20news data: run- time as a function of n ning times as a function of p We also studied the impact of parameter T on the perfor-

5 2 mances of the MLR procedures on the synthetic data. In Inversion of a p × p matrix has a p complexity on CPU, but 2 parallelization schemes provide linear complexity on GPU when Figure 10, the generalization performance (R -score) in- some memory constraints are met [Murugesan et al., 2018, Nath creases significantly from the first added permutation (T = et al., 2010, Chrzeszczyk and Chrzeszczyk, 2013]. 1) . Starting from T ≈ 10, the R2-score has converged to Muddling Labels for Regularization, a novel approach to generalization

Scenario A Scenario B Scenario C Optimization parameters Parameter initialization 0.9 0.95 0.7 0.8 0.90 −4 0.6 Tolerance 10 T 30 0.7 0.85 0.5 3 3 0.6 0.80 Max. iter. 10 λ 10 0.4 0.5 0.75 0.3 Learning rate 0.5 γ 0p R2-score 0.4 0.70 0.2 0.3 0.65 Adam β1 0.5 κ 0.1 0.1 0.2 0.60 0.0 Adam β2 0.9 µ 0 0.1 0.55 0.1 MLR SMLR AMLR Ridge Lasso E-net MLR SMLR AMLR Ridge Lasso E-net MLR SMLR AMLR Ridge Lasso E-net

Table 1. Parameters for the MLR procedures. Figure 11. Synthetic data: R2-score

UCI 20news 1.0 its maximum value. An even more striking phenomenon 0.8 0.70 0.6 0.65 is the gain observed in the running time when we add T 0.4

0.2 0.60 permutations (for T in the range from 1 to approximately R2-score R2-score 0.0 0.55 100) when compared with the usual empirical risk (T = 0). 0.2

0.4 Larger values of T are neither judicious nor needed in this 0.50 approach. In addition, the needed number of iterations for MLR SMLR AMLR Ridge Lasso E-net MLR SMLR AMLR Ridge Lasso E-net ADAM to converge is divided by 3 starting from the first Figure 12. UCI data (left) and 20news data (right): R2-score added permutation. Furthermore, this number of iterations remained stable (below 20) starting from T = 1. Based on these observations, the hyperparameter T does not re- Estimation of β∗ and support recovery accuracy. For quire calibration. We fixed T = 30 in our experiments the synthetic data, we also consider the estimation of the ∗ even though T = 10 might have been sufficient. regression vector β . We use the l2-norm estimation er- ∗ ror kβb − β k2 to compare the procedures. As we can see Time in sec. R2-score Iterations

Scenario A Scenario A in Figure 13, the MLR procedures perform better than the 0.75 Scenario B 6 × 101 Scenario B Scenario C Scenario C 0.50 benchmark procedures. 4 × 101 10 1 0.25 3 × 101 0.00 Scenario A Scenario B Scenario C 2 × 101 0.25 100 100 0.50 Scenario A Scenario B 1 0.75 Scenario C 10

10 1 100 101 102 103 10 1 100 101 102 103 10 1 100 101 102 103 2

| 1 Number of permutations T Number of permutations T Number of permutations T | 6 × 10 6 × 10 1 | | MLR 4 × 10 1 Figure 10. Synthetic data: impact of T on the procedures. 10 1 1 3 × 10 1 4 × 10

MLR SMLR AMLR Ridge Lasso Enet MLR SMLR AMLR Ridge Lasso Enet MLR SMLR AMLR Ridge Lasso Enet

Figure 13. β∗ estimation (best in yellow according to MW) Performance comparisons. We compare our MLR procedures against cross-validated Ridge, LASSO and We finally study the support recovery accuracy in the Elastic-net (implemented as RidgeCV, LassoCV and Elas- sparse setting (Scenario B). We want to recover the sup- ticnetCV in Scikit-learn [Pedregosa et al., 2011]) on simu- ∗  ∗ port J(β ) = j : βj 6= 0 . For our procedures, we build lated and real datasets. Our procedures are implemented in n o PyTorch [Paszke et al., 2019] on the centered and rescaled the following estimator Jb(βb) = j : |βbj| > τb where the response Y. Complete details and results can be found in threshold τb corresponds to the first sharp decline of the co- the Appendix. In our approach θ can always be tuned di- efficients |βbj|. Denote by #J the cardinality of set J. The rectly on the train set whereas for benchmark procedures support recovery accuracy is measured as follows: like LASSO , Ridge, Elastic net, θ is typically calibrated on a hold-out validation-set using grid-search CV for in- #{J(β∗) ∩ Jb(βb)} + #{J c(β∗) ∩ Jbc(βb)#} Acc(βb) := , stance. p

Generalisation performance. Figures 11 and 12 show Our simulations confirm βbS is a quasi-sparse vector. In- 2 the MLR procedures consistently attain the highest R - deed we observe in Figure 14 a sharp decline of the coeffi- scores for the synthetic and UCI data according to the S −3 cients |βbj |. Thus we set the threshold τ at 10 . Mann-Whitney test over the M = 100 repetitions. Regard- b ing, the 20news datasets, the MLR procedures are always Overall, βbS and βbA perform better for support recovery within 0.05 of the best (E-net). than the benchmark procedures. Moreover in Scenario Muddling Labels for Regularization, a novel approach to generalization

| | in decreasing order j stems from their compatibility with gradient-based opti- j values in log scale 10 1 Pruning threshold mization methods. As such, these procedures can fully ben-

10 3 efit from automatic graph-differentiation libraries (such as pytorch [Paszke et al., 2017] and tensorflow [Abadi et al., 5 10 2015]).

7 10 In our numerical experiments, adding more permutations

0 10 20 30 40 50 60 70 80 improves the convergence of the ADAM optimizer while preserving generalisation. As a matter of fact, T does not Figure 14. Coefficients values (blue) and threshold (red) with p = require any fine-tuning. In that regard, T is not a hyper- 80. parameter. Likewise, the other hyperparameters require no tedious initialization in this framework. The same fixed B favorable to LASSO, our procedures perform far better hyperparameters for ADAM and initialization values of the (Figure 15). regularization parameters (see Table 1) were used for all the considered datasets. Noticeably, these experiments were Accuracy Support False Positive Rate run using high values for learning rate and convergence

0.5 0.9 threshold. Consequently, only a very small number of it-

0.4 0.8 erations were needed, even for non-convex criteria (MLRβS

0.3 MLR A 0.7 and β ).

0.2 0.6 The MLR approach offers promising perspectives to ad-

0.1 0.5 dress an impediment to the broader use of deep learning.

SMLR AMLR Lasso Enet SMLR AMLR Lasso Enet Currently, fine-tuning DNN numerous hyper-parameters often involves heavy computational resources and manual Figure 15. Support recovery performance analysis in Scenario B supervision from field experts[Smith, 2018]. Nonetheless, (best in yellow according to MW). it is widely accepted that deep neural networks produce state-of-the-art results on most machine learning bench- Accuracy Support False Positive Rate False Negative Rate marks based on large and structured datasets [Escalera and 0.75 0.35 0.5

0.30 Herbrich, 2018, He et al., 2016, Klambauer et al., 2017, 0.70 0.4 0.25 0.65 Krizhevsky et al., 2012, Silver et al., 2016, Simonyan and 0.3 0.20 0.60 Zisserman, 2014, Szegedy et al., 2015]. By contrast, it is 0.2 0.15 0.55 0.10 not yet the case for small unstructured datasets, (eg. tabular 0.1 0.50 0.05

0.45 0.0 datasets with less than 1000 observations ) where Random 0.00 SMLR AMLR Lasso Enet SMLR AMLR Lasso Enet SMLR AMLR Lasso Enet Forest, XGBOOST, MARS, etc are usually acknowledged as the state of the art [Shavitt and Segal, 2018]. Figure 16. Support recovery performance analysis in Scenario C (best in yellow according to MW). These concerns are all the more relevant during the on- going global health crisis. Reacting early and appropri- ately to new streams of information became a daily chal- 3. Conclusion and future work lenge. Specifically, relying on the minimum amount of data In this paper, we introduced in the linear regression setting to produce informed decisions on a massive scale has be- the new MLR approach based on a different understanding come the crux of the matter. In this unprecedented situ- of generalization. Exploiting this idea, we derived a novel ation, transfer learning and domain knowledge might not criterion and new procedures which can be implemented be relied on to address these concerns. In that regard, the directly on the train-set without any hold-out validation- minimal need for calibration and the reliable convergence MLR set. Within MLR, additional structures can be taken into ac- behavior of the approach are a key milestone in the count without any significant increase in the computational search for fast reliable regularization methods of deep neu- complexity. ral networks, especially in the small sample regime. We highlighted several additional advantageous properties Beyond the results provided in this paper, we success- MLR of the MLR approach in our numerical experiments. The fully extended the approach to deep neural networks. MLR MLR approach is computationally feasible while yielding Neural networks trained with the criterion can reach statistical performances equivalent or better than the cross- state of the art results on benchmarks usually dominated by validated benchmarks. We provided numerical evidence of Random Forest and Gradient Boosting techniques. More- MLR criterion’s ability to generalize from the first added over, these results were obtained while preserving the fast, permutation. Besides, the strength of our MLR procedures smooth and reliable convergence behavior displayed in this Muddling Labels for Regularization, a novel approach to generalization paper. We also successfully extended our approach to clas- Leon´ Bottou. Online learning and stochastic approxima- sification on tabular data. All these results are the top- tions. On-line learning in neural networks, 17(9):142, ics of a future paper which will be posted on Arxiv soon. 1998. In an ongoing project, we are also adapting our approach to tackle few-shots learning and adversarial resilience for Leo Breiman. Random forests. Machine learning, 45(1): structured data (images, texts, graphs). We believe we just 5–32, 2001. touched upon the many potential applications of the MLR Eric Brochu, Vlad M Cora, and Nando De Freitas. A tu- approach in the fields of Machine Learning, Statistics and torial on bayesian optimization of expensive cost func- Econometrics. tions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint References arXiv:1012.2599, 2010. Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Chih-Chung Chang and Chih-Jen Lin. Libsvm: A li- Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, brary for support vector machines. ACM transactions Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- on intelligent systems and technology (TIST), 2(3):1–27, mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, 2011. Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Boyuan Chen, Harvey Wu, Warren Mo, Ishanu Chat- Mane, Rajat Monga, Sherry Moore, Derek Murray, topadhyay, and Hod Lipson. Autostacker: A composi- Chris Olah, Mike Schuster, Jonathon Shlens, Benoit tional evolutionary learning system. In Proceedings of Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin- the Genetic and Evolutionary Computation Conference, cent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, pages 402–409, 2018. Oriol Vinyals, Pete Warden, Martin Wattenberg, Mar- tin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Tianqi Chen and Carlos Guestrin. Xgboost: A scalable Large-scale machine learning on heterogeneous systems, tree boosting system. In Proceedings of the 22nd acm 2015. URL https://www.tensorflow.org/. sigkdd international conference on knowledge discovery Software available from tensorflow.org. and , pages 785–794, 2016.

Hirotugu Akaike. A new look at the statistical model iden- Andrzej Chrzeszczyk and Jakub Chrzeszczyk. Matrix tification. IEEE transactions on automatic control, 19(6): computations on the GPU, CUBLAS and MAGMA by 716–723, 1974. example. developer.nvidia.com, 01 2013.

Arthur Asuncion and David Newman. Uci machine learn- Justin Domke. Generic methods for optimization-based ing repository, 2007. modeling. In Artificial Intelligence and Statistics, pages David A Belsley, Edwin Kuh, and Roy E Welsch. 318–326, 2012. Regression diagnostics: Identifying influential data and Sergio Escalera and Ralf Herbrich. The neurips’18 compe- sources of collinearity, volume 571. John Wiley & Sons, tition, 2018. 2005. Manuel Fernandez-Delgado,´ Eva Cernadas, Senen´ Barro, Yoshua Bengio. Gradient-based optimization of hyperpa- and Dinani Amorim. Do we need hundreds of classifiers rameters. Neural computation, 12(8):1889–1900, 2000. to solve real world classification problems? The journal James Bergstra and Yoshua Bengio. Random search of machine learning research, 15(1):3133–3181, 2014. for hyper-parameter optimization. Journal of machine learning research, 13(Feb):281–305, 2012. Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to James S Bergstra, Remi´ Bardenet, Yoshua Bengio, and boosting. In European conference on computational Balazs´ Kegl.´ Algorithms for hyper-parameter optimiza- learning theory, pages 23–37. Springer, 1995. tion. In Advances in neural information processing systems, pages 2546–2554, 2011. Jerome H Friedman. Multivariate adaptive regression splines. The annals of statistics, pages 1–67, 1991. Quentin Bertrand, Quentin Klopfenstein, Mathieu Blon- del, Samuel Vaiter, Alexandre Gramfort, and Joseph David Harrison Jr and Daniel L Rubinfeld. Hedonic Salmon. Implicit differentiation of lasso-type mod- housing prices and the demand for clean air. Journal els for hyperparameter optimization. arXiv preprint of environmental economics and management, 5(1):81– arXiv:2002.08943, 2020. 102, 1978. Muddling Labels for Regularization, a novel approach to generalization

Fengxiang He, Tongliang Liu, and Dacheng Tao. Con- Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, and trol batch size and learning rate to generalize well: Zne-Jung Lee. Particle swarm optimization for param- Theoretical and empirical evidence. In Advances in eter determination and feature selection of support vec- Neural Information Processing Systems, pages 1141– tor machines. Expert systems with applications, 35(4): 1150, 2019. 1817–1824, 2008.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Pablo Ribalta Lorenzo, Jakub Nalepa, Michal Kawulok, Sun. Deep residual learning for image recognition. In Luciano Sanchez Ramos, and Jose´ Ranilla Pastor. Par- Proceedings of the IEEE conference on computer vision ticle swarm optimization for hyper-parameter selection and pattern recognition, pages 770–778, 2016. in deep neural networks. In Proceedings of the genetic and evolutionary computation conference, pages 481– Tin Kam Ho. Random decision forests. In Proceedings of 488, 2017. 3rd international conference on document analysis and recognition, volume 1, pages 278–282. IEEE, 1995. Colin L Mallows. Some comments on cp. Technometrics, 42(1):87–94, 2000. Arthur E Hoerl and Robert W Kennard. Ridge regres- sion: Biased estimation for nonorthogonal problems. Jonas Mockus.ˇ On bayesian methods for seeking the Technometrics, 12(1):55–67, 1970. extremum. In Optimization techniques IFIP technical conference, pages 400–404. Springer, 1975. Diederik P Kingma and Jimmy Ba. Adam (2014), a method Varalakshmi Murugesan, Amit Kesarkar, and Daphne for stochastic optimization. In Proceedings of the 3rd Lopez. Embarrassingly parallel gpu based matrix in- International Conference on Learning Representations version algorithm for big climate data assimilation. (ICLR), arXiv preprint arXiv, volume 1412, 2014. International Journal of Grid and High Performance Computing, 10:71–92, 01 2018. doi: 10.4018/IJGHPC. Gunter¨ Klambauer, Thomas Unterthiner, Andreas Mayr, 2018010105. and Sepp Hochreiter. Self-normalizing neural networks. In Advances in neural information processing systems, Rajib Nath, Stanimire Tomov, and . Ac- pages 971–980, 2017. celerating gpu kernels for dense linear algebra. In Proceedings of the 2009 International Meeting on High Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Performance Computing for Computational Science, Imagenet classification with deep convolutional neural VECPAR10, Berkeley, CA, June 22-25 2010. Springer. networks. In Advances in neural information processing systems, pages 1097–1105, 2012. Randal S Olson, Ryan J Urbanowicz, Peter C Andrews, Nicole A Lavender, Jason H Moore, et al. Automating William H Kruskal. Historical notes on the wilcoxon biomedical data science through tree-based pipeline op- unpaired two-sample test. Journal of the American timization. In European Conference on the Applications Statistical Association, 52(279):356–360, 1957. of Evolutionary Computation, pages 123–137. Springer, 2016. Jan Kukacka,ˇ Vladimir Golkov, and Daniel Cremers. Regu- larization for deep learning: A taxonomy. arXiv preprint Adam Paszke, Sam Gross, Soumith Chintala, Gregory arXiv:1710.10686, 2017. Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Auto- Alexandre Lacoste, Hugo Larochelle, Mario Marchand, matic differentiation in pytorch. NIPS Workshop, 2017. and Franc¸ois Laviolette. Sequential model-based en- semble optimization. In Proceedings of the Thirtieth Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, Conference on Uncertainty in Artificial Intelligence, James Bradbury, Gregory Chanan, Trevor Killeen, Zem- pages 440–448, 2014. ing Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning li- Jan Larsen, Lars Kai Hansen, Claus Svarer, and M Ohls- brary. In Advances in Neural Information Processing son. Design and regularization of neural networks: the Systems, pages 8024–8035, 2019. optimal use of a validation set. In Neural Networks for Signal Processing VI. Proceedings of the 1996 Fabian Pedregosa. Hyperparameter optimization with ap- IEEE Signal Processing Society Workshop, pages 62– proximate gradient. In International Conference on 71. IEEE, 1996. Machine Learning, pages 737–746, 2016. Muddling Labels for Regularization, a novel approach to generalization

Fabian Pedregosa, Gael¨ Varoquaux, Alexandre Gramfort, William R Thompson. On the likelihood that one unknown Vincent Michel, Bertrand Thirion, Olivier Grisel, Math- probability exceeds another in view of the evidence of ieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent two samples. Biometrika, 25(3/4):285–294, 1933. Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825– Robert Tibshirani. Regression shrinkage and selection via 2830, 2011. the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. Esteban Real, Sherry Moore, Andrew Selle, Saurabh Sax- ena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, Alexandre Tsybakov. Optimal rates of aggregation. Lect. and Alexey Kurakin. Large-scale evolution of image Notes Artif. Intell., 2777:303–313, 01 2003. doi: 10. classifiers. In Proceedings of the 34th International 1007/978-3-540-45167-9 23. Conference on Machine Learning-Volume 70, pages Andrius Vabalas, Emma Gowen, Ellen Poliakoff, and 2902–2911. JMLR. org, 2017. Alexander J. Casson. Machine learning algorithm Jurgen¨ Schmidhuber. Evolutionary principles in validation with a limited sample size. PLOS ONE, self-referential learning, or on learning how to learn: the 14(11):1–20, 11 2019. doi: 10.1371/journal.pone. meta-meta-... hook. PhD thesis, Technische Universitat¨ 0224365. URL https://doi.org/10.1371/ Munchen,¨ 1987. journal.pone.0224365. Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Gael¨ Varoquaux. Cross-validation failure: Small sam- Adams, and Nando De Freitas. Taking the human ple sizes lead to large error bars. NeuroImage, out of the loop: A review of bayesian optimization. 180:68 – 77, 2018. ISSN 1053-8119. doi: Proceedings of the IEEE, 104(1):148–175, 2015. https://doi.org/10.1016/j.neuroimage.2017.06.061. URL http://www.sciencedirect.com/ Ira Shavitt and Eran Segal. Regularization learning net- science/article/pii/S1053811917305311. works: Deep learning for tabular datasets. Neurips, New advances in encoding and decoding of brain 2018. signals. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrit- Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt twieser, Ioannis Antonoglou, Veda Panneershelvam, Haberland, Tyler Reddy, David Cournapeau, Evgeni Marc Lanctot, et al. Mastering the game of go with deep Burovski, Pearu Peterson, Warren Weckesser, Jonathan neural networks and tree search. nature, 529(7587):484, Bright, et al. Scipy 1.0: fundamental algorithms for 2016. scientific computing in python. Nature methods, 17(3): 261–272, 2020. Karen Simonyan and Andrew Zisserman. Very deep con- volutional networks for large-scale image recognition. Hui Zou and . Regularization and variable se- arXiv preprint arXiv:1409.1556, 2014. lection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301– Leslie N Smith. A disciplined approach to neural net- 320, 2005. work hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820, 2018. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Prac- tical bayesian optimization of machine learning algo- rithms. In Advances in neural information processing systems, pages 2951–2959, 2012. Charles M Stein. Estimation of the mean of a multivari- ate normal distribution. The annals of Statistics, pages 1135–1151, 1981. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser- manet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.