The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Improved Penalty Method via Doubly Stochastic Gradients for Bilevel Hyperparameter Optimization

Wanli Shi1, Bin Gu1,2,3* 1 Nanjing University of Information Science & Technology, P.R.China 2MBZUAI, United Arab Emirates 3JD Finance America Corporation, Mountain View, CA, USA [email protected], [email protected]

Abstract of HO methods have emerged, such as grid search, random search (Bergstra et al. 2011; Bergstra and Bengio 2012), so- Hyperparameter optimization (HO) is an important problem lution path (Gu and Sheng 2017; Gu, Liu, and Huang 2017; in machine learning which is normally formulated as a bilevel optimization problem. Gradient-based methods are dominant Bao, Gu, and Huang 2019; Gu and Ling 2015), and sev- in bilevel optimization due to their high scalability to the eral Bayesian methods (Thornton et al. 2013; Brochu, Cora, number of hyperparameters, especially in a deep learning and De Freitas 2010; Swersky, Snoek, and Adams 2014; Wu problem. However, traditional gradient-based bilevel opti- et al. 2019). mization methods need intermediate steps to obtain the ex- In real-world applications, the number of hyperparame- act or approximate gradient of hyperparameters, namely hy- ters increases shapely, especially in a deep learning problem. pergradient, for the upper-level objective, whose complexity In these problems, gradient-based methods are dominant in is high especially for high dimensional datasets. Recently, bilevel problems with continuous hyperparameter set due a penalty method has been proposed to avoid the computa- tion of the hypergradient, which speeds up the gradient-based to their high scalability to the amount of hyperparameters. BHO methods. However, the penalty method may result in a The main idea of the gradient-based methods is to first ap- very large number of constraints, which greatly limits the ef- proximate the solution on the training set and then compute ficiency of this method, especially for high dimensional data the direction for hyperparameters, namely problems. To address this limitation, in this paper, we propose hypergradient. However, calculating the exact hypergradient a doubly stochastic gradient descent (DSGPHO) to needs the computation of the inverse Hessian of the lower- improve the efficiency of the penalty method. Importantly, level problem, which makes it impractical for high dimen- we not only prove the proposed method can converge to the sional data problem. To solve this problem, many methods KKT condition of the original problem in a convex setting, have been proposed to approximate the hypergradient by us- but also provide the convergence rate of DSGPHO which is ing some intermediate steps, such as solving a linear system the first result in the references of gradient-based bilevel op- timization as far as we know. We compare our method with (the approximate methods (Domke 2012; Pedregosa 2016; three state-of-the-art gradient-based methods in three tasks, Rajeswaran et al. 2019)) or using reverse/forward mode dif- i.e., data denoising, few-shot learning, and training data poi- ferentiation (e.g. (Maclaurin, Duvenaud, and Adams 2015; soning, using several large-scale benchmark datasets. All the Domke 2012; Franceschi et al. 2017; Swersky, Snoek, and results demonstrate that our method outperforms or is com- Adams 2014)). Although these methods can avoid the com- parable to the existing methods in terms of accuracy and effi- putation of inverse Hessian, the intermediate steps usually ciency. need extra loops which will affect the efficiency of these gradient-based methods. Introduction Recently, (Mehra and Hamm 2019) pointed out that the intermediate steps are not necessary. They formulate the Hyperparameter optimization problems (HO) are at the cen- bilevel problem as a single level constrained problem, by ter of several important machine learning problems, which replacing the lower-level objective with its first-order opti- are usually formulated as bilevel optimization problems mal necessary condition. Then the penalty framework can (Feurer and Hutter 2019). Any machine learning be used to solve this constrained problem. With updating crucially depend on the choice of their hyperparameters. the model parameters and hyperparameters alternately, the However, manually tunning hyperparameters is often time- penalty method can finally obtain the exact hypergradient. It consuming, and depends heavily on human’s prior knowl- speeds up the gradient-based methods but brings a new prob- edge, which makes it difficult to find the optimal hyperpa- lem in the meanwhile. Since the constraints are calculated rameters. Therefore, to choose the set hyperparameters au- from the first order necessary condition of the lower-level tomatically has attracted great attention, and large amounts objective, it may lead to the optimization problem with a *Corresponding Authors large number of constraints, especially for high dimensional Copyright © 2021, Association for the Advancement of Artificial datasets or deep learning methods. Calculating the gradient Intelligence (www.aaai.org). All rights reserved. of the penalty term in such a heavily constrained problem

9621 Method Problem Update v Intermediate steps Time Space Convergence Rate 2 2 FMD Bilevel v ← v − η∇vg P = P (I − η∇vvg) − η∇uvg O(dmT ) O(dm) − 2 p = p − η∇uvg · q RMD Bilevel v ← v − η∇vg 2 O(dT ) O(dT + m) − q = (I − η∇vvg)q 2 2 Approx Bilevel v ← v − η∇vg Solve minq k∇vvg · q − ∇vfk2 O(dT ) O(d + m) − 2 Penalty Single level v ← v − η(∇vf + µ∇vvg · ∇vg) Not reqired O(dT ) O(d + m) −√ DSGPHO Single level v ← v − η∇˜ vL Not reqired O(T ) O(d + m) O(1/ K)

Table 1: The complexity of various gradient methods for update the hyperparameters one time. Here d is the size of model parameters, m denotes the size of hyperparameters and T is the number of model parameters update per hypergradient compu- tation. ∇˜ vL = ∇˜ vf + [µcj + zj]∇vcj where zj is the jth element of dual parameter, µ > 0 is the penalty parameter and cj is jth element of ∇vg. K denotes the total number of hyperparameter updates.

will greatly limits its efficiency. ority of our method. To address this limitation, in this paper, we propose a dou- bly stochastic gradient descent method to improve the ef- Related Works ficiency of the penalty method (Mehra and Hamm 2019). Similar to (Mehra and Hamm 2019), we first transform the In this section, we give a brief review of the gradient-based bilevel problem to a single level constrained problem and hyperparameter optimization methods. give its corresponding augmented Lagrangian function. We As we mentioned in the previous section, many gradient first randomly sample a validation data instance to calculate method use some intermediate steps to approximate the hy- the stochastic gradient of the upper-level objective. Then, we pergradient. Specifically, (Franceschi et al. 2017; Maclaurin, randomly sample a constraint and calculate its correspond- Duvenaud, and Adams 2015; Shaban et al. 2019) use the ing gradient. By combining these terms together, we obtain forward/reverse mode differentiation (RMD/FMD) to ap- the stochastic gradient of the augmented Lagrangian func- proximate the hypergradient. Both of them view the train- tion, and update the model parameters and hyperparameters, ing procedure as dynamical system with a state vt+1 = alternately, by using this stochastic gradient. Since in our Φ(vt, u) := vt − η∇vg(u, vt), for t = 1, ··· ,T , where v method, we have two sources of randomness, i.e., the valida- denote the model parameters, u denotes the hyperparame- tion set and the constraints set, we can denote our method as ters, η is the learning rate and g is the training objective on a doubly stochastic gradient descent method for the penalty training set. According to (Feurer and Hutter 2019), RMD 2 hyperparameter optimization (DSGPHO). We prove that our needs another T iterations to calculate p = p − η∇uvg · q 2 method can achieve the unbiased estimation of the exact hy- and q = (I − η∇vvg)q and then outputs the hypergradient pergradient asymptotically and it finally converges to the p, where I is the identity matrix. In each p update step we KKT point of the original problem. Importantly, we also pro- need O(d) time and space to calculate the Jacobian vector vide the convergence rate of DSGPHO which is the first re- product, where d is the size of model parameters. Totally, it sult in the references of gradient-based bilevel optimization needs O(dT ) time and O(m+T d) space to get the hypergra- as far as we know. We compare our method with three state- dient, where m is the size of hyperparameters. FMD needs 2 2 of-the-art gradient-based methods in three tasks, i.e., data to calculate P = P (I − η∇vvg) − η∇uvg and update v at denoising, few-shot learning and training data poisoning, us- the same time. Then the hypergradient can be computed by ing several large scale benchmark datasets. All the results using P . (Feurer and Hutter 2019) pointed out that it needs demonstrate that our method outperforms or is comparable O(dm) time to get each P . Hence the time complexity for to the existing methods in terms of accuracy and efficiency. calculate one hypergradient is O(dmT ), and the space com- plexity is O(dm) for P is a d × m matrix. (Pedregosa 2016) Contributions We summarized the contributions as follows, 2 2 solve a linear problem minq k∇vvg · q − ∇vfk2 and use the 1. In this paper, we propose a doubly stochastic gradient solution to approximate the hypergradient. When using gra- descent method to improve the efficiency of the penalty dient method to solve this problem, we require O(d) time to method. Within each iteration, we randomly sample a val- obtain the Hessian-vector product in each iteration accord- idation instance and a constraint to calculate the stochastic ing to (Pearlmutter 1994). Assume the total gradient iter- gradient to update model parameters and hyperparameters ation number is T , then we need O(dT ) time to calculate alternately. the hypergradient one time. Obviously, the space complex- √ ity is O(d+m). (Mehra and Hamm 2019) propose a penalty 2. We prove that DSGPHO has a rate of O(1/ K) to con- method which do not need the intermediate steps to com- verge to the KKT point of the original problem. In addi- pute the hypergradient. Thus the computational complexity tion, we also prove that our method can obtain an unbi- for per hyperparameter update is mainly from the steps of ased estimation of the hypergradient asymptotically. 2 updating v by using v ← v − η(∇vf + µ∇vvg · ∇vg). 3. We apply our method to solve various problems. The ex- Since the Hessian-vector product complexity is O(d), we perimental results outperform other methods in terms of need O(dT ) time and O(d + m) space to update v for T accuracy and training-time which demonstrate the superi- iterations. We summarized all these complexity in Table 1.

9622 According to this table, we have that if d is very large, these has limit points any one of which is a solution of the original methods become impractical. problem (1). Recently, (Mehra and Hamm 2019) proposed a gradient Preliminaries method based on the penalty framework. For each given µk, In this section, we give a brief introduction to bilevel hyper- they update the model parameters and hyperparameters al- parameter optimization. ternately by using the full gradient of the penalty function until the norm of the gradient converges to a small toler- Bilevel Hyperparamter Optimization ance. However, in real-world applications, the number of constraints d is usually very large. Specifically, when us- The hyperparameter optimization problems are at the center ing the linear model for high dimension data (e.g. more than of several important machine learning problems, where once 10, 000), the size of constraints is the same as the data di- the first party (hyperparameters) makes its choice affecting mension. What’s worse, when using deep model, the size the optimal choice for the second party (model parameters). of constraints becomes further larger, usually more than The hyperparameter optimization problems are commonly 100, 000. It would be computationally expensive, if at ev- formulated as the following bilevel optimization problems, ery update, we inquire about the value and gradient of all min f(u, v), s.t. v = arg min g(u, v), (1) constraints. In addition, if the size of validation set is big, it u∈U v∈V also takes a lot of time to calculate the gradient of the upper- level objective. Meanwhile, to solve the quadratic penalty where u denotes the hyperparameters, v denotes the model k parameters, and U and V are the convex set in Rm and Rd. function, the penalty parameter µ needs to be large enough, In this paper, we assume that the upper- and lower-level ob- which makes it impossible to get the solution in a limited jectives f and g are twice continuously differentiable in both time. u and v. In addition, we assume that the lower-level cost g is convex in v for each given u. Proposed Method Obviously, in the bilevel optimization problem, the upper- In this section, to address the above limitations, we propose level problem minu f(u, v) is a usual minimization prob- our stochastic method based on the penalty framework. lem (minimizing the loss on the validation set) where v is To get the solution in a limited time, instead of using constrained to be the solution to the lower-level problem the quadratic penalty function, we add Lagrangian multipli- minv g(u, v) depended on u (usually minimizing the loss on ers and obtain the following augment Lagrangian function training set). Thus, we can replace the lower-level problem (Bertsekas 1976), by its first-order necessary condition for optimality, resulting in the following problem. L(u, v, z; µ) = f(u, v) + Ψµ(u, v, z) (3)

min f(u, v), s.t. c(u, v) = ∇vg(u, v) = 0 (2) where z denotes the Lagrangian multiplier and µ > 0 is the u,v 1 penalty parameter, Ψ (u, v, z) = Pd ψ (c (u, v), z ) µ d j=1 µ j j where each element of the equality constraint c(u, v) is µ c (u, v), j = 1, ··· , d. According to (Mehra and Hamm and ψ (c (u, v), z ) = z c (u, v) + c2(u, v). j µ j j j j 2 j 2019), we known that solving the single level problem (2) is Then, we use the stochastic manner to solve the augment equivalent to the bilevel optimization problem (1). Lagrangian function. First, we give the update rule of model t v Penalty Method parameters. In -th update iteration, we randomly sample a constraint cj(u, v) and calculate its gradient ∇vcj(u, v) In this subsection, we give an introduction to the penalty t  t t t w.r.t. v. Let h = µkcj(u, v ) + z ∇vcj(u, v ). If d is method. The penalty method is one of the most common v j hk methods to solve the constrained problems. The main idea large enough, is an unbiased estimation of the gradient Ψ ∇ Ψ = E[ht ] of the penalty methods is to optimize the original cost func- of µ and v µ v . In addition, since the upper- f tion pulsing a penalty term of the constraints multiplied a level cost is usually formulated as the expectation on the positive parameter. By making the penalty parameter larger, validation set, we can also obtain the stochastic gradient ˜ t t t we can penalize constraint violation more severely, thereby ∇vf(u, v ) of f(u, v ) at v by randomly sample a vali- t ˜ t forcing the minimizer of the penalty function closer to the dation instance and we have ∇vf(u, v ) = E[∇vf(u, v )]. feasible region of the constrained problem. Combining the above two terms, we can obtain the stochas- Thus, based on the penalty method, the constrained prob- tic gradient of augmented Lagrangian function (3) ∇˜ vL = h i lem (2) can be reformulated as L(u, v; µ) = f(u, v) + t ˜ t E E ˜ µ hv + ∇vf(u, v ) and [∇vL] = ∇vL, where ∇vL de- Pd c2(u, v) where µ > 0 is the penalty parameter. It 2d j=1 j notes the full gradient of L w.r.t v. Then we can apply the makes good intuitive sense to consider a sequence of values projected stochastic gradient to update the model parameters k k t+1 t t ˜ {µ } with µ ↑ ∞ as k → ∞, and to seek the approximate v = ProjV (v − ηv∇vL), where ProjV (·) denotes the k k k k minimizer u and v of (u , v ) = arg minu,v f(u, v) + projection on to the set V . Since V is the original set of v, µk the projection operation can be omitted. Pd c2(u, v) for each given µ . According to (Bard 2d j=1 j k By using the same approach, we can obtain the stochastic k k ˜ k 2013), we have that for each given µk, the sequence {u , v } gradient of function (3) with respect to u as ∇uL = hu +

9623 Figure 1: Illustration of a single step to update hyperparameters in Algorithm 1

k k  k k k h i ∂f ∂f ∇˜ uf(u , v) where h = µkcj(u , v) + z ∇ucj(u , v) ∗ ∗ u j E E[∇˜ uL(u, v, z; µ)] = (u, v ), where (u, v ) de- h ˜ i ∂u ∂u and we have E E[∇uL] = ∇uL. Similarly, we can per- notes the hypergradient. form the projected stochastic gradient to the hyperparame- Furthermore, instead of sampling one constraint function k+1 k k ˜ ters u = ProjU (u − ηu∇uL) and the projection opera- every time, we can sample a small set of b of constraint func- tion can also be omitted. tion. Then, the stochastic gradient of Ψµ can be calculated The overall algorithm is shown in Algorithm 1. We also 1 by h = Pb [µ c (u, v) + z ] ∇ c (u, v). In addition, illustrate a single step of updating the hyperparameter of our b j=1 k j j v j method in Figure 1. Note instead of updating u and v si- we can also sample a small batch of validation instances to multaneously, we first use the stochastic gradient ∇˜ vL to calculate the stochastic gradient of the upper-level objective. update v for T iterations and then use the stochastic gra- Obviously, in our method, the main computational com- plexity is from updating the model parameters. Assume that dient ∇˜ uL to update u for a single time. Since it contains two sources of randomness, i.e., the random validation in- we only sample one data instance and one constraint. Then stance and random constraint, we denote our method as dou- in each inner iteration, we only need O(1) computational ˜ bly stochastic for penalty hyperparameter complexity to calculate stochastic gradient ∇vL. Thus the optimization, namely DSGPHO. For each given µk, we find total complexity of a single step to update the hyperparame- k k the k-optimal solution (u , v ) of problem (3). Once the ters is O(T ). In addition, we only need to save the model tolerance k is satisfied, we update the Lagrangian multipli- parameters and hyperparameters with the space complex- ers by using zk+1 = zk + µkc(uk, vtk) and then enlarge the ity O(d + m). Compare with the complexity summarized k penalty parameter µ and reduce the tolerance k. We show in Table 1, our method obviously has lower computational that it can finally converge to a KKT point in Theorem 1. complexity and do not need intermediate steps to obtain the hypergradient. Theorem 1. Suppose the tolerance k is a positive and k convergent (k → 0) sequence and µ is a positive non- decreasing sequence. Let {uk, vk, zk} be the sequence of Convergence Analysis approximate solutions to the augmented Lagrangian func- In this section, we give the convergence analysis of our pro- k 2 posed method in convex case. Our analysis follows the anal- tion with penalty parameter µ and tolerance k∇vLk2 + 2 2 ysis in (Xu 2020). Let w := (u, v) refer to the pair. Then, k∇uLk2 < k for all k = 1, 2, ··· . Then any limit point of {uk, vk, zk} satisfies the KKT conditions of the constraint in our analysis, we consider the following update rules to update u and v, simultaneously, instead of the exact rules in problem (2). k+1 k −1 k k Algorithm 1 as w = ProjW (w −Dk ∇wL(w , z ; µ)) In our method, we do not need any intermediate steps to −1 where W denotes the feasible region of w, Dk 0 approximate the hypergradient. This is because that if we is the diagonal mtrix for each iteration and we have find the minimum v∗ of the augmented Lagrangian function, k k 1 Pb  k k k our method can get the unbiased estimation of the hypergra- ∇wL(w , z ; µ) = µcj(w ) + z ∇wcj(w ) + b j=1 j dient asymptotically when Algorithm 1converges. We give k k ∇˜ wf(w ) where ∇˜ wf(w ) denotes the stochastic gradient this result in Lemma 1. of f(w) w.r.t. w. Note the update rule can be viewed as an Lemma 1. Give u, let v∗ be the optimal solu- approximation of the rules in Algorithm 1 if T = 1. In ad- ∗ tion v := arg minv L(u, v, z; µ). Then we have dition, we assume that µ is large enough such that we can

9624 Algorithm 1 DSGPHO deterministic or stochastic vector. Then d Input: K, T , ηv, ηu, µ0, λ0, 0, cµ > 1, 0 < c < 1. 1 X 1 k+1 2 k k E[f(w) + zj∇vgj(w)] + Ekw − wk Output: u , v . d 2 Dk 1: for k = 1, ..., K do j=1 2: for t = 0, ··· ,T do 1 + Ekzk+1 − zk2 3: Randomly sample a validation training data in- 2 2 stance . 1 ≤E[f(w) + Ψ(w, zk)] + E[kwk − wk2 ] 4: Randomly sample a constraint. 2 Dk 5: Calculate the doubly stochastic gradient ∇˜ vL. 1 k 2 6: Update model parameters vt+1 = Proj (vt − + E[kz − zk ] V 2 2 t ˜ ηv∇vL). 2 k 2 2 2 2 2G E k 2 7: end for + α (σ + 2µ F G + [kz k2]) 8: Randomly sample a validation training data instance d . 1 E k+1 k 2 − (µ − 1) [kz − z k2] 9: Randomly sample a constraint. 2 E k ˜ k k 10: Calculate the doubly stochastic gradient ∇˜ uL. − [hw − w, ∇wf − ∇wf i] k+1 k 11: Update hyperparameters u = ProjU (u − − E[hwk − w, hk − ∇ Ψ(wk, zk)i] k ˜ w ηu∇uL). k k k k k 2 2 2 + E[hz − z, dejk ∇zΨ(w , z ) − ∇z(w , z )i] 12: if k∇uLk2 + k∇vLk2 ≤ k then k+1 k 13: µ = cµµ . Then, based on the above results, here we directly give the 14: k+1 = ck. convergence result with constant step size Dk. We give the 15: zk+1 = zk + µkc(uk, vtk) detailed proof in our appendix. 16: end if Theorem 3. Under the Assumption 1 and 2, let {(wk, zk)} 17: end for be the sequence generated from our Algorithm. For to- 1 tal iteration number K, let w¯ = PK wk and K j=1 reduce the complexity without considering the increasing of 1 3 w¯ = PK zk, and define φ (w) = kx1 − µ. K j=1 1 2α Then, we give the following assumptions which are   3 3G2 C F 2 widely used in theoretical analysis. 2  2 2 2 2 1  xk2 + α σ + 3µ F G + 2 + . ˜ 2 d 8αG 2  Assumption 1. The stochastic gradient ∇wf(w) is unbi- 1 − ased and bounded, i.e., there is a constant σ such that we d h i Then we have ˜ ˜ 2 have E[∇wf(w)] = ∇wf(w) and E k∇wfk ≤ σ In   E ∗ 1 ∗ 9(α + 1) ∗ 2 addtion, there exist constants F and G such that we have [f(w ¯) − f(w )] ≤ √ 2φ1(w ) + kz k2 K 2α |c(w)| ≤ F ,k∇wcj(w)k2 ≤ G.   E 2 1 ∗ α + 1 ∗ 2 Assumption 2. All the constraints cj(w) and the upper- [kc(w)k2] ≤ √ φ1(w ) + k1 + z k2 level objective f(w) are convex functions. K 2α 2 1 ∗ 2 ∗ 2 2 2 2 The following lemma is used to build the inequality in our where C1 = kw − w k2 + 4kz k + 4α(σ + 2µ F G ) analysis for running one iteration of hyperparameter update. α Remark 1. Theorem 3 shows that our method has an ap- Lemma 2. For any deterministic or stochastic z, it holds √ proximate convergence rate of O(1/ K), where K is the that total iteration number of hyperparameter updating. d 1 X 1 − Ψ(wk, zk) + z f (wk) + E[kzk+1 − zk2] d j j 2 2 Experiments j=1 In this section, we compare our method with several state- k k k k k =E[hz − z, dejk ∇zΨ(w , z ) − ∇zΨ(w , z )i] of-the-art gradient-based methods to show the superiority of k 2 E k+1 2 our method in terms of accuracy and efficiency. + 1/2kz − zk2 − 1/2(µ − 1) [kz − zk2] By the previous lemma, we establish the important result Baselines for running one iteration update w and then use it to show We summarized the methods used in our experiments as fol- the convergence rate. lows Theorem 2. Under the Assumptions 1 and 2, and as- 1. Penalty. The method proposed in (Mehra and Hamm I 2019). It formulates the bilevel optimization problem as a sume that D  , ∀k, for a positive number sequence k αk one-level problem with equality constraints, and then the k {α }k≥1, where I is the identity matrix. Let (w, z) be any gradient method can be used to solve the new problem.

9625 1 0.8 1

0.6

0.5 0.4 0.5 Approx Approx Approx RMD RMD RMD Penalty 0.2 Penalty Penalty Test accuracy Ours Test accuarcy Ours Test accuracy Ours 0 0 0 0 1 2 0 2 4 6 0 2 4 6 Time 10 4 Time 10 4 Time 10 4 (a) MNIST (b) CIFAR10 (c) SVHN

Figure 2: The training time of Penalty, Approx, RMD and our proposed method used in denoisy task

Dataset Noise Approx Penalty RMD Ours parameters in 0.001, 0.0001, 0.00001 for all methods. In CIFAR10 25% 68.7  0.6 74.4  1.2 68.9  1.4 75.6  1.7 addition, for the architectures of convolutional neural net- MNIST 25% 96.7  0.2 97.2  1.4 98.4  0.1 98.4  0.1 works, we follow the setting in (Mehra and Hamm 2019). SVHN 25% 77.4  1.5 86.3  0.2 50.9  1.2 88.9  0.1 We show the test accuracy in Table 2 for 5 runs and the time- accuracy curve in Figure 2. Obviously, as shown in Table 2, Table 2: Test accuracy (%) for data denoising task with 25% our method has the best test accuracy in this task. Figure noisy.(Meanstd).The best value is shown in bold type. 2 demonstrates that our method converges faster than other state-of-the-art methods. In addition, our method uses the least time to train the model when we fix the u update times 2. Approx. The method proposed in (Pedregosa 2016). It at 5000. This is because that compared with Approx and solves an additional linear problem to approximate the hy- RMD, our method does not need additional steps to approx- pergradient to update the hyperparameters. imate or calculate the hypergradient. With the continuous training, DSGPHO can finally obtain the unbiased estima- 3. RMD. The method proposed in (Franceschi et al. 2017). tion of hypergradient. Although Penalty has the same char- An additional loop is used to calculate the hypergradient. acteristic, it may lead to a large amount of constraints. This Applications means that in each model parameters update step, Penalty needs to evaluate the value and gradient of all constraints We evaluate the performance of the proposed method in the which greatly limits the efficiency. However, our method following machine learning tasks. only needs the information of a subset constraints and a sub- Data denoising by importance learning In many real- set of validation instances which can greatly reduce the time world applications, we need to learn models from the complexity. datasets with corrupted labels. In such case, we need to re- Few-shot learning Then, we compare the performance of weight the importance of each training instance to reduce these methods in the Few-shot learning task (Snell, Swersky, the impact of noise (Liu and Tao 2015; Yu et al. 2017; Ren and Zemel 2017; Sung et al. 2018; Santoro et al. 2016). Few- et al. 2018). To find the correct weight of training instance shot learning trains a model on several related tasks and then can be formulated as the following bilevel hyperparameter generalizes to unseen tasks with just a few examples. We can optimization problem, learn a common representation for various tasks and then ∗ train the task specific layers. The few-shot learning can be min ED l(h(xi; v , u), yi), u vl formulated as following bilevel optimization problem, ∗ E ∗ s.t. v = arg min Dtr uil(h(xi; v, u), yi) E min Dval l(hi(M(xi, u), vi ), yi), v u ∗ E where h denotes the classifier parameterized by v, l denotes s.t. vi = arg min Dtr l(hi(M(xi, u), vi), yi) vi the loss, ui denotes the weight of each data instance, and Dval and Dtr denote the validation set and training set, re- where M(·, u) are the maps for all tasks parameterized by spectively. u, hi denotes ith task’s classifier parameterized by wi. In this task, we conduct the experiments on the dataset In this task, we use the Mini-ImageNet (Vinyals et al. MNIST, SVHN, CIFAR10. For each dataset, we split it into 2016) and Omniglot (Lake, Salakhutdinov, and Tenenbaum three subsets, i.e., training set, test set, and validation set and 2015) datasets. We generate meta-training set and meta- we introduce 25% noise into the training set. The validation testing set using images from disjoint classes. For Omniglot, set of each dataset contains 1000, 10000, and 1000 points, we use the first 1200 classes to build meta-training set and respectively. For each hyperparameters update step, we up- the rest to build meta-testing set (Santoro et al. 2016). For date the model parameters for 50 times. For our method, we Mini-Imagenet, we split 64 classes in meta-training and 16 use 500 validation instances and 2048 constraints to calcu- classes in meta-validation and 20 classes in meta-testing. late the doubly stochastic gradient. We fix the step size of We fix the step size of u at 0.1 and search v step size hyperparameters at 0.01 and tune the step size of model from {0.01, 0.001, 0.0001}. We randomly sample 2048 con-

9626 0.8 1 Poisoned Approx Penalty RMD Ours

0.6 10 81.5  0.5 82.9  0.7 86.1  0.7 82.7  0.7 30 76.4  0.3 74.9  0.1 84.5  0.1 74.7  0.6 0.4 0.5 Approx Approx 0.2 Penalty Penalty Table 4: Test accuracy (%) for untargeted poisoned at- Accuracy Ours Accuracy Ours tack.(The lower accuracy is better) . 0 0 0 1 2 3 0 2 4 Time 10 4 Time 10 4 (a) Mini-ImageNet (b) Omniglot -0.5 -0.6 Figure 3: The training time of Penalty, Approx and DSG- -0.6 -0.8 PHO used in few-shot learning task -0.7 Approx -1 Approx RMD RMD -0.8 Penalty -1.2 Penalty Validation loss Ours Validation loss Ours Dataset Problem Approx Penalty Ours -0.9 -1.4 0 500 1000 1500 0 500 1000 1500 MiniImageNet 5-way 5-shot 68.3  1.8 67.2  0.5 67.3  0.5 Time Time Omiglot 5-way 5-shot 99.3  0.1 99.4  0.1 99.3  0.1 (a) 10 poisoned points (b) 30 poisoned points

Table 3: Accuracy (%) for few-shot task.(Meanstd). The Figure 4: The validation loss vs. time of Penalty, Approx, best value is shown in bold type. RMD and DSGPHO in untargeted poisoned task straints in each iteration. We show the results in Table 3 and Figure 3. From Table 3, we can find that our method has the we have the best result when we use 30 poisoned points. We comparable result to Penalty and Approx. From Figure 3, we show the validation loss and train time in Figure 4. From can find that our method can converge to the solution faster this figure, we can find that our method can obtain the lower than Penalty and Approx. This is because that, in each v up- validation loss in a less time. It is reasonable that the per- date step, our method only samples a batch of constraints formance in terms of training-time are not as good as that in and does not need to approximate the hypergradient. This the previous tasks. This is because that in this task, we use leads to a lower computational complexity in our method. a linear model, which makes the constraints are not as much as those in the previous tasks. Training data poisoning Finally, we compare these meth- Based on all above results, we can conclude that our ods on the training-data poisoning task (Munoz-Gonz˜ alez´ method is superior to the state-of-the-art gradient-based et al. 2017; Shafahi et al. 2018; Mei and Zhu 2015; Koh and methods in terms of efficiency and accuracy. Liang 2017). In this task, we modify the training data such that we make the model learned from these data performs poorly than that learned from the original data. The prob- Conclusion lem of finding the poisoned data, of which the labels are not correct, can be formulated as follows, In this paper, we proposed a doubly stochastic gradient method to improve the performance of penalty method for 1 X ∗ min − l(h(xi, u, v ), yi) bilevel hyperparameter optimization. Within each update u Nvl step, we randomly sample a validation point and a single Dvl constraint to calculate the stochastic gradient. We also prove ∗ 1 X s.t. v = arg min l(h(xi, u, w), yi) that our method can obtain the hypergradient and converge v N √ Dor ∪P to saddle point at the rate of O(1/ K). The experimental results also demonstrate the superiority our method in terms where Dor denotes the original dataset and P denotes the set of poisoned instances. of training time and accuracy. Our method needs f and g to In our experiments, we follow the setting in (Munoz-˜ be twice continuously differentiable in both u and v, which Gonzalez´ et al. 2017) and test the untargeted MNIST us- limits the usage of our method. In the feature, we will ex- ing data augmentation technique. We use logistic regres- plore the bilevel problem with the nonsmooth lower-level sion to train the classifiers. Our goal is to make the perfor- problems or the discrete hyperparameters. mance lower on the test set. We split 1000 instances into training set, 1000 into validation set and 8000 into test- Acknowledgments ing set. We randomly sample 10 and 30 instances from the training set and give them random incorrect labels. For our B. Gu was partially supported by National Natural Sci- method, we randomly sample 2048 constraints. We fix the ence Foundation of China (No: 62076138), the Qing Lan learning rate of u at 0.1 and choose learning rate of v form Project (No.R2020Q04), the National Natural Science Foun- {0.01, 0.001, 0.0001}. We set the total number of update u dation of China (No.62076138), the Six talent peaks project at 5000. From Table 4, we can find that results are compara- (No.XYDXX-042) and the 333 Project (No. BRA2017455) ble to other methods when we have 10 poisoned points and in Jiangsu Province.

9627 References Mehra, A.; and Hamm, J. 2019. Penalty Method for Bao, R.; Gu, B.; and Huang, H. 2019. Efficient Approxi- Inversion-Free Deep Bilevel Optimization. arXiv preprint mate Solution Path Algorithm for Order Weight L 1-Norm arXiv:1911.03432 . with Accuracy Guarantee. In 2019 IEEE International Con- Mei, S.; and Zhu, X. 2015. Using Machine Teaching to Iden- ference on Data Mining (ICDM), 958–963. IEEE. tify Optimal Training-Set Attacks on Machine Learners. In Bard, J. F. 2013. Practical bilevel optimization: algorithms AAAI, 2871–2877. and applications, volume 30. Springer Science & Business Munoz-Gonz˜ alez,´ L.; Biggio, B.; Demontis, A.; Paudice, A.; Media. Wongrassamee, V.; Lupu, E. C.; and Roli, F. 2017. Towards Bergstra, J.; and Bengio, Y. 2012. Random search for hyper- poisoning of deep learning algorithms with back-gradient parameter optimization. The Journal of Machine Learning optimization. In Proceedings of the 10th ACM Workshop Research 13(1): 281–305. on Artificial Intelligence and Security, 27–38. Bergstra, J. S.; Bardenet, R.; Bengio, Y.; and Kegl,´ B. 2011. Pearlmutter, B. A. 1994. Fast exact multiplication by the Algorithms for hyper-parameter optimization. In Advances Hessian. Neural Computation 6(1): 147–160. in neural information processing systems, 2546–2554. Pedregosa, F. 2016. Hyperparameter optimization with ap- Bertsekas, D. P. 1976. On penalty and multiplier methods proximate gradient. arXiv preprint arXiv:1602.02355 . for constrained minimization. SIAM Journal on Control and Rajeswaran, A.; Finn, C.; Kakade, S. M.; and Levine, S. Optimization 14(2): 216–235. 2019. Meta-learning with implicit gradients. In Advances Brochu, E.; Cora, V. M.; and De Freitas, N. 2010. A tuto- in Neural Information Processing Systems, 113–124. rial on Bayesian optimization of expensive cost functions, Ren, M.; Zeng, W.; Yang, B.; and Urtasun, R. 2018. Learn- with application to active user modeling and hierarchical re- ing to reweight examples for robust deep learning. arXiv inforcement learning. arXiv preprint arXiv:1012.2599 . preprint arXiv:1803.09050 . Domke, J. 2012. Generic methods for optimization-based Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, modeling. In Artificial Intelligence and Statistics, 318–326. D.; and Lillicrap, T. 2016. One-shot learning with Feurer, M.; and Hutter, F. 2019. Hyperparameter opti- memory-augmented neural networks. arXiv preprint mization. In Automated Machine Learning, 3–33. Springer, arXiv:1605.06065 . Cham. Shaban, A.; Cheng, C.-A.; Hatch, N.; and Boots, B. 2019. Franceschi, L.; Donini, M.; Frasconi, P.; and Pontil, M. Truncated back-propagation for bilevel optimization. In The 2017. Forward and reverse gradient-based hyperparameter 22nd International Conference on Artificial Intelligence and optimization. arXiv preprint arXiv:1703.01785 . Statistics, 1723–1732. Gu, B.; and Ling, C. 2015. A new generalized error path Shafahi, A.; Huang, W. R.; Najibi, M.; Suciu, O.; Studer, algorithm for model selection. In International Conference C.; Dumitras, T.; and Goldstein, T. 2018. Poison frogs! tar- on Machine Learning, 2549–2558. geted clean-label poisoning attacks on neural networks. In Gu, B.; Liu, G.; and Huang, H. 2017. Groups-keeping solu- Advances in Neural Information Processing Systems, 6103– tion path algorithm for sparse regression with automatic fea- 6113. ture grouping. In Proceedings of the 23rd ACM SIGKDD In- Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypical ternational Conference on Knowledge Discovery and Data networks for few-shot learning. In Advances in neural in- Mining, 185–193. formation processing systems, 4077–4087. Gu, B.; and Sheng, V. S. 2017. A solution path algo- Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H.; and rithm for general parametric prob- Hospedales, T. M. 2018. Learning to compare: Relation net- lem. IEEE Transactions on Neural Networks and Learning work for few-shot learning. In Proceedings of the IEEE Con- Systems 29(9): 4462–4472. ference on Computer Vision and Pattern Recognition, 1199– Koh, P. W.; and Liang, P. 2017. Understanding black- 1208. box predictions via influence functions. arXiv preprint Swersky, K.; Snoek, J.; and Adams, R. P. 2014. Freeze-thaw arXiv:1703.04730 . Bayesian optimization. arXiv preprint arXiv:1406.3896 . Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015. Thornton, C.; Hutter, F.; Hoos, H. H.; and Leyton-Brown, K. Human-level concept learning through probabilistic pro- 2013. Auto-WEKA: Combined selection and hyperparam- gram induction. Science 350(6266): 1332–1338. eter optimization of classification algorithms. In Proceed- Liu, T.; and Tao, D. 2015. Classification with noisy labels ings of the 19th ACM SIGKDD international conference on by importance reweighting. IEEE Transactions on pattern Knowledge discovery and data mining, 847–855. analysis and machine intelligence 38(3): 447–461. Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D.; et al. Maclaurin, D.; Duvenaud, D.; and Adams, R. 2015. 2016. Matching networks for one shot learning. In Advances Gradient-based hyperparameter optimization through re- in neural information processing systems, 3630–3638. versible learning. In International Conference on Machine Wu, J.; Chen, X.-Y.; Zhang, H.; Xiong, L.-D.; Lei, H.; Learning, 2113–2122. and Deng, S.-H. 2019. Hyperparameter optimization for

9628 machine learning models based on Bayesian optimization. Journal of Electronic Science and Technology 17(1): 26–40. Xu, Y. 2020. Primal-dual stochastic gradient method for convex programs with many functional constraints. SIAM Journal on Optimization 30(2): 1664–1692. Yu, X.; Liu, T.; Gong, M.; Zhang, K.; Batmanghelich, K.; and Tao, D. 2017. Transfer learning with label noise. arXiv preprint arXiv:1707.09724 .

9629