<<

Optimizing Millions of Hyperparameters by Implicit Differentiation

Jonathan Lorraine Paul Vicol David Duvenaud University of Toronto, Vector Institute {lorraine, pvicol, duvenaud}@cs.toronto.edu

Abstract high-dimensional HO problems locally with - We propose an algorithm for inexpensive based optimizers, but this is difficult because we must gradient-based hyperparameter optimization differentiate through the optimized weights as a func- that combines the implicit theorem tion of the hyperparameters. In other words, we must (IFT) with efficient inverse Hessian approx- approximate the Jacobian of the best-response function imations. We present results on the rela- of the parameters to the hyperparameters. tionship between the IFT and differentiat- We leverage the Theorem (IFT) to ing through optimization, motivating our al- compute the optimized validation loss gradient with gorithm. We use the proposed approach to respect to the hyperparameters—hereafter denoted the train modern network architectures with mil- hypergradient. The IFT requires inverting the training lions of weights and millions of hyperparame- Hessian with respect to the NN weights, which is infea- ters. We learn a data-augmentation network— sible for modern, deep networks. Thus, we propose an where every weight is a hyperparameter tuned approximate inverse, motivated by a link to unrolled for validation performance—that outputs aug- differentiation [3] that scales to Hessians of large NNs, mented training examples; we learn a distilled is more stable than conjugate gradient [22, 7], and only dataset where each feature in each datapoint requires a constant amount of memory. is a hyperparameter; and we tune millions of regularization hyperparameters. Jointly Finally, when fitting many parameters, the amount of tuning weights and hyperparameters with our data can limit generalization. There are ad hoc rules approach is only a few times more costly in for partitioning data into training and validation sets— memory and compute than standard training. e.g., using 10% for validation. Often, practitioners re-train their models from scratch on the combined 1 Introduction training and validation partitions with optimized hy- perparameters, which can provide marginal test-time The generalization of neural networks (NNs) depends performance increases. We verify empirically that stan- crucially on the choice of hyperparameters. Hyperpa- dard partitioning and re-training procedures perform rameter optimization (HO) has a rich history [1, 2], and well when fitting few hyperparameters, but break down achieved recent success in scaling due to gradient-based when fitting many. When fitting many hyperparame- optimizers [3–11]. There are dozens of regularization ters, we need a large validation partition, which makes techniques to combine in , and each may re-training our model with optimized hyperparameters have multiple hyperparameters [12]. If we can scale HO vital for strong test performance. to have as many—or more—hyperparameters as param- eters, there are various exciting regularization strategies Contributions arXiv:1911.02590v1 [cs.LG] 6 Nov 2019 to investigate. For example, we could learn a distilled • We propose a stable inverse Hessian approximation dataset with a hyperparameter for every feature of with constant memory cost. each input [4, 13], weights on each loss term [14–16], or augmentation on each input [17, 18]. • We show that the IFT is the limit of differentiating When the hyperparameters are low-dimensional— through optimization. e.g., 1-5 dimensions—simple methods, like random • We scale IFT-based hyperparameter optimization search, work; however, these break down for medium- to modern, large neural architectures, including dimensional HO—e.g., 5-100 dimensions. We may AlexNet and LSTM-based language models. use more scalable algorithms like Bayesian Optimiza- tion [19–21], but this often breaks down for high- • We demonstrate several uses for fitting hyperpa- dimensional HO—e.g., >100 dimensions. We can solve rameters almost as easily as weights, including per- parameter regularization, data distillation, and learned-from-scratch methods.

• We explore how training-validation splits should Submitted to AISTATS 2020. change when tuning many hyperparameters. Jonathan Lorraine, Paul Vicol, David Duvenaud

)

) λ

, (λ, w∗(λ)) w ( ∗(λ) w

. L ( L

d

n i w∗(λ) = i l a a

r

V Implict T ∗ =

L

L L

λ best response s s

s − s w o ∗ function o + =

L L L L λ w λ Hypergradient w∗ = λ λ r λ Implicit te r e te (λ , w) m e 0 a m L r a a r λ0, w∗(λ0) Pa p a ra r Pa p me e ra r te p me e r w y ter p H w y H Figure 1: Overview of gradient-based hyperparameter optimization (HO). Left: a training loss ; Right: a validation loss manifold. The implicit function w∗(λ) is the best-response of the weights to the hyperparameters ∗ and shown in blue projected onto the (λ, w)-. We get our desired objective function LV(λ) when the best-response is put into the validation loss, shown projected on the hyperparameter axis in red. The validation loss does not depend directly on the hyperparameters, as is typical in hyperparameter optimization. Instead, the hyperparameters only affect the validation loss by changing the weights’ response. We show the best-response Jacobian in blue, and the hypergradient in red.

2 Overview of Proposed Algorithm 3. We can estimate the implicit best-response There are four essential components to understanding with the IFT: We approximate the best-response our proposed algorithm. Further background is pro- Jacobian—how the optimal weights change with respect vided in AppendixA, and notation is shown in Table5. to the hyperparameters—using the IFT (Thm.1). We present the complete statement in AppendixC, but 1. HO is nested optimization: Let LT and LV highlight the key assumptions and results here. denote the training and validation losses, w the NN Theorem 1 (Cauchy, Implicit Function Theorem). If weights, and λ the hyperparameters. We aim to find op- for some (λ0, w0), ∂ LT | = 0 and regularity condi- timal hyperparameters λ∗ such that the NN minimizes ∂w λ0,w0 tions are satisfied, then surrounding (λ0, w0) there is a the validation loss after training: ∗ ∂ LT function w (λ) s.t. ∂w |λ,w∗(λ) = 0 and we have: ∗ ∗ λ :=arg min LV(λ) where (1) ∗ h 2 i−1 2 λ ∂ w ∂ LT ∂ LT ∂λ =− T × T (IFT) ∗ ∗ ∗ λ0 ∂w∂w ∂w∂λ λ0,w∗(λ0) LV(λ):=LV(λ,w (λ)) and w (λ):=arg min LT(λ,w) (2) | {z } | {z } w training Hessian training mixed partials ∗ Our implicit function is w (λ), which is the best-response ∂ LT 0 0 The condition ∂w |λ0,w0 = 0 is equivalent to λ , w be- of the weights to the hyperparameters. We assume ing a fixed point of the training gradient field. Since unique solutions to arg min for simplicity. w∗(λ0) is a fixed point of the training gradient field, 2. Hypergradients have two terms: For gradient- we can leverage the IFT to evaluate the best-response ∗ ∂ LV(λ) Jacobian locally. We only have access to an approxi- based HO we want the hypergradient ∂λ , which decomposes into: mation of the true best-response—denoted wc∗—which we can find with .

∗  ∗  ∂ LV(λ) ∂ LV ∂ LV ∂ w ∂λ = ∂λ + ∂w ∂λ = 4. Tractable inverse Hessian approximations: | {z } λ,w∗(λ) hypergradient To exactly invert a general m × m Hessian, we often 3 hyperparam indirect grad. (3) require O(m ) operations, which is intractable for the ∗ z ∗ }| ∗ { in Eq. IFT in modern NNs. We can efficiently ∂ LV(λ,w (λ)) ∂ LV(λ,w (λ)) ∂ w (λ) ∂λ + ∂w∗(λ) × ∂λ approximate the inverse with the Neumann series: | {z } | {z } hyperparam direct grad. | {z } best-response Jacobian parameter direct grad. i h 2 i−1 h 2 ij ∂ LT X ∂ LT T = lim I − T (4) The direct gradient is easy to compute, but the indirect ∂w∂w i→∞ ∂w∂w gradient is difficult to compute because we must ac- j=0 count for how the optimal weights change with respect ∂ w∗(λ) In Section4 we show that unrolling differentiation for ∗ to the hyperparameters (i.e., ∂λ ). In HO the direct i steps around locally optimal weights w is equivalent gradient is often identically 0, necessitating an approx- to approximating the inverse with the first i terms in imation of the indirect gradient to make any progress the Neumann series. We then show how to use this (visualized in Fig.1). approximation without instantiating any matrices by using efficient vector-Jacobian products. Jonathan Lorraine, Paul Vicol, David Duvenaud ∂ w∗ ∂λ ∂ L∗ V ∂ LV ∂ LV z}|{ ∂λ ∂λ ∂w Early work applied the IFT to regularization by ex- z}|{ = z}|{ + z }| { plicitly computing the Hessian (or Gauss-Newton) in- verse [23, 2]. In [24], the identity matrix is used to ap- ∂2L −1 2 proximate the inverse Hessian in the IFT. HOAG [30] − T  ∂ LT ∂w∂wT ∂w∂λT uses conjugate gradient (CG) to invert the Hessian ∂ LV ∂ LV z }| { z}|{ ∂λ ∂w approximately and provides convergence results given z}|{ z }| { = + tolerances on the optimal parameter and inverse. In iMAML [9], a center to the weights is fit to perform | {z } vector-inverse Hessian product well on multiple tasks—contrasted with our use of vali- 2 ∂ LT ∂w∂λT dation loss. In DEQ [31], implicit differentiation is used ∂ L ∂ L ∂2L −1 V V ×− T  z}|{ ∂λ ∂w ∂w∂wT to add differentiable fixed-point methods into NN ar- = z}|{ + z }| { chitectures. We use a Neumann approximation for the inverse-Hessian, instead of CG [30,9] or the identity. | {z } vector-Jacobian product Approximate inversion algorithms. CG is diffi- Figure 2: Hypergradient computation. The entire cult to scale to modern, deep NNs. We use the Neu- computation can be performed efficiently using vector- mann inverse approximation, which was observed to Jacobian products, provided a cheap approximation to be a stable alternative to CG in NNs [22, 7]. The the inverse-Hessian-vector product is available. stability is motivated by connections between the Neu- mann approximation and unrolled differentiation [7]. Algorithm 1 Gradient-based HO for λ∗, w∗(λ∗) Alternatively, we could use prior knowledge about the NN structure to aid in the inversion—e.g., by using 1: Initialize hyperparameters λ0 and weights w0 KFAC [32]. It is possible to approximate the Hessian 2: while not converged do with the Gauss-Newton matrix or Fisher Information 3: for k = 1 ...N do matrix [23]. Various works use an identity approxi- 4: w0 −= α · ∂ LT | ∂w λ0,w0 mation to the inverse, which is equivalent to 1-step 5: λ0 −= hypergradient(L , L , λ0, w0) V T unrolled differentiation [24, 14, 33, 10,8, 34, 35]. 6: return λ0, w0 . λ∗, w∗(λ∗) from Eq.1 Unrolled differentiation for HO. A key difficulty 0 0 in nested optimization is approximating how the opti- Algorithm 2 hypergradient(LV, LT, λ , w ) mized inner parameters (i.e., NN weights) change with 1: v = ∂ LV | 1 ∂w λ0,w0 respect to the outer parameters (i.e., hyperparameters). ∂ LT 2: v2 = approxInverseHVP(v1, ∂w ) We often optimize the inner parameters with gradient ∂ LT 3: v3 = grad( ∂λ , w, grad_outputs = v2) descent, so we can simply differentiate through this ∂ LV 4: return ∂λ |λ0,w0 − v3 . Return to Alg.1 optimization. Differentiation through optimization has been applied to nested optimization problems by [3], Algorithm 3 approxInverseHVP(v, f): Neumann ap- was scaled to HO for NNs by [4], and has been applied proximation of inverse-Hessian-vector product v[ ∂ f ]−1 to various applications like learning optimizers [36]. [6] ∂w provides convergence results for this class of algorithms, 1: Initialize sum p = v while [5] discusses forward- and reverse-mode variants. 2: for j = 1 . . . i do 3: v −= α · grad(f, w, grad_outputs = v) As the number of gradient steps we backpropagate 4: p −= v through increases, so does the memory and computa- 5: return p . Return to Alg.2 . tional cost. Often, gradient descent does not exactly minimize our objective after a finite number of steps— it only approaches a local minimum. Thus, to see how 2.1 Proposed Algorithms the hyperparameters affect the local minima, we may have to unroll the optimization infeasibly far. Unrolling We outline our method in Algs.1,2, and3, where α a small number of steps can be crucial for performance denotes the learning rate. Alg.3 is also shown in [ 22]. but may induce bias [37]. [7] discusses connections We visualize the hypergradient computation in Fig.2. between unrolling and the IFT, and proposes to un- roll only the last L-steps. DrMAD [38] proposes an 3 Related Work interpolation scheme to save memory. Implicit Function Theorem. The IFT has been We compare hypergradient approximations in Table1, used for optimization in nested optimization prob- and memory costs of gradient-based HO methods in lems [27, 28], backpropagating through arbitrarily long Table2. We survey gradient-free HO in AppendixB. RNNs [22], or even efficient k-fold cross-validation [29]. Jonathan Lorraine, Paul Vicol, David Duvenaud

Method Steps Eval. Hypergradient Approximation

h 2 i−1 2 ∗ ∂ LV ∂ LV ∂ LT ∂ LT Exact IFT ∞ w (λ) ∂λ − ∂w × ∂w∂wT ∂w∂λT w∗(λ) " # 2 2 Unrolled Diff. [4] i w ∂ LV − ∂ LV × P Q I − ∂ LT ∂ LT 0 ∂λ ∂w j≤i k

Lemma. Given the recurrence from unrolling SGD LT : Λ×W→R is twice differentiable with an invertible optimization in Eq.5, we have: Hessian at w∗(λ), and (3) w∗ : Λ→W is differentiable.   We need continuous hyperparameters to use gradient- ∂wi+1 X Y 2 2 ∂ LT ∂ LT based optimization, but many discrete hyperparame- =−  I − T  T ∂λ ∂w∂w λ,w (λ) ∂w∂λ λ,w (λ) j≤i k

∂w h 2 i−1 2 i+1 ∂ LT ∂ LT for an LSTM language model. lim = − T T (8) i→∞ ∂λ ∂w∂w ∂w∂λ w∗(λ) HO algorithms that are not based on implicit differen- tiation or differentiation through optimization—such as [44–48, 20]—do not scale to the high-dimensional This result is also shown in [7], but they use a different hyperparameters we use. Thus, we cannot sensibly approximation for computing the hypergradient—see compare to them for high-dimensional problems. Table1. Instead, we use the following best-response Jacobian approximation, where i controls the trade-off 5.1 Approximate Inversion Algorithms between computation and error bounds:   In Fig.3 we investigate how close various approxima-

∗ h 2 ij 2 tions are to the true inverse. We calculate the distance ∂ w X ∂ LT ∂ LT ≈ − I − α T  T (9) ∂λ ∂w∂w ∂w∂λ between the approximate hypergradient and the true j