Optimizing Millions of Hyperparameters by Implicit Differentiation
Jonathan Lorraine Paul Vicol David Duvenaud University of Toronto, Vector Institute {lorraine, pvicol, duvenaud}@cs.toronto.edu
Abstract high-dimensional HO problems locally with gradient- We propose an algorithm for inexpensive based optimizers, but this is difficult because we must gradient-based hyperparameter optimization differentiate through the optimized weights as a func- that combines the implicit function theorem tion of the hyperparameters. In other words, we must (IFT) with efficient inverse Hessian approx- approximate the Jacobian of the best-response function imations. We present results on the rela- of the parameters to the hyperparameters. tionship between the IFT and differentiat- We leverage the Implicit Function Theorem (IFT) to ing through optimization, motivating our al- compute the optimized validation loss gradient with gorithm. We use the proposed approach to respect to the hyperparameters—hereafter denoted the train modern network architectures with mil- hypergradient. The IFT requires inverting the training lions of weights and millions of hyperparame- Hessian with respect to the NN weights, which is infea- ters. We learn a data-augmentation network— sible for modern, deep networks. Thus, we propose an where every weight is a hyperparameter tuned approximate inverse, motivated by a link to unrolled for validation performance—that outputs aug- differentiation [3] that scales to Hessians of large NNs, mented training examples; we learn a distilled is more stable than conjugate gradient [22, 7], and only dataset where each feature in each datapoint requires a constant amount of memory. is a hyperparameter; and we tune millions of regularization hyperparameters. Jointly Finally, when fitting many parameters, the amount of tuning weights and hyperparameters with our data can limit generalization. There are ad hoc rules approach is only a few times more costly in for partitioning data into training and validation sets— memory and compute than standard training. e.g., using 10% for validation. Often, practitioners re-train their models from scratch on the combined 1 Introduction training and validation partitions with optimized hy- perparameters, which can provide marginal test-time The generalization of neural networks (NNs) depends performance increases. We verify empirically that stan- crucially on the choice of hyperparameters. Hyperpa- dard partitioning and re-training procedures perform rameter optimization (HO) has a rich history [1, 2], and well when fitting few hyperparameters, but break down achieved recent success in scaling due to gradient-based when fitting many. When fitting many hyperparame- optimizers [3–11]. There are dozens of regularization ters, we need a large validation partition, which makes techniques to combine in deep learning, and each may re-training our model with optimized hyperparameters have multiple hyperparameters [12]. If we can scale HO vital for strong test performance. to have as many—or more—hyperparameters as param- eters, there are various exciting regularization strategies Contributions arXiv:1911.02590v1 [cs.LG] 6 Nov 2019 to investigate. For example, we could learn a distilled • We propose a stable inverse Hessian approximation dataset with a hyperparameter for every feature of with constant memory cost. each input [4, 13], weights on each loss term [14–16], or augmentation on each input [17, 18]. • We show that the IFT is the limit of differentiating When the hyperparameters are low-dimensional— through optimization. e.g., 1-5 dimensions—simple methods, like random • We scale IFT-based hyperparameter optimization search, work; however, these break down for medium- to modern, large neural architectures, including dimensional HO—e.g., 5-100 dimensions. We may AlexNet and LSTM-based language models. use more scalable algorithms like Bayesian Optimiza- tion [19–21], but this often breaks down for high- • We demonstrate several uses for fitting hyperpa- dimensional HO—e.g., >100 dimensions. We can solve rameters almost as easily as weights, including per- parameter regularization, data distillation, and learned-from-scratch data augmentation methods.
• We explore how training-validation splits should Submitted to AISTATS 2020. change when tuning many hyperparameters. Jonathan Lorraine, Paul Vicol, David Duvenaud
)
) λ
, (λ, w∗(λ)) w ( ∗(λ) w
. L ( L
d
n i w∗(λ) = i l a a
r
V Implict T ∗ =
L
L L
λ best response s s
s − s w o ∗ function o + =
L L L L λ w λ Hypergradient w∗ = λ λ r λ Implicit derivative te r e te (λ , w) m e 0 a m L r a a r λ0, w∗(λ0) Pa p a ra r Pa p me e ra r te p me e r w y ter p H w y H Figure 1: Overview of gradient-based hyperparameter optimization (HO). Left: a training loss manifold; Right: a validation loss manifold. The implicit function w∗(λ) is the best-response of the weights to the hyperparameters ∗ and shown in blue projected onto the (λ, w)-plane. We get our desired objective function LV(λ) when the best-response is put into the validation loss, shown projected on the hyperparameter axis in red. The validation loss does not depend directly on the hyperparameters, as is typical in hyperparameter optimization. Instead, the hyperparameters only affect the validation loss by changing the weights’ response. We show the best-response Jacobian in blue, and the hypergradient in red.
2 Overview of Proposed Algorithm 3. We can estimate the implicit best-response There are four essential components to understanding with the IFT: We approximate the best-response our proposed algorithm. Further background is pro- Jacobian—how the optimal weights change with respect vided in AppendixA, and notation is shown in Table5. to the hyperparameters—using the IFT (Thm.1). We present the complete statement in AppendixC, but 1. HO is nested optimization: Let LT and LV highlight the key assumptions and results here. denote the training and validation losses, w the NN Theorem 1 (Cauchy, Implicit Function Theorem). If weights, and λ the hyperparameters. We aim to find op- for some (λ0, w0), ∂ LT | = 0 and regularity condi- timal hyperparameters λ∗ such that the NN minimizes ∂w λ0,w0 tions are satisfied, then surrounding (λ0, w0) there is a the validation loss after training: ∗ ∂ LT function w (λ) s.t. ∂w |λ,w∗(λ) = 0 and we have: ∗ ∗ λ :=arg min LV(λ) where (1) ∗ h 2 i−1 2 λ ∂ w ∂ LT ∂ LT ∂λ =− T × T (IFT) ∗ ∗ ∗ λ0 ∂w∂w ∂w∂λ λ0,w∗(λ0) LV(λ):=LV(λ,w (λ)) and w (λ):=arg min LT(λ,w) (2) | {z } | {z } w training Hessian training mixed partials ∗ Our implicit function is w (λ), which is the best-response ∂ LT 0 0 The condition ∂w |λ0,w0 = 0 is equivalent to λ , w be- of the weights to the hyperparameters. We assume ing a fixed point of the training gradient field. Since unique solutions to arg min for simplicity. w∗(λ0) is a fixed point of the training gradient field, 2. Hypergradients have two terms: For gradient- we can leverage the IFT to evaluate the best-response ∗ ∂ LV(λ) Jacobian locally. We only have access to an approxi- based HO we want the hypergradient ∂λ , which decomposes into: mation of the true best-response—denoted wc∗—which we can find with gradient descent.
∗ ∗ ∂ LV(λ) ∂ LV ∂ LV ∂ w ∂λ = ∂λ + ∂w ∂λ = 4. Tractable inverse Hessian approximations: | {z } λ,w∗(λ) hypergradient To exactly invert a general m × m Hessian, we often 3 hyperparam indirect grad. (3) require O(m ) operations, which is intractable for the ∗ z ∗ }| ∗ { matrix in Eq. IFT in modern NNs. We can efficiently ∂ LV(λ,w (λ)) ∂ LV(λ,w (λ)) ∂ w (λ) ∂λ + ∂w∗(λ) × ∂λ approximate the inverse with the Neumann series: | {z } | {z } hyperparam direct grad. | {z } best-response Jacobian parameter direct grad. i h 2 i−1 h 2 ij ∂ LT X ∂ LT T = lim I − T (4) The direct gradient is easy to compute, but the indirect ∂w∂w i→∞ ∂w∂w gradient is difficult to compute because we must ac- j=0 count for how the optimal weights change with respect ∂ w∗(λ) In Section4 we show that unrolling differentiation for ∗ to the hyperparameters (i.e., ∂λ ). In HO the direct i steps around locally optimal weights w is equivalent gradient is often identically 0, necessitating an approx- to approximating the inverse with the first i terms in imation of the indirect gradient to make any progress the Neumann series. We then show how to use this (visualized in Fig.1). approximation without instantiating any matrices by using efficient vector-Jacobian products. Jonathan Lorraine, Paul Vicol, David Duvenaud ∂ w∗ ∂λ ∂ L∗ V ∂ LV ∂ LV z}|{ ∂λ ∂λ ∂w Early work applied the IFT to regularization by ex- z}|{ = z}|{ + z }| { plicitly computing the Hessian (or Gauss-Newton) in- verse [23, 2]. In [24], the identity matrix is used to ap- ∂2L −1 2 proximate the inverse Hessian in the IFT. HOAG [30] − T ∂ LT ∂w∂wT ∂w∂λT uses conjugate gradient (CG) to invert the Hessian ∂ LV ∂ LV z }| { z}|{ ∂λ ∂w approximately and provides convergence results given z}|{ z }| { = + tolerances on the optimal parameter and inverse. In iMAML [9], a center to the weights is fit to perform | {z } vector-inverse Hessian product well on multiple tasks—contrasted with our use of vali- 2 ∂ LT ∂w∂λT dation loss. In DEQ [31], implicit differentiation is used ∂ L ∂ L ∂2L −1 V V ×− T z}|{ ∂λ ∂w ∂w∂wT to add differentiable fixed-point methods into NN ar- = z}|{ + z }| { chitectures. We use a Neumann approximation for the inverse-Hessian, instead of CG [30,9] or the identity. | {z } vector-Jacobian product Approximate inversion algorithms. CG is diffi- Figure 2: Hypergradient computation. The entire cult to scale to modern, deep NNs. We use the Neu- computation can be performed efficiently using vector- mann inverse approximation, which was observed to Jacobian products, provided a cheap approximation to be a stable alternative to CG in NNs [22, 7]. The the inverse-Hessian-vector product is available. stability is motivated by connections between the Neu- mann approximation and unrolled differentiation [7]. Algorithm 1 Gradient-based HO for λ∗, w∗(λ∗) Alternatively, we could use prior knowledge about the NN structure to aid in the inversion—e.g., by using 1: Initialize hyperparameters λ0 and weights w0 KFAC [32]. It is possible to approximate the Hessian 2: while not converged do with the Gauss-Newton matrix or Fisher Information 3: for k = 1 ...N do matrix [23]. Various works use an identity approxi- 4: w0 −= α · ∂ LT | ∂w λ0,w0 mation to the inverse, which is equivalent to 1-step 5: λ0 −= hypergradient(L , L , λ0, w0) V T unrolled differentiation [24, 14, 33, 10,8, 34, 35]. 6: return λ0, w0 . λ∗, w∗(λ∗) from Eq.1 Unrolled differentiation for HO. A key difficulty 0 0 in nested optimization is approximating how the opti- Algorithm 2 hypergradient(LV, LT, λ , w ) mized inner parameters (i.e., NN weights) change with 1: v = ∂ LV | 1 ∂w λ0,w0 respect to the outer parameters (i.e., hyperparameters). ∂ LT 2: v2 = approxInverseHVP(v1, ∂w ) We often optimize the inner parameters with gradient ∂ LT 3: v3 = grad( ∂λ , w, grad_outputs = v2) descent, so we can simply differentiate through this ∂ LV 4: return ∂λ |λ0,w0 − v3 . Return to Alg.1 optimization. Differentiation through optimization has been applied to nested optimization problems by [3], Algorithm 3 approxInverseHVP(v, f): Neumann ap- was scaled to HO for NNs by [4], and has been applied proximation of inverse-Hessian-vector product v[ ∂ f ]−1 to various applications like learning optimizers [36]. [6] ∂w provides convergence results for this class of algorithms, 1: Initialize sum p = v while [5] discusses forward- and reverse-mode variants. 2: for j = 1 . . . i do 3: v −= α · grad(f, w, grad_outputs = v) As the number of gradient steps we backpropagate 4: p −= v through increases, so does the memory and computa- 5: return p . Return to Alg.2 . tional cost. Often, gradient descent does not exactly minimize our objective after a finite number of steps— it only approaches a local minimum. Thus, to see how 2.1 Proposed Algorithms the hyperparameters affect the local minima, we may have to unroll the optimization infeasibly far. Unrolling We outline our method in Algs.1,2, and3, where α a small number of steps can be crucial for performance denotes the learning rate. Alg.3 is also shown in [ 22]. but may induce bias [37]. [7] discusses connections We visualize the hypergradient computation in Fig.2. between unrolling and the IFT, and proposes to un- roll only the last L-steps. DrMAD [38] proposes an 3 Related Work interpolation scheme to save memory. Implicit Function Theorem. The IFT has been We compare hypergradient approximations in Table1, used for optimization in nested optimization prob- and memory costs of gradient-based HO methods in lems [27, 28], backpropagating through arbitrarily long Table2. We survey gradient-free HO in AppendixB. RNNs [22], or even efficient k-fold cross-validation [29]. Jonathan Lorraine, Paul Vicol, David Duvenaud
Method Steps Eval. Hypergradient Approximation
h 2 i−1 2 ∗ ∂ LV ∂ LV ∂ LT ∂ LT Exact IFT ∞ w (λ) ∂λ − ∂w × ∂w∂wT ∂w∂λT w∗(λ) " # 2 2 Unrolled Diff. [4] i w ∂ LV − ∂ LV × P Q I − ∂ LT ∂ LT 0 ∂λ ∂w j≤i k Lemma. Given the recurrence from unrolling SGD LT : Λ×W→R is twice differentiable with an invertible optimization in Eq.5, we have: Hessian at w∗(λ), and (3) w∗ : Λ→W is differentiable. We need continuous hyperparameters to use gradient- ∂wi+1 X Y 2 2 ∂ LT ∂ LT based optimization, but many discrete hyperparame- =− I − T T ∂λ ∂w∂w λ,w (λ) ∂w∂λ λ,w (λ) j≤i k ∂w h 2 i−1 2 i+1 ∂ LT ∂ LT for an LSTM language model. lim = − T T (8) i→∞ ∂λ ∂w∂w ∂w∂λ w∗(λ) HO algorithms that are not based on implicit differen- tiation or differentiation through optimization—such as [44–48, 20]—do not scale to the high-dimensional This result is also shown in [7], but they use a different hyperparameters we use. Thus, we cannot sensibly approximation for computing the hypergradient—see compare to them for high-dimensional problems. Table1. Instead, we use the following best-response Jacobian approximation, where i controls the trade-off 5.1 Approximate Inversion Algorithms between computation and error bounds: In Fig.3 we investigate how close various approxima- ∗ h 2 ij 2 tions are to the true inverse. We calculate the distance ∂ w X ∂ LT ∂ LT ≈ − I − α T T (9) ∂λ ∂w∂w ∂w∂λ between the approximate hypergradient and the true jlayer NN on the Boston housing dataset. The true inverse Hessian has a dominant diagonal, 4.3 Scope and Limitations motivating identity approximations, while using more Neumann terms yields structure closer to the true in- The assumptions necessary to apply the IFT are as verse. follows: (1) LV : Λ × W → R is differentiable, (2) Jonathan Lorraine, Paul Vicol, David Duvenaud 1.0 1.0 0.8 0.8 0.6 0.4 0.6 0.2 0.4 Cosine Similarity 0.0 Linear 20 CG Steps Validation Error 3 AlexNet 10 20 Neumann 0.2 5 Neumann ResNet44 1 1 Neumann 0.0 0 1 2 3 4 10 0 Neumann 10 10 10 10 10 Distance 1 Iteration 10 2 ` CG Figure 5: Algorithm1 can overfit a small validation 3 Neumann 10 400 600 0 20 40 set on CIFAR-10. It optimizes for loss and achieves Optimization Iter. # of Inversion Steps 100 % validation accuracy for standard, large models. Figure 3: Comparing approximate hypergradients for data generalizes to the original validation and test sets. inverse Hessian approximations to true hypergradients. Distillation is an interesting benchmark for HO as it The Neumann scheme often has greater cosine similarity allows us to introduce tens of thousands of hyperparam- than CG, but larger ` distance for equal steps. 2 eters, and visually inspect what is learned: here, every 1.0 1.0 1.0 1.0 pixel value in each synthetic training example is a hyper- parameter. We distill MNIST and CIFAR-10/100 [53], 0.5 0.5 0.5 0.5 yielding 28×28×10 = 7840, 32×32×3×10 = 30 720, and 0.0 0.0 0.00.0 32×32×3×100 = 300 720 hyperparameters, respectively. For these experiments, all labeled data are in our vali- 0.5 0.5 0.5 0.5 dation set, while our distilled data are in the training 1.0 1.0 1.01.0 set. We visualize the distilled images for each class in 1 Neumann 5 Neumann True Inverse Fig.6, recovering recognizable digits for MNIST and Figure 4: Inverse Hessian approximations preprocessed reasonable color averages for CIFAR-10/100. by applying tanh for a 1-layer, fully-connected NN on the Boston housing dataset as in [50]. 5.4 Learned Data Augmentation Data augmentation is a simple way to introduce invari- ances to a model—such as scale or contrast invariance— 5.2 Overfitting a Small Validation Set that improve generalization [17, 18]. Taking advantage In Fig.5, we check the capacity of our HO algorithm of the ability to optimize many hyperparameters, we to overfit the validation dataset. We use the same re- learn data augmentation from scratch (Fig.7). stricted dataset as in [5, 6] of 50 training and validation Specifically, we learn a data augmentation network examples, which allows us to assess HO performance ˜x = fλ(x, ) that takes a training example x and noise easily. We tune a separate weight decay hyperparam- ∼ N (0, I), and outputs an augmented example ˜x. eter for each NN parameter as in [33, 4]. We show The noise allows us to learn stochastic augmentations. the performance with a linear classifier, AlexNet [51], We parameterize f as a U-net [54] with a residual con- and ResNet44 [52]. For AlexNet, this yields more than nection from the input to the output, to make it easy 50 000 000 hyperparameters, so we can perfectly classify to learn the identity mapping. The parameters of the our validation data by optimizing the hyperparameters. U-net, λ, are hyperparameters tuned for the validation Algorithm1 achieves 100 % accuracy on the training loss—thus, we have 6659 hyperparameters. We trained and validation sets with significantly lower accuracy a ResNet18 [52] on CIFAR-10 with augmented exam- on the test set (AppendixE, Fig. 10), showing that we ples produced by the U-net (that is simultaneously have a powerful HO algorithm. The same optimizer is trained on the validation set). used for weights and hyperparameters in all cases. Results for the identity and Neumann inverse approxi- mations are shown in Table3. We omit CG because it 5.3 Dataset Distillation performed no better than the identity. We found that using the data augmentation network improves vali- Dataset distillation [4, 13] aims to learn a small, syn- dation and test accuracy by 2-3%, and yields smaller thetic training dataset from scratch, that condenses the variance between multiple random restarts. In [55], a knowledge contained in the original full-sized training different augmentation network architecture is learned set. The goal is that a model trained on the synthetic with adversarial training. Jonathan Lorraine, Paul Vicol, David Duvenaud CIFAR-10 Distillation Plane Car Bird Cat Deer Dog Frog Horse Ship Truck MNIST Distillation CIFAR-100 Distillation Apple Fish Baby Bear Beaver Bed Bee Beetle Bicycle Bottle Bowl Boy Bridge Bus Butterfly Camel Can Castle Caterpillar Cattle Figure 6: Distilled datasets for CIFAR-10, MNIST, and CIFAR-100. For CIFAR-100, we show the first 20 classes—the rest are in Appendix Fig. 11. We learn one distilled image per class, so after training a logistic regression classifier on the distillation, it generalizes to the rest of the data. Original Sample 1 Sample 2 Pixel Std. 5.5 RNN Hyperparameter Optimization We also used our proposed algorithm to tune regulariza- tion hyperparameters for an LSTM [56] trained on the Penn TreeBank (PTB) corpus [57]. As in [58], we used a 2-layer LSTM with 650 hidden units per layer and 650-dimensional word embeddings. Additional details are provided in Appendix E.4. Overfitting Validation Data. We first verify that our algorithm can overfit the validation set in a small- data setting with 10 training and 10 validation se- quences (Fig.8). The LSTM architecture we use has 13 280 400 weights, and we tune a separate weight decay hyperparameter per weight. We overfit the validation Figure 7: Learned data augmentations. The original set, reaching nearly 0 validation loss. image is on the left, followed by two augmented samples and the standard deviation of the pixel intensities from the augmentation distribution. 8 7 Train Inverse Approx. Validation Test 6 Valid 5 0 92.5 ±0.021 92.6 ±0.017 4 3 Neumann 95.1 ±0.002 94.6 ±0.001 Loss 3 3 Unrolled Diff. 95.0 ±0.002 94.7 ±0.001 Loss 2 I 94.6 ±0.002 94.1 ±0.002 1 Table 3: Accuracy of different inverse approximations. 0 Using 0 means that no HO occurs, and the augmenta- 0k 25k 50k 75k 100k tion is initially the identity. The Neumann approach IterationIteration performs similarly to unrolled differentiation [4, 7] with equal steps and less memory. Using more terms does Figure 8: Alg.1 can overfit a small validation set with better than the identity, and the identity performed an LSTM on PTB. better than CG (not shown), which was unstable. Jonathan Lorraine, Paul Vicol, David Duvenaud Large-Scale HO. There are various forms of regular- We evaluate a high-dimensional regime with a separate ization used for training RNNs, including variational weight decay hyperparameter per NN parameter, and dropout [59] on the input, hidden state, and output; a low-dimensional regime with a single, global weight embedding dropout that sets rows of the embedding decay. We observe that: (1) for few hyperparameters, matrix to 0, removing tokens from all sequences in a the optimal combination of validation data and hy- mini-batch; DropConnect [60] on the hidden-to-hidden perparameters has similar test performance with and weights; and activation and temporal activation reg- without re-training, because the optimal amount of ularization. We tune these 7 hyperparameters simul- validation data is small; and (2) for many hyperparam- taneously. Additionally, we experiment with tuning eters, the optimal combination of validation data and separate dropout/DropConnect rate for each activa- hyperparameters is significantly affected by re-training, tion/weight, giving 1 691 951 total hyperparameters. because the optimal amount of validation data needs To allow for gradient-based optimization of dropout to be large to fit our hyperparameters effectively. rates, we use concrete dropout [61]. For few hyperparameters, our results agree with the Instead of using the small dropout initialization as in standard practice of using 10% of the data for val- [26], we use a larger initialization of 0.5, which prevents idation and the other 90% for training. For many early learning rate decay for our method. The results hyperparameters, our results show that we should use for our new initialization with no HO, our method tun- larger validation partitions for HO. If we use a large ing the same hyperparameters as [26] (“Ours”), and our validation partition to fit the hyperparameters, it is method tuning many more hyperparameters (“Ours, critical to re-train our model with all of the data. Many”) are shown in Table4. We are able to tune hyperparameters more quickly and achieve better per- WithoutMNIST with Logistic re-training Regression WithMNIST with Logistic re-training Regression with Re-Training plexities than the alternatives. Method Validation Test Time(s) .8 .8 Grid Search 97.32 94.58 100k Random Search 84.81 81.46 100k Bayesian Opt. 72.13 69.29 100k Test Accuracy .7 Test Accuracy .7 STN 70.30 67.68 25k Test Accuracy Global Decay No HO 75.72 71.91 18.5k Decay per Weight Ours 69.22 66.40 18.5k 0 .1 .25 .5 .75 .9 0 .1 .25 .5 .75 .9 Ours, Many 68.18 66.14 18.5k % DataProportion in Validationdata in valid %Proportion Data in data Validation in valid Table 4: Comparing HO methods for LSTM training Figure 9: Test accuracy of logistic regression on on PTB. We tune millions of hyperparameters faster MNIST, with different size validation splits. Solid lines and with comparable memory to competitors tuning a correspond to a single global weight decay (1 hyperpa- handful. Our method competitively optimizes the same rameter), while dotted lines correspond to a separate 7 hyperparameters as baselines from [26] (first four weight decay per weight (many hyperparameters). The rows). We show a performance boost by tuning millions best validation proportion for test performance is dif- of hyperparameters, introduced with per-unit/weight ferent after re-training for many hyperparameters, but dropout and DropConnect. “No HO” shows how the similar for few hyperparameters. hyperparameter initialization affects training. 5.6 Effects of Many Hyperparameters 6 Conclusion Given the ability to tune high-dimensional hyperpa- We present a gradient-based hyperparameter optimiza- rameters and the potential risk of overfitting to the tion algorithm that scales to high-dimensional hyper- validation set, should we reconsider how our training parameters for modern, deep NNs. We use the implicit and validation splits are structured? Do the same function theorem to formulate the hypergradient as heuristics apply as for low-dimensional hyperparame- a matrix equation, whose bottleneck is inverting the ters (e.g., use ∼ 10% of the data for validation)? Hessian of the training loss with respect to the NN parameters. We scale the hypergradient computation In Fig.9 we see how splitting our data into training to large NNs by approximately inverting the Hessian, and validation sets of different ratios affects test per- leveraging a relationship with unrolled differentiation. formance. We show the results of jointly optimizing the NN weights and hyperparameters, as well as the We believe algorithms of this nature provide a path results of fixing the final optimized hyperparameters for practical nested optimization, where we have Hes- and re-training the NN weights from scratch, which is sians with known structure. Examples of this include a common technique for boosting performance [62]. GANs [39], and other multi-agent games [63, 64]. Jonathan Lorraine, Paul Vicol, David Duvenaud Acknowledgements [11] Edward Grefenstette, Brandon Amos, Denis Yarats, Phu Mon Htut, Artem Molchanov, We thank Chris Pal for recommending we investigate Franziska Meier, Douwe Kiela, Kyunghyun Cho, re-training with all the data, Haoping Xu for discussing and Soumith Chintala. Generalized inner loop related experiments on inverse-approximation variants, meta-learning. arXiv preprint arXiv:1910.01727, Roger Grosse for guidance, and Cem Anil & Chris 2019. Cremer for their feedback on the paper. Paul Vicol [12] Jan Kukačka, Vladimir Golkov, and Daniel Cre- was supported by a JP Morgan AI Fellowship. We also mers. Regularization for deep learning: A taxon- thank everyone else at Vector for helpful discussions omy. arXiv preprint arXiv:1710.10686, 2017. and feedback. [13] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. arXiv References preprint arXiv:1811.10959, 2018. [1] Jürgen Schmidhuber. Evolutionary principles in [14] Mengye Ren, Wenyuan Zeng, Bin Yang, and self-referential learning, or on learning how to Raquel Urtasun. Learning to reweight examples learn: The meta-meta-... hook. PhD thesis, Tech- for robust deep learning. In International Con- nische Universität München, 1987. ference on Machine Learning, pages 4331–4340, 2018. [2] Yoshua Bengio. Gradient-based optimization of hy- perparameters. Neural Computation, 12(8):1889– [15] Tae-Hoon Kim and Jonghyun Choi. ScreenerNet: 1900, 2000. Learning self-paced curriculum for deep neural networks. arXiv preprint arXiv:1801.00904, 2018. [3] Justin Domke. Generic methods for optimization- based modeling. In Artificial Intelligence and [16] Jiong Zhang, Hsiang-fu Yu, and Inderjit S Dhillon. Statistics, pages 318–326, 2012. AutoAssist: A framework to accelerate train- ing of deep neural networks. arXiv preprint [4] Dougal Maclaurin, David Duvenaud, and Ryan arXiv:1905.03381, 2019. Adams. Gradient-based hyperparameter optimiza- tion through reversible learning. In International [17] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Conference on Machine Learning, pages 2113–2122, Vijay Vasudevan, and Quoc V Le. Autoaugment: 2015. Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018. [5] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse [18] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang gradient-based hyperparameter optimization. In Luong, and Quoc V Le. Unsupervised data aug- International Conference on Machine Learning, mentation. arXiv preprint arXiv:1904.12848, 2019. pages 1165–1173, 2017. [19] Jonas Močkus. On Bayesian methods for seeking the extremum. In Optimization Techniques IFIP [6] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Technical Conference, pages 400–404. Springer, Riccardo Grazzi, and Massimiliano Pontil. Bilevel 1975. programming for hyperparameter optimization and meta-learning. In International Conference [20] Jasper Snoek, Hugo Larochelle, and Ryan P on Machine Learning, pages 1563–1572, 2018. Adams. Practical Bayesian optimization of ma- chine learning algorithms. In Advances in Neural [7] Amirreza Shaban, Ching-An Cheng, Nathan Information Processing Systems, pages 2951–2959, Hatch, and Byron Boots. Truncated back- 2012. propagation for bilevel optimization. In Inter- national Conference on Artificial Intelligence and [21] Kirthevasan Kandasamy, Karun Raju Vysyaraju, Statistics, pages 1723–1732, 2019. Willie Neiswanger, Biswajit Paria, Christopher R Collins, Jeff Schneider, Barnabas Poczos, and [8] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Eric P Xing. Tuning hyperparameters without Model-agnostic meta-learning for fast adaptation grad students: Scalable and robust Bayesian of deep networks. In International Conference on optimisation with Dragonfly. arXiv preprint Machine Learning, pages 1126–1135, 2017. arXiv:1903.06694, 2019. [9] Aravind Rajeswaran, Chelsea Finn, Sham Kakade, [22] Renjie Liao, Yuwen Xiong, Ethan Fetaya, Lisa and Sergey Levine. Meta-learning with implicit Zhang, KiJung Yoon, Xaq Pitkow, Raquel Urta- gradients. arXiv preprint arXiv:1909.04630, 2019. sun, and Richard Zemel. Reviving and improving [10] Hanxiao Liu, Karen Simonyan, and Yiming Yang. recurrent back-propagation. In International Con- Darts: Differentiable architecture search. arXiv ference on Machine Learning, pages 3088–3097, preprint arXiv:1806.09055, 2018. 2018. Jonathan Lorraine, Paul Vicol, David Duvenaud [23] Jan Larsen, Lars Kai Hansen, Claus Svarer, and In Advances in Neural Information Processing Sys- M Ohlsson. Design and regularization of neural tems, pages 1379–1389, 2018. networks: The optimal use of a validation set. In [35] Alex Nichol, Joshua Achiam, and John Schulman. Neural Networks for Signal Processing VI. Proceed- On first-order meta-learning algorithms. arXiv ings of the 1996 IEEE Signal Processing Society preprint arXiv:1803.02999, 2018. Workshop, pages 62–71, 1996. [36] Marcin Andrychowicz, Misha Denil, Sergio Gomez, [24] Jelena Luketina, Mathias Berglund, Klaus Greff, Matthew W Hoffman, David Pfau, Tom Schaul, and Tapani Raiko. Scalable gradient-based tuning Brendan Shillingford, and Nando De Freitas. of continuous regularization hyperparameters. In Learning to learn by gradient descent by gradi- International Conference on Machine Learning, ent descent. In Advances in Neural Information pages 2952–2960, 2016. Processing Systems, pages 3981–3989, 2016. [25] Jonathan Lorraine and David Duvenaud. Stochas- [37] Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger tic hyperparameter optimization through hyper- Grosse. Understanding short-horizon bias in networks. arXiv preprint arXiv:1802.09419, 2018. stochastic meta-optimization. arXiv preprint [26] Matthew MacKay, Paul Vicol, Jon Lorraine, David arXiv:1803.02021, 2018. Duvenaud, and Roger Grosse. Self-tuning net- [38] Jie Fu, Hongyin Luo, Jiashi Feng, Kian Hsiang works: Bilevel optimization of hyperparameters Low, and Tat-Seng Chua. DrMAD: Distilling using structured best-response functions. In Inter- reverse-mode automatic differentiation for opti- national Conference on Learning Representations, mizing hyperparameters of deep neural networks. 2019. In International Joint Conference on Artificial [27] Peter Ochs, René Ranftl, Thomas Brox, and Intelligence, pages 1469–1475, 2016. Thomas Pock. Bilevel optimization with nons- [39] Ian Goodfellow, Jean Pouget-Abadie, Mehdi mooth lower level problems. In International Con- Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, ference on Scale Space and Variational Methods Aaron Courville, and Yoshua Bengio. Generative in Computer Vision, pages 654–665, 2015. adversarial nets. In Advances in Neural Informa- [28] Anonymous. On solving minimax optimization tion Processing Systems, pages 2672–2680, 2014. locally: A follow-the-ridge approach. Interna- [40] David Balduzzi, Sebastien Racaniere, James tional Conference on Learning Representations, Martens, Jakob Foerster, Karl Tuyls, and Thore 2019. URL https://openreview.net/pdf?id= Graepel. The mechanics of n-player differentiable Hkx7_1rKwS. games. In International Conference on Machine [29] Ahmad Beirami, Meisam Razaviyayn, Shahin Learning, pages 363–372, 2018. Shahrampour, and Vahid Tarokh. On optimal [41] Stefan Banach. Sur les opérations dans les ensem- generalizability in parametric learning. In Ad- bles abstraits et leur application aux équations vances in Neural Information Processing Systems, intégrales. Fundamenta Mathematicae, 3:133–181, pages 3455–3465, 2017. 1922. [30] Fabian Pedregosa. Hyperparameter optimization [42] C Maddison, A Mnih, and Y Teh. The concrete with approximate gradient. In International Con- distribution: A continuous relaxation of discrete ference on Machine Learning, pages 737–746, 2016. random variables. In International Conference on [31] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Learning Representations, 2017. Deep equilibrium models. arXiv preprint [43] Eric Jang, Shixiang Gu, and Ben Poole. Cate- arXiv:1909.01377, 2019. gorical reparameterization with Gumbel-softmax. [32] James Martens and Roger Grosse. Optimizing arXiv preprint arXiv:1611.01144, 2016. neural networks with Kronecker-factored approxi- [44] Max Jaderberg, Valentin Dalibard, Simon Osin- mate curvature. In International Conference on dero, Wojciech M Czarnecki, Jeff Donahue, Ali Machine Learning, pages 2408–2417, 2015. Razavi, Oriol Vinyals, Tim Green, Iain Dun- [33] Yogesh Balaji, Swami Sankaranarayanan, and ning, Karen Simonyan, et al. Population based Rama Chellappa. Metareg: Towards domain gen- training of neural networks. arXiv preprint eralization using meta-regularization. In Advances arXiv:1711.09846, 2017. in Neural Information Processing Systems, pages [45] Kevin Jamieson and Ameet Talwalkar. Non- 998–1008, 2018. stochastic best arm identification and hyperpa- [34] Ira Shavitt and Eran Segal. Regularization learn- rameter optimization. In Artificial Intelligence ing networks: Deep learning for tabular datasets. and Statistics, pages 240–248, 2016. Jonathan Lorraine, Paul Vicol, David Duvenaud [46] James Bergstra and Yoshua Bengio. Random neural networks. In Advances in Neural Informa- search for hyper-parameter optimization. Journal tion Processing Systems, pages 1027–1035, 2016. of Machine Learning Research, 13:281–305, 2012. [59] Durk P Kingma, Tim Salimans, and Max Welling. [47] Manoj Kumar, George E Dahl, Vijay Vasudevan, Variational dropout and the local reparameteri- and Mohammad Norouzi. Parallel architecture and zation trick. In Advances in Neural Information hyperparameter search via successive halving and Processing Systems, pages 2575–2583, 2015. classification. arXiv preprint arXiv:1805.10255, [60] Li Wan, Matthew Zeiler, Sixin Zhang, Yann 2018. Le Cun, and Rob Fergus. Regularization of neu- [48] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin ral networks using Dropconnect. In International Rostamizadeh, and Ameet Talwalkar. Hyperband: Conference on Machine Learning, pages 1058–1066, a novel bandit-based approach to hyperparameter 2013. optimization. The Journal of Machine Learning [61] Yarin Gal, Jiri Hron, and Alex Kendall. Con- Research, 18(1):6765–6816, 2017. crete dropout. In Advances in Neural Information [49] David Harrison Jr and Daniel L Rubinfeld. He- Processing Systems, pages 3581–3590, 2017. donic housing prices and the demand for clean [62] Ian Goodfellow, Yoshua Bengio, and Aaron air. Journal of Environmental Economics and Courville. Deep Learning. MIT Press, 2016. Management, 5(1):81–102, 1978. http://www.deeplearningbook.org. [50] Guodong Zhang, Shengyang Sun, David Duve- [63] Jakob Foerster, Richard Y Chen, Maruan Al- naud, and Roger Grosse. Noisy natural gradient Shedivat, Shimon Whiteson, Pieter Abbeel, and as variational inference. In International Con- Igor Mordatch. Learning with opponent-learning ference on Machine Learning, pages 5847–5856, awareness. In International Conference on Au- 2018. tonomous Agents and MultiAgent Systems, pages [51] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E 122–130, 2018. Hinton. ImageNet classification with deep convo- [64] Alistair Letcher, Jakob Foerster, David Balduzzi, lutional neural networks. In Advances in Neural Tim Rocktäschel, and Shimon Whiteson. Stable Information Processing Systems, pages 1097–1105, opponent shaping in differentiable games. arXiv 2012. preprint arXiv:1811.08469, 2018. [52] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and [65] Andrew Brock, Theodore Lim, James Millar Jian Sun. Deep residual learning for image recog- Ritchie, and Nicholas J Weston. SMASH: One- nition. In Conference on Computer Vision and Shot Model Architecture Search through Hyper- Pattern Recognition, pages 770–778, 2016. Networks. In International Conference on Learn- [53] Alex Krizhevsky. Learning multiple layers of fea- ing Representations, 2018. tures from tiny images. Technical report, 2009. [66] Adam Paszke, Sam Gross, Soumith Chintala, Gre- [54] Olaf Ronneberger, Philipp Fischer, and Thomas gory Chanan, Edward Yang, Zachary DeVito, Zem- Brox. U-Net: Convolutional networks for biomed- ing Lin, Alban Desmaison, Luca Antiga, and Adam ical image segmentation. In International Confer- Lerer. Automatic differentiation in PyTorch. 2017. ence on Medical image Computing and Computer- [67] Diederik P Kingma and Jimmy Ba. Adam: A Assisted Intervention, pages 234–241, 2015. method for stochastic optimization. International [55] Saypraseuth Mounsaveng, David Vazquez, Is- Conference on Learning Representations, 2014. mail Ben Ayed, and Marco Pedersoli. Adversarial [68] Geoffrey Hinton, Nitish Srivastava, and Kevin learning of general transformations for data aug- Swersky. Neural networks for machine learning. mentation. International Conference on Learning Lecture 6a. Overview of mini-batch gradient de- Representations, 2019. scent. 2012. [56] Sepp Hochreiter and Jürgen Schmidhuber. Long [69] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick short-term memory. Neural Computation, 9(8): Haffner, et al. Gradient-based learning applied to 1735–1780, 1997. document recognition. Proceedings of the IEEE, [57] Mitchell P Marcus, Mary Ann Marcinkiewicz, and 86(11):2278–2324, 1998. Beatrice Santorini. Building a large annotated [70] Stephen Merity, Nitish Shirish Keskar, and corpus of English: The Penn Treebank. Computa- Richard Socher. Regularizing and optimizing tional Linguistics, 19(2):313–330, 1993. LSTM language models. International Confer- [58] Yarin Gal and Zoubin Ghahramani. A theoreti- ence on Learning Representations, 2018. cally grounded application of dropout in recurrent Jonathan Lorraine, Paul Vicol, David Duvenaud Optimizing Millions of Hyperparameters by Implicit Differentiation Appendix A Extended Background ter selections, and training the model to completion on them. Popular examples include grid search and ran- In this section we provide an outline of our notation dom search [46]. Since each hyperparameter selection is (Table5), and the proposed algorithm. Here, we assume independent, these algorithms are trivial to parallelize. we have access to to a finite dataset D = {(xi, yi)|i = Global HO: Some HO algorithms attempt to find a 1 . . . n}, with n examples drawn from the distribution globally optimal hyperparameter setting, which can be p(x, y) with support P. We denote the input and target important if the loss is non-convex. A simple example domains by X and Y, respectively. Assume y : X → Y is random search, while a more sophisticated example is a function and we wish to learn yˆ : X × W → Y is Bayesian optimization [19–21]. These HO algorithms with a NN parameterized by w ∈ W, s.t. yˆ is close often involve re-initializing the hyperparameter and to y. We measure how close a predicted value is to a weights on each optimization iteration. This allows target with the prediction loss L : Y × Y → R. Our global optimization, at the cost of expensive re-training goal is to minimize the expected prediction loss or weights or hyperparameters. population risk: arg minw Ex∼p(x)[L(yˆ(x, w), y(x))]. Since we only have access to a finite num- Local HO: Other HO algorithms only attempt to find ber of samples, we minimize the empirical risk: a locally optimal hyperparameter setting. Often these 1 P algorithms will maintain a current estimate of the best arg minw /n x,y∈D L(yˆ(x, w), y(x)). combination of hyperparameter and weights. On each Due to a limited size dataset D, there may be a signifi- optimization iteration, the hyperparameter is adjusted cant difference between the minimizer of the empirical by a small amount, which allows us to avoid excessive risk and the population risk. We can estimate this re-training of the weights on each update. This is be- difference by partitioning our dataset into training and cause the new optimal weights are near the old optimal validation datasets— Dtrain, Dvalid. We find the min- weights due to a small change in the hyperparameters. imizer over the training dataset Dtrain, and estimate its performance on the population risk by evaluating Learned proxy function based HO: Many HO algo- rithms attempt to learn a proxy function for optimiza- the empirical risk over the validation dataset Dvalid. We introduce modifications to the empirical training tion. The proxy function is used to estimate the loss risk to decrease our population risk, parameterized by for a hyperparameter selection. We could learn a proxy λ ∈ Λ. These parameters for generalization are called function for global or local HO . We can learn a useful the hyperparameters. We call the modified empirical proxy function over any node in our computational training risk our training loss for simplicity and denote graph including the optimized weights. For example, we could learn how the optimized weights change w.r.t. it LT(λ, w). Our validation empirical risk is called the hyperparameters [25], how the optimized predic- validation loss for simplicity and denoted by LV(λ, w). Often the validation loss does not directly depend on tions change w.r.t. the hyperparameters [26], or how the optimized validation loss changes w.r.t. the hyper- the hyperparameters, and we just have LV(w). parameters as in Bayesian Optimization. It is possible The population risk is estimated by plugging the to do gradient descent on the proxy function to find ∗ training loss minimizer w (λ) = arg minw LT(λ, w) new hyperparameters to query as in Bayesian optimiza- into the validation loss for the estimated population tion. Alternatively, we could use a non-differentiable ∗ ∗ risk LV(λ) = LV(λ,w (λ)). We want our hyperpa- proxy function to get cheap estimates of the validation rameters to minimize the estimated population risk: loss like SMASH [65] for architecture choices. ∗ ∗ λ = arg minλ LV(λ). We can create a third partition of our dataset Dtest to assess if we have overfit the validation dataset Dvalid with our hyperparameters λ. B Extended Related Work Independent HO: A simple class of HO algorithms involve making a number of independent hyperparame- Jonathan Lorraine, Paul Vicol, David Duvenaud Table 5: Notation HO Hyperparameter optimization NN Neural network IFT Implicit Function Theorem HVP / JVP Hessian/Jacobian-vector product λ, w Hyperparameters and NN parameters/weights n, m Hyperparameter and NN parameter dimensionality Λ⊆Rn,W⊆Rm Hyperparameters and NN parameter domains λ0, w0 Arbitrary, fixed hyperparameters and weights LT(λ,w),LV(λ,w) Training loss & validation loss w∗(λ) Best-response of the weights to the hyperparameters wc∗(λ) An approximate best-response of the weights to the hyperparameters ∗ ∗ LV(λ)= LV(λ,w (λ)) The validation loss with best-responding weights Red (Approximations to) The validation loss with best-responding weights W∗ = w∗(Λ) The domain of best-responding weights λ∗ The optimal hyperparameters x, y An input and its associated target X , Y The input and target domains respectively D A data matrix consisting of tuples of inputs and targets y(x, w) A predicted target for a input data and weights ∂ LV ∂ LV ∂λ , ∂w The (validation loss hyperparameter / parameter) direct gradient Green (Approximations to) The validation loss direct gradient. ∂ w∗ ∂λ The best-response Jacobian Blue (Approximations to) The (Jacobian of the) best-response of the weights to the hyperparameters ∗ ∂ LV ∂ w The indirect gradient ∂w ∗∂λ ∂ LV ∂λ A hypergradient: sum of validation losses direct and indirect gradient h 2 i−1 ∂ LT ∂w∂wT The training Hessian inverse Magenta (Approximations to) The training Hessian inverse h 2 i−1 ∂ LV ∂ LT ∂w ∂w∂wT The vector - Inverse Hessian product. Orange (Approximations to) The vector - Inverse Hessian product. 2 ∂ LT ∂w∂λT The training mixed partial derivatives I The identity matrix Jonathan Lorraine, Paul Vicol, David Duvenaud C Implicit Function Theorem ∂ LT Theorem (Augustin-Louis Cauchy, Implicit Function Theorem). Let ∂w (λ, w) : Λ×W → W be a continuously ∂ LT 0 0 ∂ LT 0 0 ∂w 0 0 differentiable function. Fix a point (λ , w ) with ∂w (λ , w ) = 0. If the Jacobian Jw (λ , w ) is invertible, there exists an open set U ⊆ Λ containing λ0 s.t. there exists a continuously differentiable function w∗ : U → W s.t.: ∂L w∗(λ0) = w0 and ∀λ ∈ U, T (λ, w∗(λ))) = 0 ∂w Moreover, the partial derivatives of w∗ in U are given by the matrix product: " #−1 ∂w∗ ∂ LT ∂ LT (λ) = − J ∂w (λ, w∗(λ)) J ∂w (λ, w∗(λ))) ∂λ w λ ∂ LT ∗ m n 0 0 Typically the IFT is presented with ∂w = f, w = g, Λ = R , W = R , λ = x, w = y, λ = a, w = b. Jonathan Lorraine, Paul Vicol, David Duvenaud D Proofs Lemma (1). If the recurrence given by unrolling SGD optimization in Eq.5 has a fixed point w∞ (i.e., ∂ LT 0 = |λ,w (λ)), then: ∂w ∞ ∂w h 2 i−1 2 ∞ ∂ LT ∂ LT = − T T ∂λ ∂w∂w ∂w∂λ w∞(λ) Proof. ∂ ∂ LT ⇒ ∂λ ∂w = 0 given λ,w∞(λ) 2 2 ⇒ ∂ LT I + ∂ LT ∂ w∞ = 0 chain rule through | ∂w∂λT ∂w∂wT ∂λ λ,w∞(λ) λ,w∞(λ) 2 2 ∂ LT ∂ w∞ ∂ LT ⇒ ∂w∂wT ∂λ = − ∂w∂λT re-arrange terms λ,w∞(λ) λ,w∞(λ) h 2 i−1 2 h 2 i−1 ∂ w∞ ∂ LT ∂ LT ∂ LT ⇒ ∂λ = − T T left-multiply by T λ ∂w∂w ∂w∂λ ∂w∂w λ λ,w∞(λ) Lemma (2). Given the recurrence from unrolling SGD optimization in Eq.5 we have: ∂wi+1 X Y 2 2 ∂ LT ∂ LT = − I − T T ∂λ ∂w∂w λ,w (λ) ∂w∂λ λ,w (λ) j≤i k Proof. ∂ wi+1 ∂ ∂ LT ∂λ = ∂λ wi(λ) − ∂w take derivative w.r.t. λ λ λ,wi(λ) ∂ wi ∂ ∂ LT = ∂λ − ∂λ ∂w chain rule λ λ,wi(λ) 2 2 ∂ wi ∂ LT ∂ wi ∂ LT = ∂λ − T ∂λ + T chain rule through |λ,w (λ) λ ∂w∂w ∂w∂λ i λ,wi(λ) 2 2 ∂ LT ∂ LT ∂ wi = − ∂w∂λT + I − ∂w∂wT ∂λ re-arrange terms λ,wi(λ) λ,wi(λ) 2 2 ∂ LT ∂ LT = − ∂w∂λT + I − ∂w∂wT · λ,wi(λ) λ,wi(λ) 2 2 I − ∂ LT ∂ wi−1 − ∂ LT expand ∂ wi ∂w∂wT ∂λ ∂w∂λT ∂λ λ,wi−1(λ) 2 2 2 ∂ LT ∂ LT ∂ LT = − ∂w∂λT − I − ∂w∂wT ∂w∂λT + λ,wi(λ) λ,wi(λ) λ,wi−1(λ) Y 2 ∂ w ∂ LT i−1 I − T ∂λ re-arrange terms ∂w∂w λ,w (λ) λ k<2 i−k = ··· ∂ w X Y 2 2 i+1 ∂ LT ∂ LT So, ∂λ = − I − T T telescope the recurrence ∂w∂w λ,w (λ) ∂w∂λ λ,w (λ) j≤i k ∗ Theorem (Neumann-SGD). Given the recurrence from unrolling SGD optimization in Eq.5, if w0 = w (λ): ∂w h 2 ij 2 i+1 X ∂ LT ∂ LT = − I − T T ∂λ ∂w∂w ∂w∂λ j≤i w∗(λ) 2 ∂ LT and if I + ∂w∂wT is contractive: ∂w h 2 i−1 2 i+1 ∂ LT ∂ LT lim = − T T i→∞ ∂w∂w ∂w∂λ ∂λ w∗(λ) Proof. ∂ wi+1 lim ∂λ take lim i→∞ λ i→∞ X Y 2 2 ∂ LT ∂ LT = lim − I − T T by Lemma 2 i→∞ ∂w∂w λ,w (λ) ∂w∂λ λ,w (λ) j≤i k h 2 ij 2 X ∂ LT ∂ LT = − lim I − T T simplify i→∞ ∂w∂w ∂w∂λ j≤i λ,w∗(λ) −1 2 2 = − I − I − ∂ LT ∂ LT contractive & Neumann series ∂w∂wT ∂w∂λT λ,w∗(λ) h 2 i−1 2 ∂ LT ∂ LT = − ∂w∂wT ∂w∂λT simplify λ,w∗(λ) Jonathan Lorraine, Paul Vicol, David Duvenaud E Experiments problem; inspired by this, we use a U-Net [54] as the data augmentation network. To allow for stochastic We use PyTorch [66] as our computational framework. transformations, we feed in random noise by concatenat- All experiments were performed on NVIDIA TITAN ing a noise channel to the input image, so the resulting Xp GPUs. input has 4 channels. For all CNN experiments we use the following optimiza- tion setup: for the NN weights we use Adam [67] with E.4 RNN Hyperparameter Optimization a learning rate of 1e-4. For the hyperparameters we Our base our implementation on the AWD- use RMSprop [68] with a learning rate of 1e-2. LSTM codebase https://github.com/salesforce/ awd-lstm-lm. Similar to [58] we used a 2-layer LSTM E.1 Overfitting a Small Validation Set with 650 hidden units per layer and 650-dimensional word embeddings. We see our algorithm’s ability to overfit the validation data (see Fig. 10). We use 50 training input, and 50 Overfitting Validation Data: We used a subset validation input with the standard testing partition of 10 training sequences and 10 validation sequences, for both MNIST and CIFAR-10. We check perfor- and tuned separate weight decays per parameter. The mance with logistic regression (Linear), a 1-Layer fully- LSTM architecture we use has 13 280 400 weights, and connected NN with as many hidden units as input size thus an equal number of weight decay hyperparameters. (ex., 28 × 28 = 784, or 32 × 32 × 3 = 3072), LeNet [69], AlexNet [51], and ResNet44 [52]. In all examples we Optimization Details: For the large-scale experi- can achieve 100 % training and validation accuracy, ments, we follow the training setup proposed in [70]: while the testing accuracy is significantly lower. for the NN weights, we use SGD with learning rate 30 and gradient clipping to magnitude 0.25. The learn- MNIST CIFAR-10 1.0 1.0 ing rate was decayed by a factor of 4 based on the Training 0.8 0.8 Validation nonmonotonic criterion introduced by [70] (i.e., when Test the validation loss fails to decrease for 5 epochs). To 0.6 0.6 Linear optimize the hyperparameters, we used Adam with 1-Layer 0.4 0.4 LeNet learning rate 0.001. We trained on sequences of length 0.2 0.2 AlexNet 70 in mini-batches of size 40. ResNet44 0.0 0 1 2 3 0.0 0 1 2 3 4 Classification Error 10 10 10 10 10 10 10 10 10 Iteration Iteration Figure 10: Overfitting validation data. Algorithm1 can overfit the validation dataset. We use 50 training input, and 50 validation input with the standard testing partition for both MNIST and CIFAR-10. We check the performance with logistic regression (Linear), a 1-Layer fully-connected NN with as many hidden units as input size (ex., 28×28 = 784, or 32×32×3 = 3072), LeNet [69], AlexNet [51], and ResNet44 [52]. Separate lines are plotted for the training, validation, and test- ing error. In all examples we achieve 100 % training and validation accuracy, while the testing accuracy is significantly lower. E.2 Dataset Distillation With MNIST we use the entire dataset in validation, while for CIFAR we use 300 validation data points. E.3 Learned Data Augmentation Augmentation Network Details: Data augmenta- tion can be framed as an image-to-image transformation Jonathan Lorraine, Paul Vicol, David Duvenaud CIFAR-100 Distillation Figure 11: The complete dataset distillation for CIFAR-100. Referenced in Fig.6.