Optimizing Millions of Hyperparameters by Implicit Diﬀerentiation

Optimizing Millions of Hyperparameters by Implicit Differentiation Jonathan Lorraine Paul Vicol David Duvenaud University of Toronto, Vector Institute {lorraine, pvicol, duvenaud}@cs.toronto.edu Abstract high-dimensional HO problems locally with gradient- We propose an algorithm for inexpensive based optimizers, but this is difficult because we must gradient-based hyperparameter optimization differentiate through the optimized weights as a func- that combines the implicit function theorem tion of the hyperparameters. In other words, we must (IFT) with efficient inverse Hessian approx- approximate the Jacobian of the best-response function imations. We present results on the rela- of the parameters to the hyperparameters. tionship between the IFT and differentiat- We leverage the Implicit Function Theorem (IFT) to ing through optimization, motivating our al- compute the optimized validation loss gradient with gorithm. We use the proposed approach to respect to the hyperparameters—hereafter denoted the train modern network architectures with mil- hypergradient. The IFT requires inverting the training lions of weights and millions of hyperparame- Hessian with respect to the NN weights, which is infea- ters. We learn a data-augmentation network— sible for modern, deep networks. Thus, we propose an where every weight is a hyperparameter tuned approximate inverse, motivated by a link to unrolled for validation performance—that outputs aug- differentiation [3] that scales to Hessians of large NNs, mented training examples; we learn a distilled is more stable than conjugate gradient [22, 7], and only dataset where each feature in each datapoint requires a constant amount of memory. is a hyperparameter; and we tune millions of regularization hyperparameters. Jointly Finally, when fitting many parameters, the amount of tuning weights and hyperparameters with our data can limit generalization. There are ad hoc rules approach is only a few times more costly in for partitioning data into training and validation sets— memory and compute than standard training. e.g., using 10% for validation. Often, practitioners re-train their models from scratch on the combined 1 Introduction training and validation partitions with optimized hyperparameters, which can provide marginal test-time The generalization of neural networks (NNs) depends performance increases. We verify empirically that stan- crucially on the choice of hyperparameters. Hyperpa- dard partitioning and re-training procedures perform rameter optimization (HO) has a rich history [1, 2], and well when fitting few hyperparameters, but break down achieved recent success in scaling due to gradient-based when fitting many. When fitting many hyperparame- optimizers [3–11]. There are dozens of regularization ters, we need a large validation partition, which makes techniques to combine in deep learning, and each may re-training our model with optimized hyperparameters have multiple hyperparameters [12]. If we can scale HO vital for strong test performance. to have as many—or more—hyperparameters as parameters, there are various exciting regularization strategies Contributions arXiv:1911.02590v1 [cs.LG] 6 Nov 2019 to investigate. For example, we could learn a distilled • We propose a stable inverse Hessian approximation dataset with a hyperparameter for every feature of with constant memory cost. each input [4, 13], weights on each loss term [14–16], or augmentation on each input [17, 18]. • We show that the IFT is the limit of differentiating When the hyperparameters are low-dimensional— through optimization. e.g., 1-5 dimensions—simple methods, like random • We scale IFT-based hyperparameter optimization search, work; however, these break down for medium- to modern, large neural architectures, including dimensional HO—e.g., 5-100 dimensions. We may AlexNet and LSTM-based language models. use more scalable algorithms like Bayesian Optimiza- tion [19–21], but this often breaks down for high- • We demonstrate several uses for fitting hyperpa- dimensional HO—e.g., >100 dimensions. We can solve rameters almost as easily as weights, including per- parameter regularization, data distillation, and learned-from-scratch data augmentation methods. • We explore how training-validation splits should Submitted to AISTATS 2020. change when tuning many hyperparameters. Jonathan Lorraine, Paul Vicol, David Duvenaud ) ) λ , (λ, w∗(λ)) w ( ∗(λ) w . L ( L d n i w∗(λ) = i l a a r V Implict T ∗ = L L L λ best response s s s − s w o ∗ function o + = L L L L λ w λ Hypergradient w∗ = λ λ r λ Implicit derivative te r e te (λ , w) m e 0 a m L r a a r λ0, w∗(λ0) Pa p a ra r Pa p me e ra r te p me e r w y ter p H w y H Figure 1: Overview of gradient-based hyperparameter optimization (HO). Left: a training loss manifold; Right: a validation loss manifold. The implicit function w∗(λ) is the best-response of the weights to the hyperparameters ∗ and shown in blue projected onto the (λ; w)-plane. We get our desired objective function LV(λ) when the best-response is put into the validation loss, shown projected on the hyperparameter axis in red. The validation loss does not depend directly on the hyperparameters, as is typical in hyperparameter optimization. Instead, the hyperparameters only affect the validation loss by changing the weights’ response. We show the best-response Jacobian in blue, and the hypergradient in red. 2 Overview of Proposed Algorithm 3. We can estimate the implicit best-response There are four essential components to understanding with the IFT: We approximate the best-response our proposed algorithm. Further background is pro- Jacobian—how the optimal weights change with respect vided in AppendixA, and notation is shown in Table5. to the hyperparameters—using the IFT (Thm.1). We present the complete statement in AppendixC, but 1. HO is nested optimization: Let LT and LV highlight the key assumptions and results here. denote the training and validation losses, w the NN Theorem 1 (Cauchy, Implicit Function Theorem). If weights, and λ the hyperparameters. We aim to find op- for some (λ0; w0); @ LT j = 0 and regularity condi- timal hyperparameters λ∗ such that the NN minimizes @w λ0;w0 tions are satisfied, then surrounding (λ0; w0) there is a the validation loss after training: ∗ @ LT function w (λ) s.t. @w jλ;w∗(λ) = 0 and we have: ∗ ∗ λ :=arg min LV(λ) where (1) ∗ h 2 i−1 2 λ @ w @ LT @ LT @λ =− T × T (IFT) ∗ ∗ ∗ λ0 @w@w @w@λ λ0;w∗(λ0) LV(λ):=LV(λ;w (λ)) and w (λ):=arg min LT(λ;w) (2) | {z } | {z } w training Hessian training mixed partials ∗ Our implicit function is w (λ), which is the best-response @ LT 0 0 The condition @w jλ0;w0 = 0 is equivalent to λ ; w be- of the weights to the hyperparameters. We assume ing a fixed point of the training gradient field. Since unique solutions to arg min for simplicity. w∗(λ0) is a fixed point of the training gradient field, 2. Hypergradients have two terms: For gradient- we can leverage the IFT to evaluate the best-response ∗ @ LV(λ) Jacobian locally. We only have access to an approxi- based HO we want the hypergradient @λ , which decomposes into: mation of the true best-response—denoted wc∗—which we can find with gradient descent. ∗ ∗ @ LV(λ) @ LV @ LV @ w @λ = @λ + @w @λ = 4. Tractable inverse Hessian approximations: | {z } λ;w∗(λ) hypergradient To exactly invert a general m × m Hessian, we often 3 hyperparam indirect grad. (3) require O(m ) operations, which is intractable for the ∗ z ∗ }| ∗ { matrix in Eq. IFT in modern NNs. We can efficiently @ LV(λ;w (λ)) @ LV(λ;w (λ)) @ w (λ) @λ + @w∗(λ) × @λ approximate the inverse with the Neumann series: | {z } | {z } hyperparam direct grad. | {z } best-response Jacobian parameter direct grad. i h 2 i−1 h 2 ij @ LT X @ LT T = lim I − T (4) The direct gradient is easy to compute, but the indirect @w@w i!1 @w@w gradient is difficult to compute because we must ac- j=0 count for how the optimal weights change with respect @ w∗(λ) In Section4 we show that unrolling differentiation for ∗ to the hyperparameters (i.e., @λ ). In HO the direct i steps around locally optimal weights w is equivalent gradient is often identically 0, necessitating an approx- to approximating the inverse with the first i terms in imation of the indirect gradient to make any progress the Neumann series. We then show how to use this (visualized in Fig.1). approximation without instantiating any matrices by using efficient vector-Jacobian products. Jonathan Lorraine, Paul Vicol, David Duvenaud @ w∗ @λ @ L∗ V @ LV @ LV z}|{ @λ @λ @w Early work applied the IFT to regularization by ex- z}|{ = z}|{ + z }| { plicitly computing the Hessian (or Gauss-Newton) inverse [23, 2]. In [24], the identity matrix is used to ap- @2L −1 2 proximate the inverse Hessian in the IFT. HOAG [30] − T @ LT @w@wT @w@λT uses conjugate gradient (CG) to invert the Hessian @ LV @ LV z }| { z}|{ @λ @w approximately and provides convergence results given z}|{ z }| { = + tolerances on the optimal parameter and inverse. In iMAML [9], a center to the weights is fit to perform | {z } vector-inverse Hessian product well on multiple tasks—contrasted with our use of vali- 2 @ LT @w@λT dation loss. In DEQ [31], implicit differentiation is used @ L @ L @2L −1 V V ×− T z}|{ @λ @w @w@wT to add differentiable fixed-point methods into NN ar- = z}|{ + z }| { chitectures. We use a Neumann approximation for the inverse-Hessian, instead of CG [30,9] or the identity. | {z } vector-Jacobian product Approximate inversion algorithms. CG is diffi- Figure 2: Hypergradient computation. The entire cult to scale to modern, deep NNs.

Load more