Arxiv:1902.01813V3 [Cs.LG] 28 Feb 2020 Tectures, and Thus Automatic Computation of Gradients Seems to Improve the Optimizer’S Performance

Modular Block-diagonal Curvature Approximations for Feedforward Architectures Felix Dangel Stefan Harmeling Philipp Hennig University of Tübingen Heinrich Heine University University of Tübingen and [email protected] Düsseldorf MPI for Intelligent Systems, Tübingen [email protected] [email protected] Abstract δx = @E=@x and a positive semi-definite (PSD) curvature matrix C — the Hessian of E or approximations thereof. The quadratic is minimized by We propose a modular extension of backprop- 1 agation for the computation of block-diagonal x = x + ∆x with ∆x = C− δx : (1) approximations to various curvature matri- ∗ − ces of the training objective (in particular, Computing the update step requires that the C∆x = δx linear system be solved. To accomplish this task, the Hessian, generalized Gauss-Newton, and − positive-curvature Hessian). The approach re- providing a matrix-vector multiplication with the cur- duces the otherwise tedious manual derivation vature matrix C is sufficient. of these matrices into local modules, and is easy to integrate into existing machine learn- Approaches to second-order optimization: For ing libraries. Moreover, we develop a compact some curvature matrices, exact multiplication can be notation derived from matrix differential cal- performed at the cost of one backward pass by auto- culus. We outline different strategies applica- matic differentiation (Pearlmutter, 1994; Schraudolph, ble to our method. They subsume recently- 2002). This matrix-free formulation can then be lever- proposed block-diagonal approximations as aged to solve (1) using iterative solvers such as the special cases, and are extended to convolu- method of conjugate gradients (CG) (Martens, 2010). tional neural networks in this work. However, since this linear solver can still require multi- ple iterations, the increased per-iteration progress of the resulting optimizer might be compensated by in- 1 Introduction creased computational cost. Recently, a parallel version of Hessian-free optimization was proposed in (Zhang et al., 2017), which only considers the content of Hes- Gradient backpropagation is the central computational sian sub-blocks along the diagonal. Reducing the Hes- operation of contemporary deep learning. Its modular sian to a block diagonal allows for parallelization, tends structure allows easy extension across network archi- to lower the required number of CG iterations, and arXiv:1902.01813v3 [cs.LG] 28 Feb 2020 tectures, and thus automatic computation of gradients seems to improve the optimizer’s performance. given the computational graph of the forward pass (for a review, see Baydin et al., 2018). But optimization There have also been attempts to compute parts of using only the first-order information of the objective’s the Hessian in an iterative fashion (Mizutani and Drey- gradient can be unstable and slow, due to “vanishing” fus, 2008). Storing these constituents efficiently often or “exploding” behaviour of the gradient. Incorporat- requires an involved manual analysis of the Hessian’s ing curvature, second-order methods can avoid such structure, leveraging its outer-product form in many scaling issues and converge in fewer iterations. Such scenarios (Naumov, 2017; Bakker et al., 2018). Recent methods locally approximate the objective function E works developed different block-diagonal approxima- 1 by a quadratic E(x)+δx>(x x)+ (x x)>C(x x) tions (BDA) of curvature matrices that provide fast ∗− 2 ∗− ∗− around the current location x, using the gradient multiplication (Martens and Grosse, 2015; Grosse and Martens, 2016; Botev et al., 2017; Chen et al., 2018). Proceedings of the 23rdInternational Conference on Artificial These works have repeatedly shown that, empirically, Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. second-order information can improve the training of PMLR: Volume 108. Copyright 2020 by the author(s). Code available at github.com/f-dangel/hbp. Modular Block-diagonal Curvature Approximations for Feedforward Architectures θ(1) θ(3) θ(`) δθ(1) δθ(3) δθ(`) z(0) z(1) z(2) z(3) z(`) E (1) (2) (3) (4) ::: (`) f δz(1) f δz(2) f δz(3) f f δz(`) E Figure 1: Standard feedforward network architecture, i.e. the repetition of affine transformations parameterized by θ(i) = (W (i); b(i)) followed by elementwise activations. Arrows from left to right and vice versa indicate the data flow during forward pass and gradient backpropagation, respectively. deep learning problems. Perhaps the most important tiplication by curvature matrices using automatic dif- practical hurdle to the adoption of second-order opti- ferentiation (Pearlmutter, 1994; Schraudolph, 2002). mizers is that they tend to be tedious to integrate in However, we also address a new class of curvature existing machine learning frameworks because they re- matrices, the positive-curvature Hessian (PCH) intro- quire manual implementations. As efficient automated duced in Chen et al.(2018). Our solutions to the latter implementations have arguably been more important two points are generalizations of previous works (Botev for the wide-spread use of deep learning than many con- et al., 2017; Chen et al., 2018) to the fully modular ceptual advances, we aim to develop a framework that case, which become accessible due to the first contri- makes computation of Hessian approximations about bution. They represent additional modifications to as easy and automated as gradient backpropagation. make the scheme computationally tractable and obtain curvature approximations with desirable properties for Contribution: This paper introduces a modular for- optimization. malism for the computation of block-diagonal approximations of Hessian and curvature matrices, to various 2 Notation block resolutions, for feedforward neural networks. The framework unifies previous approaches in a form that, We consider feedforward neural networks composed similar to gradient backpropagation, reduces implemen- of ` modules f (i); i = 1; : : : ; `, which can be repre- tation and analysis to local modules. Following the sented as a computational graph mapping the input design pattern of gradient backprop also has the advan- z(0) = x to the output z(`) (Figure1). A module f (i) tage that this formalism can readily be integrated into (i 1) receives the parental output z − , applies an opera- existing machine learning libraries, and flexibly modi- tion involving the network parameters θ(i), and sends fied for different block groupings and approximations. the output z(i) to its child. Thus, f (i) is of the form (i) (i) (i 1) (i) The framework consists of three principal parts: z = f (z − ; θ ): Typical choices include elementwise nonlinear activation without any parameters and (i) (i) (i 1) (i) 1. a modular formulation for exact computation of affine transformations z = W z − + b with pa- (i) (i) Hessian block diagonals of feedforward neural nets. rameters given by the weights W and the bias b . We achieve a clear presentation by leveraging the Affine and activation modules are usually considered as notation of matrix differential calculus (Magnus a single conceptual unit, one layer of the network. How- and Neudecker, 1999). ever, for backpropagation of derivatives it is simpler to consider them separately as two modules. 2. projections onto the positive semi-definite cone by (`) (1;:::;`) eliminating sources of concavity. Given the network output z (x; θ ) of a datum x with label y, the goal is to minimize the expected risk 3. backpropagation strategies to obtain (i) exact cur- of the loss function E(z(`); y). Under the framework of vature matrix-vector products (with previously empirical risk minimization, the parameters are tuned N inaccessible BDAs of the Hessian) and (ii) further to optimize the loss on the training set Q = (x; y)i=1 , approximated multiplication routines that save 1 computations by evaluating the matrix representa- min E(z(`)(x); y) : (2) tions of intermediate quantities once, at the cost θ(1;:::;`) Q j j (x;y) Q of additional memory consumption. X2 In practice, the objective is typically further approxi- The first two contributions can be understood as an mated stochastically by drawing a mini-batch B Q ⊂ explicit formulation of well-known tricks for fast mul- from the training set. We will treat both scenarios Felix Dangel, Stefan Harmeling, Philipp Hennig Figure 2: Forward pass, gradient backpropagation, and θ Hessian backpropagation for a single module. Arrows δθ from left to right indicate the data flow in the forward pass θ H z = f(x; θ), while the opposite orientation indicates the x z gradient backpropagation by Equation (4). We suggest δx f δz to extend this by the backpropagation of the Hessian as x z H H indicated by Equation (7). without further distinction, since the structure relevant symmetry of both x and θ acting as input to the module. to our purposes is that Equation (2) is an average of Implementing gradient backpropagation thus requires terms depending on individual data points. Quantities multiplications by (transposed) Jacobians. for optimization, be it gradients or second derivatives We can apply the chain rule a second time to obtain of the loss with respect to the network parameters, can expressions for second-order partial derivatives of the be processed in parallel, then averaged. loss function E with respect to elements of x or θ, 2 3 Main contribution @ E(x) @ @zk = δzk @xi@xj @xj @xi ! First-order auto-differentiation for a custom module k X2 2 (5) requires the definition of only two local operations, @zk @ E(z) @zl @ zk = + δzk ; forward and backward, whose outputs are propagated @xi @zk@zl @xj @xi@xj k;l k along the computation graph. This modularity facili- X X tates the extension of gradient backpropagation by new @ @zl @ by means of =@xj = l( =@xj ) =@zl and the product operations, which can then be used to build networks rule. The first term of Equation (5) propagates cur- P by composition. To illustrate the principle, we consider vature information of the output further back, while a single module from the network of Figure1, depicted the second term introduces second-order effects of the in Figure2, in this section.

Load more