<<

Modular Block-diagonal Curvature Approximations for Feedforward Architectures

Felix Dangel Stefan Harmeling Philipp Hennig University of Tübingen Heinrich Heine University University of Tübingen and [email protected] Düsseldorf MPI for Intelligent Systems, Tübingen [email protected] [email protected]

Abstract δx = ∂E/∂x and a positive semi-definite (PSD) cur- vature C — the Hessian of E or approximations thereof. The quadratic is minimized by We propose a modular extension of backprop- 1 agation for the computation of block-diagonal x = x + ∆x with ∆x = C− δx . (1) approximations to various curvature matri- ∗ − ces of the training objective (in particular, Computing the update step requires that the C∆x = δx linear system be solved. To accomplish this task, the Hessian, generalized Gauss-Newton, and − positive-curvature Hessian). The approach re- providing a matrix-vector multiplication with the cur- duces the otherwise tedious manual derivation vature matrix C is sufficient. of these matrices into local modules, and is easy to integrate into existing machine learn- Approaches to second-order optimization: For ing libraries. Moreover, we develop a compact some curvature matrices, exact multiplication can be notation derived from matrix differential cal- performed at the cost of one backward pass by auto- culus. We outline different strategies applica- matic differentiation (Pearlmutter, 1994; Schraudolph, ble to our method. They subsume recently- 2002). This matrix-free formulation can then be lever- proposed block-diagonal approximations as aged to solve (1) using iterative solvers such as the special cases, and are extended to convolu- method of conjugate (CG) (Martens, 2010). tional neural networks in this work. However, since this linear solver can still require multi- ple iterations, the increased per-iteration progress of the resulting optimizer might be compensated by in- 1 Introduction creased computational cost. Recently, a parallel version of Hessian-free optimization was proposed in (Zhang et al., 2017), which only considers the content of Hes- backpropagation is the central computational sian sub-blocks along the diagonal. Reducing the Hes- operation of contemporary deep learning. Its modular sian to a block diagonal allows for parallelization, tends structure allows easy extension across network archi- to lower the required number of CG iterations, and arXiv:1902.01813v3 [cs.LG] 28 Feb 2020 tectures, and thus automatic computation of gradients seems to improve the optimizer’s performance. given the computational graph of the forward pass (for a review, see Baydin et al., 2018). But optimization There have also been attempts to compute parts of using only the first-order information of the objective’s the Hessian in an iterative fashion (Mizutani and Drey- gradient can be unstable and slow, due to “vanishing” fus, 2008). Storing these constituents efficiently often or “exploding” behaviour of the gradient. Incorporat- requires an involved manual analysis of the Hessian’s ing curvature, second-order methods can avoid such structure, leveraging its outer-product form in many scaling issues and converge in fewer iterations. Such scenarios (Naumov, 2017; Bakker et al., 2018). Recent methods locally approximate the objective E works developed different block-diagonal approxima- 1 by a quadratic E(x)+δx>(x x)+ (x x)>C(x x) tions (BDA) of curvature matrices that provide fast ∗− 2 ∗− ∗− around the current location x, using the gradient multiplication (Martens and Grosse, 2015; Grosse and Martens, 2016; Botev et al., 2017; Chen et al., 2018).

Proceedings of the 23rdInternational Conference on Artificial These works have repeatedly shown that, empirically, Intelligence and (AISTATS) 2020, Palermo, Italy. second-order information can improve the training of PMLR: Volume 108. Copyright 2020 by the author(s). Code available at github.com/f-dangel/hbp. Modular Block-diagonal Curvature Approximations for Feedforward Architectures

θ(1) θ(3) θ(`) δθ(1) δθ(3) δθ(`)

z(0) z(1) z(2) z(3) z(`) E (1) (2) (3) (4) ... (`) f δz(1) f δz(2) f δz(3) f f δz(`) E

Figure 1: Standard feedforward network architecture, i.e. the repetition of affine transformations parameterized by θ(i) = (W (i), b(i)) followed by elementwise activations. Arrows from left to right and vice versa indicate the data flow during forward pass and gradient backpropagation, respectively. deep learning problems. Perhaps the most important tiplication by curvature matrices using automatic dif- practical hurdle to the adoption of second-order opti- ferentiation (Pearlmutter, 1994; Schraudolph, 2002). mizers is that they tend to be tedious to integrate in However, we also address a new class of curvature existing machine learning frameworks because they re- matrices, the positive-curvature Hessian (PCH) intro- quire manual implementations. As efficient automated duced in Chen et al.(2018). Our solutions to the latter implementations have arguably been more important two points are generalizations of previous works (Botev for the wide-spread use of deep learning than many con- et al., 2017; Chen et al., 2018) to the fully modular ceptual advances, we aim to develop a framework that case, which become accessible due to the first contri- makes computation of Hessian approximations about bution. They represent additional modifications to as easy and automated as gradient backpropagation. make the scheme computationally tractable and obtain curvature approximations with desirable properties for Contribution: This paper introduces a modular for- optimization. malism for the computation of block-diagonal approxi- mations of Hessian and curvature matrices, to various 2 Notation block resolutions, for feedforward neural networks. The framework unifies previous approaches in a form that, We consider feedforward neural networks composed similar to gradient backpropagation, reduces implemen- of ` modules f (i), i = 1, . . . , `, which can be repre- tation and analysis to local modules. Following the sented as a computational graph mapping the input design pattern of gradient backprop also has the advan- z(0) = x to the output z(`) (Figure1). A module f (i) tage that this formalism can readily be integrated into (i 1) receives the parental output z − , applies an opera- existing machine learning libraries, and flexibly modi- tion involving the network parameters θ(i), and sends fied for different block groupings and approximations. the output z(i) to its child. Thus, f (i) is of the form (i) (i) (i 1) (i) The framework consists of three principal parts: z = f (z − , θ ). Typical choices include elemen- twise nonlinear activation without any parameters and (i) (i) (i 1) (i) 1. a modular formulation for exact computation of affine transformations z = W z − + b with pa- (i) (i) Hessian block diagonals of feedforward neural nets. rameters given by the weights W and the bias b . We achieve a clear presentation by leveraging the Affine and activation modules are usually considered as notation of matrix differential (Magnus a single conceptual unit, one layer of the network. How- and Neudecker, 1999). ever, for backpropagation of it is simpler to consider them separately as two modules. 2. projections onto the positive semi-definite cone by (`) (1,...,`) eliminating sources of concavity. Given the network output z (x, θ ) of a datum x with label y, the goal is to minimize the expected risk 3. backpropagation strategies to obtain (i) exact cur- of the E(z(`), y). Under the framework of vature matrix-vector products (with previously empirical risk minimization, the parameters are tuned N inaccessible BDAs of the Hessian) and (ii) further to optimize the loss on the training set Q = (x, y)i=1 , approximated multiplication routines that save 1  computations by evaluating the matrix representa- min E(z(`)(x), y) . (2) tions of intermediate quantities once, at the cost θ(1,...,`) Q | | (x,y) Q of additional memory consumption. X∈ In practice, the objective is typically further approxi- The first two contributions can be understood as an mated stochastically by drawing a mini-batch B Q ⊂ explicit formulation of well-known tricks for fast mul- from the training set. We will treat both scenarios Felix Dangel, Stefan Harmeling, Philipp Hennig

Figure 2: Forward pass, gradient backpropagation, and θ Hessian backpropagation for a single module. Arrows δθ from left to right indicate the data flow in the forward pass θ H z = f(x, θ), while the opposite orientation indicates the x z gradient backpropagation by Equation (4). We suggest δx f δz to extend this by the backpropagation of the Hessian as x z H H indicated by Equation (7).

without further distinction, since the structure relevant symmetry of both x and θ acting as input to the module. to our purposes is that Equation (2) is an average of Implementing gradient backpropagation thus requires terms depending on individual data points. Quantities multiplications by (transposed) Jacobians. for optimization, be it gradients or second derivatives We can apply the a second time to obtain of the loss with respect to the network parameters, can expressions for second-order partial derivatives of the be processed in parallel, then averaged. loss function E with respect to elements of x or θ,

2 3 Main contribution ∂ E(x) ∂ ∂zk = δzk ∂xi∂xj ∂xj ∂xi ! First-order auto-differentiation for a custom module k X2 2 (5) requires the definition of only two local operations, ∂zk ∂ E(z) ∂zl ∂ zk = + δzk , forward and backward, whose outputs are propagated ∂xi ∂zk∂zl ∂xj ∂xi∂xj k,l k along the computation graph. This modularity facili- X X tates the extension of gradient backpropagation by new ∂ ∂zl ∂ by means of /∂xj = l( /∂xj ) /∂zl and the product operations, which can then be used to build networks rule. The first term of Equation (5) propagates cur- P by composition. To illustrate the principle, we consider vature information of the output further back, while a single module from the network of Figure1, depicted the second term introduces second-order effects of the in Figure2, in this section. The forward pass f(x, θ) module itself. Using the Hessian matrix HE(x) = 2 maps the input x to the output z by means of the ∂ E(x)/(∂x>∂x) of a scalar function with respect to a module parameters θ (to simplify notation, we drop vector-shaped quantity x, the Hessian of the loss func- layer indices). All quantities are assumed to be vector- tion will be abbreviated by shaped (-valued quantities can be vectorized, see SectionA of the Supplements). Optimization requires ∂2E( ) HE( ) = = · , (6) the gradient of the loss function with respect to the · H· ∂ vec( )>∂ vec( ) parameters, ∂E(θ)/∂θ = δθ. We will use the shorthand · · which results in the matrix version of Equation (5), ∂E( ) δ = · . (3) · ∂ vec( ) x = [Dz(x)]> z [Dz(x)] + [Hz (x)] δz . (7) · H H k k During gradient backpropagation the module receives k X the loss gradient with respect to its output, δz, from its child. The backward operation computes gradients Note that the second-order effect introduced by the with respect to the module parameters and input, δθ module itself via Hzk(x) vanishes if fk(x, θ) is linear and δx from δz. Backpropagation continues by send- in x. Because the layer parameters θ can be regarded ing the gradient with respect to the module’s input as inputs to the layer, they are treated in exactly the to its parent, which proceeds in the same way (see same way, replacing x by θ in the above expression. Figure1). By the chain rule, gradients with respect Equation (7) is the central functional expression herein, to an element of the module’s input can be computed and will be referred to as the Hessian backpropagation ∂zj as δxi = j( /∂xi)δzj. The vectorized version is (HBP) equation. Our suggested extension of gradient compactly written in terms of the Jacobian matrix backpropagation is to also send the Hessian z back P > H Dz(x) = ∂z(x)/∂x , which contains all partial deriva- through the graph. To do so, existing modules have to tives of z with respect to x. The arrangement of partial be extended by the HBP equation: Given the Hessian ∂z (x) derivatives is such that [Dz(x)]j,i = j /∂xi, i.e. z of the loss with respect to all module outputs, an H extended module has to extract the Hessians θ, x δx = [Dz(x)]> δz . (4) H H by means of Equation (7), and forward the Hessian Analogously, the parameter gradients are given by δθi = with respect to its input x to the parent module which ∂zj H δz δθ = [Dz(θ)]> δz j ∂θi j, i.e. , which reflects the proceeds likewise. In this way, backprop of gradients P Modular Block-diagonal Curvature Approximations for Feedforward Architectures

θ(1) θ(3) θ(`) δθ(1) δθ(3) δθ(`) θ(1) θ(3) θ(`) H H H z(0) z(1) z(2) z(3) z(`) E (1) (3) (`) f δz(1) f (2) δz(2) f δz(3) f (4) ... f δz(`) E z(1) z(2) z(3) z(`) H H H H Figure 3: Extension of backprop to Hessians. It yields diagonal blocks of the full parameter Hessian. can be extended to compute curvature information in layers and in case of batched input data, the di- modules. This corresponds to BDAs of the Hessian that mension of these quantities exceeds computational ignore second-order partial derivatives of parameters budgets. In line with previous schemes (Botev in different modules. Figure3 shows the data flow. et al., 2017; Chen et al., 2018), we introduce addi- The computations required in Equation (7) depend tional approximations for batch learning in Sub- only on local quantities that are, mostly, already being section 3.2. A connection to existing schemes is computed during gradient backpropagation.1 drawn in the Supplements B.4. Before we proceed, we highlight the following aspects: HBP can easily be integrated into current machine learning libraries, so that BDAs of curvature informa- The BDA of the Hessian need not be PSD. But • tion can be provided automatically for novel or exist- our scheme can be modified to provide PSD cur- ing second-order optimization methods. Such methods vature matrices by projection onto the positive have repeatedly been shown to be competitive with semi-definite cone (see Subsection 3.1). first-order methods (Martens and Grosse, 2015; Grosse Instead of evaluating all matrices during back- and Martens, 2016; Botev et al., 2017; Zhang et al., • propagation, we can define matrix-vector products 2017; Chen et al., 2018). recursively. This yields exact curvature matrix products with the block diagonals of the Hessian, Relationship to matrix differential calculus: the generalized Gauss-Newton (GGN) matrix and To some extent, this paper is a re-formulation of earlier the PCH. Products with the first two matrices can results (Martens and Grosse, 2015; Botev et al., 2017; also be obtained by use of automatic differentiation Chen et al., 2018) in the framework of matrix differen- (Pearlmutter, 1994; Schraudolph, 2002). We also tial calculus (Magnus and Neudecker, 1999), leveraged get access to the PCH which, in contrast to the to achieve a new level of modularity. Matrix differen- GGN, considers curvature information introduced tial calculus is a set of notational rules that allow a by the network (see Subsection 3.1).2 For stan- concise construction of derivatives without the heavy dard neural networks, only second derivatives of use of indices. Equation (7) is a special case of the nonlinear activations have to be stored compared matrix chain rule of that framework. A more detailed to gradient backpropagation. discussion of this connection can be found in SectionA of the Supplements, which also reviews definitions gen- There are approaches (Botev et al., 2017; Chen • eralizing the concepts of Jacobian and Hessian in a way et al., 2018) that propagate matrix representations that preserves the chain rule. The elementary building back through the graph in order to save repeated block of our procedure is a module as shown in Figure2. computations in the curvature matrix-vector prod- Like for gradient backprop, the operations required for uct. The size of the matrices z(i) passed between H HBP can be tabulated. Table1 provides a selection of layer i+1 and i scales quadratically in the number common modules. The derivations, which again lever- of output features of layer i. For convolutional age the matrix differential calculus framework, can be 1 By Faà di Bruno’s formula (Johnson, 2002) higher- found in SupplementsB,C, andD. order derivatives of function compositions are expressed re- cursively in terms of the composites’ lower-order derivatives. 3.1 Obtaining different curvature matrices Recycling these quantities can give significant speedup com- pared to repeatedly applying first-order auto-differentiation, The HBP equation yields exact diagonal blocks which represents one key aspect of our work. θ(1),..., θ(`) of the full parameter Hessian. They 2Implementations of HBP for exact matrix-vector prod- H H ucts can reuse multiplication by the (transposed) Jacobian can be of interest in their own right for analysis of the provided by many machine learning libraries. The second loss function, but are not generally suitable for second- term of (7) needs special treatment though. order optimization in the sense of (1), as they need Felix Dangel, Stefan Harmeling, Philipp Hennig

Table 1: Hessian backpropagation for common modules used in feedforward networks. I denotes the . We assign matrices to upper-case (W, X, . . . ) and to upper-case sans serif symbols (W, X,... ).

OPERATION FORWARD HBP (Equation (7))

Matrix-vector z(x, W ) = W x x = W >( z)W , H H multiplication W = x x> z H ⊗ ⊗ H Matrix-matrix Z(X,W ) = WX X = (I W )> Z(I W ) , H ⊗ H ⊗ multiplication W = (X> I)> Z(X> I) H ⊗ H ⊗ Addition z(x, b) = x + b x = b = z H H H Elementwise z(x) = φ(x) , x = diag[φ0(x)] z diag[φ0(x)] + diag[φ00(x) δz] H H activation zi(x) = φ(xi)

Skip-connection z(x, θ) = x + y(x, θ) x = [I + Dy(x)]> z[I + Dy(x)] + [Hy (x)]δz , H H k k k θ = [Dy(θ)]> z[Dy(θ)] + [Hyk(θ)]δzk H H k P Reshape/view Z(X) = reshape(X) Z = X H H P Index select/map π z(x) = Πx , Π = 1 , x = Π>( z)Π j,π(j) H H Convolution Z(X, W) = X ? W , X = (I W ) Z(I W ) H ⊗ H ⊗ Z(W, X ) = W X , W = ( X > I)> Z( X > I) H ⊗ H ⊗ Square loss E(x, y) = (y x)>(y x) xJ =K 2I − − H Softmax cross-entropy E(x, yJ) =K y> logJ K [p(x)] x = diagJ K [p(x)] p(x)pJ(xK)> − H −

neither be PSD nor invertible. For application in opti- GGN are PSD if the loss function is convex (and thus mization, HBP can be modified to yield semi-definite HE(z(`)) is PSD). This is because blocks are recursively BDAs of the Hessian. Equation (7) again provides the left- and right-multiplied with Jacobians, which does foundation for this adaptation, which is closely related not alter the definiteness. Hessians of the loss func- to the concepts of KFRA (Botev et al., 2017), BDA- tions listed in Table1 are PSD. The resulting recursive PCH (Chen et al., 2018), and, under certain conditions, scheme has been used by Botev et al.(2017) under the KFAC (Martens and Grosse, 2015). We draw their acronym KFRA to optimize convex loss functions of connections by briefly reviewing them here. fully-connected neural networks with piecewise linear activation functions. Generalized Gauss-Newton matrix: The GGN emerges as the curvature matrix in the quadratic ex- Positive-curvature Hessian: Another concept of pansion of the loss function E(z(`)) in terms of the positive semi-definite BDAs of the Hessian (that ad- network output z(`). It is also obtained by linearizing ditionally considers second-order module effects) was the network output z(`)(θ, x) in θ before computing the studied in Chen et al.(2018) and named the PCH. loss Hessian (Martens, 2014), and reads It is obtained by modification of terms in the second 1 summand of Equation (7) that can introduce concavity G(θ) = Dz(`)(θ) > HE(z(`)) Dz(`)(θ) . Q during HBP. This ensures positive semi-definiteness (x,y) Q | | X∈ h i h i since the first summand is semi-definite, assuming the loss Hessian HE(z(`)) with respect to the network out- To obtain diagonal blocks G(θ(i)), the Jacobian can put to be positive semi-definite. Chen et al.(2018) be unrolled by means of the chain rule for Jaco- suggest to eliminate negative curvature of a matrix by bians (Supplements, Theorem A.1) as Dz(`)(θ(i)) = computing the eigenvalue decomposition and either dis- (`) (` 1) (` 1) (i) Dz (z − ) Dz − (θ ) ... Continued expansion card negative eigenvalues or cast them to their absolute HE(z(`)) shows that the  Hessian  of the loss function value. This allows the construction of PSD curvature with respect to the network output is propagated back matrices even for non-convex loss functions. In the through a layer by multiplication from left and right setting of Chen et al.(2018), the PCH can empirically with its Jacobian. This is accomplished in HBP by ig- outperform optimization using the GGN. In usual feed- noring second-order effects introduced by modules, that forward neural networks, the concavity is introduced is by setting the Hessian of the module function to by nonlinear elementwise activations, and corresponds zero, therefore neglecting the second term in Equa- to a (Table1). Thus, convexity can tion (7). In fact, if all activations in the network are be maintained during HBP by either clipping negative piecewise linear (e.g. ReLUs), the GGN and Hessian values to zero (PCH-clip), or taking their magnitude blocks are equivalent. Moreover, diagonal blocks of the in the diagonal concave term (PCH-abs). Modular Block-diagonal Curvature Approximations for Feedforward Architectures

Fisher information matrix: If the network defines single backward pass and then kept in memory for ap- a conditional probability density r(y z(`)) on the labels, plication of the matrix-vector product. We can embed | maximum likelihood learning for the parameterized den- these explicit schemes into our modular approach. To sity pθ(y x) will correspond to choosing a negative log- do so, we denote averages over a batch B by a bar, for | (`) (`) likelihood loss function, i.e. E(z , y) = log r(y z ). instance 1/ B HE(θ) = HE(θ). The modified − | | | (x,y) B Common loss functions like square and cross-entropy backward pass of curvature∈ information during HBP P loss can be interpreted in this way. Natural gradient for a module receives a batch average of the Hessian descent (Amari, 1998) uses the ma- with respect to the output, z, which is used to formu- H d log pθ (y x) dθ d log pθ (y x) dθ> trix F(θ) = Epθ (y x) [( | / )( | / )] late the matrix-vector product with the batch-averaged | as a PSD curvature matrix approximating the Hes- parameter Hessian θ. An average of the Hessian with H sian. It can be expressed as the log-predictive den- respect to the module input, x, is passed back. Exist- (`) H sity’s expected Hessian under r itself: Fr(z ) = ing work (Botev et al., 2017; Chen et al., 2018) differs (`) Er(y z(`)) H log r(y z ) . Assuming truly i.i.d. sam- primarily in the specifics of how this batch average − | | ples x, the log-likelihood of multiple data decomposes is computed. In HBP, these approximations can be   and results in the approximation formulated compactly within Equation (7). Relations to the cited works are discussed in more detail in the 1 (`) > (`) (`) Supplements B.4. The approximations amounting to F(θ) Dz (θ) Fr(z ) Dz (θ) . ≈ Q relations used by Botev et al.(2017) read (x,y) Q | | X∈ h i h i x [Dz(x)]> z [Dz(x)] + [Hz (x)] δz , (8) In this form, the computational scheme for BDAs of H ≈ H k k the Fisher resembles the HBP of the GGN. However, in- Xk stead of propagating back the loss Hessian with respect and likewise for θ. In case of a linear layer z(x) = to the network, the expected Hessian of the negative W x+b, this approximation implies the relations W = log-likelihood under the model’s predictive distribution H x x z, b = z, and x = W >( z)W . Mul- is used. Martens and Grosse(2015) use Monte-Carlo ⊗ > ⊗ H H H H H tiplication by this weight Hessian approximation with sampling to estimate this matrix in their KFAC op- a vector v is achieved by storing x x , z and per- timizer. Relations between the Fisher and GGN are ⊗ > H forming the required contractions v (x x z)v. discussed in (Pascanu and Bengio, 2013; Martens, 2014); 7→ ⊗ > ⊗ H Note that this approach is not restricted to curvature for square and cross-entropy loss, they are equivalent. matrix-vector multiplication routines only. Kronecker structure in the approximation gives rise to optimiza- 3.2 Batch learning approximations tion methods relying on direct inversion. In our HBP framework, exact multiplication by the A cheaper approximation, used in Chen et al.(2018), block of the curvature matrix of parameter θ in a mod- ule comes at the cost of one gradient backpropagation x [Dz(x)]> z [Dz(x)] + [Hz (x)] δz , (9) H ≈ H k k to this layer. The multiplication is recursively defined in Xk terms of multiplication by the layer output Hessian z. H leads to the modified relation W = x x> z for If it were possible to have an explicit representation H ⊗ ⊗ H of this matrix in memory, the recursive computations a linear layer. As this approximation is of the same hidden in z could be saved during the solution of rank as z, which is typically small, CG requires only H H the linear system implied by Equation (1). Unfortu- a few iterations during optimization. It avoids large nately, the size of the backpropagated exact matrices memory requirements for layers with numerous inputs, 3 since it requires x be stored instead of x x . scales quadratically in both the batch size and the ⊗ > number of layer’s output features. However, instead of Transformations that are linear in the module parame- propagating back the exact Hessian, a batch-averaged ters (e.g. linear and convolutional layers), possess con- version can be used instead to circumvent the batch stant Jacobians with respect to the module input for size scaling (originating from Botev et al.(2017)). In each sample (see Table1). Hence, in a network consist- combination with structural information about the pa- ing of only these layers, both Equation (8) and (9) yield rameter Hessian, this strategy is used in Botev et al. the same backpropagated Hessians x. This still leaves H (2017); Chen et al.(2018) to further approximate cur- the degree of freedom for choosing the approximation vature multiplications, using quantities computed in a scheme in the analogous equations for θ.

3 If samples are processed independently in every module, these matrices have block structure and scale linearly in Remark: Both strategies for obtaining curvature ma- batch size. Quadratic scaling is caused by transformations trix BDAs (implicit exact matrix-vector multiplications across different samples, like . and explicit propagation of approximated curvature) Felix Dangel, Stefan Harmeling, Philipp Hennig

are compatible. Regarding the connection to cited SGD works, we note that the maximally modular structure 2 CG, 1 block of our framework changes the nature of these approxi- CG, 2 blocks mations and allows a more flexible formulation. CG, 4 blocks 1 CG, 16 blocks train loss 4 Experiments & Implementation 0 20 40 60 80 We illustrate the usefulness of incorporating curvature epoch information with the two outlined strategies by experi- ments with a fully-connected and a convolutional neural 0.4 network (CNN) on the CIFAR-10 dataset (Krizhevsky, SGD 2009). Following the guidelines of Schneider et al. 0.3 CG, 1 block

(2019), the training loss is estimated on a random sub- test acc CG, 2 blocks 0.2 set of the training set of equal size as the test set. Each CG, 4 blocks experiment is performed for 10 different random seeds CG, 16 blocks 0.1 and we show the mean values with shaded intervals of 0 20 40 60 80 one standard deviation. For the loss function we use epoch cross-entropy. Details on the model architectures and hyperparameters are given in SupplementsE. Figure 4: SGD and different Newton-style optimizers based on the PCH-abs with batch approximations. The Training procedure and update rule: In com- same fully-connected neural network of (Chen et al., parison to a first-order optimization procedure, the 2018) was used to generate the solid baseline results. training loop with HBP has to be extended by a single Our modular approach allows further splitting the pa- backward pass to backpropagate the batch-averaged rameter blocks into sub-blocks that can independently or exact loss Hessian. This yields matrix-vector prod- be optimized in parallel (dashed lines). ucts with a curvature estimate C(i) for each parameter block θ(i) of the network. Parameter updates ∆θ(i) are obtained by applying CG to solve the linear system4 parameter Hessian computation by (7). Consequently, we split weights and bias terms row-wise into a specified (i) (i) (i) number of sub-blocks. Performance curves are shown αI + (1 α)C ∆θ = δθ , (10) − − in Figure4. In the initial phase, the BDA can be split h i into a larger number of sub-blocks without suffering where α acts as a step size limitation to improve ro- from a loss in performance. The reduced curvature in- bustness against noise. The CG routine terminates if formation is still sufficient to escape the initial plateau. the ratio of the residual norm and the gradient norm However, larger blocks have to be considered in later falls below a certain threshold or the maximum number stages to further reduce the loss efficiently. of iterations has been reached. The solution returned by CG is scaled by a γ, and parameters The fact that this switch in modularity is necessary are updated by the relation θ(i) θ(i) + γ∆θ(i). is an argument in favor of the flexible form of HBP, ← which allows to efficiently realize such switches: For Fully-connected network, batch approxima- this experiment, the splitting for each block was artifi- tions, and sub-blocking: The flexibility of HBP is cially chosen to illustrate this flexibility. In principle, illustrated by extending the results in Chen et al.(2018). the splitting could be decided individually for each Investigations are performed on a fully-connected net- parameter block, and even changed at run time. work with sigmoid activations. Solid lines in Figure4 show the performance of the Newton-style optimizer Convolutional neural network, matrix-free ex- and momentum SGD in terms of the training loss and act curvature multiplication: For convolutions, test accuracy. The second-order method is capable to the large number of hidden features prohibits back- escape the initial plateau in fewer iterations. propagating a curvature matrix batch average. Instead, we use exact curvature matrix-vector products provided The modularity of HBP allows for additional paral- within HBP. The CNN possesses sigmoid activations lelism by splitting the linear system (10) into smaller and cannot be trained by SGD (cf. Figure5a). For sub-blocks, which then also need fewer iterations of comparison with another second-order method, we ex- CG. Doing so only requires a modification of the periment with a public KFAC implementation (Martens 4We use the same update rule as Chen et al.(2018) since and Grosse, 2015; Grosse and Martens, 2016, see Sup- we extend some of the results shown within this work. plementsE for details). Modular Block-diagonal Curvature Approximations for Feedforward Architectures

(a) (b) SGD Adam 2 2 Adam KFAC KFAC GGN, α1 PCH-abs 1 PCH-clip 1 train loss train loss GGN, α1 GGN, α2 0 5 10 15 0 50 100 150 200 250 300 350 epoch wall [s] SGD 0.6 0.6 Adam KFAC 0.4 PCH-abs 0.4 test acc PCH-clip test acc Adam GGN, α1 KFAC 0.2 0.2 GGN, α2 GGN, α1 0 5 10 15 0 50 100 150 200 250 300 350 epoch wall [s] Figure 5: (a) Comparison of SGD, Adam, KFAC, and Newton-style methods with different exact curvature matrix-vector products (HBP) on a CNN with sigmoid activations (see SupplementsE). SGD cannot train the net. (b) Wall-clock time comparison (on an RTX 2080 Ti GPU; same colors realize different random seeds).

The matrix-free second-order methods progress fast in 5 Conclusion the initial stage of the optimization. However, progress in later phases stagnates. This may be caused by the We have outlined a procedure to compute block- limited sophistication of the update rule (10): If a diagonal approximations of different curvature matri- small value for α is chosen, the optimizer will perform ces for feedforward neural networks by a scheme that well in the beginning (GGN, α1). As the gradients can be realized on top of gradient backpropagation. In become smaller, and hence more noisy, the step size contrast to other recently proposed methods, our imple- limitation is too optimistic, which leads to a slow-down mentation is aligned with the design of current machine in optimization progress. A more conservative step learning frameworks and can flexibly compute Hessian size limitation improves the overall performance at the sub-blocks to different levels of refinement. Its modular cost of fewer initial progress (GGN, α2). In the train- formulation facilitates closed-form analysis of Hessian ing phase where damping is “effective”, our illustrative diagonal blocks, and unifies previous approaches (Botev methods, and KFAC, exhibit better progress per itera- et al., 2017; Chen et al., 2018). tion on the objective than the first-order competitor Within our framework we presented two strategies: (i) Adam, underlining the usefulness of curvature even if Obtaining exact curvature matrix-vector products that only computed block-wise. have not been accessible before by auto-differentiation For an impression on performance in terms of run (PCH), and (ii) backpropagation of further approxi- time, Figure5b compares the wall-clock time of one mated matrix representations to save computations matrix-free method and the baselines. The HBP-based during training. As for gradient backpropagation, the optimizer can compete with existing methods and offers Hessian backpropagation for different operations can potential for further improvements, like sub-blocking be derived independently of the underlying graph. The and parallelized CG. Despite the more adaptive na- extended modules can then be used as a drop-in re- ture of second-order methods, their full power seems placement for existing modules to construct deep neural to still require adaptive damping, to account for the networks. Internally, backprop is extended by an ad- quality of the local quadratic approximation and re- ditional Hessian backward pass through the graph to strict the update if necessary. The importance of these compute curvature information. It can be performed techniques to properly adapt the Newton direction has in parallel to, and reuse the quantities computed in, been emphasized in previous works (Martens, 2010; gradient backpropagation. Martens and Grosse, 2015; Botev et al., 2017) that aim to develop fully fletched second-order optimizers. Such adaptation, however, is beyond the scope of this text. Felix Dangel, Stefan Harmeling, Philipp Hennig

Acknowledgments Conference on Machine Learning, volume 27, pages 735–742. The authors would like to thank Frederik Kunstner, Martens, J. (2014). New insights and perspectives on Matthias Werner, Frank Schneider, and Agustinus Kris- the natural . CoRR, abs/1412.1193. tiadi for their constructive feedback on the manuscript, and gratefully acknowledge financial support by the Martens, J. and Grosse, R. (2015). Optimizing neu- European Research Council through ERC StG Action ral networks with Kronecker-factored approximate 757275 / PANAMA; the DFG Cluster of Excellence curvature. In Proceedings of the 32nd International “Machine Learning - New Perspectives for Science”, Conference on Machine Learning, volume 37, pages EXC 2064/1, project number 390727645; the German 2408–2417. JMLR. Federal Ministry of Education and Research (BMBF) Mizutani, E. and Dreyfus, S. E. (2008). Second-order through the Tübingen AI Center (FKZ: 01IS18039A); stagewise backpropagation for Hessian-matrix analy- and funds from the Ministry of Science, Research and ses and investigation of negative curvature. Neural Arts of the State of Baden-Württemberg. F. D. is grate- Networks, 21(2-3):193–203. ful to the International Max Planck Research School Naumov, M. (2017). Feedforward and recurrent neu- for Intelligent Systems (IMPRS-IS) for support. ral networks backward propagation and Hessian in matrix form. CoRR, abs/1709.06080. References Pascanu, R. and Bengio, Y. (2013). Revisiting natural Amari, S.-I. (1998). Natural gradient works efficiently gradient for deep networks. arXiv:1301.3584. in learning. Neural computation, 10(2):251–276. Pearlmutter, B. A. (1994). Fast exact multiplication Bakker, C., Henry, M. J., and Hodas, N. O. (2018). The by the Hessian. Neural Computation, 6:147–160. outer product structure of neural network derivatives. Schneider, F., Balles, L., and Hennig, P. (2019). Deep- CoRR, abs/1810.03798. OBS: A deep learning optimizer benchmark suite. Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and In International Conference on Learning Represen- Siskind, J. M. (2018). Automatic differentiation in tations. machine learning: A survey. Journal of Marchine Schraudolph, N. N. (2002). Fast curvature matrix- Learning Research, 18:1–43. vector products for second-order . Botev, A., Ritter, H., and Barber, D. (2017). Practical Neural Computation, 14:1723–1738. Gauss-Newton optimisation for deep learning. In Zhang, H., Xiong, C., Bradbury, J., and Socher, R. Precup, D. and Teh, Y. W., editors, Proceedings (2017). Block-diagonal Hessian-free optimization for of the 34th International Conference on Machine training neural networks. CoRR, abs/1712.07296. Learning, volume 70, pages 557–565. PMLR. Chen, S.-W., Chou, C.-N., and Chang, E. (2018). BDA- PCH: Block-diagonal approximation of positive- curvature Hessian for training neural networks. CoRR, abs/1802.06502. Grosse, R. and Martens, J. (2016). A Kronecker- factored approximate Fisher matrix for convolution layers. In Balcan, M. F. and Weinberger, K. Q., edi- tors, Proceedings of The 33rd International Confer- ence on Machine Learning, volume 48, pages 573–582. PMLR. Johnson, W. P. (2002). The curious history of Faà di Bruno’s formula. The American mathematical monthly, 109(3):217–234. Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report. Magnus, J. R. and Neudecker, H. (1999). Matrix Dif- ferential Calculus with Applications in Statistics and Econometrics. Probabilistics and Statistics. Wiley. Martens, J. (2010). Deep learning via Hessian-free optimization. In Proceedings of the 27th International Modular Block-diagonal Curvature Approximations for Feedforward Architectures (Supplementary Material)

Felix Dangel Stefan Harmeling Philipp Hennig University of Tübingen Heinrich Heine University University of Tübingen and [email protected] Düsseldorf MPI for Intelligent Systems, Tübingen [email protected] [email protected]

This document provides additional information and derivations to clarify the relations of Hessian backpropagation (HBP). SectionA contains a clean definition of Jacobian and Hessian matrices for multi-variate functions that are typically involved in the construction of neural networks. We introduce conventions for notation and show how the HBP Equation (7) results naturally from the chain rule for matrix derivatives (Magnus and Neudecker, 1999). With matrix derivatives, the HBP equation for a variety of module functions can be derived elegantly. SectionsB, C, andD contain the HBP derivations for all operations in Table1. We split the considered operations into different categories to achieve a cleaner structure. SectionB contains details on operations used for the construction of fully-connected neural networks (FCNNs) and skip-connections. Subsection B.4 illustrates the analytic composition of multiple modules by combining the backward passes of a nonlinear elementwise activation function and an affine transformation. This yields the recursive schemes of Botev et al.(2017) and Chen et al.(2018), the latter of which has been used in the experiment of Section4. The analysis of the Hessian for common loss functions is provided in SectionC. Operations occurring in convolutional neural networks (CNNs) are subject of SectionD. SectionE provides details on model architectures, training procedures used in the experiments of Section4, and an additional experiment on a modified test problem of the DeepOBS benchmark library (Schneider et al., 2019).

A Matrix derivatives

Index notation for higher-order derivatives of multi-variate matrix functions can become heavy (Mizutani and Dreyfus, 2008; Naumov, 2017; Chen et al., 2018; Bakker et al., 2018). We tackle this issue by embedding the presented approach in the notation of matrix differential calculus, which

1. yields notation consistent with established literature on matrix derivatives (Magnus and Neudecker, 1999) and clarifies the origin of the symbols D and H that are used extensively in the main text.

2. allows for using a multi-dimensional generalization of the chain rule.

3. lets us extract first- and second-order derivatives from differentials using the identification rules of Magnus and Neudecker(1999) without bothering to deal with index notation.

With these techniques it is easy to see how structure, like Kronecker products, appears in the derivatives.

Preliminaries & notation: The following definitions and theorems represent a collection of results from the book of Magnus and Neudecker(1999). They generalize the concept of first- and second-order derivatives to multi-variate matrix functions in terms of the Jacobian and Hessian matrix. While there exist multiple ways to arrange the partial derivatives, the presented definitions allows for a multivariate generalization of the chain rule. We denote matrix, vector, and scalar functions by F , f, and φ, respectively. Matrix (vector) arguments are written as X (x). Vectorization (vec) applies column-stacking, such that for matrices A, B, C of appropriate size

vec(ABC) = C> A vec(B) . (S.1) ⊗  Modular Block-diagonal Curvature (Supplements)

Whenever possible, we assign vectors to lower-case (for instance x, θ), matrices to upper-case (W, X, . . . ), and tensors to upper-case sans serif symbols (W, X,... ). denotes elementwise multiplication (Hadamard product).

Remark on vectorization: The generalization of Jacobians and Hessians provided by Magnus and Neudecker (1999) relies on vectorization of matrices. Convolutional neural networks usually act on tensors and we incorporate these by assuming them to be flattened such that the first index varies fastest. For a matrix (tensor of order n n n two), this is consistent with column-stacking. For instance, the vectorized version of the tensor A R 1× 2× 3 ∈ with n1, n2, n3 N is vec A = (A1,1,1, A2,1,1,..., An1,1,1, A1,2,1,..., An1,n2,n3 )>. To formulate the generalized ∈ Jacobian or Hessian for tensor operations, its action on a vector or matrix view of the original tensor is considered. Consequently, all operations can be reduced to vector-valued functions, which we consider in the following. The vectorization scheme is not unique. Most of the linear algebra literature assumes column-stacking. However, when it comes to implementations, a lot of programming languages store tensors in row-major order, corresponding to row-stacking vectorization (last index varies fastest). Thus, special attention has to be paid in implementations.

A.1 Definitions

n q m p Definition A.1 (Jacobian matrix). Let F : R × R × ,X F (X) be a differentiable function mapping → 7→ between two matrix-sized quantitites. The Jacobian DF (X) of F with respect to X is an (mp nq) matrix ×

∂ vec F (X) ∂ [vec F (X)]i DF (X) = , such that [DF (X)]i,j = (S.2) ∂(vec X)> ∂ [(vec X)>]j

(Magnus and Neudecker, 1999, Chapter 9.4).

In the context of FCNNs, the most common occurrences of Definition A.1 involve vector-to-vector functions f : Rn Rm, x f(x) with → 7→ ∂f(x) Df(x) = . ∂x> For instance, x can be considered the input or bias vector of a layer applying an affine transformation. Other n q m cases involve matrix-to-vector mappings f : R × R ,X f(X) with → 7→ ∂f(X) Df(X) = , ∂(vec X)>

m q where X might correspond to the R × weight matrix of an affine transformation. Proper arrangement of the partial derivatives leads to a generalized formulation of the chain rule. n q m p m p r s Theorem A.1 (Chain rule for Jacobians). Let F : R × R × and G : R × R × be differentiable → n q → r s matrix-to-matrix mappings and their composition H be given by H = G F : R × R × , then ◦ → DH(X) = [DG(F )] DF (X) (S.3)

(restricted from Magnus and Neudecker, 1999, Chapter 5.15).

Theorem A.1 is used to unroll the Jacobians in the composite structure of the feedforward neural network’s loss (`) (` 1) (1) function E f f − . . . f , compare Subsection 3.1. ◦ ◦ ◦ n q m p Definition A.2 (Hessian). Let F : R × R × be a twice differentiable matrix function. The Hessian HF (X) → is an (mnpq nq) matrix defined by ×

∂ ∂ vec F (X) > HF (X) = D [DF (X)]> = vec (S.4) ∂(vec X) ∂(vec X) > ( >  )

(Magnus and Neudecker, 1999, Chapter 10.2). Felix Dangel, Stefan Harmeling, Philipp Hennig

Forms of Equation (S.4) that are most likely to emerge in neural networks include

∂2φ(x) ∂ ∂φ(X) Hφ(x) = and Hφ(X) = . ∂x>∂x ∂(vec X)> ∂ vec X The scalar function φ can be considered as the loss function E. Analogously, the Hessian matrix of a vector-in- vector-out function f reads

Hf1(x) . Hf(x) =  .  . (S.5) Hf (x)  m    Arranging the partial second derivatives in the fashion of Definition A.2 yields a direct generalization of the chain rule for second derivatives. We now provide this rule for a composition of vector-to-vector functions. Theorem A.2 (Chain rule for Hessian matrices). Let f : Rn Rm and g : Rm Rp be twice differentiable and → → h = g f : Rn Rp. The relation between the Hessian of h and the Jacobians and Hessians of the constituents f ◦ → and g is given by

Hh(x) = [I Df(x)]> [Hg(f)] Df(x) + [Dg(f) I ]Hf(x) (S.6) p ⊗ ⊗ n (restricted from Magnus and Neudecker, 1999, Chapter 6.10).

A.2 Relation to the modular approach

(`) (` 1) (1) Theorem A.2 can directly be applied to the graph E f f − f of the feedforward net under ◦ ◦ ◦ · · · ◦ investigation. Note that, for any module function f (i), the loss can be expressed as a composition of two functions (i 1) (1) by squashing preceding modules in the graph into a single function f − f , and likewise composing the ◦ · · · ◦ module itself and all subsequent functions, i.e. E f (`) f (i). ◦ ◦ · · · ◦ The analysis can therefore be reduced to the module shown in Figure2 receiving an input x Rn that is used to ∈ compute the output z Rm. The scalar loss is then expressed as a mapping E(z(x), y): Rn Rp with p = 1. ∈ → Suppressing the label y , Equation (S.6) implies

HE(x) = [Ip Dz(x)]> [HE(z)] Dz(x) + [DE(z) In]Hz(x) ⊗ ⊗ (S.7) = [Dz(x)]> [HE(z)] Dz(x) + [DE(z) I ]Hz(x) . ⊗ n The HBP Equation (7) is obtained by substituting (S.5) into (S.7).

B HBP for fully-connected neural networks

B.1 Linear layer (matrix-vector multiplication, matrix-, addition)

Consider the function f of a module applying an affine transformation to a vector. Apart from the input x, additional parameters of the module are given by the weight matrix W and the bias term b,

n m n m m f : R R × R R × × → (x, W, b) z = W x + b . 7→ To compute the Jacobians with respect to each variable, we use the differentials

dz(x) = W dx , dz(b) = db ,

dz(W ) = (dW )x = d vec z(W ) = x> I vec(dW ) , ⇒ ⊗ m using Property (S.1) to establish the implication in the last line. With the first identification tables provided in Magnus and Neudecker(1999, Chapter 9.6), the Jacobians can be read off from the differentials as Dz(x) = Modular Block-diagonal Curvature (Supplements)

W, Dz(b) = I , Dz(W ) = x> I . All second module derivatives Hz(x), Hz(W ), and Hz(b) vanish since f is m ⊗ m linear in all inputs. Inserting the Jacobians into Equation (7) results in

x = W > zW , (S.8a) H H b = z , (S.8b) H H W = x> I > z x> I = xx> z = x x> z . (S.8c) H ⊗ m H ⊗ m ⊗ H ⊗ ⊗ H The HBP relations for matrix-vector multiplication and addition listed in Table1 are special cases of Equation (S.8). HBP for matrix-matrix multiplication is derived in a completely analogous fashion.

B.2 Elementwise activation

Next, consider the elementwise application of a nonlinear function,

m m φ : R R → x z = φ(x) such that φ (x) = φ(x ) , 7→ k k For the matrix differential with respect to x, this implies

dφ(x) = φ0(x) dx = diag [φ0(x)] dx ,

and consequently, the Jacobian is given by Dφ(x) = diag [φ0(x)] . For the Hessian, note that the function value m φk(x) only depends on xk and thus Hφk(x) = φ00(xk)eke> , with the one-hot unit vector ek R in coordinate k ∈ direction k. Inserting all quantities into the Relation (7) results in

x = diag [φ0(x)] z diag [φ0(x)] + φ00(x )e e>δz H H k k k k k (S.9) X = diag [φ0(x)] z diag [φ0(x)] + diag [φ00(x) δz] . H

B.3 Skip-connection

Residual learning He et al.(2016) uses skip-connection units to facilitate the training of deep neural networks. In its simplest form, the mapping f : Rm Rm reads → z(x, θ) = x + y(x, θ) , with a potentially nonlinear operation (x, θ) y. The Jacobians with respect to the input and parameters are 7→ given by Dz(x) = Im + Dy(x) and Dz(θ) = Dy(θ). Using (7), one finds

x = [I + Dy(x)]> z [I + Dy(x)] + [Hy (x)] δz , H m H m k k Xk θ = [Dy(θ)]> z [Dy(θ)] + [Hy (θ)] δz . H H k k Xk

B.4 Relation to recursive schemes in previous work

The modular decomposition of curvature backpropagation facilitates the analysis of modules composed of multiple operations. Now, we analyze the composition of two modules. This yields the recursive schemes presented by Botev et al.(2017) and Chen et al.(2018), referred to as KFRA and BDA-PCH, respectively.

B.4.1 Analytic composition of multiple modules

Consider the module g = f φ, x y = φ(x), y z = f(y(x)). We assume φ to act elementwise on the input, ◦ 7→ 7→ followed by a linear layer f : z(y) = W y + b (Figure S.1a). Analytic elimination of the intermediate backward pass yields a single module composed of two operations (Figure S.1b). Felix Dangel, Stefan Harmeling, Philipp Hennig

(a) (b)

W b W b δW δb δW δb W b W b H H H H x y z x z

δx φ δy z = W y +b δz δx φ z = W y +b δz x y z x z H H H H H Figure S.1: Composition of elementwise activation φ and linear layer. (a) Both operations are analyzed separately to derive the HBP. (b) Backpropagation of z is expressed in terms of x without intermediate message. H H

The first Hessian backward pass through the linear module f (Equations (S.8)) implies

y = W > zW , H H W = y y> z = φ(x) φ(x)> z , (S.10a) H ⊗ ⊗ H ⊗ ⊗ H b = z . (S.10b) H H Further backpropagation through φ by means of Equation (S.9) results in

x = diag [φ0(x)] y diag [φ0(x)] + diag [φ00(x) δy] H H = diag [φ0(x)] W > zW diag [φ0(x)] + diag φ00(x) W >δz (S.10c) H = W diag [φ0(x)] > z W diag [φ0(x)] + diag φ00(x) W >δz . { } H { }   We use the invariance of a diagonal matrix under transposition and δy = W >δz for the backpropagated gradient to establish the last equality. The Jacobian Dg(x) of the module shown in Figure S.1b is Dg(x) = W diag [φ(x)] =

W > φ0(x) >. In summary, the HBP relation for the composite layer z(x) = W φ(x) + b is given by (S.10).

  B.4.2 Obtaining the relations of KFRA and BDA-PCH

The derivations for the composite module given above are closely related to the recursive schemes of Botev et al. (2017); Chen et al.(2018). Their relations are obtained from a straightforward conversion of the HBP rules (S.10). Consider a sequence of a linear layer f (1) and multiple composite modules f (2), . . . , f (`) as shown in Figure S.2. According to Equation (S.10b) both the linear layer and the composite f (i) identify the gradient (Hessian) with respect to their outputs, δz(i) ( z(i)), as the gradient (Hessian) with respect to their bias term, δb(i) ( b(i)). H H Introducing layer indices for all quantities, one finds the recursion

b(i) = z(i) , (S.11a) H H (i) (i 1) (i 1) (i) W = φ(z − ) φ(z − )> b , (S.11b) H ⊗ ⊗ H

for i = ` 1,..., 1, and −

(i 1) (i) (i 1) > (i) (i) (i 1) (i 1) (i) (i) z − = W diag φ0(z − ) b W diag φ0(z − ) + diag φ00(z − ) W >δb H H n h io n h io h i (S.11c) (i) (i 1) (i) (i) (i 1) > (i 1) (i) (i) = W > φ0(z − ) b W > φ0(z − ) + diag φ00(z − ) W >δb H n o n o h i for i = ` 1,..., 2. It is initialized with the gradient δz(`) and Hessian z(`) of the loss function. − H Equations (S.11) are equivalent to the expressions provided in Botev et al.(2017); Chen et al.(2018). Their emergence from compositions of HBP equations of simple operations represents one key insight of this paper. Both works use the batch average strategy presented in Subsection 3.2 to obtain curvature estimates. Modular Block-diagonal Curvature (Supplements)

θ(1) θ(2) θ(`) δθ(1) δθ(2) δθ(`) θ(1) θ(2) θ(`) H H H z(0) z(1) z(2) z(`) E (1) (2) (`) f δz(1) φ f δz(2) ... φ f δz(`) E z(1) z(2) z(`) H H H Figure S.2: Grouping scheme for the recursive Hessian computation proposed by KFRA and BDA-PCH. The backward messages between the linear layer and the preceding nonlinear activation are analytically absorbed.

C HBP for loss functions

C.1 Square loss

Square loss of the model prediction x Rm and the true label y Rm is computed by ∈ ∈

E(x, y) = (y x)>(y x) . − −

Differentiating dE(x) = (dx)> (y x) (y x)> dx = 2 (y x)> dx once more yields − − − − − − 2 d E(x) = 2(dx)>dx = 2(dx)>Imdx .

The Hessian is extracted with the second identification tables from Magnus and Neudecker(1999, Chapter 10.4) and reproduces the expected result

HE(x) = x = 2I . (S.12) H m

C.2 Softmax cross-entropy loss

The computation of cross-entropy from logits is composed of two operations. First, the neural network outputs are transformed into log-probabilities by the softmax function. Then, the cross-entropy with the label is computed.

Log-softmax: The output’s elements x Rm of a neural network are assigned to log-probabilities z(x) = m ∈ exp(x) log [p(x)] R by means of the softmax function p(x) = / i exp(xi). Consequently, ∈ P z(x) = x log exp(x ) , − i " i # X m and the Jacobian reads Dz(x) = Im p(x)1> with 1> R denoting a vector of ones. Moreover, the − m m ∈ log-probability Hessians with respect to x are given by Hz (x) = diag [p(x)] + p(x)p(x)>. k − Cross-entropy: The negative log-probabilities are used to compute the cross-entropy with the probability m distribution of the label y R with yk = 1 (usually a one-hot vector), ∈ k

P E(z, y) = y>z . − Since E is linear in the log-probabilities, that is HE(z) = 0, the HBP is given by ∂E(z) x = [Dz(x)]> HE(z) [Dz(x)] + Hzk(x) H ∂zk Xk = diag [p(x)] + p(x)p(x)> ( y ) − − k k  X = diag [p(x)] p(x)p(x)> . − Felix Dangel, Stefan Harmeling, Philipp Hennig

D HBP for convolutional neural networks

The recursive approaches presented in Botev et al.(2017); Chen et al.(2018) tackle the computation of curvature blocks for FCNNs. To the best of our knowledge, an extension to CNNs has not been achieved so far. One reason this might be is that convolutions come with heavy notation that is difficult to deal with in index notation. Martens and Grosse(2015) provide a procedure for computing a Kronecker-factored approximation of the Fisher for convolutions (KFC). This scheme relies on the property of the Fisher to describe the covariance of the log-likelihood’s gradients under the model’s distribution. Curvature information is thus encoded in the expectation value, and not by backpropagation of second-order derivatives. To derive the HBP for convolutions, we use the fact that an efficient implementation of the forward pass is decomposed into multiple operations (see Figure S.4), which can be considered independently by means of our modular approach (see Figure S.5 for details). Our analysis starts by considering the backpropagation of curvature through operations that occur frequently in CNNs. This includes the reshape (Subsection D.1) and extraction operation (Subsection D.2). In Subsection D.3 we outline the modular decomposition and curvature backpropagation of convolution for the two-dimensional case. The approach carries over to other dimensions. All operations under consideration in this Section are linear. Hence the second terms in Equation (7) vanish. Again, we use the framework of matrix differential calculus (Magnus and Neudecker, 1999) to avoid index notation.

D.1 Reshape/view

n n m m The reshape operation reinterprets a tensor X R 1×···× x as another tensor Z Z 1×···× z ∈ ∈ Z(X) = reshape(X) ,

which possesses the same number of elements, that is i ni = i mi. One example is given by the vec operation. It corresponds to a reshape into a tensor of order one. As the arrangement of elements remains unaffected, Q Q vec Z = vec X, and reshaping corresponds to the identity map on the vectorized input. Consequently, one finds (remember that X is a shorthand notation for vec X) H H X = Z . H H D.2 Index select/map

Whenever elements of a tensor are selected by an operation, it can be described as a matrix-vector multiplication of a binary matrix Π and the vectorized tensor. The mapping is described by an index map π. Element j of the m n output z R is selected as element π(j) from the input x R . Only elements Πj,π(j) in the selection matrix m n∈ ∈ Π R × are one, while all other entries vanish. Consequently, index selection can be expressed as ∈ z = x z(x) = Πx with Π = 1 . j π(j) ⇔ j,π(j) The HBP is equivalent to the linear layer discussed in Subsection B.1,

z = Π>( x)Π . H H Applications include max-pooling and the im2col/unfold operation (see Subsection D.3). Average-pooling represents a weighted sum of index selections and can be treated analogously.

D.3 Convolution

The convolution operation acts on local patches of a multi-channel input of sequences, images, or volumes. In the following, we restrict the discussion to two-dimensional convolution. Figure S.3a illustrates the setting. A collection of filter maps, the W, is slid over the spatial coordinates of the input tensor X. In each step, the kernel is contracted with the current area of overlap (the patch). Both the sliding process as well as the structure of the patch area can be controlled by hyperparameters of the operation (kernel size, stride, dilation). Moreover, it is common practice to extend the input tensor, for instance by zero-padding (for an introduction to the arithmetics of convolutions, see Dumoulin and Visin, 2016). The approach presented here is not limited to a certain choice of convolution hyperparameters. Modular Block-diagonal Curvature (Supplements)

(a) 8-837-45 3 2 11 7 64 1 1 -4-58-99-9 1332 -67-1 1 8 85 2 09 8 3 ? = 8 0 7-9-49 -571 -21-29 1 3 8 9 8 2 6 5 2 4 5-3-82 2

(b) 8 3 1 4 7 -4 -9 1 5 9 -8 -4 2 6 5 -5 9 13 -67 3 5 2 7 1 8 -9 -57 -21 8 2 8 8 -9 32 1 8 -1 2 4 -8 1 -29 0 -4 1 6 1 5 2 8 0 3 7 9 3 9 8 -3 2

(c) 8 8 7 9  4− 4  −  9 8   − −  3 1 1 5 2 7 8 2 1 6 8 0  8 0  13 32  −  1 4 5 9 7 1 2 8 6 1 0 3  4 4  67 1    − −  =  − −  1 5 2 6 8 2 1 8 8 0 3 9  5 5  57 1  −  −  5 9 6 5 2 8 8 2 0 3 9 8   9 2   21 29       − −     3 7       5 9     8 3   −   9 2   −   

Figure S.3: Two-dimensional convolution Y = X ? W without bias term. (a) The input X consists of Cin = 3 channels (different colors) of (3 3) images. Filter maps of size (2 2) are provided by the kernel W for the × × generation of Cout = 2 output channels. Patch and kernel volumes that are contracted in the first step of the convolution are highlighted. Assuming no padding and a stride of one results in four patches. New features Y consist of C = 2 channels of (2 2) images. (b) Detailed view. All tensors are unrolled along the first axis. out × (c) Convolution as matrix multiplication. From left to right, the matrices X >,W >, and Y > are shown.

J K Felix Dangel, Stefan Harmeling, Philipp Hennig

Table S.1: Important quantities for the convolution operation. kernel reshape The number of patches equals the product of the outputs’ spatial dimensions, i.e. P = Y1Y2. im2col / input matmul unfold Tensor Dimensionality Description

col2im / X (Cin,X1,X2) Input bias copy reshape W (Cout,Cin,K1,K2) Kernel Y (Cout,Y1,Y2) Output X (C K K ,P ) output in 1 2 Expanded input W (Cout,CinK1K2) Matricized kernel J YK (Cout,P ) Matricized output Figure S.4: Decomposition of the convolution b Cout Bias vector operation’s forward pass. B (Cout,P ) Bias matrix

D.3.1 Forward pass and notation

We now introduce the quantities involved in the process along with their dimensions. For a summary, see Table S.1. A forward pass of convolution proceeds as follows (cf. Figure S.3b for an example):

The input X, a tensor of order three, stores a collection of two-dimensional data. Its components X • cin,x1,x2 are referenced by indices for the channel cin and the spatial location (x1, x2). Cin,X1,X2 denote the input channels, width, and height of the image, respectively. The kernel W is a tensor of order four with dimensions (C ,C ,K ,K ). Kernel width K and height K • out in 1 2 1 2 determine the patch size P = K1K2 for each channel. New features are obtained by contracting the patch and kernel. This is repeated for a collection of Cout output channels stored along the first axis of W. Each output channel c is shifted by a bias b , stored in the C -dimensional vector b. • out cout out The output Y = X ? W with components Y is of the same structure as the input. We denote the • cout,y1,y2 spatial dimensions of Y by Y1,Y2, respectively. Hence Y is of dimension (Cout,Y1,Y2).

Example (index notation): A special case where input and output have the same spatial dimensions (Grosse and Martens, 2016) uses a stride of one, kernel widths K1 = K2 = 2K + 1, (K N), and padding K. Elements of ∈ the filter W are addressed with the index set K,..., 0,...,K K,..., 0,...,K . In this setting, cout,cin,:,: {− } × {− } K K

Ycout,y1,y2 = Xcin,x1+k1,x2+k2 Wcout,cin,k1,k2 + bcout . (S.13) k = K k = K 1X− 2X− Elements of X addressed out of bounds evaluate to zero. Arbitrary convolutions come with even heavier notation.

Convolution by matrix multiplication: Evaluating convolutions by sums of the form (S.13) leads to poor memory locality (Grosse and Martens, 2016). For improved performance, the computation is mapped to a matrix multiplication (Chellapilla et al., 2006). To do so, patches of the input X are extracted and flattened into columns of a matrix. The patch extraction is indicated by the operator and the resulting matrix X is of dimension · (C K K P ) (cf. left part of Figure S.3c showing X >). In other words, elements contracted with the kernel are in 1 2 × stored along the first axis of X . is also referred to as im2col or unfold operation1, and accounts for padding. · J K J K J K The kernel tensor W is reshaped into a (Cout CinK1K2) matrix W , and elements of the bias vector b are J K J K × copied columnwise into a (C P ) matrix B = b1> , where 1 is a P -dimensional vector of ones. Patchwise out × P P contractions are carried out by matrix multiplication and yield a matrix Y of shape (Cout,P ) with P = Y1Y2, Y = W X + B (S.14)

(Figure S.3c shows the quantities W >, X >, and Y from left to right). Reshaping Y into a (Cout,Y1,Y2) tensor, referred to as col2im, yields Y. Figure S.4 summarizes theJ K outlined decomposition of the forward pass. 1Our definition of the unfold operatorJ slightlyK differs from Grosse and Martens(2016), where flattened patches are stacked rowwise. This lets us achieve an analogous form to a linear layer. Conversion is achieved by transposition. Modular Block-diagonal Curvature (Supplements)

W b

W b H H X X Z Y Y unfold matmul add col2im J K X X Z Y Y H H H H H Figure S.5: DecompositionJ K of convolution with notation for the study of curvature backpropagation.

D.4 HBP for convolution

We now compose the HBP for convolution, proceeding from right to left with the operations depicted in Figure S.5, by analyzing the backpropagation of curvature for each module, adopting the figure’s notation.

C Y Y Col2im/reshape: The col2im operation takes a matrix Y R out× 1 2 and reshapes it into the tensor C Y Y ∈ Y R out× 1× 2 . According to Subsection D.1, Y = Y. ∈ H H Bias Hessian: Forward pass Y = Z + B and Equation (S.8b) imply Y = B = Z. To obtain the Hessian H H H with respect to b from B, consider the columnwise copy operation B(b) = b 1>, whose matrix differential is H P dB(b) = (db) 1>. Vectorization yields d vec B(b) = (1 I )db. Hence, the Jacobian is DB(b) = 1 I , P P ⊗ Cout P ⊗ Cout and insertion into Equation (7) results in

b = (1 I )> B (1 I ) . H P ⊗ Cout H P ⊗ Cout This performs a linewise and columnwise summation over B, summing entities that correspond to copies of H the same entry of b in the matrix B. It can also be regarded as a reshape of B into a (C ,P,C ,P ) tensor, H out out which is then contracted over the second and fourth axis.

Weight Hessian: For the matrix-matrix multiplication Z(W, X ) = W X , the HBP was discussed in Sub- section B.1. The Jacobians are given by DZ( X ) = I W and DZ(W ) = X > I with the patch size P ⊗ ⊗ S S = CinK1K2. Hence, the HBP for the unfolded input and the weightJ K matrixJ K are J K J K X = (I W )> Z (I W ) , H P ⊗ H P ⊗ W = X > IS > Z X > IS . HJ K ⊗ H ⊗ From what has been said about the reshape operation in Subsection D.1, it follows that W = W . J K J K H H Im2col/unfold: The patch extraction operation copies all elements in a patch into the columns of a matrix · and thus represents a selection of elements by an index map, which is hard to express in notation. Numerically, it can be obtained by calling im2col on a (Cin,X1,X2J)Kindex tensor whose entries correspond to the indices. The resulting tensor contains all information about the index map. HBP follows the relation of Subsection D.2.

Discussion: Although convolution can be understood as a matrix multiplication, the parameter Hessian is not identical to that of the linear layer discussed in Subsection B.1. The difference is due to the parameter sharing of the convolution. In the case of a linear layer z = W x + b, the Hessian of the weight matrix for a single sample possesses Kronecker structure (Botev et al., 2017; Chen et al., 2018; Bakker et al., 2018), i.e. W = x x> z. H ⊗ ⊗ H For convolution layers, however, it has been argued by Bakker et al.(2018) that block diagonals of the Hessian do not inherit the same structure. Rephrasing the forward pass (S.14) in terms of vectorized quantities, we find vec Y = (I W ) vec X + vec B. P ⊗ In this perspective, convolution corresponds to a fully-connected linear layer, with the additional constraints that the weight and bias matrix be circular. Defining Wˆ = IJ K W , one then finds the Hessian Wˆ to possess P ⊗ H Kronecker structure. Parameterization with a kernel tensor encodes the circularity constraint in weight sharing. Felix Dangel, Stefan Harmeling, Philipp Hennig

Table S.2: Model architectures under consideration. We use the patterns Linear(in_features, out_features), Conv2d(in_channels, out_channels, kernel_size, padding), MaxPool2d(kernel_size, stride), and Ze- roPad2d(padding_left, padding_right, padding_top, padding_bottom) to describe module hyperparameters. Convolution strides are always chosen to be one. (a) FCNN used to extend the experiment in Chen et al.(2018) (3 846 810 parameters). (b) CNN architecture (1 099 226 parameters). (c) DeepOBS c3d3 test problem with three convolutional and three dense layers (895 210 parameters). ReLU activation functions are replaced by sigmoids.

(a) FCNN (Figure4) (b) CNN (Figure5) (c) DeepOBS c3d3 (Figure S.6)

# Module # Module # Module 1 Flatten 1 Conv2d(3, 16, 3, 1) 1 Conv2d(3, 64, 5, 0) 2 Linear(3072, 1024) 2 Sigmoid 2 Sigmoid 3 Sigmoid 3 Conv2d(16, 16, 3, 1) 3 ZeroPad2d(0, 1, 0, 1) 4 Linear(1024, 512) 4 Sigmoid 4 MaxPool2d(3, 2) 5 Sigmoid 5 MaxPool2d(2, 2) 5 Conv2d(64, 96, 3, 0) 6 Linear(512, 256) 6 Conv2d(16, 32, 3, 1) 6 Sigmoid 7 Sigmoid 7 Sigmoid 7 ZeroPad2d(0, 1, 0, 1) 8 Linear(256, 128) 8 Conv2d(32, 32, 3, 1) 8 MaxPool2d(3, 2) 9 Sigmoid 9 Sigmoid 9 ZeroPad2d(1, 1, 1, 1) 10 Linear(128, 64) 10 MaxPool2d(2, 2) 10 Conv2d(96, 128, 3, 0) 11 Sigmoid 11 Flatten 11 Sigmoid 12 Linear(64, 32) 12 Linear(2048, 512) 12 ZeroPad2d(0, 1, 0, 1) 13 Sigmoid 13 Sigmoid 13 MaxPool2d(3, 2) 14 Linear(32, 16) 14 Linear(512, 64) 14 Flatten 15 Sigmoid 15 Sigmoid 15 Linear(1152, 512) 16 Linear(16, 10) 16 Linear(64, 10) 16 Sigmoid 17 Linear(512, 256) 18 Sigmoid 19 Linear(256, 10)

For the Hessian W of the kernel to possess Kronecker structure, the output Hessian Z has to be assumed H H to factorize into a Kronecker product of S S and C C matrices. These assumptions are somewhat in × out × out parallel with the additional approximations introduced in Grosse and Martens(2016) to obtain KFC.

E Experimental details

Fully-connected neural network: The same model as in Chen et al.(2018) (see Table S.2a) is used to extend the experiment performed therein. The weights of each linear layer are initialized with the Xavier method of Glorot and Bengio(2010). Bias terms are intialized to zero. Backpropagation of the Hessian uses approximation (9) of (S.11) to compute the curvature blocks of the weights and bias, W (i) and b(i). H H Hyperparameters are chosen as follows to obtain consistent results with the original work: All runs shown in Figure4 use a batch size of B = 500. For SGD, the learning rate is assigned to γ = 0.1 with momentum | | v = 0.9. Block-splitting experiments with the second-order method use the PCH-abs. All runs were performed with a learning rate γ = 0.1 and a regularization strength of α = 0.02. For the convergence criterion of CG, the maximum number of iterations is restricted to nCG = 50; convergence is reached at a relative tolerance CG = 0.1.

Convolutional neural net: The training procedure of the CNN architecture shown in Table S.2b is evaluated on a hyperparameter grid. Runs with smallest final training loss are selected to rerun on different random seeds. The curves shown in Figure5b represent mean values and standard deviations for ten different realizations over the random seed. All layer parameters were initialized with the default method in PyTorch. For the first-order optimizers (SGD, Adam), we considered batch sizes B 100, 200, 500 . In the case of SGD, ∈ { } momentum v was tuned over the set 0, 0.45, 0.9 . Although we varied the learning rate over a large range of values 3 2 { } γ 10− , 10− , 0.1, 1, 10 , losses kept plateauing and did not decrease. In particular, the loss function even ∈  Modular Block-diagonal Curvature (Supplements)

SGD 2 Adam 0.6 PCH-abs SGD PCH-clip Adam 1 GGN 0.4 test acc

train loss PCH-abs PCH-clip 0.2 GGN 5 10 15 0 5 10 15 epoch epoch Figure S.6: Comparison of SGD, Adam, and Newton-style methods with different exact curvature matrix-vector products (HBP) on the DeepOBS c3d3 network (Schneider et al., 2019) with sigmoid activations (Table S.2c).

4 3 2 increased for the large learning rates. For Adam, we only vary the learning rate γ 10− , 10− , 10− , 0.1, 1, 10 . ∈ As second-order methods scale better to large batch sizes, we considered B 200 , 500, 1000 for them. The ∈ { } convergence parameters for CG were fixed to nCG = 200 and CG = 0.1. For all curvature matrices, we varied the 4 3 2 learning rates over the grid γ 0.05, 0.1, 0.2 and α 10− , 10− , 10− . ∈ { } ∈ In order to compare with another second-order method, we experimented with a public PyTorch implementation2 of the KFAC optimizer (Martens and Grosse, 2015; Grosse and Martens, 2016). All hyperparameters were kept 4 3 2 at their default setting. The learning rate was varied over γ 10− , 10− , 10− , 0.1, 1, 10 . ∈ The hyperparameters of results shown in Figure5 read as follows:

3 SGD ( B = 100, γ = 10− , v = 0.9). The particular choice of these hyperparameters is artificial. This run is • | | representative for SGD, which does not achieve any progress at all.

3 Adam ( B = 100, γ = 10− ) • | | KFAC ( B = 500, γ = 0.1) • | | 3 4 PCH-abs ( B = 1000, γ = 0.2, α = 10− ), PCH-clip ( B = 1000, γ = 0.1, α = 10− ) • | | | | 4 GGN, α ( B = 1000, γ = 0.1, α = 10− ). This run does not yield the minimum training loss on the grid, • 1 | | and is shown to illustrate that the second-order method is capable to escape the flat regions in early stages.

3 GGN, α ( B = 1000, γ = 0.1, α = 10− ). In comparison with α , the second-order method requires more • 2 | | 1 iterations to escape the initial plateau, caused by the larger regularization strength. However, this leads to improved robustness against noise in later stages of the training procedure.

Additional experiment: Another experiment conducted with HBP considers the c3d3 architecture (Table S.2c) of DeepOBS (Schneider et al., 2019) on CIFAR-10 . ReLU activations are replaced by sigmoids to make the problem more challenging. The hyperparameter grid is chosen identically to the CNN experiment above, and results are summarized in Figure S.6. In particular, the hyperparameter settings for each competitor are:

SGD ( B = 100, γ = 1, v = 0) • | | 3 Adam ( B = 100, γ = 10− ) • | | 3 2 PCH-abs ( B = 500, γ = 0.1, α = 10− ), PCH-clip ( B = 500, γ = 0.1, α = 10− ) • | | | | 3 GGN ( B = 500, γ = 0.05, α = 10− ) • | |

2https://github.com/alecwangcq/KFAC-Pytorch Felix Dangel, Stefan Harmeling, Philipp Hennig

References Bakker, C., Henry, M. J., and Hodas, N. O. (2018). The outer product structure of neural network derivatives. CoRR, abs/1810.03798. Botev, A., Ritter, H., and Barber, D. (2017). Practical Gauss-Newton optimisation for deep learning. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 557–565. PMLR. Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance Convolutional Neural Networks for Document Processing. In Lorette, G., editor, Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Université de Rennes 1, Suvisoft. Chen, S.-W., Chou, C.-N., and Chang, E. (2018). BDA-PCH: Block-diagonal approximation of positive-curvature Hessian for training neural networks. CoRR, abs/1802.06502. Dumoulin, V. and Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285. Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pages 249–256. Grosse, R. and Martens, J. (2016). A Kronecker-factored approximate Fisher matrix for convolution layers. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 573–582. PMLR. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on and Pattern Recognition, pages 770–778. Magnus, J. R. and Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics. Probabilistics and Statistics. Wiley. Martens, J. and Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 2408–2417. JMLR. Mizutani, E. and Dreyfus, S. E. (2008). Second-order stagewise backpropagation for Hessian-matrix analyses and investigation of negative curvature. Neural Networks, 21(2-3):193–203. Naumov, M. (2017). Feedforward and recurrent neural networks backward propagation and Hessian in matrix form. CoRR, abs/1709.06080. Schneider, F., Balles, L., and Hennig, P. (2019). DeepOBS: A deep learning optimizer benchmark suite. In International Conference on Learning Representations.