<<

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

EA-CG: An Approximate Second-Order Method for Training Fully-Connected Neural Networks

Sheng-Wei Chen Chun-Nan Chou Edward Y. Chang HTC Research & Healthcare HTC Research & Healthcare HTC Research & Healthcare sw [email protected] jason.cn [email protected] edward [email protected]

Abstract the common practice is to use the Gauss-Newton with a convex criterion (Schraudolph 2002) or the Fisher For training fully-connected neural networks (FCNNs), we propose a practical approximate second-order method includ- matrix to measure since both are guaranteed to be ing: 1) an approximation of the Hessian matrix and 2) a conju- positive semi-definite (PSD). gate (CG) based method. Our proposed approximate Although these two kinds of matrices can alleviate the is- Hessian matrix is memory-efficient and can be applied to any sue of negative curvature, computing either the exact Gauss- FCNNs where the activation and criterion functions are twice Newton matrix or Fisher matrix even for a modestly-sized differentiable. We devise a CG-based method incorporating fully-connected neural network (FCNN) is intractable. In- one-rank approximation to derive Newton directions for train- tuitively, the analytic expression for the second ing FCNNs, which significantly reduces both space and time requires O(N 2) computations if O(N) complexity is re- complexity. This CG-based method can be employed to solve any linear equation where the coefficient matrix is Kronecker- quired to compute the first derivative. Thus, several pioneer factored, symmetric and positive definite. Empirical studies works (LeCun et al. 1998; Amari, Park, and Fukumizu 2000; show the efficacy and efficiency of our proposed method. Schraudolph 2002) have used different methods to approx- imate either matrix. However, none of these methods have been shown to be computationally feasible and fundamen- Introduction tally more effective than first-order methods as reported Neural networks have been applied to solving problems in in (Martens 2010). Thus, there has been a growing trend several application domains such as (He towards conceiving more computationally feasible second- et al. 2016), natural language processing (Hochreiter and order methods for training FCNNs. Schmidhuber 1997), and disease diagnosis (Chang et al. We outline several notable works in chronological or- 2017). Training a neural network requires tuning its model der herein. Martens proposed a truncated-Newton method parameters using . Stochastic gradient de- for training deep auto-encoders. In this work, Martens used scent (SGD), Broyden-Fletcher-Goldfarb-Shanno and one- an R-operator (Pearlmutter 1994) to compute full Gauss- step secant are representative algorithms that have been em- Newton matrix-vector products and made good progress ployed for training in Backpropagation. within each update. Martens and Grosse developed a To date, SGD is widely used due to its low computational block-diagonal approximation to the Fisher matrix for FC- demand. SGD minimizes a function using the function’s first NNs, called Kronecker-factored Approximation Curvature derivative, and has been proven to be effective for train- (KFAC). They derived the update directions by exploiting ing large models. However, stochasticity in the gradient will the inverse property of Kronecker products. KFAC features slow down convergence for any gradient method such that that its cost in storing and inverting the devised approxima- none of them can be asymptotically faster than simple SGD tion does not depend on the amount of data used to estimate with Polyak averaging (Polyak and Juditsky 1992). Besides the Fisher matrix. The idea is further extended to convo- , second-order methods utilize the curvature infor- lutional nets (Grosse and Martens 2016). Recently, Botev, mation of a within the neighborhood of a given Ritter, and Barber presented a block-diagonal approxima- point to guide the update direction. Since each update be- tion to the Gauss-Newton matrix for FCNNs, referred to comes more precise, such methods can converge faster than as Kronecker-Factored Recursive Approximation (KFRA). first-order methods in terms of update iterations. They also utilized the inverse property of Kronecker prod- For solving a convex optimization problem, a second- ucts to derive the update directions. Similarly, Zhang et al. order method can always converge to the global minimum introduced a block-diagonal approximation of the Gauss- in much fewer steps than SGD. However, the problem of Newton matrix and used conjugate gradient (CG) method neural-network training can be non-convex, thereby suffer- to derive the update directions. ing from the issue of negative curvature. To avoid this issue, However, these prior works either impose constraints on Copyright c 2019, Association for the Advancement of Artificial their applicability or require considerable computation in Intelligence (www.aaai.org). All rights reserved. terms of memory or running time. On the one hand, sev-

3337 eral of these notable methods relying on the Gauss-Newton Suppose we have a minimization problem matrix face the essential limit that they cannot handle non- min f(θ), (1) convex criterion functions playing an important part in some θ problems. Use robust estimation in computer vision (Stew- art 1999) as an example. Some non-convex criterion func- where f is a convex and twice-differentiable function. Since the global minimum is at the point that the first derivative is tions such as Tukey’s biweight function are robustness to ∗ outliers and perform better than convex criterion functions zero, the solution θ can be derived from the equation (Belagiannis et al. 2015). On the other hand, in order to de- ∇f(θ∗) = 0. (2) rive the update directions, some of the methods utilizing the conventional CG method may take too much time, and the We can utilize a quadratic to approximate Prob- others exploiting the inverse property of Kronecker products lem 1 by conducting a Taylor expansion with a given point may require excessive memory space. θj. Then, the problem turns out to be To remedy the aforementioned issues, we propose a 1 block-diagonal approximation of the positive-curvature Hes- min f(θj + d) ≈ f(θj) + ∇f(θj)T d + dT ∇2f(θj)d, d 2 sian (PCH) matrix, which is memory-efficient. Our proposed PCH matrix can be applied to any FCNN where the acti- where ∇2f(θj) is the Hessian matrix of f at θj. Af- vation and criterion functions are twice differentiable. Par- ter applying the aforementioned approximation, we rewrite ticularly, our proposed PCH matrix can handle non-convex Eq. (2) as the linear equation criterion functions, which the Gauss-Newton methods can- j 2 j j not. Besides, we incorporate expectation approximation into ∇f(θ ) + ∇ f(θ )d = 0. (3) the CG-based method, which is dubbed EA-CG, to derive Therefore, the Newton direction is obtained via update directions for training FCNNs in mini-batch setting. EA-CG significantly reduces the space and time complexity dj = −∇2f(θj)−1∇f(θj), of the conventional CG method. Our experimental results ∗ show the efficacy and efficiency of our proposed method. and we can acquire θ by iteratively applying the update rule In this work, we focus on deriving a second-order method θj+1 = θj + ηdj, for training FCNNs since the shared weights of convolu- tional layers lead to the difficulties in factorizing its Hes- where η is the step size. sian. We defer tackling convolutional layers to our future For non-convex problems, the solution to Eq. (2) reflects work. We also focus on the classification problem and hence one of three possibilities: a local minimum θmin, a local do not consider auto-encoders. Our strategy is that once a maximum θmax or a saddle point θsaddle. Some previous simpler isolated problem can be effectively handled, we can works such as (Dauphin et al. 2014; Goldfarb et al. 2017; then extend our method to address more challenging issues. Mizutani and Dreyfus 2008) utilize negative curvature infor- In summary, the contributions of this paper are as follows: mation to converge to a local minimum. Before illustrating how these previous works tackle the issue of negative cur- 1. For curvature information, we propose the PCH matrix vature, we have to introduce a crucial concept that we can to improve the Gauss-Newton matrix for training FCNNs know the curvature information of f at a given point θ by with convex criterion functions and overcome the non- analyzing the Hessian matrix ∇2f(θ). On the one hand, the convex scenario. Hessian matrix of f at any θmin is positive semi-definite. 2. To derive the update directions, we devise effective EA- On the other hand, the Hessian matrices of f at any θmax CG method, which does converge faster in terms of wall and θsaddle are negative semi-definite and indefinite, respec- clock time and enjoys better testing accuracy than com- tively. After establishing the concept, we can use the follow- peting methods. Specially, the performance of EA-CG is ing property to understand how to utilize the negative curva- competitive with SGD. ture information to resolve the issue of negative curvature. Property 1. Let f be a non-convex and twice-differentiable Truncated-Newton Method on Non-. With a given point θj, we suppose that there ex- 2 j Problems ist some negative eigenvalues {λ1, . . . , λs} for ∇ f(θ ). Moreover, we take V = span({v1,..., vs}), which is the Newton’s method is one of the second-order minimization eigenspace corresponds to {λ1, . . . , λs}. If we take methods, and is generally composed of two steps: 1) com- T 1 puting the Hessian matrix and 2) solving the system of lin- g(k) = f(θj) + ∇f(θj) v + vT ∇2f(θj)v, ear equations for update directions. The truncated-Newton 2 method applies the CG method with restricted iterations to s where k ∈ R and v = k1v1 + ... + ksvs, then g(k) is a the second step of Newton’s method. In this section, we first . introduce the truncated-Newton method in the context of convex problems. Afterwards, we discuss the non-convex According to Property 1, Eq. (3) may lead us to a lo- j scenario of the truncated-Newton method and provide an cal maximum or a saddle point if ∇2f(θ ) has some neg- important property that lays the foundation of our proposed ative eigenvalues. In order to converge to a local mini- PCH matrix. mum, we substitute Pos-Eig(∇2f(θj)) for ∇2f(θj), where

3338  T T T T Pos-Eig(A) is conceptually defined as replacing the negative [A·1] [A·2] ··· [A·n] . By following the nota- eigenvalues of A with non-negative ones. That is, tion mentioned above, we denote an FCNN output with k hk = F (θ|x ) = W khk−1 + bk γλ  layers as i i i . 1 To train this FCNN, we must decide a loss function ξ that  ..   .  can be any twice differentiable function. Training this FCNN  γλ  can therefore be interpreted as solving the following mini- (A) = QT  s  Q, Pos-Eig  λ  mization problem:  s+1   ..  l l  .  X k X min ξ(hi | yi) ≡ min C(yˆi | yi), λn θ θ i=1 i=1 where γ is a given scalar that is less than or equal to zero, where l is the number of the instances in the training set, y and {λ , . . . , λ } and {λ , . . . , λ } are the negative and i 1 s s+1 n is the label of the ith instance, yˆ is softmax(hk), and C is non-negative eigenvalues of A, respectively. This refinement i i the criterion function. implies that the point θj+1 escapes from either local maxima or saddle points if γ < 0. In case of γ = 0, this refinement Layer-wise Equations for the Hessian Matrix means that the eigenspace of the negative eigenvalues is ig- For lucid exposition of the block Hessian recursion, we start nored. As a result, we do not converge to any saddle point by reformulating the equations of Backpropagation accord- or local maximum. In addition, every real symmetric ma- ing to the notation defined in the previous subsection. Please trix can be diagonalized according to the spectral theorem. t 2 j note that we separate the bias terms (b ) from the weight Under our assumptions, ∇ f(θ ) is a real symmetric ma- terms (W t) and treat each of them individually during back- 2 j trix. Thus, ∇ f(θ ) can be decomposed, and the function ward propagation of gradients. The gradients of ξ with re- “Pos-Eig” can be realized easily. spect to the bias and weight terms can be derived from our When the number of variables in f is large, the Hes- reformulated equations in a -wise manner, similar to the sian matrix becomes intractable with respect to space com- original Backpropagation method. For the ith instance, our plexity. Alternatively, we can utilize the CG method to reformulated equations are as follows: solve Eq. (3). This alternative only requires calculating the Hessian-vector products rather than storing the whole Hes- ∇bk ξi =∇hk ξi sian matrix. Moreover, to save computation costs, it is desir- i able to restrict the iteration number of the CG method. (t−1)0 tT ∇bt−1 ξi =diag(hi )W ∇bt ξi (t−1)T Computing the Hessian Matrix ∇W t ξi =∇bt ξi ⊗ hi k For second-order methods we must compute the curvature where ξi = ξ(hi | yi), ⊗ is the Kronecker product, and information, and we utilize the Hessian matrix to capture (t−1)0 hi = ∇zσ(z)|z=W t−1ht−2+bt−1 . Likewise, we strive the curvature information in our work. However, the Hessian i matrix for training FCNNs is intrinsically complicated and to propagate the Hessian matrix of ξ with respect to the bias intractable. Recently, Botev, Ritter, and Barber presented and weight terms backward in a layer-wise manner. This can the idea of block Hessian recursion, in which the diagonal be achieved by utilizing the Kronecker product and follow- ing the similar fashion above. The resultant equations for the blocks of the Hessian matrix can be computed in a layer- th wise manner. As the basis of our proposed PCH matrix, we i instance are as follows: first establish some notation for training FCNNs and refor- 2 2 ∇bk ξi =∇hk ξi (4a) mulate the block Hessian recursion with our notation. Then, i 2 (t−1)0 tT 2 t (t−1)0 we present the steps to integrate the approximation concept ∇bt−1 ξi =diag(hi )W ∇bt ξiW diag(hi ) proposed by (Martens and Grosse 2015) into our reformula- (t−1)00 tT tion. + diag(hi (W ∇bt ξi)) (4b) 2 t−1 (t−1)T 2 ∇ t ξ =(h ⊗ h ) ⊗ ∇ t ξ , (4c) Fully-Connected Neural Networks W i i i b i h 00 i 0 where is the element-wise product, h(t−1) = An FCNN with k layers takes an input vector hi = xi, i th th s where xi is the i instance in the training set. For the i h 2 i ∇zσ(z) t−1 t−2 t−1 , and the derivative order of instance, the activation values in the other layers can be z=W hi +b ss t t t−1 t 2 t recursively derived from: hi = σ(W hi + b ), t = ∇W t ξi is column-wise traversal of W . Moreover, it is 1, . . . , k − 1, where σ is the and can worth noting that the original block Hessian recursion uni- be any twice differentiable function, and W t and bt are fied the bias and weight terms, which is distinct from our the weights and biases in the tth layer, respectively. We separate treatment of these terms. th further denote nt as the number of the neurons in the t layer, where t = 0, . . . , k, and collect all the model param- Expectation Approximation eters including all the weights and biases in each layer as Martens and Grosse propose one approximation concept that θ = (Vec(W 1), b1,..., Vec(W k), bk), where Vec(A) = is referred as expectation approximation in (Botev, Ritter,

3339 and Barber 2017). The idea behind expectation approxima- PCH Matrix t−1 (t−1)T tion is that the covariance between [hi ⊗ hi ]uv and Based on our layer-wise equations in the previous T [∇bt ξi ⊗ ∇bt ξi ]µν with given indices (u, v) and (µ, ν) is section and the integration of expectation approxima- shown to be tiny and thus ignored due to computational ef- tion, we can construct block matrices that vary in size ficiency, i.e., and are located at the diagonal of the Hessian ma- 2 trix. We denote this block-diagonal matrix Ei[∇θξi] as t−1 (t−1)T T 2 2 2 2 [[h ⊗ h ] · [∇ t ξ ⊗ ∇ t ξ ] ] Ei i i uv b i b i µν diag(Ei[∇W 1 ξi], Ei[∇b1 ξi],..., Ei[∇W k ξi], Ei[∇bk ξi]). t−1 (t−1)T T Please note that [∇2 ξ ] is a block-diagonal Hessian ≈Ei[[hi ⊗ hi ]uv] · Ei[[∇bt ξi ⊗ ∇bt ξi ]µν ]. Ei θ i matrix and not the complete Hessian matrix. According To explain this concept on our formulations, we define cov-t to the explanation for the three possibilities of update t−1 (t−1)T 1 1 2 directions in aforementioned section, [∇2 ξ ] is re- as Ele-Cov((hi ⊗hi )⊗ nt,nt , nt−1,nt−1 ⊗∇bt ξi), Ei θ i where ”Ele-Cov” is denoted as element-wise covariance, 2 quired to be modified. Thus, we replace Ei[∇ ξi] with and 1 is the matrix whose elements are 1 in u×v, θ u,v R diag( [∇[2 ξ ], [∇2 ξ ],..., [∇[2 ξ ], [∇2 ξ ]) t = 1, . . . , k. With the definition of cov-t and our devised Ei W 1 i Ei db1 i Ei W k i Ei dbk i 2 equations in the previous subsection, the approximation can and denote the modified result as Ei[∇cθξi], where be interpreted as follows: 2 2 Ei[∇dk ξi] =Pos-Eig(Ei[∇ k ξi]) (8a) 2 t−1 2 b hi Ei[∇W t ξi] =EhhT ⊗ Ei[∇bt ξi] + cov-t \2 tT 2 t (t−1)0 t−1 2 Ei[∇bt−1 ξi] =(W Ei[∇dbt ξi]W ) EhhT (8b) ≈EhhT ⊗ Ei[∇bt ξi], (5) (t−1)00 tT +Pos-Eig(diag( i[h (W ∇bt ξi)])) t−1 t−1 (t−1)T E i where EhhT = Ei[hi ⊗ hi ]. [2 t−1 2 Botev, Ritter, and Barber also adopted expectation ap- Ei[∇W t ξi] =EhhT ⊗ Ei[∇dbt ξi]. (8c) proximation in their proposed method. Similarly, we inte- 2 grate this approximation into our proposed layer-wise Hes- We call Ei[∇cθξi] the block-diagonal approximation of pos- sian matrix equations, thereby resulting in the following ap- itive curvature Hessian (PCH) matrix. Any PCH matrix can proximation equation: be guaranteed to be PSD, which we explain in the following. 2 In order to show i[∇c ξi] is PSD, we have to show that 2 E θ Ei[∇bt−1 ξi] 2 2 for any t, both [∇dt ξ ] and [∇[t ξ ] are PSD. First, 0 0 Ei b i Ei W i (t−1) tT 2 t (t−1) 2 ≈Ei[diag(hi )W Ei[∇bt ξi]W diag(hi ) we consider the block matrix Ei[∇ k ξi] that is a nk by nk hi (t−1)00 tT C(yˆ | y ) + (h (W ∇ t ξ ))] square matrix in Eq. (8a). If the criterion function i i diag i b i 2 0 is convex, Ei[∇ k ξi] is a PSD matrix. Otherwise, we de- tT 2 t (t−1) hi =(W Ei[∇bt ξi]W ) EhhT compose the matrix and replace the negative eigenvalues. (t−1)00 tT 2 t Fortunately nk is usually not very large, so Ei[∇ k ξi] can + Ei[diag(hi (W ∇b ξi))]. (6) hi be decomposed quickly1 and modified to a PSD matrix (t−1)0 (t−1)0 (t−1)0T 2 2 where EhhT = Ei[hi ⊗ hi ]. The difference Ei[∇dhk ξi]. Second, suppose that Ei[∇dbt ξi] is a PSD ma- between the original and the approximate Hessian matrices i tT 2 t (t−1)0 in Eq. (6) is bounded by trix, then (W Ei[∇dbt ξi]W ) EhhT is PSD. Con- 2 2 sequently, the negative eigenvalues of i[∇dt ξi] stems form (t−1)0 (t−1)0T E b tT 2 t 00 Ele-Cov(W ∇bt ξiW , hi ⊗ hi ) (t−1) tT F the diagonal part diag(Ei[hi (W ∇bt ξi)]), so we 4 X tT 2 t take the Pos-Eig function for this diagonal part in Eq. (8b). ≤L Var([W ∇ t ξ W ] ), (7) b i µν Third, because the Kronecker product of two PSD matrices µ,ν [2 is PSD, it implies Ei[∇W t ξi] is PSD. where L is the Lipschitz constant of activation functions. For example, LReLU and Lsigmoid are 1 and 0.25, respectively. Solving Linear Equation via EA-CG 2 After obtaining a PCH matrix Ei[∇cθξi], we derive the up- Deriving the Newton Direction date direction by solving the linear equation Now we present a computationally feasible method to train ((1 − α) [∇c2 ξ ] + αI)d = − [∇ ξ ], (9) FCNNs with Newton directions. First, we explain the ways Ei θ i θ Ei θ i T T T T T to construct a PCH matrix. Based on PCH matrices, we pro- where 0 < α < 1 and dθ = [dW 1 db1 ··· dW k dbk ] . pose an efficient CG-based method incorporating the expec- 2 Here, we use the weighted average of Ei[∇cθξi] and an iden- tation approximation to derive Newton directions for mul- tity matrix I because this average turns the coefficient matrix tiple training instances, which we call EA-CG. Finally, we provide an analysis of the space and time complexity for 1In our experience, the decomposition of an 1000×1000 matrix EA-CG. can be done within a few seconds in PyTorch.

3340 Table 1: The comparisons of second-order methods. (Martens and Grosse 2015) (Botev, Ritter, and Barber 2017) Ours Criterion Func. Non-Convex Convex Non-Convex Curvature Info. Fisher Gauss-Newton PCH Pk 2 Pk 2 Pk 2 Time Required O(|Batch| × t=1 nt ) O( t=2 nt−1nt) O( t=2 nt−1nt) Solving Scheme KFI KFI EA-CG Pk−1 2 Pk 2 Pk−1 2 Pk 2 Pk−1 Pk 2 Space Required O( t=0 nt + t=1 nt ) O( t=0 nt + t=1 nt ) O( t=0 nt + t=1 nt ) Pk 3 3 2 Pk 3 3 2 Pk 2 Time Required O( t=1[nt + nt−1 + nt−1nt O( t=1[nt + nt−1 + nt−1nt O(|CG| × t=1[nt + 2ntnt−1]) 2 2 + nt−1nt ]) + nt−1nt ]) of Eq. (9) that is PSD to positive definite and thus makes the training process of FCNNs. The extra computation mainly solutions more stable. Due to the essence of diagonal blocks, originates from two portions of our method: the propa- Eq. (9) can be decomposed as gation of the curvature information and the computation of EA-CG method. To propagate the curvature informa- 2 ((1 − α)Ei[∇dbt ξi] + αI)dbt = − Ei[∇bt ξi] (10a) tion, we are required to perform the matrix-to-matrix prod- 2 ucts twice, which is more computationally expensive than [ t t ((1 − α)Ei[∇W t ξi] + αI)dW = − Vec(Ei[∇W ξi]), propagating gradients. The time complexity of propagat- (10b) ing the curvature information with Eq. (6) can be estimated Pk for t = 1, . . . , k. To solve Eq. (10a), we attain the solu- as O( t=1[nt−1nt(nt−1 + nt)]). We also have the extra tions by using the CG method directly. For Eq. (10b), storing cost involved in applying EA-CG method. The complexity of EA-CG method mainly stems from the Hessian-vector [∇[2 ξ ] Ei W t i is not an efficient way, so we apply the equation products Eq. (11b). Regarding Eq. (11b), we must conduct T (C ⊗ A)Vec(B) = Vec(ABC) and Eq. (5) to have the the matrix-vector products twice and the vector-vector outer Hessian-vector product with a given vector Vec(P ): products once in order to acquire the Hessian-vector prod- uct. Thus, the time complexity pertaining to the truncated- 2 k Ei[∇[t ξi]Vec(P ) P W Newton method is O(|CG|× t=1[nt(2nt−1 +nt)]), where 2 t−1 (t−1)T |CG| is the number of iterations of EA-CG method. =Vec(Ei[∇dbt ξi] · P · Ei[hi ⊗ hi ]) (11a) 2 t−1 (t−1)T ≈Vec(Ei[∇dbt ξi] · P · Ei[hi ] ⊗ Ei[hi ]). (11b) Related Works Based on Eq. (11a), we derive the Hessian-vector prod- In this section, we elaborate on the differences between our [2 2 work and two closely related works using the notation estab- ucts of Ei[∇W t ξi] via Ei[∇dbt ξi], thereby reducing the space complexity of storing the curvature information. Further- lished in the previous sections. more, applying the CG method becomes much more effi- Martens and Grosse developed KFAC by considering the cient with Eq. (11b), which we call EA-CG. The details are FCNNs with the convex criterion and non-convex activa- ˜ ˜ elaborated in the following subsections. tion functions. KFAC utilized (Ei[Fi] + αI) where Fi is the Fisher matrix (for any training instance xi) to measure ˜ Analysis of Space Complexity the curvature and used the Khatri-Rao product to rewrite Fi, By considering the devised method in aforementioned sub- which yields the following equation: section, we analyze the required space complexity to store ˜ µ−1 (µ−1)T T the distinct types of the curvature information. The origi- [Fi]µν = (hi ⊗ hi ) ⊗ (∇bν ξi ⊗ ∇bν ξi ). nal Newton’s method requires space to store [∇2 ξ ], so Ei θ i Since it is difficult to find the inverse of ( [F˜ ]+αI), KFAC Pk 2 Ei i that its space complexity is O([ t=1 nt−1(nt + 1)] ). If substitutes a block-diagonal matrix Fˆi for F˜i and thus has 2 we consider the PCH matrix Ei[∇cθξi], the space complex- the formulation: Pk 2 2 ity turns out to be O( t=1[(nt−1nt) + nt ]). According ˆt t−1 (t−1)T T Ei[Fi ] =Ei[(hi ⊗ hi ) ⊗ (∇bt ξi ⊗ ∇bt ξi )] to Eq. (11a), it is not necessary to store [∇[2 ξ ] any- Ei W t i t−1 (t−1)T T 2 ≈Ei[hi ⊗ hi ] ⊗ Ei[∇bt ξi ⊗ ∇bt ξi ], more because Ei[∇dbt ξi] is sufficient to derive the solution to Eq. (9) with the CG method. Thus, the space complexity t −1 where t = 1, . . . , k. To derive ( i[Fˆ ] + αI) efficiently, Pk 2 E i is reduced to O( t=1 nt ). KFAC comes up with the approximation Analysis of Additional Time Complexity ˆt −1 t −1 t −1 (Ei[Fi ] + αI) ≈ (H ) ⊗ (G ) , (12) In this subsection, we elaborate on the additional time com- t −1 t−1 (t−1)T √ −1 plexity introduced in our proposed method. In contrast to where (H ) = (Ei[hi ⊗ hi ] + πt αI) , t −1 T √ −1 the SGD, our method contributes more computation to the (G ) = (Ei[∇bt ξi ⊗ ∇bt ξi ] + ( α/πt)I) . Therefore,

3341 Fisher-KFI 45 Fisher-CG 2.0 GN-KFI 40 GN-CG PCH-KFI 1.5 35 PCH-CG SGD 30 Fisher-KFI 1.0 25 Fisher-CG training loss GN-KFI test accuracy (%) 20 GN-CG 0.5 PCH-KFI 15 PCH-CG SGD 0.0 10

0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 time(sec) time(sec)

Figure 1: Comparison of different curvature information and solving methods for the convex criterion function “cross-entropy” on “Cifar-10”.

45 Fisher-KFI Fisher-CG 40 2.0 GN-KFI GN-CG 35 PCH-KFI 1.5 PCH-CG 30 SGD Fisher-KFI 1.0 25 Fisher-CG training loss GN-KFI test accuracy (%) 20 GN-CG 0.5 PCH-KFI 15 PCH-CG SGD 0.0 10 0 1000 2000 3000 4000 0 1000 2000 3000 4000 time(sec) time(sec)

Figure 2: Comparison of different curvature information and solving methods for the convex criterion function “cross-entropy” on “ImageNet-10”. the update directions of KFAC can be acquired via the fol- Table 1 highlights the three main components of these two lowing equation: related works and our method. As shown in Table 1, KFRA t −1 t −1 cannot handle non-convex criterion functions due to the cur- dˆ t = − ((H ) ⊗ (G ) ) · Vec( [∇ t ξ ]) W Ei W i vature information they utilized. The Gauss-Newton matrix t −1 t −1 = − Vec((G ) Ei[∇W t ξi](H ) ), becomes indefinite if the criterion function is non-convex. and we refer this type of inverse methods as Kronecker- Note that the difference between the original and approx- ˆt Factored Inverse (KFI) methods in this paper. In contrast, imated matrices always exists in Eq. (12) if Ei[Fi ] is not we derive the update directions by applying EA-CG method. diagonal. Moreover, the time complexity of KFI method is Moreover, KFAC uses the Fisher matrix, while we use the similar to EA-CG method. PCH matrix. Based on the discussion of the FCNNs with convex cri- terion functions and piecewise linear activation functions, Experimental Evaluation Botev, Ritter, and Barber developed KFRA. As a result, Our empirical studies aim to examine not only three dif- the diagonal term disappears in Eq. (4b). Thus, the Gauss- ferent types of curvature information including the Fisher, Newton matrix becomes no different from the Hessian ma- Gauss-Newton and PCH matrices, but also the two solving trix. KFRA also employs KFI method, and hence the update methods, i.e., KFI and EA-CG methods, in terms of train- direction of KFRA can be derived from ing loss and testing accuracy. We encompassed SGD with ˜ ˜t −1 t −1 dW t = −Vec((G ) Ei[∇W t ξi](H ) ), momentum as the baseline and fixed its momentum to 0.9. √ ˜t −1 2 −1 For better comparison, we considered FCNNs with either where (G ) = ( [GN(∇ t ξ )] + ( α/π )I) , and GN Ei b i t convex or non-convex criterion functions and conducted our stands for Gauss-Newton matrix. The method of deriving 2 the update directions is again different between KFRA and experiments on the image datasets that comprise Cifar-10 our work. In addition, KFRA still works under the non- convex activation functions, but it exists the difference be- 2Experiments were implemented by using PyTorch libraries tween Gauss-Newton matrix and Hessian matrix. and run on a GTX-1080Ti GPU.

3342 Fisher-KFI 45 Fisher-CG 2.0 GN-KFI 40 GN-CG PCH-KFI 1.5 35 PCH-CG SGD 30 Fisher-KFI 1.0 25 Fisher-CG training loss GN-KFI test accuracy (%) 20 GN-CG 0.5 PCH-KFI 15 PCH-CG SGD 0.0 10

0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 epoch epoch

Figure 3: Comparison of different curvature information and solving methods for the convex criterion function “cross-entropy” on “Cifar-10”.

45 Fisher-KFI Fisher-CG 40 2.0 GN-KFI GN-CG 35 PCH-KFI 1.5 PCH-CG 30 SGD Fisher-KFI 1.0 25 Fisher-CG training loss GN-KFI test accuracy (%) 20 GN-CG 0.5 PCH-KFI 15 PCH-CG SGD 0.0 10 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 epoch epoch

Figure 4: Comparison of different curvature information and solving methods for the convex criterion function “cross-entropy” on “ImageNet-10”.

3 (Krizhevsky and Hinton 2009) and ImageNet-10 . The net- CG. Thus, we also performed the grid search on max|CG| = −10 −5 −2 −1 work structures are “3072-1024-512-256-128-64-32-16-10” [5, 10, 20, 50] and CG = [10 , 10 , 10 , 10 ]. Fi- and “150528-1024-512-256-128-64-32-16-10” for Cifar-10 nally, to further investigate the different types of curvature and ImageNet-10, respectively. For the PCH matrices, we information, we calculate the errors caused by approximat- explored two possible scenarios of Eq.(8b): 1) taking the ab- ing the true Hessian in a layer-wise fashion. solute values of the diagonal part, i.e., γ = −1 and 2) ap- plying max(x, 0) function to the diagonal part, i.e., γ = 0. Convex Criterion The first and second scenarios of the PCH matrices are sep- In the first type of experiments, we examined the perfor- arately dubbed PCH-1 and PCH-2. Since PCH-1 and PCH-2 mance of different second-order methods. According to the produced the similar results of training loss and testing ac- prior section, the differences between these second-order curacy, we only reported PCH-1 in the corresponding fig- methods originate from two parts. The first part is the curva- ures below for the sake of simplicity. In all of our experi- ture information, e.g., the Fisher matrix Fˆi in KFAC, and the ments, we used sigmoid as the non-convex activation func- other is the solving method for the linear equations. In these tion of FCNNs and trained the networks with 200 epochs, experiments, we used cross-entropy as the convex criterion i.e., seeing the entire training samples 200 times. Fur- function. thermore, we utilized “Xavier” initialization method (Glo- It is noteworthy that the original KFI method runs out of rot and Bengio 2010) and performed the grid search on GPU memory for ImageNet-10 because the dimension of the = [0.05, 0.1, 0.2] | | = [100, 500, 1000] learning rate , Batch features (n0) in ImageNet-10 is too high, which conforms and α = [0.01, 0.02, 0.05, 0.1]. Regarding EA-CG method, with our analysis in Table 1. Thus, we utilized we have to determine two hyper-parameters that control its 1 0 0T √ stopping conditions. One is the maximal iteration number, H ≈ Ei[hi ] ⊗ Ei[hi ] + π1 αI (13) max|CG|; the other is the constant of the related error bound, and Sherman-Morrison formula to derive the update direc- tions. Albeit this approximation worked in our ImageNet-10 3We randomly choose ten classes from the ImageNet (Deng et experiments, we observed that this approximation is not sta- al. 2009) dataset. ble for other datasets and models that we have explored.

3343 30.0 Fisher-KFI 0.60 Fisher-CG 27.5 PCH-KFI PCH-CG 25.0 0.55 SGD 22.5

0.50 20.0

17.5 training loss 0.45 Fisher-KFI test accuracy (%) 15.0 Fisher-CG PCH-KFI 0.40 12.5 PCH-CG 10.0 SGD

0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 time(sec) time(sec)

Figure 5: Comparison of different curvature information and solving methods for the non-convex criterion function “Eq. (14)” on “Cifar-10”. The Gauss-Newton matrix is removed from this figure since it is not PSD in this type of experiments.

Fisher-KFI 27.5 0.60 Fisher-CG PCH-KFI 25.0 PCH-CG 22.5 0.55 SGD

20.0

0.50 17.5 training loss Fisher-KFI test accuracy (%) 15.0 0.45 Fisher-CG PCH-KFI 12.5 PCH-CG 0.40 10.0 SGD

0 1000 2000 3000 4000 0 1000 2000 3000 4000 time(sec) time(sec)

Figure 6: Comparison of different curvature information and solving methods for the non-convex criterion function “Eq. (14)” on “ImageNet-10”. The Gauss-Newton matrix is removed from this figure since it is not PSD in this type of experiments.

Wall Clock Time As shown in Figure 1, EA-CG method on this premise, the extra cost of propagating our proposed converges faster with respect to wall clock time and has bet- PCH where only the Hessian of the bias terms is required to ter performance of testing accuracy than KFI method for propagate is comparatively low. Similarly, the cost of solv- Cifar-10. The runtime behavior adheres to Table 1. We also ing the update directions is comparatively low. notice that the different types of curvature information with EA-CG method have the similar performance. In contrast, Epoch As shown in Figure 3, KFI method provides a pre- KFI method using the Fisher matrix converges faster than cise descent direction of each epoch, so the training loss using the other two types of matrices. For ImageNet-10, decreases faster in terms of epochs. In contrast, Figure 1 Figure 2 exhibits that EA-CG and KFI methods take al- shows KFI method converges slower in terms of wall clock most the same amount of time for 200 epochs. Besides, as time, which varied with different implementation. We argue shown in Figure 2, KFI method has lower training loss but that we have already done our best effort to implement KFI may suffer from the overfitting issue. Please note that we method by using the in PyTorch. As shown applied Eq (13) to the first layer of the neural network for in either Figure 1 or Figure 3, EA-CG method has the better KFI method. Otherwise, the original KFI method ran out of results of testing accuracy for Cifar-10. For ImageNet-10, GPU memory for ImageNet-10. However, applying this ap- the results of Figure 4 are consistent with the results of Fig- proximation reduces the memory usage of KFI method but ure 2. Thus, we do not repeat the same narrative here. expedites KFI method accordingly. Non-Convex Criterion It is worth noting that second-order methods performed In order to demonstrate our capability of handling non- comparable to SGD in Figure 2, but the phenomenon is com- convex criterion functions, we designed the second type of pletely different in Figure 1. This is because the image sizes experiments. In these experiments, we considered the fol- of Cifar-10 and ImageNet-10 are varied. The image size of lowing criterion function: ImageNet-10 is 224 × 224 × 3, and hence propagating the gradients of the weights backward is much more expensive 1 C(yˆ | y ) = , (14) than it in Cifar-10 whose image size is 32 × 32 × 3. Based i i δ(yT yˆ −) 1 + e i i

3344 30.0 Fisher-KFI 0.60 Fisher-CG 27.5 PCH-KFI PCH-CG 25.0 0.55 SGD 22.5

0.50 20.0

17.5 training loss 0.45 Fisher-KFI test accuracy (%) 15.0 Fisher-CG PCH-KFI 0.40 12.5 PCH-CG 10.0 SGD

0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 epoch epoch

Figure 7: Comparison of different curvature information and solving methods for the non-convex criterion function “Eq. (14)” on “Cifar-10”. The Gauss-Newton matrix is removed from this figure since it is not PSD in this type of experiments.

Fisher-KFI 27.5 0.60 Fisher-CG PCH-KFI 25.0 PCH-CG 22.5 0.55 SGD

20.0

0.50 17.5 training loss Fisher-KFI test accuracy (%) 15.0 0.45 Fisher-CG PCH-KFI 12.5 PCH-CG 0.40 10.0 SGD

0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 epoch epoch

Figure 8: Comparison of different curvature information and solving methods for the non-convex criterion function “Eq. (14)” on “ImageNet-10”. The Gauss-Newton matrix is removed from this figure since it is not PSD in this type of experiments. where we fix δ = 5 and  = 0.2. This criterion function im- plies that the upper bound of the loss function exists. Apart Table 2: Layer-wise errors between each of the approximate from the criterion function, we followed the same settings Hessian matrices and the true Hessian for the convex crite- as those used in the first type of experiments. Note that the rion function “cross-entropy” on “Cifar-10”. Gauss-Newton matrix is not PSD if the criterion function Fisher GN PCH-1 PCH-2 of training FCNNs is non-convex. Thus, we excluded the Layer-1 0.0071 0.0057 0.0036 0.0043 Gauss-Newton matrix from this comparison and focused on Layer-2 0.0470 0.0212 0.0237 0.0228 comparing the other two types of matrices with either EA- Layer-3 0.1140 0.0220 0.0238 0.0165 CG or KFI methods. Layer-4 0.0726 0.0207 0.0119 0.0085 Layer-5 0.0397 0.0130 0.0086 0.0066 Wall Clock Time For Cifar-10, Figure 5 shows that EA- Layer-6 0.0219 0.0107 0.0093 0.0071 CG method outperforms KFI method in terms of both train- Layer-7 0.0185 0.0176 0.0132 0.0106 ing loss and testing accuracy. But the Fisher matrix with Layer-8 0.0251 0.0000 0.0000 0.0000 EA-CG method has the best performance. For ImageNet-10, Total 0.1535 0.0446 0.0402 0.0330 Figure 6 shows that EA-CG method surpasses KFI method regarding training loss. But the performance of EA-CG and KFI methods is hard to distinguish by the testing accuracy. ing loss, which is consistent with Figure 6. However, the Epoch As shown in Figure 7, for Cifar-10, the perfor- performance of EA-CG and KFI methods is hard to distin- mance of EA-CG and KFI methods is hard to distinguish by guish by the testing accuracy, and the PCH matrix using KFI the convergence speed of training loss in terms of epochs. method has the best testing accuracy. But the Fisher matrix with EA-CG method has the best training loss in Figure 7. Regarding testing accuracy, EA- Comparison with the True Hessian CG method has the better performance than KFI method for this type of experiments. For ImageNet-10, Figure 8 shows In addition to the comparison of training loss and testing ac- that EA-CG method surpasses KFI method regarding train- curacy on different criterion functions and datasets, we fur-

3345 ther examine the difference between the true Hessian and Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; and Fei-Fei, L. the approximate Hessian matrices such as the Fisher matrix. 2009. Imagenet: A large-scale hierarchical image database. Here, we measure the difference in the errors that is elabo- In 2009 IEEE Conference on Computer Vision and Pattern rated as follows. Given the tth layer and the model parame- Recognition, 248–255. ters θ, the error is defined as Glorot, X., and Bengio, Y. 2010. Understanding the diffi-

2 2 culty of training deep feedforward neural networks. In Teh, Ei[∇gt ξi] − |Ei[∇bt ξi]| , b F Y. W., and Titterington, M., eds., Proceedings of the Thir- 2 teenth International Conference on Artificial Intelligence where ∇gbt ξi stands for the approximate Hessian matrix, and 2 and , volume 9 of Proceedings of Machine Learn- |Ei[∇bt ξi]| is to take the absolute values of the eigenvalues 2 ing Research, 249–256. Chia Laguna Resort, Sardinia, Italy: of [∇ t ξ ], which we follow (Dauphin et al. 2014). Ei b i PMLR. The results are shown in Table 2 where each value is de- rived from averaging the errors of the initial parameters θ0 Goldfarb, D.; Mu, C.; Wright, J.; and Zhou, C. 2017. Using to the parameters θs−1 that are updated s times. Table 2 re- negative curvature in solving nonlinear programs. Compu- flects that the errors are not accumulated with layers for any tational Optimization and Applications 68(3):479–502. approximate Hessian matrix. This observation is important Grosse, R., and Martens, J. 2016. A kronecker-factored for KFRA and ours since both methods derive the curvature approximate fisher matrix for layers. In Inter- information by approximating the Hessian layer-by-layer re- national Conference on , 573–582. cursively. The “Total” row of Table 2 also indicates that our He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid- proposed PCH-1 and PCH-2 are closer to the true Hessian ual learning for image recognition. In Proceedings of the than the Fisher and Gauss-Newton matrices. IEEE conference on computer vision and pattern recogni- tion, 770–778. Concluding Remarks Hochreiter, S., and Schmidhuber, J. 1997. Long short-term To achieve more computationally feasible second-order memory. Neural computation 9(8):1735–1780. methods for training FCNNs, we developed a practical ap- Krizhevsky, A., and Hinton, G. 2009. Learning multiple lay- proach, including our proposed PCH matrix and devised ers of features from tiny images. Technical report, Citeseer. EA-CG method. Our proposed PCH matrix overcomes the problem of training FCNNs with non-convex criterion func- LeCun, Y.; Bottou, L.; Orr, G. B.; and Muller,¨ K.-R. 1998. tions. Besides, EA-CG provides another alternative to effi- Efficient backprop. In Neural networks: Tricks of the trade. ciently derive update directions in this context. Our empiri- Springer. 9–50. cal studies show that our proposed PCH matrix can compete Martens, J., and Grosse, R. 2015. Optimizing neural with the state-of-the-art curvature approximation, and EA- networks with kronecker-factored approximate curvature. CG does converge faster and enjoys better testing accuracy In International Conference on Machine Learning, 2408– than KFI method. Specially, the performance of our pro- 2417. posed approach is competitive with SGD. As future work, Martens, J. 2010. via hessian-free optimiza- we will extend the idea to work with convolutional nets. tion. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), 735–742. References Mizutani, E., and Dreyfus, S. E. 2008. Second-order stage- Amari, S.-I.; Park, H.; and Fukumizu, K. 2000. Adaptive wise backpropagation for hessian-matrix analyses and inves- method of realizing natural gradient learning for multilayer tigation of negative curvature. Neural Networks 21(2):193– . Neural Computation 12(6):1399–1409. 203. Belagiannis, V.; Rupprecht, C.; Carneiro, G.; and Navab, N. Pearlmutter, B. A. 1994. Fast exact multiplication by the 2015. Robust optimization for deep regression. In Proceed- hessian. Neural computation 6(1):147–160. ings of the IEEE International Conference on Computer Vi- Polyak, B. T., and Juditsky, A. B. 1992. Acceleration of sion, 2830–2838. stochastic approximation by averaging. SIAM Journal on Botev, A.; Ritter, H.; and Barber, D. 2017. Practical gauss- Control and Optimization 30(4):838–855. newton optimisation for deep learning. In International Schraudolph, N. N. 2002. Fast curvature matrix-vector prod- Conference on Machine Learning, 557–565. ucts for second-order . Neural computation Chang, E. Y.; Wu, M.-H.; Tang, K.-F. T.; Kao, H.-C.; and 14(7):1723–1738. Chou, C.-N. 2017. Artificial intelligence in xprize deepq tricorder. In Proceedings of the 2Nd International Work- Stewart, C. V. 1999. Robust parameter estimation in com- SIAM review shop on Multimedia for Personal Health and Health Care, puter vision. 41(3):513–537. MMHealth ’17, 11–18. New York, NY, USA: ACM. Zhang, H.; Xiong, C.; Bradbury, J.; and Socher, R. 2017. Dauphin, Y. N.; Pascanu, R.; Gulcehre, C.; Cho, K.; Gan- Block-diagonal hessian-free optimization for training neural guli, S.; and Bengio, Y. 2014. Identifying and attacking the networks. arXiv preprint arXiv:1712.07296. saddle point problem in high-dimensional non-convex op- timization. In Advances in neural information processing systems, 2933–2941.

3346