Arxiv:1802.06502V3 [Cs.LG] 6 Dec 2018

EA-CG: An Approximate Second-Order Method for Training Fully-Connected Neural Networks

Sheng-Wei Chen Chun-Nan Chou Edward Y. Chang HTC Research & Healthcare HTC Research & Healthcare HTC Research & Healthcare sw [email protected] jason.cn [email protected] edward [email protected]

Abstract the common practice is to use the Gauss-Newton matrix with a convex criterion function (Schraudolph 2002) or the Fisher For training fully-connected neural networks (FCNNs), we propose a practical approximate second-order method includ- matrix to measure curvature since both are guaranteed to be ing: 1) an approximation of the Hessian matrix and 2) a conju- positive semi-definite (PSD). gate gradient (CG) based method. Our proposed approximate Although these two kinds of matrices can alleviate the is- Hessian matrix is memory-efficient and can be applied to any sue of negative curvature, computing either the exact Gauss- FCNNs where the activation and criterion functions are twice Newton matrix or Fisher matrix even for a modestly-sized differentiable. We devise a CG-based method incorporating fully-connected neural network (FCNN) is intractable. In- one-rank approximation to derive Newton directions for train- tuitively, the analytic expression for the second derivative ing FCNNs, which significantly reduces both space and time requires O(N 2) computations if O(N) complexity is re- complexity. This CG-based method can be employed to solve any linear equation where the coefficient matrix is Kronecker- quired to compute the first derivative. Thus, several pioneer factored, symmetric and positive definite. Empirical studies works (LeCun et al. 1998; Amari, Park, and Fukumizu 2000; show the efficacy and efficiency of our proposed method. Schraudolph 2002) have used different methods to approximate either matrix. However, none of these methods have been shown to be computationally feasible and fundamen- Introduction tally more effective than first-order methods as reported Neural networks have been applied to solving problems in in (Martens 2010). Thus, there has been a growing trend several application domains such as computer vision (He towards conceiving more computationally feasible second- et al. 2016), natural language processing (Hochreiter and order methods for training FCNNs. Schmidhuber 1997), and disease diagnosis (Chang et al. We outline several notable works in chronological or- 2017). Training a neural network requires tuning its model der herein. Martens proposed a truncated-Newton method parameters using Backpropagation. Stochastic gradient de- for training deep auto-encoders. In this work, Martens used scent (SGD), Broyden-Fletcher-Goldfarb-Shanno and one- an R-operator (Pearlmutter 1994) to compute full Gauss- step secant are representative algorithms that have been em- Newton matrix-vector products and made good progress ployed for training in Backpropagation. within each update. Martens and Grosse developed a To date, SGD is widely used due to its low computational block-diagonal approximation to the Fisher matrix for FC- demand. SGD minimizes a function using the function’s first NNs, called Kronecker-factored Approximation Curvature derivative, and has been proven to be effective for train- (KFAC). They derived the update directions by exploiting ing large models. However, stochasticity in the gradient will the inverse property of Kronecker products. KFAC features slow down convergence for any gradient method such that that its cost in storing and inverting the devised approxima- arXiv:1802.06502v3 [cs.LG] 6 Dec 2018 none of them can be asymptotically faster than simple SGD tion does not depend on the amount of data used to estimate with Polyak averaging (Polyak and Juditsky 1992). Besides the Fisher matrix. The idea is further extended to convo- gradients, second-order methods utilize the curvature infor- lutional nets (Grosse and Martens 2016). Recently, Botev, mation of a loss function within the neighborhood of a given Ritter, and Barber presented a block-diagonal approxima- point to guide the update direction. Since each update be- tion to the Gauss-Newton matrix for FCNNs, referred to comes more precise, such methods can converge faster than as Kronecker-Factored Recursive Approximation (KFRA). first-order methods in terms of update iterations. They also utilized the inverse property of Kronecker prod- For solving a convex optimization problem, a second- ucts to derive the update directions. Similarly, Zhang et al. order method can always converge to the global minimum introduced a block-diagonal approximation of the Gauss- in much fewer steps than SGD. However, the problem of Newton matrix and used conjugate gradient (CG) method neural-network training can be non-convex, thereby suffer- to derive the update directions. ing from the issue of negative curvature. To avoid this issue, However, these prior works either impose constraints on Copyright c 2019, Association for the Advancement of Artificial their applicability or require considerable computation in Intelligence (www.aaai.org). All rights reserved. terms of memory or running time. On the one hand, several of these notable methods relying on the Gauss-Newton Suppose we have a minimization problem matrix face the essential limit that they cannot handle non- min f(θ), (1) convex criterion functions playing an important part in some θ problems. Use robust estimation in computer vision (Stewart 1999) as an example. Some non-convex criterion functions where f is a convex and twice-differentiable function. Since the global minimum is at the point that the first derivative is such as Tukeys biweight function are robustness to outliers ∗ and perform better than convex criterion functions (Bela- zero, the solution θ can be derived from the equation giannis et al. 2015). On the other hand, in order to derive the ∇f(θ∗) = 0. (2) update directions, some of the methods utilizing the conventional CG method may take too much time, and the others We can utilize a quadratic polynomial to approximate Prob- exploiting the inverse property of Kronecker products may lem 1 by conducting a Taylor expansion with a given point require excessive memory space. θj. Then, the problem turns out to be To remedy the aforementioned issues, we propose a 1 block-diagonal approximation of the positive-curvature Hes- min f(θj + d) ≈ f(θj) + ∇f(θj)T d + dT ∇2f(θj)d, d 2 sian (PCH) matrix, which is memory-efficient. Our proposed PCH matrix can be applied to any FCNN where the acti- where ∇2f(θj) is the Hessian matrix of f at θj. Af- vation and criterion functions are twice differentiable. Par- ter applying the aforementioned approximation, we rewrite ticularly, our proposed PCH matrix can handle non-convex Eq. (2) as the linear equation criterion functions, which the Gauss-Newton methods can- j 2 j j not. Besides, we incorporate expectation approximation into ∇f(θ ) + ∇ f(θ )d = 0. (3) the CG-based method, which is dubbed EA-CG, to derive Therefore, the Newton direction is obtained via update directions for training FCNNs in mini-batch setting. EA-CG significantly reduces the space and time complexity dj = −∇2f(θj)−1∇f(θj), of the conventional CG method. Our experimental results ∗ show the efficacy and efficiency of our proposed method. and we can acquire θ by iteratively applying the update rule In this work, we focus on deriving a second-order method θj+1 = θj + ηdj, for training FCNNs since the shared weights of convolutional layers lead to the difficulties in factorizing its Hes- where η is the step size. sian. We defer tackling convolutional layers to our future For non-convex problems, the solution to Eq. (2) reflects work. We also focus on the classification problem and hence one of three possibilities: a local minimum θmin, a local do not consider auto-encoders. Our strategy is that once a maximum θmax or a saddle point θsaddle. Some previous simpler isolated problem can be effectively handled, we can works such as (Dauphin et al. 2014; Goldfarb et al. 2017; then extend our method to address more challenging issues. Mizutani and Dreyfus 2008) utilize negative curvature infor- In summary, the contributions of this paper are as follows: mation to converge to a local minimum. Before illustrating how these previous works tackle the issue of negative cur- 1. For curvature information, we propose the PCH matrix vature, we have to introduce a crucial concept that we can to improve the Gauss-Newton matrix for training FCNNs know the curvature information of f at a given point θ by with convex criterion functions and overcome the non- analyzing the Hessian matrix ∇2f(θ). On the one hand, the convex scenario. Hessian matrix of f at any θmin is positive semi-definite. 2. To derive the update directions, we devise effective EA- On the other hand, the Hessian matrices of f at any θmax CG method, which does converge faster in terms of wall and θsaddle are negative semi-definite and indefinite, respec- clock time and enjoys better testing accuracy than com- tively. After establishing the concept, we can use the follow- peting methods. Specially, the performance of EA-CG is ing property to understand how to utilize the negative curva- competitive with SGD. ture information to resolve the issue of negative curvature. Property 1. Let f be a non-convex and twice-differentiable Truncated-Newton Method on Non-Convex function. With a given point θj, we suppose that there ex- 2 j Problems ist some negative eigenvalues {λ1, . . . , λs} for ∇ f(θ ). Moreover, we take V = span({v1,..., vs}), which is the Newton’s method is one of the second-order minimization eigenspace corresponds to {λ1, . . . , λs}. If we take methods, and is generally composed of two steps: 1) com- T 1 puting the Hessian matrix and 2) solving the system of lin- g(k) = f(θj) + ∇f(θj) v + vT ∇2f(θj)v, ear equations for update directions. The truncated-Newton 2 method applies the CG method with restricted iterations to s where k ∈ R and v = k1v1 + ... + ksvs, then g(k) is a the second step of Newton’s method. In this section, we first concave function. introduce the truncated-Newton method in the context of convex problems. Afterwards, we discuss the non-convex According to Property 1, Eq. (3) may lead us to a lo- j scenario of the truncated-Newton method and provide an cal maximum or a saddle point if ∇2f(θ ) has some neg- important property that lays the foundation of our proposed ative eigenvalues. In order to converge to a local mini- PCH matrix. mum, we substitute Pos-Eig(∇2f(θj)) for ∇2f(θj), where T T T T Pos-Eig(A) is conceptually defined as replacing the negative [A·1] [A·2] ··· [A·n] . By following the nota- eigenvalues of A with non-negative ones. That is, tion mentioned above, we denote an FCNN output with k hk = F (θ|x ) = W khk−1 + bk γλ  layers as i i i . 1 To train this FCNN, we must decide a loss function ξ that  ..   .  can be any twice differentiable function. Training this FCNN  γλ  can therefore be interpreted as solving the following mini- Pos-Eig(A) = QT  s  Q,  λ  mization problem:  s+1   .   ..  l l X k X λn min ξ(hi | yi) ≡ min C(yî | yi), θ θ where γ is a given scalar that is less than or equal to zero, i=1 i=1 and {λ , . . . , λ } and {λ , . . . , λ } are the negative and 1 s s+1 n where l is the number of the instances in the training set, yi non-negative eigenvalues of A, respectively. This refinement th ˆ k j+1 is the label of the i instance, yi is softmax(hi ), and C is implies that the point θ escapes from either local maxima the criterion function. or saddle points if γ < 0. In case of γ = 0, this refinement means that the eigenspace of the negative eigenvalues is ig- Layer-wise Equations for the Hessian Matrix nored. As a result, we do not converge to any saddle point or local maximum. In addition, every real symmetric ma- For lucid exposition of the block Hessian recursion, we start trix can be diagonalized according to the spectral theorem. by reformulating the equations of Backpropagation accord- Under our assumptions, ∇2f(θj) is a real symmetric ma- ing to the notation defined in the previous subsection. Please t trix. Thus, ∇2f(θj) can be decomposed, and the function note that we separate the bias terms (b ) from the weight t “Pos-Eig” can be realized easily. terms (W ) and treat each of them individually during back- When the number of variables in f is large, the Hes- ward propagation of gradients. The gradients of ξ with re- sian matrix becomes intractable with respect to space com- spect to the bias and weight terms can be derived from our plexity. Alternatively, we can utilize the CG method to reformulated equations in a layer-wise manner, similar to the th solve Eq. (3). This alternative only requires calculating the original Backpropagation method. For the i instance, our Hessian-vector products rather than storing the whole Hes- reformulated equations are as follows: sian matrix. Moreover, to save computation costs, it is desir- able to restrict the iteration number of the CG method. ∇ k ξi =∇ k ξi b hi Computing the Hessian Matrix (t−1)0 tT ∇bt−1 ξi =diag(hi )W ∇bt ξi For second-order methods we must compute the curvature (t−1)T ∇ t ξ =∇ t ξ ⊗ h information, and we utilize the Hessian matrix to capture W i b i i the curvature information in our work. However, the Hessian ξ = ξ(hk | y ) ⊗ matrix for training FCNNs is intrinsically complicated and where i i i , is the Kronecker product, and (t−1)0 intractable. Recently, Botev, Ritter, and Barber presented h = ∇zσ(z)| t−1 t−2 t−1 . Likewise, we strive i z=W hi +b the idea of block Hessian recursion, in which the diagonal to propagate the Hessian matrix of ξ with respect to the bias blocks of the Hessian matrix can be computed in a layer- and weight terms backward in a layer-wise manner. This can wise manner. As the basis of our proposed PCH matrix, we be achieved by utilizing the Kronecker product and follow- first establish some notation for training FCNNs and refor- ing the similar fashion above. The resultant equations for the mulate the block Hessian recursion with our notation. Then, ith instance are as follows: we present the steps to integrate the approximation concept 2 2 proposed by (Martens and Grosse 2015) into our reformula- ∇ k ξi =∇ k ξi (4a) b hi tion. 0 0 2 (t−1) tT 2 t (t−1) ∇bt−1 ξi =diag(hi )W ∇bt ξiW diag(hi ) Fully-Connected Neural Networks (t−1)00 tT + diag(h (W ∇bt ξi)) (4b) 0 i An FCNN with k layers takes an input vector hi = xi, 2 t−1 (t−1)T 2 th th ∇ t ξi =(h ⊗ h ) ⊗ ∇ t ξi, (4c) where xi is the i instance in the training set. For the i W i i b instance, the activation values in the other layers can be h 00 i t t t−1 t (t−1) recursively derived from: h = σ(W h + b ), t = where is the element-wise product, hi = i i s 1, . . . , k − 1, where σ is the activation function and can h 2 i t t ∇zσ(z) t−1 t−2 t−1 , and the derivative order of be any twice differentiable function, and W and b are z=W hi +b ss th 2 t the weights and biases in the t layer, respectively. We ∇ t ξi is column-wise traversal of W . The derivations are th W further denote nt as the number of the neurons in the t in the Supplementary Materials B and C. Moreover, it is layer, where t = 0, . . . , k, and collect all the model param- worth noting that the original block Hessian recursion uni- eters including all the weights and biases in each layer as fied the bias and weight terms, which is distinct from our θ = (Vec(W 1), b1,..., Vec(W k), bk), where Vec(A) = separate treatment of these terms. Expectation Approximation PCH Matrix Martens and Grosse propose one approximation concept that Based on our layer-wise equations in the previous is referred as expectation approximation in (Botev, Ritter, section and the integration of expectation approxima- and Barber 2017). The idea behind expectation approximation, we can construct block matrices that vary in size t−1 (t−1)T and are located at the diagonal of the Hessian mation is that the covariance between [hi ⊗ hi ]uv and T trix. We denote this block-diagonal matrix [∇2 ξ ] as [∇bt ξi ⊗ ∇bt ξi ]µν with given indices (u, v) and (µ, ν) is Ei θ i 2 2 2 2 shown to be tiny and thus ignored due to computational ef- diag(Ei[∇W 1 ξi], Ei[∇b1 ξi],..., Ei[∇W k ξi], Ei[∇bk ξi]). ficiency, i.e., 2 Please note that Ei[∇θξi] is a block-diagonal Hessian t−1 (t−1)T T matrix and not the complete Hessian matrix. According [[h ⊗ h ] · [∇ t ξ ⊗ ∇ t ξ ] ] Ei i i uv b i b i µν to the explanation for the three possibilities of update t−1 (t−1)T T ≈ i[[h ⊗ h ]uv] · i[[∇ t ξi ⊗ ∇ t ξi ]µν ]. 2 E i i E b b directions in aforementioned section, Ei[∇θξi] is re- t 2 To explain this concept on our formulations, we define cov- quired to be modified. Thus, we replace Ei[∇θξi] with t−1 (t−1)T 1 1 2 2 2 2 2 as Ele-Cov((hi ⊗h )⊗ nt,nt , nt−1,nt−1 ⊗∇ t ξi), [ [ i b diag(Ei[∇ 1 ξi], Ei[∇d1 ξi],..., Ei[∇ k ξi], Ei[∇dk ξi]) where ”Ele-Cov” is denoted as element-wise covariance, W b W b u×v and denote the modified result as [∇c2 ξ ], where and 1u,v is the matrix whose elements are 1 in R , Ei θ i t = 1, . . . , k. With the definition of cov-t and our devised 2 2 i[∇dk ξi] =Pos-Eig( i[∇ k ξi]) (8a) equations in the previous subsection, the approximation can E b E hi be interpreted as follows: 2 tT 2 t (t−1)0 i[∇\t−1 ξi] =(W i[∇dt ξi]W ) EhhT (8b) 2 t−1 2 E b E b [∇ t ξ ] =EhhT ⊗ [∇ t ξ ] + cov-t 00 Ei W i Ei b i (t−1) tT t−1 2 +Pos-Eig(diag(Ei[hi (W ∇bt ξi)])) ≈EhhT ⊗ Ei[∇bt ξi], (5) [2 t−1 2 t−1 t−1 (t−1)T Ei[∇W t ξi] =EhhT ⊗ Ei[∇dbt ξi]. (8c) where EhhT = Ei[hi ⊗ hi ]. Botev, Ritter, and Barber also adopted expectation ap- 2 We call Ei[∇cθξi] the block-diagonal approximation of pos- proximation in their proposed method. Similarly, we inte- itive curvature Hessian (PCH) matrix. Any PCH matrix can grate this approximation into our proposed layer-wise Hes- be guaranteed to be PSD, which we explain in the following. sian matrix equations, thereby resulting in the following ap- In order to show [∇2 ξ ] is PSD, we have to show that proximation equation: Ei cθ i 2 2 2 for any t, both i[∇dt ξi] and i[∇[t ξi] are PSD. First, [∇ t−1 ξ ] E b E W Ei b i 2 0 0 we consider the block matrix Ei[∇ k ξi] that is a nk by nk (t−1) tT 2 t (t−1) hi ≈Ei[diag(hi )W Ei[∇bt ξi]W diag(hi ) square matrix in Eq. (8a). If the criterion function C(yî | yi) (t−1)00 tT 2 + (h (W ∇ t ξ ))] is convex, Ei[∇ k ξi] is a PSD matrix. Otherwise, we de- diag i b i hi tT 2 t (t−1)0 compose the matrix and replace the negative eigenvalues. =(W Ei[∇bt ξi]W ) EhhT 2 Fortunately nk is usually not very large, so Ei[∇hk ξi] can (t−1)00 i tT 1 + Ei[diag(hi (W ∇bt ξi))]. (6) be decomposed quickly and modified to a PSD matrix (t−1)0 (t−1)0 (t−1)0T 2 2 = [h ⊗ h ] Ei[∇dk ξi]. Second, suppose that Ei[∇dt ξi] is a PSD ma- where EhhT Ei i i . The difference hi b between the original and the approximate Hessian matrices tT 2 t (t−1)0 trix, then (W [∇dt ξ ]W ) EhhT is PSD. Con- in Eq. (6) is bounded by Ei b i 2 2 sequently, the negative eigenvalues of i[∇dt ξi] stems form (t−1)0 (t−1)0T E b tT 2 t (t−1)00 Ele-Cov(W ∇bt ξiW , hi ⊗ hi ) tT F the diagonal part diag(Ei[hi (W ∇bt ξi)]), so we 4 X tT 2 t take the Pos-Eig function for this diagonal part in Eq. (8b). ≤L ([W ∇ t ξ W ] ), Var b i µν (7) Third, because the Kronecker product of two PSD matrices µ,ν is PSD, it implies [∇[2 ξ ] is PSD. where L is the Lipschitz constant of activation functions. For Ei W t i example, LReLU and Lsigmoid are 1 and 0.25, respectively. The Solving Linear Equation via EA-CG details are in the Supplementary Materials D. 2 After obtaining a PCH matrix Ei[∇cθξi], we derive the up- Deriving the Newton Direction date direction by solving the linear equation Now we present a computationally feasible method to train ((1 − α) [∇c2 ξ ] + αI)d = − [∇ ξ ], (9) FCNNs with Newton directions. First, we explain the ways Ei θ i θ Ei θ i T T T T T to construct a PCH matrix. Based on PCH matrices, we pro- where 0 < α < 1 and dθ = [dW 1 db1 ··· dW k dbk ] . pose an efficient CG-based method incorporating the expec- 2 Here, we use the weighted average of Ei[∇cθξi] and an iden- tation approximation to derive Newton directions for mul- tity matrix I because this average turns the coefficient matrix tiple training instances, which we call EA-CG. Finally, we provide an analysis of the space and time complexity for 1In our experience, the decomposition of an 1000×1000 matrix EA-CG. can be done within a few seconds in PyTorch. Table 1: The comparisons of second-order methods. (Martens and Grosse 2015) (Botev, Ritter, and Barber 2017) Ours Criterion Func. Non-Convex Convex Non-Convex Curvature Info. Fisher Gauss-Newton PCH Pk 2 Pk 2 Pk 2 Time Required O(|Batch| × t=1 nt ) O( t=2 nt−1nt) O( t=2 nt−1nt) Solving Scheme KFI KFI EA-CG Pk−1 2 Pk 2 Pk−1 2 Pk 2 Pk−1 Pk 2 Space Required O( t=0 nt + t=1 nt ) O( t=0 nt + t=1 nt ) O( t=0 nt + t=1 nt ) Pk 3 3 2 Pk 3 3 2 Pk 2 Time Required O( t=1[nt + nt−1 + nt−1nt O( t=1[nt + nt−1 + nt−1nt O(|CG| × t=1[nt + 2ntnt−1]) 2 2 + nt−1nt ]) + nt−1nt ]) of Eq. (9) that is PSD to positive definite and thus makes the training process of FCNNs. The extra computation mainly solutions more stable. Due to the essence of diagonal blocks, originates from two portions of our method: the propa- Eq. (9) can be decomposed as gation of the curvature information and the computation of EA-CG method. To propagate the curvature informa- 2 ((1 − α)Ei[∇dbt ξi] + αI)dbt = − Ei[∇bt ξi] (10a) tion, we are required to perform the matrix-to-matrix prod- 2 ucts twice, which is more computationally expensive than [ t t ((1 − α)Ei[∇W t ξi] + αI)dW = − Vec(Ei[∇W ξi]), propagating gradients. The time complexity of propagat- (10b) ing the curvature information with Eq. (6) can be estimated Pk for t = 1, . . . , k. To solve Eq. (10a), we attain the solu- as O( t=1[nt−1nt(nt−1 + nt)]). We also have the extra tions by using the CG method directly. For Eq. (10b), storing cost involved in applying EA-CG method. The complexity of EA-CG method mainly stems from the Hessian-vector [∇[2 ξ ] Ei W t i is not an efficient way, so we apply the equation products Eq. (11b). Regarding Eq. (11b), we must conduct T (C ⊗ A)Vec(B) = Vec(ABC) and Eq. (5) to have the the matrix-vector products twice and the vector-vector outer Hessian-vector product with a given vector Vec(P ): products once in order to acquire the Hessian-vector product. Thus, the time complexity pertaining to the truncated- 2 k Ei[∇[t ξi]Vec(P ) P W Newton method is O(|CG|× t=1[nt(2nt−1 +nt)]), where 2 t−1 (t−1)T |CG| is the number of iterations of EA-CG method. =Vec(Ei[∇dbt ξi] · P · Ei[hi ⊗ hi ]) (11a) 2 t−1 (t−1)T ≈Vec(Ei[∇dbt ξi] · P · Ei[hi ] ⊗ Ei[hi ]). (11b) Related Works Based on Eq. (11a), we derive the Hessian-vector prod- In this section, we elaborate on the differences between our [2 2 work and two closely related works using the notation estab- ucts of Ei[∇W t ξi] via Ei[∇dbt ξi], thereby reducing the space complexity of storing the curvature information. Further- lished in the previous sections. more, applying the CG method becomes much more effi- Martens and Grosse developed KFAC by considering the cient with Eq. (11b), which we call EA-CG. The details are FCNNs with the convex criterion and non-convex activa- ˜ ˜ elaborated in the following subsections. tion functions. KFAC utilized (Ei[Fi] + αI) where Fi is the Fisher matrix (for any training instance xi) to measure ˜ Analysis of Space Complexity the curvature and used the Khatri-Rao product to rewrite Fi, By considering the devised method in aforementioned sub- which yields the following equation: section, we analyze the required space complexity to store ˜ µ−1 (µ−1)T T the distinct types of the curvature information. The origi- [Fi]µν = (hi ⊗ hi ) ⊗ (∇bν ξi ⊗ ∇bν ξi ). nal Newton’s method requires space to store [∇2 ξ ], so Ei θ i Since it is difficult to find the inverse of ( [F˜ ]+αI), KFAC Pk 2 Ei i that its space complexity is O([ t=1 nt−1(nt + 1)] ). If substitutes a block-diagonal matrix Fî for F˜i and thus has 2 we consider the PCH matrix Ei[∇cθξi], the space complex- the formulation: Pk 2 2 ity turns out to be O( t=1[(nt−1nt) + nt ]). According ˆt t−1 (t−1)T T Ei[Fi ] =Ei[(hi ⊗ hi ) ⊗ (∇bt ξi ⊗ ∇bt ξi )] to Eq. (11a), it is not necessary to store [∇[2 ξ ] any- Ei W t i t−1 (t−1)T T 2 ≈Ei[hi ⊗ hi ] ⊗ Ei[∇bt ξi ⊗ ∇bt ξi ], more because Ei[∇dbt ξi] is sufficient to derive the solution to Eq. (9) with the CG method. Thus, the space complexity t −1 where t = 1, . . . , k. To derive ( i[Fˆ ] + αI) efficiently, Pk 2 E i is reduced to O( t=1 nt ). KFAC comes up with the approximation Analysis of Additional Time Complexity ˆt −1 t −1 t −1 (Ei[Fi ] + αI) ≈ (H ) ⊗ (G ) , (12) In this subsection, we elaborate on the additional time com- t −1 t−1 (t−1)T √ −1 plexity introduced in our proposed method. In contrast to where (H ) = (Ei[hi ⊗ hi ] + πt αI) , t −1 T √ −1 the SGD, our method contributes more computation to the (G ) = (Ei[∇bt ξi ⊗ ∇bt ξi ] + ( α/πt)I) . Therefore, Fisher-KFI 45 Fisher-CG 2.0 GN-KFI 40 GN-CG PCH-KFI 1.5 35 PCH-CG SGD 30 Fisher-KFI 1.0 25 Fisher-CG training loss GN-KFI test accuracy (%) 20 GN-CG 0.5 PCH-KFI 15 PCH-CG SGD 0.0 10

0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 time(sec) time(sec)

Figure 1: Comparison of different curvature information and solving methods for the convex criterion function “cross-entropy” on “Cifar-10”.

45 Fisher-KFI Fisher-CG 40 2.0 GN-KFI GN-CG 35 PCH-KFI 1.5 PCH-CG 30 SGD Fisher-KFI 1.0 25 Fisher-CG training loss GN-KFI test accuracy (%) 20 GN-CG 0.5 PCH-KFI 15 PCH-CG SGD 0.0 10 0 1000 2000 3000 4000 0 1000 2000 3000 4000 time(sec) time(sec)

Figure 2: Comparison of different curvature information and solving methods for the convex criterion function “cross-entropy” on “ImageNet-10”. the update directions of KFAC can be acquired via the fol- Table 1 highlights the three main components of these two lowing equation: related works and our method. As shown in Table 1, KFRA t −1 t −1 cannot handle non-convex criterion functions due to the cur- dˆ t = − ((H ) ⊗ (G ) ) · Vec( [∇ t ξ ]) W Ei W i vature information they utilized. The Gauss-Newton matrix t −1 t −1 = − Vec((G ) Ei[∇W t ξi](H ) ), becomes indeﬁnite if the criterion function is non-convex. and we refer this type of inverse methods as Kronecker- Note that the difference between the original and approx- ˆt Factored Inverse (KFI) methods in this paper. In contrast, imated matrices always exists in Eq. (12) if Ei[Fi ] is not we derive the update directions by applying EA-CG method. diagonal. Moreover, the time complexity of KFI method is Moreover, KFAC uses the Fisher matrix, while we use the similar to EA-CG method. PCH matrix. Based on the discussion of the FCNNs with convex criterion functions and piecewise linear activation functions, Experimental Evaluation Botev, Ritter, and Barber developed KFRA. As a result, Our empirical studies aim to examine not only three dif- the diagonal term disappears in Eq. (4b). Thus, the Gauss- ferent types of curvature information including the Fisher, Newton matrix becomes no different from the Hessian ma- Gauss-Newton and PCH matrices, but also the two solving trix. KFRA also employs KFI method, and hence the update methods, i.e., KFI and EA-CG methods, in terms of train- direction of KFRA can be derived from ing loss and testing accuracy. We encompassed SGD with ˜ ˜t −1 t −1 dW t = −Vec((G ) Ei[∇W t ξi](H ) ), momentum as the baseline and ﬁxed its momentum to 0.9. √ ˜t −1 2 −1 For better comparison, we considered FCNNs with either where (G ) = ( [GN(∇ t ξ )] + ( α/π )I) , and GN Ei b i t convex or non-convex criterion functions and conducted our stands for Gauss-Newton matrix. The method of deriving 2 the update directions is again different between KFRA and experiments on the image datasets that comprise Cifar-10 our work. In addition, KFRA still works under the non- convex activation functions, but it exists the difference be- 2Experiments were implemented by using PyTorch libraries tween Gauss-Newton matrix and Hessian matrix. and run on a GTX-1080Ti GPU. Fisher-KFI 45 Fisher-CG 2.0 GN-KFI 40 GN-CG PCH-KFI 1.5 35 PCH-CG SGD 30 Fisher-KFI 1.0 25 Fisher-CG training loss GN-KFI test accuracy (%) 20 GN-CG 0.5 PCH-KFI 15 PCH-CG SGD 0.0 10

0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 epoch epoch

Figure 3: Comparison of different curvature information and solving methods for the convex criterion function “cross-entropy” on “Cifar-10”.

45 Fisher-KFI Fisher-CG 40 2.0 GN-KFI GN-CG 35 PCH-KFI 1.5 PCH-CG 30 SGD Fisher-KFI 1.0 25 Fisher-CG training loss GN-KFI test accuracy (%) 20 GN-CG 0.5 PCH-KFI 15 PCH-CG SGD 0.0 10 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 epoch epoch

Figure 4: Comparison of different curvature information and solving methods for the convex criterion function “cross-entropy” on “ImageNet-10”.

(Krizhevsky and Hinton 2009) and ImageNet-103. The net- max|CG|; the other is the constant of the related error bound, work structures are “3072-1024-512-256-128-64-32-16-10” CG. Thus, we also performed the grid search on max|CG| = −10 −5 −2 −1 and “150528-1024-512-256-128-64-32-16-10” for Cifar-10 [5, 10, 20, 50] and CG = [10 , 10 , 10 , 10 ]. Fi- and ImageNet-10, respectively. For the PCH matrices, we nally, to further investigate the different types of curvature explored two possible scenarios of Eq.(8b): 1) taking the ab- information, we calculate the errors caused by approximat- solute values of the diagonal part, i.e., γ = −1 and 2) ap- ing the true Hessian in a layer-wise fashion. plying max(x, 0) function to the diagonal part, i.e., γ = 0. The first and second scenarios of the PCH matrices are sep- Convex Criterion arately dubbed PCH-1 and PCH-2. Since PCH-1 and PCH-2 In the first type of experiments, we examined the perfor- produced the similar results of training loss and testing ac- 4 mance of different second-order methods. According to the curacy , we only reported PCH-1 in the corresponding fig- prior section, the differences between these second-order ures below for the sake of simplicity. In all of our experi- methods originate from two parts. The first part is the curva- ments, we used sigmoid as the non-convex activation func- ˆ tion of FCNNs and trained the networks with 200 epochs, ture information, e.g., the Fisher matrix Fi in KFAC, and the i.e., seeing the entire training samples 200 times. Fur- other is the solving method for the linear equations. In these thermore, we utilized “Xavier” initialization method (Glo- experiments, we used cross-entropy as the convex criterion rot and Bengio 2010) and performed the grid search on function. learning rate = [0.05, 0.1, 0.2], |Batch| = [100, 500, 1000] It is noteworthy that the original KFI method runs out of and α = [0.01, 0.02, 0.05, 0.1]. Regarding EA-CG method, GPU memory for ImageNet-10 because the dimension of the we have to determine two hyper-parameters that control its features (n0) in ImageNet-10 is too high, which conforms stopping conditions. One is the maximal iteration number, with our analysis in Table 1. Thus, we utilized 1 0 0T √ H ≈ Ei[hi ] ⊗ Ei[hi ] + π1 αI (13) 3We randomly choose ten classes from the ImageNet (Deng et al. 2009) dataset. and Sherman-Morrison formula to derive the update direc- 4Detailed results are in Supplementary Material A tions. Albeit this approximation worked in our ImageNet-10 30.0 Fisher-KFI 0.60 Fisher-CG 27.5 PCH-KFI PCH-CG 25.0 0.55 SGD 22.5

0.50 20.0

17.5 training loss 0.45 Fisher-KFI test accuracy (%) 15.0 Fisher-CG PCH-KFI 0.40 12.5 PCH-CG 10.0 SGD

0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 time(sec) time(sec)

Figure 5: Comparison of different curvature information and solving methods for the non-convex criterion function “Eq. (14)” on “Cifar-10”. The Gauss-Newton matrix is removed from this ﬁgure since it is not PSD in this type of experiments.

Fisher-KFI 27.5 0.60 Fisher-CG PCH-KFI 25.0 PCH-CG 22.5 0.55 SGD

20.0

0.50 17.5 training loss Fisher-KFI test accuracy (%) 15.0 0.45 Fisher-CG PCH-KFI 12.5 PCH-CG 0.40 10.0 SGD

0 1000 2000 3000 4000 0 1000 2000 3000 4000 time(sec) time(sec)

Figure 6: Comparison of different curvature information and solving methods for the non-convex criterion function “Eq. (14)” on “ImageNet-10”. The Gauss-Newton matrix is removed from this figure since it is not PSD in this type of experiments. experiments, we observed that this approximation is not sta- gradients of the weights backward is much more expensive ble for other datasets and models that we have explored. than it in Cifar-10 whose image size is 32 × 32 × 3. Based on this premise, the extra cost of propagating our proposed Wall Clock Time As shown in Figure 1, EA-CG method PCH where only the Hessian of the bias terms is required to converges faster with respect to wall clock time and has bet- propagate is comparatively low. Similarly, the cost of solv- ter performance of testing accuracy than KFI method for ing the update directions is comparatively low. Cifar-10. The runtime behavior adheres to Table 1. We also notice that the different types of curvature information with Epoch As shown in Figure 3, KFI method provides a pre- EA-CG method have the similar performance. In contrast, cise descent direction of each epoch, so the training loss KFI method using the Fisher matrix converges faster than decreases faster in terms of epochs. In contrast, Figure 1 using the other two types of matrices. For ImageNet-10, shows KFI method converges slower in terms of wall clock Figure 2 exhibits that EA-CG and KFI methods take al- time, which varied with different implementation. We argue most the same amount of time for 200 epochs. Besides, as that we have already done our best effort to implement KFI shown in Figure 2, KFI method has lower training loss but method by using the inverse function in PyTorch. As shown may suffer from the overfitting issue. Please note that we in either Figure 1 or Figure 3, EA-CG method has the better applied Eq (13) to the first layer of the neural network for results of testing accuracy for Cifar-10. For ImageNet-10, KFI method. Otherwise, the original KFI method ran out of the results of Figure 4 are consistent with the results of Fig- GPU memory for ImageNet-10. However, applying this ap- ure 2. Thus, we do not repeat the same narrative here. proximation reduces the memory usage of KFI method but expedites KFI method accordingly. It is worth noting that second-order methods performed Non-Convex Criterion comparable to SGD in Figure 2, but the phenomenon is com- pletely different in Figure 1. This is because the image sizes In order to demonstrate our capability of handling non- of Cifar-10 and ImageNet-10 are varied. The image size of convex criterion functions, we designed the second type of ImageNet-10 is 224 × 224 × 3, and hence propagating the experiments. In these experiments, we considered the fol- 30.0 Fisher-KFI 0.60 Fisher-CG 27.5 PCH-KFI PCH-CG 25.0 0.55 SGD 22.5

0.50 20.0

17.5 training loss 0.45 Fisher-KFI test accuracy (%) 15.0 Fisher-CG PCH-KFI 0.40 12.5 PCH-CG 10.0 SGD

0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 epoch epoch

Figure 7: Comparison of different curvature information and solving methods for the non-convex criterion function “Eq. (14)” on “Cifar-10”. The Gauss-Newton matrix is removed from this ﬁgure since it is not PSD in this type of experiments.

Fisher-KFI 27.5 0.60 Fisher-CG PCH-KFI 25.0 PCH-CG 22.5 0.55 SGD

20.0

0.50 17.5 training loss Fisher-KFI test accuracy (%) 15.0 0.45 Fisher-CG PCH-KFI 12.5 PCH-CG 0.40 10.0 SGD

0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 epoch epoch

Figure 8: Comparison of different curvature information and solving methods for the non-convex criterion function “Eq. (14)” on “ImageNet-10”. The Gauss-Newton matrix is removed from this figure since it is not PSD in this type of experiments. lowing criterion function: Table 2: Layer-wise errors between each of the approximate 1 C(yˆ | y ) = , (14) Hessian matrices and the true Hessian for the convex crite- i i δ(yT yˆ −) 1 + e i i rion function “cross-entropy” on “Cifar-10”. where we fix δ = 5 and = 0.2. This criterion function im- Fisher GN PCH-1 PCH-2 plies that the upper bound of the loss function exists. Apart Layer-1 0.0071 0.0057 0.0036 0.0043 from the criterion function, we followed the same settings Layer-2 0.0470 0.0212 0.0237 0.0228 as those used in the first type of experiments. Note that the Layer-3 0.1140 0.0220 0.0238 0.0165 Gauss-Newton matrix is not PSD if the criterion function Layer-4 0.0726 0.0207 0.0119 0.0085 of training FCNNs is non-convex. Thus, we excluded the Layer-5 0.0397 0.0130 0.0086 0.0066 Gauss-Newton matrix from this comparison and focused on Layer-6 0.0219 0.0107 0.0093 0.0071 comparing the other two types of matrices with either EA- Layer-7 0.0185 0.0176 0.0132 0.0106 CG or KFI methods. Layer-8 0.0251 0.0000 0.0000 0.0000 Total 0.1535 0.0446 0.0402 0.0330 Wall Clock Time For Cifar-10, Figure 5 shows that EA- CG method outperforms KFI method in terms of both training loss and testing accuracy. But the Fisher matrix with EA-CG method has the best performance. For ImageNet-10, training loss in Figure 7. Regarding testing accuracy, EA- Figure 6 shows that EA-CG method surpasses KFI method CG method has the better performance than KFI method for regarding training loss. But the performance of EA-CG and this type of experiments. For ImageNet-10, Figure 8 shows KFI methods is hard to distinguish by the testing accuracy. that EA-CG method surpasses KFI method regarding train- Epoch As shown in Figure 7, for Cifar-10, the perfor- ing loss, which is consistent with Figure 6. However, the mance of EA-CG and KFI methods is hard to distinguish by performance of EA-CG and KFI methods is hard to distin- the convergence speed of training loss in terms of epochs. guish by the testing accuracy, and the PCH matrix using KFI But the Fisher matrix with EA-CG method has the best method has the best testing accuracy. Comparison with the True Hessian in xprize deepq tricorder. In Proceedings of the 2Nd Inter- In addition to the comparison of training loss and testing ac- national Workshop on Multimedia for Personal Health and curacy on different criterion functions and datasets, we fur- Health Care, MMHealth ’17, 11–18. New York, NY, USA: ther examine the difference between the true Hessian and ACM. the approximate Hessian matrices such as the Fisher matrix. [Dauphin et al. 2014] Dauphin, Y. N.; Pascanu, R.; Gulcehre, Here, we measure the difference in the errors that is elabo- C.; Cho, K.; Ganguli, S.; and Bengio, Y. 2014. Identifying rated as follows. Given the tth layer and the model parame- and attacking the saddle point problem in high-dimensional ters θ, the error is defined as non-convex optimization. In Advances in neural information processing systems, 2933–2941. 2 2 Ei[∇gt ξi] − |Ei[∇bt ξi]| , b F [Deng et al. 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchi- 2 where ∇gbt ξi stands for the approximate Hessian matrix, and cal image database. In 2009 IEEE Conference on Computer 2 |Ei[∇bt ξi]| is to take the absolute values of the eigenvalues Vision and Pattern Recognition, 248–255. 2 of Ei[∇bt ξi], which we follow (Dauphin et al. 2014). [Glorot and Bengio 2010] Glorot, X., and Bengio, Y. 2010. The results are shown in Table 2 where each value is de- 0 Understanding the difficulty of training deep feedforward rived from averaging the errors of the initial parameters θ neural networks. In Teh, Y. W., and Titterington, M., eds., s−1 to the parameters θ that are updated s times. Table 2 re- Proceedings of the Thirteenth International Conference on flects that the errors are not accumulated with layers for any Artificial Intelligence and Statistics, volume 9 of Proceed- approximate Hessian matrix. This observation is important ings of Machine Learning Research, 249–256. Chia Laguna for KFRA and ours since both methods derive the curvature Resort, Sardinia, Italy: PMLR. information by approximating the Hessian layer-by-layer re- [Goldfarb et al. 2017] Goldfarb, D.; Mu, C.; Wright, J.; and cursively. The “Total” row of Table 2 also indicates that our Zhou, C. 2017. Using negative curvature in solving non- proposed PCH-1 and PCH-2 are closer to the true Hessian linear programs. Computational Optimization and Applica- than the Fisher and Gauss-Newton matrices. The results of tions 68(3):479–502. the other experiments echo our conclusions here and are pro- vided in the Supplementary Materials A. [Grosse and Martens 2016] Grosse, R., and Martens, J. 2016. A kronecker-factored approximate fisher matrix for Concluding Remarks convolution layers. In International Conference on Machine Learning, 573–582. To achieve more computationally feasible second-order methods for training FCNNs, we developed a practical ap- [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. proach, including our proposed PCH matrix and devised Deep residual learning for image recognition. In Proceed- EA-CG method. Our proposed PCH matrix overcomes the ings of the IEEE conference on computer vision and pattern problem of training FCNNs with non-convex criterion func- recognition, 770–778. tions. Besides, EA-CG provides another alternative to effi- [Hochreiter and Schmidhuber 1997] Hochreiter, S., and ciently derive update directions in this context. Our empiri- Schmidhuber, J. 1997. Long short-term memory. Neural cal studies show that our proposed PCH matrix can compete computation 9(8):1735–1780. with the state-of-the-art curvature approximation, and EA- [Krizhevsky and Hinton 2009] Krizhevsky, A., and Hinton, CG does converge faster and enjoys better testing accuracy G. 2009. Learning multiple layers of features from tiny than KFI method. Specially, the performance of our pro- images. Technical report, Citeseer. posed approach is competitive with SGD. As future work, we will extend the idea to work with convolutional nets. [LeCun et al. 1998] LeCun, Y.; Bottou, L.; Orr, G. B.; and Muller,¨ K.-R. 1998. Efficient backprop. In Neural networks: References Tricks of the trade. Springer. 9–50. [Amari, Park, and Fukumizu 2000] Amari, S.-I.; Park, H.; [Martens and Grosse 2015] Martens, J., and Grosse, R. and Fukumizu, K. 2000. Adaptive method of realizing 2015. Optimizing neural networks with kronecker-factored natural gradient learning for multilayer perceptrons. Neu- approximate curvature. In International Conference on Ma- ral Computation 12(6):1399–1409. chine Learning, 2408–2417. [Belagiannis et al. 2015] Belagiannis, V.; Rupprecht, C.; [Martens 2010] Martens, J. 2010. Deep learning via hessian- Carneiro, G.; and Navab, N. 2015. Robust optimization for free optimization. In Proceedings of the 27th International deep regression. In Proceedings of the IEEE International Conference on Machine Learning (ICML-10), 735–742. Conference on Computer Vision, 2830–2838. [Mizutani and Dreyfus 2008] Mizutani, E., and Dreyfus, [Botev, Ritter, and Barber 2017] Botev, A.; Ritter, H.; and S. E. 2008. Second-order stagewise backpropagation for Barber, D. 2017. Practical gauss-newton optimisation for hessian-matrix analyses and investigation of negative curva- deep learning. In International Conference on Machine ture. Neural Networks 21(2):193–203. Learning, 557–565. [Pearlmutter 1994] Pearlmutter, B. A. 1994. Fast exact mul- [Chang et al. 2017] Chang, E. Y.; Wu, M.-H.; Tang, K.-F. T.; tiplication by the hessian. Neural computation 6(1):147– Kao, H.-C.; and Chou, C.-N. 2017. Artificial intelligence 160. [Polyak and Juditsky 1992] Polyak, B. T., and Juditsky, A. B. 1992. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization 30(4):838– 855. [Schraudolph 2002] Schraudolph, N. N. 2002. Fast curvature matrix-vector products for second-order gradient descent. Neural computation 14(7):1723–1738. [Stewart 1999] Stewart, C. V. 1999. Robust parameter estimation in computer vision. SIAM review 41(3):513–537. [Zhang et al. 2017] Zhang, H.; Xiong, C.; Bradbury, J.; and Socher, R. 2017. Block-diagonal hessian-free optimization for training neural networks. arXiv preprint arXiv:1712.07296. Supplementary Materials A Other True Hessian Comparison The other experiments of comparing the approximate Hessian matrices and the true Hessian matrix are reported in the Table 3, Table 4, and Table 5. All the three tables along with Table 2 reflect that the errors are not accumulated with layers for any approximate Hessian matrix. Because KFRA and our proposed method derive the curvature information by approximating the Hessian layer-by-layer recursively, we have to assert that the errors are not accumulated with layers. Otherwise, the error accumulation will cause a severe problem in lower layers of FCNNs. The consequence of these experiments provides an evidence to clear up this concern. Moreover, the “Total” rows of these three tables also indicate that our proposed PCH-1 and PCH-2 are closer to the true Hessian than the Fisher and Gauss-Newton matrices.

Table 3: Layer-wise errors between each of the approximate Hessian matrices and the true Hessian for the convex criterion function “cross-entropy” on “ImageNet-10”. Fisher GN PCH-1 PCH-2 Layer-1 0.0012 0.0011 0.0003 0.0009 Layer-2 0.0240 0.0151 0.0146 0.0132 Layer-3 0.0409 0.0186 0.0198 0.0174 Layer-4 0.0390 0.0149 0.0137 0.0160 Layer-5 0.0327 0.0132 0.0115 0.0112 Layer-6 0.0279 0.0159 0.0142 0.0128 Layer-7 0.0290 0.0260 0.0259 0.0183 Layer-8 0.0427 0.0000 0.0000 0.0000 Total 0.0910 0.0436 0.0424 0.0369

Table 4: Layer-wise errors between each of the approximate Hessian matrices and the true Hessian for the non-convex criterion function “Eq. (14)” on “Cifar-10”. The Gauss-Newton matrix is removed from this table since it is not PSD in this type of experiments. Fisher GN PCH-1 PCH-2 Layer-1 0.0044 - 0.0035 0.0049 Layer-2 0.0192 - 0.0179 0.0456 Layer-3 0.0528 - 0.0185 0.0502 Layer-4 0.0449 - 0.0129 0.0344 Layer-5 0.0330 - 0.0092 0.0261 Layer-6 0.0294 - 0.0097 0.0198 Layer-7 0.0331 - 0.0140 0.0198 Layer-8 0.0781 - 0.0000 0.0000 Toatal 0.1198 - 0.0349 0.0853

Table 5: Layer-wise errors between each of the approximate Hessian matrices and the true Hessian for the non-convex criterion function “Eq. (14)” on “ImageNet-10”. The Gauss-Newton matrix is removed from this table since it is not PSD in this type of experiments. Fisher GN PCH-1 PCH-2 Layer-1 0.0003 - 0.0002 0.0003 Layer-2 0.0116 - 0.0226 0.0253 Layer-3 0.0258 - 0.0264 0.0300 Layer-4 0.0248 - 0.0285 0.0326 Layer-5 0.0157 - 0.0284 0.0308 Layer-6 0.0147 - 0.0214 0.0227 Layer-7 0.0219 - 0.0172 0.0171 Layer-8 0.0734 - 0.0000 0.0000 Total 0.0880 - 0.0598 0.0660

Comparison of different types of PCH matrices We conducted all the experiments considering two kinds of PCH matrices. However, their results are almost similar for all the experiments. For the sake of simplicity, we only show one of them in the ﬁgures of our experimental evaluation section and put their comparison here. PCH-1-KFI 45 PCH-1-CG 2.0 PCH-2-KFI 40 PCH-2-CG

1.5 35

1.0 25 training loss test accuracy (%) 20 PCH-1-KFI 0.5 PCH-1-CG 15 PCH-2-KFI PCH-2-CG 0.0 10

0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 time(sec) time(sec)

Figure 9: Comparison of PCH-1 and PCH-2 for convex criterion funciton “cross-entropy” on “Cifar-10”.

45 PCH-1-KFI PCH-1-CG 40 2.0 PCH-2-KFI PCH-2-CG 35 1.5 30

25 1.0 training loss

test accuracy (%) 20 PCH-1-KFI 0.5 PCH-1-CG 15 PCH-2-KFI 10 PCH-2-CG 0.0 0 1000 2000 3000 4000 0 1000 2000 3000 4000 time(sec) time(sec)

Figure 10: Comparison of PCH-1 and PCH-2 for convex criterion funciton “cross-entropy” on “ImageNet-10”.

PCH-1-KFI 27.5 0.60 PCH-1-CG PCH-2-KFI 25.0 PCH-2-CG 0.55 22.5

20.0

0.50 17.5 training loss

test accuracy (%) 15.0 PCH-1-KFI 0.45 PCH-1-CG 12.5 PCH-2-KFI 10.0 PCH-2-CG 0.40 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 time(sec) time(sec)

Figure 11: Comparison of PCH-1 and PCH-2 for non-convex criterion funciton “Eq. (14)” on “Cifar-10”.

PCH-1-KFI 30.0 0.60 PCH-1-CG 27.5 PCH-2-KFI PCH-2-CG 25.0 0.55 22.5

20.0 0.50

training loss 17.5 test accuracy (%) 0.45 15.0 PCH-1-KFI PCH-1-CG 12.5 PCH-2-KFI 0.40 10.0 PCH-2-CG

0 1000 2000 3000 4000 0 1000 2000 3000 4000 time(sec) time(sec)

Figure 12: Comparison of PCH-1 and PCH-2 for non-convex criterion funciton “Eq. (14)” on “ImageNet-10”. Supplementary Materials B

(i) ∇ k ξi = ∇ k ξi b hi

h i h i h i ∂ hk ∂( W k hk−1 + bk ) ∂ξi X ∂ξi i X ∂ξi i = s = s· s h ki h ki h ki h ki h ki ∂ b s ∂ hi ∂ b s ∂ hi ∂ b r s r s r h ki k−1 h ki ∂ξ ∂( W hi + b ) ∂ξ = i r· r = i . h ki h ki h ki ∂ hi ∂ b ∂ hi r r r

Therefore, we have ∇ k ξi = ∇ k ξi. b hi

(t−1)0 tT (ii) ∇bt−1 ξi = diag(hi )W ∇bt ξi, t = k, . . . , 2

Case 1: t = k.

h i  h i h i  ∂ hk ∂ hk ∂ hk−1 ∂ξi X ∂ξi i X  ∂ξi X i i  = s = s u h k−1i h ki h k−1i h ki h k−1i h k−1i ∂ b s ∂ hi ∂ b s ∂ hi u ∂ hi ∂ b  r s r s u r  h i h i h i  ∂ hk ∂σ( W k−1 hk−2 + bk−1 ) X  ∂ξi X i i  = s u· u h ki h k−1i h k−1i s ∂ hi u ∂ hi ∂ b  s u r  h i h i h i  ∂ hk ∂σ( W k−1 hk−2 + bk−1 ) X  ∂ξi i i  = s r· r h ki h k−1i h k−1i s ∂ hi ∂ hi ∂ b  s r r  h ki  ∂ hi X  ∂ξi s 0 h k−1i k−2 h k−1i  = σ ( W hi + b ) h ki h k−1i r· r s ∂ hi ∂ hi  s r  h ki  ∂ hi 0 X  ∂ξi s h (k−1) i  = hi h ki h k−1i r s ∂ hi ∂ hi  s r  h ki k−1 h ki  ∂( W hi + b ) 0 X  ∂ξi s· s h (k−1) i  = hi h ki h k−1i r s ∂ hi ∂ hi  s r   X  ∂ξi h ki h (k−1)0 i  h (k−1)0 i T h ki = W hi = hi ∇hk ξi W h ki sr r r i ·r s ∂ hi  s h (k−1)0 i T h ki h (k−1)0 i h kT i = hi ∇bk ξi W = hi W ∇bk ξi. r ·r r r·

(k−1)T kT Therefore, we have ∇bk−1 ξi = diag(hi )W ∇bk ξi. Case 2: t < k. Similarly, ( t ) 0 ∂ξi X ∂ξi ∂ hi s h (t−1) i t−1 = t t−1 hi ∂ h r ∂ b r s i s ∂ hi r ( t t−1 t ) 0 X ∂ξi ∂σ( W s· hi + b s) h (t−1) i = t t−1 hi ∂ h r s i s ∂ hi r ( ) X ∂ξi 0 t t−1 t t h (t−1)0 i = t σ ( W hi + b ) W hi s· s sr r s ∂ hi s ( t t−1 t ) ∂σ( W h + b ) 0 X ∂ξi s· i s t h (t−1) i = t t W hi sr r s ∂ hi s ∂ b s ( t ) ∂ h 0 X ∂ξi i s t h (t−1) i = t t W hi sr r s ∂ hi s ∂ b s ( ( t ) ) ∂ h 0 X X ∂ξi i u t h (t−1) i = t t W hi sr r s u ∂ hi u ∂ b s ( ) X ∂ξi t h (t−1)0 i h (t−1)0 i h tT i = t W hi = hi W ∇bt ξi. sr r r r· s ∂ b s (t−1)T tT Therefore, we have ∇bt−1 ξi = diag(hi )W ∇bt ξi. (t−1)T (iii) ∇W t ξi = ∇bt ξi ⊗ hi , t = k, . . . , 1 Case 1: t = k. h i h i h i ∂ hk ∂( W k hk−1 + bk ) ∂ξi X ∂ξi i X ∂ξi i = s = s· s h ki h ki h ki h ki h ki ∂ W s ∂ hi ∂ W s ∂ hi ∂ W αβ s αβ s αβ h ki k−1 h ki ∂ξ ∂( W hi + b ) = i α· α h ki h ki ∂ hi ∂ W α αβ h ki k−1 h ki ∂ξ ∂( W hi + b ) = i α· α h ki h ki ∂ hi ∂ W α αβ

∂ξi h k−1i ∂ξi h k−1i = hi = hi . h ki β h ki β ∂ hi ∂ bi α α (k−1)T Therefore, we have ∇W k ξi = ∇bk ξi ⊗ hi . Case 2: t < k. Similarly, t t−1 t ∂ξi X ∂ξi ∂σ( W hi + b ) = s· s ∂ W t t ∂ W t αβ s ∂ hi s αβ t t−1 t ∂ξ ∂ξ ∂σ( W h + b ) = i σ0(W t ht−1 + bt ) ht−1 = i α· i α ht−1 t α· i α i β t t i β ∂ hi α ∂ hi α ∂ b α t ( t ) ∂ξi ∂ hi X ∂ξi ∂ hi = α ht−1 = s ht−1 t t i β t t i β ∂ hi α ∂ b α s ∂ hi s ∂ b α ∂ξi t−1 t−1 = h = [∇ t ξ ] h . t i β b i α i β ∂ b α (t−1)T Therefore, we have ∇W t ξi = ∇bt ξi ⊗ hi . Supplementary Materials C 2 2 (i) ∇ k ξi = ∇ k ξi b hi h ki 2 ∂ h ∂ ξi ∂∇ k ξi ∂∇hk ξi X ∂∇hk ξi i = b = i = i s h ki k h ki h ki h ki h ki ∂ b ∂b ∂ b ∂ b s ∂ hi ∂ b r r r s r h i h i ∂( W k hk−1 + bk ) X ∂∇hk ξi i ∂∇hk ξi = i s· s = i . h ki h ki h ki s ∂ hi ∂ b ∂ hi s r r 2 2 Therefore, we have ∇ k ξi = ∇ k ξi. b hi 0 0 00 2 (t−1) tT 2 t (t−1) (t−1) tT (ii) ∇bt−1 ξi = diag(hi )W ∇bt ξiW diag(hi ) + diag(hi (W ∇bt ξi)), t = k, . . . , 2 Case 1: t = k. 0 2 (k−1) kT ∂ ξi ∂∇bk−1 ξi ∂ diag(hi )W ∇bk ξi h i = h i = h i ∂ bk−1 ∂bk−1 ∂ bk−1 ∂ bk−1 r r r (k−1)0 kT 0 ∂ diag(hi )W (k−1) kT ∂∇bk ξi = h i ∇bk ξi + diag(hi )W h i . ∂ bk−1 ∂ bk−1 r r | {z } | {z } Part 1 Part 2  0  .  .   .   0   00  00 th h (k−1) i h kT i  h (k−1) i h kT i Part 1 =r ←  hi W ∇bk ξi = ( hi W ∇bk ξi)er.  r r·  r r·  0     .   .  0 Next, we derive h i h i h i ∂ hk ∂ hk ∂ hk−1 ∂∇ k ξi X ∂∇ k ξi i X ∂∇ k ξi X i i b = b s = b s u h k−1i h ki h k−1i h ki h k−1i h k−1i ∂ b s ∂ hi ∂ b s ∂ hi u ∂ hi ∂ b r s r s u r h ki ∂ hi 0 X ∂∇bk ξi s h (k−1) i = hi h ki h k−1i r s ∂ hi ∂ hi s r h ki k−1 h ki ∂( W hi + b ) 0 X ∂∇bk ξi s· s h (k−1) i = hi h ki h k−1i r s ∂ hi ∂ hi s r X ∂∇bk ξi h ki h (k−1)0 i X ∂∇bk ξi h ki h (k−1)0 i = W hi = W hi h ki sr r h ki sr r s ∂ hi s ∂ bi s s X 2 h ki h (k−1)0 i 2 h ki h (k−1)0 i = ∇bk ξi W hi = ∇bk ξi W hi . ·s sr r ·r r s

0 0 (k−1) kT 2 h k (k−1) i Therefore, we have Part 2 = diag(hi )W ∇ k ξi W diag(hi ) , then b ·r 2 (k−1)0 kT 2 k (k−1)0 ∇bk−1 ξi =diag(hi )W ∇bk ξiW diag(hi ) (k−1)00 kT + diag(hi (W ∇bk ξi)). Case 2: t < k. Similarly,

(t−1)0 tT 2 0 ∂ ξi ∂ diag(hi )W (t−1) tT ∂∇bt ξi t−1 t−1 = t−1 ∇bt ξi + diag(hi )W t−1 . ∂ b r ∂b ∂ b r ∂ b r | {z } | {z } Part 1 Part 2

h (t−1)00 i h tT i Part 1 = ( hi W ∇bt ξi)er, r r·

and

t t−1 t 0 ∂∇bt ξi X ∂∇bt ξi ∂σ( W s· hi + b s) h (t−1) i t−1 = t t−1 hi ∂ h r ∂ b r s i s ∂ hi r X ∂∇bt ξi h (t)0 i t h (t−1)0 i = t hi W hi s sr r s ∂ hi s t t ∂ h 0 X ∂∇b ξi i s t h (t−1) i = t t W hi sr r s ∂ hi s ∂ b s ( t ) t ∂ h 0 X X ∂∇b ξi i u t h (t−1) i = t t W hi sr r s u ∂ hi u ∂ b s X ∂∇bt ξi t h (t−1)0 i X 2 t h (t−1)0 i = t W hi = ∇bt ξi W hi sr r ·s sr r s ∂ bi s s 2 t h (t−1)0 i =∇bt ξi w hi . ·r r

By combining the case 1 and 2, we have

2 (t−1)0 tT 2 t (t−1)0 (t−1)00 tT ∇bt−1 ξi = diag(hi )W ∇bt ξiW diag(hi ) + diag(hi (W ∇bt ξi)),

where t = k, . . . , 1.

2 t−1 (t−1)T 2 (iii) ∇W t ξi = (hi ⊗hi )⊗∇bt ξi, t = k, . . . , 1 First, we have to give the order of the derivative for the Hessian matrix because the gradient of W t is a matrix. Thus, we deﬁne

2 ∂Vec(∇W t ξi) ∇ t ξi = . W ∂Vec(W t)

Next, we derive one block of this Hessian matrix.

Case 1: t = k.

h (k−1)T i 2 ∂( ∇bk ξi ⊗ hi ) ∂ ξi ∂([∇W k ξi]·r) ·r h i h i = h i = h i ∂ W k ∂ W k ∂ W k ∂ W k ·s ·r ·s ·s h (k−1)i ∂( hi ∇bk ξi) r h k−1i ∂∇bk ξi = h i = hi h i ∂ W k r ∂ W k ·s ·s  

h i ∂∇ k ξ ∂∇ k ξ k−1  b i b i  = hi  h i ··· h i  . r ∂ W k ∂ W k 1s nks ∂∇ ξ Now we focus on the term bk i with a give index q. ∂ W k [ ]qs h i h i ∂ hk ∂ hk ∂∇ k ξi X ∂∇ k ξi i X ∂∇ k ξi i b = b u = h u h ki h ki h ki h ki h ki ∂ W u ∂ hi ∂ W u ∂ hi ∂ W qs u qs u qs h i h i ∂( W k hk−1 + bk ) X ∂∇ k ξi i = h u· u h ki h ki u ∂ hi ∂ W u qs h ki k−1 h ki ∂( W hi + b ) ∂∇hk ξi q· q 2 h k−1i = = ∇hk ξi hi h ki h ki q· s ∂ hi ∂ W q qs h k−1i 2 = hi ∇hk ξi . s ·q We combine the aforementioned equations and we obtain   2 ∂ ξ h i ∂∇ k ξ ∂∇ k ξ i k−1  b i b i  h i h i = hi  h i ··· h i  ∂ W k ∂ W k r ∂ W k ∂ W k ·s ·r 1s nks h k−1i h k−1i 2 h k−1i h k−1i 2 = hi hi ∇hk ξi = hi hi ∇bk ξi. r s r s Therefore, we have 2 2 ∂ ξi k−1 (k−1)T 2 ∇ k ξi = = (h ⊗ h ) ⊗ ∇ k ξi. W ∂Vec(W k)∂Vec(W k) i i b Case 2: t < k. Similarly, 2 " # ∂ ξ ∂∇ t ξ ∂∇ t ξ i = ht−1 b i ··· b i . ∂ W t ∂ W t i r ∂ W t ∂ W t ·s ·r 1s nts t t Then, we focus on the term ∂∇b ξi/∂ W qs with a give index q. t t t−1 t ∂∇ t ξi X ∂∇ t ξi ∂ hi X ∂∇ t ξi ∂σ( W hi + b ) b = b u = b u· u ∂ W t t ∂ W t t ∂ W t qs u ∂ hi u qs u ∂ hi u qs t t−1 t ∂∇ t ξ ∂σ( W hi + b ) = b i q· q t ∂ W t ∂ hi q qs t ∂∇ t ξ ∂∇ t ξ ∂ hi = b i σ0(W t ht−1 + bt ) ht−i = ht−i b i q t q· i q i s i s t t ∂ hi q ∂ hi q ∂ b q t t ∂ h t t−i X ∂∇b ξi i u t−i ∂∇b ξi t−i 2 = h = h = h ∇ t ξ i s t t i s t i s b i q· u ∂ hi u ∂ b q ∂ b q t−i 2 = hi s ∇bt ξi ·q . We combine the aforementioned equations and obtain

2 " # ∂ ξ ∂∇ t ξ ∂∇ t ξ i = ht−1 b i ··· b i ∂ W t ∂ W t i r ∂ W t ∂ W t ·s ·r 1s nts t−1 t−1 2 = hi r hi s ∇bt ξi. Therefore, we have 2 2 ∂ ξi t−1 (t−1)T 2 ∇ t ξi = = (h ⊗ h ) ⊗ ∇ t ξi. W ∂Vec(W t)∂Vec(W t) i i b Supplementary Materials D The difference between original Hessian and expectation approximate Hessian in the (t − 1)th layer is shown as

(t−1)0 tT 2 t (t−1)0 (t−1)00 tT Ei[diag(hi )W ∇bt ξiW diag(hi ) + diag(hi (W ∇bt ξi))] (t−1)0 tT 2 t (t−1)0 (t−1)00 tT − Ei[diag(hi )W Ei[∇bt ξi]W diag(hi ) + diag(hi (W ∇bt ξi))] tT 2 t (t−1)0 (t−1)0T tT 2 t (t−1)0 =Ei[(W ∇bt ξiW ) (hi ⊗ hi )] − (W Ei[∇bt ξi]W ) EhhT tT 2 t (t−1)0 (t−1)0T =Ele-Cov(W ∇bt ξiW , hi ⊗ hi ). (15) To ﬁnd the bound of Eq. (15), we utilize the inequality of covariance Cov(X,Y )2 ≤ Var(X)Var(Y ) which is implied by the Cauchy-Schwarz inequality. Thus, given indices µ and ν, we have

tT 2 t (t−1)0 (t−1)0T 2 tT 2 t (t−1)0 (t−1)0T Cov([W ∇bt ξiW ]µν , [hi ⊗ hi ]µν ) ≤ Var([W ∇bt ξiW ]µν )Var([hi ⊗ hi ]µν ). Therefore,

0 0 2 tT 2 t (t−1) (t−1) T Ele-Cov(W ∇bt ξiW , hi ⊗ hi ) F X tT 2 t (t−1)0 (t−1)0T 2 = Cov([W ∇bt ξiW ]µν , [hi ⊗ hi ]µν ) µ,ν X tT 2 t (t−1)0 (t−1)0T ≤ Var([W ∇bt ξiW ]µν )Var([hi ⊗ hi ]µν ). (16) µ,ν

0 0 (t−1) (t−1) T tT 2 t Obviously, Var([hi ⊗ hi ]µν ) can be controlled easier than Var([W ∇bt ξiW ]µν ). At the beginning, we suppose the activation function σ is L-Lipschitz continuous, i.e., |σ(x) − σ(y)| ≤ L|x − y|, which implies |σ0(x)| ≤ L. Thus,

(t−1)0 (t−1)0T (t−1)0 (t−1)0T 2 (t−1)0 (t−1)0T 2 Var([hi ⊗ hi ]µν ) =Ei[[hi ⊗ hi ]µν ] − Ei[[hi ⊗ hi ]µν ] (t−1)0 2 (t−1)0 2 (t−1)0 (t−1)0 2 =Ei[[hi ]µ[hi ]ν ] − Ei[[hi ]µ[hi ]ν ] (t−1)0 2 (t−1)0 2 ≤Ei[[hi ]µ[hi ]ν ] 2 2 4 ≤Ei[L L ] = L . (17) At the ﬁnal, we combine Eq. (16) and Eq. (17) to

0 0 2 tT 2 t (t−1) (t−1) T 4 X tT 2 t Ele-Cov(W ∇bt ξiW , hi ⊗ hi ) ≤ L Var([W ∇bt ξiW ]µν ), F µ,ν and the proof is done. We give an example here, the sigmoid function is 0.25-Lipschitz. That is, Lsigmoid = 0.25. Hence, when we utilized the sigmoid as our activation function in FCNNs, we have the bound

X tT 2 t 0.00390625 · Var([W ∇bt ξiW ]µν ) µ,ν for the errors from the backpropagation of the second derivatives in the (t − 1)th layer.