<<

PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference

THOR, Trace-based Hardware-driven -ORiented Natural Descent Computation

Mengyun Chen1*, Kai-Xin Gao2*, Xiao-Lei Liu2*, Zidong Wang1*, Ningxi Ni1*, Qian Zhang3∗ Lei Chen4†, Chao Ding5, Zheng-Hai Huang2, Min Wang1, Shuangling Wang1, Fan Yu1, Xinyuan Zhao3, Dachuan Xu3 1Huawei Technologies Co. Ltd, 2Tianjin University, 3Beijing University of Technology 4Hong Kong University of Science and Technology, 5Chinese Academy of Sciences 1chenmengyun1, wang1, niningxi, wangmin106, wangshuangling1, [email protected] 2gaokaixin, liuxiaolei, [email protected], 3zhangqian, xyzhao, [email protected] [email protected], [email protected]

Abstract an objective J(θ) with respect to the parameters θ, i.e., θ is updated as: θ ← θ − α∇θJ(θ), where ∇θJ(θ) It is well-known that second-order optimizer can accelerate the is gradient, α represents the . Using SGD to training of deep neural networks, however, the huge compu- tation cost of second-order optimization makes it impractical optimize the parameter training faces two challenges, which to apply in real practice. In order to reduce the cost, many are: 1) it is difficult to choose a proper learning rate, and 2) methods have been proposed to approximate a second-order it is hard to escape saddle points. Therefore, many variants matrix. Inspired by KFAC, we propose a novel Trace-based of SGD such as Momentum [Qian 1999], AdaGrad [Zeiler Hardware-driven layer-ORiented Natural Gradient Descent 2012], Adam [Kingma and Ba 2014] and etc. have been Computation method, called THOR, to make the second-order introduced in the past two decades. Though choosing learning optimization applicable in the real application models. Specif- rate becomes easy in these , they still cannot escape ically, we gradually increase the update interval and use the saddle points when the object function is non-convex, which matrix trace to determine which blocks of Fisher Informa- is often the case in many real application models. tion Matrix (FIM) need to be updated. Moreover, by resorting To address the challenging issues encountered by SGD, it the power of hardware, we have designed a Hardware-driven approximation method for computing FIM to achieve better is easy to think of using second-order optimizer since it can performance. To demonstrate the effectiveness of THOR, we avoid saddle points and most importantly, it can accelerate have conducted extensive experiments. The results show that convergence by using the information. Specifically, −1 training ResNet-50 on ImageNet with THOR only takes 66.7 the parameters θ are updated by θ ← θ − αG ∇θJ(θ), minutes to achieve a top-1 accuracy of 75.9 % under an 8 As- where G−1 is the inverse of second-order information matrix cend 910 environment with MindSpore, a new G. The definitions of G in different second-order optimiza- computing framework. Moreover, with more computational tion algorithms are not the same. Common second-order op- resources, THOR can only takes 2.7 minutes to 75.9 % with timization algorithms include Newton’s method and natural 256 Ascend 910. , where their second-order information ma- trix G is (HM) and Fisher information matrix 1 Introduction (FIM), respectively. The biggest challenge to use second- Recently, deep learning has made significantly progress in order optimizer is its computation increases cubically and various and natural language applications. space cost increases quadratically compared to SGD. There- However, with the increase of complexity of models, tons of fore, it is quite impractical to compute the inverse of second- parameters needed to be trained. For example, according to order information matrix directly. [Devlin et al. 2018] and [He et al. 2016], training BERT (over To reduce the computation cost of the second-order op- 340 million parameters) and ResNet-50 (over 23 million timizer, quite a few approximation approaches have been trainable parameters) will take around 3 days on 16 TPUv3 proposed. For instance, for Newtons method, Quasi-Newton and 29 hours on 8 Tesla P100 GPU, respectively. Therefore, methods [Nocedal and Wright 2006] can be used to approxi- many efforts have been put to propose optimization solutions mate the inverse of HM. One of the advantages of these meth- to reduce the training time. ods over the classical Newton method is that the HM does Among all the proposed optimization techniques, the most not need to be inverted explicitly. In particular, the Limited- popular and promising one is Stochastic Gradient Descent memory BFGS (L-BGFS) [Zhu et al. 1997] has (SGD) [Robbins and Monro 1951], which is a first-order been implemented and used to speed up the training process optimization algorithm. Specifically, SGD tries to minimize in Deep Neural Networks (DNN) (e.g., [Le et al. 2011]). Other structured stochastic Quasi-Newton methods are al- ∗Equal contribution. so developed and studied recently in [Keskar and Berahas †Corresponding author 2016, Berahas, Jahani, and Taka´cˇ 2019]. Another class of Newton type second-order methods is the Hessian Free opti- 2 Background and Notations mization method [Martens 2010, Kiros 2013, Pan, Innanen, The purpose of deep neural network training is to find a set and Liao 2017], in which the matrix-free conjugate-gradient of model parameters θ ∈ Rn to minimize the (CG) algorithms are used to approximate the true Hessian J(θ). Given the cross-entropy loss function: matrix. However, these CG algorithms usually require lots J(θ) = [− log p(y|x, θ)], (1) of iterations to reach the desired accuracy, in particular for E ill-condition cases. where x, y are the training input and label, p(y|x, θ) repre- Unlike the Newton type methods, Kronecker-factored Ap- sents the density function of a predictive distribution Py|x. proximate Curvature (KFAC) [Martens and Grosse 2015, 2.1 The Natural Gradient Grosse and Martens 2016, Martens, Ba, and Johnson 2018] is a second-order method based on natural gradient method. Our algorithm is based on the natural gradient proposed by More precisely, in KFAC, one computes the inverse of the [Amari 1998]. Natural gradient gives the steepest direction FIM by computationally tractable approximations such as of the target function when the parameter space has a Rie- block-diagonal approximation and tridiagonal-block diag- mannian metric structure. In other words, it gives the largest onal approximation. [George et al. 2018] have introduced change of the loss function per unit change of the model. The an Eigenvalue-corrected Kronecker Factorization (EKFAC) distance between the distribution Pθ and Pθ+δθ can be mea- which can approximate FIM much better than KFAC does. sured with the K-L divergence. More recent discussion of the natural gradient can be found in [Martens 2014, Ollivier et al. [Osawa et al. 2019, 2020] have demonstrated that KFAC is −1 2017]. Natural gradient is typically defined as F ∇θJ(θ), efficient in large-scale distributed computing for deep neural n×n networks. Overall, among all these methods, the approxima- where F ∈ R is FIM. With the predictive distribution tion scheme for the inverse of FIM is crucial for improving defined as Py|x, FIM is formulated as T the efficiency of the second-order optimizer, since the current F = E[∇θ log p(y|x, θ)∇θ log p(y|x, θ) ]. (2) exact strategies still require significant computing power in It is impractical to compute the inverse of FIM directly in a practice. deep neural network since it has over millions of parameters. To address the issues of inefficient computing FIM, in this paper, we propose an efficient 2.2 KFAC based on natural , named Trace-based Hardware- KFAC is an efficient method for approximation natural gra- driven layer-ORiented Natural Gradient Descent Computa- dient, which approximates FIM by block-diagonal or block- tion (THOR), to compute FIM. Firstly, we observe from tridiagonal matrices. Based on nice motivation and rigorous experiments that the FIM for each layer usually changes mathematical derivation, it has exquisitely settled the prob- rapidly in the first few iterations and then tends to be stable. lem of complex computation for inverting the second order Therefore, it is reasonable to increase the update interval information matrix. [Osawa et al. 2019] have proved that of the inverse of FIM in a proper manner without the loss block-diagonal KFAC has good results in large-scale DNN of convergence rate. Secondly, we make further decision to and block-diagonal KFAC computes more efficiently than decide which blocks of FIM need to be updated. Thirdly, we block tridiagonal. Thus, we focus on block-diagonal KFAC introduce a new approximation scheme by using a Hardware- to approximate FIM in this work. driven matrix splitting scheme to approximate the FIM, which KFAC is a two-step approximation method. In the first step, can be regarded as finding an optimal tradeoff point between KFAC decomposes the FIM into block matrices according to the computational efficiency and the information loss of FIM. the layers of the neural network, by assuming that parameters Overall, the contributions of our work can be summarized of different layers are independent. Then the calculation of as follows: the inverse of FIM is simplified as the inverse of these small • Under the assumption that the FIM converges to a sta- blocks. In the second step, these block matrices are further tionary distribution, we gradually increase the update interval approximated by the Kronecker product of two much smaller of the inverse of FIM to save the overall computational time. matrices which we call Kronecker factors. Since the inverses • Instead of using the Frobenius norm based updating of the Kronecker product of two matrices are equal to the rule proposed in [Osawa et al. 2019], we introduce a more Kronecker product of their inverses, and these two smaller computationally tractable trace-based updating rule for FIM matrices are easier to calculate and invert than calculating and for each layer. inverting the entire block matrix. KFAC greatly simplifies FIM calculation. • We approximate the block diagonal matrix based on Consider a deep neural network with l layers and denote KFAC to a smaller matrix by splitting matrix dimensions, the outputs of the i-th layer as si, the inputs of the i-th as which trade the loss of FIM for efficient computation. ai−1 which is the activations of previous layer and θi is a • Last but not the least, with THOR, we are able to train weight vector of i-th layer. ResNet-50 on ImageNet in 66.7/4.1/2.7 minutes with a top-1 In the first step, KFAC approximates FIM into block ma- accuracy of 75.9 % using 8/128/256 Ascend 910 on Mind- trix: 1 Spore . F ≈ diag(F1,F2, ..., Fl) = diag [D DT ], [D DT ], ··· , [D ,DT ] , E θ1 θ1 E θ2 θ2 E θl θl 1Mindspore: https://www.mindspore.cn/. (3) d log p(y|x,θ) whereDθ = − dθ . Algorithm 1 THOR Optimizer In the second step, each block of FIM can be rewritten as Require: TFIM,TINV : FIM and its inverse matrix update T T T intervals Fi = E[Dθi Dθ ] = E[ai−1ai−1 ⊗ gigi ] i (4) Require: ω , ω : two positive threshold parameters used in T T 1 2 ≈ E[ai−1ai−1] ⊗ E[gigi ] = Ai−1 ⊗ Gi, Eq. (2) Require: size: the split dimension of FIM where⊗ denotes the Kronecker product, gi = Dsi , T T −1 Require: α: the learning rate Ai−1 = E[ai−1ai−1] and Gi = E[gigi ]. Since (A⊗B) = −1 −1 Require: λ: the damping parameter A ⊗ B for any matrices A and B, we can compute the k ← 0 block-diagonal FIM easily as while convergence is not reached do −1 −1 −1 −1 for i;i ≤ l;i + + do Fi = (Ai−1 ⊗ Gi) = Ai−1 ⊗ Gi . (5) if k ≡ 0 (mod TFIM) then Furthermore, KFAC uses a damping technique in [Martens Update the factors Ai−1 and Gi and Grosse 2015] by adding λI to the Kronecker factors. end if Finally, the weight vector θ with i-th layer can be updated i if k ≡ 0 (mod TINV) then as follows: Compute ∆k using Eq. (8) (k+1) (k) (k) −1 (k) −1 (k) if ∆k is updated according to Eq. (2) then θi ← θi −α((Ai−1+λI) ⊗(Gi +λI) )∇θi J , (6) Using size to split the factors Ai−1 and Gi where α represents the learning rate. according to Eq. (10) ˆ−1 ˆ−1 Update the inverse of split matrix Ai−1 and Gi 3 THOR end if end if As mentioned in Section Introduction, although KFAC could θ(k+1) ← θ(k) − α((Aˆ(k) + λI)−1 ⊗ (Gˆ(k) + accelerate convergence, it still has no advantage on the overal- i i i−1 i λI)−1)∇ J k l training time compared with first-order optimizer due to the θi high computation cost of Kronecker product. To address this k ← k + 1 problem, we propose a novel algorithm called Trace-based end for Hardware-driven layer-ORiented Natural Gradient Descent end while Computation (THOR). In THOR, we first use a gradually increasing update interval for updating the inverse of FIM. return θ Second, instead of updating the whole inverse of FIM, we further determine to update matrix blocks which are guid- ed by trace-based rules. Finally, by utilizing the power of k n and our experiments, it is reasonable to assume {F }k=1 as MindSpore and Ascend 910, we trade a little loss of FIM for a Markov process converging to a stationary distribution π, efficient approximating the matrix blocks. The detailed steps where F k represents the FIM updated at the k-th iteration. of the THOR optimizer are given in Algorithm1. Under this assumption, we can gradually increase the update interval of the FIM and its inverse matrix during training. 3.1 Update with trace However, as shown in Figure1, some layers tend to stabilize In order to reduce the computation and achieve faster training, faster than others, it is too rough to set the same update KFAC and its variants all reduce the frequency of computing interval for all blocks of FIM. Therefore, it is more reasonable the FIM and its inverse matrix [Martens and Grosse 2015, to select which blocks of FIM need to be updated. Moreover, Grosse and Martens 2016, George et al. 2018, Zhang et al. we can stop updating the FIM and its inverse matrix for 2018, Osawa et al. 2019]. They update the FIM and its in- each layer if the FIM becomes stable. For example, if we verse matrix every few iterations. In particular, [Osawa et al. stop updating the FIM after the k-th iteration for the i-th 2019] discussed the change rate of the FIM on the ResNet-50 network for ImageNet classification, and adopted a heuristic

102 scheme. They further reduced the update rate after 500 itera- conv-1 conv-2 101 tions to accelerate training. However, this scheme is not the conv-3 fc optimal choice. The fixed update is not highly profitable in 100 the later stage of training. In other words, the update times 10-1 F-norm in the later training are still very large, which costs many 10-2 computing resources but can’t greatly improve the training 10-3 effect. Therefore, we propose a new updating scheme in this 10-4 0 10000 20000 30000 40000 subsection. iteration Figure1 illustrates the changes of the Frobenius norm for FIM at each layer. We can clearly observe that the FIM for Figure 1: Changes of the Frobenius norm for FIM. We choose each layer changes rapidly in the first few iterations and then the fully-connected layer and three different layers tends to be stable. Based on existing research [Martens and when training the CIFAR-10 dataset on ResNet-18 using KFAC. We Grosse 2015, Grosse and Martens 2016, Osawa et al. 2019] record data every 20 iterations. Table 1: The example of trade-off

split dimension 1 16 32 64 128 256 512 1024 2048 matrix number(L < 1%) 4 4 5 9 13 23 32 46 48 performance(µs) 35 59 83 121 212 534 1418 3738 9824 normalized loss 0.0741 0.0741 0.0926 0.1667 0.2407 0.4259 0.5926 0.8519 0.8889 of matrix normalized 1 0.6014 0.4276 0.2939 0.1669 0.0664 0.0250 0.0095 0.0036 performance layer, then the parameters will be computed by θ(k+t) = i conv-1 conv-2 (k+t−1) (k) −1 (k) θi − α(Fi ) ∇θi J , t = 1, 2, ··· . To determine whether to update or stop updating, we shall introduce an adaptive trace-based updating rule. In [Osawa change of trace change of trace et al. 2019], the Frobenius norm k · kF is used to estimate the changes of FIM for each layer, which does not have good iteration iteration scalability and may not suitable for large-scale tasks. How- (a) (b) ever, it is well-known that for any matrix X, the relationship of its Frobenius norm kXkF and nuclear norm kXk∗ can be conv-3 fc expressed as follows: √ kXkF ≤ kXk∗ ≤ rkXkF , (7) change of trace change of trace where r = rank(X) and k · k∗ is the nuclear norm of a ma- trix, i.e., the sum of singular values [Recht, Fazel, and Parrilo 2010, Srebro, Rennie, and Jaakkola 2005]. It is well-known iteration iteration that for any matrix X, the trace of matrix X |tr(X)| is also (c) (d) smaller or equal the nuclear norm kXk∗ and the equality holds if X is a positive semidefinite matrix. Therefore, the Figure 2: Change rate vs iterations on ResNet-18. We choose three absolute value of |tr(X)| can also be used to estimate the different convolution layers and the fully-connected layer when changes of FIM for each layer. More importantly, the com- training the CIFAR-10 dataset using K-FAC. We record data every putational cost of |tr(X)| is linear which means it has much 20 iterations. better scalability. Therefore, in THOR, for the i-th layer, we further define the following relative change rate: k−1 |tr(F k + λI)| − |tr(F + λI)| 3.2 Hardware-driven Matrix Split ∆k = i i . (8) |tr(F k−1 + λI)| i Due to the huge number of parameters existed in the deep Then, we adopt the following trace-based updating scheme of neural networks, the computation of the inverse of Kronecker FIM and its inverse for each layer based on the above relative factors matrix is still very costly (O(l·n3), where l is number change rate ∆k: of the network layers and n is the typical dimension of the Kronecker factors ). To achieve better performance, we need  update F k, if ∆k ∈ (ω , +∞) to make a further approximation of FIM. As FIM can be treat-  i 1  k ed as a covariance matrix over the gradient Dθ of the loss  do not update Fi and set  k k−1 k function, which is defined in equation (3). TONGA[Roux, Fi = Fi , if ∆ ∈ [ω2, ω1] Manzagol, and Bengio 2008] makes a block-diagonal ap-  k proximation to FIM by assuming independence between the  stop update Fi and set  k+t k−1 k neurons of a neural network. Similarly, KFAC [Martens and Fi ≡ Fi for all t = 1, 2,... if ∆ ∈ [0, ω2) (9) Grosse 2015] treats Dθi is more “important” to itself than D does, where j 6= i, which implies that the diagonal block- where ω1 and ω2 are two given positive threshold parameters. θj In Figure2 and Figure3, we demonstrate the changes of s contain more information for the current layer. Therefore, in ∆k of some layers on two different networks. It can be seen KFAC, one can approximate FIM by using a block-diagonal clearly that ∆k is relatively large at the beginning, and then matrix in a given layer. fluctuates around a relatively fixed small value after a few it- In order to compute FIM more efficiently, in our algorithm, erations. For most layers, ∆k lies in the interval (0.001, 0.01) we further split the input of the i-th layer’s into j group- and fluctuates around 0.001 for some layers. Therefore, we s vectors, i.e., a(i−1)1, a(i−1)2, ..., a(i−1)j and assume that provide a recommendation of the choices ω1 = 0.01 and different groups a(i−1)s and a(i−1)t are independent. As a ω2 = 0.001, which have performed well for training, con- consequence, the outputs of the i-th layer’s block split, de- firmed by experiments in Section Experiments. We believe noted as si1, si2, . . ., sij, are also independent. Under the that it is reasonable to increase the update interval of Fi if independent assumption, we can approximate the Kronecker k k ∆ ∈ [0.001, 0.01], and stop updating Fi if ∆ ∈ [0, 0.001). factors Ai−1 and Gi for computing the i-th FIM block Fi by Table 2: The computational result of ResNet-18 on CIFAR-10

SGD KFAC THOR THOR stop THOR NT Best Test Acc 94.31% 94.42% 95.00% 95.09% 94.40% Time Per Epoch 13.29s 65.01s 17.64s 17.16s 18.09s Time(93%) 809.51s 1704.261s 656.84s 622.92s 670.95s Time(94%) 889.154s 4032.43s 1139.11s 1092.24s 1155.28s Time(95%) NaN NaN 1555.54s 1350.72s NaN

tion loss L is measured by the spectral norm, which is defined conv-1 conv-2 as follows: s λ (AˆAˆT ) change of trace change of trace max (11) L = 1 − T . λmax(AA ) iteration iteration where λ (·) is the largest eigenvalue of the matrix, A is (a) (b) max the original matrix, and Aˆ is the split matrix. Second, we count the number of the matrices whose infor- conv-3 fc mation loss L is below 1% in each predefined split dimension. Finally, these counts are normalized by dividing the total

change of trace change of trace number of matrices. As for the computation efficiency (performance), we first

iteration iteration measure the time it costs to invert the matrix of each shape (c) (d) in the predefined split dimensions on the Ascend 910/GPU. Then, the normalized performance of a specific split dimen- sion equals to the performance data of the matrix with split Figure 3: Change rate vs iterations on ResNet-50. We choose three on the first dimension divided by the performance data of different convolution layers and the fully-connected layer when matrix with the specific split dimension, which is defined as training the ImageNet dataset using K-FAC. We record data every 200 iterations. follows:

p1 normalizedn = . (12) the following block diagonal matrices: pn  ˆ T T where normalizedn is the normalized performance of a Ai−1 ≈ diag E[a(i−1)1a(i−1)1], E[a(i−1)2a(i−1)2], specific split dimension n, p1 is the performance data of the T  matrix with split on the fist dimension. ..., E[a(i−1)ja(i−1)j] , For example, on ResNet-50 with Ascend 910, we set split ˆ  T T T  dimension list as [1, 16, 32, 64, 128, 256, 512, 1024, 2048] Gi ≈ diag E[Dsi1 Ds ], E[Dsi2 Ds ],..., E[Dsij Ds ] i1 i2 ij and the total number of Kronecker factors A is 54. The rel- T T T  ≈ diag E[gi1gi1], E[gi2gi2],..., E[gijgij] . evant data are reported in Table1. Finally, Figure5 shows (10) the normalized data in Table1. We can find the intersection point is (106, 0.21), which represents the trade-off between In Figure4, we compare the difference between the K-FAC the computation efficiency and the loss of the matrix. Thus, block-diagonal approximation F˜ (Figure4(a)) and the pro- we choose the split dimension as 128 which is the closest posed splitting approximation Fˆ (Figure4(b)). We calculate point to the intersection point in the split dimension list. the errors of two approximations F˜ and Fˆ, which are around 5% after 10 iterations. Interestingly, the relative difference 4 Experiments between two approximations F˜ and Fˆ reduces to 1% after To test the performance, we apply THOR to train ResNet-18 50 iterations. One possible reason is that the independence for CIFAR-10 and ResNet-50 for ImageNet. In these experi- assumption is more likely to be satisfied when the proportion ments, we implement our method in three variants: THOR, of elements on the diagonal increases. Obviously, the smaller THOR stop with and THOR NT without trace- the split dimension, the less time cost on computation (bet- based updating rule. We have compared THOR, THOR stop ter efficiency), but the larger information loss compared to and THOR NT with KFAC and SGD with momentum on original matrix. Therefore, the group number j is a trade-off CIFAR-10. However, we only compared THOR, THOR stop between the information loss and computation efficiency. and THOR NT with SGD on ImageNet, since KFAC needs to The process of calculating the information loss(loss of calculate the inverse of FIM on large model,such as ResNet- matrix) is given as follows. First, we set the tolerable infor- 50 which cannot finish the training in a reasonable time. For mation loss to 1% which means the split matrix contains 99% example, KFAC takes 2s to calculate the FIM inversion while information of the original Kronecker factors. The informa- THOR only takes 200ms on Tesla V100. Please note that we Table 3: Hyper-parameters of our methods on ImageNet

learning rate damping

αwarmup αtarget ewarmup eend pdecay λ0 ρdecay

BS=256 THOR - 0.045 - 70 6 0.03 0.87 BS=256 THOR stop - 0.050 - 70 6 0.03 0.87 BS=256 THOR NT - 0.045 - 72 6 0.03 0.87 BS=4096 THOR 0.005 0.45 5 55 6 0.3 0.2 BS=8192 THOR 0.01 0.8 5 48 6 0.6 0.3

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 (106,0.21) 0.2

performance normalized performance loss of matrix normalized loss of matrix 0.0 0.0 0 500 1000 1500 2000 (a) (b) (c) split dimension

Figure 5: The trade-off between loss of matrix and performance A comparison between the KFAC block-diagonal F˜ and Figure 4: of Ascend 910. In this experiment, the matrix is the Kronecker ˆ Hardware-driven split matrix F . We use the deep neural network to factor A from ResNet-50 and the split dimension list is [1, 16, 32, train MNIST for 10 iterations. The network architecture is 768-20- 64, 128, 256, 512, 1024, 2048]. Performance 1 represents the best 20-20-10, in which the middle three layers trained with FIM matrix. performance while 0 represents the worst one. We use spectral norm The dashed line indicate the separation by layers. (a) is the figure to estimate the loss of matrices where 1 represents the largest loss of F˜, (b) is the figure of Fˆ which split dimension is 10, (c) is the of the original matrix and 0 represents the least one. The marked absolute error between (a) and (b). point in the figure is the trade-off point, the nearest value in the split dimension list is 128. didn’t compare to Adam because Adam fails to obtain the accuracy of SGD with momentum for ResNet-50, the highest accuracy achieved by Adam is 73.48% [You et al. 2019] For all our experiments, we average the results of 5 runs Figure7 (b) shows that, our methods outperform KFAC, but and we use a normal distribution to generalize the starting has no advantage compared to SGD in terms of the training points. loss. However, for test accuracy, THOR is 152.67s faster, THOR NT is 138.56s faster and THOR stop is 186.59s faster 4.1 CIFAR-10 than SGD with 93% test accuracy and the summary of com- Setup. In this experiment, we use on 1 Tesla v100 putational results can be seen from Table2. Note that in this GPU and train ResNet-18 on CIFAR-10 with batch-size 128. experiment, for the second-order methods, we use the same Split dimension, learning rate, damping and update interval learning rate α as that of SGD. After adjusting the parame- can be found in Figure6. The weight decay for SGD, KFAC, ters, we can get better results. For instance, THOR stop is THOR, THOR stop and THOR NT is set to 0.0005. The 435s faster than SGD when reaching 93% test accuracy by trace thresholds are set to (ω1, ω2) = (0.01, 0) for THOR, tuning learning rate. And THOR stop is also 3809s faster (ω1, ω2) = (0.01, 0.001) for THOR stop and (ω1, ω2) = than EKFAC with 93% test accuracy. (0, 0) for THOR NT. The update interval for KFAC is set to 20. Furthermore, we did ablation study to see how frequency Results. In Figure7, we compared our methods with KFAC updating strategy, trace-based updating rule and matrix split and SGD with momentum on CIFAR-10 in terms of the loss, affect the results on ResNet-18 + CIFAR-10 and respective- test accuracy and wall-clock time. Figure7 (a) shows that ly named each improved algorithm as THOR tr, THOR fre THOR, THOR stop, THOR NT and KFAC converge faster and THOR sp. Our study showed that THOR tr which ac- than SGD in the first 30 epochs, and all of them are able to celerated 65% compared to original KFAC algorithm on 90 reach high train accuracy. It can be seen from Figure7 (c) epochs, while THOR fre and THOR sp respectively acceler- that THOR, THOR stop, THOR NT and KFAC are faster ated 48.5% and 40.5% compared to original KFAC algorithm than SGD in the first 30 epochs, and second-order algorithms on 90 epochs. THOR sp gained lower acceleration since are able to achieve higher test accuracy than SGD. In particu- ResNet-18’s fisher information matrix is much smaller than lar, THOR can reach 95% test accuracy in this experiment. other models. Table 4: The computational result of ResNet-50 on ImageNet

SGD THOR THOR stop THOR NT Best Test Acc 76.04% 75.92% 75.92% 76.00% Time Per Epoch 90.00s 102.15s 100.05s 103.65s Time(74.9%) 6569.86s 3674.88s 3405.26s 3747.74s Time(75.9%) 7020.98s 4083.20s 4004.47s 4148.03s

4.2 ImageNet

1.0 1.0 performance 0.10 SGD Setup. In this experiment, we implement the proposed loss of matrix KFAC THOR 0.8 0.8 0.08 THOR_stop THOR on MindSpore with 8 Ascend 910 and train ResNet- THOR_NT

0.6 0.6 0.06 50 on ImageNet with batch-size 256. The weight decay

0.4 0.4 0.04 for these methods is set to 0.0005, the label smoothing learning rate

(104,0.21) 0.02 0.2 0.2 is set to 0.1 and the split dimension is set to 128 in Fig- normalized performance normalized loss of matrix 0.00 0.0 0.0 ure8. The trace thresholds are set to (ω1, ω2) = (0.01, 0) 0 1000 2000 3000 4000 0 20 40 60 80 split dimension epoch for THOR, (ω1, ω2) = (0.01, 0.001) for THOR stop and (a) (b) (ω1, ω2) = (0, 0) for THOR NT. Split dimension, learning rate, damping and update interval can be found in Figure8. 400 0.30 KFAC THOR 350 The learning rate α for e epoch is determined as follows: THOR_stop 0.25 THOR_NT 300 0.20 250 e (e) pdecay 0.15 200 α = αtarget · (1 − ) . (13) 150

damping e 0.10 end 100 update interval

0.05 50 THOR THOR_stop where α is the target learning rate, e is the end of THOR_NT target end 0.00 0 0 20 40 60 80 0 20 40 60 80 epoch epoch decay epoch, pdecay is the decay rate. Figure9 shows the (c) (d) impact of target learning rate and decay rate on reaching the test accuracy after 40 epochs with batch-size 256. For the largeer batch size, the warmup strategy makes the training Figure 6: The hyper-parameters of training ResNet-18 on CIFAR- result better. The specific strategy is given as follow: 10. (a) The split dimension list is [1, 9, 18, 36, 72, 144, 288, 576, 1152, 2304, 4608], we set split dimension as 72. (b) Same learning  α − α  α(e) = α + target warmup · e, e ≤ e rate used for SGD, KFAC, THOR, THOR stop and THOR NT. (c)  warmup e warmup Same Damping used for KFAC, THOR, THOR stop and THOR NT. warmup e − ewarmup (d) The same update interval of FIM for THOR, THOR stop and  (e) pdecay  α = αtarget · (1 − ) , e > ewarmup THOR NT. eend (14) The damping λ adopts the following decreasing rule:

e 2.00 SGD ( ) SGD (e) KFAC 2.00 10 KFAC (15) THOR λ = λ · ρ . 1.75 THOR (0) THOR_stop 1.75 decay THOR_stop THOR_NT 1.50 1.50 THOR_NT 1.25 1.25 where λ is the initial damping, ρ is the decay rate 1.00 1.00 (0) decay 0.75

train loss 0.75 train loss of the damping.The hyper-parameters for our methods are 0.50 0.50 0.25 0.25 shown in Table3. 0.00 0.00 0 20 40 60 80 101 102 103 104 Results. In Figure 10, we have compared our methods with epoch wall-clock time(s) SGD with momentum on ImageNet in terms of their train- (a) (b) ing loss, test accuracy with respect to epoch and wall-clock time. Figure 10 (a)(c) show that the convergence speed of

0.9 0.9 THOR, THOR NT and THOR stop are faster than SGD. S- 0.8 0.8 GD needs 78 epochs to converge while THOR, THOR NT, 0.7 0.7 THOR stop only needs 40 epochs; In the Figure 10(b)(d), 0.6 0.6 test accuracy SGD test accuracy SGD our methods take less time than SGD, specifically, THOR KFAC KFAC 0.5 0.5 THOR THOR THOR_stop THOR_stop THOR_NT needs 68.1min,THOR NT needs 69.1min and THOR stop THOR_NT 0.4 0.4 0 20 40 60 80 101 102 103 104 epoch wall-clock time(s) only takes 66.7min to converge while SGD needs 117min (c) (d) and the summary of computational results can be seen from Table4. From these results, our proposed algorithm has out- standing performance on large-scale task such as ResNet-50. Figure 7: ResNet-18 on CIFAR-10. (a) The training loss with epoch. And THOR is also competitive in the sense of end-to-end (b) The training loss with wall-clock time. (c) The test accuracy with epoch. (d) The test accuracy with wall-clock time. training time with various batch sizes across different hard- ware platforms, it takes 4.1min/2.7min to reach test accuracy 75.9% with batch-size 4096/8192 respectively in Table5. Table 5: The result of large batchsize of ResNet-50 on ImageNet

Hardware Software Batch size Optimizer Time Accuracy [He et al. 2016] Tesla P100 × 8 Caffe 256 SGD 29 hr 75.3% [Goyal et al. 2017] Tesla P100 × 256 Caffe2 8192 SGD 60 min 76.3% Google 0.7-2 [https://mlperf.org] NVIDIA V100 × 8 TensorFlow 2496 LARS 88.56 min 75.9% [Osawa et al. 2020] Tesla V100 × 128 Chainer 4096 SP-NGD 32.5min 74.8% [Osawa et al. 2020] Tesla V100 × 256 Chainer 8192 SP-NGD 16.9min 75.3% our work Ascend 910 × 8 MindSpore 256 THOR 66.7min 75.9% our work Ascend 910 × 128 MindSpore 4096 THOR 4.1min 75.9% our work Ascend 910 × 256 MindSpore 8192 THOR 2.7min 75.9%

0.10 SGD 1.0 1.0 THOR THOR_stop 0.08 THOR_NT 7.0 75.5 0.8 0.8 75.0 0.06 6.5 0.6 0.6 74.5 0.04 6.0 0.4 0.4 74.0 learning rate 5.5 0.02 (106,0.21) 0.2 0.2 73.5

performance test accuracy normalized performance 5.0 loss of matrix normalized loss of matrix 0.00 73.0 0.0 0.0 0 500 1000 1500 2000 0 20 40 60 80 4.5 split dimension epoch 72.5 decay rate of learning decay 4.0 (a) (b) 0.00 0.02 0.04 0.06 0.08 0.10 target learning rate

0.030 THOR 5000 THOR_stop Figure 9: The learning rate of ResNet-50 on ImageNet. THOR_NT 4000 0.025

3000 0.020 SGD SGD 2000 THOR THOR damping THOR_stop 0.015 THOR_stop 5 5 THOR_NT

update interval THOR_NT 1000 THOR 0.010 THOR_stop 0 THOR_NT 4 4 0 20 40 60 80 0 20 40 60 80 epoch epoch

train loss 3 train loss 3 (c) (d) 2 2

0 10 20 30 40 50 60 70 80 0 1000 2000 3000 4000 5000 6000 7000 Figure 8: The hyper-parameters of training ResNet-50 on ImageNet. epoch wall-clock time(s) (a) The split dimension list is [1,16,32,64,128,256,512,1024,2048], (a) (b) we set split dimension as 128, (b) The learning rate used in train- ing ResNet-50 on ImageNet for SGD, THOR, THOR stop and 0.7 0.7

THOR NT. (c) Damping used in training ResNet-50 on ImageNet 0.6 0.6 for THOR, THOR stop and THOR NT. (d) The update interval of 0.5 0.5 FIM and its inverse matrix for THOR, THOR stop and THOR NT. 0.4 0.4 0.3 0.3 test accuracy test accuracy 0.2 SGD 0.2 SGD THOR THOR THOR_stop THOR_stop 0.1 THOR_NT 0.1 THOR_NT

0 10 20 30 40 50 60 70 80 0 1000 2000 3000 4000 5000 6000 7000 5 Related work epoch wall-clock time(s) Second-order optimizer could accelerate convergence and (c) (d) avoid saddle points easily, but the computational complexity of the inverse of FIM is O(n3) (where n is the dimension Figure 10: ResNet-50 on ImageNet. (a) The training loss with of FIM). Therefore, various approximations of the second- epoch. (b) The training loss with wall-clock time. (c) The test accu- order information matrix have been proposed in recent years. racy with epoch. (d) The test accuracy with wall-clock time. KFAC [Martens and Grosse 2015, Grosse and Martens 2016, Martens, Ba, and Johnson 2018] use the natural gradient descent in deep network training by approximating the FIM ResNet-50 for ImageNet). In our work, the proposed methods as two much smaller matrices based on network structure is more efficient, we train ResNet-50 on ImageNet to 75.9% and Kronecker products. However, KFAC still requires a in 2.7 minutes with 256 Ascend 910. Moreover, we are able to lot of computing power and does not have ideal scalability achieve a top-1 accuracy of 75.9% in 66.7 minutes with much which is crucial for large-scale tasks. EKFAC[George et al. less computational resources (8 Ascend 910) than [Osawa 2018] tried to solve this problem by using more accurate et al. 2019, 2020]. eigenvalues to reduce approximate error than KFAC, but it still takes too much time to train ResNet-18 on CIFAR-10 6 Conclusion in our experiments.More recently, [Osawa et al. 2019, 2020] In this paper, we propose the THOR to speed up second-order implemented an improved KFAC on ResNet-50 for ImageNet optimizer by reducing the computation cost of the inverse with powerful computational resources (1024 Tesla V100). of FIM. This algorithm assumes FIM will converge to a s- In terms of the wall-clock time, the result is quite promising tationary distribution and uses the trace of matrix block to (it takes 5.5min to achieve a top-1 accuracy of 75.4% on increase the update interval of matrix blocks, and makes a more radical approximation to the matrix block. The experi- Martens, J.; and Grosse, R. 2015. Optimizing neural net- ments on CIFAR-10 and ImageNet clearly demonstrate that works with kronecker-factored approximate curvature. In THOR can converge much faster than EKFAC, KFAC and International conference on , 2408–2417. SGD. Especially on the ImageNet, THOR’s overall time is Nocedal, J.; and Wright, S. 2006. Numerical optimization. much less than that of SGD. THOR only uses 66.7 minutes Springer Science & Business Media. to converge with 8 Ascend 910, which only takes half the time of SGD. In the future, we will apply THOR to other Ollivier, Y.; Arnold, L.; Auger, A.; and Hansen, N. 2017. deep learning models to speed up their training time, such as Information-geometric optimization algorithms: A unifying BERT[Devlin et al. 2018] and GPT-2[Radford et al. 2019]. picture via invariance principles. The Journal of Machine Learning Research 18(1): 564–628. References Osawa, K.; Tsuji, Y.; Ueno, Y.; Naruse, A.; Foo, C.-S.; and Yokota, R. 2020. Scalable and Practical Natural Gradi- Amari, S.-I. 1998. Natural gradient works efficiently in ent for Large-Scale Deep Learning. arXiv preprint arX- learning. Neural computation 10(2): 251–276. iv:2002.06015 . Berahas, A. S.; Jahani, M.; and Taka´c,ˇ M. 2019. Quasi- Osawa, K.; Tsuji, Y.; Ueno, Y.; Naruse, A.; Yokota, R.; and newton methods for deep learning: Forget the past, just sam- Matsuoka, S. 2019. Large-scale distributed second-order ple. arXiv preprint arXiv:1901.09997 . optimization using kronecker-factored approximate curvature Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. for deep convolutional neural networks. In Proceedings Bert: Pre-training of deep bidirectional transformers for lan- of the IEEE Conference on Computer Vision and Pattern guage understanding. arXiv preprint arXiv:1810.04805 . Recognition, 12359–12367. George, T.; Laurent, C.; Bouthillier, X.; Ballas, N.; and Vin- Pan, W.; Innanen, K. A.; and Liao, W. 2017. Accelerat- cent, P. 2018. Fast approximate natural gradient descent ing Hessian-free Gauss-Newton full-waveform inversion via in a kronecker factored eigenbasis. In Advances in Neural l-BFGS preconditioned conjugate-gradient algorithm. Geo- Information Processing Systems, 9550–9560. physics 82(2): R49–R64. Goyal, P.; Dollar,´ P.; Girshick, R.; Noordhuis, P.; Wesolowski, Qian, N. 1999. On the momentum term in gradient descent L.; Kyrola, A.; Tulloch, A.; Jia, Y.;and He, K. 2017. Accurate, learning algorithms. Neural networks 12(1): 145–151. large minibatch sgd: Training in 1 hour. arXiv Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and preprint arXiv:1706.02677 . Sutskever, I. 2019. Language models are unsupervised multi- Grosse, R.; and Martens, J. 2016. A kronecker-factored task learners. OpenAI Blog 1(8): 9. approximate fisher matrix for convolution layers. In Interna- Recht, B.; Fazel, M.; and Parrilo, P. A. 2010. Guaranteed tional Conference on Machine Learning, 573–582. minimum-rank solutions of linear matrix equations via nucle- He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual ar norm minimization. SIAM review 52(3): 471–501. learning for image recognition. In Proceedings of the IEEE Robbins, H.; and Monro, S. 1951. A stochastic approximation conference on computer vision and , 770– method. The annals of mathematical statistics 400–407. 778. Roux, N. L.; Manzagol, P.-A.; and Bengio, Y. 2008. Top- Keskar, N. S.; and Berahas, A. S. 2016. adaqn: An adaptive moumoute online natural gradient algorithm. In Advances in quasi-newton algorithm for training rnns. In Joint European neural information processing systems, 849–856. Conference on Machine Learning and Knowledge Discovery Srebro, N.; Rennie, J.; and Jaakkola, T. S. 2005. Maximum- in Databases, 1–16. Springer. margin matrix factorization. In Advances in Neural Informa- Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochas- tion Processing Systems, 1329–1336. tic optimization. arXiv preprint arXiv:1412.6980 . You, Y.; Li, J.; Reddi, S.; Hseu, J.; Kumar, S.; Bhojanapalli, Kiros, R. 2013. Training neural networks with stochastic S.; Song, X.; Demmel, J.; Keutzer, K.; and Hsieh, C.-J. 2019. Hessian-free optimization. arXiv preprint arXiv:1301.3641 . Large batch optimization for deep learning: Training in 76 minutes. arXiv preprint arXiv:1904.00962 . Le, Q. V.; Ngiam, J.; Coates, A.; Lahiri, A.; Prochnow, B.; and Ng, A. Y. 2011. On optimization methods for deep Zeiler, M. D. 2012. Adadelta: an adaptive learning rate learning. In Proceedings of the 28th International Conference method. arXiv preprint arXiv:1212.5701 . on International Conference on Machine Learning, 265–272. Zhang, G.; Sun, S.; Duvenaud, D.; and Grosse, R. 2018. Martens, J. 2010. Deep learning via hessian-free optimization. Noisy natural gradient as variational inference. In Interna- In ICML, volume 27, 735–742. tional conference on machine learning, 5847–5856. Martens, J. 2014. New insights and perspectives on the Zhu, C.; Byrd, R. H.; Lu, P.; and Nocedal, J. 1997. Algo- natural gradient method. arXiv preprint arXiv:1412.1193 . rithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Math- Martens, J.; Ba, J.; and Johnson, M. 2018. Kronecker- ematical Software (TOMS) 23(4): 550–560. factored curvature approximations for recurrent neural net- works .