THOR: Trace-Based Hardware-Driven Layer-Oriented Natural Gradient

PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference THOR, Trace-based Hardware-driven layer-ORiented Natural Gradient Descent Computation Mengyun Chen1*, Kai-Xin Gao2*, Xiao-Lei Liu2*, Zidong Wang1*, Ningxi Ni1*, Qian Zhang3∗ Lei Chen4y, Chao Ding5, Zheng-Hai Huang2, Min Wang1, Shuangling Wang1, Fan Yu1, Xinyuan Zhao3, Dachuan Xu3 1Huawei Technologies Co. Ltd, 2Tianjin University, 3Beijing University of Technology 4Hong Kong University of Science and Technology, 5Chinese Academy of Sciences 1chenmengyun1, wang1, niningxi, wangmin106, wangshuangling1, [email protected] 2gaokaixin, liuxiaolei, [email protected], 3zhangqian, xyzhao, [email protected] [email protected], [email protected] Abstract an objective function J(θ) with respect to the parameters θ, i.e., θ is updated as: θ θ − αrθJ(θ), where rθJ(θ) It is well-known that second-order optimizer can accelerate the is gradient, α represents the learning rate. Using SGD to training of deep neural networks, however, the huge computation cost of second-order optimization makes it impractical optimize the parameter training faces two challenges, which to apply in real practice. In order to reduce the cost, many are: 1) it is difficult to choose a proper learning rate, and 2) methods have been proposed to approximate a second-order it is hard to escape saddle points. Therefore, many variants matrix. Inspired by KFAC, we propose a novel Trace-based of SGD such as Momentum [Qian 1999], AdaGrad [Zeiler Hardware-driven layer-ORiented Natural Gradient Descent 2012], Adam [Kingma and Ba 2014] and etc. have been Computation method, called THOR, to make the second-order introduced in the past two decades. Though choosing learning optimization applicable in the real application models. Specif- rate becomes easy in these algorithms, they still cannot escape ically, we gradually increase the update interval and use the saddle points when the object function is non-convex, which matrix trace to determine which blocks of Fisher Informa- is often the case in many real application models. tion Matrix (FIM) need to be updated. Moreover, by resorting To address the challenging issues encountered by SGD, it the power of hardware, we have designed a Hardware-driven approximation method for computing FIM to achieve better is easy to think of using second-order optimizer since it can performance. To demonstrate the effectiveness of THOR, we avoid saddle points and most importantly, it can accelerate have conducted extensive experiments. The results show that convergence by using the curvature information. Specifically, −1 training ResNet-50 on ImageNet with THOR only takes 66.7 the parameters θ are updated by θ θ − αG rθJ(θ), minutes to achieve a top-1 accuracy of 75.9 % under an 8 As- where G−1 is the inverse of second-order information matrix cend 910 environment with MindSpore, a new deep learning G. The definitions of G in different second-order optimiza- computing framework. Moreover, with more computational tion algorithms are not the same. Common second-order op- resources, THOR can only takes 2.7 minutes to 75.9 % with timization algorithms include Newton’s method and natural 256 Ascend 910. gradient method, where their second-order information matrix G is Hessian matrix (HM) and Fisher information matrix 1 Introduction (FIM), respectively. The biggest challenge to use second- Recently, deep learning has made significantly progress in order optimizer is its computation increases cubically and various computer vision and natural language applications. space cost increases quadratically compared to SGD. There- However, with the increase of complexity of models, tons of fore, it is quite impractical to compute the inverse of second- parameters needed to be trained. For example, according to order information matrix directly. [Devlin et al. 2018] and [He et al. 2016], training BERT (over To reduce the computation cost of the second-order op- 340 million parameters) and ResNet-50 (over 23 million timizer, quite a few approximation approaches have been trainable parameters) will take around 3 days on 16 TPUv3 proposed. For instance, for Newtons method, Quasi-Newton and 29 hours on 8 Tesla P100 GPU, respectively. Therefore, methods [Nocedal and Wright 2006] can be used to approxi- many efforts have been put to propose optimization solutions mate the inverse of HM. One of the advantages of these meth- to reduce the training time. ods over the classical Newton method is that the HM does Among all the proposed optimization techniques, the most not need to be inverted explicitly. In particular, the Limited- popular and promising one is Stochastic Gradient Descent memory BFGS (L-BGFS) algorithm [Zhu et al. 1997] has (SGD) [Robbins and Monro 1951], which is a first-order been implemented and used to speed up the training process optimization algorithm. Specifically, SGD tries to minimize in Deep Neural Networks (DNN) (e.g., [Le et al. 2011]). Other structured stochastic Quasi-Newton methods are al- ∗Equal contribution. so developed and studied recently in [Keskar and Berahas yCorresponding author 2016, Berahas, Jahani, and Takaćˇ 2019]. Another class of Newton type second-order methods is the Hessian Free opti- 2 Background and Notations mization method [Martens 2010, Kiros 2013, Pan, Innanen, The purpose of deep neural network training is to find a set and Liao 2017], in which the matrix-free conjugate-gradient of model parameters θ 2 Rn to minimize the loss function (CG) algorithms are used to approximate the true Hessian J(θ). Given the cross-entropy loss function: matrix. However, these CG algorithms usually require lots J(θ) = [− log p(yjx; θ)]; (1) of iterations to reach the desired accuracy, in particular for E ill-condition cases. where x, y are the training input and label, p(yjx; θ) repre- Unlike the Newton type methods, Kronecker-factored Ap- sents the density function of a predictive distribution Pyjx. proximate Curvature (KFAC) [Martens and Grosse 2015, 2.1 The Natural Gradient Grosse and Martens 2016, Martens, Ba, and Johnson 2018] is a second-order method based on natural gradient method. Our algorithm is based on the natural gradient proposed by More precisely, in KFAC, one computes the inverse of the [Amari 1998]. Natural gradient gives the steepest direction FIM by computationally tractable approximations such as of the target function when the parameter space has a Rie- block-diagonal approximation and tridiagonal-block diag- mannian metric structure. In other words, it gives the largest onal approximation. [George et al. 2018] have introduced change of the loss function per unit change of the model. The an Eigenvalue-corrected Kronecker Factorization (EKFAC) distance between the distribution Pθ and Pθ+δθ can be mea- which can approximate FIM much better than KFAC does. sured with the K-L divergence. More recent discussion of the natural gradient can be found in [Martens 2014, Ollivier et al. [Osawa et al. 2019, 2020] have demonstrated that KFAC is −1 2017]. Natural gradient is typically defined as F rθJ(θ), efficient in large-scale distributed computing for deep neural n×n networks. Overall, among all these methods, the approxima- where F 2 R is FIM. With the predictive distribution tion scheme for the inverse of FIM is crucial for improving defined as Pyjx, FIM is formulated as T the efficiency of the second-order optimizer, since the current F = E[rθ log p(yjx; θ)rθ log p(yjx; θ) ]: (2) exact strategies still require significant computing power in It is impractical to compute the inverse of FIM directly in a practice. deep neural network since it has over millions of parameters. To address the issues of inefficient computing FIM, in this paper, we propose an efficient approximation algorithm 2.2 KFAC based on natural gradients, named Trace-based Hardware- KFAC is an efficient method for approximation natural gra- driven layer-ORiented Natural Gradient Descent Computa- dient, which approximates FIM by block-diagonal or block- tion (THOR), to compute FIM. Firstly, we observe from tridiagonal matrices. Based on nice motivation and rigorous experiments that the FIM for each layer usually changes mathematical derivation, it has exquisitely settled the prob- rapidly in the first few iterations and then tends to be stable. lem of complex computation for inverting the second order Therefore, it is reasonable to increase the update interval information matrix. [Osawa et al. 2019] have proved that of the inverse of FIM in a proper manner without the loss block-diagonal KFAC has good results in large-scale DNN of convergence rate. Secondly, we make further decision to and block-diagonal KFAC computes more efficiently than decide which blocks of FIM need to be updated. Thirdly, we block tridiagonal. Thus, we focus on block-diagonal KFAC introduce a new approximation scheme by using a Hardware- to approximate FIM in this work. driven matrix splitting scheme to approximate the FIM, which KFAC is a two-step approximation method. In the first step, can be regarded as finding an optimal tradeoff point between KFAC decomposes the FIM into block matrices according to the computational efficiency and the information loss of FIM. the layers of the neural network, by assuming that parameters Overall, the contributions of our work can be summarized of different layers are independent. Then the calculation of as follows: the inverse of FIM is simplified as the inverse of these small • Under the assumption that the FIM converges to a sta- blocks. In the second step, these block matrices are further tionary distribution, we gradually increase the update interval approximated by the Kronecker product of two much smaller of the inverse of FIM to save the overall computational time. matrices which we call Kronecker factors. Since the inverses • Instead of using the Frobenius norm based updating of the Kronecker product of two matrices are equal to the rule proposed in [Osawa et al.

Load more