EA-CG: an Approximate Second-Order Method for Training Fully-Connected Neural Networks

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) EA-CG: An Approximate Second-Order Method for Training Fully-Connected Neural Networks Sheng-Wei Chen Chun-Nan Chou Edward Y. Chang HTC Research & Healthcare HTC Research & Healthcare HTC Research & Healthcare sw [email protected] jason.cn [email protected] edward [email protected] Abstract the common practice is to use the Gauss-Newton matrix with a convex criterion function (Schraudolph 2002) or the Fisher For training fully-connected neural networks (FCNNs), we propose a practical approximate second-order method includ- matrix to measure curvature since both are guaranteed to be ing: 1) an approximation of the Hessian matrix and 2) a conju- positive semi-definite (PSD). gate gradient (CG) based method. Our proposed approximate Although these two kinds of matrices can alleviate the is- Hessian matrix is memory-efficient and can be applied to any sue of negative curvature, computing either the exact Gauss- FCNNs where the activation and criterion functions are twice Newton matrix or Fisher matrix even for a modestly-sized differentiable. We devise a CG-based method incorporating fully-connected neural network (FCNN) is intractable. In- one-rank approximation to derive Newton directions for train- tuitively, the analytic expression for the second derivative ing FCNNs, which significantly reduces both space and time requires O(N 2) computations if O(N) complexity is re- complexity. This CG-based method can be employed to solve any linear equation where the coefficient matrix is Kronecker- quired to compute the first derivative. Thus, several pioneer factored, symmetric and positive definite. Empirical studies works (LeCun et al. 1998; Amari, Park, and Fukumizu 2000; show the efficacy and efficiency of our proposed method. Schraudolph 2002) have used different methods to approximate either matrix. However, none of these methods have been shown to be computationally feasible and fundamen- Introduction tally more effective than first-order methods as reported Neural networks have been applied to solving problems in in (Martens 2010). Thus, there has been a growing trend several application domains such as computer vision (He towards conceiving more computationally feasible second- et al. 2016), natural language processing (Hochreiter and order methods for training FCNNs. Schmidhuber 1997), and disease diagnosis (Chang et al. We outline several notable works in chronological or- 2017). Training a neural network requires tuning its model der herein. Martens proposed a truncated-Newton method parameters using Backpropagation. Stochastic gradient de- for training deep auto-encoders. In this work, Martens used scent (SGD), Broyden-Fletcher-Goldfarb-Shanno and one- an R-operator (Pearlmutter 1994) to compute full Gauss- step secant are representative algorithms that have been em- Newton matrix-vector products and made good progress ployed for training in Backpropagation. within each update. Martens and Grosse developed a To date, SGD is widely used due to its low computational block-diagonal approximation to the Fisher matrix for FC- demand. SGD minimizes a function using the function’s first NNs, called Kronecker-factored Approximation Curvature derivative, and has been proven to be effective for train- (KFAC). They derived the update directions by exploiting ing large models. However, stochasticity in the gradient will the inverse property of Kronecker products. KFAC features slow down convergence for any gradient method such that that its cost in storing and inverting the devised approxima- none of them can be asymptotically faster than simple SGD tion does not depend on the amount of data used to estimate with Polyak averaging (Polyak and Juditsky 1992). Besides the Fisher matrix. The idea is further extended to convo- gradients, second-order methods utilize the curvature infor- lutional nets (Grosse and Martens 2016). Recently, Botev, mation of a loss function within the neighborhood of a given Ritter, and Barber presented a block-diagonal approxima- point to guide the update direction. Since each update be- tion to the Gauss-Newton matrix for FCNNs, referred to comes more precise, such methods can converge faster than as Kronecker-Factored Recursive Approximation (KFRA). first-order methods in terms of update iterations. They also utilized the inverse property of Kronecker prod- For solving a convex optimization problem, a second- ucts to derive the update directions. Similarly, Zhang et al. order method can always converge to the global minimum introduced a block-diagonal approximation of the Gauss- in much fewer steps than SGD. However, the problem of Newton matrix and used conjugate gradient (CG) method neural-network training can be non-convex, thereby suffer- to derive the update directions. ing from the issue of negative curvature. To avoid this issue, However, these prior works either impose constraints on Copyright c 2019, Association for the Advancement of Artificial their applicability or require considerable computation in Intelligence (www.aaai.org). All rights reserved. terms of memory or running time. On the one hand, sev- 3337 eral of these notable methods relying on the Gauss-Newton Suppose we have a minimization problem matrix face the essential limit that they cannot handle non- min f(θ); (1) convex criterion functions playing an important part in some θ problems. Use robust estimation in computer vision (Stew- art 1999) as an example. Some non-convex criterion func- where f is a convex and twice-differentiable function. Since the global minimum is at the point that the first derivative is tions such as Tukey’s biweight function are robustness to ∗ outliers and perform better than convex criterion functions zero, the solution θ can be derived from the equation (Belagiannis et al. 2015). On the other hand, in order to de- rf(θ∗) = 0: (2) rive the update directions, some of the methods utilizing the conventional CG method may take too much time, and the We can utilize a quadratic polynomial to approximate Prob- others exploiting the inverse property of Kronecker products lem 1 by conducting a Taylor expansion with a given point may require excessive memory space. θj. Then, the problem turns out to be To remedy the aforementioned issues, we propose a 1 block-diagonal approximation of the positive-curvature Hes- min f(θj + d) ≈ f(θj) + rf(θj)T d + dT r2f(θj)d; d 2 sian (PCH) matrix, which is memory-efficient. Our proposed PCH matrix can be applied to any FCNN where the acti- where r2f(θj) is the Hessian matrix of f at θj. Af- vation and criterion functions are twice differentiable. Par- ter applying the aforementioned approximation, we rewrite ticularly, our proposed PCH matrix can handle non-convex Eq. (2) as the linear equation criterion functions, which the Gauss-Newton methods can- j 2 j j not. Besides, we incorporate expectation approximation into rf(θ ) + r f(θ )d = 0: (3) the CG-based method, which is dubbed EA-CG, to derive Therefore, the Newton direction is obtained via update directions for training FCNNs in mini-batch setting. EA-CG significantly reduces the space and time complexity dj = −∇2f(θj)−1rf(θj); of the conventional CG method. Our experimental results ∗ show the efficacy and efficiency of our proposed method. and we can acquire θ by iteratively applying the update rule In this work, we focus on deriving a second-order method θj+1 = θj + ηdj; for training FCNNs since the shared weights of convolutional layers lead to the difficulties in factorizing its Hes- where η is the step size. sian. We defer tackling convolutional layers to our future For non-convex problems, the solution to Eq. (2) reflects work. We also focus on the classification problem and hence one of three possibilities: a local minimum θmin, a local do not consider auto-encoders. Our strategy is that once a maximum θmax or a saddle point θsaddle. Some previous simpler isolated problem can be effectively handled, we can works such as (Dauphin et al. 2014; Goldfarb et al. 2017; then extend our method to address more challenging issues. Mizutani and Dreyfus 2008) utilize negative curvature infor- In summary, the contributions of this paper are as follows: mation to converge to a local minimum. Before illustrating how these previous works tackle the issue of negative cur- 1. For curvature information, we propose the PCH matrix vature, we have to introduce a crucial concept that we can to improve the Gauss-Newton matrix for training FCNNs know the curvature information of f at a given point θ by with convex criterion functions and overcome the non- analyzing the Hessian matrix r2f(θ). On the one hand, the convex scenario. Hessian matrix of f at any θmin is positive semi-definite. 2. To derive the update directions, we devise effective EA- On the other hand, the Hessian matrices of f at any θmax CG method, which does converge faster in terms of wall and θsaddle are negative semi-definite and indefinite, respec- clock time and enjoys better testing accuracy than com- tively. After establishing the concept, we can use the follow- peting methods. Specially, the performance of EA-CG is ing property to understand how to utilize the negative curva- competitive with SGD. ture information to resolve the issue of negative curvature. Property 1. Let f be a non-convex and twice-differentiable Truncated-Newton Method on Non-Convex function. With a given point θj, we suppose that there ex- 2 j Problems ist some negative eigenvalues fλ1; : : : ; λsg for r f(θ ). Moreover, we take V = span(fv1;:::; vsg), which is the Newton’s method is one of the second-order minimization eigenspace corresponds to fλ1; : : : ; λsg. If we take methods, and is generally composed of two steps: 1) com- T 1 puting the Hessian matrix and 2) solving the system of lin- g(k) = f(θj) + rf(θj) v + vT r2f(θj)v; ear equations for update directions.

Load more