Preconditioning for Hessian-Free Optimization

Computational Science and Engineering (Int. Master’s Program) Technische Universitat¨ Munchen¨ Master’s thesis in Computational Science and Engineering Preconditioning for Hessian-Free Optimization Robert Seidl Computational Science and Engineering (Int. Master’s Program) Technische Universitat¨ Munchen¨ Master’s thesis in Computational Science and Engineering Preconditioning for Hessian-Free Optimization Author: Robert Seidl 1st examiner: Prof. Dr. Thomas Huckle 2nd examiner: Prof. Dr. Michael Bader Thesis handed in on: April 2, 2012 I hereby declare that this thesis is entirely the result of my own work except where otherwise indicated. I have only used the resources given in the list of references. April 2, 2012 Robert Seidl i Abstract Recently Martens adapted the Hessian-free optimization method for the training of deep neural networks. One key aspect of this approach is that the Hessian is never computed explicitly, instead the Conjugate Gradient(CG) Algorithm is used to compute the new search direction by applying only matrix-vector products of the Hessian with arbitrary vectors. This can be done efficiently using a variant of the backpropagation algorithm. Recent algorithms use diagonal preconditioners to reduce the needed iterations of the CG algorithm. They are used because of their easy calculation and application. Unfortunately in later stages of the optimization these diagonal preconditioners are not as well suited for the inner iteration as they are for the optimization in the earlier stages. This is mostly due to an increased number of elements of the dense Hessian having the same order of magnitude near an optimum. We construct a sparse approximate inverse preconditioner (SPAI) that is used to accel- erate the inner iteration especially in the later stages of the optimization. The quality of our preconditioner depends on a predefined sparsity pattern. We apply the knowledge of the pattern of the Gauss-Newton approximation of the Hessian to efficiently construct the needed pattern for our preconditioner which can then be computed efficiently fully in parallel using GPUs. This preconditioner is then applied to a deep auto-encoder test case using different update strategies. ii Contents Abstract i Outline of the Thesis iv I. Introduction 1 II. Theory 5 1. Iterative Solution of Linear Equation Systems 6 1.1. Stationary Iterative Methods . .6 1.1.1. Jacobi Method . .7 1.1.2. Gauss-Seidel Method . .7 1.2. Nonstationary Iterative Methods . .7 1.2.1. Conjugate Gradient Method . .8 1.2.2. GMRES Method . .9 2. Preconditioning 11 2.1. Implicit preconditioners . 11 2.2. Explicit preconditioners . 11 3. Hessian-free Optimization 14 3.1. Truncated Newton Methods . 14 3.2. Levenberg-Marquardt Method . 15 3.3. Implemented algorithm . 16 3.4. Hessian vs Gauss-Newton approximation . 16 3.5. Inner iteration of the HF method . 17 3.6. Preconditioning of inner iteration . 19 3.7. Peculiarities of Martens implementation . 20 4. Neural networks 21 4.1. Feed-forward Networks . 21 4.2. Deep auto-encoder . 23 4.3. Network Training . 24 4.4. Error backpropagation . 25 4.5. Fast multiplication by the Hessian . 27 Contents iii 4.6. Hessian and Gauss-Newton Hessian for neural networks . 29 III. Construction of the SPAI Preconditioner 31 5. Efficient pattern finding 33 5.1. Gauss-Newton approximation . 34 5.2. Building the sparsity pattern . 34 5.3. Summary of the pattern finding method . 35 IV. Case study 37 6. Training of a deep auto-encoder 38 6.1. Overview . 38 6.2. Pathological Curvature . 39 6.3. Analysis . 40 6.4. Results for different sparsity pattern . 44 6.5. Results for reduced models . 46 6.6. Results using CG for the inner iteration . 48 V. Conclusions and Future Work 51 7. Conclusions 52 8. Future Work 53 8.1. Parallelization of the method . 53 8.2. Termination of CG . 53 8.3. Application of a better approximation of the diagonal of the Hessian . 53 8.4. Using more information for the preconditioner . 54 8.5. Efficient calculation of particular entries of the Hessian . 54 Appendix 57 A. Pattern finding idea 57 B. Sparsity Patterns 58 C. Results for different Hessians 60 D. CG results 62 Bibliography 63 Outline of the Thesis Part I: Introduction This chapter presents an overview of the thesis and its purpose. Part II: Theory CHAPTER 1: ITERATIVE SOLUTION OF LINEAR EQUATION SYSTEMS This chapter presents the basic theory for the iterative solution of linear equation systems. The first part deals with stationary methods like the Jacobi and Gauss-Seidel method. The second part deals with Krylov subspace methods which are nonstationary methods that form the basis for the inner iteration of truncated Newton methods. Examples are the Conjugate Gradient method and GMRES. CHAPTER 2: PRECONDITIONING How fast iterative methods like CG find a solution depends on spectral properties of the underlying matrix. It is often convenient to transform the equation system into a system that has the same solution but has more favorable spectral properties. This is achieved by finding a suitable preconditioner. We will focus on a sparse preconditioner that is an approximation of the inverse of A. The main task is to find an appropriate sparsity pattern of A−1 without explicit knowledge. CHAPTER 3: HESSIAN-FREE OPTIMIZATION This chapter summarizes the Hessian-Free or Truncated Newton method used for the optimization of nonlinear functions. It presents basic ideas and discusses peculiarities in Martens implementation used for training of deep neural networks. A particular focus lies on the stopping criteria of the inner iteration of the optimization method. CHAPTER 4: NEURAL NETWORKS This chapter presents the basics of neural networks that are needed to understand the theory behind the case study. It deals with training of neural networks, the backpropagation algorithm used to compute gradient information needed for training of the net and a variant known as the “Pearlmutter trick” used to efficiently compute the matrix vector product of the Hessian Hv with an arbitrary vector. Outline of the Thesis v Part III: Construction of the SPAI Preconditioner This part describes the straightforward approach of Benzi to compute a feasible a priori sparsity pattern for a sparse approximate inverse preconditioner. CHAPTER 5: PATTERN FINDING Because of the sparsification costs of the above method it is not economic to apply this method to find a suitable pattern for our preconditioner. This chapter presents an approach to come up with a suitable pattern by exploiting the known pattern of the Gauss-Newton Hessian that is usually used as a substitute of the Hessian in the context of the Hessian-Free optimization method. Part IV: Case study CHAPTER 6: TRAINING OF A DEEP AUTO-ENCODER In this case study Martens Matlab implementation of the Hessian-Free optimization method is used to train a deep auto-encoder. This task is known to be hard for algorithms based only on gradient information like the backpropagation algorithm, especially without pretraining. Then the changing structure of the Gauss-Newton Hessian in the different stages of the optimization is investigated and results of applying different preconditioners on repre- sentatives of the different stages are presented. Furthermore the final results of the application of the SPAI preconditioner, using different update strategies in HF using GMRES or CG for the inner iteration with different stopping criteria based on the value of the quadratic approximation, are discussed . Part V: Conclusions and Future Work CHAPTER 7: CONCLUSIONS This chapter summarizes the thesis. CHAPTER 8: FUTURE WORK This chapter presents some possible topics for future work like studies dealing with speedup of the computation of SPAI preconditioners on a graphics card or a possibly more efficient way to compute the needed entries of the Hessian for the computation of our preconditioner. 1 Part I. Introduction 2 Introduction Recently there are a lot of algorithms trying to incorporate second order information into the training of deep neural networks. The task of training these networks is known to be very hard using common training algorithms like backpropagation. This is mostly due to two reasons. The first one is the well known vanishing gradi- ents problem and the second one is pathological curvature. The latter is making it very tough for algorithms based only on gradient information. It often appears that they get stuck in a local minimum because of pathological curvature. Following this reasoning, Martens (2010) [25] successfully applied a variant of the Hessian- free optimization algorithm to the task of learning for deep auto-encoders and more recently also to recurrent neural networks [26]. The Hessian-free optimization method, better known as truncated-newton method, uses the idea of Newton’s method and some globalisation ideas to find the minimum of a nonlinear function. Other than Newton’s method it settles for solving the Newton equations only approximately using some iterative Krylov subspace method like the Conjugate Gradient method. Thus the explicit knowledge of the Hessian is not needed. Instead there is only the need to efficiently compute matrix-vector products of the Hes- sian with arbitrary vectors . This is possible through the well-known Pearlmuttertrick [30], which uses a variant of the backpropagation algorithm for feed forward neural networks. There is also an approach called Krylov subspace decent [34] that explicitly computes a basis of the Krylov subspace for a fixed k < N and numerically optimizes the parameter change within this subspace, using BFGS to minimize the original nonlinear objective function. Martens uses a handmade diagonal preconditioner to speed up the inner CG iteration based on some approximate values of the diagonal entries of the Hessian. A further improvement of the HF method of Martens is the use of a better diagonal preconditioner using the exact values of the diagonal of the Hessian as supposed in [9].

Preconditioning for Hessian-Free Optimization

CNC/IUGG: 2019 Quadrennial Report

The Method of Gauss-Newton to Compute Power Series Solutions of Polynomial Homotopies∗

Approximate Newton's Method for Tomographic Image Reconstruction

On the Use of Neural Networks to Solve Differential Equations Alberto García Molina

Petsc's Software Strategy for the Design Space of Composable

Sequential Quadratic Programming Methods∗

Arxiv:2011.07410V2 [Cs.CE] 3 Jul 2021

Trust Region Newton Method for Large-Scale Logistic Regression

Data-Scalable Hessian Preconditioning for Distributed Parameter PDE-Constrained Inverse Problems

International Conference on Differential, Difference Equations and Their Differential, Difference Applications

CREWES in 2016

Full Waveform Inversion and the Truncated Newton Method∗