Training Neural Networks Without Gradients: a Scalable ADMM Approach
Total Page:16
File Type:pdf, Size:1020Kb
Training Neural Networks Without Gradients: A Scalable ADMM Approach Gavin Taylor1 [email protected] Ryan Burmeister1 Zheng Xu2 [email protected] Bharat Singh2 [email protected] Ankit Patel3 [email protected] Tom Goldstein2 [email protected] 1United States Naval Academy, Annapolis, MD USA 2University of Maryland, College Park, MD USA 3Rice University, Houston, TX USA Abstract many parameters. Because big datasets provide results that With the growing importance of large network (often dramatically) outperform the prior state-of-the-art in models and enormous training datasets, GPUs many machine learning tasks, researchers are willing to have become increasingly necessary to train neu- purchase specialized hardware such as GPUs, and commit ral networks. This is largely because conven- large amounts of time to training models and tuning hyper- tional optimization algorithms rely on stochastic parameters. gradient methods that don’t scale well to large Gradient-based training methods have several properties numbers of cores in a cluster setting. Further- that contribute to this need for specialized hardware. First, more, the convergence of all gradient methods, while large amounts of data can be shared amongst many including batch methods, suffers from common cores, existing optimization methods suffer when paral- problems like saturation effects, poor condition- lelized. Second, training neural nets requires optimizing ing, and saddle points. This paper explores an highly non-convex objectives that exhibit saddle points, unconventional training method that uses alter- poor conditioning, and vanishing gradients, all of which nating direction methods and Bregman iteration slow down gradient-based methods such as stochastic gra- to train networks without gradient descent steps. dient descent, conjugate gradients, and BFGS. Several mit- The proposed method reduces the network train- igating approaches to avoiding this issue have been intro- ing problem to a sequence of minimization sub- duced, including rectified linear units (ReLU) (Nair & Hin- steps that can each be solved globally in closed ton, 2010), Long Short-Term Memory networks (Hochre- form. The proposed method is advantageous be- iter & Schmidhuber, 1997), RPROP (Riedmiller & Braun, cause it avoids many of the caveats that make 1993), and others, but the fundamental problem remains. gradient methods slow on highly non-convex problems. The method exhibits strong scaling in In this paper, we introduce a new method for training the the distributed setting, yielding linear speedups parameters of neural nets using the Alternating Direction even when split over thousands of cores. Method of Multipliers (ADMM) and Bregman iteration. This approach addresses several problems facing classi- cal gradient methods; the proposed method exhibits linear 1. Introduction scaling when data is parallelized across cores, and is robust to gradient saturation and poor conditioning. The method As hardware and algorithms advance, neural network per- decomposes network training into a sequence of sub-steps formance is constantly improving for many machine learn- that are each solved to global optimality. The scalability ing tasks. This is particularly true in applications where of the proposed method, combined with the ability to avoid extremely large datasets are available to train models with local minima by globally solving each substep, can lead to dramatic speedups. Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume We begin in Section2 by describing the mathematical no- 48. Copyright 2016 by the author(s). tation and context, and providing a discussion of several Scaling Neural Networks with ADMM weaknesses of gradient-based methods that we hope to ad- number of training samples, and then a descent step is dress. Sections3 and4 introduce and describe the opti- taken using this approximate gradient. Stochastic gradient mization approach, and Sections5 and6 describe in detail methods work extremely well in the serial setting, but lack the distributed implementation. Section7 provides an ex- scalability. Recent attempts to scale SGD include Down- perimental comparison of the new approach with standard pour, which runs SGD simultaneously on multiple cores. implementations of several gradient-based methods on two This model averages parameters across cores using multi- problems of differing size and difficulty. Finally, Section ple communication nodes that store copies of the model. A 8 contains a closing discussion of the paper’s contributions conceptually similar approach is elastic averaging (Zhang and the future work needed. et al., 2015), in which different processors simultaneously run SGD using a quadratic penalty term that prevents dif- 2. Background and notation ferent processes from drifting too far from the central av- erage. These methods have found success with modest Though there are many variations, a typical neural network numbers of processors, but fail to maintain strong scaling consists of L layers, each of which is defined by a linear for large numbers of cores. For example, for several ex- operator Wl; and a non-linear neural activation function periments reported in (Dean et al., 2012), the Downpour hl: Given a (column) vector of input activations, al−1; a distributed SGD method runs slower with 1500 cores than single layer computes and outputs the non-linear function with 500 cores. a = h (W a ): A network is formed by layering these l l l l−1 The scalability of SGD is limited because it relies on a large units together in a nested fashion to compute a composite number of inexpensive minimization steps that each use a function; in the 3-layer case, for example, this would be small amount of data. Forming a noisy gradient from a small mini-batch requires very little computation. The low f(a0; W ) = W3(h2(W2h1(W1a0))) (1) cost of this step is an asset in the serial setting where it en- where W = fWlg denotes the ensemble of weight ma- ables the algorithm to move quickly, but disadvantageous trices, and a0 contains input activations for every training in the parallel setting where each step is too inexpensive sample (one sample per column). The function h3 is absent to be split over multiple processors. For this reason, SGD as it is common for the last layer to not have an activation is ideally suited for computation on GPUs, where multi- function. ple cores can simultaneously work on a small batch of data using a shared memory space with virtually no communi- Training the network is the task of tuning the weight ma- cation overhead. trices W to match the output activations aL to the targets y, given the inputs a0: Using a loss function `; the training When parallelizing over CPUs, it is preferable to have problem can be posed as methods that use a small number of expensive minimiza- tion steps, preferably involving a large number of data. The min `(f(a0; W ); y) (2) W work required on each minimization step can then be split across many worker nodes, and the latency of communica- Note that in our notation, we have included all input acti- tion is amortized over a large amount of computation. This vations for all training data into the matrix/tensor a0: This approach has been suggested by numerous authors who notation benefits our discussion of the proposed algorithm, propose batch computation methods (Ngiam et al., 2011), which operates on all training data simultaneously as a which compute exact gradients on each iteration using the batch. entire dataset, including conjugate gradients (Towsey et al., Also, in our formulation the tensor W contains linear op- 1995; Møller, 1993), BFGS, and Hessian-free (Martens & erators, but not necessarily dense matrices. These linear Sutskever, 2011; Sainath et al., 2013) methods. operators can be convolutions with an ensemble of filters, Unfortunately, all gradient-based approaches, whether in which case (1) represents a convolutional net. batched or stochastic, also suffer from several other criti- Finally, the formulation used here assumes a feed-forward cal drawbacks. First, gradient-based methods suffer from architecture. However, our proposed methods can handle the vanishing gradients. During backpropagation, the more complex network topologies (such as recurrent net- derivative of shallow layers in a network are formed us- works) with little modification. ing products of weight matrices and derivatives of non- linearities from downstream layers. When the eigenvalues 2.1. What’s wrong with backprop? of the weight matrices are small and the derivatives of non- linearities are nearly zero (as they often are for sigmoid and Most networks are trained using stochastic gradient de- ReLU non-linearities), multiplication by these terms anni- scent (SGD, i.e. backpropagation) in which the gradient hilates information. The resulting gradients in shallow lay- of the network loss function is approximated using a small Scaling Neural Networks with ADMM ers contain little information about the error (Bengio et al., employs least-squares parameter updates that are conceptu- 1994; Riedmiller & Braun, 1993; Hochreiter & Schmidhu- ally similar to (but different from) the Parallel Weight Up- ber, 1997). date proposed here (see Section5). However, there is cur- rently no implementation nor any training results to com- Second, backprop has the potential to get stuck at local pare against. minima and saddle points. While recent results suggest that local minimizers of SGD are close to global minima Note that our work is the first to consider alternating least (Choromanska et al., 2014), in practice SGD often lingers squares as a method to distribute computation across a clus- near saddle points where gradients are small (Dauphin ter, although the authors of (Carreira-Perpinan´ & Wang, et al., 2014). 2012) do consider implementations that are “distributed” in the sense of using multiple threads on a single machine Finally, backprop does not easily parallelize over layers, a via the Matlab matrix toolbox. significant bottleneck when considering deep architectures. However, recent work on SGD has successfully used model parallelism by using multiple replicas of the entire network 3.