<<

Training Neural Networks Without : A Scalable ADMM Approach

Gavin Taylor1 [email protected] Ryan Burmeister1 Zheng Xu2 [email protected] Bharat Singh2 [email protected] Ankit Patel3 [email protected] Tom Goldstein2 [email protected] 1United States Naval Academy, Annapolis, MD USA 2University of Maryland, College Park, MD USA 3Rice University, Houston, TX USA

Abstract many parameters. Because big datasets provide results that With the growing importance of large network (often dramatically) outperform the prior state-of-the-art in models and enormous training datasets, GPUs many tasks, researchers are willing to have become increasingly necessary to train neu- purchase specialized hardware such as GPUs, and commit ral networks. This is largely because conven- large amounts of time to training models and tuning hyper- tional optimization rely on stochastic parameters. methods that don’t scale well to large Gradient-based training methods have several properties numbers of cores in a cluster setting. Further- that contribute to this need for specialized hardware. First, more, the convergence of all gradient methods, while large amounts of data can be shared amongst many including batch methods, suffers from common cores, existing optimization methods suffer when paral- problems like saturation effects, poor condition- lelized. Second, training neural nets requires optimizing ing, and saddle points. This paper explores an highly non-convex objectives that exhibit saddle points, unconventional training method that uses alter- poor conditioning, and vanishing gradients, all of which nating direction methods and Bregman iteration slow down gradient-based methods such as stochastic gra- to train networks without gradient descent steps. dient descent, conjugate gradients, and BFGS. Several mit- The proposed method reduces the network train- igating approaches to avoiding this issue have been intro- ing problem to a sequence of minimization sub- duced, including rectified linear units (ReLU) (Nair & Hin- steps that can each be solved globally in closed ton, 2010), Long Short-Term Memory networks (Hochre- form. The proposed method is advantageous be- iter & Schmidhuber, 1997), RPROP (Riedmiller & Braun, cause it avoids many of the caveats that make 1993), and others, but the fundamental problem remains. gradient methods slow on highly non-convex problems. The method exhibits strong scaling in In this paper, we introduce a new method for training the the distributed setting, yielding linear speedups parameters of neural nets using the Alternating Direction even when split over thousands of cores. Method of Multipliers (ADMM) and Bregman iteration. This approach addresses several problems facing classi- cal gradient methods; the proposed method exhibits linear 1. Introduction scaling when data is parallelized across cores, and is robust to gradient saturation and poor conditioning. The method As hardware and algorithms advance, neural network per- decomposes network training into a sequence of sub-steps formance is constantly improving for many machine learn- that are each solved to global optimality. The scalability ing tasks. This is particularly true in applications where of the proposed method, combined with the ability to avoid extremely large datasets are available to train models with local minima by globally solving each substep, can lead to dramatic speedups. Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume We begin in Section2 by describing the mathematical no- 48. Copyright 2016 by the author(s). tation and context, and providing a discussion of several Scaling Neural Networks with ADMM weaknesses of gradient-based methods that we hope to ad- number of training samples, and then a descent step is dress. Sections3 and4 introduce and describe the opti- taken using this approximate gradient. Stochastic gradient mization approach, and Sections5 and6 describe in detail methods work extremely well in the serial setting, but lack the distributed implementation. Section7 provides an ex- scalability. Recent attempts to scale SGD include Down- perimental comparison of the new approach with standard pour, which runs SGD simultaneously on multiple cores. implementations of several gradient-based methods on two This model averages parameters across cores using multi- problems of differing size and difficulty. Finally, Section ple communication nodes that store copies of the model. A 8 contains a closing discussion of the paper’s contributions conceptually similar approach is elastic averaging (Zhang and the future work needed. et al., 2015), in which different processors simultaneously run SGD using a quadratic penalty term that prevents dif- 2. Background and notation ferent processes from drifting too far from the central av- erage. These methods have found success with modest Though there are many variations, a typical neural network numbers of processors, but fail to maintain strong scaling consists of L layers, each of which is defined by a linear for large numbers of cores. For example, for several ex- operator Wl, and a non-linear neural activation periments reported in (Dean et al., 2012), the Downpour hl. Given a (column) vector of input activations, al−1, a distributed SGD method runs slower with 1500 cores than single computes and outputs the non-linear function with 500 cores. a = h (W a ). A network is formed by layering these l l l l−1 The scalability of SGD is limited because it relies on a large units together in a nested fashion to compute a composite number of inexpensive minimization steps that each use a function; in the 3-layer case, for example, this would be small amount of data. Forming a noisy gradient from a small mini-batch requires very little computation. The low f(a0; W ) = W3(h2(W2h1(W1a0))) (1) cost of this step is an asset in the serial setting where it en- where W = {Wl} denotes the ensemble of weight ma- ables the to move quickly, but disadvantageous trices, and a0 contains input activations for every training in the parallel setting where each step is too inexpensive sample (one sample per column). The function h3 is absent to be split over multiple processors. For this reason, SGD as it is common for the last layer to not have an activation is ideally suited for computation on GPUs, where multi- function. ple cores can simultaneously work on a small batch of data using a shared memory space with virtually no communi- Training the network is the task of tuning the weight ma- cation overhead. trices W to match the output activations aL to the targets y, given the inputs a0. Using a `, the training When parallelizing over CPUs, it is preferable to have problem can be posed as methods that use a small number of expensive minimiza- tion steps, preferably involving a large number of data. The min `(f(a0; W ), y) (2) W work required on each minimization step can then be split across many worker nodes, and the latency of communica- Note that in our notation, we have included all input acti- tion is amortized over a large amount of computation. This vations for all training data into the matrix/tensor a0. This approach has been suggested by numerous authors who notation benefits our discussion of the proposed algorithm, propose batch computation methods (Ngiam et al., 2011), which operates on all training data simultaneously as a which compute exact gradients on each iteration using the batch. entire dataset, including conjugate gradients (Towsey et al., Also, in our formulation the tensor W contains linear op- 1995; Møller, 1993), BFGS, and Hessian-free (Martens & erators, but not necessarily dense matrices. These linear Sutskever, 2011; Sainath et al., 2013) methods. operators can be with an ensemble of filters, Unfortunately, all gradient-based approaches, whether in which case (1) represents a convolutional net. batched or stochastic, also suffer from several other criti- Finally, the formulation used here assumes a feed-forward cal drawbacks. First, gradient-based methods suffer from architecture. However, our proposed methods can handle the vanishing gradients. During , the more complex network topologies (such as recurrent net- of shallow layers in a network are formed us- works) with little modification. ing products of weight matrices and of non- linearities from downstream layers. When the eigenvalues 2.1. What’s wrong with backprop? of the weight matrices are small and the derivatives of non- linearities are nearly zero (as they often are for sigmoid and Most networks are trained using stochastic gradient de- ReLU non-linearities), multiplication by these terms anni- scent (SGD, i.e. backpropagation) in which the gradient hilates information. The resulting gradients in shallow lay- of the network loss function is approximated using a small Scaling Neural Networks with ADMM ers contain little information about the error (Bengio et al., employs least-squares parameter updates that are conceptu- 1994; Riedmiller & Braun, 1993; Hochreiter & Schmidhu- ally similar to (but different from) the Parallel Weight Up- ber, 1997). date proposed here (see Section5). However, there is cur- rently no implementation nor any training results to com- Second, backprop has the potential to get stuck at local pare against. minima and saddle points. While recent results suggest that local minimizers of SGD are close to global minima Note that our work is the first to consider alternating least (Choromanska et al., 2014), in practice SGD often lingers squares as a method to distribute computation across a clus- near saddle points where gradients are small (Dauphin ter, although the authors of (Carreira-Perpinan´ & Wang, et al., 2014). 2012) do consider implementations that are “distributed” in the sense of using multiple threads on a single machine Finally, backprop does not easily parallelize over layers, a via the Matlab matrix toolbox. significant bottleneck when considering deep architectures. However, recent work on SGD has successfully used model parallelism by using multiple replicas of the entire network 3. Alternating minimization for neural (Dean et al., 2012). networks We propose a solution that helps alleviate these problems The idea behind our method is to decouple the weights by separating the objective function at each layer of a neu- from the nonlinear link functions using a splitting tech- ral network into two terms: one term measuring the relation nique. Rather than feeding the output of the linear opera- between the weights and the input activations, and the other tor Wl directly into the hl, we store the term containing the nonlinear activation function. We then output of layer l in a new variable zl = Wlal−1. We also apply an alternating direction method that addresses each represent the output of the link function as a vector of ac- term separately. The first term allows the weights to be tivations al = hl(zl). We then wish to solve the following updated without the effects of vanishing gradients. In the problem second step, we have a non-convex minimization problem that can be solved globally in closed-form. Also, the form minimize `(zL, y) {Wl},{al},{zl} of the objective allows the weights of every layer to be up- subject to z = W a , for l = 1, 2, ··· L (3) dated independently, enabling parallelization over layers. l l l−1 al = hl(zl), for l = 1, 2, ··· L − 1. This approach does not require any gradient steps at all. Rather, the problem of training network parameters is re- Observe that solving (3) is equivalent to solving (2). Rather duced to a series of minimization sub-problems using the than try to solve (3) directly, we relax the constraints by alternating direction methods of multipliers. These mini- adding an `2 penalty function to the objective and attack mization sub-problems are solved globally in closed form. the unconstrained problem 2 minimize `(zL, y) + βLkzL − WLaL−1k 2.2. Related work {Wl},{al},{zl} L−1 X Other works have applied least-squares based methods to + γ ka − h (z )k2 + β kz − W a k2 (4) neural networks. One notable example is the method of l l l l l l l l−1 l=1 auxiliary coordinates (MAC) (Carreira-Perpinan´ & Wang, 2012) which uses quadratic penalties to approximately where {γl} and {βl} are constants that control the weight enforce equality constraints. Unlike our method, MAC of each . The formulation (4) only approximately requires iterative solvers for sub-problems, whereas the enforces the constraints in (3). To obtain exact enforcement method proposed here is designed so that all sub-problems of the constraints, we add a term to (4), have closed form solutions. Also unlike MAC, the method which yields proposed here uses Lagrange multipliers to exactly enforce minimize `(zL, y) (5) equality constraints, which we have found to be necessary {Wl},{al},{zl} 2 for training deeper networks. + hzL, λi + βLkzL − WLaL−1k Another related approach is the expectation-maximization L−1 X  2 2 (EM) algorithm of (Patel et al., 2015), which is derived + γlkal − hl(zl)k + βlkzl − Wlal−1k . from the Deep Rendering Model (DRM), a hierarchical l=1 generative model for natural images. They show that feed- where λ is a vector of Lagrange multipliers with the same forward propagation in a deep convolutional net corre- dimensions as zL. Note that in a classical ADMM formu- sponds to inference on their proposed DRM. They derive lation, a Lagrange multiplier would be added for each con- a new EM learning algorithm for their proposed DRM that straint in (3). The formulation above corresponds more Scaling Neural Networks with ADMM

Algorithm 1 ADMM for Neural Nets Activations update Minimization for al is a simple least- squares problem similar to the weight update. However, in Input: training features {a0}, and labels {y}, L=1 L this case the matrix al appears in two penalty terms in (4), Initialize: allocate {al}l=1 , {zl}l=1, and λ repeat and so we must minimize βlkzl+1 − Wl+1alk + γlkal − 2 for l = 1, 2, ··· ,L − 1 do hl(zl)k for al, holding all other variables fixed. The new † value of al is given by Wl ← zlal−1 T −1 T al←(βl+1Wl+1Wl+1 +γlI) (βl+1Wl+1zl+1 +γlhl(zl)) T −1 T 2 2 (βl+1Wl+1Wl+1 + γlI) (βl+1Wl+1zl+1 + γlhl(zl)) (6) zl ← arg minz γlkal − hl(z)k + βlkzl − Wlal−1k T end for where W is the adjoint (transpose) of Wl+1. † l+1 WL ← zLaL−1 2 zL ← arg minz `(z, y) + hzL, λi + βLkz − WLal−1k Outputs update The update for zl requires minimizing λ ← λ + βL(zL − WLaL−1) 2 2 min γlkal − hl(z)k + βlkz − Wlal−1k . (7) until converged z This problem is non-convex and non-quadratic (because of the non-linear term h). Fortunately, because the non- closely to Bregman iteration, which only requires a La- linearity h works entry-wise on its argument, the entries in grange correction to be added to the objective term (and zl are de-coupled. Solving (7) is particularly easy when h not the constraint terms), rather than classical ADMM. We is piecewise linear, as it can be solved in closed form; com- have found the Bregman formulation to be far more stable mon piecewise linear choices for h include rectified lin- than a full scale ADMM formulation. This issue will be ear units (ReLUs) and non-differentiable sigmoid functions discussed in detail in Section4. given by The split formulation (4) is carefully designed to be easily  ( 1, if x ≥ 1 minimized using an alternating direction method in which x, if x > 0  each sub-step has a simple closed-form solution. The al- hrelu(x) = , hsig(x) = x, if 0 < x < 1 . 0, otherwise  ternating direction scheme proceeds by updating one set of 0, otherwise variables at a time – either {W }, {a }, or {z } – while l l l For such choices of h, the minimizer of (7) is easily com- holding the others constant. The simplicity of the proposed puted using simple if-then logic. For more sophistical scheme comes from the following observation: The min- choices of h, including smooth sigmoid curves, the prob- imization of (4) with respect to both {W } and {a } is l l−1 lem can be solved quickly with a lookup table of pre- a simple linear least-squares problem. Only the minimiza- computed solutions because each 1-dimensional problem tion of (4) with respect {z } is nonlinear. However, there is l only depends on two inputs. no coupling between the entries of {zl}, and so the prob- lem of minimizing for {z } decomposes into solving a large l Lagrange multiplier update After minimizing for number of one-dimensional problems, one for each entry {W }, {a }, and {z }, the Lagrange multiplier update is in {z }. Because each sub-problem has a simple form and l l l l given simply by only 1 variable, these problems can be solved globally in closed form. λ ← λ + βL(zL − WLaL−1). (8) The full alternating direction method is listed in Algo- We discuss this update further in Section4. rithm1. We discuss the details below.

3.1. Minimization sub-steps 4. Lagrange multiplier updates via method of multipliers and Bregman iteration In this section, we consider the updates for each variable in (5). The algorithm proceeds by minimizing for Wl, al, and The proposed method can be viewed as solving the con- zl, and then updating the Lagrange multipliers λ. strained problem (3) using Bregman iteration, which is closely related to ADMM. The convergence of Bregman iteration is fairly well understood in the presence of linear Weight update We first consider the minimization of (4) constraints (Yin et al., 2008). The convergence of ADMM with respect to {Wl}. For each layer l, the optimal solution is fairly well understood for convex problems involving 2 minimizes kzl − Wlal−1k . This is simply a least squares only two separate variable blocks (He & Yuan, 2015). Con- † problem, and the solution is given by Wl ← zlal−1 where vergence results also guarantee that a local minima is ob- † al−1 represents the pseudoinverse of the (rectangular) acti- tained for two-block non-convex objectives under certain vation matrix al−1. smoothness assumptions (Nocedal & Wright, 2006). Scaling Neural Networks with ADMM

Because the proposed scheme involves more than two cou- 4.2. Interpretation as method of multipliers pled variable blocks and a non-smooth penalty function, In addition to the Bregman interpretation, the proposed it lies outside the scope of known convergence results for method can also be viewed as an approximation to the ADMM. In fact, when ADMM is applied to (3) in a con- method of multipliers, which solves constrained problems ventional way using separate Lagrange multiplier vectors of the form for each constraint, the method is highly unstable because of the de-stabilizing effect of a large number of coupled, min J(u) subject to A(u) = b (11) non-smooth, non-convex terms. u Fortunately, we will see below that the Bregman Lagrange for some J and (possibly non-linear) oper- update method (8) does not involve any non-smooth con- ator A(·). In its most general form (which does not assume straint terms, and the resulting method seems to be ex- linear constraints) the method proceeds using the iterative tremely stable. updates

( k+1 k β 2 4.1. Bregman interpretation u ← min J(u) + hλ ,A(u) − bi + 2 kA(u) − bk λk+1 ← λk + ∂ { β kA(u) − bk2} Bregman iteration (also known as the method of multipli- u 2 ers) is a general framework for solving constrained opti- where λk is a vector of Lagrange multipliers that is gen- mization problems. Methods of this type have been used erally initialized to zero, and β kA(u) − bk2 is a quadratic extensively in the sparse optimization literature (Yin et al., 2 penalty term. After each minimization sub-problem, the 2008). Consider the general problem of minimizing gradient of the penalty term is added to the Lagrange mul- min J(u) subject to Au = b (9) tipliers. When the operator A is linear, this update takes u the form λk+1 ← λk + βAT (Au − b), which is the most for some convex function J and linear operator A. Breg- common form of the method of multipliers. man iteration repeatedly solves Just like in the Bregman case, we now let J(u) = `(zL, y), and let A contain the constraints in (3). After a minimiza- k+1 k 1 2 u ← min DJ (u, u ) + kAu − bk (10) tion pass, we must update the Lagrange multiplier vector. 2 Assuming a good minimizer has been achieved, the deriva- where p ∈ ∂J(uk) is a (sub-)gradient of J at uk, and tive of (5) should be nearly zero. All variables except zL ap- k k k pear only in the quadratic penalty, and so these derivatives DJ (u, u ) = J(u) − J(u ) − hu − u , pi is the so-called Bregman distance. The iterative process (10) can be viewed should be negligibly small. The only major contributor to as minimizing the objective J subject to an inexact penalty the gradient of the penalty term is zL, which appears in both that approximately obtains Ax ≈ b, and then adding a lin- the loss function and the quadratic penalty. The gradient of ear term to the objective to weaken it so that the quadratic the penalty term with respect to zL, is βL(zL − WLaL−1), penalty becomes more influential on the next iteration. which is exactly the proposed multiplier update. The Lagrange update described in Section3 can be inter- When the objective is approximately minimized by al- preted as performing Bregman iteration to solve the prob- ternately updating separate blocks of variables (as in the proposed method), this becomes an instance of the lem (3), where J(u) = `(zL, y), and A contains all the ADMM (Boyd et al., 2011). constraints in (3). On each iteration, the outputs zl are up- dated immediately before the Lagrange step is taken, and so zl−1 satisfies the optimality condition 5. Distributed implementation using data parallelism 0 ∈ ∂z`(zL, y) + βL(zL − WLaL−1) + λ. The main advantage of the proposed alternating minimiza- It follows that tion method is its high degree of scalability. In this section, we explain how the method is distributed. λ + βL(zL − WLaL−1) ∈ −∂z`(zL, y). Consider distributing the algorithm across N worker nodes. For this reason, the Lagrange update (8) can be inter- The ADMM method is scaled using a data parallelization preted as updating the sub-gradient in the Bregman iterative strategy, in which different nodes store activations and out- method for solving (3). The combination of the Bregman puts corresponding to different subsets of the training data. iterative update with an alternating minimization strategy For each layer, the activation matrix is broken into columns makes the proposed algorithm an instance of the split Breg- subsets as ai = (a1, a2, ··· , aN ). The output matrix zl and man method (Goldstein & Osher, 2009). Lagrange multipliers λ decompose similarly. Scaling Neural Networks with ADMM

The optimization sub-steps for updating {al} and {zl} Parallel Lagrange multiplier update The Lagrange do not require any communication and parallelize triv- multiplier update also trivially splits across nodes, with ially. The weight matrix update requires the computation of worker n computing pseudo-inverses and products involving the matrices {al} n n n n and {zl}. This can be done effectively using transpose re- λ ← λ + βL(zL − WLaL−1) (13) duction strategies (Goldstein et al., 2016) that reduce the dimensionality of matrices before they are transmitted to a using only local data. central node. 6. Implementation details Parallel Weight update The weight update has the form Like many training methods for neural networks, the W ← z a†, where a† represents the pseudoinverse of l l l l ADMM approach requires several tips and tricks to get the activation matrix a . This pseudoinverse can be writ- l maximum performance. The convergence theory for the ten a† = aT (a aT )−1. Using this expansion, the W update l l l l method of multipliers requires a good minimizer to be com- decomposes across nodes as puted before updating the Lagrange multipliers. When the method is initialized with random starting values, the ini- N ! N !−1 X X tial iterates are generally far from optimal. For this reason, W ← zn(an)T an(an)T . l l l l l we frequently “warm start” the ADMM method by running n=1 n=1 several iterations without Lagrange multiplier updates. n n T n n T The individual products zl (al ) and al (al ) are com- The method potentially requires the user to choose a large puted separately on each node, and then summed across number of parameters {γi} and {βi}. We choose γi = 10 nodes using a single reduce operation. Note that the width and βi = 1 for all trials runs reported here, and we have n of al equals the number of training vectors that are stored found that this choice works reliably for a wide range of on node n, which is potentially very large for big data sets. problems and network architectures. Note that in the clas- n When the number of features (the number of rows in al ) is sical ADMM method, convergence is guaranteed for any n less than the number of training data (columns of al ), we choice of the quadratic penalty parameters. can exploit transpose reduction when forming these prod- n n T We use training data with binary class labels, in which each ucts – the product al (al ) is much smaller than the matrix n output entry aL is either 1 or 0. We use a separable loss al alone. This dramatically reduces the quantity of data transmitted during the reduce operation. function with a hinge penalty of the form ( Once these products have been formed and reduced onto max{1 − z, 0}, when a = 1, a central server, the central node computes the inverse of `(z, a) = T max{a, 0}, when a = 0. alal , updates Wl, and then broadcasts the result to the worker nodes. This loss function works well in practice, and yields min- imization sub-problems that are easily solved in closed Parallel Activations update The update (6) trivially de- form. composes across workers, with each worker computing Finally, our implementation simply initializes the activa- n T −1 T n n tion matrices {al} and output matrices {zl} using i.i.d a ← (βl+1W Wl+1+γI) (βl+1W z +γlhl(z )). l l+1 l+1 l+1 l Gaussian random variables. Because our method updates Each server maintains a full representation of the entire the weights before anything else, the weight matrices do weight matrix, and can formulate its own local copy of the not require any initialization. The results presented here T −1 are using Gaussian random variables with unit variance, matrix inverse (βl+1W Wl+1 + γI) . l+1 and the results seem to be fairly insensitive to the variance of this distribution. This seems to be because the output Parallel Outputs update Like the activations update, the updates are solved to global optimality on each iteration. update for zl trivially parallelizes and each worker node solves 7. Experiments min γ kan − h (zn)k2 + β kzn − W an k2. (12) n l l l l l l l l−1 zl In this section, we present experimental results that com- pare the performance of the ADMM method to other n Each worker node simply computes Wlal−1 using local approaches, including SGD, conjugate gradients, and L- data, and then updates each of the (decoupled) entries in BFGS on benchmark classification tasks. Comparisons are n zl by solving a 1-dimensional problem in closed form. made across multiple axes. First, we illustrate the scaling Scaling Neural Networks with ADMM

(a) Time required for ADMM to reach 95% test accuracy (b) Test set predictive accuracy as a function of time in sec- vs number of cores. This problem was not large enough to onds for ADMM on 2,496 cores (blue), in addition to GPU support parallelization over many cores, yet the advantages implementations of conjugate gradients (green), SGD (red), of scaling are still apparent (note the x-axis has log scale). In and L-BFGS (cyan). comparison, on the GPU, L-BFGS reached this threshold in 3.2 seconds, CG in 9.3 seconds, and SGD in 8.2 seconds.

Figure 1. Street View House Numbers (subsection 7.1) of the approach, by varying the number of cores available 7.1. SVHN and clocking the compute time necessary to meet an accu- First, we focus on the problem posed by the SVHN dataset. racy threshold on the test set of the problem. Second, we For this dataset, we optimized a net with two hidden layers show test set classification accuracy as a function of time to of 100 and 50 nodes and ReLU activation functions. This compare the rate of convergence of the optimization meth- is an easy problem (test accuracy rises quickly) that does ods. Finally, we show these comparisons on two different not require a large volume of data and is easily handled data sets, one small and relatively easy, and one large and by gradient-based methods on a GPU. However, Figure difficult. 1a demonstrates that ADMM exhibits linear scaling with The new ADMM approach was implemented in Python on cores. Even though the implementations of the gradient- a Cray XC30 supercomputer with Ivy Bridge processors, based methods enjoy communication via shared memory and communication between cores performed via MPI. on the GPU while ADMM required CPU-to-CPU commu- SGD, conjugate gradients, and L-BFGS are run as imple- nication, strong scaling allows ADMM on CPU cores to mented in the Torch optim package on NVIDIA Tesla K40 compete with the gradient-based methods on a GPU. GPUs. These methods underwent a thorough hyperparam- This is illustrated clearly in Figure 1b, which shows each eter grid search to identify the algorithm parameters that method’s performance on the test set as a function of time. produced the best results. In all cases, timings indicate With 1,024 compute cores, on an average of 10 runs, only the time spent optimizing, excluding time spent load- ADMM was able to meet the 95% test set accuracy thresh- ing data and setting up the network. old in 13.3 seconds. After an extensive hyperparameter Experiments were run on two datasets. The first is a subset search to find the settings which resulted in the fastest con- of the Street View House Numbers (SVHN) dataset (Net- vergence, SGD converged on average in 28.3 seconds, L- zer et al., 2011). Neural nets were constructed to classify BFGS in 3.3 seconds, and conjugate gradients in 10.1 sec- pictures of 0s from 2s using histogram of gradient (HOG) onds. Though the small dataset kept ADMM from taking features of the original dataset. Using the “extra” dataset full advantage of its scalability, it was nonetheless sufficient to train, this meant 120,290 training datapoints of 648 fea- to allow it to be competitive with GPU implementations. tures each. The testing set contained 5,893 data points. The second dataset is the far more difficult Higgs 7.2. Higgs dataset (Baldi et al., 2014), consisting of a training set of For the much larger and more difficult Higgs dataset, we 10,500,000 datapoints of 28 features each, with each data- optimized a simple network with ReLU activation func- point labelled as either a signal process producing a Higgs tions and a hidden layer of 300 nodes, as suggested in boson or a background process which does not. The testing (Baldi et al., 2014). The graph illustrates the amount of set consists of 500,000 datapoints. time required to optimize the network to a test set pre- Scaling Neural Networks with ADMM

(a) Time required for ADMM to reach 64% test accuracy (b) Test set predictive accuracy as a function of time for when parallelized over varying levels of cores. L-BFGS ADMM on 7200 cores (blue), conjugate gradients (green), on a GPU required 181 seconds, and conjugate gradients re- and SGD (red). Note the x-axis is scaled logarithmically. quired 44 minutes. SGD never reached 64% accuracy. Figure 2. Higgs (subsection 7.2) diction accuracy of 64%; this parameter was chosen as all 8.1. Looking forward batch methods being tested reliably hit this accuracy bench- The experiments shown here represent a fairly narrow mark over numerous trials. As is clear from Figure 2a, par- range of classification problems and are not meant to allelizing over additional cores decreases the time required demonstrate the absolute superiority of ADMM as a train- dramatically, and again exhibits linear scaling. ing method. Rather, this study is meant to be a proof of In this much larger problem, the advantageous scaling al- concept demonstrating that the caveats of gradient-based lowed ADMM to reach the 64% benchmark much faster methods can be avoided using alternative minimization than the other approaches. Figure 2b illustrates this clearly, schemes. Future work will explore the behavior of alter- with ADMM running on 7200 cores reaching this bench- nating direction methods in broader contexts. mark in 7.8 seconds. In comparison, L-BFGS required 181 We are particularly interested in focusing future work seconds, and conjugate gradients required 44 minutes.1 In on recurrent nets and convolutional nets. Recurrent seven hours of training, SGD never reached 64% accuracy nets, which complicate standard gradient methods (Jaeger, on the test set. These results suggest that, for large and 2002; Lukoseviˇ ciusˇ , 2012), pose no difficulty for ADMM difficult problems, the strong linear scaling of ADMM en- schemes whatsoever because they decouple layers using ables it to leverage large numbers of cores to (dramatically) auxiliary variables. Convolutional networks are also of out-perform GPU implementations. interest because ADMM can, in principle, handle them very efficiently. When the linear operators {Wl} represent 8. Discussion & Conclusion convolutions rather than dense weight matrices, the least squares problems that arise in the updates for {W } and We present a method for training neural networks without l {a } can be solved efficiently using fast Fourier transforms. using gradient steps. In addition to avoiding many diffi- l culties of gradient methods (like saturation and choice of Finally, there are avenues to explore to potentially im- learning rates), performance of the proposed method scales prove convergence speed. These include adding momen- linearly up to thousands of cores. This strong scaling en- tum terms to the weight updates and studying different ini- ables the proposed approach to out-perform other methods tialization schemes, both of which are known to be impor- on problems involving extremely large datasets. tant for gradient-based schemes (Sutskever et al., 2013).

1It is worth noting that though L-BFGS required substantially more time to reach 64% than did ADMM, it was the only method Acknowledgements to produce a superior classifier, doing as well as 75% accuracy on the test set. This work was supported by the National Science Founda- tion (#1535902), the Office of Naval Research (#N00014- 15-1-2676 and #N0001415WX01341), and the DoD High Performance Computing Center. Scaling Neural Networks with ADMM

References Hochreiter, Sepp and Schmidhuber, Jurgen.¨ Long short- term memory. Neural computation, 9(8):1735–1780, Baldi, Pierre, Sadowski, Peter, and Whiteson, Daniel. 1997. Searching for exotic particles in high-energy physics with . Nature communications, 5, 2014. Jaeger, Herbert. Tutorial on training recurrent neural net- works, covering BPPT, RTRL, EKF and the” echo state Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. network” approach. GMD-Forschungszentrum Infor- Learning long-term dependencies with gradient descent mationstechnik, 2002. is difficult. Neural Networks, IEEE Transactions on, 5 (2):157–166, 1994. Lukoseviˇ cius,ˇ Mantas. A practical guide to applying echo state networks. In Neural Networks: Tricks of the Trade, Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and pp. 659–686. Springer, 2012. Eckstein, Jonathan. Distributed optimization and statisti- cal learning via the alternating direction method of mul- Martens, James and Sutskever, Ilya. Learning recurrent tipliers. Foundations and Trends R in Machine Learn- neural networks with hessian-free optimization. In Pro- ing, 3(1):1–122, 2011. ceedings of the 28th International Conference on Ma- chine Learning (ICML-11), pp. 1033–1040, 2011. Carreira-Perpinan,´ Miguel A and Wang, Weiran. Dis- Møller, Martin Fodslette. A scaled conjugate gradient al- tributed optimization of deeply nested systems. arXiv gorithm for fast . Neural networks, 6 preprint arXiv:1212.5921, 2012. (4):525–533, 1993.

Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, Nair, Vinod and Hinton, Geoffrey E. Rectified linear units Arous, Gerard´ Ben, and LeCun, Yann. The loss surfaces improve restricted boltzmann machines. In Proceedings of multilayer networks. arXiv preprint arXiv:1412.0233, of the 27th International Conference on Machine Learn- 2014. ing (ICML-10), pp. 807–814, 2010.

Dauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar, Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua. Alessandro, Wu, Bo, and Ng, Andrew Y. Reading dig- Identifying and attacking the problem in its in natural images with unsupervised feature learning. high-dimensional non-. In Advances In NIPS workshop on deep learning and unsupervised in neural information processing systems, pp. 2933– feature learning, volume 2011, pp. 4. Granada, Spain, 2941, 2014. 2011.

Dean, Jeffrey, Corrado, Greg, Monga, Rajat, Chen, Kai, Ngiam, Jiquan, Coates, Adam, Lahiri, Ahbik, Prochnow, Devin, Matthieu, Mao, Mark, Ranzato, Marc’aurelio, Bobby, Le, Quoc V, and Ng, Andrew Y. On optimization Senior, Andrew, Tucker, Paul, Yang, Ke, Le, Quoc V., methods for deep learning. In Proceedings of the 28th and Ng, Andrew Y. Large scale distributed deep net- International Conference on Machine Learning (ICML- works. In Pereira, F., Burges, C.J.C., Bottou, L., and 11), pp. 265–272, 2011. Weinberger, K.Q. (eds.), Advances in Neural Informa- tion Processing Systems 25, pp. 1223–1231. Curran As- Nocedal, Jorge and Wright, Stephen. Numerical optimiza- sociates, Inc., 2012. tion. Springer Science & Business Media, 2006. Patel, Ankit B, Nguyen, Tan, and Baraniuk, Richard. A Goldstein, Tom and Osher, Stanley. The split bregman probabilistic theory of deep learning. arXiv preprint SIAM Journal on method for l1-regularized problems. arXiv:1504.00641, 2015. Imaging Sciences, 2(2):323–343, 2009. Riedmiller, Martin and Braun, Heinrich. A direct adaptive Goldstein, Tom, Taylor, Gavin, Barabin, Kawika, and method for faster backpropagation learning: The rprop Sayre, Kent. Unwrapping admm: Efficient distributed algorithm. In Neural Networks, 1993., IEEE Interna- computing via transpose reduction. In 19th Interna- tional Conference on, pp. 586–591. IEEE, 1993. tional Conference on Artificial Intelligence and Statis- tics(AISTATS), 2016. Sainath, Tara N, Horesh, Lior, Kingsbury, Brian, Aravkin, Aleksandr Y, and Ramabhadran, Bhuvana. Accelerat- He, Bingsheng and Yuan, Xiaoming. On non-ergodic con- ing hessian-free optimization for deep neural networks vergence rate of douglas–rachford alternating direction by implicit preconditioning and sampling. In Automatic method of multipliers. Numerische Mathematik, 130(3): and Understanding (ASRU), 2013 567–577, 2015. IEEE Workshop on, pp. 303–308. IEEE, 2013. Scaling Neural Networks with ADMM

Sutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initialization and mo- mentum in deep learning. In Proceedings of the 30th in- ternational conference on machine learning (ICML-13), pp. 1139–1147, 2013. Towsey, Michael, Alpsan, Dogan, and Sztriha, Laszlo. Training a neural network with conjugate gradient meth- ods. In Neural Networks, 1995. Proceedings., IEEE International Conference on, volume 1, pp. 373–378. IEEE, 1995. Yin, Wotao, Osher, Stanley, Goldfarb, Donald, and Dar- bon, Jerome. Bregman iterative algorithms for \ell 1- minimization with applications to compressed sensing. SIAM Journal on Imaging Sciences, 1(1):143–168, 2008. Zhang, Sixin, Choromanska, Anna E, and LeCun, Yann. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pp. 685–693, 2015.