Meprop: Sparsified Back Propagation for Accelerated Deep Learning With

meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting Xu Sun 1 2 Xuancheng Ren 1 2 Shuming Ma 1 2 Houfeng Wang 1 2 Abstract such that only highly relevant parameters are updated and We propose a simple yet effective technique for other parameters stay untouched. The sparsified back pro- neural network learning. The forward propaga- pagation leads to a linear reduction in the computational tion is computed as usual. In back propagation, cost. only a small subset of the full gradient is com- To realize our approach, we need to answer two questions. puted to update the model parameters. The gra- The first question is how to find the highly relevant subset dient vectors are sparsified in such a way that of the parameters from the current sample in stochastic le- only the top-k elements (in terms of magnitude) arning. We propose a top-k search method to find the most are kept. As a result, only k rows or columns important parameters. Interestingly, experimental results (depending on the layout) of the weight matrix demonstrate that we can update only 1–4% of the weights are modified, leading to a linear reduction (k di- at each back propagation pass. This does not result in a vided by the vector dimension) in the computati- larger number of training iterations. The proposed method onal cost. Surprisingly, experimental results de- is general-purpose and it is independent of specific models monstrate that we can update only 1–4% of the and specific optimizers (e.g., Adam and AdaGrad). weights at each back propagation pass. This does not result in a larger number of training iterati- The second question is whether or not this minimal effort ons. More interestingly, the accuracy of the re- back propagation strategy will hurt the accuracy of the trai- sulting models is actually improved rather than ned models. We show that our strategy does not degrade degraded, and a detailed analysis is given. the accuracy of the trained model, even when a very small portion of the parameters is updated. More interestingly, our experimental results reveal that our strategy actually 1. Introduction improves the model accuracy in most cases. Based on our experiments, we find that it is probably because the mini- Neural network learning is typically slow, where back pro- mal effort update does not modify weakly relevant para- pagation usually dominates the computational cost during meters in each update, which makes overfitting less likely, the learning process. Back propagation entails a high com- similar to the dropout effect. putational cost because it needs to compute full gradients and update all model parameters in each learning step. It The contributions of this work are as follows: is not uncommon for a neural network to have a massive number of model parameters. • We propose a sparsified back propagation technique for neural network learning, in which only a small In this study, we propose a minimal effort back propaga- subset of the full gradient is computed to update the tion method, which we call meProp, for neural network model parameters. Experimental results demonstrate learning. The idea is that we compute only a very small that we can update only 1–4% of the weights at each but critical portion of the gradient information, and update back propagation pass. This does not result in a larger only the corresponding minimal portion of the parameters number of training iterations. in each learning step. This leads to sparsified gradients, • Surprisingly, our experimental results reveal that the 1School of Electronics Engineering and Computer Science, accuracy of the resulting models is actually impro- 2 Peking University, China MOE Key Laboratory of Computati- ved, rather than degraded. We demonstrate this effect onal Linguistics, Peking University, China. Correspondence to: Xu Sun <[email protected]>. by conducting experiments on different deep learning models (LSTM and MLP), various optimization met- Proceedings of the 34 th International Conference on Machine hods (Adam and AdaGrad), and diverse tasks (natural Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 language processing and image recognition). by the author(s). Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting Figure 1. An illustration of meProp. 2. Proposed Method The proposed meProp uses approximate gradients by keeping only top-k elements based on the magnitude va- We propose a simple yet effective technique for neural net- lues. That is, only the top-k elements with the lar- work learning. The forward propagation is computed as gest absolute values are kept. For example, suppose a usual. During back propagation, only a small subset of the vector v = h1; 2; 3; −4i, then top2(v) = h0; 0; 3; −4i. 0 full gradient is computed to update the model parameters. We denote the indices of vector σ (y)’s top-k values as The gradient vectors are “quantized” so that only the top-k ft1; t2; :::; tkg(1 ≤ k ≤ n), and the approximate gradient components in terms of magnitude are kept. We first pre- of the parameter matrix W and input vector x is: sent the proposed method and then describe the implementation details. @z 0 T σixj if i 2 ft1; t2; :::; tkg else 0 (5) @Wij 2.1. meProp @z X 0 Forward propagation of neural network models, including W T σ if j 2 ft ; t ; :::; t g else 0 (6) @x ij j 1 2 k feedforward neural networks, RNN, LSTM, consists of li- i j near transformations and non-linear transformations. For simplicity, we take a computation unit with one linear As a result, only k rows or columns (depending on the la- transformation and one non-linear transformation as an ex- yout) of the weight matrix are modified, leading to a linear ample: reduction (k divided by the vector dimension) in the computational cost. y = W x (1) Figure1 is an illustration of meProp for a single computation unit of neural models. The original back propagation z = σ(y) (2) uses the full gradient of the output vectors to compute the where W 2 Rn×m; x 2 Rm; y 2 Rn; z 2 Rn, m is the gradient of the parameters. The proposed method selects k dimension of the input vector, n is the dimension of the out- the top- values of the gradient of the output vector, and put vector, and σ is a non-linear function (e.g., relu, tanh, backpropagates the loss through the corresponding subset and sigmoid). During back propagation, we need to com- of the total model parameters. pute the gradient of the parameter matrix W and the input As for a complete neural network framework with a loss L, vector x: the original back propagation computes the gradient of the parameter matrix W as: @z 0 T = σixj (1 ≤ i ≤ n; 1 ≤ j ≤ m) (3) @L @L @y @Wij = · (7) @W @y @W @z X 0 = W T σ (1 ≤ j ≤ n; 1 ≤ i ≤ m) (4) while the gradient of the input vector x is: @x ij j i j @L @y @L = · (8) 0 @x @x @y where σ 2 Rn means @zi . We can see that the computati- i @yi onal cost of back propagation is directly proportional to the The proposed meProp selects top-k elements of the gra- @L dimension of output vector n. dient @y to approximate the original gradient, and passes Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting Figure 2. An illustration of the computational flow of meProp. them through the gradient computation graph according to parsing and MNIST image recognition. We use the optimi- the chain rule. Hence, the gradient of W goes to: zers with automatically adaptive learning rates, including @L @L @y Adam (Kingma & Ba, 2014) and AdaGrad (Duchi et al., topk( ) · (9) 2011). In our implementation, we make no modification to @W @y @W the optimizers, although there are many zero elements in while the gradient of the vector x is: the gradients. @L @y @L · topk( ) (10) Most of the experiments on CPU are conducted on the fra- @x @x @y mework coded in C# on our own. This framework builds a dynamic computation graph of the model for each sample, Figure2 shows an illustration of the computational flow of making it suitable for data in variable lengths. A typical meProp. The forward propagation is the same as traditio- training procedure contains four parts: building the compu- nal forward propagation, which computes the output vec- tation graph, forward propagation, back propagation, and tor via a matrix multiplication operation between two input parameter update. We also have an implementation based tensors. The original back propagation computes the full on the PyTorch framework for GPU based experiments. gradient for the input vector and the weight matrix. For me- Prop, back propagation computes an approximate gradient 2.2.1. WHERE TO APPLY MEPROP by keeping top-k values of the backward flowed gradient and masking the remaining values to 0. The proposed method aims to reduce the complexity of the back propagation by reducing the elements in the compu- Figure3 further shows the computational flow of meProp tationally intensive operations. In our preliminary observa- for the mini-batch case. tions, matrix-matrix or matrix-vector multiplication consu- med more than 90% of the time of back propagation. In our 2.2. Implementation implementation, we apply meProp only to the back propa- We have coded two neural network models, including an gation from the output of the multiplication to its inputs. LSTM model for part-of-speech (POS) tagging, and a feed- For other element-wise operations (e.g., activation functi- forward NN model (MLP) for transition-based dependency ons), the original back propagation procedure is kept, be- Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting Figure 3.

Load more