<<

TRUE -BASED TRAINING OF DEEP BINARY ACTIVATED NEURAL NETWORKS VIA CONTINUOUS BINARIZATION Charbel Sakr?, Jungwook Choi†, Zhuo Wang†‡, Kailash Gopalakrishnan†, Naresh Shanbhag? ?Dept. of Electrical and Computer Engineering, University of Illinois at Urbana Champaign †IBM T. J. Research Center

ABSTRACT with over 100 million parameters [4] and over 1000 layers [5]. With the ever growing popularity of , the Such high complexity makes these models hard to train, but tremendous complexity of deep neural networks is becom- most importantly, their deployment on resource-constrained ing problematic when one considers inference on resource platforms, such as ASICs, FPGAs, and microcontrollers, be- constrained platforms. Binary networks have emerged as a comes problematic. potential solution, however, they exhibit a fundamental limi- tation in realizing gradient-based learning as their activations 1.1. Related work The importance of reducing the complexity of deep learning are non-differentiable. Current work has so far relied on ap- systems is well appreciated today. One approach is to opti- proximating in order to use the back-propagation mize the structure of a neural network itself, such as by prun- algorithm via the straight through estimator (STE). Such ing [6], where weak connections are deleted and the reduced approximations harm the quality of the training procedure network is retrained. Dominant layers can also be decom- causing a noticeable gap in accuracy between binary neural posed into a cascade of smaller layers with an overall lesser networks and their full precision baselines. We present a complexity [7]. Taking advantage of the sparsity also enables novel method to train binary activated neural networks using efficient implementation via zero-skipping [3]. true gradient-based learning. Our idea is motivated by the An orthogonal approach is to consider reduced precision similarities between clipping and binary activation functions. neural networks. This can be done in one of two ways: quan- We show that our method has minimal accuracy degradation tizing pre-trained networks, or directly training in low preci- with respect to the full precision baseline. Finally, we test our sion. The first option is justified by the inherent robustness of method on three benchmarking datasets: MNIST, CIFAR-10, neural networks suggesting that moderate quantization should and SVHN. For each benchmark, we show that continuous not be catastrophic. This has led to interesting analytical in- binarization using true gradient-based learning achieves an vestigations such as determining the correspondence between accuracy within 1.5% of the floating-point baseline, as com- precision and signal-to-quantization-noise ratio [8]. A bet- pared to accuracy drops as high as 6% when training the same ter understanding of accuracy in the presence of fixed-point binary activated network using the STE. Index Terms— deep learning, binary neural networks, quantization was presented in [9]. This study led to the dis- activation functions covery of an interesting trade-off between weight and activa- tion precisions. 1. INTRODUCTION Directly training in low precision has also seen many ad- vances. It was shown that training 16-bit fixed-point networks Deep neural networks are becoming the de facto predictive is possible using stochastic rounding [10]. It was later re- models used in many tasks. Their popu- alized that training binary weighted neural networks, such larity is the result of numerous victories deep learning has as BinaryConnect [11], is possible provided a high precision enjoyed in the past decade. Most notably, with AlexNet [1] representation of the weights is maintained during the training winning the 2012 ImageNet Large Scale Visual Recognition [12]. A natural extension is to consider activation binarization challenge, extensive research efforts in the area followed and such as BinaryNet [13], XNOR-Net [14], and DoReFa-Net culminated with machines outperforming humans in recog- [15]. A close investigation of these works’ reported perfor- nition tasks [2]. However, this outstanding representational mances reveals failure in achieving state-of-the-art accuracy. power comes at the price of very high complexity. Some of For instance, the accuracy gap between BinaryNet and Bi- these networks require around 1 billion multiply-accumulates naryConnect is slightly over 3% (on the CIFAR-10) dataset, (MACs) [3]. It is in fact not uncommon to find networks despite using numerous optimization tricks such as stochastic This work was supported in part by Systems On Nanoscale Informa- rounding, shift-based ADAMAX, and early model selection. tion fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by These shortcomings may arguably be attributed to the inabil- MARCO and DARPA. ity to use gradient-based learning because of the use of the This work is supported in part by IBM-ILLINOIS Center for Cognitive non-differentiable binary activation . Instead, these Computing Systems Research (C3SR) - a research collaboration as part of the IBM AI Horizons Network. works have relied on gradient approximations via the straight ‡ Zhuo Wang is now an employee at Google. through estimator (STE) [16] to enable back-propagation. n o iayatvtdntok.Cneunl,teueof use the Consequently, networks. learn- activated gradient-based binary true for employ ing to is motivation main Our Principle 2.1. justification. binarization analytical its continuous and proposed technique our describes section This 4. Section in paper our conclude We 3. re- Section are in results ported Numerical and justification. method analytical our binarization includes continuous proposed our describes accuracy networks. in precision loss full minimal over with presented ac- are binary of networks tivated benefits complexity the Thus, observe method. and proposed STE as the high the train as drops using also accuracy We networks activated baseline. binary precision within full same accuracies the with to respect datasets with CIFAR-10 [20] [18], SVHN MNIST and the [19], on networks We training activated successful binary baseline. show of precision that full experiments numerical the accuracy present to minimal respect guarantees with analytically method We degradation our net- our approximation, algorithm. that activated SGD gradient demonstrate binary true on the train relying leverages of to method Instead method new works. a propose We Contributions analytical 1.2. complete a lacks still basis. method STE [17]. the guarantees contrast, to performance In led with has research theory of century well-established a However,a half almost minimized. SGD, of being case function the in loss the to gradient true respect because the with to (SGD) approximation an descent is gradient gradient computed stochastic the justified of is context STE the The in “sees function. activation procedure shown binary back-propagation as the the Effectively, through” function identity 1. the Fig. with in it replacing by function identity the with back-propagation. it during replaces but computations activation feedforward the binary during non-differentiable the uses networks ac- neural binary tivated of training Conventional (STE). estimator through straight 1 Fig. h eto hsppri raie sflos eto 2 Section follows. as organized is paper this of rest The activation binary the of gradient the estimates STE The : .POOE EHD CONTINUOUS METHOD: PROPOSED 2. edowr iayatvto ucin(A)adits and (BAF) function activation binary feedforward A BINARIZATION 6% utfigtesproiyo our of superiority the justifying

1 . 5% or parameter slope the only and m weights frozen have and vated stage at where parameter slope Hence, the learn stop. even we or down becomes slow slope to convergence the causing as steep, sparse become PCF the through flowing a to effect lead slope neck might the simultaneously training layers that all is of caveat parameters important An provided 1. is Algorithm procedure in binarization continuous proposed The Procedure 2.2. for network activated fi- binary the a inference. SGD obtain generates uses but to procedure training, SBAF our during Thus, the network. by activated PCF binary nal the regularizing replace by we done enough, is This the small. using be to it and PCF constraining function the using activation network the neural a as 2 train to propose Fig. We in on. relies shown as SBAF the of for approaches value and small slope a steeper for that, Observe tively. where of (PCF) function clipping form parametrized the a consider also We where function activation form binary the of scaled (SBAF) a consider More we function. activa- differentiable specifically, binary piecewise the continuous, replace a we by Instead, tion suitable. not is STE the of is value considered smaller SBAF The σ (PCFs). functions clipping parametrized 2 Fig. ( x σ tlayer at L 2 = ) steparameter the As α ( 2 x = ) : 2 = euaiain,teuulcvasapyt h choice the to apply caveats usual the regularizations, α m sasaeprmtrand parameter scale a is cldbnr ciainfnto SA)adsome and (SBAF) function activation binary scaled A L and ×

clip l hsosraini h e nih u method our insight key the is observation This . ntebcwr as hsi eas h gradients the because is This pass. backward the in 2 slearned. is 1 or x> ( α m l m L x l aesu olayer to up layers all , 0 r h lp n cl aaees respec- parameters, scale and slope the are h C prahsteSBAF. the approaches PCF the , h C is PCF The . 1 + euaiainshms When schemes. regularization σ α 2 m ( , x 0 ed ob euaie,eg via e.g. regularized, be to needs = ) α , min(max( = ) m α σ

learn × ( nstages, in x 1 = ) 1 x> steidctrfunction. indicator the is t parameter its 0 clip l − n ae tatime a at one ( m m 1 x m x h C a a has PCF the , r iayacti- binary are 1 + + , α 2 m 0 , , m 0) 2) ssmall is bottle- With . α , while ) (1) L m 1 , Algorithm 1 Continuous Binarization procedure: L is the number proximating the PCF by the SBAF decreases linearly in m. of layers, I(l) is the number of iterations for which the PCF parame- Note that from (1), the intercepts with the lines y = 0 and ters of layer l are trained, µ is the learning rate, and λ(l) is the slope mα mα y = α of the clipping function are at x = − 2 and x = 2 , regularization coefficient at layer l respectively. We assume the pessimistic scenario where the for l = 1 to L do  mα mα  input x is uniformly distributed in − 2 ; 2 . Then, the for i = 1 to I(l) do MSE between PCF and SBAF activations is computed as fol- (inputs, labels) ← getMiniBatch() lows: hidden ← inputs − mα Z 2  x α 2 . Feedforward pass: MSE = k + − α · 1x>0 dx = c · m mα m 2 for l = 1 to L do − 2 W ← getLayerWeights(j) α3 preActivation ← computePreActivation(hidden, W , l) where k is a normalization constant and c = k 12 is a con- if j < l then stant. Hence, we establish that as m decreases, the perturba- hidden ← SBAF(preActivation, l) tions due to switching activation functions at layer l decrease end if linearly in the mean squared sense. if l ≤ j < L then Next, for a given input, let ao and ap be the feature vectors hidden ← PCF(preActivation, j) at layer l of N (l−1) and N (l), respectively. We will show that end if there is no mismatch between the predicted labels yˆo and yˆp if j == L then of N (l−1) and N (l), respectively, provided the perturbation ← hidden outputActivation(preActivation) vector at layer l, q = a − a , has a bounded magnitude. . Generally a softmax a p o To do so, we use the output re-ordering argument similar to end if end for that in [9]. There is a mismatch when yˆo = i, yˆp = j, and predicted ← argmax(hidden) i 6= j, where i, j are predicted classes. But this event occurs C ← costFunction(predicted, labels) only if fi(ao) > fj(ao) and fi(ap) < fj(ap) where fi, fj . Generally a cross entropy are the soft outputs at indexes i, j, respectively. For small . Backward pass and updates: perturbations, we can use a first order Taylor approximation for j = l to L do as in [9] to show: W ← getLayerWeights(j) fi(ap) < fj(ap) ⇒ fi(ao + qa) < fj(ao + qa) G ← computeGradients(C, W ) W ← W − µG ⇒ fi(ao) − fj(ao) < kqak k∇ao fj(ao) − ∇ao fi(ao)k end for where we used the Cauchy-Schwarz inequality. Therefore, by m ← getPCFSlope(l) α ← getPCFScale(l) the contrapositive, we have no mismatch if (l) f (a ) − f (a ) gm ← computeGradient(C + regularizer(m, λ ), m) i o j o kqak < (2) gα ← computeGradient(C, α) k∇ao fj(ao) − ∇ao fi(ao)k m ← m − µgm For an M-class classification problem, we may extend the α ← α − µgα end for condition in (2) as follows: end for fi(ao) − fj(ao) kqak < min j=1...M, j6=i k∇ao fj(ao) − ∇ao fi(ao)k of the regularization coefficient λ. We identify three types Thus, we establish an upper bound on the perturbation vector of regularization in a network: 1) activations preceding fully magnitude to avoid mismatch. But this magnitude decreases m connected layers: λ , 2) activations preceding convolutional linearly as a function of in the mean square sense. Hence, 1 l layers: λ , and 3) activations preceding pooling layers: λ .A we argue that replacing the PCF by the SBAF at layer has 2 3 minimal effect on accuracy for small enough values of m. good strategy is to choose λ1 < λ2 < λ3. 2.3. Analytical Justification 3. NUMERICAL RESULTS Given a network, let us denote by N (l) the obtained network In order to evaluate our proposed continuous binarization using the SBAF for all layers until and including the l-th one method, we utilize three datasets: MNIST [18], CIFAR-10 and the PCF for all final layers past the l-th one. Conse- [19], and SVHN [20]. For each, we consider a unique net- quently, N (0) is the original full precision network using the work inspired by BinaryNet [13], defined below: PCF at all layers and N (L−1) is the final binary activated net- • MNIST: A multi-layer with architecture work. Note that at the start of stage l (l ≥ 2) of the continuous 784 − 2048 − 2048 − 2048 − 10. binarization procedure, we replace N (l−1) by N (l) after hav- • CIFAR-10: A convolutional neural network with archi- ing regularized the parameter m at layer l. tecture 128C3 − 128C3 − MP 2 − 256C3 − 256C3 − We first show that the mean squared error (MSE) of ap- MP 2 − 512C3 − 512C3 − 1024FC − 1024FC − 10.

MNIST CIFAR-10 SVHN

Test Error (%) Error Test

Test Error (%) Error Test (%) Error Test

Epoch Epoch Epoch (a) (b) (c) Fig. 3: Illustration of the continuous binarization procedure for: (a) MNIST, (b) CIFAR-10, and (c) SVHN. Both blue and orange are for the same network. The blue is obtained by observing the test error for N (l). The orange curve is obtained by observing the test error for N (L−1). MNIST CIFAR-10 SVHN For the CIFAR-10 dataset, the baseline network is trained Full-precision Baseline 1.45% 9.04% 2.53% using SGD and achieves a test error of 9.04% using clip- Binarization via STE 1.54% 14.80% 4.05% ping after 500 epochs. The STE implementation is trained Continuous Binarization 1.27% 10.41% 3.20% using Vanilla Adam and with the same configurations as the Table 1: Summary of Test Errors for the three networks trained for baseline otherwise. The test error obtained is 14.80% after each dataset. Note the small gap in accuracy between baseline and 500 epochs. During continuous binarization, Vanilla Adam is binary activated networks for each dataset when using the proposed used instead of SGD. This slightly negated the warm up ef- continuous binarization method. In contrast, binarization using the fect due to the change in optimization technique. both L2 and STE is not as successful. L1 regularizations were used with coefficients λ1 = 0.001, • SVHN: A convolutional neural network with architec- λ2 = 0.01, λ3 = 0.1. Each layer is trained for 50 epochs and ture 64C3−64C3−MP 2−128C3−128C3−MP 2− the final test error obtained is 10.41%. 256C3 − 256C3 − 1024FC − 1024FC − 10. For the SVHN dataset, the baseline network is trained us- The network corresponding to each dataset is trained in three ing SGD and achieves a test error of 2.53% using clipping ways: 1) full-precision baseline using the clipping activation after 200 epochs. The STE implementation is trained using function, 2) binarization via STE [16], and 3) using the pro- Vanilla Adam and with the same configurations as our base- posed conitnuous binarization. All results are summarized in line otherwise. The test error obtained is 4.05% after 200 Table 1. epochs. During continuous binarization, the same procedure Figure 3 illustrates the outcome of the continuous bina- as that for CIFAR-10 is followed with a few modifications: rization procedure for each datasets. Plotted is the test error each layer is trained for 20 epochs instead of 50 because this as a function of training epoch for N (l) and N (L−1). The net- dataset is much larger; and only L2 regularization is used for works are pre-trained with m = 0.5 and α = 2 for all layers all layers past the fifth one. We obtain a final test error of which is why the test error for N (l) has good initial condi- 3.20%. tions. Recall that N (l) uses the SBAF for all layers up to the In summary, for each of the three datasets, the test error l-th one and the PCF for all other layers, whereas N (L−1) uses using continuous binarization is within 1.5% of the full pre- only the SBAF. As l is progressively incremented in stages, cision baseline. However, for binarization using the STE, the the number of binary activated layers in N (l) is also incre- accuracy drop reaches up to 6%. This demonstrates the supe- mented until all the network is binary activated and N (l) is riority of our proposed training procedure for binary activated identical to N (L−1), making both curves meet. networks. 4. CONCLUSION For the MNIST dataset, the full precision baseline is ob- We have presented a novel method for binarizing the activa- tained using SGD and achieves 1.45% test error after 500 tions of deep neural networks. We have presented a theoret- epochs. The STE implementation is obtained by training with ical justification for our method. To demonstrate the valid- Vanilla Adam [21] (we observed significant improvements ity of our approach, we have tested it on three deep learning over SGD). All network configurations are otherwise iden- benchmarks. tical to the baseline. The test error obtained is 1.54% after Future work includes further experimentations on larger 500 epochs. For continuous binarization, only L regulariza- 2 models and datasets, combining the proposed activation bina- tion was used on m with λ = 1 for all layers. The first layer rization to weight binarization, and extension to the multi-bit is trained for 200 epochs, and all other layers are trained for activation. 100 epochs each. The test error at the end of the last training epoch we obtain is 1.27%. 5. REFERENCES with binary weights during propagations,” in Advances in Neural Information Processing Systems, 2015, pp. [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- 3123–3131. ton, “ImageNet classification with deep convolutional neural networks,” in Advances in neural information [12] Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan processing systems, 2012, pp. 1097–1105. Samet, and Tom Goldstein, “Training quantized nets: A deeper understanding,” in Advances in Neural Informa- [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian tion Processing Systems (NIPS), 2017. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vi- [13] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran sion and , 2016, pp. 770–778. El-Yaniv, and , “Binarized neural net- works,” in Advances in Neural Information Processing [3] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivi- Systems, 2016, pp. 4107–4115. enne Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” [14] Mohammad Rastegari, Vicente Ordonez, Joseph Red- IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. mon, and Ali Farhadi, “XNOR-Net: Imagenet clas- 127–138, 2017. sification using binary convolutional neural networks,” in European Conference on . Springer, [4] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and 2016, pp. 525–542. Lior Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of [15] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, the IEEE Conference on Computer Vision and Pattern He Wen, and Yuheng Zou, “DoReFa-Net: Training Recognition, 2014, pp. 1701–1708. low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, [5] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and 2016. Kilian Q Weinberger, “Deep networks with stochastic depth,” in European Conference on Computer Vision. [16] Yoshua Bengio, Nicholas Leonard,´ and Aaron Springer, 2016, pp. 646–661. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computa- [6] Song Han, Jeff Pool, John Tran, and William Dally, tion,” arXiv preprint arXiv:1308.3432, 2013. “Learning both weights and connections for efficient neural network,” in Advances in Neural Information [17] L. Bottou, “Large-scale machine learning with stochas- Processing Systems (NIPS), 2015, pp. 1135–1143. tic ,” in Proceedings of COMP- STAT’2010, pp. 177–186. Springer, 2010. [7] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun, “Accelerating very deep convolutional networks [18] Yann LeCun, Corinna Cortes, and Christopher JC for classification and detection,” IEEE Transactions on Burges, “The MNIST database of handwritten digits,” Pattern Analysis and Machine Intelligence, 2015. 1998. [8] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy, [19] Alex Krizhevsky and , “Learning mul- “Fixed point quantization of deep convolutional net- tiple layers of features from tiny images,” 2009. works,” in Proceedings of The 33rd International Con- [20] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis- ference on Machine Learning, 2016, pp. 2849–2858. sacco, Bo Wu, and Andrew Y Ng, “Reading digits in [9] Charbel Sakr, Yongjune Kim, and Naresh Shanbhag, natural images with unsupervised feature learning,” in “Analytical guarantees on numerical precision of deep NIPS workshop on deep learning and unsupervised fea- neural networks,” in Proceedings of the 34th Inter- ture learning, 2011, vol. 2011, p. 5. national Conference on Machine Learning, 2017, pp. [21] Diederik Kingma and Jimmy Ba, “Adam: A 3007–3016. method for stochastic optimization,” arXiv preprint [10] S. Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and arXiv:1412.6980, 2014. Pritish Narayanan, “Deep learning with limited nu- merical precision,” in Proceedings of The 32nd Inter- national Conference on Machine Learning, 2015, pp. 1737–1746. [11] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, “Binaryconnect: Training deep neural networks