Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning

Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning Item Type Conference Paper Authors Gajjala, Rishikesh R.; Banchhor, Shashwat; Abdelmoniem, Ahmed M.; Dutta, Aritra; Canini, Marco; Kalnis, Panos Citation Gajjala, R. R., Banchhor, S., Abdelmoniem, A. M., Dutta, A., Canini, M., & Kalnis, P. (2020). Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning. Proceedings of the 1st Workshop on Distributed Machine Learning. doi:10.1145/3426745.3431334 Eprint version Post-print DOI 10.1145/3426745.3431334 Publisher Association for Computing Machinery (ACM) Rights Archived with thanks to ACM Download date 27/09/2021 06:37:00 Link to Item http://hdl.handle.net/10754/666175 Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning Rishikesh R. Gajjala1, Shashwat Banchhor1, Ahmed M. Abdelmoniem1∗ Aritra Dutta, Marco Canini, Panos Kalnis KAUST ABSTRACT adopted in practice. In this setting, at each iteration, each worker Distributed stochastic algorithms, equipped with gradient compres- maintains a local copy of the same DNN model, accesses one of sion techniques, such as codebook quantization, are becoming in- the non-intersecting partitions of the data, and calculates its local creasingly popular and considered state-of-the-art in training large gradient. The gradient information are exchanged synchronously deep neural network (DNN) models. However, communicating the through the network for aggregation and the aggregated global quantized gradients in a network requires efficient encoding tech- gradient is sent back to the workers. The workers jointly update niques. For this, practitioners generally use Elias encoding-based the model parameters by using this global gradient. However, the techniques without considering their computational overhead or network latency during gradient transmission creates a commu- data-volume. In this paper, based on Huffman coding, we propose nication bottleneck and as a result, the training becomes slow. To several lossless encoding techniques that exploit different character- remedy this, gradient compression techniques, such as quantiza- istics of the quantized gradients during distributed DNN training. tion [2, 21, 49, 51], sparsification [1, 13, 42, 45], hybrid compressors Then, we show their effectiveness on 5 different DNN models across [4, 43], and low-rank methods [47, 48] are used. In this paper, we three different data-sets, and compare them with classic state-of- focus on the gradient quantization techniques. d d the-art Elias-based encoding techniques. Our results show that the We are interested in quantization operators, Q¹·º : R ! R , proposed Huffman-based encoders (i.e., RLH, SH, and SHS) can that produce a lower-precision quantized vector Q¹xº from the reduce the encoded data-volume by up to 5:1×; 4:32×, and 3:8×, original vector x. In general, depending on quantization technique, i th respectively, compared to the Elias-based encoders. the quantized gradient, Q¹дt º, resulting from i worker is further encoded by using one of the existing encoding techniques. For in- KEYWORDS stance, random dithering [2] and ternary quantization [49] use Elias C Distributed training, Gradient compression, Quantization, Huffman encoding, Sattler et al. [37] use Golomb encoding [15], and Nat coding, Elias and Run-length Encoding. [21] uses a fixed-length 8-bit encoding. In distributed settings, the quality of the trained model may only be impacted due to com- ACM Reference Format: pression. Moreover, if the encoding is losses, it can help reduce the 1 1 1 Rishikesh R. Gajjala , Shashwat Banchhor , Ahmed M. Abdelmoniem communicated volume without further impact on model quality. and Aritra Dutta, Marco Canini, Panos Kalnis. 2020. Huffman Coding To elaborate more, let the total communication time, T , to be the Based Encoding Techniques for Fast Distributed Deep Learning. In 1st Workshop on Distributed Machine Learning (DistributedML ’20), Decem- sum of the time taken for compression, transmission, and decom- ber 1, 2020, Barcelona, Spain. ACM, New York, NY, USA, 7 pages. https: pression. The main goal of encoding techniques is to encode the //doi.org/10.1145/3426745.3431334 compressed gradients in compact form as a result of quantization which can further reduce T . To reduce T , in this work, we focus 1 INTRODUCTION on the lossless encoding component used in gradient quantization and evaluate their effectiveness across a wide range of compression As the DNN models are becoming complex, one of the fundamental methods. That is, we decouple quantization from the encoding part challenges in training them is their increasing size. To efficiently and explore possible combinations of quantization techniques and train these models, practitioners generally employ distributed par- lossless encoding.1 If the effective transfer speed (including any allel training over multiple computing nodes/workers [5, 11]. In this reductions of transfer speed as a result of compression overhead) S, work, we focus on the data-parallel paradigm as it is the most widely that accounts for network transmission and computation necessary ∗Corresponding author: Ahmed M. Abdelmoniem ([email protected]). to compression and encoding, is constant for homogeneous train- 1Equal Contribution. ing environments such as, data centers or private clusters, then Rishikesh and Shashwat were with Indian Institute of Technology, Delhi. Significant time, T can be translated primarily to the reduction of the com- part of the work is done while they were interns at KAUST. V municated data-volume, V as T = S . Moreover, existing work on Permission to make digital or hard copies of all or part of this work for personal or quantization, exploit this assumption and employ arbitrary encod- classroom use is granted without fee provided that copies are not made or distributed ing techniques without carefully considering their complexity and for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM the resulting data-volume. To this end, we try to fill this gap and must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, make the following contributions: to post on servers or to redistribute to lists, requires prior specific permission and/or a (i) Based on the classic Huffman coding, we propose three en- fee. Request permissions from [email protected]. DistributedML ’20, December 1, 2020, Barcelona, Spain coding techniques—(a) run-length Huffman (RLH) encoding,b ( ) © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-8182-6/20/12...$15.00 1As long as the encoding is lossless, the convergence of a distributed optimizer (e.g., https://doi.org/10.1145/3426745.3431334 SGD and its variants) with a quantization strategy Q¹·º remains unaffected. DistributedML ’20, December 1, 2020, Barcelona, Spain R. R. Gajjala et al. sample Huffman (SH) encoding, andc ( ) sample Huffman with spar- Algorithm 1: Abstraction for Distributed quantized SGD— sity (SHS); to encode the quantized gradients generated from code- From the perspective of the ith worker. book quantization. For each of these encoders, we calculate their Input :Local data Di , model parameter xt , learning rate ηt > 0; entropy, average code-length, and computational complexity and Output :Trained model x 2 Rd . compare against the state-of-the-art encoding techniques used in 1 for t = 1, 2, ··· , do i th gradient compression—Elias and Run-length encoding (RLE). 2 On Worker:Compute a local stochastic gradient дt at t iteration; i i 3 Quantize дt to Q¹дt º; (ii) We analyze the performance of our proposed encoders on a i i 4 Encode Q¹дt º to C¹Q¹дt ºº; wide spectrum of quantization techniques [2, 21, 49], on a variety of i i 5 On Master :Decode C¹Q¹дt ºº to Q¹дt º ; DNN models (ResNet-20, VGG-16, ResNet-50, GoogLeNet, LSTM), 1 Ín i 6 Do all-to-all reduction д˜t = n i=1 Q¹дt º; performing different tasks, across diverse datasets (CIFAR-10, Ima- 7 Encode д˜t to C¹д˜t º and send back to workers; geNet, Penn Tree Bank (PTB)) and report our findings. The results 8 On Worker:Decode C¹д˜t º; show that RLH, SHS and SH can reduce the data-volume by up to 9 Update model parameters locally via: xt+1 = xt − ηt д˜t ; end 5:1×, 4:32× and 3:8× over the Elias-based encoders, respectively. And, sampling-based techniques (i.e., SH and SHS) achieve faster uses only binary bits. E.g., the sign compression in [6]. However, encoding times of up to 2:5× compared to the Elias-based encoders. in more sophisticated cases, the gradient components are projected To the best of our knowledge, this is the first work that theoreti- into a vector of fixed length (set by the user), such as random cally and empirically dissects the efficiency of different encoders dithering. The user can change the codebook length by varying with respect to the communicated data-volume and complexity. Our the quantization states, s, and the inclusion probabilities of each proposed encoders can also be used to encode the composition of component will vary. For a bit-width b and a scaling factor δ > 0, sparsification and quantization as in Q-sparse local SGD[4]. Since Yu et al. in [51], quantized д»i¼ 2 »ε; ε + δ¼ with ε 2 dom¹δ;bº := n o compression is worker-independent and involves no inter-worker −2b−1δ;:::; −δ; 0; δ;:::; ¹2b−1 − 1ºδ as: communication, encoding methods can apply to compression in asynchronous settings [34]. Notations. We denote ith component of a vector x by x»i¼. By xi , ( ε+δ −д»i¼ t ε with probability pi = δ we denote a vector arising from the tth iteration at the ith node. A д˜»i¼ = (1) ε + δ with probability 1 − pi : string of p consecutive zeros is denoted by 0p .

Load more