Bandwidth-Efficient Deep Learning ——from Compression to Acceleration
Song Han Assistant Professor, EECS Massachusetts Institute of Technology
1 AI is Changing Our Lives
Self-Driving Car Machine Translation
2 AlphaGo Smart Robots
2 Models are Getting Larger
IMAGE RECOGNITION SPEECH RECOGNITION
Important Property of Neural Networks 16X 10X Results get better with Model Training Ops 152 layers 465 GFLOP more data + 22.6 GFLOP 12,000 hrs of Data bigger models + ~3.5% error ~5% Error more computation 8 layers 80 GFLOP 1.4 GFLOP 7,000 hrs of Data (Better algorithms, new insights and ~16% Error ~8% Error improved techniques always help, too!) 2012 2015 2014 2015 AlexNet ResNet Deep Speech 1 Deep Speech 2
Dally, NIPS’2016 workshop on Efficient Methods for Deep Neural Networks
3 The first Challenge: Model Size
Hard to distribute large models through over-the-air update
4 The Second Challenge: Speed
Error rate Training time ResNet18: 10.76% 2.5 days ResNet50: 7.02% 5 days ResNet101: 6.21% 1 week ResNet152: 6.16% 1.5 weeks
Such long training time limits ML researcher’s productivity
Training time benchmarked with fb.resnet.torch using four M40 GPUs
5 The Third Challenge: Energy Efficiency
AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game
on mobile: drains battery on data-center: increases TCO
6 Where is the Energy Consumed? larger model => more memory reference => more energy
7 Where is the Energy Consumed?
larger model => more memory reference => more energy
Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000
Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7]. MemoryCMOS process access [ is7]. 3 Memory orders of access magnitude is 3 orders more of magnitude more energy expensive thanenergy simple1 expensive arithmetic. than simple arithmetic.= 1000
To achieve this goal, weTo achievepresent athis method goal, we to prune present network a method connections to prune network in a manner connections that preserves in a manner the that preserves the original accuracy. Afteroriginal an initial accuracy. training After phase, an initial we remove training all phase, connections we remove whose all connections weight is lower whose weight is lower than a threshold. Thisthan pruning a threshold. converts This a dense, pruning fully-connected converts a dense, layer fully-connected to a sparse layer. layer This to a first sparse layer. This 8 first phase learns the topologyphase of learns the networks the topology — learning of the networks which connections — learning are which important connections and removing are important and removing the unimportant connections.the unimportant We then connections. retrain the sparse We then network retrain so the the sparse remaining network connections so the remaining can connections can compensate for the connectionscompensate that for the have connections been removed. that have The beenphases removed. of pruning The and phases retraining of pruning may and retraining may be repeated iterativelybe to repeated further iteratively reduce network to further complexity. reduce network In effect, complexity. this training In effect, process this learns training process learns the network connectivitythe network in addition connectivity to the weights in addition - much to as the in weights the mammalian - much as brain in the [8][9], mammalian where brain [8][9], where synapses are created insynapses the first are few created months in of the a first child’s few development, months of a child’s followed development, by gradual followed pruning byof gradual pruning of little-used connections,little-used falling to connections, typical adult falling values. to typical adult values.
2 Related Work2 Related Work
Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- ing models [10]. This results in a waste of both computation and memory. There have been various ing models [10]. This results in a waste of both computation and memory. There have been various proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear structure of the neural network by finding an appropriate low-rank approximation of the parameters structure of the neural network by finding an appropriate low-rank approximation of the parameters and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gong et al. and keeping the accuracy[13] compressed within 1% of deep the convnets original model. using vector With quantization.similar accuracy These loss, approximation Gong et al. and quantization [13] compressed deeptechniques convnets are using orthogonal vector quantization. to network pruning, These andapproximation they can be and used quantization together to obtain further gains techniques are orthogonal[14]. to network pruning, and they can be used together to obtain further gains [14]. There have been other attempts to reduce the number of parameters of neural networks by replacing There have been otherthe attempts fully connected to reduce layer the number with global of parameters average pooling. of neural The networks Network by in replacing Network architecture [15] the fully connected layerand GoogLenet with global [16 average] achieves pooling. state-of-the-art The Network results in Network on several architecture benchmarks [15 by] adopting this idea. and GoogLenet [16]However, achieves state-of-the-arttransfer learning, results i.e. reusing on several features benchmarks learned on by the adopting ImageNet this dataset idea. and applying them However, transfer learning,to new i.e. tasks reusing by only features fine-tuning learned the onfully the connected ImageNet layers, dataset is andmore applying difficult them with this approach. This to new tasks by onlyproblem fine-tuning is noted the fully by Szegedyconnectedet layers, al. [16 is] and more motivates difficult them with thisto add approach. a linear Thislayer on the top of their problem is noted by Szegedynetworkset to al. enable[16] transfer and motivates learning. them to add a linear layer on the top of their networks to enable transfer learning. Network pruning has been used both to reduce network complexity and to reduce over-fitting. An Network pruning hasearly been approach used both to to pruning reduce was network biased complexity weight decay and [17 to]. reduce Optimal over-fitting. Brain Damage An [18] and Optimal early approach to pruningBrain was Surgeon biased [19 weight] prune decay networks [17]. to Optimal reduce the Brain number Damage of connections [18] and Optimal based on the Hessian of the Brain Surgeon [19] pruneloss function networks and to reduce suggest the that number such pruning of connections is more accurate based on than the Hessian magnitude-based of the pruning such as loss function and suggestweight that decay. such However, pruning is second more accurate order derivative than magnitude-based needs additional pruning computation. such as weight decay. However, second order derivative needs additional computation. HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly HashedNets [20] is agroup recent connection technique weightsto reduce into model hash sizes buckets, by using so that a hash all connections function to withinrandomly the same hash bucket group connection weightsshare a into single hash parameter buckets, value. so that This all technique connections may within benefit the from same pruning. hash Asbucket pointed out in Shi et al. share a single parameter[21] value. and Weinberger This techniqueet al. may[22], benefit sparsity from will pruning. minimize As hash pointed collision out in making Shi et al. feature hashing even [21] and Weinbergermoreet al. effective.[22], sparsity HashedNets will minimize may be hashused together collision with making pruning feature to give hashing even better even parameter savings. more effective. HashedNets may be used together with pruning to give even better parameter savings. 2 2 Where is the Energy Consumed?
larger model => more memory reference => more energy
Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000 how to make deep learning more efficient? Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7]. MemoryCMOS process access [ is7]. 3 Memory orders of access magnitude is 3 orders more of magnitude more energy expensive thanenergy simple expensive arithmetic. than simple arithmetic.
To achieve this goal, weTo achievepresent athis method goal, we to prune present network a method connections to prune network in a manner connections that preserves in a manner the that preserves the original accuracy. Afteroriginal an initial accuracy. training After phase, an initial we remove training all phase, connections we remove whose all connections weight is lower whose weight is lower than a threshold. Thisthan pruning a threshold. converts This a dense, pruning fully-connected converts a dense, layer fully-connected to a sparse layer. layer This to a first sparse layer. This 9 first phase learns the topologyphase of learns the networks the topology — learning of the networks which connections — learning are which important connections and removing are important and removing the unimportant connections.the unimportant We then connections. retrain the sparse We then network retrain so the the sparse remaining network connections so the remaining can connections can compensate for the connectionscompensate that for the have connections been removed. that have The beenphases removed. of pruning The and phases retraining of pruning may and retraining may be repeated iterativelybe to repeated further iteratively reduce network to further complexity. reduce network In effect, complexity. this training In effect, process this learns training process learns the network connectivitythe network in addition connectivity to the weights in addition - much to as the in weights the mammalian - much as brain in the [8][9], mammalian where brain [8][9], where synapses are created insynapses the first are few created months in of the a first child’s few development, months of a child’s followed development, by gradual followed pruning byof gradual pruning of little-used connections,little-used falling to connections, typical adult falling values. to typical adult values.
2 Related Work2 Related Work
Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- ing models [10]. This results in a waste of both computation and memory. There have been various ing models [10]. This results in a waste of both computation and memory. There have been various proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear structure of the neural network by finding an appropriate low-rank approximation of the parameters structure of the neural network by finding an appropriate low-rank approximation of the parameters and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gong et al. and keeping the accuracy[13] compressed within 1% of deep the convnets original model. using vector With quantization.similar accuracy These loss, approximation Gong et al. and quantization [13] compressed deeptechniques convnets are using orthogonal vector quantization. to network pruning, These andapproximation they can be and used quantization together to obtain further gains techniques are orthogonal[14]. to network pruning, and they can be used together to obtain further gains [14]. There have been other attempts to reduce the number of parameters of neural networks by replacing There have been otherthe attempts fully connected to reduce layer the number with global of parameters average pooling. of neural The networks Network by in replacing Network architecture [15] the fully connected layerand GoogLenet with global [16 average] achieves pooling. state-of-the-art The Network results in Network on several architecture benchmarks [15 by] adopting this idea. and GoogLenet [16]However, achieves state-of-the-arttransfer learning, results i.e. reusing on several features benchmarks learned on by the adopting ImageNet this dataset idea. and applying them However, transfer learning,to new i.e. tasks reusing by only features fine-tuning learned the onfully the connected ImageNet layers, dataset is andmore applying difficult them with this approach. This to new tasks by onlyproblem fine-tuning is noted the fully by Szegedyconnectedet layers, al. [16 is] and more motivates difficult them with thisto add approach. a linear Thislayer on the top of their problem is noted by Szegedynetworkset to al. enable[16] transfer and motivates learning. them to add a linear layer on the top of their networks to enable transfer learning. Network pruning has been used both to reduce network complexity and to reduce over-fitting. An Network pruning hasearly been approach used both to to pruning reduce was network biased complexity weight decay and [17 to]. reduce Optimal over-fitting. Brain Damage An [18] and Optimal early approach to pruningBrain was Surgeon biased [19 weight] prune decay networks [17]. to Optimal reduce the Brain number Damage of connections [18] and Optimal based on the Hessian of the Brain Surgeon [19] pruneloss function networks and to reduce suggest the that number such pruning of connections is more accurate based on than the Hessian magnitude-based of the pruning such as loss function and suggestweight that decay. such However, pruning is second more accurate order derivative than magnitude-based needs additional pruning computation. such as weight decay. However, second order derivative needs additional computation. HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly HashedNets [20] is agroup recent connection technique weightsto reduce into model hash sizes buckets, by using so that a hash all connections function to withinrandomly the same hash bucket group connection weightsshare a into single hash parameter buckets, value. so that This all technique connections may within benefit the from same pruning. hash Asbucket pointed out in Shi et al. share a single parameter[21] value. and Weinberger This techniqueet al. may[22], benefit sparsity from will pruning. minimize As hash pointed collision out in making Shi et al. feature hashing even [21] and Weinbergermoreet al. effective.[22], sparsity HashedNets will minimize may be hashused together collision with making pruning feature to give hashing even better even parameter savings. more effective. HashedNets may be used together with pruning to give even better parameter savings. 2 2 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design
10 Application as a Black Box
Algorithm Spec 2006
Hardware
CPU
11 Open the Box before Hardware Design ? Algorithm
Hardware
?PU
Breaks the boundary between algorithm and hardware
12 Proposed Paradigm How to compress a deep learning model?
Compression Acceleration Regularization 14 Learning both Weights and Connections for Efficient Neural Networks
Han et al. NIPS 2015
Compression Acceleration Regularization 15 Pruning Neural Networks
Pruning Trained Quantization Huffman Coding 16 Pruning Happens in Human Brain
1000 Trillion Synapses 50 Trillion 500 Trillion Synapses Synapses
Newborn 1 year old Adolescent
Christopher A Walsh. Peter Huttenlocher (1931-2013). Nature, 502(7470):172–172, 2013.
Pruning Trained Quantization Huffman Coding 17 [Han et al. NIPS’15] Pruning AlexNet
CONV Layer: 3x FC Layer: 10x
Pruning Trained Quantization Huffman Coding 18 [Han et al. NIPS’15] Pruning Neural Networks
-0.01x2 +x+1
60 Million 6M 10x less connections
Pruning Trained Quantization Huffman Coding 19 [Han et al. NIPS’15] Pruning Neural Networks
0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away
Pruning Trained Quantization Huffman Coding 20 [Han et al. NIPS’15] Pruning Neural Networks
Pruning 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away
Pruning Trained Quantization Huffman Coding 21 [Han et al. NIPS’15] Retrain to Recover Accuracy
Pruning Pruning+Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away
Pruning Trained Quantization Huffman Coding 22 [Han et al. NIPS’15] Iteratively Retrain to Recover Accuracy
Pruning Pruning+Retraining Iterative Pruning and Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away
Pruning Trained Quantization Huffman Coding 23 [Han et al. NIPS’15] Pruning RNN and LSTM
Image Captioning
Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick *Karpathy et al "Deep Visual- Fei-Fei Li & Andrej Karpathy &Semantic Justin Johnson Alignments forLecture 10 - 51 8 Feb 2016 Generating Image Descriptions"
Pruning Trained Quantization Huffman Coding 24 [Han et al. NIPS’15] Pruning RNN and LSTM
• Original: a basketball player in a white uniform is playing with a ball 90% • Pruned 90%: a basketball player in a white uniform is playing with a basketball
• Original : a brown dog is running through a grassy field 90% • Pruned 90%: a brown dog is running through a grassy area
• Original : a man is riding a surfboard on a wave 90% • Pruned 90%: a man in a wetsuit is riding a wave on a beach
• Original : a soccer player in red is running in the field 95% • Pruned 95%: a man in a red shirt and black and white black shirt is running through a field
Pruning Trained Quantization Huffman Coding 25 Exploring the Granularity of Sparsity that is Hardware-friendly
4 types of pruning granularity
fully-dense irregular sparsity regular sparsity more regular sparsity model
=> => =>
[Han et al, NIPS’15] [Molchanov et al, ICLR’17]
26 Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
Han et al. ICLR 2016 Best Paper
Pruning Trained Quantization Huffman Coding 28 [Han et al. ICLR’16] Trained Quantization
2.09, 2.12, 1.92, 1.87
2.0
Pruning Trained Quantization Huffman Coding 29 Published as a conference paper at ICLR 2016
[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.
weights cluster index (32 bit float) (2 bit uint) centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 30
We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.
3TRAINED QUANTIZATION AND WEIGHT SHARING
Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.× The weights are quantized to× 4 bins (denoted with 4 colors), all the weights in the× same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:
nb r = (1) nlog2(k)+kb
For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together× to share the same value. Originally we need to store 16 weights each
3 [Han et al. ICLR’16] After Trained Quantization: Discrete Weight Count
Weight Value
Pruning Trained Quantization Huffman Coding 31 [Han et al. ICLR’16] After Trained Quantization: Discrete Weight after Training Count
Weight Value
Pruning Trained Quantization Huffman Coding 32 [Han et al. ICLR’16] How Many Bits do We Need?
Pruning Trained Quantization Huffman Coding 33 CHAPTER 4. TRAINED QUANTIZATION AND DEEP COMPRESSION 57
Table 4.9: Comparison of uniform quantization and non-uniform quantization (this work) with different update methods. -c: updating centroid only; -c+l: update both centroid and label. Baseline ResNet-50 accuracy: 76.15%, 92.87%. All results are after retraining.
Quantization Method 1bit 2bit 4bit 6bit 8bit Uniform (Top-1) -59.33%74.52%75.49%76.15% Uniform (Top-5) -82.39%91.97%92.60%92.91% Non-uniform -c (Top-1) 24.08% 68.41% 76.16% 76.13% [Han76.20% et al. ICLR’16] Non-uniform -c (Top-5) 48.57% 88.49% 92.85% 92.91% 92.88% Non-uniformHow -c+l (Top-1) Many24.71% Bits 69.36% do We76.17% Need? 76.21% 76.19% Non-uniform -c+l (Top-5) 49.84% 89.03% 92.87% 92.89% 92.90%
Non-uniform quantization Uniform quantization 77%
full-precision top-1 76% accuracy: 76.15%
75% 1 Accuracy 1 - 74% Top
73% 2bits 4bits 6bits 8bits Number of bits per weight for ResNet-50 after fine-tuning
Figure 4.10: Non-uniform quantization performs better than uniform quantization. Pruning Trained Quantization Huffman Coding 34 non-uniform quantization (this work), all the layers of the baseline ResNet-50 can be compressed to 4-bits without losing accuracy. For uniform quantization, however, all the layers of the baseline ResNet-50 can be compressed to 8bitswithout losing accuracy (at 4 bits, there are about 1.6% top-1 accuracy loss when using uniform quantization). The advantage of non-uniform quantization is that it can better capture the non-uniform distribution of the weights. When the probability distribution is higher, the distance between each centroid would be closer. However, uniform quantization can not achieve this. Table 4.9 compares the performance two non-uniform quantization strategies. During fine-tuning, one strategy is to only update the centroid; the other strategy is to update both the centroid and the label (the label means which centroid does the weight belong to). Intuitively, the latter case has more degree of freedom in the learning process and should give better performance. However, experiments show that the improvement is not significant, as shown in the third row and the fourth row in Table Under review as a conference paper at ICLR 2017
Equation 2. We use scaled gradients for 32-bit weights: ∂L W p :˜w > ∆ l × ∂wt l l ⎧ l ∂L ⎪ ∂L = ⎪ 1 t : w˜l ∆l (8) ∂w˜l ⎪ × ∂wl | | ≤ ⎨⎪ n ∂L Wl t :˜wl < ∆l ⎪ × ∂wl − ⎪ ⎪ Note we use scalar number 1 as factor of⎩⎪ gradients of zero weights. The overall quantization process is illustrated as Figure 1. The evolution of the ternary weights from different layers during training is shown in Figure 2. We observe that as training proceeds, different layers behave differently: for the p n first quantized conv layer, the absolute values of Wl and Wl get smaller and sparsity gets lower, p n while for the last conv layer and fully connected layer, the absolute values of Wl and Wl get larger and sparsity gets higher. We learn the ternary assignments (index to the codebook) by updating the latent full-resolution weights during training. This may cause the assignments to change between iterations. Note that the thresholds are not constants as the maximal absolute values change over time. Once an updated weight crosses the threshold, the ternary assignment is changed. p n The benefits of using trained quantization factors are: i) The asymmetry of Wl = Wl enables neural networks to have more model capacity. ii) Quantized weights play the role of̸ "learning rate multipliers" during back propagation. More Aggressive Compression: 3.2 QUANTIZATION HEURISTIC Ternary Quantization In previous work on ternary weight networks, Li & Liu (2016) proposed Ternary Weight Networks (TWN) using ∆l as thresholds to reduce 32-bit weights to ternary values, where ∆l is defined ± ± Trained as Equation 5. They optimized value of ∆ by minimizingNormalized expectation of L2 distance between l Intermediate Ternary Weight Quantization Final Ternary Weight full precisionFull Precision weights andWeight ternary weights.± Instead ofFull using Precision a strictly Weight optimized threshold, we adopt different heuristics: 1) use the maximum absolute value of the weights as a reference to the layer’s Scale Normalize Quantize threshold and maintain a constant factor t for all layers: Wn Wp
-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp ∆l = t max( w˜ ) (9) × | | gradient1 gradient2 Loss and 2) maintainFeed a constantForward sparsity rBackfor Propagate all layers throughout training.Inference By Time adjusting the hyper- parameter r we are able to obtain ternary weight networks with various sparsities. We use the first method and set t to 0.05 in experiments on CIFAR-10 and ImageNet dataset and use the second one to explore a wider range of sparsities in section 5.1.1.
res1.0/conv1/Wn res1.0/conv1/Wp res3.2/conv2/Wn res3.2/conv2/Wp linear/Wn linear/Wp 3 2 1 0 -1 -2 Ternary Weight Value Value Weight Ternary -3 Negatives Zeros Positives Negatives Zeros Positives Negatives Zeros Positives 100%
75%
50%
Percentage 25%
Ternary Weight Weight Ternary 35 0% 0 50 100 150 0 50 100 150 0 50 100 150 Epochs
Figure 2: Ternary weights value (above) and distribution (below) with iterations for different layers of ResNet-20 on CIFAR-10.
4