Efficient Methods and Hardware for Deep Learning
Song Han Stanford University May 25, 2017 Intro
Song Han Bill Dally PhD Candidate Chief Scientist Stanford NVIDIA Professor Stanford
2 Deep Learning is Changing Our Lives
Self-Driving Car Machine Translation
This image is licensed under CC-BY 2.0 This image is in the public domain
This image is in the public domain This image is licensed under CC-BY 2.0
3 AlphaGo Smart Robots
3 Models are Getting Larger
IMAGE RECOGNITION SPEECH RECOGNITION
Important Property of Neural Networks 16X 10X Results get better with Model Training Ops 152 layers 465 GFLOP more data + 22.6 GFLOP 12,000 hrs of Data bigger models + ~3.5% error ~5% Error more computation 8 layers 80 GFLOP 1.4 GFLOP 7,000 hrs of Data (Better algorithms, new insights and ~16% Error ~8% Error improved techniques always help, too!) 2012 2015 2014 2015 AlexNet ResNet Deep Speech 1 Deep Speech 2 Microsoft Baidu
Dally, NIPS’2016 workshop on Efficient Methods for Deep Neural Networks
4 The first Challenge: Model Size
Hard to distribute large models through over-the-air update
App icon is in the public domain This image is licensed under CC-BY 2.0 Phone image is licensed under CC-BY 2.0
5 The Second Challenge: Speed
Error rate Training time ResNet18: 10.76% 2.5 days ResNet50: 7.02% 5 days ResNet101: 6.21% 1 week ResNet152: 6.16% 1.5 weeks
Such long training time limits ML researcher’s productivity
Training time benchmarked with fb.resnet.torch using four M40 GPUs
6 The Third Challenge: Energy Efficiency
AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game This image is in the public domain This image is in the public domain
on mobile: drains battery on data-center: increases TCO Phone image is licensed under CC-BY 2.0 This image is licensed under CC-BY 2.0
7 Where is the Energy Consumed? larger model => more memory reference => more energy
8 Where is the Energy Consumed?
larger model => more memory reference => more energy
Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000
Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7]. MemoryCMOS process access [ is7]. 3 Memory orders of access magnitude is 3 orders more of magnitude more energy expensive thanenergy simple1 expensive arithmetic. than simple arithmetic.= 1000
This image is in the public domain
To achieve this goal, weTo achievepresent athis method goal, we to prune present network a method connections to prune network in a manner connections that preserves in a manner the that preserves the original accuracy. Afteroriginal an initial accuracy. training After phase, an initial we remove training all phase, connections we remove whose all connections weight is lower whose weight is lower than a threshold. Thisthan pruning a threshold. converts This a dense, pruning fully-connected converts a dense, layer fully-connected to a sparse layer. layer This to a first sparse layer. This9 first phase learns the topologyphase of learns the networks the topology — learning of the networks which connections — learning are which important connections and removing are important and removing the unimportant connections.the unimportant We then connections. retrain the sparse We then network retrain so the the sparse remaining network connections so the remaining can connections can compensate for the connectionscompensate that for the have connections been removed. that have The beenphases removed. of pruning The and phases retraining of pruning may and retraining may be repeated iterativelybe to repeated further iteratively reduce network to further complexity. reduce network In effect, complexity. this training In effect, process this learns training process learns the network connectivitythe network in addition connectivity to the weights in addition - much to as the in weights the mammalian - much as brain in the [8][9], mammalian where brain [8][9], where synapses are created insynapses the first are few created months in of the a first child’s few development, months of a child’s followed development, by gradual followed pruning byof gradual pruning of little-used connections,little-used falling to connections, typical adult falling values. to typical adult values.
2 Related Work2 Related Work
Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- ing models [10]. This results in a waste of both computation and memory. There have been various ing models [10]. This results in a waste of both computation and memory. There have been various proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear structure of the neural network by finding an appropriate low-rank approximation of the parameters structure of the neural network by finding an appropriate low-rank approximation of the parameters and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gong et al. and keeping the accuracy[13] compressed within 1% of deep the convnets original model. using vector With quantization.similar accuracy These loss, approximation Gong et al. and quantization [13] compressed deeptechniques convnets are using orthogonal vector quantization. to network pruning, These andapproximation they can be and used quantization together to obtain further gains techniques are orthogonal[14]. to network pruning, and they can be used together to obtain further gains [14]. There have been other attempts to reduce the number of parameters of neural networks by replacing There have been otherthe attempts fully connected to reduce layer the number with global of parameters average pooling. of neural The networks Network by in replacing Network architecture [15] the fully connected layerand GoogLenet with global [16 average] achieves pooling. state-of-the-art The Network results in Network on several architecture benchmarks [15 by] adopting this idea. and GoogLenet [16]However, achieves state-of-the-arttransfer learning, results i.e. reusing on several features benchmarks learned on by the adopting ImageNet this dataset idea. and applying them However, transfer learning,to new i.e. tasks reusing by only features fine-tuning learned the onfully the connected ImageNet layers, dataset is andmore applying difficult them with this approach. This to new tasks by onlyproblem fine-tuning is noted the fully by Szegedyconnectedet layers, al. [16 is] and more motivates difficult them with thisto add approach. a linear Thislayer on the top of their problem is noted by Szegedynetworkset to al. enable[16] transfer and motivates learning. them to add a linear layer on the top of their networks to enable transfer learning. Network pruning has been used both to reduce network complexity and to reduce over-fitting. An Network pruning hasearly been approach used both to to pruning reduce was network biased complexity weight decay and [17 to]. reduce Optimal over-fitting. Brain Damage An [18] and Optimal early approach to pruningBrain was Surgeon biased [19 weight] prune decay networks [17]. to Optimal reduce the Brain number Damage of connections [18] and Optimal based on the Hessian of the Brain Surgeon [19] pruneloss function networks and to reduce suggest the that number such pruning of connections is more accurate based on than the Hessian magnitude-based of the pruning such as loss function and suggestweight that decay. such However, pruning is second more accurate order derivative than magnitude-based needs additional pruning computation. such as weight decay. However, second order derivative needs additional computation. HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly HashedNets [20] is agroup recent connection technique weightsto reduce into model hash sizes buckets, by using so that a hash all connections function to withinrandomly the same hash bucket group connection weightsshare a into single hash parameter buckets, value. so that This all technique connections may within benefit the from same pruning. hash Asbucket pointed out in Shi et al. share a single parameter[21] value. and Weinberger This techniqueet al. may[22], benefit sparsity from will pruning. minimize As hash pointed collision out in making Shi et al. feature hashing even [21] and Weinbergermoreet al. effective.[22], sparsity HashedNets will minimize may be hashused together collision with making pruning feature to give hashing even better even parameter savings. more effective. HashedNets may be used together with pruning to give even better parameter savings. 2 2 Where is the Energy Consumed?
larger model => more memory reference => more energy
Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000 how to make deep learning more efficient? Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7]. MemoryCMOS process access [ is7]. 3 Memory orders of access magnitude is 3 orders more of magnitude more energy expensive thanenergy simple expensive arithmetic. than simple arithmetic.
Battery images are in the public domain To achieve thisImage 1, goal,image 2, image weTo 2, image achievepresent 4 athis method goal, we to prune present network a method connections to prune network in a manner connections that preserves in a manner the that preserves the original accuracy. Afteroriginal an initial accuracy. training After phase, an initial we remove training all phase, connections we remove whose all connections weight is lower whose weight is lower than a threshold. Thisthan pruning a threshold. converts This a dense, pruning fully-connected converts a dense, layer fully-connected to a sparse layer. layer This to a first sparse layer. This10 first phase learns the topologyphase of learns the networks the topology — learning of the networks which connections — learning are which important connections and removing are important and removing the unimportant connections.the unimportant We then connections. retrain the sparse We then network retrain so the the sparse remaining network connections so the remaining can connections can compensate for the connectionscompensate that for the have connections been removed. that have The beenphases removed. of pruning The and phases retraining of pruning may and retraining may be repeated iterativelybe to repeated further iteratively reduce network to further complexity. reduce network In effect, complexity. this training In effect, process this learns training process learns the network connectivitythe network in addition connectivity to the weights in addition - much to as the in weights the mammalian - much as brain in the [8][9], mammalian where brain [8][9], where synapses are created insynapses the first are few created months in of the a first child’s few development, months of a child’s followed development, by gradual followed pruning byof gradual pruning of little-used connections,little-used falling to connections, typical adult falling values. to typical adult values.
2 Related Work2 Related Work
Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- ing models [10]. This results in a waste of both computation and memory. There have been various ing models [10]. This results in a waste of both computation and memory. There have been various proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear structure of the neural network by finding an appropriate low-rank approximation of the parameters structure of the neural network by finding an appropriate low-rank approximation of the parameters and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gong et al. and keeping the accuracy[13] compressed within 1% of deep the convnets original model. using vector With quantization.similar accuracy These loss, approximation Gong et al. and quantization [13] compressed deeptechniques convnets are using orthogonal vector quantization. to network pruning, These andapproximation they can be and used quantization together to obtain further gains techniques are orthogonal[14]. to network pruning, and they can be used together to obtain further gains [14]. There have been other attempts to reduce the number of parameters of neural networks by replacing There have been otherthe attempts fully connected to reduce layer the number with global of parameters average pooling. of neural The networks Network by in replacing Network architecture [15] the fully connected layerand GoogLenet with global [16 average] achieves pooling. state-of-the-art The Network results in Network on several architecture benchmarks [15 by] adopting this idea. and GoogLenet [16]However, achieves state-of-the-arttransfer learning, results i.e. reusing on several features benchmarks learned on by the adopting ImageNet this dataset idea. and applying them However, transfer learning,to new i.e. tasks reusing by only features fine-tuning learned the onfully the connected ImageNet layers, dataset is andmore applying difficult them with this approach. This to new tasks by onlyproblem fine-tuning is noted the fully by Szegedyconnectedet layers, al. [16 is] and more motivates difficult them with thisto add approach. a linear Thislayer on the top of their problem is noted by Szegedynetworkset to al. enable[16] transfer and motivates learning. them to add a linear layer on the top of their networks to enable transfer learning. Network pruning has been used both to reduce network complexity and to reduce over-fitting. An Network pruning hasearly been approach used both to to pruning reduce was network biased complexity weight decay and [17 to]. reduce Optimal over-fitting. Brain Damage An [18] and Optimal early approach to pruningBrain was Surgeon biased [19 weight] prune decay networks [17]. to Optimal reduce the Brain number Damage of connections [18] and Optimal based on the Hessian of the Brain Surgeon [19] pruneloss function networks and to reduce suggest the that number such pruning of connections is more accurate based on than the Hessian magnitude-based of the pruning such as loss function and suggestweight that decay. such However, pruning is second more accurate order derivative than magnitude-based needs additional pruning computation. such as weight decay. However, second order derivative needs additional computation. HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly HashedNets [20] is agroup recent connection technique weightsto reduce into model hash sizes buckets, by using so that a hash all connections function to withinrandomly the same hash bucket group connection weightsshare a into single hash parameter buckets, value. so that This all technique connections may within benefit the from same pruning. hash Asbucket pointed out in Shi et al. share a single parameter[21] value. and Weinberger This techniqueet al. may[22], benefit sparsity from will pruning. minimize As hash pointed collision out in making Shi et al. feature hashing even [21] and Weinbergermoreet al. effective.[22], sparsity HashedNets will minimize may be hashused together collision with making pruning feature to give hashing even better even parameter savings. more effective. HashedNets may be used together with pruning to give even better parameter savings. 2 2 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design
11 Application as a Black Box
Spec 2006 Algorithm
This image is in the public domain
Hardware
This image is in CPU the public domain
12 Open the Box before Hardware Design
? Algorithm
This image is in the public domain
Hardware
This image is in ?PU the public domain
Breaks the boundary between algorithm and hardware
13 Agenda
Inference Training Agenda
Algorithm
Inference Training
Hardware Agenda
Algorithm
Inference Training
Hardware Agenda
Algorithm
Algorithms for Algorithms for Efficient Inference Efficient Training
Inference Training
Hardware for Hardware for Efficient Inference Efficient Training Hardware Hardware 101: the Family
Hardware
General Purpose* Specialized HW
CPU GPU FPGA ASIC
latency throughput programmable fixed oriented oriented logic logic
* including GPGPU Hardware 101: Number Representation
E (-1)s x (1.M) x 2
Range Accuracy 1 8 23 -38 38 FP32 S E M 10 - 10 .000006% 1 5 10 S E M -5 4 FP16 6x10 - 6x10 .05% 1 31 0 – 2x109 ½ Int32 S M
1 15 0 – 6x104 ½ Int16 S M
1 7 0 – 127 ½ Int8 S M - - Fixed point S I F
radix point Dally, High Performance Hardware for Machine Learning, NIPS’2015 Hardware 101: Number Representation
Operation: Energy (pJ) Area (µm2) 8b Add 0.03 36 16b Add 0.05 67 32b Add 0.1 137 16b FP Add 0.4 1360 32b FP Add 0.9 4184 8b Mult 0.2 282 32b Mult 3.1 3495 16b FP Mult 1.1 1640 32b FP Mult 3.7 7700 32b SRAM Read (8KB) 5 N/A 32b DRAM Read 640 N/A
Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014 Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library. Agenda
Algorithm
Algorithms for Algorithms for Efficient Inference Efficient Training
Inference Training
Hardware for Hardware for Efficient Inference Efficient Training Hardware Part 1: Algorithms for Efficient Inference
• 1. Pruning
• 2. Weight Sharing
• 3. Quantization
• 4. Low Rank Approximation
• 5. Binary / Ternary Net
• 6. Winograd Transformation Part 1: Algorithms for Efficient Inference
• 1. Pruning
• 2. Weight Sharing
• 3. Quantization
• 4. Low Rank Approximation
• 5. Binary / Ternary Net
• 6. Winograd Transformation Pruning Neural Networks
[Lecun et al. NIPS’89] [Han et al. NIPS’15]
Pruning Trained Quantization Huffman Coding 24 [Han et al. NIPS’15] Pruning Neural Networks
-0.01x2 +x+1
60 Million 6M 10x less connections
Pruning Trained Quantization Huffman Coding 25 [Han et al. NIPS’15] Pruning Neural Networks
0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away
Pruning Trained Quantization Huffman Coding 26 [Han et al. NIPS’15] Pruning Neural Networks
Pruning 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away
Pruning Trained Quantization Huffman Coding 27 [Han et al. NIPS’15] Retrain to Recover Accuracy
Pruning Pruning+Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away
Pruning Trained Quantization Huffman Coding 28 [Han et al. NIPS’15] Iteratively Retrain to Recover Accuracy
Pruning Pruning+Retraining Iterative Pruning and Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away
Pruning Trained Quantization Huffman Coding 29 [Han et al. NIPS’15] Pruning RNN and LSTM
Image Captioning
Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation*Karpathy for etImage al, "DeepCaption Generation,Visual- Chen and Zitnick Semantic Alignments for Generating Fei-Fei Li & Andrej KarpathyImage & Justin Descriptions”, Johnson 2015.Lecture 10 - 51 8 Feb 2016 Figure copyright IEEE, 2015; reproduced for educational purposes.
Pruning Trained Quantization Huffman Coding 30 [Han et al. NIPS’15] Pruning RNN and LSTM
• Original: a basketball player in a white uniform is playing with a ball 90% • Pruned 90%: a basketball player in a white uniform is playing with a basketball
• Original : a brown dog is running through a grassy field 90% • Pruned 90%: a brown dog is running through a grassy area
• Original : a man is riding a surfboard on a wave 90% • Pruned 90%: a man in a wetsuit is riding a wave on a beach
• Original : a soccer player in red is running in the field 95% • Pruned 95%: a man in a red shirt and black and white black shirt is running through a field
Pruning Trained Quantization Huffman Coding 31 Pruning Happens in Human Brain
1000 Trillion Synapses 50 Trillion 500 Trillion Synapses Synapses
This image is in the public domain This image is in the public domain This image is in the public domain Newborn 1 year old Adolescent
Christopher A Walsh. Peter Huttenlocher (1931-2013). Nature, 502(7470):172–172, 2013.
Pruning Trained Quantization Huffman Coding 32 [Han et al. NIPS’15] Pruning Changes Weight Distribution
Before Pruning After Pruning After Retraining
Conv5 layer of Alexnet. Representative for other network layers as well.
Pruning Trained Quantization Huffman Coding 33 Part 1: Algorithms for Efficient Inference
• 1. Pruning
• 2. Weight Sharing
• 3. Quantization
• 4. Low Rank Approximation
• 5. Binary / Ternary Net
• 6. Winograd Transformation [Han et al. ICLR’16] Trained Quantization
2.09, 2.12, 1.92, 1.87
2.0
Pruning Trained Quantization Huffman Coding 35 Published as a conference paper at ICLR 2016
[Han et al. ICLR’16] Trained Quantization Quantization: less bits per weight Pruning: less number of weights Huffman Encoding
Cluster the Weights
Train Connectivity Encode Weights original same same same network 2.09, 2.12, 1.92,accuracy 1.87 Generate Code Book accuracy accuracy Prune Connections original 9x-13x Quantize the Weights 27x-31x Encode Index 35x-49x size reduction with Code Book reduction reduction Train Weights 2.0 Retrain Code Book
Figure 1: The three stage compression pipeline: pruning,32 bit quantization and Huffman coding. Pruning reduces the number of weights by 10 ,4bit while8x less memory footprint quantization further improves the compression rate: between 27 and 31 . Huffman coding⇥ gives more compression: between 35 and 49 . The Pruning Trained Quantization Huffman Coding compression⇥ rate already⇥ included the meta-data for sparse representation. The compression⇥ 36 ⇥ scheme doesn’t incur any accuracy loss. features such as better privacy, less network bandwidth and real time processing, the large storage overhead prevents deep neural networks from being incorporated into mobile apps. The second issue is energy consumption. Running large neural networks require a lot of memory bandwidth to fetch the weights and a lot of computation to do dot products— which in turn consumes considerable energy. Mobile devices are battery constrained, making power hungry applications such as deep neural networks hard to deploy. Energy consumption is dominated by memory access. Under 45nm CMOS technology, a 32 bit floating point add consumes 0.9pJ, a 32bit SRAM cache access takes 5pJ, while a 32bit DRAM memory access takes 640pJ, which is 3 orders of magnitude of an add operation. Large networks do not fit in on-chip storage and hence require the more costly DRAM accesses. Running a 1 billion connection neural network, for example, at 20fps would require (20Hz)(1G)(640pJ) = 12.8W just for DRAM access - well beyond the power envelope of a typical mobile device. Our goal is to reduce the storage and energy required to run inference on such large networks so they can be deployed on mobile devices. To achieve this goal, we present “deep compression”: a three- stage pipeline (Figure 1) to reduce the storage required by neural network in a manner that preserves the original accuracy. First, we prune the networking by removing the redundant connections, keeping only the most informative connections. Next, the weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored. Finally, we apply Huffman coding to take advantage of the biased distribution of effective weights. Our main insight is that, pruning and trained quantization are able to compress the network without interfering each other, thus lead to surprisingly high compression rate. It makes the required storage so small (a few megabytes) that all weights can be cached on chip instead of going to off-chip DRAM which is energy consuming. Based on “deep compression”, the EIE hardware accelerator Han et al. (2016) was later proposed that works on the compressed model, achieving significant speedup and energy efficiency improvement.
2NETWORK PRUNING
Network pruning has been widely studied to compress CNN models. In early work, network pruning proved to be a valid way to reduce the network complexity and over-fitting (LeCun et al., 1989; Hanson & Pratt, 1989; Hassibi et al., 1993; Strom,¨ 1997). Recently Han et al. (2015) pruned state- of-the-art CNN models with no loss of accuracy. We build on top of that approach. As shown on the left side of Figure 1, we start by learning the connectivity via normal network training. Next, we prune the small-weight connections: all connections with weights below a threshold are removed from the network. Finally, we retrain the network to learn the final weights for the remaining sparse connections. Pruning reduced the number of parameters by 9 and 13 for AlexNet and VGG-16 model. ⇥ ⇥
2 Published as a conference paper at ICLR 2016
[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.
weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 37
We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.
3TRAINED QUANTIZATION AND WEIGHT SHARING
Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:
nb r = (1) nlog2(k)+kb
For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each
3 Published as a conference paper at ICLR 2016
[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.
weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 38
We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.
3TRAINED QUANTIZATION AND WEIGHT SHARING
Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:
nb r = (1) nlog2(k)+kb
For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each
3 Published as a conference paper at ICLR 2016
[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.
weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 39
We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.
3TRAINED QUANTIZATION AND WEIGHT SHARING
Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:
nb r = (1) nlog2(k)+kb
For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each
3 Published as a conference paper at ICLR 2016
[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.
weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 40
We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.
3TRAINED QUANTIZATION AND WEIGHT SHARING
Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:
nb r = (1) nlog2(k)+kb
For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each
3 Published as a conference paper at ICLR 2016
[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.
weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 41
We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.
3TRAINED QUANTIZATION AND WEIGHT SHARING
Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:
nb r = (1) nlog2(k)+kb
For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each
3 Published as a conference paper at ICLR 2016
[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.
weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 42
We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.
3TRAINED QUANTIZATION AND WEIGHT SHARING
Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:
nb r = (1) nlog2(k)+kb
For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each
3 Published as a conference paper at ICLR 2016
[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.
weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids
2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96
0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48
-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04
1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97
gradient
-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 43
We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.
3TRAINED QUANTIZATION AND WEIGHT SHARING
Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:
nb r = (1) nlog2(k)+kb
For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each
3 [Han et al. ICLR’16] Before Trained Quantization: Continuous Weight Count
Weight Value
Pruning Trained Quantization Huffman Coding 44 [Han et al. ICLR’16] After Trained Quantization: Discrete Weight Count
Weight Value
Pruning Trained Quantization Huffman Coding 45 [Han et al. ICLR’16] After Trained Quantization: Discrete Weight after Training Count
Weight Value
Pruning Trained Quantization Huffman Coding 46 [Han et al. ICLR’16] How Many Bits do We Need?
Pruning Trained Quantization Huffman Coding 47 [Han et al. ICLR’16] How Many Bits do We Need?
Pruning Trained Quantization Huffman Coding 48 [Han et al. ICLR’16] Pruning + Trained Quantization Work Together
Pruning Trained Quantization Huffman Coding 49 [Han et al. ICLR’16] Pruning + Trained Quantization Work Together
AlexNet on ImageNet
Pruning Trained Quantization Huffman Coding 50 [Han et al. ICLR’16]
Published as a conference paper at ICLR 2016 Huffman Coding
Quantization: less bits per weight Pruning: less number of weights Huffman Encoding
Cluster the Weights
Train Connectivity Encode Weights original same same same network accuracy Generate Code Book accuracy accuracy Prune Connections original 9x-13x Quantize the Weights 27x-31x Encode Index 35x-49x size reduction with Code Book reduction reduction Train Weights
Retrain Code Book
Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning• In-frequent weights: use more bits to represent reduces the number of weights by 10 , while quantization further improves the compression rate:• Frequent weights: use less bits to represent between 27 and 31 . Huffman coding⇥ gives more compression: between 35 and 49 . The compression⇥ rate already⇥ included the meta-data for sparse representation. The compression⇥ ⇥ scheme doesn’t incur any accuracy loss. Pruning Trained Quantization Huffman Coding 51 features such as better privacy, less network bandwidth and real time processing, the large storage overhead prevents deep neural networks from being incorporated into mobile apps. The second issue is energy consumption. Running large neural networks require a lot of memory bandwidth to fetch the weights and a lot of computation to do dot products— which in turn consumes considerable energy. Mobile devices are battery constrained, making power hungry applications such as deep neural networks hard to deploy. Energy consumption is dominated by memory access. Under 45nm CMOS technology, a 32 bit floating point add consumes 0.9pJ, a 32bit SRAM cache access takes 5pJ, while a 32bit DRAM memory access takes 640pJ, which is 3 orders of magnitude of an add operation. Large networks do not fit in on-chip storage and hence require the more costly DRAM accesses. Running a 1 billion connection neural network, for example, at 20fps would require (20Hz)(1G)(640pJ) = 12.8W just for DRAM access - well beyond the power envelope of a typical mobile device. Our goal is to reduce the storage and energy required to run inference on such large networks so they can be deployed on mobile devices. To achieve this goal, we present “deep compression”: a three- stage pipeline (Figure 1) to reduce the storage required by neural network in a manner that preserves the original accuracy. First, we prune the networking by removing the redundant connections, keeping only the most informative connections. Next, the weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored. Finally, we apply Huffman coding to take advantage of the biased distribution of effective weights. Our main insight is that, pruning and trained quantization are able to compress the network without interfering each other, thus lead to surprisingly high compression rate. It makes the required storage so small (a few megabytes) that all weights can be cached on chip instead of going to off-chip DRAM which is energy consuming. Based on “deep compression”, the EIE hardware accelerator Han et al. (2016) was later proposed that works on the compressed model, achieving significant speedup and energy efficiency improvement.
2NETWORK PRUNING
Network pruning has been widely studied to compress CNN models. In early work, network pruning proved to be a valid way to reduce the network complexity and over-fitting (LeCun et al., 1989; Hanson & Pratt, 1989; Hassibi et al., 1993; Strom,¨ 1997). Recently Han et al. (2015) pruned state- of-the-art CNN models with no loss of accuracy. We build on top of that approach. As shown on the left side of Figure 1, we start by learning the connectivity via normal network training. Next, we prune the small-weight connections: all connections with weights below a threshold are removed from the network. Finally, we retrain the network to learn the final weights for the remaining sparse connections. Pruning reduced the number of parameters by 9 and 13 for AlexNet and VGG-16 model. ⇥ ⇥
2 [Han et al. ICLR’16] Summary of Deep Compression Published as a conference paper at ICLR 2016
Quantization: less bits per weight Pruning: less number of weights Huffman Encoding
Cluster the Weights
Train Connectivity Encode Weights original same same same network accuracy Generate Code Book accuracy accuracy Prune Connections original 9x-13x Quantize the Weights 27x-31x Encode Index 35x-49x size reduction with Code Book reduction reduction Train Weights
Retrain Code Book
Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning reduces the number of weights by 10 , while quantization further improves the compression rate: between 27 and 31 . Huffman coding⇥ gives more compression: between 35 and 49 . The compression⇥ rate already⇥ included the meta-data for sparse representation. The compression⇥ ⇥ scheme doesn’t incur any accuracy loss. Pruning Trained Quantization Huffman Coding 52 features such as better privacy, less network bandwidth and real time processing, the large storage overhead prevents deep neural networks from being incorporated into mobile apps. The second issue is energy consumption. Running large neural networks require a lot of memory bandwidth to fetch the weights and a lot of computation to do dot products— which in turn consumes considerable energy. Mobile devices are battery constrained, making power hungry applications such as deep neural networks hard to deploy. Energy consumption is dominated by memory access. Under 45nm CMOS technology, a 32 bit floating point add consumes 0.9pJ, a 32bit SRAM cache access takes 5pJ, while a 32bit DRAM memory access takes 640pJ, which is 3 orders of magnitude of an add operation. Large networks do not fit in on-chip storage and hence require the more costly DRAM accesses. Running a 1 billion connection neural network, for example, at 20fps would require (20Hz)(1G)(640pJ) = 12.8W just for DRAM access - well beyond the power envelope of a typical mobile device. Our goal is to reduce the storage and energy required to run inference on such large networks so they can be deployed on mobile devices. To achieve this goal, we present “deep compression”: a three- stage pipeline (Figure 1) to reduce the storage required by neural network in a manner that preserves the original accuracy. First, we prune the networking by removing the redundant connections, keeping only the most informative connections. Next, the weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored. Finally, we apply Huffman coding to take advantage of the biased distribution of effective weights. Our main insight is that, pruning and trained quantization are able to compress the network without interfering each other, thus lead to surprisingly high compression rate. It makes the required storage so small (a few megabytes) that all weights can be cached on chip instead of going to off-chip DRAM which is energy consuming. Based on “deep compression”, the EIE hardware accelerator Han et al. (2016) was later proposed that works on the compressed model, achieving significant speedup and energy efficiency improvement.
2NETWORK PRUNING
Network pruning has been widely studied to compress CNN models. In early work, network pruning proved to be a valid way to reduce the network complexity and over-fitting (LeCun et al., 1989; Hanson & Pratt, 1989; Hassibi et al., 1993; Strom,¨ 1997). Recently Han et al. (2015) pruned state- of-the-art CNN models with no loss of accuracy. We build on top of that approach. As shown on the left side of Figure 1, we start by learning the connectivity via normal network training. Next, we prune the small-weight connections: all connections with weights below a threshold are removed from the network. Finally, we retrain the network to learn the final weights for the remaining sparse connections. Pruning reduced the number of parameters by 9 and 13 for AlexNet and VGG-16 model. ⇥ ⇥
2 [Han et al. ICLR’16] Results: Compression Ratio
Original Compressed Compression Original Compressed Network Size Size Ratio Accuracy Accuracy
LeNet-300 1070KB 27KB 40x 98.36% 98.42%
LeNet-5 1720KB 44KB 39x 99.20% 99.26%
AlexNet 240MB 6.9MB 35x 80.27% 80.30%
VGGNet 550MB 11.3MB 49x 88.68% 89.09%
GoogleNet 28MB 2.8MB 10x 88.90% 88.92%
ResNet-18 44.6MB 4.0MB 11x 89.24% 89.28%
Can we make compact models to begin with?
Compression Acceleration Regularization 53 SqueezeNet
Input
64 1x1 Conv Squeeze 16 1x1 Conv 3x3 Conv Expand Expand 64 64 Output Concat/Eltwise 128
Iandola et al, “SqueezeNet: AlexNet-levelVanilla accuracy Firewith 50x module fewer parameters and <0.5MB model size”, arXiv 2016
Compression Acceleration Regularization 54 Compressing SqueezeNet
Network Approach Size Ratio Top-1 Top-5 Accuracy Accuracy
AlexNet - 240MB 1x 57.2% 80.3%
AlexNet SVD 48MB 5x 56.0% 79.4%
Deep AlexNet 6.9MB 35x 57.2% 80.3% Compression
SqueezeNet - 4.8MB 50x 57.5% 80.3%
Deep SqueezeNet 0.47MB 510x 57.5% 80.3% Compression
Iandola et al, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, arXiv 2016
Compression Acceleration Regularization 55 Results: Speedup
Average
0.6x
CPU GPU mGPU
Compression Acceleration Regularization 56 Results: Energy Efficiency
Average
CPU GPU mGPU
Compression Acceleration Regularization 57 Deep Compression Applied to Industry
Deep Compression
F8
Compression Acceleration Regularization 58 Part 1: Algorithms for Efficient Inference
• 1. Pruning
• 2. Weight Sharing
• 3. Quantization
• 4. Low Rank Approximation
• 5. Binary / Ternary Net
• 6. Winograd Transformation Quantizing the Weight and Activation
• Train with float • Quantizing the weight and activation: • Gather the statistics for weight and activation • Choose proper radix point position • Fine-tune in float format • Convert to fixed-point format
Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA’16 Quantization Result