Bandwidth-Efficient ——from Compression to Acceleration

Song Han Assistant Professor, EECS Massachusetts Institute of Technology

1 AI is Changing Our Lives

Self-Driving Car Machine Translation

2 AlphaGo Smart Robots

2 Models are Getting Larger

IMAGE RECOGNITION

Important Property of Neural Networks 16X 10X Results get better with Model Training Ops 152 layers 465 GFLOP more data + 22.6 GFLOP 12,000 hrs of Data bigger models + ~3.5% error ~5% Error more computation 8 layers 80 GFLOP 1.4 GFLOP 7,000 hrs of Data (Better algorithms, new insights and ~16% Error ~8% Error improved techniques always help, too!) 2012 2015 2014 2015 AlexNet ResNet Deep Speech 1 Deep Speech 2

Dally, NIPS’2016 workshop on Efficient Methods for Deep Neural Networks

3 The first Challenge: Model Size

Hard to distribute large models through over-the-air update

4 The Second Challenge: Speed

Error rate Training time ResNet18: 10.76% 2.5 days ResNet50: 7.02% 5 days ResNet101: 6.21% 1 week ResNet152: 6.16% 1.5 weeks

Such long training time limits ML researcher’s productivity

Training time benchmarked with fb.resnet.torch using four M40 GPUs

5 The Third Challenge: Energy Efficiency

AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game

on mobile: drains battery on data-center: increases TCO

6 Where is the Energy Consumed? larger model => more memory reference => more energy

7 Where is the Energy Consumed?

larger model => more memory reference => more energy

Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000

Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7]. MemoryCMOS process access [ is7]. 3 Memory orders of access magnitude is 3 orders more of magnitude more energy expensive thanenergy simple1 expensive arithmetic. than simple arithmetic.= 1000

To achieve this goal, weTo achievepresent athis method goal, we to prune present network a method connections to prune network in a manner connections that preserves in a manner the that preserves the original accuracy. Afteroriginal an initial accuracy. training After phase, an initial we remove training all phase, connections we remove whose all connections weight is lower whose weight is lower than a threshold. Thisthan pruning a threshold. converts This a dense, pruning fully-connected converts a dense, fully-connected to a sparse layer. layer This to a first sparse layer. This 8 first phase learns the topologyphase of learns the networks the topology — learning of the networks which connections — learning are which important connections and removing are important and removing the unimportant connections.the unimportant We then connections. retrain the sparse We then network retrain so the the sparse remaining network connections so the remaining can connections can compensate for the connectionscompensate that for the have connections been removed. that have The beenphases removed. of pruning The and phases retraining of pruning may and retraining may be repeated iterativelybe to repeated further iteratively reduce network to further complexity. reduce network In effect, complexity. this training In effect, process this learns training process learns the network connectivitythe network in addition connectivity to the weights in addition - much to as the in weights the mammalian - much as brain in the [8][9], mammalian where brain [8][9], where synapses are created insynapses the first are few created months in of the a first child’s few development, months of a child’s followed development, by gradual followed pruning byof gradual pruning of little-used connections,little-used falling to connections, typical adult falling values. to typical adult values.

2 Related Work2 Related Work

Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- ing models [10]. This results in a waste of both computation and memory. There have been various ing models [10]. This results in a waste of both computation and memory. There have been various proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear structure of the neural network by finding an appropriate low-rank approximation of the parameters structure of the neural network by finding an appropriate low-rank approximation of the parameters and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gong et al. and keeping the accuracy[13] compressed within 1% of deep the convnets original model. using vector With quantization.similar accuracy These loss, approximation Gong et al. and quantization [13] compressed deeptechniques convnets are using orthogonal vector quantization. to network pruning, These andapproximation they can be and used quantization together to obtain further gains techniques are orthogonal[14]. to network pruning, and they can be used together to obtain further gains [14]. There have been other attempts to reduce the number of parameters of neural networks by replacing There have been otherthe attempts fully connected to reduce layer the number with global of parameters average pooling. of neural The networks Network by in replacing Network architecture [15] the fully connected layerand GoogLenet with global [16 average] achieves pooling. state-of-the-art The Network results in Network on several architecture benchmarks [15 by] adopting this idea. and GoogLenet [16]However, achieves state-of-the-arttransfer learning, results i.e. reusing on several features benchmarks learned on by the adopting ImageNet this dataset idea. and applying them However, transfer learning,to new i.e. tasks reusing by only features fine-tuning learned the onfully the connected ImageNet layers, dataset is andmore applying difficult them with this approach. This to new tasks by onlyproblem fine-tuning is noted the fully by Szegedyconnectedet layers, al. [16 is] and more motivates difficult them with thisto add approach. a linear Thislayer on the top of their problem is noted by Szegedynetworkset to al. enable[16] transfer and motivates learning. them to add a linear layer on the top of their networks to enable transfer learning. Network pruning has been used both to reduce network complexity and to reduce over-fitting. An Network pruning hasearly been approach used both to to pruning reduce was network biased complexity weight decay and [17 to]. reduce Optimal over-fitting. Brain Damage An [18] and Optimal early approach to pruningBrain was Surgeon biased [19 weight] prune decay networks [17]. to Optimal reduce the Brain number Damage of connections [18] and Optimal based on the Hessian of the Brain Surgeon [19] pruneloss function networks and to reduce suggest the that number such pruning of connections is more accurate based on than the Hessian magnitude-based of the pruning such as and suggestweight that decay. such However, pruning is second more accurate order derivative than magnitude-based needs additional pruning computation. such as weight decay. However, second order derivative needs additional computation. HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly HashedNets [20] is agroup recent connection technique weightsto reduce into model hash sizes buckets, by using so that a hash all connections function to withinrandomly the same hash bucket group connection weightsshare a into single hash parameter buckets, value. so that This all technique connections may within benefit the from same pruning. hash Asbucket pointed out in Shi et al. share a single parameter[21] value. and Weinberger This techniqueet al. may[22], benefit sparsity from will pruning. minimize As hash pointed collision out in making Shi et al. feature hashing even [21] and Weinbergermoreet al. effective.[22], sparsity HashedNets will minimize may be hashused together collision with making pruning feature to give hashing even better even parameter savings. more effective. HashedNets may be used together with pruning to give even better parameter savings. 2 2 Where is the Energy Consumed?

larger model => more memory reference => more energy

Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000 how to make deep learning more efficient? Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7]. MemoryCMOS process access [ is7]. 3 Memory orders of access magnitude is 3 orders more of magnitude more energy expensive thanenergy simple expensive arithmetic. than simple arithmetic.

To achieve this goal, weTo achievepresent athis method goal, we to prune present network a method connections to prune network in a manner connections that preserves in a manner the that preserves the original accuracy. Afteroriginal an initial accuracy. training After phase, an initial we remove training all phase, connections we remove whose all connections weight is lower whose weight is lower than a threshold. Thisthan pruning a threshold. converts This a dense, pruning fully-connected converts a dense, layer fully-connected to a sparse layer. layer This to a first sparse layer. This 9 first phase learns the topologyphase of learns the networks the topology — learning of the networks which connections — learning are which important connections and removing are important and removing the unimportant connections.the unimportant We then connections. retrain the sparse We then network retrain so the the sparse remaining network connections so the remaining can connections can compensate for the connectionscompensate that for the have connections been removed. that have The beenphases removed. of pruning The and phases retraining of pruning may and retraining may be repeated iterativelybe to repeated further iteratively reduce network to further complexity. reduce network In effect, complexity. this training In effect, process this learns training process learns the network connectivitythe network in addition connectivity to the weights in addition - much to as the in weights the mammalian - much as brain in the [8][9], mammalian where brain [8][9], where synapses are created insynapses the first are few created months in of the a first child’s few development, months of a child’s followed development, by gradual followed pruning byof gradual pruning of little-used connections,little-used falling to connections, typical adult falling values. to typical adult values.

2 Related Work2 Related Work

Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- ing models [10]. This results in a waste of both computation and memory. There have been various ing models [10]. This results in a waste of both computation and memory. There have been various proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear structure of the neural network by finding an appropriate low-rank approximation of the parameters structure of the neural network by finding an appropriate low-rank approximation of the parameters and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gong et al. and keeping the accuracy[13] compressed within 1% of deep the convnets original model. using vector With quantization.similar accuracy These loss, approximation Gong et al. and quantization [13] compressed deeptechniques convnets are using orthogonal vector quantization. to network pruning, These andapproximation they can be and used quantization together to obtain further gains techniques are orthogonal[14]. to network pruning, and they can be used together to obtain further gains [14]. There have been other attempts to reduce the number of parameters of neural networks by replacing There have been otherthe attempts fully connected to reduce layer the number with global of parameters average pooling. of neural The networks Network by in replacing Network architecture [15] the fully connected layerand GoogLenet with global [16 average] achieves pooling. state-of-the-art The Network results in Network on several architecture benchmarks [15 by] adopting this idea. and GoogLenet [16]However, achieves state-of-the-arttransfer learning, results i.e. reusing on several features benchmarks learned on by the adopting ImageNet this dataset idea. and applying them However, transfer learning,to new i.e. tasks reusing by only features fine-tuning learned the onfully the connected ImageNet layers, dataset is andmore applying difficult them with this approach. This to new tasks by onlyproblem fine-tuning is noted the fully by Szegedyconnectedet layers, al. [16 is] and more motivates difficult them with thisto add approach. a linear Thislayer on the top of their problem is noted by Szegedynetworkset to al. enable[16] transfer and motivates learning. them to add a linear layer on the top of their networks to enable transfer learning. Network pruning has been used both to reduce network complexity and to reduce over-fitting. An Network pruning hasearly been approach used both to to pruning reduce was network biased complexity weight decay and [17 to]. reduce Optimal over-fitting. Brain Damage An [18] and Optimal early approach to pruningBrain was Surgeon biased [19 weight] prune decay networks [17]. to Optimal reduce the Brain number Damage of connections [18] and Optimal based on the Hessian of the Brain Surgeon [19] pruneloss function networks and to reduce suggest the that number such pruning of connections is more accurate based on than the Hessian magnitude-based of the pruning such as loss function and suggestweight that decay. such However, pruning is second more accurate order derivative than magnitude-based needs additional pruning computation. such as weight decay. However, second order derivative needs additional computation. HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly HashedNets [20] is agroup recent connection technique weightsto reduce into model hash sizes buckets, by using so that a hash all connections function to withinrandomly the same hash bucket group connection weightsshare a into single hash parameter buckets, value. so that This all technique connections may within benefit the from same pruning. hash Asbucket pointed out in Shi et al. share a single parameter[21] value. and Weinberger This techniqueet al. may[22], benefit sparsity from will pruning. minimize As hash pointed collision out in making Shi et al. feature hashing even [21] and Weinbergermoreet al. effective.[22], sparsity HashedNets will minimize may be hashused together collision with making pruning feature to give hashing even better even parameter savings. more effective. HashedNets may be used together with pruning to give even better parameter savings. 2 2 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design

10 Application as a Black Box

Algorithm Spec 2006

Hardware

CPU

11 Open the Box before Hardware Design ? Algorithm

Hardware

?PU

Breaks the boundary between algorithm and hardware

12 Proposed Paradigm How to compress a deep learning model?

Compression Acceleration Regularization 14 Learning both Weights and Connections for Efficient Neural Networks

Han et al. NIPS 2015

Compression Acceleration Regularization 15 Pruning Neural Networks

Pruning Trained Quantization Huffman Coding 16 Pruning Happens in Human Brain

1000 Trillion Synapses 50 Trillion 500 Trillion Synapses Synapses

Newborn 1 year old Adolescent

Christopher A Walsh. Peter Huttenlocher (1931-2013). Nature, 502(7470):172–172, 2013.

Pruning Trained Quantization Huffman Coding 17 [Han et al. NIPS’15] Pruning AlexNet

CONV Layer: 3x FC Layer: 10x

Pruning Trained Quantization Huffman Coding 18 [Han et al. NIPS’15] Pruning Neural Networks

-0.01x2 +x+1

60 Million 6M 10x less connections

Pruning Trained Quantization Huffman Coding 19 [Han et al. NIPS’15] Pruning Neural Networks

0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 20 [Han et al. NIPS’15] Pruning Neural Networks

Pruning 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 21 [Han et al. NIPS’15] Retrain to Recover Accuracy

Pruning Pruning+Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 22 [Han et al. NIPS’15] Iteratively Retrain to Recover Accuracy

Pruning Pruning+Retraining Iterative Pruning and Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 23 [Han et al. NIPS’15] Pruning RNN and LSTM

Image Captioning

Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick *Karpathy et al "Deep Visual- Fei-Fei Li & Andrej Karpathy &Semantic Justin Johnson Alignments forLecture 10 - 51 8 Feb 2016 Generating Image Descriptions"

Pruning Trained Quantization Huffman Coding 24 [Han et al. NIPS’15] Pruning RNN and LSTM

• Original: a basketball player in a white uniform is playing with a ball 90% • Pruned 90%: a basketball player in a white uniform is playing with a basketball

• Original : a brown dog is running through a grassy field 90% • Pruned 90%: a brown dog is running through a grassy area

• Original : a man is riding a surfboard on a wave 90% • Pruned 90%: a man in a wetsuit is riding a wave on a beach

• Original : a soccer player in red is running in the field 95% • Pruned 95%: a man in a red shirt and black and white black shirt is running through a field

Pruning Trained Quantization Huffman Coding 25 Exploring the Granularity of Sparsity that is Hardware-friendly

4 types of pruning granularity

fully-dense irregular sparsity regular sparsity more regular sparsity model

=> => =>

[Han et al, NIPS’15] [Molchanov et al, ICLR’17]

26 Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Han et al. ICLR 2016 Best Paper

Pruning Trained Quantization Huffman Coding 28 [Han et al. ICLR’16] Trained Quantization

2.09, 2.12, 1.92, 1.87

2.0

Pruning Trained Quantization Huffman Coding 29 Published as a conference paper at ICLR 2016

[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.

weights cluster index (32 bit float) (2 bit uint) centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 30

We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.

3TRAINED QUANTIZATION AND WEIGHT SHARING

Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.× The weights are quantized to× 4 bins (denoted with 4 colors), all the weights in the× same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.

To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:

nb r = (1) nlog2(k)+kb

For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together× to share the same value. Originally we need to store 16 weights each

3 [Han et al. ICLR’16] After Trained Quantization: Discrete Weight Count

Weight Value

Pruning Trained Quantization Huffman Coding 31 [Han et al. ICLR’16] After Trained Quantization: Discrete Weight after Training Count

Weight Value

Pruning Trained Quantization Huffman Coding 32 [Han et al. ICLR’16] How Many Bits do We Need?

Pruning Trained Quantization Huffman Coding 33 CHAPTER 4. TRAINED QUANTIZATION AND DEEP COMPRESSION 57

Table 4.9: Comparison of uniform quantization and non-uniform quantization (this work) with different update methods. -c: updating centroid only; -c+l: update both centroid and label. Baseline ResNet-50 accuracy: 76.15%, 92.87%. All results are after retraining.

Quantization Method 1bit 2bit 4bit 6bit 8bit Uniform (Top-1) -59.33%74.52%75.49%76.15% Uniform (Top-5) -82.39%91.97%92.60%92.91% Non-uniform -c (Top-1) 24.08% 68.41% 76.16% 76.13% [Han76.20% et al. ICLR’16] Non-uniform -c (Top-5) 48.57% 88.49% 92.85% 92.91% 92.88% Non-uniformHow -c+l (Top-1) Many24.71% Bits 69.36% do We76.17% Need? 76.21% 76.19% Non-uniform -c+l (Top-5) 49.84% 89.03% 92.87% 92.89% 92.90%

Non-uniform quantization Uniform quantization 77%

full-precision top-1 76% accuracy: 76.15%

75% 1 Accuracy 1 - 74% Top

73% 2bits 4bits 6bits 8bits Number of bits per weight for ResNet-50 after fine-tuning

Figure 4.10: Non-uniform quantization performs better than uniform quantization. Pruning Trained Quantization Huffman Coding 34 non-uniform quantization (this work), all the layers of the baseline ResNet-50 can be compressed to 4-bits without losing accuracy. For uniform quantization, however, all the layers of the baseline ResNet-50 can be compressed to 8bitswithout losing accuracy (at 4 bits, there are about 1.6% top-1 accuracy loss when using uniform quantization). The advantage of non-uniform quantization is that it can better capture the non-uniform distribution of the weights. When the probability distribution is higher, the distance between each centroid would be closer. However, uniform quantization can not achieve this. Table 4.9 compares the performance two non-uniform quantization strategies. During fine-tuning, one strategy is to only update the centroid; the other strategy is to update both the centroid and the label (the label means which centroid does the weight belong to). Intuitively, the latter case has more degree of freedom in the learning process and should give better performance. However, experiments show that the improvement is not significant, as shown in the third row and the fourth row in Table Under review as a conference paper at ICLR 2017

Equation 2. We use scaled gradients for 32-bit weights: ∂L W p :˜w > ∆ l × ∂wt l l ⎧ l ∂L ⎪ ∂L = ⎪ 1 t : w˜l ∆l (8) ∂w˜l ⎪ × ∂wl | | ≤ ⎨⎪ n ∂L Wl t :˜wl < ∆l ⎪ × ∂wl − ⎪ ⎪ Note we use scalar number 1 as factor of⎩⎪ gradients of zero weights. The overall quantization process is illustrated as Figure 1. The evolution of the ternary weights from different layers during training is shown in Figure 2. We observe that as training proceeds, different layers behave differently: for the p n first quantized conv layer, the absolute values of Wl and Wl get smaller and sparsity gets lower, p n while for the last conv layer and fully connected layer, the absolute values of Wl and Wl get larger and sparsity gets higher. We learn the ternary assignments (index to the codebook) by updating the latent full-resolution weights during training. This may cause the assignments to change between iterations. Note that the thresholds are not constants as the maximal absolute values change over time. Once an updated weight crosses the threshold, the ternary assignment is changed. p n The benefits of using trained quantization factors are: i) The asymmetry of Wl = Wl enables neural networks to have more model capacity. ii) Quantized weights play the role of̸ "learning rate multipliers" during back propagation. More Aggressive Compression: 3.2 QUANTIZATION HEURISTIC Ternary Quantization In previous work on ternary weight networks, Li & Liu (2016) proposed Ternary Weight Networks (TWN) using ∆l as thresholds to reduce 32-bit weights to ternary values, where ∆l is defined ± ± Trained as Equation 5. They optimized value of ∆ by minimizingNormalized expectation of L2 distance between l Intermediate Ternary Weight Quantization Final Ternary Weight full precisionFull Precision weights andWeight ternary weights.± Instead ofFull using Precision a strictly Weight optimized threshold, we adopt different heuristics: 1) use the maximum absolute value of the weights as a reference to the layer’s Scale Normalize Quantize threshold and maintain a constant factor t for all layers: Wn Wp

-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp ∆l = t max( w˜ ) (9) × | | gradient1 gradient2 Loss and 2) maintainFeed a constantForward sparsity rBackfor Propagate all layers throughout training.Inference By Time adjusting the hyper- parameter r we are able to obtain ternary weight networks with various sparsities. We use the first method and set t to 0.05 in experiments on CIFAR-10 and ImageNet dataset and use the second one to explore a wider range of sparsities in section 5.1.1.

res1.0/conv1/Wn res1.0/conv1/Wp res3.2/conv2/Wn res3.2/conv2/Wp linear/Wn linear/Wp 3 2 1 0 -1 -2 Ternary Weight Value Value Weight Ternary -3 Negatives Zeros Positives Negatives Zeros Positives Negatives Zeros Positives 100%

75%

50%

Percentage 25%

Ternary Weight Weight Ternary 35 0% 0 50 100 150 0 50 100 150 0 50 100 150 Epochs

Figure 2: Ternary weights value (above) and distribution (below) with iterations for different layers of ResNet-20 on CIFAR-10.

4

1 [Han et al. ICLR’16] Results: Compression Ratio

Original Compressed Compression Original Compressed Network Size Size Ratio Accuracy Accuracy

LeNet-300 1070KB 27KB 40x 98.36% 98.42%

LeNet-5 1720KB 44KB 39x 99.20% 99.26%

AlexNet 240MB 6.9MB 35x 80.27% 80.30%

VGGNet 550MB 11.3MB 49x 88.68% 89.09%

Inception- 91MB 4.2MB 22x 93.56% 93.67% V3 ResNet-50 97MB 5.8MB 17x 92.87% 93.04%

Can we make compact models to begin with?

Compression Acceleration Regularization 36 SqueezeNet

Input

64 1x1 Conv Squeeze 16 1x1 Conv 3x3 Conv Expand Expand 64 64 Output Concat/Eltwise 128

Iandola et al, “SqueezeNet: AlexNet-levelVanilla accuracy Firewith 50x module fewer parameters and <0.5MB model size”, arXiv 2016

Compression Acceleration Regularization 37 Compressing SqueezeNet

Network Approach Size Ratio Top-1 Top-5 Accuracy Accuracy

AlexNet - 240MB 1x 57.2% 80.3%

AlexNet SVD 48MB 5x 56.0% 79.4%

Deep AlexNet 6.9MB 35x 57.2% 80.3% Compression

SqueezeNet - 4.8MB 50x 57.5% 80.3%

Deep SqueezeNet 0.47MB 510x 57.5% 80.3% Compression

Compression Acceleration Regularization 38 Results: Speedup

Baseline: 40fps mAP = 59.47 / 28.48 / 45.43 FLOP = 17.5G # Parameters = 6.0M 33fps 30fps 1.6x Pruned: speedup mAP = 59.30 / 28.33 / 47.72 FLOP = 8.9G 20fps 20fps # Parameters = 2.5M 1.6x 10fps speedup 8fps 5fps 1fps 2fps 0fps Batch = 1 Batch = 8 Batch = 32

39 Deep Compression Applied to Industry

Deep Compression

Compression Acceleration Regularization 40 A Facebook AR prototype backed by Deep Compression

- 8x model size reduction by Deep Compression

- Run on mobile

- Source: Facebook F8’17

41 EIE: Efficient Inference Engine on Compressed Deep Neural Network

Han et al. ISCA 2016

42 Deep Learning Accelerators

• First Wave: Compute (Neu Flow)

• Second Wave: Memory (Diannao family)

• Third Wave: Algorithm / Hardware Co-Design (EIE)

Google TPU: “This unit is designed for dense matrices. Sparse architectural support was omitted for time-to-deploy reasons. Sparsity will have high priority in future designs”

43 [Han et al. ISCA’16] EIE: the First DNN Accelerator for Sparse, Compressed Model

0 * A = 0 W * 0 = 0 2.09, 1.92=> 2

Sparse Weight Sparse Activation Weight Sharing 90% static sparsity 70% dynamic sparsity 4-bit weights

10x less computation 3x less computation

5x less memory footprint 8x less memory footprint

Compression Acceleration Regularization 44 [Han et al. ISCA’16] EIE: Parallelization on Sparsity

⃗a 0 a1 0 a3 ⃗b ! × " PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 ⎛ 0 0 w1,2 0 ⎞ ⎛ b1⎞ ⎛ b1 ⎞ PE2 0 w 0 w b 0 ⎜ 2,1 2,3⎟ ⎜− 2⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ PE3 ⎜ 0 0 0 0 ⎟ ⎜ b3⎟ ReLU ⎜ b3 ⎟ ⎜ ⎟ = ⎜ ⎟ ⎜ ⎟ ⎜ 0 0 w4,2w4,3⎟ ⎜ b4⎟ ⇒ ⎜ 0 ⎟ ⎜ ⎟ ⎜− ⎟ ⎜ ⎟ ⎜w 0 0 0 ⎟ ⎜ b ⎟ ⎜ b ⎟ ⎜ 5,0 ⎟ ⎜ 5⎟ ⎜ 5 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 0 0 0 w6,3⎟ ⎜ b6⎟ ⎜ b6 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 0 w7,1 0 0 ⎟ ⎜ b7⎟ ⎜ 0 ⎟ ⎜ ⎟ ⎜− ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠

Compression Acceleration Regularization 45

1 [Han et al. ISCA’16] EIE: Parallelization on Sparsity

⃗a 0 a1 0 a3 ⃗b ! × " PE PE PE PE PE0 w0,0 w0,1 0 w0,3 b0 b0 PE1 ⎛ 0 0 w1,2 0 ⎞ ⎛ b1⎞ ⎛ b1 ⎞ PE PE PE PE PE2 0 w 0 w b 0 ⎜ 2,1 2,3 ⎟ ⎜− 2⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ Central Control PE3 ⎜ 0 0 0 0 ⎟ ⎜ b3⎟ ReLU ⎜ b3 ⎟ ⎜ ⎟ = ⎜ ⎟ ⎜ ⎟ PE PE PE PE ⎜ 0 0 w4,2 w4,3 ⎟ ⎜ b4⎟ ⇒ ⎜ 0 ⎟ ⎜ ⎟ ⎜− ⎟ ⎜ ⎟ ⎜w 0 0 0 ⎟ ⎜ b ⎟ ⎜ b ⎟ ⎜ 5,0 ⎟ ⎜ 5⎟ ⎜ 5 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ PE PE PE PE ⎜ 0 0 0 w6,3 ⎟ ⎜ b6⎟ ⎜ b6 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 0 w7,1 0 0 ⎟ ⎜ b7⎟ ⎜ 0 ⎟ ⎜ ⎟ ⎜− ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠

Compression Acceleration Regularization 46

1 [Han et al. ISCA’16] Dataflow

⃗a 0 a1 0 a3 ⃗b ! × " PE0 w0,0 w0,1 0 w0,3 b0 b0 PE1 ⎛ 0 0 w1,2 0 ⎞ ⎛ b1⎞ ⎛ b1 ⎞ PE2 0 w 0 w b 0 ⎜ 2,1 2,3 ⎟ ⎜− 2⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ PE3 ⎜ 0 0 0 0 ⎟ ⎜ b3⎟ ReLU ⎜ b3 ⎟ ⎜ ⎟ = ⎜ ⎟ ⎜ ⎟ ⎜ 0 0 w4,2 w4,3 ⎟ ⎜ b4⎟ ⇒ ⎜ 0 ⎟ ⎜ ⎟ ⎜− ⎟ ⎜ ⎟ ⎜w 0 0 0 ⎟ ⎜ b ⎟ ⎜ b ⎟ ⎜ 5,0 ⎟ ⎜ 5⎟ ⎜ 5 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 0 0 0 w6,3 ⎟ ⎜ b6⎟ ⎜ b6 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 0 w7,1 0 0 ⎟ ⎜ b7⎟ ⎜ 0 ⎟ ⎜ ⎟ ⎜− ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠

rule of thumb: 0 * A = 0 W * 0 = 0

Compression Acceleration Regularization 47

1 EIE: Efficient Inference Engine on Compressed Deep Neural Network

Song Han∗ Xingyu Liu∗ Huizi Mao∗ Jing Pu∗ Ardavan Pedram∗

Mark A. Horowitz∗ William J. Dally∗† [Han et al. ISCA’16] ∗Stanford University, †NVIDIA EIE Architecture songhan,xyl,huizi,jingpu,perdavan,horowitz,dally @stanford.edu { }

Weight decode

Abstract—State-of-the-art deep neural networks (DNNs) Compressed 4-bit 16-bit have hundreds of millions of connections and are both compu- DNN Model Virtual weight Weight Real weight tationally and memory intensive, making them difficult to de- Encoded Weight Look-up ALU Prediction ploy on embedded systems with limited hardware resources and Relative Index Result power budgets. While custom hardware helps the computation, Sparse Format Index Mem Input 4-bit Accum 16-bit fetching weights from DRAM is two orders of magnitude more Image Relative Index Absolute Index expensive than ALU operations, and dominates the required power. Address Accumulate Previously proposed ‘Deep Compression’ makes it possible Figure 1. Efficient inference engine that works on the compressed deep to fit large DNNs (AlexNet and VGGNet) fully in on-chip neural network model for applications. SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same ruleword, of thumb: or speech0 sample.* A = 0 For embeddedW * 0 = 0 mobile2.09, applications, 1.92=> 2 weight. We propose an energy efficient inference engine (EIE) these resource demands become prohibitive. Table I shows Compression Acceleration Regularization that performs inference on this compressed network model and the energy cost of basic arithmetic and memory operations 48 accelerates the resulting sparse matrix-vector multiplication in a 45nm CMOS process [9]. It shows that the total energy with weight sharing. Going from DRAM to SRAM gives EIE is dominated by the required memory access if there is 120 energy saving; Exploiting sparsity saves 10 ; Weight sharing× gives 8 ; Skipping zero activations from ReLU× saves no data reuse. The energy cost per fetch ranges from 5pJ another 3 . Evaluated× on nine DNN benchmarks, EIE is for 32b coefficients in on-chip SRAM to 640pJ for 32b 189 and× 13 faster when compared to CPU and GPU coefficients in off-chip LPDDR2 DRAM. Large networks do × × implementations of the same DNN without compression. EIE not fit in on-chip storage and hence require the more costly has a processing power of 102 GOPS/s working directly on DRAM accesses. Running a 1G connection neural network, a compressed network, corresponding to 3 TOPS/s on an uncompressed network, and processes FC layers of AlexNet at for example, at 20Hz would require (20Hz)(1G)(640pJ)= 1.88 104 frames/sec with a power dissipation of only 600mW. 12.8W just for DRAM accesses, which is well beyond the It is× 24,000 and 3,400 more energy efficient than a CPU power envelope of a typical mobile device. × × and GPU respectively. Compared with DaDianNao, EIE has Previous work has used specialized hardware to accelerate 2.9 , 19 and 3 better throughput, energy efficiency and DNNs [10]–[12]. However, these efforts focus on acceler- area× efficiency.× × ating dense, uncompressed models - limiting their utility Keywords-Deep Learning; Model Compression; Hardware to small models or to cases where the high energy cost Acceleration; Algorithm-Hardware co-Design; ASIC; of external DRAM access can be tolerated. Without model compression, it is only possible to fit very small neural I. INTRODUCTION networks, such as Lenet-5, in on-chip SRAM [12]. arXiv:1602.01528v2 [cs.CV] 3 May 2016 Neural networks have become ubiquitous in applications Efficient implementation of convolutional layers in CNN including [1]–[3], speech recognition [4], has been intensively studied, as its data reuse and manipu- and natural language processing [4]. In 1998, Lecun et lation is quite suitable for customized hardware [10]–[15]. al. classified handwritten digits with less than 1M parame- However, it has been found that fully-connected (FC) layers, ters [5], while in 2012, Krizhevsky et al. won the ImageNet widely used in RNN and LSTMs, are bandwidth limited competition with 60M parameters [1]. Deepface classified on large networks [14]. Unlike CONV layers, there is no human faces with 120M parameters [6]. Neural Talk [7] parameter reuse in FC layers. Data batching has become automatically converts image to natural language with 130M an efficient solution when training networks on CPUs or CNN parameters and 100M RNN parameters. Coates et GPUs, however, it is unsuitable for real-time applications al. scaled up a network to 10 billion parameters on HPC with latency requirements. systems [8]. Network compression via pruning and weight sharing Large DNN models are very powerful but consume large [16] makes it possible to fit modern networks such as amounts of energy because the model must be stored in AlexNet (60M parameters, 240MB), and VGG-16 (130M external DRAM, and fetched every time for each image, parameters, 520MB) in on-chip SRAM. Processing these [Han et al. ISCA’16] Post Layout Result of EIE

Technology 40 nm

# PEs 64

on-chip SRAM 8 MB

Max Model Size 84 Million

Static Sparsity 10x

Dynamic Sparsity 3x

Quantization 4-bit

ALU Width 16-bit

Area 40.8 mm^2

MxV Throughput 81,967 layers/s

Power 586 mW

1. Post layout result 2. Throughput measured on AlexNet FC-7

Compression Acceleration Regularization 49 [Han et al. ISCA’16] Speedup on EIE

Sparse Matrix Read Unit. The sparse-matrix read unit uses pointers pj and pj+1 to read the non-zero elements (if SpMat CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE any) of this PE’s slice of column Ij from the sparse-matrix 1018x SRAM. Each entry in the SRAM is 8-bits in length and 1000x 507x 618x Act_0 Act_1 248x 210x 189x contains one 4-bit element of v and one 4-bit element of x. 94x 115x 135x 92x 98x 56x 63x 60x Ptr_Even Arithm Ptr_Odd 100x 34x 33x 48x For efficiency (see Section VI) the PE’s slice of encoded 25x 21x 24x 22x 25x 14x 14x 16x 15x 15x 9x 8x 10x 9x 10x 9x 9x sparse matrix I is stored in a 64-bit-wide SRAM. Thus eight 10x 5x 5x CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE 3x 3x 2x 3x 2x 3x 2x 3x entries are fetched on each SRAM read. The high 13 bits 2x 1x 1x 1x 1.1x 1x 1x 1x 1x 1.0x 1x 1.0x 1x 1x 1x 1x 1x 1x 1x 0.6x 0.5x 0.5x 0.5x 0.6x of the current pointer p selects an SRAM row, and the low SpMat Speedup 1x 0.3x 0.5x 3-bits select one of the eight entries in that row. A single 1018x (v, x) entry is provided to the arithmetic unit each cycle. 0.1x 618x Arithmetic Unit. The arithmetic unit receives a (v, x) Figure 5. Layout of one PE in EIE under TSMC 45nm process. Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean 1000x 507x entry from the sparse matrix read unit and performs the multiply-accumulate operation b = b + v a . Index Table II Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. x x × j THE IMPLEMENTATION RESULTS OF ONE PE IN EIE AND THE 248x 210x x is used to index an accumulator array (the destination BREAKDOWN BY COMPONENT TYPE (LINE 3-7), BY MODULE (LINE activation registers) while v is multiplied by the activation 8-13). THE CRITICAL PATH OF EIE IS 1.15 NS 189x 135x value at the head of the activation queue. Because v is stored Power Area CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE 115x (%) 2 (%) µm 119,797x 94x 92x in 4-bit encoded form, it is first expanded to a 16-bit fixed- 98x(mW) ( ) 61,533x 76,784x Total 9.157 638,024 100000x 34,522x 14,826x 24,207x point63x number via a table look up. A bypass path is provided memory 5.416 (59.15%) 594,786 (93.22%) 60x 11,828x 9,485x 10,904x 8,053x 56x 10000x 100x to route the output of the adder to its input if the same clock network 1.874 (20.46%) 866 (0.14%) 48x 34x accumulator is selected on two adjacent cycles. 33x register 1.026 (11.20%) 9,465 (1.48%)1000x 25x combinational 0.841 (9.18%) 8,946 (1.40%) 25x 102x 24x The Activation Read/Write Unit 78x 101x 22x Activation Read/Write. filler cell 23,961 (3.76%) 59x 61x 21x 39x 100x 37x 37x 36x 26x 20x 25x 25x 20x 23x 16x contains two activation register files that accommodate the Act queue 0.112 (1.23%) 758 (0.12%) 12x 18x 17x 14x 14x 15x 15x 9x 10x 10x 10x 7x 15x 8x 14x 7x 6x 6x 6x 7x 14x 6x 6x PtrRead 1.807 (19.73%) 121,849 (19.10%) 5x 5x 5x source and destination activation values respectively during 10x 3x 4x 10x 15x 13x 14x 2x 9x 10x 10x 9x SpmatRead 4.955 (54.11%) 469,412 (73.57%)9x 1x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x 8x 9x a single round of FC layer computation. The source and ArithmUnit 1.162 (12.68%) 3,110 (0.49%)1x destination register files exchange their role for next layer. ActRW 1.122 (12.25%) 18,934 Energy Efficiency (2.97%) Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean 10x 5x 5x Thus no additional data transfer is needed to support multi- filler cell 23,961 (3.76%) layer feed-forward computation. 3x 3x 2x 3x 2x 3x 2x Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running3x uncompressed DNNGeo model. There Mean is no batching in all cases. Each activation register file holds 64 16-bit activations. Central Unit: I/O and Computing. In the2x I/O mode, all of 1x 1x This is sufficient to accommodate 4K activation vectors the PEs are idle while the activations and weights in every 1x 1x 1x 1.1x 1x 1x 1.0x 1x 1.0x 1x 1x 1x 1x across 64 PEs. Longer activation1x vectors can be accommo- PE can be accessed by a DMA1x connected with the Central 1x Table III dated with the 2KB activation SRAM. When the activation Unit. This is one time cost. In thecorner. Computing We mode, placed the and routed the PE using the Synopsys IC BENCHMARK FROM STATE-OF-THE-ART DNN MODELS 0.6x vector has a length greater than 4K, the M V will be CCU repeatedly collects a non-zero value from the LNZD 0.6x 0.5x × 0.5x compiler (ICC). We0.5x used Cacti [25] to get SRAM area and Speedup 1x 0.5x 0.3x completed in several batches, where each batch is of length quadtree and broadcasts this value to all PEs. This process Layer Size Weight% Act% FLOP% Description 4K or less. All the local reduction is done in the register continues until the input length is exceeded.energy By numbers. setting the We annotated the toggle rate from the RTL 9216, Alex-6 9% 35.1% 3% file. The SRAM is read only at the beginning and written at input length and starting address of pointer array, EIE is 4096 Compressed the end of the batch. simulation to the gate-level netlist, which was dumped to instructed to execute different layers. 4096, AlexNet [1] for Distributed Leading Non-Zero Detection. Input acti- switching activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% vations are hierarchically distributed to each PE. To take V. E VA L UAT I O N METHODOLOGY 4096 large scale image advantage of the input vector sparsity, we use leading non- Simulator, RTL and Layout. Wethe implemented power a custom using Prime-Time PX. 4096, classification 0.1x Alex-8 25% 37.5% 10% zero detection logic to select the first non-zero result. Each cycle-accurate C++ simulator for the acceleratorComparison aimed to Baseline. We compare EIE with three dif- 1000 group of 4 PEs does a local leading non-zero detection on model the RTL behavior of synchronous circuits. Each CPU GPU mGPU EIE25088, their input activation. The result is sent to a Leading Non- hardware module is abstracted as an object that implements VGG-6 4% 18.3% 1% Compressed ferent off-the-shelf computing units: CPU, GPU and mobile 4096 zero Detection Node (LNZD Node) illustrated in Figure 4. two abstract methods: propagate and update, corresponding VGG-16 [3] for Each LNZD node finds the next non-zero activation across to combination logic and the flip-flopGPU. in RTL. The simulator 4096, VGG-7 4% 37.5% 2% large scale image Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-Weits four children and sends this resultNT up the quadtree.-Wd The is used for design space exploration.NT It also serves- asLSTM a Geo Mean4096 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E classification and quadtree is arranged so that wire lengths remain constant as checker for RTL verification. 4096, we add PEs. At the root LNZD Node, the selected non-zero To measure the area, power and criticalclass path processor, delay, we that has been used in NVIDIA Digits Deep VGG-8 23% 41.1% 9% object detection activation is broadcast back to all the PEs via a separate implemented the RTL of EIE in Verilog. The RTL is verified 1000 Learning Dev Box as a CPU baseline. To run the benchmark 4096, Compressed wire placed in an H-tree. against the cycle-accurateCompression simulator. Then we synthesized Acceleration RegularizationNT-We 10% 100% 10% Central Control Unit. The Central Control Unit (CCU) EIE using the Synopsys Design Compileron CPU, (DC) under we the used MKL CBLAS GEMV to implement the 600 NeuralTalk [7] 50 is the root LNZD Node. It communicates with the master, TSMC 45nm GP standard VT library with worst case PVT 600, with RNN and NT-Wd 11% 100% 11% for example a CPU, and monitors the state of every PE by corner. We placed and routed the PEoriginal using the Synopsys dense IC model and MKL SPBLAS CSRMV for the 8791 LSTM for Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNNsetting the control model. registers. There are two modes in the Therecompiler (ICC). We used Cacti is [25] to get no SRAM area and batching in all cases. compressed sparse model. CPU socket and DRAM power 1201, automatic NTLSTM 10% 100% 11% are as reported by the pcm-power utility provided by Intel. 2400 image captioning 2) GPU. We use NVIDIA GeForce GTX Titan X GPU, a state-of-the-art GPU for deep learning as our baseline The uncompressed DNN model is obtained from Caffe using nvidia-smi utility to report the power. To run model zoo [28] and NeuralTalk model zoo [7]; The com- the benchmark, we used cuBLAS GEMV to implement pressed DNN model is produced as described in [16], [23]. the original dense layer. For the compressed sparse layer, The benchmark networks have 9 layers in total obtained CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPUwe stored theCompressed sparse matrix in in CSR format, and used fromEIE AlexNet, VGGNet, and NeuralTalk. We use the Image- 119,797x cuSPARSE CSRMV kernel, which is optimized for sparse Net dataset [29] and the Caffe [28] deep learning framework 61,533x 76,784x matrix-vector multiplication on GPUs. as golden model to verify the correctness of the hardware 100000x 34,522x 3) Mobile GPU. We use NVIDIA Tegra K1 that has design.24,207x 14,826x 192 CUDA cores as our mobile GPU baseline. We used 11,828x 9,485x 10,904x cuBLAS GEMV for the8,053x original dense model and cuS- VI. EXPERIMENTAL RESULT PARSE CSRMV for the compressed sparse model. Tegra K1 Figure 5 shows the layout (after place-and-route) of 10000x doesn’t have software interface to report power consumption, an EIE processing element. The power/area breakdown is so we measured the total power consumption with a power- shown in Table II. We brought the critical path delay down meter, then assumed 15% AC to DC conversion loss, 85% to 1.15ns by introducing 4 pipeline stages to update one 1000x regulator efficiency and 15% power consumed by peripheral activation: codebook lookup and address accumulation (in components [26], [27] to report the AP+DRAM power for parallel), output activation read and input activation multiply 78x 101x 102x Tegra K1. (in parallel), shift and add, and output activation write. Ac- 59x 61x Benchmarks. 100x 37x 37x 39x tivation36x read and write access a local register and activation 26x 18x 20x 25x 25x We compare the performance20x on two sets of models:23xbypassing is employed to avoid a pipeline hazard. Using 12x 17x 14x 14x uncompressed15x DNN model and the compressed DNN model. 64 PEs running at 800MHz yields a performance of 102 7x 9x 10x 10x 10x 8x 5x 7x 5x 6x 6x 6x 6x 5x 6x 7x 10x 3x 4x 10x 15x 13x 14x 2x 1x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x 1x Energy Energy Efficiency Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.

Table III energy numbers. We annotated the toggle rate from the RTL BENCHMARK FROM STATE-OF-THE-ART DNN MODELS simulation to the gate-level netlist, which was dumped to Layer Size Weight% Act% FLOP% Description switching activity interchange format (SAIF), and estimated 9216, Alex-6 9% 35.1% 3% the power using Prime-Time PX. 4096 Compressed 4096, AlexNet [1] for Alex-7 9% 35.3% 3% Comparison Baseline. We compare EIE with three dif- 4096 large scale image ferent off-the-shelf computing units: CPU, GPU and mobile 4096, classification Alex-8 25% 37.5% 10% GPU. 1000 25088, VGG-6 4% 18.3% 1% Compressed 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E 4096 VGG-16 [3] for 4096, class processor, that has been used in NVIDIA Digits Deep VGG-7 4% 37.5% 2% large scale image 4096 Learning Dev Box as a CPU baseline. To run the benchmark classification and 4096, VGG-8 23% 41.1% 9% object detection on CPU, we used MKL CBLAS GEMV to implement the 1000 original dense model and MKL SPBLAS CSRMV for the 4096, Compressed NT-We 10% 100% 10% compressed sparse model. CPU socket and DRAM power 600 NeuralTalk [7] 600, with RNN and NT-Wd 11% 100% 11% are as reported by the pcm-power utility provided by Intel. 8791 LSTM for 1201, automatic 2) GPU. We use NVIDIA GeForce GTX Titan X GPU, NTLSTM 10% 100% 11% a state-of-the-art GPU for deep learning as our baseline 2400 image captioning using nvidia-smi utility to report the power. To run the benchmark, we used cuBLAS GEMV to implement in [16], [23]. The benchmark networks have 9 layers in total the original dense layer. For the compressed sparse layer, obtained from AlexNet, VGGNet, and NeuralTalk. We use we stored the sparse matrix in in CSR format, and used the Image-Net dataset [29] and the Caffe [28] deep learning cuSPARSE CSRMV kernel, which is optimized for sparse framework as golden model to verify the correctness of the matrix-vector multiplication on GPUs. hardware design. 3) Mobile GPU. We use NVIDIA Tegra K1 that has 192 CUDA cores as our mobile GPU baseline. We used VI. EXPERIMENTAL RESULTS cuBLAS GEMV for the original dense model and cuS- Figure 5 shows the layout (after place-and-route) of PARSE CSRMV for the compressed sparse model. Tegra K1 an EIE processing element. The power/area breakdown is doesn’t have software interface to report power consumption, shown in Table II. We brought the critical path delay down so we measured the total power consumption with a power- to 1.15ns by introducing 4 pipeline stages to update one meter, then assumed 15% AC to DC conversion loss, 85% activation: codebook lookup and address accumulation (in regulator efficiency and 15% power consumed by peripheral parallel), output activation read and input activation multiply components [26], [27] to report the AP+DRAM power for (in parallel), shift and add, and output activation write. Ac- Tegra K1. tivation read and write access a local register and activation Benchmarks. We compare the performance on two sets bypassing is employed to avoid a pipeline hazard. Using of models: uncompressed DNN model and the compressed 64 PEs running at 800MHz yields a performance of 102 DNN model. The uncompressed DNN model is obtained GOP/s. Considering 10 weight sparsity and 3 activation × × from Caffe model zoo [28] and NeuralTalk model zoo [7]; sparsity, this requires a dense DNN accelerator 3TOP/s to The compressed DNN model is produced as described have equivalent application throughput. CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE 1018x 1000x 507x 618x 248x 210x 189x 94x 115x 135x 92x 98x 56x 63x 60x 100x 34x 33x 48x 25x 21x 24x 22x 25x 14x 14x 16x 15x 15x 9x 8x 10x 9x 10x 9x 9x 10x 5x 5x 2x 3x 2x 3x 3x 2x 3x 3x 2x 1x 1x 1x 1.1x 1x 1x 1x 1x 1.0x 1x 1.0x 1x 1x 1x 1x 1x 1x 1x 0.6x 0.5x 0.5x 0.5x 0.6x Speedup 1x 0.3x 0.5x

0.1x CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE 1018x 1000x 507x 618x 248x 210x 189x 94x 115x 135x 92x 98x 56x 63x 60x [Han et al. ISCA’16] Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT100x -LSTM Geo Mean 34x 33x 48x 25x 21x 24x 22x 25x 14x 14x 16x 15x 15x 9x 8x 10x 9x 10x 9x 9x 10x 5x 5x 2x 3x 2x 3x 3x 2x 3x 3x 2x 1x 1x 1x 1.1x 1x 1x 1x 1x 1.0x 1x 1.0x 1x 1x 1x 1x 1x 1x 1x 0.6x 0.5x 0.5x 0.5x 0.6x Speedup 1x Energy Efficiency0.3x 0.5xon EIE 0.1x Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is noAlex-6 batchingAlex-7 Alex-8 inVGG-6 allVGG-7 cases.VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. Sparse Matrix Read Unit. The sparse-matrix read unit uses pointers pj and pj+1 to read the non-zero elements (if SpMat CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE any) of this PE’s slice of column I from the sparse-matrix 119,797x j 61,533x 76,784x 100000x 34,522x SRAM. Each entry in the SRAM is 8-bits in length and 14,826x 24,207x Act_0 Act_1 11,828x 9,485x 10,904x 8,053x contains one 4-bit element of v and one 4-bit element of x. 10000x For efficiency (see Section VI) the PE’s slice of encoded Ptr_Even Arithm Ptr_Odd 1000x sparse matrix I is stored in a 64-bit-wide SRAM. Thus eight 78x 101x 102x 59x 61x 39x entries are fetched on each SRAM read. The high 13 bits 100x 26x 37x 37x 25x 25x 36x 18x 17x 20x 15x 20x 23x 9x12x 10x 10x 10x 14x 14x p 7x 7x 8x 6x 6x 6x 6x 7x CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed of themGPU current pointer selects an SRAM Dense row, and the low SpMatmGPU10x 5x Compressed EIE5x 6x 4x 5x 3x 10x 15x 13x 14x 2x 3-bits select one of the eight entries in that row. A single 1x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x (v, x) entry is provided to the arithmetic unit each cycle. 1x Energy Energy Efficiency 119,797x Arithmetic Unit. The arithmetic unit receives a (v, x) Figure 5. Layout of one PE in EIE under TSMC 45nm process. Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean 61,533x 76,784x entry from the sparse matrix read unit and performs the multiply-accumulate operation b = b + v a . Index Table II Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 100000x 34,522x x x × j THE IMPLEMENTATION RESULTS OF ONE PE IN EIE AND THE x is used to index an accumulator array (the destination BREAKDOWN BY COMPONENT TYPE (LINE 3-7), BY MODULE (LINE 24,207x 14,826x activation registers) while v is multiplied by the activation 8-13). THE CRITICAL PATH OF EIE IS 1.15 NS Table III Power Area 11,828x value at the head of the activation queue. Because v is stored 10,904x(%)corner. We(%) placed and routed the PE using the Synopsys IC 2 BENCHMARK FROM STATE-OF-THE-ART DNN MODELS in 4-bit encoded form,9,485x it is first expanded to a 16-bit fixed- (mW) (µm ) 8,053x Total 9.157compiler 638,024 (ICC). We used Cacti [25] to get SRAM area and 10000x point number via a table look up. A bypass path is provided memory 5.416 (59.15%) 594,786 (93.22%) Layer Size Weight% Act% FLOP% Description to route the output of the adder to its input if the same clock network 1.874 (20.46%)energy 866 numbers. (0.14%) We annotated the toggle rate from the RTL 9216, accumulator is selected on two adjacent cycles. register 1.026 (11.20%) 9,465 (1.48%) Alex-6 9% 35.1% 3% combinational 0.841 (9.18%) 8,946 (1.40%) 4096 Compressed Activation Read/Write. The Activation Read/Write Unit filler cellsimulation 23,961 (3.76%) to the gate-level netlist, which was dumped to 4096, AlexNet [1] for contains two activation register files that accommodate the Act queue 0.112 (1.23%)switching 758 (0.12%) activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% source and destination activation values respectively during PtrRead 1.807 (19.73%) 121,849 (19.10%) 4096 large scale image 1000x SpmatRead 4.955 (54.11%) 469,412 (73.57%) a single round of FC layer computation. The source and ArithmUnit 1.162 (12.68%)the 3,110 power (0.49%) using Prime-Time PX. 4096, classification Alex-8 25% 37.5% 10% destination register files exchange their role for next layer. ActRW 1.122 (12.25%) 18,934 (2.97%) 1000 Thus no additional data transfer is needed to support multi- filler cellComparison 23,961 (3.76%) Baseline. We compare EIE with three dif- 25088, 101x 102x layer feed-forward computation. VGG-6 4%Geo 18.3% 1%MeanCompressed 78x ferent off-the-shelf computing units: CPU, GPU and mobile 4096 59x 61x Each activation register file holds 64 16-bit activations. Central Unit: I/O and Computing. In the I/O mode, all of VGG-16 [3] for 100x 37x 37x This is sufficient39x to accommodate 4K activation vectors the PEs are idle while the activationsGPU. and weights in every 4096, VGG-736x 4% 37.5% 2% large scale image 26x 25xacross 64 PEs. Longer activation vectors can be accommo- PE can be accessed25x by a DMA connected with the Central 23x 4096 18x 17x 20x 1) CPU. We use Intel Core20x i-7 5930k CPU, a Haswell-E classification and 14x dated with the 2KB activation SRAM. When the activation 14xUnit. This is one time cost. In the Computing mode, the 15x 4096, 12x 10x 10x 10x vector has a length greater than 4K, the M V will be CCU repeatedly collects a non-zero value from the LNZD VGG-8 23% 41.1% 9% object detection 7x 9x 8x × class processor, that has been used in NVIDIA Digits Deep 1000 7x 6x 6xcompleted in several batches, where each batch is6x of length6xquadtree and broadcasts this value to all PEs. This process 6x 7x 5x 5x 4K or less. All the local reduction is done in the register Learning Dev5x Box as a CPU baseline. To run the benchmark 4096, Compressed 10x continues until the input length is exceeded. By setting4x the NT-We 10% 100% 10% 15x 3x 13x 14x 2x file. The SRAM is read only at the beginning and written at input length and starting addresson of pointer CPU, array, we EIE used is MKL CBLAS GEMV to implement the 600 NeuralTalk [7] 10x the end of8x the batch. instructed to execute different layers. 9x 600, with RNN and 1x 1x 1x 7x 1x 1x 1x 5x 1x Distributed Leading Non-Zero Detection.1x Input acti- 7x original1x dense model and7x MKL SPBLAS CSRMV1x for the NT-Wd 11% 100% 11% vations are hierarchically distributed to each PE. To take V. E VA L UAT I O N METHODOLOGY 8791 LSTM for compressed sparse model. CPU socket and DRAM power 1201, automatic 1x advantage of the input vector sparsity, we use leading non- Simulator, RTL and Layout. We implemented a custom NTLSTM 10% 100% 11% zero detection logic to select the first non-zero result. Each cycle-accurate C++ simulator for theare accelerator as reported aimed to by the pcm-power utility provided by Intel. 2400 image captioning group of 4 PEs does a local leading non-zero detection on model the RTL behavior of synchronous circuits. Each GPU mGPUEIE

Energy Energy Efficiency CPU their input activation. The result is sent to a Leading Non- hardware module is abstracted as an object2) GPU.that implementsWe use NVIDIA GeForce GTX Titan X GPU, zero Detection Node (LNZD Node) illustrated in Figure 4. two abstract methods: propagate anda update, state-of-the-art corresponding GPU for deep learning as our baseline The uncompressed DNN model is obtained from Caffe Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NTEach-We LNZD node finds the next non-zero activationNT across to- combinationWd logic and the flip-flop in RTL. TheNT simulator-LSTM Geo Mean its four children and sends this result up the quadtree. The is used for design space exploration.using It alsonvidia-smi serves as a utility to report the power. To run model zoo [28] and NeuralTalk model zoo [7]; The com- quadtree is arranged so that wire lengths remain constant as checker for RTL verification. we add PEs. At the root LNZD Node, the selected non-zero To measure the area, power andthe critical benchmark, path delay, we we used cuBLAS GEMV to implement pressed DNN model is produced as described in [16], [23]. activation is broadcast back to all the PEs via a separate implemented the RTL of EIE in Verilog.the The original RTL is verified dense layer. For the compressed sparse layer, The benchmark networks have 9 layers in total obtained wire placed in an H-tree. against the cycle-accurateCompression simulator. Then we synthesized Acceleration Regularization Central Control Unit. The Central Control Unit (CCU) EIE using the Synopsys Design Compilerwe stored (DC) under the the sparse matrix in in CSR format, and used from AlexNet, VGGNet, and NeuralTalk. We use the Image- 51 Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressedis the root LNZD Node. DNN It communicates with the master, model.TSMC 45nm GP standard There VT librarycuSPARSE with worst case CSRMV PVT is no kernel, which batching is optimized for sparse inNet all dataset [29] cases. and the Caffe [28] deep learning framework for example a CPU, and monitors the state of every PE by corner. We placed and routed the PE using the Synopsys IC setting the control registers. There are two modes in the compiler (ICC). We used Cacti [25]matrix-vector to get SRAM area and multiplication on GPUs. as golden model to verify the correctness of the hardware 3) Mobile GPU. We use NVIDIA Tegra K1 that has design. 192 CUDA cores as our mobile GPU baseline. We used cuBLAS GEMV for the original dense model and cuS- VI. EXPERIMENTAL RESULT Table III PARSE CSRMV for the compressed sparse model. Tegra K1 Figure 5 shows the layout (after place-and-route) of doesn’t have software interface to report power consumption, an EIE processing element. The power/area breakdown is energy numbers. We annotated the toggle rate from the RTL so we measured the total power consumption with a power- shown in Table II. We brought the critical path delay down BENCHMARK FROM STATE-OF-THEmeter, then-ART assumed 15%DNNAC to DC conversionMODELS loss, 85% to 1.15ns by introducing 4 pipeline stages to update one regulator efficiency and 15% power consumed by peripheral activation: codebook lookup and address accumulation (in components [26], [27] to report the AP+DRAM power for parallel), output activation read and input activation multiply simulation to the gate-level netlist, which was dumped to Tegra K1. (in parallel), shift and add, and output activation write. Ac- Layer Size Weight% Act%Benchmarks. FLOP% Descriptiontivation read and write access a local register and activation We compare the performance on two sets of models: bypassing is employed to avoid a pipeline hazard. Using switching activity interchange format (SAIF), and estimated 9216, uncompressed DNN model and the compressed DNN model. 64 PEs running at 800MHz yields a performance of 102 Alex-6 9% 35.1% 3% the power using Prime-Time PX. 4096 Compressed 4096, AlexNet [1] for Alex-7 9% 35.3% 3% Comparison Baseline. We compare EIE with three dif- 4096 large scale image ferent off-the-shelf computing units: CPU, GPU and mobile 4096, classification Alex-8 25% 37.5% 10% GPU. 1000 25088, VGG-6 4% 18.3% 1% Compressed 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E 4096 VGG-16 [3] for 4096, class processor, that has been used in NVIDIA Digits Deep VGG-7 4% 37.5% 2% large scale image 4096 Learning Dev Box as a CPU baseline. To run the benchmark classification and 4096, VGG-8 23% 41.1% 9% object detection on CPU, we used MKL CBLAS GEMV to implement the 1000 original dense model and MKL SPBLAS CSRMV for the 4096, Compressed NT-We 10% 100% 10% compressed sparse model. CPU socket and DRAM power 600 NeuralTalk [7] 600, with RNN and NT-Wd 11% 100% 11% are as reported by the pcm-power utility provided by Intel. 8791 LSTM for 1201, automatic 2) GPU. We use NVIDIA GeForce GTX Titan X GPU, NTLSTM 10% 100% 11% a state-of-the-art GPU for deep learning as our baseline 2400 image captioning using nvidia-smi utility to report the power. To run the benchmark, we used cuBLAS GEMV to implement in [16], [23]. The benchmark networks have 9 layers in total the original dense layer. For the compressed sparse layer, obtained from AlexNet, VGGNet, and NeuralTalk. We use we stored the sparse matrix in in CSR format, and used the Image-Net dataset [29] and the Caffe [28] deep learning cuSPARSE CSRMV kernel, which is optimized for sparse framework as golden model to verify the correctness of the matrix-vector multiplication on GPUs. hardware design. 3) Mobile GPU. We use NVIDIA Tegra K1 that has 192 CUDA cores as our mobile GPU baseline. We used VI. EXPERIMENTAL RESULTS cuBLAS GEMV for the original dense model and cuS- Figure 5 shows the layout (after place-and-route) of PARSE CSRMV for the compressed sparse model. Tegra K1 an EIE processing element. The power/area breakdown is doesn’t have software interface to report power consumption, shown in Table II. We brought the critical path delay down so we measured the total power consumption with a power- to 1.15ns by introducing 4 pipeline stages to update one meter, then assumed 15% AC to DC conversion loss, 85% activation: codebook lookup and address accumulation (in regulator efficiency and 15% power consumed by peripheral parallel), output activation read and input activation multiply components [26], [27] to report the AP+DRAM power for (in parallel), shift and add, and output activation write. Ac- Tegra K1. tivation read and write access a local register and activation Benchmarks. We compare the performance on two sets bypassing is employed to avoid a pipeline hazard. Using of models: uncompressed DNN model and the compressed 64 PEs running at 800MHz yields a performance of 102 DNN model. The uncompressed DNN model is obtained GOP/s. Considering 10× weight sparsity and 3× activation from Caffe model zoo [28] and NeuralTalk model zoo [7]; sparsity, this requires a dense DNN accelerator 3TOP/s to The compressed DNN model is produced as described have equivalent application throughput. [Han et al. ISCA’16] Comparison: Throughput EIE

Throughput (Layers/s in log scale) 1E+06 ASIC 1E+05 ASIC ASIC 1E+04 GPU 1E+03 ASIC

1E+02 CPU mGPU

1E+01 FPGA

1E+00 Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE 22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC 64PEs 256PEs

Compression Acceleration Regularization 52 [Han et al. ISCA’16] Comparison: Energy Efficiency

EIE Energy Efficiency (Layers/J in log scale) 1E+06

1E+05 ASIC ASIC

1E+04 ASIC ASIC 1E+03

1E+02

1E+01 GPU mGPU FPGA 1E+00 CPU Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE 22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC 64PEs 256PEs

Compression Acceleration Regularization 53 Feedforward => ?

Compression Acceleration Regularization 54 Software Program

CPU ExternalExternal Memory Memory ActQueue SpMV Accu Yt MEM Act Buffer FIFO Buf Buf Mt FIFO

FPGA PCIE Controller MEM Controller

WXt/Yt-1 DATA BUS PtrRead SpmatRead

Pointer Buffer Weight Buffer Input Buffer Output Buffer Ht Buffer Buf Buf Buf Buf

ESE Accelerator Channel Channel 0 Channel 1 Adder Tree N ElemMul Sigmoid ElemMul ESE Controller PE PE PE /Tanh PE PE PE [Han et al. FPGA’17] Ct PE PE PE ESE ArchitectureProcessing Element (PE)

Software Program FIFO ExternalExternal Memory Memory CPU MEM x/yt-1 FIFO Channel with FIFO multiple PEs

mt ActQueue FIFO

PE N FPGA PCIE Controller MEM Controller PE k PtrRead SpmatRead SpMV Accu PE 1 PtrRead SpmatRead Act Buffer Pointer Buffer Weight Buffer SpMV Accu PE 0 DATA BUS PtrRead SpmatRead Act Buffer yt Pointer Buffer Weight Buffer SpMV Accu PtrRead SpmatRead Act BufferBuf Buf PointerBuf BufferBuf WeightSpMMBuf BufferBuf SpMV Accu Act Buffer Buf Buf Buf Buf Buf Buf Assemble Memory Pointer Buffer Weight Buffer Input Buffer Output Buffer Buf Buf Buf Buf Buf Buf Buf Buf Buf Buf Buf Buf

ESE Accelerator Channel Channel 0 Channel 1 Wxt/Wyt-1 N

ESE Controller PE PE PE Adder Tree Wc/ct-1 ElemMul ElemMul PE PE PE Sigmoid Ht /Tanh Buffer PE PE PE ct Elt-wise

(a) (b)

Compression Acceleration Regularization 55 ESE is now on AWS Marketplace with Xilinx VU9P FPGA ESE is now on AWS Marketplace with Xilinx VU9P FPGA ESE is now on AWS Marketplace with Xilinx VU9P FPGA

https://aws.amazon.com/marketplace/pp/B079N2J42R

58 Thank you! [email protected]

Smart

Accelerated Training Compression Inference Pruning Quantization Han et al. ISCA’16 Fast Han et al. ICLR’17 Han et al. FPGA’17 Under review as a conference paper at ICLR 2017 Han et al. NIPS’15 Dense Sparse Dense (Best paper award) Han et al. ICLR’16 Pruning Re-Dense

Sparsity Constraint Increase Model Capacity Sparse Matrix Read Unit. The sparse-matrix read unit uses pointers pj and pj+1 to read the non-zero elements (if SpMat (Best paper award) any) of this PE’s slice of column Ij from the sparse-matrix SRAM. Each entry in the SRAM is 8-bits in length and contains one 4-bit element of v and one 4-bit element of x. Act_0 Act_1 Figure 1: Dense-Sparse-Dense Training Flow. The sparse training regularizes the model, and the final Ptr_Even Arithm Ptr_Odd dense training restores the pruned weights (red), increasing the model capacity without overfitting. For efficiency (see Section VI) the PE’s slice of encoded sparse matrix I is stored in a 64-bit-wide SRAM. Thus eight entries are fetched on each SRAM read. The high 13 bits Algorithm 1: Workflow of DSD training of the current pointer p selects an SRAM row, and the low SpMat Initialization: W (0) with W (0) N(0, Σ) 3-bits select one of the eight entries in that row. A single ∼ Output :W (t). (v, x) entry is provided to the arithmetic unit each cycle. ———————————————– Initial Dense Phase ———————————————– Arithmetic Unit. The arithmetic unit receives a (v, x) Figure 5. Layout of one PE in EIE under TSMC 45nm process. while not converged do entry from the sparse matrix read unit and performs the ˜ (t) (t 1) (t) (t 1) (t 1) Table II W = W − η f(W − ; x − ); multiply-accumulate operation b = b + v a . Index − ∇ x x × j THE IMPLEMENTATION RESULTS OF ONE PE IN EIE AND THE t = t +1; x is used to index an accumulator array (the destination BREAKDOWN BY COMPONENT TYPE (LINE 3-7), BY MODULE (LINE Efficient end activation registers) while v is multiplied by the activation 8-13). THE CRITICAL PATH OF EIE IS 1.15 NS ————————————————— Sparse Phase —————————————————- Power Area value at the head of the activation queue. Because v is stored (%) (%) // initialize the mask by sorting and keeping the Top-k weights. (mW) (µm2) (t 1) (t 1) in 4-bit encoded form, it is first expanded to a 16-bit fixed- S = sort( W − ); λ = S ; Mask = 1( W − > λ); Total 9.157 638,024 | | ki | | point number via a table look up. A bypass path is provided while not converged do memory 5.416 (59.15%) 594,786 (93.22%) to route the output of the adder to its input if the same clock network 1.874 (20.46%) 866 (0.14%) W˜ (t) = W (t 1) η(t) f(W (t 1); x(t 1)) − − − ; accumulator is selected on two adjacent cycles. register 1.026 (11.20%) 9,465 (1.48%) ˜ (t) (t) − ∇ combinational 0.841 (9.18%) 8,946 (1.40%) W = W Mask; Activation Read/Write. The Activation Read/Write Unit t = t +1; · filler cell 23,961 (3.76%) contains two activation register files that accommodate the Act queue 0.112 (1.23%) 758 (0.12%) end source and destination activation values respectively during PtrRead 1.807 (19.73%) 121,849 (19.10%) ————————————————- Final Dense Phase ————————————————– SpmatRead 4.955 (54.11%) 469,412 (73.57%) while not converged do a single round of FC layer computation. The source and ArithmUnit 1.162 (12.68%) 3,110 (0.49%) (t) (t 1) (t) (t 1) (t 1) destination register files exchange their role for next layer. ActRW 1.122 (12.25%) 18,934 (2.97%) W˜ = W − η f(W − ; x − ); filler cell 23,961 (3.76%) t = t +1; − ∇ Thus no additional data transfer is needed to support multi- end layer feed-forward computation. goto Sparse Phase for iterative DSD; Each activation register file holds 64 16-bit activations. Central Unit: I/O and Computing. In the I/O mode, all of 59 This is sufficient to accommodate 4K activation vectors the PEs are idle while the activations and weights in every across 64 PEs. Longer activation vectors can be accommo- PE can be accessed by a DMA connected with the Central dated with the 2KB activation SRAM. When the activation Unit. This is one time cost. In the Computing mode, the In contrast, simply reducing the model capacity would lead to the other extreme, causing a machine vector has a length greater than 4K, the M V will be CCU repeatedly collects a non-zero value from the LNZD learning system to miss the relevant relationships between features and target outputs, leading to × completed in several batches, where each batch is of length under-fitting and a high bias. Bias and variance are hard to optimize at the same time. quadtree and broadcasts this value to all PEs. This process 4K or less. All the local reduction is done in the register continues until the input length is exceeded. By setting the Model compression methods ( Han et al. (2016; 2015); Guo et al. (2016)) can reduce the model file. The SRAM is read only at the beginning and written at input length and starting address of pointer array, EIE is size by 35x-49x or more without hurting prediction accuracy. Compression without losing accuracy the end of the batch. instructed to execute different layers. means there’s significant redundancy in the trained model. Since the compressed model can achieve Distributed Leading Non-Zero Detection. Input acti- the same accuracy as the redundant uncompressed model, one hypothesis is that the model of the vations are hierarchically distributed to each PE. To take V. E VA L UAT I O N METHODOLOGY original size should have the capacity to achieve higher accuracy. This shows the inadequacy of advantage of the input vector sparsity, we use leading non- Simulator, RTL and Layout. We implemented a custom current training methods since it fails to find the existing better solutions. zero detection logic to select the first non-zero result. Each cycle-accurate C++ simulator for the accelerator aimed to In order to find the expected higher accuracy, we propose a dense-sparse-dense training flow (DSD), a group of 4 PEs does a local leading non-zero detection on model the RTL behavior of synchronous circuits. Each novel training strategy that starts from a dense model from conventional training, then regularizes the their input activation. The result is sent to a Leading Non- hardware module is abstracted as an object that implements model with sparsity-constrained optimization, and finally increases the model capacity by restoring zero Detection Node (LNZD Node) illustrated in Figure 4. two abstract methods: propagate and update, corresponding and retraining the pruned weights. At testing time, the final model produced by DSD still has the Each LNZD node finds the next non-zero activation across to combination logic and the flip-flop in RTL. The simulator same architecture and dimension as the original dense model, and DSD training doesn’t incur any its four children and sends this result up the quadtree. The is used for design space exploration. It also serves as a inference overhead. We experimented DSD training on 7 mainstream CNN / RNN / LSTMs and quadtree is arranged so that wire lengths remain constant as checker for RTL verification. found consistent performance gains over its comparable counterpart for image classification, image we add PEs. At the root LNZD Node, the selected non-zero To measure the area, power and critical path delay, we captioning and speech recognition. activation is broadcast back to all the PEs via a separate implemented the RTL of EIE in Verilog. The RTL is verified wire placed in an H-tree. against the cycle-accurate simulator. Then we synthesized 2DSDTRAINING FLOW Central Control Unit. The Central Control Unit (CCU) EIE using the Synopsys Design Compiler (DC) under the is the root LNZD Node. It communicates with the master, TSMC 45nm GP standard VT library with worst case PVT for example a CPU, and monitors the state of every PE by corner. We placed and routed the PE using the Synopsys IC Our DSD training employs a three-step process: dense, sparse, dense. Each step is illustrated in setting the control registers. There are two modes in the compiler (ICC). We used Cacti [25] to get SRAM area and Figure 1 and Algorithm 1. The progression of weight distribution is plotted in Figure 2.

2