<<

Efficient Methods and Hardware for

Song Han Stanford University May 25, 2017 Intro

Song Han Bill Dally PhD Candidate Chief Scientist Stanford Professor Stanford

2 Deep Learning is Changing Our Lives

Self-Driving Car Machine Translation

This image is licensed under CC-BY 2.0 This image is in the public domain

This image is in the public domain This image is licensed under CC-BY 2.0

3 AlphaGo Smart Robots

3 Models are Getting Larger

IMAGE RECOGNITION

Important Property of Neural Networks 16X 10X Results get better with Model Training Ops 152 layers 465 GFLOP more data + 22.6 GFLOP 12,000 hrs of Data bigger models + ~3.5% error ~5% Error more computation 8 layers 80 GFLOP 1.4 GFLOP 7,000 hrs of Data (Better algorithms, new insights and ~16% Error ~8% Error improved techniques always help, too!) 2012 2015 2014 2015 AlexNet ResNet Deep Speech 1 Deep Speech 2 Microsoft Baidu

Dally, NIPS’2016 workshop on Efficient Methods for Deep Neural Networks

4 The first Challenge: Model Size

Hard to distribute large models through over-the-air update

App icon is in the public domain This image is licensed under CC-BY 2.0 Phone image is licensed under CC-BY 2.0

5 The Second Challenge: Speed

Error rate Training time ResNet18: 10.76% 2.5 days ResNet50: 7.02% 5 days ResNet101: 6.21% 1 week ResNet152: 6.16% 1.5 weeks

Such long training time limits ML researcher’s productivity

Training time benchmarked with fb.resnet.torch using four M40 GPUs

6 The Third Challenge: Energy Efficiency

AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game This image is in the public domain This image is in the public domain

on mobile: drains battery on data-center: increases TCO Phone image is licensed under CC-BY 2.0 This image is licensed under CC-BY 2.0

7 Where is the Energy Consumed? larger model => more memory reference => more energy

8 Where is the Energy Consumed?

larger model => more memory reference => more energy

Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000

Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7]. MemoryCMOS process access [ is7]. 3 Memory orders of access magnitude is 3 orders more of magnitude more energy expensive thanenergy simple1 expensive arithmetic. than simple arithmetic.= 1000

This image is in the public domain

To achieve this goal, weTo achievepresent athis method goal, we to prune present network a method connections to prune network in a manner connections that preserves in a manner the that preserves the original accuracy. Afteroriginal an initial accuracy. training After phase, an initial we remove training all phase, connections we remove whose all connections weight is lower whose weight is lower than a threshold. Thisthan pruning a threshold. converts This a dense, pruning fully-connected converts a dense, fully-connected to a sparse layer. layer This to a first sparse layer. This9 first phase learns the topologyphase of learns the networks the topology — learning of the networks which connections — learning are which important connections and removing are important and removing the unimportant connections.the unimportant We then connections. retrain the sparse We then network retrain so the the sparse remaining network connections so the remaining can connections can compensate for the connectionscompensate that for the have connections been removed. that have The beenphases removed. of pruning The and phases retraining of pruning may and retraining may be repeated iterativelybe to repeated further iteratively reduce network to further complexity. reduce network In effect, complexity. this training In effect, process this learns training process learns the network connectivitythe network in addition connectivity to the weights in addition - much to as the in weights the mammalian - much as brain in the [8][9], mammalian where brain [8][9], where synapses are created insynapses the first are few created months in of the a first child’s few development, months of a child’s followed development, by gradual followed pruning byof gradual pruning of little-used connections,little-used falling to connections, typical adult falling values. to typical adult values.

2 Related Work2 Related Work

Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- ing models [10]. This results in a waste of both computation and memory. There have been various ing models [10]. This results in a waste of both computation and memory. There have been various proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear structure of the neural network by finding an appropriate low-rank approximation of the parameters structure of the neural network by finding an appropriate low-rank approximation of the parameters and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gong et al. and keeping the accuracy[13] compressed within 1% of deep the convnets original model. using vector With quantization.similar accuracy These loss, approximation Gong et al. and quantization [13] compressed deeptechniques convnets are using orthogonal vector quantization. to network pruning, These andapproximation they can be and used quantization together to obtain further gains techniques are orthogonal[14]. to network pruning, and they can be used together to obtain further gains [14]. There have been other attempts to reduce the number of parameters of neural networks by replacing There have been otherthe attempts fully connected to reduce layer the number with global of parameters average pooling. of neural The networks Network by in replacing Network architecture [15] the fully connected layerand GoogLenet with global [16 average] achieves pooling. state-of-the-art The Network results in Network on several architecture benchmarks [15 by] adopting this idea. and GoogLenet [16]However, achieves state-of-the-arttransfer learning, results i.e. reusing on several features benchmarks learned on by the adopting ImageNet this dataset idea. and applying them However, transfer learning,to new i.e. tasks reusing by only features fine-tuning learned the onfully the connected ImageNet layers, dataset is andmore applying difficult them with this approach. This to new tasks by onlyproblem fine-tuning is noted the fully by Szegedyconnectedet layers, al. [16 is] and more motivates difficult them with thisto add approach. a linear Thislayer on the top of their problem is noted by Szegedynetworkset to al. enable[16] transfer and motivates learning. them to add a linear layer on the top of their networks to enable transfer learning. Network pruning has been used both to reduce network complexity and to reduce over-fitting. An Network pruning hasearly been approach used both to to pruning reduce was network biased complexity weight decay and [17 to]. reduce Optimal over-fitting. Brain Damage An [18] and Optimal early approach to pruningBrain was Surgeon biased [19 weight] prune decay networks [17]. to Optimal reduce the Brain number Damage of connections [18] and Optimal based on the Hessian of the Brain Surgeon [19] pruneloss function networks and to reduce suggest the that number such pruning of connections is more accurate based on than the Hessian magnitude-based of the pruning such as and suggestweight that decay. such However, pruning is second more accurate order derivative than magnitude-based needs additional pruning computation. such as weight decay. However, second order derivative needs additional computation. HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly HashedNets [20] is agroup recent connection technique weightsto reduce into model hash sizes buckets, by using so that a hash all connections function to withinrandomly the same hash bucket group connection weightsshare a into single hash parameter buckets, value. so that This all technique connections may within benefit the from same pruning. hash Asbucket pointed out in Shi et al. share a single parameter[21] value. and Weinberger This techniqueet al. may[22], benefit sparsity from will pruning. minimize As hash pointed collision out in making Shi et al. feature hashing even [21] and Weinbergermoreet al. effective.[22], sparsity HashedNets will minimize may be hashused together collision with making pruning feature to give hashing even better even parameter savings. more effective. HashedNets may be used together with pruning to give even better parameter savings. 2 2 Where is the Energy Consumed?

larger model => more memory reference => more energy

Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000 how to make deep learning more efficient? Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7]. MemoryCMOS process access [ is7]. 3 Memory orders of access magnitude is 3 orders more of magnitude more energy expensive thanenergy simple expensive arithmetic. than simple arithmetic.

Battery images are in the public domain To achieve thisImage 1, goal,image 2, image weTo 2, image achievepresent 4 athis method goal, we to prune present network a method connections to prune network in a manner connections that preserves in a manner the that preserves the original accuracy. Afteroriginal an initial accuracy. training After phase, an initial we remove training all phase, connections we remove whose all connections weight is lower whose weight is lower than a threshold. Thisthan pruning a threshold. converts This a dense, pruning fully-connected converts a dense, layer fully-connected to a sparse layer. layer This to a first sparse layer. This10 first phase learns the topologyphase of learns the networks the topology — learning of the networks which connections — learning are which important connections and removing are important and removing the unimportant connections.the unimportant We then connections. retrain the sparse We then network retrain so the the sparse remaining network connections so the remaining can connections can compensate for the connectionscompensate that for the have connections been removed. that have The beenphases removed. of pruning The and phases retraining of pruning may and retraining may be repeated iterativelybe to repeated further iteratively reduce network to further complexity. reduce network In effect, complexity. this training In effect, process this learns training process learns the network connectivitythe network in addition connectivity to the weights in addition - much to as the in weights the mammalian - much as brain in the [8][9], mammalian where brain [8][9], where synapses are created insynapses the first are few created months in of the a first child’s few development, months of a child’s followed development, by gradual followed pruning byof gradual pruning of little-used connections,little-used falling to connections, typical adult falling values. to typical adult values.

2 Related Work2 Related Work

Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- ing models [10]. This results in a waste of both computation and memory. There have been various ing models [10]. This results in a waste of both computation and memory. There have been various proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear structure of the neural network by finding an appropriate low-rank approximation of the parameters structure of the neural network by finding an appropriate low-rank approximation of the parameters and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gong et al. and keeping the accuracy[13] compressed within 1% of deep the convnets original model. using vector With quantization.similar accuracy These loss, approximation Gong et al. and quantization [13] compressed deeptechniques convnets are using orthogonal vector quantization. to network pruning, These andapproximation they can be and used quantization together to obtain further gains techniques are orthogonal[14]. to network pruning, and they can be used together to obtain further gains [14]. There have been other attempts to reduce the number of parameters of neural networks by replacing There have been otherthe attempts fully connected to reduce layer the number with global of parameters average pooling. of neural The networks Network by in replacing Network architecture [15] the fully connected layerand GoogLenet with global [16 average] achieves pooling. state-of-the-art The Network results in Network on several architecture benchmarks [15 by] adopting this idea. and GoogLenet [16]However, achieves state-of-the-arttransfer learning, results i.e. reusing on several features benchmarks learned on by the adopting ImageNet this dataset idea. and applying them However, transfer learning,to new i.e. tasks reusing by only features fine-tuning learned the onfully the connected ImageNet layers, dataset is andmore applying difficult them with this approach. This to new tasks by onlyproblem fine-tuning is noted the fully by Szegedyconnectedet layers, al. [16 is] and more motivates difficult them with thisto add approach. a linear Thislayer on the top of their problem is noted by Szegedynetworkset to al. enable[16] transfer and motivates learning. them to add a linear layer on the top of their networks to enable transfer learning. Network pruning has been used both to reduce network complexity and to reduce over-fitting. An Network pruning hasearly been approach used both to to pruning reduce was network biased complexity weight decay and [17 to]. reduce Optimal over-fitting. Brain Damage An [18] and Optimal early approach to pruningBrain was Surgeon biased [19 weight] prune decay networks [17]. to Optimal reduce the Brain number Damage of connections [18] and Optimal based on the Hessian of the Brain Surgeon [19] pruneloss function networks and to reduce suggest the that number such pruning of connections is more accurate based on than the Hessian magnitude-based of the pruning such as loss function and suggestweight that decay. such However, pruning is second more accurate order derivative than magnitude-based needs additional pruning computation. such as weight decay. However, second order derivative needs additional computation. HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly HashedNets [20] is agroup recent connection technique weightsto reduce into model hash sizes buckets, by using so that a hash all connections function to withinrandomly the same hash bucket group connection weightsshare a into single hash parameter buckets, value. so that This all technique connections may within benefit the from same pruning. hash Asbucket pointed out in Shi et al. share a single parameter[21] value. and Weinberger This techniqueet al. may[22], benefit sparsity from will pruning. minimize As hash pointed collision out in making Shi et al. feature hashing even [21] and Weinbergermoreet al. effective.[22], sparsity HashedNets will minimize may be hashused together collision with making pruning feature to give hashing even better even parameter savings. more effective. HashedNets may be used together with pruning to give even better parameter savings. 2 2 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design

11 Application as a Black Box

Spec 2006 Algorithm

This image is in the public domain

Hardware

This image is in CPU the public domain

12 Open the Box before Hardware Design

? Algorithm

This image is in the public domain

Hardware

This image is in ?PU the public domain

Breaks the boundary between algorithm and hardware

13 Agenda

Inference Training Agenda

Algorithm

Inference Training

Hardware Agenda

Algorithm

Inference Training

Hardware Agenda

Algorithm

Algorithms for Algorithms for Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for Efficient Inference Efficient Training Hardware Hardware 101: the Family

Hardware

General Purpose* Specialized HW

CPU GPU FPGA ASIC

latency throughput programmable fixed oriented oriented logic logic

* including GPGPU Hardware 101: Number Representation

E (-1)s x (1.M) x 2

Range Accuracy 1 8 23 -38 38 FP32 S E M 10 - 10 .000006% 1 5 10 S E M -5 4 FP16 6x10 - 6x10 .05% 1 31 0 – 2x109 ½ Int32 S M

1 15 0 – 6x104 ½ Int16 S M

1 7 0 – 127 ½ Int8 S M - - Fixed point S I F

radix point Dally, High Performance Hardware for , NIPS’2015 Hardware 101: Number Representation

Operation: Energy (pJ) Area (µm2) 8b Add 0.03 36 16b Add 0.05 67 32b Add 0.1 137 16b FP Add 0.4 1360 32b FP Add 0.9 4184 8b Mult 0.2 282 32b Mult 3.1 3495 16b FP Mult 1.1 1640 32b FP Mult 3.7 7700 32b SRAM Read (8KB) 5 N/A 32b DRAM Read 640 N/A

Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014 Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library. Agenda

Algorithm

Algorithms for Algorithms for Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for Efficient Inference Efficient Training Hardware Part 1: Algorithms for Efficient Inference

• 1. Pruning

• 2. Weight Sharing

• 3. Quantization

• 4. Low Rank Approximation

• 5. Binary / Ternary Net

• 6. Winograd Transformation Part 1: Algorithms for Efficient Inference

• 1. Pruning

• 2. Weight Sharing

• 3. Quantization

• 4. Low Rank Approximation

• 5. Binary / Ternary Net

• 6. Winograd Transformation Pruning Neural Networks

[Lecun et al. NIPS’89] [Han et al. NIPS’15]

Pruning Trained Quantization Huffman Coding 24 [Han et al. NIPS’15] Pruning Neural Networks

-0.01x2 +x+1

60 Million 6M 10x less connections

Pruning Trained Quantization Huffman Coding 25 [Han et al. NIPS’15] Pruning Neural Networks

0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 26 [Han et al. NIPS’15] Pruning Neural Networks

Pruning 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 27 [Han et al. NIPS’15] Retrain to Recover Accuracy

Pruning Pruning+Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 28 [Han et al. NIPS’15] Iteratively Retrain to Recover Accuracy

Pruning Pruning+Retraining Iterative Pruning and Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% AccuracyLoss -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 29 [Han et al. NIPS’15] Pruning RNN and LSTM

Image Captioning

Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation*Karpathy for etImage al, "DeepCaption Generation,Visual- Chen and Zitnick Semantic Alignments for Generating Fei-Fei Li & Andrej KarpathyImage & Justin Descriptions”, Johnson 2015.Lecture 10 - 51 8 Feb 2016 Figure copyright IEEE, 2015; reproduced for educational purposes.

Pruning Trained Quantization Huffman Coding 30 [Han et al. NIPS’15] Pruning RNN and LSTM

• Original: a basketball player in a white uniform is playing with a ball 90% • Pruned 90%: a basketball player in a white uniform is playing with a basketball

• Original : a brown dog is running through a grassy field 90% • Pruned 90%: a brown dog is running through a grassy area

• Original : a man is riding a surfboard on a wave 90% • Pruned 90%: a man in a wetsuit is riding a wave on a beach

• Original : a soccer player in red is running in the field 95% • Pruned 95%: a man in a red shirt and black and white black shirt is running through a field

Pruning Trained Quantization Huffman Coding 31 Pruning Happens in Human Brain

1000 Trillion Synapses 50 Trillion 500 Trillion Synapses Synapses

This image is in the public domain This image is in the public domain This image is in the public domain Newborn 1 year old Adolescent

Christopher A Walsh. Peter Huttenlocher (1931-2013). , 502(7470):172–172, 2013.

Pruning Trained Quantization Huffman Coding 32 [Han et al. NIPS’15] Pruning Changes Weight Distribution

Before Pruning After Pruning After Retraining

Conv5 layer of Alexnet. Representative for other network layers as well.

Pruning Trained Quantization Huffman Coding 33 Part 1: Algorithms for Efficient Inference

• 1. Pruning

• 2. Weight Sharing

• 3. Quantization

• 4. Low Rank Approximation

• 5. Binary / Ternary Net

• 6. Winograd Transformation [Han et al. ICLR’16] Trained Quantization

2.09, 2.12, 1.92, 1.87

2.0

Pruning Trained Quantization Huffman Coding 35 Published as a conference paper at ICLR 2016

[Han et al. ICLR’16] Trained Quantization Quantization: less bits per weight Pruning: less number of weights Huffman Encoding

Cluster the Weights

Train Connectivity Encode Weights original same same same network 2.09, 2.12, 1.92,accuracy 1.87 Generate Code Book accuracy accuracy Prune Connections original 9x-13x Quantize the Weights 27x-31x Encode Index 35x-49x size reduction with Code Book reduction reduction Train Weights 2.0 Retrain Code Book

Figure 1: The three stage compression pipeline: pruning,32 bit quantization and Huffman coding. Pruning reduces the number of weights by 10 ,4bit while8x less memory footprint quantization further improves the compression rate: between 27 and 31 . Huffman coding⇥ gives more compression: between 35 and 49 . The Pruning Trained Quantization Huffman Coding compression⇥ rate already⇥ included the meta-data for sparse representation. The compression⇥ 36 ⇥ scheme doesn’t incur any accuracy loss. features such as better privacy, less network bandwidth and real time processing, the large storage overhead prevents deep neural networks from being incorporated into mobile apps. The second issue is energy consumption. Running large neural networks require a lot of memory bandwidth to fetch the weights and a lot of computation to do dot products— which in turn consumes considerable energy. Mobile devices are battery constrained, making power hungry applications such as deep neural networks hard to deploy. Energy consumption is dominated by memory access. Under 45nm CMOS technology, a 32 bit floating point add consumes 0.9pJ, a 32bit SRAM cache access takes 5pJ, while a 32bit DRAM memory access takes 640pJ, which is 3 orders of magnitude of an add operation. Large networks do not fit in on-chip storage and hence require the more costly DRAM accesses. Running a 1 billion connection neural network, for example, at 20fps would require (20Hz)(1G)(640pJ) = 12.8W just for DRAM access - well beyond the power envelope of a typical mobile device. Our goal is to reduce the storage and energy required to run inference on such large networks so they can be deployed on mobile devices. To achieve this goal, we present “deep compression”: a three- stage pipeline (Figure 1) to reduce the storage required by neural network in a manner that preserves the original accuracy. First, we prune the networking by removing the redundant connections, keeping only the most informative connections. Next, the weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored. Finally, we apply Huffman coding to take advantage of the biased distribution of effective weights. Our main insight is that, pruning and trained quantization are able to compress the network without interfering each other, thus lead to surprisingly high compression rate. It makes the required storage so small (a few megabytes) that all weights can be cached on chip instead of going to off-chip DRAM which is energy consuming. Based on “deep compression”, the EIE hardware accelerator Han et al. (2016) was later proposed that works on the compressed model, achieving significant speedup and energy efficiency improvement.

2NETWORK PRUNING

Network pruning has been widely studied to compress CNN models. In early work, network pruning proved to be a valid way to reduce the network complexity and over-fitting (LeCun et al., 1989; Hanson & Pratt, 1989; Hassibi et al., 1993; Strom,¨ 1997). Recently Han et al. (2015) pruned state- of-the-art CNN models with no loss of accuracy. We build on top of that approach. As shown on the left side of Figure 1, we start by learning the connectivity via normal network training. Next, we prune the small-weight connections: all connections with weights below a threshold are removed from the network. Finally, we retrain the network to learn the final weights for the remaining sparse connections. Pruning reduced the number of parameters by 9 and 13 for AlexNet and VGG-16 model. ⇥ ⇥

2 Published as a conference paper at ICLR 2016

[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.

weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 37

We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.

3TRAINED QUANTIZATION AND WEIGHT SHARING

Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.

To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:

nb r = (1) nlog2(k)+kb

For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each

3 Published as a conference paper at ICLR 2016

[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.

weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 38

We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.

3TRAINED QUANTIZATION AND WEIGHT SHARING

Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.

To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:

nb r = (1) nlog2(k)+kb

For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each

3 Published as a conference paper at ICLR 2016

[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.

weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 39

We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.

3TRAINED QUANTIZATION AND WEIGHT SHARING

Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.

To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:

nb r = (1) nlog2(k)+kb

For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each

3 Published as a conference paper at ICLR 2016

[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.

weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 40

We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.

3TRAINED QUANTIZATION AND WEIGHT SHARING

Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.

To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:

nb r = (1) nlog2(k)+kb

For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each

3 Published as a conference paper at ICLR 2016

[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.

weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 41

We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.

3TRAINED QUANTIZATION AND WEIGHT SHARING

Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.

To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:

nb r = (1) nlog2(k)+kb

For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each

3 Published as a conference paper at ICLR 2016

[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.

weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 42

We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.

3TRAINED QUANTIZATION AND WEIGHT SHARING

Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.

To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:

nb r = (1) nlog2(k)+kb

For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each

3 Published as a conference paper at ICLR 2016

[Han et al. ICLR’16] Figure 2: Representing the matrixTrained sparsity Quantization with relative index. Padding filler zero to prevent overflow.

weights cluster index fine-tuned (32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Pruning Trained Quantization Huffman Coding 43

We store the sparse structure that results from pruning using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n +1numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, we store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.

3TRAINED QUANTIZATION AND WEIGHT SHARING

Network quantization and weight sharing further compresses the pruned network by reducing the number of bits required to represent each weight. We limit the number of effective weights we need to store by having multiple connections share the same weight, and then fine-tune those shared weights. Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 output neurons, the weight is a 4 4 matrix. On the top left is the 4 4 weight matrix, and on the bottom left is the 4 4 gradient matrix.⇥ The weights are quantized to⇥ 4 bins (denoted with 4 colors), all the weights in the⇥ same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.

To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:

nb r = (1) nlog2(k)+kb

For example, Figure 3 shows the weights of a single layer neural network with four input units and four output units. There are 4 4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together⇥ to share the same value. Originally we need to store 16 weights each

3 [Han et al. ICLR’16] Before Trained Quantization: Continuous Weight Count

Weight Value

Pruning Trained Quantization Huffman Coding 44 [Han et al. ICLR’16] After Trained Quantization: Discrete Weight Count

Weight Value

Pruning Trained Quantization Huffman Coding 45 [Han et al. ICLR’16] After Trained Quantization: Discrete Weight after Training Count

Weight Value

Pruning Trained Quantization Huffman Coding 46 [Han et al. ICLR’16] How Many Bits do We Need?

Pruning Trained Quantization Huffman Coding 47 [Han et al. ICLR’16] How Many Bits do We Need?

Pruning Trained Quantization Huffman Coding 48 [Han et al. ICLR’16] Pruning + Trained Quantization Work Together

Pruning Trained Quantization Huffman Coding 49 [Han et al. ICLR’16] Pruning + Trained Quantization Work Together

AlexNet on ImageNet

Pruning Trained Quantization Huffman Coding 50 [Han et al. ICLR’16]

Published as a conference paper at ICLR 2016 Huffman Coding

Quantization: less bits per weight Pruning: less number of weights Huffman Encoding

Cluster the Weights

Train Connectivity Encode Weights original same same same network accuracy Generate Code Book accuracy accuracy Prune Connections original 9x-13x Quantize the Weights 27x-31x Encode Index 35x-49x size reduction with Code Book reduction reduction Train Weights

Retrain Code Book

Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning• In-frequent weights: use more bits to represent reduces the number of weights by 10 , while quantization further improves the compression rate:• Frequent weights: use less bits to represent between 27 and 31 . Huffman coding⇥ gives more compression: between 35 and 49 . The compression⇥ rate already⇥ included the meta-data for sparse representation. The compression⇥ ⇥ scheme doesn’t incur any accuracy loss. Pruning Trained Quantization Huffman Coding 51 features such as better privacy, less network bandwidth and real time processing, the large storage overhead prevents deep neural networks from being incorporated into mobile apps. The second issue is energy consumption. Running large neural networks require a lot of memory bandwidth to fetch the weights and a lot of computation to do dot products— which in turn consumes considerable energy. Mobile devices are battery constrained, making power hungry applications such as deep neural networks hard to deploy. Energy consumption is dominated by memory access. Under 45nm CMOS technology, a 32 bit floating point add consumes 0.9pJ, a 32bit SRAM cache access takes 5pJ, while a 32bit DRAM memory access takes 640pJ, which is 3 orders of magnitude of an add operation. Large networks do not fit in on-chip storage and hence require the more costly DRAM accesses. Running a 1 billion connection neural network, for example, at 20fps would require (20Hz)(1G)(640pJ) = 12.8W just for DRAM access - well beyond the power envelope of a typical mobile device. Our goal is to reduce the storage and energy required to run inference on such large networks so they can be deployed on mobile devices. To achieve this goal, we present “deep compression”: a three- stage pipeline (Figure 1) to reduce the storage required by neural network in a manner that preserves the original accuracy. First, we prune the networking by removing the redundant connections, keeping only the most informative connections. Next, the weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored. Finally, we apply Huffman coding to take advantage of the biased distribution of effective weights. Our main insight is that, pruning and trained quantization are able to compress the network without interfering each other, thus lead to surprisingly high compression rate. It makes the required storage so small (a few megabytes) that all weights can be cached on chip instead of going to off-chip DRAM which is energy consuming. Based on “deep compression”, the EIE hardware accelerator Han et al. (2016) was later proposed that works on the compressed model, achieving significant speedup and energy efficiency improvement.

2NETWORK PRUNING

Network pruning has been widely studied to compress CNN models. In early work, network pruning proved to be a valid way to reduce the network complexity and over-fitting (LeCun et al., 1989; Hanson & Pratt, 1989; Hassibi et al., 1993; Strom,¨ 1997). Recently Han et al. (2015) pruned state- of-the-art CNN models with no loss of accuracy. We build on top of that approach. As shown on the left side of Figure 1, we start by learning the connectivity via normal network training. Next, we prune the small-weight connections: all connections with weights below a threshold are removed from the network. Finally, we retrain the network to learn the final weights for the remaining sparse connections. Pruning reduced the number of parameters by 9 and 13 for AlexNet and VGG-16 model. ⇥ ⇥

2 [Han et al. ICLR’16] Summary of Deep Compression Published as a conference paper at ICLR 2016

Quantization: less bits per weight Pruning: less number of weights Huffman Encoding

Cluster the Weights

Train Connectivity Encode Weights original same same same network accuracy Generate Code Book accuracy accuracy Prune Connections original 9x-13x Quantize the Weights 27x-31x Encode Index 35x-49x size reduction with Code Book reduction reduction Train Weights

Retrain Code Book

Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning reduces the number of weights by 10 , while quantization further improves the compression rate: between 27 and 31 . Huffman coding⇥ gives more compression: between 35 and 49 . The compression⇥ rate already⇥ included the meta-data for sparse representation. The compression⇥ ⇥ scheme doesn’t incur any accuracy loss. Pruning Trained Quantization Huffman Coding 52 features such as better privacy, less network bandwidth and real time processing, the large storage overhead prevents deep neural networks from being incorporated into mobile apps. The second issue is energy consumption. Running large neural networks require a lot of memory bandwidth to fetch the weights and a lot of computation to do dot products— which in turn consumes considerable energy. Mobile devices are battery constrained, making power hungry applications such as deep neural networks hard to deploy. Energy consumption is dominated by memory access. Under 45nm CMOS technology, a 32 bit floating point add consumes 0.9pJ, a 32bit SRAM cache access takes 5pJ, while a 32bit DRAM memory access takes 640pJ, which is 3 orders of magnitude of an add operation. Large networks do not fit in on-chip storage and hence require the more costly DRAM accesses. Running a 1 billion connection neural network, for example, at 20fps would require (20Hz)(1G)(640pJ) = 12.8W just for DRAM access - well beyond the power envelope of a typical mobile device. Our goal is to reduce the storage and energy required to run inference on such large networks so they can be deployed on mobile devices. To achieve this goal, we present “deep compression”: a three- stage pipeline (Figure 1) to reduce the storage required by neural network in a manner that preserves the original accuracy. First, we prune the networking by removing the redundant connections, keeping only the most informative connections. Next, the weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored. Finally, we apply Huffman coding to take advantage of the biased distribution of effective weights. Our main insight is that, pruning and trained quantization are able to compress the network without interfering each other, thus lead to surprisingly high compression rate. It makes the required storage so small (a few megabytes) that all weights can be cached on chip instead of going to off-chip DRAM which is energy consuming. Based on “deep compression”, the EIE hardware accelerator Han et al. (2016) was later proposed that works on the compressed model, achieving significant speedup and energy efficiency improvement.

2NETWORK PRUNING

Network pruning has been widely studied to compress CNN models. In early work, network pruning proved to be a valid way to reduce the network complexity and over-fitting (LeCun et al., 1989; Hanson & Pratt, 1989; Hassibi et al., 1993; Strom,¨ 1997). Recently Han et al. (2015) pruned state- of-the-art CNN models with no loss of accuracy. We build on top of that approach. As shown on the left side of Figure 1, we start by learning the connectivity via normal network training. Next, we prune the small-weight connections: all connections with weights below a threshold are removed from the network. Finally, we retrain the network to learn the final weights for the remaining sparse connections. Pruning reduced the number of parameters by 9 and 13 for AlexNet and VGG-16 model. ⇥ ⇥

2 [Han et al. ICLR’16] Results: Compression Ratio

Original Compressed Compression Original Compressed Network Size Size Ratio Accuracy Accuracy

LeNet-300 1070KB 27KB 40x 98.36% 98.42%

LeNet-5 1720KB 44KB 39x 99.20% 99.26%

AlexNet 240MB 6.9MB 35x 80.27% 80.30%

VGGNet 550MB 11.3MB 49x 88.68% 89.09%

GoogleNet 28MB 2.8MB 10x 88.90% 88.92%

ResNet-18 44.6MB 4.0MB 11x 89.24% 89.28%

Can we make compact models to begin with?

Compression Acceleration Regularization 53 SqueezeNet

Input

64 1x1 Conv Squeeze 16 1x1 Conv 3x3 Conv Expand Expand 64 64 Output Concat/Eltwise 128

Iandola et al, “SqueezeNet: AlexNet-levelVanilla accuracy Firewith 50x module fewer parameters and <0.5MB model size”, arXiv 2016

Compression Acceleration Regularization 54 Compressing SqueezeNet

Network Approach Size Ratio Top-1 Top-5 Accuracy Accuracy

AlexNet - 240MB 1x 57.2% 80.3%

AlexNet SVD 48MB 5x 56.0% 79.4%

Deep AlexNet 6.9MB 35x 57.2% 80.3% Compression

SqueezeNet - 4.8MB 50x 57.5% 80.3%

Deep SqueezeNet 0.47MB 510x 57.5% 80.3% Compression

Iandola et al, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, arXiv 2016

Compression Acceleration Regularization 55 Results: Speedup

Average

0.6x

CPU GPU mGPU

Compression Acceleration Regularization 56 Results: Energy Efficiency

Average

CPU GPU mGPU

Compression Acceleration Regularization 57 Deep Compression Applied to Industry

Deep Compression

F8

Compression Acceleration Regularization 58 Part 1: Algorithms for Efficient Inference

• 1. Pruning

• 2. Weight Sharing

• 3. Quantization

• 4. Low Rank Approximation

• 5. Binary / Ternary Net

• 6. Winograd Transformation Quantizing the Weight and Activation

• Train with float • Quantizing the weight and activation: • Gather the statistics for weight and activation • Choose proper radix point position • Fine-tune in float format • Convert to fixed-point format

Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA’16 Quantization Result

Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA’16 Part 1: Algorithms for Efficient Inference

• 1. Pruning

• 2. Weight Sharing

• 3. Quantization

• 4. Low Rank Approximation

• 5. Binary / Ternary Net

• 6. Winograd Transformation Low Rank Approximation for Conv

Zhang et al Efficient and Accurate Approximations of Nonlinear Convolutional Networks CVPR’15 Low Rank Approximation for Conv

Zhang et al Efficient and Accurate Approximations of Nonlinear Convolutional Networks CVPR’15 Low Rank Approximation for FC

Novikov et al Tensorizing Neural Networks, NIPS’15 Part 1: Algorithms for Efficient Inference

• 1. Pruning

• 2. Weight Sharing

• 3. Quantization

• 4. Low Rank Approximation

• 5. Binary / Ternary Net

• 6. Winograd Transformation Under review as a conference paper at ICLR 2017 Binary / Ternary Net: Motivation

Train on Dense (D) Pruning the Network Train on Sparse (S) Recover Zero Weights Train on Dense (D) 6400 6400 6400 6400 6400 Trained Normalized Intermediate Ternary Weight Final Ternary Weight 4800 Full Precision4800 Weight 4800 Full Precision4800 Weight 4800 Quantization

3200 3200 3200 3200 3200 Scale Count

Count Count Count => Count Normalize Quantize 1600 1600 1600 1600 1600 Wn Wp

0 0 0 0 0 −0.05 0 0.05 −0.05 0 0.05 −0.05 0 0.05 −0.05 0 0.05 −0.05 0 0.05 Weight-1 Value 0 1Weight Value Weight-1 Value-t 0 t Weight1 Value Weight-1 Value 0 1 -Wn 0 Wp (a) (b) (c) (d) (e) gradient1 gradient2 Figure 2: Weight distribution of the original GoogLeNet (a), pruned GoogLeNet (b), after retraining Loss the sparsity-constrainedFeed Forward GoogLeNetBack (c), Propagate ignoring the sparistyInference constraint Time and recovering the zero weights (d), and after retraining the dense network (e).

Initial Dense Training: The first D step learns the connection weights and importance via normal network training on the dense network. Unlike conventional training, however, the goal of this D step is not only to learn the values of the weights; we are also learning which connections are important. We use the simple heuristic to quantify the importance of the weights using their absolute value. Sparse Training: The S step prunes the low-weight connections and trains a sparse network. We applied the same sparsity to all the layers, thus there’s a single hyper parameter: the sparsity, the percentage of weights that are pruned to 0. For each layer W with N parameters, we sorted the parameters, picked the k-th largest one = Sk as the threshold where k = N (1 sparsity), and generated a binary mask to remove all the weights smaller than . Details are shown⇤ in Algorithm 1 . The reason behind removing small weight is partially due to the Taylor expansion of the loss function, shown in Equation (1)(2). We want to minimize the increase in Loss when conducting hard threshold in pruning, so we need to minimize the first and second terms in equation 2. Since we are zeroing out parameters, Wi is actually Wi 0=Wi. At local minimum point with @Loss/@Wi 0 @2Loss 2 ⇡2 and 2 > 0, only the second order term matters. Since second order gradient @ Loss/@Wi is @Wi expensive to calculate and Wi has a power of 2, we use Wi as the metric of pruning. Smaller Wi means smaller increase to the loss function. | | | |

Loss = f(x, W1,W2,W3...) (1) 2 @Loss 1 @ Loss 2 Loss = Wi + 2 Wi + ... (2) @Wi 2 @Wi Retraining while enforcing the binary mask in each iteration, we converted a dense network into a sparse network which has a known sparsity support and can fully recover or even increase the original accuracy of initial dense model under the sparsity constraint. The sparsity can be tuned using validation and we found values between 25% and 50% generally work well in our experiments. Final Dense Training: The final D step recovers the pruned connections, making the network dense again. These previously-pruned connections are initialized to zero and the entire network is retrained with 1/10 the original learning rate (since the sparse network is already at a good local minima). Hyper parameters like dropout ratios and weight decay remained unchanged. By restoring the pruned connections, the final D step increases the model capacity of the network and make it possible to arrive at a better local minima compared with the sparse model from S step. To visualize the DSD training flow, we plotted the progression of weight distribution in Figure 2. The figure is plotted using GoogLeNet inception_5b3x3 layer, and we found that this progression of weight distribution is very representative for VGGNet and ResNet as well. The original distribution of weight is centered on zero with tails dropping off quickly. Pruning is based on absolute value so after pruning the large center region is truncated away. The network parameters un-truncated adjust themselves during the retraining phase, so in (c) the boundary becomes soft and forms a bimodal distribution. In (d), at the beginning of the re-dense training step, all the pruned weights come back again and are reinitialized to zero. Finally, in (e), the previously-pruned weights are retrained together with the survived weights. In this step, we kept the same learning hyper-parameters (weight decay, learning rate, etc.) for reborn weights and old weights. Comparing Figure (d) and (e), the old weights’ distribution almost remained the same, while the new weights become more spread around zero. The overall mean absolute value of the weight distribution is much smaller.

3 Trained Ternary Quantization

Trained Normalized Intermediate Ternary Weight Final Ternary Weight Full Precision Weight Full Precision Weight Quantization Scale Normalize Quantize Wn Wp

-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp

gradient1 gradient2 Loss Feed Forward Back Propagate Inference Time

Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR’17 Pruning Trained Quantization Huffman Coding Under review as a conference paper at ICLR 2017

Equation 2. We use scaled gradients for 32-bit weights: @L W p :˜w > l ⇥ @wt l l 8 l @L > @L = > 1 t : w˜l l (8) @w˜l > ⇥ @wl | |  <> n @L Wl t :˜wl < l > ⇥ @wl > > Note we use scalar number 1 as factor of:> gradients of zero weights. The overall quantization process is illustrated as Figure 1. The evolution of the ternary weights from different layers during training is shown in Figure 2. We observe that as training proceeds, different layers behave differently: for the p n first quantized conv layer, the absolute values of Wl and Wl get smaller and sparsity gets lower, p n while for the last conv layer and fully connected layer, the absolute values of Wl and Wl get larger and sparsity gets higher. We learn the ternary assignments (index to the codebook) by updating the latent full-resolution weights during training. This may cause the assignments to change between iterations. Note that the thresholds are not constants as the maximal absolute values change over time. Once an updated weight crosses the threshold, the ternary assignment is changed. p n The benefits of using trained quantization factors are: i) The asymmetry of Wl = Wl enables neural networks to have more model capacity. ii) Quantized weights play the role of6 "learning rate multipliers" during back propagation.

3.2 QUANTIZATION HEURISTIC

In previous work on ternary weight networks, Li & Liu (2016) proposed Ternary Weight Networks (TWN) using as thresholds to reduce 32-bit weights to ternary values, where is defined ± l ± l as Equation 5. They optimized value of l by minimizing expectation of L2 distance between full precision weights and ternary weights.± Instead of using a strictly optimized threshold, we adopt different heuristics: 1) use the maximum absolute value of the weights as a reference to the layer’s threshold and maintain a constant factor t for all layers:

= t max( w˜ ) (9) l ⇥ | | and 2) maintain a constant sparsity r for all layers throughout training. By adjusting the hyper- parameterWeightr we are able to obtainEvolution ternary weight networks during with various Training sparsities. We use the first method and set t to 0.05 in experiments on CIFAR-10 and ImageNet dataset and use the second one to explore a wider range of sparsities in section 5.1.1.

res1.0/conv1/Wn res1.0/conv1/Wp res3.2/conv2/Wn res3.2/conv2/Wp linear/Wn linear/Wp 3 2 1

0 -1 -2 Ternary Weight Value Value Weight Ternary -3 Negatives Zeros Positives Negatives Zeros Positives Negatives Zeros Positives 100%

75%

50%

Percentage 25% Ternary Weight Weight Ternary 0% 0 50 100 150 0 50 100 150 0 50 100 150 Epochs

Figure 2: Ternary weights value (above) and distribution (below) with iterations for different layers of ResNet-20 on CIFAR-10.

4 Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR’17

1 Visualization of the TTQ Kernels

Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR’17 Pruning Trained Quantization Huffman Coding Error Rate on ImageNet

Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR’17 Pruning Trained Quantization Huffman Coding Part 1: Algorithms for Efficient Inference

• 1. Pruning

• 2. Weight Sharing

• 3. Quantization

• 4. Low Rank Approximation

• 5. Binary / Ternary Net

• 6. Winograd Transformation 3x3 DIRECT Compute Bound

9xC FMAs/Output: Math Intensive Image Tensor 0 8 1 7 2 6 0 1 2 3 4 5 3 5 4 4 5 3 6 7 8 ∑ Filter 6 2 7 1 8 0 8 7 6 ✽ 5 4 3 2 1 0 9xK FMAs/Input: Good Data Reuse

4 0 4 1 4 2

4 3 4 4 4 5

4 6 4 7 4 8

Direct : we need 9xCx4 = 36xC FMAs for 4 outputs

Julien Demouth, Convolution OPTIMIZATION: Winograd, NVIDIA

73 3x3 WINOGRAD Convolutions Transform Data to Reduce Math Intensity

4x4 Tile

Data Transform ∑ over C

Point-wise Output Transform Filter multiplication

Filter Transform

Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs Winograd convolution: we need 16xC FMAs for 4 outputs: 2.25x fewer FMAs

See A. Lavin & S. Gray, ”Fast Algorithms for Convolutional Neural Networks Julien Demouth, Convolution OPTIMIZATION: Winograd, NVIDIA

74 Speedup of Winograd Convolution

VGG16, Batch Size 1 – Relative Performance

cuDNN 3 cuDNN 5

2.00

1.50

1.00

0.50

0.00 conv 1.1 conv 1.2 conv 2.1 conv 2.2 conv 3.1 conv 3.2 conv 4.1 conv 4.2 conv 5.0

Measured on Maxwell TITAN X

Julien Demouth, Convolution OPTIMIZATION: Winograd, NVIDIA

75 Agenda

Algorithm

Algorithms for Algorithms for Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for Efficient Inference Efficient Training Hardware Hardware for Efficient Inference a common goal: minimize memory access

Sparse Matrix Read Unit. The sparse-matrix read unit uses pointers pj and pj+1 to read the non-zero elements (if SpMat any) of this PE’s slice of column Ij from the sparse-matrix SRAM. Each entry in the SRAM is 8-bits in length and contains one 4-bit element of v and one 4-bit element of x. Act_0 Act_1 For efficiency (see Section VI) the PE’s slice of encoded Ptr_Even Arithm Ptr_Odd sparse matrix I is stored in a 64-bit-wide SRAM. Thus eight entries are fetched on each SRAM read. The high 13 bits of the current pointer p selects an SRAM row, and the low SpMat 3-bits select one of the eight entries in that row. A single (v, x) entry is provided to the arithmetic unit each cycle. Arithmetic Unit. The arithmetic unit receives a (v, x) Figure 5. Layout of one PE in EIE under TSMC 45nm process. entry from the sparse matrix read unit and performs the Eyeriss DaDiannaomultiply-accumulate operation b = b TPU+ v a . Index Table II x x ⇥ j THE IMPLEMENTATION RESULTSEIE OF ONE PE IN EIE AND THE x is used to index an accumulator array (the destination BREAKDOWN BY COMPONENT TYPE (LINE 3-7), BY MODULE (LINE MIT CASactivation registers) while v is multipliedGoogle by the activation 8-13). THE CRITICAL PATH OF EIE IS 1.15 NS value at the head of the activation queue. Because v is stored StanfordPower Area (%) 2 (%) in 4-bit encoded form, it is first expanded to a 16-bit fixed- (mW) (µm ) eDRAM Total 9.157 638,024 RS Dataflow point number via a table look up.8-bit A bypass pathInteger is provided memoryCompression/ 5.416 (59.15%) 594,786 (93.22%) to route the output of the to its input if the same clock network 1.874 (20.46%) 866 (0.14%) accumulator is selected on two adjacent cycles. register 1.026 (11.20%) 9,465 (1.48%) combinationalSparsity 0.841 (9.18%) 8,946 (1.40%) Activation Read/Write.“ThisThe unit Activation is designed Read/Write Unit for densefiller cell 23,961 (3.76%) contains two activation register files that accommodate the Act queue 0.112 (1.23%) 758 (0.12%) source and destinationmatrices. activation values Sparse respectively architectural during PtrRead 1.807 (19.73%) 121,849 (19.10%) SpmatRead 4.955 (54.11%) 469,412 (73.57%) a single round of FCsupport layer computation. was omitted The source for and time-to-ArithmUnit 1.162 (12.68%) 3,110 (0.49%) destination register files exchange their role for next layer. ActRW 1.122 (12.25%) 18,934 (2.97%) Thus no additional datadeploy transfer reasons. is needed to support Sparsity multi- willfiller have cell 23,961 (3.76%) layer feed-forward computation. Each activation registerhigh filepriority holds 64 16-bitin future activations. designs”Central Unit: I/O and Computing. In the I/O mode, all of This is sufficient to accommodate 4K activation vectors the PEs are idle while the activations and weights in every across 64 PEs. Longer activation vectors can be accommo- PE can be accessed by a DMA connected with the Central dated with the 2KB activation SRAM. When the activation Unit. This is one time cost. In the Computing mode, the vector has a length greater than 4K, the M V will be CCU repeatedly collects a non-zero value from the LNZD Compression Acceleration Regularization ⇥ completed in several batches, where each batch is of length quadtree and broadcasts this value to all PEs. This77 process 4K or less. All the local reduction is done in the register continues until the input length is exceeded. By setting the file. The SRAM is read only at the beginning and written at input length and starting address of pointer array, EIE is the end of the batch. instructed to execute different layers. Distributed Leading Non-Zero Detection. Input acti- vations are hierarchically distributed to each PE. To take V. E VA L UAT I O N METHODOLOGY advantage of the input vector sparsity, we use leading non- Simulator, RTL and Layout. We implemented a custom zero detection logic to select the first non-zero result. Each cycle-accurate C++ simulator for the accelerator aimed to group of 4 PEs does a local leading non-zero detection on model the RTL behavior of synchronous circuits. Each their input activation. The result is sent to a Leading Non- hardware module is abstracted as an object that implements zero Detection Node (LNZD Node) illustrated in Figure 4. two abstract methods: propagate and update, corresponding Each LNZD node finds the next non-zero activation across to combination logic and the flip-flop in RTL. The simulator its four children and sends this result up the quadtree. The is used for design space exploration. It also serves as a quadtree is arranged so that wire lengths remain constant as checker for RTL verification. we add PEs. At the root LNZD Node, the selected non-zero To measure the area, power and critical path delay, we activation is broadcast back to all the PEs via a separate implemented the RTL of EIE in Verilog. The RTL is verified wire placed in an H-tree. against the cycle-accurate simulator. Then we synthesized Central . The Central Control Unit (CCU) EIE using the Synopsys Design Compiler (DC) under the is the root LNZD Node. It communicates with the , TSMC 45nm GP standard VT library with worst case PVT for example a CPU, and monitors the state of every PE by corner. We placed and routed the PE using the Synopsys IC setting the control registers. There are two modes in the compiler (ICC). We used Cacti [25] to get SRAM area and TPU

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a

78 Google TPU

● The Matrix Unit: 65,536 (256x256) 8-bit multiply-accumulate units ● 700 MHz ● Peak: 92T operations/second ○ 65,536 * 2 * 700M ● >25X as many MACs vs GPU ● >100X as many MACs vs CPU ● 4 MiB of on-chip Accumulator memory ● 24 MiB of on-chip Unified Buffer (activation memory) ● 3.5X as much on-chip memory vs GPU ● Two 2133MHz DDR3 DRAM channels ● 8 GiB of off-chip weight DRAM memory

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

79 Google TPU

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

80 Inference Datacenter Workload

Layers TPU Ops / TPU Nonlinear % Name LOC Weights Weight Batch function Deployed FC Conv Vector Pool Total Size MLP0 0.1k 5 5 ReLU 20M 200 200 61% MLP1 1k 4 4 ReLU 5M 168 168 sigmoid, LSTM0 1k 24 34 58 52M 64 64 tanh 29% sigmoid, LSTM1 1.5k 37 19 56 34M 96 96 tanh CNN0 1k 16 16 ReLU 8M 2888 8 5% CNN1 1k 4 72 13 89 ReLU 100M 1750 32

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

81 Roofline Model: Identify Performance Bottleneck

Samuel Williams, Andrew Waterman, and David Patterson. "Roofline: an insightful visual performance model for multicore architectures."Communications of the ACM 52.4 (2009): 65-76. David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

82 TPU Roofline

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

83 Log Rooflines for CPU, GPU, TPU

Star = TPU Triangle = GPU Circle = CPU

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

84 Linear Rooflines for CPU, GPU, TPU

Star = TPU Triangle = GPU Circle = CPU

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

85 Why so far below Rooflines?

Low latency requirement => Can’t batch more => low ops/byte

How to Solve this? less memory footprint => need compress the model

Challenge: Hardware that can infer on compressed model

86 [Han et al. ISCA’16] EIE: the First DNN Accelerator for Sparse, Compressed Model

Compression Acceleration Regularization 87 [Han et al. ISCA’16] EIE: the First DNN Accelerator for Sparse, Compressed Model

0 * A = 0 W * 0 = 0 2.09, 1.92=> 2

Sparse Weight Sparse Activation Weight Sharing 90% static sparsity 70% dynamic sparsity 4-bit weights

10x less computation 3x less computation

5x less memory footprint 8x less memory footprint

Compression Acceleration Regularization 88 EIE: Reduce Memory Access by Compression

~a 0 a1 0 a3 ~b ⇥ PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 0 0 0 w1,2 0 1 0 b11 0 b1 1 PE2 0 w 0 w b 0 B 2,1 2,3C B 2C B C B C B C B C logically PE3 B 0 0 0 0 C B b3C ReLU B b3 C B C = B C B C B 0 0 w4,2w4,3C B b4C ) B 0 C B C B C B C Bw 0 0 0 C B b C B b C B 5,0 C B 5C B 5 C B C B C B C B 0 0 0 w6,3C B b6C B b6 C B C B C B C B 0 w7,1 0 0 C B b7C B 0 C B C B C B C @ A @ A @ A

Virtual Weight W0,0 W0,1 W4,2 W0,3 W4,3 physically Relative Index 0 1 2 0 0

Column Pointer 0 1 2 3

Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016, Hotchips 2016

1 [Han et al. ISCA’16] Dataflow

~a 0 a1 0 a3 ~b ⇥ PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 0 0 0 w1,2 0 1 0 b11 0 b1 1 PE2 0 w 0 w b 0 B 2,1 2,3C B 2C B C B C B C B C PE3 B 0 0 0 0 C B b3C ReLU B b3 C B C = B C B C B 0 0 w4,2w4,3C B b4C ) B 0 C B C B C B C Bw 0 0 0 C B b C B b C B 5,0 C B 5C B 5 C B C B C B C B 0 0 0 w6,3C B b6C B b6 C B C B C B C B 0 w7,1 0 0 C B b7C B 0 C B C B C B C @ A @ A @ A

rule of thumb: 0 * A = 0 W * 0 = 0

Compression Acceleration Regularization 90

1 [Han et al. ISCA’16] Dataflow

~a 0 a1 0 a3 ~b ⇥ PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 0 0 0 w1,2 0 1 0 b11 0 b1 1 PE2 0 w 0 w b 0 B 2,1 2,3C B 2C B C B C B C B C PE3 B 0 0 0 0 C B b3C ReLU B b3 C B C = B C B C B 0 0 w4,2w4,3C B b4C ) B 0 C B C B C B C Bw 0 0 0 C B b C B b C B 5,0 C B 5C B 5 C B C B C B C B 0 0 0 w6,3C B b6C B b6 C B C B C B C B 0 w7,1 0 0 C B b7C B 0 C B C B C B C @ A @ A @ A

rule of thumb: 0 * A = 0 W * 0 = 0

Compression Acceleration Regularization 91

1 [Han et al. ISCA’16] Dataflow

~a 0 a1 0 a3 ~b ⇥ PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 0 0 0 w1,2 0 1 0 b11 0 b1 1 PE2 0 w 0 w b 0 B 2,1 2,3C B 2C B C B C B C B C PE3 B 0 0 0 0 C B b3C ReLU B b3 C B C = B C B C B 0 0 w4,2w4,3C B b4C ) B 0 C B C B C B C Bw 0 0 0 C B b C B b C B 5,0 C B 5C B 5 C B C B C B C B 0 0 0 w6,3C B b6C B b6 C B C B C B C B 0 w7,1 0 0 C B b7C B 0 C B C B C B C @ A @ A @ A

rule of thumb: 0 * A = 0 W * 0 = 0

Compression Acceleration Regularization 92

1 [Han et al. ISCA’16] Dataflow

~a 0 a1 0 a3 ~b ⇥ PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 0 0 0 w1,2 0 1 0 b11 0 b1 1 PE2 0 w 0 w b 0 B 2,1 2,3C B 2C B C B C B C B C PE3 B 0 0 0 0 C B b3C ReLU B b3 C B C = B C B C B 0 0 w4,2w4,3C B b4C ) B 0 C B C B C B C Bw 0 0 0 C B b C B b C B 5,0 C B 5C B 5 C B C B C B C B 0 0 0 w6,3C B b6C B b6 C B C B C B C B 0 w7,1 0 0 C B b7C B 0 C B C B C B C @ A @ A @ A

rule of thumb: 0 * A = 0 W * 0 = 0

Compression Acceleration Regularization 93

1 [Han et al. ISCA’16] Dataflow

~a 0 a1 0 a3 ~b ⇥ PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 0 0 0 w1,2 0 1 0 b11 0 b1 1 PE2 0 w 0 w b 0 B 2,1 2,3C B 2C B C B C B C B C PE3 B 0 0 0 0 C B b3C ReLU B b3 C B C = B C B C B 0 0 w4,2w4,3C B b4C ) B 0 C B C B C B C Bw 0 0 0 C B b C B b C B 5,0 C B 5C B 5 C B C B C B C B 0 0 0 w6,3C B b6C B b6 C B C B C B C B 0 w7,1 0 0 C B b7C B 0 C B C B C B C @ A @ A @ A

rule of thumb: 0 * A = 0 W * 0 = 0

Compression Acceleration Regularization 94

1 [Han et al. ISCA’16] Dataflow

~a 0 a1 0 a3 ~b ⇥ PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 0 0 0 w1,2 0 1 0 b11 0 b1 1 PE2 0 w 0 w b 0 B 2,1 2,3C B 2C B C B C B C B C PE3 B 0 0 0 0 C B b3C ReLU B b3 C B C = B C B C B 0 0 w4,2w4,3C B b4C ) B 0 C B C B C B C Bw 0 0 0 C B b C B b C B 5,0 C B 5C B 5 C B C B C B C B 0 0 0 w6,3C B b6C B b6 C B C B C B C B 0 w7,1 0 0 C B b7C B 0 C B C B C B C @ A @ A @ A

rule of thumb: 0 * A = 0 W * 0 = 0

Compression Acceleration Regularization 95

1 [Han et al. ISCA’16] Dataflow

~a 0 a1 0 a3 ~b ⇥ PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 0 0 0 w1,2 0 1 0 b11 0 b1 1 PE2 0 w 0 w b 0 B 2,1 2,3C B 2C B C B C B C B C PE3 B 0 0 0 0 C B b3C ReLU B b3 C B C = B C B C B 0 0 w4,2w4,3C B b4C ) B 0 C B C B C B C Bw 0 0 0 C B b C B b C B 5,0 C B 5C B 5 C B C B C B C B 0 0 0 w6,3C B b6C B b6 C B C B C B C B 0 w7,1 0 0 C B b7C B 0 C B C B C B C @ A @ A @ A

rule of thumb: 0 * A = 0 W * 0 = 0

Compression Acceleration Regularization 96

1 [Han et al. ISCA’16] Dataflow

~a 0 a1 0 a3 ~b ⇥ PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 0 0 0 w1,2 0 1 0 b11 0 b1 1 PE2 0 w 0 w b 0 B 2,1 2,3C B 2C B C B C B C B C PE3 B 0 0 0 0 C B b3C ReLU B b3 C B C = B C B C B 0 0 w4,2w4,3C B b4C ) B 0 C B C B C B C Bw 0 0 0 C B b C B b C B 5,0 C B 5C B 5 C B C B C B C B 0 0 0 w6,3C B b6C B b6 C B C B C B C B 0 w7,1 0 0 C B b7C B 0 C B C B C B C @ A @ A @ A

rule of thumb: 0 * A = 0 W * 0 = 0

Compression Acceleration Regularization 97

1 [Han et al. ISCA’16] Dataflow

~a 0 a1 0 a3 ~b ⇥ PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 0 0 0 w1,2 0 1 0 b11 0 b1 1 PE2 0 w 0 w b 0 B 2,1 2,3C B 2C B C B C B C B C PE3 B 0 0 0 0 C B b3C ReLU B b3 C B C = B C B C B 0 0 w4,2w4,3C B b4C ) B 0 C B C B C B C Bw 0 0 0 C B b C B b C B 5,0 C B 5C B 5 C B C B C B C B 0 0 0 w6,3C B b6C B b6 C B C B C B C B 0 w7,1 0 0 C B b7C B 0 C B C B C B C @ A @ A @ A

rule of thumb: 0 * A = 0 W * 0 = 0

Compression Acceleration Regularization 98

1 [Han et al. ISCA’16] Dataflow

~a 0 a1 0 a3 ~b ⇥ PE0 w0,0w0,1 0 w0,3 b0 b0 PE1 0 0 0 w1,2 0 1 0 b11 0 b1 1 PE2 0 w 0 w b 0 B 2,1 2,3C B 2C B C B C B C B C PE3 B 0 0 0 0 C B b3C ReLU B b3 C B C = B C B C B 0 0 w4,2w4,3C B b4C ) B 0 C B C B C B C Bw 0 0 0 C B b C B b C B 5,0 C B 5C B 5 C B C B C B C B 0 0 0 w6,3C B b6C B b6 C B C B C B C B 0 w7,1 0 0 C B b7C B 0 C B C B C B C @ A @ A @ A

rule of thumb: 0 * A = 0 W * 0 = 0

Compression Acceleration Regularization 99

1 EIE: Efficient Inference Engine on Compressed Deep Neural Network

Song Han⇤ Xingyu Liu⇤ Huizi Mao⇤ Jing Pu⇤ Ardavan Pedram⇤

Mark A. Horowitz⇤ William J. Dally⇤† [Han et al. ISCA’16] ⇤Stanford University, †NVIDIA EIE Architecture songhan,xyl,huizi,jingpu,perdavan,horowitz,dally @stanford.edu { }

Weight decode

Abstract—State-of-the-art deep neural networks (DNNs) Compressed 4-bit 16-bit have hundreds of millions of connections and are both compu- DNN Model Virtual weight Weight Real weight tationally and memory intensive, making them difficult to de- Encoded Weight Look-up ALU Prediction ploy on embedded systems with limited hardware resources and Relative Index Result power budgets. While custom hardware helps the computation, Sparse Format Index Mem Input 4-bit Accum 16-bit fetching weights from DRAM is two orders of magnitude more Image Relative Index Absolute Index expensive than ALU operations, and dominates the required power. Address Accumulate Previously proposed ‘Deep Compression’ makes it possible Figure 1. Efficient inference engine that works on the compressed deep to fit large DNNs (AlexNet and VGGNet) fully in on-chip neural network model for machine learning applications. SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same ruleword, of thumb: or speech0 sample.* A = 0 For embeddedW * 0 = 0 mobile2.09, applications, 1.92=> 2 weight. We propose an energy efficient inference engine (EIE) these resource demands become prohibitive. Table I shows Compression Acceleration Regularization that performs inference on this compressed network model and the energy cost of basic arithmetic and memory operations 100 accelerates the resulting sparse matrix-vector multiplication in a 45nm CMOS process [9]. It shows that the total energy with weight sharing. Going from DRAM to SRAM gives EIE is dominated by the required memory access if there is 120 energy saving; Exploiting sparsity saves 10 ; Weight sharing⇥ gives 8 ; Skipping zero activations from ReLU⇥ saves no data reuse. The energy cost per fetch ranges from 5pJ another 3 . Evaluated⇥ on nine DNN benchmarks, EIE is for 32b coefficients in on-chip SRAM to 640pJ for 32b 189 and⇥ 13 faster when compared to CPU and GPU coefficients in off-chip LPDDR2 DRAM. Large networks do ⇥ ⇥ implementations of the same DNN without compression. EIE not fit in on-chip storage and hence require the more costly has a processing power of 102 GOPS/s working directly on DRAM accesses. Running a 1G connection neural network, a compressed network, corresponding to 3 TOPS/s on an uncompressed network, and processes FC layers of AlexNet at for example, at 20Hz would require (20Hz)(1G)(640pJ)= 1.88 104 frames/sec with a power dissipation of only 600mW. 12.8W just for DRAM accesses, which is well beyond the It is⇥ 24,000 and 3,400 more energy efficient than a CPU power envelope of a typical mobile device. ⇥ ⇥ and GPU respectively. Compared with DaDianNao, EIE has Previous work has used specialized hardware to accelerate 2.9 , 19 and 3 better throughput, energy efficiency and DNNs [10]–[12]. However, these efforts focus on acceler- area⇥ efficiency.⇥ ⇥ ating dense, uncompressed models - limiting their utility Keywords-Deep Learning; Model Compression; Hardware to small models or to cases where the high energy cost Acceleration; Algorithm-Hardware co-Design; ASIC; of external DRAM access can be tolerated. Without model compression, it is only possible to fit very small neural I. INTRODUCTION networks, such as Lenet-5, in on-chip SRAM [12]. arXiv:1602.01528v2 [cs.CV] 3 May 2016 Neural networks have become ubiquitous in applications Efficient implementation of convolutional layers in CNN including [1]–[3], speech recognition [4], has been intensively studied, as its data reuse and manipu- and natural language processing [4]. In 1998, Lecun et lation is quite suitable for customized hardware [10]–[15]. al. classified handwritten digits with less than 1M parame- However, it has been found that fully-connected (FC) layers, ters [5], while in 2012, Krizhevsky et al. won the ImageNet widely used in RNN and LSTMs, are bandwidth limited competition with 60M parameters [1]. Deepface classified on large networks [14]. Unlike CONV layers, there is no human faces with 120M parameters [6]. Neural Talk [7] parameter reuse in FC layers. Data batching has become automatically converts image to natural language with 130M an efficient solution when training networks on CPUs or CNN parameters and 100M RNN parameters. Coates et GPUs, however, it is unsuitable for real-time applications al. scaled up a network to 10 billion parameters on HPC with latency requirements. systems [8]. Network compression via pruning and weight sharing Large DNN models are very powerful but consume large [16] makes it possible to fit modern networks such as amounts of energy because the model must be stored in AlexNet (60M parameters, 240MB), and VGG-16 (130M external DRAM, and fetched every time for each image, parameters, 520MB) in on-chip SRAM. Processing these [Han et al. ISCA’16] Micro Architecture for each PE

Act Value Act Value Act Queue Act Leading Act Index SRAM NZero Encoded Weight Detect Act Index

Weight Col Sparse Even Ptr SRAM Bank Start/ Decoder Matrix Regs Bypass Dest Src End Act Act SRAM Addr Regs Regs ReLU Odd Ptr SRAM Bank Address Absolute Address Accum Relative Index Pointer Read Sparse Matrix Access Arithmetic Unit Act R/W

SRAM Regs Comb

Compression Acceleration Regularization 101 [Han et al. ISCA’16] Speedup on EIE

Sparse Matrix Read Unit. The sparse-matrix read unit uses pointers pj and pj+1 to read the non-zero elements (if SpMat CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE any) of this PE’s slice of column Ij from the sparse-matrix 1018x SRAM. Each entry in the SRAM is 8-bits in length and 1000x 507x 618x Act_0 Act_1 248x 210x 189x contains one 4-bit element of v and one 4-bit element of x. 94x 115x 135x 92x 98x 56x 63x 60x Ptr_Even Arithm Ptr_Odd 100x 34x 33x 48x For efficiency (see Section VI) the PE’s slice of encoded 25x 21x 24x 22x 25x 14x 14x 16x 15x 15x 9x 8x 10x 9x 10x 9x 9x sparse matrix I is stored in a 64-bit-wide SRAM. Thus eight 10x 5x 5x CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE 3x 3x 2x 3x 2x 3x 2x 3x entries are fetched on each SRAM read. The high 13 bits 2x 1x 1x 1x 1.1x 1x 1x 1x 1x 1.0x 1x 1.0x 1x 1x 1x 1x 1x 1x 1x 0.6x 0.5x 0.5x 0.5x 0.6x of the current pointer p selects an SRAM row, and the low SpMat Speedup 1x 0.3x 0.5x 3-bits select one of the eight entries in that row. A single 1018x (v, x) entry is provided to the arithmetic unit each cycle. 0.1x 618x Arithmetic Unit. The arithmetic unit receives a (v, x) Figure 5. Layout of one PE in EIE under TSMC 45nm process. Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean 1000x 507x entry from the sparse matrix read unit and performs the multiply-accumulate operation b = b + v a . Index Table II Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. x x ⇥ j THE IMPLEMENTATION RESULTS OF ONE PE IN EIE AND THE 248x 210x x is used to index an accumulator array (the destination BREAKDOWN BY COMPONENT TYPE (LINE 3-7), BY MODULE (LINE activation registers) while v is multiplied by the activation 8-13). THE CRITICAL PATH OF EIE IS 1.15 NS 189x 135x value at the head of the activation queue. Because v is stored Power Area CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE 115x (%) 2 (%) µm 119,797x 94x 92x in 4-bit encoded form, it is first expanded to a 16-bit fixed- 98x(mW) ( ) 61,533x 76,784x Total 9.157 638,024 100000x 34,522x 14,826x 24,207x point63x number via a table look up. A bypass path is provided memory 5.416 (59.15%) 594,786 (93.22%) 60x 11,828x 9,485x 10,904x 8,053x 56x 10000x 100x to route the output of the adder to its input if the same clock network 1.874 (20.46%) 866 (0.14%) 48x 34x accumulator is selected on two adjacent cycles. 33x register 1.026 (11.20%) 9,465 (1.48%)1000x 25x combinational 0.841 (9.18%) 8,946 (1.40%) 25x 102x 24x The Activation Read/Write Unit 78x 101x 22x Activation Read/Write. filler cell 23,961 (3.76%) 59x 61x 21x 39x 100x 37x 37x 36x 26x 20x 25x 25x 20x 23x 16x contains two activation register files that accommodate the Act queue 0.112 (1.23%) 758 (0.12%) 12x 18x 17x 14x 14x 15x 15x 9x 10x 10x 10x 7x 15x 8x 14x 7x 6x 6x 6x 7x 14x 6x 6x PtrRead 1.807 (19.73%) 121,849 (19.10%) 5x 5x 5x source and destination activation values respectively during 10x 3x 4x 10x 15x 13x 14x 2x 9x 10x 10x 9x SpmatRead 4.955 (54.11%) 469,412 (73.57%)9x 1x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x 8x 9x a single round of FC layer computation. The source and ArithmUnit 1.162 (12.68%) 3,110 (0.49%)1x destination register files exchange their role for next layer. ActRW 1.122 (12.25%) 18,934 Energy Efficiency (2.97%) Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean 10x 5x 5x Thus no additional data transfer is needed to support multi- filler cell 23,961 (3.76%) layer feed-forward computation. 3x 3x 2x 3x 2x 3x 2x Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running3x uncompressed DNNGeo model. There Mean is no batching in all cases. Each activation register file holds 64 16-bit activations. Central Unit: I/O and Computing. In the2x I/O mode, all of 1x 1x This is sufficient to accommodate 4K activation vectors the PEs are idle while the activations and weights in every 1x 1x 1x 1.1x 1x 1x 1.0x 1x 1.0x 1x 1x 1x 1x across 64 PEs. Longer activation1x vectors can be accommo- PE can be accessed by a DMA1x connected with the Central 1x Table III dated with the 2KB activation SRAM. When the activation Unit. This is one time cost. In thecorner. Computing We mode, placed the and routed the PE using the Synopsys IC BENCHMARK FROM STATE-OF-THE-ART DNN MODELS 0.6x vector has a length greater than 4K, the M V will be CCU repeatedly collects a non-zero value from the LNZD 0.6x 0.5x ⇥ 0.5x compiler (ICC). We0.5x used Cacti [25] to get SRAM area and Speedup 1x 0.5x 0.3x completed in several batches, where each batch is of length quadtree and broadcasts this value to all PEs. This process Layer Size Weight% Act% FLOP% Description 4K or less. All the local reduction is done in the register continues until the input length is exceeded.energy By numbers. setting the We annotated the toggle rate from the RTL 9216, Alex-6 9% 35.1% 3% file. The SRAM is read only at the beginning and written at input length and starting address of pointer array, EIE is 4096 Compressed the end of the batch. simulation to the gate-level netlist, which was dumped to instructed to execute different layers. 4096, AlexNet [1] for Distributed Leading Non-Zero Detection. Input acti- switching activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% vations are hierarchically distributed to each PE. To take V. E VA L UAT I O N METHODOLOGY 4096 large scale image advantage of the input vector sparsity, we use leading non- Simulator, RTL and Layout. Wethe implemented power a custom using Prime-Time PX. 4096, classification 0.1x Alex-8 25% 37.5% 10% zero detection logic to select the first non-zero result. Each cycle-accurate C++ simulator for the acceleratorComparison aimed to Baseline. We compare EIE with three dif- 1000 group of 4 PEs does a local leading non-zero detection on model the RTL behavior of synchronous circuits. Each CPU GPU mGPU EIE25088, their input activation. The result is sent to a Leading Non- hardware module is abstracted as an object that implements VGG-6 4% 18.3% 1% Compressed ferent off-the-shelf computing units: CPU, GPU and mobile 4096 zero Detection Node (LNZD Node) illustrated in Figure 4. two abstract methods: propagate and update, corresponding VGG-16 [3] for Each LNZD node finds the next non-zero activation across to combination logic and the flip-flopGPU. in RTL. The simulator 4096, VGG-7 4% 37.5% 2% large scale image Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-Weits four children and sends this resultNT up the quadtree.-Wd The is used for design space exploration.NT It also serves- asLSTM a Geo Mean4096 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E classification and quadtree is arranged so that wire lengths remain constant as checker for RTL verification. 4096, we add PEs. At the root LNZD Node, the selected non-zero To measure the area, power and criticalclass path , delay, we that has been used in NVIDIA Digits Deep VGG-8 23% 41.1% 9% object detection activation is broadcast back to all the PEs via a separate implemented the RTL of EIE in Verilog. The RTL is verified 1000 Learning Dev Box as a CPU baseline. To run the benchmark 4096, Compressed wire placed in an H-tree. against the cycle-accurateCompression simulator. Then we synthesized Acceleration RegularizationNT-We 10% 100% 10% Central Control Unit. The Central Control Unit (CCU) EIE using the Synopsys Design Compileron CPU, (DC) under we the used MKL CBLAS GEMV to implement the 600 NeuralTalk [7] 102 is the root LNZD Node. It communicates with the master, TSMC 45nm GP standard VT library with worst case PVT 600, with RNN and NT-Wd 11% 100% 11% for example a CPU, and monitors the state of every PE by corner. We placed and routed the PEoriginal using the Synopsys dense IC model and MKL SPBLAS CSRMV for the 8791 LSTM for Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNNsetting the control model. registers. There are two modes in the Therecompiler (ICC). We used Cacti is [25] to get no SRAM area and batching in all cases. compressed sparse model. CPU socket and DRAM power 1201, automatic NTLSTM 10% 100% 11% are as reported by the pcm-power utility provided by Intel. 2400 image captioning 2) GPU. We use NVIDIA GeForce GTX Titan X GPU, a state-of-the-art GPU for deep learning as our baseline The uncompressed DNN model is obtained from Caffe using nvidia-smi utility to report the power. To run model zoo [28] and NeuralTalk model zoo [7]; The com- the benchmark, we used cuBLAS GEMV to implement pressed DNN model is produced as described in [16], [23]. the original dense layer. For the compressed sparse layer, The benchmark networks have 9 layers in total obtained CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPUwe stored theCompressed sparse matrix in in CSR format, and used fromEIE AlexNet, VGGNet, and NeuralTalk. We use the Image- 119,797x cuSPARSE CSRMV kernel, which is optimized for sparse Net dataset [29] and the Caffe [28] deep learning framework 61,533x 76,784x matrix-vector multiplication on GPUs. as golden model to verify the correctness of the hardware 100000x 34,522x 3) Mobile GPU. We use NVIDIA K1 that has design.24,207x 14,826x 192 CUDA cores as our mobile GPU baseline. We used 11,828x 9,485x 10,904x cuBLAS GEMV for the8,053x original dense model and cuS- VI. EXPERIMENTAL RESULT PARSE CSRMV for the compressed sparse model. Tegra K1 Figure 5 shows the layout (after place-and-route) of 10000x doesn’t have software interface to report power consumption, an EIE processing element. The power/area breakdown is so we measured the total power consumption with a power- shown in Table II. We brought the critical path delay down meter, then assumed 15% AC to DC conversion loss, 85% to 1.15ns by introducing 4 pipeline stages to update one 1000x regulator efficiency and 15% power consumed by peripheral activation: codebook lookup and address accumulation (in components [26], [27] to report the AP+DRAM power for parallel), output activation read and input activation multiply 78x 101x 102x Tegra K1. (in parallel), shift and add, and output activation write. Ac- 59x 61x Benchmarks. 100x 37x 37x 39x tivation36x read and write access a local register and activation 26x 18x 20x 25x 25x We compare the performance20x on two sets of models:23xbypassing is employed to avoid a pipeline hazard. Using 12x 17x 14x 14x uncompressed15x DNN model and the compressed DNN model. 64 PEs running at 800MHz yields a performance of 102 7x 9x 10x 10x 10x 8x 5x 7x 5x 6x 6x 6x 6x 5x 6x 7x 10x 3x 4x 10x 15x 13x 14x 2x 1x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x 1x Energy Energy Efficiency Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.

Table III energy numbers. We annotated the toggle rate from the RTL BENCHMARK FROM STATE-OF-THE-ART DNN MODELS simulation to the gate-level netlist, which was dumped to Layer Size Weight% Act% FLOP% Description switching activity interchange format (SAIF), and estimated 9216, Alex-6 9% 35.1% 3% the power using Prime-Time PX. 4096 Compressed 4096, AlexNet [1] for Alex-7 9% 35.3% 3% Comparison Baseline. We compare EIE with three dif- 4096 large scale image ferent off-the-shelf computing units: CPU, GPU and mobile 4096, classification Alex-8 25% 37.5% 10% GPU. 1000 25088, VGG-6 4% 18.3% 1% Compressed 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E 4096 VGG-16 [3] for 4096, class processor, that has been used in NVIDIA Digits Deep VGG-7 4% 37.5% 2% large scale image 4096 Learning Dev Box as a CPU baseline. To run the benchmark classification and 4096, VGG-8 23% 41.1% 9% object detection on CPU, we used MKL CBLAS GEMV to implement the 1000 original dense model and MKL SPBLAS CSRMV for the 4096, Compressed NT-We 10% 100% 10% compressed sparse model. CPU socket and DRAM power 600 NeuralTalk [7] 600, with RNN and NT-Wd 11% 100% 11% are as reported by the pcm-power utility provided by Intel. 8791 LSTM for 1201, automatic 2) GPU. We use NVIDIA GeForce GTX Titan X GPU, NTLSTM 10% 100% 11% a state-of-the-art GPU for deep learning as our baseline 2400 image captioning using nvidia-smi utility to report the power. To run the benchmark, we used cuBLAS GEMV to implement in [16], [23]. The benchmark networks have 9 layers in total the original dense layer. For the compressed sparse layer, obtained from AlexNet, VGGNet, and NeuralTalk. We use we stored the sparse matrix in in CSR format, and used the Image-Net dataset [29] and the Caffe [28] deep learning cuSPARSE CSRMV kernel, which is optimized for sparse framework as golden model to verify the correctness of the matrix-vector multiplication on GPUs. hardware design. 3) Mobile GPU. We use NVIDIA Tegra K1 that has 192 CUDA cores as our mobile GPU baseline. We used VI. EXPERIMENTAL RESULTS cuBLAS GEMV for the original dense model and cuS- Figure 5 shows the layout (after place-and-route) of PARSE CSRMV for the compressed sparse model. Tegra K1 an EIE processing element. The power/area breakdown is doesn’t have software interface to report power consumption, shown in Table II. We brought the critical path delay down so we measured the total power consumption with a power- to 1.15ns by introducing 4 pipeline stages to update one meter, then assumed 15% AC to DC conversion loss, 85% activation: codebook lookup and address accumulation (in regulator efficiency and 15% power consumed by peripheral parallel), output activation read and input activation multiply components [26], [27] to report the AP+DRAM power for (in parallel), shift and add, and output activation write. Ac- Tegra K1. tivation read and write access a local register and activation Benchmarks. We compare the performance on two sets bypassing is employed to avoid a pipeline hazard. Using of models: uncompressed DNN model and the compressed 64 PEs running at 800MHz yields a performance of 102 DNN model. The uncompressed DNN model is obtained GOP/s. Considering 10 weight sparsity and 3 activation ⇥ ⇥ from Caffe model zoo [28] and NeuralTalk model zoo [7]; sparsity, this requires a dense DNN accelerator 3TOP/s to The compressed DNN model is produced as described have equivalent application throughput. CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE 1018x 1000x 507x 618x 248x 210x 189x 94x 115x 135x 92x 98x 56x 63x 60x 100x 34x 33x 48x 25x 21x 24x 22x 25x 14x 14x 16x 15x 15x 9x 8x 10x 9x 10x 9x 9x 10x 5x 5x 2x 3x 2x 3x 3x 2x 3x 3x 2x 1x 1x 1x 1.1x 1x 1x 1x 1x 1.0x 1x 1.0x 1x 1x 1x 1x 1x 1x 1x 0.6x 0.5x 0.5x 0.5x 0.6x Speedup 1x 0.3x 0.5x

0.1x CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE 1018x 1000x 507x 618x 248x 210x 189x 94x 115x 135x 92x 98x 56x 63x 60x [Han et al. ISCA’16] Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT100x -LSTM Geo Mean 34x 33x 48x 25x 21x 24x 22x 25x 14x 14x 16x 15x 15x 9x 8x 10x 9x 10x 9x 9x 10x 5x 5x 2x 3x 2x 3x 3x 2x 3x 3x 2x 1x 1x 1x 1.1x 1x 1x 1x 1x 1.0x 1x 1.0x 1x 1x 1x 1x 1x 1x 1x 0.6x 0.5x 0.5x 0.5x 0.6x Speedup 1x Energy Efficiency0.3x 0.5xon EIE 0.1x Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is noAlex-6 batchingAlex-7 Alex-8 inVGG-6 allVGG-7 cases.VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. Sparse Matrix Read Unit. The sparse-matrix read unit uses pointers pj and pj+1 to read the non-zero elements (if SpMat CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE any) of this PE’s slice of column I from the sparse-matrix 119,797x j 61,533x 76,784x 100000x 34,522x SRAM. Each entry in the SRAM is 8-bits in length and 14,826x 24,207x Act_0 Act_1 11,828x 9,485x 10,904x 8,053x contains one 4-bit element of v and one 4-bit element of x. 10000x For efficiency (see Section VI) the PE’s slice of encoded Ptr_Even Arithm Ptr_Odd 1000x sparse matrix I is stored in a 64-bit-wide SRAM. Thus eight 78x 101x 102x 59x 61x 39x entries are fetched on each SRAM read. The high 13 bits 100x 26x 37x 37x 25x 25x 36x 18x 17x 20x 15x 20x 23x 9x12x 10x 10x 10x 14x 14x p 7x 7x 8x 6x 6x 6x 6x 7x CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed of themGPU current pointer selects an SRAM Dense row, and the low SpMatmGPU10x 5x Compressed EIE5x 6x 4x 5x 3x 10x 15x 13x 14x 2x 3-bits select one of the eight entries in that row. A single 1x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x (v, x) entry is provided to the arithmetic unit each cycle. 1x Energy Energy Efficiency 119,797x Arithmetic Unit. The arithmetic unit receives a (v, x) Figure 5. Layout of one PE in EIE under TSMC 45nm process. Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean 61,533x 76,784x entry from the sparse matrix read unit and performs the multiply-accumulate operation b = b + v a . Index Table II Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 100000x 34,522x x x ⇥ j THE IMPLEMENTATION RESULTS OF ONE PE IN EIE AND THE x is used to index an accumulator array (the destination BREAKDOWN BY COMPONENT TYPE (LINE 3-7), BY MODULE (LINE 24,207x 14,826x activation registers) while v is multiplied by the activation 8-13). THE CRITICAL PATH OF EIE IS 1.15 NS Table III Power Area 11,828x value at the head of the activation queue. Because v is stored 10,904x(%)corner. We(%) placed and routed the PE using the Synopsys IC 2 BENCHMARK FROM STATE-OF-THE-ART DNN MODELS in 4-bit encoded form,9,485x it is first expanded to a 16-bit fixed- (mW) (µm ) 8,053x Total 9.157compiler 638,024 (ICC). We used Cacti [25] to get SRAM area and 10000x point number via a table look up. A bypass path is provided memory 5.416 (59.15%) 594,786 (93.22%) Layer Size Weight% Act% FLOP% Description to route the output of the adder to its input if the same clock network 1.874 (20.46%)energy 866 numbers. (0.14%) We annotated the toggle rate from the RTL 9216, accumulator is selected on two adjacent cycles. register 1.026 (11.20%) 9,465 (1.48%) Alex-6 9% 35.1% 3% combinational 0.841 (9.18%) 8,946 (1.40%) 4096 Compressed Activation Read/Write. The Activation Read/Write Unit filler cellsimulation 23,961 (3.76%) to the gate-level netlist, which was dumped to 4096, AlexNet [1] for contains two activation register files that accommodate the Act queue 0.112 (1.23%)switching 758 (0.12%) activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% source and destination activation values respectively during PtrRead 1.807 (19.73%) 121,849 (19.10%) 4096 large scale image 1000x SpmatRead 4.955 (54.11%) 469,412 (73.57%) a single round of FC layer computation. The source and ArithmUnit 1.162 (12.68%)the 3,110 power (0.49%) using Prime-Time PX. 4096, classification Alex-8 25% 37.5% 10% destination register files exchange their role for next layer. ActRW 1.122 (12.25%) 18,934 (2.97%) 1000 Thus no additional data transfer is needed to support multi- filler cellComparison 23,961 (3.76%) Baseline. We compare EIE with three dif- 25088, 101x 102x layer feed-forward computation. VGG-6 4%Geo 18.3% 1%MeanCompressed 78x ferent off-the-shelf computing units: CPU, GPU and mobile 4096 59x 61x Each activation register file holds 64 16-bit activations. Central Unit: I/O and Computing. In the I/O mode, all of VGG-16 [3] for 100x 37x 37x This is sufficient39x to accommodate 4K activation vectors the PEs are idle while the activationsGPU. and weights in every 4096, VGG-736x 4% 37.5% 2% large scale image 26x 25xacross 64 PEs. Longer activation vectors can be accommo- PE can be accessed25x by a DMA connected with the Central 23x 4096 18x 17x 20x 1) CPU. We use Intel Core20x i-7 5930k CPU, a Haswell-E classification and 14x dated with the 2KB activation SRAM. When the activation 14xUnit. This is one time cost. In the Computing mode, the 15x 4096, 12x 10x 10x 10x vector has a length greater than 4K, the M V will be CCU repeatedly collects a non-zero value from the LNZD VGG-8 23% 41.1% 9% object detection 7x 9x 8x ⇥ class processor, that has been used in NVIDIA Digits Deep 1000 7x 6x 6xcompleted in several batches, where each batch is6x of length6xquadtree and broadcasts this value to all PEs. This process 6x 7x 5x 5x 4K or less. All the local reduction is done in the register Learning Dev5x Box as a CPU baseline. To run the benchmark 4096, Compressed 10x continues until the input length is exceeded. By setting4x the NT-We 10% 100% 10% 15x 3x 13x 14x 2x file. The SRAM is read only at the beginning and written at input length and starting addresson of pointer CPU, array, we EIE used is MKL CBLAS GEMV to implement the 600 NeuralTalk [7] 10x the end of8x the batch. instructed to execute different layers. 9x 600, with RNN and 1x 1x 1x 7x 1x 1x 1x 5x 1x Distributed Leading Non-Zero Detection.1x Input acti- 7x original1x dense model and7x MKL SPBLAS CSRMV1x for the NT-Wd 11% 100% 11% vations are hierarchically distributed to each PE. To take V. E VA L UAT I O N METHODOLOGY 8791 LSTM for compressed sparse model. CPU socket and DRAM power 1201, automatic 1x advantage of the input vector sparsity, we use leading non- Simulator, RTL and Layout. We implemented a custom NTLSTM 10% 100% 11% zero detection logic to select the first non-zero result. Each cycle-accurate C++ simulator for theare accelerator as reported aimed to by the pcm-power utility provided by Intel. 2400 image captioning group of 4 PEs does a local leading non-zero detection on model the RTL behavior of synchronous circuits. Each GPU mGPUEIE

Energy Energy Efficiency CPU their input activation. The result is sent to a Leading Non- hardware module is abstracted as an object2) GPU.that implementsWe use NVIDIA GeForce GTX Titan X GPU, zero Detection Node (LNZD Node) illustrated in Figure 4. two abstract methods: propagate anda update, state-of-the-art corresponding GPU for deep learning as our baseline The uncompressed DNN model is obtained from Caffe Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NTEach-We LNZD node finds the next non-zero activationNT across to- combinationWd logic and the flip-flop in RTL. TheNT simulator-LSTM Geo Mean its four children and sends this result up the quadtree. The is used for design space exploration.using It alsonvidia-smi serves as a utility to report the power. To run model zoo [28] and NeuralTalk model zoo [7]; The com- quadtree is arranged so that wire lengths remain constant as checker for RTL verification. we add PEs. At the root LNZD Node, the selected non-zero To measure the area, power andthe critical benchmark, path delay, we we used cuBLAS GEMV to implement pressed DNN model is produced as described in [16], [23]. activation is broadcast back to all the PEs via a separate implemented the RTL of EIE in Verilog.the The original RTL is verified dense layer. For the compressed sparse layer, The benchmark networks have 9 layers in total obtained wire placed in an H-tree. against the cycle-accurateCompression simulator. Then we synthesized Acceleration Regularization Central Control Unit. The Central Control Unit (CCU) EIE using the Synopsys Design Compilerwe stored (DC) under the the sparse matrix in in CSR format, and used from AlexNet, VGGNet, and NeuralTalk. We use the Image- 103 Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressedis the root LNZD Node. DNN It communicates with the master, model.TSMC 45nm GP standard There VT librarycuSPARSE with worst case CSRMV PVT is no kernel, which batching is optimized for sparse inNet all dataset [29] cases. and the Caffe [28] deep learning framework for example a CPU, and monitors the state of every PE by corner. We placed and routed the PE using the Synopsys IC setting the control registers. There are two modes in the compiler (ICC). We used Cacti [25]matrix-vector to get SRAM area and multiplication on GPUs. as golden model to verify the correctness of the hardware 3) Mobile GPU. We use NVIDIA Tegra K1 that has design. 192 CUDA cores as our mobile GPU baseline. We used cuBLAS GEMV for the original dense model and cuS- VI. EXPERIMENTAL RESULT Table III PARSE CSRMV for the compressed sparse model. Tegra K1 Figure 5 shows the layout (after place-and-route) of doesn’t have software interface to report power consumption, an EIE processing element. The power/area breakdown is energy numbers. We annotated the toggle rate from the RTL so we measured the total power consumption with a power- shown in Table II. We brought the critical path delay down BENCHMARK FROM STATE-OF-THEmeter, then-ART assumed 15%DNNAC to DC conversionMODELS loss, 85% to 1.15ns by introducing 4 pipeline stages to update one regulator efficiency and 15% power consumed by peripheral activation: codebook lookup and address accumulation (in components [26], [27] to report the AP+DRAM power for parallel), output activation read and input activation multiply simulation to the gate-level netlist, which was dumped to Tegra K1. (in parallel), shift and add, and output activation write. Ac- Layer Size Weight% Act%Benchmarks. FLOP% Descriptiontivation read and write access a local register and activation We compare the performance on two sets of models: bypassing is employed to avoid a pipeline hazard. Using switching activity interchange format (SAIF), and estimated 9216, uncompressed DNN model and the compressed DNN model. 64 PEs running at 800MHz yields a performance of 102 Alex-6 9% 35.1% 3% the power using Prime-Time PX. 4096 Compressed 4096, AlexNet [1] for Alex-7 9% 35.3% 3% Comparison Baseline. We compare EIE with three dif- 4096 large scale image ferent off-the-shelf computing units: CPU, GPU and mobile 4096, classification Alex-8 25% 37.5% 10% GPU. 1000 25088, VGG-6 4% 18.3% 1% Compressed 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E 4096 VGG-16 [3] for 4096, class processor, that has been used in NVIDIA Digits Deep VGG-7 4% 37.5% 2% large scale image 4096 Learning Dev Box as a CPU baseline. To run the benchmark classification and 4096, VGG-8 23% 41.1% 9% object detection on CPU, we used MKL CBLAS GEMV to implement the 1000 original dense model and MKL SPBLAS CSRMV for the 4096, Compressed NT-We 10% 100% 10% compressed sparse model. CPU socket and DRAM power 600 NeuralTalk [7] 600, with RNN and NT-Wd 11% 100% 11% are as reported by the pcm-power utility provided by Intel. 8791 LSTM for 1201, automatic 2) GPU. We use NVIDIA GeForce GTX Titan X GPU, NTLSTM 10% 100% 11% a state-of-the-art GPU for deep learning as our baseline 2400 image captioning using nvidia-smi utility to report the power. To run the benchmark, we used cuBLAS GEMV to implement in [16], [23]. The benchmark networks have 9 layers in total the original dense layer. For the compressed sparse layer, obtained from AlexNet, VGGNet, and NeuralTalk. We use we stored the sparse matrix in in CSR format, and used the Image-Net dataset [29] and the Caffe [28] deep learning cuSPARSE CSRMV kernel, which is optimized for sparse framework as golden model to verify the correctness of the matrix-vector multiplication on GPUs. hardware design. 3) Mobile GPU. We use NVIDIA Tegra K1 that has 192 CUDA cores as our mobile GPU baseline. We used VI. EXPERIMENTAL RESULTS cuBLAS GEMV for the original dense model and cuS- Figure 5 shows the layout (after place-and-route) of PARSE CSRMV for the compressed sparse model. Tegra K1 an EIE processing element. The power/area breakdown is doesn’t have software interface to report power consumption, shown in Table II. We brought the critical path delay down so we measured the total power consumption with a power- to 1.15ns by introducing 4 pipeline stages to update one meter, then assumed 15% AC to DC conversion loss, 85% activation: codebook lookup and address accumulation (in regulator efficiency and 15% power consumed by peripheral parallel), output activation read and input activation multiply components [26], [27] to report the AP+DRAM power for (in parallel), shift and add, and output activation write. Ac- Tegra K1. tivation read and write access a local register and activation Benchmarks. We compare the performance on two sets bypassing is employed to avoid a pipeline hazard. Using of models: uncompressed DNN model and the compressed 64 PEs running at 800MHz yields a performance of 102 DNN model. The uncompressed DNN model is obtained GOP/s. Considering 10 weight sparsity and 3 activation ⇥ ⇥ from Caffe model zoo [28] and NeuralTalk model zoo [7]; sparsity, this requires a dense DNN accelerator 3TOP/s to The compressed DNN model is produced as described have equivalent application throughput. [Han et al. ISCA’16] Comparison: Throughput EIE

Throughput (Layers/s in log scale) 1E+06 ASIC 1E+05 ASIC ASIC 1E+04 GPU 1E+03 ASIC

1E+02 CPU mGPU

1E+01 FPGA

1E+00 Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE 22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC 64PEs 256PEs

Compression Acceleration Regularization 104 [Han et al. ISCA’16] Comparison: Energy Efficiency

EIE Energy Efficiency (Layers/J in log scale) 1E+06

1E+05 ASIC ASIC

1E+04 ASIC ASIC 1E+03

1E+02

1E+01 GPU mGPU FPGA 1E+00 CPU Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE 22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC 64PEs 256PEs

Compression Acceleration Regularization 105 Agenda

Algorithm

Algorithms for Algorithms for Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for Efficient Inference Efficient Training Hardware Part 3: Efficient Training — Algorithms

• 1. Parallelization

• 2. Mixed Precision with FP16 and FP32

• 3. Model Distillation

• 4. DSD: Dense-Sparse-Dense Training Part 3: Efficient Training — Algorithms

• 1. Parallelization

• 2. Mixed Precision with FP16 and FP32

• 3. Model Distillation

• 4. DSD: Dense-Sparse-Dense Training Moore’s law made CPUs 300x faster than in 1990 But its over…

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011 Data Parallel – Run multiple inputs in parallel

Dally, High Performance Hardware for Machine Learning, NIPS’2015 Data Parallel – Run multiple inputs in parallel

• Doesn’t affect latency for one input • Requires P-fold larger batch size • For training requires coordinated weight update

Dally, High Performance Hardware for Machine Learning, NIPS’2015 Parameter Update One method to achieve scale is parallelization

Parameter Server p’ = p + ∆p

∆p p’

Model! Workers

Data! Shards Large scale distributed deep networks J Dean et al (2012) Large Scale Distributed Deep Networks, Jeff Dean et al., 2013 Model Parallel Split up the Model – i.e. the network

Dally, High Performance Hardware for Machine Learning, NIPS’2015 Model-Parallel Convolution – by output region (x,y)

Kernels Multiple 3D K uvkj x

BBxyj BBxyj 6D Loop A xyj xyj AAijij Bxyj Forall output map j xyk Bxyj For each input map k BBxyj BBxyj For each pixel x,y xyj xyj For each kernel element u,v

Bxyj += A(x-u)(y-v)k x Kuvkj Input maps Output maps Axyk Bxyj

Dally, High Performance Hardware for Machine Learning, NIPS’2015 Model-Parallel Convolution – By output map j (filter)

Kernels Multiple 3D

Kuvkj x

6D Loop AAij AAij BAijij Forall output map j Axykij xyj For each input map k For each pixel x,y For each kernel element u,v

Bxyj += A(x-u)(y-v)k x Kuvkj Input maps Output maps Axyk Bxyj

Dally, High Performance Hardware for Machine Learning, NIPS’2015 Model Parallel Fully-Connected Layer (M x V)

bi = Wij x aj Output activations Input Input activations weight matrix

Dally, High Performance Hardware for Machine Learning, NIPS’2015 Model Parallel Fully-Connected Layer (M x V)

bi Wij = x aj bi Wij Output activations Input Input activations weight matrix

Dally, High Performance Hardware for Machine Learning, NIPS’2015 Hyper-Parameter Parallel Try many alternative networks in parallel

Dally, High Performance Hardware for Machine Learning, NIPS’2015 Summary of Parallelism • Lots of parallelism in DNNs • 16M independent multiplies in one FC layer • Limited by overhead to exploit a fraction of this

• Data parallel • Run multiple training examples in parallel • Limited by batch size

• Model parallel • Split model over multiple processors • By layer • Conv layers by map region • Fully connected layers by output activation

• Easy to get 16-64 GPUs training one model in parallel Dally, High Performance Hardware for Machine Learning, NIPS’2015 Part 3: Efficient Training — Algorithms

• 1. Parallelization

• 2. Mixed Precision with FP16 and FP32

• 3. Model Distillation

• 4. DSD: Dense-Sparse-Dense Training Mixed Precision

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/ Mixed Precision Training

VOLTAVOLTAVOLTAVOLTA TRAININGTRAINING TRAININGTRAINING METHODMETHOD METHODMETHOD F16 W (F16W (F16) ) W W F16F16 F16F16 F16 W WW(F16 ((F16F16) )) W WW FWD F16F16 F16F16 Actv Actv F16 FWDFWDFWDFWD ActvActvActvActv ActvActvActvActv F16F16 F16F16

F16 F16F16 F16 W Actv GradF16 F16 F16WW WW ActvActvActvActv Grad Grad GradGrad F16 F16F16 BWD-A F16 BWDBWDBWDBWD-A-A --AA F16F16 F16 Actv Grad F16ActvActv ActvActv Grad Grad GradGrad

F16 W Grad F16F16 F16 Actv WW Grad WWGrad GradGrad F16 F16 F16Actv ActvActvActv W GradF16 F16F16 BWD-W F16 BWDBWDBWDBWD-W-W - -WW F16F16 F16 Actv Grad F16ActvActv ActvActv Grad Grad GradGrad

F16 F16F16 F16F16

F32 F32 Master-W (F32) F32F32 F32F32 Weight Update F32F32 F32F32 Updated Master-W MasterMasterMasterMaster-W-W (F32)-- WW(F32) (F32)(F32) WeightWeightWeightWeight Update Update UpdateUpdate UpdatedUpdatedUpdatedUpdated Master Master MasterMaster-W-W -- WW 5 5 5 55

Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, “Training with mixed precision”, NVIDIA GTC 2017 Inception V1 INCEPTION V1

12

Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, “Training with mixed precision”, NVIDIA GTC 2017 ResNet RESNET50

13

Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, “Training with mixed precision”, NVIDIA GTC 2017 ALEXNET : COMPARISON OF RESULTS

AlexNet Top1 Top5 Mode accuracy, % accuracy, % Fp32 58.62 81.25 RESNET RESULTS Mixed precision training 58.12 80.71

FP16 trainingNo scaleInception of loss function V354.89 … 78.12 Top1 Top5 FP16 training,Mode loss scale = 1000 accuracy,57.76 % accuracy,80.76 % Fp32 71.75 90.52 INCEPTION-V3 RESULTS Nvcaffe-Mixed0.16, DGX precision-1, SGD with momentum,training 100 epochs, batch=1024,71.17 no augmentation,90.10 1 crop, 1 model30 FP16 training, lossScale scale ResNet-50loss = function 1 by 71.17100x… 90.33 FP16 training, loss scale = 1, Top1 Top5 70.53 90.14 FP16 masterMode weight storage accuracy, % accuracy, % Fp32 73.85 91.44

Nvcaffe-0.16, DGX-1, SGD with momentum, 100 epochs, batch=512, no augmentation, 1 crop, 1 model41 Mixed precision training 73.6 91.11

Boris Ginsburg, Sergei Nikolaev,FP16 training Paulius Micikevicius, “Training71.36 with mixed precision”,90.84 NVIDIA GTC 2017

FP16 training, loss scale = 100 74.13 91.51 FP16 training, loss scale = 100, 73.52 91.08 FP16 master weight storage Nvcaffe-0.16, DGX-1, SGD with momentum, 100 epochs, batch=512, no augmentation, 1 crop, 1 model39 Part 3: Efficient Training Algorithm

• 1. Parallelization

• 2. Mixed Precision with FP16 and FP32

• 3. Model Distillation

• 4. DSD: Dense-Sparse-Dense Training Model Distillation

Teacher model 1 Teacher model 2 Teacher model 3 (Googlenet) (Vggnet) (Resnet)

Knowledge Knowledge Knowledge

student model

student model has much smaller model size Softened outputs reveal the dark knowledge

Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network Softened outputs reveal the dark knowledge

•Method: Divide score by a “temperature” to get a much softer distribution

•Result: Start with a trained model that classifies 58.9% of the test frames correctly. The new model converges to 57.0% correct even when it is only trained on 3% of the data

Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network Part 3: Efficient Training Algorithm

• 1. Parallelization

• 2. Mixed Precision with FP16 and FP32

• 3. Model Distillation

• 4. DSD: Dense-Sparse-Dense Training DSD: Dense Sparse Dense Training

Under review as a conference paper at ICLR 2017

Dense Sparse Dense

Pruning Re-Dense

Sparsity Constraint Increase Model Capacity

Figure 1: Dense-Sparse-Dense Training Flow. The sparse training regularizes the model, and the final dense training restores the pruned weights (red), increasing the model capacity without overfitting. DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide Workflow of DSD training Algorithmrange 1:of deep neural networks on CNNs / RNNs / LSTMs. Initialization: W (0) with W (0) N(0, ⌃) ⇠ Output :W (t). Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017 ———————————————– Initial Dense Phase ———————————————– while not converged do (t) (t 1) (t) (t 1) (t 1) W˜ = W ⌘ f(W ; x ); t = t +1; r end ————————————————— Sparse Phase —————————————————- // initialize the mask by sorting and keeping the Top-k weights. (t 1) (t 1) S = sort( W ); = S ; Mask = 1( W > ); | | ki | | while not converged do (t) (t 1) (t) (t 1) (t 1) W˜ = W ⌘ f(W ; x ); r W˜ (t) = W (t) Mask; t = t +1; · end ————————————————- Final Dense Phase ————————————————– while not converged do (t) (t 1) (t) (t 1) (t 1) W˜ = W ⌘ f(W ; x ); t = t +1; r end goto Sparse Phase for iterative DSD;

In contrast, simply reducing the model capacity would lead to the other extreme, causing a machine learning system to miss the relevant relationships between features and target outputs, leading to under-fitting and a high bias. Bias and variance are hard to optimize at the same time. Model compression methods ( Han et al. (2016; 2015); Guo et al. (2016)) can reduce the model size by 35x-49x or more without hurting prediction accuracy. Compression without losing accuracy means there’s significant redundancy in the trained model. Since the compressed model can achieve the same accuracy as the redundant uncompressed model, one hypothesis is that the model of the original size should have the capacity to achieve higher accuracy. This shows the inadequacy of current training methods since it fails to find the existing better solutions. In order to find the expected higher accuracy, we propose a dense-sparse-dense training flow (DSD), a novel training strategy that starts from a dense model from conventional training, then regularizes the model with sparsity-constrained optimization, and finally increases the model capacity by restoring and retraining the pruned weights. At testing time, the final model produced by DSD still has the same architecture and dimension as the original dense model, and DSD training doesn’t incur any inference overhead. We experimented DSD training on 7 mainstream CNN / RNN / LSTMs and found consistent performance gains over its comparable counterpart for image classification, image captioning and speech recognition.

2DSDTRAINING FLOW

Our DSD training employs a three-step process: dense, sparse, dense. Each step is illustrated in Figure 1 and Algorithm 1. The progression of weight distribution is plotted in Figure 2.

2 DSD: Intuition

learn the trunk first then learn the leaves

Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017 [Han et al. ICLR 2017] DSD is General Purpose: Vision, Speech, Natural Language

Abs. Rel. Network Domain Dataset Type Baseline DSD Imp. Imp.

GoogleNet Vision ImageNet CNN 31.1% 30.0% 1.1% 3.6%

VGG-16 Vision ImageNet CNN 31.5% 27.2% 4.3% 13.7%

ResNet-18 Vision ImageNet CNN 30.4% 29.3% 1.1% 3.7%

ResNet-50 Vision ImageNet CNN 24.0% 23.2% 0.9% 3.5%

Open Sourced DSD Model Zoo: https://songhan.github.io/DSD The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

Compression Acceleration Regularization 133 [Han et al. ICLR 2017] DSD is General Purpose: Vision, Speech, Natural Language

Abs. Rel. Network Domain Dataset Type Baseline DSD Imp. Imp.

GoogleNet Vision ImageNet CNN 31.1% 30.0% 1.1% 3.6%

VGG-16 Vision ImageNet CNN 31.5% 27.2% 4.3% 13.7%

ResNet-18 Vision ImageNet CNN 30.4% 29.3% 1.1% 3.7%

ResNet-50 Vision ImageNet CNN 24.0% 23.2% 0.9% 3.5%

NeuralTalk Caption Flickr-8K LSTM 16.8 18.5 1.7 10.1%

Open Sourced DSD Model Zoo: https://songhan.github.io/DSD The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

Compression Acceleration Regularization 134 [Han et al. ICLR 2017] DSD is General Purpose: Vision, Speech, Natural Language

Abs. Rel. Network Domain Dataset Type Baseline DSD Imp. Imp.

GoogleNet Vision ImageNet CNN 31.1% 30.0% 1.1% 3.6%

VGG-16 Vision ImageNet CNN 31.5% 27.2% 4.3% 13.7%

ResNet-18 Vision ImageNet CNN 30.4% 29.3% 1.1% 3.7%

ResNet-50 Vision ImageNet CNN 24.0% 23.2% 0.9% 3.5%

NeuralTalk Caption Flickr-8K LSTM 16.8 18.5 1.7 10.1%

DeepSpeech Speech WSJ’93 RNN 33.6% 31.6% 2.0% 5.8%

DeepSpeech-2 Speech WSJ’93 RNN 14.5% 13.4% 1.1% 7.4%

Open Sourced DSD Model Zoo: https://songhan.github.io/DSD The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

Compression Acceleration Regularization 135 https://songhan.github.io/DSD DSD on Caption Generation

Baseline: a boy Baseline: a Baseline: two Baseline: a man and Baseline: a person in in a red shirt is basketball player in dogs are playing a woman are sitting a red jacket is riding a climbing a rock a red uniform is together in a field. on a bench. b i k e t h r o u g h t h e wall. playing with a ball. woods. Sparse: a young Sparse: a basketball Sparse: two dogs Sparse: a man is Sparse: a car drives girl is jumping off player in a blue are playing in a sitting on a bench through a mud puddle. a tree. uniform is jumping field. with his hands in the over the goal. air. DSD: a car drives DSD: a young girl DSD: a basketball DSD: two dogs are DSD: a man is sitting through a forest. in a pink shirt is player in a white playing in the on a bench with his s w i n g i n g o n a uniform is trying to grass. arms folded. swing. make a shot. Figure 3: Visualization of DSD training improves the performance of image captioning.

Baseline model: Andrej Karpathy, Neural Talk model zoo. Hanthe et forestal. “DSD: from Dense-Sparse-Dense the background. Training The good for performanceDeep Neural Networks”, of DSD trainingICLR 201 generalizes7 beyond these examples, more image caption results generated by DSD training is provided in the supplementary material. 137 Table 7: DSD results on NeuralTalk NeuralTalk BLEU-1 BLEU-2 BLEU-3 BLEU-4 Sparsity Baseline 57.2 38.6 25.4 16.8 0% Sparse 58.4 39.7 26.3 17.5 80% DSD 59.2 40.7 27.4 18.5 0% Improvement (abs) 2.0 2.1 2.0 1.7 - Improvement (rel) 3.5% 5.4% 7.9% 10.1% -

3.7 DeepSpeech

We explore DSD training on speech data using the DeepSpeech network [16, 3]. DSD training experiments are performed on the 5 layer model with 1 recurrent layer DeepSpeech network (DS1) that contains approximately 8 million parameters. The DS1 model is described in Table 8. The training data set used is Wall Street Journal (WSJ), which contains approximately 37,000 training utterances (81 hours of speech). We benchmark DSD training on two test sets from the WSJ corpus of read articles. The Word Error Rate (WER)1 reported on the test sets for the baseline model is different from the those in DeepSpeech2 [3] due to two factors. The Deep Speech 2 models were trained using much larger data sets containing approximately 12,000 hours of multi-speaker speech data. Secondly, in Deep Speech 2, WER was evaluated with beam search and a language model; here the network output is obtained using only max decoding to show improvement in the neural network accuracy.

Table 8: Deep Speech 1 Architecture Layer ID Type #Params layer 0 Convolution 1814528 layer 1 FullyConnected 1049600 layer 2 FullyConnected 1049600 layer 3 Bidirectional Recurrent 3146752 layer 4 FullyConnected 1049600 layer 5 CTCCost 29725

The baseline DS1 model is trained for 50 epochs on WSJ training data. The weights from this model are pruned for the sparse iteration of DSD training. Weights are pruned in the FullyConnected layers and the Bidirectional Recurrent layer only. Each layer is pruned to achieve 50% sparsity. This results in overall sparsity of 32.2% across the entire network. This sparse model is re-trained on 50 epochs of WSJ data. For the final dense training, the pruned weights are initialized to zero and trained again on 50 epochs of WSJ training data. This step completes one iteration of DSD training. We use Nesterov SGD to train the model, reduce the learning rate with each re-training, and keep all other hyper parameters unchanged.

6 A. SupplementaryA Appendix: Material: More Examples More Examples of DSD of DSD Training Training Improves Improves the the Captions Performance of Generated by NeuralTalk (Images from Flickr-8K Test Set) NeuralTalk Auto-CaptionDSD System on Caption Generation

Baseline: a boy is swimming in a pool. Baseline: a group of people are Baseline: two girls in bathing suits are Baseline: a man in a red shirt and Sparse: a small black dog is jumping standing in front of a building. playing in the water. jeans is riding a bicycle down a street. into a pool. Sparse: a group of people are standing Sparse: two children are playing in the Sparse: a man in a red shirt and a DSD: a black and white dog is swimming in front of a building. sand. woman in a wheelchair. in a pool. DSD: a group of people are walking in a DSD: two children are playing in the DSD: a man and a woman are riding on park. sand. a street.

Baseline: a group of people sit on a Baseline: a man in a black jacket and a Baseline: a group of football players in Baseline: a dog runs through the grass. bench in front of a building. black jacket is smiling. red uniforms. Sparse: a dog runs through the grass. Sparse: a group of people are Sparse: a man and a woman are standing Sparse: a group of football players in a DSD: a white and brown dog is running standing in front of a building. in front of a mountain. field. through the grass. DSD: a group of people are standing DSD: a man in a black jacket is standing DSD: a group of football players in red in a fountain. next to a man in a black shirt. and white uniforms. Baseline model: Andrej Karpathy, Neural Talk model zoo.

Baseline: a man in a red shirt is Baseline: a group of people are sitting in Baseline: a man in a red jacket is Baseline:a young girl in a red dress is standing on a rock. a subway station. standing in front of a white building. holding a camera. Sparse: a man in a red jacket is Sparse: a man and a woman are sitting Sparse: a man in a black jacket is Sparse: a little girl in a pink dress is standing on a mountaintop. on a couch. standing in front of a brick wall. standing in front of a tree. DSD: a man is standing on a rock DSD: a group of people are sitting at a DSD: a man in a black jacket is DSD: a little girl in a red dress is overlooking the mountains. table in a room. standing in front of a white building. holding a red and white flowers.

Baseline: a soccer player in a red Baseline: a girl in a white dress is Baseline: a young girl in a swimming Baseline: a soccer player in a red and and white uniform is playing with a standing on a sidewalk. pool. white uniform is running on the field. soccer ball. Sparse: a girl in a pink shirt is Sparse: a young boy in a swimming Sparse: a soccer player in a red uniform Sparse: two boys playing soccer. standing in front of a white building. pool. is tackling another player in a white DSD: two boys playing soccer. DSD: a girl in a pink dress is walking DSD: a girl in a pink bathing suit uniform. on a sidewalk. jumps into a pool. DSD: a soccer player in a red uniform kicks a soccer ball.

1 10 Agenda

Algorithm

Algorithms for Algorithms for Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for Efficient Inference Efficient Training Hardware CPUs for Training CPUs Are Targeting Deep Learning

Intel Knights Landing (2016)

• 7 TFLOPS FP32

• 16GB MCDRAM– 400 GB/s

• 245W TDP

• 29 GFLOPS/W (FP32)

• 14nm process

Knights Mill: next gen Xeon Phi “optimized for deep learning” Intel announced the addition of new vector instructions for deep learning (AVX512-4VNNIW and AVX512-4FMAPS), October 2016 Slide Source: Sze et al Survey of DNN Hardware, MICRO’16 Tutorial. 2 ImageImage Source: Source: Intel, Data Intel, Source: Data NextSource: Platform Next Platform GPUs for Training GPUs Are Targeting Deep Learning

Nvidia PASCAL GP100 (2016)

• 10/20 TFLOPS FP32/FP16 • 16GB HBM – 750 GB/s • 300W TDP • 67 GFLOPS/W (FP16) • 16nm process • 160GB/s NV Link

Slide Source: Sze et al Survey of DNN Hardware, MICRO’16 Tutorial. Data Source: NVIDIA

Source: Nvidia 3 GPUs for Training

Nvidia Volta GV100 (2017)

• 15 FP32 TFLOPS • 120 Tensor TFLOPS • 16GB HBM2 @ 900GB/s • 300W TDP • 12nm process • 21B Transistors • die size: 815 mm2 • 300GB/s NVLink

Data Source: NVIDIA TENSORWhat’s CORE new 4X4X4 in Volta: MATRIX Tensor-MULTIPLY Core ACC

a new instruction that performs 4x4x4 FMA mixed-precision operations per clock 12X increase in throughput for the Volta V100 compared to the Pascal P100

8

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/ Pascal v.s. Volta

Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations.

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/ Pascal v.s. Volta

Left: Tesla V100 trains the ResNet-50 deep neural network 2.4x faster than Tesla P100. Right: Given a target latency per image of 7ms, Tesla V100 is able to perform inference using the ResNet-50 deep neural network 3.7x faster than Tesla P100.

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/ The GV100 SM is partitioned into four processing blocks, each with:

• 8 FP64 Cores • 16 FP32 Cores • 16 INT32 Cores • two of the new mixed-precision Tensor Cores for deep learning • a new L0 instruction cache • one warp scheduler • one dispatch unit • a 64 KB .

https://devblogs.nvidia.com/parallelforall/ cuda-9-features-revealed/ Tesla Product Tesla K40 Tesla M40 Tesla P100 Tesla V100 GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GV100 (Volta)

GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz 1455 MHz Peak FP32 TFLOP/s* 5.04 6.8 10.6 15 Peak Tensor Core - - - 120 TFLOP/s* Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2

Memory Size Up to 12 GB Up to 24 GB 16 GB 16 GB

TDP 235 Watts 250 Watts 300 Watts 300 Watts Transistors 7.1 billion 8 billion 15.3 billion 21.1 billion GPU Die Size 551 mm² 601 mm² 610 mm² 815 mm² Manufacturing 28 nm 28 nm 16 nm FinFET+ 12 nm FFN Process https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/ GPU / TPU

https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/ Build and train machine learning models on our new Google Cloud TPUs 5/22/17, 8)20 PM Google Cloud TPU

Cloud TPU delivers up to 180 teraflops to train and run machine learning models.

source: Google Blog

Our new Cloud TPU delivers up to 180 teraEops to train and run machine learning 149 models.

Each of these new TPU devices delivers up to 180 teraflops of floating-point performance. As powerful as these TPUs are on their own, though, we designed them to work even better together. Each TPU includes a custom high-speed network that allows us to build machine learning supercomputers we call “TPU pods.” A TPU pod contains 64 second-generation TPUs and provides up to 11.5 petaflops to accelerate the training of a single large machine learning model. That’s a lot of computation!

Using these TPU pods, we've already seen dramatic improvements in training times. One of our new large-scale translation models used to take a full day to train on 32 of the best commercially-available GPUs—now it trains to the same accuracy

https://blog.google/topics/google-cloud/google-cloud-offer-tpus-machine-learning/ Page 5 of 11 Build and train machine learning models on our new Google Cloud TPUs 5/22/17, 8)20 PM

in an afternoon using just one eighth of a TPU pod. Google Cloud TPU

A “TPU pod” built with 64 second-generation TPUs delivers up to 11.5 petaflops of machine learning acceleration. “OneA “TPU of pod” our built new with large-scale 64 second-generation translation TPUs models delivers used up to 11.5take petaEops a full day of to train machine learning acceleration. on 32 of the best commercially-available GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod.”— Google Blog Introducing Cloud TPUs 150 We’re bringing our new TPUs to as Cloud TPUs, where you can connect them to virtual machines of all shapes and sizes and mix and match them with other types of hardware, including Skylake CPUs and NVIDIA GPUs. You can program these TPUs with TensorFlow, the most popular open- source machine learning framework on GitHub, and we’re introducing high-level APIs, which will make it easier to train machine learning models on CPUs, GPUs or Cloud TPUs with only minimal code changes.

With Cloud TPUs, you have the opportunity to integrate state-of-

https://blog.google/topics/google-cloud/google-cloud-offer-tpus-machine-learning/ Page 6 of 11 Wrap-Up

Algorithm

Algorithms for Algorithms for Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for Efficient Inference Efficient Training Hardware Future

Smart Low Latency Privacy Mobility Energy-Efficient

152 Outlook: the Focus for Computation

PC Era Mobile-First Era AI-First Era

Brain-Inspired Mobile Computing Cognitive Computing Computing

Sundar Pichai, Google IO, 2016

153 Thank you!

stanford.edu/~songhan