Bandwidth-Efficient Deep Learning ——From Compression to Acceleration

Bandwidth-Efficient Deep Learning ——from Compression to Acceleration Song Han Assistant Professor, EECS Massachusetts Institute of Technology 1 AI is Changing Our Lives Self-Driving Car Machine Translation 2 AlphaGo Smart Robots 2 Models are Getting Larger IMAGE RECOGNITION SPEECH RECOGNITION Important Property of Neural Networks 16X 10X Results get better with Model Training Ops 152 layers 465 GFLOP more data + 22.6 GFLOP 12,000 hrs of Data bigger models + ~3.5% error ~5% Error more computation 8 layers 80 GFLOP 1.4 GFLOP 7,000 hrs of Data (Better algorithms, new insights and ~16% Error ~8% Error improved techniques always help, too!) 2012 2015 2014 2015 AlexNet ResNet Deep Speech 1 Deep Speech 2 Dally, NIPS’2016 workshop on Efficient Methods for Deep Neural Networks 3 The first Challenge: Model Size Hard to distribute large models through over-the-air update 4 The Second Challenge: Speed Error rate Training time ResNet18: 10.76% 2.5 days ResNet50: 7.02% 5 days ResNet101: 6.21% 1 week ResNet152: 6.16% 1.5 weeks Such long training time limits ML researcher’s productivity Training time benchmarked with fb.resnet.torch using four M40 GPUs 5 The Third Challenge: Energy Efficiency AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game on mobile: drains battery on data-center: increases TCO 6 Where is the Energy Consumed? larger model => more memory reference => more energy 7 Where is the Energy Consumed? larger model => more memory reference => more energy Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000 Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7]. MemoryCMOS process access [ is7]. 3 Memory orders of access magnitude is 3 orders more of magnitude more energy expensive thanenergy simple1 expensive arithmetic. than simple arithmetic.= 1000 To achieve this goal, weTo achievepresent athis method goal, we to prune present network a method connections to prune network in a manner connections that preserves in a manner the that preserves the original accuracy. Afteroriginal an initial accuracy. training After phase, an initial we remove training all phase, connections we remove whose all connections weight is lower whose weight is lower than a threshold. Thisthan pruning a threshold. converts This a dense, pruning fully-connected converts a dense, layer fully-connected to a sparse layer. layer This to a first sparse layer. This 8 first phase learns the topologyphase of learns the networks the topology — learning of the networks which connections — learning are which important connections and removing are important and removing the unimportant connections.the unimportant We then connections. retrain the sparse We then network retrain so the the sparse remaining network connections so the remaining can connections can compensate for the connectionscompensate that for the have connections been removed. that have The beenphases removed. of pruning The and phases retraining of pruning may and retraining may be repeated iterativelybe to repeated further iteratively reduce network to further complexity. reduce network In effect, complexity. this training In effect, process this learns training process learns the network connectivitythe network in addition connectivity to the weights in addition - much to as the in weights the mammalian - much as brain in the [8][9], mammalian where brain [8][9], where synapses are created insynapses the first are few created months in of the a first child’s few development, months of a child’s followed development, by gradual followed pruning byof gradual pruning of little-used connections,little-used falling to connections, typical adult falling values. to typical adult values. 2 Related Work2 Related Work Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- Neural networks are typically over-parameterized, and there is significant redundancy for deep learning models [10]. This results in a waste of both computation and memory. There have been various ing models [10]. This results in a waste of both computation and memory. There have been various proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear structure of the neural network by finding an appropriate low-rank approximation of the parameters structure of the neural network by finding an appropriate low-rank approximation of the parameters and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gong et al. and keeping the accuracy[13] compressed within 1% of deep the convnets original model. using vector With quantization.similar accuracy These loss, approximation Gong et al. and quantization [13] compressed deeptechniques convnets are using orthogonal vector quantization. to network pruning, These andapproximation they can be and used quantization together to obtain further gains techniques are orthogonal[14]. to network pruning, and they can be used together to obtain further gains [14]. There have been other attempts to reduce the number of parameters of neural networks by replacing There have been otherthe attempts fully connected to reduce layer the number with global of parameters average pooling. of neural The networks Network by in replacing Network architecture [15] the fully connected layerand GoogLenet with global [16 average] achieves pooling. state-of-the-art The Network results in Network on several architecture benchmarks [15 by] adopting this idea. and GoogLenet [16]However, achieves state-of-the-arttransfer learning, results i.e. reusing on several features benchmarks learned on by the adopting ImageNet this dataset idea. and applying them However, transfer learning,to new i.e. tasks reusing by only features fine-tuning learned the onfully the connected ImageNet layers, dataset is andmore applying difficult them with this approach. This to new tasks by onlyproblem fine-tuning is noted the fully by Szegedyconnectedet layers, al. [16 is] and more motivates difficult them with thisto add approach. a linear Thislayer on the top of their problem is noted by Szegedynetworkset to al. enable[16] transfer and motivates learning. them to add a linear layer on the top of their networks to enable transfer learning. Network pruning has been used both to reduce network complexity and to reduce over-fitting. An Network pruning hasearly been approach used both to to pruning reduce was network biased complexity weight decay and [17 to]. reduce Optimal over-fitting. Brain Damage An [18] and Optimal early approach to pruningBrain was Surgeon biased [19 weight] prune decay networks [17]. to Optimal reduce the Brain number Damage of connections [18] and Optimal based on the Hessian of the Brain Surgeon [19] pruneloss function networks and to reduce suggest the that number such pruning of connections is more accurate based on than the Hessian magnitude-based of the pruning such as loss function and suggestweight that decay. such However, pruning is second more accurate order derivative than magnitude-based needs additional pruning computation. such as weight decay. However, second order derivative needs additional computation. HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly HashedNets [20] is agroup recent connection technique weightsto reduce into model hash sizes buckets, by using so that a hash all connections function to withinrandomly the same hash bucket group connection weightsshare a into single hash parameter buckets, value. so that This all technique connections may within benefit the from same pruning. hash Asbucket pointed out in Shi et al. share a single parameter[21] value. and Weinberger This techniqueet al. may[22], benefit sparsity from will pruning. minimize As hash pointed collision out in making Shi et al. feature hashing even [21] and Weinbergermoreet al. effective.[22], sparsity HashedNets will minimize may be hashused together collision with making pruning feature to give hashing even better even parameter savings. more effective. HashedNets may be used together with pruning to give even better parameter savings. 2 2 Where is the Energy Consumed? larger model => more memory reference => more energy Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000 how to make deep learning more efficient? Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7]. MemoryCMOS process access [ is7]. 3 Memory orders of access magnitude is 3 orders more of magnitude more energy expensive thanenergy simple expensive arithmetic. than simple arithmetic. To achieve this goal, weTo achievepresent athis method goal, we to prune present network a method connections to prune network in a manner connections that preserves in a manner the that preserves the original accuracy.

Bandwidth-Efficient Deep Learning ——From Compression to Acceleration

Lecture 10: Recurrent Neural Networks

Image Retrieval Algorithm Based on Convolutional Neural Network

LSTM-In-LSTM for Generating Long Descriptions of Images

The History Began from Alexnet: a Comprehensive Survey on Deep Learning Approaches

Active Authentication Using an Autoencoder Regularized CNN-Based One-Class Classiﬁer

Key Ideas and Architectures in Deep Learning Applications That (Probably) Use DL

Deep Learning in MATLAB

Path Towards Better Deep Learning Models and Implications for Hardware

Model-Based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction

Data Movement Is All You Need: a Case Study on Optimizing Transformers

Optimizing Alexnet on Tiny Imagenet10

What Is Reinforcement Learning?