Mixed Precision Neural Architecture Search for Energy Efficient Deep Learning
Chengyue Gong*1, Zixuan Jiang*2, Dilin Wang1, Yibo Lin2, Qiang Liu1, and David Z. Pan2
1CS Department, 2ECE Department The University of Texas at Austin
∗ indicates equal contributions 1 Contents t Introduction t Our algorithm t Experimental results t Conclusion
2 Success of Machine Learning
cat
Chinese English
3 Energy Efficient Computation t Energy computation, latency, security, etc. are critical metrics of edge inference. Cloud Edge t Tradeoff between accuracy Training Inference and complexity of models. t Efficient computation Higher › Neural architecture Accuracy › Quantization Less Energy
4 Neural Architecture Design t Mechanism of neural ImageNet Results 1000 50 networks is not well 45
interpreted. 40
) 35 100 ms 30 t Designing neural architecture 25
is challenging. 20 Error rate (%) rate Error 10
Layers / Speed ( Speed / Layers 15
10 t Can we advance AI/ML using 5
artificial intelligence instead 1 0 of human intelligence? AlexNet VGG-16 VGG-19 ResNet-18ResNet-34ResNet-50 Inception-V1 ResNet-101ResNet-152ResNet-200 Layers Speed (ms) Top-1 error Top-5 error
5 Neural Architecture Search
Search space Environment as a Sample networks # black box
!" Training and evaluation
Hardware Update controller simulation
6 Neural Architecture Search
t Black box optimization › Find the optimal network configuration to maximize the Search performance space › Huge search space Sample Update following t Available methods policy › Reinforcement learning policy › Evolutionary algorithm › Differentiable architecture search Black box H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” ICLR 2019.
7 Quantization t Weights, activations can be MobileNet-V1 on ImageNet 100 35 quantized due to the inherent 90 redundancy in representations. 30 80
70 25
60 ) t Mixed precision for different 20 mJ layers 50
Accuracy 15 › HAQ 40 ( Energy
30 10
20 K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “HAQ: 5 hardware-aware automated quantization with mixed 10 precision,” CVPR, 2019. 0 0 fix8 fix6 fix4 haq mixed haq mixed top 1 top 5 energy (mJ)
8 Our Work Mixed Precision Quantization Neural Energy Architecture Efficient Search Computation
Our work
9 Contents t Introduction t Our algorithm t Experimental results t Conclusion
10 Search Space Basis: MobileNetV2 Block (MB)
expand ratio ! ∈ 1,3,6 kernel size ' ∈ 3,5,7 network connectivity * ∈ 0,1 layer−wise bitwidths ,-, ,. ∈ 2,4,6,8
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. MobileNetV2: Inverted residuals and linear bottlenecks. CVPR 2018.
11 Search Space
Input Shape Block Type Bitwidth #Channels Stride #Blocks 224 × 224 × 3 Conv 3 × 3 8 32 2 1 112 × 112 × 32 Search Space 16 2 1 block(!, #, $, % , % ) 56 × 56 × 16 & ' 24 1 2
56 × 56 × 24 Neural architecture 32 2 4 !, #, $ # settings 28 × 28 × 32 64 2 4 ())(( 14 × 14 × 64 128 1 4 ≈ +. ()×+./0 14 × 14 × 128 Quantization 160 2 5 %&, %' 7 × 7 × 160 256 1 2 7 × 7 × 256 Conv 1 × 1 8 1280 1 1 7 × 7 × 1280 Pooling and FC 8 - 1 1
12 Our Framework
Environment
Deployment Energy
Controller Weighted # !" Reward
Loss Training Evaluation
13 Problem Formulation t Discover neural architectures that minimize the task-related loss while satisfying the energy constraint. min %&~( * +,∗ . ; 012345267 Expectation of loss $ ) ∗ 8. :. ; = argmin * +, . ; 06@24A Training of NN
%&~()B . < D Energy constraint
E$ is the policy with parameter F . represents a neural network with weights ;
14 Problem Formulation t Discover neural architectures that minimize the task-related loss while satisfying the energy constraint. min %&~( * +,∗ . ; 012345267 $ ) ∗ 8. :. ; = argmin * +, . ; 06@24A
%&~()B . < D t Relaxation min %&~( * +,∗ . ; 012345267 + F max %&~( B . − D, 0 $ ) ) ∗ 8. :. ; = argmin * +, . ; 06@24A
15 Hardware Environment
Deployment Energy
Controller Weighted # !" Reward
Loss Training Evaluation
16 REINFORCE algorithm t Policy gradient theorem
For any differentiable policy !", for any policy objective functions #, the policy gradient is () ∇ "# % = '()[∇" log !" . ] t Non-differentiable energy measures
∇"'0~()# 2 = '0~() # 2 ∇" log !" : 1 ≈ 6 # 2 ∇ log ! 2 5 7 " " 7 789
17 Software Environment
Deployment Energy
Controller Weighted # !" Reward
Loss Training Evaluation
18 Non-Differentiability t Relax the discrete mask variable ! to be a continuous random variable computed by the Gumbel Softmax function
exp (("+ log -")/0 !" = ∑ exp (("+ log -")/0 where -" is the logit, ("~Gumbel(0, 1), 0 is the temperature. t One-hot [0, 1, 0] Continuous [.3, .5, .2]
E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-Softmax,” ICLR, 2017.
19 Bilevel Optimization t Whenever the policy parameter ! changes, the weights of network "($) needs to be retrained. t Motivated by differentiable architecture search (DARTS), we propose the following algorithm.
Sample minibatch of network configurations $ from the controller
Update network models "($) by minimizing the training loss
Update the controller parameters !
H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” ICLR 2019.
20 Contents t Introduction t Our algorithm t Experimental results t Conclusion
21 Experimental Settings
t Hardware simulator of Bit Fusion [1, 2]
t First, search architectures and mixed precision for each layer on a proxy task, tiny ImageNet › Trained for a fixed 60 epochs › 5 days on 1 NVIDIA Tesla P100
t Next, train the discovered architectures on CIFAR-100 and ImageNet from scratch.
[1] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network,” in Proc. ISCA, June 2018, pp. 764–775. [2] https://github.com/hsharma35/bitfusion
22 Searched Results
! = 0.1 small - Ours base - Ours
! = 0.01
23 Results on ImageNet
IMAGENET RESULT HAQ-small Ours-small HAQ-base Ours-base 40.2 32.1 24.7 21.2 16.3 12.9 12.7 11.6 10.9 10.1 9.94 8.91 2.12 2.06 1.7 1.44
Top-5 Error Model Size (MB) Energy (mJ) Latency (ms) 24 Joint NAS and Mixed Precision Quantization
22.6 NAS + quantization 22.4 (small)
22.2 Ours-small
22
21.8 Model Size Error (%) 21.6 NAS + quantization (base) 21.4 Ours-base 21.2
21 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Energy (mJ) 25 Adaptive Mixed Precision Quantization t Pareto front for error rate, latency, and energy
26 Contents t Introduction t Our algorithm t Experimental results t Conclusion
27 Conclusion t We propose a new methodology to perform joint optimization of NAS and mixed precision quantization in the extended search space. t Hardware performance is involved in the objective function. t Our methodology facilitates the end-to-end design automation flow of neural network design and deployment, especially the edge inference.
28 Thank you!
29 Backup Framework t Pipelines of Hardware-Centric Design Automation for Model Efficient Neural Networks Quantization t Limited research considers each stage of the pipeline Mixed collaboratively Precision NAS t Our proposed framework: Neural Mixed Precision NAS Model Architecture Pruning Search
30 Backup result
ImageNet RESULT VGG-16 FXP 8 Resnet-50 FXP 8 MobileNetV2 FXP 8 FBNet-B FXP 8 FBNet-B FXP3 HAQ-small Mixed Ours-small Mixed HAQ-base Mixed Ours-base Mixed 838 753 591 557 138 83.9 73.9 40.2 36.29 34.7 33.01 32.1 31.62 29.1 29.1 29 28.19 28.23 27.9 26.84 25.5 24.7 24.7 21.2 16.3 15.4 13.5 12.9 12.7 11.6 10.9 10.1 9.94 9.75 8.97 8.91 7.4 5.3 4.5 3.4 2.12 2.06 1.7 1.68 1.44
Top-1 Error Top-5 Error Model Size (MB) Energy (mJ) Latency (ms) 31