Mixed Precision Neural Architecture Search for Energy Efficient Deep Learning

Chengyue Gong*1, Zixuan Jiang*2, Dilin Wang1, Yibo Lin2, Qiang Liu1, and David Z. Pan2

1CS Department, 2ECE Department The University of Texas at Austin

∗ indicates equal contributions 1 Contents t Introduction t Our algorithm t Experimental results t Conclusion

cat

Chinese English

3 Energy Efficient Computation t Energy computation, latency, security, etc. are critical metrics of edge inference. Cloud Edge t Tradeoff between accuracy Training Inference and complexity of models. t Efficient computation Higher › Neural architecture Accuracy › Quantization Less Energy

4 Neural Architecture Design t Mechanism of neural ImageNet Results 1000 50 networks is not well 45

interpreted. 40

) 35 100 ms 30 t Designing neural architecture 25

is challenging. 20 Error rate (%) rate Error 10

Layers / Speed ( Speed / Layers 15

10 t Can we advance AI/ML using 5

artificial intelligence instead 1 0 of human intelligence? AlexNet VGG-16 VGG-19 ResNet-18ResNet-34ResNet-50 Inception-V1 ResNet-101ResNet-152ResNet-200 Layers Speed (ms) Top-1 error Top-5 error

5 Neural Architecture Search

Search space Environment as a Sample networks # black box

!" Training and evaluation

Hardware Update controller simulation

6 Neural Architecture Search

t Black box optimization › Find the optimal network configuration to maximize the Search performance space › Huge search space Sample Update following t Available methods policy › Reinforcement learning policy › Evolutionary algorithm › Differentiable architecture search Black box H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” ICLR 2019.

7 Quantization t Weights, activations can be MobileNet-V1 on ImageNet 100 35 quantized due to the inherent 90 redundancy in representations. 30 80

70 25

60 ) t Mixed precision for different 20 mJ layers 50

Accuracy 15 › HAQ 40 ( Energy

30 10

20 K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “HAQ: 5 hardware-aware automated quantization with mixed 10 precision,” CVPR, 2019. 0 0 fix8 fix6 fix4 haq mixed haq mixed top 1 top 5 energy (mJ)

8 Our Work Mixed Precision Quantization Neural Energy Architecture Efficient Search Computation

Our work

9 Contents t Introduction t Our algorithm t Experimental results t Conclusion

10 Search Space Basis: MobileNetV2 Block (MB)

expand ratio ! ∈ 1,3,6 kernel size ' ∈ 3,5,7 network connectivity * ∈ 0,1 layer−wise bitwidths ,-, ,. ∈ 2,4,6,8

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. MobileNetV2: Inverted residuals and linear bottlenecks. CVPR 2018.

11 Search Space

Input Shape Block Type Bitwidth #Channels Stride #Blocks 224 × 224 × 3 Conv 3 × 3 8 32 2 1 112 × 112 × 32 Search Space 16 2 1 block(!, #, $, % , % ) 56 × 56 × 16 & ' 24 1 2

56 × 56 × 24 Neural architecture 32 2 4 !, #, $ # settings 28 × 28 × 32 64 2 4 ())(( 14 × 14 × 64 128 1 4 ≈ +. ()×+./0 14 × 14 × 128 Quantization 160 2 5 %&, %' 7 × 7 × 160 256 1 2 7 × 7 × 256 Conv 1 × 1 8 1280 1 1 7 × 7 × 1280 Pooling and FC 8 - 1 1

12 Our Framework

Environment

Deployment Energy

Controller Weighted # !" Reward

Loss Training Evaluation

13 Problem Formulation t Discover neural architectures that minimize the task-related loss while satisfying the energy constraint. min %&~( * +,∗ . ; 012345267 Expectation of loss $ ) ∗ 8. :. ; = argmin * +, . ; 06@24A Training of NN

%&~()B . < D Energy constraint

E$ is the policy with parameter F . represents a neural network with weights ;

14 Problem Formulation t Discover neural architectures that minimize the task-related loss while satisfying the energy constraint. min %&~( * +,∗ . ; 012345267 $ ) ∗ 8. :. ; = argmin * +, . ; 06@24A

%&~()B . < D t Relaxation min %&~( * +,∗ . ; 012345267 + F max %&~( B . − D, 0 $ ) ) ∗ 8. :. ; = argmin * +, . ; 06@24A

15 Hardware Environment

Deployment Energy

Controller Weighted # !" Reward

Loss Training Evaluation

16 REINFORCE algorithm t Policy gradient theorem

For any differentiable policy !", for any policy objective functions #, the policy gradient is () ∇ "# % = '()[∇" log !" . ] t Non-differentiable energy measures

∇"'0~()# 2 = '0~() # 2 ∇" log !" : 1 ≈ 6 # 2 ∇ log ! 2 5 7 " " 7 789

17 Software Environment

Deployment Energy

Controller Weighted # !" Reward

Loss Training Evaluation

18 Non-Differentiability t Relax the discrete mask variable ! to be a continuous random variable computed by the Gumbel Softmax function

exp (("+ log -")/0 !" = ∑ exp (("+ log -")/0 where -" is the logit, ("~Gumbel(0, 1), 0 is the temperature. t One-hot [0, 1, 0] Continuous [.3, .5, .2]

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-Softmax,” ICLR, 2017.

19 Bilevel Optimization t Whenever the policy parameter ! changes, the weights of network "($) needs to be retrained. t Motivated by differentiable architecture search (DARTS), we propose the following algorithm.

Sample minibatch of network configurations $ from the controller

Update network models "($) by minimizing the training loss

Update the controller parameters !

H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” ICLR 2019.

20 Contents t Introduction t Our algorithm t Experimental results t Conclusion

21 Experimental Settings

t Hardware simulator of Bit Fusion [1, 2]

t First, search architectures and mixed precision for each layer on a proxy task, tiny ImageNet › Trained for a fixed 60 epochs › 5 days on 1 NVIDIA Tesla P100

t Next, train the discovered architectures on CIFAR-100 and ImageNet from scratch.

[1] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network,” in Proc. ISCA, June 2018, pp. 764–775. [2] https://github.com/hsharma35/bitfusion

22 Searched Results

! = 0.1 small - Ours base - Ours

! = 0.01

23 Results on ImageNet

IMAGENET RESULT HAQ-small Ours-small HAQ-base Ours-base 40.2 32.1 24.7 21.2 16.3 12.9 12.7 11.6 10.9 10.1 9.94 8.91 2.12 2.06 1.7 1.44

Top-5 Error Model Size (MB) Energy (mJ) Latency (ms) 24 Joint NAS and Mixed Precision Quantization

22.6 NAS + quantization 22.4 (small)

22.2 Ours-small

21.8 Model Size Error (%) 21.6 NAS + quantization (base) 21.4 Ours-base 21.2

21 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Energy (mJ) 25 Adaptive Mixed Precision Quantization t Pareto front for error rate, latency, and energy

26 Contents t Introduction t Our algorithm t Experimental results t Conclusion

27 Conclusion t We propose a new methodology to perform joint optimization of NAS and mixed precision quantization in the extended search space. t Hardware performance is involved in the objective function. t Our methodology facilitates the end-to-end design automation flow of neural network design and deployment, especially the edge inference.

28 Thank you!

29 Backup Framework t Pipelines of Hardware-Centric Design Automation for Model Efficient Neural Networks Quantization t Limited research considers each stage of the pipeline Mixed collaboratively Precision NAS t Our proposed framework: Neural Mixed Precision NAS Model Architecture Pruning Search

30 Backup result

ImageNet RESULT VGG-16 FXP 8 Resnet-50 FXP 8 MobileNetV2 FXP 8 FBNet-B FXP 8 FBNet-B FXP3 HAQ-small Mixed Ours-small Mixed HAQ-base Mixed Ours-base Mixed 838 753 591 557 138 83.9 73.9 40.2 36.29 34.7 33.01 32.1 31.62 29.1 29.1 29 28.19 28.23 27.9 26.84 25.5 24.7 24.7 21.2 16.3 15.4 13.5 12.9 12.7 11.6 10.9 10.1 9.94 9.75 8.97 8.91 7.4 5.3 4.5 3.4 2.12 2.06 1.7 1.68 1.44

Top-1 Error Top-5 Error Model Size (MB) Energy (mJ) Latency (ms) 31