Intelligence at Scale Through AI Model Efficiency

April 6, 2021 @QCOMResearch

Intelligence at scale through AI model efficiency

Qualcomm Technologies, Inc. Agenda

• Why efficient machine learning is necessary for AI to proliferate

• Our latest research to make AI models more efficient

• Our open-source projects to scale efficient AI

2 Smartphone Smart homes Video conferencing Autonomous vehicles

Smart factories Extended reality Smart cities Video monitoring

AI is being used all around us AI video analysis is on the rise increasing productivity, enhancing collaboration, Trend toward more cameras, higher resolution, and transforming industries and increased frame rate across devices

3 2025: N = 100T = 1014 1014

Deep neural networks 2021: Extremely large 1012 neural networks (N = 1.6T) are energy hungry 2017: Very large neural networks (N = 137B) 1010

and growing fast 2013: Google/Y! 8 10 AI is being powered by the explosive (N=+/- 1B) 2009: Hinton’s Deep

106 growth of deep neural networks Belief Net (+/- N=10M) Weight parameter count Weightparameter 4 10 1988: NetTalk (+/- N=20K)

102

1943: First NN (+/- N=10) 100 1940 1950 1960 1970 1980 1990 2000 2010 2020 2030

Will we have reached the capacity of the human brain? 2025 Energy efficiency of a brain is 100x better than current hardware

Source: Welling 4 The challenge of Constrained mobile AI workloads environment

Very compute Must be thermally intensive efficient for sleek, ultra-light designs

Large, complicated neural network models Power and thermal Complex Requires long battery concurrencies efficiency are essential life for all-day use for on-device AI Real-time

Always-on Storage/memory bandwidth limitations

5 Holistic model efficiency Quantization research Compilation Learning to reduce Learning to compile bit-precision while keeping Multiple axes to shrink AI models for efficient desired accuracy AI models and efficiently hardware execution run them on hardware

Neural architecture Compression search Learning to prune Learning to design smaller model while keeping neural networks that are on par desired accuracy or outperform hand-designed architectures on real hardware

6 Inference at Leading lower precision up to Increase in performance research to 01010101 01010101 per watt from savings in 4X memory and compute1 efficiently 16-bit Integer quantize 3452 AI models Automated reduction in precision Models trained at of weights and activations while high precision maintaining accuracy up to Increase in performance 01010101 01010101 01010101 01010101 01010101 per watt from savings in 16X memory and compute1 Promising results show that 32-bit floating point 8-bit Integer low-precision integer inference 3452.3194 255 can become widespread Virtually the same accuracy between a FP32 and quantized AI model through: up to Increase in performance • Automated, data free, 0101 per watt from savings in 1 post-training methods 64X memory and compute • Automated training-based 4-bit Integer mixed-precision method 15 Significant performance per watt improvements through quantization

1: FP32 model compared to quantized model 7 Data-free AdaRound Bayesian bits quantization How can we make Is rounding to the nearest Can we quantize layers to quantization as simple value the best approach different bit widths based as possible? for quantization? on precision sensitivity? Created an automated method Created an automated Created a novel method that addresses bias and method for finding the to learn mixed-precision imbalance in weight ranges: best rounding choice: quantization: No training No training Training required Data free Minimal unlabeled data Training data required Pushing the Jointly learns bit-width limits of what’s precision and pruning possible with quantization SOTA 8-bit results SOTA 4-bit weight results SOTA mixed-precision results Making 8-bit weight Making 4-bit weight quantization Automating mixed-precision quantization ubiquitous ubiquitous quantization and enabling the tradeoff between accuracy and kernel bit-width

Accuracy drop for MobileNet V2 against FP32 Accuracy drop for Accuracy drop for model for mixed precision MobileNet V2 MobileNet V2 model with computational <1% against FP32 model <2.5% against FP32 model <1% complexity equivalent to a 4-bit weight model

Data-Free Quantization Through Weight Equalization and Up or Down? Adaptive Rounding for Post-Training Bayesian Bits: Unifying Quantization and Pruning Bias Correction (Nagel, van Baalen, et al., ICCV 2019) Quantization (Nagel, Amjad, et al., ICML 2020) van Baalen, Louizos, et al., NeurIPS 2020)

SOTA: State-of-the-art 8 Neural network Neural network Device Cost complexity diversity diversity

Many state-of-the-art For different Deploying neural Compute and neural network tasks and use networks to many engineering solutions are large, case cases, different devices resources for complex, and do many different with different training plus not run efficiently neural networks configurations and evaluation are on target hardware are required changing software too costly and is required time consuming

Optimizing and deploying state-of-the-art AI models for diverse scenarios at scale is challenging

9 An automated way to learn a network topology that can achieve the best performance NAS on a certain task Neural Architecture Search Search Evaluation Search space algorithm strategy Set of operations Method for Method to and how they sampling a estimate the can be connected population of performance to form valid good network of sampled network architecture network architectures candidates architectures

10 Lack diverse search Hard to search in diverse spaces, with different block-types, attention, and activations Repeated training phase for every new scenario

Existing NAS High cost Brute force search is expensive solutions do not >40,000 epochs per platform address all the Do not scale challenges Repeated training phase for every new device >40,000 epochs per platform

Unreliable hardware models Requires differentiable cost-functions Repeated training phase for every new device

11 Introducing new AI research Diverse search to find the best models Supports diverse spaces with different cell-types, DONNA attention, and activation functions (ReLU, Swish, etc.) Distilling Optimal Neural Network Architectures Low cost Efficient NAS with hardware-aware optimization Low start-up cost of 1000-4000 epochs, A scalable method that finds pareto-optimal equivalent to training 2-10 networks from scratch network architectures in terms of accuracy and latency for any hardware platform at low cost Starts from an oversized pretrained reference architecture Scalable Scales to many hardware devices Oversized pretrained at minimal cost reference architecture

DONNA Reliable hardware

Set of pareto-optimal measurements network architectures Uses direct hardware measurements instead of a potentially inaccurate hardware model

Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces (Moons, Bert, et al., arXiv 2020) 12 DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios

Define reference A and search space once

Define backbone: • Fixed channels • Head and Stem

1 2 3 4 5

Varying parameters: • Kernel Size • Expansion Factors • Network depth • Network width • Attention/activation • Different efficient layer types

13 Define reference architecture and search-space once A diverse search space is essential for finding optimal architectures with higher accuracy

Select reference architecture The largest model in the search-space

Chop the NN into blocks ch=32 ch=64 ch=96 ch=128 ch=196 ch=256 Fix the STEM, HEAD, STEM 1, s=2 2, s=2 3, s=2 4, s=1 5, s=2 HEAD # blocks, strides, # channels at block-edge

Choose search space Diverse factorized Choose diverse search space hierarchical search space, Conv DW Conv Kernel: 3,5,7 Activation: ReLU/Swish Avg FC including variable kernel- 3x3s2 Conv Expand: 2,3,4,6 Cell type: grouped, DW, … 1x1 size, expansion-rate, depth, ch=32 Depth: 1,2,3,4 Width scale: 0.5x, 1.0x ch=1536 # channels, cell-type, Attention: SE, no SE activation, attention

Ch: channel; SE: Squeeze-and-Excitation 14 DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios

Define reference Build accuracy model via A and search B Knowledge Distillation space once (KD) once

Define backbone: Approximate ideal projections • Fixed channels of a reference model through KD • Head and Stem 1 2 3 4 5 1 2 3 4 5 MSE MSE MSE MSE MSE

1 2 3 4 5 Varying parameters: • Kernel Size • Expansion Factors Use quality of blockwise approximations • Network depth to build accuracy model • Network width • Attention/activation • Different efficient MSE MSE MSE MSE MSE layer types 1 2 3 4 5

15 Build accuracy predictor via BKD once Low-cost hardware-agnostic training phase

Block library Architecture library Accuracy predictor

Pretrain all blocks in search- Quickly finetune a Fit linear space through blockwise representative set regression knowledge distillation of architectures model

Block pretrained weights Finetuned Block architectures quality metrics

Fast block training Finetune sampled networks Regularized Ridge Regression Trivial parallelized training Fast network training Accurate predictions Broad search space Only 20-30 NN required

BKD: blockwise knowledge distillation 16 DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios

Define reference Build accuracy model via Evolutionary and search Knowledge Distillation A B C search in 24h space once (KD) once

Define backbone: Approximate ideal projections • Fixed channels of a reference model through KD • Head and Stem 1 2 3 4 5 1 2 3 4 5 MSE MSE MSE MSE MSE different compiler versions, different image sizes

Scenario- 1 2 3 4 5 specific Varying parameters: search • Kernel Size • Expansion Factors Use quality of blockwise approximations • Network depth to build accuracy model • Network width • Attention/activation • Different efficient MSE MSE MSE MSE MSE layer types 1 2 3 4 5 accuracy Predicted HW latency

17 Qualcomm Snapdragon is a product of Qualcomm Technologies, Inc. and/or its subsidiaries. Evolutionary search with real hardware measurements Scenario-specific search allows users to select optimal architectures for real-life deployments

Predicted task accuracy

Task Quick turnaround time acc predictor Results in +/- 1 day using one measurement device

NSGA-II sampling algorithm End-to-end model Accurate scenario-specific search Captures all intricacies of the hardware platform and software — e.g. run-time version or devices Target HW

Measured latency on device

NSGA: Non-dominated Sorting Genetic Algorithm 18 DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios

Define reference Build accuracy model via Evolutionary Sample and and search Knowledge Distillation A B C search in 24h D finetune space once (KD) once

Define backbone: Approximate ideal projections • Fixed channels of a reference model through KD MSE MSE MSE MSE MSE 1 2 3 4 5 • Head and Stem 1 2 3 4 5 1 2 3 4 5 MSE MSE MSE MSE MSE different compiler versions, different image sizes Use KD-initialized blocks from step B Scenario- to finetune any 1 2 3 4 5 specific network in the Varying parameters: search search space in • Kernel Size 15-50 epochs • Expansion Factors Use quality of blockwise approximations instead of 450 • Network depth to build accuracy model • Network width • Attention/activation MSE MSE MSE MSE MSE • Different efficient 1 2 3 4 5 layer types 1 2 3 4 5 accuracy Predicted HW latency

19 Quickly finetune predicted Pareto-optimal architectures Finetune to reach full accuracy and complete hardware-aware optimization for on-device AI deployments

Soft distillation on teacher logits

Ground-truth

labels accuracy Block Predicted pretrained weights CE HW latency

Soft 1 2 3 4 5

CE accuracy 1 2 3 4 5 Confirmed

BKD-reference network HW latency

20 DONNA finds state-of-the-art networks for on-device scenarios Quickly optimize and make tradeoffs in model accuracy with respect to the deployment conditions that matter 20% 20% 20% faster at similar faster at similar faster at similar

accuracy accuracy accuracy

accuracy [%] accuracy

val

1 1

- Top 224x224 images 224x224 images 224x224 images 672x672 images

# of Parameters [M] FPS FPS FPS FLOPS Desktop GPU latency Mobile SoC latency1 Mobile SoC latency2 (Adreno 660 GPU) (Hexagon 780 Processor)

Qualcomm® Adreno™ 660 GPU in the Snapdragon 888 running on the Samsung Galaxy S21. 2: Qualcomm® Hexagon™ 780 Processor in the Snapdragon 888 running on the Samsung Galaxy S21. Qualcomm Adreno and Qualcomm Hexagon are products of Qualcomm Technologies, Inc. and/or its subsidiaries. 21 Search-cost Cost / scenario Cost / scenario DONNA 1 scenario 4 scenarios ∞ scenarios efficiently Method Granularity Macro-diversity [epochs] [epochs] [epochs] finds optimal OFA Layer-level Fixed 1200+10×[25 — 75] 550 — 1050 250 — 750 models over diverse DNA Layer-level Fixed 770+10×450 4700 4500 scenarios MNasNet Block-level Variable 40000+10×450 44500 44500 Cost of training is a handful of DONNA Block-level Variable 4000+10×50 1500 500 architectures*

Good OK Not good

DONNA provides MnasNet-level diversity at 100x lower cost

*Training 1 model from scratch = 450 epochs 22 Object Vision detection transformers DONNA finds DEIT-B state-of-the-art Mobile models VIT-B

networks for ResNet-50

(%) 1 accuracy 1 on-device - scenarios mAP

Quickly optimize and VAL make tradeoffs in model top Predicted accuracy with respect to the deployment # Multiply Accumulate Operations [FLOPS] conditions that matter DONNA applies directly to downstream tasks and non-CNN neural architectures without conceptual code changes

23 User perspective for DONNA Build accuracy model of search space once, then deploy to many scenarios

IN Oversized pretrained reference architecture

A B C D

Define a search-space Create an accuracy Execute predictor-driven Finetune best searched of smaller, faster predictor for all evolutionary search models, 9-30x faster vs DONNA network architectures network architectures on-device regular training

Roughly equal to training 1 day / 50 GPU-hrs / 2-10 nets from scratch use-case net

OUT Set of Pareto-optimal Deploy best models @ 3ms, 6ms, 9ms, … on different chips network architectures

24 DONNA Conclusions DONNA shrinks big networks in a hardware-efficient way

DONNA can be rerun for any new device or setting within a day

DONNA works on many different tasks out of the box

DONNA enables scalability and allows models to be easily updated after small changes rather than starting from scratch

25 Quantization Relaxed Quantization Data-free Quantization AdaRound Bayesian Bits research (ICLR 2019) (ICCV 2019) (ICML 2020) (NeurIPS 2020)

Quantization open-sourcing

AI Model Efficiency Toolkit (AIMET) AIMET Model Zoo

Leading AI research and fast commercialization Driving the industry towards integer inference and power-efficient AI

AIMET and AIMET Model Zoo are products of Qualcomm Innovation Center, Inc. 26 AIMET & AIMET Model Zoo Open-source projects to scale model-efficient AI to the masses

27 AIMET makes AI models small Open-sourced GitHub project that includes state-of-the-art quantization and compression techniques from Qualcomm AI Research

Trained AI Model Efficiency Toolkit Optimized Deployed AI model (AIMET) AI model AI model

Compression

Quantization

TensorFlow or PyTorch

If interested, please join the AIMET GitHub project: https://github.com/quic/aimet

State-of-the-art State-of-the-art Support for both Benchmarks Developed by Features: network compression quantization TensorFlow and tests for professional software tools tools and PyTorch many models developers

28 Benefits Features

Lower Quantization power State-of-the-art INT8 and Quantization-aware training INT4 performance Quantization simulation Post-training quantization methods, Lower memory bandwidth including Data-Free Quantization and Adaptive Rounding (AdaRound) — coming soon

AIMET Maintains model Providing advanced accuracy Compression model efficiency Efficient tensor decomposition Spatial singular value features and benefits and removal of redundant decomposition (SVD) Lower channels in convolution layers storage Channel pruning

Higher Visualization performance Analysis tools for drawing insights Weight ranges for quantization and compression Per-layer compression sensitivity

Simple ease of use

29 Other frameworks Framework Algorithm specific API API

APIs invoked directly from the pipeline Supports TensorFlow and PyTorch extensions extensions Direct algorithm API frameworks

Model optimization library User-friendly APIs (techniques to compress & quantize models) compress_model (model, eval_callback=obj_det_eval, compress_scheme=Scheme.spatial_svd, ... ) AIMET equalize_model (model, ...)

AIMET features and APIs are easy to use Designed to fit naturally in the AI model development workflow for researchers, developers, and ISVs

30 Data Free Quantization results in AIMET Post-training technique enabling INT8 inference with very minimal loss in accuracy

Cross-layer Bias Weight Bias Activation equalization absorption quantization correction range estimation

Equalize Avoid high Measure and Estimate weight ranges activation ranges correct shift in activation ranges layer outputs for quantization

DFQ example <1% <1% <1% results MobileNet-v2 ResNet-50 DeepLabv3 % Reduction in accuracy between FP32 ad INT8 (top-1 accuracy) (top-1 accuracy) mean intersection over union)

DFQ: data free quantization 31 AdaRound is coming soon to AIMET Post-training technique that makes INT8 quantization more accurate and INT4 quantization possible INT8, baseline quantization

Bitwidth Mean AP (mAP)

FP32 82.20

INT8 baseline 49.85 <1% quantization Reduction in accuracy between INT8 AdaRound FP32 ad INT8 quantization 81.21 AdaRound quantization INT8, AdaRound quantization

AP: Average Precision 32 33 Image Semantic Super AIMET classification segmentation resolution Model Zoo Accurate pre-trained 8-bit quantized models

Object Pose Speech detection estimation recognition

34 AIMET Model Zoo includes popular quantized AI models Accuracy is maintained for INT8 models — less than 1% loss*

<1% Loss in accuracy*

75.21% 74.96% 75% 74.21% 74.93% 74.99% 0.2469 0.2456 7167% 71.14% 75.42% 74.44% 72.62% 72.22% 68.7% 68.6% FP32 INT8 FP32 INT8 FP32 INT8 FP32 INT8 FP32 INT8 FP32 INT8 FP32 INT8 FP32 INT8 Top-1 accuracy* Top-1 accuracy* Top-1 accuracy* mAP* Top-1 accuracy* Top-1 accuracy* mIoU* mAP*

ResNet-50 MobileNet- EfficientNet SSD MobileNetV2 EfficientNet- DeepLabV3+ MobileNetV2- (v1) v2-1.4 Lite MobileNet-v2 lite0 SSD-Lite

0.35 0.349 0.383 0.379 25.45 24.78 0.364 0.359 25.51 25.5 9.92% 10.22% FP32 INT8 FP32 INT8 FP32 INT8 FP32 INT8 FP32 INT8 FP32 INT8 mAP* mAP* PSNR* mAP* PSNR WER*

RetinaNet Pose SRGAN Pose SRGAN DeepSpeech2 estimation estimation

*: Comparison between FP32 model and INT8 model quantized with AIMET. For further details, check out: https://github.com/quic/aimet-model-zoo/ 35 Accurate Inaccurate AIMET Model segmentation segmentation Zoo models preserve accuracy Visual difference in model accuracy is telling between AIMET and baseline quantization methods

For DeepLabv3+ semantic segmentation, AIMET quantization maintains accuracy, while baseline quantization method is inaccurate

FP32 INT8 INT8 (AIMET quantization) (Baseline quantization)

Baseline quantization: Post-training quantization using min-max based quantization grid AIMET quantization: Model fine-tuned using Quantization Aware Training in AIMET 36 AIMET AIMET Model Zoo State-of-the-art quantization and compression techniques Accurate pre-trained 8-bit quantized models

github.com/quic/aimet github.com/quic/aimet-model-zoo

Join our open-source projects

37 AI model efficiency is crucial for making AI ubiquitous, leading to smarter devices and enhanced lives

We are conducting leading research and development in AI model efficiency while maintaining accuracy

Our open-source projects, based on this leading research, are making it possible for the industry to adopt efficient AI models at scale

38 www.qualcomm.com/ai

www.qualcomm.com/news/onq Questions? @QCOMResearch Connect with Us

https://www.youtube.com/qualcomm?

http://www.slideshare.net/qualcommwirelessevolution

39 Thank you

Nothing in these materials is an offer to sell any of the References in this presentation to “Qualcomm” may mean Qualcomm components or devices referenced herein. Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as ©2018-2021 Qualcomm Technologies, Inc. and/or its applicable. Qualcomm Incorporated includes our licensing business, affiliated companies. All Rights Reserved. QTL, and the vast majority of our patent portfolio. Qualcomm Qualcomm, Snapdragon, Adreno, and Hexagon are Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, trademarks or registered trademark of Qualcomm along with its subsidiaries, substantially all of our engineering, Incorporated. Other products and brand names may be research and development functions, and substantially all of our trademarks or registered trademarks of their respective products and services businesses, including our QCT semiconductor owners. business.