AI On the Edge

Cyrus M. Vahid, Principal Solutions Architect, Principal DeepLearning Solution Architect AWS DeepLearning [email protected] Oct 2017 © 2017, , Inc. or its Affiliates. All rights reserved. Motivation Training vs. Inference

• Training is performed on the cloud.

• Inference is performed everywhere

• Efficiency of inference is indispensable to address: • Latency • Connectivity • Cost • Privacy/Security Motivation Large DNNs require huge amounts of memory--e.g. Alexnet Caffemodel is over 200MB VGG-16 CaffeModel is over 500MB.

Complex computation makes apps power hungry.

Edge devices have low power and small memory capacity ------∴ To run models on the edge we need to compress them significantly Motivating Examples From Customers

• Industrial IoT (Out of Distribution/Anomaly Detection) Motivating Examples From Customers

• Real Time Filtering (Neural Style Transfer) Motivating Examples From Customers

• Building a Better Hearing Aid (Recurrent Acoustic Models) Motivating Examples From Customers

• Security Robots (Object Detection and Recognition) Autonomous Vehicles Model Compression Computational Efficiency • The goal is to reduce floating point operations and number of parameters:

�, ⋯ �, � = ⋮ ⋱ ⋮ � ⋯ � , , �, ⋯ �, Tensor Contraction Layer � = ⋮ ⋱ ⋮ �, ⋯ �, � = � ⊗ … ⊗ �

Fast Fourier Transform Separable Kernels 2� → 2���(�) Winograd FFT O � → � �×� 2.25 ����� �������� ��� Very effective on CPU Most effective for larger kernels �(2×2, 3×3) Model Compression: Pruning-Quantization-Encoding

arXiv:1510.00149v5 Model Compression: Pruning

• Pruning is removing connections that are less effective in computation of a network. • After training is performed, then all the weights that are smaller than a certain threshold are removed, and model is retrained. • Reduction of number of parameters by 9-13 times without loss of accuracy is shown. [arXiv:1510.00149v5] Model Compression: Quantization

• Quantization is about using fewer bits to express the same information. • Wight sharing a one method of quantization via using centroids as shared weights.

Good to take advantage of low precision hardware acceleration [arXiv:1510.00149v5]: weight sharing through scalar quantization Model Compression: Hoffman Coding

• A Hoffman code is an optimal prefix code commonly used for lossless data compression. • It uses variable-length code words to encode source [arXiv:1510.00149v5] symbols. • probability distribution of quantized weights and the sparse matrix index • More common symbols are of the last fully connected layer in AlexNet. • most of the quantized weights are distributed around the two peaks; the represented with fewer bits. sparse matrix index difference are rarely above 20 • Experiments show that Huffman coding these non-uniformly distributed values saves 20% - 30% of network storage.

BMXNet – Collaborators in the MXNet community, brought this to binary weights https://github.com/hpi-xnor/BMXNet Reduced Architecture

SqueezeNet: AlexNet Accuracy with 50x Fewer Parameters

Good for devices with low RAM that can’t hold all weights for larger models concurrently in memory

Student/Teacher training Comparing Techniques

Winograd Separable Quantization Tensor Sparsity Weight Convolutions Convolutions Contractions Exploitation Sharing

CPU + ++ = ++ + + Acceleration GPU + + + + = + Acceleration Model Size = = - - - -

Model = - - - - - Accuracy Specialized + + ++ + + + Hardware Acceleration

Edge Compute Models – AWS IoT

Key Functions • Data Ingest • Compressed Inference • Full Inference / Trained Model Query • Model Training

Deployment Models Cloud <-> Edge Cloud <-> Hub <-> Edge

Edge Analytics Trends : Reduce Latency, Reduce Transfer Costs AWS Infrastructure Tools

Deep Learning AMI, P2 Instances: Preconfigured for Deep Learning CFM Template Up to 40K CUDA Cores mxnet, TensorFlow, … Deep Learning Cluster Apache MXNet

Most Open Best On AWS Accepted into the Apache Incubator Optimized for deep learning on AWS Amazon AI: Scaling With MXNet 88% Efficiency Ideal Inception v3 Resnet

Alexnet

1 2 4 8 16 32 64 128 256 Manage and Monitor Models on The Fly

Upload Escalate to Tagged AI Service Data

Deploy AWS Captured Data and Manage Model Escalate to Custom Model on P2 Local Learning Loop Poorly Classified Data

Updated Model Fine Tune Model With Accurate Classification References

• arXiv:1510.00149v5: Deep Compression; Han, Mao, and Dally • arXiv:1509.09308v2: Fast Algorithms for CNN, Laving & Gray • arXiv:1706.00439v1: Tensor Contraction Layers; Anima Anandkumar et al • arXiv:1606.09274v1 : Compression of NMT via Pruning; See, Luong, Manning • http://cs231n.stanford.edu/reports/2016/pdfs/117_Report.pdf: Pruning Winograd and FFT based algorithms; Liu and Turakhia • https://colfaxresearch.com/falcon-library/ • https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/ • https://en.wikipedia.org/wiki/Fast_Fourier_transform • https://arxiv.org/pdf/1611.06321.pdf: Learning the Number of Neurons in Deep Networks • https://aclweb.org/anthology/D16-1139: Sequence Level Knowledge Distillation; Kim and Rush Thank you!

Cyrus M. Vahid [email protected]