AWS Deeplearning [email protected] Oct 2017 © 2017, Amazon Web Services, Inc

AI On the Edge Cyrus M. Vahid, Principal Solutions Architect, Principal DeepLearning Solution Architect AWS DeepLearning [email protected] Oct 2017 © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Motivation Training vs. Inference • Training is performed on the cloud. • Inference is performed everywhere • Efficiency of inference is indispensable to address: • Latency • Connectivity • Cost • Privacy/Security Motivation Large DNNs reQuire huge amounts of memory--e.g. Alexnet Caffemodel is over 200MB VGG-16 CaffeModel is over 500MB. Complex computation makes apps power hungry. Edge devices have low power and small memory capacity ------------------------------------------------------ ∴ To run models on the edge we need to compress them significantly Motivating Examples From Customers • Industrial IoT (Out of Distribution/Anomaly Detection) Motivating Examples From Customers • Real Time Filtering (Neural Style Transfer) Motivating Examples From Customers • Building a Better Hearing Aid (Recurrent Acoustic Models) Motivating Examples From Customers • Security Robots (Object Detection and Recognition) Autonomous Vehicles Model Compression Computational Efficiency • The goal is to reduce floating point operations and number of parameters: �=,= ⋯ �?,= � = ⋮ ⋱ ⋮ � ⋯ � =,B ?,B ?CB �=,= ⋯ �D,= Tensor Contraction Layer � = ⋮ ⋱ ⋮ �=,E ⋯ �D,E DCE � = �= ⊗ … ⊗ �D Fast Fourier Transform Separable Kernels J 2�# → 2��(�) Winograd FFT O � → � �×� 2.25 �� Very effective on CPU Most effective for larger kernels �(2×2, 3×3) Model Compression: Pruning-Quantization-Encoding arXiv:1510.00149v5 Model Compression: Pruning • Pruning is removing connections that are less effective in computation of a network. • After training is performed, then all the weights that are smaller than a certain threshold are removed, and model is retrained. • Reduction of number of parameters by 9-13 times without loss of accuracy is shown. [arXiv:1510.00149v5] Model Compression: Quantization • Quantization is about using fewer bits to express the same information. • Wight sharing a one method of Quantization via using centroids as shared weights. Good to take advantage of low precision hardware acceleration [arXiv:1510.00149v5]: weight sharing through scalar quantization Model Compression: Hoffman Coding • A Hoffman code is an optimal prefix code commonly used for lossless data compression. • It uses variable-length code words to encode source [arXiv:1510.00149v5] symbols. • probability distribution of quantized weights and the sparse matrix index • More common symbols are of the last fully connected layer in Alexnet. • most of the quantized weights are distributed around the two peaks; the represented with fewer bits. sparse matrix index difference are rarely above 20 • Experiments show that Huffman coding these non-uniformly distributed values saves 20% - 30% of network storage. BMXNet – Collaborators in the MXNet community, brought this to binary weights https://github.com/hpi-xnor/BMXNet Reduced Architecture SQueezeNet: AlexNet Accuracy with 50x Fewer Parameters Good for devices with low RAM that can’t hold all weights for larger models concurrently in memory Student/Teacher training Comparing Techniques Winograd Separable Quantization Tensor Sparsity Weight Convolutions Convolutions Contractions Exploitation Sharing CPU + ++ = ++ + + Acceleration GPU + + + + = + Acceleration Model Size = = - - - - Model = - - - - - Accuracy Specialized + + ++ + + + Hardware Acceleration Edge Compute Models – AWS IoT Key Functions • Data Ingest • Compressed Inference • Full Inference / Trained Model Query • Model Training Deployment Models Cloud <-> Edge Cloud <-> Hub <-> Edge Edge Analytics Trends : Reduce Latency, Reduce Transfer Costs AWS Deep Learning Infrastructure Tools Deep Learning AMI, P2 Instances: Preconfigured for Deep Learning CFM Template Up to 40K CUDA Cores mxnet, TensorFlow, … Deep Learning Cluster Apache MXNet Most Open Best On AWS Accepted into the Apache Incubator Optimized for deep learning on AWS Amazon AI: Scaling With MXNet 88% Efficiency Ideal Inception v3 Resnet Alexnet 1 2 4 8 16 32 64 128 256 Manage and Monitor Models on The Fly Upload Escalate to Tagged AI Service Data Deploy AWS Captured Data and Manage Model Escalate to Custom Model on P2 Local Learning Loop Poorly Classified Data Updated Model Fine Tune Model With Accurate Classification References • arXiv:1510.00149v5: Deep Compression; Han, Mao, and Dally • arXiv:1509.09308v2: Fast Algorithms for CNN, Laving & Gray • arXiv:1706.00439v1: Tensor Contraction Layers; Anima Anandkumar et al • arXiv:1606.09274v1 : Compression of NMT via Pruning; See, Luong, Manning • http://cs231n.stanford.edu/reports/2016/pdfs/117_Report.pdf: Pruning Winograd and FFT based algorithms; Liu and Turakhia • https://colfaxresearch.com/falcon-library/ • https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/ • https://en.wikipedia.org/wiki/Fast_Fourier_transform • https://arxiv.org/pdf/1611.06321.pdf: Learning the Number of Neurons in Deep Networks • https://aclweb.org/anthology/D16-1139: SeQuence Level Knowledge Distillation; Kim and Rush Thank you! Cyrus M. Vahid [email protected].

Load more