Inferencing at the Edge and Fragmentation Challenges Mark Charlebois Director Engineering Qualcomm Technologies, Inc
Total Page:16
File Type:pdf, Size:1020Kb
04/04/2019 [email protected] Bangkok, Thailand Inferencing at the Edge and Fragmentation Challenges Mark Charlebois Director Engineering Qualcomm Technologies, Inc. Agenda • Overview of Inferencing at the Edge ◦ Deployment Scenarios ◦ Application Development Options • Runtimes • Graph Compilers ◦ Four Levels of IR • Leveraging Runtimes in Compiler Frameworks • Conclusions 2 Inferencing at the Edge • Models developed on servers, TensorFlow mxnet ONNX Caffe2 deployed to edge devices • Devices can range from MCUs to Converters / Graph Optimizers edge servers Graph Parsers • Simple networks can run on MCU AOT • Huge diversity of HW at the edge for Runtimes Compilers inferencing acceleration • Growing number of runtimes and graph compilers 3 Deployment Scenarios Known Models, Known Models, Known Models, Unknown Devices Known Device Known Devices Developer OEM OEM App Store Phone Cloud Managed Phone Phone Deployment Phone Device A Device B 4 Application Development Options App Using Vendor App using OS / App Using SDKs Framework API Compiled Model App App Developer Vendor Vendor Platform/Framework Compiler runtime runtime API (e.g. NNAPI) Framework Backend Backend Device Device HAL A HAL B App App Device Device Device Device 5 Runtimes 6 App Examples • Qualcomm® Neural Processing SDK • TensorFlow Lite Trained TensorFlow Converter Model TFLite Converter Trained ONNX, … DLC Model TFLite Model (.tflite) App App Neural Network Core TFLite CPU ops GPU ops DSP ops Qualcomm® Math Qualcomm® CPU ops GPU ops NNAPI Libraries Kernels Hexagon™ NN Kernels NEON OpenCL HVX Kernels Vendor NEON OpenGL NNHAL Qualcomm Neural Processing SDK is a product of Qualcomm Technologies, Inc. and/or its subsidiaries. 7 Android App Example (NNAPI) • TFLite App running on Qualcomm® device using NNAPI Trained TensorFlow Model TFLite Converter TFLite Model (.tflite) Android App TFLite NNAPI Neural Network Core CPU ops GPU ops DSP ops Qualcomm® Math Qualcomm® Libraries Kernels Hexagon™ NN NEON OpenCL HVX 8 ARM NN Runtime Example Trained TensorFlow Trained ONNX, … Model Model App ARM NN Backends CPU ops Mali GPU ARM Compute Other Library (ACL) 9 CPU Runtime Fragmentation Qualcomm® Neural Android Prosessing NN HAL ARM NN SDK TFLite Backends NN Core Deligate Backend CPU ops CPU ops CPU ops CPU ops Qualcomm® Custom Custom Math ACL Kernels Libraries Kernels Qualcomm Neural Processing SDK and Qualcomm Math Libraries are products of Qualcomm Technologies, Inc. and/or its subsidiaries. 10 Graph Compilers 11 Graph Compilers ONNX Runtime TVM GLOW nGraph ONNC XLA CPU Binary e.g. TensorFlow Graph Compiler Framework GPU Binary DSP Binary 12 Framework Layer Cake – Example nGraph TensorFlow Bridge ONNX Bridge CPU MKL-DNN mxnet Bridge nGraph Intel GPU PlaidML Caffe2 Bridge nVidia GPU PyTorch Bridge … … 13 Four Levels of IR Developer APIs • Parse Graph, first pass optimization and partitioning • Graph segment optimization High-level IR • Memory reuse (non-backend specific) “Graph IR” • Layout transformation Function-level IR • HW specific subgraph optimizations “Operation IR” • HW optimizations: • Explicit memory layout, parallelization pattern Level - > ISA > - Shader/AST-level IR • Generate shaders, DSL code, bitcode LOW LOW • Integrate hand optimized kernels (proprietary) Vendor Tools Vendor > ISA > Graph - Instr Assembly-level IR • Generate assembly and schedule for target ISA Shader C. Shader Hardware Instruction Set Architecture Credit: Enrico Ros, Qualcomm Technologies Inc 14 Projects Mapped to Four Levels of IR TensorFlow ONNX mxnet Caffe2 PyTorch … NNVM/Relay nGraph XLA Frontend GLOW Graph IR High Level IR TVM PlaidML cuDNN XLA Backend GLOW Op IR Operator Level IR LLVM OpenCL CUDA Shader/AST Level IR ARMv8 Assembly Hexagon Assembly PTX Assembly Level IR 15 Leveraging Runtimes in Compiler Frameworks 16 Frameworks, Platforms, HW, All Growing Runtime / Framework HW / Compilers Graph Compiler OS TensorFlow Qualcomm Neural Open Embedded Hexagon SDK Processing SDK Caffe Android OpenCL ARM NN Caffe2 X X X = Ubuntu / Fedora / TFLite CUDA MxNet etc Relay / TVM / LLVM Windows ARM M4 / A53 / Pytorch A76 / etc Glow / LLVM RTOS … … … … 17 Frameworks, Platforms, HW, All Growing Runtime / Framework HW / Compilers Graph Compiler OS TensorFlow Qualcomm Neural Open Embedded Hexagon SDK Processing SDK Caffe Android OpenCL ARM NN Caffe2 X X X = Ubuntu / Fedora / TFLite CUDA MxNet etc Relay / TVM / LLVM Windows ARM M4 / A53 / Pytorch A76 / etc Glow / LLVM RTOS … … … … 18 HW Diversity Is a Scalability Issue Graph Compiler Toolchain + Toolchain + Toolchain + Toolchain + Toolchain + HW Desc HW Desc HW Desc HW Desc HW Desc 19 Tradeoff Some Performance for Scale Graph Compiler Runtime? ACL 20 Conclusions 21 Runtime Components Map to Op Level IR Framework Model Container Container Qualcomm Neural Processing SDK ARM NN TEngine Graph Compiler ARM NN Backend Hexagon NN TEngine executor API Op Level IR 22 Using Full Runtime as a HW Backend CPU LLVM mxnet Relay TVM CPU ACL GPU ACL CPU mxnet Relay TVM Runtime CPU GPU … 23 Using Full Runtime as a HW Backend CPU LLVM mxnet Relay TVM CPU ACL GPU ACL CPU mxnet Relay TVM X Runtime CPU GPU mxnet ONNX Runtime CPU GPU … 24 ARM Inferencing Edge Server TensorFlow Bridge ONNX Bridge CPU ACL mxnet Bridge nGraph Mali CPU ACL Caffe2 Bridge Accel HW compiler … PyTorch Bridge … 25 Addressing Graph Compiler IR Fragmentation • Each compiler framework makes different tradeoffs, but eventually may be able to share components: e.g. code generation for OpenCL, LLVM, etc • Too many frameworks, compilers and formats to address them all. • Too much disparate HW for all frameworks to support. • TensorFlow/TFLite and ONNX formats can provide the most scale for edge device inferencing runtimes. • If Compiler Frameworks supported a common runtime backend API (like ARM NN Backend API) to bind to operator IR would enable graph compilers to support more edge devices with optimized backends, and would provide a common API for new backends (e.g. Hexagon NN) vs individual ports to each project. 26 Addressing CPU Runtime Fragmentation • Make ACL the “best of breed” CPU Runtime • Consolidate TFLite CPU runtime, Android NN CPU runtime, and ARM NN CPU backend • Paves the way for others to follow: TEngine, MACE, Qualcomm® Neural Processing SDK, … Qualcomm Neural Processing SDK is a product of Qualcomm Technologies, Inc. and/or its subsidiaries. 27 Thank you! Follow us on: For more information, visit us at: www.qualcomm.com & www.qualcomm.com/blog Nothing in these materials is an offer to sell any of the References in this presentation to “Qualcomm” may mean Qualcomm components or devices referenced herein. Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as ©2019 Qualcomm Technologies, Inc. and/or its affiliated applicable. Qualcomm Incorporated includes Qualcomm’s licensing companies. All Rights Reserved. business, QTL, and the vast majority of its patent portfolio. Qualcomm Qualcomm is a trademark of Qualcomm Incorporated, Technologies, Inc., a wholly-owned subsidiary of Qualcomm registered in the United States and other countries. Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm Hexagon is a product of Qualcomm Technologies, Qualcomm’s engineering, research and development functions, and Inc. and/or its subsidiaries Other products and brand names substantially all of its product and services businesses, including its may be trademarks or registered trademarks of their semiconductor business, QCT. respective owners..