Ngraph + Plaidml

nGraph + PlaidML Unlocking Next-Generation Performance with Deep Learning Compilers Jayaram Bobba and Tim Zerrell 1 Legal Notices & Disclaimers This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon, Movidius and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2018 Intel Corporation. 2 The Path to nGraph+PlaidML 3 Simplified Deep Learning Stack Language Kernel Driver Frontend Library Graph IC Graph Compiler Hardware 4 * Other brands and names may be claimed as the property of others. Current State of DL framework acceleration: Framework Optimization Other DL Frameworks Graph Optimization + Kernel Integration Scale NNP Movidius Your DL Kernel cuDNN clDNN Intel MKL-DNN Library Library Library Your DL Hardware 5 Current State of DL framework acceleration: Kernel Libraries NCHW 2D Const GPUs FP16 Convolution NHWC 3D Reflect 4D Edge CPUs FP32 MatMul Grouped Scale Standard BS1 BS16 FPGAs INT4 Pool Same BS32 Valid Accelerators INT8 Normalize #ChipDesigns * #DTypes * #Ops * ∏(#Params) = #Kernels * Other brands and names may be claimed as the property of others. Our Solution: nGraph + PlaidML Other DL Frameworks nGraph NNP Movidius Your DL Kernel cuDNN clDNN Intel MKL-DNN Library Library Library Your DL Hardware Graph level optimizations + Kernel Library Integration 7 * Other brands and names may be claimed as the property of others. Our Solution: nGraph + PlaidML Other DL Frameworks nGraph Movidius Your DL Kernel cuDNN clDNN Intel MKL-DNN NNP Library Library Library Your DL Hardware Graph level optimizations + Kernel Library Integration + Tensor Compiler 8 nGraph + PlaidML: A Multi-Platform Stack w/ Tensor Compiler Kernel Driver LibraryKernel Library Frontend (XL)Kernel (XL)Library Tensor Compiler IC Graph Hardware Graph Compiler 9 nGraph: A Deep Dive 10 The Whole Stack: Bridges, Core, Backends FrameworkFramework BridgesBridges Core Frontend API Graph Construction API Execution Interface Graph Rewriting Generic Graph API Optimizers Core Backend API Hardware Backends HW-Spec. Optimizer Executor 11 * Other brands and names may be claimed as the property of others. Framework Bridges TensorFlow Bridge Mxnet Bridge https://github.com/NervanaSystems/ngraph-tf https://github.com/NervanaSystems/ngraph-mxnet Option 1 (pre-built binaries) Option 1 (pre-built binaries) 1) pip install tensorflow 1) pip install ngraph-mxnet 2) pip install ngraph-tensorflow-bridge 3) import ngraph_bridge Option 2 (from source) Option 2 (from source) 1) Download tensorflow v1.12.0 1) Download ngraph-mxnet 2) bazel build –config=opt –config=ngraph 2) Make USE_NGRAPH=1 //tensorflow/tools/pip_package:build_pip _package 12 Framework Bridge: Translation Flow nGraph Function nGraph Function nGraph Function Original FW Graph After Clustering After Translation ● Each cluster is a “unit of work” ● Backend has freedom to rewrite for nGraph. nGraph Functions ○ …for optimization ● Anything not clustered stays on ○ …for easy integration with kernel libraries the framework native engine. ○ etc. 13 Framework Bridges -> nGraph FrameworkFramework BridgesBridges Core Frontend API Graph Construction API Execution Interface Graph Rewriting Generic Graph API Optimizers Core Backend API Hardware Backends HW-Spec. Optimizer Executor 14 Graph Construction API: From Framework Graphs to nGraph IR Framework Graph nGraph IR Graph Broadcast Convolution shape = {8,16,220,220} stride={1,1} bc_axes = {0,1,3} Add Relu - Rich op sets (TF: ~1K ops1) - Small set of simple ops - Usually dynamically typed - Statically typed - “Non-DL” ops - Focused on DL primitives 15 1 https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/ops.pbtxt Constructing Graphs Function Parameter Parameter Parameter f32 f32 f32 auto data_in = make_shared<op::Parameter>(element::f32, Shape{8,3,224,224}); {8,3,224,224 {16,3,5,5} {16} } auto w_in = make_shared<op::Parameter>(element::f32, Shape{16,3,5,5}); auto b_in = make_shared<op::Parameter>(element::f32, Shape{16}); Convolution Broadcast shape = {8,16,220,220} auto conv = make_shared<op::Convolution>(data_in, w_in, Strides{1,1}); stride={1,1} bc_axes = {0,2,3} auto bias_bc = make_shared<op::Broadcast>(b_in, Shape{8,16,220,220}, AxisSet{0,2,3}); auto conv_bias = make_shared<op::Add>(conv, bias_bc); Add auto conv_bias_relu = make_shared<op::Relu>(conv_bias); auto f = make_shared<Function>(conv_bias_relu, Relu ParameterVector{data_in, w_in, b_in}); Result 16 nGraph Code automatic graph differentiation Python, ONNX, ONNXIFI frontends nGraph Core Ops 17 Execution Interface: Run graphs FrameworkFramework ● Execution API is a simple four- BridgesBridges method interface. Core Frontend API ○ create_tensor() ○ write() Graph Construction API ○ read() ○ compile() Execution Interface ○ call() Graph Rewriting Generic Graph API Optimizers ● These functions are implemented by each backend. Core Backend API Hardware Backends ● NB: write(), read() can be avoided HW-Spec. Optimizer Executor for host-resident tensors. 18 The Whole Stack: Bridges, Core, Backends FrameworkFramework BridgesBridges Core Frontend API Graph Construction API Execution Interface Graph Rewriting Generic Graph API Optimizers Core Backend API Hardware Backends HW-Spec. Optimizer Executor 19 Hardware Backends More in external repos for new hardware and new usage models 20 * Other brands and names may be claimed as the property of others. Example: Intel CPU Backend Graph Rewriting Generic Graph Execution API Optimizers Interface Backend API IA-Specific Passes CPU (IA) Backend Codegen . Direct Exec. Deep Learning/Linear Algebra Performance Libraries JIT Engines Intel MKL-DNN Eigen Halide ... Foundation Libraries for Parallelism/Concurrency OpenMP Intel Thread Building Blocks 21 Generic Graph Optimizers: Optimization Passes ● nGraph Core includes a large library of ● Pass manager makes it easy to HW-agnostic passes: reuse and mix generic ○ Algebraic Simplification optimization passes, and your ○ Common Subexpression Elimination own device-specific ○ Constant Folding optimizations. ○ Core Fusion ○ Reshape/Transpose Elimination ● Same, unified interface and ○ Reshape/Transpose Sinking APIs for both. ○ Zero-Element Tensor Elimination 22 Optimization Passes: Algebraic Simplification Foo shape = {15} Slice Slice Slice [0:2] [2:4] [4:15] Foo Concat Foo Tensor is being sliced up into pieces (2x3x5x7) and immediately being reassembled Foo Pad {0,0,0,0} {0,0,0,0} Tensor is being “padded” but the width of padding is zero all around 23 Optimization Passes: Reshape/Transpose Elimination Foo Bar 10x20 30x10 Baz Baz 64x3x224x224 64x3x224x224 Transpose Transpose perm=[1,0] perm=[1,0] Transpose perm=[0,2,3,1] MatM MatM Transposes MatMul ul ul cancel out T T Transpose A B perm=[0,3,1,2] = (BA)T Foo Bar 10x20 30x10 Convolution Convolution MatM MatM MatMul ul ul Transpose perm=[1,0] 24 Pattern Matching & Graph Rewriting Function Function Step 1: Describe pattern Parameter Parameter Parameter Parameter Parameter Parameter f32 f32 f32 f32 f32 f32 {8,3,224,224 {8,3,224,224 {16,3,5,5} {16} {16,3,5,5} {16} } } Convolution Broadcast shape = {8,16,220,220} stride={1,1} bc_axes = {0,2,3} CPUConvBias Step 2: Request pattern match stride={1,1} Add with_relu=true Step 3: Rewrite match Relu Result Result 25 Backend Specific Opt: Group Convolution Fusion Before After (Images) (Filters) two slice ops per channel group Slice Slice Slice ... Slice Slice Slice Slice ... Slice (Images) (Filters) one convolution Conv Conv Conv ... Conv per channel group CPUGroupConv

Ngraph + Plaidml

SOL: Effortless Device Support for AI Frameworks Without Source Code Changes

Training Professionals and Researchers on Future Intelligent On-Board Embedded Heterogeneous Processing

Plaidml Documentation

Advancing Compiler Optimizations for General-Purpose & Domain-Specific Parallel Architectures

Machine Learning Framework for Petrochemical Process Industry

Fast Geometric Learning with Symbolic Matrices

Triton: an Intermediate Language and Compiler for Tiled Neural Network Computations

Benchmarking Contemporary Deep Learning Hardware and Frameworks: a Survey of Qualitative Metrics Wei Dai, Daniel Berleant

Benchmarking Contemporary Deep Learning Hardware And

Stripe: Tensor Compilation Via the Nested Polyhedral Model

Comparing the Costs of Abstraction for DL Frameworks

Towards Transparent Neural Network Acceleration