Intel Performance Optimizations for Deep Learning

Choose the Best Accelerated Technology Intel Performance optimizations for Deep Learning Laurent Duhem – HPC/AI Solutions Architect ([email protected]) Shailen Sobhee - AI Software Technical Consultant ([email protected]) January 2021 All information provided in this deck is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. ▪ Quick recap of oneAPI ▪ Overview of oneDNN ▪ Training: • Overview of performance-optimized DL frameworks Agenda ▪ Inferencing: • Intro to Intel Distribution of OpenVINO • Intel Low Precision Optimization Tool ▪ Hands-on demos IAGS Intel Architecture, Graphics, and Software 2 Intel’s oneAPI Ecosystem Built on Intel’s Rich Heritage of CPU Tools Expanded to XPUs ... oneAPI A cross-architecture language based on C++ and SYCL standards Powerful libraries designed for acceleration of domain-specific functions A complete set of advanced compilers, libraries, and porting, analysis and debugger tools Powered by oneAPI Frameworks and middleware that are built using one or more of the oneAPI industry specification elements, the DPC++ language, and libraries listed on oneapi.com. Available Now Visit software.intel.com/oneapi for more details IAGSSome capabilitiesIntel Architecture, may differ per Graphics, architecture and and Software custom-tuning will still be required. Other accelerators to be supported in the future. 3 A core set of high-performance tools for building C++, Data Parallel C++ applications & oneAPI library-based applications Intel® oneAPI Intel® oneAPI Intel® oneAPI Rendering Tools for HPC Tools for IoT Toolkit Deliver fast Fortran, Build efficient, reliable Create performant, OpenMP & MPI solutions that run at high-fidelity visualization applications that network’s edge applications scale Intel® AI Analytics Intel® Distribution of Toolkit OpenVINO™ Toolkit Accelerate machine learning & data Deploy high performance science pipelines with optimized DL inference & applications from frameworks & high-performing Toolkit edge to cloud Python libraries Latest version is 2021.1 IAGS Intel Architecture, Graphics, and Software 4 Intel® AI Analytics Toolkit Accelerate end-to-end AI and data analytics pipelines with libraries optimized for Intel® architectures Data scientists, AI researchers, ML and DL developers, AI application developers ▪ Deep learning performance for training and inference with Intel optimized DL frameworks and tools ▪ Drop-in acceleration for data analytics and machine learning workflows with compute- intensive Python packages Learn More: software.intel.com/oneapi/ai-kit 5 IAGS Intel Architecture, Graphics, and Software Back to Domain-specific Toolkits for Specialized Workloads 5 Intel Deep Learning Software Stack Deep Learning ecosystem Enabling Intel Deep Learning deployment toolkit, major open source deep learning networks and customer applications Intel Deep Neural Network Library (oneDNN) is an open source IA optimized Deep Learning library for scalable, high-velocity integration with ML/DL frameworks. • Includes open source implementations of new DNN functionality Intel® Deep Intel® Math • Delivers new algorithms ahead of MKL releases Neural Network Kernel Library • Open for community contributions Library Intel oneMKL is commercial software Performance Library to (Intel® oneMKL) (Intel® oneDNN) extract max Intel HW performance and provide a common interface to all Intel processors and accelerators. Intel Core/Xeon CPU Intel libraries as path to bring optimized ML/DL frameworks to Intel hardware *Other names and brands may be claimed as property of others. IAGS Intel Architecture, Graphics, and Software 66 Develop Fast Neural Networks on Intel® CPUs & GPUs with Performance-optimized Building Blocks Intel® oneAPI Deep Neural Network Library (oneDNN) Intel® oneAPI Deep Neural Network Library (oneDNN) ▪ An open source performance library for deep learning applications • Helps developers create high performance deep learning frameworks • Abstracts out instruction set and other complexities of performance optimizations • Same API for both Intel CPUs and GPUs, use the best technology for the job • Supports Linux, Windows • Open source for community contributions ▪ Get it here: software.intel.com/oneapi/dnnl ▪ Distribution: github.com/intel/mkl-dnn IAGS Intel Architecture, Graphics, and Software 8 9 Intel® oneAPI Deep Neural Network Library Basic Information Intel® oneDNN ▪ Features Convolution 2D/3D Direct Convolution/Deconvolution, Depthwise separable convolution • API: C, C++, SYCL(4) 2D Winograd convolution Inner Product 2D/3D Inner Production • Training: float32, bfloat16(1) Pooling 2D/3D Maximum • Inference: float32, bfloat16(1), float16(1), and int8(1) 2D/3D Average (include/exclude padding) • MLPs, CNNs (1D, 2D and 3D), RNNs (plain, LSTM, GRU) Normalization 2D/3D LRN across/within channel, 2D/3D Batch normalization Eltwise (Loss/activation) ReLU(bounded/soft), ELU, Tanh; ▪ Support Matrix Softmax, Logistic, linear; square, sqrt, abs, exp, gelu, swish • Compilers: Intel, GCC, CLANG, MSVC(2), DPC++(4) Data manipulation Reorder, sum, concat, View RNN cell RNN cell, LSTM cell, GRU cell • OS: Linux, Windows(2), macOS(2,3) Fused primitive Conv+ReLU+sum, BatchNorm+ReLU • CPU Data type f32, bfloat16, s8, u8 • Hardware: Intel® Atom, Intel® Core™, Intel® Xeon™ • Runtimes: OpenMP, TBB, DPC++(4) (1) Low precision data types are supported only for platforms where hardware acceleration is available • GPU (2) Not available in the oneAPI Beta binary distribution • Hardware: Intel HD Graphics, Intel® Iris® Plus Graphics (3) GPU support, OpenCL runtime and DPC++ GPU target are not available on macOS (4) Targeting oneAPI Beta, not available in DNNL v1.x • Runtimes: OpenCL(2), DPC++(4) IAGS Intel Architecture, Graphics, and Software 9 Iris is a trademark of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Overview of Intel-optimizations for TensorFlow* Intel® TensorFlow* optimizations 1. Operator optimizations: Replace default (Eigen) kernels by highly- optimized kernels (using Intel® oneDNN) 2. Graph optimizations: Fusion, Layout Propagation 3. System optimizations: Threading model 14 IAGS Intel Architecture, Graphics, and Software 14 Setting up the TensorFlow environments ▪ We create two environments for our benchmarks ▪ One with “stock” TensorFlow ▪ The other with Intel-optimized TensorFlow IAGS Intel Architecture, Graphics, and Software 16 Operator optimizations Examples Weights MatMul In TensorFlow, computation Bias graph is a data-flow graph. Add ReLU 17 IAGS Intel Architecture, Graphics, and Software 17 Operator optimizations ▪ Replace default (Eigen) kernels by Forward Backward highly-optimized kernels (using Conv2D Conv2DGrad Intel® oneDNN) Relu, TanH, ELU ReLUGrad, ▪ Intel® oneDNN has optimized a set TanHGrad, ELUGrad of TensorFlow operations. MaxPooling MaxPoolingGrad ▪ Library is open-source AvgPooling AvgPoolingGrad (https://github.com/intel/mkl-dnn) BatchNorm BatchNormGrad and downloaded automatically when building TensorFlow. LRN LRNGrad MatMul, Concat 18 IAGS Intel Architecture, Graphics, and Software 18 Fusing computations Conv Conv Sum ReLU Conv Conv+Sum+ReLU ▪ On Intel processors a high % of time is typically ▪ The frameworks are expected to be able to spent in BW-limited ops detect fusion opportunities • ~40% of ResNet-50, even higher for inference • IntelCaffe already supports this ▪ The solution is to fuse BW-limited ops with ▪ Major impact on implementation convolutions or one with another to reduce the # of memory accesses • All the implementations must be made aware of the fusion to get max performance • Conv+ReLU+Sum, BatchNorm+ReLU, etc • Intel oneDNN team is looking for scalable • Done for inference, WIP for training solutions to this problem IAGS Intel Architecture, Graphics, and Software 19 Graph optimizations: fusion Input Filter Input Filter Bias Conv2D Bias Conv2DWithBias BiasAdd Before Merge After Merge 20 IAGS Intel Architecture, Graphics, and Software 20 Graph optimizations: layout propagation Input Filter Input Filter Input Filter Convert Convert Convert Convert Conv2D MklConv2D MklConv2D Convert ReLU Convert MklReLU MklReLU Convert Convert Shape Shape Shape Initial Graph After Layout Conversions After Layout Propagation All oneDNN operators use highly-optimized layouts for TensorFlow tensors. 21 IAGS Intel Architecture, Graphics, and Software 21 More on memory channels: Memory layouts ▪ Most popular memory layouts for image recognition are nhwc and nchw • Challenging for Intel processors either for vectorization or for memory accesses (cache thrashing) nchw ▪ Intel oneDNN convolutions use blocked layouts • Example: nhwc with channels blocked by 16 – nChw16c • Convolutions define which layouts are to be used by other Reorders primitives • Optimized frameworks track memory layouts and perform reorders only when necessary nChw16c 22 IAGS Intel Architecture, Graphics, and Software 22 Data Layout has a BIG Impact • Continuous access to avoid gather/scatter • Have iterations in inner most loop to ensure high vector utilization • Maximize data reuse; e.g. weights in a convolution layer • Overhead of layout conversion is sometimes negligible, compared with operating on unoptimized layout Channel based 21 18 … 1 .. 8 92 .. (NCHW) 21 18 32 6 3 1 8 920 373 29 26 44 Pixel based 21 8 18 92 .. 1 11 .. (NHWC) 40 119 229 7622 3 81 26 for i= 1 to N # batch size 3 47 29 88 1 for j = 1 to C # number of channels, image RGB = 3 channels 23 44 81 32 11 for k

Load more