Xilinx ML Suite Overview
Total Page:16
File Type:pdf, Size:1020Kb
Xilinx ML Suite Overview Jim Heaton Sr. FAE © Copyright 2018 Xilinx Deep Learning explores the study of algorithms that can Deep Learning learn from and make is Re-defining predictions on data Many Applications Cloud Ecommerce Security Financial Acceleration Social Industrial Medical Autonomous Surveillance IOT Bioinformatics Vehicles Page 2 © Copyright 2018 Xilinx Accelerating AI Inference into Your Cloud Applications Deep Learning Applications Featuring the Most Powerful FPGA in the Cloud Virtex® Ultrascale+™ Zynq® Ultrascale+™ VU9P MPSoC Cloud On Premises Edge © Copyright 2018 Xilinx Want to try the out Xilinx ML Suite ? https://github.com/Xilinx/ml-suite © Copyright 2018 Xilinx Xilinx ML Suite - AWS Marketplace ML Suite Supported Frameworks: ‒ Caffe ‒ MxNet ‒ Tensorflow ‒ Keras ‒ Python Support ‒ Darknet Jupyter Notebooks available: ‒ Image Classification with Caffe ‒ Using the xfDNN Compiler w/ a Caffe Model ‒ Using the xfDNN Quantizer w/ a Caffe Model Pre-trained Models ‒ Caffe 8/16-bit – GoogLeNet v1 – ResNet50 – Flowers102 – Places365 ‒ Python 8/16-bit – Yolov2 ‒ MxNet 8/16-bit – GoogLeNet v1 xfDNN Tools https://aws.amazon.com/marketplace/pp/B077FM2JNS?qid=1544477354556&sr=0-2&ref_=srh_res_product_title ‒ Compiler ‒ Quantizer © Copyright 2018 Xilinx Unified Simple User Experience from Cloud to Alveo Develop Publish Deploy User Guides Tutorials Examples User Guides Tutorials Examples Launch Instance User Choice Download xfDNN Apps xDNN Binaries xfDNN Apps xDNN Binaries © Copyright 2018 Xilinx Deep Learning Models Multi-Layer Perceptron Convolutional Neural Network Recurrent Neural Network • Classification • Feature Extraction • Sequence and Temporal Data • Universal Function Approximator • Object Detection • Speech to Text • Autoencoder • Image Segmentation • Language Translation Classification Object Detection Segmentation “Dog” © Copyright 2018 Xilinx Overlay Architecture Custom Processors Exploiting Xilinx FPGA Flexibility Customized overlays with ISA architecture for optimized implementation Easy plug and play with Software Stack Random Forest MLP Engine xDNN – CNN Engine for Large 16 nm Deephi ESE Configurable RF Scalable sparse and dense Xilinx Devices LSTM Speech to Text classification implementation Deephi DPU – Flexible CNN Engine engine with Embedded Focus Available on AWS © Copyright 2018 Xilinx Rapid Feature and Performance Improvement xDNN-v1 xDNN-v2 xDNN-v3 – 500 MHz – 500 MHz – 700 MHz – URAM for feature maps without caching – All xDNN-v1 features – Feature compatible with – Array of accumulator with – DDR Caching: Larger xDNN-v2 – 16 bit(batch 1), 8 bit(batch Image, CNN Networks – New Systolic Array 2) – Instructions: Depth wise Implementation: 50% – Instructions: Convolution, Convolution, Higher FMAX and 2.2x Relu, MaxPool, Deconvolution, time lower latency AveragePool, Elementwise Convolution, Transpose – Batch of 1 for 8 bit – Flexible kernel size(square) and strides Upsampling implementation – Programmable Scaling – Rectangular Kernels – Non-blocking Caching and – Q4CY17 – Q2CY18 Pooling – Q4CY18 Page 9 © Copyright 2018 Xilinx Break Through on Peak Performance GPU: Introduce new architectures Xilinx: Adapt the break through of and silicon emerging domain knowledge GPU Deep Learning Peak Power Efficiency FPGA Deep Learning Peak Power Efficiency 1000 1000 900 900 ACAP 800 800 Architectures 700 700 600 600 500 500 GOPS/W GOPS/W 400 400 Adaptable Precision Mixed FP16 for TensorCore 300 300 INT8 Optimization 200 200 Native INT8 operation 100 100 0 0 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2014 2015 2016 2017 2018 2019 2020 2021 Kepler Maxwell Pascal Volta Ultrascale Ultrascale+ ACAP © Copyright 2018 Xilinx Seamless Deployment with Open Source Software From Community xfDNN Middleware, Tools and Runtime From Xilinx xDNN Processing Engine © Copyright 2018 Xilinx xfDNN flow CNTK Caffe2 PyTorch Tensorflow MxNet Caffe F Framework Tensor Graph to R ONNX O Xilinx Tensor Graph N T E xfDNN Tensor Graph Optimization N D Model Weights xfDNN Compiler xfDNN Compression Calibration Set xfDNN Runtime Image (python API) CPU Layers FPGA Layers https://github.com/Xilinx/ml-suite © Copyright 2018 Xilinx xfDNN Inference Toolbox Graph Compiler Network Optimization xfDNN Quantizer • Python tools to quickly compile • Automatic network optimizations for • Quickly reduce precision of trained networks from common lower latency by fusing layers and models for deployment Frameworks – Caffe, MxNet and buffering on-chip memory • Maintains 32bit accuracy at 8 bit Tensorflow within 2% © Copyright 2018 Xilinx xfDNN Graph Compiler Pass in a Network Microcode for xDNN is Produced 0 XNConv conv1 7 2 16 26 2 1 1 0x1c0000 224 3 0x0 112 64 2 XNMaxPool pool1 3 2 0 0x0 112 64 0x1c0000 56 3 XNConv res2a_branch1 1 1 16 26 2 0 1 0x1c0000 56 64 0x0 56 256 4 XNConv res2a_branch2a 1 1 16 26 2 1 1 0x1c0000 56 64 0x230000 56 64 6 XNConv res2a_branch2b 3 1 16 26 2 1 1 0x230000 56 64 0x1c0000 56 64 8 XNConv res2a_branch2c 1 1 16 26 2 0 1 0x1c0000 56 64 0x230000 56 256 9 XNEltWise 1 0 1 0x0 0x230000 56 256 0x0 9 XNUpload 0x0 56 256 11 XNConv res2b_branch2a 1 1 16 26 2 1 1 0x0xfDNN 56 256 0x1c0000 56 64 13 XNConv res2b_branch2b 3 1 16 26 2 1 1 0x1c0000 56 64 0x3f0000 56 64 15 XNConv res2b_branch2c 1 1 16 26Graph 2 0 1 0x3f0000 Compiler 56 64 0x1c0000 56 256 16 XNEltWise 1 0 1 0x0 0x1c0000 56 256 0x0 16 XNUpload 0x0 56 256 18 XNConv res2c_branch2a 1 1 16 26 2 1 1 0x0 56 256 0x3f0000 56 64 20 XNConv res2c_branch2b 3 1 16 26 2 1 1 0x3f0000 56 64 0x460000 56 64 22 XNConv res2c_branch2c 1 1 16 26 2 0 1 0x460000 56 64 0x1c0000 56 256 23 XNEltWise 1 0 1 0x0 0x1c0000 56 256 0x0 23 XNUpload 0x0 56 256 25 XNConv res3a_branch1 1 2 16 26 2 0 1 0x0 56 256 0x1c0000 28 512 26 XNConv res3a_branch2a 1 2 16 26 2 1 1 0x0 56 256 0x3f0000 28 128 28 XNConv res3a_branch2b 3 1 16 26 2 1 1 0x3f0000 28 128 0x428000 28 128 30 XNConv res3a_branch2c 1 1 16 26 2 0 1 0x428000 28 128 0x2a0000 28 512 31 XNEltWise 1 0 1 0x1c0000 0x2a0000 28 512 0x1c0000 31 XNUpload 0x1c0000 28 512 © Copyright 2018 Xilinx xfDNN Network Optimization Layer to Layer Next Next Relu Relu Relu Relu Fused Fused Fused Fused [Relu+ [Relu+ [Relu+ [Relu+ Bias Bias Bias Bias Bias+ Bias+ Bias+ Bias+ Conv] Conv] Conv] Conv] Conv Conv Conv Conv Relu Relu Pool Fused Fused Pool [Relu+ [Relu+ Bias Bias Bias+ Bias+ Conv] Conv] DDR Buffered Conv Conv On-Chip URAM Buffered Previous Previous Unoptimized Model xfDNN Intelligently Fused layers Streaming optimized for URAM © Copyright 2018 Xilinx xfDNN Network Deployment Output Output Next Fused Layer Optimizations HW HW HW HW In- In- In- In- Line Line Line Line [Relu+ [Relu+ [Relu+ [Relu+ • Compiler can merge nodes Bias+ Bias+ Bias+ Bias+ Conv] Conv] Conv] Conv] • (Conv or EltWise)+Relu HW HW In- In- Pool Line Line [Relu+ [Relu+ • Conv + Batch Norm Bias+ Bias+ Conv] Conv] • Compiler can split nodes Previous • Conv 1x1 stride 2 -> Maxpool+Conv 1x1 Stride 1 Next HW HW HW HW In- In- In- In- Line Line Line Line [Relu+ [Relu+ [Relu+ [Relu+ On-Chip buffering reduces latency and increases Bias+ Bias+ Bias+ Bias+ “One Shot” Conv] Conv] Conv] Conv] Network throughput HW HW In- In- Pool Line Line [Relu+ [Relu+ • xfDNN analyzes network memory needs and optimizes Bias+ Bias+ Deployment Conv] Conv] scheduler Previous • For Fused and “One Shot” Deployment Next HW HW HW HW In- In- In- In- Line Line Line Line [Relu+ [Relu+ [Relu+ [Relu+ “One Shot” deploys entire network to FPGA Bias+ Bias+ Bias+ Bias+ Conv] Conv] Conv] Conv] • Optimized for fast, low latency inference HW HW In- In- Pool Line Line [Relu+ [Relu+ • Entire network, schedule and weights loaded only once to Bias+ Bias+ Conv] Conv] FPGA Previous Input Input Page 16 © Copyright 2018 Xilinx xfDNN Quantizer: FP to Fixed-Point Quantization Problem: xfDNN Quantized Model Accuracy Nearly all trained models are in 32-bit floating-point 100% Available Caffe and TensorFlow quantization tools 90% take hours and produce inefficient models 80% 70% Introducing: xfDNN Quantizer 60% A customer friendly toolkit that automatically analyses 50% floating-point ranges layer-by-layer and produces the 40% fixed-point encoding that looses the least amount of information 30% ‒ Quantizes GoogleNet in under a minute 20% ‒ Quantizes 8-bit fixed-point networks within 1-3% 10% accuracy of 32-bit floating-point networks 0% ‒ Extensible toolkit to maximize performance by resnet-50 resnet-101 resnet-152 Googlenet-v1 searching for minimal viable bitwidths and prune sparse fp32 top-1 accuracy 8 bit top-1 accuracy fp32 top-5 accuracy 8 bit top-5 accuracy networks © Copyright 2018 Xilinx xfDNN Quantizer: Fast and Easy 8 1) Provide FP32 network and 2) Provide a small sample set, no 3) Specify desired precision model labels required • Quantizes to <8 bits to match • E.g., prototxt and caffemodel • 16 to 512 images Xilinx’s DSP © Copyright 2018 Xilinx Xilinx ML Processing Engine – xDNN Features Description Kernel Sizes W: 1-15; H:1-15 Strides W: 1,2,4,8; H: 1,2,4,8 Convolution / Padding Same, Valid Deconvolution / Convolution Dilation Factor: 1,2,4 Programmable Feature-set Transpose Activation ReLU Bias Value Per Channel Tensor Level Instructions Scale & Shift Value Per Scaling Channel Kernel Sizes W: 1-15; H:1-15 700+MHz DSP Freq (VU9P) Supported Max Pooling Strides W: 1,2,4,8; H: 1,2,4,8 Operations Padding Same, Valid Custom Network Acceleration Kernel Sizes W: 1-15; H:1-15 Avg Pooling Strides W: 1,2,4,8; H: 1,2,4,8 Padding Same, Valid Element-wise Add Width & Height must match; Depth can mismatch. Memory Support On-Chip Buffering, DDR Caching Expanded set