Real-Time AI Systems (Industry Perspective) Fast Machine Learning Workshop 2019 Fermi National Accelerator Laboratory
Total Page:16
File Type:pdf, Size:1020Kb
Real-time AI Systems (Industry Perspective) Fast Machine Learning Workshop 2019 Fermi National Accelerator Laboratory Jason Vidmar [email protected] System Architect Xilinx Aerospace & Defense Team Sep 11, 2019 © Copyright 2019 Xilinx Image credit: Via Satellite The Technology Conundrum ... And the Need for a Maximum 4X scaling when only 25% Amdahl’s Law New Compute Paradigm of application cannot be parallelized 4 Processing Architectures 2 are Not Scaling 0 Performance vs. VAZ11-780 40 YEARS OF PROCESSOR PERFORMANCE 2X / 100,000 3.5 Years ? 2X / 6 Years Amdahls 10,000 Law End of Dennard 1000 Scaling 2X / 1.5 Years 100 RISC 10 2X / 3.5 Years CISC 1980 1985 1990 1995 2000 2005 2010 2015 Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e 2018 >> 3 © Copyright 2019 Xilinx The Rise of Domain-Specific Architectures Safety Processing, Irregular or Latency-Critical data types, Workloads instruction sets, data operation Cache Whole Domain Specific CPU Fixed HW Accelerators Application Sensor Fusion, Parallelism (e.g., Video, ML) Pre-Processing, CISC RISC Multi-Core GPU, ASSP, ASIC Data Aggregation Complex Algorithms, Full Linux “Services” Complex Applications = Multiple Processing Domains A Single Architecture Can’t Do It Alone Speed of Innovation Outpacing Silicon Cycles DSA TPU, DLA >> 4 © Copyright 2019 Xilinx FPGA, MPSoC, ACAP WP505: “Versal: The First Adaptive Compute Acceleration Platform.” WP506: “Xilinx AI Engines and Their Applications.” Disruptive Innovation Needed: Xilinx Versal New classes of devices for new challenges (7nm) ACAP VECTOR VECTOR CORE CORE MEMORY MEMORY Software Programmability Software RFSoC VECTOR VECTOR CORE CORE MEMORY MPSoC MEMORY FPGA SoC Optimized for AI Inference and Advanced Device Category Signal Processing Workloads © Copyright 2019 Xilinx Real-time Machine Learning Considerations >> 6 © Copyright 2019 Xilinx Demand for High-throughput…but also Low-latency TAM $B Training Data Center Inference Edge Inference Cloud Compute Networking Breakthrough AI Inference Multi-terabit Throughput 5G Wireless Edge Compute Compute for Massive MIMO AI Inference at Low Power 2016 2017 2018 2019 2020 2021 2022 2023 Barclays Research, Company Reports May 2018 Satellite Communications Phased-Array Radar World-wide Internet Access Enhanced Situational Awareness AI/ML Inference – Project Growth >> 7 © Copyright 2019 Xilinx Real-time ML: Data Types IMAGES SIGNALS PACKETS • Satellite/UAV • Cognitive Radio • Cyber Security Imagery • Cognitive Radar • Deep Packet • Thermal / IR • LiDAR Inspection • SAR Imagery • Network Intrusion Detection • Network Anomaly Detection Multi-spectral; Hi-resolution Audio, RF, LiDAR Data. Network-layer (32kx32k or 16kx16k). Single or 2-channel sensor protocol data data; up to 16-bit input Page 8 precision. © Copyright 2019 Xilinx Real-time ML: Data/Sensor Processing Requirements [Hyper-spectral] [Panchromatic] [Multi-spectral] [Synthetic Aperture Radar (SAR)] [LiDAR-enhanced Imagery] [RF IQ “images”] >> 9 © Copyright 2019 Xilinx “Seeing” with Sensors – Enhancing Situational Awareness The World is Wider Than Red-Green-Blue [EO/IR Sensors] [RGB Sensors] Composite image for RGB image human interpretation. (3 channels) Source. ArcGIS. Multi-spectral image (frequently up to 8 channels) + PAN >> 10 © Copyright 2019 Xilinx Real-time ML Use Case: Remote Sensing via Imaging Synthetic Aperature Radar (SAR) Optical Imaging vs. SAR Imaging. Volcano in Kamchatka Russia, Oct 5, 1994. Image Credit: Michigan Tech Volcanology. Source: NASA ARSET: Basics of Synthetic Aperture Radar (SAR), Session 1/4. Link >> 11 © Copyright 2019 Xilinx Machine Learning Acceleration Solutions + Versal ACAP >> 12 © Copyright 2019 Xilinx Range of ML Inference Solutions on Xilinx Devices Xilinx Accelerators (Fabric-based) 3rd Party Accelerators (Fabric-based) DPU xDNN Mipsology Zebra Direct to Fabric Synthesis-based Versal ACAP (7nm) AI Engine 2D Array VLIW and SIMD Architecture HLS4ML – portable to wide range of Xilinx programmable logic (PL). >> 13 © Copyright 2019 Xilinx Versal Adaptive Compute Acceleration Platform Versal (7nm) Multi-core Programmable DSP Processing System Logic (Vector-based & Fabric-based) AI RF Series COMPUTE AI Core ACCELERATION Scalar Adaptable Intelligent Series Engines Engines Engines AI Edge Series PLATFORM ADAPTIVE Development Tools HW/SW Libraries HBM Diverse Workloads in Run-time Stack Series Milliseconds Premium Future-Proof for SW Programmable Series New Algorithms Silicon Infrastructure Prime Series Enabling Data Scientists, SW Developers, HW Developers >> 14 © Copyright 2019 Xilinx ACAP: Memory Hierarchy That Scales With Compute local data memory in AI engines Scalar Engines Adaptable Engines Intelligent Engines Array size scales AI ENGINES from 10s to 100s WORKLOAD Arm 1 of tiles. Cortex-A72 Fabric and 1,000 Tb/s WORKLOADN Cache memory LUTRAM Distributed low-latency memory resources scale Arm accordingly. BRAM BRAM BRAM BRAM 100 Tb/s Block RAM & UltraRAM Cortex-R5 BRAM BRAM BRAM BRAM Embedded configurable SRAM UltraRAM UltraRAM UltraRAM UltraRAM AI Edge Series (New) Accelerator RAM Cache 4 MB sharable across engines TCM Accelerator RAM for most SWaP- OCM 10 Tb/s HBM constrained. In-package DRAM PCIe & Network Direct HBM SerDes DDR External Memory CCIX DDR Cores RF AI RF Series for DDR4-3200; LPDDR4-4266 Direct RF 1 Tb/s Increasing Bandwidth, Decreasing Density Decreasing Density Bandwidth, Increasing capabilities. >> 15 © Copyright 2019 Xilinx AI Engine: Terminology Versal ACAP AI AI AI Engine Engine Engine Memory Memory Memory AI Engine AI AI AI Engine Engine Engine Memory Memory Array Memory AI AI AI Engine Engine Engine Memory Memory Memory AI Engine Tile 1GHz+ VLIW / SIMD vector processor Interconnect Fixed-Point Scalar ALU Scalar Vector Vector Unit Register Register File Non-linear File Floating-Point Functions Vector Unit ISA-based Local Scalar Unit Vector Unit AI Engine Vector Processor Memory AI Vector AGU AGU AGU Instruction Fetch Extensions & Decode Unit Load Unit A Load Unit B Store Unit 5G Vector Data Extensions Mover Memory Interface Stream Interface >> 16 © Copyright 2019 Xilinx AI Engine: Scalar Unit, Vector Unit, Load Units and Memory Fixed-Point Scalar ALU Scalar Vector Vector Unit Register Register File Non-linear File Floating-Point Functions Vector Unit 32-bit Scalar RISC Processor Scalar Unit Vector Unit Vector Processor 512-bit SIMD Datapath AGU AGU AGU Instruction Fetch & Decode Unit Load Unit A Load Unit B Store Unit Local, Shareable Memory Stream • 32KB Local, 128KB Addressable Memory Interface Interface Instruction Parallelism: VLIW Data Parallelism: SIMD Highly 7+ operations / clock cycle Multiple vector lanes Parallel • 2 Vector Loads / 1 Mult / 1 Store • Vector Datapath • 2 Scalar Ops / Stream Access • 8 / 16 / 32-bit & SPFP operands Up to 128 MACs / Clock Cycle per Core (INT 8) Up to 1288 FLOPsMACs // ClockClock Cycle Cycle (32SPFP) per Core (INT 8) © Copyright 2019 Xilinx >> 18 AI Inference Mapping on Versal™ ACAP A = Activations W = Weights Program Directly From High-level ML Frameworks 퐴 퐴 푊 푊 퐴 × 푾 + 퐴 × 푊 … 00 01 × 00 01 = 00 ퟎퟎ 01 10 퐴10 퐴11 푊10 푊11 퐴10 × 푾ퟎퟎ + 퐴11 × 푊10 … Scalar Adaptable Intelligent AI AI AI Engines Engine Engine Arm® Dual-Core Weight Cortex™- Buffer Convolution Layers A72 (URAM) AI AI Fully X = Connected Engine Engine Activation Layers Buffer (4x8) (4x4) Arm Cascade Dual-Core Max (8x4) (URAM) Pool ReLU 퐴 푊 퐴 Stream Cortex-R5 00 00 10 PL NETWORK-ON-CHIP ˃ Custom memory hierarchy I/O ˃ Buffer on-chip vs off-chip; Reduce latency and power ˃ Stream Multi-cast on AI interconnect External Memory (e.g., DDR) ˃ Weights and Activations ˃ Read once: reduce memory bandwidth ˃ AI-optimized vector instructions (128 INT8 mults/cycle) © Copyright 2019 Xilinx >> 19 AI Engine Delivers High Compute Efficiency ˃ Adaptable, non-blocking interconnect Vector Processor Efficiency Flexible data movement architecture Peak Kernel Theoretical Performance Avoids interconnect “bottlenecks” 98% ˃ Adaptable memory hierarchy 95% Local, distributed, shareable = extreme bandwidth 80% No cache misses or data replication Extend to PL memory (BRAM, URAM) ˃ Transfer data while AI Engine Computes Comm Comm Comm Compute Compute Compute ML Convolutions FFT DPD Block-based 1024-pt Volterra-based Overlap Compute and Communication Matrix Multiplication FFT/iFFT forward-path DPD (32×64) × (64×32) © Copyright 2019 Xilinx Summary ˃ Process gains from node to node have tapered off Heterogenous computing Visit https://www.xilinx.com/products/silicon- devices/acap/versal.html for datasheets, whitepapers, ˃ Real-time Machine Learning often requires and product tables. substantial sensor processing beforehand Domain-specific Architectures Required Power and security considerations more prominent ˃ The Versal ACAP is a platform on which to develop SW-defined DSAs for Real-time Machine Learning The AI Engine Array delivers up to 8x silicon compute density at ~40% lower power vs. Xilinx’s prior generations HLS4ML and fabric-based accelerators supported in Programmable Logic for maximum flexibility Xilinx VC1902 with 400 AI Engines. First shipment June 2019. >> 20 © Copyright 2019 Xilinx Adaptable. THANK Intelligent. Contact Info: Jason Vidmar YOU! [email protected] >> 21 © Copyright 2019 Xilinx Appendix © Copyright 2019 Xilinx AI Engine: Multi-Precision Math Support Real Data Types Optimized For: Complex Data Types Linear Algebra MACs / Cycle (per core) MACs / Cycle (per core) Matrix-Matrix Mult 128 16 Matrix-Vector Mult Convolution 8 64 FIR