Real-time AI Systems (Industry Perspective) Fast Machine Learning Workshop 2019 Fermi National Accelerator Laboratory

Jason Vidmar [email protected] System Architect Xilinx Aerospace & Defense Team Sep 11, 2019

© Copyright 2019 Xilinx Image credit: Via Satellite The Technology Conundrum ... And the Need for a Maximum 4X scaling when only 25% Amdahl’s Law New Compute Paradigm of application cannot be parallelized 4 Processing Architectures 2 are Not Scaling 0

Performance vs. VAZ11-780 40 YEARS OF PERFORMANCE 2X / 100,000 3.5 Years ? 2X / 6 Years Amdahls 10,000 Law End of Dennard 1000 Scaling 2X / 1.5 Years 100 RISC

10 2X / 3.5 Years CISC 1980 1985 1990 1995 2000 2005 2010 2015

Source: John Hennessy and David Patterson, : A Quantitative Approach, 6/e 2018

>> 3 © Copyright 2019 Xilinx The Rise of Domain-Specific Architectures

Safety Processing, Irregular or Latency-Critical data types, Workloads instruction sets, data operation

Whole Domain Specific CPU Fixed HW Accelerators Application Sensor Fusion, Parallelism (e.g., Video, ML) Pre-Processing, CISC  RISC  Multi-Core GPU, ASSP, ASIC Data Aggregation

Complex Algorithms, Full Linux “Services”

Complex Applications = Multiple Processing Domains A Single Architecture Can’t Do It Alone Speed of Innovation Outpacing Silicon Cycles DSA TPU, DLA >> 4 © Copyright 2019 Xilinx FPGA, MPSoC, ACAP WP505: “Versal: The First Adaptive Compute Acceleration Platform.” WP506: “Xilinx AI Engines and Their Applications.”

Disruptive Innovation Needed: Xilinx Versal New classes of devices for new challenges (7nm)

ACAP

VECTOR VECTOR

CORE CORE

MEMORY MEMORY

Software Programmability Software RFSoC

VECTOR VECTOR

CORE CORE MEMORY MPSoC MEMORY FPGA SoC Optimized for AI Inference and Advanced Device Category Signal Processing Workloads © Copyright 2019 Xilinx Real-time Machine Learning Considerations

>> 6 © Copyright 2019 Xilinx Demand for High-throughput…but also Low-latency

TAM $B Training Data Center Inference Edge Inference

Cloud Compute Networking Breakthrough AI Inference Multi-terabit Throughput

5G Wireless Edge Compute Compute for Massive MIMO AI Inference at Low Power

2016 2017 2018 2019 2020 2021 2022 2023 Barclays Research, Company Reports May 2018 Satellite Communications Phased-Array Radar World-wide Internet Access Enhanced Situational Awareness AI/ML Inference – Project Growth >> 7 © Copyright 2019 Xilinx Real-time ML: Data Types

IMAGES SIGNALS PACKETS • Satellite/UAV • Cognitive Radio • Cyber Security Imagery • Cognitive Radar • Deep Packet • Thermal / IR • LiDAR Inspection • SAR Imagery • Network Intrusion Detection • Network Anomaly Detection

Multi-spectral; Hi-resolution Audio, RF, LiDAR Data. Network-layer (32kx32k or 16kx16k). Single or 2-channel sensor protocol data data; up to 16-bit input Page 8 precision. © Copyright 2019 Xilinx Real-time ML: Data/Sensor Processing Requirements

[Hyper-spectral] [Panchromatic] [Multi-spectral]

[Synthetic Aperture Radar (SAR)]

[LiDAR-enhanced Imagery] [RF IQ “images”] >> 9 © Copyright 2019 Xilinx “Seeing” with Sensors – Enhancing Situational Awareness The World is Wider Than Red-Green-Blue

[EO/IR Sensors]

[RGB Sensors]

Composite image for RGB image human interpretation. (3 channels) Source. ArcGIS. Multi-spectral image (frequently up to 8 channels) + PAN

>> 10 © Copyright 2019 Xilinx Real-time ML Use Case: Remote Sensing via Imaging Synthetic Aperature Radar (SAR)

Optical Imaging vs. SAR Imaging. Volcano in Kamchatka Russia, Oct 5, 1994. Image Credit: Michigan Tech Volcanology. Source: NASA ARSET: Basics of Synthetic Aperture Radar (SAR), Session 1/4. Link

>> 11 © Copyright 2019 Xilinx Machine Learning Acceleration Solutions + Versal ACAP

>> 12 © Copyright 2019 Xilinx Range of ML Inference Solutions on Xilinx Devices Xilinx Accelerators (Fabric-based) 3rd Party Accelerators (Fabric-based)

DPU xDNN Mipsology Zebra

Direct to Fabric Synthesis-based Versal ACAP (7nm) AI Engine 2D Array VLIW and SIMD Architecture

HLS4ML – portable to wide range of Xilinx programmable logic (PL). >> 13 © Copyright 2019 Xilinx Versal Adaptive Compute Acceleration Platform Versal (7nm) Multi-core Programmable DSP Processing System Logic (Vector-based & Fabric-based) AI RF Series COMPUTE AI Core ACCELERATION Scalar Adaptable Intelligent Series Engines Engines Engines AI Edge Series PLATFORM ADAPTIVE Development Tools HW/SW Libraries HBM Diverse Workloads in Run-time Stack Series Milliseconds Premium Future-Proof for SW Programmable Series New Algorithms Silicon Infrastructure Prime Series

Enabling Data Scientists, SW Developers, HW Developers >> 14 © Copyright 2019 Xilinx ACAP: That Scales With Compute

local data memory in AI engines

Scalar Engines Adaptable Engines Intelligent Engines Array size scales

AI ENGINES from 10s to 100s WORKLOAD Arm 1 of tiles. Cortex-A72 Fabric and 1,000 Tb/s WORKLOADN Cache memory LUTRAM Distributed low-latency memory resources scale Arm accordingly. BRAM BRAM BRAM BRAM 100 Tb/s Block RAM & UltraRAM Cortex-R5 BRAM BRAM BRAM BRAM Embedded configurable SRAM UltraRAM UltraRAM UltraRAM UltraRAM AI Edge Series (New) Accelerator RAM Cache 4 MB sharable across engines TCM Accelerator RAM for most SWaP- OCM 10 Tb/s HBM constrained. In-package DRAM PCIe & Network Direct HBM SerDes DDR External Memory CCIX DDR Cores RF AI RF Series for DDR4-3200; LPDDR4-4266 Direct RF 1 Tb/s Increasing Bandwidth, Decreasing Density DecreasingDensity Bandwidth, Increasing capabilities.

>> 15 © Copyright 2019 Xilinx AI Engine: Terminology Versal ACAP AI AI AI

Engine Engine Engine

Memory

Memory Memory

AI Engine AI AI AI

Engine Engine Engine

Memory Memory Array Memory

AI AI AI

Engine Engine Engine

Memory

Memory Memory

AI Engine Tile

1GHz+ VLIW / SIMD Interconnect

Fixed-Point Scalar ALU Scalar Vector Vector Unit Register Non-linear File Floating-Point Functions Vector Unit ISA-based Local Scalar Unit Vector Unit AI Engine Vector Processor Memory AI Vector AGU AGU AGU Instruction Fetch Extensions & Decode Unit Load Unit A Load Unit B Store Unit 5G Vector Data Extensions Mover Memory Interface Stream Interface

>> 16 © Copyright 2019 Xilinx AI Engine: Scalar Unit, Vector Unit, Load Units and Memory Fixed-Point Scalar ALU Scalar Vector Vector Unit Register Register File Non-linear File Floating-Point Functions Vector Unit 32-bit Scalar RISC Processor Scalar Unit Vector Unit Vector Processor 512-bit SIMD AGU AGU AGU Instruction Fetch & Decode Unit Load Unit A Load Unit B Store Unit Local, Shareable Memory Stream • 32KB Local, 128KB Addressable Memory Interface Interface

Instruction Parallelism: VLIW : SIMD Highly 7+ operations / clock cycle Multiple vector lanes Parallel • 2 Vector Loads / 1 Mult / 1 Store • Vector Datapath • 2 Scalar Ops / Stream Access • 8 / 16 / 32-bit & SPFP operands Up to 128 MACs / Clock Cycle per Core (INT 8) Up to 1288 FLOPsMACs // ClockClock Cycle Cycle (32SPFP) per Core (INT 8) © Copyright 2019 Xilinx >> 18 AI Inference Mapping on Versal™ ACAP A = Activations W = Weights Program Directly From High-level ML Frameworks 퐴 퐴 푊 푊 퐴 × 푾 + 퐴 × 푊 … 00 01 × 00 01 = 00 ퟎퟎ 01 10 퐴10 퐴11 푊10 푊11 퐴10 × 푾ퟎퟎ + 퐴11 × 푊10 …

Scalar Adaptable Intelligent AI AI AI Engines Engine Engine Arm® Dual-Core Weight Cortex™- Buffer Convolution Layers A72 (URAM) AI AI Fully X = Connected Engine Engine Activation Layers Buffer (4x8) (4x4) Arm Cascade Dual-Core Max (8x4) (URAM) Pool ReLU 퐴 푊 퐴 Stream Cortex-R5 00 00 10 PL

NETWORK-ON-CHIP ˃ Custom memory hierarchy I/O ˃ Buffer on-chip vs off-chip; Reduce latency and power ˃ Stream Multi-cast on AI interconnect External Memory (e.g., DDR) ˃ Weights and Activations ˃ Read once: reduce memory bandwidth ˃ AI-optimized vector instructions (128 INT8 mults/cycle) © Copyright 2019 Xilinx >> 19 AI Engine Delivers High Compute Efficiency

˃ Adaptable, non-blocking interconnect Vector Processor Efficiency Flexible data movement architecture Peak Kernel Theoretical Performance Avoids interconnect “bottlenecks” 98% ˃ Adaptable memory hierarchy 95% Local, distributed, shareable = extreme bandwidth 80% No cache misses or data replication Extend to PL memory (BRAM, URAM)

˃ Transfer data while AI Engine Computes

Comm Comm Comm

Compute Compute Compute ML Convolutions FFT DPD Block-based 1024-pt Volterra-based Overlap Compute and Communication Matrix Multiplication FFT/iFFT forward-path DPD (32×64) × (64×32)

© Copyright 2019 Xilinx Summary

˃ gains from node to node have tapered off

Heterogenous computing Visit https://www.xilinx.com/products/silicon- devices/acap/versal.html for datasheets, whitepapers, ˃ Real-time Machine Learning often requires and product tables. substantial sensor processing beforehand Domain-specific Architectures Required Power and security considerations more prominent ˃ The Versal ACAP is a platform on which to develop SW-defined DSAs for Real-time Machine Learning The AI Engine Array delivers up to 8x silicon compute density at ~40% lower power vs. Xilinx’s prior generations HLS4ML and fabric-based accelerators supported in Programmable Logic for maximum flexibility Xilinx VC1902 with 400 AI Engines. First shipment June 2019.

>> 20 © Copyright 2019 Xilinx Adaptable. THANK Intelligent.

Contact Info: Jason Vidmar YOU! [email protected]

>> 21 © Copyright 2019 Xilinx Appendix

© Copyright 2019 Xilinx AI Engine: Multi-Precision Math Support

Real Data Types Optimized For: Complex Data Types

Linear Algebra MACs / Cycle (per core) MACs / Cycle (per core) Matrix-Matrix Mult 128 16 Matrix-Vector Mult

Convolution 8 64 FIR Filters 2-D Filters 4 32 2 16 8 8 Transforms 32x32 32x16 16x16 16 32x32 32x32 32x16 16x16 16x8 8x8 FFTs/IFFTs Complex Complex Complex Complex SPFP Real Real Real Real Real x 16 Real DCT, etc

© Copyright 2019 Xilinx Breakthrough Performance for Cloud, Network, and Edge

Cloud Compute Networking 5G Wireless Edge Compute Breakthrough AI Inference Multi-terabit Throughput Compute for Massive MIMO AI Inference at Low Power

>8X 4X 5X 15X

/ sec) /

TeraMAC

/Sec (<2ms) /Sec

Img

/sec (batch=1) /sec

img

Chip Encrypted Traffic (Gb/s) Traffic Encrypted Chip

-

GoogleNet V1 GoogleNet

ResNet50

16x16 DSP Compute ( Compute DSP 16x16

Single Int

High-End Versal Device UltraScale+ Versal UltraScale+ Versal UltraScale+ Versal GPU FPGA Device RFSoC Device MPSoC Device >> 24 © Copyright 2019 Xilinx