FPGA accelerated processing in the cloud Cathal McCabe, Xilinx Ireland New concepts in ultra fast data acquisition workshop

© Copyright 2018 Xilinx Overview

Computing after Moore’s law Big data The rise of AI From cloud to edge and back

Page 2 © Copyright 2018 Xilinx The legal speak

Moore’s Law – Number of transistors doubles every 1/1.5/2 years Moore’s Second Law (Rock’s law) – Cost of fabs increases exponentially Moore’s Law’s Law – Number of people predicting the end of Moore’s law doubles every year! Dennard’s Law (Scaling) – As transistors get smaller their power *“Cramming More Components onto Integrated Circuits,” density remains constant Electronics, pp. 114–117, April 19, 1965.

© Copyright 2018 Xilinx Transistors, Clock Speed, Power

© Copyright 2018 Xilinx Computing performance increase

Slowing to 3% per year

*John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 2018

Page 5 © Copyright 2018 Xilinx Data scaling

Page 6 © Copyright 2018 Xilinx The world around us hasn’t stopped scaling

*Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2015–2020 White Paper © Copyright 2018 Xilinx Scientific data growth

Page 8 *Analyzing and Querying big scientific data; Thomas Heinis © Copyright 2018 Xilinx https://www.slideshare.net/eXascaleInfolab/braintalk-cuso-nm Data centres

>7500 data centres worldwide ++21% per year in 2018

By 2020 1/3 of all data will pass through the cloud

Hohhot Data Center in China 7,750,015 square feet 40% of total operating costs is Energy

3% of global electricity production

http://worldstopdatacenters.com/china-mobile-hohhot-data-center/ Page 9 http://www.ciena.com/insights/articles/Twelve-Mind-blowing-Data-Center-Facts-You-Need-to-Know.html © Copyright 2018 Xilinx AI and

Page 10 © Copyright 2018 Xilinx Dark data

Less than 1% of all data that is produced every day is mined for valuable information

“We cannot solve our problems with the same thinking we used when we created them.” – Albert Einstein

Page 11 http://research.ibm.com/articles/datacentricdesign/ © Copyright 2018 Xilinx Page 12 © Copyright 2018 Xilinx Google Translate Significant Quality Upgrade through Machine Learning

Page 13 © Copyright 2018 Xilinx Machine Learning

"Machine learning can solve fundamental business problems that are really hard to create hardwired solutions to" Danny Lange () “Alexa will drive $10 billion in sales by 2020” Mark Mahaney (RBC Capital)

AI helped Bing double market share ($4 billion annually) “Twitter taught ’s AI chatbot to be a “Twitterracist is … going in less to use than artificial a day” intelligence and AI saves $1 Billion on machine learning to categorize every single tweet” content every year” Jack Dorsey (Twitter CEO)

Page 14 © Copyright 2018 Xilinx AI in big science

“It took us several years to convince people that this is not just some magic, hocus-pocus, black box stuff,” Boaz Klima, (Fermilab)

Inter-experimental LHC Machine Learning (IML) Working Group @ CERN

Luengo I, et al. SuRVoS: Super-Region Volume Segmentation workbench." Journal of Page 15 Structural Biology(2017). DOI:10.1016/j.jsb.2017.02.007 © Copyright 2018 Xilinx A Perspective on Deep Imaging Ge Wang Latest Research: Increasing Accuracy of Reduced Precision CNNs & BNNs

Top-5 Error (ImageNet) 60.00 BNNs are improving rapidly 50.00 Near consensus that inference

40.00 will be very low precision – Image / CNN: 2-bit (binary) 30.00 – Speech / RNN: 3-bit (ternary) 20.00

10.00

0.00 06.07.2009 18.11.2010 01.04.2012 14.08.2013 27.12.2014 10.05.2016 22.09.2017

BNN CNN Reduced Precision

Page 16 © Copyright 2018 Xilinx Xilinx technology

Page 17 © Copyright 2018 Xilinx FPGA Architecture Overview

© Copyright 2018 Xilinx Adaptable Architecture Advantage

Algorithm to implement Video Video Ethernet DNN HDMI decode encode – Control / Dataflow graph

CPU implementation CPU – Sequential (Van Neumann) execution (SIMD for GPU) Memory read decode – Memory access bottleneck ALU instructions execute – Fixed data size (e.g. 64 bits) write & data – Poor decision handling (breaks ALU pipeline)

Optimized hardware FPGA Custom memory FPGA implementation – Custom dataflow / pipeline / decision handling / widths Video Video Ethernet DNN HDMI – Custom memory hierarchy decode encode – Energy efficient computation

Page 19 © Copyright 2018 Xilinx FPGA evolution 2.5 micron XC2064 1984 7nm Everest 2018 85,000 transistors 50,000,000,000+ transistors

Maximum Density of Xilinx FPGA by node

© Copyright 2018 Xilinx FPGA Scaling

Maximum Xilinx SerDes Rate per pin Aggregate Xilinx SerDes Bandwidth

High End PC per pin DDR rate Aggregate Xilinx FPGA Memory BW per node

FPL 2014: The FPGA, an engine for innovation in silicon and packaging technology; Liam Madden, FPL 2014 © Copyright 2018 Xilinx Xilinx Product Families: A Broad Portfolio

Page 22 © Copyright 2018 Xilinx UltraScale+™ Capabilities New at 16nm New at 16nm FinFET Perf/Watt SmartConnect Optimized Process & Addressing IP & Fabric Voltage Scaling Interconnect bottlenecks

Enhanced at 16nm Enhanced at 16nm New at 16nm (58G) Security & Reliability Transceivers Decrypt/Auth/Anti-Tamper, 16G, 28G, 32G, & 58G Improved SEU Performance Fractional PLL

Enhanced at 16nm New at 16nm (HBM) Enhanced at 16nm 3rd Gen 3D IC & HBM Networking IP 100G EMAC w/RS-FEC Greater Inter-Die FMAX HBM for 20X bandwidth 150G Interlaken w/300GLL

Enhanced at 16nm Enhanced at 16nm DSP External Memory 2400 Mb/s (20nm) Floating/Fixed Pt Enhanced 2,666 Mb/s (16nm) 2.5X Bandwidth (vs. 28nm)

Enhanced at 16nm Enhanced at 16nm Block RAM PCI Express® Hardened Cascading Gen3 x16 Power-Optimized Silicon Gen4 x8

New at 16nm New at 16nm New at 16nm High Density I/O Packaging UltraRAM Power-Optimized I/O For Signal Integrity, Massive Capacity MIPI D-PHY Support PCB Area, Thermal SRAM Replacement

Page 23 © Copyright 2018 Xilinx UltraScale Re-Architects the Core Highest Utilization at Maximum Performance

Page 24 © Copyright 2018 Xilinx SSI Harnesses Proven Technology in a Unique Way

28nm FPGA Slice 28nm FPGA Slice 28nm FPGA Slice 28nm FPGA Slice Microbumps Silicon Interposer Through-Silicon Vias Package Substrate C4 Bumps Side-by-Side Die Layout BGA Balls Minimal heat flux issues Passive Silicon Minimal design tool flow Interposer Microbumps impact Access to power / ground / IOs Access to logic regions Leverages ubiquitous image sensor micro-bump technology

Through-silicon Vias (TSV) Bridge power / ground / IOs to C4 bumps Coarse pitch, low density aids manufacturability Etch process (not laser drilled)

Page 25 © Copyright 2018 Xilinx Introducing Virtex UltraScale+ HBM Devices 20X more bandwidth than a DDR4 DIMM

Dedicated hardened Built on the proven Virtex interface to the HBM for UltraScale+ FPGA platform maximized bandwidth

DRAM stacks integrated using SSI Technology Hardened Cache Coherent Interconnect (CCIX) Ports

Memory Controller uses AXI interface for easy HBM Gen2 represents the integration using Vivado IPI highest DRAM bandwidth available

© Copyright 2018 Xilinx Covering the Full Spectrum of Memory Solutions

External Memory 2666-DDR4 High Bandwidth (Multi-Gigabyte) Memory (Multi-Gigabyte) UltraRAM (100s of Megabits)

Block RAM Gap in (10s of megabits) Memory Hierarchy

Shallow Deep Video Deeper Buffering at Large Buffering & Packet Buffering Highest Performance/Watt Data Storage

11.7GB/s 14.9GB/s 460GB/s 21.3GB/s 36Kb Density 288Kb Density 8GB Density For 16GB DIMM per Block Per Block on Package

Page 27 © Copyright 2018 Xilinx Next generation Adaptive Compute Acceleration Platform

New Device Category for Adaptive HW/SW Programmable Engines Workload-Specific Acceleration

Application Processors HW/SW programmable engines Next Gen Programmable Logic IP subsystems and a network-on-chip Real-Time Processors

RF GPIO & Highly integrated programmable I/O HBM SerDes DAC/ADC Mem I/F

© Copyright 2018 Xilinx Software tools and libraries

Page 29 © Copyright 2018 Xilinx Up-leveling the Programming Model

Development Productivity

SW Programmability: SDAccel, SDSoC

Video Video Video decode process encode Ethernet HDMI C++ C++ IP C++ IP

15x productivity: HLS, SysGen, Model Composer, IPI, SDK

video HDMI enc. video proc.

Traditional HW design: Vivado, HDL

Page 30 © Copyright 2018 Xilinx Development Stacks for SoC & FPGA

Accelerated Open Frameworks

Accelerated Libraries Machine Database learning Analytics

Development Environment

Development Boards

Page 31 © Copyright 2018 Xilinx Advantages of Xilinx Devices

Reduced System Power Costs • Highly efficient compute across range of workloads/applications • Single device for range of processing & interfacing needs

Reduce System Hardware Costs • FPGAs massive compute replaces CPU • Future proofed hardware thanks to massive flexibility

Platform solutions • Massive flexibility allows processing need for range of applications be met

• Massive IO flexibility ensures FPGA/SoC can hook into system easily • Large range of devices to fit into system power envelop

Ultimate Flexibility/Future Proofed HW • Xilinx’s flexibility & ability to handle diverse processing requirement • GPUs need high data locality, massive parallelism & specific data type + limited IO support • Tomorrows algorithms will run efficiently on Xilinx devices

Page 32 © Copyright 2018 Xilinx FPGAs in the cloud

Page 33 © Copyright 2018 Xilinx FPGA as a Service Expanding Worldwide

Enterprise

SaaS

Accelerated SW / API

FaaS

Page 34 © Copyright 2018 Xilinx F1 Instances

EC2 Instance Type with up to 8 VU9P (16nmFF+) FPGA Intel Xeon Processors with up to 16 cores Connected through PCIe gen3x16 4 local DDR4 (12 GBps) per FPGA

Model #FPGA Mem SSD Storage FPGA DDR4 Price / hour

f1.2xlarge 1 122 GB 470 GB 4x16 GB $1.65 f1.16xlarge 8 976 GB 8 x 470 GB 8 x 4x16 GB $13.20

Page 35 © Copyright 2018 Xilinx F1 Users Tools “I wasn’t aware the service I am using involved F1 and FPGAs.” • User’s front-end application leverages F1 transparently end user

“I need to accelerate an application. I don’t • SDAccel know RTL and hardware design.”  Host: Xilinx OpenCL runtime  Kernel: C, C++ or OpenCL F1 developer #1

“I want to create or reuse RTL kernels while • SDAccel using standard APIs whenever possible.”  Host: Xilinx OpenCL runtime  Kernel: RTL leveraging Vivado F1 developer #2

“I want to create or reuse my RTL designs while designing HW and SW middleware.” • AWS HDK/SDK  Host: Custom API F1 developer #3  Kernel: RTL or HLx

Page 36 © Copyright 2018 Xilinx AWS EC2 F1 Instance SDAccel Flow

Page 37 © Copyright 2018 Xilinx Xilinx tools in the cloud

FPGA developer AMI – Includes all Xilinx tools required for F1 development – Run on standard AWS compute instances – Spin up as many instances as you need on-demand • No Xilinx software/license maintenance • Reduces IT infrastructure requirements

Page 38 © Copyright 2018 Xilinx Amazon F1 Development Flow

AWS Hardware Development Kit Hardware provides access to necessary tools, Development Kit scripts and files aws.amazon.com

Development in Development the AWS Cloud on premise with SDAccel with SDAccel

Execute your own accelerated Accelerated application or publish it on the Application AWS marketplace

Page 39 © Copyright 2018 Xilinx Benefits of the AWS F1 Cloud Compute Platform

Accelerated computation Makes leading-edge FPGA acceleration available to a large community of developers, and to millions of potential users Provides dedicated and large amounts of leading-edge FPGA logic with elasticity to scale to multiple instances Simplifies the development process by providing cloud-based tools for FPGA development Ideal platform for collaborative research and development – build a prototype and share instantly with partners

Page 40 © Copyright 2018 Xilinx Workloads

Machine Video Data analytics Genomics & more Learning 10x 90x 100x 40x

Page 41 © Copyright 2018 Xilinx Saving Babies at Cloud Scale

20min genome analysis (100x faster)

in 25min on AWS

Page 42 © Copyright 2018 Xilinx FireSim - Cycle-accurate, FPGA-accelerated data center simulation project based on RISC-V

Uses public-cloud F1 – No upfront cost to purchase and deploy FPGA hardware – Distribute pre-built images • Easy to reproduce experiments* • Automates FPGA simulation

– Scale out experiments by spinning up additional EC2 instances – “Saves hundreds of thousands of dollars on large FPGA clusters”

*Try for yourself:

Page 43 © Copyright 2018 Xilinx From cloud to edge and back

Page 44 © Copyright 2018 Xilinx “Python Productivity for Zynq"

Page 45 © Copyright 2018 Xilinx Computing Landscape: … A Linear Spectrum from IOT to Cloud

IoTT .. IIoT .. Embedded .. Mobile .. Desktop .. Server .. Data Center .. Cloud

• IIoT: Industrial Internet of Things

• IoTT: Internet of Tiny Things (aka motes, ultra low-power/energy)

© Copyright 2018 Xilinx Computing Landscape: … From Linear Spectrum to Continuum

Desktop Mobile Server

Embedded Data Center …… IIoT IoTT Cloud

Real-time Edge Transactional Computing

© Copyright 2018 Xilinx PYNQ enables hardware, software and analytics

Data Scientists Embedded Engineers Hardware Engineers

Analytics Software Hardware

Abstract O.S. sensors sensors sensors

Page 48 © Copyright 2018 Xilinx Summary

Delivering more than Moore – Ultrascale+, Everest Adaptable computing for AI Cloud solutions for big data problems – AWS EC2 F1 From cloud to edge and back https://www.xilinx.com/products/design-tools/acceleration-zone/aws.html with PYNQ

Page 49 © Copyright 2018 Xilinx Page 50 © Copyright 2018 Xilinx