FPGA accelerated processing in the cloud Cathal McCabe, Xilinx Ireland New concepts in ultra fast data acquisition workshop
© Copyright 2018 Xilinx Overview
Computing after Moore’s law Big data The rise of AI From cloud to edge and back
Page 2 © Copyright 2018 Xilinx The legal speak
Moore’s Law – Number of transistors doubles every 1/1.5/2 years Moore’s Second Law (Rock’s law) – Cost of fabs increases exponentially Moore’s Law’s Law – Number of people predicting the end of Moore’s law doubles every year! Dennard’s Law (Scaling) – As transistors get smaller their power *“Cramming More Components onto Integrated Circuits,” density remains constant Electronics, pp. 114–117, April 19, 1965.
© Copyright 2018 Xilinx Transistors, Clock Speed, Power
© Copyright 2018 Xilinx Computing performance increase
Slowing to 3% per year
*John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 2018
Page 5 © Copyright 2018 Xilinx Data scaling
Page 6 © Copyright 2018 Xilinx The world around us hasn’t stopped scaling
*Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2015–2020 White Paper © Copyright 2018 Xilinx Scientific data growth
Page 8 *Analyzing and Querying big scientific data; Thomas Heinis © Copyright 2018 Xilinx https://www.slideshare.net/eXascaleInfolab/braintalk-cuso-nm Data centres
>7500 data centres worldwide ++21% per year in 2018
By 2020 1/3 of all data will pass through the cloud
Hohhot Data Center in China 7,750,015 square feet 40% of total operating costs is Energy
3% of global electricity production
http://worldstopdatacenters.com/china-mobile-hohhot-data-center/ Page 9 http://www.ciena.com/insights/articles/Twelve-Mind-blowing-Data-Center-Facts-You-Need-to-Know.html © Copyright 2018 Xilinx AI and Machine Learning
Page 10 © Copyright 2018 Xilinx Dark data
Less than 1% of all data that is produced every day is mined for valuable information
“We cannot solve our problems with the same thinking we used when we created them.” – Albert Einstein
Page 11 http://research.ibm.com/articles/datacentricdesign/ © Copyright 2018 Xilinx Page 12 © Copyright 2018 Xilinx Google Translate Significant Quality Upgrade through Machine Learning
Page 13 © Copyright 2018 Xilinx Machine Learning
"Machine learning can solve fundamental business problems that are really hard to create hardwired solutions to" Danny Lange (Uber) “Alexa will drive $10 billion in sales by 2020” Mark Mahaney (RBC Capital)
AI helped Bing double market share ($4 billion annually) “Twitter taught Microsoft’s AI chatbot to be a “Twitterracist is … going in less to use than artificial a day” intelligence and AI saves $1 Billion on machine learning to categorize every single tweet” content every year” Jack Dorsey (Twitter CEO)
Page 14 © Copyright 2018 Xilinx AI in big science
“It took us several years to convince people that this is not just some magic, hocus-pocus, black box stuff,” Boaz Klima, (Fermilab)
Inter-experimental LHC Machine Learning (IML) Working Group @ CERN
Luengo I, et al. SuRVoS: Super-Region Volume Segmentation workbench." Journal of Page 15 Structural Biology(2017). DOI:10.1016/j.jsb.2017.02.007 © Copyright 2018 Xilinx A Perspective on Deep Imaging Ge Wang Latest Research: Increasing Accuracy of Reduced Precision CNNs & BNNs
Top-5 Error (ImageNet) 60.00 BNNs are improving rapidly 50.00 Near consensus that inference
40.00 will be very low precision – Image / CNN: 2-bit (binary) 30.00 – Speech / RNN: 3-bit (ternary) 20.00
10.00
0.00 06.07.2009 18.11.2010 01.04.2012 14.08.2013 27.12.2014 10.05.2016 22.09.2017
BNN CNN Reduced Precision
Page 16 © Copyright 2018 Xilinx Xilinx technology
Page 17 © Copyright 2018 Xilinx FPGA Architecture Overview
© Copyright 2018 Xilinx Adaptable Architecture Advantage
Algorithm to implement Video Video Ethernet DNN HDMI decode encode – Control / Dataflow graph
CPU implementation CPU – Sequential (Van Neumann) execution (SIMD for GPU) Memory read decode – Memory access bottleneck ALU instructions execute – Fixed data size (e.g. 64 bits) write & data – Poor decision handling (breaks ALU pipeline)
Optimized hardware FPGA Custom memory FPGA implementation – Custom dataflow / pipeline / decision handling / widths Video Video Ethernet DNN HDMI – Custom memory hierarchy decode encode – Energy efficient computation
Page 19 © Copyright 2018 Xilinx FPGA evolution 2.5 micron XC2064 1984 7nm Everest 2018 85,000 transistors 50,000,000,000+ transistors
Maximum Density of Xilinx FPGA by node
© Copyright 2018 Xilinx FPGA Scaling
Maximum Xilinx SerDes Rate per pin Aggregate Xilinx SerDes Bandwidth
High End PC per pin DDR rate Aggregate Xilinx FPGA Memory BW per node
FPL 2014: The FPGA, an engine for innovation in silicon and packaging technology; Liam Madden, FPL 2014 © Copyright 2018 Xilinx Xilinx Product Families: A Broad Portfolio
Page 22 © Copyright 2018 Xilinx UltraScale+™ Capabilities New at 16nm New at 16nm FinFET Perf/Watt SmartConnect Optimized Process & Addressing IP & Fabric Voltage Scaling Interconnect bottlenecks
Enhanced at 16nm Enhanced at 16nm New at 16nm (58G) Security & Reliability Transceivers Decrypt/Auth/Anti-Tamper, 16G, 28G, 32G, & 58G Improved SEU Performance Fractional PLL
Enhanced at 16nm New at 16nm (HBM) Enhanced at 16nm 3rd Gen 3D IC & HBM Networking IP 100G EMAC w/RS-FEC Greater Inter-Die FMAX HBM for 20X bandwidth 150G Interlaken w/300GLL
Enhanced at 16nm Enhanced at 16nm DSP External Memory 2400 Mb/s (20nm) Floating/Fixed Pt Enhanced 2,666 Mb/s (16nm) 2.5X Bandwidth (vs. 28nm)
Enhanced at 16nm Enhanced at 16nm Block RAM PCI Express® Hardened Cascading Gen3 x16 Power-Optimized Silicon Gen4 x8
New at 16nm New at 16nm New at 16nm High Density I/O Packaging UltraRAM Power-Optimized I/O For Signal Integrity, Massive Capacity MIPI D-PHY Support PCB Area, Thermal SRAM Replacement
Page 23 © Copyright 2018 Xilinx UltraScale Re-Architects the Core Highest Utilization at Maximum Performance
Page 24 © Copyright 2018 Xilinx SSI Harnesses Proven Technology in a Unique Way
28nm FPGA Slice 28nm FPGA Slice 28nm FPGA Slice 28nm FPGA Slice Microbumps Silicon Interposer Through-Silicon Vias Package Substrate C4 Bumps Side-by-Side Die Layout BGA Balls Minimal heat flux issues Passive Silicon Minimal design tool flow Interposer Microbumps impact Access to power / ground / IOs Access to logic regions Leverages ubiquitous image sensor micro-bump technology
Through-silicon Vias (TSV) Bridge power / ground / IOs to C4 bumps Coarse pitch, low density aids manufacturability Etch process (not laser drilled)
Page 25 © Copyright 2018 Xilinx Introducing Virtex UltraScale+ HBM Devices 20X more bandwidth than a DDR4 DIMM
Dedicated hardened Built on the proven Virtex interface to the HBM for UltraScale+ FPGA platform maximized bandwidth
DRAM stacks integrated using SSI Technology Hardened Cache Coherent Interconnect (CCIX) Ports
Memory Controller uses AXI interface for easy HBM Gen2 represents the integration using Vivado IPI highest DRAM bandwidth available
© Copyright 2018 Xilinx Covering the Full Spectrum of Memory Solutions
External Memory 2666-DDR4 High Bandwidth (Multi-Gigabyte) Memory (Multi-Gigabyte) UltraRAM (100s of Megabits)
Block RAM Gap in (10s of megabits) Memory Hierarchy
Shallow Deep Video Deeper Buffering at Large Buffering & Packet Buffering Highest Performance/Watt Data Storage
11.7GB/s 14.9GB/s 460GB/s 21.3GB/s 36Kb Density 288Kb Density 8GB Density For 16GB DIMM per Block Per Block on Package
Page 27 © Copyright 2018 Xilinx Next generation Adaptive Compute Acceleration Platform
New Device Category for Adaptive HW/SW Programmable Engines Workload-Specific Acceleration
Application Processors HW/SW programmable engines Next Gen Programmable Logic IP subsystems and a network-on-chip Real-Time Processors
RF GPIO & Highly integrated programmable I/O HBM SerDes DAC/ADC Mem I/F
© Copyright 2018 Xilinx Software tools and libraries
Page 29 © Copyright 2018 Xilinx Up-leveling the Programming Model
Development Productivity
SW Programmability: SDAccel, SDSoC
Video Video Video decode process encode Ethernet HDMI C++ C++ IP C++ IP
15x productivity: HLS, SysGen, Model Composer, IPI, SDK
video HDMI enc. video proc.
Traditional HW design: Vivado, HDL
Page 30 © Copyright 2018 Xilinx Development Stacks for SoC & FPGA
Accelerated Open Frameworks
Accelerated Libraries Machine Database learning Analytics
Development Environment
Development Boards
Page 31 © Copyright 2018 Xilinx Advantages of Xilinx Devices
Reduced System Power Costs • Highly efficient compute across range of workloads/applications • Single device for range of processing & interfacing needs
Reduce System Hardware Costs • FPGAs massive compute replaces CPU • Future proofed hardware thanks to massive flexibility
Platform solutions • Massive flexibility allows processing need for range of applications be met
• Massive IO flexibility ensures FPGA/SoC can hook into system easily • Large range of devices to fit into system power envelop
Ultimate Flexibility/Future Proofed HW • Xilinx’s flexibility & ability to handle diverse processing requirement • GPUs need high data locality, massive parallelism & specific data type + limited IO support • Tomorrows algorithms will run efficiently on Xilinx devices
Page 32 © Copyright 2018 Xilinx FPGAs in the cloud
Page 33 © Copyright 2018 Xilinx FPGA as a Service Expanding Worldwide
Enterprise
SaaS
Accelerated SW / API
FaaS
Page 34 © Copyright 2018 Xilinx Amazon F1 Instances
EC2 Instance Type with up to 8 VU9P (16nmFF+) FPGA Intel Xeon Processors with up to 16 cores Connected through PCIe gen3x16 4 local DDR4 (12 GBps) per FPGA
Model #FPGA Mem SSD Storage FPGA DDR4 Price / hour
f1.2xlarge 1 122 GB 470 GB 4x16 GB $1.65 f1.16xlarge 8 976 GB 8 x 470 GB 8 x 4x16 GB $13.20
Page 35 © Copyright 2018 Xilinx F1 Users Tools “I wasn’t aware the service I am using involved F1 and FPGAs.” • User’s front-end application leverages F1 transparently end user
“I need to accelerate an application. I don’t • SDAccel know RTL and hardware design.” Host: Xilinx OpenCL runtime Kernel: C, C++ or OpenCL F1 developer #1
“I want to create or reuse RTL kernels while • SDAccel using standard APIs whenever possible.” Host: Xilinx OpenCL runtime Kernel: RTL leveraging Vivado F1 developer #2
“I want to create or reuse my RTL designs while designing HW and SW middleware.” • AWS HDK/SDK Host: Custom API F1 developer #3 Kernel: RTL or HLx
Page 36 © Copyright 2018 Xilinx AWS EC2 F1 Instance SDAccel Flow
Page 37 © Copyright 2018 Xilinx Xilinx tools in the cloud
FPGA developer AMI – Includes all Xilinx tools required for F1 development – Run on standard AWS compute instances – Spin up as many instances as you need on-demand • No Xilinx software/license maintenance • Reduces IT infrastructure requirements
Page 38 © Copyright 2018 Xilinx Amazon F1 Development Flow
AWS Hardware Development Kit Hardware provides access to necessary tools, Development Kit scripts and files aws.amazon.com
Development in Development the AWS Cloud on premise with SDAccel with SDAccel
Execute your own accelerated Accelerated application or publish it on the Application AWS marketplace
Page 39 © Copyright 2018 Xilinx Benefits of the AWS F1 Cloud Compute Platform
Accelerated computation Makes leading-edge FPGA acceleration available to a large community of developers, and to millions of potential users Provides dedicated and large amounts of leading-edge FPGA logic with elasticity to scale to multiple instances Simplifies the development process by providing cloud-based tools for FPGA development Ideal platform for collaborative research and development – build a prototype and share instantly with partners
Page 40 © Copyright 2018 Xilinx Workloads
Machine Video Data analytics Genomics & more Learning 10x 90x 100x 40x
Page 41 © Copyright 2018 Xilinx Saving Babies at Cloud Scale
20min genome analysis (100x faster)
in 25min on AWS
Page 42 © Copyright 2018 Xilinx FireSim - Cycle-accurate, FPGA-accelerated data center simulation project based on RISC-V
Uses public-cloud F1 – No upfront cost to purchase and deploy FPGA hardware – Distribute pre-built images • Easy to reproduce experiments* • Automates FPGA simulation
– Scale out experiments by spinning up additional EC2 instances – “Saves hundreds of thousands of dollars on large FPGA clusters”
*Try for yourself:
Page 43 © Copyright 2018 Xilinx From cloud to edge and back
Page 44 © Copyright 2018 Xilinx “Python Productivity for Zynq"
Page 45 © Copyright 2018 Xilinx Computing Landscape: … A Linear Spectrum from IOT to Cloud
IoTT .. IIoT .. Embedded .. Mobile .. Desktop .. Server .. Data Center .. Cloud
• IIoT: Industrial Internet of Things
• IoTT: Internet of Tiny Things (aka motes, ultra low-power/energy)
© Copyright 2018 Xilinx Computing Landscape: … From Linear Spectrum to Continuum
Desktop Mobile Server
Embedded Data Center …… IIoT IoTT Cloud
Real-time Edge Transactional Computing
© Copyright 2018 Xilinx PYNQ enables hardware, software and analytics
Data Scientists Embedded Engineers Hardware Engineers
Analytics Software Hardware
Abstract O.S. sensors sensors sensors
Page 48 © Copyright 2018 Xilinx Summary
Delivering more than Moore – Ultrascale+, Everest Adaptable computing for AI Cloud solutions for big data problems – AWS EC2 F1 From cloud to edge and back https://www.xilinx.com/products/design-tools/acceleration-zone/aws.html with PYNQ
Page 49 © Copyright 2018 Xilinx Page 50 © Copyright 2018 Xilinx