Real-life Benchmarks vs CPUs and GPUs
Nick Ni, Director of Product Marketing, AI – Software - Ecosystem
© Copyright 2020 Xilinx Era of Domain-Specific Architectures (DSAs)
Cache
CPU Fixed HW Accelerators DSA CISC → RISC → Multi-Core GPU, ASSP, ASIC TPU, DLA FPGA, Zynq, ACAP 4 © Copyright 2020 Xilinx SpeedSpeed ofof InnovationInnovation OutpacedOutpaced SiliconSilicon DesignDesign CyclesCycles
Silicon Design Cycle 80
Innovation Cycle
70
1 Accuracy 1 (1%) Accuracy
- Top 60
50 AlexNet BN-NIN GoogLeNet VGG-16 ResNet-34 ResNet-101 Inception v3 DenseNet-264 ShuffleNet 2x MobileNet v2 BN-AlexNet ENet ResNet-18 VGG-19 ResNet-50 ResNet-153 Inception-v4 ResNeXt-101 SENet-154 2012 2018 Pace of AI/ML Innovation
© Copyright 2020 Xilinx What is a Domain-Specific Architecture?
55 Input 11 27 dense dense 13 13 13 dense Output 5 3 3 3 11 13 224 27 13 3 13 3 Input 5 3 “Dog” Image 384 384 256 (RGB) 1000 55 256 Max Max Max Pooling Pooling 4096 4096 224 Stride 96 Pooling of 4
2) Custom Precision 1) Custom Data Path 5 © Copyright 2019 Xilinx What is a Domain-Specific Architecture?
Off-Chip DDR
3) Custom Memory Hierarchy
On-Chip Memory
2) Custom Precision 1) Custom Data Path 6 © Copyright 2019 Xilinx Xilinx Achieves the Highest Efficiency in Computation On a standard benchmark in MLPerf v0.7
FPS/TOPS (ResNet50v1.5)
nVidia Jetson
nVidia T4
Intel NNP-I
nVidia A100
nVidia V100
Xilinx U250
0 20 40 60 80 100 120 140
>> 6 © Copyright 2020 Xilinx Agenda
AI
Database
HPC
>> 7 © Copyright 2020 Xilinx AI Super Resolution Sentiment Analysis Visual Search Point Cloud LIDAR
>> 8 © Copyright 2020 Xilinx Super Resolution (Image upscaling with EDSR)
Super Resolution (unit = FPS) Model: EDSR, Pytorch paper 7
6 UHD 4K
FHD 2K 5
4 2x
3
2
3x performance 1 4x lower COGS 0 nVidia RTX2080Ti nVidia V100 Alveo U250 Alveo U250 with 8x lower TCO pruning
© Copyright 2020 Xilinx LSTM - Sentiment Analysis LSTM (Sentences/s) 14000
IMDB Sentiment Analysis - link 12000
Open Information Exchange - link 10000
8000
6000
4000
2000
0 Intel 48-core CPU nVidia T4 Xilinx Versal VC1902
IMDB OIE
CPU: 48 cores, Intel® Xeon(R) Gold 6136 CPU 3.00GHz | float32 model GPU-T4: | float32 model Xilinx Versal VC1902: 320Cores Design on Versal | INT8/INT16 model >> 10 © Copyright 2020 Xilinx Visual Search
1st stage: CNN model for feature extraction Visual Search 6000 2nd stage: Highly sparse Fully Connected models for large database look up 5000
4000
3000
2000
[ MOFFETT 3D Tensor SPARSE VECTOR ] 1000 Image Data XILINX (TB) FPGA 64GB DDR NVIDIA AWS F1 TESLA T4 0 16GB DDR Throughput (Query/sec or QPS) QPS/$ AWS G4 nVidia T4 (AWS G4) Xilinx AWS F1
© Copyright 2020 Xilinx LIDAR Perception based on Point Cloud input
PointPillars 14.0 Model: PointPillars with Pytorch (paper)
12.0
10.0
8.0
6.0
4.0
2.0
0.0 nVidia TX2 nVidia Xavier NX Xilinx ZU5
FPS/w FPS/$100
>> 12 © Copyright 2020 Xilinx Computer Vision Satellite imaging Depth estimation
>> 13 © Copyright 2020 Xilinx Satellite Imaging
Orthorectification + Yolov3 of 500 Mpix image 1400 1st stage: Orthorectification
2nd stage: Yolov3 object detection 1200
1000
800 TIME (S)TIME 600 Lower better
400
200
0 CPU + nVidia T4 CPU + Alveo U250
Credit: https://www.mdpi.com/2072-4292/12/1/44/htm © Copyright 2020 Xilinx Depth Estimation with CV and Deep Learning
Depth Estimation (AI-based) 100 st 1 stage: Encoder-Decoder depth network 90
2nd stage: Fast bilinear filter 80
70
60
50 FPS
40
30
20
10 No data 0 Intel i7 @ 2.6GHz Edge TPU Xilinx ZU9
320*240 640*480
>> 15 © Copyright 2020 Xilinx Database Graph Database Big Data Queries Transaction Processing (Decision Tree) Regular Expression
>> 17 © Copyright 2020 Xilinx Graph Database: Recommendation, Fraud Detection
Product Recommendation Fraud Detection (time in ms) (time in ms) 18000 2000 Graph DB: faster deeper and wider 16900 insights on connected data 16000 1800
1600 App 1: Recommendation with 14000 1400 Cosine Similarity for 1.5 million 12000 patients 1200 10000 Lower 1000 Lower better App 2: Fraud Detection with 8000 better Louvain Modularity for 50M nodes 800 6000
RESIDES 600 Payment ES Customer Location 2 AK M 4000 Payment 400
D O A E T S S C P A I C H E H S P TE C
D R
U P 2000 200 Order PURCHASED
Order Product
SH IP S F RO 40 85 S M E FI TI O 0 0 N
Supplier Location 1 2x Intel Xeon E5-2640 Alveo U50 2x Intel Xeon E5-2640 Alveo U50
>> 18 © Copyright 2020 Xilinx Big Data Queries
Apache Spark – TPC-DS Benchmarks
Using Spark on AWS R5.2xl and F1.2xl cluster with S3 storage
250SF JSON Gzip compressed data (~130GB)
TPC-DS Benchmarks 2500000 Lower better 2000000
1500000
1000000 TIME (MS) TIME
500000
0 3 52 43 55 2 34 96 98 73 8 36 89 7 48 68 70 76 31 QUERY #
CPU R5.2xlarge F1.2xlarge >> 19 © Copyright 2020 Xilinx Transaction Processing
Transaction Processing 200 Non-Deterministic finite automaton 180
160 NDFA Decision NDFA interpretor on 140 Rules FPGA 120
100
80
60
40
20
0 AWS CPU C5 2x8 cores AWS GPU P3 2xlarge AWS F1 2x
transactions/sec transactions/$
Picture credit: https://www.ttgasia.com/2020/06/04/amadeus-troovo-unite-to-automate- b2b-travel-payments/ >> 20 © Copyright 2020 Xilinx GDPR Compliance Check – Regular Expression
Regular Expression Processing 10
• Regular expression with Capture Group 9 (identify kind: SSN, DOB, CC and Email, 8 and offsets of match) 7
6 Operator Description Operator Description 5 . Any character x|y x or y, prefer x 4 * 0 or more, greedy […] “one of” character class
+ 1 or more, greedy 3 [^…] “none of” character class ? 0 or 1, greedy 2 *? 0 or more, lazy (…) capture group 1 +? 1 or more, lazy (?:…) non capture group ?? 0 or 1, lazy \ escape special char 0 File size processed (GB/s) Perf/100W Perf/$10000
2x Intel Xeon 4116 2x Intel eon 8268 Alveo U200
>> 21 © Copyright 2020 Xilinx HPC Seismic Imaging (RTM) Genomic Sequencing
>> 22 © Copyright 2020 Xilinx Seismic Imaging Reverse Time Migration (RTM) 1800
1600 1613.99 Reverse Time Migration (RTM) 1400 Benchmarked with Pluto velocity model 1200
1000
Time (sec) 800
600 Lower better
400
200
53.5 40 0 CPU FP32 - 70 threads nVidia V100 FP32 Alveo U280 FP32 Original input FPGA 1-shot >> 23 © Copyright 2020 Xilinx Genomic Sequencing
A whole genome analysis (minuts) Reduced a whole genome sequencing time 2000
from 30 hours to 25 min 1800 1800 Framework: BWA-GATK 1600
1400
1200
1000
800 Lower better 600
400
200
25 0 CPU AWS F1
>> 24 © Copyright 2020 Xilinx Conclusion
Workload-optimized Domain Specific Architecture (DSA) is critical to achieve high performance in today’s demanding compute needs
Innovation is outpacing silicon pace in many domains such as AI, big data and 5G
Xilinx platform allows DSA to be developed in months on an existing silicon, not years to develop a new chip
Wide range of benchmarks prove major advantages over CPU and GPUs
>> 25 © Copyright 2020 Xilinx Thank You
© Copyright 2020 Xilinx