Real-Life Benchmarks Vs Cpus and Gpus

Real-life Benchmarks vs CPUs and GPUs

Nick Ni, Director of Product Marketing, AI – Software - Ecosystem

Cache

CPU Fixed HW Accelerators DSA CISC → RISC → Multi-Core GPU, ASSP, ASIC TPU, DLA FPGA, Zynq, ACAP 4 © Copyright 2020 Xilinx SpeedSpeed ofof InnovationInnovation OutpacedOutpaced SiliconSilicon DesignDesign CyclesCycles

Silicon Design Cycle 80

Innovation Cycle

1 Accuracy 1 (1%) Accuracy

- Top 60

50 AlexNet BN-NIN GoogLeNet VGG-16 ResNet-34 ResNet-101 Inception v3 DenseNet-264 ShuffleNet 2x MobileNet v2 BN-AlexNet ENet ResNet-18 VGG-19 ResNet-50 ResNet-153 Inception-v4 ResNeXt-101 SENet-154 2012 2018 Pace of AI/ML Innovation

55 Input 11 27 dense dense 13 13 13 dense Output 5 3 3 3 11 13 224 27 13 3 13 3 Input 5 3 “Dog” Image 384 384 256 (RGB) 1000 55 256 Max Max Max Pooling Pooling 4096 4096 224 Stride 96 Pooling of 4

Off-Chip DDR

3) Custom Memory Hierarchy

On-Chip Memory

FPS/TOPS (ResNet50v1.5)

nVidia Jetson

nVidia T4

Intel NNP-I

nVidia A100

nVidia V100

Xilinx U250

0 20 40 60 80 100 120 140

AI

Computer Vision

Database

HPC

Super Resolution (unit = FPS) Model: EDSR, Pytorch paper 7

6 UHD 4K

FHD 2K 5

4 2x

3x performance 1 4x lower COGS 0 nVidia RTX2080Ti nVidia V100 Alveo U250 Alveo U250 with 8x lower TCO pruning

IMDB Sentiment Analysis - link 12000

Open Information Exchange - link 10000

8000

6000

4000

2000

0 Intel 48-core CPU nVidia T4 Xilinx Versal VC1902

IMDB OIE

CPU: 48 cores, Intel® Xeon(R) Gold 6136 CPU 3.00GHz | float32 model GPU-T4: | float32 model Xilinx Versal VC1902: 320Cores Design on Versal | INT8/INT16 model >> 10 © Copyright 2020 Xilinx Visual Search

1st stage: CNN model for feature extraction Visual Search 6000 2nd stage: Highly sparse Fully Connected models for large database look up 5000

4000

3000

2000

[ MOFFETT 3D Tensor SPARSE VECTOR ] 1000 Image Data XILINX (TB) FPGA 64GB DDR NVIDIA AWS F1 TESLA T4 0 16GB DDR Throughput (Query/sec or QPS) QPS/$ AWS G4 nVidia T4 (AWS G4) Xilinx AWS F1

PointPillars 14.0 Model: PointPillars with Pytorch (paper)

12.0

10.0

8.0

6.0

4.0

2.0

0.0 nVidia TX2 nVidia Xavier NX Xilinx ZU5

FPS/w FPS/$100

Orthorectification + Yolov3 of 500 Mpix image 1400 1st stage: Orthorectification

2nd stage: Yolov3 object detection 1200

1000

800 TIME (S)TIME 600 Lower better

400

200

0 CPU + nVidia T4 CPU + Alveo U250

Depth Estimation (AI-based) 100 st 1 stage: Encoder-Decoder depth network 90

2nd stage: Fast bilinear filter 80

50 FPS

10 No data 0 Intel i7 @ 2.6GHz Edge TPU Xilinx ZU9

320*240 640*480

Product Recommendation Fraud Detection (time in ms) (time in ms) 18000 2000 Graph DB: faster deeper and wider 16900 insights on connected data 16000 1800

1600 App 1: Recommendation with 14000 1400 Cosine Similarity for 1.5 million 12000 patients 1200 10000 Lower 1000 Lower better App 2: Fraud Detection with 8000 better Louvain Modularity for 50M nodes 800 6000

RESIDES 600 Payment ES Customer Location 2 AK M 4000 Payment 400

D O A E T S S C P A I C H E H S P TE C

D R

U P 2000 200 Order PURCHASED

Order Product

SH IP S F RO 40 85 S M E FI TI O 0 0 N

Supplier Location 1 2x Intel Xeon E5-2640 Alveo U50 2x Intel Xeon E5-2640 Alveo U50

Apache Spark – TPC-DS Benchmarks

Using Spark on AWS R5.2xl and F1.2xl cluster with S3 storage

250SF JSON Gzip compressed data (~130GB)

TPC-DS Benchmarks 2500000 Lower better 2000000

1500000

1000000 TIME (MS) TIME

500000

0 3 52 43 55 2 34 96 98 73 8 36 89 7 48 68 70 76 31 QUERY #

Transaction Processing 200 Non-Deterministic finite automaton 180

160 NDFA Decision NDFA interpretor on 140 Rules FPGA 120

100

0 AWS CPU C5 2x8 cores AWS GPU P3 2xlarge AWS F1 2x

transactions/sec transactions/$

Regular Expression Processing 10

• Regular expression with Capture Group 9 (identify kind: SSN, DOB, CC and Email, 8 and offsets of match) 7

6 Operator Description Operator Description 5 . Any character x|y x or y, prefer x 4 * 0 or more, greedy […] “one of” character class

+ 1 or more, greedy 3 [^…] “none of” character class ? 0 or 1, greedy 2 *? 0 or more, lazy (…) capture group 1 +? 1 or more, lazy (?:…) non capture group ?? 0 or 1, lazy \ escape special char 0 File size processed (GB/s) Perf/100W Perf/$10000

2x Intel Xeon 4116 2x Intel eon 8268 Alveo U200

1600 1613.99 Reverse Time Migration (RTM) 1400 Benchmarked with Pluto velocity model 1200

1000

Time (sec) 800

600 Lower better

400

200

A whole genome analysis (minuts) Reduced a whole genome sequencing time 2000

from 30 hours to 25 min 1800 1800 Framework: BWA-GATK 1600

1400

1200

1000

800 Lower better 600

400

200

25 0 CPU AWS F1

Workload-optimized Domain Specific Architecture (DSA) is critical to achieve high performance in today’s demanding compute needs

Innovation is outpacing silicon pace in many domains such as AI, big data and 5G

Xilinx platform allows DSA to be developed in months on an existing silicon, not years to develop a new chip

Wide range of benchmarks prove major advantages over CPU and GPUs