<<

Real-life Benchmarks vs CPUs and GPUs

Nick Ni, Director of Product Marketing, AI – - Ecosystem

© Copyright 2020 Xilinx Era of Domain-Specific Architectures (DSAs)

Cache

CPU Fixed HW Accelerators DSA CISC → RISC → Multi-Core GPU, ASSP, ASIC TPU, DLA FPGA, Zynq, ACAP 4 © Copyright 2020 Xilinx SpeedSpeed ofof InnovationInnovation OutpacedOutpaced SiliconSilicon DesignDesign CyclesCycles

Silicon Design Cycle 80

Innovation Cycle

70

1 Accuracy 1 (1%) Accuracy

- Top 60

50 AlexNet BN-NIN GoogLeNet VGG-16 ResNet-34 ResNet-101 Inception v3 DenseNet-264 ShuffleNet 2x MobileNet v2 BN-AlexNet ENet ResNet-18 VGG-19 ResNet-50 ResNet-153 Inception-v4 ResNeXt-101 SENet-154 2012 2018 Pace of AI/ML Innovation

© Copyright 2020 Xilinx What is a Domain-Specific Architecture?

55 Input 11 27 dense dense 13 13 13 dense Output 5 3 3 3 11 13 224 27 13 3 13 3 Input 5 3 “Dog” Image 384 384 256 (RGB) 1000 55 256 Max Max Max Pooling Pooling 4096 4096 224 Stride 96 Pooling of 4

2) Custom Precision 1) Custom Data Path 5 © Copyright 2019 Xilinx What is a Domain-Specific Architecture?

Off-Chip DDR

3) Custom Memory Hierarchy

On-Chip Memory

2) Custom Precision 1) Custom Data Path 6 © Copyright 2019 Xilinx Xilinx Achieves the Highest Efficiency in Computation On a standard benchmark in MLPerf v0.7

FPS/TOPS (ResNet50v1.5)

Jetson

nVidia T4

Intel NNP-I

nVidia A100

nVidia V100

Xilinx U250

0 20 40 60 80 100 120 140

>> 6 © Copyright 2020 Xilinx Agenda

AI

Database

HPC

>> 7 © Copyright 2020 Xilinx AI Super Resolution Sentiment Analysis Visual Search Point Cloud LIDAR

>> 8 © Copyright 2020 Xilinx Super Resolution (Image upscaling with EDSR)

Super Resolution (unit = FPS) Model: EDSR, Pytorch paper 7

6 UHD 4K

FHD 2K 5

4 2x

3

2

3x performance 1 4x lower COGS 0 nVidia RTX2080Ti nVidia V100 Alveo U250 Alveo U250 with 8x lower TCO pruning

© Copyright 2020 Xilinx LSTM - Sentiment Analysis LSTM (Sentences/s) 14000

IMDB Sentiment Analysis - link 12000

Open Information Exchange - link 10000

8000

6000

4000

2000

0 48-core CPU nVidia T4 Xilinx Versal VC1902

IMDB OIE

CPU: 48 cores, Intel® Xeon(R) Gold 6136 CPU 3.00GHz | float32 model GPU-T4: | float32 model Xilinx Versal VC1902: 320Cores Design on Versal | INT8/INT16 model >> 10 © Copyright 2020 Xilinx Visual Search

1st stage: CNN model for feature extraction Visual Search 6000 2nd stage: Highly sparse Fully Connected models for large database look up 5000

4000

3000

2000

[ MOFFETT 3D Tensor SPARSE VECTOR ] 1000 Image Data XILINX (TB) FPGA 64GB DDR NVIDIA AWS F1 TESLA T4 0 16GB DDR Throughput (Query/sec or QPS) QPS/$ AWS G4 nVidia T4 (AWS G4) Xilinx AWS F1

© Copyright 2020 Xilinx LIDAR Perception based on Point Cloud input

PointPillars 14.0 Model: PointPillars with Pytorch (paper)

12.0

10.0

8.0

6.0

4.0

2.0

0.0 nVidia TX2 nVidia Xavier NX Xilinx ZU5

FPS/w FPS/$100

>> 12 © Copyright 2020 Xilinx Computer Vision Satellite imaging Depth estimation

>> 13 © Copyright 2020 Xilinx Satellite Imaging

Orthorectification + Yolov3 of 500 Mpix image 1400 1st stage: Orthorectification

2nd stage: Yolov3 object detection 1200

1000

800 TIME (S)TIME 600 Lower better

400

200

0 CPU + nVidia T4 CPU + Alveo U250

Credit: https://www.mdpi.com/2072-4292/12/1/44/htm © Copyright 2020 Xilinx Depth Estimation with CV and Deep Learning

Depth Estimation (AI-based) 100 st 1 stage: Encoder-Decoder depth network 90

2nd stage: Fast bilinear filter 80

70

60

50 FPS

40

30

20

10 No data 0 Intel i7 @ 2.6GHz Edge TPU Xilinx ZU9

320*240 640*480

>> 15 © Copyright 2020 Xilinx Database Graph Database Queries Transaction Processing (Decision Tree) Regular Expression

>> 17 © Copyright 2020 Xilinx Graph Database: Recommendation, Fraud Detection

Product Recommendation Fraud Detection (time in ms) (time in ms) 18000 2000 Graph DB: faster deeper and wider 16900 insights on connected data 16000 1800

1600 App 1: Recommendation with 14000 1400 Cosine Similarity for 1.5 million 12000 patients 1200 10000 Lower 1000 Lower better App 2: Fraud Detection with 8000 better Louvain Modularity for 50M nodes 800 6000

RESIDES 600 Payment ES Customer Location 2 AK M 4000 Payment 400

D O A E T S S C P A I C H E H S P TE C

D R

U P 2000 200 Order PURCHASED

Order Product

SH IP S F RO 40 85 S M E FI TI O 0 0 N

Supplier Location 1 2x Intel Xeon E5-2640 Alveo U50 2x Intel Xeon E5-2640 Alveo U50

>> 18 © Copyright 2020 Xilinx Big Data Queries

Apache Spark – TPC-DS Benchmarks

Using Spark on AWS R5.2xl and F1.2xl cluster with S3 storage

250SF JSON Gzip compressed data (~130GB)

TPC-DS Benchmarks 2500000 Lower better 2000000

1500000

1000000 TIME (MS) TIME

500000

0 3 52 43 55 2 34 96 98 73 8 36 89 7 48 68 70 76 31 QUERY #

CPU R5.2xlarge F1.2xlarge >> 19 © Copyright 2020 Xilinx Transaction Processing

Transaction Processing 200 Non-Deterministic finite automaton 180

160 NDFA Decision NDFA interpretor on 140 Rules FPGA 120

100

80

60

40

20

0 AWS CPU C5 2x8 cores AWS GPU P3 2xlarge AWS F1 2x

transactions/sec transactions/$

Picture credit: https://www.ttgasia.com/2020/06/04/amadeus-troovo-unite-to-automate- b2b-travel-payments/ >> 20 © Copyright 2020 Xilinx GDPR Compliance Check – Regular Expression

Regular Expression Processing 10

• Regular expression with Capture Group 9 (identify kind: SSN, DOB, CC and Email, 8 and offsets of match) 7

6 Operator Description Operator Description 5 . Any character x|y x or y, prefer x 4 * 0 or more, greedy […] “one of” character class

+ 1 or more, greedy 3 [^…] “none of” character class ? 0 or 1, greedy 2 *? 0 or more, lazy (…) capture group 1 +? 1 or more, lazy (?:…) non capture group ?? 0 or 1, lazy \ escape special char 0 File size processed (GB/s) Perf/100W Perf/$10000

2x Intel Xeon 4116 2x Intel eon 8268 Alveo U200

>> 21 © Copyright 2020 Xilinx HPC Seismic Imaging (RTM) Genomic Sequencing

>> 22 © Copyright 2020 Xilinx Seismic Imaging Reverse Time Migration (RTM) 1800

1600 1613.99 Reverse Time Migration (RTM) 1400 Benchmarked with Pluto velocity model 1200

1000

Time (sec) 800

600 Lower better

400

200

53.5 40 0 CPU FP32 - 70 threads nVidia V100 FP32 Alveo U280 FP32 Original input FPGA 1-shot >> 23 © Copyright 2020 Xilinx Genomic Sequencing

A whole genome analysis (minuts) Reduced a whole genome sequencing time 2000

from 30 hours to 25 min 1800 1800 Framework: BWA-GATK 1600

1400

1200

1000

800 Lower better 600

400

200

25 0 CPU AWS F1

>> 24 © Copyright 2020 Xilinx Conclusion

Workload-optimized Domain Specific Architecture (DSA) is critical to achieve high performance in today’s demanding compute needs

Innovation is outpacing pace in many domains such as AI, big data and

Xilinx platform allows DSA to be developed in months on an existing silicon, not years to develop a new chip

Wide range of benchmarks prove major advantages over CPU and GPUs

>> 25 © Copyright 2020 Xilinx Thank You

© Copyright 2020 Xilinx