INFINIBAND IN-NETWORK COMPUTING MUG, August 2020 THE NEW SCIENTIFIC COMPUTING WORLD

SUPERCOMPUTER

EDGE INFINIBAND APPLIANCE NETWORK

STORAGE

2 IN-NETWORK COMPUTING ACCELERATED SUPERCOMPUTING

INFINIBAND Data Processing Unit DPU GPU

Software Defined Hardware-Accelerated In-network computing

Pre-configured DPUs Programmable DPUs

SUPERCOMPUTER

CPU

3 IN-NETWORK COMPUTING ACCELERATED SUPERCOMPUTING

INFINIBAND DPU cores, MMU Data Processing Unit DPU Programmable Data pre-processing User-defined algorithms Software Defined Hardware-Accelerated In-network computing

Pre-configured DPUs SHARP (data reductions) Programmable DPUs MPI Tag-Matching Pre-Configured SHIELD (self healing resiliency) NVMe over fabric Data security and tenant isolations

200G end-to-end, extremely low latency RDMA, GPU Direct RDMA, GPU Direct storage Speed of Light Enhanced Adaptive Routing and Congestion Control SUPERCOMPUTER Smart topologies

4 IN-NETWORK COMPUTING ACCELERATED SUPERCOMPUTING

INFINIBAND Data Processing Unit DPU

Software Defined Hardware-Accelerated In-network computing OpenSNAPI Pre-configured DPUs Programmable DPUs Applications and Services Applications / Services Switching Analytics, Elastic QoS, Isolation, Acceleration Routing Telemetry Storage Protection

Hardware Acceleration Communication Computation Storage Security SDK Libraries

SUPERCOMPUTER

5 INFINIBAND TECHNOLOGY FUNDAMENTALS

GPU

DPU

CPU

Smart End-Point Architected to Scale Centralized Management Standard

6 HDR 200G INFINIBAND ACCELERATES NEXT GENERATION HPC AND AI (EXAMPLES)

9 PetaFLOPS 7.5 PetaFLOPS 3K HDR Nodes 8K HDR Nodes 35.5 PetaFLOPS 3K HDR Nodes 2K HDR Nodes 16 PetaFLOPS Dragonfly+ Topology 2K HDR Nodes Dragonfly+ Topology Dragonfly+ Topology Dragonfly+ Topology Fat-Tree Topology

23 PetaFLOPS 27.6 PetaFLOPS HPC/AI Cloud HDR 23.5 PetaFLOPS 5.6K HDR Nodes 3K HDR Nodes HDR InfiniBand Supercomputers 8K HDR Nodes Dragonfly+ Topology Fat-Tree Topology Fat-Tree Topology

7 SCALABLE HIERARCHICAL AGGREGATION AND REDUCTION PROTOCOL (SHARP)

8 SCALABLE HIERARCHICAL AGGREGATION AND REDUCTION PROTOCOL (SHARP)

In-network Tree based aggregation mechanism

Multiple simultaneous outstanding operations Switch For HPC (MPI / SHMEM) and Distributed applications

Scalable High Performance Collective Offload Aggregated Aggregated Data Result Barrier, Reduce, All-Reduce, Broadcast and more Switch Switch Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND

Data Aggregated Integer and Floating-Point, 16/32/64 bits Result

Host Host Host Host Host

9 SHARP ALLREDUCE PERFORMANCE ADVANTAGES Providing Flat Latency, 7X Higher Performance

10 INFINIBAND SHARP AI PERFORMANCE ADVANTAGE 2.5X Higher Performance

11 MPI TAG MATCHING HARDWARE ENGINE

12 INFINIBAND MPI TAG MATCHING HARDWARE ENGINE

Wait for Posting buffers software with Tags Tag Matching (expected messages)

Scatter to local buffer

Software

No Hardware

Unexpected Message Matching List Expected Message Rendezvous ?

Yes

Arriving Gather New Messages remote data

13 HARDWARE TAG MATCHING PERFORMANCE ADVANTAGES 1.8X Higher MPI_Iscatterv Performance on TACC Frontera

14 HARDWARE TAG MATCHING PERFORMANCE ADVANTAGES Nearly 100% Compute – Communication Overlap

15 GPUDIRECT

16 EFFICIENT COMMUNICATION FOR ACCELERATED TRAINING 10X Better Latency & Bandwidth, 3X Faster

GPUDirect™ RDMA

17 QUALITY OF SERVICE

18 NETWORK CONGESTION TYPES

In-network Congestion In-cast Congestion

Network Network

Solution: Adaptive Routing Solutions: Congestion Control

19 IN-NETWORK CONGESTION: ADAPTIVE ROUTING

mpiGraph: Static vs. Adaptive Routing

Static Routing Adaptive Routing

20 INFINIBAND CONGESTION CONTROL

Congestion – Throughput loss No congestion – highest throughput! Congestion Congestion Factor

With Congestion Control Without Congestion Control

21 INFINIBAND ACCELERATED SUPERCOMPUTING

Speed of Light SHARP AI Technology SHIELD AI Technology UFM Cyber AI 200Gb/s Data Throughput AI Acceleration Engines Self Healing Network Data Center Cyber Intelligence RDMA and GPUDirect RDMA 2.5X Higher AI Performance 1000X Faster Recovery Time and Analytics 3X Better (Lower) Latency

22