INFINIBAND IN-NETWORK COMPUTING MUG, August 2020 THE NEW SCIENTIFIC COMPUTING WORLD
SUPERCOMPUTER
EDGE INFINIBAND APPLIANCE NETWORK
STORAGE
2 IN-NETWORK COMPUTING ACCELERATED SUPERCOMPUTING
INFINIBAND Data Processing Unit DPU GPU
Software Defined Hardware-Accelerated In-network computing
Pre-configured DPUs Programmable DPUs
SUPERCOMPUTER
CPU
3 IN-NETWORK COMPUTING ACCELERATED SUPERCOMPUTING
INFINIBAND DPU cores, MMU Data Processing Unit DPU Programmable Data pre-processing User-defined algorithms Software Defined Hardware-Accelerated In-network computing
Pre-configured DPUs SHARP (data reductions) Programmable DPUs MPI Tag-Matching Pre-Configured SHIELD (self healing resiliency) NVMe over fabric Data security and tenant isolations
200G end-to-end, extremely low latency RDMA, GPU Direct RDMA, GPU Direct storage Speed of Light Enhanced Adaptive Routing and Congestion Control SUPERCOMPUTER Smart topologies
4 IN-NETWORK COMPUTING ACCELERATED SUPERCOMPUTING
INFINIBAND Data Processing Unit DPU
Software Defined Hardware-Accelerated In-network computing OpenSNAPI Pre-configured DPUs Programmable DPUs Applications and Services Applications / Services Switching Analytics, Elastic QoS, Isolation, Acceleration Routing Telemetry Storage Protection
Hardware Acceleration Communication Computation Storage Security SDK Libraries
SUPERCOMPUTER
5 INFINIBAND TECHNOLOGY FUNDAMENTALS
GPU
DPU
CPU
Smart End-Point Architected to Scale Centralized Management Standard
6 HDR 200G INFINIBAND ACCELERATES NEXT GENERATION HPC AND AI SUPERCOMPUTERS (EXAMPLES)
9 PetaFLOPS 7.5 PetaFLOPS 3K HDR Nodes 8K HDR Nodes 35.5 PetaFLOPS 3K HDR Nodes 2K HDR Nodes 16 PetaFLOPS Dragonfly+ Topology 2K HDR Nodes Dragonfly+ Topology Dragonfly+ Topology Dragonfly+ Topology Fat-Tree Topology
23 PetaFLOPS 27.6 PetaFLOPS HPC/AI Cloud HDR 23.5 PetaFLOPS 5.6K HDR Nodes 3K HDR Nodes HDR InfiniBand Supercomputers 8K HDR Nodes Dragonfly+ Topology Fat-Tree Topology Fat-Tree Topology
7 SCALABLE HIERARCHICAL AGGREGATION AND REDUCTION PROTOCOL (SHARP)
8 SCALABLE HIERARCHICAL AGGREGATION AND REDUCTION PROTOCOL (SHARP)
In-network Tree based aggregation mechanism
Multiple simultaneous outstanding operations Switch For HPC (MPI / SHMEM) and Distributed Machine Learning applications
Scalable High Performance Collective Offload Aggregated Aggregated Data Result Barrier, Reduce, All-Reduce, Broadcast and more Switch Switch Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND
Data Aggregated Integer and Floating-Point, 16/32/64 bits Result
Host Host Host Host Host
9 SHARP ALLREDUCE PERFORMANCE ADVANTAGES Providing Flat Latency, 7X Higher Performance
10 INFINIBAND SHARP AI PERFORMANCE ADVANTAGE 2.5X Higher Performance
11 MPI TAG MATCHING HARDWARE ENGINE
12 INFINIBAND MPI TAG MATCHING HARDWARE ENGINE
Wait for Posting buffers software with Tags Tag Matching (expected messages)
Scatter to local buffer
Software
No Hardware
Unexpected Message Matching List Expected Message Rendezvous ?
Yes
Arriving Gather New Messages remote data
13 HARDWARE TAG MATCHING PERFORMANCE ADVANTAGES 1.8X Higher MPI_Iscatterv Performance on TACC Frontera
14 HARDWARE TAG MATCHING PERFORMANCE ADVANTAGES Nearly 100% Compute – Communication Overlap
15 GPUDIRECT
16 EFFICIENT COMMUNICATION FOR ACCELERATED TRAINING 10X Better Latency & Bandwidth, 3X Faster Deep Learning
GPUDirect™ RDMA
17 QUALITY OF SERVICE
18 NETWORK CONGESTION TYPES
In-network Congestion In-cast Congestion
Network Network
Solution: Adaptive Routing Solutions: Congestion Control
19 IN-NETWORK CONGESTION: ADAPTIVE ROUTING
mpiGraph: Static vs. Adaptive Routing
Static Routing Adaptive Routing
20 INFINIBAND CONGESTION CONTROL
Congestion – Throughput loss No congestion – highest throughput! Congestion Congestion Factor
With Congestion Control Without Congestion Control
21 INFINIBAND ACCELERATED SUPERCOMPUTING
Speed of Light SHARP AI Technology SHIELD AI Technology UFM Cyber AI 200Gb/s Data Throughput AI Acceleration Engines Self Healing Network Data Center Cyber Intelligence RDMA and GPUDirect RDMA 2.5X Higher AI Performance 1000X Faster Recovery Time and Analytics 3X Better (Lower) Latency
22