INFINIBAND IN-NETWORK COMPUTING MUG, August 2020 the NEW SCIENTIFIC COMPUTING WORLD

INFINIBAND IN-NETWORK COMPUTING MUG, August 2020 THE NEW SCIENTIFIC COMPUTING WORLD SUPERCOMPUTER EDGE INFINIBAND APPLIANCE NETWORK STORAGE 2 IN-NETWORK COMPUTING ACCELERATED SUPERCOMPUTING INFINIBAND Data Processing Unit DPU GPU Software Defined Hardware-Accelerated In-network computing Pre-configured DPUs Programmable DPUs SUPERCOMPUTER CPU 3 IN-NETWORK COMPUTING ACCELERATED SUPERCOMPUTING INFINIBAND DPU cores, MMU Data Processing Unit DPU Programmable Data pre-processing User-defined algorithms Software Defined Hardware-Accelerated In-network computing Pre-configured DPUs SHARP (data reductions) Programmable DPUs MPI Tag-Matching Pre-Configured SHIELD (self healing resiliency) NVMe over fabric Data security and tenant isolations 200G end-to-end, extremely low latency RDMA, GPU Direct RDMA, GPU Direct storage Speed of Light Enhanced Adaptive Routing and Congestion Control SUPERCOMPUTER Smart topologies 4 IN-NETWORK COMPUTING ACCELERATED SUPERCOMPUTING INFINIBAND Data Processing Unit DPU Software Defined Hardware-Accelerated In-network computing OpenSNAPI Pre-configured DPUs Programmable DPUs Applications and Services Applications / Services Switching Analytics, Elastic QoS, Isolation, Acceleration Routing Telemetry Storage Protection Hardware Acceleration Communication Computation Storage Security SDK Libraries SUPERCOMPUTER 5 INFINIBAND TECHNOLOGY FUNDAMENTALS GPU DPU CPU Smart End-Point Architected to Scale Centralized Management Standard 6 HDR 200G INFINIBAND ACCELERATES NEXT GENERATION HPC AND AI SUPERCOMPUTERS (EXAMPLES) 9 PetaFLOPS 7.5 PetaFLOPS 3K HDR Nodes 8K HDR Nodes 35.5 PetaFLOPS 3K HDR Nodes 2K HDR Nodes 16 PetaFLOPS Dragonfly+ Topology 2K HDR Nodes Dragonfly+ Topology Dragonfly+ Topology Dragonfly+ Topology Fat-Tree Topology 23 PetaFLOPS 27.6 PetaFLOPS HPC/AI Cloud HDR 23.5 PetaFLOPS 5.6K HDR Nodes 3K HDR Nodes HDR InfiniBand Supercomputers 8K HDR Nodes Dragonfly+ Topology Fat-Tree Topology Fat-Tree Topology 7 SCALABLE HIERARCHICAL AGGREGATION AND REDUCTION PROTOCOL (SHARP) 8 SCALABLE HIERARCHICAL AGGREGATION AND REDUCTION PROTOCOL (SHARP) In-network Tree based aggregation mechanism Multiple simultaneous outstanding operations Switch For HPC (MPI / SHMEM) and Distributed Machine Learning applications Scalable High Performance Collective Offload Aggregated Aggregated Data Result Barrier, Reduce, All-Reduce, Broadcast and more Switch Switch Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Data Aggregated Integer and Floating-Point, 16/32/64 bits Result Host Host Host Host Host 9 SHARP ALLREDUCE PERFORMANCE ADVANTAGES Providing Flat Latency, 7X Higher Performance 10 INFINIBAND SHARP AI PERFORMANCE ADVANTAGE 2.5X Higher Performance 11 MPI TAG MATCHING HARDWARE ENGINE 12 INFINIBAND MPI TAG MATCHING HARDWARE ENGINE Wait for Posting buffers software with Tags Tag Matching (expected messages) Scatter to local buffer Software No Hardware Unexpected Message Matching List Expected Message Rendezvous ? Yes Arriving Gather New Messages remote data 13 HARDWARE TAG MATCHING PERFORMANCE ADVANTAGES 1.8X Higher MPI_Iscatterv Performance on TACC Frontera 14 HARDWARE TAG MATCHING PERFORMANCE ADVANTAGES Nearly 100% Compute – Communication Overlap 15 GPUDIRECT 16 EFFICIENT COMMUNICATION FOR ACCELERATED TRAINING 10X Better Latency & Bandwidth, 3X Faster Deep Learning GPUDirect™ RDMA 17 QUALITY OF SERVICE 18 NETWORK CONGESTION TYPES In-network Congestion In-cast Congestion Network Network Solution: Adaptive Routing Solutions: Congestion Control 19 IN-NETWORK CONGESTION: ADAPTIVE ROUTING mpiGraph: Static vs. Adaptive Routing Static Routing Adaptive Routing 20 INFINIBAND CONGESTION CONTROL Congestion – Throughput loss No congestion – highest throughput! Congestion Congestion Factor With Congestion Control Without Congestion Control 21 INFINIBAND ACCELERATED SUPERCOMPUTING Speed of Light SHARP AI Technology SHIELD AI Technology UFM Cyber AI 200Gb/s Data Throughput AI Acceleration Engines Self Healing Network Data Center Cyber Intelligence RDMA and GPUDirect RDMA 2.5X Higher AI Performance 1000X Faster Recovery Time and Analytics 3X Better (Lower) Latency 22 .

INFINIBAND IN-NETWORK COMPUTING MUG, August 2020 the NEW SCIENTIFIC COMPUTING WORLD

And Complex-Valued Multiply-Accumulate SIMD Unit for Digital Signal Processors

A Many-Core Architecture for In-Memory Data Processing

NVIDIA Bluefield-2 Datasheet

Bluefield As Platform

Smartnics: Current Trends in Research and Industry

Opportunities for Near Data Computing in Mapreduce Workloads

Dpus: Acceleration Through Disaggregation

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Hardware Acceleration of Biophotonic Simulations by Tanner Young

A Many-Core Architecture for In-Memory Data Processing

Unit – V– Sbs1203 – Computer Architecture

In Storage Process, the Next Generation of Storage System