RDMA-based Networking Technologies and Middleware for Next-Generation Clusters and Data Centers

Keynote Talk at KBNet ‘18 by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda High-End Computing (HEC): Towards Exascale

122 PFlops 100 PFlops in in 2018 2016

1 EFlops in 2020- 2021?

Expected to have an ExaFlop system in 2020-2021!

Network Based Computing Laboratory KBNet ‘18 2 Big Data – How Much Data Is Generated Every Minute on the Internet?

The global Internet population grew 7.5% from 2016 and now represents 3.7 Billion People. Courtesy: https://www.domo.com/blog/data-never-sleeps-5/ Network Based Computing Laboratory KBNet ‘18 3 Resurgence of AI/Machine Learning/Deep Learning

Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-flexibility-scalability-and-advocacy/ Network Based Computing Laboratory KBNet ‘18 4 Data Management and Processing on Modern Datacenters

• Substantial impact on designing and utilizing data management and processing systems in multiple tiers – Front-end data accessing and serving (Online) • Memcached + DB (e.g. MySQL), HBase – Back-end data analytics (Offline) • HDFS, MapReduce, Spark

Network Based Computing Laboratory KBNet ‘18 5 Communication and Computation Requirements

Web- Proxy (Apache) Server Clients Storage WAN More Computation and Communication Requirements

Database Application Server Server (PHP) (MySQL)

• Requests are received from clients over the WAN • Proxy nodes perform caching, load balancing, resource monitoring, etc. • If not cached, the request is forwarded to the next tiers  Application Server • Application server performs the business logic (CGI, Java servlets, etc.) – Retrieves appropriate data from the database to process the requests

Network Based Computing Laboratory KBNet ‘18 6 Increasing Usage of HPC, Big Data and Deep Learning on Modern Datacenters

Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, , etc.) Memcached, etc.)

Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!!

Network Based Computing Laboratory KBNet ‘18 7 Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Physical Compute

Network Based Computing Laboratory KBNet ‘18 8 Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Network Based Computing Laboratory KBNet ‘18 9 Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Network Based Computing Laboratory KBNet ‘18 10 Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Hadoop Job Deep Learning Job

Spark Job

Network Based Computing Laboratory KBNet ‘18 11 Trends in Network Speed Acceleration (1979 - ) 10 Mbit/sec Fast Ethernet (1993 -) 100 Mbit/sec (1995 -) 1000 Mbit /sec ATM (1995 -) 155/622/1024 Mbit/sec Myrinet (1993 -) 1 Gbit/sec Fibre Channel (1994 -) 1 Gbit/sec InfiniBand (2001 -) 2 Gbit/sec (1X SDR) 10-Gigabit Ethernet (2001 -) 10 Gbit/sec InfiniBand (2003 -) 8 Gbit/sec (4X SDR) InfiniBand (2005 -) 16 Gbit/sec (4X DDR) 24 Gbit/sec (12X SDR) InfiniBand (2007 -) 32 Gbit/sec (4X QDR) 40-Gigabit Ethernet (2010 -) 40 Gbit/sec InfiniBand (2011 -) 54.6 Gbit/sec (4X FDR) InfiniBand (2012 -) 2 x 54.6 Gbit/sec (4X Dual-FDR) 25-/50-Gigabit Ethernet (2014 -) 25/50 Gbit/sec 100-Gigabit Ethernet (2015 -) 100 Gbit/sec Omni-Path (2015 - ) 100 Gbit/sec InfiniBand (2015 - ) 100 Gbit/sec (4X EDR) InfiniBand (2016 - ) 200 Gbit/sec (4X HDR) 100 times in the last 17 years Network Based Computing Laboratory KBNet ‘18 12 Available Interconnects and Protocols for Data Centers Application / Middleware Application / Middleware Interface

Sockets Verbs OFI Protocol Kernel Space TCP/IP TCP/IP RSockets SDP TCP/IP RDMA RDMA RDMA

Ethernet Hardware IPoIB User Space RDMA User Space User Space User Space User Space Driver Offload

Adapter Ethernet InfiniBand Ethernet InfiniBand InfiniBand iWARP RoCE InfiniBand Omni-Path Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter Switch

Ethernet InfiniBand Ethernet InfiniBand InfiniBand Ethernet Ethernet InfiniBand Omni-Path Switch Switch Switch Switch Switch Switch Switch Switch Switch 1/10/25/40/ 1/10/25/40/ IPoIB RSockets RoCE IB Native 100 Gb/s 50/100 GigE- SDP iWARP 50/100 GigE TOE Network Based Computing Laboratory KBNet ‘18 13 Open Standard InfiniBand Networking Technology • Introduced in Oct 2000 • High Performance Data Transfer – Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and low CPU utilization (5-10%) • Flexibility for LAN and WAN communication • Multiple Transport Services – Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram – Provides flexibility to develop upper layers • Multiple Operations – Send/Recv – RDMA Read/Write – Atomic Operations (very unique) • high performance and scalable implementations of distributed locks, semaphores, collective communication operations • Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, …. Network Based Computing Laboratory KBNet ‘18 14 Communication in the Memory Semantics (RDMA Model)

Memory Memory Processor Processor Memory Segment Memory Memory Segment Segment Memory Segment QP QP CQ Send Recv Initiator processor is involved only to: Send Recv CQ 1. Post send WQE 2. Pull out completed CQE from the send CQ

No involvement from the target processor

InfiniBand Device InfiniBand Device Hardware ACK Send WQE contains information about the send buffer (multiple segments) and the receive buffer (single segment) Network Based Computing Laboratory KBNet ‘18 15 Large-scale InfiniBand Installations • 139 IB Clusters (27.8%) in the Jun’18 Top500 list #2nd system (Sunway TaihuLight) – (http://www.top500.org) also uses InfiniBand • Installations in the Top 50 (19 systems):

2,282,544 cores (Summit) at ORNL (1st) 155,150 cores (JURECA) at FZJ/Germany (38th)

1,572,480 cores (Sierra) at LLNL (3rd) 72,800 cores Cray CS-Storm in US (40th)

391,680 cores (ABCI) at AIST/Japan (5th) 72,800 cores Cray CS-Storm in US (41st)

253,600 cores (HPC4) in Italy (13th) 78,336 cores (Electra) at NASA/Ames (43rd)

114,480 cores (Juwels Module 1) at FZJ/Germany (23rd) 124,200 cores (Topaz) at ERDC DSRC/USA (44th)

241,108 cores (Pleiades) at NASA/Ames (24th) 60,512 cores NVIDIA DGX-1 at Facebook/USA (45th)

220,800 cores (Pangea) in France (30th) 60,512 cores (DGX Saturn V) at NVIDIA/USA (46th)

144,900 cores (Cheyenne) at NCAR/USA (31st) 113,832 cores (Damson) at AWE/UK (47th)

72,000 cores (IT0 – Subsystem A) in Japan (32nd) 72,000 cores (HPC2) in Italy (49th)

79,488 cores (JOLIOT-CURIE SKL) at CEA/France (34th) and many more!

Network Based Computing Laboratory KBNet ‘18 16 High-speed Ethernet Consortium (10GE/25GE/40GE/50GE/100GE) • 10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step • Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet • http://www.ethernetalliance.org • 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG • 25-Gbps Ethernet Consortium targeting 25/50Gbps (July 2014) – http://25gethernet.org • Energy-efficient and power-conscious protocols – On-the-fly link speed reduction for under-utilized links • Ethernet Alliance Technology Forum looking forward to 2026 – http://insidehpc.com/2016/08/at-ethernet-alliance-technology-forum/

Network Based Computing Laboratory KBNet ‘18 17 TOE and iWARP Accelerators

• TCP Offload Engines (TOE) – Hardware Acceleration for the entire TCP/IP stack – Initially patented by Tehuti Networks – Actually refers to the IC on the network adapter that implements TCP/IP – In practice, usually referred to as the entire network adapter • Internet Wide-Area RDMA Protocol (iWARP) – Standardized by IETF and the RDMA Consortium – Support acceleration features (like IB) for Ethernet • http://www.ietf.org & http://www.rdmaconsortium.org

Network Based Computing Laboratory KBNet ‘18 18 RDMA over Converged Enhanced Ethernet (RoCE) Network Stack Comparison Application Application Application • Takes advantage of IB and Ethernet IB Verbs IB Verbs IB Verbs – Software written with IB-Verbs IB Transport IB Transport IB Transport – Link layer is Converged (Enhanced) Ethernet (CE) IB Network IB Network UDP / IP – 100Gb/s support from latest EDR and ConnectX- InfiniBand Ethernet Link Ethernet Link

Hardware 3 Pro adapters Link Layer Layer Layer • Pros: IB Vs RoCE InfiniBand RoCE RoCE v2 – Works natively in Ethernet environments Packet Header Comparison • Entire Ethernet management ecosystem is available – Has all the benefits of IB verbs ETH IB GRH IB BTH+ – Link layer is very similar to the link layer of L2 Hdr L3 Hdr L4 Hdr

RoCE native IB, so there are no missing features Ethertype • RoCE v2: Additional Benefits over RoCE – Traditional Network Management Tools Apply ETH IP Hdr UDP IB BTH+ – ACLs (Metering, Accounting, Firewalling)

L2 Hdr L3 Hdr Hdr Port # L4 Hdr – GMP Snooping for Optimized Multicast Proto # Proto Ethertype RoCE v2 – Network Monitoring Tools Courtesy: OFED, Mellanox Network Based Computing Laboratory KBNet ‘18 19 HSE Scientific Computing Installations • 171 HSE compute systems with ranking in the Jun’18 Top500 list – 38,400-core installation in China (#95) – new – 38,400-core installation in China (#96) – new – 38,400-core installation in China (#97) – new – 39,680-core installation in China (#99) – 66,560-core installation in China (#157) – 66,280-core installation in China (#159) – 64,000-core installation in China (#160) – 64,000-core installation in China (#161) – 72,000-core installation in China (#164) – 64,320-core installation in China (#185) – new – 78,000-core installation in China (#187) – 75,776-core installation in China (#188) – new – 59,520-core installation in China (#192) – 59,520-core installation in China (#193) – 28,800-core installation in China (#195) – new – 62,400-core installation in China (#197) – new – 64,800-core installation in China (#198) – 66,000-core installation in China (#209) – new – and many more!

Network Based Computing Laboratory KBNet ‘18 20 Omni-Path Fabric Overview • Derived from InfiniBand • Layer 1.5: Link Transfer Protocol – Features • Traffic Flow Optimization • Packet Integrity Protection • Dynamic Lane Switching – Error detection/replay occurs in Link Transfer Packet units – Retransmit request via NULL LTP; carries replay command flit • Layer 2: Link Layer – Supports 24 fabric addresses – Allows 10KB of L4 payload; 10,368 max packet size – Congestion Management • Adaptive / Dispersive Routing • Explicit Congestion Notification – QoS support • Traffic Class, Service Level, Service Channel and Virtual Lane • Layer 3: Data Link Layer – Fabric addressing, switching, resource allocation and partitioning support Courtesy: Corporation Network Based Computing Laboratory KBNet ‘18 21 Large-scale Omni-Path Installations • 39 Omni-Path Clusters (7.8%) in the Jun’18 Top500 list – (http://www.top500.org)

570,020 core (Nurion) at KISTI/South Korea (11th) 53,300 core (Makman-3) at Saudi Aramco/Saudi Arabia (78th)

556,104 core (Oakforest-PACS) at JCAHPC in Japan (12th) 34,560 core (Gaffney) at Navy DSRC/USA (85th)

367,024 core (Stampede2) at TACC in USA (15th) 34,560 core (Koehr) at Navy DSRC/USA (86th)

312,936 core (Marconi XeonPhi) at CINECA in Italy (18th) 49,432 core (Mogon II) in Germany (87th)

135,828 core (Tsubame 3.0) at TiTech in Japan (19th) 38,553 core (Molecular Simulator) in Japan (93rd)

153,216 core (MareNostrum) at BSC in Spain (22nd) 35,280 core (Quriosity) at BASF in Germany (94th)

127,520 core (Cobra) in Germany (28th) 54,432 core (Marconi Xeon) at CINECA in Italy (98th)

55,296 core (Mustang) at AFRL/USA (48th) 46,464 core (Peta4) at Cambridge/UK (101st)

95,472 core (Quartz) at LLNL in USA (63rd) 53,352 core (Girzzly) at LANL in USA (136th)

95,472 core (Jade) at LLNL in USA (64th) and many more!

Network Based Computing Laboratory KBNet ‘18 22 IB, Omni-Path, and HSE: Feature Comparison

Features IB iWARP/HSE RoCE RoCE v2 Omni-Path

Hardware Acceleration Yes Yes Yes Yes Yes

RDMA Yes Yes Yes Yes Yes

Congestion Control Yes Optional Yes Yes Yes

Multipathing Yes Yes Yes Yes Yes

Atomic Operations Yes No Yes Yes Yes

Multicast Optional No Optional Optional Optional

Data Placement Ordered Out-of-order Ordered Ordered Ordered

Prioritization Optional Optional Yes Yes Yes

Fixed BW QoS (ETS) No Optional Yes Yes Yes

Ethernet Compatibility No Yes Yes Yes Yes Yes Yes TCP/IP Compatibility Yes Yes Yes (using IPoIB) (using IPoIB)

Network Based Computing Laboratory KBNet ‘18 23 Designing RDMA-based Communication and I/O Libraries for Clusters and Middleware: Challenges

Applications Cluster and Data Center Middleware Upper level (MPI, PGAS, Memcached, HDFS, MapReduce, HBase, and gRPC/TensorFlowChanges? ) Programming Models (Sockets) RDMA? Communication and I/O Library RDMA-based Threaded Models Virtualization (SR-IOV) Communication Substrate and Synchronization

I/O and File Systems QoS & Fault Tolerance Performance Tuning

Commodity Computing System Networking Technologies Architectures Storage Technologies (InfiniBand, 1/10/40/100 GigE (Multi- and Many-core (HDD, SSD, NVM, and NVMe- and Intelligent NICs) architectures and accelerators) SSD)

Network Based Computing Laboratory KBNet ‘18 24 Designing RDMA-based Middleware for Clusters and Datacenters

• High-Performance Programming Models Support for HPC Clusters • RDMA-Enabled Communication Substrate for Common Services in Datacenters • High-Performance and Scalable Memcached • RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce) • Deep Learning with Scale-Up and Scale-Out – Caffe and TensorFlow • Virtualization Support with SR-IOV and Containers

Network Based Computing Laboratory KBNet ‘18 25 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges

Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Layers

Communication Library or Runtime for Programming Models Performance Point-to-point Collective Energy- Synchronization I/O and Fault Communication Communication Awareness and Locks File Systems Tolerance Scalability Resilience Networking Technologies Multi-/Many-core Accelerators (InfiniBand, 40/100GigE, Architectures (GPU and FPGA) Aries, and Omni-Path)

Network Based Computing Laboratory KBNet ‘18 26 Overview of the MVAPICH2 Project

• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,925 organizations in 86 countries – More than 487,000 (> 0.48 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Jul ‘18 ranking) • 2nd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 12th, 556,104 cores (Oakforest-PACS) in Japan • 15th, 367,024 cores (Stampede2) at TACC • 24th, 241,108-core (Pleiades) at NASA and many others – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade Network Based Computing Laboratory KBNet ‘18 27 Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPOWER, Xeon-Phi (MIC, KNL), NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features SR- Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM* NVLink* CAPI* IOV Rail Memory

* Upcoming Network Based Computing Laboratory KBNet ‘18 28 One-way Latency: MPI over IB with MVAPICH2

2 Small Message Latency 120 Large Message Latency TrueScale-QDR 1.8 ConnectX-3-FDR 1.6 1.19 100 ConnectIB-DualFDR 1.4 1.15 80 ConnectX-5-EDR 1.2 Omni-Path 1 60 0.8 1.11 0.6 1.04 40 Latency (us) Latency (us) 0.98 0.4 20 0.2 0 0

Message Size () Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch Network Based Computing Laboratory KBNet ‘18 29 Bandwidth: MPI over IB with MVAPICH2 14000 Unidirectional Bandwidth Bidirectional Bandwidth 12,590 25000 TrueScale-QDR 24,136 12000 12,366 20000 ConnectX-3-FDR 10000 ConnectIB-DualFDR 22,564

) 12,358 ) ConnectX-5-EDR 21,983 sec 8000 15000 6,356 Omni-Path 12,161 6000 10000 Bandwidth

3,373 Bandwidth MBytes/sec (MBytes/ 6,228

4000 ( 5000 2000 0 0

Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 IB switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch Network Based Computing Laboratory KBNet ‘18 30 Hardware Multicast-aware MPI_Bcast on Stampede Small Messages (102,400 Cores) Large Messages (102,400 Cores) 40 500 Default Default 30 Multicast 400 Multicast 300 20 200 10 Latency (us) Latency (us) 100 0 0 2 8 32 128 512 2K 8K 32K 128K Message Size (Bytes) Message Size (Bytes) 32 KByte Message 30 16 Byte Message 200 25 Default 150 Default 20 Multicast Multicast 15 100 10

Latency (us) 50 Latency (us) 5 0 0

Number of Nodes Number of Nodes ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch Network Based Computing Laboratory KBNet ‘18 31 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory KBNet ‘18 32 Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth

20 4000 10 11X 1.88us 10x 2000

Latency (us) 0 0 1 2 4 8 Bandwidth (MB/s) Bandwidth 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K

Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth 4000 3000 2000 9x MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA

1 2 4 8 CUDA 9.0 16 32 64 1K 2K 4K 128 256 512 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a Network Based Computing Laboratory KBNet ‘18 33 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

Wilkes GPU Cluster CSCS GPU cluster

Default Callback-based Event-based Default Callback-based Event-based

1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 4 8 16 32 Normalized Execution Time 16 32 64 96 Normalized Execution Time Number of GPUs Number of GPUs

• 2X improvement on 32 GPUs nodes Cosmo model: http://www2.cosmo-model.org/content • 30% improvement on 96 GPU nodes (8 GPUs/node) /tasks/operational/meteoSwiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16 Network Based Computing Laboratory KBNet ‘18 34 Designing RDMA-based Middleware for Clusters and Datacenters

• High-Performance Programming Models Support for HPC Clusters • RDMA-Enabled Communication Substrate for Common Services in Datacenters • High-Performance and Scalable Memcached • RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce) • Deep Learning with Scale-Up and Scale-Out – Caffe and TensorFlow • Virtualization Support with SR-IOV and Containers

Network Based Computing Laboratory KBNet ‘18 35 Data-Center Service Primitives

• Common Services needed by Data-Centers – Better resource management – Higher performance provided to higher layers • Service Primitives – Soft Shared State – Distributed Lock Management – Global Memory Aggregator • Network Based Designs – RDMA, Remote Atomic Operations

Network Based Computing Laboratory KBNet ‘18 36 Soft Shared State

Get Data-Center Data-Center Put Application Application

Get Data-Center Put Data-Center Application Shared State Application

Get Put Data-Center Data-Center Application Application

Network Based Computing Laboratory KBNet ‘18 37 Active Caching

• Dynamic data caching – challenging! • Cache Consistency and Coherence – Become more important than in static case

Proxy Nodes Back-End Nodes

User Requests Update

Network Based Computing Laboratory KBNet ‘18 38 RDMA based Client Polling Design

Front-End Back-End

Request

Version Read Cache Hit Response

Cache Miss Response

Network Based Computing Laboratory KBNet ‘18 39 Active Caching – Performance Benefits

Data-Center Throughput Effect of Load 16000 16000

14000 14000 No Cache 12000 12000 Invalidate All No Cache 10000 Dependency Lists 10000 Dependency 8000 8000 Lists Throughput Throughput 6000 6000

4000 4000

2000 2000

0 0 Trace 2 Trace 3 Trace 4 Trace 5 0 1 2 4 8 16 32 64 Traces with Increasing Update Rate Load (Compute Threads) • Higher overall performance – Up to an order of magnitude • Performance is sustained under loaded conditions

S. Narravula, P. Balaji, K. Vaidyanathan, H. -W. Jin and D. K. Panda, Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data-Centers over InfiniBand. CCGrid-2005 Network Based Computing Laboratory KBNet ‘18 40 Resource Monitoring Services

• Traditional approaches – Coarse-grained in nature – Assume resource usage is consistent throughout the monitoring granularity (in the order of seconds) • This assumption is no longer valid – Resource usage is becoming increasingly divergent • Fine-grained monitoring is desired but has additional overheads – High overheads, less accurate, slow in response • Can we design fine-grained resource monitoring scheme with low overhead and accurate resource usage?

Network Based Computing Laboratory KBNet ‘18 41 Synchronous Resource Monitoring using RDMA (RDMA-Sync)

CPU CPU

App App Threads Threads

Front-end Monitoring Process

Front-end Back-end Node Memory Memory Node

User User Space /proc Space RDMA Kernel Kernel Space Kernel Space Data Structures

Network Based Computing Laboratory KBNet ‘18 42 Designing RDMA-based Middleware for Clusters and Datacenters

• High-Performance Programming Models Support for HPC Clusters • RDMA-Enabled Communication Substrate for Common Services in Datacenters • High-Performance and Scalable Memcached • RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce) • Deep Learning with Scale-Up and Scale-Out – Caffe and TensorFlow • Virtualization Support with SR-IOV and Containers

Network Based Computing Laboratory KBNet ‘18 43 Architecture Overview of Memcached • Three-layer architecture of Web 2.0

– Web Servers, Memcached Servers, Internet Database Servers • Memcached is a core component of Web 2.0 architecture • Distributed Caching Layer – Allows to aggregate spare memory from multiple nodes – General purpose • Typically used to cache database queries, results of API calls • Scalable model, but typical usage very network intensive

Network Based Computing Laboratory KBNet ‘18 44 Memcached-RDMA Design

Master Sockets Sockets 1 Worker Client Thread Shared 2 Thread Sockets Data 1 Worker RDMA Thread Memory Slabs Verbs Verbs Client 2 Items Worker Worker … Thread Thread • Server and client perform a negotiation protocol – Master thread assigns clients to appropriate worker thread • Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread • All other Memcached data structures are shared among RDMA and Sockets worker threads • Memcached Server can serve both socket and verbs clients simultaneously • Memcached applications need not be modified; uses verbs interface if available

Network Based Computing Laboratory KBNet ‘18 45 Memcached Performance (FDR Interconnect) Memcached GET Latency Memcached Throughput 1000 700 OSU-IB (FDR) IPoIB (FDR) 600 100 500 400 2X Latency Reduced 300

Time (us) 10 by nearly 20X 200 100

1 per Second (TPS)

1 2 4 8 0 16 32 64 1K 2K 4K 128 256 512 Thousands of Transactions Thousands Transactions of 16 32 64 128 256 512 1024 2048 4080 Message Size No. of Clients Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) • Memcached Get latency – 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us • Memcached Throughput (4bytes) – 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput Network Based Computing Laboratory KBNet ‘18 46 Designing RDMA-based Middleware for Clusters and Datacenters

• High-Performance Programming Models Support for HPC Clusters • RDMA-Enabled Communication Substrate for Common Services in Datacenters • High-Performance and Scalable Memcached • RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce) • Deep Learning with Scale-Up and Scale-Out – Caffe and TensorFlow • Virtualization Support with SR-IOV and Containers

Network Based Computing Laboratory KBNet ‘18 47 The High-Performance Big Data (HiBD) Project • RDMA for Apache Spark • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions • RDMA for Apache HBase Available for InfiniBand and RoCE • RDMA for Memcached (RDMA-Memcached) Also run on Ethernet • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • OSU HiBD-Benchmarks (OHB) Available for x86 and OpenPOWER – HDFS, Memcached, HBase, and Spark Micro-benchmarks • http://hibd.cse.ohio-state.edu Support for Singularity and Docker • Users Base: 290 organizations from 34 countries • More than 27,300 downloads from the project site

Network Based Computing Laboratory KBNet ‘18 48 Performance Numbers of RDMA for Apache Hadoop 2.x – RandomWriter & TeraGen in OSU-RI2 (EDR) 400 IPoIB (EDR) 800 IPoIB (EDR) Reduced by 4x 350 Reduced by 3x 700 OSU-IB (EDR) 300 600 OSU-IB (EDR) 250 500 200 400 150 300 100 200 Execution Time (s) Execution Time (s) 50 100 0 0 80 120 160 80 160 240 Data Size (GB) Data Size (GB) RandomWriter TeraGen Cluster with 8 Nodes with a total of 64 maps

• RandomWriter • TeraGen – 3x improvement over IPoIB – 4x improvement over IPoIB for for 80-160 GB file size 80-240 GB file size

Network Based Computing Laboratory KBNet ‘18 49 Performance Evaluation of RDMA-Spark on SDSC Comet – HiBench PageRank

800 450 37% 43% 700 IPoIB 400 IPoIB 600 RDMA 350 RDMA 500 300 250 400 200 Time (sec) 300 Time (sec) 150 200 100 100 50 0 0 Huge BigData Gigantic Huge BigData Gigantic Data Size (GB) Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) • RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) – 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

Network Based Computing Laboratory KBNet ‘18 50 Designing RDMA-based Middleware for Clusters and Datacenters

• High-Performance Programming Models Support for HPC Clusters • RDMA-Enabled Communication Substrate for Common Services in Datacenters • High-Performance and Scalable Memcached • RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce) • Deep Learning with Scale-Up and Scale-Out – Caffe and TensorFlow • Virtualization Support with SR-IOV and Containers

Network Based Computing Laboratory KBNet ‘18 51 Deep Learning: New Challenges for Communication Runtimes

• Deep Learning frameworks are a different game Proposed altogether NCCL Co-Designs – Unusually large message sizes (order of megabytes) – Most communication based on GPU buffers cuDNN • Existing State-of-the-art cuBLAS – cuDNN, cuBLAS, NCCL --> scale-up performance MPI – CUDA-Aware MPI --> scale-out performance • For small and medium message sizes only! up Performanceup - • Proposed: Can we co-design the MPI runtime (MVAPICH2- gRPC GDR) and the DL framework (Caffe) to achieve both? Scale – Efficient Overlap of Computation and Communication Hadoop – Efficient Large-Message Communication (Reductions) – What application co-designs are needed to exploit communication-runtime co-designs? Scale-out Performance A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) Network Based Computing Laboratory KBNet ‘18 52 MVAPICH2-GDR vs. NCCL2 – Allreduce Operation

• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

1000 100000

10000 ~1.2X better 100 1000 ~3X better 100 Latency (us) Latency (us) 10 10

1 1 4 8 16 32 64 1K 2K 4K 8K 128 256 512 16K 32K 64K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect Network Based Computing Laboratory KBNet ‘18 53 MVAPICH2: Allreduce Comparison with Baidu and OpenMPI • 16 GPUs (4 nodes) MVAPICH2-GDR(*) vs. Baidu-Allreduce and OpenMPI 3.0

100000 ~10X better 6000000 50000 OpenMPI is ~5X slower 5000000 10000 than Baidu ~30X better 40000 4000000 MV2 is ~2X better 1000 3000000 30000 than Baidu 100 2000000 Latency (us)

Latency (us) 1000000 10 20000

Latency (us) 0 1 10000

4 ~4X better 16 64 256 8388608 1024 4096 0 16777216 33554432 67108864 16384 65536 134217728 268435456 536870912 262144 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI

*Available with MVAPICH2-GDR 2.3a Network Based Computing Laboratory KBNet ‘18 54 OSU-Caffe: Scalable Deep Learning • Caffe : A flexible and layered Deep Learning framework. GoogLeNet (ImageNet) on 128 GPUs • Benefits and Weaknesses 250 – Multi-GPU Training within a single node – Performance degradation for GPUs across different 200 sockets – Limited Scale-out 150 • OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across 100 multi-GPU nodes) – Scale-out on 64 GPUs for training CIFAR-10 network on 50 CIFAR-10 dataset Training Time (seconds) Training – Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset 0 8 16 32 64 128 No. of GPUs OSU-Caffe publicly available from Invalid use case http://hidl.cse.ohio-state.edu/ Caffe OSU-Caffe (1024) OSU-Caffe (2048) Network Based Computing Laboratory KBNet ‘18 55 RDMA-TensorFlow Distribution

• High-Performance Design of TensorFlow over RDMA-enabled Interconnects – High performance RDMA-enhanced design with native InfiniBand support at the verbs-level for gRPC and TensorFlow – RDMA-based data communication – Adaptive communication protocols – Dynamic message chunking and accumulation – Support for RDMA device selection – Easily configurable for different protocols (native InfiniBand and IPoIB) • Current release: 0.9.1 – Based on Google TensorFlow 1.3.0 – Tested with • Mellanox InfiniBand adapters (e.g., EDR) • NVIDIA GPGPU K80 • Tested with CUDA 8.0 and CUDNN 5.0 – http://hidl.cse.ohio-state.edu

Network Based Computing Laboratory KBNet ‘18 56 Performance Benefit for TensorFlow (Inception3)

gRPPC (IPoIB-100Gbps) gRPPC (IPoIB-100Gbps) gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) Verbs (RDMA-100Gbps) 300 Verbs (RDMA-100Gbps) 200 MPI (RDMA-100Gbps) MPI (RDMA-100Gbps) 10% MPI (RDMA-100Gbps) 90 7% 12% AR-gRPC (RDMA-100Gbps) 80 AR-gRPC (RDMA-100Gbps) 240 AR-gRPC (RDMA-100Gbps) 150 70 60 47% 180 Images / Second 50 100 116% 153% Images / Second 40 Images / Second 120 30 50 20 60 10 0 0 0 8 16 32 64 8 16 32 64 8 16 32 64 Batch Size / GPU Batch Size / GPU Batch Size / GPU 4 Nodes 8 Nodes 12 Nodes • TensorFlow Inception3 performance evaluation on an IB EDR cluster – Up to 47% performance speedup over Default gRPC (IPoIB) for 4 nodes – Up to 116% performance speedup over Default gRPC (IPoIB) for 8 nodes – Up to 153% performance speedup over Default gRPC (IPoIB) for 12 nodes Network Based Computing Laboratory KBNet ‘18 57 Concluding Remarks

• Next generation Clusters and Data Centers need to be designed with a holistic view of HPC, Big Data, Deep Learning, and Cloud • Presented an overview of the networking technology trends exploiting RDMA • Presented some of the RDMA-based approaches and results along these directions • Enable HPC, Big Data, Deep Learning and Cloud community to take advantage of modern RDMA-based networking technologies • Many other open issues need to be solved

Network Based Computing Laboratory KBNet ‘18 58 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory KBNet ‘18 59 Personnel Acknowledgments

Current Students (Graduate) Current Students (Undergraduate) Current Research Scientists Current Research Specialist – A. Awan (Ph.D.) – J. Hashmi (Ph.D.) – N. Sarkauskas (B.S.) – X. Lu – J. Smith – M. Bayatpour (Ph.D.) – H. Javed (Ph.D.) – V. Gangal (B.S.) – H. Subramoni – M. Arnold – S. Chakraborthy (Ph.D.) – P. Kousha (Ph.D.) – C.-H. Chu (Ph.D.) – D. Shankar (Ph.D.) Current Post-doc – S. Guganani (Ph.D.) – H. Shi (Ph.D.) – A. Ruhela – K. Manian Past Students Past Research Scientist – W. Huang (Ph.D.) – J. Liu (Ph.D.) – R. Rajachandrasekar (Ph.D.) – A. Augustine (M.S.) – K. Hamidouche – P. Balaji (Ph.D.) – W. Jiang (M.S.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – S. Sur – R. Biswas (M.S.) – J. Jose (Ph.D.) – A. Mamidala (Ph.D.) – A. Singh (Ph.D.) – S. Bhagvat (M.S.) – S. Kini (M.S.) – G. Marsh (M.S.) – J. Sridhar (M.S.) Past Programmers – S. Sur (Ph.D.) – A. Bhat (M.S.) – M. Koop (Ph.D.) – V. Meshram (M.S.) – D. Bureddy – H. Subramoni (Ph.D.) – D. Buntinas (Ph.D.) – K. Kulkarni (M.S.) – A. Moody (M.S.) – J. Perkins – K. Vaidyanathan (Ph.D.) – L. Chai (Ph.D.) – R. Kumar (M.S.) – S. Naravula (Ph.D.) – A. Vishnu (Ph.D.) – B. Chandrasekharan (M.S.) – S. Krishnamoorthy (M.S.) – R. Noronha (Ph.D.) – N. Dandapanthula (M.S.) – K. Kandalla (Ph.D.) – X. Ouyang (Ph.D.) – J. Wu (Ph.D.) – V. Dhanraj (M.S.) – M. Li (Ph.D.) – S. Pai (M.S.) – W. Yu (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – S. Potluri (Ph.D.) – J. Zhang (Ph.D.) – K. Gopalakrishnan (M.S.) Past Post-Docs – D. Banerjee – J. Lin – S. Marcarelli – X. Besseron – M. Luo – J. Vienne – H.-W. Jin – E. Mancini – H. Wang Network Based Computing Laboratory KBNet ‘18 60 Multiple Positions Available in My Group

• Looking for Bright and Enthusiastic Personnel to join as – Post-Doctoral Researchers – PhD Students – MPI Programmer/Software Engineer – Hadoop/Spark/Big Data Programmer/Software Engineer – Deep Learning Programmer/Software Engineer • If interested, please contact me at this conference and/or send an e-mail to [email protected]

Network Based Computing Laboratory KBNet ‘18 61 Thank You! [email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hibd.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory KBNet ‘18 62