NVIDIA Gpudirect RDMA (GDR) Host Memory Host Memory • Pipeline Through Host for Large Msg 2 IB IB 4 2
Total Page:16
File Type:pdf, Size:1020Kb
Efficient and Scalable Communication Middleware for Emerging Dense-GPU Clusters Ching-Hsiang Chu Advisor: Dhabaleswar K. Panda Network Based Computing Lab Department of Computer Science and Engineering The Ohio State University, Columbus, OH Outline • Introduction • Problem Statement • Detailed Description and Results • Broader Impact on the HPC Community • Expected Contributions Network Based Computing Laboratory SC 19 Doctoral Showcase 2 Trends in Modern HPC Architecture: Heterogeneous Multi/ Many-core High Performance Interconnects Accelerators / Coprocessors SSD, NVMe-SSD, Processors InfiniBand, Omni-Path, EFA high compute density, NVRAM <1usec latency, 100Gbps+ Bandwidth high performance/watt Node local storage • Multi-core/many-core technologies • High Performance Storage and Compute devices • High Performance Interconnects • Variety of programming models (MPI, PGAS, MPI+X) #1 Summit #2 Sierra (17,280 GPUs) #8 ABCI #22 DGX SuperPOD (27,648 GPUs) #10 Lassen (2,664 GPUs) (4,352 GPUs) (1,536 GPUs) Network Based Computing Laboratory SC 19 Doctoral Showcase 3 Trends in Modern Large-scale Dense-GPU Systems • Scale-up (up to 150 GB/s) • Scale-out (up to 25 GB/s) – PCIe, NVLink/NVSwitch – InfiniBand, Omni-path, Ethernet – Infinity Fabric, Gen-Z, CXL – Cray Slingshot Network Based Computing Laboratory SC 19 Doctoral Showcase 4 GPU-enabled HPC Applications Lattice Quantum Chromodynamics Weather Simulation Wave propagation simulation Fuhrer O, Osuna C, Lapillonne X, Gysi T, Bianco M, Schulthess T. Towards GPU-accelerated operational weather forecasting. GTC 2013. Derek Leinweber, CSSM, University of Adelaide https://geodynamics.org/cig/software/specfem3d_globe/ 23x faster than CPU 2.8x faster than CPU 25x faster than CPU • Various scientific applications are ported to GPU – Reportedly significant speedup compared to CPU version – High-resolution/precision results Network Based Computing Laboratory SC 19 Doctoral Showcase 5 GPU-enabled Emerging Deep Learning Applications • Easy-to-use and high-performance frameworks 999 PetaFlop/s sustained, and 1.13 ExaFlop/s peak FP 16 performance over 4560 nodes (27,360 GPU) • Wide range of applications – Image Classification – Speech Recognition – Self-driving car – Healthcare – Climate Analytic Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, Houston M. Exascale deep learning for climate analytics. SC 2018 Nov 11 (p. 51). (Golden Bell Prize) Network Based Computing Laboratory SC 19 Doctoral Showcase 6 GPU-Aware (CUDA-Aware) Communication Middleware MPI-based Generic Communication Middleware DL-Specific Communication Middleware inside MVAPICH2 • Supports and optimizes various communication patterns • Ring-based collective operations • Overlaps data movement from GPU with RDMA transfers • Optimized for DL workloads on GPU systems Network Based Computing Laboratory SC 19 Doctoral Showcase 7 Broad Challenge Can we design a generic GPU- enabled communication Ideal User Naive middleware to fully exploit GPU User Advanced resources and interconnects for Resource Utilization traditional HPC and emerging ML/DL applications? Overlap Productivity Performance Farther from the center is Better Network Based Computing Laboratory SC 19 Doctoral Showcase 8 Outline • Introduction • Problem Statement • Detailed Description and Results • Broader Impact on the HPC Community • Expected Contributions Network Based Computing Laboratory SC 19 Doctoral Showcase 9 Problem Statements • What kind of hardware capabilities can be leveraged to fully exploit the modern interconnects deployed in GPU clusters? • What is the potential scope of communication patterns can be benefited from the proposed GPU-enabled communication middleware? • How to leverage GPU resources such as high-bandwidth memory (HBM) and massive streaming multiprocessors (SM) to accelerate communication? • What are design considerations for a GPU-enabled communication middleware to efficiently utilize the hardware features? • What kind of performance benefits can be expected with the proposed GPU-enabled communication middleware? • How can the traditional HPC and DL applications can take advantage of the proposed communication middleware without application-level modifications? Network Based Computing Laboratory SC 19 Doctoral Showcase 10 Research Framework Benchmarks Traditional HPC Applications Machine/Deep Learning Applications OMB Streaming MILC COSMO AWP-ODC HOOMD-Blue TensorFlow CNTK Programing Message Passing Interface (MPI) Compute Models Models Efficient Non-contiguous Data Process: GPU-Enabled Scalable Streaming Broadcast Scheduling and Transfer CUDA Communication Middleware GPU-enabled Cooperative and Link-efficient reduction operations OpenACC Hardware GPUDirect RDMA GDRCOPY Hardware Multicast Inter-GPU Direct Load/Store Capability Modern HPC Multi-Core Processors Accelerator Interconnects Hardware Xeon OpenPOWER GPU InfiniBand PCIe NVLink/NVSwitch Network Based Computing Laboratory SC 19 Doctoral Showcase 11 Outline • Introduction • Problem Statement • Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks – Efficient Scheduling of Non-contiguous Data Transfer – GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations • Broader Impact on the HPC Community • Expected Contributions Network Based Computing Laboratory SC 19 Doctoral Showcase 12 Motivated Example #1 – Need of scalable broadcast • Streaming & Deep Learning applications Data Source – Large-scale broadcast operations – High computation-communication overlap Online/Offline streaming – No application-level modification Data Distributor / Parameter Server GPU Node 1 GPU Node 3 Data streaming-like broadcast operations Worker Worker Worker Worker Worker CPU CPU CPU CPU CPU GPU Node 2 GPU Node 4 GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019 Network Based Computing Laboratory SC 19 Doctoral Showcase 13 Proposed Efficient and Scalable Solution • Streaming data through host MPI_Bcast(d_in,…) – Fine-tuned chunked data Destination 1 Ø Three-stage pipeline Header CPU – Leverages IB Scatter-Gather and GDR features IB Ø HCA Free-up PCIe resources for applications GPU IB d_in Source Switch 1. Data Preparation Header 2. IB Gather im_buf Destination N CPU Header 3. IB Hardware Multicast IB IB Scatter (GDR Write) CPU HCA IB GPU HCA MPI_Bcast(d_out,…) d_out GPU Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019 d_in Network Based Computing Laboratory SC 19 Doctoral Showcase 14 Performance Evaluation *D. S. Banerjee, K. Hamidouche and D. K. Panda, "Re-Designing CNTK Deep Learning Framework on • Evaluated @ RI2 GPU nodes Modern GPU Enabled Clusters," CloudCom 2016. Streaming Workload *CA-CNTK - Image Classification 10 Knomial-GDR 1.5 8 Ring-GDR-Pipeline 1.2X TA-Zcpy-MCAST-GDR-Pipeline 1 6 6.9X 4 Speedup 0.5 Throughput (GB/s) Throughput 2 0 0 8 16 8 16 8 16 Peak Peak Peak AlexNet VGG ResNet-50 Streaming Streaming Streaming Scale (Number of GPU nodes) 4 GPU Nodes 8 GPUs Nodes 16 GPU Nodes Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019 Network Based Computing Laboratory SC 19 Doctoral Showcase 15 Performance Model Validation and Prediction • Based on the architecture on RI2 cluster 1 � = 2��; � = 512 ��; � = 4 ��; � ≈ 100 ����; � = 8 ����; � � ≈ × ln � , 15 ≤ � ≤ 20 � ���� � � 100 100000 Within 10% of error 10 1000 1 K-nomial-based: Model-based Estimation 10 Latency (s) Latency 0.1 K-nomial-based: Experiment Ring-based: Model-based Estimation (s) Latency 0.1 K-nomial-based: Model-based Estimation 0.01 Ring-based: Experiment Ring-based: Model-based Estimation MCAST-GDR-Opt: Model-based Estimation MCAST-GDR-Opt: Model-based Estimation MCAST-GDR-Opt: Experiment 0.001 0.001 2 4 8 16 2 4 8 16 32 64 128 256 512 1024 2048 Number of Broadcast Sources Number of Broadcast Sources Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019 Network Based Computing Laboratory SC 19 Doctoral Showcase 16 Outline • Introduction • Problem Statement • Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks – Efficient Scheduling of Non-contiguous Data Transfer – GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations • Ongoing Work and Future Research Directions • Broader Impact on the HPC Community • Expected Contributions Network Based Computing Laboratory SC 19 Doctoral Showcase 17 Motivated Example #2 – Non-contiguous Data Transfer • Wide usages of MPI derived datatype for Non-contiguous Data Transfer – Requires Low-latency and high overlap processing Quantum Chromodynamics: MILC with QUDA Weather Simulation: COSMO model M. Martinasso, G. Kwasniewski, S. R. Alam, Thomas C. Schulthess, and T. Hoefler. “A PCIe congestion-aware performance model for densely populated accelerator servers. “ SC 2016 Mike Clark. “GPU Computing with QUDA, “Developer Technology Group, https://www.olcf.ornl.gov/wp-content/uploads/2013/02/Clark_M_LQCD.pdf Network Based Computing Laboratory SC 19 Doctoral Showcase 18 Existing GPU-enabled MPI Datatype Processing Common Scenario Waste of computing resources