Scalable and Distributed DNN Training on Modern HPC Systems

Talk at HPC-AI Competition (SCAsia ’19) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Increasing Usage of HPC, Big Data and Deep Learning

Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.)

Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the !!

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 2 Drivers of Modern HPC Cluster Architectures

High Performance Interconnects - Accelerators / Coprocessors InfiniBand high compute density, high Multi-core Processors <1usec latency, 200Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Azure, etc.

Summit Sierra Sunway TaihuLight K - Computer Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 3 Scale-up and Scale-out • Scale-up: Intra-node Communication NCCL2 – Many improvements like: NCCL1 Desired • NVIDIA cuDNN, cuBLAS, NCCL, etc. • CUDA 9 Co-operative Groups cuDNN MPI • Scale-out: Inter-node Communication MKL-DNN – DL Frameworks – most are optimized for

single-node only up Performance up – Distributed (Parallel) Training is an - gRPC

emerging trend Scale • OSU-Caffe – MPI-based Hadoop • Microsoft CNTK – MPI/NCCL2 • Google TensorFlow – gRPC-based/MPI/NCCL2 Scale-out Performance • Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI)

Network Based Computing Laboratory PETTT TE901 (May ’18) 4 Deep Learning over Big Data (DLoBD) • Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms • More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big data stacks, such as Apache Hadoop and Spark • Benefits of the DLoBD approach – Easily build a powerful data analytics pipeline • E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof

(3) Non-deep (1) Prepare (2) Deep (4) Apply ML learning Datasets @Scale Learning @Scale model @Scale analytics @Scale

– Better data locality – Efficient resource sharing and cost effective

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 5 Holistic Evaluation is Important!! DL Applications (Image Recognition, Speech Processing, etc.) • My framework is faster than

your framework! DL Frameworks (Caffe, TensorFlow, etc.)

• This needs to be understood Generic MKL Optimized cuDNN Optimized in a holistic way. Convolution Layer Convolution Layer Convolution Layer • Performance depends on

the entire execution ATLAS OpenBLAS MKL 2017 cuDNN/cuBLAS environment (the full stack) Other BLAS Libraries • Isolated view of BLAS Libraries

performance is not helpful Other Processors Multi-/Many-core Many-core GPU (Xeon, Xeon Phi) (Pascal P100) Hardware

A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article 8.

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 6 Research Challenges to Exploit HPC Technologies

1. What are the fundamental 1 issues in designing DL Deep Learning and Machine Learning Frameworks Caffe/ frameworks? CNTK Caffe2 TensorFlow MXNet OSU-Caffe – Memory Requirements Major Computation and Communication Phases in DL Frameworks – Computation Forward Gradient Model Propagation Requirements Backward Aggregation – Communication Overhead 2. Why do we need to support 2 Communication Runtimes to support distributed training? Distributed Training – To overcome the limits of single-node training HPC Platforms CPU InfiniBand GPU – To better utilize hundreds of existing HPC Clusters Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 7 Research Challenges to Exploit HPC Technologies (Cont’)

3. What are the new design challenges brought forward by DL frameworks for Deep Learning and Machine Learning Frameworks Communication runtimes? Caffe/ CNTK Caffe2 TensorFlow MXNet – Large Message Collective OSU-Caffe Communication and Reductions Major Computation and Communication Phases in DL Frameworks – GPU Buffers (CUDA-Awareness) Forward Gradient Model Propagation Backward Aggregation 4. Can a Co-design approach help in 4 Co-Design Opportunities achieving Scale-up and Scale-out efficiently? Communication Runtimes (MPI/NCCL/Gloo/MLSL) – Co-Design the support at Runtime Point-to- CUDA- Large-message Point level and Exploit it at the DL Awareness Collectives 3 Operations Framework level – What performance benefits can HPC Platforms be observed? CPU InfiniBand GPU – What needs to be fixed at the communication runtime layer? Network5. Based Computing Laboratory HPC-AI (SCAsia ‘19) 8 Multiple Approaches taken up by OSU • MPI-driven Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN training • Accelerating TensorFlow on HPC Systems • Accelerating Big Data Stacks • Efficient Deep Learning over Big Data

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 9 Data Parallel Deep Learning and MPI Collectives Loop {} • Major MPI Collectives packed_comm_buff involved in Designing MPI_Bcast (GPU 0) 1. Data

Propagation

1 0

distributed frameworks Params Params 2 Params Params

GPU GPU 3

GPU GPU GPU GPU • MPI_Bcast – required for GPU L DNN parameter exchange L1 L1 L1 1 L L2 L L2 2 • MPI_Reduce – needed for F B F 2 B F B F B 2. Forward ...... Backward gradient accumulation L L Ln n Ln n Pass from multiple solvers packed_red packed_red packed_red packed_red uce_buff uce_buff uce_buff uce_buff • MPI_Allreduce – use just one Allreduce instead of MPI_Reduce (GPU 0) 3. Gradient Reduce and Broadcast Gradients Aggregatio ApplyUpdates n

A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 10 Overview of the MVAPICH2 Project

• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,950 organizations in 86 countries – More than 527,000 (> 0.5 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘18 ranking) • 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 14th, 556,104 cores (Oakforest-PACS) in Japan • 17th, 367,024 cores (Stampede2) at TACC • 27th, 241,108-core (Pleiades) at NASA and many others – Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC) – http://mvapich.cse.ohio-state.edu Partner in the upcoming TACC Frontera System • Empowering Top500 systems for over a decade Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 11 Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness File Systems Tolerance Messages & Analysis Primitives Access

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features

SR- Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM* MCDRAM* NVLink* CAPI* IOV Rail Memory

* Upcoming Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 12 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 13 Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth

20 4000 10 11X 1.85us 10x 2000

Latency (us) Latency 0

4 0 1 2 8

Bandwidth (MB/s) Bandwidth 0

64 16 32

1K 2K 4K 8K

1 2 4 8

128 256 512

16 32 64

1K 2K 4K

256 512 Message Size (Bytes) 128 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR 2.3 MV2-(NO-GDR) MV2-GDR-2.3 GPU-GPU Inter-node Bandwidth 4000 3000 2000 9x MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA

8 2 4

1 CUDA 9.0

16 32 64

1K 2K 4K

256 512 128 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 14 Distributed Training using TensorFlow (TF)

• TensorFlow is the most popular Accelerated gRPC DL framework gRPC

• gRPC is the official distributed gRPC+MPI training runtime Distributed gRPC+X gRPC+Verbs – Many problems for HPC use- TensorFlow cases gRPC+GDR • Community efforts - Baidu and Baidu-MPI Uber’s Horovod have added MPI No-gRPC MPI support to TF across nodes Horovod • Need to understand several NCCL options currently available → Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, (To be presented) CCGrid ‘19. https://arxiv.org/abs/1810.11112 Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 15 Scalable TensorFlow using Horovod, MPI, and NCCL

1000

)

r e

t 900

t e

b 800

s i

• Efficient Allreduce is crucial for 700

r e

h 600

g

i

H (

Horovod’s overall training performance 500 d

n 400

o

c e

s 300

– Both MPI and NCCL designs are available / s

e 200 g

a 100

m I • We have evaluated Horovod extensively 0 1 2 4 8 16 and compared across a wide range of No. of GPUs Horovod-MPI Horovod-NCCL2 Horovod-MPI-Opt (Proposed) Ideal 16384

designs using gRPC and gRPC extensions )

r

e t

t 4096

e

b

s

i 1024

• MVAPICH2-GDR achieved up to 90%

r

e h

g 256

i

H (

scaling efficiency for ResNet-50 Training

d 64

n

o c

e 16 s

on 64 Pascal GPUs / s

e 4

g

a m I 1 1 2 4 8 16 32 64 Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: No. of Nodes (GPUs) Characterization, Designs, and Performance Evaluation”, (To be presented) CCGrid ‘19. https://arxiv.org/abs/1810.11112 Horovod-NCCL2 Horovod-MPI-Opt Ideal

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 16 MVAPICH2-GDR: Allreduce Comparison with Baidu and OpenMPI • 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

100000 6000000 50000 ~10X better OpenMPI is ~5X slower 5000000 10000 than Baidu ~30X better 40000 4000000 MV2 is ~2X better 1000 3000000 30000 than Baidu

100 2000000 Latency (us) Latency

Latency (us) Latency 1000000 10 20000

Latency (us) Latency 0 1 10000

4 ~4X better

16 64

256

1024 4096

0 8388608

16384 65536 16777216 33554432 67108864

262144 268435456 536870912 512K 1M 2M 4M 134217728 Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI

*Available since MVAPICH2-GDR 2.3a Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 17 MVAPICH2-GDR vs. NCCL2 – Allreduce Operation

• Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

1000 100000

10000 ~1.2X better 100 1000 ~3X better

100 Latency (us) Latency

Latency (us) Latency 10 10

1

1

4 8

16 32 64

1K 2K 4K 8K

128 256 512

16K 32K 64K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 18 MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)

• Optimized designs in upcoming MVAPICH2-GDR offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)

60 10000

50 1000 40 ~2.5X better ~1.7X better

30 100 Latency (us) Latency Latency (us) Latency 20 10 10

0 1

8

32 16 64

1K 2K 4K 8K

128 256 512

16K 32K 64K 128K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR-Next NCCL-2.3 MVAPICH2-GDR-Next NCCL-2.3

Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2 Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 19 OSU-Caffe: Scalable Deep Learning • Caffe : A flexible and layered Deep Learning framework. GoogLeNet (ImageNet) on 128 GPUs • Benefits and Weaknesses 250 – Multi-GPU Training within a single node – Performance degradation for GPUs across different 200 sockets – Limited Scale-out 150 • OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across 100 multi-GPU nodes) – Scale-out on 64 GPUs for training CIFAR-10 network on 50

CIFAR-10 dataset Training Time Time (seconds) Training – Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset 0 8 16 32 64 128 OSU-Caffe publicly available from No. of GPUs Invalid use case http://hidl.cse.ohio-state.edu/ Caffe OSU-Caffe (1024) OSU-Caffe (2048) Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 20 Training Large (Out-of-core) Models • Large DNNs cannot be trained on GPUs due to memory limitation! Out-of-core P100 GPU Memory Limit (16 GB) Training 4500 4096 – ResNet-50 for Image Recognition but current frameworks can 4000 only go up to a small batch size of 45 3500 AlexNet GoogLeNet VGG

e 3000

z

i S

2500

– Next generation models like Neural Machine Translation h 2048

c 2000

t a

(NMT) are ridiculously large, consists of billions of parameters, B 1500 1024 1024 1000 and require even more memory 512 512 500 256 256 200 250 32 50 64 100 150 – Can we design Out-of-core DNN training support using new 0 Trainability (Memory Requirements)

software features in CUDA 8/9 and hardware mechanisms in 20 )

Pascal/Volta GPUs? r

e t t 15

e oc-caffe-opt is b

• General intuition is that managed allocations “will be” slow! s

i 80% better than

r

e intel-caffe – The proposed framework called OC-Caffe (Out-of-Core Caffe) h 10

g caffe-gpu

i H

( cannot intel-

shows the potential of managed memory designs that can c e run caffe-opt

s 5 /

s (N/A)

provide performance with negligible/no overhead. e g

a X X m • OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for I 0 caffe-gpu oc-caffe-naïve oc-caffe-opt ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7 caffe-cpu intel-caffe intel-caffe-opt

A. Awan et al., OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18 Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 21 Multiple Approaches taken up by OSU • MPI-driven Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN training • Accelerating TensorFlow on HPC Systems • Accelerating Big Data Stacks • Efficient Deep Learning over Big Data

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 22 Performance Benefits for RDMA-gRPC with Micro-Benchmark

90 1000 18600 Default gRPC Default gRPC 75 OSU RDMA gRPC 800 14900 Default gRPC OSU RDMA gRPC 60 600 OSU RDMA gRPC 11200 45

400 (us) Latency 7500

30

Latency (us) Latency Latency (us) Latency 15 200 3800

0 0 100 2 8 32 128 512 2K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M payload (Bytes) Payload (Bytes) Payload (Bytes)

RDMA-gRPC RPC Latency

• gRPC-RDMA Latency on SDSC-Comet-FDR – Up to 2.7x performance speedup over IPoIB for Latency for small messages – Up to 2.8x performance speedup over IPoIB for Latency for medium messages – Up to 2.5x performance speedup over IPoIB for Latency for large messages R. Biswas, X. Lu, and D. K. Panda, Accelerating gRPC and TensorFlow with RDMA for High-Performance Deep Learning over InfiniBand, HiPC ‘18. Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 23 Performance Benefit for RDMA-TensorFlow (Inception3)

gRPPC (IPoIB-100Gbps) gRPPC (IPoIB-100Gbps) gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) 600 Verbs (RDMA-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) 400 MPI (RDMA-100Gbps) 200 MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) 300

150 400 Images / / Second Images

100 200 Images / / Second Images Images / / Second Images 200 50 100

0 0 0 16 32 64 16 32 64 16 32 64 Batch Size Batch Size Batch Size 4 Nodes (8 GPUS) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS) • TensorFlow Inception3 performance evaluation on an IB EDR cluster R. Biswas, X. Lu, and D. K. Panda, – Up to 20% performance speedup over Default gRPC (IPoIB) for 8 GPUs Accelerating TensorFlow with – Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs Adaptive RDMA-based gRPC. HiPC ‘18 – Up to 37% performance speedup over Default gRPC (IPoIB) for 24 GPUs Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 24 RDMA-TensorFlow Distribution

• High-Performance Design of TensorFlow over RDMA-enabled Interconnects – High performance RDMA-enhanced design with native InfiniBand support at the verbs-level for gRPC and TensorFlow – RDMA-based data communication – Adaptive communication protocols – Dynamic message chunking and accumulation – Support for RDMA device selection – Easily configurable for different protocols (native InfiniBand and IPoIB) • Current release: 0.9.1 – Based on Google TensorFlow 1.3.0 – Tested with • Mellanox InfiniBand adapters (e.g., EDR) • NVIDIA GPGPU K80 • Tested with CUDA 8.0 and CUDNN 5.0 – http://hidl.cse.ohio-state.edu

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 25 Multiple Approaches taken up by OSU • MPI-driven Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN Training • Accelerating TensorFlow on HPC Systems • Accelerating Big Data Stacks • Efficient Deep Learning over Big Data

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 26 The High-Performance Big Data (HiBD) Project

• RDMA for Apache Spark • RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x) • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions Available for InfiniBand and RoCE • RDMA for Apache Kafka • RDMA for Apache HBase Also run on Ethernet • RDMA for Memcached (RDMA-Memcached) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) Available for x86 and OpenPOWER • OSU HiBD-Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-benchmarks Support for Singularity and Docker • http://hibd.cse.ohio-state.edu • Users Base: 300 organizations from 35 countries • More than 29,300 downloads from the project site

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 27 Performance Numbers of RDMA for Apache Hadoop 2.x – RandomWriter & TeraGen in OSU-RI2 (EDR) 400 IPoIB (EDR) 800 IPoIB (EDR) Reduced by 4x 350 Reduced by 3x 700 OSU-IB (EDR) 300 600 OSU-IB (EDR) 250 500 200 400 150 300

100 200 Execution Time (s) Execution Execution Time (s) Execution 50 100 0 0 80 120 160 80 160 240 Data Size (GB) Data Size (GB) RandomWriter TeraGen Cluster with 8 Nodes with a total of 64 maps

• RandomWriter • TeraGen – 3x improvement over IPoIB – 4x improvement over IPoIB for for 80-160 GB file size 80-240 GB file size

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 28 Performance Evaluation on SDSC Comet – HiBench PageRank

800 450 37% 43% 700 IPoIB 400 IPoIB 600 RDMA 350 RDMA 500 300 250 400

200 Time (sec) Time 300 (sec) Time 150 200 100 100 50 0 0 Huge BigData Gigantic Huge BigData Gigantic Data Size (GB) Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) • RDMA-based design for Spark 1.5.1 • RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) – 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 29 High-Performance Deep Learning over Big Data (DLoBD) Stacks • Challenges of Deep Learning over Big Data (DLoBD) ▪ Can RDMA-based designs in DLoBD stacks improve performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs? ▪ What are the performance characteristics of representative DLoBD stacks on RDMA networks? • Characterization on DLoBD Stacks

▪ CaffeOnSpark, TensorFlowOnSpark, and BigDL 5010 60 ▪ IPoIB vs. RDMA; In-band communication vs. Out- IPoIB-Time 4010 RDMA-Time IPoIB-Accuracy of-band communication; CPU vs. GPU; etc. 40 3010 RDMA-Accuracy ▪ Performance, accuracy, scalability, and resource 2.6X 2010 utilization 20 ▪ RDMA-based DLoBD stacks (e.g., BigDL over 1010 Accuracy (%)

RDMA-Spark) can achieve 2.6x speedup compared Epochs Time (secs) 10 0 to the IPoIB based scheme, while maintain similar 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Epoch Number accuracy X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017. Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 30 Conclusions

• Scalable distributed training is getting important • Requires high-performance middleware designs while exploiting modern interconnects • Provided a set of different solutions to achieve scalable distributed training – CUDA-aware MPI with optimized collectives – TensorFlow-gRPC with RDMA support – Efficient DL support over Big Data • Will continue to enable the DL community to achieve scalability and high-performance for their distributed training

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 31 Thank You! [email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hibd.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory HPC-AI (SCAsia ‘19) 32