Distributed Training

Scalable and Distributed DNN Training on Modern HPC Systems: Challenges and Solutions Keynote Talk at SDAS ‘19 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Understanding the Deep Learning Resurgence • Deep Learning is a sub-set of Machine Learning – But, it is perhaps the most radical and revolutionary subset – Automatic feature extraction vs. hand-crafted features • Deep Learning – A renewed interest and a lot of hype! – Key success: Deep Neural Networks (DNNs) – Everything was there since the late 80s except the “computability of DNNs” Courtesy: http://www.deeplearningbook.org/contents/intro.html Network Based Computing Laboratory SDAS (June ‘19) 2 Deep Learning Use Cases and Growth Trends Courtesy: https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/ Network Based Computing Laboratory SDAS (June ‘19) 3 Increasing Usage of HPC, Big Data and Deep Learning Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.) Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!! Network Based Computing Laboratory SDAS (June ‘19) 4 Newer Workflows - Deep Learning over Big Data (DLoBD) • Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms • More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big data stacks, such as Apache Hadoop and Spark • Benefits of the DLoBD approach – Easily build a powerful data analytics pipeline • E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof (3) Non-deep (1) Prepare (2) Deep (4) Apply ML learning Datasets @Scale Learning @Scale model @Scale analytics @Scale – Better data locality – Efficient resource sharing and cost effective Network Based Computing Laboratory SDAS (June ‘19) 5 Drivers of Modern HPC Cluster Architectures High Performance Interconnects - Accelerators / Coprocessors InfiniBand high compute density, high Multi-core Processors <1usec latency, 200Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. Summit Sierra Sunway TaihuLight K - Computer Network Based Computing Laboratory SDAS (June ‘19) 6 Key Phases of Deep Learning • Deep Learning has two major tasks 1. Training of the Deep Neural Network 2. Inference (or deployment) that uses a trained DNN • DNN Training – Training is a compute/communication intensive process – can take days to weeks – Faster training is necessary! • Faster training can be achieved by – Using Newer and Faster Hardware – But, there is a limit! – Can we use more GPUs or nodes? • The need for Parallel and Distributed Training Network Based Computing Laboratory SDAS (June ‘19) 7 Scale-up and Scale-out Desired • Scale-up: Intra-node Communication – Many improvements like: NCCL2 • NVIDIA cuDNN, cuBLAS, NCCL, etc. • CUDA 9 Co-operative Groups cuDNN MPI • Scale-out: Inter-node Communication MKL-DNN – DL Frameworks – most are optimized for single-node only up Performance up – Distributed (Parallel) Training is an emerging - gRPC trend • OSU-Caffe – MPI-based Scale • Microsoft CNTK – MPI/NCCL2 Hadoop • Google TensorFlow – gRPC-based/MPI/NCCL2 • Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI) Scale-out Performance Network Based Computing Laboratory SDAS (June ‘19) 8 Holistic Evaluation is Important!! • My framework is faster than your framework! • This needs to be understood in a holistic way. • Performance depends on the entire execution MKL/ MKL-DNN environment (the full stack) • Isolated view of performance is not helpful A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article 8. Network Based Computing Laboratory SDAS (June ‘19) 9 Broad Challenge: Exploiting HPC for Deep Learning How to efficiently scale-out a Deep Learning (DL) framework and take advantage of heterogeneous High Performance Computing (HPC) resources? Network Based Computing Laboratory SDAS (June ‘19) 10 Research Challenges to Exploit HPC Technologies 1. What are the fundamental 1 issues in designing DL Deep Learning and Machine Learning Frameworks Caffe/ CNTK Caffe2 TensorFlow MXNet frameworks? OSU-Caffe – Memory Requirements Major Computation and Communication Phases in DL Frameworks – Computation Forward Gradient Model Propagation Backward Requirements Aggregation – Communication Overhead 2. Why do we need to support 2 Communication Runtimes to support distributed training? Distributed Training – To overcome the limits of single-node training HPC Platforms CPU InfiniBand GPU – To better utilize hundreds of existing HPC Clusters Network Based Computing Laboratory SDAS (June ‘19) 11 Research Challenges to Exploit HPC Technologies (Cont’d) 3. What are the new design challenges brought forward by DL frameworks for Deep Learning and Machine Learning Frameworks Communication runtimes? Caffe/ CNTK Caffe2 TensorFlow MXNet – Large Message Collective OSU-Caffe Communication and Reductions Major Computation and Communication Phases in DL Frameworks – GPU Buffers (CUDA-Awareness) Forward Gradient Model Propagation Backward Aggregation 4. Can a Co-design approach help in 4 Co-Design Opportunities achieving Scale-up and Scale-out efficiently? Communication Runtimes (MPI/NCCL/Gloo/MLSL) – Co-Design the support at Runtime Point-to- CUDA- Large-message Point level and Exploit it at the DL Awareness Collectives 3 Operations Framework level – What performance benefits can HPC Platforms be observed? CPU InfiniBand GPU – What needs to be fixed at the communication runtime layer? Network Based Computing Laboratory SDAS (June ‘19) 12 Multiple Approaches taken up by OSU • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN training • Accelerating TensorFlow on HPC Systems • Accelerating Big Data Stacks • Efficient Deep Learning over Big Data Network Based Computing Laboratory SDAS (June ‘19) 13 Data Parallel Deep Learning and MPI Collectives Loop {} • Major MPI Collectives packed_comm_buff involved in Designing MPI_Bcast (GPU 0) 1. Data Propagation distributed frameworks Params Params Params Params GPU 3 GPU 1 GPU 0 • MPI_Bcast – required for GPU 2 L DNN parameter exchange L1 L1 L1 1 L2 L L2 L2 • MPI_Reduce – needed for F B F 2 B F B F B 2. Forward .. .. .. .. Backward gradient accumulation L L Ln n Ln n Pass from multiple solvers packed_red packed_red packed_red packed_red uce_buff uce_buff uce_buff uce_buff • MPI_Allreduce – use just one Allreduce instead of MPI_Reduce (GPU 0) 3. Gradient Reduce and Broadcast Gradients Aggregatio ApplyUpdates n A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) Network Based Computing Laboratory SDAS (June ‘19) 14 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 3,000 organizations in 88 countries – More than 549,000 (> 0.5 million) downloads from the OSU site directly – Empowering many TOP500 clusters (June ‘19 ranking) • 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 16th, 556,104 cores (Oakforest-PACS) in Japan • 19th, 367,024 cores (Stampede2) at TACC • 31st, 241,108-core (Pleiades) at NASA and many others – Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC) – http://mvapich.cse.ohio-state.edu Partner in the TACC Frontera System • Empowering Top500 systems for over a decade Network Based Computing Laboratory SDAS (June ‘19) 15 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features SR- Multi Shared RC

Distributed Training

Cluster Setup Guide

Frontera Documentation Release 0.4.0

High Performance Distributed Web-Scraper

Scrapy Cluster Documentation Release 1.2

Frontera-Open Source Large Scale Web Crawling Framework

HPC and AI Middleware for Exascale Systems and Clouds

Vergleich Aktueller Web-Crawling-Werkzeuge

Parsl Documentation Release 1.1.0

Frontera Documentation Release 0.8.0

Frontera Documentation 0.6.0

High Performance Distributed Web-Scraper

Diapositivo 1