Scalable Distributed DNN Training Using Tensorflow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

—Submitted to IEEE IPDPS 2019 (Main Track) for Peer Review— 1 Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation Ammar Ahmad Awan, Ching-Hsiang Chu, Jeroen Bedorf´ Hari Subramoni, and Dhabaleswar K. Panda Minds.ai Department of Computer Science and Engineering Santa Cruz, the United States The Ohio State University [email protected] fawan.10, chu.368, subramoni.1, [email protected] Leiden Observatory, Leiden University Leiden, the Netherlands Abstract—The current wave of advances in Machine Learn- To this end, the current resurgence and a renewed interest ing (ML) and Deep Learning (DL) have been triggered by in DL and DNN-based solutions to classical as well as new the availability of large-scale datasets, efficient CPU and GPU Machine Learning (ML) problems can be attributed to the hardware, and development of easy-to-use software frameworks like TensorFlow (TF), Caffe and Torch. TensorFlow has been, by widespread availability of 1) versatile and large-scale datasets far, the most widely adopted ML/DL framework. However, little like ImageNet [1], and 2) efficient computation capabilities in exists in the literature that provides a thorough understanding modern hardware architectures like Graphics Processing Units of the capabilities which TensorFlow offers for the distributed (GPUs) and multi-/many-core CPUs. These two trends have training of large ML/DL models that need computation and positively influenced the development of several high-level communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google DL toolkits like Caffe [9], [25], Microsoft Cognitive Toolkit Remote Procedure Call (gRPC), 2) gRPC+‘X’: X = (InfiniBand (CNTK), Facebook PyTorch [5], and Google TensorFlow [11]. Verbs, Message Passing Interface (MPI), and GPUDirect RDMA), Implementing DNN and back-propagation techniques in an and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with efficient manner has been a challenging problem. However, MPI, and Horovod with NVIDIA NCCL. In this paper, we these toolkits have played a crucial role in making ML and provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters DL more accessible for both academic as well as industry- including the Piz Daint system (#6 on Top500). We perform based researchers in various fields. In the context of DL experiments to gain novel insights along the following vectors: 1) frameworks, it is pertinent to mention that TensorFlow is Application-level scalability of DNN training, 2) Effect of Batch the most popular DL framework and has seen widespread Size on scaling efficiency, 3) Impact of the MPI library used for adoption. Today, more than 1,600 people have contributed to no-gRPC approaches, and 4) Type and size of DNN architectures (e.g ResNet vs. MobileNet). Based on these experiments, we the TensorFlow GitHub repository [6] and several hundred present two key insights: 1) Overall, No-gRPC designs achieve research papers have utilized TensorFlow for both research better performance compared to gRPC-based approaches for and commercial applications. However, TensorFlow in its early most configurations, and 2) The performance of No-gRPC is days was criticized for its slower performance [39] as well as heavily influenced by the gradient aggregation using the Allre- lack of support for efficient distributed training and limited duce communication pattern. Finally, we propose a truly CUDA- Aware MPI Allreduce design that exploits 1) CUDA kernels to support for High Performance Computing (HPC) systems. To perform large reductions on the GPU and 2) A pointer cache this end, recent efforts by the TF developers as well as the to avoid overheads involved in queries to the CUDA driver. Our open source community are commendable and performance proposed designs have been implemented in MVAPICH2-GDR has significantly improved for both single-GPU/single-node arXiv:1810.11112v1 [cs.DC] 25 Oct 2018 and offer 5-17× better performance than NCCL2 for small and as well as multi-GPU/multi-node (distributed) training. The medium messages, and reduces latency by 29% for large messages on 16 GPUs (nodes). The proposed optimizations help Horovod- gRPC [19] library, which is the official distributed training MPI to achieve approximately 90% scaling efficiency for ResNet- infrastructure for TF, has been optimized for tensor transfers 50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8× (fewer memory operations), but still uses the relatively slow and 3.2× higher throughput than the native gRPC method for standard Ethernet networks. However, gRPC can take advan- ResNet-50 and MobileNet, respectively, on the Piz Daint cluster. tage of InfiniBand (IB) using the IP over IB (IPoIB) protocol which offers significantly better performance. At the same I. INTRODUCTION time, the community has been actively exploring Message Deep Learning (DL) has been a significant contributor to Passing Interface (MPI) – a de facto standard for the HPC the recent achievements in the Artificial Intelligence (AI) community – based designs to improve distributed training on realm. Novel approaches like back-propagation in Deep Neural HPC clusters. However, the active interest and contributions Networks (DNNs) were investigated around the 1980 time from the community have led to several disjoint efforts and frame [36]. However, the potential of these approaches was fragmentation in the way users can take advantage of the marred by slow hardware and lack of sufficient training data. advanced distributed training designs in TF. The two broad —Submitted to IEEE IPDPS 2019 (Main Track) for Peer Review— 2 challenges that we investigate in this paper are: ”1) What is the speed has forced the ML/DL framework designers to rethink most efficient tensor communication (for gradient aggregation) their strategy and to start utilizing existing communication framework for TensorFlow and 2) How can we improve the schemes or design their own libraries from scratch. Microsoft performance of this communication using CUDA-Aware MPI? Cognitive Toolkit (CNTK) [29] is based on an MPI design Several detailed questions follow this broad challenge. It is whereas Caffe2 [9] uses the Gloo [10] collective communi- pertinent to note that little in existing literature can offer cation library developed by Facebook, which is similar to insights to the following key questions, which are of significant NVIDIA’s NCCL [32] library. Gloo exploits the InfiniBand interest if large-scale distributed training is employed. verbs interface and Remote Direct Memory Access (RDMA) • What are the available choices for distributed training technology to offer various reduction algorithms. Apart from using TensorFlow for a given execution platform? the communication libraries that come integrated with the • What are the key features and performance characteristics frameworks, there are external (commercial) efforts to enable of the various distributed training approaches for Tensor- efficient distributed training. For example, IBM has developed Flow? the PowerAI Distributed Deep-learning Library (DDL), which • How can we optimize the performance of distributed uses a multi-ring reduction algorithm. The library is built on training using TensorFlow on modern HPC systems? top of IBM’s Spectrum MPI (a derivative of OpenMPI) and as such supports all the network interfaces that are supported A. Contributions by the OpenMPI library. According to IBM, the library can be To the best of our knowledge, this is the first paper that integrated with TensorFlow, Caffe and Torch. Intel has devel- offers a comprehensive landscape that highlights, evaluates, oped the Intel Machine Learning Scaling Library (MLSL) [40]. and optimizes a diverse set of approaches that deal with This library is built on top of Intel MPI and therefore supports distributed DNN training using TensorFlow at scale. In this various interconnects such as InfiniBand, Omni-Path, and paper, we make the following key contributions: Ethernet. The library offers a set of communication primi- • We provide an in-depth categorization, design analysis, tives that neural network frameworks can take advantage of and performance characterization of distributed training when performing distributed communication. According to the approaches for TensorFlow using state-of-the-art DNNs research paper it is integrated with Intel-Caffe, TensorFlow, like ResNet-50, MobileNet, and NASNet. and Intel’s own neural-net compiler called nGraph. To bring • We propose a truly CUDA-Aware MPI Allreduce design support to TensorFlow, the library modifies the Horovod [37] that exploits 1) CUDA kernels to perform large reductions library by inserting the MLSL communication primitives. on the GPU and 2) A pointer cache to avoid overheads B. Communication Libraries for TensorFlow involved in queries to the CUDA driver. Our focus in this paper is on distributed training using • We illustrate benefits of the proposed MPI Allreduce TensorFlow, which, by default, can be realized using the optimizations using micro-benchmarks as well as appli- official Google RPC (gRPC) library. gRPC is a generic point- cation workloads (tf cnn benchmarks) using TensorFlow to-point communication library and has no collective com- and Horovod. munication support. It works in a simple client/server style • We present a comprehensive and large-scale performance fashion and has been tightly integrated with TF. However,

Load more