Performance, Power, and Scalability Analysis of the Horovod Implementation of the CANDLE NT3 Benchmark on the Cray XC40 Theta

Performance, Power, and Scalability Analysis of the Horovod Implementation of the CANDLE NT3 Benchmark on the Cray XC40 Theta Xingfu Wu1, Valerie Taylor1, Justin M. Wozniak2, Rick Stevens3, Thomas Brettin4 and Fangfang Xia3 1Mathematics and Computer Science Division, Argonne National Laboratory and University of Chicago 2Data Science and Learning Division, Argonne National Laboratory and University of Chicago 3Computing, Environment, and Life Sciences Directorate, Argonne National Laboratory and University of Chicago 4Computing, Environment, and Life Sciences Directorate, Argonne National Laboratory Email: fxingfu.wu, vtaylor, woz, stevens, brettin, [email protected] Abstract—Training scientific deep learning models requires the driving vehicles [20]. Horovod [11] [20], developed by large amount of computing power provided by HPC systems. Uber, is a distributed training framework for TensorFlow and In this paper, we use the distributed deep learning framework Keras [12]. In this work, we use Horovod to parallelize NT3, Horovod to parallelize NT3, a Python benchmark from the a Python-based benchmark [6] from the exploratory research exploratory research project CANDLE (Cancer Distributed project CANDLE (Cancer Distributed Learning Environ- Learning Environment). We analyze NT3’s scalability, perfor- ment) [4]. We then analyze the Horovod implementation mance, and power characteristics with different batch sizes and of NT3 in terms of performance, power, and scalability on learning rates under two memory modes, cache and flat, on the the DOE pre-exascale production system Cray XC40 Theta DOE pre-exascale production system Cray XC40 Theta at Ar- [9] at Argonne National Laboratory. gonne National Laboratory. Our experimental results indicate The CANDLE project [4] [25] focuses on building a that the power profiles for the node, CPU, and memory are single scalable deep neural network code that can address useful in showing how the Horovod NT3 benchmark behaves three cancer challenge problems: the RAS pathway problem, on the underlying system. Using the communication timeline understanding the molecular basis of key protein interactions of this benchmark, we found that the Horovod communication in the RAS/RAF pathway presented in 30% of cancers; overhead in NT3 increases significantly with the number of the drug response problem, developing predictive models nodes although Horovod has the ability to scale up. The bench- for drug response to optimize preclinical drug screening and drive precision-medicine-based treatments for cancer mark leads to smaller runtime and lower power consumption patients; and the treatment strategy problem, automating for the node and CPU under the cache mode than under the the analysis and extraction of information from millions of flat mode. Furthermore, increasing the batch size leads to a cancer patient records to determine optimal cancer treatment runtime decrease and slightly impacts the power. Increasing strategies. CANDLE benchmark codes [5] implement deep the learning rate results in a slight decrease in runtime and learning architectures that are relevant to these three cancer node power and an increase in accuracy. Several issues raised problems. The NT3 benchmark [6] is one of the Pilot1 by the Horovod NT3 benchmark results are discussed, and benchmarks [5] that are formed from problems and data at suggestions are proposed for further work. the cellular level. The goal behind these Pilot1 benchmarks is to predict the drug response based on molecular features 1. Introduction of tumor cells and drug descriptors. The NT3 benchmark, like other CANDLE benchmarks, Training modern deep learning models requires the large is implemented in Python by using the Keras framework. amount of computing power provided by high-performance Python allows for the rapid development of the application. computing (HPC) systems. TensorFlow [2] [22] is one of It also enables code reuse across the CANDLE benchmarks, the most widely used open source frameworks for deep since each benchmark uses common Python-based CAN- learning; it supports a wide variety of deep learning uses, DLE utilities and each benchmark implements a common from conducting exploratory research to deploying models interface used by higher-level Python-based driver systems, in production on cloud servers, mobile apps, and even self- such as the CANDLE/Supervisor framework for hyperpa- rameter optimization [25]. These benchmarks, which are batch sizes and learning rates under two memory modes, intended to run on exascale systems as they emerge, are cur- cache and flat (a high bandwidth on-package memory Multi- rently being tested on pre-exascale systems such as Theta. Channel DRAM can be configured as a shared L3 cache These pre-exascale systems feature new hardware at ever (cache mode) or as a distinct NUMA node memory (flat greater scale, requiring new analysis of performance and mode)), on the Cray XC40 Theta. Our experimental results power to determine how best to use them. Deep learning indicate that power profiling for the node, CPU, and memory is expected to play a greater role in scientific computing is useful for showing how the Horovod NT3 benchmark on systems such as Summit [21]. Thus, it is critical for behaves on the system. Using the communication timeline studying the performance and power usage of the whole of this benchmark, we find that the Horovod communication application stack, including the scripting level, numerics, overhead in NT3 increases significantly with the number of and communication. nodes. The benchmark leads to smaller runtime and lower To speed TensorFlow applications by utilizing large- power consumption for the node and CPU under cache mode scale supercomputers such as Theta requires a distributed than under flat mode. Furthermore, increasing the batch size TensorFlow environment. Currently, TensorFlow has a na- leads to a runtime decrease and slightly impacts the power; tive method for parallelism across nodes using the gRPC and increasing the learning rate results in a slight decrease layer in TensorFlow based on sockets [1] [10], but this is in runtime and node power and an increase in accuracy. difficult to use and optimize [15] [20]. The performance This work makes the following contributions. and usability issues with the distributed TensorFlow can be addressed, however, by adopting an MPI communication • We use Horovod to parallelize the CANDLE NT3 model. Although TensorFlow has an MPI option, it replaces benchmark. This parallelization method can be ap- only point-to-point operations in gRPC with MPI and does plied to other CANDLE benchmarks such as the not use MPI collective operations. Horovod adapts the MPI Pilot1 and Pilot3 benchmarks in the similar way. communication model by adding an allreduce between the • We analyze the scalability of the Horovod imple- gradient computation and model update, replacing the native mentation of the NT3 benchmark with weak scaling, optimizer with a new one called the Distributed Optimizer. and we discuss the Horovod overhead. No modification to TensorFlow itself is required; the Python • We investigate the performance and power charac- training scripts are modified instead. The Cray programming teristics of the Horovod implementation of the NT3 environment machine learning plugin (CPE ML Plugin) benchmark with strong scaling, and we use power [15], like Horovod, does not require modification to Ten- profiling to analyze how parameters such as the sorFlow, but it is designed for Cray systems and is not learning rate and batch size affect the performance available to the public. Therefore, we chose Horovod for and power. this investigation. The remainder of this paper is organized as follows. Sec- Related work with Horovod and TensorFlow has been tion 2 briefly describes the CANDLE NT3 benchmark and reported in the literature. A. Sergeev and M. Del Balso [20] Horovod and then discusses the Horovod implementation. designed and developed Horovod, and they used the Tensor- Section 3 depicts the system platform Cray XC40 Theta. Flow benchmarks [23] such as Inception V3 and ResNet- Section 4 analyzes the scalability of the Horovod imple- 101 to compare the performance (images per second) of mentation of the NT3 benchmark with increasing numbers the Horovod implementations with standard distributed Ten- of nodes. Section 5 uses the experimental results to analyze sorFlow on different numbers of NVIDIA Pascal GPUs. performance and power characteristics of the NT3 bench- They observed larger improvements in Horovod’s ability to mark. Section 6 summarizes our conclusions and discusses scale, and the training in the Horovod implementation was future work. about twice as fast as standard distributed TensorFlow. P. Mendygral et al. [15] discussed the Horovod-like Cray CPE 2. CANDLE NT3 Benchmark and Its Horovod ML Plugin, and they used TensorFlow benchmarks such as Inception V3 and ResNet50 to compare the performance Implementation (samples per second) of the CPE ML Plugin implementations with standard distributed TensorFlow with gRPC and In this section, we briefly describe the CANDLE NT3 Horovod on a Cray XC40 system. They observed that the benchmark and the distributed deep learning framework CPE ML Plugin outperformed both Horovod and standard Horovod. We then discuss the Horovod implementation of distributed TensorFlow. They also discussed convergence the benchmark in detail. considerations at scale in deep learning and presented square and linear learning rate scaling rules. They found

Performance, Power, and Scalability Analysis of the Horovod Implementation of the CANDLE NT3 Benchmark on the Cray XC40 Theta

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support