Performance, Power, and Scalability Analysis of the Horovod Implementation of the CANDLE NT3 Benchmark on the Cray XC40 Theta

Xingfu Wu1, Valerie Taylor1, Justin M. Wozniak2, Rick Stevens3, Thomas Brettin4 and Fangfang Xia3

1Mathematics and Computer Science Division, Argonne National Laboratory and University of Chicago

2Data Science and Learning Division, Argonne National Laboratory and University of Chicago

3Computing, Environment, and Life Sciences Directorate, Argonne National Laboratory and University of Chicago

4Computing, Environment, and Life Sciences Directorate, Argonne National Laboratory

Email: {xingfu.wu, vtaylor, woz, stevens, brettin, fangfang}@anl.gov

Abstract—Training scientific models requires the driving vehicles [20]. Horovod [11] [20], developed by large amount of computing power provided by HPC systems. Uber, is a distributed training framework for TensorFlow and In this paper, we use the distributed deep learning framework [12]. In this work, we use Horovod to parallelize NT3, Horovod to parallelize NT3, a Python benchmark from the a Python-based benchmark [6] from the exploratory research exploratory research project CANDLE (Cancer Distributed project CANDLE (Cancer Distributed Learning Environ- Learning Environment). We analyze NT3’s scalability, perfor- ment) [4]. We then analyze the Horovod implementation mance, and power characteristics with different batch sizes and of NT3 in terms of performance, power, and scalability on learning rates under two memory modes, cache and flat, on the the DOE pre-exascale production system Cray XC40 Theta DOE pre-exascale production system Cray XC40 Theta at Ar- [9] at Argonne National Laboratory. gonne National Laboratory. Our experimental results indicate The CANDLE project [4] [25] focuses on building a that the power profiles for the node, CPU, and memory are single scalable deep neural network code that can address useful in showing how the Horovod NT3 benchmark behaves three cancer challenge problems: the RAS pathway problem, on the underlying system. Using the communication timeline understanding the molecular basis of key protein interactions of this benchmark, we found that the Horovod communication in the RAS/RAF pathway presented in 30% of cancers; overhead in NT3 increases significantly with the number of the drug response problem, developing predictive models nodes although Horovod has the ability to scale up. The bench- for drug response to optimize preclinical drug screening and drive precision-medicine-based treatments for cancer mark leads to smaller runtime and lower power consumption patients; and the treatment strategy problem, automating for the node and CPU under the cache mode than under the the analysis and extraction of information from millions of flat mode. Furthermore, increasing the batch size leads to a cancer patient records to determine optimal cancer treatment runtime decrease and slightly impacts the power. Increasing strategies. CANDLE benchmark codes [5] implement deep the results in a slight decrease in runtime and learning architectures that are relevant to these three cancer node power and an increase in accuracy. Several issues raised problems. The NT3 benchmark [6] is one of the Pilot1 by the Horovod NT3 benchmark results are discussed, and benchmarks [5] that are formed from problems and data at suggestions are proposed for further work. the cellular level. The goal behind these Pilot1 benchmarks is to predict the drug response based on molecular features 1. Introduction of tumor cells and drug descriptors. The NT3 benchmark, like other CANDLE benchmarks, Training modern deep learning models requires the large is implemented in Python by using the Keras framework. amount of computing power provided by high-performance Python allows for the rapid development of the application. computing (HPC) systems. TensorFlow [2] [22] is one of It also enables code reuse across the CANDLE benchmarks, the most widely used open source frameworks for deep since each benchmark uses common Python-based CAN- learning; it supports a wide variety of deep learning uses, DLE utilities and each benchmark implements a common from conducting exploratory research to deploying models interface used by higher-level Python-based driver systems, in production on cloud servers, mobile apps, and even self- such as the CANDLE/Supervisor framework for hyperpa- rameter optimization [25]. These benchmarks, which are batch sizes and learning rates under two memory modes, intended to run on exascale systems as they emerge, are cur- cache and flat (a high bandwidth on-package memory Multi- rently being tested on pre-exascale systems such as Theta. Channel DRAM can be configured as a shared L3 cache These pre-exascale systems feature new hardware at ever (cache mode) or as a distinct NUMA node memory (flat greater scale, requiring new analysis of performance and mode)), on the Cray XC40 Theta. Our experimental results power to determine how best to use them. Deep learning indicate that power profiling for the node, CPU, and memory is expected to play a greater role in scientific computing is useful for showing how the Horovod NT3 benchmark on systems such as Summit [21]. Thus, it is critical for behaves on the system. Using the communication timeline studying the performance and power usage of the whole of this benchmark, we find that the Horovod communication application stack, including the scripting level, numerics, overhead in NT3 increases significantly with the number of and communication. nodes. The benchmark leads to smaller runtime and lower To speed TensorFlow applications by utilizing large- power consumption for the node and CPU under cache mode scale supercomputers such as Theta requires a distributed than under flat mode. Furthermore, increasing the batch size TensorFlow environment. Currently, TensorFlow has a na- leads to a runtime decrease and slightly impacts the power; tive method for parallelism across nodes using the gRPC and increasing the learning rate results in a slight decrease in TensorFlow based on sockets [1] [10], but this is in runtime and node power and an increase in accuracy. difficult to use and optimize [15] [20]. The performance This work makes the following contributions. and usability issues with the distributed TensorFlow can be addressed, however, by adopting an MPI communication • We use Horovod to parallelize the CANDLE NT3 model. Although TensorFlow has an MPI option, it replaces benchmark. This parallelization method can be ap- only point-to-point operations in gRPC with MPI and does plied to other CANDLE benchmarks such as the not use MPI collective operations. Horovod adapts the MPI Pilot1 and Pilot3 benchmarks in the similar way. communication model by adding an allreduce between the • We analyze the scalability of the Horovod imple- gradient computation and model update, replacing the native mentation of the NT3 benchmark with weak scaling, optimizer with a new one called the Distributed Optimizer. and we discuss the Horovod overhead. No modification to TensorFlow itself is required; the Python • We investigate the performance and power charac- training scripts are modified instead. The Cray programming teristics of the Horovod implementation of the NT3 environment plugin (CPE ML Plugin) benchmark with strong scaling, and we use power [15], like Horovod, does not require modification to Ten- profiling to analyze how parameters such as the sorFlow, but it is designed for Cray systems and is not learning rate and batch size affect the performance available to the public. Therefore, we chose Horovod for and power. this investigation. The remainder of this paper is organized as follows. Sec- Related work with Horovod and TensorFlow has been tion 2 briefly describes the CANDLE NT3 benchmark and reported in the literature. A. Sergeev and M. Del Balso [20] Horovod and then discusses the Horovod implementation. designed and developed Horovod, and they used the Tensor- Section 3 depicts the system platform Cray XC40 Theta. Flow benchmarks [23] such as Inception V3 and ResNet- Section 4 analyzes the scalability of the Horovod imple- 101 to compare the performance (images per second) of mentation of the NT3 benchmark with increasing numbers the Horovod implementations with standard distributed Ten- of nodes. Section 5 uses the experimental results to analyze sorFlow on different numbers of NVIDIA Pascal GPUs. performance and power characteristics of the NT3 bench- They observed larger improvements in Horovod’s ability to mark. Section 6 summarizes our conclusions and discusses scale, and the training in the Horovod implementation was future work. about twice as fast as standard distributed TensorFlow. P. Mendygral et al. [15] discussed the Horovod-like Cray CPE 2. CANDLE NT3 Benchmark and Its Horovod ML Plugin, and they used TensorFlow benchmarks such as Inception V3 and ResNet50 to compare the performance Implementation (samples per second) of the CPE ML Plugin implementa- tions with standard distributed TensorFlow with gRPC and In this section, we briefly describe the CANDLE NT3 Horovod on a Cray XC40 system. They observed that the benchmark and the distributed deep learning framework CPE ML Plugin outperformed both Horovod and standard Horovod. We then discuss the Horovod implementation of distributed TensorFlow. They also discussed convergence the benchmark in detail. considerations at scale in deep learning and presented square and linear learning rate scaling rules. They found that the 2.1. CANDLE NT3 Benchmark scaling rules are more attractive (when they work) because they do not require many additional iterations to reach the The CANDLE NT3 benchmark [6] is written in Python same accuracy. and Keras, which is a high-level neural network API written In this paper, we analyze the Horovod parallel imple- in Python and capable of running on top of TensorFlow, mentation of the NT3 benchmark, focusing on its scalabil- CNTK [16], or [24]. This benchmark is a 1D con- ity, performance, and power characteristics with different volutional network for classifying RNA-seq gene expression h aafie;teotmzri G sohsi gradient epoch (stochastic each for SGD where 400, is is epochs optimizer of number the the descent); files; data function the data importing file the data uses pandas.read_csv() test benchmark the The 149MB. of is size the and 597MB, save=’.’ 10] pool=[1, classes=2 drop=0.1 learning_rate=0.001 batch_size=20 epochs=400 metrics=’accuracy’ optimizer=’sgd’ loss=’categorical_crossentropy’ out_act=’softmax’ 1] 10, activation=’relu’ 128, 1, dense=[200,20] 20, ’nt3’ conv=[128, = ’nt_test2.csv’ model_name = ’nt_train2.csv’ test_data = train_data parameters. global ’ftp://ftp.mcs.anl.gov/pub/candle/public/ benchmarks/Pilot1/normal-tumor/’ main following = the data_url uses It bench- evaluation data. NT3 and test prediction The basic on and 1. preprocessing, cross-validation, loading, to and data training add phases: four which entails Tumor mark and Normal for softmax dense-dropout ( of the pairs by of controlled number layers various a of into flows the pairs signal in numbers of the by number list. specified variable counts unit a layer with parameters by by controlled processed layers, -pooling first is data tissues. of tumor transformation and and normal difference between the representation studying latent for useful is rmRAsqFK-Qvle 6 htmpt column integer a to the map contains that [6] that values ex- FPKM-UQ transformed of RNA-seq columns synthetically dataset float from for full 60,483 check contains The features control profiles. Com- pression quality expression Data a gene Genomic as generated NCI acts the and from mons available expression trained gene pairs is normal-tumor model profile matched This 700 balanced layers. pooling the dense on with final by interleaved followed layers layers convolutional with 1D models convolutional multiple of network architecture This classic categories. the tissue follows tumor or normal into profiles dropout h ieo h riigdt file data training the of size The input The 1. Figure in depicted is architecture NT3 The relu (floats) (floats) RNA-seq iue1 ceai fN3nua ewr architecture. network neural NT3 of Schematic 1. Figure RNA-seq → → ucini sdt banteotu,adscores and output, the obtain to used is function 0|1 0|1 sue steatvto ucin hn the Then, function. activation the as used is a eoitdfrn rpus.Fnly a Finally, dropouts). no for omitted be may

relu Input: Conv1D Poooolliinng:: 11DD dense 1]wt T ormtl read remotely to FTP with [18] 0 | 1 ec h T benchmark NT3 the Hence . relu Hiidden:: DDeennssee itand list Drropoutt:: 00..11 nt_train2.csv softmax dropout

nt_test2.csv Outtputt:: DDeennssee 0=Normal 0=Normal 1=Tumor 1=Tumor specifier conv conv is , h aallzto tteeohlvl(keras level epoch the provides at it [11], parallelization examples the its data In MPI-based TensorFlow. tensorflow- for provides Baidu parallelism Horovod the [3]. performance. on reduction repository based improved is one allreduce in code into source results moment Horovod that are given The action a that small an at tensors batch reduced operation, the be to all to able combining ready communica- is by interleave it operations to Moreover, allreduce unique ability computation. A its and subroutines. is tion MPI Horovod using of by broadcast; feature implemented and The allgather, is allreduce, such it use. rank, concepts to and to local MPI is on rank, easy based size, Horovod Python are and as Horovod of fast of stand-alone principles goal learning a core deep The is distributed Uber. make and by Keras developed package and TensorFlow for 20; Horovod is 2.2. size batch 0.001. the is epoch; rate next re- learning for the the saved in and are dataset model ...) whole the loss, the (accuracy, training through parameters pass key full the a and has training model the ttebthse ee (keras level step batch the at olwn diin oteN3bnhakt tlz CPUs: utilize to benchmark NT3 the to additions following Bench- NT3 the Parallelize to mark Horovod Using 2.3. ooo 1][0 sadsrbtdtann framework training distributed a is [20] [11] Horovod • • • the made we Horovod, • use to [11], in described As okr.W cldtelann aeto rate of number learning well. the as the size batch by the scaled increased rate We learning_rate learning for same workers. the the we is balancing, epochs Scale load of node. For number each the node. that each ensure for epochs of use We = epochs nprocs=1): myrank=0, comp_epochs(n, def hvd.rank() = hvd.size() myrank as = used nprocs nodes CPU of number ( follows. the on based size epochs the ( Obtain package. Add Horovod the import to Add hvd.rank() comp_epochs(gParameters[’epochs’], mothrvdtnofo shvd as horovod. import hvd.init() comp_epochs() euni return else: nprocs-1: nprocs < % myrank nprocs) n if // = int(n k = j yak nprocs) myrank, ,adajs h ubrof number the adjust and ), k + j = i j = i × oiiilz Horovod. initialize to hvd.size() mnist hvd.size() ocluaetenumber the calculate to advanced.py). eproperly We . n rank and ) ns.y and mnist.py) TABLE 1. RUNTIME (S), POWER (W), ANDENERGY (J) OFTHE SINGLE-NODE NT3 BENCHMARK WITH DIFFERENT BATCH SIZES UNDER CACHEMODE.

Figure 2. Cache mode on Cray XC40 Theta

L3 cache (cache mode) shown in Figure 2 or as a distinct Figure 3. Flat mode on Cray XC40 Theta NUMA node memory (flat mode) shown in Figure 3. For the flat mode, the default memory allocation preference is DDR4 first, then MCDRAM. With the different memory • Wrap the original optimizer in the Horovod distributed optimizer using optimizer = modes by which the system can be booted, understanding hvd.DistributedOptimizer(optimizer). the best mode for an application becomes challenging for The distributed optimizer delegates the gradient the user. computation to the original optimizer, averages gradients using MPI_Allreduce(), and then 4. Scalability Analysis applies those averaged gradients. • Add In this section, we investigate the performance and hvd.BroadcastGlobalVariablesHook(0) power characteristics of the original NT3 benchmark under to the callbacks to broadcast initial variable states different memory modes. We then analyze the scalability of from rank 0 to all other processes. This step ensures the Horovod NT3 benchmark for our weak-scaling study. consistent initialization of all workers when training is started with random weights. 4.1. Original NT3 Benchmark under Different Memory Modes 3. System Platform: Cray XC40 Theta The number of epochs is 400 for the NT3 benchmark. We conducted our experiments on the Cray XC40 Theta For simplicity, in this section we use just one epoch to [9], which is a pre-exascale production system at Argonne conduct the experiments under a learning rate of 0.001 and National Laboratory. Each Cray XC40 node has 64 compute a batch size of 20. cores (one Intel Phi Knights Landing (KNL) 7230 with the Figure 4 shows power over time on Theta, with cache thermal design power (TDP) of 215 W), shared L2 cache mode on one node and epochs = 1. The runtime is 1567s, of 32 MB (1 MB L2 cache shared by two cores), 16 GB and the average power is 159.21 W for the node, 105.26 W of high-bandwidth in-package memory, 192 GB of DDR4 for the CPU, and 12.08 W for memory. We observe that RAM, and a 128 GB SSD. The Cray XC40 system uses the NT3 takes around 800s to do the data loading and pre- Cray Aries dragonfly network with user access to a Lustre processing because of the FTP remote data access used by parallel file system with 10 PB of capacity and 210 GB/s pandas.read_csv() in the benchmark. bandwidth. Cray XC40 [8] [14] provides power management Figure 5 shows power over time on Theta (with flat mode to operate more efficiently by monitoring, profiling, and on one node and epochs = 1). The runtime is 1608s, and the limiting the power usage. In this work, we use a simplified average power is 164.37 W for the node, 110.37 W for the PoLiMEr [13], which utilizes Cray’s CapMC [14] CPU, and 17.0 W for memory. Comparing the cache mode to measure power consumption for the node, CPU, and with the flat mode, we observe that the NT3 benchmark memory at the node level on Theta. The power sampling under cache mode results in better performance and lower rate used is approximately two samples per second (default). power consumption in the node and CPU. The memory In a Python code, we import ctypes to export the CDLL for power consumption is lower because the MCDRAM con- loading the shared PoLiMEr library in order to measure the figured as the L3 cache is enough to hold both data files in power for the code. the cache. Each XC40 node has one Intel KNL, which brings in In the experiments, we used an initial batch size of 20 new memory technology, an on-package memory called as default. We then changed the batch size to determine Multi-Channel DRAM (MCDRAM) in addition to the tra- its effect on the performance and power of NT3. Table 1 ditional DDR4 RAM. MCDRAM has a high-bandwidth shows the runtime, power, and energy of the benchmark (around 4 times more than DDR4 RAM) and low-capacity with different numbers of batch sizes under cache mode. (16 GB) memory. MCDRAM can be configured as a shared We ran the same experiment several times to ensure the Figure 4. Power over time on Theta (with cache mode on one node and epochs = 1)

Figure 5. Power over time on Theta (with flat mode on one node and epochs = 1) results consistent, finding that the difference in runtime is Python’s cProfile [19] to profile the performance and analyze very small (less than 0.1%). Thus, we simply used the run NT3’s scalability on Theta. with the smallest runtime to report the time and power. Table 2 shows the time for different parts Increasing the batch size results in a decrease in the runtime of the benchmark with increasing learning rate and impacts the power slightly because it requires fewer (0.001 * hvd.size()). The table headers are iterations to converge to the same validation accuracy for as follows: TensorFlow, the time for the method models trained at larger batch sizes. The benchmark with a _pywrap_tensorflow_internal.TF_Run(); batch size of 200 achieves the lowest energy. We find that Method read, the time for the operations in the benchmark fails, however, if the batch size is 300 or pandas._libs.parsers.TextReader; Keras larger. Thus, the batch size can be adjusted only to some callback: the time for the Keras callbacks in limited extent. callbacks.py; Horovod callbacks, the time for Horovod callbacks.py; Distributed Optimizer, the 4.2. Horovod NT3 Benchmark time for the Horovod Distributed Optimizer; Broadcast, the time for the Horovod broadcast itself; Allreduce, the In this section, we still use one epoch and a batch size time for Horovod allreduce; and model.fit(), the time spent of 20 to conduct the experiments under the cache mode on in the model training and validation. Theta. However, we scale up the number of nodes with one The Horovod distributed optimizer has small overhead, epoch per node for our weak-scaling study. We measure around 1.4s even with increasing numbers of nodes, be- the performance of the Horovod version of NT3 and use cause this optimizer delegates gradient computation to the TABLE 2. PERFORMANCE (INSECONDS) WITHTHEINCREASEDLEARNINGRATE (WEAKSCALING) ON THETA.

TABLE 3. PERFORMANCE (INSECONDS) WITHTHESAMELEARNINGRATE (WEAKSCALING) ON THETA.

original optimizer, averages gradients using allreduce, and 5. Performance and Power Analysis of the then applies those averaged gradients. Horovod callbacks Horovod NT3 Benchmark introduce some overhead, but it is relatively small. We note that the Horovod broadcast overhead is around 0.22s and In this section, we use the problem size from the original the Horovod allreduce overhead is around 0.24s. The dom- NT3 benchmark for our strong-scaling study, where we fix inant parts are data loading (Method read) and TensorFlow. the number of epochs at 400, the learning rate at 0.001, Model.fit is the main part of TensorFlow and takes most and the batch size at 20. We conduct our experiments with of the execution time of TensorFlow. Because the Python different numbers of nodes under different memory modes to cProfile provides the time for TensorFlow only as a whole, analyze the performance and power behavior of the Horovod we have to add the inline timing for the function model.fit NT3 benchmark. We focus on three metrics: the time spent to measure its runtime. in the model.fit(), loss, and accuracy.

5.1. Cache Mode

We further tested the communication overhead intro- Table 4 shows the performance for the original dataset duced by Horovod with the same learning rate (0.001) with epochs = 400 and the increased learning rate on Theta. shown in Table 3 on Theta. Overall, the total runtime results In this table, loss is the training loss; acc is the training are slightly larger than those shown in Table 2. The Horovod accuracy; val loss is the validation loss; and val acc is the overheads for the callbacks, optimizer, broadcast, and allre- validation accuracy. Time per epoch is the total time for duce are similar for both cases. Compared with the total model.fit() divided by the number of epochs executed. runtime, the Horovod overhead is relatively small. We note On each node model.fit() executes a number of epochs, that, from 2 and 3, the time spent in the TensorFlow which is 400 divided by the number of nodes used. Because and model.fit increases significantly—by more than 39%— we did not shard the training data, each epoch makes a as the number of nodes increases from 2 to 512. Ideally, complete pass through the data. Thus, epoch time apparently if the batch size and learning rate and other parameters are increases as node count increases, but the increase is totally fixed, the model training time is expected to be the same. due to communication overhead, allowing us to capture this Although Horovod has the ability to scale up, it causes the cleanly. large communication overhead within TensorFlow, however, On 400 nodes, model.fit() executes one epoch per this overhead is not reflected from the Horovod functions node, the total runtime is 1,042s, and the time per epoch is provided by the cProfile. In the following section, we further 1,042s. On 100 nodes, model.fit() executes 4 epochs explain this overhead using the Horovod timeline [11]. per node, it takes 3,593s, and the time per epoch is around TABLE 4. TIME (INSECONDS), LOSS, ANDACCURACYFORTHE TABLE 5. TIME (INSECONDS), LOSS, ANDACCURACYFORTHE ORIGINAL DATASET WITH THE INCREASED LEARNING RATE (STRONG ORIGINAL DATASET WITH THE SAME LEARNING RATE (STRONG SCALING) ON THETA UNDER THE CACHE MODE SCALING) ON THETA.

898s. On 25 nodes, model.fit() executes 16 epochs per operation actually happens. These communications node, it takes 13,001s, and the time per epoch is around in Figures 7 and 8 indicate the time taken to 813s. With increasing the number of nodes from 25 to do the actual operation on the CPU and highlight 400, the accuracy decreases and the loss increases because that the operation was performed by using pure model.fit() executes fewer number of epochs. With the MPI collectives such as MPI_Broadcast() and increased learning rate, using 25 nodes results in a training MPI_Allreduce(). accuracy of 1.0 and validation accuracy of 0.9786. These results indicate that properly increasing the learning rate Based on the communication activities in Figures 7 and leads to better accuracy, and the model training requires the 8, we are able to explain the power behavior in Figure 6. Af- proper number of epochs (16) to achieve the high accuracy. ter data loading and preprocessing, the negotiate broadcast We note that the time per epoch increases with increasing takes place as shown in Figure 7. During the broadcast, numbers of nodes because the Horovod communication the node power and CPU power decrease because of the overhead increases with the number of nodes. dynamic power management on the Cray XC40. Then the Figure 6 shows power over time for the Horovod NT3 gradients are computed, so the node power and CPU power benchmark on 400 nodes. We found that loading the clib for increase. Both allreduce and MPI_Allreduce() are used the power measurement using ctypes takes around 11s based to average the gradients, and the averaged gradients are on the profile data from the Python cProfile. Therefore, the applied. This process takes around 131s as the first allreduce data loading takes around 781s. Compared with Figure 4, and MPI_Allreduce() shown in Figure 8. The process what happened after the data-loading phase? To explain the takes place between 800s and 1000s in Figure 6. The model power behavior in Figure 5, we use the Horovod timeline training batch steps then are started. During the training, ne- feature [11] to record the communication activities viewed gotiate allreduce, allreduce, and MPI_Allreduce() take in the Chrome browser through chrome://tracing [7]. place between two consecutive training batch steps period- Figures 7 and 8 show the timeline for the commu- ically in Figure 8. Compared with Figure 4, this Horovod nication of the benchmark on 400 nodes with the high- overhead enlarges the gaps between two consecutive training lights of broadcast and allreduce from Figure 6. This batch steps, and the power consumption is smaller than that timeline starts the broadcast communication, not the be- in Figure 4 because the internode communication is used ginning of the benchmark. It consists of six communica- for gradient averaging. This explains the communication tion types: negotiate broadcast, broadcast, mpi broadcast, overhead caused by Horovod within the TensorFlow run. allreduce, mpi allreduce, and negotiate allreduce, where The learning rate is 0.001 in Figure 4 and the increased broadcast is implemented based on mpi broadcast shown learning rate is 0.001×400 = 0.4 in Figure 6, while keeping in Figure 7; allreduce is based on the baidu ring- other input parameter values the same. allreduce algorithm [3] and MPI_Allreduce() shown Table 5 shows the performance for the original dataset in Figure 8. MEMCPY IN FUSION BUFFER and MEM- with epochs = 400 with the same learning rate on Theta. CPY OUT FUSION BUFFER are to copy data into and Compared with Table 4, the execution time increases out of the fusion buffer. Each tensor broadcast/reduction in slightly. For 100 nodes or more, the accuracy using the same the Horovod NT3 benchmark involves two major phases. learning rate is much higher than that using the increased learning rate. We note, however, that for 50 and 25 nodes, • The negotiation phase (negotiate broadcast, nego- the accuracy using the same learning rate is lower than tiate allreduce): all workers send a signal to rank that using the increased learning rate. With increasing the 0 that they are ready to broadcast/reduce the given number of nodes from 25 to 400, the accuracy decreases and tensor. Each worker is represented by a tick under the loss increases because model.fit() executes fewer the negotiate broadcast/negotiate allreduce bar. Im- number of epochs. This indicates that properly increasing mediately after negotiation, rank 0 sends a signal to the learning rate results in better accuracy, and the model the other workers to start broadcasting/reducing the training requires the proper number of epochs to achieve the tensor. high accuracy. Notice that the time per epoch increases with • The processing phase: here the communication increasing the number of nodes because of the increased Figure 6. Power over time of the Horovod NT3 with the increased learning rate under the cache mode on 400 nodes.

Figure 7. Timeline for the communication with the highlight of broadcast of the Horovod NT3 on 400 nodes (cache mode).

Figure 8. Timeline for the communication with the highlight of allreduce of the Horovod NT3 on 400 nodes (cache mode). TABLE 6. TIME (INSECONDS), LOSS, ANDACCURACYFORTHE in accuracy, and the model training in NT3 requires the ORIGINAL DATASET WITH THE INCREASED LEARNING RATE (STRONG proper number of epochs to achieve the high accuracy. SCALING) ON THETA. We plan to address several issues raised by the NT3 results. First, the data-loading time becomes the bottleneck after speeding the model-training process. We have to con- sider how to speed the input dataset operations, perhaps by partitioning them. Second, we plan to perform additional measurements with sharded and shuffled data in Horovod; these will improve the loss and accuracy resulting from our training runs, however, they do not affect the systems-related results presented here. Third, the performance profiling us- ing the Python cProfile includes only the total time for the Horovod communication overhead. whole TensorFlow run, as shown in Tables 2 and 3. It does not give any details about how TensorFlow behaves. To 5.2. Flat Mode further improve the performance of the TensorFlow run, we may need a fine-grained performance profiling tool such as Table 6 shows the performance for the original dataset NVProf [17] to profile the TensorFlow run. Further, we de- with epochs = 400 with the increased learning rate using the veloped the Horovod version of the NT3 benchmark to sup- flat mode on Theta. The results also indicate that, compared port both CPUs and GPUs. We plan to test the benchmark with Table 4, properly increasing the learning rate results in on heterogeneous systems with CPUs and GPUs, such as better accuracy, and the model training requires a sufficient Summit [21]. We also plan to add checkpoint/restart features number of epochs (8) to achieve the high accuracy. When to the Horovod benchmark for fault tolerance. Lastly, we we compare this with Table 4 using the cache mode, we plan to use our performance and power modeling work [26] see that the benchmark benefits more from using the cache to model and optimize the CANDLE benchmarks. Python mode than the flat mode. Notice that the time per epoch also codes, like other scripting languages, do not have compiler increases with increasing the number of nodes because of optimization support and instead rely on the library, re- the increased Horovod overhead. source, and environment settings for improve performance. We can utilize our previous work to identify better resource Figure 9 shows power behavior similar to that in Fig- and environment settings for performance improvement. ure 6. For the benchmark, using the cache mode results in Because the NT3 benchmark, like other CANDLE smaller runtime and lower power consumption for the node benchmarks, is implemented in Python by using the Keras and CPU. Compared with Figure 5, the Horovod overhead framework, the parallelization method using Horovod in this enlarges the gaps between two consecutive training batch paper can be applied to other TensorFlow-based CANDLE steps because of the allreduce operations as discussed in benchmarks such as the Pilot1 and Pilot3 benchmarks in the preceding section, and similarly the power consumption a similar way. We plan to utilize OpenMP parallelism to is smaller than that in Figure 5. further develop hybrid Horovod/OpenMP/GPU versions of CANDLE benchmarks. 6. Conclusions Acknowledgments In this paper, we used Horovod to parallelize the Python NT3 benchmark from the exploratory research project CAN- This work was supported in part by Laboratory Directed DLE. We then analyzed the performance, power, and scala- Research and Development (LDRD) funding from Argonne bility of the Horovod implementation with weak scaling and National Laboratory, provided by the Director, Office of strong scaling on the Cray XC40 Theta with different batch Science, of the U.S. Department of Energy under contract sizes and learning rates under two memory modes: cache DE-AC02-06CH11357, and in part by NSF grant CCF- and flat on Theta. Our experimental results indicate that 1801856. This research was supported by the Exascale power profiling for the node, CPU, and memory is useful Computing Project (17-SC-20-SC), a collaborative effort of for showing how the Horovod NT3 benchmark behaves the U.S. Department of Energy Office of Science and the on the underlying system. The communication timeline of National Nuclear Security Administration. We would like to this benchmark showed that the Horovod communication acknowledge the Argonne Leadership Computing Facility overhead for the benchmark increased significantly with the for use of the Cray XC40 Theta under the DOE INCITE number of nodes although Horovod has the ability to scale project PEACES and ALCF project EE-ECP, and Gail W. up. The benchmark under the cache mode resulted in smaller Pieper and Emily Dietrich for their edits of this paper. runtime and lower power consumption for the node and CPU as compared with results under the flat mode. Increasing the References batch size led to a runtime decrease and slightly impacted the power; and increasing the learning rate resulted in a [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, . Citro, G. S. slight decrease in runtime and node power and an increase Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Figure 9. Power over time of the Horovod NT3 with the increased learning rate under the flat mode on 400 nodes.

Harp, G. Irving, M. Isard, Y. Jia, . Jozefowicz, L. Kaiser, M. Kudlur, [16] Microsoft Cognitive Toolkit, https://github.com/Microsoft/cntk. J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. [17] NVProf, https://docs.nvidia.com/cuda/profiler-users-guide/index.html. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Watten- [18] Pandas, https://pandas.pydata.org/pandas-docs/stable/generated/panda berg, M. Wicke, Y. Yu, and X. Zheng, TensorFlow: Large-scale machine s.read csv.html. learning on heterogeneous distributed systems, arXiv:1603.04467, 2016. [19] Python Profilers, https://docs.python.org/2/library/profile.html. [2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, [20] A. Sergeev and M. Del Balso, Horovod: Fast and Easy Distributed S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, Deep Learning in TensorFlow, arXiv:1802.05799v3, Feb. 21, 2018. S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, TensorFlow: A system for large-scale [21] Summit, https://www.olcf.ornl.gov/olcf-resources/compute-systems/ machine learning, arXiv:1605.08695, 2016. summit/ [3] Baidu-allreduce, https://github.com/baidu-research/baidu-allreduce, [22] TensorFlow, https://www.tensorflow.org. https://github.com/baidu-research/tensorflow-allreduce. [23] TensorFlow Benchmarks, https://www.tensorflow.org/performance/ [4] CANDLE: Cancer Distributed Learning Environment, benchmarks. http://candle.cels.anl.gov. [24] Theano, https://github.com/Theano/Theano. [5] CANDLE Benchmarks: https://github.com/ECP-CANDLE/Bench marks. [25] J. M. Wozniak, R. Jain, P. Balaprakash, J. Ozik, N. Collier, J. Bauer, F. Xia, T. Brettin, R. Stevens, J. Mohd-Yusof, C. G. Cardona, B. Van Essen, [6] CANDLE NT3 Benchmark, https://github.com/ECP-CANDLE/ and M. Baughman, CANDLE/Supervisor: A Workflow Framework for Benchmarks/blob/frameworks/Pilot1/NT3, https://github.com/ECP- Machine Learning Applied to Cancer Research, SC17 Workshop on CANDLE/Benchmarks/tree/master/Pilot1/NT3. Computational Approaches for Cancer Workshop, November 2017. [7] Chrome trace event profiling tool, https://www.chromium.org/ [26] X. Wu, V. Taylor, J. Cook, and P. Mucci, Using Performance-Power developers/how-tos/trace-event-profiling-tool. Modeling to Improve Energy Efficiency of HPC Applications, IEEE [8] Cray, Monitoring and Managing Power Consumption on the Cray XC Computer, Vol. 49, No. 10, pp. 20-29, Oct. 2016. System, Tech Report, S-0043-7204. [9] Cray XC40 Theta, Argonne National Laboratory, https://www.alcf.anl.gov/theta. [10] Distributed TensorFlow, https://www.tensorflow.org/deploy/distributed, https://github.com/tensorflow/tensorflow/blob/master/tensorflow/ core/distributed runtime/README.md. [11] Horovod: A Distributed Training Framework for TensorFlow, https://github.com/uber/horovod. [12] Keras: The Python Deep Learning Library, https://keras.io/#keras-the- python-deep-learning-library. [13] I. Marincic, V. Vishwanath, and H. Hoffmann, PoLiMEr: An En- ergy Monitoring and Power Limiting Interface for HPC Applications, SC2017 Workshop on Energy Efficient Supercomputing, Nov. 13, 2017. [14] S. Martin, D. Rush, M. Kappel, M. Sandstedt, and J. Williams. 2016. Cray XC40 Power Monitoring and Control for Knights Landing. Proceedings of the Cray User Group (CUG), 2016. [15] P. Mendygral, Scaling Deep Learning, ALCF SDL(Simulation, Data and Learning) Workshop, March 2018.