Scaling DL Training Workloads with the Cray PE Plugin Luiz Derose Sr

Scaling DL Training Workloads with the Cray PE Plugin Luiz DeRose Sr. Principal Engineer Programming Environments Director COM PUTE | STORE | ANALYZE 33 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Autonomous Driving L5-Full Aut. Dr. Training data CNN Labeled Data Training No Driver Results Supervised Learning Backend request Frontend Real-time data Inference Mission and Trajectory Planning Sensors Sensor fusing Perception Cognition Action COM PUTE | STORE | ANALYZE 34 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Deep Learning in Science - NERSC Opportunities to apply DL widely in support of classic HPC simulation and modelling. COM PUTE | STORE | ANALYZE 35 NERSC - June 14, 2018 The Deep Learning Part in Autonomous Driving ● Model training is the most crucial and challenging aspect ● Many tasks to train for in autonomous driving ● Stereo ● Optical flow ● Visual odometry ● Structure-from-motion ● Object detection ● Recognition and tracking ● 3D scene understanding ● Many NNs to train for each task ● Retraining (or transfer learning) happens when new data is available ● A trained neural network can also be a powerful tool for ● Pattern recognition ● Classification ● Clustering ● Others… COM PUTE | STORE | ANALYZE 36 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. The Deep Learning Software Landscape 60000 56… 50000 40000 30000 17824 20000 10681 9635 6847 6640 6271 10000 4845 4628 1731 Github Stars as of May 2017 May as of Stars Github 0 TensorFlow Caffe CNTK MXNet Torch DeepLearning4j Theano PaddlePaddle Caffe2 BigDL Deep Learning Toolkits TensorFlow is nearly as popular as all other frameworks combined COM PUTE | STORE | ANALYZE 37 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Scaling Deep Learning - Motivation ● Scaling Deep Learning training is a tool for ● Models that take a very long time to train ● and have a very large training dataset ● Increasing the frequency at which models can be retrained with new or improved data ● It is critical to be able to update the models (when new data arrives) in a matter of minutes ● Hyper-Parameter Optimization ● For problems and datasets where baseline accuracy is not known ● learning rate schedule ● momentum ● batch size ● Evolve topologies if good architecture is unknown (common with novel datasets / mappings) ● Layer types, width, number filters ● Activation functions, drop-out rates COM PUTE | STORE | ANALYZE 38 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. HPC Attributes ● DL training is a classic high-performance computing problem which demands: ● Large compute capacity in terms of FLOPs, memory capacity and bandwidth ● A performant interconnect for fast communication of gradients and model parameters ● Parallel I/O and storage with sufficient bandwidth to keep the compute fed at scale COM PUTE | STORE | ANALYZE 39 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. The Cray PE DL Scalability Plugin Project Overview ● Cray’s primary goals were: ● Design a solution for scaling TensorFlow (specifically synchronous SGD) to significantly larger node counts than existing methods allowed ● Should require minimal changes to user training scripts and provide a more friendly user experience ● Achieve the best possible TensorFlow performance on Cray Systems ● Maintain accuracy for a given number of steps and hyper-parameter setup allowing for significantly reduced time-to-accuracy through scaling ● Ideally have a portable solution that would work with other deep learning frameworks COM PUTE | STORE | ANALYZE 40 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Data Parallelism - Collective-based Synchronous SGD ● Data parallel training divides a global mini-batch of examples across processes ● Each process computes gradients from their local mini-batch ● Average gradients across processes ● All processes update their local model with averaged gradients ● all processes have the same model Compute intensive Communication intensive Typically not much compute ● Not shown is the I/O activity of reading training samples and possible augmentation) COM PUTE | STORE | ANALYZE 41 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Data Parallel Synchronous SGD ● Operations in a “step” of SGD Read/prepare local mini-batch ● Parallel operations highlighted with light blue box ● Communication highlighted in Compute gradients light red box ● Maps to an allreduce Reduce gradients across ● Non-parallel work is the remainder all process (local mini-batches) ● For a fixed local mini-batch size per process, the fraction of time in Update weights and parallel work is constant as more biases with gradients processes are added COM PUTE | STORE | ANALYZE 42 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Distributed TensorFlow ● TensorFlow has a native method for parallelism across nodes ● ClusterSpec API ● Uses gRPC layer in TensorFlow based on sockets ● Can be difficult to use and optimize ● User must specify ● hostnames and ports for all worker processes ● hostnames and ports for all parameter server processes (see next slide) ● # of workers ● # of parameter server processes ● Chief process of workers COM PUTE | STORE | ANALYZE 43 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Distributed TensorFlow No Parameter devices needed in the Cray method Resources dedicated to gradient calculation Cray Method ● Parameter deviceDevice is 1a separateDevice process 2 possibly requiring additional nodes Device n ● Number of parameteradd devicesadd processes toScalable use is Globalnot clear Add add model model model ● Communication isinput an allreduceinputbut not. implemented. .that . .way . input P P P ● Several gradientUpdate “aggregation”Update methods possible in TensorFlow Update Client Client Client ● All implementΔP an allreduce as a set of gather/bcast operations with point-to-point over sockets COM PUTE | STORE | ANALYZE 44 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Training Script Modifications ● The Cray PE DL Plugin require the following modifications to a serial training script 1. Importing the Python module 2. Initialize the module ● Possibly configure the thread team(s) for specific uses 3. Broadcast initial model parameters 4. Incorporate gradient aggregation between gradient computation and model update 5. Finalize the Python module COM PUTE | STORE | ANALYZE 45 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Cray PE DL Scalability Plugin Performance Inception v3 Throughput ResNet50 Throughput CSCS Piz Daint P100 NERSC Cori KNL Batch Size = 64 per Process Batch Size = 32 per Process 80032 40032 91% efficient at 512 nodes 89% efficient at 1024 nodes 70032 1.8X faster than gRPC at 128 35032 60032 nodes 30032 50032 25032 See Peter Mendygral’s talk on Thursday at 1:30 40032 20032 30032 15032 samples / secsamples samples / secsamples 20032 10032 10032 5032 32 32 8 16 32 64 128 256 512 8 16 32 64 128 256 512 1024 P100 GPUs (Nodes) KNL Nodes Ideal derived from single node with no gRPC CPE DL Plugin Ideal CPE DL Plugin Ideal communication SW COM PUTE | STORE | ANALYZE 46 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Horovod / CPE DL Plugin – Throughput Scaling ResNet50 Performance on XC40 (Cori KNL at NERSC) Inception v3 Performance on XC50 (Piz Daint at Horovod and CPE DL Plugin CSCS) – CPE DL Plugin ONLY 65536 32768 131072 CPE DL Plugin 16384 1.8X faster than 8192 32768 gRPC at 128 nodes 4096 8192 2048 1024 2048 1.4X faster than 512 Horovod at 128 512 256 nodes, 3.2X at 1024 nodes 128 128 (aggregate)Samples/sec Samples/sec (aggregate)Samples/sec 64 32 32 1 4 16 64 256 1024 2 8 32 128 512 Nodes (GPUs) Nodes MBS=4 MBS=16 MBS=32 CPE MLDL Plugin - MBS=128 CPE MLDL Plugin - MBS=32 Horovod - MBS=32 MBS=64 MBS=64 (gRPC) 200 x N COM PUTE | STORE | ANALYZE 47 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Cray PE DL Scalability Plugin ● Users can easily achieve ideal scaling performance across DL frameworks utilizing stochastic gradient descent 1. Load a module 2. Plug in a few simple lines to your serial Python or C-based training script 3. Scale up your training workload to hundreds of nodes or more on Cray systems ● Delivers high performance across a variety of Cray node architectures ● Cray customers can leverage their existing compute nodes to scale DL training COM PUTE | STORE | ANALYZE 48 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publicly announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA and YARCDATA. The following are trademarks of Cray Inc.: CHAPEL, CLUSTER CONNECT, CLUSTERSTOR, CRAYDOC, CRAYPAT, CRAYPORT, DATAWARP, ECOPHLEX, LIBSCI, NODEKARE, REVEAL. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used on this website are the property of their respective owners.

Scaling DL Training Workloads with the Cray PE Plugin Luiz Derose Sr

Deep Learning Frameworks | NVIDIA Developer

Comparative Study of Deep Learning Software Frameworks

Reinforcement Learning in Videogames

Distributed Negative Sampling for Word Embeddings Stergios Stergiou Zygimantas Straznickas Yahoo Research MIT [email protected] [email protected]

Comparative Study of Caffe, Neon, Theano, and Torch

Torch7: a Matlab-Like Environment for Machine Learning

BUSEM at Semeval-2017 Task 4A Sentiment Analysis with Word

Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction

Introduction to Deep Learning Framework 1. Introduction 1.1

Enhancing OMB with Python Benchmarks Presentation at MUG ‘21

DLL: a Fast Deep Neural Network Library

Using Long Short-Term Memory Recurrent Neural Network in Land Cover Classification on Landsat and Cropland Data Layer Time Series