Boosting Performance of Deep Learning, Machine Learning, and Dask with MVAPICH2

Follow us on https://twitter.com/mvapich Boosting Performance of Deep Learning, Machine Learning, and Dask with MVAPICH2 Tutorial at MUG ’21 by Arpan Jain Aamir Shafi Quentin Anthony The Ohio State University The Ohio State University The Ohio State University [email protected] [email protected] [email protected] http://u.osu.edu/jain.575 https://cse.osu.edu/people/shafi.16 https://www.linkedin.com/in/quentin-anthony Outline • Introduction • Deep Neural Network Training and Essential Concepts • Parallelization Strategies for Distributed DNN Training • Machine Learning • Data Science using Dask • Conclusion Network Based Computing Laboratory MUG ‘21 2 What is Machine Learning and Deep Learning? • Machine Learning (ML) – “the study of computer algorithms to improve automatically through experience and use of data” • Deep Learning (DL) – a subset of ML – Uses Deep Neural Networks (DNNs) – Perhaps, the most revolutionary subset! • Based on learning data representation • DNN Examples: Convolutional Neural Networks, Recurrent Neural Networks, Hybrid Networks • Data Scientist or Developer Perspective for using DNNs 1. Identify DL as solution to a problem 2. Determine Data Set 3. Select Deep Learning Algorithm to Use Courtesy: https://hackernoon.com/difference-between-artificial-intelligence-machine-learning-and- 4. Use a large data set to train an algorithm deep-learning-1pcv3zeg, https://blog.dataiku.com/ai-vs.-machine-learning-vs.-deep-learning, https://en.wikipedia.org/wiki/Machine_learning Network Based Computing Laboratory MUG ‘21 3 History: Milestones in the Development of ML/DL Multi-layered XOR Problem Perceptron (Backpropagation) AlexNet ADALINE WGAN Perceptron ResNet Dark Age DBN Electronic Golden Transformers Age (“AI Winter”) Brain 1943 1957 1960 1969 1986 2006 2012 2015 2017 1800 1900 …. 1940 1950 1960 1970 1980 1990 2000 2010 2020 1805 1901 1936 1954 1965 1967 1979 1985 1995 2000 2014 2017 Bayesian Deep Linear K-Means Network Forest Regression Turing Machine KNN SVM XGBoost Evolutionary PCA Algorithms Decision Trees Random Forest CatBoost A. Legendre – J. Gauss K. Pearson A. Turing S. McCulloch – W. Pitts F. Rosenblatt B. Widrow – M. Hoff M. Minsky – S. Papert J. Pearl D. Rumelhart – G. Hinton – R. Wiliams V. Vapnik– C. Cortes A. Krizhevsky A. Ng Y. LeCun Y. Bengio Network Based Computing Laboratory MUG ‘21 4 Deep Learning meets Super Computers Accelerator/CP Family Performance Share • Computation requirement is increasing exponentially – CPUs still dominates HPC arena and can be used for Deep Learning www.top500.org Courtesy: https://openai.com/blog/ai-and-compute/ Network Based Computing Laboratory MUG ‘21 5 Outline • Introduction • Deep Neural Network Training and Essential Concepts • Parallelization Strategies for Distributed DNN Training • Machine Learning • Data Science using Dask • Conclusion Network Based Computing Laboratory MUG ‘21 6 So what is a Deep Neural Network? • Example of a 3-layer Deep Neural Network (DNN) – (input layer is not counted) Courtesy: http://cs231n.github.io/neural-networks-1/ Network Based Computing Laboratory MUG ‘21 7 DNN Training: Forward Pass Input Hidden Hidden Output Layer Layer Layer Layer Network Based Computing Laboratory MUG ‘21 8 DNN Training: Forward Pass Forward Pass W1 X W2 Input Hidden Hidden Output Layer Layer Layer Layer Network Based Computing Laboratory MUG ‘21 9 DNN Training: Forward Pass Forward Pass W3 W1 W4 X W5 W2 W6 Input Hidden Hidden Output Layer Layer Layer Layer Network Based Computing Laboratory MUG ‘21 10 DNN Training: Forward Pass Forward Pass W3 W1 W7 W4 X W5 W2 W8 W6 Input Hidden Hidden Output Layer Layer Layer Layer Network Based Computing Laboratory MUG ‘21 11 DNN Training: Forward Pass Forward Pass W3 W1 W7 W4 X Pred W5 Error = Loss(Pred,Output) W2 W8 W6 Input Hidden Hidden Output Layer Layer Layer Layer Network Based Computing Laboratory MUG ‘21 12 DNN Training: Backward Pass Forward Pass E7 Error = Loss(Pred,Output) E8 Backward Pass Input Hidden Hidden Output Layer Layer Layer Layer Network Based Computing Laboratory MUG ‘21 13 DNN Training: Backward Pass Forward Pass E3 E7 E4 E5 Error = Loss(Pred,Output) E8 E6 Backward Pass Input Hidden Hidden Output Layer Layer Layer Layer Network Based Computing Laboratory MUG ‘21 14 DNN Training: Backward Pass Forward Pass E3 E1 E7 E4 E5 Error = Loss(Pred,Output) E2 E8 E6 Backward Pass Input Hidden Hidden Output Layer Layer Layer Layer Network Based Computing Laboratory MUG ‘21 15 DNN Training Network Based Computing Laboratory MUG ‘21 16 Essential Concepts: Activation function and Back-propagation • Back-propagation involves complicated mathematics. – Luckily, most DL Frameworks give you a one line implementation -- model.backward() • What are Activation functions? Courtesy: https://www.jeremyjordan.me/neural-networks-training/ – RELU (a Max fn.) is the most common activation fn. – Sigmoid, tanh, etc. are also used Network Based Computing Laboratory MUG ‘21 17 Essential Concepts: Learning Rate (α) Courtesy: https://www.jeremyjordan.me/nn-learning-rate/ Network Based Computing Laboratory MUG ‘21 18 Essential Concepts: Batch Size • Batched Gradient Descent N – Batch Size = N • Stochastic Gradient Descent – Batch Size = 1 • Mini-batch Gradient Descent – Somewhere in the middle – Common: • Batch Size = 64, 128, 256, etc. • Finding the optimal batch Batch Size One full pass over N is called an epoch of training size will yield the fastest learning. Courtesy: https://www.jeremyjordan.me/gradient-descent/ Network Based Computing Laboratory MUG ‘21 19 Key Phases of Deep Learning • Training is compute intensive – Many passes over data – Can take days to weeks – Model adjustment is done • Inference – Single pass over the data – Should take seconds – No model adjustment Courtesy: https://devblogs.nvidia.com/ • Challenge: How to make “Training” faster? – Need Parallel and Distributed Training… Network Based Computing Laboratory MUG ‘21 20 Outline • Introduction • Deep Neural Network Training and Essential Concepts • Parallelization Strategies for Distributed DNN Training • Machine Learning • Data Science using Dask • Conclusion Network Based Computing Laboratory MUG ‘21 21 Parallelization Strategies • Some parallelization strategies.. – Data Parallelism – Model Parallelism – Hybrid Parallelism Model Parallelism Hybrid (Model and Data) Parallelism Data Parallelism Courtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks Network Based Computing Laboratory MUG ‘21 22 Outline • Introduction • Deep Neural Network Training and Essential Concepts • Parallelization Strategies for Distributed DNN Training – Data Parallelism – Model and Hybrid Parallelism • Machine Learning • Data Science using Dask • Conclusion Network Based Computing Laboratory MUG ‘21 23 Need for Data Parallelism Let’s revisit Mini-Batch Gradient Descent Drawback: If the dataset has 1 million images, then it will take forever to run the model on such a big dataset Solution:Can we use multiple machines to speedup the training of Deep learning models? (i.e. Utilize Supercomputers to Parallelize) Network Based Computing Laboratory MUG ‘21 24 Need for Communication in Data Parallelism Y Y N N Y Y Y Y N N Y Y Y Y Machine 1 Y Machine 4 Y Y Y N N Y Y Y Y N N Y Y Y Y Machine 2 Y Y Machine 5 Y N Y Problem: Train a single model on whole dataset, Y N not 5 models on different sets of dataset Y Y Machine 3 Y Network Based Computing Laboratory MUG ‘21 25 Data Parallelism 1 7 2 1219 9 2 2 4 121212 Machine 1 5 1 3 21 5 11 1 3 2 1219 9 1 2 3 121212 Machine 2 5 1 2 21 5 11 2 1 3 MPI 1219 9 5 5 2 121212 Machine 3 5 1 1 AllReduce 21 5 11 5 7 1 1219 9 2 1 2 121212 Machine 4 4 1 3 21 5 11 3 1 1 1219 9 2 2 1 121212 Machine 5 2 1 2 21 5 11 Gradients Reduced Gradients Network Based Computing Laboratory MUG ‘21 26 Allreduce Collective Communication Pattern • Element-wise Sum data from all processes and sends to all processes int MPI_Allreduce (const void *sendbuf, void * recvbuf, int count, MPI_Datatype datatype, MPI_Op operation, MPI_Comm comm) Input-only Parameters Sendbuf (Before) Parameter Description T2 T3 sendbuf Starting address of send buffer T1 T4 1 1 1 1 recvbuf Starting address of recv buffer 2 2 2 2 type Data type of buffer elements 3 3 3 3 4 4 4 4 count Number of elements in the buffers operation Reduction operation to be performed (e.g. sum) Recvbuf (After) comm Communicator handle T1 T2 T3 T4 Input/Output Parameters 4 4 4 4 8 8 Parameter Description 8 8 12 12 12 12 recvbuf Starting address of receive buffer 16 16 16 16 Network Based Computing Laboratory MUG ‘21 27 Data Parallelism and MPI Collectives • Step1: Data Propagation – Distribute the Data among GPUs • Step2: Forward Backward Pass – Perform forward pass and calculate the prediction – Calculate Error by comparing prediction with actual output – Perform backward pass and calculate gradients • Step3: Gradient Aggregation – Call MPI_Allreduce to reduce the local gradients – Update parameters locally using global gradients Network Based Computing Laboratory MUG ‘21 28 MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training ML/DL Applications ML/DL Applications TensorFlow PyTorch MXNet PyTorch Horovod Torch.distributed DeepSpeed MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for for CPU Training GPU Training for CPU Training GPU Training More details available from: http://hidl.cse.ohio-state.edu Network Based Computing Laboratory MUG ‘21 29 MVAPICH2 2.3.6 • Released on 05/11/2021 • Major Features and Enhancements

Boosting Performance of Deep Learning, Machine Learning, and Dask with MVAPICH2

MVAPICH2 2.2 User Guide

Scalable and High Performance MPI Design for Very Large

MVAPICH2 2.3 User Guide

Automatic Handling of Global Variables for Multi-Threaded MPI Programs

Exascale Computing Project -- Software

MVAPICH USER GROUP MEETING August 26, 2020

Infiniband and 10-Gigabit Ethernet for Dummies

MVAPICH2 2.2Rc2 User Guide

Combining Openmp and Message Passing: Hybrid Programming

Comparison of Platforms for Recommender Algorithm on Large Datasets

Infiniband and 10-Gigabit Ethernet for Dummies

Hodor and Bran - Environment Modules and Compiling Programs