Dlobd: a Comprehensive Study of Deep Learning Over Big Data Stacks on HPC Clusters Xiaoyi Lu , Haiyang Shi, Rajarshi Biswas, M

DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters Xiaoyi Lu , Haiyang Shi, Rajarshi Biswas, M. Haseeb Javed , and Dhabaleswar K. Panda 1 What is Big Data? ● "Big data" is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. (Source: Wikipedia) ● What does “Big” mean? ○ According to Forbes, we generate 2.5 quintillion bytes of data each day. ● Properties of Big Data ○ Variety ○ Velocity ○ Volume 2 How is Big Data different from large data stored in relational databases? How is Big Data different from large data stored in relational databases? 3 Uses of Big Data Predictive analytics: Utilizing a variety of statistical techniques ● Data mining(discovering patterns in large data sets), ● Predictive modeling(uses statistics to predict outcomes), and ● Machine learning User Behaviour Analytics (UBA) : analyze human behavior patterns 4 Big Data Technology stack Source: https://www.slideshare.net/Khalid-Imran/big-data-technology-stack-nutshell 5 Big Data Frameworks 6 Hadoop vs Spark ● Spark is 100 times faster than Hadoop. Why? ○ In-memory computation ● Spark does not provide it’s own distributed storage. ● Hence, Big Data projects involve installing Spark on top of Hadoop. ○ Spark applications can use Hadoop Distributed File System (HDFS). 7 Batch Processing ● Processing happens of blocks of data that have already been stored over a period of time. ● Example - calculating monthly payroll summaries. 8 source - https://medium.com/@gowthamy/big-data-battle-batch-processing-vs-stream-processing-5d94600d8103 Stream Processing ● Process data in real time as they arrive. ● Example - determining if a bank transaction is fraudulent or not. 9 source - https://medium.com/@gowthamy/big-data-battle-batch-processing-vs-stream-processing-5d94600d8103 MapReduce ● Processing technique and a program model for distributed computing. ● Key idea - take the input data and divide it into many parts. Each part is then sent to a different machine to be processed and finally aggregated. ● Based upon horizontal scaling. ● Under the MapReduce model, the data processing primitives are called mappers and reducers. 10 MapReduce source - https://towardsdatascience.com/big-data-analysis-spark-and-hadoop-a11ba591c057 11 DLoBD 12 Introduction ➢ What is DLoBD? ○ Running Deep Learning libraries over Big Data stacks, like Apache Hadoop and Spark. ➢ Advantages - enable distributed Deep Learning on Big Data analytics clusters. ➢ Contributions of paper: ○ Extensive performance evaluations of representative DLoBD stacks. ○ Characterization based on performance, scalability, accuracy, and resource utilization. 13 Why DL over Big Data? 1. Better data locality. 2. Efficient resource sharing & cost effective. 3. Easy integration of DL on Big Data processing components. source: https://www.kdnuggets.com/2016/02/yahoo-caffe-spark-distributed-deep-learning.html 14 source: https://www.slideshare.net/JenAman/caffeonspark-deep-learning-on-spark-cluster 15 Examples of DLoBD stacks ● CaffeOnSpark ● SparkNet ● TensorFlowOnSpark ● TensorFrame ● DeepLearning4J ● BigDL ● MMLSpark (or CNTKOnSpark) source:https://www.slideshare.net/databricks/dlobd-an-emerging-paradigm-of- deep-learning-over-big-data-stacks-with-dhabaleswar-k-panda-and-xiaoyi-lu 16 Convergence of DL, Big Data, and HPC ● How much performance benefit we can achieve for end deep learning applications? ● What is the impact of these advanced hardware and the associated efficient building blocks on various Deep Learning aspects? ● How much performance overhead is brought due to the heavy layers of DLoBD stacks? 17 CaffeOnSpark Overview ➢ Designed by Yahoo! ➢ Inherits features from Caffe like computing on: ○ CPU ○ GPU ○ GPU + cuDNN ➢ Enables Deep Learning training and testing with Caffe embedded inside Spark applications. ➢ Major benefit? ○ Eliminates unnecessary data movement 18 Background ● Resilient Distributed Datasets (RDDs) ○ Fundamental data structure of Spark ○ Read-only ○ Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. ● Hadoop YARN (Yet Another Resource Negotiator) ○ Job scheduling and cluster resource management ○ Two components — ■ ResourceManager : manages the global assignment of compute resources to applications, e.g., memory, cpu, disk, network, etc. ■ NodeManager : tracks its own local resources and communicates its resource configuration to the ResourceManager. 19 Background - Spark Cluster Mode ● Master/slave architecture ● The driver converts the user program into tasks and after that it schedules the tasks on the executors. ● Executors are worker nodes in charge of running individual tasks in a given Spark job. Once they have run the task they send the results to the driver. 20 CaffeOnSpark Architecture 21 CaffeOnSpark Advantages Disadvantages ➢ Integration of Deep Learning, Big Data, ➢ Parameter exchange phase is and HPC. implemented in a out-of-band fashion. ○ Cannot take advantage of optimizations for ➢ Scaling out of Caffe with the help from Hadoop YARN and Spark frameworks. defaut Spark. Dedicated communication channels (either ➢ By using HDFS: ➢ ○ Data-sharing RDMA or TCP/IP based) in model ○ Fault tolerance synchronizers. ○ Need extra effort to maintain them. 22 TensorFlowOnSpark ● Again designed by Yahoo! ● Allows running DL jobs using TensorFlow over Big Data clusters. ● Parameter Server approach ○ Storing and updating the model’s parameter. ○ Async training - all workers are independently training over the input data and updating variables asynchronously. 23 MMLSpark ( or CNTKOnSpark) ● Similar architecture as the previous DLoBD stacks. ● Out-of-band communication via MPI. ○ Users can specify the channel (TCP/IP or RDMA) based on MPI library chosen. ○ In this paper, MMLSpark is complied with OpenMPI library. 24 BigDL ➢ Proposed by Intel. ➢ Allows to import pre-trained models in Caffe and Torch into Spark. ➢ Written using Intel’s Math Kernel Library (MKL). ○ Optimizes math functions like Linear Algebra, FFTs. ➢ Advantage: ○ No separate communication channel required. 25 Characterization Methodology 26 Data Sets 27 Selected Deep Learning Models and Algorithms 28 Performance Evaluation 29 Experimental Setup 1. OSU RI2 Cluster (Cluster A): a. 20 nodes i. 2 Intel Broadwell 14-core processors ii. 128 GB RAM iii. 128 GB local HDD iv. Nvidia Tesla K80 GPU b. Connected with Mellanox single port InifiniBand EDR (100 Gbps) HCA. 2. SDSC Comet Cluster (Cluster B): a. 1984 nodes (used only 17 nodes) i. Intel Haswell dual 12-core processors ii. 128 GB RAM iii. 320 GB local HDD b. Connected with Infiniband FDR (56 Gbps) 3. Default - # nodes x batch size = 128 30 Evaluation on CPU vs. GPU (Cluster A) 31 Evaluation on IPoIB vs. RDMA (Cluster A) 32 Evaluation on Performance & Accuracy 33 Epoch-level evaluation VGG model using BigDL on default Spark CPU cores - 192 and Batch Size - 768 34 Scalability Evaluation VGG model on Cluster B using BigDL Figure shows accumulative time taken to finish 18th epoch 2.7x speedup 35 Evaluation on Resource Utilization Cifar-10 Quick model on Cifar-10 dataset with CaffeOnSpark (Cluster A) Utilization results generated with avg resource consumption per 60 seconds. (During training) 36 TensorFlowOnSpark 37 Performance Overhead on TensorFlowOnSpark ● TensorFlowOnSpark doesnot involve Spark drivers in tensor communication. ● Experimental Setup: ○ SoftMax Regression Model ○ MNIST datset ○ 4 node parameter server ○ Batch size - 128 ○ CPU training ● Between ~15 - 19% of execution time spent in YARN and Spark each. ● This overhead can be amortized for long running DL jobs. 38 Impact of RDMA on TensorFlow ❖ Experimental Setup: ➢ ResNet50 model ➢ TF CNN benchmarks ➢ Batch size - 64 ➢ GPU based training ➢ 2 GPUs/node ➢ Parameter server mode (1 PS and rest workers) ➢ MPI libraries – ➢ MVAPICH2-2.3b ➢ Intel-MPI-2018 39 Summary ● Detailed architectural overview of 4 representative DLoBD stacks: ○ CaffeOnSpark ○ TensorFlowOnSpark ○ MMLSpark ○ BigDL ● RDMA scheme benefits DL workloads. ○ 2.7x performance speedup with RDMA vs IPoIB. ● Mostly, GPU-based DL designs outperform CPU-based ones. 40 BigDL: A Distributed Deep Learning Framework for Big Data 41 Introduction ❏ Distributed deep learning framework for Big Data platforms and workflows. ❏ Implemented on top of Apache Spark. ❏ Allows to write DL applications as standard Spark programs. source: https://www.slideshare.net/SparkSummit/bigdl-a-distributed-deep-learning-library-on-spark-spark-summit-east-talk-by-yiheng-wang 42 Why BigDL? source: https://www.slideshare.net/SparkSummit/bigdl-a-distributed-deep-learning-library-on-spark-spark-summit-east-talk-by-yiheng-wang 43 Why BigDL? ❏ Rich Deep learning support ❏ Neural network operations ❏ Layers ❏ Losses ❏ Optimizers ❏ Directly run existing models defined in other frameworks (such as TensorFlow, Keras, Caffe and Torch). 44 Programming Model Transform text to list of words 45 Programming Model Data transformation 46 Programming Model Model construction 47 Programming Model Model Training 48 Programming Model Model Prediction 49 Execution Model ➔ Single driver and multiple workers. ➔ Driver - scheduling and dispatching tasks to worker nodes. ➔ Worker - actual computation and physical data storage. ➔ How does BigDL support efficient and scalable distributed training on top of Spark? 50 Data-parallel training ● Synchronous mini-batch SGD. ● Distributed

Dlobd: a Comprehensive Study of Deep Learning Over Big Data Stacks on HPC Clusters Xiaoyi Lu , Haiyang Shi, Rajarshi Biswas, M

Intel® Optimized AI Frameworks

Lecture 6 Learned Feedforward Visual Processing Neural Networks, Deep Learning, Convnets

Comparative Study of Deep Learning Software Frameworks

Reinforcement Learning in Videogames

Distributed Negative Sampling for Word Embeddings Stergios Stergiou Zygimantas Straznickas Yahoo Research MIT [email protected] [email protected]

Comparative Study of Caffe, Neon, Theano, and Torch

BUSEM at Semeval-2017 Task 4A Sentiment Analysis with Word

Intel® Software Template Overview

Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction

DIY Deep Learning for Vision: the Caffe Framework

Online Power Management for Multi-Cores: a Reinforcement Learning Based Approach

March 2012 Version 2