DLoBD: A Comprehensive Study of over Big Data Stacks on HPC Clusters Xiaoyi Lu , Haiyang Shi, Rajarshi Biswas, M. Haseeb Javed , and Dhabaleswar K. Panda

1 What is Big Data?

● "Big data" is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. (Source: Wikipedia) ● What does “Big” mean? ○ According to Forbes, we generate 2.5 quintillion bytes of data each day. ● Properties of Big Data ○ Variety ○ Velocity ○ Volume

2 How is Big Data different from large data stored in relational databases?

How is Big Data different from large data stored in relational databases?

3 Uses of Big Data

Predictive analytics: Utilizing a variety of statistical techniques

(discovering patterns in large data sets), ● Predictive modeling(uses statistics to predict outcomes), and ●

User Behaviour Analytics (UBA) : analyze human behavior patterns 4 Big Data Technology stack

Source: https://www.slideshare.net/Khalid-Imran/big-data-technology-stack-nutshell 5 Big Data Frameworks

6 Hadoop vs Spark

● Spark is 100 times faster than Hadoop. Why? ○ In-memory computation ● Spark does not provide it’s own distributed storage. ● Hence, Big Data projects involve installing Spark on top of Hadoop. ○ Spark applications can use Hadoop Distributed File System (HDFS).

7 Batch Processing

● Processing happens of blocks of data that have already been stored over a period of time. ● Example - calculating monthly payroll summaries.

8 source - https://medium.com/@gowthamy/big-data-battle-batch-processing-vs-stream-processing-5d94600d8103 Stream Processing

● Process data in real time as they arrive. ● Example - determining if a bank transaction is fraudulent or not.

9 source - https://medium.com/@gowthamy/big-data-battle-batch-processing-vs-stream-processing-5d94600d8103 MapReduce

● Processing technique and a program model for . ● Key idea - take the input data and divide it into many parts. Each part is then sent to a different machine to be processed and finally aggregated. ● Based upon horizontal scaling. ● Under the MapReduce model, the data processing primitives are called mappers and reducers.

10 MapReduce

source - https://towardsdatascience.com/big-data-analysis-spark-and-hadoop-a11ba591c057 11 DLoBD

12 Introduction

➢ What is DLoBD? ○ Running Deep Learning libraries over Big Data stacks, like and Spark. ➢ Advantages - enable distributed Deep Learning on Big Data analytics clusters. ➢ Contributions of paper: ○ Extensive performance evaluations of representative DLoBD stacks. ○ Characterization based on performance, scalability, accuracy, and resource utilization.

13 Why DL over Big Data?

1. Better data locality.

2. Efficient resource sharing & cost effective.

3. Easy integration of DL on Big Data processing components.

source: https://www.kdnuggets.com/2016/02/yahoo-caffe-spark-distributed-deep-learning.html 14 source: https://www.slideshare.net/JenAman/caffeonspark-deep-learning-on-spark-cluster 15 Examples of DLoBD stacks

● CaffeOnSpark ● SparkNet ● TensorFlowOnSpark ● TensorFrame ● ● BigDL ● MMLSpark (or CNTKOnSpark)

source:https://www.slideshare.net/databricks/dlobd-an-emerging-paradigm-of- deep-learning-over-big-data-stacks-with-dhabaleswar-k-panda-and-xiaoyi-lu 16 Convergence of DL, Big Data, and HPC

● How much performance benefit we can achieve for end deep learning applications? ● What is the impact of these advanced hardware and the associated efficient building blocks on various Deep Learning aspects? ● How much performance overhead is brought due to the heavy layers of DLoBD stacks?

17 CaffeOnSpark Overview

➢ Designed by Yahoo! ➢ Inherits features from like computing on: ○ CPU ○ GPU ○ GPU + cuDNN ➢ Enables Deep Learning training and testing with Caffe embedded inside Spark applications. ➢ Major benefit? ○ Eliminates unnecessary data movement

18 Background

● Resilient Distributed Datasets (RDDs) ○ Fundamental data structure of Spark ○ Read-only ○ Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. ● Hadoop YARN (Yet Another Resource Negotiator) ○ Job scheduling and cluster resource management ○ Two components — ■ ResourceManager : manages the global assignment of compute resources to applications, e.g., memory, cpu, disk, network, etc. ■ NodeManager : tracks its own local resources and communicates its resource configuration to the ResourceManager.

19 Background - Spark Cluster Mode

● Master/slave architecture ● The driver converts the user program into tasks and after that it schedules the tasks on the executors. ● Executors are worker nodes in charge of running individual tasks in a given Spark job. Once they have run the task they send the results to the driver.

20 CaffeOnSpark Architecture

21 CaffeOnSpark

Advantages Disadvantages

➢ Integration of Deep Learning, Big Data, ➢ Parameter exchange phase is and HPC. implemented in a out-of-band fashion. ○ Cannot take advantage of optimizations for ➢ Scaling out of Caffe with the help from Hadoop YARN and Spark frameworks. defaut Spark. Dedicated communication channels (either ➢ By using HDFS: ➢ ○ Data-sharing RDMA or TCP/IP based) in model ○ Fault tolerance synchronizers. ○ Need extra effort to maintain them.

22 TensorFlowOnSpark

● Again designed by Yahoo! ● Allows running DL jobs using TensorFlow over Big Data clusters. ● Parameter Server approach ○ Storing and updating the model’s parameter. ○ Async training - all workers are independently training over the input data and updating variables asynchronously.

23 MMLSpark ( or CNTKOnSpark)

● Similar architecture as the previous DLoBD stacks. ● Out-of-band communication via MPI. ○ Users can specify the channel (TCP/IP or RDMA) based on MPI chosen. ○ In this paper, MMLSpark is complied with OpenMPI library.

24 BigDL

➢ Proposed by Intel. ➢ Allows to import pre-trained models in Caffe and into Spark. ➢ Written using Intel’s (MKL). ○ Optimizes math functions like Linear Algebra, FFTs. ➢ Advantage: ○ No separate communication channel required.

25 Characterization Methodology

26 Data Sets

27 Selected Deep Learning Models and Algorithms

28 Performance Evaluation

29 Experimental Setup

1. OSU RI2 Cluster (Cluster A): a. 20 nodes i. 2 Intel Broadwell 14-core processors ii. 128 GB RAM iii. 128 GB local HDD iv. Nvidia Tesla K80 GPU b. Connected with Mellanox single port InifiniBand EDR (100 Gbps) HCA. 2. SDSC Comet Cluster (Cluster B): a. 1984 nodes (used only 17 nodes) i. Intel Haswell dual 12-core processors ii. 128 GB RAM iii. 320 GB local HDD b. Connected with Infiniband FDR (56 Gbps) 3. Default - # nodes x batch size = 128 30 Evaluation on CPU vs. GPU (Cluster A)

31 Evaluation on IPoIB vs. RDMA (Cluster A)

32 Evaluation on Performance & Accuracy

33 Epoch-level evaluation

VGG model using BigDL on default Spark CPU cores - 192 and Batch Size - 768

34 Scalability Evaluation

VGG model on Cluster B using BigDL Figure shows accumulative time taken to finish 18th epoch

2.7x speedup

35 Evaluation on Resource Utilization

Cifar-10 Quick model on Cifar-10 dataset with CaffeOnSpark (Cluster A)

Utilization results generated with avg resource consumption per 60 seconds.

(During training) 36 TensorFlowOnSpark

37 Performance Overhead on TensorFlowOnSpark

● TensorFlowOnSpark doesnot involve Spark drivers in tensor communication. ● Experimental Setup: ○ SoftMax Regression Model ○ MNIST datset ○ 4 node parameter server ○ Batch size - 128 ○ CPU training ● Between ~15 - 19% of execution time spent in YARN and Spark each. ● This overhead can be amortized for long running DL jobs. 38 Impact of RDMA on TensorFlow

❖ Experimental Setup: ➢ ResNet50 model ➢ TF CNN benchmarks ➢ Batch size - 64 ➢ GPU based training ➢ 2 GPUs/node ➢ Parameter server mode (1 PS and rest workers) ➢ MPI libraries – ➢ MVAPICH2-2.3b ➢ Intel-MPI-2018

39 Summary

● Detailed architectural overview of 4 representative DLoBD stacks: ○ CaffeOnSpark ○ TensorFlowOnSpark ○ MMLSpark ○ BigDL ● RDMA scheme benefits DL workloads. ○ 2.7x performance speedup with RDMA vs IPoIB. ● Mostly, GPU-based DL designs outperform CPU-based ones.

40 BigDL: A Distributed Deep Learning Framework for Big Data

41 Introduction

❏ Distributed deep learning framework for Big Data platforms and workflows. ❏ Implemented on top of . ❏ Allows to write DL applications as standard Spark programs.

source: https://www.slideshare.net/SparkSummit/bigdl-a-distributed-deep-learning-library-on-spark-spark-summit-east-talk-by-yiheng-wang 42 Why BigDL?

source: https://www.slideshare.net/SparkSummit/bigdl-a-distributed-deep-learning-library-on-spark-spark-summit-east-talk-by-yiheng-wang 43 Why BigDL?

❏ Rich Deep learning support ❏ Neural network operations ❏ Layers ❏ Losses ❏ Optimizers ❏ Directly run existing models defined in other frameworks (such as TensorFlow, , Caffe and Torch).

44 Programming Model

Transform text to list of words

45 Programming Model

Data transformation

46 Programming Model

Model construction

47 Programming Model

Model Training

48 Programming Model

Model Prediction

49 Execution Model

➔ Single driver and multiple workers. ➔ Driver - scheduling and dispatching tasks to worker nodes. ➔ Worker - actual computation and physical data storage. ➔ How does BigDL support efficient and scalable distributed training on top of Spark?

50 Data-parallel training

● Synchronous mini-batch SGD. ● Distributed training implemented as iterative process.

source: https://www.slideshare.net/SparkSummit/bigdl-a-distributed-deep-learning-library-on-spark-spark-summit-east-talk-by-yiheng-wang 51 Data-parallel training

52 How is Data parallelism achieved?

❖ BigDL constructs an RDD of models (replica of original neural network model) ❖ Model and Sample RDDs are co-partitioned and co-located.

53 Parameter synchronization

54 Parameter Synchronization

● Tasks in the “model forward-backward” job of the next iteration can read the latest value of all the weights before the next training step begins. ● AllReduce kind of operation is implemented using Spark primitives, namely shuffle and broadcast. ● Highly scalable distributed training possible on upto 256 nodes.

55 Task scheduling

● DL frameworks are typically run stateful tasks which interact with each other. ○ Why? - To support synchronous mini-batch SGD for model computation and parameter synchronization. ● Whereas BigDL runs a series of short- lived jobs (i.e., 2 jobs per mini-batch) ○ Each task is stateless and non-blocking. ○ How do we schedule them? ■ Launch a single, multi-threaded task on each worker to achieve high scalability Overheads of task scheduling and ■ Use group scheduling (introduced by dispatch (as a fraction of average Drizzle) to help schedule a group of compute time) for ImageNet Inception v1 training computations at once. 56 Model quantization

● Quantization refers to storing numbers and performing calculations in a more compact and lower precision form (than their original format such as 32-bit floating point). ● Why needed? ○ For inference speed-up on resource constrained environments. ● How does BigDL do it? ○ Quantizes the parameters of selected layers into 8-bit integer Quantized Model ○ During inference - ■ Each quantized layer quantizes the input (float32) data into 8-bit integer. ■ Apply 8-bit calculations ■ Dequantize the result to 32-bit floating point.

57 Model quantization results (accuracy, inference speed and model size) for SSD, VGG16 and VGG19

58 Applications

59 Model Inference: image feature extraction

60 Distributed Training: precipitation nowcasting

● Predicting short-term precipitation. ● Trained model used to predict precipitation patterns for the next hour.

61 Conclusion

➢ Implements parameter server architecture or AllReduce operation for distributed training which is not supported by existing big data systems. ➢ CaffeOnSpark and TensorFlowOnSpark frameworks use Spark as the orchestration layer to allocate resources from the cluster, and then launch the distributed Caffe or TensorFlow job on the allocated machines; however, the Caffe or TensorFlow job still runs outside of the big data framework. In contrast, BigDL provides distributed training on top of big data framework using Spark primitives.

62