Comparison of Spark Resource Managers and Distributed File Systems

2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) Comparison of Spark Resource Managers and Distributed File Systems Solmaz Salehian Yonghong Yan Department of Computer Science & Engineering, Department of Computer Science & Engineering, Oakland University, Oakland University, Rochester, MI 48309-4401, United States Rochester, MI 48309-4401, United States [email protected] [email protected] Abstract— The rapid growth in volume, velocity, and variety Section 4 provides information of different resource of data produced by applications of scientific computing, management subsystems of Spark. In Section 5, Four commercial workloads and cloud has led to Big Data. Traditional Distributed File Systems (DFSs) are reviewed. Then, solutions of data storage, management and processing cannot conclusion is included in Section 6. meet demands of this distributed data, so new execution models, data models and software systems have been developed to address II. CHALLENGES IN BIG DATA the challenges of storing data in heterogeneous form, e.g. HDFS, NoSQL database, and for processing data in parallel and Although Big Data provides users with lot of opportunities, distributed fashion, e.g. MapReduce, Hadoop and Spark creating an efficient software framework for Big Data has frameworks. This work comparatively studies Apache Spark many challenges in terms of networking, storage, management, distributed data processing framework. Our study first discusses analytics, and ethics [5, 6]. the resource management subsystems of Spark, and then reviews Previous studies [6-11] have been carried out to review open several of the distributed data storage options available to Spark. Keywords—Big Data, Distributed File System, Resource challenges in Big Data. [6] proposes an overview of Big Data Management, Apache Spark. initiatives and technologies. It also discusses the challenges and potential solutions in industries and academics from I. INTRODUCTION different perspectives such as engineers, computer scientists and statisticians. In [7], the authors provide a survey of diverse ig Data refers to different forms of large data sets which B demand special computational platforms in order to hardware and software platforms available for Big Data discover valuable knowledge. Big Data has revolutionized analytics such as Hadoop, MapReduce, High Performance all aspects of human’s lives and many industries including Computing (HPC), Graphics Processing Unit (GPU) and Field medicine, business, education, and science and engineering. Programmable Gate Arrays (FPGA). The hardware platforms Big Data environment are software tools to acquire, are then compared based on various metrics including organize and analyze heterogeneous large data with different scalability, data I/O rate, fault tolerance, real-time processing. velocities. There are two ways to import data: batch data, The software frameworks used with each of these hardware where a dataset is loading all at once, and streaming data where platforms are discussed with regards to their strengths and stream data is continuously imported. Big Data can be in drawbacks. different forms of structured and unstructured data, of arbitrary A comprehensive survey on different technologies, size. These features of Big Data refer to 3Vs (volume, variety opportunities and challenges in Big Data has been provided by and velocity) and define properties or dimensions of Big Data. [8]. This paper also classified the various attributes of Big Data Because this type of data is different from the traditional data including definitions, volume management, analysis, rapid and is not practical to be stored in a single machine, so special growth rate and security. tools and technologies are demanded to store, access, and Indexing technique for efficient retrieval and management of analyze them. These tools need to process this large amount of data is considered as another open challenge in Big Data, data in a timely and cost-effective manner. which has been investigated in [9]. There are different data processing frameworks proposed [10] illustrates the importance of Big Data and discusses for Big Data including Apache Hadoop [1], Dyrad [2], various areas where Big Data is being used. The background Yahoo!S4 [3], Apache Spark [4], etc. Each of which has its and state-of-the-art of Big Data have been reviewed by [11]. own advantages and disadvantages. In this paper, discussion concentrates on Spark, which is one of the most recent and Related technologies to Big Data including cloud computing, popular interactive data processing frameworks. Because IoT, data centers, and Hadoop have been discussed in the Spark is inspired by Hadoop, we first review Google paper. Then, four phases of the value chain of Big Data, e.g. MapReduce and Hadoop. Then, Spark resource managers are data generation, data acquisition, data storage, and data discussed in details. Four data storage options used in Spark analysis have been proposed. For each phase, the general are also discussed in this work. background, the technical challenges, and the latest advances The rest of the paper is organized as follow. In next section, have been introduced and reviewed. challenges in Big Data are summarized. In Section 3, Big Data Currently, software tools for managing Big Data resources processing frameworks are discussed. which have been developed for data acquisition, storage, 978-1-5090-3936-4/16 $31.00 © 2016 IEEE 567 DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.88 management and analysis, are considered as the initial is slow for interactive data exploration [14], and is not efficient challenges in Big Data. In this paper we review Apache Spark, for real time jobs. Because, Hadoop makes heavy use of which is an efficient data processing framework supporting intermediate files in every step of Map and Reduce functions, real time and interactive job analysis. We also compare the and in every step it needs to read and write frequent used data four data storage options available for Big Data. from and to HDFS. This demands a lot of disk accesses and incurs large I/O overheads. The creation of Spark was III. BIG DATA PROCESSING FRAMEWORKS motivated by this challenges in Hadoop by offering in-memory computation feature. A. Google MapReduce MapReduce is a distributed parallel computing framework C. Spark designed by Google to process large amounts of data, and Apache Spark distributed data processing framework was Google File System (GFS) is its storage system [12]. Google originally developed in 2009 in UC Berkeley’s AMPLab, and MapReduce provides an interface for users to process a huge open sourced in 2010 as an Apache project. Spark provides in- amount of data while also allowing searching for data by memory computation, which leads to significant increase of Google search engine. data processing performance over Hadoop MapReduce [4], MapReduce has three main components: Master, Map because sharing data in memory is 10x to 100x times faster function and Reduce function [12]. The Master is responsible than sharing through network and disk. Spark in comparison for keeping the status of Map and Reduce servers and for with other Big Data technologies has several advantages. Most feeding data to and launching procedures for the Map and MapReduce solutions are efficient for one-pass process but Reduce functions. The Map function processes data and they do not work well for multi-pass computations. Spark with generates intermediate data. The Reduce function combines the in-memory data storage feature is able to handle multi process intermediate data and generates data for the next stage of job faster than other Big Data technologies, since it is able to processing. The assumption made by MapReduce is that all hold dataset in memory that is needed multiple times. Some data can be expressed in a form of key and value pairs [12]. studies compare the performance of Hadoop and Spark [14- The key is the correlation between the data, and the value is the 15], and demonstrate the advantage of Spark over Hadoop. content of data. Map and reduce functions are written by user. Spark also introduces a new data model, Resilient The input file is divided into M chunks with size ranges from Distributer Dataset (RDD), the main abstraction in Spark for 16 MB to 64 MB. Master finds the available server to do representing a distributed collection of elements. All data in computation and assigns servers for the Map function. Those Spark are expressed in form of RDD. There are two types of servers that are responsible for the Map function, retrieve the operation on RDD, transformation and action operations. input data and process it, and then store the Map results in the Transformation operations create a new RDD from existing file system. Map function takes a key and value pair as input RDD, while the action operations return final value and the and creates the intermediate data. The intermediate data is then result is not in form of RDD. Spark provides lazy evaluation, cut into multiple pieces and the Master receives their locations. which postpones the processing until it reaches to the action Servers that are responsible for the Reduce function retrieve operations. By this feature, Spark reduces the number of passes the data and sort and group the value with the same key. The of data by grouping operations together. In systems like data with the same key are processed by the Reduce function Hadoop MapReduce, developers often need to spend a lot of and a new set of key and value is generated. The final results time considering how to group operations in order to minimize will be available to users after processes are completed. the number of MapReduce passes [16]. Spark is written in Scala programming language [17] and it B. Hadoop runs on Java Virtual Machine (JVM) environment. It supports Apache Software Foundation developed Hadoop which was Scala, Python and Java programming languages and also offers inspired by Google MapReduce.

Comparison of Spark Resource Managers and Distributed File Systems

Is 'Distributed' Worth It? Benchmarking Apache Spark with Mesos

Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks

Alluxio: a Virtual Distributed File System by Haoyuan Li A

Crystal: a Unified Cache Storage System for Analytical Databases

Scalable Collaborative Caching and Storage Platform for Data Analytics

An Architecture for Fast and General Data Processing on Large Clusters

Thesis Enabling Autoscaling for In-Memory Storage In

Alluxio: a Virtual Distributed File System

Thesis, We Describe Two Main Areas That Lack Support for Long-Running Applications in Traditional System Designs

Performance Tuning and Evaluation of Iterative Algorithms in Spark

A Machine Learning Data Processing Framework

Scaling Data Mining in Massively Parallel Dataflow Systems