2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom)

Comparison of Spark Resource Managers and Distributed File Systems

Solmaz Salehian Yonghong Yan Department of Computer Science & Engineering, Department of Computer Science & Engineering, Oakland University, Oakland University, Rochester, MI 48309-4401, United States Rochester, MI 48309-4401, United States [email protected] [email protected]

Abstract— The rapid growth in volume, velocity, and variety Section 4 provides information of different resource of data produced by applications of scientific computing, management subsystems of Spark. In Section 5, Four commercial workloads and cloud has led to Big Data. Traditional Distributed File Systems (DFSs) are reviewed. Then, solutions of data storage, management and processing cannot conclusion is included in Section 6. meet demands of this distributed data, so new execution models, data models and software systems have been developed to address II. CHALLENGES IN BIG DATA the challenges of storing data in heterogeneous form, e.g. HDFS, NoSQL database, and for processing data in parallel and Although Big Data provides users with lot of opportunities, distributed fashion, e.g. MapReduce, Hadoop and Spark creating an efficient software framework for Big Data has frameworks. This work comparatively studies many challenges in terms of networking, storage, management, distributed data processing framework. Our study first discusses analytics, and ethics [5, 6]. the resource management subsystems of Spark, and then reviews Previous studies [6-11] have been carried out to review open several of the distributed data storage options available to Spark. Keywords—Big Data, Distributed File System, Resource challenges in Big Data. [6] proposes an overview of Big Data Management, Apache Spark. initiatives and technologies. It also discusses the challenges and potential solutions in industries and academics from I. INTRODUCTION different perspectives such as engineers, computer scientists and statisticians. In [7], the authors provide a survey of diverse ig Data refers to different forms of large data sets which B demand special computational platforms in order to hardware and software platforms available for Big Data discover valuable knowledge. Big Data has revolutionized analytics such as Hadoop, MapReduce, High Performance all aspects of human’s lives and many industries including Computing (HPC), Graphics Processing Unit (GPU) and Field medicine, business, education, and science and engineering. Programmable Gate Arrays (FPGA). The hardware platforms Big Data environment are software tools to acquire, are then compared based on various metrics including organize and analyze heterogeneous large data with different scalability, data I/O rate, fault tolerance, real-time processing. velocities. There are two ways to import data: batch data, The software frameworks used with each of these hardware where a dataset is loading all at once, and streaming data where platforms are discussed with regards to their strengths and stream data is continuously imported. Big Data can be in drawbacks. different forms of structured and unstructured data, of arbitrary A comprehensive survey on different technologies, size. These features of Big Data refer to 3Vs (volume, variety opportunities and challenges in Big Data has been provided by and velocity) and define properties or dimensions of Big Data. [8]. This paper also classified the various attributes of Big Data Because this type of data is different from the traditional data including definitions, volume management, analysis, rapid and is not practical to be stored in a single machine, so special growth rate and security. tools and technologies are demanded to store, access, and Indexing technique for efficient retrieval and management of analyze them. These tools need to process this large amount of data is considered as another open challenge in Big Data, data in a timely and cost-effective manner. which has been investigated in [9]. There are different data processing frameworks proposed [10] illustrates the importance of Big Data and discusses for Big Data including Apache Hadoop [1], Dyrad [2], various areas where Big Data is being used. The background Yahoo!S4 [3], Apache Spark [4], etc. Each of which has its and state-of-the-art of Big Data have been reviewed by [11]. own advantages and disadvantages. In this paper, discussion concentrates on Spark, which is one of the most recent and Related technologies to Big Data including cloud computing, popular interactive data processing frameworks. Because IoT, data centers, and Hadoop have been discussed in the Spark is inspired by Hadoop, we first review Google paper. Then, four phases of the value chain of Big Data, e.g. MapReduce and Hadoop. Then, Spark resource managers are data generation, data acquisition, data storage, and data discussed in details. Four data storage options used in Spark analysis have been proposed. For each phase, the general are also discussed in this work. background, the technical challenges, and the latest advances The rest of the paper is organized as follow. In next section, have been introduced and reviewed. challenges in Big Data are summarized. In Section 3, Big Data Currently, software tools for managing Big Data resources processing frameworks are discussed. which have been developed for data acquisition, storage,

978-1-5090-3936-4/16 $31.00 © 2016 IEEE 567 DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.88 management and analysis, are considered as the initial is slow for interactive data exploration [14], and is not efficient challenges in Big Data. In this paper we review Apache Spark, for real time jobs. Because, Hadoop makes heavy use of which is an efficient data processing framework supporting intermediate files in every step of Map and Reduce functions, real time and interactive job analysis. We also compare the and in every step it needs to read and write frequent used data four data storage options available for Big Data. from and to HDFS. This demands a lot of disk accesses and incurs large I/O overheads. The creation of Spark was III. BIG DATA PROCESSING FRAMEWORKS motivated by this challenges in Hadoop by offering in-memory computation feature. A. Google MapReduce MapReduce is a distributed parallel computing framework C. Spark designed by Google to process large amounts of data, and Apache Spark distributed data processing framework was Google File System (GFS) is its storage system [12]. Google originally developed in 2009 in UC Berkeley’s AMPLab, and MapReduce provides an interface for users to process a huge open sourced in 2010 as an Apache project. Spark provides in- amount of data while also allowing searching for data by memory computation, which leads to significant increase of Google search engine. data processing performance over Hadoop MapReduce [4], MapReduce has three main components: Master, Map because sharing data in memory is 10x to 100x times faster function and Reduce function [12]. The Master is responsible than sharing through network and disk. Spark in comparison for keeping the status of Map and Reduce servers and for with other Big Data technologies has several advantages. Most feeding data to and launching procedures for the Map and MapReduce solutions are efficient for one-pass process but Reduce functions. The Map function processes data and they do not work well for multi-pass computations. Spark with generates intermediate data. The Reduce function combines the in-memory data storage feature is able to handle multi process intermediate data and generates data for the next stage of job faster than other Big Data technologies, since it is able to processing. The assumption made by MapReduce is that all hold dataset in memory that is needed multiple times. Some data can be expressed in a form of key and value pairs [12]. studies compare the performance of Hadoop and Spark [14- The key is the correlation between the data, and the value is the 15], and demonstrate the advantage of Spark over Hadoop. content of data. Map and reduce functions are written by user. Spark also introduces a new data model, Resilient The input file is divided into M chunks with size ranges from Distributer Dataset (RDD), the main abstraction in Spark for 16 MB to 64 MB. Master finds the available server to do representing a distributed collection of elements. All data in computation and assigns servers for the Map function. Those Spark are expressed in form of RDD. There are two types of servers that are responsible for the Map function, retrieve the operation on RDD, transformation and action operations. input data and process it, and then store the Map results in the Transformation operations create a new RDD from existing file system. Map function takes a key and value pair as input RDD, while the action operations return final value and the and creates the intermediate data. The intermediate data is then result is not in form of RDD. Spark provides lazy evaluation, cut into multiple pieces and the Master receives their locations. which postpones the processing until it reaches to the action Servers that are responsible for the Reduce function retrieve operations. By this feature, Spark reduces the number of passes the data and sort and group the value with the same key. The of data by grouping operations together. In systems like data with the same key are processed by the Reduce function Hadoop MapReduce, developers often need to spend a lot of and a new set of key and value is generated. The final results time considering how to group operations in order to minimize will be available to users after processes are completed. the number of MapReduce passes [16]. Spark is written in Scala programming language [17] and it B. Hadoop runs on Java Virtual Machine (JVM) environment. It supports Apache Software Foundation developed Hadoop which was Scala, Python and Java programming languages and also offers inspired by Google MapReduce. Apache Hadoop software interactive shell for both Scala and Python. library is a parallel data processing framework, running for the distributed data sets on clusters of computers. Apache Hadoop IV. SPARK RESOURCE MANAGERS is released in 2011 initially and is written in Java language. It provides MapReduce implementation. Apache Spark can operate in two modes, local and cluster modes. In local mode, Spark schedules works within a single Hadoop also implements a distributed file system (called node. It is often used for testing and experimental purpose. In HDFS) [13], which distributes data on the nodes and provides a cluster mode, Spark scales up computation in distributed global name space for users. In Hadoop the software itself environment involving multiple computer systems. Spark discovers and handles failures of the application rather than cluster uses master/worker architecture, Fig.1 shows the Spark being dependent on users to interfere. architecture in cluster mode. A Spark application consists of Hadoop processes large amount of either structure or one coordinator (driver) and multiple executors distributed on unstructured data using MapReduce model of parallelism. It is the Spark cluster nodes. Driver is the program main function broadly used in industrial applications, including spam that is responsible for converting user program into tasks and filtering, network searching, clickstream analysis, and social assigning those tasks to the executors. Executors then run recommendation [8]. It has been proven that Hadoop performs individual task and they are also responsible for providing in- well for processing archive data in a batch model. However, it memory storage (cache) for RDDs.

568

Fig.1. Major Components of Spark Data Processing Framework Spark cluster manager is the component that coordinates the Application Master that is responsible to request resources drivers and executors of all applications launched in a Spark from resource managers. In the cluster mode, both the environment. It makes decisions of resource mapping and data application driver and the Application Master are on the same movement to achieve load balancing and fault tolerance. The processes. While in the client mode they are on different cluster manager launches the executors and in certain cases the processes. driver [16]. The cluster manager could be either the built-in standalone one or an external resource manager, for example, C. Spark on Mesos Yet Another Resource Negotiator (YARN) [1] or Mesos [18]. [18] is a general-purpose cluster manager that can be used by Spark. Mesos offers two ways to share A. Spark Standalone resources on cluster. In Fine-grained mode, executors are able Spark standalone is the simplest approach for using a Spark to scale up and down the number of the CPUs that are claimed environment because the deployment complexity is less than from Mesos. In coarse-grained mode the allocated CPUs for other approaches. For standalone mode, Spark is installed on executors will not be released till the end of applications [21]. each node. The scheduler by default runs jobs in First in First When a request comes to Mesos, Mesos determines the out (FIFO) manner. Each job is divided into smaller tasks and availability of resources and makes an offer to the application tasks are scheduled to run according to the order that they are scheduler (e.g. Hadoop scheduler, or Spark scheduler). The created. The scheduler assigns tasks to available resources to application scheduler can accept or reject offers. So, Mesos achieve maximum degree of parallelism. resource allocation involves a process of negotiation and Spark standalone provides a web interface, from which users match-making for resources and jobs. It is a two-level can monitor the cluster. Each worker and driver of an scheduler that allows for better scalability than other application has its own web interface that lists cluster and job approaches. statistics including RDD and task status, as well as the data distribution on executors. D. Comparison There are several criteria users would consider when B. Spark on YARN selecting a resource manager for Spark, including application YARN is an external cluster manager that Spark can use to objective, resource requirements, deployment complexity, schedule Spark jobs and to allocate resources. YARN is variety of modes, availability, scheduling and security level. In introduced by Hadoop 2.0. In Hadoop 1.0, MapReduce this section, different Spark resource managers are compared component handles both resource management and data regarding these features. Table.1 provides a qualitative processing. In Hadoop 2.0, resource management subsystem is comparison between the three resource managers used with separated from the MapReduce, and Hadoop 2.0 makes it two Spark. The rating marked X, which we used in this table should components, i.e. MapReduce and YARN managers. The not be considered as a quantitative rate. MapReduce manages data processing, and the YARN manager handles resource management.  Level of scheduler: Among the three resource managers, In YARN resource manager, YARN receives job requests only Mesos has two-level scheduling hierarchy, which first and evaluates if there are resource available, then it makes checks the availability of resources and then makes an decision to allocate resources to the jobs. Thus, it is a offer to the application scheduler that might accept or monolithic way (single-level scheduler). YARN is typically reject the offer. For Spark standalone and Spark on YARN installed on the same nodes as the HDFS, so running Spark on there is no negotiation. If the resource is available YARN allows Spark access HDFS data quickly [16]. Spark on scheduler makes decision for allocation. YARN also provides users functionalities for categorizing, isolating, and prioritizing workloads [13], which do not exist in  Support for Spark shell: The Spark shell is the tool that Spark standalone. allows users to load configurations dynamically [4]. Both Spark supports two modes for running on YARN, cluster Spark standalone and Spark on Mesos support the spark mode and client mode [19]. In YARN, each application has an shell.

569

TABLE I. COMPARISON OF RESOURCE MANAGEMENT SUBSYSTEMS OF SPARK

Cluster level of Support for Deployment Variety of Scheduling Availability Security Manager scheduler Spark Shell complexity modes option

Spark One-level Yes X - Static sharing XX X standalone scheduler

YARN One-level Just on client XX Client Dynamic sharing XX XX scheduler mode cluster

Mesos Two-level Yes XX Fine-grained CPU dynamic XX X scheduler Coarse- sharing grained

Spark on YARN only supports it in the client mode, so Addressing single point failures for all Spark cluster YARN in the cluster mode is not suitable for interactive managers including standalone, YARN, Mesos could be applications that demand user input. achieved by utilizing ZooKeeper [4]. ZooKeeper is a third party component that provides monitoring and fault  Deployment complexity: Spark standalone is the simplest recovering functionalities when there are failures of the approach for using a Spark environment because the master. It is capable of selecting a new master as well as deployment complexity is less than the other two the storage for maintaining the consistent state of the approaches. master without interrupting the execution. If the current master faces a problem and dies, ZooKeeper will replace  Variety of modes: Spark on YARN supports two modes: it with a new master and recover the old master’s state. cluster and client modes. In cluster mode, the operations of driving an application and requesting resources are  Security: Security in Spark is currently supported via a done by the same process. But in the client mode, shared secret key. The authentication mechanisms make different processes are responsible for these duties. Thus, sure both sides have the same secret key [4]. Spark the client mode is more reasonable for short and deployment with YARN, enables automatic shared interactive jobs [19], and the cluster mode is not well secreted key configuration, while with the other two suited for interactive jobs or for debugging purposes. resource managers, the authentication parameter should be Mesos also offers two ways to share resources on cluster: configured separately on each node, including master and Fine-grained and coarse-grained modes. When multiple workers. This secret key is used by the master, workers users share a cluster for interactive jobs, the fine-grained and applications. But for Spark with YARN, each works well because it scales down the free cores of application uses a unique shared secret key. CPUs. However, this could lead to higher latency [16]. V. DISTRIBUTED FILE SYSTEM  Scheduling: There are several mechanisms for resource For processing batch data using Spark, data requires to be allocation between applications, depending on the cluster accessible on all worker nodes. The simplest way is to copy manager that is being used with Spark. The easiest way is data onto all nodes on the Spark cluster. However, this static partitioning, which is available for all cluster approach is not efficient or cost-optimal for a large data set. So, managers. In this approach, each application receives a using Spark for large data demands an efficient data storage maximum amount of resources that can be held for whole component. In this section, we briefly discuss several duration of application process. The second approach is mechanisms for distributed file storage. Distributed File dynamic sharing of resources. Mesos is able to do System (DFS) is an approach used for storing large amount of dynamic sharing of CPU. In this option, CPU can be data across cluster nodes. By using DFS, data can be easily dynamically shared between applications, but the accessed over network transparently using a single name space. application still has a fixed amount of memory. When Network File System (NFS) was developed by Sun one application is not running tasks on a machine, other Microsystems in 1984. It is an alternative to make sure all applications may use those cores to run their tasks. Spark nodes can access the same files through network. An NSF with YARN allows for dynamic sharing of CPU and setup allows easy sharing with a centralized file server. Also, memory [4]. However, dynamically scaling up and down security in this method can be achieved by keeping centralized of resources based on the workload is not available in system secure. However, capacity is an issue, which is limited Spark standalone scheduler. by the centralized disk space, and a single point of failure compromises the availability of the data storage.  High availability: High availability for distributed data HDFS, Amazon Simple Storage Service (S3) [21] and processing reduces the downtime of the system and Tachyon [22] are other DFSs, which are able to store massive allows for quick recovery from failure situations.

570

TABLE II. COMPARISON BETWEEN DISTRIBUTED FILE SYSTEMS

DFSs Distributed Fault Availability Performance Scalability User cost for User cost for administration tolerance account hardware

NFS No No X X X No Yes

HDFS Yes Yes XX XXX XX No Yes

S3 Yes Yes XXX XX XXX Yes No

Tachyon Yes Yes XX XXXX XX No Yes

data and to give the cluster nodes concurrent accesses to the Apache HDFS, Amazon S3, and the data processing data. HDFS is a component of Hadoop, and it has two main framework such as Spark, Hadoop. modules: NameNode and DataNodes. NameNode manages the A. Comparison file system metadata, and actual data is stored in DataNodes. DataNodes are distributed on the cluster nodes, therefore Four distributed file systems are compared with respect to addressing the single point of failure. Since HDFS files are the following metrics: fault tolerance, availability, scalability, stored in multiple nodes with redundancy (e.g. 3 copies by performance, and user costs. Table.2 provides a summary of default configuration), HDFS provides better availability and the comparisons. The X symbol in the table was used for the faster data access than NSF. purpose of showing qualitative comparison of these DFSs. HDFS security is based on the POSIX model of users and groups and currently security is limited to simple file  Distributed administration and fault tolerance: A single permissions [24]. point of failure explains a situation in which upon any Amazon S3 is a solution for storing files for Internet failure of the central point, an entire system becomes applications. Amazon S3 provides object-oriented storage and unavailable. Among all those DFSs, only NFS has the stores data as objects within resources called buckets [21]. centralized administration, which is the NSF file server. Users are able to store as many objects as they want within a This not only leads to single point failure but also makes bucket, and write, read, and delete objects in the bucket [21]. the limited disk space of centralized file storage as a S3 users are provided with a web service interface to store and challenge. The other three solutions, HDFS, S3 and retrieve data from anywhere on the web. Tachyon, have already addressed this constraint by storing Tachyon is a memory-centric distributed storage system data over distributed data storages and creating redundant that enables reliable data sharing over a cluster at memory- copy of data. These help the system to respond gracefully speed [7, 25]. Tachyon supports tiered storage and it is able to in the event of failure. manage other storages such as Solid State Drives (SSD) and Hard Disk Drives (HDD). Each tier has a different speed and  Availability: NFS in comparison with other file storages priority. Memory is the top level storage which caches new has the lowest level of availability because it suffers from data. If data does not fit into memory, additional storage such single point failure. Amazon S3, on the other hand, as SSD or HHD is used to keep data. provides high availability close to 99.99% [24], by Tachyon detects frequently used files and caches them in redundantly storing data across multiple facilities and memory. This leads to improved access time for data on disk. devices over the Internet. Tachyon has three main components: master, workers, and clients [24]. The main responsibility of master is to manage  Performance: Tachyon detects frequently used files and global metadata of the system, while workers manage allocated caches them in memory. By using memory more local resources such as SSD and HDD. Clients are applications aggressively, better performance in Tachyon can be such as Spark or Hadoop jobs, or Tachyon command-line achieved in comparison with HDFS [7]. Other existing users. A Tachyon client acts as a gateway and enables users to data storages such as HDFS and S3 use disks and network interact with Tachyon master and workers. Data is stored as bandwidths which are slower than memory [25]. HDFS blocks in Tachyon workers, and Tachyon master is responsible shows better performance in comparison with S3, because for keeping the maps from file to blocks. Tachyon with tiered HDFS stores data locally rather than reads from or writes storage feature manages blocks automatically between all to data storage over the Internet. configured tiers, so users do not need to manage the location of data manually.  Scalability: Scalability in terms of storage is achieved by Tachyon is based on Hadoop and any Hadoop or Spark adding more nodes to the cluster, or by adding more hard programs can run on top of Tachyon without any changes. drives to existing nodes. For Amazon S3 users, scalability Tachyon resides between the storage subsystems such as can be achieved with no effort, and costs grow and shrink on users’ demand [21]. HDFS relies on local storage and

571

scalability is more complex in comparison with Amazon [3] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed S3. Thus, Amazon S3 provides better scalability. NFS, on Stream Computing Platform,” in Proceedings of the 10th IEEE International Conference on Data Mining Workshops (ICDMW ’10), the other hand, does not have a good scalability because 2010. the disk space is limited by the centralized data storage. [4] http://spark.apache.org/ (accessed December 1, 2015). [5] James Ahrens, Bruce Hendrickson, Gabrielle Long, Steve Miller, Rob  Ross, and Dean Williams. "Data-intensive science in the US DOE: case User costs for account and hardware: For Amazon S3 studies and future challenges." Computing in Science & Engineering 13, users pay for data storage monthly. Also, Transferring no. 6 (2011): 14-24. data in S3 is costly except if it happens between Amazon [6] Hua Fang, Zhaoyang Zhang, Chanpaul Jin Wang, Mahmoud Elastic Compute Cloud (EC2) and Amazon S3 within Daneshmand, Chonggang Wang, and Honggang Wang. "A survey of big data research."IEEE network 29, no. 5 (2015): 6. same region [16]. For other file storages, on the other [7] Dilpreet Singh, and Chandan K. Reddy. "A survey on platforms for big hand, users should pay for hardware and facilities. data analytics." Journal of Big Data 2, no. 1 (2015): 1-20. [8] Nawsher Khan, Ibrar Yaqoob, Ibrahim Abaker Targio Hashem, Zakira Inayat, Waleed Kamaleldin Mahmoud Ali, Muhammad Alam, VI. CONCLUSION Muhammad Shiraz, and Abdullah Gani. "Big Data: survey, technologies, With the fast growth of data volume in both industry and opportunities, and challenges." The Scientific World Journal 2014 research, the traditional data management solutions are (2014). [9] Abdullah Gani, Aisha Siddiqa, Shahaboddin Shamshirband, and Fariza inadequate to capture, access and analyze this huge amount of Hanum. "A survey on indexing techniques for big data: taxonomy and data. Thus, management and storage of massive data are performance evaluation." Knowledge and Information Systems (2015): 1- considered as one of the most important challenges in Big 44. Data. This paper provided a comparative survey on parallel [10] Er Ramandeep Kaur. "Big Data-Is a Turnkey Solution." Procedia Computer Science 62 (2015): 326-331. data processing frameworks and distributed data storages for [11] Min Chen, Shiwen Mao, and Yunhao Liu. "Big Data: A survey." Mobile Big Data. We specifically compared and evaluated the three Networks and Applications 19, no. 2 (2014): 171-209. resource managers used with Apache Spark. [12] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. "The Google The choice of a resource manager in Spark depends on file system." In ACM SIGOPS operating systems review, vol. 37, no. 5, pp. 29-43. ACM, 2003. both the application and the resource setup. For Spark on [13] https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html (accessed cluster mode, although YARN has much more granular control December 15, 2015). over the resources and provides users with security and job [14] , Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin isolation, the implementation is more complex in comparison Ma, Murphy Mccauley, M. Franklin, Scott Shenker, and . "Fast and interactive analytics over Hadoop data with Spark." USENIX; login with Spark standalone. It may introduce more overhead, which 37, no. 4 (2012): 45-51. consequently affects the performance. So, Spark on YARN [15] Lei Gu, and Huan Li. "Memory or time: Performance evaluation for would prefer if Spark needs to coexist with other scheduling iterative operation on hadoop and spark." In High Performance frameworks. Mesos, which is an independent resource Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE manager, is more suitable for interactive jobs when multiple 10th International Conference on, pp. 721-727. IEEE, 2013. users share a cluster, because the fine-grained control in Mesos [16] Holden Karau, Andy Konwinski, Patrick Wendell, and Matei allows for efficiently scaling up and down the free cores of Zaharia.Learning Spark: Lightning-Fast Big Data Analysis. " O'Reilly CPUs. Media, Inc.", 2015. [17] http://www.scala-lang.org/ (accessed December 1, 2015). For helping selecting distributed file systems for storing [18] http://mesos.apache.org/ (accessed December 15, 2015). large amount of data over clusters, we also reviewed and [19] Ryza, Sandy. “Apache Spark Resource Management and YARN App compared several DFSs. Generally, decentralized architectures Models.” Cloudera. http://blog.cloudera.com/blog/2014/05/apache-spark- scales better and are able to address the single point of failure. resource-management-and-yarn-app-models/ (accessed December 1, 2015). Among the three decentralized file storages, Tachyon achieved [20] Benjamin Hindman, Andy Konwinski, Matei Zaharia, , better performance by using memory more frequently. Anthony D. Joseph, Randy H. Katz, Scott Shenker, and Ion Stoica. "Mesos: A Platform for Fine-Grained Resource Sharing in the Data ACKNOWLEDGMENT Center." In NSDI, vol. 11, pp. 22-22. 2011. [21] https://aws.amazon.com/s3/ (accessed December 15, 2015). This material is based upon work supported by the National [22] http://tachyon-project.org/ (accessed December 15, 2015). Science Foundation under Grant No. CNS-1205708 and SHF- [23] R.Vijayakumari, R. Kirankumar, and K. Gangadhara Rao. "Comparative analysis of Google File System and Hadoop Distributed File 1409946. System."ICETS-International Journal of Advanced Trends in Computer Science and Engineering 3, no. 1 (2014): 553-558. REFERENCES [24] Kai Hwang, Geoffrey C. Fox, and Jack Dongarra. Distributed and cloud computing: from parallel processing to the internet of things. Morgan [1] https://hadoop.apache.org/ (accessed December 1, 2015). Kaufmann, 2013 [2] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: [25] , Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. Distributed Data-Parallel Programs from Sequential Building Blocks,” "Tachyon: Reliable, memory speed storage for cluster computing frameworks." ACM SIGOPS Operating Systems Review, vol. 41, pp. 59–72, 2007. In Proceedings of the ACM Symposium on Cloud Computing, pp. 1-15. ACM, 2014.

572