Adaptive Data Replication for Efficient Cluster

2011 IEEE International Conference on Cluster Computing DARE: Adaptive Data Replication for Efficient Cluster Scheduling Cristina L. Abad∗‡,YiLu†, Roy H. Campbell∗ ∗Department of Computer Science †Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Email: cabad,yilu4,[email protected] Abstract—Placing data as close as possible to computation Data locality is an important problem as it significantly is a common practice of data intensive systems, commonly affects system throughput and job completion times [6, 10]. referred to as the data locality problem. By analyzing existing The goal of this paper is to improve data locality in cluster production systems, we confirm the benefit of data locality and find that data have different popularity and varying computing systems using adaptive replication with low sys- correlation of accesses. We propose DARE, a distributed tem overhead. It was suggested in a recent study [18] that the adaptive data replication algorithm that aids the scheduler to benefit of data locality might disappear as bandwidth in data achieve better data locality. DARE solves two problems, how centers increases. However, we found the difference between many replicas to allocate for each file and where to place them, disk and network bandwidths remain significant in reality, using probabilistic sampling and a competitive aging algorithm independently at each node. It takes advantage of existing as illustrated in Section II, and the use of virtualized clusters remote data accesses in the system and incurs no extra network and the concerns with energy consumption of network usage usage. Using two mixed workload traces from Facebook, we make disk reads remain preferable. Furthermore, the amount show that DARE improves data locality by more than 7 times of data being processed in data centers, both in business and with the FIFO scheduler in Hadoop and achieves more than science, keeps growing at an enormous pace [19]. 85% data locality for the FAIR scheduler with delay scheduling. Turnaround time and job slowdown are reduced by 19% and Uniform data replication and placement are used in cur- 25%, respectively. rent implementations of MapReduce systems (e.g., Hadoop). Applications rely on the scheduler to optimize for data local- Keywords-MapReduce, replication, scheduling, locality. ity. However, we found that there is significant difference in data popularity and considerable correlation among accesses I. INTRODUCTION to different files in a cluster log from Yahoo!. We elaborate on the observation in Section III. Similar observation of the Cluster computing systems, such as MapReduce [1], skew in data popularity in a large Dryad production cluster Hadoop [2] and Dryad [3], together with fault-tolerant supporting Microsoft’s Bing was presented in [6]. distributed data storage [4, 5], have become a popular There are two ways in which a well-designed data repli- framework for data-intensive applications. Large clusters cation and placement strategy can improve data locality: consisting of tens of thousands of machines [6] have been built for web indexing and searching; small and mid-size 1. Popular data are assigned a larger number of replicas clusters have also been built for business analytics and to improve data locality of concurrent accesses; we corporate data warehousing [7–9]. In clusters of all sizes, call it the replica allocation problem. throughput and job completion time are important metrics 2. Different data blocks accessed concurrently are placed for computation efficiency, which determines cost of data on different nodes to reduce contention on a particular centers and user satisfaction [6, 10]. node; we call it the replica placement problem. Placing data as close as possible to computation is a Scarlett [6] addresses the replica allocation problem at common practice of data-intensive systems, referred to as fixed epochs using a centralized algorithm. However, the the data locality problem. Current cluster computing systems choice of epochs depends on the workload, and can vary use uniform data replication to (a) ensure data availability from cluster to cluster and across time periods. While and fault tolerance in the event of failures [11–16], (b) workload characteristics may remain similar in a cluster improve data locality by placing a job at the same node supporting a single application, they can vary significantly as its data [1, 3, 17], and (c) achieve load balancing by in environments with multiple applications. A dynamic algo- distributing work across the replicas. rithm that adapts to changes in workload is hence preferable. ‡ A. Our Approach Also affiliated with Facultad de Ingenier´ıa en Electricidad y Computacion´ (FIEC), Escuela Superior Politecnica´ del Litoral (ESPOL), Campus Gustavo We propose DARE, a distributed data replication and Galindo, Km 30.5 V´ıa Perimetral, Guayaquil–Ecuador. placement algorithm that adapts to the change in workload. 978-0-7695-4516-5/11 $26.00 © 2011 IEEE 159 DOI 10.1109/CLUSTER.2011.26 We assume a scheduler oblivious to the data replication advantage of existing remote data retrievals. Thrashing policy, such as the first-in, first-out (FIFO) scheduler or is minimized using sampling and a competitive aging the Fair scheduler in Hadoop systems, so our algorithm algorithm, which produces comparable data locality to will be compatible to existing schedulers. We implement a greedy least recently used (LRU) algorithm, but with and evaluate DARE using the Hadoop framework, Apache’s only 50% disk writes of the latter. open source implementation of MapReduce. We expand on The contribution of the paper is two-fold. First, we the details of MapReduce and Hadoop clusters in Section II. analyze existing production systems to obtain effective band- In the current implementation, when local data are not width, data popularity distributions, and uncover character- available, a node retrieves data from a remote node in order istics of access patterns. Second, we propose the distributed to process the assigned task, and discards the data once dynamic data replication algorithm, which significantly im- the task is completed. DARE takes advantage of existing proves data locality and task completion times. remote data retrievals and selects a subset of the data to be The rest of this paper is organized as follows. We present inserted into the file system, hence creating a replica without our motivation in Section II, including a detailed discussion consuming extra network and computation resources. on the effect of data locality in virtualized clusters on Each node runs the algorithm independently to create public clouds. Section III presents the results of an analysis replicas of data that are likely to be heavily accessed in a of data access patterns in a large MapReduce production short period of time. We observe in the Yahoo! log that the cluster. Section IV describes the design of our proposed popularity of files follows a heavy-tailed distribution. This replication scheme. In Section V we present and discuss makes it possible to predict file popularity from the number the evaluation results. Section VI discusses the related work of accesses that have already occurred: For a heavy-tailed and we summarize the contributions in Section VII. distribution of popularity, the more a file has been accessed, the more future accesses it is likely to receive. II. BACKGROUND AND MOTIVATION From the point of view of an individual data node, the al- A. MapReduce clusters gorithm comes down to quickly identifying the most popular MapReduce clusters [1, 2] offer a distributed computing set of data and creating replicas for this set. Popularity not platform suitable for data-intensive applications. MapReduce only means that a piece of data receives a large number of was originally proposed by Google and its most widely de- accesses, but also a high intensity of accesses. We observe ployed implementation, Hadoop, is used by many companies that this is the same as the problem of heavy hitter detection including Facebook, Yahoo! and Twitter [9]. in network monitoring: In order to detect flows occupying MapReduce uses a divide-and-conquer approach in which the largest bandwidth, we need to identify flows that are both input data are divided into fixed size units processed inde- fast and large. In addition, the popularity of data is relative: pendently and in parallel by map tasks, which are executed We want to create replicas for files that are more popular distributedly across the nodes in the cluster. After the map than others. Hence algorithms based on a hard threshold of tasks are executed, their output is shuffled, sorted and then number of accesses do not work well. processed in parallel by one or more reduce tasks. We design a probabilistic dynamic replication algorithm To avoid the network bottlenecks due to moving data with the following features: into and out of the compute nodes, a distributed file system 1. Each node samples assigned tasks and uses the Ele- typically co-exists with the compute nodes (GFS [21] for phantTrap [20] structure to replicate popular files Google’s MapReduce and HDFS [14] for Hadoop). in a distributed manner. Experiments on dedicated MapReduce clusters have a master-slave design for the Hadoop clusters and virtualized EC2 clusters both compute and storage systems. The master file system node show more than 7-times improvement of data locality handles the metadata operations, while the slaves handle for the FIFO scheduler, and 70% improvement for the the read/writes initiated by clients. Files are divided into Fair scheduler. DARE with the Fair scheduler–which fixed-sized blocks, each stored at a different data node. increases locality by introducing a small delay when Files are read-only, but appends may be performed in some a job scheduled to run cannot execute a local task; implementations. For the sake of simplicity, in this paper we allowing other jobs to launch tasks instead–can lead will refer to the components of the distributed file system to locality levels close to 100% for some workloads.

Load more