Heterodoop : a Mapreduce Programming System for Accelerator Clusters
Total Page:16
File Type:pdf, Size:1020Kb
HeteroDoop : A MapReduce Programming System for Accelerator Clusters Amit Sabne Putt Sakdhnagool Rudolf Eigenmann Purdue University Purdue University Purdue University [email protected] [email protected] [email protected] ABSTRACT umes. Dealing with such volumes requires processing in par- allel, often on systems that offer high compute power. The deluge of data has inspired big-data processing frame- For this type of parallel processing, the MapReduce para- works that span across large clusters. Frameworks for digm has found popularity. The key insight of MapReduce MapReduce, a state-of-the-art programming model, have is that many processing problems can be structured into one primarily made use of the CPUs in distributed systems, or a sequence of phases, where a first step (Map) operates leaving out computationally powerful accelerators such as in fully parallel mode on the input data; a second step (Re- GPUs. This paper presents HeteroDoop, a MapReduce duce) combines the resulting data in some manner, often by framework that employs both CPUs and GPUs in a cluster. applying a form of reduction operation. MapReduce pro- HeteroDoop offers the following novel features: (i) a small gramming models allow the user to specify these map and set of directives can be placed on an existing sequential, reduce steps as distinct functions; the system then provides CPU-only program, expressing MapReduce semantics; (ii) the workflow infrastructure, feeding input data to the map, an optimizing compiler translates the directive-augmented reorganizing the map results, and then feeding them to the program into a GPU code; (iii) a runtime system assists the appropriate reduce functions, finally generating the output. compiler in handling MapReduce semantics on the GPU; The large data volumes involved may not fit on a single and (iv) a tail scheduling scheme minimizes job execution node. Thus, distributed architectures with many nodes may time in light of disparate processing capabilities of CPUs be needed. Among the systems that support MapReduce and GPUs. This paper addresses several challenges that on distributed architectures, Hadoop [1] has gained wide need to be overcome in order to support these features. Het- use. Hadoop provides a framework that executes MapRe- eroDoop is built on top of the state-of-the-art, CPU-only duce problems in a distributed and replicated storage orga- Hadoop MapReduce framework, inheriting its functionality. nization (the Hadoop Distributed File System { HDFS). In Evaluation results of HeteroDoop on recent hardware indi- doing so, it also deals with node failures. cate that usage of even a single GPU per node can improve Big-data problems pose high demands on processing and performance by up to 2.78x, with a geometric mean of 1.6x IO speeds, often with emphasis on one of the two. For across our benchmarks, compared to a CPU-only Hadoop, general compute-intensive problems, accelerators, such as running on a cluster with 20-core CPUs. NVIDIA GPUs and Intel Phis, have proven their value for Categories and Subject Descriptors an increasing range of applications. To obtain high per- formance, their architectures make different chip real-estate D.1.3 [Software]: PROGRAMMING TECHNIQUES| tradeoffs between processing cores and memory than in Concurrent Programming, Distributed programming; D.3.3 CPUs. A larger number of simpler cores provide higher [Software]: PROGRAMMING LANGUAGES|Frame- aggregate processing power and reduce energy consump- works; D.3.4 [Programming Languages]: Compilers tion, offering better performance/watt ratios than CPUs. Keywords In GPUs, intra-chip memory bandwidth is high and multi- threading reduces the effective memory access latency. Distributed Frameworks; MapReduce; Accelerators; Source- These optimizations come at the cost of an intricate mem- to-source Translation; Scheduling ory hierarchy, reduced memory size, data accesses that are highly optimized for inter-thread contiguous (a.k.a. coa- 1. INTRODUCTION lesced) reference patterns and explicitly parallel program- A growing number of commercial and science applications ming models. Using these architectures therefore requires in both classical and new fields process very large data vol- high programmer expertise. While accelerators tend to perform well on compute-int- ensive applications, IO-intensive MapReduce problems may not always benefit. Previous research efforts on MapReduce- Permission to make digital or hard copies of all or part of this work for personal or like systems employ either GPUs [2, 3, 4] alone, disregarding classroom use is granted without fee provided that copies are not made or distributed for IO-intensive applications, or CPUs [1, 5, 6, 7] alone, los- profit or commercial advantage and that copies bear this notice and the full citation on ing out on GPU acceleration. In this paper we present our the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. HeteroDoop1 system, which exploits both CPUs and GPUs inefficient sorting of the key-value pairs, which follows the in a cluster, as needed by the application. map phase. To resolve this issue, HeteroDoop includes a fast The first challenge in developing such a heterogeneous runtime compaction scheme, resulting in efficient sort. Fur- MapReduce system is the programming method. In a naive thermore, the runtime system executes the sort operation scheme, the programmer would have to write two program on the GPU rather than on the slower CPU. Other opti- versions, one for CPUs and the second for GPUs. This need mizations include efficient data placement in the complex arises as accelerators rely on explicitly parallel programs, be GPU memory hierarchy and automatic generation of vector it either low-level programming models such as CUDA [8] load/store operations. and OpenCL [9], or high-level ones, such as OpenACC [10] A final challenge in developing such a heterogeneous and OpenMP 4.0 [11]. Although available high-level pro- MapReduce scheme is to account for the different process- gramming models relieve the user from having to learn ing speeds of CPUs and GPUs for work partitioning. While model-specific APIs, such as in CUDA or OpenCL, they prior work [15, 16] has dealt with load balancing across still require explicit parallel programming. On the other nodes, intra-node heterogeneity has remained an issue. Het- hand, in CPU-oriented MapReduce systems, programmers eroDoop's tail scheduling scheme addresses this issue. Our write only sequential code; the underlying framework auto- contribution is based on a key observation: the load imbal- matically employs all cores in the cluster by concurrently ance only arises in the execution of the final tasks in a job; processing the input data, which is split into separate files. careful GPU-speedup-based scheduling of the tailing tasks Previous research on GPU uses for MapReduce has either can avoid the imbalance. relied on explicitly parallel codes with accelerator-specific We have evaluated the HeteroDoop framework on eight optimizations [3, 4, 12, 13], and/or on specific MapReduce applications, comprising well-known MapReduce programs APIs [2, 3, 13, 14]. Programmability in both approaches is as well as scientific applications. We demonstrate the utility poor; the former requires learning low-level APIs and the of HeteroDoop on a 48-node, single GPU cluster with large latter necessitates application rewriting. To overcome these datasets, and on a 64-node, 3-GPU cluster with in-memory limitations, our contribution enables programmers to port datasets. Our main results indicate that the use of even already available sequential MapReduce programs to hetero- a single GPU per node can speed up the end-to-end job geneous systems by annotating the code with HeteroDoop execution by up to 2.78x, with a geometric mean of 1.6x, as directives. Inserting such directives is straightforward, re- compared to CPU-only Hadoop, running on a cluster with quires no additional accelerator optimizations, and leads to 20-core CPUs. Furthermore, the execution time scales with a single input source code for both CPUs and GPUs. Fur- the number of GPUs used per node. thermore, the resulting code is portable; it can still execute The remainder of the paper is organized as follows: Sec- on CPU-only clusters. tion 2 provides background on GPUs, MapReduce, and Had- The second key challenge in exploiting accelerators is their oop. Section 3 describes the HeteroDoop constructs, fol- limited, non-virtual memory space. The default paralleliza- lowed by the compiler design in Section 4. Section 5 de- tion scheme used in MapReduce/Hadoop engages multiple scribes the overall execution flow of the HeteroDoop frame- cores by processing separate input files in parallel { typically work and details the runtime system. The tail scheduling one per core, as an individual task. Data input is appropri- scheme is explained in Section 6. Section 7 presents the ex- ately partitioned into separate fileSplits, which are fed to perimental evaluation. Section 8 discusses related work, and the different compute nodes and their threads. Simultane- Section 9 concludes the paper. ous accesses by many threads to their fileSplits require a large memory. This size requirement is not a problem in 2. PRELIMINARIES today's typical CPUs with 4 to 48 cores and virtual mem- We introduce the basic terminology used in this paper ory support; however, in GPUs with several hundred cores for the GPU/CUDA architecture and programming model and possibly thousands of threads, the available, non-virtual as well as the MapReduce and Hadoop concepts. We keep memory is insufficient. Our second contribution addresses the discussion of GPUs and CUDA brief, assuming reader this challenge by processing the data records within a file- familiarity, but we do refer to introductory material [17].