Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

Muhuan Huang?1,2, Di Wu?1,2, Cody Hao Yu?1, Zhenman Fang1, Matteo Interlandi1, Tyson Condie1 and Jason Cong1 1University of California Los Angeles, 2 Falcon Computing Solutions, Inc. {mhhuang, allwu, hyu, zhenman, minterlandi, tcondie, cong}@cs.ucla.edu

Abstract 1. Introduction With the end of CPU core scaling due to dark silicon limi- Modern big data processing systems, such as Apache tations, customized accelerators on FPGAs have gained in- Hadoop [1] and Spark [47], have evolved to an unprece- creased attention in modern datacenters due to their lower dented scale. As a consequence, cloud service providers, power, high performance and energy efficiency. Evidenced such as Amazon, Google and , have expanded their by Microsoft’s FPGA deployment in its Bing search en- datacenter infrastructures to meet the ever-growing demands gine and Intel’s 16.7 billion acquisition of Altera, integrat- for supporting big data applications. However, due to the ing FPGAs into datacenters is considered one of the most problem of dark silicon [21], simple CPU core scaling has promising approaches to sustain future datacenter growth. come to an end, and thus CPU performance and energy effi- However, it is quite challenging for existing big data com- ciency has become one of the primary constraints in scaling puting systems—like Apache Spark and Hadoop—to access such systems. To sustain the continued growth in data and the performance and energy benefits of FPGA accelerators. processing methods, cloud providers are seeking new solu- In this paper we design and implement Blaze to provide tions to improve the performance and energy efficiency for programming and runtime support for enabling easy and ef- their big data workloads. ficient deployments of FPGA accelerators in datacenters. In Among various solutions that harness GPU (graphcs pro- particular, Blaze abstracts FPGA accelerators as a service cessing unit), FPGA (field-programmable gate array), and (FaaS) and provides a set of clean programming for ASIC (application-specific integrated circuit) accelerators in big data processing applications to easily utilize those ac- a datacenter, the FPGA-enabled datacenter has gained in- celerators. Our Blaze runtime implements an FaaS frame- creased attention and is considered one of the most promis- work to efficiently share FPGA accelerators among mul- ing approaches. This is because FPGAs provide low power, tiple heterogeneous threads on a single node, and extends high energy efficiency and reprogrammability to customize Hadoop YARN with accelerator-centric scheduling to effi- high-performance accelerators. One breakthrough example ciently share them among multiple computing tasks in the is that Microsoft has deployed FPGAs into its datacenters to cluster. Experimental results using four representative big accelerate the Bing search engine with almost 2x through- data applications demonstrate that Blaze greatly reduces the put improvement while consuming only 10% more power programming efforts to access FPGA accelerators in systems per CPU-FPGA server [33]. Another example is IBM’s de- like Apache Spark and YARN, and improves the system ployment of FPGAs in its data engine for large NoSQL data throughput by 1.7× to 3× (and energy efficiency by 1.5× stores [13]. Moreover, Intel, with the $16.7 billion acquisi- to 2.7×) compared to a conventional CPU-only cluster. tion of Altera, is providing closely integrated CPU-FPGA Categories and Subject Descriptors C.1.3 [Computer Sys- platforms for datacenters [12], and is targeting the produc- tems Organization]: Heterogeneous (hybrid) systems tion of around 30% of the servers with FPGAs in datacenters by 2020 [6]. Keywords FPGA-as-a-service, heterogeneous datacenter With the emerging trend of FPGA-enabled datacenters, one key question is: How can we easily and efficiently deploy ? Author names are listed in alphabetical order. FPGA accelerators into state-of-the-art big data computing systems like Apache Spark [47] and Hadoop YARN [42]? Permission to make digital or hard copies of all or part of this work for personal or To achieve this goal, both programming abstractions and classroom use is granted without fee provided that copies are not made or distributed runtime support are needed to make these existing systems for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the programmable to FPGA accelerators. This is challenging for author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or the following reasons. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. 1. Unlike conventional CPU and GPU targeted programs, SoCC ’16 October, 2016, Santa Clara, CA, USA. compiling an FPGA program can take several hours, Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-4525-5/16/10. . . $15.00. which makes existing runtime systems that use dynamic DOI: 2987550.2987569 code generation for CPU-GPU datacenters, such as Dan- delion [35], HadoopCL [22] and SWAT [23], not applica- FPGA accelerator developers, big data application devel- ble for FPGAs. opers, and system architects together, to blaze the deploy- 2. State-of-the-art big data systems like Apache Hadoop ment of accelerators in datacenters.2 and Spark compile to the Java Virtual Machine (JVM), 2. Background while FPGA accelerators are usually manipulated by C/C++/OpenCL. Even with predesigned FPGA acceler- There has been great success in programming frameworks ators, there are still excessive programming efforts re- that enable efficient development and deployment of big data quired to i) integrate them with the JVM, ii) share an applications in conventional datacenters, i.e., composed of accelerator among multiple threads or multiple applica- general-purpose processors. In this section we briefly intro- tions, and iii) share an FPGA platform by multiple accel- duce Apache Spark [47]— our target big data processing erators of different functionalities. framework— and the Hadoop YARN resource manager [42], 3. A straightforward JNI (Java Native Interface) integra- which we use to expose FPGA resources in a cluster environ- tion of FPGA accelerators can diminish or even degrade ment. We also give a quick tutorial of FPGA accelerators. the overall performance (up to 1000X slowdown) due to 2.1 Apache Spark the overwhelming JVM-to-native-to-FPGA communica- Apache Spark [47] is a widely used fast and general large- tion overhead [15]. scale data processing framework. It exposes a programming 4. It usually takes several seconds to reprogram an FPGA model based on Resilient Distributed Datasets (RDDs) [46]. into a different accelerator (with a different function- The RDD abstraction provides transformations (e.g., map, ality). A frequent FPGA reprogramming in a multi- reduce, filter, join, etc.) and actions (e.g., count, collect) that accelerator scenario can significantly degrade the overall operate on datasets partitioned over a cluster of nodes. A system performance. This raises a fundamental question: typical Spark program executes a series of transformations Do we manage ”the hardware platform itself” or ”the ending with an action that returns a singleton value (e.g., the logical accelerator (functionality) running on top of the record count of an RDD) to the Spark driver program, which hardware platform” as a resource? could then trigger another series of RDD transformations. To address these challenges, we design and implement Spark caches reused data blocks in memory, often Blaze: a framework that provides a programming abstrac- achieving significant performance speedup over the Hadoop tion and runtime support for easy and efficient FPGA de- MapReduce [1] on iterative applications such as machine ployments in datacenters. This paper describes the Blaze ar- learning. Recent studies [9, 32] show that Spark applications chitecture and makes the following contributions. are often computation-bound instead of IO or network bound 1. Programming APIs that enable big data processing appli- in conventional Hadoop applications. This motivates us to cations to leverage FPGA accelerators to perform task- leverage FPGAs to further accelerate the computation. level work. We abstract FPGA accelerators as a service Spark can be run standalone on a cluster, or with a re- (FaaS), which decouples the hardware accelerator de- source manager like Hadoop YARN [42]. For each Spark velopment of data processing tasks (i.e., Spark transfor- application submitted to the YARN cluster, a set of contain- mations) and big data processing logic (i.e., scheduling ers (see Section 2.2) is gathered from the resource manager tasks, shuffling data, etc.).1 matching the available resources and the application con- 2. Policies for managing logical accelerator functionality — figuration. For each acquired container, the Spark context instead of the physical hardware platform itself—as a launches an executor: a JVM instance providing the base resource, where better scheduling decisions can be made runtime for the execution of the actual data-processing com- to optimize the system throughput and energy efficiency. putation (i.e., tasks), and managing the application data. 3. An efficient runtime to share FPGA accelerators in data- 2.2 Apache YARN centers, where an FaaS framework is implemented to support sharing of accelerators among multiple threads YARN (Yet Another Resource Negotiator) is a widely used and multiple applications in a single node. Also, an cluster resource management layer in the Hadoop system accelerator-centric scheduling is proposed for the global that allocates resources, such as CPU and memory, to mul- accelerator management to alleviate the FPGA repro- tiple big data applications (or jobs). Figure 1 shows a high- gramming overhead for multi-accelerators. Finally sev- level view of the YARN architecture. A typical YARN setup eral well-known optimization techniques—such as data would include a single resource manager (RM) and several caching and task pipelining—are employed to reduce the node managers (NM) installations. Each NM typically man- JVM-to-FPGA communication overhead. ages the resources of a single machine, and periodically re- ports to the RM, which collects all NM reports and formu- 4. An open-source prototype that is compatible with exist- lates a global view of the cluster resources. The periodic NM ing ecosystems like Apache Spark with no code changes and YARN with a lightweight patch. Our goal is to bring 2Blaze can be downloaded from github: https://github.com/UCLA- VAST/blaze. Blaze has already been used by multiple groups at Intel Labs 1While Blaze does support GPU accelerators as well, this paper will to deploy accelerators composed of the Intel-Altera Heterogeneous Accel- mainly focus on FPGA accelerators which have not been studied before. erator Research Platforms (HARP CPU-FPGA platforms). adopted in recent years for accelerating the computation- NM intensive kernels in standalone applications; it achieved 18x Client RM AM to more than 300x kernel speedups [16, 18, 28, 48]. An FPGA implementation is usually based on a hardware description languages (HDL) such as Verilog and VHDL, Job submission NM Note status and it requires a comprehensive knowledge of hardware. Re- Application status Container cent development of high-level synthesis (HLS) [19] allows Resource request Container programmers to use a C-based language to design FPGA accelerators. However, the learning-curve for FPGA pro- gramming is usually very steep for software programmers, Figure 1: Example YARN architecture showing a client sub- since the optimal implementation still requires a significant mitting jobs to the global resource manager. amount of FPGA-specific knowledge. reports also provide a basis for monitoring the overall cluster Due to the power wall and dark silicon [21], FPGA accel- health at the RM, which notifies relevant applications when eration has become increasingly promising, and OpenCL has failures occur. emerged as a standard framework for FPGA programming. A YARN job is represented by an application master However, there are several fundamental differences between (AM), which is responsible for orchestrating the job’s work OpenCL applications for FPGA and GPU. Since the archi- on allocated containers i.e., a slice of machine resources tecture of GPU is fixed, GPU programs can be compiled us- (some amount of CPU, RAM, disk, etc.). A client submits ing a just-in-time (JIT) compiler on the fly. FPGAs, on the an AM package—that includes a shell command and any other hand, are flexible on the architecture level, but require files (i.e., binary executable configurations) needed to exe- a much longer compilation time (often several hours). This cute the command—to the RM, which then selects a single means that an FPGA accelerator has to be generated in ad- NM to host the AM. The chosen NM creates a shell envi- vance as a library, and loaded in an OpenCL host program at ronment that includes the file resources, and then executes runtime. Moreover, the OpenCL support for FPGAs is still at the given shell command. The NM monitors the containers an early stage compared to that for GPUs. For example, the for resource usage and exit status, which the NM includes in Xilinx OpenCL implementation does not support FPGA ac- its periodic reports to the RM. At runtime, the AM uses an celerator sharing by multiple applications. This further mo- RPC interface to request containers from the RM, and to ask tivates our FaaS design for transparent and efficient FPGA the NMs that host its containers to launch a desired program. accelerator sharing. Returning to Figure 1, we see the AM instance running with 3. Blaze System Overview allocated containers executing a job-specific task. We design Blaze as a generic system to enable big data appli- To manage heterogeneous computing resources in the cations to easily access FPGA accelerators and implement it datacenter and provide placement control, YARN recently as a third-party package that works with existing ecosystems introduced a mechanism called label-based scheduling. Ad- (i.e., Apache Spark and Hadoop YARN), with lightweight ministrators can specify labels for each server node and ex- changes. Here we give an overview of the Blaze program- pose the label information to applications. The YARN re- ming and runtime support and discuss how we address the source manager then schedules the resource to an application challenges listed in Section 1. only if the node label matches with the application-specified To provide an easy-to-use programming interface, we ab- label. Examples of node labels can be an FPGA or GPU, stract FPGA accelerators as a service (FaaS) and propose which indicate that the nodes are equipped with a special to decouple the software development of big data applica- hardware platform. tions and the hardware development of FPGA accelerators. 2.3 FPGAs This means hardware experts can make the best effort to op- A field-programmable gate array (FPGA) is a reconfigurable timize the accelerator design without being burdened with integrated circuit with much lower power consumption com- application complexity, and software developers do not need pared to CPUs and GPUs. Since an FPGA is essentially to be aware of tedious hardware details to take advantage customizable hardware, it can achieve significant perfor- of accelerators. Currently, Blaze provides a set of APIs for mance speedup despite its low clock frequency. Many fac- Spark programs to offload map computations onto accelera- tors contribute to the efficiency of FPGA. For example, tors without any change to the Spark framework. All Spark application-specific computation pipelines can be designed programmers have to do is to register the pre-defined FPGA to avoid the conventional instruction fetching and decoding accelerators (developed by hardware experts) into Blaze as a overhead. The data access can also be customized to sig- service, and call the Blaze API to access the customized ac- nificantly improve the data reuse. Processing elements of celerators. All the accelerator sharing and management logic customized computation pipelines can also be duplicated to are transparently handled by our Blaze runtime. scale the performance by . Because of these The Blaze runtime system integrates with Hadoop YARN techniques and FPGA’s energy efficiency, it has been widely to manage accelerator sharing among multiple applications. us to differentiate applications that are using the FPGA NM NAM GPU devices to perform different computations. Therefore, we Client RM GAM AM can delay the scheduling of accelerators with different FPGA functionalities onto the same FPGA to avoid reprogram- ming as much as possible. Different from the current NM NAM YARN solution, where node labels are configured into Job submission Accelerator status Container YARN’s configuration files, node labels in Blaze are con- Applicaon data Container figured into NAM through command-line. NAM then re- Resource request ports the accelerator information to GAM through heart- beats, and GAM configures these labels into YARN. GAM Global Accelerator Manager accelerator-centric scheduling 3. Hiding JVM-to-FPGA communication. We also em- ploy well-known techniques such as data caching and Node Accelerator Manager NAM FaaS, JVM-to-FPGA communication optimization task pipelining in FaaS to hide the overwhelming JVM- to-native-to-FPGA communication overhead. Figure 2: Overview of Blaze runtime system. 4. Fault tolerance. The FaaS design in each NAM also helps the fault tolerance of the system. Whenever a fault As illustrated in Figure 2, Blaze includes two levels of accel- in the accelerator hardware occurs, NAM can allocate erator management. A global accelerator manager (GAM) different hardware to fulfill the request, or fallback to oversees all the accelerator resources in the cluster and dis- CPU execution when no more accelerators are available. tributes them to various user applications. Node accelera- 5. Facilitating rolling upgrades. FaaS makes it easy to tor managers (NAMs) sit on each cluster node and pro- configure heterogeneous accelerator resources on com- 1 vide transparent accelerator access to a number of hetero- pute nodes in the datacenter, facilitating rolling upgrades geneous threads from multiple applications. After receiving of next-generation accelerator hardware and making the the accelerator computing resources from GAM, the Spark system administration of large-scale heterogeneous data- application begins to offload computation to the accelera- centers more scalable. tors through NAM. NAM monitors the accelerator status, In summary, the easy-to-use programming interface, handles JVM-to-FPGA data movement and accelerator task transparent FaaS, and the accelerator-centric scheduling of scheduling. NAM also performs a heartbeat protocol with Blaze makes FPGA accelerator deployment at datacenter GAM to report the latest accelerator status. scale much easier than existing approaches. Note that the We summarize the key features of Blaze as follows. FaaS framework for NAM is provided as a third-party pack- 1. FPGA accelerators as a service (FaaS). The most im- age without any change to Apache Spark, while accelerator- portant role of NAM in Blaze runtime is providing trans- centric scheduling for GAM and NAM is provided as a parent FaaS shared by multiple application jobs (run on lightweight patch to Hadoop YARN. In Section 4 and Sec- the same node) that request accelerators in a fashion sim- tion 5, we will present more details about the Blaze program- ilar to software library routines. Each “logical accelera- ming interface and runtime implementation. tor” library routine exposes a predefined functionality to a Spark program, and can be composed of multiple “phys- 4. Blaze Programming Interface ical accelerators” on multiple hardware platforms (e.g., In this section we first describe the programming interfaces two FPGAs, or one FPGA and one GPU). FaaS automat- of Blaze from two aspects: how to write a big data applica- ically manages the task scheduling between logical and tion that invokes FPGA accelerators, and how to design and physical accelerators. For example, multiple physical ac- register an FPGA accelerator into Blaze. Then we present celerators can be allocated for a single logical acceler- our support for data serialization during data transfer be- ator for performance-demanding applications, while one tween JVM and accelerators. physical accelerator can be shared across multiple logical 4.1 Application Programming Interface accelerators if each has a low utilization of that physical accelerator. We implement Blaze as a third-party package that works with the existing Spark framework3 without any modifica- 2. Accelerator-centric scheduling. In order to solve the tion of Spark source code. Thus, Blaze is not specific to a global application placement problem considering the particular version of Spark. Moreover, the Blaze program- overwhelming FPGA reprogramming overhead, we pro- ming model for user applications is designed to support pose to manage the logical accelerator functionality, in- accelerators with minimal code changes. To achieve this, stead of the physical hardware itself, as a resource to re- we extend the Spark RDD to AccRDD which supports ac- duce such reprogramming overhead. We extend the label- celerated transformations. We explain the detailed usage of based scheduling mechanism in YARN to achieve this AccRDD in Listing 1 with an example of logistic regression. goal: instead of configuring node labels as ‘FPGA’, we propose to use accelerator functionality (e.g., ‘KMeans- 3Blaze also supports C++ applications with similar interfaces, but we FPGA’, ‘Compression-FPGA’) as node labels. This helps will mainly focus on Spark applications in this paper. Listing 1: Blaze application example (Spark Scala) data and schedule it to the corresponding task. The accelera- val points = sc.textFile(filePath).cache() tor designer can use any available programming framework val train = blaze.wrap(points) to implement an accelerator task as long as it can be inte- for (i <- 0 until ITERATIONS) { grated with an interface in C++. bcW = sc.broadcast(weights) val gradients = train.map( Listing 2: Blaze accelerator example (C++) new LogisticAcc(bcW) ).reduce(a + b) weights -= gradients class LogisticTask : public Task { } public: class LogisticAcc(w: Broadcast_var[V]) LogisticTask(): Task(NUM_ARGS) extends Accelerator[T, U] { // overwrite the compute function val id: String = "LRGradientCompute" virtual void compute() { def call(p: T): U = { int num_elements = getInputLength(...); localGradients.compute(p, w.value) double *in = (float*)getInput(...); } double *out = (float*)getOutput(...); ... // perform computation } ... } }; In Listing 1, training data samples are loaded from a file and stored to an RDD points, and are used to train 4.3 Serialization Support weights by calculating gradients in each iteration. To ac- The input and output data of Spark tasks need to be serial- celerate the gradient calculation with Blaze, first the RDD ized and deserialized respectively before they are transferred points needs to be extended to AccRDD train by call- to and from accelerator platforms. Blaze implementation in- ing the Blaze API wrap. Then an accelerator function, cludes its own (de)serializer for primitive data types, because LogisticAcc, can be passed to the .map transformation the existing Java version is not sufficient for handling the of the AccRDD. This accelerator function is extended from data layout for accelerators. In addition, Blaze also provides the Blaze interface Accelerator by specifying an accel- an interface to users to implement their own (de)serializer erator id and an optional compute function for the fall- methods. As a result, users are allowed to use arbitrary data back CPU execution. The accelerator id specifies the desired types in the Spark application as long as the corresponding accelerator service, which in the example is “LRGradient- (de)serializer is able to process data to match the accelerator Compute”. The fall-back CPU function will be called when interface. the accelerator service is not available. This interface is pro- vided with fault-tolerance and portability considerations. In 5. Blaze Runtime Support addition, Blaze also supports caching for Spark broadcast In this section, we present our Blaze runtime support, includ- variables to reduce JVM-to-FPGA data transfer. This will be ing the FaaS implementation to share accelerators among elaborated in Section 5.3. multiple heterogeneous threads in a single node, accelerator- The application interface of Blaze can be used by library centric scheduling to alleviate the FPGA reprogramming developers as well. For example, Spark MLlib developers overhead, communication optimization to alleviate the JVM- can include Blaze-compatible codes to provide acceleration to-FPGA overhead, and fault tolerance and security support. capabilities to end users. With Blaze, such capabilities are 5.1 FPGA-as-a-Service (FaaS) independent of the execution platform. When accelerators Blaze facilitates FaaS in NAM through two levels of queues: are not available, the same computation will be performed task queues and platform queues. The architecture of NAM on CPU. In this case, accelerators will be totally transparent is illustrated in Figure 3. Each task queue is associated with to the end users. In our evaluation, we created several im- a “logical accelerator”, which represents an accelerator li- plementations for Spark MLlib algorithms such as logistic brary routine. When an application task requests a specific regression and K-Means using this approach. accelerator routine, the request is put into the corresponding 4.2 Accelerator Programming Interface task queue. Each platform queue is associated with a “phys- For accelerator designers, the programming experience is ical accelerator”, which represents an accelerator hardware decoupled with any application-specific details. An example platform such as an FPGA board. The tasks in task queue of the interface implementing the “LRGradientCompute” can be executed by different platform queues depending on accelerator in the prior subsection is shown in Listing 2. the availability of the implementations. For example, if both Our accelerator interface hides details of FPGA accelera- GPU and FPGA implementations of the same accelerator li- tor initialization and data transfer by providing a set of APIs. brary routine are available, the task of that routine can be In this implementation, for example, the user inherits the executed on both devices. provided template, Task, and the input and output data can This mechanism is designed with three considerations: 1) be obtained by simply calling getInput and getOutput application-level accelerator sharing, 2) minimizing FPGA APIs. No explicitly OpenCL buffer manipulation is neces- reprogramming, and 3) efficient overlapping of data transfer sary for users. The runtime system will prepare the input and accelerator execution to alleviate JVM-to-FPGA over- Task Task Node labels Node labels Node labels Node labels gradient distance sum gradient distance sum Task Task Task gradient distance sum gradient distance sum

App1 App2 App3 App4 Application Scheduler App1 App2 App3 App4 container container container container container container container container Task Task Task Queue Queue Queue NAM (a) Naive allocation: applications on a node use different ac-

Task Scheduler celerators which leads to frequent FPGA reprogramming.

Platform Queue Platform Queue Node labels Node labels Node labels Node labels gradient distance sum gradient distance sum gradient distance sum gradient distance sum FPGA FPGA App1 App3 App2 App4 App1 App3 App2 App4 container container container container Figure 3: Node accelerator manager design to enable FPGA container container container container accelerators as a service (FaaS). (b) Ideal allocation: applications on the node use the same accelerator and thus there is no FPGA reprogramming head. We elaborate the first two considerations in the rest of Figure 4: Different resource allocation policies. In this ex- this subsection, and discuss 3) in Section 5.3. ample, each cluster node has one FPGA platform and two In Blaze, accelerator devices are owned by NAM rather accelerator implementations, “gradient” and “distance sum”. than individual applications. The reasoning behind this de- Four applications are submitted to the cluster, requesting dif- sign is our observations that in most big data applications, ferent accelerators. the accelerator utilization is less than 50%. If the acceler- ator is owned by a specific application, then much of the node only gets the tasks that are requesting the same accel- time it will be spent in idle, wasting energy. The application- erator, in which case FPGA reprogramming is not needed. level sharing inside NAM is managed by a scheduler that Figure 4 illustrates that grouping accelerator tasks can re- sits between application requests and task queues. In this pa- duce FPGA reprogramming overhead. per, a simple first-come-first-serve scheduling policy is im- By managing logical accelerator functionality as a re- plemented. We leave the exploration of different policies to source, we propose an accelerator-locality-based delay future work. scheduling policy to dynamically partition the cluster at run- The downside of providing application sharing is the ad- time, avoiding launching mixed FPGA workloads on the ditional overheads of data transfer between the application same cluster node as much as possible. During accelerator process and NAM process. For latency-sensitive applica- allocation in GAM, we consider the nodes in the following tions, Blaze also offers a reservation mode where the accel- order as scheduling priorities: 1) the idle nodes that do not erator device is reserved for a single application, i.e., a NAM have any running containers; 2) the nodes that run similar instance will be launched inside the application process. workloads; 3) the nodes that run a different set of work- The design of the platform queue focuses on mitigat- loads. Specifically, we define an affinity function to describe ing the large overhead in FPGA reprogramming. For a ith node’s affinity to an application as f = nacc , where n processor-based accelerator such as GPU to begin execut- i n acc is the number of containers on this node that use the same ing a different “logical accelerator”, it simply means load- logical accelerator (or label), and n is the total number of ing another program binary, which incurs minimum over- containers on this node. A node with higher affinity repre- head. With FPGA, on the other hand, the reprogramming sents a better scheduling candidate. An idle node which has takes much longer. An FPGA device contains an array of zero running containers has the highest affinity and is con- logic cells, and the programming is effectively configuring sidered the best scheduling candidate. GAM tries to honor the logic function and connection of each cell. Each con- nodes with higher accelerator affinity by using the so-called figuration is called a “bitstream”, and it typically takes 1∼2 delay scheduling. seconds to program an FPGA with a given bitstream. Such At runtime, each NAM periodically sends a heartbeat to a reprogramming overhead makes it impractical to use the the GAM, which represents a scheduling opportunity. The same scheme as the GPU in the runtime system. In Blaze, GAM scheduler does not simply use the first scheduling op- a second scheduler sits between task queues and platform portunity it receives. Instead, it may skip a few scheduling queues to avoid frequent reprogramming of the same FPGA opportunities and wait a short amount of time for a schedul- device. More details about the scheduling policy will be pre- ing opportunity with a better accelerator affinity. In our im- sented in the next subsection. plementation, we maintain a threshold function for each ap- 5.2 Accelerator-centric Scheduling plication, which linearly decreases as the number of missed In order to mitigate the FPGA reprogramming overhead, it scheduling opportunities increases. A container is allocated is better to group the tasks that need the same accelerator to on a node only if the node’s accelerator affinity is higher than the same set of nodes. The ideal situation is that each cluster the threshold function. 5.3 Hiding JVM-to-FPGA Communication face, such as getInput(). The data can only be assigned In order for a Spark program to transfer data to an FPGA ac- by NAM based on the dependency, and all input data is read- celerator, the data has to be first moved from JVM to the na- only. Our underlying platform implementation is based on tive machine, and then moved to the FPGA device memory existing accelerator runtime systems such as OpenCL, so we through a PCIe connection. Such data movement between rely on the runtime implementation to guarantee security at the host CPU and FPGA accelerators sometimes can dimin- the device level. In general, the security issues in FPGA- ish or even degrade the overall system performance [15]. enabled datacenters will be an open and interesting direction To mitigate such overhead, Blaze adopts the following well- for future work. known techniques within the FaaS framework. 6. Experimental Results 1. Task pipelining. Most datacenter workloads will have Now we evaluate the programming efforts and system per- multiple threads/tasks sharing the same accelerator, formance of deploying FPGA accelerators in datacenters us- which creates an opportunity to hide data transfer with ing Blaze. First we present the hardware and software setup, task execution by pipelining: the task queue in NAM and describe the four representative large-scale applications adopts an asynchronous communication scheme that we chose that cover two extremes: iterative algorithms like overlaps JVM-to-FPGA data communication with FPGA machine learning, and streaming algorithms like compres- accelerator execution. sion and genome sequencing. We evaluate the programming 2. FPGA data caching. Many big data applications like efforts to write these applications using Blaze in terms of machine learning use iterative algorithms that repeatedly lines-of-code (LOC). Then we evaluate the overall system perform computation on the same set of input data. This speedup and energy savings for each individual application provides the opportunity to cache the data on the FPGA by putting FPGA accelerators into the cluster. We also ana- device memory and thus avoid the most time-consuming lyze the FaaS overhead and break down the performance im- native-to-FPGA data movement through PCIe. To be provement of each optimization. Finally, we analyze multi- more specific, our FaaS framework implements a Block job executions and the efficiency of our accelerator-centric Manager to maintain a data reuse table that records the scheduling policy in the global accelerator management. mapping from the native data block to the FPGA de- vice memory block. For the case of OpenCL, Block Man- 6.1 Experimental Setup ager manages a table of cl buffer objects which are The experimental platform we use is a local standard mapped to device memory. A flag is used to indicate CPU cluster with up to 20 nodes, among which 4 nodes4 whether the programmer wants Blaze to cache an input are integrated with FPGA cards using PCI-E slots. Each data block. In Spark, the flag is automatically assigned if server has dual-socket Intel Xeon E5-2620v3 CPUs with the user specifies .cache() for the input RDD. 12 cores in total and 64GB of main memory. The FPGA 3. Broadcast data caching. Most data analytic frameworks card is AlphaData ADM-PCIE-7V3, which contains a Xilinx such as Spark support data sharing across the cluster Virtex-7 XC7VX690T-2 FPGA chip and 16GB of on-board nodes. In Spark, this is provided as broadcast data. Simi- DDR3 memory. The FPGA board can be powered by PCI-E larly, Blaze also supports a broadcast data caching to min- alone and consumes around 25W, which makes it deployable imize data transfer across the cluster nodes. A broadcast into commodity datacenters. block only needs to be transferred to the NAM once, and The software framework is based on a community version it will be cached inside the Block Manager throughout the of Spark 1.5.1 and Hadoop 2.6.0. The accelerator compila- application’s life cycle. tion and runtime are provided by the vendor toolkits. For 5.4 Fault Tolerance and Security Issues the AlphaData FPGA cards, we use the OpenCL flow pro- vided by the Xilinx SDAccel tool-chain, where the OpenCL Fault tolerance is inherent in our proposed transparent accel- kernels will be synthesized into bitstreams to program the erator scheduling. All accelerator-related errors are caught FPGA. at the application level, and the CPU implementation will We choose a set of four representative compute-intensive be used to resume the execution. Errors of accelerators in large-scale applications. They cover two extremes: iterative NAM are handled in a similar fashion as Spark or YARN. A machine learning algorithms like logistic regression and K- counter is used for each accelerator task per platform, keep- means clustering, and streaming algorithms like genome ing track of the number of errors incurred. If the failure is sequencing analysis and Apache Parquet compression. persistent for one accelerator task, it will be removed from 1. Logistic regression (LR). The baseline LR is the training NAM’s configuration. This information will also be propa- application implemented by Spark MLlib [11] with the gated to GAM in the heartbeat signals, and GAM will re- LBFGS algorithm. The software baseline uses netlib with move the corresponding label for this node. native BLAS library. The computation kernels we select Based on the description of the Blaze accelerator inter- are the logistic gradients and the loss function calculation. face in Section 4.2, the accelerator task implementation only has access to its private input data through the provided inter- 4We are planning to install more FPGA cards in the near future. The kernel computation takes about 80% of the total to 12 in Spark. For COMP, 12 independent streams on the application time. CPU are used to take advantage of all cores. The accelerator 2. K-Means clustering (KM). The KM application is also design details are omitted in this paper, since our focus is on implemented using Spark MLlib, which uses netlib with evaluating the integration benefits of FPGA accelerators into native BLAS library. The computation kernel we select is big data applications using Blaze. the local sum of center distances calculation. The datasets Currently, we only run the kernel computation on FPGAs used in KM are the same as LR, and the percentage of for FPGA-related experiments as a proof-of-concept. We kernel computation time is also similar to LR. will consider efficient CPU-FPGA co-working in our future 3. Genome sequences alignment (GSA). The GSA appli- work, which will provide higher performance than our cur- cation is from the open-source Cloud Scale BWAMEM rent reported results. (CS-BWAMEM) software suite [17], which is a scale-out implementation of the BWAMEM algorithm [29] widely Table 2: Comparison of accelerator deployment efforts in used in the bioinformatics area. The algorithm aligns the terms of lines-of-code (LOC) changes short reads from the sequencer to a reference genome. App ACC Setup Partial FaaS? We mainly focus on the alignment step in this application LR 26 104 325 which uses the Smith-Waterman algorithm, as we did in KM 37 107 364 a prior case study [15]. Manual GSA 0† 227 896 4. Apache Parquet compression (COMP). Apache Par- COMP 0† 70 360 quet [2] is a compressed and efficient columnar data App ACC Setup FaaS representation available to any project in the Hadoop/S- park ecosystem. Such columnar data generally have good LR 9 99 0 KM 7 103 0 compression rates and thus are often compressed for bet- Blaze GSA 0† 142 0 ter spatial utilization and less data communication. We † mainly focus on the compression (deflater) step, which is COMP 0 65 0 computation-bound and common through various appli- ? Partial FaaS does not support accelerator sharing among cations. We use two software baselines: 1) the Java Gzip different applications, compared to full FaaS. implementation that uses both the LZ77 algorithm and † In both GSA and COMP, the accelerator kernels are Huffman encoding, which has a better compression ratio wrapped as a function replacing the original software but low throughput; and 2) the open-source Snappy im- version, so no software code change is counted. plementation [10] that uses a JNI wrapper to call the C++ 6.2 Programming Efforts Snappy library based on the LZ77 algorithm, which has a We begin the analysis by showing Blaze’s benefits in reduc- lower compression ratio but better throughput. ing the deployment efforts of integrating existing FPGA ac- The input data for LR and KM are based on a variant of celerators to big data applications. The results are shown the MNIST dataset [8] with 8 million records, and is sampled in Table 2, where the lines of code (LOC) breakdown is such that on average each node will process 2-4GB of data. listed for the selected applications. The hardware code to The data set of GSA is a sample of HCC1954, which is a design the accelerators is exactly the same between man- single person’s whole genome. The input data for COMP is ual and Blaze implementations and decoupled from soft- the first 100 kilo short reads in HCC1954. ware developers, so it is excluded in this comparison. As The FPGA accelerators for all applications are designed an illustration of complexity of accelerator designs, it usu- in-house. The accelerator specifications for LR and KM can ally takes an experienced hardware engineer around 4 to be found in [18], and the Smith-Waterman implementation 24 weeks to implement an efficient FPGA accelerator ker- is based on [16]. Our FPGA accelerator is designed based nel, which is a big overhead for big data application devel- on the Gzip implementation with both the LZ77 algorithm opers like Spark programmers. In this paper the LR, KM, and Huffman encoding. Table 1 presents an overview of GSA, and COMP accelerators take a senior graduate stu- the accelerator speedup compared to the 12-thread CPU dent 4, 4, 24, and 16 weeks to implement and optimize. Col- software baseline in terms of throughput improvement. We umn ‘App’ in Table 2 shows code changes needed to mod- set --num-executors to 1 and --executor-cores ify the big data applications so as to access accelerators in the application code. Column ‘ACC-setup’ shows the code Table 1: FPGA accelerator performance profile changes for PCIe data transfer and accelerator invocation Application Kernel Speedup through OpenCL. Finally, column ‘Partial FaaS’ shows the LR Gradients 3.4× code changes needed to enable sharing accelerators among KM DistancesSum 4.3× multiple threads within the application. GSA SmithWaterman 10× Although using LOC to represent the programming ef- 26.7× over Gzip forts is not entirely accurate, it provides a rough illustra- COMP Deflater 3× over Snappy tion of the difference between each implementation method. Job latency Energy 200 150 Task time App time Task time App time 4.00 120 150 3.03 2.93 3.00 2.65 2.56 90 100 60 Time (sec) Time 2.00 1.73 1.71 (sec) Time 1.51 1.50 50 30 1.00 0 0

4Nx12T 8Nx12T 12Nx12T 4Nx12T 4Nx12T 8Nx12T 12Nx12T 4Nx12T CPUBaseline (x) 0.00 CPU CPU CPU FPGA Improvement Over Improvement CPU CPU CPU FPGA LR KM GSA COMP * (a) LR (b) KM * CPU baseline used for COMP is Snappy Figure 6: Performance of LR and KM on multiple nodes. Figure 5: Single-node system performance and energy gains The X-axis represents the experiment configurations. For for each individual application. example, ”4N×12T CPU” represents the configuration of 4 CPU-only nodes with 12 threads on each node. Among the breakdown of LOCs, most of the “ACC-setup” 200 150 code for accelerator control can be reused as long as the ac- shuffle 120 celerator is fixed. We can see that deploying FPGA acceler- 150 compute 90 ators in big data applications using Blaze is very easy, with 100 scheduler 60 Time (sec) Time (sec) less than 10 LOC changes in the application, and a one-time 50 data load and 100 LOC changes for accelerator setup. Without Blaze, even 30 preprocessing a manual design for partial FaaS to support accelerator shar- 0 0 4Nx12T 4Nx12T 4Nx12T 4Nx12T ing among multi-threads within a single application requires CPU FPGA CPU FPGA 325 to 896 LOC changes for every application. Figure 7: Execution time breakdown. for LR and KM before 6.3 Overall System Performance and and after FPGA acceleration on multiple nodes. Energy Gains for Single Application tensive, and the computation kernels benefit from FPGA ac- Figure 5 demonstrates the single-node system speedup5 and celeration. Note that the data load and preprocessing part in energy reduction for our application case studies using Blaze the original Spark program remain on the CPU, i.e., it is not and FPGA accelerators. For each individual job, we mea- accelerated by FPGA. sure the overall job time and estimate the overall energy con- 6.4 FaaS Overhead Analysis sumption based on the average power measured during ap- plication runtime. As mentioned earlier, we only run the ker- To evaluate the potential overhead that Blaze introduces to nel computation on FPGAs. Compared with the CPU base- provide FaaS, we evaluate the performance of Blaze inte- line, the system with FPGA achieved 1.7× to 3× speedup on gration against a reference manual integration. To make the overall system throughput, and 1.5× to 2.7× improvement analysis simple, we focus on the streaming COMP applica- on system energy consumption. (Note that FPGAs introduce tion. We first measure the normalized compression through- an additional 25 watts per node into the system; therefore the put to the reference manual design for 1-core and 12-core achieved energy efficiency is slightly smaller than the perfor- cases. As shown in Figure 8(a), for the two software base- mance speedup numbers.) This confirms that computation- lines, the native Snappy implementation is around 10× faster intensive big data applications can take full advantage of than the Java Gzip implementation. For the single-core ver- FPGA acceleration with Blaze. sion, a manual integration of the compression FPGA accel- Moreover, we compare the performance of a 4-node clus- erator achieves around 8.5× speedup over Snappy, while ter with FPGAs to the CPU-only clusters with 4-node, 8- a Blaze integration achieves around 5.6× speedup. When node, and 12-node. As shown in Figure 6, for LR and KM, a there are 12 cores, the fully parallelized software implemen- 4-node cluster with FPGA accelerators enabled can provide tation gets significant speedup, while Blaze integration and roughly the same throughput as a cluster of 12 CPU nodes. manual integration achieve similar performance, which is This indicates that we can reduce the conventional datacen- 1.7× better than Snappy. ter size by 3× by putting an FPGA into each server node, 1.2 SW-GZIP SW-Snappy 200 1.20 180 JVM-to-native FPGA-Manual FPGA-Blaze while achieving the same throughput. 1 160 1.00 Native-to-FPGA 140 Finally, Figure 7 presents the execution time breakdown 0.8 0.80 120 of Spark jobs (the entire application instead of the kernel Software FPGA-kernel 0.6 100 0.60 Manual task execution time) on a 4-node cluster before and after 80 Native-private- 0.4 Blaze to-share

0.40 Execution Time (ms) 60 Manual Design Manual

FPGA acceleration. The results confirm that machine learn- toManual Design 40 0.2

Normalized Throughput Normalized 0.20 20 ing workloads such as LR and KM are computationally in- to Throughput Normalized 0 0 0.00 1 thread 1 12thread threads 12 thread Manual Blaze 5For Figure 5, 6, and 7, the experiments are done by configuring (a) (b) --executor-cores to 12 and --num-executors to the number of nodes in Spark. 200 Figure 8: Faas overhead analysis in COMP application. 5

150

FPGA kernel 100

(ms) Transfer

Serialization 50 ExecutionBreakdown Time 0 ZIP - Manual ZIP - Blaze h aai S n OPi rcse nyoc,tere- the once, only all processed Since is mitigated. COMP and is GSA transfer in improve- data data the performance the of huge most a since provides ment caching data KM, bigdt ieiig h oa iecnb ceeae ya by accelerated be 2.8 can of time factor total en- the pipelining, By data overhead. abling JVM-to-FPGA aforementioned the to due speedup, a than 6.71 rather e.g., slow-down a in result may accelerator the remains FPGA the on performed compu- same. be total to the needs since time that cases kernel from tation different FPGA and across time. to same kernel the transfer FPGA stays data and of PCIe via time FPGA the the both represents application, includes time entire which the task of FPGA The instead kernel time. the accelerating evaluate task targeted the we and setup, time and FaaS pipelining kernel pipelining each task task For add without caching. gradually FaaS data then naive and a caching, data from Fig- or start in We 9. optimization ure by communication improvements JVM-to-FPGA performance of each breakdown the show Optimizations We FaaS of Breakdown 6.5 cores. per- 12 are comparable integration there a manual when and see FaaS integration we Blaze our result, between by a formance As supported Blaze. transparently (and in is framework pipelining task that using caching) alleviated data applications, be data can big applica- in overhead COMP nature such multi-thread single-core the the to Due in tion. time of execution 76% total overwhelming the occupies the overhead that execu- confirms communication more by also JVM-to-FPGA 50% 8(b) sharing Figure around accelerator time. costs tion manage and to applications required multiple mem- is shared data Blaze which moving the As of ory, to 8(b). overhead memory private the Figure application introduces from in Blaze illustrated see, can is native. we breakdown Blaze detailed in movement The movement, execu- memory data kernel private-to-shared native-to-FPGA FPGA and We into and case. time JVM-to-native single-core execution tion, the the in down integration break manual than head communication FaaS. JVM-to-FPGA in optimizations the of Breakdown 9: Figure Normalized FPGA Task Time ssoni iue9 av fodn fwrla to workload of offloading naive a 9, Figure in shown As over- more has integration Blaze why analyze we Then to CPU Baseline 0.00 1.00 × FPGA kernel timeFPGA kernel

lwonfrL n 6.95 and LR for slowdown 0.24 × Naïve 6.71 o3.8 to

LR 0.24 w/ Pipeline 2.42 × 0.24

o trtv optto fL and LR of computation iterative For . w/ caching 0.31

0.17 Task time Naïve 6.95

KM 0.17 w/ Pipeline 1.80

0.17 w/ caching

× 0.23 lwonfrKM, for slowdown 0.14 Naïve

GSA 0.36

0.14 w/ Pipeline 0.32

0.33

COMP Naïve 0.91

0.33 w/ Pipeline 0.58 oiy u A a etnssmlrto similar settings allocation has resource GAM default YARN’s Our policy. Apache the use we and re- Therefore, ing In applications. experiments. the KM in and run occurs applications never only programming LR nodes run only 2 nodes other 2 the sets: two into nodes 4 and n h ai fteetoapiain aisaogdifferent workloads. among of varies applications sets two these sizes, of data ratio input the set various and of Each applications cluster. KM CPU-FPGA and LR 4-node contains a on workloads of sets seven choose we allocation scheduling), resource accelerator-centric (i.e., GAM’s policy of effectiveness the evaluate To Analysis Scheduling Multi-Job negligible. 6.6 is overhead confirms runtime applications Blaze LR the and that utilization KM again accelerator for high FaaS The observed full-featured well. in be as also workloads can KM for results in- Once Similar be in. can dramatically. utilization flowing creased accelerator continuously the enabled, data is with caching data busy accelerator accel- the kept the begins, be pipeline before the can transferred once be but the start, to iteration can needs each erator data of of beginning batch the periodi- at first utilization because high is is This reach cally. pipelining can when accelerator part, the second figure, enabled, the transferred the In be application. of to the data part for from first waiting keeps the accelerator in the since shown pipelining without case or the caching in execution low consistently The accelerator time. is execution utilization of application accelerator of ratio interval sampled accelerator the a The in as time FPGA. defined an on is run- LR of utilization application patterns single utilization different a the ning In show metric. we utilization 10 accelerator Figure the using illustrated better thus and identical, are 9. caching Figure in data omitted without and with sults FPGA. single an a on running application of LR results utilization Accelerator 10: Figure w w w w/o w/o w/o ecmaeGMwt w baselines: two with GAM compare We be can caching data and pipelining task of benefits The l h PAndscnrnbt RadK workloads, KM and LR both run can nodes FPGA the all , av sharing naive pipeline caching pipeline caching pipeline caching In . ttcpartition static eeel atto the partition evenly we , av sharing naive static-partition av shar- naive but , rtclotmladaclrtruiiain oprn the Comparing utilization. baseline accelerator and optimal oretical submit- workloads of beforehand. sets known the are in all ted results, that assume scheduling we optimal case which theoretical cal- offline also the We policy. culate scheduling accelerator-centric our uses cluster. CPU-FPGA a accelerator on and workloads mixed throughput of system utilization Normalized 11: Figure od eoemr eeoeeu u oFG reprogram- FPGA overhead. to ming due How- heterogeneous nodes. more become FPGA loads 4 of all advantages use the can ever applications ( the applications since KM 0), or LR ( contain to workloads only close loads LR is to nodes LR while workloads to 0.5), KM nodes of KM of ratio ratio the the that way partitioned a is in cluster the when better performs partition static to normalized is results. number optimal The theoretical workloads. offline different the of throughput System (a) efis ics h rjcsta aaelresaeclusters large-scale manage that projects the discuss section first this we In frameworks. heterogeneous analytic of big-data inclusion in the architectures on projects several are There Work Related results, results. optimal optimal 7. the from theoretical away 9% the only from is GAM away while 22% of set and same 27% the average, on On workloads nodes. similar placing by reprogramming in FPGA overhead rate reduces it utilization Meanwhile, accelerator (b)). 11 the Figure as (shown FPGAs cluster sharing naive Accelerator Utilization (%) Normalized System Throughput iue1 lt h omlzdsse hogptt the- to throughput system normalized the plots 11 Figure A noprtstebs set of aspects best the incorporates GAM to Theoretical Optimal Throughput

13.3 static partition 1 static partition ttcpartition static 45.5 0.52 1

b ceeao tlzto fdfeetworkloads. different of utilization Accelerator (b) 46.3 1.00

av sharing naive 46.3 1.00 1.00 tptnilyalw plctost s all use to applications allows potentially it : Ratio of LR workloads in Ratio LR of workloadsin the mixed LR

31.3 Ratio LR of workloadsin the mixed LR 0.8 25.5 0.71 42.7 0.8 0.83 44.3 0.96

ttcpartition static 1.00 43.6 naïve sharing naïve sharing 0.6 av sharing naive 17.7 0.96 0.6

with 32.6 0.67

efrsbte hntework- the when better performs 45.5 0.86 1.00 45.7 av sharing naive 0.5 0.92

18.9 0.5 31.9 0.54 45.7 0.81

GAM 1.00 and GAM 38.4 eln stework- the as decline 0.4 0.92

19.9 0.4 - KM workloads KM

ttcpartition static 0.72 -

36 workloads KM av sharing naive 52.3 0.90 1.00 i.e., theoretical optimal efidthat find we , 31.4 theoretical optimal

0.2 0.61

24.6 0.2 i.e., ai s1or 1 is ratio 0.74 34.2 0.86

51.6

1.00 ai is ratio 23.4 0.52 0 50.7 0 1.00 and are 49.3 0.98 50.7 1.00 ytm olwti einprinciple. design this follow systems srfinl rgamn oe o aopadSpark and write Hadoop to for users model programming propose user-friendly [23] SWAT a and [22], hard- addition, HadoopCL In [36], low-level ID. HeteroDoop thread leverage as such still APIs architecture-aware provided ware they models users ming own allow example, [20] their For Glasswing write fly. and to the gener- [40], on GPMR to code [25], able kernel MapCG GPU provide is compile to which and attempt framework ate works for programmable time recent fully compilation more a appli- the negligible, since their is hand, describe GPU users other to the so accelerators On functions, provided cations. fixed use adopt to them Spark have CPU-GPU of a Both is [30] framework. HeteroSpark ac- GPU. architecture C- on convolutional a celeration for is system [27] distributed Caffe based systems. runtime single-application into repro- FPGA of consideration overhead. as lack gramming efficiency they same because the Blaze manage achieve our to cannot used they be systems, can FPGA Although YARN. techniques and management [26] GPU to these architec- similar high-level is Moxie The of GPUs. cluster ture to with a onto machines engine, graph powerful dataflow dataflow of nodes. a cluster in GPU represented a or jobs Moxie, schedule CPU uses on state [35] launched to Dandelion be jobs should allow lever- to they system capabilities whether clus- Their Yahoo label YARN. a node by scale. YARN’s onto managed ages cluster nodes jobs GPU learning at of deep ter GPUs running managing demonstrates [7] on works eral for support provide and management. yet accelerator [37] these FPGA Omega of none [24], However, Mesos [4]. include Corona YARN to that similar managers Resource are recomputation. state avoid partial save to and required fail, as or preempted be can resources that job the when resources completes. those returning on leases resources, request transient elastically which tar- jobs, specifically YARN processing that data gets is YARN, Hadoop man- as resource such and agers systems funda- these most between The difference [5]. mental HTCondor and MPI as such managing for workloads management cluster in provision scientific and that used services, systems long-running software, widely provisioning are machine vir- tual include and Examples machines. history, of clusters datacenter-scale long a have prahsfrGU r lotntapial oFPGAs. for to FPGAs. designed especially applicable systems existing not the consider almost we 1, Finally, are Section GPUs for in for mentioned implemented approaches we and As designed datacenters. existing were CPU-GPU review that we systems Then accelerators. runtime for support their and hr r losm rjcsta tep oicueGPUs include to attempt that projects some also are There GPUs. for systems runtime Distributed lu-cl eoremanagement. resource Cloud-scale 6 hsi h nedduecs o AN oee,ntalbgdata big all not However, YARN. for case use intended the is This 6 uhjb utb rte ihteassumption the with written be must jobs Such map/reduce map/reduce ucin ihu osdrn the considering without functions ucin,btteprogram- the but functions, eoremanagers Resource hr r sev- are There underlying architecture, but sacrifice some performance im- Aparapi [3] to automatically generate an OpenCL kernel for provement opportunities. Altera FPGAs and executes FPGA accelerators on Spark. Distributed runtime systems for FPGAs. There are However, the programming model of SparkCL discloses some research projects that try to accelerate big-data analytic low-level OpenCL APIs such as thread ID to users and only frameworks using FPGAs. FPMR [39] attempts to imple- supports primitive data types. In addition, it lacks experi- ment an entire MapReduce framework on FPGAs so that the mental results to illustrate system efficiency and scalability. data communication overhead can be eliminated. However We have another ongoing effort at UCLA to improve auto- FPMR still requires users to write customized map/reduce matic Java-to-FPGA code generation, which is orthogonal to functions in RTL, and it only supports very limited MapRe- Blaze and will be integrated into Blaze in future work. duce features. Axel [41] and [44] are C-based MapReduce In summary, to the best of our knowledge, Blaze is the frameworks for not only FPGAs but also GPUs. Both frame- first (open source) system that provides easy and efficient works have straightforward scheduling mechanisms to allo- access of FPGA accelerators for big data applications that cate tasks to either FPGAs or GPUs. Melia [43] presents a run on top of Apache Spark and Hadoop YARN. C++ based MapReduce framework on OpenCL-based Altera FPGAs. It generates Altera FPGA accelerators from user- 8. Conclusion written C/C++ map/reduce functions and executes them on FPGAs along with CPUs. Although experimental results in In this paper we present the design and implementation of [43] evaluate the quality of generated accelerators, the re- Blaze, which provides programming and runtime support sults for multi-node clusters only come from simulations in- that enables rapid and efficient deployment of FPGA accel- stead of end-to-end evaluations. Since all of the above sys- erators at warehouse-scale. Blaze abstracts FPGA acceler- tems are implemented in C/C++ as standalone frameworks, ators as a service (FaaS), decouples the FPGA accelerator they are not compatible with widely used JVM-based big development and big data application development, and pro- data analytic systems such as Hadoop and Spark. vides a set of clean programming APIs for big data applica- On the other hand, to bridge the gap between JVM-based tions to easily access the performance and energy gains of frameworks and C-based FPGA accelerators, [45] deploys FPGA accelerators. In the FaaS framework, we provide effi- Hadoop on a server with NetFPGAs connected via an Eth- cient accelerator sharing by multiple heterogeneous threads, ernet switch. However, the programming model in [45] ex- hide the overwhelming Java-to-FPGA data communication poses low-level interactions with the FPGA device to users. overhead, and support fault tolerance. We implement FaaS In addition, no results are provided to show whether this sys- as a third-party package that works with Apache Spark. In tem has reasonable scalability when extending to a cluster of addition, we propose to manage the logical accelerator func- multiple servers. Different from [45], the Zynq-based clus- tionality as a resource instead of the physical hardware plat- ter [31] deploys Hadoop on a cluster of Xilinx Zynq SoC form itself. Using this new concept, we are able to extend devices [34] where CPU cores and programmable logics are Hadoop YARN with an accelerator-centric scheduling pol- fabricated on the same chip. Although the system is energy icy that better manages global accelerator resources and mit- efficient because of FPGAs, this methodology is tightly cou- igates the FPGA reprogramming overhead. Our experiments pled with the underlying Zynq platform and is hard to port to with four representative big data applications demonstrate clusters with commodity CPU servers. One of the first stud- that Blaze greatly reduces the programming efforts, and im- ies that integrates PCIe-based high-performance FPGA ac- proves the system throughput from 1.7× to 3×, i.e., a 1.7× celerators into Spark running on commodity clusters is our to 3× datacenter size reduction using FPGAs with the same case study for genome sequence alignment in [15]. Com- throughput. We also demonstrate that our FaaS implementa- pared to [15], Blaze provides generic programming and run- tion achieves performance similar to a manual design under time support to deploy any FPGA accelerators, and manages the dominant multi-thread scenarios in big data applications, accelerator sharing by multiple applications. while our accelerator-centric scheduling achieves close to FPGA virtualization has been discussed in [14], where optimal system throughput. multiple FPGA accelerators that are designed with certain coding templates can be placed on the same FPGA board and 9. Acknowledgement managed through OpenStack. Compared to our work, this mechanism imposes a limited programming model for accel- This work is partially supported by the Center for Domain- erator designs due to the use of coding templates. Moreover Specific Computing under the NSF InTrans Award CCF- dividing the FPGA resources into several regions results in 1436827, funding from CDSC industrial partners includ- less logics being available to accelerators, and it thus limits ing Baidu, Fujitsu Labs, Google, Huawei, Intel, IBM Re- the performance of accelerators. search Almaden, and Mentor Graphics; C-FAR, one of the SparkCL [38] works in a direction that is orthogonal six centers of STARnet, a Semiconductor Research Corpo- to our Blaze system described in this paper. While Blaze ration program sponsored by MARCO and DARPA; grants assumes predefined FPGA accelerators, SparkCL adapts NSF IIS-1302698 and CNS-1351047; and U54EB020404 awarded by NIH Big Data to Knowledge (BD2K). References [16]C HEN, Y. T., CONG,J.,LEI,J., AND WEI, P. A novel high- throughput acceleration engine for read alignment. In Field- [1] Apache Hadoop. https://hadoop.apache.org. Ac- Programmable Custom Computing Machines (FCCM), 2015 cessed: 2016-05-24. IEEE 23rd Annual International Symposium on (May 2015), [2] Apache parquet. https://parquet.apache.org/. pp. 199–202. Accessed: 2016-05-24. [17]C HEN, Y.-T., CONG,J.,LI,S.,PETO,M.,SPELLMAN, P., [3] Aparapi in amd developer website. http:// WEI, P., AND ZHOU, P. CS-BWAMEM: A fast and scalable developer.amd.com/tools-and-sdks/opencl- read aligner at the cloud scale for whole genome sequencing. zone/aparapi/. Accessed: 2016-05-24. High Throughput Sequencing Algorithms and Applications [4] Facebook engineering (2012) under the hood: Schedul- (HITSEQ) (2015). ing mapreduce jobs more efficiently with corona. [18]C ONG,J.,HUANG,M.,WU,D., AND YU, C. H. Heteroge- https://www.facebook.com/notes/facebook- neous datacenters: Options and opportunities. In Proceedings engineering/under-the-hood-scheduling- of the 53nd Annual Design Automation Conference (2016), mapreduce-jobs-more-efficiently-with- ACM. corona/10151142560538920. Accessed: 2016-01-30. [19]C ONG,J.,LIU,B.,NEUENDORFFER,S.,NOGUERA,J., [5] HTCondor. https://research.cs.wisc.edu/ VISSERS,K., AND ZHANG, Z. High-level synthesis for htcondor. Accessed: 2016-05-24. FPGAs: From prototyping to deployment. Computer-Aided [6] Intel to Start Shipping Xeons With FPGAs in Early Design of Integrated Circuits and Systems, IEEE Transactions 2016. http://www.eweek.com/servers/intel- on 30, 4 (April 2011), 473–491. to-start-shipping-xeons-with-fpgas-in- [20]E L-HELW,I.,HOFMAN,R., AND BAL, H. E. Glasswing: early-2016.html. Accessed: 2016-05-17. Accelerating mapreduce on multi-core and many-core clus- [7] Large scale distributed deep learning on Hadoop clus- ters. In Proceedings of the 23rd International Symposium on ters. http://yahoohadoop.tumblr.com/post/ High-performance Parallel and (New 129872361846/large-scale-distributed- York, NY, USA, 2014), HPDC ’14, ACM, pp. 295–298. deep-learning-on-hadoop. Accessed: 2016-05-24. [21]E SMAEILZADEH,H.,BLEM,E.,ST.AMANT,R.,SANKAR- [8] The MNIST database of handwritten digits. https: ALINGAM,K., AND BURGER, D. Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), //www.csie.ntu.edu.tw/˜cjlin/libsvmtools/ datasets/multiclass.html#mnist8m. Accessed: 2011 38th Annual International Symposium on (June 2011), 2016-05-24. pp. 365–376. [9] Project Tungsten: Bringing Apache Spark Closer to Bare [22]G ROSSMAN,M.,BRETERNITZ,M., AND SARKAR, V. Metal. https://databricks.com/blog/2015/ HadoopCL: Mapreduce on distributed heterogeneous plat- 04/28/project-tungsten-bringing-spark- forms through seamless integration of Hadoop and OpenCL. closer-to-bare-metal.html. Accessed: 2016-08- In Proceedings of the 2013 IEEE 27th International Sympo- 10. sium on Parallel and Distributed Processing Workshops and PhD Forum (Washington, DC, USA, 2013), IPDPSW ’13, [10] The snappy-java port. https://github.com/xerial/ IEEE Computer Society, pp. 1918–1927. snappy-java. Accessed: 2016-08-01. [23]G ROSSMAN,M., AND SARKAR, V. Swat: A pro- [11] Spark MLlib. http://spark.apache.org/mllib/. grammable, in-memory, distributed, high-performance com- Accessed: 2016-05-24. puting platform. The 25th International Symposium on High- [12] Xeon+FPGA Platform for the Data Center. https: Performance Parallel and Distributed Computing (HPDC) //www.ece.cmu.edu/˜calcm/carl/lib/exe/ (2016). fetch.php?media=carl15-gupta.pdf . Accessed: [24]H INDMAN,B.,KONWINSKI,A.,ZAHARIA,M.,GHODSI, 2016-05-17. A.,JOSEPH,A.D.,KATZ,R.,SHENKER,S., AND STOICA, [13]B RECH,B.,RUBIO,J., AND HOLLINGER, M. IBM Data I. Mesos: A platform for fine-grained resource sharing in the Engine for NoSQL - Power Systems Edition. Tech. rep., IBM data center. In Proceedings of the 8th USENIX Conference Systems Group, 2015. on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI’11, USENIX Association, pp. 295– [14]B YMA,S.,STEFFAN,J.G.,BANNAZADEH,H.,GARCIA, 308. A.L., AND CHOW, P. FPGAs in the cloud: Booting vir- tualized hardware accelerators with openstack. In Field- [25]H ONG,C.,CHEN,D.,CHEN, W., ZHENG, W., AND LIN,H. Programmable Custom Computing Machines (FCCM), 2014 MapCG: Writing parallel program portable between CPU and IEEE 22nd Annual International Symposium on (2014), IEEE, GPU. In Proceedings of the 19th International Conference pp. 109–116. on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2010), PACT ’10, ACM, pp. 217–226. [15]C HEN, Y.-T., CONG,J.,FANG,Z.,LEI,J., AND WEI., P. When Apache Spark meets FPGAs: A case study for next- [26]I SARD,M.,BUDIU,M.,YU, Y., BIRRELL,A., AND FET- generation dna sequencing acceleration. In The 8th USENIX TERLY, D. Dryad: distributed data-parallel programs from se- Workshop on Hot Topics in Cloud Computing (HotCloud 16) quential building blocks. In ACM SIGOPS Operating Systems (2016). Review (2007), vol. 41, ACM, pp. 59–72. [27]J IA, Y., SHELHAMER,E.,DONAHUE,J.,KARAYEV,S., work for accelerators on heterogeneous clusters. CoRR LONG,J.,GIRSHICK,R.,GUADARRAMA,S., AND DAR- abs/1505.01120 (2015). RELL, T. Caffe: Convolutional architecture for fast feature [39]S HAN, Y., WANG,B.,YAN,J.,WANG, Y., XU,N., AND embedding. arXiv preprint arXiv:1408.5093 (2014). YANG, H. FPMR: Mapreduce framework on FPGA. In [28] K.CHOI, Y., AND CONG, J. Acceleration of EM-based Proceedings of the 18th Annual ACM/SIGDA International 3D CT reconstruction using FPGA. IEEE Transactions on Symposium on Field Programmable Gate Arrays (New York, Biomedical Circuits and Systems 10, 3 (June 2016), 754–767. NY, USA, 2010), FPGA ’10, ACM, pp. 93–102. [29]L I, H. Aligning sequence reads, clone sequences and assem- [40]S TUART,J.A., AND OWENS, J. D. Multi-GPU mapreduce bly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 on GPU clusters. In Proceedings of the 2011 IEEE Interna- (2013). tional Parallel & Distributed Processing Symposium (Wash- [30]L I, P., LUO, Y., ZHANG,N., AND CAO, Y. HeteroSpark: A ington, DC, USA, 2011), IPDPS ’11, IEEE Computer Society, heterogeneous CPU/GPU spark platform for machine learning pp. 1068–1079. algorithms. In Networking, Architecture and Storage (NAS), [41]T SOI,K.H., AND LUK, W. Axel: A heterogeneous clus- 2015 IEEE International Conference on (Aug 2015), pp. 347– ter with FPGAs and GPUs. In Proceedings of the 18th An- 348. nual ACM/SIGDA International Symposium on Field Pro- [31]L IN,Z., AND CHOW, P. Zcluster: A Zynq-based Hadoop grammable Gate Arrays (New York, NY, USA, 2010), FPGA cluster. In Field-Programmable Technology (FPT), 2013 In- ’10, ACM, pp. 115–124. ternational Conference on (Dec 2013), pp. 450–453. [42]V AVILAPALLI, V. K., MURTHY,A.C.,DOUGLAS,C., [32]O USTERHOUT,K.,RASTI,R.,RATNASAMY,S.,SHENKER, AGARWAL,S.,KONAR,M.,EVANS,R.,GRAVES, T., S., AND CHUN, B.-G. Making sense of performance in data analytics frameworks. In 12th USENIX Symposium on LOWE,J.,SHAH,H.,SETH,S., ETAL. Apache Hadoop Networked Systems Design and Implementation (NSDI 15) YARN: Yet another resource negotiator. In Proceedings of the (Oakland, CA, May 2015), USENIX Association, pp. 293– 4th annual Symposium on Cloud Computing (2013), ACM, 307. p. 5. [33]P UTNAM,A.,CAULFIELD,A.M.,CHUNG,E.S.,CHIOU, [43]W ANG,Z.,ZHANG,S.,HE,B., AND ZHANG, W. Melia: D.,CONSTANTINIDES,K.,DEMME,J.,ESMAEILZADEH, A MapReduce framework on OpenCL-based FPGAs. IEEE H.,FOWERS,J.,GOPAL, G. P., GRAY,J.,HASELMAN, Transactions on Parallel and Distributed Systems PP, 99 M.,HAUCK,S.,HEIL,S.,HORMATI,A.,KIM, J.-Y., (2016), 1–1. LANKA,S.,LARUS,J.,PETERSON,E.,POPE,S.,SMITH, A.,THONG,J.,XIAO, P. Y., AND BURGER, D. A reconfig- [44]Y EUNG,J.H.C.,TSANG,C.C.,TSOI,K.H.,KWAN,B. urable fabric for accelerating large-scale datacenter services. S.H.,CHEUNG,C.C.C.,CHAN, A. P. C., AND LEONG, P. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st In- H. W. Map-reduce as a programming model for custom com- ternational Symposium on (June 2014), ieeexplore.ieee.org, puting machines. In Field-Programmable Custom Computing pp. 13–24. Machines, 2008. FCCM ’08. 16th International Symposium on (April 2008), pp. 149–159. [34]R AJAGOPALAN, V., BOPPANA, V., DUTTA,S.,TAYLOR, B., AND WITTIG, R. Xilinx Zynq-7000 EPP–an extensible [45]Y IN,D.,LI,G., AND HUANG,K.-D. Scalable MapReduce processing platform family. In 23rd Hot Chips Symposium framework on FPGA. In Lecture Notes in Computer Science, (2011), pp. 1352–1357. S. Andreev, S. Balandin, and Y. Koucheryavy, Eds. Springer [35]R OSSBACH,C.J.,YU, Y., CURREY,J.,MARTIN, J.-P., Berlin Heidelberg, 2012, pp. 280–294. AND FETTERLY, D. Dandelion: a compiler and runtime for [46]Z AHARIA,M.,CHOWDHURY,M.,DAS, T., DAVE,A.,MA, heterogeneous systems. In Proceedings of the Twenty-Fourth J.,MCCAULEY,M.,FRANKLIN,M.J.,SHENKER,S., AND ACM Symposium on Operating Systems Principles (2013), STOICA, I. Resilient distributed datasets: A fault-tolerant ACM, pp. 49–68. abstraction for in-memory cluster computing. In Proceedings [36]S ABNE,A.,SAKDHNAGOOL, P., AND EIGENMANN, R. Het- of the 9th USENIX conference on Networked Systems Design eroDoop: A MapReduce programming system for accelera- and Implementation (2012), USENIX Association, pp. 2–2. tor clusters. In Proceedings of the 24th International Sympo- sium on High-Performance Parallel and Distributed Comput- [47]Z AHARIA,M.,CHOWDHURY,M.,FRANKLIN,M.J., ing (New York, NY, USA, 2015), HPDC ’15, ACM, pp. 235– SHENKER,S., AND STOICA, I. Spark: Cluster computing 246. with working sets. In Proceedings of the 2nd USENIX confer- ence on Hot topics in cloud computing (2010), pp. 10–10. [37]S CHWARZKOPF,M.,KONWINSKI,A.,ABD-EL-MALEK, M., AND WILKES, J. Omega: flexible, scalable schedulers [48]Z HANG,C.,LI, P., SUN,G.,GUAN, Y., XIAO,B., AND for large compute clusters. In Proceedings of the 8th ACM CONG, J. Optimizing FPGA-based accelerator design for European Conference on Computer Systems (2013), ACM, deep convolutional neural networks. In Proceedings of pp. 351–364. the 2015 ACM/SIGDA International Symposium on Field- [38]S EGAL,O.,COLANGELO, P., NASIRI,N.,QIAN,Z., AND Programmable Gate Arrays (New York, NY, USA, 2015), MARGALA, M. SparkCL: A unified programming frame- FPGA ’15, ACM, pp. 161–170.