Enabling Openmp Task Parallelism on Multi-Fpgas

Enabling OpenMP Task Parallelism on Multi-FPGAs Ramon Nepomuceno, Renan Sterle, Guilherme Valarini, Marcio Pereira, Herve´ Yviquel, and Guido Araujo Institute of Computing, University of Campinas, Brazil framon.nepomuceno, mpereira, herve, [email protected] fr176536, [email protected] Abstract—FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy efficiency. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large modern workloads. To achieve that, FPGAs need to be interconnected in a Multi-FPGA architecture capable of accelerating a single application. However, programming such architecture is a challenging endeavor that still requires additional research. This paper extends the OpenMP task-based computation offloading model to enable a number of FPGAs to work together as a single Multi-FPGA architecture. Experimental results for a set of OpenMP stencil applications running on a Multi-FPGA platform consisting of 6 Xilinx VC709 boards interconnected through fiber-optic links have shown close to linear speedups as the number of FPGAs and IP-cores per Fig. 1: An optical-link interconnected Multi-FPGA architec- FPGA increase. ture running an OpenMP pipelined application. I. INTRODUCTION With the limits imposed by the power density of semi- Unfortunately, programming such Multi-FPGA architecture conductor technology, heterogeneous systems became a design is a challenging endeavor that still requires additional re- alternative that combines CPUs with domain-specific acceler- search [14], [15]. Synchronizing the accelerators inside the ators to improve power-performance efficiency [1]. A mod- FPGA and seamlessly managing data transfers between them ern heterogeneous system typically combines general-purpose are still significant design challenges that restrict the adoption CPUs and GPUs to speedup complex scientific applications[2]. of such architectures. This paper addresses this problem by However, for many specialized applications that can benefit extending the LLVM OpenMP task programming model [16] from pipelined parallelism (e.g. FFT, Networking), FPGA- to Multi-FPGA architectures. based hardware accelerators have shown to produce improved At a higher abstraction level, the programming model pro- power-performance numbers [3], [4], [5], [6]. Moreover, posed in this paper enables the programmer to see the FPGA FPGA’s reconfigurability facilitates the adaptation of the accel- as a regular OpenMP device and the IP-cores accelerators (IPs) erator to distinct types of workloads and applications. In order as OpenMP tasks. In the proposed approach, the OpenMP to leverage on this, cloud service companies like Microsoft task dependence mechanism transparently coordinates the IPs’ Azure [7] and Amazon AWS [8] are offering heterogeneous work to run the application. For example, consider the simple arXiv:2103.10573v2 [cs.DC] 22 Mar 2021 computing nodes with integrated FPGAs. problem of processing the elements of a vector V in a pipeline Given its small external memory bandwidth [9] FPGAs fashion using four IPs (IPO-IP3) programmed in two FPGAs. do not perform well for applications that require intense Each IPi (i = 0-3) performs some specific computation foo(V,i). memory accesses. Pipelined FPGA accelerators [10], [11], The Multi-FPGA architecture used in this example is shown [12] have been designed to address this problem but such Figure 1 and contains two VC7091 FPGA boards intercon- designs are constrained to the boundaries of a single FPGA nected by two fiber-optics links. Each FPGA is programmed or are limited by the number of FPGAs that it can handle. By with: (a) a module for communication with the PCIE interface connecting multiple FPGAs, one can design deep pipelines (DMA/PCIE); (b) two IPs from the set IP0-IP3; (c) a NET that go beyond the border of one FPGA, thus allowing data to module for communication with the optical fibers; (d) a Virtual be transferred through optical-links from one FPGA to another FIFO module (VFIFO) for communication with memory; (e) without using external memory as temporal storage. Such deep a MAC frame handler (MFH) to pack/unpack data; and (f) a pipeline accelerators can considerably expand the application packet switch (A-SWT) module capable of moving data among of FPGAs, thus enabling increasing speedups as more FPGAs are added to the system [13]. 1A similar approach can also be used for modern Alveo FPGAs. the IPs, even if they seat on FPGAs from two distinct boards. 1 i n t V[M] ; As shown in the figure, the vector is initially moved from 2 bool deps[N+1]; 3 #pragma omp parallel the host memory (left of Figure 1) and then pushed through 4 #pragma omp single the IP0-IP3 pipeline, returning the final result into the host 5 f o r(int i=0; i <N; i ++)f memory. 6 #pragma omp task n 7 depend(in:deps[i]) depend(out:deps[i+1]) The OpenMP program required to execute this simple appli- 8 foo (V, i ) ; cation uses just a few code lines (see Listing 3). Its simplicity 9 g is possible due to the techniques proposed in this paper, which Listing 1: OpenMP tasks in CPUs. leverage on the OpenMP task dependence and computation offloading abstraction to hide the complexity needed to move the vector across the four IPs. The proposed techniques extend 1 i n t V[M] ; the OpenMP runtime with an FPGA device plugin which: (a) 2 bool deps[N+1]; maps task data to/from the FPGAs; (b) transparently handles 3 #pragma omp parallel 4 #pragma omp single the data dependencies between IPs located in distinct FPGAs; 5 f o r(int i=0; i < N; i ++)f and (c) eases the synchronization of IPs’ execution. 6 #pragma omp target device (GPU) n The main contributions of this paper are summarized below: 7 depend(in:deps[i]) depend(out:deps[i+1]) n 8 map(tofrom:V[:M]) nowait • A new Clang/LLVM plugin that understands FPGA 9 foo (V, i ) ; boards as OpenMP devices, and uses OpenMP declare 10 g variant directive to specify hardware IPs; Listing 2: Offloading OpenMP tasks to GPU. • A mechanism based on the OpenMP task dependence and computation offloading model that enables transparent communication of IPs in a Multi-FPGA architecture; • A programming model based on OpenMP task paral- between tasks are specified by the depend clause, which the lelism, which makes it simple to move data among FP- programmer uses to define the set of input variables that GAs, CPUs or other acceleration devices (e.g. GPUs), and the task depends on and the output variables that the task that allows the programmer to use a single programming writes to. Whenever the dependencies of a given task are model to run its application on a truly heterogeneous satisfied, the OpenMP runtime dispatches it to a ready queue architecture. which feeds a pool of worker threads. The OpenMP runtime The rest of the paper is organized as follows. Section II manages the task graph, handles data management, creates and discusses background material and Section III details the synchronizes threads, among other activities. software and hardware mechanisms that were designed to en- Listing 1 shows an example of a simple program that uses able OpenMP task parallelism on a Multi-FPGA architecture. OpenMP tasks to perform a sequence of increments in the Section IV shows how this architecture can be used to design vector V, similarly as discussed in Figure 1. First, the program a scalable stencil pipeline application. Section V describes creates a pool of worker threads. This is done using the the experimental setup and analyzes their results. Section VI #pragma omp parallel directive (line 3). It is with this set of discusses related works and finally Section VII concludes the threads that an OpenMP program runs in parallel. Next, the work. #pragma omp single directive (line 4) is used to select one of the worker threads. This selected thread is called control II. BACKGROUND thread and is responsible for creating the tasks. The #pragma This section provides background material required for omp task directive (line 6) is used to effectively build N understanding the proposed approach. Section II-A reviews worker tasks, each one encapsulating a call to function foo(V,i) the basic concepts of the OpenMP accelerator programming (line 8) that performs some specific operation on the vector model, while Section II-B details the Xilinx modules used to V depending on the task (loop) index. Worker threads are assemble the testing platform. created by the OpenMP runtime to run on the cores of the system. In order to assure the execution order of the tasks the A. OpenMP Accelerator Model depend clause (line 7) is used to specify the input and output The OpenMP specification [17] defines a programming dependencies of each task. In the specific case of the example model based on code annotations that use minimal modifica- of Listing 1 we have to introduce a deps vector which assures tions of the program to expose potential parallelism. OpenMP that the foo(V,i) tasks are executed in a pipeline order. has been successfully used to parallelize programs on CPUs Now assume the user wants to accelerate the above de- and GPUs residing on a single computer node. scribed tasks in a GPU. For this purpose, consider the code in Tasks were introduced to OpenMP in version 3.0 [18], to Listing 2 which is the version of Listing 1 using a GPU as an expose a higher degree of parallelism through the non-blocking acceleration device. In Listing 2 the target directive (line 6) execution of code fragments annotated with task directives.

Enabling Openmp Task Parallelism on Multi-Fpgas

Designing an Ultra Low-Overhead Multithreading Runtime for Nim

Fiber Context As a ﬁrst-Class Object

Cimple: Instruction and Memory Level Parallelism a DSL for Uncovering ILP and MLP

Fiber-Based Architecture for NFV Cloud Databases

A Thread Performance Comparison: Windows NT and Solaris on a Symmetric Multiprocessor

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Scheduling Task Parallelism on Multi-Socket Multicore Systems

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Task Parallelism Bit-Level Parallelism

The Impact of Process Placement and Oversubscription on Application Performance: a Case Study for Exascale Computing

A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization

And Computer Systems I: Introduction Systems